linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET] blkcg: update locking and fix stacking
@ 2012-02-16 22:37 Tejun Heo
  2012-02-16 22:37 ` [PATCH 1/9] blkcg: use double locking instead of RCU for blkg synchronization Tejun Heo
                   ` (8 more replies)
  0 siblings, 9 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-16 22:37 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel

Hey, guys.

This is the third patchset of blkcg API cleanup series and does the
following two things.

* Drops RCU and use double locking.  As with ioc, we might want to
  re-introduce RCU in more restricted form to walk blkg from blkcg but
  for now there's no need.  cgroup deletion is much colder than task
  exit and there's no other path which requires reverse double
  locking.

* Before this patchset, blk-throttle and cfq-iosched propio couldn't
  be used together because for any bios delayed by blk-throttle would
  be issued by a worker thread and get dumped to the default cgroup.
  So, bios will be nondeterministically thrown into the root cg.  This
  isn't a problem limited to blk-throttle.  Any mechanism which punts
  bios to a different task (e.g. btrfs) would mess up IO scheduling
  for both cgroups and iocontexts.

  This patchset implements a mechanism to associate a bio to a task
  (its cgroup and ioc actually) and block layer will handle the
  request as if it were issued by the associated task no matter which
  task actually ends up issuing it to block layer.  It's applied to
  blk-throttle such that any delayed bio is associated to the original
  task.

This patchset contains the following 9 patches.

 0001-blkcg-use-double-locking-instead-of-RCU-for-blkg-syn.patch
 0002-blkcg-drop-unnecessary-RCU-locking.patch
 0003-block-restructure-get_request.patch
 0004-block-interface-update-for-ioc-icq-creation-function.patch
 0005-block-ioc_task_link-can-t-fail.patch
 0006-block-add-io_context-active_ref.patch
 0007-block-implement-bio_associate_current.patch
 0008-block-make-block-cgroup-policies-follow-bio-task-ass.patch
 0009-block-make-blk-throttle-preserve-the-issuing-task-on.patch

 0001-0002 updates locking and strips out RCU.

 0003-0006 prepare for bio task association.

 0007-0009 implement bio task association and apply it to
 blk-throttle.

This patchset is on top of

  block/for-linus 621032ad6eaabf2fe771c4fa0d8f58e1fcfcdba6
+ [1] blkcg: kill policy node and blkg->dev, take#4
+ [2] blkcg: unify blkgs for different policies (updated)

and is also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blkcg-locking

The git branch has separate branches for previous patchsets, so using
it probably is the easiest way to review this series.  Jens, if you
want the whole thing reposted, please let me know.

diffstat follows.

 block/blk-cgroup.c        |  167 +++++++++++++++++-----------------------------
 block/blk-cgroup.h        |   12 +--
 block/blk-core.c          |   92 ++++++++++++++-----------
 block/blk-ioc.c           |   58 +++++++++------
 block/blk-throttle.c      |   41 +----------
 block/blk.h               |   24 +++---
 block/cfq-iosched.c       |   54 ++++----------
 block/cfq.h               |   10 --
 block/elevator.c          |    5 -
 fs/bio.c                  |   61 ++++++++++++++++
 include/linux/bio.h       |    8 ++
 include/linux/blk_types.h |   10 ++
 include/linux/blkdev.h    |    1 
 include/linux/elevator.h  |    6 +
 include/linux/iocontext.h |   32 ++++++--
 kernel/fork.c             |    5 -
 16 files changed, 299 insertions(+), 287 deletions(-)

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel/1247152
[2] http://thread.gmane.org/gmane.linux.kernel/1247287

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/9] blkcg: use double locking instead of RCU for blkg synchronization
  2012-02-16 22:37 [PATCHSET] blkcg: update locking and fix stacking Tejun Heo
@ 2012-02-16 22:37 ` Tejun Heo
  2012-02-16 22:37 ` [PATCH 2/9] blkcg: drop unnecessary RCU locking Tejun Heo
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-16 22:37 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

blkgs are chained from both blkcgs and request_queues and thus
subjected to two locks - blkcg->lock and q->queue_lock.  As both blkcg
and q can go away anytime, locking during removal is tricky.  It's
currently solved by wrapping removal inside RCU, which makes the
synchronization complex.  There are three locks to worry about - the
outer RCU, q lock and blkcg lock, and it leads to nasty subtle
complications like conditional synchronize_rcu() on queue exit paths.

For all other paths, blkcg lock is naturally nested inside q lock and
the only exception is blkcg removal path, which is a very cold path
and can be implemented as clumsy but conceptually-simple reverse
double lock dancing.

This patch updates blkg removal path such that blkgs are removed while
holding both q and blkcg locks, which is trivial for request queue
exit path - blkg_destroy_all().  The blkcg removal path,
blkiocg_pre_destroy(), implements reverse double lock dancing
essentially identical to ioc_release_fn().

This simplifies blkg locking - no half-dead blkgs to worry about.  Now
unnecessary RCU annotations will be removed by the next patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |  136 +++++++++++++++++++--------------------------------
 block/blk-cgroup.h |    4 --
 block/cfq.h        |   10 ----
 3 files changed, 51 insertions(+), 99 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ce2dd15..aee71ef 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -620,32 +620,6 @@ out:
 }
 EXPORT_SYMBOL_GPL(blkg_lookup_create);
 
-static void __blkiocg_del_blkio_group(struct blkio_group *blkg)
-{
-	hlist_del_init_rcu(&blkg->blkcg_node);
-}
-
-/*
- * returns 0 if blkio_group was still on cgroup list. Otherwise returns 1
- * indicating that blk_group was unhashed by the time we got to it.
- */
-int blkiocg_del_blkio_group(struct blkio_group *blkg)
-{
-	struct blkio_cgroup *blkcg = blkg->blkcg;
-	unsigned long flags;
-	int ret = 1;
-
-	spin_lock_irqsave(&blkcg->lock, flags);
-	if (!hlist_unhashed(&blkg->blkcg_node)) {
-		__blkiocg_del_blkio_group(blkg);
-		ret = 0;
-	}
-	spin_unlock_irqrestore(&blkcg->lock, flags);
-
-	return ret;
-}
-EXPORT_SYMBOL_GPL(blkiocg_del_blkio_group);
-
 /* called under rcu_read_lock(). */
 struct blkio_group *blkg_lookup(struct blkio_cgroup *blkcg,
 				struct request_queue *q)
@@ -663,12 +637,16 @@ EXPORT_SYMBOL_GPL(blkg_lookup);
 static void blkg_destroy(struct blkio_group *blkg)
 {
 	struct request_queue *q = blkg->q;
+	struct blkio_cgroup *blkcg = blkg->blkcg;
 
 	lockdep_assert_held(q->queue_lock);
+	lockdep_assert_held(&blkcg->lock);
 
 	/* Something wrong if we are trying to remove same group twice */
 	WARN_ON_ONCE(list_empty(&blkg->q_node));
+	WARN_ON_ONCE(hlist_unhashed(&blkg->blkcg_node));
 	list_del_init(&blkg->q_node);
+	hlist_del_init_rcu(&blkg->blkcg_node);
 
 	WARN_ON_ONCE(q->nr_blkgs <= 0);
 	q->nr_blkgs--;
@@ -712,47 +690,35 @@ static void update_root_blkg(struct request_queue *q, enum blkio_policy_id plid)
 	pol->ops.blkio_init_group_fn(blkg);
 }
 
+/**
+ * blkg_destroy_all - destroy all blkgs associated with a request_queue
+ * @q: request_queue of interest
+ * @destroy_root: whether to destroy root blkg or not
+ *
+ * Destroy blkgs associated with @q.  If @destroy_root is %true, all are
+ * destroyed; otherwise, root blkg is left alone.
+ */
 void blkg_destroy_all(struct request_queue *q, bool destroy_root)
 {
 	struct blkio_group *blkg, *n;
 	int i;
 
-	while (true) {
-		bool done = true;
-
-		spin_lock_irq(q->queue_lock);
-
-		list_for_each_entry_safe(blkg, n, &q->blkg_list, q_node) {
-			/* skip root? */
-			if (!destroy_root && blkg->blkcg == &blkio_root_cgroup)
-				continue;
-
-			/*
-			 * If cgroup removal path got to blk_group first
-			 * and removed it from cgroup list, then it will
-			 * take care of destroying cfqg also.
-			 */
-			if (!blkiocg_del_blkio_group(blkg))
-				blkg_destroy(blkg);
-			else
-				done = false;
-		}
+	spin_lock_irq(q->queue_lock);
 
-		spin_unlock_irq(q->queue_lock);
+	list_for_each_entry_safe(blkg, n, &q->blkg_list, q_node) {
+		struct blkio_cgroup *blkcg = blkg->blkcg;
 
-		/*
-		 * Group list may not be empty if we raced cgroup removal
-		 * and lost.  cgroup removal is guaranteed to make forward
-		 * progress and retrying after a while is enough.  This
-		 * ugliness is scheduled to be removed after locking
-		 * update.
-		 */
-		if (done)
-			break;
+		/* skip root? */
+		if (!destroy_root && blkg->blkcg == &blkio_root_cgroup)
+			continue;
 
-		msleep(10);	/* just some random duration I like */
+		spin_lock(&blkcg->lock);
+		blkg_destroy(blkg);
+		spin_unlock(&blkcg->lock);
 	}
 
+	spin_unlock_irq(q->queue_lock);
+
 	for (i = 0; i < BLKIO_NR_POLICIES; i++)
 		update_root_blkg(q, i);
 }
@@ -1590,45 +1556,45 @@ static int blkiocg_populate(struct cgroup_subsys *subsys, struct cgroup *cgroup)
 				ARRAY_SIZE(blkio_files));
 }
 
+/**
+ * blkiocg_pre_destroy - cgroup pre_destroy callback
+ * @subsys: cgroup subsys
+ * @cgroup: cgroup of interest
+ *
+ * This function is called when @cgroup is about to go away and responsible
+ * for shooting down all blkgs associated with @cgroup.  blkgs should be
+ * removed while holding both q and blkcg locks.  As blkcg lock is nested
+ * inside q lock, this function performs reverse double lock dancing.
+ *
+ * This is the blkcg counterpart of ioc_release_fn().
+ */
 static int blkiocg_pre_destroy(struct cgroup_subsys *subsys,
 			       struct cgroup *cgroup)
 {
 	struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
-	unsigned long flags;
-	struct blkio_group *blkg;
-	struct request_queue *q;
 
 	rcu_read_lock();
+	spin_lock_irq(&blkcg->lock);
 
-	do {
-		spin_lock_irqsave(&blkcg->lock, flags);
+	while (!hlist_empty(&blkcg->blkg_list)) {
+		struct blkio_group *blkg = hlist_entry(blkcg->blkg_list.first,
+						struct blkio_group, blkcg_node);
+		struct request_queue *q = rcu_dereference(blkg->q);
 
-		if (hlist_empty(&blkcg->blkg_list)) {
-			spin_unlock_irqrestore(&blkcg->lock, flags);
-			break;
+		if (spin_trylock(q->queue_lock)) {
+			blkg_destroy(blkg);
+			spin_unlock(q->queue_lock);
+		} else {
+			spin_unlock_irq(&blkcg->lock);
+			rcu_read_unlock();
+			cpu_relax();
+			rcu_read_lock();
+			spin_lock(&blkcg->lock);
 		}
+	}
 
-		blkg = hlist_entry(blkcg->blkg_list.first, struct blkio_group,
-					blkcg_node);
-		q = rcu_dereference(blkg->q);
-		__blkiocg_del_blkio_group(blkg);
-
-		spin_unlock_irqrestore(&blkcg->lock, flags);
-
-		/*
-		 * This blkio_group is being unlinked as associated cgroup is
-		 * going away. Let all the IO controlling policies know about
-		 * this event.
-		 */
-		spin_lock(&blkio_list_lock);
-		spin_lock_irqsave(q->queue_lock, flags);
-		blkg_destroy(blkg);
-		spin_unlock_irqrestore(q->queue_lock, flags);
-		spin_unlock(&blkio_list_lock);
-	} while (1);
-
+	spin_unlock_irq(&blkcg->lock);
 	rcu_read_unlock();
-
 	return 0;
 }
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 88b2c3b..bebc442 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -376,7 +376,6 @@ static inline void blkiocg_set_start_empty_time(struct blkio_group *blkg,
 extern struct blkio_cgroup blkio_root_cgroup;
 extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
 extern struct blkio_cgroup *task_blkio_cgroup(struct task_struct *tsk);
-extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
 extern struct blkio_group *blkg_lookup(struct blkio_cgroup *blkcg,
 				       struct request_queue *q);
 struct blkio_group *blkg_lookup_create(struct blkio_cgroup *blkcg,
@@ -412,9 +411,6 @@ cgroup_to_blkio_cgroup(struct cgroup *cgroup) { return NULL; }
 static inline struct blkio_cgroup *
 task_blkio_cgroup(struct task_struct *tsk) { return NULL; }
 
-static inline int
-blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }
-
 static inline struct blkio_group *blkg_lookup(struct blkio_cgroup *blkcg,
 					      void *key) { return NULL; }
 static inline void blkiocg_update_timeslice_used(struct blkio_group *blkg,
diff --git a/block/cfq.h b/block/cfq.h
index 5584e1b..c8b15ef 100644
--- a/block/cfq.h
+++ b/block/cfq.h
@@ -79,11 +79,6 @@ static inline void cfq_blkiocg_update_completion_stats(struct blkio_group *blkg,
 					direction, sync);
 }
 
-static inline int cfq_blkiocg_del_blkio_group(struct blkio_group *blkg)
-{
-	return blkiocg_del_blkio_group(blkg);
-}
-
 #else /* CFQ_GROUP_IOSCHED */
 static inline void cfq_blkiocg_update_io_add_stats(struct blkio_group *blkg,
 			struct blkio_policy_type *pol,
@@ -119,10 +114,5 @@ static inline void cfq_blkiocg_update_completion_stats(struct blkio_group *blkg,
 			struct blkio_policy_type *pol, uint64_t start_time,
 			uint64_t io_start_time, bool direction, bool sync) { }
 
-static inline int cfq_blkiocg_del_blkio_group(struct blkio_group *blkg)
-{
-	return 0;
-}
-
 #endif /* CFQ_GROUP_IOSCHED */
 #endif
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 2/9] blkcg: drop unnecessary RCU locking
  2012-02-16 22:37 [PATCHSET] blkcg: update locking and fix stacking Tejun Heo
  2012-02-16 22:37 ` [PATCH 1/9] blkcg: use double locking instead of RCU for blkg synchronization Tejun Heo
@ 2012-02-16 22:37 ` Tejun Heo
  2012-02-17 16:19   ` Vivek Goyal
                     ` (2 more replies)
  2012-02-16 22:37 ` [PATCH 3/9] block: restructure get_request() Tejun Heo
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-16 22:37 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

Now that blkg additions / removals are always done under both q and
blkcg locks, the only place RCU locking is used is blkg_lookup() for
lockless lookup.  This patch drops unncessary RCU locking replacing it
with plain blkcg / q locking as necessary.

* blkg_lookup_create() and blkiocg_pre_destroy() already perform
  proper locking and don't need RCU.  Dropped.

* queue_lock coverage extended to cover @blkg usage in
  blkio_policy_parse_and_set() and RCU dropped.  This means all config
  update callbacks are now called under queue_lock.

* blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read
  lock.  This isn't a hot path.

* RCU locking around blkg_lookup_create() in cfq_init_queue() and
  blk_throtl_init() removed.

* Now unnecessary synchronize_rcu() from queue exit paths removed.
  This makes q->nr_blkgs unnecessary.  Dropped.

* RCU annotation on blkg->q removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c     |   26 +++++++-------------------
 block/blk-cgroup.h     |    4 ++--
 block/blk-throttle.c   |   35 +----------------------------------
 block/cfq-iosched.c    |   26 --------------------------
 include/linux/blkdev.h |    1 -
 5 files changed, 10 insertions(+), 82 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index aee71ef..fb5f21b 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -500,7 +500,7 @@ static struct blkio_group *blkg_alloc(struct blkio_cgroup *blkcg,
 		return NULL;
 
 	spin_lock_init(&blkg->stats_lock);
-	rcu_assign_pointer(blkg->q, q);
+	blkg->q = q;
 	INIT_LIST_HEAD(&blkg->q_node);
 	blkg->blkcg = blkcg;
 	blkg->refcnt = 1;
@@ -551,7 +551,6 @@ struct blkio_group *blkg_lookup_create(struct blkio_cgroup *blkcg,
 {
 	struct blkio_group *blkg, *new_blkg;
 
-	WARN_ON_ONCE(!rcu_read_lock_held());
 	lockdep_assert_held(q->queue_lock);
 
 	/*
@@ -581,11 +580,9 @@ struct blkio_group *blkg_lookup_create(struct blkio_cgroup *blkcg,
 	 * allocation is fixed.
 	 */
 	spin_unlock_irq(q->queue_lock);
-	rcu_read_unlock();
 
 	new_blkg = blkg_alloc(blkcg, q);
 
-	rcu_read_lock();
 	spin_lock_irq(q->queue_lock);
 
 	/* did bypass get turned on inbetween? */
@@ -611,7 +608,6 @@ struct blkio_group *blkg_lookup_create(struct blkio_cgroup *blkcg,
 
 	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 	list_add(&blkg->q_node, &q->blkg_list);
-	q->nr_blkgs++;
 
 	spin_unlock(&blkcg->lock);
 out:
@@ -648,9 +644,6 @@ static void blkg_destroy(struct blkio_group *blkg)
 	list_del_init(&blkg->q_node);
 	hlist_del_init_rcu(&blkg->blkcg_node);
 
-	WARN_ON_ONCE(q->nr_blkgs <= 0);
-	q->nr_blkgs--;
-
 	/*
 	 * Put the reference taken at the time of creation so that when all
 	 * queues are gone, group can be destroyed.
@@ -1041,11 +1034,8 @@ static int blkio_policy_parse_and_set(char *buf, enum blkio_policy_id plid,
 	if (!disk || part)
 		goto out;
 
-	rcu_read_lock();
-
 	spin_lock_irq(disk->queue->queue_lock);
 	blkg = blkg_lookup_create(blkcg, disk->queue, plid, false);
-	spin_unlock_irq(disk->queue->queue_lock);
 
 	if (IS_ERR(blkg)) {
 		ret = PTR_ERR(blkg);
@@ -1092,7 +1082,7 @@ static int blkio_policy_parse_and_set(char *buf, enum blkio_policy_id plid,
 	}
 	ret = 0;
 out_unlock:
-	rcu_read_unlock();
+	spin_unlock_irq(disk->queue->queue_lock);
 out:
 	put_disk(disk);
 
@@ -1224,7 +1214,8 @@ static int blkio_read_blkg_stats(struct blkio_cgroup *blkcg,
 	struct hlist_node *n;
 	uint64_t cgroup_total = 0;
 
-	rcu_read_lock();
+	spin_lock_irq(&blkcg->lock);
+
 	hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
 		const char *dname = dev_name(blkg->q->backing_dev_info.dev);
 		int plid = BLKIOFILE_POLICY(cft->private);
@@ -1241,7 +1232,8 @@ static int blkio_read_blkg_stats(struct blkio_cgroup *blkcg,
 	}
 	if (show_total)
 		cb->fill(cb, "Total", cgroup_total);
-	rcu_read_unlock();
+
+	spin_unlock_irq(&blkcg->lock);
 	return 0;
 }
 
@@ -1573,28 +1565,24 @@ static int blkiocg_pre_destroy(struct cgroup_subsys *subsys,
 {
 	struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
 
-	rcu_read_lock();
 	spin_lock_irq(&blkcg->lock);
 
 	while (!hlist_empty(&blkcg->blkg_list)) {
 		struct blkio_group *blkg = hlist_entry(blkcg->blkg_list.first,
 						struct blkio_group, blkcg_node);
-		struct request_queue *q = rcu_dereference(blkg->q);
+		struct request_queue *q = blkg->q;
 
 		if (spin_trylock(q->queue_lock)) {
 			blkg_destroy(blkg);
 			spin_unlock(q->queue_lock);
 		} else {
 			spin_unlock_irq(&blkcg->lock);
-			rcu_read_unlock();
 			cpu_relax();
-			rcu_read_lock();
 			spin_lock(&blkcg->lock);
 		}
 	}
 
 	spin_unlock_irq(&blkcg->lock);
-	rcu_read_unlock();
 	return 0;
 }
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index bebc442..1a80619 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -176,8 +176,8 @@ struct blkg_policy_data {
 };
 
 struct blkio_group {
-	/* Pointer to the associated request_queue, RCU protected */
-	struct request_queue __rcu *q;
+	/* Pointer to the associated request_queue */
+	struct request_queue *q;
 	struct list_head q_node;
 	struct hlist_node blkcg_node;
 	struct blkio_cgroup *blkcg;
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index e35ee7a..8cd13ec 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1026,7 +1026,6 @@ int blk_throtl_init(struct request_queue *q)
 	td->queue = q;
 
 	/* alloc and init root group. */
-	rcu_read_lock();
 	spin_lock_irq(q->queue_lock);
 
 	blkg = blkg_lookup_create(&blkio_root_cgroup, q, BLKIO_POLICY_THROTL,
@@ -1035,7 +1034,6 @@ int blk_throtl_init(struct request_queue *q)
 		td->root_tg = blkg_to_tg(blkg);
 
 	spin_unlock_irq(q->queue_lock);
-	rcu_read_unlock();
 
 	if (!td->root_tg) {
 		kfree(td);
@@ -1046,39 +1044,8 @@ int blk_throtl_init(struct request_queue *q)
 
 void blk_throtl_exit(struct request_queue *q)
 {
-	struct throtl_data *td = q->td;
-	bool wait;
-
-	BUG_ON(!td);
-
+	BUG_ON(!q->td);
 	throtl_shutdown_wq(q);
-
-	/* If there are other groups */
-	spin_lock_irq(q->queue_lock);
-	wait = q->nr_blkgs;
-	spin_unlock_irq(q->queue_lock);
-
-	/*
-	 * Wait for tg_to_blkg(tg)->q accessors to exit their grace periods.
-	 * Do this wait only if there are other undestroyed groups out
-	 * there (other than root group). This can happen if cgroup deletion
-	 * path claimed the responsibility of cleaning up a group before
-	 * queue cleanup code get to the group.
-	 *
-	 * Do not call synchronize_rcu() unconditionally as there are drivers
-	 * which create/delete request queue hundreds of times during scan/boot
-	 * and synchronize_rcu() can take significant time and slow down boot.
-	 */
-	if (wait)
-		synchronize_rcu();
-
-	/*
-	 * Just being safe to make sure after previous flush if some body did
-	 * update limits through cgroup and another work got queued, cancel
-	 * it.
-	 */
-	throtl_shutdown_wq(q);
-
 	kfree(q->td);
 }
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 637cd76..cc8e9ef 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3449,7 +3449,6 @@ static void cfq_exit_queue(struct elevator_queue *e)
 {
 	struct cfq_data *cfqd = e->elevator_data;
 	struct request_queue *q = cfqd->queue;
-	bool wait = false;
 
 	cfq_shutdown_timer_wq(cfqd);
 
@@ -3462,31 +3461,8 @@ static void cfq_exit_queue(struct elevator_queue *e)
 
 	spin_unlock_irq(q->queue_lock);
 
-#ifdef CONFIG_BLK_CGROUP
-	/*
-	 * If there are groups which we could not unlink from blkcg list,
-	 * wait for a rcu period for them to be freed.
-	 */
-	spin_lock_irq(q->queue_lock);
-	wait = q->nr_blkgs;
-	spin_unlock_irq(q->queue_lock);
-#endif
 	cfq_shutdown_timer_wq(cfqd);
 
-	/*
-	 * Wait for cfqg->blkg->key accessors to exit their grace periods.
-	 * Do this wait only if there are other unlinked groups out
-	 * there. This can happen if cgroup deletion path claimed the
-	 * responsibility of cleaning up a group before queue cleanup code
-	 * get to the group.
-	 *
-	 * Do not call synchronize_rcu() unconditionally as there are drivers
-	 * which create/delete request queue hundreds of times during scan/boot
-	 * and synchronize_rcu() can take significant time and slow down boot.
-	 */
-	if (wait)
-		synchronize_rcu();
-
 #ifndef CONFIG_CFQ_GROUP_IOSCHED
 	kfree(cfqd->root_group);
 #endif
@@ -3511,7 +3487,6 @@ static int cfq_init_queue(struct request_queue *q)
 
 	/* Init root group and prefer root group over other groups by default */
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
-	rcu_read_lock();
 	spin_lock_irq(q->queue_lock);
 
 	blkg = blkg_lookup_create(&blkio_root_cgroup, q, BLKIO_POLICY_PROP,
@@ -3520,7 +3495,6 @@ static int cfq_init_queue(struct request_queue *q)
 		cfqd->root_group = blkg_to_cfqg(blkg);
 
 	spin_unlock_irq(q->queue_lock);
-	rcu_read_unlock();
 #else
 	cfqd->root_group = kzalloc_node(sizeof(*cfqd->root_group),
 					GFP_KERNEL, cfqd->queue->node);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b4d1d4b..33f1b29 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -365,7 +365,6 @@ struct request_queue {
 #ifdef CONFIG_BLK_CGROUP
 	/* XXX: array size hardcoded to avoid include dependency (temporary) */
 	struct list_head	blkg_list;
-	int			nr_blkgs;
 #endif
 
 	struct queue_limits	limits;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 3/9] block: restructure get_request()
  2012-02-16 22:37 [PATCHSET] blkcg: update locking and fix stacking Tejun Heo
  2012-02-16 22:37 ` [PATCH 1/9] blkcg: use double locking instead of RCU for blkg synchronization Tejun Heo
  2012-02-16 22:37 ` [PATCH 2/9] blkcg: drop unnecessary RCU locking Tejun Heo
@ 2012-02-16 22:37 ` Tejun Heo
  2012-02-16 22:37 ` [PATCH 4/9] block: interface update for ioc/icq creation functions Tejun Heo
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-16 22:37 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

get_request() is structured a bit unusually in that failure path is
inlined in the usual flow with goto labels atop and inside it.
Relocate the error path to the end of the function.

This is to prepare for icq handling changes in get_request() and
doesn't introduce any behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c |   60 ++++++++++++++++++++++++++---------------------------
 1 files changed, 29 insertions(+), 31 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bf06d1d..69fa8c4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -825,7 +825,7 @@ static bool blk_rq_should_init_elevator(struct bio *bio)
 static struct request *get_request(struct request_queue *q, int rw_flags,
 				   struct bio *bio, gfp_t gfp_mask)
 {
-	struct request *rq = NULL;
+	struct request *rq;
 	struct request_list *rl = &q->rq;
 	struct elevator_type *et;
 	struct io_context *ioc;
@@ -877,7 +877,7 @@ retry:
 					 * process is not a "batcher", and not
 					 * exempted by the IO scheduler
 					 */
-					goto out;
+					return NULL;
 				}
 			}
 		}
@@ -890,7 +890,7 @@ retry:
 	 * allocated with any setting of ->nr_requests
 	 */
 	if (rl->count[is_sync] >= (3 * q->nr_requests / 2))
-		goto out;
+		return NULL;
 
 	rl->count[is_sync]++;
 	rl->starved[is_sync] = 0;
@@ -920,36 +920,12 @@ retry:
 	if ((rw_flags & REQ_ELVPRIV) && unlikely(et->icq_cache && !icq)) {
 		icq = ioc_create_icq(q, gfp_mask);
 		if (!icq)
-			goto fail_icq;
+			goto fail_alloc;
 	}
 
 	rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
-
-fail_icq:
-	if (unlikely(!rq)) {
-		/*
-		 * Allocation failed presumably due to memory. Undo anything
-		 * we might have messed up.
-		 *
-		 * Allocating task should really be put onto the front of the
-		 * wait queue, but this is pretty rare.
-		 */
-		spin_lock_irq(q->queue_lock);
-		freed_request(q, rw_flags);
-
-		/*
-		 * in the very unlikely event that allocation failed and no
-		 * requests for this direction was pending, mark us starved
-		 * so that freeing of a request in the other direction will
-		 * notice us. another possible fix would be to split the
-		 * rq mempool into READ and WRITE
-		 */
-rq_starved:
-		if (unlikely(rl->count[is_sync] == 0))
-			rl->starved[is_sync] = 1;
-
-		goto out;
-	}
+	if (unlikely(!rq))
+		goto fail_alloc;
 
 	/*
 	 * ioc may be NULL here, and ioc_batching will be false. That's
@@ -961,8 +937,30 @@ rq_starved:
 		ioc->nr_batch_requests--;
 
 	trace_block_getrq(q, bio, rw_flags & 1);
-out:
 	return rq;
+
+fail_alloc:
+	/*
+	 * Allocation failed presumably due to memory. Undo anything we
+	 * might have messed up.
+	 *
+	 * Allocating task should really be put onto the front of the wait
+	 * queue, but this is pretty rare.
+	 */
+	spin_lock_irq(q->queue_lock);
+	freed_request(q, rw_flags);
+
+	/*
+	 * in the very unlikely event that allocation failed and no
+	 * requests for this direction was pending, mark us starved so that
+	 * freeing of a request in the other direction will notice
+	 * us. another possible fix would be to split the rq mempool into
+	 * READ and WRITE
+	 */
+rq_starved:
+	if (unlikely(rl->count[is_sync] == 0))
+		rl->starved[is_sync] = 1;
+	return NULL;
 }
 
 /**
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 4/9] block: interface update for ioc/icq creation functions
  2012-02-16 22:37 [PATCHSET] blkcg: update locking and fix stacking Tejun Heo
                   ` (2 preceding siblings ...)
  2012-02-16 22:37 ` [PATCH 3/9] block: restructure get_request() Tejun Heo
@ 2012-02-16 22:37 ` Tejun Heo
  2012-02-16 22:37 ` [PATCH 5/9] block: ioc_task_link() can't fail Tejun Heo
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-16 22:37 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

Make the following interface updates to prepare for future ioc related
changes.

* create_io_context() returning ioc only works for %current because it
  doesn't increment ref on the ioc.  Drop @task parameter from it and
  always assume %current.

* Make create_io_context_slowpath() return 0 or -errno and rename it
  to create_task_io_context().

* Make ioc_create_icq() take @ioc as parameter instead of assuming
  that of %current.  The caller, get_request(), is updated to create
  ioc explicitly and then pass it into ioc_create_icq().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c |    8 +++++---
 block/blk-ioc.c  |   22 ++++++++++------------
 block/blk.h      |   24 +++++++++++-------------
 3 files changed, 26 insertions(+), 28 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 69fa8c4..195c5f7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -854,7 +854,7 @@ retry:
 			 */
 			if (!ioc && !retried) {
 				spin_unlock_irq(q->queue_lock);
-				create_io_context(current, gfp_mask, q->node);
+				create_io_context(gfp_mask, q->node);
 				spin_lock_irq(q->queue_lock);
 				retried = true;
 				goto retry;
@@ -918,7 +918,9 @@ retry:
 
 	/* create icq if missing */
 	if ((rw_flags & REQ_ELVPRIV) && unlikely(et->icq_cache && !icq)) {
-		icq = ioc_create_icq(q, gfp_mask);
+		ioc = create_io_context(gfp_mask, q->node);
+		if (ioc)
+			icq = ioc_create_icq(ioc, q, gfp_mask);
 		if (!icq)
 			goto fail_alloc;
 	}
@@ -1004,7 +1006,7 @@ static struct request *get_request_wait(struct request_queue *q, int rw_flags,
 		 * up to a big batch of them for a small period time.
 		 * See ioc_batching, ioc_set_batching
 		 */
-		create_io_context(current, GFP_NOIO, q->node);
+		create_io_context(GFP_NOIO, q->node);
 		ioc_set_batching(q, current->io_context);
 
 		spin_lock_irq(q->queue_lock);
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 92bf555..1092874 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -212,15 +212,14 @@ void ioc_clear_queue(struct request_queue *q)
 	}
 }
 
-void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_flags,
-				int node)
+int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node)
 {
 	struct io_context *ioc;
 
 	ioc = kmem_cache_alloc_node(iocontext_cachep, gfp_flags | __GFP_ZERO,
 				    node);
 	if (unlikely(!ioc))
-		return;
+		return -ENOMEM;
 
 	/* initialize */
 	atomic_long_set(&ioc->refcount, 1);
@@ -244,6 +243,8 @@ void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_flags,
 	else
 		kmem_cache_free(iocontext_cachep, ioc);
 	task_unlock(task);
+
+	return 0;
 }
 
 /**
@@ -275,7 +276,7 @@ struct io_context *get_task_io_context(struct task_struct *task,
 			return ioc;
 		}
 		task_unlock(task);
-	} while (create_io_context(task, gfp_flags, node));
+	} while (!create_task_io_context(task, gfp_flags, node));
 
 	return NULL;
 }
@@ -319,26 +320,23 @@ EXPORT_SYMBOL(ioc_lookup_icq);
 
 /**
  * ioc_create_icq - create and link io_cq
+ * @ioc: io_context of interest
  * @q: request_queue of interest
  * @gfp_mask: allocation mask
  *
- * Make sure io_cq linking %current->io_context and @q exists.  If either
- * io_context and/or icq don't exist, they will be created using @gfp_mask.
+ * Make sure io_cq linking @ioc and @q exists.  If icq doesn't exist, they
+ * will be created using @gfp_mask.
  *
  * The caller is responsible for ensuring @ioc won't go away and @q is
  * alive and will stay alive until this function returns.
  */
-struct io_cq *ioc_create_icq(struct request_queue *q, gfp_t gfp_mask)
+struct io_cq *ioc_create_icq(struct io_context *ioc, struct request_queue *q,
+			     gfp_t gfp_mask)
 {
 	struct elevator_type *et = q->elevator->type;
-	struct io_context *ioc;
 	struct io_cq *icq;
 
 	/* allocate stuff */
-	ioc = create_io_context(current, gfp_mask, q->node);
-	if (!ioc)
-		return NULL;
-
 	icq = kmem_cache_alloc_node(et->icq_cache, gfp_mask | __GFP_ZERO,
 				    q->node);
 	if (!icq)
diff --git a/block/blk.h b/block/blk.h
index de15f92..aa81afd 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -200,32 +200,30 @@ static inline int blk_do_io_stat(struct request *rq)
  */
 void get_io_context(struct io_context *ioc);
 struct io_cq *ioc_lookup_icq(struct io_context *ioc, struct request_queue *q);
-struct io_cq *ioc_create_icq(struct request_queue *q, gfp_t gfp_mask);
+struct io_cq *ioc_create_icq(struct io_context *ioc, struct request_queue *q,
+			     gfp_t gfp_mask);
 void ioc_clear_queue(struct request_queue *q);
 
-void create_io_context_slowpath(struct task_struct *task, gfp_t gfp_mask,
-				int node);
+int create_task_io_context(struct task_struct *task, gfp_t gfp_mask, int node);
 
 /**
  * create_io_context - try to create task->io_context
- * @task: target task
  * @gfp_mask: allocation mask
  * @node: allocation node
  *
- * If @task->io_context is %NULL, allocate a new io_context and install it.
- * Returns the current @task->io_context which may be %NULL if allocation
- * failed.
+ * If %current->io_context is %NULL, allocate a new io_context and install
+ * it.  Returns the current %current->io_context which may be %NULL if
+ * allocation failed.
  *
  * Note that this function can't be called with IRQ disabled because
- * task_lock which protects @task->io_context is IRQ-unsafe.
+ * task_lock which protects %current->io_context is IRQ-unsafe.
  */
-static inline struct io_context *create_io_context(struct task_struct *task,
-						   gfp_t gfp_mask, int node)
+static inline struct io_context *create_io_context(gfp_t gfp_mask, int node)
 {
 	WARN_ON_ONCE(irqs_disabled());
-	if (unlikely(!task->io_context))
-		create_io_context_slowpath(task, gfp_mask, node);
-	return task->io_context;
+	if (unlikely(!current->io_context))
+		create_task_io_context(current, gfp_mask, node);
+	return current->io_context;
 }
 
 /*
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 5/9] block: ioc_task_link() can't fail
  2012-02-16 22:37 [PATCHSET] blkcg: update locking and fix stacking Tejun Heo
                   ` (3 preceding siblings ...)
  2012-02-16 22:37 ` [PATCH 4/9] block: interface update for ioc/icq creation functions Tejun Heo
@ 2012-02-16 22:37 ` Tejun Heo
  2012-02-17 20:41   ` Vivek Goyal
  2012-02-16 22:37 ` [PATCH 6/9] block: add io_context->active_ref Tejun Heo
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2012-02-16 22:37 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

ioc_task_link() is used to share %current's ioc on clone.  If
%current->is_context is set, %current is guaranteed to have refcount
on the ioc and, thus, ioc_task_link() can't fail.

Replace error checking in ioc_task_link() with WARN_ON_ONCE() and make
it just increment refcount and nr_tasks.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 include/linux/iocontext.h |   16 +++++-----------
 kernel/fork.c             |    5 ++---
 2 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 1a30180..81a8870 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -120,18 +120,12 @@ struct io_context {
 	struct work_struct release_work;
 };
 
-static inline struct io_context *ioc_task_link(struct io_context *ioc)
+static inline void ioc_task_link(struct io_context *ioc)
 {
-	/*
-	 * if ref count is zero, don't allow sharing (ioc is going away, it's
-	 * a race).
-	 */
-	if (ioc && atomic_long_inc_not_zero(&ioc->refcount)) {
-		atomic_inc(&ioc->nr_tasks);
-		return ioc;
-	}
-
-	return NULL;
+	WARN_ON_ONCE(atomic_long_read(&ioc->refcount) <= 0);
+	WARN_ON_ONCE(atomic_read(&ioc->nr_tasks) <= 0);
+	atomic_long_inc(&ioc->refcount);
+	atomic_inc(&ioc->nr_tasks);
 }
 
 struct task_struct;
diff --git a/kernel/fork.c b/kernel/fork.c
index b77fd55..a1b6327 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -901,9 +901,8 @@ static int copy_io(unsigned long clone_flags, struct task_struct *tsk)
 	 * Share io context with parent, if CLONE_IO is set
 	 */
 	if (clone_flags & CLONE_IO) {
-		tsk->io_context = ioc_task_link(ioc);
-		if (unlikely(!tsk->io_context))
-			return -ENOMEM;
+		ioc_task_link(ioc);
+		tsk->io_context = ioc;
 	} else if (ioprio_valid(ioc->ioprio)) {
 		new_ioc = get_task_io_context(tsk, GFP_KERNEL, NUMA_NO_NODE);
 		if (unlikely(!new_ioc))
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 6/9] block: add io_context->active_ref
  2012-02-16 22:37 [PATCHSET] blkcg: update locking and fix stacking Tejun Heo
                   ` (4 preceding siblings ...)
  2012-02-16 22:37 ` [PATCH 5/9] block: ioc_task_link() can't fail Tejun Heo
@ 2012-02-16 22:37 ` Tejun Heo
  2012-02-16 22:37 ` [PATCH 7/9] block: implement bio_associate_current() Tejun Heo
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-16 22:37 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

Currently ioc->nr_tasks is used to decide two things - whether an ioc
is done issuing IOs and whether it's shared by multiple tasks.  This
patch separate out the first into ioc->active_ref, which is acquired
and released using {get|put}_io_context_active() respectively.

This will be used to associate bio's with a given task.  This patch
doesn't introduce any visible behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-ioc.c           |   36 +++++++++++++++++++++++++-----------
 block/cfq-iosched.c       |    4 ++--
 include/linux/iocontext.h |   22 ++++++++++++++++++++--
 3 files changed, 47 insertions(+), 15 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 1092874..439ec21 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -149,20 +149,20 @@ void put_io_context(struct io_context *ioc)
 }
 EXPORT_SYMBOL(put_io_context);
 
-/* Called by the exiting task */
-void exit_io_context(struct task_struct *task)
+/**
+ * put_io_context_active - put active reference on ioc
+ * @ioc: ioc of interest
+ *
+ * Undo get_io_context_active().  If active reference reaches zero after
+ * put, @ioc can never issue further IOs and ioscheds are notified.
+ */
+void put_io_context_active(struct io_context *ioc)
 {
-	struct io_context *ioc;
-	struct io_cq *icq;
 	struct hlist_node *n;
 	unsigned long flags;
+	struct io_cq *icq;
 
-	task_lock(task);
-	ioc = task->io_context;
-	task->io_context = NULL;
-	task_unlock(task);
-
-	if (!atomic_dec_and_test(&ioc->nr_tasks)) {
+	if (!atomic_dec_and_test(&ioc->active_ref)) {
 		put_io_context(ioc);
 		return;
 	}
@@ -191,6 +191,20 @@ retry:
 	put_io_context(ioc);
 }
 
+/* Called by the exiting task */
+void exit_io_context(struct task_struct *task)
+{
+	struct io_context *ioc;
+
+	task_lock(task);
+	ioc = task->io_context;
+	task->io_context = NULL;
+	task_unlock(task);
+
+	atomic_dec(&ioc->nr_tasks);
+	put_io_context_active(ioc);
+}
+
 /**
  * ioc_clear_queue - break any ioc association with the specified queue
  * @q: request_queue being cleared
@@ -223,7 +237,7 @@ int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node)
 
 	/* initialize */
 	atomic_long_set(&ioc->refcount, 1);
-	atomic_set(&ioc->nr_tasks, 1);
+	atomic_set(&ioc->active_ref, 1);
 	spin_lock_init(&ioc->lock);
 	INIT_RADIX_TREE(&ioc->icq_tree, GFP_ATOMIC | __GFP_HIGH);
 	INIT_HLIST_HEAD(&ioc->icq_list);
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index cc8e9ef..00e28a3 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1865,7 +1865,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	 * task has exited, don't wait
 	 */
 	cic = cfqd->active_cic;
-	if (!cic || !atomic_read(&cic->icq.ioc->nr_tasks))
+	if (!cic || !atomic_read(&cic->icq.ioc->active_ref))
 		return;
 
 	/*
@@ -2841,7 +2841,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	if (cfqq->next_rq && (cfqq->next_rq->cmd_flags & REQ_NOIDLE))
 		enable_idle = 0;
-	else if (!atomic_read(&cic->icq.ioc->nr_tasks) ||
+	else if (!atomic_read(&cic->icq.ioc->active_ref) ||
 		 !cfqd->cfq_slice_idle ||
 		 (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq)))
 		enable_idle = 0;
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 81a8870..6f1a260 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -100,6 +100,7 @@ struct io_cq {
  */
 struct io_context {
 	atomic_long_t refcount;
+	atomic_t active_ref;
 	atomic_t nr_tasks;
 
 	/* all the fields below are protected by this lock */
@@ -120,17 +121,34 @@ struct io_context {
 	struct work_struct release_work;
 };
 
-static inline void ioc_task_link(struct io_context *ioc)
+/**
+ * get_io_context_active - get active reference on ioc
+ * @ioc: ioc of interest
+ *
+ * Only iocs with active reference can issue new IOs.  This function
+ * acquires an active reference on @ioc.  The caller must already have an
+ * active reference on @ioc.
+ */
+static inline void get_io_context_active(struct io_context *ioc)
 {
 	WARN_ON_ONCE(atomic_long_read(&ioc->refcount) <= 0);
-	WARN_ON_ONCE(atomic_read(&ioc->nr_tasks) <= 0);
+	WARN_ON_ONCE(atomic_read(&ioc->active_ref) <= 0);
 	atomic_long_inc(&ioc->refcount);
+	atomic_inc(&ioc->active_ref);
+}
+
+static inline void ioc_task_link(struct io_context *ioc)
+{
+	get_io_context_active(ioc);
+
+	WARN_ON_ONCE(atomic_read(&ioc->nr_tasks) <= 0);
 	atomic_inc(&ioc->nr_tasks);
 }
 
 struct task_struct;
 #ifdef CONFIG_BLOCK
 void put_io_context(struct io_context *ioc);
+void put_io_context_active(struct io_context *ioc);
 void exit_io_context(struct task_struct *task);
 struct io_context *get_task_io_context(struct task_struct *task,
 				       gfp_t gfp_flags, int node);
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 7/9] block: implement bio_associate_current()
  2012-02-16 22:37 [PATCHSET] blkcg: update locking and fix stacking Tejun Heo
                   ` (5 preceding siblings ...)
  2012-02-16 22:37 ` [PATCH 6/9] block: add io_context->active_ref Tejun Heo
@ 2012-02-16 22:37 ` Tejun Heo
  2012-02-17  1:19   ` Kent Overstreet
  2012-02-17 21:33   ` Vivek Goyal
  2012-02-16 22:37 ` [PATCH 8/9] block: make block cgroup policies follow bio task association Tejun Heo
  2012-02-16 22:37 ` [PATCH 9/9] block: make blk-throttle preserve the issuing task on delayed bios Tejun Heo
  8 siblings, 2 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-16 22:37 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo, Kent Overstreet

IO scheduling and cgroup are tied to the issuing task via io_context
and cgroup of %current.  Unfortunately, there are cases where IOs need
to be routed via a different task which makes scheduling and cgroup
limit enforcement applied completely incorrectly.

For example, all bios delayed by blk-throttle end up being issued by a
delayed work item and get assigned the io_context of the worker task
which happens to serve the work item and dumped to the default block
cgroup.  This is double confusing as bios which aren't delayed end up
in the correct cgroup and makes using blk-throttle and cfq propio
together impossible.

Any code which punts IO issuing to another task is affected which is
getting more and more common (e.g. btrfs).  As both io_context and
cgroup are firmly tied to task including userland visible APIs to
manipulate them, it makes a lot of sense to match up tasks to bios.

This patch implements bio_associate_current() which associates the
specified bio with %current.  The bio will record the associated ioc
and blkcg at that point and block layer will use the recorded ones
regardless of which task actually ends up issuing the bio.  bio
release puts the associated ioc and blkcg.

It grabs and remembers ioc and blkcg instead of the task itself
because task may already be dead by the time the bio is issued making
ioc and blkcg inaccessible and those are all block layer cares about.

elevator_set_req_fn() is updated such that the bio elvdata is being
allocated for is available to the elevator.

This doesn't update block cgroup policies yet.  Further patches will
implement the support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Kent Overstreet <koverstreet@google.com>
---
 block/blk-core.c          |   30 +++++++++++++++++-----
 block/cfq-iosched.c       |    3 +-
 block/elevator.c          |    5 ++-
 fs/bio.c                  |   61 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/bio.h       |    8 ++++++
 include/linux/blk_types.h |   10 +++++++
 include/linux/elevator.h  |    6 +++-
 7 files changed, 111 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 195c5f7..e6a4f90 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -695,7 +695,7 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
 }
 
 static struct request *
-blk_alloc_request(struct request_queue *q, struct io_cq *icq,
+blk_alloc_request(struct request_queue *q, struct bio *bio, struct io_cq *icq,
 		  unsigned int flags, gfp_t gfp_mask)
 {
 	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
@@ -709,7 +709,7 @@ blk_alloc_request(struct request_queue *q, struct io_cq *icq,
 
 	if (flags & REQ_ELVPRIV) {
 		rq->elv.icq = icq;
-		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
+		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
 			mempool_free(rq, q->rq.rq_pool);
 			return NULL;
 		}
@@ -809,6 +809,20 @@ static bool blk_rq_should_init_elevator(struct bio *bio)
 }
 
 /**
+ * rq_ioc - determine io_context for request allocation
+ * @bio: request being allocated is for this bio (can be %NULL)
+ *
+ * Determine io_context to use for request allocation for @bio.  May return
+ * %NULL if %current->io_context doesn't exist.
+ */
+static struct io_context *rq_ioc(struct bio *bio)
+{
+	if (bio && bio->bi_ioc)
+		return bio->bi_ioc;
+	return current->io_context;
+}
+
+/**
  * get_request - get a free request
  * @q: request_queue to allocate request from
  * @rw_flags: RW and SYNC flags
@@ -835,7 +849,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	int may_queue;
 retry:
 	et = q->elevator->type;
-	ioc = current->io_context;
+	ioc = rq_ioc(bio);
 
 	if (unlikely(blk_queue_dead(q)))
 		return NULL;
@@ -918,14 +932,16 @@ retry:
 
 	/* create icq if missing */
 	if ((rw_flags & REQ_ELVPRIV) && unlikely(et->icq_cache && !icq)) {
-		ioc = create_io_context(gfp_mask, q->node);
-		if (ioc)
-			icq = ioc_create_icq(ioc, q, gfp_mask);
+		create_io_context(gfp_mask, q->node);
+		ioc = rq_ioc(bio);
+		if (!ioc)
+			goto fail_alloc;
+		icq = ioc_create_icq(ioc, q, gfp_mask);
 		if (!icq)
 			goto fail_alloc;
 	}
 
-	rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
+	rq = blk_alloc_request(q, bio, icq, rw_flags, gfp_mask);
 	if (unlikely(!rq))
 		goto fail_alloc;
 
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 00e28a3..b2aabe8 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3299,7 +3299,8 @@ split_cfqq(struct cfq_io_cq *cic, struct cfq_queue *cfqq)
  * Allocate cfq data structures associated with this request.
  */
 static int
-cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
+		gfp_t gfp_mask)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_io_cq *cic = icq_to_cic(rq->elv.icq);
diff --git a/block/elevator.c b/block/elevator.c
index 06d9869..6315a27 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -663,12 +663,13 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
 	return NULL;
 }
 
-int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
+int elv_set_request(struct request_queue *q, struct request *rq,
+		    struct bio *bio, gfp_t gfp_mask)
 {
 	struct elevator_queue *e = q->elevator;
 
 	if (e->type->ops.elevator_set_req_fn)
-		return e->type->ops.elevator_set_req_fn(q, rq, gfp_mask);
+		return e->type->ops.elevator_set_req_fn(q, rq, bio, gfp_mask);
 	return 0;
 }
 
diff --git a/fs/bio.c b/fs/bio.c
index b980ecd..142214b 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -19,12 +19,14 @@
 #include <linux/swap.h>
 #include <linux/bio.h>
 #include <linux/blkdev.h>
+#include <linux/iocontext.h>
 #include <linux/slab.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/mempool.h>
 #include <linux/workqueue.h>
+#include <linux/cgroup.h>
 #include <scsi/sg.h>		/* for struct sg_iovec */
 
 #include <trace/events/block.h>
@@ -418,6 +420,7 @@ void bio_put(struct bio *bio)
 	 * last put frees it
 	 */
 	if (atomic_dec_and_test(&bio->bi_cnt)) {
+		bio_disassociate_task(bio);
 		bio->bi_next = NULL;
 		bio->bi_destructor(bio);
 	}
@@ -1641,6 +1644,64 @@ bad:
 }
 EXPORT_SYMBOL(bioset_create);
 
+#ifdef CONFIG_BLK_CGROUP
+/**
+ * bio_associate_current - associate a bio with %current
+ * @bio: target bio
+ *
+ * Associate @bio with %current if it hasn't been associated yet.  Block
+ * layer will treat @bio as if it were issued by %current no matter which
+ * task actually issues it.
+ *
+ * This function takes an extra reference of @task's io_context and blkcg
+ * which will be put when @bio is released.  The caller must own @bio,
+ * ensure %current->io_context exists, and is responsible for synchronizing
+ * calls to this function.
+ */
+int bio_associate_current(struct bio *bio)
+{
+	struct io_context *ioc;
+	struct cgroup_subsys_state *css;
+
+	if (bio->bi_ioc)
+		return -EBUSY;
+
+	ioc = current->io_context;
+	if (!ioc)
+		return -ENOENT;
+
+	/* acquire active ref on @ioc and associate */
+	get_io_context_active(ioc);
+	bio->bi_ioc = ioc;
+
+	/* associate blkcg if exists */
+	rcu_read_lock();
+	css = task_subsys_state(current, blkio_subsys_id);
+	if (css && css_tryget(css))
+		bio->bi_css = css;
+	rcu_read_unlock();
+
+	return 0;
+}
+
+/**
+ * bio_disassociate_task - undo bio_associate_current()
+ * @bio: target bio
+ */
+void bio_disassociate_task(struct bio *bio)
+{
+	if (bio->bi_ioc) {
+		put_io_context(bio->bi_ioc);
+		bio->bi_ioc = NULL;
+	}
+	if (bio->bi_css) {
+		css_put(bio->bi_css);
+		bio->bi_css = NULL;
+	}
+}
+
+#endif /* CONFIG_BLK_CGROUP */
+
 static void __init biovec_init_slabs(void)
 {
 	int i;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 129a9c0..692d3d5 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -268,6 +268,14 @@ extern struct bio_vec *bvec_alloc_bs(gfp_t, int, unsigned long *, struct bio_set
 extern void bvec_free_bs(struct bio_set *, struct bio_vec *, unsigned int);
 extern unsigned int bvec_nr_vecs(unsigned short idx);
 
+#ifdef CONFIG_BLK_CGROUP
+int bio_associate_current(struct bio *bio);
+void bio_disassociate_task(struct bio *bio);
+#else	/* CONFIG_BLK_CGROUP */
+static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
+static inline void bio_disassociate_task(struct bio *bio) { }
+#endif	/* CONFIG_BLK_CGROUP */
+
 /*
  * bio_set is used to allow other portions of the IO system to
  * allocate their own private memory pools for bio and iovec structures.
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 4053cbd..0edb65d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -14,6 +14,8 @@ struct bio;
 struct bio_integrity_payload;
 struct page;
 struct block_device;
+struct io_context;
+struct cgroup_subsys_state;
 typedef void (bio_end_io_t) (struct bio *, int);
 typedef void (bio_destructor_t) (struct bio *);
 
@@ -66,6 +68,14 @@ struct bio {
 	bio_end_io_t		*bi_end_io;
 
 	void			*bi_private;
+#ifdef CONFIG_BLK_CGROUP
+	/*
+	 * Optional ioc and css associated with this bio.  Put on bio
+	 * release.  Read comment on top of bio_associate_current().
+	 */
+	struct io_context	*bi_ioc;
+	struct cgroup_subsys_state *bi_css;
+#endif
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
 	struct bio_integrity_payload *bi_integrity;  /* data integrity */
 #endif
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 97fb255..c03af76 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -28,7 +28,8 @@ typedef int (elevator_may_queue_fn) (struct request_queue *, int);
 
 typedef void (elevator_init_icq_fn) (struct io_cq *);
 typedef void (elevator_exit_icq_fn) (struct io_cq *);
-typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
+typedef int (elevator_set_req_fn) (struct request_queue *, struct request *,
+				   struct bio *, gfp_t);
 typedef void (elevator_put_req_fn) (struct request *);
 typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
 typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
@@ -129,7 +130,8 @@ extern void elv_unregister_queue(struct request_queue *q);
 extern int elv_may_queue(struct request_queue *, int);
 extern void elv_abort_queue(struct request_queue *);
 extern void elv_completed_request(struct request_queue *, struct request *);
-extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
+extern int elv_set_request(struct request_queue *q, struct request *rq,
+			   struct bio *bio, gfp_t gfp_mask);
 extern void elv_put_request(struct request_queue *, struct request *);
 extern void elv_drain_elevator(struct request_queue *);
 
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 8/9] block: make block cgroup policies follow bio task association
  2012-02-16 22:37 [PATCHSET] blkcg: update locking and fix stacking Tejun Heo
                   ` (6 preceding siblings ...)
  2012-02-16 22:37 ` [PATCH 7/9] block: implement bio_associate_current() Tejun Heo
@ 2012-02-16 22:37 ` Tejun Heo
  2012-02-16 22:37 ` [PATCH 9/9] block: make blk-throttle preserve the issuing task on delayed bios Tejun Heo
  8 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-16 22:37 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

Implement bio_blkio_cgroup() which returns the blkcg associated with
the bio if exists or %current's blkcg, and use it in blk-throttle and
cfq-iosched propio.  This makes both cgroup policies honor task
association for the bio instead of always assuming %current.

As nobody is using bio_set_task() yet, this doesn't introduce any
behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c   |   11 +++++++++--
 block/blk-cgroup.h   |    4 ++--
 block/blk-throttle.c |    2 +-
 block/cfq-iosched.c  |   21 +++++++++++----------
 4 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index fb5f21b..9bb203b 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -71,12 +71,19 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
 }
 EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
 
-struct blkio_cgroup *task_blkio_cgroup(struct task_struct *tsk)
+static struct blkio_cgroup *task_blkio_cgroup(struct task_struct *tsk)
 {
 	return container_of(task_subsys_state(tsk, blkio_subsys_id),
 			    struct blkio_cgroup, css);
 }
-EXPORT_SYMBOL_GPL(task_blkio_cgroup);
+
+struct blkio_cgroup *bio_blkio_cgroup(struct bio *bio)
+{
+	if (bio && bio->bi_css)
+		return container_of(bio->bi_css, struct blkio_cgroup, css);
+	return task_blkio_cgroup(current);
+}
+EXPORT_SYMBOL_GPL(bio_blkio_cgroup);
 
 static inline void blkio_update_group_weight(struct blkio_group *blkg,
 					     int plid, unsigned int weight)
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 1a80619..4bf4c7b 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -375,7 +375,7 @@ static inline void blkiocg_set_start_empty_time(struct blkio_group *blkg,
 #ifdef CONFIG_BLK_CGROUP
 extern struct blkio_cgroup blkio_root_cgroup;
 extern struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup);
-extern struct blkio_cgroup *task_blkio_cgroup(struct task_struct *tsk);
+extern struct blkio_cgroup *bio_blkio_cgroup(struct bio *bio);
 extern struct blkio_group *blkg_lookup(struct blkio_cgroup *blkcg,
 				       struct request_queue *q);
 struct blkio_group *blkg_lookup_create(struct blkio_cgroup *blkcg,
@@ -409,7 +409,7 @@ struct cgroup;
 static inline struct blkio_cgroup *
 cgroup_to_blkio_cgroup(struct cgroup *cgroup) { return NULL; }
 static inline struct blkio_cgroup *
-task_blkio_cgroup(struct task_struct *tsk) { return NULL; }
+bio_blkio_cgroup(struct bio *bio) { return NULL; }
 
 static inline struct blkio_group *blkg_lookup(struct blkio_cgroup *blkcg,
 					      void *key) { return NULL; }
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8cd13ec..a7c8e0b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -900,7 +900,7 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 	 * just update the dispatch stats in lockless manner and return.
 	 */
 	rcu_read_lock();
-	blkcg = task_blkio_cgroup(current);
+	blkcg = bio_blkio_cgroup(bio);
 	tg = throtl_lookup_tg(td, blkcg);
 	if (tg) {
 		if (tg_no_rule_group(tg, rw)) {
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b2aabe8..d84879a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -467,8 +467,9 @@ static inline int cfqg_busy_async_queues(struct cfq_data *cfqd,
 }
 
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
-static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool,
-				       struct io_context *, gfp_t);
+static struct cfq_queue *cfq_get_queue(struct cfq_data *cfqd, bool is_sync,
+				       struct io_context *ioc, struct bio *bio,
+				       gfp_t gfp_mask);
 
 static inline struct cfq_io_cq *icq_to_cic(struct io_cq *icq)
 {
@@ -2601,7 +2602,7 @@ static void cfq_init_prio_data(struct cfq_queue *cfqq, struct io_context *ioc)
 	cfq_clear_cfqq_prio_changed(cfqq);
 }
 
-static void changed_ioprio(struct cfq_io_cq *cic)
+static void changed_ioprio(struct cfq_io_cq *cic, struct bio *bio)
 {
 	struct cfq_data *cfqd = cic_to_cfqd(cic);
 	struct cfq_queue *cfqq;
@@ -2613,7 +2614,7 @@ static void changed_ioprio(struct cfq_io_cq *cic)
 	if (cfqq) {
 		struct cfq_queue *new_cfqq;
 		new_cfqq = cfq_get_queue(cfqd, BLK_RW_ASYNC, cic->icq.ioc,
-						GFP_ATOMIC);
+					 bio, GFP_ATOMIC);
 		if (new_cfqq) {
 			cic->cfqq[BLK_RW_ASYNC] = new_cfqq;
 			cfq_put_queue(cfqq);
@@ -2671,7 +2672,7 @@ static void changed_cgroup(struct cfq_io_cq *cic)
 
 static struct cfq_queue *
 cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
-		     struct io_context *ioc, gfp_t gfp_mask)
+		     struct io_context *ioc, struct bio *bio, gfp_t gfp_mask)
 {
 	struct blkio_cgroup *blkcg;
 	struct cfq_queue *cfqq, *new_cfqq = NULL;
@@ -2681,7 +2682,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
 retry:
 	rcu_read_lock();
 
-	blkcg = task_blkio_cgroup(current);
+	blkcg = bio_blkio_cgroup(bio);
 
 	cfqg = cfq_lookup_create_cfqg(cfqd, blkcg);
 
@@ -2746,7 +2747,7 @@ cfq_async_queue_prio(struct cfq_data *cfqd, int ioprio_class, int ioprio)
 
 static struct cfq_queue *
 cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
-	      gfp_t gfp_mask)
+	      struct bio *bio, gfp_t gfp_mask)
 {
 	const int ioprio = task_ioprio(ioc);
 	const int ioprio_class = task_ioprio_class(ioc);
@@ -2759,7 +2760,7 @@ cfq_get_queue(struct cfq_data *cfqd, bool is_sync, struct io_context *ioc,
 	}
 
 	if (!cfqq)
-		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
+		cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, bio, gfp_mask);
 
 	/*
 	 * pin the queue now that it's allocated, scheduler exit will prune it
@@ -3316,7 +3317,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
 	/* handle changed notifications */
 	changed = icq_get_changed(&cic->icq);
 	if (unlikely(changed & ICQ_IOPRIO_CHANGED))
-		changed_ioprio(cic);
+		changed_ioprio(cic, bio);
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
 	if (unlikely(changed & ICQ_CGROUP_CHANGED))
 		changed_cgroup(cic);
@@ -3325,7 +3326,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
 new_queue:
 	cfqq = cic_to_cfqq(cic, is_sync);
 	if (!cfqq || cfqq == &cfqd->oom_cfqq) {
-		cfqq = cfq_get_queue(cfqd, is_sync, cic->icq.ioc, gfp_mask);
+		cfqq = cfq_get_queue(cfqd, is_sync, cic->icq.ioc, bio, gfp_mask);
 		cic_set_cfqq(cic, cfqq, is_sync);
 	} else {
 		/*
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 9/9] block: make blk-throttle preserve the issuing task on delayed bios
  2012-02-16 22:37 [PATCHSET] blkcg: update locking and fix stacking Tejun Heo
                   ` (7 preceding siblings ...)
  2012-02-16 22:37 ` [PATCH 8/9] block: make block cgroup policies follow bio task association Tejun Heo
@ 2012-02-16 22:37 ` Tejun Heo
  2012-02-17 21:58   ` Vivek Goyal
  8 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2012-02-16 22:37 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel, Tejun Heo

Make blk-throttle call bio_associate_current() on bios being delayed
such that they get issued to block layer with the original io_context.
This allows stacking blk-throttle and cfq-iosched propio policies.
bios will always be issued with the correct ioc and blkcg whether it
gets delayed by blk-throttle or not.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-throttle.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a7c8e0b..ad8e16b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -894,6 +894,9 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
 		goto out;
 	}
 
+	/* bio_associate_current() needs ioc, try creating */
+	create_io_context(GFP_ATOMIC, q->node);
+
 	/*
 	 * A throtl_grp pointer retrieved under rcu can be used to access
 	 * basic fields like stats and io rates. If a group has no rules,
@@ -958,6 +961,7 @@ queue_bio:
 			tg->io_disp[rw], tg->iops[rw],
 			tg->nr_queued[READ], tg->nr_queued[WRITE]);
 
+	bio_associate_current(bio);
 	throtl_add_bio_tg(q->td, tg, bio);
 	throttled = true;
 
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-16 22:37 ` [PATCH 7/9] block: implement bio_associate_current() Tejun Heo
@ 2012-02-17  1:19   ` Kent Overstreet
  2012-02-17 22:14     ` Tejun Heo
  2012-02-17 21:33   ` Vivek Goyal
  1 sibling, 1 reply; 50+ messages in thread
From: Kent Overstreet @ 2012-02-17  1:19 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, vgoyal, ctalbott, rni, linux-kernel

On Thu, Feb 16, 2012 at 02:37:56PM -0800, Tejun Heo wrote:
> This patch implements bio_associate_current() which associates the
> specified bio with %current.  The bio will record the associated ioc
> and blkcg at that point and block layer will use the recorded ones
> regardless of which task actually ends up issuing the bio.  bio
> release puts the associated ioc and blkcg.

Excellent.

Why not have bio_associate_current() called from submit_bio()? I would
expect that's what we want most of the time, and the places it's not
(mainly writeback) calling it before submit_bio() would do the right
thing.

It'd make things more consistent - rq_ioc() could be dropped, and
incorrect usage would be more obvious.

> It grabs and remembers ioc and blkcg instead of the task itself
> because task may already be dead by the time the bio is issued making
> ioc and blkcg inaccessible and those are all block layer cares about.
> 
> elevator_set_req_fn() is updated such that the bio elvdata is being
> allocated for is available to the elevator.
> 
> This doesn't update block cgroup policies yet.  Further patches will
> implement the support.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Cc: Kent Overstreet <koverstreet@google.com>
> ---
>  block/blk-core.c          |   30 +++++++++++++++++-----
>  block/cfq-iosched.c       |    3 +-
>  block/elevator.c          |    5 ++-
>  fs/bio.c                  |   61 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/bio.h       |    8 ++++++
>  include/linux/blk_types.h |   10 +++++++
>  include/linux/elevator.h  |    6 +++-
>  7 files changed, 111 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 195c5f7..e6a4f90 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -695,7 +695,7 @@ static inline void blk_free_request(struct request_queue *q, struct request *rq)
>  }
>  
>  static struct request *
> -blk_alloc_request(struct request_queue *q, struct io_cq *icq,
> +blk_alloc_request(struct request_queue *q, struct bio *bio, struct io_cq *icq,
>  		  unsigned int flags, gfp_t gfp_mask)
>  {
>  	struct request *rq = mempool_alloc(q->rq.rq_pool, gfp_mask);
> @@ -709,7 +709,7 @@ blk_alloc_request(struct request_queue *q, struct io_cq *icq,
>  
>  	if (flags & REQ_ELVPRIV) {
>  		rq->elv.icq = icq;
> -		if (unlikely(elv_set_request(q, rq, gfp_mask))) {
> +		if (unlikely(elv_set_request(q, rq, bio, gfp_mask))) {
>  			mempool_free(rq, q->rq.rq_pool);
>  			return NULL;
>  		}
> @@ -809,6 +809,20 @@ static bool blk_rq_should_init_elevator(struct bio *bio)
>  }
>  
>  /**
> + * rq_ioc - determine io_context for request allocation
> + * @bio: request being allocated is for this bio (can be %NULL)
> + *
> + * Determine io_context to use for request allocation for @bio.  May return
> + * %NULL if %current->io_context doesn't exist.
> + */
> +static struct io_context *rq_ioc(struct bio *bio)
> +{
> +	if (bio && bio->bi_ioc)
> +		return bio->bi_ioc;
> +	return current->io_context;
> +}
> +
> +/**
>   * get_request - get a free request
>   * @q: request_queue to allocate request from
>   * @rw_flags: RW and SYNC flags
> @@ -835,7 +849,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	int may_queue;
>  retry:
>  	et = q->elevator->type;
> -	ioc = current->io_context;
> +	ioc = rq_ioc(bio);
>  
>  	if (unlikely(blk_queue_dead(q)))
>  		return NULL;
> @@ -918,14 +932,16 @@ retry:
>  
>  	/* create icq if missing */
>  	if ((rw_flags & REQ_ELVPRIV) && unlikely(et->icq_cache && !icq)) {
> -		ioc = create_io_context(gfp_mask, q->node);
> -		if (ioc)
> -			icq = ioc_create_icq(ioc, q, gfp_mask);
> +		create_io_context(gfp_mask, q->node);
> +		ioc = rq_ioc(bio);
> +		if (!ioc)
> +			goto fail_alloc;
> +		icq = ioc_create_icq(ioc, q, gfp_mask);
>  		if (!icq)
>  			goto fail_alloc;
>  	}
>  
> -	rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
> +	rq = blk_alloc_request(q, bio, icq, rw_flags, gfp_mask);
>  	if (unlikely(!rq))
>  		goto fail_alloc;
>  
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 00e28a3..b2aabe8 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -3299,7 +3299,8 @@ split_cfqq(struct cfq_io_cq *cic, struct cfq_queue *cfqq)
>   * Allocate cfq data structures associated with this request.
>   */
>  static int
> -cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
> +cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
> +		gfp_t gfp_mask)
>  {
>  	struct cfq_data *cfqd = q->elevator->elevator_data;
>  	struct cfq_io_cq *cic = icq_to_cic(rq->elv.icq);
> diff --git a/block/elevator.c b/block/elevator.c
> index 06d9869..6315a27 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -663,12 +663,13 @@ struct request *elv_former_request(struct request_queue *q, struct request *rq)
>  	return NULL;
>  }
>  
> -int elv_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
> +int elv_set_request(struct request_queue *q, struct request *rq,
> +		    struct bio *bio, gfp_t gfp_mask)
>  {
>  	struct elevator_queue *e = q->elevator;
>  
>  	if (e->type->ops.elevator_set_req_fn)
> -		return e->type->ops.elevator_set_req_fn(q, rq, gfp_mask);
> +		return e->type->ops.elevator_set_req_fn(q, rq, bio, gfp_mask);
>  	return 0;
>  }
>  
> diff --git a/fs/bio.c b/fs/bio.c
> index b980ecd..142214b 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -19,12 +19,14 @@
>  #include <linux/swap.h>
>  #include <linux/bio.h>
>  #include <linux/blkdev.h>
> +#include <linux/iocontext.h>
>  #include <linux/slab.h>
>  #include <linux/init.h>
>  #include <linux/kernel.h>
>  #include <linux/module.h>
>  #include <linux/mempool.h>
>  #include <linux/workqueue.h>
> +#include <linux/cgroup.h>
>  #include <scsi/sg.h>		/* for struct sg_iovec */
>  
>  #include <trace/events/block.h>
> @@ -418,6 +420,7 @@ void bio_put(struct bio *bio)
>  	 * last put frees it
>  	 */
>  	if (atomic_dec_and_test(&bio->bi_cnt)) {
> +		bio_disassociate_task(bio);
>  		bio->bi_next = NULL;
>  		bio->bi_destructor(bio);
>  	}
> @@ -1641,6 +1644,64 @@ bad:
>  }
>  EXPORT_SYMBOL(bioset_create);
>  
> +#ifdef CONFIG_BLK_CGROUP
> +/**
> + * bio_associate_current - associate a bio with %current
> + * @bio: target bio
> + *
> + * Associate @bio with %current if it hasn't been associated yet.  Block
> + * layer will treat @bio as if it were issued by %current no matter which
> + * task actually issues it.
> + *
> + * This function takes an extra reference of @task's io_context and blkcg
> + * which will be put when @bio is released.  The caller must own @bio,
> + * ensure %current->io_context exists, and is responsible for synchronizing
> + * calls to this function.
> + */
> +int bio_associate_current(struct bio *bio)
> +{
> +	struct io_context *ioc;
> +	struct cgroup_subsys_state *css;
> +
> +	if (bio->bi_ioc)
> +		return -EBUSY;
> +
> +	ioc = current->io_context;
> +	if (!ioc)
> +		return -ENOENT;
> +
> +	/* acquire active ref on @ioc and associate */
> +	get_io_context_active(ioc);
> +	bio->bi_ioc = ioc;
> +
> +	/* associate blkcg if exists */
> +	rcu_read_lock();
> +	css = task_subsys_state(current, blkio_subsys_id);
> +	if (css && css_tryget(css))
> +		bio->bi_css = css;
> +	rcu_read_unlock();
> +
> +	return 0;
> +}
> +
> +/**
> + * bio_disassociate_task - undo bio_associate_current()
> + * @bio: target bio
> + */
> +void bio_disassociate_task(struct bio *bio)
> +{
> +	if (bio->bi_ioc) {
> +		put_io_context(bio->bi_ioc);
> +		bio->bi_ioc = NULL;
> +	}
> +	if (bio->bi_css) {
> +		css_put(bio->bi_css);
> +		bio->bi_css = NULL;
> +	}
> +}
> +
> +#endif /* CONFIG_BLK_CGROUP */
> +
>  static void __init biovec_init_slabs(void)
>  {
>  	int i;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 129a9c0..692d3d5 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -268,6 +268,14 @@ extern struct bio_vec *bvec_alloc_bs(gfp_t, int, unsigned long *, struct bio_set
>  extern void bvec_free_bs(struct bio_set *, struct bio_vec *, unsigned int);
>  extern unsigned int bvec_nr_vecs(unsigned short idx);
>  
> +#ifdef CONFIG_BLK_CGROUP
> +int bio_associate_current(struct bio *bio);
> +void bio_disassociate_task(struct bio *bio);
> +#else	/* CONFIG_BLK_CGROUP */
> +static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
> +static inline void bio_disassociate_task(struct bio *bio) { }
> +#endif	/* CONFIG_BLK_CGROUP */
> +
>  /*
>   * bio_set is used to allow other portions of the IO system to
>   * allocate their own private memory pools for bio and iovec structures.
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 4053cbd..0edb65d 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -14,6 +14,8 @@ struct bio;
>  struct bio_integrity_payload;
>  struct page;
>  struct block_device;
> +struct io_context;
> +struct cgroup_subsys_state;
>  typedef void (bio_end_io_t) (struct bio *, int);
>  typedef void (bio_destructor_t) (struct bio *);
>  
> @@ -66,6 +68,14 @@ struct bio {
>  	bio_end_io_t		*bi_end_io;
>  
>  	void			*bi_private;
> +#ifdef CONFIG_BLK_CGROUP
> +	/*
> +	 * Optional ioc and css associated with this bio.  Put on bio
> +	 * release.  Read comment on top of bio_associate_current().
> +	 */
> +	struct io_context	*bi_ioc;
> +	struct cgroup_subsys_state *bi_css;
> +#endif
>  #if defined(CONFIG_BLK_DEV_INTEGRITY)
>  	struct bio_integrity_payload *bi_integrity;  /* data integrity */
>  #endif
> diff --git a/include/linux/elevator.h b/include/linux/elevator.h
> index 97fb255..c03af76 100644
> --- a/include/linux/elevator.h
> +++ b/include/linux/elevator.h
> @@ -28,7 +28,8 @@ typedef int (elevator_may_queue_fn) (struct request_queue *, int);
>  
>  typedef void (elevator_init_icq_fn) (struct io_cq *);
>  typedef void (elevator_exit_icq_fn) (struct io_cq *);
> -typedef int (elevator_set_req_fn) (struct request_queue *, struct request *, gfp_t);
> +typedef int (elevator_set_req_fn) (struct request_queue *, struct request *,
> +				   struct bio *, gfp_t);
>  typedef void (elevator_put_req_fn) (struct request *);
>  typedef void (elevator_activate_req_fn) (struct request_queue *, struct request *);
>  typedef void (elevator_deactivate_req_fn) (struct request_queue *, struct request *);
> @@ -129,7 +130,8 @@ extern void elv_unregister_queue(struct request_queue *q);
>  extern int elv_may_queue(struct request_queue *, int);
>  extern void elv_abort_queue(struct request_queue *);
>  extern void elv_completed_request(struct request_queue *, struct request *);
> -extern int elv_set_request(struct request_queue *, struct request *, gfp_t);
> +extern int elv_set_request(struct request_queue *q, struct request *rq,
> +			   struct bio *bio, gfp_t gfp_mask);
>  extern void elv_put_request(struct request_queue *, struct request *);
>  extern void elv_drain_elevator(struct request_queue *);
>  
> -- 
> 1.7.7.3
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/9] blkcg: drop unnecessary RCU locking
  2012-02-16 22:37 ` [PATCH 2/9] blkcg: drop unnecessary RCU locking Tejun Heo
@ 2012-02-17 16:19   ` Vivek Goyal
  2012-02-17 17:07     ` Tejun Heo
  2012-02-17 16:47   ` Vivek Goyal
  2012-02-22  0:49   ` [PATCH UPDATED " Tejun Heo
  2 siblings, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-17 16:19 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Thu, Feb 16, 2012 at 02:37:51PM -0800, Tejun Heo wrote:
> Now that blkg additions / removals are always done under both q and
> blkcg locks, the only place RCU locking is used is blkg_lookup() for
> lockless lookup.  This patch drops unncessary RCU locking replacing it
> with plain blkcg / q locking as necessary.
> 
> * blkg_lookup_create() and blkiocg_pre_destroy() already perform
>   proper locking and don't need RCU.  Dropped.

But blkg_lookup_create() is called under rcu() to protect blkcg pointer.
And blkg_lookup() is also happening under same rcu read lock. So I think
you can't drop rcu from blkg_lookup_create().

>  {
>  	struct blkio_group *blkg, *new_blkg;
>  
> -	WARN_ON_ONCE(!rcu_read_lock_held());

Don't we want to be called with rcu lock held needed for blkg_lookup()?

>  	lockdep_assert_held(q->queue_lock);
>  
>  	/*

[..]
> @@ -581,11 +580,9 @@ struct blkio_group *blkg_lookup_create(struct blkio_cgroup *blkcg,
>  	 * allocation is fixed.
>  	 */
>  	spin_unlock_irq(q->queue_lock);
> -	rcu_read_unlock();
>  
>  	new_blkg = blkg_alloc(blkcg, q);
>  
> -	rcu_read_lock();
>  	spin_lock_irq(q->queue_lock);

blkg_alloc() might sleep here with rcu lock held?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/9] blkcg: drop unnecessary RCU locking
  2012-02-16 22:37 ` [PATCH 2/9] blkcg: drop unnecessary RCU locking Tejun Heo
  2012-02-17 16:19   ` Vivek Goyal
@ 2012-02-17 16:47   ` Vivek Goyal
  2012-02-17 17:11     ` Tejun Heo
  2012-02-22  0:49   ` [PATCH UPDATED " Tejun Heo
  2 siblings, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-17 16:47 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Thu, Feb 16, 2012 at 02:37:51PM -0800, Tejun Heo wrote:

[..]
> * queue_lock coverage extended to cover @blkg usage in
>   blkio_policy_parse_and_set() and RCU dropped.  This means all config
>   update callbacks are now called under queue_lock.
> 

[..]
> @@ -1041,11 +1034,8 @@ static int blkio_policy_parse_and_set(char *buf, enum blkio_policy_id plid,
>  	if (!disk || part)
>  		goto out;
>  
> -	rcu_read_lock();
> -
>  	spin_lock_irq(disk->queue->queue_lock);
>  	blkg = blkg_lookup_create(blkcg, disk->queue, plid, false);
> -	spin_unlock_irq(disk->queue->queue_lock);
>  

So now in some cases we call blkg_lookup_create() with both queue and rcu
read lock held (cfq_lookup_create_cfqg()) and in this case hold only queue
lock. blkg_lookup_create() calls blkg_lookup() which expects a rcu_read_lock()
to be held and we will be travesing that list without rcu_read_lock()
held. Isn't that a problem? We might be examining a blkg belonging to
a different queue and it might be being freed parallely.

Or blkg destruction in this cgroup is serialized by cgroup_mutex() or
by something else in this policy parse and set path?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/9] blkcg: drop unnecessary RCU locking
  2012-02-17 16:19   ` Vivek Goyal
@ 2012-02-17 17:07     ` Tejun Heo
  2012-02-17 17:14       ` Tejun Heo
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2012-02-17 17:07 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

Hey, Vivek.

On Fri, Feb 17, 2012 at 11:19:58AM -0500, Vivek Goyal wrote:
> On Thu, Feb 16, 2012 at 02:37:51PM -0800, Tejun Heo wrote:
> > Now that blkg additions / removals are always done under both q and
> > blkcg locks, the only place RCU locking is used is blkg_lookup() for
> > lockless lookup.  This patch drops unncessary RCU locking replacing it
> > with plain blkcg / q locking as necessary.
> > 
> > * blkg_lookup_create() and blkiocg_pre_destroy() already perform
> >   proper locking and don't need RCU.  Dropped.
> 
> But blkg_lookup_create() is called under rcu() to protect blkcg pointer.
> And blkg_lookup() is also happening under same rcu read lock. So I think
> you can't drop rcu from blkg_lookup_create().

Ooh, right.  Will fix.

> >  {
> >  	struct blkio_group *blkg, *new_blkg;
> >  
> > -	WARN_ON_ONCE(!rcu_read_lock_held());
> 
> Don't we want to be called with rcu lock held needed for blkg_lookup()?

We want WARN_ON_ONCE(none of RCU, blkcg and queue lock held).  We can
do it using lockdep macros inside #ifdef CONFIG_LOCKDEP.  It's just a
bit clunky.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/9] blkcg: drop unnecessary RCU locking
  2012-02-17 16:47   ` Vivek Goyal
@ 2012-02-17 17:11     ` Tejun Heo
  2012-02-17 17:28       ` Vivek Goyal
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2012-02-17 17:11 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

On Fri, Feb 17, 2012 at 11:47:49AM -0500, Vivek Goyal wrote:
> So now in some cases we call blkg_lookup_create() with both queue and rcu
> read lock held (cfq_lookup_create_cfqg()) and in this case hold only queue
> lock.

So, this should be okay.  It's currently not because blkg_alloc() is
broken due to percpu allocation but other than that calling both w/
and w/o RCU read lock should be fine.

> blkg_lookup_create() calls blkg_lookup() which expects a rcu_read_lock()
> to be held and we will be travesing that list without rcu_read_lock()
> held. Isn't that a problem?

No, why would it be a problem?

> We might be examining a blkg belonging to a different queue and it
> might be being freed parallely.

How?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/9] blkcg: drop unnecessary RCU locking
  2012-02-17 17:07     ` Tejun Heo
@ 2012-02-17 17:14       ` Tejun Heo
  0 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-17 17:14 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

On Fri, Feb 17, 2012 at 09:07:13AM -0800, Tejun Heo wrote:
> On Fri, Feb 17, 2012 at 11:19:58AM -0500, Vivek Goyal wrote:
> > On Thu, Feb 16, 2012 at 02:37:51PM -0800, Tejun Heo wrote:
> > > Now that blkg additions / removals are always done under both q and
> > > blkcg locks, the only place RCU locking is used is blkg_lookup() for
> > > lockless lookup.  This patch drops unncessary RCU locking replacing it
> > > with plain blkcg / q locking as necessary.
> > > 
> > > * blkg_lookup_create() and blkiocg_pre_destroy() already perform
> > >   proper locking and don't need RCU.  Dropped.
> > 
> > But blkg_lookup_create() is called under rcu() to protect blkcg pointer.
> > And blkg_lookup() is also happening under same rcu read lock. So I think
> > you can't drop rcu from blkg_lookup_create().
> 
> Ooh, right.  Will fix.

BTW, the reason this doesn't work is due to the broken percpu
allocation in blkg_alloc().  We should remove unnecessary rcu lockings
after updating stat allocation.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/9] blkcg: drop unnecessary RCU locking
  2012-02-17 17:11     ` Tejun Heo
@ 2012-02-17 17:28       ` Vivek Goyal
  2012-02-17 17:43         ` Tejun Heo
  0 siblings, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-17 17:28 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Fri, Feb 17, 2012 at 09:11:13AM -0800, Tejun Heo wrote:
> On Fri, Feb 17, 2012 at 11:47:49AM -0500, Vivek Goyal wrote:
> > So now in some cases we call blkg_lookup_create() with both queue and rcu
> > read lock held (cfq_lookup_create_cfqg()) and in this case hold only queue
> > lock.
> 
> So, this should be okay.  It's currently not because blkg_alloc() is
> broken due to percpu allocation but other than that calling both w/
> and w/o RCU read lock should be fine.
> 
> > blkg_lookup_create() calls blkg_lookup() which expects a rcu_read_lock()
> > to be held and we will be travesing that list without rcu_read_lock()
> > held. Isn't that a problem?
> 
> No, why would it be a problem?

I am kind of confused that what are the semantics of calling
blkg_lookup_create(). Given the fact that it traverses the
blkcg->blkg_list which is rcu protected, so either we should have
rcu read lock held or we should have blkcg->lock held.

So there might not be any problem, just that looking at the code
I am not very clear abou the locking sematics of blkg_lookup(). May
be some documentation will help that it should be called with 
what locks in what situation. Specifically, when should it be called
with rcu_read_lock() held.

> 
> > We might be examining a blkg belonging to a different queue and it
> > might be being freed parallely.
> 
> How?

Can pre_destroy() and blkio_policy_parse_and_set() make progress in
parallel for same cgroup(blkcg) but different queue.

If yes, blkg_lookup() might be doing blkg->q == q check and pre_destroy
might delete that group and free it up.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/9] blkcg: drop unnecessary RCU locking
  2012-02-17 17:28       ` Vivek Goyal
@ 2012-02-17 17:43         ` Tejun Heo
  2012-02-17 18:08           ` Vivek Goyal
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2012-02-17 17:43 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

Hello,

On Fri, Feb 17, 2012 at 12:28:57PM -0500, Vivek Goyal wrote:
> I am kind of confused that what are the semantics of calling
> blkg_lookup_create(). Given the fact that it traverses the
> blkcg->blkg_list which is rcu protected, so either we should have
> rcu read lock held or we should have blkcg->lock held.

Modifying blkgs require both blkcg and queue locks, so read access can
be done holding any lock.  Also, as modifications are done using RCU
safe oprations, it can be looked up while holding RCU read lock.

> > How?
> 
> Can pre_destroy() and blkio_policy_parse_and_set() make progress in
> parallel for same cgroup(blkcg) but different queue.
> 
> If yes, blkg_lookup() might be doing blkg->q == q check and pre_destroy
> might delete that group and free it up.

And that's why __blkg_release() is RCU free'ing blkgs, no?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/9] blkcg: drop unnecessary RCU locking
  2012-02-17 17:43         ` Tejun Heo
@ 2012-02-17 18:08           ` Vivek Goyal
  2012-02-17 18:16             ` Tejun Heo
  0 siblings, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-17 18:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Fri, Feb 17, 2012 at 09:43:04AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Fri, Feb 17, 2012 at 12:28:57PM -0500, Vivek Goyal wrote:
> > I am kind of confused that what are the semantics of calling
> > blkg_lookup_create(). Given the fact that it traverses the
> > blkcg->blkg_list which is rcu protected, so either we should have
> > rcu read lock held or we should have blkcg->lock held.
> 
> Modifying blkgs require both blkcg and queue locks,

> so read access can be done holding any lock.

This is the point I am not getting. How blkg_lookup() is safe just
under queue lock. What stops freeing up blkg associated with other
queues. I thought caller needs to hold rcu_read_lock() also to
make sure it can safely compare blkg->q == q and return the blkg
belonging to the queue in question.

> > 
> > Can pre_destroy() and blkio_policy_parse_and_set() make progress in
> > parallel for same cgroup(blkcg) but different queue.
> > 
> > If yes, blkg_lookup() might be doing blkg->q == q check and pre_destroy
> > might delete that group and free it up.
> 
> And that's why __blkg_release() is RCU free'ing blkgs, no?

Yes. And you are not holding rcu_read_lock() while doing blkg_lookup()
in blkio_policy_parse_and_set(). Just queue lock will not be enough?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/9] blkcg: drop unnecessary RCU locking
  2012-02-17 18:08           ` Vivek Goyal
@ 2012-02-17 18:16             ` Tejun Heo
  0 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-17 18:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

On Fri, Feb 17, 2012 at 01:08:00PM -0500, Vivek Goyal wrote:
> > Modifying blkgs require both blkcg and queue locks,
> > so read access can be done holding any lock.
> 
> This is the point I am not getting. How blkg_lookup() is safe just
> under queue lock. What stops freeing up blkg associated with other
> queues. I thought caller needs to hold rcu_read_lock() also to
> make sure it can safely compare blkg->q == q and return the blkg
> belonging to the queue in question.

Ooh, you're right.  I got confused.  We should be holding either blkcg
lock or rcu_read_lock() across blkg_lookup().  Will update.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/9] block: ioc_task_link() can't fail
  2012-02-16 22:37 ` [PATCH 5/9] block: ioc_task_link() can't fail Tejun Heo
@ 2012-02-17 20:41   ` Vivek Goyal
  2012-02-17 22:18     ` Tejun Heo
  0 siblings, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-17 20:41 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Thu, Feb 16, 2012 at 02:37:54PM -0800, Tejun Heo wrote:
> ioc_task_link() is used to share %current's ioc on clone.  If
> %current->is_context is set, %current is guaranteed to have refcount

Typo. s/is_context/io_context.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-16 22:37 ` [PATCH 7/9] block: implement bio_associate_current() Tejun Heo
  2012-02-17  1:19   ` Kent Overstreet
@ 2012-02-17 21:33   ` Vivek Goyal
  2012-02-17 22:03     ` Tejun Heo
  1 sibling, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-17 21:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel, Kent Overstreet

On Thu, Feb 16, 2012 at 02:37:56PM -0800, Tejun Heo wrote:

[..]
> This patch implements bio_associate_current() which associates the
> specified bio with %current.  The bio will record the associated ioc
> and blkcg at that point and block layer will use the recorded ones
> regardless of which task actually ends up issuing the bio.  bio
> release puts the associated ioc and blkcg.

How about storing blkcg information in io_context instead of bio. We will
have less copies of bio pointers and I think logically it makes sense.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 9/9] block: make blk-throttle preserve the issuing task on delayed bios
  2012-02-16 22:37 ` [PATCH 9/9] block: make blk-throttle preserve the issuing task on delayed bios Tejun Heo
@ 2012-02-17 21:58   ` Vivek Goyal
  2012-02-17 22:17     ` Tejun Heo
  0 siblings, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-17 21:58 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel

On Thu, Feb 16, 2012 at 02:37:58PM -0800, Tejun Heo wrote:
> Make blk-throttle call bio_associate_current() on bios being delayed
> such that they get issued to block layer with the original io_context.
> This allows stacking blk-throttle and cfq-iosched propio policies.
> bios will always be issued with the correct ioc and blkcg whether it
> gets delayed by blk-throttle or not.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/blk-throttle.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index a7c8e0b..ad8e16b 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -894,6 +894,9 @@ bool blk_throtl_bio(struct request_queue *q, struct bio *bio)
>  		goto out;
>  	}
>  
> +	/* bio_associate_current() needs ioc, try creating */
> +	create_io_context(GFP_ATOMIC, q->node);
> +
>  	/*
>  	 * A throtl_grp pointer retrieved under rcu can be used to access
>  	 * basic fields like stats and io rates. If a group has no rules,
> @@ -958,6 +961,7 @@ queue_bio:
>  			tg->io_disp[rw], tg->iops[rw],
>  			tg->nr_queued[READ], tg->nr_queued[WRITE]);
>  
> +	bio_associate_current(bio);

blk-throttle is just one of places where we lose submitter's context info.
There are others too.

As Kent said, doing this at more generic level will probably make sense.
I think dm layer also might benefit from it. We had issues with losing
this process context info while bio is being submitted with a helper thread.

https://lkml.org/lkml/2010/10/24/59

But I guess this will require some changes at dm layer too so that it
takes extra bio->ioc and bio->blkcg references at cloning time.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-17 21:33   ` Vivek Goyal
@ 2012-02-17 22:03     ` Tejun Heo
  2012-02-17 22:29       ` Vivek Goyal
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2012-02-17 22:03 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel, Kent Overstreet

Hello, Vivek.

On Fri, Feb 17, 2012 at 04:33:13PM -0500, Vivek Goyal wrote:
> On Thu, Feb 16, 2012 at 02:37:56PM -0800, Tejun Heo wrote:
> 
> [..]
> > This patch implements bio_associate_current() which associates the
> > specified bio with %current.  The bio will record the associated ioc
> > and blkcg at that point and block layer will use the recorded ones
> > regardless of which task actually ends up issuing the bio.  bio
> > release puts the associated ioc and blkcg.
> 
> How about storing blkcg information in io_context instead of bio. We will
> have less copies of bio pointers and I think logically it makes sense.

I don't know.  The problem with that approach is that we introduce a
persistent state which needs to be kept in sync.  cgroup is a task
property and the current code just grabs the current cgroup of
%current and uses it for that bio.  It doesn't matter how the task
changes its cgroup membership later - we're correct (in a sense) no
matter what.  If we add cgroup pointer to ioc, we need to keep that in
sync with task changing cgroup memberships and need to introduce
synchronization scheme for accessing ioc->blkcg, which is a much
bigger headache.

I think it's better to take an explicit ref now.  If the situation
changes, it's an implementation detail only known to block layer
proper anyway, so we should be able to change it without too much
difficulty.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-17  1:19   ` Kent Overstreet
@ 2012-02-17 22:14     ` Tejun Heo
  2012-02-17 22:34       ` Vivek Goyal
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2012-02-17 22:14 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: axboe, vgoyal, ctalbott, rni, linux-kernel

Hello, Kent.

On Thu, Feb 16, 2012 at 05:19:07PM -0800, Kent Overstreet wrote:
> On Thu, Feb 16, 2012 at 02:37:56PM -0800, Tejun Heo wrote:
> > This patch implements bio_associate_current() which associates the
> > specified bio with %current.  The bio will record the associated ioc
> > and blkcg at that point and block layer will use the recorded ones
> > regardless of which task actually ends up issuing the bio.  bio
> > release puts the associated ioc and blkcg.
> 
> Excellent.
> 
> Why not have bio_associate_current() called from submit_bio()? I would
> expect that's what we want most of the time, and the places it's not
> (mainly writeback) calling it before submit_bio() would do the right
> thing.
> 
> It'd make things more consistent - rq_ioc() could be dropped, and
> incorrect usage would be more obvious.

Yeah, maybe, I was mostly trying to avoid adding extra operations on
common paths as explicit task association isn't that common, at least
for now.  That said, I agree that things can be much cleaner just
allocating and associating stuff upfront - ie. always have ioc and
associate blkcg with IOs on issue.

If the associated blkcg is used consistently through block layer, it
might be better but cfq doesn't even do that.  It has its own icq ->
blkcg mapping, which it initializes on first cfqq association.  It
even ignores tasks migrating to a different cgroup and keeps using
whatever cgroup the task first issued IOs on. :(

I don't know.  Maybe someday.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 9/9] block: make blk-throttle preserve the issuing task on delayed bios
  2012-02-17 21:58   ` Vivek Goyal
@ 2012-02-17 22:17     ` Tejun Heo
  0 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-17 22:17 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

Hello,

On Fri, Feb 17, 2012 at 04:58:14PM -0500, Vivek Goyal wrote:
> blk-throttle is just one of places where we lose submitter's context info.
> There are others too.

Yeah, sure.  We do that in quite a few places now.

> As Kent said, doing this at more generic level will probably make sense.
> I think dm layer also might benefit from it. We had issues with losing
> this process context info while bio is being submitted with a helper thread.
> 
> https://lkml.org/lkml/2010/10/24/59
> 
> But I guess this will require some changes at dm layer too so that it
> takes extra bio->ioc and bio->blkcg references at cloning time.

Yeah, it just needs to make sure %current's ioc is there and call
bio_associate_current() before passing bios to worker.  We'll get
there.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 5/9] block: ioc_task_link() can't fail
  2012-02-17 20:41   ` Vivek Goyal
@ 2012-02-17 22:18     ` Tejun Heo
  0 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-17 22:18 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel

On Fri, Feb 17, 2012 at 03:41:08PM -0500, Vivek Goyal wrote:
> On Thu, Feb 16, 2012 at 02:37:54PM -0800, Tejun Heo wrote:
> > ioc_task_link() is used to share %current's ioc on clone.  If
> > %current->is_context is set, %current is guaranteed to have refcount
> 
> Typo. s/is_context/io_context.

Will update.  Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-17 22:03     ` Tejun Heo
@ 2012-02-17 22:29       ` Vivek Goyal
  2012-02-17 22:38         ` Tejun Heo
  0 siblings, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-17 22:29 UTC (permalink / raw)
  To: Tejun Heo; +Cc: axboe, ctalbott, rni, linux-kernel, Kent Overstreet

On Fri, Feb 17, 2012 at 02:03:51PM -0800, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Fri, Feb 17, 2012 at 04:33:13PM -0500, Vivek Goyal wrote:
> > On Thu, Feb 16, 2012 at 02:37:56PM -0800, Tejun Heo wrote:
> > 
> > [..]
> > > This patch implements bio_associate_current() which associates the
> > > specified bio with %current.  The bio will record the associated ioc
> > > and blkcg at that point and block layer will use the recorded ones
> > > regardless of which task actually ends up issuing the bio.  bio
> > > release puts the associated ioc and blkcg.
> > 
> > How about storing blkcg information in io_context instead of bio. We will
> > have less copies of bio pointers and I think logically it makes sense.
> 
> I don't know.  The problem with that approach is that we introduce a
> persistent state which needs to be kept in sync.  cgroup is a task
> property and the current code just grabs the current cgroup of
> %current and uses it for that bio.  It doesn't matter how the task
> changes its cgroup membership later - we're correct (in a sense) no
> matter what.  If we add cgroup pointer to ioc, we need to keep that in
> sync with task changing cgroup memberships and need to introduce
> synchronization scheme for accessing ioc->blkcg, which is a much
> bigger headache.

Don't we already keep track of task changing cgroup and record that
state in ioc.

blkiocg_attach()
  ioc_cgroup_changed()

I think in ioc_cgroup_changed() we can just drop the reference to previous
blkcg and store reference to new blkcg?

> 
> I think it's better to take an explicit ref now.  If the situation
> changes, it's an implementation detail only known to block layer
> proper anyway, so we should be able to change it without too much
> difficulty.

I am fine with changing it later too.

BTW, this change seems to be completely orthogonal to blkcg cleanup. May
be it is a good idea to split it out in a separate patch series. It has
nothing to do with rcu cleanup in blkcg.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-17 22:14     ` Tejun Heo
@ 2012-02-17 22:34       ` Vivek Goyal
  2012-02-17 22:41         ` Tejun Heo
  0 siblings, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-17 22:34 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel

On Fri, Feb 17, 2012 at 02:14:06PM -0800, Tejun Heo wrote:

[..]
> If the associated blkcg is used consistently through block layer, it
> might be better but cfq doesn't even do that.  It has its own icq ->
> blkcg mapping, which it initializes on first cfqq association.  It
> even ignores tasks migrating to a different cgroup and keeps using
> whatever cgroup the task first issued IOs on. :(

Nope. We make note of task migration and drop cic->cfqq association
and establish a new association where new cfqq is part of new cgroup.
(ioc_cgroup_changed()).

In the past people have raised the issue of retaining the notion of
iopriority while dm targets are operating in between. This patchset
will help for that case.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-17 22:29       ` Vivek Goyal
@ 2012-02-17 22:38         ` Tejun Heo
  2012-02-17 22:42           ` Tejun Heo
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2012-02-17 22:38 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel, Kent Overstreet

Hello,

On Fri, Feb 17, 2012 at 05:29:09PM -0500, Vivek Goyal wrote:
> Don't we already keep track of task changing cgroup and record that
> state in ioc.
> 
> blkiocg_attach()
>   ioc_cgroup_changed()
> 
> I think in ioc_cgroup_changed() we can just drop the reference to previous
> blkcg and store reference to new blkcg?

Hmmm.... right, we have that; then, why doesn't cgroup change take
effect w/ cfq?  Maybe it actually works and I confused it with
stacking failure.  Will test again later.

But, anyways, ioc isn't keeping track of blkcg.  The changed thing is
necessary only because cfq is caching the relationship as associated
cfqqs.  I think it would be better if cfq can just compare the current
blkcg without requiring the async notification (or at least do it
synchronously).  The current way of handling it is kinda nasty.

> BTW, this change seems to be completely orthogonal to blkcg cleanup. May
> be it is a good idea to split it out in a separate patch series. It has
> nothing to do with rcu cleanup in blkcg.

At first, it required the locking update because I was determining
blkg association on bio issue and then passing it down.  That didn't
work with cfq caching the association, so it no longer has dependency.
It should can be a separate patchset.  This whole lot is gonna go in
as long sequential series of patches so I'm splitting them just so
that each posting isn't too huge at this point.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-17 22:34       ` Vivek Goyal
@ 2012-02-17 22:41         ` Tejun Heo
  2012-02-17 22:51           ` Vivek Goyal
  2012-02-17 22:56           ` Vivek Goyal
  0 siblings, 2 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-17 22:41 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel

On Fri, Feb 17, 2012 at 05:34:20PM -0500, Vivek Goyal wrote:
> Nope. We make note of task migration and drop cic->cfqq association
> and establish a new association where new cfqq is part of new cgroup.
> (ioc_cgroup_changed()).

Yeah, that's the CHANGED bit thing.  I probably got confused with
blk-throttle losing blkcg while testing.  Anyways, it would be great
if we can either remove that altogether.  e.g. compare the cached
blkg->blkcg and see if it has changed from cfq.  Or add a callback and
do the shootdown synchronously at least.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-17 22:38         ` Tejun Heo
@ 2012-02-17 22:42           ` Tejun Heo
  0 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-17 22:42 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, ctalbott, rni, linux-kernel, Kent Overstreet

On Fri, Feb 17, 2012 at 02:38:59PM -0800, Tejun Heo wrote:
> At first, it required the locking update because I was determining
> blkg association on bio issue and then passing it down.  That didn't
> work with cfq caching the association, so it no longer has dependency.
> It should can be a separate patchset.  This whole lot is gonna go in
> as long sequential series of patches so I'm splitting them just so
> that each posting isn't too huge at this point.

Hmmm... I'll move the two locking patches to the previous series and
let this series handle just the stacking issue.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-17 22:41         ` Tejun Heo
@ 2012-02-17 22:51           ` Vivek Goyal
  2012-02-17 22:57             ` Tejun Heo
  2012-02-17 22:56           ` Vivek Goyal
  1 sibling, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-17 22:51 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel

On Fri, Feb 17, 2012 at 02:41:03PM -0800, Tejun Heo wrote:
> On Fri, Feb 17, 2012 at 05:34:20PM -0500, Vivek Goyal wrote:
> > Nope. We make note of task migration and drop cic->cfqq association
> > and establish a new association where new cfqq is part of new cgroup.
> > (ioc_cgroup_changed()).
> 
> Yeah, that's the CHANGED bit thing.  I probably got confused with
> blk-throttle losing blkcg while testing.  Anyways, it would be great
> if we can either remove that altogether.  e.g. compare the cached
> blkg->blkcg and see if it has changed from cfq.  Or add a callback and
> do the shootdown synchronously at least.

I think dropping it lazily has advantage as once the cic->cfqq association
is set, we don't worry about cgroups at all.

Otherwise on every IO, we will end up comparing submitting tasks's
cgroup and cic/cfqq's cgroup.

Also this will create problems, if two threads sharing io context are
in two different cgroups. We will frequently end up changing the
association.

So comparing probably is not a very good idea. Doing something
synchronously might be better if you don't like CGROUP_CHANGED
bit in ioc.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-17 22:41         ` Tejun Heo
  2012-02-17 22:51           ` Vivek Goyal
@ 2012-02-17 22:56           ` Vivek Goyal
  2012-02-17 23:06             ` Tejun Heo
  1 sibling, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-17 22:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel

On Fri, Feb 17, 2012 at 02:41:03PM -0800, Tejun Heo wrote:
> On Fri, Feb 17, 2012 at 05:34:20PM -0500, Vivek Goyal wrote:
> > Nope. We make note of task migration and drop cic->cfqq association
> > and establish a new association where new cfqq is part of new cgroup.
> > (ioc_cgroup_changed()).
> 
> Yeah, that's the CHANGED bit thing.  I probably got confused with
> blk-throttle losing blkcg while testing.

BTW, blk-throttle mangling the context, is a real problem you are facing
or it is just one of things which is nice to fix. If a cgroup is being
throttled do people really care about the iopriority of original task
that much. (Given the fact that iopriority helps only so much and kills
performance on fast storage).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-17 22:51           ` Vivek Goyal
@ 2012-02-17 22:57             ` Tejun Heo
  2012-02-20 14:22               ` Vivek Goyal
  2012-02-20 14:36               ` Vivek Goyal
  0 siblings, 2 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-17 22:57 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel

Hey, Vivek.

On Fri, Feb 17, 2012 at 05:51:26PM -0500, Vivek Goyal wrote:
> Otherwise on every IO, we will end up comparing submitting tasks's
> cgroup and cic/cfqq's cgroup.

But how much is that different from checking CHANGED bit on each IO?
I mean, we can just do sth like cfqg->blkg->blkcg == bio_blkcg(bio).
It isn't expensive.

> Also this will create problems, if two threads sharing io context are
> in two different cgroups. We will frequently end up changing the
> association.

blkcg doesn't allow that anyway (it tries but is racy) and I actually
was thinking about sending a RFC patch to kill CLONE_IO.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-17 22:56           ` Vivek Goyal
@ 2012-02-17 23:06             ` Tejun Heo
  0 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-17 23:06 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel

On Fri, Feb 17, 2012 at 05:56:15PM -0500, Vivek Goyal wrote:
> BTW, blk-throttle mangling the context, is a real problem you are facing
> or it is just one of things which is nice to fix.

Yeah, it is a real problem which makes blk-throttle useless if cfq
propio is in use.

> If a cgroup is being throttled do people really care about the
> iopriority of original task that much. (Given the fact that
> iopriority helps only so much and kills performance on fast
> storage).

Yeap, apparently, people wanna apply hard limits on top of propio.
Besides, it's just bad to mangle cgroup *randomly* as the current code
does.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-17 22:57             ` Tejun Heo
@ 2012-02-20 14:22               ` Vivek Goyal
  2012-02-20 16:59                 ` Tejun Heo
  2012-02-20 14:36               ` Vivek Goyal
  1 sibling, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-20 14:22 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel

On Fri, Feb 17, 2012 at 02:57:35PM -0800, Tejun Heo wrote:
> Hey, Vivek.
> 
> On Fri, Feb 17, 2012 at 05:51:26PM -0500, Vivek Goyal wrote:
> > Otherwise on every IO, we will end up comparing submitting tasks's
> > cgroup and cic/cfqq's cgroup.
> 
> But how much is that different from checking CHANGED bit on each IO?
> I mean, we can just do sth like cfqg->blkg->blkcg == bio_blkcg(bio).
> It isn't expensive.

I guess you will first determine cfqq associated with cic and then do

cfqq->cfqg->blkg->blkcg == bio_blkcg(bio)

One can do that but still does not get rid of requirement of checking
for CGRPOUP_CHANGED as not every bio will have cgroup information stored
and you still will have to check whether submitting task has changed
the cgroup since it last did IO.

> 
> > Also this will create problems, if two threads sharing io context are
> > in two different cgroups. We will frequently end up changing the
> > association.
> 
> blkcg doesn't allow that anyway (it tries but is racy) and I actually
> was thinking about sending a RFC patch to kill CLONE_IO.

I thought CLONE_IO is useful and it allows threads to share IO context.
qemu wanted to use it for its IO threads so that one virtual machine
does not get higher share of disk by just craeting more threads. In fact
if multiple threads are doing related IO, we would like them to use
same io context. Those programs who don't use CLONE_IO (dump utility),
we try to detect closely realted IO in CFQ and try to merge cfq queues.
(effectively trying to simulate shared io context).

Hence, I think CLONE_IO is useful and killing it probably does not buy
us much.

Can we logically say that io_context is owned by thread group leader and
cgroup of io_context changes only if thread group leader changes the
cgroup. So even if some threads are in different cgroup, IO gets accounted
to thread group leaders's cgroup.

So we can store ioc->blkcg association and this association changes when
thread group leader changes cgroup. We can possibly keep CHANGED_CGROUP
also around so that next time old cic->cfqq association is dropped and
a new one is established with new ioc->blkcg.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-17 22:57             ` Tejun Heo
  2012-02-20 14:22               ` Vivek Goyal
@ 2012-02-20 14:36               ` Vivek Goyal
  2012-02-20 17:01                 ` Tejun Heo
  1 sibling, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-20 14:36 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel

On Fri, Feb 17, 2012 at 02:57:35PM -0800, Tejun Heo wrote:

[..]
> blkcg doesn't allow that anyway (it tries but is racy) and I actually
> was thinking about sending a RFC patch to kill CLONE_IO.

Hi  Tejun,

I have a question about CLONE_IO.

In copy_io(), we don't share the io_context if "current" does not have
task->io_context set. Does that mean if even if I specify CLONE_IO, two
threads might not share io context depending on when clone() happened? Or
I am reading the code wrong.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-20 14:22               ` Vivek Goyal
@ 2012-02-20 16:59                 ` Tejun Heo
  2012-02-20 19:14                   ` Vivek Goyal
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2012-02-20 16:59 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel

Hello, Vivek.

On Mon, Feb 20, 2012 at 09:22:33AM -0500, Vivek Goyal wrote:
> I guess you will first determine cfqq associated with cic and then do
> 
> cfqq->cfqg->blkg->blkcg == bio_blkcg(bio)
> 
> One can do that but still does not get rid of requirement of checking
> for CGRPOUP_CHANGED as not every bio will have cgroup information stored
> and you still will have to check whether submitting task has changed
> the cgroup since it last did IO.

Hmmm... but in that case task would be using a different blkg and the
test would still work, wouldn't it?

> > blkcg doesn't allow that anyway (it tries but is racy) and I actually
> > was thinking about sending a RFC patch to kill CLONE_IO.
> 
> I thought CLONE_IO is useful and it allows threads to share IO context.
> qemu wanted to use it for its IO threads so that one virtual machine
> does not get higher share of disk by just craeting more threads. In fact
> if multiple threads are doing related IO, we would like them to use
> same io context.

I don't think that's true.  Think of any multithreaded server program
where each thread is working pretty much independently from others.
Virtualization *can* be a valid use case but are they actually using
it?  Aren't they better served by cgroup?

> Those programs who don't use CLONE_IO (dump utility),
> we try to detect closely realted IO in CFQ and try to merge cfq queues.
> (effectively trying to simulate shared io context).
>
> Hence, I think CLONE_IO is useful and killing it probably does not buy
> us much.

I don't know.  Anything can be useful to somebody somehow.  I'm
skeptical whether ioc sharing is justified.  It was first introduced
for syslets which never flew and as you asked in another message the
implementation has always been broken (it likely ends up ignoring
CLONE_IO more often than not) and *nobody* noticed the brekage all
that time.

Another problem is it doesn't play well with cgroup.  If you start
sharing ioc among tasks, those tasks can't be migrated to other
cgroups.  The enforcement of that, BTW, is also broken.

So, to me, it looks like a mostly unused feature which is broken left
and right, which isn't even visible through the usual pthread
interface.

> Can we logically say that io_context is owned by thread group leader and
> cgroup of io_context changes only if thread group leader changes the
> cgroup. So even if some threads are in different cgroup, IO gets accounted
> to thread group leaders's cgroup.

I don't think that's a good idea.  There are lots of multithreaded
heavy-IO servers and the behavior change can be pretty big and I don't
think the new behavior is necessarily better either.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-20 14:36               ` Vivek Goyal
@ 2012-02-20 17:01                 ` Tejun Heo
  2012-02-20 19:16                   ` Vivek Goyal
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2012-02-20 17:01 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel

On Mon, Feb 20, 2012 at 09:36:22AM -0500, Vivek Goyal wrote:
> In copy_io(), we don't share the io_context if "current" does not have
> task->io_context set. Does that mean if even if I specify CLONE_IO, two
> threads might not share io context depending on when clone() happened? Or
> I am reading the code wrong.

Yeah, AFAICS, if the cloning task hasn't issued IO before, CLONE_IO is
ignored.

--
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-20 16:59                 ` Tejun Heo
@ 2012-02-20 19:14                   ` Vivek Goyal
  2012-02-20 21:21                     ` Tejun Heo
  2012-02-27 23:12                     ` Chris Wright
  0 siblings, 2 replies; 50+ messages in thread
From: Vivek Goyal @ 2012-02-20 19:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel, Chris Wright

On Mon, Feb 20, 2012 at 08:59:22AM -0800, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Mon, Feb 20, 2012 at 09:22:33AM -0500, Vivek Goyal wrote:
> > I guess you will first determine cfqq associated with cic and then do
> > 
> > cfqq->cfqg->blkg->blkcg == bio_blkcg(bio)
> > 
> > One can do that but still does not get rid of requirement of checking
> > for CGRPOUP_CHANGED as not every bio will have cgroup information stored
> > and you still will have to check whether submitting task has changed
> > the cgroup since it last did IO.
> 
> Hmmm... but in that case task would be using a different blkg and the
> test would still work, wouldn't it?

Oh.., forgot that bio_blkio_blkcg() returns the current tasks's blkcg if
bio->blkcg is not set. So if a task's cgroup changes, bio_blkcg() will
point to latest cgroup and cfqq->cfqg->blkg->blkcg will point to old
cgroup and test will indicate the discrepancy. So yes, it should work
for both the cases.

> 
> > > blkcg doesn't allow that anyway (it tries but is racy) and I actually
> > > was thinking about sending a RFC patch to kill CLONE_IO.
> > 
> > I thought CLONE_IO is useful and it allows threads to share IO context.
> > qemu wanted to use it for its IO threads so that one virtual machine
> > does not get higher share of disk by just craeting more threads. In fact
> > if multiple threads are doing related IO, we would like them to use
> > same io context.
> 
> I don't think that's true.  Think of any multithreaded server program
> where each thread is working pretty much independently from others.

If threads are working pretty much independently, then one does not have
to specify CLONE_IO.

In case of qemu IO threads, I have debugged issues where an big IO range
is being splitted among its IO threads. Just do a sequential IO inside
guest, and I was seeing that few sector IO comes from one process, next
few sector come from other process and it goes on. A sequential range
of IO is some split among a bunch of threads and that does not work
well with CFQ if every IO is coming from its own IO context and IO
context is not shared. After a bunch of IO from one io context, CFQ
continues to idle on that io context thinking more IO will come soon.
Next IO does come but from a different thread and differnet context.

CFQ now has employed some techniques to detect that case and try
to do preemption and try to reduce idling in such cases. But sometimes
these techniques work well and other times don't.  So to me, CLONE_IO
can help in this case where application can specifically share
IO context and CFQ does not have to do all the tricks.

That's a different thing that applications might not be making use
of CLONE_IO.

> Virtualization *can* be a valid use case but are they actually using
> it?  Aren't they better served by cgroup?

cgroup can be very heavy weight when hundred's of virtual machines
are running. Why? because of idling. CFQ still has lots of tricks
to do preemption and cut down on idling across io contexts, but
across cgroup boundaries, isolation is much more stronger and very
little preemption (if any) is allowed. I suspect in current
implementation, if we create lots of blkio cgroup, it will be 
bad for overall throughput of virtual machines (purely because of
idling).

So I am not too excited about blkio cgroup solution because it might not
scale well. (Until and unless we find a better algorithm to cut down
on idling).

I am ccing Chris Wright <chrisw@redhat.com>. He might have thoughts
on usage of CLONE_IO and qemu.

> 
> > Those programs who don't use CLONE_IO (dump utility),
> > we try to detect closely realted IO in CFQ and try to merge cfq queues.
> > (effectively trying to simulate shared io context).
> >
> > Hence, I think CLONE_IO is useful and killing it probably does not buy
> > us much.
> 
> I don't know.  Anything can be useful to somebody somehow.  I'm
> skeptical whether ioc sharing is justified.  It was first introduced
> for syslets which never flew and as you asked in another message the
> implementation has always been broken (it likely ends up ignoring
> CLONE_IO more often than not) and *nobody* noticed the brekage all
> that time.
> 
> Another problem is it doesn't play well with cgroup.  If you start
> sharing ioc among tasks, those tasks can't be migrated to other
> cgroups.  The enforcement of that, BTW, is also broken.

Do we try to prevent sharing of io context across cgroups as of today?
Can you point me to the relevant code chunk.

> 
> So, to me, it looks like a mostly unused feature which is broken left
> and right, which isn't even visible through the usual pthread
> interface.

> 
> > Can we logically say that io_context is owned by thread group leader and
> > cgroup of io_context changes only if thread group leader changes the
> > cgroup. So even if some threads are in different cgroup, IO gets accounted
> > to thread group leaders's cgroup.
> 
> I don't think that's a good idea.  There are lots of multithreaded
> heavy-IO servers and the behavior change can be pretty big and I don't
> think the new behavior is necessarily better either.

But I thought above you mentioned that these multithread IO servers
are not using CLONE_IO. If that's the case they don't get effected by
this change.

I thought sharing io_context was similar to shared memory pages accounting
in memcg where the process which touched page first gets accounted for
it and other process get a free ride irrespective of cgroups. And when
page owning process moves to a different cgroup, all the accounting
also moves. (Hopefully I remember the details of memcg correctly).

I don't know. Those who have seen IO patterns from other applications can
tell more, whether it is useful or it is just a dead interface.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-20 17:01                 ` Tejun Heo
@ 2012-02-20 19:16                   ` Vivek Goyal
  2012-02-20 21:06                     ` Tejun Heo
  0 siblings, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-20 19:16 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel

On Mon, Feb 20, 2012 at 09:01:28AM -0800, Tejun Heo wrote:
> On Mon, Feb 20, 2012 at 09:36:22AM -0500, Vivek Goyal wrote:
> > In copy_io(), we don't share the io_context if "current" does not have
> > task->io_context set. Does that mean if even if I specify CLONE_IO, two
> > threads might not share io context depending on when clone() happened? Or
> > I am reading the code wrong.
> 
> Yeah, AFAICS, if the cloning task hasn't issued IO before, CLONE_IO is
> ignored.

Will it make sense to try to allocate and attach io_context and then 
share it in copy_io()?

Well, you are planning to kill CLONE_IO altogether, so it does not
make a difference.

Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-20 19:16                   ` Vivek Goyal
@ 2012-02-20 21:06                     ` Tejun Heo
  2012-02-20 21:10                       ` Vivek Goyal
  0 siblings, 1 reply; 50+ messages in thread
From: Tejun Heo @ 2012-02-20 21:06 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel

Hello,

On Mon, Feb 20, 2012 at 02:16:04PM -0500, Vivek Goyal wrote:
> On Mon, Feb 20, 2012 at 09:01:28AM -0800, Tejun Heo wrote:
> > Yeah, AFAICS, if the cloning task hasn't issued IO before, CLONE_IO is
> > ignored.
> 
> Will it make sense to try to allocate and attach io_context and then 
> share it in copy_io()?

Yeap, that's probably what we should do on CLONE_IO.

> Well, you are planning to kill CLONE_IO altogether, so it does not
> make a difference.

Heh, I was just thinking about sending out a RFC patch.  I mean,
CLONE_IO handling that severely broken and nobody noticing for such
long time doesn't look good, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-20 21:06                     ` Tejun Heo
@ 2012-02-20 21:10                       ` Vivek Goyal
  0 siblings, 0 replies; 50+ messages in thread
From: Vivek Goyal @ 2012-02-20 21:10 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel

On Mon, Feb 20, 2012 at 01:06:42PM -0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Feb 20, 2012 at 02:16:04PM -0500, Vivek Goyal wrote:
> > On Mon, Feb 20, 2012 at 09:01:28AM -0800, Tejun Heo wrote:
> > > Yeah, AFAICS, if the cloning task hasn't issued IO before, CLONE_IO is
> > > ignored.
> > 
> > Will it make sense to try to allocate and attach io_context and then 
> > share it in copy_io()?
> 
> Yeap, that's probably what we should do on CLONE_IO.
> 
> > Well, you are planning to kill CLONE_IO altogether, so it does not
> > make a difference.
> 
> Heh, I was just thinking about sending out a RFC patch.  I mean,
> CLONE_IO handling that severely broken and nobody noticing for such
> long time doesn't look good, right?

Yep, it definitely raises the question that how many users are out 
there.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-20 19:14                   ` Vivek Goyal
@ 2012-02-20 21:21                     ` Tejun Heo
  2012-02-27 23:12                     ` Chris Wright
  1 sibling, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-20 21:21 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Kent Overstreet, axboe, ctalbott, rni, linux-kernel, Chris Wright

Hello,

On Mon, Feb 20, 2012 at 02:14:04PM -0500, Vivek Goyal wrote:
> Oh.., forgot that bio_blkio_blkcg() returns the current tasks's blkcg if
> bio->blkcg is not set. So if a task's cgroup changes, bio_blkcg() will
> point to latest cgroup and cfqq->cfqg->blkg->blkcg will point to old
> cgroup and test will indicate the discrepancy. So yes, it should work
> for both the cases.

Yeah, about the same thing for ioprio.  After all, if the cfqg is
pointing to the same blkcg / prio, we are using the right one, so
comparing to the current value should always give the correct result.
It can be thought of as caching the lookup instead of trying to keep
the two states in sync via async notification.  ie. cic caches the
cfqg used last time and if it doesn't match the current one we look up
the new one.

> In case of qemu IO threads, I have debugged issues where an big IO range
> is being splitted among its IO threads. Just do a sequential IO inside
> guest, and I was seeing that few sector IO comes from one process, next
> few sector come from other process and it goes on. A sequential range
> of IO is some split among a bunch of threads and that does not work
> well with CFQ if every IO is coming from its own IO context and IO
> context is not shared. After a bunch of IO from one io context, CFQ
> continues to idle on that io context thinking more IO will come soon.
> Next IO does come but from a different thread and differnet context.

That would be a matching use case or maybe we should improve aio
support so that qemu can simply use aio?

> I am ccing Chris Wright <chrisw@redhat.com>. He might have thoughts
> on usage of CLONE_IO and qemu.

Yeah, learning about actual use cases would be very helpful.

> Do we try to prevent sharing of io context across cgroups as of today?
> Can you point me to the relevant code chunk.

blkiocg_can_attach() in blk-cgroup.c.  We simply can't support it as
it may imply multiple cgroups per ioc.

> > > Can we logically say that io_context is owned by thread group leader and
> > > cgroup of io_context changes only if thread group leader changes the
> > > cgroup. So even if some threads are in different cgroup, IO gets accounted
> > > to thread group leaders's cgroup.
> > 
> > I don't think that's a good idea.  There are lots of multithreaded
> > heavy-IO servers and the behavior change can be pretty big and I don't
> > think the new behavior is necessarily better either.
> 
> But I thought above you mentioned that these multithread IO servers
> are not using CLONE_IO. If that's the case they don't get effected by
> this change.

Hmmm?  I thought you were suggesting changing the default behavior.

> I don't know. Those who have seen IO patterns from other applications can
> tell more, whether it is useful or it is just a dead interface.

blk-cgroup limitation seems rather severe to me and it can prevent
migrating tasks in very non-obvious way.  e.g. multiple controllers
mounted on the same cgroup hierarchy and the target process happens to
use CLONE_IO.  Migrating attempts will simply fail with -EINVAL - go
figure. :( And it seems nobody noticed rather serious breakage for
years so I was getting suspicious whether it was being used at all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH UPDATED 2/9] blkcg: drop unnecessary RCU locking
  2012-02-16 22:37 ` [PATCH 2/9] blkcg: drop unnecessary RCU locking Tejun Heo
  2012-02-17 16:19   ` Vivek Goyal
  2012-02-17 16:47   ` Vivek Goyal
@ 2012-02-22  0:49   ` Tejun Heo
  2 siblings, 0 replies; 50+ messages in thread
From: Tejun Heo @ 2012-02-22  0:49 UTC (permalink / raw)
  To: axboe, vgoyal; +Cc: ctalbott, rni, linux-kernel

Now that blkg additions / removals are always done under both q and
blkcg locks, the only places RCU locking is necessary are
blkg_lookup[_create]() for lookup w/o blkcg lock.  This patch drops
unncessary RCU locking replacing it with plain blkcg locking as
necessary.

* blkiocg_pre_destroy() already perform proper locking and don't need
  RCU.  Dropped.

* blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read
  lock.  This isn't a hot path.

* Now unnecessary synchronize_rcu() from queue exit paths removed.
  This makes q->nr_blkgs unnecessary.  Dropped.

* RCU annotation on blkg->q removed.

-v2: Vivek pointed out that blkg_lookup_create() still needs to be
     called under rcu_read_lock().  Updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c     |   20 +++++++-------------
 block/blk-cgroup.h     |    4 ++--
 block/blk-throttle.c   |   33 +--------------------------------
 block/cfq-iosched.c    |   24 ------------------------
 include/linux/blkdev.h |    1 -
 5 files changed, 10 insertions(+), 72 deletions(-)

Index: work/block/blk-cgroup.c
===================================================================
--- work.orig/block/blk-cgroup.c
+++ work/block/blk-cgroup.c
@@ -500,7 +500,7 @@ static struct blkio_group *blkg_alloc(st
 		return NULL;
 
 	spin_lock_init(&blkg->stats_lock);
-	rcu_assign_pointer(blkg->q, q);
+	blkg->q = q;
 	INIT_LIST_HEAD(&blkg->q_node);
 	blkg->blkcg = blkcg;
 	blkg->refcnt = 1;
@@ -611,7 +611,6 @@ struct blkio_group *blkg_lookup_create(s
 
 	hlist_add_head_rcu(&blkg->blkcg_node, &blkcg->blkg_list);
 	list_add(&blkg->q_node, &q->blkg_list);
-	q->nr_blkgs++;
 
 	spin_unlock(&blkcg->lock);
 out:
@@ -648,9 +647,6 @@ static void blkg_destroy(struct blkio_gr
 	list_del_init(&blkg->q_node);
 	hlist_del_init_rcu(&blkg->blkcg_node);
 
-	WARN_ON_ONCE(q->nr_blkgs <= 0);
-	q->nr_blkgs--;
-
 	/*
 	 * Put the reference taken at the time of creation so that when all
 	 * queues are gone, group can be destroyed.
@@ -1224,8 +1220,9 @@ static int blkio_read_blkg_stats(struct 
 	struct hlist_node *n;
 	uint64_t cgroup_total = 0;
 
-	rcu_read_lock();
-	hlist_for_each_entry_rcu(blkg, n, &blkcg->blkg_list, blkcg_node) {
+	spin_lock_irq(&blkcg->lock);
+
+	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
 		const char *dname = dev_name(blkg->q->backing_dev_info.dev);
 		int plid = BLKIOFILE_POLICY(cft->private);
 
@@ -1241,7 +1238,8 @@ static int blkio_read_blkg_stats(struct 
 	}
 	if (show_total)
 		cb->fill(cb, "Total", cgroup_total);
-	rcu_read_unlock();
+
+	spin_unlock_irq(&blkcg->lock);
 	return 0;
 }
 
@@ -1573,28 +1571,24 @@ static int blkiocg_pre_destroy(struct cg
 {
 	struct blkio_cgroup *blkcg = cgroup_to_blkio_cgroup(cgroup);
 
-	rcu_read_lock();
 	spin_lock_irq(&blkcg->lock);
 
 	while (!hlist_empty(&blkcg->blkg_list)) {
 		struct blkio_group *blkg = hlist_entry(blkcg->blkg_list.first,
 						struct blkio_group, blkcg_node);
-		struct request_queue *q = rcu_dereference(blkg->q);
+		struct request_queue *q = blkg->q;
 
 		if (spin_trylock(q->queue_lock)) {
 			blkg_destroy(blkg);
 			spin_unlock(q->queue_lock);
 		} else {
 			spin_unlock_irq(&blkcg->lock);
-			rcu_read_unlock();
 			cpu_relax();
-			rcu_read_lock();
 			spin_lock(&blkcg->lock);
 		}
 	}
 
 	spin_unlock_irq(&blkcg->lock);
-	rcu_read_unlock();
 	return 0;
 }
 
Index: work/block/blk-cgroup.h
===================================================================
--- work.orig/block/blk-cgroup.h
+++ work/block/blk-cgroup.h
@@ -176,8 +176,8 @@ struct blkg_policy_data {
 };
 
 struct blkio_group {
-	/* Pointer to the associated request_queue, RCU protected */
-	struct request_queue __rcu *q;
+	/* Pointer to the associated request_queue */
+	struct request_queue *q;
 	struct list_head q_node;
 	struct hlist_node blkcg_node;
 	struct blkio_cgroup *blkcg;
Index: work/block/blk-throttle.c
===================================================================
--- work.orig/block/blk-throttle.c
+++ work/block/blk-throttle.c
@@ -1046,39 +1046,8 @@ int blk_throtl_init(struct request_queue
 
 void blk_throtl_exit(struct request_queue *q)
 {
-	struct throtl_data *td = q->td;
-	bool wait;
-
-	BUG_ON(!td);
-
+	BUG_ON(!q->td);
 	throtl_shutdown_wq(q);
-
-	/* If there are other groups */
-	spin_lock_irq(q->queue_lock);
-	wait = q->nr_blkgs;
-	spin_unlock_irq(q->queue_lock);
-
-	/*
-	 * Wait for tg_to_blkg(tg)->q accessors to exit their grace periods.
-	 * Do this wait only if there are other undestroyed groups out
-	 * there (other than root group). This can happen if cgroup deletion
-	 * path claimed the responsibility of cleaning up a group before
-	 * queue cleanup code get to the group.
-	 *
-	 * Do not call synchronize_rcu() unconditionally as there are drivers
-	 * which create/delete request queue hundreds of times during scan/boot
-	 * and synchronize_rcu() can take significant time and slow down boot.
-	 */
-	if (wait)
-		synchronize_rcu();
-
-	/*
-	 * Just being safe to make sure after previous flush if some body did
-	 * update limits through cgroup and another work got queued, cancel
-	 * it.
-	 */
-	throtl_shutdown_wq(q);
-
 	kfree(q->td);
 }
 
Index: work/block/cfq-iosched.c
===================================================================
--- work.orig/block/cfq-iosched.c
+++ work/block/cfq-iosched.c
@@ -3449,7 +3449,6 @@ static void cfq_exit_queue(struct elevat
 {
 	struct cfq_data *cfqd = e->elevator_data;
 	struct request_queue *q = cfqd->queue;
-	bool wait = false;
 
 	cfq_shutdown_timer_wq(cfqd);
 
@@ -3462,31 +3461,8 @@ static void cfq_exit_queue(struct elevat
 
 	spin_unlock_irq(q->queue_lock);
 
-#ifdef CONFIG_BLK_CGROUP
-	/*
-	 * If there are groups which we could not unlink from blkcg list,
-	 * wait for a rcu period for them to be freed.
-	 */
-	spin_lock_irq(q->queue_lock);
-	wait = q->nr_blkgs;
-	spin_unlock_irq(q->queue_lock);
-#endif
 	cfq_shutdown_timer_wq(cfqd);
 
-	/*
-	 * Wait for cfqg->blkg->key accessors to exit their grace periods.
-	 * Do this wait only if there are other unlinked groups out
-	 * there. This can happen if cgroup deletion path claimed the
-	 * responsibility of cleaning up a group before queue cleanup code
-	 * get to the group.
-	 *
-	 * Do not call synchronize_rcu() unconditionally as there are drivers
-	 * which create/delete request queue hundreds of times during scan/boot
-	 * and synchronize_rcu() can take significant time and slow down boot.
-	 */
-	if (wait)
-		synchronize_rcu();
-
 #ifndef CONFIG_CFQ_GROUP_IOSCHED
 	kfree(cfqd->root_group);
 #endif
Index: work/include/linux/blkdev.h
===================================================================
--- work.orig/include/linux/blkdev.h
+++ work/include/linux/blkdev.h
@@ -365,7 +365,6 @@ struct request_queue {
 #ifdef CONFIG_BLK_CGROUP
 	/* XXX: array size hardcoded to avoid include dependency (temporary) */
 	struct list_head	blkg_list;
-	int			nr_blkgs;
 #endif
 
 	struct queue_limits	limits;

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-20 19:14                   ` Vivek Goyal
  2012-02-20 21:21                     ` Tejun Heo
@ 2012-02-27 23:12                     ` Chris Wright
  2012-02-28 14:10                       ` Vivek Goyal
  1 sibling, 1 reply; 50+ messages in thread
From: Chris Wright @ 2012-02-27 23:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tejun Heo, Kent Overstreet, axboe, ctalbott, rni, linux-kernel,
	Chris Wright

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Mon, Feb 20, 2012 at 08:59:22AM -0800, Tejun Heo wrote:
> > On Mon, Feb 20, 2012 at 09:22:33AM -0500, Vivek Goyal wrote:
> > > I guess you will first determine cfqq associated with cic and then do
> > > 
> > > cfqq->cfqg->blkg->blkcg == bio_blkcg(bio)
> > > 
> > > One can do that but still does not get rid of requirement of checking
> > > for CGRPOUP_CHANGED as not every bio will have cgroup information stored
> > > and you still will have to check whether submitting task has changed
> > > the cgroup since it last did IO.
> > 
> > Hmmm... but in that case task would be using a different blkg and the
> > test would still work, wouldn't it?
> 
> Oh.., forgot that bio_blkio_blkcg() returns the current tasks's blkcg if
> bio->blkcg is not set. So if a task's cgroup changes, bio_blkcg() will
> point to latest cgroup and cfqq->cfqg->blkg->blkcg will point to old
> cgroup and test will indicate the discrepancy. So yes, it should work
> for both the cases.
> 
> > 
> > > > blkcg doesn't allow that anyway (it tries but is racy) and I actually
> > > > was thinking about sending a RFC patch to kill CLONE_IO.
> > > 
> > > I thought CLONE_IO is useful and it allows threads to share IO context.
> > > qemu wanted to use it for its IO threads so that one virtual machine
> > > does not get higher share of disk by just craeting more threads. In fact
> > > if multiple threads are doing related IO, we would like them to use
> > > same io context.
> > 
> > I don't think that's true.  Think of any multithreaded server program
> > where each thread is working pretty much independently from others.
> 
> If threads are working pretty much independently, then one does not have
> to specify CLONE_IO.
> 
> In case of qemu IO threads, I have debugged issues where an big IO range
> is being splitted among its IO threads. Just do a sequential IO inside
> guest, and I was seeing that few sector IO comes from one process, next
> few sector come from other process and it goes on. A sequential range
> of IO is some split among a bunch of threads and that does not work
> well with CFQ if every IO is coming from its own IO context and IO
> context is not shared. After a bunch of IO from one io context, CFQ
> continues to idle on that io context thinking more IO will come soon.
> Next IO does come but from a different thread and differnet context.
> 
> CFQ now has employed some techniques to detect that case and try
> to do preemption and try to reduce idling in such cases. But sometimes
> these techniques work well and other times don't.  So to me, CLONE_IO
> can help in this case where application can specifically share
> IO context and CFQ does not have to do all the tricks.
> 
> That's a different thing that applications might not be making use
> of CLONE_IO.
> 
> > Virtualization *can* be a valid use case but are they actually using
> > it?  Aren't they better served by cgroup?
> 
> cgroup can be very heavy weight when hundred's of virtual machines
> are running. Why? because of idling. CFQ still has lots of tricks
> to do preemption and cut down on idling across io contexts, but
> across cgroup boundaries, isolation is much more stronger and very
> little preemption (if any) is allowed. I suspect in current
> implementation, if we create lots of blkio cgroup, it will be 
> bad for overall throughput of virtual machines (purely because of
> idling).
> 
> So I am not too excited about blkio cgroup solution because it might not
> scale well. (Until and unless we find a better algorithm to cut down
> on idling).
> 
> I am ccing Chris Wright <chrisw@redhat.com>. He might have thoughts
> on usage of CLONE_IO and qemu.

Vivek, you summed it up pretty well.  Also, for qemu, raw CLONE_IO is not
an option because threads are created via pthread (we had done some local
hacks to verify that CLONE_IO helped w/ the idling problem, and it did).

thanks,
-chris

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-27 23:12                     ` Chris Wright
@ 2012-02-28 14:10                       ` Vivek Goyal
  2012-02-28 17:01                         ` Chris Wright
  0 siblings, 1 reply; 50+ messages in thread
From: Vivek Goyal @ 2012-02-28 14:10 UTC (permalink / raw)
  To: Chris Wright
  Cc: Tejun Heo, Kent Overstreet, axboe, ctalbott, rni, linux-kernel

On Mon, Feb 27, 2012 at 03:12:22PM -0800, Chris Wright wrote:

[..]
> > > > > blkcg doesn't allow that anyway (it tries but is racy) and I actually
> > > > > was thinking about sending a RFC patch to kill CLONE_IO.
> > > > 
> > > > I thought CLONE_IO is useful and it allows threads to share IO context.
> > > > qemu wanted to use it for its IO threads so that one virtual machine
> > > > does not get higher share of disk by just craeting more threads. In fact
> > > > if multiple threads are doing related IO, we would like them to use
> > > > same io context.
> > > 
> > > I don't think that's true.  Think of any multithreaded server program
> > > where each thread is working pretty much independently from others.
> > 
> > If threads are working pretty much independently, then one does not have
> > to specify CLONE_IO.
> > 
> > In case of qemu IO threads, I have debugged issues where an big IO range
> > is being splitted among its IO threads. Just do a sequential IO inside
> > guest, and I was seeing that few sector IO comes from one process, next
> > few sector come from other process and it goes on. A sequential range
> > of IO is some split among a bunch of threads and that does not work
> > well with CFQ if every IO is coming from its own IO context and IO
> > context is not shared. After a bunch of IO from one io context, CFQ
> > continues to idle on that io context thinking more IO will come soon.
> > Next IO does come but from a different thread and differnet context.
> > 
> > CFQ now has employed some techniques to detect that case and try
> > to do preemption and try to reduce idling in such cases. But sometimes
> > these techniques work well and other times don't.  So to me, CLONE_IO
> > can help in this case where application can specifically share
> > IO context and CFQ does not have to do all the tricks.
> > 
> > That's a different thing that applications might not be making use
> > of CLONE_IO.
> > 
> > > Virtualization *can* be a valid use case but are they actually using
> > > it?  Aren't they better served by cgroup?
> > 
> > cgroup can be very heavy weight when hundred's of virtual machines
> > are running. Why? because of idling. CFQ still has lots of tricks
> > to do preemption and cut down on idling across io contexts, but
> > across cgroup boundaries, isolation is much more stronger and very
> > little preemption (if any) is allowed. I suspect in current
> > implementation, if we create lots of blkio cgroup, it will be 
> > bad for overall throughput of virtual machines (purely because of
> > idling).
> > 
> > So I am not too excited about blkio cgroup solution because it might not
> > scale well. (Until and unless we find a better algorithm to cut down
> > on idling).
> > 
> > I am ccing Chris Wright <chrisw@redhat.com>. He might have thoughts
> > on usage of CLONE_IO and qemu.
> 
> Vivek, you summed it up pretty well.  Also, for qemu, raw CLONE_IO is not
> an option because threads are created via pthread (we had done some local
> hacks to verify that CLONE_IO helped w/ the idling problem, and it did).

Chris,

Just to make sure I understand it right I am thinking loud.

That means CLONE_IO is useful and ideally qemu would like to make use of it
but beacuse pthread interface does not support it, it is not used as of
today.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-28 14:10                       ` Vivek Goyal
@ 2012-02-28 17:01                         ` Chris Wright
  2012-02-28 20:11                           ` Stefan Hajnoczi
  0 siblings, 1 reply; 50+ messages in thread
From: Chris Wright @ 2012-02-28 17:01 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Chris Wright, Tejun Heo, Kent Overstreet, axboe, ctalbott, rni,
	linux-kernel, Kevin Wolf, Stefan Hajnoczi

* Vivek Goyal (vgoyal@redhat.com) wrote:
> On Mon, Feb 27, 2012 at 03:12:22PM -0800, Chris Wright wrote:
> 
> [..]
> > > > > > blkcg doesn't allow that anyway (it tries but is racy) and I actually
> > > > > > was thinking about sending a RFC patch to kill CLONE_IO.
> > > > > 
> > > > > I thought CLONE_IO is useful and it allows threads to share IO context.
> > > > > qemu wanted to use it for its IO threads so that one virtual machine
> > > > > does not get higher share of disk by just craeting more threads. In fact
> > > > > if multiple threads are doing related IO, we would like them to use
> > > > > same io context.
> > > > 
> > > > I don't think that's true.  Think of any multithreaded server program
> > > > where each thread is working pretty much independently from others.
> > > 
> > > If threads are working pretty much independently, then one does not have
> > > to specify CLONE_IO.
> > > 
> > > In case of qemu IO threads, I have debugged issues where an big IO range
> > > is being splitted among its IO threads. Just do a sequential IO inside
> > > guest, and I was seeing that few sector IO comes from one process, next
> > > few sector come from other process and it goes on. A sequential range
> > > of IO is some split among a bunch of threads and that does not work
> > > well with CFQ if every IO is coming from its own IO context and IO
> > > context is not shared. After a bunch of IO from one io context, CFQ
> > > continues to idle on that io context thinking more IO will come soon.
> > > Next IO does come but from a different thread and differnet context.
> > > 
> > > CFQ now has employed some techniques to detect that case and try
> > > to do preemption and try to reduce idling in such cases. But sometimes
> > > these techniques work well and other times don't.  So to me, CLONE_IO
> > > can help in this case where application can specifically share
> > > IO context and CFQ does not have to do all the tricks.
> > > 
> > > That's a different thing that applications might not be making use
> > > of CLONE_IO.
> > > 
> > > > Virtualization *can* be a valid use case but are they actually using
> > > > it?  Aren't they better served by cgroup?
> > > 
> > > cgroup can be very heavy weight when hundred's of virtual machines
> > > are running. Why? because of idling. CFQ still has lots of tricks
> > > to do preemption and cut down on idling across io contexts, but
> > > across cgroup boundaries, isolation is much more stronger and very
> > > little preemption (if any) is allowed. I suspect in current
> > > implementation, if we create lots of blkio cgroup, it will be 
> > > bad for overall throughput of virtual machines (purely because of
> > > idling).
> > > 
> > > So I am not too excited about blkio cgroup solution because it might not
> > > scale well. (Until and unless we find a better algorithm to cut down
> > > on idling).
> > > 
> > > I am ccing Chris Wright <chrisw@redhat.com>. He might have thoughts
> > > on usage of CLONE_IO and qemu.
> > 
> > Vivek, you summed it up pretty well.  Also, for qemu, raw CLONE_IO is not
> > an option because threads are created via pthread (we had done some local
> > hacks to verify that CLONE_IO helped w/ the idling problem, and it did).
> 
> Chris,
> 
> Just to make sure I understand it right I am thinking loud.
> 
> That means CLONE_IO is useful and ideally qemu would like to make use of it
> but beacuse pthread interface does not support it, it is not used as of
> today.

It depends on the block I/O model in qemu.  It can use either a pthread
pool or native aio.  All of the CFQ + idling problems we encountered were
w/ a pthread pool (and IIRC, those would have predated preadv/pwritev,
so the actual i/o vector from the guest was pulled apart and submitted as
a bunch of discrete pread/pwrite requests), however in many cases simply
using aio may be the better sol'n.  Stefan, Kevin, have any thoughts re:
qemu block I/O and making use of CLONE_IO in a modern qemu?

thanks,
-chris

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 7/9] block: implement bio_associate_current()
  2012-02-28 17:01                         ` Chris Wright
@ 2012-02-28 20:11                           ` Stefan Hajnoczi
  0 siblings, 0 replies; 50+ messages in thread
From: Stefan Hajnoczi @ 2012-02-28 20:11 UTC (permalink / raw)
  To: Chris Wright
  Cc: Vivek Goyal, Tejun Heo, Kent Overstreet, axboe, ctalbott, rni,
	linux-kernel, Kevin Wolf

On Tue, Feb 28, 2012 at 5:01 PM, Chris Wright <chrisw@redhat.com> wrote:
> * Vivek Goyal (vgoyal@redhat.com) wrote:
>> On Mon, Feb 27, 2012 at 03:12:22PM -0800, Chris Wright wrote:
>>
>> [..]
>> > > > > > blkcg doesn't allow that anyway (it tries but is racy) and I actually
>> > > > > > was thinking about sending a RFC patch to kill CLONE_IO.
>> > > > >
>> > > > > I thought CLONE_IO is useful and it allows threads to share IO context.
>> > > > > qemu wanted to use it for its IO threads so that one virtual machine
>> > > > > does not get higher share of disk by just craeting more threads. In fact
>> > > > > if multiple threads are doing related IO, we would like them to use
>> > > > > same io context.
>> > > >
>> > > > I don't think that's true.  Think of any multithreaded server program
>> > > > where each thread is working pretty much independently from others.
>> > >
>> > > If threads are working pretty much independently, then one does not have
>> > > to specify CLONE_IO.
>> > >
>> > > In case of qemu IO threads, I have debugged issues where an big IO range
>> > > is being splitted among its IO threads. Just do a sequential IO inside
>> > > guest, and I was seeing that few sector IO comes from one process, next
>> > > few sector come from other process and it goes on. A sequential range
>> > > of IO is some split among a bunch of threads and that does not work
>> > > well with CFQ if every IO is coming from its own IO context and IO
>> > > context is not shared. After a bunch of IO from one io context, CFQ
>> > > continues to idle on that io context thinking more IO will come soon.
>> > > Next IO does come but from a different thread and differnet context.
>> > >
>> > > CFQ now has employed some techniques to detect that case and try
>> > > to do preemption and try to reduce idling in such cases. But sometimes
>> > > these techniques work well and other times don't.  So to me, CLONE_IO
>> > > can help in this case where application can specifically share
>> > > IO context and CFQ does not have to do all the tricks.
>> > >
>> > > That's a different thing that applications might not be making use
>> > > of CLONE_IO.
>> > >
>> > > > Virtualization *can* be a valid use case but are they actually using
>> > > > it?  Aren't they better served by cgroup?
>> > >
>> > > cgroup can be very heavy weight when hundred's of virtual machines
>> > > are running. Why? because of idling. CFQ still has lots of tricks
>> > > to do preemption and cut down on idling across io contexts, but
>> > > across cgroup boundaries, isolation is much more stronger and very
>> > > little preemption (if any) is allowed. I suspect in current
>> > > implementation, if we create lots of blkio cgroup, it will be
>> > > bad for overall throughput of virtual machines (purely because of
>> > > idling).
>> > >
>> > > So I am not too excited about blkio cgroup solution because it might not
>> > > scale well. (Until and unless we find a better algorithm to cut down
>> > > on idling).
>> > >
>> > > I am ccing Chris Wright <chrisw@redhat.com>. He might have thoughts
>> > > on usage of CLONE_IO and qemu.
>> >
>> > Vivek, you summed it up pretty well.  Also, for qemu, raw CLONE_IO is not
>> > an option because threads are created via pthread (we had done some local
>> > hacks to verify that CLONE_IO helped w/ the idling problem, and it did).
>>
>> Chris,
>>
>> Just to make sure I understand it right I am thinking loud.
>>
>> That means CLONE_IO is useful and ideally qemu would like to make use of it
>> but beacuse pthread interface does not support it, it is not used as of
>> today.
>
> It depends on the block I/O model in qemu.  It can use either a pthread
> pool or native aio.  All of the CFQ + idling problems we encountered were
> w/ a pthread pool (and IIRC, those would have predated preadv/pwritev,
> so the actual i/o vector from the guest was pulled apart and submitted as
> a bunch of discrete pread/pwrite requests), however in many cases simply
> using aio may be the better sol'n.  Stefan, Kevin, have any thoughts re:
> qemu block I/O and making use of CLONE_IO in a modern qemu?

What you say makes sense.  Today QEMU uses either Linux AIO or a
thread pool with preadv()/pwritev().  Vectored requests are passed to
the host kernel, they are no longer broken up into multiple
read()/write() calls.

QEMU takes requests from an emulated storage interface like virtio-blk
or IDE and issues them.  If the guest submits multiple requests at
once QEMU will try to issue them in parallel.

In the preadv()/pwritev() thread pool case QEMU makes no attempt at
binding sequential I/O patterns to the same worker thread.  Therefore
I think idle or anticipatory wait doesn't make sense for worker
threads - they have no state and simply grab the next available
request off a queue.  After 10 seconds of idle time a worker thread
will terminate.

I think it makes sense to use CLONE_IO for QEMU worker threads.

By the way, glibc isn't using CLONE_IO for its own POSIX aio
implementation either, so this flag really seems underused.

Stefan

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2012-02-28 20:11 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-16 22:37 [PATCHSET] blkcg: update locking and fix stacking Tejun Heo
2012-02-16 22:37 ` [PATCH 1/9] blkcg: use double locking instead of RCU for blkg synchronization Tejun Heo
2012-02-16 22:37 ` [PATCH 2/9] blkcg: drop unnecessary RCU locking Tejun Heo
2012-02-17 16:19   ` Vivek Goyal
2012-02-17 17:07     ` Tejun Heo
2012-02-17 17:14       ` Tejun Heo
2012-02-17 16:47   ` Vivek Goyal
2012-02-17 17:11     ` Tejun Heo
2012-02-17 17:28       ` Vivek Goyal
2012-02-17 17:43         ` Tejun Heo
2012-02-17 18:08           ` Vivek Goyal
2012-02-17 18:16             ` Tejun Heo
2012-02-22  0:49   ` [PATCH UPDATED " Tejun Heo
2012-02-16 22:37 ` [PATCH 3/9] block: restructure get_request() Tejun Heo
2012-02-16 22:37 ` [PATCH 4/9] block: interface update for ioc/icq creation functions Tejun Heo
2012-02-16 22:37 ` [PATCH 5/9] block: ioc_task_link() can't fail Tejun Heo
2012-02-17 20:41   ` Vivek Goyal
2012-02-17 22:18     ` Tejun Heo
2012-02-16 22:37 ` [PATCH 6/9] block: add io_context->active_ref Tejun Heo
2012-02-16 22:37 ` [PATCH 7/9] block: implement bio_associate_current() Tejun Heo
2012-02-17  1:19   ` Kent Overstreet
2012-02-17 22:14     ` Tejun Heo
2012-02-17 22:34       ` Vivek Goyal
2012-02-17 22:41         ` Tejun Heo
2012-02-17 22:51           ` Vivek Goyal
2012-02-17 22:57             ` Tejun Heo
2012-02-20 14:22               ` Vivek Goyal
2012-02-20 16:59                 ` Tejun Heo
2012-02-20 19:14                   ` Vivek Goyal
2012-02-20 21:21                     ` Tejun Heo
2012-02-27 23:12                     ` Chris Wright
2012-02-28 14:10                       ` Vivek Goyal
2012-02-28 17:01                         ` Chris Wright
2012-02-28 20:11                           ` Stefan Hajnoczi
2012-02-20 14:36               ` Vivek Goyal
2012-02-20 17:01                 ` Tejun Heo
2012-02-20 19:16                   ` Vivek Goyal
2012-02-20 21:06                     ` Tejun Heo
2012-02-20 21:10                       ` Vivek Goyal
2012-02-17 22:56           ` Vivek Goyal
2012-02-17 23:06             ` Tejun Heo
2012-02-17 21:33   ` Vivek Goyal
2012-02-17 22:03     ` Tejun Heo
2012-02-17 22:29       ` Vivek Goyal
2012-02-17 22:38         ` Tejun Heo
2012-02-17 22:42           ` Tejun Heo
2012-02-16 22:37 ` [PATCH 8/9] block: make block cgroup policies follow bio task association Tejun Heo
2012-02-16 22:37 ` [PATCH 9/9] block: make blk-throttle preserve the issuing task on delayed bios Tejun Heo
2012-02-17 21:58   ` Vivek Goyal
2012-02-17 22:17     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).