All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2]
@ 2011-05-19 19:38 Vivek Goyal
  2011-05-19 19:38 ` [PATCH 01/13] blk-throttle: Do the new group initialization with the help of a function Vivek Goyal
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:38 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah, vgoyal

Hi,

This is V2 of the patch. Changes from V1 are.

- Dropped first patch of the series which has already been merged.
- Fixed couple of white space warnings.

Jens, I am sending this series on your both the ids. See if both produce
warnings.

Block throttling code takes request queue lock for every incoming bio
(blk_throtl_bio()). This is true even if there are no throttle rules in
the group. This is a common case for root cgroup where distributions
will have throttling support compiled in but a vast majority of users
will not be specifying throttling rule.

This patch series tries to make bio processing lockless (no requeust
queue lock), if there are no rules specified for the group. Once
a bio is submitted, under rcu_read_lock() we search for the group,
update the stats and release the rcu lock. request queue lock is taken
only if there are throttling rules specified in the group.

I have made some of the dispatch stats per cpu so that these can be updated
without taking request queue lock.

On my system for a simple dd as follows, request queue lock acquisition
count has gone down by 11% roughly.

dd if=/mnt/zerofile-1G of=/dev/null bs=4K iflag=direct

lockstat output vanilla kernel
-----------------------------
class name                      acquisitions    holdtime-total

&(&q->__queue_lock)->rlock:     2360944         1850183.07

lockstat output with patched kernel
-----------------------------------
class name                      acquisitions    holdtime-total
&(&q->__queue_lock)->rlock:     2098599         1430478.79


I did test on a 4 cpu system doing IO to one SSD. I did not see any
significant improvement in throughput. I suspect that I never saturated
the cpus hence I don't see the improvement in throughput. I will see
if I can get more testing done on this and see if I notice IO throughput
improvement.

Thanks
Vivek

Vivek Goyal (13):
  blk-throttle: Do the new group initialization with the help of a
    function
  blk-cgroup: move some fields of unaccounted_time file under right
    config option
  cfq-iosched: Get rid of redundant function parameter "create"
  cfq-iosched: Fix a possible race with cfq cgroup removal code
  blk-cgroup: Allow sleeping while dynamically allocating a group
  blk-throttle: Dynamically allocate root group
  blk-throttle: Introduce a helper function to fill in device details
  blk-throttle: Use helper function to add root throtl group to lists
  blk-throttle: Free up a group only after one rcu grace period
  blk-throttle: Make dispatch stats per cpu
  blk-cgroup: Make 64bit per cpu stats safe on 32bit arch
  blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch
    stats
  blk-throttle: Make no throttling rule group processing lockless

 block/blk-cgroup.c   |  180 +++++++++++++++++++++++------
 block/blk-cgroup.h   |   36 +++++--
 block/blk-core.c     |    3 +-
 block/blk-throttle.c |  313 ++++++++++++++++++++++++++++++++++++++------------
 block/cfq-iosched.c  |  202 ++++++++++++++++++++++++--------
 5 files changed, 564 insertions(+), 170 deletions(-)

-- 
1.7.4.4


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 01/13] blk-throttle: Do the new group initialization with the help of a function
  2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
@ 2011-05-19 19:38 ` Vivek Goyal
  2011-05-19 19:38 ` [PATCH 02/13] blk-cgroup: move some fields of unaccounted_time file under right config option Vivek Goyal
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:38 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah, vgoyal

Group initialization code seems to be at two places. root group
initialization in blk_throtl_init() and dynamically allocated group
in throtl_find_alloc_tg(). Create a common function and use at both
the places.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-throttle.c |   64 +++++++++++++++++++++++++++----------------------
 1 files changed, 35 insertions(+), 29 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 252a81a..fa9a900 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -159,6 +159,35 @@ static void throtl_put_tg(struct throtl_grp *tg)
 	kfree(tg);
 }
 
+static void throtl_init_group(struct throtl_grp *tg)
+{
+	INIT_HLIST_NODE(&tg->tg_node);
+	RB_CLEAR_NODE(&tg->rb_node);
+	bio_list_init(&tg->bio_lists[0]);
+	bio_list_init(&tg->bio_lists[1]);
+	tg->limits_changed = false;
+
+	/* Practically unlimited BW */
+	tg->bps[0] = tg->bps[1] = -1;
+	tg->iops[0] = tg->iops[1] = -1;
+
+	/*
+	 * Take the initial reference that will be released on destroy
+	 * This can be thought of a joint reference by cgroup and
+	 * request queue which will be dropped by either request queue
+	 * exit or cgroup deletion path depending on who is exiting first.
+	 */
+	atomic_set(&tg->ref, 1);
+}
+
+/* Should be called with rcu read lock held (needed for blkcg) */
+static void
+throtl_add_group_to_td_list(struct throtl_data *td, struct throtl_grp *tg)
+{
+	hlist_add_head(&tg->tg_node, &td->tg_list);
+	td->nr_undestroyed_grps++;
+}
+
 static struct throtl_grp * throtl_find_alloc_tg(struct throtl_data *td,
 			struct blkio_cgroup *blkcg)
 {
@@ -196,19 +225,7 @@ static struct throtl_grp * throtl_find_alloc_tg(struct throtl_data *td,
 	if (!tg)
 		goto done;
 
-	INIT_HLIST_NODE(&tg->tg_node);
-	RB_CLEAR_NODE(&tg->rb_node);
-	bio_list_init(&tg->bio_lists[0]);
-	bio_list_init(&tg->bio_lists[1]);
-	td->limits_changed = false;
-
-	/*
-	 * Take the initial reference that will be released on destroy
-	 * This can be thought of a joint reference by cgroup and
-	 * request queue which will be dropped by either request queue
-	 * exit or cgroup deletion path depending on who is exiting first.
-	 */
-	atomic_set(&tg->ref, 1);
+	throtl_init_group(tg);
 
 	/* Add group onto cgroup list */
 	sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
@@ -220,8 +237,7 @@ static struct throtl_grp * throtl_find_alloc_tg(struct throtl_data *td,
 	tg->iops[READ] = blkcg_get_read_iops(blkcg, tg->blkg.dev);
 	tg->iops[WRITE] = blkcg_get_write_iops(blkcg, tg->blkg.dev);
 
-	hlist_add_head(&tg->tg_node, &td->tg_list);
-	td->nr_undestroyed_grps++;
+	throtl_add_group_to_td_list(td, tg);
 done:
 	return tg;
 }
@@ -1060,18 +1076,11 @@ int blk_throtl_init(struct request_queue *q)
 	INIT_HLIST_HEAD(&td->tg_list);
 	td->tg_service_tree = THROTL_RB_ROOT;
 	td->limits_changed = false;
+	INIT_DELAYED_WORK(&td->throtl_work, blk_throtl_work);
 
 	/* Init root group */
 	tg = &td->root_tg;
-	INIT_HLIST_NODE(&tg->tg_node);
-	RB_CLEAR_NODE(&tg->rb_node);
-	bio_list_init(&tg->bio_lists[0]);
-	bio_list_init(&tg->bio_lists[1]);
-
-	/* Practically unlimited BW */
-	tg->bps[0] = tg->bps[1] = -1;
-	tg->iops[0] = tg->iops[1] = -1;
-	td->limits_changed = false;
+	throtl_init_group(tg);
 
 	/*
 	 * Set root group reference to 2. One reference will be dropped when
@@ -1080,16 +1089,13 @@ int blk_throtl_init(struct request_queue *q)
 	 * as it is statically allocated and gets destroyed when throtl_data
 	 * goes away.
 	 */
-	atomic_set(&tg->ref, 2);
-	hlist_add_head(&tg->tg_node, &td->tg_list);
-	td->nr_undestroyed_grps++;
-
-	INIT_DELAYED_WORK(&td->throtl_work, blk_throtl_work);
+	atomic_inc(&tg->ref);
 
 	rcu_read_lock();
 	blkiocg_add_blkio_group(&blkio_root_cgroup, &tg->blkg, (void *)td,
 					0, BLKIO_POLICY_THROTL);
 	rcu_read_unlock();
+	throtl_add_group_to_td_list(td, tg);
 
 	/* Attach throtl data to request queue */
 	td->queue = q;
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 02/13] blk-cgroup: move some fields of unaccounted_time file under right config option
  2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
  2011-05-19 19:38 ` [PATCH 01/13] blk-throttle: Do the new group initialization with the help of a function Vivek Goyal
@ 2011-05-19 19:38 ` Vivek Goyal
  2011-05-19 19:38 ` [PATCH 03/13] cfq-iosched: Get rid of redundant function parameter "create" Vivek Goyal
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:38 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah, vgoyal

cgroup unaccounted_time file is created only if CONFIG_DEBUG_BLK_CGROUP=y.
there are some fields which are out side this config option. Fix that.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |    2 ++
 block/blk-cgroup.h |    7 ++++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 471fdcc..b0592bc 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -385,7 +385,9 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time,
 
 	spin_lock_irqsave(&blkg->stats_lock, flags);
 	blkg->stats.time += time;
+#ifdef CONFIG_DEBUG_BLK_CGROUP
 	blkg->stats.unaccounted_time += unaccounted_time;
+#endif
 	spin_unlock_irqrestore(&blkg->stats_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index c774930..63f1ef4 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -49,9 +49,9 @@ enum stat_type {
 	/* All the single valued stats go below this */
 	BLKIO_STAT_TIME,
 	BLKIO_STAT_SECTORS,
+#ifdef CONFIG_DEBUG_BLK_CGROUP
 	/* Time not charged to this cgroup */
 	BLKIO_STAT_UNACCOUNTED_TIME,
-#ifdef CONFIG_DEBUG_BLK_CGROUP
 	BLKIO_STAT_AVG_QUEUE_SIZE,
 	BLKIO_STAT_IDLE_TIME,
 	BLKIO_STAT_EMPTY_TIME,
@@ -117,10 +117,11 @@ struct blkio_group_stats {
 	/* total disk time and nr sectors dispatched by this group */
 	uint64_t time;
 	uint64_t sectors;
-	/* Time not charged to this cgroup */
-	uint64_t unaccounted_time;
 	uint64_t stat_arr[BLKIO_STAT_QUEUED + 1][BLKIO_STAT_TOTAL];
 #ifdef CONFIG_DEBUG_BLK_CGROUP
+	/* Time not charged to this cgroup */
+	uint64_t unaccounted_time;
+
 	/* Sum of number of IOs queued across all samples */
 	uint64_t avg_queue_size_sum;
 	/* Count of samples taken for average */
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 03/13] cfq-iosched: Get rid of redundant function parameter "create"
  2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
  2011-05-19 19:38 ` [PATCH 01/13] blk-throttle: Do the new group initialization with the help of a function Vivek Goyal
  2011-05-19 19:38 ` [PATCH 02/13] blk-cgroup: move some fields of unaccounted_time file under right config option Vivek Goyal
@ 2011-05-19 19:38 ` Vivek Goyal
  2011-05-19 19:38 ` [PATCH 04/13] cfq-iosched: Fix a possible race with cfq cgroup removal code Vivek Goyal
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:38 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah, vgoyal

Nobody seems to be using cfq_find_alloc_cfqg() function parameter "create".
Get rid of that.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ab7a9e6..a905b55 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1015,7 +1015,7 @@ void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
 }
 
 static struct cfq_group * cfq_find_alloc_cfqg(struct cfq_data *cfqd,
-		struct blkio_cgroup *blkcg, int create)
+		struct blkio_cgroup *blkcg)
 {
 	struct cfq_group *cfqg = NULL;
 	void *key = cfqd;
@@ -1030,7 +1030,7 @@ static struct cfq_group * cfq_find_alloc_cfqg(struct cfq_data *cfqd,
 		cfqg->blkg.dev = MKDEV(major, minor);
 		goto done;
 	}
-	if (cfqg || !create)
+	if (cfqg)
 		goto done;
 
 	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
@@ -1073,18 +1073,18 @@ done:
 }
 
 /*
- * Search for the cfq group current task belongs to. If create = 1, then also
- * create the cfq group if it does not exist. request_queue lock must be held.
+ * Search for the cfq group current task belongs to. request_queue lock must
+ * be held.
  */
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
 {
 	struct blkio_cgroup *blkcg;
 	struct cfq_group *cfqg = NULL;
 
 	rcu_read_lock();
 	blkcg = task_blkio_cgroup(current);
-	cfqg = cfq_find_alloc_cfqg(cfqd, blkcg, create);
-	if (!cfqg && create)
+	cfqg = cfq_find_alloc_cfqg(cfqd, blkcg);
+	if (!cfqg)
 		cfqg = &cfqd->root_group;
 	rcu_read_unlock();
 	return cfqg;
@@ -1176,7 +1176,7 @@ void cfq_unlink_blkio_group(void *key, struct blkio_group *blkg)
 }
 
 #else /* GROUP_IOSCHED */
-static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd, int create)
+static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
 {
 	return &cfqd->root_group;
 }
@@ -2911,7 +2911,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync,
 	struct cfq_group *cfqg;
 
 retry:
-	cfqg = cfq_get_cfqg(cfqd, 1);
+	cfqg = cfq_get_cfqg(cfqd);
 	cic = cfq_cic_lookup(cfqd, ioc);
 	/* cic always exists here */
 	cfqq = cic_to_cfqq(cic, is_sync);
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 04/13] cfq-iosched: Fix a possible race with cfq cgroup removal code
  2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
                   ` (2 preceding siblings ...)
  2011-05-19 19:38 ` [PATCH 03/13] cfq-iosched: Get rid of redundant function parameter "create" Vivek Goyal
@ 2011-05-19 19:38 ` Vivek Goyal
  2011-05-19 19:38 ` [PATCH 05/13] blk-cgroup: Allow sleeping while dynamically allocating a group Vivek Goyal
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:38 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah, vgoyal

blkg->key = cfqd is an rcu protected pointer and hence we used to do
call_rcu(cfqd->rcu_head) to free up cfqd after one rcu grace period.

The problem here is that even though cfqd is around, there are no
gurantees that associated request queue (td->queue) or q->queue_lock
is still around. A driver might have called blk_cleanup_queue() and
release the lock.

It might happen that after freeing up the lock we call
blkg->key->queue->queue_ock and crash. This is possible in following
path.

blkiocg_destroy()
 blkio_unlink_group_fn()
  cfq_unlink_blkio_group()

Hence, wait for an rcu peirod if there are groups which have not
been unlinked from blkcg->blkg_list. That way, if there are any groups
which are taking cfq_unlink_blkio_group() path, can safely take queue
lock.

This is how we have taken care of race in throttling logic also.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |   48 ++++++++++++++++++++++++++++++++++++------------
 1 files changed, 36 insertions(+), 12 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a905b55..e2e6719 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -300,7 +300,9 @@ struct cfq_data {
 
 	/* List of cfq groups being managed on this device*/
 	struct hlist_head cfqg_list;
-	struct rcu_head rcu;
+
+	/* Number of groups which are on blkcg->blkg_list */
+	unsigned int nr_blkcg_linked_grps;
 };
 
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
@@ -1063,6 +1065,7 @@ static struct cfq_group * cfq_find_alloc_cfqg(struct cfq_data *cfqd,
 		cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
 					0);
 
+	cfqd->nr_blkcg_linked_grps++;
 	cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
 
 	/* Add group on cfqd list */
@@ -3815,15 +3818,11 @@ static void cfq_put_async_queues(struct cfq_data *cfqd)
 		cfq_put_queue(cfqd->async_idle_cfqq);
 }
 
-static void cfq_cfqd_free(struct rcu_head *head)
-{
-	kfree(container_of(head, struct cfq_data, rcu));
-}
-
 static void cfq_exit_queue(struct elevator_queue *e)
 {
 	struct cfq_data *cfqd = e->elevator_data;
 	struct request_queue *q = cfqd->queue;
+	bool wait = false;
 
 	cfq_shutdown_timer_wq(cfqd);
 
@@ -3842,7 +3841,13 @@ static void cfq_exit_queue(struct elevator_queue *e)
 
 	cfq_put_async_queues(cfqd);
 	cfq_release_cfq_groups(cfqd);
-	cfq_blkiocg_del_blkio_group(&cfqd->root_group.blkg);
+
+	/*
+	 * If there are groups which we could not unlink from blkcg list,
+	 * wait for a rcu period for them to be freed.
+	 */
+	if (cfqd->nr_blkcg_linked_grps)
+		wait = true;
 
 	spin_unlock_irq(q->queue_lock);
 
@@ -3852,8 +3857,20 @@ static void cfq_exit_queue(struct elevator_queue *e)
 	ida_remove(&cic_index_ida, cfqd->cic_index);
 	spin_unlock(&cic_index_lock);
 
-	/* Wait for cfqg->blkg->key accessors to exit their grace periods. */
-	call_rcu(&cfqd->rcu, cfq_cfqd_free);
+	/*
+	 * Wait for cfqg->blkg->key accessors to exit their grace periods.
+	 * Do this wait only if there are other unlinked groups out
+	 * there. This can happen if cgroup deletion path claimed the
+	 * responsibility of cleaning up a group before queue cleanup code
+	 * get to the group.
+	 *
+	 * Do not call synchronize_rcu() unconditionally as there are drivers
+	 * which create/delete request queue hundreds of times during scan/boot
+	 * and synchronize_rcu() can take significant time and slow down boot.
+	 */
+	if (wait)
+		synchronize_rcu();
+	kfree(cfqd);
 }
 
 static int cfq_alloc_cic_index(void)
@@ -3909,14 +3926,21 @@ static void *cfq_init_queue(struct request_queue *q)
 
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
 	/*
-	 * Take a reference to root group which we never drop. This is just
-	 * to make sure that cfq_put_cfqg() does not try to kfree root group
+	 * Set root group reference to 2. One reference will be dropped when
+	 * all groups on cfqd->cfqg_list are being deleted during queue exit.
+	 * Other reference will remain there as we don't want to delete this
+	 * group as it is statically allocated and gets destroyed when
+	 * throtl_data goes away.
 	 */
-	cfqg->ref = 1;
+	cfqg->ref = 2;
 	rcu_read_lock();
 	cfq_blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg,
 					(void *)cfqd, 0);
 	rcu_read_unlock();
+	cfqd->nr_blkcg_linked_grps++;
+
+	/* Add group on cfqd->cfqg_list */
+	hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
 #endif
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 05/13] blk-cgroup: Allow sleeping while dynamically allocating a group
  2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
                   ` (3 preceding siblings ...)
  2011-05-19 19:38 ` [PATCH 04/13] cfq-iosched: Fix a possible race with cfq cgroup removal code Vivek Goyal
@ 2011-05-19 19:38 ` Vivek Goyal
  2011-05-19 19:38 ` [PATCH 06/13] blk-throttle: Dynamically allocate root group Vivek Goyal
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:38 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah, vgoyal

Currently, all the cfq_group or throtl_group allocations happen while
we are holding ->queue_lock and sleeping is not allowed.

Soon, we will move to per cpu stats and also need to allocate the
per group stats. As one can not call alloc_percpu() from atomic
context as it can sleep, we need to drop ->queue_lock, allocate the
group, retake the lock and continue processing.

In throttling code, I check the queue DEAD flag again to make sure
that driver did not call blk_cleanup_queue() in the mean time.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c     |    3 +-
 block/blk-throttle.c |  141 ++++++++++++++++++++++++++++++++++++++------------
 block/cfq-iosched.c  |  128 +++++++++++++++++++++++++++++++++------------
 3 files changed, 205 insertions(+), 67 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3fe00a1..9e8e297 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1550,7 +1550,8 @@ static inline void __generic_make_request(struct bio *bio)
 			goto end_io;
 		}
 
-		blk_throtl_bio(q, &bio);
+		if (blk_throtl_bio(q, &bio))
+			goto end_io;
 
 		/*
 		 * If bio = NULL, bio has been throttled and will be submitted
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index fa9a900..c201967 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -188,8 +188,40 @@ throtl_add_group_to_td_list(struct throtl_data *td, struct throtl_grp *tg)
 	td->nr_undestroyed_grps++;
 }
 
-static struct throtl_grp * throtl_find_alloc_tg(struct throtl_data *td,
-			struct blkio_cgroup *blkcg)
+static void throtl_init_add_tg_lists(struct throtl_data *td,
+			struct throtl_grp *tg, struct blkio_cgroup *blkcg)
+{
+	struct backing_dev_info *bdi = &td->queue->backing_dev_info;
+	unsigned int major, minor;
+
+	/* Add group onto cgroup list */
+	sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+	blkiocg_add_blkio_group(blkcg, &tg->blkg, (void *)td,
+				MKDEV(major, minor), BLKIO_POLICY_THROTL);
+
+	tg->bps[READ] = blkcg_get_read_bps(blkcg, tg->blkg.dev);
+	tg->bps[WRITE] = blkcg_get_write_bps(blkcg, tg->blkg.dev);
+	tg->iops[READ] = blkcg_get_read_iops(blkcg, tg->blkg.dev);
+	tg->iops[WRITE] = blkcg_get_write_iops(blkcg, tg->blkg.dev);
+
+	throtl_add_group_to_td_list(td, tg);
+}
+
+/* Should be called without queue lock and outside of rcu period */
+static struct throtl_grp *throtl_alloc_tg(struct throtl_data *td)
+{
+	struct throtl_grp *tg = NULL;
+
+	tg = kzalloc_node(sizeof(*tg), GFP_ATOMIC, td->queue->node);
+	if (!tg)
+		return NULL;
+
+	throtl_init_group(tg);
+	return tg;
+}
+
+static struct
+throtl_grp *throtl_find_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
 {
 	struct throtl_grp *tg = NULL;
 	void *key = td;
@@ -197,12 +229,6 @@ static struct throtl_grp * throtl_find_alloc_tg(struct throtl_data *td,
 	unsigned int major, minor;
 
 	/*
-	 * TODO: Speed up blkiocg_lookup_group() by maintaining a radix
-	 * tree of blkg (instead of traversing through hash list all
-	 * the time.
-	 */
-
-	/*
 	 * This is the common case when there are no blkio cgroups.
  	 * Avoid lookup in this case
  	 */
@@ -215,43 +241,83 @@ static struct throtl_grp * throtl_find_alloc_tg(struct throtl_data *td,
 	if (tg && !tg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
 		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
 		tg->blkg.dev = MKDEV(major, minor);
-		goto done;
 	}
 
-	if (tg)
-		goto done;
-
-	tg = kzalloc_node(sizeof(*tg), GFP_ATOMIC, td->queue->node);
-	if (!tg)
-		goto done;
-
-	throtl_init_group(tg);
-
-	/* Add group onto cgroup list */
-	sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
-	blkiocg_add_blkio_group(blkcg, &tg->blkg, (void *)td,
-				MKDEV(major, minor), BLKIO_POLICY_THROTL);
-
-	tg->bps[READ] = blkcg_get_read_bps(blkcg, tg->blkg.dev);
-	tg->bps[WRITE] = blkcg_get_write_bps(blkcg, tg->blkg.dev);
-	tg->iops[READ] = blkcg_get_read_iops(blkcg, tg->blkg.dev);
-	tg->iops[WRITE] = blkcg_get_write_iops(blkcg, tg->blkg.dev);
-
-	throtl_add_group_to_td_list(td, tg);
-done:
 	return tg;
 }
 
+/*
+ * This function returns with queue lock unlocked in case of error, like
+ * request queue is no more
+ */
 static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 {
-	struct throtl_grp *tg = NULL;
+	struct throtl_grp *tg = NULL, *__tg = NULL;
 	struct blkio_cgroup *blkcg;
+	struct request_queue *q = td->queue;
 
 	rcu_read_lock();
 	blkcg = task_blkio_cgroup(current);
-	tg = throtl_find_alloc_tg(td, blkcg);
-	if (!tg)
+	tg = throtl_find_tg(td, blkcg);
+	if (tg) {
+		rcu_read_unlock();
+		return tg;
+	}
+
+	/*
+	 * Need to allocate a group. Allocation of group also needs allocation
+	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
+	 * we need to drop rcu lock and queue_lock before we call alloc
+	 *
+	 * Take the request queue reference to make sure queue does not
+	 * go away once we return from allocation.
+	 */
+	blk_get_queue(q);
+	rcu_read_unlock();
+	spin_unlock_irq(q->queue_lock);
+
+	tg = throtl_alloc_tg(td);
+	/*
+	 * We might have slept in group allocation. Make sure queue is not
+	 * dead
+	 */
+	if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) {
+		blk_put_queue(q);
+		if (tg)
+			kfree(tg);
+
+		return ERR_PTR(-ENODEV);
+	}
+	blk_put_queue(q);
+
+	/* Group allocated and queue is still alive. take the lock */
+	spin_lock_irq(q->queue_lock);
+
+	/*
+	 * Initialize the new group. After sleeping, read the blkcg again.
+	 */
+	rcu_read_lock();
+	blkcg = task_blkio_cgroup(current);
+
+	/*
+	 * If some other thread already allocated the group while we were
+	 * not holding queue lock, free up the group
+	 */
+	__tg = throtl_find_tg(td, blkcg);
+
+	if (__tg) {
+		kfree(tg);
+		rcu_read_unlock();
+		return __tg;
+	}
+
+	/* Group allocation failed. Account the IO to root group */
+	if (!tg) {
 		tg = &td->root_tg;
+		return tg;
+	}
+
+	throtl_init_add_tg_lists(td, tg, blkcg);
 	rcu_read_unlock();
 	return tg;
 }
@@ -1014,6 +1080,15 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
 	spin_lock_irq(q->queue_lock);
 	tg = throtl_get_tg(td);
 
+	if (IS_ERR(tg)) {
+		if (PTR_ERR(tg)	== -ENODEV) {
+			/*
+			 * Queue is gone. No queue lock held here.
+			 */
+			return -ENODEV;
+		}
+	}
+
 	if (tg->nr_queued[rw]) {
 		/*
 		 * There is already another bio queued in same dir. No
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e2e6719..606020f 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1016,28 +1016,47 @@ void cfq_update_blkio_group_weight(void *key, struct blkio_group *blkg,
 	cfqg->needs_update = true;
 }
 
-static struct cfq_group * cfq_find_alloc_cfqg(struct cfq_data *cfqd,
-		struct blkio_cgroup *blkcg)
+static void cfq_init_add_cfqg_lists(struct cfq_data *cfqd,
+			struct cfq_group *cfqg, struct blkio_cgroup *blkcg)
 {
-	struct cfq_group *cfqg = NULL;
-	void *key = cfqd;
-	int i, j;
-	struct cfq_rb_root *st;
 	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
 	unsigned int major, minor;
 
-	cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
-	if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+	/*
+	 * Add group onto cgroup list. It might happen that bdi->dev is
+	 * not initialized yet. Initialize this new group without major
+	 * and minor info and this info will be filled in once a new thread
+	 * comes for IO.
+	 */
+	if (bdi->dev) {
 		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
-		cfqg->blkg.dev = MKDEV(major, minor);
-		goto done;
-	}
-	if (cfqg)
-		goto done;
+		cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg,
+					(void *)cfqd, MKDEV(major, minor));
+	} else
+		cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg,
+					(void *)cfqd, 0);
+
+	cfqd->nr_blkcg_linked_grps++;
+	cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+
+	/* Add group on cfqd list */
+	hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
+}
+
+/*
+ * Should be called from sleepable context. No request queue lock as per
+ * cpu stats are allocated dynamically and alloc_percpu needs to be called
+ * from sleepable context.
+ */
+static struct cfq_group * cfq_alloc_cfqg(struct cfq_data *cfqd)
+{
+	struct cfq_group *cfqg = NULL;
+	int i, j;
+	struct cfq_rb_root *st;
 
 	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
 	if (!cfqg)
-		goto done;
+		return NULL;
 
 	for_each_cfqg_st(cfqg, i, j, st)
 		*st = CFQ_RB_ROOT;
@@ -1050,28 +1069,31 @@ static struct cfq_group * cfq_find_alloc_cfqg(struct cfq_data *cfqd,
 	 * or cgroup deletion path depending on who is exiting first.
 	 */
 	cfqg->ref = 1;
+	return cfqg;
+}
+
+static struct cfq_group *
+cfq_find_cfqg(struct cfq_data *cfqd, struct blkio_cgroup *blkcg)
+{
+	struct cfq_group *cfqg = NULL;
+	void *key = cfqd;
+	struct backing_dev_info *bdi = &cfqd->queue->backing_dev_info;
+	unsigned int major, minor;
 
 	/*
-	 * Add group onto cgroup list. It might happen that bdi->dev is
-	 * not initialized yet. Initialize this new group without major
-	 * and minor info and this info will be filled in once a new thread
-	 * comes for IO. See code above.
+	 * This is the common case when there are no blkio cgroups.
+	 * Avoid lookup in this case
 	 */
-	if (bdi->dev) {
-		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
-		cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
-					MKDEV(major, minor));
-	} else
-		cfq_blkiocg_add_blkio_group(blkcg, &cfqg->blkg, (void *)cfqd,
-					0);
-
-	cfqd->nr_blkcg_linked_grps++;
-	cfqg->weight = blkcg_get_weight(blkcg, cfqg->blkg.dev);
+	if (blkcg == &blkio_root_cgroup)
+		cfqg = &cfqd->root_group;
+	else
+		cfqg = cfqg_of_blkg(blkiocg_lookup_group(blkcg, key));
 
-	/* Add group on cfqd list */
-	hlist_add_head(&cfqg->cfqd_node, &cfqd->cfqg_list);
+	if (cfqg && !cfqg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
+		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+		cfqg->blkg.dev = MKDEV(major, minor);
+	}
 
-done:
 	return cfqg;
 }
 
@@ -1082,13 +1104,53 @@ done:
 static struct cfq_group *cfq_get_cfqg(struct cfq_data *cfqd)
 {
 	struct blkio_cgroup *blkcg;
-	struct cfq_group *cfqg = NULL;
+	struct cfq_group *cfqg = NULL, *__cfqg = NULL;
+	struct request_queue *q = cfqd->queue;
+
+	rcu_read_lock();
+	blkcg = task_blkio_cgroup(current);
+	cfqg = cfq_find_cfqg(cfqd, blkcg);
+	if (cfqg) {
+		rcu_read_unlock();
+		return cfqg;
+	}
+
+	/*
+	 * Need to allocate a group. Allocation of group also needs allocation
+	 * of per cpu stats which in-turn takes a mutex() and can block. Hence
+	 * we need to drop rcu lock and queue_lock before we call alloc.
+	 *
+	 * Not taking any queue reference here and assuming that queue is
+	 * around by the time we return. CFQ queue allocation code does
+	 * the same. It might be racy though.
+	 */
+
+	rcu_read_unlock();
+	spin_unlock_irq(q->queue_lock);
+
+	cfqg = cfq_alloc_cfqg(cfqd);
+
+	spin_lock_irq(q->queue_lock);
 
 	rcu_read_lock();
 	blkcg = task_blkio_cgroup(current);
-	cfqg = cfq_find_alloc_cfqg(cfqd, blkcg);
+
+	/*
+	 * If some other thread already allocated the group while we were
+	 * not holding queue lock, free up the group
+	 */
+	__cfqg = cfq_find_cfqg(cfqd, blkcg);
+
+	if (__cfqg) {
+		kfree(cfqg);
+		rcu_read_unlock();
+		return __cfqg;
+	}
+
 	if (!cfqg)
 		cfqg = &cfqd->root_group;
+
+	cfq_init_add_cfqg_lists(cfqd, cfqg, blkcg);
 	rcu_read_unlock();
 	return cfqg;
 }
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 06/13] blk-throttle: Dynamically allocate root group
  2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
                   ` (4 preceding siblings ...)
  2011-05-19 19:38 ` [PATCH 05/13] blk-cgroup: Allow sleeping while dynamically allocating a group Vivek Goyal
@ 2011-05-19 19:38 ` Vivek Goyal
  2011-05-19 19:38 ` [PATCH 07/13] blk-throttle: Introduce a helper function to fill in device details Vivek Goyal
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:38 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah, vgoyal

Currently, we allocate root throtl_grp statically. But as we will be
introducing per cpu stat pointers and that will be allocated
dynamically even for root group, we might as well make whole root
throtl_grp allocation dynamic and treat it in same manner as other
groups.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-throttle.c |   27 ++++++++++++---------------
 1 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index c201967..68f2ac3 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -88,7 +88,7 @@ struct throtl_data
 	/* service tree for active throtl groups */
 	struct throtl_rb_root tg_service_tree;
 
-	struct throtl_grp root_tg;
+	struct throtl_grp *root_tg;
 	struct request_queue *queue;
 
 	/* Total Number of queued bios on READ and WRITE lists */
@@ -233,7 +233,7 @@ throtl_grp *throtl_find_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
  	 * Avoid lookup in this case
  	 */
 	if (blkcg == &blkio_root_cgroup)
-		tg = &td->root_tg;
+		tg = td->root_tg;
 	else
 		tg = tg_of_blkg(blkiocg_lookup_group(blkcg, key));
 
@@ -313,7 +313,7 @@ static struct throtl_grp * throtl_get_tg(struct throtl_data *td)
 
 	/* Group allocation failed. Account the IO to root group */
 	if (!tg) {
-		tg = &td->root_tg;
+		tg = td->root_tg;
 		return tg;
 	}
 
@@ -1153,18 +1153,16 @@ int blk_throtl_init(struct request_queue *q)
 	td->limits_changed = false;
 	INIT_DELAYED_WORK(&td->throtl_work, blk_throtl_work);
 
-	/* Init root group */
-	tg = &td->root_tg;
-	throtl_init_group(tg);
+	/* alloc and Init root group. */
+	td->queue = q;
+	tg = throtl_alloc_tg(td);
 
-	/*
-	 * Set root group reference to 2. One reference will be dropped when
-	 * all groups on tg_list are being deleted during queue exit. Other
-	 * reference will remain there as we don't want to delete this group
-	 * as it is statically allocated and gets destroyed when throtl_data
-	 * goes away.
-	 */
-	atomic_inc(&tg->ref);
+	if (!tg) {
+		kfree(td);
+		return -ENOMEM;
+	}
+
+	td->root_tg = tg;
 
 	rcu_read_lock();
 	blkiocg_add_blkio_group(&blkio_root_cgroup, &tg->blkg, (void *)td,
@@ -1173,7 +1171,6 @@ int blk_throtl_init(struct request_queue *q)
 	throtl_add_group_to_td_list(td, tg);
 
 	/* Attach throtl data to request queue */
-	td->queue = q;
 	q->td = td;
 	return 0;
 }
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 07/13] blk-throttle: Introduce a helper function to fill in device details
  2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
                   ` (5 preceding siblings ...)
  2011-05-19 19:38 ` [PATCH 06/13] blk-throttle: Dynamically allocate root group Vivek Goyal
@ 2011-05-19 19:38 ` Vivek Goyal
  2011-05-19 19:38 ` [PATCH 08/13] blk-throttle: Use helper function to add root throtl group to lists Vivek Goyal
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:38 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah, vgoyal

A helper function for the code which is used at 2-3 places. Makes reading
code little easier.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-throttle.c |   35 +++++++++++++++++++++++------------
 1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 68f2ac3..97ea7f8 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -188,16 +188,34 @@ throtl_add_group_to_td_list(struct throtl_data *td, struct throtl_grp *tg)
 	td->nr_undestroyed_grps++;
 }
 
-static void throtl_init_add_tg_lists(struct throtl_data *td,
-			struct throtl_grp *tg, struct blkio_cgroup *blkcg)
+static void
+__throtl_tg_fill_dev_details(struct throtl_data *td, struct throtl_grp *tg)
 {
 	struct backing_dev_info *bdi = &td->queue->backing_dev_info;
 	unsigned int major, minor;
 
+	if (!tg || tg->blkg.dev)
+		return;
+
+	/*
+	 * Fill in device details for a group which might not have been
+	 * filled at group creation time as queue was being instantiated
+	 * and driver had not attached a device yet
+	 */
+	if (bdi->dev && dev_name(bdi->dev)) {
+		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
+		tg->blkg.dev = MKDEV(major, minor);
+	}
+}
+
+static void throtl_init_add_tg_lists(struct throtl_data *td,
+			struct throtl_grp *tg, struct blkio_cgroup *blkcg)
+{
+	__throtl_tg_fill_dev_details(td, tg);
+
 	/* Add group onto cgroup list */
-	sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
 	blkiocg_add_blkio_group(blkcg, &tg->blkg, (void *)td,
-				MKDEV(major, minor), BLKIO_POLICY_THROTL);
+				tg->blkg.dev, BLKIO_POLICY_THROTL);
 
 	tg->bps[READ] = blkcg_get_read_bps(blkcg, tg->blkg.dev);
 	tg->bps[WRITE] = blkcg_get_write_bps(blkcg, tg->blkg.dev);
@@ -225,8 +243,6 @@ throtl_grp *throtl_find_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
 {
 	struct throtl_grp *tg = NULL;
 	void *key = td;
-	struct backing_dev_info *bdi = &td->queue->backing_dev_info;
-	unsigned int major, minor;
 
 	/*
 	 * This is the common case when there are no blkio cgroups.
@@ -237,12 +253,7 @@ throtl_grp *throtl_find_tg(struct throtl_data *td, struct blkio_cgroup *blkcg)
 	else
 		tg = tg_of_blkg(blkiocg_lookup_group(blkcg, key));
 
-	/* Fill in device details for root group */
-	if (tg && !tg->blkg.dev && bdi->dev && dev_name(bdi->dev)) {
-		sscanf(dev_name(bdi->dev), "%u:%u", &major, &minor);
-		tg->blkg.dev = MKDEV(major, minor);
-	}
-
+	__throtl_tg_fill_dev_details(td, tg);
 	return tg;
 }
 
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 08/13] blk-throttle: Use helper function to add root throtl group to lists
  2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
                   ` (6 preceding siblings ...)
  2011-05-19 19:38 ` [PATCH 07/13] blk-throttle: Introduce a helper function to fill in device details Vivek Goyal
@ 2011-05-19 19:38 ` Vivek Goyal
  2011-05-19 19:38 ` [PATCH 09/13] blk-throttle: Free up a group only after one rcu grace period Vivek Goyal
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:38 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah, vgoyal

Use same helper function for root group as we use with dynamically
allocated groups to add it to various lists.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-throttle.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 97ea7f8..b9412d1 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1176,10 +1176,8 @@ int blk_throtl_init(struct request_queue *q)
 	td->root_tg = tg;
 
 	rcu_read_lock();
-	blkiocg_add_blkio_group(&blkio_root_cgroup, &tg->blkg, (void *)td,
-					0, BLKIO_POLICY_THROTL);
+	throtl_init_add_tg_lists(td, tg, &blkio_root_cgroup);
 	rcu_read_unlock();
-	throtl_add_group_to_td_list(td, tg);
 
 	/* Attach throtl data to request queue */
 	q->td = td;
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 09/13] blk-throttle: Free up a group only after one rcu grace period
  2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
                   ` (7 preceding siblings ...)
  2011-05-19 19:38 ` [PATCH 08/13] blk-throttle: Use helper function to add root throtl group to lists Vivek Goyal
@ 2011-05-19 19:38 ` Vivek Goyal
  2011-05-19 19:38 ` [PATCH 10/13] blk-throttle: Make dispatch stats per cpu Vivek Goyal
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:38 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah, vgoyal

Soon we will allow accessing a throtl_grp under rcu_read_lock(). Hence
start freeing up throtl_grp after one rcu grace period.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-throttle.c |   22 +++++++++++++++++++++-
 1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index b9412d1..90ad407 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -78,6 +78,8 @@ struct throtl_grp {
 
 	/* Some throttle limits got updated for the group */
 	int limits_changed;
+
+	struct rcu_head rcu_head;
 };
 
 struct throtl_data
@@ -151,12 +153,30 @@ static inline struct throtl_grp *throtl_ref_get_tg(struct throtl_grp *tg)
 	return tg;
 }
 
+static void throtl_free_tg(struct rcu_head *head)
+{
+	struct throtl_grp *tg;
+
+	tg = container_of(head, struct throtl_grp, rcu_head);
+	kfree(tg);
+}
+
 static void throtl_put_tg(struct throtl_grp *tg)
 {
 	BUG_ON(atomic_read(&tg->ref) <= 0);
 	if (!atomic_dec_and_test(&tg->ref))
 		return;
-	kfree(tg);
+
+	/*
+	 * A group is freed in rcu manner. But having an rcu lock does not
+	 * mean that one can access all the fields of blkg and assume these
+	 * are valid. For example, don't try to follow throtl_data and
+	 * request queue links.
+	 *
+	 * Having a reference to blkg under an rcu allows acess to only
+	 * values local to groups like group stats and group rate limits
+	 */
+	call_rcu(&tg->rcu_head, throtl_free_tg);
 }
 
 static void throtl_init_group(struct throtl_grp *tg)
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 10/13] blk-throttle: Make dispatch stats per cpu
  2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
                   ` (8 preceding siblings ...)
  2011-05-19 19:38 ` [PATCH 09/13] blk-throttle: Free up a group only after one rcu grace period Vivek Goyal
@ 2011-05-19 19:38 ` Vivek Goyal
  2011-05-19 19:38 ` [PATCH 11/13] blk-cgroup: Make 64bit per cpu stats safe on 32bit arch Vivek Goyal
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:38 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah, vgoyal

Currently we take blkg_stat lock for even updating the stats. So even if
a group has no throttling rules (common case for root group), we end
up taking blkg_lock, for updating the stats.

Make dispatch stats per cpu so that these can be updated without taking
blkg lock.

If cpu goes offline, these stats simply disappear. No protection has
been provided for that yet. Do we really need anything for that?

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c   |  135 +++++++++++++++++++++++++++++++++++++-------------
 block/blk-cgroup.h   |   27 ++++++++--
 block/blk-throttle.c |    9 +++
 block/cfq-iosched.c  |   18 ++++++-
 4 files changed, 147 insertions(+), 42 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index b0592bc..34bfcef 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -392,20 +392,22 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time,
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
 
+/*
+ * should be called under rcu read lock or queue lock to make sure blkg pointer
+ * is valid.
+ */
 void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
 				uint64_t bytes, bool direction, bool sync)
 {
-	struct blkio_group_stats *stats;
-	unsigned long flags;
+	struct blkio_group_stats_cpu *stats_cpu;
 
-	spin_lock_irqsave(&blkg->stats_lock, flags);
-	stats = &blkg->stats;
-	stats->sectors += bytes >> 9;
-	blkio_add_stat(stats->stat_arr[BLKIO_STAT_SERVICED], 1, direction,
-			sync);
-	blkio_add_stat(stats->stat_arr[BLKIO_STAT_SERVICE_BYTES], bytes,
-			direction, sync);
-	spin_unlock_irqrestore(&blkg->stats_lock, flags);
+	stats_cpu = this_cpu_ptr(blkg->stats_cpu);
+
+	stats_cpu->sectors += bytes >> 9;
+	blkio_add_stat(stats_cpu->stat_arr_cpu[BLKIO_STAT_CPU_SERVICED],
+			1, direction, sync);
+	blkio_add_stat(stats_cpu->stat_arr_cpu[BLKIO_STAT_CPU_SERVICE_BYTES],
+			bytes, direction, sync);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_dispatch_stats);
 
@@ -440,6 +442,20 @@ void blkiocg_update_io_merged_stats(struct blkio_group *blkg, bool direction,
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_io_merged_stats);
 
+/*
+ * This function allocates the per cpu stats for blkio_group. Should be called
+ * from sleepable context as alloc_per_cpu() requires that.
+ */
+int blkio_alloc_blkg_stats(struct blkio_group *blkg)
+{
+	/* Allocate memory for per cpu stats */
+	blkg->stats_cpu = alloc_percpu(struct blkio_group_stats_cpu);
+	if (!blkg->stats_cpu)
+		return -ENOMEM;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(blkio_alloc_blkg_stats);
+
 void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 		struct blkio_group *blkg, void *key, dev_t dev,
 		enum blkio_policy_id plid)
@@ -600,6 +616,53 @@ static uint64_t blkio_fill_stat(char *str, int chars_left, uint64_t val,
 	return val;
 }
 
+
+static uint64_t blkio_read_stat_cpu(struct blkio_group *blkg,
+			enum stat_type_cpu type, enum stat_sub_type sub_type)
+{
+	int cpu;
+	struct blkio_group_stats_cpu *stats_cpu;
+	uint64_t val = 0;
+
+	for_each_possible_cpu(cpu) {
+		stats_cpu  = per_cpu_ptr(blkg->stats_cpu, cpu);
+
+		if (type == BLKIO_STAT_CPU_SECTORS)
+			val += stats_cpu->sectors;
+		else
+			val += stats_cpu->stat_arr_cpu[type][sub_type];
+	}
+
+	return val;
+}
+
+static uint64_t blkio_get_stat_cpu(struct blkio_group *blkg,
+		struct cgroup_map_cb *cb, dev_t dev, enum stat_type_cpu type)
+{
+	uint64_t disk_total, val;
+	char key_str[MAX_KEY_LEN];
+	enum stat_sub_type sub_type;
+
+	if (type == BLKIO_STAT_CPU_SECTORS) {
+		val = blkio_read_stat_cpu(blkg, type, 0);
+		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1, val, cb, dev);
+	}
+
+	for (sub_type = BLKIO_STAT_READ; sub_type < BLKIO_STAT_TOTAL;
+			sub_type++) {
+		blkio_get_key_name(sub_type, dev, key_str, MAX_KEY_LEN, false);
+		val = blkio_read_stat_cpu(blkg, type, sub_type);
+		cb->fill(cb, key_str, val);
+	}
+
+	disk_total = blkio_read_stat_cpu(blkg, type, BLKIO_STAT_READ) +
+			blkio_read_stat_cpu(blkg, type, BLKIO_STAT_WRITE);
+
+	blkio_get_key_name(BLKIO_STAT_TOTAL, dev, key_str, MAX_KEY_LEN, false);
+	cb->fill(cb, key_str, disk_total);
+	return disk_total;
+}
+
 /* This should be called with blkg->stats_lock held */
 static uint64_t blkio_get_stat(struct blkio_group *blkg,
 		struct cgroup_map_cb *cb, dev_t dev, enum stat_type type)
@@ -611,9 +674,6 @@ static uint64_t blkio_get_stat(struct blkio_group *blkg,
 	if (type == BLKIO_STAT_TIME)
 		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
 					blkg->stats.time, cb, dev);
-	if (type == BLKIO_STAT_SECTORS)
-		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
-					blkg->stats.sectors, cb, dev);
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 	if (type == BLKIO_STAT_UNACCOUNTED_TIME)
 		return blkio_fill_stat(key_str, MAX_KEY_LEN - 1,
@@ -1077,8 +1137,8 @@ static int blkiocg_file_read(struct cgroup *cgrp, struct cftype *cft,
 }
 
 static int blkio_read_blkg_stats(struct blkio_cgroup *blkcg,
-		struct cftype *cft, struct cgroup_map_cb *cb, enum stat_type type,
-		bool show_total)
+		struct cftype *cft, struct cgroup_map_cb *cb,
+		enum stat_type type, bool show_total, bool pcpu)
 {
 	struct blkio_group *blkg;
 	struct hlist_node *n;
@@ -1089,10 +1149,15 @@ static int blkio_read_blkg_stats(struct blkio_cgroup *blkcg,
 		if (blkg->dev) {
 			if (!cftype_blkg_same_policy(cft, blkg))
 				continue;
-			spin_lock_irq(&blkg->stats_lock);
-			cgroup_total += blkio_get_stat(blkg, cb, blkg->dev,
-						type);
-			spin_unlock_irq(&blkg->stats_lock);
+			if (pcpu)
+				cgroup_total += blkio_get_stat_cpu(blkg, cb,
+						blkg->dev, type);
+			else {
+				spin_lock_irq(&blkg->stats_lock);
+				cgroup_total += blkio_get_stat(blkg, cb,
+						blkg->dev, type);
+				spin_unlock_irq(&blkg->stats_lock);
+			}
 		}
 	}
 	if (show_total)
@@ -1116,47 +1181,47 @@ static int blkiocg_file_read_map(struct cgroup *cgrp, struct cftype *cft,
 		switch(name) {
 		case BLKIO_PROP_time:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_TIME, 0);
+						BLKIO_STAT_TIME, 0, 0);
 		case BLKIO_PROP_sectors:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_SECTORS, 0);
+						BLKIO_STAT_CPU_SECTORS, 0, 1);
 		case BLKIO_PROP_io_service_bytes:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_SERVICE_BYTES, 1);
+					BLKIO_STAT_CPU_SERVICE_BYTES, 1, 1);
 		case BLKIO_PROP_io_serviced:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_SERVICED, 1);
+						BLKIO_STAT_CPU_SERVICED, 1, 1);
 		case BLKIO_PROP_io_service_time:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_SERVICE_TIME, 1);
+						BLKIO_STAT_SERVICE_TIME, 1, 0);
 		case BLKIO_PROP_io_wait_time:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_WAIT_TIME, 1);
+						BLKIO_STAT_WAIT_TIME, 1, 0);
 		case BLKIO_PROP_io_merged:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_MERGED, 1);
+						BLKIO_STAT_MERGED, 1, 0);
 		case BLKIO_PROP_io_queued:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_QUEUED, 1);
+						BLKIO_STAT_QUEUED, 1, 0);
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 		case BLKIO_PROP_unaccounted_time:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_UNACCOUNTED_TIME, 0);
+					BLKIO_STAT_UNACCOUNTED_TIME, 0, 0);
 		case BLKIO_PROP_dequeue:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_DEQUEUE, 0);
+						BLKIO_STAT_DEQUEUE, 0, 0);
 		case BLKIO_PROP_avg_queue_size:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_AVG_QUEUE_SIZE, 0);
+					BLKIO_STAT_AVG_QUEUE_SIZE, 0, 0);
 		case BLKIO_PROP_group_wait_time:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_GROUP_WAIT_TIME, 0);
+					BLKIO_STAT_GROUP_WAIT_TIME, 0, 0);
 		case BLKIO_PROP_idle_time:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_IDLE_TIME, 0);
+						BLKIO_STAT_IDLE_TIME, 0, 0);
 		case BLKIO_PROP_empty_time:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_EMPTY_TIME, 0);
+						BLKIO_STAT_EMPTY_TIME, 0, 0);
 #endif
 		default:
 			BUG();
@@ -1166,10 +1231,10 @@ static int blkiocg_file_read_map(struct cgroup *cgrp, struct cftype *cft,
 		switch(name){
 		case BLKIO_THROTL_io_service_bytes:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_SERVICE_BYTES, 1);
+						BLKIO_STAT_CPU_SERVICE_BYTES, 1, 1);
 		case BLKIO_THROTL_io_serviced:
 			return blkio_read_blkg_stats(blkcg, cft, cb,
-						BLKIO_STAT_SERVICED, 1);
+						BLKIO_STAT_CPU_SERVICED, 1, 1);
 		default:
 			BUG();
 		}
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 63f1ef4..fd730a2 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -36,10 +36,6 @@ enum stat_type {
 	 * request completion for IOs doen by this cgroup. This may not be
 	 * accurate when NCQ is turned on. */
 	BLKIO_STAT_SERVICE_TIME = 0,
-	/* Total bytes transferred */
-	BLKIO_STAT_SERVICE_BYTES,
-	/* Total IOs serviced, post merge */
-	BLKIO_STAT_SERVICED,
 	/* Total time spent waiting in scheduler queue in ns */
 	BLKIO_STAT_WAIT_TIME,
 	/* Number of IOs merged */
@@ -48,7 +44,6 @@ enum stat_type {
 	BLKIO_STAT_QUEUED,
 	/* All the single valued stats go below this */
 	BLKIO_STAT_TIME,
-	BLKIO_STAT_SECTORS,
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 	/* Time not charged to this cgroup */
 	BLKIO_STAT_UNACCOUNTED_TIME,
@@ -60,6 +55,16 @@ enum stat_type {
 #endif
 };
 
+/* Per cpu stats */
+enum stat_type_cpu {
+	BLKIO_STAT_CPU_SECTORS,
+	/* Total bytes transferred */
+	BLKIO_STAT_CPU_SERVICE_BYTES,
+	/* Total IOs serviced, post merge */
+	BLKIO_STAT_CPU_SERVICED,
+	BLKIO_STAT_CPU_NR
+};
+
 enum stat_sub_type {
 	BLKIO_STAT_READ = 0,
 	BLKIO_STAT_WRITE,
@@ -116,7 +121,6 @@ struct blkio_cgroup {
 struct blkio_group_stats {
 	/* total disk time and nr sectors dispatched by this group */
 	uint64_t time;
-	uint64_t sectors;
 	uint64_t stat_arr[BLKIO_STAT_QUEUED + 1][BLKIO_STAT_TOTAL];
 #ifdef CONFIG_DEBUG_BLK_CGROUP
 	/* Time not charged to this cgroup */
@@ -146,6 +150,12 @@ struct blkio_group_stats {
 #endif
 };
 
+/* Per cpu blkio group stats */
+struct blkio_group_stats_cpu {
+	uint64_t sectors;
+	uint64_t stat_arr_cpu[BLKIO_STAT_CPU_NR][BLKIO_STAT_TOTAL];
+};
+
 struct blkio_group {
 	/* An rcu protected unique identifier for the group */
 	void *key;
@@ -161,6 +171,8 @@ struct blkio_group {
 	/* Need to serialize the stats in the case of reset/update */
 	spinlock_t stats_lock;
 	struct blkio_group_stats stats;
+	/* Per cpu stats pointer */
+	struct blkio_group_stats_cpu __percpu *stats_cpu;
 };
 
 struct blkio_policy_node {
@@ -296,6 +308,7 @@ extern struct blkio_cgroup *task_blkio_cgroup(struct task_struct *tsk);
 extern void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 	struct blkio_group *blkg, void *key, dev_t dev,
 	enum blkio_policy_id plid);
+extern int blkio_alloc_blkg_stats(struct blkio_group *blkg);
 extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
 extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
 						void *key);
@@ -323,6 +336,8 @@ static inline void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 		struct blkio_group *blkg, void *key, dev_t dev,
 		enum blkio_policy_id plid) {}
 
+static inline int blkio_alloc_blkg_stats(struct blkio_group *blkg) { return 0; }
+
 static inline int
 blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 90ad407..c29a5a8 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -158,6 +158,7 @@ static void throtl_free_tg(struct rcu_head *head)
 	struct throtl_grp *tg;
 
 	tg = container_of(head, struct throtl_grp, rcu_head);
+	free_percpu(tg->blkg.stats_cpu);
 	kfree(tg);
 }
 
@@ -249,11 +250,19 @@ static void throtl_init_add_tg_lists(struct throtl_data *td,
 static struct throtl_grp *throtl_alloc_tg(struct throtl_data *td)
 {
 	struct throtl_grp *tg = NULL;
+	int ret;
 
 	tg = kzalloc_node(sizeof(*tg), GFP_ATOMIC, td->queue->node);
 	if (!tg)
 		return NULL;
 
+	ret = blkio_alloc_blkg_stats(&tg->blkg);
+
+	if (ret) {
+		kfree(tg);
+		return NULL;
+	}
+
 	throtl_init_group(tg);
 	return tg;
 }
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 606020f..d646b27 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1051,7 +1051,7 @@ static void cfq_init_add_cfqg_lists(struct cfq_data *cfqd,
 static struct cfq_group * cfq_alloc_cfqg(struct cfq_data *cfqd)
 {
 	struct cfq_group *cfqg = NULL;
-	int i, j;
+	int i, j, ret;
 	struct cfq_rb_root *st;
 
 	cfqg = kzalloc_node(sizeof(*cfqg), GFP_ATOMIC, cfqd->queue->node);
@@ -1069,6 +1069,13 @@ static struct cfq_group * cfq_alloc_cfqg(struct cfq_data *cfqd)
 	 * or cgroup deletion path depending on who is exiting first.
 	 */
 	cfqg->ref = 1;
+
+	ret = blkio_alloc_blkg_stats(&cfqg->blkg);
+	if (ret) {
+		kfree(cfqg);
+		return NULL;
+	}
+
 	return cfqg;
 }
 
@@ -1183,6 +1190,7 @@ static void cfq_put_cfqg(struct cfq_group *cfqg)
 		return;
 	for_each_cfqg_st(cfqg, i, j, st)
 		BUG_ON(!RB_EMPTY_ROOT(&st->rb));
+	free_percpu(cfqg->blkg.stats_cpu);
 	kfree(cfqg);
 }
 
@@ -3995,7 +4003,15 @@ static void *cfq_init_queue(struct request_queue *q)
 	 * throtl_data goes away.
 	 */
 	cfqg->ref = 2;
+
+	if (blkio_alloc_blkg_stats(&cfqg->blkg)) {
+		kfree(cfqg);
+		kfree(cfqd);
+		return NULL;
+	}
+
 	rcu_read_lock();
+
 	cfq_blkiocg_add_blkio_group(&blkio_root_cgroup, &cfqg->blkg,
 					(void *)cfqd, 0);
 	rcu_read_unlock();
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 11/13] blk-cgroup: Make 64bit per cpu stats safe on 32bit arch
  2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
                   ` (9 preceding siblings ...)
  2011-05-19 19:38 ` [PATCH 10/13] blk-throttle: Make dispatch stats per cpu Vivek Goyal
@ 2011-05-19 19:38 ` Vivek Goyal
  2011-05-19 19:38 ` [PATCH 12/13] blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats Vivek Goyal
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:38 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah, vgoyal

Some of the stats are 64bit and updation will be non atomic on 32bit
architecture. Use sequence counters on 32bit arch to make reading
of stats safe.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |   27 ++++++++++++++++++++++-----
 block/blk-cgroup.h |    2 ++
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 34bfcef..3622518e 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -400,14 +400,25 @@ void blkiocg_update_dispatch_stats(struct blkio_group *blkg,
 				uint64_t bytes, bool direction, bool sync)
 {
 	struct blkio_group_stats_cpu *stats_cpu;
+	unsigned long flags;
+
+	/*
+	 * Disabling interrupts to provide mutual exclusion between two
+	 * writes on same cpu. It probably is not needed for 64bit. Not
+	 * optimizing that case yet.
+	 */
+	local_irq_save(flags);
 
 	stats_cpu = this_cpu_ptr(blkg->stats_cpu);
 
+	u64_stats_update_begin(&stats_cpu->syncp);
 	stats_cpu->sectors += bytes >> 9;
 	blkio_add_stat(stats_cpu->stat_arr_cpu[BLKIO_STAT_CPU_SERVICED],
 			1, direction, sync);
 	blkio_add_stat(stats_cpu->stat_arr_cpu[BLKIO_STAT_CPU_SERVICE_BYTES],
 			bytes, direction, sync);
+	u64_stats_update_end(&stats_cpu->syncp);
+	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_dispatch_stats);
 
@@ -622,15 +633,21 @@ static uint64_t blkio_read_stat_cpu(struct blkio_group *blkg,
 {
 	int cpu;
 	struct blkio_group_stats_cpu *stats_cpu;
-	uint64_t val = 0;
+	u64 val = 0, tval;
 
 	for_each_possible_cpu(cpu) {
+		unsigned int start;
 		stats_cpu  = per_cpu_ptr(blkg->stats_cpu, cpu);
 
-		if (type == BLKIO_STAT_CPU_SECTORS)
-			val += stats_cpu->sectors;
-		else
-			val += stats_cpu->stat_arr_cpu[type][sub_type];
+		do {
+			start = u64_stats_fetch_begin(&stats_cpu->syncp);
+			if (type == BLKIO_STAT_CPU_SECTORS)
+				tval = stats_cpu->sectors;
+			else
+				tval = stats_cpu->stat_arr_cpu[type][sub_type];
+		} while(u64_stats_fetch_retry(&stats_cpu->syncp, start));
+
+		val += tval;
 	}
 
 	return val;
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index fd730a2..2622267 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -14,6 +14,7 @@
  */
 
 #include <linux/cgroup.h>
+#include <linux/u64_stats_sync.h>
 
 enum blkio_policy_id {
 	BLKIO_POLICY_PROP = 0,		/* Proportional Bandwidth division */
@@ -154,6 +155,7 @@ struct blkio_group_stats {
 struct blkio_group_stats_cpu {
 	uint64_t sectors;
 	uint64_t stat_arr_cpu[BLKIO_STAT_CPU_NR][BLKIO_STAT_TOTAL];
+	struct u64_stats_sync syncp;
 };
 
 struct blkio_group {
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 12/13] blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats
  2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
                   ` (10 preceding siblings ...)
  2011-05-19 19:38 ` [PATCH 11/13] blk-cgroup: Make 64bit per cpu stats safe on 32bit arch Vivek Goyal
@ 2011-05-19 19:38 ` Vivek Goyal
  2011-05-19 19:38 ` [PATCH 13/13] blk-throttle: Make no throttling rule group processing lockless Vivek Goyal
  2011-05-19 19:58 ` [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
  13 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:38 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah, vgoyal

Now dispatch stats update is lock free. But reset of these stats still
takes blkg->stats_lock and is dependent on that. As stats are per cpu,
we should be able to just reset the stats on each cpu without any locks.
(Atleast for 64bit arch).

On 32bit arch there is a small race where 64bit updates are not atomic.
The result of this race can be that in the presence of other writers,
one might not get 0 value after reset of a stat and might see something
intermediate

One can write more complicated code to cover this race like sending IPI
to other cpus to reset stats and for offline cpus, reset these directly.

Right not I am not taking that path because reset_update is more of a
debug feature and it can happen only on 32bit arch and possibility of
it happening is small. Will fix it if it becomes a real problem. For
the time being going for code simplicity.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c |   28 ++++++++++++++++++++++++++++
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 3622518e..e41cc6f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -537,6 +537,30 @@ struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key)
 }
 EXPORT_SYMBOL_GPL(blkiocg_lookup_group);
 
+static void blkio_reset_stats_cpu(struct blkio_group *blkg)
+{
+	struct blkio_group_stats_cpu *stats_cpu;
+	int i, j, k;
+	/*
+	 * Note: On 64 bit arch this should not be an issue. This has the
+	 * possibility of returning some inconsistent value on 32bit arch
+	 * as 64bit update on 32bit is non atomic. Taking care of this
+	 * corner case makes code very complicated, like sending IPIs to
+	 * cpus, taking care of stats of offline cpus etc.
+	 *
+	 * reset stats is anyway more of a debug feature and this sounds a
+	 * corner case. So I am not complicating the code yet until and
+	 * unless this becomes a real issue.
+	 */
+	for_each_possible_cpu(i) {
+		stats_cpu = per_cpu_ptr(blkg->stats_cpu, i);
+		stats_cpu->sectors = 0;
+		for(j = 0; j < BLKIO_STAT_CPU_NR; j++)
+			for (k = 0; k < BLKIO_STAT_TOTAL; k++)
+				stats_cpu->stat_arr_cpu[j][k] = 0;
+	}
+}
+
 static int
 blkiocg_reset_stats(struct cgroup *cgroup, struct cftype *cftype, u64 val)
 {
@@ -581,7 +605,11 @@ blkiocg_reset_stats(struct cgroup *cgroup, struct cftype *cftype, u64 val)
 		}
 #endif
 		spin_unlock(&blkg->stats_lock);
+
+		/* Reset Per cpu stats which don't take blkg->stats_lock */
+		blkio_reset_stats_cpu(blkg);
 	}
+
 	spin_unlock_irq(&blkcg->lock);
 	return 0;
 }
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 13/13] blk-throttle: Make no throttling rule group processing lockless
  2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
                   ` (11 preceding siblings ...)
  2011-05-19 19:38 ` [PATCH 12/13] blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats Vivek Goyal
@ 2011-05-19 19:38 ` Vivek Goyal
  2011-05-19 19:58 ` [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
  13 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:38 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah, vgoyal

Currently we take a queue lock on each bio to check if there are any
throttling rules associated with the group and also update the stats.
Now access the group under rcu and update the stats without taking
the queue lock. Queue lock is taken only if there are throttling rules
associated with the group.

So the common case of root group when there are no rules, save
unnecessary pounding of request queue lock.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-throttle.c |   53 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index c29a5a8..a62be8d 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -229,6 +229,22 @@ __throtl_tg_fill_dev_details(struct throtl_data *td, struct throtl_grp *tg)
 	}
 }
 
+/*
+ * Should be called with without queue lock held. Here queue lock will be
+ * taken rarely. It will be taken only once during life time of a group
+ * if need be
+ */
+static void
+throtl_tg_fill_dev_details(struct throtl_data *td, struct throtl_grp *tg)
+{
+	if (!tg || tg->blkg.dev)
+		return;
+
+	spin_lock_irq(td->queue->queue_lock);
+	__throtl_tg_fill_dev_details(td, tg);
+	spin_unlock_irq(td->queue->queue_lock);
+}
+
 static void throtl_init_add_tg_lists(struct throtl_data *td,
 			struct throtl_grp *tg, struct blkio_cgroup *blkcg)
 {
@@ -666,6 +682,12 @@ static bool tg_with_in_bps_limit(struct throtl_data *td, struct throtl_grp *tg,
 	return 0;
 }
 
+static bool tg_no_rule_group(struct throtl_grp *tg, bool rw) {
+	if (tg->bps[rw] == -1 && tg->iops[rw] == -1)
+		return 1;
+	return 0;
+}
+
 /*
  * Returns whether one can dispatch a bio or not. Also returns approx number
  * of jiffies to wait before this bio is with-in IO rate and can be dispatched
@@ -730,10 +752,6 @@ static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 	tg->bytes_disp[rw] += bio->bi_size;
 	tg->io_disp[rw]++;
 
-	/*
-	 * TODO: This will take blkg->stats_lock. Figure out a way
-	 * to avoid this cost.
-	 */
 	blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size, rw, sync);
 }
 
@@ -1111,12 +1129,39 @@ int blk_throtl_bio(struct request_queue *q, struct bio **biop)
 	struct throtl_grp *tg;
 	struct bio *bio = *biop;
 	bool rw = bio_data_dir(bio), update_disptime = true;
+	struct blkio_cgroup *blkcg;
 
 	if (bio->bi_rw & REQ_THROTTLED) {
 		bio->bi_rw &= ~REQ_THROTTLED;
 		return 0;
 	}
 
+	/*
+	 * A throtl_grp pointer retrieved under rcu can be used to access
+	 * basic fields like stats and io rates. If a group has no rules,
+	 * just update the dispatch stats in lockless manner and return.
+	 */
+
+	rcu_read_lock();
+	blkcg = task_blkio_cgroup(current);
+	tg = throtl_find_tg(td, blkcg);
+	if (tg) {
+		throtl_tg_fill_dev_details(td, tg);
+
+		if (tg_no_rule_group(tg, rw)) {
+			blkiocg_update_dispatch_stats(&tg->blkg, bio->bi_size,
+					rw, bio->bi_rw & REQ_SYNC);
+			rcu_read_unlock();
+			return 0;
+		}
+	}
+	rcu_read_unlock();
+
+	/*
+	 * Either group has not been allocated yet or it is not an unlimited
+	 * IO group
+	 */
+
 	spin_lock_irq(q->queue_lock);
 	tg = throtl_get_tg(td);
 
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2]
  2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
                   ` (12 preceding siblings ...)
  2011-05-19 19:38 ` [PATCH 13/13] blk-throttle: Make no throttling rule group processing lockless Vivek Goyal
@ 2011-05-19 19:58 ` Vivek Goyal
  13 siblings, 0 replies; 15+ messages in thread
From: Vivek Goyal @ 2011-05-19 19:58 UTC (permalink / raw)
  To: linux-kernel, jaxboe, axboe; +Cc: dpshah

On Thu, May 19, 2011 at 03:38:18PM -0400, Vivek Goyal wrote:
> Hi,
> 
> This is V2 of the patch. Changes from V1 are.
> 
> - Dropped first patch of the series which has already been merged.
> - Fixed couple of white space warnings.
> 
> Jens, I am sending this series on your both the ids. See if both produce
> warnings.
> 
> Block throttling code takes request queue lock for every incoming bio
> (blk_throtl_bio()). This is true even if there are no throttle rules in
> the group. This is a common case for root cgroup where distributions
> will have throttling support compiled in but a vast majority of users
> will not be specifying throttling rule.
> 
> This patch series tries to make bio processing lockless (no requeust
> queue lock), if there are no rules specified for the group. Once
> a bio is submitted, under rcu_read_lock() we search for the group,
> update the stats and release the rcu lock. request queue lock is taken
> only if there are throttling rules specified in the group.
> 
> I have made some of the dispatch stats per cpu so that these can be updated
> without taking request queue lock.

Also forgot to mention that now throttling path does not take blkg->stat_lock
at all. Throttling updates only 3 stats and all these 3 stats have been
converted to per cpu, hence we got rid of the need of taking blkg->stat_lock
also.

Divyesh, can you have a look at locking changes around stats and see if 
you have any concerns.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-05-19 19:59 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-19 19:38 [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal
2011-05-19 19:38 ` [PATCH 01/13] blk-throttle: Do the new group initialization with the help of a function Vivek Goyal
2011-05-19 19:38 ` [PATCH 02/13] blk-cgroup: move some fields of unaccounted_time file under right config option Vivek Goyal
2011-05-19 19:38 ` [PATCH 03/13] cfq-iosched: Get rid of redundant function parameter "create" Vivek Goyal
2011-05-19 19:38 ` [PATCH 04/13] cfq-iosched: Fix a possible race with cfq cgroup removal code Vivek Goyal
2011-05-19 19:38 ` [PATCH 05/13] blk-cgroup: Allow sleeping while dynamically allocating a group Vivek Goyal
2011-05-19 19:38 ` [PATCH 06/13] blk-throttle: Dynamically allocate root group Vivek Goyal
2011-05-19 19:38 ` [PATCH 07/13] blk-throttle: Introduce a helper function to fill in device details Vivek Goyal
2011-05-19 19:38 ` [PATCH 08/13] blk-throttle: Use helper function to add root throtl group to lists Vivek Goyal
2011-05-19 19:38 ` [PATCH 09/13] blk-throttle: Free up a group only after one rcu grace period Vivek Goyal
2011-05-19 19:38 ` [PATCH 10/13] blk-throttle: Make dispatch stats per cpu Vivek Goyal
2011-05-19 19:38 ` [PATCH 11/13] blk-cgroup: Make 64bit per cpu stats safe on 32bit arch Vivek Goyal
2011-05-19 19:38 ` [PATCH 12/13] blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats Vivek Goyal
2011-05-19 19:38 ` [PATCH 13/13] blk-throttle: Make no throttling rule group processing lockless Vivek Goyal
2011-05-19 19:58 ` [PATCH 00/13] blk-throttle: lockless bio processing for no throttle rule group [V2] Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.