All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET] blkcg: basic accounting and throttling fixes
@ 2017-11-12 22:26 ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-12 22:26 UTC (permalink / raw)
  To: axboe; +Cc: shli, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

blkcg was often getting basic accounting numbers wildly wrong and
blk-throtl could throttle the same bio multiple times leading to
significantly lower limits being enforced than configured.

This patchset contains the following seven patches to update request
cgroup membership tracking and fix the bugs.

 0001-blkcg-relocate-__blkg_release_rcu.patch
 0002-blkcg-use-percpu_ref-for-blkcg_gq-refcnt.patch
 0003-blkcg-associate-a-request-with-its-blkcg_gq-instead-.patch
 0004-blkcg-refactor-blkcg_gq-lookup-and-creation-in-blkcg.patch
 0005-blkcg-associate-blk-mq-requests-with-the-matching-bl.patch
 0006-blkcg-account-requests-instead-of-bios-for-request-b.patch
 0007-blk-throtl-don-t-throttle-the-same-IO-multiple-times.patch

0001-0005 update request cgroup membership tracking so that the
association is always available.

0006 makes blkcg account requests instead of bios as bios don't really
have much to do with what's going on the system.

0007 prevents blk-throttle from throttling the same IO multiple times.

The patches are also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-blkcg-fixes

diffstat follows.  Thanks.

 block/blk-cgroup.c         |   69 +++++++++++++----------
 block/blk-core.c           |   13 +++-
 block/blk-mq.c             |    9 ++-
 block/blk-mq.h             |    1 
 block/blk-throttle.c       |   10 ---
 include/linux/blk-cgroup.h |  134 ++++++++++++++++++++++++++++++++++++---------
 include/linux/blkdev.h     |    2 
 7 files changed, 169 insertions(+), 69 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCHSET] blkcg: basic accounting and throttling fixes
@ 2017-11-12 22:26 ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-12 22:26 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: shli-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, lizefan-hv44wF8Li93QT0dZR+AlfA,
	hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	guro-b10kYP2dOMg

blkcg was often getting basic accounting numbers wildly wrong and
blk-throtl could throttle the same bio multiple times leading to
significantly lower limits being enforced than configured.

This patchset contains the following seven patches to update request
cgroup membership tracking and fix the bugs.

 0001-blkcg-relocate-__blkg_release_rcu.patch
 0002-blkcg-use-percpu_ref-for-blkcg_gq-refcnt.patch
 0003-blkcg-associate-a-request-with-its-blkcg_gq-instead-.patch
 0004-blkcg-refactor-blkcg_gq-lookup-and-creation-in-blkcg.patch
 0005-blkcg-associate-blk-mq-requests-with-the-matching-bl.patch
 0006-blkcg-account-requests-instead-of-bios-for-request-b.patch
 0007-blk-throtl-don-t-throttle-the-same-IO-multiple-times.patch

0001-0005 update request cgroup membership tracking so that the
association is always available.

0006 makes blkcg account requests instead of bios as bios don't really
have much to do with what's going on the system.

0007 prevents blk-throttle from throttling the same IO multiple times.

The patches are also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-blkcg-fixes

diffstat follows.  Thanks.

 block/blk-cgroup.c         |   69 +++++++++++++----------
 block/blk-core.c           |   13 +++-
 block/blk-mq.c             |    9 ++-
 block/blk-mq.h             |    1 
 block/blk-throttle.c       |   10 ---
 include/linux/blk-cgroup.h |  134 ++++++++++++++++++++++++++++++++++++---------
 include/linux/blkdev.h     |    2 
 7 files changed, 169 insertions(+), 69 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/7] blkcg: relocate __blkg_release_rcu()
  2017-11-12 22:26 ` Tejun Heo
  (?)
@ 2017-11-12 22:26 ` Tejun Heo
  2017-11-14 23:12     ` Shaohua Li
  -1 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2017-11-12 22:26 UTC (permalink / raw)
  To: axboe
  Cc: shli, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro,
	Tejun Heo

Move __blkg_release_rcu() above blkg_alloc().  This is a pure code
reorganization to prepare for the switch to percpu_ref.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 46 +++++++++++++++++++++++-----------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d3f56ba..6482be5 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -81,6 +81,29 @@ static void blkg_free(struct blkcg_gq *blkg)
 	kfree(blkg);
 }
 
+/*
+ * A group is RCU protected, but having an rcu lock does not mean that one
+ * can access all the fields of blkg and assume these are valid.  For
+ * example, don't try to follow throtl_data and request queue links.
+ *
+ * Having a reference to blkg under an rcu allows accesses to only values
+ * local to groups like group stats and group rate limits.
+ */
+void __blkg_release_rcu(struct rcu_head *rcu_head)
+{
+	struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, rcu_head);
+
+	/* release the blkcg and parent blkg refs this blkg has been holding */
+	css_put(&blkg->blkcg->css);
+	if (blkg->parent)
+		blkg_put(blkg->parent);
+
+	wb_congested_put(blkg->wb_congested);
+
+	blkg_free(blkg);
+}
+EXPORT_SYMBOL_GPL(__blkg_release_rcu);
+
 /**
  * blkg_alloc - allocate a blkg
  * @blkcg: block cgroup the new blkg is associated with
@@ -378,29 +401,6 @@ static void blkg_destroy_all(struct request_queue *q)
 }
 
 /*
- * A group is RCU protected, but having an rcu lock does not mean that one
- * can access all the fields of blkg and assume these are valid.  For
- * example, don't try to follow throtl_data and request queue links.
- *
- * Having a reference to blkg under an rcu allows accesses to only values
- * local to groups like group stats and group rate limits.
- */
-void __blkg_release_rcu(struct rcu_head *rcu_head)
-{
-	struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, rcu_head);
-
-	/* release the blkcg and parent blkg refs this blkg has been holding */
-	css_put(&blkg->blkcg->css);
-	if (blkg->parent)
-		blkg_put(blkg->parent);
-
-	wb_congested_put(blkg->wb_congested);
-
-	blkg_free(blkg);
-}
-EXPORT_SYMBOL_GPL(__blkg_release_rcu);
-
-/*
  * The next function used by blk_queue_for_each_rl().  It's a bit tricky
  * because the root blkg uses @q->root_rl instead of its own rl.
  */
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/7] blkcg: use percpu_ref for blkcg_gq->refcnt
  2017-11-12 22:26 ` Tejun Heo
  (?)
  (?)
@ 2017-11-12 22:26 ` Tejun Heo
  2017-11-14 23:12     ` Shaohua Li
  -1 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2017-11-12 22:26 UTC (permalink / raw)
  To: axboe
  Cc: shli, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro,
	Tejun Heo

blkcg_gq->refcnt is an atomic_t.  This was due to the following two
reasons.

* When blkcg_gq was added, the percpu memory allocator didn't support
  allocations from !GFP_KERNEL contexts.  Because blkcg_gq's are
  created from IO issue paths, it couldn't use GFP_KERNEL allocations.

* A blkcg_gq represents the connection between a cgroup and a
  request_queue.  Because a in-flight bio already pins both, blkcg_gq
  didn't usually need explicit pinning, making the use of atomic_t
  acceptable albeit restrictive and fragile.

Now that the percpu allocator supports !GFP_KERNEL allocations,
there's no reason to keep using atomic_t refcnt.  This will allow
clean separation between bio and request layers helping blkcg support
in blk-mq.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c         | 21 ++++++++++++++++-----
 include/linux/blk-cgroup.h | 13 ++++---------
 2 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 6482be5..60a4486 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -78,6 +78,7 @@ static void blkg_free(struct blkcg_gq *blkg)
 
 	blkg_rwstat_exit(&blkg->stat_ios);
 	blkg_rwstat_exit(&blkg->stat_bytes);
+	percpu_ref_exit(&blkg->refcnt);
 	kfree(blkg);
 }
 
@@ -89,7 +90,7 @@ static void blkg_free(struct blkcg_gq *blkg)
  * Having a reference to blkg under an rcu allows accesses to only values
  * local to groups like group stats and group rate limits.
  */
-void __blkg_release_rcu(struct rcu_head *rcu_head)
+static void blkg_release_rcu(struct rcu_head *rcu_head)
 {
 	struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, rcu_head);
 
@@ -102,7 +103,13 @@ void __blkg_release_rcu(struct rcu_head *rcu_head)
 
 	blkg_free(blkg);
 }
-EXPORT_SYMBOL_GPL(__blkg_release_rcu);
+
+static void blkg_release(struct percpu_ref *refcnt)
+{
+	struct blkcg_gq *blkg = container_of(refcnt, struct blkcg_gq, refcnt);
+
+	call_rcu(&blkg->rcu_head, blkg_release_rcu);
+}
 
 /**
  * blkg_alloc - allocate a blkg
@@ -123,6 +130,11 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	if (!blkg)
 		return NULL;
 
+	if (percpu_ref_init(&blkg->refcnt, blkg_release, 0, gfp_mask)) {
+		kfree(blkg);
+		return NULL;
+	}
+
 	if (blkg_rwstat_init(&blkg->stat_bytes, gfp_mask) ||
 	    blkg_rwstat_init(&blkg->stat_ios, gfp_mask))
 		goto err_free;
@@ -130,7 +142,6 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	blkg->q = q;
 	INIT_LIST_HEAD(&blkg->q_node);
 	blkg->blkcg = blkcg;
-	atomic_set(&blkg->refcnt, 1);
 
 	/* root blkg uses @q->root_rl, init rl only for !root blkgs */
 	if (blkcg != &blkcg_root) {
@@ -266,7 +277,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 		return blkg;
 
 	/* @blkg failed fully initialized, use the usual release path */
-	blkg_put(blkg);
+	percpu_ref_kill(&blkg->refcnt);
 	return ERR_PTR(ret);
 
 err_put_congested:
@@ -373,7 +384,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
 	 * Put the reference taken at the time of creation so that when all
 	 * queues are gone, group can be destroyed.
 	 */
-	blkg_put(blkg);
+	percpu_ref_kill(&blkg->refcnt);
 }
 
 /**
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 8bbc371..c0d4736 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -19,7 +19,7 @@
 #include <linux/seq_file.h>
 #include <linux/radix-tree.h>
 #include <linux/blkdev.h>
-#include <linux/atomic.h>
+#include <linux/percpu-refcount.h>
 
 /* percpu_counter batch for blkg_[rw]stats, per-cpu drift doesn't matter */
 #define BLKG_STAT_CPU_BATCH	(INT_MAX / 2)
@@ -123,7 +123,7 @@ struct blkcg_gq {
 	struct request_list		rl;
 
 	/* reference count */
-	atomic_t			refcnt;
+	struct percpu_ref		refcnt;
 
 	/* is this blkg online? protected by both blkcg and q locks */
 	bool				online;
@@ -355,21 +355,16 @@ static inline int blkg_path(struct blkcg_gq *blkg, char *buf, int buflen)
  */
 static inline void blkg_get(struct blkcg_gq *blkg)
 {
-	WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
-	atomic_inc(&blkg->refcnt);
+	percpu_ref_get(&blkg->refcnt);
 }
 
-void __blkg_release_rcu(struct rcu_head *rcu);
-
 /**
  * blkg_put - put a blkg reference
  * @blkg: blkg to put
  */
 static inline void blkg_put(struct blkcg_gq *blkg)
 {
-	WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
-	if (atomic_dec_and_test(&blkg->refcnt))
-		call_rcu(&blkg->rcu_head, __blkg_release_rcu);
+	percpu_ref_put(&blkg->refcnt);
 }
 
 /**
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/7] blkcg: associate a request with its blkcg_gq instead of request_list
  2017-11-12 22:26 ` Tejun Heo
                   ` (2 preceding siblings ...)
  (?)
@ 2017-11-12 22:26 ` Tejun Heo
  2017-11-13 20:15     ` Tejun Heo
  -1 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2017-11-12 22:26 UTC (permalink / raw)
  To: axboe
  Cc: shli, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro,
	Tejun Heo

On the legacy request_queue, each blkcg_gq has a dedicated
request_list that requests for the cgroup are allocated from.  Each
request points to the originating request_list and cgroup membership
can be determined by examining which cgroup the associated
request_list belongs to.

This doesn't work for blk-mq which doesn't use request_list.  Let's
associate each request with the blkcg_gq directly.  Except for minor
wrapper changes, this doesn't affect the legacy request_queue and will
allow blk-mq to track each request's cgroup association.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c             | 2 +-
 include/linux/blk-cgroup.h | 9 ++++++---
 include/linux/blkdev.h     | 2 +-
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a1860..af958c4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -306,7 +306,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->part = NULL;
 	rq->start_time = jiffies;
 #ifdef CONFIG_BLK_CGROUP
-	rq->rl = NULL;
+	rq->blkg = NULL;
 	set_start_time_ns(rq);
 	rq->io_start_time_ns = 0;
 #endif
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index c0d4736..b5721d32 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -463,7 +463,7 @@ static inline void blk_put_rl(struct request_list *rl)
  */
 static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
 {
-	rq->rl = rl;
+	rq->blkg = rl->blkg;
 }
 
 /**
@@ -474,7 +474,10 @@ static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
  */
 static inline struct request_list *blk_rq_rl(struct request *rq)
 {
-	return rq->rl;
+	if (!rq->blkg->parent)
+		return &rq->q->root_rl;
+	else
+		return &rq->blkg->rl;
 }
 
 struct request_list *__blk_queue_next_rl(struct request_list *rl,
@@ -762,7 +765,7 @@ static inline void blkg_put(struct blkcg_gq *blkg) { }
 static inline struct request_list *blk_get_rl(struct request_queue *q,
 					      struct bio *bio) { return &q->root_rl; }
 static inline void blk_put_rl(struct request_list *rl) { }
-static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl) { }
+static inline void blk_rq_set_blkg(struct request *rq, struct blkcg_gq *blkg) { }
 static inline struct request_list *blk_rq_rl(struct request *rq) { return &rq->q->root_rl; }
 
 static inline bool blkcg_bio_issue_check(struct request_queue *q,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8da6637..6ec1067 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -206,7 +206,7 @@ struct request {
 	unsigned long start_time;
 	struct blk_issue_stat issue_stat;
 #ifdef CONFIG_BLK_CGROUP
-	struct request_list *rl;		/* rl this rq is alloced from */
+	struct blkcg_gq *blkg;			/* blkg of this rq */
 	unsigned long long start_time_ns;
 	unsigned long long io_start_time_ns;    /* when passed to hardware */
 #endif
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/7] blkcg: refactor blkcg_gq lookup and creation in blkcg_bio_issue_check()
  2017-11-12 22:26 ` Tejun Heo
                   ` (3 preceding siblings ...)
  (?)
@ 2017-11-12 22:26 ` Tejun Heo
  2017-11-14 23:18     ` Shaohua Li
  -1 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2017-11-12 22:26 UTC (permalink / raw)
  To: axboe
  Cc: shli, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro,
	Tejun Heo

blkcg_bio_issue_check() open codes blkcg_gq lookup and creation using
blkg_lookup() and blkg_lookup_create().  Refactor the code so that

* blkg_lookup_create() is renamed to blkcg_lookup_create_locked().

* blkg_lookup_create() is now a new function which encapsulates the
  RCU protected lookup and queue_locked protected creation.

* blkg_lookup_create() is guaranteed to return a non-NULL blkcg_gq.
  The NULL checks are removed from the users.

This is pure refactoring and doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c         |  6 +++---
 block/blk-throttle.c       |  2 +-
 include/linux/blk-cgroup.h | 32 +++++++++++++++++++++-----------
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 60a4486..490b0a6 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -290,7 +290,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 }
 
 /**
- * blkg_lookup_create - lookup blkg, try to create one if not there
+ * blkg_lookup_create_locked - lookup blkg, try to create one if not there
  * @blkcg: blkcg of interest
  * @q: request_queue of interest
  *
@@ -303,8 +303,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
  * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
  * dead and bypassing, returns ERR_PTR(-EBUSY).
  */
-struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
-				    struct request_queue *q)
+struct blkcg_gq *blkg_lookup_create_locked(struct blkcg *blkcg,
+					   struct request_queue *q)
 {
 	struct blkcg_gq *blkg;
 
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8631763..1e6916b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2123,7 +2123,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 		    struct bio *bio)
 {
 	struct throtl_qnode *qn = NULL;
-	struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg);
+	struct throtl_grp *tg = blkg_to_tg(blkg);
 	struct throtl_service_queue *sq;
 	bool rw = bio_data_dir(bio);
 	bool throttled = false;
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index b5721d32..1de5158 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -172,8 +172,8 @@ extern struct cgroup_subsys_state * const blkcg_root_css;
 
 struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
 				      struct request_queue *q, bool update_hint);
-struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
-				    struct request_queue *q);
+struct blkcg_gq *blkg_lookup_create_locked(struct blkcg *blkcg,
+					   struct request_queue *q);
 int blkcg_init_queue(struct request_queue *q);
 void blkcg_drain_queue(struct request_queue *q);
 void blkcg_exit_queue(struct request_queue *q);
@@ -680,6 +680,24 @@ static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg
 				  struct bio *bio) { return false; }
 #endif
 
+static inline struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
+						  struct request_queue *q)
+{
+	struct blkcg_gq *blkg;
+
+	blkg = blkg_lookup(blkcg, q);
+	if (likely(blkg))
+		return blkg;
+
+	spin_lock_irq(q->queue_lock);
+	blkg = blkg_lookup_create_locked(blkcg, q);
+	spin_unlock_irq(q->queue_lock);
+	if (likely(!IS_ERR(blkg)))
+		return blkg;
+	else
+		return q->root_blkg;
+}
+
 static inline bool blkcg_bio_issue_check(struct request_queue *q,
 					 struct bio *bio)
 {
@@ -693,19 +711,11 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
 	/* associate blkcg if bio hasn't attached one */
 	bio_associate_blkcg(bio, &blkcg->css);
 
-	blkg = blkg_lookup(blkcg, q);
-	if (unlikely(!blkg)) {
-		spin_lock_irq(q->queue_lock);
-		blkg = blkg_lookup_create(blkcg, q);
-		if (IS_ERR(blkg))
-			blkg = NULL;
-		spin_unlock_irq(q->queue_lock);
-	}
+	blkg = blkg_lookup_create(blkcg, q);
 
 	throtl = blk_throtl_bio(q, blkg, bio);
 
 	if (!throtl) {
-		blkg = blkg ?: q->root_blkg;
 		blkg_rwstat_add(&blkg->stat_bytes, bio->bi_opf,
 				bio->bi_iter.bi_size);
 		blkg_rwstat_add(&blkg->stat_ios, bio->bi_opf, 1);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/7] blkcg: associate blk-mq requests with the matching blkcg_gqs
@ 2017-11-12 22:26   ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-12 22:26 UTC (permalink / raw)
  To: axboe
  Cc: shli, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro,
	Tejun Heo

On legacy request_queues, request allocation through request_list
associates each request with its blkcg_gq.  However, blk-mq doesn't
use request_list and its requests aren't associated with the matching
blkcg_gqs, preventing cgroup-aware tracking blk-mq IOs.

This patch adds blk_rq_[dis]associate_blkg() and use them in blk-mq
request alloc/free paths so that every request is associated with its
blkcg_gq while in-flight.  The added overhead is minimal and per-cpu
in hot paths.  In the future, we also should be able to remove the
more frequent per-bio blkcg_gq operations in blkcg_bio_issue_check()
for blk-mq.

* ->blkcg is added to blk_mq_alloc_data to carry the target blkcg in
  the allocation path.  I didn't #ifdef it for code simplicity.  The
  only overhead when cgroup is disabled is the extra pointer in the
  data structure, which shouldn't matter.

* Moved set_start_time_ns() out of CONFIG_BLK_CGROUP together with
  blk_rq_associate_blkg().  Both functions provide dummy
  implementations when !CONFIG_BLK_CGROUP.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c             |  9 ++++++---
 block/blk-mq.h             |  1 +
 include/linux/blk-cgroup.h | 42 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index af958c4..7cc64de 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -305,9 +305,10 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->rq_disk = NULL;
 	rq->part = NULL;
 	rq->start_time = jiffies;
-#ifdef CONFIG_BLK_CGROUP
-	rq->blkg = NULL;
+
+	blk_rq_associate_blkg(rq, data->blkcg);
 	set_start_time_ns(rq);
+#ifdef CONFIG_BLK_CGROUP
 	rq->io_start_time_ns = 0;
 #endif
 	rq->nr_phys_segments = 0;
@@ -472,6 +473,8 @@ void blk_mq_free_request(struct request *rq)
 		}
 	}
 
+	blk_rq_disassociate_blkg(rq);
+
 	ctx->rq_completed[rq_is_sync(rq)]++;
 	if (rq->rq_flags & RQF_MQ_INFLIGHT)
 		atomic_dec(&hctx->nr_active);
@@ -1599,7 +1602,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
 	const int is_sync = op_is_sync(bio->bi_opf);
 	const int is_flush_fua = op_is_flush(bio->bi_opf);
-	struct blk_mq_alloc_data data = { .flags = 0 };
+	struct blk_mq_alloc_data data = { .flags = 0, .blkcg = bio_blkcg(bio) };
 	struct request *rq;
 	unsigned int request_count = 0;
 	struct blk_plug *plug;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 4933af9..dfb1e1d 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -111,6 +111,7 @@ struct blk_mq_alloc_data {
 	struct request_queue *q;
 	unsigned int flags;
 	unsigned int shallow_depth;
+	struct blkcg *blkcg;
 
 	/* input & output parameter */
 	struct blk_mq_ctx *ctx;
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 1de5158..96eed0f 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -725,6 +725,45 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
 	return !throtl;
 }
 
+/**
+ * blk_rq_associate_blkg - lookup and associate a request with its blkcg_gq
+ * @rq: request of interest
+ * @blkcg: target blkcg
+ *
+ * Associate @rq with the matching blkcg_cq of @blkcg.  This is used by
+ * blk-mq to associate requests directly with blkgs without going through
+ * request_list.
+ */
+static inline void blk_rq_associate_blkg(struct request *rq, struct blkcg *blkcg)
+{
+	struct request_queue *q = rq->q;
+	struct blkcg_gq *blkg;
+
+	if (!blkcg) {
+		rq->blkg = q->root_blkg;
+		return;
+	}
+
+	rcu_read_lock();
+	blkg = blkg_lookup_create(blkcg, q);
+	if (likely(blkg == q->root_blkg || percpu_ref_tryget(&blkg->refcnt)))
+		rq->blkg = blkg;
+	else
+		rq->blkg = q->root_blkg;
+	rcu_read_unlock();
+}
+
+/**
+ * blk_rq_disassociate_blkg - undo blk_rq_associate_blkg()
+ * @rq: request of interest
+ */
+static inline void blk_rq_disassociate_blkg(struct request *rq)
+{
+	if (rq->blkg && rq->blkg != rq->q->root_blkg)
+		blkg_put(rq->blkg);
+	rq->blkg = NULL;
+}
+
 #else	/* CONFIG_BLK_CGROUP */
 
 struct blkcg {
@@ -781,6 +820,9 @@ static inline struct request_list *blk_rq_rl(struct request *rq) { return &rq->q
 static inline bool blkcg_bio_issue_check(struct request_queue *q,
 					 struct bio *bio) { return true; }
 
+static inline void blk_rq_associate_blkg(struct request *rq, struct blkcg *blkcg) { }
+static inline void blk_rq_disassociate_blkg(struct request *rq) { }
+
 #define blk_queue_for_each_rl(rl, q)	\
 	for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/7] blkcg: associate blk-mq requests with the matching blkcg_gqs
@ 2017-11-12 22:26   ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-12 22:26 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: shli-DgEjT+Ai2ygdnm+yROfE0A, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, lizefan-hv44wF8Li93QT0dZR+AlfA,
	hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	guro-b10kYP2dOMg, Tejun Heo

On legacy request_queues, request allocation through request_list
associates each request with its blkcg_gq.  However, blk-mq doesn't
use request_list and its requests aren't associated with the matching
blkcg_gqs, preventing cgroup-aware tracking blk-mq IOs.

This patch adds blk_rq_[dis]associate_blkg() and use them in blk-mq
request alloc/free paths so that every request is associated with its
blkcg_gq while in-flight.  The added overhead is minimal and per-cpu
in hot paths.  In the future, we also should be able to remove the
more frequent per-bio blkcg_gq operations in blkcg_bio_issue_check()
for blk-mq.

* ->blkcg is added to blk_mq_alloc_data to carry the target blkcg in
  the allocation path.  I didn't #ifdef it for code simplicity.  The
  only overhead when cgroup is disabled is the extra pointer in the
  data structure, which shouldn't matter.

* Moved set_start_time_ns() out of CONFIG_BLK_CGROUP together with
  blk_rq_associate_blkg().  Both functions provide dummy
  implementations when !CONFIG_BLK_CGROUP.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-mq.c             |  9 ++++++---
 block/blk-mq.h             |  1 +
 include/linux/blk-cgroup.h | 42 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index af958c4..7cc64de 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -305,9 +305,10 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->rq_disk = NULL;
 	rq->part = NULL;
 	rq->start_time = jiffies;
-#ifdef CONFIG_BLK_CGROUP
-	rq->blkg = NULL;
+
+	blk_rq_associate_blkg(rq, data->blkcg);
 	set_start_time_ns(rq);
+#ifdef CONFIG_BLK_CGROUP
 	rq->io_start_time_ns = 0;
 #endif
 	rq->nr_phys_segments = 0;
@@ -472,6 +473,8 @@ void blk_mq_free_request(struct request *rq)
 		}
 	}
 
+	blk_rq_disassociate_blkg(rq);
+
 	ctx->rq_completed[rq_is_sync(rq)]++;
 	if (rq->rq_flags & RQF_MQ_INFLIGHT)
 		atomic_dec(&hctx->nr_active);
@@ -1599,7 +1602,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
 	const int is_sync = op_is_sync(bio->bi_opf);
 	const int is_flush_fua = op_is_flush(bio->bi_opf);
-	struct blk_mq_alloc_data data = { .flags = 0 };
+	struct blk_mq_alloc_data data = { .flags = 0, .blkcg = bio_blkcg(bio) };
 	struct request *rq;
 	unsigned int request_count = 0;
 	struct blk_plug *plug;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 4933af9..dfb1e1d 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -111,6 +111,7 @@ struct blk_mq_alloc_data {
 	struct request_queue *q;
 	unsigned int flags;
 	unsigned int shallow_depth;
+	struct blkcg *blkcg;
 
 	/* input & output parameter */
 	struct blk_mq_ctx *ctx;
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 1de5158..96eed0f 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -725,6 +725,45 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
 	return !throtl;
 }
 
+/**
+ * blk_rq_associate_blkg - lookup and associate a request with its blkcg_gq
+ * @rq: request of interest
+ * @blkcg: target blkcg
+ *
+ * Associate @rq with the matching blkcg_cq of @blkcg.  This is used by
+ * blk-mq to associate requests directly with blkgs without going through
+ * request_list.
+ */
+static inline void blk_rq_associate_blkg(struct request *rq, struct blkcg *blkcg)
+{
+	struct request_queue *q = rq->q;
+	struct blkcg_gq *blkg;
+
+	if (!blkcg) {
+		rq->blkg = q->root_blkg;
+		return;
+	}
+
+	rcu_read_lock();
+	blkg = blkg_lookup_create(blkcg, q);
+	if (likely(blkg == q->root_blkg || percpu_ref_tryget(&blkg->refcnt)))
+		rq->blkg = blkg;
+	else
+		rq->blkg = q->root_blkg;
+	rcu_read_unlock();
+}
+
+/**
+ * blk_rq_disassociate_blkg - undo blk_rq_associate_blkg()
+ * @rq: request of interest
+ */
+static inline void blk_rq_disassociate_blkg(struct request *rq)
+{
+	if (rq->blkg && rq->blkg != rq->q->root_blkg)
+		blkg_put(rq->blkg);
+	rq->blkg = NULL;
+}
+
 #else	/* CONFIG_BLK_CGROUP */
 
 struct blkcg {
@@ -781,6 +820,9 @@ static inline struct request_list *blk_rq_rl(struct request *rq) { return &rq->q
 static inline bool blkcg_bio_issue_check(struct request_queue *q,
 					 struct bio *bio) { return true; }
 
+static inline void blk_rq_associate_blkg(struct request *rq, struct blkcg *blkcg) { }
+static inline void blk_rq_disassociate_blkg(struct request *rq) { }
+
 #define blk_queue_for_each_rl(rl, q)	\
 	for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 6/7] blkcg: account requests instead of bios for request based request_queues
  2017-11-12 22:26 ` Tejun Heo
                   ` (5 preceding siblings ...)
  (?)
@ 2017-11-12 22:26 ` Tejun Heo
  2017-11-14 23:23     ` Shaohua Li
  -1 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2017-11-12 22:26 UTC (permalink / raw)
  To: axboe
  Cc: shli, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro,
	Tejun Heo

blkcg accounting is currently bio based, which is silly for request
based request_queues.  This is silly as the number of bios doesn't
have much to do with the actual number of IOs issued to the underlying
device (can be significantly higher or lower) and may change depending
on the implementation details on how the bios are issued (e.g. from
the recent split-bios-while-issuing change).

request based request_queues have QUEUE_FLAG_IO_STAT set by default
which controls the gendisk accounting.  Do cgroup accounting for those
request_queues together with gendisk accounting on request completion.

This makes cgroup accounting consistent with gendisk accounting and
what's happening on the system.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c           |  3 +++
 include/linux/blk-cgroup.h | 18 +++++++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 048be4a..ad23b96 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2429,6 +2429,7 @@ void blk_account_io_completion(struct request *req, unsigned int bytes)
 		cpu = part_stat_lock();
 		part = req->part;
 		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
+		blkcg_account_io_completion(req, bytes);
 		part_stat_unlock();
 	}
 }
@@ -2454,6 +2455,8 @@ void blk_account_io_done(struct request *req)
 		part_round_stats(req->q, cpu, part);
 		part_dec_in_flight(req->q, part, rw);
 
+		blkcg_account_io_done(req);
+
 		hd_struct_put(part);
 		part_stat_unlock();
 	}
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 96eed0f..f2f9691 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -715,7 +715,8 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
 
 	throtl = blk_throtl_bio(q, blkg, bio);
 
-	if (!throtl) {
+	/* if @q does io stat, blkcg stats are updated together with them */
+	if (!blk_queue_io_stat(q) && !throtl) {
 		blkg_rwstat_add(&blkg->stat_bytes, bio->bi_opf,
 				bio->bi_iter.bi_size);
 		blkg_rwstat_add(&blkg->stat_ios, bio->bi_opf, 1);
@@ -764,6 +765,17 @@ static inline void blk_rq_disassociate_blkg(struct request *rq)
 	rq->blkg = NULL;
 }
 
+static inline void blkcg_account_io_completion(struct request *rq,
+					       unsigned int bytes)
+{
+	blkg_rwstat_add(&rq->blkg->stat_bytes, rq_data_dir(rq), bytes);
+}
+
+static inline void blkcg_account_io_done(struct request *rq)
+{
+	blkg_rwstat_add(&rq->blkg->stat_ios, rq_data_dir(rq), 1);
+}
+
 #else	/* CONFIG_BLK_CGROUP */
 
 struct blkcg {
@@ -823,6 +835,10 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
 static inline void blk_rq_associate_blkg(struct request *rq, struct blkcg *blkcg) { }
 static inline void blk_rq_disassociate_blkg(struct request *rq) { }
 
+static inline void blkcg_account_io_completion(struct request *rq,
+					       unsigned int bytes) { }
+static inline void blkcg_account_io_done(struct request *rq) { }
+
 #define blk_queue_for_each_rl(rl, q)	\
 	for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times
  2017-11-12 22:26 ` Tejun Heo
                   ` (6 preceding siblings ...)
  (?)
@ 2017-11-12 22:26 ` Tejun Heo
  2017-11-13  4:07     ` Shaohua Li
  -1 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2017-11-12 22:26 UTC (permalink / raw)
  To: axboe
  Cc: shli, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro,
	Tejun Heo

BIO_THROTTLED is used to mark already throttled bios so that a bio
doesn't get throttled multiple times.  The flag gets set when the bio
starts getting dispatched from blk-throtl and cleared when it leaves
blk-throtl.

Unfortunately, this doesn't work when the request_queue decides to
split or requeue the bio and ends up throttling the same IO multiple
times.  This condition gets easily triggered and often leads to
multiple times lower bandwidth limit being enforced than configured.

Fix it by always setting BIO_THROTTLED for bios recursing to the same
request_queue and clearing only when a bio leaves the current level.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-core.c           | 10 +++++++---
 block/blk-throttle.c       |  8 --------
 include/linux/blk-cgroup.h | 20 ++++++++++++++++++++
 3 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ad23b96..f0e3157 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2216,11 +2216,15 @@ blk_qc_t generic_make_request(struct bio *bio)
 			 */
 			bio_list_init(&lower);
 			bio_list_init(&same);
-			while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
-				if (q == bio->bi_disk->queue)
+			while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL) {
+				if (q == bio->bi_disk->queue) {
+					blkcg_bio_repeat_q_level(bio);
 					bio_list_add(&same, bio);
-				else
+				} else {
+					blkcg_bio_leave_q_level(bio);
 					bio_list_add(&lower, bio);
+				}
+			}
 			/* now assemble so we handle the lowest level first */
 			bio_list_merge(&bio_list_on_stack[0], &lower);
 			bio_list_merge(&bio_list_on_stack[0], &same);
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 1e6916b..76579b2 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2223,14 +2223,6 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 out_unlock:
 	spin_unlock_irq(q->queue_lock);
 out:
-	/*
-	 * As multiple blk-throtls may stack in the same issue path, we
-	 * don't want bios to leave with the flag set.  Clear the flag if
-	 * being issued.
-	 */
-	if (!throttled)
-		bio_clear_flag(bio, BIO_THROTTLED);
-
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 	if (throttled || !td->track_bio_latency)
 		bio->bi_issue_stat.stat |= SKIP_LATENCY;
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index f2f9691..bed0416 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -675,9 +675,29 @@ static inline void blkg_rwstat_add_aux(struct blkg_rwstat *to,
 #ifdef CONFIG_BLK_DEV_THROTTLING
 extern bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 			   struct bio *bio);
+
+static inline void blkcg_bio_repeat_q_level(struct bio *bio)
+{
+	/*
+	 * @bio is queued while processing a previous bio which was already
+	 * throttled.  Don't throttle it again.
+	 */
+	bio_set_flag(bio, BIO_THROTTLED);
+}
+
+static inline void blkcg_bio_leave_q_level(struct bio *bio)
+{
+	/*
+	 * @bio may get throttled at multiple q levels, clear THROTTLED
+	 * when leaving the current one.
+	 */
+	bio_clear_flag(bio, BIO_THROTTLED);
+}
 #else
 static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
 				  struct bio *bio) { return false; }
+static inline void blkcg_bio_repeat_q_level(struct bio *bio) { }
+static inline void biocg_bio_leave_q_level(struct bio *bio) { }
 #endif
 
 static inline struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times
@ 2017-11-13  4:07     ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-13  4:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

On Sun, Nov 12, 2017 at 02:26:13PM -0800, Tejun Heo wrote:
> BIO_THROTTLED is used to mark already throttled bios so that a bio
> doesn't get throttled multiple times.  The flag gets set when the bio
> starts getting dispatched from blk-throtl and cleared when it leaves
> blk-throtl.
> 
> Unfortunately, this doesn't work when the request_queue decides to
> split or requeue the bio and ends up throttling the same IO multiple
> times.  This condition gets easily triggered and often leads to
> multiple times lower bandwidth limit being enforced than configured.
> 
> Fix it by always setting BIO_THROTTLED for bios recursing to the same
> request_queue and clearing only when a bio leaves the current level.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  block/blk-core.c           | 10 +++++++---
>  block/blk-throttle.c       |  8 --------
>  include/linux/blk-cgroup.h | 20 ++++++++++++++++++++
>  3 files changed, 27 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ad23b96..f0e3157 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2216,11 +2216,15 @@ blk_qc_t generic_make_request(struct bio *bio)
>  			 */
>  			bio_list_init(&lower);
>  			bio_list_init(&same);
> -			while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
> -				if (q == bio->bi_disk->queue)
> +			while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL) {
> +				if (q == bio->bi_disk->queue) {
> +					blkcg_bio_repeat_q_level(bio);
>  					bio_list_add(&same, bio);
> -				else
> +				} else {
> +					blkcg_bio_leave_q_level(bio);
>  					bio_list_add(&lower, bio);
> +				}
> +			}

Hi Tejun,

Thanks for looking into this while I was absence. I don't understand how this
works. Assume a bio will be splitted into 2 small bios. In
generic_make_request, we charge the whole bio. 'q->make_request_fn' will
dispatch the first small bio, and call generic_make_request for the second
small bio. Then generic_make_request charge the second small bio and we add the
second small bio to current->bio_list[0] (please check the order). In above
code the patch changed, we pop the second small bio and set BIO_THROTTLED for
it. But this is already too late, because generic_make_request already charged
the second small bio.

Did you look at my original patch
(https://marc.info/?l=linux-block&m=150791825327628&w=2), anything wrong?

Thanks,
Shaohua

>  			/* now assemble so we handle the lowest level first */
>  			bio_list_merge(&bio_list_on_stack[0], &lower);
>  			bio_list_merge(&bio_list_on_stack[0], &same);
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 1e6916b..76579b2 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -2223,14 +2223,6 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
>  out_unlock:
>  	spin_unlock_irq(q->queue_lock);
>  out:
> -	/*
> -	 * As multiple blk-throtls may stack in the same issue path, we
> -	 * don't want bios to leave with the flag set.  Clear the flag if
> -	 * being issued.
> -	 */
> -	if (!throttled)
> -		bio_clear_flag(bio, BIO_THROTTLED);
> -
>  #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
>  	if (throttled || !td->track_bio_latency)
>  		bio->bi_issue_stat.stat |= SKIP_LATENCY;
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index f2f9691..bed0416 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -675,9 +675,29 @@ static inline void blkg_rwstat_add_aux(struct blkg_rwstat *to,
>  #ifdef CONFIG_BLK_DEV_THROTTLING
>  extern bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
>  			   struct bio *bio);
> +
> +static inline void blkcg_bio_repeat_q_level(struct bio *bio)
> +{
> +	/*
> +	 * @bio is queued while processing a previous bio which was already
> +	 * throttled.  Don't throttle it again.
> +	 */
> +	bio_set_flag(bio, BIO_THROTTLED);
> +}
> +
> +static inline void blkcg_bio_leave_q_level(struct bio *bio)
> +{
> +	/*
> +	 * @bio may get throttled at multiple q levels, clear THROTTLED
> +	 * when leaving the current one.
> +	 */
> +	bio_clear_flag(bio, BIO_THROTTLED);
> +}
>  #else
>  static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
>  				  struct bio *bio) { return false; }
> +static inline void blkcg_bio_repeat_q_level(struct bio *bio) { }
> +static inline void biocg_bio_leave_q_level(struct bio *bio) { }
>  #endif
>  
>  static inline struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times
@ 2017-11-13  4:07     ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-13  4:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guro-b10kYP2dOMg

On Sun, Nov 12, 2017 at 02:26:13PM -0800, Tejun Heo wrote:
> BIO_THROTTLED is used to mark already throttled bios so that a bio
> doesn't get throttled multiple times.  The flag gets set when the bio
> starts getting dispatched from blk-throtl and cleared when it leaves
> blk-throtl.
> 
> Unfortunately, this doesn't work when the request_queue decides to
> split or requeue the bio and ends up throttling the same IO multiple
> times.  This condition gets easily triggered and often leads to
> multiple times lower bandwidth limit being enforced than configured.
> 
> Fix it by always setting BIO_THROTTLED for bios recursing to the same
> request_queue and clearing only when a bio leaves the current level.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  block/blk-core.c           | 10 +++++++---
>  block/blk-throttle.c       |  8 --------
>  include/linux/blk-cgroup.h | 20 ++++++++++++++++++++
>  3 files changed, 27 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ad23b96..f0e3157 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2216,11 +2216,15 @@ blk_qc_t generic_make_request(struct bio *bio)
>  			 */
>  			bio_list_init(&lower);
>  			bio_list_init(&same);
> -			while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)
> -				if (q == bio->bi_disk->queue)
> +			while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL) {
> +				if (q == bio->bi_disk->queue) {
> +					blkcg_bio_repeat_q_level(bio);
>  					bio_list_add(&same, bio);
> -				else
> +				} else {
> +					blkcg_bio_leave_q_level(bio);
>  					bio_list_add(&lower, bio);
> +				}
> +			}

Hi Tejun,

Thanks for looking into this while I was absence. I don't understand how this
works. Assume a bio will be splitted into 2 small bios. In
generic_make_request, we charge the whole bio. 'q->make_request_fn' will
dispatch the first small bio, and call generic_make_request for the second
small bio. Then generic_make_request charge the second small bio and we add the
second small bio to current->bio_list[0] (please check the order). In above
code the patch changed, we pop the second small bio and set BIO_THROTTLED for
it. But this is already too late, because generic_make_request already charged
the second small bio.

Did you look at my original patch
(https://marc.info/?l=linux-block&m=150791825327628&w=2), anything wrong?

Thanks,
Shaohua

>  			/* now assemble so we handle the lowest level first */
>  			bio_list_merge(&bio_list_on_stack[0], &lower);
>  			bio_list_merge(&bio_list_on_stack[0], &same);
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 1e6916b..76579b2 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -2223,14 +2223,6 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
>  out_unlock:
>  	spin_unlock_irq(q->queue_lock);
>  out:
> -	/*
> -	 * As multiple blk-throtls may stack in the same issue path, we
> -	 * don't want bios to leave with the flag set.  Clear the flag if
> -	 * being issued.
> -	 */
> -	if (!throttled)
> -		bio_clear_flag(bio, BIO_THROTTLED);
> -
>  #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
>  	if (throttled || !td->track_bio_latency)
>  		bio->bi_issue_stat.stat |= SKIP_LATENCY;
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index f2f9691..bed0416 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -675,9 +675,29 @@ static inline void blkg_rwstat_add_aux(struct blkg_rwstat *to,
>  #ifdef CONFIG_BLK_DEV_THROTTLING
>  extern bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
>  			   struct bio *bio);
> +
> +static inline void blkcg_bio_repeat_q_level(struct bio *bio)
> +{
> +	/*
> +	 * @bio is queued while processing a previous bio which was already
> +	 * throttled.  Don't throttle it again.
> +	 */
> +	bio_set_flag(bio, BIO_THROTTLED);
> +}
> +
> +static inline void blkcg_bio_leave_q_level(struct bio *bio)
> +{
> +	/*
> +	 * @bio may get throttled at multiple q levels, clear THROTTLED
> +	 * when leaving the current one.
> +	 */
> +	bio_clear_flag(bio, BIO_THROTTLED);
> +}
>  #else
>  static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
>  				  struct bio *bio) { return false; }
> +static inline void blkcg_bio_repeat_q_level(struct bio *bio) { }
> +static inline void biocg_bio_leave_q_level(struct bio *bio) { }
>  #endif
>  
>  static inline struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times
@ 2017-11-13 11:13       ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-13 11:13 UTC (permalink / raw)
  To: Shaohua Li
  Cc: axboe, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

Hello, Shaohua.

On Sun, Nov 12, 2017 at 08:07:16PM -0800, Shaohua Li wrote:
> Thanks for looking into this while I was absence. I don't understand how this
> works. Assume a bio will be splitted into 2 small bios. In
> generic_make_request, we charge the whole bio. 'q->make_request_fn' will
> dispatch the first small bio, and call generic_make_request for the second
> small bio. Then generic_make_request charge the second small bio and we add the
> second small bio to current->bio_list[0] (please check the order). In above

You're right.  If we wanna take this approach, we need to keep the
throttled flag while cloning.  The clearing part is still correct tho.
Without that, I get 1/4 bw limit enforced.  Hmm... I'm not quite sure
where that 1/4 is coming from tho.  Will investigate more.

> code the patch changed, we pop the second small bio and set BIO_THROTTLED for
> it. But this is already too late, because generic_make_request already charged
> the second small bio.
> 
> Did you look at my original patch
> (https://marc.info/?l=linux-block&m=150791825327628&w=2), anything wrong?

That should work too but if we have to modify clone path anyway and
clear the flag when the bio leaves the queue, we don't need to add a
dedicated field to bio, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times
@ 2017-11-13 11:13       ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-13 11:13 UTC (permalink / raw)
  To: Shaohua Li
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guro-b10kYP2dOMg

Hello, Shaohua.

On Sun, Nov 12, 2017 at 08:07:16PM -0800, Shaohua Li wrote:
> Thanks for looking into this while I was absence. I don't understand how this
> works. Assume a bio will be splitted into 2 small bios. In
> generic_make_request, we charge the whole bio. 'q->make_request_fn' will
> dispatch the first small bio, and call generic_make_request for the second
> small bio. Then generic_make_request charge the second small bio and we add the
> second small bio to current->bio_list[0] (please check the order). In above

You're right.  If we wanna take this approach, we need to keep the
throttled flag while cloning.  The clearing part is still correct tho.
Without that, I get 1/4 bw limit enforced.  Hmm... I'm not quite sure
where that 1/4 is coming from tho.  Will investigate more.

> code the patch changed, we pop the second small bio and set BIO_THROTTLED for
> it. But this is already too late, because generic_make_request already charged
> the second small bio.
> 
> Did you look at my original patch
> (https://marc.info/?l=linux-block&m=150791825327628&w=2), anything wrong?

That should work too but if we have to modify clone path anyway and
clear the flag when the bio leaves the queue, we don't need to add a
dedicated field to bio, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times
@ 2017-11-13 15:57         ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-13 15:57 UTC (permalink / raw)
  To: Shaohua Li
  Cc: axboe, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

Hello,

On Mon, Nov 13, 2017 at 03:13:48AM -0800, Tejun Heo wrote:
> You're right.  If we wanna take this approach, we need to keep the
> throttled flag while cloning.  The clearing part is still correct tho.
> Without that, I get 1/4 bw limit enforced.  Hmm... I'm not quite sure
> where that 1/4 is coming from tho.  Will investigate more.

Okay, this is because when we spiit, the split bio is the first part
which gets issued and then the orignal bio is wound forward and
requeued.  So, for the splits, the original bio is the one which gets
trimmed in the front and requeued, so not clearing BIO_THROTTLED is
enough.  I think we should still copy BIO_THROTTLED on clones so that
we don't get suprises w/ other bio drivers.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times
@ 2017-11-13 15:57         ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-13 15:57 UTC (permalink / raw)
  To: Shaohua Li
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guro-b10kYP2dOMg

Hello,

On Mon, Nov 13, 2017 at 03:13:48AM -0800, Tejun Heo wrote:
> You're right.  If we wanna take this approach, we need to keep the
> throttled flag while cloning.  The clearing part is still correct tho.
> Without that, I get 1/4 bw limit enforced.  Hmm... I'm not quite sure
> where that 1/4 is coming from tho.  Will investigate more.

Okay, this is because when we spiit, the split bio is the first part
which gets issued and then the orignal bio is wound forward and
requeued.  So, for the splits, the original bio is the one which gets
trimmed in the front and requeued, so not clearing BIO_THROTTLED is
enough.  I think we should still copy BIO_THROTTLED on clones so that
we don't get suprises w/ other bio drivers.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times
@ 2017-11-13 19:54           ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-13 19:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

On Mon, Nov 13, 2017 at 07:57:45AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Nov 13, 2017 at 03:13:48AM -0800, Tejun Heo wrote:
> > You're right.  If we wanna take this approach, we need to keep the
> > throttled flag while cloning.  The clearing part is still correct tho.
> > Without that, I get 1/4 bw limit enforced.  Hmm... I'm not quite sure
> > where that 1/4 is coming from tho.  Will investigate more.
> 
> Okay, this is because when we spiit, the split bio is the first part
> which gets issued and then the orignal bio is wound forward and
> requeued.  So, for the splits, the original bio is the one which gets
> trimmed in the front and requeued, so not clearing BIO_THROTTLED is
> enough.  I think we should still copy BIO_THROTTLED on clones so that
> we don't get suprises w/ other bio drivers.

I'm not sure how you are going to make this correct. The mechanism is very
fragile. So for example, 'q->make_request_fn(q, bio)' could just queue the bio
somewhere and handle in other context (both dm and md do this). The bio will be
called again with generic_make_request. In this case, the second time shouldn't
throttle the bio. The bio could be called again with generic_make_request but
with bdev changed. In this case, the second time should throttle the bio
(against the new bdev). There are a lot of different usages of bio. I'd rather
not depend on generic_make_request dispatches new bio immediately. That's why I
add a bdev in my patch.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times
@ 2017-11-13 19:54           ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-13 19:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guro-b10kYP2dOMg

On Mon, Nov 13, 2017 at 07:57:45AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Nov 13, 2017 at 03:13:48AM -0800, Tejun Heo wrote:
> > You're right.  If we wanna take this approach, we need to keep the
> > throttled flag while cloning.  The clearing part is still correct tho.
> > Without that, I get 1/4 bw limit enforced.  Hmm... I'm not quite sure
> > where that 1/4 is coming from tho.  Will investigate more.
> 
> Okay, this is because when we spiit, the split bio is the first part
> which gets issued and then the orignal bio is wound forward and
> requeued.  So, for the splits, the original bio is the one which gets
> trimmed in the front and requeued, so not clearing BIO_THROTTLED is
> enough.  I think we should still copy BIO_THROTTLED on clones so that
> we don't get suprises w/ other bio drivers.

I'm not sure how you are going to make this correct. The mechanism is very
fragile. So for example, 'q->make_request_fn(q, bio)' could just queue the bio
somewhere and handle in other context (both dm and md do this). The bio will be
called again with generic_make_request. In this case, the second time shouldn't
throttle the bio. The bio could be called again with generic_make_request but
with bdev changed. In this case, the second time should throttle the bio
(against the new bdev). There are a lot of different usages of bio. I'd rather
not depend on generic_make_request dispatches new bio immediately. That's why I
add a bdev in my patch.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times
@ 2017-11-13 19:58           ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-13 19:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

On Mon, Nov 13, 2017 at 07:57:45AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Nov 13, 2017 at 03:13:48AM -0800, Tejun Heo wrote:
> > You're right.  If we wanna take this approach, we need to keep the
> > throttled flag while cloning.  The clearing part is still correct tho.
> > Without that, I get 1/4 bw limit enforced.  Hmm... I'm not quite sure
> > where that 1/4 is coming from tho.  Will investigate more.
> 
> Okay, this is because when we spiit, the split bio is the first part
> which gets issued and then the orignal bio is wound forward and
> requeued.  So, for the splits, the original bio is the one which gets
> trimmed in the front and requeued, so not clearing BIO_THROTTLED is
> enough.  I think we should still copy BIO_THROTTLED on clones so that
> we don't get suprises w/ other bio drivers.

Oh, I think there is a better solution. Not adding a new bdev is possible. We
always set the BIO_THROTTLED flag after block-throttle and copy the flag in
clone. In bio_set_dev, we clear the flag. This should work I think.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times
@ 2017-11-13 19:58           ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-13 19:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guro-b10kYP2dOMg

On Mon, Nov 13, 2017 at 07:57:45AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Nov 13, 2017 at 03:13:48AM -0800, Tejun Heo wrote:
> > You're right.  If we wanna take this approach, we need to keep the
> > throttled flag while cloning.  The clearing part is still correct tho.
> > Without that, I get 1/4 bw limit enforced.  Hmm... I'm not quite sure
> > where that 1/4 is coming from tho.  Will investigate more.
> 
> Okay, this is because when we spiit, the split bio is the first part
> which gets issued and then the orignal bio is wound forward and
> requeued.  So, for the splits, the original bio is the one which gets
> trimmed in the front and requeued, so not clearing BIO_THROTTLED is
> enough.  I think we should still copy BIO_THROTTLED on clones so that
> we don't get suprises w/ other bio drivers.

Oh, I think there is a better solution. Not adding a new bdev is possible. We
always set the BIO_THROTTLED flag after block-throttle and copy the flag in
clone. In bio_set_dev, we clear the flag. This should work I think.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times
@ 2017-11-13 19:58             ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-13 19:58 UTC (permalink / raw)
  To: Shaohua Li
  Cc: axboe, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

Hello,

On Mon, Nov 13, 2017 at 11:54:13AM -0800, Shaohua Li wrote:
> I'm not sure how you are going to make this correct. The mechanism is very
> fragile. So for example, 'q->make_request_fn(q, bio)' could just queue the bio
> somewhere and handle in other context (both dm and md do this). The bio will be
> called again with generic_make_request. In this case, the second time shouldn't
> throttle the bio. The bio could be called again with generic_make_request but
> with bdev changed. In this case, the second time should throttle the bio
> (against the new bdev). There are a lot of different usages of bio. I'd rather
> not depend on generic_make_request dispatches new bio immediately. That's why I
> add a bdev in my patch.

I see.  Yeah, that makes sense.  The cloned bios may get queued to a
different request_queue asynchronously and without remembering the
last queue, you can't tell whether it's a new queue or not.  I'll drop
this one and test with your patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times
@ 2017-11-13 19:58             ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-13 19:58 UTC (permalink / raw)
  To: Shaohua Li
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guro-b10kYP2dOMg

Hello,

On Mon, Nov 13, 2017 at 11:54:13AM -0800, Shaohua Li wrote:
> I'm not sure how you are going to make this correct. The mechanism is very
> fragile. So for example, 'q->make_request_fn(q, bio)' could just queue the bio
> somewhere and handle in other context (both dm and md do this). The bio will be
> called again with generic_make_request. In this case, the second time shouldn't
> throttle the bio. The bio could be called again with generic_make_request but
> with bdev changed. In this case, the second time should throttle the bio
> (against the new bdev). There are a lot of different usages of bio. I'd rather
> not depend on generic_make_request dispatches new bio immediately. That's why I
> add a bdev in my patch.

I see.  Yeah, that makes sense.  The cloned bios may get queued to a
different request_queue asynchronously and without remembering the
last queue, you can't tell whether it's a new queue or not.  I'll drop
this one and test with your patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v2 3/7] blkcg: associate a request with its blkcg_gq instead of request_list
  2017-11-12 22:26 ` [PATCH 3/7] blkcg: associate a request with its blkcg_gq instead of request_list Tejun Heo
@ 2017-11-13 20:15     ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-13 20:15 UTC (permalink / raw)
  To: axboe; +Cc: shli, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

>From c856a199ec70e4022e997609f1b17b9106408777 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 13 Nov 2017 12:11:57 -0800

On the legacy request_queue, each blkcg_gq has a dedicated
request_list that requests for the cgroup are allocated from.  Each
request points to the originating request_list and cgroup membership
can be determined by examining which cgroup the associated
request_list belongs to.

This doesn't work for blk-mq which doesn't use request_list.  Let's
associate each request with the blkcg_gq directly.  Except for minor
wrapper changes, this doesn't affect the legacy request_queue and will
allow blk-mq to track each request's cgroup association.

v2: Build fix for !CONFIG_BLK_CGROUP.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c             | 2 +-
 include/linux/blk-cgroup.h | 7 +++++--
 include/linux/blkdev.h     | 2 +-
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a1860..af958c4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -306,7 +306,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->part = NULL;
 	rq->start_time = jiffies;
 #ifdef CONFIG_BLK_CGROUP
-	rq->rl = NULL;
+	rq->blkg = NULL;
 	set_start_time_ns(rq);
 	rq->io_start_time_ns = 0;
 #endif
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index c0d4736..47db75a 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -463,7 +463,7 @@ static inline void blk_put_rl(struct request_list *rl)
  */
 static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
 {
-	rq->rl = rl;
+	rq->blkg = rl->blkg;
 }
 
 /**
@@ -474,7 +474,10 @@ static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
  */
 static inline struct request_list *blk_rq_rl(struct request *rq)
 {
-	return rq->rl;
+	if (!rq->blkg->parent)
+		return &rq->q->root_rl;
+	else
+		return &rq->blkg->rl;
 }
 
 struct request_list *__blk_queue_next_rl(struct request_list *rl,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8da6637..6ec1067 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -206,7 +206,7 @@ struct request {
 	unsigned long start_time;
 	struct blk_issue_stat issue_stat;
 #ifdef CONFIG_BLK_CGROUP
-	struct request_list *rl;		/* rl this rq is alloced from */
+	struct blkcg_gq *blkg;			/* blkg of this rq */
 	unsigned long long start_time_ns;
 	unsigned long long io_start_time_ns;    /* when passed to hardware */
 #endif
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 3/7] blkcg: associate a request with its blkcg_gq instead of request_list
@ 2017-11-13 20:15     ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-13 20:15 UTC (permalink / raw)
  To: axboe; +Cc: shli, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

From c856a199ec70e4022e997609f1b17b9106408777 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 13 Nov 2017 12:11:57 -0800

On the legacy request_queue, each blkcg_gq has a dedicated
request_list that requests for the cgroup are allocated from.  Each
request points to the originating request_list and cgroup membership
can be determined by examining which cgroup the associated
request_list belongs to.

This doesn't work for blk-mq which doesn't use request_list.  Let's
associate each request with the blkcg_gq directly.  Except for minor
wrapper changes, this doesn't affect the legacy request_queue and will
allow blk-mq to track each request's cgroup association.

v2: Build fix for !CONFIG_BLK_CGROUP.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-mq.c             | 2 +-
 include/linux/blk-cgroup.h | 7 +++++--
 include/linux/blkdev.h     | 2 +-
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a1860..af958c4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -306,7 +306,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->part = NULL;
 	rq->start_time = jiffies;
 #ifdef CONFIG_BLK_CGROUP
-	rq->rl = NULL;
+	rq->blkg = NULL;
 	set_start_time_ns(rq);
 	rq->io_start_time_ns = 0;
 #endif
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index c0d4736..47db75a 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -463,7 +463,7 @@ static inline void blk_put_rl(struct request_list *rl)
  */
 static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
 {
-	rq->rl = rl;
+	rq->blkg = rl->blkg;
 }
 
 /**
@@ -474,7 +474,10 @@ static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
  */
 static inline struct request_list *blk_rq_rl(struct request *rq)
 {
-	return rq->rl;
+	if (!rq->blkg->parent)
+		return &rq->q->root_rl;
+	else
+		return &rq->blkg->rl;
 }
 
 struct request_list *__blk_queue_next_rl(struct request_list *rl,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8da6637..6ec1067 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -206,7 +206,7 @@ struct request {
 	unsigned long start_time;
 	struct blk_issue_stat issue_stat;
 #ifdef CONFIG_BLK_CGROUP
-	struct request_list *rl;		/* rl this rq is alloced from */
+	struct blkcg_gq *blkg;			/* blkg of this rq */
 	unsigned long long start_time_ns;
 	unsigned long long io_start_time_ns;    /* when passed to hardware */
 #endif
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/7] blkcg: relocate __blkg_release_rcu()
@ 2017-11-14 23:12     ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-14 23:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

On Sun, Nov 12, 2017 at 02:26:07PM -0800, Tejun Heo wrote:
> Move __blkg_release_rcu() above blkg_alloc().  This is a pure code
> reorganization to prepare for the switch to percpu_ref.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Shaohua Li <shli@kernel.org>
> ---
>  block/blk-cgroup.c | 46 +++++++++++++++++++++++-----------------------
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d3f56ba..6482be5 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -81,6 +81,29 @@ static void blkg_free(struct blkcg_gq *blkg)
>  	kfree(blkg);
>  }
>  
> +/*
> + * A group is RCU protected, but having an rcu lock does not mean that one
> + * can access all the fields of blkg and assume these are valid.  For
> + * example, don't try to follow throtl_data and request queue links.
> + *
> + * Having a reference to blkg under an rcu allows accesses to only values
> + * local to groups like group stats and group rate limits.
> + */
> +void __blkg_release_rcu(struct rcu_head *rcu_head)
> +{
> +	struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, rcu_head);
> +
> +	/* release the blkcg and parent blkg refs this blkg has been holding */
> +	css_put(&blkg->blkcg->css);
> +	if (blkg->parent)
> +		blkg_put(blkg->parent);
> +
> +	wb_congested_put(blkg->wb_congested);
> +
> +	blkg_free(blkg);
> +}
> +EXPORT_SYMBOL_GPL(__blkg_release_rcu);
> +
>  /**
>   * blkg_alloc - allocate a blkg
>   * @blkcg: block cgroup the new blkg is associated with
> @@ -378,29 +401,6 @@ static void blkg_destroy_all(struct request_queue *q)
>  }
>  
>  /*
> - * A group is RCU protected, but having an rcu lock does not mean that one
> - * can access all the fields of blkg and assume these are valid.  For
> - * example, don't try to follow throtl_data and request queue links.
> - *
> - * Having a reference to blkg under an rcu allows accesses to only values
> - * local to groups like group stats and group rate limits.
> - */
> -void __blkg_release_rcu(struct rcu_head *rcu_head)
> -{
> -	struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, rcu_head);
> -
> -	/* release the blkcg and parent blkg refs this blkg has been holding */
> -	css_put(&blkg->blkcg->css);
> -	if (blkg->parent)
> -		blkg_put(blkg->parent);
> -
> -	wb_congested_put(blkg->wb_congested);
> -
> -	blkg_free(blkg);
> -}
> -EXPORT_SYMBOL_GPL(__blkg_release_rcu);
> -
> -/*
>   * The next function used by blk_queue_for_each_rl().  It's a bit tricky
>   * because the root blkg uses @q->root_rl instead of its own rl.
>   */
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/7] blkcg: relocate __blkg_release_rcu()
@ 2017-11-14 23:12     ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-14 23:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guro-b10kYP2dOMg

On Sun, Nov 12, 2017 at 02:26:07PM -0800, Tejun Heo wrote:
> Move __blkg_release_rcu() above blkg_alloc().  This is a pure code
> reorganization to prepare for the switch to percpu_ref.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Reviewed-by: Shaohua Li <shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  block/blk-cgroup.c | 46 +++++++++++++++++++++++-----------------------
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d3f56ba..6482be5 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -81,6 +81,29 @@ static void blkg_free(struct blkcg_gq *blkg)
>  	kfree(blkg);
>  }
>  
> +/*
> + * A group is RCU protected, but having an rcu lock does not mean that one
> + * can access all the fields of blkg and assume these are valid.  For
> + * example, don't try to follow throtl_data and request queue links.
> + *
> + * Having a reference to blkg under an rcu allows accesses to only values
> + * local to groups like group stats and group rate limits.
> + */
> +void __blkg_release_rcu(struct rcu_head *rcu_head)
> +{
> +	struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, rcu_head);
> +
> +	/* release the blkcg and parent blkg refs this blkg has been holding */
> +	css_put(&blkg->blkcg->css);
> +	if (blkg->parent)
> +		blkg_put(blkg->parent);
> +
> +	wb_congested_put(blkg->wb_congested);
> +
> +	blkg_free(blkg);
> +}
> +EXPORT_SYMBOL_GPL(__blkg_release_rcu);
> +
>  /**
>   * blkg_alloc - allocate a blkg
>   * @blkcg: block cgroup the new blkg is associated with
> @@ -378,29 +401,6 @@ static void blkg_destroy_all(struct request_queue *q)
>  }
>  
>  /*
> - * A group is RCU protected, but having an rcu lock does not mean that one
> - * can access all the fields of blkg and assume these are valid.  For
> - * example, don't try to follow throtl_data and request queue links.
> - *
> - * Having a reference to blkg under an rcu allows accesses to only values
> - * local to groups like group stats and group rate limits.
> - */
> -void __blkg_release_rcu(struct rcu_head *rcu_head)
> -{
> -	struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, rcu_head);
> -
> -	/* release the blkcg and parent blkg refs this blkg has been holding */
> -	css_put(&blkg->blkcg->css);
> -	if (blkg->parent)
> -		blkg_put(blkg->parent);
> -
> -	wb_congested_put(blkg->wb_congested);
> -
> -	blkg_free(blkg);
> -}
> -EXPORT_SYMBOL_GPL(__blkg_release_rcu);
> -
> -/*
>   * The next function used by blk_queue_for_each_rl().  It's a bit tricky
>   * because the root blkg uses @q->root_rl instead of its own rl.
>   */
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/7] blkcg: use percpu_ref for blkcg_gq->refcnt
@ 2017-11-14 23:12     ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-14 23:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

On Sun, Nov 12, 2017 at 02:26:08PM -0800, Tejun Heo wrote:
> blkcg_gq->refcnt is an atomic_t.  This was due to the following two
> reasons.
> 
> * When blkcg_gq was added, the percpu memory allocator didn't support
>   allocations from !GFP_KERNEL contexts.  Because blkcg_gq's are
>   created from IO issue paths, it couldn't use GFP_KERNEL allocations.
> 
> * A blkcg_gq represents the connection between a cgroup and a
>   request_queue.  Because a in-flight bio already pins both, blkcg_gq
>   didn't usually need explicit pinning, making the use of atomic_t
>   acceptable albeit restrictive and fragile.
> 
> Now that the percpu allocator supports !GFP_KERNEL allocations,
> there's no reason to keep using atomic_t refcnt.  This will allow
> clean separation between bio and request layers helping blkcg support
> in blk-mq.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Shaohua Li <shli@kernel.org>
> ---
>  block/blk-cgroup.c         | 21 ++++++++++++++++-----
>  include/linux/blk-cgroup.h | 13 ++++---------
>  2 files changed, 20 insertions(+), 14 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 6482be5..60a4486 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -78,6 +78,7 @@ static void blkg_free(struct blkcg_gq *blkg)
>  
>  	blkg_rwstat_exit(&blkg->stat_ios);
>  	blkg_rwstat_exit(&blkg->stat_bytes);
> +	percpu_ref_exit(&blkg->refcnt);
>  	kfree(blkg);
>  }
>  
> @@ -89,7 +90,7 @@ static void blkg_free(struct blkcg_gq *blkg)
>   * Having a reference to blkg under an rcu allows accesses to only values
>   * local to groups like group stats and group rate limits.
>   */
> -void __blkg_release_rcu(struct rcu_head *rcu_head)
> +static void blkg_release_rcu(struct rcu_head *rcu_head)
>  {
>  	struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, rcu_head);
>  
> @@ -102,7 +103,13 @@ void __blkg_release_rcu(struct rcu_head *rcu_head)
>  
>  	blkg_free(blkg);
>  }
> -EXPORT_SYMBOL_GPL(__blkg_release_rcu);
> +
> +static void blkg_release(struct percpu_ref *refcnt)
> +{
> +	struct blkcg_gq *blkg = container_of(refcnt, struct blkcg_gq, refcnt);
> +
> +	call_rcu(&blkg->rcu_head, blkg_release_rcu);
> +}
>  
>  /**
>   * blkg_alloc - allocate a blkg
> @@ -123,6 +130,11 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>  	if (!blkg)
>  		return NULL;
>  
> +	if (percpu_ref_init(&blkg->refcnt, blkg_release, 0, gfp_mask)) {
> +		kfree(blkg);
> +		return NULL;
> +	}
> +
>  	if (blkg_rwstat_init(&blkg->stat_bytes, gfp_mask) ||
>  	    blkg_rwstat_init(&blkg->stat_ios, gfp_mask))
>  		goto err_free;
> @@ -130,7 +142,6 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>  	blkg->q = q;
>  	INIT_LIST_HEAD(&blkg->q_node);
>  	blkg->blkcg = blkcg;
> -	atomic_set(&blkg->refcnt, 1);
>  
>  	/* root blkg uses @q->root_rl, init rl only for !root blkgs */
>  	if (blkcg != &blkcg_root) {
> @@ -266,7 +277,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>  		return blkg;
>  
>  	/* @blkg failed fully initialized, use the usual release path */
> -	blkg_put(blkg);
> +	percpu_ref_kill(&blkg->refcnt);
>  	return ERR_PTR(ret);
>  
>  err_put_congested:
> @@ -373,7 +384,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
>  	 * Put the reference taken at the time of creation so that when all
>  	 * queues are gone, group can be destroyed.
>  	 */
> -	blkg_put(blkg);
> +	percpu_ref_kill(&blkg->refcnt);
>  }
>  
>  /**
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index 8bbc371..c0d4736 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -19,7 +19,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/radix-tree.h>
>  #include <linux/blkdev.h>
> -#include <linux/atomic.h>
> +#include <linux/percpu-refcount.h>
>  
>  /* percpu_counter batch for blkg_[rw]stats, per-cpu drift doesn't matter */
>  #define BLKG_STAT_CPU_BATCH	(INT_MAX / 2)
> @@ -123,7 +123,7 @@ struct blkcg_gq {
>  	struct request_list		rl;
>  
>  	/* reference count */
> -	atomic_t			refcnt;
> +	struct percpu_ref		refcnt;
>  
>  	/* is this blkg online? protected by both blkcg and q locks */
>  	bool				online;
> @@ -355,21 +355,16 @@ static inline int blkg_path(struct blkcg_gq *blkg, char *buf, int buflen)
>   */
>  static inline void blkg_get(struct blkcg_gq *blkg)
>  {
> -	WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
> -	atomic_inc(&blkg->refcnt);
> +	percpu_ref_get(&blkg->refcnt);
>  }
>  
> -void __blkg_release_rcu(struct rcu_head *rcu);
> -
>  /**
>   * blkg_put - put a blkg reference
>   * @blkg: blkg to put
>   */
>  static inline void blkg_put(struct blkcg_gq *blkg)
>  {
> -	WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
> -	if (atomic_dec_and_test(&blkg->refcnt))
> -		call_rcu(&blkg->rcu_head, __blkg_release_rcu);
> +	percpu_ref_put(&blkg->refcnt);
>  }
>  
>  /**
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/7] blkcg: use percpu_ref for blkcg_gq->refcnt
@ 2017-11-14 23:12     ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-14 23:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guro-b10kYP2dOMg

On Sun, Nov 12, 2017 at 02:26:08PM -0800, Tejun Heo wrote:
> blkcg_gq->refcnt is an atomic_t.  This was due to the following two
> reasons.
> 
> * When blkcg_gq was added, the percpu memory allocator didn't support
>   allocations from !GFP_KERNEL contexts.  Because blkcg_gq's are
>   created from IO issue paths, it couldn't use GFP_KERNEL allocations.
> 
> * A blkcg_gq represents the connection between a cgroup and a
>   request_queue.  Because a in-flight bio already pins both, blkcg_gq
>   didn't usually need explicit pinning, making the use of atomic_t
>   acceptable albeit restrictive and fragile.
> 
> Now that the percpu allocator supports !GFP_KERNEL allocations,
> there's no reason to keep using atomic_t refcnt.  This will allow
> clean separation between bio and request layers helping blkcg support
> in blk-mq.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Reviewed-by: Shaohua Li <shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  block/blk-cgroup.c         | 21 ++++++++++++++++-----
>  include/linux/blk-cgroup.h | 13 ++++---------
>  2 files changed, 20 insertions(+), 14 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 6482be5..60a4486 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -78,6 +78,7 @@ static void blkg_free(struct blkcg_gq *blkg)
>  
>  	blkg_rwstat_exit(&blkg->stat_ios);
>  	blkg_rwstat_exit(&blkg->stat_bytes);
> +	percpu_ref_exit(&blkg->refcnt);
>  	kfree(blkg);
>  }
>  
> @@ -89,7 +90,7 @@ static void blkg_free(struct blkcg_gq *blkg)
>   * Having a reference to blkg under an rcu allows accesses to only values
>   * local to groups like group stats and group rate limits.
>   */
> -void __blkg_release_rcu(struct rcu_head *rcu_head)
> +static void blkg_release_rcu(struct rcu_head *rcu_head)
>  {
>  	struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, rcu_head);
>  
> @@ -102,7 +103,13 @@ void __blkg_release_rcu(struct rcu_head *rcu_head)
>  
>  	blkg_free(blkg);
>  }
> -EXPORT_SYMBOL_GPL(__blkg_release_rcu);
> +
> +static void blkg_release(struct percpu_ref *refcnt)
> +{
> +	struct blkcg_gq *blkg = container_of(refcnt, struct blkcg_gq, refcnt);
> +
> +	call_rcu(&blkg->rcu_head, blkg_release_rcu);
> +}
>  
>  /**
>   * blkg_alloc - allocate a blkg
> @@ -123,6 +130,11 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>  	if (!blkg)
>  		return NULL;
>  
> +	if (percpu_ref_init(&blkg->refcnt, blkg_release, 0, gfp_mask)) {
> +		kfree(blkg);
> +		return NULL;
> +	}
> +
>  	if (blkg_rwstat_init(&blkg->stat_bytes, gfp_mask) ||
>  	    blkg_rwstat_init(&blkg->stat_ios, gfp_mask))
>  		goto err_free;
> @@ -130,7 +142,6 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>  	blkg->q = q;
>  	INIT_LIST_HEAD(&blkg->q_node);
>  	blkg->blkcg = blkcg;
> -	atomic_set(&blkg->refcnt, 1);
>  
>  	/* root blkg uses @q->root_rl, init rl only for !root blkgs */
>  	if (blkcg != &blkcg_root) {
> @@ -266,7 +277,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>  		return blkg;
>  
>  	/* @blkg failed fully initialized, use the usual release path */
> -	blkg_put(blkg);
> +	percpu_ref_kill(&blkg->refcnt);
>  	return ERR_PTR(ret);
>  
>  err_put_congested:
> @@ -373,7 +384,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
>  	 * Put the reference taken at the time of creation so that when all
>  	 * queues are gone, group can be destroyed.
>  	 */
> -	blkg_put(blkg);
> +	percpu_ref_kill(&blkg->refcnt);
>  }
>  
>  /**
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index 8bbc371..c0d4736 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -19,7 +19,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/radix-tree.h>
>  #include <linux/blkdev.h>
> -#include <linux/atomic.h>
> +#include <linux/percpu-refcount.h>
>  
>  /* percpu_counter batch for blkg_[rw]stats, per-cpu drift doesn't matter */
>  #define BLKG_STAT_CPU_BATCH	(INT_MAX / 2)
> @@ -123,7 +123,7 @@ struct blkcg_gq {
>  	struct request_list		rl;
>  
>  	/* reference count */
> -	atomic_t			refcnt;
> +	struct percpu_ref		refcnt;
>  
>  	/* is this blkg online? protected by both blkcg and q locks */
>  	bool				online;
> @@ -355,21 +355,16 @@ static inline int blkg_path(struct blkcg_gq *blkg, char *buf, int buflen)
>   */
>  static inline void blkg_get(struct blkcg_gq *blkg)
>  {
> -	WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
> -	atomic_inc(&blkg->refcnt);
> +	percpu_ref_get(&blkg->refcnt);
>  }
>  
> -void __blkg_release_rcu(struct rcu_head *rcu);
> -
>  /**
>   * blkg_put - put a blkg reference
>   * @blkg: blkg to put
>   */
>  static inline void blkg_put(struct blkcg_gq *blkg)
>  {
> -	WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
> -	if (atomic_dec_and_test(&blkg->refcnt))
> -		call_rcu(&blkg->rcu_head, __blkg_release_rcu);
> +	percpu_ref_put(&blkg->refcnt);
>  }
>  
>  /**
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/7] blkcg: associate a request with its blkcg_gq instead of request_list
@ 2017-11-14 23:17       ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-14 23:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

On Mon, Nov 13, 2017 at 12:15:23PM -0800, Tejun Heo wrote:
> From c856a199ec70e4022e997609f1b17b9106408777 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Mon, 13 Nov 2017 12:11:57 -0800
> 
> On the legacy request_queue, each blkcg_gq has a dedicated
> request_list that requests for the cgroup are allocated from.  Each
> request points to the originating request_list and cgroup membership
> can be determined by examining which cgroup the associated
> request_list belongs to.
> 
> This doesn't work for blk-mq which doesn't use request_list.  Let's
> associate each request with the blkcg_gq directly.  Except for minor
> wrapper changes, this doesn't affect the legacy request_queue and will
> allow blk-mq to track each request's cgroup association.
> 
> v2: Build fix for !CONFIG_BLK_CGROUP.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  block/blk-mq.c             | 2 +-
>  include/linux/blk-cgroup.h | 7 +++++--
>  include/linux/blkdev.h     | 2 +-
>  3 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 98a1860..af958c4 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -306,7 +306,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>  	rq->part = NULL;
>  	rq->start_time = jiffies;
>  #ifdef CONFIG_BLK_CGROUP
> -	rq->rl = NULL;
> +	rq->blkg = NULL;
>  	set_start_time_ns(rq);
>  	rq->io_start_time_ns = 0;
>  #endif
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index c0d4736..47db75a 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -463,7 +463,7 @@ static inline void blk_put_rl(struct request_list *rl)
>   */
>  static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
>  {
> -	rq->rl = rl;
> +	rq->blkg = rl->blkg;
>  }

After patch 5, we don't really need the blk_put_rl, blk_get_rl, blk_rq_set_rl
functions. I'd like to delete them and only keep blk_rq_rl. The name
blk_rq_set_rl is confusing too.

Thanks,
Shaohua
>  /**
> @@ -474,7 +474,10 @@ static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
>   */
>  static inline struct request_list *blk_rq_rl(struct request *rq)
>  {
> -	return rq->rl;
> +	if (!rq->blkg->parent)
> +		return &rq->q->root_rl;
> +	else
> +		return &rq->blkg->rl;
>  }
>  
>  struct request_list *__blk_queue_next_rl(struct request_list *rl,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8da6637..6ec1067 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -206,7 +206,7 @@ struct request {
>  	unsigned long start_time;
>  	struct blk_issue_stat issue_stat;
>  #ifdef CONFIG_BLK_CGROUP
> -	struct request_list *rl;		/* rl this rq is alloced from */
> +	struct blkcg_gq *blkg;			/* blkg of this rq */
>  	unsigned long long start_time_ns;
>  	unsigned long long io_start_time_ns;    /* when passed to hardware */
>  #endif
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/7] blkcg: associate a request with its blkcg_gq instead of request_list
@ 2017-11-14 23:17       ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-14 23:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guro-b10kYP2dOMg

On Mon, Nov 13, 2017 at 12:15:23PM -0800, Tejun Heo wrote:
> From c856a199ec70e4022e997609f1b17b9106408777 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Date: Mon, 13 Nov 2017 12:11:57 -0800
> 
> On the legacy request_queue, each blkcg_gq has a dedicated
> request_list that requests for the cgroup are allocated from.  Each
> request points to the originating request_list and cgroup membership
> can be determined by examining which cgroup the associated
> request_list belongs to.
> 
> This doesn't work for blk-mq which doesn't use request_list.  Let's
> associate each request with the blkcg_gq directly.  Except for minor
> wrapper changes, this doesn't affect the legacy request_queue and will
> allow blk-mq to track each request's cgroup association.
> 
> v2: Build fix for !CONFIG_BLK_CGROUP.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  block/blk-mq.c             | 2 +-
>  include/linux/blk-cgroup.h | 7 +++++--
>  include/linux/blkdev.h     | 2 +-
>  3 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 98a1860..af958c4 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -306,7 +306,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>  	rq->part = NULL;
>  	rq->start_time = jiffies;
>  #ifdef CONFIG_BLK_CGROUP
> -	rq->rl = NULL;
> +	rq->blkg = NULL;
>  	set_start_time_ns(rq);
>  	rq->io_start_time_ns = 0;
>  #endif
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index c0d4736..47db75a 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -463,7 +463,7 @@ static inline void blk_put_rl(struct request_list *rl)
>   */
>  static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
>  {
> -	rq->rl = rl;
> +	rq->blkg = rl->blkg;
>  }

After patch 5, we don't really need the blk_put_rl, blk_get_rl, blk_rq_set_rl
functions. I'd like to delete them and only keep blk_rq_rl. The name
blk_rq_set_rl is confusing too.

Thanks,
Shaohua
>  /**
> @@ -474,7 +474,10 @@ static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
>   */
>  static inline struct request_list *blk_rq_rl(struct request *rq)
>  {
> -	return rq->rl;
> +	if (!rq->blkg->parent)
> +		return &rq->q->root_rl;
> +	else
> +		return &rq->blkg->rl;
>  }
>  
>  struct request_list *__blk_queue_next_rl(struct request_list *rl,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8da6637..6ec1067 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -206,7 +206,7 @@ struct request {
>  	unsigned long start_time;
>  	struct blk_issue_stat issue_stat;
>  #ifdef CONFIG_BLK_CGROUP
> -	struct request_list *rl;		/* rl this rq is alloced from */
> +	struct blkcg_gq *blkg;			/* blkg of this rq */
>  	unsigned long long start_time_ns;
>  	unsigned long long io_start_time_ns;    /* when passed to hardware */
>  #endif
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7] blkcg: refactor blkcg_gq lookup and creation in blkcg_bio_issue_check()
@ 2017-11-14 23:18     ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-14 23:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

On Sun, Nov 12, 2017 at 02:26:10PM -0800, Tejun Heo wrote:
> blkcg_bio_issue_check() open codes blkcg_gq lookup and creation using
> blkg_lookup() and blkg_lookup_create().  Refactor the code so that
> 
> * blkg_lookup_create() is renamed to blkcg_lookup_create_locked().
> 
> * blkg_lookup_create() is now a new function which encapsulates the
>   RCU protected lookup and queue_locked protected creation.
> 
> * blkg_lookup_create() is guaranteed to return a non-NULL blkcg_gq.
>   The NULL checks are removed from the users.
> 
> This is pure refactoring and doesn't introduce any functional changes.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Shaohua Li <shli@kernel.org>
> ---
>  block/blk-cgroup.c         |  6 +++---
>  block/blk-throttle.c       |  2 +-
>  include/linux/blk-cgroup.h | 32 +++++++++++++++++++++-----------
>  3 files changed, 25 insertions(+), 15 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 60a4486..490b0a6 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -290,7 +290,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>  }
>  
>  /**
> - * blkg_lookup_create - lookup blkg, try to create one if not there
> + * blkg_lookup_create_locked - lookup blkg, try to create one if not there
>   * @blkcg: blkcg of interest
>   * @q: request_queue of interest
>   *
> @@ -303,8 +303,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>   * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
>   * dead and bypassing, returns ERR_PTR(-EBUSY).
>   */
> -struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> -				    struct request_queue *q)
> +struct blkcg_gq *blkg_lookup_create_locked(struct blkcg *blkcg,
> +					   struct request_queue *q)
>  {
>  	struct blkcg_gq *blkg;
>  
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 8631763..1e6916b 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -2123,7 +2123,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
>  		    struct bio *bio)
>  {
>  	struct throtl_qnode *qn = NULL;
> -	struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg);
> +	struct throtl_grp *tg = blkg_to_tg(blkg);
>  	struct throtl_service_queue *sq;
>  	bool rw = bio_data_dir(bio);
>  	bool throttled = false;
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index b5721d32..1de5158 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -172,8 +172,8 @@ extern struct cgroup_subsys_state * const blkcg_root_css;
>  
>  struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
>  				      struct request_queue *q, bool update_hint);
> -struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> -				    struct request_queue *q);
> +struct blkcg_gq *blkg_lookup_create_locked(struct blkcg *blkcg,
> +					   struct request_queue *q);
>  int blkcg_init_queue(struct request_queue *q);
>  void blkcg_drain_queue(struct request_queue *q);
>  void blkcg_exit_queue(struct request_queue *q);
> @@ -680,6 +680,24 @@ static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg
>  				  struct bio *bio) { return false; }
>  #endif
>  
> +static inline struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> +						  struct request_queue *q)
> +{
> +	struct blkcg_gq *blkg;
> +
> +	blkg = blkg_lookup(blkcg, q);
> +	if (likely(blkg))
> +		return blkg;
> +
> +	spin_lock_irq(q->queue_lock);
> +	blkg = blkg_lookup_create_locked(blkcg, q);
> +	spin_unlock_irq(q->queue_lock);
> +	if (likely(!IS_ERR(blkg)))
> +		return blkg;
> +	else
> +		return q->root_blkg;
> +}
> +
>  static inline bool blkcg_bio_issue_check(struct request_queue *q,
>  					 struct bio *bio)
>  {
> @@ -693,19 +711,11 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
>  	/* associate blkcg if bio hasn't attached one */
>  	bio_associate_blkcg(bio, &blkcg->css);
>  
> -	blkg = blkg_lookup(blkcg, q);
> -	if (unlikely(!blkg)) {
> -		spin_lock_irq(q->queue_lock);
> -		blkg = blkg_lookup_create(blkcg, q);
> -		if (IS_ERR(blkg))
> -			blkg = NULL;
> -		spin_unlock_irq(q->queue_lock);
> -	}
> +	blkg = blkg_lookup_create(blkcg, q);
>  
>  	throtl = blk_throtl_bio(q, blkg, bio);
>  
>  	if (!throtl) {
> -		blkg = blkg ?: q->root_blkg;
>  		blkg_rwstat_add(&blkg->stat_bytes, bio->bi_opf,
>  				bio->bi_iter.bi_size);
>  		blkg_rwstat_add(&blkg->stat_ios, bio->bi_opf, 1);
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7] blkcg: refactor blkcg_gq lookup and creation in blkcg_bio_issue_check()
@ 2017-11-14 23:18     ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-14 23:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guro-b10kYP2dOMg

On Sun, Nov 12, 2017 at 02:26:10PM -0800, Tejun Heo wrote:
> blkcg_bio_issue_check() open codes blkcg_gq lookup and creation using
> blkg_lookup() and blkg_lookup_create().  Refactor the code so that
> 
> * blkg_lookup_create() is renamed to blkcg_lookup_create_locked().
> 
> * blkg_lookup_create() is now a new function which encapsulates the
>   RCU protected lookup and queue_locked protected creation.
> 
> * blkg_lookup_create() is guaranteed to return a non-NULL blkcg_gq.
>   The NULL checks are removed from the users.
> 
> This is pure refactoring and doesn't introduce any functional changes.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Reviewed-by: Shaohua Li <shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  block/blk-cgroup.c         |  6 +++---
>  block/blk-throttle.c       |  2 +-
>  include/linux/blk-cgroup.h | 32 +++++++++++++++++++++-----------
>  3 files changed, 25 insertions(+), 15 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 60a4486..490b0a6 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -290,7 +290,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>  }
>  
>  /**
> - * blkg_lookup_create - lookup blkg, try to create one if not there
> + * blkg_lookup_create_locked - lookup blkg, try to create one if not there
>   * @blkcg: blkcg of interest
>   * @q: request_queue of interest
>   *
> @@ -303,8 +303,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
>   * value on error.  If @q is dead, returns ERR_PTR(-EINVAL).  If @q is not
>   * dead and bypassing, returns ERR_PTR(-EBUSY).
>   */
> -struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> -				    struct request_queue *q)
> +struct blkcg_gq *blkg_lookup_create_locked(struct blkcg *blkcg,
> +					   struct request_queue *q)
>  {
>  	struct blkcg_gq *blkg;
>  
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index 8631763..1e6916b 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -2123,7 +2123,7 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg,
>  		    struct bio *bio)
>  {
>  	struct throtl_qnode *qn = NULL;
> -	struct throtl_grp *tg = blkg_to_tg(blkg ?: q->root_blkg);
> +	struct throtl_grp *tg = blkg_to_tg(blkg);
>  	struct throtl_service_queue *sq;
>  	bool rw = bio_data_dir(bio);
>  	bool throttled = false;
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index b5721d32..1de5158 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -172,8 +172,8 @@ extern struct cgroup_subsys_state * const blkcg_root_css;
>  
>  struct blkcg_gq *blkg_lookup_slowpath(struct blkcg *blkcg,
>  				      struct request_queue *q, bool update_hint);
> -struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> -				    struct request_queue *q);
> +struct blkcg_gq *blkg_lookup_create_locked(struct blkcg *blkcg,
> +					   struct request_queue *q);
>  int blkcg_init_queue(struct request_queue *q);
>  void blkcg_drain_queue(struct request_queue *q);
>  void blkcg_exit_queue(struct request_queue *q);
> @@ -680,6 +680,24 @@ static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg
>  				  struct bio *bio) { return false; }
>  #endif
>  
> +static inline struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
> +						  struct request_queue *q)
> +{
> +	struct blkcg_gq *blkg;
> +
> +	blkg = blkg_lookup(blkcg, q);
> +	if (likely(blkg))
> +		return blkg;
> +
> +	spin_lock_irq(q->queue_lock);
> +	blkg = blkg_lookup_create_locked(blkcg, q);
> +	spin_unlock_irq(q->queue_lock);
> +	if (likely(!IS_ERR(blkg)))
> +		return blkg;
> +	else
> +		return q->root_blkg;
> +}
> +
>  static inline bool blkcg_bio_issue_check(struct request_queue *q,
>  					 struct bio *bio)
>  {
> @@ -693,19 +711,11 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
>  	/* associate blkcg if bio hasn't attached one */
>  	bio_associate_blkcg(bio, &blkcg->css);
>  
> -	blkg = blkg_lookup(blkcg, q);
> -	if (unlikely(!blkg)) {
> -		spin_lock_irq(q->queue_lock);
> -		blkg = blkg_lookup_create(blkcg, q);
> -		if (IS_ERR(blkg))
> -			blkg = NULL;
> -		spin_unlock_irq(q->queue_lock);
> -	}
> +	blkg = blkg_lookup_create(blkcg, q);
>  
>  	throtl = blk_throtl_bio(q, blkg, bio);
>  
>  	if (!throtl) {
> -		blkg = blkg ?: q->root_blkg;
>  		blkg_rwstat_add(&blkg->stat_bytes, bio->bi_opf,
>  				bio->bi_iter.bi_size);
>  		blkg_rwstat_add(&blkg->stat_ios, bio->bi_opf, 1);
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/7] blkcg: account requests instead of bios for request based request_queues
@ 2017-11-14 23:23     ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-14 23:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

On Sun, Nov 12, 2017 at 02:26:12PM -0800, Tejun Heo wrote:
> blkcg accounting is currently bio based, which is silly for request
> based request_queues.  This is silly as the number of bios doesn't
> have much to do with the actual number of IOs issued to the underlying
> device (can be significantly higher or lower) and may change depending
> on the implementation details on how the bios are issued (e.g. from
> the recent split-bios-while-issuing change).
> 
> request based request_queues have QUEUE_FLAG_IO_STAT set by default
> which controls the gendisk accounting.  Do cgroup accounting for those
> request_queues together with gendisk accounting on request completion.
> 
> This makes cgroup accounting consistent with gendisk accounting and
> what's happening on the system.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  block/blk-core.c           |  3 +++
>  include/linux/blk-cgroup.h | 18 +++++++++++++++++-
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 048be4a..ad23b96 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2429,6 +2429,7 @@ void blk_account_io_completion(struct request *req, unsigned int bytes)
>  		cpu = part_stat_lock();
>  		part = req->part;
>  		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
> +		blkcg_account_io_completion(req, bytes);
>  		part_stat_unlock();
>  	}
>  }
> @@ -2454,6 +2455,8 @@ void blk_account_io_done(struct request *req)
>  		part_round_stats(req->q, cpu, part);
>  		part_dec_in_flight(req->q, part, rw);
>  
> +		blkcg_account_io_done(req);
> +
>  		hd_struct_put(part);
>  		part_stat_unlock();
>  	}
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index 96eed0f..f2f9691 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -715,7 +715,8 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
>  
>  	throtl = blk_throtl_bio(q, blkg, bio);
>  
> -	if (!throtl) {
> +	/* if @q does io stat, blkcg stats are updated together with them */
> +	if (!blk_queue_io_stat(q) && !throtl) {

Reviewed-by: Shaohua Li <shli@kernel.org>

One nitpick, can we use q->request_fn to determine request based queue? I think
that is more reliable and usual way for this.

Thanks,
Shaohua
>  		blkg_rwstat_add(&blkg->stat_bytes, bio->bi_opf,
>  				bio->bi_iter.bi_size);
>  		blkg_rwstat_add(&blkg->stat_ios, bio->bi_opf, 1);
> @@ -764,6 +765,17 @@ static inline void blk_rq_disassociate_blkg(struct request *rq)
>  	rq->blkg = NULL;
>  }
>  
> +static inline void blkcg_account_io_completion(struct request *rq,
> +					       unsigned int bytes)
> +{
> +	blkg_rwstat_add(&rq->blkg->stat_bytes, rq_data_dir(rq), bytes);
> +}
> +
> +static inline void blkcg_account_io_done(struct request *rq)
> +{
> +	blkg_rwstat_add(&rq->blkg->stat_ios, rq_data_dir(rq), 1);
> +}
> +
>  #else	/* CONFIG_BLK_CGROUP */
>  
>  struct blkcg {
> @@ -823,6 +835,10 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
>  static inline void blk_rq_associate_blkg(struct request *rq, struct blkcg *blkcg) { }
>  static inline void blk_rq_disassociate_blkg(struct request *rq) { }
>  
> +static inline void blkcg_account_io_completion(struct request *rq,
> +					       unsigned int bytes) { }
> +static inline void blkcg_account_io_done(struct request *rq) { }
> +
>  #define blk_queue_for_each_rl(rl, q)	\
>  	for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
>  
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/7] blkcg: account requests instead of bios for request based request_queues
@ 2017-11-14 23:23     ` Shaohua Li
  0 siblings, 0 replies; 41+ messages in thread
From: Shaohua Li @ 2017-11-14 23:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guro-b10kYP2dOMg

On Sun, Nov 12, 2017 at 02:26:12PM -0800, Tejun Heo wrote:
> blkcg accounting is currently bio based, which is silly for request
> based request_queues.  This is silly as the number of bios doesn't
> have much to do with the actual number of IOs issued to the underlying
> device (can be significantly higher or lower) and may change depending
> on the implementation details on how the bios are issued (e.g. from
> the recent split-bios-while-issuing change).
> 
> request based request_queues have QUEUE_FLAG_IO_STAT set by default
> which controls the gendisk accounting.  Do cgroup accounting for those
> request_queues together with gendisk accounting on request completion.
> 
> This makes cgroup accounting consistent with gendisk accounting and
> what's happening on the system.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  block/blk-core.c           |  3 +++
>  include/linux/blk-cgroup.h | 18 +++++++++++++++++-
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 048be4a..ad23b96 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2429,6 +2429,7 @@ void blk_account_io_completion(struct request *req, unsigned int bytes)
>  		cpu = part_stat_lock();
>  		part = req->part;
>  		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
> +		blkcg_account_io_completion(req, bytes);
>  		part_stat_unlock();
>  	}
>  }
> @@ -2454,6 +2455,8 @@ void blk_account_io_done(struct request *req)
>  		part_round_stats(req->q, cpu, part);
>  		part_dec_in_flight(req->q, part, rw);
>  
> +		blkcg_account_io_done(req);
> +
>  		hd_struct_put(part);
>  		part_stat_unlock();
>  	}
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index 96eed0f..f2f9691 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -715,7 +715,8 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
>  
>  	throtl = blk_throtl_bio(q, blkg, bio);
>  
> -	if (!throtl) {
> +	/* if @q does io stat, blkcg stats are updated together with them */
> +	if (!blk_queue_io_stat(q) && !throtl) {

Reviewed-by: Shaohua Li <shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

One nitpick, can we use q->request_fn to determine request based queue? I think
that is more reliable and usual way for this.

Thanks,
Shaohua
>  		blkg_rwstat_add(&blkg->stat_bytes, bio->bi_opf,
>  				bio->bi_iter.bi_size);
>  		blkg_rwstat_add(&blkg->stat_ios, bio->bi_opf, 1);
> @@ -764,6 +765,17 @@ static inline void blk_rq_disassociate_blkg(struct request *rq)
>  	rq->blkg = NULL;
>  }
>  
> +static inline void blkcg_account_io_completion(struct request *rq,
> +					       unsigned int bytes)
> +{
> +	blkg_rwstat_add(&rq->blkg->stat_bytes, rq_data_dir(rq), bytes);
> +}
> +
> +static inline void blkcg_account_io_done(struct request *rq)
> +{
> +	blkg_rwstat_add(&rq->blkg->stat_ios, rq_data_dir(rq), 1);
> +}
> +
>  #else	/* CONFIG_BLK_CGROUP */
>  
>  struct blkcg {
> @@ -823,6 +835,10 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
>  static inline void blk_rq_associate_blkg(struct request *rq, struct blkcg *blkcg) { }
>  static inline void blk_rq_disassociate_blkg(struct request *rq) { }
>  
> +static inline void blkcg_account_io_completion(struct request *rq,
> +					       unsigned int bytes) { }
> +static inline void blkcg_account_io_done(struct request *rq) { }
> +
>  #define blk_queue_for_each_rl(rl, q)	\
>  	for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
>  
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/7] blkcg: associate a request with its blkcg_gq instead of request_list
  2017-11-14 23:17       ` Shaohua Li
  (?)
@ 2017-11-15 17:11       ` Tejun Heo
  -1 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-15 17:11 UTC (permalink / raw)
  To: Shaohua Li
  Cc: axboe, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

Hello,

On Tue, Nov 14, 2017 at 03:17:43PM -0800, Shaohua Li wrote:
> After patch 5, we don't really need the blk_put_rl, blk_get_rl, blk_rq_set_rl
> functions. I'd like to delete them and only keep blk_rq_rl. The name
> blk_rq_set_rl is confusing too.

Yeah, we can switch the legacy rq to switch to direct blkg pinning
too.  If that cleans up the code, great.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v2 6/7] blkcg: account requests instead of bios for request based request_queues
@ 2017-11-15 17:18       ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-15 17:18 UTC (permalink / raw)
  To: Shaohua Li
  Cc: axboe, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

blkcg accounting is currently bio based, which is silly for request
based request_queues.  This is silly as the number of bios doesn't
have much to do with the actual number of IOs issued to the underlying
device (can be significantly higher or lower) and may change depending
on the implementation details on how the bios are issued (e.g. from
the recent split-bios-while-issuing change).

Do cgroup accounting for request based request_queues together with
gendisk accounting on request completion.

This makes cgroup accounting consistent with gendisk accounting and
what's happening on the system.

v2: Use q->request_fn to skip bio based accounting instead of
    QUEUE_FLAG_IOSTAT as suggested by Shaohua.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Shaohua Li <shli@kernel.org>
---
 block/blk-core.c           |    3 +++
 include/linux/blk-cgroup.h |   18 +++++++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2429,6 +2429,7 @@ void blk_account_io_completion(struct re
 		cpu = part_stat_lock();
 		part = req->part;
 		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
+		blkcg_account_io_completion(req, bytes);
 		part_stat_unlock();
 	}
 }
@@ -2454,6 +2455,8 @@ void blk_account_io_done(struct request
 		part_round_stats(req->q, cpu, part);
 		part_dec_in_flight(req->q, part, rw);
 
+		blkcg_account_io_done(req);
+
 		hd_struct_put(part);
 		part_stat_unlock();
 	}
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -715,7 +715,8 @@ static inline bool blkcg_bio_issue_check
 
 	throtl = blk_throtl_bio(q, blkg, bio);
 
-	if (!throtl) {
+	/* if @q does io stat, blkcg stats are updated together with them */
+	if (!q->request_fn && !throtl) {
 		blkg_rwstat_add(&blkg->stat_bytes, bio->bi_opf,
 				bio->bi_iter.bi_size);
 		blkg_rwstat_add(&blkg->stat_ios, bio->bi_opf, 1);
@@ -764,6 +765,17 @@ static inline void blk_rq_disassociate_b
 	rq->blkg = NULL;
 }
 
+static inline void blkcg_account_io_completion(struct request *rq,
+					       unsigned int bytes)
+{
+	blkg_rwstat_add(&rq->blkg->stat_bytes, rq_data_dir(rq), bytes);
+}
+
+static inline void blkcg_account_io_done(struct request *rq)
+{
+	blkg_rwstat_add(&rq->blkg->stat_ios, rq_data_dir(rq), 1);
+}
+
 #else	/* CONFIG_BLK_CGROUP */
 
 struct blkcg {
@@ -823,6 +835,10 @@ static inline bool blkcg_bio_issue_check
 static inline void blk_rq_associate_blkg(struct request *rq, struct blkcg *blkcg) { }
 static inline void blk_rq_disassociate_blkg(struct request *rq) { }
 
+static inline void blkcg_account_io_completion(struct request *rq,
+					       unsigned int bytes) { }
+static inline void blkcg_account_io_done(struct request *rq) { }
+
 #define blk_queue_for_each_rl(rl, q)	\
 	for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v2 6/7] blkcg: account requests instead of bios for request based request_queues
@ 2017-11-15 17:18       ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-15 17:18 UTC (permalink / raw)
  To: Shaohua Li
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guro-b10kYP2dOMg

blkcg accounting is currently bio based, which is silly for request
based request_queues.  This is silly as the number of bios doesn't
have much to do with the actual number of IOs issued to the underlying
device (can be significantly higher or lower) and may change depending
on the implementation details on how the bios are issued (e.g. from
the recent split-bios-while-issuing change).

Do cgroup accounting for request based request_queues together with
gendisk accounting on request completion.

This makes cgroup accounting consistent with gendisk accounting and
what's happening on the system.

v2: Use q->request_fn to skip bio based accounting instead of
    QUEUE_FLAG_IOSTAT as suggested by Shaohua.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reviewed-by: Shaohua Li <shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-core.c           |    3 +++
 include/linux/blk-cgroup.h |   18 +++++++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2429,6 +2429,7 @@ void blk_account_io_completion(struct re
 		cpu = part_stat_lock();
 		part = req->part;
 		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
+		blkcg_account_io_completion(req, bytes);
 		part_stat_unlock();
 	}
 }
@@ -2454,6 +2455,8 @@ void blk_account_io_done(struct request
 		part_round_stats(req->q, cpu, part);
 		part_dec_in_flight(req->q, part, rw);
 
+		blkcg_account_io_done(req);
+
 		hd_struct_put(part);
 		part_stat_unlock();
 	}
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -715,7 +715,8 @@ static inline bool blkcg_bio_issue_check
 
 	throtl = blk_throtl_bio(q, blkg, bio);
 
-	if (!throtl) {
+	/* if @q does io stat, blkcg stats are updated together with them */
+	if (!q->request_fn && !throtl) {
 		blkg_rwstat_add(&blkg->stat_bytes, bio->bi_opf,
 				bio->bi_iter.bi_size);
 		blkg_rwstat_add(&blkg->stat_ios, bio->bi_opf, 1);
@@ -764,6 +765,17 @@ static inline void blk_rq_disassociate_b
 	rq->blkg = NULL;
 }
 
+static inline void blkcg_account_io_completion(struct request *rq,
+					       unsigned int bytes)
+{
+	blkg_rwstat_add(&rq->blkg->stat_bytes, rq_data_dir(rq), bytes);
+}
+
+static inline void blkcg_account_io_done(struct request *rq)
+{
+	blkg_rwstat_add(&rq->blkg->stat_ios, rq_data_dir(rq), 1);
+}
+
 #else	/* CONFIG_BLK_CGROUP */
 
 struct blkcg {
@@ -823,6 +835,10 @@ static inline bool blkcg_bio_issue_check
 static inline void blk_rq_associate_blkg(struct request *rq, struct blkcg *blkcg) { }
 static inline void blk_rq_disassociate_blkg(struct request *rq) { }
 
+static inline void blkcg_account_io_completion(struct request *rq,
+					       unsigned int bytes) { }
+static inline void blkcg_account_io_done(struct request *rq) { }
+
 #define blk_queue_for_each_rl(rl, q)	\
 	for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/7] blkcg: account requests instead of bios for request based request_queues
@ 2017-11-15 17:19       ` Jens Axboe
  0 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2017-11-15 17:19 UTC (permalink / raw)
  To: Shaohua Li, Tejun Heo
  Cc: linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

On 11/14/2017 04:23 PM, Shaohua Li wrote:
> On Sun, Nov 12, 2017 at 02:26:12PM -0800, Tejun Heo wrote:
>> blkcg accounting is currently bio based, which is silly for request
>> based request_queues.  This is silly as the number of bios doesn't
>> have much to do with the actual number of IOs issued to the underlying
>> device (can be significantly higher or lower) and may change depending
>> on the implementation details on how the bios are issued (e.g. from
>> the recent split-bios-while-issuing change).
>>
>> request based request_queues have QUEUE_FLAG_IO_STAT set by default
>> which controls the gendisk accounting.  Do cgroup accounting for those
>> request_queues together with gendisk accounting on request completion.
>>
>> This makes cgroup accounting consistent with gendisk accounting and
>> what's happening on the system.
>>
>> Signed-off-by: Tejun Heo <tj@kernel.org>
>> ---
>>  block/blk-core.c           |  3 +++
>>  include/linux/blk-cgroup.h | 18 +++++++++++++++++-
>>  2 files changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 048be4a..ad23b96 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2429,6 +2429,7 @@ void blk_account_io_completion(struct request *req, unsigned int bytes)
>>  		cpu = part_stat_lock();
>>  		part = req->part;
>>  		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
>> +		blkcg_account_io_completion(req, bytes);
>>  		part_stat_unlock();
>>  	}
>>  }
>> @@ -2454,6 +2455,8 @@ void blk_account_io_done(struct request *req)
>>  		part_round_stats(req->q, cpu, part);
>>  		part_dec_in_flight(req->q, part, rw);
>>  
>> +		blkcg_account_io_done(req);
>> +
>>  		hd_struct_put(part);
>>  		part_stat_unlock();
>>  	}
>> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
>> index 96eed0f..f2f9691 100644
>> --- a/include/linux/blk-cgroup.h
>> +++ b/include/linux/blk-cgroup.h
>> @@ -715,7 +715,8 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
>>  
>>  	throtl = blk_throtl_bio(q, blkg, bio);
>>  
>> -	if (!throtl) {
>> +	/* if @q does io stat, blkcg stats are updated together with them */
>> +	if (!blk_queue_io_stat(q) && !throtl) {
> 
> Reviewed-by: Shaohua Li <shli@kernel.org>
> 
> One nitpick, can we use q->request_fn to determine request based queue? I think
> that is more reliable and usual way for this.

That won't work for mq - but there is a helper for this, queue_is_rq_based().

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/7] blkcg: account requests instead of bios for request based request_queues
@ 2017-11-15 17:19       ` Jens Axboe
  0 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2017-11-15 17:19 UTC (permalink / raw)
  To: Shaohua Li, Tejun Heo
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-team-b10kYP2dOMg,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	cgroups-u79uwXL29TY76Z2rM5mHXA, guro-b10kYP2dOMg

On 11/14/2017 04:23 PM, Shaohua Li wrote:
> On Sun, Nov 12, 2017 at 02:26:12PM -0800, Tejun Heo wrote:
>> blkcg accounting is currently bio based, which is silly for request
>> based request_queues.  This is silly as the number of bios doesn't
>> have much to do with the actual number of IOs issued to the underlying
>> device (can be significantly higher or lower) and may change depending
>> on the implementation details on how the bios are issued (e.g. from
>> the recent split-bios-while-issuing change).
>>
>> request based request_queues have QUEUE_FLAG_IO_STAT set by default
>> which controls the gendisk accounting.  Do cgroup accounting for those
>> request_queues together with gendisk accounting on request completion.
>>
>> This makes cgroup accounting consistent with gendisk accounting and
>> what's happening on the system.
>>
>> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>> ---
>>  block/blk-core.c           |  3 +++
>>  include/linux/blk-cgroup.h | 18 +++++++++++++++++-
>>  2 files changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 048be4a..ad23b96 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2429,6 +2429,7 @@ void blk_account_io_completion(struct request *req, unsigned int bytes)
>>  		cpu = part_stat_lock();
>>  		part = req->part;
>>  		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
>> +		blkcg_account_io_completion(req, bytes);
>>  		part_stat_unlock();
>>  	}
>>  }
>> @@ -2454,6 +2455,8 @@ void blk_account_io_done(struct request *req)
>>  		part_round_stats(req->q, cpu, part);
>>  		part_dec_in_flight(req->q, part, rw);
>>  
>> +		blkcg_account_io_done(req);
>> +
>>  		hd_struct_put(part);
>>  		part_stat_unlock();
>>  	}
>> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
>> index 96eed0f..f2f9691 100644
>> --- a/include/linux/blk-cgroup.h
>> +++ b/include/linux/blk-cgroup.h
>> @@ -715,7 +715,8 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
>>  
>>  	throtl = blk_throtl_bio(q, blkg, bio);
>>  
>> -	if (!throtl) {
>> +	/* if @q does io stat, blkcg stats are updated together with them */
>> +	if (!blk_queue_io_stat(q) && !throtl) {
> 
> Reviewed-by: Shaohua Li <shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> 
> One nitpick, can we use q->request_fn to determine request based queue? I think
> that is more reliable and usual way for this.

That won't work for mq - but there is a helper for this, queue_is_rq_based().

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v3 6/7] blkcg: account requests instead of bios for request based request_queues
@ 2017-11-15 17:22         ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-15 17:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Shaohua Li, linux-kernel, kernel-team, lizefan, hannes, cgroups, guro

blkcg accounting is currently bio based, which is silly for request
based request_queues.  This is silly as the number of bios doesn't
have much to do with the actual number of IOs issued to the underlying
device (can be significantly higher or lower) and may change depending
on the implementation details on how the bios are issued (e.g. from
the recent split-bios-while-issuing change).

Do cgroup accounting for request based request_queues together with
gendisk accounting on request completion.

This makes cgroup accounting consistent with gendisk accounting and
what's happening on the system.

v3: Use q->request_fn test doesn't work on blk-mq.  Use
    queue_is_rq_based() instead as suggested by Jens.

v2: Use q->request_fn to skip bio based accounting instead of
    QUEUE_FLAG_IOSTAT as suggested by Shaohua.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Shaohua Li <shli@kernel.org>
---
 block/blk-core.c           |    3 +++
 include/linux/blk-cgroup.h |   18 +++++++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2429,6 +2429,7 @@ void blk_account_io_completion(struct re
 		cpu = part_stat_lock();
 		part = req->part;
 		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
+		blkcg_account_io_completion(req, bytes);
 		part_stat_unlock();
 	}
 }
@@ -2454,6 +2455,8 @@ void blk_account_io_done(struct request
 		part_round_stats(req->q, cpu, part);
 		part_dec_in_flight(req->q, part, rw);
 
+		blkcg_account_io_done(req);
+
 		hd_struct_put(part);
 		part_stat_unlock();
 	}
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -715,7 +715,8 @@ static inline bool blkcg_bio_issue_check
 
 	throtl = blk_throtl_bio(q, blkg, bio);
 
-	if (!throtl) {
+	/* if @q does io stat, blkcg stats are updated together with them */
+	if (!queue_is_rq_based(q) && !throtl) {
 		blkg_rwstat_add(&blkg->stat_bytes, bio->bi_opf,
 				bio->bi_iter.bi_size);
 		blkg_rwstat_add(&blkg->stat_ios, bio->bi_opf, 1);
@@ -764,6 +765,17 @@ static inline void blk_rq_disassociate_b
 	rq->blkg = NULL;
 }
 
+static inline void blkcg_account_io_completion(struct request *rq,
+					       unsigned int bytes)
+{
+	blkg_rwstat_add(&rq->blkg->stat_bytes, rq_data_dir(rq), bytes);
+}
+
+static inline void blkcg_account_io_done(struct request *rq)
+{
+	blkg_rwstat_add(&rq->blkg->stat_ios, rq_data_dir(rq), 1);
+}
+
 #else	/* CONFIG_BLK_CGROUP */
 
 struct blkcg {
@@ -823,6 +835,10 @@ static inline bool blkcg_bio_issue_check
 static inline void blk_rq_associate_blkg(struct request *rq, struct blkcg *blkcg) { }
 static inline void blk_rq_disassociate_blkg(struct request *rq) { }
 
+static inline void blkcg_account_io_completion(struct request *rq,
+					       unsigned int bytes) { }
+static inline void blkcg_account_io_done(struct request *rq) { }
+
 #define blk_queue_for_each_rl(rl, q)	\
 	for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v3 6/7] blkcg: account requests instead of bios for request based request_queues
@ 2017-11-15 17:22         ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2017-11-15 17:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Shaohua Li, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-team-b10kYP2dOMg, lizefan-hv44wF8Li93QT0dZR+AlfA,
	hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	guro-b10kYP2dOMg

blkcg accounting is currently bio based, which is silly for request
based request_queues.  This is silly as the number of bios doesn't
have much to do with the actual number of IOs issued to the underlying
device (can be significantly higher or lower) and may change depending
on the implementation details on how the bios are issued (e.g. from
the recent split-bios-while-issuing change).

Do cgroup accounting for request based request_queues together with
gendisk accounting on request completion.

This makes cgroup accounting consistent with gendisk accounting and
what's happening on the system.

v3: Use q->request_fn test doesn't work on blk-mq.  Use
    queue_is_rq_based() instead as suggested by Jens.

v2: Use q->request_fn to skip bio based accounting instead of
    QUEUE_FLAG_IOSTAT as suggested by Shaohua.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Reviewed-by: Shaohua Li <shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 block/blk-core.c           |    3 +++
 include/linux/blk-cgroup.h |   18 +++++++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2429,6 +2429,7 @@ void blk_account_io_completion(struct re
 		cpu = part_stat_lock();
 		part = req->part;
 		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
+		blkcg_account_io_completion(req, bytes);
 		part_stat_unlock();
 	}
 }
@@ -2454,6 +2455,8 @@ void blk_account_io_done(struct request
 		part_round_stats(req->q, cpu, part);
 		part_dec_in_flight(req->q, part, rw);
 
+		blkcg_account_io_done(req);
+
 		hd_struct_put(part);
 		part_stat_unlock();
 	}
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -715,7 +715,8 @@ static inline bool blkcg_bio_issue_check
 
 	throtl = blk_throtl_bio(q, blkg, bio);
 
-	if (!throtl) {
+	/* if @q does io stat, blkcg stats are updated together with them */
+	if (!queue_is_rq_based(q) && !throtl) {
 		blkg_rwstat_add(&blkg->stat_bytes, bio->bi_opf,
 				bio->bi_iter.bi_size);
 		blkg_rwstat_add(&blkg->stat_ios, bio->bi_opf, 1);
@@ -764,6 +765,17 @@ static inline void blk_rq_disassociate_b
 	rq->blkg = NULL;
 }
 
+static inline void blkcg_account_io_completion(struct request *rq,
+					       unsigned int bytes)
+{
+	blkg_rwstat_add(&rq->blkg->stat_bytes, rq_data_dir(rq), bytes);
+}
+
+static inline void blkcg_account_io_done(struct request *rq)
+{
+	blkg_rwstat_add(&rq->blkg->stat_ios, rq_data_dir(rq), 1);
+}
+
 #else	/* CONFIG_BLK_CGROUP */
 
 struct blkcg {
@@ -823,6 +835,10 @@ static inline bool blkcg_bio_issue_check
 static inline void blk_rq_associate_blkg(struct request *rq, struct blkcg *blkcg) { }
 static inline void blk_rq_disassociate_blkg(struct request *rq) { }
 
+static inline void blkcg_account_io_completion(struct request *rq,
+					       unsigned int bytes) { }
+static inline void blkcg_account_io_done(struct request *rq) { }
+
 #define blk_queue_for_each_rl(rl, q)	\
 	for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2017-11-15 17:23 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-12 22:26 [PATCHSET] blkcg: basic accounting and throttling fixes Tejun Heo
2017-11-12 22:26 ` Tejun Heo
2017-11-12 22:26 ` [PATCH 1/7] blkcg: relocate __blkg_release_rcu() Tejun Heo
2017-11-14 23:12   ` Shaohua Li
2017-11-14 23:12     ` Shaohua Li
2017-11-12 22:26 ` [PATCH 2/7] blkcg: use percpu_ref for blkcg_gq->refcnt Tejun Heo
2017-11-14 23:12   ` Shaohua Li
2017-11-14 23:12     ` Shaohua Li
2017-11-12 22:26 ` [PATCH 3/7] blkcg: associate a request with its blkcg_gq instead of request_list Tejun Heo
2017-11-13 20:15   ` [PATCH v2 " Tejun Heo
2017-11-13 20:15     ` Tejun Heo
2017-11-14 23:17     ` Shaohua Li
2017-11-14 23:17       ` Shaohua Li
2017-11-15 17:11       ` Tejun Heo
2017-11-12 22:26 ` [PATCH 4/7] blkcg: refactor blkcg_gq lookup and creation in blkcg_bio_issue_check() Tejun Heo
2017-11-14 23:18   ` Shaohua Li
2017-11-14 23:18     ` Shaohua Li
2017-11-12 22:26 ` [PATCH 5/7] blkcg: associate blk-mq requests with the matching blkcg_gqs Tejun Heo
2017-11-12 22:26   ` Tejun Heo
2017-11-12 22:26 ` [PATCH 6/7] blkcg: account requests instead of bios for request based request_queues Tejun Heo
2017-11-14 23:23   ` Shaohua Li
2017-11-14 23:23     ` Shaohua Li
2017-11-15 17:18     ` [PATCH v2 " Tejun Heo
2017-11-15 17:18       ` Tejun Heo
2017-11-15 17:19     ` [PATCH " Jens Axboe
2017-11-15 17:19       ` Jens Axboe
2017-11-15 17:22       ` [PATCH v3 " Tejun Heo
2017-11-15 17:22         ` Tejun Heo
2017-11-12 22:26 ` [PATCH 7/7] blk-throtl: don't throttle the same IO multiple times Tejun Heo
2017-11-13  4:07   ` Shaohua Li
2017-11-13  4:07     ` Shaohua Li
2017-11-13 11:13     ` Tejun Heo
2017-11-13 11:13       ` Tejun Heo
2017-11-13 15:57       ` Tejun Heo
2017-11-13 15:57         ` Tejun Heo
2017-11-13 19:54         ` Shaohua Li
2017-11-13 19:54           ` Shaohua Li
2017-11-13 19:58           ` Tejun Heo
2017-11-13 19:58             ` Tejun Heo
2017-11-13 19:58         ` Shaohua Li
2017-11-13 19:58           ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.