LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Dennis Zhou <dennisszhou@gmail.com>
To: Jens Axboe <axboe@kernel.dk>, Tejun Heo <tj@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Josef Bacik <josef@toxicpanda.com>
Cc: kernel-team@fb.com, linux-block@vger.kernel.org,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	"Dennis Zhou (Facebook)" <dennisszhou@gmail.com>,
	Jiufei Xue <jiufei.xue@linux.alibaba.com>,
	Joseph Qi <joseph.qi@linux.alibaba.com>
Subject: [PATCH 02/15] blkcg: delay blkg destruction until after writeback has finished
Date: Thu, 30 Aug 2018 21:53:43 -0400
Message-ID: <20180831015356.69796-3-dennisszhou@gmail.com> (raw)
In-Reply-To: <20180831015356.69796-1-dennisszhou@gmail.com>

From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>

Currently, blkcg destruction relies on a sequence of events:
  1. Destruction starts. blkcg_css_offline() is called and blkgs
     release their reference to the blkcg. This immediately destroys
     the cgwbs (writeback).
  2. With blkgs giving up their reference, the blkcg ref count should
     become zero and eventually call blkcg_css_free() which finally
     frees the blkcg.

Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
on the completion of all writeback associated with the blkcg. A count of
the number of cgwbs is maintained and once that goes to zero, blkg
destruction can follow. This should prevent premature blkg destruction.

The new process for blkcg cleanup is as follows:
  1. Destruction starts. blkcg_css_offline() is called which offlines
     writeback. Blkg destruction is delayed on the nr_cgwbs count to
     avoid punting potentially large amounts of outstanding writeback
     to root while maintaining any ongoing policies.
  2. When the nr_cgwbs becomes zero, blkcg_destroy_blkgs() is called and
     handles destruction of blkgs. This is where the css reference held
     by each blkg is released.
  3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
     This finally frees the blkg.

It seems in the past blk-throttle didn't do the most understandable
things with taking data from a blkg while associating with current. So,
the simplification and unification of what blk-throttle is doing caused
this.

Fixes: 08e18eab0c579 ("block: add bi_blkg to the bio for cgroups")
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
---
 block/blk-cgroup.c         | 53 ++++++++++++++++++++++++++++++++------
 include/linux/blk-cgroup.h | 29 +++++++++++++++++++++
 mm/backing-dev.c           |  5 ++++
 3 files changed, 79 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2998e4f095d1..d7114308a480 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1042,21 +1042,59 @@ static struct cftype blkcg_legacy_files[] = {
 	{ }	/* terminate */
 };
 
+/*
+ * blkcg destruction is a three-stage process.
+ *
+ * 1. Destruction starts.  The blkcg_css_offline() callback is invoked
+ *    which offlines writeback.  Here we tie the next stage of blkg destruction
+ *    to the completion of writeback associated with the blkcg.  This lets us
+ *    avoid punting potentially large amounts of outstanding writeback to root
+ *    while maintaining any ongoing policies.  The next stage is triggered when
+ *    the nr_cgwbs count goes to zero.
+ *
+ * 2. When the nr_cgwbs count goes to zero, blkcg_destroy_blkgs() is called
+ *    and handles the destruction of blkgs.  Here the css reference held by
+ *    the blkg is put back eventually allowing blkcg_css_free() to be called.
+ *    This work may occur in cgwb_release_workfn() on the cgwb_release
+ *    workqueue.  Any submitted ios that fail to get the blkg ref will be
+ *    punted to the root_blkg.
+ *
+ * 3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
+ *    This finally frees the blkcg.
+ */
+
 /**
  * blkcg_css_offline - cgroup css_offline callback
  * @css: css of interest
  *
- * This function is called when @css is about to go away and responsible
- * for shooting down all blkgs associated with @css.  blkgs should be
- * removed while holding both q and blkcg locks.  As blkcg lock is nested
- * inside q lock, this function performs reverse double lock dancing.
- *
- * This is the blkcg counterpart of ioc_release_fn().
+ * This function is called when @css is about to go away.  Here the cgwbs are
+ * offlined first and only once writeback associated with the blkcg has
+ * finished do we start step 2 (see above).
  */
 static void blkcg_css_offline(struct cgroup_subsys_state *css)
 {
 	struct blkcg *blkcg = css_to_blkcg(css);
 
+	/* this prevents anyone from attaching or migrating to this blkcg */
+	wb_blkcg_offline(blkcg);
+
+	/* allow the count the count to go to zero */
+	blkcg_cgwb_dec(blkcg);
+}
+
+/**
+ * blkcg_destroy_blkgs - responsible for shooting down blkgs
+ * @blkcg: blkcg of interest
+ *
+ * blkgs should be removed while holding both q and blkcg locks.  As blkcg lock
+ * is nested inside q lock, this function performs reverse double lock dancing.
+ * Destroying the blkgs releases the reference held on the blkcg's css allowing
+ * blkcg_css_free to eventually be called.
+ *
+ * This is the blkcg counterpart of ioc_release_fn().
+ */
+void blkcg_destroy_blkgs(struct blkcg *blkcg)
+{
 	spin_lock_irq(&blkcg->lock);
 
 	while (!hlist_empty(&blkcg->blkg_list)) {
@@ -1075,8 +1113,6 @@ static void blkcg_css_offline(struct cgroup_subsys_state *css)
 	}
 
 	spin_unlock_irq(&blkcg->lock);
-
-	wb_blkcg_offline(blkcg);
 }
 
 static void blkcg_css_free(struct cgroup_subsys_state *css)
@@ -1146,6 +1182,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
 	INIT_HLIST_HEAD(&blkcg->blkg_list);
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&blkcg->cgwb_list);
+	atomic_set(&blkcg->nr_cgwbs, 1);
 #endif
 	list_add_tail(&blkcg->all_blkcgs_node, &all_blkcgs);
 
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 1615cdd4c797..c7386464ec4c 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -56,6 +56,7 @@ struct blkcg {
 	struct list_head		all_blkcgs_node;
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct list_head		cgwb_list;
+	atomic_t			nr_cgwbs;
 #endif
 };
 
@@ -386,6 +387,34 @@ static inline struct blkcg *cpd_to_blkcg(struct blkcg_policy_data *cpd)
 	return cpd ? cpd->blkcg : NULL;
 }
 
+/**
+ * blkcg_cgwb_inc - increment the count for cgwb_list
+ * @blkcg: blkcg of interest
+ *
+ * This is used to count the number of active wb's related to a blkcg.
+ */
+static inline void blkcg_cgwb_inc(struct blkcg *blkcg)
+{
+	atomic_inc(&blkcg->nr_cgwbs);
+}
+
+extern void blkcg_destroy_blkgs(struct blkcg *blkcg);
+
+/**
+ * blkcg_cgwb_dec - decrement the count for cgwb_list
+ * @blkcg: blkcg of interest
+ *
+ * This is used to count the number of active wb's related to a blkcg.
+ * When this count goes to zero, all active wb has finished so the
+ * blkcg can be destroyed.  This does blkg destruction if the nr_cgwbs
+ * drops to zero.
+ */
+static inline void blkcg_cgwb_dec(struct blkcg *blkcg)
+{
+	if (atomic_dec_and_test(&blkcg->nr_cgwbs))
+		blkcg_destroy_blkgs(blkcg);
+}
+
 /**
  * blkg_path - format cgroup path of blkg
  * @blkg: blkg of interest
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 2e5d3df0853d..92342d38f0c6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -494,6 +494,7 @@ static void cgwb_release_workfn(struct work_struct *work)
 {
 	struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
 						release_work);
+	struct blkcg *blkcg = css_to_blkcg(wb->blkcg_css);
 
 	mutex_lock(&wb->bdi->cgwb_release_mutex);
 	wb_shutdown(wb);
@@ -502,6 +503,9 @@ static void cgwb_release_workfn(struct work_struct *work)
 	css_put(wb->blkcg_css);
 	mutex_unlock(&wb->bdi->cgwb_release_mutex);
 
+	/* this triggers destruction of blkgs if nr_cgwbs becomes zero */
+	blkcg_cgwb_dec(blkcg);
+
 	fprop_local_destroy_percpu(&wb->memcg_completions);
 	percpu_ref_exit(&wb->refcnt);
 	wb_exit(wb);
@@ -600,6 +604,7 @@ static int cgwb_create(struct backing_dev_info *bdi,
 			list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
 			list_add(&wb->memcg_node, memcg_cgwb_list);
 			list_add(&wb->blkcg_node, blkcg_cgwb_list);
+			blkcg_cgwb_inc(blkcg);
 			css_get(memcg_css);
 			css_get(blkcg_css);
 		}
-- 
2.17.1


  parent reply index

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-31  1:53 [PATCH 00/15] blkcg ref count refactor/cleanup + blkcg avg_lat Dennis Zhou
2018-08-31  1:53 ` [PATCH 01/15] Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()" Dennis Zhou
2018-08-31  1:53 ` Dennis Zhou [this message]
2018-08-31 15:27   ` [PATCH 02/15] blkcg: delay blkg destruction until after writeback has finished Josef Bacik
2018-08-31 20:19     ` Dennis Zhou
2018-08-31  1:53 ` [PATCH 03/15] blkcg: use tryget logic when associating a blkg with a bio Dennis Zhou
2018-08-31 15:30   ` Josef Bacik
2018-08-31 20:20     ` Dennis Zhou
2018-08-31  1:53 ` [PATCH 04/15] blkcg: fix ref count issue with bio_blkcg using task_css Dennis Zhou
2018-08-31 15:35   ` Josef Bacik
2018-08-31 23:04     ` Tejun Heo
2018-09-06 15:21     ` Dennis Zhou
2018-08-31  1:53 ` [PATCH 05/15] blkcg: update blkg_lookup_create to do locking Dennis Zhou
2018-08-31 15:37   ` Josef Bacik
2018-08-31 23:09   ` Tejun Heo
2018-08-31  1:53 ` [PATCH 06/15] blkcg: always associate a bio with a blkg Dennis Zhou
2018-08-31  9:01   ` kbuild test robot
2018-08-31 10:02   ` kbuild test robot
2018-08-31 23:16   ` Tejun Heo
2018-09-06 20:41     ` Dennis Zhou
2018-09-07  3:03   ` [LKP] [blkcg] c02c58dab2: WARNING:at_block/blk-throttle.c:#blk_throtl_bio kernel test robot
2018-08-31  1:53 ` [PATCH 07/15] blkcg: consolidate bio_issue_init and blkg association Dennis Zhou
2018-08-31  9:19   ` kbuild test robot
2018-08-31 11:11   ` kbuild test robot
2018-08-31 15:42   ` Josef Bacik
2018-09-06 20:43     ` Dennis Zhou
2018-08-31 23:45   ` Tejun Heo
2018-08-31  1:53 ` [PATCH 08/15] blkcg: associate a blkg for pages being evicted by swap Dennis Zhou
2018-08-31 15:44   ` Josef Bacik
2018-08-31 23:47   ` Tejun Heo
2018-08-31  1:53 ` [PATCH 09/15] blkcg: associate writeback bios with a blkg Dennis Zhou
2018-08-31 15:45   ` Josef Bacik
2018-08-31 23:53   ` Tejun Heo
2018-08-31  1:53 ` [PATCH 10/15] blkcg: remove bio->bi_css and instead use bio->bi_blkg Dennis Zhou
2018-08-31 15:46   ` Josef Bacik
2018-09-01  0:13   ` Tejun Heo
2018-08-31  1:53 ` [PATCH 11/15] blkcg: remove additional reference to the css Dennis Zhou
2018-09-01  0:26   ` Tejun Heo
2018-09-06 20:45     ` Dennis Zhou
2018-08-31  1:53 ` [PATCH 12/15] blkcg: cleanup and make blk_get_rl use blkg_lookup_create Dennis Zhou
2018-09-01  0:29   ` Tejun Heo
2018-09-11  2:37   ` [LKP] [blkcg] 22f657e287: general_protection_fault:#[##] kernel test robot
2018-08-31  1:53 ` [PATCH 13/15] blkcg: change blkg reference counting to use percpu_ref Dennis Zhou
2018-08-31 15:49   ` Josef Bacik
2018-09-01  0:31   ` Tejun Heo
2018-09-06 20:46     ` Dennis Zhou
2018-09-07  3:08   ` [LKP] [blkcg] 6ef69a3a0b: WARNING:suspicious_RCU_usage kernel test robot
2018-08-31  1:53 ` [PATCH 14/15] blkcg: rename blkg_try_get to blkg_tryget Dennis Zhou
2018-08-31 15:50   ` Josef Bacik
2018-09-01  0:32   ` Tejun Heo
2018-08-31  1:53 ` [PATCH 15/15] blkcg: add average latency tracking to blk-cgroup Dennis Zhou
2018-08-31 10:22   ` kbuild test robot
2018-08-31 11:38   ` kbuild test robot
2018-09-01  0:35 ` [PATCH 00/15] blkcg ref count refactor/cleanup + blkcg avg_lat Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180831015356.69796-3-dennisszhou@gmail.com \
    --to=dennisszhou@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=jiufei.xue@linux.alibaba.com \
    --cc=josef@toxicpanda.com \
    --cc=joseph.qi@linux.alibaba.com \
    --cc=kernel-team@fb.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git