linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support
@ 2019-06-15 18:24 Tejun Heo
  2019-06-15 18:24 ` [PATCH 1/9] cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages Tejun Heo
                   ` (9 more replies)
  0 siblings, 10 replies; 25+ messages in thread
From: Tejun Heo @ 2019-06-15 18:24 UTC (permalink / raw)
  To: dsterba, clm, josef, axboe, jack
  Cc: linux-btrfs, linux-kernel, linux-block, kernel-team

Hello,

Changes from v1[1]

  * 0001-cgroup-blkcg-Prepare-some-symbols-for-module-and-CON.patch
    added.  It collects and adds symbol exports and dummy function def
    to fix module and different config builds.

When writeback is executed asynchronously (e.g. for compression), bios
are bounced to and issued by worker pool shared by all cgroups.  This
leads to significant priority inversions when cgroup IO control is in
use - IOs for a low priority cgroup can tie down the workers forcing
higher priority IOs to wait behind them.

This patchset adds an bio punt mechanism to blkcg and updates btrfs to
issue async IOs through it.  A bio tagged with REQ_CGROUP_PUNT flag is
bounced to the asynchronous issue context of the associated blkcg on
bio_submit().  As the bios are issued from per-blkcg work items,
there's no concern for priority inversions and it doesn't require
invasive changes to the filesystems.  The mechanism should be
generally useful for IO control support across different filesystems.

This patchset contains the following 9 patches.  The first three are
my blkcg patches to implement the needed mechanisms.  The latter five
are Chris Mason's btrfs cleanup and update patches.  Please let me
know how the patches should be routed.  Given that there currently
aren't other users, it's likely the easiest to route all through btrfs
tree.

 0001-cgroup-blkcg-Prepare-some-symbols-for-module-and-CON.patch
 0002-blkcg-writeback-Add-wbc-no_wbc_acct.patch
 0003-blkcg-writeback-Implement-wbc_blkcg_css.patch
 0004-blkcg-implement-REQ_CGROUP_PUNT.patch
 0005-Btrfs-stop-using-btrfs_schedule_bio.patch
 0006-Btrfs-delete-the-entire-async-bio-submission-framewo.patch
 0007-Btrfs-only-associate-the-locked-page-with-one-async_.patch
 0008-Btrfs-use-REQ_CGROUP_PUNT-for-worker-thread-submitte.patch
 0009-Btrfs-extent_write_locked_range-should-attach-inode-.patch

0001-0003 implement wbc->no_wbc_acct, wbc_blkcg_css() and
REQ_CGROUP_PUNT.

0004-0007 are prep patches to simplify / improve async bio submission.

0008 makes btrfs use REQ_CGROUP_PUNT for async bios.

0009 fixes wbc writeback accounting for IOs issued through
extent_write_locked_range().

This patchset is also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-btrfs-cgroup-updates-v2

Thanks, diffstat follows.

 block/blk-cgroup.c          |   54 +++++++++
 block/blk-core.c            |    3 
 fs/btrfs/compression.c      |   16 +-
 fs/btrfs/compression.h      |    3 
 fs/btrfs/ctree.h            |    1 
 fs/btrfs/disk-io.c          |   25 +---
 fs/btrfs/extent_io.c        |   15 +-
 fs/btrfs/inode.c            |   61 ++++++++--
 fs/btrfs/super.c            |    1 
 fs/btrfs/volumes.c          |  264 --------------------------------------------
 fs/btrfs/volumes.h          |   10 -
 fs/fs-writeback.c           |    5 
 include/linux/backing-dev.h |    1 
 include/linux/blk-cgroup.h  |   16 ++
 include/linux/blk_types.h   |   10 +
 include/linux/cgroup.h      |    1 
 include/linux/writeback.h   |   24 +++-
 17 files changed, 198 insertions(+), 312 deletions(-)

--
tejun

[1] http://lkml.kernel.org/r/20190614003350.1178444-1-tj@kernel.org


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/9] cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages
  2019-06-15 18:24 [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support Tejun Heo
@ 2019-06-15 18:24 ` Tejun Heo
  2019-06-24 16:39   ` Jan Kara
  2019-06-15 18:24 ` [PATCH 2/9] blkcg, writeback: Add wbc->no_wbc_acct Tejun Heo
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2019-06-15 18:24 UTC (permalink / raw)
  To: dsterba, clm, josef, axboe, jack
  Cc: linux-btrfs, linux-kernel, linux-block, kernel-team, Tejun Heo

btrfs is going to use css_put() and wbc helpers to improve cgroup
writeback support.  Add dummy css_get() definition and export wbc
helpers to prepare for module and !CONFIG_CGROUP builds.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
---
 block/blk-cgroup.c     | 1 +
 fs/fs-writeback.c      | 3 +++
 include/linux/cgroup.h | 1 +
 3 files changed, 5 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 617a2b3f7582..07600d3c9520 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -46,6 +46,7 @@ struct blkcg blkcg_root;
 EXPORT_SYMBOL_GPL(blkcg_root);
 
 struct cgroup_subsys_state * const blkcg_root_css = &blkcg_root.css;
+EXPORT_SYMBOL_GPL(blkcg_root_css);
 
 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 36855c1f8daf..c29cff345b1f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -269,6 +269,7 @@ void __inode_attach_wb(struct inode *inode, struct page *page)
 	if (unlikely(cmpxchg(&inode->i_wb, NULL, wb)))
 		wb_put(wb);
 }
+EXPORT_SYMBOL_GPL(__inode_attach_wb);
 
 /**
  * locked_inode_to_wb_and_lock_list - determine a locked inode's wb and lock it
@@ -580,6 +581,7 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
 	if (unlikely(wb_dying(wbc->wb)))
 		inode_switch_wbs(inode, wbc->wb_id);
 }
+EXPORT_SYMBOL_GPL(wbc_attach_and_unlock_inode);
 
 /**
  * wbc_detach_inode - disassociate wbc from inode and perform foreign detection
@@ -699,6 +701,7 @@ void wbc_detach_inode(struct writeback_control *wbc)
 	wb_put(wbc->wb);
 	wbc->wb = NULL;
 }
+EXPORT_SYMBOL_GPL(wbc_detach_inode);
 
 /**
  * wbc_account_io - account IO issued during writeback
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 81f58b4a5418..4cb5d5646986 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -687,6 +687,7 @@ void cgroup_path_from_kernfs_id(const union kernfs_node_id *id,
 struct cgroup_subsys_state;
 struct cgroup;
 
+static inline void css_get(struct cgroup_subsys_state *css) {}
 static inline void css_put(struct cgroup_subsys_state *css) {}
 static inline int cgroup_attach_task_all(struct task_struct *from,
 					 struct task_struct *t) { return 0; }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/9] blkcg, writeback: Add wbc->no_wbc_acct
  2019-06-15 18:24 [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support Tejun Heo
  2019-06-15 18:24 ` [PATCH 1/9] cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages Tejun Heo
@ 2019-06-15 18:24 ` Tejun Heo
  2019-06-20 15:21   ` Jan Kara
  2019-06-24 16:49   ` Jan Kara
  2019-06-15 18:24 ` [PATCH 3/9] blkcg, writeback: Implement wbc_blkcg_css() Tejun Heo
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 25+ messages in thread
From: Tejun Heo @ 2019-06-15 18:24 UTC (permalink / raw)
  To: dsterba, clm, josef, axboe, jack
  Cc: linux-btrfs, linux-kernel, linux-block, kernel-team, Tejun Heo

When writeback IOs are bounced through async layers, the IOs should
only be accounted against the wbc from the original bdi writeback to
avoid confusing cgroup inode ownership arbitration.  Add
wbc->no_wbc_acct to allow disabling wbc accounting.  This will be used
make btfs compression work well with cgroup IO control.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/fs-writeback.c         | 2 +-
 include/linux/writeback.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c29cff345b1f..667ba07fffcd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -724,7 +724,7 @@ void wbc_account_io(struct writeback_control *wbc, struct page *page,
 	 * behind a slow cgroup.  Ultimately, we want pageout() to kick off
 	 * regular writeback instead of writing things out itself.
 	 */
-	if (!wbc->wb)
+	if (!wbc->wb || wbc->no_wbc_acct)
 		return;
 
 	id = mem_cgroup_css_from_page(page)->id;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 738a0c24874f..b8f5f000cde4 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -68,6 +68,7 @@ struct writeback_control {
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
+	unsigned no_wbc_acct:1;		/* skip wbc IO accounting */
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct bdi_writeback *wb;	/* wb this writeback is issued under */
 	struct inode *inode;		/* inode being written out */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/9] blkcg, writeback: Implement wbc_blkcg_css()
  2019-06-15 18:24 [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support Tejun Heo
  2019-06-15 18:24 ` [PATCH 1/9] cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages Tejun Heo
  2019-06-15 18:24 ` [PATCH 2/9] blkcg, writeback: Add wbc->no_wbc_acct Tejun Heo
@ 2019-06-15 18:24 ` Tejun Heo
  2019-06-24 16:50   ` Jan Kara
  2019-06-15 18:24 ` [PATCH 4/9] blkcg: implement REQ_CGROUP_PUNT Tejun Heo
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2019-06-15 18:24 UTC (permalink / raw)
  To: dsterba, clm, josef, axboe, jack
  Cc: linux-btrfs, linux-kernel, linux-block, kernel-team, Tejun Heo

Add a helper to determine the target blkcg from wbc.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 include/linux/writeback.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index b8f5f000cde4..800ee031e88a 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -11,6 +11,7 @@
 #include <linux/flex_proportions.h>
 #include <linux/backing-dev-defs.h>
 #include <linux/blk_types.h>
+#include <linux/blk-cgroup.h>
 
 struct bio;
 
@@ -93,6 +94,16 @@ static inline int wbc_to_write_flags(struct writeback_control *wbc)
 	return 0;
 }
 
+static inline struct cgroup_subsys_state *
+wbc_blkcg_css(struct writeback_control *wbc)
+{
+#ifdef CONFIG_CGROUP_WRITEBACK
+	if (wbc->wb)
+		return wbc->wb->blkcg_css;
+#endif
+	return blkcg_root_css;
+}
+
 /*
  * A wb_domain represents a domain that wb's (bdi_writeback's) belong to
  * and are measured against each other in.  There always is one global
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/9] blkcg: implement REQ_CGROUP_PUNT
  2019-06-15 18:24 [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support Tejun Heo
                   ` (2 preceding siblings ...)
  2019-06-15 18:24 ` [PATCH 3/9] blkcg, writeback: Implement wbc_blkcg_css() Tejun Heo
@ 2019-06-15 18:24 ` Tejun Heo
  2019-06-15 18:29   ` Tejun Heo
                     ` (2 more replies)
  2019-06-15 18:24 ` [PATCH 5/9] Btrfs: stop using btrfs_schedule_bio() Tejun Heo
                   ` (5 subsequent siblings)
  9 siblings, 3 replies; 25+ messages in thread
From: Tejun Heo @ 2019-06-15 18:24 UTC (permalink / raw)
  To: dsterba, clm, josef, axboe, jack
  Cc: linux-btrfs, linux-kernel, linux-block, kernel-team, Tejun Heo

When a shared kthread needs to issue a bio for a cgroup, doing so
synchronously can lead to priority inversions as the kthread can be
trapped waiting for that cgroup.  This patch implements
REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
to a dedicated per-blkcg work item to avoid such priority inversions.

This will be used to fix priority inversions in btrfs compression and
should be generally useful as we grow filesystem support for
comprehensive IO control.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Cc: Chris Mason <clm@fb.com>
---
 block/blk-cgroup.c          | 53 +++++++++++++++++++++++++++++++++++++
 block/blk-core.c            |  3 +++
 include/linux/backing-dev.h |  1 +
 include/linux/blk-cgroup.h  | 16 ++++++++++-
 include/linux/blk_types.h   | 10 +++++++
 include/linux/writeback.h   | 12 ++++++---
 6 files changed, 91 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 07600d3c9520..48239bb93fbe 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -53,6 +53,7 @@ static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
 static LIST_HEAD(all_blkcgs);		/* protected by blkcg_pol_mutex */
 
 static bool blkcg_debug_stats = false;
+static struct workqueue_struct *blkcg_punt_bio_wq;
 
 static bool blkcg_policy_enabled(struct request_queue *q,
 				 const struct blkcg_policy *pol)
@@ -88,6 +89,8 @@ static void __blkg_release(struct rcu_head *rcu)
 
 	percpu_ref_exit(&blkg->refcnt);
 
+	WARN_ON(!bio_list_empty(&blkg->async_bios));
+
 	/* release the blkcg and parent blkg refs this blkg has been holding */
 	css_put(&blkg->blkcg->css);
 	if (blkg->parent)
@@ -113,6 +116,23 @@ static void blkg_release(struct percpu_ref *ref)
 	call_rcu(&blkg->rcu_head, __blkg_release);
 }
 
+static void blkg_async_bio_workfn(struct work_struct *work)
+{
+	struct blkcg_gq *blkg = container_of(work, struct blkcg_gq,
+					     async_bio_work);
+	struct bio_list bios = BIO_EMPTY_LIST;
+	struct bio *bio;
+
+	/* as long as there are pending bios, @blkg can't go away */
+	spin_lock_bh(&blkg->async_bio_lock);
+	bio_list_merge(&bios, &blkg->async_bios);
+	bio_list_init(&blkg->async_bios);
+	spin_unlock_bh(&blkg->async_bio_lock);
+
+	while ((bio = bio_list_pop(&bios)))
+		submit_bio(bio);
+}
+
 /**
  * blkg_alloc - allocate a blkg
  * @blkcg: block cgroup the new blkg is associated with
@@ -138,6 +158,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 
 	blkg->q = q;
 	INIT_LIST_HEAD(&blkg->q_node);
+	spin_lock_init(&blkg->async_bio_lock);
+	bio_list_init(&blkg->async_bios);
+	INIT_WORK(&blkg->async_bio_work, blkg_async_bio_workfn);
 	blkg->blkcg = blkcg;
 
 	for (i = 0; i < BLKCG_MAX_POLS; i++) {
@@ -1583,6 +1606,25 @@ void blkcg_policy_unregister(struct blkcg_policy *pol)
 }
 EXPORT_SYMBOL_GPL(blkcg_policy_unregister);
 
+bool __blkcg_punt_bio_submit(struct bio *bio)
+{
+	struct blkcg_gq *blkg = bio->bi_blkg;
+
+	/* consume the flag first */
+	bio->bi_opf &= ~REQ_CGROUP_PUNT;
+
+	/* never bounce for the root cgroup */
+	if (!blkg->parent)
+		return false;
+
+	spin_lock_bh(&blkg->async_bio_lock);
+	bio_list_add(&blkg->async_bios, bio);
+	spin_unlock_bh(&blkg->async_bio_lock);
+
+	queue_work(blkcg_punt_bio_wq, &blkg->async_bio_work);
+	return true;
+}
+
 /*
  * Scale the accumulated delay based on how long it has been since we updated
  * the delay.  We only call this when we are adding delay, in case it's been a
@@ -1783,5 +1825,16 @@ void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta)
 	atomic64_add(delta, &blkg->delay_nsec);
 }
 
+static int __init blkcg_init(void)
+{
+	blkcg_punt_bio_wq = alloc_workqueue("blkcg_punt_bio",
+					    WQ_MEM_RECLAIM | WQ_FREEZABLE |
+					    WQ_UNBOUND | WQ_SYSFS, 0);
+	if (!blkcg_punt_bio_wq)
+		return -ENOMEM;
+	return 0;
+}
+subsys_initcall(blkcg_init);
+
 module_param(blkcg_debug_stats, bool, 0644);
 MODULE_PARM_DESC(blkcg_debug_stats, "True if you want debug stats, false if not");
diff --git a/block/blk-core.c b/block/blk-core.c
index a55389ba8779..5879c1ec044d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1165,6 +1165,9 @@ EXPORT_SYMBOL_GPL(direct_make_request);
  */
 blk_qc_t submit_bio(struct bio *bio)
 {
+	if (blkcg_punt_bio_submit(bio))
+		return BLK_QC_T_NONE;
+
 	/*
 	 * If it's a regular read/write or a barrier with data attached,
 	 * go through the normal accounting stuff before submission.
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index f9b029180241..35b31d176f74 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -48,6 +48,7 @@ extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
 
 extern struct workqueue_struct *bdi_wq;
+extern struct workqueue_struct *bdi_async_bio_wq;
 
 static inline bool wb_has_dirty_io(struct bdi_writeback *wb)
 {
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 76c61318fda5..ffb2f88e87c6 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -134,13 +134,17 @@ struct blkcg_gq {
 
 	struct blkg_policy_data		*pd[BLKCG_MAX_POLS];
 
-	struct rcu_head			rcu_head;
+	spinlock_t			async_bio_lock;
+	struct bio_list			async_bios;
+	struct work_struct		async_bio_work;
 
 	atomic_t			use_delay;
 	atomic64_t			delay_nsec;
 	atomic64_t			delay_start;
 	u64				last_delay;
 	int				last_use;
+
+	struct rcu_head			rcu_head;
 };
 
 typedef struct blkcg_policy_data *(blkcg_pol_alloc_cpd_fn)(gfp_t gfp);
@@ -763,6 +767,15 @@ static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg
 				  struct bio *bio) { return false; }
 #endif
 
+bool __blkcg_punt_bio_submit(struct bio *bio);
+
+static inline bool blkcg_punt_bio_submit(struct bio *bio)
+{
+	if (bio->bi_opf & REQ_CGROUP_PUNT)
+		return __blkcg_punt_bio_submit(bio);
+	else
+		return false;
+}
 
 static inline void blkcg_bio_issue_init(struct bio *bio)
 {
@@ -910,6 +923,7 @@ static inline char *blkg_path(struct blkcg_gq *blkg) { return NULL; }
 static inline void blkg_get(struct blkcg_gq *blkg) { }
 static inline void blkg_put(struct blkcg_gq *blkg) { }
 
+static inline bool blkcg_punt_bio_submit(struct bio *bio) { return false; }
 static inline void blkcg_bio_issue_init(struct bio *bio) { }
 static inline bool blkcg_bio_issue_check(struct request_queue *q,
 					 struct bio *bio) { return true; }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 791fee35df88..e8b42a786315 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -321,6 +321,14 @@ enum req_flag_bits {
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_BACKGROUND,	/* background IO */
 	__REQ_NOWAIT,           /* Don't wait if request will block */
+	/*
+	 * When a shared kthread needs to issue a bio for a cgroup, doing
+	 * so synchronously can lead to priority inversions as the kthread
+	 * can be trapped waiting for that cgroup.  CGROUP_PUNT flag makes
+	 * submit_bio() punt the actual issuing to a dedicated per-blkcg
+	 * work item to avoid such priority inversions.
+	 */
+	__REQ_CGROUP_PUNT,
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
@@ -347,6 +355,8 @@ enum req_flag_bits {
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
+#define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
+
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 800ee031e88a..be602c42aab8 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -70,6 +70,7 @@ struct writeback_control {
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
 	unsigned no_wbc_acct:1;		/* skip wbc IO accounting */
+	unsigned punt_to_cgroup:1;	/* cgrp punting, see __REQ_CGROUP_PUNT */
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct bdi_writeback *wb;	/* wb this writeback is issued under */
 	struct inode *inode;		/* inode being written out */
@@ -86,12 +87,17 @@ struct writeback_control {
 
 static inline int wbc_to_write_flags(struct writeback_control *wbc)
 {
+	int flags = 0;
+
+	if (wbc->punt_to_cgroup)
+		flags = REQ_CGROUP_PUNT;
+
 	if (wbc->sync_mode == WB_SYNC_ALL)
-		return REQ_SYNC;
+		flags |= REQ_SYNC;
 	else if (wbc->for_kupdate || wbc->for_background)
-		return REQ_BACKGROUND;
+		flags |= REQ_BACKGROUND;
 
-	return 0;
+	return flags;
 }
 
 static inline struct cgroup_subsys_state *
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/9] Btrfs: stop using btrfs_schedule_bio()
  2019-06-15 18:24 [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support Tejun Heo
                   ` (3 preceding siblings ...)
  2019-06-15 18:24 ` [PATCH 4/9] blkcg: implement REQ_CGROUP_PUNT Tejun Heo
@ 2019-06-15 18:24 ` Tejun Heo
  2019-06-15 18:24 ` [PATCH 6/9] Btrfs: delete the entire async bio submission framework Tejun Heo
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2019-06-15 18:24 UTC (permalink / raw)
  To: dsterba, clm, josef, axboe, jack
  Cc: linux-btrfs, linux-kernel, linux-block, kernel-team

From: Chris Mason <clm@fb.com>

btrfs_schedule_bio() hands IO off to a helper thread to do the actual
submit_bio() call.  This has been used to make sure async crc and
compression helpers don't get stuck on IO submission.  To maintain good
performance, over time the IO submission threads duplicated some IO
scheduler characteristics such as high and low priority IOs and they
also made some ugly assumptions about request allocation batch sizes.

All of this cost at least one extra context switch during IO submission,
and doesn't fit well with the modern blkmq IO stack.  So, this commit stops
using btrfs_schedule_bio().  We may need to adjust the number of async
helper threads for crcs and compression, but long term it's a better
path.

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/compression.c |  8 +++---
 fs/btrfs/disk-io.c     |  6 ++---
 fs/btrfs/inode.c       |  6 ++---
 fs/btrfs/volumes.c     | 55 +++---------------------------------------
 fs/btrfs/volumes.h     |  2 +-
 5 files changed, 15 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 4ec1df369e47..873261b932b8 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -355,7 +355,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 				BUG_ON(ret); /* -ENOMEM */
 			}
 
-			ret = btrfs_map_bio(fs_info, bio, 0, 1);
+			ret = btrfs_map_bio(fs_info, bio, 0);
 			if (ret) {
 				bio->bi_status = ret;
 				bio_endio(bio);
@@ -385,7 +385,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 		BUG_ON(ret); /* -ENOMEM */
 	}
 
-	ret = btrfs_map_bio(fs_info, bio, 0, 1);
+	ret = btrfs_map_bio(fs_info, bio, 0);
 	if (ret) {
 		bio->bi_status = ret;
 		bio_endio(bio);
@@ -638,7 +638,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 			sums += DIV_ROUND_UP(comp_bio->bi_iter.bi_size,
 					     fs_info->sectorsize);
 
-			ret = btrfs_map_bio(fs_info, comp_bio, mirror_num, 0);
+			ret = btrfs_map_bio(fs_info, comp_bio, mirror_num);
 			if (ret) {
 				comp_bio->bi_status = ret;
 				bio_endio(comp_bio);
@@ -662,7 +662,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 		BUG_ON(ret); /* -ENOMEM */
 	}
 
-	ret = btrfs_map_bio(fs_info, comp_bio, mirror_num, 0);
+	ret = btrfs_map_bio(fs_info, comp_bio, mirror_num);
 	if (ret) {
 		comp_bio->bi_status = ret;
 		bio_endio(comp_bio);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 663efce22d98..b34240406f36 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -800,7 +800,7 @@ static void run_one_async_done(struct btrfs_work *work)
 	}
 
 	ret = btrfs_map_bio(btrfs_sb(inode->i_sb), async->bio,
-			async->mirror_num, 1);
+			    async->mirror_num);
 	if (ret) {
 		async->bio->bi_status = ret;
 		bio_endio(async->bio);
@@ -901,12 +901,12 @@ static blk_status_t btree_submit_bio_hook(struct inode *inode, struct bio *bio,
 					  BTRFS_WQ_ENDIO_METADATA);
 		if (ret)
 			goto out_w_error;
-		ret = btrfs_map_bio(fs_info, bio, mirror_num, 0);
+		ret = btrfs_map_bio(fs_info, bio, mirror_num);
 	} else if (!async) {
 		ret = btree_csum_one_bio(bio);
 		if (ret)
 			goto out_w_error;
-		ret = btrfs_map_bio(fs_info, bio, mirror_num, 0);
+		ret = btrfs_map_bio(fs_info, bio, mirror_num);
 	} else {
 		/*
 		 * kthread helpers are used to submit writes so that
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d519c3520e87..91b161fb1521 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2032,7 +2032,7 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio,
 	}
 
 mapit:
-	ret = btrfs_map_bio(fs_info, bio, mirror_num, 0);
+	ret = btrfs_map_bio(fs_info, bio, mirror_num);
 
 out:
 	if (ret) {
@@ -7764,7 +7764,7 @@ static inline blk_status_t submit_dio_repair_bio(struct inode *inode,
 	if (ret)
 		return ret;
 
-	ret = btrfs_map_bio(fs_info, bio, mirror_num, 0);
+	ret = btrfs_map_bio(fs_info, bio, mirror_num);
 
 	return ret;
 }
@@ -8295,7 +8295,7 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 			goto err;
 	}
 map:
-	ret = btrfs_map_bio(fs_info, bio, 0, 0);
+	ret = btrfs_map_bio(fs_info, bio, 0);
 err:
 	return ret;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1c2a6e4b39da..72326cc23985 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6386,52 +6386,8 @@ static void btrfs_end_bio(struct bio *bio)
 	}
 }
 
-/*
- * see run_scheduled_bios for a description of why bios are collected for
- * async submit.
- *
- * This will add one bio to the pending list for a device and make sure
- * the work struct is scheduled.
- */
-static noinline void btrfs_schedule_bio(struct btrfs_device *device,
-					struct bio *bio)
-{
-	struct btrfs_fs_info *fs_info = device->fs_info;
-	int should_queue = 1;
-	struct btrfs_pending_bios *pending_bios;
-
-	/* don't bother with additional async steps for reads, right now */
-	if (bio_op(bio) == REQ_OP_READ) {
-		btrfsic_submit_bio(bio);
-		return;
-	}
-
-	WARN_ON(bio->bi_next);
-	bio->bi_next = NULL;
-
-	spin_lock(&device->io_lock);
-	if (op_is_sync(bio->bi_opf))
-		pending_bios = &device->pending_sync_bios;
-	else
-		pending_bios = &device->pending_bios;
-
-	if (pending_bios->tail)
-		pending_bios->tail->bi_next = bio;
-
-	pending_bios->tail = bio;
-	if (!pending_bios->head)
-		pending_bios->head = bio;
-	if (device->running_pending)
-		should_queue = 0;
-
-	spin_unlock(&device->io_lock);
-
-	if (should_queue)
-		btrfs_queue_work(fs_info->submit_workers, &device->work);
-}
-
 static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
-			      u64 physical, int dev_nr, int async)
+			      u64 physical, int dev_nr)
 {
 	struct btrfs_device *dev = bbio->stripes[dev_nr].dev;
 	struct btrfs_fs_info *fs_info = bbio->fs_info;
@@ -6449,10 +6405,7 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
 
 	btrfs_bio_counter_inc_noblocked(fs_info);
 
-	if (async)
-		btrfs_schedule_bio(dev, bio);
-	else
-		btrfsic_submit_bio(bio);
+	btrfsic_submit_bio(bio);
 }
 
 static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
@@ -6473,7 +6426,7 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
 }
 
 blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
-			   int mirror_num, int async_submit)
+			   int mirror_num)
 {
 	struct btrfs_device *dev;
 	struct bio *first_bio = bio;
@@ -6542,7 +6495,7 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 			bio = first_bio;
 
 		submit_stripe_bio(bbio, bio, bbio->stripes[dev_nr].physical,
-				  dev_nr, async_submit);
+				  dev_nr);
 	}
 	btrfs_bio_counter_dec(fs_info);
 	return BLK_STS_OK;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index b8a0e8d0672d..8c7bd79b234a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -415,7 +415,7 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans, u64 type);
 void btrfs_mapping_init(struct btrfs_mapping_tree *tree);
 void btrfs_mapping_tree_free(struct btrfs_mapping_tree *tree);
 blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
-			   int mirror_num, int async_submit);
+			   int mirror_num);
 int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
 		       fmode_t flags, void *holder);
 struct btrfs_device *btrfs_scan_one_device(const char *path,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/9] Btrfs: delete the entire async bio submission framework
  2019-06-15 18:24 [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support Tejun Heo
                   ` (4 preceding siblings ...)
  2019-06-15 18:24 ` [PATCH 5/9] Btrfs: stop using btrfs_schedule_bio() Tejun Heo
@ 2019-06-15 18:24 ` Tejun Heo
  2019-06-15 18:24 ` [PATCH 7/9] Btrfs: only associate the locked page with one async_cow struct Tejun Heo
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2019-06-15 18:24 UTC (permalink / raw)
  To: dsterba, clm, josef, axboe, jack
  Cc: linux-btrfs, linux-kernel, linux-block, kernel-team

From: Chris Mason <clm@fb.com>

Now that we're not using btrfs_schedule_bio() anymore, delete all the
code that supported it.

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h   |   1 -
 fs/btrfs/disk-io.c |  13 +--
 fs/btrfs/super.c   |   1 -
 fs/btrfs/volumes.c | 209 ---------------------------------------------
 fs/btrfs/volumes.h |   8 --
 5 files changed, 1 insertion(+), 231 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b81c331b28fa..2a5ba0f85ed3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -989,7 +989,6 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *endio_meta_write_workers;
 	struct btrfs_workqueue *endio_write_workers;
 	struct btrfs_workqueue *endio_freespace_worker;
-	struct btrfs_workqueue *submit_workers;
 	struct btrfs_workqueue *caching_workers;
 	struct btrfs_workqueue *readahead_workers;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b34240406f36..9dbe4ba3995d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2028,7 +2028,6 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	btrfs_destroy_workqueue(fs_info->rmw_workers);
 	btrfs_destroy_workqueue(fs_info->endio_write_workers);
 	btrfs_destroy_workqueue(fs_info->endio_freespace_worker);
-	btrfs_destroy_workqueue(fs_info->submit_workers);
 	btrfs_destroy_workqueue(fs_info->delayed_workers);
 	btrfs_destroy_workqueue(fs_info->caching_workers);
 	btrfs_destroy_workqueue(fs_info->readahead_workers);
@@ -2194,16 +2193,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 	fs_info->caching_workers =
 		btrfs_alloc_workqueue(fs_info, "cache", flags, max_active, 0);
 
-	/*
-	 * a higher idle thresh on the submit workers makes it much more
-	 * likely that bios will be send down in a sane order to the
-	 * devices
-	 */
-	fs_info->submit_workers =
-		btrfs_alloc_workqueue(fs_info, "submit", flags,
-				      min_t(u64, fs_devices->num_devices,
-					    max_active), 64);
-
 	fs_info->fixup_workers =
 		btrfs_alloc_workqueue(fs_info, "fixup", flags, 1, 0);
 
@@ -2246,7 +2235,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
 					    max_active), 8);
 
 	if (!(fs_info->workers && fs_info->delalloc_workers &&
-	      fs_info->submit_workers && fs_info->flush_workers &&
+	      fs_info->flush_workers &&
 	      fs_info->endio_workers && fs_info->endio_meta_workers &&
 	      fs_info->endio_meta_write_workers &&
 	      fs_info->endio_repair_workers &&
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 2c66d9ea6a3b..3fb86a7bfdf7 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1668,7 +1668,6 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info,
 
 	btrfs_workqueue_set_max(fs_info->workers, new_pool_size);
 	btrfs_workqueue_set_max(fs_info->delalloc_workers, new_pool_size);
-	btrfs_workqueue_set_max(fs_info->submit_workers, new_pool_size);
 	btrfs_workqueue_set_max(fs_info->caching_workers, new_pool_size);
 	btrfs_workqueue_set_max(fs_info->endio_workers, new_pool_size);
 	btrfs_workqueue_set_max(fs_info->endio_meta_workers, new_pool_size);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 72326cc23985..fc3a16d87869 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -509,212 +509,6 @@ btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void *holder,
 	return ret;
 }
 
-static void requeue_list(struct btrfs_pending_bios *pending_bios,
-			struct bio *head, struct bio *tail)
-{
-
-	struct bio *old_head;
-
-	old_head = pending_bios->head;
-	pending_bios->head = head;
-	if (pending_bios->tail)
-		tail->bi_next = old_head;
-	else
-		pending_bios->tail = tail;
-}
-
-/*
- * we try to collect pending bios for a device so we don't get a large
- * number of procs sending bios down to the same device.  This greatly
- * improves the schedulers ability to collect and merge the bios.
- *
- * But, it also turns into a long list of bios to process and that is sure
- * to eventually make the worker thread block.  The solution here is to
- * make some progress and then put this work struct back at the end of
- * the list if the block device is congested.  This way, multiple devices
- * can make progress from a single worker thread.
- */
-static noinline void run_scheduled_bios(struct btrfs_device *device)
-{
-	struct btrfs_fs_info *fs_info = device->fs_info;
-	struct bio *pending;
-	struct backing_dev_info *bdi;
-	struct btrfs_pending_bios *pending_bios;
-	struct bio *tail;
-	struct bio *cur;
-	int again = 0;
-	unsigned long num_run;
-	unsigned long batch_run = 0;
-	unsigned long last_waited = 0;
-	int force_reg = 0;
-	int sync_pending = 0;
-	struct blk_plug plug;
-
-	/*
-	 * this function runs all the bios we've collected for
-	 * a particular device.  We don't want to wander off to
-	 * another device without first sending all of these down.
-	 * So, setup a plug here and finish it off before we return
-	 */
-	blk_start_plug(&plug);
-
-	bdi = device->bdev->bd_bdi;
-
-loop:
-	spin_lock(&device->io_lock);
-
-loop_lock:
-	num_run = 0;
-
-	/* take all the bios off the list at once and process them
-	 * later on (without the lock held).  But, remember the
-	 * tail and other pointers so the bios can be properly reinserted
-	 * into the list if we hit congestion
-	 */
-	if (!force_reg && device->pending_sync_bios.head) {
-		pending_bios = &device->pending_sync_bios;
-		force_reg = 1;
-	} else {
-		pending_bios = &device->pending_bios;
-		force_reg = 0;
-	}
-
-	pending = pending_bios->head;
-	tail = pending_bios->tail;
-	WARN_ON(pending && !tail);
-
-	/*
-	 * if pending was null this time around, no bios need processing
-	 * at all and we can stop.  Otherwise it'll loop back up again
-	 * and do an additional check so no bios are missed.
-	 *
-	 * device->running_pending is used to synchronize with the
-	 * schedule_bio code.
-	 */
-	if (device->pending_sync_bios.head == NULL &&
-	    device->pending_bios.head == NULL) {
-		again = 0;
-		device->running_pending = 0;
-	} else {
-		again = 1;
-		device->running_pending = 1;
-	}
-
-	pending_bios->head = NULL;
-	pending_bios->tail = NULL;
-
-	spin_unlock(&device->io_lock);
-
-	while (pending) {
-
-		rmb();
-		/* we want to work on both lists, but do more bios on the
-		 * sync list than the regular list
-		 */
-		if ((num_run > 32 &&
-		    pending_bios != &device->pending_sync_bios &&
-		    device->pending_sync_bios.head) ||
-		   (num_run > 64 && pending_bios == &device->pending_sync_bios &&
-		    device->pending_bios.head)) {
-			spin_lock(&device->io_lock);
-			requeue_list(pending_bios, pending, tail);
-			goto loop_lock;
-		}
-
-		cur = pending;
-		pending = pending->bi_next;
-		cur->bi_next = NULL;
-
-		BUG_ON(atomic_read(&cur->__bi_cnt) == 0);
-
-		/*
-		 * if we're doing the sync list, record that our
-		 * plug has some sync requests on it
-		 *
-		 * If we're doing the regular list and there are
-		 * sync requests sitting around, unplug before
-		 * we add more
-		 */
-		if (pending_bios == &device->pending_sync_bios) {
-			sync_pending = 1;
-		} else if (sync_pending) {
-			blk_finish_plug(&plug);
-			blk_start_plug(&plug);
-			sync_pending = 0;
-		}
-
-		btrfsic_submit_bio(cur);
-		num_run++;
-		batch_run++;
-
-		cond_resched();
-
-		/*
-		 * we made progress, there is more work to do and the bdi
-		 * is now congested.  Back off and let other work structs
-		 * run instead
-		 */
-		if (pending && bdi_write_congested(bdi) && batch_run > 8 &&
-		    fs_info->fs_devices->open_devices > 1) {
-			struct io_context *ioc;
-
-			ioc = current->io_context;
-
-			/*
-			 * the main goal here is that we don't want to
-			 * block if we're going to be able to submit
-			 * more requests without blocking.
-			 *
-			 * This code does two great things, it pokes into
-			 * the elevator code from a filesystem _and_
-			 * it makes assumptions about how batching works.
-			 */
-			if (ioc && ioc->nr_batch_requests > 0 &&
-			    time_before(jiffies, ioc->last_waited + HZ/50UL) &&
-			    (last_waited == 0 ||
-			     ioc->last_waited == last_waited)) {
-				/*
-				 * we want to go through our batch of
-				 * requests and stop.  So, we copy out
-				 * the ioc->last_waited time and test
-				 * against it before looping
-				 */
-				last_waited = ioc->last_waited;
-				cond_resched();
-				continue;
-			}
-			spin_lock(&device->io_lock);
-			requeue_list(pending_bios, pending, tail);
-			device->running_pending = 1;
-
-			spin_unlock(&device->io_lock);
-			btrfs_queue_work(fs_info->submit_workers,
-					 &device->work);
-			goto done;
-		}
-	}
-
-	cond_resched();
-	if (again)
-		goto loop;
-
-	spin_lock(&device->io_lock);
-	if (device->pending_bios.head || device->pending_sync_bios.head)
-		goto loop_lock;
-	spin_unlock(&device->io_lock);
-
-done:
-	blk_finish_plug(&plug);
-}
-
-static void pending_bios_fn(struct btrfs_work *work)
-{
-	struct btrfs_device *device;
-
-	device = container_of(work, struct btrfs_device, work);
-	run_scheduled_bios(device);
-}
-
 static bool device_path_matched(const char *path, struct btrfs_device *device)
 {
 	int found;
@@ -6599,9 +6393,6 @@ struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info,
 	else
 		generate_random_uuid(dev->uuid);
 
-	btrfs_init_work(&dev->work, btrfs_submit_helper,
-			pending_bios_fn, NULL, NULL);
-
 	return dev;
 }
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 8c7bd79b234a..231f50dd107d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -18,10 +18,6 @@ extern struct mutex uuid_mutex;
 #define BTRFS_STRIPE_LEN	SZ_64K
 
 struct buffer_head;
-struct btrfs_pending_bios {
-	struct bio *head;
-	struct bio *tail;
-};
 
 /*
  * Use sequence counter to get consistent device stat data on
@@ -55,10 +51,6 @@ struct btrfs_device {
 
 	spinlock_t io_lock ____cacheline_aligned;
 	int running_pending;
-	/* regular prio bios */
-	struct btrfs_pending_bios pending_bios;
-	/* sync bios */
-	struct btrfs_pending_bios pending_sync_bios;
 
 	struct block_device *bdev;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 7/9] Btrfs: only associate the locked page with one async_cow struct
  2019-06-15 18:24 [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support Tejun Heo
                   ` (5 preceding siblings ...)
  2019-06-15 18:24 ` [PATCH 6/9] Btrfs: delete the entire async bio submission framework Tejun Heo
@ 2019-06-15 18:24 ` Tejun Heo
  2019-06-15 18:24 ` [PATCH 8/9] Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios Tejun Heo
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2019-06-15 18:24 UTC (permalink / raw)
  To: dsterba, clm, josef, axboe, jack
  Cc: linux-btrfs, linux-kernel, linux-block, kernel-team

From: Chris Mason <clm@fb.com>

The btrfs writepages function collects a large range of pages flagged
for delayed allocation, and then sends them down through the COW code
for processing.  When compression is on, we allocate one async_cow
structure for every 512K, and then run those pages through the
compression code for IO submission.

writepages starts all of this off with a single page, locked by
the original call to extent_write_cache_pages(), and it's important to
keep track of this page because it has already been through
clear_page_dirty_for_io().

The btrfs async_cow struct has a pointer to the locked_page, and when
we're redirtying the page because compression had to fallback to
uncompressed IO, we use page->index to decide if a given async_cow
struct really owns that page.

But, this is racey.  If a given delalloc range is broken up into two
async_cows (cow_A and cow_B), we can end up with something like this:

compress_file_range(cowA)
submit_compress_extents(cowA)
submit compressed bios(cowA)
put_page(locked_page)

				compress_file_range(cowB)
				...

The end result is that cowA is completed and cleaned up before cowB even
starts processing.  This means we can free locked_page() and reuse it
elsewhere.  If we get really lucky, it'll have the same page->index in
its new home as it did before.

While we're processing cowB, we might decide we need to fall back to
uncompressed IO, and so compress_file_range() will call
__set_page_dirty_nobufers() on cowB->locked_page.

Without cgroups in use, this creates as a phantom dirty page, which
isn't great but isn't the end of the world.  With cgroups in use, we
might crash in the accounting code because page->mapping->i_wb isn't
set.

[ 8308.523110] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
[ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
[ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
[ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
[ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
[ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
[ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
[ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
[ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
[ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
[ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
[ 8308.582520] FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
[ 8308.585440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
[ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8308.594469] Call Trace:
[ 8308.595149]  account_page_cleaned+0x15b/0x1f0
[ 8308.596340]  __cancel_dirty_page+0x146/0x200
[ 8308.599395]  truncate_cleanup_page+0x92/0xb0
[ 8308.600480]  truncate_inode_pages_range+0x202/0x7d0
[ 8308.617392]  btrfs_evict_inode+0x92/0x5a0
[ 8308.619108]  evict+0xc1/0x190
[ 8308.620023]  do_unlinkat+0x176/0x280
[ 8308.621202]  do_syscall_64+0x63/0x1a0
[ 8308.623451]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

The fix here is to make asyc_cow->locked_page NULL everywhere but the
one async_cow struct that's allowed to do things to the locked page.

Signed-off-by: Chris Mason <clm@fb.com>
Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c |  2 +-
 fs/btrfs/inode.c     | 25 +++++++++++++++++++++----
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 13fca7bfc1f2..9f223d7d78c0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct address_space *mapping,
 			if (page_ops & PAGE_SET_PRIVATE2)
 				SetPagePrivate2(pages[i]);
 
-			if (pages[i] == locked_page) {
+			if (locked_page && pages[i] == locked_page) {
 				put_page(pages[i]);
 				pages_locked++;
 				continue;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 91b161fb1521..df5527cc07b9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -666,10 +666,12 @@ static noinline void compress_file_range(struct async_chunk *async_chunk,
 	 * to our extent and set things up for the async work queue to run
 	 * cow_file_range to do the normal delalloc dance.
 	 */
-	if (page_offset(async_chunk->locked_page) >= start &&
-	    page_offset(async_chunk->locked_page) <= end)
+	if (async_chunk->locked_page &&
+	    (page_offset(async_chunk->locked_page) >= start &&
+	     page_offset(async_chunk->locked_page)) <= end) {
 		__set_page_dirty_nobuffers(async_chunk->locked_page);
 		/* unlocked later on in the async handlers */
+	}
 
 	if (redirty)
 		extent_range_redirty_for_io(inode, start, end);
@@ -759,7 +761,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 						  async_extent->start +
 						  async_extent->ram_size - 1,
 						  WB_SYNC_ALL);
-			else if (ret)
+			else if (ret && async_chunk->locked_page)
 				unlock_page(async_chunk->locked_page);
 			kfree(async_extent);
 			cond_resched();
@@ -1236,10 +1238,25 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 		async_chunk[i].inode = inode;
 		async_chunk[i].start = start;
 		async_chunk[i].end = cur_end;
-		async_chunk[i].locked_page = locked_page;
 		async_chunk[i].write_flags = write_flags;
 		INIT_LIST_HEAD(&async_chunk[i].extents);
 
+		/*
+		 * The locked_page comes all the way from writepage and its
+		 * the original page we were actually given.  As we spread
+		 * this large delalloc region across multiple async_cow
+		 * structs, only the first struct needs a pointer to locked_page
+		 *
+		 * This way we don't need racey decisions about who is supposed
+		 * to unlock it.
+		 */
+		if (locked_page) {
+			async_chunk[i].locked_page = locked_page;
+			locked_page = NULL;
+		} else {
+			async_chunk[i].locked_page = NULL;
+		}
+
 		btrfs_init_work(&async_chunk[i].work,
 				btrfs_delalloc_helper,
 				async_cow_start, async_cow_submit,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 8/9] Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios
  2019-06-15 18:24 [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support Tejun Heo
                   ` (6 preceding siblings ...)
  2019-06-15 18:24 ` [PATCH 7/9] Btrfs: only associate the locked page with one async_cow struct Tejun Heo
@ 2019-06-15 18:24 ` Tejun Heo
  2019-06-15 18:24 ` [PATCH 9/9] Btrfs: extent_write_locked_range() should attach inode->i_wb Tejun Heo
  2019-06-18 12:54 ` [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support David Sterba
  9 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2019-06-15 18:24 UTC (permalink / raw)
  To: dsterba, clm, josef, axboe, jack
  Cc: linux-btrfs, linux-kernel, linux-block, kernel-team

From: Chris Mason <clm@fb.com>

Async CRCs and compression submit IO through helper threads, which
means they have IO priority inversions when cgroup IO controllers are
in use.

This flags all of the writes submitted by btrfs helper threads as
REQ_CGROUP_PUNT.  submit_bio() will punt these to dedicated per-blkcg
work items to avoid the priority inversion.

For the compression code, we take a reference on the wbc's blkg css and
pass it down to the async workers.

For the async crcs, the bio already has the correct css, we just need to
tell the block layer to use REQ_CGROUP_PUNT.

Signed-off-by: Chris Mason <clm@fb.com>
Modified-and-reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/compression.c |  8 +++++++-
 fs/btrfs/compression.h |  3 ++-
 fs/btrfs/disk-io.c     |  6 ++++++
 fs/btrfs/extent_io.c   |  3 +++
 fs/btrfs/inode.c       | 30 +++++++++++++++++++++++++++---
 5 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 873261b932b8..138479a9576c 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -289,7 +289,8 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 				 unsigned long compressed_len,
 				 struct page **compressed_pages,
 				 unsigned long nr_pages,
-				 unsigned int write_flags)
+				 unsigned int write_flags,
+				 struct cgroup_subsys_state *blkcg_css)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct bio *bio = NULL;
@@ -323,6 +324,11 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 	bio->bi_opf = REQ_OP_WRITE | write_flags;
 	bio->bi_private = cb;
 	bio->bi_end_io = end_compressed_bio_write;
+
+	if (blkcg_css) {
+		bio->bi_opf |= REQ_CGROUP_PUNT;
+		bio_associate_blkg_from_css(bio, blkcg_css);
+	}
 	refcount_set(&cb->pending_bios, 1);
 
 	/* create and submit bios for the compressed pages */
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 9976fe0f7526..7cbefab96ecf 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -93,7 +93,8 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
 				  unsigned long compressed_len,
 				  struct page **compressed_pages,
 				  unsigned long nr_pages,
-				  unsigned int write_flags);
+				  unsigned int write_flags,
+				  struct cgroup_subsys_state *blkcg_css);
 blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 				 int mirror_num, unsigned long bio_flags);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9dbe4ba3995d..a5ebbf3d0833 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -799,6 +799,12 @@ static void run_one_async_done(struct btrfs_work *work)
 		return;
 	}
 
+	/*
+	 * All of the bios that pass through here are from async helpers.
+	 * Use REQ_CGROUP_PUNT to issue them from the owning cgroup's
+	 * context.  This changes nothing when cgroups aren't in use.
+	 */
+	async->bio->bi_opf |= REQ_CGROUP_PUNT;
 	ret = btrfs_map_bio(btrfs_sb(inode->i_sb), async->bio,
 			    async->mirror_num);
 	if (ret) {
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9f223d7d78c0..d7b57341ff1a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4175,6 +4175,9 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
 		.nr_to_write	= nr_pages * 2,
 		.range_start	= start,
 		.range_end	= end + 1,
+		/* we're called from an async helper function */
+		.punt_to_cgroup	= 1,
+		.no_wbc_acct	= 1,
 	};
 
 	while (start <= end) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index df5527cc07b9..3f9b35bc0455 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -357,6 +357,7 @@ struct async_extent {
 };
 
 struct async_chunk {
+	struct cgroup_subsys_state *blkcg_css;
 	struct inode *inode;
 	struct page *locked_page;
 	u64 start;
@@ -846,7 +847,8 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
 				    ins.objectid,
 				    ins.offset, async_extent->pages,
 				    async_extent->nr_pages,
-				    async_chunk->write_flags)) {
+				    async_chunk->write_flags,
+				    async_chunk->blkcg_css)) {
 			struct page *p = async_extent->pages[0];
 			const u64 start = async_extent->start;
 			const u64 end = start + async_extent->ram_size - 1;
@@ -1170,6 +1172,8 @@ static noinline void async_cow_free(struct btrfs_work *work)
 	async_chunk = container_of(work, struct async_chunk, work);
 	if (async_chunk->inode)
 		btrfs_add_delayed_iput(async_chunk->inode);
+	if (async_chunk->blkcg_css)
+		css_put(async_chunk->blkcg_css);
 	/*
 	 * Since the pointer to 'pending' is at the beginning of the array of
 	 * async_chunk's, freeing it ensures the whole array has been freed.
@@ -1178,12 +1182,15 @@ static noinline void async_cow_free(struct btrfs_work *work)
 		kvfree(async_chunk->pending);
 }
 
-static int cow_file_range_async(struct inode *inode, struct page *locked_page,
+static int cow_file_range_async(struct inode *inode,
+				struct writeback_control *wbc,
+				struct page *locked_page,
 				u64 start, u64 end, int *page_started,
 				unsigned long *nr_written,
 				unsigned int write_flags)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct cgroup_subsys_state *blkcg_css = wbc_blkcg_css(wbc);
 	struct async_cow *ctx;
 	struct async_chunk *async_chunk;
 	unsigned long nr_pages;
@@ -1251,12 +1258,29 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 		 * to unlock it.
 		 */
 		if (locked_page) {
+			/*
+			 * Depending on the compressibility, the pages
+			 * might or might not go through async.  We want
+			 * all of them to be accounted against @wbc once.
+			 * Let's do it here before the paths diverge.  wbc
+			 * accounting is used only for foreign writeback
+			 * detection and doesn't need full accuracy.  Just
+			 * account the whole thing against the first page.
+			 */
+			wbc_account_io(wbc, locked_page, cur_end - start);
 			async_chunk[i].locked_page = locked_page;
 			locked_page = NULL;
 		} else {
 			async_chunk[i].locked_page = NULL;
 		}
 
+		if (blkcg_css != blkcg_root_css) {
+			css_get(blkcg_css);
+			async_chunk[i].blkcg_css = blkcg_css;
+		} else {
+			async_chunk[i].blkcg_css = NULL;
+		}
+
 		btrfs_init_work(&async_chunk[i].work,
 				btrfs_delalloc_helper,
 				async_cow_start, async_cow_submit,
@@ -1653,7 +1677,7 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
 			&BTRFS_I(inode)->runtime_flags);
-		ret = cow_file_range_async(inode, locked_page, start, end,
+		ret = cow_file_range_async(inode, wbc, locked_page, start, end,
 					   page_started, nr_written,
 					   write_flags);
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 9/9] Btrfs: extent_write_locked_range() should attach inode->i_wb
  2019-06-15 18:24 [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support Tejun Heo
                   ` (7 preceding siblings ...)
  2019-06-15 18:24 ` [PATCH 8/9] Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios Tejun Heo
@ 2019-06-15 18:24 ` Tejun Heo
  2019-06-18 12:54 ` [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support David Sterba
  9 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2019-06-15 18:24 UTC (permalink / raw)
  To: dsterba, clm, josef, axboe, jack
  Cc: linux-btrfs, linux-kernel, linux-block, kernel-team

From: Chris Mason <clm@fb.com>

extent_write_locked_range() is used when we're falling back to buffered
IO from inside of compression.  It allocates its own wbc and should
associate it with the inode's i_wb to make sure the IO goes down from
the correct cgroup.

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d7b57341ff1a..afb916a69c30 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4180,6 +4180,7 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
 		.no_wbc_acct	= 1,
 	};
 
+	wbc_attach_fdatawrite_inode(&wbc_writepages, inode);
 	while (start <= end) {
 		page = find_get_page(mapping, start >> PAGE_SHIFT);
 		if (clear_page_dirty_for_io(page))
@@ -4194,11 +4195,12 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
 	}
 
 	ASSERT(ret <= 0);
-	if (ret < 0) {
+	if (ret == 0)
+		ret = flush_write_bio(&epd);
+	else
 		end_write_bio(&epd, ret);
-		return ret;
-	}
-	ret = flush_write_bio(&epd);
+
+	wbc_detach_inode(&wbc_writepages);
 	return ret;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/9] blkcg: implement REQ_CGROUP_PUNT
  2019-06-15 18:24 ` [PATCH 4/9] blkcg: implement REQ_CGROUP_PUNT Tejun Heo
@ 2019-06-15 18:29   ` Tejun Heo
  2019-06-20 15:37   ` Jan Kara
  2019-06-24 16:53   ` Jan Kara
  2 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2019-06-15 18:29 UTC (permalink / raw)
  To: dsterba, clm, josef, axboe, jack
  Cc: linux-btrfs, linux-kernel, linux-block, kernel-team

Hello,

On Sat, Jun 15, 2019 at 11:24:48AM -0700, Tejun Heo wrote:
> When a shared kthread needs to issue a bio for a cgroup, doing so
> synchronously can lead to priority inversions as the kthread can be
> trapped waiting for that cgroup.  This patch implements
> REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
> to a dedicated per-blkcg work item to avoid such priority inversions.
> 
> This will be used to fix priority inversions in btrfs compression and
> should be generally useful as we grow filesystem support for
> comprehensive IO control.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Cc: Chris Mason <clm@fb.com>

The blkcg patches, especially this one, need Jens's review.  Jens, if
you're okay with the changes, please let me know how you want them to
be routed.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support
  2019-06-15 18:24 [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support Tejun Heo
                   ` (8 preceding siblings ...)
  2019-06-15 18:24 ` [PATCH 9/9] Btrfs: extent_write_locked_range() should attach inode->i_wb Tejun Heo
@ 2019-06-18 12:54 ` David Sterba
  2019-06-18 14:45   ` Tejun Heo
  9 siblings, 1 reply; 25+ messages in thread
From: David Sterba @ 2019-06-18 12:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dsterba, clm, josef, axboe, jack, linux-btrfs, linux-kernel,
	linux-block, kernel-team

On Sat, Jun 15, 2019 at 11:24:44AM -0700, Tejun Heo wrote:
> Hello,
> 
> Changes from v1[1]
> 
>   * 0001-cgroup-blkcg-Prepare-some-symbols-for-module-and-CON.patch
>     added.  It collects and adds symbol exports and dummy function def
>     to fix module and different config builds.
> 
> When writeback is executed asynchronously (e.g. for compression), bios
> are bounced to and issued by worker pool shared by all cgroups.  This
> leads to significant priority inversions when cgroup IO control is in
> use - IOs for a low priority cgroup can tie down the workers forcing
> higher priority IOs to wait behind them.
> 
> This patchset adds an bio punt mechanism to blkcg and updates btrfs to
> issue async IOs through it.  A bio tagged with REQ_CGROUP_PUNT flag is
> bounced to the asynchronous issue context of the associated blkcg on
> bio_submit().  As the bios are issued from per-blkcg work items,
> there's no concern for priority inversions and it doesn't require
> invasive changes to the filesystems.  The mechanism should be
> generally useful for IO control support across different filesystems.
> 
> This patchset contains the following 9 patches.  The first three are
> my blkcg patches to implement the needed mechanisms.  The latter five
> are Chris Mason's btrfs cleanup and update patches.  Please let me
> know how the patches should be routed.  Given that there currently
> aren't other users, it's likely the easiest to route all through btrfs
> tree.

That would be easiest so to avoid synchronization between two trees,
provided that all non-btrfs commits have acks/reviews.

However, as it's rc5, I'm not at all comfortable to add this patchset to
5.3 queue, the changes seem to be intrusive and redoing bio submission
path is something that will affect all workloads. I did quick tests on
fstests (without cgruops enabled) and this was fine, but that's the
minimum that must work. Wider range of workloads would be needed, I can
do that with mmtests, but all of that means that 5.3 is infeasible.

So this opens more possibilites regarding the patchset routing. Both
parts can go separately through their usual trees.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support
  2019-06-18 12:54 ` [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support David Sterba
@ 2019-06-18 14:45   ` Tejun Heo
  0 siblings, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2019-06-18 14:45 UTC (permalink / raw)
  To: dsterba, dsterba, clm, josef, axboe, jack, linux-btrfs,
	linux-kernel, linux-block, kernel-team

Hello, David.

On Tue, Jun 18, 2019 at 02:54:42PM +0200, David Sterba wrote:
> However, as it's rc5, I'm not at all comfortable to add this patchset to
> 5.3 queue, the changes seem to be intrusive and redoing bio submission
> path is something that will affect all workloads. I did quick tests on
> fstests (without cgruops enabled) and this was fine, but that's the
> minimum that must work. Wider range of workloads would be needed, I can
> do that with mmtests, but all of that means that 5.3 is infeasible.

Sure thing.  These aren't urgent in any way.

> So this opens more possibilites regarding the patchset routing. Both
> parts can go separately through their usual trees.

Yeah, that sounds great too.  Let's wait for Jens's review and decide
how to route the patches.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/9] blkcg, writeback: Add wbc->no_wbc_acct
  2019-06-15 18:24 ` [PATCH 2/9] blkcg, writeback: Add wbc->no_wbc_acct Tejun Heo
@ 2019-06-20 15:21   ` Jan Kara
  2019-06-20 17:02     ` Tejun Heo
  2019-06-24 16:49   ` Jan Kara
  1 sibling, 1 reply; 25+ messages in thread
From: Jan Kara @ 2019-06-20 15:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dsterba, clm, josef, axboe, jack, linux-btrfs, linux-kernel,
	linux-block, kernel-team

On Sat 15-06-19 11:24:46, Tejun Heo wrote:
> When writeback IOs are bounced through async layers, the IOs should
> only be accounted against the wbc from the original bdi writeback to
> avoid confusing cgroup inode ownership arbitration.  Add
> wbc->no_wbc_acct to allow disabling wbc accounting.  This will be used
> make btfs compression work well with cgroup IO control.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>

I'm completely ignorant of how btrfs compressed writeback works so don't
quite understand implications of this. So does this mean that writeback to
btrfs compressed files won't be able to transition inodes from one memcg to
another? Or are you trying to say the 'wbc' used from async worker thread
is actually a dummy one and we would double-account the writeback?

Anyway, AFAICS no_wbc_acct means: "IO done as a result of this wbc will not
have influence on inode memcg ownership", doesn't it?

								Honza
> ---
>  fs/fs-writeback.c         | 2 +-
>  include/linux/writeback.h | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index c29cff345b1f..667ba07fffcd 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -724,7 +724,7 @@ void wbc_account_io(struct writeback_control *wbc, struct page *page,
>  	 * behind a slow cgroup.  Ultimately, we want pageout() to kick off
>  	 * regular writeback instead of writing things out itself.
>  	 */
> -	if (!wbc->wb)
> +	if (!wbc->wb || wbc->no_wbc_acct)
>  		return;
>  
>  	id = mem_cgroup_css_from_page(page)->id;
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 738a0c24874f..b8f5f000cde4 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -68,6 +68,7 @@ struct writeback_control {
>  	unsigned for_reclaim:1;		/* Invoked from the page allocator */
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
>  	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
> +	unsigned no_wbc_acct:1;		/* skip wbc IO accounting */
>  #ifdef CONFIG_CGROUP_WRITEBACK
>  	struct bdi_writeback *wb;	/* wb this writeback is issued under */
>  	struct inode *inode;		/* inode being written out */
> -- 
> 2.17.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/9] blkcg: implement REQ_CGROUP_PUNT
  2019-06-15 18:24 ` [PATCH 4/9] blkcg: implement REQ_CGROUP_PUNT Tejun Heo
  2019-06-15 18:29   ` Tejun Heo
@ 2019-06-20 15:37   ` Jan Kara
  2019-06-20 16:42     ` Tejun Heo
  2019-06-24 16:53   ` Jan Kara
  2 siblings, 1 reply; 25+ messages in thread
From: Jan Kara @ 2019-06-20 15:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dsterba, clm, josef, axboe, jack, linux-btrfs, linux-kernel,
	linux-block, kernel-team

On Sat 15-06-19 11:24:48, Tejun Heo wrote:
> When a shared kthread needs to issue a bio for a cgroup, doing so
> synchronously can lead to priority inversions as the kthread can be
> trapped waiting for that cgroup.  This patch implements
> REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
> to a dedicated per-blkcg work item to avoid such priority inversions.
> 
> This will be used to fix priority inversions in btrfs compression and
> should be generally useful as we grow filesystem support for
> comprehensive IO control.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Cc: Chris Mason <clm@fb.com>

...

> +bool __blkcg_punt_bio_submit(struct bio *bio)
> +{
> +	struct blkcg_gq *blkg = bio->bi_blkg;
> +
> +	/* consume the flag first */
> +	bio->bi_opf &= ~REQ_CGROUP_PUNT;
> +
> +	/* never bounce for the root cgroup */
> +	if (!blkg->parent)
> +		return false;
> +
> +	spin_lock_bh(&blkg->async_bio_lock);
> +	bio_list_add(&blkg->async_bios, bio);
> +	spin_unlock_bh(&blkg->async_bio_lock);
> +
> +	queue_work(blkcg_punt_bio_wq, &blkg->async_bio_work);
> +	return true;
> +}
> +

So does this mean that if there is some inode with lots of dirty data for a
blkcg that is heavily throttled, that blkcg can occupy a ton of workers all
being throttled in submit_bio()? Or what is constraining a number of
workers one blkcg can consume?

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/9] blkcg: implement REQ_CGROUP_PUNT
  2019-06-20 15:37   ` Jan Kara
@ 2019-06-20 16:42     ` Tejun Heo
  2019-06-20 17:01       ` Jan Kara
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2019-06-20 16:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: dsterba, clm, josef, axboe, linux-btrfs, linux-kernel,
	linux-block, kernel-team

Hello, Jan.

On Thu, Jun 20, 2019 at 05:37:33PM +0200, Jan Kara wrote:
> > +bool __blkcg_punt_bio_submit(struct bio *bio)
> > +{
> > +	struct blkcg_gq *blkg = bio->bi_blkg;
> > +
> > +	/* consume the flag first */
> > +	bio->bi_opf &= ~REQ_CGROUP_PUNT;
> > +
> > +	/* never bounce for the root cgroup */
> > +	if (!blkg->parent)
> > +		return false;
> > +
> > +	spin_lock_bh(&blkg->async_bio_lock);
> > +	bio_list_add(&blkg->async_bios, bio);
> > +	spin_unlock_bh(&blkg->async_bio_lock);
> > +
> > +	queue_work(blkcg_punt_bio_wq, &blkg->async_bio_work);
> > +	return true;
> > +}
> > +
> 
> So does this mean that if there is some inode with lots of dirty data for a
> blkcg that is heavily throttled, that blkcg can occupy a ton of workers all
> being throttled in submit_bio()? Or what is constraining a number of
> workers one blkcg can consume?

There's only one work item per blkcg-device pair, so the maximum
number of kthreads a blkcg can occupy on a filesystem would be one.
It's the same scheme as writeback work items.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/9] blkcg: implement REQ_CGROUP_PUNT
  2019-06-20 16:42     ` Tejun Heo
@ 2019-06-20 17:01       ` Jan Kara
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2019-06-20 17:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, dsterba, clm, josef, axboe, linux-btrfs, linux-kernel,
	linux-block, kernel-team

On Thu 20-06-19 09:42:29, Tejun Heo wrote:
> Hello, Jan.
> 
> On Thu, Jun 20, 2019 at 05:37:33PM +0200, Jan Kara wrote:
> > > +bool __blkcg_punt_bio_submit(struct bio *bio)
> > > +{
> > > +	struct blkcg_gq *blkg = bio->bi_blkg;
> > > +
> > > +	/* consume the flag first */
> > > +	bio->bi_opf &= ~REQ_CGROUP_PUNT;
> > > +
> > > +	/* never bounce for the root cgroup */
> > > +	if (!blkg->parent)
> > > +		return false;
> > > +
> > > +	spin_lock_bh(&blkg->async_bio_lock);
> > > +	bio_list_add(&blkg->async_bios, bio);
> > > +	spin_unlock_bh(&blkg->async_bio_lock);
> > > +
> > > +	queue_work(blkcg_punt_bio_wq, &blkg->async_bio_work);
> > > +	return true;
> > > +}
> > > +
> > 
> > So does this mean that if there is some inode with lots of dirty data for a
> > blkcg that is heavily throttled, that blkcg can occupy a ton of workers all
> > being throttled in submit_bio()? Or what is constraining a number of
> > workers one blkcg can consume?
> 
> There's only one work item per blkcg-device pair, so the maximum
> number of kthreads a blkcg can occupy on a filesystem would be one.
> It's the same scheme as writeback work items.

OK, I've missed the fact that although the work item can get queued while
it is still running it cannot get executed more than once at a time (which
is kind of obvious but I got confused). Thanks for explanation!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/9] blkcg, writeback: Add wbc->no_wbc_acct
  2019-06-20 15:21   ` Jan Kara
@ 2019-06-20 17:02     ` Tejun Heo
  2019-06-24  8:21       ` Jan Kara
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2019-06-20 17:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: dsterba, clm, josef, axboe, linux-btrfs, linux-kernel,
	linux-block, kernel-team

Hello, Jan.

On Thu, Jun 20, 2019 at 05:21:45PM +0200, Jan Kara wrote:
> I'm completely ignorant of how btrfs compressed writeback works so don't
> quite understand implications of this. So does this mean that writeback to
> btrfs compressed files won't be able to transition inodes from one memcg to
> another? Or are you trying to say the 'wbc' used from async worker thread
> is actually a dummy one and we would double-account the writeback?

So, before, only the async compression workers would run through the
wbc accounting code regardless of who originated the dirty pages,
which is obviously wrong.  After the patch, the code accounts when the
dirty pages are being handed off to the compression workers and
no_wbc_acct is used to suppress spurious accounting from the workers.

> Anyway, AFAICS no_wbc_acct means: "IO done as a result of this wbc will not
> have influence on inode memcg ownership", doesn't it?

Yeap.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/9] blkcg, writeback: Add wbc->no_wbc_acct
  2019-06-20 17:02     ` Tejun Heo
@ 2019-06-24  8:21       ` Jan Kara
  2019-06-24 12:58         ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Jan Kara @ 2019-06-24  8:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, dsterba, clm, josef, axboe, linux-btrfs, linux-kernel,
	linux-block, kernel-team

Hello Tejun!

On Thu 20-06-19 10:02:50, Tejun Heo wrote:
> On Thu, Jun 20, 2019 at 05:21:45PM +0200, Jan Kara wrote:
> > I'm completely ignorant of how btrfs compressed writeback works so don't
> > quite understand implications of this. So does this mean that writeback to
> > btrfs compressed files won't be able to transition inodes from one memcg to
> > another? Or are you trying to say the 'wbc' used from async worker thread
> > is actually a dummy one and we would double-account the writeback?
> 
> So, before, only the async compression workers would run through the
> wbc accounting code regardless of who originated the dirty pages,
> which is obviously wrong.  After the patch, the code accounts when the
> dirty pages are being handed off to the compression workers and
> no_wbc_acct is used to suppress spurious accounting from the workers.

OK, now I understand. Just one more question: So effectively, you are using
wbc->no_wbc_acct to pass information from btrfs code to btrfs code telling
it whether IO should or should not be accounted with wbc_account_io().
Wouldn't it make more sense to just pass this information internally
within btrfs? Granted, if this mechanism gets more widespread use by other
filesystems, then probably using wbc flag makes more sense. But I'm not
sure if this isn't a premature generalization...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/9] blkcg, writeback: Add wbc->no_wbc_acct
  2019-06-24  8:21       ` Jan Kara
@ 2019-06-24 12:58         ` Tejun Heo
  2019-06-24 16:39           ` Jan Kara
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2019-06-24 12:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: dsterba, clm, josef, axboe, linux-btrfs, linux-kernel,
	linux-block, kernel-team

Hello, Jan.

On Mon, Jun 24, 2019 at 10:21:30AM +0200, Jan Kara wrote:
> OK, now I understand. Just one more question: So effectively, you are using
> wbc->no_wbc_acct to pass information from btrfs code to btrfs code telling
> it whether IO should or should not be accounted with wbc_account_io().

Yes.

> Wouldn't it make more sense to just pass this information internally
> within btrfs? Granted, if this mechanism gets more widespread use by other
> filesystems, then probably using wbc flag makes more sense. But I'm not
> sure if this isn't a premature generalization...

The btrfs async issuers end up using generic writeback path and uses
the generic wbc owner mechanisms so that ios are attached to the right
cgroup too.  So, I kinda prefer to provide a generic mechanism from
wbc side.  That said, the names are a bit misleading and I think it'd
be better to rename them to something more explicit, e.g. sth along
the line of wbc_update_cgroup_owner() and wbc->no_cgroup_owner.  What
do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/9] blkcg, writeback: Add wbc->no_wbc_acct
  2019-06-24 12:58         ` Tejun Heo
@ 2019-06-24 16:39           ` Jan Kara
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2019-06-24 16:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, dsterba, clm, josef, axboe, linux-btrfs, linux-kernel,
	linux-block, kernel-team

On Mon 24-06-19 05:58:56, Tejun Heo wrote:
> Hello, Jan.
> 
> On Mon, Jun 24, 2019 at 10:21:30AM +0200, Jan Kara wrote:
> > OK, now I understand. Just one more question: So effectively, you are using
> > wbc->no_wbc_acct to pass information from btrfs code to btrfs code telling
> > it whether IO should or should not be accounted with wbc_account_io().
> 
> Yes.
> 
> > Wouldn't it make more sense to just pass this information internally
> > within btrfs? Granted, if this mechanism gets more widespread use by other
> > filesystems, then probably using wbc flag makes more sense. But I'm not
> > sure if this isn't a premature generalization...
> 
> The btrfs async issuers end up using generic writeback path and uses
> the generic wbc owner mechanisms so that ios are attached to the right
> cgroup too.  So, I kinda prefer to provide a generic mechanism from
> wbc side.

OK, I can live with that. We just have to be kind of careful so that people
just don't sprinkle no_wbc_acct writeback around because they don't know
better. Maybe you could at least add comment to no_wbc_acct mentioning that
this is for the cases where writeback has already been accounted for?

> That said, the names are a bit misleading and I think it'd
> be better to rename them to something more explicit, e.g. sth along
> the line of wbc_update_cgroup_owner() and wbc->no_cgroup_owner.  What
> do you think?

Yeah, renaming would probably make things clearer as well.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/9] cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages
  2019-06-15 18:24 ` [PATCH 1/9] cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages Tejun Heo
@ 2019-06-24 16:39   ` Jan Kara
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2019-06-24 16:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dsterba, clm, josef, axboe, jack, linux-btrfs, linux-kernel,
	linux-block, kernel-team

On Sat 15-06-19 11:24:45, Tejun Heo wrote:
> btrfs is going to use css_put() and wbc helpers to improve cgroup
> writeback support.  Add dummy css_get() definition and export wbc
> helpers to prepare for module and !CONFIG_CGROUP builds.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: kbuild test robot <lkp@intel.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  block/blk-cgroup.c     | 1 +
>  fs/fs-writeback.c      | 3 +++
>  include/linux/cgroup.h | 1 +
>  3 files changed, 5 insertions(+)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 617a2b3f7582..07600d3c9520 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -46,6 +46,7 @@ struct blkcg blkcg_root;
>  EXPORT_SYMBOL_GPL(blkcg_root);
>  
>  struct cgroup_subsys_state * const blkcg_root_css = &blkcg_root.css;
> +EXPORT_SYMBOL_GPL(blkcg_root_css);
>  
>  static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
>  
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 36855c1f8daf..c29cff345b1f 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -269,6 +269,7 @@ void __inode_attach_wb(struct inode *inode, struct page *page)
>  	if (unlikely(cmpxchg(&inode->i_wb, NULL, wb)))
>  		wb_put(wb);
>  }
> +EXPORT_SYMBOL_GPL(__inode_attach_wb);
>  
>  /**
>   * locked_inode_to_wb_and_lock_list - determine a locked inode's wb and lock it
> @@ -580,6 +581,7 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
>  	if (unlikely(wb_dying(wbc->wb)))
>  		inode_switch_wbs(inode, wbc->wb_id);
>  }
> +EXPORT_SYMBOL_GPL(wbc_attach_and_unlock_inode);
>  
>  /**
>   * wbc_detach_inode - disassociate wbc from inode and perform foreign detection
> @@ -699,6 +701,7 @@ void wbc_detach_inode(struct writeback_control *wbc)
>  	wb_put(wbc->wb);
>  	wbc->wb = NULL;
>  }
> +EXPORT_SYMBOL_GPL(wbc_detach_inode);
>  
>  /**
>   * wbc_account_io - account IO issued during writeback
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 81f58b4a5418..4cb5d5646986 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -687,6 +687,7 @@ void cgroup_path_from_kernfs_id(const union kernfs_node_id *id,
>  struct cgroup_subsys_state;
>  struct cgroup;
>  
> +static inline void css_get(struct cgroup_subsys_state *css) {}
>  static inline void css_put(struct cgroup_subsys_state *css) {}
>  static inline int cgroup_attach_task_all(struct task_struct *from,
>  					 struct task_struct *t) { return 0; }
> -- 
> 2.17.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/9] blkcg, writeback: Add wbc->no_wbc_acct
  2019-06-15 18:24 ` [PATCH 2/9] blkcg, writeback: Add wbc->no_wbc_acct Tejun Heo
  2019-06-20 15:21   ` Jan Kara
@ 2019-06-24 16:49   ` Jan Kara
  1 sibling, 0 replies; 25+ messages in thread
From: Jan Kara @ 2019-06-24 16:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dsterba, clm, josef, axboe, jack, linux-btrfs, linux-kernel,
	linux-block, kernel-team

On Sat 15-06-19 11:24:46, Tejun Heo wrote:
> When writeback IOs are bounced through async layers, the IOs should
> only be accounted against the wbc from the original bdi writeback to
> avoid confusing cgroup inode ownership arbitration.  Add
> wbc->no_wbc_acct to allow disabling wbc accounting.  This will be used
> make btfs compression work well with cgroup IO control.
       ^^ btrfs

> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
...
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 738a0c24874f..b8f5f000cde4 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -68,6 +68,7 @@ struct writeback_control {
>  	unsigned for_reclaim:1;		/* Invoked from the page allocator */
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
>  	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
> +	unsigned no_wbc_acct:1;		/* skip wbc IO accounting */
>  #ifdef CONFIG_CGROUP_WRITEBACK
>  	struct bdi_writeback *wb;	/* wb this writeback is issued under */
>  	struct inode *inode;		/* inode being written out */

Can you add comment here that this is for writeback which has been already
accounted for?

Otherwise the patch looks good to me so feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/9] blkcg, writeback: Implement wbc_blkcg_css()
  2019-06-15 18:24 ` [PATCH 3/9] blkcg, writeback: Implement wbc_blkcg_css() Tejun Heo
@ 2019-06-24 16:50   ` Jan Kara
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2019-06-24 16:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dsterba, clm, josef, axboe, jack, linux-btrfs, linux-kernel,
	linux-block, kernel-team

On Sat 15-06-19 11:24:47, Tejun Heo wrote:
> Add a helper to determine the target blkcg from wbc.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/writeback.h | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index b8f5f000cde4..800ee031e88a 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -11,6 +11,7 @@
>  #include <linux/flex_proportions.h>
>  #include <linux/backing-dev-defs.h>
>  #include <linux/blk_types.h>
> +#include <linux/blk-cgroup.h>
>  
>  struct bio;
>  
> @@ -93,6 +94,16 @@ static inline int wbc_to_write_flags(struct writeback_control *wbc)
>  	return 0;
>  }
>  
> +static inline struct cgroup_subsys_state *
> +wbc_blkcg_css(struct writeback_control *wbc)
> +{
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +	if (wbc->wb)
> +		return wbc->wb->blkcg_css;
> +#endif
> +	return blkcg_root_css;
> +}
> +
>  /*
>   * A wb_domain represents a domain that wb's (bdi_writeback's) belong to
>   * and are measured against each other in.  There always is one global
> -- 
> 2.17.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/9] blkcg: implement REQ_CGROUP_PUNT
  2019-06-15 18:24 ` [PATCH 4/9] blkcg: implement REQ_CGROUP_PUNT Tejun Heo
  2019-06-15 18:29   ` Tejun Heo
  2019-06-20 15:37   ` Jan Kara
@ 2019-06-24 16:53   ` Jan Kara
  2 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2019-06-24 16:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: dsterba, clm, josef, axboe, jack, linux-btrfs, linux-kernel,
	linux-block, kernel-team

On Sat 15-06-19 11:24:48, Tejun Heo wrote:
> When a shared kthread needs to issue a bio for a cgroup, doing so
> synchronously can lead to priority inversions as the kthread can be
> trapped waiting for that cgroup.  This patch implements
> REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
> to a dedicated per-blkcg work item to avoid such priority inversions.
> 
> This will be used to fix priority inversions in btrfs compression and
> should be generally useful as we grow filesystem support for
> comprehensive IO control.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Cc: Chris Mason <clm@fb.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  block/blk-cgroup.c          | 53 +++++++++++++++++++++++++++++++++++++
>  block/blk-core.c            |  3 +++
>  include/linux/backing-dev.h |  1 +
>  include/linux/blk-cgroup.h  | 16 ++++++++++-
>  include/linux/blk_types.h   | 10 +++++++
>  include/linux/writeback.h   | 12 ++++++---
>  6 files changed, 91 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 07600d3c9520..48239bb93fbe 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -53,6 +53,7 @@ static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
>  static LIST_HEAD(all_blkcgs);		/* protected by blkcg_pol_mutex */
>  
>  static bool blkcg_debug_stats = false;
> +static struct workqueue_struct *blkcg_punt_bio_wq;
>  
>  static bool blkcg_policy_enabled(struct request_queue *q,
>  				 const struct blkcg_policy *pol)
> @@ -88,6 +89,8 @@ static void __blkg_release(struct rcu_head *rcu)
>  
>  	percpu_ref_exit(&blkg->refcnt);
>  
> +	WARN_ON(!bio_list_empty(&blkg->async_bios));
> +
>  	/* release the blkcg and parent blkg refs this blkg has been holding */
>  	css_put(&blkg->blkcg->css);
>  	if (blkg->parent)
> @@ -113,6 +116,23 @@ static void blkg_release(struct percpu_ref *ref)
>  	call_rcu(&blkg->rcu_head, __blkg_release);
>  }
>  
> +static void blkg_async_bio_workfn(struct work_struct *work)
> +{
> +	struct blkcg_gq *blkg = container_of(work, struct blkcg_gq,
> +					     async_bio_work);
> +	struct bio_list bios = BIO_EMPTY_LIST;
> +	struct bio *bio;
> +
> +	/* as long as there are pending bios, @blkg can't go away */
> +	spin_lock_bh(&blkg->async_bio_lock);
> +	bio_list_merge(&bios, &blkg->async_bios);
> +	bio_list_init(&blkg->async_bios);
> +	spin_unlock_bh(&blkg->async_bio_lock);
> +
> +	while ((bio = bio_list_pop(&bios)))
> +		submit_bio(bio);
> +}
> +
>  /**
>   * blkg_alloc - allocate a blkg
>   * @blkcg: block cgroup the new blkg is associated with
> @@ -138,6 +158,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>  
>  	blkg->q = q;
>  	INIT_LIST_HEAD(&blkg->q_node);
> +	spin_lock_init(&blkg->async_bio_lock);
> +	bio_list_init(&blkg->async_bios);
> +	INIT_WORK(&blkg->async_bio_work, blkg_async_bio_workfn);
>  	blkg->blkcg = blkcg;
>  
>  	for (i = 0; i < BLKCG_MAX_POLS; i++) {
> @@ -1583,6 +1606,25 @@ void blkcg_policy_unregister(struct blkcg_policy *pol)
>  }
>  EXPORT_SYMBOL_GPL(blkcg_policy_unregister);
>  
> +bool __blkcg_punt_bio_submit(struct bio *bio)
> +{
> +	struct blkcg_gq *blkg = bio->bi_blkg;
> +
> +	/* consume the flag first */
> +	bio->bi_opf &= ~REQ_CGROUP_PUNT;
> +
> +	/* never bounce for the root cgroup */
> +	if (!blkg->parent)
> +		return false;
> +
> +	spin_lock_bh(&blkg->async_bio_lock);
> +	bio_list_add(&blkg->async_bios, bio);
> +	spin_unlock_bh(&blkg->async_bio_lock);
> +
> +	queue_work(blkcg_punt_bio_wq, &blkg->async_bio_work);
> +	return true;
> +}
> +
>  /*
>   * Scale the accumulated delay based on how long it has been since we updated
>   * the delay.  We only call this when we are adding delay, in case it's been a
> @@ -1783,5 +1825,16 @@ void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta)
>  	atomic64_add(delta, &blkg->delay_nsec);
>  }
>  
> +static int __init blkcg_init(void)
> +{
> +	blkcg_punt_bio_wq = alloc_workqueue("blkcg_punt_bio",
> +					    WQ_MEM_RECLAIM | WQ_FREEZABLE |
> +					    WQ_UNBOUND | WQ_SYSFS, 0);
> +	if (!blkcg_punt_bio_wq)
> +		return -ENOMEM;
> +	return 0;
> +}
> +subsys_initcall(blkcg_init);
> +
>  module_param(blkcg_debug_stats, bool, 0644);
>  MODULE_PARM_DESC(blkcg_debug_stats, "True if you want debug stats, false if not");
> diff --git a/block/blk-core.c b/block/blk-core.c
> index a55389ba8779..5879c1ec044d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1165,6 +1165,9 @@ EXPORT_SYMBOL_GPL(direct_make_request);
>   */
>  blk_qc_t submit_bio(struct bio *bio)
>  {
> +	if (blkcg_punt_bio_submit(bio))
> +		return BLK_QC_T_NONE;
> +
>  	/*
>  	 * If it's a regular read/write or a barrier with data attached,
>  	 * go through the normal accounting stuff before submission.
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index f9b029180241..35b31d176f74 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -48,6 +48,7 @@ extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;
>  
>  extern struct workqueue_struct *bdi_wq;
> +extern struct workqueue_struct *bdi_async_bio_wq;
>  
>  static inline bool wb_has_dirty_io(struct bdi_writeback *wb)
>  {
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index 76c61318fda5..ffb2f88e87c6 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -134,13 +134,17 @@ struct blkcg_gq {
>  
>  	struct blkg_policy_data		*pd[BLKCG_MAX_POLS];
>  
> -	struct rcu_head			rcu_head;
> +	spinlock_t			async_bio_lock;
> +	struct bio_list			async_bios;
> +	struct work_struct		async_bio_work;
>  
>  	atomic_t			use_delay;
>  	atomic64_t			delay_nsec;
>  	atomic64_t			delay_start;
>  	u64				last_delay;
>  	int				last_use;
> +
> +	struct rcu_head			rcu_head;
>  };
>  
>  typedef struct blkcg_policy_data *(blkcg_pol_alloc_cpd_fn)(gfp_t gfp);
> @@ -763,6 +767,15 @@ static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg
>  				  struct bio *bio) { return false; }
>  #endif
>  
> +bool __blkcg_punt_bio_submit(struct bio *bio);
> +
> +static inline bool blkcg_punt_bio_submit(struct bio *bio)
> +{
> +	if (bio->bi_opf & REQ_CGROUP_PUNT)
> +		return __blkcg_punt_bio_submit(bio);
> +	else
> +		return false;
> +}
>  
>  static inline void blkcg_bio_issue_init(struct bio *bio)
>  {
> @@ -910,6 +923,7 @@ static inline char *blkg_path(struct blkcg_gq *blkg) { return NULL; }
>  static inline void blkg_get(struct blkcg_gq *blkg) { }
>  static inline void blkg_put(struct blkcg_gq *blkg) { }
>  
> +static inline bool blkcg_punt_bio_submit(struct bio *bio) { return false; }
>  static inline void blkcg_bio_issue_init(struct bio *bio) { }
>  static inline bool blkcg_bio_issue_check(struct request_queue *q,
>  					 struct bio *bio) { return true; }
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 791fee35df88..e8b42a786315 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -321,6 +321,14 @@ enum req_flag_bits {
>  	__REQ_RAHEAD,		/* read ahead, can fail anytime */
>  	__REQ_BACKGROUND,	/* background IO */
>  	__REQ_NOWAIT,           /* Don't wait if request will block */
> +	/*
> +	 * When a shared kthread needs to issue a bio for a cgroup, doing
> +	 * so synchronously can lead to priority inversions as the kthread
> +	 * can be trapped waiting for that cgroup.  CGROUP_PUNT flag makes
> +	 * submit_bio() punt the actual issuing to a dedicated per-blkcg
> +	 * work item to avoid such priority inversions.
> +	 */
> +	__REQ_CGROUP_PUNT,
>  
>  	/* command specific flags for REQ_OP_WRITE_ZEROES: */
>  	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
> @@ -347,6 +355,8 @@ enum req_flag_bits {
>  #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
>  #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
>  #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
> +#define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
> +
>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>  #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
>  
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 800ee031e88a..be602c42aab8 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -70,6 +70,7 @@ struct writeback_control {
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
>  	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
>  	unsigned no_wbc_acct:1;		/* skip wbc IO accounting */
> +	unsigned punt_to_cgroup:1;	/* cgrp punting, see __REQ_CGROUP_PUNT */
>  #ifdef CONFIG_CGROUP_WRITEBACK
>  	struct bdi_writeback *wb;	/* wb this writeback is issued under */
>  	struct inode *inode;		/* inode being written out */
> @@ -86,12 +87,17 @@ struct writeback_control {
>  
>  static inline int wbc_to_write_flags(struct writeback_control *wbc)
>  {
> +	int flags = 0;
> +
> +	if (wbc->punt_to_cgroup)
> +		flags = REQ_CGROUP_PUNT;
> +
>  	if (wbc->sync_mode == WB_SYNC_ALL)
> -		return REQ_SYNC;
> +		flags |= REQ_SYNC;
>  	else if (wbc->for_kupdate || wbc->for_background)
> -		return REQ_BACKGROUND;
> +		flags |= REQ_BACKGROUND;
>  
> -	return 0;
> +	return flags;
>  }
>  
>  static inline struct cgroup_subsys_state *
> -- 
> 2.17.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-06-24 16:53 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-15 18:24 [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support Tejun Heo
2019-06-15 18:24 ` [PATCH 1/9] cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages Tejun Heo
2019-06-24 16:39   ` Jan Kara
2019-06-15 18:24 ` [PATCH 2/9] blkcg, writeback: Add wbc->no_wbc_acct Tejun Heo
2019-06-20 15:21   ` Jan Kara
2019-06-20 17:02     ` Tejun Heo
2019-06-24  8:21       ` Jan Kara
2019-06-24 12:58         ` Tejun Heo
2019-06-24 16:39           ` Jan Kara
2019-06-24 16:49   ` Jan Kara
2019-06-15 18:24 ` [PATCH 3/9] blkcg, writeback: Implement wbc_blkcg_css() Tejun Heo
2019-06-24 16:50   ` Jan Kara
2019-06-15 18:24 ` [PATCH 4/9] blkcg: implement REQ_CGROUP_PUNT Tejun Heo
2019-06-15 18:29   ` Tejun Heo
2019-06-20 15:37   ` Jan Kara
2019-06-20 16:42     ` Tejun Heo
2019-06-20 17:01       ` Jan Kara
2019-06-24 16:53   ` Jan Kara
2019-06-15 18:24 ` [PATCH 5/9] Btrfs: stop using btrfs_schedule_bio() Tejun Heo
2019-06-15 18:24 ` [PATCH 6/9] Btrfs: delete the entire async bio submission framework Tejun Heo
2019-06-15 18:24 ` [PATCH 7/9] Btrfs: only associate the locked page with one async_cow struct Tejun Heo
2019-06-15 18:24 ` [PATCH 8/9] Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios Tejun Heo
2019-06-15 18:24 ` [PATCH 9/9] Btrfs: extent_write_locked_range() should attach inode->i_wb Tejun Heo
2019-06-18 12:54 ` [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support David Sterba
2019-06-18 14:45   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).