linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET for-5.3/block] block: add blkcg bio punt mechanism
@ 2019-06-27 20:39 Tejun Heo
  2019-06-27 20:39 ` [PATCH 1/5] cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages Tejun Heo
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Tejun Heo @ 2019-06-27 20:39 UTC (permalink / raw)
  To: axboe
  Cc: jack, josef, clm, dsterba, linux-block, linux-kernel,
	linux-btrfs, kernel-team

Hello,

This patchset contains only the block part of the following

  [1] [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support

with the following changes

 * wbc_account_io() is renamed to wbc_account_cgroup_owner() and
   wbc->no_account_io to wbc->no_cgroup_owner for clarity.

When writeback is executed asynchronously (e.g. for compression), bios
are bounced to and issued by worker pool shared by all cgroups.  This
leads to significant priority inversions when cgroup IO control is in
use - IOs for a low priority cgroup can tie down the workers forcing
higher priority IOs to wait behind them.

This patchset adds an bio punt mechanism to blkcg and updates btrfs to
issue async IOs through it.  A bio tagged with REQ_CGROUP_PUNT flag is
bounced to the asynchronous issue context of the associated blkcg on
bio_submit().  As the bios are issued from per-blkcg work items,
there's no concern for priority inversions and it doesn't require
invasive changes to the filesystems.  The mechanism should be
generally useful for IO control support across different filesystems.

This patchset contains the following 5 patches.

 0001-cgroup-blkcg-Prepare-some-symbols-for-module-and-CON.patch
 0002-blkcg-writeback-Rename-wbc_account_io-to-wbc_account.patch
 0003-blkcg-writeback-Add-wbc-no_cgroup_owner.patch
 0004-blkcg-writeback-Implement-wbc_blkcg_css.patch
 0005-blkcg-implement-REQ_CGROUP_PUNT.patch

The patches are also available in the following branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-blkcg-punt

Thanks, diffstat follows.

 Documentation/admin-guide/cgroup-v2.rst |    2 -
 block/blk-cgroup.c                      |   54 ++++++++++++++++++++++++++++++++
 block/blk-core.c                        |    3 +
 fs/btrfs/extent_io.c                    |    4 +-
 fs/buffer.c                             |    2 -
 fs/ext4/page-io.c                       |    2 -
 fs/f2fs/data.c                          |    4 +-
 fs/fs-writeback.c                       |   13 ++++---
 fs/mpage.c                              |    2 -
 include/linux/backing-dev.h             |    1 
 include/linux/blk-cgroup.h              |   16 ++++++++-
 include/linux/blk_types.h               |   10 +++++
 include/linux/cgroup.h                  |    1 
 include/linux/writeback.h               |   41 ++++++++++++++++++++----
 14 files changed, 134 insertions(+), 21 deletions(-)

--
tejun

[1] http://lkml.kernel.org/r/20190615182453.843275-1-tj@kernel.org


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/5] cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages
  2019-06-27 20:39 [PATCHSET for-5.3/block] block: add blkcg bio punt mechanism Tejun Heo
@ 2019-06-27 20:39 ` Tejun Heo
  2019-06-27 20:39 ` [PATCH 2/5] blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner() Tejun Heo
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2019-06-27 20:39 UTC (permalink / raw)
  To: axboe
  Cc: jack, josef, clm, dsterba, linux-block, linux-kernel,
	linux-btrfs, kernel-team, Tejun Heo

btrfs is going to use css_put() and wbc helpers to improve cgroup
writeback support.  Add dummy css_get() definition and export wbc
helpers to prepare for module and !CONFIG_CGROUP builds.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 block/blk-cgroup.c     | 1 +
 fs/fs-writeback.c      | 3 +++
 include/linux/cgroup.h | 1 +
 3 files changed, 5 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 53b7bd4c7000..3319ab4ff262 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -47,6 +47,7 @@ struct blkcg blkcg_root;
 EXPORT_SYMBOL_GPL(blkcg_root);
 
 struct cgroup_subsys_state * const blkcg_root_css = &blkcg_root.css;
+EXPORT_SYMBOL_GPL(blkcg_root_css);
 
 static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9ebfb1b28430..a8a40bc26c2f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -270,6 +270,7 @@ void __inode_attach_wb(struct inode *inode, struct page *page)
 	if (unlikely(cmpxchg(&inode->i_wb, NULL, wb)))
 		wb_put(wb);
 }
+EXPORT_SYMBOL_GPL(__inode_attach_wb);
 
 /**
  * locked_inode_to_wb_and_lock_list - determine a locked inode's wb and lock it
@@ -582,6 +583,7 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
 	if (unlikely(wb_dying(wbc->wb)))
 		inode_switch_wbs(inode, wbc->wb_id);
 }
+EXPORT_SYMBOL_GPL(wbc_attach_and_unlock_inode);
 
 /**
  * wbc_detach_inode - disassociate wbc from inode and perform foreign detection
@@ -701,6 +703,7 @@ void wbc_detach_inode(struct writeback_control *wbc)
 	wb_put(wbc->wb);
 	wbc->wb = NULL;
 }
+EXPORT_SYMBOL_GPL(wbc_detach_inode);
 
 /**
  * wbc_account_io - account IO issued during writeback
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c0077adeea83..da9d2afbcf0c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -687,6 +687,7 @@ void cgroup_path_from_kernfs_id(const union kernfs_node_id *id,
 struct cgroup_subsys_state;
 struct cgroup;
 
+static inline void css_get(struct cgroup_subsys_state *css) {}
 static inline void css_put(struct cgroup_subsys_state *css) {}
 static inline int cgroup_attach_task_all(struct task_struct *from,
 					 struct task_struct *t) { return 0; }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/5] blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()
  2019-06-27 20:39 [PATCHSET for-5.3/block] block: add blkcg bio punt mechanism Tejun Heo
  2019-06-27 20:39 ` [PATCH 1/5] cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages Tejun Heo
@ 2019-06-27 20:39 ` Tejun Heo
  2019-07-10 10:54   ` Jan Kara
  2019-06-27 20:39 ` [PATCH 3/5] blkcg, writeback: Add wbc->no_cgroup_owner Tejun Heo
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2019-06-27 20:39 UTC (permalink / raw)
  To: axboe
  Cc: jack, josef, clm, dsterba, linux-block, linux-kernel,
	linux-btrfs, kernel-team, Tejun Heo

wbc_account_io() does a very specific job - try to see which cgroup is
actually dirtying an inode and transfer its ownership to the majority
dirtier if needed.  The name is too generic and confusing.  Let's
rename it to something more specific.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Kara <jack@suse.cz>
---
 Documentation/admin-guide/cgroup-v2.rst | 2 +-
 fs/btrfs/extent_io.c                    | 4 ++--
 fs/buffer.c                             | 2 +-
 fs/ext4/page-io.c                       | 2 +-
 fs/f2fs/data.c                          | 4 ++--
 fs/fs-writeback.c                       | 8 ++++----
 fs/mpage.c                              | 2 +-
 include/linux/writeback.h               | 8 ++++----
 8 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index cf88c1f98270..356a7a3dcb2f 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2108,7 +2108,7 @@ following two functions.
 	a queue (device) has been associated with the bio and
 	before submission.
 
-  wbc_account_io(@wbc, @page, @bytes)
+  wbc_account_cgroup_owner(@wbc, @page, @bytes)
 	Should be called for each data segment being written out.
 	While this function doesn't care exactly when it's called
 	during the writeback session, it's the easiest and most
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index db337e53aab3..5106008f5e28 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2911,7 +2911,7 @@ static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree,
 			bio = NULL;
 		} else {
 			if (wbc)
-				wbc_account_io(wbc, page, page_size);
+				wbc_account_cgroup_owner(wbc, page, page_size);
 			return 0;
 		}
 	}
@@ -2924,7 +2924,7 @@ static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree,
 	bio->bi_opf = opf;
 	if (wbc) {
 		wbc_init_bio(wbc, bio);
-		wbc_account_io(wbc, page, page_size);
+		wbc_account_cgroup_owner(wbc, page, page_size);
 	}
 
 	*bio_ret = bio;
diff --git a/fs/buffer.c b/fs/buffer.c
index e450c55f6434..40547bbbea94 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3093,7 +3093,7 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
 
 	if (wbc) {
 		wbc_init_bio(wbc, bio);
-		wbc_account_io(wbc, bh->b_page, bh->b_size);
+		wbc_account_cgroup_owner(wbc, bh->b_page, bh->b_size);
 	}
 
 	submit_bio(bio);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 4690618a92e9..56e287f5ee50 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -404,7 +404,7 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
 	ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
 	if (ret != bh->b_size)
 		goto submit_and_retry;
-	wbc_account_io(io->io_wbc, page, bh->b_size);
+	wbc_account_cgroup_owner(io->io_wbc, page, bh->b_size);
 	io->io_next_block++;
 	return 0;
 }
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index eda4181d2092..e1cab1717ac7 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -470,7 +470,7 @@ int f2fs_submit_page_bio(struct f2fs_io_info *fio)
 	}
 
 	if (fio->io_wbc && !is_read_io(fio->op))
-		wbc_account_io(fio->io_wbc, page, PAGE_SIZE);
+		wbc_account_cgroup_owner(fio->io_wbc, page, PAGE_SIZE);
 
 	bio_set_op_attrs(bio, fio->op, fio->op_flags);
 
@@ -537,7 +537,7 @@ void f2fs_submit_page_write(struct f2fs_io_info *fio)
 	}
 
 	if (fio->io_wbc)
-		wbc_account_io(fio->io_wbc, bio_page, PAGE_SIZE);
+		wbc_account_cgroup_owner(fio->io_wbc, bio_page, PAGE_SIZE);
 
 	io->last_block_in_bio = fio->new_blkaddr;
 	f2fs_trace_ios(fio, 0);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a8a40bc26c2f..0aef79e934bb 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -706,7 +706,7 @@ void wbc_detach_inode(struct writeback_control *wbc)
 EXPORT_SYMBOL_GPL(wbc_detach_inode);
 
 /**
- * wbc_account_io - account IO issued during writeback
+ * wbc_account_cgroup_owner - account writeback to update inode cgroup ownership
  * @wbc: writeback_control of the writeback in progress
  * @page: page being written out
  * @bytes: number of bytes being written out
@@ -715,8 +715,8 @@ EXPORT_SYMBOL_GPL(wbc_detach_inode);
  * controlled by @wbc.  Keep the book for foreign inode detection.  See
  * wbc_detach_inode().
  */
-void wbc_account_io(struct writeback_control *wbc, struct page *page,
-		    size_t bytes)
+void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
+			      size_t bytes)
 {
 	struct cgroup_subsys_state *css;
 	int id;
@@ -753,7 +753,7 @@ void wbc_account_io(struct writeback_control *wbc, struct page *page,
 	else
 		wbc->wb_tcand_bytes -= min(bytes, wbc->wb_tcand_bytes);
 }
-EXPORT_SYMBOL_GPL(wbc_account_io);
+EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner);
 
 /**
  * inode_congested - test whether an inode is congested
diff --git a/fs/mpage.c b/fs/mpage.c
index 436a85260394..a63620cdb73a 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -647,7 +647,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 	 * the confused fail path above (OOM) will be very confused when
 	 * it finds all bh marked clean (i.e. it will not write anything)
 	 */
-	wbc_account_io(wbc, page, PAGE_SIZE);
+	wbc_account_cgroup_owner(wbc, page, PAGE_SIZE);
 	length = first_unmapped << blkbits;
 	if (bio_add_page(bio, page, length, 0) < length) {
 		bio = mpage_bio_submit(REQ_OP_WRITE, op_flags, bio);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 738a0c24874f..dda5cf228172 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -188,8 +188,8 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
 				 struct inode *inode)
 	__releases(&inode->i_lock);
 void wbc_detach_inode(struct writeback_control *wbc);
-void wbc_account_io(struct writeback_control *wbc, struct page *page,
-		    size_t bytes);
+void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
+			      size_t bytes);
 void cgroup_writeback_umount(void);
 
 /**
@@ -291,8 +291,8 @@ static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio)
 {
 }
 
-static inline void wbc_account_io(struct writeback_control *wbc,
-				  struct page *page, size_t bytes)
+static inline void wbc_account_cgroup_owner(struct writeback_control *wbc,
+					    struct page *page, size_t bytes)
 {
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/5] blkcg, writeback: Add wbc->no_cgroup_owner
  2019-06-27 20:39 [PATCHSET for-5.3/block] block: add blkcg bio punt mechanism Tejun Heo
  2019-06-27 20:39 ` [PATCH 1/5] cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages Tejun Heo
  2019-06-27 20:39 ` [PATCH 2/5] blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner() Tejun Heo
@ 2019-06-27 20:39 ` Tejun Heo
  2019-06-27 20:39 ` [PATCH 4/5] blkcg, writeback: Implement wbc_blkcg_css() Tejun Heo
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2019-06-27 20:39 UTC (permalink / raw)
  To: axboe
  Cc: jack, josef, clm, dsterba, linux-block, linux-kernel,
	linux-btrfs, kernel-team, Tejun Heo

When writeback IOs are bounced through async layers, the IOs should
only be accounted against the wbc from the original bdi writeback to
avoid confusing cgroup inode ownership arbitration.  Add
wbc->no_cgroup_owner to allow disabling wbc cgroup owner accounting.
This will be used make btrfs compression work well with cgroup IO
control.

v2: Renamed from no_wbc_acct to no_cgroup_owner and added comment as
    per Jan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c         | 2 +-
 include/linux/writeback.h | 9 +++++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0aef79e934bb..542b02d170f8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -727,7 +727,7 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
 	 * behind a slow cgroup.  Ultimately, we want pageout() to kick off
 	 * regular writeback instead of writing things out itself.
 	 */
-	if (!wbc->wb)
+	if (!wbc->wb || wbc->no_cgroup_owner)
 		return;
 
 	css = mem_cgroup_css_from_page(page);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index dda5cf228172..33a50fa09fac 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -68,6 +68,15 @@ struct writeback_control {
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
+
+	/*
+	 * When writeback IOs are bounced through async layers, only the
+	 * initial synchronous phase should be accounted towards inode
+	 * cgroup ownership arbitration to avoid confusion.  Later stages
+	 * can set the following flag to disable the accounting.
+	 */
+	unsigned no_cgroup_owner:1;
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct bdi_writeback *wb;	/* wb this writeback is issued under */
 	struct inode *inode;		/* inode being written out */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/5] blkcg, writeback: Implement wbc_blkcg_css()
  2019-06-27 20:39 [PATCHSET for-5.3/block] block: add blkcg bio punt mechanism Tejun Heo
                   ` (2 preceding siblings ...)
  2019-06-27 20:39 ` [PATCH 3/5] blkcg, writeback: Add wbc->no_cgroup_owner Tejun Heo
@ 2019-06-27 20:39 ` Tejun Heo
  2019-06-27 20:39 ` [PATCH 5/5] blkcg: implement REQ_CGROUP_PUNT Tejun Heo
  2019-07-10 13:53 ` [PATCHSET for-5.3/block] block: add blkcg bio punt mechanism Jens Axboe
  5 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2019-06-27 20:39 UTC (permalink / raw)
  To: axboe
  Cc: jack, josef, clm, dsterba, linux-block, linux-kernel,
	linux-btrfs, kernel-team, Tejun Heo

Add a helper to determine the target blkcg from wbc.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/writeback.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 33a50fa09fac..e056a22075cf 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -11,6 +11,7 @@
 #include <linux/flex_proportions.h>
 #include <linux/backing-dev-defs.h>
 #include <linux/blk_types.h>
+#include <linux/blk-cgroup.h>
 
 struct bio;
 
@@ -101,6 +102,16 @@ static inline int wbc_to_write_flags(struct writeback_control *wbc)
 	return 0;
 }
 
+static inline struct cgroup_subsys_state *
+wbc_blkcg_css(struct writeback_control *wbc)
+{
+#ifdef CONFIG_CGROUP_WRITEBACK
+	if (wbc->wb)
+		return wbc->wb->blkcg_css;
+#endif
+	return blkcg_root_css;
+}
+
 /*
  * A wb_domain represents a domain that wb's (bdi_writeback's) belong to
  * and are measured against each other in.  There always is one global
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 5/5] blkcg: implement REQ_CGROUP_PUNT
  2019-06-27 20:39 [PATCHSET for-5.3/block] block: add blkcg bio punt mechanism Tejun Heo
                   ` (3 preceding siblings ...)
  2019-06-27 20:39 ` [PATCH 4/5] blkcg, writeback: Implement wbc_blkcg_css() Tejun Heo
@ 2019-06-27 20:39 ` Tejun Heo
  2019-07-10 13:53 ` [PATCHSET for-5.3/block] block: add blkcg bio punt mechanism Jens Axboe
  5 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2019-06-27 20:39 UTC (permalink / raw)
  To: axboe
  Cc: jack, josef, clm, dsterba, linux-block, linux-kernel,
	linux-btrfs, kernel-team, Tejun Heo

When a shared kthread needs to issue a bio for a cgroup, doing so
synchronously can lead to priority inversions as the kthread can be
trapped waiting for that cgroup.  This patch implements
REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
to a dedicated per-blkcg work item to avoid such priority inversions.

This will be used to fix priority inversions in btrfs compression and
should be generally useful as we grow filesystem support for
comprehensive IO control.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Chris Mason <clm@fb.com>
---
 block/blk-cgroup.c          | 53 +++++++++++++++++++++++++++++++++++++
 block/blk-core.c            |  3 +++
 include/linux/backing-dev.h |  1 +
 include/linux/blk-cgroup.h  | 16 ++++++++++-
 include/linux/blk_types.h   | 10 +++++++
 include/linux/writeback.h   | 13 ++++++---
 6 files changed, 92 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 3319ab4ff262..921a3ef329aa 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -54,6 +54,7 @@ static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
 static LIST_HEAD(all_blkcgs);		/* protected by blkcg_pol_mutex */
 
 static bool blkcg_debug_stats = false;
+static struct workqueue_struct *blkcg_punt_bio_wq;
 
 static bool blkcg_policy_enabled(struct request_queue *q,
 				 const struct blkcg_policy *pol)
@@ -88,6 +89,8 @@ static void __blkg_release(struct rcu_head *rcu)
 {
 	struct blkcg_gq *blkg = container_of(rcu, struct blkcg_gq, rcu_head);
 
+	WARN_ON(!bio_list_empty(&blkg->async_bios));
+
 	/* release the blkcg and parent blkg refs this blkg has been holding */
 	css_put(&blkg->blkcg->css);
 	if (blkg->parent)
@@ -113,6 +116,23 @@ static void blkg_release(struct percpu_ref *ref)
 	call_rcu(&blkg->rcu_head, __blkg_release);
 }
 
+static void blkg_async_bio_workfn(struct work_struct *work)
+{
+	struct blkcg_gq *blkg = container_of(work, struct blkcg_gq,
+					     async_bio_work);
+	struct bio_list bios = BIO_EMPTY_LIST;
+	struct bio *bio;
+
+	/* as long as there are pending bios, @blkg can't go away */
+	spin_lock_bh(&blkg->async_bio_lock);
+	bio_list_merge(&bios, &blkg->async_bios);
+	bio_list_init(&blkg->async_bios);
+	spin_unlock_bh(&blkg->async_bio_lock);
+
+	while ((bio = bio_list_pop(&bios)))
+		submit_bio(bio);
+}
+
 /**
  * blkg_alloc - allocate a blkg
  * @blkcg: block cgroup the new blkg is associated with
@@ -141,6 +161,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 
 	blkg->q = q;
 	INIT_LIST_HEAD(&blkg->q_node);
+	spin_lock_init(&blkg->async_bio_lock);
+	bio_list_init(&blkg->async_bios);
+	INIT_WORK(&blkg->async_bio_work, blkg_async_bio_workfn);
 	blkg->blkcg = blkcg;
 
 	for (i = 0; i < BLKCG_MAX_POLS; i++) {
@@ -1527,6 +1550,25 @@ void blkcg_policy_unregister(struct blkcg_policy *pol)
 }
 EXPORT_SYMBOL_GPL(blkcg_policy_unregister);
 
+bool __blkcg_punt_bio_submit(struct bio *bio)
+{
+	struct blkcg_gq *blkg = bio->bi_blkg;
+
+	/* consume the flag first */
+	bio->bi_opf &= ~REQ_CGROUP_PUNT;
+
+	/* never bounce for the root cgroup */
+	if (!blkg->parent)
+		return false;
+
+	spin_lock_bh(&blkg->async_bio_lock);
+	bio_list_add(&blkg->async_bios, bio);
+	spin_unlock_bh(&blkg->async_bio_lock);
+
+	queue_work(blkcg_punt_bio_wq, &blkg->async_bio_work);
+	return true;
+}
+
 /*
  * Scale the accumulated delay based on how long it has been since we updated
  * the delay.  We only call this when we are adding delay, in case it's been a
@@ -1727,5 +1769,16 @@ void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta)
 	atomic64_add(delta, &blkg->delay_nsec);
 }
 
+static int __init blkcg_init(void)
+{
+	blkcg_punt_bio_wq = alloc_workqueue("blkcg_punt_bio",
+					    WQ_MEM_RECLAIM | WQ_FREEZABLE |
+					    WQ_UNBOUND | WQ_SYSFS, 0);
+	if (!blkcg_punt_bio_wq)
+		return -ENOMEM;
+	return 0;
+}
+subsys_initcall(blkcg_init);
+
 module_param(blkcg_debug_stats, bool, 0644);
 MODULE_PARM_DESC(blkcg_debug_stats, "True if you want debug stats, false if not");
diff --git a/block/blk-core.c b/block/blk-core.c
index 5d1fc8e17dd1..812052c835fc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1127,6 +1127,9 @@ EXPORT_SYMBOL_GPL(direct_make_request);
  */
 blk_qc_t submit_bio(struct bio *bio)
 {
+	if (blkcg_punt_bio_submit(bio))
+		return BLK_QC_T_NONE;
+
 	/*
 	 * If it's a regular read/write or a barrier with data attached,
 	 * go through the normal accounting stuff before submission.
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index f9b029180241..35b31d176f74 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -48,6 +48,7 @@ extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
 
 extern struct workqueue_struct *bdi_wq;
+extern struct workqueue_struct *bdi_async_bio_wq;
 
 static inline bool wb_has_dirty_io(struct bdi_writeback *wb)
 {
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 33f23a858438..689a58231288 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -132,13 +132,17 @@ struct blkcg_gq {
 
 	struct blkg_policy_data		*pd[BLKCG_MAX_POLS];
 
-	struct rcu_head			rcu_head;
+	spinlock_t			async_bio_lock;
+	struct bio_list			async_bios;
+	struct work_struct		async_bio_work;
 
 	atomic_t			use_delay;
 	atomic64_t			delay_nsec;
 	atomic64_t			delay_start;
 	u64				last_delay;
 	int				last_use;
+
+	struct rcu_head			rcu_head;
 };
 
 typedef struct blkcg_policy_data *(blkcg_pol_alloc_cpd_fn)(gfp_t gfp);
@@ -701,6 +705,15 @@ static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg
 				  struct bio *bio) { return false; }
 #endif
 
+bool __blkcg_punt_bio_submit(struct bio *bio);
+
+static inline bool blkcg_punt_bio_submit(struct bio *bio)
+{
+	if (bio->bi_opf & REQ_CGROUP_PUNT)
+		return __blkcg_punt_bio_submit(bio);
+	else
+		return false;
+}
 
 static inline void blkcg_bio_issue_init(struct bio *bio)
 {
@@ -848,6 +861,7 @@ static inline char *blkg_path(struct blkcg_gq *blkg) { return NULL; }
 static inline void blkg_get(struct blkcg_gq *blkg) { }
 static inline void blkg_put(struct blkcg_gq *blkg) { }
 
+static inline bool blkcg_punt_bio_submit(struct bio *bio) { return false; }
 static inline void blkcg_bio_issue_init(struct bio *bio) { }
 static inline bool blkcg_bio_issue_check(struct request_queue *q,
 					 struct bio *bio) { return true; }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 6a53799c3fe2..feff3fe4467e 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -311,6 +311,14 @@ enum req_flag_bits {
 	__REQ_RAHEAD,		/* read ahead, can fail anytime */
 	__REQ_BACKGROUND,	/* background IO */
 	__REQ_NOWAIT,           /* Don't wait if request will block */
+	/*
+	 * When a shared kthread needs to issue a bio for a cgroup, doing
+	 * so synchronously can lead to priority inversions as the kthread
+	 * can be trapped waiting for that cgroup.  CGROUP_PUNT flag makes
+	 * submit_bio() punt the actual issuing to a dedicated per-blkcg
+	 * work item to avoid such priority inversions.
+	 */
+	__REQ_CGROUP_PUNT,
 
 	/* command specific flags for REQ_OP_WRITE_ZEROES: */
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
@@ -337,6 +345,8 @@ enum req_flag_bits {
 #define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
+#define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
+
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
 
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index e056a22075cf..8945aac31392 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -78,6 +78,8 @@ struct writeback_control {
 	 */
 	unsigned no_cgroup_owner:1;
 
+	unsigned punt_to_cgroup:1;	/* cgrp punting, see __REQ_CGROUP_PUNT */
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct bdi_writeback *wb;	/* wb this writeback is issued under */
 	struct inode *inode;		/* inode being written out */
@@ -94,12 +96,17 @@ struct writeback_control {
 
 static inline int wbc_to_write_flags(struct writeback_control *wbc)
 {
+	int flags = 0;
+
+	if (wbc->punt_to_cgroup)
+		flags = REQ_CGROUP_PUNT;
+
 	if (wbc->sync_mode == WB_SYNC_ALL)
-		return REQ_SYNC;
+		flags |= REQ_SYNC;
 	else if (wbc->for_kupdate || wbc->for_background)
-		return REQ_BACKGROUND;
+		flags |= REQ_BACKGROUND;
 
-	return 0;
+	return flags;
 }
 
 static inline struct cgroup_subsys_state *
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/5] blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()
  2019-06-27 20:39 ` [PATCH 2/5] blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner() Tejun Heo
@ 2019-07-10 10:54   ` Jan Kara
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2019-07-10 10:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, jack, josef, clm, dsterba, linux-block, linux-kernel,
	linux-btrfs, kernel-team

On Thu 27-06-19 13:39:49, Tejun Heo wrote:
> wbc_account_io() does a very specific job - try to see which cgroup is
> actually dirtying an inode and transfer its ownership to the majority
> dirtier if needed.  The name is too generic and confusing.  Let's
> rename it to something more specific.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Jan Kara <jack@suse.cz>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  Documentation/admin-guide/cgroup-v2.rst | 2 +-
>  fs/btrfs/extent_io.c                    | 4 ++--
>  fs/buffer.c                             | 2 +-
>  fs/ext4/page-io.c                       | 2 +-
>  fs/f2fs/data.c                          | 4 ++--
>  fs/fs-writeback.c                       | 8 ++++----
>  fs/mpage.c                              | 2 +-
>  include/linux/writeback.h               | 8 ++++----
>  8 files changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index cf88c1f98270..356a7a3dcb2f 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2108,7 +2108,7 @@ following two functions.
>  	a queue (device) has been associated with the bio and
>  	before submission.
>  
> -  wbc_account_io(@wbc, @page, @bytes)
> +  wbc_account_cgroup_owner(@wbc, @page, @bytes)
>  	Should be called for each data segment being written out.
>  	While this function doesn't care exactly when it's called
>  	during the writeback session, it's the easiest and most
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index db337e53aab3..5106008f5e28 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2911,7 +2911,7 @@ static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree,
>  			bio = NULL;
>  		} else {
>  			if (wbc)
> -				wbc_account_io(wbc, page, page_size);
> +				wbc_account_cgroup_owner(wbc, page, page_size);
>  			return 0;
>  		}
>  	}
> @@ -2924,7 +2924,7 @@ static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree,
>  	bio->bi_opf = opf;
>  	if (wbc) {
>  		wbc_init_bio(wbc, bio);
> -		wbc_account_io(wbc, page, page_size);
> +		wbc_account_cgroup_owner(wbc, page, page_size);
>  	}
>  
>  	*bio_ret = bio;
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e450c55f6434..40547bbbea94 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3093,7 +3093,7 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
>  
>  	if (wbc) {
>  		wbc_init_bio(wbc, bio);
> -		wbc_account_io(wbc, bh->b_page, bh->b_size);
> +		wbc_account_cgroup_owner(wbc, bh->b_page, bh->b_size);
>  	}
>  
>  	submit_bio(bio);
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 4690618a92e9..56e287f5ee50 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -404,7 +404,7 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
>  	ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
>  	if (ret != bh->b_size)
>  		goto submit_and_retry;
> -	wbc_account_io(io->io_wbc, page, bh->b_size);
> +	wbc_account_cgroup_owner(io->io_wbc, page, bh->b_size);
>  	io->io_next_block++;
>  	return 0;
>  }
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index eda4181d2092..e1cab1717ac7 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -470,7 +470,7 @@ int f2fs_submit_page_bio(struct f2fs_io_info *fio)
>  	}
>  
>  	if (fio->io_wbc && !is_read_io(fio->op))
> -		wbc_account_io(fio->io_wbc, page, PAGE_SIZE);
> +		wbc_account_cgroup_owner(fio->io_wbc, page, PAGE_SIZE);
>  
>  	bio_set_op_attrs(bio, fio->op, fio->op_flags);
>  
> @@ -537,7 +537,7 @@ void f2fs_submit_page_write(struct f2fs_io_info *fio)
>  	}
>  
>  	if (fio->io_wbc)
> -		wbc_account_io(fio->io_wbc, bio_page, PAGE_SIZE);
> +		wbc_account_cgroup_owner(fio->io_wbc, bio_page, PAGE_SIZE);
>  
>  	io->last_block_in_bio = fio->new_blkaddr;
>  	f2fs_trace_ios(fio, 0);
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index a8a40bc26c2f..0aef79e934bb 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -706,7 +706,7 @@ void wbc_detach_inode(struct writeback_control *wbc)
>  EXPORT_SYMBOL_GPL(wbc_detach_inode);
>  
>  /**
> - * wbc_account_io - account IO issued during writeback
> + * wbc_account_cgroup_owner - account writeback to update inode cgroup ownership
>   * @wbc: writeback_control of the writeback in progress
>   * @page: page being written out
>   * @bytes: number of bytes being written out
> @@ -715,8 +715,8 @@ EXPORT_SYMBOL_GPL(wbc_detach_inode);
>   * controlled by @wbc.  Keep the book for foreign inode detection.  See
>   * wbc_detach_inode().
>   */
> -void wbc_account_io(struct writeback_control *wbc, struct page *page,
> -		    size_t bytes)
> +void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
> +			      size_t bytes)
>  {
>  	struct cgroup_subsys_state *css;
>  	int id;
> @@ -753,7 +753,7 @@ void wbc_account_io(struct writeback_control *wbc, struct page *page,
>  	else
>  		wbc->wb_tcand_bytes -= min(bytes, wbc->wb_tcand_bytes);
>  }
> -EXPORT_SYMBOL_GPL(wbc_account_io);
> +EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner);
>  
>  /**
>   * inode_congested - test whether an inode is congested
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 436a85260394..a63620cdb73a 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -647,7 +647,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
>  	 * the confused fail path above (OOM) will be very confused when
>  	 * it finds all bh marked clean (i.e. it will not write anything)
>  	 */
> -	wbc_account_io(wbc, page, PAGE_SIZE);
> +	wbc_account_cgroup_owner(wbc, page, PAGE_SIZE);
>  	length = first_unmapped << blkbits;
>  	if (bio_add_page(bio, page, length, 0) < length) {
>  		bio = mpage_bio_submit(REQ_OP_WRITE, op_flags, bio);
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 738a0c24874f..dda5cf228172 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -188,8 +188,8 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
>  				 struct inode *inode)
>  	__releases(&inode->i_lock);
>  void wbc_detach_inode(struct writeback_control *wbc);
> -void wbc_account_io(struct writeback_control *wbc, struct page *page,
> -		    size_t bytes);
> +void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
> +			      size_t bytes);
>  void cgroup_writeback_umount(void);
>  
>  /**
> @@ -291,8 +291,8 @@ static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio)
>  {
>  }
>  
> -static inline void wbc_account_io(struct writeback_control *wbc,
> -				  struct page *page, size_t bytes)
> +static inline void wbc_account_cgroup_owner(struct writeback_control *wbc,
> +					    struct page *page, size_t bytes)
>  {
>  }
>  
> -- 
> 2.17.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCHSET for-5.3/block] block: add blkcg bio punt mechanism
  2019-06-27 20:39 [PATCHSET for-5.3/block] block: add blkcg bio punt mechanism Tejun Heo
                   ` (4 preceding siblings ...)
  2019-06-27 20:39 ` [PATCH 5/5] blkcg: implement REQ_CGROUP_PUNT Tejun Heo
@ 2019-07-10 13:53 ` Jens Axboe
  5 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2019-07-10 13:53 UTC (permalink / raw)
  To: Tejun Heo, axboe
  Cc: jack, josef, Chris Mason, dsterba, linux-block, linux-kernel,
	linux-btrfs, Kernel Team

On 6/27/19 2:39 PM, Tejun Heo wrote:
> Hello,
> 
> This patchset contains only the block part of the following
> 
>    [1] [PATCHSET v2 btrfs/for-next] blkcg, btrfs: fix cgroup writeback support
> 
> with the following changes
> 
>   * wbc_account_io() is renamed to wbc_account_cgroup_owner() and
>     wbc->no_account_io to wbc->no_cgroup_owner for clarity.
> 
> When writeback is executed asynchronously (e.g. for compression), bios
> are bounced to and issued by worker pool shared by all cgroups.  This
> leads to significant priority inversions when cgroup IO control is in
> use - IOs for a low priority cgroup can tie down the workers forcing
> higher priority IOs to wait behind them.
> 
> This patchset adds an bio punt mechanism to blkcg and updates btrfs to
> issue async IOs through it.  A bio tagged with REQ_CGROUP_PUNT flag is
> bounced to the asynchronous issue context of the associated blkcg on
> bio_submit().  As the bios are issued from per-blkcg work items,
> there's no concern for priority inversions and it doesn't require
> invasive changes to the filesystems.  The mechanism should be
> generally useful for IO control support across different filesystems.

Applied for 5.3, thanks Tejun.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-07-10 13:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-27 20:39 [PATCHSET for-5.3/block] block: add blkcg bio punt mechanism Tejun Heo
2019-06-27 20:39 ` [PATCH 1/5] cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages Tejun Heo
2019-06-27 20:39 ` [PATCH 2/5] blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner() Tejun Heo
2019-07-10 10:54   ` Jan Kara
2019-06-27 20:39 ` [PATCH 3/5] blkcg, writeback: Add wbc->no_cgroup_owner Tejun Heo
2019-06-27 20:39 ` [PATCH 4/5] blkcg, writeback: Implement wbc_blkcg_css() Tejun Heo
2019-06-27 20:39 ` [PATCH 5/5] blkcg: implement REQ_CGROUP_PUNT Tejun Heo
2019-07-10 13:53 ` [PATCHSET for-5.3/block] block: add blkcg bio punt mechanism Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).