[PATCHSET block/for-4.2/writeback] cgroup, writeback: misc updates for cgroup writeback support

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHSET block/for-4.2/writeback] cgroup, writeback: misc updates for cgroup writeback support
@ 2015-06-12 21:57 Tejun Heo
  2015-06-12 21:57 ` [PATCH 1/3] writeback: do foreign inode detection iff cgroup writeback is enabled Tejun Heo
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Tejun Heo @ 2015-06-12 21:57 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, linux-fsdevel, lizefan, cgroups

Hello,

This patchset contains the following assorted updates for the cgroup
writeback support.

 0001-writeback-do-foreign-inode-detection-iff-cgroup-writ.patch
 0002-vfs-writeback-replace-FS_CGROUP_WRITEBACK-with-MS_CG.patch
 0003-writeback-blkio-add-documentation-for-cgroup-writeba.patch

0001 fixes a bug where clear FS_CGROUP_WRITEBACK flag didn't fully
disable cgroup writeback support if the filesystem code uses
wbc_init_bio() and wbc_account_io().

0002 replaces FS_CGROUP_WRITEBACK with MS_CGROUPWB so that cgroup
writeback support can be enabled / disabled per superblock rather than
filesystem type.

0003 updates blkio documentation with information on cgroup writeback
support.

This patchset is on top of block/for-4.2/writeback and available in
the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-writeback-updates

diffstat follows.  Thanks.

 Documentation/cgroups/blkio-controller.txt |   83 +++++++++++++++++++++++++++--
 fs/ext2/super.c                            |    4 -
 fs/fs-writeback.c                          |   16 ++++-
 fs/namespace.c                             |    2 
 include/linux/backing-dev.h                |    2 
 include/linux/fs.h                         |    1 
 include/uapi/linux/fs.h                    |    1 
 7 files changed, 96 insertions(+), 13 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] writeback: do foreign inode detection iff cgroup writeback is enabled
  2015-06-12 21:57 [PATCHSET block/for-4.2/writeback] cgroup, writeback: misc updates for cgroup writeback support Tejun Heo
@ 2015-06-12 21:57 ` Tejun Heo
       [not found] ` <1434146254-26220-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2015-06-12 21:57 ` [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support Tejun Heo
  2 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2015-06-12 21:57 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel, linux-fsdevel, lizefan, cgroups, Tejun Heo

Currently, even when a filesystem doesn't set the FS_CGROUP_WRITEBACK
flag, if the filesystem uses wbc_init_bio() and wbc_account_io(), the
foreign inode detection and migration logic still ends up activating
cgroup writeback which is unexpected.  This patch ensures that the
foreign inode detection logic stays disabled when inode_cgwb_enabled()
is false by not associating writeback_control's with bdi_writeback's.

This also avoids unnecessary operations in wbc_init_bio(),
wbc_account_io() and wbc_detach_inode() for filesystems which don't
support cgroup writeback.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 fs/fs-writeback.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index f60de54..f0520bc 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -513,6 +513,11 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
 void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
 				 struct inode *inode)
 {
+	if (!inode_cgwb_enabled(inode)) {
+		spin_unlock(&inode->i_lock);
+		return;
+	}
+
 	wbc->wb = inode_to_wb(inode);
 	wbc->inode = inode;
 
@@ -575,11 +580,16 @@ void wbc_detach_inode(struct writeback_control *wbc)
 {
 	struct bdi_writeback *wb = wbc->wb;
 	struct inode *inode = wbc->inode;
-	u16 history = inode->i_wb_frn_history;
-	unsigned long avg_time = inode->i_wb_frn_avg_time;
-	unsigned long max_bytes, max_time;
+	unsigned long avg_time, max_bytes, max_time;
+	u16 history;
 	int max_id;
 
+	if (!wb)
+		return;
+
+	history = inode->i_wb_frn_history;
+	avg_time = inode->i_wb_frn_avg_time;
+
 	/* pick the winner of this round */
 	if (wbc->wb_bytes >= wbc->wb_lcand_bytes &&
 	    wbc->wb_bytes >= wbc->wb_tcand_bytes) {
-- 
2.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

[parent not found: <1434146254-26220-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]

* [PATCH 2/3] vfs, writeback: replace FS_CGROUP_WRITEBACK with MS_CGROUPWB
       [not found] ` <1434146254-26220-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2015-06-12 21:57   ` Tejun Heo
       [not found]     ` <1434146254-26220-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2015-06-12 21:57 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Alexander Viro, Jan Kara,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA

FS_CGROUP_WRITEBACK indicates whether a file_system_type supports
cgroup writeback; however, different super_blocks of the same
file_system_type may or may not support cgroup writeback depending on
filesystem options.  This patch replaces FS_CGROUP_WRITEBACK with a
kernel-internal super_block->s_flags MS_CGROUPWB.  The concatenated
and abbreviated name is for consistency with other MS_* flags.

ext2_fill_super() is updated to assert MS_CGROUPWB.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 fs/ext2/super.c             | 4 ++--
 fs/namespace.c              | 2 +-
 include/linux/backing-dev.h | 2 +-
 include/linux/fs.h          | 1 -
 include/uapi/linux/fs.h     | 1 +
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 549219d..472ed34 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -879,7 +879,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 	if (!parse_options((char *) data, sb))
 		goto failed_mount;
 
-	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
+	sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | MS_CGROUPWB |
 		((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
 		 MS_POSIXACL : 0);
 
@@ -1543,7 +1543,7 @@ static struct file_system_type ext2_fs_type = {
 	.name		= "ext2",
 	.mount		= ext2_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV | FS_CGROUP_WRITEBACK,
+	.fs_flags	= FS_REQUIRES_DEV,
 };
 MODULE_ALIAS_FS("ext2");
 
diff --git a/fs/namespace.c b/fs/namespace.c
index 1f4f9da..507b90b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2669,7 +2669,7 @@ long do_mount(const char *dev_name, const char __user *dir_name,
 
 	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN |
 		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
-		   MS_STRICTATIME);
+		   MS_STRICTATIME | MS_CGROUPWB);
 
 	if (flags & MS_REMOUNT)
 		retval = do_remount(&path, flags & ~MS_REMOUNT, mnt_flags,
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index dfce808..1489131 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -260,7 +260,7 @@ static inline bool inode_cgwb_enabled(struct inode *inode)
 
 	return bdi_cap_account_dirty(bdi) &&
 		(bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
-		(inode->i_sb->s_type->fs_flags & FS_CGROUP_WRITEBACK);
+		(inode->i_sb->s_flags & MS_CGROUPWB);
 }
 
 /**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b5e1dcf..66e35dc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1912,7 +1912,6 @@ struct file_system_type {
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_USERNS_DEV_MOUNT	16 /* A userns mount does not imply MNT_NODEV */
-#define FS_CGROUP_WRITEBACK	32	/* Supports cgroup-aware writeback */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	struct dentry *(*mount) (struct file_system_type *, int,
 		       const char *, void *);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 9b964a5..60316e7 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -93,6 +93,7 @@ struct inodes_stat_t {
 #define MS_LAZYTIME	(1<<25) /* Update the on-disk [acm]times lazily */
 
 /* These sb flags are internal to the kernel */
+#define MS_CGROUPWB	(1<<27)	/* cgroup-aware writeback enabled */
 #define MS_NOSEC	(1<<28)
 #define MS_BORN		(1<<29)
 #define MS_ACTIVE	(1<<30)
-- 
2.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

[parent not found: <1434146254-26220-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]

* Re: [PATCH 2/3] vfs, writeback: replace FS_CGROUP_WRITEBACK with MS_CGROUPWB
       [not found]     ` <1434146254-26220-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2015-06-13 16:16       ` Christoph Hellwig
       [not found]         ` <20150613161608.GA29414-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2015-06-13 16:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Alexander Viro, Jan Kara, linux-ext4-u79uwXL29TY76Z2rM5mHXA

On Fri, Jun 12, 2015 at 04:57:33PM -0500, Tejun Heo wrote:
> FS_CGROUP_WRITEBACK indicates whether a file_system_type supports
> cgroup writeback; however, different super_blocks of the same
> file_system_type may or may not support cgroup writeback depending on
> filesystem options.  This patch replaces FS_CGROUP_WRITEBACK with a
> kernel-internal super_block->s_flags MS_CGROUPWB.  The concatenated
> and abbreviated name is for consistency with other MS_* flags.

Nak.  As the uapi part makes it obvious the MS_ namespace is part
of the userspace ABI.  Please add a new in-kernel flags field instead.

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <20150613161608.GA29414-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]

* Re: [PATCH 2/3] vfs, writeback: replace FS_CGROUP_WRITEBACK with MS_CGROUPWB
       [not found]         ` <20150613161608.GA29414-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2015-06-14  5:42           ` Tejun Heo
       [not found]             ` <20150614054236.GA9662-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2015-06-14  5:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Alexander Viro, Jan Kara, linux-ext4-u79uwXL29TY76Z2rM5mHXA

Hello, Christoph.

On Sat, Jun 13, 2015 at 09:16:08AM -0700, Christoph Hellwig wrote:
> On Fri, Jun 12, 2015 at 04:57:33PM -0500, Tejun Heo wrote:
> > FS_CGROUP_WRITEBACK indicates whether a file_system_type supports
> > cgroup writeback; however, different super_blocks of the same
> > file_system_type may or may not support cgroup writeback depending on
> > filesystem options.  This patch replaces FS_CGROUP_WRITEBACK with a
> > kernel-internal super_block->s_flags MS_CGROUPWB.  The concatenated
> > and abbreviated name is for consistency with other MS_* flags.
> 
> Nak.  As the uapi part makes it obvious the MS_ namespace is part
> of the userspace ABI.  Please add a new in-kernel flags field instead.

Are MS_ACTIVE and MS_BORN part of userpace ABI?  They seem pretty
internal.  I don't mind introducing a new internal flag field but it's
weird to put this single flag there with other internal flags in
->s_flags.

Assuming we add a new field, how do sb->s_iflags and SB_I_XXX sound?
Any better suggestions?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <20150614054236.GA9662-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]

* Re: [PATCH 2/3] vfs, writeback: replace FS_CGROUP_WRITEBACK with MS_CGROUPWB
       [not found]             ` <20150614054236.GA9662-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2015-06-15 11:39               ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2015-06-15 11:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Hellwig, axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Alexander Viro, Jan Kara, linux-ext4-u79uwXL29TY76Z2rM5mHXA

On Sun 14-06-15 00:42:36, Tejun Heo wrote:
> Hello, Christoph.
> 
> On Sat, Jun 13, 2015 at 09:16:08AM -0700, Christoph Hellwig wrote:
> > On Fri, Jun 12, 2015 at 04:57:33PM -0500, Tejun Heo wrote:
> > > FS_CGROUP_WRITEBACK indicates whether a file_system_type supports
> > > cgroup writeback; however, different super_blocks of the same
> > > file_system_type may or may not support cgroup writeback depending on
> > > filesystem options.  This patch replaces FS_CGROUP_WRITEBACK with a
> > > kernel-internal super_block->s_flags MS_CGROUPWB.  The concatenated
> > > and abbreviated name is for consistency with other MS_* flags.
> > 
> > Nak.  As the uapi part makes it obvious the MS_ namespace is part
> > of the userspace ABI.  Please add a new in-kernel flags field instead.
> 
> Are MS_ACTIVE and MS_BORN part of userpace ABI?  They seem pretty
> internal.  I don't mind introducing a new internal flag field but it's
> weird to put this single flag there with other internal flags in
> ->s_flags.
  So you are right that there are other internal flags allocated from the
top of the i_flags field, however we are pretty much running out of the
flags available for the ABI so it's better to move internal flags elsewhere
as that's simpler than creating a new ABI for mount...

> Assuming we add a new field, how do sb->s_iflags and SB_I_XXX sound?
> Any better suggestions?
  Looks good to me.

								Honza
-- 
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support
  2015-06-12 21:57 [PATCHSET block/for-4.2/writeback] cgroup, writeback: misc updates for cgroup writeback support Tejun Heo
  2015-06-12 21:57 ` [PATCH 1/3] writeback: do foreign inode detection iff cgroup writeback is enabled Tejun Heo
       [not found] ` <1434146254-26220-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2015-06-12 21:57 ` Tejun Heo
       [not found]   ` <1434146254-26220-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2015-06-12 21:57 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, linux-fsdevel, lizefan, cgroups, Tejun Heo, Vivek Goyal

Update Documentation/cgroups/blkio-controller.txt to reflect the
recently added cgroup writeback support.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: cgroups@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
---
 Documentation/cgroups/blkio-controller.txt | 83 ++++++++++++++++++++++++++++--
 1 file changed, 78 insertions(+), 5 deletions(-)

diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index cd556b9..68b6a6a 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -387,8 +387,81 @@ groups and put applications in that group which are not driving enough
 IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
 on individual groups and throughput should improve.
 
-What works
-==========
-- Currently only sync IO queues are support. All the buffered writes are
-  still system wide and not per group. Hence we will not see service
-  differentiation between buffered writes between groups.
+Writeback
+=========
+
+Page cache is dirtied through buffered writes and shared mmaps and
+written asynchronously to the backing filesystem by the writeback
+mechanism.  Writeback sits between the memory and IO domains and
+regulates the proportion of dirty memory by balancing dirtying and
+write IOs.
+
+On traditional cgroup hierarchies, relationships between different
+controllers cannot be established making it impossible for writeback
+to operate accounting for cgroup resource restrictions and all
+writeback IOs are attributed to the root cgroup.
+
+If both the blkio and memory controllers are used on the v2 hierarchy
+and the filesystem supports cgroup writeback, writeback operations
+correctly follow the resource restrictions imposed by both memory and
+blkio controllers.
+
+Writeback examines both system-wide and per-cgroup dirty memory status
+and enforces the more restrictive of the two.  Also, writeback control
+parameters which are absolute values - vm.dirty_bytes and
+vm.dirty_background_bytes - are distributed across cgroups according
+to their current writeback bandwidth.
+
+There's a peculiarity stemming from the discrepancy in ownership
+granularity between memory controller and writeback.  While memory
+controller tracks ownership per page, writeback operates on inode
+basis.  cgroup writeback bridges the gap by tracking ownership by
+inode but migrating ownership if too many foreign pages, pages which
+don't match the current inode ownership, have been encountered while
+writing back the inode.
+
+This is a conscious design choice as writeback operations are
+inherently tied to inodes making strictly following page ownership
+complicated and inefficient.  The only use case which suffers from
+this compromise is multiple cgroups concurrently dirtying disjoint
+regions of the same inode, which is an unlikely use case and decided
+to be unsupported.  Note that as memory controller assigns page
+ownership on the first use and doesn't update it until the page is
+released, even if cgroup writeback strictly follows page ownership,
+multiple cgroups dirtying overlapping areas wouldn't work as expected.
+In general, write-sharing an inode across multiple cgroups is not well
+supported.
+
+Filesystem support for cgroup writeback
+---------------------------------------
+
+A filesystem can make writeback IOs cgroup-aware by updating
+address_space_operations->writepage[s]() to annotate bio's using the
+following two functions.
+
+* wbc_init_bio(@wbc, @bio)
+
+  Should be called for each bio carrying writeback data and associates
+  the bio with the inode's owner cgroup.  Can be called anytime
+  between bio allocation and submission.
+
+* wbc_account_io(@wbc, @page, @bytes)
+
+  Should be called for each data segment being written out.  While
+  this function doesn't care exactly when it's called during the
+  writeback session, it's the easiest and most natural to call it as
+  data segments are added to a bio.
+
+With writeback bio's annotated, cgroup support can be enabled per
+super_block by setting MS_CGROUPWB in ->s_flags.  This allows for
+selective disabling of cgroup writeback support which is helpful when
+certain filesystem features, e.g. journaled data mode, are
+incompatible.
+
+wbc_init_bio() binds the specified bio to its cgroup.  Depending on
+the configuration, the bio may be executed at a lower priority and if
+the writeback session is holding shared resources, e.g. a journal
+entry, may lead to priority inversion.  There is no one easy solution
+for the problem.  Filesystems can try to work around specific problem
+cases by skipping wbc_init_bio() or using bio_associate_blkcg()
+directly.
-- 
2.4.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

[parent not found: <1434146254-26220-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]

* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support
       [not found]   ` <1434146254-26220-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2015-06-15 17:28     ` Vivek Goyal
  2015-06-15 18:23       ` Tejun Heo
  0 siblings, 1 reply; 16+ messages in thread
From: Vivek Goyal @ 2015-06-15 17:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA

On Fri, Jun 12, 2015 at 04:57:34PM -0500, Tejun Heo wrote:
> Update Documentation/cgroups/blkio-controller.txt to reflect the
> recently added cgroup writeback support.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
>  Documentation/cgroups/blkio-controller.txt | 83 ++++++++++++++++++++++++++++--

Hi Tejun,

This looks good to me. Thanks.

IIRC, I had run into the issues with two fsync running into two cgroups.
One cgroup was of really small limit and other was unlimited. At that
point of time I think conclusion was that multiple transactions could
not make progress at the same time. So slower cgroup had blocked unlimited
cgroup process from opening a transaction (as IO from slower group was
stuck inside throttling later). 

For some reason, in my limited testing I have not noticed it with your
branch. May be things have changed since or I am just hazy on details.
I will do some more testing.

Thanks
Vivek

>  1 file changed, 78 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
> index cd556b9..68b6a6a 100644
> --- a/Documentation/cgroups/blkio-controller.txt
> +++ b/Documentation/cgroups/blkio-controller.txt
> @@ -387,8 +387,81 @@ groups and put applications in that group which are not driving enough
>  IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
>  on individual groups and throughput should improve.
>  
> -What works
> -==========
> -- Currently only sync IO queues are support. All the buffered writes are
> -  still system wide and not per group. Hence we will not see service
> -  differentiation between buffered writes between groups.
> +Writeback
> +=========
> +
> +Page cache is dirtied through buffered writes and shared mmaps and
> +written asynchronously to the backing filesystem by the writeback
> +mechanism.  Writeback sits between the memory and IO domains and
> +regulates the proportion of dirty memory by balancing dirtying and
> +write IOs.
> +
> +On traditional cgroup hierarchies, relationships between different
> +controllers cannot be established making it impossible for writeback
> +to operate accounting for cgroup resource restrictions and all
> +writeback IOs are attributed to the root cgroup.
> +
> +If both the blkio and memory controllers are used on the v2 hierarchy
> +and the filesystem supports cgroup writeback, writeback operations
> +correctly follow the resource restrictions imposed by both memory and
> +blkio controllers.
> +
> +Writeback examines both system-wide and per-cgroup dirty memory status
> +and enforces the more restrictive of the two.  Also, writeback control
> +parameters which are absolute values - vm.dirty_bytes and
> +vm.dirty_background_bytes - are distributed across cgroups according
> +to their current writeback bandwidth.
> +
> +There's a peculiarity stemming from the discrepancy in ownership
> +granularity between memory controller and writeback.  While memory
> +controller tracks ownership per page, writeback operates on inode
> +basis.  cgroup writeback bridges the gap by tracking ownership by
> +inode but migrating ownership if too many foreign pages, pages which
> +don't match the current inode ownership, have been encountered while
> +writing back the inode.
> +
> +This is a conscious design choice as writeback operations are
> +inherently tied to inodes making strictly following page ownership
> +complicated and inefficient.  The only use case which suffers from
> +this compromise is multiple cgroups concurrently dirtying disjoint
> +regions of the same inode, which is an unlikely use case and decided
> +to be unsupported.  Note that as memory controller assigns page
> +ownership on the first use and doesn't update it until the page is
> +released, even if cgroup writeback strictly follows page ownership,
> +multiple cgroups dirtying overlapping areas wouldn't work as expected.
> +In general, write-sharing an inode across multiple cgroups is not well
> +supported.
> +
> +Filesystem support for cgroup writeback
> +---------------------------------------
> +
> +A filesystem can make writeback IOs cgroup-aware by updating
> +address_space_operations->writepage[s]() to annotate bio's using the
> +following two functions.
> +
> +* wbc_init_bio(@wbc, @bio)
> +
> +  Should be called for each bio carrying writeback data and associates
> +  the bio with the inode's owner cgroup.  Can be called anytime
> +  between bio allocation and submission.
> +
> +* wbc_account_io(@wbc, @page, @bytes)
> +
> +  Should be called for each data segment being written out.  While
> +  this function doesn't care exactly when it's called during the
> +  writeback session, it's the easiest and most natural to call it as
> +  data segments are added to a bio.
> +
> +With writeback bio's annotated, cgroup support can be enabled per
> +super_block by setting MS_CGROUPWB in ->s_flags.  This allows for
> +selective disabling of cgroup writeback support which is helpful when
> +certain filesystem features, e.g. journaled data mode, are
> +incompatible.
> +
> +wbc_init_bio() binds the specified bio to its cgroup.  Depending on
> +the configuration, the bio may be executed at a lower priority and if
> +the writeback session is holding shared resources, e.g. a journal
> +entry, may lead to priority inversion.  There is no one easy solution
> +for the problem.  Filesystems can try to work around specific problem
> +cases by skipping wbc_init_bio() or using bio_associate_blkcg()
> +directly.
> -- 
> 2.4.2

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support
  2015-06-15 17:28     ` Vivek Goyal
@ 2015-06-15 18:23       ` Tejun Heo
       [not found]         ` <20150615182345.GB18517-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2015-06-15 18:23 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, linux-kernel, linux-fsdevel, lizefan, cgroups

Hey, Vivek.

On Mon, Jun 15, 2015 at 01:28:23PM -0400, Vivek Goyal wrote:
> IIRC, I had run into the issues with two fsync running into two cgroups.
> One cgroup was of really small limit and other was unlimited. At that
> point of time I think conclusion was that multiple transactions could
> not make progress at the same time. So slower cgroup had blocked unlimited
> cgroup process from opening a transaction (as IO from slower group was
> stuck inside throttling later). 
>
> For some reason, in my limited testing I have not noticed it with your
> branch. May be things have changed since or I am just hazy on details.
> I will do some more testing.

On ext2, there's nothing interlocking each other.  My understanding of
ext4 is pretty limited but as long as the journal head doesn't
overwrap and gets bloked on the slow one, it should be fine, so for
most use cases, this shouldn't be a problem.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <20150615182345.GB18517-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]

* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support
       [not found]         ` <20150615182345.GB18517-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2015-06-15 23:35           ` Theodore Ts'o
       [not found]             ` <20150615233519.GB30059-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Theodore Ts'o @ 2015-06-15 23:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA

On Mon, Jun 15, 2015 at 02:23:45PM -0400, Tejun Heo wrote:
> 
> On ext2, there's nothing interlocking each other.  My understanding of
> ext4 is pretty limited but as long as the journal head doesn't
> overwrap and gets bloked on the slow one, it should be fine, so for
> most use cases, this shouldn't be a problem.

The writes to the journal in ext3/ext4 are done from the jbd/jbd2
kernel thread.  So writes to the journal shouldn't be a problem.  In
data=ordered mode inodes that have blocks that were allocated during
the current transaction do have to have their data blocks written out,
and this is done by the jbd/jbd2 thread using filemap_fdatawait().

If this gets throttled because blocks were originally dirtied by some
cgroup that didn't have much disk time quota, then all file system
activities will get stalled out until the ordered mode writeback
completes, which means if there are any high priority cgroups trying
to execute any system call that mutates file system state will block
until the commit has gotten past the initial setup stage, and so other
system activity could sputter to a halt --- at which point the commit
will be allowed to compete, and then all of the calls to
ext4_journal_start() will unblock, and the system will come back to
life.  :-)

Because ext3 doesn't have delayed allocation, it will orders of
magnitude more data=ordered block flushing, so this problem will be
far worse with ext3 compared to ext4.

So if there is some way we can signal to any cgroup that that might be
throttling writeback or disk I/O that the jbd/jbd2 process should be
considered privileged, that would be a good since it would allow us to
avoid a potential priority inversion problem. 

						- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <20150615233519.GB30059-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>]

* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support
       [not found]             ` <20150615233519.GB30059-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
@ 2015-06-16 21:54               ` Tejun Heo
  2015-06-17  3:15                 ` Theodore Ts'o
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2015-06-16 21:54 UTC (permalink / raw)
  To: Theodore Ts'o, Vivek Goyal, axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA

Hello, Ted.

On Mon, Jun 15, 2015 at 07:35:19PM -0400, Theodore Ts'o wrote:
> So if there is some way we can signal to any cgroup that that might be
> throttling writeback or disk I/O that the jbd/jbd2 process should be
> considered privileged, that would be a good since it would allow us to
> avoid a potential priority inversion problem. 

I see.  In the long term, I think we might need to come up with a way
to overcharge a slower cgroup to avoid blocking faster ones for cases
where some IOs are depended upon by more than one cgroups.  That'd
take quite a bit of work from blkcg side.  Will think more about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support
  2015-06-16 21:54               ` Tejun Heo
@ 2015-06-17  3:15                 ` Theodore Ts'o
       [not found]                   ` <20150617031540.GB4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Theodore Ts'o @ 2015-06-17  3:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, axboe, linux-kernel, linux-fsdevel, lizefan, cgroups

On Tue, Jun 16, 2015 at 05:54:36PM -0400, Tejun Heo wrote:
> Hello, Ted.
> 
> On Mon, Jun 15, 2015 at 07:35:19PM -0400, Theodore Ts'o wrote:
> > So if there is some way we can signal to any cgroup that that might be
> > throttling writeback or disk I/O that the jbd/jbd2 process should be
> > considered privileged, that would be a good since it would allow us to
> > avoid a potential priority inversion problem. 
> 
> I see.  In the long term, I think we might need to come up with a way
> to overcharge a slower cgroup to avoid blocking faster ones for cases
> where some IOs are depended upon by more than one cgroups.  That'd
> take quite a bit of work from blkcg side.  Will think more about it.

Hmm, while we're at it, there's another priority inversion that can be
painful.  If a block directory has been pushed out of memory (possibly
because it was initially accessed by a cgroup with a very tiny amount
of memory allocated to its cgroup) and a process with a cgroup tries
to do a lookup in that directory, it will issue the read with such a
tightly constrained disk time that it might take minutes for the read
to complete.  The problem is that the VFS has locked the directory's
i_mutex *before* calling ext4_lookup().

If a high priority process then tries to read the same directory, or
in fact any VFS operation which requires taking the directory's
i_mutex first, including renaming the directory, the high priority
process will end up blocking until the read is completed --- which can
be minutes if the low priority process has a tiny amount of disk time
allocated to it.

There is a related problem where if a read for a particular block is
issued with a very low amount of amount of disk time, and that same
block is required by a high priority process, we can also get hit with
a very similar priority inversion problem.

To date the answer has always been, "Doctor, Doctor it hurts when I do
that...."  The only way I can think of fixing the directory mutex
problem is by returning an error code to the VFS layer which instructs
it to unlock the directory, and then have it wait on some wait channel
so it ends up calling the lookup after the directory block has been
read into memory (and we can hope that due to a tight memory cgroup
the block doesn't end up getting ejected from memory right away).

As another solution for another part of the problem, if a high
priority process attempts a read and the I/O is already queued up, but
it's at the back of the bus because it was originally posted by a low
priority cgroup, the rest of the fix would be to elevate the priority
of said I/O request and then resort the queue.

As far as the filemap_fdatawait() call is concerned, if it's being
called by fsync() run by a low priority process, or from the writeback
thread, then it can certainly take place at a low prority.  But if the
filemap_fdatawait() is being done by a high priority process, such as
a jbd/jbd2 thread, then there needs to be a way that we can set a flag
in the wbc structure indicating that the writes should be submitted as
if it was issued from the kernel thread, and not based on who
originally dirtied the page.

It's going to be a number of point solutions, which is a bit ugly, but
I think that is much more likely to be successful than trying to
implement, say, a generalized priority inheritance scheme for block
I/O requests and related locks.   :-)

    	     					- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <20150617031540.GB4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>]

* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support
       [not found]                   ` <20150617031540.GB4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
@ 2015-06-17 18:52                     ` Tejun Heo
       [not found]                       ` <20150617185237.GL22637-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2015-06-17 18:52 UTC (permalink / raw)
  To: Theodore Ts'o, Vivek Goyal, axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA

Hi,

On Tue, Jun 16, 2015 at 11:15:40PM -0400, Theodore Ts'o wrote:
> Hmm, while we're at it, there's another priority inversion that can be
> painful.  If a block directory has been pushed out of memory (possibly
> because it was initially accessed by a cgroup with a very tiny amount
> of memory allocated to its cgroup) and a process with a cgroup tries

At scale, this is self-correcting to certain extent in that if the
inode is actually something shared across cgroups, it'll most likely
end up in a cgroup which has enough resource to keep it in memory.
This doesn't prevent one-off hiccups but it at least shouldn't develop
into a systematic and chronic issue.

> to do a lookup in that directory, it will issue the read with such a
> tightly constrained disk time that it might take minutes for the read
> to complete.  The problem is that the VFS has locked the directory's
> i_mutex *before* calling ext4_lookup().
> 
> If a high priority process then tries to read the same directory, or
> in fact any VFS operation which requires taking the directory's
> i_mutex first, including renaming the directory, the high priority
> process will end up blocking until the read is completed --- which can
> be minutes if the low priority process has a tiny amount of disk time
> allocated to it.
> 
> There is a related problem where if a read for a particular block is
> issued with a very low amount of amount of disk time, and that same
> block is required by a high priority process, we can also get hit with
> a very similar priority inversion problem.
> 
> To date the answer has always been, "Doctor, Doctor it hurts when I do
> that...."  The only way I can think of fixing the directory mutex

In a lot of use cases, the directories accessed by different cgroups
are fairly segregated so this hopefully shouldn't happen too often but
yeah it can be painful on sharing cases.

> problem is by returning an error code to the VFS layer which instructs
> it to unlock the directory, and then have it wait on some wait channel
> so it ends up calling the lookup after the directory block has been
> read into memory (and we can hope that due to a tight memory cgroup
> the block doesn't end up getting ejected from memory right away).
> 
> As another solution for another part of the problem, if a high
> priority process attempts a read and the I/O is already queued up, but
> it's at the back of the bus because it was originally posted by a low
> priority cgroup, the rest of the fix would be to elevate the priority
> of said I/O request and then resort the queue.
>
> As far as the filemap_fdatawait() call is concerned, if it's being
> called by fsync() run by a low priority process, or from the writeback
> thread, then it can certainly take place at a low prority.  But if the
> filemap_fdatawait() is being done by a high priority process, such as
> a jbd/jbd2 thread, then there needs to be a way that we can set a flag
> in the wbc structure indicating that the writes should be submitted as
> if it was issued from the kernel thread, and not based on who
> originally dirtied the page.

Hmmm... so, overriding things *before* an bio is issued shouldn't be
too difficult and as long as this sort of operations aren't prevalent
we might be able to get away with just charging them against root.
Especially if it's to avoid getting blocked on the journal which we
already consider a shared overhead which is charged to root.  If this
becomes large enough to require exacting charges, it'll be more
complex but still way better than trying to raise priority on a bio
which is already issued, which is likely to be excruciatingly painful
if possible at all.

> It's going to be a number of point solutions, which is a bit ugly, but
> I think that is much more likely to be successful than trying to
> implement, say, a generalized priority inheritance scheme for block
> I/O requests and related locks.   :-)

I agree that generalized priority inheritance mechanism would be a
massive overkill.  I think as long as we can avoid boosting bio's
which already have been issued, things should be relatively sane.
Hopefully, we'd be able to figure out solutions for the worst
offenders within these constraints.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <20150617185237.GL22637-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]

* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support
       [not found]                       ` <20150617185237.GL22637-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
@ 2015-06-17 21:48                         ` Theodore Ts'o
       [not found]                           ` <20150617214852.GE4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Theodore Ts'o @ 2015-06-17 21:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vivek Goyal, axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA

On Wed, Jun 17, 2015 at 02:52:37PM -0400, Tejun Heo wrote:
> 
> Hmmm... so, overriding things *before* an bio is issued shouldn't be
> too difficult and as long as this sort of operations aren't prevalent
> we might be able to get away with just charging them against root.
> Especially if it's to avoid getting blocked on the journal which we
> already consider a shared overhead which is charged to root.  If this
> becomes large enough to require exacting charges, it'll be more
> complex but still way better than trying to raise priority on a bio
> which is already issued, which is likely to be excruciatingly painful
> if possible at all.

Yeah, just charging the overhead to root seems good enough.

I could imagine charging it to whatever cgroup the jbd/jbd2 thread
belongs to, which in turn would be the cgroup of the process that
mounted the file system.  The only problem with that is that if a
low-priority process is allowed to mount a file system, and it gets
traversed by a high priority process, the high priority process will
get impacted.  So maybe it's better to just say that it always get
charged to the root cgroup.

					- Ted

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <20150617214852.GE4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>]

* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support
       [not found]                           ` <20150617214852.GE4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
@ 2015-06-20 20:00                             ` Tejun Heo
  0 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2015-06-20 20:00 UTC (permalink / raw)
  To: Theodore Ts'o, Vivek Goyal, axboe-tSWWG44O7X1aa/9Udqfwiw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA

Hey, Ted.

On Wed, Jun 17, 2015 at 05:48:52PM -0400, Theodore Ts'o wrote:
> On Wed, Jun 17, 2015 at 02:52:37PM -0400, Tejun Heo wrote:
> > 
> > Hmmm... so, overriding things *before* an bio is issued shouldn't be
> > too difficult and as long as this sort of operations aren't prevalent
> > we might be able to get away with just charging them against root.
> > Especially if it's to avoid getting blocked on the journal which we
> > already consider a shared overhead which is charged to root.  If this
> > becomes large enough to require exacting charges, it'll be more
> > complex but still way better than trying to raise priority on a bio
> > which is already issued, which is likely to be excruciatingly painful
> > if possible at all.
> 
> Yeah, just charging the overhead to root seems good enough.

I think the easiest way to achieve this bypass would be making jbd
mark the inode while waiting in fdatawait so that writeback path can
skip attaching the writeback bios for the inode.  This isn't perfect
but should be able to work around stalls from priority inversion to
certain extent.

However, I can't come up with a workload to test it.  AFAICS, the
fdatawait stall path in jbd2 is journal_finish_inode_data_buffers()
but the path doesn't trigger reliabley with mixed load of overwriting
dd, a bunch of file creations and chmods and different cgroups stay
pretty well isolated.

Can you please suggest a workload for testing the datawait path?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCHSET v2 block/for-4.2/writeback] cgroup, writeback: misc updates for cgroup writeback support
@ 2015-06-16 22:48 Tejun Heo
       [not found] ` <1434494912-31043-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Tejun Heo @ 2015-06-16 22:48 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	hannes-druUgvl0LCNAfugRpC6u6w, kernel-team-b10kYP2dOMg

Hello,

This is v2.  The only change from the last take[L] is

* super_block->s_iflags added and MS_CGROUPWB replaced with
  SB_I_CGROUPWB as suggested by Christoph and Jan.

This patchset contains the following assorted updates for the cgroup
writeback support.

 0001-writeback-do-foreign-inode-detection-iff-cgroup-writ.patch
 0002-vfs-writeback-replace-FS_CGROUP_WRITEBACK-with-SB_I_.patch
 0003-writeback-blkio-add-documentation-for-cgroup-writeba.patch

0001 fixes a bug where clear FS_CGROUP_WRITEBACK flag didn't fully
disable cgroup writeback support if the filesystem code uses
wbc_init_bio() and wbc_account_io().

0002 replaces FS_CGROUP_WRITEBACK with SB_I_CGROUPWB so that cgroup
writeback support can be enabled / disabled per superblock rather than
filesystem type.

0003 updates blkio documentation with information on cgroup writeback
support.

This patchset is on top of block/for-4.2/writeback and available in
the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-writeback-updates

diffstat follows.  Thanks.

 Documentation/cgroups/blkio-controller.txt |   83 +++++++++++++++++++++++++++--
 fs/ext2/super.c                            |    4 -
 fs/fs-writeback.c                          |   16 ++++-
 fs/namespace.c                             |    2 
 include/linux/backing-dev.h                |    2 
 include/linux/fs.h                         |    1 
 include/uapi/linux/fs.h                    |    1 
 7 files changed, 96 insertions(+), 13 deletions(-)

--
tejun

[L] http://lkml.kernel.org/g/1434146254-26220-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <1434494912-31043-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]

* [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support
       [not found] ` <1434494912-31043-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2015-06-16 22:48   ` Tejun Heo
  0 siblings, 0 replies; 16+ messages in thread
From: Tejun Heo @ 2015-06-16 22:48 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA,
	hannes-druUgvl0LCNAfugRpC6u6w, kernel-team-b10kYP2dOMg,
	Tejun Heo, Vivek Goyal

Update Documentation/cgroups/blkio-controller.txt to reflect the
recently added cgroup writeback support.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 Documentation/cgroups/blkio-controller.txt | 83 ++++++++++++++++++++++++++++--
 1 file changed, 78 insertions(+), 5 deletions(-)

diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index cd556b9..68b6a6a 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -387,8 +387,81 @@ groups and put applications in that group which are not driving enough
 IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
 on individual groups and throughput should improve.
 
-What works
-==========
-- Currently only sync IO queues are support. All the buffered writes are
-  still system wide and not per group. Hence we will not see service
-  differentiation between buffered writes between groups.
+Writeback
+=========
+
+Page cache is dirtied through buffered writes and shared mmaps and
+written asynchronously to the backing filesystem by the writeback
+mechanism.  Writeback sits between the memory and IO domains and
+regulates the proportion of dirty memory by balancing dirtying and
+write IOs.
+
+On traditional cgroup hierarchies, relationships between different
+controllers cannot be established making it impossible for writeback
+to operate accounting for cgroup resource restrictions and all
+writeback IOs are attributed to the root cgroup.
+
+If both the blkio and memory controllers are used on the v2 hierarchy
+and the filesystem supports cgroup writeback, writeback operations
+correctly follow the resource restrictions imposed by both memory and
+blkio controllers.
+
+Writeback examines both system-wide and per-cgroup dirty memory status
+and enforces the more restrictive of the two.  Also, writeback control
+parameters which are absolute values - vm.dirty_bytes and
+vm.dirty_background_bytes - are distributed across cgroups according
+to their current writeback bandwidth.
+
+There's a peculiarity stemming from the discrepancy in ownership
+granularity between memory controller and writeback.  While memory
+controller tracks ownership per page, writeback operates on inode
+basis.  cgroup writeback bridges the gap by tracking ownership by
+inode but migrating ownership if too many foreign pages, pages which
+don't match the current inode ownership, have been encountered while
+writing back the inode.
+
+This is a conscious design choice as writeback operations are
+inherently tied to inodes making strictly following page ownership
+complicated and inefficient.  The only use case which suffers from
+this compromise is multiple cgroups concurrently dirtying disjoint
+regions of the same inode, which is an unlikely use case and decided
+to be unsupported.  Note that as memory controller assigns page
+ownership on the first use and doesn't update it until the page is
+released, even if cgroup writeback strictly follows page ownership,
+multiple cgroups dirtying overlapping areas wouldn't work as expected.
+In general, write-sharing an inode across multiple cgroups is not well
+supported.
+
+Filesystem support for cgroup writeback
+---------------------------------------
+
+A filesystem can make writeback IOs cgroup-aware by updating
+address_space_operations->writepage[s]() to annotate bio's using the
+following two functions.
+
+* wbc_init_bio(@wbc, @bio)
+
+  Should be called for each bio carrying writeback data and associates
+  the bio with the inode's owner cgroup.  Can be called anytime
+  between bio allocation and submission.
+
+* wbc_account_io(@wbc, @page, @bytes)
+
+  Should be called for each data segment being written out.  While
+  this function doesn't care exactly when it's called during the
+  writeback session, it's the easiest and most natural to call it as
+  data segments are added to a bio.
+
+With writeback bio's annotated, cgroup support can be enabled per
+super_block by setting MS_CGROUPWB in ->s_flags.  This allows for
+selective disabling of cgroup writeback support which is helpful when
+certain filesystem features, e.g. journaled data mode, are
+incompatible.
+
+wbc_init_bio() binds the specified bio to its cgroup.  Depending on
+the configuration, the bio may be executed at a lower priority and if
+the writeback session is holding shared resources, e.g. a journal
+entry, may lead to priority inversion.  There is no one easy solution
+for the problem.  Filesystems can try to work around specific problem
+cases by skipping wbc_init_bio() or using bio_associate_blkcg()
+directly.
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-06-20 20:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-12 21:57 [PATCHSET block/for-4.2/writeback] cgroup, writeback: misc updates for cgroup writeback support Tejun Heo
2015-06-12 21:57 ` [PATCH 1/3] writeback: do foreign inode detection iff cgroup writeback is enabled Tejun Heo
     [not found] ` <1434146254-26220-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-06-12 21:57   ` [PATCH 2/3] vfs, writeback: replace FS_CGROUP_WRITEBACK with MS_CGROUPWB Tejun Heo
     [not found]     ` <1434146254-26220-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-06-13 16:16       ` Christoph Hellwig
     [not found]         ` <20150613161608.GA29414-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-06-14  5:42           ` Tejun Heo
     [not found]             ` <20150614054236.GA9662-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-06-15 11:39               ` Jan Kara
2015-06-12 21:57 ` [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support Tejun Heo
     [not found]   ` <1434146254-26220-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-06-15 17:28     ` Vivek Goyal
2015-06-15 18:23       ` Tejun Heo
     [not found]         ` <20150615182345.GB18517-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-06-15 23:35           ` Theodore Ts'o
     [not found]             ` <20150615233519.GB30059-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2015-06-16 21:54               ` Tejun Heo
2015-06-17  3:15                 ` Theodore Ts'o
     [not found]                   ` <20150617031540.GB4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2015-06-17 18:52                     ` Tejun Heo
     [not found]                       ` <20150617185237.GL22637-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-06-17 21:48                         ` Theodore Ts'o
     [not found]                           ` <20150617214852.GE4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2015-06-20 20:00                             ` Tejun Heo
2015-06-16 22:48 [PATCHSET v2 block/for-4.2/writeback] cgroup, writeback: misc updates " Tejun Heo
     [not found] ` <1434494912-31043-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2015-06-16 22:48   ` [PATCH 3/3] writeback, blkio: add documentation " Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).