* [PATCHSET block/for-4.2/writeback] cgroup, writeback: misc updates for cgroup writeback support @ 2015-06-12 21:57 Tejun Heo 2015-06-12 21:57 ` [PATCH 1/3] writeback: do foreign inode detection iff cgroup writeback is enabled Tejun Heo ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Tejun Heo @ 2015-06-12 21:57 UTC (permalink / raw) To: axboe; +Cc: linux-kernel, linux-fsdevel, lizefan, cgroups Hello, This patchset contains the following assorted updates for the cgroup writeback support. 0001-writeback-do-foreign-inode-detection-iff-cgroup-writ.patch 0002-vfs-writeback-replace-FS_CGROUP_WRITEBACK-with-MS_CG.patch 0003-writeback-blkio-add-documentation-for-cgroup-writeba.patch 0001 fixes a bug where clear FS_CGROUP_WRITEBACK flag didn't fully disable cgroup writeback support if the filesystem code uses wbc_init_bio() and wbc_account_io(). 0002 replaces FS_CGROUP_WRITEBACK with MS_CGROUPWB so that cgroup writeback support can be enabled / disabled per superblock rather than filesystem type. 0003 updates blkio documentation with information on cgroup writeback support. This patchset is on top of block/for-4.2/writeback and available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-writeback-updates diffstat follows. Thanks. Documentation/cgroups/blkio-controller.txt | 83 +++++++++++++++++++++++++++-- fs/ext2/super.c | 4 - fs/fs-writeback.c | 16 ++++- fs/namespace.c | 2 include/linux/backing-dev.h | 2 include/linux/fs.h | 1 include/uapi/linux/fs.h | 1 7 files changed, 96 insertions(+), 13 deletions(-) -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/3] writeback: do foreign inode detection iff cgroup writeback is enabled 2015-06-12 21:57 [PATCHSET block/for-4.2/writeback] cgroup, writeback: misc updates for cgroup writeback support Tejun Heo @ 2015-06-12 21:57 ` Tejun Heo [not found] ` <1434146254-26220-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2015-06-12 21:57 ` [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support Tejun Heo 2 siblings, 0 replies; 16+ messages in thread From: Tejun Heo @ 2015-06-12 21:57 UTC (permalink / raw) To: axboe; +Cc: linux-kernel, linux-fsdevel, lizefan, cgroups, Tejun Heo Currently, even when a filesystem doesn't set the FS_CGROUP_WRITEBACK flag, if the filesystem uses wbc_init_bio() and wbc_account_io(), the foreign inode detection and migration logic still ends up activating cgroup writeback which is unexpected. This patch ensures that the foreign inode detection logic stays disabled when inode_cgwb_enabled() is false by not associating writeback_control's with bdi_writeback's. This also avoids unnecessary operations in wbc_init_bio(), wbc_account_io() and wbc_detach_inode() for filesystems which don't support cgroup writeback. Signed-off-by: Tejun Heo <tj@kernel.org> --- fs/fs-writeback.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index f60de54..f0520bc 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -513,6 +513,11 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) void wbc_attach_and_unlock_inode(struct writeback_control *wbc, struct inode *inode) { + if (!inode_cgwb_enabled(inode)) { + spin_unlock(&inode->i_lock); + return; + } + wbc->wb = inode_to_wb(inode); wbc->inode = inode; @@ -575,11 +580,16 @@ void wbc_detach_inode(struct writeback_control *wbc) { struct bdi_writeback *wb = wbc->wb; struct inode *inode = wbc->inode; - u16 history = inode->i_wb_frn_history; - unsigned long avg_time = inode->i_wb_frn_avg_time; - unsigned long max_bytes, max_time; + unsigned long avg_time, max_bytes, max_time; + u16 history; int max_id; + if (!wb) + return; + + history = inode->i_wb_frn_history; + avg_time = inode->i_wb_frn_avg_time; + /* pick the winner of this round */ if (wbc->wb_bytes >= wbc->wb_lcand_bytes && wbc->wb_bytes >= wbc->wb_tcand_bytes) { -- 2.4.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
[parent not found: <1434146254-26220-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* [PATCH 2/3] vfs, writeback: replace FS_CGROUP_WRITEBACK with MS_CGROUPWB [not found] ` <1434146254-26220-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2015-06-12 21:57 ` Tejun Heo [not found] ` <1434146254-26220-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2015-06-12 21:57 UTC (permalink / raw) To: axboe-tSWWG44O7X1aa/9Udqfwiw Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Alexander Viro, Jan Kara, linux-ext4-u79uwXL29TY76Z2rM5mHXA FS_CGROUP_WRITEBACK indicates whether a file_system_type supports cgroup writeback; however, different super_blocks of the same file_system_type may or may not support cgroup writeback depending on filesystem options. This patch replaces FS_CGROUP_WRITEBACK with a kernel-internal super_block->s_flags MS_CGROUPWB. The concatenated and abbreviated name is for consistency with other MS_* flags. ext2_fill_super() is updated to assert MS_CGROUPWB. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> Cc: linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- fs/ext2/super.c | 4 ++-- fs/namespace.c | 2 +- include/linux/backing-dev.h | 2 +- include/linux/fs.h | 1 - include/uapi/linux/fs.h | 1 + 5 files changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 549219d..472ed34 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -879,7 +879,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) if (!parse_options((char *) data, sb)) goto failed_mount; - sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | + sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | MS_CGROUPWB | ((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); @@ -1543,7 +1543,7 @@ static struct file_system_type ext2_fs_type = { .name = "ext2", .mount = ext2_mount, .kill_sb = kill_block_super, - .fs_flags = FS_REQUIRES_DEV | FS_CGROUP_WRITEBACK, + .fs_flags = FS_REQUIRES_DEV, }; MODULE_ALIAS_FS("ext2"); diff --git a/fs/namespace.c b/fs/namespace.c index 1f4f9da..507b90b 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2669,7 +2669,7 @@ long do_mount(const char *dev_name, const char __user *dir_name, flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_BORN | MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT | - MS_STRICTATIME); + MS_STRICTATIME | MS_CGROUPWB); if (flags & MS_REMOUNT) retval = do_remount(&path, flags & ~MS_REMOUNT, mnt_flags, diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index dfce808..1489131 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -260,7 +260,7 @@ static inline bool inode_cgwb_enabled(struct inode *inode) return bdi_cap_account_dirty(bdi) && (bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) && - (inode->i_sb->s_type->fs_flags & FS_CGROUP_WRITEBACK); + (inode->i_sb->s_flags & MS_CGROUPWB); } /** diff --git a/include/linux/fs.h b/include/linux/fs.h index b5e1dcf..66e35dc 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1912,7 +1912,6 @@ struct file_system_type { #define FS_HAS_SUBTYPE 4 #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */ #define FS_USERNS_DEV_MOUNT 16 /* A userns mount does not imply MNT_NODEV */ -#define FS_CGROUP_WRITEBACK 32 /* Supports cgroup-aware writeback */ #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ struct dentry *(*mount) (struct file_system_type *, int, const char *, void *); diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 9b964a5..60316e7 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -93,6 +93,7 @@ struct inodes_stat_t { #define MS_LAZYTIME (1<<25) /* Update the on-disk [acm]times lazily */ /* These sb flags are internal to the kernel */ +#define MS_CGROUPWB (1<<27) /* cgroup-aware writeback enabled */ #define MS_NOSEC (1<<28) #define MS_BORN (1<<29) #define MS_ACTIVE (1<<30) -- 2.4.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
[parent not found: <1434146254-26220-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 2/3] vfs, writeback: replace FS_CGROUP_WRITEBACK with MS_CGROUPWB [not found] ` <1434146254-26220-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2015-06-13 16:16 ` Christoph Hellwig [not found] ` <20150613161608.GA29414-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Christoph Hellwig @ 2015-06-13 16:16 UTC (permalink / raw) To: Tejun Heo Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Alexander Viro, Jan Kara, linux-ext4-u79uwXL29TY76Z2rM5mHXA On Fri, Jun 12, 2015 at 04:57:33PM -0500, Tejun Heo wrote: > FS_CGROUP_WRITEBACK indicates whether a file_system_type supports > cgroup writeback; however, different super_blocks of the same > file_system_type may or may not support cgroup writeback depending on > filesystem options. This patch replaces FS_CGROUP_WRITEBACK with a > kernel-internal super_block->s_flags MS_CGROUPWB. The concatenated > and abbreviated name is for consistency with other MS_* flags. Nak. As the uapi part makes it obvious the MS_ namespace is part of the userspace ABI. Please add a new in-kernel flags field instead. ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20150613161608.GA29414-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]
* Re: [PATCH 2/3] vfs, writeback: replace FS_CGROUP_WRITEBACK with MS_CGROUPWB [not found] ` <20150613161608.GA29414-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> @ 2015-06-14 5:42 ` Tejun Heo [not found] ` <20150614054236.GA9662-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2015-06-14 5:42 UTC (permalink / raw) To: Christoph Hellwig Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Alexander Viro, Jan Kara, linux-ext4-u79uwXL29TY76Z2rM5mHXA Hello, Christoph. On Sat, Jun 13, 2015 at 09:16:08AM -0700, Christoph Hellwig wrote: > On Fri, Jun 12, 2015 at 04:57:33PM -0500, Tejun Heo wrote: > > FS_CGROUP_WRITEBACK indicates whether a file_system_type supports > > cgroup writeback; however, different super_blocks of the same > > file_system_type may or may not support cgroup writeback depending on > > filesystem options. This patch replaces FS_CGROUP_WRITEBACK with a > > kernel-internal super_block->s_flags MS_CGROUPWB. The concatenated > > and abbreviated name is for consistency with other MS_* flags. > > Nak. As the uapi part makes it obvious the MS_ namespace is part > of the userspace ABI. Please add a new in-kernel flags field instead. Are MS_ACTIVE and MS_BORN part of userpace ABI? They seem pretty internal. I don't mind introducing a new internal flag field but it's weird to put this single flag there with other internal flags in ->s_flags. Assuming we add a new field, how do sb->s_iflags and SB_I_XXX sound? Any better suggestions? Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20150614054236.GA9662-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 2/3] vfs, writeback: replace FS_CGROUP_WRITEBACK with MS_CGROUPWB [not found] ` <20150614054236.GA9662-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2015-06-15 11:39 ` Jan Kara 0 siblings, 0 replies; 16+ messages in thread From: Jan Kara @ 2015-06-15 11:39 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Hellwig, axboe-tSWWG44O7X1aa/9Udqfwiw, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, Alexander Viro, Jan Kara, linux-ext4-u79uwXL29TY76Z2rM5mHXA On Sun 14-06-15 00:42:36, Tejun Heo wrote: > Hello, Christoph. > > On Sat, Jun 13, 2015 at 09:16:08AM -0700, Christoph Hellwig wrote: > > On Fri, Jun 12, 2015 at 04:57:33PM -0500, Tejun Heo wrote: > > > FS_CGROUP_WRITEBACK indicates whether a file_system_type supports > > > cgroup writeback; however, different super_blocks of the same > > > file_system_type may or may not support cgroup writeback depending on > > > filesystem options. This patch replaces FS_CGROUP_WRITEBACK with a > > > kernel-internal super_block->s_flags MS_CGROUPWB. The concatenated > > > and abbreviated name is for consistency with other MS_* flags. > > > > Nak. As the uapi part makes it obvious the MS_ namespace is part > > of the userspace ABI. Please add a new in-kernel flags field instead. > > Are MS_ACTIVE and MS_BORN part of userpace ABI? They seem pretty > internal. I don't mind introducing a new internal flag field but it's > weird to put this single flag there with other internal flags in > ->s_flags. So you are right that there are other internal flags allocated from the top of the i_flags field, however we are pretty much running out of the flags available for the ABI so it's better to move internal flags elsewhere as that's simpler than creating a new ABI for mount... > Assuming we add a new field, how do sb->s_iflags and SB_I_XXX sound? > Any better suggestions? Looks good to me. Honza -- Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support 2015-06-12 21:57 [PATCHSET block/for-4.2/writeback] cgroup, writeback: misc updates for cgroup writeback support Tejun Heo 2015-06-12 21:57 ` [PATCH 1/3] writeback: do foreign inode detection iff cgroup writeback is enabled Tejun Heo [not found] ` <1434146254-26220-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2015-06-12 21:57 ` Tejun Heo [not found] ` <1434146254-26220-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2015-06-12 21:57 UTC (permalink / raw) To: axboe Cc: linux-kernel, linux-fsdevel, lizefan, cgroups, Tejun Heo, Vivek Goyal Update Documentation/cgroups/blkio-controller.txt to reflect the recently added cgroup writeback support. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: cgroups@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org --- Documentation/cgroups/blkio-controller.txt | 83 ++++++++++++++++++++++++++++-- 1 file changed, 78 insertions(+), 5 deletions(-) diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt index cd556b9..68b6a6a 100644 --- a/Documentation/cgroups/blkio-controller.txt +++ b/Documentation/cgroups/blkio-controller.txt @@ -387,8 +387,81 @@ groups and put applications in that group which are not driving enough IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle on individual groups and throughput should improve. -What works -========== -- Currently only sync IO queues are support. All the buffered writes are - still system wide and not per group. Hence we will not see service - differentiation between buffered writes between groups. +Writeback +========= + +Page cache is dirtied through buffered writes and shared mmaps and +written asynchronously to the backing filesystem by the writeback +mechanism. Writeback sits between the memory and IO domains and +regulates the proportion of dirty memory by balancing dirtying and +write IOs. + +On traditional cgroup hierarchies, relationships between different +controllers cannot be established making it impossible for writeback +to operate accounting for cgroup resource restrictions and all +writeback IOs are attributed to the root cgroup. + +If both the blkio and memory controllers are used on the v2 hierarchy +and the filesystem supports cgroup writeback, writeback operations +correctly follow the resource restrictions imposed by both memory and +blkio controllers. + +Writeback examines both system-wide and per-cgroup dirty memory status +and enforces the more restrictive of the two. Also, writeback control +parameters which are absolute values - vm.dirty_bytes and +vm.dirty_background_bytes - are distributed across cgroups according +to their current writeback bandwidth. + +There's a peculiarity stemming from the discrepancy in ownership +granularity between memory controller and writeback. While memory +controller tracks ownership per page, writeback operates on inode +basis. cgroup writeback bridges the gap by tracking ownership by +inode but migrating ownership if too many foreign pages, pages which +don't match the current inode ownership, have been encountered while +writing back the inode. + +This is a conscious design choice as writeback operations are +inherently tied to inodes making strictly following page ownership +complicated and inefficient. The only use case which suffers from +this compromise is multiple cgroups concurrently dirtying disjoint +regions of the same inode, which is an unlikely use case and decided +to be unsupported. Note that as memory controller assigns page +ownership on the first use and doesn't update it until the page is +released, even if cgroup writeback strictly follows page ownership, +multiple cgroups dirtying overlapping areas wouldn't work as expected. +In general, write-sharing an inode across multiple cgroups is not well +supported. + +Filesystem support for cgroup writeback +--------------------------------------- + +A filesystem can make writeback IOs cgroup-aware by updating +address_space_operations->writepage[s]() to annotate bio's using the +following two functions. + +* wbc_init_bio(@wbc, @bio) + + Should be called for each bio carrying writeback data and associates + the bio with the inode's owner cgroup. Can be called anytime + between bio allocation and submission. + +* wbc_account_io(@wbc, @page, @bytes) + + Should be called for each data segment being written out. While + this function doesn't care exactly when it's called during the + writeback session, it's the easiest and most natural to call it as + data segments are added to a bio. + +With writeback bio's annotated, cgroup support can be enabled per +super_block by setting MS_CGROUPWB in ->s_flags. This allows for +selective disabling of cgroup writeback support which is helpful when +certain filesystem features, e.g. journaled data mode, are +incompatible. + +wbc_init_bio() binds the specified bio to its cgroup. Depending on +the configuration, the bio may be executed at a lower priority and if +the writeback session is holding shared resources, e.g. a journal +entry, may lead to priority inversion. There is no one easy solution +for the problem. Filesystems can try to work around specific problem +cases by skipping wbc_init_bio() or using bio_associate_blkcg() +directly. -- 2.4.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
[parent not found: <1434146254-26220-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support [not found] ` <1434146254-26220-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2015-06-15 17:28 ` Vivek Goyal 2015-06-15 18:23 ` Tejun Heo 0 siblings, 1 reply; 16+ messages in thread From: Vivek Goyal @ 2015-06-15 17:28 UTC (permalink / raw) To: Tejun Heo Cc: axboe-tSWWG44O7X1aa/9Udqfwiw, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA On Fri, Jun 12, 2015 at 04:57:34PM -0500, Tejun Heo wrote: > Update Documentation/cgroups/blkio-controller.txt to reflect the > recently added cgroup writeback support. > > Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> > Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> > Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > --- > Documentation/cgroups/blkio-controller.txt | 83 ++++++++++++++++++++++++++++-- Hi Tejun, This looks good to me. Thanks. IIRC, I had run into the issues with two fsync running into two cgroups. One cgroup was of really small limit and other was unlimited. At that point of time I think conclusion was that multiple transactions could not make progress at the same time. So slower cgroup had blocked unlimited cgroup process from opening a transaction (as IO from slower group was stuck inside throttling later). For some reason, in my limited testing I have not noticed it with your branch. May be things have changed since or I am just hazy on details. I will do some more testing. Thanks Vivek > 1 file changed, 78 insertions(+), 5 deletions(-) > > diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt > index cd556b9..68b6a6a 100644 > --- a/Documentation/cgroups/blkio-controller.txt > +++ b/Documentation/cgroups/blkio-controller.txt > @@ -387,8 +387,81 @@ groups and put applications in that group which are not driving enough > IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle > on individual groups and throughput should improve. > > -What works > -========== > -- Currently only sync IO queues are support. All the buffered writes are > - still system wide and not per group. Hence we will not see service > - differentiation between buffered writes between groups. > +Writeback > +========= > + > +Page cache is dirtied through buffered writes and shared mmaps and > +written asynchronously to the backing filesystem by the writeback > +mechanism. Writeback sits between the memory and IO domains and > +regulates the proportion of dirty memory by balancing dirtying and > +write IOs. > + > +On traditional cgroup hierarchies, relationships between different > +controllers cannot be established making it impossible for writeback > +to operate accounting for cgroup resource restrictions and all > +writeback IOs are attributed to the root cgroup. > + > +If both the blkio and memory controllers are used on the v2 hierarchy > +and the filesystem supports cgroup writeback, writeback operations > +correctly follow the resource restrictions imposed by both memory and > +blkio controllers. > + > +Writeback examines both system-wide and per-cgroup dirty memory status > +and enforces the more restrictive of the two. Also, writeback control > +parameters which are absolute values - vm.dirty_bytes and > +vm.dirty_background_bytes - are distributed across cgroups according > +to their current writeback bandwidth. > + > +There's a peculiarity stemming from the discrepancy in ownership > +granularity between memory controller and writeback. While memory > +controller tracks ownership per page, writeback operates on inode > +basis. cgroup writeback bridges the gap by tracking ownership by > +inode but migrating ownership if too many foreign pages, pages which > +don't match the current inode ownership, have been encountered while > +writing back the inode. > + > +This is a conscious design choice as writeback operations are > +inherently tied to inodes making strictly following page ownership > +complicated and inefficient. The only use case which suffers from > +this compromise is multiple cgroups concurrently dirtying disjoint > +regions of the same inode, which is an unlikely use case and decided > +to be unsupported. Note that as memory controller assigns page > +ownership on the first use and doesn't update it until the page is > +released, even if cgroup writeback strictly follows page ownership, > +multiple cgroups dirtying overlapping areas wouldn't work as expected. > +In general, write-sharing an inode across multiple cgroups is not well > +supported. > + > +Filesystem support for cgroup writeback > +--------------------------------------- > + > +A filesystem can make writeback IOs cgroup-aware by updating > +address_space_operations->writepage[s]() to annotate bio's using the > +following two functions. > + > +* wbc_init_bio(@wbc, @bio) > + > + Should be called for each bio carrying writeback data and associates > + the bio with the inode's owner cgroup. Can be called anytime > + between bio allocation and submission. > + > +* wbc_account_io(@wbc, @page, @bytes) > + > + Should be called for each data segment being written out. While > + this function doesn't care exactly when it's called during the > + writeback session, it's the easiest and most natural to call it as > + data segments are added to a bio. > + > +With writeback bio's annotated, cgroup support can be enabled per > +super_block by setting MS_CGROUPWB in ->s_flags. This allows for > +selective disabling of cgroup writeback support which is helpful when > +certain filesystem features, e.g. journaled data mode, are > +incompatible. > + > +wbc_init_bio() binds the specified bio to its cgroup. Depending on > +the configuration, the bio may be executed at a lower priority and if > +the writeback session is holding shared resources, e.g. a journal > +entry, may lead to priority inversion. There is no one easy solution > +for the problem. Filesystems can try to work around specific problem > +cases by skipping wbc_init_bio() or using bio_associate_blkcg() > +directly. > -- > 2.4.2 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support 2015-06-15 17:28 ` Vivek Goyal @ 2015-06-15 18:23 ` Tejun Heo [not found] ` <20150615182345.GB18517-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2015-06-15 18:23 UTC (permalink / raw) To: Vivek Goyal; +Cc: axboe, linux-kernel, linux-fsdevel, lizefan, cgroups Hey, Vivek. On Mon, Jun 15, 2015 at 01:28:23PM -0400, Vivek Goyal wrote: > IIRC, I had run into the issues with two fsync running into two cgroups. > One cgroup was of really small limit and other was unlimited. At that > point of time I think conclusion was that multiple transactions could > not make progress at the same time. So slower cgroup had blocked unlimited > cgroup process from opening a transaction (as IO from slower group was > stuck inside throttling later). > > For some reason, in my limited testing I have not noticed it with your > branch. May be things have changed since or I am just hazy on details. > I will do some more testing. On ext2, there's nothing interlocking each other. My understanding of ext4 is pretty limited but as long as the journal head doesn't overwrap and gets bloked on the slow one, it should be fine, so for most use cases, this shouldn't be a problem. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20150615182345.GB18517-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support [not found] ` <20150615182345.GB18517-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2015-06-15 23:35 ` Theodore Ts'o [not found] ` <20150615233519.GB30059-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Theodore Ts'o @ 2015-06-15 23:35 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, axboe-tSWWG44O7X1aa/9Udqfwiw, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA On Mon, Jun 15, 2015 at 02:23:45PM -0400, Tejun Heo wrote: > > On ext2, there's nothing interlocking each other. My understanding of > ext4 is pretty limited but as long as the journal head doesn't > overwrap and gets bloked on the slow one, it should be fine, so for > most use cases, this shouldn't be a problem. The writes to the journal in ext3/ext4 are done from the jbd/jbd2 kernel thread. So writes to the journal shouldn't be a problem. In data=ordered mode inodes that have blocks that were allocated during the current transaction do have to have their data blocks written out, and this is done by the jbd/jbd2 thread using filemap_fdatawait(). If this gets throttled because blocks were originally dirtied by some cgroup that didn't have much disk time quota, then all file system activities will get stalled out until the ordered mode writeback completes, which means if there are any high priority cgroups trying to execute any system call that mutates file system state will block until the commit has gotten past the initial setup stage, and so other system activity could sputter to a halt --- at which point the commit will be allowed to compete, and then all of the calls to ext4_journal_start() will unblock, and the system will come back to life. :-) Because ext3 doesn't have delayed allocation, it will orders of magnitude more data=ordered block flushing, so this problem will be far worse with ext3 compared to ext4. So if there is some way we can signal to any cgroup that that might be throttling writeback or disk I/O that the jbd/jbd2 process should be considered privileged, that would be a good since it would allow us to avoid a potential priority inversion problem. - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20150615233519.GB30059-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>]
* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support [not found] ` <20150615233519.GB30059-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org> @ 2015-06-16 21:54 ` Tejun Heo 2015-06-17 3:15 ` Theodore Ts'o 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2015-06-16 21:54 UTC (permalink / raw) To: Theodore Ts'o, Vivek Goyal, axboe-tSWWG44O7X1aa/9Udqfwiw, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA Hello, Ted. On Mon, Jun 15, 2015 at 07:35:19PM -0400, Theodore Ts'o wrote: > So if there is some way we can signal to any cgroup that that might be > throttling writeback or disk I/O that the jbd/jbd2 process should be > considered privileged, that would be a good since it would allow us to > avoid a potential priority inversion problem. I see. In the long term, I think we might need to come up with a way to overcharge a slower cgroup to avoid blocking faster ones for cases where some IOs are depended upon by more than one cgroups. That'd take quite a bit of work from blkcg side. Will think more about it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support 2015-06-16 21:54 ` Tejun Heo @ 2015-06-17 3:15 ` Theodore Ts'o [not found] ` <20150617031540.GB4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Theodore Ts'o @ 2015-06-17 3:15 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, axboe, linux-kernel, linux-fsdevel, lizefan, cgroups On Tue, Jun 16, 2015 at 05:54:36PM -0400, Tejun Heo wrote: > Hello, Ted. > > On Mon, Jun 15, 2015 at 07:35:19PM -0400, Theodore Ts'o wrote: > > So if there is some way we can signal to any cgroup that that might be > > throttling writeback or disk I/O that the jbd/jbd2 process should be > > considered privileged, that would be a good since it would allow us to > > avoid a potential priority inversion problem. > > I see. In the long term, I think we might need to come up with a way > to overcharge a slower cgroup to avoid blocking faster ones for cases > where some IOs are depended upon by more than one cgroups. That'd > take quite a bit of work from blkcg side. Will think more about it. Hmm, while we're at it, there's another priority inversion that can be painful. If a block directory has been pushed out of memory (possibly because it was initially accessed by a cgroup with a very tiny amount of memory allocated to its cgroup) and a process with a cgroup tries to do a lookup in that directory, it will issue the read with such a tightly constrained disk time that it might take minutes for the read to complete. The problem is that the VFS has locked the directory's i_mutex *before* calling ext4_lookup(). If a high priority process then tries to read the same directory, or in fact any VFS operation which requires taking the directory's i_mutex first, including renaming the directory, the high priority process will end up blocking until the read is completed --- which can be minutes if the low priority process has a tiny amount of disk time allocated to it. There is a related problem where if a read for a particular block is issued with a very low amount of amount of disk time, and that same block is required by a high priority process, we can also get hit with a very similar priority inversion problem. To date the answer has always been, "Doctor, Doctor it hurts when I do that...." The only way I can think of fixing the directory mutex problem is by returning an error code to the VFS layer which instructs it to unlock the directory, and then have it wait on some wait channel so it ends up calling the lookup after the directory block has been read into memory (and we can hope that due to a tight memory cgroup the block doesn't end up getting ejected from memory right away). As another solution for another part of the problem, if a high priority process attempts a read and the I/O is already queued up, but it's at the back of the bus because it was originally posted by a low priority cgroup, the rest of the fix would be to elevate the priority of said I/O request and then resort the queue. As far as the filemap_fdatawait() call is concerned, if it's being called by fsync() run by a low priority process, or from the writeback thread, then it can certainly take place at a low prority. But if the filemap_fdatawait() is being done by a high priority process, such as a jbd/jbd2 thread, then there needs to be a way that we can set a flag in the wbc structure indicating that the writes should be submitted as if it was issued from the kernel thread, and not based on who originally dirtied the page. It's going to be a number of point solutions, which is a bit ugly, but I think that is much more likely to be successful than trying to implement, say, a generalized priority inheritance scheme for block I/O requests and related locks. :-) - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20150617031540.GB4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>]
* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support [not found] ` <20150617031540.GB4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org> @ 2015-06-17 18:52 ` Tejun Heo [not found] ` <20150617185237.GL22637-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2015-06-17 18:52 UTC (permalink / raw) To: Theodore Ts'o, Vivek Goyal, axboe-tSWWG44O7X1aa/9Udqfwiw, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA Hi, On Tue, Jun 16, 2015 at 11:15:40PM -0400, Theodore Ts'o wrote: > Hmm, while we're at it, there's another priority inversion that can be > painful. If a block directory has been pushed out of memory (possibly > because it was initially accessed by a cgroup with a very tiny amount > of memory allocated to its cgroup) and a process with a cgroup tries At scale, this is self-correcting to certain extent in that if the inode is actually something shared across cgroups, it'll most likely end up in a cgroup which has enough resource to keep it in memory. This doesn't prevent one-off hiccups but it at least shouldn't develop into a systematic and chronic issue. > to do a lookup in that directory, it will issue the read with such a > tightly constrained disk time that it might take minutes for the read > to complete. The problem is that the VFS has locked the directory's > i_mutex *before* calling ext4_lookup(). > > If a high priority process then tries to read the same directory, or > in fact any VFS operation which requires taking the directory's > i_mutex first, including renaming the directory, the high priority > process will end up blocking until the read is completed --- which can > be minutes if the low priority process has a tiny amount of disk time > allocated to it. > > There is a related problem where if a read for a particular block is > issued with a very low amount of amount of disk time, and that same > block is required by a high priority process, we can also get hit with > a very similar priority inversion problem. > > To date the answer has always been, "Doctor, Doctor it hurts when I do > that...." The only way I can think of fixing the directory mutex In a lot of use cases, the directories accessed by different cgroups are fairly segregated so this hopefully shouldn't happen too often but yeah it can be painful on sharing cases. > problem is by returning an error code to the VFS layer which instructs > it to unlock the directory, and then have it wait on some wait channel > so it ends up calling the lookup after the directory block has been > read into memory (and we can hope that due to a tight memory cgroup > the block doesn't end up getting ejected from memory right away). > > As another solution for another part of the problem, if a high > priority process attempts a read and the I/O is already queued up, but > it's at the back of the bus because it was originally posted by a low > priority cgroup, the rest of the fix would be to elevate the priority > of said I/O request and then resort the queue. > > As far as the filemap_fdatawait() call is concerned, if it's being > called by fsync() run by a low priority process, or from the writeback > thread, then it can certainly take place at a low prority. But if the > filemap_fdatawait() is being done by a high priority process, such as > a jbd/jbd2 thread, then there needs to be a way that we can set a flag > in the wbc structure indicating that the writes should be submitted as > if it was issued from the kernel thread, and not based on who > originally dirtied the page. Hmmm... so, overriding things *before* an bio is issued shouldn't be too difficult and as long as this sort of operations aren't prevalent we might be able to get away with just charging them against root. Especially if it's to avoid getting blocked on the journal which we already consider a shared overhead which is charged to root. If this becomes large enough to require exacting charges, it'll be more complex but still way better than trying to raise priority on a bio which is already issued, which is likely to be excruciatingly painful if possible at all. > It's going to be a number of point solutions, which is a bit ugly, but > I think that is much more likely to be successful than trying to > implement, say, a generalized priority inheritance scheme for block > I/O requests and related locks. :-) I agree that generalized priority inheritance mechanism would be a massive overkill. I think as long as we can avoid boosting bio's which already have been issued, things should be relatively sane. Hopefully, we'd be able to figure out solutions for the worst offenders within these constraints. Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20150617185237.GL22637-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support [not found] ` <20150617185237.GL22637-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2015-06-17 21:48 ` Theodore Ts'o [not found] ` <20150617214852.GE4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Theodore Ts'o @ 2015-06-17 21:48 UTC (permalink / raw) To: Tejun Heo Cc: Vivek Goyal, axboe-tSWWG44O7X1aa/9Udqfwiw, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA On Wed, Jun 17, 2015 at 02:52:37PM -0400, Tejun Heo wrote: > > Hmmm... so, overriding things *before* an bio is issued shouldn't be > too difficult and as long as this sort of operations aren't prevalent > we might be able to get away with just charging them against root. > Especially if it's to avoid getting blocked on the journal which we > already consider a shared overhead which is charged to root. If this > becomes large enough to require exacting charges, it'll be more > complex but still way better than trying to raise priority on a bio > which is already issued, which is likely to be excruciatingly painful > if possible at all. Yeah, just charging the overhead to root seems good enough. I could imagine charging it to whatever cgroup the jbd/jbd2 thread belongs to, which in turn would be the cgroup of the process that mounted the file system. The only problem with that is that if a low-priority process is allowed to mount a file system, and it gets traversed by a high priority process, the high priority process will get impacted. So maybe it's better to just say that it always get charged to the root cgroup. - Ted ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <20150617214852.GE4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>]
* Re: [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support [not found] ` <20150617214852.GE4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org> @ 2015-06-20 20:00 ` Tejun Heo 0 siblings, 0 replies; 16+ messages in thread From: Tejun Heo @ 2015-06-20 20:00 UTC (permalink / raw) To: Theodore Ts'o, Vivek Goyal, axboe-tSWWG44O7X1aa/9Udqfwiw, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA Hey, Ted. On Wed, Jun 17, 2015 at 05:48:52PM -0400, Theodore Ts'o wrote: > On Wed, Jun 17, 2015 at 02:52:37PM -0400, Tejun Heo wrote: > > > > Hmmm... so, overriding things *before* an bio is issued shouldn't be > > too difficult and as long as this sort of operations aren't prevalent > > we might be able to get away with just charging them against root. > > Especially if it's to avoid getting blocked on the journal which we > > already consider a shared overhead which is charged to root. If this > > becomes large enough to require exacting charges, it'll be more > > complex but still way better than trying to raise priority on a bio > > which is already issued, which is likely to be excruciatingly painful > > if possible at all. > > Yeah, just charging the overhead to root seems good enough. I think the easiest way to achieve this bypass would be making jbd mark the inode while waiting in fdatawait so that writeback path can skip attaching the writeback bios for the inode. This isn't perfect but should be able to work around stalls from priority inversion to certain extent. However, I can't come up with a workload to test it. AFAICS, the fdatawait stall path in jbd2 is journal_finish_inode_data_buffers() but the path doesn't trigger reliabley with mixed load of overwriting dd, a bunch of file creations and chmods and different cgroups stay pretty well isolated. Can you please suggest a workload for testing the datawait path? Thanks. -- tejun ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCHSET v2 block/for-4.2/writeback] cgroup, writeback: misc updates for cgroup writeback support @ 2015-06-16 22:48 Tejun Heo [not found] ` <1434494912-31043-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 0 siblings, 1 reply; 16+ messages in thread From: Tejun Heo @ 2015-06-16 22:48 UTC (permalink / raw) To: axboe-tSWWG44O7X1aa/9Udqfwiw Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w, kernel-team-b10kYP2dOMg Hello, This is v2. The only change from the last take[L] is * super_block->s_iflags added and MS_CGROUPWB replaced with SB_I_CGROUPWB as suggested by Christoph and Jan. This patchset contains the following assorted updates for the cgroup writeback support. 0001-writeback-do-foreign-inode-detection-iff-cgroup-writ.patch 0002-vfs-writeback-replace-FS_CGROUP_WRITEBACK-with-SB_I_.patch 0003-writeback-blkio-add-documentation-for-cgroup-writeba.patch 0001 fixes a bug where clear FS_CGROUP_WRITEBACK flag didn't fully disable cgroup writeback support if the filesystem code uses wbc_init_bio() and wbc_account_io(). 0002 replaces FS_CGROUP_WRITEBACK with SB_I_CGROUPWB so that cgroup writeback support can be enabled / disabled per superblock rather than filesystem type. 0003 updates blkio documentation with information on cgroup writeback support. This patchset is on top of block/for-4.2/writeback and available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-writeback-updates diffstat follows. Thanks. Documentation/cgroups/blkio-controller.txt | 83 +++++++++++++++++++++++++++-- fs/ext2/super.c | 4 - fs/fs-writeback.c | 16 ++++- fs/namespace.c | 2 include/linux/backing-dev.h | 2 include/linux/fs.h | 1 include/uapi/linux/fs.h | 1 7 files changed, 96 insertions(+), 13 deletions(-) -- tejun [L] http://lkml.kernel.org/g/1434146254-26220-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <1434494912-31043-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support [not found] ` <1434494912-31043-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2015-06-16 22:48 ` Tejun Heo 0 siblings, 0 replies; 16+ messages in thread From: Tejun Heo @ 2015-06-16 22:48 UTC (permalink / raw) To: axboe-tSWWG44O7X1aa/9Udqfwiw Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA, cgroups-u79uwXL29TY76Z2rM5mHXA, hannes-druUgvl0LCNAfugRpC6u6w, kernel-team-b10kYP2dOMg, Tejun Heo, Vivek Goyal Update Documentation/cgroups/blkio-controller.txt to reflect the recently added cgroup writeback support. Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> Cc: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org --- Documentation/cgroups/blkio-controller.txt | 83 ++++++++++++++++++++++++++++-- 1 file changed, 78 insertions(+), 5 deletions(-) diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt index cd556b9..68b6a6a 100644 --- a/Documentation/cgroups/blkio-controller.txt +++ b/Documentation/cgroups/blkio-controller.txt @@ -387,8 +387,81 @@ groups and put applications in that group which are not driving enough IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle on individual groups and throughput should improve. -What works -========== -- Currently only sync IO queues are support. All the buffered writes are - still system wide and not per group. Hence we will not see service - differentiation between buffered writes between groups. +Writeback +========= + +Page cache is dirtied through buffered writes and shared mmaps and +written asynchronously to the backing filesystem by the writeback +mechanism. Writeback sits between the memory and IO domains and +regulates the proportion of dirty memory by balancing dirtying and +write IOs. + +On traditional cgroup hierarchies, relationships between different +controllers cannot be established making it impossible for writeback +to operate accounting for cgroup resource restrictions and all +writeback IOs are attributed to the root cgroup. + +If both the blkio and memory controllers are used on the v2 hierarchy +and the filesystem supports cgroup writeback, writeback operations +correctly follow the resource restrictions imposed by both memory and +blkio controllers. + +Writeback examines both system-wide and per-cgroup dirty memory status +and enforces the more restrictive of the two. Also, writeback control +parameters which are absolute values - vm.dirty_bytes and +vm.dirty_background_bytes - are distributed across cgroups according +to their current writeback bandwidth. + +There's a peculiarity stemming from the discrepancy in ownership +granularity between memory controller and writeback. While memory +controller tracks ownership per page, writeback operates on inode +basis. cgroup writeback bridges the gap by tracking ownership by +inode but migrating ownership if too many foreign pages, pages which +don't match the current inode ownership, have been encountered while +writing back the inode. + +This is a conscious design choice as writeback operations are +inherently tied to inodes making strictly following page ownership +complicated and inefficient. The only use case which suffers from +this compromise is multiple cgroups concurrently dirtying disjoint +regions of the same inode, which is an unlikely use case and decided +to be unsupported. Note that as memory controller assigns page +ownership on the first use and doesn't update it until the page is +released, even if cgroup writeback strictly follows page ownership, +multiple cgroups dirtying overlapping areas wouldn't work as expected. +In general, write-sharing an inode across multiple cgroups is not well +supported. + +Filesystem support for cgroup writeback +--------------------------------------- + +A filesystem can make writeback IOs cgroup-aware by updating +address_space_operations->writepage[s]() to annotate bio's using the +following two functions. + +* wbc_init_bio(@wbc, @bio) + + Should be called for each bio carrying writeback data and associates + the bio with the inode's owner cgroup. Can be called anytime + between bio allocation and submission. + +* wbc_account_io(@wbc, @page, @bytes) + + Should be called for each data segment being written out. While + this function doesn't care exactly when it's called during the + writeback session, it's the easiest and most natural to call it as + data segments are added to a bio. + +With writeback bio's annotated, cgroup support can be enabled per +super_block by setting MS_CGROUPWB in ->s_flags. This allows for +selective disabling of cgroup writeback support which is helpful when +certain filesystem features, e.g. journaled data mode, are +incompatible. + +wbc_init_bio() binds the specified bio to its cgroup. Depending on +the configuration, the bio may be executed at a lower priority and if +the writeback session is holding shared resources, e.g. a journal +entry, may lead to priority inversion. There is no one easy solution +for the problem. Filesystems can try to work around specific problem +cases by skipping wbc_init_bio() or using bio_associate_blkcg() +directly. -- 2.4.3 ^ permalink raw reply related [flat|nested] 16+ messages in thread
end of thread, other threads:[~2015-06-20 20:00 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-06-12 21:57 [PATCHSET block/for-4.2/writeback] cgroup, writeback: misc updates for cgroup writeback support Tejun Heo 2015-06-12 21:57 ` [PATCH 1/3] writeback: do foreign inode detection iff cgroup writeback is enabled Tejun Heo [not found] ` <1434146254-26220-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2015-06-12 21:57 ` [PATCH 2/3] vfs, writeback: replace FS_CGROUP_WRITEBACK with MS_CGROUPWB Tejun Heo [not found] ` <1434146254-26220-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2015-06-13 16:16 ` Christoph Hellwig [not found] ` <20150613161608.GA29414-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 2015-06-14 5:42 ` Tejun Heo [not found] ` <20150614054236.GA9662-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2015-06-15 11:39 ` Jan Kara 2015-06-12 21:57 ` [PATCH 3/3] writeback, blkio: add documentation for cgroup writeback support Tejun Heo [not found] ` <1434146254-26220-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2015-06-15 17:28 ` Vivek Goyal 2015-06-15 18:23 ` Tejun Heo [not found] ` <20150615182345.GB18517-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2015-06-15 23:35 ` Theodore Ts'o [not found] ` <20150615233519.GB30059-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org> 2015-06-16 21:54 ` Tejun Heo 2015-06-17 3:15 ` Theodore Ts'o [not found] ` <20150617031540.GB4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org> 2015-06-17 18:52 ` Tejun Heo [not found] ` <20150617185237.GL22637-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 2015-06-17 21:48 ` Theodore Ts'o [not found] ` <20150617214852.GE4076-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org> 2015-06-20 20:00 ` Tejun Heo 2015-06-16 22:48 [PATCHSET v2 block/for-4.2/writeback] cgroup, writeback: misc updates " Tejun Heo [not found] ` <1434494912-31043-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2015-06-16 22:48 ` [PATCH 3/3] writeback, blkio: add documentation " Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).