All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2] xfs: implement cgroup writeback support
@ 2018-03-22 21:11 Shaohua Li
  2018-03-23 14:00 ` Chris Mason
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Shaohua Li @ 2018-03-22 21:11 UTC (permalink / raw)
  To: linux-xfs
  Cc: Kernel-team, bfoster, Shaohua Li, Tejun Heo, Darrick J . Wong,
	Dave Chinner, Christoph Hellwig

From: Shaohua Li <shli@fb.com>

Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup
writeback support). Tested with a fio test, verified writeback is
throttled against cgroup io.max write bandwidth, also verified moving
the fio test to another cgroup and the writeback is throttled against
new cgroup setting.

This only controls the file data write for cgroup. For metadata, since
xfs dispatches the metadata write in specific threads, it's possible low
prio app's metadata could harm high prio app's metadata. A while back,
Tejun has a patch to force metadata belonging to root cgroup for btrfs.
I had a similiar patch for xfs too. But Since Tejun's patch isn't in
upstream, I'll delay post the xfs patch.

Cc: Tejun Heo <tj@kernel.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/xfs/xfs_aops.c  | 13 +++++++++++--
 fs/xfs/xfs_super.c |  1 +
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 19eadc8..5f70584 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -589,7 +589,8 @@ xfs_alloc_ioend(
 	struct inode		*inode,
 	unsigned int		type,
 	xfs_off_t		offset,
-	struct buffer_head	*bh)
+	struct buffer_head	*bh,
+	struct writeback_control *wbc)
 {
 	struct xfs_ioend	*ioend;
 	struct bio		*bio;
@@ -606,6 +607,8 @@ xfs_alloc_ioend(
 	INIT_WORK(&ioend->io_work, xfs_end_io);
 	ioend->io_append_trans = NULL;
 	ioend->io_bio = bio;
+	/* attach new bio to its cgroup */
+	wbc_init_bio(wbc, bio);
 	return ioend;
 }
 
@@ -633,6 +636,8 @@ xfs_chain_bio(
 	ioend->io_bio->bi_write_hint = ioend->io_inode->i_write_hint;
 	submit_bio(ioend->io_bio);
 	ioend->io_bio = new;
+	/* attach new bio to its cgroup */
+	wbc_init_bio(wbc, new);
 }
 
 /*
@@ -656,7 +661,8 @@ xfs_add_to_ioend(
 	    offset != wpc->ioend->io_offset + wpc->ioend->io_size) {
 		if (wpc->ioend)
 			list_add(&wpc->ioend->io_list, iolist);
-		wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh);
+		wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset,
+					     bh, wbc);
 	}
 
 	/*
@@ -666,6 +672,9 @@ xfs_add_to_ioend(
 	while (xfs_bio_add_buffer(wpc->ioend->io_bio, bh) != bh->b_size)
 		xfs_chain_bio(wpc->ioend, wbc, bh);
 
+	/* Charge write size to its cgroup for cgroup switching track */
+	wbc_account_io(wbc, bh->b_page, bh->b_size);
+
 	wpc->ioend->io_size += bh->b_size;
 	wpc->last_block = bh->b_blocknr;
 	xfs_start_buffer_writeback(bh);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 951271f..95c2d3d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1666,6 +1666,7 @@ xfs_fs_fill_super(
 	sb->s_max_links = XFS_MAXLINK;
 	sb->s_time_gran = 1;
 	set_posix_acl_flag(sb);
+	sb->s_iflags |= SB_I_CGROUPWB;
 
 	/* version 5 superblocks support inode version counters. */
 	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2018-03-22 21:11 [PATCH V2] xfs: implement cgroup writeback support Shaohua Li
@ 2018-03-23 14:00 ` Chris Mason
  2018-03-23 14:24 ` 张本龙
  2018-03-23 14:37 ` Brian Foster
  2 siblings, 0 replies; 21+ messages in thread
From: Chris Mason @ 2018-03-23 14:00 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-xfs, Kernel-team, bfoster, Shaohua Li, Tejun Heo,
	Darrick J . Wong, Dave Chinner, Christoph Hellwig



On 22 Mar 2018, at 17:11, Shaohua Li wrote:

> From: Shaohua Li <shli@fb.com>
>
> Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup
> writeback support). Tested with a fio test, verified writeback is
> throttled against cgroup io.max write bandwidth, also verified moving
> the fio test to another cgroup and the writeback is throttled against
> new cgroup setting.
>
> This only controls the file data write for cgroup. For metadata, since
> xfs dispatches the metadata write in specific threads, it's possible 
> low
> prio app's metadata could harm high prio app's metadata. A while back,
> Tejun has a patch to force metadata belonging to root cgroup for 
> btrfs.
> I had a similiar patch for xfs too. But Since Tejun's patch isn't in
> upstream, I'll delay post the xfs patch.

I think we've worked out all the problems on the btrfs/generic patch 
set, we'll get those sent out shortly.

-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2018-03-22 21:11 [PATCH V2] xfs: implement cgroup writeback support Shaohua Li
  2018-03-23 14:00 ` Chris Mason
@ 2018-03-23 14:24 ` 张本龙
  2018-03-25 21:59   ` Dave Chinner
  2018-03-23 14:37 ` Brian Foster
  2 siblings, 1 reply; 21+ messages in thread
From: 张本龙 @ 2018-03-23 14:24 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-xfs

Hi Shaohua and XFS,

May I ask how are we gonna handle REQ_META issued from XFS? As you
mentioned about charging to root cgroup (also in an earlier email
discussion), and seems the 4.16.0-rc6 code is not handling it
separately.

In our case to support XFS cgroup writeback control, which was ported
and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
trouble. Threads from throttled docker might submit_bio in following
path by its own identity, this docker blkcg accumulated large amounts
of data (e.g., 20GB), thus such log gets blocked.
------------------------------------------------------------------------------------------------------------------------------------------------
xxx-agent 22825 [001] 78772.391023: probe:submit_bio:
(ffffffff812f70c0) bio=ffff880fb6e1f300 bi_css=0
                  4f70c1 submit_bio
(/usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.cgwb.rdt.x86_64/vmlinux)
                   4e440 xfs_buf_submit ([xfs])
                   6e59b xlog_bdstrat ([xfs])
                   704c5 xlog_sync ([xfs])
                   70683 xlog_state_release_iclog ([xfs])
                   71917 _xfs_log_force_lsn ([xfs])
                   5249e xfs_file_fsync ([xfs])
                  4374d5 do_fsync
(/usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.cgwb.rdt.x86_64/vmlinux)
                  4377a0 sys_fsync
(/usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.cgwb.rdt.x86_64/vmlinux)
                  8a0ac9 system_call
(/usr/lib/debug/lib/modules/3.10.0-514.16.1.el7.cgwb.rdt.x86_64/vmlinux
                  1df9c4 [unknown]
(/home/xxx/xxx-agent/data/install/xxx-agent/16/xxx-agent)

Meanwhile some other container without bps limits, or kworkers from
root cgroup, might get stuck from:
----------------------------------------------------------------------------------------------------------------------------------------------
[79183.692355] INFO: task xxx:33434 blocked for more than 120 seconds.
[79183.730997] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[79183.778095] consul          D ffff880169958000     0 33434  24659 0x00000000
[79183.820478]  ffff880bce3e7e38 0000000000000082 ffff880c7af8bec0
ffff880bce3e7fd8
[79183.865232]  ffff880bce3e7fd8 ffff880bce3e7fd8 ffff880c7af8bec0
ffff880c7af8bec0
[79183.909997]  ffff880169bfa928 ffff880bce3e7ef4 0000000000000000
ffff880c7af8bec0
[79183.954738] Call Trace:
[79183.969516]  [<ffffffff81695ac9>] schedule+0x29/0x70
[79183.999357]  [<ffffffffa073497a>] _xfs_log_force_lsn+0x2fa/0x350 [xfs]
[79184.038497]  [<ffffffff810c84c0>] ? wake_up_state+0x20/0x20
[79184.071941]  [<ffffffffa071542e>] xfs_file_fsync+0x10e/0x1e0 [xfs]
[79184.109010]  [<ffffffff812374d5>] do_fsync+0x65/0xa0
[79184.138826]  [<ffffffff8120368f>] ? SyS_write+0x9f/0xe0
[79184.170182]  [<ffffffff812377a0>] SyS_fsync+0x10/0x20
[79184.200514]  [<ffffffff816a0ac9>] system_call_fastpath+0x16/0x1b
[79184.236613] INFO: task xxx:38778 blocked for more than 120 seconds.
[79184.275238] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[79184.322329] consul          D ffff880fe728af10     0 38778  32505 0x00000000
[79184.364694]  ffff881a18fe7c10 0000000000000082 ffff88196a74bec0
ffff881a18fe7fd8
[79184.409424]  ffff881a18fe7fd8 ffff881a18fe7fd8 ffff88196a74bec0
ffff881a18fe7d60
[79184.454167]  ffff881a18fe7d68 7fffffffffffffff ffff88196a74bec0
ffffffffffffffff
[79184.498909] Call Trace:
[79184.513673]  [<ffffffff81695ac9>] schedule+0x29/0x70
[79184.543475]  [<ffffffff81693529>] schedule_timeout+0x239/0x2c0
[79184.578468]  [<ffffffff810c4f99>] ? ttwu_do_wakeup+0x19/0xd0
[79184.612419]  [<ffffffff810c512d>] ? ttwu_do_activate.constprop.91+0x5d/0x70
[79184.654148]  [<ffffffff810c8308>] ? try_to_wake_up+0x1c8/0x320
[79184.689141]  [<ffffffff81695ea6>] wait_for_completion+0x116/0x170
[79184.725688]  [<ffffffff810c84c0>] ? wake_up_state+0x20/0x20
[79184.759126]  [<ffffffff810ac67c>] flush_work+0xfc/0x1c0
[79184.790481]  [<ffffffff810a85f0>] ? move_linked_works+0x90/0x90
[79184.826020]  [<ffffffffa073622a>] xlog_cil_force_lsn+0x8a/0x210 [xfs]
[79184.864644]  [<ffffffff811850df>] ? __filemap_fdatawrite_range+0xbf/0xe0
[79184.904838]  [<ffffffffa073470f>] _xfs_log_force_lsn+0x8f/0x350 [xfs]
[79184.943467]  [<ffffffff8118531f>] ? filemap_fdatawait_range+0x1f/0x30
[79184.982086]  [<ffffffff81694c42>] ? down_read+0x12/0x30
[79185.013466]  [<ffffffffa071542e>] xfs_file_fsync+0x10e/0x1e0 [xfs]
[79185.050533]  [<ffffffff812374d5>] do_fsync+0x65/0xa0
[79185.080335]  [<ffffffff8120368f>] ? SyS_write+0x9f/0xe0
[79185.111691]  [<ffffffff812377a0>] SyS_fsync+0x10/0x20
[79185.141986]  [<ffffffff816a0ac9>] system_call_fastpath+0x16/0x1b
[79185.178051] INFO: task kworker/1:0:12593 blocked for more than 120 seconds.
[79185.219258] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[79185.266353] kworker/1:0     D ffff8801698caf10     0 12593      2 0x00000000
[79185.308743] Workqueue: xfs-cil/sdb1 xlog_cil_push_work [xfs]
[79185.342757]  ffff880d00e3bbe8 0000000000000046 ffff880169732f10
ffff880d00e3bfd8
[79185.387475]  ffff880d00e3bfd8 ffff880d00e3bfd8 ffff880169732f10
ffff880169bfa800
[79185.432197]  ffff880169bfa928 ffff880bd3b4a5c0 ffff880169732f10
ffff880169bfa900
[79185.476946] Call Trace:
[79185.491725]  [<ffffffff81695ac9>] schedule+0x29/0x70
[79185.521548]  [<ffffffffa073377a>]
xlog_state_get_iclog_space+0x10a/0x310 [xfs]
[79185.565010]  [<ffffffff810c84c0>] ? wake_up_state+0x20/0x20
[79185.598476]  [<ffffffffa0733c99>] xlog_write+0x1a9/0x720 [xfs]
[79185.633477]  [<ffffffffa07359c9>] xlog_cil_push+0x239/0x420 [xfs]
[79185.670037]  [<ffffffffa0735bc5>] xlog_cil_push_work+0x15/0x20 [xfs]
[79185.708138]  [<ffffffff810ab45b>] process_one_work+0x17b/0x470
[79185.743124]  [<ffffffff810ac413>] worker_thread+0x2a3/0x410
[79185.776542]  [<ffffffff810ac170>] ? rescuer_thread+0x460/0x460
[79185.811546]  [<ffffffff810b3a4f>] kthread+0xcf/0xe0
[79185.840835]  [<ffffffff810b3980>] ? kthread_create_on_node+0x140/0x140
[79185.879982]  [<ffffffff816a0a18>] ret_from_fork+0x58/0x90
[79185.912391]  [<ffffffff810b3980>] ? kthread_create_on_node+0x140/0x140

Not familiar with XFS, but seems log bios are partially stuck in
throttled cgroups, leaving other innocent groups waiting for
completion. To cope with this we bypassed REQ_META log bios in
blk_throtl_bio().

So just writing to ensure we know about the issue. Absolutely great if
your other patch already fixed this!

Thanks,
Benlong


2018-03-23 5:11 GMT+08:00 Shaohua Li <shli@kernel.org>:
> From: Shaohua Li <shli@fb.com>
>
> Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup
> writeback support). Tested with a fio test, verified writeback is
> throttled against cgroup io.max write bandwidth, also verified moving
> the fio test to another cgroup and the writeback is throttled against
> new cgroup setting.
>
> This only controls the file data write for cgroup. For metadata, since
> xfs dispatches the metadata write in specific threads, it's possible low
> prio app's metadata could harm high prio app's metadata. A while back,
> Tejun has a patch to force metadata belonging to root cgroup for btrfs.
> I had a similiar patch for xfs too. But Since Tejun's patch isn't in
> upstream, I'll delay post the xfs patch.
>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Darrick J. Wong <darrick.wong@oracle.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  fs/xfs/xfs_aops.c  | 13 +++++++++++--
>  fs/xfs/xfs_super.c |  1 +
>  2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 19eadc8..5f70584 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -589,7 +589,8 @@ xfs_alloc_ioend(
>         struct inode            *inode,
>         unsigned int            type,
>         xfs_off_t               offset,
> -       struct buffer_head      *bh)
> +       struct buffer_head      *bh,
> +       struct writeback_control *wbc)
>  {
>         struct xfs_ioend        *ioend;
>         struct bio              *bio;
> @@ -606,6 +607,8 @@ xfs_alloc_ioend(
>         INIT_WORK(&ioend->io_work, xfs_end_io);
>         ioend->io_append_trans = NULL;
>         ioend->io_bio = bio;
> +       /* attach new bio to its cgroup */
> +       wbc_init_bio(wbc, bio);
>         return ioend;
>  }
>
> @@ -633,6 +636,8 @@ xfs_chain_bio(
>         ioend->io_bio->bi_write_hint = ioend->io_inode->i_write_hint;
>         submit_bio(ioend->io_bio);
>         ioend->io_bio = new;
> +       /* attach new bio to its cgroup */
> +       wbc_init_bio(wbc, new);
>  }
>
>  /*
> @@ -656,7 +661,8 @@ xfs_add_to_ioend(
>             offset != wpc->ioend->io_offset + wpc->ioend->io_size) {
>                 if (wpc->ioend)
>                         list_add(&wpc->ioend->io_list, iolist);
> -               wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh);
> +               wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset,
> +                                            bh, wbc);
>         }
>
>         /*
> @@ -666,6 +672,9 @@ xfs_add_to_ioend(
>         while (xfs_bio_add_buffer(wpc->ioend->io_bio, bh) != bh->b_size)
>                 xfs_chain_bio(wpc->ioend, wbc, bh);
>
> +       /* Charge write size to its cgroup for cgroup switching track */
> +       wbc_account_io(wbc, bh->b_page, bh->b_size);
> +
>         wpc->ioend->io_size += bh->b_size;
>         wpc->last_block = bh->b_blocknr;
>         xfs_start_buffer_writeback(bh);
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 951271f..95c2d3d 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1666,6 +1666,7 @@ xfs_fs_fill_super(
>         sb->s_max_links = XFS_MAXLINK;
>         sb->s_time_gran = 1;
>         set_posix_acl_flag(sb);
> +       sb->s_iflags |= SB_I_CGROUPWB;
>
>         /* version 5 superblocks support inode version counters. */
>         if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
> --
> 2.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2018-03-22 21:11 [PATCH V2] xfs: implement cgroup writeback support Shaohua Li
  2018-03-23 14:00 ` Chris Mason
  2018-03-23 14:24 ` 张本龙
@ 2018-03-23 14:37 ` Brian Foster
  2 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2018-03-23 14:37 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-xfs, Kernel-team, Shaohua Li, Tejun Heo, Darrick J . Wong,
	Dave Chinner, Christoph Hellwig

On Thu, Mar 22, 2018 at 02:11:45PM -0700, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup
> writeback support). Tested with a fio test, verified writeback is
> throttled against cgroup io.max write bandwidth, also verified moving
> the fio test to another cgroup and the writeback is throttled against
> new cgroup setting.
> 
> This only controls the file data write for cgroup. For metadata, since
> xfs dispatches the metadata write in specific threads, it's possible low
> prio app's metadata could harm high prio app's metadata. A while back,
> Tejun has a patch to force metadata belonging to root cgroup for btrfs.
> I had a similiar patch for xfs too. But Since Tejun's patch isn't in
> upstream, I'll delay post the xfs patch.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Darrick J. Wong <darrick.wong@oracle.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  fs/xfs/xfs_aops.c  | 13 +++++++++++--
>  fs/xfs/xfs_super.c |  1 +
>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 19eadc8..5f70584 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -589,7 +589,8 @@ xfs_alloc_ioend(
>  	struct inode		*inode,
>  	unsigned int		type,
>  	xfs_off_t		offset,
> -	struct buffer_head	*bh)
> +	struct buffer_head	*bh,
> +	struct writeback_control *wbc)
>  {
>  	struct xfs_ioend	*ioend;
>  	struct bio		*bio;
> @@ -606,6 +607,8 @@ xfs_alloc_ioend(
>  	INIT_WORK(&ioend->io_work, xfs_end_io);
>  	ioend->io_append_trans = NULL;
>  	ioend->io_bio = bio;
> +	/* attach new bio to its cgroup */
> +	wbc_init_bio(wbc, bio);

Just a nit, but can we move the wbc_init_bio() calls to earlier in the
function where we setup/init the bio? (Same comment for
xfs_chain_bio()). Otherwise this looks fine to me and passes the posted
xfstests test.

Brian

>  	return ioend;
>  }
>  
> @@ -633,6 +636,8 @@ xfs_chain_bio(
>  	ioend->io_bio->bi_write_hint = ioend->io_inode->i_write_hint;
>  	submit_bio(ioend->io_bio);
>  	ioend->io_bio = new;
> +	/* attach new bio to its cgroup */
> +	wbc_init_bio(wbc, new);
>  }
>  
>  /*
> @@ -656,7 +661,8 @@ xfs_add_to_ioend(
>  	    offset != wpc->ioend->io_offset + wpc->ioend->io_size) {
>  		if (wpc->ioend)
>  			list_add(&wpc->ioend->io_list, iolist);
> -		wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh);
> +		wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset,
> +					     bh, wbc);
>  	}
>  
>  	/*
> @@ -666,6 +672,9 @@ xfs_add_to_ioend(
>  	while (xfs_bio_add_buffer(wpc->ioend->io_bio, bh) != bh->b_size)
>  		xfs_chain_bio(wpc->ioend, wbc, bh);
>  
> +	/* Charge write size to its cgroup for cgroup switching track */
> +	wbc_account_io(wbc, bh->b_page, bh->b_size);
> +
>  	wpc->ioend->io_size += bh->b_size;
>  	wpc->last_block = bh->b_blocknr;
>  	xfs_start_buffer_writeback(bh);
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 951271f..95c2d3d 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1666,6 +1666,7 @@ xfs_fs_fill_super(
>  	sb->s_max_links = XFS_MAXLINK;
>  	sb->s_time_gran = 1;
>  	set_posix_acl_flag(sb);
> +	sb->s_iflags |= SB_I_CGROUPWB;
>  
>  	/* version 5 superblocks support inode version counters. */
>  	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
> -- 
> 2.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2018-03-23 14:24 ` 张本龙
@ 2018-03-25 21:59   ` Dave Chinner
  2018-03-26 16:28     ` Brian Foster
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2018-03-25 21:59 UTC (permalink / raw)
  To: 张本龙; +Cc: Shaohua Li, linux-xfs

On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> Hi Shaohua and XFS,
> 
> May I ask how are we gonna handle REQ_META issued from XFS? As you
> mentioned about charging to root cgroup (also in an earlier email
> discussion), and seems the 4.16.0-rc6 code is not handling it
> separately.
> 
> In our case to support XFS cgroup writeback control, which was ported
> and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> trouble. Threads from throttled docker might submit_bio in following
> path by its own identity, this docker blkcg accumulated large amounts
> of data (e.g., 20GB), thus such log gets blocked.

And thus displaying the reason why I originally refused to merge
this code until regression tests were added to fstests to exercise
these sorts of issues. This stuff adds new internal filesystem IO
ordering constraints, so we need tests that exercise it and ensure
we don't accidentally break it in future.


> Not familiar with XFS, but seems log bios are partially stuck in
> throttled cgroups, leaving other innocent groups waiting for
> completion. To cope with this we bypassed REQ_META log bios in
> blk_throtl_bio().

Yup, the log is global and should not be throttled. Metadata is less
obvious as to what the correct thing to do is, because writes are
always done from internal kernel threads but reads are done from a
mix of kernel threads, user cgroup contexts and user data IO
completions. Hence there are situations where metadata reads may
need to be throttled because they are cgroup context only (e.g.
filesystem directory traversal) but others where reads should not
be throttled because they have global context (e.g. inside a
transaction when other buffers are already locked).

Getting this right and keeping it working requires regression tests
that get run on every release, whether it be upstream or distro
kernels, and that means we need tests in fstests to cover cgroup IO
control....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2018-03-25 21:59   ` Dave Chinner
@ 2018-03-26 16:28     ` Brian Foster
  2018-03-27  0:55       ` Shaohua Li
       [not found]       ` <CAJDdQW3gOa8ry_XVkcCMf2QT7wC7MvU4b94hMhwJsg9MjYoKgQ@mail.gmail.com>
  0 siblings, 2 replies; 21+ messages in thread
From: Brian Foster @ 2018-03-26 16:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: 张本龙, Shaohua Li, linux-xfs

On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> > Hi Shaohua and XFS,
> > 
> > May I ask how are we gonna handle REQ_META issued from XFS? As you
> > mentioned about charging to root cgroup (also in an earlier email
> > discussion), and seems the 4.16.0-rc6 code is not handling it
> > separately.
> > 
> > In our case to support XFS cgroup writeback control, which was ported
> > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> > trouble. Threads from throttled docker might submit_bio in following
> > path by its own identity, this docker blkcg accumulated large amounts
> > of data (e.g., 20GB), thus such log gets blocked.
> 
> And thus displaying the reason why I originally refused to merge
> this code until regression tests were added to fstests to exercise
> these sorts of issues. This stuff adds new internal filesystem IO
> ordering constraints, so we need tests that exercise it and ensure
> we don't accidentally break it in future.
> 

Hmm, but if the user issues fsync from the throttled cgroup then won't
that throttling occur today, regardless of cgroup aware writeback? My
understanding is that cgawb just accurately accounts writeback I/Os to
the owner of the cached pages. IOW, if the buffered writer and fsync
call are in the same throttled cgroup, then the throttling works just as
it would with cgawb and the writer being in a throttled cgroup.

So ISTM that this is an independent problem. What am I missing?

Shaohua,

Do you have a reference to the older metadata related patch mentioned in
the commit log that presumably addressed this?

Brian

> 
> > Not familiar with XFS, but seems log bios are partially stuck in
> > throttled cgroups, leaving other innocent groups waiting for
> > completion. To cope with this we bypassed REQ_META log bios in
> > blk_throtl_bio().
> 
> Yup, the log is global and should not be throttled. Metadata is less
> obvious as to what the correct thing to do is, because writes are
> always done from internal kernel threads but reads are done from a
> mix of kernel threads, user cgroup contexts and user data IO
> completions. Hence there are situations where metadata reads may
> need to be throttled because they are cgroup context only (e.g.
> filesystem directory traversal) but others where reads should not
> be throttled because they have global context (e.g. inside a
> transaction when other buffers are already locked).
> 
> Getting this right and keeping it working requires regression tests
> that get run on every release, whether it be upstream or distro
> kernels, and that means we need tests in fstests to cover cgroup IO
> control....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2018-03-26 16:28     ` Brian Foster
@ 2018-03-27  0:55       ` Shaohua Li
  2018-03-27 11:36         ` Brian Foster
       [not found]       ` <CAJDdQW3gOa8ry_XVkcCMf2QT7wC7MvU4b94hMhwJsg9MjYoKgQ@mail.gmail.com>
  1 sibling, 1 reply; 21+ messages in thread
From: Shaohua Li @ 2018-03-27  0:55 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, 张本龙, linux-xfs

On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote:
> On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> > > Hi Shaohua and XFS,
> > > 
> > > May I ask how are we gonna handle REQ_META issued from XFS? As you
> > > mentioned about charging to root cgroup (also in an earlier email
> > > discussion), and seems the 4.16.0-rc6 code is not handling it
> > > separately.
> > > 
> > > In our case to support XFS cgroup writeback control, which was ported
> > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> > > trouble. Threads from throttled docker might submit_bio in following
> > > path by its own identity, this docker blkcg accumulated large amounts
> > > of data (e.g., 20GB), thus such log gets blocked.
> > 
> > And thus displaying the reason why I originally refused to merge
> > this code until regression tests were added to fstests to exercise
> > these sorts of issues. This stuff adds new internal filesystem IO
> > ordering constraints, so we need tests that exercise it and ensure
> > we don't accidentally break it in future.
> > 
> 
> Hmm, but if the user issues fsync from the throttled cgroup then won't
> that throttling occur today, regardless of cgroup aware writeback? My
> understanding is that cgawb just accurately accounts writeback I/Os to
> the owner of the cached pages. IOW, if the buffered writer and fsync
> call are in the same throttled cgroup, then the throttling works just as
> it would with cgawb and the writer being in a throttled cgroup.
> 
> So ISTM that this is an independent problem. What am I missing?
> 
> Shaohua,
> 
> Do you have a reference to the older metadata related patch mentioned in
> the commit log that presumably addressed this?

The problem is about priority reversion. Say you do a fsync in a low prio
cgroup, the IO will be submitted with low prio. Now you do a fsync in a high
prio cgroup, the cgroup will wait for fsync IO finished, which is already
submitted by the low prio cgroup and run in low prio. This makes the high prio
cgroup run slow. The proposed patch is to force all metadata write submitted
from root cgroup regardless which task submitted, which can fix this issue.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2018-03-27  0:55       ` Shaohua Li
@ 2018-03-27 11:36         ` Brian Foster
  2018-03-27 21:56           ` Dave Chinner
  2018-03-28  4:37           ` 张本龙
  0 siblings, 2 replies; 21+ messages in thread
From: Brian Foster @ 2018-03-27 11:36 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Dave Chinner, 张本龙, linux-xfs

On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote:
> On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote:
> > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> > > > Hi Shaohua and XFS,
> > > > 
> > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
> > > > mentioned about charging to root cgroup (also in an earlier email
> > > > discussion), and seems the 4.16.0-rc6 code is not handling it
> > > > separately.
> > > > 
> > > > In our case to support XFS cgroup writeback control, which was ported
> > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> > > > trouble. Threads from throttled docker might submit_bio in following
> > > > path by its own identity, this docker blkcg accumulated large amounts
> > > > of data (e.g., 20GB), thus such log gets blocked.
> > > 
> > > And thus displaying the reason why I originally refused to merge
> > > this code until regression tests were added to fstests to exercise
> > > these sorts of issues. This stuff adds new internal filesystem IO
> > > ordering constraints, so we need tests that exercise it and ensure
> > > we don't accidentally break it in future.
> > > 
> > 
> > Hmm, but if the user issues fsync from the throttled cgroup then won't
> > that throttling occur today, regardless of cgroup aware writeback? My
> > understanding is that cgawb just accurately accounts writeback I/Os to
> > the owner of the cached pages. IOW, if the buffered writer and fsync
> > call are in the same throttled cgroup, then the throttling works just as
> > it would with cgawb and the writer being in a throttled cgroup.
> > 
> > So ISTM that this is an independent problem. What am I missing?
> > 
> > Shaohua,
> > 
> > Do you have a reference to the older metadata related patch mentioned in
> > the commit log that presumably addressed this?
> 
> The problem is about priority reversion. Say you do a fsync in a low prio
> cgroup, the IO will be submitted with low prio. Now you do a fsync in a high
> prio cgroup, the cgroup will wait for fsync IO finished, which is already
> submitted by the low prio cgroup and run in low prio. This makes the high prio
> cgroup run slow. The proposed patch is to force all metadata write submitted
> from root cgroup regardless which task submitted, which can fix this issue.
> 

Right, but it seems to me that this can happen with or without cgroup
aware writeback. This patch just introduces the final bits required to
carry the page owner from however it is tracked in the writeback machine
to the bio submitted by the fs. It doesn't introduce/enable/implement
I/O throttling itself, which is already in place and usable (outside of
the buffered write page owner problem fixed by this patch), right?

So without this patch, if a task in throttled cgroup A does a bunch of
buffered writes and calls fsync, then another task in unthrottled cgroup
B calls fsync, aren't we (XFS) susceptible to priority inversion via
these same log I/O serialization points? If not, then what am I missing?

I'm not saying this isn't a problem that needs fixing, I just want to
make sure I understand the fundamental problem(s), what this cgawb patch
actually does and doesn't do and whether there is a logical dependency
between this patch and the proposed metadata filtering patch.

Brian

> Thanks,
> Shaohua
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
       [not found]       ` <CAJDdQW3gOa8ry_XVkcCMf2QT7wC7MvU4b94hMhwJsg9MjYoKgQ@mail.gmail.com>
@ 2018-03-27 11:50         ` Brian Foster
  2018-03-28  9:55           ` 张本龙
  0 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2018-03-27 11:50 UTC (permalink / raw)
  To: 张本龙; +Cc: Dave Chinner, Shaohua Li, linux-xfs

On Tue, Mar 27, 2018 at 06:54:27PM +0800, 张本龙 wrote:
> 2018-03-27 0:28 GMT+08:00 Brian Foster <bfoster@redhat.com>:
> 
> > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> > > > Hi Shaohua and XFS,
> > > >
> > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
> > > > mentioned about charging to root cgroup (also in an earlier email
> > > > discussion), and seems the 4.16.0-rc6 code is not handling it
> > > > separately.
> > > >
> > > > In our case to support XFS cgroup writeback control, which was ported
> > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> > > > trouble. Threads from throttled docker might submit_bio in following
> > > > path by its own identity, this docker blkcg accumulated large amounts
> > > > of data (e.g., 20GB), thus such log gets blocked.
> > >
> > > And thus displaying the reason why I originally refused to merge
> > > this code until regression tests were added to fstests to exercise
> > > these sorts of issues. This stuff adds new internal filesystem IO
> > > ordering constraints, so we need tests that exercise it and ensure
> > > we don't accidentally break it in future.
> > >
> >
> > Hmm, but if the user issues fsync from the throttled cgroup then won't
> > that throttling occur today, regardless of cgroup aware writeback? My
> > understanding is that cgawb just accurately accounts writeback I/Os to
> > the owner of the cached pages. IOW, if the buffered writer and fsync
> > call are in the same throttled cgroup, then the throttling works just as
> > it would with cgawb and the writer being in a throttled cgroup.
> >
> > So ISTM that this is an independent problem. What am I missing?
> 
> 
> Yeah throttling would work for fsync traffic even without cgwb, but that's
> because the way bio_blkcg() is implemented: charging to bio->bi_css when
> set but submitter blkcg when not. So without cgwb normal traffic from
> writeback kworkers are attributed to blkcg_root, which is normally
> unlimited.
> 

Ok, that's what I thought. Your report shows everything backed up on an
fsync call, however, so I didn't really see how it related to this
patch. So unless Shaohua explains otherwise, my understanding is that
this is not necessarily caused by cgawb, but rather this is a separate
but similar gap in the broader block I/O throttling mechanism that also
needs to be addressed before throttling is usable (in practice) by
filesystems. You presumably deal with this by filtering out metadata
bios, while the patch Shaohua mentions deals with it by promoting
metadata bios to the root cgroup.

Brian

> And fsync might yield kernel INFO messages when bps * 120 < dirty data,
> giving something like below. That's Ok though, since throttling and all
> types
> of sync calls are contradictory in nature.
> ------------------------------------------------------------
> ---------------------------------
> [Mar 2 14:27] INFO: task dd:40794 blocked for more than 120 seconds.
> [  +0.000747]       Not tainted 4.15.7 #1
> [  +0.000610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  +0.000614] dd              D    0 40794  40793 0x10000000
> [  +0.000610] Call Trace:
> [  +0.000619]  ? __schedule+0x28d/0x870
> [  +0.000611]  schedule+0x32/0x80
> [  +0.000615]  io_schedule+0x12/0x40
> [  +0.000612]  wait_on_page_bit_common+0xff/0x1a0
> [  +0.000609]  ? page_cache_tree_insert+0xd0/0xd0
> [  +0.000614]  __filemap_fdatawait_range+0xe2/0x150
> [  +0.000614]  file_write_and_wait_range+0x62/0xa0
> [  +0.000664]  xfs_file_fsync+0x65/0x1d0 [xfs]
> [  +0.000621]  do_fsync+0x38/0x60
> [  +0.000610]  SyS_fsync+0xc/0x10
> [  +0.000616]  do_syscall_64+0x6e/0x1a0
> [  +0.000612]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [  +0.000610] RIP: 0033:0x7f8131231020
> [  +0.000604] RSP: 002b:00007fff81c04f48 EFLAGS: 00000246 ORIG_RAX:
> 000000000000004a
> [  +0.000610] RAX: ffffffffffffffda RBX: 00007f8131af9690 RCX:
> 00007f8131231020
> [  +0.000619] RDX: 0000000004000000 RSI: 0000000000000000 RDI:
> 0000000000000001
> [  +0.000626] RBP: 0000000000000020 R08: 00000000ffffffff R09:
> 0000000000000000
> [  +0.000618] R10: 00007fff81c04d10 R11: 0000000000000246 R12:
> 0000000000000020
> [  +0.000620] R13: 0000000000000000 R14: 00007fff81c054cf R15:
> 00007fff81c05198
> 
> 
> > Shaohua,
> >
> > Do you have a reference to the older metadata related patch mentioned in
> > the commit log that presumably addressed this?
> >
> > Brian
> >
> > >
> > > > Not familiar with XFS, but seems log bios are partially stuck in
> > > > throttled cgroups, leaving other innocent groups waiting for
> > > > completion. To cope with this we bypassed REQ_META log bios in
> > > > blk_throtl_bio().
> > >
> > > Yup, the log is global and should not be throttled. Metadata is less
> > > obvious as to what the correct thing to do is, because writes are
> > > always done from internal kernel threads but reads are done from a
> > > mix of kernel threads, user cgroup contexts and user data IO
> > > completions. Hence there are situations where metadata reads may
> > > need to be throttled because they are cgroup context only (e.g.
> > > filesystem directory traversal) but others where reads should not
> > > be throttled because they have global context (e.g. inside a
> > > transaction when other buffers are already locked).
> > >
> 
> 
> Thanks for the explanation Dave. So there are both read and write xfs_buf,
> and reads really seem to be complex then.
> 
> 
> > > Getting this right and keeping it working requires regression tests
> > > that get run on every release, whether it be upstream or distro
> > > kernels, and that means we need tests in fstests to cover cgroup IO
> > > control....
> > >
> > > Cheers,
> > >
> > > Dave.
> > > --
> > > Dave Chinner
> > > david@fromorbit.com
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2018-03-27 11:36         ` Brian Foster
@ 2018-03-27 21:56           ` Dave Chinner
  2018-03-28 11:32             ` Brian Foster
  2018-03-28  4:37           ` 张本龙
  1 sibling, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2018-03-27 21:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: Shaohua Li, 张本龙, linux-xfs

On Tue, Mar 27, 2018 at 07:36:48AM -0400, Brian Foster wrote:
> On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote:
> > On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote:
> > > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> > > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> > > > > Hi Shaohua and XFS,
> > > > > 
> > > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
> > > > > mentioned about charging to root cgroup (also in an earlier email
> > > > > discussion), and seems the 4.16.0-rc6 code is not handling it
> > > > > separately.
> > > > > 
> > > > > In our case to support XFS cgroup writeback control, which was ported
> > > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> > > > > trouble. Threads from throttled docker might submit_bio in following
> > > > > path by its own identity, this docker blkcg accumulated large amounts
> > > > > of data (e.g., 20GB), thus such log gets blocked.
> > > > 
> > > > And thus displaying the reason why I originally refused to merge
> > > > this code until regression tests were added to fstests to exercise
> > > > these sorts of issues. This stuff adds new internal filesystem IO
> > > > ordering constraints, so we need tests that exercise it and ensure
> > > > we don't accidentally break it in future.
> > > > 
> > > 
> > > Hmm, but if the user issues fsync from the throttled cgroup then won't
> > > that throttling occur today, regardless of cgroup aware writeback? My
> > > understanding is that cgawb just accurately accounts writeback I/Os to
> > > the owner of the cached pages. IOW, if the buffered writer and fsync
> > > call are in the same throttled cgroup, then the throttling works just as
> > > it would with cgawb and the writer being in a throttled cgroup.
> > > 
> > > So ISTM that this is an independent problem. What am I missing?
> > > 
> > > Shaohua,
> > > 
> > > Do you have a reference to the older metadata related patch mentioned in
> > > the commit log that presumably addressed this?
> > 
> > The problem is about priority reversion. Say you do a fsync in a low prio
> > cgroup, the IO will be submitted with low prio. Now you do a fsync in a high
> > prio cgroup, the cgroup will wait for fsync IO finished, which is already
> > submitted by the low prio cgroup and run in low prio. This makes the high prio
> > cgroup run slow. The proposed patch is to force all metadata write submitted
> > from root cgroup regardless which task submitted, which can fix this issue.
> > 
> 
> Right, but it seems to me that this can happen with or without cgroup
> aware writeback. This patch just introduces the final bits required to
> carry the page owner from however it is tracked in the writeback machine
> to the bio submitted by the fs. It doesn't introduce/enable/implement
> I/O throttling itself, which is already in place and usable (outside of
> the buffered write page owner problem fixed by this patch), right?
> 
> So without this patch, if a task in throttled cgroup A does a bunch of
> buffered writes and calls fsync, then another task in unthrottled cgroup
> B calls fsync, aren't we (XFS) susceptible to priority inversion via
> these same log I/O serialization points? If not, then what am I missing?

Well, I was originally told that bios send by a filesystem without
cgroup info would be accounted to the root cgroup and hence not
throttled at all. i.e. metadata, and any untagged data IO. In that
case, there should be no problems what-so-ever as all XFS IO should
be issued at the same priority

However, given the way people are talking about needing to bypass
block cgroup throttling via REQ_META hacks, I'm guessing that it
doesn't actually work that way. i.e. somewhere in the block layer it
attaches the current task cgroup to bios without existing cgroup
information, and so we end up with priority inversion problems
whenever cgroups and throttling are in use regardless of whether the
filesystem supports it or not....

Ah, yeah, the task cgroup gets added through this path:

submit_bio
  generic_make_request
    generic_make_request_checks
      blkcg_bio_issue_check
        bio_blkcg

> I'm not saying this isn't a problem that needs fixing, I just want to
> make sure I understand the fundamental problem(s), what this cgawb patch
> actually does and doesn't do and whether there is a logical dependency
> between this patch and the proposed metadata filtering patch.

Seems like there's a lot more work to be done to make this stuff
work reliably than anyone has been telling us. Without regression
tests, we're never going to know if it actually works or not....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2018-03-27 11:36         ` Brian Foster
  2018-03-27 21:56           ` Dave Chinner
@ 2018-03-28  4:37           ` 张本龙
  2018-03-28 11:24             ` Brian Foster
  1 sibling, 1 reply; 21+ messages in thread
From: 张本龙 @ 2018-03-28  4:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: Shaohua Li, Dave Chinner, linux-xfs

2018-03-27 19:36 GMT+08:00 Brian Foster <bfoster@redhat.com>:
> On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote:
>> On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote:
>> > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
>> > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
>> > > > Hi Shaohua and XFS,
>> > > >
>> > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
>> > > > mentioned about charging to root cgroup (also in an earlier email
>> > > > discussion), and seems the 4.16.0-rc6 code is not handling it
>> > > > separately.
>> > > >
>> > > > In our case to support XFS cgroup writeback control, which was ported
>> > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
>> > > > trouble. Threads from throttled docker might submit_bio in following
>> > > > path by its own identity, this docker blkcg accumulated large amounts
>> > > > of data (e.g., 20GB), thus such log gets blocked.
>> > >
>> > > And thus displaying the reason why I originally refused to merge
>> > > this code until regression tests were added to fstests to exercise
>> > > these sorts of issues. This stuff adds new internal filesystem IO
>> > > ordering constraints, so we need tests that exercise it and ensure
>> > > we don't accidentally break it in future.
>> > >
>> >
>> > Hmm, but if the user issues fsync from the throttled cgroup then won't
>> > that throttling occur today, regardless of cgroup aware writeback? My
>> > understanding is that cgawb just accurately accounts writeback I/Os to
>> > the owner of the cached pages. IOW, if the buffered writer and fsync
>> > call are in the same throttled cgroup, then the throttling works just as
>> > it would with cgawb and the writer being in a throttled cgroup.
>> >
>> > So ISTM that this is an independent problem. What am I missing?
>> >
>> > Shaohua,
>> >
>> > Do you have a reference to the older metadata related patch mentioned in
>> > the commit log that presumably addressed this?
>>
>> The problem is about priority reversion. Say you do a fsync in a low prio
>> cgroup, the IO will be submitted with low prio. Now you do a fsync in a high
>> prio cgroup, the cgroup will wait for fsync IO finished, which is already
>> submitted by the low prio cgroup and run in low prio. This makes the high prio
>> cgroup run slow. The proposed patch is to force all metadata write submitted
>> from root cgroup regardless which task submitted, which can fix this issue.
>>
>
> Right, but it seems to me that this can happen with or without cgroup
> aware writeback. This patch just introduces the final bits required to
> carry the page owner from however it is tracked in the writeback machine
> to the bio submitted by the fs. It doesn't introduce/enable/implement
> I/O throttling itself, which is already in place and usable (outside of
> the buffered write page owner problem fixed by this patch), right?
>
> So without this patch, if a task in throttled cgroup A does a bunch of
> buffered writes and calls fsync, then another task in unthrottled cgroup
> B calls fsync, aren't we (XFS) susceptible to priority inversion via
> these same log I/O serialization points? If not, then what am I missing?
>

Ok first let's agree we're not talking about 2 groups fsync the same
file. What's demonstrated in the original post, is that group-A MIGHT
submit xlog in the fsync path via xlog_sync() -> xfs_buf_submit().
Threads from group-B are waiting for the log from flush_work(), at the
same time kworkers from xlog_cil_push_work().

The problem is not about fsync get stuck on the target file. Actually
group-B should be waiting on filemap_write_and_wait_range() if it
were, as xfs_file_fsync() would flush the real data before
_xfs_log_force_lsn.

Here is the setup: group-A has 10 fio jobs each running on 20GB files,
and also some agents with not much IO; group-B just has the agents.
With this patch we set a bps=20Mb/s to A, thus the large amounts of
fio traffic are blocking the group (Note the fio are not doing any
fsync though). At this time, one agent does an fsync to a non-dirty
file, bypassing filemap_write_and_wait_range() and doing xlog_sync().
Here we go, log bios are stuck in group-A. Then group-B and kworkers
are waiting for log integrity indefinitely from various paths. Without
this patch fio should writeout at full disk speed, say 200MB/s, so
it's not accumulated in group-A.

> I'm not saying this isn't a problem that needs fixing, I just want to
> make sure I understand the fundamental problem(s), what this cgawb patch
> actually does and doesn't do and whether there is a logical dependency
> between this patch and the proposed metadata filtering patch.
>
> Brian
>
>> Thanks,
>> Shaohua
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2018-03-27 11:50         ` Brian Foster
@ 2018-03-28  9:55           ` 张本龙
  0 siblings, 0 replies; 21+ messages in thread
From: 张本龙 @ 2018-03-28  9:55 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, Shaohua Li, linux-xfs

2018-03-27 19:50 GMT+08:00 Brian Foster <bfoster@redhat.com>:
> On Tue, Mar 27, 2018 at 06:54:27PM +0800, 张本龙 wrote:
>> 2018-03-27 0:28 GMT+08:00 Brian Foster <bfoster@redhat.com>:
>>
>> > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
>> > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
>> > > > Hi Shaohua and XFS,
>> > > >
>> > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
>> > > > mentioned about charging to root cgroup (also in an earlier email
>> > > > discussion), and seems the 4.16.0-rc6 code is not handling it
>> > > > separately.
>> > > >
>> > > > In our case to support XFS cgroup writeback control, which was ported
>> > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
>> > > > trouble. Threads from throttled docker might submit_bio in following
>> > > > path by its own identity, this docker blkcg accumulated large amounts
>> > > > of data (e.g., 20GB), thus such log gets blocked.
>> > >
>> > > And thus displaying the reason why I originally refused to merge
>> > > this code until regression tests were added to fstests to exercise
>> > > these sorts of issues. This stuff adds new internal filesystem IO
>> > > ordering constraints, so we need tests that exercise it and ensure
>> > > we don't accidentally break it in future.
>> > >
>> >
>> > Hmm, but if the user issues fsync from the throttled cgroup then won't
>> > that throttling occur today, regardless of cgroup aware writeback? My
>> > understanding is that cgawb just accurately accounts writeback I/Os to
>> > the owner of the cached pages. IOW, if the buffered writer and fsync
>> > call are in the same throttled cgroup, then the throttling works just as
>> > it would with cgawb and the writer being in a throttled cgroup.
>> >
>> > So ISTM that this is an independent problem. What am I missing?
>>
>>
>> Yeah throttling would work for fsync traffic even without cgwb, but that's
>> because the way bio_blkcg() is implemented: charging to bio->bi_css when
>> set but submitter blkcg when not. So without cgwb normal traffic from
>> writeback kworkers are attributed to blkcg_root, which is normally
>> unlimited.
>>
>
> Ok, that's what I thought. Your report shows everything backed up on an
> fsync call, however, so I didn't really see how it related to this

Sorry for misleading but the original report is not showing things
backed up on fsync. It shows fsync MIGHT submit the log.
--------------------------------------------------------------------------
xxx-agent 22825 [001] 78772.391023: probe:submit_bio:
                  4f70c1 submit_bio
                   4e440 xfs_buf_submit ([xfs])
                   6e59b xlog_bdstrat ([xfs])
                   704c5 xlog_sync ([xfs])
                   70683 xlog_state_release_iclog ([xfs])
                   71917 _xfs_log_force_lsn ([xfs])
                   5249e xfs_file_fsync ([xfs])
                  4374d5 do_fsync
                  4377a0 sys_fsync
                  8a0ac9 system_call
                  1df9c4 [unknown]
Please note this is NOT a kernel INFO, it's a dynamic perf probe trace
to see who is submitting log. As the agent's docker is now blocked by
other fio threads, the log would be stuck from this path.

The following trace is kernel INFO, it shows another docker waits for
log integrity and timeout, similar is the 3rd other trace via
flush_work(). Again not completely sure about XFS, but the traces and
code diving seems to suggest it.
-------------------------------------------------------------------------------------------------
[79183.692355] INFO: task xxx:33434 blocked for more than 120 seconds.
[79183.730997] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
......
[79183.954738] Call Trace:
[79183.969516]  [<ffffffff81695ac9>] schedule+0x29/0x70
                                          /*    inline xlog_wait here */
[79183.999357]  [<ffffffffa073497a>] _xfs_log_force_lsn+0x2fa/0x350 [xfs]
[79184.038497]  [<ffffffff810c84c0>] ? wake_up_state+0x20/0x20
[79184.071941]  [<ffffffffa071542e>] xfs_file_fsync+0x10e/0x1e0 [xfs]
[79184.109010]  [<ffffffff812374d5>] do_fsync+0x65/0xa0
[79184.138826]  [<ffffffff8120368f>] ? SyS_write+0x9f/0xe0
[79184.170182]  [<ffffffff812377a0>] SyS_fsync+0x10/0x20
[79184.200514]  [<ffffffff816a0ac9>] system_call_fastpath+0x16/0x1b

> patch. So unless Shaohua explains otherwise, my understanding is that
> this is not necessarily caused by cgawb, but rather this is a separate
> but similar gap in the broader block I/O throttling mechanism that also
> needs to be addressed before throttling is usable (in practice) by
> filesystems. You presumably deal with this by filtering out metadata
> bios, while the patch Shaohua mentions deals with it by promoting
> metadata bios to the root cgroup.

Either way would work, agreed. Though I wouldn't call it 'priority
inversion', as Dave explained it's more like 'global log mistakenly
throttled'. The report is to provide some evidence that this single
patch is not enough for xfs writeback. And we should wait for the
other patch from Shaohua.

>
> Brian
>
>> And fsync might yield kernel INFO messages when bps * 120 < dirty data,
>> giving something like below. That's Ok though, since throttling and all
>> types
>> of sync calls are contradictory in nature.
>> ------------------------------------------------------------
>> ---------------------------------
>> [Mar 2 14:27] INFO: task dd:40794 blocked for more than 120 seconds.
>> [  +0.000747]       Not tainted 4.15.7 #1
>> [  +0.000610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [  +0.000614] dd              D    0 40794  40793 0x10000000
>> [  +0.000610] Call Trace:
>> [  +0.000619]  ? __schedule+0x28d/0x870
>> [  +0.000611]  schedule+0x32/0x80
>> [  +0.000615]  io_schedule+0x12/0x40
>> [  +0.000612]  wait_on_page_bit_common+0xff/0x1a0
>> [  +0.000609]  ? page_cache_tree_insert+0xd0/0xd0
>> [  +0.000614]  __filemap_fdatawait_range+0xe2/0x150
>> [  +0.000614]  file_write_and_wait_range+0x62/0xa0
>> [  +0.000664]  xfs_file_fsync+0x65/0x1d0 [xfs]
>> [  +0.000621]  do_fsync+0x38/0x60
>> [  +0.000610]  SyS_fsync+0xc/0x10
>> [  +0.000616]  do_syscall_64+0x6e/0x1a0
>> [  +0.000612]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>> [  +0.000610] RIP: 0033:0x7f8131231020
>> [  +0.000604] RSP: 002b:00007fff81c04f48 EFLAGS: 00000246 ORIG_RAX:
>> 000000000000004a
>> [  +0.000610] RAX: ffffffffffffffda RBX: 00007f8131af9690 RCX:
>> 00007f8131231020
>> [  +0.000619] RDX: 0000000004000000 RSI: 0000000000000000 RDI:
>> 0000000000000001
>> [  +0.000626] RBP: 0000000000000020 R08: 00000000ffffffff R09:
>> 0000000000000000
>> [  +0.000618] R10: 00007fff81c04d10 R11: 0000000000000246 R12:
>> 0000000000000020
>> [  +0.000620] R13: 0000000000000000 R14: 00007fff81c054cf R15:
>> 00007fff81c05198
>>
>>
>> > Shaohua,
>> >
>> > Do you have a reference to the older metadata related patch mentioned in
>> > the commit log that presumably addressed this?
>> >
>> > Brian
>> >
>> > >
>> > > > Not familiar with XFS, but seems log bios are partially stuck in
>> > > > throttled cgroups, leaving other innocent groups waiting for
>> > > > completion. To cope with this we bypassed REQ_META log bios in
>> > > > blk_throtl_bio().
>> > >
>> > > Yup, the log is global and should not be throttled. Metadata is less
>> > > obvious as to what the correct thing to do is, because writes are
>> > > always done from internal kernel threads but reads are done from a
>> > > mix of kernel threads, user cgroup contexts and user data IO
>> > > completions. Hence there are situations where metadata reads may
>> > > need to be throttled because they are cgroup context only (e.g.
>> > > filesystem directory traversal) but others where reads should not
>> > > be throttled because they have global context (e.g. inside a
>> > > transaction when other buffers are already locked).
>> > >
>>
>>
>> Thanks for the explanation Dave. So there are both read and write xfs_buf,
>> and reads really seem to be complex then.
>>
>>
>> > > Getting this right and keeping it working requires regression tests
>> > > that get run on every release, whether it be upstream or distro
>> > > kernels, and that means we need tests in fstests to cover cgroup IO
>> > > control....
>> > >
>> > > Cheers,
>> > >
>> > > Dave.
>> > > --
>> > > Dave Chinner
>> > > david@fromorbit.com
>> > > --
>> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> > > the body of a message to majordomo@vger.kernel.org
>> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2018-03-28  4:37           ` 张本龙
@ 2018-03-28 11:24             ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2018-03-28 11:24 UTC (permalink / raw)
  To: 张本龙; +Cc: Shaohua Li, Dave Chinner, linux-xfs

On Wed, Mar 28, 2018 at 12:37:32PM +0800, 张本龙 wrote:
> 2018-03-27 19:36 GMT+08:00 Brian Foster <bfoster@redhat.com>:
> > On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote:
> >> On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote:
> >> > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> >> > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> >> > > > Hi Shaohua and XFS,
> >> > > >
> >> > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
> >> > > > mentioned about charging to root cgroup (also in an earlier email
> >> > > > discussion), and seems the 4.16.0-rc6 code is not handling it
> >> > > > separately.
> >> > > >
> >> > > > In our case to support XFS cgroup writeback control, which was ported
> >> > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> >> > > > trouble. Threads from throttled docker might submit_bio in following
> >> > > > path by its own identity, this docker blkcg accumulated large amounts
> >> > > > of data (e.g., 20GB), thus such log gets blocked.
> >> > >
> >> > > And thus displaying the reason why I originally refused to merge
> >> > > this code until regression tests were added to fstests to exercise
> >> > > these sorts of issues. This stuff adds new internal filesystem IO
> >> > > ordering constraints, so we need tests that exercise it and ensure
> >> > > we don't accidentally break it in future.
> >> > >
> >> >
> >> > Hmm, but if the user issues fsync from the throttled cgroup then won't
> >> > that throttling occur today, regardless of cgroup aware writeback? My
> >> > understanding is that cgawb just accurately accounts writeback I/Os to
> >> > the owner of the cached pages. IOW, if the buffered writer and fsync
> >> > call are in the same throttled cgroup, then the throttling works just as
> >> > it would with cgawb and the writer being in a throttled cgroup.
> >> >
> >> > So ISTM that this is an independent problem. What am I missing?
> >> >
> >> > Shaohua,
> >> >
> >> > Do you have a reference to the older metadata related patch mentioned in
> >> > the commit log that presumably addressed this?
> >>
> >> The problem is about priority reversion. Say you do a fsync in a low prio
> >> cgroup, the IO will be submitted with low prio. Now you do a fsync in a high
> >> prio cgroup, the cgroup will wait for fsync IO finished, which is already
> >> submitted by the low prio cgroup and run in low prio. This makes the high prio
> >> cgroup run slow. The proposed patch is to force all metadata write submitted
> >> from root cgroup regardless which task submitted, which can fix this issue.
> >>
> >
> > Right, but it seems to me that this can happen with or without cgroup
> > aware writeback. This patch just introduces the final bits required to
> > carry the page owner from however it is tracked in the writeback machine
> > to the bio submitted by the fs. It doesn't introduce/enable/implement
> > I/O throttling itself, which is already in place and usable (outside of
> > the buffered write page owner problem fixed by this patch), right?
> >
> > So without this patch, if a task in throttled cgroup A does a bunch of
> > buffered writes and calls fsync, then another task in unthrottled cgroup
> > B calls fsync, aren't we (XFS) susceptible to priority inversion via
> > these same log I/O serialization points? If not, then what am I missing?
> >
> 
> Ok first let's agree we're not talking about 2 groups fsync the same
> file. What's demonstrated in the original post, is that group-A MIGHT
> submit xlog in the fsync path via xlog_sync() -> xfs_buf_submit().
> Threads from group-B are waiting for the log from flush_work(), at the
> same time kworkers from xlog_cil_push_work().
> 
> The problem is not about fsync get stuck on the target file. Actually
> group-B should be waiting on filemap_write_and_wait_range() if it
> were, as xfs_file_fsync() would flush the real data before
> _xfs_log_force_lsn.
> 

Yes, I don't mean to suggest we're contending on the same file across
different cgroups. Rather, we're getting stuck on a dependency of a
global resource (the log) between groups with very different priorities.

> Here is the setup: group-A has 10 fio jobs each running on 20GB files,
> and also some agents with not much IO; group-B just has the agents.
> With this patch we set a bps=20Mb/s to A, thus the large amounts of
> fio traffic are blocking the group (Note the fio are not doing any
> fsync though). At this time, one agent does an fsync to a non-dirty
> file, bypassing filemap_write_and_wait_range() and doing xlog_sync().
> Here we go, log bios are stuck in group-A. Then group-B and kworkers
> are waiting for log integrity indefinitely from various paths. Without
> this patch fio should writeout at full disk speed, say 200MB/s, so
> it's not accumulated in group-A.
> 

I see. So in your particular setup, cgawb is required to reproduce
because you aren't actually pushing data through the cgroup via fsync.
Therefore, cgawb is what actually enables the throttling for your use
case. The throttling slows down further I/O from group A and thus
creates the problematic conditions for contention on the log.

However, it looks like the only requirement to cause this particular
condition is that an fsync submits a log bio in an active and throttled
cgroup that ends up blocked for a period of time. Other log serializing
tasks in other cgroups are now susceptible to the first cgroup's
throttle if they have to wait on log I/O completion. So replace your
cgawb use case with one that creates similar conditions by somebody else
waiting on fsync -> filemap_write_and_wait_range() on a separate file in
the same (throttled) group and afaict the same thing can occur without
cgawb in the picture at all.

But between this and your other email I think I see how this isn't what
you are actually reproducing. Thanks for the explanation.

Brian

> > I'm not saying this isn't a problem that needs fixing, I just want to
> > make sure I understand the fundamental problem(s), what this cgawb patch
> > actually does and doesn't do and whether there is a logical dependency
> > between this patch and the proposed metadata filtering patch.
> >
> > Brian
> >
> >> Thanks,
> >> Shaohua
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2018-03-27 21:56           ` Dave Chinner
@ 2018-03-28 11:32             ` Brian Foster
  2018-03-28 22:35               ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2018-03-28 11:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Shaohua Li, 张本龙, linux-xfs

On Wed, Mar 28, 2018 at 08:56:14AM +1100, Dave Chinner wrote:
> On Tue, Mar 27, 2018 at 07:36:48AM -0400, Brian Foster wrote:
> > On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote:
> > > On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote:
> > > > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> > > > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> > > > > > Hi Shaohua and XFS,
> > > > > > 
> > > > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
> > > > > > mentioned about charging to root cgroup (also in an earlier email
> > > > > > discussion), and seems the 4.16.0-rc6 code is not handling it
> > > > > > separately.
> > > > > > 
> > > > > > In our case to support XFS cgroup writeback control, which was ported
> > > > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> > > > > > trouble. Threads from throttled docker might submit_bio in following
> > > > > > path by its own identity, this docker blkcg accumulated large amounts
> > > > > > of data (e.g., 20GB), thus such log gets blocked.
> > > > > 
> > > > > And thus displaying the reason why I originally refused to merge
> > > > > this code until regression tests were added to fstests to exercise
> > > > > these sorts of issues. This stuff adds new internal filesystem IO
> > > > > ordering constraints, so we need tests that exercise it and ensure
> > > > > we don't accidentally break it in future.
> > > > > 
> > > > 
> > > > Hmm, but if the user issues fsync from the throttled cgroup then won't
> > > > that throttling occur today, regardless of cgroup aware writeback? My
> > > > understanding is that cgawb just accurately accounts writeback I/Os to
> > > > the owner of the cached pages. IOW, if the buffered writer and fsync
> > > > call are in the same throttled cgroup, then the throttling works just as
> > > > it would with cgawb and the writer being in a throttled cgroup.
> > > > 
> > > > So ISTM that this is an independent problem. What am I missing?
> > > > 
> > > > Shaohua,
> > > > 
> > > > Do you have a reference to the older metadata related patch mentioned in
> > > > the commit log that presumably addressed this?
> > > 
> > > The problem is about priority reversion. Say you do a fsync in a low prio
> > > cgroup, the IO will be submitted with low prio. Now you do a fsync in a high
> > > prio cgroup, the cgroup will wait for fsync IO finished, which is already
> > > submitted by the low prio cgroup and run in low prio. This makes the high prio
> > > cgroup run slow. The proposed patch is to force all metadata write submitted
> > > from root cgroup regardless which task submitted, which can fix this issue.
> > > 
> > 
> > Right, but it seems to me that this can happen with or without cgroup
> > aware writeback. This patch just introduces the final bits required to
> > carry the page owner from however it is tracked in the writeback machine
> > to the bio submitted by the fs. It doesn't introduce/enable/implement
> > I/O throttling itself, which is already in place and usable (outside of
> > the buffered write page owner problem fixed by this patch), right?
> > 
> > So without this patch, if a task in throttled cgroup A does a bunch of
> > buffered writes and calls fsync, then another task in unthrottled cgroup
> > B calls fsync, aren't we (XFS) susceptible to priority inversion via
> > these same log I/O serialization points? If not, then what am I missing?
> 
> Well, I was originally told that bios send by a filesystem without
> cgroup info would be accounted to the root cgroup and hence not
> throttled at all. i.e. metadata, and any untagged data IO. In that
> case, there should be no problems what-so-ever as all XFS IO should
> be issued at the same priority
> 

It sounds like this was actually the plan at some point but the
associated patch (to which there is still no reference?) was never
committed. I have no idea why that is.. whether it was just lost or
nacked with outstanding issues..?

> However, given the way people are talking about needing to bypass
> block cgroup throttling via REQ_META hacks, I'm guessing that it
> doesn't actually work that way. i.e. somewhere in the block layer it
> attaches the current task cgroup to bios without existing cgroup
> information, and so we end up with priority inversion problems
> whenever cgroups and throttling are in use regardless of whether the
> filesystem supports it or not....
> 
> Ah, yeah, the task cgroup gets added through this path:
> 
> submit_bio
>   generic_make_request
>     generic_make_request_checks
>       blkcg_bio_issue_check
>         bio_blkcg
> 

Right, this is what I thought based on the code. For anything not
explicitly associated with a cgroup, it pretty much depends on the user
context of the I/O submission.

> > I'm not saying this isn't a problem that needs fixing, I just want to
> > make sure I understand the fundamental problem(s), what this cgawb patch
> > actually does and doesn't do and whether there is a logical dependency
> > between this patch and the proposed metadata filtering patch.
> 
> Seems like there's a lot more work to be done to make this stuff
> work reliably than anyone has been telling us. Without regression
> tests, we're never going to know if it actually works or not....
> 

I'm wondering if there's something simple/high-level we can do in the
filesystem to sort of flip the default block layer condition above on
its head with regard to metadata. For example, forcibly associate all
metadata bios to the root cgroup somewhere down in the xfs_buf_submit()
path. So rather than doing metadata I/O on behalf of a variety of random
cgroups based purely on userspace configuration and activity, we start
with a consistent and predictable base behavior.

The former (current) behavior leaves us open to these kinds of
problematic blocked states. The latter approach changes those gaps into
something that instead may just not honor the throttling limits for
certain metadata heavy workloads until finer grained control is
implemented in the fs (which can be implemented as needed). That's also
advantageous because by default we won't throttle certain I/Os that
might have complex locking/transaction dependencies and instead opt-in
to such metadata throttling in the obvious contexts where it's
appropriate. For example, the directory read example you mentioned
before could set a state on the buffer that prevents preemptive
association so the bio is actually associated with current userspace
context in the block layer and throttled appropriately.

Of course, I haven't looked into how reasonable any of that is to
implement. It would be nice to see whatever patch was previously
proposed for reference. :/

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2018-03-28 11:32             ` Brian Foster
@ 2018-03-28 22:35               ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2018-03-28 22:35 UTC (permalink / raw)
  To: Brian Foster; +Cc: Shaohua Li, 张本龙, linux-xfs

On Wed, Mar 28, 2018 at 07:32:32AM -0400, Brian Foster wrote:
> On Wed, Mar 28, 2018 at 08:56:14AM +1100, Dave Chinner wrote:
> > On Tue, Mar 27, 2018 at 07:36:48AM -0400, Brian Foster wrote:
> > > On Mon, Mar 26, 2018 at 05:55:26PM -0700, Shaohua Li wrote:
> > > > On Mon, Mar 26, 2018 at 12:28:31PM -0400, Brian Foster wrote:
> > > > > On Mon, Mar 26, 2018 at 08:59:04AM +1100, Dave Chinner wrote:
> > > > > > On Fri, Mar 23, 2018 at 10:24:03PM +0800, 张本龙 wrote:
> > > > > > > Hi Shaohua and XFS,
> > > > > > > 
> > > > > > > May I ask how are we gonna handle REQ_META issued from XFS? As you
> > > > > > > mentioned about charging to root cgroup (also in an earlier email
> > > > > > > discussion), and seems the 4.16.0-rc6 code is not handling it
> > > > > > > separately.
> > > > > > > 
> > > > > > > In our case to support XFS cgroup writeback control, which was ported
> > > > > > > and slightly adapted to 3.10.0, ignoring xfs log bios resulted in
> > > > > > > trouble. Threads from throttled docker might submit_bio in following
> > > > > > > path by its own identity, this docker blkcg accumulated large amounts
> > > > > > > of data (e.g., 20GB), thus such log gets blocked.
> > > > > > 
> > > > > > And thus displaying the reason why I originally refused to merge
> > > > > > this code until regression tests were added to fstests to exercise
> > > > > > these sorts of issues. This stuff adds new internal filesystem IO
> > > > > > ordering constraints, so we need tests that exercise it and ensure
> > > > > > we don't accidentally break it in future.
> > > > > > 
> > > > > 
> > > > > Hmm, but if the user issues fsync from the throttled cgroup then won't
> > > > > that throttling occur today, regardless of cgroup aware writeback? My
> > > > > understanding is that cgawb just accurately accounts writeback I/Os to
> > > > > the owner of the cached pages. IOW, if the buffered writer and fsync
> > > > > call are in the same throttled cgroup, then the throttling works just as
> > > > > it would with cgawb and the writer being in a throttled cgroup.
> > > > > 
> > > > > So ISTM that this is an independent problem. What am I missing?
> > > > > 
> > > > > Shaohua,
> > > > > 
> > > > > Do you have a reference to the older metadata related patch mentioned in
> > > > > the commit log that presumably addressed this?
> > > > 
> > > > The problem is about priority reversion. Say you do a fsync in a low prio
> > > > cgroup, the IO will be submitted with low prio. Now you do a fsync in a high
> > > > prio cgroup, the cgroup will wait for fsync IO finished, which is already
> > > > submitted by the low prio cgroup and run in low prio. This makes the high prio
> > > > cgroup run slow. The proposed patch is to force all metadata write submitted
> > > > from root cgroup regardless which task submitted, which can fix this issue.
> > > > 
> > > 
> > > Right, but it seems to me that this can happen with or without cgroup
> > > aware writeback. This patch just introduces the final bits required to
> > > carry the page owner from however it is tracked in the writeback machine
> > > to the bio submitted by the fs. It doesn't introduce/enable/implement
> > > I/O throttling itself, which is already in place and usable (outside of
> > > the buffered write page owner problem fixed by this patch), right?
> > > 
> > > So without this patch, if a task in throttled cgroup A does a bunch of
> > > buffered writes and calls fsync, then another task in unthrottled cgroup
> > > B calls fsync, aren't we (XFS) susceptible to priority inversion via
> > > these same log I/O serialization points? If not, then what am I missing?
> > 
> > Well, I was originally told that bios send by a filesystem without
> > cgroup info would be accounted to the root cgroup and hence not
> > throttled at all. i.e. metadata, and any untagged data IO. In that
> > case, there should be no problems what-so-ever as all XFS IO should
> > be issued at the same priority
> > 
> 
> It sounds like this was actually the plan at some point but the
> associated patch (to which there is still no reference?) was never
> committed. I have no idea why that is.. whether it was just lost or
> nacked with outstanding issues..?

I've never seen the patch that is being spoken of. I'm guessing it
lives in some dark corner of FB's internal kernel tree and they have
no motivation to push stuff like this upstream....

> > However, given the way people are talking about needing to bypass
> > block cgroup throttling via REQ_META hacks, I'm guessing that it
> > doesn't actually work that way. i.e. somewhere in the block layer it
> > attaches the current task cgroup to bios without existing cgroup
> > information, and so we end up with priority inversion problems
> > whenever cgroups and throttling are in use regardless of whether the
> > filesystem supports it or not....
> > 
> > Ah, yeah, the task cgroup gets added through this path:
> > 
> > submit_bio
> >   generic_make_request
> >     generic_make_request_checks
> >       blkcg_bio_issue_check
> >         bio_blkcg
> > 
> 
> Right, this is what I thought based on the code. For anything not
> explicitly associated with a cgroup, it pretty much depends on the user
> context of the I/O submission.

I suspect that the code when originally proposed didn't have this
"fall back to task context" code in it. i.e.  it was added later,
thereby changing the default behaviour and introducing all these
filesystem level issues.

> > > I'm not saying this isn't a problem that needs fixing, I just want to
> > > make sure I understand the fundamental problem(s), what this cgawb patch
> > > actually does and doesn't do and whether there is a logical dependency
> > > between this patch and the proposed metadata filtering patch.
> > 
> > Seems like there's a lot more work to be done to make this stuff
> > work reliably than anyone has been telling us. Without regression
> > tests, we're never going to know if it actually works or not....
> > 
> 
> I'm wondering if there's something simple/high-level we can do in the
> filesystem to sort of flip the default block layer condition above on
> its head with regard to metadata.

I'm guessing that's what the patches that look at REQ_META and
assign it to the root cgroup does - it makes them global scope
rather than whatever is attached to them and prevents them from
being throttled by the existing/default cgroup configuration.

> For example, forcibly associate all
> metadata bios to the root cgroup somewhere down in the xfs_buf_submit()
> path. So rather than doing metadata I/O on behalf of a variety of random
> cgroups based purely on userspace configuration and activity, we start
> with a consistent and predictable base behavior.

I'd prefer the block layer does that by default when it comes across
an untagged REQ_META bio. That way we don't have to sprinkle magic
cgroup pixie dust through every single filesystem to deal with this
same problem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2017-10-15  5:07 Shaohua Li
  2017-10-15 22:22 ` Dave Chinner
@ 2017-10-19  7:35 ` Christoph Hellwig
  1 sibling, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2017-10-19  7:35 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-xfs, darrick.wong, david, tj, Kernel-team

> -		wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh);
> +		wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh, wbc);

Too long line, please fix.

> +	sb->s_iflags |= SB_I_CGROUPWB;

Didn't Tejun just change this to a per-inode flag?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2017-10-16  6:22     ` Dave Chinner
@ 2017-10-18  5:18       ` Shaohua Li
  0 siblings, 0 replies; 21+ messages in thread
From: Shaohua Li @ 2017-10-18  5:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, darrick.wong, tj, Kernel-team

On Mon, Oct 16, 2017 at 05:22:44PM +1100, Dave Chinner wrote:
> On Sun, Oct 15, 2017 at 08:35:39PM -0700, Shaohua Li wrote:
> > On Mon, Oct 16, 2017 at 09:22:02AM +1100, Dave Chinner wrote:
> > > On Sat, Oct 14, 2017 at 10:07:51PM -0700, Shaohua Li wrote:
> > > > From: Shaohua Li <shli@fb.com>
> > > > 
> > > > Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup
> > > > writeback support). Tested with a fio test, verified writeback is
> > > > throttled against cgroup io.max write bandwidth, also verified moving
> > > > the fio test to another cgroup and the writeback is throttled against
> > > > new cgroup setting.
> > > > 
> > > > I created a test for this as attached, please try! I'll send the test out for
> > > > inclusion later.
> > > 
> > > Hmmm. The test you appended just checks that bytes get written.
> > > That's pretty much useless for verification of the features you
> > > describe above (throttling rate it correct, dynamic throttle
> > > application as memcg config changes).
> > > 
> > > You explicitly state this is a memcg IO QoS feature and that you
> > > have a set of fio tests that verify that it works as expected. We
> > > need those "works as expected" fio tests formalised into automated
> > > fstests. Both upstream fs developers and downstream distro QE
> > > departments need to be able to verify that the bandwidth control and
> > > throttling works as advertised - it's essential that we have
> > > regression tests for this....
> > 
> > Right, this test only verifies the writeback is correctly charged to a cgroup,
> > it doesn't verify the writeback is running in correct bandwidth. I did try to
> > create such automatic test, but my attempt failed.
> 
> <sigh>
> 
> > To measure speed, we need measure the time for a test. But
> > writeback is async, the file write finishes before the data is
> > written to disk. We can't call a fsync, because fsync write is
> > different than writeback write, which is charged to correct cgroup
> > even without cgroup writeback support.
> 
> What a fucking mess.  We were told that if we didn't implement the
> writeback hooks in the filesystem then the memcg write throttling
> would not be active on the filesystem.
> 
> Looks like we got taken for a bunch of suckers, eh?
> 
> So now I don't trust what we've been told about metadata IO being
> ignored by memcg throttling. What happens to REQ_META IO from within
> a throttled memcg task context? Does that also get throttled
> according to the current task memcg limits?

Yes, it's throttled according to current blkcg settings. This can be fixed
however, Tejun recently posted patches to force all meta IO changed to root
cgroup, we can do same thing for xfs.
 
> > For my test, I run iostat and check the disk
> > speed is correct, so I don't have idea to create an automatic test.
> 
> Seems to me like there is an obvious answer to your problem:
> syncfs()
> 
> That is, syncfs uses the bdi flusher threads to do async writeback
> of all outstanding dirty data on the filesystem, and it also waits
> for that async writeback to complete.  e.g: set the writeback
> throttle speed to 5MB/s then run:
> 
> # time xfs_io -f -c "pwrite 0 100m" -c "syncfs" /path/to/file
> 
> If that takes less than 20s to run, then async writeback through the
> bdi flusher workqueue wasn't throttled properly.

Good idea, this is exactly what I want, thanks! Will post a xfs test soon.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2017-10-16  3:35   ` Shaohua Li
@ 2017-10-16  6:22     ` Dave Chinner
  2017-10-18  5:18       ` Shaohua Li
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2017-10-16  6:22 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-xfs, darrick.wong, tj, Kernel-team

On Sun, Oct 15, 2017 at 08:35:39PM -0700, Shaohua Li wrote:
> On Mon, Oct 16, 2017 at 09:22:02AM +1100, Dave Chinner wrote:
> > On Sat, Oct 14, 2017 at 10:07:51PM -0700, Shaohua Li wrote:
> > > From: Shaohua Li <shli@fb.com>
> > > 
> > > Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup
> > > writeback support). Tested with a fio test, verified writeback is
> > > throttled against cgroup io.max write bandwidth, also verified moving
> > > the fio test to another cgroup and the writeback is throttled against
> > > new cgroup setting.
> > > 
> > > I created a test for this as attached, please try! I'll send the test out for
> > > inclusion later.
> > 
> > Hmmm. The test you appended just checks that bytes get written.
> > That's pretty much useless for verification of the features you
> > describe above (throttling rate it correct, dynamic throttle
> > application as memcg config changes).
> > 
> > You explicitly state this is a memcg IO QoS feature and that you
> > have a set of fio tests that verify that it works as expected. We
> > need those "works as expected" fio tests formalised into automated
> > fstests. Both upstream fs developers and downstream distro QE
> > departments need to be able to verify that the bandwidth control and
> > throttling works as advertised - it's essential that we have
> > regression tests for this....
> 
> Right, this test only verifies the writeback is correctly charged to a cgroup,
> it doesn't verify the writeback is running in correct bandwidth. I did try to
> create such automatic test, but my attempt failed.

<sigh>

> To measure speed, we need measure the time for a test. But
> writeback is async, the file write finishes before the data is
> written to disk. We can't call a fsync, because fsync write is
> different than writeback write, which is charged to correct cgroup
> even without cgroup writeback support.

What a fucking mess.  We were told that if we didn't implement the
writeback hooks in the filesystem then the memcg write throttling
would not be active on the filesystem.

Looks like we got taken for a bunch of suckers, eh?

So now I don't trust what we've been told about metadata IO being
ignored by memcg throttling. What happens to REQ_META IO from within
a throttled memcg task context? Does that also get throttled
according to the current task memcg limits?

> For my test, I run iostat and check the disk
> speed is correct, so I don't have idea to create an automatic test.

Seems to me like there is an obvious answer to your problem:
syncfs()

That is, syncfs uses the bdi flusher threads to do async writeback
of all outstanding dirty data on the filesystem, and it also waits
for that async writeback to complete.  e.g: set the writeback
throttle speed to 5MB/s then run:

# time xfs_io -f -c "pwrite 0 100m" -c "syncfs" /path/to/file

If that takes less than 20s to run, then async writeback through the
bdi flusher workqueue wasn't throttled properly.

You can use this basic concept to do all manner of more complex
testing, including changing throttle rates and the memcgs that tasks
belong to.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2017-10-15 22:22 ` Dave Chinner
@ 2017-10-16  3:35   ` Shaohua Li
  2017-10-16  6:22     ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Shaohua Li @ 2017-10-16  3:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, darrick.wong, tj, Kernel-team

On Mon, Oct 16, 2017 at 09:22:02AM +1100, Dave Chinner wrote:
> On Sat, Oct 14, 2017 at 10:07:51PM -0700, Shaohua Li wrote:
> > From: Shaohua Li <shli@fb.com>
> > 
> > Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup
> > writeback support). Tested with a fio test, verified writeback is
> > throttled against cgroup io.max write bandwidth, also verified moving
> > the fio test to another cgroup and the writeback is throttled against
> > new cgroup setting.
> > 
> > I created a test for this as attached, please try! I'll send the test out for
> > inclusion later.
> 
> Hmmm. The test you appended just checks that bytes get written.
> That's pretty much useless for verification of the features you
> describe above (throttling rate it correct, dynamic throttle
> application as memcg config changes).
> 
> You explicitly state this is a memcg IO QoS feature and that you
> have a set of fio tests that verify that it works as expected. We
> need those "works as expected" fio tests formalised into automated
> fstests. Both upstream fs developers and downstream distro QE
> departments need to be able to verify that the bandwidth control and
> throttling works as advertised - it's essential that we have
> regression tests for this....

Right, this test only verifies the writeback is correctly charged to a cgroup,
it doesn't verify the writeback is running in correct bandwidth. I did try to
create such automatic test, but my attempt failed. To measure speed, we need
measure the time for a test. But writeback is async, the file write finishes
before the data is written to disk. We can't call a fsync, because fsync write
is different than writeback write, which is charged to correct cgroup even
without cgroup writeback support. For my test, I run iostat and check the disk
speed is correct, so I don't have idea to create an automatic test.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V2] xfs: implement cgroup writeback support
  2017-10-15  5:07 Shaohua Li
@ 2017-10-15 22:22 ` Dave Chinner
  2017-10-16  3:35   ` Shaohua Li
  2017-10-19  7:35 ` Christoph Hellwig
  1 sibling, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2017-10-15 22:22 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-xfs, darrick.wong, tj, Kernel-team

On Sat, Oct 14, 2017 at 10:07:51PM -0700, Shaohua Li wrote:
> From: Shaohua Li <shli@fb.com>
> 
> Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup
> writeback support). Tested with a fio test, verified writeback is
> throttled against cgroup io.max write bandwidth, also verified moving
> the fio test to another cgroup and the writeback is throttled against
> new cgroup setting.
> 
> I created a test for this as attached, please try! I'll send the test out for
> inclusion later.

Hmmm. The test you appended just checks that bytes get written.
That's pretty much useless for verification of the features you
describe above (throttling rate it correct, dynamic throttle
application as memcg config changes).

You explicitly state this is a memcg IO QoS feature and that you
have a set of fio tests that verify that it works as expected. We
need those "works as expected" fio tests formalised into automated
fstests. Both upstream fs developers and downstream distro QE
departments need to be able to verify that the bandwidth control and
throttling works as advertised - it's essential that we have
regression tests for this....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH V2] xfs: implement cgroup writeback support
@ 2017-10-15  5:07 Shaohua Li
  2017-10-15 22:22 ` Dave Chinner
  2017-10-19  7:35 ` Christoph Hellwig
  0 siblings, 2 replies; 21+ messages in thread
From: Shaohua Li @ 2017-10-15  5:07 UTC (permalink / raw)
  To: linux-xfs, darrick.wong; +Cc: david, tj, Kernel-team

[-- Attachment #1: Type: text/plain, Size: 2613 bytes --]

From: Shaohua Li <shli@fb.com>

Basically this is a copy of commit 001e4a8775f6(ext4: implement cgroup
writeback support). Tested with a fio test, verified writeback is
throttled against cgroup io.max write bandwidth, also verified moving
the fio test to another cgroup and the writeback is throttled against
new cgroup setting.

I created a test for this as attached, please try! I'll send the test out for
inclusion later.

Cc: Tejun Heo <tj@kernel.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/xfs/xfs_aops.c  | 12 ++++++++++--
 fs/xfs/xfs_super.c |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2917260..0c41f82 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -561,7 +561,8 @@ xfs_alloc_ioend(
 	struct inode		*inode,
 	unsigned int		type,
 	xfs_off_t		offset,
-	struct buffer_head	*bh)
+	struct buffer_head	*bh,
+	struct writeback_control *wbc)
 {
 	struct xfs_ioend	*ioend;
 	struct bio		*bio;
@@ -578,6 +579,8 @@ xfs_alloc_ioend(
 	INIT_WORK(&ioend->io_work, xfs_end_io);
 	ioend->io_append_trans = NULL;
 	ioend->io_bio = bio;
+	/* attach new bio to its cgroup */
+	wbc_init_bio(wbc, bio);
 	return ioend;
 }
 
@@ -605,6 +608,8 @@ xfs_chain_bio(
 	ioend->io_bio->bi_write_hint = ioend->io_inode->i_write_hint;
 	submit_bio(ioend->io_bio);
 	ioend->io_bio = new;
+	/* attach new bio to its cgroup */
+	wbc_init_bio(wbc, new);
 }
 
 /*
@@ -628,7 +633,7 @@ xfs_add_to_ioend(
 	    offset != wpc->ioend->io_offset + wpc->ioend->io_size) {
 		if (wpc->ioend)
 			list_add(&wpc->ioend->io_list, iolist);
-		wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh);
+		wpc->ioend = xfs_alloc_ioend(inode, wpc->io_type, offset, bh, wbc);
 	}
 
 	/*
@@ -638,6 +643,9 @@ xfs_add_to_ioend(
 	while (xfs_bio_add_buffer(wpc->ioend->io_bio, bh) != bh->b_size)
 		xfs_chain_bio(wpc->ioend, wbc, bh);
 
+	/* Charge write size to its cgroup for cgroup switching track */
+	wbc_account_io(wbc, bh->b_page, bh->b_size);
+
 	wpc->ioend->io_size += bh->b_size;
 	wpc->last_block = bh->b_blocknr;
 	xfs_start_buffer_writeback(bh);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index c996f4a..41eb6e0 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1634,6 +1634,7 @@ xfs_fs_fill_super(
 	sb->s_max_links = XFS_MAXLINK;
 	sb->s_time_gran = 1;
 	set_posix_acl_flag(sb);
+	sb->s_iflags |= SB_I_CGROUPWB;
 
 	/* version 5 superblocks support inode version counters. */
 	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
-- 
2.9.5


[-- Attachment #2: 0001-Add-cgroup2-writeback-test.patch --]
[-- Type: text/x-diff, Size: 2919 bytes --]

>From 0edbc6d9f7abae299d6b1bd4b8ad76728afab313 Mon Sep 17 00:00:00 2001
Message-Id: <0edbc6d9f7abae299d6b1bd4b8ad76728afab313.1508043579.git.shli@fb.com>
From: Shaohua Li <shli@fb.com>
Date: Sat, 14 Oct 2017 18:57:06 -0700
Subject: [PATCH] Add cgroup2 writeback test

---
 common/cgroup2        | 18 +++++++++++++++++
 tests/generic/463     | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/463.out |  2 ++
 tests/generic/group   |  1 +
 4 files changed, 77 insertions(+)
 create mode 100644 common/cgroup2
 create mode 100755 tests/generic/463
 create mode 100644 tests/generic/463.out

diff --git a/common/cgroup2 b/common/cgroup2
new file mode 100644
index 00000000..130c2f79
--- /dev/null
+++ b/common/cgroup2
@@ -0,0 +1,18 @@
+#!/bin/bash
+# cgroup2 specific common functions
+
+export CGROUP2_PATH="/sys/fs/cgroup"
+
+_require_cgroup2()
+{
+	if [ ! -f ${CGROUP2_PATH}/cgroup.subtree_control ]; then
+		_notrun "Test requires cgroup2 enabled"
+	fi
+}
+
+_get_scratch_dev_devt()
+{
+	ls -l $SCRATCH_DEV | awk '{printf("%s:%s", substr($5, 1, length($5)-1), 0)}'
+}
+
+/bin/true
diff --git a/tests/generic/463 b/tests/generic/463
new file mode 100755
index 00000000..8dafadb0
--- /dev/null
+++ b/tests/generic/463
@@ -0,0 +1,56 @@
+#! /bin/bash
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/cgroup2
+
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+cgname=`mktemp -du ${CGROUP2_PATH}/test.XXXXXX`
+_cleanup()
+{
+    _scratch_unmount
+    cd /
+    rmdir $cgname
+}
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_cgroup2
+
+# Setup Filesystem
+_scratch_mkfs >/dev/null 2>&1 \
+        || _fail "mkfs failed"
+
+_scratch_mount \
+        || _fail "mount failed"
+
+echo +io > /sys/fs/cgroup/cgroup.subtree_control
+mkdir $cgname
+
+wbytes=$(
+echo $BASHPID > $cgname/cgroup.procs;
+dd if=/dev/zero of=$SCRATCH_MNT/image bs=1M count=100 >/dev/null 2>&1;
+# Makre sure writeback starts
+sleep 120;
+cat $cgname/io.stat | sed -n "s/$(_get_scratch_dev_devt).*wbytes=\([0-9]*\).*/\1/p"
+)
+
+[ -z "$wbytes" -o "$wbytes" = "0" ] && _fail "Cgroup writeback doesn't work"
+echo "Cgroup writeback test success"
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/463.out b/tests/generic/463.out
new file mode 100644
index 00000000..211da95e
--- /dev/null
+++ b/tests/generic/463.out
@@ -0,0 +1,2 @@
+QA output created by 463
+Cgroup writeback test success
diff --git a/tests/generic/group b/tests/generic/group
index f2a6cdad..ea7b6956 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -465,3 +465,4 @@
 460 auto quick rw
 461 auto shutdown stress
 462 auto quick dax
+463 cgroup writeback test
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-03-28 22:36 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-22 21:11 [PATCH V2] xfs: implement cgroup writeback support Shaohua Li
2018-03-23 14:00 ` Chris Mason
2018-03-23 14:24 ` 张本龙
2018-03-25 21:59   ` Dave Chinner
2018-03-26 16:28     ` Brian Foster
2018-03-27  0:55       ` Shaohua Li
2018-03-27 11:36         ` Brian Foster
2018-03-27 21:56           ` Dave Chinner
2018-03-28 11:32             ` Brian Foster
2018-03-28 22:35               ` Dave Chinner
2018-03-28  4:37           ` 张本龙
2018-03-28 11:24             ` Brian Foster
     [not found]       ` <CAJDdQW3gOa8ry_XVkcCMf2QT7wC7MvU4b94hMhwJsg9MjYoKgQ@mail.gmail.com>
2018-03-27 11:50         ` Brian Foster
2018-03-28  9:55           ` 张本龙
2018-03-23 14:37 ` Brian Foster
  -- strict thread matches above, loose matches on Subject: below --
2017-10-15  5:07 Shaohua Li
2017-10-15 22:22 ` Dave Chinner
2017-10-16  3:35   ` Shaohua Li
2017-10-16  6:22     ` Dave Chinner
2017-10-18  5:18       ` Shaohua Li
2017-10-19  7:35 ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.