All of lore.kernel.org
 help / color / mirror / Atom feed
* ext4 hang and per-memcg dirty throttling
@ 2018-09-12  0:10 Liu Bo
  2018-09-12 12:11 ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Liu Bo @ 2018-09-12  0:10 UTC (permalink / raw)
  To: linux-ext4; +Cc: fengguang.wu, tj, jack, cgroups, gthelen, linux-mm, yang.shi

Hi,

With ext4's data=ordered mode and the underlying blk throttle setting, we
can easily run to hang,

1.
mount /dev/sdc /mnt -odata=ordered
2.
mkdir /sys/fs/cgroup/unified/cg
3.
echo "+io" > /sys/fs/cgroup/unified/cgroup.subtree_control
4.
echo "`cat /sys/block/sdc/dev` wbps=$((1 << 20))" > /sys/fs/cgroup/unified/cg/io.max
5.
echo $$ >  /sys/fs/cgroup/unified/cg/cgroup.procs
6.
// background dirtier
xfs_io -f -c "pwrite 0 1G" $M/dummy &
7.
echo $$ > /sys/fs/cgroup/unified/cgroup.procs
8.
// issue synchronous IO
for i in `seq 1 100`;
do
    xfs_io -f -s -c "pwrite 0 4k" $M/foo > /dev/null
done


And the hang is like

      [jbd2-sdc]
jbd2_journal_commit_transaction                              
  journal_submit_data_buffers
    # file 'dummy' has been written by writeback kthread
  journal_finish_inode_data_buffers
    # wait on page's writeback

Then all the operations of ext4 which need to start journal will have
to wait until journal committing transaction completes.

Since there is no per-memcg throttling, such as dirty ratio or dirty
bytes throttle, balance_dirty_pages() may not be able to slow down the
background dirtier task as expected.

I googled a little bit and found that Greg did the related work[1]
back in 2011, but seems the patch set didn't make it to kernel.

Now that we have writeback aware cgroup, is there any plan to push the
patch set again or are there any alternative solutions/suggestions?

[1]: https://lwn.net/Articles/455341/

thanks,
-liubo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ext4 hang and per-memcg dirty throttling
  2018-09-12  0:10 ext4 hang and per-memcg dirty throttling Liu Bo
@ 2018-09-12 12:11 ` Jan Kara
  2018-09-12 15:07   ` Theodore Y. Ts'o
  2018-09-12 19:19   ` Liu Bo
  0 siblings, 2 replies; 5+ messages in thread
From: Jan Kara @ 2018-09-12 12:11 UTC (permalink / raw)
  To: Liu Bo
  Cc: linux-ext4, fengguang.wu, tj, jack, cgroups, gthelen, linux-mm, yang.shi

Hi!

On Tue 11-09-18 17:10:55, Liu Bo wrote:
> With ext4's data=ordered mode and the underlying blk throttle setting, we
> can easily run to hang,
> 
> 1.
> mount /dev/sdc /mnt -odata=ordered
> 2.
> mkdir /sys/fs/cgroup/unified/cg
> 3.
> echo "+io" > /sys/fs/cgroup/unified/cgroup.subtree_control
> 4.
> echo "`cat /sys/block/sdc/dev` wbps=$((1 << 20))" > /sys/fs/cgroup/unified/cg/io.max
> 5.
> echo $$ >  /sys/fs/cgroup/unified/cg/cgroup.procs
> 6.
> // background dirtier
> xfs_io -f -c "pwrite 0 1G" $M/dummy &
> 7.
> echo $$ > /sys/fs/cgroup/unified/cgroup.procs
> 8.
> // issue synchronous IO
> for i in `seq 1 100`;
> do
>     xfs_io -f -s -c "pwrite 0 4k" $M/foo > /dev/null
> done
> 
> 
> And the hang is like
> 
>       [jbd2-sdc]
> jbd2_journal_commit_transaction                              
>   journal_submit_data_buffers
>     # file 'dummy' has been written by writeback kthread
>   journal_finish_inode_data_buffers
>     # wait on page's writeback

Yes, I guess you're speaking about the one Chris Mason mentioned [1].
Essentially it's a priority inversion where jbd2 thread gets blocked behind
writeback done on behalf of a heavily restricted process. It actually is
not related to dirty throttling or anything like that. And the solution for
this priority inversion is to use unwritten extents for writeback
unconditionally as I wrote in that thread. The core of this is implemented
and hidden behind dioread_nolock mount option but it needs some serious
polishing work and testing...

[1] https://marc.info/?l=linux-fsdevel&m=151688776319077

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ext4 hang and per-memcg dirty throttling
  2018-09-12 12:11 ` Jan Kara
@ 2018-09-12 15:07   ` Theodore Y. Ts'o
  2018-09-12 19:22     ` Liu Bo
  2018-09-12 19:19   ` Liu Bo
  1 sibling, 1 reply; 5+ messages in thread
From: Theodore Y. Ts'o @ 2018-09-12 15:07 UTC (permalink / raw)
  To: Jan Kara
  Cc: Liu Bo, linux-ext4, fengguang.wu, tj, cgroups, gthelen, linux-mm,
	yang.shi

On Wed, Sep 12, 2018 at 02:11:30PM +0200, Jan Kara wrote:
> 
> Yes, I guess you're speaking about the one Chris Mason mentioned [1].
> Essentially it's a priority inversion where jbd2 thread gets blocked behind
> writeback done on behalf of a heavily restricted process. It actually is
> not related to dirty throttling or anything like that. And the solution for
> this priority inversion is to use unwritten extents for writeback
> unconditionally as I wrote in that thread. The core of this is implemented
> and hidden behind dioread_nolock mount option but it needs some serious
> polishing work and testing...
> 
> [1] https://marc.info/?l=linux-fsdevel&m=151688776319077

I've actually be considering making dioread_nolock the default when
page_size == block_size.

Arguments in favor:

1)  Improves AIO latency in some circumstances
2)  Improves parallel DIO read performance
3)  Should address the block-cg throttling priority inversion problem

Arguments against:

1)  Hasn't seen much usage outside of Google (where it makes a big
    difference for fast flash workloads; see (1) and (2) above)
2)  Dioread_nolock only works when page_size == block_size; so this
    implies we would be using a different codepath depending on
    the block size.
3)  generic/500 (dm-thin ENOSPC hitter with concurrent discards)
    fails with dioread_nolock, but not in the 4k workload

Liu, can you try out mount -o dioread_nolock and see if this address
your problem, if so, maybe this is the development cycle where we
finally change the default.

						- Ted

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ext4 hang and per-memcg dirty throttling
  2018-09-12 12:11 ` Jan Kara
  2018-09-12 15:07   ` Theodore Y. Ts'o
@ 2018-09-12 19:19   ` Liu Bo
  1 sibling, 0 replies; 5+ messages in thread
From: Liu Bo @ 2018-09-12 19:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, fengguang.wu, tj, cgroups, gthelen, linux-mm, yang.shi

On Wed, Sep 12, 2018 at 02:11:30PM +0200, Jan Kara wrote:
> Hi!
> 
> On Tue 11-09-18 17:10:55, Liu Bo wrote:
> > With ext4's data=ordered mode and the underlying blk throttle setting, we
> > can easily run to hang,
> > 
> > 1.
> > mount /dev/sdc /mnt -odata=ordered
> > 2.
> > mkdir /sys/fs/cgroup/unified/cg
> > 3.
> > echo "+io" > /sys/fs/cgroup/unified/cgroup.subtree_control
> > 4.
> > echo "`cat /sys/block/sdc/dev` wbps=$((1 << 20))" > /sys/fs/cgroup/unified/cg/io.max
> > 5.
> > echo $$ >  /sys/fs/cgroup/unified/cg/cgroup.procs
> > 6.
> > // background dirtier
> > xfs_io -f -c "pwrite 0 1G" $M/dummy &
> > 7.
> > echo $$ > /sys/fs/cgroup/unified/cgroup.procs
> > 8.
> > // issue synchronous IO
> > for i in `seq 1 100`;
> > do
> >     xfs_io -f -s -c "pwrite 0 4k" $M/foo > /dev/null
> > done
> > 
> > 
> > And the hang is like
> > 
> >       [jbd2-sdc]
> > jbd2_journal_commit_transaction                              
> >   journal_submit_data_buffers
> >     # file 'dummy' has been written by writeback kthread
> >   journal_finish_inode_data_buffers
> >     # wait on page's writeback
> 
> Yes, I guess you're speaking about the one Chris Mason mentioned [1].

Exactly.

> Essentially it's a priority inversion where jbd2 thread gets blocked behind
> writeback done on behalf of a heavily restricted process. It actually is
> not related to dirty throttling or anything like that. And the solution for
> this priority inversion is to use unwritten extents for writeback
> unconditionally as I wrote in that thread. The core of this is implemented
> and hidden behind dioread_nolock mount option but it needs some serious
> polishing work and testing...

Thank you so much for the details, so setting extent to unwritten and
then converting it in endio does work and keeps the data=ordered
semantic but I have to say the name, "dioread_nolock", is really
confusing...

thanks,
-liubo
> 
> [1] https://marc.info/?l=linux-fsdevel&m=151688776319077
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: ext4 hang and per-memcg dirty throttling
  2018-09-12 15:07   ` Theodore Y. Ts'o
@ 2018-09-12 19:22     ` Liu Bo
  0 siblings, 0 replies; 5+ messages in thread
From: Liu Bo @ 2018-09-12 19:22 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Jan Kara, linux-ext4, fengguang.wu, tj, cgroups, gthelen,
	linux-mm, yang.shi

On Wed, Sep 12, 2018 at 11:07:17AM -0400, Theodore Y. Ts'o wrote:
> On Wed, Sep 12, 2018 at 02:11:30PM +0200, Jan Kara wrote:
> > 
> > Yes, I guess you're speaking about the one Chris Mason mentioned [1].
> > Essentially it's a priority inversion where jbd2 thread gets blocked behind
> > writeback done on behalf of a heavily restricted process. It actually is
> > not related to dirty throttling or anything like that. And the solution for
> > this priority inversion is to use unwritten extents for writeback
> > unconditionally as I wrote in that thread. The core of this is implemented
> > and hidden behind dioread_nolock mount option but it needs some serious
> > polishing work and testing...
> > 
> > [1] https://marc.info/?l=linux-fsdevel&m=151688776319077
> 
> I've actually be considering making dioread_nolock the default when
> page_size == block_size.
> 
> Arguments in favor:
> 
> 1)  Improves AIO latency in some circumstances
> 2)  Improves parallel DIO read performance
> 3)  Should address the block-cg throttling priority inversion problem
> 
> Arguments against:
> 
> 1)  Hasn't seen much usage outside of Google (where it makes a big
>     difference for fast flash workloads; see (1) and (2) above)
> 2)  Dioread_nolock only works when page_size == block_size; so this
>     implies we would be using a different codepath depending on
>     the block size.
> 3)  generic/500 (dm-thin ENOSPC hitter with concurrent discards)
>     fails with dioread_nolock, but not in the 4k workload
> 
> Liu, can you try out mount -o dioread_nolock and see if this address
> your problem, if so, maybe this is the development cycle where we
> finally change the default.
> 

I've confirmed that "mount -o dioread_nolock" fixed the hang, I can do
further testing (maybe in production environment) if that's needed.

Many thanks to both you and Jan.

thanks,
-liubo

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-09-12 19:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-12  0:10 ext4 hang and per-memcg dirty throttling Liu Bo
2018-09-12 12:11 ` Jan Kara
2018-09-12 15:07   ` Theodore Y. Ts'o
2018-09-12 19:22     ` Liu Bo
2018-09-12 19:19   ` Liu Bo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.