All of lore.kernel.org
 help / color / mirror / Atom feed
* question about writeback
@ 2019-03-14 20:03 Ross Zwisler
  2019-03-14 20:18 ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Ross Zwisler @ 2019-03-14 20:03 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o, Jan Kara, Jens Axboe, linux-block
  Cc: Ross Zwisler

Hi,

I'm trying to understand a failure I'm seeing with both v4.14 and
v4.19 based kernels, and I was hoping you could point me in the right
direction.

What seems to be happening is that under heavy I/O we get into a
situation where for a given inode/mapping we eventually reach a steady
state where one task is continuously dirtying pages and marking them
for writeback via ext4_writepages(), and another task is continuously
completing I/Os via ext4_end_bio() and clearing the
PAGECACHE_TAG_WRITEBACK flags.  So, we are making forward progress as
far as I/O is concerned.

The problem is that another task calls filemap_fdatwait_range(), and
that call never returns because it always finds pages that are tagged
for writeback.  I've added some prints to __filemap_fdatawait_range(),
and the total number of pages tagged for writeback seems pretty
constant.  It goes up and down a bit, but does not seem to move
towards 0.  If we halt I/O the system eventually recovers, but if we
keep I/O going we can block the task waiting in
__filemap_fdatawait_range() long enough for the system to reboot due
to what it perceives as hung task.

My question is: Is there some mechanism that is supposed to prevent
this sort of situation?  Or is it expected that with slow enough
storage and a high enough I/O load, we could block inside of
filemap_fdatawait_range() indefinitely since we never run out of dirty
pages that are marked for writeback?

Thanks,
- Ross

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: question about writeback
  2019-03-14 20:03 question about writeback Ross Zwisler
@ 2019-03-14 20:18 ` Dave Chinner
  2019-03-14 20:37   ` Ross Zwisler
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2019-03-14 20:18 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Jens Axboe, linux-block,
	Ross Zwisler

On Thu, Mar 14, 2019 at 02:03:08PM -0600, Ross Zwisler wrote:
> Hi,
> 
> I'm trying to understand a failure I'm seeing with both v4.14 and
> v4.19 based kernels, and I was hoping you could point me in the right
> direction.
> 
> What seems to be happening is that under heavy I/O we get into a
> situation where for a given inode/mapping we eventually reach a steady
> state where one task is continuously dirtying pages and marking them
> for writeback via ext4_writepages(), and another task is continuously
> completing I/Os via ext4_end_bio() and clearing the
> PAGECACHE_TAG_WRITEBACK flags.  So, we are making forward progress as
> far as I/O is concerned.
> 
> The problem is that another task calls filemap_fdatwait_range(), and
> that call never returns because it always finds pages that are tagged
> for writeback.  I've added some prints to __filemap_fdatawait_range(),
> and the total number of pages tagged for writeback seems pretty
> constant.  It goes up and down a bit, but does not seem to move
> towards 0.  If we halt I/O the system eventually recovers, but if we
> keep I/O going we can block the task waiting in
> __filemap_fdatawait_range() long enough for the system to reboot due
> to what it perceives as hung task.
> 
> My question is: Is there some mechanism that is supposed to prevent
> this sort of situation?  Or is it expected that with slow enough
> storage and a high enough I/O load, we could block inside of
> filemap_fdatawait_range() indefinitely since we never run out of dirty
> pages that are marked for writeback?

SO your problem is that you are doing an extending write, and then
doing __filemap_fdatawait_range(end = LLONG_MAX), and while it
blocks on the pages under IO, the file is further extended and so
the next radix tree lookup finds more pages past that page under
writeback?

i.e. because it is waiting for pages to complete, it never gets
ahead of the extending write or writeback and always ends up with
more pages to wait on and so never reached the end of the file as
directed?

So perhaps the caller should be waiting on a specific range to bound
the wait (e.g.  isize as the end of the wait) rather than using the
default "keep going until the end of file is reached" semantics?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: question about writeback
  2019-03-14 20:18 ` Dave Chinner
@ 2019-03-14 20:37   ` Ross Zwisler
  2019-03-14 23:02     ` Theodore Ts'o
  2019-03-18 11:38     ` Jan Kara
  0 siblings, 2 replies; 6+ messages in thread
From: Ross Zwisler @ 2019-03-14 20:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-ext4, Theodore Ts'o, Jan Kara, Jens Axboe, linux-block,
	Ross Zwisler

On Thu, Mar 14, 2019 at 2:18 PM Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Mar 14, 2019 at 02:03:08PM -0600, Ross Zwisler wrote:
> > Hi,
> >
> > I'm trying to understand a failure I'm seeing with both v4.14 and
> > v4.19 based kernels, and I was hoping you could point me in the right
> > direction.
> >
> > What seems to be happening is that under heavy I/O we get into a
> > situation where for a given inode/mapping we eventually reach a steady
> > state where one task is continuously dirtying pages and marking them
> > for writeback via ext4_writepages(), and another task is continuously
> > completing I/Os via ext4_end_bio() and clearing the
> > PAGECACHE_TAG_WRITEBACK flags.  So, we are making forward progress as
> > far as I/O is concerned.
> >
> > The problem is that another task calls filemap_fdatwait_range(), and
> > that call never returns because it always finds pages that are tagged
> > for writeback.  I've added some prints to __filemap_fdatawait_range(),
> > and the total number of pages tagged for writeback seems pretty
> > constant.  It goes up and down a bit, but does not seem to move
> > towards 0.  If we halt I/O the system eventually recovers, but if we
> > keep I/O going we can block the task waiting in
> > __filemap_fdatawait_range() long enough for the system to reboot due
> > to what it perceives as hung task.
> >
> > My question is: Is there some mechanism that is supposed to prevent
> > this sort of situation?  Or is it expected that with slow enough
> > storage and a high enough I/O load, we could block inside of
> > filemap_fdatawait_range() indefinitely since we never run out of dirty
> > pages that are marked for writeback?
>
> SO your problem is that you are doing an extending write, and then
> doing __filemap_fdatawait_range(end = LLONG_MAX), and while it
> blocks on the pages under IO, the file is further extended and so
> the next radix tree lookup finds more pages past that page under
> writeback?
>
> i.e. because it is waiting for pages to complete, it never gets
> ahead of the extending write or writeback and always ends up with
> more pages to wait on and so never reached the end of the file as
> directed?
>
> So perhaps the caller should be waiting on a specific range to bound
> the wait (e.g.  isize as the end of the wait) rather than using the
> default "keep going until the end of file is reached" semantics?

The call to __filemap_fdatawait_range() is happening via the jdb2 code:

jbd2_journal_commit_transaction()
  journal_finish_inode_data_buffers()
    filemap_fdatawait_keep_errors()
      __filemap_fdatawait_range(end = LLONG_MAX)

Would it have to be an extending write?  Or could it work the same if
you have one thread just moving forward through a very large file,
dirtying pages, and the __filemap_fdatawait_range() call will just
keep finding new pages as it moves forward through the big file?

In either case, I think your description of the problem is correct.
Is this just a "well, don't do that" type situation, or is this
supposed to have a different result?

- Ross

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: question about writeback
  2019-03-14 20:37   ` Ross Zwisler
@ 2019-03-14 23:02     ` Theodore Ts'o
  2019-03-18 11:38     ` Jan Kara
  1 sibling, 0 replies; 6+ messages in thread
From: Theodore Ts'o @ 2019-03-14 23:02 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dave Chinner, linux-ext4, Jan Kara, Jens Axboe, linux-block,
	Ross Zwisler

On Thu, Mar 14, 2019 at 02:37:55PM -0600, Ross Zwisler wrote:
> > So perhaps the caller should be waiting on a specific range to bound
> > the wait (e.g.  isize as the end of the wait) rather than using the
> > default "keep going until the end of file is reached" semantics?
> 
> The call to __filemap_fdatawait_range() is happening via the jdb2 code:
> 
> jbd2_journal_commit_transaction()
>   journal_finish_inode_data_buffers()
>     filemap_fdatawait_keep_errors()
>       __filemap_fdatawait_range(end = LLONG_MAX)

I think jbd2 needs to call a new filemap_fdatawait_range_keep_errors()
(to be defined in mm/filemap.c).

> Would it have to be an extending write?  Or could it work the same if
> you have one thread just moving forward through a very large file,
> dirtying pages, and the __filemap_fdatawait_range() call will just
> keep finding new pages as it moves forward through the big file?

No, that case is fine because we'll eventually make our way to the end
of the file and stop.

In the long term I want to get rid of data=ordered mode (while still
avoids the stale data problem) without going through all of this hair
so we don't have to call filemap_fdatawait from the commit thread.
The real problem is that ext2/3 allocates blocks, updates the inode
metadata, and then writes the data blocks out.  What we need to do is
to swap the 2nd and 3rd steps....

					- Ted

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: question about writeback
  2019-03-14 20:37   ` Ross Zwisler
  2019-03-14 23:02     ` Theodore Ts'o
@ 2019-03-18 11:38     ` Jan Kara
  2019-03-18 22:54       ` Ross Zwisler
  1 sibling, 1 reply; 6+ messages in thread
From: Jan Kara @ 2019-03-18 11:38 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dave Chinner, linux-ext4, Theodore Ts'o, Jan Kara,
	Jens Axboe, linux-block, Ross Zwisler

On Thu 14-03-19 14:37:55, Ross Zwisler wrote:
> On Thu, Mar 14, 2019 at 2:18 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Mar 14, 2019 at 02:03:08PM -0600, Ross Zwisler wrote:
> > > Hi,
> > >
> > > I'm trying to understand a failure I'm seeing with both v4.14 and
> > > v4.19 based kernels, and I was hoping you could point me in the right
> > > direction.
> > >
> > > What seems to be happening is that under heavy I/O we get into a
> > > situation where for a given inode/mapping we eventually reach a steady
> > > state where one task is continuously dirtying pages and marking them
> > > for writeback via ext4_writepages(), and another task is continuously
> > > completing I/Os via ext4_end_bio() and clearing the
> > > PAGECACHE_TAG_WRITEBACK flags.  So, we are making forward progress as
> > > far as I/O is concerned.
> > >
> > > The problem is that another task calls filemap_fdatwait_range(), and
> > > that call never returns because it always finds pages that are tagged
> > > for writeback.  I've added some prints to __filemap_fdatawait_range(),
> > > and the total number of pages tagged for writeback seems pretty
> > > constant.  It goes up and down a bit, but does not seem to move
> > > towards 0.  If we halt I/O the system eventually recovers, but if we
> > > keep I/O going we can block the task waiting in
> > > __filemap_fdatawait_range() long enough for the system to reboot due
> > > to what it perceives as hung task.
> > >
> > > My question is: Is there some mechanism that is supposed to prevent
> > > this sort of situation?  Or is it expected that with slow enough
> > > storage and a high enough I/O load, we could block inside of
> > > filemap_fdatawait_range() indefinitely since we never run out of dirty
> > > pages that are marked for writeback?
> >
> > SO your problem is that you are doing an extending write, and then
> > doing __filemap_fdatawait_range(end = LLONG_MAX), and while it
> > blocks on the pages under IO, the file is further extended and so
> > the next radix tree lookup finds more pages past that page under
> > writeback?
> >
> > i.e. because it is waiting for pages to complete, it never gets
> > ahead of the extending write or writeback and always ends up with
> > more pages to wait on and so never reached the end of the file as
> > directed?
> >
> > So perhaps the caller should be waiting on a specific range to bound
> > the wait (e.g.  isize as the end of the wait) rather than using the
> > default "keep going until the end of file is reached" semantics?
> 
> The call to __filemap_fdatawait_range() is happening via the jdb2 code:
> 
> jbd2_journal_commit_transaction()
>   journal_finish_inode_data_buffers()
>     filemap_fdatawait_keep_errors()
>       __filemap_fdatawait_range(end = LLONG_MAX)
> 
> Would it have to be an extending write?  Or could it work the same if
> you have one thread just moving forward through a very large file,
> dirtying pages, and the __filemap_fdatawait_range() call will just
> keep finding new pages as it moves forward through the big file?

As Ted wrote, it must be extending write or a very large file.
__filemap_fdatawait_range() is strictly monotone - it waits for each page
at most once (check the loop in __filemap_fdatawait_range()). It would be
actually good to know which case you hit if you can find it out.

> In either case, I think your description of the problem is correct.
> Is this just a "well, don't do that" type situation, or is this
> supposed to have a different result?

Let's call this a known limitation of current ext4 journalling
implementation :) As Ted has outlined, there are plans to redesign some
things which would also avoid this problem. But that's not a quick fix.
Short term we could reduce the problem by tracking in jbd2 the min-max
range that's relevant for the running transaction. It wouldn't completely
fix it as e.g. for random writes into sparse file the problem would still
trigger but that is far less common than continously extending file or
sequential write into a large file.


								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: question about writeback
  2019-03-18 11:38     ` Jan Kara
@ 2019-03-18 22:54       ` Ross Zwisler
  0 siblings, 0 replies; 6+ messages in thread
From: Ross Zwisler @ 2019-03-18 22:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, linux-ext4, Theodore Ts'o, Jan Kara,
	Jens Axboe, linux-block, Ross Zwisler

On Mon, Mar 18, 2019 at 5:38 AM Jan Kara <jack@suse.cz> wrote:
> On Thu 14-03-19 14:37:55, Ross Zwisler wrote:
> > On Thu, Mar 14, 2019 at 2:18 PM Dave Chinner <david@fromorbit.com> wrote:
> > > On Thu, Mar 14, 2019 at 02:03:08PM -0600, Ross Zwisler wrote:
> > > > Hi,
> > > >
> > > > I'm trying to understand a failure I'm seeing with both v4.14 and
> > > > v4.19 based kernels, and I was hoping you could point me in the right
> > > > direction.
> > > >
> > > > What seems to be happening is that under heavy I/O we get into a
> > > > situation where for a given inode/mapping we eventually reach a steady
> > > > state where one task is continuously dirtying pages and marking them
> > > > for writeback via ext4_writepages(), and another task is continuously
> > > > completing I/Os via ext4_end_bio() and clearing the
> > > > PAGECACHE_TAG_WRITEBACK flags.  So, we are making forward progress as
> > > > far as I/O is concerned.
> > > >
> > > > The problem is that another task calls filemap_fdatwait_range(), and
> > > > that call never returns because it always finds pages that are tagged
> > > > for writeback.  I've added some prints to __filemap_fdatawait_range(),
> > > > and the total number of pages tagged for writeback seems pretty
> > > > constant.  It goes up and down a bit, but does not seem to move
> > > > towards 0.  If we halt I/O the system eventually recovers, but if we
> > > > keep I/O going we can block the task waiting in
> > > > __filemap_fdatawait_range() long enough for the system to reboot due
> > > > to what it perceives as hung task.
> > > >
> > > > My question is: Is there some mechanism that is supposed to prevent
> > > > this sort of situation?  Or is it expected that with slow enough
> > > > storage and a high enough I/O load, we could block inside of
> > > > filemap_fdatawait_range() indefinitely since we never run out of dirty
> > > > pages that are marked for writeback?
> > >
> > > SO your problem is that you are doing an extending write, and then
> > > doing __filemap_fdatawait_range(end = LLONG_MAX), and while it
> > > blocks on the pages under IO, the file is further extended and so
> > > the next radix tree lookup finds more pages past that page under
> > > writeback?
> > >
> > > i.e. because it is waiting for pages to complete, it never gets
> > > ahead of the extending write or writeback and always ends up with
> > > more pages to wait on and so never reached the end of the file as
> > > directed?
> > >
> > > So perhaps the caller should be waiting on a specific range to bound
> > > the wait (e.g.  isize as the end of the wait) rather than using the
> > > default "keep going until the end of file is reached" semantics?
> >
> > The call to __filemap_fdatawait_range() is happening via the jdb2 code:
> >
> > jbd2_journal_commit_transaction()
> >   journal_finish_inode_data_buffers()
> >     filemap_fdatawait_keep_errors()
> >       __filemap_fdatawait_range(end = LLONG_MAX)
> >
> > Would it have to be an extending write?  Or could it work the same if
> > you have one thread just moving forward through a very large file,
> > dirtying pages, and the __filemap_fdatawait_range() call will just
> > keep finding new pages as it moves forward through the big file?
>
> As Ted wrote, it must be extending write or a very large file.
> __filemap_fdatawait_range() is strictly monotone - it waits for each page
> at most once (check the loop in __filemap_fdatawait_range()). It would be
> actually good to know which case you hit if you can find it out.
>
> > In either case, I think your description of the problem is correct.
> > Is this just a "well, don't do that" type situation, or is this
> > supposed to have a different result?
>
> Let's call this a known limitation of current ext4 journalling
> implementation :) As Ted has outlined, there are plans to redesign some
> things which would also avoid this problem. But that's not a quick fix.
> Short term we could reduce the problem by tracking in jbd2 the min-max
> range that's relevant for the running transaction. It wouldn't completely
> fix it as e.g. for random writes into sparse file the problem would still
> trigger but that is far less common than continously extending file or
> sequential write into a large file.

Awesome, thank you for the replies.  I'll see if I can boil it down to
a relatively simple xfstest type reproducer, and I'll take a crack at
implemeting your suggested workaround in jbd2.

Thanks,
- Ross

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-03-18 22:54 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-14 20:03 question about writeback Ross Zwisler
2019-03-14 20:18 ` Dave Chinner
2019-03-14 20:37   ` Ross Zwisler
2019-03-14 23:02     ` Theodore Ts'o
2019-03-18 11:38     ` Jan Kara
2019-03-18 22:54       ` Ross Zwisler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.