linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* iomap: writeback ioend/bio allocation deadlock risk
@ 2021-05-21  3:27 Ming Lei
  2021-05-21  7:17 ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2021-05-21  3:27 UTC (permalink / raw)
  To: Christoph Hellwig, Brian Foster, Darrick J. Wong, Dave Chinner
  Cc: linux-block, linux-fsdevel

Hello Guys,

I found there may be two deadlock risk under memory pressure wrt.
ioend/bio allocation in iomap writeback code wrt. bio_alloc_bioset():

        if %__gfp_direct_reclaim is set then bio_alloc will always be able to
        allocate a bio.  this is due to the mempool guarantees.  to make this work,
        callers must never allocate more than 1 bio at a time from the general pool.
        callers that need to allocate more than 1 bio must always submit the
        previously allocated bio for io before attempting to allocate a new one.
        failure to do so can cause deadlocks under memory pressure.

1) more than one ioends can be allocated from 'iomap_ioend_bioset'
before submitting them all, so mempoll guarantee can't be made, which can
be observed frequently in writeback over ext4

2) more than one chained bio(allocated from fs_bio_set) via iomap_chain_bio,
which is easy observed when writing big files on XFS:

- the old bio is submitted _after_ the new allocation
- submission on old chained bio can't make forward progress because all chained
bios can only be freed after the whole ioend is completed, see iomap_finish_ioend()

Both looks not hard to fix, just want to make sure they are real issues?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iomap: writeback ioend/bio allocation deadlock risk
  2021-05-21  3:27 iomap: writeback ioend/bio allocation deadlock risk Ming Lei
@ 2021-05-21  7:17 ` Christoph Hellwig
  2021-05-21  7:31   ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2021-05-21  7:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Brian Foster, Darrick J. Wong, Dave Chinner,
	linux-block, linux-fsdevel

On Fri, May 21, 2021 at 11:27:54AM +0800, Ming Lei wrote:
>         if %__gfp_direct_reclaim is set then bio_alloc will always be able to
>         allocate a bio.  this is due to the mempool guarantees.  to make this work,
>         callers must never allocate more than 1 bio at a time from the general pool.
>         callers that need to allocate more than 1 bio must always submit the
>         previously allocated bio for io before attempting to allocate a new one.
>         failure to do so can cause deadlocks under memory pressure.
> 
> 1) more than one ioends can be allocated from 'iomap_ioend_bioset'
> before submitting them all, so mempoll guarantee can't be made, which can
> be observed frequently in writeback over ext4
> 
> 2) more than one chained bio(allocated from fs_bio_set) via iomap_chain_bio,
> which is easy observed when writing big files on XFS:

The comment describing bio_alloc_bioset is actually wrong.  Allocating
more than 1 at a time is perfectly fine, it just can't be more than
the pool_size argument passed to bioset_init.

iomap_ioend_bioset is sized to make sure we can always complete up
to 4 pages, and the list is only used inside a page, so we're fine.

fs_bio_set always has two entries to allow exactly for the common
chain and submit pattern.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iomap: writeback ioend/bio allocation deadlock risk
  2021-05-21  7:17 ` Christoph Hellwig
@ 2021-05-21  7:31   ` Ming Lei
  2021-05-21  7:35     ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2021-05-21  7:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Brian Foster, Darrick J. Wong, Dave Chinner, linux-block, linux-fsdevel

On Fri, May 21, 2021 at 09:17:27AM +0200, Christoph Hellwig wrote:
> On Fri, May 21, 2021 at 11:27:54AM +0800, Ming Lei wrote:
> >         if %__gfp_direct_reclaim is set then bio_alloc will always be able to
> >         allocate a bio.  this is due to the mempool guarantees.  to make this work,
> >         callers must never allocate more than 1 bio at a time from the general pool.
> >         callers that need to allocate more than 1 bio must always submit the
> >         previously allocated bio for io before attempting to allocate a new one.
> >         failure to do so can cause deadlocks under memory pressure.
> > 
> > 1) more than one ioends can be allocated from 'iomap_ioend_bioset'
> > before submitting them all, so mempoll guarantee can't be made, which can
> > be observed frequently in writeback over ext4
> > 
> > 2) more than one chained bio(allocated from fs_bio_set) via iomap_chain_bio,
> > which is easy observed when writing big files on XFS:
> 
> The comment describing bio_alloc_bioset is actually wrong.  Allocating

OK, we can fix the doc, but...

> more than 1 at a time is perfectly fine, it just can't be more than
> the pool_size argument passed to bioset_init.
> 
> iomap_ioend_bioset is sized to make sure we can always complete up
> to 4 pages, and the list is only used inside a page, so we're fine.

The number itself does not matter, because there isn't any limit on how
many ioends can be allocated before submitting, for example, it can be
observed that 64 ioends is allocated before submitting when writing
5GB file to ext4. So far the reserved pool size is 32.

> 
> fs_bio_set always has two entries to allow exactly for the common
> chain and submit pattern.

It is easy to trigger dozens of chained bios in one ioend when writing
big file to XFS.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iomap: writeback ioend/bio allocation deadlock risk
  2021-05-21  7:31   ` Ming Lei
@ 2021-05-21  7:35     ` Christoph Hellwig
  2021-05-21  8:35       ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2021-05-21  7:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Brian Foster, Darrick J. Wong, Dave Chinner,
	linux-block, linux-fsdevel

On Fri, May 21, 2021 at 03:31:05PM +0800, Ming Lei wrote:
> > iomap_ioend_bioset is sized to make sure we can always complete up
> > to 4 pages, and the list is only used inside a page, so we're fine.
> 
> The number itself does not matter, because there isn't any limit on how
> many ioends can be allocated before submitting, for example, it can be
> observed that 64 ioends is allocated before submitting when writing
> 5GB file to ext4. So far the reserved pool size is 32.

How do you manage to allocate iomap ioends when writing to ext4?  ext4
doesn't use iomap for buffered I/O.

> > fs_bio_set always has two entries to allow exactly for the common
> > chain and submit pattern.
> 
> It is easy to trigger dozens of chained bios in one ioend when writing
> big file to XFS.

Yes, we can still have one chained bio per ioend, so we need a bioset
with the same size as iomap_ioend_bioset.  That still should not be
dozends for a common setup, though.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iomap: writeback ioend/bio allocation deadlock risk
  2021-05-21  7:35     ` Christoph Hellwig
@ 2021-05-21  8:35       ` Ming Lei
  2021-05-21  8:36         ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2021-05-21  8:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Brian Foster, Darrick J. Wong, Dave Chinner, linux-block, linux-fsdevel

On Fri, May 21, 2021 at 09:35:47AM +0200, Christoph Hellwig wrote:
> On Fri, May 21, 2021 at 03:31:05PM +0800, Ming Lei wrote:
> > > iomap_ioend_bioset is sized to make sure we can always complete up
> > > to 4 pages, and the list is only used inside a page, so we're fine.
> > 
> > The number itself does not matter, because there isn't any limit on how
> > many ioends can be allocated before submitting, for example, it can be
> > observed that 64 ioends is allocated before submitting when writing
> > 5GB file to ext4. So far the reserved pool size is 32.
> 
> How do you manage to allocate iomap ioends when writing to ext4?  ext4
> doesn't use iomap for buffered I/O.

Just double check, the multiple ioends allocation is from root XFS and
not from big file write to ext4, so looks it can be triggered easily in
background writeback.

> 
> > > fs_bio_set always has two entries to allow exactly for the common
> > > chain and submit pattern.
> > 
> > It is easy to trigger dozens of chained bios in one ioend when writing
> > big file to XFS.
> 
> Yes, we can still have one chained bio per ioend, so we need a bioset
> with the same size as iomap_ioend_bioset.  That still should not be
> dozends for a common setup, though.

Yeah, that can be one solution.

Just wondering why the ioend isn't submitted out after it becomes full?


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iomap: writeback ioend/bio allocation deadlock risk
  2021-05-21  8:35       ` Ming Lei
@ 2021-05-21  8:36         ` Christoph Hellwig
  2021-05-21  8:54           ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2021-05-21  8:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Brian Foster, Darrick J. Wong, Dave Chinner,
	linux-block, linux-fsdevel

On Fri, May 21, 2021 at 04:35:03PM +0800, Ming Lei wrote:
> Just wondering why the ioend isn't submitted out after it becomes full?

block layer plugging?  Although failing bio allocations will kick that,
so that is not a deadlock risk.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iomap: writeback ioend/bio allocation deadlock risk
  2021-05-21  8:36         ` Christoph Hellwig
@ 2021-05-21  8:54           ` Ming Lei
  2021-05-24 15:32             ` Christoph Hellwig
  2021-05-24 23:55             ` Dave Chinner
  0 siblings, 2 replies; 12+ messages in thread
From: Ming Lei @ 2021-05-21  8:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Brian Foster, Darrick J. Wong, Dave Chinner, linux-block, linux-fsdevel

On Fri, May 21, 2021 at 10:36:35AM +0200, Christoph Hellwig wrote:
> On Fri, May 21, 2021 at 04:35:03PM +0800, Ming Lei wrote:
> > Just wondering why the ioend isn't submitted out after it becomes full?
> 
> block layer plugging?  Although failing bio allocations will kick that,
> so that is not a deadlock risk.

These ioends are just added to one list stored on local stack variable(submit_list),
how can block layer plugging observe & submit them out?

Chained bios have been submitted already, but all can't be completed/freed
until the whole ioend is done, that submission won't make forward progress.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iomap: writeback ioend/bio allocation deadlock risk
  2021-05-21  8:54           ` Ming Lei
@ 2021-05-24 15:32             ` Christoph Hellwig
  2021-05-24 23:55             ` Dave Chinner
  1 sibling, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2021-05-24 15:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Brian Foster, Darrick J. Wong, Dave Chinner,
	linux-block, linux-fsdevel

On Fri, May 21, 2021 at 04:54:45PM +0800, Ming Lei wrote:
> On Fri, May 21, 2021 at 10:36:35AM +0200, Christoph Hellwig wrote:
> > On Fri, May 21, 2021 at 04:35:03PM +0800, Ming Lei wrote:
> > > Just wondering why the ioend isn't submitted out after it becomes full?
> > 
> > block layer plugging?  Although failing bio allocations will kick that,
> > so that is not a deadlock risk.
> 
> These ioends are just added to one list stored on local stack variable(submit_list),

Yes.  But only until the code finished iterating over a page.  The
worst case number of bios for a page is PAGE_SIZE / SECTOR_SIZE, and
the bio_set is side to handle that comfortably.

> how can block layer plugging observe & submit them out?

It can't.  I was talking about the high-level plug that is held
over multiple pages.  At that point the bios are submitted to the
block layer already, but they might be held in a plug.  And looking at
the bio_alloc_bioset code we don't actually flush a plug at the moment
in that case.  Something like the patch below would do that:

diff --git a/block/bio.c b/block/bio.c
index 44205dfb6b60..5b9d2f4d7c08 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -432,6 +432,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned short nr_iovecs,
 
 	p = mempool_alloc(&bs->bio_pool, gfp_mask);
 	if (!p && gfp_mask != saved_gfp) {
+		blk_schedule_flush_plug(current);
 		punt_bios_to_rescuer(bs);
 		gfp_mask = saved_gfp;
 		p = mempool_alloc(&bs->bio_pool, gfp_mask);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: iomap: writeback ioend/bio allocation deadlock risk
  2021-05-21  8:54           ` Ming Lei
  2021-05-24 15:32             ` Christoph Hellwig
@ 2021-05-24 23:55             ` Dave Chinner
  2021-05-25  4:54               ` Ming Lei
  1 sibling, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2021-05-24 23:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Brian Foster, Darrick J. Wong, Dave Chinner,
	linux-block, linux-fsdevel

On Fri, May 21, 2021 at 04:54:45PM +0800, Ming Lei wrote:
> On Fri, May 21, 2021 at 10:36:35AM +0200, Christoph Hellwig wrote:
> > On Fri, May 21, 2021 at 04:35:03PM +0800, Ming Lei wrote:
> > > Just wondering why the ioend isn't submitted out after it becomes full?
> > 
> > block layer plugging?  Although failing bio allocations will kick that,
> > so that is not a deadlock risk.
> 
> These ioends are just added to one list stored on local stack variable(submit_list),
> how can block layer plugging observe & submit them out?

We ignore that, as the commit histoy says:

commit e10de3723c53378e7cf441529f563c316fdc0dd3
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Feb 15 17:23:12 2016 +1100

    xfs: don't chain ioends during writepage submission

    Currently we can build a long ioend chain during ->writepages that
    gets attached to the writepage context. IO submission only then
    occurs when we finish all the writepage processing. This means we
    can have many ioends allocated and pending, and this violates the
    mempool guarantees that we need to give about forwards progress.
    i.e. we really should only have one ioend being built at a time,
    otherwise we may drain the mempool trying to allocate a new ioend
    and that blocks submission, completion and freeing of ioends that
    are already in progress.

    To prevent this situation from happening, we need to submit ioends
    for IO as soon as they are ready for dispatch rather than queuing
    them for later submission. This means the ioends have bios built
    immediately and they get queued on any plug that is current active.
    Hence if we schedule away from writeback, the ioends that have been
    built will make forwards progress due to the plug flushing on
    context switch. This will also prevent context switches from
    creating unnecessary IO submission latency.

    We can't completely avoid having nested IO allocation - when we have
    a block size smaller than a page size, we still need to hold the
    ioend submission until after we have marked the current page dirty.
    Hence we may need multiple ioends to be held while the current page
    is completely mapped and made ready for IO dispatch. We cannot avoid
    this problem - the current code already has this ioend chaining
    within a page so we can mostly ignore that it occurs.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dave Chinner <david@fromorbit.com>

IOWs, this nesting for block size < page size has been out there
for many years now and we've yet to have anyone report that
writeback deadlocks have occurred.

There's a mistake in that commit message - we can't submit the
ioends on a page until we've marked the page as under writeback, not
dirty. That's because we cannot have ioends completing on a a page
that isn't under writeback because calling end_page_writeback() on
it when it isn't under writeback will BUG(). Hence we have to build
all the submission state before we mark the page as under writeback
and perform the submission(s) to avoid completion racing with
submission.

Hence we can't actually avoid nesting ioend allocation here within a
single page - the constraints of page cache writeback require it.
Hence the construction of the iomap_ioend_bioset uses a pool size of:

	4 * (PAGE_SIZE / SECTOR_SIZE)

So that we always have enough ioends cached in the mempool to
guarantee forwards progress of writeback of any single page under
writeback.

But that is a completely separate problem to this:

> Chained bios have been submitted already, but all can't be completed/freed
> until the whole ioend is done, that submission won't make forward progress.

This is a problem caused by having unbound contiguous ioend sizes,
not a problem caused by chaining bios. We can throw 256 pages into
a bio, so when we are doing huge contiguous IOs, we can map a
lot of sequential dirty pages into a contiguous extent into a very
long bio chain. But if we cap the max ioend size to, say, 4096
pages, then we've effectively capped the number of bios that can be
nested by such a writeback chain.

I was about to point you at the patchset that fixes this, but you've
already found it and are claiming that this nesting is an unfixable
problem. Capping the size of the ioend also bounds the depth of the
allocation nesting that will occur, and that fixes the whole nseting
deadlock problem: If the mempool reserves are deeper than than the
maximum chain nesting that can occur, then there is no deadlock.

However, this points out what the real problem here is: that bio
allocation is designed to deadlock when nesting bio allocation rather
than fail. Hence at the iomap level we've go no way of knowing that
we should terminate and submit the current bio chain and start a new
ioend+bio chain because we will hang before there's any indication
that a deadlock could occur.

And then the elephant in the room: reality.

We've been nesting bio allocations via this chaining in production
systems under heavy memory pressure for 5 years now and we don't
have a single bug report indicating that this code deadlocks. So
while there's a theoretical problem, evidence points to it not being
an issue in practice.

Hence I don't think there is any need to urgently turn this code
upside down. I'd much prefer that bio allocation would fail rather
than deadlock, because then we can set a flag in the writepage
context that says "do single bio ioend submission only" for the
duration of that writeback context. And with that, the forwards
progress problem is solved whilst still allowing us to chain as
deeply as we want when there is no memory pressure....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iomap: writeback ioend/bio allocation deadlock risk
  2021-05-24 23:55             ` Dave Chinner
@ 2021-05-25  4:54               ` Ming Lei
  2021-05-25  6:28                 ` Ming Lei
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2021-05-25  4:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Brian Foster, Darrick J. Wong, Dave Chinner,
	linux-block, linux-fsdevel

On Tue, May 25, 2021 at 09:55:38AM +1000, Dave Chinner wrote:
> On Fri, May 21, 2021 at 04:54:45PM +0800, Ming Lei wrote:
> > On Fri, May 21, 2021 at 10:36:35AM +0200, Christoph Hellwig wrote:
> > > On Fri, May 21, 2021 at 04:35:03PM +0800, Ming Lei wrote:
> > > > Just wondering why the ioend isn't submitted out after it becomes full?
> > > 
> > > block layer plugging?  Although failing bio allocations will kick that,
> > > so that is not a deadlock risk.
> > 
> > These ioends are just added to one list stored on local stack variable(submit_list),
> > how can block layer plugging observe & submit them out?
> 
> We ignore that, as the commit histoy says:
> 
> commit e10de3723c53378e7cf441529f563c316fdc0dd3
> Author: Dave Chinner <dchinner@redhat.com>
> Date:   Mon Feb 15 17:23:12 2016 +1100
> 
>     xfs: don't chain ioends during writepage submission
> 
>     Currently we can build a long ioend chain during ->writepages that
>     gets attached to the writepage context. IO submission only then
>     occurs when we finish all the writepage processing. This means we
>     can have many ioends allocated and pending, and this violates the
>     mempool guarantees that we need to give about forwards progress.
>     i.e. we really should only have one ioend being built at a time,
>     otherwise we may drain the mempool trying to allocate a new ioend
>     and that blocks submission, completion and freeing of ioends that
>     are already in progress.
> 
>     To prevent this situation from happening, we need to submit ioends
>     for IO as soon as they are ready for dispatch rather than queuing
>     them for later submission. This means the ioends have bios built
>     immediately and they get queued on any plug that is current active.
>     Hence if we schedule away from writeback, the ioends that have been
>     built will make forwards progress due to the plug flushing on
>     context switch. This will also prevent context switches from
>     creating unnecessary IO submission latency.
> 
>     We can't completely avoid having nested IO allocation - when we have
>     a block size smaller than a page size, we still need to hold the
>     ioend submission until after we have marked the current page dirty.
>     Hence we may need multiple ioends to be held while the current page
>     is completely mapped and made ready for IO dispatch. We cannot avoid
>     this problem - the current code already has this ioend chaining
>     within a page so we can mostly ignore that it occurs.
> 
>     Signed-off-by: Dave Chinner <dchinner@redhat.com>
>     Reviewed-by: Christoph Hellwig <hch@lst.de>
>     Signed-off-by: Dave Chinner <david@fromorbit.com>
> 
> IOWs, this nesting for block size < page size has been out there
> for many years now and we've yet to have anyone report that
> writeback deadlocks have occurred.
> 
> There's a mistake in that commit message - we can't submit the
> ioends on a page until we've marked the page as under writeback, not
> dirty. That's because we cannot have ioends completing on a a page
> that isn't under writeback because calling end_page_writeback() on
> it when it isn't under writeback will BUG(). Hence we have to build
> all the submission state before we mark the page as under writeback
> and perform the submission(s) to avoid completion racing with
> submission.
> 
> Hence we can't actually avoid nesting ioend allocation here within a
> single page - the constraints of page cache writeback require it.
> Hence the construction of the iomap_ioend_bioset uses a pool size of:
> 
> 	4 * (PAGE_SIZE / SECTOR_SIZE)
> 
> So that we always have enough ioends cached in the mempool to
> guarantee forwards progress of writeback of any single page under
> writeback.

OK, looks it is just for subpage IO, so there isn't such issue
in case of multiple ioends.

> 
> But that is a completely separate problem to this:
> 
> > Chained bios have been submitted already, but all can't be completed/freed
> > until the whole ioend is done, that submission won't make forward progress.
> 
> This is a problem caused by having unbound contiguous ioend sizes,
> not a problem caused by chaining bios. We can throw 256 pages into
> a bio, so when we are doing huge contiguous IOs, we can map a
> lot of sequential dirty pages into a contiguous extent into a very
> long bio chain. But if we cap the max ioend size to, say, 4096
> pages, then we've effectively capped the number of bios that can be
> nested by such a writeback chain.

If the 4096 pages are not continuous, there may be 4096/256=16 bios
allocated for single ioend, and one is allocated from iomap_ioend_bioset,
another 15 is allocated by bio_alloc() from fs_bio_set which just
reserves 2 bios.

> 
> I was about to point you at the patchset that fixes this, but you've
> already found it and are claiming that this nesting is an unfixable
> problem. Capping the size of the ioend also bounds the depth of the
> allocation nesting that will occur, and that fixes the whole nseting
> deadlock problem: If the mempool reserves are deeper than than the
> maximum chain nesting that can occur, then there is no deadlock.
> 
> However, this points out what the real problem here is: that bio
> allocation is designed to deadlock when nesting bio allocation rather
> than fail. Hence at the iomap level we've go no way of knowing that
> we should terminate and submit the current bio chain and start a new
> ioend+bio chain because we will hang before there's any indication
> that a deadlock could occur.

Most of reservation is small, such as fs_bio_set, so bio_alloc_bioset()
documents that 'never allocate more than 1 bio at a time'. Actually
iomap_chain_bio() does allocate a new one, then submits the old one.
'fs_bio_set' reserves two bios, so the order(alloc before submit) is fine,
but all bios allocated from iomap_chain_bio() can only be freed after the
whole ioend is done, this way is fragile to deadlock, because submission
can't provide forward progress an more, such as, flushing plug list before
schedule out can't work as expected.

One question is why the chained bios in ioend aren't completed individually?
What is the advantage to complete all bios in iomap_finish_ioend()?

> 
> And then the elephant in the room: reality.
> 
> We've been nesting bio allocations via this chaining in production
> systems under heavy memory pressure for 5 years now and we don't
> have a single bug report indicating that this code deadlocks. So
> while there's a theoretical problem, evidence points to it not being
> an issue in practice.
> 
> Hence I don't think there is any need to urgently turn this code
> upside down. I'd much prefer that bio allocation would fail rather

GFP_NOWAIT can be passed to bio_alloc() if you like, but the caller has
to handle out of allocation. So far iomap ioend code supposes all bio
allocation can succeed.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iomap: writeback ioend/bio allocation deadlock risk
  2021-05-25  4:54               ` Ming Lei
@ 2021-05-25  6:28                 ` Ming Lei
  2021-05-25  8:21                   ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Ming Lei @ 2021-05-25  6:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Brian Foster, Darrick J. Wong, Dave Chinner,
	linux-block, linux-fsdevel

On Tue, May 25, 2021 at 12:54:35PM +0800, Ming Lei wrote:
> On Tue, May 25, 2021 at 09:55:38AM +1000, Dave Chinner wrote:
> > On Fri, May 21, 2021 at 04:54:45PM +0800, Ming Lei wrote:
> > > On Fri, May 21, 2021 at 10:36:35AM +0200, Christoph Hellwig wrote:
> > > > On Fri, May 21, 2021 at 04:35:03PM +0800, Ming Lei wrote:
> > > > > Just wondering why the ioend isn't submitted out after it becomes full?
> > > > 
> > > > block layer plugging?  Although failing bio allocations will kick that,
> > > > so that is not a deadlock risk.
> > > 
> > > These ioends are just added to one list stored on local stack variable(submit_list),
> > > how can block layer plugging observe & submit them out?
> > 
> > We ignore that, as the commit histoy says:
> > 
> > commit e10de3723c53378e7cf441529f563c316fdc0dd3
> > Author: Dave Chinner <dchinner@redhat.com>
> > Date:   Mon Feb 15 17:23:12 2016 +1100
> > 
> >     xfs: don't chain ioends during writepage submission
> > 
> >     Currently we can build a long ioend chain during ->writepages that
> >     gets attached to the writepage context. IO submission only then
> >     occurs when we finish all the writepage processing. This means we
> >     can have many ioends allocated and pending, and this violates the
> >     mempool guarantees that we need to give about forwards progress.
> >     i.e. we really should only have one ioend being built at a time,
> >     otherwise we may drain the mempool trying to allocate a new ioend
> >     and that blocks submission, completion and freeing of ioends that
> >     are already in progress.
> > 
> >     To prevent this situation from happening, we need to submit ioends
> >     for IO as soon as they are ready for dispatch rather than queuing
> >     them for later submission. This means the ioends have bios built
> >     immediately and they get queued on any plug that is current active.
> >     Hence if we schedule away from writeback, the ioends that have been
> >     built will make forwards progress due to the plug flushing on
> >     context switch. This will also prevent context switches from
> >     creating unnecessary IO submission latency.
> > 
> >     We can't completely avoid having nested IO allocation - when we have
> >     a block size smaller than a page size, we still need to hold the
> >     ioend submission until after we have marked the current page dirty.
> >     Hence we may need multiple ioends to be held while the current page
> >     is completely mapped and made ready for IO dispatch. We cannot avoid
> >     this problem - the current code already has this ioend chaining
> >     within a page so we can mostly ignore that it occurs.
> > 
> >     Signed-off-by: Dave Chinner <dchinner@redhat.com>
> >     Reviewed-by: Christoph Hellwig <hch@lst.de>
> >     Signed-off-by: Dave Chinner <david@fromorbit.com>
> > 
> > IOWs, this nesting for block size < page size has been out there
> > for many years now and we've yet to have anyone report that
> > writeback deadlocks have occurred.
> > 
> > There's a mistake in that commit message - we can't submit the
> > ioends on a page until we've marked the page as under writeback, not
> > dirty. That's because we cannot have ioends completing on a a page
> > that isn't under writeback because calling end_page_writeback() on
> > it when it isn't under writeback will BUG(). Hence we have to build
> > all the submission state before we mark the page as under writeback
> > and perform the submission(s) to avoid completion racing with
> > submission.
> > 
> > Hence we can't actually avoid nesting ioend allocation here within a
> > single page - the constraints of page cache writeback require it.
> > Hence the construction of the iomap_ioend_bioset uses a pool size of:
> > 
> > 	4 * (PAGE_SIZE / SECTOR_SIZE)
> > 
> > So that we always have enough ioends cached in the mempool to
> > guarantee forwards progress of writeback of any single page under
> > writeback.
> 
> OK, looks it is just for subpage IO, so there isn't such issue
> in case of multiple ioends.

Thinking of further, this way is still fragile, suppose there are 8
contexts in which writeback is working on at the same time, and each
needs to allocate 6 ioends, so all contexts may not move on when
allocating its 6th ioend.

> 
> > 
> > But that is a completely separate problem to this:
> > 
> > > Chained bios have been submitted already, but all can't be completed/freed
> > > until the whole ioend is done, that submission won't make forward progress.
> > 
> > This is a problem caused by having unbound contiguous ioend sizes,
> > not a problem caused by chaining bios. We can throw 256 pages into
> > a bio, so when we are doing huge contiguous IOs, we can map a
> > lot of sequential dirty pages into a contiguous extent into a very
> > long bio chain. But if we cap the max ioend size to, say, 4096
> > pages, then we've effectively capped the number of bios that can be
> > nested by such a writeback chain.
> 
> If the 4096 pages are not continuous, there may be 4096/256=16 bios
> allocated for single ioend, and one is allocated from iomap_ioend_bioset,
> another 15 is allocated by bio_alloc() from fs_bio_set which just
> reserves 2 bios.
> 
> > 
> > I was about to point you at the patchset that fixes this, but you've
> > already found it and are claiming that this nesting is an unfixable
> > problem. Capping the size of the ioend also bounds the depth of the
> > allocation nesting that will occur, and that fixes the whole nseting
> > deadlock problem: If the mempool reserves are deeper than than the
> > maximum chain nesting that can occur, then there is no deadlock.
> > 
> > However, this points out what the real problem here is: that bio
> > allocation is designed to deadlock when nesting bio allocation rather
> > than fail. Hence at the iomap level we've go no way of knowing that
> > we should terminate and submit the current bio chain and start a new
> > ioend+bio chain because we will hang before there's any indication
> > that a deadlock could occur.
> 
> Most of reservation is small, such as fs_bio_set, so bio_alloc_bioset()
> documents that 'never allocate more than 1 bio at a time'. Actually
> iomap_chain_bio() does allocate a new one, then submits the old one.
> 'fs_bio_set' reserves two bios, so the order(alloc before submit) is fine,

It may not be fine when more than one context is involved in writeback.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iomap: writeback ioend/bio allocation deadlock risk
  2021-05-25  6:28                 ` Ming Lei
@ 2021-05-25  8:21                   ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2021-05-25  8:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Brian Foster, Darrick J. Wong, Dave Chinner,
	linux-block, linux-fsdevel

On Tue, May 25, 2021 at 02:28:54PM +0800, Ming Lei wrote:
> On Tue, May 25, 2021 at 12:54:35PM +0800, Ming Lei wrote:
> > On Tue, May 25, 2021 at 09:55:38AM +1000, Dave Chinner wrote:
> > > On Fri, May 21, 2021 at 04:54:45PM +0800, Ming Lei wrote:
> > > > On Fri, May 21, 2021 at 10:36:35AM +0200, Christoph Hellwig wrote:
> > > > > On Fri, May 21, 2021 at 04:35:03PM +0800, Ming Lei wrote:
> > > > > > Just wondering why the ioend isn't submitted out after it becomes full?
> > > > > 
> > > > > block layer plugging?  Although failing bio allocations will kick that,
> > > > > so that is not a deadlock risk.
> > > > 
> > > > These ioends are just added to one list stored on local stack variable(submit_list),
> > > > how can block layer plugging observe & submit them out?
> > > 
> > > We ignore that, as the commit histoy says:
> > > 
> > > commit e10de3723c53378e7cf441529f563c316fdc0dd3
> > > Author: Dave Chinner <dchinner@redhat.com>
> > > Date:   Mon Feb 15 17:23:12 2016 +1100
> > > 
> > >     xfs: don't chain ioends during writepage submission
> > > 
> > >     Currently we can build a long ioend chain during ->writepages that
> > >     gets attached to the writepage context. IO submission only then
> > >     occurs when we finish all the writepage processing. This means we
> > >     can have many ioends allocated and pending, and this violates the
> > >     mempool guarantees that we need to give about forwards progress.
> > >     i.e. we really should only have one ioend being built at a time,
> > >     otherwise we may drain the mempool trying to allocate a new ioend
> > >     and that blocks submission, completion and freeing of ioends that
> > >     are already in progress.
> > > 
> > >     To prevent this situation from happening, we need to submit ioends
> > >     for IO as soon as they are ready for dispatch rather than queuing
> > >     them for later submission. This means the ioends have bios built
> > >     immediately and they get queued on any plug that is current active.
> > >     Hence if we schedule away from writeback, the ioends that have been
> > >     built will make forwards progress due to the plug flushing on
> > >     context switch. This will also prevent context switches from
> > >     creating unnecessary IO submission latency.
> > > 
> > >     We can't completely avoid having nested IO allocation - when we have
> > >     a block size smaller than a page size, we still need to hold the
> > >     ioend submission until after we have marked the current page dirty.
> > >     Hence we may need multiple ioends to be held while the current page
> > >     is completely mapped and made ready for IO dispatch. We cannot avoid
> > >     this problem - the current code already has this ioend chaining
> > >     within a page so we can mostly ignore that it occurs.
> > > 
> > >     Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > >     Reviewed-by: Christoph Hellwig <hch@lst.de>
> > >     Signed-off-by: Dave Chinner <david@fromorbit.com>
> > > 
> > > IOWs, this nesting for block size < page size has been out there
> > > for many years now and we've yet to have anyone report that
> > > writeback deadlocks have occurred.
> > > 
> > > There's a mistake in that commit message - we can't submit the
> > > ioends on a page until we've marked the page as under writeback, not
> > > dirty. That's because we cannot have ioends completing on a a page
> > > that isn't under writeback because calling end_page_writeback() on
> > > it when it isn't under writeback will BUG(). Hence we have to build
> > > all the submission state before we mark the page as under writeback
> > > and perform the submission(s) to avoid completion racing with
> > > submission.
> > > 
> > > Hence we can't actually avoid nesting ioend allocation here within a
> > > single page - the constraints of page cache writeback require it.
> > > Hence the construction of the iomap_ioend_bioset uses a pool size of:
> > > 
> > > 	4 * (PAGE_SIZE / SECTOR_SIZE)
> > > 
> > > So that we always have enough ioends cached in the mempool to
> > > guarantee forwards progress of writeback of any single page under
> > > writeback.
> > 
> > OK, looks it is just for subpage IO, so there isn't such issue
> > in case of multiple ioends.
> 
> Thinking of further, this way is still fragile, suppose there are 8
> contexts in which writeback is working on at the same time, and each
> needs to allocate 6 ioends, so all contexts may not move on when
> allocating its 6th ioend.

Again - the reality is that nobody is reporting deadlocks in
production workloads, and some workloads (like file servers) can be
running hundreds of concurrent IO writeback contexts at the same
time to the same filesystem whilst under heavy memory pressure.

Yes, in theory it can deadlock. In practice? Nobody is reporting
deadlocks, so either the deadlock is so extremely rare we just don't
care about it or the theory is wrong.

Either way, I'll take "works in practice" over "theoretically
perfect" every day of the week. As Linus likes to say: "perfect is
the enemy of good".

> > > But that is a completely separate problem to this:
> > > 
> > > > Chained bios have been submitted already, but all can't be completed/freed
> > > > until the whole ioend is done, that submission won't make forward progress.
> > > 
> > > This is a problem caused by having unbound contiguous ioend sizes,
> > > not a problem caused by chaining bios. We can throw 256 pages into
> > > a bio, so when we are doing huge contiguous IOs, we can map a
> > > lot of sequential dirty pages into a contiguous extent into a very
> > > long bio chain. But if we cap the max ioend size to, say, 4096
> > > pages, then we've effectively capped the number of bios that can be
> > > nested by such a writeback chain.
> > 
> > If the 4096 pages are not continuous, there may be 4096/256=16 bios
> > allocated for single ioend, and one is allocated from iomap_ioend_bioset,
> > another 15 is allocated by bio_alloc() from fs_bio_set which just
> > reserves 2 bios.

Which completely misses the point that bounding the chain length
means we could guarantee forwards progress, simply by increasing the
reservation on the bioset or using a custom bioset with a large
enough reservation.

> > > I was about to point you at the patchset that fixes this, but you've
> > > already found it and are claiming that this nesting is an unfixable
> > > problem. Capping the size of the ioend also bounds the depth of the
> > > allocation nesting that will occur, and that fixes the whole nseting
> > > deadlock problem: If the mempool reserves are deeper than than the
> > > maximum chain nesting that can occur, then there is no deadlock.
> > > 
> > > However, this points out what the real problem here is: that bio
> > > allocation is designed to deadlock when nesting bio allocation rather
> > > than fail. Hence at the iomap level we've go no way of knowing that
> > > we should terminate and submit the current bio chain and start a new
> > > ioend+bio chain because we will hang before there's any indication
> > > that a deadlock could occur.
> > 
> > Most of reservation is small, such as fs_bio_set, so bio_alloc_bioset()
> > documents that 'never allocate more than 1 bio at a time'. Actually
> > iomap_chain_bio() does allocate a new one, then submits the old one.
> > 'fs_bio_set' reserves two bios, so the order(alloc before submit) is fine,
> 
> It may not be fine when more than one context is involved in writeback.

Which brings me back to "theory vs reality". In theory, the
fs_bio_set is shared by all filesytsems and all of them can have
multiple writeback contexts active at the same time all chaining
bios. So even the fs_bio_set doesn't have the reservation space
available to guarantee forwards progress in even slighlty complex
multiple context writeback.

The big assumption in the forwards progress guarantee that mempools
provide is that the no dependencies between bio allocations from the
same bioset. bio chaining is an obvious dependency, but stuff like
concurrent submission, needing to do metadata IO whilst building
bios (i.e delayed allocation), stacked filesystems (e.g. loop
devices) create deep, disconnected nested dependencies between
filesystems, bios and biosets.

Nesting happens. Deep nesting happens. But the evidence is that
systems aren't deadlocking during bio allocation despite what the
theory says.

So rather than telling us "this can't work", how about trying to
find out why it has been working so well for the past 5 years or so?
Maybe we'll all learn something we didn't know about the code in the
process...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-05-25  8:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-21  3:27 iomap: writeback ioend/bio allocation deadlock risk Ming Lei
2021-05-21  7:17 ` Christoph Hellwig
2021-05-21  7:31   ` Ming Lei
2021-05-21  7:35     ` Christoph Hellwig
2021-05-21  8:35       ` Ming Lei
2021-05-21  8:36         ` Christoph Hellwig
2021-05-21  8:54           ` Ming Lei
2021-05-24 15:32             ` Christoph Hellwig
2021-05-24 23:55             ` Dave Chinner
2021-05-25  4:54               ` Ming Lei
2021-05-25  6:28                 ` Ming Lei
2021-05-25  8:21                   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).