All of lore.kernel.org
 help / color / mirror / Atom feed
From: Trond Myklebust <trondmy@hammerspace.com>
To: "bfoster@redhat.com" <bfoster@redhat.com>,
	"david@fromorbit.com" <david@fromorbit.com>
Cc: "djwong@kernel.org" <djwong@kernel.org>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
	"willy@infradead.org" <willy@infradead.org>,
	"hch@infradead.org" <hch@infradead.org>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"trondmy@kernel.org" <trondmy@kernel.org>
Subject: Re: [PATCH] iomap: Address soft lockup in iomap_finish_ioend()
Date: Tue, 11 Jan 2022 14:33:11 +0000	[thread overview]
Message-ID: <615fa140d0dd855a252de09490411be0afe5ae3d.camel@hammerspace.com> (raw)
In-Reply-To: <YdxwnaT0nYHgGQZR@bfoster>

On Mon, 2022-01-10 at 12:45 -0500, Brian Foster wrote:
> On Mon, Jan 10, 2022 at 07:18:47PM +1100, Dave Chinner wrote:
> > On Thu, Jan 06, 2022 at 11:44:23AM -0500, Brian Foster wrote:
> > > On Thu, Jan 06, 2022 at 09:04:21AM +1100, Dave Chinner wrote:
> > > > On Wed, Jan 05, 2022 at 08:56:33AM -0500, Brian Foster wrote:
> > > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > > > index 71a36ae120ee..39214577bc46 100644
> > > > --- a/fs/iomap/buffered-io.c
> > > > +++ b/fs/iomap/buffered-io.c
> > > > @@ -1066,17 +1066,34 @@ iomap_finish_ioend(struct iomap_ioend
> > > > *ioend, int error)
> > > >         }
> > > >  }
> > > >  
> > > > +/*
> > > > + * Ioend completion routine for merged bios. This can only be
> > > > called from task
> > > > + * contexts as merged ioends can be of unbound length. Hence
> > > > we have to break up
> > > > + * the page writeback completion into manageable chunks to
> > > > avoid long scheduler
> > > > + * holdoffs. We aim to keep scheduler holdoffs down below 10ms
> > > > so that we get
> > > > + * good batch processing throughput without creating adverse
> > > > scheduler latency
> > > > + * conditions.
> > > > + */
> > > >  void
> > > >  iomap_finish_ioends(struct iomap_ioend *ioend, int error)
> > > >  {
> > > >         struct list_head tmp;
> > > > +       int segments;
> > > > +
> > > > +       might_sleep();
> > > >  
> > > >         list_replace_init(&ioend->io_list, &tmp);
> > > > +       segments = ioend->io_segments;
> > > >         iomap_finish_ioend(ioend, error);
> > > >  
> > > >         while (!list_empty(&tmp)) {
> > > > +               if (segments > 32768) {
> > > > +                       cond_resched();
> > > > +                       segments = 0;
> > > > +               }
> > > 
> > > How is this intended to address the large bi_vec scenario? AFAICT
> > > bio_segments() doesn't account for multipage bvecs so the above
> > > logic
> > > can allow something like 34b (?) 4k pages before a yield.
> > 
> > Right now the bvec segment iteration in iomap_finish_ioend() is
> > completely unaware of multipage bvecs - as per above
> > bio_for_each_segment_all() iterates by PAGE_SIZE within a bvec,
> > regardless of whether they are stored in a multipage bvec or not.
> > Hence it always iterates the entire bio a single page at a time.
> > 
> > IOWs, we don't use multi-page bvecs in iomap writeback, nor is it
> > aware of them at all. We're adding single pages to bios via
> > bio_add_page() which may merge them internally into multipage
> > bvecs.
> > However, all our iterators use single page interfaces, hence we
> > don't see the internal multi-page structure of the bio at all.
> > As such, bio_segments() should return the number of PAGE_SIZE pages
> > attached to the bio regardless of it's internal structure.
> > 
> 
> That is pretty much the point. The completion loop doesn't really
> care
> whether the amount of page processing work is due to a large bio
> chain,
> multipage bi_bvec(s), merged ioends, or some odd combination thereof.
> As
> you note, these conditions can manifest from various layers above or
> below iomap. I don't think iomap really needs to know or care about
> any
> of this. It just needs to yield when it has spent "too much" time
> processing pages.
> 
> With regard to the iterators, my understanding was that
> bio_for_each_segment_all() walks the multipage bvecs but
> bio_for_each_segment() does not, but that could certainly be wrong as
> I
> find the iterators a bit confusing. Either way, the most recent test
> with the ioend granular filter implies that a single ioend can still
> become a soft lockup vector from non-atomic context.
> 
> > That is what I see on a trace from a large single file submission,
> > comparing bio_segments() output from the page count on an ioend:
> > 
> >    kworker/u67:2-187   [017] 13530.753548: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x370400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.759706: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x378400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.766326: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x380400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.770689: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x388400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.774716: iomap_do_writepage: 2.
> > bios 4096, pages 4096, start sector 0x390400 bi_vcnt 1, bi_size
> > 16777216
> >    kworker/u67:2-187   [017] 13530.777157: iomap_writepages: 3.
> > bios 2048, pages 2048, start sector 0x398400 bi_vcnt 1, bi_size
> > 8388608
> > 
> > Which shows we are building ioends with a single bio with a single
> > bvec, containing 4096 pages and 4096 bio segments. So, as expected,
> > bio_segments() matches the page count and we submit 4096 page
> > ioends
> > with a single bio attached to it.
> > 
> > This is clearly a case where we are getting physically contiguous
> > page cache page allocation during write() syscalls, and the result
> > is a single contiguous bvec from bio_add_page() doing physical page
> > merging at the bvec level. Hence we see bio->bi_vcnt = 1 and a
> > physically contiguous 4096 multipage bvec being dispatched. The
> > lower layers slice and dice these huge bios to what the hardware
> > can
> > handle...
> > 
> 
> I think we're in violent agreement here. That is the crux of
> multipage
> bvecs and what I've been trying to point out [1]. Ming (who I believe
> implemented it) pointed this out back when the problem was first
> reported. This is also why I asked Trond to test out the older patch
> series, because that was intended to cover this case.
> 
> [1]
> https://lore.kernel.org/linux-xfs/20220104192321.GF31606@magnolia/T/#mc08ffe4b619c1b503b2c1342157bdaa9823167c1
> 
> > What I'm not yet reproducing is whatever vector that Trond is
> > seeing
> > that is causing the multi-second hold-offs. I get page completion
> > processed at a rate of about a million pages per second per CPU,
> > but
> > I'm bandwidth limited to about 400,000 pages per second due to
> > mapping->i_pages lock contention (reclaim vs page cache
> > instantiation vs writeback completion). I'm not seeing merged ioend
> > batches of larger than about 40,000 pages being processed at once.
> > Hence I can't yet see where the millions of pages in a single ioend
> > completion that would be required to hold a CPU for tens of seconds
> > is coming from yet...
> > 
> 
> I was never able to reproduce the actual warning either (only
> construct
> the unexpectedly large page sequences through various means), so I'm
> equally as curious about that aspect of the problem. My only guess at
> the moment is that perhaps hardware is enough of a factor to increase
> the cost (i.e. slow cpu, cacheline misses, etc.)? I dunno..
> 

So I did try patches 1 and 2 from this series over the weekend, but for
some reason the resulting kernel started hanging during I/O. I'm still
trying to figure out why.

> > 
I think I'll try Dave's new patch to see if that behaves similarly (in
case this is something new in the stable kernel series) and report
back.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



  parent reply	other threads:[~2022-01-11 14:33 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-30 19:35 [PATCH] iomap: Address soft lockup in iomap_finish_ioend() trondmy
2021-12-30 21:24 ` Jens Axboe
2021-12-30 22:25   ` Trond Myklebust
2021-12-30 22:27     ` Jens Axboe
2021-12-30 22:55       ` Trond Myklebust
2021-12-31  1:42 ` Matthew Wilcox
2021-12-31  6:16   ` Trond Myklebust
2022-01-01  3:55     ` Dave Chinner
2022-01-01 17:39       ` Trond Myklebust
2022-01-03 22:03         ` Dave Chinner
2022-01-04  0:04           ` Trond Myklebust
2022-01-04  1:22             ` Dave Chinner
2022-01-04  3:01               ` Trond Myklebust
2022-01-04  7:08               ` hch
2022-01-04 18:08                 ` Matthew Wilcox
2022-01-04 18:14                   ` hch
2022-01-04 19:22                     ` Darrick J. Wong
2022-01-04 21:52                       ` Dave Chinner
2022-01-04 23:12                         ` Darrick J. Wong
2022-01-05  2:10                           ` Dave Chinner
2022-01-05 13:56                             ` Brian Foster
2022-01-05 22:04                               ` Dave Chinner
2022-01-06 16:44                                 ` Brian Foster
2022-01-10  8:18                                   ` Dave Chinner
2022-01-10 17:45                                     ` Brian Foster
2022-01-10 18:11                                       ` hch
2022-01-11 14:33                                       ` Trond Myklebust [this message]
2022-01-05 13:42                           ` hch
2022-01-04 21:16                 ` Dave Chinner
2022-01-05 13:43                   ` hch
2022-01-05 22:34                     ` Dave Chinner
2022-01-05  2:09               ` Trond Myklebust
2022-01-05 20:45                 ` Trond Myklebust
2022-01-05 22:48                   ` Dave Chinner
2022-01-05 23:29                     ` Trond Myklebust
2022-01-06  0:01                     ` Darrick J. Wong
2022-01-09 23:09                       ` Dave Chinner
2022-01-06 18:36                     ` Trond Myklebust
2022-01-06 18:38                       ` Trond Myklebust
2022-01-06 20:07                       ` Brian Foster
2022-01-07  3:08                         ` Trond Myklebust
2022-01-07 15:15                           ` Brian Foster
2022-01-09 23:34                       ` Dave Chinner
2022-01-10 23:37                       ` Dave Chinner
2022-01-11  0:08                         ` Dave Chinner
2022-01-13 17:01                         ` Trond Myklebust
2022-01-17 17:24                           ` Trond Myklebust
2022-01-17 17:36                             ` Darrick J. Wong
2022-01-04 13:36         ` Brian Foster
2022-01-04 19:23           ` Darrick J. Wong
2022-01-05  2:31 ` [iomap] f5934dda54: BUG:sleeping_function_called_from_invalid_context_at_fs/iomap/buffered-io.c kernel test robot
2022-01-05  2:31   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=615fa140d0dd855a252de09490411be0afe5ae3d.camel@hammerspace.com \
    --to=trondmy@hammerspace.com \
    --cc=axboe@kernel.dk \
    --cc=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=trondmy@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.