Re: [PATCH RFC] iomap: invalidate pages past eof in iomap_do_writepage()

From: Dave Chinner <david@fromorbit.com>
To: Chris Mason <clm@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Christoph Hellwig <hch@infradead.org>,
	"Darrick J. Wong" <djwong@kernel.org>,
	xfs <linux-xfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	"dchinner@redhat.com" <dchinner@redhat.com>
Subject: Re: [PATCH RFC] iomap: invalidate pages past eof in iomap_do_writepage()
Date: Mon, 6 Jun 2022 09:32:13 +1000	[thread overview]
Message-ID: <20220605233213.GN1098723@dread.disaster.area> (raw)
In-Reply-To: <c3620e1f-91c5-777c-4193-2478c69a033c@fb.com>

On Fri, Jun 03, 2022 at 12:09:06PM -0400, Chris Mason wrote:
> [ From a different message, Dave asks wtf my email client was doing. Thanks
> Dave, apparently exchange is being exchangey with base64 in unpredictable
> ways.  This was better in my test reply, lets see. ]
> 
> On 6/3/22 11:06 AM, Johannes Weiner wrote:
> > Hello Dave,
> > 
> > On Fri, Jun 03, 2022 at 03:20:47PM +1000, Dave Chinner wrote:
> > > On Fri, Jun 03, 2022 at 01:29:40AM +0000, Chris Mason wrote:
> > > > As you describe above, the loops are definitely coming from higher
> > > > in the stack.  wb_writeback() will loop as long as
> > > > __writeback_inodes_wb() returns that it’s making progress and
> > > > we’re still globally over the bg threshold, so write_cache_pages()
> > > > is just being called over and over again.  We’re coming from
> > > > wb_check_background_flush(), so:
> > > > 
> > > >                  struct wb_writeback_work work = {
> > > >                          .nr_pages       = LONG_MAX,
> > > >                          .sync_mode      = WB_SYNC_NONE,
> > > >                          .for_background = 1,
> > > >                          .range_cyclic   = 1,
> > > >                          .reason         = WB_REASON_BACKGROUND,
> > > >                  };
> > > 
> > > Sure, but we end up in writeback_sb_inodes() which does this after
> > > the __writeback_single_inode()->do_writepages() call that iterates
> > > the dirty pages:
> > > 
> > >                 if (need_resched()) {
> > >                          /*
> > >                           * We're trying to balance between building up a nice
> > >                           * long list of IOs to improve our merge rate, and
> > >                           * getting those IOs out quickly for anyone throttling
> > >                           * in balance_dirty_pages().  cond_resched() doesn't
> > >                           * unplug, so get our IOs out the door before we
> > >                           * give up the CPU.
> > >                           */
> > >                          blk_flush_plug(current->plug, false);
> > >                          cond_resched();
> > >                  }
> > > 
> > > So if there is a pending IO completion on this CPU on a work queue
> > > here, we'll reschedule to it because the work queue kworkers are
> > > bound to CPUs and they take priority over user threads.
> > 
> > The flusher thread is also a kworker, though. So it may hit this
> > cond_resched(), but it doesn't yield until the timeslice expires.

17us or 10ms, it doesn't matter. The fact is the writeback thread
will give up the CPU long before the latency durations (seconds)
that were reported upthread are seen. Writeback spinning can
not explain why truncate is not making progress - everything points
to it being a downstream symptom, not a cause.

Also important to note, as we are talking about kworker sheduling
hold-offs, the writeback flusher work is unbound (can run on any
CPU), whilst the IO completion workers in XFS are per-CPU and bound
to individual CPUs. Bound kernel tasks usually take run queue
priority on a CPU over unbound and/or user tasks that can be punted
to a different CPU. So, again, with this taken into account I don't
see how the flusher thread consuming CPU would cause long duration
hold-offs of IO completion work....

> Just to underline this, the long tail latencies aren't softlockups or major
> explosions.  It's just suboptimal enough that different metrics and
> dashboards noticed it.

Sure, but you've brought a problem we don't understand the root
cause of to my attention. I want to know what the root cause is so
that I can determine that there are no other unknown underlying
issues that are contributing to this issue.

Nothing found so far explains why truncate is not making progress
and that's what *I* need to understand here. Sure, go ahead and
patch your systems to get rid of the bad writeback looping
behaviour, but that does not explain or solve the underlying long IO
completion tail latency issue that originally lead to finding
writeback spinning.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com