Re: [PATCH v2 2/2] xfs: kick extra large ioends to completion workqueue

From: Dave Chinner <david@fromorbit.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Brian Foster <bfoster@redhat.com>,
	"Darrick J. Wong" <djwong@kernel.org>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH v2 2/2] xfs: kick extra large ioends to completion workqueue
Date: Mon, 10 May 2021 12:45:34 +1000	[thread overview]
Message-ID: <20210510024534.GO63242@dread.disaster.area> (raw)
In-Reply-To: <YJVRZ1Bg1gan2BrW@casper.infradead.org>

On Fri, May 07, 2021 at 03:40:39PM +0100, Matthew Wilcox wrote:
> On Fri, May 07, 2021 at 10:06:31AM -0400, Brian Foster wrote:
> > > <nod> So I guess I'm saying that my resistance to /this/ part of the
> > > changes is melting away.  For a 2GB+ write IO, I guess the extra overhead
> > > of poking a workqueue can be amortized over the sheer number of pages.
> > 
> > I think the main question is what is a suitable size threshold to kick
> > an ioend over to the workqueue? Looking back, I think this patch just
> > picked 256k randomly to propose the idea. ISTM there could be a
> > potentially large window from the point where I/O latency starts to
> > dominate (over the extra context switch for wq processing) and where the
> > softlockup warning thing will eventually trigger due to having too many
> > pages. I think that means we could probably use a more conservative
> > value, I'm just not sure what value should be (10MB, 100MB, 1GB?). If
> > you have a reproducer it might be interesting to experiment with that.
> 
> To my mind, there are four main types of I/Os.
> 
> 1. Small, dependent reads -- maybe reading a B-tree block so we can get
> the next pointer.  Those need latency above all.
> 
> 2. Readahead.  Tend to be large I/Os and latency is not a concern
> 
> 3. Page writeback which tend to be large and can afford the extra latency.
> 
> 4. Log writes.  These tend to be small, and I'm not sure what increasing
> their latency would do.  Probably bad.
> 
> I like 256kB as a threshold.  I think I could get behind anything from
> 128kB to 1MB.  I don't think playing with it is going to be really
> interesting because most IOs are going to be far below or far above
> that threshold.

256kB is waaaaay too small for writeback IO. Brian actually proposed
256k *pages*, not bytes. 256kB will take us back to the bad old days
where we are dependent on bio merging to get decent IO patterns down
to the hardware, and that's a bad place to be from both an IO and
CPU efficiency POV.

IOWs, the IO that is built by the filesystem needs to be large
w.r.t. the underlying hardware so that the block layer can slice and
dice it efficiently so we don't end up doing lots of small partial
stripe writes to RAID devices.

ISTR proposing - in response to Brian's patch - a limit of 16MB or
so - large enough that for most storage stacks we'll keep it's
queues full with well formed IO, but also small enough that we don't
end up with long completion latencies due to walking hundreds of
thousands of pages completing them in a tight loop...

Yup, here's the relevent chunk of that patch:

+/*
+ * Maximum ioend IO size is used to prevent ioends from becoming unbound in
+ * size. Bios can read 4GB in size is pages are contiguous, and bio chains are
+ * effectively unbound in length. Hence the only limits on the size of the bio
+ * chain is the contiguity of the extent on disk and the length of the run of
+ * sequential dirty pages in the page cache. This can be tens of GBs of physical
+ * extents and if memory is large enough, tens of millions of dirty pages.
+ * Locking them all under writeback until the final bio in the chain is
+ * submitted and completed locks all those pages for the legnth of time it takes
+ * to write those many, many GBs of data to storage.
+ *
+ * Background writeback caps any single writepages call to half the device
+ * bandwidth to ensure fairness and prevent any one dirty inode causing
+ * writeback starvation.  fsync() and other WB_SYNC_ALL writebacks have no such
+ * cap on wbc->nr_pages, and that's where the above massive bio chain lengths
+ * come from. We want large IOs to reach the storage, but we need to limit
+ * completion latencies, hence we need to control the maximum IO size we
+ * dispatch to the storage stack.
+ *
+ * We don't really have to care about the extra IO completion overhead here -
+ * iomap has contiguous IO completion merging, so if the device can sustain a
+ * very high throughput and large bios, the ioends will be merged on completion
+ * and processed in large, efficient chunks with no additional IO latency. IOWs,
+ * we really don't need the huge bio chains to be built in the first place for
+ * sequential writeback...
+ *
+ * Note that page size has an effect here - 64k pages really needs lower
+ * page count thresholds because they have the same IO capabilities. We can do
+ * larger IOs because of the lower completion overhead, so cap it at 1024
+ * pages. For everything else, use 16MB as the ioend size.
+ */
+#if (PAGE_SIZE == 65536)
+#define IOMAP_MAX_IOEND_PAGES  1024
+#else
+#define IOMAP_MAX_IOEND_PAGES  4096
+#endif
+#define IOMAP_MAX_IOEND_SIZE   (IOMAP_MAX_IOEND_PAGES * PAGE_SIZE)

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com