Re: [Question] About XFS random buffer write performance

From: Matthew Wilcox <willy@infradead.org>
To: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
	Zhengyuan Liu <liuzhengyuang521@gmail.com>,
	linux-xfs@vger.kernel.org,
	Zhengyuan Liu <liuzhengyuan@kylinos.cn>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [Question] About XFS random buffer write performance
Date: Wed, 29 Jul 2020 19:50:35 +0100	[thread overview]
Message-ID: <20200729185035.GX23808@casper.infradead.org> (raw)
In-Reply-To: <20200729051923.GZ2005@dread.disaster.area>

On Wed, Jul 29, 2020 at 03:19:23PM +1000, Dave Chinner wrote:
> On Wed, Jul 29, 2020 at 03:12:31AM +0100, Matthew Wilcox wrote:
> > On Wed, Jul 29, 2020 at 11:54:58AM +1000, Dave Chinner wrote:
> > > On Tue, Jul 28, 2020 at 04:47:53PM +0100, Matthew Wilcox wrote:
> > > > I propose we do away with the 'uptodate' bit-array and replace it with an
> > > > 'writeback' bit-array.  We set the page uptodate bit whenever the reads to
> > > 
> > > That's just per-block dirty state tracking. But when we set a single
> > > bit, we still need to set the page dirty flag.
> > 
> > It's not exactly dirty, though.  It's 'present' (ie the opposite
> > of hole). 
> 
> Careful with your terminology. At the page cache level, there is no
> such thing as a "hole". There is only data and whether the data is
> up to date or not. The page cache may be *sparsely populated*, but
> a lack of a page or a range of the page that is not up to date
> does not imply there is a -hole in the file- at that point.

That's not entirely true.  The current ->uptodate array does keep
track of whether an unwritten extent is currently a hole (see
page_cache_seek_hole_data()).  I don't know how useful that is.

> I'm still not sure what "present" is supposed to mean, though,
> because it seems no different to "up to date". The data is present
> once it's been read into the page, calling page_mkwrite() on the
> page doesn't change that at all.

I had a bit of a misunderstanding.  Let's discard that proposal
and discuss what we want to optimise for, ignoring THPs.  We don't
need to track any per-block state, of course.  We could implement
__iomap_write_begin() by reading in the entire page (skipping the last
few blocks if they lie outside i_size, of course) and then marking the
entire page Uptodate.

Buffer heads track several bits of information about each block:
 - Uptodate (contents of cache at least as recent as storage)
 - Dirty (contents of cache more recent than storage)
 - ... er, I think all the rest are irrelevant for iomap

I think I just talked myself into what you were arguing for -- that we
change the ->uptodate bit array into a ->dirty bit array.

That implies that we lose the current optimisation that we can write at
a blocksize alignment into the page cache and not read from storage.
I'm personally fine with that; most workloads don't care if you read
extra bytes from storage (hence readahead), but writing unnecessarily
to storage (particularly flash) is bad.

Or we keep two bits per block.  The implementation would be a little icky,
but it could be done.

I like the idea of getting rid of partially uptodate pages.  I've never
really understood the concept.  For me, a partially dirty page makes a
lot more sense than a partially uptodate page.  Perhaps I'm just weird.

Speaking of weird, I don't understand why an unwritten extent queries
the uptodate bits.  Maybe that's a buffer_head thing and we can just
ignore it -- iomap doesn't have such a thing as a !uptodate page any
more.