Re: [Question] About XFS random buffer write performance

From: Dave Chinner <david@fromorbit.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
	Zhengyuan Liu <liuzhengyuang521@gmail.com>,
	linux-xfs@vger.kernel.org,
	Zhengyuan Liu <liuzhengyuan@kylinos.cn>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [Question] About XFS random buffer write performance
Date: Wed, 29 Jul 2020 15:19:23 +1000	[thread overview]
Message-ID: <20200729051923.GZ2005@dread.disaster.area> (raw)
In-Reply-To: <20200729021231.GV23808@casper.infradead.org>

On Wed, Jul 29, 2020 at 03:12:31AM +0100, Matthew Wilcox wrote:
> On Wed, Jul 29, 2020 at 11:54:58AM +1000, Dave Chinner wrote:
> > On Tue, Jul 28, 2020 at 04:47:53PM +0100, Matthew Wilcox wrote:
> > > I propose we do away with the 'uptodate' bit-array and replace it with an
> > > 'writeback' bit-array.  We set the page uptodate bit whenever the reads to
> > 
> > That's just per-block dirty state tracking. But when we set a single
> > bit, we still need to set the page dirty flag.
> 
> It's not exactly dirty, though.  It's 'present' (ie the opposite
> of hole). 

Careful with your terminology. At the page cache level, there is no
such thing as a "hole". There is only data and whether the data is
up to date or not. The page cache may be *sparsely populated*, but
a lack of a page or a range of the page that is not up to date
does not imply there is a -hole in the file- at that point.

I'm still not sure what "present" is supposed to mean, though,
because it seems no different to "up to date". The data is present
once it's been read into the page, calling page_mkwrite() on the
page doesn't change that at all.

> I'm not attached to the name.  So it can be used to
> implement iomap_is_partially_uptodate.  If the page is dirty, the chunks
> corresponding to the present bits get written back, but we don't track
> a per-block dirty state.

iomap_is_partially_uptodate() only indicates whether data in the
page is entirely valid or not. If it isn't entirely valid, then the
caller has to ask the filesystem whether the underlying range
contains holes or data....

> > > fill the page have completed rather than checking the 'writeback' array.
> > > In page_mkwrite, we fill the writeback bit-array on the grounds that we
> > > have no way to track a block's non-dirtiness and we don't want to scan
> > > each block at writeback time to see if it's been written to.
> > 
> > You're talking about mmap() access to the file here, not
> > read/write() syscall access. If page_mkwrite() sets all the
> > blocks in a page as "needing writeback", how is that different in
> > any way to just using a single dirty bit? So why wouldn't we just do
> > this in iomap_set_page_dirty()?
> 
> iomap_set_page_dirty() is called from iomap_page_mkwrite_actor(), so
> sure!

via set_page_dirty(), which is why I mentioned this:

> > The only place we wouldn't want to set the entire page dirty is
> > the call from __iomap_write_end() which knows the exact range of the
> > page that was dirtied. In which case, iomap_set_page_dirty_range()
> > would be appropriate, right? i.e. we still have to do all the same
> > page/page cache/inode dirtying, but only that would set a sub-page
> > range of dirty bits in the iomap_page?
> > 
> > /me doesn't see the point of calling dirty tracking "writeback bits"
> > when "writeback" is a specific page state that comes between the
> > "dirty" and "clean" states...
> 
> I don't want to get it confused with page states.  This is a different
> thing.  It's just tracking which blocks are holes (and have definitely
> not been written to), so those blocks can remain as holes when the page
> gets written back.

We do not track holes at the page level. We do not want to track
anything to do with the filesystem extent mapping at the page level.
That was something that bufferheads were used for, and was something
we specifically designed iomap specifically not to require.

IOWs, iomap does page cache IO at page level granularity, not block
level granularity.  The only thing we track at block granularity is
wither the range of the page over a given block contains valid data
or not.  i.e. whether the page has been initialised with the correct
data or not.

Further, page-mkwrite() has no knoweldge of whether the backing
store has holes in it or not, nor does it care. All it does is call
into the filesystem to fill any holes that may exist in the backing
space behind the page.  This is also needed for COW to allocate the
destination of the over write, but in either case there is no
interaction with pre-existing holes - that is all done by the
read side of the page fault before page_mkwrite is called...

IOWs, if you call page_mkwrite() on a THP, the filesystem will
allocate/reserve the entire backing space behind the page because
writeback of that THP requires writing the entire page and for
backing space to be fully allocated before that write is issued.

hence I'm really not sure what you are suggesting we do here because
it doesn't make sense to me. Maybe I'm missing something that THP
does that I'm not away of, but other than that I'm completely
missing what you are trying to do here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com