Re: [Question] About XFS random buffer write performance

From: Dave Chinner <david@fromorbit.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
	Zhengyuan Liu <liuzhengyuang521@gmail.com>,
	linux-xfs@vger.kernel.org,
	Zhengyuan Liu <liuzhengyuan@kylinos.cn>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [Question] About XFS random buffer write performance
Date: Fri, 31 Jul 2020 08:08:57 +1000	[thread overview]
Message-ID: <20200730220857.GD2005@dread.disaster.area> (raw)
In-Reply-To: <20200730135040.GD23808@casper.infradead.org>

On Thu, Jul 30, 2020 at 02:50:40PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 09:05:03AM +1000, Dave Chinner wrote:
> > On Wed, Jul 29, 2020 at 07:50:35PM +0100, Matthew Wilcox wrote:
> > > I had a bit of a misunderstanding.  Let's discard that proposal
> > > and discuss what we want to optimise for, ignoring THPs.  We don't
> > > need to track any per-block state, of course.  We could implement
> > > __iomap_write_begin() by reading in the entire page (skipping the last
> > > few blocks if they lie outside i_size, of course) and then marking the
> > > entire page Uptodate.
> > 
> > __iomap_write_begin() already does read-around for sub-page writes.
> > And, if necessary, it does zeroing of unwritten extents, newly
> > allocated ranges and ranges beyond EOF and marks them uptodate
> > appropriately.
> 
> But it doesn't read in the entire page, just the blocks in the page which
> will be touched by the write.

Ah, you are right, I got my page/offset macros mixed up.

In which case, you just identified why the uptodate array is
necessary and can't be removed. If we do a sub-page write() the page
is not fully initialised, and so if we then mmap it readpage needs
to know what part of the page requires initialisation to bring the
page uptodate before it is exposed to userspace.

But that also means the behaviour of the 4kB write on 64kB page size
benchmark is unexplained, because that should only be marking the
written pages of the page up to date, and so it should be behaving
exactly like ext4 and only writing back individual uptodate chunks
on the dirty page....

So, we need to the iostat output from that test workload to
determine if XFS is doing page size IO or something different. I
suspect it's spewing huge numbers of 4-16kB writes, not PAGE_SIZEd
writes...

> > Modern really SSDs don't care about runs of zeros being written.
> > They compress and/or deduplicate such things on the fly as part of
> > their internal write-amplification reduction strategies. Pretty much
> > all SSDs on the market these days - consumer or enterprise - do this
> > sort of thing in their FTLs and so writing more than the exact
> > changed data really doesn't make a difference.
> 
> You're clearly talking to different SSD people than I am.

Perhaps so.

But it was pretty clear way back in the days of early sandforce SSD
controllers that compression and zero detection at the FTL level
resulted in massive reductions in write amplification right down at
the hardware level. The next generation of controllers all did this
so they could compete on performance. They still do this, which is
why industry benchmarks test performance with incompressible data so
that they expose the flash write perofrmance, not just the rate at
which the drive can detect and elide runs of zeros...

Note: I'm not saying that we shouldn't reduce the write bandwidth
being consumed here, just that arguments that about write
amplification are really not that convincing. We've *never* cared
about write amplification in XFS (indeed, we've never really cared
about SSD characteristics at all), yet it's consistently the fastest
filesystem on high end SSD storage because stuff like concurrency
and efficient dispatch of IO and deterministic behaviour matter far
more than write amplification.

IOWs, showing that even high end devices end up bandwidth limited
under common workloads using default configurations is a much more
convincing argument...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com