Re: [Question] About XFS random buffer write performance

From: Dave Chinner <david@fromorbit.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
	Zhengyuan Liu <liuzhengyuang521@gmail.com>,
	linux-xfs@vger.kernel.org,
	Zhengyuan Liu <liuzhengyuan@kylinos.cn>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [Question] About XFS random buffer write performance
Date: Wed, 29 Jul 2020 11:54:58 +1000	[thread overview]
Message-ID: <20200729015458.GY2005@dread.disaster.area> (raw)
In-Reply-To: <20200728154753.GS23808@casper.infradead.org>

On Tue, Jul 28, 2020 at 04:47:53PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 28, 2020 at 08:34:53AM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 28, 2020 at 07:34:39PM +0800, Zhengyuan Liu wrote:
> > > Hi all,
> > > 
> > > When doing random buffer write testing I found the bandwidth on EXT4 is much
> > > better than XFS under the same environment.
> > > The test case ,test result and test environment is as follows:
> > > Test case:
> > > fio --ioengine=sync --rw=randwrite --iodepth=64 --size=4G --name=test
> > > --filename=/mnt/testfile --bs=4k
> > > Before doing fio, use dd (if=/dev/zero of=/mnt/testfile bs=1M
> > > count=4096) to warm-up the file in the page cache.
> > > 
> > > Test result (bandwidth):
> > >          ext4                   xfs
> > >        ~300MB/s       ~120MB/s
> > > 
> > > Test environment:
> > >     Platform:  arm64
> > >     Kernel:  v5.7
> > >     PAGESIZE:  64K
> > >     Memtotal:  16G

So it's capturing roughly 2GB of random 4kB writes before it starts
blocking waiting for writeback.

What happens when you change the size of the file (say 512MB, 1GB,
2GB, 8GB, 16GB, etc)? Does this change the result at all?

i.e. are we just looking at a specific behaviour triggered by the
specific data set size?  I would suspect that the larger the file,
the greater the performance differential as XFS will drive the SSD
to be bandwidth bound before ext4, and that if you have a faster SSd
(e.g. nvme on pcie 4x) there would be a much smaller difference as
the workload won't end up IO bandwidth bound....

I also suspect that you'll get a different result running on spinning
disks. i.e., it is likely that you'll get the oppposite result (i.e
XFS is faster than ext4) because each 64kB page write IO from XFS
captures multiple 4KB user writes and so results in fewer seeks than
ext4 doing individual 4kB IOs.

> > >     Storage: sata ssd(Max bandwidth about 350MB/s)
> > >     FS block size: 4K
> > > 
> > > The  fio "Test result" shows that EXT4 has more than 2x bandwidth compared to
> > > XFS, but iostat shows the transfer speed of XFS to SSD is about 300MB/s too.
> > > So I debt XFS writing back many non-dirty blocks to SSD while  writing back
> > > dirty pages. I tried to read the core writeback code of both
> > > filesystem and found
> > > XFS will write back blocks which is uptodate (seeing iomap_writepage_map()),
> > 
> > Ahhh, right, because iomap tracks uptodate separately for each block in
> > the page, but only tracks dirty status for the whole page.  Hence if you
> > dirty one byte in the 64k page, xfs will write all 64k even though we
> > could get away writing 4k like ext4 does.

Right, iomap intentionally went to a page granularity IO model for
page cache IO, because that's what the page cache uses and largely
gets rid of the need for tracking per-block page state.

> > Hey Christoph & Matthew: If you're already thinking about changing
> > struct iomap_page, should we add the ability to track per-block dirty
> > state to reduce the write amplification that Zhengyuan is asking about?
> > 
> > I'm guessing that between willy's THP series, Dave's iomap chunks
> > series, and whatever Christoph may or may not be writing, at least one
> > of you might have already implemented this? :)
> 
> Well, this is good timing!  I was wondering whether something along
> these lines was an important use-case.
> 
> I propose we do away with the 'uptodate' bit-array and replace it with an
> 'writeback' bit-array.  We set the page uptodate bit whenever the reads to

That's just per-block dirty state tracking. But when we set a single
bit, we still need to set the page dirty flag.

> fill the page have completed rather than checking the 'writeback' array.
> In page_mkwrite, we fill the writeback bit-array on the grounds that we
> have no way to track a block's non-dirtiness and we don't want to scan
> each block at writeback time to see if it's been written to.

You're talking about mmap() access to the file here, not
read/write() syscall access. If page_mkwrite() sets all the
blocks in a page as "needing writeback", how is that different in
any way to just using a single dirty bit? So why wouldn't we just do
this in iomap_set_page_dirty()?

The only place we wouldn't want to set the entire page dirty is
the call from __iomap_write_end() which knows the exact range of the
page that was dirtied. In which case, iomap_set_page_dirty_range()
would be appropriate, right? i.e. we still have to do all the same
page/page cache/inode dirtying, but only that would set a sub-page
range of dirty bits in the iomap_page?

/me doesn't see the point of calling dirty tracking "writeback bits"
when "writeback" is a specific page state that comes between the
"dirty" and "clean" states...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com