Re: [Question] About XFS random buffer write performance

From: Zhengyuan Liu <liuzhengyuang521@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	linux-xfs@vger.kernel.org,
	Zhengyuan Liu <liuzhengyuan@kylinos.cn>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [Question] About XFS random buffer write performance
Date: Wed, 29 Jul 2020 21:02:39 +0800	[thread overview]
Message-ID: <CAOOPZo7ikpJU2xN7+UWekpLe047C4cAMq_fxSWcRhzVqYUL8rA@mail.gmail.com> (raw)
In-Reply-To: <20200729015458.GY2005@dread.disaster.area>

On Wed, Jul 29, 2020 at 9:55 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Jul 28, 2020 at 04:47:53PM +0100, Matthew Wilcox wrote:
> > On Tue, Jul 28, 2020 at 08:34:53AM -0700, Darrick J. Wong wrote:
> > > On Tue, Jul 28, 2020 at 07:34:39PM +0800, Zhengyuan Liu wrote:
> > > > Hi all,
> > > >
> > > > When doing random buffer write testing I found the bandwidth on EXT4 is much
> > > > better than XFS under the same environment.
> > > > The test case ,test result and test environment is as follows:
> > > > Test case:
> > > > fio --ioengine=sync --rw=randwrite --iodepth=64 --size=4G --name=test
> > > > --filename=/mnt/testfile --bs=4k
> > > > Before doing fio, use dd (if=/dev/zero of=/mnt/testfile bs=1M
> > > > count=4096) to warm-up the file in the page cache.
> > > >
> > > > Test result (bandwidth):
> > > >          ext4                   xfs
> > > >        ~300MB/s       ~120MB/s
> > > >
> > > > Test environment:
> > > >     Platform:  arm64
> > > >     Kernel:  v5.7
> > > >     PAGESIZE:  64K
> > > >     Memtotal:  16G
>
> So it's capturing roughly 2GB of random 4kB writes before it starts
> blocking waiting for writeback.
>
> What happens when you change the size of the file (say 512MB, 1GB,
> 2GB, 8GB, 16GB, etc)? Does this change the result at all?

I tested other file size as you wanted above and this is the result:
    Size        XFS         EXT4
    512M    1.5GB/s      1GB/s
    1G        1.5GB/s      1GB/s
    2G        1.4GB/s      800MB/s
    4G        120MB/s     290MB/s
    8G        60MB/s       280MB/s
For file size smaller than 2G, it is basically pure memory operation.
For file size bigger than 4G, the write amplification is more obvious, but
not always,if the file size exceeds the memory size we need to
reallocate page cache (e.g. I found the bandwidth was about 140BM/s
if I set the file size to be 16G).

> i.e. are we just looking at a specific behaviour triggered by the
> specific data set size?  I would suspect that the larger the file,
> the greater the performance differential as XFS will drive the SSD
> to be bandwidth bound before ext4, and that if you have a faster SSd
> (e.g. nvme on pcie 4x) there would be a much smaller difference as
> the workload won't end up IO bandwidth bound....
>
> I also suspect that you'll get a different result running on spinning
> disks. i.e., it is likely that you'll get the oppposite result (i.e
> XFS is faster than ext4) because each 64kB page write IO from XFS
> captures multiple 4KB user writes and so results in fewer seeks than
> ext4 doing individual 4kB IOs.

No obvious difference found when I tested a 4GB-file on the spinning
disk, both got about 8MB/s, although I agree with you that XFS should
be faster than ext4. Maybe there is something wrong with me.

I also did the same test on a Intel pcie-nvme  card which has a max
bandwidth about 2GB/s:
                    fio-bandwidth     nvme-bandwidth    cpu-usage
    XFS             600MB/s              1.8GB/s                 35%
    EXT4            850MB/s              850MB/s              100%
So the write amplification is always there, even though it may not
reach the bandwidth bound of a faster SSD, but the bandwidth it
wasted isn't something we expected.

>
> > > >     Storage: sata ssd(Max bandwidth about 350MB/s)
> > > >     FS block size: 4K
> > > >
> > > > The  fio "Test result" shows that EXT4 has more than 2x bandwidth compared to
> > > > XFS, but iostat shows the transfer speed of XFS to SSD is about 300MB/s too.
> > > > So I debt XFS writing back many non-dirty blocks to SSD while  writing back
> > > > dirty pages. I tried to read the core writeback code of both
> > > > filesystem and found
> > > > XFS will write back blocks which is uptodate (seeing iomap_writepage_map()),
> > >
> > > Ahhh, right, because iomap tracks uptodate separately for each block in
> > > the page, but only tracks dirty status for the whole page.  Hence if you
> > > dirty one byte in the 64k page, xfs will write all 64k even though we
> > > could get away writing 4k like ext4 does.
>
> Right, iomap intentionally went to a page granularity IO model for
> page cache IO, because that's what the page cache uses and largely
> gets rid of the need for tracking per-block page state.
>
> > > Hey Christoph & Matthew: If you're already thinking about changing
> > > struct iomap_page, should we add the ability to track per-block dirty
> > > state to reduce the write amplification that Zhengyuan is asking about?
> > >
> > > I'm guessing that between willy's THP series, Dave's iomap chunks
> > > series, and whatever Christoph may or may not be writing, at least one
> > > of you might have already implemented this? :)
> >
> > Well, this is good timing!  I was wondering whether something along
> > these lines was an important use-case.
> >
> > I propose we do away with the 'uptodate' bit-array and replace it with an
> > 'writeback' bit-array.  We set the page uptodate bit whenever the reads to
>
> That's just per-block dirty state tracking. But when we set a single
> bit, we still need to set the page dirty flag.
>
>
> > fill the page have completed rather than checking the 'writeback' array.
> > In page_mkwrite, we fill the writeback bit-array on the grounds that we
> > have no way to track a block's non-dirtiness and we don't want to scan
> > each block at writeback time to see if it's been written to.
>
> You're talking about mmap() access to the file here, not
> read/write() syscall access. If page_mkwrite() sets all the
> blocks in a page as "needing writeback", how is that different in
> any way to just using a single dirty bit? So why wouldn't we just do
> this in iomap_set_page_dirty()?
>
> The only place we wouldn't want to set the entire page dirty is
> the call from __iomap_write_end() which knows the exact range of the
> page that was dirtied. In which case, iomap_set_page_dirty_range()
> would be appropriate, right? i.e. we still have to do all the same
> page/page cache/inode dirtying, but only that would set a sub-page
> range of dirty bits in the iomap_page?
>
> /me doesn't see the point of calling dirty tracking "writeback bits"
> when "writeback" is a specific page state that comes between the
> "dirty" and "clean" states...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com