linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Question] About XFS random buffer write performance
@ 2020-07-28 11:34 Zhengyuan Liu
  2020-07-28 15:34 ` Darrick J. Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Zhengyuan Liu @ 2020-07-28 11:34 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, Zhengyuan Liu

Hi all,

When doing random buffer write testing I found the bandwidth on EXT4 is much
better than XFS under the same environment.
The test case ,test result and test environment is as follows:
Test case:
fio --ioengine=sync --rw=randwrite --iodepth=64 --size=4G --name=test
--filename=/mnt/testfile --bs=4k
Before doing fio, use dd (if=/dev/zero of=/mnt/testfile bs=1M
count=4096) to warm-up the file in the page cache.

Test result (bandwidth):
         ext4                   xfs
       ~300MB/s       ~120MB/s

Test environment:
    Platform:  arm64
    Kernel:  v5.7
    PAGESIZE:  64K
    Memtotal:  16G
    Storage: sata ssd(Max bandwidth about 350MB/s)
    FS block size: 4K

The  fio "Test result" shows that EXT4 has more than 2x bandwidth compared to
XFS, but iostat shows the transfer speed of XFS to SSD is about 300MB/s too.
So I debt XFS writing back many non-dirty blocks to SSD while  writing back
dirty pages. I tried to read the core writeback code of both
filesystem and found
XFS will write back blocks which is uptodate (seeing iomap_writepage_map()),
while EXT4 writes back blocks which must be dirty (seeing
ext4_bio_write_page() ) . XFS had turned from buffer head to iomap since
V4.8, there is only a bitmap in iomap to track block's uptodate
status, no 'dirty'
concept was found, my question is if this is the reason why XFS writes many
extra blocks to SSD when doing random buffer write? If it is, then why don't we
track the dirty status of blocks in XFS?

With the questions in brain, I start digging into XFS's history, and found a
annotations in V2.6.12:
        /*
         * Calling this without startio set means we are being asked
to make a dirty
         * page ready for freeing it's buffers.  When called with
startio set then
         * we are coming from writepage.
         * When called with startio set it is important that we write the WHOLE
         * page if possible.
         * The bh->b_state's cannot know if any of the blocks or which block for
         * that matter are dirty due to mmap writes, and therefore bh
uptodate is
         * only vaild if the page itself isn't completely uptodate.  Some layers
         * may clear the page dirty flag prior to calling write page, under the
         * assumption the entire page will be written out; by not
writing out the
         * whole page the page can be reused before all valid dirty data is
         * written out.  Note: in the case of a page that has been dirty'd by
         * mapwrite and but partially setup by block_prepare_write the
         * bh->b_states's will not agree and only ones setup by BPW/BCW will
         * have valid state, thus the whole page must be written out thing.
         */
        STATIC int xfs_page_state_convert()

From above annotations, It seems this has something to do with mmap, but I
can't get the point , so I turn to you guys to get the help. Anyway, I don't
think there is such a difference about random write between XFS and EXT4.

Any reply would be appreciative, Thanks in advance.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-28 11:34 [Question] About XFS random buffer write performance Zhengyuan Liu
@ 2020-07-28 15:34 ` Darrick J. Wong
  2020-07-28 15:47   ` Matthew Wilcox
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2020-07-28 15:34 UTC (permalink / raw)
  To: Zhengyuan Liu
  Cc: linux-xfs, Zhengyuan Liu, Christoph Hellwig, Matthew Wilcox,
	Dave Chinner

[add hch and willy to cc]

On Tue, Jul 28, 2020 at 07:34:39PM +0800, Zhengyuan Liu wrote:
> Hi all,
> 
> When doing random buffer write testing I found the bandwidth on EXT4 is much
> better than XFS under the same environment.
> The test case ,test result and test environment is as follows:
> Test case:
> fio --ioengine=sync --rw=randwrite --iodepth=64 --size=4G --name=test
> --filename=/mnt/testfile --bs=4k
> Before doing fio, use dd (if=/dev/zero of=/mnt/testfile bs=1M
> count=4096) to warm-up the file in the page cache.
> 
> Test result (bandwidth):
>          ext4                   xfs
>        ~300MB/s       ~120MB/s
> 
> Test environment:
>     Platform:  arm64
>     Kernel:  v5.7
>     PAGESIZE:  64K
>     Memtotal:  16G
>     Storage: sata ssd(Max bandwidth about 350MB/s)
>     FS block size: 4K
> 
> The  fio "Test result" shows that EXT4 has more than 2x bandwidth compared to
> XFS, but iostat shows the transfer speed of XFS to SSD is about 300MB/s too.
> So I debt XFS writing back many non-dirty blocks to SSD while  writing back
> dirty pages. I tried to read the core writeback code of both
> filesystem and found
> XFS will write back blocks which is uptodate (seeing iomap_writepage_map()),

Ahhh, right, because iomap tracks uptodate separately for each block in
the page, but only tracks dirty status for the whole page.  Hence if you
dirty one byte in the 64k page, xfs will write all 64k even though we
could get away writing 4k like ext4 does.

Hey Christoph & Matthew: If you're already thinking about changing
struct iomap_page, should we add the ability to track per-block dirty
state to reduce the write amplification that Zhengyuan is asking about?

I'm guessing that between willy's THP series, Dave's iomap chunks
series, and whatever Christoph may or may not be writing, at least one
of you might have already implemented this? :)

--D

> while EXT4 writes back blocks which must be dirty (seeing
> ext4_bio_write_page() ) . XFS had turned from buffer head to iomap since
> V4.8, there is only a bitmap in iomap to track block's uptodate
> status, no 'dirty'
> concept was found, my question is if this is the reason why XFS writes many
> extra blocks to SSD when doing random buffer write? If it is, then why don't we
> track the dirty status of blocks in XFS?
> 
> With the questions in brain, I start digging into XFS's history, and found a
> annotations in V2.6.12:
>         /*
>          * Calling this without startio set means we are being asked
> to make a dirty
>          * page ready for freeing it's buffers.  When called with
> startio set then
>          * we are coming from writepage.
>          * When called with startio set it is important that we write the WHOLE
>          * page if possible.
>          * The bh->b_state's cannot know if any of the blocks or which block for
>          * that matter are dirty due to mmap writes, and therefore bh
> uptodate is
>          * only vaild if the page itself isn't completely uptodate.  Some layers
>          * may clear the page dirty flag prior to calling write page, under the
>          * assumption the entire page will be written out; by not
> writing out the
>          * whole page the page can be reused before all valid dirty data is
>          * written out.  Note: in the case of a page that has been dirty'd by
>          * mapwrite and but partially setup by block_prepare_write the
>          * bh->b_states's will not agree and only ones setup by BPW/BCW will
>          * have valid state, thus the whole page must be written out thing.
>          */
>         STATIC int xfs_page_state_convert()
> 
> From above annotations, It seems this has something to do with mmap, but I
> can't get the point , so I turn to you guys to get the help. Anyway, I don't
> think there is such a difference about random write between XFS and EXT4.
> 
> Any reply would be appreciative, Thanks in advance.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-28 15:34 ` Darrick J. Wong
@ 2020-07-28 15:47   ` Matthew Wilcox
  2020-07-29  1:54     ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2020-07-28 15:47 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Zhengyuan Liu, linux-xfs, Zhengyuan Liu, Christoph Hellwig, Dave Chinner

On Tue, Jul 28, 2020 at 08:34:53AM -0700, Darrick J. Wong wrote:
> On Tue, Jul 28, 2020 at 07:34:39PM +0800, Zhengyuan Liu wrote:
> > Hi all,
> > 
> > When doing random buffer write testing I found the bandwidth on EXT4 is much
> > better than XFS under the same environment.
> > The test case ,test result and test environment is as follows:
> > Test case:
> > fio --ioengine=sync --rw=randwrite --iodepth=64 --size=4G --name=test
> > --filename=/mnt/testfile --bs=4k
> > Before doing fio, use dd (if=/dev/zero of=/mnt/testfile bs=1M
> > count=4096) to warm-up the file in the page cache.
> > 
> > Test result (bandwidth):
> >          ext4                   xfs
> >        ~300MB/s       ~120MB/s
> > 
> > Test environment:
> >     Platform:  arm64
> >     Kernel:  v5.7
> >     PAGESIZE:  64K
> >     Memtotal:  16G
> >     Storage: sata ssd(Max bandwidth about 350MB/s)
> >     FS block size: 4K
> > 
> > The  fio "Test result" shows that EXT4 has more than 2x bandwidth compared to
> > XFS, but iostat shows the transfer speed of XFS to SSD is about 300MB/s too.
> > So I debt XFS writing back many non-dirty blocks to SSD while  writing back
> > dirty pages. I tried to read the core writeback code of both
> > filesystem and found
> > XFS will write back blocks which is uptodate (seeing iomap_writepage_map()),
> 
> Ahhh, right, because iomap tracks uptodate separately for each block in
> the page, but only tracks dirty status for the whole page.  Hence if you
> dirty one byte in the 64k page, xfs will write all 64k even though we
> could get away writing 4k like ext4 does.
> 
> Hey Christoph & Matthew: If you're already thinking about changing
> struct iomap_page, should we add the ability to track per-block dirty
> state to reduce the write amplification that Zhengyuan is asking about?
> 
> I'm guessing that between willy's THP series, Dave's iomap chunks
> series, and whatever Christoph may or may not be writing, at least one
> of you might have already implemented this? :)

Well, this is good timing!  I was wondering whether something along
these lines was an important use-case.

I propose we do away with the 'uptodate' bit-array and replace it with an
'writeback' bit-array.  We set the page uptodate bit whenever the reads to
fill the page have completed rather than checking the 'writeback' array.
In page_mkwrite, we fill the writeback bit-array on the grounds that we
have no way to track a block's non-dirtiness and we don't want to scan
each block at writeback time to see if it's been written to.

I'll do this now before the THP series gets reposted.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-28 15:47   ` Matthew Wilcox
@ 2020-07-29  1:54     ` Dave Chinner
  2020-07-29  2:12       ` Matthew Wilcox
  2020-07-29 13:02       ` Zhengyuan Liu
  0 siblings, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2020-07-29  1:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Zhengyuan Liu, linux-xfs, Zhengyuan Liu,
	Christoph Hellwig

On Tue, Jul 28, 2020 at 04:47:53PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 28, 2020 at 08:34:53AM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 28, 2020 at 07:34:39PM +0800, Zhengyuan Liu wrote:
> > > Hi all,
> > > 
> > > When doing random buffer write testing I found the bandwidth on EXT4 is much
> > > better than XFS under the same environment.
> > > The test case ,test result and test environment is as follows:
> > > Test case:
> > > fio --ioengine=sync --rw=randwrite --iodepth=64 --size=4G --name=test
> > > --filename=/mnt/testfile --bs=4k
> > > Before doing fio, use dd (if=/dev/zero of=/mnt/testfile bs=1M
> > > count=4096) to warm-up the file in the page cache.
> > > 
> > > Test result (bandwidth):
> > >          ext4                   xfs
> > >        ~300MB/s       ~120MB/s
> > > 
> > > Test environment:
> > >     Platform:  arm64
> > >     Kernel:  v5.7
> > >     PAGESIZE:  64K
> > >     Memtotal:  16G

So it's capturing roughly 2GB of random 4kB writes before it starts
blocking waiting for writeback.

What happens when you change the size of the file (say 512MB, 1GB,
2GB, 8GB, 16GB, etc)? Does this change the result at all?

i.e. are we just looking at a specific behaviour triggered by the
specific data set size?  I would suspect that the larger the file,
the greater the performance differential as XFS will drive the SSD
to be bandwidth bound before ext4, and that if you have a faster SSd
(e.g. nvme on pcie 4x) there would be a much smaller difference as
the workload won't end up IO bandwidth bound....

I also suspect that you'll get a different result running on spinning
disks. i.e., it is likely that you'll get the oppposite result (i.e
XFS is faster than ext4) because each 64kB page write IO from XFS
captures multiple 4KB user writes and so results in fewer seeks than
ext4 doing individual 4kB IOs.

> > >     Storage: sata ssd(Max bandwidth about 350MB/s)
> > >     FS block size: 4K
> > > 
> > > The  fio "Test result" shows that EXT4 has more than 2x bandwidth compared to
> > > XFS, but iostat shows the transfer speed of XFS to SSD is about 300MB/s too.
> > > So I debt XFS writing back many non-dirty blocks to SSD while  writing back
> > > dirty pages. I tried to read the core writeback code of both
> > > filesystem and found
> > > XFS will write back blocks which is uptodate (seeing iomap_writepage_map()),
> > 
> > Ahhh, right, because iomap tracks uptodate separately for each block in
> > the page, but only tracks dirty status for the whole page.  Hence if you
> > dirty one byte in the 64k page, xfs will write all 64k even though we
> > could get away writing 4k like ext4 does.

Right, iomap intentionally went to a page granularity IO model for
page cache IO, because that's what the page cache uses and largely
gets rid of the need for tracking per-block page state.

> > Hey Christoph & Matthew: If you're already thinking about changing
> > struct iomap_page, should we add the ability to track per-block dirty
> > state to reduce the write amplification that Zhengyuan is asking about?
> > 
> > I'm guessing that between willy's THP series, Dave's iomap chunks
> > series, and whatever Christoph may or may not be writing, at least one
> > of you might have already implemented this? :)
> 
> Well, this is good timing!  I was wondering whether something along
> these lines was an important use-case.
> 
> I propose we do away with the 'uptodate' bit-array and replace it with an
> 'writeback' bit-array.  We set the page uptodate bit whenever the reads to

That's just per-block dirty state tracking. But when we set a single
bit, we still need to set the page dirty flag.


> fill the page have completed rather than checking the 'writeback' array.
> In page_mkwrite, we fill the writeback bit-array on the grounds that we
> have no way to track a block's non-dirtiness and we don't want to scan
> each block at writeback time to see if it's been written to.

You're talking about mmap() access to the file here, not
read/write() syscall access. If page_mkwrite() sets all the
blocks in a page as "needing writeback", how is that different in
any way to just using a single dirty bit? So why wouldn't we just do
this in iomap_set_page_dirty()?

The only place we wouldn't want to set the entire page dirty is
the call from __iomap_write_end() which knows the exact range of the
page that was dirtied. In which case, iomap_set_page_dirty_range()
would be appropriate, right? i.e. we still have to do all the same
page/page cache/inode dirtying, but only that would set a sub-page
range of dirty bits in the iomap_page?

/me doesn't see the point of calling dirty tracking "writeback bits"
when "writeback" is a specific page state that comes between the
"dirty" and "clean" states...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-29  1:54     ` Dave Chinner
@ 2020-07-29  2:12       ` Matthew Wilcox
  2020-07-29  5:19         ` Dave Chinner
  2020-07-29 13:02       ` Zhengyuan Liu
  1 sibling, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2020-07-29  2:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Zhengyuan Liu, linux-xfs, Zhengyuan Liu,
	Christoph Hellwig

On Wed, Jul 29, 2020 at 11:54:58AM +1000, Dave Chinner wrote:
> On Tue, Jul 28, 2020 at 04:47:53PM +0100, Matthew Wilcox wrote:
> > I propose we do away with the 'uptodate' bit-array and replace it with an
> > 'writeback' bit-array.  We set the page uptodate bit whenever the reads to
> 
> That's just per-block dirty state tracking. But when we set a single
> bit, we still need to set the page dirty flag.

It's not exactly dirty, though.  It's 'present' (ie the opposite
of hole).  I'm not attached to the name.  So it can be used to
implement iomap_is_partially_uptodate.  If the page is dirty, the chunks
corresponding to the present bits get written back, but we don't track
a per-block dirty state.

> > fill the page have completed rather than checking the 'writeback' array.
> > In page_mkwrite, we fill the writeback bit-array on the grounds that we
> > have no way to track a block's non-dirtiness and we don't want to scan
> > each block at writeback time to see if it's been written to.
> 
> You're talking about mmap() access to the file here, not
> read/write() syscall access. If page_mkwrite() sets all the
> blocks in a page as "needing writeback", how is that different in
> any way to just using a single dirty bit? So why wouldn't we just do
> this in iomap_set_page_dirty()?

iomap_set_page_dirty() is called from iomap_page_mkwrite_actor(), so
sure!

> The only place we wouldn't want to set the entire page dirty is
> the call from __iomap_write_end() which knows the exact range of the
> page that was dirtied. In which case, iomap_set_page_dirty_range()
> would be appropriate, right? i.e. we still have to do all the same
> page/page cache/inode dirtying, but only that would set a sub-page
> range of dirty bits in the iomap_page?
> 
> /me doesn't see the point of calling dirty tracking "writeback bits"
> when "writeback" is a specific page state that comes between the
> "dirty" and "clean" states...

I don't want to get it confused with page states.  This is a different
thing.  It's just tracking which blocks are holes (and have definitely
not been written to), so those blocks can remain as holes when the page
gets written back.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-29  2:12       ` Matthew Wilcox
@ 2020-07-29  5:19         ` Dave Chinner
  2020-07-29 18:50           ` Matthew Wilcox
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2020-07-29  5:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Zhengyuan Liu, linux-xfs, Zhengyuan Liu,
	Christoph Hellwig

On Wed, Jul 29, 2020 at 03:12:31AM +0100, Matthew Wilcox wrote:
> On Wed, Jul 29, 2020 at 11:54:58AM +1000, Dave Chinner wrote:
> > On Tue, Jul 28, 2020 at 04:47:53PM +0100, Matthew Wilcox wrote:
> > > I propose we do away with the 'uptodate' bit-array and replace it with an
> > > 'writeback' bit-array.  We set the page uptodate bit whenever the reads to
> > 
> > That's just per-block dirty state tracking. But when we set a single
> > bit, we still need to set the page dirty flag.
> 
> It's not exactly dirty, though.  It's 'present' (ie the opposite
> of hole). 

Careful with your terminology. At the page cache level, there is no
such thing as a "hole". There is only data and whether the data is
up to date or not. The page cache may be *sparsely populated*, but
a lack of a page or a range of the page that is not up to date
does not imply there is a -hole in the file- at that point.

I'm still not sure what "present" is supposed to mean, though,
because it seems no different to "up to date". The data is present
once it's been read into the page, calling page_mkwrite() on the
page doesn't change that at all.

> I'm not attached to the name.  So it can be used to
> implement iomap_is_partially_uptodate.  If the page is dirty, the chunks
> corresponding to the present bits get written back, but we don't track
> a per-block dirty state.

iomap_is_partially_uptodate() only indicates whether data in the
page is entirely valid or not. If it isn't entirely valid, then the
caller has to ask the filesystem whether the underlying range
contains holes or data....

> > > fill the page have completed rather than checking the 'writeback' array.
> > > In page_mkwrite, we fill the writeback bit-array on the grounds that we
> > > have no way to track a block's non-dirtiness and we don't want to scan
> > > each block at writeback time to see if it's been written to.
> > 
> > You're talking about mmap() access to the file here, not
> > read/write() syscall access. If page_mkwrite() sets all the
> > blocks in a page as "needing writeback", how is that different in
> > any way to just using a single dirty bit? So why wouldn't we just do
> > this in iomap_set_page_dirty()?
> 
> iomap_set_page_dirty() is called from iomap_page_mkwrite_actor(), so
> sure!

via set_page_dirty(), which is why I mentioned this:

> > The only place we wouldn't want to set the entire page dirty is
> > the call from __iomap_write_end() which knows the exact range of the
> > page that was dirtied. In which case, iomap_set_page_dirty_range()
> > would be appropriate, right? i.e. we still have to do all the same
> > page/page cache/inode dirtying, but only that would set a sub-page
> > range of dirty bits in the iomap_page?
> > 
> > /me doesn't see the point of calling dirty tracking "writeback bits"
> > when "writeback" is a specific page state that comes between the
> > "dirty" and "clean" states...
> 
> I don't want to get it confused with page states.  This is a different
> thing.  It's just tracking which blocks are holes (and have definitely
> not been written to), so those blocks can remain as holes when the page
> gets written back.

We do not track holes at the page level. We do not want to track
anything to do with the filesystem extent mapping at the page level.
That was something that bufferheads were used for, and was something
we specifically designed iomap specifically not to require.

IOWs, iomap does page cache IO at page level granularity, not block
level granularity.  The only thing we track at block granularity is
wither the range of the page over a given block contains valid data
or not.  i.e. whether the page has been initialised with the correct
data or not.

Further, page-mkwrite() has no knoweldge of whether the backing
store has holes in it or not, nor does it care. All it does is call
into the filesystem to fill any holes that may exist in the backing
space behind the page.  This is also needed for COW to allocate the
destination of the over write, but in either case there is no
interaction with pre-existing holes - that is all done by the
read side of the page fault before page_mkwrite is called...

IOWs, if you call page_mkwrite() on a THP, the filesystem will
allocate/reserve the entire backing space behind the page because
writeback of that THP requires writing the entire page and for
backing space to be fully allocated before that write is issued.

hence I'm really not sure what you are suggesting we do here because
it doesn't make sense to me. Maybe I'm missing something that THP
does that I'm not away of, but other than that I'm completely
missing what you are trying to do here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-29  1:54     ` Dave Chinner
  2020-07-29  2:12       ` Matthew Wilcox
@ 2020-07-29 13:02       ` Zhengyuan Liu
  1 sibling, 0 replies; 18+ messages in thread
From: Zhengyuan Liu @ 2020-07-29 13:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Darrick J. Wong, linux-xfs, Zhengyuan Liu,
	Christoph Hellwig

On Wed, Jul 29, 2020 at 9:55 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Jul 28, 2020 at 04:47:53PM +0100, Matthew Wilcox wrote:
> > On Tue, Jul 28, 2020 at 08:34:53AM -0700, Darrick J. Wong wrote:
> > > On Tue, Jul 28, 2020 at 07:34:39PM +0800, Zhengyuan Liu wrote:
> > > > Hi all,
> > > >
> > > > When doing random buffer write testing I found the bandwidth on EXT4 is much
> > > > better than XFS under the same environment.
> > > > The test case ,test result and test environment is as follows:
> > > > Test case:
> > > > fio --ioengine=sync --rw=randwrite --iodepth=64 --size=4G --name=test
> > > > --filename=/mnt/testfile --bs=4k
> > > > Before doing fio, use dd (if=/dev/zero of=/mnt/testfile bs=1M
> > > > count=4096) to warm-up the file in the page cache.
> > > >
> > > > Test result (bandwidth):
> > > >          ext4                   xfs
> > > >        ~300MB/s       ~120MB/s
> > > >
> > > > Test environment:
> > > >     Platform:  arm64
> > > >     Kernel:  v5.7
> > > >     PAGESIZE:  64K
> > > >     Memtotal:  16G
>
> So it's capturing roughly 2GB of random 4kB writes before it starts
> blocking waiting for writeback.
>
> What happens when you change the size of the file (say 512MB, 1GB,
> 2GB, 8GB, 16GB, etc)? Does this change the result at all?

I tested other file size as you wanted above and this is the result:
    Size        XFS         EXT4
    512M    1.5GB/s      1GB/s
    1G        1.5GB/s      1GB/s
    2G        1.4GB/s      800MB/s
    4G        120MB/s     290MB/s
    8G        60MB/s       280MB/s
For file size smaller than 2G, it is basically pure memory operation.
For file size bigger than 4G, the write amplification is more obvious, but
not always,if the file size exceeds the memory size we need to
reallocate page cache (e.g. I found the bandwidth was about 140BM/s
if I set the file size to be 16G).

> i.e. are we just looking at a specific behaviour triggered by the
> specific data set size?  I would suspect that the larger the file,
> the greater the performance differential as XFS will drive the SSD
> to be bandwidth bound before ext4, and that if you have a faster SSd
> (e.g. nvme on pcie 4x) there would be a much smaller difference as
> the workload won't end up IO bandwidth bound....
>
> I also suspect that you'll get a different result running on spinning
> disks. i.e., it is likely that you'll get the oppposite result (i.e
> XFS is faster than ext4) because each 64kB page write IO from XFS
> captures multiple 4KB user writes and so results in fewer seeks than
> ext4 doing individual 4kB IOs.

No obvious difference found when I tested a 4GB-file on the spinning
disk, both got about 8MB/s, although I agree with you that XFS should
be faster than ext4. Maybe there is something wrong with me.

I also did the same test on a Intel pcie-nvme  card which has a max
bandwidth about 2GB/s:
                    fio-bandwidth     nvme-bandwidth    cpu-usage
    XFS             600MB/s              1.8GB/s                 35%
    EXT4            850MB/s              850MB/s              100%
So the write amplification is always there, even though it may not
reach the bandwidth bound of a faster SSD, but the bandwidth it
wasted isn't something we expected.

>
> > > >     Storage: sata ssd(Max bandwidth about 350MB/s)
> > > >     FS block size: 4K
> > > >
> > > > The  fio "Test result" shows that EXT4 has more than 2x bandwidth compared to
> > > > XFS, but iostat shows the transfer speed of XFS to SSD is about 300MB/s too.
> > > > So I debt XFS writing back many non-dirty blocks to SSD while  writing back
> > > > dirty pages. I tried to read the core writeback code of both
> > > > filesystem and found
> > > > XFS will write back blocks which is uptodate (seeing iomap_writepage_map()),
> > >
> > > Ahhh, right, because iomap tracks uptodate separately for each block in
> > > the page, but only tracks dirty status for the whole page.  Hence if you
> > > dirty one byte in the 64k page, xfs will write all 64k even though we
> > > could get away writing 4k like ext4 does.
>
> Right, iomap intentionally went to a page granularity IO model for
> page cache IO, because that's what the page cache uses and largely
> gets rid of the need for tracking per-block page state.
>
> > > Hey Christoph & Matthew: If you're already thinking about changing
> > > struct iomap_page, should we add the ability to track per-block dirty
> > > state to reduce the write amplification that Zhengyuan is asking about?
> > >
> > > I'm guessing that between willy's THP series, Dave's iomap chunks
> > > series, and whatever Christoph may or may not be writing, at least one
> > > of you might have already implemented this? :)
> >
> > Well, this is good timing!  I was wondering whether something along
> > these lines was an important use-case.
> >
> > I propose we do away with the 'uptodate' bit-array and replace it with an
> > 'writeback' bit-array.  We set the page uptodate bit whenever the reads to
>
> That's just per-block dirty state tracking. But when we set a single
> bit, we still need to set the page dirty flag.
>
>
> > fill the page have completed rather than checking the 'writeback' array.
> > In page_mkwrite, we fill the writeback bit-array on the grounds that we
> > have no way to track a block's non-dirtiness and we don't want to scan
> > each block at writeback time to see if it's been written to.
>
> You're talking about mmap() access to the file here, not
> read/write() syscall access. If page_mkwrite() sets all the
> blocks in a page as "needing writeback", how is that different in
> any way to just using a single dirty bit? So why wouldn't we just do
> this in iomap_set_page_dirty()?
>
> The only place we wouldn't want to set the entire page dirty is
> the call from __iomap_write_end() which knows the exact range of the
> page that was dirtied. In which case, iomap_set_page_dirty_range()
> would be appropriate, right? i.e. we still have to do all the same
> page/page cache/inode dirtying, but only that would set a sub-page
> range of dirty bits in the iomap_page?
>
> /me doesn't see the point of calling dirty tracking "writeback bits"
> when "writeback" is a specific page state that comes between the
> "dirty" and "clean" states...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-29  5:19         ` Dave Chinner
@ 2020-07-29 18:50           ` Matthew Wilcox
  2020-07-29 23:05             ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2020-07-29 18:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Zhengyuan Liu, linux-xfs, Zhengyuan Liu,
	Christoph Hellwig

On Wed, Jul 29, 2020 at 03:19:23PM +1000, Dave Chinner wrote:
> On Wed, Jul 29, 2020 at 03:12:31AM +0100, Matthew Wilcox wrote:
> > On Wed, Jul 29, 2020 at 11:54:58AM +1000, Dave Chinner wrote:
> > > On Tue, Jul 28, 2020 at 04:47:53PM +0100, Matthew Wilcox wrote:
> > > > I propose we do away with the 'uptodate' bit-array and replace it with an
> > > > 'writeback' bit-array.  We set the page uptodate bit whenever the reads to
> > > 
> > > That's just per-block dirty state tracking. But when we set a single
> > > bit, we still need to set the page dirty flag.
> > 
> > It's not exactly dirty, though.  It's 'present' (ie the opposite
> > of hole). 
> 
> Careful with your terminology. At the page cache level, there is no
> such thing as a "hole". There is only data and whether the data is
> up to date or not. The page cache may be *sparsely populated*, but
> a lack of a page or a range of the page that is not up to date
> does not imply there is a -hole in the file- at that point.

That's not entirely true.  The current ->uptodate array does keep
track of whether an unwritten extent is currently a hole (see
page_cache_seek_hole_data()).  I don't know how useful that is.

> I'm still not sure what "present" is supposed to mean, though,
> because it seems no different to "up to date". The data is present
> once it's been read into the page, calling page_mkwrite() on the
> page doesn't change that at all.

I had a bit of a misunderstanding.  Let's discard that proposal
and discuss what we want to optimise for, ignoring THPs.  We don't
need to track any per-block state, of course.  We could implement
__iomap_write_begin() by reading in the entire page (skipping the last
few blocks if they lie outside i_size, of course) and then marking the
entire page Uptodate.

Buffer heads track several bits of information about each block:
 - Uptodate (contents of cache at least as recent as storage)
 - Dirty (contents of cache more recent than storage)
 - ... er, I think all the rest are irrelevant for iomap

I think I just talked myself into what you were arguing for -- that we
change the ->uptodate bit array into a ->dirty bit array.

That implies that we lose the current optimisation that we can write at
a blocksize alignment into the page cache and not read from storage.
I'm personally fine with that; most workloads don't care if you read
extra bytes from storage (hence readahead), but writing unnecessarily
to storage (particularly flash) is bad.

Or we keep two bits per block.  The implementation would be a little icky,
but it could be done.

I like the idea of getting rid of partially uptodate pages.  I've never
really understood the concept.  For me, a partially dirty page makes a
lot more sense than a partially uptodate page.  Perhaps I'm just weird.

Speaking of weird, I don't understand why an unwritten extent queries
the uptodate bits.  Maybe that's a buffer_head thing and we can just
ignore it -- iomap doesn't have such a thing as a !uptodate page any
more.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-29 18:50           ` Matthew Wilcox
@ 2020-07-29 23:05             ` Dave Chinner
  2020-07-30 13:50               ` Matthew Wilcox
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2020-07-29 23:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Zhengyuan Liu, linux-xfs, Zhengyuan Liu,
	Christoph Hellwig

On Wed, Jul 29, 2020 at 07:50:35PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 29, 2020 at 03:19:23PM +1000, Dave Chinner wrote:
> > On Wed, Jul 29, 2020 at 03:12:31AM +0100, Matthew Wilcox wrote:
> > > On Wed, Jul 29, 2020 at 11:54:58AM +1000, Dave Chinner wrote:
> > > > On Tue, Jul 28, 2020 at 04:47:53PM +0100, Matthew Wilcox wrote:
> > > > > I propose we do away with the 'uptodate' bit-array and replace it with an
> > > > > 'writeback' bit-array.  We set the page uptodate bit whenever the reads to
> > > > 
> > > > That's just per-block dirty state tracking. But when we set a single
> > > > bit, we still need to set the page dirty flag.
> > > 
> > > It's not exactly dirty, though.  It's 'present' (ie the opposite
> > > of hole). 
> > 
> > Careful with your terminology. At the page cache level, there is no
> > such thing as a "hole". There is only data and whether the data is
> > up to date or not. The page cache may be *sparsely populated*, but
> > a lack of a page or a range of the page that is not up to date
> > does not imply there is a -hole in the file- at that point.
> 
> That's not entirely true.  The current ->uptodate array does keep
> track of whether an unwritten extent is currently a hole (see
> page_cache_seek_hole_data()).  I don't know how useful that is.

"unwritten extent is currently a hole"

Ummm, by definition, and unwritten extent it *not* a hole in the
file. It's an allocated extent that is marked as containing zeroes.

SEEK_HOLE uses the definition that "any contiguous run of zeros may
be considered a hole" and from that, we present unwritten extents as
"holes" to userspace so they don't have to copy them when doing
sparse file operations. This does not mean what the iomap_page
uptodate array is doing is tracking a "hole" in the page.

What page_cache_seek_hole_data() is doing here is determining
whether the range of the page contains -data- or not. If it is
uptodate, then it contains data. If it is not uptodate, then it
does not contain data, and because it is over an unwritten extent,
that means it *must* contain be zeros and for the purposes of
SEEK_DATA/SEEK_HOLE, that means it is considered a hole.

This is an SEEK_HOLE/SEEK_DATA API implementation detail - it uses
the uptodate state of a page over an unwritten extent to determine
if user data has been initialised over the unwritten extent -in
memory- but that data hasn't yet reached disk. Having initialised
data in memory means the range is classified as data, if there is no
data then it is a hole. IOWs, the uptodate bits tell us whether
there is -valid data in the cache for that range-, not whether the
page range spans a hole in the file.

> > I'm still not sure what "present" is supposed to mean, though,
> > because it seems no different to "up to date". The data is present
> > once it's been read into the page, calling page_mkwrite() on the
> > page doesn't change that at all.
> 
> I had a bit of a misunderstanding.  Let's discard that proposal
> and discuss what we want to optimise for, ignoring THPs.  We don't
> need to track any per-block state, of course.  We could implement
> __iomap_write_begin() by reading in the entire page (skipping the last
> few blocks if they lie outside i_size, of course) and then marking the
> entire page Uptodate.

__iomap_write_begin() already does read-around for sub-page writes.
And, if necessary, it does zeroing of unwritten extents, newly
allocated ranges and ranges beyond EOF and marks them uptodate
appropriately.

> Buffer heads track several bits of information about each block:
>  - Uptodate (contents of cache at least as recent as storage)
>  - Dirty (contents of cache more recent than storage)
>  - ... er, I think all the rest are irrelevant for iomap


Yes, it is. And we optimised out the dirty tracking by just using
the single dirty bit in the page.

> I think I just talked myself into what you were arguing for -- that we
> change the ->uptodate bit array into a ->dirty bit array.
> 
> That implies that we lose the current optimisation that we can write at
> a blocksize alignment into the page cache and not read from storage.

iomap does not do that. It always reads the entire page in, even for
block aligned sub-page writes. IIRC, we even allocate on page
granularity for sub-page block size filesystems so taht we fill
holes and can do full page writes in writeback because this tends to
significantly reduce worst case file fragmentation for random sparse
writes...

> I'm personally fine with that; most workloads don't care if you read
> extra bytes from storage (hence readahead), but writing unnecessarily
> to storage (particularly flash) is bad.

Modern really SSDs don't care about runs of zeros being written.
They compress and/or deduplicate such things on the fly as part of
their internal write-amplification reduction strategies. Pretty much
all SSDs on the market these days - consumer or enterprise - do this
sort of thing in their FTLs and so writing more than the exact
changed data really doesn't make a difference.

Indeed, if we were to write in flash page sized chunks, we'd be
doing the SSDs a major favour because then they don't have to do
sub-page defragmentation to be able to erase blocks and continue
writing. If you look at where interfaces like Micron's HSE stuff is
going, that's all based around optimising writes to only be done at
erase block granularity and all the copy-on-write stuff is done up
in the application running on the host...

So optimising for small writes on modern SSDs is really only about
minimising the data transfer bandwidth required for lots and lots of
small sub page writes. That's what this specific test showed; XFS
ran of out IO bandwidth before ext4 did. Put it SSD on a pcie
interface that has bandwidth to burn, and it's likely a very
different story...

> Or we keep two bits per block.  The implementation would be a little icky,
> but it could be done.
> 
> I like the idea of getting rid of partially uptodate pages.  I've never
> really understood the concept.  For me, a partially dirty page makes a
> lot more sense than a partially uptodate page.  Perhaps I'm just weird.

I think we ended up with partially uptodate tracking because it was
a direct translation of the bufferhead uptodate tracking. Similarly
we have read and write io counts which translated from bufferhead
sub-page IO tracking to determine when to process IO completion.

I agree, I don't think we need the uptodate tracking anymore because
we do IO in full pages already. As for sub page dirtying, I'd need
to see the implementation and the numbers before deciding on that...

The other thing to consider is that some filesytems still use
bufferheads with iomap (e.g. gfs2) and so we might be completely
missing something here w.r.t. partially up to date state. That will
need careful audit, too.

> Speaking of weird, I don't understand why an unwritten extent queries
> the uptodate bits.  Maybe that's a buffer_head thing and we can just
> ignore it -- iomap doesn't have such a thing as a !uptodate page any
> more.

It's a direct translation of the code as it existed when partially
uptodate pages could exist in the cache. The page cache seek
hole/data code is not iomap specific, and so filesystems that use
those helpers may well have partially up to date pages.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-29 23:05             ` Dave Chinner
@ 2020-07-30 13:50               ` Matthew Wilcox
  2020-07-30 22:08                 ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2020-07-30 13:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Zhengyuan Liu, linux-xfs, Zhengyuan Liu,
	Christoph Hellwig

On Thu, Jul 30, 2020 at 09:05:03AM +1000, Dave Chinner wrote:
> On Wed, Jul 29, 2020 at 07:50:35PM +0100, Matthew Wilcox wrote:
> > I had a bit of a misunderstanding.  Let's discard that proposal
> > and discuss what we want to optimise for, ignoring THPs.  We don't
> > need to track any per-block state, of course.  We could implement
> > __iomap_write_begin() by reading in the entire page (skipping the last
> > few blocks if they lie outside i_size, of course) and then marking the
> > entire page Uptodate.
> 
> __iomap_write_begin() already does read-around for sub-page writes.
> And, if necessary, it does zeroing of unwritten extents, newly
> allocated ranges and ranges beyond EOF and marks them uptodate
> appropriately.

But it doesn't read in the entire page, just the blocks in the page which
will be touched by the write.

> > Buffer heads track several bits of information about each block:
> >  - Uptodate (contents of cache at least as recent as storage)
> >  - Dirty (contents of cache more recent than storage)
> >  - ... er, I think all the rest are irrelevant for iomap
> 
> 
> Yes, it is. And we optimised out the dirty tracking by just using
> the single dirty bit in the page.
> 
> > I think I just talked myself into what you were arguing for -- that we
> > change the ->uptodate bit array into a ->dirty bit array.
> > 
> > That implies that we lose the current optimisation that we can write at
> > a blocksize alignment into the page cache and not read from storage.
> 
> iomap does not do that. It always reads the entire page in, even for
> block aligned sub-page writes. IIRC, we even allocate on page
> granularity for sub-page block size filesystems so taht we fill
> holes and can do full page writes in writeback because this tends to
> significantly reduce worst case file fragmentation for random sparse
> writes...

That isn't what __iomap_write_begin() does today.

Consider a 1kB block size filesystem and a 4kB page size host.  Trace through
writing 1kB at a 2kB offset into the file.
We call iomap_write_begin() with pos of 2048, len 1024.
Allocate a new page
Call __iomap_write_begin(2048, 1024)
block_start = 2048
block_end = 3072
iomap_adjust_read_range() sets poff and plen to 2048 & 1024
from == 2048, to == 3072, so we continue
block_start + plen == block_end so the loop terminates.
We didn't read anything.

> Modern really SSDs don't care about runs of zeros being written.
> They compress and/or deduplicate such things on the fly as part of
> their internal write-amplification reduction strategies. Pretty much
> all SSDs on the market these days - consumer or enterprise - do this
> sort of thing in their FTLs and so writing more than the exact
> changed data really doesn't make a difference.

You're clearly talking to different SSD people than I am.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-30 13:50               ` Matthew Wilcox
@ 2020-07-30 22:08                 ` Dave Chinner
  2020-07-30 23:45                   ` Matthew Wilcox
  2020-07-31  6:55                   ` Christoph Hellwig
  0 siblings, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2020-07-30 22:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Zhengyuan Liu, linux-xfs, Zhengyuan Liu,
	Christoph Hellwig

On Thu, Jul 30, 2020 at 02:50:40PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 09:05:03AM +1000, Dave Chinner wrote:
> > On Wed, Jul 29, 2020 at 07:50:35PM +0100, Matthew Wilcox wrote:
> > > I had a bit of a misunderstanding.  Let's discard that proposal
> > > and discuss what we want to optimise for, ignoring THPs.  We don't
> > > need to track any per-block state, of course.  We could implement
> > > __iomap_write_begin() by reading in the entire page (skipping the last
> > > few blocks if they lie outside i_size, of course) and then marking the
> > > entire page Uptodate.
> > 
> > __iomap_write_begin() already does read-around for sub-page writes.
> > And, if necessary, it does zeroing of unwritten extents, newly
> > allocated ranges and ranges beyond EOF and marks them uptodate
> > appropriately.
> 
> But it doesn't read in the entire page, just the blocks in the page which
> will be touched by the write.

Ah, you are right, I got my page/offset macros mixed up.

In which case, you just identified why the uptodate array is
necessary and can't be removed. If we do a sub-page write() the page
is not fully initialised, and so if we then mmap it readpage needs
to know what part of the page requires initialisation to bring the
page uptodate before it is exposed to userspace.

But that also means the behaviour of the 4kB write on 64kB page size
benchmark is unexplained, because that should only be marking the
written pages of the page up to date, and so it should be behaving
exactly like ext4 and only writing back individual uptodate chunks
on the dirty page....

So, we need to the iostat output from that test workload to
determine if XFS is doing page size IO or something different. I
suspect it's spewing huge numbers of 4-16kB writes, not PAGE_SIZEd
writes...

> > Modern really SSDs don't care about runs of zeros being written.
> > They compress and/or deduplicate such things on the fly as part of
> > their internal write-amplification reduction strategies. Pretty much
> > all SSDs on the market these days - consumer or enterprise - do this
> > sort of thing in their FTLs and so writing more than the exact
> > changed data really doesn't make a difference.
> 
> You're clearly talking to different SSD people than I am.

Perhaps so.

But it was pretty clear way back in the days of early sandforce SSD
controllers that compression and zero detection at the FTL level
resulted in massive reductions in write amplification right down at
the hardware level. The next generation of controllers all did this
so they could compete on performance. They still do this, which is
why industry benchmarks test performance with incompressible data so
that they expose the flash write perofrmance, not just the rate at
which the drive can detect and elide runs of zeros...

Note: I'm not saying that we shouldn't reduce the write bandwidth
being consumed here, just that arguments that about write
amplification are really not that convincing. We've *never* cared
about write amplification in XFS (indeed, we've never really cared
about SSD characteristics at all), yet it's consistently the fastest
filesystem on high end SSD storage because stuff like concurrency
and efficient dispatch of IO and deterministic behaviour matter far
more than write amplification.

IOWs, showing that even high end devices end up bandwidth limited
under common workloads using default configurations is a much more
convincing argument...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-30 22:08                 ` Dave Chinner
@ 2020-07-30 23:45                   ` Matthew Wilcox
  2020-07-31  2:05                     ` Dave Chinner
  2020-07-31 20:47                     ` Matthew Wilcox
  2020-07-31  6:55                   ` Christoph Hellwig
  1 sibling, 2 replies; 18+ messages in thread
From: Matthew Wilcox @ 2020-07-30 23:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Zhengyuan Liu, linux-xfs, Zhengyuan Liu,
	Christoph Hellwig

On Fri, Jul 31, 2020 at 08:08:57AM +1000, Dave Chinner wrote:
> On Thu, Jul 30, 2020 at 02:50:40PM +0100, Matthew Wilcox wrote:
> > On Thu, Jul 30, 2020 at 09:05:03AM +1000, Dave Chinner wrote:
> > > On Wed, Jul 29, 2020 at 07:50:35PM +0100, Matthew Wilcox wrote:
> > > > I had a bit of a misunderstanding.  Let's discard that proposal
> > > > and discuss what we want to optimise for, ignoring THPs.  We don't
> > > > need to track any per-block state, of course.  We could implement
> > > > __iomap_write_begin() by reading in the entire page (skipping the last
> > > > few blocks if they lie outside i_size, of course) and then marking the
> > > > entire page Uptodate.
> > > 
> > > __iomap_write_begin() already does read-around for sub-page writes.
> > > And, if necessary, it does zeroing of unwritten extents, newly
> > > allocated ranges and ranges beyond EOF and marks them uptodate
> > > appropriately.
> > 
> > But it doesn't read in the entire page, just the blocks in the page which
> > will be touched by the write.
> 
> Ah, you are right, I got my page/offset macros mixed up.
> 
> In which case, you just identified why the uptodate array is
> necessary and can't be removed. If we do a sub-page write() the page
> is not fully initialised, and so if we then mmap it readpage needs
> to know what part of the page requires initialisation to bring the
> page uptodate before it is exposed to userspace.

You snipped the part of my mail where I explained how we could handle
that without the uptodate array ;-(  Essentially, we do as you thought
it worked, we read the entire page (or at least the portion of it that
isn't going to be overwritten.  Once all the bytes have been transferred,
we can mark the page Uptodate.  We'll need to wait for the transfer to
happen if the write overlaps a block boundary, but we do that right now.

> But that also means the behaviour of the 4kB write on 64kB page size
> benchmark is unexplained, because that should only be marking the
> written pages of the page up to date, and so it should be behaving
> exactly like ext4 and only writing back individual uptodate chunks
> on the dirty page....

That benchmark started by zeroing the entire page cache, so all blocks
were marked Uptodate, so we wouldn't skip them on writeout.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-30 23:45                   ` Matthew Wilcox
@ 2020-07-31  2:05                     ` Dave Chinner
  2020-07-31  2:37                       ` Matthew Wilcox
  2020-07-31 20:47                     ` Matthew Wilcox
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2020-07-31  2:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Zhengyuan Liu, linux-xfs, Zhengyuan Liu,
	Christoph Hellwig

On Fri, Jul 31, 2020 at 12:45:17AM +0100, Matthew Wilcox wrote:
> On Fri, Jul 31, 2020 at 08:08:57AM +1000, Dave Chinner wrote:
> > On Thu, Jul 30, 2020 at 02:50:40PM +0100, Matthew Wilcox wrote:
> > > On Thu, Jul 30, 2020 at 09:05:03AM +1000, Dave Chinner wrote:
> > > > On Wed, Jul 29, 2020 at 07:50:35PM +0100, Matthew Wilcox wrote:
> > > > > I had a bit of a misunderstanding.  Let's discard that proposal
> > > > > and discuss what we want to optimise for, ignoring THPs.  We don't
> > > > > need to track any per-block state, of course.  We could implement
> > > > > __iomap_write_begin() by reading in the entire page (skipping the last
> > > > > few blocks if they lie outside i_size, of course) and then marking the
> > > > > entire page Uptodate.
> > > > 
> > > > __iomap_write_begin() already does read-around for sub-page writes.
> > > > And, if necessary, it does zeroing of unwritten extents, newly
> > > > allocated ranges and ranges beyond EOF and marks them uptodate
> > > > appropriately.
> > > 
> > > But it doesn't read in the entire page, just the blocks in the page which
> > > will be touched by the write.
> > 
> > Ah, you are right, I got my page/offset macros mixed up.
> > 
> > In which case, you just identified why the uptodate array is
> > necessary and can't be removed. If we do a sub-page write() the page
> > is not fully initialised, and so if we then mmap it readpage needs
> > to know what part of the page requires initialisation to bring the
> > page uptodate before it is exposed to userspace.
> 
> You snipped the part of my mail where I explained how we could handle
> that without the uptodate array ;-(

I snipped the part where you explained the way it currently avoided
reading the parts of the page that the block being dirtied didn't
cover :)

> Essentially, we do as you thought
> it worked, we read the entire page (or at least the portion of it that
> isn't going to be overwritten.  Once all the bytes have been transferred,
> we can mark the page Uptodate.  We'll need to wait for the transfer to
> happen if the write overlaps a block boundary, but we do that right now.

Right, we can do that, but it would be an entire page read, I think,
because I see little point int doing two small IOs with a seek in
between them when a single IO will do the entire thing much faster
that two small IOs and put less IOP load on the disk. We still have
to think about impact of IOs on spinning disks, unfortunately...

> > But that also means the behaviour of the 4kB write on 64kB page size
> > benchmark is unexplained, because that should only be marking the
> > written pages of the page up to date, and so it should be behaving
> > exactly like ext4 and only writing back individual uptodate chunks
> > on the dirty page....
> 
> That benchmark started by zeroing the entire page cache, so all blocks
> were marked Uptodate, so we wouldn't skip them on writeout.

Ah, I missed that bit. I thought it was just starting from a fully
allocated file and a cold cache, not a primed, hot cache...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-31  2:05                     ` Dave Chinner
@ 2020-07-31  2:37                       ` Matthew Wilcox
  0 siblings, 0 replies; 18+ messages in thread
From: Matthew Wilcox @ 2020-07-31  2:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Zhengyuan Liu, linux-xfs, Zhengyuan Liu,
	Christoph Hellwig

On Fri, Jul 31, 2020 at 12:05:58PM +1000, Dave Chinner wrote:
> On Fri, Jul 31, 2020 at 12:45:17AM +0100, Matthew Wilcox wrote:
> > Essentially, we do as you thought
> > it worked, we read the entire page (or at least the portion of it that
> > isn't going to be overwritten.  Once all the bytes have been transferred,
> > we can mark the page Uptodate.  We'll need to wait for the transfer to
> > happen if the write overlaps a block boundary, but we do that right now.
> 
> Right, we can do that, but it would be an entire page read, I think,
> because I see little point int doing two small IOs with a seek in
> between them when a single IO will do the entire thing much faster
> that two small IOs and put less IOP load on the disk. We still have
> to think about impact of IOs on spinning disks, unfortunately...

Heh, maybe don't read the existing code because we actually do that today
if, say, you have a write that spans bytes 800-3000 of a 4kB page.  Worse,
we wait for each one individually before submitting the next, so the
drive doesn't even get the chance to see that we're doing read-seek-read.

I think we can profitably skip reading portions of the page if the write
overlaps either the beginning or end of the page, but it's not worth
breaking up an I/O for skipping reading 2-3kB.

The readahead window expands up to 256kB, so clearly we are comfortable
with doing potentially-unnecessary reads of at least that much.  I start
to wonder about whether it might be worth skipping part of the page if
you do a 1MB write to the middle of a 2MB page, but the THP patchset
doesn't even try to allocate large pages in the write path yet, so the
question remains moot today.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-30 22:08                 ` Dave Chinner
  2020-07-30 23:45                   ` Matthew Wilcox
@ 2020-07-31  6:55                   ` Christoph Hellwig
  1 sibling, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2020-07-31  6:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Darrick J. Wong, Zhengyuan Liu, linux-xfs,
	Zhengyuan Liu, Christoph Hellwig

[delayed and partial response because I'm on vacation, still feeling
 like I should shime in]

On Fri, Jul 31, 2020 at 08:08:57AM +1000, Dave Chinner wrote:
> In which case, you just identified why the uptodate array is
> necessary and can't be removed. If we do a sub-page write() the page
> is not fully initialised, and so if we then mmap it readpage needs
> to know what part of the page requires initialisation to bring the
> page uptodate before it is exposed to userspace.
> 
> But that also means the behaviour of the 4kB write on 64kB page size
> benchmark is unexplained, because that should only be marking the
> written pages of the page up to date, and so it should be behaving
> exactly like ext4 and only writing back individual uptodate chunks
> on the dirty page....

We have two different cases here:  file read in through read or mmap,
or just writing to a not cached file.  In the former case redpage
reads everything in, and everything will also be written out.  If
OTOH write only read in parts only those parts will be written out.

> > You're clearly talking to different SSD people than I am.
> 
> Perhaps so.
> 
> But it was pretty clear way back in the days of early sandforce SSD
> controllers that compression and zero detection at the FTL level
> resulted in massive reductions in write amplification right down at
> the hardware level. The next generation of controllers all did this
> so they could compete on performance. They still do this, which is
> why industry benchmarks test performance with incompressible data so
> that they expose the flash write perofrmance, not just the rate at
> which the drive can detect and elide runs of zeros...

I don't know of any modern SSDs doing zeroes detection.

> IOWs, showing that even high end devices end up bandwidth limited
> under common workloads using default configurations is a much more
> convincing argument...

Not every SSD is a high end device.  If you have an enterprise SSD
with a non-volatile write cache and a full blown PCIe interface bandwith
is not going to a limitation.  If on the other hand you have an
el-cheapo ATA SSD or a 2x gen3 PCIe consumer with very few flash
channels OTOH..

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-30 23:45                   ` Matthew Wilcox
  2020-07-31  2:05                     ` Dave Chinner
@ 2020-07-31 20:47                     ` Matthew Wilcox
  2020-07-31 22:13                       ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2020-07-31 20:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Zhengyuan Liu, linux-xfs, Zhengyuan Liu,
	Christoph Hellwig

On Fri, Jul 31, 2020 at 12:45:17AM +0100, Matthew Wilcox wrote:
> On Fri, Jul 31, 2020 at 08:08:57AM +1000, Dave Chinner wrote:
> > On Thu, Jul 30, 2020 at 02:50:40PM +0100, Matthew Wilcox wrote:
> > > On Thu, Jul 30, 2020 at 09:05:03AM +1000, Dave Chinner wrote:
> > > > On Wed, Jul 29, 2020 at 07:50:35PM +0100, Matthew Wilcox wrote:
> > > > > I had a bit of a misunderstanding.  Let's discard that proposal
> > > > > and discuss what we want to optimise for, ignoring THPs.  We don't
> > > > > need to track any per-block state, of course.  We could implement
> > > > > __iomap_write_begin() by reading in the entire page (skipping the last
> > > > > few blocks if they lie outside i_size, of course) and then marking the
> > > > > entire page Uptodate.
> > > > 
> > > > __iomap_write_begin() already does read-around for sub-page writes.
> > > > And, if necessary, it does zeroing of unwritten extents, newly
> > > > allocated ranges and ranges beyond EOF and marks them uptodate
> > > > appropriately.
> > > 
> > > But it doesn't read in the entire page, just the blocks in the page which
> > > will be touched by the write.
> > 
> > Ah, you are right, I got my page/offset macros mixed up.
> > 
> > In which case, you just identified why the uptodate array is
> > necessary and can't be removed. If we do a sub-page write() the page
> > is not fully initialised, and so if we then mmap it readpage needs
> > to know what part of the page requires initialisation to bring the
> > page uptodate before it is exposed to userspace.
> 
> You snipped the part of my mail where I explained how we could handle
> that without the uptodate array ;-(  Essentially, we do as you thought
> it worked, we read the entire page (or at least the portion of it that
> isn't going to be overwritten.  Once all the bytes have been transferred,
> we can mark the page Uptodate.  We'll need to wait for the transfer to
> happen if the write overlaps a block boundary, but we do that right now.

OK, so this turns out to be Hard.  We enter the iomap code with

iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
                const struct iomap_ops *ops)

which does:
                ret = iomap_apply(inode, pos, iov_iter_count(iter),
                                IOMAP_WRITE, ops, iter, iomap_write_actor);

so iomap_write_actor doesn't get told about the blocks in the page before
the starting pos.  They might be a hole or mapped; we have no idea.

We could allocate pages _here_ and call iomap_readpage() for the pages
which overlap the beginning and end of the I/O, but I'm not entirely
convinced that the iomap_ops being passed in will appreciate being
called for a read that has no intent to write the portions of the page
outside pos.

Bleh.  I'm going to give up on removing the uptodate bit array and go
back to making the THP patchset better.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-31 20:47                     ` Matthew Wilcox
@ 2020-07-31 22:13                       ` Dave Chinner
  2020-08-21  2:39                         ` Zhengyuan Liu
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2020-07-31 22:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, Zhengyuan Liu, linux-xfs, Zhengyuan Liu,
	Christoph Hellwig

On Fri, Jul 31, 2020 at 09:47:13PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 31, 2020 at 12:45:17AM +0100, Matthew Wilcox wrote:
> > On Fri, Jul 31, 2020 at 08:08:57AM +1000, Dave Chinner wrote:
> > > On Thu, Jul 30, 2020 at 02:50:40PM +0100, Matthew Wilcox wrote:
> > > > On Thu, Jul 30, 2020 at 09:05:03AM +1000, Dave Chinner wrote:
> > > > > On Wed, Jul 29, 2020 at 07:50:35PM +0100, Matthew Wilcox wrote:
> > > > > > I had a bit of a misunderstanding.  Let's discard that proposal
> > > > > > and discuss what we want to optimise for, ignoring THPs.  We don't
> > > > > > need to track any per-block state, of course.  We could implement
> > > > > > __iomap_write_begin() by reading in the entire page (skipping the last
> > > > > > few blocks if they lie outside i_size, of course) and then marking the
> > > > > > entire page Uptodate.
> > > > > 
> > > > > __iomap_write_begin() already does read-around for sub-page writes.
> > > > > And, if necessary, it does zeroing of unwritten extents, newly
> > > > > allocated ranges and ranges beyond EOF and marks them uptodate
> > > > > appropriately.
> > > > 
> > > > But it doesn't read in the entire page, just the blocks in the page which
> > > > will be touched by the write.
> > > 
> > > Ah, you are right, I got my page/offset macros mixed up.
> > > 
> > > In which case, you just identified why the uptodate array is
> > > necessary and can't be removed. If we do a sub-page write() the page
> > > is not fully initialised, and so if we then mmap it readpage needs
> > > to know what part of the page requires initialisation to bring the
> > > page uptodate before it is exposed to userspace.
> > 
> > You snipped the part of my mail where I explained how we could handle
> > that without the uptodate array ;-(  Essentially, we do as you thought
> > it worked, we read the entire page (or at least the portion of it that
> > isn't going to be overwritten.  Once all the bytes have been transferred,
> > we can mark the page Uptodate.  We'll need to wait for the transfer to
> > happen if the write overlaps a block boundary, but we do that right now.
> 
> OK, so this turns out to be Hard.  We enter the iomap code with
> 
> iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
>                 const struct iomap_ops *ops)
> 
> which does:
>                 ret = iomap_apply(inode, pos, iov_iter_count(iter),
>                                 IOMAP_WRITE, ops, iter, iomap_write_actor);
> 
> so iomap_write_actor doesn't get told about the blocks in the page before
> the starting pos.  They might be a hole or mapped; we have no idea.

So this is a kind of the same problem block size > page size has to
deal with for block allocation - the zero-around issue. THat is,
when a sub block write triggers a new allocation, it actually has to
zero the entire block in the page cache first, which means it needs
to expand the IO range in iomap_write_actor()....

https://lore.kernel.org/linux-xfs/20181107063127.3902-10-david@fromorbit.com/
https://lore.kernel.org/linux-xfs/20181107063127.3902-14-david@fromorbit.com/

> We could allocate pages _here_ and call iomap_readpage() for the pages
> which overlap the beginning and end of the I/O,

FWIW, this is effective what calling iomap_zero() from
iomap_write_actor() does - it allocates pages outside the write
range via iomap_begin_write(), then zeroes them in memory and marks
them dirty....

> but I'm not entirely
> convinced that the iomap_ops being passed in will appreciate being
> called for a read that has no intent to write the portions of the page
> outside pos.

I don't think it should matter what the range of the read being done
is - it has the same constraints whether it's to populate the
partial block or whole blocks just before the write. Especially as
we are in the buffered write path and so the filesystem has
guaranteed us exclusive access to the inode and it's mapping
here....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Question] About XFS random buffer write performance
  2020-07-31 22:13                       ` Dave Chinner
@ 2020-08-21  2:39                         ` Zhengyuan Liu
  0 siblings, 0 replies; 18+ messages in thread
From: Zhengyuan Liu @ 2020-08-21  2:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, Darrick J. Wong, linux-xfs, Zhengyuan Liu,
	Christoph Hellwig

Thanks for your discussions.
For this issue,  if we have plans to fix?

On Sat, Aug 1, 2020 at 6:13 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Jul 31, 2020 at 09:47:13PM +0100, Matthew Wilcox wrote:
> > On Fri, Jul 31, 2020 at 12:45:17AM +0100, Matthew Wilcox wrote:
> > > On Fri, Jul 31, 2020 at 08:08:57AM +1000, Dave Chinner wrote:
> > > > On Thu, Jul 30, 2020 at 02:50:40PM +0100, Matthew Wilcox wrote:
> > > > > On Thu, Jul 30, 2020 at 09:05:03AM +1000, Dave Chinner wrote:
> > > > > > On Wed, Jul 29, 2020 at 07:50:35PM +0100, Matthew Wilcox wrote:
> > > > > > > I had a bit of a misunderstanding.  Let's discard that proposal
> > > > > > > and discuss what we want to optimise for, ignoring THPs.  We don't
> > > > > > > need to track any per-block state, of course.  We could implement
> > > > > > > __iomap_write_begin() by reading in the entire page (skipping the last
> > > > > > > few blocks if they lie outside i_size, of course) and then marking the
> > > > > > > entire page Uptodate.
> > > > > >
> > > > > > __iomap_write_begin() already does read-around for sub-page writes.
> > > > > > And, if necessary, it does zeroing of unwritten extents, newly
> > > > > > allocated ranges and ranges beyond EOF and marks them uptodate
> > > > > > appropriately.
> > > > >
> > > > > But it doesn't read in the entire page, just the blocks in the page which
> > > > > will be touched by the write.
> > > >
> > > > Ah, you are right, I got my page/offset macros mixed up.
> > > >
> > > > In which case, you just identified why the uptodate array is
> > > > necessary and can't be removed. If we do a sub-page write() the page
> > > > is not fully initialised, and so if we then mmap it readpage needs
> > > > to know what part of the page requires initialisation to bring the
> > > > page uptodate before it is exposed to userspace.
> > >
> > > You snipped the part of my mail where I explained how we could handle
> > > that without the uptodate array ;-(  Essentially, we do as you thought
> > > it worked, we read the entire page (or at least the portion of it that
> > > isn't going to be overwritten.  Once all the bytes have been transferred,
> > > we can mark the page Uptodate.  We'll need to wait for the transfer to
> > > happen if the write overlaps a block boundary, but we do that right now.
> >
> > OK, so this turns out to be Hard.  We enter the iomap code with
> >
> > iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
> >                 const struct iomap_ops *ops)
> >
> > which does:
> >                 ret = iomap_apply(inode, pos, iov_iter_count(iter),
> >                                 IOMAP_WRITE, ops, iter, iomap_write_actor);
> >
> > so iomap_write_actor doesn't get told about the blocks in the page before
> > the starting pos.  They might be a hole or mapped; we have no idea.
>
> So this is a kind of the same problem block size > page size has to
> deal with for block allocation - the zero-around issue. THat is,
> when a sub block write triggers a new allocation, it actually has to
> zero the entire block in the page cache first, which means it needs
> to expand the IO range in iomap_write_actor()....
>
> https://lore.kernel.org/linux-xfs/20181107063127.3902-10-david@fromorbit.com/
> https://lore.kernel.org/linux-xfs/20181107063127.3902-14-david@fromorbit.com/
>
> > We could allocate pages _here_ and call iomap_readpage() for the pages
> > which overlap the beginning and end of the I/O,
>
> FWIW, this is effective what calling iomap_zero() from
> iomap_write_actor() does - it allocates pages outside the write
> range via iomap_begin_write(), then zeroes them in memory and marks
> them dirty....
>
> > but I'm not entirely
> > convinced that the iomap_ops being passed in will appreciate being
> > called for a read that has no intent to write the portions of the page
> > outside pos.
>
> I don't think it should matter what the range of the read being done
> is - it has the same constraints whether it's to populate the
> partial block or whole blocks just before the write. Especially as
> we are in the buffered write path and so the filesystem has
> guaranteed us exclusive access to the inode and it's mapping
> here....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-08-21  2:39 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-28 11:34 [Question] About XFS random buffer write performance Zhengyuan Liu
2020-07-28 15:34 ` Darrick J. Wong
2020-07-28 15:47   ` Matthew Wilcox
2020-07-29  1:54     ` Dave Chinner
2020-07-29  2:12       ` Matthew Wilcox
2020-07-29  5:19         ` Dave Chinner
2020-07-29 18:50           ` Matthew Wilcox
2020-07-29 23:05             ` Dave Chinner
2020-07-30 13:50               ` Matthew Wilcox
2020-07-30 22:08                 ` Dave Chinner
2020-07-30 23:45                   ` Matthew Wilcox
2020-07-31  2:05                     ` Dave Chinner
2020-07-31  2:37                       ` Matthew Wilcox
2020-07-31 20:47                     ` Matthew Wilcox
2020-07-31 22:13                       ` Dave Chinner
2020-08-21  2:39                         ` Zhengyuan Liu
2020-07-31  6:55                   ` Christoph Hellwig
2020-07-29 13:02       ` Zhengyuan Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).