linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support
Date: Wed, 7 Nov 2018 09:14:05 -0800	[thread overview]
Message-ID: <20181107171405.GB4135@magnolia> (raw)
In-Reply-To: <20181107063127.3902-1-david@fromorbit.com>

On Wed, Nov 07, 2018 at 05:31:11PM +1100, Dave Chinner wrote:
> Hi folks,
> 
> We've had a fair number of problems reported on 64k block size
> filesystems of late, but none of the XFS developers have Power or
> ARM machines handy to reproduce them or even really test the fixes.
> 
> The iomap infrastructure we introduced a while back was designed
> with the capabity of block size > page size support in mind, but we
> hadn't tried to implement it.
> 
> So after another 64k block size bug report late last week I said to
> Darrick "How hard could it be"?

"Nothing is ever simple" :)

> About 6 billion (yes, B) fsx ops later, I have most of the XFS
> functionality working on 64k block sizes on x86_64.  Buffered
> read/write, mmap read/write and direct IO read/write all work. All
> the fallocate() operations work correctly, as does truncate. xfsdump
> and xfs_restore are happy with it, as is xfs_repair. xfs-scrub
> needed some help, but I've tested Darrick's fixes for that quite a
> bit over the past few days.
> 
> It passes most of xfstests - there's some test failures that I have
> to determine whether they are code bugs or test problems (i.e. some
> tests don't deal with 64k block sizes correctly or assume block size
> <= page size).
> 
> What I haven't tested yet is shared extents - the COW path,
> clone_file_range and dedupe_file_range. I discovered earlier today
> that fsx doesn't support copy/clone/dedupe_file_operations
> operations, so before I go any further I need to enxpahnce fsx. Then

I assume that means you only tested this on reflink=0 filesystems?

Looking at fsstress, it looks like we don't test copy_file_range either.
I can try adding the missing clone/dedupe/copy to both programs, but
maybe you've already done that while I was asleep?

--D

> fix all the bugs it uncovers on block size <= page size filesystems.
> And then I'll move onto adding the rest of the functionality this
> patch set requires.
> 
> The main addition to the iomap code to support this functionality is
> the "zero-around" capability. When the filesystem is mapping a new
> block, a delalloc range or an unwritten extent, it sets the
> IOMAP_F_ZERO_AROUND flag in the iomap it returns. This tells the
> iomap code that it needs to expand the range of the operation being
> performed to cover entire blocks. i.e. if the data being written
> doesn't span the filesystem block, it needs to instantiate and zero
> pages in the page cache to cover the portions of the block the data
> doesn't cover.
> 
> Because the page cache may already hold data for the regions (e.g.
> read over a hole/unwritten extent) the zero-around code does not
> zero pages that are already marked up to date. It is assumed that
> whatever put those pages into the page cache has already done the
> right thing with them.
> 
> Yes, this means the unit of page cache IO is still individual pages.
> There are no mm/ changes at all, no page cache changes, nothing.
> That all still just operates on individual pages and is oblivious to
> the fact the filesystem/iomap codeis now processing gangs of pages
> at a time instead of just one.
> 
> Actually, I stretch the truth there a little - there is one change
> to XFS that is important to note here. I removed ->writepage from
> XFS (patches 1 and 2). We can't use it for large block sizes because
> we want to write whole blocks at a time if they are newly allocated
> or unwritten. And really, it's just a nasty hack that gets in the
> way of background writeback doing an efficient job of cleaning dirty
> pages. So I killed it.
> 
> We also need to expose the large block size to stat(2). If we don't,
> the applications that use stat.st_bsize for operations that require
> block size alignment (e.g. various fallocate ops) then fail because
> 4k is not the block size of a 64k block size filesystem.
> 
> A number of latent bugs in existing code were uncovered as I worked
> through this - patches 3-5 fix bugs in XFS and iomap that can be
> triggered on existing systems but it's somewhat hard to expose them.
> 
> Patches 6-12 introduce all the iomap infrastructure needed to
> support block size > page size.
> 
> Patches 13-16 introduce the necessary functionality to trigger the
> iompa infrastructure, tell userspace the right thing, make sub-block
> fsync ranges do the right thing and finally remove the code that
> prevents large block size filesystems from mounting on small page
> size machines.
> 
> It works, it seems pretty robust and runs enough of fstests that
> I've already used it to find, fix and test a 64k block size bug in
> XFS:
> 
> https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=837514f7a4ca4aca06aec5caa5ff56d33ef06976
> 
> I think this is the last of the XFS Irix features we haven't
> implemented in Linux XFS - it's only taken us 20 years to
> get here, but the end of the tunnel is in sight.
> 
> Nah, it's probably a train. Or maybe a flame. :)
> 
> Anyway, I'm interested to see what people think of the approach.
> 
> Cheers,
> 
> Dave.
> 
> 

  parent reply	other threads:[~2018-11-08  2:45 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
2018-11-07  6:31 ` [PATCH 01/16] xfs: drop ->writepage completely Dave Chinner
2018-11-09 15:12   ` Christoph Hellwig
2018-11-12 21:08     ` Dave Chinner
2021-02-02 20:51       ` Darrick J. Wong
2018-11-07  6:31 ` [PATCH 02/16] xfs: move writepage context warnings to writepages Dave Chinner
2018-11-07  6:31 ` [PATCH 03/16] xfs: finobt AG reserves don't consider last AG can be a runt Dave Chinner
2018-11-07 16:55   ` Darrick J. Wong
2018-11-09  0:21     ` Dave Chinner
2018-11-07  6:31 ` [PATCH 04/16] xfs: extent shifting doesn't fully invalidate page cache Dave Chinner
2018-11-07  6:31 ` [PATCH 05/16] iomap: sub-block dio needs to zeroout beyond EOF Dave Chinner
2018-11-09 15:15   ` Christoph Hellwig
2018-11-07  6:31 ` [PATCH 06/16] iomap: support block size > page size for direct IO Dave Chinner
2018-11-08 11:28   ` Nikolay Borisov
2018-11-09 15:18   ` Christoph Hellwig
2018-11-11  1:12     ` Dave Chinner
2018-11-07  6:31 ` [PATCH 07/16] iomap: prepare buffered IO paths for block size > page size Dave Chinner
2018-11-09 15:19   ` Christoph Hellwig
2018-11-11  1:15     ` Dave Chinner
2018-11-07  6:31 ` [PATCH 08/16] iomap: mode iomap_zero_range and friends Dave Chinner
2018-11-09 15:19   ` Christoph Hellwig
2018-11-07  6:31 ` [PATCH 09/16] iomap: introduce zero-around functionality Dave Chinner
2018-11-07  6:31 ` [PATCH 10/16] iomap: enable zero-around for iomap_zero_range() Dave Chinner
2018-11-07  6:31 ` [PATCH 11/16] iomap: Don't mark partial pages zeroing uptodate for zero-around Dave Chinner
2018-11-07  6:31 ` [PATCH 12/16] iomap: zero-around in iomap_page_mkwrite Dave Chinner
2018-11-07  6:31 ` [PATCH 13/16] xfs: add zero-around controls to iomap Dave Chinner
2018-11-07  6:31 ` [PATCH 14/16] xfs: align writepages to large block sizes Dave Chinner
2018-11-09 15:22   ` Christoph Hellwig
2018-11-11  1:20     ` Dave Chinner
2018-11-11 16:32       ` Christoph Hellwig
2018-11-14 14:19   ` Brian Foster
2018-11-14 21:18     ` Dave Chinner
2018-11-15 12:55       ` Brian Foster
2018-11-16  6:19         ` Dave Chinner
2018-11-16 13:29           ` Brian Foster
2018-11-19  1:14             ` Dave Chinner
2018-11-07  6:31 ` [PATCH 15/16] xfs: expose block size in stat Dave Chinner
2018-11-07  6:31 ` [PATCH 16/16] xfs: enable block size larger than page size support Dave Chinner
2018-11-07 17:14 ` Darrick J. Wong [this message]
2018-11-07 22:04   ` [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
2018-11-08  1:38     ` Darrick J. Wong
2018-11-08  9:04       ` Dave Chinner
2018-11-08 22:17         ` Darrick J. Wong
2018-11-08 22:22           ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181107171405.GB4135@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).