linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support
@ 2018-11-07  6:31 Dave Chinner
  2018-11-07  6:31 ` [PATCH 01/16] xfs: drop ->writepage completely Dave Chinner
                   ` (16 more replies)
  0 siblings, 17 replies; 44+ messages in thread
From: Dave Chinner @ 2018-11-07  6:31 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

Hi folks,

We've had a fair number of problems reported on 64k block size
filesystems of late, but none of the XFS developers have Power or
ARM machines handy to reproduce them or even really test the fixes.

The iomap infrastructure we introduced a while back was designed
with the capabity of block size > page size support in mind, but we
hadn't tried to implement it.

So after another 64k block size bug report late last week I said to
Darrick "How hard could it be"?

About 6 billion (yes, B) fsx ops later, I have most of the XFS
functionality working on 64k block sizes on x86_64.  Buffered
read/write, mmap read/write and direct IO read/write all work. All
the fallocate() operations work correctly, as does truncate. xfsdump
and xfs_restore are happy with it, as is xfs_repair. xfs-scrub
needed some help, but I've tested Darrick's fixes for that quite a
bit over the past few days.

It passes most of xfstests - there's some test failures that I have
to determine whether they are code bugs or test problems (i.e. some
tests don't deal with 64k block sizes correctly or assume block size
<= page size).

What I haven't tested yet is shared extents - the COW path,
clone_file_range and dedupe_file_range. I discovered earlier today
that fsx doesn't support copy/clone/dedupe_file_operations
operations, so before I go any further I need to enxpahnce fsx. Then
fix all the bugs it uncovers on block size <= page size filesystems.
And then I'll move onto adding the rest of the functionality this
patch set requires.

The main addition to the iomap code to support this functionality is
the "zero-around" capability. When the filesystem is mapping a new
block, a delalloc range or an unwritten extent, it sets the
IOMAP_F_ZERO_AROUND flag in the iomap it returns. This tells the
iomap code that it needs to expand the range of the operation being
performed to cover entire blocks. i.e. if the data being written
doesn't span the filesystem block, it needs to instantiate and zero
pages in the page cache to cover the portions of the block the data
doesn't cover.

Because the page cache may already hold data for the regions (e.g.
read over a hole/unwritten extent) the zero-around code does not
zero pages that are already marked up to date. It is assumed that
whatever put those pages into the page cache has already done the
right thing with them.

Yes, this means the unit of page cache IO is still individual pages.
There are no mm/ changes at all, no page cache changes, nothing.
That all still just operates on individual pages and is oblivious to
the fact the filesystem/iomap codeis now processing gangs of pages
at a time instead of just one.

Actually, I stretch the truth there a little - there is one change
to XFS that is important to note here. I removed ->writepage from
XFS (patches 1 and 2). We can't use it for large block sizes because
we want to write whole blocks at a time if they are newly allocated
or unwritten. And really, it's just a nasty hack that gets in the
way of background writeback doing an efficient job of cleaning dirty
pages. So I killed it.

We also need to expose the large block size to stat(2). If we don't,
the applications that use stat.st_bsize for operations that require
block size alignment (e.g. various fallocate ops) then fail because
4k is not the block size of a 64k block size filesystem.

A number of latent bugs in existing code were uncovered as I worked
through this - patches 3-5 fix bugs in XFS and iomap that can be
triggered on existing systems but it's somewhat hard to expose them.

Patches 6-12 introduce all the iomap infrastructure needed to
support block size > page size.

Patches 13-16 introduce the necessary functionality to trigger the
iompa infrastructure, tell userspace the right thing, make sub-block
fsync ranges do the right thing and finally remove the code that
prevents large block size filesystems from mounting on small page
size machines.

It works, it seems pretty robust and runs enough of fstests that
I've already used it to find, fix and test a 64k block size bug in
XFS:

https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=for-next&id=837514f7a4ca4aca06aec5caa5ff56d33ef06976

I think this is the last of the XFS Irix features we haven't
implemented in Linux XFS - it's only taken us 20 years to
get here, but the end of the tunnel is in sight.

Nah, it's probably a train. Or maybe a flame. :)

Anyway, I'm interested to see what people think of the approach.

Cheers,

Dave.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2021-02-02 20:52 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-07  6:31 [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Dave Chinner
2018-11-07  6:31 ` [PATCH 01/16] xfs: drop ->writepage completely Dave Chinner
2018-11-09 15:12   ` Christoph Hellwig
2018-11-12 21:08     ` Dave Chinner
2021-02-02 20:51       ` Darrick J. Wong
2018-11-07  6:31 ` [PATCH 02/16] xfs: move writepage context warnings to writepages Dave Chinner
2018-11-07  6:31 ` [PATCH 03/16] xfs: finobt AG reserves don't consider last AG can be a runt Dave Chinner
2018-11-07 16:55   ` Darrick J. Wong
2018-11-09  0:21     ` Dave Chinner
2018-11-07  6:31 ` [PATCH 04/16] xfs: extent shifting doesn't fully invalidate page cache Dave Chinner
2018-11-07  6:31 ` [PATCH 05/16] iomap: sub-block dio needs to zeroout beyond EOF Dave Chinner
2018-11-09 15:15   ` Christoph Hellwig
2018-11-07  6:31 ` [PATCH 06/16] iomap: support block size > page size for direct IO Dave Chinner
2018-11-08 11:28   ` Nikolay Borisov
2018-11-09 15:18   ` Christoph Hellwig
2018-11-11  1:12     ` Dave Chinner
2018-11-07  6:31 ` [PATCH 07/16] iomap: prepare buffered IO paths for block size > page size Dave Chinner
2018-11-09 15:19   ` Christoph Hellwig
2018-11-11  1:15     ` Dave Chinner
2018-11-07  6:31 ` [PATCH 08/16] iomap: mode iomap_zero_range and friends Dave Chinner
2018-11-09 15:19   ` Christoph Hellwig
2018-11-07  6:31 ` [PATCH 09/16] iomap: introduce zero-around functionality Dave Chinner
2018-11-07  6:31 ` [PATCH 10/16] iomap: enable zero-around for iomap_zero_range() Dave Chinner
2018-11-07  6:31 ` [PATCH 11/16] iomap: Don't mark partial pages zeroing uptodate for zero-around Dave Chinner
2018-11-07  6:31 ` [PATCH 12/16] iomap: zero-around in iomap_page_mkwrite Dave Chinner
2018-11-07  6:31 ` [PATCH 13/16] xfs: add zero-around controls to iomap Dave Chinner
2018-11-07  6:31 ` [PATCH 14/16] xfs: align writepages to large block sizes Dave Chinner
2018-11-09 15:22   ` Christoph Hellwig
2018-11-11  1:20     ` Dave Chinner
2018-11-11 16:32       ` Christoph Hellwig
2018-11-14 14:19   ` Brian Foster
2018-11-14 21:18     ` Dave Chinner
2018-11-15 12:55       ` Brian Foster
2018-11-16  6:19         ` Dave Chinner
2018-11-16 13:29           ` Brian Foster
2018-11-19  1:14             ` Dave Chinner
2018-11-07  6:31 ` [PATCH 15/16] xfs: expose block size in stat Dave Chinner
2018-11-07  6:31 ` [PATCH 16/16] xfs: enable block size larger than page size support Dave Chinner
2018-11-07 17:14 ` [RFC PATCH 00/16] xfs: Block size > PAGE_SIZE support Darrick J. Wong
2018-11-07 22:04   ` Dave Chinner
2018-11-08  1:38     ` Darrick J. Wong
2018-11-08  9:04       ` Dave Chinner
2018-11-08 22:17         ` Darrick J. Wong
2018-11-08 22:22           ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).