* [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles @ 2021-09-24 17:17 David Howells 2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells ` (4 more replies) 0 siblings, 5 replies; 10+ messages in thread From: David Howells @ 2021-09-24 17:17 UTC (permalink / raw) To: willy, hch, trond.myklebust Cc: Theodore Ts'o, linux-block, ceph-devel, Trond Myklebust, Darrick J. Wong, Jeff Layton, Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu, Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe, linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason, David Sterba, Minchan Kim, Steve French, NeilBrown, Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, dhowells, dhowells, darrick.wong, viro, jlayton, torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel Hi Willy, Trond, Christoph, Here's v3 of a change to make reads and writes from the swapfile use async DIO, adding a new ->swap_rw() address_space method, rather than readpage() or direct_IO(), as requested by Willy. This allows NFS to bypass the write checks that prevent swapfiles from working, plus a bunch of other checks that may or may not be necessary. Whilst trying to make this work, I found that NFS's support for swapfiles seems to have been non-functional since Aug 2019 (I think), so the first patch fixes that. Question is: do we actually *want* to keep this functionality, given that it seems that no one's tested it with an upstream kernel in the last couple of years? There are additional patches to get rid of noop_direct_IO and replace it with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the new ->swap_rw method and thence remove the direct BIO submission paths from swap. I kept the IOCB_SWAP flag, using it to enable REQ_SWAP. I'm not sure if that's necessary, but it seems accounting related. The synchronous DIO I/O code on NFS, raw blockdev, ext4 swapfile and xfs swapfile all seem to work fine. Btrfs refuses to swapon because the file might be CoW'd. I've tried doing "chattr +C", but that didn't help. The async DIO paths fail spectacularly (from I/O errors to ATA failure messages on the test disk using a normal swapspace); NFS just hangs. My patches can be found here also: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=swap-dio I tested this using the procedure and program outlined in the NFS patch. I also encountered occasional instances of the following warning with NFS, so I'm wondering if there's a scheduling problem somewhere: BUG: workqueue lockup - pool cpus=0-3 flags=0x5 nice=0 stuck for 34s! Showing busy workqueues and worker pools: workqueue events: flags=0x0 pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 in-flight: 1565:fill_page_cache_func workqueue events_highpri: flags=0x10 pwq 3: cpus=1 node=0 flags=0x1 nice=-20 active=1/256 refcnt=2 in-flight: 1547:fill_page_cache_func pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256 refcnt=2 in-flight: 1811:fill_page_cache_func workqueue events_unbound: flags=0x2 pwq 8: cpus=0-3 flags=0x5 nice=0 active=3/512 refcnt=5 pending: fsnotify_connector_destroy_workfn, fsnotify_mark_destroy_workfn, cleanup_offline_cgwbs_workfn workqueue events_power_efficient: flags=0x82 pwq 8: cpus=0-3 flags=0x5 nice=0 active=4/256 refcnt=6 pending: neigh_periodic_work, neigh_periodic_work, check_lifetime, do_cache_clean workqueue writeback: flags=0x4a pwq 8: cpus=0-3 flags=0x5 nice=0 active=1/256 refcnt=4 in-flight: 433(RESCUER):wb_workfn workqueue rpciod: flags=0xa pwq 8: cpus=0-3 flags=0x5 nice=0 active=38/256 refcnt=40 in-flight: 7:rpc_async_schedule, 1609:rpc_async_schedule, 1610:rpc_async_schedule, 912:rpc_async_schedule, 1613:rpc_async_schedule, 1631:rpc_async_schedule, 34:rpc_async_schedule, 44:rpc_async_schedule pending: rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule, rpc_async_schedule workqueue ext4-rsv-conversion: flags=0x2000a pool 1: cpus=0 node=0 flags=0x0 nice=-20 hung=59s workers=2 idle: 6 pool 3: cpus=1 node=0 flags=0x1 nice=-20 hung=43s workers=2 manager: 20 pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 498 29 pool 8: cpus=0-3 flags=0x5 nice=0 hung=34s workers=9 manager: 1623 pool 9: cpus=0-3 flags=0x5 nice=-20 hung=0s workers=2 manager: 5224 idle: 859 Note that this is due to DIO writes to NFS only, as far as I can tell, and that no reads had happened yet. Changes: ======== ver #3: - Introduced a new ->swap_rw() method. - Added feature support flags to the address_space_operations struct and got rid of the checks for ->direct_() and noop_direct_IO() and similar. - Implemented swap_rw for nfs, adjusting the direct I/O code paths. - Implemented swap_rw for blockdev, btrfs, ext4 and xfs. - Got rid of the return value on swap_readpage() as it's never checked. ver #2: - Remove the callback param to __swap_writepage() as it's invariant. - Allocate the kiocb on the stack in sync mode. - Do an async DIO write if WB_SYNC_ALL isn't set. - Try to remove the BIO submission paths. David Link: https://lore.kernel.org/r/162876946134.3068428.15475611190876694695.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/162879971699.3306668.8977537647318498651.stgit@warthog.procyon.org.uk/ # v2 --- David Howells (9): mm: Remove the callback func argument from __swap_writepage() mm: Add 'supports' field to the address_space_operations to list features mm: Make swap_readpage() void Introduce IOCB_SWAP kiocb flag to trigger REQ_SWAP mm: Make swap_readpage() for SWP_FS_OPS use ->swap_rw() not ->readpage() mm: Make __swap_writepage() do async DIO if asked for it nfs: Fix write to swapfile failure due to generic_write_checks() block, btrfs, ext4, xfs: Implement swap_rw mm: Remove swap BIO paths and only use DIO paths Documentation/filesystems/vfs.rst | 8 + block/fops.c | 2 + drivers/block/loop.c | 6 +- fs/9p/vfs_addr.c | 1 + fs/affs/file.c | 1 + fs/btrfs/inode.c | 14 +- fs/ceph/addr.c | 13 +- fs/cifs/file.c | 21 +- fs/direct-io.c | 2 + fs/erofs/data.c | 2 +- fs/exfat/inode.c | 1 + fs/ext2/inode.c | 4 +- fs/ext4/inode.c | 17 +- fs/f2fs/data.c | 1 + fs/fat/inode.c | 1 + fs/fcntl.c | 2 +- fs/fuse/dax.c | 2 +- fs/fuse/file.c | 1 + fs/gfs2/aops.c | 2 +- fs/hfs/inode.c | 1 + fs/hfsplus/inode.c | 1 + fs/jfs/inode.c | 1 + fs/libfs.c | 12 - fs/nfs/direct.c | 28 +-- fs/nfs/file.c | 15 +- fs/nilfs2/inode.c | 1 + fs/ntfs3/inode.c | 1 + fs/ocfs2/aops.c | 1 + fs/open.c | 3 +- fs/orangefs/inode.c | 1 + fs/overlayfs/file.c | 2 +- fs/overlayfs/inode.c | 3 +- fs/reiserfs/inode.c | 1 + fs/udf/file.c | 1 + fs/udf/inode.c | 1 + fs/xfs/xfs_aops.c | 13 +- fs/zonefs/super.c | 2 +- include/linux/bio.h | 2 + include/linux/fs.h | 7 +- include/linux/nfs_fs.h | 2 +- include/linux/swap.h | 2 +- mm/page_io.c | 356 +++++++++++++++--------------- mm/swapfile.c | 4 +- 43 files changed, 275 insertions(+), 287 deletions(-) ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features 2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells @ 2021-09-24 17:18 ` David Howells 2021-09-24 20:10 ` Matthew Wilcox 2021-09-25 23:42 ` [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles Dave Chinner ` (3 subsequent siblings) 4 siblings, 1 reply; 10+ messages in thread From: David Howells @ 2021-09-24 17:18 UTC (permalink / raw) To: willy, hch, trond.myklebust Cc: Darrick J. Wong, Ilya Dryomov, Jeff Layton, ceph-devel, Steve French, linux-cifs, linux-xfs, linux-fsdevel, linux-mm, dhowells, dhowells, darrick.wong, viro, jlayton, torvalds, linux-nfs, linux-mm, linux-fsdevel, linux-kernel Rather than depending on .direct_IO to point to something to indicate that direct I/O is supported, add a 'supports' bitmask that we can test, since we only need one bit. We can then remove noop_direct_IO, ceph_direct_io and cifs_direct_io. [Question: Some filesystems support read DIO but not write DIO - should I split the flag?] Signed-off-by: David Howells <dhowells@redhat.com> cc: Matthew Wilcox <willy@infradead.org> cc: Christoph Hellwig <hch@lst.de> cc: Darrick J. Wong <djwong@kernel.org> cc: Ilya Dryomov <idryomov@gmail.com> cc: Jeff Layton <jlayton@kernel.org> cc: ceph-devel@vger.kernel.org cc: Steve French <sfrench@samba.org> cc: linux-cifs@vger.kernel.org cc: linux-xfs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org --- Documentation/filesystems/vfs.rst | 8 ++++++++ block/fops.c | 1 + drivers/block/loop.c | 6 +++--- fs/9p/vfs_addr.c | 1 + fs/affs/file.c | 1 + fs/btrfs/inode.c | 2 +- fs/ceph/addr.c | 13 +------------ fs/cifs/file.c | 21 +-------------------- fs/erofs/data.c | 2 +- fs/exfat/inode.c | 1 + fs/ext2/inode.c | 4 +++- fs/ext4/inode.c | 8 ++++---- fs/f2fs/data.c | 1 + fs/fat/inode.c | 1 + fs/fcntl.c | 2 +- fs/fuse/dax.c | 2 +- fs/fuse/file.c | 1 + fs/gfs2/aops.c | 2 +- fs/hfs/inode.c | 1 + fs/hfsplus/inode.c | 1 + fs/jfs/inode.c | 1 + fs/libfs.c | 12 ------------ fs/nfs/file.c | 1 + fs/nilfs2/inode.c | 1 + fs/ntfs3/inode.c | 1 + fs/ocfs2/aops.c | 1 + fs/open.c | 3 ++- fs/orangefs/inode.c | 1 + fs/overlayfs/file.c | 2 +- fs/overlayfs/inode.c | 3 +-- fs/reiserfs/inode.c | 1 + fs/udf/file.c | 1 + fs/udf/inode.c | 1 + fs/xfs/xfs_aops.c | 4 ++-- fs/zonefs/super.c | 2 +- include/linux/fs.h | 4 +++- 36 files changed, 53 insertions(+), 65 deletions(-) diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index bf5c48066fac..abb844792d6a 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -721,6 +721,7 @@ cache in your filesystem. The following members are defined: .. code-block:: c struct address_space_operations { + unsigned int supports; int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); int (*writepages)(struct address_space *, struct writeback_control *); @@ -755,6 +756,13 @@ cache in your filesystem. The following members are defined: int (*swap_deactivate)(struct file *); }; +``supports`` + provides a list of features supported by address_spaces using this + operations set. The following feature support flags are provided: + + ``AS_SUPPORTS_DIRECT_IO`` + Direct I/O is supported. + ``writepage`` called by the VM to write a dirty page to backing store. This may happen for data integrity reasons (i.e. 'sync'), or to free diff --git a/block/fops.c b/block/fops.c index ffce6f6c68dd..84c64d814d0d 100644 --- a/block/fops.c +++ b/block/fops.c @@ -384,6 +384,7 @@ const struct address_space_operations def_blk_aops = { .direct_IO = blkdev_direct_IO, .migratepage = buffer_migrate_page_norefs, .is_dirty_writeback = buffer_check_dirty_writeback, + .supports = AS_SUPPORTS_DIRECT_IO, }; /* diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 7bf4686af774..76f7a6d85815 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -237,9 +237,9 @@ static void __loop_update_dio(struct loop_device *lo, bool dio) */ if (dio) { if (queue_logical_block_size(lo->lo_queue) >= sb_bsize && - !(lo->lo_offset & dio_align) && - mapping->a_ops->direct_IO && - !lo->transfer) + !(lo->lo_offset & dio_align) && + (mapping->a_ops->supports & AS_SUPPORTS_DIRECT_IO) && + !lo->transfer) use_dio = true; else use_dio = false; diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c index cce9ace651a2..4910898af0d7 100644 --- a/fs/9p/vfs_addr.c +++ b/fs/9p/vfs_addr.c @@ -333,4 +333,5 @@ const struct address_space_operations v9fs_addr_operations = { .invalidatepage = v9fs_invalidate_page, .launder_page = v9fs_launder_page, .direct_IO = v9fs_direct_IO, + .supports = AS_SUPPORTS_DIRECT_IO, }; diff --git a/fs/affs/file.c b/fs/affs/file.c index 75ebd2b576ca..7488bd7d3e0c 100644 --- a/fs/affs/file.c +++ b/fs/affs/file.c @@ -460,6 +460,7 @@ const struct address_space_operations affs_aops = { .write_end = affs_write_end, .direct_IO = affs_direct_IO, .bmap = _affs_bmap + .supports = AS_SUPPORTS_DIRECT_IO, }; static inline struct buffer_head * diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 487533c35ddb..b479c97e42fc 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -10937,7 +10937,6 @@ static const struct address_space_operations btrfs_aops = { .writepage = btrfs_writepage, .writepages = btrfs_writepages, .readahead = btrfs_readahead, - .direct_IO = noop_direct_IO, .invalidatepage = btrfs_invalidatepage, .releasepage = btrfs_releasepage, #ifdef CONFIG_MIGRATION @@ -10947,6 +10946,7 @@ static const struct address_space_operations btrfs_aops = { .error_remove_page = generic_error_remove_page, .swap_activate = btrfs_swap_activate, .swap_deactivate = btrfs_swap_deactivate, + .supports = AS_SUPPORTS_DIRECT_IO, }; static const struct inode_operations btrfs_file_inode_operations = { diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 99b80b5c7a93..086d4745b99e 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1306,17 +1306,6 @@ static int ceph_write_end(struct file *file, struct address_space *mapping, return copied; } -/* - * we set .direct_IO to indicate direct io is supported, but since we - * intercept O_DIRECT reads and writes early, this function should - * never get called. - */ -static ssize_t ceph_direct_io(struct kiocb *iocb, struct iov_iter *iter) -{ - WARN_ON(1); - return -EINVAL; -} - const struct address_space_operations ceph_aops = { .readpage = ceph_readpage, .readahead = ceph_readahead, @@ -1327,7 +1316,7 @@ const struct address_space_operations ceph_aops = { .set_page_dirty = ceph_set_page_dirty, .invalidatepage = ceph_invalidatepage, .releasepage = ceph_releasepage, - .direct_IO = ceph_direct_io, + .supports = AS_SUPPORTS_DIRECT_IO, }; static void ceph_block_sigs(sigset_t *oldset) diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 6796fc73b304..a5787cf3d836 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -4891,25 +4891,6 @@ void cifs_oplock_break(struct work_struct *work) cifs_done_oplock_break(cinode); } -/* - * The presence of cifs_direct_io() in the address space ops vector - * allowes open() O_DIRECT flags which would have failed otherwise. - * - * In the non-cached mode (mount with cache=none), we shunt off direct read and write requests - * so this method should never be called. - * - * Direct IO is not yet supported in the cached mode. - */ -static ssize_t -cifs_direct_io(struct kiocb *iocb, struct iov_iter *iter) -{ - /* - * FIXME - * Eventually need to support direct IO for non forcedirectio mounts - */ - return -EINVAL; -} - static int cifs_swap_activate(struct swap_info_struct *sis, struct file *swap_file, sector_t *span) { @@ -4974,7 +4955,6 @@ const struct address_space_operations cifs_addr_ops = { .write_end = cifs_write_end, .set_page_dirty = __set_page_dirty_nobuffers, .releasepage = cifs_release_page, - .direct_IO = cifs_direct_io, .invalidatepage = cifs_invalidate_page, .launder_page = cifs_launder_page, /* @@ -4984,6 +4964,7 @@ const struct address_space_operations cifs_addr_ops = { */ .swap_activate = cifs_swap_activate, .swap_deactivate = cifs_swap_deactivate, + .supports = AS_SUPPORTS_DIRECT_IO, }; /* diff --git a/fs/erofs/data.c b/fs/erofs/data.c index 9db829715652..30f19296b268 100644 --- a/fs/erofs/data.c +++ b/fs/erofs/data.c @@ -299,7 +299,7 @@ const struct address_space_operations erofs_raw_access_aops = { .readpage = erofs_readpage, .readahead = erofs_readahead, .bmap = erofs_bmap, - .direct_IO = noop_direct_IO, + .supports = AS_SUPPORTS_DIRECT_IO, }; #ifdef CONFIG_FS_DAX diff --git a/fs/exfat/inode.c b/fs/exfat/inode.c index ca37d4344361..f38f42282f54 100644 --- a/fs/exfat/inode.c +++ b/fs/exfat/inode.c @@ -500,6 +500,7 @@ static const struct address_space_operations exfat_aops = { .write_end = exfat_write_end, .direct_IO = exfat_direct_IO, .bmap = exfat_aop_bmap + .supports = AS_SUPPORTS_DIRECT_IO, }; static inline unsigned long exfat_hash(loff_t i_pos) diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 333fa62661d5..4ad3655defd9 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -974,6 +974,7 @@ const struct address_space_operations ext2_aops = { .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, .error_remove_page = generic_error_remove_page, + .supports = AS_SUPPORTS_DIRECT_IO, }; const struct address_space_operations ext2_nobh_aops = { @@ -988,13 +989,14 @@ const struct address_space_operations ext2_nobh_aops = { .writepages = ext2_writepages, .migratepage = buffer_migrate_page, .error_remove_page = generic_error_remove_page, + .supports = AS_SUPPORTS_DIRECT_IO, }; static const struct address_space_operations ext2_dax_aops = { .writepages = ext2_dax_writepages, - .direct_IO = noop_direct_IO, .set_page_dirty = __set_page_dirty_no_writeback, .invalidatepage = noop_invalidatepage, + .supports = AS_SUPPORTS_DIRECT_IO, }; /* diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index d18852d6029c..08d3541d8daa 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3662,11 +3662,11 @@ static const struct address_space_operations ext4_aops = { .bmap = ext4_bmap, .invalidatepage = ext4_invalidatepage, .releasepage = ext4_releasepage, - .direct_IO = noop_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, .error_remove_page = generic_error_remove_page, .swap_activate = ext4_iomap_swap_activate, + .supports = AS_SUPPORTS_DIRECT_IO, }; static const struct address_space_operations ext4_journalled_aops = { @@ -3680,10 +3680,10 @@ static const struct address_space_operations ext4_journalled_aops = { .bmap = ext4_bmap, .invalidatepage = ext4_journalled_invalidatepage, .releasepage = ext4_releasepage, - .direct_IO = noop_direct_IO, .is_partially_uptodate = block_is_partially_uptodate, .error_remove_page = generic_error_remove_page, .swap_activate = ext4_iomap_swap_activate, + .supports = AS_SUPPORTS_DIRECT_IO, }; static const struct address_space_operations ext4_da_aops = { @@ -3697,20 +3697,20 @@ static const struct address_space_operations ext4_da_aops = { .bmap = ext4_bmap, .invalidatepage = ext4_invalidatepage, .releasepage = ext4_releasepage, - .direct_IO = noop_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, .error_remove_page = generic_error_remove_page, .swap_activate = ext4_iomap_swap_activate, + .supports = AS_SUPPORTS_DIRECT_IO, }; static const struct address_space_operations ext4_dax_aops = { .writepages = ext4_dax_writepages, - .direct_IO = noop_direct_IO, .set_page_dirty = __set_page_dirty_no_writeback, .bmap = ext4_bmap, .invalidatepage = noop_invalidatepage, .swap_activate = ext4_iomap_swap_activate, + .supports = AS_SUPPORTS_DIRECT_IO, }; void ext4_set_aops(struct inode *inode) diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index f4fd6c246c9a..4c3643969b69 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -4156,6 +4156,7 @@ const struct address_space_operations f2fs_dblock_aops = { #ifdef CONFIG_MIGRATION .migratepage = f2fs_migrate_page, #endif + .supports = AS_SUPPORTS_DIRECT_IO, }; void f2fs_clear_page_cache_dirty_tag(struct page *page) diff --git a/fs/fat/inode.c b/fs/fat/inode.c index de0c9b013a85..4352981dfb82 100644 --- a/fs/fat/inode.c +++ b/fs/fat/inode.c @@ -351,6 +351,7 @@ static const struct address_space_operations fat_aops = { .write_end = fat_write_end, .direct_IO = fat_direct_IO, .bmap = _fat_bmap + .supports = AS_SUPPORTS_DIRECT_IO, }; /* diff --git a/fs/fcntl.c b/fs/fcntl.c index 9c6c6a3e2de5..7308e8274ff9 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -58,7 +58,7 @@ static int setfl(int fd, struct file * filp, unsigned long arg) /* Pipe packetized mode is controlled by O_DIRECT flag */ if (!S_ISFIFO(inode->i_mode) && (arg & O_DIRECT)) { if (!filp->f_mapping || !filp->f_mapping->a_ops || - !filp->f_mapping->a_ops->direct_IO) + !(filp->f_mapping->a_ops->supports & AS_SUPPORTS_DIRECT_IO)) return -EINVAL; } diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index 281d79f8b3d3..e39468fd7177 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -1325,9 +1325,9 @@ bool fuse_dax_inode_alloc(struct super_block *sb, struct fuse_inode *fi) static const struct address_space_operations fuse_dax_file_aops = { .writepages = fuse_dax_writepages, - .direct_IO = noop_direct_IO, .set_page_dirty = __set_page_dirty_no_writeback, .invalidatepage = noop_invalidatepage, + .supports = AS_SUPPORTS_DIRECT_IO, }; void fuse_dax_inode_init(struct inode *inode) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 11404f8c21c7..3db64194d346 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -3161,6 +3161,7 @@ static const struct address_space_operations fuse_file_aops = { .direct_IO = fuse_direct_IO, .write_begin = fuse_write_begin, .write_end = fuse_write_end, + .supports = AS_SUPPORTS_DIRECT_IO, }; void fuse_init_file_inode(struct inode *inode) diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c index 005e920f5d4a..dc50b53d6abd 100644 --- a/fs/gfs2/aops.c +++ b/fs/gfs2/aops.c @@ -783,10 +783,10 @@ static const struct address_space_operations gfs2_aops = { .releasepage = iomap_releasepage, .invalidatepage = iomap_invalidatepage, .bmap = gfs2_bmap, - .direct_IO = noop_direct_IO, .migratepage = iomap_migrate_page, .is_partially_uptodate = iomap_is_partially_uptodate, .error_remove_page = generic_error_remove_page, + .supports = AS_SUPPORTS_DIRECT_IO, }; static const struct address_space_operations gfs2_jdata_aops = { diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c index 4a95a92546a0..5f9e5464a5bf 100644 --- a/fs/hfs/inode.c +++ b/fs/hfs/inode.c @@ -177,6 +177,7 @@ const struct address_space_operations hfs_aops = { .bmap = hfs_bmap, .direct_IO = hfs_direct_IO, .writepages = hfs_writepages, + .supports = AS_SUPPORTS_DIRECT_IO, }; /* diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c index 6fef67c2a9f0..9f0c27e5e115 100644 --- a/fs/hfsplus/inode.c +++ b/fs/hfsplus/inode.c @@ -174,6 +174,7 @@ const struct address_space_operations hfsplus_aops = { .bmap = hfsplus_bmap, .direct_IO = hfsplus_direct_IO, .writepages = hfsplus_writepages, + .supports = AS_SUPPORTS_DIRECT_IO, }; const struct dentry_operations hfsplus_dentry_operations = { diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c index 57ab424c05ff..a477267471a4 100644 --- a/fs/jfs/inode.c +++ b/fs/jfs/inode.c @@ -366,6 +366,7 @@ const struct address_space_operations jfs_aops = { .write_end = nobh_write_end, .bmap = jfs_bmap, .direct_IO = jfs_direct_IO, + .supports = AS_SUPPORTS_DIRECT_IO, }; /* diff --git a/fs/libfs.c b/fs/libfs.c index 51b4de3b3447..c27f681291e5 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -1182,18 +1182,6 @@ void noop_invalidatepage(struct page *page, unsigned int offset, } EXPORT_SYMBOL_GPL(noop_invalidatepage); -ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter) -{ - /* - * iomap based filesystems support direct I/O without need for - * this callback. However, it still needs to be set in - * inode->a_ops so that open/fcntl know that direct I/O is - * generally supported. - */ - return -EINVAL; -} -EXPORT_SYMBOL_GPL(noop_direct_IO); - /* Because kfree isn't assignment-compatible with void(void*) ;-/ */ void kfree_link(void *p) { diff --git a/fs/nfs/file.c b/fs/nfs/file.c index aa353fd58240..7403ec6317cb 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -532,6 +532,7 @@ const struct address_space_operations nfs_file_aops = { .error_remove_page = generic_error_remove_page, .swap_activate = nfs_swap_activate, .swap_deactivate = nfs_swap_deactivate, + .supports = AS_SUPPORTS_DIRECT_IO, }; /* diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c index 2e8eb263cf0f..c57395c01817 100644 --- a/fs/nilfs2/inode.c +++ b/fs/nilfs2/inode.c @@ -307,6 +307,7 @@ const struct address_space_operations nilfs_aops = { .invalidatepage = block_invalidatepage, .direct_IO = nilfs_direct_IO, .is_partially_uptodate = block_is_partially_uptodate, + .supports = AS_SUPPORTS_DIRECT_IO, }; static int nilfs_insert_inode_locked(struct inode *inode, diff --git a/fs/ntfs3/inode.c b/fs/ntfs3/inode.c index db2a5a4c38e4..7b3ac1ab5d04 100644 --- a/fs/ntfs3/inode.c +++ b/fs/ntfs3/inode.c @@ -1948,6 +1948,7 @@ const struct address_space_operations ntfs_aops = { .direct_IO = ntfs_direct_IO, .bmap = ntfs_bmap, .set_page_dirty = __set_page_dirty_buffers, + .supports = AS_SUPPORTS_DIRECT_IO, }; const struct address_space_operations ntfs_aops_cmpr = { diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 68d11c295dd3..5a158975a4ff 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -2466,4 +2466,5 @@ const struct address_space_operations ocfs2_aops = { .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, .error_remove_page = generic_error_remove_page, + .supports = AS_SUPPORTS_DIRECT_IO, }; diff --git a/fs/open.c b/fs/open.c index daa324606a41..d679dc0c1801 100644 --- a/fs/open.c +++ b/fs/open.c @@ -840,7 +840,8 @@ static int do_dentry_open(struct file *f, /* NB: we're sure to have correct a_ops only after f_op->open */ if (f->f_flags & O_DIRECT) { - if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO) + if (!f->f_mapping->a_ops || + !(f->f_mapping->a_ops->supports & AS_SUPPORTS_DIRECT_IO)) return -EINVAL; } diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c index c1bb4c4b5d67..c5bad94dfbd0 100644 --- a/fs/orangefs/inode.c +++ b/fs/orangefs/inode.c @@ -641,6 +641,7 @@ static const struct address_space_operations orangefs_address_operations = { .freepage = orangefs_freepage, .launder_page = orangefs_launder_page, .direct_IO = orangefs_direct_IO, + .supports = AS_SUPPORTS_DIRECT_IO, }; vm_fault_t orangefs_page_mkwrite(struct vm_fault *vmf) diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index d081faa55e83..87d05f1d718a 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -83,7 +83,7 @@ static int ovl_change_flags(struct file *file, unsigned int flags) if (flags & O_DIRECT) { if (!file->f_mapping->a_ops || - !file->f_mapping->a_ops->direct_IO) + !(file->f_mapping->a_ops->supports & AS_SUPPORTS_DIRECT_IO)) return -EINVAL; } diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c index 832b17589733..9902608b1715 100644 --- a/fs/overlayfs/inode.c +++ b/fs/overlayfs/inode.c @@ -660,8 +660,7 @@ static const struct inode_operations ovl_special_inode_operations = { }; static const struct address_space_operations ovl_aops = { - /* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */ - .direct_IO = noop_direct_IO, + .supports = AS_SUPPORTS_DIRECT_IO, }; /* diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c index f49b72ccac4c..890d91847d58 100644 --- a/fs/reiserfs/inode.c +++ b/fs/reiserfs/inode.c @@ -3436,4 +3436,5 @@ const struct address_space_operations reiserfs_address_space_operations = { .bmap = reiserfs_aop_bmap, .direct_IO = reiserfs_direct_IO, .set_page_dirty = reiserfs_set_page_dirty, + .supports = AS_SUPPORTS_DIRECT_IO, }; diff --git a/fs/udf/file.c b/fs/udf/file.c index 1baff8ddb754..2cb1b499e5c7 100644 --- a/fs/udf/file.c +++ b/fs/udf/file.c @@ -131,6 +131,7 @@ const struct address_space_operations udf_adinicb_aops = { .write_begin = udf_adinicb_write_begin, .write_end = udf_adinicb_write_end, .direct_IO = udf_adinicb_direct_IO, + .supports = AS_SUPPORTS_DIRECT_IO, }; static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from) diff --git a/fs/udf/inode.c b/fs/udf/inode.c index 1d6b7a50736b..38b799b457d5 100644 --- a/fs/udf/inode.c +++ b/fs/udf/inode.c @@ -244,6 +244,7 @@ const struct address_space_operations udf_aops = { .write_end = generic_write_end, .direct_IO = udf_direct_IO, .bmap = udf_bmap, + .supports = AS_SUPPORTS_DIRECT_IO, }; /* diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 34fc6148032a..2a4570516591 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -548,17 +548,17 @@ const struct address_space_operations xfs_address_space_operations = { .releasepage = iomap_releasepage, .invalidatepage = iomap_invalidatepage, .bmap = xfs_vm_bmap, - .direct_IO = noop_direct_IO, .migratepage = iomap_migrate_page, .is_partially_uptodate = iomap_is_partially_uptodate, .error_remove_page = generic_error_remove_page, .swap_activate = xfs_iomap_swapfile_activate, + .supports = AS_SUPPORTS_DIRECT_IO, }; const struct address_space_operations xfs_dax_aops = { .writepages = xfs_dax_writepages, - .direct_IO = noop_direct_IO, .set_page_dirty = __set_page_dirty_no_writeback, .invalidatepage = noop_invalidatepage, .swap_activate = xfs_iomap_swapfile_activate, + .supports = AS_SUPPORTS_DIRECT_IO, }; diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c index ddc346a9df9b..37ff541467e8 100644 --- a/fs/zonefs/super.c +++ b/fs/zonefs/super.c @@ -191,8 +191,8 @@ static const struct address_space_operations zonefs_file_aops = { .migratepage = iomap_migrate_page, .is_partially_uptodate = iomap_is_partially_uptodate, .error_remove_page = generic_error_remove_page, - .direct_IO = noop_direct_IO, .swap_activate = zonefs_swap_activate, + .supports = AS_SUPPORTS_DIRECT_IO, }; static void zonefs_update_stats(struct inode *inode, loff_t new_isize) diff --git a/include/linux/fs.h b/include/linux/fs.h index e7a633353fd2..c909ca6c0eb6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -369,7 +369,10 @@ typedef struct { typedef int (*read_actor_t)(read_descriptor_t *, struct page *, unsigned long, unsigned long); +#define AS_SUPPORTS_DIRECT_IO 0x00000001 + struct address_space_operations { + unsigned int supports; /* Bitmask of AS_SUPPORTS_* flags */ int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); @@ -3391,7 +3394,6 @@ extern void simple_recursive_removal(struct dentry *, extern int noop_fsync(struct file *, loff_t, loff_t, int); extern void noop_invalidatepage(struct page *page, unsigned int offset, unsigned int length); -extern ssize_t noop_direct_IO(struct kiocb *iocb, struct iov_iter *iter); extern int simple_empty(struct dentry *); extern int simple_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features 2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells @ 2021-09-24 20:10 ` Matthew Wilcox 0 siblings, 0 replies; 10+ messages in thread From: Matthew Wilcox @ 2021-09-24 20:10 UTC (permalink / raw) To: David Howells Cc: hch, trond.myklebust, Darrick J. Wong, Ilya Dryomov, Jeff Layton, ceph-devel, Steve French, linux-cifs, linux-xfs, linux-fsdevel, linux-mm, darrick.wong, viro, torvalds, linux-nfs, linux-kernel On Fri, Sep 24, 2021 at 06:18:14PM +0100, David Howells wrote: > Rather than depending on .direct_IO to point to something to indicate that > direct I/O is supported, add a 'supports' bitmask that we can test, since > we only need one bit. Why would you add mapping->aops->supports instead of using one of the free bits in mapping->flags? enum mapping_flags in pagemap.h. It could also be a per-fs flag, or per-sb flag, but it's fewer dereferences at check time if it's in mapping->flags. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles 2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells 2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells @ 2021-09-25 23:42 ` Dave Chinner 2021-09-26 3:10 ` Matthew Wilcox 2021-09-27 20:07 ` David Sterba ` (2 subsequent siblings) 4 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2021-09-25 23:42 UTC (permalink / raw) To: David Howells Cc: willy, hch, trond.myklebust, Theodore Ts'o, linux-block, ceph-devel, Trond Myklebust, Darrick J. Wong, Jeff Layton, Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu, Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe, linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason, David Sterba, Minchan Kim, Steve French, NeilBrown, Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, viro, torvalds, linux-kernel On Fri, Sep 24, 2021 at 06:17:52PM +0100, David Howells wrote: > > Hi Willy, Trond, Christoph, > > Here's v3 of a change to make reads and writes from the swapfile use async > DIO, adding a new ->swap_rw() address_space method, rather than readpage() > or direct_IO(), as requested by Willy. This allows NFS to bypass the write > checks that prevent swapfiles from working, plus a bunch of other checks > that may or may not be necessary. > > Whilst trying to make this work, I found that NFS's support for swapfiles > seems to have been non-functional since Aug 2019 (I think), so the first > patch fixes that. Question is: do we actually *want* to keep this > functionality, given that it seems that no one's tested it with an upstream > kernel in the last couple of years? > > There are additional patches to get rid of noop_direct_IO and replace it > with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the > new ->swap_rw method and thence remove the direct BIO submission paths from > swap. > > I kept the IOCB_SWAP flag, using it to enable REQ_SWAP. I'm not sure if > that's necessary, but it seems accounting related. > > The synchronous DIO I/O code on NFS, raw blockdev, ext4 swapfile and xfs > swapfile all seem to work fine. Btrfs refuses to swapon because the file > might be CoW'd. I've tried doing "chattr +C", but that didn't help. Ok, so if the filesystem is doing block mapping in the IO path now, why does the swap file still need to map the file into a private block mapping now? i.e all the work that iomap_swapfile_activate() does for filesystems like XFS and ext4 - it's this completely redundant now that we are doing block mapping during swap file IO via iomap_dio_rw()? Actually, that path does all the "can we use this file as a swap file" checking. So the extent iteration can't go away, just the swap file mapping part (iomap_swapfile_add_extent()). This is necessary to ensure there aren't any holes in the file, and we still need that because the DIO write path will allocate into holes, which leads me to my main concern here. Using the DIO path opens up the possibility that the filesystem could want to run transactions are part of the DIO. Right now we support unwritten extents for swap files (so they don't have to be written to allocate the backing store before activation) and that means we'll be doing DIO to unwritten extents. IO completion of a DIO write to an unwritten extent will run a transaction to convert that extent to written. A similar problem with sparse files exists, because allocation of blocks can be done from the DIO path, and that requires transactions. File extension is another potential transaction path we open up by using DIO writes dor swap. The problem is that a transaction run in swap IO context will will deadlock the filesystem. Either through the unbound memory demand of metadata modification, or from needing log space that can't be freed up because the metadata IO that will free the log space is waiting on memory allocation that is waiting on swap IO... I think some more thought needs to be put into controlling the behaviour/semantics of the DIO path so that it can be safely used by swap IO, because it's not a direct 1:1 behavioural mapping with existing DIO and there are potential deadlock vectors we need to avoid. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles 2021-09-25 23:42 ` [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles Dave Chinner @ 2021-09-26 3:10 ` Matthew Wilcox 2021-09-26 22:36 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Matthew Wilcox @ 2021-09-26 3:10 UTC (permalink / raw) To: Dave Chinner Cc: David Howells, hch, trond.myklebust, Theodore Ts'o, linux-block, ceph-devel, Trond Myklebust, Darrick J. Wong, Jeff Layton, Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu, Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe, linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason, David Sterba, Minchan Kim, Steve French, NeilBrown, Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, viro, torvalds, linux-kernel On Sun, Sep 26, 2021 at 09:42:43AM +1000, Dave Chinner wrote: > Ok, so if the filesystem is doing block mapping in the IO path now, > why does the swap file still need to map the file into a private > block mapping now? i.e all the work that iomap_swapfile_activate() > does for filesystems like XFS and ext4 - it's this completely > redundant now that we are doing block mapping during swap file IO > via iomap_dio_rw()? Hi Dave, Thanks for bringing up all these points. I think they all deserve to go into the documentation as "things to consider" for people implementing ->swap_rw for their filesystem. Something I don't think David perhaps made sufficiently clear is that regular DIO from userspace gets handled by ->read_iter and ->write_iter. This ->swap_rw op is used exclusive for, as the name suggests, swap DIO. So filesystems don't have to handle swap DIO and regular DIO the same way, and can split the allocation work between ->swap_activate and the iomap callback as they see fit (as long as they can guarantee the lack of deadlocks under memory pressure). There are several advantages to using the DIO infrastructure for swap: - unify block & net swap paths - allow filesystems to _see_ swap IOs instead of being bypassed - get rid of the swap extent rbtree - allow writing compound pages to swap files instead of splitting them - allow ->readpage to be synchronous for better error reporting - remove page_file_mapping() and page_file_offset() I suspect there are several problems with this patchset, but I'm not likely to have a chance to read it closely for a few days. If you have time to give the XFS parts a good look, that would be fantastic. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles 2021-09-26 3:10 ` Matthew Wilcox @ 2021-09-26 22:36 ` Dave Chinner 0 siblings, 0 replies; 10+ messages in thread From: Dave Chinner @ 2021-09-26 22:36 UTC (permalink / raw) To: Matthew Wilcox Cc: David Howells, hch, trond.myklebust, Theodore Ts'o, linux-block, ceph-devel, Trond Myklebust, Darrick J. Wong, Jeff Layton, Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu, Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe, linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason, David Sterba, Minchan Kim, Steve French, NeilBrown, Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, viro, torvalds, linux-kernel On Sun, Sep 26, 2021 at 04:10:43AM +0100, Matthew Wilcox wrote: > On Sun, Sep 26, 2021 at 09:42:43AM +1000, Dave Chinner wrote: > > Ok, so if the filesystem is doing block mapping in the IO path now, > > why does the swap file still need to map the file into a private > > block mapping now? i.e all the work that iomap_swapfile_activate() > > does for filesystems like XFS and ext4 - it's this completely > > redundant now that we are doing block mapping during swap file IO > > via iomap_dio_rw()? > > Hi Dave, > > Thanks for bringing up all these points. I think they all deserve to go > into the documentation as "things to consider" for people implementing > ->swap_rw for their filesystem. > > Something I don't think David perhaps made sufficiently clear is that > regular DIO from userspace gets handled by ->read_iter and ->write_iter. > This ->swap_rw op is used exclusive for, as the name suggests, swap DIO. > So filesystems don't have to handle swap DIO and regular DIO the same > way, and can split the allocation work between ->swap_activate and the > iomap callback as they see fit (as long as they can guarantee the lack > of deadlocks under memory pressure). I understand this completely. The point is that the implementation of ->swap_rw is to call iomap_dio_rw() with the same ops as the normal DIO read/write path uses. IOWs, apart from the IOCB_SWAP flag, there is no practical difference between the "swap DIO" and "normal DIO" I/O paths. > There are several advantages to using the DIO infrastructure for > swap: > > - unify block & net swap paths > - allow filesystems to _see_ swap IOs instead of being bypassed > - get rid of the swap extent rbtree > - allow writing compound pages to swap files instead of splitting > them > - allow ->readpage to be synchronous for better error reporting > - remove page_file_mapping() and page_file_offset() > > I suspect there are several problems with this patchset, but I'm not > likely to have a chance to read it closely for a few days. If you > have time to give the XFS parts a good look, that would be fantastic. That's what I've already done, and all the questions I've raised are from asking a simple question: what happens if a transaction is required to complete the iomap_dio_rw() swap write operation? I mean, this is similar to the problems with IOCB_NOWAIT - we're supposed to return -EAGAIN if we might block during IO submission, and one of those situations we have to consider is "do we need to run a transaction". If we get it wrong (and we do!), then the worst thing that happens is that there is a long latency for IO submission. It's a minor performance issue, not the end of the world. The difference with IOCB_SWAP is that "don't do transactions during iomap_dio_rw()" is a _hard requirement_ on both IO submission and completion. That means, from now and forever, we will have to guarantee a path through iomap_dio_rw() that will never run transactions on an IO. That requirement needs to be enforced in every block mapping callback into each filesystem, as this is something the iomap infrastructure cannot enforce. Hence we'll have to plumb IOCB_SWAP into a new IOMAP_SWAP iterator flag to pass to the ->iomap_begin() DIO methods to ensure they do the right thing. And then the question becomes: what happens if the filesystem cannot do the right thing? Can the swap code handle an error? e.g. the first thing that xfs_direct_write_iomap_begin() and xfs_read_iomap_begin() do is check if the filesystem is shut down and returns -EIO in that case. IOWs, we've now got normal filesystem "reject all IO" corruption protection mechanisms in play. Using iomap_dio_rw() as it stands means that _all swapfile IO will fail_ if the filesystem shuts down. Right now the swap file IO can keep going blissfully unaware of the filesystem failure status. The open swapfile will prevent the filesystem from being unmounted. Hence to unmount the shutdown filesystem to correct the problem, first the swap file has to be turned off, which means we have a fail-safe behaviour. Using the iomap_dio_rw() path means that swapfile IO _can and will fail_. AFAICT, swap IO errors are pretty much thrown away by the mm code; the swap_writepage() return value is ignored or placed on the swap cache address space and ignored. And it looks like the new read path just sets PageError() and leaves it to callers to detect and deal with a swapin failure because swap_readpage() is now void... So it seems like there's a whole new set of failure cases using the DIO path introduces into the swap IO path that haven't been considered here. I can't see why we wouldn't be able to solve them, but these considerations lead me to think that use of the DIO is based on an incorrect assumption - DIO is not a "simple low level IO" interface. Hence I suspect that we'd be much better off with a new iomap_swap_rw() implementation that just does what swap needs without any of the complexity of the DIO API. Internally iomap can share what it needs to share with the DIO path, but at this point I'm not sure we should be overloading the iomap_dio_rw() path with the semantics required by swap. e.g. we limit iomap_swap_rw() to only accept written or unwritten block mappings within file size on inodes with clean metadata (i.e. pure overwrite to guarantee no modification transactions), and then the fs provided ->iomap_begin callback can ignore shutdown state, elide inode level locking, do read-only mappings, etc without adding extra overhead to the existing DIO code path... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles 2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells 2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells 2021-09-25 23:42 ` [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles Dave Chinner @ 2021-09-27 20:07 ` David Sterba 2021-09-28 3:11 ` NeilBrown 2021-09-29 15:45 ` David Howells 4 siblings, 0 replies; 10+ messages in thread From: David Sterba @ 2021-09-27 20:07 UTC (permalink / raw) To: David Howells Cc: willy, hch, trond.myklebust, Theodore Ts'o, linux-block, ceph-devel, Trond Myklebust, Darrick J. Wong, Jeff Layton, Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu, Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe, linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason, David Sterba, Minchan Kim, Steve French, NeilBrown, Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, viro, torvalds, linux-kernel On Fri, Sep 24, 2021 at 06:17:52PM +0100, David Howells wrote: > > Hi Willy, Trond, Christoph, > > Here's v3 of a change to make reads and writes from the swapfile use async > DIO, adding a new ->swap_rw() address_space method, rather than readpage() > or direct_IO(), as requested by Willy. This allows NFS to bypass the write > checks that prevent swapfiles from working, plus a bunch of other checks > that may or may not be necessary. > > Whilst trying to make this work, I found that NFS's support for swapfiles > seems to have been non-functional since Aug 2019 (I think), so the first > patch fixes that. Question is: do we actually *want* to keep this > functionality, given that it seems that no one's tested it with an upstream > kernel in the last couple of years? > > There are additional patches to get rid of noop_direct_IO and replace it > with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the > new ->swap_rw method and thence remove the direct BIO submission paths from > swap. > > I kept the IOCB_SWAP flag, using it to enable REQ_SWAP. I'm not sure if > that's necessary, but it seems accounting related. > > The synchronous DIO I/O code on NFS, raw blockdev, ext4 swapfile and xfs > swapfile all seem to work fine. Btrfs refuses to swapon because the file > might be CoW'd. I've tried doing "chattr +C", but that didn't help. There was probably some step missing. The file must not have holes, so either do 'dd' to the right size or use fallocate (which is recommended in manual page btrfs(5) SWAPFILE SUPPORT). There are some fstests exercising swapfile (grep -l _format_swapfile tests/generic/*) so you could try that without having to set up the swapfile manually. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles 2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells ` (2 preceding siblings ...) 2021-09-27 20:07 ` David Sterba @ 2021-09-28 3:11 ` NeilBrown 2021-09-30 15:54 ` Steve French 2021-09-29 15:45 ` David Howells 4 siblings, 1 reply; 10+ messages in thread From: NeilBrown @ 2021-09-28 3:11 UTC (permalink / raw) To: David Howells Cc: willy, hch, trond.myklebust, Theodore Ts'o, linux-block, ceph-devel, Trond Myklebust, Darrick J. Wong, Jeff Layton, Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu, Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe, linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, Chris Mason, David Sterba, Minchan Kim, Steve French, Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, dhowells, viro, torvalds, linux-kernel On Sat, 25 Sep 2021, David Howells wrote: > Whilst trying to make this work, I found that NFS's support for swapfiles > seems to have been non-functional since Aug 2019 (I think), so the first > patch fixes that. Question is: do we actually *want* to keep this > functionality, given that it seems that no one's tested it with an upstream > kernel in the last couple of years? SUSE definitely want to keep this functionality. We have customers using it. I agree it would be good if it was being tested somewhere.... Thanks, NeilBrown ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles 2021-09-28 3:11 ` NeilBrown @ 2021-09-30 15:54 ` Steve French 0 siblings, 0 replies; 10+ messages in thread From: Steve French @ 2021-09-30 15:54 UTC (permalink / raw) To: NeilBrown Cc: David Howells, Matthew Wilcox, Christoph Hellwig, Trond Myklebust, Theodore Ts'o, linux-block, ceph-devel, Trond Myklebust, Darrick J. Wong, Jeff Layton, Andreas Dilger, Anna Schumaker, linux-mm, Bob Liu, Darrick J. Wong, Josef Bacik, Seth Jennings, Jens Axboe, linux-fsdevel, linux-xfs, linux-ext4, CIFS, Chris Mason, David Sterba, Minchan Kim, Steve French, Dan Magenheimer, linux-nfs, Ilya Dryomov, linux-btrfs, Al Viro, Linus Torvalds, LKML On Mon, Sep 27, 2021 at 10:12 PM NeilBrown <neilb@suse.de> wrote: > > On Sat, 25 Sep 2021, David Howells wrote: > > Whilst trying to make this work, I found that NFS's support for swapfiles > > seems to have been non-functional since Aug 2019 (I think), so the first > > patch fixes that. Question is: do we actually *want* to keep this > > functionality, given that it seems that no one's tested it with an upstream > > kernel in the last couple of years? > > SUSE definitely want to keep this functionality. We have customers > using it. > I agree it would be good if it was being tested somewhere.... > I am trying to work through the testing of swap over SMB3 mounts since there are use cases where you need to expand the swap space to remote storage and so this requirement comes up. The main difficulty I run into is forgetting to mount with the mount options (to store mode bits) (so swap file has the right permissions) and debugging some of the xfstests relating to swap can be a little confusing. -- Thanks, Steve ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles 2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells ` (3 preceding siblings ...) 2021-09-28 3:11 ` NeilBrown @ 2021-09-29 15:45 ` David Howells 4 siblings, 0 replies; 10+ messages in thread From: David Howells @ 2021-09-29 15:45 UTC (permalink / raw) To: dsterba Cc: dhowells, willy, Chris Mason, linux-block, ceph-devel, linux-mm, linux-fsdevel, linux-xfs, linux-ext4, linux-cifs, linux-nfs, Ilya Dryomov, linux-btrfs, linux-kernel David Sterba <dsterba@suse.cz> wrote: > > There are additional patches to get rid of noop_direct_IO and replace it > > with a feature bitmask, to make btrfs, ext4, xfs and raw blockdevs use the > > new ->swap_rw method and thence remove the direct BIO submission paths from > > swap. > > > > I kept the IOCB_SWAP flag, using it to enable REQ_SWAP. I'm not sure if > > that's necessary, but it seems accounting related. > > There was probably some step missing. The file must not have holes, so > either do 'dd' to the right size or use fallocate (which is recommended > in manual page btrfs(5) SWAPFILE SUPPORT). There are some fstests > exercising swapfile (grep -l _format_swapfile tests/generic/*) so you > could try that without having to set up the swapfile manually. Yeah. As advised elsewhere, I removed the file and recreated it, doing the chattr before extending the file. At that point swapon worked. It didn't work though, and various userspace programs started dying. I'm guessing my btrfs_swap_rw() is wrong somehow. David ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2021-09-30 15:54 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-09-24 17:17 [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles David Howells 2021-09-24 17:18 ` [PATCH v3 2/9] mm: Add 'supports' field to the address_space_operations to list features David Howells 2021-09-24 20:10 ` Matthew Wilcox 2021-09-25 23:42 ` [RFC][PATCH v3 0/9] mm: Use DIO for swap and fix NFS swapfiles Dave Chinner 2021-09-26 3:10 ` Matthew Wilcox 2021-09-26 22:36 ` Dave Chinner 2021-09-27 20:07 ` David Sterba 2021-09-28 3:11 ` NeilBrown 2021-09-30 15:54 ` Steve French 2021-09-29 15:45 ` David Howells
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).