linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RFC: allow building a kernel without buffer_heads
@ 2023-04-24  5:49 Christoph Hellwig
  2023-04-24  5:49 ` [PATCH 01/17] fs: unexport buffer_check_dirty_writeback Christoph Hellwig
                   ` (16 more replies)
  0 siblings, 17 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Hi all,

after all the talk about removing buffer_heads, here is a series that
shows how to build a kernel without buffer_heads.  And how unrealistic
it is to remove the entirely.

Most of the series refactors some common code to make implementing direct
I/O easier without use of the ->direct_IO method and the helpers based
around it.  It then switches buffered writes (but not writeback) for
block devices to use iomap unconditionally, but still using buffer_heads.

The final patch then adds a CONFIG_BUFFER_HEAD selected by all file
systems that need it (which is most block based file systems), makes the
buffer_head support in iomap optional, and adds an alternative
implementation of the block device address_operations using iomap.

With this you can build a kernel with block device support, but without
buffer_heads.  This kernel supports xfs and btrfs as full blown block
based filesystems, and a bunch of read-only ones like cramfs, erofs and
squashfs.  Note that the md software raid drivers is also disabled as it
has some (rather questionable) buffer_head usage in the unconditionally
built bitmap code.

The series is based on Linux 6.3 and will need some rebasing before it
can be fed to the maintainers incrementally.  All but the last patch
definitively seem useful for me.  The last one I think is just to avoid
introducing new buffer_head dependencies, even if I suspect the real
life usefulness of a !CONFIG_BUFFER_HEAD kernel might be rather limited.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 01/17] fs: unexport buffer_check_dirty_writeback
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-05-19 14:17   ` Hannes Reinecke
                     ` (2 more replies)
  2023-04-24  5:49 ` [PATCH 02/17] fs: remove the special !CONFIG_BLOCK def_blk_fops Christoph Hellwig
                   ` (15 subsequent siblings)
  16 siblings, 3 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

buffer_check_dirty_writeback is only used by the block device aops,
remove the export.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/buffer.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9e1e2add541e07..eb14fbaa7d35f7 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -111,7 +111,6 @@ void buffer_check_dirty_writeback(struct folio *folio,
 		bh = bh->b_this_page;
 	} while (bh != head);
 }
-EXPORT_SYMBOL(buffer_check_dirty_writeback);
 
 /*
  * Block until a buffer comes unlocked.  This doesn't stop it
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 02/17] fs: remove the special !CONFIG_BLOCK def_blk_fops
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
  2023-04-24  5:49 ` [PATCH 01/17] fs: unexport buffer_check_dirty_writeback Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-04-24 19:22   ` Randy Dunlap
  2023-04-24  5:49 ` [PATCH 03/17] fs: rename and move block_page_mkwrite_return Christoph Hellwig
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

def_blk_fops always returns -ENODEV, which dosn't match the return value
of a non-existing block device with CONFIG_BLOCK, which is -ENXIO.
Just remove the extra implementation and fall back to the default
no_open_fops that always returns -ENXIO.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/Makefile   | 10 ++--------
 fs/inode.c    |  3 ++-
 fs/no-block.c | 19 -------------------
 3 files changed, 4 insertions(+), 28 deletions(-)
 delete mode 100644 fs/no-block.c

diff --git a/fs/Makefile b/fs/Makefile
index 05f89b5c962f88..da21e7d0a1cf37 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -18,14 +18,8 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		fs_types.o fs_context.o fs_parser.o fsopen.o init.o \
 		kernel_read_file.o mnt_idmapping.o remap_range.o
 
-ifeq ($(CONFIG_BLOCK),y)
-obj-y +=	buffer.o mpage.o
-else
-obj-y +=	no-block.o
-endif
-
-obj-$(CONFIG_PROC_FS) += proc_namespace.o
-
+obj-$(CONFIG_BLOCK)		+= buffer.o mpage.o
+obj-$(CONFIG_PROC_FS)		+= proc_namespace.o
 obj-$(CONFIG_LEGACY_DIRECT_IO)	+= direct-io.o
 obj-y				+= notify/
 obj-$(CONFIG_EPOLL)		+= eventpoll.o
diff --git a/fs/inode.c b/fs/inode.c
index 4558dc2f135573..d43f07f146eb73 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2265,7 +2265,8 @@ void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
 		inode->i_fop = &def_chr_fops;
 		inode->i_rdev = rdev;
 	} else if (S_ISBLK(mode)) {
-		inode->i_fop = &def_blk_fops;
+		if (IS_ENABLED(CONFIG_BLOCK))
+			inode->i_fop = &def_blk_fops;
 		inode->i_rdev = rdev;
 	} else if (S_ISFIFO(mode))
 		inode->i_fop = &pipefifo_fops;
diff --git a/fs/no-block.c b/fs/no-block.c
deleted file mode 100644
index 481c0f0ab4bd2c..00000000000000
--- a/fs/no-block.c
+++ /dev/null
@@ -1,19 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/* no-block.c: implementation of routines required for non-BLOCK configuration
- *
- * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- */
-
-#include <linux/kernel.h>
-#include <linux/fs.h>
-
-static int no_blkdev_open(struct inode * inode, struct file * filp)
-{
-	return -ENODEV;
-}
-
-const struct file_operations def_blk_fops = {
-	.open		= no_blkdev_open,
-	.llseek		= noop_llseek,
-};
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 03/17] fs: rename and move block_page_mkwrite_return
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
  2023-04-24  5:49 ` [PATCH 01/17] fs: unexport buffer_check_dirty_writeback Christoph Hellwig
  2023-04-24  5:49 ` [PATCH 02/17] fs: remove the special !CONFIG_BLOCK def_blk_fops Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-04-24 12:30   ` Matthew Wilcox
  2023-04-24  5:49 ` [PATCH 04/17] fs: remove emergency_thaw_bdev Christoph Hellwig
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

block_page_mkwrite_return is neither block nor mkwrite specific, and
should not be under CONFIG_BLOCK.  Move it to mm.h and rename it to
errno_to_vmfault.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/ext4/inode.c             |  2 +-
 fs/f2fs/file.c              |  2 +-
 fs/gfs2/file.c              | 16 ++++++++--------
 fs/iomap/buffered-io.c      |  2 +-
 fs/nilfs2/file.c            |  2 +-
 fs/udf/file.c               |  2 +-
 include/linux/buffer_head.h | 12 ------------
 include/linux/mm.h          | 13 +++++++++++++
 8 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf0b7dea4900af..c0f41a38bd9ca4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6343,7 +6343,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 	if (err == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
 		goto retry_alloc;
 out_ret:
-	ret = block_page_mkwrite_return(err);
+	ret = errno_to_vmfault(err);
 out:
 	filemap_invalidate_unlock_shared(mapping);
 	sb_end_pagefault(inode->i_sb);
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 15dabeac469050..f4ab23efcf85f8 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -161,7 +161,7 @@ static vm_fault_t f2fs_vm_page_mkwrite(struct vm_fault *vmf)
 
 	sb_end_pagefault(inode->i_sb);
 err:
-	return block_page_mkwrite_return(err);
+	return errno_to_vmfault(err);
 }
 
 static const struct vm_operations_struct f2fs_file_vm_ops = {
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 300844f50dcd28..8c4fad359ff538 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -432,7 +432,7 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
 	err = gfs2_glock_nq(&gh);
 	if (err) {
-		ret = block_page_mkwrite_return(err);
+		ret = errno_to_vmfault(err);
 		goto out_uninit;
 	}
 
@@ -474,7 +474,7 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 
 	err = gfs2_rindex_update(sdp);
 	if (err) {
-		ret = block_page_mkwrite_return(err);
+		ret = errno_to_vmfault(err);
 		goto out_unlock;
 	}
 
@@ -482,12 +482,12 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	ap.target = data_blocks + ind_blocks;
 	err = gfs2_quota_lock_check(ip, &ap);
 	if (err) {
-		ret = block_page_mkwrite_return(err);
+		ret = errno_to_vmfault(err);
 		goto out_unlock;
 	}
 	err = gfs2_inplace_reserve(ip, &ap);
 	if (err) {
-		ret = block_page_mkwrite_return(err);
+		ret = errno_to_vmfault(err);
 		goto out_quota_unlock;
 	}
 
@@ -500,7 +500,7 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	}
 	err = gfs2_trans_begin(sdp, rblocks, 0);
 	if (err) {
-		ret = block_page_mkwrite_return(err);
+		ret = errno_to_vmfault(err);
 		goto out_trans_fail;
 	}
 
@@ -508,7 +508,7 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	if (gfs2_is_stuffed(ip)) {
 		err = gfs2_unstuff_dinode(ip);
 		if (err) {
-			ret = block_page_mkwrite_return(err);
+			ret = errno_to_vmfault(err);
 			goto out_trans_end;
 		}
 	}
@@ -524,7 +524,7 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 
 	err = gfs2_allocate_page_backing(page, length);
 	if (err)
-		ret = block_page_mkwrite_return(err);
+		ret = errno_to_vmfault(err);
 
 out_page_locked:
 	if (ret != VM_FAULT_LOCKED)
@@ -558,7 +558,7 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)
 	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
 	err = gfs2_glock_nq(&gh);
 	if (err) {
-		ret = block_page_mkwrite_return(err);
+		ret = errno_to_vmfault(err);
 		goto out_uninit;
 	}
 	ret = filemap_fault(vmf);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6f4c97a6d7e9dc..2986be63d2bea6 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1290,7 +1290,7 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops)
 	return VM_FAULT_LOCKED;
 out_unlock:
 	folio_unlock(folio);
-	return block_page_mkwrite_return(ret);
+	return errno_to_vmfault(ret);
 }
 EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
 
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index a265d391ffe92d..ea35294bb158a3 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -108,7 +108,7 @@ static vm_fault_t nilfs_page_mkwrite(struct vm_fault *vmf)
 	wait_for_stable_page(page);
  out:
 	sb_end_pagefault(inode->i_sb);
-	return block_page_mkwrite_return(ret);
+	return errno_to_vmfault(ret);
 }
 
 static const struct vm_operations_struct nilfs_file_vm_ops = {
diff --git a/fs/udf/file.c b/fs/udf/file.c
index 8238f742377bab..9420284d7c0455 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -71,7 +71,7 @@ static vm_fault_t udf_page_mkwrite(struct vm_fault *vmf)
 		err = block_commit_write(page, 0, end);
 	if (err < 0) {
 		unlock_page(page);
-		ret = block_page_mkwrite_return(err);
+		ret = errno_to_vmfault(err);
 		goto out_unlock;
 	}
 out_dirty:
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 8f14dca5fed756..0fcc16b7f02bb4 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -281,18 +281,6 @@ int generic_cont_expand_simple(struct inode *inode, loff_t size);
 int block_commit_write(struct page *page, unsigned from, unsigned to);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 				get_block_t get_block);
-/* Convert errno to return value from ->page_mkwrite() call */
-static inline vm_fault_t block_page_mkwrite_return(int err)
-{
-	if (err == 0)
-		return VM_FAULT_LOCKED;
-	if (err == -EFAULT || err == -EAGAIN)
-		return VM_FAULT_NOPAGE;
-	if (err == -ENOMEM)
-		return VM_FAULT_OOM;
-	/* -ENOSPC, -EDQUOT, -EIO ... */
-	return VM_FAULT_SIGBUS;
-}
 sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *);
 int block_truncate_page(struct address_space *, loff_t, get_block_t *);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f79667824eb60..03e645032c81ac 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3061,6 +3061,19 @@ extern vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		pgoff_t start_pgoff, pgoff_t end_pgoff);
 extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
 
+/* Convert errno to return value from ->page_mkwrite() call */
+static inline vm_fault_t errno_to_vmfault(int err)
+{
+	if (err == 0)
+		return VM_FAULT_LOCKED;
+	if (err == -EFAULT || err == -EAGAIN)
+		return VM_FAULT_NOPAGE;
+	if (err == -ENOMEM)
+		return VM_FAULT_OOM;
+	/* -ENOSPC, -EDQUOT, -EIO ... */
+	return VM_FAULT_SIGBUS;
+}
+
 extern unsigned long stack_guard_gap;
 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 04/17] fs: remove emergency_thaw_bdev
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
                   ` (2 preceding siblings ...)
  2023-04-24  5:49 ` [PATCH 03/17] fs: rename and move block_page_mkwrite_return Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-04-24  5:49 ` [PATCH 05/17] filemap: update ki_pos in generic_perform_write Christoph Hellwig
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Fold emergency_thaw_bdev into it's only caller, to prepare for buffer.c
to be built only when buffer_head support is enabled.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/buffer.c   | 6 ------
 fs/internal.h | 6 ------
 fs/super.c    | 4 +++-
 3 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index eb14fbaa7d35f7..58e0007892b1c7 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -563,12 +563,6 @@ static int osync_buffers_list(spinlock_t *lock, struct list_head *list)
 	return err;
 }
 
-void emergency_thaw_bdev(struct super_block *sb)
-{
-	while (sb->s_bdev && !thaw_bdev(sb->s_bdev))
-		printk(KERN_WARNING "Emergency Thaw on %pg\n", sb->s_bdev);
-}
-
 /**
  * sync_mapping_buffers - write out & wait upon a mapping's "associated" buffers
  * @mapping: the mapping which wants those buffers written
diff --git a/fs/internal.h b/fs/internal.h
index dc4eb91a577a80..cad87784eb5e0f 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -23,16 +23,10 @@ struct mnt_idmap;
  */
 #ifdef CONFIG_BLOCK
 extern void __init bdev_cache_init(void);
-
-void emergency_thaw_bdev(struct super_block *sb);
 #else
 static inline void bdev_cache_init(void)
 {
 }
-static inline int emergency_thaw_bdev(struct super_block *sb)
-{
-	return 0;
-}
 #endif /* CONFIG_BLOCK */
 
 /*
diff --git a/fs/super.c b/fs/super.c
index 04bc62ab7dfea9..d8f0a28d1850b1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1026,7 +1026,9 @@ static void do_thaw_all_callback(struct super_block *sb)
 {
 	down_write(&sb->s_umount);
 	if (sb->s_root && sb->s_flags & SB_BORN) {
-		emergency_thaw_bdev(sb);
+		if (IS_ENABLED(CONFIG_BLOCK))
+			while (sb->s_bdev && !thaw_bdev(sb->s_bdev))
+				pr_warn("Emergency Thaw on %pg\n", sb->s_bdev);
 		thaw_super_locked(sb);
 	} else {
 		up_write(&sb->s_umount);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 05/17] filemap: update ki_pos in generic_perform_write
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
                   ` (3 preceding siblings ...)
  2023-04-24  5:49 ` [PATCH 04/17] fs: remove emergency_thaw_bdev Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-04-24 18:54   ` [Cluster-devel] " Andreas Gruenbacher
  2023-04-24  5:49 ` [PATCH 06/17] filemap: add a kiocb_write_and_wait helper Christoph Hellwig
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

All callers of generic_perform_write need to updated ki_pos, move it into
common code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/ceph/file.c | 2 --
 fs/ext4/file.c | 9 +++------
 fs/f2fs/file.c | 1 -
 fs/nfs/file.c  | 1 -
 mm/filemap.c   | 8 ++++----
 5 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index f4d8bf7dec88a8..feeb9882ef635a 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1894,8 +1894,6 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		 * can not run at the same time
 		 */
 		written = generic_perform_write(iocb, from);
-		if (likely(written >= 0))
-			iocb->ki_pos = pos + written;
 		ceph_end_io_write(inode);
 	}
 
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0b8b4499e5ca18..1026acaf1235a0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -291,12 +291,9 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 
 out:
 	inode_unlock(inode);
-	if (likely(ret > 0)) {
-		iocb->ki_pos += ret;
-		ret = generic_write_sync(iocb, ret);
-	}
-
-	return ret;
+	if (unlikely(ret <= 0))
+		return ret;
+	return generic_write_sync(iocb, ret);
 }
 
 static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index f4ab23efcf85f8..5a9ae054b6da7d 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -4511,7 +4511,6 @@ static ssize_t f2fs_buffered_write_iter(struct kiocb *iocb,
 	current->backing_dev_info = NULL;
 
 	if (ret > 0) {
-		iocb->ki_pos += ret;
 		f2fs_update_iostat(F2FS_I_SB(inode), inode,
 						APP_BUFFERED_IO, ret);
 	}
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 893625eacab9fa..abdae2b29369be 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -666,7 +666,6 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 		goto out;
 
 	written = result;
-	iocb->ki_pos += written;
 	nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, written);
 
 	if (mntflags & NFS_MOUNT_WRITE_EAGER) {
diff --git a/mm/filemap.c b/mm/filemap.c
index 2723104cc06a12..0110bde3708b3f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3960,7 +3960,10 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 		balance_dirty_pages_ratelimited(mapping);
 	} while (iov_iter_count(i));
 
-	return written ? written : status;
+	if (!written)
+		return status;
+	iocb->ki_pos += written;
+	return written;
 }
 EXPORT_SYMBOL(generic_perform_write);
 
@@ -4039,7 +4042,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		endbyte = pos + status - 1;
 		err = filemap_write_and_wait_range(mapping, pos, endbyte);
 		if (err == 0) {
-			iocb->ki_pos = endbyte + 1;
 			written += status;
 			invalidate_mapping_pages(mapping,
 						 pos >> PAGE_SHIFT,
@@ -4052,8 +4054,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		}
 	} else {
 		written = generic_perform_write(iocb, from);
-		if (likely(written > 0))
-			iocb->ki_pos += written;
 	}
 out:
 	current->backing_dev_info = NULL;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 06/17] filemap: add a kiocb_write_and_wait helper
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
                   ` (4 preceding siblings ...)
  2023-04-24  5:49 ` [PATCH 05/17] filemap: update ki_pos in generic_perform_write Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-04-24  5:49 ` [PATCH 07/17] filemap: add a kiocb_invalidate_pages helper Christoph Hellwig
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Factor out a helper that does filemap_write_and_wait_range for a the
range covered by a read kiocb, or returns -EAGAIN if the kiocb
is marked as nowait and there would be pages to write.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/fops.c            | 18 +++---------------
 include/linux/pagemap.h |  2 ++
 mm/filemap.c            | 30 ++++++++++++++++++------------
 3 files changed, 23 insertions(+), 27 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index d2e6be4e3d1c7d..c194939b851cfb 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -576,21 +576,9 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		goto reexpand; /* skip atime */
 
 	if (iocb->ki_flags & IOCB_DIRECT) {
-		struct address_space *mapping = iocb->ki_filp->f_mapping;
-
-		if (iocb->ki_flags & IOCB_NOWAIT) {
-			if (filemap_range_needs_writeback(mapping, pos,
-							  pos + count - 1)) {
-				ret = -EAGAIN;
-				goto reexpand;
-			}
-		} else {
-			ret = filemap_write_and_wait_range(mapping, pos,
-							   pos + count - 1);
-			if (ret < 0)
-				goto reexpand;
-		}
-
+		ret = kiocb_write_and_wait(iocb, count);
+		if (ret < 0)
+			goto reexpand;
 		file_accessed(iocb->ki_filp);
 
 		ret = blkdev_direct_IO(iocb, to);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0acb8e1fb7afdc..51f7aea51169c4 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -30,6 +30,7 @@ static inline void invalidate_remote_inode(struct inode *inode)
 int invalidate_inode_pages2(struct address_space *mapping);
 int invalidate_inode_pages2_range(struct address_space *mapping,
 		pgoff_t start, pgoff_t end);
+
 int write_inode_now(struct inode *, int sync);
 int filemap_fdatawrite(struct address_space *);
 int filemap_flush(struct address_space *);
@@ -54,6 +55,7 @@ int filemap_check_errors(struct address_space *mapping);
 void __filemap_set_wb_err(struct address_space *mapping, int err);
 int filemap_fdatawrite_wbc(struct address_space *mapping,
 			   struct writeback_control *wbc);
+int kiocb_write_and_wait(struct kiocb *iocb, size_t count);
 
 static inline int filemap_write_and_wait(struct address_space *mapping)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 0110bde3708b3f..28ea9804191d6b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2765,6 +2765,21 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(filemap_read);
 
+int kiocb_write_and_wait(struct kiocb *iocb, size_t count)
+{
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
+	loff_t pos = iocb->ki_pos;
+	loff_t end = pos + count - 1;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (filemap_range_needs_writeback(mapping, pos, end))
+			return -EAGAIN;
+		return 0;
+	}
+
+	return filemap_write_and_wait_range(mapping, pos, end);
+}
+
 /**
  * generic_file_read_iter - generic filesystem read routine
  * @iocb:	kernel I/O control block
@@ -2800,18 +2815,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		struct address_space *mapping = file->f_mapping;
 		struct inode *inode = mapping->host;
 
-		if (iocb->ki_flags & IOCB_NOWAIT) {
-			if (filemap_range_needs_writeback(mapping, iocb->ki_pos,
-						iocb->ki_pos + count - 1))
-				return -EAGAIN;
-		} else {
-			retval = filemap_write_and_wait_range(mapping,
-						iocb->ki_pos,
-					        iocb->ki_pos + count - 1);
-			if (retval < 0)
-				return retval;
-		}
-
+		retval = kiocb_write_and_wait(iocb, count);
+		if (retval < 0)
+			return retval;
 		file_accessed(file);
 
 		retval = mapping->a_ops->direct_IO(iocb, iter);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 07/17] filemap: add a kiocb_invalidate_pages helper
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
                   ` (5 preceding siblings ...)
  2023-04-24  5:49 ` [PATCH 06/17] filemap: add a kiocb_write_and_wait helper Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-04-24  5:49 ` [PATCH 08/17] filemap: add a kiocb_invalidate_post_write helper Christoph Hellwig
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Factor out a helper that calls filemap_write_and_wait_range and
invalidate_inode_pages2_rangefor a the range covered by a write kiocb or
returns -EAGAIN if the kiocb is marked as nowait and there would be pages
to write or invalidate.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 include/linux/pagemap.h |  1 +
 mm/filemap.c            | 48 ++++++++++++++++++++++++-----------------
 2 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 51f7aea51169c4..f6c5ef9d3d7646 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -30,6 +30,7 @@ static inline void invalidate_remote_inode(struct inode *inode)
 int invalidate_inode_pages2(struct address_space *mapping);
 int invalidate_inode_pages2_range(struct address_space *mapping,
 		pgoff_t start, pgoff_t end);
+int kiocb_invalidate_pages(struct kiocb *iocb, size_t count);
 
 int write_inode_now(struct inode *, int sync);
 int filemap_fdatawrite(struct address_space *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 28ea9804191d6b..afb3b0169a9173 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2780,6 +2780,33 @@ int kiocb_write_and_wait(struct kiocb *iocb, size_t count)
 	return filemap_write_and_wait_range(mapping, pos, end);
 }
 
+int kiocb_invalidate_pages(struct kiocb *iocb, size_t count)
+{
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
+	loff_t pos = iocb->ki_pos;
+	loff_t end = pos + count - 1;
+	int ret;
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		/* we could block if there are any pages in the range */
+		if (filemap_range_has_page(mapping, pos, end))
+			return -EAGAIN;
+	} else {
+		ret = filemap_write_and_wait_range(mapping, pos, end);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * After a write we want buffered reads to be sure to go to disk to get
+	 * the new data.  We invalidate clean cached page from the region we're
+	 * about to write.  We do this *before* the write so that we can return
+	 * without clobbering -EIOCBQUEUED from ->direct_IO().
+	 */
+	return invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
+					     end >> PAGE_SHIFT);
+}
+
 /**
  * generic_file_read_iter - generic filesystem read routine
  * @iocb:	kernel I/O control block
@@ -3823,30 +3850,11 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	write_len = iov_iter_count(from);
 	end = (pos + write_len - 1) >> PAGE_SHIFT;
 
-	if (iocb->ki_flags & IOCB_NOWAIT) {
-		/* If there are pages to writeback, return */
-		if (filemap_range_has_page(file->f_mapping, pos,
-					   pos + write_len - 1))
-			return -EAGAIN;
-	} else {
-		written = filemap_write_and_wait_range(mapping, pos,
-							pos + write_len - 1);
-		if (written)
-			goto out;
-	}
-
-	/*
-	 * After a write we want buffered reads to be sure to go to disk to get
-	 * the new data.  We invalidate clean cached page from the region we're
-	 * about to write.  We do this *before* the write so that we can return
-	 * without clobbering -EIOCBQUEUED from ->direct_IO().
-	 */
-	written = invalidate_inode_pages2_range(mapping,
-					pos >> PAGE_SHIFT, end);
 	/*
 	 * If a page can not be invalidated, return 0 to fall back
 	 * to buffered write.
 	 */
+	written = kiocb_invalidate_pages(iocb, write_len);
 	if (written) {
 		if (written == -EBUSY)
 			return 0;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 08/17] filemap: add a kiocb_invalidate_post_write helper
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
                   ` (6 preceding siblings ...)
  2023-04-24  5:49 ` [PATCH 07/17] filemap: add a kiocb_invalidate_pages helper Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-04-24  5:49 ` [PATCH 09/17] fs: factor out a direct_write_fallback helper Christoph Hellwig
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Add a helper to invalidate page cache after a dio write.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/direct-io.c          | 10 ++--------
 fs/iomap/direct-io.c    | 12 ++----------
 include/linux/fs.h      |  5 -----
 include/linux/pagemap.h |  1 +
 mm/filemap.c            | 37 ++++++++++++++++++++-----------------
 5 files changed, 25 insertions(+), 40 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index ab0d7ea89813a6..f2d2e544c259dc 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -286,14 +286,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags)
 	 * zeros from unwritten extents.
 	 */
 	if (flags & DIO_COMPLETE_INVALIDATE &&
-	    ret > 0 && dio_op == REQ_OP_WRITE &&
-	    dio->inode->i_mapping->nrpages) {
-		err = invalidate_inode_pages2_range(dio->inode->i_mapping,
-					offset >> PAGE_SHIFT,
-					(offset + ret - 1) >> PAGE_SHIFT);
-		if (err)
-			dio_warn_stale_pagecache(dio->iocb->ki_filp);
-	}
+	    ret > 0 && dio_op == REQ_OP_WRITE)
+	    	kiocb_invalidate_post_write(dio->iocb, ret);
 
 	inode_dio_end(dio->inode);
 
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f771001574d008..2bf8d684675615 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -81,7 +81,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	const struct iomap_dio_ops *dops = dio->dops;
 	struct kiocb *iocb = dio->iocb;
-	struct inode *inode = file_inode(iocb->ki_filp);
 	loff_t offset = iocb->ki_pos;
 	ssize_t ret = dio->error;
 
@@ -109,15 +108,8 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 	 * ->end_io() when necessary, otherwise a racing buffer read would cache
 	 * zeros from unwritten extents.
 	 */
-	if (!dio->error && dio->size &&
-	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) {
-		int err;
-		err = invalidate_inode_pages2_range(inode->i_mapping,
-				offset >> PAGE_SHIFT,
-				(offset + dio->size - 1) >> PAGE_SHIFT);
-		if (err)
-			dio_warn_stale_pagecache(iocb->ki_filp);
-	}
+	if (!dio->error && dio->size && (dio->flags & IOMAP_DIO_WRITE))
+		kiocb_invalidate_post_write(iocb, dio->size);
 
 	inode_dio_end(file_inode(iocb->ki_filp));
 	/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7db50..b8ca376e606255 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2818,11 +2818,6 @@ static inline void inode_dio_end(struct inode *inode)
 		wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
 }
 
-/*
- * Warn about a page cache invalidation failure diring a direct I/O write.
- */
-void dio_warn_stale_pagecache(struct file *filp);
-
 extern void inode_set_flags(struct inode *inode, unsigned int flags,
 			    unsigned int mask);
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index f6c5ef9d3d7646..38d767499daa1a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -31,6 +31,7 @@ int invalidate_inode_pages2(struct address_space *mapping);
 int invalidate_inode_pages2_range(struct address_space *mapping,
 		pgoff_t start, pgoff_t end);
 int kiocb_invalidate_pages(struct kiocb *iocb, size_t count);
+void kiocb_invalidate_post_write(struct kiocb *iocb, size_t count);
 
 int write_inode_now(struct inode *, int sync);
 int filemap_fdatawrite(struct address_space *);
diff --git a/mm/filemap.c b/mm/filemap.c
index afb3b0169a9173..8c5196cf93a454 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3819,7 +3819,7 @@ EXPORT_SYMBOL(read_cache_page_gfp);
 /*
  * Warn about a page cache invalidation failure during a direct I/O write.
  */
-void dio_warn_stale_pagecache(struct file *filp)
+static void dio_warn_stale_pagecache(struct file *filp)
 {
 	static DEFINE_RATELIMIT_STATE(_rs, 86400 * HZ, DEFAULT_RATELIMIT_BURST);
 	char pathname[128];
@@ -3836,19 +3836,23 @@ void dio_warn_stale_pagecache(struct file *filp)
 	}
 }
 
+void kiocb_invalidate_post_write(struct kiocb *iocb, size_t count)
+{
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
+	loff_t pos = iocb->ki_pos;
+
+	if (mapping->nrpages &&
+	    invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
+			(pos + count - 1) >> PAGE_SHIFT))
+		dio_warn_stale_pagecache(iocb->ki_filp);
+}
+
 ssize_t
 generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 {
-	struct file	*file = iocb->ki_filp;
-	struct address_space *mapping = file->f_mapping;
-	struct inode	*inode = mapping->host;
-	loff_t		pos = iocb->ki_pos;
-	ssize_t		written;
-	size_t		write_len;
-	pgoff_t		end;
-
-	write_len = iov_iter_count(from);
-	end = (pos + write_len - 1) >> PAGE_SHIFT;
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
+	size_t write_len = iov_iter_count(from);
+	ssize_t written;
 
 	/*
 	 * If a page can not be invalidated, return 0 to fall back
@@ -3858,7 +3862,7 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (written) {
 		if (written == -EBUSY)
 			return 0;
-		goto out;
+		return written;
 	}
 
 	written = mapping->a_ops->direct_IO(iocb, from);
@@ -3880,11 +3884,11 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	 *
 	 * Skip invalidation for async writes or if mapping has no pages.
 	 */
-	if (written > 0 && mapping->nrpages &&
-	    invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT, end))
-		dio_warn_stale_pagecache(file);
-
 	if (written > 0) {
+		struct inode *inode = mapping->host;
+		loff_t pos = iocb->ki_pos;
+
+		kiocb_invalidate_post_write(iocb, write_len);
 		pos += written;
 		write_len -= written;
 		if (pos > i_size_read(inode) && !S_ISBLK(inode->i_mode)) {
@@ -3895,7 +3899,6 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	}
 	if (written != -EIOCBQUEUED)
 		iov_iter_revert(from, write_len - iov_iter_count(from));
-out:
 	return written;
 }
 EXPORT_SYMBOL(generic_file_direct_write);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 09/17] fs: factor out a direct_write_fallback helper
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
                   ` (7 preceding siblings ...)
  2023-04-24  5:49 ` [PATCH 08/17] filemap: add a kiocb_invalidate_post_write helper Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-04-24  5:49 ` [PATCH 10/17] iomap: use kiocb_write_and_wait and kiocb_invalidate_pages Christoph Hellwig
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Add a helper dealing with handling the syncing of a buffered write fallback
for direct I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/libfs.c         | 36 ++++++++++++++++++++++++++++
 include/linux/fs.h |  2 ++
 mm/filemap.c       | 59 ++++++++++------------------------------------
 3 files changed, 50 insertions(+), 47 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 4eda519c30022f..a530d36e0d1a3d 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1618,3 +1618,39 @@ u64 inode_query_iversion(struct inode *inode)
 	return cur >> I_VERSION_QUERIED_SHIFT;
 }
 EXPORT_SYMBOL(inode_query_iversion);
+
+ssize_t direct_write_fallback(struct kiocb *iocb, struct iov_iter *iter,
+		ssize_t direct_written, ssize_t buffered_written)
+{
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
+	loff_t pos = iocb->ki_pos, end;
+	int err;
+
+	/*
+	 * If the buffered write fallback returned an error, we want to return
+	 * the number of bytes which were written by direct I/O, or the error
+	 * code if that was zero.
+	 *
+	 * Note that this differs from normal direct-io semantics, which will
+	 * return -EFOO even if some bytes were written.
+	 */
+	if (unlikely(buffered_written < 0))
+		return buffered_written;
+
+	/*
+	 * We need to ensure that the page cache pages are written to disk and
+	 * invalidated to preserve the expected O_DIRECT semantics.
+	 */
+	end = pos + buffered_written - 1;
+	err = filemap_write_and_wait_range(mapping, pos, end);
+	if (err < 0) {
+		/*
+		 * We don't know how much we wrote, so just return the number of
+		 * bytes which were direct-written
+		 */
+		return err;
+	}
+	invalidate_mapping_pages(mapping, pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
+	return direct_written + buffered_written;
+}
+EXPORT_SYMBOL_GPL(direct_write_fallback);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b8ca376e606255..337afdb5024dcb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2719,6 +2719,8 @@ extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
 extern ssize_t generic_file_direct_write(struct kiocb *, struct iov_iter *);
 ssize_t generic_perform_write(struct kiocb *, struct iov_iter *);
+ssize_t direct_write_fallback(struct kiocb *iocb, struct iov_iter *iter,
+		ssize_t direct_written, ssize_t buffered_written);
 
 ssize_t vfs_iter_read(struct file *file, struct iov_iter *iter, loff_t *ppos,
 		rwf_t flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 8c5196cf93a454..bb769746f78b08 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -4009,25 +4009,21 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
-	struct inode 	*inode = mapping->host;
-	ssize_t		written = 0;
-	ssize_t		err;
-	ssize_t		status;
+	struct inode *inode = mapping->host;
+	ssize_t ret;
 
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = inode_to_bdi(inode);
-	err = file_remove_privs(file);
-	if (err)
+	ret = file_remove_privs(file);
+	if (ret)
 		goto out;
 
-	err = file_update_time(file);
-	if (err)
+	ret = file_update_time(file);
+	if (ret)
 		goto out;
 
 	if (iocb->ki_flags & IOCB_DIRECT) {
-		loff_t pos, endbyte;
-
-		written = generic_file_direct_write(iocb, from);
+		ret = generic_file_direct_write(iocb, from);
 		/*
 		 * If the write stopped short of completing, fall back to
 		 * buffered writes.  Some filesystems do this for writes to
@@ -4035,46 +4031,15 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		 * not succeed (even if it did, DAX does not handle dirty
 		 * page-cache pages correctly).
 		 */
-		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
-			goto out;
-
-		pos = iocb->ki_pos;
-		status = generic_perform_write(iocb, from);
-		/*
-		 * If generic_perform_write() returned a synchronous error
-		 * then we want to return the number of bytes which were
-		 * direct-written, or the error code if that was zero.  Note
-		 * that this differs from normal direct-io semantics, which
-		 * will return -EFOO even if some bytes were written.
-		 */
-		if (unlikely(status < 0)) {
-			err = status;
-			goto out;
-		}
-		/*
-		 * We need to ensure that the page cache pages are written to
-		 * disk and invalidated to preserve the expected O_DIRECT
-		 * semantics.
-		 */
-		endbyte = pos + status - 1;
-		err = filemap_write_and_wait_range(mapping, pos, endbyte);
-		if (err == 0) {
-			written += status;
-			invalidate_mapping_pages(mapping,
-						 pos >> PAGE_SHIFT,
-						 endbyte >> PAGE_SHIFT);
-		} else {
-			/*
-			 * We don't know how much we wrote, so just return
-			 * the number of bytes which were direct-written
-			 */
-		}
+		if (ret >= 0 && iov_iter_count(from) && !IS_DAX(inode))
+			ret = direct_write_fallback(iocb, from, ret,
+					generic_perform_write(iocb, from));
 	} else {
-		written = generic_perform_write(iocb, from);
+		ret = generic_perform_write(iocb, from);
 	}
 out:
 	current->backing_dev_info = NULL;
-	return written ? written : err;
+	return ret;
 }
 EXPORT_SYMBOL(__generic_file_write_iter);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 10/17] iomap: use kiocb_write_and_wait and kiocb_invalidate_pages
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
                   ` (8 preceding siblings ...)
  2023-04-24  5:49 ` [PATCH 09/17] fs: factor out a direct_write_fallback helper Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-04-24  5:49 ` [PATCH 11/17] iomap: assign current->backing_dev_info in iomap_file_buffered_write Christoph Hellwig
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Use the common helpers for direct I/O page invalidation instead of
open coding the logic.  This leads to a slight reordering of checks
in __iomap_dio_rw to keep the logic straight.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/iomap/direct-io.c | 55 ++++++++++++++++----------------------------
 1 file changed, 20 insertions(+), 35 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 2bf8d684675615..10a790f568afca 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -470,7 +470,6 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags, void *private, size_t done_before)
 {
-	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct iomap_iter iomi = {
 		.inode		= inode,
@@ -479,11 +478,11 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		.flags		= IOMAP_DIRECT,
 		.private	= private,
 	};
-	loff_t end = iomi.pos + iomi.len - 1, ret = 0;
 	bool wait_for_completion =
 		is_sync_kiocb(iocb) || (dio_flags & IOMAP_DIO_FORCE_WAIT);
 	struct blk_plug plug;
 	struct iomap_dio *dio;
+	loff_t ret = 0;
 
 	if (!iomi.len)
 		return NULL;
@@ -505,31 +504,29 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->submit.waiter = current;
 	dio->submit.poll_bio = NULL;
 
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		iomi.flags |= IOMAP_NOWAIT;
+
 	if (iov_iter_rw(iter) == READ) {
 		if (iomi.pos >= dio->i_size)
 			goto out_free_dio;
 
-		if (iocb->ki_flags & IOCB_NOWAIT) {
-			if (filemap_range_needs_writeback(mapping, iomi.pos,
-					end)) {
-				ret = -EAGAIN;
-				goto out_free_dio;
-			}
-			iomi.flags |= IOMAP_NOWAIT;
-		}
-
 		if (user_backed_iter(iter))
 			dio->flags |= IOMAP_DIO_DIRTY;
+
+		ret = kiocb_write_and_wait(iocb, iomi.len);
+		if (ret)
+			goto out_free_dio;
 	} else {
 		iomi.flags |= IOMAP_WRITE;
 		dio->flags |= IOMAP_DIO_WRITE;
 
-		if (iocb->ki_flags & IOCB_NOWAIT) {
-			if (filemap_range_has_page(mapping, iomi.pos, end)) {
-				ret = -EAGAIN;
+		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
+			ret = -EAGAIN;
+			if (iomi.pos >= dio->i_size ||
+			    iomi.pos + iomi.len > dio->i_size)
 				goto out_free_dio;
-			}
-			iomi.flags |= IOMAP_NOWAIT;
+			iomi.flags |= IOMAP_OVERWRITE_ONLY;
 		}
 
 		/* for data sync or sync, we need sync completion processing */
@@ -545,31 +542,19 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			if (!(iocb->ki_flags & IOCB_SYNC))
 				dio->flags |= IOMAP_DIO_WRITE_FUA;
 		}
-	}
-
-	if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
-		ret = -EAGAIN;
-		if (iomi.pos >= dio->i_size ||
-		    iomi.pos + iomi.len > dio->i_size)
-			goto out_free_dio;
-		iomi.flags |= IOMAP_OVERWRITE_ONLY;
-	}
 
-	ret = filemap_write_and_wait_range(mapping, iomi.pos, end);
-	if (ret)
-		goto out_free_dio;
-
-	if (iov_iter_rw(iter) == WRITE) {
 		/*
 		 * Try to invalidate cache pages for the range we are writing.
 		 * If this invalidation fails, let the caller fall back to
 		 * buffered I/O.
 		 */
-		if (invalidate_inode_pages2_range(mapping,
-				iomi.pos >> PAGE_SHIFT, end >> PAGE_SHIFT)) {
-			trace_iomap_dio_invalidate_fail(inode, iomi.pos,
-							iomi.len);
-			ret = -ENOTBLK;
+		ret = kiocb_invalidate_pages(iocb, iomi.len);
+		if (ret) {
+			if (ret != -EAGAIN) {
+				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
+								iomi.len);
+				ret = -ENOTBLK;
+			}
 			goto out_free_dio;
 		}
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 11/17] iomap: assign current->backing_dev_info in iomap_file_buffered_write
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
                   ` (9 preceding siblings ...)
  2023-04-24  5:49 ` [PATCH 10/17] iomap: use kiocb_write_and_wait and kiocb_invalidate_pages Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-04-24  6:18   ` Darrick J. Wong
  2023-04-24  5:49 ` [PATCH 12/17] fuse: use direct_write_fallback Christoph Hellwig
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Move the assignment to current->backing_dev_info from the callers into
iomap_file_buffered_write.  Note that zonefs was missing this assignment
before.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/gfs2/file.c         | 3 ---
 fs/iomap/buffered-io.c | 4 ++++
 fs/xfs/xfs_file.c      | 5 -----
 3 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 8c4fad359ff538..4d88c6080b3e30 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -25,7 +25,6 @@
 #include <linux/dlm.h>
 #include <linux/dlm_plock.h>
 #include <linux/delay.h>
-#include <linux/backing-dev.h>
 #include <linux/fileattr.h>
 
 #include "gfs2.h"
@@ -1041,11 +1040,9 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb,
 			goto out_unlock;
 	}
 
-	current->backing_dev_info = inode_to_bdi(inode);
 	pagefault_disable();
 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
 	pagefault_enable();
-	current->backing_dev_info = NULL;
 	if (ret > 0) {
 		iocb->ki_pos += ret;
 		written += ret;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 2986be63d2bea6..3d5042efda202a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -3,6 +3,7 @@
  * Copyright (C) 2010 Red Hat, Inc.
  * Copyright (C) 2016-2019 Christoph Hellwig.
  */
+#include <linux/backing-dev.h>
 #include <linux/module.h>
 #include <linux/compiler.h>
 #include <linux/fs.h>
@@ -876,8 +877,11 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		iter.flags |= IOMAP_NOWAIT;
 
+	current->backing_dev_info = inode_to_bdi(iter.inode);
 	while ((ret = iomap_iter(&iter, ops)) > 0)
 		iter.processed = iomap_write_iter(&iter, i);
+	current->backing_dev_info = NULL;
+
 	if (iter.pos == iocb->ki_pos)
 		return ret;
 	return iter.pos - iocb->ki_pos;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 705250f9f90a1b..f5442e689baf15 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -27,7 +27,6 @@
 
 #include <linux/dax.h>
 #include <linux/falloc.h>
-#include <linux/backing-dev.h>
 #include <linux/mman.h>
 #include <linux/fadvise.h>
 #include <linux/mount.h>
@@ -717,9 +716,6 @@ xfs_file_buffered_write(
 	if (ret)
 		goto out;
 
-	/* We can write back this queue in page reclaim */
-	current->backing_dev_info = inode_to_bdi(inode);
-
 	trace_xfs_file_buffered_write(iocb, from);
 	ret = iomap_file_buffered_write(iocb, from,
 			&xfs_buffered_write_iomap_ops);
@@ -753,7 +749,6 @@ xfs_file_buffered_write(
 		goto write_retry;
 	}
 
-	current->backing_dev_info = NULL;
 out:
 	if (iolock)
 		xfs_iunlock(ip, iolock);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 12/17] fuse: use direct_write_fallback
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
                   ` (10 preceding siblings ...)
  2023-04-24  5:49 ` [PATCH 11/17] iomap: assign current->backing_dev_info in iomap_file_buffered_write Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-04-24  5:49 ` [PATCH 13/17] block: don't plug in blkdev_write_iter Christoph Hellwig
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Refator the fuse direct write code so that the fuse_perform_write
callig convention is simplified to match generic_perform_write and
it's updating ki_pos directly, and the generic direct_write_fallback
helper is used to consolidate buffered I/O fallback code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/fuse/file.c | 44 +++++++++++---------------------------------
 1 file changed, 11 insertions(+), 33 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index de37a3a06a7169..55b64dac175d68 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1280,13 +1280,13 @@ static inline unsigned int fuse_wr_pages(loff_t pos, size_t len,
 		     max_pages);
 }
 
-static ssize_t fuse_perform_write(struct kiocb *iocb,
-				  struct address_space *mapping,
-				  struct iov_iter *ii, loff_t pos)
+static ssize_t fuse_perform_write(struct kiocb *iocb, struct iov_iter *ii)
 {
+	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
 	struct fuse_inode *fi = get_fuse_inode(inode);
+	loff_t pos = iocb->ki_pos;
 	int err = 0;
 	ssize_t res = 0;
 
@@ -1329,7 +1329,10 @@ static ssize_t fuse_perform_write(struct kiocb *iocb,
 	fuse_write_update_attr(inode, pos, res);
 	clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state);
 
-	return res > 0 ? res : err;
+	if (!res)
+		return err;
+	iocb->ki_pos += res;
+	return res;
 }
 
 static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
@@ -1337,11 +1340,9 @@ static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
 	ssize_t written = 0;
-	ssize_t written_buffered = 0;
 	struct inode *inode = mapping->host;
 	ssize_t err;
 	struct fuse_conn *fc = get_fuse_conn(inode);
-	loff_t endbyte = 0;
 
 	if (fc->writeback_cache) {
 		/* Update size (EOF optimization) and mode (SUID clearing) */
@@ -1378,35 +1379,12 @@ static ssize_t fuse_cache_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		goto out;
 
 	if (iocb->ki_flags & IOCB_DIRECT) {
-		loff_t pos = iocb->ki_pos;
 		written = generic_file_direct_write(iocb, from);
-		if (written < 0 || !iov_iter_count(from))
-			goto out;
-
-		pos += written;
-
-		written_buffered = fuse_perform_write(iocb, mapping, from, pos);
-		if (written_buffered < 0) {
-			err = written_buffered;
-			goto out;
-		}
-		endbyte = pos + written_buffered - 1;
-
-		err = filemap_write_and_wait_range(file->f_mapping, pos,
-						   endbyte);
-		if (err)
-			goto out;
-
-		invalidate_mapping_pages(file->f_mapping,
-					 pos >> PAGE_SHIFT,
-					 endbyte >> PAGE_SHIFT);
-
-		written += written_buffered;
-		iocb->ki_pos = pos + written_buffered;
+		if (written >= 0 && iov_iter_count(from))
+			written = direct_write_fallback(iocb, from, written,
+					fuse_perform_write(iocb, from));
 	} else {
-		written = fuse_perform_write(iocb, mapping, from, iocb->ki_pos);
-		if (written >= 0)
-			iocb->ki_pos += written;
+		written = fuse_perform_write(iocb, from);
 	}
 out:
 	current->backing_dev_info = NULL;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 13/17] block: don't plug in blkdev_write_iter
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
                   ` (11 preceding siblings ...)
  2023-04-24  5:49 ` [PATCH 12/17] fuse: use direct_write_fallback Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-04-24  5:49 ` [PATCH 14/17] block: open code __generic_file_write_iter for blkdev writes Christoph Hellwig
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Remove the no needed plug in blkdev_write_iter.  For direct I/O that
issues more than a single I/O, the plug is already done in
__blkdev_direct_IO, and for synchronous buffered writes, the plug
is done in writeback_inodes_wb / wb_writeback, while for the other
cases a plug doesn't make sense.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/fops.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index c194939b851cfb..b670aa7c5bb745 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -520,7 +520,6 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct block_device *bdev = iocb->ki_filp->private_data;
 	struct inode *bd_inode = bdev->bd_inode;
 	loff_t size = bdev_nr_bytes(bdev);
-	struct blk_plug plug;
 	size_t shorted = 0;
 	ssize_t ret;
 
@@ -545,12 +544,10 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		iov_iter_truncate(from, size);
 	}
 
-	blk_start_plug(&plug);
 	ret = __generic_file_write_iter(iocb, from);
 	if (ret > 0)
 		ret = generic_write_sync(iocb, ret);
 	iov_iter_reexpand(from, iov_iter_count(from) + shorted);
-	blk_finish_plug(&plug);
 	return ret;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 14/17] block: open code __generic_file_write_iter for blkdev writes
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
                   ` (12 preceding siblings ...)
  2023-04-24  5:49 ` [PATCH 13/17] block: don't plug in blkdev_write_iter Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-05-24 22:23   ` Luis Chamberlain
  2023-04-24  5:49 ` [PATCH 15/17] block: stop setting ->direct_IO Christoph Hellwig
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Open code __generic_file_write_iter to remove the indirect call into
->direct_IO and to prepare using the iomap based write code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/fops.c | 46 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index b670aa7c5bb745..fd510b6142bd57 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -508,6 +508,29 @@ static int blkdev_close(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+static ssize_t
+blkdev_direct_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	size_t count = iov_iter_count(from);
+	ssize_t written;
+
+	written = kiocb_invalidate_pages(iocb, count);
+	if (written) {
+		if (written == -EBUSY)
+			return 0;
+		return written;
+	}
+
+	written = blkdev_direct_IO(iocb, from);
+	if (written > 0) {
+		kiocb_invalidate_post_write(iocb, count);
+		iocb->ki_pos += written;
+	}
+	if (written != -EIOCBQUEUED)
+		iov_iter_revert(from, count - written - iov_iter_count(from));
+	return written;
+}
+
 /*
  * Write data to the block device.  Only intended for the block device itself
  * and the raw driver which basically is a fake block device.
@@ -517,7 +540,8 @@ static int blkdev_close(struct inode *inode, struct file *filp)
  */
 static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
-	struct block_device *bdev = iocb->ki_filp->private_data;
+	struct file *file = iocb->ki_filp;
+	struct block_device *bdev = file->private_data;
 	struct inode *bd_inode = bdev->bd_inode;
 	loff_t size = bdev_nr_bytes(bdev);
 	size_t shorted = 0;
@@ -538,13 +562,31 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if ((iocb->ki_flags & (IOCB_NOWAIT | IOCB_DIRECT)) == IOCB_NOWAIT)
 		return -EOPNOTSUPP;
 
+	ret = file_remove_privs(file);
+	if (ret)
+		return ret;
+
+	ret = file_update_time(file);
+	if (ret)
+		return ret;
+
 	size -= iocb->ki_pos;
 	if (iov_iter_count(from) > size) {
 		shorted = iov_iter_count(from) - size;
 		iov_iter_truncate(from, size);
 	}
 
-	ret = __generic_file_write_iter(iocb, from);
+	current->backing_dev_info = bdev->bd_disk->bdi;
+	if (iocb->ki_flags & IOCB_DIRECT) {
+		ret = blkdev_direct_write(iocb, from);
+		if (ret >= 0 && iov_iter_count(from))
+			ret = direct_write_fallback(iocb, from, ret,
+					generic_perform_write(iocb, from));
+	} else {
+		ret = generic_perform_write(iocb, from);
+	}
+	current->backing_dev_info = NULL;
+
 	if (ret > 0)
 		ret = generic_write_sync(iocb, ret);
 	iov_iter_reexpand(from, iov_iter_count(from) + shorted);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 15/17] block: stop setting ->direct_IO
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
                   ` (13 preceding siblings ...)
  2023-04-24  5:49 ` [PATCH 14/17] block: open code __generic_file_write_iter for blkdev writes Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-04-24  5:49 ` [PATCH 16/17] block: use iomap for writes to block devices Christoph Hellwig
  2023-04-24  5:49 ` [PATCH 17/17] fs: add CONFIG_BUFFER_HEAD Christoph Hellwig
  16 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Direct I/O on block devices now nevers goes through aops->direct_IO.
Stop setting it and set the FMODE_CAN_ODIRECT in ->open instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/fops.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index fd510b6142bd57..318247832a7bcf 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -428,7 +428,6 @@ const struct address_space_operations def_blk_aops = {
 	.writepage	= blkdev_writepage,
 	.write_begin	= blkdev_write_begin,
 	.write_end	= blkdev_write_end,
-	.direct_IO	= blkdev_direct_IO,
 	.migrate_folio	= buffer_migrate_folio_norefs,
 	.is_dirty_writeback = buffer_check_dirty_writeback,
 };
@@ -481,7 +480,7 @@ static int blkdev_open(struct inode *inode, struct file *filp)
 	 * during an unstable branch.
 	 */
 	filp->f_flags |= O_LARGEFILE;
-	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
+	filp->f_mode |= FMODE_CAN_ODIRECT | FMODE_NOWAIT | FMODE_BUF_RASYNC;
 
 	if (filp->f_flags & O_NDELAY)
 		filp->f_mode |= FMODE_NDELAY;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 16/17] block: use iomap for writes to block devices
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
                   ` (14 preceding siblings ...)
  2023-04-24  5:49 ` [PATCH 15/17] block: stop setting ->direct_IO Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
       [not found]   ` <CGME20230426130921eucas1p279078812be7e8d50c1305e47cea53661@eucas1p2.samsung.com>
  2023-05-19 14:22   ` Hannes Reinecke
  2023-04-24  5:49 ` [PATCH 17/17] fs: add CONFIG_BUFFER_HEAD Christoph Hellwig
  16 siblings, 2 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Use iomap in buffer_head compat mode to write to block devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/Kconfig |  1 +
 block/fops.c  | 33 +++++++++++++++++++++++++++++----
 2 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index 941b2dca70db73..672b08f0096ab4 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -5,6 +5,7 @@
 menuconfig BLOCK
        bool "Enable the block layer" if EXPERT
        default y
+       select IOMAP
        select SBITMAP
        help
 	 Provide block layer support for the kernel.
diff --git a/block/fops.c b/block/fops.c
index 318247832a7bcf..7910636f8df33b 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -15,6 +15,7 @@
 #include <linux/falloc.h>
 #include <linux/suspend.h>
 #include <linux/fs.h>
+#include <linux/iomap.h>
 #include <linux/module.h>
 #include "blk.h"
 
@@ -386,6 +387,27 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	return __blkdev_direct_IO(iocb, iter, bio_max_segs(nr_pages));
 }
 
+static int blkdev_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+		unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
+{
+	struct block_device *bdev = I_BDEV(inode);
+	loff_t isize = i_size_read(inode);
+
+	iomap->bdev = bdev;
+	iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev));
+	if (WARN_ON_ONCE(iomap->offset >= isize))
+		return -EIO;
+	iomap->type = IOMAP_MAPPED;
+	iomap->addr = iomap->offset;
+	iomap->length = isize - iomap->offset;
+	iomap->flags |= IOMAP_F_BUFFER_HEAD;
+	return 0;
+}
+
+static const struct iomap_ops blkdev_iomap_ops = {
+	.iomap_begin		= blkdev_iomap_begin,
+};
+
 static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
 {
 	return block_write_full_page(page, blkdev_get_block, wbc);
@@ -530,6 +552,11 @@ blkdev_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	return written;
 }
 
+static ssize_t blkdev_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	return iomap_file_buffered_write(iocb, from, &blkdev_iomap_ops);
+}
+
 /*
  * Write data to the block device.  Only intended for the block device itself
  * and the raw driver which basically is a fake block device.
@@ -575,16 +602,14 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		iov_iter_truncate(from, size);
 	}
 
-	current->backing_dev_info = bdev->bd_disk->bdi;
 	if (iocb->ki_flags & IOCB_DIRECT) {
 		ret = blkdev_direct_write(iocb, from);
 		if (ret >= 0 && iov_iter_count(from))
 			ret = direct_write_fallback(iocb, from, ret,
-					generic_perform_write(iocb, from));
+					blkdev_buffered_write(iocb, from));
 	} else {
-		ret = generic_perform_write(iocb, from);
+		ret = blkdev_buffered_write(iocb, from);
 	}
-	current->backing_dev_info = NULL;
 
 	if (ret > 0)
 		ret = generic_write_sync(iocb, ret);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 17/17] fs: add CONFIG_BUFFER_HEAD
  2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
                   ` (15 preceding siblings ...)
  2023-04-24  5:49 ` [PATCH 16/17] block: use iomap for writes to block devices Christoph Hellwig
@ 2023-04-24  5:49 ` Christoph Hellwig
  2023-04-29  0:11   ` Luis Chamberlain
  16 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  5:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Add a new config option that controls building the buffer_head code, and
select it from all file systems and stacking drivers that need it.

For the block device nodes and alternative iomap based buffered I/O path
is provided when buffer_head support is not enabled, and iomap needs a
little tweak to be able to compile out the buffer_head based code path.

Otherwise this is just Kconfig and ifdef changes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/fops.c                 | 71 +++++++++++++++++++++++++++++++-----
 drivers/md/Kconfig           |  1 +
 fs/Kconfig                   |  4 ++
 fs/Makefile                  |  2 +-
 fs/adfs/Kconfig              |  1 +
 fs/affs/Kconfig              |  1 +
 fs/befs/Kconfig              |  1 +
 fs/bfs/Kconfig               |  1 +
 fs/efs/Kconfig               |  1 +
 fs/exfat/Kconfig             |  1 +
 fs/ext2/Kconfig              |  1 +
 fs/ext4/Kconfig              |  1 +
 fs/f2fs/Kconfig              |  1 +
 fs/fat/Kconfig               |  1 +
 fs/freevxfs/Kconfig          |  1 +
 fs/gfs2/Kconfig              |  1 +
 fs/hfs/Kconfig               |  1 +
 fs/hfsplus/Kconfig           |  1 +
 fs/hpfs/Kconfig              |  1 +
 fs/iomap/buffered-io.c       | 12 ++++--
 fs/isofs/Kconfig             |  1 +
 fs/jfs/Kconfig               |  1 +
 fs/minix/Kconfig             |  1 +
 fs/nilfs2/Kconfig            |  1 +
 fs/ntfs/Kconfig              |  1 +
 fs/ntfs3/Kconfig             |  1 +
 fs/ocfs2/Kconfig             |  1 +
 fs/omfs/Kconfig              |  1 +
 fs/qnx4/Kconfig              |  1 +
 fs/qnx6/Kconfig              |  1 +
 fs/reiserfs/Kconfig          |  1 +
 fs/sysv/Kconfig              |  1 +
 fs/udf/Kconfig               |  1 +
 fs/ufs/Kconfig               |  1 +
 include/linux/buffer_head.h  | 32 ++++++++--------
 include/trace/events/block.h |  2 +
 mm/migrate.c                 |  4 +-
 37 files changed, 125 insertions(+), 32 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 7910636f8df33b..524b8a828aad3d 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -24,15 +24,6 @@ static inline struct inode *bdev_file_inode(struct file *file)
 	return file->f_mapping->host;
 }
 
-static int blkdev_get_block(struct inode *inode, sector_t iblock,
-		struct buffer_head *bh, int create)
-{
-	bh->b_bdev = I_BDEV(inode);
-	bh->b_blocknr = iblock;
-	set_buffer_mapped(bh);
-	return 0;
-}
-
 static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
 {
 	blk_opf_t opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
@@ -400,7 +391,8 @@ static int blkdev_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	iomap->type = IOMAP_MAPPED;
 	iomap->addr = iomap->offset;
 	iomap->length = isize - iomap->offset;
-	iomap->flags |= IOMAP_F_BUFFER_HEAD;
+	if (IS_ENABLED(CONFIG_BUFFER_HEAD))
+		iomap->flags |= IOMAP_F_BUFFER_HEAD;
 	return 0;
 }
 
@@ -408,6 +400,16 @@ static const struct iomap_ops blkdev_iomap_ops = {
 	.iomap_begin		= blkdev_iomap_begin,
 };
 
+#ifdef CONFIG_BUFFER_HEAD
+static int blkdev_get_block(struct inode *inode, sector_t iblock,
+		struct buffer_head *bh, int create)
+{
+	bh->b_bdev = I_BDEV(inode);
+	bh->b_blocknr = iblock;
+	set_buffer_mapped(bh);
+	return 0;
+}
+
 static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
 {
 	return block_write_full_page(page, blkdev_get_block, wbc);
@@ -453,6 +455,55 @@ const struct address_space_operations def_blk_aops = {
 	.migrate_folio	= buffer_migrate_folio_norefs,
 	.is_dirty_writeback = buffer_check_dirty_writeback,
 };
+#else /* CONFIG_BUFFER_HEAD */
+static int blkdev_read_folio(struct file *file, struct folio *folio)
+{
+	return iomap_read_folio(folio, &blkdev_iomap_ops);
+}
+
+static void blkdev_readahead(struct readahead_control *rac)
+{
+	iomap_readahead(rac, &blkdev_iomap_ops);
+}
+
+static int blkdev_map_blocks(struct iomap_writepage_ctx *wpc,
+		struct inode *inode, loff_t offset)
+{
+	loff_t isize = i_size_read(inode);
+
+	if (WARN_ON_ONCE(offset >= isize))
+		return -EIO;
+	if (offset >= wpc->iomap.offset &&
+	    offset < wpc->iomap.offset + wpc->iomap.length)
+		return 0;
+	return blkdev_iomap_begin(inode, offset, isize - offset,
+				  IOMAP_WRITE, &wpc->iomap, NULL);
+}
+
+static const struct iomap_writeback_ops blkdev_writeback_ops = {
+	.map_blocks		= blkdev_map_blocks,
+};
+
+static int blkdev_writepages(struct address_space *mapping,
+		struct writeback_control *wbc)
+{
+	struct iomap_writepage_ctx wpc = { };
+
+	return iomap_writepages(mapping, wbc, &wpc, &blkdev_writeback_ops);
+}
+
+const struct address_space_operations def_blk_aops = {
+	.dirty_folio	= filemap_dirty_folio,
+	.release_folio		= iomap_release_folio,
+	.invalidate_folio	= iomap_invalidate_folio,
+	.read_folio		= blkdev_read_folio,
+	.readahead		= blkdev_readahead,
+	.writepages		= blkdev_writepages,
+	.is_partially_uptodate  = iomap_is_partially_uptodate,
+	.error_remove_page	= generic_error_remove_page,
+	.migrate_folio		= filemap_migrate_folio,
+};
+#endif /* CONFIG_BUFFER_HEAD */
 
 /*
  * for a block special file file_inode(file)->i_size is zero
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index b0a22e99bade37..9ee18013b1f2ab 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -15,6 +15,7 @@ if MD
 config BLK_DEV_MD
 	tristate "RAID support"
 	select BLOCK_HOLDER_DEPRECATED if SYSFS
+	select BUFFER_HEAD
 	# BLOCK_LEGACY_AUTOLOAD requirement should be removed
 	# after relevant mdadm enhancements - to make "names=yes"
 	# the default - are widely available.
diff --git a/fs/Kconfig b/fs/Kconfig
index e99830c650336a..366d5d5be2784b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -18,8 +18,12 @@ config VALIDATE_FS_PARSER
 config FS_IOMAP
 	bool
 
+config BUFFER_HEAD
+	bool
+
 # old blockdev_direct_IO implementation.  Use iomap for new code instead
 config LEGACY_DIRECT_IO
+	depends on BUFFER_HEAD
 	bool
 
 if BLOCK
diff --git a/fs/Makefile b/fs/Makefile
index da21e7d0a1cf37..3cd6aa1d2ce387 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -18,7 +18,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		fs_types.o fs_context.o fs_parser.o fsopen.o init.o \
 		kernel_read_file.o mnt_idmapping.o remap_range.o
 
-obj-$(CONFIG_BLOCK)		+= buffer.o mpage.o
+obj-$(CONFIG_BUFFER_HEAD)	+= buffer.o mpage.o
 obj-$(CONFIG_PROC_FS)		+= proc_namespace.o
 obj-$(CONFIG_LEGACY_DIRECT_IO)	+= direct-io.o
 obj-y				+= notify/
diff --git a/fs/adfs/Kconfig b/fs/adfs/Kconfig
index 44738fed66251f..1b97058f0c4a92 100644
--- a/fs/adfs/Kconfig
+++ b/fs/adfs/Kconfig
@@ -2,6 +2,7 @@
 config ADFS_FS
 	tristate "ADFS file system support"
 	depends on BLOCK
+	select BUFFER_HEAD
 	help
 	  The Acorn Disc Filing System is the standard file system of the
 	  RiscOS operating system which runs on Acorn's ARM-based Risc PC
diff --git a/fs/affs/Kconfig b/fs/affs/Kconfig
index 962b86374e1c15..1ae432d266c32f 100644
--- a/fs/affs/Kconfig
+++ b/fs/affs/Kconfig
@@ -2,6 +2,7 @@
 config AFFS_FS
 	tristate "Amiga FFS file system support"
 	depends on BLOCK
+	select BUFFER_HEAD
 	select LEGACY_DIRECT_IO
 	help
 	  The Fast File System (FFS) is the common file system used on hard
diff --git a/fs/befs/Kconfig b/fs/befs/Kconfig
index 9550b6462b8147..5fcfc4024ffe6f 100644
--- a/fs/befs/Kconfig
+++ b/fs/befs/Kconfig
@@ -2,6 +2,7 @@
 config BEFS_FS
 	tristate "BeOS file system (BeFS) support (read only)"
 	depends on BLOCK
+	select BUFFER_HEAD
 	select NLS
 	help
 	  The BeOS File System (BeFS) is the native file system of Be, Inc's
diff --git a/fs/bfs/Kconfig b/fs/bfs/Kconfig
index 3a757805b58568..8e7ef866b62a62 100644
--- a/fs/bfs/Kconfig
+++ b/fs/bfs/Kconfig
@@ -2,6 +2,7 @@
 config BFS_FS
 	tristate "BFS file system support"
 	depends on BLOCK
+	select BUFFER_HEAD
 	help
 	  Boot File System (BFS) is a file system used under SCO UnixWare to
 	  allow the bootloader access to the kernel image and other important
diff --git a/fs/efs/Kconfig b/fs/efs/Kconfig
index 2df1bac8b375b1..0833e533df9d53 100644
--- a/fs/efs/Kconfig
+++ b/fs/efs/Kconfig
@@ -2,6 +2,7 @@
 config EFS_FS
 	tristate "EFS file system support (read only)"
 	depends on BLOCK
+	select BUFFER_HEAD
 	help
 	  EFS is an older file system used for non-ISO9660 CD-ROMs and hard
 	  disk partitions by SGI's IRIX operating system (IRIX 6.0 and newer
diff --git a/fs/exfat/Kconfig b/fs/exfat/Kconfig
index 147edeb044691d..cbeca8e44d9b38 100644
--- a/fs/exfat/Kconfig
+++ b/fs/exfat/Kconfig
@@ -2,6 +2,7 @@
 
 config EXFAT_FS
 	tristate "exFAT filesystem support"
+	select BUFFER_HEAD
 	select NLS
 	select LEGACY_DIRECT_IO
 	help
diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
index 77393fda99af09..74d98965902e16 100644
--- a/fs/ext2/Kconfig
+++ b/fs/ext2/Kconfig
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config EXT2_FS
 	tristate "Second extended fs support"
+	select BUFFER_HEAD
 	select FS_IOMAP
 	select LEGACY_DIRECT_IO
 	help
diff --git a/fs/ext4/Kconfig b/fs/ext4/Kconfig
index 86699c8cab281c..e20d59221fc05b 100644
--- a/fs/ext4/Kconfig
+++ b/fs/ext4/Kconfig
@@ -28,6 +28,7 @@ config EXT3_FS_SECURITY
 
 config EXT4_FS
 	tristate "The Extended 4 (ext4) filesystem"
+	select BUFFER_HEAD
 	select JBD2
 	select CRC16
 	select CRYPTO
diff --git a/fs/f2fs/Kconfig b/fs/f2fs/Kconfig
index 03ef087537c7c4..68a1e23e1557c7 100644
--- a/fs/f2fs/Kconfig
+++ b/fs/f2fs/Kconfig
@@ -2,6 +2,7 @@
 config F2FS_FS
 	tristate "F2FS filesystem support"
 	depends on BLOCK
+	select BUFFER_HEAD
 	select NLS
 	select CRYPTO
 	select CRYPTO_CRC32
diff --git a/fs/fat/Kconfig b/fs/fat/Kconfig
index afe83b4e717280..25fae1c83725bc 100644
--- a/fs/fat/Kconfig
+++ b/fs/fat/Kconfig
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config FAT_FS
 	tristate
+	select BUFFER_HEAD
 	select NLS
 	select LEGACY_DIRECT_IO
 	help
diff --git a/fs/freevxfs/Kconfig b/fs/freevxfs/Kconfig
index 0e2fc08f7de492..912107ebea6f40 100644
--- a/fs/freevxfs/Kconfig
+++ b/fs/freevxfs/Kconfig
@@ -2,6 +2,7 @@
 config VXFS_FS
 	tristate "FreeVxFS file system support (VERITAS VxFS(TM) compatible)"
 	depends on BLOCK
+	select BUFFER_HEAD
 	help
 	  FreeVxFS is a file system driver that support the VERITAS VxFS(TM)
 	  file system format.  VERITAS VxFS(TM) is the standard file system
diff --git a/fs/gfs2/Kconfig b/fs/gfs2/Kconfig
index 03c966840422ec..be7f87a8e11ae1 100644
--- a/fs/gfs2/Kconfig
+++ b/fs/gfs2/Kconfig
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config GFS2_FS
 	tristate "GFS2 file system support"
+	select BUFFER_HEAD
 	select FS_POSIX_ACL
 	select CRC32
 	select LIBCRC32C
diff --git a/fs/hfs/Kconfig b/fs/hfs/Kconfig
index d985066006d588..5ea5cd8ecea9c0 100644
--- a/fs/hfs/Kconfig
+++ b/fs/hfs/Kconfig
@@ -2,6 +2,7 @@
 config HFS_FS
 	tristate "Apple Macintosh file system support"
 	depends on BLOCK
+	select BUFFER_HEAD
 	select NLS
 	select LEGACY_DIRECT_IO
 	help
diff --git a/fs/hfsplus/Kconfig b/fs/hfsplus/Kconfig
index 8034e7827a690b..8ce4a33a9ac788 100644
--- a/fs/hfsplus/Kconfig
+++ b/fs/hfsplus/Kconfig
@@ -2,6 +2,7 @@
 config HFSPLUS_FS
 	tristate "Apple Extended HFS file system support"
 	depends on BLOCK
+	select BUFFER_HEAD
 	select NLS
 	select NLS_UTF8
 	select LEGACY_DIRECT_IO
diff --git a/fs/hpfs/Kconfig b/fs/hpfs/Kconfig
index ec975f4668775f..ac1e9318e65a4a 100644
--- a/fs/hpfs/Kconfig
+++ b/fs/hpfs/Kconfig
@@ -2,6 +2,7 @@
 config HPFS_FS
 	tristate "OS/2 HPFS file system support"
 	depends on BLOCK
+	select BUFFER_HEAD
 	select FS_IOMAP
 	help
 	  OS/2 is IBM's operating system for PC's, the same as Warp, and HPFS
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 3d5042efda202a..336a9d4542c97a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -42,6 +42,12 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 	return NULL;
 }
 
+static inline bool iomap_use_buffer_heads(const struct iomap *iomap)
+{
+	return IS_ENABLED(CONFIG_BUFFER_HEAD) &&
+		(iomap->flags & IOMAP_F_BUFFER_HEAD);
+}
+
 static struct bio_set iomap_ioend_bioset;
 
 static struct iomap_page *
@@ -683,7 +689,7 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 
 	if (srcmap->type == IOMAP_INLINE)
 		status = iomap_write_begin_inline(iter, folio);
-	else if (srcmap->flags & IOMAP_F_BUFFER_HEAD)
+	else if (iomap_use_buffer_heads(srcmap))
 		status = __block_write_begin_int(folio, pos, len, NULL, srcmap);
 	else
 		status = __iomap_write_begin(iter, pos, len, folio);
@@ -753,7 +759,7 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 
 	if (srcmap->type == IOMAP_INLINE) {
 		ret = iomap_write_end_inline(iter, folio, pos, copied);
-	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
+	} else if (iomap_use_buffer_heads(srcmap)) {
 		ret = block_write_end(NULL, iter->inode->i_mapping, pos, len,
 				copied, &folio->page, NULL);
 	} else {
@@ -1256,7 +1262,7 @@ static loff_t iomap_folio_mkwrite_iter(struct iomap_iter *iter,
 	loff_t length = iomap_length(iter);
 	int ret;
 
-	if (iter->iomap.flags & IOMAP_F_BUFFER_HEAD) {
+	if (iomap_use_buffer_heads(&iter->iomap)) {
 		ret = __block_write_begin_int(folio, iter->pos, length, NULL,
 					      &iter->iomap);
 		if (ret)
diff --git a/fs/isofs/Kconfig b/fs/isofs/Kconfig
index 08ffd37b9bb8f6..51434f2a471b0f 100644
--- a/fs/isofs/Kconfig
+++ b/fs/isofs/Kconfig
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config ISO9660_FS
 	tristate "ISO 9660 CDROM file system support"
+	select BUFFER_HEAD
 	help
 	  This is the standard file system used on CD-ROMs.  It was previously
 	  known as "High Sierra File System" and is called "hsfs" on other
diff --git a/fs/jfs/Kconfig b/fs/jfs/Kconfig
index 51e856f0e4b8d6..17488440eef1a9 100644
--- a/fs/jfs/Kconfig
+++ b/fs/jfs/Kconfig
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config JFS_FS
 	tristate "JFS filesystem support"
+	select BUFFER_HEAD
 	select NLS
 	select CRC32
 	select LEGACY_DIRECT_IO
diff --git a/fs/minix/Kconfig b/fs/minix/Kconfig
index de2003974ff0d0..90ddfad2a75e8f 100644
--- a/fs/minix/Kconfig
+++ b/fs/minix/Kconfig
@@ -2,6 +2,7 @@
 config MINIX_FS
 	tristate "Minix file system support"
 	depends on BLOCK
+	select BUFFER_HEAD
 	help
 	  Minix is a simple operating system used in many classes about OS's.
 	  The minix file system (method to organize files on a hard disk
diff --git a/fs/nilfs2/Kconfig b/fs/nilfs2/Kconfig
index 7d59567465e121..7dae168e346e30 100644
--- a/fs/nilfs2/Kconfig
+++ b/fs/nilfs2/Kconfig
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config NILFS2_FS
 	tristate "NILFS2 file system support"
+	select BUFFER_HEAD
 	select CRC32
 	select LEGACY_DIRECT_IO
 	help
diff --git a/fs/ntfs/Kconfig b/fs/ntfs/Kconfig
index f93e69a612833f..7b2509741735a9 100644
--- a/fs/ntfs/Kconfig
+++ b/fs/ntfs/Kconfig
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config NTFS_FS
 	tristate "NTFS file system support"
+	select BUFFER_HEAD
 	select NLS
 	help
 	  NTFS is the file system of Microsoft Windows NT, 2000, XP and 2003.
diff --git a/fs/ntfs3/Kconfig b/fs/ntfs3/Kconfig
index 96cc236f7f7bd3..cdfdf51e55d797 100644
--- a/fs/ntfs3/Kconfig
+++ b/fs/ntfs3/Kconfig
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config NTFS3_FS
 	tristate "NTFS Read-Write file system support"
+	select BUFFER_HEAD
 	select NLS
 	select LEGACY_DIRECT_IO
 	help
diff --git a/fs/ocfs2/Kconfig b/fs/ocfs2/Kconfig
index 304d12186ccd38..4a8288f7ee8697 100644
--- a/fs/ocfs2/Kconfig
+++ b/fs/ocfs2/Kconfig
@@ -2,6 +2,7 @@
 config OCFS2_FS
 	tristate "OCFS2 file system support"
 	depends on INET && SYSFS && CONFIGFS_FS
+	select BUFFER_HEAD
 	select JBD2
 	select CRC32
 	select QUOTA
diff --git a/fs/omfs/Kconfig b/fs/omfs/Kconfig
index 42b2ec35a05bfb..8470f6c3e64e6a 100644
--- a/fs/omfs/Kconfig
+++ b/fs/omfs/Kconfig
@@ -2,6 +2,7 @@
 config OMFS_FS
 	tristate "SonicBlue Optimized MPEG File System support"
 	depends on BLOCK
+	select BUFFER_HEAD
 	select CRC_ITU_T
 	help
 	  This is the proprietary file system used by the Rio Karma music
diff --git a/fs/qnx4/Kconfig b/fs/qnx4/Kconfig
index 45b5b98376c436..a2eb826e76c602 100644
--- a/fs/qnx4/Kconfig
+++ b/fs/qnx4/Kconfig
@@ -2,6 +2,7 @@
 config QNX4FS_FS
 	tristate "QNX4 file system support (read only)"
 	depends on BLOCK
+	select BUFFER_HEAD
 	help
 	  This is the file system used by the real-time operating systems
 	  QNX 4 and QNX 6 (the latter is also called QNX RTP).
diff --git a/fs/qnx6/Kconfig b/fs/qnx6/Kconfig
index 6a9d6bce158622..8e865d72204e75 100644
--- a/fs/qnx6/Kconfig
+++ b/fs/qnx6/Kconfig
@@ -2,6 +2,7 @@
 config QNX6FS_FS
 	tristate "QNX6 file system support (read only)"
 	depends on BLOCK && CRC32
+	select BUFFER_HEAD
 	help
 	  This is the file system used by the real-time operating systems
 	  QNX 6 (also called QNX RTP).
diff --git a/fs/reiserfs/Kconfig b/fs/reiserfs/Kconfig
index 4d22ecfe0fab65..0e6fe26458fede 100644
--- a/fs/reiserfs/Kconfig
+++ b/fs/reiserfs/Kconfig
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config REISERFS_FS
 	tristate "Reiserfs support (deprecated)"
+	select BUFFER_HEAD
 	select CRC32
 	select LEGACY_DIRECT_IO
 	help
diff --git a/fs/sysv/Kconfig b/fs/sysv/Kconfig
index b4e23e03fbeba3..67b3f90afbfd67 100644
--- a/fs/sysv/Kconfig
+++ b/fs/sysv/Kconfig
@@ -2,6 +2,7 @@
 config SYSV_FS
 	tristate "System V/Xenix/V7/Coherent file system support"
 	depends on BLOCK
+	select BUFFER_HEAD
 	help
 	  SCO, Xenix and Coherent are commercial Unix systems for Intel
 	  machines, and Version 7 was used on the DEC PDP-11. Saying Y
diff --git a/fs/udf/Kconfig b/fs/udf/Kconfig
index 82e8bfa2dfd989..8f7ce30d47fdce 100644
--- a/fs/udf/Kconfig
+++ b/fs/udf/Kconfig
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config UDF_FS
 	tristate "UDF file system support"
+	select BUFFER_HEAD
 	select CRC_ITU_T
 	select NLS
 	select LEGACY_DIRECT_IO
diff --git a/fs/ufs/Kconfig b/fs/ufs/Kconfig
index 6d30adb6b890fc..9301e7ecd09210 100644
--- a/fs/ufs/Kconfig
+++ b/fs/ufs/Kconfig
@@ -2,6 +2,7 @@
 config UFS_FS
 	tristate "UFS file system support (read only)"
 	depends on BLOCK
+	select BUFFER_HEAD
 	help
 	  BSD and derivate versions of Unix (such as SunOS, FreeBSD, NetBSD,
 	  OpenBSD and NeXTstep) use a file system called UFS. Some System V
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 0fcc16b7f02bb4..1ef3915cacfae2 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -16,8 +16,6 @@
 #include <linux/wait.h>
 #include <linux/atomic.h>
 
-#ifdef CONFIG_BLOCK
-
 enum bh_state_bits {
 	BH_Uptodate,	/* Contains valid data */
 	BH_Dirty,	/* Is dirty */
@@ -196,7 +194,6 @@ void mark_buffer_write_io_error(struct buffer_head *bh);
 void touch_buffer(struct buffer_head *bh);
 void set_bh_page(struct buffer_head *bh,
 		struct page *page, unsigned long offset);
-bool try_to_free_buffers(struct folio *);
 struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
 		bool retry);
 void create_empty_buffers(struct page *, unsigned long,
@@ -207,10 +204,6 @@ void end_buffer_async_write(struct buffer_head *bh, int uptodate);
 
 /* Things to do with buffers at mapping->private_list */
 void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode);
-int inode_has_buffers(struct inode *);
-void invalidate_inode_buffers(struct inode *);
-int remove_inode_buffers(struct inode *inode);
-int sync_mapping_buffers(struct address_space *mapping);
 void clean_bdev_aliases(struct block_device *bdev, sector_t block,
 			sector_t len);
 static inline void clean_bdev_bh_alias(struct buffer_head *bh)
@@ -230,9 +223,6 @@ void __bforget(struct buffer_head *);
 void __breadahead(struct block_device *, sector_t block, unsigned int size);
 struct buffer_head *__bread_gfp(struct block_device *,
 				sector_t block, unsigned size, gfp_t gfp);
-void invalidate_bh_lrus(void);
-void invalidate_bh_lrus_cpu(void);
-bool has_bh_in_lru(int cpu, void *dummy);
 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags);
 void free_buffer_head(struct buffer_head * bh);
 void unlock_buffer(struct buffer_head *bh);
@@ -248,8 +238,6 @@ int __bh_read(struct buffer_head *bh, blk_opf_t op_flags, bool wait);
 void __bh_read_batch(int nr, struct buffer_head *bhs[],
 		     blk_opf_t op_flags, bool force_lock);
 
-extern int buffer_heads_over_limit;
-
 /*
  * Generic address_space_operations implementations for buffer_head-backed
  * address_spaces.
@@ -294,8 +282,6 @@ extern int buffer_migrate_folio_norefs(struct address_space *,
 #define buffer_migrate_folio_norefs NULL
 #endif
 
-void buffer_init(void);
-
 /*
  * inline definitions
  */
@@ -455,7 +441,20 @@ __bread(struct block_device *bdev, sector_t block, unsigned size)
 
 bool block_dirty_folio(struct address_space *mapping, struct folio *folio);
 
-#else /* CONFIG_BLOCK */
+#ifdef CONFIG_BUFFER_HEAD
+
+void buffer_init(void);
+bool try_to_free_buffers(struct folio *folio);
+int inode_has_buffers(struct inode *inode);
+void invalidate_inode_buffers(struct inode *inode);
+int remove_inode_buffers(struct inode *inode);
+int sync_mapping_buffers(struct address_space *mapping);
+void invalidate_bh_lrus(void);
+void invalidate_bh_lrus_cpu(void);
+bool has_bh_in_lru(int cpu, void *dummy);
+extern int buffer_heads_over_limit;
+
+#else /* CONFIG_BUFFER_HEAD */
 
 static inline void buffer_init(void) {}
 static inline bool try_to_free_buffers(struct folio *folio) { return true; }
@@ -463,9 +462,10 @@ static inline int inode_has_buffers(struct inode *inode) { return 0; }
 static inline void invalidate_inode_buffers(struct inode *inode) {}
 static inline int remove_inode_buffers(struct inode *inode) { return 1; }
 static inline int sync_mapping_buffers(struct address_space *mapping) { return 0; }
+static inline void invalidate_bh_lrus(void) {}
 static inline void invalidate_bh_lrus_cpu(void) {}
 static inline bool has_bh_in_lru(int cpu, void *dummy) { return false; }
 #define buffer_heads_over_limit 0
 
-#endif /* CONFIG_BLOCK */
+#endif /* CONFIG_BUFFER_HEAD */
 #endif /* _LINUX_BUFFER_HEAD_H */
diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 7f4dfbdf12a6f1..a0503f387104a1 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -12,6 +12,7 @@
 
 #define RWBS_LEN	8
 
+#ifdef CONFIG_BUFFER_HEAD
 DECLARE_EVENT_CLASS(block_buffer,
 
 	TP_PROTO(struct buffer_head *bh),
@@ -61,6 +62,7 @@ DEFINE_EVENT(block_buffer, block_dirty_buffer,
 
 	TP_ARGS(bh)
 );
+#endif /* CONFIG_BUFFER_HEAD */
 
 /**
  * block_rq_requeue - place block IO request back on a queue
diff --git a/mm/migrate.c b/mm/migrate.c
index db3f154446af4e..9f9a81d43fc8f2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -692,7 +692,7 @@ int migrate_folio(struct address_space *mapping, struct folio *dst,
 }
 EXPORT_SYMBOL(migrate_folio);
 
-#ifdef CONFIG_BLOCK
+#ifdef CONFIG_BUFFER_HEAD
 /* Returns true if all buffers are successfully locked */
 static bool buffer_migrate_lock_buffers(struct buffer_head *head,
 							enum migrate_mode mode)
@@ -850,7 +850,7 @@ int buffer_migrate_folio_norefs(struct address_space *mapping,
 	return __buffer_migrate_folio(mapping, dst, src, mode, true);
 }
 EXPORT_SYMBOL_GPL(buffer_migrate_folio_norefs);
-#endif
+#endif /* CONFIG_BUFFER_HEAD */
 
 int filemap_migrate_folio(struct address_space *mapping,
 		struct folio *dst, struct folio *src, enum migrate_mode mode)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 11/17] iomap: assign current->backing_dev_info in iomap_file_buffered_write
  2023-04-24  5:49 ` [PATCH 11/17] iomap: assign current->backing_dev_info in iomap_file_buffered_write Christoph Hellwig
@ 2023-04-24  6:18   ` Darrick J. Wong
  2023-04-24  6:22     ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2023-04-24  6:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Miklos Szeredi, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

On Mon, Apr 24, 2023 at 07:49:20AM +0200, Christoph Hellwig wrote:
> Move the assignment to current->backing_dev_info from the callers into
> iomap_file_buffered_write.  Note that zonefs was missing this assignment
> before.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/gfs2/file.c         | 3 ---
>  fs/iomap/buffered-io.c | 4 ++++
>  fs/xfs/xfs_file.c      | 5 -----
>  3 files changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 8c4fad359ff538..4d88c6080b3e30 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -25,7 +25,6 @@
>  #include <linux/dlm.h>
>  #include <linux/dlm_plock.h>
>  #include <linux/delay.h>
> -#include <linux/backing-dev.h>
>  #include <linux/fileattr.h>
>  
>  #include "gfs2.h"
> @@ -1041,11 +1040,9 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb,
>  			goto out_unlock;
>  	}
>  
> -	current->backing_dev_info = inode_to_bdi(inode);
>  	pagefault_disable();
>  	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
>  	pagefault_enable();
> -	current->backing_dev_info = NULL;
>  	if (ret > 0) {
>  		iocb->ki_pos += ret;
>  		written += ret;
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 2986be63d2bea6..3d5042efda202a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -3,6 +3,7 @@
>   * Copyright (C) 2010 Red Hat, Inc.
>   * Copyright (C) 2016-2019 Christoph Hellwig.
>   */
> +#include <linux/backing-dev.h>
>  #include <linux/module.h>
>  #include <linux/compiler.h>
>  #include <linux/fs.h>
> @@ -876,8 +877,11 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		iter.flags |= IOMAP_NOWAIT;
>  
> +	current->backing_dev_info = inode_to_bdi(iter.inode);

Dumb question from me late on a Sunday night, but does the iomap_unshare
code need to set this too?  Since it works by dirtying pagecache folios
without actually changing the contents?

--D

>  	while ((ret = iomap_iter(&iter, ops)) > 0)
>  		iter.processed = iomap_write_iter(&iter, i);
> +	current->backing_dev_info = NULL;
> +
>  	if (iter.pos == iocb->ki_pos)
>  		return ret;
>  	return iter.pos - iocb->ki_pos;
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 705250f9f90a1b..f5442e689baf15 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -27,7 +27,6 @@
>  
>  #include <linux/dax.h>
>  #include <linux/falloc.h>
> -#include <linux/backing-dev.h>
>  #include <linux/mman.h>
>  #include <linux/fadvise.h>
>  #include <linux/mount.h>
> @@ -717,9 +716,6 @@ xfs_file_buffered_write(
>  	if (ret)
>  		goto out;
>  
> -	/* We can write back this queue in page reclaim */
> -	current->backing_dev_info = inode_to_bdi(inode);
> -
>  	trace_xfs_file_buffered_write(iocb, from);
>  	ret = iomap_file_buffered_write(iocb, from,
>  			&xfs_buffered_write_iomap_ops);
> @@ -753,7 +749,6 @@ xfs_file_buffered_write(
>  		goto write_retry;
>  	}
>  
> -	current->backing_dev_info = NULL;
>  out:
>  	if (iolock)
>  		xfs_iunlock(ip, iolock);
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 11/17] iomap: assign current->backing_dev_info in iomap_file_buffered_write
  2023-04-24  6:18   ` Darrick J. Wong
@ 2023-04-24  6:22     ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24  6:22 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Jens Axboe, Miklos Szeredi, Andrew Morton,
	David Howells, Matthew Wilcox, linux-block, linux-fsdevel,
	ceph-devel, linux-ext4, linux-f2fs-devel, cluster-devel,
	linux-xfs, linux-nfs, linux-mm, linux-kernel

On Sun, Apr 23, 2023 at 11:18:25PM -0700, Darrick J. Wong wrote:
> > @@ -876,8 +877,11 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
> >  	if (iocb->ki_flags & IOCB_NOWAIT)
> >  		iter.flags |= IOMAP_NOWAIT;
> >  
> > +	current->backing_dev_info = inode_to_bdi(iter.inode);
> 
> Dumb question from me late on a Sunday night, but does the iomap_unshare
> code need to set this too?  Since it works by dirtying pagecache folios
> without actually changing the contents?

A very good question that I have no answer for.  The
current->backing_dev_info mechanism confuses the heck out of me and
appears basically undocumented.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 03/17] fs: rename and move block_page_mkwrite_return
  2023-04-24  5:49 ` [PATCH 03/17] fs: rename and move block_page_mkwrite_return Christoph Hellwig
@ 2023-04-24 12:30   ` Matthew Wilcox
  2023-04-24 12:42     ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2023-04-24 12:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Miklos Szeredi, Darrick J. Wong, Andrew Morton,
	David Howells, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

On Mon, Apr 24, 2023 at 07:49:12AM +0200, Christoph Hellwig wrote:
> block_page_mkwrite_return is neither block nor mkwrite specific, and
> should not be under CONFIG_BLOCK.  Move it to mm.h and rename it to
> errno_to_vmfault.

Could you move it about 300 lines down and put it near vmf_error()
so we think about how to unify the two at some point?

Perhaps it should better be called vmf_fs_error() for now since the
errnos it handles are the kind generated by filesystems.

> +++ b/include/linux/mm.h
> @@ -3061,6 +3061,19 @@ extern vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>  		pgoff_t start_pgoff, pgoff_t end_pgoff);
>  extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
>  
> +/* Convert errno to return value from ->page_mkwrite() call */
> +static inline vm_fault_t errno_to_vmfault(int err)
> +{
> +	if (err == 0)
> +		return VM_FAULT_LOCKED;
> +	if (err == -EFAULT || err == -EAGAIN)
> +		return VM_FAULT_NOPAGE;
> +	if (err == -ENOMEM)
> +		return VM_FAULT_OOM;
> +	/* -ENOSPC, -EDQUOT, -EIO ... */
> +	return VM_FAULT_SIGBUS;
> +}
> +
>  extern unsigned long stack_guard_gap;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 03/17] fs: rename and move block_page_mkwrite_return
  2023-04-24 12:30   ` Matthew Wilcox
@ 2023-04-24 12:42     ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-04-24 12:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Jens Axboe, Miklos Szeredi, Darrick J. Wong,
	Andrew Morton, David Howells, linux-block, linux-fsdevel,
	ceph-devel, linux-ext4, linux-f2fs-devel, cluster-devel,
	linux-xfs, linux-nfs, linux-mm, linux-kernel

On Mon, Apr 24, 2023 at 01:30:56PM +0100, Matthew Wilcox wrote:
> On Mon, Apr 24, 2023 at 07:49:12AM +0200, Christoph Hellwig wrote:
> > block_page_mkwrite_return is neither block nor mkwrite specific, and
> > should not be under CONFIG_BLOCK.  Move it to mm.h and rename it to
> > errno_to_vmfault.
> 
> Could you move it about 300 lines down and put it near vmf_error()
> so we think about how to unify the two at some point?
> 
> Perhaps it should better be called vmf_fs_error() for now since the
> errnos it handles are the kind generated by filesystems.

I'll look into unifying them for the next version.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Cluster-devel] [PATCH 05/17] filemap: update ki_pos in generic_perform_write
  2023-04-24  5:49 ` [PATCH 05/17] filemap: update ki_pos in generic_perform_write Christoph Hellwig
@ 2023-04-24 18:54   ` Andreas Gruenbacher
  0 siblings, 0 replies; 41+ messages in thread
From: Andreas Gruenbacher @ 2023-04-24 18:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, linux-nfs, cluster-devel, linux-xfs,
	Miklos Szeredi, Darrick J. Wong, linux-kernel, Matthew Wilcox,
	linux-f2fs-devel, David Howells, linux-mm, linux-fsdevel,
	Andrew Morton, linux-ext4, ceph-devel

On Mon, Apr 24, 2023 at 8:22 AM Christoph Hellwig <hch@lst.de> wrote:
> All callers of generic_perform_write need to updated ki_pos, move it into
> common code.

We've actually got a similar situation with
iomap_file_buffered_write() and its callers. Would it make sense to
fix that up as well?

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/ceph/file.c | 2 --
>  fs/ext4/file.c | 9 +++------
>  fs/f2fs/file.c | 1 -
>  fs/nfs/file.c  | 1 -
>  mm/filemap.c   | 8 ++++----
>  5 files changed, 7 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index f4d8bf7dec88a8..feeb9882ef635a 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1894,8 +1894,6 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
>                  * can not run at the same time
>                  */
>                 written = generic_perform_write(iocb, from);
> -               if (likely(written >= 0))
> -                       iocb->ki_pos = pos + written;
>                 ceph_end_io_write(inode);
>         }
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 0b8b4499e5ca18..1026acaf1235a0 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -291,12 +291,9 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>
>  out:
>         inode_unlock(inode);
> -       if (likely(ret > 0)) {
> -               iocb->ki_pos += ret;
> -               ret = generic_write_sync(iocb, ret);
> -       }
> -
> -       return ret;
> +       if (unlikely(ret <= 0))
> +               return ret;
> +       return generic_write_sync(iocb, ret);
>  }
>
>  static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index f4ab23efcf85f8..5a9ae054b6da7d 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -4511,7 +4511,6 @@ static ssize_t f2fs_buffered_write_iter(struct kiocb *iocb,
>         current->backing_dev_info = NULL;
>
>         if (ret > 0) {
> -               iocb->ki_pos += ret;
>                 f2fs_update_iostat(F2FS_I_SB(inode), inode,
>                                                 APP_BUFFERED_IO, ret);
>         }
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 893625eacab9fa..abdae2b29369be 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -666,7 +666,6 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
>                 goto out;
>
>         written = result;
> -       iocb->ki_pos += written;
>         nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, written);
>
>         if (mntflags & NFS_MOUNT_WRITE_EAGER) {
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 2723104cc06a12..0110bde3708b3f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3960,7 +3960,10 @@ ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
>                 balance_dirty_pages_ratelimited(mapping);
>         } while (iov_iter_count(i));
>
> -       return written ? written : status;
> +       if (!written)
> +               return status;
> +       iocb->ki_pos += written;

Could be turned into:
iocb->ki_pos = pos;

> +       return written;
>  }
>  EXPORT_SYMBOL(generic_perform_write);
>
> @@ -4039,7 +4042,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>                 endbyte = pos + status - 1;
>                 err = filemap_write_and_wait_range(mapping, pos, endbyte);
>                 if (err == 0) {
> -                       iocb->ki_pos = endbyte + 1;
>                         written += status;
>                         invalidate_mapping_pages(mapping,
>                                                  pos >> PAGE_SHIFT,
> @@ -4052,8 +4054,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>                 }
>         } else {
>                 written = generic_perform_write(iocb, from);
> -               if (likely(written > 0))
> -                       iocb->ki_pos += written;
>         }
>  out:
>         current->backing_dev_info = NULL;
> --
> 2.39.2
>

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/17] fs: remove the special !CONFIG_BLOCK def_blk_fops
  2023-04-24  5:49 ` [PATCH 02/17] fs: remove the special !CONFIG_BLOCK def_blk_fops Christoph Hellwig
@ 2023-04-24 19:22   ` Randy Dunlap
  2023-04-24 19:37     ` Keith Busch
  0 siblings, 1 reply; 41+ messages in thread
From: Randy Dunlap @ 2023-04-24 19:22 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

Hi,

On 4/23/23 22:49, Christoph Hellwig wrote:
> def_blk_fops always returns -ENODEV, which dosn't match the return value
> of a non-existing block device with CONFIG_BLOCK, which is -ENXIO.
> Just remove the extra implementation and fall back to the default
> no_open_fops that always returns -ENXIO.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/Makefile   | 10 ++--------
>  fs/inode.c    |  3 ++-
>  fs/no-block.c | 19 -------------------
>  3 files changed, 4 insertions(+), 28 deletions(-)
>  delete mode 100644 fs/no-block.c
> 
> diff --git a/fs/Makefile b/fs/Makefile
> index 05f89b5c962f88..da21e7d0a1cf37 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -18,14 +18,8 @@ obj-y :=	open.o read_write.o file_table.o super.o \
>  		fs_types.o fs_context.o fs_parser.o fsopen.o init.o \
>  		kernel_read_file.o mnt_idmapping.o remap_range.o
>  
> -ifeq ($(CONFIG_BLOCK),y)
> -obj-y +=	buffer.o mpage.o
> -else
> -obj-y +=	no-block.o
> -endif
> -
> -obj-$(CONFIG_PROC_FS) += proc_namespace.o
> -
> +obj-$(CONFIG_BLOCK)		+= buffer.o mpage.o
> +obj-$(CONFIG_PROC_FS)		+= proc_namespace.o
>  obj-$(CONFIG_LEGACY_DIRECT_IO)	+= direct-io.o
>  obj-y				+= notify/
>  obj-$(CONFIG_EPOLL)		+= eventpoll.o
> diff --git a/fs/inode.c b/fs/inode.c
> index 4558dc2f135573..d43f07f146eb73 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2265,7 +2265,8 @@ void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
>  		inode->i_fop = &def_chr_fops;
>  		inode->i_rdev = rdev;
>  	} else if (S_ISBLK(mode)) {
> -		inode->i_fop = &def_blk_fops;
> +		if (IS_ENABLED(CONFIG_BLOCK))
> +			inode->i_fop = &def_blk_fops;

It looks like def_blk_fops is being removed (commit message and patch
fragment below), but here (above line) it is being used.

Am I just confused?

>  		inode->i_rdev = rdev;
>  	} else if (S_ISFIFO(mode))
>  		inode->i_fop = &pipefifo_fops;
> diff --git a/fs/no-block.c b/fs/no-block.c
> deleted file mode 100644
> index 481c0f0ab4bd2c..00000000000000
> --- a/fs/no-block.c
> +++ /dev/null
> @@ -1,19 +0,0 @@
> -// SPDX-License-Identifier: GPL-2.0-or-later
> -/* no-block.c: implementation of routines required for non-BLOCK configuration
> - *
> - * Copyright (C) 2006 Red Hat, Inc. All Rights Reserved.
> - * Written by David Howells (dhowells@redhat.com)
> - */
> -
> -#include <linux/kernel.h>
> -#include <linux/fs.h>
> -
> -static int no_blkdev_open(struct inode * inode, struct file * filp)
> -{
> -	return -ENODEV;
> -}
> -
> -const struct file_operations def_blk_fops = {
> -	.open		= no_blkdev_open,
> -	.llseek		= noop_llseek,
> -};

-- 
~Randy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/17] fs: remove the special !CONFIG_BLOCK def_blk_fops
  2023-04-24 19:22   ` Randy Dunlap
@ 2023-04-24 19:37     ` Keith Busch
  0 siblings, 0 replies; 41+ messages in thread
From: Keith Busch @ 2023-04-24 19:37 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Christoph Hellwig, Jens Axboe, Miklos Szeredi, Darrick J. Wong,
	Andrew Morton, David Howells, Matthew Wilcox, linux-block,
	linux-fsdevel, ceph-devel, linux-ext4, linux-f2fs-devel,
	cluster-devel, linux-xfs, linux-nfs, linux-mm, linux-kernel

On Mon, Apr 24, 2023 at 12:22:30PM -0700, Randy Dunlap wrote:
> On 4/23/23 22:49, Christoph Hellwig wrote:
> > +		if (IS_ENABLED(CONFIG_BLOCK))
> > +			inode->i_fop = &def_blk_fops;
> 
> It looks like def_blk_fops is being removed (commit message and patch
> fragment below), but here (above line) it is being used.
> 
> Am I just confused?

The def_blk_fops is removed only for the !CONFIG_BLOCK case. Its usage is under
a branch known at compile time, so it's optimized out for that case before
trying to resolve the symbol.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [f2fs-dev] [PATCH 16/17] block: use iomap for writes to block devices
       [not found]   ` <CGME20230426130921eucas1p279078812be7e8d50c1305e47cea53661@eucas1p2.samsung.com>
@ 2023-04-26 13:00     ` Pankaj Raghav
  0 siblings, 0 replies; 41+ messages in thread
From: Pankaj Raghav @ 2023-04-26 13:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, linux-nfs, cluster-devel, linux-xfs,
	Miklos Szeredi, Darrick J. Wong, linux-kernel, Matthew Wilcox,
	linux-f2fs-devel, David Howells, linux-mm, linux-fsdevel,
	Andrew Morton, linux-ext4, ceph-devel, p.raghav

[-- Attachment #1: Type: text/plain, Size: 710 bytes --]

On Mon, Apr 24, 2023 at 07:49:25AM +0200, Christoph Hellwig wrote:
> Use iomap in buffer_head compat mode to write to block devices.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/Kconfig |  1 +
>  block/fops.c  | 33 +++++++++++++++++++++++++++++----
>  2 files changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index 941b2dca70db73..672b08f0096ab4 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -5,6 +5,7 @@
>  menuconfig BLOCK
>         bool "Enable the block layer" if EXPERT
>         default y
> +       select IOMAP

This needs to be FS_IOMAP.

>         select SBITMAP
>         help
>  	 Provide block layer support for the kernel.

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 17/17] fs: add CONFIG_BUFFER_HEAD
  2023-04-24  5:49 ` [PATCH 17/17] fs: add CONFIG_BUFFER_HEAD Christoph Hellwig
@ 2023-04-29  0:11   ` Luis Chamberlain
  2023-04-29  1:20     ` Matthew Wilcox
  0 siblings, 1 reply; 41+ messages in thread
From: Luis Chamberlain @ 2023-04-29  0:11 UTC (permalink / raw)
  To: Christoph Hellwig, Pankaj Raghav, Daniel Gomez
  Cc: Jens Axboe, Miklos Szeredi, Darrick J. Wong, Andrew Morton,
	David Howells, Matthew Wilcox, linux-block, linux-fsdevel,
	ceph-devel, linux-ext4, linux-f2fs-devel, cluster-devel,
	linux-xfs, linux-nfs, linux-mm, linux-kernel

On Mon, Apr 24, 2023 at 07:49:26AM +0200, Christoph Hellwig wrote:
> +const struct address_space_operations def_blk_aops = {
> +	.dirty_folio	= filemap_dirty_folio,
> +	.release_folio		= iomap_release_folio,
> +	.invalidate_folio	= iomap_invalidate_folio,
> +	.read_folio		= blkdev_read_folio,
> +	.readahead		= blkdev_readahead,
> +	.writepages		= blkdev_writepages,
> +	.is_partially_uptodate  = iomap_is_partially_uptodate,
> +	.error_remove_page	= generic_error_remove_page,
> +	.migrate_folio		= filemap_migrate_folio,
> +};
> +#endif /* CONFIG_BUFFER_HEAD */

We've tested this with bs > ps (LBS) devices and it would seem it crashes,
as Pankaj notes perhaps due to lack of higher order folio support yet
on this path, for the block cache. The same crash happens with NVMe
(using out-of-tree nvme_core.debug_large_lbas boot parameter to enable NVMe
LBS) or brd with LBS. To enable NVMe LBS or brd with LBS you need
out of tree patches though of course, so I've stashed these into
a branch, large-block-20230426 [0] so to help folks who may want
to experiment further.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=large-block-20230426

[   11.245248] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   11.254581] #PF: supervisor read access in kernel mode
[   11.257387] #PF: error_code(0x0000) - not-present page
[   11.260921] PGD 0 P4D 0
[   11.262600] Oops: 0000 [#1] PREEMPT SMP PTI
[   11.264993] CPU: 7 PID: 198 Comm: (udev-worker) Not tainted 6.3.0-large-block-20230426 #2
[   11.269385] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[   11.275054] RIP: 0010:iomap_page_create.isra.0+0xc/0xd0
[   11.277924] Code: 41 5e 41 5f c3 cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 55 48 89 f5 53 <48> 8b 06 48 c1 e8 0d 89 c6 83 e6 01 0f 84 a1 00 00 00 4c 8b 65 28
[   11.287293] RSP: 0018:ffffb0f0805ef9d8 EFLAGS: 00010293
[   11.289964] RAX: ffff9de3c1fa8388 RBX: ffffb0f0805efa78 RCX: 000000037ffe0000
[   11.293212] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000d
[   11.296485] RBP: 0000000000000000 R08: 0000000000021000 R09: ffffffff9c733b20
[   11.299724] R10: 0000000000000001 R11: 000000000000c000 R12: 0000000000000000
[   11.302974] R13: ffffffff9be96260 R14: ffffb0f0805efa58 R15: 0000000000000000
[   11.306206] FS:  00007f03ea8368c0(0000) GS:ffff9de43bdc0000(0000) knlGS:0000000000000000
[   11.309949] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.312464] CR2: 0000000000000000 CR3: 0000000117ec6006 CR4: 0000000000770ee0
[   11.315442] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   11.318310] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   11.321010] PKRU: 55555554
[   11.322212] Call Trace:
[   11.323224]  <TASK>
[   11.324146]  iomap_readpage_iter+0x96/0x300
[   11.325694]  iomap_readahead+0x174/0x2d0
[   11.327129]  read_pages+0x69/0x1f0
[   11.328389]  ? folio_add_lru+0x7e/0xe0
[   11.329751]  page_cache_ra_unbounded+0x187/0x1d0
[   11.331301]  force_page_cache_ra+0x94/0xb0
[   11.332681]  filemap_get_pages+0x10e/0x650
[   11.334073]  ? _raw_spin_lock+0x13/0x40
[   11.335287]  filemap_read+0xbf/0x340
[   11.336430]  ? aa_file_perm+0x117/0x4b0
[   11.337646]  ? generic_fillattr+0x45/0xf0
[   11.338887]  ? _copy_to_user+0x22/0x30
[   11.340026]  ? cp_new_stat+0x150/0x180
[   11.341166]  blkdev_read_iter+0x5e/0x140
[   11.342357]  vfs_read+0x1f0/0x2c0
[   11.343354]  ksys_read+0x63/0xe0
[   11.344331]  do_syscall_64+0x37/0x90
[   11.345411]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   11.346760] RIP: 0033:0x7f03eaf3903d

(gdb) l *(iomap_readpage_iter+0x96)
0xffffffff814021b6 is in iomap_readpage_iter (fs/iomap/buffered-io.c:280).
275             if (iomap->type == IOMAP_INLINE)
276                     return iomap_read_inline_data(iter, folio);
277
278             /* zero post-eof blocks as the page may be mapped */
279             iop = iomap_page_create(iter->inode, folio, iter->flags);
280             iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
281             if (plen == 0)
282                     goto done;
283
284             if (iomap_block_needs_zeroing(iter, pos)) {
(gdb) l *(iomap_page_create+0xc)
0xffffffff81400cdc is in iomap_page_create (./arch/x86/include/asm/bitops.h:207).
202     }
203
204     static __always_inline bool constant_test_bit(long nr, const volatile unsigned long *addr)
205     {
206             return ((1UL << (nr & (BITS_PER_LONG-1))) &
207                     (addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
208     }
209
210     static __always_inline bool constant_test_bit_acquire(long nr, const volatile unsigned long *addr)
211     {

To reproduce one would want a system with only say XFS as the root
image. I've enabled this on kdevops through "pure-iomap" option:

https://github.com/linux-kdevops/kdevops/blob/master/docs/lbs.md

  Luis

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 17/17] fs: add CONFIG_BUFFER_HEAD
  2023-04-29  0:11   ` Luis Chamberlain
@ 2023-04-29  1:20     ` Matthew Wilcox
  2023-05-01  3:14       ` Luis Chamberlain
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2023-04-29  1:20 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, Pankaj Raghav, Daniel Gomez, Jens Axboe,
	Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	linux-block, linux-fsdevel, ceph-devel, linux-ext4,
	linux-f2fs-devel, cluster-devel, linux-xfs, linux-nfs, linux-mm,
	linux-kernel

On Fri, Apr 28, 2023 at 05:11:57PM -0700, Luis Chamberlain wrote:
> [   11.245248] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [   11.254581] #PF: supervisor read access in kernel mode
> [   11.257387] #PF: error_code(0x0000) - not-present page
> [   11.260921] PGD 0 P4D 0
> [   11.262600] Oops: 0000 [#1] PREEMPT SMP PTI
> [   11.264993] CPU: 7 PID: 198 Comm: (udev-worker) Not tainted 6.3.0-large-block-20230426 #2
> [   11.269385] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> [   11.275054] RIP: 0010:iomap_page_create.isra.0+0xc/0xd0
> [   11.277924] Code: 41 5e 41 5f c3 cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 55 48 89 f5 53 <48> 8b 06 48 c1 e8 0d 89 c6 83 e6 01 0f 84 a1 00 00 00 4c 8b 65 28
> [   11.287293] RSP: 0018:ffffb0f0805ef9d8 EFLAGS: 00010293
> [   11.289964] RAX: ffff9de3c1fa8388 RBX: ffffb0f0805efa78 RCX: 000000037ffe0000
> [   11.293212] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000d
> [   11.296485] RBP: 0000000000000000 R08: 0000000000021000 R09: ffffffff9c733b20
> [   11.299724] R10: 0000000000000001 R11: 000000000000c000 R12: 0000000000000000
> [   11.302974] R13: ffffffff9be96260 R14: ffffb0f0805efa58 R15: 0000000000000000

RSI is argument 2, which is folio.

Code starting with the faulting instruction
===========================================
   0:	48 8b 06             	mov    (%rsi),%rax
   3:	48 c1 e8 0d          	shr    $0xd,%rax

Looks to me like a NULL folio was passed into iomap_page_create().

> [   11.306206] FS:  00007f03ea8368c0(0000) GS:ffff9de43bdc0000(0000) knlGS:0000000000000000
> [   11.309949] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   11.312464] CR2: 0000000000000000 CR3: 0000000117ec6006 CR4: 0000000000770ee0
> [   11.315442] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   11.318310] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   11.321010] PKRU: 55555554
> [   11.322212] Call Trace:
> [   11.323224]  <TASK>
> [   11.324146]  iomap_readpage_iter+0x96/0x300
> [   11.325694]  iomap_readahead+0x174/0x2d0
> [   11.327129]  read_pages+0x69/0x1f0
> [   11.329751]  page_cache_ra_unbounded+0x187/0x1d0

... that shouldn't be possible.  read_pages() allocates pages, puts them
in the page cache and tells the filesystem to fill them in.

In your patches, did you call mapping_set_large_folios() anywhere?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 17/17] fs: add CONFIG_BUFFER_HEAD
  2023-04-29  1:20     ` Matthew Wilcox
@ 2023-05-01  3:14       ` Luis Chamberlain
  2023-05-01 15:46         ` Matthew Wilcox
  0 siblings, 1 reply; 41+ messages in thread
From: Luis Chamberlain @ 2023-05-01  3:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Pankaj Raghav, Daniel Gomez, Jens Axboe,
	Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	linux-block, linux-fsdevel, ceph-devel, linux-ext4,
	linux-f2fs-devel, cluster-devel, linux-xfs, linux-nfs, linux-mm,
	linux-kernel

On Sat, Apr 29, 2023 at 02:20:17AM +0100, Matthew Wilcox wrote:
> On Fri, Apr 28, 2023 at 05:11:57PM -0700, Luis Chamberlain wrote:
> > [   11.245248] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [   11.254581] #PF: supervisor read access in kernel mode
> > [   11.257387] #PF: error_code(0x0000) - not-present page
> > [   11.260921] PGD 0 P4D 0
> > [   11.262600] Oops: 0000 [#1] PREEMPT SMP PTI
> > [   11.264993] CPU: 7 PID: 198 Comm: (udev-worker) Not tainted 6.3.0-large-block-20230426 #2
> > [   11.269385] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
> > [   11.275054] RIP: 0010:iomap_page_create.isra.0+0xc/0xd0
> > [   11.277924] Code: 41 5e 41 5f c3 cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 55 48 89 f5 53 <48> 8b 06 48 c1 e8 0d 89 c6 83 e6 01 0f 84 a1 00 00 00 4c 8b 65 28
> > [   11.287293] RSP: 0018:ffffb0f0805ef9d8 EFLAGS: 00010293
> > [   11.289964] RAX: ffff9de3c1fa8388 RBX: ffffb0f0805efa78 RCX: 000000037ffe0000
> > [   11.293212] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000d
> > [   11.296485] RBP: 0000000000000000 R08: 0000000000021000 R09: ffffffff9c733b20
> > [   11.299724] R10: 0000000000000001 R11: 000000000000c000 R12: 0000000000000000
> > [   11.302974] R13: ffffffff9be96260 R14: ffffb0f0805efa58 R15: 0000000000000000
> 
> RSI is argument 2, which is folio.
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	48 8b 06             	mov    (%rsi),%rax
>    3:	48 c1 e8 0d          	shr    $0xd,%rax
> 
> Looks to me like a NULL folio was passed into iomap_page_create().
> 
> > [   11.306206] FS:  00007f03ea8368c0(0000) GS:ffff9de43bdc0000(0000) knlGS:0000000000000000
> > [   11.309949] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   11.312464] CR2: 0000000000000000 CR3: 0000000117ec6006 CR4: 0000000000770ee0
> > [   11.315442] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [   11.318310] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [   11.321010] PKRU: 55555554
> > [   11.322212] Call Trace:
> > [   11.323224]  <TASK>
> > [   11.324146]  iomap_readpage_iter+0x96/0x300
> > [   11.325694]  iomap_readahead+0x174/0x2d0
> > [   11.327129]  read_pages+0x69/0x1f0
> > [   11.329751]  page_cache_ra_unbounded+0x187/0x1d0
> 
> ... that shouldn't be possible.  read_pages() allocates pages, puts them
> in the page cache and tells the filesystem to fill them in.
> 
> In your patches, did you call mapping_set_large_folios() anywhere?

No but the only place to add that would be in the block cache. Adding
that alone to the block cache doesn't fix the issue. The below patch
however does get us by.

From my readings it does't seem like readahead_folio() should always
return non-NULL, and also I couldn't easily verify the math is right.
The max cap I see is for the backing device io_size, but there are
some other heuristics which will take me some time to try to grok
to understand if they are correct.

diff --git a/block/bdev.c b/block/bdev.c
index 21c63bfef323..afa0c5ebd364 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -411,6 +411,7 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
 	inode->i_rdev = 0;
 	inode->i_data.a_ops = &def_blk_aops;
 	mapping_set_gfp_mask(&inode->i_data, GFP_USER);
+	mapping_set_large_folios(inode->i_mapping);
 
 	bdev = I_BDEV(inode);
 	mutex_init(&bdev->bd_fsfreeze_mutex);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8115b0d9a85a..d3c9e16a7066 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -384,7 +384,10 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
 		}
 		if (!ctx->cur_folio) {
 			ctx->cur_folio = readahead_folio(ctx->rac);
-			ctx->cur_folio_in_bio = false;
+			if (ctx->cur_folio)
+				ctx->cur_folio_in_bio = false;
+			else
+				continue;
 		}
 		ret = iomap_readpage_iter(iter, ctx, done);
 		if (ret <= 0)

We still see:

Apr 30 19:54:21 iomap kernel: ------------[ cut here ]------------
Apr 30 19:54:21 iomap kernel: WARNING: CPU: 4 PID: 197 at block/fops.c:389 blkdev_iomap_begin+0x80/0x90
Apr 30 19:54:21 iomap kernel: Modules linked in: psmouse virtio_blk failover nvme nvme_core crc32c_intel t10_pi virtio_pci
virtio_pci_legacy_dev virtio_pci_modern_dev virtio crc64_rocksoft >
Apr 30 19:54:21 iomap kernel: CPU: 4 PID: 197 Comm: (udev-worker) Not tainted 6.3.0-large-block-20230426-dirty #7
Apr 30 19:54:21 iomap kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
Apr 30 19:54:21 iomap kernel: RIP: 0010:blkdev_iomap_begin+0x80/0x90
Apr 30 19:54:21 iomap kernel: Code: c0 48 89 53 10 5b 5d 41 5c c3 cc cc
cc cc 89 c6 83 e8 01 48 8d 6c 2e ff 48 f7 de 48 29 c5 48 21 ee 48 89 73
08 48 39 d6 7c c6 <0f> 0b b8 fb ff ff ff 5b 5>
Apr 30 19:54:21 iomap kernel: RSP: 0018:ffffa437c0723a08 EFLAGS: 00010246
Apr 30 19:54:21 iomap kernel: RAX: 0000000000007fff RBX: ffffa437c0723aa0 RCX: 0000000000000000
Apr 30 19:54:21 iomap kernel: RDX: 0000000400000000 RSI: 0000000400000000 RDI: ffff92f9c23c5788
Apr 30 19:54:21 iomap kernel: RBP: 0000000400000000 R08: ffffa437c0723aa0 R09: ffffa437c0723af0
Apr 30 19:54:21 iomap kernel: R10: 0000000000000001 R11: 000000000000c000 R12: ffff92f9c23c5788
Apr 30 19:54:21 iomap kernel: R13: ffffa437c0723af0 R14: ffffffff828962e0 R15: ffffa437c0723cc8
Apr 30 19:54:21 iomap kernel: FS:  00007febee04b8c0(0000) GS:ffff92fa3bd00000(0000) knlGS:0000000000000000
Apr 30 19:54:21 iomap kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 30 19:54:21 iomap kernel: CR2: 00007ffd68325108 CR3: 0000000102a9a005 CR4: 0000000000770ee0
Apr 30 19:54:21 iomap kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 30 19:54:21 iomap kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 30 19:54:21 iomap kernel: PKRU: 55555554
Apr 30 19:54:21 iomap kernel: Call Trace:
Apr 30 19:54:21 iomap kernel:  <TASK>
Apr 30 19:54:21 iomap kernel:  iomap_iter+0x179/0x350
Apr 30 19:54:21 iomap kernel:  iomap_readahead+0x200/0x2e0
Apr 30 19:54:21 iomap kernel:  read_pages+0x69/0x1f0
Apr 30 19:54:21 iomap kernel:  ? folio_add_lru+0x7e/0xe0
Apr 30 19:54:21 iomap kernel:  page_cache_ra_unbounded+0x187/0x1d0
Apr 30 19:54:21 iomap kernel:  force_page_cache_ra+0x94/0xb0
Apr 30 19:54:21 iomap kernel:  filemap_get_pages+0x10e/0x650
Apr 30 19:54:21 iomap kernel:  ? _raw_spin_lock+0x13/0x40
Apr 30 19:54:21 iomap kernel:  ? _raw_spin_unlock+0x15/0x30
Apr 30 19:54:21 iomap kernel:  ? __mark_inode_dirty+0x155/0x380
Apr 30 19:54:21 iomap kernel:  filemap_read+0xbf/0x340
Apr 30 19:54:21 iomap kernel:  ? aa_file_perm+0x117/0x4b0
Apr 30 19:54:21 iomap kernel:  ? generic_fillattr+0x45/0xf0
Apr 30 19:54:21 iomap kernel:  ? _copy_to_user+0x22/0x30
Apr 30 19:54:21 iomap kernel:  ? cp_new_stat+0x150/0x180
Apr 30 19:54:21 iomap kernel:  blkdev_read_iter+0x5e/0x140
Apr 30 19:54:21 iomap kernel:  vfs_read+0x1f0/0x2c0
Apr 30 19:54:21 iomap kernel:  ksys_read+0x63/0xe0
Apr 30 19:54:21 iomap kernel:  do_syscall_64+0x37/0x90
Apr 30 19:54:21 iomap kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
Apr 30 19:54:21 iomap kernel: RIP: 0033:0x7febee74e03d
Apr 30 19:54:21 iomap kernel: Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d a6
55 0a 00 e8 39 fe 01 00 66 0f 1f 84 00 00 00 00 00 80 3d a1 25 0e 00 00
74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c>
Apr 30 19:54:21 iomap kernel: RSP: 002b:00007ffd68329148 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Apr 30 19:54:21 iomap kernel: RAX: ffffffffffffffda RBX: 000055b30a902020 RCX: 00007febee74e03d
Apr 30 19:54:21 iomap kernel: RDX: 0000000000000100 RSI: 000055b30a905138 RDI: 000000000000000c
Apr 30 19:54:21 iomap kernel: RBP: 00000003ffffe000 R08: 00007febee828d40 R09: 00007febee828d40
Apr 30 19:54:21 iomap kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000055b30a905110
Apr 30 19:54:21 iomap kernel: R13: 0000000000000100 R14: 000055b30a902078 R15: 000055b30a905128
Apr 30 19:54:21 iomap kernel:  </TASK>
Apr 30 19:54:21 iomap kernel: ---[ end trace 0000000000000000 ]---

And then also:

Apr 30 19:54:21 iomap kernel: ------------[ cut here ]------------
Apr 30 19:54:21 iomap kernel: WARNING: CPU: 4 PID: 197 at fs/iomap/iter.c:32 iomap_iter+0x335/0x350
Apr 30 19:54:21 iomap kernel: Modules linked in: psmouse virtio_blk failover nvme nvme_core crc32c_intel t10_pi virtio_pci
virtio_pci_legacy_dev virtio_pci_modern_dev virtio crc64_rocksoft >
Apr 30 19:54:21 iomap kernel: CPU: 4 PID: 197 Comm: (udev-worker) Tainted: G        W          6.3.0-large-block-20230426-dirty #7
Apr 30 19:54:21 iomap kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
Apr 30 19:54:21 iomap kernel: RIP: 0010:iomap_iter+0x335/0x350
Apr 30 19:54:21 iomap kernel: Code: b8 fe ff ff e8 dc 29 c0 ff e9 ae fe
ff ff 0f 0b e9 71 fe ff ff 0f 0b e9 77 fe ff ff 0f 0b e9 7c fe ff ff 0f
0b e9 7f fe ff ff <0f> 0b b8 fb ff ff ff e9 8>
Apr 30 19:54:21 iomap kernel: RSP: 0018:ffffa437c0723a28 EFLAGS: 00010287
Apr 30 19:54:21 iomap kernel: RAX: 0000000000008000 RBX: ffffa437c0723a78 RCX: 0000000000008000
Apr 30 19:54:21 iomap kernel: RDX: 000000000000a000 RSI:00000003fffab000 RDI: ffffa437c0723a78
Apr 30 19:54:21 iomap kernel: RBP: ffffffff828962e0 R08: 0000000000005000 R09: ffffffff83133b20
Apr 30 19:54:21 iomap kernel: R10: 0000000000000001 R11: 000000000000c000 R12: 0000000000008000
Apr 30 19:54:21 iomap kernel: R13: ffffa437c0723a78 R14: ffffffff828962e0 R15: ffffa437c0723cc8
Apr 30 19:54:21 iomap kernel: FS:  00007febee04b8c0(0000) GS:ffff92fa3bd00000(0000) knlGS:0000000000000000
Apr 30 19:54:21 iomap kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 30 19:54:21 iomap kernel: CR2: 00007ffd68325108 CR3: 0000000102a9a005 CR4: 0000000000770ee0
Apr 30 19:54:21 iomap kernel: DR0: 0000000000000000 DR1:0000000000000000 DR2: 0000000000000000
Apr 30 19:54:21 iomap kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 30 19:54:21 iomap kernel: PKRU: 55555554
Apr 30 19:54:21 iomap kernel: Call Trace:
Apr 30 19:54:21 iomap kernel:  <TASK>
Apr 30 19:54:21 iomap kernel:  iomap_readahead+0x200/0x2e0
Apr 30 19:54:21 iomap kernel:  read_pages+0x69/0x1f0
Apr 30 19:54:21 iomap kernel:  ? folio_add_lru+0x7e/0xe0
Apr 30 19:54:21 iomap kernel:  page_cache_ra_unbounded+0x187/0x1d0
Apr 30 19:54:21 iomap kernel:  force_page_cache_ra+0x94/0xb0
Apr 30 19:54:21 iomap kernel:  filemap_get_pages+0x10e/0x650
Apr 30 19:54:21 iomap kernel:  filemap_read+0xbf/0x340
Apr 30 19:54:21 iomap kernel:  ? tomoyo_path_number_perm+0x68/0x1d0
Apr 30 19:54:21 iomap kernel:  ? aa_file_perm+0x117/0x4b0
Apr 30 19:54:21 iomap kernel:  blkdev_read_iter+0x5e/0x140
Apr 30 19:54:21 iomap kernel:  vfs_read+0x1f0/0x2c0
Apr 30 19:54:21 iomap kernel:  ksys_read+0x63/0xe0
Apr 30 19:54:21 iomap kernel:  do_syscall_64+0x37/0x90
Apr 30 19:54:21 iomap kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
Apr 30 19:54:21 iomap kernel: RIP: 0033:0x7febee74e03d
Apr 30 19:54:21 iomap kernel: Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d a6
55 0a 00 e8 39 fe 01 00 66 0f 1f 84 00 00 00 00 00 80 3d a1 25 0e 00 00
74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c>
Apr 30 19:54:21 iomap kernel: RSP: 002b:00007ffd68329198 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Apr 30 19:54:21 iomap kernel: RAX: ffffffffffffffda RBX: 000055b30a902020 RCX: 00007febee74e03d
Apr 30 19:54:21 iomap kernel: RDX: 0000000000000018 RSI:i 000055b30a8fe518 RDI: 000000000000000c
Apr 30 19:54:21 iomap kernel: RBP: 00000003fffaba00 R08: 00007febee828cc0 R09: 0000000000000070
Apr 30 19:54:21 iomap kernel: R10: 0000000000000000 R11:0000000000000246 R12: 000055b30a8fe4f0
Apr 30 19:54:21 iomap kernel: R13: 0000000000000018 R14: 000055b30a902078 R15: 000055b30a8fe508
Apr 30 19:54:21 iomap kernel:  </TASK>
Apr 30 19:54:21 iomap kernel: ---[ end trace 0000000000000000 ]---


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 17/17] fs: add CONFIG_BUFFER_HEAD
  2023-05-01  3:14       ` Luis Chamberlain
@ 2023-05-01 15:46         ` Matthew Wilcox
  2023-05-01 16:00           ` Pankaj Raghav
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2023-05-01 15:46 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, Pankaj Raghav, Daniel Gomez, Jens Axboe,
	Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	linux-block, linux-fsdevel, ceph-devel, linux-ext4,
	linux-f2fs-devel, cluster-devel, linux-xfs, linux-nfs, linux-mm,
	linux-kernel

On Sun, Apr 30, 2023 at 08:14:03PM -0700, Luis Chamberlain wrote:
> On Sat, Apr 29, 2023 at 02:20:17AM +0100, Matthew Wilcox wrote:
> > > [   11.322212] Call Trace:
> > > [   11.323224]  <TASK>
> > > [   11.324146]  iomap_readpage_iter+0x96/0x300
> > > [   11.325694]  iomap_readahead+0x174/0x2d0
> > > [   11.327129]  read_pages+0x69/0x1f0
> > > [   11.329751]  page_cache_ra_unbounded+0x187/0x1d0
> > 
> > ... that shouldn't be possible.  read_pages() allocates pages, puts them
> > in the page cache and tells the filesystem to fill them in.
> > 
> > In your patches, did you call mapping_set_large_folios() anywhere?
> 
> No but the only place to add that would be in the block cache. Adding
> that alone to the block cache doesn't fix the issue. The below patch
> however does get us by.

That's "working around the error", not fixing it ... probably the same
root cause as your other errors; at least I'm not diving into them until
the obvious one is fixed.

> >From my readings it does't seem like readahead_folio() should always
> return non-NULL, and also I couldn't easily verify the math is right.

readahead_folio() always returns non-NULL.  That's guaranteed by how
page_cache_ra_unbounded() and page_cache_ra_order() work.  It allocates
folios, until it can't (already-present folio, ENOMEM, EOF, max batch
size) and then calls the filesystem to make those folios uptodate,
telling it how many folios it put in the page cache, where they start.

Hm.  The fact that it's coming from page_cache_ra_unbounded() makes
me wonder if you updated this line:

                folio = filemap_alloc_folio(gfp_mask, 0);

without updating this line:

                ractl->_nr_pages++;

This is actually number of pages, not number of folios, so needs to be
		ractl->_nr_pages += 1 << order;

various other parts of page_cache_ra_unbounded() need to be examined
carefully for assumptions of order-0; it's never been used for that
before.  all the large folio work has concentrated on
page_cache_ra_order()

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 17/17] fs: add CONFIG_BUFFER_HEAD
  2023-05-01 15:46         ` Matthew Wilcox
@ 2023-05-01 16:00           ` Pankaj Raghav
  0 siblings, 0 replies; 41+ messages in thread
From: Pankaj Raghav @ 2023-05-01 16:00 UTC (permalink / raw)
  To: Matthew Wilcox, Luis Chamberlain
  Cc: Christoph Hellwig, Daniel Gomez, Jens Axboe, Miklos Szeredi,
	Darrick J. Wong, Andrew Morton, David Howells, linux-block,
	linux-fsdevel, ceph-devel, linux-ext4, linux-f2fs-devel,
	cluster-devel, linux-xfs, linux-nfs, linux-mm, linux-kernel

>> No but the only place to add that would be in the block cache. Adding
>> that alone to the block cache doesn't fix the issue. The below patch
>> however does get us by.
> 
> That's "working around the error", not fixing it ... probably the same
> root cause as your other errors; at least I'm not diving into them until
> the obvious one is fixed.
> 
>> >From my readings it does't seem like readahead_folio() should always
>> return non-NULL, and also I couldn't easily verify the math is right.
> 
> readahead_folio() always returns non-NULL.  That's guaranteed by how
> page_cache_ra_unbounded() and page_cache_ra_order() work.  It allocates
> folios, until it can't (already-present folio, ENOMEM, EOF, max batch
> size) and then calls the filesystem to make those folios uptodate,
> telling it how many folios it put in the page cache, where they start.
> 
> Hm.  The fact that it's coming from page_cache_ra_unbounded() makes
> me wonder if you updated this line:
> 
>                 folio = filemap_alloc_folio(gfp_mask, 0);
> 
> without updating this line:
> 
>                 ractl->_nr_pages++;
> 
> This is actually number of pages, not number of folios, so needs to be
> 		ractl->_nr_pages += 1 << order;
> 

I already had a patch which did the following:

ractl->_nr_pages += folio_nr_pages(folio);

but the variable `i` in the loop was not updated properly (assumption of zero order folio). This now
fixes the crash:

@@ -210,7 +210,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
        unsigned long index = readahead_index(ractl);
        gfp_t gfp_mask = readahead_gfp_mask(mapping);
        unsigned long i;
-
+       int order = 0;
        /*
         * Partway through the readahead operation, we will have added
         * locked pages to the page cache, but will not yet have submitted
@@ -223,6 +223,9 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
         */
        unsigned int nofs = memalloc_nofs_save();

+       if (mapping->host->i_blkbits > PAGE_SHIFT)
+               order = mapping->host->i_blkbits - PAGE_SHIFT;
+
        filemap_invalidate_lock_shared(mapping);
        /*
         * Preallocate as many pages as we will need.
@@ -245,7 +248,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
                        continue;
                }

-               folio = filemap_alloc_folio(gfp_mask, 0);
+               folio = filemap_alloc_folio(gfp_mask, order);
                if (!folio)
                        break;
                if (filemap_add_folio(mapping, folio, index + i,
@@ -259,7 +262,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
                if (i == nr_to_read - lookahead_size)
                        folio_set_readahead(folio);
                ractl->_workingset |= folio_test_workingset(folio);
-               ractl->_nr_pages++;
+               ractl->_nr_pages += folio_nr_pages(folio);
+               i += folio_nr_pages(folio) - 1;
        }

> various other parts of page_cache_ra_unbounded() need to be examined
> carefully for assumptions of order-0; it's never been used for that
> before.  all the large folio work has concentrated on
> page_cache_ra_order()

As you have noted here, this needs to be examined more carefully. Even though the patches fix the
crash, fio with verify option fails (i.e write and read are not giving the same output).

I think it is better to send an RFC patch series on top of Christoph's work with optional
BUFFER_HEAD to iron out some core issues/bugs.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/17] fs: unexport buffer_check_dirty_writeback
  2023-04-24  5:49 ` [PATCH 01/17] fs: unexport buffer_check_dirty_writeback Christoph Hellwig
@ 2023-05-19 14:17   ` Hannes Reinecke
  2023-07-06  0:18   ` [f2fs-dev] " patchwork-bot+f2fs
  2023-09-04 18:11   ` patchwork-bot+f2fs
  2 siblings, 0 replies; 41+ messages in thread
From: Hannes Reinecke @ 2023-05-19 14:17 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

On 4/24/23 07:49, Christoph Hellwig wrote:
> buffer_check_dirty_writeback is only used by the block device aops,
> remove the export.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   fs/buffer.c | 1 -
>   1 file changed, 1 deletion(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 9e1e2add541e07..eb14fbaa7d35f7 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -111,7 +111,6 @@ void buffer_check_dirty_writeback(struct folio *folio,
>   		bh = bh->b_this_page;
>   	} while (bh != head);
>   }
> -EXPORT_SYMBOL(buffer_check_dirty_writeback);
>   
>   /*
>    * Block until a buffer comes unlocked.  This doesn't stop it

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/17] block: use iomap for writes to block devices
  2023-04-24  5:49 ` [PATCH 16/17] block: use iomap for writes to block devices Christoph Hellwig
       [not found]   ` <CGME20230426130921eucas1p279078812be7e8d50c1305e47cea53661@eucas1p2.samsung.com>
@ 2023-05-19 14:22   ` Hannes Reinecke
  2023-05-23 22:27     ` Dave Chinner
  2023-07-20 12:06     ` Christoph Hellwig
  1 sibling, 2 replies; 41+ messages in thread
From: Hannes Reinecke @ 2023-05-19 14:22 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	Matthew Wilcox, linux-block, linux-fsdevel, ceph-devel,
	linux-ext4, linux-f2fs-devel, cluster-devel, linux-xfs,
	linux-nfs, linux-mm, linux-kernel

On 4/24/23 07:49, Christoph Hellwig wrote:
> Use iomap in buffer_head compat mode to write to block devices.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/Kconfig |  1 +
>   block/fops.c  | 33 +++++++++++++++++++++++++++++----
>   2 files changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index 941b2dca70db73..672b08f0096ab4 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -5,6 +5,7 @@
>   menuconfig BLOCK
>          bool "Enable the block layer" if EXPERT
>          default y
> +       select IOMAP
>          select SBITMAP
>          help
>   	 Provide block layer support for the kernel.
> diff --git a/block/fops.c b/block/fops.c
> index 318247832a7bcf..7910636f8df33b 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -15,6 +15,7 @@
>   #include <linux/falloc.h>
>   #include <linux/suspend.h>
>   #include <linux/fs.h>
> +#include <linux/iomap.h>
>   #include <linux/module.h>
>   #include "blk.h"
>   
> @@ -386,6 +387,27 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
>   	return __blkdev_direct_IO(iocb, iter, bio_max_segs(nr_pages));
>   }
>   
> +static int blkdev_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> +		unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
> +{
> +	struct block_device *bdev = I_BDEV(inode);
> +	loff_t isize = i_size_read(inode);
> +
> +	iomap->bdev = bdev;
> +	iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev));
> +	if (WARN_ON_ONCE(iomap->offset >= isize))
> +		return -EIO;

I'm hitting this during booting:
[    5.016324]  <TASK>
[    5.030256]  iomap_iter+0x11a/0x350
[    5.030264]  iomap_readahead+0x1eb/0x2c0
[    5.030272]  read_pages+0x5d/0x220
[    5.030279]  page_cache_ra_unbounded+0x131/0x180
[    5.030284]  filemap_get_pages+0xff/0x5a0
[    5.030292]  filemap_read+0xca/0x320
[    5.030296]  ? aa_file_perm+0x126/0x500
[    5.040216]  ? touch_atime+0xc8/0x150
[    5.040224]  blkdev_read_iter+0xb0/0x150
[    5.040228]  vfs_read+0x226/0x2d0
[    5.040234]  ksys_read+0xa5/0xe0
[    5.040238]  do_syscall_64+0x5b/0x80

Maybe we should consider this patch:

diff --git a/block/fops.c b/block/fops.c
index 524b8a828aad..d202fb663f25 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -386,10 +386,13 @@ static int blkdev_iomap_begin(struct inode *inode, 
loff_t offset, loff_t length,

         iomap->bdev = bdev;
         iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev));
-       if (WARN_ON_ONCE(iomap->offset >= isize))
-               return -EIO;
-       iomap->type = IOMAP_MAPPED;
-       iomap->addr = iomap->offset;
+       if (WARN_ON_ONCE(iomap->offset >= isize)) {
+               iomap->type = IOMAP_HOLE;
+               iomap->addr = IOMAP_NULL_ADDR;
+       } else {
+               iomap->type = IOMAP_MAPPED;
+               iomap->addr = iomap->offset;
+       }
         iomap->length = isize - iomap->offset;
         if (IS_ENABLED(CONFIG_BUFFER_HEAD))
                 iomap->flags |= IOMAP_F_BUFFER_HEAD;


Other that the the system seems fine.

Cheers,

Hannes


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/17] block: use iomap for writes to block devices
  2023-05-19 14:22   ` Hannes Reinecke
@ 2023-05-23 22:27     ` Dave Chinner
  2023-05-24 13:33       ` Matthew Wilcox
  2023-07-20 12:06     ` Christoph Hellwig
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2023-05-23 22:27 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Jens Axboe, Miklos Szeredi, Darrick J. Wong,
	Andrew Morton, David Howells, Matthew Wilcox, linux-block,
	linux-fsdevel, ceph-devel, linux-ext4, linux-f2fs-devel,
	cluster-devel, linux-xfs, linux-nfs, linux-mm, linux-kernel

On Fri, May 19, 2023 at 04:22:01PM +0200, Hannes Reinecke wrote:
> On 4/24/23 07:49, Christoph Hellwig wrote:
> > Use iomap in buffer_head compat mode to write to block devices.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >   block/Kconfig |  1 +
> >   block/fops.c  | 33 +++++++++++++++++++++++++++++----
> >   2 files changed, 30 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/Kconfig b/block/Kconfig
> > index 941b2dca70db73..672b08f0096ab4 100644
> > --- a/block/Kconfig
> > +++ b/block/Kconfig
> > @@ -5,6 +5,7 @@
> >   menuconfig BLOCK
> >          bool "Enable the block layer" if EXPERT
> >          default y
> > +       select IOMAP
> >          select SBITMAP
> >          help
> >   	 Provide block layer support for the kernel.
> > diff --git a/block/fops.c b/block/fops.c
> > index 318247832a7bcf..7910636f8df33b 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -15,6 +15,7 @@
> >   #include <linux/falloc.h>
> >   #include <linux/suspend.h>
> >   #include <linux/fs.h>
> > +#include <linux/iomap.h>
> >   #include <linux/module.h>
> >   #include "blk.h"
> > @@ -386,6 +387,27 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> >   	return __blkdev_direct_IO(iocb, iter, bio_max_segs(nr_pages));
> >   }
> > +static int blkdev_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> > +		unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
> > +{
> > +	struct block_device *bdev = I_BDEV(inode);
> > +	loff_t isize = i_size_read(inode);
> > +
> > +	iomap->bdev = bdev;
> > +	iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev));
> > +	if (WARN_ON_ONCE(iomap->offset >= isize))
> > +		return -EIO;
> 
> I'm hitting this during booting:
> [    5.016324]  <TASK>
> [    5.030256]  iomap_iter+0x11a/0x350
> [    5.030264]  iomap_readahead+0x1eb/0x2c0
> [    5.030272]  read_pages+0x5d/0x220
> [    5.030279]  page_cache_ra_unbounded+0x131/0x180
> [    5.030284]  filemap_get_pages+0xff/0x5a0

Why is filemap_get_pages() using unbounded readahead? Surely
readahead should be limited to reading within EOF....

> [    5.030292]  filemap_read+0xca/0x320
> [    5.030296]  ? aa_file_perm+0x126/0x500
> [    5.040216]  ? touch_atime+0xc8/0x150
> [    5.040224]  blkdev_read_iter+0xb0/0x150
> [    5.040228]  vfs_read+0x226/0x2d0
> [    5.040234]  ksys_read+0xa5/0xe0
> [    5.040238]  do_syscall_64+0x5b/0x80
> 
> Maybe we should consider this patch:
> 
> diff --git a/block/fops.c b/block/fops.c
> index 524b8a828aad..d202fb663f25 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -386,10 +386,13 @@ static int blkdev_iomap_begin(struct inode *inode,
> loff_t offset, loff_t length,
> 
>         iomap->bdev = bdev;
>         iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev));
> -       if (WARN_ON_ONCE(iomap->offset >= isize))
> -               return -EIO;
> -       iomap->type = IOMAP_MAPPED;
> -       iomap->addr = iomap->offset;
> +       if (WARN_ON_ONCE(iomap->offset >= isize)) {
> +               iomap->type = IOMAP_HOLE;
> +               iomap->addr = IOMAP_NULL_ADDR;
> +       } else {
> +               iomap->type = IOMAP_MAPPED;
> +               iomap->addr = iomap->offset;
> +       }

I think Christoph's code is correct. IMO, any attempt to read beyond
the end of the device should throw out a warning and return an
error, not silently return zeros.

If readahead is trying to read beyond the end of the device, then it
really seems to me like the problem here is readahead, not the iomap
code detecting the OOB read request....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/17] block: use iomap for writes to block devices
  2023-05-23 22:27     ` Dave Chinner
@ 2023-05-24 13:33       ` Matthew Wilcox
  2023-07-20 12:09         ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Matthew Wilcox @ 2023-05-24 13:33 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hannes Reinecke, Christoph Hellwig, Jens Axboe, Miklos Szeredi,
	Darrick J. Wong, Andrew Morton, David Howells, linux-block,
	linux-fsdevel, ceph-devel, linux-ext4, linux-f2fs-devel,
	cluster-devel, linux-xfs, linux-nfs, linux-mm, linux-kernel

On Wed, May 24, 2023 at 08:27:13AM +1000, Dave Chinner wrote:
> On Fri, May 19, 2023 at 04:22:01PM +0200, Hannes Reinecke wrote:
> > I'm hitting this during booting:
> > [    5.016324]  <TASK>
> > [    5.030256]  iomap_iter+0x11a/0x350
> > [    5.030264]  iomap_readahead+0x1eb/0x2c0
> > [    5.030272]  read_pages+0x5d/0x220
> > [    5.030279]  page_cache_ra_unbounded+0x131/0x180
> > [    5.030284]  filemap_get_pages+0xff/0x5a0
> 
> Why is filemap_get_pages() using unbounded readahead? Surely
> readahead should be limited to reading within EOF....

It isn't using unbounded readahead; that's an artifact of this
incomplete stack trace.  Actual call stack:

page_cache_ra_unbounded
do_page_cache_ra
ondemand_readahead
page_cache_sync_ra
page_cache_sync_readahead
filemap_get_pages

As you can see, do_page_cache_ra() does limit readahead to i_size.
Is ractl->mapping->host the correct way to find the inode?  I always
get confused.

> I think Christoph's code is correct. IMO, any attempt to read beyond
> the end of the device should throw out a warning and return an
> error, not silently return zeros.
> 
> If readahead is trying to read beyond the end of the device, then it
> really seems to me like the problem here is readahead, not the iomap
> code detecting the OOB read request....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 14/17] block: open code __generic_file_write_iter for blkdev writes
  2023-04-24  5:49 ` [PATCH 14/17] block: open code __generic_file_write_iter for blkdev writes Christoph Hellwig
@ 2023-05-24 22:23   ` Luis Chamberlain
  0 siblings, 0 replies; 41+ messages in thread
From: Luis Chamberlain @ 2023-05-24 22:23 UTC (permalink / raw)
  To: Christoph Hellwig, Daniel Gomez, Pankaj Raghav, Ming Lei
  Cc: Jens Axboe, Miklos Szeredi, Darrick J. Wong, Andrew Morton,
	David Howells, Matthew Wilcox, linux-block, linux-fsdevel,
	ceph-devel, linux-ext4, linux-f2fs-devel, cluster-devel,
	linux-xfs, linux-nfs, linux-mm, linux-kernel

On Mon, Apr 24, 2023 at 07:49:23AM +0200, Christoph Hellwig wrote:
> Open code __generic_file_write_iter to remove the indirect call into
> ->direct_IO and to prepare using the iomap based write code.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  block/fops.c | 46 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 44 insertions(+), 2 deletions(-)
> 
> diff --git a/block/fops.c b/block/fops.c
> index b670aa7c5bb745..fd510b6142bd57 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -508,6 +508,29 @@ static int blkdev_close(struct inode *inode, struct file *filp)
>  	return 0;
>  }
>  
> +static ssize_t
> +blkdev_direct_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	size_t count = iov_iter_count(from);
> +	ssize_t written;
> +
> +	written = kiocb_invalidate_pages(iocb, count);
> +	if (written) {
> +		if (written == -EBUSY)
> +			return 0;
> +		return written;
> +	}
> +
> +	written = blkdev_direct_IO(iocb, from);
> +	if (written > 0) {
> +		kiocb_invalidate_post_write(iocb, count);
> +		iocb->ki_pos += written;
> +	}

Written can be negative here after blkdev_direct_IO()

> +	if (written != -EIOCBQUEUED)
> +		iov_iter_revert(from, count - written - iov_iter_count(from));

And we'll then use it here on iov_iter_revert() and this can crash on
with some values. For example this can crash on a 4k write attempt
on a 32k drive when experimenting wit large block sizes.

kernel BUG at lib/iov_iter.c:999!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 4 PID: 949 Comm: fio Not tainted 6.3.0-large-block-20230426-dirty#28
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.0-debian-1.16.0-5        04/01/2014
+RIP: 0010:iov_iter_revert.part.0+0x16e/0x170
Code: f9 40 a2 63 af 74 07 03 56 08 89 d8 29 d0 89 45 08 44 89 6d 20
<etc>
RSP: 0018:ffffaa52006cfc60 EFLAGS: 00010246
RAX: 0000000000000016 RBX: 0000000000000016 RCX: 0000000000000000
RDX: 0000000000000004 RSI: 0000000000000006 RDI: ffffaa52006cfd08
RBP: ffffaa52006cfd08 R08: 0000000000000000 R09: ffffaa52006cfb40
R10: 0000000000000003 R11: ffffffffafcc21e8 R12: 0000000000004000
R13: 0000000000003fea R14: ffff9de3d7565e00 R15: ffff9de3c1f68600
FS:  00007f8bfe726c40(0000) GS:ffff9de43bd00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8bf5eadd68 CR3: 0000000102c76001 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 <TASK>
blkdev_direct_write+0xf0/0x160
blkdev_write_iter+0x11b/0x230
io_write+0x10c/0x420
? kmem_cache_alloc_bulk+0x2a1/0x410
? fget+0x79/0xb0
io_issue_sqe+0x60/0x3b0
? io_prep_rw+0x5a/0x190
io_submit_sqes+0x1e6/0x640
__do_sys_io_uring_enter+0x54c/0xb90
? handle_mm_fault+0x9a/0x340
? preempt_count_add+0x47/0xa0
? up_read+0x37/0x70
? do_user_addr_fault+0x27c/0x780
do_syscall_64+0x37/0x90
entry_SYSCALL_64_after_hw

Although I fixed it with an early check on this routine
with:

if (count < bdev_logical_block_size(bdev))
	return -EINVAL; 

I think this can just be fixed by also using the alignment
check earier here:

if (blkdev_dio_unaligned(bdev, pos, iter))                               
	return -EINVAL;  

  Luis

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [f2fs-dev] [PATCH 01/17] fs: unexport buffer_check_dirty_writeback
  2023-04-24  5:49 ` [PATCH 01/17] fs: unexport buffer_check_dirty_writeback Christoph Hellwig
  2023-05-19 14:17   ` Hannes Reinecke
@ 2023-07-06  0:18   ` patchwork-bot+f2fs
  2023-09-04 18:11   ` patchwork-bot+f2fs
  2 siblings, 0 replies; 41+ messages in thread
From: patchwork-bot+f2fs @ 2023-07-06  0:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, linux-block, linux-nfs, cluster-devel, linux-xfs, miklos,
	djwong, linux-kernel, willy, linux-f2fs-devel, dhowells,
	linux-mm, linux-fsdevel, akpm, linux-ext4, ceph-devel

Hello:

This series was applied to jaegeuk/f2fs.git (dev)
by Jens Axboe <axboe@kernel.dk>:

On Mon, 24 Apr 2023 07:49:10 +0200 you wrote:
> buffer_check_dirty_writeback is only used by the block device aops,
> remove the export.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/buffer.c | 1 -
>  1 file changed, 1 deletion(-)

Here is the summary with links:
  - [f2fs-dev,01/17] fs: unexport buffer_check_dirty_writeback
    https://git.kernel.org/jaegeuk/f2fs/c/4bb218a65a43
  - [f2fs-dev,02/17] fs: remove the special !CONFIG_BLOCK def_blk_fops
    https://git.kernel.org/jaegeuk/f2fs/c/bda2795a630b
  - [f2fs-dev,03/17] fs: rename and move block_page_mkwrite_return
    (no matching commit)
  - [f2fs-dev,04/17] fs: remove emergency_thaw_bdev
    (no matching commit)
  - [f2fs-dev,05/17] filemap: update ki_pos in generic_perform_write
    (no matching commit)
  - [f2fs-dev,06/17] filemap: add a kiocb_write_and_wait helper
    https://git.kernel.org/jaegeuk/f2fs/c/3c435a0fe35c
  - [f2fs-dev,07/17] filemap: add a kiocb_invalidate_pages helper
    https://git.kernel.org/jaegeuk/f2fs/c/e003f74afbd2
  - [f2fs-dev,08/17] filemap: add a kiocb_invalidate_post_write helper
    (no matching commit)
  - [f2fs-dev,09/17] fs: factor out a direct_write_fallback helper
    (no matching commit)
  - [f2fs-dev,10/17] iomap: use kiocb_write_and_wait and kiocb_invalidate_pages
    (no matching commit)
  - [f2fs-dev,11/17] iomap: assign current->backing_dev_info in iomap_file_buffered_write
    (no matching commit)
  - [f2fs-dev,12/17] fuse: use direct_write_fallback
    (no matching commit)
  - [f2fs-dev,13/17] block: don't plug in blkdev_write_iter
    https://git.kernel.org/jaegeuk/f2fs/c/712c7364655f
  - [f2fs-dev,14/17] block: open code __generic_file_write_iter for blkdev writes
    (no matching commit)
  - [f2fs-dev,15/17] block: stop setting ->direct_IO
    (no matching commit)
  - [f2fs-dev,16/17] block: use iomap for writes to block devices
    (no matching commit)
  - [f2fs-dev,17/17] fs: add CONFIG_BUFFER_HEAD
    (no matching commit)

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/17] block: use iomap for writes to block devices
  2023-05-19 14:22   ` Hannes Reinecke
  2023-05-23 22:27     ` Dave Chinner
@ 2023-07-20 12:06     ` Christoph Hellwig
  2023-07-20 12:16       ` Hannes Reinecke
  1 sibling, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2023-07-20 12:06 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Jens Axboe, Miklos Szeredi, Darrick J. Wong,
	Andrew Morton, David Howells, Matthew Wilcox, linux-block,
	linux-fsdevel, ceph-devel, linux-ext4, linux-f2fs-devel,
	cluster-devel, linux-xfs, linux-nfs, linux-mm, linux-kernel

On Fri, May 19, 2023 at 04:22:01PM +0200, Hannes Reinecke wrote:
> I'm hitting this during booting:
> [    5.016324]  <TASK>
> [    5.030256]  iomap_iter+0x11a/0x350
> [    5.030264]  iomap_readahead+0x1eb/0x2c0
> [    5.030272]  read_pages+0x5d/0x220
> [    5.030279]  page_cache_ra_unbounded+0x131/0x180
> [    5.030284]  filemap_get_pages+0xff/0x5a0
> [    5.030292]  filemap_read+0xca/0x320
> [    5.030296]  ? aa_file_perm+0x126/0x500
> [    5.040216]  ? touch_atime+0xc8/0x150
> [    5.040224]  blkdev_read_iter+0xb0/0x150
> [    5.040228]  vfs_read+0x226/0x2d0
> [    5.040234]  ksys_read+0xa5/0xe0
> [    5.040238]  do_syscall_64+0x5b/0x80
>
> Maybe we should consider this patch:

As willy said this should be taken care of by the i_size check.
Did you run with just this patch set or some of the large block
size experiments on top which might change the variables?

I'll repost the series today without any chances in the area, and
if you can reproduce it with just that series we need to root
cause it, so please send your kernel and VM config along for the
next report.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/17] block: use iomap for writes to block devices
  2023-05-24 13:33       ` Matthew Wilcox
@ 2023-07-20 12:09         ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2023-07-20 12:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, Hannes Reinecke, Christoph Hellwig, Jens Axboe,
	Miklos Szeredi, Darrick J. Wong, Andrew Morton, David Howells,
	linux-block, linux-fsdevel, ceph-devel, linux-ext4,
	linux-f2fs-devel, cluster-devel, linux-xfs, linux-nfs, linux-mm,
	linux-kernel

On Wed, May 24, 2023 at 02:33:13PM +0100, Matthew Wilcox wrote:
> As you can see, do_page_cache_ra() does limit readahead to i_size.
> Is ractl->mapping->host the correct way to find the inode?  I always
> get confused.

As far as I can tell it is the right inode, the indirection through
file->f_mapping ensures it actually points to the backing inode.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/17] block: use iomap for writes to block devices
  2023-07-20 12:06     ` Christoph Hellwig
@ 2023-07-20 12:16       ` Hannes Reinecke
  0 siblings, 0 replies; 41+ messages in thread
From: Hannes Reinecke @ 2023-07-20 12:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Miklos Szeredi, Darrick J. Wong, Andrew Morton,
	David Howells, Matthew Wilcox, linux-block, linux-fsdevel,
	ceph-devel, linux-ext4, linux-f2fs-devel, cluster-devel,
	linux-xfs, linux-nfs, linux-mm, linux-kernel

On 7/20/23 14:06, Christoph Hellwig wrote:
> On Fri, May 19, 2023 at 04:22:01PM +0200, Hannes Reinecke wrote:
>> I'm hitting this during booting:
>> [    5.016324]  <TASK>
>> [    5.030256]  iomap_iter+0x11a/0x350
>> [    5.030264]  iomap_readahead+0x1eb/0x2c0
>> [    5.030272]  read_pages+0x5d/0x220
>> [    5.030279]  page_cache_ra_unbounded+0x131/0x180
>> [    5.030284]  filemap_get_pages+0xff/0x5a0
>> [    5.030292]  filemap_read+0xca/0x320
>> [    5.030296]  ? aa_file_perm+0x126/0x500
>> [    5.040216]  ? touch_atime+0xc8/0x150
>> [    5.040224]  blkdev_read_iter+0xb0/0x150
>> [    5.040228]  vfs_read+0x226/0x2d0
>> [    5.040234]  ksys_read+0xa5/0xe0
>> [    5.040238]  do_syscall_64+0x5b/0x80
>>
>> Maybe we should consider this patch:
> 
> As willy said this should be taken care of by the i_size check.
> Did you run with just this patch set or some of the large block
> size experiments on top which might change the variables?
> 
> I'll repost the series today without any chances in the area, and
> if you can reproduce it with just that series we need to root
> cause it, so please send your kernel and VM config along for the
> next report.

I _think_ it's been resolve now; I've rewritten my patchset (and the 
patches where it's based upon) several times now, so it might be a stale 
issue now.

Eagerly awaiting your patchset.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [f2fs-dev] [PATCH 01/17] fs: unexport buffer_check_dirty_writeback
  2023-04-24  5:49 ` [PATCH 01/17] fs: unexport buffer_check_dirty_writeback Christoph Hellwig
  2023-05-19 14:17   ` Hannes Reinecke
  2023-07-06  0:18   ` [f2fs-dev] " patchwork-bot+f2fs
@ 2023-09-04 18:11   ` patchwork-bot+f2fs
  2 siblings, 0 replies; 41+ messages in thread
From: patchwork-bot+f2fs @ 2023-09-04 18:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, linux-block, linux-nfs, cluster-devel, linux-xfs, miklos,
	djwong, linux-kernel, willy, linux-f2fs-devel, dhowells,
	linux-mm, linux-fsdevel, akpm, linux-ext4, ceph-devel

Hello:

This series was applied to jaegeuk/f2fs.git (dev)
by Jens Axboe <axboe@kernel.dk>:

On Mon, 24 Apr 2023 07:49:10 +0200 you wrote:
> buffer_check_dirty_writeback is only used by the block device aops,
> remove the export.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/buffer.c | 1 -
>  1 file changed, 1 deletion(-)

Here is the summary with links:
  - [f2fs-dev,01/17] fs: unexport buffer_check_dirty_writeback
    (no matching commit)
  - [f2fs-dev,02/17] fs: remove the special !CONFIG_BLOCK def_blk_fops
    (no matching commit)
  - [f2fs-dev,03/17] fs: rename and move block_page_mkwrite_return
    (no matching commit)
  - [f2fs-dev,04/17] fs: remove emergency_thaw_bdev
    https://git.kernel.org/jaegeuk/f2fs/c/4a8b719f95c0
  - [f2fs-dev,05/17] filemap: update ki_pos in generic_perform_write
    (no matching commit)
  - [f2fs-dev,06/17] filemap: add a kiocb_write_and_wait helper
    (no matching commit)
  - [f2fs-dev,07/17] filemap: add a kiocb_invalidate_pages helper
    (no matching commit)
  - [f2fs-dev,08/17] filemap: add a kiocb_invalidate_post_write helper
    (no matching commit)
  - [f2fs-dev,09/17] fs: factor out a direct_write_fallback helper
    (no matching commit)
  - [f2fs-dev,10/17] iomap: use kiocb_write_and_wait and kiocb_invalidate_pages
    (no matching commit)
  - [f2fs-dev,11/17] iomap: assign current->backing_dev_info in iomap_file_buffered_write
    (no matching commit)
  - [f2fs-dev,12/17] fuse: use direct_write_fallback
    (no matching commit)
  - [f2fs-dev,13/17] block: don't plug in blkdev_write_iter
    (no matching commit)
  - [f2fs-dev,14/17] block: open code __generic_file_write_iter for blkdev writes
    (no matching commit)
  - [f2fs-dev,15/17] block: stop setting ->direct_IO
    (no matching commit)
  - [f2fs-dev,16/17] block: use iomap for writes to block devices
    (no matching commit)
  - [f2fs-dev,17/17] fs: add CONFIG_BUFFER_HEAD
    (no matching commit)

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2023-09-04 18:11 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-24  5:49 RFC: allow building a kernel without buffer_heads Christoph Hellwig
2023-04-24  5:49 ` [PATCH 01/17] fs: unexport buffer_check_dirty_writeback Christoph Hellwig
2023-05-19 14:17   ` Hannes Reinecke
2023-07-06  0:18   ` [f2fs-dev] " patchwork-bot+f2fs
2023-09-04 18:11   ` patchwork-bot+f2fs
2023-04-24  5:49 ` [PATCH 02/17] fs: remove the special !CONFIG_BLOCK def_blk_fops Christoph Hellwig
2023-04-24 19:22   ` Randy Dunlap
2023-04-24 19:37     ` Keith Busch
2023-04-24  5:49 ` [PATCH 03/17] fs: rename and move block_page_mkwrite_return Christoph Hellwig
2023-04-24 12:30   ` Matthew Wilcox
2023-04-24 12:42     ` Christoph Hellwig
2023-04-24  5:49 ` [PATCH 04/17] fs: remove emergency_thaw_bdev Christoph Hellwig
2023-04-24  5:49 ` [PATCH 05/17] filemap: update ki_pos in generic_perform_write Christoph Hellwig
2023-04-24 18:54   ` [Cluster-devel] " Andreas Gruenbacher
2023-04-24  5:49 ` [PATCH 06/17] filemap: add a kiocb_write_and_wait helper Christoph Hellwig
2023-04-24  5:49 ` [PATCH 07/17] filemap: add a kiocb_invalidate_pages helper Christoph Hellwig
2023-04-24  5:49 ` [PATCH 08/17] filemap: add a kiocb_invalidate_post_write helper Christoph Hellwig
2023-04-24  5:49 ` [PATCH 09/17] fs: factor out a direct_write_fallback helper Christoph Hellwig
2023-04-24  5:49 ` [PATCH 10/17] iomap: use kiocb_write_and_wait and kiocb_invalidate_pages Christoph Hellwig
2023-04-24  5:49 ` [PATCH 11/17] iomap: assign current->backing_dev_info in iomap_file_buffered_write Christoph Hellwig
2023-04-24  6:18   ` Darrick J. Wong
2023-04-24  6:22     ` Christoph Hellwig
2023-04-24  5:49 ` [PATCH 12/17] fuse: use direct_write_fallback Christoph Hellwig
2023-04-24  5:49 ` [PATCH 13/17] block: don't plug in blkdev_write_iter Christoph Hellwig
2023-04-24  5:49 ` [PATCH 14/17] block: open code __generic_file_write_iter for blkdev writes Christoph Hellwig
2023-05-24 22:23   ` Luis Chamberlain
2023-04-24  5:49 ` [PATCH 15/17] block: stop setting ->direct_IO Christoph Hellwig
2023-04-24  5:49 ` [PATCH 16/17] block: use iomap for writes to block devices Christoph Hellwig
     [not found]   ` <CGME20230426130921eucas1p279078812be7e8d50c1305e47cea53661@eucas1p2.samsung.com>
2023-04-26 13:00     ` [f2fs-dev] " Pankaj Raghav
2023-05-19 14:22   ` Hannes Reinecke
2023-05-23 22:27     ` Dave Chinner
2023-05-24 13:33       ` Matthew Wilcox
2023-07-20 12:09         ` Christoph Hellwig
2023-07-20 12:06     ` Christoph Hellwig
2023-07-20 12:16       ` Hannes Reinecke
2023-04-24  5:49 ` [PATCH 17/17] fs: add CONFIG_BUFFER_HEAD Christoph Hellwig
2023-04-29  0:11   ` Luis Chamberlain
2023-04-29  1:20     ` Matthew Wilcox
2023-05-01  3:14       ` Luis Chamberlain
2023-05-01 15:46         ` Matthew Wilcox
2023-05-01 16:00           ` Pankaj Raghav

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).