All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/9] iov_iter: Add extraction helpers
@ 2023-01-11 14:27 David Howells
  2023-01-11 14:27 ` [PATCH v5 1/9] iov_iter: Change the direction macros into an enum David Howells
                   ` (8 more replies)
  0 siblings, 9 replies; 34+ messages in thread
From: David Howells @ 2023-01-11 14:27 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, linux-mm, Rohith Surabattula, linux-block,
	Matthew Wilcox, Jeff Layton, linux-cachefs, Jan Kara,
	Logan Gunthorpe, Christoph Hellwig, Jens Axboe, Steve French,
	Shyam Prasad N, John Hubbard, linux-cifs, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel


Hi Al,

Here are patches clean up some use of READ/WRITE and ITER_SOURCE/DEST,
patches to provide support for extracting pages from an iov_iter and a
patch to use the primary extraction function in the block layer bio code if
you could take a look?

[!] NOTE that I've switched the functions to be exported GPL-only at
    Christoph's request[1].  They are, however, intended as a replacement
    for iov_iter_get_pages*(), which is not marked _GPL - so that
    functionality will probably become unavailable to non-GPL 3rd party
    modules in future.

The first three patches deal with ITER_SOURCE/DEST:

 (1) Switch ITER_SOURCE/DEST to an enum and add a couple of helper
     functions to query if an iterator represents a source or a destination
     buffer.  Using an enum may allow extra consistency warnings from the
     compiler.

 (2) Use the ITER_SOURCE/DEST values in the iov_iter core functions rather
     than READ/WRITE.

 (3) Get rid of most of the callers of iov_iter_rw(), using the IOCB_WRITE
     and IOMAP_WRITE instead where available.  This leaves only two places
     looking at the this value: a consistency check in cifs and a single
     place in the block layer.

The next patch adds a replacement for iov_iter_get_pages*(), including
Logan's new version:

 (4) Add a function to list-only, get or pin pages from an iterator as a
     future replacement for iov_iter_get_pages*().  Pointers to the pages
     are placed into an array (which will get allocated if not provided)
     and, depending on the iterator type and direction, the pages will have
     a ref or a pin get on them or be left untouched (on the assumption
     that the caller manages their lifetime).

     The determination is:

	UBUF/IOVEC + DEST	-> pin
	UBUF/IOVEC + SOURCE	-> get
	PIPE + DEST		-> list-only
	BVEC/XARRAY		-> list-only
	Anything else		-> EFAULT

     The function also returns an indication of which of "list only, get or
     pin" the extraction function did to aid in cleaning up (returning 0,
     FOLL_GET or FOLL_PIN as appropriate).

Then there are a couple of patches that add stuff to netfslib that I want
to use there as well as in cifs:

 (5) Add a netfslib function to use (4) to extract pages from an ITER_IOBUF
     or ITER_UBUF iterator into an ITER_BVEC iterator.  This will get or
     pin the pages as appropriate.

 (6) Add a netfslib function to extract pages from an iterator that's of
     type ITER_UBUF/IOVEC/BVEC/KVEC/XARRAY and add them to a scatterlist.

     The function in (4) is used for a UBUF and IOVEC iterators, so those
     need cleaning up afterwards.  BVEC and XARRAY iterators are ungot and
     unpinned and may be rendered into elements that span multiple pages,
     for example if large folios are present.

Finally, there are a set of three patches to make the block layer's BIO
code use iov_iter_extract_pages():

 (7) Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED (opposite logic).

 (8) Make the block layer's BIO code pin pages or leave pages unaltered
     rather than getting a ref on the pages when the circumstances warrant,
     and noting in the bio struct what cleanups should be performed so that
     the bio cleanup code then does the right thing.

 (9) Remove an unnecessary check against 0 in bio_flagged() (it returns
     bool) thus allowing the gcc optimiser to combine multiple instances of
     the bitwise-AND on the same flags value.

Changes:
========
ver #5)
 - Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch.
 - Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags.
 - Add patch to allow bio_flagged() to be combined by gcc.

ver #4)
 - Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're
   no longer referenced by linux/uio.h.
 - Add ITER_SOURCE/DEST cleanup patches.
 - Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST.
 - Allow additional gup_flags to be passed into iov_iter_extract_pages().
 - Add struct bio patch.

ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

ver #2)
 - Rolled the extraction cleanup mode query function into the extraction
   function, returning the indication through the argument list.
 - Fixed patch 4 (extract to scatterlist) to actually use the new
   extraction API.

I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract

David

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4

---
David Howells (9):
      iov_iter: Change the direction macros into an enum
      iov_iter: Use the direction in the iterator functions
      iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
      iov_iter: Add a function to extract a page list from an iterator
      netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator
      netfs: Add a function to extract an iterator into a scatterlist
      bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning
      iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
      bio: Fix bio_flagged() so that it can be combined


 block/bio.c               |  59 ++++--
 block/fops.c              |   8 +-
 fs/9p/vfs_addr.c          |   2 +-
 fs/affs/file.c            |   4 +-
 fs/ceph/file.c            |   2 +-
 fs/dax.c                  |   6 +-
 fs/direct-io.c            |  22 +--
 fs/exfat/inode.c          |   6 +-
 fs/ext2/inode.c           |   2 +-
 fs/f2fs/file.c            |  10 +-
 fs/fat/inode.c            |   4 +-
 fs/fuse/dax.c             |   2 +-
 fs/fuse/file.c            |   8 +-
 fs/hfs/inode.c            |   2 +-
 fs/hfsplus/inode.c        |   2 +-
 fs/iomap/direct-io.c      |   6 +-
 fs/jfs/inode.c            |   2 +-
 fs/netfs/Makefile         |   1 +
 fs/netfs/iterator.c       | 367 ++++++++++++++++++++++++++++++++++
 fs/nfs/direct.c           |   2 +-
 fs/nilfs2/inode.c         |   2 +-
 fs/ntfs3/inode.c          |   2 +-
 fs/ocfs2/aops.c           |   2 +-
 fs/orangefs/inode.c       |   2 +-
 fs/reiserfs/inode.c       |   2 +-
 fs/udf/inode.c            |   2 +-
 include/linux/bio.h       |   5 +-
 include/linux/blk_types.h |   3 +-
 include/linux/netfs.h     |   7 +
 include/linux/uio.h       |  59 ++++--
 lib/iov_iter.c            | 407 +++++++++++++++++++++++++++++++++++---
 mm/vmalloc.c              |   1 +
 32 files changed, 903 insertions(+), 108 deletions(-)
 create mode 100644 fs/netfs/iterator.c



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v5 1/9] iov_iter: Change the direction macros into an enum
  2023-01-11 14:27 [PATCH v5 0/9] iov_iter: Add extraction helpers David Howells
@ 2023-01-11 14:27 ` David Howells
  2023-01-11 14:27 ` [PATCH v5 2/9] iov_iter: Use the direction in the iterator functions David Howells
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: David Howells @ 2023-01-11 14:27 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

Change the ITER_SOURCE and ITER_DEST direction macros into an enum and
provide three new helper functions:

 iov_iter_dir() - returns the iterator direction
 iov_iter_is_dest() - returns true if it's an ITER_DEST iterator
 iov_iter_is_source() - returns true if it's an ITER_SOURCE iterator

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/167305161763.1521586.6593798818336440133.stgit@warthog.procyon.org.uk/ # v4
---

 include/linux/uio.h |   32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 9f158238edba..4b0f4a773d90 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -29,8 +29,10 @@ enum iter_type {
 	ITER_UBUF,
 };
 
-#define ITER_SOURCE	1	// == WRITE
-#define ITER_DEST	0	// == READ
+enum iter_dir {
+	ITER_DEST	= 0,	// == READ
+	ITER_SOURCE	= 1,	// == WRITE
+} __mode(byte);
 
 struct iov_iter_state {
 	size_t iov_offset;
@@ -39,9 +41,9 @@ struct iov_iter_state {
 };
 
 struct iov_iter {
-	u8 iter_type;
+	enum iter_type iter_type __mode(byte);
 	bool nofault;
-	bool data_source;
+	enum iter_dir data_source;
 	bool user_backed;
 	union {
 		size_t iov_offset;
@@ -114,9 +116,29 @@ static inline bool iov_iter_is_xarray(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_XARRAY;
 }
 
+static inline enum iter_dir iov_iter_dir(const struct iov_iter *i)
+{
+	return i->data_source;
+}
+
+static inline bool iov_iter_is_source(const struct iov_iter *i)
+{
+	return iov_iter_dir(i) == ITER_SOURCE; /* ie. WRITE */
+}
+
+static inline bool iov_iter_is_dest(const struct iov_iter *i)
+{
+	return iov_iter_dir(i) == ITER_DEST; /* ie. READ */
+}
+
+static inline bool iov_iter_dir_valid(enum iter_dir direction)
+{
+	return direction == ITER_DEST || direction == ITER_SOURCE;
+}
+
 static inline unsigned char iov_iter_rw(const struct iov_iter *i)
 {
-	return i->data_source ? WRITE : READ;
+	return iov_iter_dir(i) == ITER_SOURCE ? WRITE : READ;
 }
 
 static inline bool user_backed_iter(const struct iov_iter *i)



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v5 2/9] iov_iter: Use the direction in the iterator functions
  2023-01-11 14:27 [PATCH v5 0/9] iov_iter: Add extraction helpers David Howells
  2023-01-11 14:27 ` [PATCH v5 1/9] iov_iter: Change the direction macros into an enum David Howells
@ 2023-01-11 14:27 ` David Howells
  2023-01-11 14:27 ` [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction David Howells
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: David Howells @ 2023-01-11 14:27 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

Use the direction in the iterator functions rather than READ/WRITE.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/167305162465.1521586.18077838937455153675.stgit@warthog.procyon.org.uk/ # v4
---

 include/linux/uio.h |   22 +++++++++++-----------
 lib/iov_iter.c      |   46 +++++++++++++++++++++++-----------------------
 2 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 4b0f4a773d90..acb1ae3324ed 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -261,16 +261,16 @@ bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
 			unsigned len_mask);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
 unsigned long iov_iter_gap_alignment(const struct iov_iter *i);
-void iov_iter_init(struct iov_iter *i, unsigned int direction, const struct iovec *iov,
+void iov_iter_init(struct iov_iter *i, enum iter_dir direction, const struct iovec *iov,
 			unsigned long nr_segs, size_t count);
-void iov_iter_kvec(struct iov_iter *i, unsigned int direction, const struct kvec *kvec,
+void iov_iter_kvec(struct iov_iter *i, enum iter_dir direction, const struct kvec *kvec,
 			unsigned long nr_segs, size_t count);
-void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_vec *bvec,
+void iov_iter_bvec(struct iov_iter *i, enum iter_dir direction, const struct bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
-void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode_info *pipe,
+void iov_iter_pipe(struct iov_iter *i, enum iter_dir direction, struct pipe_inode_info *pipe,
 			size_t count);
-void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
-void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
+void iov_iter_discard(struct iov_iter *i, enum iter_dir direction, size_t count);
+void iov_iter_xarray(struct iov_iter *i, enum iter_dir direction, struct xarray *xarray,
 		     loff_t start, size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 		size_t maxsize, unsigned maxpages, size_t *start,
@@ -360,19 +360,19 @@ size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp,
 struct iovec *iovec_from_user(const struct iovec __user *uvector,
 		unsigned long nr_segs, unsigned long fast_segs,
 		struct iovec *fast_iov, bool compat);
-ssize_t import_iovec(int type, const struct iovec __user *uvec,
+ssize_t import_iovec(enum iter_dir direction, const struct iovec __user *uvec,
 		 unsigned nr_segs, unsigned fast_segs, struct iovec **iovp,
 		 struct iov_iter *i);
-ssize_t __import_iovec(int type, const struct iovec __user *uvec,
+ssize_t __import_iovec(enum iter_dir direction, const struct iovec __user *uvec,
 		 unsigned nr_segs, unsigned fast_segs, struct iovec **iovp,
 		 struct iov_iter *i, bool compat);
-int import_single_range(int type, void __user *buf, size_t len,
+int import_single_range(enum iter_dir direction, void __user *buf, size_t len,
 		 struct iovec *iov, struct iov_iter *i);
 
-static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
+static inline void iov_iter_ubuf(struct iov_iter *i, enum iter_dir direction,
 			void __user *buf, size_t count)
 {
-	WARN_ON(direction & ~(READ | WRITE));
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_UBUF,
 		.user_backed = true,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f9a3ff37ecd1..fec1c5513197 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -421,11 +421,11 @@ size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
 }
 EXPORT_SYMBOL(fault_in_iov_iter_writeable);
 
-void iov_iter_init(struct iov_iter *i, unsigned int direction,
+void iov_iter_init(struct iov_iter *i, enum iter_dir direction,
 			const struct iovec *iov, unsigned long nr_segs,
 			size_t count)
 {
-	WARN_ON(direction & ~(READ | WRITE));
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_IOVEC,
 		.nofault = false,
@@ -994,11 +994,11 @@ size_t iov_iter_single_seg_count(const struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_single_seg_count);
 
-void iov_iter_kvec(struct iov_iter *i, unsigned int direction,
+void iov_iter_kvec(struct iov_iter *i, enum iter_dir direction,
 			const struct kvec *kvec, unsigned long nr_segs,
 			size_t count)
 {
-	WARN_ON(direction & ~(READ | WRITE));
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter){
 		.iter_type = ITER_KVEC,
 		.data_source = direction,
@@ -1010,11 +1010,11 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_kvec);
 
-void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
+void iov_iter_bvec(struct iov_iter *i, enum iter_dir direction,
 			const struct bio_vec *bvec, unsigned long nr_segs,
 			size_t count)
 {
-	WARN_ON(direction & ~(READ | WRITE));
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter){
 		.iter_type = ITER_BVEC,
 		.data_source = direction,
@@ -1026,15 +1026,15 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
-void iov_iter_pipe(struct iov_iter *i, unsigned int direction,
+void iov_iter_pipe(struct iov_iter *i, enum iter_dir direction,
 			struct pipe_inode_info *pipe,
 			size_t count)
 {
-	BUG_ON(direction != READ);
+	BUG_ON(direction != ITER_DEST);
 	WARN_ON(pipe_full(pipe->head, pipe->tail, pipe->ring_size));
 	*i = (struct iov_iter){
 		.iter_type = ITER_PIPE,
-		.data_source = false,
+		.data_source = ITER_DEST,
 		.pipe = pipe,
 		.head = pipe->head,
 		.start_head = pipe->head,
@@ -1057,10 +1057,10 @@ EXPORT_SYMBOL(iov_iter_pipe);
  * from evaporation, either by taking a ref on them or locking them by the
  * caller.
  */
-void iov_iter_xarray(struct iov_iter *i, unsigned int direction,
+void iov_iter_xarray(struct iov_iter *i, enum iter_dir direction,
 		     struct xarray *xarray, loff_t start, size_t count)
 {
-	BUG_ON(direction & ~1);
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_XARRAY,
 		.data_source = direction,
@@ -1079,14 +1079,14 @@ EXPORT_SYMBOL(iov_iter_xarray);
  * @count: The size of the I/O buffer in bytes.
  *
  * Set up an I/O iterator that just discards everything that's written to it.
- * It's only available as a READ iterator.
+ * It's only available as a destination iterator.
  */
-void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count)
+void iov_iter_discard(struct iov_iter *i, enum iter_dir direction, size_t count)
 {
-	BUG_ON(direction != READ);
+	BUG_ON(direction != ITER_DEST);
 	*i = (struct iov_iter){
 		.iter_type = ITER_DISCARD,
-		.data_source = false,
+		.data_source = ITER_DEST,
 		.count = count,
 		.iov_offset = 0
 	};
@@ -1447,7 +1447,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		unsigned long addr;
 		int res;
 
-		if (iov_iter_rw(i) != WRITE)
+		if (iov_iter_is_dest(i))
 			gup_flags |= FOLL_WRITE;
 		if (i->nofault)
 			gup_flags |= FOLL_NOFAULT;
@@ -1784,7 +1784,7 @@ struct iovec *iovec_from_user(const struct iovec __user *uvec,
 	return iov;
 }
 
-ssize_t __import_iovec(int type, const struct iovec __user *uvec,
+ssize_t __import_iovec(enum iter_dir direction, const struct iovec __user *uvec,
 		 unsigned nr_segs, unsigned fast_segs, struct iovec **iovp,
 		 struct iov_iter *i, bool compat)
 {
@@ -1823,7 +1823,7 @@ ssize_t __import_iovec(int type, const struct iovec __user *uvec,
 		total_len += len;
 	}
 
-	iov_iter_init(i, type, iov, nr_segs, total_len);
+	iov_iter_init(i, direction, iov, nr_segs, total_len);
 	if (iov == *iovp)
 		*iovp = NULL;
 	else
@@ -1836,7 +1836,7 @@ ssize_t __import_iovec(int type, const struct iovec __user *uvec,
  *     into the kernel, check that it is valid, and initialize a new
  *     &struct iov_iter iterator to access it.
  *
- * @type: One of %READ or %WRITE.
+ * @direction: One of %ITER_SOURCE or %ITER_DEST.
  * @uvec: Pointer to the userspace array.
  * @nr_segs: Number of elements in userspace array.
  * @fast_segs: Number of elements in @iov.
@@ -1853,16 +1853,16 @@ ssize_t __import_iovec(int type, const struct iovec __user *uvec,
  *
  * Return: Negative error code on error, bytes imported on success
  */
-ssize_t import_iovec(int type, const struct iovec __user *uvec,
+ssize_t import_iovec(enum iter_dir direction, const struct iovec __user *uvec,
 		 unsigned nr_segs, unsigned fast_segs,
 		 struct iovec **iovp, struct iov_iter *i)
 {
-	return __import_iovec(type, uvec, nr_segs, fast_segs, iovp, i,
+	return __import_iovec(direction, uvec, nr_segs, fast_segs, iovp, i,
 			      in_compat_syscall());
 }
 EXPORT_SYMBOL(import_iovec);
 
-int import_single_range(int rw, void __user *buf, size_t len,
+int import_single_range(enum iter_dir direction, void __user *buf, size_t len,
 		 struct iovec *iov, struct iov_iter *i)
 {
 	if (len > MAX_RW_COUNT)
@@ -1872,7 +1872,7 @@ int import_single_range(int rw, void __user *buf, size_t len,
 
 	iov->iov_base = buf;
 	iov->iov_len = len;
-	iov_iter_init(i, rw, iov, 1, len);
+	iov_iter_init(i, direction, iov, 1, len);
 	return 0;
 }
 EXPORT_SYMBOL(import_single_range);



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
  2023-01-11 14:27 [PATCH v5 0/9] iov_iter: Add extraction helpers David Howells
  2023-01-11 14:27 ` [PATCH v5 1/9] iov_iter: Change the direction macros into an enum David Howells
  2023-01-11 14:27 ` [PATCH v5 2/9] iov_iter: Use the direction in the iterator functions David Howells
@ 2023-01-11 14:27 ` David Howells
  2023-01-12  7:54   ` Christoph Hellwig
                     ` (2 more replies)
  2023-01-11 14:28 ` [PATCH v5 4/9] iov_iter: Add a function to extract a page list from an iterator David Howells
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 34+ messages in thread
From: David Howells @ 2023-01-11 14:27 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

If a kiocb or iomap_iter is available, then use the IOCB_WRITE flag or the
IOMAP_WRITE flag to determine whether we're writing rather than the
iterator direction flag.

This allows all but three of the users of iov_iter_rw() to be got rid of: a
consistency check and a warning statement in cifs and one user in the block
layer that has neither available.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/167305163159.1521586.9460968250704377087.stgit@warthog.procyon.org.uk/ # v4
---

 block/fops.c         |    8 ++++----
 fs/9p/vfs_addr.c     |    2 +-
 fs/affs/file.c       |    4 ++--
 fs/ceph/file.c       |    2 +-
 fs/dax.c             |    6 +++---
 fs/direct-io.c       |   22 +++++++++++-----------
 fs/exfat/inode.c     |    6 +++---
 fs/ext2/inode.c      |    2 +-
 fs/f2fs/file.c       |   10 +++++-----
 fs/fat/inode.c       |    4 ++--
 fs/fuse/dax.c        |    2 +-
 fs/fuse/file.c       |    8 ++++----
 fs/hfs/inode.c       |    2 +-
 fs/hfsplus/inode.c   |    2 +-
 fs/iomap/direct-io.c |    6 +++---
 fs/jfs/inode.c       |    2 +-
 fs/nfs/direct.c      |    2 +-
 fs/nilfs2/inode.c    |    2 +-
 fs/ntfs3/inode.c     |    2 +-
 fs/ocfs2/aops.c      |    2 +-
 fs/orangefs/inode.c  |    2 +-
 fs/reiserfs/inode.c  |    2 +-
 fs/udf/inode.c       |    2 +-
 23 files changed, 51 insertions(+), 51 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 50d245e8c913..29c6de67c39e 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -73,7 +73,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 			return -ENOMEM;
 	}
 
-	if (iov_iter_rw(iter) == READ) {
+	if (!(iocb->ki_flags & IOCB_WRITE)) {
 		bio_init(&bio, bdev, vecs, nr_pages, REQ_OP_READ);
 		if (user_backed_iter(iter))
 			should_dirty = true;
@@ -88,7 +88,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 		goto out;
 	ret = bio.bi_iter.bi_size;
 
-	if (iov_iter_rw(iter) == WRITE)
+	if (iocb->ki_flags & IOCB_WRITE)
 		task_io_account_write(ret);
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
@@ -174,7 +174,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	struct blk_plug plug;
 	struct blkdev_dio *dio;
 	struct bio *bio;
-	bool is_read = (iov_iter_rw(iter) == READ), is_sync;
+	bool is_read = !(iocb->ki_flags & IOCB_WRITE), is_sync;
 	blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
 	loff_t pos = iocb->ki_pos;
 	int ret = 0;
@@ -296,7 +296,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 					unsigned int nr_pages)
 {
 	struct block_device *bdev = iocb->ki_filp->private_data;
-	bool is_read = iov_iter_rw(iter) == READ;
+	bool is_read = !(iocb->ki_flags & IOCB_WRITE);
 	blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
 	struct blkdev_dio *dio;
 	struct bio *bio;
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 97599edbc300..383d62fc3e18 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -254,7 +254,7 @@ v9fs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	ssize_t n;
 	int err = 0;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb->ki_flags & IOCB_WRITE) {
 		n = p9_client_write(file->private_data, pos, iter, &err);
 		if (n) {
 			struct inode *inode = file_inode(file);
diff --git a/fs/affs/file.c b/fs/affs/file.c
index cefa222f7881..1c0e80a8aab9 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -400,7 +400,7 @@ affs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	loff_t offset = iocb->ki_pos;
 	ssize_t ret;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb->ki_flags & IOCB_WRITE) {
 		loff_t size = offset + count;
 
 		if (AFFS_I(inode)->mmu_private < size)
@@ -408,7 +408,7 @@ affs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	}
 
 	ret = blockdev_direct_IO(iocb, inode, iter, affs_get_block);
-	if (ret < 0 && iov_iter_rw(iter) == WRITE)
+	if (ret < 0 && (iocb->ki_flags & IOCB_WRITE))
 		affs_write_failed(mapping, offset + count);
 	return ret;
 }
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 764598e1efd9..8bdc5b52c271 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1284,7 +1284,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 	struct timespec64 mtime = current_time(inode);
 	size_t count = iov_iter_count(iter);
 	loff_t pos = iocb->ki_pos;
-	bool write = iov_iter_rw(iter) == WRITE;
+	bool write = iocb->ki_flags & IOCB_WRITE;
 	bool should_dirty = !write && user_backed_iter(iter);
 
 	if (write && ceph_snap(file_inode(file)) != CEPH_NOSNAP)
diff --git a/fs/dax.c b/fs/dax.c
index c48a3a93ab29..7f4c3789907b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1405,7 +1405,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 	loff_t pos = iomi->pos;
 	struct dax_device *dax_dev = iomap->dax_dev;
 	loff_t end = pos + length, done = 0;
-	bool write = iov_iter_rw(iter) == WRITE;
+	bool write = iomi->flags & IOMAP_WRITE;
 	bool cow = write && iomap->flags & IOMAP_F_SHARED;
 	ssize_t ret = 0;
 	size_t xfer;
@@ -1455,7 +1455,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 
 		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
 				DAX_ACCESS, &kaddr, NULL);
-		if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+		if (map_len == -EIO && write) {
 			map_len = dax_direct_access(dax_dev, pgoff,
 					PHYS_PFN(size), DAX_RECOVERY_WRITE,
 					&kaddr, NULL);
@@ -1530,7 +1530,7 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (!iomi.len)
 		return 0;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb->ki_flags & IOCB_WRITE) {
 		lockdep_assert_held_write(&iomi.inode->i_rwsem);
 		iomi.flags |= IOMAP_WRITE;
 	} else {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 03d381377ae1..e2d5c757a27a 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1143,7 +1143,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 */
 
 	/* watch out for a 0 len io from a tricksy fs */
-	if (iov_iter_rw(iter) == READ && !count)
+	if (!(iocb->ki_flags & IOCB_WRITE) && !count)
 		return 0;
 
 	dio = kmem_cache_alloc(dio_cache, GFP_KERNEL);
@@ -1157,14 +1157,14 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	memset(dio, 0, offsetof(struct dio, pages));
 
 	dio->flags = flags;
-	if (dio->flags & DIO_LOCKING && iov_iter_rw(iter) == READ) {
+	if (dio->flags & DIO_LOCKING && !(iocb->ki_flags & IOCB_WRITE)) {
 		/* will be released by direct_io_worker */
 		inode_lock(inode);
 	}
 
 	/* Once we sampled i_size check for reads beyond EOF */
 	dio->i_size = i_size_read(inode);
-	if (iov_iter_rw(iter) == READ && offset >= dio->i_size) {
+	if (!(iocb->ki_flags & IOCB_WRITE) && offset >= dio->i_size) {
 		retval = 0;
 		goto fail_dio;
 	}
@@ -1177,7 +1177,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 			goto fail_dio;
 	}
 
-	if (dio->flags & DIO_LOCKING && iov_iter_rw(iter) == READ) {
+	if (dio->flags & DIO_LOCKING && !(iocb->ki_flags & IOCB_WRITE)) {
 		struct address_space *mapping = iocb->ki_filp->f_mapping;
 
 		retval = filemap_write_and_wait_range(mapping, offset, end - 1);
@@ -1193,13 +1193,13 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 */
 	if (is_sync_kiocb(iocb))
 		dio->is_async = false;
-	else if (iov_iter_rw(iter) == WRITE && end > i_size_read(inode))
+	else if ((iocb->ki_flags & IOCB_WRITE) && end > i_size_read(inode))
 		dio->is_async = false;
 	else
 		dio->is_async = true;
 
 	dio->inode = inode;
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb->ki_flags & IOCB_WRITE) {
 		dio->opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			dio->opf |= REQ_NOWAIT;
@@ -1211,7 +1211,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
 	 * so that we can call ->fsync.
 	 */
-	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
+	if (dio->is_async && (iocb->ki_flags & IOCB_WRITE)) {
 		retval = 0;
 		if (iocb_is_dsync(iocb))
 			retval = dio_set_defer_completion(dio);
@@ -1248,7 +1248,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	spin_lock_init(&dio->bio_lock);
 	dio->refcount = 1;
 
-	dio->should_dirty = user_backed_iter(iter) && iov_iter_rw(iter) == READ;
+	dio->should_dirty = user_backed_iter(iter) && !(iocb->ki_flags & IOCB_WRITE);
 	sdio.iter = iter;
 	sdio.final_block_in_request = end >> blkbits;
 
@@ -1305,7 +1305,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 * we can let i_mutex go now that its achieved its purpose
 	 * of protecting us from looking up uninitialized blocks.
 	 */
-	if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
+	if (!(iocb->ki_flags & IOCB_WRITE) && (dio->flags & DIO_LOCKING))
 		inode_unlock(dio->inode);
 
 	/*
@@ -1317,7 +1317,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 */
 	BUG_ON(retval == -EIOCBQUEUED);
 	if (dio->is_async && retval == 0 && dio->result &&
-	    (iov_iter_rw(iter) == READ || dio->result == count))
+	    (!(iocb->ki_flags & IOCB_WRITE) || dio->result == count))
 		retval = -EIOCBQUEUED;
 	else
 		dio_await_completion(dio);
@@ -1330,7 +1330,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	return retval;
 
 fail_dio:
-	if (dio->flags & DIO_LOCKING && iov_iter_rw(iter) == READ)
+	if (dio->flags & DIO_LOCKING && !(iocb->ki_flags & IOCB_WRITE))
 		inode_unlock(inode);
 
 	kmem_cache_free(dio_cache, dio);
diff --git a/fs/exfat/inode.c b/fs/exfat/inode.c
index 5b644cb057fa..26c2cff71878 100644
--- a/fs/exfat/inode.c
+++ b/fs/exfat/inode.c
@@ -412,10 +412,10 @@ static ssize_t exfat_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = mapping->host;
 	loff_t size = iocb->ki_pos + iov_iter_count(iter);
-	int rw = iov_iter_rw(iter);
+	bool writing = iocb->ki_flags & IOCB_WRITE;
 	ssize_t ret;
 
-	if (rw == WRITE) {
+	if (writing) {
 		/*
 		 * FIXME: blockdev_direct_IO() doesn't use ->write_begin(),
 		 * so we need to update the ->i_size_aligned to block boundary.
@@ -434,7 +434,7 @@ static ssize_t exfat_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * condition of exfat_get_block() and ->truncate().
 	 */
 	ret = blockdev_direct_IO(iocb, inode, iter, exfat_get_block);
-	if (ret < 0 && (rw & WRITE))
+	if (ret < 0 && writing)
 		exfat_write_failed(mapping, size);
 	return ret;
 }
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 69aed9e2359e..9ed588d70722 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -919,7 +919,7 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	ssize_t ret;
 
 	ret = blockdev_direct_IO(iocb, inode, iter, ext2_get_block);
-	if (ret < 0 && iov_iter_rw(iter) == WRITE)
+	if (ret < 0 && (iocb->ki_flags & IOCB_WRITE))
 		ext2_write_failed(mapping, offset + count);
 	return ret;
 }
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index ecbc8c135b49..7a7cfa39b327 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -809,7 +809,7 @@ int f2fs_truncate(struct inode *inode)
 	return 0;
 }
 
-static bool f2fs_force_buffered_io(struct inode *inode, int rw)
+static bool f2fs_force_buffered_io(struct inode *inode, bool writing)
 {
 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
 
@@ -827,9 +827,9 @@ static bool f2fs_force_buffered_io(struct inode *inode, int rw)
 	 * for blkzoned device, fallback direct IO to buffered IO, so
 	 * all IOs can be serialized by log-structured write.
 	 */
-	if (f2fs_sb_has_blkzoned(sbi) && (rw == WRITE))
+	if (f2fs_sb_has_blkzoned(sbi) && writing)
 		return true;
-	if (f2fs_lfs_mode(sbi) && rw == WRITE && F2FS_IO_ALIGNED(sbi))
+	if (f2fs_lfs_mode(sbi) && writing && F2FS_IO_ALIGNED(sbi))
 		return true;
 	if (is_sbi_flag_set(sbi, SBI_CP_DISABLED))
 		return true;
@@ -865,7 +865,7 @@ int f2fs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 		unsigned int bsize = i_blocksize(inode);
 
 		stat->result_mask |= STATX_DIOALIGN;
-		if (!f2fs_force_buffered_io(inode, WRITE)) {
+		if (!f2fs_force_buffered_io(inode, true)) {
 			stat->dio_mem_align = bsize;
 			stat->dio_offset_align = bsize;
 		}
@@ -4254,7 +4254,7 @@ static bool f2fs_should_use_dio(struct inode *inode, struct kiocb *iocb,
 	if (!(iocb->ki_flags & IOCB_DIRECT))
 		return false;
 
-	if (f2fs_force_buffered_io(inode, iov_iter_rw(iter)))
+	if (f2fs_force_buffered_io(inode, iocb->ki_flags & IOCB_WRITE))
 		return false;
 
 	/*
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index d99b8549ec8f..d7ffc30ce0e5 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -261,7 +261,7 @@ static ssize_t fat_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	loff_t offset = iocb->ki_pos;
 	ssize_t ret;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb->ki_flags & IOCB_WRITE) {
 		/*
 		 * FIXME: blockdev_direct_IO() doesn't use ->write_begin(),
 		 * so we need to update the ->mmu_private to block boundary.
@@ -281,7 +281,7 @@ static ssize_t fat_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * condition of fat_get_block() and ->truncate().
 	 */
 	ret = blockdev_direct_IO(iocb, inode, iter, fat_get_block);
-	if (ret < 0 && iov_iter_rw(iter) == WRITE)
+	if (ret < 0 && (iocb->ki_flags & IOCB_WRITE))
 		fat_write_failed(mapping, offset + count);
 
 	return ret;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index e23e802a8013..fddd0b8de27e 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -720,7 +720,7 @@ static bool file_extending_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 
-	return (iov_iter_rw(from) == WRITE &&
+	return ((iocb->ki_flags & IOCB_WRITE) &&
 		((iocb->ki_pos) >= i_size_read(inode) ||
 		  (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode))));
 }
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 875314ee6f59..9575c4ca0667 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2897,7 +2897,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	inode = file->f_mapping->host;
 	i_size = i_size_read(inode);
 
-	if ((iov_iter_rw(iter) == READ) && (offset >= i_size))
+	if (!(iocb->ki_flags & IOCB_WRITE) && (offset >= i_size))
 		return 0;
 
 	io = kmalloc(sizeof(struct fuse_io_priv), GFP_KERNEL);
@@ -2909,7 +2909,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	io->bytes = -1;
 	io->size = 0;
 	io->offset = offset;
-	io->write = (iov_iter_rw(iter) == WRITE);
+	io->write = (iocb->ki_flags & IOCB_WRITE);
 	io->err = 0;
 	/*
 	 * By default, we want to optimize all I/Os with async request
@@ -2942,7 +2942,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		io->done = &wait;
 	}
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if ((iocb->ki_flags & IOCB_WRITE)) {
 		ret = fuse_direct_io(io, iter, &pos, FUSE_DIO_WRITE);
 		fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
 	} else {
@@ -2965,7 +2965,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 	kref_put(&io->refcnt, fuse_io_release);
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if ((iocb->ki_flags & IOCB_WRITE)) {
 		fuse_write_update_attr(inode, pos, ret);
 		/* For extending writes we already hold exclusive lock */
 		if (ret < 0 && offset + count > i_size)
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 9c329a365e75..638c87afd96f 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -141,7 +141,7 @@ static ssize_t hfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * In case of error extending write may have instantiated a few
 	 * blocks outside i_size. Trim these off again.
 	 */
-	if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
+	if (unlikely((iocb->ki_flags & IOCB_WRITE) && ret < 0)) {
 		loff_t isize = i_size_read(inode);
 		loff_t end = iocb->ki_pos + count;
 
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 840577a0c1e7..843e6f1ced25 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -138,7 +138,7 @@ static ssize_t hfsplus_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * In case of error extending write may have instantiated a few
 	 * blocks outside i_size. Trim these off again.
 	 */
-	if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
+	if (unlikely((iocb->ki_flags & IOCB_WRITE) && ret < 0)) {
 		loff_t isize = i_size_read(inode);
 		loff_t end = iocb->ki_pos + count;
 
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9804714b1751..d045b0c54d04 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -519,7 +519,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->submit.waiter = current;
 	dio->submit.poll_bio = NULL;
 
-	if (iov_iter_rw(iter) == READ) {
+	if (!(iocb->ki_flags & IOCB_WRITE)) {
 		if (iomi.pos >= dio->i_size)
 			goto out_free_dio;
 
@@ -573,7 +573,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (ret)
 		goto out_free_dio;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iomi.flags & IOMAP_WRITE) {
 		/*
 		 * Try to invalidate cache pages for the range we are writing.
 		 * If this invalidation fails, let the caller fall back to
@@ -613,7 +613,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	 * Revert iter to a state corresponding to that as some callers (such
 	 * as the splice code) rely on it.
 	 */
-	if (iov_iter_rw(iter) == READ && iomi.pos >= dio->i_size)
+	if (!(iomi.flags & IOMAP_WRITE) && iomi.pos >= dio->i_size)
 		iov_iter_revert(iter, iomi.pos - dio->i_size);
 
 	if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) {
diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
index 8ac10e396050..f403d2f2bfe6 100644
--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -334,7 +334,7 @@ static ssize_t jfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * In case of error extending write may have instantiated a few
 	 * blocks outside i_size. Trim these off again.
 	 */
-	if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
+	if (unlikely((iocb->ki_flags & IOCB_WRITE) && ret < 0)) {
 		loff_t isize = i_size_read(inode);
 		loff_t end = iocb->ki_pos + count;
 
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 1707f46b1335..e94ce42f93a8 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -133,7 +133,7 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 
 	VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);
 
-	if (iov_iter_rw(iter) == READ)
+	if (!(iocb->ki_flags & IOCB_WRITE))
 		ret = nfs_file_direct_read(iocb, iter, true);
 	else
 		ret = nfs_file_direct_write(iocb, iter, true);
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 232dd7b6cca1..59df5707d30f 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -289,7 +289,7 @@ nilfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 
-	if (iov_iter_rw(iter) == WRITE)
+	if (iocb->ki_flags & IOCB_WRITE)
 		return 0;
 
 	/* Needs synchronization with the cleaner */
diff --git a/fs/ntfs3/inode.c b/fs/ntfs3/inode.c
index 20b953871574..f0881d0522f8 100644
--- a/fs/ntfs3/inode.c
+++ b/fs/ntfs3/inode.c
@@ -761,7 +761,7 @@ static ssize_t ntfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	struct ntfs_inode *ni = ntfs_i(inode);
 	loff_t vbo = iocb->ki_pos;
 	loff_t end;
-	int wr = iov_iter_rw(iter) & WRITE;
+	bool wr = iocb->ki_flags & IOCB_WRITE;
 	size_t iter_count = iov_iter_count(iter);
 	loff_t valid;
 	ssize_t ret;
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 1d65f6ef00ca..3f41c6b403c2 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2441,7 +2441,7 @@ static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	    !ocfs2_supports_append_dio(osb))
 		return 0;
 
-	if (iov_iter_rw(iter) == READ)
+	if (!(iocb->ki_flags & IOCB_WRITE))
 		get_block = ocfs2_lock_get_block;
 	else
 		get_block = ocfs2_dio_wr_get_block;
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index 4df560894386..fbecca379e91 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -521,7 +521,7 @@ static ssize_t orangefs_direct_IO(struct kiocb *iocb,
 	 */
 	struct file *file = iocb->ki_filp;
 	loff_t pos = iocb->ki_pos;
-	enum ORANGEFS_io_type type = iov_iter_rw(iter) == WRITE ?
+	enum ORANGEFS_io_type type = (iocb->ki_flags & IOCB_WRITE) ?
             ORANGEFS_IO_WRITE : ORANGEFS_IO_READ;
 	loff_t *offset = &pos;
 	struct inode *inode = file->f_mapping->host;
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index c7d1fa526dea..1fc94fd5c371 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -3249,7 +3249,7 @@ static ssize_t reiserfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * In case of error extending write may have instantiated a few
 	 * blocks outside i_size. Trim these off again.
 	 */
-	if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
+	if (unlikely((iocb->ki_flags & IOCB_WRITE) && ret < 0)) {
 		loff_t isize = i_size_read(inode);
 		loff_t end = iocb->ki_pos + count;
 
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 1d7c2a812fc1..6d2ce0e512f4 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -219,7 +219,7 @@ static ssize_t udf_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	ssize_t ret;
 
 	ret = blockdev_direct_IO(iocb, inode, iter, udf_get_block);
-	if (unlikely(ret < 0 && iov_iter_rw(iter) == WRITE))
+	if (unlikely(ret < 0 && (iocb->ki_flags & IOCB_WRITE)))
 		udf_write_failed(mapping, iocb->ki_pos + count);
 	return ret;
 }



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v5 4/9] iov_iter: Add a function to extract a page list from an iterator
  2023-01-11 14:27 [PATCH v5 0/9] iov_iter: Add extraction helpers David Howells
                   ` (2 preceding siblings ...)
  2023-01-11 14:27 ` [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction David Howells
@ 2023-01-11 14:28 ` David Howells
  2023-01-12  7:55   ` Christoph Hellwig
  2023-01-12 21:15   ` Al Viro
  2023-01-11 14:28 ` [PATCH v5 5/9] netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator David Howells
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 34+ messages in thread
From: David Howells @ 2023-01-11 14:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, John Hubbard, Matthew Wilcox, linux-fsdevel,
	linux-mm, dhowells, Christoph Hellwig, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel

Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator.  The pages may be returned with a reference added or a pin
added or neither, depending on the type of iterator and the direction of
transfer.

The function also indicates the mode of retention that was employed for an
iterator - and therefore how the caller should dispose of the pages later.

There are three cases:

 (1) Transfer *into* an ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have pins obtained on them (but not references)
     so that fork() doesn't CoW the pages incorrectly whilst the I/O is in
     progress.

     The indicated mode of retention will be FOLL_PIN for this case.  The
     caller should use something like unpin_user_page() to dispose of the
     page.

 (2) Transfer is *out of* an ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have references obtained on them, but not pins.

     The indicated mode of retention will be FOLL_GET.  The caller should
     use something like put_page() for page disposal.

 (3) Any other sort of iterator.

     No refs or pins are obtained on the page, the assumption is made that
     the caller will manage page retention.

     The indicated mode of retention will be 0.  The pages don't need
     additional disposal.

Changes:
========
vet #4)
 - Use ITER_SOURCE/DEST instead of WRITE/READ.
 - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.

ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: John Hubbard <jhubbard@nvidia.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166722777971.2555743.12953624861046741424.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732025748.3186319.8314014902727092626.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869689451.3723671.18242195992447653092.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920903885.1461876.692029808682876184.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997421646.9475.14837976344157464997.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305163883.1521586.10777155475378874823.stgit@warthog.procyon.org.uk/ # v4
---

 include/linux/uio.h |    5 +
 lib/iov_iter.c      |  361 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 366 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index acb1ae3324ed..9a36b4cddb28 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -382,4 +382,9 @@ static inline void iov_iter_ubuf(struct iov_iter *i, enum iter_dir direction,
 	};
 }
 
+ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
+			       size_t maxsize, unsigned int maxpages,
+			       unsigned int gup_flags,
+			       size_t *offset0, unsigned int *cleanup_mode);
+
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index fec1c5513197..dc6db5ad108b 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1914,3 +1914,364 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 		i->iov -= state->nr_segs - i->nr_segs;
 	i->nr_segs = state->nr_segs;
 }
+
+/*
+ * Extract a list of contiguous pages from an ITER_PIPE iterator.  This does
+ * not get references of its own on the pages, nor does it get a pin on them.
+ * If there's a partial page, it adds that first and will then allocate and add
+ * pages into the pipe to make up the buffer space to the amount required.
+ *
+ * The caller must hold the pipe locked and only transferring into a pipe is
+ * supported.
+ */
+static ssize_t iov_iter_extract_pipe_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int gup_flags,
+					   size_t *offset0,
+					   unsigned int *cleanup_mode)
+{
+	unsigned int nr, offset, chunk, j;
+	struct page **p;
+	size_t left;
+
+	if (!sanity(i))
+		return -EFAULT;
+
+	offset = pipe_npages(i, &nr);
+	if (!nr)
+		return -EFAULT;
+	*offset0 = offset;
+
+	maxpages = min_t(size_t, nr, maxpages);
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	left = maxsize;
+	for (j = 0; j < maxpages; j++) {
+		struct page *page = append_pipe(i, left, &offset);
+		if (!page)
+			break;
+		chunk = min_t(size_t, left, PAGE_SIZE - offset);
+		left -= chunk;
+		*p++ = page;
+	}
+	if (!j)
+		return -EFAULT;
+	*cleanup_mode = 0;
+	return maxsize - left;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_XARRAY iterator.  This does not
+ * get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i,
+					     struct page ***pages, size_t maxsize,
+					     unsigned int maxpages,
+					     unsigned int gup_flags,
+					     size_t *offset0,
+					     unsigned int *cleanup_mode)
+{
+	struct page *page, **p;
+	unsigned int nr = 0, offset;
+	loff_t pos = i->xarray_start + i->iov_offset;
+	pgoff_t index = pos >> PAGE_SHIFT;
+	XA_STATE(xas, i->xarray, index);
+
+	offset = pos & ~PAGE_MASK;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	rcu_read_lock();
+	for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+		if (xas_retry(&xas, page))
+			continue;
+
+		/* Has the page moved or been split? */
+		if (unlikely(page != xas_reload(&xas))) {
+			xas_reset(&xas);
+			continue;
+		}
+
+		p[nr++] = find_subpage(page, xas.xa_index);
+		if (nr == maxpages)
+			break;
+	}
+	rcu_read_unlock();
+
+	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+	i->iov_offset += maxsize;
+	i->count -= maxsize;
+	*cleanup_mode = 0;
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_BVEC iterator.  This does
+ * not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int gup_flags,
+					   size_t *offset0,
+					   unsigned int *cleanup_mode)
+{
+	struct page **p, *page;
+	size_t skip = i->iov_offset, offset;
+	int k;
+
+	maxsize = min(maxsize, i->bvec->bv_len - skip);
+	skip += i->bvec->bv_offset;
+	page = i->bvec->bv_page + skip / PAGE_SIZE;
+	offset = skip % PAGE_SIZE;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+	for (k = 0; k < maxpages; k++)
+		p[k] = page + k;
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	i->count -= maxsize;
+	i->iov_offset += maxsize;
+	if (i->iov_offset == i->bvec->bv_len) {
+		i->iov_offset = 0;
+		i->bvec++;
+		i->nr_segs--;
+	}
+	*cleanup_mode = 0;
+	return maxsize;
+}
+
+/*
+ * Get the first segment from an ITER_UBUF or ITER_IOVEC iterator.  The
+ * iterator must not be empty.
+ */
+static unsigned long iov_iter_extract_first_user_segment(const struct iov_iter *i,
+							 size_t *size)
+{
+	size_t skip;
+	long k;
+
+	if (iter_is_ubuf(i))
+		return (unsigned long)i->ubuf + i->iov_offset;
+
+	for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
+		size_t len = i->iov[k].iov_len - skip;
+
+		if (unlikely(!len))
+			continue;
+		if (*size > len)
+			*size = len;
+		return (unsigned long)i->iov[k].iov_base + skip;
+	}
+	BUG(); // if it had been empty, we wouldn't get called
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get references
+ * on them.  This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred out of the buffer described by
+ * the iterator (ie. this is the source).
+ *
+ * The pages are returned with incremented refcounts that the caller must undo
+ * once the transfer is complete, but no additional pins are obtained.
+ *
+ * This is only safe to be used where background IO/DMA is not going to be
+ * modifying the buffer, and so won't cause a problem with CoW on fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_get(struct iov_iter *i,
+						   struct page ***pages,
+						   size_t maxsize,
+						   unsigned int maxpages,
+						   unsigned int gup_flags,
+						   size_t *offset0,
+						   unsigned int *cleanup_mode)
+{
+	unsigned long addr;
+	size_t offset;
+	int res;
+
+	if (WARN_ON_ONCE(!iov_iter_is_source(i)))
+		return -EFAULT;
+
+	gup_flags |= FOLL_GET;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = iov_iter_extract_first_user_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = get_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	*cleanup_mode = FOLL_GET;
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get a pin on
+ * each of them.  This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred into the buffer described by the
+ * iterator (ie. this is the destination).
+ *
+ * It does not get refs on the pages, but the pages must be unpinned by the
+ * caller once the transfer is complete.
+ *
+ * This is safe to be used where background IO/DMA *is* going to be modifying
+ * the buffer; using a pin rather than a ref makes sure that CoW happens
+ * correctly in the parent during fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_pin(struct iov_iter *i,
+						   struct page ***pages,
+						   size_t maxsize,
+						   unsigned int maxpages,
+						   unsigned int gup_flags,
+						   size_t *offset0,
+						   unsigned int *cleanup_mode)
+{
+	unsigned long addr;
+	size_t offset;
+	int res;
+
+	if (WARN_ON_ONCE(!iov_iter_is_dest(i)))
+		return -EFAULT;
+
+	gup_flags |= FOLL_PIN | FOLL_WRITE;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = first_iovec_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	*cleanup_mode = FOLL_PIN;
+	return maxsize;
+}
+
+static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int gup_flags,
+					   size_t *offset0,
+					   unsigned int *cleanup_mode)
+{
+	if (i->data_source)
+		return iov_iter_extract_user_pages_and_get(i, pages, maxsize,
+							   maxpages, gup_flags,
+							   offset0, cleanup_mode);
+	else
+		return iov_iter_extract_user_pages_and_pin(i, pages, maxsize,
+							   maxpages, gup_flags,
+							   offset0, cleanup_mode);
+}
+
+/**
+ * iov_iter_extract_pages - Extract a list of contiguous pages from an iterator
+ * @i: The iterator to extract from
+ * @pages: Where to return the list of pages
+ * @maxsize: The maximum amount of iterator to extract
+ * @maxpages: The maximum size of the list of pages
+ * @gup_flags: Addition flags when getting pages from a user-backed iterator
+ * @offset0: Where to return the starting offset into (*@pages)[0]
+ * @cleanup_mode: Where to return the cleanup mode
+ *
+ * Extract a list of contiguous pages from the current point of the iterator,
+ * advancing the iterator.  The maximum number of pages and the maximum amount
+ * of page contents can be set.
+ *
+ * If *@pages is NULL, a page list will be allocated to the required size and
+ * *@pages will be set to its base.  If *@pages is not NULL, it will be assumed
+ * that the caller allocated a page list at least @maxpages in size and this
+ * will be filled in.
+ *
+ * Extra refs or pins on the pages may be obtained as follows:
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ *      transferred /OUT OF/ the described buffer, refs will be taken on the
+ *      pages, but pins will not be added.  This can be used for DMA from a
+ *      page; it cannot be used for DMA to a page, as it may cause page-COW
+ *      problems in fork.  *@cleanup_mode will be set to FOLL_GET.
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ *      transferred /INTO/ the described buffer, pins will be added to the
+ *      pages, but refs will not be taken.  This must be used for DMA to a
+ *      page.  *@cleanup_mode will be set to FOLL_PIN.
+ *
+ *  (*) If the iterator is ITER_PIPE, this must describe a destination for the
+ *      data.  Additional pages may be allocated and added to the pipe (which
+ *      will hold the refs), but neither refs nor pins will be obtained for the
+ *      caller.  The caller must hold the pipe lock.  *@cleanup_mode will be
+ *      set to 0.
+ *
+ *  (*) If the iterator is ITER_BVEC or ITER_XARRAY, the pages are merely
+ *      listed; no extra refs or pins are obtained.  *@cleanup_mode will be set
+ *      to 0.
+ *
+ * Note also:
+ *
+ *  (*) Use with ITER_KVEC is not supported as that may refer to memory that
+ *      doesn't have associated page structs.
+ *
+ *  (*) Use with ITER_DISCARD is not supported as that has no content.
+ *
+ * On success, the function sets *@pages to the new pagelist, if allocated, and
+ * sets *offset0 to the offset into the first page, *cleanup_mode to the
+ * cleanup required and returns the amount of buffer space added represented by
+ * the page list.
+ *
+ * It may also return -ENOMEM and -EFAULT.
+ */
+ssize_t iov_iter_extract_pages(struct iov_iter *i,
+			       struct page ***pages,
+			       size_t maxsize,
+			       unsigned int maxpages,
+			       unsigned int gup_flags,
+			       size_t *offset0,
+			       unsigned int *cleanup_mode)
+{
+	maxsize = min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT);
+	if (!maxsize)
+		return 0;
+
+	if (likely(user_backed_iter(i)))
+		return iov_iter_extract_user_pages(i, pages, maxsize,
+						   maxpages, gup_flags,
+						   offset0, cleanup_mode);
+	if (iov_iter_is_bvec(i))
+		return iov_iter_extract_bvec_pages(i, pages, maxsize,
+						   maxpages, gup_flags,
+						   offset0, cleanup_mode);
+	if (iov_iter_is_pipe(i))
+		return iov_iter_extract_pipe_pages(i, pages, maxsize,
+						   maxpages, gup_flags,
+						   offset0, cleanup_mode);
+	if (iov_iter_is_xarray(i))
+		return iov_iter_extract_xarray_pages(i, pages, maxsize,
+						     maxpages, gup_flags,
+						     offset0, cleanup_mode);
+	return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(iov_iter_extract_pages);



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v5 5/9] netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator
  2023-01-11 14:27 [PATCH v5 0/9] iov_iter: Add extraction helpers David Howells
                   ` (3 preceding siblings ...)
  2023-01-11 14:28 ` [PATCH v5 4/9] iov_iter: Add a function to extract a page list from an iterator David Howells
@ 2023-01-11 14:28 ` David Howells
  2023-01-11 14:28 ` [PATCH v5 6/9] netfs: Add a function to extract an iterator into a scatterlist David Howells
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: David Howells @ 2023-01-11 14:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, Steve French, Shyam Prasad N, Rohith Surabattula,
	linux-cachefs, linux-cifs, linux-fsdevel, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Add a function to extract the pages from a user-space supplied iterator
(UBUF- or IOVEC-type) into a BVEC-type iterator, retaining the pages by
getting a ref on them (ITER_SOURCE, ie. WRITE) or pinning them (ITER_DEST,
ie. READ) as we go.

This is useful in three situations:

 (1) A userspace thread may have a sibling that unmaps or remaps the
     process's VM during the operation, changing the assignment of the
     pages and potentially causing an error.  Retaining the pages keeps
     some pages around, even if this occurs; futher, we find out at the
     point of extraction if EFAULT is going to be incurred.

 (2) Pages might get swapped out/discarded if not retained, so we want to
     retain them to avoid the reload causing a deadlock due to a DIO
     from/to an mmapped region on the same file.

 (3) The iterator may get passed to sendmsg() by the filesystem.  If a
     fault occurs, we may get a short write to a TCP stream that's then
     tricky to recover from.

We don't deal with other types of iterator here, leaving it to other
mechanisms to retain the pages (eg. PG_locked, PG_writeback and the pipe
lock).

Changes:
========
ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697255265.61150.6289490555867717077.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732026503.3186319.12020462741051772825.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869690376.3723671.8813331570219190705.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920904810.1461876.11603559311247187100.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997422579.9475.12101700945635692496.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305164634.1521586.12199658904363317567.stgit@warthog.procyon.org.uk/ # v4
---

 fs/netfs/Makefile     |    1 
 fs/netfs/iterator.c   |   99 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/netfs.h |    3 +
 3 files changed, 103 insertions(+)
 create mode 100644 fs/netfs/iterator.c

diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
index f684c0cd1ec5..386d6fb92793 100644
--- a/fs/netfs/Makefile
+++ b/fs/netfs/Makefile
@@ -3,6 +3,7 @@
 netfs-y := \
 	buffered_read.o \
 	io.o \
+	iterator.o \
 	main.o \
 	objects.o
 
diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
new file mode 100644
index 000000000000..7d802d21b9c5
--- /dev/null
+++ b/fs/netfs/iterator.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Iterator helpers.
+ *
+ * Copyright (C) 2022 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/export.h>
+#include <linux/slab.h>
+#include <linux/uio.h>
+#include <linux/netfs.h>
+#include "internal.h"
+
+/**
+ * netfs_extract_user_iter - Extract the pages from a user iterator into a bvec
+ * @orig: The original iterator
+ * @orig_len: The amount of iterator to copy
+ * @new: The iterator to be set up
+ * @cleanup_mode: Where to indicate the cleanup mode
+ *
+ * Extract the page fragments from the given amount of the source iterator and
+ * build up a second iterator that refers to all of those bits.  This allows
+ * the original iterator to disposed of.
+ *
+ * On success, the number of elements in the bvec is returned, the original
+ * iterator will have been advanced by the amount extracted and @*cleanup_mode
+ * will have been set to FOLL_GET, FOLL_PIN or 0.
+ */
+ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
+				struct iov_iter *new, unsigned int *cleanup_mode)
+{
+	struct bio_vec *bv = NULL;
+	struct page **pages;
+	unsigned int cur_npages;
+	unsigned int max_pages;
+	unsigned int npages = 0;
+	unsigned int i;
+	ssize_t ret;
+	size_t count = orig_len, offset, len;
+	size_t bv_size, pg_size;
+
+	if (WARN_ON_ONCE(!iter_is_ubuf(orig) && !iter_is_iovec(orig)))
+		return -EIO;
+
+	max_pages = iov_iter_npages(orig, INT_MAX);
+	bv_size = array_size(max_pages, sizeof(*bv));
+	bv = kvmalloc(bv_size, GFP_KERNEL);
+	if (!bv)
+		return -ENOMEM;
+
+	*cleanup_mode = 0;
+
+	/* Put the page list at the end of the bvec list storage.  bvec
+	 * elements are larger than page pointers, so as long as we work
+	 * 0->last, we should be fine.
+	 */
+	pg_size = array_size(max_pages, sizeof(*pages));
+	pages = (void *)bv + bv_size - pg_size;
+
+	while (count && npages < max_pages) {
+		ret = iov_iter_extract_pages(orig, &pages, count,
+					     max_pages - npages, 0,
+					     &offset, cleanup_mode);
+		if (ret < 0) {
+			pr_err("Couldn't get user pages (rc=%zd)\n", ret);
+			break;
+		}
+
+		if (ret > count) {
+			pr_err("get_pages rc=%zd more than %zu\n", ret, count);
+			break;
+		}
+
+		count -= ret;
+		ret += offset;
+		cur_npages = DIV_ROUND_UP(ret, PAGE_SIZE);
+
+		if (npages + cur_npages > max_pages) {
+			pr_err("Out of bvec array capacity (%u vs %u)\n",
+			       npages + cur_npages, max_pages);
+			break;
+		}
+
+		for (i = 0; i < cur_npages; i++) {
+			len = ret > PAGE_SIZE ? PAGE_SIZE : ret;
+			bv[npages + i].bv_page	 = *pages++;
+			bv[npages + i].bv_offset = offset;
+			bv[npages + i].bv_len	 = len - offset;
+			ret -= len;
+			offset = 0;
+		}
+
+		npages += cur_npages;
+	}
+
+	iov_iter_bvec(new, iov_iter_rw(orig), bv, npages, orig_len - count);
+	return npages;
+}
+EXPORT_SYMBOL_GPL(netfs_extract_user_iter);
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 4c76ddfb6a67..26fe3e6bafa1 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -296,6 +296,9 @@ void netfs_get_subrequest(struct netfs_io_subrequest *subreq,
 void netfs_put_subrequest(struct netfs_io_subrequest *subreq,
 			  bool was_async, enum netfs_sreq_ref_trace what);
 void netfs_stats_show(struct seq_file *);
+ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
+				struct iov_iter *new,
+				unsigned int *cleanup_mode);
 
 /**
  * netfs_inode - Get the netfs inode context from the inode



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v5 6/9] netfs: Add a function to extract an iterator into a scatterlist
  2023-01-11 14:27 [PATCH v5 0/9] iov_iter: Add extraction helpers David Howells
                   ` (4 preceding siblings ...)
  2023-01-11 14:28 ` [PATCH v5 5/9] netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator David Howells
@ 2023-01-11 14:28 ` David Howells
  2023-01-11 14:28 ` [PATCH v5 7/9] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: David Howells @ 2023-01-11 14:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, Steve French, Shyam Prasad N, Rohith Surabattula,
	linux-cachefs, linux-cifs, linux-fsdevel, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Provide a function for filling in a scatterlist from the list of pages
contained in an iterator.

If the iterator is UBUF- or IOBUF-type, the pages have a ref (ITER_SOURCE,
ie. WRITE) or a pin (ITER_DEST, ie. READ) taken on them.

If the iterator is BVEC-, KVEC- or XARRAY-type, no ref is taken on the
pages and it is left to the caller to manage their lifetime.  It cannot be
assumed that a ref can be validly taken, particularly in the case of a KVEC
iterator.

Changes:
========
ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697255985.61150.16489950598033809487.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732027275.3186319.5186488812166611598.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869691313.3723671.10714823767342163891.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920905749.1461876.12079195122363691498.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997423514.9475.11145024341505464337.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305165398.1521586.12353215176136705725.stgit@warthog.procyon.org.uk/ # v4
---

 fs/netfs/iterator.c   |  268 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/netfs.h |    4 +
 mm/vmalloc.c          |    1 
 3 files changed, 273 insertions(+)

diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index 7d802d21b9c5..82c46e3caf1d 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -7,7 +7,9 @@
 
 #include <linux/export.h>
 #include <linux/slab.h>
+#include <linux/mm.h>
 #include <linux/uio.h>
+#include <linux/scatterlist.h>
 #include <linux/netfs.h>
 #include "internal.h"
 
@@ -97,3 +99,269 @@ ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
 	return npages;
 }
 EXPORT_SYMBOL_GPL(netfs_extract_user_iter);
+
+/*
+ * Extract as list of up to sg_max pages from UBUF- or IOVEC-class iterators,
+ * pin or get refs on them appropriate and add them to the scatterlist.
+ */
+static ssize_t netfs_extract_user_to_sg(struct iov_iter *iter,
+					ssize_t maxsize,
+					struct sg_table *sgtable,
+					unsigned int sg_max,
+					unsigned int *cleanup_mode)
+{
+	struct scatterlist *sg = sgtable->sgl + sgtable->nents;
+	struct page **pages;
+	unsigned int npages;
+	ssize_t ret = 0, res;
+	size_t len, off;
+
+	*cleanup_mode = 0;
+
+	/* We decant the page list into the tail of the scatterlist */
+	pages = (void *)sgtable->sgl + array_size(sg_max, sizeof(struct scatterlist));
+	pages -= sg_max;
+
+	do {
+		res = iov_iter_extract_pages(iter, &pages, maxsize, sg_max, 0,
+					     &off, cleanup_mode);
+		if (res < 0)
+			goto failed;
+
+		len = res;
+		maxsize -= len;
+		ret += len;
+		npages = DIV_ROUND_UP(off + len, PAGE_SIZE);
+		sg_max -= npages;
+
+		for (; npages < 0; npages--) {
+			struct page *page = *pages;
+			size_t seg = min_t(size_t, PAGE_SIZE - off, len);
+
+			*pages++ = NULL;
+			sg_set_page(sg, page, len, off);
+			sgtable->nents++;
+			sg++;
+			len -= seg;
+			off = 0;
+		}
+	} while (maxsize > 0 && sg_max > 0);
+
+	return ret;
+
+failed:
+	while (sgtable->nents > sgtable->orig_nents)
+		put_page(sg_page(&sgtable->sgl[--sgtable->nents]));
+	return res;
+}
+
+/*
+ * Extract up to sg_max pages from a BVEC-type iterator and add them to the
+ * scatterlist.  The pages are not pinned.
+ */
+static ssize_t netfs_extract_bvec_to_sg(struct iov_iter *iter,
+					ssize_t maxsize,
+					struct sg_table *sgtable,
+					unsigned int sg_max,
+					unsigned int *cleanup_mode)
+{
+	const struct bio_vec *bv = iter->bvec;
+	struct scatterlist *sg = sgtable->sgl + sgtable->nents;
+	unsigned long start = iter->iov_offset;
+	unsigned int i;
+	ssize_t ret = 0;
+
+	for (i = 0; i < iter->nr_segs; i++) {
+		size_t off, len;
+
+		len = bv[i].bv_len;
+		if (start >= len) {
+			start -= len;
+			continue;
+		}
+
+		len = min_t(size_t, maxsize, len - start);
+		off = bv[i].bv_offset + start;
+
+		sg_set_page(sg, bv[i].bv_page, len, off);
+		sgtable->nents++;
+		sg++;
+		sg_max--;
+
+		ret += len;
+		maxsize -= len;
+		if (maxsize <= 0 || sg_max == 0)
+			break;
+		start = 0;
+	}
+
+	if (ret > 0)
+		iov_iter_advance(iter, ret);
+	*cleanup_mode = 0;
+	return ret;
+}
+
+/*
+ * Extract up to sg_max pages from a KVEC-type iterator and add them to the
+ * scatterlist.  This can deal with vmalloc'd buffers as well as kmalloc'd or
+ * static buffers.  The pages are not pinned.
+ */
+static ssize_t netfs_extract_kvec_to_sg(struct iov_iter *iter,
+					ssize_t maxsize,
+					struct sg_table *sgtable,
+					unsigned int sg_max,
+					unsigned int *cleanup_mode)
+{
+	const struct kvec *kv = iter->kvec;
+	struct scatterlist *sg = sgtable->sgl + sgtable->nents;
+	unsigned long start = iter->iov_offset;
+	unsigned int i;
+	ssize_t ret = 0;
+
+	for (i = 0; i < iter->nr_segs; i++) {
+		struct page *page;
+		unsigned long kaddr;
+		size_t off, len, seg;
+
+		len = kv[i].iov_len;
+		if (start >= len) {
+			start -= len;
+			continue;
+		}
+
+		kaddr = (unsigned long)kv[i].iov_base + start;
+		off = kaddr & ~PAGE_MASK;
+		len = min_t(size_t, maxsize, len - start);
+		kaddr &= PAGE_MASK;
+
+		maxsize -= len;
+		ret += len;
+		do {
+			seg = min_t(size_t, len, PAGE_SIZE - off);
+			if (is_vmalloc_or_module_addr((void *)kaddr))
+				page = vmalloc_to_page((void *)kaddr);
+			else
+				page = virt_to_page(kaddr);
+
+			sg_set_page(sg, page, len, off);
+			sgtable->nents++;
+			sg++;
+			sg_max--;
+
+			len -= seg;
+			kaddr += PAGE_SIZE;
+			off = 0;
+		} while (len > 0 && sg_max > 0);
+
+		if (maxsize <= 0 || sg_max == 0)
+			break;
+		start = 0;
+	}
+
+	if (ret > 0)
+		iov_iter_advance(iter, ret);
+	*cleanup_mode = 0;
+	return ret;
+}
+
+/*
+ * Extract up to sg_max folios from an XARRAY-type iterator and add them to
+ * the scatterlist.  The pages are not pinned.
+ */
+static ssize_t netfs_extract_xarray_to_sg(struct iov_iter *iter,
+					  ssize_t maxsize,
+					  struct sg_table *sgtable,
+					  unsigned int sg_max,
+					  unsigned int *cleanup_mode)
+{
+	struct scatterlist *sg = sgtable->sgl + sgtable->nents;
+	struct xarray *xa = iter->xarray;
+	struct folio *folio;
+	loff_t start = iter->xarray_start + iter->iov_offset;
+	pgoff_t index = start / PAGE_SIZE;
+	ssize_t ret = 0;
+	size_t offset, len;
+	XA_STATE(xas, xa, index);
+
+	rcu_read_lock();
+
+	xas_for_each(&xas, folio, ULONG_MAX) {
+		if (xas_retry(&xas, folio))
+			continue;
+		if (WARN_ON(xa_is_value(folio)))
+			break;
+		if (WARN_ON(folio_test_hugetlb(folio)))
+			break;
+
+		offset = offset_in_folio(folio, start);
+		len = min_t(size_t, maxsize, folio_size(folio) - offset);
+
+		sg_set_page(sg, folio_page(folio, 0), len, offset);
+		sgtable->nents++;
+		sg++;
+		sg_max--;
+
+		maxsize -= len;
+		ret += len;
+		if (maxsize <= 0 || sg_max == 0)
+			break;
+	}
+
+	rcu_read_unlock();
+	if (ret > 0)
+		iov_iter_advance(iter, ret);
+	*cleanup_mode = 0;
+	return ret;
+}
+
+/**
+ * netfs_extract_iter_to_sg - Extract pages from an iterator and add ot an sglist
+ * @iter: The iterator to extract from
+ * @maxsize: The amount of iterator to copy
+ * @sgtable: The scatterlist table to fill in
+ * @sg_max: Maximum number of elements in @sgtable that may be filled
+ * @cleanup_mode: Where to return the cleanup mode
+ *
+ * Extract the page fragments from the given amount of the source iterator and
+ * add them to a scatterlist that refers to all of those bits, to a maximum
+ * addition of @sg_max elements.
+ *
+ * The pages referred to by UBUF- and IOVEC-type iterators are extracted and
+ * pinned; BVEC-, KVEC- and XARRAY-type are extracted but aren't pinned; PIPE-
+ * and DISCARD-type are not supported.
+ *
+ * No end mark is placed on the scatterlist; that's left to the caller.
+ *
+ * If successul, @sgtable->nents is updated to include the number of elements
+ * added and the number of bytes added is returned.  @sgtable->orig_nents is
+ * left unaltered.
+ */
+ssize_t netfs_extract_iter_to_sg(struct iov_iter *iter, size_t maxsize,
+				 struct sg_table *sgtable, unsigned int sg_max,
+				 unsigned int *cleanup_mode)
+{
+	if (maxsize == 0)
+		return 0;
+
+	switch (iov_iter_type(iter)) {
+	case ITER_UBUF:
+	case ITER_IOVEC:
+		return netfs_extract_user_to_sg(iter, maxsize, sgtable, sg_max,
+						cleanup_mode);
+	case ITER_BVEC:
+		return netfs_extract_bvec_to_sg(iter, maxsize, sgtable, sg_max,
+						cleanup_mode);
+	case ITER_KVEC:
+		return netfs_extract_kvec_to_sg(iter, maxsize, sgtable, sg_max,
+						cleanup_mode);
+	case ITER_XARRAY:
+		return netfs_extract_xarray_to_sg(iter, maxsize, sgtable, sg_max,
+						  cleanup_mode);
+	default:
+		pr_err("netfs_extract_iter_to_sg(%u) unsupported\n",
+		       iov_iter_type(iter));
+		WARN_ON_ONCE(1);
+		return -EIO;
+	}
+}
+EXPORT_SYMBOL_GPL(netfs_extract_iter_to_sg);
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 26fe3e6bafa1..059e49233e29 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -299,6 +299,10 @@ void netfs_stats_show(struct seq_file *);
 ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
 				struct iov_iter *new,
 				unsigned int *cleanup_mode);
+struct sg_table;
+ssize_t netfs_extract_iter_to_sg(struct iov_iter *iter, size_t len,
+				 struct sg_table *sgtable, unsigned int sg_max,
+				 unsigned int *cleanup_mode);
 
 /**
  * netfs_inode - Get the netfs inode context from the inode
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ca71de7c9d77..61f5bec0f2b6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -656,6 +656,7 @@ int is_vmalloc_or_module_addr(const void *x)
 #endif
 	return is_vmalloc_addr(x);
 }
+EXPORT_SYMBOL_GPL(is_vmalloc_or_module_addr);
 
 /*
  * Walk a vmap address to the struct page it maps. Huge vmap mappings will



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v5 7/9] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning
  2023-01-11 14:27 [PATCH v5 0/9] iov_iter: Add extraction helpers David Howells
                   ` (5 preceding siblings ...)
  2023-01-11 14:28 ` [PATCH v5 6/9] netfs: Add a function to extract an iterator into a scatterlist David Howells
@ 2023-01-11 14:28 ` David Howells
  2023-01-12  7:32   ` Christoph Hellwig
  2023-01-11 14:28 ` [PATCH v5 8/9] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
  2023-01-11 14:28 ` [PATCH v5 9/9] bio: Fix bio_flagged() so that it can be combined David Howells
  8 siblings, 1 reply; 34+ messages in thread
From: David Howells @ 2023-01-11 14:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Logan Gunthorpe, linux-block, dhowells, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning.  In a
following patch I intend to add a BIO_PAGE_PINNED flag to indicate that the
page needs unpinning and this way both flags have the same logic.

Changes
=======
ver #5)
 - Split from patch that uses iov_iter_extract_pages().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/167305166150.1521586.10220949115402059720.stgit@warthog.procyon.org.uk/ # v4
---

 block/bio.c               |    9 ++++++++-
 include/linux/bio.h       |    2 +-
 include/linux/blk_types.h |    2 +-
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f96fcae3f75..0fa140789d16 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -243,6 +243,10 @@ static void bio_free(struct bio *bio)
  * Users of this function have their own bio allocation. Subsequently,
  * they must remember to pair any call to bio_init() with bio_uninit()
  * when IO has completed, or when the bio is released.
+ *
+ * We set the initial assumption that pages attached to the bio will be
+ * released with put_page() by setting BIO_PAGE_REFFED; if the pages
+ * should not be put, this flag should be cleared.
  */
 void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	      unsigned short max_vecs, blk_opf_t opf)
@@ -274,6 +278,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 	bio->bi_integrity = NULL;
 #endif
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	bio->bi_vcnt = 0;
 
 	atomic_set(&bio->__bi_remaining, 1);
@@ -302,6 +307,7 @@ void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf)
 {
 	bio_uninit(bio);
 	memset(bio, 0, BIO_RESET_BYTES);
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	atomic_set(&bio->__bi_remaining, 1);
 	bio->bi_bdev = bdev;
 	if (bio->bi_bdev)
@@ -812,6 +818,7 @@ EXPORT_SYMBOL(bio_put);
 static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 {
 	bio_set_flag(bio, BIO_CLONED);
+	bio_clear_flag(bio, BIO_PAGE_REFFED);
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_iter = bio_src->bi_iter;
 
@@ -1198,7 +1205,7 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio->bi_io_vec = (struct bio_vec *)iter->bvec;
 	bio->bi_iter.bi_bvec_done = iter->iov_offset;
 	bio->bi_iter.bi_size = size;
-	bio_set_flag(bio, BIO_NO_PAGE_REF);
+	bio_clear_flag(bio, BIO_PAGE_REFFED);
 	bio_set_flag(bio, BIO_CLONED);
 }
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 22078a28d7cb..63bfd91793f9 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -482,7 +482,7 @@ void zero_fill_bio(struct bio *bio);
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+	if (bio_flagged(bio, BIO_PAGE_REFFED))
 		__bio_release_pages(bio, mark_dirty);
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..2f64b3606397 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -318,7 +318,7 @@ struct bio {
  * bio flags
  */
 enum {
-	BIO_NO_PAGE_REF,	/* don't put release vec pages */
+	BIO_PAGE_REFFED,	/* Pages need refs putting (see FOLL_GET) */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_QUIET,		/* Make BIO Quiet */



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v5 8/9] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-11 14:27 [PATCH v5 0/9] iov_iter: Add extraction helpers David Howells
                   ` (6 preceding siblings ...)
  2023-01-11 14:28 ` [PATCH v5 7/9] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
@ 2023-01-11 14:28 ` David Howells
  2023-01-12  7:44   ` Christoph Hellwig
  2023-01-12 10:28   ` David Howells
  2023-01-11 14:28 ` [PATCH v5 9/9] bio: Fix bio_flagged() so that it can be combined David Howells
  8 siblings, 2 replies; 34+ messages in thread
From: David Howells @ 2023-01-11 14:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Logan Gunthorpe, linux-block, dhowells, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

Convert the block layer's bio code to use iov_iter_extract_pages() instead
of iov_iter_get_pages().  This will pin pages or leave them unaltered
rather than getting a ref on them as appropriate to the source iterator.

The pages need to be pinned for DIO-read rather than having refs taken on
them to prevent VM copy-on-write from malfunctioning during a concurrent
fork() (the result of the I/O would otherwise end up only visible to the
child process and not the parent).

To implement this:

 (1) If the BIO_PAGE_REFFED flag is set, this causes attached pages to be
     passed to put_page() during cleanup.

 (2) A BIO_PAGE_PINNED flag is provided.  If set, this causes attached
     pages to be passed to unpin_user_page() during cleanup.

 (3) BIO_PAGE_REFFED is set by default and BIO_PAGE_PINNED is cleared by
     default when the bio is (re-)initialised.

 (4) If iov_iter_extract_pages() indicates FOLL_GET, this causes
     BIO_PAGE_REFFED to be set and if FOLL_PIN is indicated, this causes
     BIO_PAGE_PINNED to be set.  If it returns neither FOLL_* flag, then
     both BIO_PAGE_* flags will be cleared.

     Mixing sets of pages with different clean up modes is not supported.

 (5) Cloned bio structs have both flags cleared.

 (6) bio_release_pages() will do the release if either BIO_PAGE_* flag is
     set.

[!] Note that this is tested a bit with ext4, but nothing else.

Changes
=======
ver #5)
 - Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
   BIO_* flags and got rid of bi_cleanup_mode.
 - Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/167305166150.1521586.10220949115402059720.stgit@warthog.procyon.org.uk/ # v4
---

 block/bio.c               |   54 ++++++++++++++++++++++++++++++++-------------
 include/linux/bio.h       |    3 ++-
 include/linux/blk_types.h |    1 +
 3 files changed, 41 insertions(+), 17 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 0fa140789d16..22900f8e9bfb 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -245,8 +245,9 @@ static void bio_free(struct bio *bio)
  * when IO has completed, or when the bio is released.
  *
  * We set the initial assumption that pages attached to the bio will be
- * released with put_page() by setting BIO_PAGE_REFFED; if the pages
- * should not be put, this flag should be cleared.
+ * released with put_page() by setting BIO_PAGE_REFFED, but this should be set
+ * to BIO_PAGE_PINNED if the page should be unpinned instead; if the pages
+ * should not be put or unpinned, these flags should be cleared.
  */
 void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	      unsigned short max_vecs, blk_opf_t opf)
@@ -819,6 +820,7 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 {
 	bio_set_flag(bio, BIO_CLONED);
 	bio_clear_flag(bio, BIO_PAGE_REFFED);
+	bio_clear_flag(bio, BIO_PAGE_PINNED);
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_iter = bio_src->bi_iter;
 
@@ -1175,6 +1177,18 @@ bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
 	return bio_add_page(bio, &folio->page, len, off) > 0;
 }
 
+/*
+ * Clean up a page appropriately, where the page may be pinned, may have a
+ * ref taken on it or neither.
+ */
+static void bio_release_page(struct bio *bio, struct page *page)
+{
+	if (bio_flagged(bio, BIO_PAGE_PINNED))
+		unpin_user_page(page);
+	if (bio_flagged(bio, BIO_PAGE_REFFED))
+		put_page(page);
+}
+
 void __bio_release_pages(struct bio *bio, bool mark_dirty)
 {
 	struct bvec_iter_all iter_all;
@@ -1183,7 +1197,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		bio_release_page(bio, bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1220,7 +1234,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
 	}
 
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1234,7 +1248,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
 			queue_max_zone_append_sectors(q), &same_page) != len)
 		return -EINVAL;
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1245,10 +1259,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be mapped
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Extracts pages from *iter and appends them to @bio's bvec array.  The pages
+ * will have to be cleaned up in the way indicated by the BIO_PAGE_REFFED and
+ * BIO_PAGE_PINNED flags.  For a multi-segment *iter, this function only adds
+ * pages from the next non-empty segment of the iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1256,7 +1270,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
-	unsigned int gup_flags = 0;
+	unsigned int gup_flags = 0, cleanup_mode;
 	ssize_t size, left;
 	unsigned len, i = 0;
 	size_t offset, trim;
@@ -1280,12 +1294,20 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_get_pages(iter, pages,
-				  UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset, gup_flags);
+	size = iov_iter_extract_pages(iter, &pages,
+				      UINT_MAX - bio->bi_iter.bi_size,
+				      nr_pages, gup_flags,
+				      &offset, &cleanup_mode);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
+	bio_clear_flag(bio, BIO_PAGE_REFFED);
+	bio_clear_flag(bio, BIO_PAGE_PINNED);
+	if (cleanup_mode & FOLL_GET)
+		bio_set_flag(bio, BIO_PAGE_REFFED);
+	if (cleanup_mode & FOLL_PIN)
+		bio_set_flag(bio, BIO_PAGE_PINNED);
+
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
 
 	trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
@@ -1315,7 +1337,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)
-		put_page(pages[i++]);
+		bio_release_page(bio, pages[i++]);
 
 	return ret;
 }
@@ -1496,8 +1518,8 @@ void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will run one put_page() or unpin_user_page() against each page
+ * and will run one bio_put() against the BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 63bfd91793f9..1c6f051f6ff2 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -482,7 +482,8 @@ void zero_fill_bio(struct bio *bio);
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	if (bio_flagged(bio, BIO_PAGE_REFFED))
+	if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+	    bio_flagged(bio, BIO_PAGE_PINNED))
 		__bio_release_pages(bio, mark_dirty);
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 2f64b3606397..7a2d2b2cf3a0 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -319,6 +319,7 @@ struct bio {
  */
 enum {
 	BIO_PAGE_REFFED,	/* Pages need refs putting (see FOLL_GET) */
+	BIO_PAGE_PINNED,	/* Pages need pins unpinning (see FOLL_PIN) */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_QUIET,		/* Make BIO Quiet */



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v5 9/9] bio: Fix bio_flagged() so that it can be combined
  2023-01-11 14:27 [PATCH v5 0/9] iov_iter: Add extraction helpers David Howells
                   ` (7 preceding siblings ...)
  2023-01-11 14:28 ` [PATCH v5 8/9] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
@ 2023-01-11 14:28 ` David Howells
  2023-01-12  7:33   ` Christoph Hellwig
  8 siblings, 1 reply; 34+ messages in thread
From: David Howells @ 2023-01-11 14:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, linux-block, dhowells, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

Fix bio_flagged() so that multiple instances of it can be combined by the
compiler into a single test (arguably, this is a compiler optimisation
issue[1]).

The problem is that it compares the result of the bitwise-AND to zero.
This results in an out-of-line bio_release_page() that looks something
like:

   <+0>:     mov    0x14(%rdi),%eax
   <+3>:     test   $0x1,%al
   <+5>:     jne    0xffffffff816dac53 <bio_release_pages+11>
   <+7>:     test   $0x2,%al
   <+9>:     je     0xffffffff816dac5c <bio_release_pages+20>
   <+11>:    movzbl %sil,%esi
   <+15>:    jmp    0xffffffff816daba1 <__bio_release_pages>
   <+20>:    jmp    0xffffffff81d0b800 <__x86_return_thunk>

Removing the test (it's superfluous as the return type is bool - the
compiler will reduce the return to 0 or 1 as needed) results in:

   <+0>:     testb  $0x3,0x14(%rdi)
   <+4>:     je     0xffffffff816e4af4 <bio_release_pages+15>
   <+6>:     movzbl %sil,%esi
   <+10>:    jmp    0xffffffff816dab7c <__bio_release_pages>
   <+15>:    jmp    0xffffffff81d0b7c0 <__x86_return_thunk>

instead.

The MOVZBL instruction also looks unnecessary[2] - I think it's just
're-booling' the mark_dirty.

Fixes: b7c44ed9d2fc ("block: manipulate bio->bi_flags through helpers")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108370 [1]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108371 [2]
---

 include/linux/bio.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1c6f051f6ff2..2e6109b0fca8 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -227,7 +227,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)
 
 static inline bool bio_flagged(struct bio *bio, unsigned int bit)
 {
-	return (bio->bi_flags & (1U << bit)) != 0;
+	return bio->bi_flags & (1U << bit);
 }
 
 static inline void bio_set_flag(struct bio *bio, unsigned int bit)



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 7/9] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning
  2023-01-11 14:28 ` [PATCH v5 7/9] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
@ 2023-01-12  7:32   ` Christoph Hellwig
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2023-01-12  7:32 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Jens Axboe, Jan Kara, Matthew Wilcox, Logan Gunthorpe,
	linux-block, Christoph Hellwig, Jeff Layton, linux-fsdevel,
	linux-kernel

I'd rename the flag to BIO_PAGES_REFFED as it can effect multiple pages,
but otherwise this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 9/9] bio: Fix bio_flagged() so that it can be combined
  2023-01-11 14:28 ` [PATCH v5 9/9] bio: Fix bio_flagged() so that it can be combined David Howells
@ 2023-01-12  7:33   ` Christoph Hellwig
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2023-01-12  7:33 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Jens Axboe, linux-block, Christoph Hellwig,
	Matthew Wilcox, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-kernel

I find the subject a little weird.  What you're doing is to clean up
the C version, which also tends to micro-optimize the x86 assembly
generation for a specifi compiler.  Which is good on two counts, so I'm
all for the patch, but I don't really think it's "combining", which
really made me think of testing two flags in one call.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 8/9] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-11 14:28 ` [PATCH v5 8/9] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
@ 2023-01-12  7:44   ` Christoph Hellwig
  2023-01-12 10:28   ` David Howells
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2023-01-12  7:44 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Jens Axboe, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Logan Gunthorpe, linux-block, Christoph Hellwig, Jeff Layton,
	linux-fsdevel, linux-kernel

On Wed, Jan 11, 2023 at 02:28:35PM +0000, David Howells wrote:
> [!] Note that this is tested a bit with ext4, but nothing else.

You probably want to also at least test it with block device I/O
as that is a slightly different I/O path from iomap.  More file systems
also never hurt, but aren't quite as important.

> +/*
> + * Clean up a page appropriately, where the page may be pinned, may have a
> + * ref taken on it or neither.
> + */
> +static void bio_release_page(struct bio *bio, struct page *page)
> +{
> +	if (bio_flagged(bio, BIO_PAGE_PINNED))
> +		unpin_user_page(page);
> +	if (bio_flagged(bio, BIO_PAGE_REFFED))
> +		put_page(page);
> +}
> +
>  void __bio_release_pages(struct bio *bio, bool mark_dirty)
>  {
>  	struct bvec_iter_all iter_all;
> @@ -1183,7 +1197,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
>  	bio_for_each_segment_all(bvec, bio, iter_all) {
>  		if (mark_dirty && !PageCompound(bvec->bv_page))
>  			set_page_dirty_lock(bvec->bv_page);
> -		put_page(bvec->bv_page);
> +		bio_release_page(bio, bvec->bv_page);

So this does look correc an sensible, but given that the new pin/unpin
path has a significantly higher overhead I wonder if this might be a
good time to switch to folios here as soon as possible in a follow on
patch.


> +	size = iov_iter_extract_pages(iter, &pages,
> +				      UINT_MAX - bio->bi_iter.bi_size,
> +				      nr_pages, gup_flags,
> +				      &offset, &cleanup_mode);
>  	if (unlikely(size <= 0))
>  		return size ? size : -EFAULT;
>  
> +	bio_clear_flag(bio, BIO_PAGE_REFFED);
> +	bio_clear_flag(bio, BIO_PAGE_PINNED);
> +	if (cleanup_mode & FOLL_GET)
> +		bio_set_flag(bio, BIO_PAGE_REFFED);
> +	if (cleanup_mode & FOLL_PIN)
> +		bio_set_flag(bio, BIO_PAGE_PINNED);

The flags here must not change from one invocation to another, so
clearing and resetting them on every iteration seems dangerous.

This should probably be a:

	if (cleanup_mode & FOLL_GET) {
		WARN_ON_ONCE(bio_test_flag(bio, BIO_PAGE_PINNED));
		bio_set_flag(bio, BIO_PAGE_REFFED);
	}
	if (cleanup_mode & FOLL_PIN) {
		WARN_ON_ONCE(bio_test_flag(bio, BIO_PAGE_REFFED));
		bio_set_flag(bio, BIO_PAGE_PINNED);
	}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
  2023-01-11 14:27 ` [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction David Howells
@ 2023-01-12  7:54   ` Christoph Hellwig
  2023-01-12 10:31   ` David Howells
  2023-01-12 21:56   ` Al Viro
  2 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2023-01-12  7:54 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On Wed, Jan 11, 2023 at 02:27:58PM +0000, David Howells wrote:
> This allows all but three of the users of iov_iter_rw() to be got rid of: a
> consistency check and a warning statement in cifs

Let's just drop these two.

> and one user in the block
> layer that has neither available.

And use the information in the request for this one (see patch below),
and then move this patch first in the series, add an explicit direction
parameter in the gup_flags to the get/pin helper and drop iov_iter_rw
and the whole confusing source/dest information in the iov_iter entirely,
which is a really nice big tree wide cleanup that remove redundant
information.

---
diff --git a/block/blk-map.c b/block/blk-map.c
index 19940c978c73bb..08cbb7ff3b196d 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -203,7 +203,7 @@ static int bio_copy_user_iov(struct request *rq, struct rq_map_data *map_data,
 	/*
 	 * success
 	 */
-	if ((iov_iter_rw(iter) == WRITE &&
+	if ((op_is_write(rq->cmd_flags) &&
 	     (!map_data || !map_data->null_mapped)) ||
 	    (map_data && map_data->from_user)) {
 		ret = bio_copy_from_iter(bio, iter);

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 4/9] iov_iter: Add a function to extract a page list from an iterator
  2023-01-11 14:28 ` [PATCH v5 4/9] iov_iter: Add a function to extract a page list from an iterator David Howells
@ 2023-01-12  7:55   ` Christoph Hellwig
  2023-01-12 21:15   ` Al Viro
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2023-01-12  7:55 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, John Hubbard, Matthew Wilcox,
	linux-fsdevel, linux-mm, Christoph Hellwig, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-block, linux-kernel

> +	BUG(); // if it had been empty, we wouldn't get called

Please use a normal /* */ comment here.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 8/9] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-11 14:28 ` [PATCH v5 8/9] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
  2023-01-12  7:44   ` Christoph Hellwig
@ 2023-01-12 10:28   ` David Howells
  2023-01-12 14:09     ` Christoph Hellwig
  1 sibling, 1 reply; 34+ messages in thread
From: David Howells @ 2023-01-12 10:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Jens Axboe, Jan Kara, Christoph Hellwig,
	Matthew Wilcox, Logan Gunthorpe, linux-block, Jeff Layton,
	linux-fsdevel, linux-kernel

Christoph Hellwig <hch@infradead.org> wrote:

> 	if (cleanup_mode & FOLL_GET) {
> 		WARN_ON_ONCE(bio_test_flag(bio, BIO_PAGE_PINNED));
> 		bio_set_flag(bio, BIO_PAGE_REFFED);
> 	}
> 	if (cleanup_mode & FOLL_PIN) {
> 		WARN_ON_ONCE(bio_test_flag(bio, BIO_PAGE_REFFED));
> 		bio_set_flag(bio, BIO_PAGE_PINNED);
> 	}

That won't necessarily work as you might get back cleanup_mode == 0, in which
case both flags are cleared - and neither warning will trip on the next
addition.

I could change it so that rather than using a pair of flags, it uses a
four-state variable (which can be stored in bi_flags): BIO_PAGE_DEFAULT,
BIO_PAGE_REFFED, BIO_PAGE_PINNED, BIO_PAGE_NO_CLEANUP, say.

Or I could add an extra flag to say that the setting is locked.  Or we could
just live with the scenario I outlined possibly happening.

David


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
  2023-01-11 14:27 ` [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction David Howells
  2023-01-12  7:54   ` Christoph Hellwig
@ 2023-01-12 10:31   ` David Howells
  2023-01-12 14:08     ` Christoph Hellwig
  2023-01-12 21:56   ` Al Viro
  2 siblings, 1 reply; 34+ messages in thread
From: David Howells @ 2023-01-12 10:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Christoph Hellwig <hch@infradead.org> wrote:

> > This allows all but three of the users of iov_iter_rw() to be got rid of: a
> > consistency check and a warning statement in cifs
> 
> Let's just drop these two.
> 
> > and one user in the block
> > layer that has neither available.
> 
> And use the information in the request for this one (see patch below),
> and then move this patch first in the series, add an explicit direction
> parameter in the gup_flags to the get/pin helper and drop iov_iter_rw
> and the whole confusing source/dest information in the iov_iter entirely,
> which is a really nice big tree wide cleanup that remove redundant
> information.

Fine by me, but Al might object as I think he wanted the internal checks.  Al?

David


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
  2023-01-12 10:31   ` David Howells
@ 2023-01-12 14:08     ` Christoph Hellwig
  2023-01-12 17:37       ` Al Viro
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2023-01-12 14:08 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On Thu, Jan 12, 2023 at 10:31:01AM +0000, David Howells wrote:
> > And use the information in the request for this one (see patch below),
> > and then move this patch first in the series, add an explicit direction
> > parameter in the gup_flags to the get/pin helper and drop iov_iter_rw
> > and the whole confusing source/dest information in the iov_iter entirely,
> > which is a really nice big tree wide cleanup that remove redundant
> > information.
> 
> Fine by me, but Al might object as I think he wanted the internal checks.  Al?

I'm happy to have another discussion, but the fact the information in
the iov_iter is 98% redundant and various callers got it wrong and
away is a pretty good sign that we should drop this information.  It
also nicely simplified the API.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 8/9] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-12 10:28   ` David Howells
@ 2023-01-12 14:09     ` Christoph Hellwig
  2023-01-12 14:17       ` Christoph Hellwig
  2023-01-12 14:58       ` David Howells
  0 siblings, 2 replies; 34+ messages in thread
From: Christoph Hellwig @ 2023-01-12 14:09 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Al Viro, Jens Axboe, Jan Kara,
	Christoph Hellwig, Matthew Wilcox, Logan Gunthorpe, linux-block,
	Jeff Layton, linux-fsdevel, linux-kernel

On Thu, Jan 12, 2023 at 10:28:41AM +0000, David Howells wrote:
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > 	if (cleanup_mode & FOLL_GET) {
> > 		WARN_ON_ONCE(bio_test_flag(bio, BIO_PAGE_PINNED));
> > 		bio_set_flag(bio, BIO_PAGE_REFFED);
> > 	}
> > 	if (cleanup_mode & FOLL_PIN) {
> > 		WARN_ON_ONCE(bio_test_flag(bio, BIO_PAGE_REFFED));
> > 		bio_set_flag(bio, BIO_PAGE_PINNED);
> > 	}
> 
> That won't necessarily work as you might get back cleanup_mode == 0, in which
> case both flags are cleared - and neither warning will trip on the next
> addition.

Well, it will work for the intended use case even with
cleanup_mode == 0, we just won't get the debug check.  Or am I missing
something fundamental?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 8/9] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-12 14:09     ` Christoph Hellwig
@ 2023-01-12 14:17       ` Christoph Hellwig
  2023-01-12 14:58       ` David Howells
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2023-01-12 14:17 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Al Viro, Jens Axboe, Jan Kara,
	Christoph Hellwig, Matthew Wilcox, Logan Gunthorpe, linux-block,
	Jeff Layton, linux-fsdevel, linux-kernel

On Thu, Jan 12, 2023 at 06:09:16AM -0800, Christoph Hellwig wrote:
> On Thu, Jan 12, 2023 at 10:28:41AM +0000, David Howells wrote:
> > Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > > 	if (cleanup_mode & FOLL_GET) {
> > > 		WARN_ON_ONCE(bio_test_flag(bio, BIO_PAGE_PINNED));
> > > 		bio_set_flag(bio, BIO_PAGE_REFFED);
> > > 	}
> > > 	if (cleanup_mode & FOLL_PIN) {
> > > 		WARN_ON_ONCE(bio_test_flag(bio, BIO_PAGE_REFFED));
> > > 		bio_set_flag(bio, BIO_PAGE_PINNED);
> > > 	}
> > 
> > That won't necessarily work as you might get back cleanup_mode == 0, in which
> > case both flags are cleared - and neither warning will trip on the next
> > addition.
> 
> Well, it will work for the intended use case even with
> cleanup_mode == 0, we just won't get the debug check.  Or am I missing
> something fundamental?

In fact looking at the code we can debug check that case too by doing:

	if (cleanup_mode & FOLL_GET) {
 		WARN_ON_ONCE(bio_test_flag(bio, BIO_PAGE_PINNED));
 		bio_set_flag(bio, BIO_PAGE_REFFED);
 	} else if (cleanup_mode & FOLL_PIN) {
 		WARN_ON_ONCE(bio_test_flag(bio, BIO_PAGE_REFFED));
 		bio_set_flag(bio, BIO_PAGE_PINNED);
 	} else {
		WARN_ON_ONCE(bio_test_flag(bio, BIO_PAGE_PINNED) ||
			     bio_test_flag(bio, BIO_PAGE_REFFED));
	}

But given that all calls for the same iter type return the same
cleanup_mode by defintion I'm not even sure we need any of this
debug checking, and might as well just do:

	if (cleanup_mode & FOLL_GET)
 		bio_set_flag(bio, BIO_PAGE_REFFED);
 	else if (cleanup_mode & FOLL_PIN)
 		bio_set_flag(bio, BIO_PAGE_PINNED);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 8/9] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-12 14:09     ` Christoph Hellwig
  2023-01-12 14:17       ` Christoph Hellwig
@ 2023-01-12 14:58       ` David Howells
  2023-01-12 15:02         ` Christoph Hellwig
  1 sibling, 1 reply; 34+ messages in thread
From: David Howells @ 2023-01-12 14:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Jens Axboe, Jan Kara, Christoph Hellwig,
	Matthew Wilcox, Logan Gunthorpe, linux-block, Jeff Layton,
	linux-fsdevel, linux-kernel

Christoph Hellwig <hch@infradead.org> wrote:

> But given that all calls for the same iter type return the same
> cleanup_mode by defintion I'm not even sure we need any of this
> debug checking, and might as well just do:
> 
> 	if (cleanup_mode & FOLL_GET)
>  		bio_set_flag(bio, BIO_PAGE_REFFED);
>  	else if (cleanup_mode & FOLL_PIN)
>  		bio_set_flag(bio, BIO_PAGE_PINNED);

That's kind of what I'm doing - though I've left out the else just in case the
VM decides to indicate back both FOLL_GET and FOLL_PIN.  I'm not sure why it
would but...

David


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 8/9] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-12 14:58       ` David Howells
@ 2023-01-12 15:02         ` Christoph Hellwig
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2023-01-12 15:02 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Al Viro, Jens Axboe, Jan Kara,
	Christoph Hellwig, Matthew Wilcox, Logan Gunthorpe, linux-block,
	Jeff Layton, linux-fsdevel, linux-kernel

On Thu, Jan 12, 2023 at 02:58:49PM +0000, David Howells wrote:
> That's kind of what I'm doing - though I've left out the else just in case the
> VM decides to indicate back both FOLL_GET and FOLL_PIN.  I'm not sure why it
> would but...

It really can't - they are exclusive.  Maybe we need an assert for that
somewhere, but we surely shouldn't try to deal with it.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
  2023-01-12 14:08     ` Christoph Hellwig
@ 2023-01-12 17:37       ` Al Viro
  2023-01-12 21:49         ` Bart Van Assche
  2023-01-13  5:30         ` Christoph Hellwig
  0 siblings, 2 replies; 34+ messages in thread
From: Al Viro @ 2023-01-12 17:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

On Thu, Jan 12, 2023 at 06:08:14AM -0800, Christoph Hellwig wrote:
> On Thu, Jan 12, 2023 at 10:31:01AM +0000, David Howells wrote:
> > > And use the information in the request for this one (see patch below),
> > > and then move this patch first in the series, add an explicit direction
> > > parameter in the gup_flags to the get/pin helper and drop iov_iter_rw
> > > and the whole confusing source/dest information in the iov_iter entirely,
> > > which is a really nice big tree wide cleanup that remove redundant
> > > information.
> > 
> > Fine by me, but Al might object as I think he wanted the internal checks.  Al?
> 
> I'm happy to have another discussion, but the fact the information in
> the iov_iter is 98% redundant and various callers got it wrong and
> away is a pretty good sign that we should drop this information.  It
> also nicely simplified the API.

I have no problem with getting rid of iov_iter_rw(), but I would really like to
keep ->data_source.  If nothing else, any place getting direction wrong is
a trouble waiting to happen - something that is currently dealing only with
iovec and bvec might be given e.g. a pipe.

Speaking of which, I would really like to get rid of the kludge /dev/sg is
pulling - right now from-device requests there do the following:
	* copy the entire destination in (and better hope that nothing is mapped
write-only, etc.)
	* form a request + bio, attach the pages with the destination copy to it
	* submit
	* copy the damn thing back to destination after the completion.
The reason for that is (quoted in commit ecb554a846f8)

====
    The semantics of SG_DXFER_TO_FROM_DEV were:
       - copy user space buffer to kernel (LLD) buffer
       - do SCSI command which is assumed to be of the DATA_IN
         (data from device) variety. This would overwrite
         some or all of the kernel buffer
       - copy kernel (LLD) buffer back to the user space.
    
    The idea was to detect short reads by filling the original
    user space buffer with some marker bytes ("0xec" it would
    seem in this report). The "resid" value is a better way
    of detecting short reads but that was only added this century
    and requires co-operation from the LLD.
====

IOW, we can't tell how much do we actually want to copy out, unless the SCSI driver
in question is recent enough.  Note that the above had been written in 2009, so
it might not be an issue these days.

Do we still have SCSI drivers that would not set the residual on bypass requests
completion?  Because I would obviously very much prefer to get rid of that
copy in-overwrite-copy out thing there - given the accurate information about
the transfer length it would be easy to do.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 4/9] iov_iter: Add a function to extract a page list from an iterator
  2023-01-11 14:28 ` [PATCH v5 4/9] iov_iter: Add a function to extract a page list from an iterator David Howells
  2023-01-12  7:55   ` Christoph Hellwig
@ 2023-01-12 21:15   ` Al Viro
  2023-01-12 21:36     ` Al Viro
  2023-01-13  5:26     ` Christoph Hellwig
  1 sibling, 2 replies; 34+ messages in thread
From: Al Viro @ 2023-01-12 21:15 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, John Hubbard, Matthew Wilcox, linux-fsdevel,
	linux-mm, Christoph Hellwig, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-block, linux-kernel

On Wed, Jan 11, 2023 at 02:28:05PM +0000, David Howells wrote:

> +ssize_t iov_iter_extract_pages(struct iov_iter *i,
> +			       struct page ***pages,
> +			       size_t maxsize,
> +			       unsigned int maxpages,
> +			       unsigned int gup_flags,
> +			       size_t *offset0,
> +			       unsigned int *cleanup_mode)

This cleanup_mode thing is wrong.  It's literally a trivial
function of ->user_backed and ->data_source - we don't
even need to look at the ->type.

Separate it into an inline helper and be done with that;
don't carry it all over the place.

It's really "if not user-backed => 0, otherwise it's FOLL_PIN or FOLL_GET,
depending upon the direction".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 4/9] iov_iter: Add a function to extract a page list from an iterator
  2023-01-12 21:15   ` Al Viro
@ 2023-01-12 21:36     ` Al Viro
  2023-01-13  5:26     ` Christoph Hellwig
  1 sibling, 0 replies; 34+ messages in thread
From: Al Viro @ 2023-01-12 21:36 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, John Hubbard, Matthew Wilcox, linux-fsdevel,
	linux-mm, Christoph Hellwig, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-block, linux-kernel

On Thu, Jan 12, 2023 at 09:15:50PM +0000, Al Viro wrote:
> On Wed, Jan 11, 2023 at 02:28:05PM +0000, David Howells wrote:
> 
> > +ssize_t iov_iter_extract_pages(struct iov_iter *i,
> > +			       struct page ***pages,
> > +			       size_t maxsize,
> > +			       unsigned int maxpages,
> > +			       unsigned int gup_flags,
> > +			       size_t *offset0,
> > +			       unsigned int *cleanup_mode)
> 
> This cleanup_mode thing is wrong.  It's literally a trivial
> function of ->user_backed and ->data_source - we don't
> even need to look at the ->type.
> 
> Separate it into an inline helper and be done with that;
> don't carry it all over the place.
> 
> It's really "if not user-backed => 0, otherwise it's FOLL_PIN or FOLL_GET,
> depending upon the direction".

Seriously, it would be easier to follow that way; if you really insist upon
keeping these calling conventions, at least put the calculation in one place -
don't make readers to chase down into every sodding helper to check if they
do what you'd expect them to do.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
  2023-01-12 17:37       ` Al Viro
@ 2023-01-12 21:49         ` Bart Van Assche
  2023-01-13  5:32           ` Christoph Hellwig
  2023-01-14  1:34           ` Martin K. Petersen
  2023-01-13  5:30         ` Christoph Hellwig
  1 sibling, 2 replies; 34+ messages in thread
From: Bart Van Assche @ 2023-01-12 21:49 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Martin K. Petersen, Douglas Gilbert

On 1/12/23 09:37, Al Viro wrote:
> On Thu, Jan 12, 2023 at 06:08:14AM -0800, Christoph Hellwig wrote:
>> On Thu, Jan 12, 2023 at 10:31:01AM +0000, David Howells wrote:
>>>> And use the information in the request for this one (see patch below),
>>>> and then move this patch first in the series, add an explicit direction
>>>> parameter in the gup_flags to the get/pin helper and drop iov_iter_rw
>>>> and the whole confusing source/dest information in the iov_iter entirely,
>>>> which is a really nice big tree wide cleanup that remove redundant
>>>> information.
>>>
>>> Fine by me, but Al might object as I think he wanted the internal checks.  Al?
>>
>> I'm happy to have another discussion, but the fact the information in
>> the iov_iter is 98% redundant and various callers got it wrong and
>> away is a pretty good sign that we should drop this information.  It
>> also nicely simplified the API.
> 
> I have no problem with getting rid of iov_iter_rw(), but I would really like to
> keep ->data_source.  If nothing else, any place getting direction wrong is
> a trouble waiting to happen - something that is currently dealing only with
> iovec and bvec might be given e.g. a pipe.
> 
> Speaking of which, I would really like to get rid of the kludge /dev/sg is
> pulling - right now from-device requests there do the following:
> 	* copy the entire destination in (and better hope that nothing is mapped
> write-only, etc.)
> 	* form a request + bio, attach the pages with the destination copy to it
> 	* submit
> 	* copy the damn thing back to destination after the completion.
> The reason for that is (quoted in commit ecb554a846f8)
> 
> ====
>      The semantics of SG_DXFER_TO_FROM_DEV were:
>         - copy user space buffer to kernel (LLD) buffer
>         - do SCSI command which is assumed to be of the DATA_IN
>           (data from device) variety. This would overwrite
>           some or all of the kernel buffer
>         - copy kernel (LLD) buffer back to the user space.
>      
>      The idea was to detect short reads by filling the original
>      user space buffer with some marker bytes ("0xec" it would
>      seem in this report). The "resid" value is a better way
>      of detecting short reads but that was only added this century
>      and requires co-operation from the LLD.
> ====
> 
> IOW, we can't tell how much do we actually want to copy out, unless the SCSI driver
> in question is recent enough.  Note that the above had been written in 2009, so
> it might not be an issue these days.
> 
> Do we still have SCSI drivers that would not set the residual on bypass requests
> completion?  Because I would obviously very much prefer to get rid of that
> copy in-overwrite-copy out thing there - given the accurate information about
> the transfer length it would be easy to do.

(+Martin and Doug)

I'm not sure that we still need the double copy in the sg driver. It 
seems obscure to me that there is user space software that relies on 
finding "0xec" in bytes not originating from a SCSI device. 
Additionally, SCSI drivers that do not support residuals should be 
something from the past.

Others may be better qualified to comment on this topic.

Bart.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
  2023-01-11 14:27 ` [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction David Howells
  2023-01-12  7:54   ` Christoph Hellwig
  2023-01-12 10:31   ` David Howells
@ 2023-01-12 21:56   ` Al Viro
  2023-01-13  5:33     ` Christoph Hellwig
  2 siblings, 1 reply; 34+ messages in thread
From: Al Viro @ 2023-01-12 21:56 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On Wed, Jan 11, 2023 at 02:27:58PM +0000, David Howells wrote:
> If a kiocb or iomap_iter is available, then use the IOCB_WRITE flag or the
> IOMAP_WRITE flag to determine whether we're writing rather than the
> iterator direction flag.
> 
> This allows all but three of the users of iov_iter_rw() to be got rid of: a
> consistency check and a warning statement in cifs and one user in the block
> layer that has neither available.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> Link: https://lore.kernel.org/r/167305163159.1521586.9460968250704377087.stgit@warthog.procyon.org.uk/ # v4

Incidentally, I'd suggest iocb_is_write(iocb) - just look at the amount of
places where you end up with clumsy parentheses...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 4/9] iov_iter: Add a function to extract a page list from an iterator
  2023-01-12 21:15   ` Al Viro
  2023-01-12 21:36     ` Al Viro
@ 2023-01-13  5:26     ` Christoph Hellwig
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2023-01-13  5:26 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Christoph Hellwig, John Hubbard, Matthew Wilcox,
	linux-fsdevel, linux-mm, Christoph Hellwig, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-block, linux-kernel

On Thu, Jan 12, 2023 at 09:15:50PM +0000, Al Viro wrote:
> This cleanup_mode thing is wrong.  It's literally a trivial
> function of ->user_backed and ->data_source - we don't
> even need to look at the ->type.
> 
> Separate it into an inline helper and be done with that;
> don't carry it all over the place.
> 
> It's really "if not user-backed => 0, otherwise it's FOLL_PIN or FOLL_GET,
> depending upon the direction".

That would defintively clean up the bio code as well..

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
  2023-01-12 17:37       ` Al Viro
  2023-01-12 21:49         ` Bart Van Assche
@ 2023-01-13  5:30         ` Christoph Hellwig
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2023-01-13  5:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, David Howells, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Linus Torvalds

On Thu, Jan 12, 2023 at 05:37:26PM +0000, Al Viro wrote:
> I have no problem with getting rid of iov_iter_rw(), but I would really like to
> keep ->data_source.  If nothing else, any place getting direction wrong is
> a trouble waiting to happen - something that is currently dealing only with
> iovec and bvec might be given e.g. a pipe.

But the calling code knows the direction, in fact it is generally
encoded in the actual operation we do on the iov_iter.  The only
exception is iov_iter_get_pages and friends.  So I'd much rather pass
make the lass operation that does not explicitly encode a direction
explicit rather than carrying this duplicate information.

The direction of the iov_iter has been a major source of confusing,
and getting rid of it removes that confusion.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
  2023-01-12 21:49         ` Bart Van Assche
@ 2023-01-13  5:32           ` Christoph Hellwig
  2023-01-14  1:33             ` Martin K. Petersen
  2023-01-14  1:34           ` Martin K. Petersen
  1 sibling, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2023-01-13  5:32 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Al Viro, Christoph Hellwig, David Howells, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel, Martin K. Petersen,
	Douglas Gilbert

On Thu, Jan 12, 2023 at 01:49:14PM -0800, Bart Van Assche wrote:
> I'm not sure that we still need the double copy in the sg driver. It seems
> obscure to me that there is user space software that relies on finding
> "0xec" in bytes not originating from a SCSI device. Additionally, SCSI
> drivers that do not support residuals should be something from the past.
> 
> Others may be better qualified to comment on this topic.

Yeah.  And that weird (ab)use of blk_rq_map_user_iov into preallocated
pages in sg has been a constant source of pain.  I'd be very happy to
make it match the generic SG_IO implementation and either do
get_user_pages or the normal bounce buffering implemented in the
common code.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
  2023-01-12 21:56   ` Al Viro
@ 2023-01-13  5:33     ` Christoph Hellwig
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2023-01-13  5:33 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

On Thu, Jan 12, 2023 at 09:56:25PM +0000, Al Viro wrote:
> Incidentally, I'd suggest iocb_is_write(iocb) - just look at the amount of
> places where you end up with clumsy parentheses...

Agreed, a little helper won't hurt.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
  2023-01-13  5:32           ` Christoph Hellwig
@ 2023-01-14  1:33             ` Martin K. Petersen
  0 siblings, 0 replies; 34+ messages in thread
From: Martin K. Petersen @ 2023-01-14  1:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Al Viro, David Howells, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel, Martin K. Petersen,
	Douglas Gilbert


Christoph,

> Yeah.  And that weird (ab)use of blk_rq_map_user_iov into preallocated
> pages in sg has been a constant source of pain.  I'd be very happy to
> make it match the generic SG_IO implementation and either do
> get_user_pages or the normal bounce buffering implemented in the
> common code.

Yes, it would be nice to get that cleaned up.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
  2023-01-12 21:49         ` Bart Van Assche
  2023-01-13  5:32           ` Christoph Hellwig
@ 2023-01-14  1:34           ` Martin K. Petersen
  2023-01-14  1:58             ` Al Viro
  1 sibling, 1 reply; 34+ messages in thread
From: Martin K. Petersen @ 2023-01-14  1:34 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Al Viro, Christoph Hellwig, David Howells, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel, Martin K. Petersen,
	Douglas Gilbert


Bart,

> I'm not sure that we still need the double copy in the sg driver. It
> seems obscure to me that there is user space software that relies on
> finding "0xec" in bytes not originating from a SCSI
> device. Additionally, SCSI drivers that do not support residuals
> should be something from the past.

Yeah. I'm not aware of anything that relies on this still. But obviously
Doug has more experience in the app dependency department.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction
  2023-01-14  1:34           ` Martin K. Petersen
@ 2023-01-14  1:58             ` Al Viro
  0 siblings, 0 replies; 34+ messages in thread
From: Al Viro @ 2023-01-14  1:58 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Bart Van Assche, Christoph Hellwig, David Howells,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Douglas Gilbert

On Fri, Jan 13, 2023 at 08:34:50PM -0500, Martin K. Petersen wrote:
> 
> Bart,
> 
> > I'm not sure that we still need the double copy in the sg driver. It
> > seems obscure to me that there is user space software that relies on
> > finding "0xec" in bytes not originating from a SCSI
> > device. Additionally, SCSI drivers that do not support residuals
> > should be something from the past.
> 
> Yeah. I'm not aware of anything that relies on this still. But obviously
> Doug has more experience in the app dependency department.

Are we guaranteed to know the accurate amount of data that got transferred
for all surviving drivers?  If we do, we can do accurate copy-out and all
apps will keep seeing what they currently do.  If we don't, the best we
can do is replacing copy-in + IO + copy-out with memset + IO + copy-out.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-01-14  1:59 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-11 14:27 [PATCH v5 0/9] iov_iter: Add extraction helpers David Howells
2023-01-11 14:27 ` [PATCH v5 1/9] iov_iter: Change the direction macros into an enum David Howells
2023-01-11 14:27 ` [PATCH v5 2/9] iov_iter: Use the direction in the iterator functions David Howells
2023-01-11 14:27 ` [PATCH v5 3/9] iov_iter: Use IOCB/IOMAP_WRITE if available rather than iterator direction David Howells
2023-01-12  7:54   ` Christoph Hellwig
2023-01-12 10:31   ` David Howells
2023-01-12 14:08     ` Christoph Hellwig
2023-01-12 17:37       ` Al Viro
2023-01-12 21:49         ` Bart Van Assche
2023-01-13  5:32           ` Christoph Hellwig
2023-01-14  1:33             ` Martin K. Petersen
2023-01-14  1:34           ` Martin K. Petersen
2023-01-14  1:58             ` Al Viro
2023-01-13  5:30         ` Christoph Hellwig
2023-01-12 21:56   ` Al Viro
2023-01-13  5:33     ` Christoph Hellwig
2023-01-11 14:28 ` [PATCH v5 4/9] iov_iter: Add a function to extract a page list from an iterator David Howells
2023-01-12  7:55   ` Christoph Hellwig
2023-01-12 21:15   ` Al Viro
2023-01-12 21:36     ` Al Viro
2023-01-13  5:26     ` Christoph Hellwig
2023-01-11 14:28 ` [PATCH v5 5/9] netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator David Howells
2023-01-11 14:28 ` [PATCH v5 6/9] netfs: Add a function to extract an iterator into a scatterlist David Howells
2023-01-11 14:28 ` [PATCH v5 7/9] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
2023-01-12  7:32   ` Christoph Hellwig
2023-01-11 14:28 ` [PATCH v5 8/9] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
2023-01-12  7:44   ` Christoph Hellwig
2023-01-12 10:28   ` David Howells
2023-01-12 14:09     ` Christoph Hellwig
2023-01-12 14:17       ` Christoph Hellwig
2023-01-12 14:58       ` David Howells
2023-01-12 15:02         ` Christoph Hellwig
2023-01-11 14:28 ` [PATCH v5 9/9] bio: Fix bio_flagged() so that it can be combined David Howells
2023-01-12  7:33   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.