All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][CFT][PATCHSET] iov_iter stuff
@ 2022-06-22  4:10 Al Viro
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                   ` (2 more replies)
  0 siblings, 3 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

	There's a bunch of pending iov_iter-related work; most of that had
been posted, but only one part got anything resembling a review.  Currently
it seems to be working, but it obviously needs review and testing.

	It's split into several subseries; the entire series can be observed
as v5.19-rc2..#work.iov_iter_get_pages.  Description follows; individual
patches will be posted^Wmailbombed in followups.

	This stuff is not in -next yet; I'd like to put it there, so if you
see any problems - please yell.

	One thing not currently in there, but to be added very soon is
iov_iter_find_pages{,_alloc}() - analogue of iov_iter_get_pages(), except that
it only grabs page references for userland-backed flavours.  The callers,
of course, are responsible for keeping the underlying object(s) alive for as
long as they are using the results.  Quite a few of iov_iter_get_pages()
callers would be fine with that.  Moreover, unlike iov_iter_get_pages() this
could be allowed for ITER_KVEC, potentially eliminating several places where
we special-case the treatment of ITER_KVEC.

	Another pending thing is integration with cifs and ceph series (dhowells
and jlayton resp.) and probably io_uring as well.

----------------------------------------------------------------------------

	Part 1, #work.9p: [rc1-based]

1/44: 9p: handling Rerror without copy_from_iter_full()
	Self-contained fix, should be easy to backport.  What happens
there is that arrival of Rerror in response to zerocopy read or readdir
ends up with error string in the place where the actual data would've gone
in case of success.  It needs to be extracted, and copy_from_iter_full()
is only for data-source iterators, not for e.g. ITER_PIPE.  And ITER_PIPE
can be used with those...

----------------------------------------------------------------------------

	Part 2, #work.iov_iter: [rc1-based]

Dealing with the overhead in new_sync_read()/new_sync_write(), mostly.
Several things there - one is that calculation of iocb flags can be
made cheaper, another is that single-segment iovec is sufficiently
common to be worth turning into a new iov_iter flavour (ITER_UBUF).
With all that, the total size of iov_iter.c goes down, mostly due to
removal of magic in iovec copy_page_to_iter()/copy_page_from_iter().
Generic variant works for those nowadays...

This had been posted two weeks ago, got a reasonable amount of comments.

2/44: No need of likely/unlikely on calls of check_copy_size()
	not just in uio.h; the thing is inlined and it has unlikely on
all paths leading to return false

3/44:  teach iomap_dio_rw() to suppress dsync
	new flag for iomap_dio_rw(), telling it to suppress generic_write_sync()

4/44: btrfs: use IOMAP_DIO_NOSYNC
	use the above instead of currently used kludges.

5/44: struct file: use anonymous union member for rcuhead and llist
	"f_u" might have been an amusing name, but... we expect anon unions to
work.

6/44: iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
	makes iocb_flags() much cheaper, and it's easier to keep track of
the places where it can change.

7/44: keep iocb_flags() result cached in struct file
	that, along with the previous commit, reduces the overhead of
new_sync_{read,write}().  struct file doesn't grow - we can keep that
thing in the same anon union where rcuhead and llist live; that field
gets used only before ->f_count reaches zero while the other two are
used only after ->f_count has reached zero.

8/44: copy_page_{to,from}_iter(): switch iovec variants to generic
	kmap_local_page() allows that.  And it kills quite a bit of
code.

9/44: new iov_iter flavour - ITER_UBUF
	iovec analogue, with single segment.  That case is fairly common and it
can be handled with less overhead than full-blown iovec.

10/44: switch new_sync_{read,write}() to ITER_UBUF
	... and this is why it is so common.  Further reduction of overhead
for new_sync_{read,write}().

11/44: iov_iter_bvec_advance(): don't bother with bvec_iter
	AFAICS, variant similar to what we do for iovec/kvec generates better
code.  Needs profiling, obviously.

----------------------------------------------------------------------------

	Part 3, #fixes [-rc2-based]

12/44: fix short copy handling in copy_mc_pipe_to_iter()
	Minimal version of fix; it's replaced with prettier one in the next
series, but replacement is not a backport fodder.

----------------------------------------------------------------------------

	Part 4, #work.ITER_PIPE [on top of merge of previous branches]

ITER_PIPE handling had never been pretty, but by now it has become
really obfuscated and hard to read.  Untangle it a bit.  Posted last
weekend, some brainos fixed since then.

13/44: splice: stop abusing iov_iter_advance() to flush a pipe
	A really odd (ab)use of iov_iter_advance() - in case of error
generic_file_splice_read() wants to free all pipe buffers ->read_iter()
has produced.  Yes, forcibly resetting ->head and ->iov_offset to
original values and calling iov_iter_advance(i, 0) will trigger
pipe_advance(), which will trigger pipe_truncate(), which will free
buffers.  Or we could just go ahead and free the same buffers;
pipe_discard_from() does exactly that, no iov_iter stuff needs to
be involved.

14/44: ITER_PIPE: helper for getting pipe buffer by index
	In a lot of places we want to find pipe_buffer by index;
expression is convoluted and hard to read.  Provide an inline helper
for that, convert trivial open-coded cases.  Eventually *all*
open-coded instances in iov_iter.c will be gone.

15/44: ITER_PIPE: helpers for adding pipe buffers
        There are only two kinds of pipe_buffer in the area used by ITER_PIPE.
* anonymous - copy_to_iter() et.al. end up creating those and copying data
  there.  They have zero ->offset, and their ->ops points to
  default_pipe_page_ops.
* zero-copy ones - those come from copy_page_to_iter(), and page comes from
  caller.  ->offset is also caller-supplied - it might be non-zero.
  ->ops points to page_cache_pipe_buf_ops.
        Move creation and insertion of those into helpers -
push_anon(pipe, size) and push_page(pipe, page, offset, size) resp., separating
them from the "could we avoid creating a new buffer by merging with the current
head?" logics.

16/44: ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives
        New helper: append_pipe().  Extends the last buffer if possible,
allocates a new one otherwise.  Returns page and offset in it on success,
NULL on failure.  iov_iter is advanced past the data we've got.
        Use that instead of push_pipe() in copy-to-pipe primitives;
they get simpler that way.  Handling of short copy (in "mc" one)
is done simply by iov_iter_revert() - iov_iter is in consistent
state after that one, so we can use that.

17/44: ITER_PIPE: fold push_pipe() into __pipe_get_pages()
        Expand the only remaining call of push_pipe() (in
__pipe_get_pages()), combine it with the page-collecting loop there.
We don't need to bother with i->count checks or calculation of offset
in the first page - the caller already has done that.
        Note that the only reason it's not a loop doing append_pipe()
is that append_pipe() is advancing, while iov_iter_get_pages() is not.
As soon as it switches to saner semantics, this thing will switch
to using append_pipe().

18/44: ITER_PIPE: lose iter_head argument of __pipe_get_pages()
	Redundant.

19/44: ITER_PIPE: clean pipe_advance() up
        Don't bother with pipe_truncate(); adjust the buffer
length just as we decide it'll be the last one, then use
pipe_discard_from() to release buffers past that one.

20/44: ITER_PIPE: clean iov_iter_revert()
        Fold pipe_truncate() in there, clean the things up.

21/44: ITER_PIPE: cache the type of last buffer
        We often need to find whether the last buffer is anon or not, and
currently it's rather clumsy:
        check if ->iov_offset is non-zero (i.e. that pipe is not empty)
        if so, get the corresponding pipe_buffer and check its ->ops
        if it's &default_pipe_buf_ops, we have an anon buffer.
Let's replace the use of ->iov_offset (which is nowhere near similar to
its role for other flavours) with signed field (->last_offset), with
the following rules:
        empty, no buffers occupied:             0
        anon, with bytes up to N-1 filled:      N
        zero-copy, with bytes up to N-1 filled: -N
That way abs(i->last_offset) is equal to what used to be in i->iov_offset
and empty vs. anon vs. zero-copy can be distinguished by the sign of
i->last_offset.
        Checks for "should we extend the last buffer or should we start
a new one?" become easier to follow that way.
        Note that most of the operations can only be done in a sane
state - i.e. when the pipe has nothing past the current position of
iterator.  About the only thing that could be done outside of that
state is iov_iter_advance(), which transitions to the sane state by
truncating the pipe.  There are only two cases where we leave the
sane state:
        1) iov_iter_get_pages()/iov_iter_get_pages_alloc().  Will be
dealt with later, when we make get_pages advancing - the callers are
actually happier that way.
        2) iov_iter copied, then something is put into the copy.  Since
they share the underlying pipe, the original gets behind.  When we
decide that we are done with the copy (original is not usable until then)
we advance the original.  direct_io used to be done that way; nowadays
it operates on the original and we do iov_iter_revert() to discard
the excessive data.  At the moment there's nothing in the kernel that
could do that to ITER_PIPE iterators, so this reason for insane state
is theoretical right now.

22/44: ITER_PIPE: fold data_start() and pipe_space_for_user() together
        All their callers are next to each other; all of them want
the total amount of pages and, possibly, the offset in the partial
final buffer.
        Combine into a new helper (pipe_npages()), fix the
bogosity in pipe_space_for_user(), while we are at it.

----------------------------------------------------------------------------

	Part 5, #work.unify_iov_iter_get_pages [on top of previous]

iov_iter_get_pages() and iov_iter_get_pages_alloc() have a lot of code
duplication and are bloody hard to read.  With some massage duplication
can be eliminated, along with some of the cruft accumulated there.

	Flavour-independent arguments validation and, for ..._alloc(),
cleanup handling on failure:
23/44: iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
24/44: iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
25/44: iov_iter_get_pages(): sanity-check arguments

	Mechanically merge parallel ..._get_pages() and ..._get_pages_alloc().
26/44: unify pipe_get_pages() and pipe_get_pages_alloc()
27/44: unify xarray_get_pages() and xarray_get_pages_alloc()
28/44: unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts

	Decrufting for XARRAY:
29/44: ITER_XARRAY: don't open-code DIV_ROUND_UP()

	Decrufting for UBUF/IOVEC/BVEC: that bunch suffers from really convoluted
helpers; untangling those takes a bit of care, so I'd carved that up into fairly
small chunks.  Could be collapsed together, but...
30/44: iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
31/44: iov_iter: first_{iovec,bvec}_segment() - simplify a bit
32/44: iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
33/44: found_iovec_segment(): just return address

	Decrufting for PIPE:
34/44: fold __pipe_get_pages() into pipe_get_pages()

	Now we can finally get a helper encapsulating the array allocations
right way:
35/44: iov_iter: saner helper for page array allocation

----------------------------------------------------------------------------

	Part 6, #work.iov_iter_get_pages-advance [on top of previous]
Convert iov_iter_get_pages{,_alloc}() to iterator-advancing semantics.  

	Most of the callers follow successful ...get_pages... with advance
by the amount it had reported.  For some it's unconditional, for some it
might end up being less in some cases.  All of them would be fine with
advancing variants of those primitives - those that might want to advance
by less than reported could easily use revert by the difference of those
amounts.
	Rather than doing a flagday change (they are exported and signatures
remain unchanged), replacement variants are added (iov_iter_get_pages2()
and iov_iter_get_pages_alloc2(), initially as wrappers).  By the end of
the series everything is converted to those and the old ones are removed.

	Makes for simpler rules for ITER_PIPE, among other things, and
advancing semantics is consistent with all data-copying primitives.
Series is pretty obvious - introduce variants with new semantics, switch
users one by one, fold the old variants into new ones.

36/44: iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
37/44: block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
38/44: iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
39/44: af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
40/44: 9p: convert to advancing variant of iov_iter_get_pages_alloc()
41/44: ceph: switch the last caller of iov_iter_get_pages_alloc()
42/44: get rid of non-advancing variants

----------------------------------------------------------------------------

	Part 7, #wort.iov_iter_get_pages [on top of previous]
Trivial followups, with more to be added here...

43/44: pipe_get_pages(): switch to append_pipe()
44/44: expand those iov_iter_advance()...

Overall diffstat:

 arch/powerpc/include/asm/uaccess.h |   2 +-
 arch/s390/include/asm/uaccess.h    |   4 +-
 block/bio.c                        |  15 +-
 block/blk-map.c                    |   7 +-
 block/fops.c                       |   8 +-
 crypto/af_alg.c                    |   3 +-
 crypto/algif_hash.c                |   5 +-
 drivers/nvme/target/io-cmd-file.c  |   2 +-
 drivers/vhost/scsi.c               |   4 +-
 fs/aio.c                           |   2 +-
 fs/btrfs/file.c                    |  19 +-
 fs/btrfs/inode.c                   |   3 +-
 fs/ceph/addr.c                     |   2 +-
 fs/ceph/file.c                     |   5 +-
 fs/cifs/file.c                     |   8 +-
 fs/cifs/misc.c                     |   3 +-
 fs/direct-io.c                     |   7 +-
 fs/fcntl.c                         |   1 +
 fs/file_table.c                    |  17 +-
 fs/fuse/dev.c                      |   7 +-
 fs/fuse/file.c                     |   7 +-
 fs/gfs2/file.c                     |   2 +-
 fs/io_uring.c                      |   2 +-
 fs/iomap/direct-io.c               |  21 +-
 fs/nfs/direct.c                    |   8 +-
 fs/open.c                          |   1 +
 fs/read_write.c                    |   6 +-
 fs/splice.c                        |  54 +-
 fs/zonefs/super.c                  |   2 +-
 include/linux/fs.h                 |  21 +-
 include/linux/iomap.h              |   6 +
 include/linux/pipe_fs_i.h          |  29 +-
 include/linux/uaccess.h            |   4 +-
 include/linux/uio.h                |  50 +-
 lib/iov_iter.c                     | 993 ++++++++++++++-----------------------
 mm/shmem.c                         |   2 +-
 net/9p/client.c                    | 125 +----
 net/9p/protocol.c                  |   3 +-
 net/9p/trans_virtio.c              |  37 +-
 net/core/datagram.c                |   3 +-
 net/core/skmsg.c                   |   3 +-
 net/rds/message.c                  |   3 +-
 net/tls/tls_sw.c                   |   4 +-
 43 files changed, 599 insertions(+), 911 deletions(-)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full()
  2022-06-22  4:10 [RFC][CFT][PATCHSET] iov_iter stuff Al Viro
@ 2022-06-22  4:15 ` Al Viro
  2022-06-22  4:15   ` [PATCH 02/44] No need of likely/unlikely on calls of check_copy_size() Al Viro
                     ` (45 more replies)
  2022-06-23 15:21 ` [RFC][CFT][PATCHSET] iov_iter stuff David Howells
  2022-06-28 12:25 ` Jeff Layton
  2 siblings, 46 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

        p9_client_zc_rpc()/p9_check_zc_errors() are playing fast
and loose with copy_from_iter_full().

	Reading from file is done by sending Tread request.  Response
consists of fixed-sized header (including the amount of data actually
read) followed by the data itself.

	For zero-copy case we arrange the things so that the first
11 bytes of reply go into the fixed-sized buffer, with the rest going
straight into the pages we want to read into.

	What makes the things inconvenient is that sglist describing
what should go where has to be set *before* the reply arrives.  As
the result, if reply is an error, the things get interesting.  On success
we get
	size[4] Rread tag[2] count[4] data[count]
For error layout varies depending upon the protocol variant -
in original 9P and 9P2000 it's
	size[4] Rerror tag[2] len[2] error[len]
in 9P2000.U
	size[4] Rerror tag[2] len[2] error[len] errno[4]
in 9P2000.L
	size[4] Rlerror tag[2] errno[4]

	The last case is nice and simple - we have an 11-byte response
that fits into the fixed-sized buffer we hoped to get an Rread into.
In other two, though, we get a variable-length string spill into the
pages we'd prepared for the data to be read.

	Had that been in fixed-sized buffer (which is actually 4K),
we would've dealt with that the same way we handle non-zerocopy case.
However, for zerocopy it doesn't end up there, so we need to copy it
from those pages.

	The trouble is, by the time we get around to that, the
references to pages in question are already dropped.  As the result,
p9_zc_check_errors() tries to get the data using copy_from_iter_full().
Unfortunately, the iov_iter it's trying to read from might *NOT* be
capable of that.  It is, after all, a data destination, not data source.
In particular, if it's an ITER_PIPE one, copy_from_iter_full() will
simply fail.

	In ->zc_request() itself we do have those pages and dealing with
the problem in there would be a simple matter of memcpy_from_page()
into the fixed-sized buffer.  Moreover, it isn't hard to recognize
the (rare) case when such copying is needed.  That way we get rid of
p9_zc_check_errors() entirely - p9_check_errors() can be used instead
both for zero-copy and non-zero-copy cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 net/9p/client.c       | 86 +------------------------------------------
 net/9p/trans_virtio.c | 34 +++++++++++++++++
 2 files changed, 35 insertions(+), 85 deletions(-)

diff --git a/net/9p/client.c b/net/9p/client.c
index 8bba0d9cf975..d403085b9ef5 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -550,90 +550,6 @@ static int p9_check_errors(struct p9_client *c, struct p9_req_t *req)
 	return err;
 }
 
-/**
- * p9_check_zc_errors - check 9p packet for error return and process it
- * @c: current client instance
- * @req: request to parse and check for error conditions
- * @uidata: external buffer containing error
- * @in_hdrlen: Size of response protocol buffer.
- *
- * returns error code if one is discovered, otherwise returns 0
- *
- * this will have to be more complicated if we have multiple
- * error packet types
- */
-
-static int p9_check_zc_errors(struct p9_client *c, struct p9_req_t *req,
-			      struct iov_iter *uidata, int in_hdrlen)
-{
-	int err;
-	int ecode;
-	s8 type;
-	char *ename = NULL;
-
-	err = p9_parse_header(&req->rc, NULL, &type, NULL, 0);
-	/* dump the response from server
-	 * This should be after parse_header which poplulate pdu_fcall.
-	 */
-	trace_9p_protocol_dump(c, &req->rc);
-	if (err) {
-		p9_debug(P9_DEBUG_ERROR, "couldn't parse header %d\n", err);
-		return err;
-	}
-
-	if (type != P9_RERROR && type != P9_RLERROR)
-		return 0;
-
-	if (!p9_is_proto_dotl(c)) {
-		/* Error is reported in string format */
-		int len;
-		/* 7 = header size for RERROR; */
-		int inline_len = in_hdrlen - 7;
-
-		len = req->rc.size - req->rc.offset;
-		if (len > (P9_ZC_HDR_SZ - 7)) {
-			err = -EFAULT;
-			goto out_err;
-		}
-
-		ename = &req->rc.sdata[req->rc.offset];
-		if (len > inline_len) {
-			/* We have error in external buffer */
-			if (!copy_from_iter_full(ename + inline_len,
-						 len - inline_len, uidata)) {
-				err = -EFAULT;
-				goto out_err;
-			}
-		}
-		ename = NULL;
-		err = p9pdu_readf(&req->rc, c->proto_version, "s?d",
-				  &ename, &ecode);
-		if (err)
-			goto out_err;
-
-		if (p9_is_proto_dotu(c) && ecode < 512)
-			err = -ecode;
-
-		if (!err) {
-			err = p9_errstr2errno(ename, strlen(ename));
-
-			p9_debug(P9_DEBUG_9P, "<<< RERROR (%d) %s\n",
-				 -ecode, ename);
-		}
-		kfree(ename);
-	} else {
-		err = p9pdu_readf(&req->rc, c->proto_version, "d", &ecode);
-		err = -ecode;
-
-		p9_debug(P9_DEBUG_9P, "<<< RLERROR (%d)\n", -ecode);
-	}
-	return err;
-
-out_err:
-	p9_debug(P9_DEBUG_ERROR, "couldn't parse error%d\n", err);
-	return err;
-}
-
 static struct p9_req_t *
 p9_client_rpc(struct p9_client *c, int8_t type, const char *fmt, ...);
 
@@ -874,7 +790,7 @@ static struct p9_req_t *p9_client_zc_rpc(struct p9_client *c, int8_t type,
 	if (err < 0)
 		goto reterr;
 
-	err = p9_check_zc_errors(c, req, uidata, in_hdrlen);
+	err = p9_check_errors(c, req);
 	trace_9p_client_res(c, type, req->rc.tag, err);
 	if (!err)
 		return req;
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index b24a4fb0f0a2..2a210c2f8e40 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -377,6 +377,35 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
 	}
 }
 
+static void handle_rerror(struct p9_req_t *req, int in_hdr_len,
+			  size_t offs, struct page **pages)
+{
+	unsigned size, n;
+	void *to = req->rc.sdata + in_hdr_len;
+
+	// Fits entirely into the static data?  Nothing to do.
+	if (req->rc.size < in_hdr_len)
+		return;
+
+	// Really long error message?  Tough, truncate the reply.  Might get
+	// rejected (we can't be arsed to adjust the size encoded in header,
+	// or string size for that matter), but it wouldn't be anything valid
+	// anyway.
+	if (unlikely(req->rc.size > P9_ZC_HDR_SZ))
+		req->rc.size = P9_ZC_HDR_SZ;
+
+	// data won't span more than two pages
+	size = req->rc.size - in_hdr_len;
+	n = PAGE_SIZE - offs;
+	if (size > n) {
+		memcpy_from_page(to, *pages++, offs, n);
+		offs = 0;
+		to += n;
+		size -= n;
+	}
+	memcpy_from_page(to, *pages, offs, size);
+}
+
 /**
  * p9_virtio_zc_request - issue a zero copy request
  * @client: client instance issuing the request
@@ -503,6 +532,11 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	kicked = 1;
 	p9_debug(P9_DEBUG_TRANS, "virtio request kicked\n");
 	err = wait_event_killable(req->wq, req->status >= REQ_STATUS_RCVD);
+	// RERROR needs reply (== error string) in static data
+	if (req->status == REQ_STATUS_RCVD &&
+	    unlikely(req->rc.sdata[4] == P9_RERROR))
+		handle_rerror(req, in_hdr_len, offs, in_pages);
+
 	/*
 	 * Non kernel buffers are pinned, unpin them
 	 */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 02/44] No need of likely/unlikely on calls of check_copy_size()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-22  4:15   ` [PATCH 03/44] teach iomap_dio_rw() to suppress dsync Al Viro
                     ` (44 subsequent siblings)
  45 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

it's inline and unlikely() inside of it (including the implicit one
in WARN_ON_ONCE()) suffice to convince the compiler that getting
false from check_copy_size() is unlikely.

Spotted-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 arch/powerpc/include/asm/uaccess.h |  2 +-
 arch/s390/include/asm/uaccess.h    |  4 ++--
 include/linux/uaccess.h            |  4 ++--
 include/linux/uio.h                | 15 ++++++---------
 4 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index 9b82b38ff867..105f200b1e31 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -348,7 +348,7 @@ copy_mc_to_kernel(void *to, const void *from, unsigned long size)
 static inline unsigned long __must_check
 copy_mc_to_user(void __user *to, const void *from, unsigned long n)
 {
-	if (likely(check_copy_size(from, n, true))) {
+	if (check_copy_size(from, n, true)) {
 		if (access_ok(to, n)) {
 			allow_write_to_user(to, n);
 			n = copy_mc_generic((void *)to, from, n);
diff --git a/arch/s390/include/asm/uaccess.h b/arch/s390/include/asm/uaccess.h
index f4511e21d646..c2c9995466e0 100644
--- a/arch/s390/include/asm/uaccess.h
+++ b/arch/s390/include/asm/uaccess.h
@@ -39,7 +39,7 @@ _copy_from_user_key(void *to, const void __user *from, unsigned long n, unsigned
 static __always_inline unsigned long __must_check
 copy_from_user_key(void *to, const void __user *from, unsigned long n, unsigned long key)
 {
-	if (likely(check_copy_size(to, n, false)))
+	if (check_copy_size(to, n, false))
 		n = _copy_from_user_key(to, from, n, key);
 	return n;
 }
@@ -50,7 +50,7 @@ _copy_to_user_key(void __user *to, const void *from, unsigned long n, unsigned l
 static __always_inline unsigned long __must_check
 copy_to_user_key(void __user *to, const void *from, unsigned long n, unsigned long key)
 {
-	if (likely(check_copy_size(from, n, true)))
+	if (check_copy_size(from, n, true))
 		n = _copy_to_user_key(to, from, n, key);
 	return n;
 }
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 5a328cf02b75..47e5d374c7eb 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -148,7 +148,7 @@ _copy_to_user(void __user *, const void *, unsigned long);
 static __always_inline unsigned long __must_check
 copy_from_user(void *to, const void __user *from, unsigned long n)
 {
-	if (likely(check_copy_size(to, n, false)))
+	if (check_copy_size(to, n, false))
 		n = _copy_from_user(to, from, n);
 	return n;
 }
@@ -156,7 +156,7 @@ copy_from_user(void *to, const void __user *from, unsigned long n)
 static __always_inline unsigned long __must_check
 copy_to_user(void __user *to, const void *from, unsigned long n)
 {
-	if (likely(check_copy_size(from, n, true)))
+	if (check_copy_size(from, n, true))
 		n = _copy_to_user(to, from, n);
 	return n;
 }
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 739285fe5a2f..76d305f3d4c2 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -156,19 +156,17 @@ static inline size_t copy_folio_to_iter(struct folio *folio, size_t offset,
 static __always_inline __must_check
 size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
-	if (unlikely(!check_copy_size(addr, bytes, true)))
-		return 0;
-	else
+	if (check_copy_size(addr, bytes, true))
 		return _copy_to_iter(addr, bytes, i);
+	return 0;
 }
 
 static __always_inline __must_check
 size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 {
-	if (unlikely(!check_copy_size(addr, bytes, false)))
-		return 0;
-	else
+	if (check_copy_size(addr, bytes, false))
 		return _copy_from_iter(addr, bytes, i);
+	return 0;
 }
 
 static __always_inline __must_check
@@ -184,10 +182,9 @@ bool copy_from_iter_full(void *addr, size_t bytes, struct iov_iter *i)
 static __always_inline __must_check
 size_t copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 {
-	if (unlikely(!check_copy_size(addr, bytes, false)))
-		return 0;
-	else
+	if (check_copy_size(addr, bytes, false))
 		return _copy_from_iter_nocache(addr, bytes, i);
+	return 0;
 }
 
 static __always_inline __must_check
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 03/44] teach iomap_dio_rw() to suppress dsync
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
  2022-06-22  4:15   ` [PATCH 02/44] No need of likely/unlikely on calls of check_copy_size() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-22  4:15   ` [PATCH 04/44] btrfs: use IOMAP_DIO_NOSYNC Al Viro
                     ` (43 subsequent siblings)
  45 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

New flag, equivalent to removal of IOCB_DSYNC from iocb flags.
This mimics what btrfs is doing (and that's what btrfs will
switch to).  However, I'm not at all sure that we want to
suppress REQ_FUA for those - all btrfs hack really cares about
is suppression of generic_write_sync().  For now let's keep
the existing behaviour, but I really want to hear more detailed
arguments pro or contra.

[folded brain fix from willy]

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/iomap/direct-io.c  | 20 +++++++++++---------
 include/linux/iomap.h |  6 ++++++
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 370c3241618a..c10c69e2de24 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -548,17 +548,19 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		}
 
 		/* for data sync or sync, we need sync completion processing */
-		if (iocb->ki_flags & IOCB_DSYNC)
+		if (iocb->ki_flags & IOCB_DSYNC &&
+		    !(dio_flags & IOMAP_DIO_NOSYNC)) {
 			dio->flags |= IOMAP_DIO_NEED_SYNC;
 
-		/*
-		 * For datasync only writes, we optimistically try using FUA for
-		 * this IO.  Any non-FUA write that occurs will clear this flag,
-		 * hence we know before completion whether a cache flush is
-		 * necessary.
-		 */
-		if ((iocb->ki_flags & (IOCB_DSYNC | IOCB_SYNC)) == IOCB_DSYNC)
-			dio->flags |= IOMAP_DIO_WRITE_FUA;
+		       /*
+			* For datasync only writes, we optimistically try
+			* using FUA for this IO.  Any non-FUA write that
+			* occurs will clear this flag, hence we know before
+			* completion whether a cache flush is necessary.
+			*/
+			if (!(iocb->ki_flags & IOCB_SYNC))
+				dio->flags |= IOMAP_DIO_WRITE_FUA;
+		}
 	}
 
 	if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e552097c67e0..c8622d8f064e 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -353,6 +353,12 @@ struct iomap_dio_ops {
  */
 #define IOMAP_DIO_PARTIAL		(1 << 2)
 
+/*
+ * The caller will sync the write if needed; do not sync it within
+ * iomap_dio_rw.  Overrides IOMAP_DIO_FORCE_WAIT.
+ */
+#define IOMAP_DIO_NOSYNC		(1 << 3)
+
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
 		unsigned int dio_flags, void *private, size_t done_before);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 04/44] btrfs: use IOMAP_DIO_NOSYNC
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
  2022-06-22  4:15   ` [PATCH 02/44] No need of likely/unlikely on calls of check_copy_size() Al Viro
  2022-06-22  4:15   ` [PATCH 03/44] teach iomap_dio_rw() to suppress dsync Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-22  4:15   ` [PATCH 05/44] struct file: use anonymous union member for rcuhead and llist Al Viro
                     ` (42 subsequent siblings)
  45 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

... instead of messing with iocb flags

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/btrfs/file.c  | 17 -----------------
 fs/btrfs/inode.c |  3 ++-
 2 files changed, 2 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 1fd827b99c1b..98f81e304eb1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1848,7 +1848,6 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
 
 static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 {
-	const bool is_sync_write = (iocb->ki_flags & IOCB_DSYNC);
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -1901,15 +1900,6 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 		goto buffered;
 	}
 
-	/*
-	 * We remove IOCB_DSYNC so that we don't deadlock when iomap_dio_rw()
-	 * calls generic_write_sync() (through iomap_dio_complete()), because
-	 * that results in calling fsync (btrfs_sync_file()) which will try to
-	 * lock the inode in exclusive/write mode.
-	 */
-	if (is_sync_write)
-		iocb->ki_flags &= ~IOCB_DSYNC;
-
 	/*
 	 * The iov_iter can be mapped to the same file range we are writing to.
 	 * If that's the case, then we will deadlock in the iomap code, because
@@ -1964,13 +1954,6 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 
 	btrfs_inode_unlock(inode, ilock_flags);
 
-	/*
-	 * Add back IOCB_DSYNC. Our caller, btrfs_file_write_iter(), will do
-	 * the fsync (call generic_write_sync()).
-	 */
-	if (is_sync_write)
-		iocb->ki_flags |= IOCB_DSYNC;
-
 	/* If 'err' is -ENOTBLK then it means we must fallback to buffered IO. */
 	if ((err < 0 && err != -ENOTBLK) || !iov_iter_count(from))
 		goto out;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 81737eff92f3..fbf0aee7d66a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8152,7 +8152,8 @@ ssize_t btrfs_dio_rw(struct kiocb *iocb, struct iov_iter *iter, size_t done_befo
 	struct btrfs_dio_data data;
 
 	return iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
-			    IOMAP_DIO_PARTIAL, &data, done_before);
+			    IOMAP_DIO_PARTIAL | IOMAP_DIO_NOSYNC,
+			    &data, done_before);
 }
 
 static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 05/44] struct file: use anonymous union member for rcuhead and llist
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (2 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 04/44] btrfs: use IOMAP_DIO_NOSYNC Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-22  4:15   ` [PATCH 06/44] iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC Al Viro
                     ` (41 subsequent siblings)
  45 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

Once upon a time we couldn't afford anon unions; these days minimal
gcc version had been raised enough to take care of that.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/file_table.c    | 16 ++++++++--------
 include/linux/fs.h |  6 +++---
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 5424e3a8df5f..b989e33aacda 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -45,7 +45,7 @@ static struct percpu_counter nr_files __cacheline_aligned_in_smp;
 
 static void file_free_rcu(struct rcu_head *head)
 {
-	struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
+	struct file *f = container_of(head, struct file, f_rcuhead);
 
 	put_cred(f->f_cred);
 	kmem_cache_free(filp_cachep, f);
@@ -56,7 +56,7 @@ static inline void file_free(struct file *f)
 	security_file_free(f);
 	if (!(f->f_mode & FMODE_NOACCOUNT))
 		percpu_counter_dec(&nr_files);
-	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
+	call_rcu(&f->f_rcuhead, file_free_rcu);
 }
 
 /*
@@ -142,7 +142,7 @@ static struct file *__alloc_file(int flags, const struct cred *cred)
 	f->f_cred = get_cred(cred);
 	error = security_file_alloc(f);
 	if (unlikely(error)) {
-		file_free_rcu(&f->f_u.fu_rcuhead);
+		file_free_rcu(&f->f_rcuhead);
 		return ERR_PTR(error);
 	}
 
@@ -341,13 +341,13 @@ static void delayed_fput(struct work_struct *unused)
 	struct llist_node *node = llist_del_all(&delayed_fput_list);
 	struct file *f, *t;
 
-	llist_for_each_entry_safe(f, t, node, f_u.fu_llist)
+	llist_for_each_entry_safe(f, t, node, f_llist)
 		__fput(f);
 }
 
 static void ____fput(struct callback_head *work)
 {
-	__fput(container_of(work, struct file, f_u.fu_rcuhead));
+	__fput(container_of(work, struct file, f_rcuhead));
 }
 
 /*
@@ -374,8 +374,8 @@ void fput(struct file *file)
 		struct task_struct *task = current;
 
 		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
-			init_task_work(&file->f_u.fu_rcuhead, ____fput);
-			if (!task_work_add(task, &file->f_u.fu_rcuhead, TWA_RESUME))
+			init_task_work(&file->f_rcuhead, ____fput);
+			if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))
 				return;
 			/*
 			 * After this task has run exit_task_work(),
@@ -384,7 +384,7 @@ void fput(struct file *file)
 			 */
 		}
 
-		if (llist_add(&file->f_u.fu_llist, &delayed_fput_list))
+		if (llist_add(&file->f_llist, &delayed_fput_list))
 			schedule_delayed_work(&delayed_fput_work, 1);
 	}
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ad5e3520fae..6a2a4906041f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -924,9 +924,9 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
 
 struct file {
 	union {
-		struct llist_node	fu_llist;
-		struct rcu_head 	fu_rcuhead;
-	} f_u;
+		struct llist_node	f_llist;
+		struct rcu_head 	f_rcuhead;
+	};
 	struct path		f_path;
 	struct inode		*f_inode;	/* cached value */
 	const struct file_operations	*f_op;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 06/44] iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (3 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 05/44] struct file: use anonymous union member for rcuhead and llist Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-22  4:15   ` [PATCH 07/44] keep iocb_flags() result cached in struct file Al Viro
                     ` (40 subsequent siblings)
  45 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

New helper to be used instead of direct checks for IOCB_DSYNC:
iocb_is_dsync(iocb).  Checks converted, which allows to avoid
the IS_SYNC(iocb->ki_filp->f_mapping->host) part (4 cache lines)
from iocb_flags() - it's checked in iocb_is_dsync() instead

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 block/fops.c         |  2 +-
 fs/btrfs/file.c      |  2 +-
 fs/direct-io.c       |  2 +-
 fs/fuse/file.c       |  2 +-
 fs/iomap/direct-io.c |  3 +--
 fs/zonefs/super.c    |  2 +-
 include/linux/fs.h   | 10 ++++++++--
 7 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index d6b3276a6c68..6e86931ab847 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -37,7 +37,7 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb)
 	unsigned int op = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
 
 	/* avoid the need for a I/O completion work item */
-	if (iocb->ki_flags & IOCB_DSYNC)
+	if (iocb_is_dsync(iocb))
 		op |= REQ_FUA;
 	return op;
 }
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 98f81e304eb1..54358a5c9d56 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2021,7 +2021,7 @@ ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct iov_iter *from,
 	struct file *file = iocb->ki_filp;
 	struct btrfs_inode *inode = BTRFS_I(file_inode(file));
 	ssize_t num_written, num_sync;
-	const bool sync = iocb->ki_flags & IOCB_DSYNC;
+	const bool sync = iocb_is_dsync(iocb);
 
 	/*
 	 * If the fs flips readonly due to some impossible error, although we
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 840752006f60..39647eb56904 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1210,7 +1210,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 */
 	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
 		retval = 0;
-		if (iocb->ki_flags & IOCB_DSYNC)
+		if (iocb_is_dsync(iocb))
 			retval = dio_set_defer_completion(dio);
 		else if (!dio->inode->i_sb->s_dio_done_wq) {
 			/*
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 05caa2b9272e..00fa861aeead 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1042,7 +1042,7 @@ static unsigned int fuse_write_flags(struct kiocb *iocb)
 {
 	unsigned int flags = iocb->ki_filp->f_flags;
 
-	if (iocb->ki_flags & IOCB_DSYNC)
+	if (iocb_is_dsync(iocb))
 		flags |= O_DSYNC;
 	if (iocb->ki_flags & IOCB_SYNC)
 		flags |= O_SYNC;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index c10c69e2de24..31c7f1035b20 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -548,8 +548,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		}
 
 		/* for data sync or sync, we need sync completion processing */
-		if (iocb->ki_flags & IOCB_DSYNC &&
-		    !(dio_flags & IOMAP_DIO_NOSYNC)) {
+		if (iocb_is_dsync(iocb) && !(dio_flags & IOMAP_DIO_NOSYNC)) {
 			dio->flags |= IOMAP_DIO_NEED_SYNC;
 
 		       /*
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index bcb21aea990a..04a98b4cd7ee 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -746,7 +746,7 @@ static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
 			REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE, GFP_NOFS);
 	bio->bi_iter.bi_sector = zi->i_zsector;
 	bio->bi_ioprio = iocb->ki_ioprio;
-	if (iocb->ki_flags & IOCB_DSYNC)
+	if (iocb_is_dsync(iocb))
 		bio->bi_opf |= REQ_FUA;
 
 	ret = bio_iov_iter_get_pages(bio, from);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6a2a4906041f..380a1292f4f9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2720,6 +2720,12 @@ extern int vfs_fsync(struct file *file, int datasync);
 extern int sync_file_range(struct file *file, loff_t offset, loff_t nbytes,
 				unsigned int flags);
 
+static inline bool iocb_is_dsync(const struct kiocb *iocb)
+{
+	return (iocb->ki_flags & IOCB_DSYNC) ||
+		IS_SYNC(iocb->ki_filp->f_mapping->host);
+}
+
 /*
  * Sync the bytes written if this was a synchronous write.  Expect ki_pos
  * to already be updated for the write, and will return either the amount
@@ -2727,7 +2733,7 @@ extern int sync_file_range(struct file *file, loff_t offset, loff_t nbytes,
  */
 static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
 {
-	if (iocb->ki_flags & IOCB_DSYNC) {
+	if (iocb_is_dsync(iocb)) {
 		int ret = vfs_fsync_range(iocb->ki_filp,
 				iocb->ki_pos - count, iocb->ki_pos - 1,
 				(iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
@@ -3262,7 +3268,7 @@ static inline int iocb_flags(struct file *file)
 		res |= IOCB_APPEND;
 	if (file->f_flags & O_DIRECT)
 		res |= IOCB_DIRECT;
-	if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))
+	if (file->f_flags & O_DSYNC)
 		res |= IOCB_DSYNC;
 	if (file->f_flags & __O_SYNC)
 		res |= IOCB_SYNC;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 07/44] keep iocb_flags() result cached in struct file
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (4 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 06/44] iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-22  4:15   ` [PATCH 08/44] copy_page_{to,from}_iter(): switch iovec variants to generic Al Viro
                     ` (39 subsequent siblings)
  45 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

* calculate at the time we set FMODE_OPENED (do_dentry_open() for normal
opens, alloc_file() for pipe()/socket()/etc.)
* update when handling F_SETFL
* keep in a new field - file->f_iocb_flags; since that thing is needed only
before the refcount reaches zero, we can put it into the same anon union
where ->f_rcuhead and ->f_llist live - those are used only after refcount
reaches zero.

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 drivers/nvme/target/io-cmd-file.c | 2 +-
 fs/aio.c                          | 2 +-
 fs/fcntl.c                        | 1 +
 fs/file_table.c                   | 1 +
 fs/io_uring.c                     | 2 +-
 fs/open.c                         | 1 +
 include/linux/fs.h                | 5 ++---
 7 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index f3d58abf11e0..64b47e2a4633 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -112,7 +112,7 @@ static ssize_t nvmet_file_submit_bvec(struct nvmet_req *req, loff_t pos,
 
 	iocb->ki_pos = pos;
 	iocb->ki_filp = req->ns->file;
-	iocb->ki_flags = ki_flags | iocb_flags(req->ns->file);
+	iocb->ki_flags = ki_flags | iocb->ki_filp->f_iocb_flags;
 
 	return call_iter(iocb, &iter);
 }
diff --git a/fs/aio.c b/fs/aio.c
index 3c249b938632..2bdd444d408b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1475,7 +1475,7 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
 	req->ki_complete = aio_complete_rw;
 	req->private = NULL;
 	req->ki_pos = iocb->aio_offset;
-	req->ki_flags = iocb_flags(req->ki_filp);
+	req->ki_flags = req->ki_filp->f_iocb_flags;
 	if (iocb->aio_flags & IOCB_FLAG_RESFD)
 		req->ki_flags |= IOCB_EVENTFD;
 	if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 34a3faa4886d..146c9ab0cd4b 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -78,6 +78,7 @@ static int setfl(int fd, struct file * filp, unsigned long arg)
 	}
 	spin_lock(&filp->f_lock);
 	filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK);
+	filp->f_iocb_flags = iocb_flags(filp);
 	spin_unlock(&filp->f_lock);
 
  out:
diff --git a/fs/file_table.c b/fs/file_table.c
index b989e33aacda..905792b0521c 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -241,6 +241,7 @@ static struct file *alloc_file(const struct path *path, int flags,
 	if ((file->f_mode & FMODE_WRITE) &&
 	     likely(fop->write || fop->write_iter))
 		file->f_mode |= FMODE_CAN_WRITE;
+	file->f_iocb_flags = iocb_flags(file);
 	file->f_mode |= FMODE_OPENED;
 	file->f_op = fop;
 	if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 3aab4182fd89..53424b1f019f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4330,7 +4330,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode)
 	if (!io_req_ffs_set(req))
 		req->flags |= io_file_get_flags(file) << REQ_F_SUPPORT_NOWAIT_BIT;
 
-	kiocb->ki_flags = iocb_flags(file);
+	kiocb->ki_flags = file->f_iocb_flags;
 	ret = kiocb_set_rw_flags(kiocb, req->rw.flags);
 	if (unlikely(ret))
 		return ret;
diff --git a/fs/open.c b/fs/open.c
index 1d57fbde2feb..d80441a0bf17 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -862,6 +862,7 @@ static int do_dentry_open(struct file *f,
 		f->f_mode |= FMODE_CAN_ODIRECT;
 
 	f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
+	f->f_iocb_flags = iocb_flags(f);
 
 	file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 380a1292f4f9..c82b9d442f56 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -926,6 +926,7 @@ struct file {
 	union {
 		struct llist_node	f_llist;
 		struct rcu_head 	f_rcuhead;
+		unsigned int 		f_iocb_flags;
 	};
 	struct path		f_path;
 	struct inode		*f_inode;	/* cached value */
@@ -2199,13 +2200,11 @@ static inline bool HAS_UNMAPPED_ID(struct user_namespace *mnt_userns,
 	       !gid_valid(i_gid_into_mnt(mnt_userns, inode));
 }
 
-static inline int iocb_flags(struct file *file);
-
 static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 {
 	*kiocb = (struct kiocb) {
 		.ki_filp = filp,
-		.ki_flags = iocb_flags(filp),
+		.ki_flags = filp->f_iocb_flags,
 		.ki_ioprio = get_current_ioprio(),
 	};
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 08/44] copy_page_{to,from}_iter(): switch iovec variants to generic
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (5 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 07/44] keep iocb_flags() result cached in struct file Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-27 18:31     ` Jeff Layton
  2022-06-28 12:32     ` Christian Brauner
  2022-06-22  4:15   ` [PATCH 09/44] new iov_iter flavour - ITER_UBUF Al Viro
                     ` (38 subsequent siblings)
  45 siblings, 2 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

we can do copyin/copyout under kmap_local_page(); it shouldn't overflow
the kmap stack - the maximal footprint increase only by one here.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 191 ++-----------------------------------------------
 1 file changed, 4 insertions(+), 187 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 6dd5330f7a99..4c658a25e29c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -168,174 +168,6 @@ static int copyin(void *to, const void __user *from, size_t n)
 	return n;
 }
 
-static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t bytes,
-			 struct iov_iter *i)
-{
-	size_t skip, copy, left, wanted;
-	const struct iovec *iov;
-	char __user *buf;
-	void *kaddr, *from;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-
-	if (unlikely(!bytes))
-		return 0;
-
-	might_fault();
-	wanted = bytes;
-	iov = i->iov;
-	skip = i->iov_offset;
-	buf = iov->iov_base + skip;
-	copy = min(bytes, iov->iov_len - skip);
-
-	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_writeable(buf, copy)) {
-		kaddr = kmap_atomic(page);
-		from = kaddr + offset;
-
-		/* first chunk, usually the only one */
-		left = copyout(buf, from, copy);
-		copy -= left;
-		skip += copy;
-		from += copy;
-		bytes -= copy;
-
-		while (unlikely(!left && bytes)) {
-			iov++;
-			buf = iov->iov_base;
-			copy = min(bytes, iov->iov_len);
-			left = copyout(buf, from, copy);
-			copy -= left;
-			skip = copy;
-			from += copy;
-			bytes -= copy;
-		}
-		if (likely(!bytes)) {
-			kunmap_atomic(kaddr);
-			goto done;
-		}
-		offset = from - kaddr;
-		buf += copy;
-		kunmap_atomic(kaddr);
-		copy = min(bytes, iov->iov_len - skip);
-	}
-	/* Too bad - revert to non-atomic kmap */
-
-	kaddr = kmap(page);
-	from = kaddr + offset;
-	left = copyout(buf, from, copy);
-	copy -= left;
-	skip += copy;
-	from += copy;
-	bytes -= copy;
-	while (unlikely(!left && bytes)) {
-		iov++;
-		buf = iov->iov_base;
-		copy = min(bytes, iov->iov_len);
-		left = copyout(buf, from, copy);
-		copy -= left;
-		skip = copy;
-		from += copy;
-		bytes -= copy;
-	}
-	kunmap(page);
-
-done:
-	if (skip == iov->iov_len) {
-		iov++;
-		skip = 0;
-	}
-	i->count -= wanted - bytes;
-	i->nr_segs -= iov - i->iov;
-	i->iov = iov;
-	i->iov_offset = skip;
-	return wanted - bytes;
-}
-
-static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t bytes,
-			 struct iov_iter *i)
-{
-	size_t skip, copy, left, wanted;
-	const struct iovec *iov;
-	char __user *buf;
-	void *kaddr, *to;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-
-	if (unlikely(!bytes))
-		return 0;
-
-	might_fault();
-	wanted = bytes;
-	iov = i->iov;
-	skip = i->iov_offset;
-	buf = iov->iov_base + skip;
-	copy = min(bytes, iov->iov_len - skip);
-
-	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_readable(buf, copy)) {
-		kaddr = kmap_atomic(page);
-		to = kaddr + offset;
-
-		/* first chunk, usually the only one */
-		left = copyin(to, buf, copy);
-		copy -= left;
-		skip += copy;
-		to += copy;
-		bytes -= copy;
-
-		while (unlikely(!left && bytes)) {
-			iov++;
-			buf = iov->iov_base;
-			copy = min(bytes, iov->iov_len);
-			left = copyin(to, buf, copy);
-			copy -= left;
-			skip = copy;
-			to += copy;
-			bytes -= copy;
-		}
-		if (likely(!bytes)) {
-			kunmap_atomic(kaddr);
-			goto done;
-		}
-		offset = to - kaddr;
-		buf += copy;
-		kunmap_atomic(kaddr);
-		copy = min(bytes, iov->iov_len - skip);
-	}
-	/* Too bad - revert to non-atomic kmap */
-
-	kaddr = kmap(page);
-	to = kaddr + offset;
-	left = copyin(to, buf, copy);
-	copy -= left;
-	skip += copy;
-	to += copy;
-	bytes -= copy;
-	while (unlikely(!left && bytes)) {
-		iov++;
-		buf = iov->iov_base;
-		copy = min(bytes, iov->iov_len);
-		left = copyin(to, buf, copy);
-		copy -= left;
-		skip = copy;
-		to += copy;
-		bytes -= copy;
-	}
-	kunmap(page);
-
-done:
-	if (skip == iov->iov_len) {
-		iov++;
-		skip = 0;
-	}
-	i->count -= wanted - bytes;
-	i->nr_segs -= iov - i->iov;
-	i->iov = iov;
-	i->iov_offset = skip;
-	return wanted - bytes;
-}
-
 #ifdef PIPE_PARANOIA
 static bool sanity(const struct iov_iter *i)
 {
@@ -848,24 +680,14 @@ static inline bool page_copy_sane(struct page *page, size_t offset, size_t n)
 static size_t __copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
-	if (likely(iter_is_iovec(i)))
-		return copy_page_to_iter_iovec(page, offset, bytes, i);
-	if (iov_iter_is_bvec(i) || iov_iter_is_kvec(i) || iov_iter_is_xarray(i)) {
+	if (unlikely(iov_iter_is_pipe(i))) {
+		return copy_page_to_iter_pipe(page, offset, bytes, i);
+	} else {
 		void *kaddr = kmap_local_page(page);
 		size_t wanted = _copy_to_iter(kaddr + offset, bytes, i);
 		kunmap_local(kaddr);
 		return wanted;
 	}
-	if (iov_iter_is_pipe(i))
-		return copy_page_to_iter_pipe(page, offset, bytes, i);
-	if (unlikely(iov_iter_is_discard(i))) {
-		if (unlikely(i->count < bytes))
-			bytes = i->count;
-		i->count -= bytes;
-		return bytes;
-	}
-	WARN_ON(1);
-	return 0;
 }
 
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
@@ -896,17 +718,12 @@ EXPORT_SYMBOL(copy_page_to_iter);
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
-	if (unlikely(!page_copy_sane(page, offset, bytes)))
-		return 0;
-	if (likely(iter_is_iovec(i)))
-		return copy_page_from_iter_iovec(page, offset, bytes, i);
-	if (iov_iter_is_bvec(i) || iov_iter_is_kvec(i) || iov_iter_is_xarray(i)) {
+	if (page_copy_sane(page, offset, bytes)) {
 		void *kaddr = kmap_local_page(page);
 		size_t wanted = _copy_from_iter(kaddr + offset, bytes, i);
 		kunmap_local(kaddr);
 		return wanted;
 	}
-	WARN_ON(1);
 	return 0;
 }
 EXPORT_SYMBOL(copy_page_from_iter);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 09/44] new iov_iter flavour - ITER_UBUF
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (6 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 08/44] copy_page_{to,from}_iter(): switch iovec variants to generic Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-27 18:47     ` Jeff Layton
                       ` (2 more replies)
  2022-06-22  4:15   ` [PATCH 10/44] switch new_sync_{read,write}() to ITER_UBUF Al Viro
                     ` (37 subsequent siblings)
  45 siblings, 3 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

Equivalent of single-segment iovec.  Initialized by iov_iter_ubuf(),
checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
ones.

We are going to expose the things like ->write_iter() et.al. to those
in subsequent commits.

New predicate (user_backed_iter()) that is true for ITER_IOVEC and
ITER_UBUF; places like direct-IO handling should use that for
checking that pages we modify after getting them from iov_iter_get_pages()
would need to be dirtied.

DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
will solve all problems - there's code that uses iter_is_iovec() to
decide how to poke around in iov_iter guts and for that the predicate
replacement obviously won't suffice.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 block/fops.c         |  6 +--
 fs/ceph/file.c       |  2 +-
 fs/cifs/file.c       |  2 +-
 fs/direct-io.c       |  2 +-
 fs/fuse/dev.c        |  4 +-
 fs/fuse/file.c       |  2 +-
 fs/gfs2/file.c       |  2 +-
 fs/iomap/direct-io.c |  2 +-
 fs/nfs/direct.c      |  2 +-
 include/linux/uio.h  | 26 ++++++++++++
 lib/iov_iter.c       | 94 ++++++++++++++++++++++++++++++++++----------
 mm/shmem.c           |  2 +-
 12 files changed, 113 insertions(+), 33 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 6e86931ab847..3e68d69e0ee3 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -69,7 +69,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 
 	if (iov_iter_rw(iter) == READ) {
 		bio_init(&bio, bdev, vecs, nr_pages, REQ_OP_READ);
-		if (iter_is_iovec(iter))
+		if (user_backed_iter(iter))
 			should_dirty = true;
 	} else {
 		bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
@@ -199,7 +199,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	}
 
 	dio->size = 0;
-	if (is_read && iter_is_iovec(iter))
+	if (is_read && user_backed_iter(iter))
 		dio->flags |= DIO_SHOULD_DIRTY;
 
 	blk_start_plug(&plug);
@@ -331,7 +331,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	dio->size = bio->bi_iter.bi_size;
 
 	if (is_read) {
-		if (iter_is_iovec(iter)) {
+		if (user_backed_iter(iter)) {
 			dio->flags |= DIO_SHOULD_DIRTY;
 			bio_set_pages_dirty(bio);
 		}
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 8c8226c0feac..e132adeeaf16 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1262,7 +1262,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 	size_t count = iov_iter_count(iter);
 	loff_t pos = iocb->ki_pos;
 	bool write = iov_iter_rw(iter) == WRITE;
-	bool should_dirty = !write && iter_is_iovec(iter);
+	bool should_dirty = !write && user_backed_iter(iter);
 
 	if (write && ceph_snap(file_inode(file)) != CEPH_NOSNAP)
 		return -EROFS;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 1618e0537d58..4b4129d9a90c 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -4004,7 +4004,7 @@ static ssize_t __cifs_readv(
 	if (!is_sync_kiocb(iocb))
 		ctx->iocb = iocb;
 
-	if (iter_is_iovec(to))
+	if (user_backed_iter(to))
 		ctx->should_dirty = true;
 
 	if (direct) {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 39647eb56904..72237f49ad94 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1245,7 +1245,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	spin_lock_init(&dio->bio_lock);
 	dio->refcount = 1;
 
-	dio->should_dirty = iter_is_iovec(iter) && iov_iter_rw(iter) == READ;
+	dio->should_dirty = user_backed_iter(iter) && iov_iter_rw(iter) == READ;
 	sdio.iter = iter;
 	sdio.final_block_in_request = end >> blkbits;
 
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 0e537e580dc1..8d657c2cd6f7 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1356,7 +1356,7 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, struct iov_iter *to)
 	if (!fud)
 		return -EPERM;
 
-	if (!iter_is_iovec(to))
+	if (!user_backed_iter(to))
 		return -EINVAL;
 
 	fuse_copy_init(&cs, 1, to);
@@ -1949,7 +1949,7 @@ static ssize_t fuse_dev_write(struct kiocb *iocb, struct iov_iter *from)
 	if (!fud)
 		return -EPERM;
 
-	if (!iter_is_iovec(from))
+	if (!user_backed_iter(from))
 		return -EINVAL;
 
 	fuse_copy_init(&cs, 0, from);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 00fa861aeead..c982e3afe3b4 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1465,7 +1465,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 			inode_unlock(inode);
 	}
 
-	io->should_dirty = !write && iter_is_iovec(iter);
+	io->should_dirty = !write && user_backed_iter(iter);
 	while (count) {
 		ssize_t nres;
 		fl_owner_t owner = current->files;
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 2cceb193dcd8..48e6cc74fdc1 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -780,7 +780,7 @@ static inline bool should_fault_in_pages(struct iov_iter *i,
 
 	if (!count)
 		return false;
-	if (!iter_is_iovec(i))
+	if (!user_backed_iter(i))
 		return false;
 
 	size = PAGE_SIZE;
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 31c7f1035b20..d5c7d019653b 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -533,7 +533,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			iomi.flags |= IOMAP_NOWAIT;
 		}
 
-		if (iter_is_iovec(iter))
+		if (user_backed_iter(iter))
 			dio->flags |= IOMAP_DIO_DIRTY;
 	} else {
 		iomi.flags |= IOMAP_WRITE;
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 4eb2a8380a28..022e1ce63e62 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -478,7 +478,7 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
 	if (!is_sync_kiocb(iocb))
 		dreq->iocb = iocb;
 
-	if (iter_is_iovec(iter))
+	if (user_backed_iter(iter))
 		dreq->flags = NFS_ODIRECT_SHOULD_DIRTY;
 
 	if (!swap)
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 76d305f3d4c2..6ab4260c3d6c 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -26,6 +26,7 @@ enum iter_type {
 	ITER_PIPE,
 	ITER_XARRAY,
 	ITER_DISCARD,
+	ITER_UBUF,
 };
 
 struct iov_iter_state {
@@ -38,6 +39,7 @@ struct iov_iter {
 	u8 iter_type;
 	bool nofault;
 	bool data_source;
+	bool user_backed;
 	size_t iov_offset;
 	size_t count;
 	union {
@@ -46,6 +48,7 @@ struct iov_iter {
 		const struct bio_vec *bvec;
 		struct xarray *xarray;
 		struct pipe_inode_info *pipe;
+		void __user *ubuf;
 	};
 	union {
 		unsigned long nr_segs;
@@ -70,6 +73,11 @@ static inline void iov_iter_save_state(struct iov_iter *iter,
 	state->nr_segs = iter->nr_segs;
 }
 
+static inline bool iter_is_ubuf(const struct iov_iter *i)
+{
+	return iov_iter_type(i) == ITER_UBUF;
+}
+
 static inline bool iter_is_iovec(const struct iov_iter *i)
 {
 	return iov_iter_type(i) == ITER_IOVEC;
@@ -105,6 +113,11 @@ static inline unsigned char iov_iter_rw(const struct iov_iter *i)
 	return i->data_source ? WRITE : READ;
 }
 
+static inline bool user_backed_iter(const struct iov_iter *i)
+{
+	return i->user_backed;
+}
+
 /*
  * Total number of bytes covered by an iovec.
  *
@@ -320,4 +333,17 @@ ssize_t __import_iovec(int type, const struct iovec __user *uvec,
 int import_single_range(int type, void __user *buf, size_t len,
 		 struct iovec *iov, struct iov_iter *i);
 
+static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
+			void __user *buf, size_t count)
+{
+	WARN_ON(direction & ~(READ | WRITE));
+	*i = (struct iov_iter) {
+		.iter_type = ITER_UBUF,
+		.user_backed = true,
+		.data_source = direction,
+		.ubuf = buf,
+		.count = count
+	};
+}
+
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 4c658a25e29c..8275b28e886b 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -16,6 +16,16 @@
 
 #define PIPE_PARANOIA /* for now */
 
+/* covers ubuf and kbuf alike */
+#define iterate_buf(i, n, base, len, off, __p, STEP) {		\
+	size_t __maybe_unused off = 0;				\
+	len = n;						\
+	base = __p + i->iov_offset;				\
+	len -= (STEP);						\
+	i->iov_offset += len;					\
+	n = len;						\
+}
+
 /* covers iovec and kvec alike */
 #define iterate_iovec(i, n, base, len, off, __p, STEP) {	\
 	size_t off = 0;						\
@@ -110,7 +120,12 @@ __out:								\
 	if (unlikely(i->count < n))				\
 		n = i->count;					\
 	if (likely(n)) {					\
-		if (likely(iter_is_iovec(i))) {			\
+		if (likely(iter_is_ubuf(i))) {			\
+			void __user *base;			\
+			size_t len;				\
+			iterate_buf(i, n, base, len, off,	\
+						i->ubuf, (I)) 	\
+		} else if (likely(iter_is_iovec(i))) {		\
 			const struct iovec *iov = i->iov;	\
 			void __user *base;			\
 			size_t len;				\
@@ -275,7 +290,11 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
  */
 size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
 {
-	if (iter_is_iovec(i)) {
+	if (iter_is_ubuf(i)) {
+		size_t n = min(size, iov_iter_count(i));
+		n -= fault_in_readable(i->ubuf + i->iov_offset, n);
+		return size - n;
+	} else if (iter_is_iovec(i)) {
 		size_t count = min(size, iov_iter_count(i));
 		const struct iovec *p;
 		size_t skip;
@@ -314,7 +333,11 @@ EXPORT_SYMBOL(fault_in_iov_iter_readable);
  */
 size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
 {
-	if (iter_is_iovec(i)) {
+	if (iter_is_ubuf(i)) {
+		size_t n = min(size, iov_iter_count(i));
+		n -= fault_in_safe_writeable(i->ubuf + i->iov_offset, n);
+		return size - n;
+	} else if (iter_is_iovec(i)) {
 		size_t count = min(size, iov_iter_count(i));
 		const struct iovec *p;
 		size_t skip;
@@ -345,6 +368,7 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 	*i = (struct iov_iter) {
 		.iter_type = ITER_IOVEC,
 		.nofault = false,
+		.user_backed = true,
 		.data_source = direction,
 		.iov = iov,
 		.nr_segs = nr_segs,
@@ -494,7 +518,7 @@ size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
 	if (unlikely(iov_iter_is_pipe(i)))
 		return copy_pipe_to_iter(addr, bytes, i);
-	if (iter_is_iovec(i))
+	if (user_backed_iter(i))
 		might_fault();
 	iterate_and_advance(i, bytes, base, len, off,
 		copyout(base, addr + off, len),
@@ -576,7 +600,7 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
 	if (unlikely(iov_iter_is_pipe(i)))
 		return copy_mc_pipe_to_iter(addr, bytes, i);
-	if (iter_is_iovec(i))
+	if (user_backed_iter(i))
 		might_fault();
 	__iterate_and_advance(i, bytes, base, len, off,
 		copyout_mc(base, addr + off, len),
@@ -594,7 +618,7 @@ size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 		WARN_ON(1);
 		return 0;
 	}
-	if (iter_is_iovec(i))
+	if (user_backed_iter(i))
 		might_fault();
 	iterate_and_advance(i, bytes, base, len, off,
 		copyin(addr + off, base, len),
@@ -882,16 +906,16 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 {
 	if (unlikely(i->count < size))
 		size = i->count;
-	if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i))) {
+	if (likely(iter_is_ubuf(i)) || unlikely(iov_iter_is_xarray(i))) {
+		i->iov_offset += size;
+		i->count -= size;
+	} else if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i))) {
 		/* iovec and kvec have identical layouts */
 		iov_iter_iovec_advance(i, size);
 	} else if (iov_iter_is_bvec(i)) {
 		iov_iter_bvec_advance(i, size);
 	} else if (iov_iter_is_pipe(i)) {
 		pipe_advance(i, size);
-	} else if (unlikely(iov_iter_is_xarray(i))) {
-		i->iov_offset += size;
-		i->count -= size;
 	} else if (iov_iter_is_discard(i)) {
 		i->count -= size;
 	}
@@ -938,7 +962,7 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 		return;
 	}
 	unroll -= i->iov_offset;
-	if (iov_iter_is_xarray(i)) {
+	if (iov_iter_is_xarray(i) || iter_is_ubuf(i)) {
 		BUG(); /* We should never go beyond the start of the specified
 			* range since we might then be straying into pages that
 			* aren't pinned.
@@ -1129,6 +1153,13 @@ static unsigned long iov_iter_alignment_bvec(const struct iov_iter *i)
 
 unsigned long iov_iter_alignment(const struct iov_iter *i)
 {
+	if (likely(iter_is_ubuf(i))) {
+		size_t size = i->count;
+		if (size)
+			return ((unsigned long)i->ubuf + i->iov_offset) | size;
+		return 0;
+	}
+
 	/* iovec and kvec have identical layouts */
 	if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i)))
 		return iov_iter_alignment_iovec(i);
@@ -1159,6 +1190,9 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 	size_t size = i->count;
 	unsigned k;
 
+	if (iter_is_ubuf(i))
+		return 0;
+
 	if (WARN_ON(!iter_is_iovec(i)))
 		return ~0U;
 
@@ -1287,7 +1321,19 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
 	return actual;
 }
 
-/* must be done on non-empty ITER_IOVEC one */
+static unsigned long found_ubuf_segment(unsigned long addr,
+					size_t len,
+					size_t *size, size_t *start,
+					unsigned maxpages)
+{
+	len += (*start = addr % PAGE_SIZE);
+	if (len > maxpages * PAGE_SIZE)
+		len = maxpages * PAGE_SIZE;
+	*size = len;
+	return addr & PAGE_MASK;
+}
+
+/* must be done on non-empty ITER_UBUF or ITER_IOVEC one */
 static unsigned long first_iovec_segment(const struct iov_iter *i,
 					 size_t *size, size_t *start,
 					 size_t maxsize, unsigned maxpages)
@@ -1295,6 +1341,11 @@ static unsigned long first_iovec_segment(const struct iov_iter *i,
 	size_t skip;
 	long k;
 
+	if (iter_is_ubuf(i)) {
+		unsigned long addr = (unsigned long)i->ubuf + i->iov_offset;
+		return found_ubuf_segment(addr, maxsize, size, start, maxpages);
+	}
+
 	for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
 		unsigned long addr = (unsigned long)i->iov[k].iov_base + skip;
 		size_t len = i->iov[k].iov_len - skip;
@@ -1303,11 +1354,7 @@ static unsigned long first_iovec_segment(const struct iov_iter *i,
 			continue;
 		if (len > maxsize)
 			len = maxsize;
-		len += (*start = addr % PAGE_SIZE);
-		if (len > maxpages * PAGE_SIZE)
-			len = maxpages * PAGE_SIZE;
-		*size = len;
-		return addr & PAGE_MASK;
+		return found_ubuf_segment(addr, len, size, start, maxpages);
 	}
 	BUG(); // if it had been empty, we wouldn't get called
 }
@@ -1344,7 +1391,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 	if (!maxsize)
 		return 0;
 
-	if (likely(iter_is_iovec(i))) {
+	if (likely(user_backed_iter(i))) {
 		unsigned int gup_flags = 0;
 		unsigned long addr;
 
@@ -1470,7 +1517,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 	if (!maxsize)
 		return 0;
 
-	if (likely(iter_is_iovec(i))) {
+	if (likely(user_backed_iter(i))) {
 		unsigned int gup_flags = 0;
 		unsigned long addr;
 
@@ -1624,6 +1671,11 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 {
 	if (unlikely(!i->count))
 		return 0;
+	if (likely(iter_is_ubuf(i))) {
+		unsigned offs = offset_in_page(i->ubuf + i->iov_offset);
+		int npages = DIV_ROUND_UP(offs + i->count, PAGE_SIZE);
+		return min(npages, maxpages);
+	}
 	/* iovec and kvec have identical layouts */
 	if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i)))
 		return iov_npages(i, maxpages);
@@ -1862,10 +1914,12 @@ EXPORT_SYMBOL(import_single_range);
 void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 {
 	if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i)) &&
-			 !iov_iter_is_kvec(i))
+			 !iov_iter_is_kvec(i) && !iter_is_ubuf(i))
 		return;
 	i->iov_offset = state->iov_offset;
 	i->count = state->count;
+	if (iter_is_ubuf(i))
+		return;
 	/*
 	 * For the *vec iters, nr_segs + iov is constant - if we increment
 	 * the vec, then we also decrement the nr_segs count. Hence we don't
diff --git a/mm/shmem.c b/mm/shmem.c
index a6f565308133..6b83f3971795 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2603,7 +2603,7 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			ret = copy_page_to_iter(page, offset, nr, to);
 			put_page(page);
 
-		} else if (iter_is_iovec(to)) {
+		} else if (!user_backed_iter(to)) {
 			/*
 			 * Copy to user tends to be so well optimized, but
 			 * clear_user() not so much, that it is noticeably
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 10/44] switch new_sync_{read,write}() to ITER_UBUF
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (7 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 09/44] new iov_iter flavour - ITER_UBUF Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-22  4:15   ` [PATCH 11/44] iov_iter_bvec_advance(): don't bother with bvec_iter Al Viro
                     ` (36 subsequent siblings)
  45 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/read_write.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index b1b1cdfee9d3..e82e4301cadd 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -389,14 +389,13 @@ EXPORT_SYMBOL(rw_verify_area);
 
 static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
 {
-	struct iovec iov = { .iov_base = buf, .iov_len = len };
 	struct kiocb kiocb;
 	struct iov_iter iter;
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
 	kiocb.ki_pos = (ppos ? *ppos : 0);
-	iov_iter_init(&iter, READ, &iov, 1, len);
+	iov_iter_ubuf(&iter, READ, buf, len);
 
 	ret = call_read_iter(filp, &kiocb, &iter);
 	BUG_ON(ret == -EIOCBQUEUED);
@@ -492,14 +491,13 @@ ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
 
 static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos)
 {
-	struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = len };
 	struct kiocb kiocb;
 	struct iov_iter iter;
 	ssize_t ret;
 
 	init_sync_kiocb(&kiocb, filp);
 	kiocb.ki_pos = (ppos ? *ppos : 0);
-	iov_iter_init(&iter, WRITE, &iov, 1, len);
+	iov_iter_ubuf(&iter, WRITE, (void __user *)buf, len);
 
 	ret = call_write_iter(filp, &kiocb, &iter);
 	BUG_ON(ret == -EIOCBQUEUED);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 11/44] iov_iter_bvec_advance(): don't bother with bvec_iter
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (8 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 10/44] switch new_sync_{read,write}() to ITER_UBUF Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-27 18:48     ` Jeff Layton
  2022-06-28 12:40     ` Christian Brauner
  2022-06-22  4:15   ` [PATCH 12/44] fix short copy handling in copy_mc_pipe_to_iter() Al Viro
                     ` (35 subsequent siblings)
  45 siblings, 2 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

do what we do for iovec/kvec; that ends up generating better code,
AFAICS.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 8275b28e886b..93ceb13ec7b5 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -870,17 +870,22 @@ static void pipe_advance(struct iov_iter *i, size_t size)
 
 static void iov_iter_bvec_advance(struct iov_iter *i, size_t size)
 {
-	struct bvec_iter bi;
+	const struct bio_vec *bvec, *end;
 
-	bi.bi_size = i->count;
-	bi.bi_bvec_done = i->iov_offset;
-	bi.bi_idx = 0;
-	bvec_iter_advance(i->bvec, &bi, size);
+	if (!i->count)
+		return;
+	i->count -= size;
+
+	size += i->iov_offset;
 
-	i->bvec += bi.bi_idx;
-	i->nr_segs -= bi.bi_idx;
-	i->count = bi.bi_size;
-	i->iov_offset = bi.bi_bvec_done;
+	for (bvec = i->bvec, end = bvec + i->nr_segs; bvec < end; bvec++) {
+		if (likely(size < bvec->bv_len))
+			break;
+		size -= bvec->bv_len;
+	}
+	i->iov_offset = size;
+	i->nr_segs -= bvec - i->bvec;
+	i->bvec = bvec;
 }
 
 static void iov_iter_iovec_advance(struct iov_iter *i, size_t size)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 12/44] fix short copy handling in copy_mc_pipe_to_iter()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (9 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 11/44] iov_iter_bvec_advance(): don't bother with bvec_iter Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-27 19:15     ` Jeff Layton
  2022-06-28 12:42     ` Christian Brauner
  2022-06-22  4:15   ` [PATCH 13/44] splice: stop abusing iov_iter_advance() to flush a pipe Al Viro
                     ` (34 subsequent siblings)
  45 siblings, 2 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can
result in a short copy.  In that case we need to trim the unused
buffers, as well as the length of partially filled one - it's not
enough to set ->head, ->iov_offset and ->count to reflect how
much had we copied.  Not hard to fix, fortunately...

I'd put a helper (pipe_discard_from(pipe, head)) into pipe_fs_i.h,
rather than iov_iter.c - it has nothing to do with iov_iter and
having it will allow us to avoid an ugly kludge in fs/splice.c.
We could put it into lib/iov_iter.c for now and move it later,
but I don't see the point going that way...

Fixes: ca146f6f091e "lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/pipe_fs_i.h |  9 +++++++++
 lib/iov_iter.c            | 15 +++++++++++----
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index cb0fd633a610..4ea496924106 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -229,6 +229,15 @@ static inline bool pipe_buf_try_steal(struct pipe_inode_info *pipe,
 	return buf->ops->try_steal(pipe, buf);
 }
 
+static inline void pipe_discard_from(struct pipe_inode_info *pipe,
+		unsigned int old_head)
+{
+	unsigned int mask = pipe->ring_size - 1;
+
+	while (pipe->head > old_head)
+		pipe_buf_release(pipe, &pipe->bufs[--pipe->head & mask]);
+}
+
 /* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual
    memory allocation, whereas PIPE_BUF makes atomicity guarantees.  */
 #define PIPE_SIZE		PAGE_SIZE
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 0b64695ab632..2bf20b48a04a 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -689,6 +689,7 @@ static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
 	struct pipe_inode_info *pipe = i->pipe;
 	unsigned int p_mask = pipe->ring_size - 1;
 	unsigned int i_head;
+	unsigned int valid = pipe->head;
 	size_t n, off, xfer = 0;
 
 	if (!sanity(i))
@@ -702,11 +703,17 @@ static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
 		rem = copy_mc_to_kernel(p + off, addr + xfer, chunk);
 		chunk -= rem;
 		kunmap_local(p);
-		i->head = i_head;
-		i->iov_offset = off + chunk;
-		xfer += chunk;
-		if (rem)
+		if (chunk) {
+			i->head = i_head;
+			i->iov_offset = off + chunk;
+			xfer += chunk;
+			valid = i_head + 1;
+		}
+		if (rem) {
+			pipe->bufs[i_head & p_mask].len -= rem;
+			pipe_discard_from(pipe, valid);
 			break;
+		}
 		n -= chunk;
 		off = 0;
 		i_head++;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 13/44] splice: stop abusing iov_iter_advance() to flush a pipe
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (10 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 12/44] fix short copy handling in copy_mc_pipe_to_iter() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-27 19:17     ` Jeff Layton
  2022-06-28 12:43     ` Christian Brauner
  2022-06-22  4:15   ` [PATCH 14/44] ITER_PIPE: helper for getting pipe buffer by index Al Viro
                     ` (33 subsequent siblings)
  45 siblings, 2 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

Use pipe_discard_from() explicitly in generic_file_read_iter(); don't bother
with rather non-obvious use of iov_iter_advance() in there.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 047b79db8eb5..6645b30ec990 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -301,11 +301,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 {
 	struct iov_iter to;
 	struct kiocb kiocb;
-	unsigned int i_head;
 	int ret;
 
 	iov_iter_pipe(&to, READ, pipe, len);
-	i_head = to.head;
 	init_sync_kiocb(&kiocb, in);
 	kiocb.ki_pos = *ppos;
 	ret = call_read_iter(in, &kiocb, &to);
@@ -313,9 +311,8 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 		*ppos = kiocb.ki_pos;
 		file_accessed(in);
 	} else if (ret < 0) {
-		to.head = i_head;
-		to.iov_offset = 0;
-		iov_iter_advance(&to, 0); /* to free what was emitted */
+		/* free what was emitted */
+		pipe_discard_from(pipe, to.start_head);
 		/*
 		 * callers of ->splice_read() expect -EAGAIN on
 		 * "can't put anything in there", rather than -EFAULT.
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 14/44] ITER_PIPE: helper for getting pipe buffer by index
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (11 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 13/44] splice: stop abusing iov_iter_advance() to flush a pipe Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 10:38     ` Jeff Layton
  2022-06-28 12:45     ` Christian Brauner
  2022-06-22  4:15   ` [PATCH 15/44] ITER_PIPE: helpers for adding pipe buffers Al Viro
                     ` (32 subsequent siblings)
  45 siblings, 2 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

pipe_buffer instances of a pipe are organized as a ring buffer,
with power-of-2 size.  Indices are kept *not* reduced modulo ring
size, so the buffer refered to by index N is
	pipe->bufs[N & (pipe->ring_size - 1)].

Ring size can change over the lifetime of a pipe, but not while
the pipe is locked.  So for any iov_iter primitives it's a constant.
Original conversion of pipes to this layout went overboard trying
to microoptimize that - calculating pipe->ring_size - 1, storing
it in a local variable and using through the function.  In some
cases it might be warranted, but most of the times it only
obfuscates what's going on in there.

Introduce a helper (pipe_buf(pipe, N)) that would encapsulate
that and use it in the obvious cases.  More will follow...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index d00cc8971b5b..08bb393da677 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -183,13 +183,18 @@ static int copyin(void *to, const void __user *from, size_t n)
 	return n;
 }
 
+static inline struct pipe_buffer *pipe_buf(const struct pipe_inode_info *pipe,
+					   unsigned int slot)
+{
+	return &pipe->bufs[slot & (pipe->ring_size - 1)];
+}
+
 #ifdef PIPE_PARANOIA
 static bool sanity(const struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;
 	unsigned int p_head = pipe->head;
 	unsigned int p_tail = pipe->tail;
-	unsigned int p_mask = pipe->ring_size - 1;
 	unsigned int p_occupancy = pipe_occupancy(p_head, p_tail);
 	unsigned int i_head = i->head;
 	unsigned int idx;
@@ -201,7 +206,7 @@ static bool sanity(const struct iov_iter *i)
 		if (unlikely(i_head != p_head - 1))
 			goto Bad;	// must be at the last buffer...
 
-		p = &pipe->bufs[i_head & p_mask];
+		p = pipe_buf(pipe, i_head);
 		if (unlikely(p->offset + p->len != i->iov_offset))
 			goto Bad;	// ... at the end of segment
 	} else {
@@ -386,11 +391,10 @@ static inline bool allocated(struct pipe_buffer *buf)
 static inline void data_start(const struct iov_iter *i,
 			      unsigned int *iter_headp, size_t *offp)
 {
-	unsigned int p_mask = i->pipe->ring_size - 1;
 	unsigned int iter_head = i->head;
 	size_t off = i->iov_offset;
 
-	if (off && (!allocated(&i->pipe->bufs[iter_head & p_mask]) ||
+	if (off && (!allocated(pipe_buf(i->pipe, iter_head)) ||
 		    off == PAGE_SIZE)) {
 		iter_head++;
 		off = 0;
@@ -1180,10 +1184,9 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 		return iov_iter_alignment_bvec(i);
 
 	if (iov_iter_is_pipe(i)) {
-		unsigned int p_mask = i->pipe->ring_size - 1;
 		size_t size = i->count;
 
-		if (size && i->iov_offset && allocated(&i->pipe->bufs[i->head & p_mask]))
+		if (size && i->iov_offset && allocated(pipe_buf(i->pipe, i->head)))
 			return size | i->iov_offset;
 		return size;
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 15/44] ITER_PIPE: helpers for adding pipe buffers
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (12 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 14/44] ITER_PIPE: helper for getting pipe buffer by index Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 11:32     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 16/44] ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives Al Viro
                     ` (31 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

There are only two kinds of pipe_buffer in the area used by ITER_PIPE.

1) anonymous - copy_to_iter() et.al. end up creating those and copying
data there.  They have zero ->offset, and their ->ops points to
default_pipe_page_ops.

2) zero-copy ones - those come from copy_page_to_iter(), and page
comes from caller.  ->offset is also caller-supplied - it might be
non-zero.  ->ops points to page_cache_pipe_buf_ops.

Move creation and insertion of those into helpers - push_anon(pipe, size)
and push_page(pipe, page, offset, size) resp., separating them from
the "could we avoid creating a new buffer by merging with the current
head?" logics.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 88 ++++++++++++++++++++++++++------------------------
 1 file changed, 46 insertions(+), 42 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 08bb393da677..924854c2a7ce 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -231,15 +231,39 @@ static bool sanity(const struct iov_iter *i)
 #define sanity(i) true
 #endif
 
+static struct page *push_anon(struct pipe_inode_info *pipe, unsigned size)
+{
+	struct page *page = alloc_page(GFP_USER);
+	if (page) {
+		struct pipe_buffer *buf = pipe_buf(pipe, pipe->head++);
+		*buf = (struct pipe_buffer) {
+			.ops = &default_pipe_buf_ops,
+			.page = page,
+			.offset = 0,
+			.len = size
+		};
+	}
+	return page;
+}
+
+static void push_page(struct pipe_inode_info *pipe, struct page *page,
+			unsigned int offset, unsigned int size)
+{
+	struct pipe_buffer *buf = pipe_buf(pipe, pipe->head++);
+	*buf = (struct pipe_buffer) {
+		.ops = &page_cache_pipe_buf_ops,
+		.page = page,
+		.offset = offset,
+		.len = size
+	};
+	get_page(page);
+}
+
 static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-	struct pipe_buffer *buf;
-	unsigned int p_tail = pipe->tail;
-	unsigned int p_mask = pipe->ring_size - 1;
-	unsigned int i_head = i->head;
-	size_t off;
+	unsigned int head = pipe->head;
 
 	if (unlikely(bytes > i->count))
 		bytes = i->count;
@@ -250,32 +274,21 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
 	if (!sanity(i))
 		return 0;
 
-	off = i->iov_offset;
-	buf = &pipe->bufs[i_head & p_mask];
-	if (off) {
-		if (offset == off && buf->page == page) {
-			/* merge with the last one */
+	if (offset && i->iov_offset == offset) { // could we merge it?
+		struct pipe_buffer *buf = pipe_buf(pipe, head - 1);
+		if (buf->page == page) {
 			buf->len += bytes;
 			i->iov_offset += bytes;
-			goto out;
+			i->count -= bytes;
+			return bytes;
 		}
-		i_head++;
-		buf = &pipe->bufs[i_head & p_mask];
 	}
-	if (pipe_full(i_head, p_tail, pipe->max_usage))
+	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
 		return 0;
 
-	buf->ops = &page_cache_pipe_buf_ops;
-	buf->flags = 0;
-	get_page(page);
-	buf->page = page;
-	buf->offset = offset;
-	buf->len = bytes;
-
-	pipe->head = i_head + 1;
+	push_page(pipe, page, offset, bytes);
 	i->iov_offset = offset + bytes;
-	i->head = i_head;
-out:
+	i->head = head;
 	i->count -= bytes;
 	return bytes;
 }
@@ -407,8 +420,6 @@ static size_t push_pipe(struct iov_iter *i, size_t size,
 			int *iter_headp, size_t *offp)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int p_tail = pipe->tail;
-	unsigned int p_mask = pipe->ring_size - 1;
 	unsigned int iter_head;
 	size_t off;
 	ssize_t left;
@@ -423,30 +434,23 @@ static size_t push_pipe(struct iov_iter *i, size_t size,
 	*iter_headp = iter_head;
 	*offp = off;
 	if (off) {
+		struct pipe_buffer *buf = pipe_buf(pipe, iter_head);
+
 		left -= PAGE_SIZE - off;
 		if (left <= 0) {
-			pipe->bufs[iter_head & p_mask].len += size;
+			buf->len += size;
 			return size;
 		}
-		pipe->bufs[iter_head & p_mask].len = PAGE_SIZE;
-		iter_head++;
+		buf->len = PAGE_SIZE;
 	}
-	while (!pipe_full(iter_head, p_tail, pipe->max_usage)) {
-		struct pipe_buffer *buf = &pipe->bufs[iter_head & p_mask];
-		struct page *page = alloc_page(GFP_USER);
+	while (!pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
+		struct page *page = push_anon(pipe,
+					      min_t(ssize_t, left, PAGE_SIZE));
 		if (!page)
 			break;
 
-		buf->ops = &default_pipe_buf_ops;
-		buf->flags = 0;
-		buf->page = page;
-		buf->offset = 0;
-		buf->len = min_t(ssize_t, left, PAGE_SIZE);
-		left -= buf->len;
-		iter_head++;
-		pipe->head = iter_head;
-
-		if (left == 0)
+		left -= PAGE_SIZE;
+		if (left <= 0)
 			return size;
 	}
 	return size - left;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 16/44] ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (13 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 15/44] ITER_PIPE: helpers for adding pipe buffers Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-22  4:15   ` [PATCH 17/44] ITER_PIPE: fold push_pipe() into __pipe_get_pages() Al Viro
                     ` (30 subsequent siblings)
  45 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

New helper: append_pipe().  Extends the last buffer if possible,
allocates a new one otherwise.  Returns page and offset in it
on success, NULL on failure.  iov_iter is advanced past the
data we've got.

Use that instead of push_pipe() in copy-to-pipe primitives;
they get simpler that way.  Handling of short copy (in "mc" one)
is done simply by iov_iter_revert() - iov_iter is in consistent
state after that one, so we can use that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 159 +++++++++++++++++++++++++++++--------------------
 1 file changed, 93 insertions(+), 66 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 924854c2a7ce..2a445261096e 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -259,6 +259,44 @@ static void push_page(struct pipe_inode_info *pipe, struct page *page,
 	get_page(page);
 }
 
+static inline bool allocated(struct pipe_buffer *buf)
+{
+	return buf->ops == &default_pipe_buf_ops;
+}
+
+static struct page *append_pipe(struct iov_iter *i, size_t size, size_t *off)
+{
+	struct pipe_inode_info *pipe = i->pipe;
+	size_t offset = i->iov_offset;
+	struct pipe_buffer *buf;
+	struct page *page;
+
+	if (offset && offset < PAGE_SIZE) {
+		// some space in the last buffer; can we add to it?
+		buf = pipe_buf(pipe, pipe->head - 1);
+		if (allocated(buf)) {
+			size = min_t(size_t, size, PAGE_SIZE - offset);
+			buf->len += size;
+			i->iov_offset += size;
+			i->count -= size;
+			*off = offset;
+			return buf->page;
+		}
+	}
+	// OK, we need a new buffer
+	*off = 0;
+	size = min_t(size_t, size, PAGE_SIZE);
+	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
+		return NULL;
+	page = push_anon(pipe, size);
+	if (!page)
+		return NULL;
+	i->head = pipe->head - 1;
+	i->iov_offset = size;
+	i->count -= size;
+	return page;
+}
+
 static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
@@ -396,11 +434,6 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_init);
 
-static inline bool allocated(struct pipe_buffer *buf)
-{
-	return buf->ops == &default_pipe_buf_ops;
-}
-
 static inline void data_start(const struct iov_iter *i,
 			      unsigned int *iter_headp, size_t *offp)
 {
@@ -459,28 +492,26 @@ static size_t push_pipe(struct iov_iter *i, size_t size,
 static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
 				struct iov_iter *i)
 {
-	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int p_mask = pipe->ring_size - 1;
-	unsigned int i_head;
 	size_t n, off;
 
-	if (!sanity(i))
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+	if (unlikely(!bytes))
 		return 0;
 
-	bytes = n = push_pipe(i, bytes, &i_head, &off);
-	if (unlikely(!n))
+	if (!sanity(i))
 		return 0;
-	do {
+
+	n = bytes;
+	while (n) {
+		struct page *page = append_pipe(i, n, &off);
 		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
-		memcpy_to_page(pipe->bufs[i_head & p_mask].page, off, addr, chunk);
-		i->head = i_head;
-		i->iov_offset = off + chunk;
-		n -= chunk;
+		if (!page)
+			break;
+		memcpy_to_page(page, off, addr, chunk);
 		addr += chunk;
-		off = 0;
-		i_head++;
-	} while (n);
-	i->count -= bytes;
+		n -= chunk;
+	}
 	return bytes;
 }
 
@@ -494,31 +525,32 @@ static __wsum csum_and_memcpy(void *to, const void *from, size_t len,
 static size_t csum_and_copy_to_pipe_iter(const void *addr, size_t bytes,
 					 struct iov_iter *i, __wsum *sump)
 {
-	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int p_mask = pipe->ring_size - 1;
 	__wsum sum = *sump;
 	size_t off = 0;
-	unsigned int i_head;
 	size_t r;
 
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+	if (unlikely(!bytes))
+		return 0;
+
 	if (!sanity(i))
 		return 0;
 
-	bytes = push_pipe(i, bytes, &i_head, &r);
 	while (bytes) {
+		struct page *page = append_pipe(i, bytes, &r);
 		size_t chunk = min_t(size_t, bytes, PAGE_SIZE - r);
-		char *p = kmap_local_page(pipe->bufs[i_head & p_mask].page);
+		char *p;
+
+		if (!page)
+			break;
+		p = kmap_local_page(page);
 		sum = csum_and_memcpy(p + r, addr + off, chunk, sum, off);
 		kunmap_local(p);
-		i->head = i_head;
-		i->iov_offset = r + chunk;
-		bytes -= chunk;
 		off += chunk;
-		r = 0;
-		i_head++;
+		bytes -= chunk;
 	}
 	*sump = sum;
-	i->count -= off;
 	return off;
 }
 
@@ -550,39 +582,35 @@ static int copyout_mc(void __user *to, const void *from, size_t n)
 static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
 				struct iov_iter *i)
 {
-	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int p_mask = pipe->ring_size - 1;
-	unsigned int i_head;
-	unsigned int valid = pipe->head;
-	size_t n, off, xfer = 0;
+	size_t off, xfer = 0;
+
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+	if (unlikely(!bytes))
+		return 0;
 
 	if (!sanity(i))
 		return 0;
 
-	n = push_pipe(i, bytes, &i_head, &off);
-	while (n) {
-		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
-		char *p = kmap_local_page(pipe->bufs[i_head & p_mask].page);
+	while (bytes) {
+		struct page *page = append_pipe(i, bytes, &off);
+		size_t chunk = min_t(size_t, bytes, PAGE_SIZE - off);
 		unsigned long rem;
+		char *p;
+
+		if (!page)
+			break;
+		p = kmap_local_page(page);
 		rem = copy_mc_to_kernel(p + off, addr + xfer, chunk);
 		chunk -= rem;
 		kunmap_local(p);
-		if (chunk) {
-			i->head = i_head;
-			i->iov_offset = off + chunk;
-			xfer += chunk;
-			valid = i_head + 1;
-		}
+		xfer += chunk;
+		bytes -= chunk;
 		if (rem) {
-			pipe->bufs[i_head & p_mask].len -= rem;
-			pipe_discard_from(pipe, valid);
+			iov_iter_revert(i, rem);
 			break;
 		}
-		n -= chunk;
-		off = 0;
-		i_head++;
 	}
-	i->count -= xfer;
 	return xfer;
 }
 
@@ -769,30 +797,29 @@ EXPORT_SYMBOL(copy_page_from_iter);
 
 static size_t pipe_zero(size_t bytes, struct iov_iter *i)
 {
-	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int p_mask = pipe->ring_size - 1;
-	unsigned int i_head;
 	size_t n, off;
 
-	if (!sanity(i))
+	if (unlikely(bytes > i->count))
+		bytes = i->count;
+	if (unlikely(!bytes))
 		return 0;
 
-	bytes = n = push_pipe(i, bytes, &i_head, &off);
-	if (unlikely(!n))
+	if (!sanity(i))
 		return 0;
 
-	do {
+	n = bytes;
+	while (n) {
+		struct page *page = append_pipe(i, n, &off);
 		size_t chunk = min_t(size_t, n, PAGE_SIZE - off);
-		char *p = kmap_local_page(pipe->bufs[i_head & p_mask].page);
+		char *p;
+
+		if (!page)
+			break;
+		p = kmap_local_page(page);
 		memset(p + off, 0, chunk);
 		kunmap_local(p);
-		i->head = i_head;
-		i->iov_offset = off + chunk;
 		n -= chunk;
-		off = 0;
-		i_head++;
-	} while (n);
-	i->count -= bytes;
+	}
 	return bytes;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 17/44] ITER_PIPE: fold push_pipe() into __pipe_get_pages()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (14 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 16/44] ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-22  4:15   ` [PATCH 18/44] ITER_PIPE: lose iter_head argument of __pipe_get_pages() Al Viro
                     ` (29 subsequent siblings)
  45 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

	Expand the only remaining call of push_pipe() (in
__pipe_get_pages()), combine it with the page-collecting loop there.

Note that the only reason it's not a loop doing append_pipe() is
that append_pipe() is advancing, while iov_iter_get_pages() is not.
As soon as it switches to saner semantics, this thing will switch
to using append_pipe().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 80 ++++++++++++++++----------------------------------
 1 file changed, 25 insertions(+), 55 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 2a445261096e..a507eed67839 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -449,46 +449,6 @@ static inline void data_start(const struct iov_iter *i,
 	*offp = off;
 }
 
-static size_t push_pipe(struct iov_iter *i, size_t size,
-			int *iter_headp, size_t *offp)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int iter_head;
-	size_t off;
-	ssize_t left;
-
-	if (unlikely(size > i->count))
-		size = i->count;
-	if (unlikely(!size))
-		return 0;
-
-	left = size;
-	data_start(i, &iter_head, &off);
-	*iter_headp = iter_head;
-	*offp = off;
-	if (off) {
-		struct pipe_buffer *buf = pipe_buf(pipe, iter_head);
-
-		left -= PAGE_SIZE - off;
-		if (left <= 0) {
-			buf->len += size;
-			return size;
-		}
-		buf->len = PAGE_SIZE;
-	}
-	while (!pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
-		struct page *page = push_anon(pipe,
-					      min_t(ssize_t, left, PAGE_SIZE));
-		if (!page)
-			break;
-
-		left -= PAGE_SIZE;
-		if (left <= 0)
-			return size;
-	}
-	return size - left;
-}
-
 static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
 				struct iov_iter *i)
 {
@@ -1261,23 +1221,33 @@ static inline ssize_t __pipe_get_pages(struct iov_iter *i,
 				size_t maxsize,
 				struct page **pages,
 				int iter_head,
-				size_t *start)
+				size_t off)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int p_mask = pipe->ring_size - 1;
-	ssize_t n = push_pipe(i, maxsize, &iter_head, start);
-	if (!n)
-		return -EFAULT;
+	ssize_t left = maxsize;
 
-	maxsize = n;
-	n += *start;
-	while (n > 0) {
-		get_page(*pages++ = pipe->bufs[iter_head & p_mask].page);
-		iter_head++;
-		n -= PAGE_SIZE;
-	}
+	if (off) {
+		struct pipe_buffer *buf = pipe_buf(pipe, iter_head);
 
-	return maxsize;
+		get_page(*pages++ = buf->page);
+		left -= PAGE_SIZE - off;
+		if (left <= 0) {
+			buf->len += maxsize;
+			return maxsize;
+		}
+		buf->len = PAGE_SIZE;
+	}
+	while (!pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
+		struct page *page = push_anon(pipe,
+					      min_t(ssize_t, left, PAGE_SIZE));
+		if (!page)
+			break;
+		get_page(*pages++ = page);
+		left -= PAGE_SIZE;
+		if (left <= 0)
+			return maxsize;
+	}
+	return maxsize - left ? : -EFAULT;
 }
 
 static ssize_t pipe_get_pages(struct iov_iter *i,
@@ -1295,7 +1265,7 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 	npages = pipe_space_for_user(iter_head, i->pipe->tail, i->pipe);
 	capacity = min(npages, maxpages) * PAGE_SIZE - *start;
 
-	return __pipe_get_pages(i, min(maxsize, capacity), pages, iter_head, start);
+	return __pipe_get_pages(i, min(maxsize, capacity), pages, iter_head, *start);
 }
 
 static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
@@ -1491,7 +1461,7 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 	p = get_pages_array(npages);
 	if (!p)
 		return -ENOMEM;
-	n = __pipe_get_pages(i, maxsize, p, iter_head, start);
+	n = __pipe_get_pages(i, maxsize, p, iter_head, *start);
 	if (n > 0)
 		*pages = p;
 	else
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 18/44] ITER_PIPE: lose iter_head argument of __pipe_get_pages()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (15 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 17/44] ITER_PIPE: fold push_pipe() into __pipe_get_pages() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-22  4:15   ` [PATCH 19/44] ITER_PIPE: clean pipe_advance() up Al Viro
                     ` (28 subsequent siblings)
  45 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

it's only used to get to the partial buffer we can add to,
and that's always the last one, i.e. pipe->head - 1.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index a507eed67839..4b5a98105547 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1220,14 +1220,13 @@ EXPORT_SYMBOL(iov_iter_gap_alignment);
 static inline ssize_t __pipe_get_pages(struct iov_iter *i,
 				size_t maxsize,
 				struct page **pages,
-				int iter_head,
 				size_t off)
 {
 	struct pipe_inode_info *pipe = i->pipe;
 	ssize_t left = maxsize;
 
 	if (off) {
-		struct pipe_buffer *buf = pipe_buf(pipe, iter_head);
+		struct pipe_buffer *buf = pipe_buf(pipe, pipe->head - 1);
 
 		get_page(*pages++ = buf->page);
 		left -= PAGE_SIZE - off;
@@ -1265,7 +1264,7 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 	npages = pipe_space_for_user(iter_head, i->pipe->tail, i->pipe);
 	capacity = min(npages, maxpages) * PAGE_SIZE - *start;
 
-	return __pipe_get_pages(i, min(maxsize, capacity), pages, iter_head, *start);
+	return __pipe_get_pages(i, min(maxsize, capacity), pages, *start);
 }
 
 static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
@@ -1461,7 +1460,7 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 	p = get_pages_array(npages);
 	if (!p)
 		return -ENOMEM;
-	n = __pipe_get_pages(i, maxsize, p, iter_head, *start);
+	n = __pipe_get_pages(i, maxsize, p, *start);
 	if (n > 0)
 		*pages = p;
 	else
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 19/44] ITER_PIPE: clean pipe_advance() up
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (16 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 18/44] ITER_PIPE: lose iter_head argument of __pipe_get_pages() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-22  4:15   ` [PATCH 20/44] ITER_PIPE: clean iov_iter_revert() Al Viro
                     ` (27 subsequent siblings)
  45 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

instead of setting ->iov_offset for new position and calling
pipe_truncate() to adjust ->len of the last buffer and discard
everything after it, adjust ->len at the same time we set ->iov_offset
and use pipe_discard_from() to deal with buffers past that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 4b5a98105547..6d693c1d189d 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -847,27 +847,27 @@ static inline void pipe_truncate(struct iov_iter *i)
 static void pipe_advance(struct iov_iter *i, size_t size)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-	if (size) {
-		struct pipe_buffer *buf;
-		unsigned int p_mask = pipe->ring_size - 1;
-		unsigned int i_head = i->head;
-		size_t off = i->iov_offset, left = size;
+	unsigned int off = i->iov_offset;
 
+	if (!off && !size) {
+		pipe_discard_from(pipe, i->start_head); // discard everything
+		return;
+	}
+	i->count -= size;
+	while (1) {
+		struct pipe_buffer *buf = pipe_buf(pipe, i->head);
 		if (off) /* make it relative to the beginning of buffer */
-			left += off - pipe->bufs[i_head & p_mask].offset;
-		while (1) {
-			buf = &pipe->bufs[i_head & p_mask];
-			if (left <= buf->len)
-				break;
-			left -= buf->len;
-			i_head++;
+			size += off - buf->offset;
+		if (size <= buf->len) {
+			buf->len = size;
+			i->iov_offset = buf->offset + size;
+			break;
 		}
-		i->head = i_head;
-		i->iov_offset = buf->offset + left;
+		size -= buf->len;
+		i->head++;
+		off = 0;
 	}
-	i->count -= size;
-	/* ... and discard everything past that point */
-	pipe_truncate(i);
+	pipe_discard_from(pipe, i->head + 1); // discard everything past this one
 }
 
 static void iov_iter_bvec_advance(struct iov_iter *i, size_t size)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 20/44] ITER_PIPE: clean iov_iter_revert()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (17 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 19/44] ITER_PIPE: clean pipe_advance() up Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-22  4:15   ` [PATCH 21/44] ITER_PIPE: cache the type of last buffer Al Viro
                     ` (26 subsequent siblings)
  45 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

Fold pipe_truncate() into it, clean up.  We can release buffers
in the same loop where we walk backwards to the iterator beginning
looking for the place where the new position will be.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 60 ++++++++++++--------------------------------------
 1 file changed, 14 insertions(+), 46 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 6d693c1d189d..4e2b000b0466 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -818,32 +818,6 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t byt
 }
 EXPORT_SYMBOL(copy_page_from_iter_atomic);
 
-static inline void pipe_truncate(struct iov_iter *i)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int p_tail = pipe->tail;
-	unsigned int p_head = pipe->head;
-	unsigned int p_mask = pipe->ring_size - 1;
-
-	if (!pipe_empty(p_head, p_tail)) {
-		struct pipe_buffer *buf;
-		unsigned int i_head = i->head;
-		size_t off = i->iov_offset;
-
-		if (off) {
-			buf = &pipe->bufs[i_head & p_mask];
-			buf->len = off - buf->offset;
-			i_head++;
-		}
-		while (p_head != i_head) {
-			p_head--;
-			pipe_buf_release(pipe, &pipe->bufs[p_head & p_mask]);
-		}
-
-		pipe->head = p_head;
-	}
-}
-
 static void pipe_advance(struct iov_iter *i, size_t size)
 {
 	struct pipe_inode_info *pipe = i->pipe;
@@ -938,28 +912,22 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 	i->count += unroll;
 	if (unlikely(iov_iter_is_pipe(i))) {
 		struct pipe_inode_info *pipe = i->pipe;
-		unsigned int p_mask = pipe->ring_size - 1;
-		unsigned int i_head = i->head;
-		size_t off = i->iov_offset;
-		while (1) {
-			struct pipe_buffer *b = &pipe->bufs[i_head & p_mask];
-			size_t n = off - b->offset;
-			if (unroll < n) {
-				off -= unroll;
-				break;
-			}
-			unroll -= n;
-			if (!unroll && i_head == i->start_head) {
-				off = 0;
-				break;
+		unsigned int head = pipe->head;
+
+		while (head > i->start_head) {
+			struct pipe_buffer *b = pipe_buf(pipe, --head);
+			if (unroll < b->len) {
+				b->len -= unroll;
+				i->iov_offset = b->offset + b->len;
+				i->head = head;
+				return;
 			}
-			i_head--;
-			b = &pipe->bufs[i_head & p_mask];
-			off = b->offset + b->len;
+			unroll -= b->len;
+			pipe_buf_release(pipe, b);
+			pipe->head--;
 		}
-		i->iov_offset = off;
-		i->head = i_head;
-		pipe_truncate(i);
+		i->iov_offset = 0;
+		i->head = head;
 		return;
 	}
 	if (unlikely(iov_iter_is_discard(i)))
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 21/44] ITER_PIPE: cache the type of last buffer
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (18 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 20/44] ITER_PIPE: clean iov_iter_revert() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-22  4:15   ` [PATCH 22/44] ITER_PIPE: fold data_start() and pipe_space_for_user() together Al Viro
                     ` (25 subsequent siblings)
  45 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

We often need to find whether the last buffer is anon or not, and
currently it's rather clumsy:
	check if ->iov_offset is non-zero (i.e. that pipe is not empty)
	if so, get the corresponding pipe_buffer and check its ->ops
	if it's &default_pipe_buf_ops, we have an anon buffer.

Let's replace the use of ->iov_offset (which is nowhere near similar to
its role for other flavours) with signed field (->last_offset), with
the following rules:
	empty, no buffers occupied:		0
	anon, with bytes up to N-1 filled:	N
	zero-copy, with bytes up to N-1 filled:	-N

That way abs(i->last_offset) is equal to what used to be in i->iov_offset
and empty vs. anon vs. zero-copy can be distinguished by the sign of
i->last_offset.

	Checks for "should we extend the last buffer or should we start
a new one?" become easier to follow that way.

	Note that most of the operations can only be done in a sane
state - i.e. when the pipe has nothing past the current position of
iterator.  About the only thing that could be done outside of that
state is iov_iter_advance(), which transitions to the sane state by
truncating the pipe.  There are only two cases where we leave the
sane state:
	1) iov_iter_get_pages()/iov_iter_get_pages_alloc().  Will be
dealt with later, when we make get_pages advancing - the callers are
actually happier that way.
	2) iov_iter copied, then something is put into the copy.  Since
they share the underlying pipe, the original gets behind.  When we
decide that we are done with the copy (original is not usable until then)
we advance the original.  direct_io used to be done that way; nowadays
it operates on the original and we do iov_iter_revert() to discard
the excessive data.  At the moment there's nothing in the kernel that
could do that to ITER_PIPE iterators, so this reason for insane state
is theoretical right now.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/uio.h |  5 +++-
 lib/iov_iter.c      | 72 ++++++++++++++++++++++-----------------------
 2 files changed, 40 insertions(+), 37 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 6ab4260c3d6c..d3e13b37ea72 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -40,7 +40,10 @@ struct iov_iter {
 	bool nofault;
 	bool data_source;
 	bool user_backed;
-	size_t iov_offset;
+	union {
+		size_t iov_offset;
+		int last_offset;
+	};
 	size_t count;
 	union {
 		const struct iovec *iov;
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 4e2b000b0466..27ad2ef93dbc 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -199,7 +199,7 @@ static bool sanity(const struct iov_iter *i)
 	unsigned int i_head = i->head;
 	unsigned int idx;
 
-	if (i->iov_offset) {
+	if (i->last_offset) {
 		struct pipe_buffer *p;
 		if (unlikely(p_occupancy == 0))
 			goto Bad;	// pipe must be non-empty
@@ -207,7 +207,7 @@ static bool sanity(const struct iov_iter *i)
 			goto Bad;	// must be at the last buffer...
 
 		p = pipe_buf(pipe, i_head);
-		if (unlikely(p->offset + p->len != i->iov_offset))
+		if (unlikely(p->offset + p->len != abs(i->last_offset)))
 			goto Bad;	// ... at the end of segment
 	} else {
 		if (i_head != p_head)
@@ -215,7 +215,7 @@ static bool sanity(const struct iov_iter *i)
 	}
 	return true;
 Bad:
-	printk(KERN_ERR "idx = %d, offset = %zd\n", i_head, i->iov_offset);
+	printk(KERN_ERR "idx = %d, offset = %d\n", i_head, i->last_offset);
 	printk(KERN_ERR "head = %d, tail = %d, buffers = %d\n",
 			p_head, p_tail, pipe->ring_size);
 	for (idx = 0; idx < pipe->ring_size; idx++)
@@ -259,29 +259,30 @@ static void push_page(struct pipe_inode_info *pipe, struct page *page,
 	get_page(page);
 }
 
-static inline bool allocated(struct pipe_buffer *buf)
+static inline int last_offset(const struct pipe_buffer *buf)
 {
-	return buf->ops == &default_pipe_buf_ops;
+	if (buf->ops == &default_pipe_buf_ops)
+		return buf->len;	// buf->offset is 0 for those
+	else
+		return -(buf->offset + buf->len);
 }
 
 static struct page *append_pipe(struct iov_iter *i, size_t size, size_t *off)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-	size_t offset = i->iov_offset;
+	int offset = i->last_offset;
 	struct pipe_buffer *buf;
 	struct page *page;
 
-	if (offset && offset < PAGE_SIZE) {
-		// some space in the last buffer; can we add to it?
+	if (offset > 0 && offset < PAGE_SIZE) {
+		// some space in the last buffer; add to it
 		buf = pipe_buf(pipe, pipe->head - 1);
-		if (allocated(buf)) {
-			size = min_t(size_t, size, PAGE_SIZE - offset);
-			buf->len += size;
-			i->iov_offset += size;
-			i->count -= size;
-			*off = offset;
-			return buf->page;
-		}
+		size = min_t(size_t, size, PAGE_SIZE - offset);
+		buf->len += size;
+		i->last_offset += size;
+		i->count -= size;
+		*off = offset;
+		return buf->page;
 	}
 	// OK, we need a new buffer
 	*off = 0;
@@ -292,7 +293,7 @@ static struct page *append_pipe(struct iov_iter *i, size_t size, size_t *off)
 	if (!page)
 		return NULL;
 	i->head = pipe->head - 1;
-	i->iov_offset = size;
+	i->last_offset = size;
 	i->count -= size;
 	return page;
 }
@@ -312,11 +313,11 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
 	if (!sanity(i))
 		return 0;
 
-	if (offset && i->iov_offset == offset) { // could we merge it?
+	if (offset && i->last_offset == -offset) { // could we merge it?
 		struct pipe_buffer *buf = pipe_buf(pipe, head - 1);
 		if (buf->page == page) {
 			buf->len += bytes;
-			i->iov_offset += bytes;
+			i->last_offset -= bytes;
 			i->count -= bytes;
 			return bytes;
 		}
@@ -325,7 +326,7 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
 		return 0;
 
 	push_page(pipe, page, offset, bytes);
-	i->iov_offset = offset + bytes;
+	i->last_offset = -(offset + bytes);
 	i->head = head;
 	i->count -= bytes;
 	return bytes;
@@ -437,16 +438,15 @@ EXPORT_SYMBOL(iov_iter_init);
 static inline void data_start(const struct iov_iter *i,
 			      unsigned int *iter_headp, size_t *offp)
 {
-	unsigned int iter_head = i->head;
-	size_t off = i->iov_offset;
+	int off = i->last_offset;
 
-	if (off && (!allocated(pipe_buf(i->pipe, iter_head)) ||
-		    off == PAGE_SIZE)) {
-		iter_head++;
-		off = 0;
+	if (off > 0 && off < PAGE_SIZE) { // anon and not full
+		*iter_headp = i->pipe->head - 1;
+		*offp = off;
+	} else {
+		*iter_headp = i->pipe->head;
+		*offp = 0;
 	}
-	*iter_headp = iter_head;
-	*offp = off;
 }
 
 static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
@@ -821,7 +821,7 @@ EXPORT_SYMBOL(copy_page_from_iter_atomic);
 static void pipe_advance(struct iov_iter *i, size_t size)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int off = i->iov_offset;
+	int off = i->last_offset;
 
 	if (!off && !size) {
 		pipe_discard_from(pipe, i->start_head); // discard everything
@@ -831,10 +831,10 @@ static void pipe_advance(struct iov_iter *i, size_t size)
 	while (1) {
 		struct pipe_buffer *buf = pipe_buf(pipe, i->head);
 		if (off) /* make it relative to the beginning of buffer */
-			size += off - buf->offset;
+			size += abs(off) - buf->offset;
 		if (size <= buf->len) {
 			buf->len = size;
-			i->iov_offset = buf->offset + size;
+			i->last_offset = last_offset(buf);
 			break;
 		}
 		size -= buf->len;
@@ -918,7 +918,7 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 			struct pipe_buffer *b = pipe_buf(pipe, --head);
 			if (unroll < b->len) {
 				b->len -= unroll;
-				i->iov_offset = b->offset + b->len;
+				i->last_offset = last_offset(b);
 				i->head = head;
 				return;
 			}
@@ -926,7 +926,7 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 			pipe_buf_release(pipe, b);
 			pipe->head--;
 		}
-		i->iov_offset = 0;
+		i->last_offset = 0;
 		i->head = head;
 		return;
 	}
@@ -1029,7 +1029,7 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction,
 		.pipe = pipe,
 		.head = pipe->head,
 		.start_head = pipe->head,
-		.iov_offset = 0,
+		.last_offset = 0,
 		.count = count
 	};
 }
@@ -1145,8 +1145,8 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 	if (iov_iter_is_pipe(i)) {
 		size_t size = i->count;
 
-		if (size && i->iov_offset && allocated(pipe_buf(i->pipe, i->head)))
-			return size | i->iov_offset;
+		if (size && i->last_offset > 0)
+			return size | i->last_offset;
 		return size;
 	}
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 22/44] ITER_PIPE: fold data_start() and pipe_space_for_user() together
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (19 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 21/44] ITER_PIPE: cache the type of last buffer Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-22  4:15   ` [PATCH 23/44] iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT Al Viro
                     ` (24 subsequent siblings)
  45 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

All their callers are next to each other; all of them
want the total amount of pages and, possibly, the
offset in the partial final buffer.

Combine into a new helper (pipe_npages()), fix the
bogosity in pipe_space_for_user(), while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/pipe_fs_i.h | 20 ------------------
 lib/iov_iter.c            | 44 +++++++++++++++++----------------------
 2 files changed, 19 insertions(+), 45 deletions(-)

diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 4ea496924106..6cb65df3e3ba 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -156,26 +156,6 @@ static inline bool pipe_full(unsigned int head, unsigned int tail,
 	return pipe_occupancy(head, tail) >= limit;
 }
 
-/**
- * pipe_space_for_user - Return number of slots available to userspace
- * @head: The pipe ring head pointer
- * @tail: The pipe ring tail pointer
- * @pipe: The pipe info structure
- */
-static inline unsigned int pipe_space_for_user(unsigned int head, unsigned int tail,
-					       struct pipe_inode_info *pipe)
-{
-	unsigned int p_occupancy, p_space;
-
-	p_occupancy = pipe_occupancy(head, tail);
-	if (p_occupancy >= pipe->max_usage)
-		return 0;
-	p_space = pipe->ring_size - p_occupancy;
-	if (p_space > pipe->max_usage)
-		p_space = pipe->max_usage;
-	return p_space;
-}
-
 /**
  * pipe_buf_get - get a reference to a pipe_buffer
  * @pipe:	the pipe that the buffer belongs to
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 27ad2ef93dbc..30f4158382d6 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -435,18 +435,20 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_init);
 
-static inline void data_start(const struct iov_iter *i,
-			      unsigned int *iter_headp, size_t *offp)
+// returns the offset in partial buffer (if any)
+static inline unsigned int pipe_npages(const struct iov_iter *i, int *npages)
 {
+	struct pipe_inode_info *pipe = i->pipe;
+	int used = pipe->head - pipe->tail;
 	int off = i->last_offset;
 
+	*npages = max((int)pipe->max_usage - used, 0);
+
 	if (off > 0 && off < PAGE_SIZE) { // anon and not full
-		*iter_headp = i->pipe->head - 1;
-		*offp = off;
-	} else {
-		*iter_headp = i->pipe->head;
-		*offp = 0;
+		(*npages)++;
+		return off;
 	}
+	return 0;
 }
 
 static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
@@ -1221,18 +1223,16 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
 {
-	unsigned int iter_head, npages;
+	unsigned int npages, off;
 	size_t capacity;
 
 	if (!sanity(i))
 		return -EFAULT;
 
-	data_start(i, &iter_head, start);
-	/* Amount of free space: some of this one + all after this one */
-	npages = pipe_space_for_user(iter_head, i->pipe->tail, i->pipe);
-	capacity = min(npages, maxpages) * PAGE_SIZE - *start;
+	*start = off = pipe_npages(i, &npages);
+	capacity = min(npages, maxpages) * PAGE_SIZE - off;
 
-	return __pipe_get_pages(i, min(maxsize, capacity), pages, *start);
+	return __pipe_get_pages(i, min(maxsize, capacity), pages, off);
 }
 
 static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
@@ -1411,24 +1411,22 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 		   size_t *start)
 {
 	struct page **p;
-	unsigned int iter_head, npages;
+	unsigned int npages, off;
 	ssize_t n;
 
 	if (!sanity(i))
 		return -EFAULT;
 
-	data_start(i, &iter_head, start);
-	/* Amount of free space: some of this one + all after this one */
-	npages = pipe_space_for_user(iter_head, i->pipe->tail, i->pipe);
-	n = npages * PAGE_SIZE - *start;
+	*start = off = pipe_npages(i, &npages);
+	n = npages * PAGE_SIZE - off;
 	if (maxsize > n)
 		maxsize = n;
 	else
-		npages = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
+		npages = DIV_ROUND_UP(maxsize + off, PAGE_SIZE);
 	p = get_pages_array(npages);
 	if (!p)
 		return -ENOMEM;
-	n = __pipe_get_pages(i, maxsize, p, *start);
+	n = __pipe_get_pages(i, maxsize, p, off);
 	if (n > 0)
 		*pages = p;
 	else
@@ -1653,16 +1651,12 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 	if (iov_iter_is_bvec(i))
 		return bvec_npages(i, maxpages);
 	if (iov_iter_is_pipe(i)) {
-		unsigned int iter_head;
 		int npages;
-		size_t off;
 
 		if (!sanity(i))
 			return 0;
 
-		data_start(i, &iter_head, &off);
-		/* some of this one + all after this one */
-		npages = pipe_space_for_user(iter_head, i->pipe->tail, i->pipe);
+		pipe_npages(i, &npages);
 		return min(npages, maxpages);
 	}
 	if (iov_iter_is_xarray(i)) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 23/44] iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (20 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 22/44] ITER_PIPE: fold data_start() and pipe_space_for_user() together Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 11:41     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 24/44] iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper Al Viro
                     ` (23 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

All callers can and should handle iov_iter_get_pages() returning
fewer pages than requested.  All in-kernel ones do.  And it makes
the arithmetical overflow analysis much simpler...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 30f4158382d6..c3fb7853dbe8 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1367,6 +1367,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		maxsize = i->count;
 	if (!maxsize)
 		return 0;
+	if (maxsize > MAX_RW_COUNT)
+		maxsize = MAX_RW_COUNT;
 
 	if (likely(user_backed_iter(i))) {
 		unsigned int gup_flags = 0;
@@ -1485,6 +1487,8 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		maxsize = i->count;
 	if (!maxsize)
 		return 0;
+	if (maxsize > MAX_RW_COUNT)
+		maxsize = MAX_RW_COUNT;
 
 	if (likely(user_backed_iter(i))) {
 		unsigned int gup_flags = 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 24/44] iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (21 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 23/44] iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 11:45     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 25/44] iov_iter_get_pages(): sanity-check arguments Al Viro
                     ` (22 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

Incidentally, ITER_XARRAY did *not* free the sucker in case when
iter_xarray_populate_pages() returned 0...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c3fb7853dbe8..9c25661684c6 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1425,15 +1425,10 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
 		maxsize = n;
 	else
 		npages = DIV_ROUND_UP(maxsize + off, PAGE_SIZE);
-	p = get_pages_array(npages);
+	*pages = p = get_pages_array(npages);
 	if (!p)
 		return -ENOMEM;
-	n = __pipe_get_pages(i, maxsize, p, off);
-	if (n > 0)
-		*pages = p;
-	else
-		kvfree(p);
-	return n;
+	return __pipe_get_pages(i, maxsize, p, off);
 }
 
 static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
@@ -1463,10 +1458,9 @@ static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
 			count++;
 	}
 
-	p = get_pages_array(count);
+	*pages = p = get_pages_array(count);
 	if (!p)
 		return -ENOMEM;
-	*pages = p;
 
 	nr = iter_xarray_populate_pages(p, i->xarray, index, count);
 	if (nr == 0)
@@ -1475,7 +1469,7 @@ static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
 	return min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
 }
 
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
 {
@@ -1501,16 +1495,12 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 
 		addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		p = get_pages_array(n);
+		*pages = p = get_pages_array(n);
 		if (!p)
 			return -ENOMEM;
 		res = get_user_pages_fast(addr, n, gup_flags, p);
-		if (unlikely(res <= 0)) {
-			kvfree(p);
-			*pages = NULL;
+		if (unlikely(res <= 0))
 			return res;
-		}
-		*pages = p;
 		return (res == n ? len : res * PAGE_SIZE) - *start;
 	}
 	if (iov_iter_is_bvec(i)) {
@@ -1531,6 +1521,22 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		return iter_xarray_get_pages_alloc(i, pages, maxsize, start);
 	return -EFAULT;
 }
+
+ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+		   struct page ***pages, size_t maxsize,
+		   size_t *start)
+{
+	ssize_t len;
+
+	*pages = NULL;
+
+	len = __iov_iter_get_pages_alloc(i, pages, maxsize, start);
+	if (len <= 0) {
+		kvfree(*pages);
+		*pages = NULL;
+	}
+	return len;
+}
 EXPORT_SYMBOL(iov_iter_get_pages_alloc);
 
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 25/44] iov_iter_get_pages(): sanity-check arguments
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (22 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 24/44] iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 11:47     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 26/44] unify pipe_get_pages() and pipe_get_pages_alloc() Al Viro
                     ` (21 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

zero maxpages is bogus, but best treated as "just return 0";
NULL pages, OTOH, should be treated as a hard bug.

get rid of now completely useless checks in xarray_get_pages{,_alloc}().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 9c25661684c6..5c985cf2858e 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1271,9 +1271,6 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
 	size_t size = maxsize;
 	loff_t pos;
 
-	if (!size || !maxpages)
-		return 0;
-
 	pos = i->xarray_start + i->iov_offset;
 	index = pos >> PAGE_SHIFT;
 	offset = pos & ~PAGE_MASK;
@@ -1365,10 +1362,11 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 
 	if (maxsize > i->count)
 		maxsize = i->count;
-	if (!maxsize)
+	if (!maxsize || !maxpages)
 		return 0;
 	if (maxsize > MAX_RW_COUNT)
 		maxsize = MAX_RW_COUNT;
+	BUG_ON(!pages);
 
 	if (likely(user_backed_iter(i))) {
 		unsigned int gup_flags = 0;
@@ -1441,9 +1439,6 @@ static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
 	size_t size = maxsize;
 	loff_t pos;
 
-	if (!size)
-		return 0;
-
 	pos = i->xarray_start + i->iov_offset;
 	index = pos >> PAGE_SHIFT;
 	offset = pos & ~PAGE_MASK;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 26/44] unify pipe_get_pages() and pipe_get_pages_alloc()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (23 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 25/44] iov_iter_get_pages(): sanity-check arguments Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 11:49     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 27/44] unify xarray_get_pages() and xarray_get_pages_alloc() Al Viro
                     ` (20 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

	The differences between those two are
* pipe_get_pages() gets a non-NULL struct page ** value pointing to
preallocated array + array size.
* pipe_get_pages_alloc() gets an address of struct page ** variable that
contains NULL, allocates the array and (on success) stores its address in
that variable.

	Not hard to combine - always pass struct page ***, have
the previous pipe_get_pages_alloc() caller pass ~0U as cap for
array size.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 49 +++++++++++++++++--------------------------------
 1 file changed, 17 insertions(+), 32 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 5c985cf2858e..1c98f2f3a581 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1187,6 +1187,11 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_gap_alignment);
 
+static struct page **get_pages_array(size_t n)
+{
+	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
+}
+
 static inline ssize_t __pipe_get_pages(struct iov_iter *i,
 				size_t maxsize,
 				struct page **pages,
@@ -1220,10 +1225,11 @@ static inline ssize_t __pipe_get_pages(struct iov_iter *i,
 }
 
 static ssize_t pipe_get_pages(struct iov_iter *i,
-		   struct page **pages, size_t maxsize, unsigned maxpages,
+		   struct page ***pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
 {
 	unsigned int npages, off;
+	struct page **p;
 	size_t capacity;
 
 	if (!sanity(i))
@@ -1231,8 +1237,15 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 
 	*start = off = pipe_npages(i, &npages);
 	capacity = min(npages, maxpages) * PAGE_SIZE - off;
+	maxsize = min(maxsize, capacity);
+	p = *pages;
+	if (!p) {
+		*pages = p = get_pages_array(DIV_ROUND_UP(maxsize + off, PAGE_SIZE));
+		if (!p)
+			return -ENOMEM;
+	}
 
-	return __pipe_get_pages(i, min(maxsize, capacity), pages, off);
+	return __pipe_get_pages(i, maxsize, p, off);
 }
 
 static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
@@ -1394,41 +1407,13 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 		return len - *start;
 	}
 	if (iov_iter_is_pipe(i))
-		return pipe_get_pages(i, pages, maxsize, maxpages, start);
+		return pipe_get_pages(i, &pages, maxsize, maxpages, start);
 	if (iov_iter_is_xarray(i))
 		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
 	return -EFAULT;
 }
 EXPORT_SYMBOL(iov_iter_get_pages);
 
-static struct page **get_pages_array(size_t n)
-{
-	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
-}
-
-static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
-		   struct page ***pages, size_t maxsize,
-		   size_t *start)
-{
-	struct page **p;
-	unsigned int npages, off;
-	ssize_t n;
-
-	if (!sanity(i))
-		return -EFAULT;
-
-	*start = off = pipe_npages(i, &npages);
-	n = npages * PAGE_SIZE - off;
-	if (maxsize > n)
-		maxsize = n;
-	else
-		npages = DIV_ROUND_UP(maxsize + off, PAGE_SIZE);
-	*pages = p = get_pages_array(npages);
-	if (!p)
-		return -ENOMEM;
-	return __pipe_get_pages(i, maxsize, p, off);
-}
-
 static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
 					   struct page ***pages, size_t maxsize,
 					   size_t *_start_offset)
@@ -1511,7 +1496,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		return len - *start;
 	}
 	if (iov_iter_is_pipe(i))
-		return pipe_get_pages_alloc(i, pages, maxsize, start);
+		return pipe_get_pages(i, pages, maxsize, ~0U, start);
 	if (iov_iter_is_xarray(i))
 		return iter_xarray_get_pages_alloc(i, pages, maxsize, start);
 	return -EFAULT;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 27/44] unify xarray_get_pages() and xarray_get_pages_alloc()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (24 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 26/44] unify pipe_get_pages() and pipe_get_pages_alloc() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 11:50     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 28/44] unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts Al Viro
                     ` (19 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

same as for pipes

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 49 ++++++++++---------------------------------------
 1 file changed, 10 insertions(+), 39 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 1c98f2f3a581..07dacb274ba5 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1276,7 +1276,7 @@ static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa
 }
 
 static ssize_t iter_xarray_get_pages(struct iov_iter *i,
-				     struct page **pages, size_t maxsize,
+				     struct page ***pages, size_t maxsize,
 				     unsigned maxpages, size_t *_start_offset)
 {
 	unsigned nr, offset;
@@ -1301,7 +1301,13 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
 	if (count > maxpages)
 		count = maxpages;
 
-	nr = iter_xarray_populate_pages(pages, i->xarray, index, count);
+	if (!*pages) {
+		*pages = get_pages_array(count);
+		if (!*pages)
+			return -ENOMEM;
+	}
+
+	nr = iter_xarray_populate_pages(*pages, i->xarray, index, count);
 	if (nr == 0)
 		return 0;
 
@@ -1409,46 +1415,11 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 	if (iov_iter_is_pipe(i))
 		return pipe_get_pages(i, &pages, maxsize, maxpages, start);
 	if (iov_iter_is_xarray(i))
-		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
+		return iter_xarray_get_pages(i, &pages, maxsize, maxpages, start);
 	return -EFAULT;
 }
 EXPORT_SYMBOL(iov_iter_get_pages);
 
-static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
-					   struct page ***pages, size_t maxsize,
-					   size_t *_start_offset)
-{
-	struct page **p;
-	unsigned nr, offset;
-	pgoff_t index, count;
-	size_t size = maxsize;
-	loff_t pos;
-
-	pos = i->xarray_start + i->iov_offset;
-	index = pos >> PAGE_SHIFT;
-	offset = pos & ~PAGE_MASK;
-	*_start_offset = offset;
-
-	count = 1;
-	if (size > PAGE_SIZE - offset) {
-		size -= PAGE_SIZE - offset;
-		count += size >> PAGE_SHIFT;
-		size &= ~PAGE_MASK;
-		if (size)
-			count++;
-	}
-
-	*pages = p = get_pages_array(count);
-	if (!p)
-		return -ENOMEM;
-
-	nr = iter_xarray_populate_pages(p, i->xarray, index, count);
-	if (nr == 0)
-		return 0;
-
-	return min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
-}
-
 static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
@@ -1498,7 +1469,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 	if (iov_iter_is_pipe(i))
 		return pipe_get_pages(i, pages, maxsize, ~0U, start);
 	if (iov_iter_is_xarray(i))
-		return iter_xarray_get_pages_alloc(i, pages, maxsize, start);
+		return iter_xarray_get_pages(i, pages, maxsize, ~0U, start);
 	return -EFAULT;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 28/44] unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (25 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 27/44] unify xarray_get_pages() and xarray_get_pages_alloc() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 11:54     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 29/44] ITER_XARRAY: don't open-code DIV_ROUND_UP() Al Viro
                     ` (18 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

same as for pipes and xarrays; after that iov_iter_get_pages() becomes
a wrapper for __iov_iter_get_pages_alloc().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 86 ++++++++++++++++----------------------------------
 1 file changed, 28 insertions(+), 58 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 07dacb274ba5..811fa09515d8 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1372,20 +1372,19 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
 	return page;
 }
 
-ssize_t iov_iter_get_pages(struct iov_iter *i,
-		   struct page **pages, size_t maxsize, unsigned maxpages,
-		   size_t *start)
+static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
+		   struct page ***pages, size_t maxsize,
+		   unsigned int maxpages, size_t *start)
 {
 	size_t len;
 	int n, res;
 
 	if (maxsize > i->count)
 		maxsize = i->count;
-	if (!maxsize || !maxpages)
+	if (!maxsize)
 		return 0;
 	if (maxsize > MAX_RW_COUNT)
 		maxsize = MAX_RW_COUNT;
-	BUG_ON(!pages);
 
 	if (likely(user_backed_iter(i))) {
 		unsigned int gup_flags = 0;
@@ -1398,80 +1397,51 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 
 		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		res = get_user_pages_fast(addr, n, gup_flags, pages);
+		if (!*pages) {
+			*pages = get_pages_array(n);
+			if (!*pages)
+				return -ENOMEM;
+		}
+		res = get_user_pages_fast(addr, n, gup_flags, *pages);
 		if (unlikely(res <= 0))
 			return res;
 		return (res == n ? len : res * PAGE_SIZE) - *start;
 	}
 	if (iov_iter_is_bvec(i)) {
+		struct page **p;
 		struct page *page;
 
 		page = first_bvec_segment(i, &len, start, maxsize, maxpages);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
+		p = *pages;
+		if (!p) {
+			*pages = p = get_pages_array(n);
+			if (!p)
+				return -ENOMEM;
+		}
 		while (n--)
-			get_page(*pages++ = page++);
+			get_page(*p++ = page++);
 		return len - *start;
 	}
 	if (iov_iter_is_pipe(i))
-		return pipe_get_pages(i, &pages, maxsize, maxpages, start);
+		return pipe_get_pages(i, pages, maxsize, maxpages, start);
 	if (iov_iter_is_xarray(i))
-		return iter_xarray_get_pages(i, &pages, maxsize, maxpages, start);
+		return iter_xarray_get_pages(i, pages, maxsize, maxpages,
+					     start);
 	return -EFAULT;
 }
-EXPORT_SYMBOL(iov_iter_get_pages);
 
-static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
-		   struct page ***pages, size_t maxsize,
+ssize_t iov_iter_get_pages(struct iov_iter *i,
+		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
 {
-	struct page **p;
-	size_t len;
-	int n, res;
-
-	if (maxsize > i->count)
-		maxsize = i->count;
-	if (!maxsize)
+	if (!maxpages)
 		return 0;
-	if (maxsize > MAX_RW_COUNT)
-		maxsize = MAX_RW_COUNT;
-
-	if (likely(user_backed_iter(i))) {
-		unsigned int gup_flags = 0;
-		unsigned long addr;
-
-		if (iov_iter_rw(i) != WRITE)
-			gup_flags |= FOLL_WRITE;
-		if (i->nofault)
-			gup_flags |= FOLL_NOFAULT;
-
-		addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
-		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		*pages = p = get_pages_array(n);
-		if (!p)
-			return -ENOMEM;
-		res = get_user_pages_fast(addr, n, gup_flags, p);
-		if (unlikely(res <= 0))
-			return res;
-		return (res == n ? len : res * PAGE_SIZE) - *start;
-	}
-	if (iov_iter_is_bvec(i)) {
-		struct page *page;
+	BUG_ON(!pages);
 
-		page = first_bvec_segment(i, &len, start, maxsize, ~0U);
-		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		*pages = p = get_pages_array(n);
-		if (!p)
-			return -ENOMEM;
-		while (n--)
-			get_page(*p++ = page++);
-		return len - *start;
-	}
-	if (iov_iter_is_pipe(i))
-		return pipe_get_pages(i, pages, maxsize, ~0U, start);
-	if (iov_iter_is_xarray(i))
-		return iter_xarray_get_pages(i, pages, maxsize, ~0U, start);
-	return -EFAULT;
+	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start);
 }
+EXPORT_SYMBOL(iov_iter_get_pages);
 
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
@@ -1481,7 +1451,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 
 	*pages = NULL;
 
-	len = __iov_iter_get_pages_alloc(i, pages, maxsize, start);
+	len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start);
 	if (len <= 0) {
 		kvfree(*pages);
 		*pages = NULL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 29/44] ITER_XARRAY: don't open-code DIV_ROUND_UP()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (26 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 28/44] unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 11:54     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 30/44] iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment() Al Viro
                     ` (17 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 811fa09515d8..92a566f839f9 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1289,15 +1289,7 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
 	offset = pos & ~PAGE_MASK;
 	*_start_offset = offset;
 
-	count = 1;
-	if (size > PAGE_SIZE - offset) {
-		size -= PAGE_SIZE - offset;
-		count += size >> PAGE_SHIFT;
-		size &= ~PAGE_MASK;
-		if (size)
-			count++;
-	}
-
+	count = DIV_ROUND_UP(size + offset, PAGE_SIZE);
 	if (count > maxpages)
 		count = maxpages;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 30/44] iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (27 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 29/44] ITER_XARRAY: don't open-code DIV_ROUND_UP() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 11:56     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 31/44] iov_iter: first_{iovec,bvec}_segment() - simplify a bit Al Viro
                     ` (16 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 92a566f839f9..9ef671b101dc 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1308,12 +1308,9 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
 
 static unsigned long found_ubuf_segment(unsigned long addr,
 					size_t len,
-					size_t *size, size_t *start,
-					unsigned maxpages)
+					size_t *size, size_t *start)
 {
 	len += (*start = addr % PAGE_SIZE);
-	if (len > maxpages * PAGE_SIZE)
-		len = maxpages * PAGE_SIZE;
 	*size = len;
 	return addr & PAGE_MASK;
 }
@@ -1321,14 +1318,14 @@ static unsigned long found_ubuf_segment(unsigned long addr,
 /* must be done on non-empty ITER_UBUF or ITER_IOVEC one */
 static unsigned long first_iovec_segment(const struct iov_iter *i,
 					 size_t *size, size_t *start,
-					 size_t maxsize, unsigned maxpages)
+					 size_t maxsize)
 {
 	size_t skip;
 	long k;
 
 	if (iter_is_ubuf(i)) {
 		unsigned long addr = (unsigned long)i->ubuf + i->iov_offset;
-		return found_ubuf_segment(addr, maxsize, size, start, maxpages);
+		return found_ubuf_segment(addr, maxsize, size, start);
 	}
 
 	for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
@@ -1339,7 +1336,7 @@ static unsigned long first_iovec_segment(const struct iov_iter *i,
 			continue;
 		if (len > maxsize)
 			len = maxsize;
-		return found_ubuf_segment(addr, len, size, start, maxpages);
+		return found_ubuf_segment(addr, len, size, start);
 	}
 	BUG(); // if it had been empty, we wouldn't get called
 }
@@ -1347,7 +1344,7 @@ static unsigned long first_iovec_segment(const struct iov_iter *i,
 /* must be done on non-empty ITER_BVEC one */
 static struct page *first_bvec_segment(const struct iov_iter *i,
 				       size_t *size, size_t *start,
-				       size_t maxsize, unsigned maxpages)
+				       size_t maxsize)
 {
 	struct page *page;
 	size_t skip = i->iov_offset, len;
@@ -1358,8 +1355,6 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
 	skip += i->bvec->bv_offset;
 	page = i->bvec->bv_page + skip / PAGE_SIZE;
 	len += (*start = skip % PAGE_SIZE);
-	if (len > maxpages * PAGE_SIZE)
-		len = maxpages * PAGE_SIZE;
 	*size = len;
 	return page;
 }
@@ -1387,7 +1382,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		if (i->nofault)
 			gup_flags |= FOLL_NOFAULT;
 
-		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
+		addr = first_iovec_segment(i, &len, start, maxsize);
+		if (len > maxpages * PAGE_SIZE)
+			len = maxpages * PAGE_SIZE;
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
 		if (!*pages) {
 			*pages = get_pages_array(n);
@@ -1403,7 +1400,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		struct page **p;
 		struct page *page;
 
-		page = first_bvec_segment(i, &len, start, maxsize, maxpages);
+		page = first_bvec_segment(i, &len, start, maxsize);
+		if (len > maxpages * PAGE_SIZE)
+			len = maxpages * PAGE_SIZE;
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
 		p = *pages;
 		if (!p) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 31/44] iov_iter: first_{iovec,bvec}_segment() - simplify a bit
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (28 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 30/44] iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 11:58     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 32/44] iov_iter: massage calling conventions for first_{iovec,bvec}_segment() Al Viro
                     ` (15 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

We return length + offset in page via *size.  Don't bother - the caller
can do that arithmetics just as well; just report the length to it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 9ef671b101dc..0bed684d91d0 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1310,7 +1310,7 @@ static unsigned long found_ubuf_segment(unsigned long addr,
 					size_t len,
 					size_t *size, size_t *start)
 {
-	len += (*start = addr % PAGE_SIZE);
+	*start = addr % PAGE_SIZE;
 	*size = len;
 	return addr & PAGE_MASK;
 }
@@ -1354,7 +1354,7 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
 		len = maxsize;
 	skip += i->bvec->bv_offset;
 	page = i->bvec->bv_page + skip / PAGE_SIZE;
-	len += (*start = skip % PAGE_SIZE);
+	*start = skip % PAGE_SIZE;
 	*size = len;
 	return page;
 }
@@ -1383,9 +1383,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 			gup_flags |= FOLL_NOFAULT;
 
 		addr = first_iovec_segment(i, &len, start, maxsize);
-		if (len > maxpages * PAGE_SIZE)
-			len = maxpages * PAGE_SIZE;
-		n = DIV_ROUND_UP(len, PAGE_SIZE);
+		n = DIV_ROUND_UP(len + *start, PAGE_SIZE);
+		if (n > maxpages)
+			n = maxpages;
 		if (!*pages) {
 			*pages = get_pages_array(n);
 			if (!*pages)
@@ -1394,25 +1394,25 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		res = get_user_pages_fast(addr, n, gup_flags, *pages);
 		if (unlikely(res <= 0))
 			return res;
-		return (res == n ? len : res * PAGE_SIZE) - *start;
+		return min_t(size_t, len, res * PAGE_SIZE - *start);
 	}
 	if (iov_iter_is_bvec(i)) {
 		struct page **p;
 		struct page *page;
 
 		page = first_bvec_segment(i, &len, start, maxsize);
-		if (len > maxpages * PAGE_SIZE)
-			len = maxpages * PAGE_SIZE;
-		n = DIV_ROUND_UP(len, PAGE_SIZE);
+		n = DIV_ROUND_UP(len + *start, PAGE_SIZE);
+		if (n > maxpages)
+			n = maxpages;
 		p = *pages;
 		if (!p) {
 			*pages = p = get_pages_array(n);
 			if (!p)
 				return -ENOMEM;
 		}
-		while (n--)
+		for (int k = 0; k < n; k++)
 			get_page(*p++ = page++);
-		return len - *start;
+		return min_t(size_t, len, n * PAGE_SIZE - *start);
 	}
 	if (iov_iter_is_pipe(i))
 		return pipe_get_pages(i, pages, maxsize, maxpages, start);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 32/44] iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (29 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 31/44] iov_iter: first_{iovec,bvec}_segment() - simplify a bit Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 12:06     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 33/44] found_iovec_segment(): just return address Al Viro
                     ` (14 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

Pass maxsize by reference, return length via the same.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 37 +++++++++++++++----------------------
 1 file changed, 15 insertions(+), 22 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 0bed684d91d0..fca66ecce7a0 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1306,26 +1306,22 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
 	return min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
 }
 
-static unsigned long found_ubuf_segment(unsigned long addr,
-					size_t len,
-					size_t *size, size_t *start)
+static unsigned long found_ubuf_segment(unsigned long addr, size_t *start)
 {
 	*start = addr % PAGE_SIZE;
-	*size = len;
 	return addr & PAGE_MASK;
 }
 
 /* must be done on non-empty ITER_UBUF or ITER_IOVEC one */
 static unsigned long first_iovec_segment(const struct iov_iter *i,
-					 size_t *size, size_t *start,
-					 size_t maxsize)
+					 size_t *size, size_t *start)
 {
 	size_t skip;
 	long k;
 
 	if (iter_is_ubuf(i)) {
 		unsigned long addr = (unsigned long)i->ubuf + i->iov_offset;
-		return found_ubuf_segment(addr, maxsize, size, start);
+		return found_ubuf_segment(addr, start);
 	}
 
 	for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
@@ -1334,28 +1330,26 @@ static unsigned long first_iovec_segment(const struct iov_iter *i,
 
 		if (unlikely(!len))
 			continue;
-		if (len > maxsize)
-			len = maxsize;
-		return found_ubuf_segment(addr, len, size, start);
+		if (*size > len)
+			*size = len;
+		return found_ubuf_segment(addr, start);
 	}
 	BUG(); // if it had been empty, we wouldn't get called
 }
 
 /* must be done on non-empty ITER_BVEC one */
 static struct page *first_bvec_segment(const struct iov_iter *i,
-				       size_t *size, size_t *start,
-				       size_t maxsize)
+				       size_t *size, size_t *start)
 {
 	struct page *page;
 	size_t skip = i->iov_offset, len;
 
 	len = i->bvec->bv_len - skip;
-	if (len > maxsize)
-		len = maxsize;
+	if (*size > len)
+		*size = len;
 	skip += i->bvec->bv_offset;
 	page = i->bvec->bv_page + skip / PAGE_SIZE;
 	*start = skip % PAGE_SIZE;
-	*size = len;
 	return page;
 }
 
@@ -1363,7 +1357,6 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   unsigned int maxpages, size_t *start)
 {
-	size_t len;
 	int n, res;
 
 	if (maxsize > i->count)
@@ -1382,8 +1375,8 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		if (i->nofault)
 			gup_flags |= FOLL_NOFAULT;
 
-		addr = first_iovec_segment(i, &len, start, maxsize);
-		n = DIV_ROUND_UP(len + *start, PAGE_SIZE);
+		addr = first_iovec_segment(i, &maxsize, start);
+		n = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
 		if (n > maxpages)
 			n = maxpages;
 		if (!*pages) {
@@ -1394,14 +1387,14 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		res = get_user_pages_fast(addr, n, gup_flags, *pages);
 		if (unlikely(res <= 0))
 			return res;
-		return min_t(size_t, len, res * PAGE_SIZE - *start);
+		return min_t(size_t, maxsize, res * PAGE_SIZE - *start);
 	}
 	if (iov_iter_is_bvec(i)) {
 		struct page **p;
 		struct page *page;
 
-		page = first_bvec_segment(i, &len, start, maxsize);
-		n = DIV_ROUND_UP(len + *start, PAGE_SIZE);
+		page = first_bvec_segment(i, &maxsize, start);
+		n = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
 		if (n > maxpages)
 			n = maxpages;
 		p = *pages;
@@ -1412,7 +1405,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		}
 		for (int k = 0; k < n; k++)
 			get_page(*p++ = page++);
-		return min_t(size_t, len, n * PAGE_SIZE - *start);
+		return min_t(size_t, maxsize, n * PAGE_SIZE - *start);
 	}
 	if (iov_iter_is_pipe(i))
 		return pipe_get_pages(i, pages, maxsize, maxpages, start);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 33/44] found_iovec_segment(): just return address
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (30 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 32/44] iov_iter: massage calling conventions for first_{iovec,bvec}_segment() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 12:09     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 34/44] fold __pipe_get_pages() into pipe_get_pages() Al Viro
                     ` (13 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

... and calculate the offset in the caller

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index fca66ecce7a0..f455b8ee0d76 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1306,33 +1306,23 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
 	return min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
 }
 
-static unsigned long found_ubuf_segment(unsigned long addr, size_t *start)
-{
-	*start = addr % PAGE_SIZE;
-	return addr & PAGE_MASK;
-}
-
 /* must be done on non-empty ITER_UBUF or ITER_IOVEC one */
 static unsigned long first_iovec_segment(const struct iov_iter *i,
-					 size_t *size, size_t *start)
+					 size_t *size)
 {
 	size_t skip;
 	long k;
 
-	if (iter_is_ubuf(i)) {
-		unsigned long addr = (unsigned long)i->ubuf + i->iov_offset;
-		return found_ubuf_segment(addr, start);
-	}
+	if (iter_is_ubuf(i))
+		return (unsigned long)i->ubuf + i->iov_offset;
 
 	for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
-		unsigned long addr = (unsigned long)i->iov[k].iov_base + skip;
 		size_t len = i->iov[k].iov_len - skip;
-
 		if (unlikely(!len))
 			continue;
 		if (*size > len)
 			*size = len;
-		return found_ubuf_segment(addr, start);
+		return (unsigned long)i->iov[k].iov_base + skip;
 	}
 	BUG(); // if it had been empty, we wouldn't get called
 }
@@ -1375,7 +1365,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		if (i->nofault)
 			gup_flags |= FOLL_NOFAULT;
 
-		addr = first_iovec_segment(i, &maxsize, start);
+		addr = first_iovec_segment(i, &maxsize);
+		*start = addr % PAGE_SIZE;
+		addr &= PAGE_MASK;
 		n = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
 		if (n > maxpages)
 			n = maxpages;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 34/44] fold __pipe_get_pages() into pipe_get_pages()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (31 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 33/44] found_iovec_segment(): just return address Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 12:11     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 35/44] iov_iter: saner helper for page array allocation Al Viro
                     ` (12 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

... and don't mangle maxsize there - turn the loop into counting
one instead.  Easier to see that we won't run out of array that
way.  Note that special treatment of the partial buffer in that
thing is an artifact of the non-advancing semantics of
iov_iter_get_pages() - if not for that, it would be append_pipe(),
same as the body of the loop that follows it.  IOW, once we make
iov_iter_get_pages() advancing, the whole thing will turn into
	calculate how many pages do we want
	allocate an array (if needed)
	call append_pipe() that many times.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 75 +++++++++++++++++++++++++-------------------------
 1 file changed, 38 insertions(+), 37 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f455b8ee0d76..9280f865fd6a 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1192,60 +1192,61 @@ static struct page **get_pages_array(size_t n)
 	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
 }
 
-static inline ssize_t __pipe_get_pages(struct iov_iter *i,
-				size_t maxsize,
-				struct page **pages,
-				size_t off)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	ssize_t left = maxsize;
-
-	if (off) {
-		struct pipe_buffer *buf = pipe_buf(pipe, pipe->head - 1);
-
-		get_page(*pages++ = buf->page);
-		left -= PAGE_SIZE - off;
-		if (left <= 0) {
-			buf->len += maxsize;
-			return maxsize;
-		}
-		buf->len = PAGE_SIZE;
-	}
-	while (!pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
-		struct page *page = push_anon(pipe,
-					      min_t(ssize_t, left, PAGE_SIZE));
-		if (!page)
-			break;
-		get_page(*pages++ = page);
-		left -= PAGE_SIZE;
-		if (left <= 0)
-			return maxsize;
-	}
-	return maxsize - left ? : -EFAULT;
-}
-
 static ssize_t pipe_get_pages(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
 {
+	struct pipe_inode_info *pipe = i->pipe;
 	unsigned int npages, off;
 	struct page **p;
-	size_t capacity;
+	ssize_t left;
+	int count;
 
 	if (!sanity(i))
 		return -EFAULT;
 
 	*start = off = pipe_npages(i, &npages);
-	capacity = min(npages, maxpages) * PAGE_SIZE - off;
-	maxsize = min(maxsize, capacity);
+	count = DIV_ROUND_UP(maxsize + off, PAGE_SIZE);
+	if (count > npages)
+		count = npages;
+	if (count > maxpages)
+		count = maxpages;
 	p = *pages;
 	if (!p) {
-		*pages = p = get_pages_array(DIV_ROUND_UP(maxsize + off, PAGE_SIZE));
+		*pages = p = get_pages_array(count);
 		if (!p)
 			return -ENOMEM;
 	}
 
-	return __pipe_get_pages(i, maxsize, p, off);
+	left = maxsize;
+	npages = 0;
+	if (off) {
+		struct pipe_buffer *buf = pipe_buf(pipe, pipe->head - 1);
+
+		get_page(*p++ = buf->page);
+		left -= PAGE_SIZE - off;
+		if (left <= 0) {
+			buf->len += maxsize;
+			return maxsize;
+		}
+		buf->len = PAGE_SIZE;
+		npages = 1;
+	}
+	for ( ; npages < count; npages++) {
+		struct page *page;
+		unsigned int size = min_t(ssize_t, left, PAGE_SIZE);
+
+		if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
+			break;
+		page = push_anon(pipe, size);
+		if (!page)
+			break;
+		get_page(*p++ = page);
+		left -= size;
+	}
+	if (!npages)
+		return -EFAULT;
+	return maxsize - left;
 }
 
 static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 35/44] iov_iter: saner helper for page array allocation
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (32 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 34/44] fold __pipe_get_pages() into pipe_get_pages() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 12:12     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 36/44] iov_iter: advancing variants of iov_iter_get_pages{,_alloc}() Al Viro
                     ` (11 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

All call sites of get_pages_array() are essenitally identical now.
Replace with common helper...

Returns number of slots available in resulting array or 0 on OOM;
it's up to the caller to make sure it doesn't ask to zero-entry
array (i.e. neither maxpages nor size are allowed to be zero).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 77 +++++++++++++++++++++-----------------------------
 1 file changed, 32 insertions(+), 45 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 9280f865fd6a..1c744f0c0b2c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1187,9 +1187,20 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_gap_alignment);
 
-static struct page **get_pages_array(size_t n)
+static int want_pages_array(struct page ***res, size_t size,
+			    size_t start, unsigned int maxpages)
 {
-	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
+	unsigned int count = DIV_ROUND_UP(size + start, PAGE_SIZE);
+
+	if (count > maxpages)
+		count = maxpages;
+	WARN_ON(!count);	// caller should've prevented that
+	if (!*res) {
+		*res = kvmalloc_array(count, sizeof(struct page *), GFP_KERNEL);
+		if (!*res)
+			return 0;
+	}
+	return count;
 }
 
 static ssize_t pipe_get_pages(struct iov_iter *i,
@@ -1197,27 +1208,20 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 		   size_t *start)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int npages, off;
+	unsigned int npages, off, count;
 	struct page **p;
 	ssize_t left;
-	int count;
 
 	if (!sanity(i))
 		return -EFAULT;
 
 	*start = off = pipe_npages(i, &npages);
-	count = DIV_ROUND_UP(maxsize + off, PAGE_SIZE);
-	if (count > npages)
-		count = npages;
-	if (count > maxpages)
-		count = maxpages;
+	if (!npages)
+		return -EFAULT;
+	count = want_pages_array(pages, maxsize, off, min(npages, maxpages));
+	if (!count)
+		return -ENOMEM;
 	p = *pages;
-	if (!p) {
-		*pages = p = get_pages_array(count);
-		if (!p)
-			return -ENOMEM;
-	}
-
 	left = maxsize;
 	npages = 0;
 	if (off) {
@@ -1280,9 +1284,8 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
 				     struct page ***pages, size_t maxsize,
 				     unsigned maxpages, size_t *_start_offset)
 {
-	unsigned nr, offset;
-	pgoff_t index, count;
-	size_t size = maxsize;
+	unsigned nr, offset, count;
+	pgoff_t index;
 	loff_t pos;
 
 	pos = i->xarray_start + i->iov_offset;
@@ -1290,16 +1293,9 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
 	offset = pos & ~PAGE_MASK;
 	*_start_offset = offset;
 
-	count = DIV_ROUND_UP(size + offset, PAGE_SIZE);
-	if (count > maxpages)
-		count = maxpages;
-
-	if (!*pages) {
-		*pages = get_pages_array(count);
-		if (!*pages)
-			return -ENOMEM;
-	}
-
+	count = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!count)
+		return -ENOMEM;
 	nr = iter_xarray_populate_pages(*pages, i->xarray, index, count);
 	if (nr == 0)
 		return 0;
@@ -1348,7 +1344,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   unsigned int maxpages, size_t *start)
 {
-	int n, res;
+	unsigned int n;
 
 	if (maxsize > i->count)
 		maxsize = i->count;
@@ -1360,6 +1356,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 	if (likely(user_backed_iter(i))) {
 		unsigned int gup_flags = 0;
 		unsigned long addr;
+		int res;
 
 		if (iov_iter_rw(i) != WRITE)
 			gup_flags |= FOLL_WRITE;
@@ -1369,14 +1366,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		addr = first_iovec_segment(i, &maxsize);
 		*start = addr % PAGE_SIZE;
 		addr &= PAGE_MASK;
-		n = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
-		if (n > maxpages)
-			n = maxpages;
-		if (!*pages) {
-			*pages = get_pages_array(n);
-			if (!*pages)
-				return -ENOMEM;
-		}
+		n = want_pages_array(pages, maxsize, *start, maxpages);
+		if (!n)
+			return -ENOMEM;
 		res = get_user_pages_fast(addr, n, gup_flags, *pages);
 		if (unlikely(res <= 0))
 			return res;
@@ -1387,15 +1379,10 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		struct page *page;
 
 		page = first_bvec_segment(i, &maxsize, start);
-		n = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
-		if (n > maxpages)
-			n = maxpages;
+		n = want_pages_array(pages, maxsize, *start, maxpages);
+		if (!n)
+			return -ENOMEM;
 		p = *pages;
-		if (!p) {
-			*pages = p = get_pages_array(n);
-			if (!p)
-				return -ENOMEM;
-		}
 		for (int k = 0; k < n; k++)
 			get_page(*p++ = page++);
 		return min_t(size_t, maxsize, n * PAGE_SIZE - *start);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 36/44] iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (33 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 35/44] iov_iter: saner helper for page array allocation Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 12:13     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 37/44] block: convert to " Al Viro
                     ` (10 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

Most of the users immediately follow successful iov_iter_get_pages()
with advancing by the amount it had returned.

Provide inline wrappers doing that, convert trivial open-coded
uses of those.

BTW, iov_iter_get_pages() never returns more than it had been asked
to; such checks in cifs ought to be removed someday...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 drivers/vhost/scsi.c |  4 +---
 fs/ceph/file.c       |  3 +--
 fs/cifs/file.c       |  6 ++----
 fs/cifs/misc.c       |  3 +--
 fs/direct-io.c       |  3 +--
 fs/fuse/dev.c        |  3 +--
 fs/fuse/file.c       |  3 +--
 fs/nfs/direct.c      |  6 ++----
 include/linux/uio.h  | 20 ++++++++++++++++++++
 net/core/datagram.c  |  3 +--
 net/core/skmsg.c     |  3 +--
 net/rds/message.c    |  3 +--
 net/tls/tls_sw.c     |  4 +---
 13 files changed, 34 insertions(+), 30 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index ffd9e6c2ffc1..9b65509424dc 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -643,14 +643,12 @@ vhost_scsi_map_to_sgl(struct vhost_scsi_cmd *cmd,
 	size_t offset;
 	unsigned int npages = 0;
 
-	bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
+	bytes = iov_iter_get_pages2(iter, pages, LONG_MAX,
 				VHOST_SCSI_PREALLOC_UPAGES, &offset);
 	/* No pages were pinned */
 	if (bytes <= 0)
 		return bytes < 0 ? bytes : -EFAULT;
 
-	iov_iter_advance(iter, bytes);
-
 	while (bytes) {
 		unsigned n = min_t(unsigned, PAGE_SIZE - offset, bytes);
 		sg_set_page(sg++, pages[npages++], n, offset);
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index c535de5852bf..8fab5db16c73 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -95,12 +95,11 @@ static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
 		size_t start;
 		int idx = 0;
 
-		bytes = iov_iter_get_pages(iter, pages, maxsize - size,
+		bytes = iov_iter_get_pages2(iter, pages, maxsize - size,
 					   ITER_GET_BVECS_PAGES, &start);
 		if (bytes < 0)
 			return size ?: bytes;
 
-		iov_iter_advance(iter, bytes);
 		size += bytes;
 
 		for ( ; bytes; idx++, bvec_idx++) {
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index e1e05b253daa..3ba013e2987f 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3022,7 +3022,7 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
 		if (ctx->direct_io) {
 			ssize_t result;
 
-			result = iov_iter_get_pages_alloc(
+			result = iov_iter_get_pages_alloc2(
 				from, &pagevec, cur_len, &start);
 			if (result < 0) {
 				cifs_dbg(VFS,
@@ -3036,7 +3036,6 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
 				break;
 			}
 			cur_len = (size_t)result;
-			iov_iter_advance(from, cur_len);
 
 			nr_pages =
 				(cur_len + start + PAGE_SIZE - 1) / PAGE_SIZE;
@@ -3758,7 +3757,7 @@ cifs_send_async_read(loff_t offset, size_t len, struct cifsFileInfo *open_file,
 		if (ctx->direct_io) {
 			ssize_t result;
 
-			result = iov_iter_get_pages_alloc(
+			result = iov_iter_get_pages_alloc2(
 					&direct_iov, &pagevec,
 					cur_len, &start);
 			if (result < 0) {
@@ -3774,7 +3773,6 @@ cifs_send_async_read(loff_t offset, size_t len, struct cifsFileInfo *open_file,
 				break;
 			}
 			cur_len = (size_t)result;
-			iov_iter_advance(&direct_iov, cur_len);
 
 			rdata = cifs_readdata_direct_alloc(
 					pagevec, cifs_uncached_readv_complete);
diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
index c69e1240d730..37493118fb72 100644
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -1022,7 +1022,7 @@ setup_aio_ctx_iter(struct cifs_aio_ctx *ctx, struct iov_iter *iter, int rw)
 	saved_len = count;
 
 	while (count && npages < max_pages) {
-		rc = iov_iter_get_pages(iter, pages, count, max_pages, &start);
+		rc = iov_iter_get_pages2(iter, pages, count, max_pages, &start);
 		if (rc < 0) {
 			cifs_dbg(VFS, "Couldn't get user pages (rc=%zd)\n", rc);
 			break;
@@ -1034,7 +1034,6 @@ setup_aio_ctx_iter(struct cifs_aio_ctx *ctx, struct iov_iter *iter, int rw)
 			break;
 		}
 
-		iov_iter_advance(iter, rc);
 		count -= rc;
 		rc += start;
 		cur_npages = DIV_ROUND_UP(rc, PAGE_SIZE);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 72237f49ad94..9724244f12ce 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -169,7 +169,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 {
 	ssize_t ret;
 
-	ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
+	ret = iov_iter_get_pages2(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
 				&sdio->from);
 
 	if (ret < 0 && sdio->blocks_available && (dio->op == REQ_OP_WRITE)) {
@@ -191,7 +191,6 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 	}
 
 	if (ret >= 0) {
-		iov_iter_advance(sdio->iter, ret);
 		ret += sdio->from;
 		sdio->head = 0;
 		sdio->tail = (ret + PAGE_SIZE - 1) / PAGE_SIZE;
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 8d657c2cd6f7..51897427a534 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -730,14 +730,13 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 		}
 	} else {
 		size_t off;
-		err = iov_iter_get_pages(cs->iter, &page, PAGE_SIZE, 1, &off);
+		err = iov_iter_get_pages2(cs->iter, &page, PAGE_SIZE, 1, &off);
 		if (err < 0)
 			return err;
 		BUG_ON(!err);
 		cs->len = err;
 		cs->offset = off;
 		cs->pg = page;
-		iov_iter_advance(cs->iter, err);
 	}
 
 	return lock_request(cs->req);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index c982e3afe3b4..69e19fc0afc1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1401,14 +1401,13 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
 	while (nbytes < *nbytesp && ap->num_pages < max_pages) {
 		unsigned npages;
 		size_t start;
-		ret = iov_iter_get_pages(ii, &ap->pages[ap->num_pages],
+		ret = iov_iter_get_pages2(ii, &ap->pages[ap->num_pages],
 					*nbytesp - nbytes,
 					max_pages - ap->num_pages,
 					&start);
 		if (ret < 0)
 			break;
 
-		iov_iter_advance(ii, ret);
 		nbytes += ret;
 
 		ret += start;
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 022e1ce63e62..c275c83f0aef 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -364,13 +364,12 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc(iter, &pagevec, 
+		result = iov_iter_get_pages_alloc2(iter, &pagevec,
 						  rsize, &pgbase);
 		if (result < 0)
 			break;
 	
 		bytes = result;
-		iov_iter_advance(iter, bytes);
 		npages = (result + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
 		for (i = 0; i < npages; i++) {
 			struct nfs_page *req;
@@ -812,13 +811,12 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc(iter, &pagevec, 
+		result = iov_iter_get_pages_alloc2(iter, &pagevec,
 						  wsize, &pgbase);
 		if (result < 0)
 			break;
 
 		bytes = result;
-		iov_iter_advance(iter, bytes);
 		npages = (result + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
 		for (i = 0; i < npages; i++) {
 			struct nfs_page *req;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index d3e13b37ea72..ab1cc218b9de 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -349,4 +349,24 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 	};
 }
 
+static inline ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
+			size_t maxsize, unsigned maxpages, size_t *start)
+{
+	ssize_t res = iov_iter_get_pages(i, pages, maxsize, maxpages, start);
+
+	if (res >= 0)
+		iov_iter_advance(i, res);
+	return res;
+}
+
+static inline ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
+			size_t maxsize, size_t *start)
+{
+	ssize_t res = iov_iter_get_pages_alloc(i, pages, maxsize, start);
+
+	if (res >= 0)
+		iov_iter_advance(i, res);
+	return res;
+}
+
 #endif
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 50f4faeea76c..344b4c5791ac 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -629,12 +629,11 @@ int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
 		if (frag == MAX_SKB_FRAGS)
 			return -EMSGSIZE;
 
-		copied = iov_iter_get_pages(from, pages, length,
+		copied = iov_iter_get_pages2(from, pages, length,
 					    MAX_SKB_FRAGS - frag, &start);
 		if (copied < 0)
 			return -EFAULT;
 
-		iov_iter_advance(from, copied);
 		length -= copied;
 
 		truesize = PAGE_ALIGN(copied + start);
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 22b983ade0e7..662151678f20 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -324,14 +324,13 @@ int sk_msg_zerocopy_from_iter(struct sock *sk, struct iov_iter *from,
 			goto out;
 		}
 
-		copied = iov_iter_get_pages(from, pages, bytes, maxpages,
+		copied = iov_iter_get_pages2(from, pages, bytes, maxpages,
 					    &offset);
 		if (copied <= 0) {
 			ret = -EFAULT;
 			goto out;
 		}
 
-		iov_iter_advance(from, copied);
 		bytes -= copied;
 		msg->sg.size += copied;
 
diff --git a/net/rds/message.c b/net/rds/message.c
index 799034e0f513..d74be4e3f3fa 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -391,7 +391,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
 		size_t start;
 		ssize_t copied;
 
-		copied = iov_iter_get_pages(from, &pages, PAGE_SIZE,
+		copied = iov_iter_get_pages2(from, &pages, PAGE_SIZE,
 					    1, &start);
 		if (copied < 0) {
 			struct mmpin *mmp;
@@ -405,7 +405,6 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
 			goto err;
 		}
 		total_copied += copied;
-		iov_iter_advance(from, copied);
 		length -= copied;
 		sg_set_page(sg, pages, copied, start);
 		rm->data.op_nents++;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 0513f82b8537..b1406c60f8df 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1361,7 +1361,7 @@ static int tls_setup_from_iter(struct iov_iter *from,
 			rc = -EFAULT;
 			goto out;
 		}
-		copied = iov_iter_get_pages(from, pages,
+		copied = iov_iter_get_pages2(from, pages,
 					    length,
 					    maxpages, &offset);
 		if (copied <= 0) {
@@ -1369,8 +1369,6 @@ static int tls_setup_from_iter(struct iov_iter *from,
 			goto out;
 		}
 
-		iov_iter_advance(from, copied);
-
 		length -= copied;
 		size += copied;
 		while (copied) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (34 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 36/44] iov_iter: advancing variants of iov_iter_get_pages{,_alloc}() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 12:16     ` Jeff Layton
                       ` (2 more replies)
  2022-06-22  4:15   ` [PATCH 38/44] iter_to_pipe(): switch to advancing variant of iov_iter_get_pages() Al Viro
                     ` (9 subsequent siblings)
  45 siblings, 3 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

... doing revert if we end up not using some pages

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 block/bio.c     | 15 ++++++---------
 block/blk-map.c |  7 ++++---
 2 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 51c99f2c5c90..01ab683e67be 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1190,7 +1190,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
-	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	size = iov_iter_get_pages2(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1205,6 +1205,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		} else {
 			if (WARN_ON_ONCE(bio_full(bio, len))) {
 				bio_put_pages(pages + i, left, offset);
+				iov_iter_revert(iter, left);
 				return -EINVAL;
 			}
 			__bio_add_page(bio, page, len, offset);
@@ -1212,7 +1213,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		offset = 0;
 	}
 
-	iov_iter_advance(iter, size);
 	return 0;
 }
 
@@ -1227,7 +1227,6 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 	ssize_t size, left;
 	unsigned len, i;
 	size_t offset;
-	int ret = 0;
 
 	if (WARN_ON_ONCE(!max_append_sectors))
 		return 0;
@@ -1240,7 +1239,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
-	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	size = iov_iter_get_pages2(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1252,16 +1251,14 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 		if (bio_add_hw_page(q, bio, page, len, offset,
 				max_append_sectors, &same_page) != len) {
 			bio_put_pages(pages + i, left, offset);
-			ret = -EINVAL;
-			break;
+			iov_iter_revert(iter, left);
+			return -EINVAL;
 		}
 		if (same_page)
 			put_page(page);
 		offset = 0;
 	}
-
-	iov_iter_advance(iter, size - left);
-	return ret;
+	return 0;
 }
 
 /**
diff --git a/block/blk-map.c b/block/blk-map.c
index df8b066cd548..7196a6b64c80 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -254,7 +254,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		size_t offs, added = 0;
 		int npages;
 
-		bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
+		bytes = iov_iter_get_pages_alloc2(iter, &pages, LONG_MAX, &offs);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
@@ -284,7 +284,6 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 				bytes -= n;
 				offs = 0;
 			}
-			iov_iter_advance(iter, added);
 		}
 		/*
 		 * release the pages we didn't map into the bio, if any
@@ -293,8 +292,10 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 			put_page(pages[j++]);
 		kvfree(pages);
 		/* couldn't stuff something into bio? */
-		if (bytes)
+		if (bytes) {
+			iov_iter_revert(iter, bytes);
 			break;
+		}
 	}
 
 	ret = blk_rq_append_bio(rq, bio);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 38/44] iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (35 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 37/44] block: convert to " Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 12:18     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 39/44] af_alg_make_sg(): " Al Viro
                     ` (8 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

... and untangle the cleanup on failure to add into pipe.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/splice.c | 47 ++++++++++++++++++++++++-----------------------
 1 file changed, 24 insertions(+), 23 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 6645b30ec990..9f84bd21f64c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1160,39 +1160,40 @@ static int iter_to_pipe(struct iov_iter *from,
 	};
 	size_t total = 0;
 	int ret = 0;
-	bool failed = false;
 
-	while (iov_iter_count(from) && !failed) {
+	while (iov_iter_count(from)) {
 		struct page *pages[16];
-		ssize_t copied;
+		ssize_t left;
 		size_t start;
-		int n;
+		int i, n;
 
-		copied = iov_iter_get_pages(from, pages, ~0UL, 16, &start);
-		if (copied <= 0) {
-			ret = copied;
+		left = iov_iter_get_pages2(from, pages, ~0UL, 16, &start);
+		if (left <= 0) {
+			ret = left;
 			break;
 		}
 
-		for (n = 0; copied; n++, start = 0) {
-			int size = min_t(int, copied, PAGE_SIZE - start);
-			if (!failed) {
-				buf.page = pages[n];
-				buf.offset = start;
-				buf.len = size;
-				ret = add_to_pipe(pipe, &buf);
-				if (unlikely(ret < 0)) {
-					failed = true;
-				} else {
-					iov_iter_advance(from, ret);
-					total += ret;
-				}
-			} else {
-				put_page(pages[n]);
+		n = DIV_ROUND_UP(left + start, PAGE_SIZE);
+		for (i = 0; i < n; i++) {
+			int size = min_t(int, left, PAGE_SIZE - start);
+
+			buf.page = pages[i];
+			buf.offset = start;
+			buf.len = size;
+			ret = add_to_pipe(pipe, &buf);
+			if (unlikely(ret < 0)) {
+				iov_iter_revert(from, left);
+				// this one got dropped by add_to_pipe()
+				while (++i < n)
+					put_page(pages[i]);
+				goto out;
 			}
-			copied -= size;
+			total += ret;
+			left -= size;
+			start = 0;
 		}
 	}
+out:
 	return total ? total : ret;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 39/44] af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (36 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 38/44] iter_to_pipe(): switch to advancing variant of iov_iter_get_pages() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 12:18     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 40/44] 9p: convert to advancing variant of iov_iter_get_pages_alloc() Al Viro
                     ` (7 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

... and adjust the callers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 crypto/af_alg.c     | 3 +--
 crypto/algif_hash.c | 5 +++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index c8289b7a85ba..e893c0f6c879 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -404,7 +404,7 @@ int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len)
 	ssize_t n;
 	int npages, i;
 
-	n = iov_iter_get_pages(iter, sgl->pages, len, ALG_MAX_PAGES, &off);
+	n = iov_iter_get_pages2(iter, sgl->pages, len, ALG_MAX_PAGES, &off);
 	if (n < 0)
 		return n;
 
@@ -1191,7 +1191,6 @@ int af_alg_get_rsgl(struct sock *sk, struct msghdr *msg, int flags,
 		len += err;
 		atomic_add(err, &ctx->rcvused);
 		rsgl->sg_num_bytes = err;
-		iov_iter_advance(&msg->msg_iter, err);
 	}
 
 	*outlen = len;
diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
index 50f7b22f1b48..1d017ec5c63c 100644
--- a/crypto/algif_hash.c
+++ b/crypto/algif_hash.c
@@ -102,11 +102,12 @@ static int hash_sendmsg(struct socket *sock, struct msghdr *msg,
 		err = crypto_wait_req(crypto_ahash_update(&ctx->req),
 				      &ctx->wait);
 		af_alg_free_sg(&ctx->sgl);
-		if (err)
+		if (err) {
+			iov_iter_revert(&msg->msg_iter, len);
 			goto unlock;
+		}
 
 		copied += len;
-		iov_iter_advance(&msg->msg_iter, len);
 	}
 
 	err = 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 40/44] 9p: convert to advancing variant of iov_iter_get_pages_alloc()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (37 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 39/44] af_alg_make_sg(): " Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-07-01  9:01     ` Dominique Martinet
  2022-07-01 13:47     ` Christian Schoenebeck
  2022-06-22  4:15   ` [PATCH 41/44] ceph: switch the last caller " Al Viro
                     ` (6 subsequent siblings)
  45 siblings, 2 replies; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

that one is somewhat clumsier than usual and needs serious testing.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 net/9p/client.c       | 39 +++++++++++++++++++++++----------------
 net/9p/protocol.c     |  3 +--
 net/9p/trans_virtio.c |  3 ++-
 3 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/net/9p/client.c b/net/9p/client.c
index d403085b9ef5..cb4324211561 100644
--- a/net/9p/client.c
+++ b/net/9p/client.c
@@ -1491,7 +1491,7 @@ p9_client_read_once(struct p9_fid *fid, u64 offset, struct iov_iter *to,
 	struct p9_client *clnt = fid->clnt;
 	struct p9_req_t *req;
 	int count = iov_iter_count(to);
-	int rsize, non_zc = 0;
+	int rsize, received, non_zc = 0;
 	char *dataptr;
 
 	*err = 0;
@@ -1520,36 +1520,40 @@ p9_client_read_once(struct p9_fid *fid, u64 offset, struct iov_iter *to,
 	}
 	if (IS_ERR(req)) {
 		*err = PTR_ERR(req);
+		if (!non_zc)
+			iov_iter_revert(to, count - iov_iter_count(to));
 		return 0;
 	}
 
 	*err = p9pdu_readf(&req->rc, clnt->proto_version,
-			   "D", &count, &dataptr);
+			   "D", &received, &dataptr);
 	if (*err) {
+		if (!non_zc)
+			iov_iter_revert(to, count - iov_iter_count(to));
 		trace_9p_protocol_dump(clnt, &req->rc);
 		p9_tag_remove(clnt, req);
 		return 0;
 	}
-	if (rsize < count) {
-		pr_err("bogus RREAD count (%d > %d)\n", count, rsize);
-		count = rsize;
+	if (rsize < received) {
+		pr_err("bogus RREAD count (%d > %d)\n", received, rsize);
+		received = rsize;
 	}
 
 	p9_debug(P9_DEBUG_9P, "<<< RREAD count %d\n", count);
 
 	if (non_zc) {
-		int n = copy_to_iter(dataptr, count, to);
+		int n = copy_to_iter(dataptr, received, to);
 
-		if (n != count) {
+		if (n != received) {
 			*err = -EFAULT;
 			p9_tag_remove(clnt, req);
 			return n;
 		}
 	} else {
-		iov_iter_advance(to, count);
+		iov_iter_revert(to, count - received - iov_iter_count(to));
 	}
 	p9_tag_remove(clnt, req);
-	return count;
+	return received;
 }
 EXPORT_SYMBOL(p9_client_read_once);
 
@@ -1567,6 +1571,7 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
 	while (iov_iter_count(from)) {
 		int count = iov_iter_count(from);
 		int rsize = fid->iounit;
+		int written;
 
 		if (!rsize || rsize > clnt->msize - P9_IOHDRSZ)
 			rsize = clnt->msize - P9_IOHDRSZ;
@@ -1584,27 +1589,29 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
 					    offset, rsize, from);
 		}
 		if (IS_ERR(req)) {
+			iov_iter_revert(from, count - iov_iter_count(from));
 			*err = PTR_ERR(req);
 			break;
 		}
 
-		*err = p9pdu_readf(&req->rc, clnt->proto_version, "d", &count);
+		*err = p9pdu_readf(&req->rc, clnt->proto_version, "d", &written);
 		if (*err) {
+			iov_iter_revert(from, count - iov_iter_count(from));
 			trace_9p_protocol_dump(clnt, &req->rc);
 			p9_tag_remove(clnt, req);
 			break;
 		}
-		if (rsize < count) {
-			pr_err("bogus RWRITE count (%d > %d)\n", count, rsize);
-			count = rsize;
+		if (rsize < written) {
+			pr_err("bogus RWRITE count (%d > %d)\n", written, rsize);
+			written = rsize;
 		}
 
 		p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);
 
 		p9_tag_remove(clnt, req);
-		iov_iter_advance(from, count);
-		total += count;
-		offset += count;
+		iov_iter_revert(from, count - written - iov_iter_count(from));
+		total += written;
+		offset += written;
 	}
 	return total;
 }
diff --git a/net/9p/protocol.c b/net/9p/protocol.c
index 3754c33e2974..83694c631989 100644
--- a/net/9p/protocol.c
+++ b/net/9p/protocol.c
@@ -63,9 +63,8 @@ static size_t
 pdu_write_u(struct p9_fcall *pdu, struct iov_iter *from, size_t size)
 {
 	size_t len = min(pdu->capacity - pdu->size, size);
-	struct iov_iter i = *from;
 
-	if (!copy_from_iter_full(&pdu->sdata[pdu->size], len, &i))
+	if (!copy_from_iter_full(&pdu->sdata[pdu->size], len, from))
 		len = 0;
 
 	pdu->size += len;
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 2a210c2f8e40..1977d33475fe 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -331,7 +331,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
 			if (err == -ERESTARTSYS)
 				return err;
 		}
-		n = iov_iter_get_pages_alloc(data, pages, count, offs);
+		n = iov_iter_get_pages_alloc2(data, pages, count, offs);
 		if (n < 0)
 			return n;
 		*need_drop = 1;
@@ -373,6 +373,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
 				(*pages)[index] = kmap_to_page(p);
 			p += PAGE_SIZE;
 		}
+		iov_iter_advance(data, len);
 		return len;
 	}
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 41/44] ceph: switch the last caller of iov_iter_get_pages_alloc()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (38 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 40/44] 9p: convert to advancing variant of iov_iter_get_pages_alloc() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 12:20     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 42/44] get rid of non-advancing variants Al Viro
                     ` (5 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

here nothing even looks at the iov_iter after the call, so we couldn't
care less whether it advances or not.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/ceph/addr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 6dee88815491..3c8a7cf19e5d 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -329,7 +329,7 @@ static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq)
 
 	dout("%s: pos=%llu orig_len=%zu len=%llu\n", __func__, subreq->start, subreq->len, len);
 	iov_iter_xarray(&iter, READ, &rreq->mapping->i_pages, subreq->start, len);
-	err = iov_iter_get_pages_alloc(&iter, &pages, len, &page_off);
+	err = iov_iter_get_pages_alloc2(&iter, &pages, len, &page_off);
 	if (err < 0) {
 		dout("%s: iov_ter_get_pages_alloc returned %d\n", __func__, err);
 		goto out;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 42/44] get rid of non-advancing variants
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (39 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 41/44] ceph: switch the last caller " Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 12:21     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 43/44] pipe_get_pages(): switch to append_pipe() Al Viro
                     ` (4 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

mechanical change; will be further massaged in subsequent commits

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 include/linux/uio.h | 24 ++----------------------
 lib/iov_iter.c      | 27 ++++++++++++++++++---------
 2 files changed, 20 insertions(+), 31 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index ab1cc218b9de..f2fc55f88e45 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -245,9 +245,9 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
 		     loff_t start, size_t count);
-ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
+ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
+ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
 void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state);
@@ -349,24 +349,4 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 	};
 }
 
-static inline ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
-			size_t maxsize, unsigned maxpages, size_t *start)
-{
-	ssize_t res = iov_iter_get_pages(i, pages, maxsize, maxpages, start);
-
-	if (res >= 0)
-		iov_iter_advance(i, res);
-	return res;
-}
-
-static inline ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
-			size_t maxsize, size_t *start)
-{
-	ssize_t res = iov_iter_get_pages_alloc(i, pages, maxsize, start);
-
-	if (res >= 0)
-		iov_iter_advance(i, res);
-	return res;
-}
-
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 1c744f0c0b2c..70736b3e07c5 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1231,6 +1231,7 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 		left -= PAGE_SIZE - off;
 		if (left <= 0) {
 			buf->len += maxsize;
+			iov_iter_advance(i, maxsize);
 			return maxsize;
 		}
 		buf->len = PAGE_SIZE;
@@ -1250,7 +1251,9 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 	}
 	if (!npages)
 		return -EFAULT;
-	return maxsize - left;
+	maxsize -= left;
+	iov_iter_advance(i, maxsize);
+	return maxsize;
 }
 
 static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
@@ -1300,7 +1303,9 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
 	if (nr == 0)
 		return 0;
 
-	return min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
 }
 
 /* must be done on non-empty ITER_UBUF or ITER_IOVEC one */
@@ -1372,7 +1377,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		res = get_user_pages_fast(addr, n, gup_flags, *pages);
 		if (unlikely(res <= 0))
 			return res;
-		return min_t(size_t, maxsize, res * PAGE_SIZE - *start);
+		maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - *start);
+		iov_iter_advance(i, maxsize);
+		return maxsize;
 	}
 	if (iov_iter_is_bvec(i)) {
 		struct page **p;
@@ -1384,8 +1391,10 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 			return -ENOMEM;
 		p = *pages;
 		for (int k = 0; k < n; k++)
-			get_page(*p++ = page++);
-		return min_t(size_t, maxsize, n * PAGE_SIZE - *start);
+			get_page(p[k] = page + k);
+		maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
+		iov_iter_advance(i, maxsize);
+		return maxsize;
 	}
 	if (iov_iter_is_pipe(i))
 		return pipe_get_pages(i, pages, maxsize, maxpages, start);
@@ -1395,7 +1404,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 	return -EFAULT;
 }
 
-ssize_t iov_iter_get_pages(struct iov_iter *i,
+ssize_t iov_iter_get_pages2(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
 {
@@ -1405,9 +1414,9 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 
 	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start);
 }
-EXPORT_SYMBOL(iov_iter_get_pages);
+EXPORT_SYMBOL(iov_iter_get_pages2);
 
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
 {
@@ -1422,7 +1431,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 	}
 	return len;
 }
-EXPORT_SYMBOL(iov_iter_get_pages_alloc);
+EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
 
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 			       struct iov_iter *i)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 43/44] pipe_get_pages(): switch to append_pipe()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (40 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 42/44] get rid of non-advancing variants Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 12:23     ` Jeff Layton
  2022-06-22  4:15   ` [PATCH 44/44] expand those iov_iter_advance() Al Viro
                     ` (3 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

now that we are advancing the iterator, there's no need to
treat the first page separately - just call append_pipe()
in a loop.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 36 ++++++++----------------------------
 1 file changed, 8 insertions(+), 28 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 70736b3e07c5..a8045c97b975 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1207,10 +1207,10 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize, unsigned maxpages,
 		   size_t *start)
 {
-	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int npages, off, count;
+	unsigned int npages, count;
 	struct page **p;
 	ssize_t left;
+	size_t off;
 
 	if (!sanity(i))
 		return -EFAULT;
@@ -1222,38 +1222,18 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
 	if (!count)
 		return -ENOMEM;
 	p = *pages;
-	left = maxsize;
-	npages = 0;
-	if (off) {
-		struct pipe_buffer *buf = pipe_buf(pipe, pipe->head - 1);
-
-		get_page(*p++ = buf->page);
-		left -= PAGE_SIZE - off;
-		if (left <= 0) {
-			buf->len += maxsize;
-			iov_iter_advance(i, maxsize);
-			return maxsize;
-		}
-		buf->len = PAGE_SIZE;
-		npages = 1;
-	}
-	for ( ; npages < count; npages++) {
-		struct page *page;
-		unsigned int size = min_t(ssize_t, left, PAGE_SIZE);
-
-		if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
-			break;
-		page = push_anon(pipe, size);
+	for (npages = 0, left = maxsize ; npages < count; npages++) {
+		struct page *page = append_pipe(i, left, &off);
 		if (!page)
 			break;
 		get_page(*p++ = page);
-		left -= size;
+		if (left <= PAGE_SIZE - off)
+			return maxsize;
+		left -= PAGE_SIZE - off;
 	}
 	if (!npages)
 		return -EFAULT;
-	maxsize -= left;
-	iov_iter_advance(i, maxsize);
-	return maxsize;
+	return maxsize - left;
 }
 
 static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* [PATCH 44/44] expand those iov_iter_advance()...
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (41 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 43/44] pipe_get_pages(): switch to append_pipe() Al Viro
@ 2022-06-22  4:15   ` Al Viro
  2022-06-28 12:23     ` Jeff Layton
  2022-07-01  6:21   ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Dominique Martinet
                     ` (2 subsequent siblings)
  45 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-22  4:15 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 lib/iov_iter.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index a8045c97b975..79c86add8dea 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1284,7 +1284,8 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
 		return 0;
 
 	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
-	iov_iter_advance(i, maxsize);
+	i->iov_offset += maxsize;
+	i->count -= maxsize;
 	return maxsize;
 }
 
@@ -1373,7 +1374,13 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		for (int k = 0; k < n; k++)
 			get_page(p[k] = page + k);
 		maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
-		iov_iter_advance(i, maxsize);
+		i->count -= maxsize;
+		i->iov_offset += maxsize;
+		if (i->iov_offset == i->bvec->bv_len) {
+			i->iov_offset = 0;
+			i->bvec++;
+			i->nr_segs--;
+		}
 		return maxsize;
 	}
 	if (iov_iter_is_pipe(i))
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [RFC][CFT][PATCHSET] iov_iter stuff
  2022-06-22  4:10 [RFC][CFT][PATCHSET] iov_iter stuff Al Viro
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
@ 2022-06-23 15:21 ` David Howells
  2022-06-23 20:32   ` Al Viro
  2022-06-28 12:25 ` Jeff Layton
  2 siblings, 1 reply; 118+ messages in thread
From: David Howells @ 2022-06-23 15:21 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, linux-fsdevel, Linus Torvalds, Jens Axboe,
	Christoph Hellwig, Matthew Wilcox, Dominique Martinet,
	Christian Brauner

Al Viro <viro@zeniv.linux.org.uk> wrote:

> 
> 13/44: splice: stop abusing iov_iter_advance() to flush a pipe
> 	A really odd (ab)use of iov_iter_advance() - in case of error
> generic_file_splice_read() wants to free all pipe buffers ->read_iter()
> has produced.  Yes, forcibly resetting ->head and ->iov_offset to
> original values and calling iov_iter_advance(i, 0) will trigger
> pipe_advance(), which will trigger pipe_truncate(), which will free
> buffers.  Or we could just go ahead and free the same buffers;
> pipe_discard_from() does exactly that, no iov_iter stuff needs to
> be involved.

Can ->splice_read() and ->splice_write() be given pipe-class iov_iters rather
than pipe_inode_info structs?

David


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC][CFT][PATCHSET] iov_iter stuff
  2022-06-23 15:21 ` [RFC][CFT][PATCHSET] iov_iter stuff David Howells
@ 2022-06-23 20:32   ` Al Viro
  0 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-23 20:32 UTC (permalink / raw)
  To: David Howells
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, Dominique Martinet, Christian Brauner

On Thu, Jun 23, 2022 at 04:21:52PM +0100, David Howells wrote:
> Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> > 
> > 13/44: splice: stop abusing iov_iter_advance() to flush a pipe
> > 	A really odd (ab)use of iov_iter_advance() - in case of error
> > generic_file_splice_read() wants to free all pipe buffers ->read_iter()
> > has produced.  Yes, forcibly resetting ->head and ->iov_offset to
> > original values and calling iov_iter_advance(i, 0) will trigger
> > pipe_advance(), which will trigger pipe_truncate(), which will free
> > buffers.  Or we could just go ahead and free the same buffers;
> > pipe_discard_from() does exactly that, no iov_iter stuff needs to
> > be involved.
> 
> Can ->splice_read() and ->splice_write() be given pipe-class iov_iters rather
> than pipe_inode_info structs?

Huh?

First of all, ->splice_write() is given a pipe as data _source_, which makes
ITER_PIPE completely irrelevant - those suckers are data destinations.
What's more, it will unlock the pipe, wait and relock once somebody writes
to that pipe.  And ITER_PIPE very much relies upon the pipe being locked
and staying locked.

As for ->splice_read()...  We could create the iov_iter in the caller, but...
Look at those callers:
        pipe_lock(opipe);
	ret = wait_for_space(opipe, flags);
	if (!ret)
		ret = do_splice_to(in, offset, opipe, len, flags);
	pipe_unlock(opipe);

You can't set it up before wait_for_pace(), obviously - if there's no
empty slots, what would that sucker work with?  And you can't keep
it past pipe_unlock().  So it's limited to do_splice_to(), which
is a very thin wrapper for ->splice_read().  So what's the point?

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/44] copy_page_{to,from}_iter(): switch iovec variants to generic
  2022-06-22  4:15   ` [PATCH 08/44] copy_page_{to,from}_iter(): switch iovec variants to generic Al Viro
@ 2022-06-27 18:31     ` Jeff Layton
  2022-06-28 12:32     ` Christian Brauner
  1 sibling, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-27 18:31 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> we can do copyin/copyout under kmap_local_page(); it shouldn't overflow
> the kmap stack - the maximal footprint increase only by one here.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 191 ++-----------------------------------------------
>  1 file changed, 4 insertions(+), 187 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 6dd5330f7a99..4c658a25e29c 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -168,174 +168,6 @@ static int copyin(void *to, const void __user *from, size_t n)
>  	return n;
>  }
>  
> -static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t bytes,
> -			 struct iov_iter *i)
> -{
> -	size_t skip, copy, left, wanted;
> -	const struct iovec *iov;
> -	char __user *buf;
> -	void *kaddr, *from;
> -
> -	if (unlikely(bytes > i->count))
> -		bytes = i->count;
> -
> -	if (unlikely(!bytes))
> -		return 0;
> -
> -	might_fault();
> -	wanted = bytes;
> -	iov = i->iov;
> -	skip = i->iov_offset;
> -	buf = iov->iov_base + skip;
> -	copy = min(bytes, iov->iov_len - skip);
> -
> -	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_writeable(buf, copy)) {
> -		kaddr = kmap_atomic(page);
> -		from = kaddr + offset;
> -
> -		/* first chunk, usually the only one */
> -		left = copyout(buf, from, copy);
> -		copy -= left;
> -		skip += copy;
> -		from += copy;
> -		bytes -= copy;
> -
> -		while (unlikely(!left && bytes)) {
> -			iov++;
> -			buf = iov->iov_base;
> -			copy = min(bytes, iov->iov_len);
> -			left = copyout(buf, from, copy);
> -			copy -= left;
> -			skip = copy;
> -			from += copy;
> -			bytes -= copy;
> -		}
> -		if (likely(!bytes)) {
> -			kunmap_atomic(kaddr);
> -			goto done;
> -		}
> -		offset = from - kaddr;
> -		buf += copy;
> -		kunmap_atomic(kaddr);
> -		copy = min(bytes, iov->iov_len - skip);
> -	}
> -	/* Too bad - revert to non-atomic kmap */
> -
> -	kaddr = kmap(page);
> -	from = kaddr + offset;
> -	left = copyout(buf, from, copy);
> -	copy -= left;
> -	skip += copy;
> -	from += copy;
> -	bytes -= copy;
> -	while (unlikely(!left && bytes)) {
> -		iov++;
> -		buf = iov->iov_base;
> -		copy = min(bytes, iov->iov_len);
> -		left = copyout(buf, from, copy);
> -		copy -= left;
> -		skip = copy;
> -		from += copy;
> -		bytes -= copy;
> -	}
> -	kunmap(page);
> -
> -done:
> -	if (skip == iov->iov_len) {
> -		iov++;
> -		skip = 0;
> -	}
> -	i->count -= wanted - bytes;
> -	i->nr_segs -= iov - i->iov;
> -	i->iov = iov;
> -	i->iov_offset = skip;
> -	return wanted - bytes;
> -}
> -
> -static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t bytes,
> -			 struct iov_iter *i)
> -{
> -	size_t skip, copy, left, wanted;
> -	const struct iovec *iov;
> -	char __user *buf;
> -	void *kaddr, *to;
> -
> -	if (unlikely(bytes > i->count))
> -		bytes = i->count;
> -
> -	if (unlikely(!bytes))
> -		return 0;
> -
> -	might_fault();
> -	wanted = bytes;
> -	iov = i->iov;
> -	skip = i->iov_offset;
> -	buf = iov->iov_base + skip;
> -	copy = min(bytes, iov->iov_len - skip);
> -
> -	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_readable(buf, copy)) {
> -		kaddr = kmap_atomic(page);
> -		to = kaddr + offset;
> -
> -		/* first chunk, usually the only one */
> -		left = copyin(to, buf, copy);
> -		copy -= left;
> -		skip += copy;
> -		to += copy;
> -		bytes -= copy;
> -
> -		while (unlikely(!left && bytes)) {
> -			iov++;
> -			buf = iov->iov_base;
> -			copy = min(bytes, iov->iov_len);
> -			left = copyin(to, buf, copy);
> -			copy -= left;
> -			skip = copy;
> -			to += copy;
> -			bytes -= copy;
> -		}
> -		if (likely(!bytes)) {
> -			kunmap_atomic(kaddr);
> -			goto done;
> -		}
> -		offset = to - kaddr;
> -		buf += copy;
> -		kunmap_atomic(kaddr);
> -		copy = min(bytes, iov->iov_len - skip);
> -	}
> -	/* Too bad - revert to non-atomic kmap */
> -
> -	kaddr = kmap(page);
> -	to = kaddr + offset;
> -	left = copyin(to, buf, copy);
> -	copy -= left;
> -	skip += copy;
> -	to += copy;
> -	bytes -= copy;
> -	while (unlikely(!left && bytes)) {
> -		iov++;
> -		buf = iov->iov_base;
> -		copy = min(bytes, iov->iov_len);
> -		left = copyin(to, buf, copy);
> -		copy -= left;
> -		skip = copy;
> -		to += copy;
> -		bytes -= copy;
> -	}
> -	kunmap(page);
> -
> -done:
> -	if (skip == iov->iov_len) {
> -		iov++;
> -		skip = 0;
> -	}
> -	i->count -= wanted - bytes;
> -	i->nr_segs -= iov - i->iov;
> -	i->iov = iov;
> -	i->iov_offset = skip;
> -	return wanted - bytes;
> -}
> -
>  #ifdef PIPE_PARANOIA
>  static bool sanity(const struct iov_iter *i)
>  {
> @@ -848,24 +680,14 @@ static inline bool page_copy_sane(struct page *page, size_t offset, size_t n)
>  static size_t __copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
>  			 struct iov_iter *i)
>  {
> -	if (likely(iter_is_iovec(i)))
> -		return copy_page_to_iter_iovec(page, offset, bytes, i);
> -	if (iov_iter_is_bvec(i) || iov_iter_is_kvec(i) || iov_iter_is_xarray(i)) {
> +	if (unlikely(iov_iter_is_pipe(i))) {
> +		return copy_page_to_iter_pipe(page, offset, bytes, i);
> +	} else {
>  		void *kaddr = kmap_local_page(page);
>  		size_t wanted = _copy_to_iter(kaddr + offset, bytes, i);
>  		kunmap_local(kaddr);
>  		return wanted;
>  	}
> -	if (iov_iter_is_pipe(i))
> -		return copy_page_to_iter_pipe(page, offset, bytes, i);
> -	if (unlikely(iov_iter_is_discard(i))) {
> -		if (unlikely(i->count < bytes))
> -			bytes = i->count;
> -		i->count -= bytes;
> -		return bytes;
> -	}
> -	WARN_ON(1);
> -	return 0;
>  }
>  
>  size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
> @@ -896,17 +718,12 @@ EXPORT_SYMBOL(copy_page_to_iter);
>  size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
>  			 struct iov_iter *i)
>  {
> -	if (unlikely(!page_copy_sane(page, offset, bytes)))
> -		return 0;
> -	if (likely(iter_is_iovec(i)))
> -		return copy_page_from_iter_iovec(page, offset, bytes, i);
> -	if (iov_iter_is_bvec(i) || iov_iter_is_kvec(i) || iov_iter_is_xarray(i)) {
> +	if (page_copy_sane(page, offset, bytes)) {
>  		void *kaddr = kmap_local_page(page);
>  		size_t wanted = _copy_from_iter(kaddr + offset, bytes, i);
>  		kunmap_local(kaddr);
>  		return wanted;
>  	}
> -	WARN_ON(1);
>  	return 0;
>  }
>  EXPORT_SYMBOL(copy_page_from_iter);

Love it.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 09/44] new iov_iter flavour - ITER_UBUF
  2022-06-22  4:15   ` [PATCH 09/44] new iov_iter flavour - ITER_UBUF Al Viro
@ 2022-06-27 18:47     ` Jeff Layton
  2022-06-28 18:41       ` Al Viro
  2022-06-28 12:38     ` Christian Brauner
  2022-07-28  9:55     ` [PATCH 9/44] " Alexander Gordeev
  2 siblings, 1 reply; 118+ messages in thread
From: Jeff Layton @ 2022-06-27 18:47 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> Equivalent of single-segment iovec.  Initialized by iov_iter_ubuf(),
> checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
> ones.
> 
> We are going to expose the things like ->write_iter() et.al. to those
> in subsequent commits.
> 
> New predicate (user_backed_iter()) that is true for ITER_IOVEC and
> ITER_UBUF; places like direct-IO handling should use that for
> checking that pages we modify after getting them from iov_iter_get_pages()
> would need to be dirtied.
> 
> DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
> will solve all problems - there's code that uses iter_is_iovec() to
> decide how to poke around in iov_iter guts and for that the predicate
> replacement obviously won't suffice.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  block/fops.c         |  6 +--
>  fs/ceph/file.c       |  2 +-
>  fs/cifs/file.c       |  2 +-
>  fs/direct-io.c       |  2 +-
>  fs/fuse/dev.c        |  4 +-
>  fs/fuse/file.c       |  2 +-
>  fs/gfs2/file.c       |  2 +-
>  fs/iomap/direct-io.c |  2 +-
>  fs/nfs/direct.c      |  2 +-
>  include/linux/uio.h  | 26 ++++++++++++
>  lib/iov_iter.c       | 94 ++++++++++++++++++++++++++++++++++----------
>  mm/shmem.c           |  2 +-
>  12 files changed, 113 insertions(+), 33 deletions(-)
> 
> diff --git a/block/fops.c b/block/fops.c
> index 6e86931ab847..3e68d69e0ee3 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -69,7 +69,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
>  
>  	if (iov_iter_rw(iter) == READ) {
>  		bio_init(&bio, bdev, vecs, nr_pages, REQ_OP_READ);
> -		if (iter_is_iovec(iter))
> +		if (user_backed_iter(iter))
>  			should_dirty = true;
>  	} else {
>  		bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
> @@ -199,7 +199,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
>  	}
>  
>  	dio->size = 0;
> -	if (is_read && iter_is_iovec(iter))
> +	if (is_read && user_backed_iter(iter))
>  		dio->flags |= DIO_SHOULD_DIRTY;
>  
>  	blk_start_plug(&plug);
> @@ -331,7 +331,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
>  	dio->size = bio->bi_iter.bi_size;
>  
>  	if (is_read) {
> -		if (iter_is_iovec(iter)) {
> +		if (user_backed_iter(iter)) {
>  			dio->flags |= DIO_SHOULD_DIRTY;
>  			bio_set_pages_dirty(bio);
>  		}
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 8c8226c0feac..e132adeeaf16 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1262,7 +1262,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
>  	size_t count = iov_iter_count(iter);
>  	loff_t pos = iocb->ki_pos;
>  	bool write = iov_iter_rw(iter) == WRITE;
> -	bool should_dirty = !write && iter_is_iovec(iter);
> +	bool should_dirty = !write && user_backed_iter(iter);
>  
>  	if (write && ceph_snap(file_inode(file)) != CEPH_NOSNAP)
>  		return -EROFS;
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 1618e0537d58..4b4129d9a90c 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -4004,7 +4004,7 @@ static ssize_t __cifs_readv(
>  	if (!is_sync_kiocb(iocb))
>  		ctx->iocb = iocb;
>  
> -	if (iter_is_iovec(to))
> +	if (user_backed_iter(to))
>  		ctx->should_dirty = true;
>  
>  	if (direct) {
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 39647eb56904..72237f49ad94 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1245,7 +1245,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>  	spin_lock_init(&dio->bio_lock);
>  	dio->refcount = 1;
>  
> -	dio->should_dirty = iter_is_iovec(iter) && iov_iter_rw(iter) == READ;
> +	dio->should_dirty = user_backed_iter(iter) && iov_iter_rw(iter) == READ;
>  	sdio.iter = iter;
>  	sdio.final_block_in_request = end >> blkbits;
>  
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 0e537e580dc1..8d657c2cd6f7 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -1356,7 +1356,7 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, struct iov_iter *to)
>  	if (!fud)
>  		return -EPERM;
>  
> -	if (!iter_is_iovec(to))
> +	if (!user_backed_iter(to))
>  		return -EINVAL;
>  
>  	fuse_copy_init(&cs, 1, to);
> @@ -1949,7 +1949,7 @@ static ssize_t fuse_dev_write(struct kiocb *iocb, struct iov_iter *from)
>  	if (!fud)
>  		return -EPERM;
>  
> -	if (!iter_is_iovec(from))
> +	if (!user_backed_iter(from))
>  		return -EINVAL;
>  
>  	fuse_copy_init(&cs, 0, from);
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 00fa861aeead..c982e3afe3b4 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1465,7 +1465,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>  			inode_unlock(inode);
>  	}
>  
> -	io->should_dirty = !write && iter_is_iovec(iter);
> +	io->should_dirty = !write && user_backed_iter(iter);
>  	while (count) {
>  		ssize_t nres;
>  		fl_owner_t owner = current->files;
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 2cceb193dcd8..48e6cc74fdc1 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -780,7 +780,7 @@ static inline bool should_fault_in_pages(struct iov_iter *i,
>  
>  	if (!count)
>  		return false;
> -	if (!iter_is_iovec(i))
> +	if (!user_backed_iter(i))
>  		return false;
>  
>  	size = PAGE_SIZE;
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 31c7f1035b20..d5c7d019653b 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -533,7 +533,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  			iomi.flags |= IOMAP_NOWAIT;
>  		}
>  
> -		if (iter_is_iovec(iter))
> +		if (user_backed_iter(iter))
>  			dio->flags |= IOMAP_DIO_DIRTY;
>  	} else {
>  		iomi.flags |= IOMAP_WRITE;
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 4eb2a8380a28..022e1ce63e62 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -478,7 +478,7 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
>  	if (!is_sync_kiocb(iocb))
>  		dreq->iocb = iocb;
>  
> -	if (iter_is_iovec(iter))
> +	if (user_backed_iter(iter))
>  		dreq->flags = NFS_ODIRECT_SHOULD_DIRTY;
>  
>  	if (!swap)
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 76d305f3d4c2..6ab4260c3d6c 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -26,6 +26,7 @@ enum iter_type {
>  	ITER_PIPE,
>  	ITER_XARRAY,
>  	ITER_DISCARD,
> +	ITER_UBUF,
>  };
>  
>  struct iov_iter_state {
> @@ -38,6 +39,7 @@ struct iov_iter {
>  	u8 iter_type;
>  	bool nofault;
>  	bool data_source;
> +	bool user_backed;
>  	size_t iov_offset;
>  	size_t count;
>  	union {
> @@ -46,6 +48,7 @@ struct iov_iter {
>  		const struct bio_vec *bvec;
>  		struct xarray *xarray;
>  		struct pipe_inode_info *pipe;
> +		void __user *ubuf;
>  	};
>  	union {
>  		unsigned long nr_segs;
> @@ -70,6 +73,11 @@ static inline void iov_iter_save_state(struct iov_iter *iter,
>  	state->nr_segs = iter->nr_segs;
>  }
>  
> +static inline bool iter_is_ubuf(const struct iov_iter *i)
> +{
> +	return iov_iter_type(i) == ITER_UBUF;
> +}
> +
>  static inline bool iter_is_iovec(const struct iov_iter *i)
>  {
>  	return iov_iter_type(i) == ITER_IOVEC;
> @@ -105,6 +113,11 @@ static inline unsigned char iov_iter_rw(const struct iov_iter *i)
>  	return i->data_source ? WRITE : READ;
>  }
>  
> +static inline bool user_backed_iter(const struct iov_iter *i)
> +{
> +	return i->user_backed;
> +}
> +

nit: I wonder whether this new boolean is worth it over just checking
is_iter_iovec() || is_iter_ubuf. Not a big deal though.

>  /*
>   * Total number of bytes covered by an iovec.
>   *
> @@ -320,4 +333,17 @@ ssize_t __import_iovec(int type, const struct iovec __user *uvec,
>  int import_single_range(int type, void __user *buf, size_t len,
>  		 struct iovec *iov, struct iov_iter *i);
>  
> +static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
> +			void __user *buf, size_t count)
> +{
> +	WARN_ON(direction & ~(READ | WRITE));
> +	*i = (struct iov_iter) {
> +		.iter_type = ITER_UBUF,
> +		.user_backed = true,
> +		.data_source = direction,
> +		.ubuf = buf,
> +		.count = count
> +	};
> +}
> +
>  #endif
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 4c658a25e29c..8275b28e886b 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -16,6 +16,16 @@
>  
>  #define PIPE_PARANOIA /* for now */
>  
> +/* covers ubuf and kbuf alike */
> +#define iterate_buf(i, n, base, len, off, __p, STEP) {		\
> +	size_t __maybe_unused off = 0;				\
> +	len = n;						\
> +	base = __p + i->iov_offset;				\
> +	len -= (STEP);						\
> +	i->iov_offset += len;					\
> +	n = len;						\
> +}
> +
>  /* covers iovec and kvec alike */
>  #define iterate_iovec(i, n, base, len, off, __p, STEP) {	\
>  	size_t off = 0;						\
> @@ -110,7 +120,12 @@ __out:								\
>  	if (unlikely(i->count < n))				\
>  		n = i->count;					\
>  	if (likely(n)) {					\
> -		if (likely(iter_is_iovec(i))) {			\
> +		if (likely(iter_is_ubuf(i))) {			\
> +			void __user *base;			\
> +			size_t len;				\
> +			iterate_buf(i, n, base, len, off,	\
> +						i->ubuf, (I)) 	\
> +		} else if (likely(iter_is_iovec(i))) {		\
>  			const struct iovec *iov = i->iov;	\
>  			void __user *base;			\
>  			size_t len;				\
> @@ -275,7 +290,11 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
>   */
>  size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
>  {
> -	if (iter_is_iovec(i)) {
> +	if (iter_is_ubuf(i)) {
> +		size_t n = min(size, iov_iter_count(i));
> +		n -= fault_in_readable(i->ubuf + i->iov_offset, n);
> +		return size - n;
> +	} else if (iter_is_iovec(i)) {
>  		size_t count = min(size, iov_iter_count(i));
>  		const struct iovec *p;
>  		size_t skip;
> @@ -314,7 +333,11 @@ EXPORT_SYMBOL(fault_in_iov_iter_readable);
>   */
>  size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
>  {
> -	if (iter_is_iovec(i)) {
> +	if (iter_is_ubuf(i)) {
> +		size_t n = min(size, iov_iter_count(i));
> +		n -= fault_in_safe_writeable(i->ubuf + i->iov_offset, n);
> +		return size - n;
> +	} else if (iter_is_iovec(i)) {
>  		size_t count = min(size, iov_iter_count(i));
>  		const struct iovec *p;
>  		size_t skip;
> @@ -345,6 +368,7 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
>  	*i = (struct iov_iter) {
>  		.iter_type = ITER_IOVEC,
>  		.nofault = false,
> +		.user_backed = true,
>  		.data_source = direction,
>  		.iov = iov,
>  		.nr_segs = nr_segs,
> @@ -494,7 +518,7 @@ size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
>  {
>  	if (unlikely(iov_iter_is_pipe(i)))
>  		return copy_pipe_to_iter(addr, bytes, i);
> -	if (iter_is_iovec(i))
> +	if (user_backed_iter(i))
>  		might_fault();
>  	iterate_and_advance(i, bytes, base, len, off,
>  		copyout(base, addr + off, len),
> @@ -576,7 +600,7 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
>  {
>  	if (unlikely(iov_iter_is_pipe(i)))
>  		return copy_mc_pipe_to_iter(addr, bytes, i);
> -	if (iter_is_iovec(i))
> +	if (user_backed_iter(i))
>  		might_fault();
>  	__iterate_and_advance(i, bytes, base, len, off,
>  		copyout_mc(base, addr + off, len),
> @@ -594,7 +618,7 @@ size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
>  		WARN_ON(1);
>  		return 0;
>  	}
> -	if (iter_is_iovec(i))
> +	if (user_backed_iter(i))
>  		might_fault();
>  	iterate_and_advance(i, bytes, base, len, off,
>  		copyin(addr + off, base, len),
> @@ -882,16 +906,16 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
>  {
>  	if (unlikely(i->count < size))
>  		size = i->count;
> -	if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i))) {
> +	if (likely(iter_is_ubuf(i)) || unlikely(iov_iter_is_xarray(i))) {
> +		i->iov_offset += size;
> +		i->count -= size;
> +	} else if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i))) {
>  		/* iovec and kvec have identical layouts */
>  		iov_iter_iovec_advance(i, size);
>  	} else if (iov_iter_is_bvec(i)) {
>  		iov_iter_bvec_advance(i, size);
>  	} else if (iov_iter_is_pipe(i)) {
>  		pipe_advance(i, size);
> -	} else if (unlikely(iov_iter_is_xarray(i))) {
> -		i->iov_offset += size;
> -		i->count -= size;
>  	} else if (iov_iter_is_discard(i)) {
>  		i->count -= size;
>  	}
> @@ -938,7 +962,7 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
>  		return;
>  	}
>  	unroll -= i->iov_offset;
> -	if (iov_iter_is_xarray(i)) {
> +	if (iov_iter_is_xarray(i) || iter_is_ubuf(i)) {
>  		BUG(); /* We should never go beyond the start of the specified
>  			* range since we might then be straying into pages that
>  			* aren't pinned.
> @@ -1129,6 +1153,13 @@ static unsigned long iov_iter_alignment_bvec(const struct iov_iter *i)
>  
>  unsigned long iov_iter_alignment(const struct iov_iter *i)
>  {
> +	if (likely(iter_is_ubuf(i))) {
> +		size_t size = i->count;
> +		if (size)
> +			return ((unsigned long)i->ubuf + i->iov_offset) | size;
> +		return 0;
> +	}
> +
>  	/* iovec and kvec have identical layouts */
>  	if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i)))
>  		return iov_iter_alignment_iovec(i);
> @@ -1159,6 +1190,9 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
>  	size_t size = i->count;
>  	unsigned k;
>  
> +	if (iter_is_ubuf(i))
> +		return 0;
> +
>  	if (WARN_ON(!iter_is_iovec(i)))
>  		return ~0U;
>  
> @@ -1287,7 +1321,19 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
>  	return actual;
>  }
>  
> -/* must be done on non-empty ITER_IOVEC one */
> +static unsigned long found_ubuf_segment(unsigned long addr,
> +					size_t len,
> +					size_t *size, size_t *start,
> +					unsigned maxpages)
> +{
> +	len += (*start = addr % PAGE_SIZE);
> +	if (len > maxpages * PAGE_SIZE)
> +		len = maxpages * PAGE_SIZE;
> +	*size = len;
> +	return addr & PAGE_MASK;
> +}
> +
> +/* must be done on non-empty ITER_UBUF or ITER_IOVEC one */
>  static unsigned long first_iovec_segment(const struct iov_iter *i,
>  					 size_t *size, size_t *start,
>  					 size_t maxsize, unsigned maxpages)
> @@ -1295,6 +1341,11 @@ static unsigned long first_iovec_segment(const struct iov_iter *i,
>  	size_t skip;
>  	long k;
>  
> +	if (iter_is_ubuf(i)) {
> +		unsigned long addr = (unsigned long)i->ubuf + i->iov_offset;
> +		return found_ubuf_segment(addr, maxsize, size, start, maxpages);
> +	}
> +
>  	for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
>  		unsigned long addr = (unsigned long)i->iov[k].iov_base + skip;
>  		size_t len = i->iov[k].iov_len - skip;
> @@ -1303,11 +1354,7 @@ static unsigned long first_iovec_segment(const struct iov_iter *i,
>  			continue;
>  		if (len > maxsize)
>  			len = maxsize;
> -		len += (*start = addr % PAGE_SIZE);
> -		if (len > maxpages * PAGE_SIZE)
> -			len = maxpages * PAGE_SIZE;
> -		*size = len;
> -		return addr & PAGE_MASK;
> +		return found_ubuf_segment(addr, len, size, start, maxpages);
>  	}
>  	BUG(); // if it had been empty, we wouldn't get called
>  }
> @@ -1344,7 +1391,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
>  	if (!maxsize)
>  		return 0;
>  
> -	if (likely(iter_is_iovec(i))) {
> +	if (likely(user_backed_iter(i))) {
>  		unsigned int gup_flags = 0;
>  		unsigned long addr;
>  
> @@ -1470,7 +1517,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
>  	if (!maxsize)
>  		return 0;
>  
> -	if (likely(iter_is_iovec(i))) {
> +	if (likely(user_backed_iter(i))) {
>  		unsigned int gup_flags = 0;
>  		unsigned long addr;
>  
> @@ -1624,6 +1671,11 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
>  {
>  	if (unlikely(!i->count))
>  		return 0;
> +	if (likely(iter_is_ubuf(i))) {
> +		unsigned offs = offset_in_page(i->ubuf + i->iov_offset);
> +		int npages = DIV_ROUND_UP(offs + i->count, PAGE_SIZE);
> +		return min(npages, maxpages);
> +	}
>  	/* iovec and kvec have identical layouts */
>  	if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i)))
>  		return iov_npages(i, maxpages);
> @@ -1862,10 +1914,12 @@ EXPORT_SYMBOL(import_single_range);
>  void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
>  {
>  	if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i)) &&
> -			 !iov_iter_is_kvec(i))
> +			 !iov_iter_is_kvec(i) && !iter_is_ubuf(i))
>  		return;
>  	i->iov_offset = state->iov_offset;
>  	i->count = state->count;
> +	if (iter_is_ubuf(i))
> +		return;
>  	/*
>  	 * For the *vec iters, nr_segs + iov is constant - if we increment
>  	 * the vec, then we also decrement the nr_segs count. Hence we don't
> diff --git a/mm/shmem.c b/mm/shmem.c
> index a6f565308133..6b83f3971795 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2603,7 +2603,7 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  			ret = copy_page_to_iter(page, offset, nr, to);
>  			put_page(page);
>  
> -		} else if (iter_is_iovec(to)) {
> +		} else if (!user_backed_iter(to)) {
>  			/*
>  			 * Copy to user tends to be so well optimized, but
>  			 * clear_user() not so much, that it is noticeably

The code looks reasonable but is there any real benefit here? It seems
like the only user of it so far is new_sync_{read,write}, and both seem
to just use it to avoid allocating a single iovec on the stack.
 
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 11/44] iov_iter_bvec_advance(): don't bother with bvec_iter
  2022-06-22  4:15   ` [PATCH 11/44] iov_iter_bvec_advance(): don't bother with bvec_iter Al Viro
@ 2022-06-27 18:48     ` Jeff Layton
  2022-06-28 12:40     ` Christian Brauner
  1 sibling, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-27 18:48 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> do what we do for iovec/kvec; that ends up generating better code,
> AFAICS.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 23 ++++++++++++++---------
>  1 file changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 8275b28e886b..93ceb13ec7b5 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -870,17 +870,22 @@ static void pipe_advance(struct iov_iter *i, size_t size)
>  
>  static void iov_iter_bvec_advance(struct iov_iter *i, size_t size)
>  {
> -	struct bvec_iter bi;
> +	const struct bio_vec *bvec, *end;
>  
> -	bi.bi_size = i->count;
> -	bi.bi_bvec_done = i->iov_offset;
> -	bi.bi_idx = 0;
> -	bvec_iter_advance(i->bvec, &bi, size);
> +	if (!i->count)
> +		return;
> +	i->count -= size;
> +
> +	size += i->iov_offset;
>  
> -	i->bvec += bi.bi_idx;
> -	i->nr_segs -= bi.bi_idx;
> -	i->count = bi.bi_size;
> -	i->iov_offset = bi.bi_bvec_done;
> +	for (bvec = i->bvec, end = bvec + i->nr_segs; bvec < end; bvec++) {
> +		if (likely(size < bvec->bv_len))
> +			break;
> +		size -= bvec->bv_len;
> +	}
> +	i->iov_offset = size;
> +	i->nr_segs -= bvec - i->bvec;
> +	i->bvec = bvec;
>  }
>  
>  static void iov_iter_iovec_advance(struct iov_iter *i, size_t size)

Much simpler to follow, IMO...

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 12/44] fix short copy handling in copy_mc_pipe_to_iter()
  2022-06-22  4:15   ` [PATCH 12/44] fix short copy handling in copy_mc_pipe_to_iter() Al Viro
@ 2022-06-27 19:15     ` Jeff Layton
  2022-06-28 12:42     ` Christian Brauner
  1 sibling, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-27 19:15 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can
> result in a short copy.  In that case we need to trim the unused
> buffers, as well as the length of partially filled one - it's not
> enough to set ->head, ->iov_offset and ->count to reflect how
> much had we copied.  Not hard to fix, fortunately...
> 
> I'd put a helper (pipe_discard_from(pipe, head)) into pipe_fs_i.h,
> rather than iov_iter.c - it has nothing to do with iov_iter and
> having it will allow us to avoid an ugly kludge in fs/splice.c.
> We could put it into lib/iov_iter.c for now and move it later,
> but I don't see the point going that way...
> 
> Fixes: ca146f6f091e "lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()"
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  include/linux/pipe_fs_i.h |  9 +++++++++
>  lib/iov_iter.c            | 15 +++++++++++----
>  2 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
> index cb0fd633a610..4ea496924106 100644
> --- a/include/linux/pipe_fs_i.h
> +++ b/include/linux/pipe_fs_i.h
> @@ -229,6 +229,15 @@ static inline bool pipe_buf_try_steal(struct pipe_inode_info *pipe,
>  	return buf->ops->try_steal(pipe, buf);
>  }
>  
> +static inline void pipe_discard_from(struct pipe_inode_info *pipe,
> +		unsigned int old_head)
> +{
> +	unsigned int mask = pipe->ring_size - 1;
> +
> +	while (pipe->head > old_head)
> +		pipe_buf_release(pipe, &pipe->bufs[--pipe->head & mask]);
> +}
> +
>  /* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual
>     memory allocation, whereas PIPE_BUF makes atomicity guarantees.  */
>  #define PIPE_SIZE		PAGE_SIZE
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 0b64695ab632..2bf20b48a04a 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -689,6 +689,7 @@ static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
>  	struct pipe_inode_info *pipe = i->pipe;
>  	unsigned int p_mask = pipe->ring_size - 1;
>  	unsigned int i_head;
> +	unsigned int valid = pipe->head;
>  	size_t n, off, xfer = 0;
>  
>  	if (!sanity(i))
> @@ -702,11 +703,17 @@ static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
>  		rem = copy_mc_to_kernel(p + off, addr + xfer, chunk);
>  		chunk -= rem;
>  		kunmap_local(p);
> -		i->head = i_head;
> -		i->iov_offset = off + chunk;
> -		xfer += chunk;
> -		if (rem)
> +		if (chunk) {
> +			i->head = i_head;
> +			i->iov_offset = off + chunk;
> +			xfer += chunk;
> +			valid = i_head + 1;
> +		}
> +		if (rem) {
> +			pipe->bufs[i_head & p_mask].len -= rem;
> +			pipe_discard_from(pipe, valid);
>  			break;
> +		}
>  		n -= chunk;
>  		off = 0;
>  		i_head++;

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 13/44] splice: stop abusing iov_iter_advance() to flush a pipe
  2022-06-22  4:15   ` [PATCH 13/44] splice: stop abusing iov_iter_advance() to flush a pipe Al Viro
@ 2022-06-27 19:17     ` Jeff Layton
  2022-06-28 12:43     ` Christian Brauner
  1 sibling, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-27 19:17 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> Use pipe_discard_from() explicitly in generic_file_read_iter(); don't bother
> with rather non-obvious use of iov_iter_advance() in there.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/splice.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/splice.c b/fs/splice.c
> index 047b79db8eb5..6645b30ec990 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -301,11 +301,9 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
>  {
>  	struct iov_iter to;
>  	struct kiocb kiocb;
> -	unsigned int i_head;
>  	int ret;
>  
>  	iov_iter_pipe(&to, READ, pipe, len);
> -	i_head = to.head;
>  	init_sync_kiocb(&kiocb, in);
>  	kiocb.ki_pos = *ppos;
>  	ret = call_read_iter(in, &kiocb, &to);
> @@ -313,9 +311,8 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
>  		*ppos = kiocb.ki_pos;
>  		file_accessed(in);
>  	} else if (ret < 0) {
> -		to.head = i_head;
> -		to.iov_offset = 0;
> -		iov_iter_advance(&to, 0); /* to free what was emitted */
> +		/* free what was emitted */
> +		pipe_discard_from(pipe, to.start_head);
>  		/*
>  		 * callers of ->splice_read() expect -EAGAIN on
>  		 * "can't put anything in there", rather than -EFAULT.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 14/44] ITER_PIPE: helper for getting pipe buffer by index
  2022-06-22  4:15   ` [PATCH 14/44] ITER_PIPE: helper for getting pipe buffer by index Al Viro
@ 2022-06-28 10:38     ` Jeff Layton
  2022-06-28 12:45     ` Christian Brauner
  1 sibling, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 10:38 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> pipe_buffer instances of a pipe are organized as a ring buffer,
> with power-of-2 size.  Indices are kept *not* reduced modulo ring
> size, so the buffer refered to by index N is
> 	pipe->bufs[N & (pipe->ring_size - 1)].
> 
> Ring size can change over the lifetime of a pipe, but not while
> the pipe is locked.  So for any iov_iter primitives it's a constant.
> Original conversion of pipes to this layout went overboard trying
> to microoptimize that - calculating pipe->ring_size - 1, storing
> it in a local variable and using through the function.  In some
> cases it might be warranted, but most of the times it only
> obfuscates what's going on in there.
> 
> Introduce a helper (pipe_buf(pipe, N)) that would encapsulate
> that and use it in the obvious cases.  More will follow...
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index d00cc8971b5b..08bb393da677 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -183,13 +183,18 @@ static int copyin(void *to, const void __user *from, size_t n)
>  	return n;
>  }
>  
> +static inline struct pipe_buffer *pipe_buf(const struct pipe_inode_info *pipe,
> +					   unsigned int slot)
> +{
> +	return &pipe->bufs[slot & (pipe->ring_size - 1)];
> +}
> +
>  #ifdef PIPE_PARANOIA
>  static bool sanity(const struct iov_iter *i)
>  {
>  	struct pipe_inode_info *pipe = i->pipe;
>  	unsigned int p_head = pipe->head;
>  	unsigned int p_tail = pipe->tail;
> -	unsigned int p_mask = pipe->ring_size - 1;
>  	unsigned int p_occupancy = pipe_occupancy(p_head, p_tail);
>  	unsigned int i_head = i->head;
>  	unsigned int idx;
> @@ -201,7 +206,7 @@ static bool sanity(const struct iov_iter *i)
>  		if (unlikely(i_head != p_head - 1))
>  			goto Bad;	// must be at the last buffer...
>  
> -		p = &pipe->bufs[i_head & p_mask];
> +		p = pipe_buf(pipe, i_head);
>  		if (unlikely(p->offset + p->len != i->iov_offset))
>  			goto Bad;	// ... at the end of segment
>  	} else {
> @@ -386,11 +391,10 @@ static inline bool allocated(struct pipe_buffer *buf)
>  static inline void data_start(const struct iov_iter *i,
>  			      unsigned int *iter_headp, size_t *offp)
>  {
> -	unsigned int p_mask = i->pipe->ring_size - 1;
>  	unsigned int iter_head = i->head;
>  	size_t off = i->iov_offset;
>  
> -	if (off && (!allocated(&i->pipe->bufs[iter_head & p_mask]) ||
> +	if (off && (!allocated(pipe_buf(i->pipe, iter_head)) ||
>  		    off == PAGE_SIZE)) {
>  		iter_head++;
>  		off = 0;
> @@ -1180,10 +1184,9 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
>  		return iov_iter_alignment_bvec(i);
>  
>  	if (iov_iter_is_pipe(i)) {
> -		unsigned int p_mask = i->pipe->ring_size - 1;
>  		size_t size = i->count;
>  
> -		if (size && i->iov_offset && allocated(&i->pipe->bufs[i->head & p_mask]))
> +		if (size && i->iov_offset && allocated(pipe_buf(i->pipe, i->head)))
>  			return size | i->iov_offset;
>  		return size;
>  	}

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 15/44] ITER_PIPE: helpers for adding pipe buffers
  2022-06-22  4:15   ` [PATCH 15/44] ITER_PIPE: helpers for adding pipe buffers Al Viro
@ 2022-06-28 11:32     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 11:32 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> There are only two kinds of pipe_buffer in the area used by ITER_PIPE.
> 
> 1) anonymous - copy_to_iter() et.al. end up creating those and copying
> data there.  They have zero ->offset, and their ->ops points to
> default_pipe_page_ops.
> 
> 2) zero-copy ones - those come from copy_page_to_iter(), and page
> comes from caller.  ->offset is also caller-supplied - it might be
> non-zero.  ->ops points to page_cache_pipe_buf_ops.
> 
> Move creation and insertion of those into helpers - push_anon(pipe, size)
> and push_page(pipe, page, offset, size) resp., separating them from
> the "could we avoid creating a new buffer by merging with the current
> head?" logics.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 88 ++++++++++++++++++++++++++------------------------
>  1 file changed, 46 insertions(+), 42 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 08bb393da677..924854c2a7ce 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -231,15 +231,39 @@ static bool sanity(const struct iov_iter *i)
>  #define sanity(i) true
>  #endif
>  
> +static struct page *push_anon(struct pipe_inode_info *pipe, unsigned size)
> +{
> +	struct page *page = alloc_page(GFP_USER);
> +	if (page) {
> +		struct pipe_buffer *buf = pipe_buf(pipe, pipe->head++);
> +		*buf = (struct pipe_buffer) {
> +			.ops = &default_pipe_buf_ops,
> +			.page = page,
> +			.offset = 0,
> +			.len = size
> +		};
> +	}
> +	return page;
> +}
> +
> +static void push_page(struct pipe_inode_info *pipe, struct page *page,
> +			unsigned int offset, unsigned int size)
> +{
> +	struct pipe_buffer *buf = pipe_buf(pipe, pipe->head++);
> +	*buf = (struct pipe_buffer) {
> +		.ops = &page_cache_pipe_buf_ops,
> +		.page = page,
> +		.offset = offset,
> +		.len = size
> +	};
> +	get_page(page);
> +}
> +
>  static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
>  			 struct iov_iter *i)
>  {
>  	struct pipe_inode_info *pipe = i->pipe;
> -	struct pipe_buffer *buf;
> -	unsigned int p_tail = pipe->tail;
> -	unsigned int p_mask = pipe->ring_size - 1;
> -	unsigned int i_head = i->head;
> -	size_t off;
> +	unsigned int head = pipe->head;
>  
>  	if (unlikely(bytes > i->count))
>  		bytes = i->count;
> @@ -250,32 +274,21 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
>  	if (!sanity(i))
>  		return 0;
>  
> -	off = i->iov_offset;
> -	buf = &pipe->bufs[i_head & p_mask];
> -	if (off) {
> -		if (offset == off && buf->page == page) {
> -			/* merge with the last one */
> +	if (offset && i->iov_offset == offset) { // could we merge it?
> +		struct pipe_buffer *buf = pipe_buf(pipe, head - 1);
> +		if (buf->page == page) {
>  			buf->len += bytes;
>  			i->iov_offset += bytes;
> -			goto out;
> +			i->count -= bytes;
> +			return bytes;
>  		}
> -		i_head++;
> -		buf = &pipe->bufs[i_head & p_mask];
>  	}
> -	if (pipe_full(i_head, p_tail, pipe->max_usage))
> +	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
>  		return 0;
>  
> -	buf->ops = &page_cache_pipe_buf_ops;
> -	buf->flags = 0;
> -	get_page(page);
> -	buf->page = page;
> -	buf->offset = offset;
> -	buf->len = bytes;
> -
> -	pipe->head = i_head + 1;
> +	push_page(pipe, page, offset, bytes);
>  	i->iov_offset = offset + bytes;
> -	i->head = i_head;
> -out:
> +	i->head = head;
>  	i->count -= bytes;
>  	return bytes;
>  }
> @@ -407,8 +420,6 @@ static size_t push_pipe(struct iov_iter *i, size_t size,
>  			int *iter_headp, size_t *offp)
>  {
>  	struct pipe_inode_info *pipe = i->pipe;
> -	unsigned int p_tail = pipe->tail;
> -	unsigned int p_mask = pipe->ring_size - 1;
>  	unsigned int iter_head;
>  	size_t off;
>  	ssize_t left;
> @@ -423,30 +434,23 @@ static size_t push_pipe(struct iov_iter *i, size_t size,
>  	*iter_headp = iter_head;
>  	*offp = off;
>  	if (off) {
> +		struct pipe_buffer *buf = pipe_buf(pipe, iter_head);
> +
>  		left -= PAGE_SIZE - off;
>  		if (left <= 0) {
> -			pipe->bufs[iter_head & p_mask].len += size;
> +			buf->len += size;
>  			return size;
>  		}
> -		pipe->bufs[iter_head & p_mask].len = PAGE_SIZE;
> -		iter_head++;
> +		buf->len = PAGE_SIZE;
>  	}
> -	while (!pipe_full(iter_head, p_tail, pipe->max_usage)) {
> -		struct pipe_buffer *buf = &pipe->bufs[iter_head & p_mask];
> -		struct page *page = alloc_page(GFP_USER);
> +	while (!pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
> +		struct page *page = push_anon(pipe,
> +					      min_t(ssize_t, left, PAGE_SIZE));
>  		if (!page)
>  			break;
>  
> -		buf->ops = &default_pipe_buf_ops;
> -		buf->flags = 0;
> -		buf->page = page;
> -		buf->offset = 0;
> -		buf->len = min_t(ssize_t, left, PAGE_SIZE);
> -		left -= buf->len;
> -		iter_head++;
> -		pipe->head = iter_head;
> -
> -		if (left == 0)
> +		left -= PAGE_SIZE;
> +		if (left <= 0)
>  			return size;
>  	}
>  	return size - left;

Not sure I follow all of the buffer handling shenanigans in here, but I
think it looks sane.

Acked-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 23/44] iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
  2022-06-22  4:15   ` [PATCH 23/44] iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT Al Viro
@ 2022-06-28 11:41     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 11:41 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> All callers can and should handle iov_iter_get_pages() returning
> fewer pages than requested.  All in-kernel ones do.  And it makes
> the arithmetical overflow analysis much simpler...
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 30f4158382d6..c3fb7853dbe8 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1367,6 +1367,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
>  		maxsize = i->count;
>  	if (!maxsize)
>  		return 0;
> +	if (maxsize > MAX_RW_COUNT)
> +		maxsize = MAX_RW_COUNT;
>  
>  	if (likely(user_backed_iter(i))) {
>  		unsigned int gup_flags = 0;
> @@ -1485,6 +1487,8 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
>  		maxsize = i->count;
>  	if (!maxsize)
>  		return 0;
> +	if (maxsize > MAX_RW_COUNT)
> +		maxsize = MAX_RW_COUNT;
>  
>  	if (likely(user_backed_iter(i))) {
>  		unsigned int gup_flags = 0;


Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 24/44] iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
  2022-06-22  4:15   ` [PATCH 24/44] iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper Al Viro
@ 2022-06-28 11:45     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 11:45 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> Incidentally, ITER_XARRAY did *not* free the sucker in case when
> iter_xarray_populate_pages() returned 0...
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 38 ++++++++++++++++++++++----------------
>  1 file changed, 22 insertions(+), 16 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index c3fb7853dbe8..9c25661684c6 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1425,15 +1425,10 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
>  		maxsize = n;
>  	else
>  		npages = DIV_ROUND_UP(maxsize + off, PAGE_SIZE);
> -	p = get_pages_array(npages);
> +	*pages = p = get_pages_array(npages);
>  	if (!p)
>  		return -ENOMEM;
> -	n = __pipe_get_pages(i, maxsize, p, off);
> -	if (n > 0)
> -		*pages = p;
> -	else
> -		kvfree(p);
> -	return n;
> +	return __pipe_get_pages(i, maxsize, p, off);
>  }
>  
>  static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
> @@ -1463,10 +1458,9 @@ static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
>  			count++;
>  	}
>  
> -	p = get_pages_array(count);
> +	*pages = p = get_pages_array(count);
>  	if (!p)
>  		return -ENOMEM;
> -	*pages = p;
>  
>  	nr = iter_xarray_populate_pages(p, i->xarray, index, count);
>  	if (nr == 0)
> @@ -1475,7 +1469,7 @@ static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
>  	return min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
>  }
>  
> -ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
> +static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		   struct page ***pages, size_t maxsize,
>  		   size_t *start)
>  {
> @@ -1501,16 +1495,12 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
>  
>  		addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
>  		n = DIV_ROUND_UP(len, PAGE_SIZE);
> -		p = get_pages_array(n);
> +		*pages = p = get_pages_array(n);
>  		if (!p)
>  			return -ENOMEM;
>  		res = get_user_pages_fast(addr, n, gup_flags, p);
> -		if (unlikely(res <= 0)) {
> -			kvfree(p);
> -			*pages = NULL;
> +		if (unlikely(res <= 0))
>  			return res;
> -		}
> -		*pages = p;
>  		return (res == n ? len : res * PAGE_SIZE) - *start;
>  	}
>  	if (iov_iter_is_bvec(i)) {
> @@ -1531,6 +1521,22 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
>  		return iter_xarray_get_pages_alloc(i, pages, maxsize, start);
>  	return -EFAULT;
>  }
> +
> +ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
> +		   struct page ***pages, size_t maxsize,
> +		   size_t *start)
> +{
> +	ssize_t len;
> +
> +	*pages = NULL;
> +
> +	len = __iov_iter_get_pages_alloc(i, pages, maxsize, start);
> +	if (len <= 0) {
> +		kvfree(*pages);
> +		*pages = NULL;
> +	}
> +	return len;
> +}
>  EXPORT_SYMBOL(iov_iter_get_pages_alloc);
>  
>  size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,


Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 25/44] iov_iter_get_pages(): sanity-check arguments
  2022-06-22  4:15   ` [PATCH 25/44] iov_iter_get_pages(): sanity-check arguments Al Viro
@ 2022-06-28 11:47     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 11:47 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> zero maxpages is bogus, but best treated as "just return 0";
> NULL pages, OTOH, should be treated as a hard bug.
> 
> get rid of now completely useless checks in xarray_get_pages{,_alloc}().
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 9c25661684c6..5c985cf2858e 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1271,9 +1271,6 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
>  	size_t size = maxsize;
>  	loff_t pos;
>  
> -	if (!size || !maxpages)
> -		return 0;
> -
>  	pos = i->xarray_start + i->iov_offset;
>  	index = pos >> PAGE_SHIFT;
>  	offset = pos & ~PAGE_MASK;
> @@ -1365,10 +1362,11 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
>  
>  	if (maxsize > i->count)
>  		maxsize = i->count;
> -	if (!maxsize)
> +	if (!maxsize || !maxpages)
>  		return 0;
>  	if (maxsize > MAX_RW_COUNT)
>  		maxsize = MAX_RW_COUNT;
> +	BUG_ON(!pages);
>  
>  	if (likely(user_backed_iter(i))) {
>  		unsigned int gup_flags = 0;
> @@ -1441,9 +1439,6 @@ static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
>  	size_t size = maxsize;
>  	loff_t pos;
>  
> -	if (!size)
> -		return 0;
> -
>  	pos = i->xarray_start + i->iov_offset;
>  	index = pos >> PAGE_SHIFT;
>  	offset = pos & ~PAGE_MASK;

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 26/44] unify pipe_get_pages() and pipe_get_pages_alloc()
  2022-06-22  4:15   ` [PATCH 26/44] unify pipe_get_pages() and pipe_get_pages_alloc() Al Viro
@ 2022-06-28 11:49     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 11:49 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> 	The differences between those two are
> * pipe_get_pages() gets a non-NULL struct page ** value pointing to
> preallocated array + array size.
> * pipe_get_pages_alloc() gets an address of struct page ** variable that
> contains NULL, allocates the array and (on success) stores its address in
> that variable.
> 
> 	Not hard to combine - always pass struct page ***, have
> the previous pipe_get_pages_alloc() caller pass ~0U as cap for
> array size.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 49 +++++++++++++++++--------------------------------
>  1 file changed, 17 insertions(+), 32 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 5c985cf2858e..1c98f2f3a581 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1187,6 +1187,11 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
>  }
>  EXPORT_SYMBOL(iov_iter_gap_alignment);
>  
> +static struct page **get_pages_array(size_t n)
> +{
> +	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
> +}
> +
>  static inline ssize_t __pipe_get_pages(struct iov_iter *i,
>  				size_t maxsize,
>  				struct page **pages,
> @@ -1220,10 +1225,11 @@ static inline ssize_t __pipe_get_pages(struct iov_iter *i,
>  }
>  
>  static ssize_t pipe_get_pages(struct iov_iter *i,
> -		   struct page **pages, size_t maxsize, unsigned maxpages,
> +		   struct page ***pages, size_t maxsize, unsigned maxpages,
>  		   size_t *start)
>  {
>  	unsigned int npages, off;
> +	struct page **p;
>  	size_t capacity;
>  
>  	if (!sanity(i))
> @@ -1231,8 +1237,15 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
>  
>  	*start = off = pipe_npages(i, &npages);
>  	capacity = min(npages, maxpages) * PAGE_SIZE - off;
> +	maxsize = min(maxsize, capacity);
> +	p = *pages;
> +	if (!p) {
> +		*pages = p = get_pages_array(DIV_ROUND_UP(maxsize + off, PAGE_SIZE));
> +		if (!p)
> +			return -ENOMEM;
> +	}
>  
> -	return __pipe_get_pages(i, min(maxsize, capacity), pages, off);
> +	return __pipe_get_pages(i, maxsize, p, off);
>  }
>  
>  static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
> @@ -1394,41 +1407,13 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
>  		return len - *start;
>  	}
>  	if (iov_iter_is_pipe(i))
> -		return pipe_get_pages(i, pages, maxsize, maxpages, start);
> +		return pipe_get_pages(i, &pages, maxsize, maxpages, start);
>  	if (iov_iter_is_xarray(i))
>  		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
>  	return -EFAULT;
>  }
>  EXPORT_SYMBOL(iov_iter_get_pages);
>  
> -static struct page **get_pages_array(size_t n)
> -{
> -	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
> -}
> -
> -static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
> -		   struct page ***pages, size_t maxsize,
> -		   size_t *start)
> -{
> -	struct page **p;
> -	unsigned int npages, off;
> -	ssize_t n;
> -
> -	if (!sanity(i))
> -		return -EFAULT;
> -
> -	*start = off = pipe_npages(i, &npages);
> -	n = npages * PAGE_SIZE - off;
> -	if (maxsize > n)
> -		maxsize = n;
> -	else
> -		npages = DIV_ROUND_UP(maxsize + off, PAGE_SIZE);
> -	*pages = p = get_pages_array(npages);
> -	if (!p)
> -		return -ENOMEM;
> -	return __pipe_get_pages(i, maxsize, p, off);
> -}
> -
>  static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
>  					   struct page ***pages, size_t maxsize,
>  					   size_t *_start_offset)
> @@ -1511,7 +1496,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		return len - *start;
>  	}
>  	if (iov_iter_is_pipe(i))
> -		return pipe_get_pages_alloc(i, pages, maxsize, start);
> +		return pipe_get_pages(i, pages, maxsize, ~0U, start);
>  	if (iov_iter_is_xarray(i))
>  		return iter_xarray_get_pages_alloc(i, pages, maxsize, start);
>  	return -EFAULT;

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 27/44] unify xarray_get_pages() and xarray_get_pages_alloc()
  2022-06-22  4:15   ` [PATCH 27/44] unify xarray_get_pages() and xarray_get_pages_alloc() Al Viro
@ 2022-06-28 11:50     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 11:50 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> same as for pipes
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 49 ++++++++++---------------------------------------
>  1 file changed, 10 insertions(+), 39 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 1c98f2f3a581..07dacb274ba5 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1276,7 +1276,7 @@ static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa
>  }
>  
>  static ssize_t iter_xarray_get_pages(struct iov_iter *i,
> -				     struct page **pages, size_t maxsize,
> +				     struct page ***pages, size_t maxsize,
>  				     unsigned maxpages, size_t *_start_offset)
>  {
>  	unsigned nr, offset;
> @@ -1301,7 +1301,13 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
>  	if (count > maxpages)
>  		count = maxpages;
>  
> -	nr = iter_xarray_populate_pages(pages, i->xarray, index, count);
> +	if (!*pages) {
> +		*pages = get_pages_array(count);
> +		if (!*pages)
> +			return -ENOMEM;
> +	}
> +
> +	nr = iter_xarray_populate_pages(*pages, i->xarray, index, count);
>  	if (nr == 0)
>  		return 0;
>  
> @@ -1409,46 +1415,11 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
>  	if (iov_iter_is_pipe(i))
>  		return pipe_get_pages(i, &pages, maxsize, maxpages, start);
>  	if (iov_iter_is_xarray(i))
> -		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
> +		return iter_xarray_get_pages(i, &pages, maxsize, maxpages, start);
>  	return -EFAULT;
>  }
>  EXPORT_SYMBOL(iov_iter_get_pages);
>  
> -static ssize_t iter_xarray_get_pages_alloc(struct iov_iter *i,
> -					   struct page ***pages, size_t maxsize,
> -					   size_t *_start_offset)
> -{
> -	struct page **p;
> -	unsigned nr, offset;
> -	pgoff_t index, count;
> -	size_t size = maxsize;
> -	loff_t pos;
> -
> -	pos = i->xarray_start + i->iov_offset;
> -	index = pos >> PAGE_SHIFT;
> -	offset = pos & ~PAGE_MASK;
> -	*_start_offset = offset;
> -
> -	count = 1;
> -	if (size > PAGE_SIZE - offset) {
> -		size -= PAGE_SIZE - offset;
> -		count += size >> PAGE_SHIFT;
> -		size &= ~PAGE_MASK;
> -		if (size)
> -			count++;
> -	}
> -
> -	*pages = p = get_pages_array(count);
> -	if (!p)
> -		return -ENOMEM;
> -
> -	nr = iter_xarray_populate_pages(p, i->xarray, index, count);
> -	if (nr == 0)
> -		return 0;
> -
> -	return min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
> -}
> -
>  static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		   struct page ***pages, size_t maxsize,
>  		   size_t *start)
> @@ -1498,7 +1469,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  	if (iov_iter_is_pipe(i))
>  		return pipe_get_pages(i, pages, maxsize, ~0U, start);
>  	if (iov_iter_is_xarray(i))
> -		return iter_xarray_get_pages_alloc(i, pages, maxsize, start);
> +		return iter_xarray_get_pages(i, pages, maxsize, ~0U, start);
>  	return -EFAULT;
>  }
>  

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 28/44] unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
  2022-06-22  4:15   ` [PATCH 28/44] unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts Al Viro
@ 2022-06-28 11:54     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 11:54 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> same as for pipes and xarrays; after that iov_iter_get_pages() becomes
> a wrapper for __iov_iter_get_pages_alloc().
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 86 ++++++++++++++++----------------------------------
>  1 file changed, 28 insertions(+), 58 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 07dacb274ba5..811fa09515d8 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1372,20 +1372,19 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
>  	return page;
>  }
>  
> -ssize_t iov_iter_get_pages(struct iov_iter *i,
> -		   struct page **pages, size_t maxsize, unsigned maxpages,
> -		   size_t *start)
> +static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
> +		   struct page ***pages, size_t maxsize,
> +		   unsigned int maxpages, size_t *start)
>  {
>  	size_t len;
>  	int n, res;
>  
>  	if (maxsize > i->count)
>  		maxsize = i->count;
> -	if (!maxsize || !maxpages)
> +	if (!maxsize)
>  		return 0; 
>  	if (maxsize > MAX_RW_COUNT)
>  		maxsize = MAX_RW_COUNT;
> -	BUG_ON(!pages);
>  
>  	if (likely(user_backed_iter(i))) {
>  		unsigned int gup_flags = 0;
> @@ -1398,80 +1397,51 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
>  
>  		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
>  		n = DIV_ROUND_UP(len, PAGE_SIZE);
> -		res = get_user_pages_fast(addr, n, gup_flags, pages);
> +		if (!*pages) {
> +			*pages = get_pages_array(n);
> +			if (!*pages)
> +				return -ENOMEM;
> +		}
> +		res = get_user_pages_fast(addr, n, gup_flags, *pages);
>  		if (unlikely(res <= 0))
>  			return res;
>  		return (res == n ? len : res * PAGE_SIZE) - *start;
>  	}
>  	if (iov_iter_is_bvec(i)) {
> +		struct page **p;
>  		struct page *page;
>  
>  		page = first_bvec_segment(i, &len, start, maxsize, maxpages);
>  		n = DIV_ROUND_UP(len, PAGE_SIZE);
> +		p = *pages;
> +		if (!p) {
> +			*pages = p = get_pages_array(n);
> +			if (!p)
> +				return -ENOMEM;
> +		}
>  		while (n--)
> -			get_page(*pages++ = page++);
> +			get_page(*p++ = page++);
>  		return len - *start;
>  	}
>  	if (iov_iter_is_pipe(i))
> -		return pipe_get_pages(i, &pages, maxsize, maxpages, start);
> +		return pipe_get_pages(i, pages, maxsize, maxpages, start);
>  	if (iov_iter_is_xarray(i))
> -		return iter_xarray_get_pages(i, &pages, maxsize, maxpages, start);
> +		return iter_xarray_get_pages(i, pages, maxsize, maxpages,
> +					     start);
>  	return -EFAULT;
>  }
> -EXPORT_SYMBOL(iov_iter_get_pages);
>  
> -static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
> -		   struct page ***pages, size_t maxsize,
> +ssize_t iov_iter_get_pages(struct iov_iter *i,
> +		   struct page **pages, size_t maxsize, unsigned maxpages,
>  		   size_t *start)
>  {
> -	struct page **p;
> -	size_t len;
> -	int n, res;
> -
> -	if (maxsize > i->count)
> -		maxsize = i->count;
> -	if (!maxsize)
> +	if (!maxpages)
>  		return 0;
> -	if (maxsize > MAX_RW_COUNT)
> -		maxsize = MAX_RW_COUNT;
> -
> -	if (likely(user_backed_iter(i))) {
> -		unsigned int gup_flags = 0;
> -		unsigned long addr;
> -
> -		if (iov_iter_rw(i) != WRITE)
> -			gup_flags |= FOLL_WRITE;
> -		if (i->nofault)
> -			gup_flags |= FOLL_NOFAULT;
> -
> -		addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
> -		n = DIV_ROUND_UP(len, PAGE_SIZE);
> -		*pages = p = get_pages_array(n);
> -		if (!p)
> -			return -ENOMEM;
> -		res = get_user_pages_fast(addr, n, gup_flags, p);
> -		if (unlikely(res <= 0))
> -			return res;
> -		return (res == n ? len : res * PAGE_SIZE) - *start;
> -	}
> -	if (iov_iter_is_bvec(i)) {
> -		struct page *page;
> +	BUG_ON(!pages);
>  
> -		page = first_bvec_segment(i, &len, start, maxsize, ~0U);
> -		n = DIV_ROUND_UP(len, PAGE_SIZE);
> -		*pages = p = get_pages_array(n);
> -		if (!p)
> -			return -ENOMEM;
> -		while (n--)
> -			get_page(*p++ = page++);
> -		return len - *start;
> -	}
> -	if (iov_iter_is_pipe(i))
> -		return pipe_get_pages(i, pages, maxsize, ~0U, start);
> -	if (iov_iter_is_xarray(i))
> -		return iter_xarray_get_pages(i, pages, maxsize, ~0U, start);
> -	return -EFAULT;
> +	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start);
>  }
> +EXPORT_SYMBOL(iov_iter_get_pages);
>  
>  ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
>  		   struct page ***pages, size_t maxsize,
> @@ -1481,7 +1451,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
>  
>  	*pages = NULL;
>  
> -	len = __iov_iter_get_pages_alloc(i, pages, maxsize, start);
> +	len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start);
>  	if (len <= 0) {
>  		kvfree(*pages);
>  		*pages = NULL;

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 29/44] ITER_XARRAY: don't open-code DIV_ROUND_UP()
  2022-06-22  4:15   ` [PATCH 29/44] ITER_XARRAY: don't open-code DIV_ROUND_UP() Al Viro
@ 2022-06-28 11:54     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 11:54 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 10 +---------
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 811fa09515d8..92a566f839f9 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1289,15 +1289,7 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
>  	offset = pos & ~PAGE_MASK;
>  	*_start_offset = offset;
>  
> -	count = 1;
> -	if (size > PAGE_SIZE - offset) {
> -		size -= PAGE_SIZE - offset;
> -		count += size >> PAGE_SHIFT;
> -		size &= ~PAGE_MASK;
> -		if (size)
> -			count++;
> -	}
> -
> +	count = DIV_ROUND_UP(size + offset, PAGE_SIZE);
>  	if (count > maxpages)
>  		count = maxpages;
>  

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 30/44] iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
  2022-06-22  4:15   ` [PATCH 30/44] iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment() Al Viro
@ 2022-06-28 11:56     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 11:56 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 23 +++++++++++------------
>  1 file changed, 11 insertions(+), 12 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 92a566f839f9..9ef671b101dc 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1308,12 +1308,9 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
>  
>  static unsigned long found_ubuf_segment(unsigned long addr,
>  					size_t len,
> -					size_t *size, size_t *start,
> -					unsigned maxpages)
> +					size_t *size, size_t *start)
>  {
>  	len += (*start = addr % PAGE_SIZE);
> -	if (len > maxpages * PAGE_SIZE)
> -		len = maxpages * PAGE_SIZE;
>  	*size = len;
>  	return addr & PAGE_MASK;
>  }
> @@ -1321,14 +1318,14 @@ static unsigned long found_ubuf_segment(unsigned long addr,
>  /* must be done on non-empty ITER_UBUF or ITER_IOVEC one */
>  static unsigned long first_iovec_segment(const struct iov_iter *i,
>  					 size_t *size, size_t *start,
> -					 size_t maxsize, unsigned maxpages)
> +					 size_t maxsize)
>  {
>  	size_t skip;
>  	long k;
>  
>  	if (iter_is_ubuf(i)) {
>  		unsigned long addr = (unsigned long)i->ubuf + i->iov_offset;
> -		return found_ubuf_segment(addr, maxsize, size, start, maxpages);
> +		return found_ubuf_segment(addr, maxsize, size, start);
>  	}
>  
>  	for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
> @@ -1339,7 +1336,7 @@ static unsigned long first_iovec_segment(const struct iov_iter *i,
>  			continue;
>  		if (len > maxsize)
>  			len = maxsize;
> -		return found_ubuf_segment(addr, len, size, start, maxpages);
> +		return found_ubuf_segment(addr, len, size, start);
>  	}
>  	BUG(); // if it had been empty, we wouldn't get called
>  }
> @@ -1347,7 +1344,7 @@ static unsigned long first_iovec_segment(const struct iov_iter *i,
>  /* must be done on non-empty ITER_BVEC one */
>  static struct page *first_bvec_segment(const struct iov_iter *i,
>  				       size_t *size, size_t *start,
> -				       size_t maxsize, unsigned maxpages)
> +				       size_t maxsize)
>  {
>  	struct page *page;
>  	size_t skip = i->iov_offset, len;
> @@ -1358,8 +1355,6 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
>  	skip += i->bvec->bv_offset;
>  	page = i->bvec->bv_page + skip / PAGE_SIZE;
>  	len += (*start = skip % PAGE_SIZE);
> -	if (len > maxpages * PAGE_SIZE)
> -		len = maxpages * PAGE_SIZE;
>  	*size = len;
>  	return page;
>  }
> @@ -1387,7 +1382,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		if (i->nofault)
>  			gup_flags |= FOLL_NOFAULT;
>  
> -		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
> +		addr = first_iovec_segment(i, &len, start, maxsize);
> +		if (len > maxpages * PAGE_SIZE)
> +			len = maxpages * PAGE_SIZE;
>  		n = DIV_ROUND_UP(len, PAGE_SIZE);
>  		if (!*pages) {
>  			*pages = get_pages_array(n);
> @@ -1403,7 +1400,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		struct page **p;
>  		struct page *page;
>  
> -		page = first_bvec_segment(i, &len, start, maxsize, maxpages);
> +		page = first_bvec_segment(i, &len, start, maxsize);
> +		if (len > maxpages * PAGE_SIZE)
> +			len = maxpages * PAGE_SIZE;
>  		n = DIV_ROUND_UP(len, PAGE_SIZE);
>  		p = *pages;
>  		if (!p) {

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 31/44] iov_iter: first_{iovec,bvec}_segment() - simplify a bit
  2022-06-22  4:15   ` [PATCH 31/44] iov_iter: first_{iovec,bvec}_segment() - simplify a bit Al Viro
@ 2022-06-28 11:58     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 11:58 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> We return length + offset in page via *size.  Don't bother - the caller
> can do that arithmetics just as well; just report the length to it.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 9ef671b101dc..0bed684d91d0 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1310,7 +1310,7 @@ static unsigned long found_ubuf_segment(unsigned long addr,
>  					size_t len,
>  					size_t *size, size_t *start)
>  {
> -	len += (*start = addr % PAGE_SIZE);
> +	*start = addr % PAGE_SIZE;
>  	*size = len;
>  	return addr & PAGE_MASK;
>  }
> @@ -1354,7 +1354,7 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
>  		len = maxsize;
>  	skip += i->bvec->bv_offset;
>  	page = i->bvec->bv_page + skip / PAGE_SIZE;
> -	len += (*start = skip % PAGE_SIZE);
> +	*start = skip % PAGE_SIZE;
>  	*size = len;
>  	return page;
>  }
> @@ -1383,9 +1383,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  			gup_flags |= FOLL_NOFAULT;
>  
>  		addr = first_iovec_segment(i, &len, start, maxsize);
> -		if (len > maxpages * PAGE_SIZE)
> -			len = maxpages * PAGE_SIZE;
> -		n = DIV_ROUND_UP(len, PAGE_SIZE);
> +		n = DIV_ROUND_UP(len + *start, PAGE_SIZE);
> +		if (n > maxpages)
> +			n = maxpages;
>  		if (!*pages) {
>  			*pages = get_pages_array(n);
>  			if (!*pages)
> @@ -1394,25 +1394,25 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		res = get_user_pages_fast(addr, n, gup_flags, *pages);
>  		if (unlikely(res <= 0))
>  			return res;
> -		return (res == n ? len : res * PAGE_SIZE) - *start;
> +		return min_t(size_t, len, res * PAGE_SIZE - *start);
>  	}
>  	if (iov_iter_is_bvec(i)) {
>  		struct page **p;
>  		struct page *page;
>  
>  		page = first_bvec_segment(i, &len, start, maxsize);
> -		if (len > maxpages * PAGE_SIZE)
> -			len = maxpages * PAGE_SIZE;
> -		n = DIV_ROUND_UP(len, PAGE_SIZE);
> +		n = DIV_ROUND_UP(len + *start, PAGE_SIZE);
> +		if (n > maxpages)
> +			n = maxpages;
>  		p = *pages;
>  		if (!p) {
>  			*pages = p = get_pages_array(n);
>  			if (!p)
>  				return -ENOMEM;
>  		}
> -		while (n--)
> +		for (int k = 0; k < n; k++)
>  			get_page(*p++ = page++);
> -		return len - *start;
> +		return min_t(size_t, len, n * PAGE_SIZE - *start);
>  	}
>  	if (iov_iter_is_pipe(i))
>  		return pipe_get_pages(i, pages, maxsize, maxpages, start);

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 32/44] iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
  2022-06-22  4:15   ` [PATCH 32/44] iov_iter: massage calling conventions for first_{iovec,bvec}_segment() Al Viro
@ 2022-06-28 12:06     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 12:06 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> Pass maxsize by reference, return length via the same.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 37 +++++++++++++++----------------------
>  1 file changed, 15 insertions(+), 22 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 0bed684d91d0..fca66ecce7a0 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1306,26 +1306,22 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
>  	return min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
>  }
>  
> -static unsigned long found_ubuf_segment(unsigned long addr,
> -					size_t len,
> -					size_t *size, size_t *start)
> +static unsigned long found_ubuf_segment(unsigned long addr, size_t *start)
>  {
>  	*start = addr % PAGE_SIZE;
> -	*size = len;
>  	return addr & PAGE_MASK;
>  }
>  
>  /* must be done on non-empty ITER_UBUF or ITER_IOVEC one */
>  static unsigned long first_iovec_segment(const struct iov_iter *i,
> -					 size_t *size, size_t *start,
> -					 size_t maxsize)
> +					 size_t *size, size_t *start)
>  {
>  	size_t skip;
>  	long k;
>  
>  	if (iter_is_ubuf(i)) {
>  		unsigned long addr = (unsigned long)i->ubuf + i->iov_offset;
> -		return found_ubuf_segment(addr, maxsize, size, start);
> +		return found_ubuf_segment(addr, start);
>  	}
>  
>  	for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
> @@ -1334,28 +1330,26 @@ static unsigned long first_iovec_segment(const struct iov_iter *i,
>  
>  		if (unlikely(!len))
>  			continue;
> -		if (len > maxsize)
> -			len = maxsize;
> -		return found_ubuf_segment(addr, len, size, start);
> +		if (*size > len)
> +			*size = len;
> +		return found_ubuf_segment(addr, start);
>  	}
>  	BUG(); // if it had been empty, we wouldn't get called
>  }
>  
>  /* must be done on non-empty ITER_BVEC one */
>  static struct page *first_bvec_segment(const struct iov_iter *i,
> -				       size_t *size, size_t *start,
> -				       size_t maxsize)
> +				       size_t *size, size_t *start)
>  {
>  	struct page *page;
>  	size_t skip = i->iov_offset, len;
>  
>  	len = i->bvec->bv_len - skip;
> -	if (len > maxsize)
> -		len = maxsize;
> +	if (*size > len)
> +		*size = len;
>  	skip += i->bvec->bv_offset;
>  	page = i->bvec->bv_page + skip / PAGE_SIZE;
>  	*start = skip % PAGE_SIZE;
> -	*size = len;
>  	return page;
>  }
>  
> @@ -1363,7 +1357,6 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		   struct page ***pages, size_t maxsize,
>  		   unsigned int maxpages, size_t *start)
>  {
> -	size_t len;
>  	int n, res;
>  
>  	if (maxsize > i->count)
> @@ -1382,8 +1375,8 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		if (i->nofault)
>  			gup_flags |= FOLL_NOFAULT;
>  
> -		addr = first_iovec_segment(i, &len, start, maxsize);
> -		n = DIV_ROUND_UP(len + *start, PAGE_SIZE);
> +		addr = first_iovec_segment(i, &maxsize, start);
> +		n = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
>  		if (n > maxpages)
>  			n = maxpages;
>  		if (!*pages) {
> @@ -1394,14 +1387,14 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		res = get_user_pages_fast(addr, n, gup_flags, *pages);
>  		if (unlikely(res <= 0))
>  			return res;
> -		return min_t(size_t, len, res * PAGE_SIZE - *start);
> +		return min_t(size_t, maxsize, res * PAGE_SIZE - *start);
>  	}
>  	if (iov_iter_is_bvec(i)) {
>  		struct page **p;
>  		struct page *page;
>  
> -		page = first_bvec_segment(i, &len, start, maxsize);
> -		n = DIV_ROUND_UP(len + *start, PAGE_SIZE);
> +		page = first_bvec_segment(i, &maxsize, start);
> +		n = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
>  		if (n > maxpages)
>  			n = maxpages;
>  		p = *pages;
> @@ -1412,7 +1405,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		}
>  		for (int k = 0; k < n; k++)
>  			get_page(*p++ = page++);
> -		return min_t(size_t, len, n * PAGE_SIZE - *start);
> +		return min_t(size_t, maxsize, n * PAGE_SIZE - *start);
>  	}
>  	if (iov_iter_is_pipe(i))
>  		return pipe_get_pages(i, pages, maxsize, maxpages, start);

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 33/44] found_iovec_segment(): just return address
  2022-06-22  4:15   ` [PATCH 33/44] found_iovec_segment(): just return address Al Viro
@ 2022-06-28 12:09     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 12:09 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

Subject line should read "first_iovec_segment" 

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> ... and calculate the offset in the caller
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 22 +++++++---------------
>  1 file changed, 7 insertions(+), 15 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index fca66ecce7a0..f455b8ee0d76 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1306,33 +1306,23 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
>  	return min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
>  }
>  
> -static unsigned long found_ubuf_segment(unsigned long addr, size_t *start)
> -{
> -	*start = addr % PAGE_SIZE;
> -	return addr & PAGE_MASK;
> -}
> -
>  /* must be done on non-empty ITER_UBUF or ITER_IOVEC one */
>  static unsigned long first_iovec_segment(const struct iov_iter *i,
> -					 size_t *size, size_t *start)
> +					 size_t *size)
>  {
>  	size_t skip;
>  	long k;
>  
> -	if (iter_is_ubuf(i)) {
> -		unsigned long addr = (unsigned long)i->ubuf + i->iov_offset;
> -		return found_ubuf_segment(addr, start);
> -	}
> +	if (iter_is_ubuf(i))
> +		return (unsigned long)i->ubuf + i->iov_offset;
>  
>  	for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
> -		unsigned long addr = (unsigned long)i->iov[k].iov_base + skip;
>  		size_t len = i->iov[k].iov_len - skip;
> -
>  		if (unlikely(!len))
>  			continue;
>  		if (*size > len)
>  			*size = len;
> -		return found_ubuf_segment(addr, start);
> +		return (unsigned long)i->iov[k].iov_base + skip;
>  	}
>  	BUG(); // if it had been empty, we wouldn't get called
>  }
> @@ -1375,7 +1365,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		if (i->nofault)
>  			gup_flags |= FOLL_NOFAULT;
>  
> -		addr = first_iovec_segment(i, &maxsize, start);
> +		addr = first_iovec_segment(i, &maxsize);
> +		*start = addr % PAGE_SIZE;
> +		addr &= PAGE_MASK;
>  		n = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
>  		if (n > maxpages)
>  			n = maxpages;

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 34/44] fold __pipe_get_pages() into pipe_get_pages()
  2022-06-22  4:15   ` [PATCH 34/44] fold __pipe_get_pages() into pipe_get_pages() Al Viro
@ 2022-06-28 12:11     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 12:11 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> ... and don't mangle maxsize there - turn the loop into counting
> one instead.  Easier to see that we won't run out of array that
> way.  Note that special treatment of the partial buffer in that
> thing is an artifact of the non-advancing semantics of
> iov_iter_get_pages() - if not for that, it would be append_pipe(),
> same as the body of the loop that follows it.  IOW, once we make
> iov_iter_get_pages() advancing, the whole thing will turn into
> 	calculate how many pages do we want
> 	allocate an array (if needed)
> 	call append_pipe() that many times.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 75 +++++++++++++++++++++++++-------------------------
>  1 file changed, 38 insertions(+), 37 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index f455b8ee0d76..9280f865fd6a 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1192,60 +1192,61 @@ static struct page **get_pages_array(size_t n)
>  	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
>  }
>  
> -static inline ssize_t __pipe_get_pages(struct iov_iter *i,
> -				size_t maxsize,
> -				struct page **pages,
> -				size_t off)
> -{
> -	struct pipe_inode_info *pipe = i->pipe;
> -	ssize_t left = maxsize;
> -
> -	if (off) {
> -		struct pipe_buffer *buf = pipe_buf(pipe, pipe->head - 1);
> -
> -		get_page(*pages++ = buf->page);
> -		left -= PAGE_SIZE - off;
> -		if (left <= 0) {
> -			buf->len += maxsize;
> -			return maxsize;
> -		}
> -		buf->len = PAGE_SIZE;
> -	}
> -	while (!pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
> -		struct page *page = push_anon(pipe,
> -					      min_t(ssize_t, left, PAGE_SIZE));
> -		if (!page)
> -			break;
> -		get_page(*pages++ = page);
> -		left -= PAGE_SIZE;
> -		if (left <= 0)
> -			return maxsize;
> -	}
> -	return maxsize - left ? : -EFAULT;
> -}
> -
>  static ssize_t pipe_get_pages(struct iov_iter *i,
>  		   struct page ***pages, size_t maxsize, unsigned maxpages,
>  		   size_t *start)
>  {
> +	struct pipe_inode_info *pipe = i->pipe;
>  	unsigned int npages, off;
>  	struct page **p;
> -	size_t capacity;
> +	ssize_t left;
> +	int count;
>  
>  	if (!sanity(i))
>  		return -EFAULT;
>  
>  	*start = off = pipe_npages(i, &npages);
> -	capacity = min(npages, maxpages) * PAGE_SIZE - off;
> -	maxsize = min(maxsize, capacity);
> +	count = DIV_ROUND_UP(maxsize + off, PAGE_SIZE);
> +	if (count > npages)
> +		count = npages;
> +	if (count > maxpages)
> +		count = maxpages;
>  	p = *pages;
>  	if (!p) {
> -		*pages = p = get_pages_array(DIV_ROUND_UP(maxsize + off, PAGE_SIZE));
> +		*pages = p = get_pages_array(count);
>  		if (!p)
>  			return -ENOMEM;
>  	}
>  
> -	return __pipe_get_pages(i, maxsize, p, off);
> +	left = maxsize;
> +	npages = 0;
> +	if (off) {
> +		struct pipe_buffer *buf = pipe_buf(pipe, pipe->head - 1);
> +
> +		get_page(*p++ = buf->page);
> +		left -= PAGE_SIZE - off;
> +		if (left <= 0) {
> +			buf->len += maxsize;
> +			return maxsize;
> +		}
> +		buf->len = PAGE_SIZE;
> +		npages = 1;
> +	}
> +	for ( ; npages < count; npages++) {
> +		struct page *page;
> +		unsigned int size = min_t(ssize_t, left, PAGE_SIZE);
> +
> +		if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
> +			break;
> +		page = push_anon(pipe, size);
> +		if (!page)
> +			break;
> +		get_page(*p++ = page);
> +		left -= size;
> +	}
> +	if (!npages)
> +		return -EFAULT;
> +	return maxsize - left;
>  }
>  
>  static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 35/44] iov_iter: saner helper for page array allocation
  2022-06-22  4:15   ` [PATCH 35/44] iov_iter: saner helper for page array allocation Al Viro
@ 2022-06-28 12:12     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 12:12 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> All call sites of get_pages_array() are essenitally identical now.
> Replace with common helper...
> 
> Returns number of slots available in resulting array or 0 on OOM;
> it's up to the caller to make sure it doesn't ask to zero-entry
> array (i.e. neither maxpages nor size are allowed to be zero).
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 77 +++++++++++++++++++++-----------------------------
>  1 file changed, 32 insertions(+), 45 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 9280f865fd6a..1c744f0c0b2c 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1187,9 +1187,20 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
>  }
>  EXPORT_SYMBOL(iov_iter_gap_alignment);
>  
> -static struct page **get_pages_array(size_t n)
> +static int want_pages_array(struct page ***res, size_t size,
> +			    size_t start, unsigned int maxpages)
>  {
> -	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
> +	unsigned int count = DIV_ROUND_UP(size + start, PAGE_SIZE);
> +
> +	if (count > maxpages)
> +		count = maxpages;
> +	WARN_ON(!count);	// caller should've prevented that
> +	if (!*res) {
> +		*res = kvmalloc_array(count, sizeof(struct page *), GFP_KERNEL);
> +		if (!*res)
> +			return 0;
> +	}
> +	return count;
>  }
>  
>  static ssize_t pipe_get_pages(struct iov_iter *i,
> @@ -1197,27 +1208,20 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
>  		   size_t *start)
>  {
>  	struct pipe_inode_info *pipe = i->pipe;
> -	unsigned int npages, off;
> +	unsigned int npages, off, count;
>  	struct page **p;
>  	ssize_t left;
> -	int count;
>  
>  	if (!sanity(i))
>  		return -EFAULT;
>  
>  	*start = off = pipe_npages(i, &npages);
> -	count = DIV_ROUND_UP(maxsize + off, PAGE_SIZE);
> -	if (count > npages)
> -		count = npages;
> -	if (count > maxpages)
> -		count = maxpages;
> +	if (!npages)
> +		return -EFAULT;
> +	count = want_pages_array(pages, maxsize, off, min(npages, maxpages));
> +	if (!count)
> +		return -ENOMEM;
>  	p = *pages;
> -	if (!p) {
> -		*pages = p = get_pages_array(count);
> -		if (!p)
> -			return -ENOMEM;
> -	}
> -
>  	left = maxsize;
>  	npages = 0;
>  	if (off) {
> @@ -1280,9 +1284,8 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
>  				     struct page ***pages, size_t maxsize,
>  				     unsigned maxpages, size_t *_start_offset)
>  {
> -	unsigned nr, offset;
> -	pgoff_t index, count;
> -	size_t size = maxsize;
> +	unsigned nr, offset, count;
> +	pgoff_t index;
>  	loff_t pos;
>  
>  	pos = i->xarray_start + i->iov_offset;
> @@ -1290,16 +1293,9 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
>  	offset = pos & ~PAGE_MASK;
>  	*_start_offset = offset;
>  
> -	count = DIV_ROUND_UP(size + offset, PAGE_SIZE);
> -	if (count > maxpages)
> -		count = maxpages;
> -
> -	if (!*pages) {
> -		*pages = get_pages_array(count);
> -		if (!*pages)
> -			return -ENOMEM;
> -	}
> -
> +	count = want_pages_array(pages, maxsize, offset, maxpages);
> +	if (!count)
> +		return -ENOMEM;
>  	nr = iter_xarray_populate_pages(*pages, i->xarray, index, count);
>  	if (nr == 0)
>  		return 0;
> @@ -1348,7 +1344,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		   struct page ***pages, size_t maxsize,
>  		   unsigned int maxpages, size_t *start)
>  {
> -	int n, res;
> +	unsigned int n;
>  
>  	if (maxsize > i->count)
>  		maxsize = i->count;
> @@ -1360,6 +1356,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  	if (likely(user_backed_iter(i))) {
>  		unsigned int gup_flags = 0;
>  		unsigned long addr;
> +		int res;
>  
>  		if (iov_iter_rw(i) != WRITE)
>  			gup_flags |= FOLL_WRITE;
> @@ -1369,14 +1366,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		addr = first_iovec_segment(i, &maxsize);
>  		*start = addr % PAGE_SIZE;
>  		addr &= PAGE_MASK;
> -		n = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
> -		if (n > maxpages)
> -			n = maxpages;
> -		if (!*pages) {
> -			*pages = get_pages_array(n);
> -			if (!*pages)
> -				return -ENOMEM;
> -		}
> +		n = want_pages_array(pages, maxsize, *start, maxpages);
> +		if (!n)
> +			return -ENOMEM;
>  		res = get_user_pages_fast(addr, n, gup_flags, *pages);
>  		if (unlikely(res <= 0))
>  			return res;
> @@ -1387,15 +1379,10 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		struct page *page;
>  
>  		page = first_bvec_segment(i, &maxsize, start);
> -		n = DIV_ROUND_UP(maxsize + *start, PAGE_SIZE);
> -		if (n > maxpages)
> -			n = maxpages;
> +		n = want_pages_array(pages, maxsize, *start, maxpages);
> +		if (!n)
> +			return -ENOMEM;
>  		p = *pages;
> -		if (!p) {
> -			*pages = p = get_pages_array(n);
> -			if (!p)
> -				return -ENOMEM;
> -		}
>  		for (int k = 0; k < n; k++)
>  			get_page(*p++ = page++);
>  		return min_t(size_t, maxsize, n * PAGE_SIZE - *start);

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 36/44] iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
  2022-06-22  4:15   ` [PATCH 36/44] iov_iter: advancing variants of iov_iter_get_pages{,_alloc}() Al Viro
@ 2022-06-28 12:13     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 12:13 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> Most of the users immediately follow successful iov_iter_get_pages()
> with advancing by the amount it had returned.
> 
> Provide inline wrappers doing that, convert trivial open-coded
> uses of those.
> 
> BTW, iov_iter_get_pages() never returns more than it had been asked
> to; such checks in cifs ought to be removed someday...
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  drivers/vhost/scsi.c |  4 +---
>  fs/ceph/file.c       |  3 +--
>  fs/cifs/file.c       |  6 ++----
>  fs/cifs/misc.c       |  3 +--
>  fs/direct-io.c       |  3 +--
>  fs/fuse/dev.c        |  3 +--
>  fs/fuse/file.c       |  3 +--
>  fs/nfs/direct.c      |  6 ++----
>  include/linux/uio.h  | 20 ++++++++++++++++++++
>  net/core/datagram.c  |  3 +--
>  net/core/skmsg.c     |  3 +--
>  net/rds/message.c    |  3 +--
>  net/tls/tls_sw.c     |  4 +---
>  13 files changed, 34 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index ffd9e6c2ffc1..9b65509424dc 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -643,14 +643,12 @@ vhost_scsi_map_to_sgl(struct vhost_scsi_cmd *cmd,
>  	size_t offset;
>  	unsigned int npages = 0;
>  
> -	bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
> +	bytes = iov_iter_get_pages2(iter, pages, LONG_MAX,
>  				VHOST_SCSI_PREALLOC_UPAGES, &offset);
>  	/* No pages were pinned */
>  	if (bytes <= 0)
>  		return bytes < 0 ? bytes : -EFAULT;
>  
> -	iov_iter_advance(iter, bytes);
> -
>  	while (bytes) {
>  		unsigned n = min_t(unsigned, PAGE_SIZE - offset, bytes);
>  		sg_set_page(sg++, pages[npages++], n, offset);
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index c535de5852bf..8fab5db16c73 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -95,12 +95,11 @@ static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
>  		size_t start;
>  		int idx = 0;
>  
> -		bytes = iov_iter_get_pages(iter, pages, maxsize - size,
> +		bytes = iov_iter_get_pages2(iter, pages, maxsize - size,
>  					   ITER_GET_BVECS_PAGES, &start);
>  		if (bytes < 0)
>  			return size ?: bytes;
>  
> -		iov_iter_advance(iter, bytes);
>  		size += bytes;
>  
>  		for ( ; bytes; idx++, bvec_idx++) {
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index e1e05b253daa..3ba013e2987f 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -3022,7 +3022,7 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
>  		if (ctx->direct_io) {
>  			ssize_t result;
>  
> -			result = iov_iter_get_pages_alloc(
> +			result = iov_iter_get_pages_alloc2(
>  				from, &pagevec, cur_len, &start);
>  			if (result < 0) {
>  				cifs_dbg(VFS,
> @@ -3036,7 +3036,6 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
>  				break;
>  			}
>  			cur_len = (size_t)result;
> -			iov_iter_advance(from, cur_len);
>  
>  			nr_pages =
>  				(cur_len + start + PAGE_SIZE - 1) / PAGE_SIZE;
> @@ -3758,7 +3757,7 @@ cifs_send_async_read(loff_t offset, size_t len, struct cifsFileInfo *open_file,
>  		if (ctx->direct_io) {
>  			ssize_t result;
>  
> -			result = iov_iter_get_pages_alloc(
> +			result = iov_iter_get_pages_alloc2(
>  					&direct_iov, &pagevec,
>  					cur_len, &start);
>  			if (result < 0) {
> @@ -3774,7 +3773,6 @@ cifs_send_async_read(loff_t offset, size_t len, struct cifsFileInfo *open_file,
>  				break;
>  			}
>  			cur_len = (size_t)result;
> -			iov_iter_advance(&direct_iov, cur_len);
>  
>  			rdata = cifs_readdata_direct_alloc(
>  					pagevec, cifs_uncached_readv_complete);
> diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
> index c69e1240d730..37493118fb72 100644
> --- a/fs/cifs/misc.c
> +++ b/fs/cifs/misc.c
> @@ -1022,7 +1022,7 @@ setup_aio_ctx_iter(struct cifs_aio_ctx *ctx, struct iov_iter *iter, int rw)
>  	saved_len = count;
>  
>  	while (count && npages < max_pages) {
> -		rc = iov_iter_get_pages(iter, pages, count, max_pages, &start);
> +		rc = iov_iter_get_pages2(iter, pages, count, max_pages, &start);
>  		if (rc < 0) {
>  			cifs_dbg(VFS, "Couldn't get user pages (rc=%zd)\n", rc);
>  			break;
> @@ -1034,7 +1034,6 @@ setup_aio_ctx_iter(struct cifs_aio_ctx *ctx, struct iov_iter *iter, int rw)
>  			break;
>  		}
>  
> -		iov_iter_advance(iter, rc);
>  		count -= rc;
>  		rc += start;
>  		cur_npages = DIV_ROUND_UP(rc, PAGE_SIZE);
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 72237f49ad94..9724244f12ce 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -169,7 +169,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
>  {
>  	ssize_t ret;
>  
> -	ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
> +	ret = iov_iter_get_pages2(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
>  				&sdio->from);
>  
>  	if (ret < 0 && sdio->blocks_available && (dio->op == REQ_OP_WRITE)) {
> @@ -191,7 +191,6 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
>  	}
>  
>  	if (ret >= 0) {
> -		iov_iter_advance(sdio->iter, ret);
>  		ret += sdio->from;
>  		sdio->head = 0;
>  		sdio->tail = (ret + PAGE_SIZE - 1) / PAGE_SIZE;
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 8d657c2cd6f7..51897427a534 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -730,14 +730,13 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
>  		}
>  	} else {
>  		size_t off;
> -		err = iov_iter_get_pages(cs->iter, &page, PAGE_SIZE, 1, &off);
> +		err = iov_iter_get_pages2(cs->iter, &page, PAGE_SIZE, 1, &off);
>  		if (err < 0)
>  			return err;
>  		BUG_ON(!err);
>  		cs->len = err;
>  		cs->offset = off;
>  		cs->pg = page;
> -		iov_iter_advance(cs->iter, err);
>  	}
>  
>  	return lock_request(cs->req);
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index c982e3afe3b4..69e19fc0afc1 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1401,14 +1401,13 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
>  	while (nbytes < *nbytesp && ap->num_pages < max_pages) {
>  		unsigned npages;
>  		size_t start;
> -		ret = iov_iter_get_pages(ii, &ap->pages[ap->num_pages],
> +		ret = iov_iter_get_pages2(ii, &ap->pages[ap->num_pages],
>  					*nbytesp - nbytes,
>  					max_pages - ap->num_pages,
>  					&start);
>  		if (ret < 0)
>  			break;
>  
> -		iov_iter_advance(ii, ret);
>  		nbytes += ret;
>  
>  		ret += start;
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 022e1ce63e62..c275c83f0aef 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -364,13 +364,12 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
>  		size_t pgbase;
>  		unsigned npages, i;
>  
> -		result = iov_iter_get_pages_alloc(iter, &pagevec, 
> +		result = iov_iter_get_pages_alloc2(iter, &pagevec,
>  						  rsize, &pgbase);
>  		if (result < 0)
>  			break;
>  	
>  		bytes = result;
> -		iov_iter_advance(iter, bytes);
>  		npages = (result + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
>  		for (i = 0; i < npages; i++) {
>  			struct nfs_page *req;
> @@ -812,13 +811,12 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
>  		size_t pgbase;
>  		unsigned npages, i;
>  
> -		result = iov_iter_get_pages_alloc(iter, &pagevec, 
> +		result = iov_iter_get_pages_alloc2(iter, &pagevec,
>  						  wsize, &pgbase);
>  		if (result < 0)
>  			break;
>  
>  		bytes = result;
> -		iov_iter_advance(iter, bytes);
>  		npages = (result + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
>  		for (i = 0; i < npages; i++) {
>  			struct nfs_page *req;
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index d3e13b37ea72..ab1cc218b9de 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -349,4 +349,24 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
>  	};
>  }
>  
> +static inline ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
> +			size_t maxsize, unsigned maxpages, size_t *start)
> +{
> +	ssize_t res = iov_iter_get_pages(i, pages, maxsize, maxpages, start);
> +
> +	if (res >= 0)
> +		iov_iter_advance(i, res);
> +	return res;
> +}
> +
> +static inline ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
> +			size_t maxsize, size_t *start)
> +{
> +	ssize_t res = iov_iter_get_pages_alloc(i, pages, maxsize, start);
> +
> +	if (res >= 0)
> +		iov_iter_advance(i, res);
> +	return res;
> +}
> +
>  #endif
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 50f4faeea76c..344b4c5791ac 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -629,12 +629,11 @@ int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
>  		if (frag == MAX_SKB_FRAGS)
>  			return -EMSGSIZE;
>  
> -		copied = iov_iter_get_pages(from, pages, length,
> +		copied = iov_iter_get_pages2(from, pages, length,
>  					    MAX_SKB_FRAGS - frag, &start);
>  		if (copied < 0)
>  			return -EFAULT;
>  
> -		iov_iter_advance(from, copied);
>  		length -= copied;
>  
>  		truesize = PAGE_ALIGN(copied + start);
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 22b983ade0e7..662151678f20 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -324,14 +324,13 @@ int sk_msg_zerocopy_from_iter(struct sock *sk, struct iov_iter *from,
>  			goto out;
>  		}
>  
> -		copied = iov_iter_get_pages(from, pages, bytes, maxpages,
> +		copied = iov_iter_get_pages2(from, pages, bytes, maxpages,
>  					    &offset);
>  		if (copied <= 0) {
>  			ret = -EFAULT;
>  			goto out;
>  		}
>  
> -		iov_iter_advance(from, copied);
>  		bytes -= copied;
>  		msg->sg.size += copied;
>  
> diff --git a/net/rds/message.c b/net/rds/message.c
> index 799034e0f513..d74be4e3f3fa 100644
> --- a/net/rds/message.c
> +++ b/net/rds/message.c
> @@ -391,7 +391,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>  		size_t start;
>  		ssize_t copied;
>  
> -		copied = iov_iter_get_pages(from, &pages, PAGE_SIZE,
> +		copied = iov_iter_get_pages2(from, &pages, PAGE_SIZE,
>  					    1, &start);
>  		if (copied < 0) {
>  			struct mmpin *mmp;
> @@ -405,7 +405,6 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>  			goto err;
>  		}
>  		total_copied += copied;
> -		iov_iter_advance(from, copied);
>  		length -= copied;
>  		sg_set_page(sg, pages, copied, start);
>  		rm->data.op_nents++;
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> index 0513f82b8537..b1406c60f8df 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -1361,7 +1361,7 @@ static int tls_setup_from_iter(struct iov_iter *from,
>  			rc = -EFAULT;
>  			goto out;
>  		}
> -		copied = iov_iter_get_pages(from, pages,
> +		copied = iov_iter_get_pages2(from, pages,
>  					    length,
>  					    maxpages, &offset);
>  		if (copied <= 0) {
> @@ -1369,8 +1369,6 @@ static int tls_setup_from_iter(struct iov_iter *from,
>  			goto out;
>  		}
>  
> -		iov_iter_advance(from, copied);
> -
>  		length -= copied;
>  		size += copied;
>  		while (copied) {

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-06-22  4:15   ` [PATCH 37/44] block: convert to " Al Viro
@ 2022-06-28 12:16     ` Jeff Layton
  2022-06-30 22:11     ` [block.git conflicts] " Al Viro
  2022-07-10 18:04     ` Sedat Dilek
  2 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 12:16 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> ... doing revert if we end up not using some pages
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  block/bio.c     | 15 ++++++---------
>  block/blk-map.c |  7 ++++---
>  2 files changed, 10 insertions(+), 12 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 51c99f2c5c90..01ab683e67be 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1190,7 +1190,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
>  	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>  
> -	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> +	size = iov_iter_get_pages2(iter, pages, LONG_MAX, nr_pages, &offset);
>  	if (unlikely(size <= 0))
>  		return size ? size : -EFAULT;
>  
> @@ -1205,6 +1205,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  		} else {
>  			if (WARN_ON_ONCE(bio_full(bio, len))) {
>  				bio_put_pages(pages + i, left, offset);
> +				iov_iter_revert(iter, left);
>  				return -EINVAL;
>  			}
>  			__bio_add_page(bio, page, len, offset);
> @@ -1212,7 +1213,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  		offset = 0;
>  	}
>  
> -	iov_iter_advance(iter, size);
>  	return 0;
>  }
>  
> @@ -1227,7 +1227,6 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
>  	ssize_t size, left;
>  	unsigned len, i;
>  	size_t offset;
> -	int ret = 0;
>  
>  	if (WARN_ON_ONCE(!max_append_sectors))
>  		return 0;
> @@ -1240,7 +1239,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
>  	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
>  	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>  
> -	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> +	size = iov_iter_get_pages2(iter, pages, LONG_MAX, nr_pages, &offset);
>  	if (unlikely(size <= 0))
>  		return size ? size : -EFAULT;
>  
> @@ -1252,16 +1251,14 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
>  		if (bio_add_hw_page(q, bio, page, len, offset,
>  				max_append_sectors, &same_page) != len) {
>  			bio_put_pages(pages + i, left, offset);
> -			ret = -EINVAL;
> -			break;
> +			iov_iter_revert(iter, left);
> +			return -EINVAL;
>  		}
>  		if (same_page)
>  			put_page(page);
>  		offset = 0;
>  	}
> -
> -	iov_iter_advance(iter, size - left);
> -	return ret;
> +	return 0;
>  }
>  
>  /**
> diff --git a/block/blk-map.c b/block/blk-map.c
> index df8b066cd548..7196a6b64c80 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -254,7 +254,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
>  		size_t offs, added = 0;
>  		int npages;
>  
> -		bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
> +		bytes = iov_iter_get_pages_alloc2(iter, &pages, LONG_MAX, &offs);
>  		if (unlikely(bytes <= 0)) {
>  			ret = bytes ? bytes : -EFAULT;
>  			goto out_unmap;
> @@ -284,7 +284,6 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
>  				bytes -= n;
>  				offs = 0;
>  			}
> -			iov_iter_advance(iter, added);
>  		}
>  		/*
>  		 * release the pages we didn't map into the bio, if any
> @@ -293,8 +292,10 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
>  			put_page(pages[j++]);
>  		kvfree(pages);
>  		/* couldn't stuff something into bio? */
> -		if (bytes)
> +		if (bytes) {
> +			iov_iter_revert(iter, bytes);
>  			break;
> +		}
>  	}
>  
>  	ret = blk_rq_append_bio(rq, bio);

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 38/44] iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
  2022-06-22  4:15   ` [PATCH 38/44] iter_to_pipe(): switch to advancing variant of iov_iter_get_pages() Al Viro
@ 2022-06-28 12:18     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 12:18 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> ... and untangle the cleanup on failure to add into pipe.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/splice.c | 47 ++++++++++++++++++++++++-----------------------
>  1 file changed, 24 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/splice.c b/fs/splice.c
> index 6645b30ec990..9f84bd21f64c 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -1160,39 +1160,40 @@ static int iter_to_pipe(struct iov_iter *from,
>  	};
>  	size_t total = 0;
>  	int ret = 0;
> -	bool failed = false;
>  
> -	while (iov_iter_count(from) && !failed) {
> +	while (iov_iter_count(from)) {
>  		struct page *pages[16];
> -		ssize_t copied;
> +		ssize_t left;
>  		size_t start;
> -		int n;
> +		int i, n;
>  
> -		copied = iov_iter_get_pages(from, pages, ~0UL, 16, &start);
> -		if (copied <= 0) {
> -			ret = copied;
> +		left = iov_iter_get_pages2(from, pages, ~0UL, 16, &start);
> +		if (left <= 0) {
> +			ret = left;
>  			break;
>  		}
>  
> -		for (n = 0; copied; n++, start = 0) {
> -			int size = min_t(int, copied, PAGE_SIZE - start);
> -			if (!failed) {
> -				buf.page = pages[n];
> -				buf.offset = start;
> -				buf.len = size;
> -				ret = add_to_pipe(pipe, &buf);
> -				if (unlikely(ret < 0)) {
> -					failed = true;
> -				} else {
> -					iov_iter_advance(from, ret);
> -					total += ret;
> -				}
> -			} else {
> -				put_page(pages[n]);
> +		n = DIV_ROUND_UP(left + start, PAGE_SIZE);
> +		for (i = 0; i < n; i++) {
> +			int size = min_t(int, left, PAGE_SIZE - start);
> +
> +			buf.page = pages[i];
> +			buf.offset = start;
> +			buf.len = size;
> +			ret = add_to_pipe(pipe, &buf);
> +			if (unlikely(ret < 0)) {
> +				iov_iter_revert(from, left);
> +				// this one got dropped by add_to_pipe()
> +				while (++i < n)
> +					put_page(pages[i]);
> +				goto out;
>  			}
> -			copied -= size;
> +			total += ret;
> +			left -= size;
> +			start = 0;
>  		}
>  	}
> +out:
>  	return total ? total : ret;
>  }
>  

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 39/44] af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
  2022-06-22  4:15   ` [PATCH 39/44] af_alg_make_sg(): " Al Viro
@ 2022-06-28 12:18     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 12:18 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> ... and adjust the callers
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  crypto/af_alg.c     | 3 +--
>  crypto/algif_hash.c | 5 +++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/crypto/af_alg.c b/crypto/af_alg.c
> index c8289b7a85ba..e893c0f6c879 100644
> --- a/crypto/af_alg.c
> +++ b/crypto/af_alg.c
> @@ -404,7 +404,7 @@ int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len)
>  	ssize_t n;
>  	int npages, i;
>  
> -	n = iov_iter_get_pages(iter, sgl->pages, len, ALG_MAX_PAGES, &off);
> +	n = iov_iter_get_pages2(iter, sgl->pages, len, ALG_MAX_PAGES, &off);
>  	if (n < 0)
>  		return n;
>  
> @@ -1191,7 +1191,6 @@ int af_alg_get_rsgl(struct sock *sk, struct msghdr *msg, int flags,
>  		len += err;
>  		atomic_add(err, &ctx->rcvused);
>  		rsgl->sg_num_bytes = err;
> -		iov_iter_advance(&msg->msg_iter, err);
>  	}
>  
>  	*outlen = len;
> diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
> index 50f7b22f1b48..1d017ec5c63c 100644
> --- a/crypto/algif_hash.c
> +++ b/crypto/algif_hash.c
> @@ -102,11 +102,12 @@ static int hash_sendmsg(struct socket *sock, struct msghdr *msg,
>  		err = crypto_wait_req(crypto_ahash_update(&ctx->req),
>  				      &ctx->wait);
>  		af_alg_free_sg(&ctx->sgl);
> -		if (err)
> +		if (err) {
> +			iov_iter_revert(&msg->msg_iter, len);
>  			goto unlock;
> +		}
>  
>  		copied += len;
> -		iov_iter_advance(&msg->msg_iter, len);
>  	}
>  
>  	err = 0;

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 41/44] ceph: switch the last caller of iov_iter_get_pages_alloc()
  2022-06-22  4:15   ` [PATCH 41/44] ceph: switch the last caller " Al Viro
@ 2022-06-28 12:20     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 12:20 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> here nothing even looks at the iov_iter after the call, so we couldn't
> care less whether it advances or not.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/ceph/addr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 6dee88815491..3c8a7cf19e5d 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -329,7 +329,7 @@ static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq)
>  
>  	dout("%s: pos=%llu orig_len=%zu len=%llu\n", __func__, subreq->start, subreq->len, len);
>  	iov_iter_xarray(&iter, READ, &rreq->mapping->i_pages, subreq->start, len);
> -	err = iov_iter_get_pages_alloc(&iter, &pages, len, &page_off);
> +	err = iov_iter_get_pages_alloc2(&iter, &pages, len, &page_off);
>  	if (err < 0) {
>  		dout("%s: iov_ter_get_pages_alloc returned %d\n", __func__, err);
>  		goto out;

There are some coming changes to make this code use an iter passed in as
part of the subreq, at which point we will need to advance this anyway.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 42/44] get rid of non-advancing variants
  2022-06-22  4:15   ` [PATCH 42/44] get rid of non-advancing variants Al Viro
@ 2022-06-28 12:21     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 12:21 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> mechanical change; will be further massaged in subsequent commits
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  include/linux/uio.h | 24 ++----------------------
>  lib/iov_iter.c      | 27 ++++++++++++++++++---------
>  2 files changed, 20 insertions(+), 31 deletions(-)
> 
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index ab1cc218b9de..f2fc55f88e45 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -245,9 +245,9 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode
>  void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
>  void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
>  		     loff_t start, size_t count);
> -ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
> +ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
>  			size_t maxsize, unsigned maxpages, size_t *start);
> -ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
> +ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
>  			size_t maxsize, size_t *start);
>  int iov_iter_npages(const struct iov_iter *i, int maxpages);
>  void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state);
> @@ -349,24 +349,4 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
>  	};
>  }
>  
> -static inline ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
> -			size_t maxsize, unsigned maxpages, size_t *start)
> -{
> -	ssize_t res = iov_iter_get_pages(i, pages, maxsize, maxpages, start);
> -
> -	if (res >= 0)
> -		iov_iter_advance(i, res);
> -	return res;
> -}
> -
> -static inline ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
> -			size_t maxsize, size_t *start)
> -{
> -	ssize_t res = iov_iter_get_pages_alloc(i, pages, maxsize, start);
> -
> -	if (res >= 0)
> -		iov_iter_advance(i, res);
> -	return res;
> -}
> -
>  #endif
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 1c744f0c0b2c..70736b3e07c5 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1231,6 +1231,7 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
>  		left -= PAGE_SIZE - off;
>  		if (left <= 0) {
>  			buf->len += maxsize;
> +			iov_iter_advance(i, maxsize);
>  			return maxsize;
>  		}
>  		buf->len = PAGE_SIZE;
> @@ -1250,7 +1251,9 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
>  	}
>  	if (!npages)
>  		return -EFAULT;
> -	return maxsize - left;
> +	maxsize -= left;
> +	iov_iter_advance(i, maxsize);
> +	return maxsize;
>  }
>  
>  static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
> @@ -1300,7 +1303,9 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
>  	if (nr == 0)
>  		return 0;
>  
> -	return min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
> +	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
> +	iov_iter_advance(i, maxsize);
> +	return maxsize;
>  }
>  
>  /* must be done on non-empty ITER_UBUF or ITER_IOVEC one */
> @@ -1372,7 +1377,9 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		res = get_user_pages_fast(addr, n, gup_flags, *pages);
>  		if (unlikely(res <= 0))
>  			return res;
> -		return min_t(size_t, maxsize, res * PAGE_SIZE - *start);
> +		maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - *start);
> +		iov_iter_advance(i, maxsize);
> +		return maxsize;
>  	}
>  	if (iov_iter_is_bvec(i)) {
>  		struct page **p;
> @@ -1384,8 +1391,10 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  			return -ENOMEM;
>  		p = *pages;
>  		for (int k = 0; k < n; k++)
> -			get_page(*p++ = page++);
> -		return min_t(size_t, maxsize, n * PAGE_SIZE - *start);
> +			get_page(p[k] = page + k);
> +		maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
> +		iov_iter_advance(i, maxsize);
> +		return maxsize;
>  	}
>  	if (iov_iter_is_pipe(i))
>  		return pipe_get_pages(i, pages, maxsize, maxpages, start);
> @@ -1395,7 +1404,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  	return -EFAULT;
>  }
>  
> -ssize_t iov_iter_get_pages(struct iov_iter *i,
> +ssize_t iov_iter_get_pages2(struct iov_iter *i,
>  		   struct page **pages, size_t maxsize, unsigned maxpages,
>  		   size_t *start)
>  {
> @@ -1405,9 +1414,9 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
>  
>  	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start);
>  }
> -EXPORT_SYMBOL(iov_iter_get_pages);
> +EXPORT_SYMBOL(iov_iter_get_pages2);
>  
> -ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
> +ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
>  		   struct page ***pages, size_t maxsize,
>  		   size_t *start)
>  {
> @@ -1422,7 +1431,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
>  	}
>  	return len;
>  }
> -EXPORT_SYMBOL(iov_iter_get_pages_alloc);
> +EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
>  
>  size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
>  			       struct iov_iter *i)

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 44/44] expand those iov_iter_advance()...
  2022-06-22  4:15   ` [PATCH 44/44] expand those iov_iter_advance() Al Viro
@ 2022-06-28 12:23     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 12:23 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index a8045c97b975..79c86add8dea 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1284,7 +1284,8 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
>  		return 0;
>  
>  	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
> -	iov_iter_advance(i, maxsize);
> +	i->iov_offset += maxsize;
> +	i->count -= maxsize;
>  	return maxsize;
>  }
>  
> @@ -1373,7 +1374,13 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		for (int k = 0; k < n; k++)
>  			get_page(p[k] = page + k);
>  		maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
> -		iov_iter_advance(i, maxsize);
> +		i->count -= maxsize;
> +		i->iov_offset += maxsize;
> +		if (i->iov_offset == i->bvec->bv_len) {
> +			i->iov_offset = 0;
> +			i->bvec++;
> +			i->nr_segs--;
> +		}
>  		return maxsize;
>  	}
>  	if (iov_iter_is_pipe(i))

Why do this? iov_iter_advance makes it clearer as to what's going on
here.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 43/44] pipe_get_pages(): switch to append_pipe()
  2022-06-22  4:15   ` [PATCH 43/44] pipe_get_pages(): switch to append_pipe() Al Viro
@ 2022-06-28 12:23     ` Jeff Layton
  0 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 12:23 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:15 +0100, Al Viro wrote:
> now that we are advancing the iterator, there's no need to
> treat the first page separately - just call append_pipe()
> in a loop.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  lib/iov_iter.c | 36 ++++++++----------------------------
>  1 file changed, 8 insertions(+), 28 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 70736b3e07c5..a8045c97b975 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1207,10 +1207,10 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
>  		   struct page ***pages, size_t maxsize, unsigned maxpages,
>  		   size_t *start)
>  {
> -	struct pipe_inode_info *pipe = i->pipe;
> -	unsigned int npages, off, count;
> +	unsigned int npages, count;
>  	struct page **p;
>  	ssize_t left;
> +	size_t off;
>  
>  	if (!sanity(i))
>  		return -EFAULT;
> @@ -1222,38 +1222,18 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
>  	if (!count)
>  		return -ENOMEM;
>  	p = *pages;
> -	left = maxsize;
> -	npages = 0;
> -	if (off) {
> -		struct pipe_buffer *buf = pipe_buf(pipe, pipe->head - 1);
> -
> -		get_page(*p++ = buf->page);
> -		left -= PAGE_SIZE - off;
> -		if (left <= 0) {
> -			buf->len += maxsize;
> -			iov_iter_advance(i, maxsize);
> -			return maxsize;
> -		}
> -		buf->len = PAGE_SIZE;
> -		npages = 1;
> -	}
> -	for ( ; npages < count; npages++) {
> -		struct page *page;
> -		unsigned int size = min_t(ssize_t, left, PAGE_SIZE);
> -
> -		if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
> -			break;
> -		page = push_anon(pipe, size);
> +	for (npages = 0, left = maxsize ; npages < count; npages++) {
> +		struct page *page = append_pipe(i, left, &off);
>  		if (!page)
>  			break;
>  		get_page(*p++ = page);
> -		left -= size;
> +		if (left <= PAGE_SIZE - off)
> +			return maxsize;
> +		left -= PAGE_SIZE - off;
>  	}
>  	if (!npages)
>  		return -EFAULT;
> -	maxsize -= left;
> -	iov_iter_advance(i, maxsize);
> -	return maxsize;
> +	return maxsize - left;
>  }
>  
>  static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [RFC][CFT][PATCHSET] iov_iter stuff
  2022-06-22  4:10 [RFC][CFT][PATCHSET] iov_iter stuff Al Viro
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
  2022-06-23 15:21 ` [RFC][CFT][PATCHSET] iov_iter stuff David Howells
@ 2022-06-28 12:25 ` Jeff Layton
  2 siblings, 0 replies; 118+ messages in thread
From: Jeff Layton @ 2022-06-28 12:25 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner

On Wed, 2022-06-22 at 05:10 +0100, Al Viro wrote:
> 	There's a bunch of pending iov_iter-related work; most of that had
> been posted, but only one part got anything resembling a review.  Currently
> it seems to be working, but it obviously needs review and testing.
> 
> 	It's split into several subseries; the entire series can be observed
> as v5.19-rc2..#work.iov_iter_get_pages.  Description follows; individual
> patches will be posted^Wmailbombed in followups.
> 
> 	This stuff is not in -next yet; I'd like to put it there, so if you
> see any problems - please yell.
> 
> 	One thing not currently in there, but to be added very soon is
> iov_iter_find_pages{,_alloc}() - analogue of iov_iter_get_pages(), except that
> it only grabs page references for userland-backed flavours.  The callers,
> of course, are responsible for keeping the underlying object(s) alive for as
> long as they are using the results.  Quite a few of iov_iter_get_pages()
> callers would be fine with that.  Moreover, unlike iov_iter_get_pages() this
> could be allowed for ITER_KVEC, potentially eliminating several places where
> we special-case the treatment of ITER_KVEC.
> 
> 	Another pending thing is integration with cifs and ceph series (dhowells
> and jlayton resp.) and probably io_uring as well.
> 
> ----------------------------------------------------------------------------
> 
> 	Part 1, #work.9p: [rc1-based]
> 
> 1/44: 9p: handling Rerror without copy_from_iter_full()
> 	Self-contained fix, should be easy to backport.  What happens
> there is that arrival of Rerror in response to zerocopy read or readdir
> ends up with error string in the place where the actual data would've gone
> in case of success.  It needs to be extracted, and copy_from_iter_full()
> is only for data-source iterators, not for e.g. ITER_PIPE.  And ITER_PIPE
> can be used with those...
> 
> ----------------------------------------------------------------------------
> 
> 	Part 2, #work.iov_iter: [rc1-based]
> 
> Dealing with the overhead in new_sync_read()/new_sync_write(), mostly.
> Several things there - one is that calculation of iocb flags can be
> made cheaper, another is that single-segment iovec is sufficiently
> common to be worth turning into a new iov_iter flavour (ITER_UBUF).
> With all that, the total size of iov_iter.c goes down, mostly due to
> removal of magic in iovec copy_page_to_iter()/copy_page_from_iter().
> Generic variant works for those nowadays...
> 
> This had been posted two weeks ago, got a reasonable amount of comments.
> 
> 2/44: No need of likely/unlikely on calls of check_copy_size()
> 	not just in uio.h; the thing is inlined and it has unlikely on
> all paths leading to return false
> 
> 3/44:  teach iomap_dio_rw() to suppress dsync
> 	new flag for iomap_dio_rw(), telling it to suppress generic_write_sync()
> 
> 4/44: btrfs: use IOMAP_DIO_NOSYNC
> 	use the above instead of currently used kludges.
> 
> 5/44: struct file: use anonymous union member for rcuhead and llist
> 	"f_u" might have been an amusing name, but... we expect anon unions to
> work.
> 
> 6/44: iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
> 	makes iocb_flags() much cheaper, and it's easier to keep track of
> the places where it can change.
> 
> 7/44: keep iocb_flags() result cached in struct file
> 	that, along with the previous commit, reduces the overhead of
> new_sync_{read,write}().  struct file doesn't grow - we can keep that
> thing in the same anon union where rcuhead and llist live; that field
> gets used only before ->f_count reaches zero while the other two are
> used only after ->f_count has reached zero.
> 
> 8/44: copy_page_{to,from}_iter(): switch iovec variants to generic
> 	kmap_local_page() allows that.  And it kills quite a bit of
> code.
> 
> 9/44: new iov_iter flavour - ITER_UBUF
> 	iovec analogue, with single segment.  That case is fairly common and it
> can be handled with less overhead than full-blown iovec.
> 
> 10/44: switch new_sync_{read,write}() to ITER_UBUF
> 	... and this is why it is so common.  Further reduction of overhead
> for new_sync_{read,write}().
> 
> 11/44: iov_iter_bvec_advance(): don't bother with bvec_iter
> 	AFAICS, variant similar to what we do for iovec/kvec generates better
> code.  Needs profiling, obviously.
> 
> ----------------------------------------------------------------------------
> 
> 	Part 3, #fixes [-rc2-based]
> 
> 12/44: fix short copy handling in copy_mc_pipe_to_iter()
> 	Minimal version of fix; it's replaced with prettier one in the next
> series, but replacement is not a backport fodder.
> 
> ----------------------------------------------------------------------------
> 
> 	Part 4, #work.ITER_PIPE [on top of merge of previous branches]
> 
> ITER_PIPE handling had never been pretty, but by now it has become
> really obfuscated and hard to read.  Untangle it a bit.  Posted last
> weekend, some brainos fixed since then.
> 
> 13/44: splice: stop abusing iov_iter_advance() to flush a pipe
> 	A really odd (ab)use of iov_iter_advance() - in case of error
> generic_file_splice_read() wants to free all pipe buffers ->read_iter()
> has produced.  Yes, forcibly resetting ->head and ->iov_offset to
> original values and calling iov_iter_advance(i, 0) will trigger
> pipe_advance(), which will trigger pipe_truncate(), which will free
> buffers.  Or we could just go ahead and free the same buffers;
> pipe_discard_from() does exactly that, no iov_iter stuff needs to
> be involved.
> 
> 14/44: ITER_PIPE: helper for getting pipe buffer by index
> 	In a lot of places we want to find pipe_buffer by index;
> expression is convoluted and hard to read.  Provide an inline helper
> for that, convert trivial open-coded cases.  Eventually *all*
> open-coded instances in iov_iter.c will be gone.
> 
> 15/44: ITER_PIPE: helpers for adding pipe buffers
>         There are only two kinds of pipe_buffer in the area used by ITER_PIPE.
> * anonymous - copy_to_iter() et.al. end up creating those and copying data
>   there.  They have zero ->offset, and their ->ops points to
>   default_pipe_page_ops.
> * zero-copy ones - those come from copy_page_to_iter(), and page comes from
>   caller.  ->offset is also caller-supplied - it might be non-zero.
>   ->ops points to page_cache_pipe_buf_ops.
>         Move creation and insertion of those into helpers -
> push_anon(pipe, size) and push_page(pipe, page, offset, size) resp., separating
> them from the "could we avoid creating a new buffer by merging with the current
> head?" logics.
> 
> 16/44: ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives
>         New helper: append_pipe().  Extends the last buffer if possible,
> allocates a new one otherwise.  Returns page and offset in it on success,
> NULL on failure.  iov_iter is advanced past the data we've got.
>         Use that instead of push_pipe() in copy-to-pipe primitives;
> they get simpler that way.  Handling of short copy (in "mc" one)
> is done simply by iov_iter_revert() - iov_iter is in consistent
> state after that one, so we can use that.
> 
> 17/44: ITER_PIPE: fold push_pipe() into __pipe_get_pages()
>         Expand the only remaining call of push_pipe() (in
> __pipe_get_pages()), combine it with the page-collecting loop there.
> We don't need to bother with i->count checks or calculation of offset
> in the first page - the caller already has done that.
>         Note that the only reason it's not a loop doing append_pipe()
> is that append_pipe() is advancing, while iov_iter_get_pages() is not.
> As soon as it switches to saner semantics, this thing will switch
> to using append_pipe().
> 
> 18/44: ITER_PIPE: lose iter_head argument of __pipe_get_pages()
> 	Redundant.
> 
> 19/44: ITER_PIPE: clean pipe_advance() up
>         Don't bother with pipe_truncate(); adjust the buffer
> length just as we decide it'll be the last one, then use
> pipe_discard_from() to release buffers past that one.
> 
> 20/44: ITER_PIPE: clean iov_iter_revert()
>         Fold pipe_truncate() in there, clean the things up.
> 
> 21/44: ITER_PIPE: cache the type of last buffer
>         We often need to find whether the last buffer is anon or not, and
> currently it's rather clumsy:
>         check if ->iov_offset is non-zero (i.e. that pipe is not empty)
>         if so, get the corresponding pipe_buffer and check its ->ops
>         if it's &default_pipe_buf_ops, we have an anon buffer.
> Let's replace the use of ->iov_offset (which is nowhere near similar to
> its role for other flavours) with signed field (->last_offset), with
> the following rules:
>         empty, no buffers occupied:             0
>         anon, with bytes up to N-1 filled:      N
>         zero-copy, with bytes up to N-1 filled: -N
> That way abs(i->last_offset) is equal to what used to be in i->iov_offset
> and empty vs. anon vs. zero-copy can be distinguished by the sign of
> i->last_offset.
>         Checks for "should we extend the last buffer or should we start
> a new one?" become easier to follow that way.
>         Note that most of the operations can only be done in a sane
> state - i.e. when the pipe has nothing past the current position of
> iterator.  About the only thing that could be done outside of that
> state is iov_iter_advance(), which transitions to the sane state by
> truncating the pipe.  There are only two cases where we leave the
> sane state:
>         1) iov_iter_get_pages()/iov_iter_get_pages_alloc().  Will be
> dealt with later, when we make get_pages advancing - the callers are
> actually happier that way.
>         2) iov_iter copied, then something is put into the copy.  Since
> they share the underlying pipe, the original gets behind.  When we
> decide that we are done with the copy (original is not usable until then)
> we advance the original.  direct_io used to be done that way; nowadays
> it operates on the original and we do iov_iter_revert() to discard
> the excessive data.  At the moment there's nothing in the kernel that
> could do that to ITER_PIPE iterators, so this reason for insane state
> is theoretical right now.
> 
> 22/44: ITER_PIPE: fold data_start() and pipe_space_for_user() together
>         All their callers are next to each other; all of them want
> the total amount of pages and, possibly, the offset in the partial
> final buffer.
>         Combine into a new helper (pipe_npages()), fix the
> bogosity in pipe_space_for_user(), while we are at it.
> 
> ----------------------------------------------------------------------------
> 
> 	Part 5, #work.unify_iov_iter_get_pages [on top of previous]
> 
> iov_iter_get_pages() and iov_iter_get_pages_alloc() have a lot of code
> duplication and are bloody hard to read.  With some massage duplication
> can be eliminated, along with some of the cruft accumulated there.
> 
> 	Flavour-independent arguments validation and, for ..._alloc(),
> cleanup handling on failure:
> 23/44: iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
> 24/44: iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
> 25/44: iov_iter_get_pages(): sanity-check arguments
> 
> 	Mechanically merge parallel ..._get_pages() and ..._get_pages_alloc().
> 26/44: unify pipe_get_pages() and pipe_get_pages_alloc()
> 27/44: unify xarray_get_pages() and xarray_get_pages_alloc()
> 28/44: unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
> 
> 	Decrufting for XARRAY:
> 29/44: ITER_XARRAY: don't open-code DIV_ROUND_UP()
> 
> 	Decrufting for UBUF/IOVEC/BVEC: that bunch suffers from really convoluted
> helpers; untangling those takes a bit of care, so I'd carved that up into fairly
> small chunks.  Could be collapsed together, but...
> 30/44: iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
> 31/44: iov_iter: first_{iovec,bvec}_segment() - simplify a bit
> 32/44: iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
> 33/44: found_iovec_segment(): just return address
> 
> 	Decrufting for PIPE:
> 34/44: fold __pipe_get_pages() into pipe_get_pages()
> 
> 	Now we can finally get a helper encapsulating the array allocations
> right way:
> 35/44: iov_iter: saner helper for page array allocation
> 
> ----------------------------------------------------------------------------
> 
> 	Part 6, #work.iov_iter_get_pages-advance [on top of previous]
> Convert iov_iter_get_pages{,_alloc}() to iterator-advancing semantics.  
> 
> 	Most of the callers follow successful ...get_pages... with advance
> by the amount it had reported.  For some it's unconditional, for some it
> might end up being less in some cases.  All of them would be fine with
> advancing variants of those primitives - those that might want to advance
> by less than reported could easily use revert by the difference of those
> amounts.
> 	Rather than doing a flagday change (they are exported and signatures
> remain unchanged), replacement variants are added (iov_iter_get_pages2()
> and iov_iter_get_pages_alloc2(), initially as wrappers).  By the end of
> the series everything is converted to those and the old ones are removed.
> 
> 	Makes for simpler rules for ITER_PIPE, among other things, and
> advancing semantics is consistent with all data-copying primitives.
> Series is pretty obvious - introduce variants with new semantics, switch
> users one by one, fold the old variants into new ones.
> 
> 36/44: iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
> 37/44: block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
> 38/44: iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
> 39/44: af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
> 40/44: 9p: convert to advancing variant of iov_iter_get_pages_alloc()
> 41/44: ceph: switch the last caller of iov_iter_get_pages_alloc()
> 42/44: get rid of non-advancing variants
> 
> ----------------------------------------------------------------------------
> 
> 	Part 7, #wort.iov_iter_get_pages [on top of previous]
> Trivial followups, with more to be added here...
> 
> 43/44: pipe_get_pages(): switch to append_pipe()
> 44/44: expand those iov_iter_advance()...
> 
> Overall diffstat:
> 
>  arch/powerpc/include/asm/uaccess.h |   2 +-
>  arch/s390/include/asm/uaccess.h    |   4 +-
>  block/bio.c                        |  15 +-
>  block/blk-map.c                    |   7 +-
>  block/fops.c                       |   8 +-
>  crypto/af_alg.c                    |   3 +-
>  crypto/algif_hash.c                |   5 +-
>  drivers/nvme/target/io-cmd-file.c  |   2 +-
>  drivers/vhost/scsi.c               |   4 +-
>  fs/aio.c                           |   2 +-
>  fs/btrfs/file.c                    |  19 +-
>  fs/btrfs/inode.c                   |   3 +-
>  fs/ceph/addr.c                     |   2 +-
>  fs/ceph/file.c                     |   5 +-
>  fs/cifs/file.c                     |   8 +-
>  fs/cifs/misc.c                     |   3 +-
>  fs/direct-io.c                     |   7 +-
>  fs/fcntl.c                         |   1 +
>  fs/file_table.c                    |  17 +-
>  fs/fuse/dev.c                      |   7 +-
>  fs/fuse/file.c                     |   7 +-
>  fs/gfs2/file.c                     |   2 +-
>  fs/io_uring.c                      |   2 +-
>  fs/iomap/direct-io.c               |  21 +-
>  fs/nfs/direct.c                    |   8 +-
>  fs/open.c                          |   1 +
>  fs/read_write.c                    |   6 +-
>  fs/splice.c                        |  54 +-
>  fs/zonefs/super.c                  |   2 +-
>  include/linux/fs.h                 |  21 +-
>  include/linux/iomap.h              |   6 +
>  include/linux/pipe_fs_i.h          |  29 +-
>  include/linux/uaccess.h            |   4 +-
>  include/linux/uio.h                |  50 +-
>  lib/iov_iter.c                     | 993 ++++++++++++++-----------------------
>  mm/shmem.c                         |   2 +-
>  net/9p/client.c                    | 125 +----
>  net/9p/protocol.c                  |   3 +-
>  net/9p/trans_virtio.c              |  37 +-
>  net/core/datagram.c                |   3 +-
>  net/core/skmsg.c                   |   3 +-
>  net/rds/message.c                  |   3 +-
>  net/tls/tls_sw.c                   |   4 +-
>  43 files changed, 599 insertions(+), 911 deletions(-)

I ported the CEPH_MSG_DATA_ITER patches on top of this, and ran ceph
through xfstests and it seemed to do just fine. You can add:

Tested-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/44] copy_page_{to,from}_iter(): switch iovec variants to generic
  2022-06-22  4:15   ` [PATCH 08/44] copy_page_{to,from}_iter(): switch iovec variants to generic Al Viro
  2022-06-27 18:31     ` Jeff Layton
@ 2022-06-28 12:32     ` Christian Brauner
  2022-06-28 18:36       ` Al Viro
  1 sibling, 1 reply; 118+ messages in thread
From: Christian Brauner @ 2022-06-28 12:32 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Dominique Martinet

On Wed, Jun 22, 2022 at 05:15:16AM +0100, Al Viro wrote:
> we can do copyin/copyout under kmap_local_page(); it shouldn't overflow
> the kmap stack - the maximal footprint increase only by one here.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Assuming the WARN_ON(1) removals are intentional,
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>

>  lib/iov_iter.c | 191 ++-----------------------------------------------
>  1 file changed, 4 insertions(+), 187 deletions(-)
> 
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 6dd5330f7a99..4c658a25e29c 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -168,174 +168,6 @@ static int copyin(void *to, const void __user *from, size_t n)
>  	return n;
>  }
>  
> -static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t bytes,
> -			 struct iov_iter *i)
> -{
> -	size_t skip, copy, left, wanted;
> -	const struct iovec *iov;
> -	char __user *buf;
> -	void *kaddr, *from;
> -
> -	if (unlikely(bytes > i->count))
> -		bytes = i->count;
> -
> -	if (unlikely(!bytes))
> -		return 0;
> -
> -	might_fault();
> -	wanted = bytes;
> -	iov = i->iov;
> -	skip = i->iov_offset;
> -	buf = iov->iov_base + skip;
> -	copy = min(bytes, iov->iov_len - skip);
> -
> -	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_writeable(buf, copy)) {
> -		kaddr = kmap_atomic(page);
> -		from = kaddr + offset;
> -
> -		/* first chunk, usually the only one */
> -		left = copyout(buf, from, copy);
> -		copy -= left;
> -		skip += copy;
> -		from += copy;
> -		bytes -= copy;
> -
> -		while (unlikely(!left && bytes)) {
> -			iov++;
> -			buf = iov->iov_base;
> -			copy = min(bytes, iov->iov_len);
> -			left = copyout(buf, from, copy);
> -			copy -= left;
> -			skip = copy;
> -			from += copy;
> -			bytes -= copy;
> -		}
> -		if (likely(!bytes)) {
> -			kunmap_atomic(kaddr);
> -			goto done;
> -		}
> -		offset = from - kaddr;
> -		buf += copy;
> -		kunmap_atomic(kaddr);
> -		copy = min(bytes, iov->iov_len - skip);
> -	}
> -	/* Too bad - revert to non-atomic kmap */
> -
> -	kaddr = kmap(page);
> -	from = kaddr + offset;
> -	left = copyout(buf, from, copy);
> -	copy -= left;
> -	skip += copy;
> -	from += copy;
> -	bytes -= copy;
> -	while (unlikely(!left && bytes)) {
> -		iov++;
> -		buf = iov->iov_base;
> -		copy = min(bytes, iov->iov_len);
> -		left = copyout(buf, from, copy);
> -		copy -= left;
> -		skip = copy;
> -		from += copy;
> -		bytes -= copy;
> -	}
> -	kunmap(page);
> -
> -done:
> -	if (skip == iov->iov_len) {
> -		iov++;
> -		skip = 0;
> -	}
> -	i->count -= wanted - bytes;
> -	i->nr_segs -= iov - i->iov;
> -	i->iov = iov;
> -	i->iov_offset = skip;
> -	return wanted - bytes;
> -}
> -
> -static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t bytes,
> -			 struct iov_iter *i)
> -{
> -	size_t skip, copy, left, wanted;
> -	const struct iovec *iov;
> -	char __user *buf;
> -	void *kaddr, *to;
> -
> -	if (unlikely(bytes > i->count))
> -		bytes = i->count;
> -
> -	if (unlikely(!bytes))
> -		return 0;
> -
> -	might_fault();
> -	wanted = bytes;
> -	iov = i->iov;
> -	skip = i->iov_offset;
> -	buf = iov->iov_base + skip;
> -	copy = min(bytes, iov->iov_len - skip);
> -
> -	if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_readable(buf, copy)) {
> -		kaddr = kmap_atomic(page);
> -		to = kaddr + offset;
> -
> -		/* first chunk, usually the only one */
> -		left = copyin(to, buf, copy);
> -		copy -= left;
> -		skip += copy;
> -		to += copy;
> -		bytes -= copy;
> -
> -		while (unlikely(!left && bytes)) {
> -			iov++;
> -			buf = iov->iov_base;
> -			copy = min(bytes, iov->iov_len);
> -			left = copyin(to, buf, copy);
> -			copy -= left;
> -			skip = copy;
> -			to += copy;
> -			bytes -= copy;
> -		}
> -		if (likely(!bytes)) {
> -			kunmap_atomic(kaddr);
> -			goto done;
> -		}
> -		offset = to - kaddr;
> -		buf += copy;
> -		kunmap_atomic(kaddr);
> -		copy = min(bytes, iov->iov_len - skip);
> -	}
> -	/* Too bad - revert to non-atomic kmap */
> -
> -	kaddr = kmap(page);
> -	to = kaddr + offset;
> -	left = copyin(to, buf, copy);
> -	copy -= left;
> -	skip += copy;
> -	to += copy;
> -	bytes -= copy;
> -	while (unlikely(!left && bytes)) {
> -		iov++;
> -		buf = iov->iov_base;
> -		copy = min(bytes, iov->iov_len);
> -		left = copyin(to, buf, copy);
> -		copy -= left;
> -		skip = copy;
> -		to += copy;
> -		bytes -= copy;
> -	}
> -	kunmap(page);
> -
> -done:
> -	if (skip == iov->iov_len) {
> -		iov++;
> -		skip = 0;
> -	}
> -	i->count -= wanted - bytes;
> -	i->nr_segs -= iov - i->iov;
> -	i->iov = iov;
> -	i->iov_offset = skip;
> -	return wanted - bytes;
> -}
> -
>  #ifdef PIPE_PARANOIA
>  static bool sanity(const struct iov_iter *i)
>  {
> @@ -848,24 +680,14 @@ static inline bool page_copy_sane(struct page *page, size_t offset, size_t n)
>  static size_t __copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
>  			 struct iov_iter *i)
>  {
> -	if (likely(iter_is_iovec(i)))
> -		return copy_page_to_iter_iovec(page, offset, bytes, i);
> -	if (iov_iter_is_bvec(i) || iov_iter_is_kvec(i) || iov_iter_is_xarray(i)) {
> +	if (unlikely(iov_iter_is_pipe(i))) {
> +		return copy_page_to_iter_pipe(page, offset, bytes, i);
> +	} else {
>  		void *kaddr = kmap_local_page(page);
>  		size_t wanted = _copy_to_iter(kaddr + offset, bytes, i);
>  		kunmap_local(kaddr);
>  		return wanted;
>  	}
> -	if (iov_iter_is_pipe(i))
> -		return copy_page_to_iter_pipe(page, offset, bytes, i);
> -	if (unlikely(iov_iter_is_discard(i))) {
> -		if (unlikely(i->count < bytes))
> -			bytes = i->count;
> -		i->count -= bytes;
> -		return bytes;
> -	}
> -	WARN_ON(1);
> -	return 0;
>  }
>  
>  size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
> @@ -896,17 +718,12 @@ EXPORT_SYMBOL(copy_page_to_iter);
>  size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
>  			 struct iov_iter *i)
>  {
> -	if (unlikely(!page_copy_sane(page, offset, bytes)))
> -		return 0;
> -	if (likely(iter_is_iovec(i)))
> -		return copy_page_from_iter_iovec(page, offset, bytes, i);
> -	if (iov_iter_is_bvec(i) || iov_iter_is_kvec(i) || iov_iter_is_xarray(i)) {
> +	if (page_copy_sane(page, offset, bytes)) {
>  		void *kaddr = kmap_local_page(page);
>  		size_t wanted = _copy_from_iter(kaddr + offset, bytes, i);
>  		kunmap_local(kaddr);
>  		return wanted;
>  	}
> -	WARN_ON(1);
>  	return 0;
>  }
>  EXPORT_SYMBOL(copy_page_from_iter);
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 09/44] new iov_iter flavour - ITER_UBUF
  2022-06-22  4:15   ` [PATCH 09/44] new iov_iter flavour - ITER_UBUF Al Viro
  2022-06-27 18:47     ` Jeff Layton
@ 2022-06-28 12:38     ` Christian Brauner
  2022-06-28 18:44       ` Al Viro
  2022-07-28  9:55     ` [PATCH 9/44] " Alexander Gordeev
  2 siblings, 1 reply; 118+ messages in thread
From: Christian Brauner @ 2022-06-28 12:38 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Dominique Martinet

On Wed, Jun 22, 2022 at 05:15:17AM +0100, Al Viro wrote:
> Equivalent of single-segment iovec.  Initialized by iov_iter_ubuf(),
> checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
> ones.
> 
> We are going to expose the things like ->write_iter() et.al. to those
> in subsequent commits.
> 
> New predicate (user_backed_iter()) that is true for ITER_IOVEC and
> ITER_UBUF; places like direct-IO handling should use that for
> checking that pages we modify after getting them from iov_iter_get_pages()
> would need to be dirtied.
> 
> DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
> will solve all problems - there's code that uses iter_is_iovec() to
> decide how to poke around in iov_iter guts and for that the predicate
> replacement obviously won't suffice.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  block/fops.c         |  6 +--
>  fs/ceph/file.c       |  2 +-
>  fs/cifs/file.c       |  2 +-
>  fs/direct-io.c       |  2 +-
>  fs/fuse/dev.c        |  4 +-
>  fs/fuse/file.c       |  2 +-
>  fs/gfs2/file.c       |  2 +-
>  fs/iomap/direct-io.c |  2 +-
>  fs/nfs/direct.c      |  2 +-
>  include/linux/uio.h  | 26 ++++++++++++
>  lib/iov_iter.c       | 94 ++++++++++++++++++++++++++++++++++----------
>  mm/shmem.c           |  2 +-
>  12 files changed, 113 insertions(+), 33 deletions(-)
> 
> diff --git a/block/fops.c b/block/fops.c
> index 6e86931ab847..3e68d69e0ee3 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -69,7 +69,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
>  
>  	if (iov_iter_rw(iter) == READ) {
>  		bio_init(&bio, bdev, vecs, nr_pages, REQ_OP_READ);
> -		if (iter_is_iovec(iter))
> +		if (user_backed_iter(iter))
>  			should_dirty = true;
>  	} else {
>  		bio_init(&bio, bdev, vecs, nr_pages, dio_bio_write_op(iocb));
> @@ -199,7 +199,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
>  	}
>  
>  	dio->size = 0;
> -	if (is_read && iter_is_iovec(iter))
> +	if (is_read && user_backed_iter(iter))
>  		dio->flags |= DIO_SHOULD_DIRTY;
>  
>  	blk_start_plug(&plug);
> @@ -331,7 +331,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
>  	dio->size = bio->bi_iter.bi_size;
>  
>  	if (is_read) {
> -		if (iter_is_iovec(iter)) {
> +		if (user_backed_iter(iter)) {
>  			dio->flags |= DIO_SHOULD_DIRTY;
>  			bio_set_pages_dirty(bio);
>  		}
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 8c8226c0feac..e132adeeaf16 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1262,7 +1262,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
>  	size_t count = iov_iter_count(iter);
>  	loff_t pos = iocb->ki_pos;
>  	bool write = iov_iter_rw(iter) == WRITE;
> -	bool should_dirty = !write && iter_is_iovec(iter);
> +	bool should_dirty = !write && user_backed_iter(iter);
>  
>  	if (write && ceph_snap(file_inode(file)) != CEPH_NOSNAP)
>  		return -EROFS;
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 1618e0537d58..4b4129d9a90c 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -4004,7 +4004,7 @@ static ssize_t __cifs_readv(
>  	if (!is_sync_kiocb(iocb))
>  		ctx->iocb = iocb;
>  
> -	if (iter_is_iovec(to))
> +	if (user_backed_iter(to))
>  		ctx->should_dirty = true;
>  
>  	if (direct) {
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 39647eb56904..72237f49ad94 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1245,7 +1245,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>  	spin_lock_init(&dio->bio_lock);
>  	dio->refcount = 1;
>  
> -	dio->should_dirty = iter_is_iovec(iter) && iov_iter_rw(iter) == READ;
> +	dio->should_dirty = user_backed_iter(iter) && iov_iter_rw(iter) == READ;
>  	sdio.iter = iter;
>  	sdio.final_block_in_request = end >> blkbits;
>  
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 0e537e580dc1..8d657c2cd6f7 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -1356,7 +1356,7 @@ static ssize_t fuse_dev_read(struct kiocb *iocb, struct iov_iter *to)
>  	if (!fud)
>  		return -EPERM;
>  
> -	if (!iter_is_iovec(to))
> +	if (!user_backed_iter(to))
>  		return -EINVAL;
>  
>  	fuse_copy_init(&cs, 1, to);
> @@ -1949,7 +1949,7 @@ static ssize_t fuse_dev_write(struct kiocb *iocb, struct iov_iter *from)
>  	if (!fud)
>  		return -EPERM;
>  
> -	if (!iter_is_iovec(from))
> +	if (!user_backed_iter(from))
>  		return -EINVAL;
>  
>  	fuse_copy_init(&cs, 0, from);
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 00fa861aeead..c982e3afe3b4 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1465,7 +1465,7 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>  			inode_unlock(inode);
>  	}
>  
> -	io->should_dirty = !write && iter_is_iovec(iter);
> +	io->should_dirty = !write && user_backed_iter(iter);
>  	while (count) {
>  		ssize_t nres;
>  		fl_owner_t owner = current->files;
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 2cceb193dcd8..48e6cc74fdc1 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -780,7 +780,7 @@ static inline bool should_fault_in_pages(struct iov_iter *i,
>  
>  	if (!count)
>  		return false;
> -	if (!iter_is_iovec(i))
> +	if (!user_backed_iter(i))
>  		return false;
>  
>  	size = PAGE_SIZE;
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 31c7f1035b20..d5c7d019653b 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -533,7 +533,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  			iomi.flags |= IOMAP_NOWAIT;
>  		}
>  
> -		if (iter_is_iovec(iter))
> +		if (user_backed_iter(iter))
>  			dio->flags |= IOMAP_DIO_DIRTY;
>  	} else {
>  		iomi.flags |= IOMAP_WRITE;
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 4eb2a8380a28..022e1ce63e62 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -478,7 +478,7 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter,
>  	if (!is_sync_kiocb(iocb))
>  		dreq->iocb = iocb;
>  
> -	if (iter_is_iovec(iter))
> +	if (user_backed_iter(iter))
>  		dreq->flags = NFS_ODIRECT_SHOULD_DIRTY;
>  
>  	if (!swap)
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 76d305f3d4c2..6ab4260c3d6c 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -26,6 +26,7 @@ enum iter_type {
>  	ITER_PIPE,
>  	ITER_XARRAY,
>  	ITER_DISCARD,
> +	ITER_UBUF,
>  };
>  
>  struct iov_iter_state {
> @@ -38,6 +39,7 @@ struct iov_iter {
>  	u8 iter_type;
>  	bool nofault;
>  	bool data_source;
> +	bool user_backed;
>  	size_t iov_offset;
>  	size_t count;
>  	union {
> @@ -46,6 +48,7 @@ struct iov_iter {
>  		const struct bio_vec *bvec;
>  		struct xarray *xarray;
>  		struct pipe_inode_info *pipe;
> +		void __user *ubuf;
>  	};
>  	union {
>  		unsigned long nr_segs;
> @@ -70,6 +73,11 @@ static inline void iov_iter_save_state(struct iov_iter *iter,
>  	state->nr_segs = iter->nr_segs;
>  }
>  
> +static inline bool iter_is_ubuf(const struct iov_iter *i)
> +{
> +	return iov_iter_type(i) == ITER_UBUF;
> +}
> +
>  static inline bool iter_is_iovec(const struct iov_iter *i)
>  {
>  	return iov_iter_type(i) == ITER_IOVEC;
> @@ -105,6 +113,11 @@ static inline unsigned char iov_iter_rw(const struct iov_iter *i)
>  	return i->data_source ? WRITE : READ;
>  }
>  
> +static inline bool user_backed_iter(const struct iov_iter *i)
> +{
> +	return i->user_backed;
> +}
> +
>  /*
>   * Total number of bytes covered by an iovec.
>   *
> @@ -320,4 +333,17 @@ ssize_t __import_iovec(int type, const struct iovec __user *uvec,
>  int import_single_range(int type, void __user *buf, size_t len,
>  		 struct iovec *iov, struct iov_iter *i);
>  
> +static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
> +			void __user *buf, size_t count)
> +{
> +	WARN_ON(direction & ~(READ | WRITE));
> +	*i = (struct iov_iter) {
> +		.iter_type = ITER_UBUF,
> +		.user_backed = true,
> +		.data_source = direction,
> +		.ubuf = buf,
> +		.count = count
> +	};
> +}
> +
>  #endif
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 4c658a25e29c..8275b28e886b 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -16,6 +16,16 @@
>  
>  #define PIPE_PARANOIA /* for now */
>  
> +/* covers ubuf and kbuf alike */
> +#define iterate_buf(i, n, base, len, off, __p, STEP) {		\
> +	size_t __maybe_unused off = 0;				\
> +	len = n;						\
> +	base = __p + i->iov_offset;				\
> +	len -= (STEP);						\
> +	i->iov_offset += len;					\
> +	n = len;						\
> +}
> +
>  /* covers iovec and kvec alike */
>  #define iterate_iovec(i, n, base, len, off, __p, STEP) {	\
>  	size_t off = 0;						\
> @@ -110,7 +120,12 @@ __out:								\
>  	if (unlikely(i->count < n))				\
>  		n = i->count;					\
>  	if (likely(n)) {					\
> -		if (likely(iter_is_iovec(i))) {			\
> +		if (likely(iter_is_ubuf(i))) {			\
> +			void __user *base;			\
> +			size_t len;				\
> +			iterate_buf(i, n, base, len, off,	\
> +						i->ubuf, (I)) 	\
> +		} else if (likely(iter_is_iovec(i))) {		\
>  			const struct iovec *iov = i->iov;	\
>  			void __user *base;			\
>  			size_t len;				\
> @@ -275,7 +290,11 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
>   */
>  size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
>  {
> -	if (iter_is_iovec(i)) {
> +	if (iter_is_ubuf(i)) {
> +		size_t n = min(size, iov_iter_count(i));
> +		n -= fault_in_readable(i->ubuf + i->iov_offset, n);
> +		return size - n;
> +	} else if (iter_is_iovec(i)) {
>  		size_t count = min(size, iov_iter_count(i));
>  		const struct iovec *p;
>  		size_t skip;
> @@ -314,7 +333,11 @@ EXPORT_SYMBOL(fault_in_iov_iter_readable);
>   */
>  size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
>  {
> -	if (iter_is_iovec(i)) {
> +	if (iter_is_ubuf(i)) {
> +		size_t n = min(size, iov_iter_count(i));
> +		n -= fault_in_safe_writeable(i->ubuf + i->iov_offset, n);
> +		return size - n;
> +	} else if (iter_is_iovec(i)) {
>  		size_t count = min(size, iov_iter_count(i));
>  		const struct iovec *p;
>  		size_t skip;
> @@ -345,6 +368,7 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
>  	*i = (struct iov_iter) {
>  		.iter_type = ITER_IOVEC,
>  		.nofault = false,
> +		.user_backed = true,
>  		.data_source = direction,
>  		.iov = iov,
>  		.nr_segs = nr_segs,
> @@ -494,7 +518,7 @@ size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
>  {
>  	if (unlikely(iov_iter_is_pipe(i)))
>  		return copy_pipe_to_iter(addr, bytes, i);
> -	if (iter_is_iovec(i))
> +	if (user_backed_iter(i))
>  		might_fault();
>  	iterate_and_advance(i, bytes, base, len, off,
>  		copyout(base, addr + off, len),
> @@ -576,7 +600,7 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
>  {
>  	if (unlikely(iov_iter_is_pipe(i)))
>  		return copy_mc_pipe_to_iter(addr, bytes, i);
> -	if (iter_is_iovec(i))
> +	if (user_backed_iter(i))
>  		might_fault();
>  	__iterate_and_advance(i, bytes, base, len, off,
>  		copyout_mc(base, addr + off, len),
> @@ -594,7 +618,7 @@ size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
>  		WARN_ON(1);
>  		return 0;
>  	}
> -	if (iter_is_iovec(i))
> +	if (user_backed_iter(i))
>  		might_fault();
>  	iterate_and_advance(i, bytes, base, len, off,
>  		copyin(addr + off, base, len),
> @@ -882,16 +906,16 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
>  {
>  	if (unlikely(i->count < size))
>  		size = i->count;
> -	if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i))) {
> +	if (likely(iter_is_ubuf(i)) || unlikely(iov_iter_is_xarray(i))) {
> +		i->iov_offset += size;
> +		i->count -= size;
> +	} else if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i))) {
>  		/* iovec and kvec have identical layouts */
>  		iov_iter_iovec_advance(i, size);
>  	} else if (iov_iter_is_bvec(i)) {
>  		iov_iter_bvec_advance(i, size);
>  	} else if (iov_iter_is_pipe(i)) {
>  		pipe_advance(i, size);
> -	} else if (unlikely(iov_iter_is_xarray(i))) {
> -		i->iov_offset += size;
> -		i->count -= size;
>  	} else if (iov_iter_is_discard(i)) {
>  		i->count -= size;
>  	}
> @@ -938,7 +962,7 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
>  		return;
>  	}
>  	unroll -= i->iov_offset;
> -	if (iov_iter_is_xarray(i)) {
> +	if (iov_iter_is_xarray(i) || iter_is_ubuf(i)) {
>  		BUG(); /* We should never go beyond the start of the specified
>  			* range since we might then be straying into pages that
>  			* aren't pinned.
> @@ -1129,6 +1153,13 @@ static unsigned long iov_iter_alignment_bvec(const struct iov_iter *i)
>  
>  unsigned long iov_iter_alignment(const struct iov_iter *i)
>  {
> +	if (likely(iter_is_ubuf(i))) {
> +		size_t size = i->count;
> +		if (size)
> +			return ((unsigned long)i->ubuf + i->iov_offset) | size;
> +		return 0;
> +	}
> +
>  	/* iovec and kvec have identical layouts */
>  	if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i)))
>  		return iov_iter_alignment_iovec(i);
> @@ -1159,6 +1190,9 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
>  	size_t size = i->count;
>  	unsigned k;
>  
> +	if (iter_is_ubuf(i))
> +		return 0;
> +
>  	if (WARN_ON(!iter_is_iovec(i)))
>  		return ~0U;
>  
> @@ -1287,7 +1321,19 @@ static ssize_t iter_xarray_get_pages(struct iov_iter *i,
>  	return actual;
>  }
>  
> -/* must be done on non-empty ITER_IOVEC one */
> +static unsigned long found_ubuf_segment(unsigned long addr,
> +					size_t len,
> +					size_t *size, size_t *start,
> +					unsigned maxpages)
> +{
> +	len += (*start = addr % PAGE_SIZE);

Ugh, I know you just copy-pasted this but can we rewrite this to:

	*start = addr % PAGE_SIZE;
	len += *start;

I think that's easier to read.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 11/44] iov_iter_bvec_advance(): don't bother with bvec_iter
  2022-06-22  4:15   ` [PATCH 11/44] iov_iter_bvec_advance(): don't bother with bvec_iter Al Viro
  2022-06-27 18:48     ` Jeff Layton
@ 2022-06-28 12:40     ` Christian Brauner
  1 sibling, 0 replies; 118+ messages in thread
From: Christian Brauner @ 2022-06-28 12:40 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Dominique Martinet

On Wed, Jun 22, 2022 at 05:15:19AM +0100, Al Viro wrote:
> do what we do for iovec/kvec; that ends up generating better code,
> AFAICS.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Looks good to me,
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 12/44] fix short copy handling in copy_mc_pipe_to_iter()
  2022-06-22  4:15   ` [PATCH 12/44] fix short copy handling in copy_mc_pipe_to_iter() Al Viro
  2022-06-27 19:15     ` Jeff Layton
@ 2022-06-28 12:42     ` Christian Brauner
  1 sibling, 0 replies; 118+ messages in thread
From: Christian Brauner @ 2022-06-28 12:42 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Dominique Martinet

On Wed, Jun 22, 2022 at 05:15:20AM +0100, Al Viro wrote:
> Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can
> result in a short copy.  In that case we need to trim the unused
> buffers, as well as the length of partially filled one - it's not
> enough to set ->head, ->iov_offset and ->count to reflect how
> much had we copied.  Not hard to fix, fortunately...
> 
> I'd put a helper (pipe_discard_from(pipe, head)) into pipe_fs_i.h,
> rather than iov_iter.c - it has nothing to do with iov_iter and
> having it will allow us to avoid an ugly kludge in fs/splice.c.
> We could put it into lib/iov_iter.c for now and move it later,
> but I don't see the point going that way...
> 
> Fixes: ca146f6f091e "lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()"
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Does that need a

CC: stable@kernel.org # 4.19+

or sm?

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 13/44] splice: stop abusing iov_iter_advance() to flush a pipe
  2022-06-22  4:15   ` [PATCH 13/44] splice: stop abusing iov_iter_advance() to flush a pipe Al Viro
  2022-06-27 19:17     ` Jeff Layton
@ 2022-06-28 12:43     ` Christian Brauner
  1 sibling, 0 replies; 118+ messages in thread
From: Christian Brauner @ 2022-06-28 12:43 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Dominique Martinet

On Wed, Jun 22, 2022 at 05:15:21AM +0100, Al Viro wrote:
> Use pipe_discard_from() explicitly in generic_file_read_iter(); don't bother
> with rather non-obvious use of iov_iter_advance() in there.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Looks good to me,
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 14/44] ITER_PIPE: helper for getting pipe buffer by index
  2022-06-22  4:15   ` [PATCH 14/44] ITER_PIPE: helper for getting pipe buffer by index Al Viro
  2022-06-28 10:38     ` Jeff Layton
@ 2022-06-28 12:45     ` Christian Brauner
  1 sibling, 0 replies; 118+ messages in thread
From: Christian Brauner @ 2022-06-28 12:45 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Dominique Martinet

On Wed, Jun 22, 2022 at 05:15:22AM +0100, Al Viro wrote:
> pipe_buffer instances of a pipe are organized as a ring buffer,
> with power-of-2 size.  Indices are kept *not* reduced modulo ring
> size, so the buffer refered to by index N is
> 	pipe->bufs[N & (pipe->ring_size - 1)].
> 
> Ring size can change over the lifetime of a pipe, but not while
> the pipe is locked.  So for any iov_iter primitives it's a constant.
> Original conversion of pipes to this layout went overboard trying
> to microoptimize that - calculating pipe->ring_size - 1, storing
> it in a local variable and using through the function.  In some
> cases it might be warranted, but most of the times it only
> obfuscates what's going on in there.
> 
> Introduce a helper (pipe_buf(pipe, N)) that would encapsulate
> that and use it in the obvious cases.  More will follow...
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---

Looks good to me,
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 08/44] copy_page_{to,from}_iter(): switch iovec variants to generic
  2022-06-28 12:32     ` Christian Brauner
@ 2022-06-28 18:36       ` Al Viro
  0 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-28 18:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Dominique Martinet

On Tue, Jun 28, 2022 at 02:32:05PM +0200, Christian Brauner wrote:
> On Wed, Jun 22, 2022 at 05:15:16AM +0100, Al Viro wrote:
> > we can do copyin/copyout under kmap_local_page(); it shouldn't overflow
> > the kmap stack - the maximal footprint increase only by one here.
> > 
> > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> > ---
> 
> Assuming the WARN_ON(1) removals are intentional,
> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Deliberate - it shouldn't be any different from what _copy_to_iter() and
_copy_from_iter() are ready to handle.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 09/44] new iov_iter flavour - ITER_UBUF
  2022-06-27 18:47     ` Jeff Layton
@ 2022-06-28 18:41       ` Al Viro
  0 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-28 18:41 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Dominique Martinet,
	Christian Brauner

On Mon, Jun 27, 2022 at 02:47:03PM -0400, Jeff Layton wrote:
 
> The code looks reasonable but is there any real benefit here? It seems
> like the only user of it so far is new_sync_{read,write}, and both seem
> to just use it to avoid allocating a single iovec on the stack.

Not really - for one thing, it's less overhead in data-copying primitives,
for another... Jens had plans for it as well.  It's not as simple as "just
use it whenever you are asked for a single-segment iovec", but...

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 09/44] new iov_iter flavour - ITER_UBUF
  2022-06-28 12:38     ` Christian Brauner
@ 2022-06-28 18:44       ` Al Viro
  0 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-06-28 18:44 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Dominique Martinet

On Tue, Jun 28, 2022 at 02:38:55PM +0200, Christian Brauner wrote:

> > -/* must be done on non-empty ITER_IOVEC one */
> > +static unsigned long found_ubuf_segment(unsigned long addr,
> > +					size_t len,
> > +					size_t *size, size_t *start,
> > +					unsigned maxpages)
> > +{
> > +	len += (*start = addr % PAGE_SIZE);
> 
> Ugh, I know you just copy-pasted this but can we rewrite this to:
> 
> 	*start = addr % PAGE_SIZE;
> 	len += *start;
> 
> I think that's easier to read.

Dealt with later in the series (around the unification and cleanups
of iov_iter_get_pages/iov_iter_get_pages_alloc).  We could do that
first, but I'd rather not mix that massage in here.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-06-22  4:15   ` [PATCH 37/44] block: convert to " Al Viro
  2022-06-28 12:16     ` Jeff Layton
@ 2022-06-30 22:11     ` Al Viro
  2022-06-30 22:39       ` Al Viro
  2022-06-30 23:07       ` Jens Axboe
  2022-07-10 18:04     ` Sedat Dilek
  2 siblings, 2 replies; 118+ messages in thread
From: Al Viro @ 2022-06-30 22:11 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	Keith Busch

On Wed, Jun 22, 2022 at 05:15:45AM +0100, Al Viro wrote:
> ... doing revert if we end up not using some pages
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

... and the first half of that thing conflicts with "block: relax direct
io memory alignment" in -next...

Joy.  It's not hard to redo on top of the commit in there; the
question is, how to deal with conflicts?

I can do a backmerge, provided that there's a sane tag or branch to
backmerge from.  Another fun (if trivial) issue in the same series
is around "iov: introduce iov_iter_aligned" (two commits prior).

Jens, Keith, do you have any suggestions?  AFAICS, variants include
	* tag or branch covering b1a000d3b8ec582da64bb644be633e5a0beffcbf
(I'd rather not grab the entire for-5.20/block for obvious reasons)
It sits in the beginning of for-5.20/block, so that should be fairly
straightforward, provided that you are not going to do rebases there.
If you are, could you put that stuff into an invariant branch, so
I'd just pull it?
	* feeding the entire iov_iter pile through block.git;
bad idea, IMO, seeing that it contains a lot of stuff far from
anything block-related. 
	* doing a manual conflict resolution on top of my branch
and pushing that out.  Would get rid of the problem from -next, but
Linus hates that kind of stuff, AFAIK, and with good reasons.

	I would prefer the first variant (and that's what I'm
going to do locally for now - just
git tag keith_stuff bf8d08532bc19a14cfb54ae61099dccadefca446
and backmerge from it), but if you would prefer to deal with that
differently - please tell.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-06-30 22:11     ` [block.git conflicts] " Al Viro
@ 2022-06-30 22:39       ` Al Viro
  2022-07-01  2:07         ` Keith Busch
  2022-06-30 23:07       ` Jens Axboe
  1 sibling, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-06-30 22:39 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	linux-fsdevel

On Thu, Jun 30, 2022 at 11:11:27PM +0100, Al Viro wrote:

> ... and the first half of that thing conflicts with "block: relax direct
> io memory alignment" in -next...

BTW, looking at that commit - are you sure that bio_put_pages() on failure
exit will do the right thing?  We have grabbed a bunch of page references;
the amount if DIV_ROUND_UP(offset + size, PAGE_SIZE).  And that's before
your
                size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
in there.  IMO the following would be more obviously correct:
        size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
        if (unlikely(size <= 0))
                return size ? size : -EFAULT;

	nr_pages = DIV_ROUND_UP(size + offset, PAGE_SIZE);
	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));

        for (left = size, i = 0; left > 0; left -= len, i++) {
...
                if (ret) {
			while (i < nr_pages)
				put_page(pages[i++]);
                        return ret;
                }
...

and get rid of bio_put_pages() entirely.  Objections?

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-06-30 22:11     ` [block.git conflicts] " Al Viro
  2022-06-30 22:39       ` Al Viro
@ 2022-06-30 23:07       ` Jens Axboe
  1 sibling, 0 replies; 118+ messages in thread
From: Jens Axboe @ 2022-06-30 23:07 UTC (permalink / raw)
  To: Al Viro, linux-fsdevel
  Cc: Linus Torvalds, Christoph Hellwig, Matthew Wilcox, David Howells,
	Dominique Martinet, Christian Brauner, Keith Busch

On 6/30/22 4:11 PM, Al Viro wrote:
> On Wed, Jun 22, 2022 at 05:15:45AM +0100, Al Viro wrote:
>> ... doing revert if we end up not using some pages
>>
>> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> 
> ... and the first half of that thing conflicts with "block: relax direct
> io memory alignment" in -next...
> 
> Joy.  It's not hard to redo on top of the commit in there; the
> question is, how to deal with conflicts?
> 
> I can do a backmerge, provided that there's a sane tag or branch to
> backmerge from.  Another fun (if trivial) issue in the same series
> is around "iov: introduce iov_iter_aligned" (two commits prior).
> 
> Jens, Keith, do you have any suggestions?  AFAICS, variants include
> 	* tag or branch covering b1a000d3b8ec582da64bb644be633e5a0beffcbf
> (I'd rather not grab the entire for-5.20/block for obvious reasons)
> It sits in the beginning of for-5.20/block, so that should be fairly
> straightforward, provided that you are not going to do rebases there.
> If you are, could you put that stuff into an invariant branch, so
> I'd just pull it?
> 	* feeding the entire iov_iter pile through block.git;
> bad idea, IMO, seeing that it contains a lot of stuff far from
> anything block-related. 
> 	* doing a manual conflict resolution on top of my branch
> and pushing that out.  Would get rid of the problem from -next, but
> Linus hates that kind of stuff, AFAIK, and with good reasons.
> 
> 	I would prefer the first variant (and that's what I'm
> going to do locally for now - just
> git tag keith_stuff bf8d08532bc19a14cfb54ae61099dccadefca446
> and backmerge from it), but if you would prefer to deal with that
> differently - please tell.

I'm not going to rebase it, and I can create a tag for that commit for
you. Done, it's block-5.20-al. I did the former commit, or we can move
the tag so it includes bf8d08532bc19a14cfb54ae61099dccadefca446? That'd
be the whole series of that patchset, which is just that one extra
patch.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-06-30 22:39       ` Al Viro
@ 2022-07-01  2:07         ` Keith Busch
  2022-07-01 17:40           ` Al Viro
  0 siblings, 1 reply; 118+ messages in thread
From: Keith Busch @ 2022-07-01  2:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	linux-fsdevel

On Thu, Jun 30, 2022 at 11:39:36PM +0100, Al Viro wrote:
> On Thu, Jun 30, 2022 at 11:11:27PM +0100, Al Viro wrote:
> 
> > ... and the first half of that thing conflicts with "block: relax direct
> > io memory alignment" in -next...
> 
> BTW, looking at that commit - are you sure that bio_put_pages() on failure
> exit will do the right thing?  We have grabbed a bunch of page references;
> the amount if DIV_ROUND_UP(offset + size, PAGE_SIZE).  And that's before
> your
>                 size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));

Thanks for the catch, it does look like a page reference could get leaked here.

> in there.  IMO the following would be more obviously correct:
>         size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
>         if (unlikely(size <= 0))
>                 return size ? size : -EFAULT;
> 
> 	nr_pages = DIV_ROUND_UP(size + offset, PAGE_SIZE);
> 	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> 
>         for (left = size, i = 0; left > 0; left -= len, i++) {
> ...
>                 if (ret) {
> 			while (i < nr_pages)
> 				put_page(pages[i++]);
>                         return ret;
>                 }
> ...
> 
> and get rid of bio_put_pages() entirely.  Objections?


I think that makes sense. I'll give your idea a test run tomorrow.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (42 preceding siblings ...)
  2022-06-22  4:15   ` [PATCH 44/44] expand those iov_iter_advance() Al Viro
@ 2022-07-01  6:21   ` Dominique Martinet
  2022-07-01  6:25   ` Dominique Martinet
  2022-08-01 12:42   ` [PATCH 09/44] new iov_iter flavour - ITER_UBUF David Howells
  45 siblings, 0 replies; 118+ messages in thread
From: Dominique Martinet @ 2022-07-01  6:21 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Christian Brauner

+Christian Schoenebeck in Ccs as that concerns qemu as well.

The patch I'm replying to is at
https://lkml.kernel.org/r/20220622041552.737754-1-viro@zeniv.linux.org.uk

Al Viro wrote on Wed, Jun 22, 2022 at 05:15:09AM +0100:
>         p9_client_zc_rpc()/p9_check_zc_errors() are playing fast
> and loose with copy_from_iter_full().
> 
> 	Reading from file is done by sending Tread request.  Response
> consists of fixed-sized header (including the amount of data actually
> read) followed by the data itself.
> 
> 	For zero-copy case we arrange the things so that the first
> 11 bytes of reply go into the fixed-sized buffer, with the rest going
> straight into the pages we want to read into.
> 
> 	What makes the things inconvenient is that sglist describing
> what should go where has to be set *before* the reply arrives.  As
> the result, if reply is an error, the things get interesting.  On success
> we get
> 	size[4] Rread tag[2] count[4] data[count]
> For error layout varies depending upon the protocol variant -
> in original 9P and 9P2000 it's
> 	size[4] Rerror tag[2] len[2] error[len]
> in 9P2000.U
> 	size[4] Rerror tag[2] len[2] error[len] errno[4]
> in 9P2000.L
> 	size[4] Rlerror tag[2] errno[4]
> 
> 	The last case is nice and simple - we have an 11-byte response
> that fits into the fixed-sized buffer we hoped to get an Rread into.
> In other two, though, we get a variable-length string spill into the
> pages we'd prepared for the data to be read.
> 
> 	Had that been in fixed-sized buffer (which is actually 4K),
> we would've dealt with that the same way we handle non-zerocopy case.
> However, for zerocopy it doesn't end up there, so we need to copy it
> from those pages.
> 
> 	The trouble is, by the time we get around to that, the
> references to pages in question are already dropped.  As the result,
> p9_zc_check_errors() tries to get the data using copy_from_iter_full().
> Unfortunately, the iov_iter it's trying to read from might *NOT* be
> capable of that.  It is, after all, a data destination, not data source.
> In particular, if it's an ITER_PIPE one, copy_from_iter_full() will
> simply fail.
> 
> 	In ->zc_request() itself we do have those pages and dealing with
> the problem in there would be a simple matter of memcpy_from_page()
> into the fixed-sized buffer.  Moreover, it isn't hard to recognize
> the (rare) case when such copying is needed.  That way we get rid of
> p9_zc_check_errors() entirely - p9_check_errors() can be used instead
> both for zero-copy and non-zero-copy cases.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

I ran basic tests with this, should be ok given the code path is never
used on normal (9p2000.L) workloads.


I also tried 9p2000.u for principle and ... I have no idea if this works
but it didn't seem to blow up there at least.
The problem is that 9p2000.u just doesn't work well even without these
patches, so I still stand by what I said about 9p2000.u and virtio (zc
interface): we really can (and I think should) just say virtio doesn't
support 9p2000.u.
(and could then further simplify this)

If you're curious, 9p2000.u hangs without your patch on at least two
different code paths (trying to read a huge buffer aborts sending a
reply because msize is too small instead of clamping it, that one has a
qemu warning message; but there are others ops like copyrange that just
fail silently and I didn't investigate)

I'd rather not fool someone into believing we support it when nobody has
time to maintain it and it fails almost immediately when user requests
some unusual IO patterns... And I definitely don't have time to even try
fixing it.
I'll suggest the same thing to qemu lists if we go that way.


Anyway, for anything useful:

Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
Tested-by: Dominique Martinet <asmadeus@codewreck.org>

--
Dominique

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full()
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (43 preceding siblings ...)
  2022-07-01  6:21   ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Dominique Martinet
@ 2022-07-01  6:25   ` Dominique Martinet
  2022-07-01 16:02     ` Christian Schoenebeck
  2022-08-01 12:42   ` [PATCH 09/44] new iov_iter flavour - ITER_UBUF David Howells
  45 siblings, 1 reply; 118+ messages in thread
From: Dominique Martinet @ 2022-07-01  6:25 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Christian Brauner,
	Christian Schoenebeck


(sigh, I'm tired -- said I'd add Christian in Ccs and promply forgot to
do it. Sorry for double send to everyone else.)

+Christian Schoenebeck in Ccs as that concerns qemu as well.

The patch I'm replying to is at
https://lkml.kernel.org/r/20220622041552.737754-1-viro@zeniv.linux.org.uk

Al Viro wrote on Wed, Jun 22, 2022 at 05:15:09AM +0100:
>         p9_client_zc_rpc()/p9_check_zc_errors() are playing fast
> and loose with copy_from_iter_full().
> 
> 	Reading from file is done by sending Tread request.  Response
> consists of fixed-sized header (including the amount of data actually
> read) followed by the data itself.
> 
> 	For zero-copy case we arrange the things so that the first
> 11 bytes of reply go into the fixed-sized buffer, with the rest going
> straight into the pages we want to read into.
> 
> 	What makes the things inconvenient is that sglist describing
> what should go where has to be set *before* the reply arrives.  As
> the result, if reply is an error, the things get interesting.  On success
> we get
> 	size[4] Rread tag[2] count[4] data[count]
> For error layout varies depending upon the protocol variant -
> in original 9P and 9P2000 it's
> 	size[4] Rerror tag[2] len[2] error[len]
> in 9P2000.U
> 	size[4] Rerror tag[2] len[2] error[len] errno[4]
> in 9P2000.L
> 	size[4] Rlerror tag[2] errno[4]
> 
> 	The last case is nice and simple - we have an 11-byte response
> that fits into the fixed-sized buffer we hoped to get an Rread into.
> In other two, though, we get a variable-length string spill into the
> pages we'd prepared for the data to be read.
> 
> 	Had that been in fixed-sized buffer (which is actually 4K),
> we would've dealt with that the same way we handle non-zerocopy case.
> However, for zerocopy it doesn't end up there, so we need to copy it
> from those pages.
> 
> 	The trouble is, by the time we get around to that, the
> references to pages in question are already dropped.  As the result,
> p9_zc_check_errors() tries to get the data using copy_from_iter_full().
> Unfortunately, the iov_iter it's trying to read from might *NOT* be
> capable of that.  It is, after all, a data destination, not data source.
> In particular, if it's an ITER_PIPE one, copy_from_iter_full() will
> simply fail.
> 
> 	In ->zc_request() itself we do have those pages and dealing with
> the problem in there would be a simple matter of memcpy_from_page()
> into the fixed-sized buffer.  Moreover, it isn't hard to recognize
> the (rare) case when such copying is needed.  That way we get rid of
> p9_zc_check_errors() entirely - p9_check_errors() can be used instead
> both for zero-copy and non-zero-copy cases.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

I ran basic tests with this, should be ok given the code path is never
used on normal (9p2000.L) workloads.


I also tried 9p2000.u for principle and ... I have no idea if this works
but it didn't seem to blow up there at least.
The problem is that 9p2000.u just doesn't work well even without these
patches, so I still stand by what I said about 9p2000.u and virtio (zc
interface): we really can (and I think should) just say virtio doesn't
support 9p2000.u.
(and could then further simplify this)

If you're curious, 9p2000.u hangs without your patch on at least two
different code paths (trying to read a huge buffer aborts sending a
reply because msize is too small instead of clamping it, that one has a
qemu warning message; but there are others ops like copyrange that just
fail silently and I didn't investigate)

I'd rather not fool someone into believing we support it when nobody has
time to maintain it and it fails almost immediately when user requests
some unusual IO patterns... And I definitely don't have time to even try
fixing it.
I'll suggest the same thing to qemu lists if we go that way.


Anyway, for anything useful:

Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>
Tested-by: Dominique Martinet <asmadeus@codewreck.org>

--
Dominique

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 40/44] 9p: convert to advancing variant of iov_iter_get_pages_alloc()
  2022-06-22  4:15   ` [PATCH 40/44] 9p: convert to advancing variant of iov_iter_get_pages_alloc() Al Viro
@ 2022-07-01  9:01     ` Dominique Martinet
  2022-07-01 13:47     ` Christian Schoenebeck
  1 sibling, 0 replies; 118+ messages in thread
From: Dominique Martinet @ 2022-07-01  9:01 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Christian Brauner,
	Christian Schoenebeck

Al Viro wrote on Wed, Jun 22, 2022 at 05:15:48AM +0100:
> that one is somewhat clumsier than usual and needs serious testing.

code inspection looks good to me, we revert everywhere I think we need
to revert for read/write and readdir doesn't need any special treatment.
I had a couple of nitpicks on debug messages, but that aside you can add
my R-b:

Reviewed-by: Dominique Martinet <asmadeus@codewreck.org>


Now for tests though I'm not quite sure what I'm supposed to test to
stress the error cases, that'd actually let me detect a failure... Basic
stuff seems to work but I don't think I ever got into an error path
where that matters -- forcing short reads perhaps?

> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  net/9p/client.c       | 39 +++++++++++++++++++++++----------------
>  net/9p/protocol.c     |  3 +--
>  net/9p/trans_virtio.c |  3 ++-
>  3 files changed, 26 insertions(+), 19 deletions(-)
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index d403085b9ef5..cb4324211561 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -1491,7 +1491,7 @@ p9_client_read_once(struct p9_fid *fid, u64 offset, struct iov_iter *to,
>  	struct p9_client *clnt = fid->clnt;
>  	struct p9_req_t *req;
>  	int count = iov_iter_count(to);
> -	int rsize, non_zc = 0;
> +	int rsize, received, non_zc = 0;
>  	char *dataptr;
>  
>  	*err = 0;
> @@ -1520,36 +1520,40 @@ p9_client_read_once(struct p9_fid *fid, u64 offset, struct iov_iter *to,
>  	}
>  	if (IS_ERR(req)) {
>  		*err = PTR_ERR(req);
> +		if (!non_zc)
> +			iov_iter_revert(to, count - iov_iter_count(to));
>  		return 0;
>  	}
>  
>  	*err = p9pdu_readf(&req->rc, clnt->proto_version,
> -			   "D", &count, &dataptr);
> +			   "D", &received, &dataptr);
>  	if (*err) {
> +		if (!non_zc)
> +			iov_iter_revert(to, count - iov_iter_count(to));
>  		trace_9p_protocol_dump(clnt, &req->rc);
>  		p9_tag_remove(clnt, req);
>  		return 0;
>  	}
> -	if (rsize < count) {
> -		pr_err("bogus RREAD count (%d > %d)\n", count, rsize);
> -		count = rsize;
> +	if (rsize < received) {
> +		pr_err("bogus RREAD count (%d > %d)\n", received, rsize);
> +		received = rsize;
>  	}
>  
>  	p9_debug(P9_DEBUG_9P, "<<< RREAD count %d\n", count);

This probably should be updated to received; we know how much we asked
to read already what we want to see here is what the server replied.

>  
>  	if (non_zc) {
> -		int n = copy_to_iter(dataptr, count, to);
> +		int n = copy_to_iter(dataptr, received, to);
>  
> -		if (n != count) {
> +		if (n != received) {
>  			*err = -EFAULT;
>  			p9_tag_remove(clnt, req);
>  			return n;
>  		}
>  	} else {
> -		iov_iter_advance(to, count);
> +		iov_iter_revert(to, count - received - iov_iter_count(to));
>  	}
>  	p9_tag_remove(clnt, req);
> -	return count;
> +	return received;
>  }
>  EXPORT_SYMBOL(p9_client_read_once);
>  
> @@ -1567,6 +1571,7 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
>  	while (iov_iter_count(from)) {
>  		int count = iov_iter_count(from);
>  		int rsize = fid->iounit;
> +		int written;
>  
>  		if (!rsize || rsize > clnt->msize - P9_IOHDRSZ)
>  			rsize = clnt->msize - P9_IOHDRSZ;
> @@ -1584,27 +1589,29 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct iov_iter *from, int *err)
>  					    offset, rsize, from);
>  		}
>  		if (IS_ERR(req)) {
> +			iov_iter_revert(from, count - iov_iter_count(from));
>  			*err = PTR_ERR(req);
>  			break;
>  		}
>  
> -		*err = p9pdu_readf(&req->rc, clnt->proto_version, "d", &count);
> +		*err = p9pdu_readf(&req->rc, clnt->proto_version, "d", &written);
>  		if (*err) {
> +			iov_iter_revert(from, count - iov_iter_count(from));
>  			trace_9p_protocol_dump(clnt, &req->rc);
>  			p9_tag_remove(clnt, req);
>  			break;
>  		}
> -		if (rsize < count) {
> -			pr_err("bogus RWRITE count (%d > %d)\n", count, rsize);
> -			count = rsize;
> +		if (rsize < written) {
> +			pr_err("bogus RWRITE count (%d > %d)\n", written, rsize);
> +			written = rsize;
>  		}
>  
>  		p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);

likewise, please make it dump written.

--
Dominique
>  
>  		p9_tag_remove(clnt, req);
> -		iov_iter_advance(from, count);
> -		total += count;
> -		offset += count;
> +		iov_iter_revert(from, count - written - iov_iter_count(from));
> +		total += written;
> +		offset += written;
>  	}
>  	return total;
>  }
> diff --git a/net/9p/protocol.c b/net/9p/protocol.c
> index 3754c33e2974..83694c631989 100644
> --- a/net/9p/protocol.c
> +++ b/net/9p/protocol.c
> @@ -63,9 +63,8 @@ static size_t
>  pdu_write_u(struct p9_fcall *pdu, struct iov_iter *from, size_t size)
>  {
>  	size_t len = min(pdu->capacity - pdu->size, size);
> -	struct iov_iter i = *from;
>  
> -	if (!copy_from_iter_full(&pdu->sdata[pdu->size], len, &i))
> +	if (!copy_from_iter_full(&pdu->sdata[pdu->size], len, from))
>  		len = 0;
>  
>  	pdu->size += len;
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index 2a210c2f8e40..1977d33475fe 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -331,7 +331,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
>  			if (err == -ERESTARTSYS)
>  				return err;
>  		}
> -		n = iov_iter_get_pages_alloc(data, pages, count, offs);
> +		n = iov_iter_get_pages_alloc2(data, pages, count, offs);
>  		if (n < 0)
>  			return n;
>  		*need_drop = 1;
> @@ -373,6 +373,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
>  				(*pages)[index] = kmap_to_page(p);
>  			p += PAGE_SIZE;
>  		}
> +		iov_iter_advance(data, len);
>  		return len;
>  	}
>  }

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 40/44] 9p: convert to advancing variant of iov_iter_get_pages_alloc()
  2022-06-22  4:15   ` [PATCH 40/44] 9p: convert to advancing variant of iov_iter_get_pages_alloc() Al Viro
  2022-07-01  9:01     ` Dominique Martinet
@ 2022-07-01 13:47     ` Christian Schoenebeck
  2022-07-06 22:06       ` Christian Schoenebeck
  1 sibling, 1 reply; 118+ messages in thread
From: Christian Schoenebeck @ 2022-07-01 13:47 UTC (permalink / raw)
  To: linux-fsdevel, Al Viro, Dominique Martinet
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Christian Brauner

On Mittwoch, 22. Juni 2022 06:15:48 CEST Al Viro wrote:
> that one is somewhat clumsier than usual and needs serious testing.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Hi Al,

it took me a bit to find the patch that introduces
iov_iter_get_pages_alloc2(), but this patch itself looks fine:

Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>

Please give me some days for thorough testing. We recently had 9p broken (with
cache=loose) for half a year, so I would like to avoid repetition.

Best regards,
Christian Schoenebeck

> ---
>  net/9p/client.c       | 39 +++++++++++++++++++++++----------------
>  net/9p/protocol.c     |  3 +--
>  net/9p/trans_virtio.c |  3 ++-
>  3 files changed, 26 insertions(+), 19 deletions(-)
> 
> diff --git a/net/9p/client.c b/net/9p/client.c
> index d403085b9ef5..cb4324211561 100644
> --- a/net/9p/client.c
> +++ b/net/9p/client.c
> @@ -1491,7 +1491,7 @@ p9_client_read_once(struct p9_fid *fid, u64 offset,
> struct iov_iter *to, struct p9_client *clnt = fid->clnt;
>  	struct p9_req_t *req;
>  	int count = iov_iter_count(to);
> -	int rsize, non_zc = 0;
> +	int rsize, received, non_zc = 0;
>  	char *dataptr;
> 
>  	*err = 0;
> @@ -1520,36 +1520,40 @@ p9_client_read_once(struct p9_fid *fid, u64 offset,
> struct iov_iter *to, }
>  	if (IS_ERR(req)) {
>  		*err = PTR_ERR(req);
> +		if (!non_zc)
> +			iov_iter_revert(to, count - iov_iter_count(to));
>  		return 0;
>  	}
> 
>  	*err = p9pdu_readf(&req->rc, clnt->proto_version,
> -			   "D", &count, &dataptr);
> +			   "D", &received, &dataptr);
>  	if (*err) {
> +		if (!non_zc)
> +			iov_iter_revert(to, count - iov_iter_count(to));
>  		trace_9p_protocol_dump(clnt, &req->rc);
>  		p9_tag_remove(clnt, req);
>  		return 0;
>  	}
> -	if (rsize < count) {
> -		pr_err("bogus RREAD count (%d > %d)\n", count, rsize);
> -		count = rsize;
> +	if (rsize < received) {
> +		pr_err("bogus RREAD count (%d > %d)\n", received, rsize);
> +		received = rsize;
>  	}
> 
>  	p9_debug(P9_DEBUG_9P, "<<< RREAD count %d\n", count);
> 
>  	if (non_zc) {
> -		int n = copy_to_iter(dataptr, count, to);
> +		int n = copy_to_iter(dataptr, received, to);
> 
> -		if (n != count) {
> +		if (n != received) {
>  			*err = -EFAULT;
>  			p9_tag_remove(clnt, req);
>  			return n;
>  		}
>  	} else {
> -		iov_iter_advance(to, count);
> +		iov_iter_revert(to, count - received - iov_iter_count(to));
>  	}
>  	p9_tag_remove(clnt, req);
> -	return count;
> +	return received;
>  }
>  EXPORT_SYMBOL(p9_client_read_once);
> 
> @@ -1567,6 +1571,7 @@ p9_client_write(struct p9_fid *fid, u64 offset, struct
> iov_iter *from, int *err) while (iov_iter_count(from)) {
>  		int count = iov_iter_count(from);
>  		int rsize = fid->iounit;
> +		int written;
> 
>  		if (!rsize || rsize > clnt->msize - P9_IOHDRSZ)
>  			rsize = clnt->msize - P9_IOHDRSZ;
> @@ -1584,27 +1589,29 @@ p9_client_write(struct p9_fid *fid, u64 offset,
> struct iov_iter *from, int *err) offset, rsize, from);
>  		}
>  		if (IS_ERR(req)) {
> +			iov_iter_revert(from, count - iov_iter_count(from));
>  			*err = PTR_ERR(req);
>  			break;
>  		}
> 
> -		*err = p9pdu_readf(&req->rc, clnt->proto_version, "d", &count);
> +		*err = p9pdu_readf(&req->rc, clnt->proto_version, "d", &written);
>  		if (*err) {
> +			iov_iter_revert(from, count - iov_iter_count(from));
>  			trace_9p_protocol_dump(clnt, &req->rc);
>  			p9_tag_remove(clnt, req);
>  			break;
>  		}
> -		if (rsize < count) {
> -			pr_err("bogus RWRITE count (%d > %d)\n", count, rsize);
> -			count = rsize;
> +		if (rsize < written) {
> +			pr_err("bogus RWRITE count (%d > %d)\n", written, rsize);
> +			written = rsize;
>  		}
> 
>  		p9_debug(P9_DEBUG_9P, "<<< RWRITE count %d\n", count);
> 
>  		p9_tag_remove(clnt, req);
> -		iov_iter_advance(from, count);
> -		total += count;
> -		offset += count;
> +		iov_iter_revert(from, count - written - iov_iter_count(from));
> +		total += written;
> +		offset += written;
>  	}
>  	return total;
>  }
> diff --git a/net/9p/protocol.c b/net/9p/protocol.c
> index 3754c33e2974..83694c631989 100644
> --- a/net/9p/protocol.c
> +++ b/net/9p/protocol.c
> @@ -63,9 +63,8 @@ static size_t
>  pdu_write_u(struct p9_fcall *pdu, struct iov_iter *from, size_t size)
>  {
>  	size_t len = min(pdu->capacity - pdu->size, size);
> -	struct iov_iter i = *from;
> 
> -	if (!copy_from_iter_full(&pdu->sdata[pdu->size], len, &i))
> +	if (!copy_from_iter_full(&pdu->sdata[pdu->size], len, from))
>  		len = 0;
> 
>  	pdu->size += len;
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index 2a210c2f8e40..1977d33475fe 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -331,7 +331,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
> if (err == -ERESTARTSYS)
>  				return err;
>  		}
> -		n = iov_iter_get_pages_alloc(data, pages, count, offs);
> +		n = iov_iter_get_pages_alloc2(data, pages, count, offs);
>  		if (n < 0)
>  			return n;
>  		*need_drop = 1;
> @@ -373,6 +373,7 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
> (*pages)[index] = kmap_to_page(p);
>  			p += PAGE_SIZE;
>  		}
> +		iov_iter_advance(data, len);
>  		return len;
>  	}
>  }




^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full()
  2022-07-01  6:25   ` Dominique Martinet
@ 2022-07-01 16:02     ` Christian Schoenebeck
  2022-07-01 21:00       ` Dominique Martinet
  0 siblings, 1 reply; 118+ messages in thread
From: Christian Schoenebeck @ 2022-07-01 16:02 UTC (permalink / raw)
  To: Al Viro, Dominique Martinet
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Christian Brauner

On Freitag, 1. Juli 2022 08:25:49 CEST Dominique Martinet wrote:
> (sigh, I'm tired -- said I'd add Christian in Ccs and promply forgot to
> do it. Sorry for double send to everyone else.)
> 
> +Christian Schoenebeck in Ccs as that concerns qemu as well.
> 
> The patch I'm replying to is at
> https://lkml.kernel.org/r/20220622041552.737754-1-viro@zeniv.linux.org.uk
> 
> Al Viro wrote on Wed, Jun 22, 2022 at 05:15:09AM +0100:
> >         p9_client_zc_rpc()/p9_check_zc_errors() are playing fast
> > 
> > and loose with copy_from_iter_full().
> > 
> > 	Reading from file is done by sending Tread request.  Response
> > 
> > consists of fixed-sized header (including the amount of data actually
> > read) followed by the data itself.
> > 
> > 	For zero-copy case we arrange the things so that the first
> > 
> > 11 bytes of reply go into the fixed-sized buffer, with the rest going
> > straight into the pages we want to read into.
> > 
> > 	What makes the things inconvenient is that sglist describing
> > 
> > what should go where has to be set *before* the reply arrives.  As
> > the result, if reply is an error, the things get interesting.  On success
> > we get
> > 
> > 	size[4] Rread tag[2] count[4] data[count]
> > 
> > For error layout varies depending upon the protocol variant -
> > in original 9P and 9P2000 it's
> > 
> > 	size[4] Rerror tag[2] len[2] error[len]
> > 
> > in 9P2000.U
> > 
> > 	size[4] Rerror tag[2] len[2] error[len] errno[4]
> > 
> > in 9P2000.L
> > 
> > 	size[4] Rlerror tag[2] errno[4]
> > 	
> > 	The last case is nice and simple - we have an 11-byte response
> > 
> > that fits into the fixed-sized buffer we hoped to get an Rread into.
> > In other two, though, we get a variable-length string spill into the
> > pages we'd prepared for the data to be read.
> > 
> > 	Had that been in fixed-sized buffer (which is actually 4K),
> > 
> > we would've dealt with that the same way we handle non-zerocopy case.
> > However, for zerocopy it doesn't end up there, so we need to copy it
> > from those pages.
> > 
> > 	The trouble is, by the time we get around to that, the
> > 
> > references to pages in question are already dropped.  As the result,
> > p9_zc_check_errors() tries to get the data using copy_from_iter_full().
> > Unfortunately, the iov_iter it's trying to read from might *NOT* be
> > capable of that.  It is, after all, a data destination, not data source.
> > In particular, if it's an ITER_PIPE one, copy_from_iter_full() will
> > simply fail.
> > 
> > 	In ->zc_request() itself we do have those pages and dealing with
> > 
> > the problem in there would be a simple matter of memcpy_from_page()
> > into the fixed-sized buffer.  Moreover, it isn't hard to recognize
> > the (rare) case when such copying is needed.  That way we get rid of
> > p9_zc_check_errors() entirely - p9_check_errors() can be used instead
> > both for zero-copy and non-zero-copy cases.
> > 
> > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> 
> I ran basic tests with this, should be ok given the code path is never
> used on normal (9p2000.L) workloads.

I haven't read this patch in detail yet, but upfront: POSIX error strings are 
like what, max. 128 bytes, no? So my expectation therefore would be that this 
patch could be further simplified.

Apart from that, I would rename handle_rerror() to something that reflects 
better what it actually does, e.g. unsparse_error() or cp_rerror_to_sdata().

> I also tried 9p2000.u for principle and ... I have no idea if this works
> but it didn't seem to blow up there at least.
> The problem is that 9p2000.u just doesn't work well even without these
> patches, so I still stand by what I said about 9p2000.u and virtio (zc
> interface): we really can (and I think should) just say virtio doesn't
> support 9p2000.u.
> (and could then further simplify this)
>
> If you're curious, 9p2000.u hangs without your patch on at least two
> different code paths (trying to read a huge buffer aborts sending a
> reply because msize is too small instead of clamping it, that one has a
> qemu warning message; but there are others ops like copyrange that just
> fail silently and I didn't investigate)

Last time I tested 9p2000.u was with the "remove msize limit" (WIP) patches:
https://lore.kernel.org/all/cover.1640870037.git.linux_oss@crudebyte.com/
Where I did not observe any issue with 9p2000.u.

What msize are we talking about, or can you tell a way to reproduce?

> I'd rather not fool someone into believing we support it when nobody has
> time to maintain it and it fails almost immediately when user requests
> some unusual IO patterns... And I definitely don't have time to even try
> fixing it.
> I'll suggest the same thing to qemu lists if we go that way.

Yeah, the situation with 9p2000.u in QEMU is similar in the sense that 
9p2000.u is barely used, little contributions, code not in good shape (e.g 
slower in many aspects in comparison to 9p2000.L), and for that reason I 
discussed with Greg to deprecate 9p2000.u in QEMU (not done yet). We are not 
aware about any serious issue with 9p2000.u though.

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-07-01  2:07         ` Keith Busch
@ 2022-07-01 17:40           ` Al Viro
  2022-07-01 17:53             ` Keith Busch
  2022-07-01 21:30             ` Jens Axboe
  0 siblings, 2 replies; 118+ messages in thread
From: Al Viro @ 2022-07-01 17:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	linux-fsdevel

On Thu, Jun 30, 2022 at 08:07:24PM -0600, Keith Busch wrote:
> On Thu, Jun 30, 2022 at 11:39:36PM +0100, Al Viro wrote:
> > On Thu, Jun 30, 2022 at 11:11:27PM +0100, Al Viro wrote:
> > 
> > > ... and the first half of that thing conflicts with "block: relax direct
> > > io memory alignment" in -next...
> > 
> > BTW, looking at that commit - are you sure that bio_put_pages() on failure
> > exit will do the right thing?  We have grabbed a bunch of page references;
> > the amount if DIV_ROUND_UP(offset + size, PAGE_SIZE).  And that's before
> > your
> >                 size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> 
> Thanks for the catch, it does look like a page reference could get leaked here.
> 
> > in there.  IMO the following would be more obviously correct:
> >         size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> >         if (unlikely(size <= 0))
> >                 return size ? size : -EFAULT;
> > 
> > 	nr_pages = DIV_ROUND_UP(size + offset, PAGE_SIZE);
> > 	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> > 
> >         for (left = size, i = 0; left > 0; left -= len, i++) {
> > ...
> >                 if (ret) {
> > 			while (i < nr_pages)
> > 				put_page(pages[i++]);
> >                         return ret;
> >                 }
> > ...
> > 
> > and get rid of bio_put_pages() entirely.  Objections?
> 
> 
> I think that makes sense. I'll give your idea a test run tomorrow.

See vfs.git#block-fixes, along with #work.iov_iter_get_pages-3 in there.
Seems to work here...

If you are OK with #block-fixes (it's one commit on top of
bf8d08532bc1 "iomap: add support for dma aligned direct-io" in
block.git), the easiest way to deal with the conflicts would be
to have that branch pulled into block.git.  Jens, would you be
OK with that in terms of tree topology?  Provided that patch
itself looks sane to you, of course...

FWOW, the patch in question is
commit 863965bb7e52997851af3a107ec3e4d8c7050cbd
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Fri Jul 1 13:15:36 2022 -0400

    __bio_iov_iter_get_pages(): make sure we don't leak page refs on failure
    
    Calculate the number of pages we'd grabbed before trimming size down.
    And don't bother with bio_put_pages() - an explicit cleanup loop is
    easier to follow...
    
    Fixes: b1a000d3b8ec "block: relax direct io memory alignment"
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/block/bio.c b/block/bio.c
index 933ea3210954..59be4eca1192 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1151,14 +1151,6 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio_set_flag(bio, BIO_CLONED);
 }
 
-static void bio_put_pages(struct page **pages, size_t size, size_t off)
-{
-	size_t i, nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
-
-	for (i = 0; i < nr; i++)
-		put_page(pages[i]);
-}
-
 static int bio_iov_add_page(struct bio *bio, struct page *page,
 		unsigned int len, unsigned int offset)
 {
@@ -1228,11 +1220,11 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * the iov data will be picked up in the next bio iteration.
 	 */
 	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
-	if (size > 0)
-		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
+	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
 
+	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
 	for (left = size, i = 0; left > 0; left -= len, i++) {
 		struct page *page = pages[i];
 		int ret;
@@ -1245,7 +1237,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 			ret = bio_iov_add_page(bio, page, len, offset);
 
 		if (ret) {
-			bio_put_pages(pages + i, left, offset);
+			while (i < nr_pages)
+				put_page(pages[i++]);
 			return ret;
 		}
 		offset = 0;

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-07-01 17:40           ` Al Viro
@ 2022-07-01 17:53             ` Keith Busch
  2022-07-01 18:07               ` Al Viro
  2022-07-01 21:30             ` Jens Axboe
  1 sibling, 1 reply; 118+ messages in thread
From: Keith Busch @ 2022-07-01 17:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	linux-fsdevel

On Fri, Jul 01, 2022 at 06:40:40PM +0100, Al Viro wrote:
> -static void bio_put_pages(struct page **pages, size_t size, size_t off)
> -{
> -	size_t i, nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
> -
> -	for (i = 0; i < nr; i++)
> -		put_page(pages[i]);
> -}
> -
>  static int bio_iov_add_page(struct bio *bio, struct page *page,
>  		unsigned int len, unsigned int offset)
>  {
> @@ -1228,11 +1220,11 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	 * the iov data will be picked up in the next bio iteration.
>  	 */
>  	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> -	if (size > 0)
> -		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
>  	if (unlikely(size <= 0))
>  		return size ? size : -EFAULT;
> +	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
>  
> +	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));

This isn't quite right. The result of the ALIGN_DOWN could be 0, so whatever
page we got before would be leaked since unused pages are only released on an
add_page error. I was about to reply with a patch that fixes this, but here's
the one that I'm currently testing:

---
diff --git a/block/bio.c b/block/bio.c
index 933ea3210954..c4a1ce39c65c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1151,14 +1151,6 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio_set_flag(bio, BIO_CLONED);
 }
 
-static void bio_put_pages(struct page **pages, size_t size, size_t off)
-{
-	size_t i, nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
-
-	for (i = 0; i < nr; i++)
-		put_page(pages[i]);
-}
-
 static int bio_iov_add_page(struct bio *bio, struct page *page,
 		unsigned int len, unsigned int offset)
 {
@@ -1208,9 +1200,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
+	unsigned len, i = 0;
 	ssize_t size, left;
-	unsigned len, i;
 	size_t offset;
+	int ret;
 
 	/*
 	 * Move page array up in the allocated memory for the bio vecs as far as
@@ -1228,14 +1221,19 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * the iov data will be picked up in the next bio iteration.
 	 */
 	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
-	if (size > 0)
+	if (size > 0) {
+		nr_pages = DIV_ROUND_UP(size + offset, PAGE_SIZE);
 		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
-	if (unlikely(size <= 0))
-		return size ? size : -EFAULT;
+	} else
+		nr_pages = 0;
+
+	if (unlikely(size <= 0)) {
+		ret = size ? size : -EFAULT;
+		goto out;
+	}
 
-	for (left = size, i = 0; left > 0; left -= len, i++) {
+	for (left = size; left > 0; left -= len, i++) {
 		struct page *page = pages[i];
-		int ret;
 
 		len = min_t(size_t, PAGE_SIZE - offset, left);
 		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
@@ -1244,15 +1242,19 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		else
 			ret = bio_iov_add_page(bio, page, len, offset);
 
-		if (ret) {
-			bio_put_pages(pages + i, left, offset);
-			return ret;
-		}
+		if (ret)
+			goto out;
 		offset = 0;
 	}
 
 	iov_iter_advance(iter, size);
-	return 0;
+out:
+	while (i < nr_pages) {
+		put_page(pages[i]);
+		i++;
+	}
+
+	return ret;
 }
 
 /**
--

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-07-01 17:53             ` Keith Busch
@ 2022-07-01 18:07               ` Al Viro
  2022-07-01 18:12                 ` Al Viro
  2022-07-01 19:05                 ` Keith Busch
  0 siblings, 2 replies; 118+ messages in thread
From: Al Viro @ 2022-07-01 18:07 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	linux-fsdevel

On Fri, Jul 01, 2022 at 11:53:44AM -0600, Keith Busch wrote:
> On Fri, Jul 01, 2022 at 06:40:40PM +0100, Al Viro wrote:
> > -static void bio_put_pages(struct page **pages, size_t size, size_t off)
> > -{
> > -	size_t i, nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
> > -
> > -	for (i = 0; i < nr; i++)
> > -		put_page(pages[i]);
> > -}
> > -
> >  static int bio_iov_add_page(struct bio *bio, struct page *page,
> >  		unsigned int len, unsigned int offset)
> >  {
> > @@ -1228,11 +1220,11 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> >  	 * the iov data will be picked up in the next bio iteration.
> >  	 */
> >  	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> > -	if (size > 0)
> > -		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> >  	if (unlikely(size <= 0))
> >  		return size ? size : -EFAULT;
> > +	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
> >  
> > +	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> 
> This isn't quite right. The result of the ALIGN_DOWN could be 0, so whatever
> page we got before would be leaked since unused pages are only released on an
> add_page error. I was about to reply with a patch that fixes this, but here's
> the one that I'm currently testing:

AFAICS, result is broken; you might end up consuming some data and leaving
iterator not advanced at all.  With no way for the caller to tell which way it
went.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-07-01 18:07               ` Al Viro
@ 2022-07-01 18:12                 ` Al Viro
  2022-07-01 18:38                   ` Keith Busch
  2022-07-01 19:05                 ` Keith Busch
  1 sibling, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-07-01 18:12 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	linux-fsdevel

On Fri, Jul 01, 2022 at 07:07:45PM +0100, Al Viro wrote:
> On Fri, Jul 01, 2022 at 11:53:44AM -0600, Keith Busch wrote:
> > On Fri, Jul 01, 2022 at 06:40:40PM +0100, Al Viro wrote:
> > > -static void bio_put_pages(struct page **pages, size_t size, size_t off)
> > > -{
> > > -	size_t i, nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
> > > -
> > > -	for (i = 0; i < nr; i++)
> > > -		put_page(pages[i]);
> > > -}
> > > -
> > >  static int bio_iov_add_page(struct bio *bio, struct page *page,
> > >  		unsigned int len, unsigned int offset)
> > >  {
> > > @@ -1228,11 +1220,11 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> > >  	 * the iov data will be picked up in the next bio iteration.
> > >  	 */
> > >  	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> > > -	if (size > 0)
> > > -		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> > >  	if (unlikely(size <= 0))
> > >  		return size ? size : -EFAULT;
> > > +	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
> > >  
> > > +	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> > 
> > This isn't quite right. The result of the ALIGN_DOWN could be 0, so whatever
> > page we got before would be leaked since unused pages are only released on an
> > add_page error. I was about to reply with a patch that fixes this, but here's
> > the one that I'm currently testing:
> 
> AFAICS, result is broken; you might end up consuming some data and leaving
> iterator not advanced at all.  With no way for the caller to tell which way it
> went.

How about the following?

commit 5e3e9769404de54734c110b2040bdb93593e0f1b
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Fri Jul 1 13:15:36 2022 -0400

    __bio_iov_iter_get_pages(): make sure we don't leak page refs on failure
    
    Calculate the number of pages we'd grabbed before trimming size down.
    And don't bother with bio_put_pages() - an explicit cleanup loop is
    easier to follow...
    
    Fixes: b1a000d3b8ec "block: relax direct io memory alignment"
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/block/bio.c b/block/bio.c
index 933ea3210954..a9fe20cb71fe 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1151,14 +1151,6 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio_set_flag(bio, BIO_CLONED);
 }
 
-static void bio_put_pages(struct page **pages, size_t size, size_t off)
-{
-	size_t i, nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
-
-	for (i = 0; i < nr; i++)
-		put_page(pages[i]);
-}
-
 static int bio_iov_add_page(struct bio *bio, struct page *page,
 		unsigned int len, unsigned int offset)
 {
@@ -1211,6 +1203,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	ssize_t size, left;
 	unsigned len, i;
 	size_t offset;
+	int ret;
 
 	/*
 	 * Move page array up in the allocated memory for the bio vecs as far as
@@ -1228,14 +1221,13 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * the iov data will be picked up in the next bio iteration.
 	 */
 	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
-	if (size > 0)
-		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
+	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
 
+	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
 	for (left = size, i = 0; left > 0; left -= len, i++) {
 		struct page *page = pages[i];
-		int ret;
 
 		len = min_t(size_t, PAGE_SIZE - offset, left);
 		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
@@ -1244,15 +1236,15 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		else
 			ret = bio_iov_add_page(bio, page, len, offset);
 
-		if (ret) {
-			bio_put_pages(pages + i, left, offset);
-			return ret;
-		}
+		if (ret)
+			break;
 		offset = 0;
 	}
+	while (i < nr_pages)
+		put_page(pages[i++]);
 
-	iov_iter_advance(iter, size);
-	return 0;
+	iov_iter_advance(iter, size - left);
+	return ret;
 }
 
 /**

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-07-01 18:12                 ` Al Viro
@ 2022-07-01 18:38                   ` Keith Busch
  2022-07-01 19:08                     ` Al Viro
  0 siblings, 1 reply; 118+ messages in thread
From: Keith Busch @ 2022-07-01 18:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	linux-fsdevel

On Fri, Jul 01, 2022 at 07:12:17PM +0100, Al Viro wrote:
> On Fri, Jul 01, 2022 at 07:07:45PM +0100, Al Viro wrote:
> > > page we got before would be leaked since unused pages are only released on an
> > > add_page error. I was about to reply with a patch that fixes this, but here's
> > > the one that I'm currently testing:
> > 
> > AFAICS, result is broken; you might end up consuming some data and leaving
> > iterator not advanced at all.  With no way for the caller to tell which way it
> > went.

I think I see what you mean, though the issue with a non-advancing iterator on
a partially filled bio was happening prior to this patch.

> How about the following?

This looks close to what I was about to respond with. Just a couple issues
below:

> -static void bio_put_pages(struct page **pages, size_t size, size_t off)
> -{
> -	size_t i, nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
> -
> -	for (i = 0; i < nr; i++)
> -		put_page(pages[i]);
> -}
> -
>  static int bio_iov_add_page(struct bio *bio, struct page *page,
>  		unsigned int len, unsigned int offset)
>  {
> @@ -1211,6 +1203,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	ssize_t size, left;
>  	unsigned len, i;
>  	size_t offset;
> +	int ret;

'ret' might never be initialized if 'size' aligns down to 0.

>  	/*
>  	 * Move page array up in the allocated memory for the bio vecs as far as
> @@ -1228,14 +1221,13 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	 * the iov data will be picked up in the next bio iteration.
>  	 */
>  	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> -	if (size > 0)
> -		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
>  	if (unlikely(size <= 0))
>  		return size ? size : -EFAULT;
> +	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
>  
> +	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));

We still need to return EFAULT if size becomes 0 because that's the only way
bio_iov_iter_get_pages()'s loop will break out in this condition.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-07-01 18:07               ` Al Viro
  2022-07-01 18:12                 ` Al Viro
@ 2022-07-01 19:05                 ` Keith Busch
  1 sibling, 0 replies; 118+ messages in thread
From: Keith Busch @ 2022-07-01 19:05 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	linux-fsdevel

On Fri, Jul 01, 2022 at 07:07:45PM +0100, Al Viro wrote:
> On Fri, Jul 01, 2022 at 11:53:44AM -0600, Keith Busch wrote:
> > On Fri, Jul 01, 2022 at 06:40:40PM +0100, Al Viro wrote:
> > > -static void bio_put_pages(struct page **pages, size_t size, size_t off)
> > > -{
> > > -	size_t i, nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
> > > -
> > > -	for (i = 0; i < nr; i++)
> > > -		put_page(pages[i]);
> > > -}
> > > -
> > >  static int bio_iov_add_page(struct bio *bio, struct page *page,
> > >  		unsigned int len, unsigned int offset)
> > >  {
> > > @@ -1228,11 +1220,11 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> > >  	 * the iov data will be picked up in the next bio iteration.
> > >  	 */
> > >  	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> > > -	if (size > 0)
> > > -		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> > >  	if (unlikely(size <= 0))
> > >  		return size ? size : -EFAULT;
> > > +	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
> > >  
> > > +	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> > 
> > This isn't quite right. The result of the ALIGN_DOWN could be 0, so whatever
> > page we got before would be leaked since unused pages are only released on an
> > add_page error. I was about to reply with a patch that fixes this, but here's
> > the one that I'm currently testing:
> 
> AFAICS, result is broken; you might end up consuming some data and leaving
> iterator not advanced at all.  With no way for the caller to tell which way it
> went.

Looks like the possibility of a non-advancing iterator goes all the way back to
the below commit.

  commit 576ed9135489c723fb39b97c4e2c73428d06dd78
  Author: Christoph Hellwig <hch@lst.de>
  Date:   Thu Sep 20 08:28:21 2018 +0200

      block: use bio_add_page in bio_iov_iter_get_pages

I guess the error condition never occured, and nor should it if the bio's
available vectors is accounted for correctly.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-07-01 18:38                   ` Keith Busch
@ 2022-07-01 19:08                     ` Al Viro
  2022-07-01 19:28                       ` Keith Busch
  0 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-07-01 19:08 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	linux-fsdevel

On Fri, Jul 01, 2022 at 12:38:36PM -0600, Keith Busch wrote:
> On Fri, Jul 01, 2022 at 07:12:17PM +0100, Al Viro wrote:
> > On Fri, Jul 01, 2022 at 07:07:45PM +0100, Al Viro wrote:
> > > > page we got before would be leaked since unused pages are only released on an
> > > > add_page error. I was about to reply with a patch that fixes this, but here's
> > > > the one that I'm currently testing:
> > > 
> > > AFAICS, result is broken; you might end up consuming some data and leaving
> > > iterator not advanced at all.  With no way for the caller to tell which way it
> > > went.
> 
> I think I see what you mean, though the issue with a non-advancing iterator on
> a partially filled bio was happening prior to this patch.
> 
> > How about the following?
> 
> This looks close to what I was about to respond with. Just a couple issues
> below:
> 
> > -static void bio_put_pages(struct page **pages, size_t size, size_t off)
> > -{
> > -	size_t i, nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
> > -
> > -	for (i = 0; i < nr; i++)
> > -		put_page(pages[i]);
> > -}
> > -
> >  static int bio_iov_add_page(struct bio *bio, struct page *page,
> >  		unsigned int len, unsigned int offset)
> >  {
> > @@ -1211,6 +1203,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> >  	ssize_t size, left;
> >  	unsigned len, i;
> >  	size_t offset;
> > +	int ret;
> 
> 'ret' might never be initialized if 'size' aligns down to 0.

Point.

> >  	/*
> >  	 * Move page array up in the allocated memory for the bio vecs as far as
> > @@ -1228,14 +1221,13 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> >  	 * the iov data will be picked up in the next bio iteration.
> >  	 */
> >  	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> > -	if (size > 0)
> > -		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> >  	if (unlikely(size <= 0))
> >  		return size ? size : -EFAULT;
> > +	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
> >  
> > +	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> 
> We still need to return EFAULT if size becomes 0 because that's the only way
> bio_iov_iter_get_pages()'s loop will break out in this condition.

I really don't like these calling conventions ;-/

What do you want to happen if we have that align-down to reduce size?
IOW, what should be the state after that loop in the caller?

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-07-01 19:08                     ` Al Viro
@ 2022-07-01 19:28                       ` Keith Busch
  2022-07-01 19:43                         ` Al Viro
  0 siblings, 1 reply; 118+ messages in thread
From: Keith Busch @ 2022-07-01 19:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	linux-fsdevel

On Fri, Jul 01, 2022 at 08:08:37PM +0100, Al Viro wrote:
> On Fri, Jul 01, 2022 at 12:38:36PM -0600, Keith Busch wrote:
> > >  	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> > > -	if (size > 0)
> > > -		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> > >  	if (unlikely(size <= 0))
> > >  		return size ? size : -EFAULT;
> > > +	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
> > >  
> > > +	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> > 
> > We still need to return EFAULT if size becomes 0 because that's the only way
> > bio_iov_iter_get_pages()'s loop will break out in this condition.
> 
> I really don't like these calling conventions ;-/

No argument here; I'm just working in the space as I found it. :)
 
> What do you want to happen if we have that align-down to reduce size?
> IOW, what should be the state after that loop in the caller?

We fill up the bio to bi_max_vecs. If there are more pages than vectors, then
the bio is submitted as-is with the partially drained iov_iter. The remainder
of the iov is left for a subsequent bio to deal with.

The ALIGN_DOWN is required because I've replaced the artificial kernel aligment
with the underlying hardware's alignment. The hardware's alignment is usually
smaller than a block size. If the last bvec has a non-block aligned offset,
then it has to be truncated down, which could result in a 0 size when bi_vcnt
is already non-zero. If that happens, we just submit the bio as-is, and
allocate a new one to deal with the rest of the iov.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-07-01 19:28                       ` Keith Busch
@ 2022-07-01 19:43                         ` Al Viro
  2022-07-01 19:56                           ` Keith Busch
  0 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-07-01 19:43 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	linux-fsdevel

On Fri, Jul 01, 2022 at 01:28:13PM -0600, Keith Busch wrote:
> On Fri, Jul 01, 2022 at 08:08:37PM +0100, Al Viro wrote:
> > On Fri, Jul 01, 2022 at 12:38:36PM -0600, Keith Busch wrote:
> > > >  	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> > > > -	if (size > 0)
> > > > -		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> > > >  	if (unlikely(size <= 0))
> > > >  		return size ? size : -EFAULT;
> > > > +	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
> > > >  
> > > > +	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> > > 
> > > We still need to return EFAULT if size becomes 0 because that's the only way
> > > bio_iov_iter_get_pages()'s loop will break out in this condition.
> > 
> > I really don't like these calling conventions ;-/
> 
> No argument here; I'm just working in the space as I found it. :)
>  
> > What do you want to happen if we have that align-down to reduce size?
> > IOW, what should be the state after that loop in the caller?
> 
> We fill up the bio to bi_max_vecs. If there are more pages than vectors, then
> the bio is submitted as-is with the partially drained iov_iter. The remainder
> of the iov is left for a subsequent bio to deal with.
> 
> The ALIGN_DOWN is required because I've replaced the artificial kernel aligment
> with the underlying hardware's alignment. The hardware's alignment is usually
> smaller than a block size. If the last bvec has a non-block aligned offset,
> then it has to be truncated down, which could result in a 0 size when bi_vcnt
> is already non-zero. If that happens, we just submit the bio as-is, and
> allocate a new one to deal with the rest of the iov.

Wait a sec.  Looks like you are dealing with the round-down in the wrong place -
it applies to the *total* you've packed into the bio, no matter how it is
distributed over the segments you've gathered it from.  Looks like it would
be more natural to handle it after the loop in the caller, would it not?

I.e.
	while bio is not full
		grab pages
		if got nothing
			break
		pack pages into bio
		if can't add a page (bio_add_hw_page() failure)
			drop the ones not shoved there
			break
	see how much had we packed into the sucker
	if not a multiple of logical block size
		trim the bio, dropping what needs to be dropped
		iov_iter_revert()
	if nothing's packed
		return -EINVAL if it was a failed bio_add_hw_page()
		return -EFAULT otherwise
	return 0

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-07-01 19:43                         ` Al Viro
@ 2022-07-01 19:56                           ` Keith Busch
  2022-07-02  5:35                             ` Al Viro
  0 siblings, 1 reply; 118+ messages in thread
From: Keith Busch @ 2022-07-01 19:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	linux-fsdevel

On Fri, Jul 01, 2022 at 08:43:32PM +0100, Al Viro wrote:
> On Fri, Jul 01, 2022 at 01:28:13PM -0600, Keith Busch wrote:
> > On Fri, Jul 01, 2022 at 08:08:37PM +0100, Al Viro wrote:
> > > On Fri, Jul 01, 2022 at 12:38:36PM -0600, Keith Busch wrote:
> > > > >  	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> > > > > -	if (size > 0)
> > > > > -		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> > > > >  	if (unlikely(size <= 0))
> > > > >  		return size ? size : -EFAULT;
> > > > > +	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
> > > > >  
> > > > > +	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
> > > > 
> > > > We still need to return EFAULT if size becomes 0 because that's the only way
> > > > bio_iov_iter_get_pages()'s loop will break out in this condition.
> > > 
> > > I really don't like these calling conventions ;-/
> > 
> > No argument here; I'm just working in the space as I found it. :)
> >  
> > > What do you want to happen if we have that align-down to reduce size?
> > > IOW, what should be the state after that loop in the caller?
> > 
> > We fill up the bio to bi_max_vecs. If there are more pages than vectors, then
> > the bio is submitted as-is with the partially drained iov_iter. The remainder
> > of the iov is left for a subsequent bio to deal with.
> > 
> > The ALIGN_DOWN is required because I've replaced the artificial kernel aligment
> > with the underlying hardware's alignment. The hardware's alignment is usually
> > smaller than a block size. If the last bvec has a non-block aligned offset,
> > then it has to be truncated down, which could result in a 0 size when bi_vcnt
> > is already non-zero. If that happens, we just submit the bio as-is, and
> > allocate a new one to deal with the rest of the iov.
> 
> Wait a sec.  Looks like you are dealing with the round-down in the wrong place -
> it applies to the *total* you've packed into the bio, no matter how it is
> distributed over the segments you've gathered it from.  Looks like it would
> be more natural to handle it after the loop in the caller, would it not?
> 
> I.e.
> 	while bio is not full
> 		grab pages
> 		if got nothing
> 			break
> 		pack pages into bio
> 		if can't add a page (bio_add_hw_page() failure)
> 			drop the ones not shoved there
> 			break
> 	see how much had we packed into the sucker
> 	if not a multiple of logical block size
> 		trim the bio, dropping what needs to be dropped
> 		iov_iter_revert()
> 	if nothing's packed
> 		return -EINVAL if it was a failed bio_add_hw_page()
> 		return -EFAULT otherwise
> 	return 0

Validating user requests gets really messy if we allow arbitrary segment
lengths. This particular patch just enables arbitrary address alignment, but
segment size is still required to be a block size. You found the commit that
enforces that earlier, "iov: introduce iov_iter_aligned", two commits prior.

The rest of the logic simplifies when we are guaranteed segment size is a block
size mulitple: truncating a segment at a block boundary ensures both sides are
block size aligned, and we don't even need to consult lower level queue limits,
like segment count or segment length. The bio is valid after this step, and can
be split into valid bios later if needed.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full()
  2022-07-01 16:02     ` Christian Schoenebeck
@ 2022-07-01 21:00       ` Dominique Martinet
  2022-07-03 13:30         ` Christian Schoenebeck
  0 siblings, 1 reply; 118+ messages in thread
From: Dominique Martinet @ 2022-07-01 21:00 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Al Viro, linux-fsdevel, Linus Torvalds, Jens Axboe,
	Christoph Hellwig, Matthew Wilcox, David Howells,
	Christian Brauner

Christian Schoenebeck wrote on Fri, Jul 01, 2022 at 06:02:31PM +0200:
> > I also tried 9p2000.u for principle and ... I have no idea if this works
> > but it didn't seem to blow up there at least.
> > The problem is that 9p2000.u just doesn't work well even without these
> > patches, so I still stand by what I said about 9p2000.u and virtio (zc
> > interface): we really can (and I think should) just say virtio doesn't
> > support 9p2000.u.
> > (and could then further simplify this)
> >
> > If you're curious, 9p2000.u hangs without your patch on at least two
> > different code paths (trying to read a huge buffer aborts sending a
> > reply because msize is too small instead of clamping it, that one has a
> > qemu warning message; but there are others ops like copyrange that just
> > fail silently and I didn't investigate)
> 
> Last time I tested 9p2000.u was with the "remove msize limit" (WIP) patches:
> https://lore.kernel.org/all/cover.1640870037.git.linux_oss@crudebyte.com/
> Where I did not observe any issue with 9p2000.u.
> 
> What msize are we talking about, or can you tell a way to reproduce?

I just ran fsstress on a
`mount -t 9p -o cache=none,trans=virtio,version=9p2000.u` mount on
v5.19-rc2:

fsstress -d /mnt/fstress -n 1000 -v

If that doesn't reproduce for you (and you care) I can turn some more
logs on, but from the look of it it could very well be msize related, I
just didn't check as I don't expect any real user

--
Dominique

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-07-01 17:40           ` Al Viro
  2022-07-01 17:53             ` Keith Busch
@ 2022-07-01 21:30             ` Jens Axboe
  1 sibling, 0 replies; 118+ messages in thread
From: Jens Axboe @ 2022-07-01 21:30 UTC (permalink / raw)
  To: Al Viro, Keith Busch
  Cc: Linus Torvalds, Christoph Hellwig, Matthew Wilcox, David Howells,
	Dominique Martinet, Christian Brauner, linux-fsdevel

On 7/1/22 11:40 AM, Al Viro wrote:
> On Thu, Jun 30, 2022 at 08:07:24PM -0600, Keith Busch wrote:
>> On Thu, Jun 30, 2022 at 11:39:36PM +0100, Al Viro wrote:
>>> On Thu, Jun 30, 2022 at 11:11:27PM +0100, Al Viro wrote:
>>>
>>>> ... and the first half of that thing conflicts with "block: relax direct
>>>> io memory alignment" in -next...
>>>
>>> BTW, looking at that commit - are you sure that bio_put_pages() on failure
>>> exit will do the right thing?  We have grabbed a bunch of page references;
>>> the amount if DIV_ROUND_UP(offset + size, PAGE_SIZE).  And that's before
>>> your
>>>                 size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
>>
>> Thanks for the catch, it does look like a page reference could get leaked here.
>>
>>> in there.  IMO the following would be more obviously correct:
>>>         size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
>>>         if (unlikely(size <= 0))
>>>                 return size ? size : -EFAULT;
>>>
>>> 	nr_pages = DIV_ROUND_UP(size + offset, PAGE_SIZE);
>>> 	size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
>>>
>>>         for (left = size, i = 0; left > 0; left -= len, i++) {
>>> ...
>>>                 if (ret) {
>>> 			while (i < nr_pages)
>>> 				put_page(pages[i++]);
>>>                         return ret;
>>>                 }
>>> ...
>>>
>>> and get rid of bio_put_pages() entirely.  Objections?
>>
>>
>> I think that makes sense. I'll give your idea a test run tomorrow.
> 
> See vfs.git#block-fixes, along with #work.iov_iter_get_pages-3 in there.
> Seems to work here...
> 
> If you are OK with #block-fixes (it's one commit on top of
> bf8d08532bc1 "iomap: add support for dma aligned direct-io" in
> block.git), the easiest way to deal with the conflicts would be
> to have that branch pulled into block.git.  Jens, would you be
> OK with that in terms of tree topology?  Provided that patch
> itself looks sane to you, of course...

I'm fine with that approach. Don't have too much time to look into this
stuff just yet, but looks like you and Keith are getting it sorted out.
I'll check in on emails later this weekend and we can get it pulled in
at that point when you guys deem it ready to check/pull.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-07-01 19:56                           ` Keith Busch
@ 2022-07-02  5:35                             ` Al Viro
  2022-07-02 21:02                               ` Keith Busch
  0 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-07-02  5:35 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	linux-fsdevel

On Fri, Jul 01, 2022 at 01:56:53PM -0600, Keith Busch wrote:

> Validating user requests gets really messy if we allow arbitrary segment
> lengths. This particular patch just enables arbitrary address alignment, but
> segment size is still required to be a block size. You found the commit that
> enforces that earlier, "iov: introduce iov_iter_aligned", two commits prior.

BTW, where do you check it for this caller?
	fs/zonefs/super.c:786:  ret = bio_iov_iter_get_pages(bio, from);
Incidentally, we have an incorrect use of iov_iter_truncate() in that one (compare
with iomap case, where we reexpand it afterwards)...

I still don't get the logics of those round-downs.  You've *already* verified
that each segment is a multiple of logical block size.  And you are stuffing
as much as you can into bio, covering the data for as many segments as you
can.  Sure, you might end up e.g. running into an unmapped page at wrong
offset (since your requirements for initial offsets might be milder than
logical block size).  Or you might run out of pages bio would take.  Either
might put the end of bio at the wrong offset.

So why not trim it down *after* you are done adding pages into it?  And do it
once, outside of the loop.  IDGI...  Validation is already done; I'm not
suggesting to allow weird segment lengths or to change behaviour of your
iov_iter_is_aligned() in any other way.

Put it another way, is there any possibility for __bio_iov_iter_get_pages() to
do a non-trivial round-down on anything other than the last iteration of that
loop?

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [block.git conflicts] Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-07-02  5:35                             ` Al Viro
@ 2022-07-02 21:02                               ` Keith Busch
  0 siblings, 0 replies; 118+ messages in thread
From: Keith Busch @ 2022-07-02 21:02 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Dominique Martinet, Christian Brauner,
	linux-fsdevel

On Sat, Jul 02, 2022 at 06:35:58AM +0100, Al Viro wrote:
> On Fri, Jul 01, 2022 at 01:56:53PM -0600, Keith Busch wrote:
> 
> > Validating user requests gets really messy if we allow arbitrary segment
> > lengths. This particular patch just enables arbitrary address alignment, but
> > segment size is still required to be a block size. You found the commit that
> > enforces that earlier, "iov: introduce iov_iter_aligned", two commits prior.
> 
> BTW, where do you check it for this caller?
> 	fs/zonefs/super.c:786:  ret = bio_iov_iter_get_pages(bio, from);
> Incidentally, we have an incorrect use of iov_iter_truncate() in that one (compare
> with iomap case, where we reexpand it afterwards)...
> 
> I still don't get the logics of those round-downs.  You've *already* verified
> that each segment is a multiple of logical block size.  And you are stuffing
> as much as you can into bio, covering the data for as many segments as you
> can.  Sure, you might end up e.g. running into an unmapped page at wrong
> offset (since your requirements for initial offsets might be milder than
> logical block size).  Or you might run out of pages bio would take.  Either
> might put the end of bio at the wrong offset.

It is strange this function allows the possibility that bio_iov_add_page() can
fail. There's no reason to grab more pages that exceed the bio_full() condition
(ignoring the special ZONE_APPEND case for the moment).

I think it will make more sense if we clean that part up first so the size for
all successfully gotten pages can skip subsequent bio add page checks, and make
the error handling unnecessary.
 
> So why not trim it down *after* you are done adding pages into it?  And do it
> once, outside of the loop.  IDGI...  Validation is already done; I'm not
> suggesting to allow weird segment lengths or to change behaviour of your
> iov_iter_is_aligned() in any other way.

I may have misunderstood your previous suggestion, but I still think this is
the right way to go. The ALIGN_DOWN in its current location ensures the size
we're appending to the bio is acceptable before we even start. It's easier to
prevent adding pages to a bio IO than to back them out later. The latter would
need something like a special cased version of bio_truncate().

Anyway, I have some changes testing right now that I think will fix up the
issues you've raised, and make the rest a bit more clear. I'll send them for
consideration this weekend if all is succesful.

> Put it another way, is there any possibility for __bio_iov_iter_get_pages() to
> do a non-trivial round-down on anything other than the last iteration of that
> loop?

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full()
  2022-07-01 21:00       ` Dominique Martinet
@ 2022-07-03 13:30         ` Christian Schoenebeck
  0 siblings, 0 replies; 118+ messages in thread
From: Christian Schoenebeck @ 2022-07-03 13:30 UTC (permalink / raw)
  To: Dominique Martinet
  Cc: Al Viro, linux-fsdevel, Linus Torvalds, Jens Axboe,
	Christoph Hellwig, Matthew Wilcox, David Howells,
	Christian Brauner, Greg Kurz

On Freitag, 1. Juli 2022 23:00:06 CEST Dominique Martinet wrote:
> Christian Schoenebeck wrote on Fri, Jul 01, 2022 at 06:02:31PM +0200:
> > > I also tried 9p2000.u for principle and ... I have no idea if this works
> > > but it didn't seem to blow up there at least.
> > > The problem is that 9p2000.u just doesn't work well even without these
> > > patches, so I still stand by what I said about 9p2000.u and virtio (zc
> > > interface): we really can (and I think should) just say virtio doesn't
> > > support 9p2000.u.
> > > (and could then further simplify this)
> > > 
> > > If you're curious, 9p2000.u hangs without your patch on at least two
> > > different code paths (trying to read a huge buffer aborts sending a
> > > reply because msize is too small instead of clamping it, that one has a
> > > qemu warning message; but there are others ops like copyrange that just
> > > fail silently and I didn't investigate)
> > 
> > Last time I tested 9p2000.u was with the "remove msize limit" (WIP)
> > patches:
> > https://lore.kernel.org/all/cover.1640870037.git.linux_oss@crudebyte.com/
> > Where I did not observe any issue with 9p2000.u.
> > 
> > What msize are we talking about, or can you tell a way to reproduce?
> 
> I just ran fsstress on a
> `mount -t 9p -o cache=none,trans=virtio,version=9p2000.u` mount on
> v5.19-rc2:
> 
> fsstress -d /mnt/fstress -n 1000 -v
> 
> If that doesn't reproduce for you (and you care) I can turn some more
> logs on, but from the look of it it could very well be msize related, I
> just didn't check as I don't expect any real user

Confirmed. :/ I tested with various kernel versions (also w/wo "remove msize 
limit" WIP patches), different combinations of msize and cache options. They 
all start to hang the fsstress app with 9p2000.u protocol version, sometimes 
sooner, sometimes later.

BTW, fsstress does not pass with 9p2000.L here either, it would not hang, but 
it consistently terminates fsstress with (cache=none|mmap):

    posix_memalign: Invalid argument

@Greg: JFYI

Best regards,
Christian Schoenebeck



^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 40/44] 9p: convert to advancing variant of iov_iter_get_pages_alloc()
  2022-07-01 13:47     ` Christian Schoenebeck
@ 2022-07-06 22:06       ` Christian Schoenebeck
  0 siblings, 0 replies; 118+ messages in thread
From: Christian Schoenebeck @ 2022-07-06 22:06 UTC (permalink / raw)
  To: linux-fsdevel, Al Viro, Dominique Martinet
  Cc: Linus Torvalds, Jens Axboe, Christoph Hellwig, Matthew Wilcox,
	David Howells, Christian Brauner

On Freitag, 1. Juli 2022 15:47:24 CEST Christian Schoenebeck wrote:
> On Mittwoch, 22. Juni 2022 06:15:48 CEST Al Viro wrote:
> > that one is somewhat clumsier than usual and needs serious testing.
> > 
> > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> 
> Hi Al,
> 
> it took me a bit to find the patch that introduces
> iov_iter_get_pages_alloc2(), but this patch itself looks fine:
> 
> Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>

Tested-by: Christian Schoenebeck <linux_oss@crudebyte.com>




^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 37/44] block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
  2022-06-22  4:15   ` [PATCH 37/44] block: convert to " Al Viro
  2022-06-28 12:16     ` Jeff Layton
  2022-06-30 22:11     ` [block.git conflicts] " Al Viro
@ 2022-07-10 18:04     ` Sedat Dilek
  2 siblings, 0 replies; 118+ messages in thread
From: Sedat Dilek @ 2022-07-10 18:04 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Dominique Martinet,
	Christian Brauner

[-- Attachment #1: Type: text/plain, Size: 7535 bytes --]

On Wed, Jun 22, 2022 at 6:56 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> ... doing revert if we end up not using some pages
>

With the version from latest vfs.git#for-next as from [1] (which
differs from this one) I see with LLVM-14:

5618:  clang -Wp,-MMD,block/.bio.o.d -nostdinc -I./arch/x86/include
-I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi
-I./arch/x86/include/generate
d/uapi -I./include/uapi -I./include/generated/uapi -include
./include/linux/compiler-version.h -include ./include/linux/kconfig.h
-include ./include/linux/compiler_typ
es.h -D__KERNEL__ -Qunused-arguments -fmacro-prefix-map=./= -Wall
-Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing
-fno-common -fshort-wchar -fno-
PIE -Werror=implicit-function-declaration -Werror=implicit-int
-Werror=return-type -Wno-format-security -std=gnu11
--target=x86_64-linux-gnu -fintegrated-as -Werror=un
known-warning-option -Werror=ignored-optimization-argument -mno-sse
-mno-mmx -mno-sse2 -mno-3dnow -mno-avx -fcf-protection=none -m64
-falign-loops=1 -mno-80387 -mno-fp
-ret-in-387 -mstack-alignment=8 -mskip-rax-setup -mtune=generic
-mno-red-zone -mcmodel=kernel -Wno-sign-compare
-fno-asynchronous-unwind-tables -mretpoline-external-th
unk -fno-delete-null-pointer-checks -Wno-frame-address
-Wno-address-of-packed-member -O2 -Wframe-larger-than=2048
-fstack-protector-strong -Wimplicit-fallthrough -Wno-
gnu -Wno-unused-but-set-variable -Wno-unused-const-variable
-fno-stack-clash-protection -pg -mfentry -DCC_USING_NOP_MCOUNT
-DCC_USING_FENTRY -fno-lto -flto=thin -fspli
t-lto-unit -fvisibility=hidden -Wdeclaration-after-statement -Wvla
-Wno-pointer-sign -Wcast-function-type -fno-strict-overflow
-fno-stack-check -Werror=date-time -Werr
or=incompatible-pointer-types -Wno-initializer-overrides -Wno-format
-Wno-sign-compare -Wno-format-zero-length -Wno-pointer-to-enum-cast
-Wno-tautological-constant-out
-of-range-compare -Wno-unaligned-access -g -gdwarf-5
-DKBUILD_MODFILE='"block/bio"' -DKBUILD_BASENAME='"bio"'
-DKBUILD_MODNAME='"bio"' -D__KBUILD_MODNAME=kmod_bio -
c -o block/bio.o block/bio.c
[ ... ]
5635:block/bio.c:1232:6: warning: variable 'i' is used uninitialized
whenever 'if' condition is true [-Wsometimes-uninitialized]
5636-        if (unlikely(!size)) {
5637-            ^~~~~~~~~~~~~~~
5638-./include/linux/compiler.h:78:22: note: expanded from macro 'unlikely'
5639-# define unlikely(x)    __builtin_expect(!!(x), 0)
5640-                        ^~~~~~~~~~~~~~~~~~~~~~~~~~
5641-block/bio.c:1254:9: note: uninitialized use occurs here
5642-        while (i < nr_pages)
5643-               ^
5644-block/bio.c:1232:2: note: remove the 'if' if its condition is always false
5645-        if (unlikely(!size)) {
5646-        ^~~~~~~~~~~~~~~~~~~~~~
5647-block/bio.c:1202:17: note: initialize the variable 'i' to silence
this warning
5648-        unsigned len, i;
5649-                       ^
5650-                        = 0

Patch from [1] is attached.

Regards,
-Sedat-

[1] https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git/commit/block/bio.c?h=for-next&id=9a6469060316674230c0666c5706f7097e9278bb

> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  block/bio.c     | 15 ++++++---------
>  block/blk-map.c |  7 ++++---
>  2 files changed, 10 insertions(+), 12 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 51c99f2c5c90..01ab683e67be 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1190,7 +1190,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>         BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
>         pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>
> -       size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> +       size = iov_iter_get_pages2(iter, pages, LONG_MAX, nr_pages, &offset);
>         if (unlikely(size <= 0))
>                 return size ? size : -EFAULT;
>
> @@ -1205,6 +1205,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>                 } else {
>                         if (WARN_ON_ONCE(bio_full(bio, len))) {
>                                 bio_put_pages(pages + i, left, offset);
> +                               iov_iter_revert(iter, left);
>                                 return -EINVAL;
>                         }
>                         __bio_add_page(bio, page, len, offset);
> @@ -1212,7 +1213,6 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>                 offset = 0;
>         }
>
> -       iov_iter_advance(iter, size);
>         return 0;
>  }
>
> @@ -1227,7 +1227,6 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
>         ssize_t size, left;
>         unsigned len, i;
>         size_t offset;
> -       int ret = 0;
>
>         if (WARN_ON_ONCE(!max_append_sectors))
>                 return 0;
> @@ -1240,7 +1239,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
>         BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
>         pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>
> -       size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
> +       size = iov_iter_get_pages2(iter, pages, LONG_MAX, nr_pages, &offset);
>         if (unlikely(size <= 0))
>                 return size ? size : -EFAULT;
>
> @@ -1252,16 +1251,14 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
>                 if (bio_add_hw_page(q, bio, page, len, offset,
>                                 max_append_sectors, &same_page) != len) {
>                         bio_put_pages(pages + i, left, offset);
> -                       ret = -EINVAL;
> -                       break;
> +                       iov_iter_revert(iter, left);
> +                       return -EINVAL;
>                 }
>                 if (same_page)
>                         put_page(page);
>                 offset = 0;
>         }
> -
> -       iov_iter_advance(iter, size - left);
> -       return ret;
> +       return 0;
>  }
>
>  /**
> diff --git a/block/blk-map.c b/block/blk-map.c
> index df8b066cd548..7196a6b64c80 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -254,7 +254,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
>                 size_t offs, added = 0;
>                 int npages;
>
> -               bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
> +               bytes = iov_iter_get_pages_alloc2(iter, &pages, LONG_MAX, &offs);
>                 if (unlikely(bytes <= 0)) {
>                         ret = bytes ? bytes : -EFAULT;
>                         goto out_unmap;
> @@ -284,7 +284,6 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
>                                 bytes -= n;
>                                 offs = 0;
>                         }
> -                       iov_iter_advance(iter, added);
>                 }
>                 /*
>                  * release the pages we didn't map into the bio, if any
> @@ -293,8 +292,10 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
>                         put_page(pages[j++]);
>                 kvfree(pages);
>                 /* couldn't stuff something into bio? */
> -               if (bytes)
> +               if (bytes) {
> +                       iov_iter_revert(iter, bytes);
>                         break;
> +               }
>         }
>
>         ret = blk_rq_append_bio(rq, bio);
> --
> 2.30.2
>

[-- Attachment #2: 0001-block-convert-to-advancing-variants-of-iov_iter_get_.patch --]
[-- Type: text/x-patch, Size: 3047 bytes --]

From 9a6469060316674230c0666c5706f7097e9278bb Mon Sep 17 00:00:00 2001
From: Al Viro <viro@zeniv.linux.org.uk>
Date: Thu, 9 Jun 2022 10:37:57 -0400
Subject: [PATCH] block: convert to advancing variants of
 iov_iter_get_pages{,_alloc}()

... doing revert if we end up not using some pages

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 block/bio.c     | 25 ++++++++++++++-----------
 block/blk-map.c |  7 ++++---
 2 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 082436736d69..d3bc05ed0783 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1200,7 +1200,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	struct page **pages = (struct page **)bv;
 	ssize_t size, left;
 	unsigned len, i;
-	size_t offset;
+	size_t offset, trim;
 	int ret = 0;
 
 	/*
@@ -1218,16 +1218,19 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_get_pages(iter, pages, UINT_MAX - bio->bi_iter.bi_size,
+	size = iov_iter_get_pages2(iter, pages, UINT_MAX - bio->bi_iter.bi_size,
 				  nr_pages, &offset);
-	if (size > 0) {
-		nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
-		size = ALIGN_DOWN(size, bdev_logical_block_size(bio->bi_bdev));
-	} else
-		nr_pages = 0;
-
-	if (unlikely(size <= 0)) {
-		ret = size ? size : -EFAULT;
+	if (unlikely(size <= 0))
+		return size ? size : -EFAULT;
+
+	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
+
+	trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
+	iov_iter_revert(iter, trim);
+
+	size -= trim;
+	if (unlikely(!size)) {
+		ret = -EFAULT;
 		goto out;
 	}
 
@@ -1246,7 +1249,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		offset = 0;
 	}
 
-	iov_iter_advance(iter, size - left);
+	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)
 		put_page(pages[i++]);
diff --git a/block/blk-map.c b/block/blk-map.c
index df8b066cd548..7196a6b64c80 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -254,7 +254,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		size_t offs, added = 0;
 		int npages;
 
-		bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
+		bytes = iov_iter_get_pages_alloc2(iter, &pages, LONG_MAX, &offs);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
@@ -284,7 +284,6 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 				bytes -= n;
 				offs = 0;
 			}
-			iov_iter_advance(iter, added);
 		}
 		/*
 		 * release the pages we didn't map into the bio, if any
@@ -293,8 +292,10 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 			put_page(pages[j++]);
 		kvfree(pages);
 		/* couldn't stuff something into bio? */
-		if (bytes)
+		if (bytes) {
+			iov_iter_revert(iter, bytes);
 			break;
+		}
 	}
 
 	ret = blk_rq_append_bio(rq, bio);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH 9/44] new iov_iter flavour - ITER_UBUF
  2022-06-22  4:15   ` [PATCH 09/44] new iov_iter flavour - ITER_UBUF Al Viro
  2022-06-27 18:47     ` Jeff Layton
  2022-06-28 12:38     ` Christian Brauner
@ 2022-07-28  9:55     ` Alexander Gordeev
  2022-07-29 17:21       ` Al Viro
  2 siblings, 1 reply; 118+ messages in thread
From: Alexander Gordeev @ 2022-07-28  9:55 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Dominique Martinet,
	Christian Brauner

On Wed, Jun 22, 2022 at 05:15:17AM +0100, Al Viro wrote:
> Equivalent of single-segment iovec.  Initialized by iov_iter_ubuf(),
> checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
> ones.
> 
> We are going to expose the things like ->write_iter() et.al. to those
> in subsequent commits.
> 
> New predicate (user_backed_iter()) that is true for ITER_IOVEC and
> ITER_UBUF; places like direct-IO handling should use that for
> checking that pages we modify after getting them from iov_iter_get_pages()
> would need to be dirtied.
> 
> DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
> will solve all problems - there's code that uses iter_is_iovec() to
> decide how to poke around in iov_iter guts and for that the predicate
> replacement obviously won't suffice.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> Link: https://lore.kernel.org/r/20220622041552.737754-9-viro@zeniv.linux.org.uk

Hi Al,

This changes causes sendfile09 LTP testcase fail in linux-next
(up to next-20220727) on s390. In fact, not this change exactly,
but rather 92d4d18eecb9 ("new iov_iter flavour - ITER_UBUF") -
which differs from what is posted here.

AFAICT page_cache_pipe_buf_confirm() encounters !PageUptodate()
and !page->mapping page and returns -ENODATA.

I am going to narrow the testcase and get more details, but please
let me know if I am missing something.

Thanks!

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 9/44] new iov_iter flavour - ITER_UBUF
  2022-07-28  9:55     ` [PATCH 9/44] " Alexander Gordeev
@ 2022-07-29 17:21       ` Al Viro
  2022-07-29 21:12         ` Alexander Gordeev
  0 siblings, 1 reply; 118+ messages in thread
From: Al Viro @ 2022-07-29 17:21 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Dominique Martinet,
	Christian Brauner

On Thu, Jul 28, 2022 at 11:55:10AM +0200, Alexander Gordeev wrote:
> On Wed, Jun 22, 2022 at 05:15:17AM +0100, Al Viro wrote:
> > Equivalent of single-segment iovec.  Initialized by iov_iter_ubuf(),
> > checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
> > ones.
> > 
> > We are going to expose the things like ->write_iter() et.al. to those
> > in subsequent commits.
> > 
> > New predicate (user_backed_iter()) that is true for ITER_IOVEC and
> > ITER_UBUF; places like direct-IO handling should use that for
> > checking that pages we modify after getting them from iov_iter_get_pages()
> > would need to be dirtied.
> > 
> > DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
> > will solve all problems - there's code that uses iter_is_iovec() to
> > decide how to poke around in iov_iter guts and for that the predicate
> > replacement obviously won't suffice.
> > 
> > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> > Link: https://lore.kernel.org/r/20220622041552.737754-9-viro@zeniv.linux.org.uk
> 
> Hi Al,
> 
> This changes causes sendfile09 LTP testcase fail in linux-next
> (up to next-20220727) on s390. In fact, not this change exactly,
> but rather 92d4d18eecb9 ("new iov_iter flavour - ITER_UBUF") -
> which differs from what is posted here.
> 
> AFAICT page_cache_pipe_buf_confirm() encounters !PageUptodate()
> and !page->mapping page and returns -ENODATA.
> 
> I am going to narrow the testcase and get more details, but please
> let me know if I am missing something.

Grrr....

-               } else if (iter_is_iovec(to)) {
+               } else if (!user_backed_iter(to)) {

in mm/shmem.c.  Spot the typo...

Could you check if replacing that line with
		} else if (user_backed_iter(to)) {

fixes the breakage?

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 9/44] new iov_iter flavour - ITER_UBUF
  2022-07-29 17:21       ` Al Viro
@ 2022-07-29 21:12         ` Alexander Gordeev
  2022-07-30  0:03           ` Al Viro
  0 siblings, 1 reply; 118+ messages in thread
From: Alexander Gordeev @ 2022-07-29 21:12 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Dominique Martinet,
	Christian Brauner

On Fri, Jul 29, 2022 at 06:21:23PM +0100, Al Viro wrote:
> > Hi Al,
> > 
> > This changes causes sendfile09 LTP testcase fail in linux-next
> > (up to next-20220727) on s390. In fact, not this change exactly,
> > but rather 92d4d18eecb9 ("new iov_iter flavour - ITER_UBUF") -
> > which differs from what is posted here.
> > 
> > AFAICT page_cache_pipe_buf_confirm() encounters !PageUptodate()
> > and !page->mapping page and returns -ENODATA.
> > 
> > I am going to narrow the testcase and get more details, but please
> > let me know if I am missing something.
> 
> Grrr....
> 
> -               } else if (iter_is_iovec(to)) {
> +               } else if (!user_backed_iter(to)) {
> 
> in mm/shmem.c.  Spot the typo...
> 
> Could you check if replacing that line with
> 		} else if (user_backed_iter(to)) {
> 
> fixes the breakage?

Yes, it does! So just to be sure - this is the fix:

diff --git a/mm/shmem.c b/mm/shmem.c
index 8baf26eda989..5783f11351bb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2626,7 +2626,7 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			ret = copy_page_to_iter(page, offset, nr, to);
 			put_page(page);
 
-		} else if (!user_backed_iter(to)) {
+		} else if (user_backed_iter(to)) {
 			/*
 			 * Copy to user tends to be so well optimized, but
 			 * clear_user() not so much, that it is noticeably

Thanks!

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH 9/44] new iov_iter flavour - ITER_UBUF
  2022-07-29 21:12         ` Alexander Gordeev
@ 2022-07-30  0:03           ` Al Viro
  0 siblings, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-07-30  0:03 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, David Howells, Dominique Martinet,
	Christian Brauner

On Fri, Jul 29, 2022 at 11:12:45PM +0200, Alexander Gordeev wrote:
> On Fri, Jul 29, 2022 at 06:21:23PM +0100, Al Viro wrote:
> > > Hi Al,
> > > 
> > > This changes causes sendfile09 LTP testcase fail in linux-next
> > > (up to next-20220727) on s390. In fact, not this change exactly,
> > > but rather 92d4d18eecb9 ("new iov_iter flavour - ITER_UBUF") -
> > > which differs from what is posted here.
> > > 
> > > AFAICT page_cache_pipe_buf_confirm() encounters !PageUptodate()
> > > and !page->mapping page and returns -ENODATA.
> > > 
> > > I am going to narrow the testcase and get more details, but please
> > > let me know if I am missing something.
> > 
> > Grrr....
> > 
> > -               } else if (iter_is_iovec(to)) {
> > +               } else if (!user_backed_iter(to)) {
> > 
> > in mm/shmem.c.  Spot the typo...
> > 
> > Could you check if replacing that line with
> > 		} else if (user_backed_iter(to)) {
> > 
> > fixes the breakage?
> 
> Yes, it does! So just to be sure - this is the fix:

FWIW, there'd been another braino, caught by test from Hugh Dickins;
this one in ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives

Incremental follows; folded and pushed out.

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 642841ce7595..939078ffbfb5 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -469,7 +469,7 @@ static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
 		struct page *page = append_pipe(i, n, &off);
 		chunk = min_t(size_t, n, PAGE_SIZE - off);
 		if (!page)
-			break;
+			return bytes - n;
 		memcpy_to_page(page, off, addr, chunk);
 		addr += chunk;
 	}
@@ -774,7 +774,7 @@ static size_t pipe_zero(size_t bytes, struct iov_iter *i)
 		char *p;
 
 		if (!page)
-			break;
+			return bytes - n;
 		chunk = min_t(size_t, n, PAGE_SIZE - off);
 		p = kmap_local_page(page);
 		memset(p + off, 0, chunk);
diff --git a/mm/shmem.c b/mm/shmem.c
index 6b83f3971795..6c8a84a1fbbb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2603,7 +2603,7 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			ret = copy_page_to_iter(page, offset, nr, to);
 			put_page(page);
 
-		} else if (!user_backed_iter(to)) {
+		} else if (user_backed_iter(to)) {
 			/*
 			 * Copy to user tends to be so well optimized, but
 			 * clear_user() not so much, that it is noticeably

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH 09/44] new iov_iter flavour - ITER_UBUF
  2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
                     ` (44 preceding siblings ...)
  2022-07-01  6:25   ` Dominique Martinet
@ 2022-08-01 12:42   ` David Howells
  2022-08-01 21:14     ` Al Viro
  2022-08-01 22:54     ` David Howells
  45 siblings, 2 replies; 118+ messages in thread
From: David Howells @ 2022-08-01 12:42 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, linux-fsdevel, Linus Torvalds, Jens Axboe,
	Christoph Hellwig, Matthew Wilcox, Dominique Martinet,
	Christian Brauner

You need to modify dup_iter() also.  That will go through the:

		return new->iov = kmemdup(new->iov,
				   new->nr_segs * sizeof(struct iovec),
				   flags);

case with a ubuf-class iterators, which will clobber new->ubuf.

David


^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [PATCH 09/44] new iov_iter flavour - ITER_UBUF
  2022-08-01 12:42   ` [PATCH 09/44] new iov_iter flavour - ITER_UBUF David Howells
@ 2022-08-01 21:14     ` Al Viro
  2022-08-01 22:54     ` David Howells
  1 sibling, 0 replies; 118+ messages in thread
From: Al Viro @ 2022-08-01 21:14 UTC (permalink / raw)
  To: David Howells
  Cc: linux-fsdevel, Linus Torvalds, Jens Axboe, Christoph Hellwig,
	Matthew Wilcox, Dominique Martinet, Christian Brauner

On Mon, Aug 01, 2022 at 01:42:04PM +0100, David Howells wrote:
> You need to modify dup_iter() also.  That will go through the:
> 
> 		return new->iov = kmemdup(new->iov,
> 				   new->nr_segs * sizeof(struct iovec),
> 				   flags);
> 
> case with a ubuf-class iterators, which will clobber new->ubuf.
> 
> David

Fixed, folded and pushed out.  Incremental:

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 939078ffbfb5..46ec07886d7b 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1659,17 +1659,16 @@ const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags)
 		WARN_ON(1);
 		return NULL;
 	}
-	if (unlikely(iov_iter_is_discard(new) || iov_iter_is_xarray(new)))
-		return NULL;
 	if (iov_iter_is_bvec(new))
 		return new->bvec = kmemdup(new->bvec,
 				    new->nr_segs * sizeof(struct bio_vec),
 				    flags);
-	else
+	else if (iov_iter_is_kvec(new) || iter_is_iovec(new))
 		/* iovec and kvec have identical layout */
 		return new->iov = kmemdup(new->iov,
 				   new->nr_segs * sizeof(struct iovec),
 				   flags);
+	return NULL;
 }
 EXPORT_SYMBOL(dup_iter);
 

^ permalink raw reply related	[flat|nested] 118+ messages in thread

* Re: [PATCH 09/44] new iov_iter flavour - ITER_UBUF
  2022-08-01 12:42   ` [PATCH 09/44] new iov_iter flavour - ITER_UBUF David Howells
  2022-08-01 21:14     ` Al Viro
@ 2022-08-01 22:54     ` David Howells
  1 sibling, 0 replies; 118+ messages in thread
From: David Howells @ 2022-08-01 22:54 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, linux-fsdevel, Linus Torvalds, Jens Axboe,
	Christoph Hellwig, Matthew Wilcox, Dominique Martinet,
	Christian Brauner

Al Viro <viro@zeniv.linux.org.uk> wrote:

>  	if (iov_iter_is_bvec(new))
>  		return new->bvec = kmemdup(new->bvec,
>  				    new->nr_segs * sizeof(struct bio_vec),
>  				    flags);
> -	else
> +	else if (iov_iter_is_kvec(new) || iter_is_iovec(new))

The else is redundant.

David


^ permalink raw reply	[flat|nested] 118+ messages in thread

end of thread, other threads:[~2022-08-01 22:54 UTC | newest]

Thread overview: 118+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-22  4:10 [RFC][CFT][PATCHSET] iov_iter stuff Al Viro
2022-06-22  4:15 ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Al Viro
2022-06-22  4:15   ` [PATCH 02/44] No need of likely/unlikely on calls of check_copy_size() Al Viro
2022-06-22  4:15   ` [PATCH 03/44] teach iomap_dio_rw() to suppress dsync Al Viro
2022-06-22  4:15   ` [PATCH 04/44] btrfs: use IOMAP_DIO_NOSYNC Al Viro
2022-06-22  4:15   ` [PATCH 05/44] struct file: use anonymous union member for rcuhead and llist Al Viro
2022-06-22  4:15   ` [PATCH 06/44] iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC Al Viro
2022-06-22  4:15   ` [PATCH 07/44] keep iocb_flags() result cached in struct file Al Viro
2022-06-22  4:15   ` [PATCH 08/44] copy_page_{to,from}_iter(): switch iovec variants to generic Al Viro
2022-06-27 18:31     ` Jeff Layton
2022-06-28 12:32     ` Christian Brauner
2022-06-28 18:36       ` Al Viro
2022-06-22  4:15   ` [PATCH 09/44] new iov_iter flavour - ITER_UBUF Al Viro
2022-06-27 18:47     ` Jeff Layton
2022-06-28 18:41       ` Al Viro
2022-06-28 12:38     ` Christian Brauner
2022-06-28 18:44       ` Al Viro
2022-07-28  9:55     ` [PATCH 9/44] " Alexander Gordeev
2022-07-29 17:21       ` Al Viro
2022-07-29 21:12         ` Alexander Gordeev
2022-07-30  0:03           ` Al Viro
2022-06-22  4:15   ` [PATCH 10/44] switch new_sync_{read,write}() to ITER_UBUF Al Viro
2022-06-22  4:15   ` [PATCH 11/44] iov_iter_bvec_advance(): don't bother with bvec_iter Al Viro
2022-06-27 18:48     ` Jeff Layton
2022-06-28 12:40     ` Christian Brauner
2022-06-22  4:15   ` [PATCH 12/44] fix short copy handling in copy_mc_pipe_to_iter() Al Viro
2022-06-27 19:15     ` Jeff Layton
2022-06-28 12:42     ` Christian Brauner
2022-06-22  4:15   ` [PATCH 13/44] splice: stop abusing iov_iter_advance() to flush a pipe Al Viro
2022-06-27 19:17     ` Jeff Layton
2022-06-28 12:43     ` Christian Brauner
2022-06-22  4:15   ` [PATCH 14/44] ITER_PIPE: helper for getting pipe buffer by index Al Viro
2022-06-28 10:38     ` Jeff Layton
2022-06-28 12:45     ` Christian Brauner
2022-06-22  4:15   ` [PATCH 15/44] ITER_PIPE: helpers for adding pipe buffers Al Viro
2022-06-28 11:32     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 16/44] ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives Al Viro
2022-06-22  4:15   ` [PATCH 17/44] ITER_PIPE: fold push_pipe() into __pipe_get_pages() Al Viro
2022-06-22  4:15   ` [PATCH 18/44] ITER_PIPE: lose iter_head argument of __pipe_get_pages() Al Viro
2022-06-22  4:15   ` [PATCH 19/44] ITER_PIPE: clean pipe_advance() up Al Viro
2022-06-22  4:15   ` [PATCH 20/44] ITER_PIPE: clean iov_iter_revert() Al Viro
2022-06-22  4:15   ` [PATCH 21/44] ITER_PIPE: cache the type of last buffer Al Viro
2022-06-22  4:15   ` [PATCH 22/44] ITER_PIPE: fold data_start() and pipe_space_for_user() together Al Viro
2022-06-22  4:15   ` [PATCH 23/44] iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT Al Viro
2022-06-28 11:41     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 24/44] iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper Al Viro
2022-06-28 11:45     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 25/44] iov_iter_get_pages(): sanity-check arguments Al Viro
2022-06-28 11:47     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 26/44] unify pipe_get_pages() and pipe_get_pages_alloc() Al Viro
2022-06-28 11:49     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 27/44] unify xarray_get_pages() and xarray_get_pages_alloc() Al Viro
2022-06-28 11:50     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 28/44] unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts Al Viro
2022-06-28 11:54     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 29/44] ITER_XARRAY: don't open-code DIV_ROUND_UP() Al Viro
2022-06-28 11:54     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 30/44] iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment() Al Viro
2022-06-28 11:56     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 31/44] iov_iter: first_{iovec,bvec}_segment() - simplify a bit Al Viro
2022-06-28 11:58     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 32/44] iov_iter: massage calling conventions for first_{iovec,bvec}_segment() Al Viro
2022-06-28 12:06     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 33/44] found_iovec_segment(): just return address Al Viro
2022-06-28 12:09     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 34/44] fold __pipe_get_pages() into pipe_get_pages() Al Viro
2022-06-28 12:11     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 35/44] iov_iter: saner helper for page array allocation Al Viro
2022-06-28 12:12     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 36/44] iov_iter: advancing variants of iov_iter_get_pages{,_alloc}() Al Viro
2022-06-28 12:13     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 37/44] block: convert to " Al Viro
2022-06-28 12:16     ` Jeff Layton
2022-06-30 22:11     ` [block.git conflicts] " Al Viro
2022-06-30 22:39       ` Al Viro
2022-07-01  2:07         ` Keith Busch
2022-07-01 17:40           ` Al Viro
2022-07-01 17:53             ` Keith Busch
2022-07-01 18:07               ` Al Viro
2022-07-01 18:12                 ` Al Viro
2022-07-01 18:38                   ` Keith Busch
2022-07-01 19:08                     ` Al Viro
2022-07-01 19:28                       ` Keith Busch
2022-07-01 19:43                         ` Al Viro
2022-07-01 19:56                           ` Keith Busch
2022-07-02  5:35                             ` Al Viro
2022-07-02 21:02                               ` Keith Busch
2022-07-01 19:05                 ` Keith Busch
2022-07-01 21:30             ` Jens Axboe
2022-06-30 23:07       ` Jens Axboe
2022-07-10 18:04     ` Sedat Dilek
2022-06-22  4:15   ` [PATCH 38/44] iter_to_pipe(): switch to advancing variant of iov_iter_get_pages() Al Viro
2022-06-28 12:18     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 39/44] af_alg_make_sg(): " Al Viro
2022-06-28 12:18     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 40/44] 9p: convert to advancing variant of iov_iter_get_pages_alloc() Al Viro
2022-07-01  9:01     ` Dominique Martinet
2022-07-01 13:47     ` Christian Schoenebeck
2022-07-06 22:06       ` Christian Schoenebeck
2022-06-22  4:15   ` [PATCH 41/44] ceph: switch the last caller " Al Viro
2022-06-28 12:20     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 42/44] get rid of non-advancing variants Al Viro
2022-06-28 12:21     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 43/44] pipe_get_pages(): switch to append_pipe() Al Viro
2022-06-28 12:23     ` Jeff Layton
2022-06-22  4:15   ` [PATCH 44/44] expand those iov_iter_advance() Al Viro
2022-06-28 12:23     ` Jeff Layton
2022-07-01  6:21   ` [PATCH 01/44] 9p: handling Rerror without copy_from_iter_full() Dominique Martinet
2022-07-01  6:25   ` Dominique Martinet
2022-07-01 16:02     ` Christian Schoenebeck
2022-07-01 21:00       ` Dominique Martinet
2022-07-03 13:30         ` Christian Schoenebeck
2022-08-01 12:42   ` [PATCH 09/44] new iov_iter flavour - ITER_UBUF David Howells
2022-08-01 21:14     ` Al Viro
2022-08-01 22:54     ` David Howells
2022-06-23 15:21 ` [RFC][CFT][PATCHSET] iov_iter stuff David Howells
2022-06-23 20:32   ` Al Viro
2022-06-28 12:25 ` Jeff Layton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.