All of lore.kernel.org
 help / color / mirror / Atom feed
From: Al Viro <viro@zeniv.linux.org.uk>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-fsdevel@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [RFC][PATCHES] iov_iter stuff
Date: Sat, 18 Jun 2022 06:27:16 +0100	[thread overview]
Message-ID: <Yq1iNHboD+9fz60M@ZenIV> (raw)
In-Reply-To: <Yq0EoTzFE+dSAYY1@ZenIV>

On Fri, Jun 17, 2022 at 11:48:01PM +0100, Al Viro wrote:
> On Fri, Jun 17, 2022 at 04:30:49PM -0600, Jens Axboe wrote:
> 
> > Al, looks good to me from inspection, and I ported stuffed this on top
> > of -git and my 5.20 branch, and did my send/recv/recvmsg io_uring change
> > on top and see a noticeable reduction there too for some benchmarking.
> > Feel free to add:
> > 
> > Reviewed-by: Jens Axboe <axboe@kernel.dk>
> > 
> > to the series.
> > 
> > Side note - of my initial series I played with, I still have this one
> > leftover that I do utilize for io_uring:
> > 
> > https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.20/io_uring-iter&id=a59f5c21a6eeb9506163c20aff4846dbec159f47
> > 
> > Doesn't make sense standalone, but I have it as a prep patch.
> > 
> > Can I consider your work.iov_iter stable at this point, or are you still
> > planning rebasing?
> 
> Umm...  Rebasing this part - probably no; there's a fun followup to it, though,
> I'm finishing the carve up & reorder at the moment.  Will post for review
> tonight...

	This stuff sits on top of #work.iov_iter (as posted a week ago) +
#fixes (one commit, handling of failures halfway through copy_mc_to_iter()
into ITER_PIPE, posted several days ago, backportable minimal fix) +
#work.9p (handling of RERROR on zerocopy 9P read/readdir, posted about
a week ago).  The branch is #work.iov_iter_get_pages; individual patches
in followups.

	NOTE: the older branches are unchanged, but this series on top of
them had been repeatedly carved up, reordered, etc. - there had been a lot
of recent massage, so at this point it should be treated as absolutely
untested.  It can shit over memory and/or chew your filesystems; DON'T
TRY IT OUTSIDE OF A SCRATCH KVM IMAGE.  Said that, review and (cautious)
testing would be very welcome.

	Part 1: ITER_PIPE cleanups

ITER_PIPE handling had never been pretty, but by now it has become
really obfuscated and hard to read.  Untangle it a bit.

1) splice: stop abusing iov_iter_advance() to flush a pipe
	A really odd (ab)use of iov_iter_advance() - in case of error
generic_file_splice_read() wants to free all pipe buffers ->read_iter()
has produced.  Yes, forcibly resetting ->head and ->iov_offset to
original values and calling iov_iter_advance(i, 0), will trigger
pipe_advance(), which will trigger pipe_truncate(), which will free
buffers.  Or we could just go ahead and free the same buffers;
pipe_discard_from() does exactly that, no iov_iter stuff needs to
be involved.

2) ITER_PIPE: helper for getting pipe buffer by index
	In a lot of places we want to find pipe_buffer by index;
expression is convoluted and hard to read.  Provide an inline helper
for that, convert trivial open-coded cases.  Eventually *all*
open-coded instances in iov_iter.c will get converted.

3) ITER_PIPE: helpers for adding pipe buffers
	There are only two kinds of pipe_buffer in the area used by ITER_PIPE.
* anonymous - copy_to_iter() et.al. end up creating those and copying data
  there.  They have zero ->offset, and their ->ops points to
  default_pipe_page_ops.
* zero-copy ones - those come from copy_page_to_iter(), and page comes from
  caller.  ->offset is also caller-supplied - it might be non-zero.
  ->ops points to page_cache_pipe_buf_ops.
	Move creation and insertion of those into helpers -
push_anon(pipe, size) and push_page(pipe, page, offset, size) resp., separating
them from the "could we avoid creating a new buffer by merging with the current
head?" logics.

4) ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives
	New helper: append_pipe().  Extends the last buffer if possible,
allocates a new one otherwise.  Returns page and offset in it on success,
NULL on failure.  iov_iter is advanced past the data we've got.
	Use that instead of push_pipe() in copy-to-pipe primitives;
they get simpler that way.  Handling of short copy (in "mc" one)
is done simply by iov_iter_revert() - iov_iter is in consistent
state after that one, so we can use that.

5) ITER_PIPE: fold push_pipe() into __pipe_get_pages()
	Expand the only remaining call of push_pipe() (in
__pipe_get_pages()), combine it with the page-collecting loop there.
We don't need to bother with i->count checks or calculation of offset
in the first page - the caller already has done that.
	Note that the only reason it's not a loop doing append_pipe()
is that append_pipe() is advancing, while iov_iter_get_pages() is not.
As soon as it switches to saner semantics, this thing will switch
to using append_pipe().

6) ITER_PIPE: lose iter_head argument of __pipe_get_pages()
	Always equal to pipe->head - 1.

7) ITER_PIPE: clean pipe_advance() up
	Don't bother with pipe_truncate(); adjust the buffer
length just as we decide it'll be the last one, then use
pipe_discard_from() to release buffers past that one.

8) ITER_PIPE: clean iov_iter_revert()
	Fold pipe_truncate() in there, clean the things up.

9) ITER_PIPE: cache the type of last buffer
	We often need to find whether the last buffer is anon or not, and
currently it's rather clumsy:
	check if ->iov_offset is non-zero (i.e. that pipe is not empty)
	if so, get the corresponding pipe_buffer and check its ->ops
	if it's &default_pipe_buf_ops, we have an anon buffer.
Let's replace the use of ->iov_offset (which is nowhere near similar to
its role for other flavours) with signed field (->last_offset), with
the following rules:
	empty, no buffers occupied:		0
	anon, with bytes up to N-1 filled:	N
	zero-copy, with bytes up to N-1 filled:	-N
That way abs(i->last_offset) is equal to what used to be in i->iov_offset
and empty vs. anon vs. zero-copy can be distinguished by the sign of
i->last_offset.
	Checks for "should we extend the last buffer or should we start
a new one?" become easier to follow that way.
	Note that most of the operations can only be done in a sane
state - i.e. when the pipe has nothing past the current position of
iterator.  About the only thing that could be done outside of that
state is iov_iter_advance(), which transitions to the sane state by
truncating the pipe.  There are only two cases where we leave the
sane state:
	1) iov_iter_get_pages()/iov_iter_get_pages_alloc().  Will be
dealt with later, when we make get_pages advancing - the callers are
actually happier that way.
	2) iov_iter copied, then something is put into the copy.  Since
they share the underlying pipe, the original gets behind.  When we
decide that we are done with the copy (original is not usable until then)
we advance the original.  direct_io used to be done that way; nowadays
it operates on the original and we do iov_iter_revert() to discard
the excessive data.  At the moment there's nothing in the kernel that
could do that to ITER_PIPE iterators, so this reason for insane state
is theoretical right now.

10) ITER_PIPE: fold data_start() and pipe_space_for_user() together
	All their callers are next to each other; all of them
want the total amount of pages and, possibly, the
offset in the partial final buffer.
	Combine into a new helper (pipe_npages()), fix the
bogosity in pipe_space_for_user(), while we are at it.

	Part 2: iov_iter_get_pages()/iov_iter_get_pages_alloc() unification

	There's a lot of duplication between iov_iter_get_pages() and
iov_iter_get_pages_alloc().  With some massage it can be eliminated,
along with some of the cruft accumulated there.

	Flavour-independent arguments validation and, for ..._alloc(),
cleanup handling on failure:
11) iov_iter_get_pages{,_alloc}(): cap the maxsize with LONG_MAX
12) iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
13) iov_iter_get_pages(): sanity-check arguments

	Mechanically merge parallel ..._get_pages() and ..._get_pages_alloc().
14) unify pipe_get_pages() and pipe_get_pages_alloc()
15) unify xarray_get_pages() and xarray_get_pages_alloc()
16) unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts

	Decrufting for XARRAY:
17) ITER_XARRAY: don't open-code DIV_ROUND_UP()
	Decrufting for iBUF/IOVEC/BVEC:
18) iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
19) iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
20) found_iovec_segment(): just return address
	Decrufting for PIPE:
21) fold __pipe_get_pages() into pipe_get_pages()

	Collapsing the bits that differ for get_pages and get_pages_alloc
cases into a common helper:
22) iov_iter: saner helper for page array allocation

	Part 3: making iov_iter_get_pages{,_alloc}() advancing

	Most of the callers follow successful ...get_pages... with advance
by the amount it had reported.  For some it's unconditional, for some it
might end up being less in some cases.  All of them would be fine with
advancing variants of those primitives - those that might want to advance
by less than reported could easily use revert by the difference of those
amounts.
	Rather than doing a flagday change (they are exported and signatures
remain unchanged), replacement variants are added (iov_iter_get_pages2()
and iov_iter_get_pages_alloc2(), initially as wrappers).  By the end of
the series everything is converted to those and the old ones are removed.

23) iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
24) block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
25) iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
26) af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
27) 9p: convert to advancing variant of iov_iter_get_pages_alloc()
28) ceph: switch the last caller of iov_iter_get_pages_alloc()
29) get rid of non-advancing variants

	Part 4: cleanups
30) pipe_get_pages(): switch to append_pipe()
31) expand those iov_iter_advance()...

Overall diffstat:

 arch/powerpc/include/asm/uaccess.h |   2 +-
 arch/s390/include/asm/uaccess.h    |   4 +-
 block/bio.c                        |  15 +-
 block/blk-map.c                    |   7 +-
 block/fops.c                       |   8 +-
 crypto/af_alg.c                    |   3 +-
 crypto/algif_hash.c                |   5 +-
 drivers/nvme/target/io-cmd-file.c  |   2 +-
 drivers/vhost/scsi.c               |   4 +-
 fs/aio.c                           |   2 +-
 fs/btrfs/file.c                    |  19 +-
 fs/btrfs/inode.c                   |   3 +-
 fs/ceph/addr.c                     |   2 +-
 fs/ceph/file.c                     |   5 +-
 fs/cifs/file.c                     |   8 +-
 fs/cifs/misc.c                     |   3 +-
 fs/direct-io.c                     |   7 +-
 fs/fcntl.c                         |   1 +
 fs/file_table.c                    |  17 +-
 fs/fuse/dev.c                      |   7 +-
 fs/fuse/file.c                     |   7 +-
 fs/gfs2/file.c                     |   2 +-
 fs/io_uring.c                      |   2 +-
 fs/iomap/direct-io.c               |  21 +-
 fs/nfs/direct.c                    |   8 +-
 fs/open.c                          |   1 +
 fs/read_write.c                    |   6 +-
 fs/splice.c                        |  54 +-
 fs/zonefs/super.c                  |   2 +-
 include/linux/fs.h                 |  21 +-
 include/linux/iomap.h              |   6 +
 include/linux/pipe_fs_i.h          |  29 +-
 include/linux/uaccess.h            |   4 +-
 include/linux/uio.h                |  50 +-
 lib/iov_iter.c                     | 978 ++++++++++++++-----------------------
 mm/shmem.c                         |   2 +-
 net/9p/client.c                    | 125 +----
 net/9p/protocol.c                  |   3 +-
 net/9p/trans_virtio.c              |  37 +-
 net/core/datagram.c                |   3 +-
 net/core/skmsg.c                   |   3 +-
 net/rds/message.c                  |   3 +-
 net/tls/tls_sw.c                   |   4 +-
 43 files changed, 589 insertions(+), 906 deletions(-)

  reply	other threads:[~2022-06-18  5:29 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-07  4:08 [RFC][PATCHES] iov_iter stuff Al Viro
2022-06-07  4:09 ` [PATCH 1/9] No need of likely/unlikely on calls of check_copy_size() Al Viro
2022-06-07  4:41   ` Christoph Hellwig
2022-06-07 11:49   ` Christian Brauner
2022-06-07  4:09 ` [PATCH 2/9] btrfs_direct_write(): cleaner way to handle generic_write_sync() suppression Al Viro
2022-06-07  4:42   ` Christoph Hellwig
2022-06-07 16:06     ` Al Viro
2022-06-07 23:27       ` Al Viro
2022-06-07 23:31         ` [PATCH 01/10] No need of likely/unlikely on calls of check_copy_size() Al Viro
2022-06-07 23:31           ` [PATCH 02/10] teach iomap_dio_rw() to suppress dsync Al Viro
2022-06-08  6:18             ` Christoph Hellwig
2022-06-08 15:17             ` Darrick J. Wong
2022-06-10 11:38             ` Christian Brauner
2022-06-07 23:31           ` [PATCH 03/10] btrfs: use IOMAP_DIO_NOSYNC Al Viro
2022-06-08  6:18             ` Christoph Hellwig
2022-06-10 11:09             ` Christian Brauner
2022-06-07 23:31           ` [PATCH 04/10] struct file: use anonymous union member for rcuhead and llist Al Viro
2022-06-07 23:31           ` [PATCH 05/10] iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC Al Viro
2022-06-10 11:41             ` Christian Brauner
2022-06-07 23:31           ` [PATCH 06/10] keep iocb_flags() result cached in struct file Al Viro
2022-06-09  0:35             ` Dave Chinner
2022-06-10 11:43             ` Christian Brauner
2022-06-07 23:31           ` [PATCH 07/10] copy_page_{to,from}_iter(): switch iovec variants to generic Al Viro
2022-06-07 23:31           ` [PATCH 08/10] new iov_iter flavour - ITER_UBUF Al Viro
2022-06-07 23:31           ` [PATCH 09/10] switch new_sync_{read,write}() to ITER_UBUF Al Viro
2022-06-10 11:11             ` Christian Brauner
2022-06-07 23:31           ` [PATCH 10/10] iov_iter_bvec_advance(): don't bother with bvec_iter Al Viro
2022-06-08  6:16       ` [PATCH 2/9] btrfs_direct_write(): cleaner way to handle generic_write_sync() suppression Christoph Hellwig
2022-06-07 14:49   ` Matthew Wilcox
2022-06-07 20:17     ` Al Viro
2022-06-07  4:10 ` [PATCH 3/9] struct file: use anonymous union member for rcuhead and llist Al Viro
2022-06-07 10:18   ` Jan Kara
2022-06-07 11:46   ` Christian Brauner
2022-06-07  4:11 ` [PATCH 4/9] iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC Al Viro
2022-06-07 10:34   ` Jan Kara
2022-06-07 15:34     ` Al Viro
2022-06-07  4:11 ` [PATCH 5/9] keep iocb_flags() result cached in struct file Al Viro
2022-06-07  4:12 ` [PATCH 6/9] copy_page_{to,from}_iter(): switch iovec variants to generic Al Viro
2022-06-07  4:12 ` [PATCH 7/9] new iov_iter flavour - ITER_UBUF Al Viro
2022-06-07  4:13 ` [PATCH 8/9] switch new_sync_{read,write}() to ITER_UBUF Al Viro
2022-06-07  4:13 ` [PATCH 9/9] iov_iter_bvec_advance(): don't bother with bvec_iter Al Viro
2022-06-08 19:28 ` [RFC][PATCHES] iov_iter stuff Sedat Dilek
2022-06-08 20:39   ` Al Viro
2022-06-09 19:10     ` Sedat Dilek
2022-06-09 19:22       ` Matthew Wilcox
2022-06-09 19:58         ` Matthew Wilcox
2022-06-09 19:45       ` Al Viro
2022-06-17 22:30 ` Jens Axboe
2022-06-17 22:48   ` Al Viro
2022-06-18  5:27     ` Al Viro [this message]
2022-06-18  5:35       ` [PATCH 01/31] splice: stop abusing iov_iter_advance() to flush a pipe Al Viro
2022-06-18  5:35         ` [PATCH 02/31] ITER_PIPE: helper for getting pipe buffer by index Al Viro
2022-06-18  5:35         ` [PATCH 03/31] ITER_PIPE: helpers for adding pipe buffers Al Viro
2022-06-18  5:35         ` [PATCH 04/31] ITER_PIPE: allocate buffers as we go in copy-to-pipe primitives Al Viro
2022-06-19  1:34           ` Al Viro
2022-06-18  5:35         ` [PATCH 05/31] ITER_PIPE: fold push_pipe() into __pipe_get_pages() Al Viro
2022-06-18  5:35         ` [PATCH 06/31] ITER_PIPE: lose iter_head argument of __pipe_get_pages() Al Viro
2022-06-18  5:35         ` [PATCH 07/31] ITER_PIPE: clean pipe_advance() up Al Viro
2022-06-18  5:35         ` [PATCH 08/31] ITER_PIPE: clean iov_iter_revert() Al Viro
2022-06-18  5:35         ` [PATCH 09/31] ITER_PIPE: cache the type of last buffer Al Viro
2022-06-18  5:35         ` [PATCH 10/10] iov_iter_bvec_advance(): don't bother with bvec_iter Al Viro
2022-06-18  5:35         ` [PATCH 10/31] ITER_PIPE: fold data_start() and pipe_space_for_user() together Al Viro
2022-06-19  2:25           ` Al Viro
2022-06-18  5:35         ` [PATCH 11/31] iov_iter_get_pages{,_alloc}(): cap the maxsize with LONG_MAX Al Viro
2022-06-18  5:35         ` [PATCH 12/31] iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper Al Viro
2022-06-18  5:35         ` [PATCH 13/31] iov_iter_get_pages(): sanity-check arguments Al Viro
2022-06-19  3:07           ` Al Viro
2022-06-18  5:35         ` [PATCH 14/31] unify pipe_get_pages() and pipe_get_pages_alloc() Al Viro
2022-06-18  5:35         ` [PATCH 15/31] unify xarray_get_pages() and xarray_get_pages_alloc() Al Viro
2022-06-18  5:35         ` [PATCH 16/31] unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts Al Viro
2022-06-19  3:56           ` Al Viro
2022-06-18  5:35         ` [PATCH 17/31] ITER_XARRAY: don't open-code DIV_ROUND_UP() Al Viro
2022-06-18  5:35         ` [PATCH 18/31] iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment() Al Viro
2022-06-18  5:35         ` [PATCH 19/31] iov_iter: massage calling conventions for first_{iovec,bvec}_segment() Al Viro
2022-06-18 11:13           ` Al Viro
2022-06-18 11:18             ` Al Viro
2022-06-18  5:35         ` [PATCH 20/31] found_iovec_segment(): just return address Al Viro
2022-06-18  5:35         ` [PATCH 21/31] fold __pipe_get_pages() into pipe_get_pages() Al Viro
2022-06-18  5:35         ` [PATCH 22/31] iov_iter: saner helper for page array allocation Al Viro
2022-06-18 11:14           ` Al Viro
2022-06-18  5:35         ` [PATCH 23/31] iov_iter: advancing variants of iov_iter_get_pages{,_alloc}() Al Viro
2022-06-18  5:35         ` [PATCH 24/31] block: convert to " Al Viro
2022-06-18  5:35         ` [PATCH 25/31] iter_to_pipe(): switch to advancing variant of iov_iter_get_pages() Al Viro
2022-06-18  5:35         ` [PATCH 26/31] af_alg_make_sg(): " Al Viro
2022-06-18  5:35         ` [PATCH 27/31] 9p: convert to advancing variant of iov_iter_get_pages_alloc() Al Viro
2022-06-18  5:35         ` [PATCH 28/31] ceph: switch the last caller " Al Viro
2022-06-18  5:35         ` [PATCH 29/31] get rid of non-advancing variants Al Viro
2022-06-18 11:14           ` Al Viro
2022-06-18  5:35         ` [PATCH 30/31] pipe_get_pages(): switch to append_pipe() Al Viro
2022-06-19  4:01           ` Al Viro
2022-06-19  4:09             ` Al Viro
2022-06-18  5:35         ` [PATCH 31/31] expand those iov_iter_advance() Al Viro
2022-06-18 11:14           ` Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yq1iNHboD+9fz60M@ZenIV \
    --to=viro@zeniv.linux.org.uk \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.