All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs
@ 2023-09-25 12:02 David Howells
  2023-09-25 12:02 ` [PATCH v7 01/12] iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE David Howells
                   ` (13 more replies)
  0 siblings, 14 replies; 36+ messages in thread
From: David Howells @ 2023-09-25 12:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Howells, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel

Hi Jens, Christian,

Could you take these patches into the block tree or the fs tree?  The
patches convert the iov_iter iteration macros to be inline functions.

 (1) Remove last_offset from iov_iter as it was only used by ITER_PIPE.

 (2) Add a __user tag on copy_mc_to_user()'s dst argument on x86 to match
     that on powerpc and get rid of a sparse warning.

 (3) Convert iter->user_backed to user_backed_iter() in the sound PCM
     driver.

 (4) Convert iter->user_backed to user_backed_iter() in a couple of
     infiniband drivers.

 (5) Renumber the type enum so that the ITER_* constants match the order in
     iterate_and_advance*().

 (6) Since the preceding patch puts UBUF and IOVEC at 0 and 1, change
     user_backed_iter() to just use the type value and get rid of the extra
     flag.

 (7) Convert the iov_iter iteration macros to always-inline functions to
     make the code easier to follow.  It uses function pointers, but they
     get optimised away.  The priv2 argument likewise gets optimised away
     if unused.

     The functions are placed into a header file so that bespoke iterators
     can be created elsewhere.  For instance, rbd has an optimisation that
     requires it to scan to the buffer it is given to see if it is all
     zeros.  It would be nice if this could use iterate_and_advance() - but
     that's currently buried inside lib/iov_iter.c.

     Whilst we could provide a generic iteration function that takes a pair
     of function pointers instead, it does add around 16% overhead in the
     framework, presumably from the combination of function pointers and
     various mitigations.

     Further, if it is known that just a particular iterator-type is in
     play, just that iteration function can be used.

 (8) Move the check for ->copy_mc to _copy_from_iter() and
     copy_page_from_iter_atomic() rather than in memcpy_from_iter_mc()
     where it gets repeated for every segment.  Instead, we check once and
     invoke a side function that can use iterate_bvec() rather than
     iterate_and_advance() and supply a different step function.

 (9) Move the copy-and-csum code to net/ where it can be in proximity with
     the code that uses it.  This eliminates the code if CONFIG_NET=n and
     allows for the slim possibility of it being inlined.

(10) Fold memcpy_and_csum() in to its two users.

(11) Move csum_and_copy_from_iter_full() out of line and merge in
     csum_and_copy_from_iter() since the former is the only caller of the
     latter.

(12) Move hash_and_copy_to_iter() to net/ where it can be with its only
     caller.

Anyway, the changes in compiled function size either side of patches
(5)+(6) on x86_64 look like:

	__copy_from_iter_mc                      new 0xe8
	_copy_from_iter                          inc 0x360 -> 0x3a7 +0x47
	_copy_from_iter_flushcache               inc 0x34c -> 0x38e +0x42
	_copy_from_iter_nocache                  inc 0x354 -> 0x378 +0x24
	_copy_mc_to_iter                         inc 0x396 -> 0x3f1 +0x5b
	_copy_to_iter                            inc 0x33b -> 0x385 +0x4a
	copy_page_from_iter_atomic.part.0        inc 0x393 -> 0x3e0 +0x4d
	copy_page_to_iter_nofault.part.0         inc 0x3de -> 0x3eb +0xd
	copyin                                   del 0x30
	copyout                                  del 0x2d
	copyout_mc                               del 0x2b
	csum_and_copy_from_iter                  inc 0x3db -> 0x41d +0x42
	csum_and_copy_to_iter                    inc 0x45d -> 0x48d +0x30
	iov_iter_zero                            inc 0x34a -> 0x36a +0x20
	memcpy_from_iter.isra.0                  del 0x1f

Note that there's a noticeable expansion on some of the main functions
because a number of the helpers get inlined instead of being called.

In terms of benchmarking patch (5)+(6), three runs without them:

	iov_kunit_benchmark_ubuf: avg 3955 uS, stddev 169 uS
	iov_kunit_benchmark_ubuf: avg 4122 uS, stddev 1292 uS
	iov_kunit_benchmark_ubuf: avg 4451 uS, stddev 1362 uS
	iov_kunit_benchmark_iovec: avg 6607 uS, stddev 22 uS
	iov_kunit_benchmark_iovec: avg 6608 uS, stddev 19 uS
	iov_kunit_benchmark_iovec: avg 6609 uS, stddev 24 uS
	iov_kunit_benchmark_bvec: avg 3166 uS, stddev 11 uS
	iov_kunit_benchmark_bvec: avg 3167 uS, stddev 13 uS
	iov_kunit_benchmark_bvec: avg 3170 uS, stddev 16 uS
	iov_kunit_benchmark_bvec_split: avg 3394 uS, stddev 12 uS
	iov_kunit_benchmark_bvec_split: avg 3394 uS, stddev 20 uS
	iov_kunit_benchmark_bvec_split: avg 3395 uS, stddev 19 uS
	iov_kunit_benchmark_kvec: avg 2672 uS, stddev 12 uS
	iov_kunit_benchmark_kvec: avg 2672 uS, stddev 12 uS
	iov_kunit_benchmark_kvec: avg 2672 uS, stddev 9 uS
	iov_kunit_benchmark_xarray: avg 3719 uS, stddev 9 uS
	iov_kunit_benchmark_xarray: avg 3719 uS, stddev 9 uS
	iov_kunit_benchmark_xarray: avg 3721 uS, stddev 24 uS

and three runs with them:

	iov_kunit_benchmark_ubuf: avg 4110 uS, stddev 1254 uS
	iov_kunit_benchmark_ubuf: avg 4141 uS, stddev 1411 uS
	iov_kunit_benchmark_ubuf: avg 4572 uS, stddev 1889 uS
	iov_kunit_benchmark_iovec: avg 6582 uS, stddev 27 uS
	iov_kunit_benchmark_iovec: avg 6585 uS, stddev 25 uS
	iov_kunit_benchmark_iovec: avg 6586 uS, stddev 48 uS
	iov_kunit_benchmark_bvec: avg 3175 uS, stddev 13 uS
	iov_kunit_benchmark_bvec: avg 3177 uS, stddev 12 uS
	iov_kunit_benchmark_bvec: avg 3178 uS, stddev 12 uS
	iov_kunit_benchmark_bvec_split: avg 3380 uS, stddev 20 uS
	iov_kunit_benchmark_bvec_split: avg 3384 uS, stddev 15 uS
	iov_kunit_benchmark_bvec_split: avg 3386 uS, stddev 25 uS
	iov_kunit_benchmark_kvec: avg 2671 uS, stddev 11 uS
	iov_kunit_benchmark_kvec: avg 2672 uS, stddev 12 uS
	iov_kunit_benchmark_kvec: avg 2677 uS, stddev 20 uS
	iov_kunit_benchmark_xarray: avg 3599 uS, stddev 20 uS
	iov_kunit_benchmark_xarray: avg 3603 uS, stddev 8 uS
	iov_kunit_benchmark_xarray: avg 3610 uS, stddev 16 uS

I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-cleanup

David

Changes
=======
ver #7)
 - Defer the patch to add a kernel-backed iter-only to my ceph branch.
 - Add missing sign-offs and cc's to commit descriptions.

ver #6)
 - Rebase on v6.6-rc2 rather than on my iov-kunit branch.
 - Add a patch to remove last_offset from iov_iter.
 - Add a patch to tag copy_mc_to_user() with __user on x86.
 - Document the priv2 arg of iterate_and_advance_kernel().

ver #5)
 - Moved kunit test patches out to a separate set.
 - Moved "iter->count - progress" into individual iteration subfunctions.
 - Fix iterate_iovec() and to update iter->__iovec after subtracting it to
   calculate iter->nr_segs.
 - Merged the function inlining patch and the move-to-header patch.
 - Rearranged the patch order slightly to put the patches to move
   networking stuff to net/ last.
 - Went back to a simpler patch to special-case ->copy_mc in
   _copy_from_iter() and copy_page_from_iter_atomic() before we get to call
   iterate_and_advance() so as to reduce the number of checks for this.

ver #4)
 - Fix iterate_bvec() and iterate_kvec() to update iter->bvec and
   iter->kvec after subtracting it to calculate iter->nr_segs.
 - Change iterate_xarray() to use start+progress rather than increasing
   start to reduce code size.
 - Added patches to move some iteration functions over to net/ as the files
   there can #include the iterator framework.
 - Added a patch to benchmark the iteration.
 - Tried an experimental patch to make copy_from_iter() and similar always
   catch MCE.

ver #3)
 - Use min_t(size_t,) not min() to avoid a warning on Hexagon.
 - Inline all the step functions.
 - Added a patch to better handle copy_mc.

ver #2)
 - Rebased on top of Willy's changes in linux-next.
 - Change the checksum argument to the iteration functions to be a general
   void* and use it to pass iter->copy_mc flag to memcpy_from_iter_mc() to
   avoid using a function pointer.
 - Arrange the end of the iterate_*() functions to look the same to give
   the optimiser the best chance.
 - Make iterate_and_advance() a wrapper around iterate_and_advance2().
 - Adjust iterate_and_advance2() to use if-else-if-else-if-else rather than
   switch(), to put ITER_BVEC before KVEC and to mark UBUF and IOVEC as
   likely().
 - Move "iter->count += progress" into iterate_and_advance2() from the
   iterate functions.
 - Mark a number of the iterator helpers with __always_inline.
 - Fix _copy_from_iter_flushcache() to use memcpy_from_iter_flushcache()
   not memcpy_from_iter().

Link: https://lore.kernel.org/r/3710261.1691764329@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/855.1692047347@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/20230816120741.534415-1-dhowells@redhat.com/ # v3
Link: https://lore.kernel.org/r/20230913165648.2570623-1-dhowells@redhat.com/ # v4
Link: https://lore.kernel.org/r/20230920222231.686275-1-dhowells@redhat.com/ # v5
Link: https://lore.kernel.org/r/20230922120227.1173720-1-dhowells@redhat.com/ # v6

David Howells (12):
  iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE
  iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user()
  sound: Fix snd_pcm_readv()/writev() to use iov access functions
  infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC
  iov_iter: Renumber ITER_* constants
  iov_iter: Derive user-backedness from the iterator type
  iov_iter: Convert iterate*() to inline funcs
  iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()
  iov_iter, net: Move csum_and_copy_to/from_iter() to net/
  iov_iter, net: Fold in csum_and_memcpy()
  iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together
  iov_iter, net: Move hash_and_copy_to_iter() to net/

 arch/x86/include/asm/uaccess.h           |   2 +-
 arch/x86/lib/copy_mc.c                   |   8 +-
 drivers/infiniband/hw/hfi1/file_ops.c    |   2 +-
 drivers/infiniband/hw/qib/qib_file_ops.c |   2 +-
 include/linux/iov_iter.h                 | 274 ++++++++++++++
 include/linux/skbuff.h                   |   3 +
 include/linux/uio.h                      |  34 +-
 lib/iov_iter.c                           | 441 +++++++----------------
 net/core/datagram.c                      |  75 +++-
 net/core/skbuff.c                        |  40 ++
 sound/core/pcm_native.c                  |   4 +-
 11 files changed, 544 insertions(+), 341 deletions(-)
 create mode 100644 include/linux/iov_iter.h


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v7 01/12] iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE
  2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
@ 2023-09-25 12:02 ` David Howells
  2023-09-25 12:02 ` [PATCH v7 02/12] iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user() David Howells
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: David Howells @ 2023-09-25 12:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Howells, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel

Now that ITER_PIPE has been removed, iov_iter::last_offset is no longer
used, so remove it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
 include/linux/uio.h | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 42bce38a8e87..2000e42a6586 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -44,10 +44,7 @@ struct iov_iter {
 	bool nofault;
 	bool data_source;
 	bool user_backed;
-	union {
-		size_t iov_offset;
-		int last_offset;
-	};
+	size_t iov_offset;
 	/*
 	 * Hack alert: overlay ubuf_iovec with iovec + count, so
 	 * that the members resolve correctly regardless of the type


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 02/12] iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user()
  2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
  2023-09-25 12:02 ` [PATCH v7 01/12] iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE David Howells
@ 2023-09-25 12:02 ` David Howells
  2023-09-28 14:47   ` Borislav Petkov
  2023-09-25 12:03 ` [PATCH v7 03/12] sound: Fix snd_pcm_readv()/writev() to use iov access functions David Howells
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: David Howells @ 2023-09-25 12:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Howells, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	Dan Williams, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, H. Peter Anvin, x86

copy_mc_to_user() has the destination marked __user on powerpc, but not on
x86; the latter results in a sparse warning in lib/iov_iter.c.

Fix this by applying the tag on x86 too.

Fixes: ec6347bb4339 ("x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Dan Williams <dan.j.williams@intel.com>
cc: Thomas Gleixner <tglx@linutronix.de>
cc: Ingo Molnar <mingo@redhat.com>
cc: Borislav Petkov <bp@alien8.de>
cc: Dave Hansen <dave.hansen@linux.intel.com>
cc: "H. Peter Anvin" <hpa@zytor.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: x86@kernel.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
 arch/x86/include/asm/uaccess.h | 2 +-
 arch/x86/lib/copy_mc.c         | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 8bae40a66282..5c367c1290c3 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -496,7 +496,7 @@ copy_mc_to_kernel(void *to, const void *from, unsigned len);
 #define copy_mc_to_kernel copy_mc_to_kernel
 
 unsigned long __must_check
-copy_mc_to_user(void *to, const void *from, unsigned len);
+copy_mc_to_user(void __user *to, const void *from, unsigned len);
 #endif
 
 /*
diff --git a/arch/x86/lib/copy_mc.c b/arch/x86/lib/copy_mc.c
index 80efd45a7761..6e8b7e600def 100644
--- a/arch/x86/lib/copy_mc.c
+++ b/arch/x86/lib/copy_mc.c
@@ -70,23 +70,23 @@ unsigned long __must_check copy_mc_to_kernel(void *dst, const void *src, unsigne
 }
 EXPORT_SYMBOL_GPL(copy_mc_to_kernel);
 
-unsigned long __must_check copy_mc_to_user(void *dst, const void *src, unsigned len)
+unsigned long __must_check copy_mc_to_user(void __user *dst, const void *src, unsigned len)
 {
 	unsigned long ret;
 
 	if (copy_mc_fragile_enabled) {
 		__uaccess_begin();
-		ret = copy_mc_fragile(dst, src, len);
+		ret = copy_mc_fragile((__force void *)dst, src, len);
 		__uaccess_end();
 		return ret;
 	}
 
 	if (static_cpu_has(X86_FEATURE_ERMS)) {
 		__uaccess_begin();
-		ret = copy_mc_enhanced_fast_string(dst, src, len);
+		ret = copy_mc_enhanced_fast_string((__force void *)dst, src, len);
 		__uaccess_end();
 		return ret;
 	}
 
-	return copy_user_generic(dst, src, len);
+	return copy_user_generic((__force void *)dst, src, len);
 }


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 03/12] sound: Fix snd_pcm_readv()/writev() to use iov access functions
  2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
  2023-09-25 12:02 ` [PATCH v7 01/12] iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE David Howells
  2023-09-25 12:02 ` [PATCH v7 02/12] iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user() David Howells
@ 2023-09-25 12:03 ` David Howells
  2023-09-25 12:03 ` [PATCH v7 04/12] infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC David Howells
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: David Howells @ 2023-09-25 12:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Howells, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	Jaroslav Kysela, Takashi Iwai, Oswald Buddenhagen,
	Suren Baghdasaryan, Kuninori Morimoto, alsa-devel

Fix snd_pcm_readv()/writev() to use iov access functions rather than poking
at the iov_iter internals directly.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jaroslav Kysela <perex@perex.cz>
Reviewed-by: Takashi Iwai <tiwai@suse.de>
cc: Oswald Buddenhagen <oswald.buddenhagen@gmx.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Suren Baghdasaryan <surenb@google.com>
cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
cc: alsa-devel@alsa-project.org
---
 sound/core/pcm_native.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
index bd9ddf412b46..9a69236fa207 100644
--- a/sound/core/pcm_native.c
+++ b/sound/core/pcm_native.c
@@ -3527,7 +3527,7 @@ static ssize_t snd_pcm_readv(struct kiocb *iocb, struct iov_iter *to)
 	if (runtime->state == SNDRV_PCM_STATE_OPEN ||
 	    runtime->state == SNDRV_PCM_STATE_DISCONNECTED)
 		return -EBADFD;
-	if (!to->user_backed)
+	if (!user_backed_iter(to))
 		return -EINVAL;
 	if (to->nr_segs > 1024 || to->nr_segs != runtime->channels)
 		return -EINVAL;
@@ -3567,7 +3567,7 @@ static ssize_t snd_pcm_writev(struct kiocb *iocb, struct iov_iter *from)
 	if (runtime->state == SNDRV_PCM_STATE_OPEN ||
 	    runtime->state == SNDRV_PCM_STATE_DISCONNECTED)
 		return -EBADFD;
-	if (!from->user_backed)
+	if (!user_backed_iter(from))
 		return -EINVAL;
 	if (from->nr_segs > 128 || from->nr_segs != runtime->channels ||
 	    !frame_aligned(runtime, iov->iov_len))


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 04/12] infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC
  2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
                   ` (2 preceding siblings ...)
  2023-09-25 12:03 ` [PATCH v7 03/12] sound: Fix snd_pcm_readv()/writev() to use iov access functions David Howells
@ 2023-09-25 12:03 ` David Howells
  2023-09-25 12:03 ` [PATCH v7 05/12] iov_iter: Renumber ITER_* constants David Howells
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: David Howells @ 2023-09-25 12:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Howells, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	Dennis Dalessandro, Jason Gunthorpe, Leon Romanovsky, linux-rdma

Use user_backed_iter() to see if iterator is UBUF/IOVEC rather than poking
inside the iterator.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
cc: Jason Gunthorpe <jgg@ziepe.ca>
cc: Leon Romanovsky <leon@kernel.org>
cc: linux-rdma@vger.kernel.org
---
 drivers/infiniband/hw/hfi1/file_ops.c    | 2 +-
 drivers/infiniband/hw/qib/qib_file_ops.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index a5ab22cedd41..788fc249234f 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -267,7 +267,7 @@ static ssize_t hfi1_write_iter(struct kiocb *kiocb, struct iov_iter *from)
 
 	if (!HFI1_CAP_IS_KSET(SDMA))
 		return -EINVAL;
-	if (!from->user_backed)
+	if (!user_backed_iter(from))
 		return -EINVAL;
 	idx = srcu_read_lock(&fd->pq_srcu);
 	pq = srcu_dereference(fd->pq, &fd->pq_srcu);
diff --git a/drivers/infiniband/hw/qib/qib_file_ops.c b/drivers/infiniband/hw/qib/qib_file_ops.c
index 152952127f13..29e4c59aa23b 100644
--- a/drivers/infiniband/hw/qib/qib_file_ops.c
+++ b/drivers/infiniband/hw/qib/qib_file_ops.c
@@ -2244,7 +2244,7 @@ static ssize_t qib_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	struct qib_ctxtdata *rcd = ctxt_fp(iocb->ki_filp);
 	struct qib_user_sdma_queue *pq = fp->pq;
 
-	if (!from->user_backed || !from->nr_segs || !pq)
+	if (!user_backed_iter(from) || !from->nr_segs || !pq)
 		return -EINVAL;
 
 	return qib_user_sdma_writev(rcd, pq, iter_iov(from), from->nr_segs);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 05/12] iov_iter: Renumber ITER_* constants
  2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
                   ` (3 preceding siblings ...)
  2023-09-25 12:03 ` [PATCH v7 04/12] infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC David Howells
@ 2023-09-25 12:03 ` David Howells
  2023-09-25 12:03 ` [PATCH v7 06/12] iov_iter: Derive user-backedness from the iterator type David Howells
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: David Howells @ 2023-09-25 12:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Howells, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel

Renumber the ITER_* iterator-type constants to put things in the same order
as in the iteration functions and to group user-backed iterators at the
bottom.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
 include/linux/uio.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 2000e42a6586..bef8e56aa45c 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -21,12 +21,12 @@ struct kvec {
 
 enum iter_type {
 	/* iter types */
+	ITER_UBUF,
 	ITER_IOVEC,
-	ITER_KVEC,
 	ITER_BVEC,
+	ITER_KVEC,
 	ITER_XARRAY,
 	ITER_DISCARD,
-	ITER_UBUF,
 };
 
 #define ITER_SOURCE	1	// == WRITE


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 06/12] iov_iter: Derive user-backedness from the iterator type
  2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
                   ` (4 preceding siblings ...)
  2023-09-25 12:03 ` [PATCH v7 05/12] iov_iter: Renumber ITER_* constants David Howells
@ 2023-09-25 12:03 ` David Howells
  2023-09-25 12:03 ` [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs David Howells
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: David Howells @ 2023-09-25 12:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Howells, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel

Use the iterator type to determine whether an iterator is user-backed or
not rather than using a special flag for it.  Now that ITER_UBUF and
ITER_IOVEC are 0 and 1, they can be checked with a single comparison.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
 include/linux/uio.h | 4 +---
 lib/iov_iter.c      | 1 -
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index bef8e56aa45c..65d9143f83c8 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -43,7 +43,6 @@ struct iov_iter {
 	bool copy_mc;
 	bool nofault;
 	bool data_source;
-	bool user_backed;
 	size_t iov_offset;
 	/*
 	 * Hack alert: overlay ubuf_iovec with iovec + count, so
@@ -140,7 +139,7 @@ static inline unsigned char iov_iter_rw(const struct iov_iter *i)
 
 static inline bool user_backed_iter(const struct iov_iter *i)
 {
-	return i->user_backed;
+	return iter_is_ubuf(i) || iter_is_iovec(i);
 }
 
 /*
@@ -380,7 +379,6 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 	*i = (struct iov_iter) {
 		.iter_type = ITER_UBUF,
 		.copy_mc = false,
-		.user_backed = true,
 		.data_source = direction,
 		.ubuf = buf,
 		.count = count,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 27234a820eeb..227c9f536b94 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -290,7 +290,6 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 		.iter_type = ITER_IOVEC,
 		.copy_mc = false,
 		.nofault = false,
-		.user_backed = true,
 		.data_source = direction,
 		.__iov = iov,
 		.nr_segs = nr_segs,


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
                   ` (5 preceding siblings ...)
  2023-09-25 12:03 ` [PATCH v7 06/12] iov_iter: Derive user-backedness from the iterator type David Howells
@ 2023-09-25 12:03 ` David Howells
  2024-02-18  3:13   ` [bug report] dead loop in generic_perform_write() //Re: " Tong Tiangen
  2023-09-25 12:03 ` [PATCH v7 08/12] iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc() David Howells
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 36+ messages in thread
From: David Howells @ 2023-09-25 12:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Howells, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel

Convert the iov_iter iteration macros to inline functions to make the code
easier to follow.

The functions are marked __always_inline as we don't want to end up with
indirect calls in the code.  This, however, leaves dealing with ->copy_mc
in an awkard situation since the step function (memcpy_from_iter_mc())
needs to test the flag in the iterator, but isn't passed the iterator.
This will be dealt with in a follow-up patch.

The variable names in the per-type iterator functions have been harmonised
as much as possible and made clearer as to the variable purpose.

The iterator functions are also moved to a header file so that other
operations that need to scan over an iterator can be added.  For instance,
the rbd driver could use this to scan a buffer to see if it is all zeros
and libceph could use this to generate a crc.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/3710261.1691764329@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/855.1692047347@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/20230816120741.534415-1-dhowells@redhat.com/ # v3
---

Notes:
    Changes
    =======
    ver #5)
     - Merge in patch to move iteration framework to a header file.
     - Move "iter->count - progress" into individual iteration subfunctions.

 include/linux/iov_iter.h | 274 ++++++++++++++++++++++++++
 lib/iov_iter.c           | 416 ++++++++++++++++-----------------------
 2 files changed, 449 insertions(+), 241 deletions(-)
 create mode 100644 include/linux/iov_iter.h

diff --git a/include/linux/iov_iter.h b/include/linux/iov_iter.h
new file mode 100644
index 000000000000..270454a6703d
--- /dev/null
+++ b/include/linux/iov_iter.h
@@ -0,0 +1,274 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* I/O iterator iteration building functions.
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#ifndef _LINUX_IOV_ITER_H
+#define _LINUX_IOV_ITER_H
+
+#include <linux/uio.h>
+#include <linux/bvec.h>
+
+typedef size_t (*iov_step_f)(void *iter_base, size_t progress, size_t len,
+			     void *priv, void *priv2);
+typedef size_t (*iov_ustep_f)(void __user *iter_base, size_t progress, size_t len,
+			      void *priv, void *priv2);
+
+/*
+ * Handle ITER_UBUF.
+ */
+static __always_inline
+size_t iterate_ubuf(struct iov_iter *iter, size_t len, void *priv, void *priv2,
+		    iov_ustep_f step)
+{
+	void __user *base = iter->ubuf;
+	size_t progress = 0, remain;
+
+	remain = step(base + iter->iov_offset, 0, len, priv, priv2);
+	progress = len - remain;
+	iter->iov_offset += progress;
+	iter->count -= progress;
+	return progress;
+}
+
+/*
+ * Handle ITER_IOVEC.
+ */
+static __always_inline
+size_t iterate_iovec(struct iov_iter *iter, size_t len, void *priv, void *priv2,
+		     iov_ustep_f step)
+{
+	const struct iovec *p = iter->__iov;
+	size_t progress = 0, skip = iter->iov_offset;
+
+	do {
+		size_t remain, consumed;
+		size_t part = min(len, p->iov_len - skip);
+
+		if (likely(part)) {
+			remain = step(p->iov_base + skip, progress, part, priv, priv2);
+			consumed = part - remain;
+			progress += consumed;
+			skip += consumed;
+			len -= consumed;
+			if (skip < p->iov_len)
+				break;
+		}
+		p++;
+		skip = 0;
+	} while (len);
+
+	iter->nr_segs -= p - iter->__iov;
+	iter->__iov = p;
+	iter->iov_offset = skip;
+	iter->count -= progress;
+	return progress;
+}
+
+/*
+ * Handle ITER_KVEC.
+ */
+static __always_inline
+size_t iterate_kvec(struct iov_iter *iter, size_t len, void *priv, void *priv2,
+		    iov_step_f step)
+{
+	const struct kvec *p = iter->kvec;
+	size_t progress = 0, skip = iter->iov_offset;
+
+	do {
+		size_t remain, consumed;
+		size_t part = min(len, p->iov_len - skip);
+
+		if (likely(part)) {
+			remain = step(p->iov_base + skip, progress, part, priv, priv2);
+			consumed = part - remain;
+			progress += consumed;
+			skip += consumed;
+			len -= consumed;
+			if (skip < p->iov_len)
+				break;
+		}
+		p++;
+		skip = 0;
+	} while (len);
+
+	iter->nr_segs -= p - iter->kvec;
+	iter->kvec = p;
+	iter->iov_offset = skip;
+	iter->count -= progress;
+	return progress;
+}
+
+/*
+ * Handle ITER_BVEC.
+ */
+static __always_inline
+size_t iterate_bvec(struct iov_iter *iter, size_t len, void *priv, void *priv2,
+		    iov_step_f step)
+{
+	const struct bio_vec *p = iter->bvec;
+	size_t progress = 0, skip = iter->iov_offset;
+
+	do {
+		size_t remain, consumed;
+		size_t offset = p->bv_offset + skip, part;
+		void *kaddr = kmap_local_page(p->bv_page + offset / PAGE_SIZE);
+
+		part = min3(len,
+			   (size_t)(p->bv_len - skip),
+			   (size_t)(PAGE_SIZE - offset % PAGE_SIZE));
+		remain = step(kaddr + offset % PAGE_SIZE, progress, part, priv, priv2);
+		kunmap_local(kaddr);
+		consumed = part - remain;
+		len -= consumed;
+		progress += consumed;
+		skip += consumed;
+		if (skip >= p->bv_len) {
+			skip = 0;
+			p++;
+		}
+		if (remain)
+			break;
+	} while (len);
+
+	iter->nr_segs -= p - iter->bvec;
+	iter->bvec = p;
+	iter->iov_offset = skip;
+	iter->count -= progress;
+	return progress;
+}
+
+/*
+ * Handle ITER_XARRAY.
+ */
+static __always_inline
+size_t iterate_xarray(struct iov_iter *iter, size_t len, void *priv, void *priv2,
+		      iov_step_f step)
+{
+	struct folio *folio;
+	size_t progress = 0;
+	loff_t start = iter->xarray_start + iter->iov_offset;
+	pgoff_t index = start / PAGE_SIZE;
+	XA_STATE(xas, iter->xarray, index);
+
+	rcu_read_lock();
+	xas_for_each(&xas, folio, ULONG_MAX) {
+		size_t remain, consumed, offset, part, flen;
+
+		if (xas_retry(&xas, folio))
+			continue;
+		if (WARN_ON(xa_is_value(folio)))
+			break;
+		if (WARN_ON(folio_test_hugetlb(folio)))
+			break;
+
+		offset = offset_in_folio(folio, start + progress);
+		flen = min(folio_size(folio) - offset, len);
+
+		while (flen) {
+			void *base = kmap_local_folio(folio, offset);
+
+			part = min_t(size_t, flen,
+				     PAGE_SIZE - offset_in_page(offset));
+			remain = step(base, progress, part, priv, priv2);
+			kunmap_local(base);
+
+			consumed = part - remain;
+			progress += consumed;
+			len -= consumed;
+
+			if (remain || len == 0)
+				goto out;
+			flen -= consumed;
+			offset += consumed;
+		}
+	}
+
+out:
+	rcu_read_unlock();
+	iter->iov_offset += progress;
+	iter->count -= progress;
+	return progress;
+}
+
+/*
+ * Handle ITER_DISCARD.
+ */
+static __always_inline
+size_t iterate_discard(struct iov_iter *iter, size_t len, void *priv, void *priv2,
+		      iov_step_f step)
+{
+	size_t progress = len;
+
+	iter->count -= progress;
+	return progress;
+}
+
+/**
+ * iterate_and_advance2 - Iterate over an iterator
+ * @iter: The iterator to iterate over.
+ * @len: The amount to iterate over.
+ * @priv: Data for the step functions.
+ * @priv2: More data for the step functions.
+ * @ustep: Function for UBUF/IOVEC iterators; given __user addresses.
+ * @step: Function for other iterators; given kernel addresses.
+ *
+ * Iterate over the next part of an iterator, up to the specified length.  The
+ * buffer is presented in segments, which for kernel iteration are broken up by
+ * physical pages and mapped, with the mapped address being presented.
+ *
+ * Two step functions, @step and @ustep, must be provided, one for handling
+ * mapped kernel addresses and the other is given user addresses which have the
+ * potential to fault since no pinning is performed.
+ *
+ * The step functions are passed the address and length of the segment, @priv,
+ * @priv2 and the amount of data so far iterated over (which can, for example,
+ * be added to @priv to point to the right part of a second buffer).  The step
+ * functions should return the amount of the segment they didn't process (ie. 0
+ * indicates complete processsing).
+ *
+ * This function returns the amount of data processed (ie. 0 means nothing was
+ * processed and the value of @len means processes to completion).
+ */
+static __always_inline
+size_t iterate_and_advance2(struct iov_iter *iter, size_t len, void *priv,
+			    void *priv2, iov_ustep_f ustep, iov_step_f step)
+{
+	if (unlikely(iter->count < len))
+		len = iter->count;
+	if (unlikely(!len))
+		return 0;
+
+	if (likely(iter_is_ubuf(iter)))
+		return iterate_ubuf(iter, len, priv, priv2, ustep);
+	if (likely(iter_is_iovec(iter)))
+		return iterate_iovec(iter, len, priv, priv2, ustep);
+	if (iov_iter_is_bvec(iter))
+		return iterate_bvec(iter, len, priv, priv2, step);
+	if (iov_iter_is_kvec(iter))
+		return iterate_kvec(iter, len, priv, priv2, step);
+	if (iov_iter_is_xarray(iter))
+		return iterate_xarray(iter, len, priv, priv2, step);
+	return iterate_discard(iter, len, priv, priv2, step);
+}
+
+/**
+ * iterate_and_advance - Iterate over an iterator
+ * @iter: The iterator to iterate over.
+ * @len: The amount to iterate over.
+ * @priv: Data for the step functions.
+ * @ustep: Function for UBUF/IOVEC iterators; given __user addresses.
+ * @step: Function for other iterators; given kernel addresses.
+ *
+ * As iterate_and_advance2(), but priv2 is always NULL.
+ */
+static __always_inline
+size_t iterate_and_advance(struct iov_iter *iter, size_t len, void *priv,
+			   iov_ustep_f ustep, iov_step_f step)
+{
+	return iterate_and_advance2(iter, len, priv, NULL, ustep, step);
+}
+
+#endif /* _LINUX_IOV_ITER_H */
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 227c9f536b94..65374ee91ecd 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -13,189 +13,69 @@
 #include <net/checksum.h>
 #include <linux/scatterlist.h>
 #include <linux/instrumented.h>
+#include <linux/iov_iter.h>
 
-/* covers ubuf and kbuf alike */
-#define iterate_buf(i, n, base, len, off, __p, STEP) {		\
-	size_t __maybe_unused off = 0;				\
-	len = n;						\
-	base = __p + i->iov_offset;				\
-	len -= (STEP);						\
-	i->iov_offset += len;					\
-	n = len;						\
-}
-
-/* covers iovec and kvec alike */
-#define iterate_iovec(i, n, base, len, off, __p, STEP) {	\
-	size_t off = 0;						\
-	size_t skip = i->iov_offset;				\
-	do {							\
-		len = min(n, __p->iov_len - skip);		\
-		if (likely(len)) {				\
-			base = __p->iov_base + skip;		\
-			len -= (STEP);				\
-			off += len;				\
-			skip += len;				\
-			n -= len;				\
-			if (skip < __p->iov_len)		\
-				break;				\
-		}						\
-		__p++;						\
-		skip = 0;					\
-	} while (n);						\
-	i->iov_offset = skip;					\
-	n = off;						\
-}
-
-#define iterate_bvec(i, n, base, len, off, p, STEP) {		\
-	size_t off = 0;						\
-	unsigned skip = i->iov_offset;				\
-	while (n) {						\
-		unsigned offset = p->bv_offset + skip;		\
-		unsigned left;					\
-		void *kaddr = kmap_local_page(p->bv_page +	\
-					offset / PAGE_SIZE);	\
-		base = kaddr + offset % PAGE_SIZE;		\
-		len = min(min(n, (size_t)(p->bv_len - skip)),	\
-		     (size_t)(PAGE_SIZE - offset % PAGE_SIZE));	\
-		left = (STEP);					\
-		kunmap_local(kaddr);				\
-		len -= left;					\
-		off += len;					\
-		skip += len;					\
-		if (skip == p->bv_len) {			\
-			skip = 0;				\
-			p++;					\
-		}						\
-		n -= len;					\
-		if (left)					\
-			break;					\
-	}							\
-	i->iov_offset = skip;					\
-	n = off;						\
-}
-
-#define iterate_xarray(i, n, base, len, __off, STEP) {		\
-	__label__ __out;					\
-	size_t __off = 0;					\
-	struct folio *folio;					\
-	loff_t start = i->xarray_start + i->iov_offset;		\
-	pgoff_t index = start / PAGE_SIZE;			\
-	XA_STATE(xas, i->xarray, index);			\
-								\
-	len = PAGE_SIZE - offset_in_page(start);		\
-	rcu_read_lock();					\
-	xas_for_each(&xas, folio, ULONG_MAX) {			\
-		unsigned left;					\
-		size_t offset;					\
-		if (xas_retry(&xas, folio))			\
-			continue;				\
-		if (WARN_ON(xa_is_value(folio)))		\
-			break;					\
-		if (WARN_ON(folio_test_hugetlb(folio)))		\
-			break;					\
-		offset = offset_in_folio(folio, start + __off);	\
-		while (offset < folio_size(folio)) {		\
-			base = kmap_local_folio(folio, offset);	\
-			len = min(n, len);			\
-			left = (STEP);				\
-			kunmap_local(base);			\
-			len -= left;				\
-			__off += len;				\
-			n -= len;				\
-			if (left || n == 0)			\
-				goto __out;			\
-			offset += len;				\
-			len = PAGE_SIZE;			\
-		}						\
-	}							\
-__out:								\
-	rcu_read_unlock();					\
-	i->iov_offset += __off;					\
-	n = __off;						\
-}
-
-#define __iterate_and_advance(i, n, base, len, off, I, K) {	\
-	if (unlikely(i->count < n))				\
-		n = i->count;					\
-	if (likely(n)) {					\
-		if (likely(iter_is_ubuf(i))) {			\
-			void __user *base;			\
-			size_t len;				\
-			iterate_buf(i, n, base, len, off,	\
-						i->ubuf, (I)) 	\
-		} else if (likely(iter_is_iovec(i))) {		\
-			const struct iovec *iov = iter_iov(i);	\
-			void __user *base;			\
-			size_t len;				\
-			iterate_iovec(i, n, base, len, off,	\
-						iov, (I))	\
-			i->nr_segs -= iov - iter_iov(i);	\
-			i->__iov = iov;				\
-		} else if (iov_iter_is_bvec(i)) {		\
-			const struct bio_vec *bvec = i->bvec;	\
-			void *base;				\
-			size_t len;				\
-			iterate_bvec(i, n, base, len, off,	\
-						bvec, (K))	\
-			i->nr_segs -= bvec - i->bvec;		\
-			i->bvec = bvec;				\
-		} else if (iov_iter_is_kvec(i)) {		\
-			const struct kvec *kvec = i->kvec;	\
-			void *base;				\
-			size_t len;				\
-			iterate_iovec(i, n, base, len, off,	\
-						kvec, (K))	\
-			i->nr_segs -= kvec - i->kvec;		\
-			i->kvec = kvec;				\
-		} else if (iov_iter_is_xarray(i)) {		\
-			void *base;				\
-			size_t len;				\
-			iterate_xarray(i, n, base, len, off,	\
-							(K))	\
-		}						\
-		i->count -= n;					\
-	}							\
-}
-#define iterate_and_advance(i, n, base, len, off, I, K) \
-	__iterate_and_advance(i, n, base, len, off, I, ((void)(K),0))
-
-static int copyout(void __user *to, const void *from, size_t n)
+static __always_inline
+size_t copy_to_user_iter(void __user *iter_to, size_t progress,
+			 size_t len, void *from, void *priv2)
 {
 	if (should_fail_usercopy())
-		return n;
-	if (access_ok(to, n)) {
-		instrument_copy_to_user(to, from, n);
-		n = raw_copy_to_user(to, from, n);
+		return len;
+	if (access_ok(iter_to, len)) {
+		from += progress;
+		instrument_copy_to_user(iter_to, from, len);
+		len = raw_copy_to_user(iter_to, from, len);
 	}
-	return n;
+	return len;
 }
 
-static int copyout_nofault(void __user *to, const void *from, size_t n)
+static __always_inline
+size_t copy_to_user_iter_nofault(void __user *iter_to, size_t progress,
+				 size_t len, void *from, void *priv2)
 {
-	long res;
+	ssize_t res;
 
 	if (should_fail_usercopy())
-		return n;
-
-	res = copy_to_user_nofault(to, from, n);
+		return len;
 
-	return res < 0 ? n : res;
+	from += progress;
+	res = copy_to_user_nofault(iter_to, from, len);
+	return res < 0 ? len : res;
 }
 
-static int copyin(void *to, const void __user *from, size_t n)
+static __always_inline
+size_t copy_from_user_iter(void __user *iter_from, size_t progress,
+			   size_t len, void *to, void *priv2)
 {
-	size_t res = n;
+	size_t res = len;
 
 	if (should_fail_usercopy())
-		return n;
-	if (access_ok(from, n)) {
-		instrument_copy_from_user_before(to, from, n);
-		res = raw_copy_from_user(to, from, n);
-		instrument_copy_from_user_after(to, from, n, res);
+		return len;
+	if (access_ok(iter_from, len)) {
+		to += progress;
+		instrument_copy_from_user_before(to, iter_from, len);
+		res = raw_copy_from_user(to, iter_from, len);
+		instrument_copy_from_user_after(to, iter_from, len, res);
 	}
 	return res;
 }
 
+static __always_inline
+size_t memcpy_to_iter(void *iter_to, size_t progress,
+		      size_t len, void *from, void *priv2)
+{
+	memcpy(iter_to, from + progress, len);
+	return 0;
+}
+
+static __always_inline
+size_t memcpy_from_iter(void *iter_from, size_t progress,
+			size_t len, void *to, void *priv2)
+{
+	memcpy(to + progress, iter_from, len);
+	return 0;
+}
+
 /*
  * fault_in_iov_iter_readable - fault in iov iterator for reading
  * @i: iterator
@@ -312,23 +192,29 @@ size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 		return 0;
 	if (user_backed_iter(i))
 		might_fault();
-	iterate_and_advance(i, bytes, base, len, off,
-		copyout(base, addr + off, len),
-		memcpy(base, addr + off, len)
-	)
-
-	return bytes;
+	return iterate_and_advance(i, bytes, (void *)addr,
+				   copy_to_user_iter, memcpy_to_iter);
 }
 EXPORT_SYMBOL(_copy_to_iter);
 
 #ifdef CONFIG_ARCH_HAS_COPY_MC
-static int copyout_mc(void __user *to, const void *from, size_t n)
-{
-	if (access_ok(to, n)) {
-		instrument_copy_to_user(to, from, n);
-		n = copy_mc_to_user((__force void *) to, from, n);
+static __always_inline
+size_t copy_to_user_iter_mc(void __user *iter_to, size_t progress,
+			    size_t len, void *from, void *priv2)
+{
+	if (access_ok(iter_to, len)) {
+		from += progress;
+		instrument_copy_to_user(iter_to, from, len);
+		len = copy_mc_to_user(iter_to, from, len);
 	}
-	return n;
+	return len;
+}
+
+static __always_inline
+size_t memcpy_to_iter_mc(void *iter_to, size_t progress,
+			 size_t len, void *from, void *priv2)
+{
+	return copy_mc_to_kernel(iter_to, from + progress, len);
 }
 
 /**
@@ -361,22 +247,20 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 		return 0;
 	if (user_backed_iter(i))
 		might_fault();
-	__iterate_and_advance(i, bytes, base, len, off,
-		copyout_mc(base, addr + off, len),
-		copy_mc_to_kernel(base, addr + off, len)
-	)
-
-	return bytes;
+	return iterate_and_advance(i, bytes, (void *)addr,
+				   copy_to_user_iter_mc, memcpy_to_iter_mc);
 }
 EXPORT_SYMBOL_GPL(_copy_mc_to_iter);
 #endif /* CONFIG_ARCH_HAS_COPY_MC */
 
-static void *memcpy_from_iter(struct iov_iter *i, void *to, const void *from,
-				 size_t size)
+static size_t memcpy_from_iter_mc(void *iter_from, size_t progress,
+				  size_t len, void *to, void *priv2)
 {
-	if (iov_iter_is_copy_mc(i))
-		return (void *)copy_mc_to_kernel(to, from, size);
-	return memcpy(to, from, size);
+	struct iov_iter *iter = priv2;
+
+	if (iov_iter_is_copy_mc(iter))
+		return copy_mc_to_kernel(to + progress, iter_from, len);
+	return memcpy_from_iter(iter_from, progress, len, to, priv2);
 }
 
 size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
@@ -386,30 +270,46 @@ size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 
 	if (user_backed_iter(i))
 		might_fault();
-	iterate_and_advance(i, bytes, base, len, off,
-		copyin(addr + off, base, len),
-		memcpy_from_iter(i, addr + off, base, len)
-	)
-
-	return bytes;
+	return iterate_and_advance2(i, bytes, addr, i,
+				    copy_from_user_iter,
+				    memcpy_from_iter_mc);
 }
 EXPORT_SYMBOL(_copy_from_iter);
 
+static __always_inline
+size_t copy_from_user_iter_nocache(void __user *iter_from, size_t progress,
+				   size_t len, void *to, void *priv2)
+{
+	return __copy_from_user_inatomic_nocache(to + progress, iter_from, len);
+}
+
 size_t _copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 {
 	if (WARN_ON_ONCE(!i->data_source))
 		return 0;
 
-	iterate_and_advance(i, bytes, base, len, off,
-		__copy_from_user_inatomic_nocache(addr + off, base, len),
-		memcpy(addr + off, base, len)
-	)
-
-	return bytes;
+	return iterate_and_advance(i, bytes, addr,
+				   copy_from_user_iter_nocache,
+				   memcpy_from_iter);
 }
 EXPORT_SYMBOL(_copy_from_iter_nocache);
 
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
+static __always_inline
+size_t copy_from_user_iter_flushcache(void __user *iter_from, size_t progress,
+				      size_t len, void *to, void *priv2)
+{
+	return __copy_from_user_flushcache(to + progress, iter_from, len);
+}
+
+static __always_inline
+size_t memcpy_from_iter_flushcache(void *iter_from, size_t progress,
+				   size_t len, void *to, void *priv2)
+{
+	memcpy_flushcache(to + progress, iter_from, len);
+	return 0;
+}
+
 /**
  * _copy_from_iter_flushcache - write destination through cpu cache
  * @addr: destination kernel address
@@ -431,12 +331,9 @@ size_t _copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i)
 	if (WARN_ON_ONCE(!i->data_source))
 		return 0;
 
-	iterate_and_advance(i, bytes, base, len, off,
-		__copy_from_user_flushcache(addr + off, base, len),
-		memcpy_flushcache(addr + off, base, len)
-	)
-
-	return bytes;
+	return iterate_and_advance(i, bytes, addr,
+				   copy_from_user_iter_flushcache,
+				   memcpy_from_iter_flushcache);
 }
 EXPORT_SYMBOL_GPL(_copy_from_iter_flushcache);
 #endif
@@ -508,10 +405,9 @@ size_t copy_page_to_iter_nofault(struct page *page, unsigned offset, size_t byte
 		void *kaddr = kmap_local_page(page);
 		size_t n = min(bytes, (size_t)PAGE_SIZE - offset);
 
-		iterate_and_advance(i, n, base, len, off,
-			copyout_nofault(base, kaddr + offset + off, len),
-			memcpy(base, kaddr + offset + off, len)
-		)
+		n = iterate_and_advance(i, bytes, kaddr,
+					copy_to_user_iter_nofault,
+					memcpy_to_iter);
 		kunmap_local(kaddr);
 		res += n;
 		bytes -= n;
@@ -554,14 +450,25 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_from_iter);
 
-size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
+static __always_inline
+size_t zero_to_user_iter(void __user *iter_to, size_t progress,
+			 size_t len, void *priv, void *priv2)
 {
-	iterate_and_advance(i, bytes, base, len, count,
-		clear_user(base, len),
-		memset(base, 0, len)
-	)
+	return clear_user(iter_to, len);
+}
 
-	return bytes;
+static __always_inline
+size_t zero_to_iter(void *iter_to, size_t progress,
+		    size_t len, void *priv, void *priv2)
+{
+	memset(iter_to, 0, len);
+	return 0;
+}
+
+size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
+{
+	return iterate_and_advance(i, bytes, NULL,
+				   zero_to_user_iter, zero_to_iter);
 }
 EXPORT_SYMBOL(iov_iter_zero);
 
@@ -586,10 +493,9 @@ size_t copy_page_from_iter_atomic(struct page *page, size_t offset,
 		}
 
 		p = kmap_atomic(page) + offset;
-		iterate_and_advance(i, n, base, len, off,
-			copyin(p + off, base, len),
-			memcpy_from_iter(i, p + off, base, len)
-		)
+		n = iterate_and_advance2(i, n, p, i,
+					 copy_from_user_iter,
+					 memcpy_from_iter_mc);
 		kunmap_atomic(p);
 		copied += n;
 		offset += n;
@@ -1180,32 +1086,64 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
 }
 EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
 
+static __always_inline
+size_t copy_from_user_iter_csum(void __user *iter_from, size_t progress,
+				size_t len, void *to, void *priv2)
+{
+	__wsum next, *csum = priv2;
+
+	next = csum_and_copy_from_user(iter_from, to + progress, len);
+	*csum = csum_block_add(*csum, next, progress);
+	return next ? 0 : len;
+}
+
+static __always_inline
+size_t memcpy_from_iter_csum(void *iter_from, size_t progress,
+			     size_t len, void *to, void *priv2)
+{
+	__wsum *csum = priv2;
+
+	*csum = csum_and_memcpy(to + progress, iter_from, len, *csum, progress);
+	return 0;
+}
+
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 			       struct iov_iter *i)
 {
-	__wsum sum, next;
-	sum = *csum;
 	if (WARN_ON_ONCE(!i->data_source))
 		return 0;
-
-	iterate_and_advance(i, bytes, base, len, off, ({
-		next = csum_and_copy_from_user(base, addr + off, len);
-		sum = csum_block_add(sum, next, off);
-		next ? 0 : len;
-	}), ({
-		sum = csum_and_memcpy(addr + off, base, len, sum, off);
-	})
-	)
-	*csum = sum;
-	return bytes;
+	return iterate_and_advance2(i, bytes, addr, csum,
+				    copy_from_user_iter_csum,
+				    memcpy_from_iter_csum);
 }
 EXPORT_SYMBOL(csum_and_copy_from_iter);
 
+static __always_inline
+size_t copy_to_user_iter_csum(void __user *iter_to, size_t progress,
+			      size_t len, void *from, void *priv2)
+{
+	__wsum next, *csum = priv2;
+
+	next = csum_and_copy_to_user(from + progress, iter_to, len);
+	*csum = csum_block_add(*csum, next, progress);
+	return next ? 0 : len;
+}
+
+static __always_inline
+size_t memcpy_to_iter_csum(void *iter_to, size_t progress,
+			   size_t len, void *from, void *priv2)
+{
+	__wsum *csum = priv2;
+
+	*csum = csum_and_memcpy(iter_to, from + progress, len, *csum, progress);
+	return 0;
+}
+
 size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *_csstate,
 			     struct iov_iter *i)
 {
 	struct csum_state *csstate = _csstate;
-	__wsum sum, next;
+	__wsum sum;
 
 	if (WARN_ON_ONCE(i->data_source))
 		return 0;
@@ -1219,14 +1157,10 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *_csstate,
 	}
 
 	sum = csum_shift(csstate->csum, csstate->off);
-	iterate_and_advance(i, bytes, base, len, off, ({
-		next = csum_and_copy_to_user(addr + off, base, len);
-		sum = csum_block_add(sum, next, off);
-		next ? 0 : len;
-	}), ({
-		sum = csum_and_memcpy(base, addr + off, len, sum, off);
-	})
-	)
+	
+	bytes = iterate_and_advance2(i, bytes, (void *)addr, &sum,
+				     copy_to_user_iter_csum,
+				     memcpy_to_iter_csum);
 	csstate->csum = csum_shift(sum, csstate->off);
 	csstate->off += bytes;
 	return bytes;


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 08/12] iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()
  2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
                   ` (6 preceding siblings ...)
  2023-09-25 12:03 ` [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs David Howells
@ 2023-09-25 12:03 ` David Howells
  2023-09-25 12:03 ` [PATCH v7 09/12] iov_iter, net: Move csum_and_copy_to/from_iter() to net/ David Howells
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: David Howells @ 2023-09-25 12:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Howells, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel

iter->copy_mc is only used with a bvec iterator and only by
dump_emit_page() in fs/coredump.c so rather than handle this in
memcpy_from_iter_mc() where it is checked repeatedly by _copy_from_iter()
and copy_page_from_iter_atomic(),

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
 lib/iov_iter.c | 39 +++++++++++++++++++++++++++------------
 1 file changed, 27 insertions(+), 12 deletions(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 65374ee91ecd..943aa3cfd7b3 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -253,14 +253,33 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 EXPORT_SYMBOL_GPL(_copy_mc_to_iter);
 #endif /* CONFIG_ARCH_HAS_COPY_MC */
 
-static size_t memcpy_from_iter_mc(void *iter_from, size_t progress,
-				  size_t len, void *to, void *priv2)
+static __always_inline
+size_t memcpy_from_iter_mc(void *iter_from, size_t progress,
+			   size_t len, void *to, void *priv2)
+{
+	return copy_mc_to_kernel(to + progress, iter_from, len);
+}
+
+static size_t __copy_from_iter_mc(void *addr, size_t bytes, struct iov_iter *i)
 {
-	struct iov_iter *iter = priv2;
+	size_t progress;
 
-	if (iov_iter_is_copy_mc(iter))
-		return copy_mc_to_kernel(to + progress, iter_from, len);
-	return memcpy_from_iter(iter_from, progress, len, to, priv2);
+	if (unlikely(i->count < bytes))
+		bytes = i->count;
+	if (unlikely(!bytes))
+		return 0;
+	progress = iterate_bvec(i, bytes, addr, NULL, memcpy_from_iter_mc);
+	i->count -= progress;
+	return progress;
+}
+
+static __always_inline
+size_t __copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
+{
+	if (unlikely(iov_iter_is_copy_mc(i)))
+		return __copy_from_iter_mc(addr, bytes, i);
+	return iterate_and_advance(i, bytes, addr,
+				   copy_from_user_iter, memcpy_from_iter);
 }
 
 size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
@@ -270,9 +289,7 @@ size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 
 	if (user_backed_iter(i))
 		might_fault();
-	return iterate_and_advance2(i, bytes, addr, i,
-				    copy_from_user_iter,
-				    memcpy_from_iter_mc);
+	return __copy_from_iter(addr, bytes, i);
 }
 EXPORT_SYMBOL(_copy_from_iter);
 
@@ -493,9 +510,7 @@ size_t copy_page_from_iter_atomic(struct page *page, size_t offset,
 		}
 
 		p = kmap_atomic(page) + offset;
-		n = iterate_and_advance2(i, n, p, i,
-					 copy_from_user_iter,
-					 memcpy_from_iter_mc);
+		__copy_from_iter(p, n, i);
 		kunmap_atomic(p);
 		copied += n;
 		offset += n;


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 09/12] iov_iter, net: Move csum_and_copy_to/from_iter() to net/
  2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
                   ` (7 preceding siblings ...)
  2023-09-25 12:03 ` [PATCH v7 08/12] iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc() David Howells
@ 2023-09-25 12:03 ` David Howells
  2023-09-25 12:03 ` [PATCH v7 10/12] iov_iter, net: Fold in csum_and_memcpy() David Howells
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: David Howells @ 2023-09-25 12:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Howells, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni

Move csum_and_copy_to/from_iter() to net code now that the iteration
framework can be #included.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
cc: netdev@vger.kernel.org
---
 include/linux/skbuff.h | 25 ++++++++++++
 include/linux/uio.h    | 18 ---------
 lib/iov_iter.c         | 89 ------------------------------------------
 net/core/datagram.c    | 50 +++++++++++++++++++++++-
 net/core/skbuff.c      | 33 ++++++++++++++++
 5 files changed, 107 insertions(+), 108 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4174c4b82d13..d0656cc11c16 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3679,6 +3679,31 @@ static inline int __must_check skb_put_padto(struct sk_buff *skb, unsigned int l
 	return __skb_put_padto(skb, len, true);
 }
 
+static inline __wsum csum_and_memcpy(void *to, const void *from, size_t len,
+				     __wsum sum, size_t off)
+{
+	__wsum next = csum_partial_copy_nocheck(from, to, len);
+	return csum_block_add(sum, next, off);
+}
+
+struct csum_state {
+	__wsum csum;
+	size_t off;
+};
+
+size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
+
+static __always_inline __must_check
+bool csum_and_copy_from_iter_full(void *addr, size_t bytes,
+				  __wsum *csum, struct iov_iter *i)
+{
+	size_t copied = csum_and_copy_from_iter(addr, bytes, csum, i);
+	if (likely(copied == bytes))
+		return true;
+	iov_iter_revert(i, copied);
+	return false;
+}
+
 static inline int skb_add_data(struct sk_buff *skb,
 			       struct iov_iter *from, int copy)
 {
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 65d9143f83c8..0a5426c97e02 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -338,24 +338,6 @@ iov_iter_npages_cap(struct iov_iter *i, int maxpages, size_t max_bytes)
 	return npages;
 }
 
-struct csum_state {
-	__wsum csum;
-	size_t off;
-};
-
-size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *csstate, struct iov_iter *i);
-size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
-
-static __always_inline __must_check
-bool csum_and_copy_from_iter_full(void *addr, size_t bytes,
-				  __wsum *csum, struct iov_iter *i)
-{
-	size_t copied = csum_and_copy_from_iter(addr, bytes, csum, i);
-	if (likely(copied == bytes))
-		return true;
-	iov_iter_revert(i, copied);
-	return false;
-}
 size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp,
 		struct iov_iter *i);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 943aa3cfd7b3..fef934a8745d 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -10,7 +10,6 @@
 #include <linux/vmalloc.h>
 #include <linux/splice.h>
 #include <linux/compat.h>
-#include <net/checksum.h>
 #include <linux/scatterlist.h>
 #include <linux/instrumented.h>
 #include <linux/iov_iter.h>
@@ -179,13 +178,6 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_init);
 
-static __wsum csum_and_memcpy(void *to, const void *from, size_t len,
-			      __wsum sum, size_t off)
-{
-	__wsum next = csum_partial_copy_nocheck(from, to, len);
-	return csum_block_add(sum, next, off);
-}
-
 size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
 	if (WARN_ON_ONCE(i->data_source))
@@ -1101,87 +1093,6 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
 }
 EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
 
-static __always_inline
-size_t copy_from_user_iter_csum(void __user *iter_from, size_t progress,
-				size_t len, void *to, void *priv2)
-{
-	__wsum next, *csum = priv2;
-
-	next = csum_and_copy_from_user(iter_from, to + progress, len);
-	*csum = csum_block_add(*csum, next, progress);
-	return next ? 0 : len;
-}
-
-static __always_inline
-size_t memcpy_from_iter_csum(void *iter_from, size_t progress,
-			     size_t len, void *to, void *priv2)
-{
-	__wsum *csum = priv2;
-
-	*csum = csum_and_memcpy(to + progress, iter_from, len, *csum, progress);
-	return 0;
-}
-
-size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
-			       struct iov_iter *i)
-{
-	if (WARN_ON_ONCE(!i->data_source))
-		return 0;
-	return iterate_and_advance2(i, bytes, addr, csum,
-				    copy_from_user_iter_csum,
-				    memcpy_from_iter_csum);
-}
-EXPORT_SYMBOL(csum_and_copy_from_iter);
-
-static __always_inline
-size_t copy_to_user_iter_csum(void __user *iter_to, size_t progress,
-			      size_t len, void *from, void *priv2)
-{
-	__wsum next, *csum = priv2;
-
-	next = csum_and_copy_to_user(from + progress, iter_to, len);
-	*csum = csum_block_add(*csum, next, progress);
-	return next ? 0 : len;
-}
-
-static __always_inline
-size_t memcpy_to_iter_csum(void *iter_to, size_t progress,
-			   size_t len, void *from, void *priv2)
-{
-	__wsum *csum = priv2;
-
-	*csum = csum_and_memcpy(iter_to, from + progress, len, *csum, progress);
-	return 0;
-}
-
-size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *_csstate,
-			     struct iov_iter *i)
-{
-	struct csum_state *csstate = _csstate;
-	__wsum sum;
-
-	if (WARN_ON_ONCE(i->data_source))
-		return 0;
-	if (unlikely(iov_iter_is_discard(i))) {
-		// can't use csum_memcpy() for that one - data is not copied
-		csstate->csum = csum_block_add(csstate->csum,
-					       csum_partial(addr, bytes, 0),
-					       csstate->off);
-		csstate->off += bytes;
-		return bytes;
-	}
-
-	sum = csum_shift(csstate->csum, csstate->off);
-	
-	bytes = iterate_and_advance2(i, bytes, (void *)addr, &sum,
-				     copy_to_user_iter_csum,
-				     memcpy_to_iter_csum);
-	csstate->csum = csum_shift(sum, csstate->off);
-	csstate->off += bytes;
-	return bytes;
-}
-EXPORT_SYMBOL(csum_and_copy_to_iter);
-
 size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp,
 		struct iov_iter *i)
 {
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 176eb5834746..37c89d0933b7 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -50,7 +50,7 @@
 #include <linux/spinlock.h>
 #include <linux/slab.h>
 #include <linux/pagemap.h>
-#include <linux/uio.h>
+#include <linux/iov_iter.h>
 #include <linux/indirect_call_wrapper.h>
 
 #include <net/protocol.h>
@@ -716,6 +716,54 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
 }
 EXPORT_SYMBOL(zerocopy_sg_from_iter);
 
+static __always_inline
+size_t copy_to_user_iter_csum(void __user *iter_to, size_t progress,
+			      size_t len, void *from, void *priv2)
+{
+	__wsum next, *csum = priv2;
+
+	next = csum_and_copy_to_user(from + progress, iter_to, len);
+	*csum = csum_block_add(*csum, next, progress);
+	return next ? 0 : len;
+}
+
+static __always_inline
+size_t memcpy_to_iter_csum(void *iter_to, size_t progress,
+			   size_t len, void *from, void *priv2)
+{
+	__wsum *csum = priv2;
+
+	*csum = csum_and_memcpy(iter_to, from + progress, len, *csum, progress);
+	return 0;
+}
+
+static size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *_csstate,
+				    struct iov_iter *i)
+{
+	struct csum_state *csstate = _csstate;
+	__wsum sum;
+
+	if (WARN_ON_ONCE(i->data_source))
+		return 0;
+	if (unlikely(iov_iter_is_discard(i))) {
+		// can't use csum_memcpy() for that one - data is not copied
+		csstate->csum = csum_block_add(csstate->csum,
+					       csum_partial(addr, bytes, 0),
+					       csstate->off);
+		csstate->off += bytes;
+		return bytes;
+	}
+
+	sum = csum_shift(csstate->csum, csstate->off);
+
+	bytes = iterate_and_advance2(i, bytes, (void *)addr, &sum,
+				     copy_to_user_iter_csum,
+				     memcpy_to_iter_csum);
+	csstate->csum = csum_shift(sum, csstate->off);
+	csstate->off += bytes;
+	return bytes;
+}
+
 /**
  *	skb_copy_and_csum_datagram - Copy datagram to an iovec iterator
  *          and update a checksum.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4eaf7ed0d1f4..5dbdfce2d05f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -62,6 +62,7 @@
 #include <linux/if_vlan.h>
 #include <linux/mpls.h>
 #include <linux/kcov.h>
+#include <linux/iov_iter.h>
 
 #include <net/protocol.h>
 #include <net/dst.h>
@@ -6931,3 +6932,35 @@ ssize_t skb_splice_from_iter(struct sk_buff *skb, struct iov_iter *iter,
 	return spliced ?: ret;
 }
 EXPORT_SYMBOL(skb_splice_from_iter);
+
+static __always_inline
+size_t memcpy_from_iter_csum(void *iter_from, size_t progress,
+			     size_t len, void *to, void *priv2)
+{
+	__wsum *csum = priv2;
+
+	*csum = csum_and_memcpy(to + progress, iter_from, len, *csum, progress);
+	return 0;
+}
+
+static __always_inline
+size_t copy_from_user_iter_csum(void __user *iter_from, size_t progress,
+				size_t len, void *to, void *priv2)
+{
+	__wsum next, *csum = priv2;
+
+	next = csum_and_copy_from_user(iter_from, to + progress, len);
+	*csum = csum_block_add(*csum, next, progress);
+	return next ? 0 : len;
+}
+
+size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
+			       struct iov_iter *i)
+{
+	if (WARN_ON_ONCE(!i->data_source))
+		return 0;
+	return iterate_and_advance2(i, bytes, addr, csum,
+				    copy_from_user_iter_csum,
+				    memcpy_from_iter_csum);
+}
+EXPORT_SYMBOL(csum_and_copy_from_iter);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 10/12] iov_iter, net: Fold in csum_and_memcpy()
  2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
                   ` (8 preceding siblings ...)
  2023-09-25 12:03 ` [PATCH v7 09/12] iov_iter, net: Move csum_and_copy_to/from_iter() to net/ David Howells
@ 2023-09-25 12:03 ` David Howells
  2023-09-25 12:03 ` [PATCH v7 11/12] iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together David Howells
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: David Howells @ 2023-09-25 12:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Howells, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni

Fold csum_and_memcpy() in to its callers.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
cc: netdev@vger.kernel.org
---
 include/linux/skbuff.h | 7 -------
 net/core/datagram.c    | 3 ++-
 net/core/skbuff.c      | 3 ++-
 3 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d0656cc11c16..c81ef5d76953 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3679,13 +3679,6 @@ static inline int __must_check skb_put_padto(struct sk_buff *skb, unsigned int l
 	return __skb_put_padto(skb, len, true);
 }
 
-static inline __wsum csum_and_memcpy(void *to, const void *from, size_t len,
-				     __wsum sum, size_t off)
-{
-	__wsum next = csum_partial_copy_nocheck(from, to, len);
-	return csum_block_add(sum, next, off);
-}
-
 struct csum_state {
 	__wsum csum;
 	size_t off;
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 37c89d0933b7..452620dd41e4 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -732,8 +732,9 @@ size_t memcpy_to_iter_csum(void *iter_to, size_t progress,
 			   size_t len, void *from, void *priv2)
 {
 	__wsum *csum = priv2;
+	__wsum next = csum_partial_copy_nocheck(from, iter_to, len);
 
-	*csum = csum_and_memcpy(iter_to, from + progress, len, *csum, progress);
+	*csum = csum_block_add(*csum, next, progress);
 	return 0;
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5dbdfce2d05f..3efed86321db 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -6938,8 +6938,9 @@ size_t memcpy_from_iter_csum(void *iter_from, size_t progress,
 			     size_t len, void *to, void *priv2)
 {
 	__wsum *csum = priv2;
+	__wsum next = csum_partial_copy_nocheck(iter_from, to + progress, len);
 
-	*csum = csum_and_memcpy(to + progress, iter_from, len, *csum, progress);
+	*csum = csum_block_add(*csum, next, progress);
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 11/12] iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together
  2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
                   ` (9 preceding siblings ...)
  2023-09-25 12:03 ` [PATCH v7 10/12] iov_iter, net: Fold in csum_and_memcpy() David Howells
@ 2023-09-25 12:03 ` David Howells
  2023-09-25 12:03 ` [PATCH v7 12/12] iov_iter, net: Move hash_and_copy_to_iter() to net/ David Howells
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: David Howells @ 2023-09-25 12:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Howells, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni

Move csum_and_copy_from_iter_full() out of line and then merge
csum_and_copy_from_iter() into its only caller.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
cc: netdev@vger.kernel.org
---
 include/linux/skbuff.h | 19 ++-----------------
 net/core/datagram.c    |  5 +++++
 net/core/skbuff.c      | 20 +++++++++++++-------
 3 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c81ef5d76953..be402f55f6d6 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3679,23 +3679,8 @@ static inline int __must_check skb_put_padto(struct sk_buff *skb, unsigned int l
 	return __skb_put_padto(skb, len, true);
 }
 
-struct csum_state {
-	__wsum csum;
-	size_t off;
-};
-
-size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i);
-
-static __always_inline __must_check
-bool csum_and_copy_from_iter_full(void *addr, size_t bytes,
-				  __wsum *csum, struct iov_iter *i)
-{
-	size_t copied = csum_and_copy_from_iter(addr, bytes, csum, i);
-	if (likely(copied == bytes))
-		return true;
-	iov_iter_revert(i, copied);
-	return false;
-}
+bool csum_and_copy_from_iter_full(void *addr, size_t bytes, __wsum *csum, struct iov_iter *i)
+	__must_check;
 
 static inline int skb_add_data(struct sk_buff *skb,
 			       struct iov_iter *from, int copy)
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 452620dd41e4..722311eeee18 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -738,6 +738,11 @@ size_t memcpy_to_iter_csum(void *iter_to, size_t progress,
 	return 0;
 }
 
+struct csum_state {
+	__wsum csum;
+	size_t off;
+};
+
 static size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *_csstate,
 				    struct iov_iter *i)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3efed86321db..2bfa6a7ba244 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -6955,13 +6955,19 @@ size_t copy_from_user_iter_csum(void __user *iter_from, size_t progress,
 	return next ? 0 : len;
 }
 
-size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
-			       struct iov_iter *i)
+bool csum_and_copy_from_iter_full(void *addr, size_t bytes,
+				  __wsum *csum, struct iov_iter *i)
 {
+	size_t copied;
+
 	if (WARN_ON_ONCE(!i->data_source))
-		return 0;
-	return iterate_and_advance2(i, bytes, addr, csum,
-				    copy_from_user_iter_csum,
-				    memcpy_from_iter_csum);
+		return false;
+	copied = iterate_and_advance2(i, bytes, addr, csum,
+				      copy_from_user_iter_csum,
+				      memcpy_from_iter_csum);
+	if (likely(copied == bytes))
+		return true;
+	iov_iter_revert(i, copied);
+	return false;
 }
-EXPORT_SYMBOL(csum_and_copy_from_iter);
+EXPORT_SYMBOL(csum_and_copy_from_iter_full);


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v7 12/12] iov_iter, net: Move hash_and_copy_to_iter() to net/
  2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
                   ` (10 preceding siblings ...)
  2023-09-25 12:03 ` [PATCH v7 11/12] iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together David Howells
@ 2023-09-25 12:03 ` David Howells
  2023-09-25 12:34 ` [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs Christian Brauner
  2023-10-02  9:25   ` David Howells
  13 siblings, 0 replies; 36+ messages in thread
From: David Howells @ 2023-09-25 12:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Howells, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni

Move hash_and_copy_to_iter() to be with its only caller in networking code.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Christian Brauner <christian@brauner.io>
cc: Matthew Wilcox <willy@infradead.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: David Laight <David.Laight@ACULAB.COM>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
cc: netdev@vger.kernel.org
---
 include/linux/uio.h |  3 ---
 lib/iov_iter.c      | 20 --------------------
 net/core/datagram.c | 19 +++++++++++++++++++
 3 files changed, 19 insertions(+), 23 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 0a5426c97e02..b6214cbf2a43 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -338,9 +338,6 @@ iov_iter_npages_cap(struct iov_iter *i, int maxpages, size_t max_bytes)
 	return npages;
 }
 
-size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp,
-		struct iov_iter *i);
-
 struct iovec *iovec_from_user(const struct iovec __user *uvector,
 		unsigned long nr_segs, unsigned long fast_segs,
 		struct iovec *fast_iov, bool compat);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index fef934a8745d..2547c96d56c7 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1,5 +1,4 @@
 // SPDX-License-Identifier: GPL-2.0-only
-#include <crypto/hash.h>
 #include <linux/export.h>
 #include <linux/bvec.h>
 #include <linux/fault-inject-usercopy.h>
@@ -1093,25 +1092,6 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
 }
 EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
 
-size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp,
-		struct iov_iter *i)
-{
-#ifdef CONFIG_CRYPTO_HASH
-	struct ahash_request *hash = hashp;
-	struct scatterlist sg;
-	size_t copied;
-
-	copied = copy_to_iter(addr, bytes, i);
-	sg_init_one(&sg, addr, copied);
-	ahash_request_set_crypt(hash, &sg, NULL, copied);
-	crypto_ahash_update(hash);
-	return copied;
-#else
-	return 0;
-#endif
-}
-EXPORT_SYMBOL(hash_and_copy_to_iter);
-
 static int iov_npages(const struct iov_iter *i, int maxpages)
 {
 	size_t skip = i->iov_offset, size = i->count;
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 722311eeee18..103d46fa0eeb 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -61,6 +61,7 @@
 #include <net/tcp_states.h>
 #include <trace/events/skb.h>
 #include <net/busy_poll.h>
+#include <crypto/hash.h>
 
 /*
  *	Is a socket 'connection oriented' ?
@@ -489,6 +490,24 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset,
 	return 0;
 }
 
+static size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp,
+				    struct iov_iter *i)
+{
+#ifdef CONFIG_CRYPTO_HASH
+	struct ahash_request *hash = hashp;
+	struct scatterlist sg;
+	size_t copied;
+
+	copied = copy_to_iter(addr, bytes, i);
+	sg_init_one(&sg, addr, copied);
+	ahash_request_set_crypt(hash, &sg, NULL, copied);
+	crypto_ahash_update(hash);
+	return copied;
+#else
+	return 0;
+#endif
+}
+
 /**
  *	skb_copy_and_hash_datagram_iter - Copy datagram to an iovec iterator
  *          and update a hash.


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs
  2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
                   ` (11 preceding siblings ...)
  2023-09-25 12:03 ` [PATCH v7 12/12] iov_iter, net: Move hash_and_copy_to_iter() to net/ David Howells
@ 2023-09-25 12:34 ` Christian Brauner
  2023-10-02  9:25   ` David Howells
  13 siblings, 0 replies; 36+ messages in thread
From: Christian Brauner @ 2023-09-25 12:34 UTC (permalink / raw)
  To: David Howells
  Cc: Christian Brauner, Al Viro, Linus Torvalds, Christoph Hellwig,
	David Laight, Matthew Wilcox, Jeff Layton, linux-fsdevel,
	linux-block, linux-mm, netdev, linux-kernel, Jens Axboe

On Mon, 25 Sep 2023 13:02:57 +0100, David Howells wrote:
> Could you take these patches into the block tree or the fs tree?  The
> patches convert the iov_iter iteration macros to be inline functions.
> 
>  (1) Remove last_offset from iov_iter as it was only used by ITER_PIPE.
> 
>  (2) Add a __user tag on copy_mc_to_user()'s dst argument on x86 to match
>      that on powerpc and get rid of a sparse warning.
> 
> [...]

I'm giving you vfs.iov_iter as a stable (no rebases) branch so you can
put fixes on top. Please let me know if someone else needs to take this.

---

Applied to the vfs.iov_iter branch of the vfs/vfs.git tree.
Patches in the vfs.iov_iter branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.iov_iter

[01/12] iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE
        https://git.kernel.org/vfs/vfs/c/581beb4fe37d
[02/12] iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user()
        https://git.kernel.org/vfs/vfs/c/066baf92bed9
[03/12] sound: Fix snd_pcm_readv()/writev() to use iov access functions
        https://git.kernel.org/vfs/vfs/c/1fcb71282e73
[04/12] infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC
        https://git.kernel.org/vfs/vfs/c/7ebc540b3524
[05/12] iov_iter: Renumber ITER_* constants
        https://git.kernel.org/vfs/vfs/c/7d9e44a6ad8a
[06/12] iov_iter: Derive user-backedness from the iterator type
        https://git.kernel.org/vfs/vfs/c/f1b4cb650b9a
[07/12] iov_iter: Convert iterate*() to inline funcs
        https://git.kernel.org/vfs/vfs/c/f1982740f5e7
[08/12] iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()
        https://git.kernel.org/vfs/vfs/c/51edcc92222f
[09/12] iov_iter, net: Move csum_and_copy_to/from_iter() to net/
        https://git.kernel.org/vfs/vfs/c/ef6fdd780dd4
[10/12] iov_iter, net: Fold in csum_and_memcpy()
        https://git.kernel.org/vfs/vfs/c/0837c6c20a4c
[11/12] iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together
        https://git.kernel.org/vfs/vfs/c/921203282d82
[12/12] iov_iter, net: Move hash_and_copy_to_iter() to net/
        https://git.kernel.org/vfs/vfs/c/d7a22f309096

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 02/12] iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user()
  2023-09-25 12:02 ` [PATCH v7 02/12] iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user() David Howells
@ 2023-09-28 14:47   ` Borislav Petkov
  0 siblings, 0 replies; 36+ messages in thread
From: Borislav Petkov @ 2023-09-28 14:47 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	Dan Williams, Thomas Gleixner, Ingo Molnar, Dave Hansen,
	H. Peter Anvin, x86

On Mon, Sep 25, 2023 at 01:02:59PM +0100, David Howells wrote:
> copy_mc_to_user() has the destination marked __user on powerpc, but not on
> x86; the latter results in a sparse warning in lib/iov_iter.c.
> 
> Fix this by applying the tag on x86 too.
> 
> Fixes: ec6347bb4339 ("x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()")
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Dan Williams <dan.j.williams@intel.com>
> cc: Thomas Gleixner <tglx@linutronix.de>
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Borislav Petkov <bp@alien8.de>
> cc: Dave Hansen <dave.hansen@linux.intel.com>
> cc: "H. Peter Anvin" <hpa@zytor.com>
> cc: Alexander Viro <viro@zeniv.linux.org.uk>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Christoph Hellwig <hch@lst.de>
> cc: Christian Brauner <christian@brauner.io>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: Linus Torvalds <torvalds@linux-foundation.org>
> cc: David Laight <David.Laight@ACULAB.COM>
> cc: x86@kernel.org
> cc: linux-block@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-mm@kvack.org
> ---
>  arch/x86/include/asm/uaccess.h | 2 +-
>  arch/x86/lib/copy_mc.c         | 8 ++++----
>  2 files changed, 5 insertions(+), 5 deletions(-)

Acked-by: Borislav Petkov (AMD) <bp@alien8.de>

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 08/12] iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()
  2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
@ 2023-10-02  9:25   ` David Howells
  2023-09-25 12:02 ` [PATCH v7 02/12] iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user() David Howells
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 36+ messages in thread
From: David Howells @ 2023-10-02  9:25 UTC (permalink / raw)
  Cc: dhowells, Jens Axboe, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel

David Howells <dhowells@redhat.com> wrote:

> +static size_t __copy_from_iter_mc(void *addr, size_t bytes, struct iov_iter *i)
>  {
> -	struct iov_iter *iter = priv2;
> +	size_t progress;
>  
> -	if (iov_iter_is_copy_mc(iter))
> -		return copy_mc_to_kernel(to + progress, iter_from, len);
> -	return memcpy_from_iter(iter_from, progress, len, to, priv2);
> +	if (unlikely(i->count < bytes))
> +		bytes = i->count;
> +	if (unlikely(!bytes))
> +		return 0;
> +	progress = iterate_bvec(i, bytes, addr, NULL, memcpy_from_iter_mc);
> +	i->count -= progress;

i->count shouldn't be decreased here as iterate_bvec() now does that.

This causes the LTP abort01 test to log a warning under KASAN (see below).
I'll remove the line and repush the patches.

David

    LTP: starting abort01
    ==================================================================
    BUG: KASAN: stack-out-of-bounds in __copy_from_iter_mc+0x2e6/0x480
    Read of size 4 at addr ffffc90004777594 by task abort01/708

    CPU: 4 PID: 708 Comm: abort01 Not tainted 99.6.0-rc3-ged6251886a1d #46
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)/Incus, BIOS unknown 2/2/2022
    Call Trace:
     <TASK>
     dump_stack_lvl+0x3d/0x70
     print_report+0xce/0x650
     ? lock_acquire+0x1b1/0x330
     kasan_report+0xda/0x110
     ? __copy_from_iter_mc+0x2e6/0x480
     ? __copy_from_iter_mc+0x2e6/0x480
     __copy_from_iter_mc+0x2e6/0x480
     copy_page_from_iter_atomic+0x517/0x1350
     ? __pfx_copy_page_from_iter_atomic+0x10/0x10
     ? __filemap_get_folio+0x281/0x6c0
     ? folio_wait_writeback+0x53/0x1e0
     ? prepare_pages.constprop.0+0x40b/0x6c0
     btrfs_copy_from_user+0xc6/0x290
     btrfs_buffered_write+0x8c9/0x1190
     ? __pfx_btrfs_buffered_write+0x10/0x10
     ? _raw_spin_unlock+0x2d/0x50
     ? btrfs_file_llseek+0x100/0xf00
     ? follow_page_mask+0x69f/0x1e10
     btrfs_do_write_iter+0x859/0xff0
     ? __pfx_btrfs_file_llseek+0x10/0x10
     ? find_held_lock+0x2d/0x110
     ? __pfx_btrfs_do_write_iter+0x10/0x10
     ? __up_read+0x211/0x790
     ? __pfx___get_user_pages+0x10/0x10
     ? __pfx___up_read+0x10/0x10
     ? __kernel_write_iter+0x3be/0x6d0
     __kernel_write_iter+0x226/0x6d0
     ? __pfx___kernel_write_iter+0x10/0x10
     dump_user_range+0x25d/0x650
     ? __pfx_dump_user_range+0x10/0x10
     ? __pfx_writenote+0x10/0x10
     elf_core_dump+0x231f/0x2e90
     ? __pfx_elf_core_dump+0x10/0x10
     ? do_coredump+0x12a9/0x38c0
     ? kasan_set_track+0x25/0x30
     ? __kasan_kmalloc+0xaa/0xb0
     ? __kmalloc_node+0x6c/0x1b0
     ? do_coredump+0x12a9/0x38c0
     ? get_signal+0x1e7d/0x20f0
     ? 0xffffffffff600000
     ? mas_next_slot+0x328/0x1dd0
     ? lock_acquire+0x162/0x330
     ? do_coredump+0x2537/0x38c0
     do_coredump+0x2537/0x38c0
     ? __pfx_do_coredump+0x10/0x10
     ? kmem_cache_free+0x114/0x520
     ? find_held_lock+0x2d/0x110
     get_signal+0x1e7d/0x20f0
     ? __pfx_get_signal+0x10/0x10
     ? do_send_specific+0xf1/0x1c0
     ? __pfx_do_send_specific+0x10/0x10
     arch_do_signal_or_restart+0x8b/0x4b0
     ? __pfx_arch_do_signal_or_restart+0x10/0x10
     exit_to_user_mode_prepare+0xde/0x210
     syscall_exit_to_user_mode+0x16/0x50
     do_syscall_64+0x53/0x90
     entry_SYSCALL_64_after_hwframe+0x6e/0xd8


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v7 08/12] iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()
@ 2023-10-02  9:25   ` David Howells
  0 siblings, 0 replies; 36+ messages in thread
From: David Howells @ 2023-10-02  9:25 UTC (permalink / raw)
  Cc: dhowells, Jens Axboe, Al Viro, Linus Torvalds, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel

David Howells <dhowells@redhat.com> wrote:

> +static size_t __copy_from_iter_mc(void *addr, size_t bytes, struct iov_iter *i)
>  {
> -	struct iov_iter *iter = priv2;
> +	size_t progress;
>  
> -	if (iov_iter_is_copy_mc(iter))
> -		return copy_mc_to_kernel(to + progress, iter_from, len);
> -	return memcpy_from_iter(iter_from, progress, len, to, priv2);
> +	if (unlikely(i->count < bytes))
> +		bytes = i->count;
> +	if (unlikely(!bytes))
> +		return 0;
> +	progress = iterate_bvec(i, bytes, addr, NULL, memcpy_from_iter_mc);
> +	i->count -= progress;

i->count shouldn't be decreased here as iterate_bvec() now does that.

This causes the LTP abort01 test to log a warning under KASAN (see below).
I'll remove the line and repush the patches.

David

    LTP: starting abort01
    ==================================================================
    BUG: KASAN: stack-out-of-bounds in __copy_from_iter_mc+0x2e6/0x480
    Read of size 4 at addr ffffc90004777594 by task abort01/708

    CPU: 4 PID: 708 Comm: abort01 Not tainted 99.6.0-rc3-ged6251886a1d #46
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)/Incus, BIOS unknown 2/2/2022
    Call Trace:
     <TASK>
     dump_stack_lvl+0x3d/0x70
     print_report+0xce/0x650
     ? lock_acquire+0x1b1/0x330
     kasan_report+0xda/0x110
     ? __copy_from_iter_mc+0x2e6/0x480
     ? __copy_from_iter_mc+0x2e6/0x480
     __copy_from_iter_mc+0x2e6/0x480
     copy_page_from_iter_atomic+0x517/0x1350
     ? __pfx_copy_page_from_iter_atomic+0x10/0x10
     ? __filemap_get_folio+0x281/0x6c0
     ? folio_wait_writeback+0x53/0x1e0
     ? prepare_pages.constprop.0+0x40b/0x6c0
     btrfs_copy_from_user+0xc6/0x290
     btrfs_buffered_write+0x8c9/0x1190
     ? __pfx_btrfs_buffered_write+0x10/0x10
     ? _raw_spin_unlock+0x2d/0x50
     ? btrfs_file_llseek+0x100/0xf00
     ? follow_page_mask+0x69f/0x1e10
     btrfs_do_write_iter+0x859/0xff0
     ? __pfx_btrfs_file_llseek+0x10/0x10
     ? find_held_lock+0x2d/0x110
     ? __pfx_btrfs_do_write_iter+0x10/0x10
     ? __up_read+0x211/0x790
     ? __pfx___get_user_pages+0x10/0x10
     ? __pfx___up_read+0x10/0x10
     ? __kernel_write_iter+0x3be/0x6d0
     __kernel_write_iter+0x226/0x6d0
     ? __pfx___kernel_write_iter+0x10/0x10
     dump_user_range+0x25d/0x650
     ? __pfx_dump_user_range+0x10/0x10
     ? __pfx_writenote+0x10/0x10
     elf_core_dump+0x231f/0x2e90
     ? __pfx_elf_core_dump+0x10/0x10
     ? do_coredump+0x12a9/0x38c0
     ? kasan_set_track+0x25/0x30
     ? __kasan_kmalloc+0xaa/0xb0
     ? __kmalloc_node+0x6c/0x1b0
     ? do_coredump+0x12a9/0x38c0
     ? get_signal+0x1e7d/0x20f0
     ? 0xffffffffff600000
     ? mas_next_slot+0x328/0x1dd0
     ? lock_acquire+0x162/0x330
     ? do_coredump+0x2537/0x38c0
     do_coredump+0x2537/0x38c0
     ? __pfx_do_coredump+0x10/0x10
     ? kmem_cache_free+0x114/0x520
     ? find_held_lock+0x2d/0x110
     get_signal+0x1e7d/0x20f0
     ? __pfx_get_signal+0x10/0x10
     ? do_send_specific+0xf1/0x1c0
     ? __pfx_do_send_specific+0x10/0x10
     arch_do_signal_or_restart+0x8b/0x4b0
     ? __pfx_arch_do_signal_or_restart+0x10/0x10
     exit_to_user_mode_prepare+0xde/0x210
     syscall_exit_to_user_mode+0x16/0x50
     do_syscall_64+0x53/0x90
     entry_SYSCALL_64_after_hwframe+0x6e/0xd8


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH next] iov_iter: fix copy_page_from_iter_atomic()
  2023-10-02  9:25   ` David Howells
  (?)
@ 2023-10-07  4:32   ` Hugh Dickins
  -1 siblings, 0 replies; 36+ messages in thread
From: Hugh Dickins @ 2023-10-07  4:32 UTC (permalink / raw)
  To: David Howells
  Cc: Christian Brauner, Jens Axboe, Al Viro, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel

[PATCH next] iov_iter: fix copy_page_from_iter_atomic()

Trying to test tmpfs on latest linux-next, copying and building kernel
trees, huge pages, and swapping while swapping off involved: lots of
cp: error writing '/tmp/2624/Documentation/fb/vesafb.txt': Bad address
cp: error writing '/tmp/2624/arch/mips/math-emu/dp_fsp.c': Bad address
etc.

Bisection leads to next-20231006's 376fdc4552f1 ("iov_iter:
Don't deal with iter->copy_mc in memcpy_from_iter_mc()") from vfs.git.
The tweak below makes it healthy again: please feel free to fold in.

Signed-off-by: Hugh Dickins <hughd@google.com>

--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -497,7 +497,7 @@ size_t copy_page_from_iter_atomic(struct
 		}
 
 		p = kmap_atomic(page) + offset;
-		__copy_from_iter(p, n, i);
+		n = __copy_from_iter(p, n, i);
 		kunmap_atomic(p);
 		copied += n;
 		offset += n;

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH next] iov_iter: fix copy_page_from_iter_atomic()
  2023-10-02  9:25   ` David Howells
  (?)
  (?)
@ 2023-10-07  7:29   ` David Howells
  2023-10-09  7:36     ` Christian Brauner
  -1 siblings, 1 reply; 36+ messages in thread
From: David Howells @ 2023-10-07  7:29 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: dhowells, Christian Brauner, Jens Axboe, Al Viro,
	Christoph Hellwig, Christian Brauner, David Laight,
	Matthew Wilcox, Jeff Layton, linux-fsdevel, linux-block,
	linux-mm, netdev, linux-kernel

Hugh Dickins <hughd@google.com> wrote:

> -		__copy_from_iter(p, n, i);
> +		n = __copy_from_iter(p, n, i);

Yeah, that looks right.  Can you fold it in, Christian?

David


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH next] iov_iter: fix copy_page_from_iter_atomic()
  2023-10-07  7:29   ` David Howells
@ 2023-10-09  7:36     ` Christian Brauner
  0 siblings, 0 replies; 36+ messages in thread
From: Christian Brauner @ 2023-10-09  7:36 UTC (permalink / raw)
  To: David Howells
  Cc: Hugh Dickins, Jens Axboe, Al Viro, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel

On Sat, Oct 07, 2023 at 08:29:14AM +0100, David Howells wrote:
> Hugh Dickins <hughd@google.com> wrote:
> 
> > -		__copy_from_iter(p, n, i);
> > +		n = __copy_from_iter(p, n, i);
> 
> Yeah, that looks right.  Can you fold it in, Christian?

Of course. Folded into
c9eec08bac96 ("iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc()")
on vfs.iov_iter.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2023-09-25 12:03 ` [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs David Howells
@ 2024-02-18  3:13   ` Tong Tiangen
  2024-02-27 12:43     ` Tong Tiangen
  2024-02-28 21:21     ` Linus Torvalds
  0 siblings, 2 replies; 36+ messages in thread
From: Tong Tiangen @ 2024-02-18  3:13 UTC (permalink / raw)
  To: David Howells, Jens Axboe
  Cc: Al Viro, Linus Torvalds, Christoph Hellwig, Christian Brauner,
	David Laight, Matthew Wilcox, Jeff Layton, linux-fsdevel,
	linux-block, linux-mm, netdev, linux-kernel, Kefeng Wang

Hi David, Jens:

Recently, I tested the x86 coredump function of the user process in the
mainline (6.8-rc1) and found an deadloop issue related to this patch.

Let's discuss it.

1. Test step:
----------------------------
   a. Start a user process.
   b. Use EINJ to inject a hardware memory error into a page of
      the this user process.
   c. Send SIGBUS to this user process.
   d. After receiving the signal, a coredump file is configured to be
      written to tmpfs.

2. Root cause:
----------------------------
The deadloop occurs in generic_perform_write(), the call path:

elf_core_dump()
   -> dump_user_range()
     -> dump_emit_page()
       -> iov_iter_bvec()  //iter type set to BVEC
         -> iov_iter_set_copy_mc(&iter);  //support copy mc
           -> __kernel_write_iter()
             -> shmem_file_write_iter()
               -> generic_perform_write()

ssize_t generic_perform_write(...)
{
	[...]
	do {
		[...]
	again:
		//[4]
		if (unlikely(fault_in_iov_iter_readable(i, bytes) ==
                              bytes)) {
			status = -EFAULT;
			break;
		}
		//[5]
		if (fatal_signal_pending(current)) {
			status = -EINTR;
			break;
		}
		
	        [...]
		
		//[1]
		copied = copy_page_from_iter_atomic(page, offset, bytes,
                          i);
		[...]
		
		//[2]
		status = a_ops->write_end(...);
		if (unlikely(status != copied)) {
			iov_iter_revert(i, copied - max(status, 0L));
			if (unlikely(status < 0))
				break;
		}
		cond_resched();
		
		if (unlikely(status == 0)) {
			/*
			* A short copy made ->write_end() reject the
			* thing entirely.  Might be memory poisoning
			* halfway through, might be a race with munmap,
			* might be severe memory pressure.
			*/
			if (copied)
				bytes = copied;
			//----[3]
			goto again;
		}
		[...]
	} while (iov_iter_count(i));
	[...]
}

[1]Before this patch:
   copy_page_from_iter_atomic()
     -> iterate_and_advance()
        -> __iterate_and_advance(..., ((void)(K),0))
          ->iterate_bvec macro
            -> left = ((void)(K),0)

With CONFIG_ARCH_HAS_COPY_MC, the K() is copy_mc_to_kernel() which
return "bytes not copied".

When a memory error occurs during K(), the value of "left" must be 0.
Therefore, the value of "copied" returned by
copy_page_from_iter_atomic() is not 0, and the loop of
generic_perform_write() can be ended normally.


After this patch:
   copy_page_from_iter_atomic()
     -> iterate_and_advance2()
       -> iterate_bvec()
         -> remain = step()

With CONFIG_ARCH_HAS_COPY_MC, the step() is copy_mc_to_kernel() which
return "bytes not copied".

When a memory error occurs during step(), the value of "left" equal to
the value of "part" (no one byte is copied successfully). In this case,
iterate_bvec() returns 0, and copy_page_from_iter_atomic() also returns
0. The callback shmem_write_end()[2] also returns 0. Finally,
generic_perform_write() goes to "goto again"[3], and the loop restarts.
4][5] cannot enter and exit the loop, then deadloop occurs.

Thanks.
Tong


在 2023/9/25 20:03, David Howells 写道:
> Convert the iov_iter iteration macros to inline functions to make the code
> easier to follow.
> 
> The functions are marked __always_inline as we don't want to end up with
> indirect calls in the code.  This, however, leaves dealing with ->copy_mc
> in an awkard situation since the step function (memcpy_from_iter_mc())
> needs to test the flag in the iterator, but isn't passed the iterator.
> This will be dealt with in a follow-up patch.
> 
> The variable names in the per-type iterator functions have been harmonised
> as much as possible and made clearer as to the variable purpose.
> 
> The iterator functions are also moved to a header file so that other
> operations that need to scan over an iterator can be added.  For instance,
> the rbd driver could use this to scan a buffer to see if it is all zeros
> and libceph could use this to generate a crc.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Alexander Viro <viro@zeniv.linux.org.uk>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Christoph Hellwig <hch@lst.de>
> cc: Christian Brauner <christian@brauner.io>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: Linus Torvalds <torvalds@linux-foundation.org>
> cc: David Laight <David.Laight@ACULAB.COM>
> cc: linux-block@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-mm@kvack.org
> Link: https://lore.kernel.org/r/3710261.1691764329@warthog.procyon.org.uk/ # v1
> Link: https://lore.kernel.org/r/855.1692047347@warthog.procyon.org.uk/ # v2
> Link: https://lore.kernel.org/r/20230816120741.534415-1-dhowells@redhat.com/ # v3
> ---
> 
> Notes:
>      Changes
>      =======
>      ver #5)
>       - Merge in patch to move iteration framework to a header file.
>       - Move "iter->count - progress" into individual iteration subfunctions.
> 
>   include/linux/iov_iter.h | 274 ++++++++++++++++++++++++++
>   lib/iov_iter.c           | 416 ++++++++++++++++-----------------------
>   2 files changed, 449 insertions(+), 241 deletions(-)
>   create mode 100644 include/linux/iov_iter.h
> 
> diff --git a/include/linux/iov_iter.h b/include/linux/iov_iter.h
> new file mode 100644
> index 000000000000..270454a6703d
> --- /dev/null
> +++ b/include/linux/iov_iter.h
> @@ -0,0 +1,274 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/* I/O iterator iteration building functions.
> + *
> + * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
> + * Written by David Howells (dhowells@redhat.com)
> + */
> +
> +#ifndef _LINUX_IOV_ITER_H
> +#define _LINUX_IOV_ITER_H
> +
> +#include <linux/uio.h>
> +#include <linux/bvec.h>
> +
> +typedef size_t (*iov_step_f)(void *iter_base, size_t progress, size_t len,
> +			     void *priv, void *priv2);
> +typedef size_t (*iov_ustep_f)(void __user *iter_base, size_t progress, size_t len,
> +			      void *priv, void *priv2);
> +
> +/*
> + * Handle ITER_UBUF.
> + */
> +static __always_inline
> +size_t iterate_ubuf(struct iov_iter *iter, size_t len, void *priv, void *priv2,
> +		    iov_ustep_f step)
> +{
> +	void __user *base = iter->ubuf;
> +	size_t progress = 0, remain;
> +
> +	remain = step(base + iter->iov_offset, 0, len, priv, priv2);
> +	progress = len - remain;
> +	iter->iov_offset += progress;
> +	iter->count -= progress;
> +	return progress;
> +}
> +
> +/*
> + * Handle ITER_IOVEC.
> + */
> +static __always_inline
> +size_t iterate_iovec(struct iov_iter *iter, size_t len, void *priv, void *priv2,
> +		     iov_ustep_f step)
> +{
> +	const struct iovec *p = iter->__iov;
> +	size_t progress = 0, skip = iter->iov_offset;
> +
> +	do {
> +		size_t remain, consumed;
> +		size_t part = min(len, p->iov_len - skip);
> +
> +		if (likely(part)) {
> +			remain = step(p->iov_base + skip, progress, part, priv, priv2);
> +			consumed = part - remain;
> +			progress += consumed;
> +			skip += consumed;
> +			len -= consumed;
> +			if (skip < p->iov_len)
> +				break;
> +		}
> +		p++;
> +		skip = 0;
> +	} while (len);
> +
> +	iter->nr_segs -= p - iter->__iov;
> +	iter->__iov = p;
> +	iter->iov_offset = skip;
> +	iter->count -= progress;
> +	return progress;
> +}
> +
> +/*
> + * Handle ITER_KVEC.
> + */
> +static __always_inline
> +size_t iterate_kvec(struct iov_iter *iter, size_t len, void *priv, void *priv2,
> +		    iov_step_f step)
> +{
> +	const struct kvec *p = iter->kvec;
> +	size_t progress = 0, skip = iter->iov_offset;
> +
> +	do {
> +		size_t remain, consumed;
> +		size_t part = min(len, p->iov_len - skip);
> +
> +		if (likely(part)) {
> +			remain = step(p->iov_base + skip, progress, part, priv, priv2);
> +			consumed = part - remain;
> +			progress += consumed;
> +			skip += consumed;
> +			len -= consumed;
> +			if (skip < p->iov_len)
> +				break;
> +		}
> +		p++;
> +		skip = 0;
> +	} while (len);
> +
> +	iter->nr_segs -= p - iter->kvec;
> +	iter->kvec = p;
> +	iter->iov_offset = skip;
> +	iter->count -= progress;
> +	return progress;
> +}
> +
> +/*
> + * Handle ITER_BVEC.
> + */
> +static __always_inline
> +size_t iterate_bvec(struct iov_iter *iter, size_t len, void *priv, void *priv2,
> +		    iov_step_f step)
> +{
> +	const struct bio_vec *p = iter->bvec;
> +	size_t progress = 0, skip = iter->iov_offset;
> +
> +	do {
> +		size_t remain, consumed;
> +		size_t offset = p->bv_offset + skip, part;
> +		void *kaddr = kmap_local_page(p->bv_page + offset / PAGE_SIZE);
> +
> +		part = min3(len,
> +			   (size_t)(p->bv_len - skip),
> +			   (size_t)(PAGE_SIZE - offset % PAGE_SIZE));
> +		remain = step(kaddr + offset % PAGE_SIZE, progress, part, priv, priv2);
> +		kunmap_local(kaddr);
> +		consumed = part - remain;
> +		len -= consumed;
> +		progress += consumed;
> +		skip += consumed;
> +		if (skip >= p->bv_len) {
> +			skip = 0;
> +			p++;
> +		}
> +		if (remain)
> +			break;
> +	} while (len);
> +
> +	iter->nr_segs -= p - iter->bvec;
> +	iter->bvec = p;
> +	iter->iov_offset = skip;
> +	iter->count -= progress;
> +	return progress;
> +}
> +
> +/*
> + * Handle ITER_XARRAY.
> + */
> +static __always_inline
> +size_t iterate_xarray(struct iov_iter *iter, size_t len, void *priv, void *priv2,
> +		      iov_step_f step)
> +{
> +	struct folio *folio;
> +	size_t progress = 0;
> +	loff_t start = iter->xarray_start + iter->iov_offset;
> +	pgoff_t index = start / PAGE_SIZE;
> +	XA_STATE(xas, iter->xarray, index);
> +
> +	rcu_read_lock();
> +	xas_for_each(&xas, folio, ULONG_MAX) {
> +		size_t remain, consumed, offset, part, flen;
> +
> +		if (xas_retry(&xas, folio))
> +			continue;
> +		if (WARN_ON(xa_is_value(folio)))
> +			break;
> +		if (WARN_ON(folio_test_hugetlb(folio)))
> +			break;
> +
> +		offset = offset_in_folio(folio, start + progress);
> +		flen = min(folio_size(folio) - offset, len);
> +
> +		while (flen) {
> +			void *base = kmap_local_folio(folio, offset);
> +
> +			part = min_t(size_t, flen,
> +				     PAGE_SIZE - offset_in_page(offset));
> +			remain = step(base, progress, part, priv, priv2);
> +			kunmap_local(base);
> +
> +			consumed = part - remain;
> +			progress += consumed;
> +			len -= consumed;
> +
> +			if (remain || len == 0)
> +				goto out;
> +			flen -= consumed;
> +			offset += consumed;
> +		}
> +	}
> +
> +out:
> +	rcu_read_unlock();
> +	iter->iov_offset += progress;
> +	iter->count -= progress;
> +	return progress;
> +}
> +
> +/*
> + * Handle ITER_DISCARD.
> + */
> +static __always_inline
> +size_t iterate_discard(struct iov_iter *iter, size_t len, void *priv, void *priv2,
> +		      iov_step_f step)
> +{
> +	size_t progress = len;
> +
> +	iter->count -= progress;
> +	return progress;
> +}
> +
> +/**
> + * iterate_and_advance2 - Iterate over an iterator
> + * @iter: The iterator to iterate over.
> + * @len: The amount to iterate over.
> + * @priv: Data for the step functions.
> + * @priv2: More data for the step functions.
> + * @ustep: Function for UBUF/IOVEC iterators; given __user addresses.
> + * @step: Function for other iterators; given kernel addresses.
> + *
> + * Iterate over the next part of an iterator, up to the specified length.  The
> + * buffer is presented in segments, which for kernel iteration are broken up by
> + * physical pages and mapped, with the mapped address being presented.
> + *
> + * Two step functions, @step and @ustep, must be provided, one for handling
> + * mapped kernel addresses and the other is given user addresses which have the
> + * potential to fault since no pinning is performed.
> + *
> + * The step functions are passed the address and length of the segment, @priv,
> + * @priv2 and the amount of data so far iterated over (which can, for example,
> + * be added to @priv to point to the right part of a second buffer).  The step
> + * functions should return the amount of the segment they didn't process (ie. 0
> + * indicates complete processsing).
> + *
> + * This function returns the amount of data processed (ie. 0 means nothing was
> + * processed and the value of @len means processes to completion).
> + */
> +static __always_inline
> +size_t iterate_and_advance2(struct iov_iter *iter, size_t len, void *priv,
> +			    void *priv2, iov_ustep_f ustep, iov_step_f step)
> +{
> +	if (unlikely(iter->count < len))
> +		len = iter->count;
> +	if (unlikely(!len))
> +		return 0;
> +
> +	if (likely(iter_is_ubuf(iter)))
> +		return iterate_ubuf(iter, len, priv, priv2, ustep);
> +	if (likely(iter_is_iovec(iter)))
> +		return iterate_iovec(iter, len, priv, priv2, ustep);
> +	if (iov_iter_is_bvec(iter))
> +		return iterate_bvec(iter, len, priv, priv2, step);
> +	if (iov_iter_is_kvec(iter))
> +		return iterate_kvec(iter, len, priv, priv2, step);
> +	if (iov_iter_is_xarray(iter))
> +		return iterate_xarray(iter, len, priv, priv2, step);
> +	return iterate_discard(iter, len, priv, priv2, step);
> +}
> +
> +/**
> + * iterate_and_advance - Iterate over an iterator
> + * @iter: The iterator to iterate over.
> + * @len: The amount to iterate over.
> + * @priv: Data for the step functions.
> + * @ustep: Function for UBUF/IOVEC iterators; given __user addresses.
> + * @step: Function for other iterators; given kernel addresses.
> + *
> + * As iterate_and_advance2(), but priv2 is always NULL.
> + */
> +static __always_inline
> +size_t iterate_and_advance(struct iov_iter *iter, size_t len, void *priv,
> +			   iov_ustep_f ustep, iov_step_f step)
> +{
> +	return iterate_and_advance2(iter, len, priv, NULL, ustep, step);
> +}
> +
> +#endif /* _LINUX_IOV_ITER_H */
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 227c9f536b94..65374ee91ecd 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -13,189 +13,69 @@
>   #include <net/checksum.h>
>   #include <linux/scatterlist.h>
>   #include <linux/instrumented.h>
> +#include <linux/iov_iter.h>
>   
> -/* covers ubuf and kbuf alike */
> -#define iterate_buf(i, n, base, len, off, __p, STEP) {		\
> -	size_t __maybe_unused off = 0;				\
> -	len = n;						\
> -	base = __p + i->iov_offset;				\
> -	len -= (STEP);						\
> -	i->iov_offset += len;					\
> -	n = len;						\
> -}
> -
> -/* covers iovec and kvec alike */
> -#define iterate_iovec(i, n, base, len, off, __p, STEP) {	\
> -	size_t off = 0;						\
> -	size_t skip = i->iov_offset;				\
> -	do {							\
> -		len = min(n, __p->iov_len - skip);		\
> -		if (likely(len)) {				\
> -			base = __p->iov_base + skip;		\
> -			len -= (STEP);				\
> -			off += len;				\
> -			skip += len;				\
> -			n -= len;				\
> -			if (skip < __p->iov_len)		\
> -				break;				\
> -		}						\
> -		__p++;						\
> -		skip = 0;					\
> -	} while (n);						\
> -	i->iov_offset = skip;					\
> -	n = off;						\
> -}
> -
> -#define iterate_bvec(i, n, base, len, off, p, STEP) {		\
> -	size_t off = 0;						\
> -	unsigned skip = i->iov_offset;				\
> -	while (n) {						\
> -		unsigned offset = p->bv_offset + skip;		\
> -		unsigned left;					\
> -		void *kaddr = kmap_local_page(p->bv_page +	\
> -					offset / PAGE_SIZE);	\
> -		base = kaddr + offset % PAGE_SIZE;		\
> -		len = min(min(n, (size_t)(p->bv_len - skip)),	\
> -		     (size_t)(PAGE_SIZE - offset % PAGE_SIZE));	\
> -		left = (STEP);					\
> -		kunmap_local(kaddr);				\
> -		len -= left;					\
> -		off += len;					\
> -		skip += len;					\
> -		if (skip == p->bv_len) {			\
> -			skip = 0;				\
> -			p++;					\
> -		}						\
> -		n -= len;					\
> -		if (left)					\
> -			break;					\
> -	}							\
> -	i->iov_offset = skip;					\
> -	n = off;						\
> -}
> -
> -#define iterate_xarray(i, n, base, len, __off, STEP) {		\
> -	__label__ __out;					\
> -	size_t __off = 0;					\
> -	struct folio *folio;					\
> -	loff_t start = i->xarray_start + i->iov_offset;		\
> -	pgoff_t index = start / PAGE_SIZE;			\
> -	XA_STATE(xas, i->xarray, index);			\
> -								\
> -	len = PAGE_SIZE - offset_in_page(start);		\
> -	rcu_read_lock();					\
> -	xas_for_each(&xas, folio, ULONG_MAX) {			\
> -		unsigned left;					\
> -		size_t offset;					\
> -		if (xas_retry(&xas, folio))			\
> -			continue;				\
> -		if (WARN_ON(xa_is_value(folio)))		\
> -			break;					\
> -		if (WARN_ON(folio_test_hugetlb(folio)))		\
> -			break;					\
> -		offset = offset_in_folio(folio, start + __off);	\
> -		while (offset < folio_size(folio)) {		\
> -			base = kmap_local_folio(folio, offset);	\
> -			len = min(n, len);			\
> -			left = (STEP);				\
> -			kunmap_local(base);			\
> -			len -= left;				\
> -			__off += len;				\
> -			n -= len;				\
> -			if (left || n == 0)			\
> -				goto __out;			\
> -			offset += len;				\
> -			len = PAGE_SIZE;			\
> -		}						\
> -	}							\
> -__out:								\
> -	rcu_read_unlock();					\
> -	i->iov_offset += __off;					\
> -	n = __off;						\
> -}
> -
> -#define __iterate_and_advance(i, n, base, len, off, I, K) {	\
> -	if (unlikely(i->count < n))				\
> -		n = i->count;					\
> -	if (likely(n)) {					\
> -		if (likely(iter_is_ubuf(i))) {			\
> -			void __user *base;			\
> -			size_t len;				\
> -			iterate_buf(i, n, base, len, off,	\
> -						i->ubuf, (I)) 	\
> -		} else if (likely(iter_is_iovec(i))) {		\
> -			const struct iovec *iov = iter_iov(i);	\
> -			void __user *base;			\
> -			size_t len;				\
> -			iterate_iovec(i, n, base, len, off,	\
> -						iov, (I))	\
> -			i->nr_segs -= iov - iter_iov(i);	\
> -			i->__iov = iov;				\
> -		} else if (iov_iter_is_bvec(i)) {		\
> -			const struct bio_vec *bvec = i->bvec;	\
> -			void *base;				\
> -			size_t len;				\
> -			iterate_bvec(i, n, base, len, off,	\
> -						bvec, (K))	\
> -			i->nr_segs -= bvec - i->bvec;		\
> -			i->bvec = bvec;				\
> -		} else if (iov_iter_is_kvec(i)) {		\
> -			const struct kvec *kvec = i->kvec;	\
> -			void *base;				\
> -			size_t len;				\
> -			iterate_iovec(i, n, base, len, off,	\
> -						kvec, (K))	\
> -			i->nr_segs -= kvec - i->kvec;		\
> -			i->kvec = kvec;				\
> -		} else if (iov_iter_is_xarray(i)) {		\
> -			void *base;				\
> -			size_t len;				\
> -			iterate_xarray(i, n, base, len, off,	\
> -							(K))	\
> -		}						\
> -		i->count -= n;					\
> -	}							\
> -}
> -#define iterate_and_advance(i, n, base, len, off, I, K) \
> -	__iterate_and_advance(i, n, base, len, off, I, ((void)(K),0))
> -
> -static int copyout(void __user *to, const void *from, size_t n)
> +static __always_inline
> +size_t copy_to_user_iter(void __user *iter_to, size_t progress,
> +			 size_t len, void *from, void *priv2)
>   {
>   	if (should_fail_usercopy())
> -		return n;
> -	if (access_ok(to, n)) {
> -		instrument_copy_to_user(to, from, n);
> -		n = raw_copy_to_user(to, from, n);
> +		return len;
> +	if (access_ok(iter_to, len)) {
> +		from += progress;
> +		instrument_copy_to_user(iter_to, from, len);
> +		len = raw_copy_to_user(iter_to, from, len);
>   	}
> -	return n;
> +	return len;
>   }
>   
> -static int copyout_nofault(void __user *to, const void *from, size_t n)
> +static __always_inline
> +size_t copy_to_user_iter_nofault(void __user *iter_to, size_t progress,
> +				 size_t len, void *from, void *priv2)
>   {
> -	long res;
> +	ssize_t res;
>   
>   	if (should_fail_usercopy())
> -		return n;
> -
> -	res = copy_to_user_nofault(to, from, n);
> +		return len;
>   
> -	return res < 0 ? n : res;
> +	from += progress;
> +	res = copy_to_user_nofault(iter_to, from, len);
> +	return res < 0 ? len : res;
>   }
>   
> -static int copyin(void *to, const void __user *from, size_t n)
> +static __always_inline
> +size_t copy_from_user_iter(void __user *iter_from, size_t progress,
> +			   size_t len, void *to, void *priv2)
>   {
> -	size_t res = n;
> +	size_t res = len;
>   
>   	if (should_fail_usercopy())
> -		return n;
> -	if (access_ok(from, n)) {
> -		instrument_copy_from_user_before(to, from, n);
> -		res = raw_copy_from_user(to, from, n);
> -		instrument_copy_from_user_after(to, from, n, res);
> +		return len;
> +	if (access_ok(iter_from, len)) {
> +		to += progress;
> +		instrument_copy_from_user_before(to, iter_from, len);
> +		res = raw_copy_from_user(to, iter_from, len);
> +		instrument_copy_from_user_after(to, iter_from, len, res);
>   	}
>   	return res;
>   }
>   
> +static __always_inline
> +size_t memcpy_to_iter(void *iter_to, size_t progress,
> +		      size_t len, void *from, void *priv2)
> +{
> +	memcpy(iter_to, from + progress, len);
> +	return 0;
> +}
> +
> +static __always_inline
> +size_t memcpy_from_iter(void *iter_from, size_t progress,
> +			size_t len, void *to, void *priv2)
> +{
> +	memcpy(to + progress, iter_from, len);
> +	return 0;
> +}
> +
>   /*
>    * fault_in_iov_iter_readable - fault in iov iterator for reading
>    * @i: iterator
> @@ -312,23 +192,29 @@ size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
>   		return 0;
>   	if (user_backed_iter(i))
>   		might_fault();
> -	iterate_and_advance(i, bytes, base, len, off,
> -		copyout(base, addr + off, len),
> -		memcpy(base, addr + off, len)
> -	)
> -
> -	return bytes;
> +	return iterate_and_advance(i, bytes, (void *)addr,
> +				   copy_to_user_iter, memcpy_to_iter);
>   }
>   EXPORT_SYMBOL(_copy_to_iter);
>   
>   #ifdef CONFIG_ARCH_HAS_COPY_MC
> -static int copyout_mc(void __user *to, const void *from, size_t n)
> -{
> -	if (access_ok(to, n)) {
> -		instrument_copy_to_user(to, from, n);
> -		n = copy_mc_to_user((__force void *) to, from, n);
> +static __always_inline
> +size_t copy_to_user_iter_mc(void __user *iter_to, size_t progress,
> +			    size_t len, void *from, void *priv2)
> +{
> +	if (access_ok(iter_to, len)) {
> +		from += progress;
> +		instrument_copy_to_user(iter_to, from, len);
> +		len = copy_mc_to_user(iter_to, from, len);
>   	}
> -	return n;
> +	return len;
> +}
> +
> +static __always_inline
> +size_t memcpy_to_iter_mc(void *iter_to, size_t progress,
> +			 size_t len, void *from, void *priv2)
> +{
> +	return copy_mc_to_kernel(iter_to, from + progress, len);
>   }
>   
>   /**
> @@ -361,22 +247,20 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
>   		return 0;
>   	if (user_backed_iter(i))
>   		might_fault();
> -	__iterate_and_advance(i, bytes, base, len, off,
> -		copyout_mc(base, addr + off, len),
> -		copy_mc_to_kernel(base, addr + off, len)
> -	)
> -
> -	return bytes;
> +	return iterate_and_advance(i, bytes, (void *)addr,
> +				   copy_to_user_iter_mc, memcpy_to_iter_mc);
>   }
>   EXPORT_SYMBOL_GPL(_copy_mc_to_iter);
>   #endif /* CONFIG_ARCH_HAS_COPY_MC */
>   
> -static void *memcpy_from_iter(struct iov_iter *i, void *to, const void *from,
> -				 size_t size)
> +static size_t memcpy_from_iter_mc(void *iter_from, size_t progress,
> +				  size_t len, void *to, void *priv2)
>   {
> -	if (iov_iter_is_copy_mc(i))
> -		return (void *)copy_mc_to_kernel(to, from, size);
> -	return memcpy(to, from, size);
> +	struct iov_iter *iter = priv2;
> +
> +	if (iov_iter_is_copy_mc(iter))
> +		return copy_mc_to_kernel(to + progress, iter_from, len);
> +	return memcpy_from_iter(iter_from, progress, len, to, priv2);
>   }
>   
>   size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
> @@ -386,30 +270,46 @@ size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
>   
>   	if (user_backed_iter(i))
>   		might_fault();
> -	iterate_and_advance(i, bytes, base, len, off,
> -		copyin(addr + off, base, len),
> -		memcpy_from_iter(i, addr + off, base, len)
> -	)
> -
> -	return bytes;
> +	return iterate_and_advance2(i, bytes, addr, i,
> +				    copy_from_user_iter,
> +				    memcpy_from_iter_mc);
>   }
>   EXPORT_SYMBOL(_copy_from_iter);
>   
> +static __always_inline
> +size_t copy_from_user_iter_nocache(void __user *iter_from, size_t progress,
> +				   size_t len, void *to, void *priv2)
> +{
> +	return __copy_from_user_inatomic_nocache(to + progress, iter_from, len);
> +}
> +
>   size_t _copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
>   {
>   	if (WARN_ON_ONCE(!i->data_source))
>   		return 0;
>   
> -	iterate_and_advance(i, bytes, base, len, off,
> -		__copy_from_user_inatomic_nocache(addr + off, base, len),
> -		memcpy(addr + off, base, len)
> -	)
> -
> -	return bytes;
> +	return iterate_and_advance(i, bytes, addr,
> +				   copy_from_user_iter_nocache,
> +				   memcpy_from_iter);
>   }
>   EXPORT_SYMBOL(_copy_from_iter_nocache);
>   
>   #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
> +static __always_inline
> +size_t copy_from_user_iter_flushcache(void __user *iter_from, size_t progress,
> +				      size_t len, void *to, void *priv2)
> +{
> +	return __copy_from_user_flushcache(to + progress, iter_from, len);
> +}
> +
> +static __always_inline
> +size_t memcpy_from_iter_flushcache(void *iter_from, size_t progress,
> +				   size_t len, void *to, void *priv2)
> +{
> +	memcpy_flushcache(to + progress, iter_from, len);
> +	return 0;
> +}
> +
>   /**
>    * _copy_from_iter_flushcache - write destination through cpu cache
>    * @addr: destination kernel address
> @@ -431,12 +331,9 @@ size_t _copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i)
>   	if (WARN_ON_ONCE(!i->data_source))
>   		return 0;
>   
> -	iterate_and_advance(i, bytes, base, len, off,
> -		__copy_from_user_flushcache(addr + off, base, len),
> -		memcpy_flushcache(addr + off, base, len)
> -	)
> -
> -	return bytes;
> +	return iterate_and_advance(i, bytes, addr,
> +				   copy_from_user_iter_flushcache,
> +				   memcpy_from_iter_flushcache);
>   }
>   EXPORT_SYMBOL_GPL(_copy_from_iter_flushcache);
>   #endif
> @@ -508,10 +405,9 @@ size_t copy_page_to_iter_nofault(struct page *page, unsigned offset, size_t byte
>   		void *kaddr = kmap_local_page(page);
>   		size_t n = min(bytes, (size_t)PAGE_SIZE - offset);
>   
> -		iterate_and_advance(i, n, base, len, off,
> -			copyout_nofault(base, kaddr + offset + off, len),
> -			memcpy(base, kaddr + offset + off, len)
> -		)
> +		n = iterate_and_advance(i, bytes, kaddr,
> +					copy_to_user_iter_nofault,
> +					memcpy_to_iter);
>   		kunmap_local(kaddr);
>   		res += n;
>   		bytes -= n;
> @@ -554,14 +450,25 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
>   }
>   EXPORT_SYMBOL(copy_page_from_iter);
>   
> -size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
> +static __always_inline
> +size_t zero_to_user_iter(void __user *iter_to, size_t progress,
> +			 size_t len, void *priv, void *priv2)
>   {
> -	iterate_and_advance(i, bytes, base, len, count,
> -		clear_user(base, len),
> -		memset(base, 0, len)
> -	)
> +	return clear_user(iter_to, len);
> +}
>   
> -	return bytes;
> +static __always_inline
> +size_t zero_to_iter(void *iter_to, size_t progress,
> +		    size_t len, void *priv, void *priv2)
> +{
> +	memset(iter_to, 0, len);
> +	return 0;
> +}
> +
> +size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
> +{
> +	return iterate_and_advance(i, bytes, NULL,
> +				   zero_to_user_iter, zero_to_iter);
>   }
>   EXPORT_SYMBOL(iov_iter_zero);
>   
> @@ -586,10 +493,9 @@ size_t copy_page_from_iter_atomic(struct page *page, size_t offset,
>   		}
>   
>   		p = kmap_atomic(page) + offset;
> -		iterate_and_advance(i, n, base, len, off,
> -			copyin(p + off, base, len),
> -			memcpy_from_iter(i, p + off, base, len)
> -		)
> +		n = iterate_and_advance2(i, n, p, i,
> +					 copy_from_user_iter,
> +					 memcpy_from_iter_mc);
>   		kunmap_atomic(p);
>   		copied += n;
>   		offset += n;
> @@ -1180,32 +1086,64 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
>   }
>   EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
>   
> +static __always_inline
> +size_t copy_from_user_iter_csum(void __user *iter_from, size_t progress,
> +				size_t len, void *to, void *priv2)
> +{
> +	__wsum next, *csum = priv2;
> +
> +	next = csum_and_copy_from_user(iter_from, to + progress, len);
> +	*csum = csum_block_add(*csum, next, progress);
> +	return next ? 0 : len;
> +}
> +
> +static __always_inline
> +size_t memcpy_from_iter_csum(void *iter_from, size_t progress,
> +			     size_t len, void *to, void *priv2)
> +{
> +	__wsum *csum = priv2;
> +
> +	*csum = csum_and_memcpy(to + progress, iter_from, len, *csum, progress);
> +	return 0;
> +}
> +
>   size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
>   			       struct iov_iter *i)
>   {
> -	__wsum sum, next;
> -	sum = *csum;
>   	if (WARN_ON_ONCE(!i->data_source))
>   		return 0;
> -
> -	iterate_and_advance(i, bytes, base, len, off, ({
> -		next = csum_and_copy_from_user(base, addr + off, len);
> -		sum = csum_block_add(sum, next, off);
> -		next ? 0 : len;
> -	}), ({
> -		sum = csum_and_memcpy(addr + off, base, len, sum, off);
> -	})
> -	)
> -	*csum = sum;
> -	return bytes;
> +	return iterate_and_advance2(i, bytes, addr, csum,
> +				    copy_from_user_iter_csum,
> +				    memcpy_from_iter_csum);
>   }
>   EXPORT_SYMBOL(csum_and_copy_from_iter);
>   
> +static __always_inline
> +size_t copy_to_user_iter_csum(void __user *iter_to, size_t progress,
> +			      size_t len, void *from, void *priv2)
> +{
> +	__wsum next, *csum = priv2;
> +
> +	next = csum_and_copy_to_user(from + progress, iter_to, len);
> +	*csum = csum_block_add(*csum, next, progress);
> +	return next ? 0 : len;
> +}
> +
> +static __always_inline
> +size_t memcpy_to_iter_csum(void *iter_to, size_t progress,
> +			   size_t len, void *from, void *priv2)
> +{
> +	__wsum *csum = priv2;
> +
> +	*csum = csum_and_memcpy(iter_to, from + progress, len, *csum, progress);
> +	return 0;
> +}
> +
>   size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *_csstate,
>   			     struct iov_iter *i)
>   {
>   	struct csum_state *csstate = _csstate;
> -	__wsum sum, next;
> +	__wsum sum;
>   
>   	if (WARN_ON_ONCE(i->data_source))
>   		return 0;
> @@ -1219,14 +1157,10 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *_csstate,
>   	}
>   
>   	sum = csum_shift(csstate->csum, csstate->off);
> -	iterate_and_advance(i, bytes, base, len, off, ({
> -		next = csum_and_copy_to_user(addr + off, base, len);
> -		sum = csum_block_add(sum, next, off);
> -		next ? 0 : len;
> -	}), ({
> -		sum = csum_and_memcpy(base, addr + off, len, sum, off);
> -	})
> -	)
> +	
> +	bytes = iterate_and_advance2(i, bytes, (void *)addr, &sum,
> +				     copy_to_user_iter_csum,
> +				     memcpy_to_iter_csum);
>   	csstate->csum = csum_shift(sum, csstate->off);
>   	csstate->off += bytes;
>   	return bytes;
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-02-18  3:13   ` [bug report] dead loop in generic_perform_write() //Re: " Tong Tiangen
@ 2024-02-27 12:43     ` Tong Tiangen
  2024-02-28 21:21     ` Linus Torvalds
  1 sibling, 0 replies; 36+ messages in thread
From: Tong Tiangen @ 2024-02-27 12:43 UTC (permalink / raw)
  To: David Howells, Jens Axboe
  Cc: Al Viro, Linus Torvalds, Christoph Hellwig, Christian Brauner,
	David Laight, Matthew Wilcox, Jeff Layton, linux-fsdevel,
	linux-block, linux-mm, netdev, linux-kernel, Kefeng Wang

Hi David, Jens:

Kindly ping...

Thanks.
Tong.


在 2024/2/18 11:13, Tong Tiangen 写道:
> Hi David, Jens:
> 
> Recently, I tested the x86 coredump function of the user process in the
> mainline (6.8-rc1) and found an deadloop issue related to this patch.
> 
> Let's discuss it.
> 
> 1. Test step:
> ----------------------------
>    a. Start a user process.
>    b. Use EINJ to inject a hardware memory error into a page of
>       the this user process.
>    c. Send SIGBUS to this user process.
>    d. After receiving the signal, a coredump file is configured to be
>       written to tmpfs.
> 
> 2. Root cause:
> ----------------------------
> The deadloop occurs in generic_perform_write(), the call path:
> 
> elf_core_dump()
>    -> dump_user_range()
>      -> dump_emit_page()
>        -> iov_iter_bvec()  //iter type set to BVEC
>          -> iov_iter_set_copy_mc(&iter);  //support copy mc
>            -> __kernel_write_iter()
>              -> shmem_file_write_iter()
>                -> generic_perform_write()
> 
> ssize_t generic_perform_write(...)
> {
>      [...]
>      do {
>          [...]
>      again:
>          //[4]
>          if (unlikely(fault_in_iov_iter_readable(i, bytes) ==
>                               bytes)) {
>              status = -EFAULT;
>              break;
>          }
>          //[5]
>          if (fatal_signal_pending(current)) {
>              status = -EINTR;
>              break;
>          }
> 
>              [...]
> 
>          //[1]
>          copied = copy_page_from_iter_atomic(page, offset, bytes,
>                           i);
>          [...]
> 
>          //[2]
>          status = a_ops->write_end(...);
>          if (unlikely(status != copied)) {
>              iov_iter_revert(i, copied - max(status, 0L));
>              if (unlikely(status < 0))
>                  break;
>          }
>          cond_resched();
> 
>          if (unlikely(status == 0)) {
>              /*
>              * A short copy made ->write_end() reject the
>              * thing entirely.  Might be memory poisoning
>              * halfway through, might be a race with munmap,
>              * might be severe memory pressure.
>              */
>              if (copied)
>                  bytes = copied;
>              //----[3]
>              goto again;
>          }
>          [...]
>      } while (iov_iter_count(i));
>      [...]
> }
> 
> [1]Before this patch:
>    copy_page_from_iter_atomic()
>      -> iterate_and_advance()
>         -> __iterate_and_advance(..., ((void)(K),0))
>           ->iterate_bvec macro
>             -> left = ((void)(K),0)
> 
> With CONFIG_ARCH_HAS_COPY_MC, the K() is copy_mc_to_kernel() which
> return "bytes not copied".
> 
> When a memory error occurs during K(), the value of "left" must be 0.
> Therefore, the value of "copied" returned by
> copy_page_from_iter_atomic() is not 0, and the loop of
> generic_perform_write() can be ended normally.
> 
> 
> After this patch:
>    copy_page_from_iter_atomic()
>      -> iterate_and_advance2()
>        -> iterate_bvec()
>          -> remain = step()
> 
> With CONFIG_ARCH_HAS_COPY_MC, the step() is copy_mc_to_kernel() which
> return "bytes not copied".
> 
> When a memory error occurs during step(), the value of "left" equal to
> the value of "part" (no one byte is copied successfully). In this case,
> iterate_bvec() returns 0, and copy_page_from_iter_atomic() also returns
> 0. The callback shmem_write_end()[2] also returns 0. Finally,
> generic_perform_write() goes to "goto again"[3], and the loop restarts.
> 4][5] cannot enter and exit the loop, then deadloop occurs.
> 
> Thanks.
> Tong
> 
> 
> 在 2023/9/25 20:03, David Howells 写道:
>> Convert the iov_iter iteration macros to inline functions to make the 
>> code
>> easier to follow.
>>
>> The functions are marked __always_inline as we don't want to end up with
>> indirect calls in the code.  This, however, leaves dealing with ->copy_mc
>> in an awkard situation since the step function (memcpy_from_iter_mc())
>> needs to test the flag in the iterator, but isn't passed the iterator.
>> This will be dealt with in a follow-up patch.
>>
>> The variable names in the per-type iterator functions have been 
>> harmonised
>> as much as possible and made clearer as to the variable purpose.
>>
>> The iterator functions are also moved to a header file so that other
>> operations that need to scan over an iterator can be added.  For 
>> instance,
>> the rbd driver could use this to scan a buffer to see if it is all zeros
>> and libceph could use this to generate a crc.
>>
>> Signed-off-by: David Howells <dhowells@redhat.com>
>> cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> cc: Jens Axboe <axboe@kernel.dk>
>> cc: Christoph Hellwig <hch@lst.de>
>> cc: Christian Brauner <christian@brauner.io>
>> cc: Matthew Wilcox <willy@infradead.org>
>> cc: Linus Torvalds <torvalds@linux-foundation.org>
>> cc: David Laight <David.Laight@ACULAB.COM>
>> cc: linux-block@vger.kernel.org
>> cc: linux-fsdevel@vger.kernel.org
>> cc: linux-mm@kvack.org
>> Link: 
>> https://lore.kernel.org/r/3710261.1691764329@warthog.procyon.org.uk/ # v1
>> Link: https://lore.kernel.org/r/855.1692047347@warthog.procyon.org.uk/ 
>> # v2
>> Link: 
>> https://lore.kernel.org/r/20230816120741.534415-1-dhowells@redhat.com/ 
>> # v3
>> ---
>>
>> Notes:
>>      Changes
>>      =======
>>      ver #5)
>>       - Merge in patch to move iteration framework to a header file.
>>       - Move "iter->count - progress" into individual iteration 
>> subfunctions.
>>
>>   include/linux/iov_iter.h | 274 ++++++++++++++++++++++++++
>>   lib/iov_iter.c           | 416 ++++++++++++++++-----------------------
>>   2 files changed, 449 insertions(+), 241 deletions(-)
>>   create mode 100644 include/linux/iov_iter.h
>>
>> diff --git a/include/linux/iov_iter.h b/include/linux/iov_iter.h
>> new file mode 100644
>> index 000000000000..270454a6703d
>> --- /dev/null
>> +++ b/include/linux/iov_iter.h
>> @@ -0,0 +1,274 @@
>> +/* SPDX-License-Identifier: GPL-2.0-or-later */
>> +/* I/O iterator iteration building functions.
>> + *
>> + * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
>> + * Written by David Howells (dhowells@redhat.com)
>> + */
>> +
>> +#ifndef _LINUX_IOV_ITER_H
>> +#define _LINUX_IOV_ITER_H
>> +
>> +#include <linux/uio.h>
>> +#include <linux/bvec.h>
>> +
>> +typedef size_t (*iov_step_f)(void *iter_base, size_t progress, size_t 
>> len,
>> +                 void *priv, void *priv2);
>> +typedef size_t (*iov_ustep_f)(void __user *iter_base, size_t 
>> progress, size_t len,
>> +                  void *priv, void *priv2);
>> +
>> +/*
>> + * Handle ITER_UBUF.
>> + */
>> +static __always_inline
>> +size_t iterate_ubuf(struct iov_iter *iter, size_t len, void *priv, 
>> void *priv2,
>> +            iov_ustep_f step)
>> +{
>> +    void __user *base = iter->ubuf;
>> +    size_t progress = 0, remain;
>> +
>> +    remain = step(base + iter->iov_offset, 0, len, priv, priv2);
>> +    progress = len - remain;
>> +    iter->iov_offset += progress;
>> +    iter->count -= progress;
>> +    return progress;
>> +}
>> +
>> +/*
>> + * Handle ITER_IOVEC.
>> + */
>> +static __always_inline
>> +size_t iterate_iovec(struct iov_iter *iter, size_t len, void *priv, 
>> void *priv2,
>> +             iov_ustep_f step)
>> +{
>> +    const struct iovec *p = iter->__iov;
>> +    size_t progress = 0, skip = iter->iov_offset;
>> +
>> +    do {
>> +        size_t remain, consumed;
>> +        size_t part = min(len, p->iov_len - skip);
>> +
>> +        if (likely(part)) {
>> +            remain = step(p->iov_base + skip, progress, part, priv, 
>> priv2);
>> +            consumed = part - remain;
>> +            progress += consumed;
>> +            skip += consumed;
>> +            len -= consumed;
>> +            if (skip < p->iov_len)
>> +                break;
>> +        }
>> +        p++;
>> +        skip = 0;
>> +    } while (len);
>> +
>> +    iter->nr_segs -= p - iter->__iov;
>> +    iter->__iov = p;
>> +    iter->iov_offset = skip;
>> +    iter->count -= progress;
>> +    return progress;
>> +}
>> +
>> +/*
>> + * Handle ITER_KVEC.
>> + */
>> +static __always_inline
>> +size_t iterate_kvec(struct iov_iter *iter, size_t len, void *priv, 
>> void *priv2,
>> +            iov_step_f step)
>> +{
>> +    const struct kvec *p = iter->kvec;
>> +    size_t progress = 0, skip = iter->iov_offset;
>> +
>> +    do {
>> +        size_t remain, consumed;
>> +        size_t part = min(len, p->iov_len - skip);
>> +
>> +        if (likely(part)) {
>> +            remain = step(p->iov_base + skip, progress, part, priv, 
>> priv2);
>> +            consumed = part - remain;
>> +            progress += consumed;
>> +            skip += consumed;
>> +            len -= consumed;
>> +            if (skip < p->iov_len)
>> +                break;
>> +        }
>> +        p++;
>> +        skip = 0;
>> +    } while (len);
>> +
>> +    iter->nr_segs -= p - iter->kvec;
>> +    iter->kvec = p;
>> +    iter->iov_offset = skip;
>> +    iter->count -= progress;
>> +    return progress;
>> +}
>> +
>> +/*
>> + * Handle ITER_BVEC.
>> + */
>> +static __always_inline
>> +size_t iterate_bvec(struct iov_iter *iter, size_t len, void *priv, 
>> void *priv2,
>> +            iov_step_f step)
>> +{
>> +    const struct bio_vec *p = iter->bvec;
>> +    size_t progress = 0, skip = iter->iov_offset;
>> +
>> +    do {
>> +        size_t remain, consumed;
>> +        size_t offset = p->bv_offset + skip, part;
>> +        void *kaddr = kmap_local_page(p->bv_page + offset / PAGE_SIZE);
>> +
>> +        part = min3(len,
>> +               (size_t)(p->bv_len - skip),
>> +               (size_t)(PAGE_SIZE - offset % PAGE_SIZE));
>> +        remain = step(kaddr + offset % PAGE_SIZE, progress, part, 
>> priv, priv2);
>> +        kunmap_local(kaddr);
>> +        consumed = part - remain;
>> +        len -= consumed;
>> +        progress += consumed;
>> +        skip += consumed;
>> +        if (skip >= p->bv_len) {
>> +            skip = 0;
>> +            p++;
>> +        }
>> +        if (remain)
>> +            break;
>> +    } while (len);
>> +
>> +    iter->nr_segs -= p - iter->bvec;
>> +    iter->bvec = p;
>> +    iter->iov_offset = skip;
>> +    iter->count -= progress;
>> +    return progress;
>> +}
>> +
>> +/*
>> + * Handle ITER_XARRAY.
>> + */
>> +static __always_inline
>> +size_t iterate_xarray(struct iov_iter *iter, size_t len, void *priv, 
>> void *priv2,
>> +              iov_step_f step)
>> +{
>> +    struct folio *folio;
>> +    size_t progress = 0;
>> +    loff_t start = iter->xarray_start + iter->iov_offset;
>> +    pgoff_t index = start / PAGE_SIZE;
>> +    XA_STATE(xas, iter->xarray, index);
>> +
>> +    rcu_read_lock();
>> +    xas_for_each(&xas, folio, ULONG_MAX) {
>> +        size_t remain, consumed, offset, part, flen;
>> +
>> +        if (xas_retry(&xas, folio))
>> +            continue;
>> +        if (WARN_ON(xa_is_value(folio)))
>> +            break;
>> +        if (WARN_ON(folio_test_hugetlb(folio)))
>> +            break;
>> +
>> +        offset = offset_in_folio(folio, start + progress);
>> +        flen = min(folio_size(folio) - offset, len);
>> +
>> +        while (flen) {
>> +            void *base = kmap_local_folio(folio, offset);
>> +
>> +            part = min_t(size_t, flen,
>> +                     PAGE_SIZE - offset_in_page(offset));
>> +            remain = step(base, progress, part, priv, priv2);
>> +            kunmap_local(base);
>> +
>> +            consumed = part - remain;
>> +            progress += consumed;
>> +            len -= consumed;
>> +
>> +            if (remain || len == 0)
>> +                goto out;
>> +            flen -= consumed;
>> +            offset += consumed;
>> +        }
>> +    }
>> +
>> +out:
>> +    rcu_read_unlock();
>> +    iter->iov_offset += progress;
>> +    iter->count -= progress;
>> +    return progress;
>> +}
>> +
>> +/*
>> + * Handle ITER_DISCARD.
>> + */
>> +static __always_inline
>> +size_t iterate_discard(struct iov_iter *iter, size_t len, void *priv, 
>> void *priv2,
>> +              iov_step_f step)
>> +{
>> +    size_t progress = len;
>> +
>> +    iter->count -= progress;
>> +    return progress;
>> +}
>> +
>> +/**
>> + * iterate_and_advance2 - Iterate over an iterator
>> + * @iter: The iterator to iterate over.
>> + * @len: The amount to iterate over.
>> + * @priv: Data for the step functions.
>> + * @priv2: More data for the step functions.
>> + * @ustep: Function for UBUF/IOVEC iterators; given __user addresses.
>> + * @step: Function for other iterators; given kernel addresses.
>> + *
>> + * Iterate over the next part of an iterator, up to the specified 
>> length.  The
>> + * buffer is presented in segments, which for kernel iteration are 
>> broken up by
>> + * physical pages and mapped, with the mapped address being presented.
>> + *
>> + * Two step functions, @step and @ustep, must be provided, one for 
>> handling
>> + * mapped kernel addresses and the other is given user addresses 
>> which have the
>> + * potential to fault since no pinning is performed.
>> + *
>> + * The step functions are passed the address and length of the 
>> segment, @priv,
>> + * @priv2 and the amount of data so far iterated over (which can, for 
>> example,
>> + * be added to @priv to point to the right part of a second buffer).  
>> The step
>> + * functions should return the amount of the segment they didn't 
>> process (ie. 0
>> + * indicates complete processsing).
>> + *
>> + * This function returns the amount of data processed (ie. 0 means 
>> nothing was
>> + * processed and the value of @len means processes to completion).
>> + */
>> +static __always_inline
>> +size_t iterate_and_advance2(struct iov_iter *iter, size_t len, void 
>> *priv,
>> +                void *priv2, iov_ustep_f ustep, iov_step_f step)
>> +{
>> +    if (unlikely(iter->count < len))
>> +        len = iter->count;
>> +    if (unlikely(!len))
>> +        return 0;
>> +
>> +    if (likely(iter_is_ubuf(iter)))
>> +        return iterate_ubuf(iter, len, priv, priv2, ustep);
>> +    if (likely(iter_is_iovec(iter)))
>> +        return iterate_iovec(iter, len, priv, priv2, ustep);
>> +    if (iov_iter_is_bvec(iter))
>> +        return iterate_bvec(iter, len, priv, priv2, step);
>> +    if (iov_iter_is_kvec(iter))
>> +        return iterate_kvec(iter, len, priv, priv2, step);
>> +    if (iov_iter_is_xarray(iter))
>> +        return iterate_xarray(iter, len, priv, priv2, step);
>> +    return iterate_discard(iter, len, priv, priv2, step);
>> +}
>> +
>> +/**
>> + * iterate_and_advance - Iterate over an iterator
>> + * @iter: The iterator to iterate over.
>> + * @len: The amount to iterate over.
>> + * @priv: Data for the step functions.
>> + * @ustep: Function for UBUF/IOVEC iterators; given __user addresses.
>> + * @step: Function for other iterators; given kernel addresses.
>> + *
>> + * As iterate_and_advance2(), but priv2 is always NULL.
>> + */
>> +static __always_inline
>> +size_t iterate_and_advance(struct iov_iter *iter, size_t len, void 
>> *priv,
>> +               iov_ustep_f ustep, iov_step_f step)
>> +{
>> +    return iterate_and_advance2(iter, len, priv, NULL, ustep, step);
>> +}
>> +
>> +#endif /* _LINUX_IOV_ITER_H */
>> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
>> index 227c9f536b94..65374ee91ecd 100644
>> --- a/lib/iov_iter.c
>> +++ b/lib/iov_iter.c
>> @@ -13,189 +13,69 @@
>>   #include <net/checksum.h>
>>   #include <linux/scatterlist.h>
>>   #include <linux/instrumented.h>
>> +#include <linux/iov_iter.h>
>> -/* covers ubuf and kbuf alike */
>> -#define iterate_buf(i, n, base, len, off, __p, STEP) {        \
>> -    size_t __maybe_unused off = 0;                \
>> -    len = n;                        \
>> -    base = __p + i->iov_offset;                \
>> -    len -= (STEP);                        \
>> -    i->iov_offset += len;                    \
>> -    n = len;                        \
>> -}
>> -
>> -/* covers iovec and kvec alike */
>> -#define iterate_iovec(i, n, base, len, off, __p, STEP) {    \
>> -    size_t off = 0;                        \
>> -    size_t skip = i->iov_offset;                \
>> -    do {                            \
>> -        len = min(n, __p->iov_len - skip);        \
>> -        if (likely(len)) {                \
>> -            base = __p->iov_base + skip;        \
>> -            len -= (STEP);                \
>> -            off += len;                \
>> -            skip += len;                \
>> -            n -= len;                \
>> -            if (skip < __p->iov_len)        \
>> -                break;                \
>> -        }                        \
>> -        __p++;                        \
>> -        skip = 0;                    \
>> -    } while (n);                        \
>> -    i->iov_offset = skip;                    \
>> -    n = off;                        \
>> -}
>> -
>> -#define iterate_bvec(i, n, base, len, off, p, STEP) {        \
>> -    size_t off = 0;                        \
>> -    unsigned skip = i->iov_offset;                \
>> -    while (n) {                        \
>> -        unsigned offset = p->bv_offset + skip;        \
>> -        unsigned left;                    \
>> -        void *kaddr = kmap_local_page(p->bv_page +    \
>> -                    offset / PAGE_SIZE);    \
>> -        base = kaddr + offset % PAGE_SIZE;        \
>> -        len = min(min(n, (size_t)(p->bv_len - skip)),    \
>> -             (size_t)(PAGE_SIZE - offset % PAGE_SIZE));    \
>> -        left = (STEP);                    \
>> -        kunmap_local(kaddr);                \
>> -        len -= left;                    \
>> -        off += len;                    \
>> -        skip += len;                    \
>> -        if (skip == p->bv_len) {            \
>> -            skip = 0;                \
>> -            p++;                    \
>> -        }                        \
>> -        n -= len;                    \
>> -        if (left)                    \
>> -            break;                    \
>> -    }                            \
>> -    i->iov_offset = skip;                    \
>> -    n = off;                        \
>> -}
>> -
>> -#define iterate_xarray(i, n, base, len, __off, STEP) {        \
>> -    __label__ __out;                    \
>> -    size_t __off = 0;                    \
>> -    struct folio *folio;                    \
>> -    loff_t start = i->xarray_start + i->iov_offset;        \
>> -    pgoff_t index = start / PAGE_SIZE;            \
>> -    XA_STATE(xas, i->xarray, index);            \
>> -                                \
>> -    len = PAGE_SIZE - offset_in_page(start);        \
>> -    rcu_read_lock();                    \
>> -    xas_for_each(&xas, folio, ULONG_MAX) {            \
>> -        unsigned left;                    \
>> -        size_t offset;                    \
>> -        if (xas_retry(&xas, folio))            \
>> -            continue;                \
>> -        if (WARN_ON(xa_is_value(folio)))        \
>> -            break;                    \
>> -        if (WARN_ON(folio_test_hugetlb(folio)))        \
>> -            break;                    \
>> -        offset = offset_in_folio(folio, start + __off);    \
>> -        while (offset < folio_size(folio)) {        \
>> -            base = kmap_local_folio(folio, offset);    \
>> -            len = min(n, len);            \
>> -            left = (STEP);                \
>> -            kunmap_local(base);            \
>> -            len -= left;                \
>> -            __off += len;                \
>> -            n -= len;                \
>> -            if (left || n == 0)            \
>> -                goto __out;            \
>> -            offset += len;                \
>> -            len = PAGE_SIZE;            \
>> -        }                        \
>> -    }                            \
>> -__out:                                \
>> -    rcu_read_unlock();                    \
>> -    i->iov_offset += __off;                    \
>> -    n = __off;                        \
>> -}
>> -
>> -#define __iterate_and_advance(i, n, base, len, off, I, K) {    \
>> -    if (unlikely(i->count < n))                \
>> -        n = i->count;                    \
>> -    if (likely(n)) {                    \
>> -        if (likely(iter_is_ubuf(i))) {            \
>> -            void __user *base;            \
>> -            size_t len;                \
>> -            iterate_buf(i, n, base, len, off,    \
>> -                        i->ubuf, (I))     \
>> -        } else if (likely(iter_is_iovec(i))) {        \
>> -            const struct iovec *iov = iter_iov(i);    \
>> -            void __user *base;            \
>> -            size_t len;                \
>> -            iterate_iovec(i, n, base, len, off,    \
>> -                        iov, (I))    \
>> -            i->nr_segs -= iov - iter_iov(i);    \
>> -            i->__iov = iov;                \
>> -        } else if (iov_iter_is_bvec(i)) {        \
>> -            const struct bio_vec *bvec = i->bvec;    \
>> -            void *base;                \
>> -            size_t len;                \
>> -            iterate_bvec(i, n, base, len, off,    \
>> -                        bvec, (K))    \
>> -            i->nr_segs -= bvec - i->bvec;        \
>> -            i->bvec = bvec;                \
>> -        } else if (iov_iter_is_kvec(i)) {        \
>> -            const struct kvec *kvec = i->kvec;    \
>> -            void *base;                \
>> -            size_t len;                \
>> -            iterate_iovec(i, n, base, len, off,    \
>> -                        kvec, (K))    \
>> -            i->nr_segs -= kvec - i->kvec;        \
>> -            i->kvec = kvec;                \
>> -        } else if (iov_iter_is_xarray(i)) {        \
>> -            void *base;                \
>> -            size_t len;                \
>> -            iterate_xarray(i, n, base, len, off,    \
>> -                            (K))    \
>> -        }                        \
>> -        i->count -= n;                    \
>> -    }                            \
>> -}
>> -#define iterate_and_advance(i, n, base, len, off, I, K) \
>> -    __iterate_and_advance(i, n, base, len, off, I, ((void)(K),0))
>> -
>> -static int copyout(void __user *to, const void *from, size_t n)
>> +static __always_inline
>> +size_t copy_to_user_iter(void __user *iter_to, size_t progress,
>> +             size_t len, void *from, void *priv2)
>>   {
>>       if (should_fail_usercopy())
>> -        return n;
>> -    if (access_ok(to, n)) {
>> -        instrument_copy_to_user(to, from, n);
>> -        n = raw_copy_to_user(to, from, n);
>> +        return len;
>> +    if (access_ok(iter_to, len)) {
>> +        from += progress;
>> +        instrument_copy_to_user(iter_to, from, len);
>> +        len = raw_copy_to_user(iter_to, from, len);
>>       }
>> -    return n;
>> +    return len;
>>   }
>> -static int copyout_nofault(void __user *to, const void *from, size_t n)
>> +static __always_inline
>> +size_t copy_to_user_iter_nofault(void __user *iter_to, size_t progress,
>> +                 size_t len, void *from, void *priv2)
>>   {
>> -    long res;
>> +    ssize_t res;
>>       if (should_fail_usercopy())
>> -        return n;
>> -
>> -    res = copy_to_user_nofault(to, from, n);
>> +        return len;
>> -    return res < 0 ? n : res;
>> +    from += progress;
>> +    res = copy_to_user_nofault(iter_to, from, len);
>> +    return res < 0 ? len : res;
>>   }
>> -static int copyin(void *to, const void __user *from, size_t n)
>> +static __always_inline
>> +size_t copy_from_user_iter(void __user *iter_from, size_t progress,
>> +               size_t len, void *to, void *priv2)
>>   {
>> -    size_t res = n;
>> +    size_t res = len;
>>       if (should_fail_usercopy())
>> -        return n;
>> -    if (access_ok(from, n)) {
>> -        instrument_copy_from_user_before(to, from, n);
>> -        res = raw_copy_from_user(to, from, n);
>> -        instrument_copy_from_user_after(to, from, n, res);
>> +        return len;
>> +    if (access_ok(iter_from, len)) {
>> +        to += progress;
>> +        instrument_copy_from_user_before(to, iter_from, len);
>> +        res = raw_copy_from_user(to, iter_from, len);
>> +        instrument_copy_from_user_after(to, iter_from, len, res);
>>       }
>>       return res;
>>   }
>> +static __always_inline
>> +size_t memcpy_to_iter(void *iter_to, size_t progress,
>> +              size_t len, void *from, void *priv2)
>> +{
>> +    memcpy(iter_to, from + progress, len);
>> +    return 0;
>> +}
>> +
>> +static __always_inline
>> +size_t memcpy_from_iter(void *iter_from, size_t progress,
>> +            size_t len, void *to, void *priv2)
>> +{
>> +    memcpy(to + progress, iter_from, len);
>> +    return 0;
>> +}
>> +
>>   /*
>>    * fault_in_iov_iter_readable - fault in iov iterator for reading
>>    * @i: iterator
>> @@ -312,23 +192,29 @@ size_t _copy_to_iter(const void *addr, size_t 
>> bytes, struct iov_iter *i)
>>           return 0;
>>       if (user_backed_iter(i))
>>           might_fault();
>> -    iterate_and_advance(i, bytes, base, len, off,
>> -        copyout(base, addr + off, len),
>> -        memcpy(base, addr + off, len)
>> -    )
>> -
>> -    return bytes;
>> +    return iterate_and_advance(i, bytes, (void *)addr,
>> +                   copy_to_user_iter, memcpy_to_iter);
>>   }
>>   EXPORT_SYMBOL(_copy_to_iter);
>>   #ifdef CONFIG_ARCH_HAS_COPY_MC
>> -static int copyout_mc(void __user *to, const void *from, size_t n)
>> -{
>> -    if (access_ok(to, n)) {
>> -        instrument_copy_to_user(to, from, n);
>> -        n = copy_mc_to_user((__force void *) to, from, n);
>> +static __always_inline
>> +size_t copy_to_user_iter_mc(void __user *iter_to, size_t progress,
>> +                size_t len, void *from, void *priv2)
>> +{
>> +    if (access_ok(iter_to, len)) {
>> +        from += progress;
>> +        instrument_copy_to_user(iter_to, from, len);
>> +        len = copy_mc_to_user(iter_to, from, len);
>>       }
>> -    return n;
>> +    return len;
>> +}
>> +
>> +static __always_inline
>> +size_t memcpy_to_iter_mc(void *iter_to, size_t progress,
>> +             size_t len, void *from, void *priv2)
>> +{
>> +    return copy_mc_to_kernel(iter_to, from + progress, len);
>>   }
>>   /**
>> @@ -361,22 +247,20 @@ size_t _copy_mc_to_iter(const void *addr, size_t 
>> bytes, struct iov_iter *i)
>>           return 0;
>>       if (user_backed_iter(i))
>>           might_fault();
>> -    __iterate_and_advance(i, bytes, base, len, off,
>> -        copyout_mc(base, addr + off, len),
>> -        copy_mc_to_kernel(base, addr + off, len)
>> -    )
>> -
>> -    return bytes;
>> +    return iterate_and_advance(i, bytes, (void *)addr,
>> +                   copy_to_user_iter_mc, memcpy_to_iter_mc);
>>   }
>>   EXPORT_SYMBOL_GPL(_copy_mc_to_iter);
>>   #endif /* CONFIG_ARCH_HAS_COPY_MC */
>> -static void *memcpy_from_iter(struct iov_iter *i, void *to, const 
>> void *from,
>> -                 size_t size)
>> +static size_t memcpy_from_iter_mc(void *iter_from, size_t progress,
>> +                  size_t len, void *to, void *priv2)
>>   {
>> -    if (iov_iter_is_copy_mc(i))
>> -        return (void *)copy_mc_to_kernel(to, from, size);
>> -    return memcpy(to, from, size);
>> +    struct iov_iter *iter = priv2;
>> +
>> +    if (iov_iter_is_copy_mc(iter))
>> +        return copy_mc_to_kernel(to + progress, iter_from, len);
>> +    return memcpy_from_iter(iter_from, progress, len, to, priv2);
>>   }
>>   size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
>> @@ -386,30 +270,46 @@ size_t _copy_from_iter(void *addr, size_t bytes, 
>> struct iov_iter *i)
>>       if (user_backed_iter(i))
>>           might_fault();
>> -    iterate_and_advance(i, bytes, base, len, off,
>> -        copyin(addr + off, base, len),
>> -        memcpy_from_iter(i, addr + off, base, len)
>> -    )
>> -
>> -    return bytes;
>> +    return iterate_and_advance2(i, bytes, addr, i,
>> +                    copy_from_user_iter,
>> +                    memcpy_from_iter_mc);
>>   }
>>   EXPORT_SYMBOL(_copy_from_iter);
>> +static __always_inline
>> +size_t copy_from_user_iter_nocache(void __user *iter_from, size_t 
>> progress,
>> +                   size_t len, void *to, void *priv2)
>> +{
>> +    return __copy_from_user_inatomic_nocache(to + progress, 
>> iter_from, len);
>> +}
>> +
>>   size_t _copy_from_iter_nocache(void *addr, size_t bytes, struct 
>> iov_iter *i)
>>   {
>>       if (WARN_ON_ONCE(!i->data_source))
>>           return 0;
>> -    iterate_and_advance(i, bytes, base, len, off,
>> -        __copy_from_user_inatomic_nocache(addr + off, base, len),
>> -        memcpy(addr + off, base, len)
>> -    )
>> -
>> -    return bytes;
>> +    return iterate_and_advance(i, bytes, addr,
>> +                   copy_from_user_iter_nocache,
>> +                   memcpy_from_iter);
>>   }
>>   EXPORT_SYMBOL(_copy_from_iter_nocache);
>>   #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
>> +static __always_inline
>> +size_t copy_from_user_iter_flushcache(void __user *iter_from, size_t 
>> progress,
>> +                      size_t len, void *to, void *priv2)
>> +{
>> +    return __copy_from_user_flushcache(to + progress, iter_from, len);
>> +}
>> +
>> +static __always_inline
>> +size_t memcpy_from_iter_flushcache(void *iter_from, size_t progress,
>> +                   size_t len, void *to, void *priv2)
>> +{
>> +    memcpy_flushcache(to + progress, iter_from, len);
>> +    return 0;
>> +}
>> +
>>   /**
>>    * _copy_from_iter_flushcache - write destination through cpu cache
>>    * @addr: destination kernel address
>> @@ -431,12 +331,9 @@ size_t _copy_from_iter_flushcache(void *addr, 
>> size_t bytes, struct iov_iter *i)
>>       if (WARN_ON_ONCE(!i->data_source))
>>           return 0;
>> -    iterate_and_advance(i, bytes, base, len, off,
>> -        __copy_from_user_flushcache(addr + off, base, len),
>> -        memcpy_flushcache(addr + off, base, len)
>> -    )
>> -
>> -    return bytes;
>> +    return iterate_and_advance(i, bytes, addr,
>> +                   copy_from_user_iter_flushcache,
>> +                   memcpy_from_iter_flushcache);
>>   }
>>   EXPORT_SYMBOL_GPL(_copy_from_iter_flushcache);
>>   #endif
>> @@ -508,10 +405,9 @@ size_t copy_page_to_iter_nofault(struct page 
>> *page, unsigned offset, size_t byte
>>           void *kaddr = kmap_local_page(page);
>>           size_t n = min(bytes, (size_t)PAGE_SIZE - offset);
>> -        iterate_and_advance(i, n, base, len, off,
>> -            copyout_nofault(base, kaddr + offset + off, len),
>> -            memcpy(base, kaddr + offset + off, len)
>> -        )
>> +        n = iterate_and_advance(i, bytes, kaddr,
>> +                    copy_to_user_iter_nofault,
>> +                    memcpy_to_iter);
>>           kunmap_local(kaddr);
>>           res += n;
>>           bytes -= n;
>> @@ -554,14 +450,25 @@ size_t copy_page_from_iter(struct page *page, 
>> size_t offset, size_t bytes,
>>   }
>>   EXPORT_SYMBOL(copy_page_from_iter);
>> -size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
>> +static __always_inline
>> +size_t zero_to_user_iter(void __user *iter_to, size_t progress,
>> +             size_t len, void *priv, void *priv2)
>>   {
>> -    iterate_and_advance(i, bytes, base, len, count,
>> -        clear_user(base, len),
>> -        memset(base, 0, len)
>> -    )
>> +    return clear_user(iter_to, len);
>> +}
>> -    return bytes;
>> +static __always_inline
>> +size_t zero_to_iter(void *iter_to, size_t progress,
>> +            size_t len, void *priv, void *priv2)
>> +{
>> +    memset(iter_to, 0, len);
>> +    return 0;
>> +}
>> +
>> +size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
>> +{
>> +    return iterate_and_advance(i, bytes, NULL,
>> +                   zero_to_user_iter, zero_to_iter);
>>   }
>>   EXPORT_SYMBOL(iov_iter_zero);
>> @@ -586,10 +493,9 @@ size_t copy_page_from_iter_atomic(struct page 
>> *page, size_t offset,
>>           }
>>           p = kmap_atomic(page) + offset;
>> -        iterate_and_advance(i, n, base, len, off,
>> -            copyin(p + off, base, len),
>> -            memcpy_from_iter(i, p + off, base, len)
>> -        )
>> +        n = iterate_and_advance2(i, n, p, i,
>> +                     copy_from_user_iter,
>> +                     memcpy_from_iter_mc);
>>           kunmap_atomic(p);
>>           copied += n;
>>           offset += n;
>> @@ -1180,32 +1086,64 @@ ssize_t iov_iter_get_pages_alloc2(struct 
>> iov_iter *i,
>>   }
>>   EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
>> +static __always_inline
>> +size_t copy_from_user_iter_csum(void __user *iter_from, size_t progress,
>> +                size_t len, void *to, void *priv2)
>> +{
>> +    __wsum next, *csum = priv2;
>> +
>> +    next = csum_and_copy_from_user(iter_from, to + progress, len);
>> +    *csum = csum_block_add(*csum, next, progress);
>> +    return next ? 0 : len;
>> +}
>> +
>> +static __always_inline
>> +size_t memcpy_from_iter_csum(void *iter_from, size_t progress,
>> +                 size_t len, void *to, void *priv2)
>> +{
>> +    __wsum *csum = priv2;
>> +
>> +    *csum = csum_and_memcpy(to + progress, iter_from, len, *csum, 
>> progress);
>> +    return 0;
>> +}
>> +
>>   size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
>>                      struct iov_iter *i)
>>   {
>> -    __wsum sum, next;
>> -    sum = *csum;
>>       if (WARN_ON_ONCE(!i->data_source))
>>           return 0;
>> -
>> -    iterate_and_advance(i, bytes, base, len, off, ({
>> -        next = csum_and_copy_from_user(base, addr + off, len);
>> -        sum = csum_block_add(sum, next, off);
>> -        next ? 0 : len;
>> -    }), ({
>> -        sum = csum_and_memcpy(addr + off, base, len, sum, off);
>> -    })
>> -    )
>> -    *csum = sum;
>> -    return bytes;
>> +    return iterate_and_advance2(i, bytes, addr, csum,
>> +                    copy_from_user_iter_csum,
>> +                    memcpy_from_iter_csum);
>>   }
>>   EXPORT_SYMBOL(csum_and_copy_from_iter);
>> +static __always_inline
>> +size_t copy_to_user_iter_csum(void __user *iter_to, size_t progress,
>> +                  size_t len, void *from, void *priv2)
>> +{
>> +    __wsum next, *csum = priv2;
>> +
>> +    next = csum_and_copy_to_user(from + progress, iter_to, len);
>> +    *csum = csum_block_add(*csum, next, progress);
>> +    return next ? 0 : len;
>> +}
>> +
>> +static __always_inline
>> +size_t memcpy_to_iter_csum(void *iter_to, size_t progress,
>> +               size_t len, void *from, void *priv2)
>> +{
>> +    __wsum *csum = priv2;
>> +
>> +    *csum = csum_and_memcpy(iter_to, from + progress, len, *csum, 
>> progress);
>> +    return 0;
>> +}
>> +
>>   size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void 
>> *_csstate,
>>                    struct iov_iter *i)
>>   {
>>       struct csum_state *csstate = _csstate;
>> -    __wsum sum, next;
>> +    __wsum sum;
>>       if (WARN_ON_ONCE(i->data_source))
>>           return 0;
>> @@ -1219,14 +1157,10 @@ size_t csum_and_copy_to_iter(const void *addr, 
>> size_t bytes, void *_csstate,
>>       }
>>       sum = csum_shift(csstate->csum, csstate->off);
>> -    iterate_and_advance(i, bytes, base, len, off, ({
>> -        next = csum_and_copy_to_user(addr + off, base, len);
>> -        sum = csum_block_add(sum, next, off);
>> -        next ? 0 : len;
>> -    }), ({
>> -        sum = csum_and_memcpy(base, addr + off, len, sum, off);
>> -    })
>> -    )
>> +
>> +    bytes = iterate_and_advance2(i, bytes, (void *)addr, &sum,
>> +                     copy_to_user_iter_csum,
>> +                     memcpy_to_iter_csum);
>>       csstate->csum = csum_shift(sum, csstate->off);
>>       csstate->off += bytes;
>>       return bytes;
>>
>>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-02-18  3:13   ` [bug report] dead loop in generic_perform_write() //Re: " Tong Tiangen
  2024-02-27 12:43     ` Tong Tiangen
@ 2024-02-28 21:21     ` Linus Torvalds
  2024-02-28 22:57       ` Linus Torvalds
  1 sibling, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2024-02-28 21:21 UTC (permalink / raw)
  To: Tong Tiangen
  Cc: David Howells, Jens Axboe, Al Viro, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	Kefeng Wang

On Sat, 17 Feb 2024 at 19:13, Tong Tiangen <tongtiangen@huawei.com> wrote:
>
> After this patch:
>    copy_page_from_iter_atomic()
>      -> iterate_and_advance2()
>        -> iterate_bvec()
>          -> remain = step()
>
> With CONFIG_ARCH_HAS_COPY_MC, the step() is copy_mc_to_kernel() which
> return "bytes not copied".
>
> When a memory error occurs during step(), the value of "left" equal to
> the value of "part" (no one byte is copied successfully). In this case,
> iterate_bvec() returns 0, and copy_page_from_iter_atomic() also returns
> 0. The callback shmem_write_end()[2] also returns 0. Finally,
> generic_perform_write() goes to "goto again"[3], and the loop restarts.
> 4][5] cannot enter and exit the loop, then deadloop occurs.

Hmm. If the copy doesn't succeed and make any progress at all, then
the code in generic_perform_write() after the "goto again"

                //[4]
                if (unlikely(fault_in_iov_iter_readable(i, bytes) ==
                              bytes)) {
                        status = -EFAULT;
                        break;
                }

should break out of the loop.

So either your analysis looks a bit flawed, or I'm missing something.
Likely I'm missing something really obvious.

Why does the copy_mc_to_kernel() fail, but the
fault_in_iov_iter_readable() succeeds?

              Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-02-28 21:21     ` Linus Torvalds
@ 2024-02-28 22:57       ` Linus Torvalds
  2024-02-29  8:13         ` Tong Tiangen
  0 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2024-02-28 22:57 UTC (permalink / raw)
  To: Tong Tiangen
  Cc: David Howells, Jens Axboe, Al Viro, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	Kefeng Wang

On Wed, 28 Feb 2024 at 13:21, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. If the copy doesn't succeed and make any progress at all, then
> the code in generic_perform_write() after the "goto again"
>
>                 //[4]
>                 if (unlikely(fault_in_iov_iter_readable(i, bytes) ==
>                               bytes)) {
>
> should break out of the loop.

Ahh. I see the problem. Or at least part of it.

The iter is an ITER_BVEC.

And fault_in_iov_iter_readable() "knows" that an ITER_BVEC cannot
fail. Because obviously it's a kernel address, so no user page fault.

But for the machine check case, ITER_BVEC very much can fail.

This should never have worked in the first place.

What a crock.

Do we need to make iterate_bvec() always succeed fully, and make
copy_mc_to_kernel() zero out the end?

                   Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-02-28 22:57       ` Linus Torvalds
@ 2024-02-29  8:13         ` Tong Tiangen
  2024-02-29 17:32           ` Linus Torvalds
  0 siblings, 1 reply; 36+ messages in thread
From: Tong Tiangen @ 2024-02-29  8:13 UTC (permalink / raw)
  To: Linus Torvalds, Alexander Viro
  Cc: David Howells, Jens Axboe, Al Viro, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	Kefeng Wang



在 2024/2/29 6:57, Linus Torvalds 写道:
> On Wed, 28 Feb 2024 at 13:21, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Hmm. If the copy doesn't succeed and make any progress at all, then
>> the code in generic_perform_write() after the "goto again"
>>
>>                  //[4]
>>                  if (unlikely(fault_in_iov_iter_readable(i, bytes) ==
>>                                bytes)) {
>>
>> should break out of the loop.
> 
> Ahh. I see the problem. Or at least part of it.
> 
> The iter is an ITER_BVEC.
> 
> And fault_in_iov_iter_readable() "knows" that an ITER_BVEC cannot
> fail. Because obviously it's a kernel address, so no user page fault.
> 
> But for the machine check case, ITER_BVEC very much can fail.
> 
> This should never have worked in the first place.
> 
> What a crock.
> 
> Do we need to make iterate_bvec() always succeed fully, and make
> copy_mc_to_kernel() zero out the end?
> 
>                     Linus
> .

Hi Linus:

See the logic before this patch, always success (((void)(K),0)) is
returned for three types: ITER_BVEC, ITER_KVEC and ITER_XARRAY.

-------------------------------------------------------------------
   -#define __iterate_and_advance(i, n, base, len, off, I, K) {	\
   -	if (unlikely(i->count < n))				\
   -		n = i->count;					\
   -	if (likely(n)) {					\
   -		if (likely(iter_is_ubuf(i))) {			\
   			[...]					\
   -			iterate_buf(i, n, base, len, off,	\
   -						i->ubuf, (I)) 	\
   -		} else if (likely(iter_is_iovec(i))) {		\
			[...]					\
   -			iterate_iovec(i, n, base, len, off,	\
   -						iov, (I))	\
   -			i->nr_segs -= iov - iter_iov(i);	\
   -			i->__iov = iov;				\
   -		} else if (iov_iter_is_bvec(i)) {		\
			[...]					\
   -			iterate_bvec(i, n, base, len, off,	\
   -						bvec, (K))	\
   -			i->nr_segs -= bvec - i->bvec;		\
   -			i->bvec = bvec;				\
   -		} else if (iov_iter_is_kvec(i)) {		\
			[...]					\
   -			iterate_iovec(i, n, base, len, off,	\
   -						kvec, (K))	\
			[...]					\
   -		} else if (iov_iter_is_xarray(i)) {		\
			[...]					\
   -			iterate_xarray(i, n, base, len, off,	\
   -							(K))	\
   -		}						\
   -		i->count -= n;					\
   -	}							\
   -}
   -#define iterate_and_advance(i, n, base, len, off, I, K) \
   -	__iterate_and_advance(i, n, base, len, off, I, ((void)(K),0))
-------------------------------------------------------------------

Maybe we're all gonna fix it back? as follows:
-------------------------------------------------------------------
   --- a/include/linux/iov_iter.h
   +++ b/include/linux/iov_iter.h
   @@ -246,11 +246,11 @@ size_t iterate_and_advance2(struct iov_iter 
*iter, size_t len, void *priv,
           if (likely(iter_is_iovec(iter)))
                   return iterate_iovec(iter, len, priv, priv2, ustep);
           if (iov_iter_is_bvec(iter))
   -               return iterate_bvec(iter, len, priv, priv2, step);
   +               return iterate_bvec(iter, len, priv, priv2, ((void 
*)step, 0));
           if (iov_iter_is_kvec(iter))
   -               return iterate_kvec(iter, len, priv, priv2, step);
   +               return iterate_kvec(iter, len, priv, priv2, ((void 
*)step, 0));
           if (iov_iter_is_xarray(iter))
   -               return iterate_xarray(iter, len, priv, priv2, step);
   +               return iterate_xarray(iter, len, priv, priv2, ((void 
*)step, 0));
           return iterate_discard(iter, len, priv, priv2, step);
    }

   diff --git a/lib/iov_iter.c b/lib/iov_iter.c
   index e0aa6b440ca5..fabd5b1b97c7 100644
   --- a/lib/iov_iter.c
   +++ b/lib/iov_iter.c
   @@ -257,7 +257,7 @@ static size_t __copy_from_iter_mc(void *addr, 
size_t bytes, struct iov_iter *i)
                   bytes = i->count;
           if (unlikely(!bytes))
                   return 0;
   -       return iterate_bvec(i, bytes, addr, NULL, memcpy_from_iter_mc);
   +       return iterate_bvec(i, bytes, addr, NULL, ((void 
*)memcpy_from_iter_mc, 0));
    }

    static __always_inline
-------------------------------------------------------------------

    Hi, maintainer Alexander, what do you think ? :)

Thanks,
Tong.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-02-29  8:13         ` Tong Tiangen
@ 2024-02-29 17:32           ` Linus Torvalds
  2024-03-01  2:13             ` Tong Tiangen
                               ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Linus Torvalds @ 2024-02-29 17:32 UTC (permalink / raw)
  To: Tong Tiangen, Al Viro
  Cc: David Howells, Jens Axboe, Christoph Hellwig, Christian Brauner,
	David Laight, Matthew Wilcox, Jeff Layton, linux-fsdevel,
	linux-block, linux-mm, netdev, linux-kernel, Kefeng Wang

[-- Attachment #1: Type: text/plain, Size: 1794 bytes --]

On Thu, 29 Feb 2024 at 00:13, Tong Tiangen <tongtiangen@huawei.com> wrote:
>
> See the logic before this patch, always success (((void)(K),0)) is
> returned for three types: ITER_BVEC, ITER_KVEC and ITER_XARRAY.

No, look closer.

Yes, the iterate_and_advance() macro does that "((void)(K),0)" to make
the compiler generate better code for those cases (because then the
compiler can see that the return value is a compile-time zero), but
notice how _copy_mc_to_iter() didn't use that macro back then. It used
the unvarnished __iterate_and_advance() exactly so that the MC copy
case would *not* get that "always return zero" behavior.

That goes back to (in a different form) at least commit 1b4fb5ffd79b
("iov_iter: teach iterate_{bvec,xarray}() about possible short
copies").

But hardly anybody ever tests this machine-check special case code, so
who knows when it broke again.

I'm just looking at the source code, and with all the macro games it's
*really* hard to follow, so I may well be missing something.

> Maybe we're all gonna fix it back? as follows:

No. We could do it for the kvec and xarray case, just to get better
code generation again (not that I looked at it, so who knows), but the
one case that actually uses memcpy_from_iter_mc() needs to react to a
short write.

One option might be to make a failed memcpy_from_iter_mc() set another
flag in the iter, and then make fault_in_iov_iter_readable() test that
flag and return 'len' if that flag is set.

Something like that (wild handwaving) should get the right error handling.

The simpler alternative is maybe something like the attached.
COMPLETELY UNTESTED. Maybe I've confused myself with all the different
indiraction mazes in the iov_iter code.

                     Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 618 bytes --]

 lib/iov_iter.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index e0aa6b440ca5..5236c16734e0 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -248,7 +248,10 @@ static __always_inline
 size_t memcpy_from_iter_mc(void *iter_from, size_t progress,
 			   size_t len, void *to, void *priv2)
 {
-	return copy_mc_to_kernel(to + progress, iter_from, len);
+	size_t n = copy_mc_to_kernel(to + progress, iter_from, len);
+	if (n)
+		memset(to + progress - n, 0, n);
+	return 0;
 }
 
 static size_t __copy_from_iter_mc(void *addr, size_t bytes, struct iov_iter *i)

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-02-29 17:32           ` Linus Torvalds
@ 2024-03-01  2:13             ` Tong Tiangen
  2024-03-02  2:59             ` Linus Torvalds
  2024-03-04 11:56             ` David Howells
  2 siblings, 0 replies; 36+ messages in thread
From: Tong Tiangen @ 2024-03-01  2:13 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro
  Cc: David Howells, Jens Axboe, Christoph Hellwig, Christian Brauner,
	David Laight, Matthew Wilcox, Jeff Layton, linux-fsdevel,
	linux-block, linux-mm, netdev, linux-kernel, Kefeng Wang



在 2024/3/1 1:32, Linus Torvalds 写道:
> On Thu, 29 Feb 2024 at 00:13, Tong Tiangen <tongtiangen@huawei.com> wrote:
>>
>> See the logic before this patch, always success (((void)(K),0)) is
>> returned for three types: ITER_BVEC, ITER_KVEC and ITER_XARRAY.
> 
> No, look closer.
> 
> Yes, the iterate_and_advance() macro does that "((void)(K),0)" to make
> the compiler generate better code for those cases (because then the
> compiler can see that the return value is a compile-time zero), but
> notice how _copy_mc_to_iter() didn't use that macro back then. It used
> the unvarnished __iterate_and_advance() exactly so that the MC copy
> case would *not* get that "always return zero" behavior.
> 
> That goes back to (in a different form) at least commit 1b4fb5ffd79b
> ("iov_iter: teach iterate_{bvec,xarray}() about possible short
> copies").
> 
> But hardly anybody ever tests this machine-check special case code, so
> who knows when it broke again.
> 
> I'm just looking at the source code, and with all the macro games it's
> *really* hard to follow, so I may well be missing something.
> 
>> Maybe we're all gonna fix it back? as follows:
> 
> No. We could do it for the kvec and xarray case, just to get better
> code generation again (not that I looked at it, so who knows), but the
> one case that actually uses memcpy_from_iter_mc() needs to react to a
> short write.
> 
> One option might be to make a failed memcpy_from_iter_mc() set another
> flag in the iter, and then make fault_in_iov_iter_readable() test that
> flag and return 'len' if that flag is set.
> 
> Something like that (wild handwaving) should get the right error handling.
> 
> The simpler alternative is maybe something like the attached.
> COMPLETELY UNTESTED. Maybe I've confused myself with all the different
> indiraction mazes in the iov_iter code.
> 
>                       Linus

Hi Linus:

The method in the attachment i have tested before is feasible and can
solve this deadloop problem. I also have some confusion about the
iov_iter code. Let's take a look at manitainer's comments to see whether
there are more comprehensive considerations.

Thanks,
Tong.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-02-29 17:32           ` Linus Torvalds
  2024-03-01  2:13             ` Tong Tiangen
@ 2024-03-02  2:59             ` Linus Torvalds
  2024-03-02  9:37               ` Tong Tiangen
  2024-03-04 11:56             ` David Howells
  2 siblings, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2024-03-02  2:59 UTC (permalink / raw)
  To: Tong Tiangen, Al Viro
  Cc: David Howells, Jens Axboe, Christoph Hellwig, Christian Brauner,
	David Laight, Matthew Wilcox, Jeff Layton, linux-fsdevel,
	linux-block, linux-mm, netdev, linux-kernel, Kefeng Wang

[-- Attachment #1: Type: text/plain, Size: 1600 bytes --]

On Thu, 29 Feb 2024 at 09:32, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> One option might be to make a failed memcpy_from_iter_mc() set another
> flag in the iter, and then make fault_in_iov_iter_readable() test that
> flag and return 'len' if that flag is set.
>
> Something like that (wild handwaving) should get the right error handling.
>
> The simpler alternative is maybe something like the attached.
> COMPLETELY UNTESTED. Maybe I've confused myself with all the different
> indiraction mazes in the iov_iter code.

Actually, I think the right model is to get rid of that horrendous
.copy_mc field entirely.

We only have one single place that uses it - that nasty core dumping
code. And that code is *not* performance critical.

And not only isn't it performance-critical, it already does all the
core dumping one page at a time because it doesn't want to write pages
that were never mapped into user space.

So what we can do is

 (a) make the core dumping code *copy* the page to a good location
with copy_mc_to_kernel() first

 (b) remove this horrendous .copy_mc crap entirely from iov_iter

This is slightly complicated by the fact that copy_mc_to_kernel() may
not even exist, and architectures that don't have it don't want the
silly extra copy. So we need to abstract the "copy to temporary page"
code a bit. But that's probably a good thing anyway in that it forces
us to have nice interfaces.

End result: something like the attached.

AGAIN: THIS IS ENTIRELY UNTESTED.

But hey, so was clearly all the .copy_mc code too that this removes, so...

               Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 6155 bytes --]

 fs/coredump.c       | 41 ++++++++++++++++++++++++++++++++++++++---
 include/linux/uio.h | 16 ----------------
 lib/iov_iter.c      | 23 -----------------------
 3 files changed, 38 insertions(+), 42 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index f258c17c1841..6a9b9f3280d8 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -872,6 +872,9 @@ static int dump_emit_page(struct coredump_params *cprm, struct page *page)
 	loff_t pos;
 	ssize_t n;
 
+	if (!page)
+		return 0;
+
 	if (cprm->to_skip) {
 		if (!__dump_skip(cprm, cprm->to_skip))
 			return 0;
@@ -884,7 +887,6 @@ static int dump_emit_page(struct coredump_params *cprm, struct page *page)
 	pos = file->f_pos;
 	bvec_set_page(&bvec, page, PAGE_SIZE, 0);
 	iov_iter_bvec(&iter, ITER_SOURCE, &bvec, 1, PAGE_SIZE);
-	iov_iter_set_copy_mc(&iter);
 	n = __kernel_write_iter(cprm->file, &iter, &pos);
 	if (n != PAGE_SIZE)
 		return 0;
@@ -895,10 +897,40 @@ static int dump_emit_page(struct coredump_params *cprm, struct page *page)
 	return 1;
 }
 
+/*
+ * If we might get machine checks from kernel accesses during the
+ * core dump, let's get those errors early rather than during the
+ * IO. This is not performance-critical enough to warrant having
+ * all the machine check logic in the iovec paths.
+ */
+#ifdef copy_mc_to_kernel
+
+#define dump_page_alloc() alloc_page(GFP_KERNEL)
+#define dump_page_free(x) __free_page(x)
+static struct page *dump_page_copy(struct page *src, struct page *dst)
+{
+	void *buf = kmap_local_page(src);
+	size_t left = copy_mc_to_kernel(page_address(dst), buf, PAGE_SIZE);
+	kunmap_local(buf);
+	return left ? NULL : dst;
+}
+
+#else
+
+#define dump_page_alloc() ((struct page *)8) // Not NULL
+#define dump_page_free(x) do { } while (0)
+#define dump_page_copy(src,dst) ((dst),(src))
+
+#endif
+
 int dump_user_range(struct coredump_params *cprm, unsigned long start,
 		    unsigned long len)
 {
 	unsigned long addr;
+	struct page *dump_page = dump_page_alloc();
+
+	if (!dump_page)
+		return 0;
 
 	for (addr = start; addr < start + len; addr += PAGE_SIZE) {
 		struct page *page;
@@ -912,14 +944,17 @@ int dump_user_range(struct coredump_params *cprm, unsigned long start,
 		 */
 		page = get_dump_page(addr);
 		if (page) {
-			int stop = !dump_emit_page(cprm, page);
+			int stop = !dump_emit_page(cprm, dump_page_copy(page, dump_page));
 			put_page(page);
-			if (stop)
+			if (stop) {
+				dump_page_free(dump_page);
 				return 0;
+			}
 		} else {
 			dump_skip(cprm, PAGE_SIZE);
 		}
 	}
+	dump_page_free(dump_page);
 	return 1;
 }
 #endif
diff --git a/include/linux/uio.h b/include/linux/uio.h
index bea9c89922d9..00cebe2b70de 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -40,7 +40,6 @@ struct iov_iter_state {
 
 struct iov_iter {
 	u8 iter_type;
-	bool copy_mc;
 	bool nofault;
 	bool data_source;
 	size_t iov_offset;
@@ -248,22 +247,8 @@ size_t _copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i);
 
 #ifdef CONFIG_ARCH_HAS_COPY_MC
 size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i);
-static inline void iov_iter_set_copy_mc(struct iov_iter *i)
-{
-	i->copy_mc = true;
-}
-
-static inline bool iov_iter_is_copy_mc(const struct iov_iter *i)
-{
-	return i->copy_mc;
-}
 #else
 #define _copy_mc_to_iter _copy_to_iter
-static inline void iov_iter_set_copy_mc(struct iov_iter *i) { }
-static inline bool iov_iter_is_copy_mc(const struct iov_iter *i)
-{
-	return false;
-}
 #endif
 
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
@@ -355,7 +340,6 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 	WARN_ON(direction & ~(READ | WRITE));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_UBUF,
-		.copy_mc = false,
 		.data_source = direction,
 		.ubuf = buf,
 		.count = count,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index e0aa6b440ca5..cf2eb2b2f983 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -166,7 +166,6 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 	WARN_ON(direction & ~(READ | WRITE));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_IOVEC,
-		.copy_mc = false,
 		.nofault = false,
 		.data_source = direction,
 		.__iov = iov,
@@ -244,27 +243,9 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 EXPORT_SYMBOL_GPL(_copy_mc_to_iter);
 #endif /* CONFIG_ARCH_HAS_COPY_MC */
 
-static __always_inline
-size_t memcpy_from_iter_mc(void *iter_from, size_t progress,
-			   size_t len, void *to, void *priv2)
-{
-	return copy_mc_to_kernel(to + progress, iter_from, len);
-}
-
-static size_t __copy_from_iter_mc(void *addr, size_t bytes, struct iov_iter *i)
-{
-	if (unlikely(i->count < bytes))
-		bytes = i->count;
-	if (unlikely(!bytes))
-		return 0;
-	return iterate_bvec(i, bytes, addr, NULL, memcpy_from_iter_mc);
-}
-
 static __always_inline
 size_t __copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 {
-	if (unlikely(iov_iter_is_copy_mc(i)))
-		return __copy_from_iter_mc(addr, bytes, i);
 	return iterate_and_advance(i, bytes, addr,
 				   copy_from_user_iter, memcpy_from_iter);
 }
@@ -633,7 +614,6 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction,
 	WARN_ON(direction & ~(READ | WRITE));
 	*i = (struct iov_iter){
 		.iter_type = ITER_KVEC,
-		.copy_mc = false,
 		.data_source = direction,
 		.kvec = kvec,
 		.nr_segs = nr_segs,
@@ -650,7 +630,6 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
 	WARN_ON(direction & ~(READ | WRITE));
 	*i = (struct iov_iter){
 		.iter_type = ITER_BVEC,
-		.copy_mc = false,
 		.data_source = direction,
 		.bvec = bvec,
 		.nr_segs = nr_segs,
@@ -679,7 +658,6 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction,
 	BUG_ON(direction & ~1);
 	*i = (struct iov_iter) {
 		.iter_type = ITER_XARRAY,
-		.copy_mc = false,
 		.data_source = direction,
 		.xarray = xarray,
 		.xarray_start = start,
@@ -703,7 +681,6 @@ void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count)
 	BUG_ON(direction != READ);
 	*i = (struct iov_iter){
 		.iter_type = ITER_DISCARD,
-		.copy_mc = false,
 		.data_source = false,
 		.count = count,
 		.iov_offset = 0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-03-02  2:59             ` Linus Torvalds
@ 2024-03-02  9:37               ` Tong Tiangen
  2024-03-02 18:06                 ` Linus Torvalds
  0 siblings, 1 reply; 36+ messages in thread
From: Tong Tiangen @ 2024-03-02  9:37 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro
  Cc: David Howells, Jens Axboe, Christoph Hellwig, Christian Brauner,
	David Laight, Matthew Wilcox, Jeff Layton, linux-fsdevel,
	linux-block, linux-mm, netdev, linux-kernel, Kefeng Wang



在 2024/3/2 10:59, Linus Torvalds 写道:
> On Thu, 29 Feb 2024 at 09:32, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> One option might be to make a failed memcpy_from_iter_mc() set another
>> flag in the iter, and then make fault_in_iov_iter_readable() test that
>> flag and return 'len' if that flag is set.
>>
>> Something like that (wild handwaving) should get the right error handling.
>>
>> The simpler alternative is maybe something like the attached.
>> COMPLETELY UNTESTED. Maybe I've confused myself with all the different
>> indiraction mazes in the iov_iter code.
> 
> Actually, I think the right model is to get rid of that horrendous
> .copy_mc field entirely.
> 
> We only have one single place that uses it - that nasty core dumping
> code. And that code is *not* performance critical.
> 
> And not only isn't it performance-critical, it already does all the
> core dumping one page at a time because it doesn't want to write pages
> that were never mapped into user space.
> 
> So what we can do is
> 
>   (a) make the core dumping code *copy* the page to a good location
> with copy_mc_to_kernel() first
> 
>   (b) remove this horrendous .copy_mc crap entirely from iov_iter

I think this solution has two impacts:
1. Although it is not a performance-critical path, the CPU usage may be
affected by one more memory copy in some large-memory applications.
2. If a hardware memory error occurs in "good location" and the
".copy_mc" is removed, the kernel will panic.

I would prefer to use the previous solution(modify the implementation of
memcpy_from_iter_mc()).

Thanks,
Tong.

> 
> This is slightly complicated by the fact that copy_mc_to_kernel() may
> not even exist, and architectures that don't have it don't want the
> silly extra copy. So we need to abstract the "copy to temporary page"
> code a bit. But that's probably a good thing anyway in that it forces
> us to have nice interfaces.
> 
> End result: something like the attached.
> 
> AGAIN: THIS IS ENTIRELY UNTESTED.
> 
> But hey, so was clearly all the .copy_mc code too that this removes, so...
> 
>                 Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-03-02  9:37               ` Tong Tiangen
@ 2024-03-02 18:06                 ` Linus Torvalds
  2024-03-02 18:11                   ` Linus Torvalds
  2024-03-04  8:45                   ` Tong Tiangen
  0 siblings, 2 replies; 36+ messages in thread
From: Linus Torvalds @ 2024-03-02 18:06 UTC (permalink / raw)
  To: Tong Tiangen
  Cc: Al Viro, David Howells, Jens Axboe, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	Kefeng Wang

On Sat, 2 Mar 2024 at 01:37, Tong Tiangen <tongtiangen@huawei.com> wrote:
>
> I think this solution has two impacts:
> 1. Although it is not a performance-critical path, the CPU usage may be
> affected by one more memory copy in some large-memory applications.

Compared to the IO, the extra memory copy is a non-issue.

If anything, getting rid of the "copy_mc" flag removes extra code in a
much more important path (ie the normal iov_iter code).

> 2. If a hardware memory error occurs in "good location" and the
> ".copy_mc" is removed, the kernel will panic.

That's always true. We do not support non-recoverable machine checks
on kernel memory. Never have, and realistically probably never will.

In fact, as far as I know, the hardware that caused all this code in
the first place no longer exists, and never really made it to wide
production.

The machine checks in question happened on pmem, now killed by Intel.
It's possible that somebody wants to use it for something else, but
let's hope any future implementations are less broken than the
unbelievable sh*tshow that caused all this code in the first place.

The whole copy_mc_to_kernel() mess exists mainly due to broken pmem
devices along with old and broken CPU's that did not deal correctly
with machine checks inside the regular memory copy ('rep movs') code,
and caused hung machines.

IOW, notice how 'copy_mc_to_kernel()' just becomes a regular
'memcpy()' on fixed hardware, and how we have that disgusting
copy_mc_fragile_key that gets enabled for older CPU cores.

And yes, we then have copy_mc_enhanced_fast_string() which isn't
*that* disgusting, and that actually handles machine checks properly
on more modern hardware, but it's still very much "the hardware is
misdesiged, it has no testing, and nobody sane should depend on this"

In other words, it's the usual "Enterprise Hardware" situation. Looks
fancy on paper, costs an arm and a leg, and the reality is just sad,
sad, sad.

               Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-03-02 18:06                 ` Linus Torvalds
@ 2024-03-02 18:11                   ` Linus Torvalds
  2024-03-04  8:45                   ` Tong Tiangen
  1 sibling, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2024-03-02 18:11 UTC (permalink / raw)
  To: Tong Tiangen
  Cc: Al Viro, David Howells, Jens Axboe, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	Kefeng Wang

On Sat, 2 Mar 2024 at 10:06, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> In other words, it's the usual "Enterprise Hardware" situation. Looks
> fancy on paper, costs an arm and a leg, and the reality is just sad,
> sad, sad.

Don't get me wrong. I'm sure large companies are more than willing to
sell other large companies very expensive support contracts and have
engineers that they fly out to deal with the problems all these
enterprise solutions have.

The problem *will* get fixed somehow, it's just going to cost you. A lot.

Because THAT is what Enterprise Hardware is all about.

                  Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-03-02 18:06                 ` Linus Torvalds
  2024-03-02 18:11                   ` Linus Torvalds
@ 2024-03-04  8:45                   ` Tong Tiangen
  1 sibling, 0 replies; 36+ messages in thread
From: Tong Tiangen @ 2024-03-04  8:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, David Howells, Jens Axboe, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	Kefeng Wang



在 2024/3/3 2:06, Linus Torvalds 写道:
> On Sat, 2 Mar 2024 at 01:37, Tong Tiangen <tongtiangen@huawei.com> wrote:
>>
>> I think this solution has two impacts:
>> 1. Although it is not a performance-critical path, the CPU usage may be
>> affected by one more memory copy in some large-memory applications.
> 
> Compared to the IO, the extra memory copy is a non-issue.
> 
> If anything, getting rid of the "copy_mc" flag removes extra code in a
> much more important path (ie the normal iov_iter code).

Indeed. I'll test this solution. Theoretically, it should solve the problem.

> 
>> 2. If a hardware memory error occurs in "good location" and the
>> ".copy_mc" is removed, the kernel will panic.
> 
> That's always true. We do not support non-recoverable machine checks
> on kernel memory. Never have, and realistically probably never will. >
> In fact, as far as I know, the hardware that caused all this code in
> the first place no longer exists, and never really made it to wide
> production.

Yes. There is a low probability that the newly applied memory is faulty.

Thanks,
Tong.

> 
> The machine checks in question happened on pmem, now killed by Intel.
> It's possible that somebody wants to use it for something else, but
> let's hope any future implementations are less broken than the
> unbelievable sh*tshow that caused all this code in the first place.
> 
> The whole copy_mc_to_kernel() mess exists mainly due to broken pmem
> devices along with old and broken CPU's that did not deal correctly
> with machine checks inside the regular memory copy ('rep movs') code,
> and caused hung machines.
> 
> IOW, notice how 'copy_mc_to_kernel()' just becomes a regular
> 'memcpy()' on fixed hardware, and how we have that disgusting
> copy_mc_fragile_key that gets enabled for older CPU cores.
> 
> And yes, we then have copy_mc_enhanced_fast_string() which isn't
> *that* disgusting, and that actually handles machine checks properly
> on more modern hardware, but it's still very much "the hardware is
> misdesiged, it has no testing, and nobody sane should depend on this"
> 
> In other words, it's the usual "Enterprise Hardware" situation. Looks
> fancy on paper, costs an arm and a leg, and the reality is just sad,
> sad, sad.
> 
>                 Linus
> .

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-02-29 17:32           ` Linus Torvalds
  2024-03-01  2:13             ` Tong Tiangen
  2024-03-02  2:59             ` Linus Torvalds
@ 2024-03-04 11:56             ` David Howells
  2024-03-04 12:15               ` Tong Tiangen
  2024-03-04 18:32               ` Linus Torvalds
  2 siblings, 2 replies; 36+ messages in thread
From: David Howells @ 2024-03-04 11:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Tong Tiangen, Al Viro, Jens Axboe, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	Kefeng Wang

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Actually, I think the right model is to get rid of that horrendous
> .copy_mc field entirely.
> 
> We only have one single place that uses it - that nasty core dumping
> code. And that code is *not* performance critical.
> 
> And not only isn't it performance-critical, it already does all the
> core dumping one page at a time because it doesn't want to write pages
> that were never mapped into user space.
> 
> So what we can do is
> 
>  (a) make the core dumping code *copy* the page to a good location
> with copy_mc_to_kernel() first
> 
>  (b) remove this horrendous .copy_mc crap entirely from iov_iter
> 
> This is slightly complicated by the fact that copy_mc_to_kernel() may
> not even exist, and architectures that don't have it don't want the
> silly extra copy. So we need to abstract the "copy to temporary page"
> code a bit. But that's probably a good thing anyway in that it forces
> us to have nice interfaces.
> 
> End result: something like the attached.
> 
> AGAIN: THIS IS ENTIRELY UNTESTED.
> 
> But hey, so was clearly all the .copy_mc code too that this removes, so...

I like it:-)

I've tested it by SIGQUIT'ing a number of processes and using gdb to examine
the coredumps - which seems to work - at least without the production of any
MCEs.  I'm not sure how I could test it with MCEs.

Feel free to add:

Reviewed-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>

That said, I wonder if:

	#ifdef copy_mc_to_kernel

should be:

	#ifdef CONFIG_ARCH_HAS_COPY_MC

and whether it's possible to find out dynamically if MCEs can occur at all.

David


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-03-04 11:56             ` David Howells
@ 2024-03-04 12:15               ` Tong Tiangen
  2024-03-04 18:32               ` Linus Torvalds
  1 sibling, 0 replies; 36+ messages in thread
From: Tong Tiangen @ 2024-03-04 12:15 UTC (permalink / raw)
  To: David Howells, Linus Torvalds
  Cc: Al Viro, Jens Axboe, Christoph Hellwig, Christian Brauner,
	David Laight, Matthew Wilcox, Jeff Layton, linux-fsdevel,
	linux-block, linux-mm, netdev, linux-kernel, Kefeng Wang



在 2024/3/4 19:56, David Howells 写道:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> Actually, I think the right model is to get rid of that horrendous
>> .copy_mc field entirely.
>>
>> We only have one single place that uses it - that nasty core dumping
>> code. And that code is *not* performance critical.
>>
>> And not only isn't it performance-critical, it already does all the
>> core dumping one page at a time because it doesn't want to write pages
>> that were never mapped into user space.
>>
>> So what we can do is
>>
>>   (a) make the core dumping code *copy* the page to a good location
>> with copy_mc_to_kernel() first
>>
>>   (b) remove this horrendous .copy_mc crap entirely from iov_iter
>>
>> This is slightly complicated by the fact that copy_mc_to_kernel() may
>> not even exist, and architectures that don't have it don't want the
>> silly extra copy. So we need to abstract the "copy to temporary page"
>> code a bit. But that's probably a good thing anyway in that it forces
>> us to have nice interfaces.
>>
>> End result: something like the attached.
>>
>> AGAIN: THIS IS ENTIRELY UNTESTED.
>>
>> But hey, so was clearly all the .copy_mc code too that this removes, so...
> 
> I like it:-)
> 
> I've tested it by SIGQUIT'ing a number of processes and using gdb to examine
> the coredumps - which seems to work - at least without the production of any
> MCEs.  I'm not sure how I could test it with MCEs.

I'm going to test the coredump with the MCE.

> 
> Feel free to add:
> 
> Reviewed-by: David Howells <dhowells@redhat.com>
> Tested-by: David Howells <dhowells@redhat.com>
> 
> That said, I wonder if:
> 
> 	#ifdef copy_mc_to_kernel
> 
> should be:
> 
> 	#ifdef CONFIG_ARCH_HAS_COPY_MC
> 
> and whether it's possible to find out dynamically if MCEs can occur at all.

MCE can occur during the use of a page. So i think it occurs
dynamically.

Thanks,
Tong

> 
> David
> 
> .

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-03-04 11:56             ` David Howells
  2024-03-04 12:15               ` Tong Tiangen
@ 2024-03-04 18:32               ` Linus Torvalds
  2024-03-05  6:57                 ` Tong Tiangen
  1 sibling, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2024-03-04 18:32 UTC (permalink / raw)
  To: David Howells
  Cc: Tong Tiangen, Al Viro, Jens Axboe, Christoph Hellwig,
	Christian Brauner, David Laight, Matthew Wilcox, Jeff Layton,
	linux-fsdevel, linux-block, linux-mm, netdev, linux-kernel,
	Kefeng Wang

On Mon, 4 Mar 2024 at 03:56, David Howells <dhowells@redhat.com> wrote:
>
> That said, I wonder if:
>
>         #ifdef copy_mc_to_kernel
>
> should be:
>
>         #ifdef CONFIG_ARCH_HAS_COPY_MC

Hmm. Maybe. We do have that

  #ifdef copy_mc_to_kernel

pattern already in <linux/uaccess.h>, so clearly we've done it both ways.

I personally like the "just test for the thing you are using" model,
which is then why I did it that way, but I don't have hugely strong
opinions on it.

> and whether it's possible to find out dynamically if MCEs can occur at all.

I really wanted to do something like that, and look at the source page
to decide "is this a pmem page that can cause machine checks", but I
didn't find any obvious way to do that.

Improvement suggestions more than welcome.

               Linus

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [bug report] dead loop in generic_perform_write() //Re: [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs
  2024-03-04 18:32               ` Linus Torvalds
@ 2024-03-05  6:57                 ` Tong Tiangen
  0 siblings, 0 replies; 36+ messages in thread
From: Tong Tiangen @ 2024-03-05  6:57 UTC (permalink / raw)
  To: Linus Torvalds, David Howells
  Cc: Al Viro, Jens Axboe, Christoph Hellwig, Christian Brauner,
	David Laight, Matthew Wilcox, Jeff Layton, linux-fsdevel,
	linux-block, linux-mm, netdev, linux-kernel, Kefeng Wang



在 2024/3/5 2:32, Linus Torvalds 写道:
> On Mon, 4 Mar 2024 at 03:56, David Howells <dhowells@redhat.com> wrote:
>>
>> That said, I wonder if:
>>
>>          #ifdef copy_mc_to_kernel
>>
>> should be:
>>
>>          #ifdef CONFIG_ARCH_HAS_COPY_MC
> 
> Hmm. Maybe. We do have that
> 
>    #ifdef copy_mc_to_kernel
> 
> pattern already in <linux/uaccess.h>, so clearly we've done it both ways.
> 
> I personally like the "just test for the thing you are using" model,
> which is then why I did it that way, but I don't have hugely strong
> opinions on it.
> 
>> and whether it's possible to find out dynamically if MCEs can occur at all.
> 
> I really wanted to do something like that, and look at the source page
> to decide "is this a pmem page that can cause machine checks", but I
> didn't find any obvious way to do that.
> 
> Improvement suggestions more than welcome.

I used EINJ to simulate hardware memory error and tested it on an ARM64
  server. This solution can solve the coredump deadloop problem.

I'll sort it out and send the patch.

Thanks,
Tong.

> 
>                 Linus
> .

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2024-03-05  6:57 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-25 12:02 [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs David Howells
2023-09-25 12:02 ` [PATCH v7 01/12] iov_iter: Remove last_offset from iov_iter as it was for ITER_PIPE David Howells
2023-09-25 12:02 ` [PATCH v7 02/12] iov_iter, x86: Be consistent about the __user tag on copy_mc_to_user() David Howells
2023-09-28 14:47   ` Borislav Petkov
2023-09-25 12:03 ` [PATCH v7 03/12] sound: Fix snd_pcm_readv()/writev() to use iov access functions David Howells
2023-09-25 12:03 ` [PATCH v7 04/12] infiniband: Use user_backed_iter() to see if iterator is UBUF/IOVEC David Howells
2023-09-25 12:03 ` [PATCH v7 05/12] iov_iter: Renumber ITER_* constants David Howells
2023-09-25 12:03 ` [PATCH v7 06/12] iov_iter: Derive user-backedness from the iterator type David Howells
2023-09-25 12:03 ` [PATCH v7 07/12] iov_iter: Convert iterate*() to inline funcs David Howells
2024-02-18  3:13   ` [bug report] dead loop in generic_perform_write() //Re: " Tong Tiangen
2024-02-27 12:43     ` Tong Tiangen
2024-02-28 21:21     ` Linus Torvalds
2024-02-28 22:57       ` Linus Torvalds
2024-02-29  8:13         ` Tong Tiangen
2024-02-29 17:32           ` Linus Torvalds
2024-03-01  2:13             ` Tong Tiangen
2024-03-02  2:59             ` Linus Torvalds
2024-03-02  9:37               ` Tong Tiangen
2024-03-02 18:06                 ` Linus Torvalds
2024-03-02 18:11                   ` Linus Torvalds
2024-03-04  8:45                   ` Tong Tiangen
2024-03-04 11:56             ` David Howells
2024-03-04 12:15               ` Tong Tiangen
2024-03-04 18:32               ` Linus Torvalds
2024-03-05  6:57                 ` Tong Tiangen
2023-09-25 12:03 ` [PATCH v7 08/12] iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc() David Howells
2023-09-25 12:03 ` [PATCH v7 09/12] iov_iter, net: Move csum_and_copy_to/from_iter() to net/ David Howells
2023-09-25 12:03 ` [PATCH v7 10/12] iov_iter, net: Fold in csum_and_memcpy() David Howells
2023-09-25 12:03 ` [PATCH v7 11/12] iov_iter, net: Merge csum_and_copy_from_iter{,_full}() together David Howells
2023-09-25 12:03 ` [PATCH v7 12/12] iov_iter, net: Move hash_and_copy_to_iter() to net/ David Howells
2023-09-25 12:34 ` [PATCH v7 00/12] iov_iter: Convert the iterator macros into inline funcs Christian Brauner
2023-10-02  9:25 ` [PATCH v7 08/12] iov_iter: Don't deal with iter->copy_mc in memcpy_from_iter_mc() David Howells
2023-10-02  9:25   ` David Howells
2023-10-07  4:32   ` [PATCH next] iov_iter: fix copy_page_from_iter_atomic() Hugh Dickins
2023-10-07  7:29   ` David Howells
2023-10-09  7:36     ` Christian Brauner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.