linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/4] convert read_kcore(), vread() to use iterators
@ 2023-03-22 18:57 Lorenzo Stoakes
  2023-03-22 18:57 ` [PATCH v7 1/4] fs/proc/kcore: avoid bounce buffer for ktext data Lorenzo Stoakes
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Lorenzo Stoakes @ 2023-03-22 18:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton
  Cc: Baoquan He, Uladzislau Rezki, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa, Jens Axboe, Alexander Viro,
	Lorenzo Stoakes

While reviewing Baoquan's recent changes to permit vread() access to
vm_map_ram regions of vmalloc allocations, Willy pointed out [1] that it
would be nice to refactor vread() as a whole, since its only user is
read_kcore() and the existing form of vread() necessitates the use of a
bounce buffer.

This patch series does exactly that, as well as adjusting how we read the
kernel text section to avoid the use of a bounce buffer in this case as
well.

This has been tested against the test case which motivated Baoquan's
changes in the first place [2] which continues to function correctly, as do
the vmalloc self tests.

[1] https://lore.kernel.org/all/Y8WfDSRkc%2FOHP3oD@casper.infradead.org/
[2] https://lore.kernel.org/all/87ilk6gos2.fsf@oracle.com/T/#u

v7:
- Keep trying to fault in memory until the vmalloc read operation
  completes.

v6:
- Correct copy_page_to_iter_nofault() to handle -EFAULT case correctly.
https://lore.kernel.org/all/cover.1679496827.git.lstoakes@gmail.com/

v5:
- Do not rename fpos to ppos in read_kcore_iter() to avoid churn.
- Fix incorrect commit messages after prior revisions altered the approach.
- Replace copy_page_to_iter_atomic() with copy_page_to_iter_nofault() and
  adjust it to be able to handle compound pages. This uses
  copy_to_user_nofault() which ensures page faults are disabled during copy
  which kmap_local_page() was not doing.
- Only try to fault in pages if we are unable to copy in the first place
  and try only once to avoid any risk of spinning.
- Do not zero memory in aligned_vread_iter() if we couldn't copy it.
- Fix mistake in zeroing missing or unpopulated blocks in
  vmap_ram_vread_iter().
https://lore.kernel.org/linux-mm/cover.1679494218.git.lstoakes@gmail.com/

v4:
- Fixup mistake in email client which orphaned patch emails from the
  cover letter.
https://lore.kernel.org/all/cover.1679431886.git.lstoakes@gmail.com

v3:
- Revert introduction of mutex/rwsem in vmalloc
- Introduce copy_page_to_iter_atomic() iovec function
- Update vread_iter() and descendent functions to use only this
- Fault in user pages before calling vread_iter()
- Use const char* in vread_iter() and descendent functions
- Updated commit messages based on feedback
- Extend vread functions to always check how many bytes we could copy. If
  at any stage we are unable to copy/zero, abort and return the number of
  bytes we did copy.
https://lore.kernel.org/all/cover.1679354384.git.lstoakes@gmail.com/

v2:
- Fix ordering of vread_iter() parameters
- Fix nommu vread() -> vread_iter()
https://lore.kernel.org/all/cover.1679209395.git.lstoakes@gmail.com/

v1:
https://lore.kernel.org/all/cover.1679183626.git.lstoakes@gmail.com/

Lorenzo Stoakes (4):
  fs/proc/kcore: avoid bounce buffer for ktext data
  fs/proc/kcore: convert read_kcore() to read_kcore_iter()
  iov_iter: add copy_page_to_iter_nofault()
  mm: vmalloc: convert vread() to vread_iter()

 fs/proc/kcore.c         |  85 +++++++--------
 include/linux/uio.h     |   2 +
 include/linux/vmalloc.h |   3 +-
 lib/iov_iter.c          |  48 +++++++++
 mm/nommu.c              |  10 +-
 mm/vmalloc.c            | 234 +++++++++++++++++++++++++---------------
 6 files changed, 243 insertions(+), 139 deletions(-)

--
2.39.2

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v7 1/4] fs/proc/kcore: avoid bounce buffer for ktext data
  2023-03-22 18:57 [PATCH v7 0/4] convert read_kcore(), vread() to use iterators Lorenzo Stoakes
@ 2023-03-22 18:57 ` Lorenzo Stoakes
  2023-03-22 18:57 ` [PATCH v7 2/4] fs/proc/kcore: convert read_kcore() to read_kcore_iter() Lorenzo Stoakes
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Lorenzo Stoakes @ 2023-03-22 18:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton
  Cc: Baoquan He, Uladzislau Rezki, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa, Jens Axboe, Alexander Viro,
	Lorenzo Stoakes

Commit df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext data")
introduced the use of a bounce buffer to retrieve kernel text data for
/proc/kcore in order to avoid failures arising from hardened user copies
enabled by CONFIG_HARDENED_USERCOPY in check_kernel_text_object().

We can avoid doing this if instead of copy_to_user() we use _copy_to_user()
which bypasses the hardening check. This is more efficient than using a
bounce buffer and simplifies the code.

We do so as part an overall effort to eliminate bounce buffer usage in the
function with an eye to converting it an iterator read.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/kcore.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 71157ee35c1a..556f310d6aa4 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -541,19 +541,12 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 		case KCORE_VMEMMAP:
 		case KCORE_TEXT:
 			/*
-			 * Using bounce buffer to bypass the
-			 * hardened user copy kernel text checks.
+			 * We use _copy_to_user() to bypass usermode hardening
+			 * which would otherwise prevent this operation.
 			 */
-			if (copy_from_kernel_nofault(buf, (void *)start, tsz)) {
-				if (clear_user(buffer, tsz)) {
-					ret = -EFAULT;
-					goto out;
-				}
-			} else {
-				if (copy_to_user(buffer, buf, tsz)) {
-					ret = -EFAULT;
-					goto out;
-				}
+			if (_copy_to_user(buffer, (char *)start, tsz)) {
+				ret = -EFAULT;
+				goto out;
 			}
 			break;
 		default:
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 2/4] fs/proc/kcore: convert read_kcore() to read_kcore_iter()
  2023-03-22 18:57 [PATCH v7 0/4] convert read_kcore(), vread() to use iterators Lorenzo Stoakes
  2023-03-22 18:57 ` [PATCH v7 1/4] fs/proc/kcore: avoid bounce buffer for ktext data Lorenzo Stoakes
@ 2023-03-22 18:57 ` Lorenzo Stoakes
  2023-03-22 18:57 ` [PATCH v7 3/4] iov_iter: add copy_page_to_iter_nofault() Lorenzo Stoakes
  2023-03-22 18:57 ` [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter() Lorenzo Stoakes
  3 siblings, 0 replies; 12+ messages in thread
From: Lorenzo Stoakes @ 2023-03-22 18:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton
  Cc: Baoquan He, Uladzislau Rezki, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa, Jens Axboe, Alexander Viro,
	Lorenzo Stoakes

For the time being we still use a bounce buffer for vread(), however in the
next patch we will convert this to interact directly with the iterator and
eliminate the bounce buffer altogether.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/kcore.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 556f310d6aa4..08b795fd80b4 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -24,7 +24,7 @@
 #include <linux/memblock.h>
 #include <linux/init.h>
 #include <linux/slab.h>
-#include <linux/uaccess.h>
+#include <linux/uio.h>
 #include <asm/io.h>
 #include <linux/list.h>
 #include <linux/ioport.h>
@@ -308,9 +308,12 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
 }
 
 static ssize_t
-read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
+read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
 {
+	struct file *file = iocb->ki_filp;
 	char *buf = file->private_data;
+	loff_t *fpos = &iocb->ki_pos;
+
 	size_t phdrs_offset, notes_offset, data_offset;
 	size_t page_offline_frozen = 1;
 	size_t phdrs_len, notes_len;
@@ -318,6 +321,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 	size_t tsz;
 	int nphdr;
 	unsigned long start;
+	size_t buflen = iov_iter_count(iter);
 	size_t orig_buflen = buflen;
 	int ret = 0;
 
@@ -356,12 +360,11 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 		};
 
 		tsz = min_t(size_t, buflen, sizeof(struct elfhdr) - *fpos);
-		if (copy_to_user(buffer, (char *)&ehdr + *fpos, tsz)) {
+		if (copy_to_iter((char *)&ehdr + *fpos, tsz, iter) != tsz) {
 			ret = -EFAULT;
 			goto out;
 		}
 
-		buffer += tsz;
 		buflen -= tsz;
 		*fpos += tsz;
 	}
@@ -398,15 +401,14 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 		}
 
 		tsz = min_t(size_t, buflen, phdrs_offset + phdrs_len - *fpos);
-		if (copy_to_user(buffer, (char *)phdrs + *fpos - phdrs_offset,
-				 tsz)) {
+		if (copy_to_iter((char *)phdrs + *fpos - phdrs_offset, tsz,
+				 iter) != tsz) {
 			kfree(phdrs);
 			ret = -EFAULT;
 			goto out;
 		}
 		kfree(phdrs);
 
-		buffer += tsz;
 		buflen -= tsz;
 		*fpos += tsz;
 	}
@@ -448,14 +450,13 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 				  min(vmcoreinfo_size, notes_len - i));
 
 		tsz = min_t(size_t, buflen, notes_offset + notes_len - *fpos);
-		if (copy_to_user(buffer, notes + *fpos - notes_offset, tsz)) {
+		if (copy_to_iter(notes + *fpos - notes_offset, tsz, iter) != tsz) {
 			kfree(notes);
 			ret = -EFAULT;
 			goto out;
 		}
 		kfree(notes);
 
-		buffer += tsz;
 		buflen -= tsz;
 		*fpos += tsz;
 	}
@@ -497,7 +498,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 		}
 
 		if (!m) {
-			if (clear_user(buffer, tsz)) {
+			if (iov_iter_zero(tsz, iter) != tsz) {
 				ret = -EFAULT;
 				goto out;
 			}
@@ -508,14 +509,14 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 		case KCORE_VMALLOC:
 			vread(buf, (char *)start, tsz);
 			/* we have to zero-fill user buffer even if no read */
-			if (copy_to_user(buffer, buf, tsz)) {
+			if (copy_to_iter(buf, tsz, iter) != tsz) {
 				ret = -EFAULT;
 				goto out;
 			}
 			break;
 		case KCORE_USER:
 			/* User page is handled prior to normal kernel page: */
-			if (copy_to_user(buffer, (char *)start, tsz)) {
+			if (copy_to_iter((char *)start, tsz, iter) != tsz) {
 				ret = -EFAULT;
 				goto out;
 			}
@@ -531,7 +532,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 			 */
 			if (!page || PageOffline(page) ||
 			    is_page_hwpoison(page) || !pfn_is_ram(pfn)) {
-				if (clear_user(buffer, tsz)) {
+				if (iov_iter_zero(tsz, iter) != tsz) {
 					ret = -EFAULT;
 					goto out;
 				}
@@ -541,17 +542,17 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 		case KCORE_VMEMMAP:
 		case KCORE_TEXT:
 			/*
-			 * We use _copy_to_user() to bypass usermode hardening
+			 * We use _copy_to_iter() to bypass usermode hardening
 			 * which would otherwise prevent this operation.
 			 */
-			if (_copy_to_user(buffer, (char *)start, tsz)) {
+			if (_copy_to_iter((char *)start, tsz, iter) != tsz) {
 				ret = -EFAULT;
 				goto out;
 			}
 			break;
 		default:
 			pr_warn_once("Unhandled KCORE type: %d\n", m->type);
-			if (clear_user(buffer, tsz)) {
+			if (iov_iter_zero(tsz, iter) != tsz) {
 				ret = -EFAULT;
 				goto out;
 			}
@@ -559,7 +560,6 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 skip:
 		buflen -= tsz;
 		*fpos += tsz;
-		buffer += tsz;
 		start += tsz;
 		tsz = (buflen > PAGE_SIZE ? PAGE_SIZE : buflen);
 	}
@@ -603,7 +603,7 @@ static int release_kcore(struct inode *inode, struct file *file)
 }
 
 static const struct proc_ops kcore_proc_ops = {
-	.proc_read	= read_kcore,
+	.proc_read_iter	= read_kcore_iter,
 	.proc_open	= open_kcore,
 	.proc_release	= release_kcore,
 	.proc_lseek	= default_llseek,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 3/4] iov_iter: add copy_page_to_iter_nofault()
  2023-03-22 18:57 [PATCH v7 0/4] convert read_kcore(), vread() to use iterators Lorenzo Stoakes
  2023-03-22 18:57 ` [PATCH v7 1/4] fs/proc/kcore: avoid bounce buffer for ktext data Lorenzo Stoakes
  2023-03-22 18:57 ` [PATCH v7 2/4] fs/proc/kcore: convert read_kcore() to read_kcore_iter() Lorenzo Stoakes
@ 2023-03-22 18:57 ` Lorenzo Stoakes
  2023-03-22 18:57 ` [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter() Lorenzo Stoakes
  3 siblings, 0 replies; 12+ messages in thread
From: Lorenzo Stoakes @ 2023-03-22 18:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton
  Cc: Baoquan He, Uladzislau Rezki, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa, Jens Axboe, Alexander Viro,
	Lorenzo Stoakes

Provide a means to copy a page to user space from an iterator, aborting if
a page fault would occur. This supports compound pages, but may be passed a
tail page with an offset extending further into the compound page, so we
cannot pass a folio.

This allows for this function to be called from atomic context and _try_ to
user pages if they are faulted in, aborting if not.

The function does not use _copy_to_iter() in order to not specify
might_fault(), this is similar to copy_page_from_iter_atomic().

This is being added in order that an iteratable form of vread() can be
implemented while holding spinlocks.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 include/linux/uio.h |  2 ++
 lib/iov_iter.c      | 48 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 27e3fd942960..29eb18bb6feb 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -173,6 +173,8 @@ static inline size_t copy_folio_to_iter(struct folio *folio, size_t offset,
 {
 	return copy_page_to_iter(&folio->page, offset, bytes, i);
 }
+size_t copy_page_to_iter_nofault(struct page *page, unsigned offset,
+				 size_t bytes, struct iov_iter *i);
 
 static __always_inline __must_check
 size_t copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 274014e4eafe..34dd6bdf2fba 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -172,6 +172,18 @@ static int copyout(void __user *to, const void *from, size_t n)
 	return n;
 }
 
+static int copyout_nofault(void __user *to, const void *from, size_t n)
+{
+	long res;
+
+	if (should_fail_usercopy())
+		return n;
+
+	res = copy_to_user_nofault(to, from, n);
+
+	return res < 0 ? n : res;
+}
+
 static int copyin(void *to, const void __user *from, size_t n)
 {
 	size_t res = n;
@@ -734,6 +746,42 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_to_iter);
 
+size_t copy_page_to_iter_nofault(struct page *page, unsigned offset, size_t bytes,
+				 struct iov_iter *i)
+{
+	size_t res = 0;
+
+	if (!page_copy_sane(page, offset, bytes))
+		return 0;
+	if (WARN_ON_ONCE(i->data_source))
+		return 0;
+	if (unlikely(iov_iter_is_pipe(i)))
+		return copy_page_to_iter_pipe(page, offset, bytes, i);
+	page += offset / PAGE_SIZE; // first subpage
+	offset %= PAGE_SIZE;
+	while (1) {
+		void *kaddr = kmap_local_page(page);
+		size_t n = min(bytes, (size_t)PAGE_SIZE - offset);
+
+		iterate_and_advance(i, n, base, len, off,
+			copyout_nofault(base, kaddr + offset + off, len),
+			memcpy(base, kaddr + offset + off, len)
+		)
+		kunmap_local(kaddr);
+		res += n;
+		bytes -= n;
+		if (!bytes || !n)
+			break;
+		offset += n;
+		if (offset == PAGE_SIZE) {
+			page++;
+			offset = 0;
+		}
+	}
+	return res;
+}
+EXPORT_SYMBOL(copy_page_to_iter_nofault);
+
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i)
 {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter()
  2023-03-22 18:57 [PATCH v7 0/4] convert read_kcore(), vread() to use iterators Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2023-03-22 18:57 ` [PATCH v7 3/4] iov_iter: add copy_page_to_iter_nofault() Lorenzo Stoakes
@ 2023-03-22 18:57 ` Lorenzo Stoakes
  2023-03-23  2:52   ` Baoquan He
  3 siblings, 1 reply; 12+ messages in thread
From: Lorenzo Stoakes @ 2023-03-22 18:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton
  Cc: Baoquan He, Uladzislau Rezki, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa, Jens Axboe, Alexander Viro,
	Lorenzo Stoakes

Having previously laid the foundation for converting vread() to an iterator
function, pull the trigger and do so.

This patch attempts to provide minimal refactoring and to reflect the
existing logic as best we can, for example we continue to zero portions of
memory not read, as before.

Overall, there should be no functional difference other than a performance
improvement in /proc/kcore access to vmalloc regions.

Now we have eliminated the need for a bounce buffer in read_kcore_iter(),
we dispense with it, and try to write to user memory optimistically but
with faults disabled via copy_page_to_iter_nofault(). We already have
preemption disabled by holding a spin lock. We continue faulting in until
the operation is complete.

Additionally, we must account for the fact that at any point a copy may
fail (most likely due to a fault not being able to occur), we exit
indicating fewer bytes retrieved than expected.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 fs/proc/kcore.c         |  44 ++++----
 include/linux/vmalloc.h |   3 +-
 mm/nommu.c              |  10 +-
 mm/vmalloc.c            | 234 +++++++++++++++++++++++++---------------
 4 files changed, 176 insertions(+), 115 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 08b795fd80b4..25b44b303b35 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -307,13 +307,9 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
 	*i = ALIGN(*i + descsz, 4);
 }
 
-static ssize_t
-read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
+static ssize_t read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
 {
-	struct file *file = iocb->ki_filp;
-	char *buf = file->private_data;
 	loff_t *fpos = &iocb->ki_pos;
-
 	size_t phdrs_offset, notes_offset, data_offset;
 	size_t page_offline_frozen = 1;
 	size_t phdrs_len, notes_len;
@@ -507,13 +503,30 @@ read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
 
 		switch (m->type) {
 		case KCORE_VMALLOC:
-			vread(buf, (char *)start, tsz);
-			/* we have to zero-fill user buffer even if no read */
-			if (copy_to_iter(buf, tsz, iter) != tsz) {
-				ret = -EFAULT;
-				goto out;
+		{
+			const char *src = (char *)start;
+			size_t read = 0, left = tsz;
+
+			/*
+			 * vmalloc uses spinlocks, so we optimistically try to
+			 * read memory. If this fails, fault pages in and try
+			 * again until we are done.
+			 */
+			while (true) {
+				read += vread_iter(iter, src, left);
+				if (read == tsz)
+					break;
+
+				src += read;
+				left -= read;
+
+				if (fault_in_iov_iter_writeable(iter, left)) {
+					ret = -EFAULT;
+					goto out;
+				}
 			}
 			break;
+		}
 		case KCORE_USER:
 			/* User page is handled prior to normal kernel page: */
 			if (copy_to_iter((char *)start, tsz, iter) != tsz) {
@@ -582,10 +595,6 @@ static int open_kcore(struct inode *inode, struct file *filp)
 	if (ret)
 		return ret;
 
-	filp->private_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
-	if (!filp->private_data)
-		return -ENOMEM;
-
 	if (kcore_need_update)
 		kcore_update_ram();
 	if (i_size_read(inode) != proc_root_kcore->size) {
@@ -596,16 +605,9 @@ static int open_kcore(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-static int release_kcore(struct inode *inode, struct file *file)
-{
-	kfree(file->private_data);
-	return 0;
-}
-
 static const struct proc_ops kcore_proc_ops = {
 	.proc_read_iter	= read_kcore_iter,
 	.proc_open	= open_kcore,
-	.proc_release	= release_kcore,
 	.proc_lseek	= default_llseek,
 };
 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 69250efa03d1..461aa5637f65 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -9,6 +9,7 @@
 #include <asm/page.h>		/* pgprot_t */
 #include <linux/rbtree.h>
 #include <linux/overflow.h>
+#include <linux/uio.h>
 
 #include <asm/vmalloc.h>
 
@@ -251,7 +252,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
 #endif
 
 /* for /proc/kcore */
-extern long vread(char *buf, char *addr, unsigned long count);
+extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
 
 /*
  *	Internals.  Don't use..
diff --git a/mm/nommu.c b/mm/nommu.c
index 57ba243c6a37..f670d9979a26 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -36,6 +36,7 @@
 #include <linux/printk.h>
 
 #include <linux/uaccess.h>
+#include <linux/uio.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -198,14 +199,13 @@ unsigned long vmalloc_to_pfn(const void *addr)
 }
 EXPORT_SYMBOL(vmalloc_to_pfn);
 
-long vread(char *buf, char *addr, unsigned long count)
+long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 {
 	/* Don't allow overflow */
-	if ((unsigned long) buf + count < count)
-		count = -(unsigned long) buf;
+	if ((unsigned long) addr + count < count)
+		count = -(unsigned long) addr;
 
-	memcpy(buf, addr, count);
-	return count;
+	return copy_to_iter(addr, count, iter);
 }
 
 /*
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 978194dc2bb8..629cd87bb403 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -37,7 +37,6 @@
 #include <linux/rbtree_augmented.h>
 #include <linux/overflow.h>
 #include <linux/pgtable.h>
-#include <linux/uaccess.h>
 #include <linux/hugetlb.h>
 #include <linux/sched/mm.h>
 #include <asm/tlbflush.h>
@@ -3442,62 +3441,96 @@ void *vmalloc_32_user(unsigned long size)
 EXPORT_SYMBOL(vmalloc_32_user);
 
 /*
- * small helper routine , copy contents to buf from addr.
- * If the page is not present, fill zero.
+ * Atomically zero bytes in the iterator.
+ *
+ * Returns the number of zeroed bytes.
  */
+size_t zero_iter(struct iov_iter *iter, size_t count)
+{
+	size_t remains = count;
+
+	while (remains > 0) {
+		size_t num, copied;
+
+		num = remains < PAGE_SIZE ? remains : PAGE_SIZE;
+		copied = copy_page_to_iter_nofault(ZERO_PAGE(0), 0, num, iter);
+		remains -= copied;
+
+		if (copied < num)
+			break;
+	}
 
-static int aligned_vread(char *buf, char *addr, unsigned long count)
+	return count - remains;
+}
+
+/*
+ * small helper routine, copy contents to iter from addr.
+ * If the page is not present, fill zero.
+ *
+ * Returns the number of copied bytes.
+ */
+static size_t aligned_vread_iter(struct iov_iter *iter,
+				 const char *addr, size_t count)
 {
-	struct page *p;
-	int copied = 0;
+	size_t remains = count;
+	struct page *page;
 
-	while (count) {
+	while (remains > 0) {
 		unsigned long offset, length;
+		size_t copied = 0;
 
 		offset = offset_in_page(addr);
 		length = PAGE_SIZE - offset;
-		if (length > count)
-			length = count;
-		p = vmalloc_to_page(addr);
+		if (length > remains)
+			length = remains;
+		page = vmalloc_to_page(addr);
 		/*
-		 * To do safe access to this _mapped_ area, we need
-		 * lock. But adding lock here means that we need to add
-		 * overhead of vmalloc()/vfree() calls for this _debug_
-		 * interface, rarely used. Instead of that, we'll use
-		 * kmap() and get small overhead in this access function.
+		 * To do safe access to this _mapped_ area, we need lock. But
+		 * adding lock here means that we need to add overhead of
+		 * vmalloc()/vfree() calls for this _debug_ interface, rarely
+		 * used. Instead of that, we'll use an local mapping via
+		 * copy_page_to_iter_nofault() and accept a small overhead in
+		 * this access function.
 		 */
-		if (p) {
-			/* We can expect USER0 is not used -- see vread() */
-			void *map = kmap_atomic(p);
-			memcpy(buf, map + offset, length);
-			kunmap_atomic(map);
-		} else
-			memset(buf, 0, length);
+		if (page)
+			copied = copy_page_to_iter_nofault(page, offset,
+							   length, iter);
+		else
+			copied = zero_iter(iter, length);
 
-		addr += length;
-		buf += length;
-		copied += length;
-		count -= length;
+		addr += copied;
+		remains -= copied;
+
+		if (copied != length)
+			break;
 	}
-	return copied;
+
+	return count - remains;
 }
 
-static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags)
+/*
+ * Read from a vm_map_ram region of memory.
+ *
+ * Returns the number of copied bytes.
+ */
+static size_t vmap_ram_vread_iter(struct iov_iter *iter, const char *addr,
+				  size_t count, unsigned long flags)
 {
 	char *start;
 	struct vmap_block *vb;
 	unsigned long offset;
-	unsigned int rs, re, n;
+	unsigned int rs, re;
+	size_t remains, n;
 
 	/*
 	 * If it's area created by vm_map_ram() interface directly, but
 	 * not further subdividing and delegating management to vmap_block,
 	 * handle it here.
 	 */
-	if (!(flags & VMAP_BLOCK)) {
-		aligned_vread(buf, addr, count);
-		return;
-	}
+	if (!(flags & VMAP_BLOCK))
+		return aligned_vread_iter(iter, addr, count);
+
+	remains = count;
 
 	/*
 	 * Area is split into regions and tracked with vmap_block, read out
@@ -3505,50 +3538,64 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
 	 */
 	vb = xa_load(&vmap_blocks, addr_to_vb_idx((unsigned long)addr));
 	if (!vb)
-		goto finished;
+		goto finished_zero;
 
 	spin_lock(&vb->lock);
 	if (bitmap_empty(vb->used_map, VMAP_BBMAP_BITS)) {
 		spin_unlock(&vb->lock);
-		goto finished;
+		goto finished_zero;
 	}
+
 	for_each_set_bitrange(rs, re, vb->used_map, VMAP_BBMAP_BITS) {
-		if (!count)
-			break;
+		size_t copied;
+
+		if (remains == 0)
+			goto finished;
+
 		start = vmap_block_vaddr(vb->va->va_start, rs);
-		while (addr < start) {
-			if (count == 0)
-				goto unlock;
-			*buf = '\0';
-			buf++;
-			addr++;
-			count--;
+
+		if (addr < start) {
+			size_t to_zero = min_t(size_t, start - addr, remains);
+			size_t zeroed = zero_iter(iter, to_zero);
+
+			addr += zeroed;
+			remains -= zeroed;
+
+			if (remains == 0 || zeroed != to_zero)
+				goto finished;
 		}
+
 		/*it could start reading from the middle of used region*/
 		offset = offset_in_page(addr);
 		n = ((re - rs + 1) << PAGE_SHIFT) - offset;
-		if (n > count)
-			n = count;
-		aligned_vread(buf, start+offset, n);
+		if (n > remains)
+			n = remains;
+
+		copied = aligned_vread_iter(iter, start + offset, n);
 
-		buf += n;
-		addr += n;
-		count -= n;
+		addr += copied;
+		remains -= copied;
+
+		if (copied != n)
+			goto finished;
 	}
-unlock:
+
 	spin_unlock(&vb->lock);
 
-finished:
+finished_zero:
 	/* zero-fill the left dirty or free regions */
-	if (count)
-		memset(buf, 0, count);
+	return count - remains + zero_iter(iter, remains);
+finished:
+	/* We couldn't copy/zero everything */
+	spin_unlock(&vb->lock);
+	return count - remains;
 }
 
 /**
- * vread() - read vmalloc area in a safe way.
- * @buf:     buffer for reading data
- * @addr:    vm address.
- * @count:   number of bytes to be read.
+ * vread_iter() - read vmalloc area in a safe way to an iterator.
+ * @iter:         the iterator to which data should be written.
+ * @addr:         vm address.
+ * @count:        number of bytes to be read.
  *
  * This function checks that addr is a valid vmalloc'ed area, and
  * copy data from that area to a given buffer. If the given memory range
@@ -3568,13 +3615,12 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
  * (same number as @count) or %0 if [addr...addr+count) doesn't
  * include any intersection with valid vmalloc area
  */
-long vread(char *buf, char *addr, unsigned long count)
+long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 {
 	struct vmap_area *va;
 	struct vm_struct *vm;
-	char *vaddr, *buf_start = buf;
-	unsigned long buflen = count;
-	unsigned long n, size, flags;
+	char *vaddr;
+	size_t n, size, flags, remains;
 
 	addr = kasan_reset_tag(addr);
 
@@ -3582,18 +3628,22 @@ long vread(char *buf, char *addr, unsigned long count)
 	if ((unsigned long) addr + count < count)
 		count = -(unsigned long) addr;
 
+	remains = count;
+
 	spin_lock(&vmap_area_lock);
 	va = find_vmap_area_exceed_addr((unsigned long)addr);
 	if (!va)
-		goto finished;
+		goto finished_zero;
 
 	/* no intersects with alive vmap_area */
-	if ((unsigned long)addr + count <= va->va_start)
-		goto finished;
+	if ((unsigned long)addr + remains <= va->va_start)
+		goto finished_zero;
 
 	list_for_each_entry_from(va, &vmap_area_list, list) {
-		if (!count)
-			break;
+		size_t copied;
+
+		if (remains == 0)
+			goto finished;
 
 		vm = va->vm;
 		flags = va->flags & VMAP_FLAGS_MASK;
@@ -3608,6 +3658,7 @@ long vread(char *buf, char *addr, unsigned long count)
 
 		if (vm && (vm->flags & VM_UNINITIALIZED))
 			continue;
+
 		/* Pair with smp_wmb() in clear_vm_uninitialized_flag() */
 		smp_rmb();
 
@@ -3616,38 +3667,45 @@ long vread(char *buf, char *addr, unsigned long count)
 
 		if (addr >= vaddr + size)
 			continue;
-		while (addr < vaddr) {
-			if (count == 0)
+
+		if (addr < vaddr) {
+			size_t to_zero = min_t(size_t, vaddr - addr, remains);
+			size_t zeroed = zero_iter(iter, to_zero);
+
+			addr += zeroed;
+			remains -= zeroed;
+
+			if (remains == 0 || zeroed != to_zero)
 				goto finished;
-			*buf = '\0';
-			buf++;
-			addr++;
-			count--;
 		}
+
 		n = vaddr + size - addr;
-		if (n > count)
-			n = count;
+		if (n > remains)
+			n = remains;
 
 		if (flags & VMAP_RAM)
-			vmap_ram_vread(buf, addr, n, flags);
+			copied = vmap_ram_vread_iter(iter, addr, n, flags);
 		else if (!(vm->flags & VM_IOREMAP))
-			aligned_vread(buf, addr, n);
+			copied = aligned_vread_iter(iter, addr, n);
 		else /* IOREMAP area is treated as memory hole */
-			memset(buf, 0, n);
-		buf += n;
-		addr += n;
-		count -= n;
+			copied = zero_iter(iter, n);
+
+		addr += copied;
+		remains -= copied;
+
+		if (copied != n)
+			goto finished;
 	}
-finished:
-	spin_unlock(&vmap_area_lock);
 
-	if (buf == buf_start)
-		return 0;
+finished_zero:
+	spin_unlock(&vmap_area_lock);
 	/* zero-fill memory holes */
-	if (buf != buf_start + buflen)
-		memset(buf, 0, buflen - (buf - buf_start));
+	return count - remains + zero_iter(iter, remains);
+finished:
+	/* Nothing remains, or We couldn't copy/zero everything. */
+	spin_unlock(&vmap_area_lock);
 
-	return buflen;
+	return count - remains;
 }
 
 /**
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter()
  2023-03-22 18:57 ` [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter() Lorenzo Stoakes
@ 2023-03-23  2:52   ` Baoquan He
  2023-03-23  6:44     ` Lorenzo Stoakes
  0 siblings, 1 reply; 12+ messages in thread
From: Baoquan He @ 2023-03-23  2:52 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand
  Cc: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton,
	Uladzislau Rezki, Matthew Wilcox, Liu Shixin, Jiri Olsa,
	Jens Axboe, Alexander Viro

On 03/22/23 at 06:57pm, Lorenzo Stoakes wrote:
> Having previously laid the foundation for converting vread() to an iterator
> function, pull the trigger and do so.
> 
> This patch attempts to provide minimal refactoring and to reflect the
> existing logic as best we can, for example we continue to zero portions of
> memory not read, as before.
> 
> Overall, there should be no functional difference other than a performance
> improvement in /proc/kcore access to vmalloc regions.
> 
> Now we have eliminated the need for a bounce buffer in read_kcore_iter(),
> we dispense with it, and try to write to user memory optimistically but
> with faults disabled via copy_page_to_iter_nofault(). We already have
> preemption disabled by holding a spin lock. We continue faulting in until
> the operation is complete.

I don't understand the sentences here. In vread_iter(), the actual
content reading is done in aligned_vread_iter(), otherwise we zero
filling the region. In aligned_vread_iter(), we will use
vmalloc_to_page() to get the mapped page and read out, otherwise zero
fill. While in this patch, fault_in_iov_iter_writeable() fault in memory
of iter one time and will bail out if failed. I am wondering why we 
continue faulting in until the operation is complete, and how that is done. 

If we look into the failing point in vread_iter(), it's mainly coming
from copy_page_to_iter_nofault(), e.g page_copy_sane() checking failed,
i->data_source checking failed. If these conditional checking failed,
should we continue reading again and again? And this is not related to
memory faulting in. I saw your discussion with David, but I am still a
little lost. Hope I can learn it, thanks in advance.

......
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index 08b795fd80b4..25b44b303b35 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
......
> @@ -507,13 +503,30 @@ read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
>  
>  		switch (m->type) {
>  		case KCORE_VMALLOC:
> -			vread(buf, (char *)start, tsz);
> -			/* we have to zero-fill user buffer even if no read */
> -			if (copy_to_iter(buf, tsz, iter) != tsz) {
> -				ret = -EFAULT;
> -				goto out;
> +		{
> +			const char *src = (char *)start;
> +			size_t read = 0, left = tsz;
> +
> +			/*
> +			 * vmalloc uses spinlocks, so we optimistically try to
> +			 * read memory. If this fails, fault pages in and try
> +			 * again until we are done.
> +			 */
> +			while (true) {
> +				read += vread_iter(iter, src, left);
> +				if (read == tsz)
> +					break;
> +
> +				src += read;
> +				left -= read;
> +
> +				if (fault_in_iov_iter_writeable(iter, left)) {
> +					ret = -EFAULT;
> +					goto out;
> +				}
>  			}
>  			break;
> +		}
>  		case KCORE_USER:
>  			/* User page is handled prior to normal kernel page: */
>  			if (copy_to_iter((char *)start, tsz, iter) != tsz) {


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter()
  2023-03-23  2:52   ` Baoquan He
@ 2023-03-23  6:44     ` Lorenzo Stoakes
  2023-03-23 10:36       ` Baoquan He
  0 siblings, 1 reply; 12+ messages in thread
From: Lorenzo Stoakes @ 2023-03-23  6:44 UTC (permalink / raw)
  To: Baoquan He
  Cc: David Hildenbrand, linux-mm, linux-kernel, linux-fsdevel,
	Andrew Morton, Uladzislau Rezki, Matthew Wilcox, Liu Shixin,
	Jiri Olsa, Jens Axboe, Alexander Viro

On Thu, Mar 23, 2023 at 10:52:09AM +0800, Baoquan He wrote:
> On 03/22/23 at 06:57pm, Lorenzo Stoakes wrote:
> > Having previously laid the foundation for converting vread() to an iterator
> > function, pull the trigger and do so.
> >
> > This patch attempts to provide minimal refactoring and to reflect the
> > existing logic as best we can, for example we continue to zero portions of
> > memory not read, as before.
> >
> > Overall, there should be no functional difference other than a performance
> > improvement in /proc/kcore access to vmalloc regions.
> >
> > Now we have eliminated the need for a bounce buffer in read_kcore_iter(),
> > we dispense with it, and try to write to user memory optimistically but
> > with faults disabled via copy_page_to_iter_nofault(). We already have
> > preemption disabled by holding a spin lock. We continue faulting in until
> > the operation is complete.
>
> I don't understand the sentences here. In vread_iter(), the actual
> content reading is done in aligned_vread_iter(), otherwise we zero
> filling the region. In aligned_vread_iter(), we will use
> vmalloc_to_page() to get the mapped page and read out, otherwise zero
> fill. While in this patch, fault_in_iov_iter_writeable() fault in memory
> of iter one time and will bail out if failed. I am wondering why we
> continue faulting in until the operation is complete, and how that is done.

This is refererrring to what's happening in kcore.c, not vread_iter(),
i.e. the looped read/faultin.

The reason we bail out if failt_in_iov_iter_writeable() is that would
indicate an error had occurred.

The whole point is to _optimistically_ try to perform the operation
assuming the pages are faulted in. Ultimately we fault in via
copy_to_user_nofault() which will either copy data or fail if the pages are
not faulted in (will discuss this below a bit more in response to your
other point).

If this fails, then we fault in, and try again. We loop because there could
be some extremely unfortunate timing with a race on e.g. swapping out or
migrating pages between faulting in and trying to write out again.

This is extremely unlikely, but to avoid any chance of breaking userland we
repeat the operation until it completes. In nearly all real-world
situations it'll either work immediately or loop once.

>
> If we look into the failing point in vread_iter(), it's mainly coming
> from copy_page_to_iter_nofault(), e.g page_copy_sane() checking failed,
> i->data_source checking failed. If these conditional checking failed,
> should we continue reading again and again? And this is not related to
> memory faulting in. I saw your discussion with David, but I am still a
> little lost. Hope I can learn it, thanks in advance.
>

Actually neither of these are going to happen. page_copy_sane() checks the
sanity of the _source_ pages, and the 'sanity' is defined by whether your
offset and length sit within the (possibly compound) folio. Since we
control this, we can arrange for it never to happen.

i->data_source is checking that it's an output iterator, however we would
already have checked this when writing ELF headers at the bare minimum, so
we cannot reach this point with an invalid iterator.

Therefore it is not possible either cause a failure. What could cause a
failure, and what we are checking for, is specified in copyout_nofault()
(in iov_iter.c) which we pass to the iterate_and_advance() macro. Now we
have a fault-injection should_fail_usercopy() which would just trigger a
redo, or copy_to_user_nofault() returning < 0 (e.g. -EFAULT).

This code is confusing as this function returns the number of bytes _not
copied_ rather than copied. I have tested this to be sure by the way :)

Therefore the only way for a failure to occur is for memory to not be
faulted in and thus the loop only triggers in this situation. If we fail to
fault in pages for any reason, the whole operation aborts so this should
cover all angles.

> ......
> > diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> > index 08b795fd80b4..25b44b303b35 100644
> > --- a/fs/proc/kcore.c
> > +++ b/fs/proc/kcore.c
> ......
> > @@ -507,13 +503,30 @@ read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
> >
> >  		switch (m->type) {
> >  		case KCORE_VMALLOC:
> > -			vread(buf, (char *)start, tsz);
> > -			/* we have to zero-fill user buffer even if no read */
> > -			if (copy_to_iter(buf, tsz, iter) != tsz) {
> > -				ret = -EFAULT;
> > -				goto out;
> > +		{
> > +			const char *src = (char *)start;
> > +			size_t read = 0, left = tsz;
> > +
> > +			/*
> > +			 * vmalloc uses spinlocks, so we optimistically try to
> > +			 * read memory. If this fails, fault pages in and try
> > +			 * again until we are done.
> > +			 */
> > +			while (true) {
> > +				read += vread_iter(iter, src, left);
> > +				if (read == tsz)
> > +					break;
> > +
> > +				src += read;
> > +				left -= read;
> > +
> > +				if (fault_in_iov_iter_writeable(iter, left)) {
> > +					ret = -EFAULT;
> > +					goto out;
> > +				}
> >  			}
> >  			break;
> > +		}
> >  		case KCORE_USER:
> >  			/* User page is handled prior to normal kernel page: */
> >  			if (copy_to_iter((char *)start, tsz, iter) != tsz) {
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter()
  2023-03-23  6:44     ` Lorenzo Stoakes
@ 2023-03-23 10:36       ` Baoquan He
  2023-03-23 10:38         ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Baoquan He @ 2023-03-23 10:36 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, linux-mm, linux-kernel, linux-fsdevel,
	Andrew Morton, Uladzislau Rezki, Matthew Wilcox, Liu Shixin,
	Jiri Olsa, Jens Axboe, Alexander Viro

On 03/23/23 at 06:44am, Lorenzo Stoakes wrote:
> On Thu, Mar 23, 2023 at 10:52:09AM +0800, Baoquan He wrote:
> > On 03/22/23 at 06:57pm, Lorenzo Stoakes wrote:
> > > Having previously laid the foundation for converting vread() to an iterator
> > > function, pull the trigger and do so.
> > >
> > > This patch attempts to provide minimal refactoring and to reflect the
> > > existing logic as best we can, for example we continue to zero portions of
> > > memory not read, as before.
> > >
> > > Overall, there should be no functional difference other than a performance
> > > improvement in /proc/kcore access to vmalloc regions.
> > >
> > > Now we have eliminated the need for a bounce buffer in read_kcore_iter(),
> > > we dispense with it, and try to write to user memory optimistically but
> > > with faults disabled via copy_page_to_iter_nofault(). We already have
> > > preemption disabled by holding a spin lock. We continue faulting in until
> > > the operation is complete.
> >
> > I don't understand the sentences here. In vread_iter(), the actual
> > content reading is done in aligned_vread_iter(), otherwise we zero
> > filling the region. In aligned_vread_iter(), we will use
> > vmalloc_to_page() to get the mapped page and read out, otherwise zero
> > fill. While in this patch, fault_in_iov_iter_writeable() fault in memory
> > of iter one time and will bail out if failed. I am wondering why we
> > continue faulting in until the operation is complete, and how that is done.
> 
> This is refererrring to what's happening in kcore.c, not vread_iter(),
> i.e. the looped read/faultin.
> 
> The reason we bail out if failt_in_iov_iter_writeable() is that would
> indicate an error had occurred.
> 
> The whole point is to _optimistically_ try to perform the operation
> assuming the pages are faulted in. Ultimately we fault in via
> copy_to_user_nofault() which will either copy data or fail if the pages are
> not faulted in (will discuss this below a bit more in response to your
> other point).
> 
> If this fails, then we fault in, and try again. We loop because there could
> be some extremely unfortunate timing with a race on e.g. swapping out or
> migrating pages between faulting in and trying to write out again.
> 
> This is extremely unlikely, but to avoid any chance of breaking userland we
> repeat the operation until it completes. In nearly all real-world
> situations it'll either work immediately or loop once.

Thanks a lot for these helpful details with patience. I got it now. I was
mainly confused by the while(true) loop in KCORE_VMALLOC case of read_kcore_iter.

Now is there any chance that the faulted in memory is swapped out or
migrated again before vread_iter()? fault_in_iov_iter_writeable() will
pin the memory? I didn't find it from code and document. Seems it only
falults in memory. If yes, there's window between faluting in and
copy_to_user_nofault().

> 
> >
> > If we look into the failing point in vread_iter(), it's mainly coming
> > from copy_page_to_iter_nofault(), e.g page_copy_sane() checking failed,
> > i->data_source checking failed. If these conditional checking failed,
> > should we continue reading again and again? And this is not related to
> > memory faulting in. I saw your discussion with David, but I am still a
> > little lost. Hope I can learn it, thanks in advance.
> >
> 
> Actually neither of these are going to happen. page_copy_sane() checks the
> sanity of the _source_ pages, and the 'sanity' is defined by whether your
> offset and length sit within the (possibly compound) folio. Since we
> control this, we can arrange for it never to happen.
> 
> i->data_source is checking that it's an output iterator, however we would
> already have checked this when writing ELF headers at the bare minimum, so
> we cannot reach this point with an invalid iterator.
> 
> Therefore it is not possible either cause a failure. What could cause a
> failure, and what we are checking for, is specified in copyout_nofault()
> (in iov_iter.c) which we pass to the iterate_and_advance() macro. Now we
> have a fault-injection should_fail_usercopy() which would just trigger a
> redo, or copy_to_user_nofault() returning < 0 (e.g. -EFAULT).
> 
> This code is confusing as this function returns the number of bytes _not
> copied_ rather than copied. I have tested this to be sure by the way :)
> 
> Therefore the only way for a failure to occur is for memory to not be
> faulted in and thus the loop only triggers in this situation. If we fail to
> fault in pages for any reason, the whole operation aborts so this should
> cover all angles.
> 
> > ......
> > > diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> > > index 08b795fd80b4..25b44b303b35 100644
> > > --- a/fs/proc/kcore.c
> > > +++ b/fs/proc/kcore.c
> > ......
> > > @@ -507,13 +503,30 @@ read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
> > >
> > >  		switch (m->type) {
> > >  		case KCORE_VMALLOC:
> > > -			vread(buf, (char *)start, tsz);
> > > -			/* we have to zero-fill user buffer even if no read */
> > > -			if (copy_to_iter(buf, tsz, iter) != tsz) {
> > > -				ret = -EFAULT;
> > > -				goto out;
> > > +		{
> > > +			const char *src = (char *)start;
> > > +			size_t read = 0, left = tsz;
> > > +
> > > +			/*
> > > +			 * vmalloc uses spinlocks, so we optimistically try to
> > > +			 * read memory. If this fails, fault pages in and try
> > > +			 * again until we are done.
> > > +			 */
> > > +			while (true) {
> > > +				read += vread_iter(iter, src, left);
> > > +				if (read == tsz)
> > > +					break;
> > > +
> > > +				src += read;
> > > +				left -= read;
> > > +
> > > +				if (fault_in_iov_iter_writeable(iter, left)) {
> > > +					ret = -EFAULT;
> > > +					goto out;
> > > +				}
> > >  			}
> > >  			break;
> > > +		}
> > >  		case KCORE_USER:
> > >  			/* User page is handled prior to normal kernel page: */
> > >  			if (copy_to_iter((char *)start, tsz, iter) != tsz) {
> >
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter()
  2023-03-23 10:36       ` Baoquan He
@ 2023-03-23 10:38         ` David Hildenbrand
  2023-03-23 13:31           ` Baoquan He
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2023-03-23 10:38 UTC (permalink / raw)
  To: Baoquan He, Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton,
	Uladzislau Rezki, Matthew Wilcox, Liu Shixin, Jiri Olsa,
	Jens Axboe, Alexander Viro

On 23.03.23 11:36, Baoquan He wrote:
> On 03/23/23 at 06:44am, Lorenzo Stoakes wrote:
>> On Thu, Mar 23, 2023 at 10:52:09AM +0800, Baoquan He wrote:
>>> On 03/22/23 at 06:57pm, Lorenzo Stoakes wrote:
>>>> Having previously laid the foundation for converting vread() to an iterator
>>>> function, pull the trigger and do so.
>>>>
>>>> This patch attempts to provide minimal refactoring and to reflect the
>>>> existing logic as best we can, for example we continue to zero portions of
>>>> memory not read, as before.
>>>>
>>>> Overall, there should be no functional difference other than a performance
>>>> improvement in /proc/kcore access to vmalloc regions.
>>>>
>>>> Now we have eliminated the need for a bounce buffer in read_kcore_iter(),
>>>> we dispense with it, and try to write to user memory optimistically but
>>>> with faults disabled via copy_page_to_iter_nofault(). We already have
>>>> preemption disabled by holding a spin lock. We continue faulting in until
>>>> the operation is complete.
>>>
>>> I don't understand the sentences here. In vread_iter(), the actual
>>> content reading is done in aligned_vread_iter(), otherwise we zero
>>> filling the region. In aligned_vread_iter(), we will use
>>> vmalloc_to_page() to get the mapped page and read out, otherwise zero
>>> fill. While in this patch, fault_in_iov_iter_writeable() fault in memory
>>> of iter one time and will bail out if failed. I am wondering why we
>>> continue faulting in until the operation is complete, and how that is done.
>>
>> This is refererrring to what's happening in kcore.c, not vread_iter(),
>> i.e. the looped read/faultin.
>>
>> The reason we bail out if failt_in_iov_iter_writeable() is that would
>> indicate an error had occurred.
>>
>> The whole point is to _optimistically_ try to perform the operation
>> assuming the pages are faulted in. Ultimately we fault in via
>> copy_to_user_nofault() which will either copy data or fail if the pages are
>> not faulted in (will discuss this below a bit more in response to your
>> other point).
>>
>> If this fails, then we fault in, and try again. We loop because there could
>> be some extremely unfortunate timing with a race on e.g. swapping out or
>> migrating pages between faulting in and trying to write out again.
>>
>> This is extremely unlikely, but to avoid any chance of breaking userland we
>> repeat the operation until it completes. In nearly all real-world
>> situations it'll either work immediately or loop once.
> 
> Thanks a lot for these helpful details with patience. I got it now. I was
> mainly confused by the while(true) loop in KCORE_VMALLOC case of read_kcore_iter.
> 
> Now is there any chance that the faulted in memory is swapped out or
> migrated again before vread_iter()? fault_in_iov_iter_writeable() will
> pin the memory? I didn't find it from code and document. Seems it only
> falults in memory. If yes, there's window between faluting in and
> copy_to_user_nofault().
> 

See the documentation of fault_in_safe_writeable():

"Note that we don't pin or otherwise hold the pages referenced that we 
fault in.  There's no guarantee that they'll stay in memory for any 
duration of time."

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter()
  2023-03-23 10:38         ` David Hildenbrand
@ 2023-03-23 13:31           ` Baoquan He
  2023-03-26 13:26             ` David Laight
  0 siblings, 1 reply; 12+ messages in thread
From: Baoquan He @ 2023-03-23 13:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, linux-mm, linux-kernel, linux-fsdevel,
	Andrew Morton, Uladzislau Rezki, Matthew Wilcox, Liu Shixin,
	Jiri Olsa, Jens Axboe, Alexander Viro

On 03/23/23 at 11:38am, David Hildenbrand wrote:
> On 23.03.23 11:36, Baoquan He wrote:
> > On 03/23/23 at 06:44am, Lorenzo Stoakes wrote:
> > > On Thu, Mar 23, 2023 at 10:52:09AM +0800, Baoquan He wrote:
> > > > On 03/22/23 at 06:57pm, Lorenzo Stoakes wrote:
> > > > > Having previously laid the foundation for converting vread() to an iterator
> > > > > function, pull the trigger and do so.
> > > > > 
> > > > > This patch attempts to provide minimal refactoring and to reflect the
> > > > > existing logic as best we can, for example we continue to zero portions of
> > > > > memory not read, as before.
> > > > > 
> > > > > Overall, there should be no functional difference other than a performance
> > > > > improvement in /proc/kcore access to vmalloc regions.
> > > > > 
> > > > > Now we have eliminated the need for a bounce buffer in read_kcore_iter(),
> > > > > we dispense with it, and try to write to user memory optimistically but
> > > > > with faults disabled via copy_page_to_iter_nofault(). We already have
> > > > > preemption disabled by holding a spin lock. We continue faulting in until
> > > > > the operation is complete.
> > > > 
> > > > I don't understand the sentences here. In vread_iter(), the actual
> > > > content reading is done in aligned_vread_iter(), otherwise we zero
> > > > filling the region. In aligned_vread_iter(), we will use
> > > > vmalloc_to_page() to get the mapped page and read out, otherwise zero
> > > > fill. While in this patch, fault_in_iov_iter_writeable() fault in memory
> > > > of iter one time and will bail out if failed. I am wondering why we
> > > > continue faulting in until the operation is complete, and how that is done.
> > > 
> > > This is refererrring to what's happening in kcore.c, not vread_iter(),
> > > i.e. the looped read/faultin.
> > > 
> > > The reason we bail out if failt_in_iov_iter_writeable() is that would
> > > indicate an error had occurred.
> > > 
> > > The whole point is to _optimistically_ try to perform the operation
> > > assuming the pages are faulted in. Ultimately we fault in via
> > > copy_to_user_nofault() which will either copy data or fail if the pages are
> > > not faulted in (will discuss this below a bit more in response to your
> > > other point).
> > > 
> > > If this fails, then we fault in, and try again. We loop because there could
> > > be some extremely unfortunate timing with a race on e.g. swapping out or
> > > migrating pages between faulting in and trying to write out again.
> > > 
> > > This is extremely unlikely, but to avoid any chance of breaking userland we
> > > repeat the operation until it completes. In nearly all real-world
> > > situations it'll either work immediately or loop once.
> > 
> > Thanks a lot for these helpful details with patience. I got it now. I was
> > mainly confused by the while(true) loop in KCORE_VMALLOC case of read_kcore_iter.
> > 
> > Now is there any chance that the faulted in memory is swapped out or
> > migrated again before vread_iter()? fault_in_iov_iter_writeable() will
> > pin the memory? I didn't find it from code and document. Seems it only
> > falults in memory. If yes, there's window between faluting in and
> > copy_to_user_nofault().
> > 
> 
> See the documentation of fault_in_safe_writeable():
> 
> "Note that we don't pin or otherwise hold the pages referenced that we fault
> in.  There's no guarantee that they'll stay in memory for any duration of
> time."

Thanks for the info. Then swapping out/migration could happen again, so
that's why while(true) loop is meaningful.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter()
  2023-03-23 13:31           ` Baoquan He
@ 2023-03-26 13:26             ` David Laight
  2023-03-26 14:20               ` Lorenzo Stoakes
  0 siblings, 1 reply; 12+ messages in thread
From: David Laight @ 2023-03-26 13:26 UTC (permalink / raw)
  To: 'Baoquan He', David Hildenbrand
  Cc: Lorenzo Stoakes, linux-mm, linux-kernel, linux-fsdevel,
	Andrew Morton, Uladzislau Rezki, Matthew Wilcox, Liu Shixin,
	Jiri Olsa, Jens Axboe, Alexander Viro

From: Baoquan He
> Sent: 23 March 2023 13:32
...
> > > > If this fails, then we fault in, and try again. We loop because there could
> > > > be some extremely unfortunate timing with a race on e.g. swapping out or
> > > > migrating pages between faulting in and trying to write out again.
> > > >
> > > > This is extremely unlikely, but to avoid any chance of breaking userland we
> > > > repeat the operation until it completes. In nearly all real-world
> > > > situations it'll either work immediately or loop once.
> > >
> > > Thanks a lot for these helpful details with patience. I got it now. I was
> > > mainly confused by the while(true) loop in KCORE_VMALLOC case of read_kcore_iter.
> > >
> > > Now is there any chance that the faulted in memory is swapped out or
> > > migrated again before vread_iter()? fault_in_iov_iter_writeable() will
> > > pin the memory? I didn't find it from code and document. Seems it only
> > > falults in memory. If yes, there's window between faluting in and
> > > copy_to_user_nofault().
> > >
> >
> > See the documentation of fault_in_safe_writeable():
> >
> > "Note that we don't pin or otherwise hold the pages referenced that we fault
> > in.  There's no guarantee that they'll stay in memory for any duration of
> > time."
> 
> Thanks for the info. Then swapping out/migration could happen again, so
> that's why while(true) loop is meaningful.

One of the problems is that is the system is under severe memory
pressure and you try to fault in (say) 20 pages, the first page
might get unmapped in order to map the last one in.

So it is quite likely better to retry 'one page at a time'.

There have also been cases where the instruction to copy data
has faulted for reasons other than 'page fault'.
ISTR an infinite loop being caused by misaligned accesses failing
due to 'bad instruction choice' in the copy code.
While this is rally a bug, an infinite retry in a file read/write
didn't make it easy to spot.

So maybe there are cases where a dropping back to a 'bounce buffer'
may be necessary.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter()
  2023-03-26 13:26             ` David Laight
@ 2023-03-26 14:20               ` Lorenzo Stoakes
  0 siblings, 0 replies; 12+ messages in thread
From: Lorenzo Stoakes @ 2023-03-26 14:20 UTC (permalink / raw)
  To: David Laight
  Cc: 'Baoquan He',
	David Hildenbrand, linux-mm, linux-kernel, linux-fsdevel,
	Andrew Morton, Uladzislau Rezki, Matthew Wilcox, Liu Shixin,
	Jiri Olsa, Jens Axboe, Alexander Viro

On Sun, Mar 26, 2023 at 01:26:57PM +0000, David Laight wrote:
> From: Baoquan He
> > Sent: 23 March 2023 13:32
> ...
> > > > > If this fails, then we fault in, and try again. We loop because there could
> > > > > be some extremely unfortunate timing with a race on e.g. swapping out or
> > > > > migrating pages between faulting in and trying to write out again.
> > > > >
> > > > > This is extremely unlikely, but to avoid any chance of breaking userland we
> > > > > repeat the operation until it completes. In nearly all real-world
> > > > > situations it'll either work immediately or loop once.
> > > >
> > > > Thanks a lot for these helpful details with patience. I got it now. I was
> > > > mainly confused by the while(true) loop in KCORE_VMALLOC case of read_kcore_iter.
> > > >
> > > > Now is there any chance that the faulted in memory is swapped out or
> > > > migrated again before vread_iter()? fault_in_iov_iter_writeable() will
> > > > pin the memory? I didn't find it from code and document. Seems it only
> > > > falults in memory. If yes, there's window between faluting in and
> > > > copy_to_user_nofault().
> > > >
> > >
> > > See the documentation of fault_in_safe_writeable():
> > >
> > > "Note that we don't pin or otherwise hold the pages referenced that we fault
> > > in.  There's no guarantee that they'll stay in memory for any duration of
> > > time."
> >
> > Thanks for the info. Then swapping out/migration could happen again, so
> > that's why while(true) loop is meaningful.
>
> One of the problems is that is the system is under severe memory
> pressure and you try to fault in (say) 20 pages, the first page
> might get unmapped in order to map the last one in.
>
> So it is quite likely better to retry 'one page at a time'.

If you look at the kcore code, it is in fact only faulting one page at a
time. tsz never exceeds PAGE_SIZE, so we never attempt to fault in or copy
more than one page at a time, e.g.:-

if ((tsz = (PAGE_SIZE - (start & ~PAGE_MASK))) > buflen)
	tsz = buflen;

...

tsz = (buflen > PAGE_SIZE ? PAGE_SIZE : buflen);

It might be a good idea to make this totally explicit in vread_iter()
(perhaps making it vread_page_iter() or such), but I think that might be
good for another patch series.

>
> There have also been cases where the instruction to copy data
> has faulted for reasons other than 'page fault'.
> ISTR an infinite loop being caused by misaligned accesses failing
> due to 'bad instruction choice' in the copy code.
> While this is rally a bug, an infinite retry in a file read/write
> didn't make it easy to spot.

I am not sure it's reasonable to not write code just in case an arch
implements buggy user copy code (do correct me if I'm misunderstanding you
about this!). By that token wouldn't a lot more be broken in that
situation? I don't imagine all other areas of the kernel would make
explicitly clear to you that this was the problem.

>
> So maybe there are cases where a dropping back to a 'bounce buffer'
> may be necessary.

One approach could be to reinstate the kernel bounce buffer, set up an
iterator that points to it and pass that in after one attempt with
userland.

But it feels a bit like overkill, as in the case of an aligment issue,
surely that would still occur and that'd just error out anyway? Again I'm
not sure bending over backwards to account for possibly buggy arch code is
sensible.

Ideally the iterator code would explicitly pass back the EFAULT error which
we could then explicitly handle but that'd require probably quite
significant rework there which feels a bit out of scope for this change.

We could implement some maximum number of attempts which statistically must
reduce the odds of repeated faults in the tiny window between fault in and
copy to effectively zero. But I'm not sure the other David would be happy
with that!

If we were to make a change to be extra careful I'd opt for simply trying X
times then giving up, given we're trying this a page at a time I don't
think X need be that large before any swap out/migrate bad luck becomes so
unlikely that we're competing with heat death of the universe timescales
before it might happen (again, I may be missing some common scenario where
the same single page swaps out/migrates over and over, please correct me if
so).

However I think there's a case to be made that it's fine as-is unless there
is another scenario we are overly concerned about?

>
> 	David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-03-26 14:21 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-22 18:57 [PATCH v7 0/4] convert read_kcore(), vread() to use iterators Lorenzo Stoakes
2023-03-22 18:57 ` [PATCH v7 1/4] fs/proc/kcore: avoid bounce buffer for ktext data Lorenzo Stoakes
2023-03-22 18:57 ` [PATCH v7 2/4] fs/proc/kcore: convert read_kcore() to read_kcore_iter() Lorenzo Stoakes
2023-03-22 18:57 ` [PATCH v7 3/4] iov_iter: add copy_page_to_iter_nofault() Lorenzo Stoakes
2023-03-22 18:57 ` [PATCH v7 4/4] mm: vmalloc: convert vread() to vread_iter() Lorenzo Stoakes
2023-03-23  2:52   ` Baoquan He
2023-03-23  6:44     ` Lorenzo Stoakes
2023-03-23 10:36       ` Baoquan He
2023-03-23 10:38         ` David Hildenbrand
2023-03-23 13:31           ` Baoquan He
2023-03-26 13:26             ` David Laight
2023-03-26 14:20               ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).