[PATCH v2 0/4] convert read_kcore(), vread() to use iterators

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] convert read_kcore(), vread() to use iterators
@ 2023-03-19  7:09 Lorenzo Stoakes
  2023-03-19  7:09 ` [PATCH v2 1/4] fs/proc/kcore: Avoid bounce buffer for ktext data Lorenzo Stoakes
                   ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2023-03-19  7:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton
  Cc: Baoquan He, Uladzislau Rezki, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa, Lorenzo Stoakes

While reviewing Baoquan's recent changes to permit vread() access to
vm_map_ram regions of vmalloc allocations, Willy pointed out [1] that it
would be nice to refactor vread() as a whole, since its only user is
read_kcore() and the existing form of vread() necessitates the use of a
bounce buffer.

This patch series does exactly that, as well as adjusting how we read the
kernel text section to avoid the use of a bounce buffer in this case as
well.

This patch series necessarily changes the locking used in vmalloc, however
tests indicate that this has very little impact on allocation performance
(test results are shown in the relevant patch).

This has been tested against the test case which motivated Baoquan's
changes in the first place [2] which continues to function correctly, as
do the vmalloc self tests.

[1] https://lore.kernel.org/all/Y8WfDSRkc%2FOHP3oD@casper.infradead.org/
[2] https://lore.kernel.org/all/87ilk6gos2.fsf@oracle.com/T/#u

v2:
- Fix ordering of vread_iter() parameters
- Fix nommu vread() -> vread_iter()

v1:
https://lore.kernel.org/all/cover.1679183626.git.lstoakes@gmail.com/

Lorenzo Stoakes (4):
  fs/proc/kcore: Avoid bounce buffer for ktext data
  mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  fs/proc/kcore: convert read_kcore() to read_kcore_iter()
  mm: vmalloc: convert vread() to vread_iter()

 fs/proc/kcore.c         |  84 +++++++------------
 include/linux/vmalloc.h |   3 +-
 mm/nommu.c              |  10 +--
 mm/vmalloc.c            | 178 +++++++++++++++++++++-------------------
 4 files changed, 130 insertions(+), 145 deletions(-)

base-commit: 4018ab1f7cec061b8425737328edefebdc0ab832
--
2.39.2

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v2 1/4] fs/proc/kcore: Avoid bounce buffer for ktext data
  2023-03-19  7:09 [PATCH v2 0/4] convert read_kcore(), vread() to use iterators Lorenzo Stoakes
@ 2023-03-19  7:09 ` Lorenzo Stoakes
  2023-03-20  9:58   ` David Hildenbrand
  2023-03-19  7:09 ` [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock Lorenzo Stoakes
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 33+ messages in thread
From: Lorenzo Stoakes @ 2023-03-19  7:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton
  Cc: Baoquan He, Uladzislau Rezki, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa, Lorenzo Stoakes

Commit df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext data")
introduced the use of a bounce buffer to retrieve kernel text data for
/proc/kcore in order to avoid failures arising from hardened user copies
enabled by CONFIG_HARDENED_USERCOPY in check_kernel_text_object().

We can avoid doing this if instead of copy_to_user() we use _copy_to_user()
which bypasses the hardening check. This is more efficient than using a
bounce buffer and simplifies the code.

We do so as part an overall effort to eliminate bounce buffer usage in the
function with an eye to converting it an iterator read.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 fs/proc/kcore.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 71157ee35c1a..556f310d6aa4 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -541,19 +541,12 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 		case KCORE_VMEMMAP:
 		case KCORE_TEXT:
 			/*
-			 * Using bounce buffer to bypass the
-			 * hardened user copy kernel text checks.
+			 * We use _copy_to_user() to bypass usermode hardening
+			 * which would otherwise prevent this operation.
 			 */
-			if (copy_from_kernel_nofault(buf, (void *)start, tsz)) {
-				if (clear_user(buffer, tsz)) {
-					ret = -EFAULT;
-					goto out;
-				}
-			} else {
-				if (copy_to_user(buffer, buf, tsz)) {
-					ret = -EFAULT;
-					goto out;
-				}
+			if (_copy_to_user(buffer, (char *)start, tsz)) {
+				ret = -EFAULT;
+				goto out;
 			}
 			break;
 		default:
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-19  7:09 [PATCH v2 0/4] convert read_kcore(), vread() to use iterators Lorenzo Stoakes
  2023-03-19  7:09 ` [PATCH v2 1/4] fs/proc/kcore: Avoid bounce buffer for ktext data Lorenzo Stoakes
@ 2023-03-19  7:09 ` Lorenzo Stoakes
  2023-03-19 20:10   ` Andrew Morton
                     ` (2 more replies)
  2023-03-19  7:09 ` [PATCH v2 3/4] fs/proc/kcore: convert read_kcore() to read_kcore_iter() Lorenzo Stoakes
  2023-03-19  7:09 ` [PATCH v2 4/4] mm: vmalloc: convert vread() to vread_iter() Lorenzo Stoakes
  3 siblings, 3 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2023-03-19  7:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton
  Cc: Baoquan He, Uladzislau Rezki, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa, Lorenzo Stoakes

vmalloc() is, by design, not permitted to be used in atomic context and
already contains components which may sleep, so avoiding spin locks is not
a problem from the perspective of atomic context.

The global vmap_area_lock is held when the red/black tree rooted in
vmap_are_root is accessed and thus is rather long-held and under
potentially high contention. It is likely to be under contention for reads
rather than write, so replace it with a rwsem.

Each individual vmap_block->lock is likely to be held for less time but
under low contention, so a mutex is not an outrageous choice here.

A subset of test_vmalloc.sh performance results:-

fix_size_alloc_test             0.40%
full_fit_alloc_test		2.08%
long_busy_list_alloc_test	0.34%
random_size_alloc_test		-0.25%
random_size_align_alloc_test	0.06%
...
all tests cycles                0.2%

This represents a tiny reduction in performance that sits barely above
noise.

The reason for making this change is to build a basis for vread() to be
usable asynchronously, this eliminating the need for a bounce buffer when
copying data to userland in read_kcore() and allowing that to be converted
to an iterator form.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 mm/vmalloc.c | 77 +++++++++++++++++++++++++++-------------------------
 1 file changed, 40 insertions(+), 37 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 978194dc2bb8..c24b27664a97 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -40,6 +40,7 @@
 #include <linux/uaccess.h>
 #include <linux/hugetlb.h>
 #include <linux/sched/mm.h>
+#include <linux/rwsem.h>
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>
 
@@ -725,7 +726,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
 #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
 
 
-static DEFINE_SPINLOCK(vmap_area_lock);
+static DECLARE_RWSEM(vmap_area_lock);
 static DEFINE_SPINLOCK(free_vmap_area_lock);
 /* Export for kexec only */
 LIST_HEAD(vmap_area_list);
@@ -1537,9 +1538,9 @@ static void free_vmap_area(struct vmap_area *va)
 	/*
 	 * Remove from the busy tree/list.
 	 */
-	spin_lock(&vmap_area_lock);
+	down_write(&vmap_area_lock);
 	unlink_va(va, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	up_write(&vmap_area_lock);
 
 	/*
 	 * Insert/Merge it back to the free tree/list.
@@ -1627,9 +1628,9 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	va->vm = NULL;
 	va->flags = va_flags;
 
-	spin_lock(&vmap_area_lock);
+	down_write(&vmap_area_lock);
 	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
-	spin_unlock(&vmap_area_lock);
+	up_write(&vmap_area_lock);
 
 	BUG_ON(!IS_ALIGNED(va->va_start, align));
 	BUG_ON(va->va_start < vstart);
@@ -1854,9 +1855,9 @@ struct vmap_area *find_vmap_area(unsigned long addr)
 {
 	struct vmap_area *va;
 
-	spin_lock(&vmap_area_lock);
+	down_read(&vmap_area_lock);
 	va = __find_vmap_area(addr, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	up_read(&vmap_area_lock);
 
 	return va;
 }
@@ -1865,11 +1866,11 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
 {
 	struct vmap_area *va;
 
-	spin_lock(&vmap_area_lock);
+	down_write(&vmap_area_lock);
 	va = __find_vmap_area(addr, &vmap_area_root);
 	if (va)
 		unlink_va(va, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	up_write(&vmap_area_lock);
 
 	return va;
 }
@@ -1914,7 +1915,7 @@ struct vmap_block_queue {
 };
 
 struct vmap_block {
-	spinlock_t lock;
+	struct mutex lock;
 	struct vmap_area *va;
 	unsigned long free, dirty;
 	DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
@@ -1991,7 +1992,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 	}
 
 	vaddr = vmap_block_vaddr(va->va_start, 0);
-	spin_lock_init(&vb->lock);
+	mutex_init(&vb->lock);
 	vb->va = va;
 	/* At least something should be left free */
 	BUG_ON(VMAP_BBMAP_BITS <= (1UL << order));
@@ -2026,9 +2027,9 @@ static void free_vmap_block(struct vmap_block *vb)
 	tmp = xa_erase(&vmap_blocks, addr_to_vb_idx(vb->va->va_start));
 	BUG_ON(tmp != vb);
 
-	spin_lock(&vmap_area_lock);
+	down_write(&vmap_area_lock);
 	unlink_va(vb->va, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	up_write(&vmap_area_lock);
 
 	free_vmap_area_noflush(vb->va);
 	kfree_rcu(vb, rcu_head);
@@ -2047,7 +2048,7 @@ static void purge_fragmented_blocks(int cpu)
 		if (!(vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS))
 			continue;
 
-		spin_lock(&vb->lock);
+		mutex_lock(&vb->lock);
 		if (vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS) {
 			vb->free = 0; /* prevent further allocs after releasing lock */
 			vb->dirty = VMAP_BBMAP_BITS; /* prevent purging it again */
@@ -2056,10 +2057,10 @@ static void purge_fragmented_blocks(int cpu)
 			spin_lock(&vbq->lock);
 			list_del_rcu(&vb->free_list);
 			spin_unlock(&vbq->lock);
-			spin_unlock(&vb->lock);
+			mutex_unlock(&vb->lock);
 			list_add_tail(&vb->purge, &purge);
 		} else
-			spin_unlock(&vb->lock);
+			mutex_unlock(&vb->lock);
 	}
 	rcu_read_unlock();
 
@@ -2101,9 +2102,9 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
 	list_for_each_entry_rcu(vb, &vbq->free, free_list) {
 		unsigned long pages_off;
 
-		spin_lock(&vb->lock);
+		mutex_lock(&vb->lock);
 		if (vb->free < (1UL << order)) {
-			spin_unlock(&vb->lock);
+			mutex_unlock(&vb->lock);
 			continue;
 		}
 
@@ -2117,7 +2118,7 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
 			spin_unlock(&vbq->lock);
 		}
 
-		spin_unlock(&vb->lock);
+		mutex_unlock(&vb->lock);
 		break;
 	}
 
@@ -2144,16 +2145,16 @@ static void vb_free(unsigned long addr, unsigned long size)
 	order = get_order(size);
 	offset = (addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT;
 	vb = xa_load(&vmap_blocks, addr_to_vb_idx(addr));
-	spin_lock(&vb->lock);
+	mutex_lock(&vb->lock);
 	bitmap_clear(vb->used_map, offset, (1UL << order));
-	spin_unlock(&vb->lock);
+	mutex_unlock(&vb->lock);
 
 	vunmap_range_noflush(addr, addr + size);
 
 	if (debug_pagealloc_enabled_static())
 		flush_tlb_kernel_range(addr, addr + size);
 
-	spin_lock(&vb->lock);
+	mutex_lock(&vb->lock);
 
 	/* Expand dirty range */
 	vb->dirty_min = min(vb->dirty_min, offset);
@@ -2162,10 +2163,10 @@ static void vb_free(unsigned long addr, unsigned long size)
 	vb->dirty += 1UL << order;
 	if (vb->dirty == VMAP_BBMAP_BITS) {
 		BUG_ON(vb->free);
-		spin_unlock(&vb->lock);
+		mutex_unlock(&vb->lock);
 		free_vmap_block(vb);
 	} else
-		spin_unlock(&vb->lock);
+		mutex_unlock(&vb->lock);
 }
 
 static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
@@ -2183,7 +2184,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
 
 		rcu_read_lock();
 		list_for_each_entry_rcu(vb, &vbq->free, free_list) {
-			spin_lock(&vb->lock);
+			mutex_lock(&vb->lock);
 			if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) {
 				unsigned long va_start = vb->va->va_start;
 				unsigned long s, e;
@@ -2196,7 +2197,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
 
 				flush = 1;
 			}
-			spin_unlock(&vb->lock);
+			mutex_unlock(&vb->lock);
 		}
 		rcu_read_unlock();
 	}
@@ -2451,9 +2452,9 @@ static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
 static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
 			      unsigned long flags, const void *caller)
 {
-	spin_lock(&vmap_area_lock);
+	down_write(&vmap_area_lock);
 	setup_vmalloc_vm_locked(vm, va, flags, caller);
-	spin_unlock(&vmap_area_lock);
+	up_write(&vmap_area_lock);
 }
 
 static void clear_vm_uninitialized_flag(struct vm_struct *vm)
@@ -3507,9 +3508,9 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
 	if (!vb)
 		goto finished;
 
-	spin_lock(&vb->lock);
+	mutex_lock(&vb->lock);
 	if (bitmap_empty(vb->used_map, VMAP_BBMAP_BITS)) {
-		spin_unlock(&vb->lock);
+		mutex_unlock(&vb->lock);
 		goto finished;
 	}
 	for_each_set_bitrange(rs, re, vb->used_map, VMAP_BBMAP_BITS) {
@@ -3536,7 +3537,7 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
 		count -= n;
 	}
 unlock:
-	spin_unlock(&vb->lock);
+	mutex_unlock(&vb->lock);
 
 finished:
 	/* zero-fill the left dirty or free regions */
@@ -3576,13 +3577,15 @@ long vread(char *buf, char *addr, unsigned long count)
 	unsigned long buflen = count;
 	unsigned long n, size, flags;
 
+	might_sleep();
+
 	addr = kasan_reset_tag(addr);
 
 	/* Don't allow overflow */
 	if ((unsigned long) addr + count < count)
 		count = -(unsigned long) addr;
 
-	spin_lock(&vmap_area_lock);
+	down_read(&vmap_area_lock);
 	va = find_vmap_area_exceed_addr((unsigned long)addr);
 	if (!va)
 		goto finished;
@@ -3639,7 +3642,7 @@ long vread(char *buf, char *addr, unsigned long count)
 		count -= n;
 	}
 finished:
-	spin_unlock(&vmap_area_lock);
+	up_read(&vmap_area_lock);
 
 	if (buf == buf_start)
 		return 0;
@@ -3980,14 +3983,14 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	}
 
 	/* insert all vm's */
-	spin_lock(&vmap_area_lock);
+	down_write(&vmap_area_lock);
 	for (area = 0; area < nr_vms; area++) {
 		insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list);
 
 		setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
 				 pcpu_get_vm_areas);
 	}
-	spin_unlock(&vmap_area_lock);
+	up_write(&vmap_area_lock);
 
 	/*
 	 * Mark allocated areas as accessible. Do it now as a best-effort
@@ -4114,7 +4117,7 @@ static void *s_start(struct seq_file *m, loff_t *pos)
 	__acquires(&vmap_area_lock)
 {
 	mutex_lock(&vmap_purge_lock);
-	spin_lock(&vmap_area_lock);
+	down_read(&vmap_area_lock);
 
 	return seq_list_start(&vmap_area_list, *pos);
 }
@@ -4128,7 +4131,7 @@ static void s_stop(struct seq_file *m, void *p)
 	__releases(&vmap_area_lock)
 	__releases(&vmap_purge_lock)
 {
-	spin_unlock(&vmap_area_lock);
+	up_read(&vmap_area_lock);
 	mutex_unlock(&vmap_purge_lock);
 }
 
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 3/4] fs/proc/kcore: convert read_kcore() to read_kcore_iter()
  2023-03-19  7:09 [PATCH v2 0/4] convert read_kcore(), vread() to use iterators Lorenzo Stoakes
  2023-03-19  7:09 ` [PATCH v2 1/4] fs/proc/kcore: Avoid bounce buffer for ktext data Lorenzo Stoakes
  2023-03-19  7:09 ` [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock Lorenzo Stoakes
@ 2023-03-19  7:09 ` Lorenzo Stoakes
  2023-03-19  7:09 ` [PATCH v2 4/4] mm: vmalloc: convert vread() to vread_iter() Lorenzo Stoakes
  3 siblings, 0 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2023-03-19  7:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton
  Cc: Baoquan He, Uladzislau Rezki, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa, Lorenzo Stoakes

Now we have eliminated spinlocks from the vread() case, convert
read_kcore() to read_kcore_iter().

For the time being we still use a bounce buffer for vread(), however in the
next patch we will convert this to interact directly with the iterator and
eliminate the bounce buffer altogether.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 fs/proc/kcore.c | 58 ++++++++++++++++++++++++-------------------------
 1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 556f310d6aa4..25e0eeb8d498 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -24,7 +24,7 @@
 #include <linux/memblock.h>
 #include <linux/init.h>
 #include <linux/slab.h>
-#include <linux/uaccess.h>
+#include <linux/uio.h>
 #include <asm/io.h>
 #include <linux/list.h>
 #include <linux/ioport.h>
@@ -308,9 +308,12 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
 }
 
 static ssize_t
-read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
+read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
 {
+	struct file *file = iocb->ki_filp;
 	char *buf = file->private_data;
+	loff_t *ppos = &iocb->ki_pos;
+
 	size_t phdrs_offset, notes_offset, data_offset;
 	size_t page_offline_frozen = 1;
 	size_t phdrs_len, notes_len;
@@ -318,6 +321,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 	size_t tsz;
 	int nphdr;
 	unsigned long start;
+	size_t buflen = iov_iter_count(iter);
 	size_t orig_buflen = buflen;
 	int ret = 0;
 
@@ -333,7 +337,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 	notes_offset = phdrs_offset + phdrs_len;
 
 	/* ELF file header. */
-	if (buflen && *fpos < sizeof(struct elfhdr)) {
+	if (buflen && *ppos < sizeof(struct elfhdr)) {
 		struct elfhdr ehdr = {
 			.e_ident = {
 				[EI_MAG0] = ELFMAG0,
@@ -355,19 +359,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 			.e_phnum = nphdr,
 		};
 
-		tsz = min_t(size_t, buflen, sizeof(struct elfhdr) - *fpos);
-		if (copy_to_user(buffer, (char *)&ehdr + *fpos, tsz)) {
+		tsz = min_t(size_t, buflen, sizeof(struct elfhdr) - *ppos);
+		if (copy_to_iter((char *)&ehdr + *ppos, tsz, iter) != tsz) {
 			ret = -EFAULT;
 			goto out;
 		}
 
-		buffer += tsz;
 		buflen -= tsz;
-		*fpos += tsz;
+		*ppos += tsz;
 	}
 
 	/* ELF program headers. */
-	if (buflen && *fpos < phdrs_offset + phdrs_len) {
+	if (buflen && *ppos < phdrs_offset + phdrs_len) {
 		struct elf_phdr *phdrs, *phdr;
 
 		phdrs = kzalloc(phdrs_len, GFP_KERNEL);
@@ -397,22 +400,21 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 			phdr++;
 		}
 
-		tsz = min_t(size_t, buflen, phdrs_offset + phdrs_len - *fpos);
-		if (copy_to_user(buffer, (char *)phdrs + *fpos - phdrs_offset,
-				 tsz)) {
+		tsz = min_t(size_t, buflen, phdrs_offset + phdrs_len - *ppos);
+		if (copy_to_iter((char *)phdrs + *ppos - phdrs_offset, tsz,
+				 iter) != tsz) {
 			kfree(phdrs);
 			ret = -EFAULT;
 			goto out;
 		}
 		kfree(phdrs);
 
-		buffer += tsz;
 		buflen -= tsz;
-		*fpos += tsz;
+		*ppos += tsz;
 	}
 
 	/* ELF note segment. */
-	if (buflen && *fpos < notes_offset + notes_len) {
+	if (buflen && *ppos < notes_offset + notes_len) {
 		struct elf_prstatus prstatus = {};
 		struct elf_prpsinfo prpsinfo = {
 			.pr_sname = 'R',
@@ -447,24 +449,23 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 				  vmcoreinfo_data,
 				  min(vmcoreinfo_size, notes_len - i));
 
-		tsz = min_t(size_t, buflen, notes_offset + notes_len - *fpos);
-		if (copy_to_user(buffer, notes + *fpos - notes_offset, tsz)) {
+		tsz = min_t(size_t, buflen, notes_offset + notes_len - *ppos);
+		if (copy_to_iter(notes + *ppos - notes_offset, tsz, iter) != tsz) {
 			kfree(notes);
 			ret = -EFAULT;
 			goto out;
 		}
 		kfree(notes);
 
-		buffer += tsz;
 		buflen -= tsz;
-		*fpos += tsz;
+		*ppos += tsz;
 	}
 
 	/*
 	 * Check to see if our file offset matches with any of
 	 * the addresses in the elf_phdr on our list.
 	 */
-	start = kc_offset_to_vaddr(*fpos - data_offset);
+	start = kc_offset_to_vaddr(*ppos - data_offset);
 	if ((tsz = (PAGE_SIZE - (start & ~PAGE_MASK))) > buflen)
 		tsz = buflen;
 
@@ -497,7 +498,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 		}
 
 		if (!m) {
-			if (clear_user(buffer, tsz)) {
+			if (iov_iter_zero(tsz, iter) != tsz) {
 				ret = -EFAULT;
 				goto out;
 			}
@@ -508,14 +509,14 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 		case KCORE_VMALLOC:
 			vread(buf, (char *)start, tsz);
 			/* we have to zero-fill user buffer even if no read */
-			if (copy_to_user(buffer, buf, tsz)) {
+			if (copy_to_iter(buf, tsz, iter) != tsz) {
 				ret = -EFAULT;
 				goto out;
 			}
 			break;
 		case KCORE_USER:
 			/* User page is handled prior to normal kernel page: */
-			if (copy_to_user(buffer, (char *)start, tsz)) {
+			if (copy_to_iter((char *)start, tsz, iter) != tsz) {
 				ret = -EFAULT;
 				goto out;
 			}
@@ -531,7 +532,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 			 */
 			if (!page || PageOffline(page) ||
 			    is_page_hwpoison(page) || !pfn_is_ram(pfn)) {
-				if (clear_user(buffer, tsz)) {
+				if (iov_iter_zero(tsz, iter) != tsz) {
 					ret = -EFAULT;
 					goto out;
 				}
@@ -541,25 +542,24 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
 		case KCORE_VMEMMAP:
 		case KCORE_TEXT:
 			/*
-			 * We use _copy_to_user() to bypass usermode hardening
+			 * We use _copy_to_iter() to bypass usermode hardening
 			 * which would otherwise prevent this operation.
 			 */
-			if (_copy_to_user(buffer, (char *)start, tsz)) {
+			if (_copy_to_iter((char *)start, tsz, iter) != tsz) {
 				ret = -EFAULT;
 				goto out;
 			}
 			break;
 		default:
 			pr_warn_once("Unhandled KCORE type: %d\n", m->type);
-			if (clear_user(buffer, tsz)) {
+			if (iov_iter_zero(tsz, iter) != tsz) {
 				ret = -EFAULT;
 				goto out;
 			}
 		}
 skip:
 		buflen -= tsz;
-		*fpos += tsz;
-		buffer += tsz;
+		*ppos += tsz;
 		start += tsz;
 		tsz = (buflen > PAGE_SIZE ? PAGE_SIZE : buflen);
 	}
@@ -603,7 +603,7 @@ static int release_kcore(struct inode *inode, struct file *file)
 }
 
 static const struct proc_ops kcore_proc_ops = {
-	.proc_read	= read_kcore,
+	.proc_read_iter	= read_kcore_iter,
 	.proc_open	= open_kcore,
 	.proc_release	= release_kcore,
 	.proc_lseek	= default_llseek,
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v2 4/4] mm: vmalloc: convert vread() to vread_iter()
  2023-03-19  7:09 [PATCH v2 0/4] convert read_kcore(), vread() to use iterators Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2023-03-19  7:09 ` [PATCH v2 3/4] fs/proc/kcore: convert read_kcore() to read_kcore_iter() Lorenzo Stoakes
@ 2023-03-19  7:09 ` Lorenzo Stoakes
  3 siblings, 0 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2023-03-19  7:09 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton
  Cc: Baoquan He, Uladzislau Rezki, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa, Lorenzo Stoakes

Having previously laid the foundation for converting vread() to an iterator
function, pull the trigger and do so.

This patch attempts to provide minimal refactoring and to reflect the
existing logic as best we can, with the exception of aligned_vread_iter()
which drops the use of the deprecated kmap_atomic() in favour of
kmap_local_page().

All existing logic to zero portions of memory not read remain and there
should be no functional difference other than a performance improvement in
/proc/kcore access to vmalloc regions.

Now we have discarded with the need for a bounce buffer at all in
read_kcore_iter(), we dispense with the one allocated there altogether.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
---
 fs/proc/kcore.c         |  21 +--------
 include/linux/vmalloc.h |   3 +-
 mm/nommu.c              |  10 ++--
 mm/vmalloc.c            | 101 +++++++++++++++++++++-------------------
 4 files changed, 62 insertions(+), 73 deletions(-)

diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 25e0eeb8d498..a0ed3ca35cce 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -307,13 +307,9 @@ static void append_kcore_note(char *notes, size_t *i, const char *name,
 	*i = ALIGN(*i + descsz, 4);
 }
 
-static ssize_t
-read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
+static ssize_t read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
 {
-	struct file *file = iocb->ki_filp;
-	char *buf = file->private_data;
 	loff_t *ppos = &iocb->ki_pos;
-
 	size_t phdrs_offset, notes_offset, data_offset;
 	size_t page_offline_frozen = 1;
 	size_t phdrs_len, notes_len;
@@ -507,9 +503,7 @@ read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
 
 		switch (m->type) {
 		case KCORE_VMALLOC:
-			vread(buf, (char *)start, tsz);
-			/* we have to zero-fill user buffer even if no read */
-			if (copy_to_iter(buf, tsz, iter) != tsz) {
+			if (vread_iter(iter, (char *)start, tsz) != tsz) {
 				ret = -EFAULT;
 				goto out;
 			}
@@ -582,10 +576,6 @@ static int open_kcore(struct inode *inode, struct file *filp)
 	if (ret)
 		return ret;
 
-	filp->private_data = kmalloc(PAGE_SIZE, GFP_KERNEL);
-	if (!filp->private_data)
-		return -ENOMEM;
-
 	if (kcore_need_update)
 		kcore_update_ram();
 	if (i_size_read(inode) != proc_root_kcore->size) {
@@ -596,16 +586,9 @@ static int open_kcore(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-static int release_kcore(struct inode *inode, struct file *file)
-{
-	kfree(file->private_data);
-	return 0;
-}
-
 static const struct proc_ops kcore_proc_ops = {
 	.proc_read_iter	= read_kcore_iter,
 	.proc_open	= open_kcore,
-	.proc_release	= release_kcore,
 	.proc_lseek	= default_llseek,
 };
 
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 69250efa03d1..6beb2ace6a7a 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -9,6 +9,7 @@
 #include <asm/page.h>		/* pgprot_t */
 #include <linux/rbtree.h>
 #include <linux/overflow.h>
+#include <linux/uio.h>
 
 #include <asm/vmalloc.h>
 
@@ -251,7 +252,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
 #endif
 
 /* for /proc/kcore */
-extern long vread(char *buf, char *addr, unsigned long count);
+extern long vread_iter(struct iov_iter *iter, char *addr, size_t count);
 
 /*
  *	Internals.  Don't use..
diff --git a/mm/nommu.c b/mm/nommu.c
index 57ba243c6a37..e0fcd948096e 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -36,6 +36,7 @@
 #include <linux/printk.h>
 
 #include <linux/uaccess.h>
+#include <linux/uio.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -198,14 +199,13 @@ unsigned long vmalloc_to_pfn(const void *addr)
 }
 EXPORT_SYMBOL(vmalloc_to_pfn);
 
-long vread(char *buf, char *addr, unsigned long count)
+long vread_iter(struct iov_iter *iter, char *addr, size_t count)
 {
 	/* Don't allow overflow */
-	if ((unsigned long) buf + count < count)
-		count = -(unsigned long) buf;
+	if ((unsigned long) addr + count < count)
+		count = -(unsigned long) addr;
 
-	memcpy(buf, addr, count);
-	return count;
+	return copy_to_iter(addr, count, iter);
 }
 
 /*
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c24b27664a97..f19509a6eef4 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -37,7 +37,6 @@
 #include <linux/rbtree_augmented.h>
 #include <linux/overflow.h>
 #include <linux/pgtable.h>
-#include <linux/uaccess.h>
 #include <linux/hugetlb.h>
 #include <linux/sched/mm.h>
 #include <linux/rwsem.h>
@@ -3446,20 +3445,20 @@ EXPORT_SYMBOL(vmalloc_32_user);
  * small helper routine , copy contents to buf from addr.
  * If the page is not present, fill zero.
  */
-
-static int aligned_vread(char *buf, char *addr, unsigned long count)
+static void aligned_vread_iter(struct iov_iter *iter,
+			       char *addr, size_t count)
 {
-	struct page *p;
-	int copied = 0;
+	struct page *page;
 
-	while (count) {
+	while (count > 0) {
 		unsigned long offset, length;
+		size_t copied = 0;
 
 		offset = offset_in_page(addr);
 		length = PAGE_SIZE - offset;
 		if (length > count)
 			length = count;
-		p = vmalloc_to_page(addr);
+		page = vmalloc_to_page(addr);
 		/*
 		 * To do safe access to this _mapped_ area, we need
 		 * lock. But adding lock here means that we need to add
@@ -3467,23 +3466,24 @@ static int aligned_vread(char *buf, char *addr, unsigned long count)
 		 * interface, rarely used. Instead of that, we'll use
 		 * kmap() and get small overhead in this access function.
 		 */
-		if (p) {
+		if (page) {
 			/* We can expect USER0 is not used -- see vread() */
-			void *map = kmap_atomic(p);
-			memcpy(buf, map + offset, length);
-			kunmap_atomic(map);
-		} else
-			memset(buf, 0, length);
+			void *map = kmap_local_page(page);
+
+			copied = copy_to_iter(map + offset, length, iter);
+			kunmap_local(map);
+		}
+
+		if (copied < length)
+			iov_iter_zero(length - copied, iter);
 
 		addr += length;
-		buf += length;
-		copied += length;
 		count -= length;
 	}
-	return copied;
 }
 
-static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags)
+static void vmap_ram_vread_iter(struct iov_iter *iter, char *addr, int count,
+				unsigned long flags)
 {
 	char *start;
 	struct vmap_block *vb;
@@ -3496,7 +3496,7 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
 	 * handle it here.
 	 */
 	if (!(flags & VMAP_BLOCK)) {
-		aligned_vread(buf, addr, count);
+		aligned_vread_iter(iter, addr, count);
 		return;
 	}
 
@@ -3517,22 +3517,24 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
 		if (!count)
 			break;
 		start = vmap_block_vaddr(vb->va->va_start, rs);
-		while (addr < start) {
+
+		if (addr < start) {
+			size_t to_zero = min_t(size_t, start - addr, count);
+
+			iov_iter_zero(to_zero, iter);
+			addr += to_zero;
+			count -= (int)to_zero;
 			if (count == 0)
 				goto unlock;
-			*buf = '\0';
-			buf++;
-			addr++;
-			count--;
 		}
+
 		/*it could start reading from the middle of used region*/
 		offset = offset_in_page(addr);
 		n = ((re - rs + 1) << PAGE_SHIFT) - offset;
 		if (n > count)
 			n = count;
-		aligned_vread(buf, start+offset, n);
+		aligned_vread_iter(iter, start + offset, n);
 
-		buf += n;
 		addr += n;
 		count -= n;
 	}
@@ -3541,15 +3543,15 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
 
 finished:
 	/* zero-fill the left dirty or free regions */
-	if (count)
-		memset(buf, 0, count);
+	if (count > 0)
+		iov_iter_zero(count, iter);
 }
 
 /**
- * vread() - read vmalloc area in a safe way.
- * @buf:     buffer for reading data
- * @addr:    vm address.
- * @count:   number of bytes to be read.
+ * vread_iter() - read vmalloc area in a safe way to an iterator.
+ * @iter:         the iterator to which data should be written.
+ * @addr:         vm address.
+ * @count:        number of bytes to be read.
  *
  * This function checks that addr is a valid vmalloc'ed area, and
  * copy data from that area to a given buffer. If the given memory range
@@ -3569,13 +3571,13 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
  * (same number as @count) or %0 if [addr...addr+count) doesn't
  * include any intersection with valid vmalloc area
  */
-long vread(char *buf, char *addr, unsigned long count)
+long vread_iter(struct iov_iter *iter, char *addr, size_t count)
 {
 	struct vmap_area *va;
 	struct vm_struct *vm;
-	char *vaddr, *buf_start = buf;
-	unsigned long buflen = count;
-	unsigned long n, size, flags;
+	char *vaddr;
+	size_t buflen = count;
+	size_t n, size, flags;
 
 	might_sleep();
 
@@ -3595,7 +3597,7 @@ long vread(char *buf, char *addr, unsigned long count)
 		goto finished;
 
 	list_for_each_entry_from(va, &vmap_area_list, list) {
-		if (!count)
+		if (count == 0)
 			break;
 
 		vm = va->vm;
@@ -3619,36 +3621,39 @@ long vread(char *buf, char *addr, unsigned long count)
 
 		if (addr >= vaddr + size)
 			continue;
-		while (addr < vaddr) {
+
+		if (addr < vaddr) {
+			size_t to_zero = min_t(size_t, vaddr - addr, count);
+
+			iov_iter_zero(to_zero, iter);
+			addr += to_zero;
+			count -= to_zero;
 			if (count == 0)
 				goto finished;
-			*buf = '\0';
-			buf++;
-			addr++;
-			count--;
 		}
+
 		n = vaddr + size - addr;
 		if (n > count)
 			n = count;
 
 		if (flags & VMAP_RAM)
-			vmap_ram_vread(buf, addr, n, flags);
+			vmap_ram_vread_iter(iter, addr, n, flags);
 		else if (!(vm->flags & VM_IOREMAP))
-			aligned_vread(buf, addr, n);
+			aligned_vread_iter(iter, addr, n);
 		else /* IOREMAP area is treated as memory hole */
-			memset(buf, 0, n);
-		buf += n;
+			iov_iter_zero(n, iter);
+
 		addr += n;
 		count -= n;
 	}
 finished:
 	up_read(&vmap_area_lock);
 
-	if (buf == buf_start)
+	if (count == buflen)
 		return 0;
 	/* zero-fill memory holes */
-	if (buf != buf_start + buflen)
-		memset(buf, 0, buflen - (buf - buf_start));
+	if (count > 0)
+		iov_iter_zero(count, iter);
 
 	return buflen;
 }
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-19  7:09 ` [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock Lorenzo Stoakes
@ 2023-03-19 20:10   ` Andrew Morton
  2023-03-19 20:29     ` Lorenzo Stoakes
  2023-03-20  7:54   ` Uladzislau Rezki
  2023-03-21  1:09   ` Dave Chinner
  2 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2023-03-19 20:10 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, linux-fsdevel, Baoquan He,
	Uladzislau Rezki, Matthew Wilcox, David Hildenbrand, Liu Shixin,
	Jiri Olsa

On Sun, 19 Mar 2023 07:09:31 +0000 Lorenzo Stoakes <lstoakes@gmail.com> wrote:

> vmalloc() is, by design, not permitted to be used in atomic context and
> already contains components which may sleep, so avoiding spin locks is not
> a problem from the perspective of atomic context.
> 
> The global vmap_area_lock is held when the red/black tree rooted in
> vmap_are_root is accessed and thus is rather long-held and under
> potentially high contention. It is likely to be under contention for reads
> rather than write, so replace it with a rwsem.
> 
> Each individual vmap_block->lock is likely to be held for less time but
> under low contention, so a mutex is not an outrageous choice here.
> 
> A subset of test_vmalloc.sh performance results:-
> 
> fix_size_alloc_test             0.40%
> full_fit_alloc_test		2.08%
> long_busy_list_alloc_test	0.34%
> random_size_alloc_test		-0.25%
> random_size_align_alloc_test	0.06%
> ...
> all tests cycles                0.2%
> 
> This represents a tiny reduction in performance that sits barely above
> noise.
> 
> The reason for making this change is to build a basis for vread() to be
> usable asynchronously, this eliminating the need for a bounce buffer when
> copying data to userland in read_kcore() and allowing that to be converted
> to an iterator form.
> 

I'm not understanding the final paragraph.  How and where is vread()
used "asynchronously"?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-19 20:10   ` Andrew Morton
@ 2023-03-19 20:29     ` Lorenzo Stoakes
  2023-03-19 20:47       ` Matthew Wilcox
  0 siblings, 1 reply; 33+ messages in thread
From: Lorenzo Stoakes @ 2023-03-19 20:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, linux-fsdevel, Baoquan He,
	Uladzislau Rezki, Matthew Wilcox, David Hildenbrand, Liu Shixin,
	Jiri Olsa

On Sun, Mar 19, 2023 at 01:10:47PM -0700, Andrew Morton wrote:
> On Sun, 19 Mar 2023 07:09:31 +0000 Lorenzo Stoakes <lstoakes@gmail.com> wrote:
>
> > vmalloc() is, by design, not permitted to be used in atomic context and
> > already contains components which may sleep, so avoiding spin locks is not
> > a problem from the perspective of atomic context.
> >
> > The global vmap_area_lock is held when the red/black tree rooted in
> > vmap_are_root is accessed and thus is rather long-held and under
> > potentially high contention. It is likely to be under contention for reads
> > rather than write, so replace it with a rwsem.
> >
> > Each individual vmap_block->lock is likely to be held for less time but
> > under low contention, so a mutex is not an outrageous choice here.
> >
> > A subset of test_vmalloc.sh performance results:-
> >
> > fix_size_alloc_test             0.40%
> > full_fit_alloc_test		2.08%
> > long_busy_list_alloc_test	0.34%
> > random_size_alloc_test		-0.25%
> > random_size_align_alloc_test	0.06%
> > ...
> > all tests cycles                0.2%
> >
> > This represents a tiny reduction in performance that sits barely above
> > noise.
> >
> > The reason for making this change is to build a basis for vread() to be
> > usable asynchronously, this eliminating the need for a bounce buffer when
> > copying data to userland in read_kcore() and allowing that to be converted
> > to an iterator form.
> >
>
> I'm not understanding the final paragraph.  How and where is vread()
> used "asynchronously"?


The basis for saying asynchronous was based on Documentation/filesystems/vfs.rst
describing read_iter() as 'possibly asynchronous read with iov_iter as
destination', and read_iter() is what is (now) invoked when accessing
/proc/kcore.

However I agree this is vague and it is clearer to refer to the fact that we are
now directly writing to user memory and thus wish to avoid spinlocks as we may
need to fault in user memory in doing so.

Would it be ok for you to go ahead and replace that final paragraph with the
below?:-

The reason for making this change is to build a basis for vread() to write
to user memory directly via an iterator; as a result we may cause page
faults during which we must not hold a spinlock. Doing this eliminates the
need for a bounce buffer in read_kcore() and thus permits that to be
converted to also use an iterator, as a read_iter() handler.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-19 20:29     ` Lorenzo Stoakes
@ 2023-03-19 20:47       ` Matthew Wilcox
  2023-03-19 21:16         ` Lorenzo Stoakes
  0 siblings, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2023-03-19 20:47 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-fsdevel, Baoquan He,
	Uladzislau Rezki, David Hildenbrand, Liu Shixin, Jiri Olsa

On Sun, Mar 19, 2023 at 08:29:16PM +0000, Lorenzo Stoakes wrote:
> The basis for saying asynchronous was based on Documentation/filesystems/vfs.rst
> describing read_iter() as 'possibly asynchronous read with iov_iter as
> destination', and read_iter() is what is (now) invoked when accessing
> /proc/kcore.
> 
> However I agree this is vague and it is clearer to refer to the fact that we are
> now directly writing to user memory and thus wish to avoid spinlocks as we may
> need to fault in user memory in doing so.
> 
> Would it be ok for you to go ahead and replace that final paragraph with the
> below?:-
> 
> The reason for making this change is to build a basis for vread() to write
> to user memory directly via an iterator; as a result we may cause page
> faults during which we must not hold a spinlock. Doing this eliminates the
> need for a bounce buffer in read_kcore() and thus permits that to be
> converted to also use an iterator, as a read_iter() handler.

I'd say the purpose of the iterator is to abstract whether we're
accessing user memory, kernel memory or a pipe, so I'd suggest:

   The reason for making this change is to build a basis for vread() to
   write to memory via an iterator; as a result we may cause page faults
   during which we must not hold a spinlock. Doing this eliminates the
   need for a bounce buffer in read_kcore() and thus permits that to be
   converted to also use an iterator, as a read_iter() handler.

I'm still undecided whether this change is really a good thing.  I
think we have line-of-sight to making vmalloc (and thus kvmalloc)
usable from interrupt context, and this destroys that possibility.

I wonder if we can't do something like prefaulting the page before
taking the spinlock, then use copy_page_to_iter_atomic()


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-19 20:47       ` Matthew Wilcox
@ 2023-03-19 21:16         ` Lorenzo Stoakes
  2023-03-20  8:40           ` Lorenzo Stoakes
  0 siblings, 1 reply; 33+ messages in thread
From: Lorenzo Stoakes @ 2023-03-19 21:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-fsdevel, Baoquan He,
	Uladzislau Rezki, David Hildenbrand, Liu Shixin, Jiri Olsa

On Sun, Mar 19, 2023 at 08:47:14PM +0000, Matthew Wilcox wrote:
> On Sun, Mar 19, 2023 at 08:29:16PM +0000, Lorenzo Stoakes wrote:
> > The basis for saying asynchronous was based on Documentation/filesystems/vfs.rst
> > describing read_iter() as 'possibly asynchronous read with iov_iter as
> > destination', and read_iter() is what is (now) invoked when accessing
> > /proc/kcore.
> >
> > However I agree this is vague and it is clearer to refer to the fact that we are
> > now directly writing to user memory and thus wish to avoid spinlocks as we may
> > need to fault in user memory in doing so.
> >
> > Would it be ok for you to go ahead and replace that final paragraph with the
> > below?:-
> >
> > The reason for making this change is to build a basis for vread() to write
> > to user memory directly via an iterator; as a result we may cause page
> > faults during which we must not hold a spinlock. Doing this eliminates the
> > need for a bounce buffer in read_kcore() and thus permits that to be
> > converted to also use an iterator, as a read_iter() handler.
>
> I'd say the purpose of the iterator is to abstract whether we're
> accessing user memory, kernel memory or a pipe, so I'd suggest:
>
>    The reason for making this change is to build a basis for vread() to
>    write to memory via an iterator; as a result we may cause page faults
>    during which we must not hold a spinlock. Doing this eliminates the
>    need for a bounce buffer in read_kcore() and thus permits that to be
>    converted to also use an iterator, as a read_iter() handler.
>

Thanks, sorry I missed the detail about iterators abstacting the three
different targets there, that is definitely better!

> I'm still undecided whether this change is really a good thing.  I
> think we have line-of-sight to making vmalloc (and thus kvmalloc)
> usable from interrupt context, and this destroys that possibility.
>
> I wonder if we can't do something like prefaulting the page before
> taking the spinlock, then use copy_page_to_iter_atomic()

There are a number of aspects of vmalloc that are not atomic-safe,
e.g. alloc_vmap_area() and vmap_range_noflush() are designated
might_sleep(), equally vfree().

So I feel that making it safe for atomic context requires a bit more of a
general rework. Given we would be able to revisit lock types at the point
we do that (something that would fit very solidly into the context of any
such change), and given that this patch series establishes that we use an
iterator, I think it is useful to keep this as-is as defer that change
until later.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-19  7:09 ` [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock Lorenzo Stoakes
  2023-03-19 20:10   ` Andrew Morton
@ 2023-03-20  7:54   ` Uladzislau Rezki
  2023-03-20  8:25     ` Lorenzo Stoakes
  2023-03-21  1:09   ` Dave Chinner
  2 siblings, 1 reply; 33+ messages in thread
From: Uladzislau Rezki @ 2023-03-20  7:54 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton, Baoquan He,
	Uladzislau Rezki, Matthew Wilcox, David Hildenbrand, Liu Shixin,
	Jiri Olsa

> vmalloc() is, by design, not permitted to be used in atomic context and
> already contains components which may sleep, so avoiding spin locks is not
> a problem from the perspective of atomic context.
> 
> The global vmap_area_lock is held when the red/black tree rooted in
> vmap_are_root is accessed and thus is rather long-held and under
> potentially high contention. It is likely to be under contention for reads
> rather than write, so replace it with a rwsem.
> 
> Each individual vmap_block->lock is likely to be held for less time but
> under low contention, so a mutex is not an outrageous choice here.
> 
> A subset of test_vmalloc.sh performance results:-
> 
> fix_size_alloc_test             0.40%
> full_fit_alloc_test		2.08%
> long_busy_list_alloc_test	0.34%
> random_size_alloc_test		-0.25%
> random_size_align_alloc_test	0.06%
> ...
> all tests cycles                0.2%
> 
> This represents a tiny reduction in performance that sits barely above
> noise.
> 
How important to have many simultaneous users of vread()? I do not see a
big reason to switch into mutexes due to performance impact and making it
less atomic.

So, how important for you to have this change?

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-20  7:54   ` Uladzislau Rezki
@ 2023-03-20  8:25     ` Lorenzo Stoakes
  2023-03-20  8:32       ` Uladzislau Rezki
  0 siblings, 1 reply; 33+ messages in thread
From: Lorenzo Stoakes @ 2023-03-20  8:25 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton, Baoquan He,
	Matthew Wilcox, David Hildenbrand, Liu Shixin, Jiri Olsa

On Mon, Mar 20, 2023 at 08:54:33AM +0100, Uladzislau Rezki wrote:
> > vmalloc() is, by design, not permitted to be used in atomic context and
> > already contains components which may sleep, so avoiding spin locks is not
> > a problem from the perspective of atomic context.
> >
> > The global vmap_area_lock is held when the red/black tree rooted in
> > vmap_are_root is accessed and thus is rather long-held and under
> > potentially high contention. It is likely to be under contention for reads
> > rather than write, so replace it with a rwsem.
> >
> > Each individual vmap_block->lock is likely to be held for less time but
> > under low contention, so a mutex is not an outrageous choice here.
> >
> > A subset of test_vmalloc.sh performance results:-
> >
> > fix_size_alloc_test             0.40%
> > full_fit_alloc_test		2.08%
> > long_busy_list_alloc_test	0.34%
> > random_size_alloc_test		-0.25%
> > random_size_align_alloc_test	0.06%
> > ...
> > all tests cycles                0.2%
> >
> > This represents a tiny reduction in performance that sits barely above
> > noise.
> >
> How important to have many simultaneous users of vread()? I do not see a
> big reason to switch into mutexes due to performance impact and making it
> less atomic.

It's less about simultaneous users of vread() and more about being able to write
direct to user memory rather than via a bounce buffer and not hold a spinlock
over possible page faults.

The performance impact is barely above noise (I got fairly widely varying
results), so I don't think it's really much of a cost at all. I can't imagine
there are many users critically dependent on a sub-single digit % reduction in
speed in vmalloc() allocation.

As I was saying to Willy, the code is already not atomic, or rather needs rework
to become atomic-safe (there are a smattering of might_sleep()'s throughout)

However, given your objection alongside Willy's, let me examine Willy's
suggestion that we instead of doing this, prefault the user memory in advance of
the vread call.

>
> So, how important for you to have this change?
>

Personally, always very important :)

> --
> Uladzislau Rezki


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-20  8:25     ` Lorenzo Stoakes
@ 2023-03-20  8:32       ` Uladzislau Rezki
  2023-03-20  8:35         ` Lorenzo Stoakes
  0 siblings, 1 reply; 33+ messages in thread
From: Uladzislau Rezki @ 2023-03-20  8:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Uladzislau Rezki, linux-mm, linux-kernel, linux-fsdevel,
	Andrew Morton, Baoquan He, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa

On Mon, Mar 20, 2023 at 08:25:32AM +0000, Lorenzo Stoakes wrote:
> On Mon, Mar 20, 2023 at 08:54:33AM +0100, Uladzislau Rezki wrote:
> > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > already contains components which may sleep, so avoiding spin locks is not
> > > a problem from the perspective of atomic context.
> > >
> > > The global vmap_area_lock is held when the red/black tree rooted in
> > > vmap_are_root is accessed and thus is rather long-held and under
> > > potentially high contention. It is likely to be under contention for reads
> > > rather than write, so replace it with a rwsem.
> > >
> > > Each individual vmap_block->lock is likely to be held for less time but
> > > under low contention, so a mutex is not an outrageous choice here.
> > >
> > > A subset of test_vmalloc.sh performance results:-
> > >
> > > fix_size_alloc_test             0.40%
> > > full_fit_alloc_test		2.08%
> > > long_busy_list_alloc_test	0.34%
> > > random_size_alloc_test		-0.25%
> > > random_size_align_alloc_test	0.06%
> > > ...
> > > all tests cycles                0.2%
> > >
> > > This represents a tiny reduction in performance that sits barely above
> > > noise.
> > >
> > How important to have many simultaneous users of vread()? I do not see a
> > big reason to switch into mutexes due to performance impact and making it
> > less atomic.
> 
> It's less about simultaneous users of vread() and more about being able to write
> direct to user memory rather than via a bounce buffer and not hold a spinlock
> over possible page faults.
> 
> The performance impact is barely above noise (I got fairly widely varying
> results), so I don't think it's really much of a cost at all. I can't imagine
> there are many users critically dependent on a sub-single digit % reduction in
> speed in vmalloc() allocation.
> 
> As I was saying to Willy, the code is already not atomic, or rather needs rework
> to become atomic-safe (there are a smattering of might_sleep()'s throughout)
> 
> However, given your objection alongside Willy's, let me examine Willy's
> suggestion that we instead of doing this, prefault the user memory in advance of
> the vread call.
> 
Just a quick perf tests shows regression around 6%. 10 workers test_mask is 31:

# default
[  140.349731] All test took worker0=485061693537 cycles
[  140.386065] All test took worker1=486504572954 cycles
[  140.418452] All test took worker2=467204082542 cycles
[  140.435895] All test took worker3=512591010219 cycles
[  140.458316] All test took worker4=448583324125 cycles
[  140.494244] All test took worker5=501018129647 cycles
[  140.518144] All test took worker6=516224787767 cycles
[  140.535472] All test took worker7=442025617137 cycles
[  140.558249] All test took worker8=503337286539 cycles
[  140.590571] All test took worker9=494369561574 cycles

# patch
[  144.464916] All test took worker0=530373399067 cycles
[  144.492904] All test took worker1=522641540924 cycles
[  144.528999] All test took worker2=529711158267 cycles
[  144.552963] All test took worker3=527389011775 cycles
[  144.592951] All test took worker4=529583252449 cycles
[  144.610286] All test took worker5=523605706016 cycles
[  144.627690] All test took worker6=531494777011 cycles
[  144.653046] All test took worker7=527150114726 cycles
[  144.669818] All test took worker8=526599712235 cycles
[  144.693428] All test took worker9=526057490851 cycles

> >
> > So, how important for you to have this change?
> >
> 
> Personally, always very important :)
> 
This is good. Personal opinion always wins :)

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-20  8:32       ` Uladzislau Rezki
@ 2023-03-20  8:35         ` Lorenzo Stoakes
  2023-03-20 11:20           ` Uladzislau Rezki
  0 siblings, 1 reply; 33+ messages in thread
From: Lorenzo Stoakes @ 2023-03-20  8:35 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton, Baoquan He,
	Matthew Wilcox, David Hildenbrand, Liu Shixin, Jiri Olsa

On Mon, Mar 20, 2023 at 09:32:06AM +0100, Uladzislau Rezki wrote:
> On Mon, Mar 20, 2023 at 08:25:32AM +0000, Lorenzo Stoakes wrote:
> > On Mon, Mar 20, 2023 at 08:54:33AM +0100, Uladzislau Rezki wrote:
> > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > already contains components which may sleep, so avoiding spin locks is not
> > > > a problem from the perspective of atomic context.
> > > >
> > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > potentially high contention. It is likely to be under contention for reads
> > > > rather than write, so replace it with a rwsem.
> > > >
> > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > under low contention, so a mutex is not an outrageous choice here.
> > > >
> > > > A subset of test_vmalloc.sh performance results:-
> > > >
> > > > fix_size_alloc_test             0.40%
> > > > full_fit_alloc_test		2.08%
> > > > long_busy_list_alloc_test	0.34%
> > > > random_size_alloc_test		-0.25%
> > > > random_size_align_alloc_test	0.06%
> > > > ...
> > > > all tests cycles                0.2%
> > > >
> > > > This represents a tiny reduction in performance that sits barely above
> > > > noise.
> > > >
> > > How important to have many simultaneous users of vread()? I do not see a
> > > big reason to switch into mutexes due to performance impact and making it
> > > less atomic.
> >
> > It's less about simultaneous users of vread() and more about being able to write
> > direct to user memory rather than via a bounce buffer and not hold a spinlock
> > over possible page faults.
> >
> > The performance impact is barely above noise (I got fairly widely varying
> > results), so I don't think it's really much of a cost at all. I can't imagine
> > there are many users critically dependent on a sub-single digit % reduction in
> > speed in vmalloc() allocation.
> >
> > As I was saying to Willy, the code is already not atomic, or rather needs rework
> > to become atomic-safe (there are a smattering of might_sleep()'s throughout)
> >
> > However, given your objection alongside Willy's, let me examine Willy's
> > suggestion that we instead of doing this, prefault the user memory in advance of
> > the vread call.
> >
> Just a quick perf tests shows regression around 6%. 10 workers test_mask is 31:
>
> # default
> [  140.349731] All test took worker0=485061693537 cycles
> [  140.386065] All test took worker1=486504572954 cycles
> [  140.418452] All test took worker2=467204082542 cycles
> [  140.435895] All test took worker3=512591010219 cycles
> [  140.458316] All test took worker4=448583324125 cycles
> [  140.494244] All test took worker5=501018129647 cycles
> [  140.518144] All test took worker6=516224787767 cycles
> [  140.535472] All test took worker7=442025617137 cycles
> [  140.558249] All test took worker8=503337286539 cycles
> [  140.590571] All test took worker9=494369561574 cycles
>
> # patch
> [  144.464916] All test took worker0=530373399067 cycles
> [  144.492904] All test took worker1=522641540924 cycles
> [  144.528999] All test took worker2=529711158267 cycles
> [  144.552963] All test took worker3=527389011775 cycles
> [  144.592951] All test took worker4=529583252449 cycles
> [  144.610286] All test took worker5=523605706016 cycles
> [  144.627690] All test took worker6=531494777011 cycles
> [  144.653046] All test took worker7=527150114726 cycles
> [  144.669818] All test took worker8=526599712235 cycles
> [  144.693428] All test took worker9=526057490851 cycles
>

OK ouch, that's worse than I observed! Let me try this prefault approach and
then we can revert back to spinlocks.

> > >
> > > So, how important for you to have this change?
> > >
> >
> > Personally, always very important :)
> >
> This is good. Personal opinion always wins :)
>

The heart always wins ;) well, an adaption here can make everybody's hearts
happy I think.

> --
> Uladzislau Rezki


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-19 21:16         ` Lorenzo Stoakes
@ 2023-03-20  8:40           ` Lorenzo Stoakes
  0 siblings, 0 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2023-03-20  8:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-fsdevel, Baoquan He,
	Uladzislau Rezki, David Hildenbrand, Liu Shixin, Jiri Olsa

On Sun, Mar 19, 2023 at 09:16:18PM +0000, Lorenzo Stoakes wrote:
> On Sun, Mar 19, 2023 at 08:47:14PM +0000, Matthew Wilcox wrote:
> > I wonder if we can't do something like prefaulting the page before
> > taking the spinlock, then use copy_page_to_iter_atomic()
>
<snip>

On second thoughts, and discussing with Uladzislau, this seems like a more
sensible approach.

I'm going to respin with prefaulting and revert to the previous locking.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 1/4] fs/proc/kcore: Avoid bounce buffer for ktext data
  2023-03-19  7:09 ` [PATCH v2 1/4] fs/proc/kcore: Avoid bounce buffer for ktext data Lorenzo Stoakes
@ 2023-03-20  9:58   ` David Hildenbrand
  0 siblings, 0 replies; 33+ messages in thread
From: David Hildenbrand @ 2023-03-20  9:58 UTC (permalink / raw)
  To: Lorenzo Stoakes, linux-mm, linux-kernel, linux-fsdevel, Andrew Morton
  Cc: Baoquan He, Uladzislau Rezki, Matthew Wilcox, Liu Shixin, Jiri Olsa

On 19.03.23 08:09, Lorenzo Stoakes wrote:
> Commit df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext data")
> introduced the use of a bounce buffer to retrieve kernel text data for
> /proc/kcore in order to avoid failures arising from hardened user copies
> enabled by CONFIG_HARDENED_USERCOPY in check_kernel_text_object().
> 
> We can avoid doing this if instead of copy_to_user() we use _copy_to_user()
> which bypasses the hardening check. This is more efficient than using a
> bounce buffer and simplifies the code.
> 
> We do so as part an overall effort to eliminate bounce buffer usage in the
> function with an eye to converting it an iterator read.
> 
> Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
> ---
>   fs/proc/kcore.c | 17 +++++------------
>   1 file changed, 5 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
> index 71157ee35c1a..556f310d6aa4 100644
> --- a/fs/proc/kcore.c
> +++ b/fs/proc/kcore.c
> @@ -541,19 +541,12 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos)
>   		case KCORE_VMEMMAP:
>   		case KCORE_TEXT:
>   			/*
> -			 * Using bounce buffer to bypass the
> -			 * hardened user copy kernel text checks.
> +			 * We use _copy_to_user() to bypass usermode hardening
> +			 * which would otherwise prevent this operation.
>   			 */
> -			if (copy_from_kernel_nofault(buf, (void *)start, tsz)) {
> -				if (clear_user(buffer, tsz)) {
> -					ret = -EFAULT;
> -					goto out;
> -				}
> -			} else {
> -				if (copy_to_user(buffer, buf, tsz)) {
> -					ret = -EFAULT;
> -					goto out;
> -				}
> +			if (_copy_to_user(buffer, (char *)start, tsz)) {
> +				ret = -EFAULT;
> +				goto out;
>   			}
>   			break;
>   		default:

Looks correct to me. The only difference between copy_to_user() and 
_copy_to_user() is the check_copy_size() check, that ends up calling 
check_kernel_text_object().

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-20  8:35         ` Lorenzo Stoakes
@ 2023-03-20 11:20           ` Uladzislau Rezki
  0 siblings, 0 replies; 33+ messages in thread
From: Uladzislau Rezki @ 2023-03-20 11:20 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Uladzislau Rezki, linux-mm, linux-kernel, linux-fsdevel,
	Andrew Morton, Baoquan He, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa

On Mon, Mar 20, 2023 at 08:35:11AM +0000, Lorenzo Stoakes wrote:
> On Mon, Mar 20, 2023 at 09:32:06AM +0100, Uladzislau Rezki wrote:
> > On Mon, Mar 20, 2023 at 08:25:32AM +0000, Lorenzo Stoakes wrote:
> > > On Mon, Mar 20, 2023 at 08:54:33AM +0100, Uladzislau Rezki wrote:
> > > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > > already contains components which may sleep, so avoiding spin locks is not
> > > > > a problem from the perspective of atomic context.
> > > > >
> > > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > > potentially high contention. It is likely to be under contention for reads
> > > > > rather than write, so replace it with a rwsem.
> > > > >
> > > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > > under low contention, so a mutex is not an outrageous choice here.
> > > > >
> > > > > A subset of test_vmalloc.sh performance results:-
> > > > >
> > > > > fix_size_alloc_test             0.40%
> > > > > full_fit_alloc_test		2.08%
> > > > > long_busy_list_alloc_test	0.34%
> > > > > random_size_alloc_test		-0.25%
> > > > > random_size_align_alloc_test	0.06%
> > > > > ...
> > > > > all tests cycles                0.2%
> > > > >
> > > > > This represents a tiny reduction in performance that sits barely above
> > > > > noise.
> > > > >
> > > > How important to have many simultaneous users of vread()? I do not see a
> > > > big reason to switch into mutexes due to performance impact and making it
> > > > less atomic.
> > >
> > > It's less about simultaneous users of vread() and more about being able to write
> > > direct to user memory rather than via a bounce buffer and not hold a spinlock
> > > over possible page faults.
> > >
> > > The performance impact is barely above noise (I got fairly widely varying
> > > results), so I don't think it's really much of a cost at all. I can't imagine
> > > there are many users critically dependent on a sub-single digit % reduction in
> > > speed in vmalloc() allocation.
> > >
> > > As I was saying to Willy, the code is already not atomic, or rather needs rework
> > > to become atomic-safe (there are a smattering of might_sleep()'s throughout)
> > >
> > > However, given your objection alongside Willy's, let me examine Willy's
> > > suggestion that we instead of doing this, prefault the user memory in advance of
> > > the vread call.
> > >
> > Just a quick perf tests shows regression around 6%. 10 workers test_mask is 31:
> >
> > # default
> > [  140.349731] All test took worker0=485061693537 cycles
> > [  140.386065] All test took worker1=486504572954 cycles
> > [  140.418452] All test took worker2=467204082542 cycles
> > [  140.435895] All test took worker3=512591010219 cycles
> > [  140.458316] All test took worker4=448583324125 cycles
> > [  140.494244] All test took worker5=501018129647 cycles
> > [  140.518144] All test took worker6=516224787767 cycles
> > [  140.535472] All test took worker7=442025617137 cycles
> > [  140.558249] All test took worker8=503337286539 cycles
> > [  140.590571] All test took worker9=494369561574 cycles
> >
> > # patch
> > [  144.464916] All test took worker0=530373399067 cycles
> > [  144.492904] All test took worker1=522641540924 cycles
> > [  144.528999] All test took worker2=529711158267 cycles
> > [  144.552963] All test took worker3=527389011775 cycles
> > [  144.592951] All test took worker4=529583252449 cycles
> > [  144.610286] All test took worker5=523605706016 cycles
> > [  144.627690] All test took worker6=531494777011 cycles
> > [  144.653046] All test took worker7=527150114726 cycles
> > [  144.669818] All test took worker8=526599712235 cycles
> > [  144.693428] All test took worker9=526057490851 cycles
> >
> 
> OK ouch, that's worse than I observed! Let me try this prefault approach and
> then we can revert back to spinlocks.
> 
> > > >
> > > > So, how important for you to have this change?
> > > >
> > >
> > > Personally, always very important :)
> > >
> > This is good. Personal opinion always wins :)
> >
> 
> The heart always wins ;) well, an adaption here can make everybody's hearts
> happy I think.
> 
Totally agree :)

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-19  7:09 ` [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock Lorenzo Stoakes
  2023-03-19 20:10   ` Andrew Morton
  2023-03-20  7:54   ` Uladzislau Rezki
@ 2023-03-21  1:09   ` Dave Chinner
  2023-03-21  5:23     ` Uladzislau Rezki
  2023-03-22 13:18     ` Uladzislau Rezki
  2 siblings, 2 replies; 33+ messages in thread
From: Dave Chinner @ 2023-03-21  1:09 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-mm, linux-kernel, linux-fsdevel, Andrew Morton, Baoquan He,
	Uladzislau Rezki, Matthew Wilcox, David Hildenbrand, Liu Shixin,
	Jiri Olsa

On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> vmalloc() is, by design, not permitted to be used in atomic context and
> already contains components which may sleep, so avoiding spin locks is not
> a problem from the perspective of atomic context.
> 
> The global vmap_area_lock is held when the red/black tree rooted in
> vmap_are_root is accessed and thus is rather long-held and under
> potentially high contention. It is likely to be under contention for reads
> rather than write, so replace it with a rwsem.
> 
> Each individual vmap_block->lock is likely to be held for less time but
> under low contention, so a mutex is not an outrageous choice here.
> 
> A subset of test_vmalloc.sh performance results:-
> 
> fix_size_alloc_test             0.40%
> full_fit_alloc_test		2.08%
> long_busy_list_alloc_test	0.34%
> random_size_alloc_test		-0.25%
> random_size_align_alloc_test	0.06%
> ...
> all tests cycles                0.2%
> 
> This represents a tiny reduction in performance that sits barely above
> noise.

I'm travelling right now, but give me a few days and I'll test this
against the XFS workloads that hammer the global vmalloc spin lock
really, really badly. XFS can use vm_map_ram and vmalloc really
heavily for metadata buffers and hit the global spin lock from every
CPU in the system at the same time (i.e. highly concurrent
workloads). vmalloc is also heavily used in the hottest path
throught the journal where we process and calculate delta changes to
several million items every second, again spread across every CPU in
the system at the same time.

We really need the global spinlock to go away completely, but in the
mean time a shared read lock should help a little bit....

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-21  1:09   ` Dave Chinner
@ 2023-03-21  5:23     ` Uladzislau Rezki
  2023-03-21  7:45       ` Lorenzo Stoakes
  2023-03-22 13:18     ` Uladzislau Rezki
  1 sibling, 1 reply; 33+ messages in thread
From: Uladzislau Rezki @ 2023-03-21  5:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Lorenzo Stoakes, linux-mm, linux-kernel, linux-fsdevel,
	Andrew Morton, Baoquan He, Uladzislau Rezki, Matthew Wilcox,
	David Hildenbrand, Liu Shixin, Jiri Olsa

On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > vmalloc() is, by design, not permitted to be used in atomic context and
> > already contains components which may sleep, so avoiding spin locks is not
> > a problem from the perspective of atomic context.
> > 
> > The global vmap_area_lock is held when the red/black tree rooted in
> > vmap_are_root is accessed and thus is rather long-held and under
> > potentially high contention. It is likely to be under contention for reads
> > rather than write, so replace it with a rwsem.
> > 
> > Each individual vmap_block->lock is likely to be held for less time but
> > under low contention, so a mutex is not an outrageous choice here.
> > 
> > A subset of test_vmalloc.sh performance results:-
> > 
> > fix_size_alloc_test             0.40%
> > full_fit_alloc_test		2.08%
> > long_busy_list_alloc_test	0.34%
> > random_size_alloc_test		-0.25%
> > random_size_align_alloc_test	0.06%
> > ...
> > all tests cycles                0.2%
> > 
> > This represents a tiny reduction in performance that sits barely above
> > noise.
> 
> I'm travelling right now, but give me a few days and I'll test this
> against the XFS workloads that hammer the global vmalloc spin lock
> really, really badly. XFS can use vm_map_ram and vmalloc really
> heavily for metadata buffers and hit the global spin lock from every
> CPU in the system at the same time (i.e. highly concurrent
> workloads). vmalloc is also heavily used in the hottest path
> throught the journal where we process and calculate delta changes to
> several million items every second, again spread across every CPU in
> the system at the same time.
> 
> We really need the global spinlock to go away completely, but in the
> mean time a shared read lock should help a little bit....
> 
I am working on it. I submitted a proposal how to eliminate it:


<snip>
Hello, LSF.

Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention

Description:
 Currently the vmap code is not scaled to number of CPU cores in a system
 because a global vmap space is protected by a single spinlock. Such approach
 has a clear bottleneck if many CPUs simultaneously access to one resource.

 In this talk i would like to describe a drawback, show some data related
 to contentions and places where those occur in a code. Apart of that i
 would like to share ideas how to eliminate it providing a few approaches
 and compare them.

Requirements:
 * It should be a per-cpu approach;
 * Search of freed ptrs should not interfere with other freeing(as much as we can);
 *   - offload allocated areas(buzy ones) per-cpu;
 * Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
 * Lazily-freed areas either drained per-cpu individually or by one CPU for all;
 * Prefetch a fixed size in front and allocate per-cpu

Goals:
 * Implement a per-cpu way of allocation to eliminate a contention.

Thanks!
<snip>

--
Uladzislau Rezki



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-21  5:23     ` Uladzislau Rezki
@ 2023-03-21  7:45       ` Lorenzo Stoakes
  2023-03-21  8:54         ` Uladzislau Rezki
  2023-03-21 10:05         ` Dave Chinner
  0 siblings, 2 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2023-03-21  7:45 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Dave Chinner, linux-mm, linux-kernel, linux-fsdevel,
	Andrew Morton, Baoquan He, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa

On Tue, Mar 21, 2023 at 06:23:39AM +0100, Uladzislau Rezki wrote:
> On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> > On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > already contains components which may sleep, so avoiding spin locks is not
> > > a problem from the perspective of atomic context.
> > >
> > > The global vmap_area_lock is held when the red/black tree rooted in
> > > vmap_are_root is accessed and thus is rather long-held and under
> > > potentially high contention. It is likely to be under contention for reads
> > > rather than write, so replace it with a rwsem.
> > >
> > > Each individual vmap_block->lock is likely to be held for less time but
> > > under low contention, so a mutex is not an outrageous choice here.
> > >
> > > A subset of test_vmalloc.sh performance results:-
> > >
> > > fix_size_alloc_test             0.40%
> > > full_fit_alloc_test		2.08%
> > > long_busy_list_alloc_test	0.34%
> > > random_size_alloc_test		-0.25%
> > > random_size_align_alloc_test	0.06%
> > > ...
> > > all tests cycles                0.2%
> > >
> > > This represents a tiny reduction in performance that sits barely above
> > > noise.
> >
> > I'm travelling right now, but give me a few days and I'll test this
> > against the XFS workloads that hammer the global vmalloc spin lock
> > really, really badly. XFS can use vm_map_ram and vmalloc really
> > heavily for metadata buffers and hit the global spin lock from every
> > CPU in the system at the same time (i.e. highly concurrent
> > workloads). vmalloc is also heavily used in the hottest path
> > throught the journal where we process and calculate delta changes to
> > several million items every second, again spread across every CPU in
> > the system at the same time.
> >
> > We really need the global spinlock to go away completely, but in the
> > mean time a shared read lock should help a little bit....
> >

Hugely appreciated Dave, however I must disappoint on the rwsem as I have now
reworked my patch set to use the original locks in order to satisfy Willy's
desire to make vmalloc atomic in future, and Uladzislau's desire to not have a
~6% performance hit -
https://lore.kernel.org/all/cover.1679354384.git.lstoakes@gmail.com/

> I am working on it. I submitted a proposal how to eliminate it:
>
>
> <snip>
> Hello, LSF.
>
> Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention
>
> Description:
>  Currently the vmap code is not scaled to number of CPU cores in a system
>  because a global vmap space is protected by a single spinlock. Such approach
>  has a clear bottleneck if many CPUs simultaneously access to one resource.
>
>  In this talk i would like to describe a drawback, show some data related
>  to contentions and places where those occur in a code. Apart of that i
>  would like to share ideas how to eliminate it providing a few approaches
>  and compare them.
>
> Requirements:
>  * It should be a per-cpu approach;
>  * Search of freed ptrs should not interfere with other freeing(as much as we can);
>  *   - offload allocated areas(buzy ones) per-cpu;
>  * Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
>  * Lazily-freed areas either drained per-cpu individually or by one CPU for all;
>  * Prefetch a fixed size in front and allocate per-cpu
>
> Goals:
>  * Implement a per-cpu way of allocation to eliminate a contention.
>
> Thanks!
> <snip>
>
> --
> Uladzislau Rezki
>

That's really awesome! I will come to that talk at LSF/MM :) being able to
sustain the lock in atomic context seems to be an aspect that is important going
forward also.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-21  7:45       ` Lorenzo Stoakes
@ 2023-03-21  8:54         ` Uladzislau Rezki
  2023-03-21 10:05         ` Dave Chinner
  1 sibling, 0 replies; 33+ messages in thread
From: Uladzislau Rezki @ 2023-03-21  8:54 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Uladzislau Rezki, Dave Chinner, linux-mm, linux-kernel,
	linux-fsdevel, Andrew Morton, Baoquan He, Matthew Wilcox,
	David Hildenbrand, Liu Shixin, Jiri Olsa

On Tue, Mar 21, 2023 at 07:45:56AM +0000, Lorenzo Stoakes wrote:
> On Tue, Mar 21, 2023 at 06:23:39AM +0100, Uladzislau Rezki wrote:
> > On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> > > On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > already contains components which may sleep, so avoiding spin locks is not
> > > > a problem from the perspective of atomic context.
> > > >
> > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > potentially high contention. It is likely to be under contention for reads
> > > > rather than write, so replace it with a rwsem.
> > > >
> > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > under low contention, so a mutex is not an outrageous choice here.
> > > >
> > > > A subset of test_vmalloc.sh performance results:-
> > > >
> > > > fix_size_alloc_test             0.40%
> > > > full_fit_alloc_test		2.08%
> > > > long_busy_list_alloc_test	0.34%
> > > > random_size_alloc_test		-0.25%
> > > > random_size_align_alloc_test	0.06%
> > > > ...
> > > > all tests cycles                0.2%
> > > >
> > > > This represents a tiny reduction in performance that sits barely above
> > > > noise.
> > >
> > > I'm travelling right now, but give me a few days and I'll test this
> > > against the XFS workloads that hammer the global vmalloc spin lock
> > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > heavily for metadata buffers and hit the global spin lock from every
> > > CPU in the system at the same time (i.e. highly concurrent
> > > workloads). vmalloc is also heavily used in the hottest path
> > > throught the journal where we process and calculate delta changes to
> > > several million items every second, again spread across every CPU in
> > > the system at the same time.
> > >
> > > We really need the global spinlock to go away completely, but in the
> > > mean time a shared read lock should help a little bit....
> > >
> 
> Hugely appreciated Dave, however I must disappoint on the rwsem as I have now
> reworked my patch set to use the original locks in order to satisfy Willy's
> desire to make vmalloc atomic in future, and Uladzislau's desire to not have a
> ~6% performance hit -
> https://lore.kernel.org/all/cover.1679354384.git.lstoakes@gmail.com/
> 
> > I am working on it. I submitted a proposal how to eliminate it:
> >
> >
> > <snip>
> > Hello, LSF.
> >
> > Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention
> >
> > Description:
> >  Currently the vmap code is not scaled to number of CPU cores in a system
> >  because a global vmap space is protected by a single spinlock. Such approach
> >  has a clear bottleneck if many CPUs simultaneously access to one resource.
> >
> >  In this talk i would like to describe a drawback, show some data related
> >  to contentions and places where those occur in a code. Apart of that i
> >  would like to share ideas how to eliminate it providing a few approaches
> >  and compare them.
> >
> > Requirements:
> >  * It should be a per-cpu approach;
> >  * Search of freed ptrs should not interfere with other freeing(as much as we can);
> >  *   - offload allocated areas(buzy ones) per-cpu;
> >  * Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
> >  * Lazily-freed areas either drained per-cpu individually or by one CPU for all;
> >  * Prefetch a fixed size in front and allocate per-cpu
> >
> > Goals:
> >  * Implement a per-cpu way of allocation to eliminate a contention.
> >
> > Thanks!
> > <snip>
> >
> > --
> > Uladzislau Rezki
> >
> 
> That's really awesome! I will come to that talk at LSF/MM :) being able to
> sustain the lock in atomic context seems to be an aspect that is important going
> forward also.
>
Uhh... So i need to prepare then :)))

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-21  7:45       ` Lorenzo Stoakes
  2023-03-21  8:54         ` Uladzislau Rezki
@ 2023-03-21 10:05         ` Dave Chinner
  2023-03-21 10:24           ` Uladzislau Rezki
  1 sibling, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2023-03-21 10:05 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Uladzislau Rezki, linux-mm, linux-kernel, linux-fsdevel,
	Andrew Morton, Baoquan He, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa

On Tue, Mar 21, 2023 at 07:45:56AM +0000, Lorenzo Stoakes wrote:
> On Tue, Mar 21, 2023 at 06:23:39AM +0100, Uladzislau Rezki wrote:
> > On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> > > On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > already contains components which may sleep, so avoiding spin locks is not
> > > > a problem from the perspective of atomic context.
> > > >
> > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > potentially high contention. It is likely to be under contention for reads
> > > > rather than write, so replace it with a rwsem.
> > > >
> > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > under low contention, so a mutex is not an outrageous choice here.
> > > >
> > > > A subset of test_vmalloc.sh performance results:-
> > > >
> > > > fix_size_alloc_test             0.40%
> > > > full_fit_alloc_test		2.08%
> > > > long_busy_list_alloc_test	0.34%
> > > > random_size_alloc_test		-0.25%
> > > > random_size_align_alloc_test	0.06%
> > > > ...
> > > > all tests cycles                0.2%
> > > >
> > > > This represents a tiny reduction in performance that sits barely above
> > > > noise.
> > >
> > > I'm travelling right now, but give me a few days and I'll test this
> > > against the XFS workloads that hammer the global vmalloc spin lock
> > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > heavily for metadata buffers and hit the global spin lock from every
> > > CPU in the system at the same time (i.e. highly concurrent
> > > workloads). vmalloc is also heavily used in the hottest path
> > > throught the journal where we process and calculate delta changes to
> > > several million items every second, again spread across every CPU in
> > > the system at the same time.
> > >
> > > We really need the global spinlock to go away completely, but in the
> > > mean time a shared read lock should help a little bit....
> > >
> 
> Hugely appreciated Dave, however I must disappoint on the rwsem as I have now
> reworked my patch set to use the original locks in order to satisfy Willy's
> desire to make vmalloc atomic in future, and Uladzislau's desire to not have a
> ~6% performance hit -
> https://lore.kernel.org/all/cover.1679354384.git.lstoakes@gmail.com/

Yeah, I'd already read that.

What I want to do, though, is to determine whether the problem
shared access contention or exclusive access contention. If it's
exclusive access contention, then an rwsem will do nothing to
alleviate the problem, and that's kinda critical to know before any
fix for the contention problems are worked out...

> > I am working on it. I submitted a proposal how to eliminate it:
> >
> >
> > <snip>
> > Hello, LSF.
> >
> > Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention
> >
> > Description:
> >  Currently the vmap code is not scaled to number of CPU cores in a system
> >  because a global vmap space is protected by a single spinlock. Such approach
> >  has a clear bottleneck if many CPUs simultaneously access to one resource.
> >
> >  In this talk i would like to describe a drawback, show some data related
> >  to contentions and places where those occur in a code. Apart of that i
> >  would like to share ideas how to eliminate it providing a few approaches
> >  and compare them.

If you want data about contention problems with vmalloc

> > Requirements:
> >  * It should be a per-cpu approach;

Hmmmm. My 2c worth on this: That is not a requirement.

That's a -solution-.

The requirement is that independent concurrent vmalloc/vfree
operations do not severely contend with each other.

Yes, the solution will probably involve sharding the resource space
across mulitple independent structures (as we do in filesystems with
block groups, allocations groups, etc) but that does not necessarily
need the structures to be per-cpu.

e.g per-node vmalloc arenas might be sufficient and allow more
expensive but more efficient indexing structures to be used because
we don't have to care about the explosion of memory that
fine-grained per-cpu indexing generally entails.  This may also fit
in to the existing per-node structure of the memory reclaim
infrastructure to manage things like compaction, balancing, etc of
vmalloc space assigned to the given node.

Hence I think saying "per-cpu is a requirement" kinda prevents
exploration of other novel solutions that may have advantages other
than "just solves the concurrency problem"...

> >  * Search of freed ptrs should not interfere with other freeing(as much as we can);
> >  *   - offload allocated areas(buzy ones) per-cpu;
> >  * Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
> >  * Lazily-freed areas either drained per-cpu individually or by one CPU for all;
> >  * Prefetch a fixed size in front and allocate per-cpu

I'd call these desired traits and/or potential optimisations, not
hard requirements.

> > Goals:
> >  * Implement a per-cpu way of allocation to eliminate a contention.

The goal should be to "allow contention-free vmalloc operations", not
that we implement a specific solution.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-21 10:05         ` Dave Chinner
@ 2023-03-21 10:24           ` Uladzislau Rezki
  0 siblings, 0 replies; 33+ messages in thread
From: Uladzislau Rezki @ 2023-03-21 10:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Lorenzo Stoakes, Uladzislau Rezki, linux-mm, linux-kernel,
	linux-fsdevel, Andrew Morton, Baoquan He, Matthew Wilcox,
	David Hildenbrand, Liu Shixin, Jiri Olsa

On Tue, Mar 21, 2023 at 09:05:26PM +1100, Dave Chinner wrote:
> On Tue, Mar 21, 2023 at 07:45:56AM +0000, Lorenzo Stoakes wrote:
> > On Tue, Mar 21, 2023 at 06:23:39AM +0100, Uladzislau Rezki wrote:
> > > On Tue, Mar 21, 2023 at 12:09:12PM +1100, Dave Chinner wrote:
> > > > On Sun, Mar 19, 2023 at 07:09:31AM +0000, Lorenzo Stoakes wrote:
> > > > > vmalloc() is, by design, not permitted to be used in atomic context and
> > > > > already contains components which may sleep, so avoiding spin locks is not
> > > > > a problem from the perspective of atomic context.
> > > > >
> > > > > The global vmap_area_lock is held when the red/black tree rooted in
> > > > > vmap_are_root is accessed and thus is rather long-held and under
> > > > > potentially high contention. It is likely to be under contention for reads
> > > > > rather than write, so replace it with a rwsem.
> > > > >
> > > > > Each individual vmap_block->lock is likely to be held for less time but
> > > > > under low contention, so a mutex is not an outrageous choice here.
> > > > >
> > > > > A subset of test_vmalloc.sh performance results:-
> > > > >
> > > > > fix_size_alloc_test             0.40%
> > > > > full_fit_alloc_test		2.08%
> > > > > long_busy_list_alloc_test	0.34%
> > > > > random_size_alloc_test		-0.25%
> > > > > random_size_align_alloc_test	0.06%
> > > > > ...
> > > > > all tests cycles                0.2%
> > > > >
> > > > > This represents a tiny reduction in performance that sits barely above
> > > > > noise.
> > > >
> > > > I'm travelling right now, but give me a few days and I'll test this
> > > > against the XFS workloads that hammer the global vmalloc spin lock
> > > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > > heavily for metadata buffers and hit the global spin lock from every
> > > > CPU in the system at the same time (i.e. highly concurrent
> > > > workloads). vmalloc is also heavily used in the hottest path
> > > > throught the journal where we process and calculate delta changes to
> > > > several million items every second, again spread across every CPU in
> > > > the system at the same time.
> > > >
> > > > We really need the global spinlock to go away completely, but in the
> > > > mean time a shared read lock should help a little bit....
> > > >
> > 
> > Hugely appreciated Dave, however I must disappoint on the rwsem as I have now
> > reworked my patch set to use the original locks in order to satisfy Willy's
> > desire to make vmalloc atomic in future, and Uladzislau's desire to not have a
> > ~6% performance hit -
> > https://lore.kernel.org/all/cover.1679354384.git.lstoakes@gmail.com/
> 
> Yeah, I'd already read that.
> 
> What I want to do, though, is to determine whether the problem
> shared access contention or exclusive access contention. If it's
> exclusive access contention, then an rwsem will do nothing to
> alleviate the problem, and that's kinda critical to know before any
> fix for the contention problems are worked out...
> 
> > > I am working on it. I submitted a proposal how to eliminate it:
> > >
> > >
> > > <snip>
> > > Hello, LSF.
> > >
> > > Title: Introduce a per-cpu-vmap-cache to eliminate a vmap lock contention
> > >
> > > Description:
> > >  Currently the vmap code is not scaled to number of CPU cores in a system
> > >  because a global vmap space is protected by a single spinlock. Such approach
> > >  has a clear bottleneck if many CPUs simultaneously access to one resource.
> > >
> > >  In this talk i would like to describe a drawback, show some data related
> > >  to contentions and places where those occur in a code. Apart of that i
> > >  would like to share ideas how to eliminate it providing a few approaches
> > >  and compare them.
> 
> If you want data about contention problems with vmalloc
> 
> > > Requirements:
> > >  * It should be a per-cpu approach;
> 
> Hmmmm. My 2c worth on this: That is not a requirement.
> 
> That's a -solution-.
> 
> The requirement is that independent concurrent vmalloc/vfree
> operations do not severely contend with each other.
> 
> Yes, the solution will probably involve sharding the resource space
> across mulitple independent structures (as we do in filesystems with
> block groups, allocations groups, etc) but that does not necessarily
> need the structures to be per-cpu.
> 
> e.g per-node vmalloc arenas might be sufficient and allow more
> expensive but more efficient indexing structures to be used because
> we don't have to care about the explosion of memory that
> fine-grained per-cpu indexing generally entails.  This may also fit
> in to the existing per-node structure of the memory reclaim
> infrastructure to manage things like compaction, balancing, etc of
> vmalloc space assigned to the given node.
> 
> Hence I think saying "per-cpu is a requirement" kinda prevents
> exploration of other novel solutions that may have advantages other
> than "just solves the concurrency problem"...
> 
> > >  * Search of freed ptrs should not interfere with other freeing(as much as we can);
> > >  *   - offload allocated areas(buzy ones) per-cpu;
> > >  * Cache ready sized objects or merge them into one big per-cpu-space(split on demand);
> > >  * Lazily-freed areas either drained per-cpu individually or by one CPU for all;
> > >  * Prefetch a fixed size in front and allocate per-cpu
> 
> I'd call these desired traits and/or potential optimisations, not
> hard requirements.
> 
> > > Goals:
> > >  * Implement a per-cpu way of allocation to eliminate a contention.
> 
> The goal should be to "allow contention-free vmalloc operations", not
> that we implement a specific solution.
> 
I think we are on the same page. I do not see that we go apart in anything.
Probably i was a bit more specific in requirements but this is how i see
personally on it based on different kind of experiments with it.

Thank you for your 2c!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-21  1:09   ` Dave Chinner
  2023-03-21  5:23     ` Uladzislau Rezki
@ 2023-03-22 13:18     ` Uladzislau Rezki
  2023-03-22 17:47       ` Matthew Wilcox
  2023-03-24  5:25       ` Dave Chinner
  1 sibling, 2 replies; 33+ messages in thread
From: Uladzislau Rezki @ 2023-03-22 13:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Lorenzo Stoakes, linux-mm, linux-kernel, linux-fsdevel,
	Andrew Morton, Baoquan He, Uladzislau Rezki, Matthew Wilcox,
	David Hildenbrand, Liu Shixin, Jiri Olsa

Hello, Dave.

> 
> I'm travelling right now, but give me a few days and I'll test this
> against the XFS workloads that hammer the global vmalloc spin lock
> really, really badly. XFS can use vm_map_ram and vmalloc really
> heavily for metadata buffers and hit the global spin lock from every
> CPU in the system at the same time (i.e. highly concurrent
> workloads). vmalloc is also heavily used in the hottest path
> throught the journal where we process and calculate delta changes to
> several million items every second, again spread across every CPU in
> the system at the same time.
> 
> We really need the global spinlock to go away completely, but in the
> mean time a shared read lock should help a little bit....
> 
Could you please share some steps how to run your workloads in order to
touch vmalloc() code. I would like to have a look at it in more detail
just for understanding the workloads.

Meanwhile my grep agains xfs shows:

<snip>
urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
./xfs_log_priv.h:675: * Log vector and shadow buffers can be large, so we need to use kvmalloc() here
./xfs_log_priv.h:676: * to ensure success. Unfortunately, kvmalloc() only allows GFP_KERNEL contexts
./xfs_log_priv.h:677: * to fall back to vmalloc, so we can't actually do anything useful with gfp
./xfs_log_priv.h:678: * flags to control the kmalloc() behaviour within kvmalloc(). Hence kmalloc()
./xfs_log_priv.h:681: * vmalloc if it can't get somethign straight away from the free lists or
./xfs_log_priv.h:682: * buddy allocator. Hence we have to open code kvmalloc outselves here.
./xfs_log_priv.h:686: * allocations. This is actually the only way to make vmalloc() do GFP_NOFS
./xfs_log_priv.h:691:xlog_kvmalloc(
./xfs_log_priv.h:702:                   p = vmalloc(buf_size);
./xfs_bio_io.c:21:      unsigned int            is_vmalloc = is_vmalloc_addr(data);
./xfs_bio_io.c:26:      if (is_vmalloc && op == REQ_OP_WRITE)
./xfs_bio_io.c:56:      if (is_vmalloc && op == REQ_OP_READ)
./xfs_log.c:1976:       if (is_vmalloc_addr(iclog->ic_data))
./xfs_log_cil.c:338:                    lv = xlog_kvmalloc(buf_size);
./libxfs/xfs_attr_leaf.c:522:           args->value = kvmalloc(valuelen, GFP_KERNEL | __GFP_NOLOCKDEP);
./kmem.h:12:#include <linux/vmalloc.h>
./kmem.h:78:    if (is_vmalloc_addr(addr))
./kmem.h:79:            return vmalloc_to_page(addr);
./xfs_attr_item.c:84:    * This could be over 64kB in length, so we have to use kvmalloc() for
./xfs_attr_item.c:85:    * this. But kvmalloc() utterly sucks, so we use our own version.
./xfs_attr_item.c:87:   nv = xlog_kvmalloc(sizeof(struct xfs_attri_log_nameval) +
./scrub/attr.c:60:      ab = kvmalloc(sizeof(*ab) + sz, flags);
urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$
<snip>

Thanks!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-22 13:18     ` Uladzislau Rezki
@ 2023-03-22 17:47       ` Matthew Wilcox
  2023-03-22 18:01         ` Uladzislau Rezki
  2023-03-24  5:25       ` Dave Chinner
  1 sibling, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2023-03-22 17:47 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Dave Chinner, Lorenzo Stoakes, linux-mm, linux-kernel,
	linux-fsdevel, Andrew Morton, Baoquan He, David Hildenbrand,
	Liu Shixin, Jiri Olsa

On Wed, Mar 22, 2023 at 02:18:19PM +0100, Uladzislau Rezki wrote:
> Hello, Dave.
> 
> > 
> > I'm travelling right now, but give me a few days and I'll test this
> > against the XFS workloads that hammer the global vmalloc spin lock
> > really, really badly. XFS can use vm_map_ram and vmalloc really
> > heavily for metadata buffers and hit the global spin lock from every
> > CPU in the system at the same time (i.e. highly concurrent
> > workloads). vmalloc is also heavily used in the hottest path
> > throught the journal where we process and calculate delta changes to
> > several million items every second, again spread across every CPU in
> > the system at the same time.
> > 
> > We really need the global spinlock to go away completely, but in the
> > mean time a shared read lock should help a little bit....
> > 
> Could you please share some steps how to run your workloads in order to
> touch vmalloc() code. I would like to have a look at it in more detail
> just for understanding the workloads.
> 
> Meanwhile my grep agains xfs shows:
> 
> <snip>
> urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./

You're missing:

fs/xfs/xfs_buf.c:                       bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,

which i suspect is the majority of Dave's workload.  That will almost
certainly take the vb_alloc() path.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-22 17:47       ` Matthew Wilcox
@ 2023-03-22 18:01         ` Uladzislau Rezki
  2023-03-22 19:15           ` Uladzislau Rezki
  0 siblings, 1 reply; 33+ messages in thread
From: Uladzislau Rezki @ 2023-03-22 18:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Uladzislau Rezki, Dave Chinner, Lorenzo Stoakes, linux-mm,
	linux-kernel, linux-fsdevel, Andrew Morton, Baoquan He,
	David Hildenbrand, Liu Shixin, Jiri Olsa

On Wed, Mar 22, 2023 at 05:47:28PM +0000, Matthew Wilcox wrote:
> On Wed, Mar 22, 2023 at 02:18:19PM +0100, Uladzislau Rezki wrote:
> > Hello, Dave.
> > 
> > > 
> > > I'm travelling right now, but give me a few days and I'll test this
> > > against the XFS workloads that hammer the global vmalloc spin lock
> > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > heavily for metadata buffers and hit the global spin lock from every
> > > CPU in the system at the same time (i.e. highly concurrent
> > > workloads). vmalloc is also heavily used in the hottest path
> > > throught the journal where we process and calculate delta changes to
> > > several million items every second, again spread across every CPU in
> > > the system at the same time.
> > > 
> > > We really need the global spinlock to go away completely, but in the
> > > mean time a shared read lock should help a little bit....
> > > 
> > Could you please share some steps how to run your workloads in order to
> > touch vmalloc() code. I would like to have a look at it in more detail
> > just for understanding the workloads.
> > 
> > Meanwhile my grep agains xfs shows:
> > 
> > <snip>
> > urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
> 
> You're missing:
> 
> fs/xfs/xfs_buf.c:                       bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> 
> which i suspect is the majority of Dave's workload.  That will almost
> certainly take the vb_alloc() path.
>
Then it has nothing to do with vmalloc contention(i mean global KVA allocator), IMHO.
Unless:

<snip>
void *vm_map_ram(struct page **pages, unsigned int count, int node)
{
	unsigned long size = (unsigned long)count << PAGE_SHIFT;
	unsigned long addr;
	void *mem;

	if (likely(count <= VMAP_MAX_ALLOC)) {
		mem = vb_alloc(size, GFP_KERNEL);
		if (IS_ERR(mem))
			return NULL;
		addr = (unsigned long)mem;
	} else {
		struct vmap_area *va;
		va = alloc_vmap_area(size, PAGE_SIZE,
				VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
		if (IS_ERR(va))
			return NULL;
<snip>

number of pages > VMAP_MAX_ALLOC.

That is why i have asked about workloads because i would like to understand
where a "problem" is. A vm_map_ram() access the global vmap space but it happens 
when a new vmap block is required and i also think it is not a problem.

But who knows, therefore it makes sense to have a lock at workload.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-22 18:01         ` Uladzislau Rezki
@ 2023-03-22 19:15           ` Uladzislau Rezki
  2023-03-23 12:47             ` Uladzislau Rezki
  0 siblings, 1 reply; 33+ messages in thread
From: Uladzislau Rezki @ 2023-03-22 19:15 UTC (permalink / raw)
  To: Matthew Wilcox, Dave Chinner
  Cc: Matthew Wilcox, Dave Chinner, Lorenzo Stoakes, linux-mm,
	linux-kernel, linux-fsdevel, Andrew Morton, Baoquan He,
	David Hildenbrand, Liu Shixin, Jiri Olsa

On Wed, Mar 22, 2023 at 07:01:59PM +0100, Uladzislau Rezki wrote:
> On Wed, Mar 22, 2023 at 05:47:28PM +0000, Matthew Wilcox wrote:
> > On Wed, Mar 22, 2023 at 02:18:19PM +0100, Uladzislau Rezki wrote:
> > > Hello, Dave.
> > > 
> > > > 
> > > > I'm travelling right now, but give me a few days and I'll test this
> > > > against the XFS workloads that hammer the global vmalloc spin lock
> > > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > > heavily for metadata buffers and hit the global spin lock from every
> > > > CPU in the system at the same time (i.e. highly concurrent
> > > > workloads). vmalloc is also heavily used in the hottest path
> > > > throught the journal where we process and calculate delta changes to
> > > > several million items every second, again spread across every CPU in
> > > > the system at the same time.
> > > > 
> > > > We really need the global spinlock to go away completely, but in the
> > > > mean time a shared read lock should help a little bit....
> > > > 
> > > Could you please share some steps how to run your workloads in order to
> > > touch vmalloc() code. I would like to have a look at it in more detail
> > > just for understanding the workloads.
> > > 
> > > Meanwhile my grep agains xfs shows:
> > > 
> > > <snip>
> > > urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
> > 
> > You're missing:
> > 
> > fs/xfs/xfs_buf.c:                       bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> > 
> > which i suspect is the majority of Dave's workload.  That will almost
> > certainly take the vb_alloc() path.
> >
> Then it has nothing to do with vmalloc contention(i mean global KVA allocator), IMHO.
> Unless:
> 
> <snip>
> void *vm_map_ram(struct page **pages, unsigned int count, int node)
> {
> 	unsigned long size = (unsigned long)count << PAGE_SHIFT;
> 	unsigned long addr;
> 	void *mem;
> 
> 	if (likely(count <= VMAP_MAX_ALLOC)) {
> 		mem = vb_alloc(size, GFP_KERNEL);
> 		if (IS_ERR(mem))
> 			return NULL;
> 		addr = (unsigned long)mem;
> 	} else {
> 		struct vmap_area *va;
> 		va = alloc_vmap_area(size, PAGE_SIZE,
> 				VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
> 		if (IS_ERR(va))
> 			return NULL;
> <snip>
> 
> number of pages > VMAP_MAX_ALLOC.
> 
> That is why i have asked about workloads because i would like to understand
> where a "problem" is. A vm_map_ram() access the global vmap space but it happens 
> when a new vmap block is required and i also think it is not a problem.
> 
> But who knows, therefore it makes sense to have a lock at workload.
> 
There is a lock-stat statistics for vm_map_ram()/vm_unmap_ram() test.
I did it on 64 CPUs system with running 64 threads doing mapping/unmapping
of 1 page. Each thread does 10 000 000 mapping + unmapping in a loop:

<snip>
root@pc638:/home/urezki# cat /proc/lock_stat
lock_stat version 0.4
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

vmap_area_lock:       2554079        2554276           0.06         213.61    11719647.67           4.59        2986513        3005712           0.05          67.02     3573323.37           1.19
  --------------
  vmap_area_lock        1297948          [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910
  vmap_area_lock        1256330          [<000000009d927bf3>] free_vmap_block+0x4a/0xe0
  vmap_area_lock              1          [<00000000c95c05a7>] find_vm_area+0x16/0x70
  --------------
  vmap_area_lock        1738590          [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910
  vmap_area_lock         815688          [<000000009d927bf3>] free_vmap_block+0x4a/0xe0
  vmap_area_lock              1          [<00000000c1d619d7>] __get_vm_area_node+0xd2/0x170

.....................................................................................................................................................................................................

vmap_blocks.xa_lock:        862689         862698           0.05          77.74      849325.39           0.98        3005156        3005709           0.12          31.11     1920242.82           0.64
  -------------------
  vmap_blocks.xa_lock         378418          [<00000000625a5626>] vm_map_ram+0x359/0x4a0
  vmap_blocks.xa_lock         484280          [<00000000caa2ef03>] xa_erase+0xe/0x30
  -------------------
  vmap_blocks.xa_lock         576226          [<00000000caa2ef03>] xa_erase+0xe/0x30
  vmap_blocks.xa_lock         286472          [<00000000625a5626>] vm_map_ram+0x359/0x4a0

....................................................................................................................................................................................................

free_vmap_area_lock:        394960         394961           0.05         124.78      448241.23           1.13        1514508        1515077           0.12          30.48     1179167.01           0.78
  -------------------
  free_vmap_area_lock         385970          [<00000000955bd641>] alloc_vmap_area+0xe5/0x910
  free_vmap_area_lock           4692          [<00000000230abf7e>] __purge_vmap_area_lazy+0x10a/0x7d0
  free_vmap_area_lock           4299          [<00000000eed9ff9e>] alloc_vmap_area+0x497/0x910
  -------------------
  free_vmap_area_lock         371734          [<00000000955bd641>] alloc_vmap_area+0xe5/0x910
  free_vmap_area_lock          17007          [<00000000230abf7e>] __purge_vmap_area_lazy+0x10a/0x7d0
  free_vmap_area_lock           6220          [<00000000eed9ff9e>] alloc_vmap_area+0x497/0x910

.....................................................................................................................................................................................................

purge_vmap_area_lock:        169307         169312           0.05          31.08       81655.21           0.48        1514794        1515078           0.05          67.73      912391.12           0.60
  --------------------
  purge_vmap_area_lock         166409          [<0000000050938075>] free_vmap_area_noflush+0x65/0x370
  purge_vmap_area_lock           2903          [<00000000fb8b57f7>] __purge_vmap_area_lazy+0x47/0x7d0
  --------------------
  purge_vmap_area_lock         167511          [<0000000050938075>] free_vmap_area_noflush+0x65/0x370
  purge_vmap_area_lock           1801          [<00000000fb8b57f7>] __purge_vmap_area_lazy+0x47/0x7d0
<snip>

alloc_vmap_area is a top and second one is xa_lock. But the test i have
done is pretty high concurrent scenario.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-22 19:15           ` Uladzislau Rezki
@ 2023-03-23 12:47             ` Uladzislau Rezki
  0 siblings, 0 replies; 33+ messages in thread
From: Uladzislau Rezki @ 2023-03-23 12:47 UTC (permalink / raw)
  To: Matthew Wilcox, Dave Chinner
  Cc: Matthew Wilcox, Dave Chinner, Lorenzo Stoakes, linux-mm,
	linux-kernel, linux-fsdevel, Andrew Morton, Baoquan He,
	David Hildenbrand, Liu Shixin, Jiri Olsa

On Wed, Mar 22, 2023 at 08:15:09PM +0100, Uladzislau Rezki wrote:
> On Wed, Mar 22, 2023 at 07:01:59PM +0100, Uladzislau Rezki wrote:
> > On Wed, Mar 22, 2023 at 05:47:28PM +0000, Matthew Wilcox wrote:
> > > On Wed, Mar 22, 2023 at 02:18:19PM +0100, Uladzislau Rezki wrote:
> > > > Hello, Dave.
> > > > 
> > > > > 
> > > > > I'm travelling right now, but give me a few days and I'll test this
> > > > > against the XFS workloads that hammer the global vmalloc spin lock
> > > > > really, really badly. XFS can use vm_map_ram and vmalloc really
> > > > > heavily for metadata buffers and hit the global spin lock from every
> > > > > CPU in the system at the same time (i.e. highly concurrent
> > > > > workloads). vmalloc is also heavily used in the hottest path
> > > > > throught the journal where we process and calculate delta changes to
> > > > > several million items every second, again spread across every CPU in
> > > > > the system at the same time.
> > > > > 
> > > > > We really need the global spinlock to go away completely, but in the
> > > > > mean time a shared read lock should help a little bit....
> > > > > 
> > > > Could you please share some steps how to run your workloads in order to
> > > > touch vmalloc() code. I would like to have a look at it in more detail
> > > > just for understanding the workloads.
> > > > 
> > > > Meanwhile my grep agains xfs shows:
> > > > 
> > > > <snip>
> > > > urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
> > > 
> > > You're missing:
> > > 
> > > fs/xfs/xfs_buf.c:                       bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> > > 
> > > which i suspect is the majority of Dave's workload.  That will almost
> > > certainly take the vb_alloc() path.
> > >
> > Then it has nothing to do with vmalloc contention(i mean global KVA allocator), IMHO.
> > Unless:
> > 
> > <snip>
> > void *vm_map_ram(struct page **pages, unsigned int count, int node)
> > {
> > 	unsigned long size = (unsigned long)count << PAGE_SHIFT;
> > 	unsigned long addr;
> > 	void *mem;
> > 
> > 	if (likely(count <= VMAP_MAX_ALLOC)) {
> > 		mem = vb_alloc(size, GFP_KERNEL);
> > 		if (IS_ERR(mem))
> > 			return NULL;
> > 		addr = (unsigned long)mem;
> > 	} else {
> > 		struct vmap_area *va;
> > 		va = alloc_vmap_area(size, PAGE_SIZE,
> > 				VMALLOC_START, VMALLOC_END, node, GFP_KERNEL);
> > 		if (IS_ERR(va))
> > 			return NULL;
> > <snip>
> > 
> > number of pages > VMAP_MAX_ALLOC.
> > 
> > That is why i have asked about workloads because i would like to understand
> > where a "problem" is. A vm_map_ram() access the global vmap space but it happens 
> > when a new vmap block is required and i also think it is not a problem.
> > 
> > But who knows, therefore it makes sense to have a lock at workload.
> > 
> There is a lock-stat statistics for vm_map_ram()/vm_unmap_ram() test.
> I did it on 64 CPUs system with running 64 threads doing mapping/unmapping
> of 1 page. Each thread does 10 000 000 mapping + unmapping in a loop:
> 
> <snip>
> root@pc638:/home/urezki# cat /proc/lock_stat
> lock_stat version 0.4
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> class name    con-bounces    contentions   waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> vmap_area_lock:       2554079        2554276           0.06         213.61    11719647.67           4.59        2986513        3005712           0.05          67.02     3573323.37           1.19
>   --------------
>   vmap_area_lock        1297948          [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910
>   vmap_area_lock        1256330          [<000000009d927bf3>] free_vmap_block+0x4a/0xe0
>   vmap_area_lock              1          [<00000000c95c05a7>] find_vm_area+0x16/0x70
>   --------------
>   vmap_area_lock        1738590          [<00000000dd41cbaa>] alloc_vmap_area+0x1c7/0x910
>   vmap_area_lock         815688          [<000000009d927bf3>] free_vmap_block+0x4a/0xe0
>   vmap_area_lock              1          [<00000000c1d619d7>] __get_vm_area_node+0xd2/0x170
> 
> .....................................................................................................................................................................................................
> 
> vmap_blocks.xa_lock:        862689         862698           0.05          77.74      849325.39           0.98        3005156        3005709           0.12          31.11     1920242.82           0.64
>   -------------------
>   vmap_blocks.xa_lock         378418          [<00000000625a5626>] vm_map_ram+0x359/0x4a0
>   vmap_blocks.xa_lock         484280          [<00000000caa2ef03>] xa_erase+0xe/0x30
>   -------------------
>   vmap_blocks.xa_lock         576226          [<00000000caa2ef03>] xa_erase+0xe/0x30
>   vmap_blocks.xa_lock         286472          [<00000000625a5626>] vm_map_ram+0x359/0x4a0
> 
> ....................................................................................................................................................................................................
> 
> free_vmap_area_lock:        394960         394961           0.05         124.78      448241.23           1.13        1514508        1515077           0.12          30.48     1179167.01           0.78
>   -------------------
>   free_vmap_area_lock         385970          [<00000000955bd641>] alloc_vmap_area+0xe5/0x910
>   free_vmap_area_lock           4692          [<00000000230abf7e>] __purge_vmap_area_lazy+0x10a/0x7d0
>   free_vmap_area_lock           4299          [<00000000eed9ff9e>] alloc_vmap_area+0x497/0x910
>   -------------------
>   free_vmap_area_lock         371734          [<00000000955bd641>] alloc_vmap_area+0xe5/0x910
>   free_vmap_area_lock          17007          [<00000000230abf7e>] __purge_vmap_area_lazy+0x10a/0x7d0
>   free_vmap_area_lock           6220          [<00000000eed9ff9e>] alloc_vmap_area+0x497/0x910
> 
> .....................................................................................................................................................................................................
> 
> purge_vmap_area_lock:        169307         169312           0.05          31.08       81655.21           0.48        1514794        1515078           0.05          67.73      912391.12           0.60
>   --------------------
>   purge_vmap_area_lock         166409          [<0000000050938075>] free_vmap_area_noflush+0x65/0x370
>   purge_vmap_area_lock           2903          [<00000000fb8b57f7>] __purge_vmap_area_lazy+0x47/0x7d0
>   --------------------
>   purge_vmap_area_lock         167511          [<0000000050938075>] free_vmap_area_noflush+0x65/0x370
>   purge_vmap_area_lock           1801          [<00000000fb8b57f7>] __purge_vmap_area_lazy+0x47/0x7d0
> <snip>
> 
> alloc_vmap_area is a top and second one is xa_lock. But the test i have
> done is pretty high concurrent scenario.
> 
<snip>
From 32c38d239c6de3f1d3accf97d9d4944ecaa4bccd Mon Sep 17 00:00:00 2001
From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Date: Thu, 23 Mar 2023 13:07:27 +0100
Subject: [PATCH] mm: vmalloc: Remove global vmap_blocks xarray

A global vmap_blocks-xarray array can be contented under
heavy usage of the vm_map_ram()/vm_unmap_ram() APIs. Under
stress test the lock-stat shows that a "vmap_blocks.xa_lock"
lock is a second in a list when it comes to contentions:

<snip>
----------------------------------------
class name con-bounces contentions ...
----------------------------------------
...
vmap_blocks.xa_lock:    862689 862698 ...
  -------------------
  vmap_blocks.xa_lock   378418    [<00000000625a5626>] vm_map_ram+0x359/0x4a0
  vmap_blocks.xa_lock   484280    [<00000000caa2ef03>] xa_erase+0xe/0x30
  -------------------
  vmap_blocks.xa_lock   576226    [<00000000caa2ef03>] xa_erase+0xe/0x30
  vmap_blocks.xa_lock   286472    [<00000000625a5626>] vm_map_ram+0x359/0x4a0
...
<snip>

that is a result of running vm_map_ram()/vm_unmap_ram() in
a loop. The test creates 64(on 64 CPUs system) threads and
each one maps/unmaps 1 page.

After this change the xa_lock is considered as noise in the
same test condition:

<snip>
...
&xa->xa_lock#1:         10333 10394 ...
  --------------
  &xa->xa_lock#1        5349      [<00000000bbbc9751>] xa_erase+0xe/0x30
  &xa->xa_lock#1        5045      [<0000000018def45d>] vm_map_ram+0x3a4/0x4f0
  --------------
  &xa->xa_lock#1        7326      [<0000000018def45d>] vm_map_ram+0x3a4/0x4f0
  &xa->xa_lock#1        3068      [<00000000bbbc9751>] xa_erase+0xe/0x30
...
<snip>

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 54 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 30 insertions(+), 24 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 978194dc2bb8..b1e549d152b2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1911,6 +1911,7 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
 struct vmap_block_queue {
 	spinlock_t lock;
 	struct list_head free;
+	struct xarray vmap_blocks;
 };
 
 struct vmap_block {
@@ -1927,25 +1928,22 @@ struct vmap_block {
 /* Queue of free and dirty vmap blocks, for allocation and flushing purposes */
 static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
 
-/*
- * XArray of vmap blocks, indexed by address, to quickly find a vmap block
- * in the free path. Could get rid of this if we change the API to return a
- * "cookie" from alloc, to be passed to free. But no big deal yet.
- */
-static DEFINE_XARRAY(vmap_blocks);
-
-/*
- * We should probably have a fallback mechanism to allocate virtual memory
- * out of partially filled vmap blocks. However vmap block sizing should be
- * fairly reasonable according to the vmalloc size, so it shouldn't be a
- * big problem.
- */
+static struct vmap_block_queue *
+addr_to_vbq(unsigned long addr)
+{
+	int cpu = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+	return &per_cpu(vmap_block_queue, cpu);
+}
 
-static unsigned long addr_to_vb_idx(unsigned long addr)
+static unsigned long
+addr_to_vb_va_start(unsigned long addr)
 {
-	addr -= VMALLOC_START & ~(VMAP_BLOCK_SIZE-1);
-	addr /= VMAP_BLOCK_SIZE;
-	return addr;
+	/* Check if aligned. */
+	if (IS_ALIGNED(addr, VMAP_BLOCK_SIZE))
+		return addr;
+
+	/* A start address of block an address belongs to. */
+	return rounddown(addr, VMAP_BLOCK_SIZE);
 }
 
 static void *vmap_block_vaddr(unsigned long va_start, unsigned long pages_off)
@@ -1953,7 +1951,7 @@ static void *vmap_block_vaddr(unsigned long va_start, unsigned long pages_off)
 	unsigned long addr;
 
 	addr = va_start + (pages_off << PAGE_SHIFT);
-	BUG_ON(addr_to_vb_idx(addr) != addr_to_vb_idx(va_start));
+	BUG_ON(addr_to_vb_va_start(addr) != addr_to_vb_va_start(va_start));
 	return (void *)addr;
 }
 
@@ -1970,7 +1968,6 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 	struct vmap_block_queue *vbq;
 	struct vmap_block *vb;
 	struct vmap_area *va;
-	unsigned long vb_idx;
 	int node, err;
 	void *vaddr;
 
@@ -2003,8 +2000,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 	bitmap_set(vb->used_map, 0, (1UL << order));
 	INIT_LIST_HEAD(&vb->free_list);
 
-	vb_idx = addr_to_vb_idx(va->va_start);
-	err = xa_insert(&vmap_blocks, vb_idx, vb, gfp_mask);
+	vbq = addr_to_vbq(va->va_start);
+	err = xa_insert(&vbq->vmap_blocks, va->va_start, vb, gfp_mask);
 	if (err) {
 		kfree(vb);
 		free_vmap_area(va);
@@ -2021,9 +2018,11 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 
 static void free_vmap_block(struct vmap_block *vb)
 {
+	struct vmap_block_queue *vbq;
 	struct vmap_block *tmp;
 
-	tmp = xa_erase(&vmap_blocks, addr_to_vb_idx(vb->va->va_start));
+	vbq = addr_to_vbq(vb->va->va_start);
+	tmp = xa_erase(&vbq->vmap_blocks, vb->va->va_start);
 	BUG_ON(tmp != vb);
 
 	spin_lock(&vmap_area_lock);
@@ -2135,6 +2134,7 @@ static void vb_free(unsigned long addr, unsigned long size)
 	unsigned long offset;
 	unsigned int order;
 	struct vmap_block *vb;
+	struct vmap_block_queue *vbq;
 
 	BUG_ON(offset_in_page(size));
 	BUG_ON(size > PAGE_SIZE*VMAP_MAX_ALLOC);
@@ -2143,7 +2143,10 @@ static void vb_free(unsigned long addr, unsigned long size)
 
 	order = get_order(size);
 	offset = (addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT;
-	vb = xa_load(&vmap_blocks, addr_to_vb_idx(addr));
+
+	vbq = addr_to_vbq(addr);
+	vb = xa_load(&vbq->vmap_blocks, addr_to_vb_va_start(addr));
+
 	spin_lock(&vb->lock);
 	bitmap_clear(vb->used_map, offset, (1UL << order));
 	spin_unlock(&vb->lock);
@@ -3486,6 +3489,7 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
 {
 	char *start;
 	struct vmap_block *vb;
+	struct vmap_block_queue *vbq;
 	unsigned long offset;
 	unsigned int rs, re, n;
 
@@ -3503,7 +3507,8 @@ static void vmap_ram_vread(char *buf, char *addr, int count, unsigned long flags
 	 * Area is split into regions and tracked with vmap_block, read out
 	 * each region and zero fill the hole between regions.
 	 */
-	vb = xa_load(&vmap_blocks, addr_to_vb_idx((unsigned long)addr));
+	vbq = addr_to_vbq((unsigned long) addr);
+	vb = xa_load(&vbq->vmap_blocks, addr_to_vb_va_start((unsigned long) addr));
 	if (!vb)
 		goto finished;
 
@@ -4272,6 +4277,7 @@ void __init vmalloc_init(void)
 		p = &per_cpu(vfree_deferred, i);
 		init_llist_head(&p->list);
 		INIT_WORK(&p->wq, delayed_vfree_work);
+		xa_init(&vbq->vmap_blocks);
 	}
 
 	/* Import existing vmlist entries. */
-- 
2.30.2
<snip>

Any thoughts patch?

I do not consider it as a big improvement in performance. But, it tends
to remove completely a contention on the "xa_lock" + it refactor slightly
the per-cpu allocator. XFS workloads can be improved, though.

--
Uladzislau Rezki


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-22 13:18     ` Uladzislau Rezki
  2023-03-22 17:47       ` Matthew Wilcox
@ 2023-03-24  5:25       ` Dave Chinner
  2023-03-24  5:31         ` Matthew Wilcox
  2023-03-27 17:22         ` Uladzislau Rezki
  1 sibling, 2 replies; 33+ messages in thread
From: Dave Chinner @ 2023-03-24  5:25 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Lorenzo Stoakes, linux-mm, linux-kernel, linux-fsdevel,
	Andrew Morton, Baoquan He, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa

On Wed, Mar 22, 2023 at 02:18:19PM +0100, Uladzislau Rezki wrote:
> Hello, Dave.
> 
> > 
> > I'm travelling right now, but give me a few days and I'll test this
> > against the XFS workloads that hammer the global vmalloc spin lock
> > really, really badly. XFS can use vm_map_ram and vmalloc really
> > heavily for metadata buffers and hit the global spin lock from every
> > CPU in the system at the same time (i.e. highly concurrent
> > workloads). vmalloc is also heavily used in the hottest path
> > throught the journal where we process and calculate delta changes to
> > several million items every second, again spread across every CPU in
> > the system at the same time.
> > 
> > We really need the global spinlock to go away completely, but in the
> > mean time a shared read lock should help a little bit....
> > 
> Could you please share some steps how to run your workloads in order to
> touch vmalloc() code. I would like to have a look at it in more detail
> just for understanding the workloads.

Go search lore for the fsmark scalability benchmarks I've been
running for the past 12-13 years on XFS. Essentially they are high
concurency create/walk/modify/remove workloads on cold cache and
limited memory configurations. Essentially it's hammering caches
turning over inodes as fast as the system can possibly stream them
in and out of memory....

> <snip>
> urezki@pc638:~/data/raid0/coding/linux-rcu.git/fs/xfs$ grep -rn vmalloc ./
> ./xfs_log_priv.h:675: * Log vector and shadow buffers can be large, so we need to use kvmalloc() here
> ./xfs_log_priv.h:676: * to ensure success. Unfortunately, kvmalloc() only allows GFP_KERNEL contexts
> ./xfs_log_priv.h:677: * to fall back to vmalloc, so we can't actually do anything useful with gfp
> ./xfs_log_priv.h:678: * flags to control the kmalloc() behaviour within kvmalloc(). Hence kmalloc()
> ./xfs_log_priv.h:681: * vmalloc if it can't get somethign straight away from the free lists or
> ./xfs_log_priv.h:682: * buddy allocator. Hence we have to open code kvmalloc outselves here.
> ./xfs_log_priv.h:686: * allocations. This is actually the only way to make vmalloc() do GFP_NOFS
> ./xfs_log_priv.h:691:xlog_kvmalloc(

Did you read the comment above this function? I mean, it's all about
how poorly kvmalloc() works for the highly concurrent, fail-fast
context that occurs in the journal commit fast path, and how we open
code it with kmalloc and vmalloc to work "ok" in this path.

Then if you go look at the commits related to it, you might find
that XFS developers tend to write properly useful changelogs to
document things like "it's better, but vmalloc will soon have lock
contention problems if we hit it any harder"....

commit 8dc9384b7d75012856b02ff44c37566a55fc2abf
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Jan 4 17:22:18 2022 -0800

    xfs: reduce kvmalloc overhead for CIL shadow buffers
    
    Oh, let me count the ways that the kvmalloc API sucks dog eggs.
    
    The problem is when we are logging lots of large objects, we hit
    kvmalloc really damn hard with costly order allocations, and
    behaviour utterly sucks:
    
         - 49.73% xlog_cil_commit
             - 31.62% kvmalloc_node
                - 29.96% __kmalloc_node
                   - 29.38% kmalloc_large_node
                      - 29.33% __alloc_pages
                         - 24.33% __alloc_pages_slowpath.constprop.0
                            - 18.35% __alloc_pages_direct_compact
                               - 17.39% try_to_compact_pages
                                  - compact_zone_order
                                     - 15.26% compact_zone
                                          5.29% __pageblock_pfn_to_page
                                          3.71% PageHuge
                                        - 1.44% isolate_migratepages_block
                                             0.71% set_pfnblock_flags_mask
                                       1.11% get_pfnblock_flags_mask
                               - 0.81% get_page_from_freelist
                                  - 0.59% _raw_spin_lock_irqsave
                                     - do_raw_spin_lock
                                          __pv_queued_spin_lock_slowpath
                            - 3.24% try_to_free_pages
                               - 3.14% shrink_node
                                  - 2.94% shrink_slab.constprop.0
                                     - 0.89% super_cache_count
                                        - 0.66% xfs_fs_nr_cached_objects
                                           - 0.65% xfs_reclaim_inodes_count
                                                0.55% xfs_perag_get_tag
                                       0.58% kfree_rcu_shrink_count
                            - 2.09% get_page_from_freelist
                               - 1.03% _raw_spin_lock_irqsave
                                  - do_raw_spin_lock
                                       __pv_queued_spin_lock_slowpath
                         - 4.88% get_page_from_freelist
                            - 3.66% _raw_spin_lock_irqsave
                               - do_raw_spin_lock
                                    __pv_queued_spin_lock_slowpath
                - 1.63% __vmalloc_node
                   - __vmalloc_node_range
                      - 1.10% __alloc_pages_bulk
                         - 0.93% __alloc_pages
                            - 0.92% get_page_from_freelist
                               - 0.89% rmqueue_bulk
                                  - 0.69% _raw_spin_lock
                                     - do_raw_spin_lock
                                          __pv_queued_spin_lock_slowpath
               13.73% memcpy_erms
             - 2.22% kvfree
    
    On this workload, that's almost a dozen CPUs all trying to compact
    and reclaim memory inside kvmalloc_node at the same time. Yet it is
    regularly falling back to vmalloc despite all that compaction, page
    and shrinker reclaim that direct reclaim is doing. Copying all the
    metadata is taking far less CPU time than allocating the storage!
    
    Direct reclaim should be considered extremely harmful.
    
    This is a high frequency, high throughput, CPU usage and latency
    sensitive allocation. We've got memory there, and we're using
    kvmalloc to allow memory allocation to avoid doing lots of work to
    try to do contiguous allocations.
    
    Except it still does *lots of costly work* that is unnecessary.
    
    Worse: the only way to avoid the slowpath page allocation trying to
    do compaction on costly allocations is to turn off direct reclaim
    (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).
    
    Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
    GFP_KERNEL allocation context, so you only get kmalloc!". This
    cuts off the vmalloc fallback, and this leads to almost instant OOM
    problems which ends up in filesystems deadlocks, shutdowns and/or
    kernel crashes.
    
    I want some basic kvmalloc behaviour:
    
    - kmalloc for a contiguous range with fail fast semantics - no
      compaction direct reclaim if the allocation enters the slow path.
    - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails
    
    The really, really stupid part about this is these kvmalloc() calls
    are run under memalloc_nofs task context, so all the allocations are
    always reduced to GFP_NOFS regardless of the fact that kvmalloc
    requires GFP_KERNEL to be passed in. IOWs, we're already telling
    kvmalloc to behave differently to the gfp flags we pass in, but it
    still won't allow vmalloc to be run with anything other than
    GFP_KERNEL.
    
    So, this patch open codes the kvmalloc() in the commit path to have
    the above described behaviour. The result is we more than halve the
    CPU time spend doing kvmalloc() in this path and transaction commits
    with 64kB objects in them more than doubles. i.e. we get ~5x
    reduction in CPU usage per costly-sized kvmalloc() invocation and
    the profile looks like this:
    
      - 37.60% xlog_cil_commit
            16.01% memcpy_erms
          - 8.45% __kmalloc
             - 8.04% kmalloc_order_trace
                - 8.03% kmalloc_order
                   - 7.93% alloc_pages
                      - 7.90% __alloc_pages
                         - 4.05% __alloc_pages_slowpath.constprop.0
                            - 2.18% get_page_from_freelist
                            - 1.77% wake_all_kswapds
    ....
                                        - __wake_up_common_lock
                                           - 0.94% _raw_spin_lock_irqsave
                         - 3.72% get_page_from_freelist
                            - 2.43% _raw_spin_lock_irqsave
          - 5.72% vmalloc
             - 5.72% __vmalloc_node_range
                - 4.81% __get_vm_area_node.constprop.0
                   - 3.26% alloc_vmap_area
                      - 2.52% _raw_spin_lock
                   - 1.46% _raw_spin_lock
                  0.56% __alloc_pages_bulk
          - 4.66% kvfree
             - 3.25% vfree
                - __vfree
                   - 3.23% __vunmap
                      - 1.95% remove_vm_area
                         - 1.06% free_vmap_area_noflush
                            - 0.82% _raw_spin_lock
                         - 0.68% _raw_spin_lock
                      - 0.92% _raw_spin_lock
             - 1.40% kfree
                - 1.36% __free_pages
                   - 1.35% __free_pages_ok
                      - 1.02% _raw_spin_lock_irqsave
    
    It's worth noting that over 50% of the CPU time spent allocating
    these shadow buffers is now spent on spinlocks. So the shadow buffer
    allocation overhead is greatly reduced by getting rid of direct
    reclaim from kmalloc, and could probably be made even less costly if
    vmalloc() didn't use global spinlocks to protect it's structures.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>


-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-24  5:25       ` Dave Chinner
@ 2023-03-24  5:31         ` Matthew Wilcox
  2023-03-27  0:38           ` Dave Chinner
  2023-03-27 17:22         ` Uladzislau Rezki
  1 sibling, 1 reply; 33+ messages in thread
From: Matthew Wilcox @ 2023-03-24  5:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Uladzislau Rezki, Lorenzo Stoakes, linux-mm, linux-kernel,
	linux-fsdevel, Andrew Morton, Baoquan He, David Hildenbrand,
	Liu Shixin, Jiri Olsa

On Fri, Mar 24, 2023 at 04:25:39PM +1100, Dave Chinner wrote:
> Did you read the comment above this function? I mean, it's all about
> how poorly kvmalloc() works for the highly concurrent, fail-fast
> context that occurs in the journal commit fast path, and how we open
> code it with kmalloc and vmalloc to work "ok" in this path.
> 
> Then if you go look at the commits related to it, you might find
> that XFS developers tend to write properly useful changelogs to
> document things like "it's better, but vmalloc will soon have lock
> contention problems if we hit it any harder"....

The problem with writing whinges like this is that mm developers don't
read XFS changelogs.  I certainly had no idea this was a problem, and
I doubt anybody else who could make a start at fixing this problem had
any idea either.  Why go to all this effort instead of sending an email
to linux-mm?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-24  5:31         ` Matthew Wilcox
@ 2023-03-27  0:38           ` Dave Chinner
  0 siblings, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2023-03-27  0:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Uladzislau Rezki, Lorenzo Stoakes, linux-mm, linux-kernel,
	linux-fsdevel, Andrew Morton, Baoquan He, David Hildenbrand,
	Liu Shixin, Jiri Olsa

On Fri, Mar 24, 2023 at 05:31:55AM +0000, Matthew Wilcox wrote:
> On Fri, Mar 24, 2023 at 04:25:39PM +1100, Dave Chinner wrote:
> > Did you read the comment above this function? I mean, it's all about
> > how poorly kvmalloc() works for the highly concurrent, fail-fast
> > context that occurs in the journal commit fast path, and how we open
> > code it with kmalloc and vmalloc to work "ok" in this path.
> > 
> > Then if you go look at the commits related to it, you might find
> > that XFS developers tend to write properly useful changelogs to
> > document things like "it's better, but vmalloc will soon have lock
> > contention problems if we hit it any harder"....
> 
> The problem with writing whinges like this is that mm developers don't
> read XFS changelogs.  I certainly had no idea this was a problem, and
> I doubt anybody else who could make a start at fixing this problem had
> any idea either.  Why go to all this effort instead of sending an email
> to linux-mm?

<sigh>

If you read the mm/vmalloc.c change logs, you'd find that two weeks
later, a bunch of commits went into the vmalloc code to change some
of the stuff mentioned in the above XFS commit. That was a direct
result of the discussion of vmalloc/kvmalloc inadequacies, and if
you followed the links from these three commits:

30d3f01191d3 mm/vmalloc: be more explicit about supported gfp flags.
9376130c390a mm/vmalloc: add support for __GFP_NOFAIL
451769ebb7e7 mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc

You'd have found those discussions.

I went into great detail in that discussion about the problems with
the kvmalloc/vmalloc APIs and the problems with actually using it in
anger. e.g:

https://lore.kernel.org/all/163184741778.29351.16920832234899124642.stgit@noble.brown/T/#e8bc85de35d432dcbc35a16fc72b6a3daef2a0f78

In that discussion, I gave these examples and use cases about
fail-fast for the kmalloc part of kvmalloc being needed, that
arguments that GFP_NOFS didn't work with vmalloc were bullshit
because we'd been using it heavily for years in GFP_NOFS contexts
without issues, the lack of scope APIs for anything other NOFS/NOIO,
that filesytsems want "retry forever" semantics, not the current
__GFP_NOFAIL semantics that have all sorts of weird side effects,
etc. I also point out that vmalloc is rapidly becoming one of the
hottest paths in XFS in response to the comments that vmalloc "isn't
a hot path".

Indeed, I point that the XFS change in that commit during that
discussion, and you made exactly the same "you should raise this
with mm developers" complaint then, too. Imagine how frsutrating it
is when I was being told to raise vmalloc issues on the linux-mm
list with mm developers during a discussion about vmalloc issues
with mm developers on the linux-mm list. Especially as it wasn't
just one mm developer that responded like that.

And yet, the 3 commits that came out of the discussion did nothing
to change the actual problem we need to fix - fail-fast high-order
kmalloc behaviour in kvmalloc() - and so the XFS commit still stands
and is badly needed.

Repeatedly castigating people saying we should talk to mm developers
rather than working around the API they maintain when we've
repeatedly talked to the mm developers about getting changes made
and repeatedly failed to get the changes we need made? Yeah, that
leads to frustrations and commit messages documenting all the shit
we haven't been able to get changed and so need to work around.....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-24  5:25       ` Dave Chinner
  2023-03-24  5:31         ` Matthew Wilcox
@ 2023-03-27 17:22         ` Uladzislau Rezki
  2023-03-28  2:53           ` Dave Chinner
  1 sibling, 1 reply; 33+ messages in thread
From: Uladzislau Rezki @ 2023-03-27 17:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Uladzislau Rezki, Lorenzo Stoakes, linux-mm, linux-kernel,
	linux-fsdevel, Andrew Morton, Baoquan He, Matthew Wilcox,
	David Hildenbrand, Liu Shixin, Jiri Olsa

>     So, this patch open codes the kvmalloc() in the commit path to have
>     the above described behaviour. The result is we more than halve the
>     CPU time spend doing kvmalloc() in this path and transaction commits
>     with 64kB objects in them more than doubles. i.e. we get ~5x
>     reduction in CPU usage per costly-sized kvmalloc() invocation and
>     the profile looks like this:
>     
>       - 37.60% xlog_cil_commit
>             16.01% memcpy_erms
>           - 8.45% __kmalloc
>              - 8.04% kmalloc_order_trace
>                 - 8.03% kmalloc_order
>                    - 7.93% alloc_pages
>                       - 7.90% __alloc_pages
>                          - 4.05% __alloc_pages_slowpath.constprop.0
>                             - 2.18% get_page_from_freelist
>                             - 1.77% wake_all_kswapds
>     ....
>                                         - __wake_up_common_lock
>                                            - 0.94% _raw_spin_lock_irqsave
>                          - 3.72% get_page_from_freelist
>                             - 2.43% _raw_spin_lock_irqsave
>           - 5.72% vmalloc
>              - 5.72% __vmalloc_node_range
>                 - 4.81% __get_vm_area_node.constprop.0
>                    - 3.26% alloc_vmap_area
>                       - 2.52% _raw_spin_lock
>                    - 1.46% _raw_spin_lock
>                   0.56% __alloc_pages_bulk
>           - 4.66% kvfree
>              - 3.25% vfree
OK, i see. I tried to use the fs_mark in different configurations. For
example:

<snip>
time fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 -d ./scratch/0 -d ./scratch/1 -d ./scratch/2  \
-d ./scratch/3 -d ./scratch/4 -d ./scratch/5 -d ./scratch/6 -d ./scratch/7 -d ./scratch/8 \
-d ./scratch/9 -d ./scratch/10 -d ./scratch/11 -d ./scratch/12 -d ./scratch/13 \
-d ./scratch/14 -d ./scratch/15 -t 64 -F
<snip>

But i did not manage to trigger xlog_cil_commit() to fallback to vmalloc
code. I think i should reduce an amount of memory on my kvm-pc and
repeat the tests!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-27 17:22         ` Uladzislau Rezki
@ 2023-03-28  2:53           ` Dave Chinner
  2023-03-28 12:40             ` Uladzislau Rezki
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2023-03-28  2:53 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Lorenzo Stoakes, linux-mm, linux-kernel, linux-fsdevel,
	Andrew Morton, Baoquan He, Matthew Wilcox, David Hildenbrand,
	Liu Shixin, Jiri Olsa

On Mon, Mar 27, 2023 at 07:22:44PM +0200, Uladzislau Rezki wrote:
> >     So, this patch open codes the kvmalloc() in the commit path to have
> >     the above described behaviour. The result is we more than halve the
> >     CPU time spend doing kvmalloc() in this path and transaction commits
> >     with 64kB objects in them more than doubles. i.e. we get ~5x
> >     reduction in CPU usage per costly-sized kvmalloc() invocation and
> >     the profile looks like this:
> >     
> >       - 37.60% xlog_cil_commit
> >             16.01% memcpy_erms
> >           - 8.45% __kmalloc
> >              - 8.04% kmalloc_order_trace
> >                 - 8.03% kmalloc_order
> >                    - 7.93% alloc_pages
> >                       - 7.90% __alloc_pages
> >                          - 4.05% __alloc_pages_slowpath.constprop.0
> >                             - 2.18% get_page_from_freelist
> >                             - 1.77% wake_all_kswapds
> >     ....
> >                                         - __wake_up_common_lock
> >                                            - 0.94% _raw_spin_lock_irqsave
> >                          - 3.72% get_page_from_freelist
> >                             - 2.43% _raw_spin_lock_irqsave
> >           - 5.72% vmalloc
> >              - 5.72% __vmalloc_node_range
> >                 - 4.81% __get_vm_area_node.constprop.0
> >                    - 3.26% alloc_vmap_area
> >                       - 2.52% _raw_spin_lock
> >                    - 1.46% _raw_spin_lock
> >                   0.56% __alloc_pages_bulk
> >           - 4.66% kvfree
> >              - 3.25% vfree
> OK, i see. I tried to use the fs_mark in different configurations. For
> example:
> 
> <snip>
> time fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 -d ./scratch/0 -d ./scratch/1 -d ./scratch/2  \
> -d ./scratch/3 -d ./scratch/4 -d ./scratch/5 -d ./scratch/6 -d ./scratch/7 -d ./scratch/8 \
> -d ./scratch/9 -d ./scratch/10 -d ./scratch/11 -d ./scratch/12 -d ./scratch/13 \
> -d ./scratch/14 -d ./scratch/15 -t 64 -F
> <snip>
> 
> But i did not manage to trigger xlog_cil_commit() to fallback to vmalloc
> code. I think i should reduce an amount of memory on my kvm-pc and
> repeat the tests!

Simple way of doing is to use directory blocks that are larger than
page size:

mkfs.xfs -n size=64k ....

We can hit that path in other ways - large attributes will hit it in
the attr buffer allocation path, enabling the new attribute
intent-based logging mode will hit it in the xlog_cil_commit path as
well. IIRC, the above profile comes from the latter case, creating
lots of zero length files with 64kB xattrs attached via fsmark.

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock
  2023-03-28  2:53           ` Dave Chinner
@ 2023-03-28 12:40             ` Uladzislau Rezki
  0 siblings, 0 replies; 33+ messages in thread
From: Uladzislau Rezki @ 2023-03-28 12:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Uladzislau Rezki, Lorenzo Stoakes, linux-mm, linux-kernel,
	linux-fsdevel, Andrew Morton, Baoquan He, Matthew Wilcox,
	David Hildenbrand, Liu Shixin, Jiri Olsa

On Tue, Mar 28, 2023 at 01:53:27PM +1100, Dave Chinner wrote:
> On Mon, Mar 27, 2023 at 07:22:44PM +0200, Uladzislau Rezki wrote:
> > >     So, this patch open codes the kvmalloc() in the commit path to have
> > >     the above described behaviour. The result is we more than halve the
> > >     CPU time spend doing kvmalloc() in this path and transaction commits
> > >     with 64kB objects in them more than doubles. i.e. we get ~5x
> > >     reduction in CPU usage per costly-sized kvmalloc() invocation and
> > >     the profile looks like this:
> > >     
> > >       - 37.60% xlog_cil_commit
> > >             16.01% memcpy_erms
> > >           - 8.45% __kmalloc
> > >              - 8.04% kmalloc_order_trace
> > >                 - 8.03% kmalloc_order
> > >                    - 7.93% alloc_pages
> > >                       - 7.90% __alloc_pages
> > >                          - 4.05% __alloc_pages_slowpath.constprop.0
> > >                             - 2.18% get_page_from_freelist
> > >                             - 1.77% wake_all_kswapds
> > >     ....
> > >                                         - __wake_up_common_lock
> > >                                            - 0.94% _raw_spin_lock_irqsave
> > >                          - 3.72% get_page_from_freelist
> > >                             - 2.43% _raw_spin_lock_irqsave
> > >           - 5.72% vmalloc
> > >              - 5.72% __vmalloc_node_range
> > >                 - 4.81% __get_vm_area_node.constprop.0
> > >                    - 3.26% alloc_vmap_area
> > >                       - 2.52% _raw_spin_lock
> > >                    - 1.46% _raw_spin_lock
> > >                   0.56% __alloc_pages_bulk
> > >           - 4.66% kvfree
> > >              - 3.25% vfree
> > OK, i see. I tried to use the fs_mark in different configurations. For
> > example:
> > 
> > <snip>
> > time fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 -d ./scratch/0 -d ./scratch/1 -d ./scratch/2  \
> > -d ./scratch/3 -d ./scratch/4 -d ./scratch/5 -d ./scratch/6 -d ./scratch/7 -d ./scratch/8 \
> > -d ./scratch/9 -d ./scratch/10 -d ./scratch/11 -d ./scratch/12 -d ./scratch/13 \
> > -d ./scratch/14 -d ./scratch/15 -t 64 -F
> > <snip>
> > 
> > But i did not manage to trigger xlog_cil_commit() to fallback to vmalloc
> > code. I think i should reduce an amount of memory on my kvm-pc and
> > repeat the tests!
> 
> Simple way of doing is to use directory blocks that are larger than
> page size:
> 
> mkfs.xfs -n size=64k ....
> 
> We can hit that path in other ways - large attributes will hit it in
> the attr buffer allocation path, enabling the new attribute
> intent-based logging mode will hit it in the xlog_cil_commit path as
> well. IIRC, the above profile comes from the latter case, creating
> lots of zero length files with 64kB xattrs attached via fsmark.
> 
Good. Thank you that is useful.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2023-03-28 12:40 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-19  7:09 [PATCH v2 0/4] convert read_kcore(), vread() to use iterators Lorenzo Stoakes
2023-03-19  7:09 ` [PATCH v2 1/4] fs/proc/kcore: Avoid bounce buffer for ktext data Lorenzo Stoakes
2023-03-20  9:58   ` David Hildenbrand
2023-03-19  7:09 ` [PATCH v2 2/4] mm: vmalloc: use rwsem, mutex for vmap_area_lock and vmap_block->lock Lorenzo Stoakes
2023-03-19 20:10   ` Andrew Morton
2023-03-19 20:29     ` Lorenzo Stoakes
2023-03-19 20:47       ` Matthew Wilcox
2023-03-19 21:16         ` Lorenzo Stoakes
2023-03-20  8:40           ` Lorenzo Stoakes
2023-03-20  7:54   ` Uladzislau Rezki
2023-03-20  8:25     ` Lorenzo Stoakes
2023-03-20  8:32       ` Uladzislau Rezki
2023-03-20  8:35         ` Lorenzo Stoakes
2023-03-20 11:20           ` Uladzislau Rezki
2023-03-21  1:09   ` Dave Chinner
2023-03-21  5:23     ` Uladzislau Rezki
2023-03-21  7:45       ` Lorenzo Stoakes
2023-03-21  8:54         ` Uladzislau Rezki
2023-03-21 10:05         ` Dave Chinner
2023-03-21 10:24           ` Uladzislau Rezki
2023-03-22 13:18     ` Uladzislau Rezki
2023-03-22 17:47       ` Matthew Wilcox
2023-03-22 18:01         ` Uladzislau Rezki
2023-03-22 19:15           ` Uladzislau Rezki
2023-03-23 12:47             ` Uladzislau Rezki
2023-03-24  5:25       ` Dave Chinner
2023-03-24  5:31         ` Matthew Wilcox
2023-03-27  0:38           ` Dave Chinner
2023-03-27 17:22         ` Uladzislau Rezki
2023-03-28  2:53           ` Dave Chinner
2023-03-28 12:40             ` Uladzislau Rezki
2023-03-19  7:09 ` [PATCH v2 3/4] fs/proc/kcore: convert read_kcore() to read_kcore_iter() Lorenzo Stoakes
2023-03-19  7:09 ` [PATCH v2 4/4] mm: vmalloc: convert vread() to vread_iter() Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).