linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas
@ 2022-11-09  3:35 Baoquan He
  2022-11-09  3:35 ` [PATCH RFC 1/3] mm/vmalloc.c: add used_map into vmap_block to track space of vmap_block Baoquan He
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Baoquan He @ 2022-11-09  3:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, akpm, stephen.s.brennan, urezki, hch, Baoquan He

Problem:
***
Stephen reported vread() will skip vm_map_ram areas when reading out
/proc/kcore with drgn utility. Please see below link to get more about
it:

  /proc/kcore reads 0's for vmap_block
  https://lore.kernel.org/all/87ilk6gos2.fsf@oracle.com/T/#u

Root cause:
***
The normal vmalloc API uses struct vmap_area to manage the virtual
kernel area allocated and associate a vm_struct to store more information
and passed out. However, area reserved through vm_map_ram() interface
doesn't allocate vm_struct to bind with. So the current code in vread()
will skip the vm_map_ram area by 'if (!va->vm)' conditional checking.

Solution:
***
There are two types of vm_map_ram area. One is the whole vmap_area being
reserved and mapped at one time; the other is the whole vmap_area with
VMAP_BLOCK_SIZE size being reserved at one time, while mapped into split
regions with smaller size several times.

In patch 1 and 2, add flags into struct vmap_area to mark these two types
of vm_map_ram area, meanwhile add bitmap field used_map into struct
vmap_block to mark those regions being used to differentiate with dirty
and free regions.

With the help of above vmap_area->flags and vmap_block->used_map, we can
recognize them in vread() and handle them respectively.

Test:
***
I don't know what system has vm_map_ram() area. So just pass compiling
test and execute "makedumpfile --mem-usage /proc/kcore" to guarantee it
won't impact the old kcore reading.

	[root@ibm-x3950x6-01 ~]# free -h
	               total        used        free      shared  buff/cache   available
	Mem:           3.9Ti       3.6Gi       3.9Ti       7.0Mi       497Mi       3.9Ti
	Swap:          8.0Gi          0B       8.0Gi
	[root@ibm-x3950x6-01 ~]# makedumpfile --mem-usage /proc/kcore
	The kernel version is not supported.
	The makedumpfile operation may be incomplete.
	
	TYPE		PAGES			EXCLUDABLE	DESCRIPTION
	----------------------------------------------------------------------
	ZERO		327309          	yes		Pages filled with zero
	NON_PRI_CACHE	81750           	yes		Cache pages without private flag
	PRI_CACHE	83981           	yes		Cache pages with private flag
	USER		12735           	yes		User process pages
	FREE		1055688908      	yes		Free pages
	KERN_DATA	17464385        	no		Dumpable kernel data 
	
	page size:		4096            
	Total pages on system:	1073659068      
	Total size on system:	4397707542528    Byte


Baoquan He (3):
  mm/vmalloc.c: add used_map into vmap_block to track space of
    vmap_block
  mm/vmalloc.c: add flags to mark vm_map_ram area
  mm/vmalloc.c: allow vread() to read out vm_map_ram areas

 include/linux/vmalloc.h |  1 +
 mm/vmalloc.c            | 81 +++++++++++++++++++++++++++++++++++++----
 2 files changed, 75 insertions(+), 7 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RFC 1/3] mm/vmalloc.c: add used_map into vmap_block to track space of vmap_block
  2022-11-09  3:35 [PATCH RFC 0/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas Baoquan He
@ 2022-11-09  3:35 ` Baoquan He
  2022-11-09  3:35 ` [PATCH RFC 2/3] mm/vmalloc.c: add flags to mark vm_map_ram area Baoquan He
  2022-11-09  3:35 ` [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas Baoquan He
  2 siblings, 0 replies; 14+ messages in thread
From: Baoquan He @ 2022-11-09  3:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, akpm, stephen.s.brennan, urezki, hch, Baoquan He

In one vmap_block area, there could be three types of regions: used
region which is allocated through vb_alloc(), dirty region which is
freed through vb_free() and free region. Among them, only used region
has available data, while there's no way to track those used regions
currently.

Here, add bitmap field used_map into vmap_block, and set/clear it during
allocation or freeing regions of vmap_block area.

This is a preparatoin for later use.

Signed-off-by: Baoquan He <bhe@redhat.com>
---
 mm/vmalloc.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ccaa461998f3..5d3fd3e6fe09 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1896,6 +1896,7 @@ struct vmap_block {
 	spinlock_t lock;
 	struct vmap_area *va;
 	unsigned long free, dirty;
+	DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
 	unsigned long dirty_min, dirty_max; /*< dirty range */
 	struct list_head free_list;
 	struct rcu_head rcu_head;
@@ -1972,10 +1973,12 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 	vb->va = va;
 	/* At least something should be left free */
 	BUG_ON(VMAP_BBMAP_BITS <= (1UL << order));
+	bitmap_zero(vb->used_map, VMAP_BBMAP_BITS);
 	vb->free = VMAP_BBMAP_BITS - (1UL << order);
 	vb->dirty = 0;
 	vb->dirty_min = VMAP_BBMAP_BITS;
 	vb->dirty_max = 0;
+	bitmap_set(vb->used_map, 0, (1UL << order));
 	INIT_LIST_HEAD(&vb->free_list);
 
 	vb_idx = addr_to_vb_idx(va->va_start);
@@ -2081,6 +2084,7 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
 		pages_off = VMAP_BBMAP_BITS - vb->free;
 		vaddr = vmap_block_vaddr(vb->va->va_start, pages_off);
 		vb->free -= 1UL << order;
+		bitmap_set(vb->used_map, pages_off, (1UL << order));
 		if (vb->free == 0) {
 			spin_lock(&vbq->lock);
 			list_del_rcu(&vb->free_list);
@@ -2114,6 +2118,9 @@ static void vb_free(unsigned long addr, unsigned long size)
 	order = get_order(size);
 	offset = (addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT;
 	vb = xa_load(&vmap_blocks, addr_to_vb_idx(addr));
+	spin_lock(&vb->lock);
+	bitmap_clear(vb->used_map, offset, (1UL << order));
+	spin_unlock(&vb->lock);
 
 	vunmap_range_noflush(addr, addr + size);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 2/3] mm/vmalloc.c: add flags to mark vm_map_ram area
  2022-11-09  3:35 [PATCH RFC 0/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas Baoquan He
  2022-11-09  3:35 ` [PATCH RFC 1/3] mm/vmalloc.c: add used_map into vmap_block to track space of vmap_block Baoquan He
@ 2022-11-09  3:35 ` Baoquan He
  2022-11-09  3:35 ` [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas Baoquan He
  2 siblings, 0 replies; 14+ messages in thread
From: Baoquan He @ 2022-11-09  3:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, akpm, stephen.s.brennan, urezki, hch, Baoquan He

Through vmalloc API, a virtual kernel area is reserved for physical
address mapping. And vmap_area is used to track them, and vm_struct is
allocated to associate with the vmap_area to store more information and
passed out.

However, area reserved via vm_map_ram() is an exception. It doesn't have
vm_struct to associate with vmap_area. And we can't simply recognize the
vm_map_ram areas with '->vm == NULL' because the normal freeing path will
set va->vm = NULL before unmapping, please see function remove_vm_area().

Meanwhile, there are two types of vm_map_ram area. One is the whole
vmap_area being reserved and mapped at one time; the other is the
whole vmap_area with VMAP_BLOCK_SIZE size being reserved, while mapped
into split regions with smaller size several times.

To mark the area reserved through vm_map_ram(), add flags into the
union field of struct vmap_area to reuse the space since that union
space is free anyway if it's vm_map_ram area.

This is a preparatoin for later use.

Signed-off-by: Baoquan He <bhe@redhat.com>
---
 include/linux/vmalloc.h |  1 +
 mm/vmalloc.c            | 17 ++++++++++++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 096d48aa3437..b739a60b73da 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -75,6 +75,7 @@ struct vmap_area {
 	union {
 		unsigned long subtree_max_size; /* in "free" tree */
 		struct vm_struct *vm;           /* in "busy" tree */
+		unsigned long flags; /* mark type of vm_map_ram area */
 	};
 };
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5d3fd3e6fe09..41d82dc07e13 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1815,6 +1815,7 @@ static void free_vmap_area_noflush(struct vmap_area *va)
 
 	spin_lock(&vmap_area_lock);
 	unlink_va(va, &vmap_area_root);
+	va->flags = 0;
 	spin_unlock(&vmap_area_lock);
 
 	nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >>
@@ -1887,6 +1888,9 @@ struct vmap_area *find_vmap_area(unsigned long addr)
 
 #define VMAP_BLOCK_SIZE		(VMAP_BBMAP_BITS * PAGE_SIZE)
 
+#define VMAP_RAM	0x1
+#define VMAP_BLOCK	0x2
+
 struct vmap_block_queue {
 	spinlock_t lock;
 	struct list_head free;
@@ -1967,6 +1971,9 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 		kfree(vb);
 		return ERR_CAST(va);
 	}
+	spin_lock(&vmap_area_lock);
+	va->flags = VMAP_RAM|VMAP_BLOCK;
+	spin_unlock(&vmap_area_lock);
 
 	vaddr = vmap_block_vaddr(va->va_start, 0);
 	spin_lock_init(&vb->lock);
@@ -2229,8 +2236,12 @@ void vm_unmap_ram(const void *mem, unsigned int count)
 		return;
 	}
 
-	va = find_vmap_area(addr);
+	spin_lock(&vmap_area_lock);
+	va = __find_vmap_area((unsigned long)addr, &vmap_area_root);
 	BUG_ON(!va);
+	if (va)
+		va->flags &= ~VMAP_RAM;
+	spin_unlock(&vmap_area_lock);
 	debug_check_no_locks_freed((void *)va->va_start,
 				    (va->va_end - va->va_start));
 	free_unmap_vmap_area(va);
@@ -2269,6 +2280,10 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
 		if (IS_ERR(va))
 			return NULL;
 
+		spin_lock(&vmap_area_lock);
+		va->flags = VMAP_RAM;
+		spin_unlock(&vmap_area_lock);
+
 		addr = va->va_start;
 		mem = (void *)addr;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas
  2022-11-09  3:35 [PATCH RFC 0/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas Baoquan He
  2022-11-09  3:35 ` [PATCH RFC 1/3] mm/vmalloc.c: add used_map into vmap_block to track space of vmap_block Baoquan He
  2022-11-09  3:35 ` [PATCH RFC 2/3] mm/vmalloc.c: add flags to mark vm_map_ram area Baoquan He
@ 2022-11-09  3:35 ` Baoquan He
  2022-11-10  0:59   ` Stephen Brennan
  2022-11-18  8:01   ` Matthew Wilcox
  2 siblings, 2 replies; 14+ messages in thread
From: Baoquan He @ 2022-11-09  3:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, akpm, stephen.s.brennan, urezki, hch, Baoquan He

Currently, vread() can read out vmalloc areas which is associated with
a vm_struct. While this doesn't work for areas created by vm_map_ram()
interface because it doesn't allocate a vm_struct. Then in vread(),
these areas will be skipped.

Here, add a new function vb_vread() to read out areas managed by
vmap_block specifically. Then recognize vm_map_ram areas via vmap->flags
and handle  them respectively.

Stephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: Baoquan He <bhe@redhat.com>
Link: https://lore.kernel.org/all/87ilk6gos2.fsf@oracle.com/T/#u
---
 mm/vmalloc.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 51 insertions(+), 6 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 41d82dc07e13..5a8d5659bfb0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3518,6 +3518,46 @@ static int aligned_vread(char *buf, char *addr, unsigned long count)
 	return copied;
 }
 
+static void vb_vread(char *buf, char *addr, int count)
+{
+	char *start;
+	struct vmap_block *vb;
+	unsigned long offset;
+	unsigned int rs, re, n;
+
+	offset = ((unsigned long)addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT;
+	vb = xa_load(&vmap_blocks, addr_to_vb_idx((unsigned long)addr));
+
+	spin_lock(&vb->lock);
+	if (bitmap_empty(vb->used_map, VMAP_BBMAP_BITS)) {
+		spin_unlock(&vb->lock);
+		memset(buf, 0, count);
+		return;
+	}
+	for_each_set_bitrange(rs, re, vb->used_map, VMAP_BBMAP_BITS) {
+		if (!count)
+			break;
+		start = vmap_block_vaddr(vb->va->va_start, rs);
+		if (addr < start) {
+			if (count == 0)
+				break;
+			*buf = '\0';
+			buf++;
+			addr++;
+			count--;
+		}
+		n = (re - rs + 1) << PAGE_SHIFT;
+		if (n > count)
+			n = count;
+		aligned_vread(buf, start, n);
+
+		buf += n;
+		addr += n;
+		count -= n;
+	}
+	spin_unlock(&vb->lock);
+}
+
 /**
  * vread() - read vmalloc area in a safe way.
  * @buf:     buffer for reading data
@@ -3548,7 +3588,7 @@ long vread(char *buf, char *addr, unsigned long count)
 	struct vm_struct *vm;
 	char *vaddr, *buf_start = buf;
 	unsigned long buflen = count;
-	unsigned long n;
+	unsigned long n, size;
 
 	addr = kasan_reset_tag(addr);
 
@@ -3569,12 +3609,14 @@ long vread(char *buf, char *addr, unsigned long count)
 		if (!count)
 			break;
 
-		if (!va->vm)
+		if (!(va->flags & VMAP_RAM) && !va->vm)
 			continue;
 
 		vm = va->vm;
-		vaddr = (char *) vm->addr;
-		if (addr >= vaddr + get_vm_area_size(vm))
+		vaddr = (char *) va->va_start;
+		size = vm ? get_vm_area_size(vm) : va_size(va);
+
+		if (addr >= vaddr + size)
 			continue;
 		while (addr < vaddr) {
 			if (count == 0)
@@ -3584,10 +3626,13 @@ long vread(char *buf, char *addr, unsigned long count)
 			addr++;
 			count--;
 		}
-		n = vaddr + get_vm_area_size(vm) - addr;
+		n = vaddr + size - addr;
 		if (n > count)
 			n = count;
-		if (!(vm->flags & VM_IOREMAP))
+
+		if ((va->flags & (VMAP_RAM|VMAP_BLOCK)) == (VMAP_RAM|VMAP_BLOCK))
+			vb_vread(buf, addr, n);
+		else if ((va->flags & VMAP_RAM) || !(vm->flags & VM_IOREMAP))
 			aligned_vread(buf, addr, n);
 		else /* IOREMAP area is treated as memory hole */
 			memset(buf, 0, n);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas
  2022-11-09  3:35 ` [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas Baoquan He
@ 2022-11-10  0:59   ` Stephen Brennan
  2022-11-10 10:23     ` Baoquan He
  2022-11-18  8:01   ` Matthew Wilcox
  1 sibling, 1 reply; 14+ messages in thread
From: Stephen Brennan @ 2022-11-10  0:59 UTC (permalink / raw)
  To: Baoquan He, linux-kernel; +Cc: linux-mm, akpm, urezki, hch, Baoquan He

Baoquan He <bhe@redhat.com> writes:
> Currently, vread() can read out vmalloc areas which is associated with
> a vm_struct. While this doesn't work for areas created by vm_map_ram()
> interface because it doesn't allocate a vm_struct. Then in vread(),
> these areas will be skipped.
>
> Here, add a new function vb_vread() to read out areas managed by
> vmap_block specifically. Then recognize vm_map_ram areas via vmap->flags
> and handle  them respectively.
>
> Stephen Brennan <stephen.s.brennan@oracle.com>
> Signed-off-by: Baoquan He <bhe@redhat.com>
> Link: https://lore.kernel.org/all/87ilk6gos2.fsf@oracle.com/T/#u
> ---
>  mm/vmalloc.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 51 insertions(+), 6 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 41d82dc07e13..5a8d5659bfb0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3518,6 +3518,46 @@ static int aligned_vread(char *buf, char *addr, unsigned long count)
>  	return copied;
>  }
>  
> +static void vb_vread(char *buf, char *addr, int count)
> +{
> +	char *start;
> +	struct vmap_block *vb;
> +	unsigned long offset;
> +	unsigned int rs, re, n;
> +
> +	offset = ((unsigned long)addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT;
> +	vb = xa_load(&vmap_blocks, addr_to_vb_idx((unsigned long)addr));
> +
> +	spin_lock(&vb->lock);
> +	if (bitmap_empty(vb->used_map, VMAP_BBMAP_BITS)) {
> +		spin_unlock(&vb->lock);
> +		memset(buf, 0, count);
> +		return;
> +	}
> +	for_each_set_bitrange(rs, re, vb->used_map, VMAP_BBMAP_BITS) {
> +		if (!count)
> +			break;
> +		start = vmap_block_vaddr(vb->va->va_start, rs);
> +		if (addr < start) {
> +			if (count == 0)
> +				break;
> +			*buf = '\0';
> +			buf++;
> +			addr++;
> +			count--;
> +		}
> +		n = (re - rs + 1) << PAGE_SHIFT;
> +		if (n > count)
> +			n = count;
> +		aligned_vread(buf, start, n);
> +
> +		buf += n;
> +		addr += n;
> +		count -= n;
> +	}
> +	spin_unlock(&vb->lock);
> +}
> +
>  /**
>   * vread() - read vmalloc area in a safe way.
>   * @buf:     buffer for reading data
> @@ -3548,7 +3588,7 @@ long vread(char *buf, char *addr, unsigned long count)
>  	struct vm_struct *vm;
>  	char *vaddr, *buf_start = buf;
>  	unsigned long buflen = count;
> -	unsigned long n;
> +	unsigned long n, size;
>  
>  	addr = kasan_reset_tag(addr);
>  
> @@ -3569,12 +3609,14 @@ long vread(char *buf, char *addr, unsigned long count)
>  		if (!count)
>  			break;
>  
> -		if (!va->vm)
> +		if (!(va->flags & VMAP_RAM) && !va->vm)
>  			continue;
>  
>  		vm = va->vm;
> -		vaddr = (char *) vm->addr;
> -		if (addr >= vaddr + get_vm_area_size(vm))
> +		vaddr = (char *) va->va_start;
> +		size = vm ? get_vm_area_size(vm) : va_size(va);

Hi Baoquan,

Thanks for working on this. I tested your patches out by using drgn to
debug /proc/kcore. I have a kernel module[1] to do a vm_map_ram() call
and print the virtual address to the kernel log so I can try to read
that memory address in drgn. When I did this test, I got a panic on the
above line of code.

[  167.101113] BUG: kernel NULL pointer dereference, address: 0000000000000013
[  167.104538] #PF: supervisor read access in kernel mode
[  167.106446] #PF: error_code(0x0000) - not-present page
[  167.108474] PGD 0 P4D 0
[  167.109311] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  167.111727] CPU: 3 PID: 7647 Comm: drgn Kdump: loaded Tainted: G           OE      6.1.0-rc4.bugvreadtest.el8.dev02.x86_64 #1
[  167.115076] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.5.1 06/16/2021
[  167.117348] RIP: 0010:vread+0xaf/0x210
[  167.118345] Code: 86 3e 01 00 00 48 85 db 0f 84 35 01 00 00 49 8d 47 28 48 3d 10 f8 a7 8f 0f 84 25 01 00 00 4d 89 f4 49 8b 57 38 48 85 d2 74 21 <48> 8b 42 10 f6 42 18 40 48 89 d6 49 8b 0f 48 8d b8 00 f0 ff ff 48
[  167.123776] RSP: 0018:ffffaeb380a1fb90 EFLAGS: 00010206
[  167.125669] RAX: ffff9853a1397b28 RBX: 0000000000000040 RCX: 0000000000000000
[  167.128401] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
[  167.130948] RBP: ffffaeb382400000 R08: 0000000000000000 R09: 0000000000000000
[  167.133372] R10: 0000000000000000 R11: 0000000000000000 R12: ffff985385877000
[  167.135397] R13: 0000000000000040 R14: ffff985385877000 R15: ffff9853a1397b00
[  167.137533] FS:  00007f71eae33b80(0000) GS:ffff9856afd80000(0000) knlGS:0000000000000000
[  167.140210] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  167.142440] CR2: 0000000000000013 CR3: 000000012048a000 CR4: 00000000003506e0
[  167.144640] Call Trace:
[  167.145494]  <TASK>
[  167.146263]  read_kcore+0x33a/0xa30
[  167.147392]  ? remove_entity_load_avg+0x2e/0x70
[  167.148425]  ? _raw_spin_unlock_irqrestore+0x11/0x60
[  167.150657]  ? __wake_up_common_lock+0x8b/0xd0
[  167.152261]  ? tty_set_termios+0x211/0x280
[  167.153397]  ? set_termios+0x16b/0x1d0
[  167.154698]  ? _raw_spin_unlock+0xe/0x40
[  167.155737]  ? wp_page_reuse+0x60/0x80
[  167.157138]  ? do_wp_page+0x169/0x3a0
[  167.158752]  ? pmd_pfn+0x9/0x50
[  167.159645]  ? __handle_mm_fault+0x3b0/0x690
[  167.160837]  ? inode_security+0x22/0x60
[  167.161761]  proc_reg_read+0x5a/0xb0
[  167.162777]  vfs_read+0xa7/0x320
[  167.163512]  ? handle_mm_fault+0xb6/0x2c0
[  167.164400]  __x64_sys_pread64+0x9c/0xd0
[  167.165763]  do_syscall_64+0x3f/0xa0
[  167.167610]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  167.169951] RIP: 0033:0x7f71e9c123d7

I debugged the resulting core dump and found the reason:

>>> stack_trace = prog.crashed_thread().stack_trace()
>>> stack_trace
#0  crash_setup_regs (./arch/x86/include/asm/kexec.h:95:3)
#1  __crash_kexec (kernel/kexec_core.c:974:4)
#2  panic (kernel/panic.c:330:3)
#3  oops_end (arch/x86/kernel/dumpstack.c:379:3)
#4  page_fault_oops (arch/x86/mm/fault.c:729:2)
#5  handle_page_fault (arch/x86/mm/fault.c:1519:3)
#6  exc_page_fault (arch/x86/mm/fault.c:1575:2)
#7  asm_exc_page_fault+0x26/0x2b (./arch/x86/include/asm/idtentry.h:570)
#8  get_vm_area_size (./include/linux/vmalloc.h:203:14)
#9  vread (mm/vmalloc.c:3617:15)
#10 read_kcore (fs/proc/kcore.c:510:4)
#11 pde_read (fs/proc/inode.c:316:10)
#12 proc_reg_read (fs/proc/inode.c:328:8)
#13 vfs_read (fs/read_write.c:468:9)
#14 ksys_pread64 (fs/read_write.c:665:10)
#15 __do_sys_pread64 (fs/read_write.c:675:9)
#16 __se_sys_pread64 (fs/read_write.c:672:1)
#17 __x64_sys_pread64 (fs/read_write.c:672:1)
#18 do_syscall_x64 (arch/x86/entry/common.c:50:14)
#19 do_syscall_64 (arch/x86/entry/common.c:80:7)
#20 entry_SYSCALL_64+0x9f/0x19b (arch/x86/entry/entry_64.S:120)
#21 0x7f71e9c123d7
>>> stack_trace[9]["va"]
*(struct vmap_area *)0xffff9853a1397b00 = {
        .va_start = (unsigned long)18446654684740452352,
        .va_end = (unsigned long)18446654684741500928,
        .rb_node = (struct rb_node){
                .__rb_parent_color = (unsigned long)18446630083335569168,
                .rb_right = (struct rb_node *)0x0,
                .rb_left = (struct rb_node *)0x0,
        },
        .list = (struct list_head){
                .next = (struct list_head *)0xffff98538c403f28,
                .prev = (struct list_head *)0xffff98538c54e1e8,
        },
        .subtree_max_size = (unsigned long)3,
        .vm = (struct vm_struct *)0x3,
        .flags = (unsigned long)3,
}

Since flags is in a union, it shadows "vm" and causes the condition to
be true, and then get_vm_area_size() tries to follow the pointer defined
by flags. I'm not sure if the fix is to have flags be a separate field
inside vmap_area, or to have more careful handling in the vread path.

Thanks,
Stephen

> +
> +		if (addr >= vaddr + size)
>  			continue;
>  		while (addr < vaddr) {
>  			if (count == 0)
> @@ -3584,10 +3626,13 @@ long vread(char *buf, char *addr, unsigned long count)
>  			addr++;
>  			count--;
>  		}
> -		n = vaddr + get_vm_area_size(vm) - addr;
> +		n = vaddr + size - addr;
>  		if (n > count)
>  			n = count;
> -		if (!(vm->flags & VM_IOREMAP))
> +
> +		if ((va->flags & (VMAP_RAM|VMAP_BLOCK)) == (VMAP_RAM|VMAP_BLOCK))
> +			vb_vread(buf, addr, n);
> +		else if ((va->flags & VMAP_RAM) || !(vm->flags & VM_IOREMAP))
>  			aligned_vread(buf, addr, n);
>  		else /* IOREMAP area is treated as memory hole */
>  			memset(buf, 0, n);
> -- 
> 2.34.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas
  2022-11-10  0:59   ` Stephen Brennan
@ 2022-11-10 10:23     ` Baoquan He
  2022-11-10 18:48       ` Stephen Brennan
  0 siblings, 1 reply; 14+ messages in thread
From: Baoquan He @ 2022-11-10 10:23 UTC (permalink / raw)
  To: Stephen Brennan; +Cc: linux-kernel, linux-mm, akpm, urezki, hch

On 11/09/22 at 04:59pm, Stephen Brennan wrote:
......  
> > @@ -3569,12 +3609,14 @@ long vread(char *buf, char *addr, unsigned long count)
> >  		if (!count)
> >  			break;
> >  
> > -		if (!va->vm)
> > +		if (!(va->flags & VMAP_RAM) && !va->vm)
> >  			continue;
> >  
> >  		vm = va->vm;
> > -		vaddr = (char *) vm->addr;
> > -		if (addr >= vaddr + get_vm_area_size(vm))
> > +		vaddr = (char *) va->va_start;
> > +		size = vm ? get_vm_area_size(vm) : va_size(va);
> 
> Hi Baoquan,
> 
> Thanks for working on this. I tested your patches out by using drgn to
> debug /proc/kcore. I have a kernel module[1] to do a vm_map_ram() call
> and print the virtual address to the kernel log so I can try to read
> that memory address in drgn. When I did this test, I got a panic on the
> above line of code.
......
> Since flags is in a union, it shadows "vm" and causes the condition to
> be true, and then get_vm_area_size() tries to follow the pointer defined
> by flags. I'm not sure if the fix is to have flags be a separate field
> inside vmap_area, or to have more careful handling in the vread path.

Sorry, my bad. Thanks for testing this and catching the error, Stephen.

About the fix, both way are fine to me. I made a draft fix based on the
current patchset. To me, adding flags in a separate field makes code
easier, but cost extra memory. I will see what other people say about
this, firstly if the solution is acceptable, then reusing the union
field or adding anohter flags.

Could you try below code to see if it works?

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5a8d5659bfb0..78cae59170d8 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1890,6 +1890,7 @@ struct vmap_area *find_vmap_area(unsigned long addr)
 
 #define VMAP_RAM	0x1
 #define VMAP_BLOCK	0x2
+#define VMAP_FLAGS_MASK	0x3
 
 struct vmap_block_queue {
 	spinlock_t lock;
@@ -3588,7 +3589,7 @@ long vread(char *buf, char *addr, unsigned long count)
 	struct vm_struct *vm;
 	char *vaddr, *buf_start = buf;
 	unsigned long buflen = count;
-	unsigned long n, size;
+	unsigned long n, size, flags;
 
 	addr = kasan_reset_tag(addr);
 
@@ -3609,12 +3610,14 @@ long vread(char *buf, char *addr, unsigned long count)
 		if (!count)
 			break;
 
-		if (!(va->flags & VMAP_RAM) && !va->vm)
+		if (!va->vm)
 			continue;
 
+		flags = va->flags & VMAP_FLAGS_MASK;
 		vm = va->vm;
+
 		vaddr = (char *) va->va_start;
-		size = vm ? get_vm_area_size(vm) : va_size(va);
+		size = flags ? va_size(va) : get_vm_area_size(vm);
 
 		if (addr >= vaddr + size)
 			continue;
@@ -3630,9 +3633,9 @@ long vread(char *buf, char *addr, unsigned long count)
 		if (n > count)
 			n = count;
 
-		if ((va->flags & (VMAP_RAM|VMAP_BLOCK)) == (VMAP_RAM|VMAP_BLOCK))
+		if ((flags & (VMAP_RAM|VMAP_BLOCK)) == (VMAP_RAM|VMAP_BLOCK))
 			vb_vread(buf, addr, n);
-		else if ((va->flags & VMAP_RAM) || !(vm->flags & VM_IOREMAP))
+		else if ((flags & VMAP_RAM) || !(vm->flags & VM_IOREMAP))
 			aligned_vread(buf, addr, n);
 		else /* IOREMAP area is treated as memory hole */
 			memset(buf, 0, n);


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas
  2022-11-10 10:23     ` Baoquan He
@ 2022-11-10 18:48       ` Stephen Brennan
  2022-11-14 10:06         ` Baoquan He
  0 siblings, 1 reply; 14+ messages in thread
From: Stephen Brennan @ 2022-11-10 18:48 UTC (permalink / raw)
  To: Baoquan He; +Cc: linux-kernel, linux-mm, akpm, urezki, hch

Baoquan He <bhe@redhat.com> writes:

> On 11/09/22 at 04:59pm, Stephen Brennan wrote:
> ......  
>> > @@ -3569,12 +3609,14 @@ long vread(char *buf, char *addr, unsigned long count)
>> >  		if (!count)
>> >  			break;
>> >  
>> > -		if (!va->vm)
>> > +		if (!(va->flags & VMAP_RAM) && !va->vm)
>> >  			continue;
>> >  
>> >  		vm = va->vm;
>> > -		vaddr = (char *) vm->addr;
>> > -		if (addr >= vaddr + get_vm_area_size(vm))
>> > +		vaddr = (char *) va->va_start;
>> > +		size = vm ? get_vm_area_size(vm) : va_size(va);
>> 
>> Hi Baoquan,
>> 
>> Thanks for working on this. I tested your patches out by using drgn to
>> debug /proc/kcore. I have a kernel module[1] to do a vm_map_ram() call
>> and print the virtual address to the kernel log so I can try to read
>> that memory address in drgn. When I did this test, I got a panic on the
>> above line of code.
> ......
>> Since flags is in a union, it shadows "vm" and causes the condition to
>> be true, and then get_vm_area_size() tries to follow the pointer defined
>> by flags. I'm not sure if the fix is to have flags be a separate field
>> inside vmap_area, or to have more careful handling in the vread path.
>
> Sorry, my bad. Thanks for testing this and catching the error, Stephen.
>
> About the fix, both way are fine to me. I made a draft fix based on the
> current patchset. To me, adding flags in a separate field makes code
> easier, but cost extra memory. I will see what other people say about
> this, firstly if the solution is acceptable, then reusing the union
> field or adding anohter flags.
>
> Could you try below code to see if it works?

I tried the patch below and it worked for me: reading from vm_map_ram()
regions in drgn was fine, gave me the correct values, and I also tested
reads which overlapped the beginning and end of the region.

That said (and I don't know the vmalloc code well at all), I wonder
whether you can be confident that the lower 2 bits of the va->vm pointer
are always clear? It looks like it comes from kmalloc, and so it should
be aligned, but can slab red zones mess up that alignment?

I also tested out this patch on top of yours, to be a bit more cautious.
I think we can be confident that the remaining bits are zero when used
as flags, and non-zero when used as a pointer, so you can test them to
avoid any overlap. But it's probably too cautious.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 78cae59170d8..911974f32b23 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3613,7 +3613,7 @@ long vread(char *buf, char *addr, unsigned long count)
                if (!va->vm)
                        continue;

-               flags = va->flags & VMAP_FLAGS_MASK;
+               flags = (va->flags & ~VMAP_FLAGS_MASK) ? 0 : (va->flags & VMAP_FLAGS_MASK);
                vm = va->vm;

                vaddr = (char *) va->va_start;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas
  2022-11-10 18:48       ` Stephen Brennan
@ 2022-11-14 10:06         ` Baoquan He
  0 siblings, 0 replies; 14+ messages in thread
From: Baoquan He @ 2022-11-14 10:06 UTC (permalink / raw)
  To: Stephen Brennan; +Cc: linux-kernel, linux-mm, akpm, urezki, hch

On 11/10/22 at 10:48am, Stephen Brennan wrote:
> Baoquan He <bhe@redhat.com> writes:
> 
> > On 11/09/22 at 04:59pm, Stephen Brennan wrote:
> > ......  
> >> > @@ -3569,12 +3609,14 @@ long vread(char *buf, char *addr, unsigned long count)
> >> >  		if (!count)
> >> >  			break;
> >> >  
> >> > -		if (!va->vm)
> >> > +		if (!(va->flags & VMAP_RAM) && !va->vm)
> >> >  			continue;
> >> >  
> >> >  		vm = va->vm;
> >> > -		vaddr = (char *) vm->addr;
> >> > -		if (addr >= vaddr + get_vm_area_size(vm))
> >> > +		vaddr = (char *) va->va_start;
> >> > +		size = vm ? get_vm_area_size(vm) : va_size(va);
> >> 
> >> Hi Baoquan,
> >> 
> >> Thanks for working on this. I tested your patches out by using drgn to
> >> debug /proc/kcore. I have a kernel module[1] to do a vm_map_ram() call
> >> and print the virtual address to the kernel log so I can try to read
> >> that memory address in drgn. When I did this test, I got a panic on the
> >> above line of code.
> > ......
> >> Since flags is in a union, it shadows "vm" and causes the condition to
> >> be true, and then get_vm_area_size() tries to follow the pointer defined
> >> by flags. I'm not sure if the fix is to have flags be a separate field
> >> inside vmap_area, or to have more careful handling in the vread path.
> >
> > Sorry, my bad. Thanks for testing this and catching the error, Stephen.
> >
> > About the fix, both way are fine to me. I made a draft fix based on the
> > current patchset. To me, adding flags in a separate field makes code
> > easier, but cost extra memory. I will see what other people say about
> > this, firstly if the solution is acceptable, then reusing the union
> > field or adding anohter flags.
> >
> > Could you try below code to see if it works?
> 
> I tried the patch below and it worked for me: reading from vm_map_ram()
> regions in drgn was fine, gave me the correct values, and I also tested
> reads which overlapped the beginning and end of the region.

Thanks a lot for the testing.

> 
> That said (and I don't know the vmalloc code well at all), I wonder
> whether you can be confident that the lower 2 bits of the va->vm pointer
> are always clear? It looks like it comes from kmalloc, and so it should
> be aligned, but can slab red zones mess up that alignment?

Hmm, it should be OK. I am also worried about the other places of va->vm
checking. I will check code again to see if there's risk in the case you
mentioned. I may change to add another ->flags field into vmap_area in
v2 post.

> 
> I also tested out this patch on top of yours, to be a bit more cautious.
> I think we can be confident that the remaining bits are zero when used
> as flags, and non-zero when used as a pointer, so you can test them to
> avoid any overlap. But it's probably too cautious.
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 78cae59170d8..911974f32b23 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3613,7 +3613,7 @@ long vread(char *buf, char *addr, unsigned long count)
>                 if (!va->vm)
>                         continue;
> 
> -               flags = va->flags & VMAP_FLAGS_MASK;
> +               flags = (va->flags & ~VMAP_FLAGS_MASK) ? 0 : (va->flags & VMAP_FLAGS_MASK);
>                 vm = va->vm;
> 
>                 vaddr = (char *) va->va_start;
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas
  2022-11-09  3:35 ` [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas Baoquan He
  2022-11-10  0:59   ` Stephen Brennan
@ 2022-11-18  8:01   ` Matthew Wilcox
  2022-11-23  3:38     ` Baoquan He
  1 sibling, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2022-11-18  8:01 UTC (permalink / raw)
  To: Baoquan He; +Cc: linux-kernel, linux-mm, akpm, stephen.s.brennan, urezki, hch

On Wed, Nov 09, 2022 at 11:35:34AM +0800, Baoquan He wrote:
> Currently, vread() can read out vmalloc areas which is associated with
> a vm_struct. While this doesn't work for areas created by vm_map_ram()
> interface because it doesn't allocate a vm_struct. Then in vread(),
> these areas will be skipped.
> 
> Here, add a new function vb_vread() to read out areas managed by
> vmap_block specifically. Then recognize vm_map_ram areas via vmap->flags
> and handle  them respectively.

i don't understand how this deals with the original problem identified,
that the vread() can race with an unmap.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas
  2022-11-18  8:01   ` Matthew Wilcox
@ 2022-11-23  3:38     ` Baoquan He
  2022-11-23 13:24       ` Matthew Wilcox
  0 siblings, 1 reply; 14+ messages in thread
From: Baoquan He @ 2022-11-23  3:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, akpm, stephen.s.brennan, urezki, hch

On 11/18/22 at 08:01am, Matthew Wilcox wrote:
> On Wed, Nov 09, 2022 at 11:35:34AM +0800, Baoquan He wrote:
> > Currently, vread() can read out vmalloc areas which is associated with
> > a vm_struct. While this doesn't work for areas created by vm_map_ram()
> > interface because it doesn't allocate a vm_struct. Then in vread(),
> > these areas will be skipped.
> > 
> > Here, add a new function vb_vread() to read out areas managed by
> > vmap_block specifically. Then recognize vm_map_ram areas via vmap->flags
> > and handle  them respectively.
> 
> i don't understand how this deals with the original problem identified,
> that the vread() can race with an unmap.

Thanks for checking.

I wrote a paragraph, then realized I misunderstood your concern. You are
saying the comment from Uladzislau about my original draft patch, right?
Paste the link of Uladzislau's reply here in case other people want to
know the background:
https://lore.kernel.org/all/Y1uKSmgURNEa3nQu@pc636/T/#u

When Stephen raised the issue originally, I posted a draft patch as
below trying to fix it:
https://lore.kernel.org/all/Y1pHTj2wuhoWmeV3@MiWiFi-R3L-srv/T/#u

In above draft patch, I tried to differentiate normal vmalloc area and
vm_map_ram area with the fact that vmalloc area is associated with a
vm_struct, while vm_map_ram area has ->vm as NULL. And I thought their
only difference is normal vmalloc area has guard page, so its size need
consider the guard page; while vm_map_ram area has no guard page, only
consider its own actual size. Uladzislau's comment reminded me I was
wrong. And the things we need handle are beyond that.

Currently there are three kinds of vmalloc areas in kernel:

1) normal vmalloc areas, associated with a vm_struct, this is allocated 
in __get_vm_area_node(). When freeing, it set ->vm to NULL
firstly, then unmap and free vmap_area, see remove_vm_area().

2) areas allocated via vm_map_ram() and size is larger than
VMAP_MAX_ALLOC. The entire area is not associated with vm_struct, and
freed at one time in vm_unmap_ram() with unmapping and freeing vmap_area;

3) areas allocated via vm_map_ram(), then delegate to vb_alloc() when
size <= VMAP_MAX_ALLOC. Its vmap_area is allocated at one time with
VMAP_BLOCK_SIZE big, and split and used later through vb_alloc(), freed
via vb_free(). When the entire area is dirty, it will be unmapped and
freed.

Based on above facts, we need add flags to differentiate the normal
vmalloc area from the vm_map_ram area, namely area 1) and 2). And we
also need flags to differentiate the area 2) and 3). Because area 3) are
pieces of a entire vmap_area, vb_free() will unmap the piece of area and
set the part dirty, but the entire vmap_area will kept there. So when we
will read area 3), we need take vb->lock and only read out the still
mapped part, but not dirty or free part of the vmap_area.

Thanks
Baoquan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas
  2022-11-23  3:38     ` Baoquan He
@ 2022-11-23 13:24       ` Matthew Wilcox
  2022-11-24  9:52         ` Baoquan He
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2022-11-23 13:24 UTC (permalink / raw)
  To: Baoquan He; +Cc: linux-kernel, linux-mm, akpm, stephen.s.brennan, urezki, hch

On Wed, Nov 23, 2022 at 11:38:54AM +0800, Baoquan He wrote:
> On 11/18/22 at 08:01am, Matthew Wilcox wrote:
> > On Wed, Nov 09, 2022 at 11:35:34AM +0800, Baoquan He wrote:
> > > Currently, vread() can read out vmalloc areas which is associated with
> > > a vm_struct. While this doesn't work for areas created by vm_map_ram()
> > > interface because it doesn't allocate a vm_struct. Then in vread(),
> > > these areas will be skipped.
> > > 
> > > Here, add a new function vb_vread() to read out areas managed by
> > > vmap_block specifically. Then recognize vm_map_ram areas via vmap->flags
> > > and handle  them respectively.
> > 
> > i don't understand how this deals with the original problem identified,
> > that the vread() can race with an unmap.
> 
> Thanks for checking.
> 
> I wrote a paragraph, then realized I misunderstood your concern. You are
> saying the comment from Uladzislau about my original draft patch, right?
> Paste the link of Uladzislau's reply here in case other people want to
> know the background:
> https://lore.kernel.org/all/Y1uKSmgURNEa3nQu@pc636/T/#u
> 
> When Stephen raised the issue originally, I posted a draft patch as
> below trying to fix it:
> https://lore.kernel.org/all/Y1pHTj2wuhoWmeV3@MiWiFi-R3L-srv/T/#u
> 
> In above draft patch, I tried to differentiate normal vmalloc area and
> vm_map_ram area with the fact that vmalloc area is associated with a
> vm_struct, while vm_map_ram area has ->vm as NULL. And I thought their
> only difference is normal vmalloc area has guard page, so its size need
> consider the guard page; while vm_map_ram area has no guard page, only
> consider its own actual size. Uladzislau's comment reminded me I was
> wrong. And the things we need handle are beyond that.
> 
> Currently there are three kinds of vmalloc areas in kernel:
> 
> 1) normal vmalloc areas, associated with a vm_struct, this is allocated 
> in __get_vm_area_node(). When freeing, it set ->vm to NULL
> firstly, then unmap and free vmap_area, see remove_vm_area().
> 
> 2) areas allocated via vm_map_ram() and size is larger than
> VMAP_MAX_ALLOC. The entire area is not associated with vm_struct, and
> freed at one time in vm_unmap_ram() with unmapping and freeing vmap_area;
> 
> 3) areas allocated via vm_map_ram(), then delegate to vb_alloc() when
> size <= VMAP_MAX_ALLOC. Its vmap_area is allocated at one time with
> VMAP_BLOCK_SIZE big, and split and used later through vb_alloc(), freed
> via vb_free(). When the entire area is dirty, it will be unmapped and
> freed.
> 
> Based on above facts, we need add flags to differentiate the normal
> vmalloc area from the vm_map_ram area, namely area 1) and 2). And we
> also need flags to differentiate the area 2) and 3). Because area 3) are
> pieces of a entire vmap_area, vb_free() will unmap the piece of area and
> set the part dirty, but the entire vmap_area will kept there. So when we
> will read area 3), we need take vb->lock and only read out the still
> mapped part, but not dirty or free part of the vmap_area.

I don't think you understand the problem.

Task A:			Task B:		Task C:
p = vm_map_ram()
			vread(p);
			... preempted ...
vm_unmap_ram(p);
					q = vm_map_ram();
			vread continues

If C has reused the address space allocated by A, task B is now reading
the memory mapped by task C instead of task A.  If it hasn't, it's now
trying to read from unmapped, and quite possibly freed memory.  Which
might have been allocated by task D.

Unless there's some kind of reference count so that B knows that both
the address range and the underlying memory can't be freed while it's
in the middle of the vread(), this is just unsafe.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas
  2022-11-23 13:24       ` Matthew Wilcox
@ 2022-11-24  9:52         ` Baoquan He
  2022-11-30 13:06           ` Uladzislau Rezki
  0 siblings, 1 reply; 14+ messages in thread
From: Baoquan He @ 2022-11-24  9:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, akpm, stephen.s.brennan, urezki, hch

On 11/23/22 at 01:24pm, Matthew Wilcox wrote:
> On Wed, Nov 23, 2022 at 11:38:54AM +0800, Baoquan He wrote:
> > On 11/18/22 at 08:01am, Matthew Wilcox wrote:
> > > On Wed, Nov 09, 2022 at 11:35:34AM +0800, Baoquan He wrote:
> > > > Currently, vread() can read out vmalloc areas which is associated with
> > > > a vm_struct. While this doesn't work for areas created by vm_map_ram()
> > > > interface because it doesn't allocate a vm_struct. Then in vread(),
> > > > these areas will be skipped.
> > > > 
> > > > Here, add a new function vb_vread() to read out areas managed by
> > > > vmap_block specifically. Then recognize vm_map_ram areas via vmap->flags
> > > > and handle  them respectively.
> > > 
> > > i don't understand how this deals with the original problem identified,
> > > that the vread() can race with an unmap.
> > 
> > Thanks for checking.
> > 
> > I wrote a paragraph, then realized I misunderstood your concern. You are
> > saying the comment from Uladzislau about my original draft patch, right?
> > Paste the link of Uladzislau's reply here in case other people want to
> > know the background:
> > https://lore.kernel.org/all/Y1uKSmgURNEa3nQu@pc636/T/#u
> > 
> > When Stephen raised the issue originally, I posted a draft patch as
> > below trying to fix it:
> > https://lore.kernel.org/all/Y1pHTj2wuhoWmeV3@MiWiFi-R3L-srv/T/#u
> > 
> > In above draft patch, I tried to differentiate normal vmalloc area and
> > vm_map_ram area with the fact that vmalloc area is associated with a
> > vm_struct, while vm_map_ram area has ->vm as NULL. And I thought their
> > only difference is normal vmalloc area has guard page, so its size need
> > consider the guard page; while vm_map_ram area has no guard page, only
> > consider its own actual size. Uladzislau's comment reminded me I was
> > wrong. And the things we need handle are beyond that.
> > 
> > Currently there are three kinds of vmalloc areas in kernel:
> > 
> > 1) normal vmalloc areas, associated with a vm_struct, this is allocated 
> > in __get_vm_area_node(). When freeing, it set ->vm to NULL
> > firstly, then unmap and free vmap_area, see remove_vm_area().
> > 
> > 2) areas allocated via vm_map_ram() and size is larger than
> > VMAP_MAX_ALLOC. The entire area is not associated with vm_struct, and
> > freed at one time in vm_unmap_ram() with unmapping and freeing vmap_area;
> > 
> > 3) areas allocated via vm_map_ram(), then delegate to vb_alloc() when
> > size <= VMAP_MAX_ALLOC. Its vmap_area is allocated at one time with
> > VMAP_BLOCK_SIZE big, and split and used later through vb_alloc(), freed
> > via vb_free(). When the entire area is dirty, it will be unmapped and
> > freed.
> > 
> > Based on above facts, we need add flags to differentiate the normal
> > vmalloc area from the vm_map_ram area, namely area 1) and 2). And we
> > also need flags to differentiate the area 2) and 3). Because area 3) are
> > pieces of a entire vmap_area, vb_free() will unmap the piece of area and
> > set the part dirty, but the entire vmap_area will kept there. So when we
> > will read area 3), we need take vb->lock and only read out the still
> > mapped part, but not dirty or free part of the vmap_area.
> 
> I don't think you understand the problem.
> 
> Task A:			Task B:		Task C:
> p = vm_map_ram()
> 			vread(p);
> 			... preempted ...
> vm_unmap_ram(p);
> 					q = vm_map_ram();
> 			vread continues



> 
> If C has reused the address space allocated by A, task B is now reading
> the memory mapped by task C instead of task A.  If it hasn't, it's now
> trying to read from unmapped, and quite possibly freed memory.  Which
> might have been allocated by task D.

Hmm, it may not be like that.

Firstly, I would remind that vread() takes vmap_area_lock during the
whole reading process. Before this patchset, the vread() and other area
manipulation will have below status:
1) __get_vm_area_node() could only finish insert_vmap_area(), then free
the vmap_area_lock, then warting;
2) __get_vm_area_node() finishs setup_vmalloc_vm()
  2.1) doing mapping but not finished;
  2.2) clear_vm_uninitialized_flag() is called after mapping is done;
3) remove_vm_area() is called to set -> = NULL, then free vmap_area_lock;

Task A:			   Task B:		     Task C:
p = __get_vm_area_node()
remove_vm_area(p);
			   vread(p);

			   vread end 
					     q = __get_vm_area_node();

So, as you can see, the checking "if (!va->vm)" in vread() will filter
out vmap_area:
a) areas if only insert_vmap_area() is called, but ->vm is still NULL; 
b) areas if remove_vm_area() is called to clear ->vm to NULL;
c) areas created through vm_map_ram() since its ->vm is always NULL;

Means vread() will read out vmap_area:
d) areas if setup_vmalloc_vm() is called;
  1) mapping is done on areas, e.g clear_vm_uninitialized_flag() is
       called;
  2) mapping is being handled, just after returning from setup_vmalloc_vm();


******* after this patchset applied:

Task A:			Task B:		Task C:
p = vm_map_ram()
vm_unmap_ram(p);
			vread(p);
                         vb_vread()
			vread end 

					q = vm_map_ram();

With this patchset applied, other than normal areas, for the
vm_map_ram() areas:
1) In vm_map_ram(), set vmap_area->flags = VMAP_RAM when vmap_area_lock
   is taken; In vm_unmap_ram(), clear it wiht "va->flags &= ~VMAP_RAM"
   when vmap_area_lock is taken;
2) If vmap_block, set va->flags = VMAP_RAM|VMAP_BLOCK; And set
   vmap_block->used_map to track the used region, filter out the dirty
   and free region;
3) In vb_vread(), we take vb->lock to avoid reading out dirty regions.

Please help point out what is wrong or I missed.

> 
> Unless there's some kind of reference count so that B knows that both
> the address range and the underlying memory can't be freed while it's
> in the middle of the vread(), this is just unsafe.
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas
  2022-11-24  9:52         ` Baoquan He
@ 2022-11-30 13:06           ` Uladzislau Rezki
  2022-12-01  4:46             ` Baoquan He
  0 siblings, 1 reply; 14+ messages in thread
From: Uladzislau Rezki @ 2022-11-30 13:06 UTC (permalink / raw)
  To: Baoquan He
  Cc: Matthew Wilcox, linux-kernel, linux-mm, akpm, stephen.s.brennan,
	urezki, hch

On Thu, Nov 24, 2022 at 05:52:34PM +0800, Baoquan He wrote:
> On 11/23/22 at 01:24pm, Matthew Wilcox wrote:
> > On Wed, Nov 23, 2022 at 11:38:54AM +0800, Baoquan He wrote:
> > > On 11/18/22 at 08:01am, Matthew Wilcox wrote:
> > > > On Wed, Nov 09, 2022 at 11:35:34AM +0800, Baoquan He wrote:
> > > > > Currently, vread() can read out vmalloc areas which is associated with
> > > > > a vm_struct. While this doesn't work for areas created by vm_map_ram()
> > > > > interface because it doesn't allocate a vm_struct. Then in vread(),
> > > > > these areas will be skipped.
> > > > > 
> > > > > Here, add a new function vb_vread() to read out areas managed by
> > > > > vmap_block specifically. Then recognize vm_map_ram areas via vmap->flags
> > > > > and handle  them respectively.
> > > > 
> > > > i don't understand how this deals with the original problem identified,
> > > > that the vread() can race with an unmap.
> > > 
> > > Thanks for checking.
> > > 
> > > I wrote a paragraph, then realized I misunderstood your concern. You are
> > > saying the comment from Uladzislau about my original draft patch, right?
> > > Paste the link of Uladzislau's reply here in case other people want to
> > > know the background:
> > > https://lore.kernel.org/all/Y1uKSmgURNEa3nQu@pc636/T/#u
> > > 
> > > When Stephen raised the issue originally, I posted a draft patch as
> > > below trying to fix it:
> > > https://lore.kernel.org/all/Y1pHTj2wuhoWmeV3@MiWiFi-R3L-srv/T/#u
> > > 
> > > In above draft patch, I tried to differentiate normal vmalloc area and
> > > vm_map_ram area with the fact that vmalloc area is associated with a
> > > vm_struct, while vm_map_ram area has ->vm as NULL. And I thought their
> > > only difference is normal vmalloc area has guard page, so its size need
> > > consider the guard page; while vm_map_ram area has no guard page, only
> > > consider its own actual size. Uladzislau's comment reminded me I was
> > > wrong. And the things we need handle are beyond that.
> > > 
> > > Currently there are three kinds of vmalloc areas in kernel:
> > > 
> > > 1) normal vmalloc areas, associated with a vm_struct, this is allocated 
> > > in __get_vm_area_node(). When freeing, it set ->vm to NULL
> > > firstly, then unmap and free vmap_area, see remove_vm_area().
> > > 
> > > 2) areas allocated via vm_map_ram() and size is larger than
> > > VMAP_MAX_ALLOC. The entire area is not associated with vm_struct, and
> > > freed at one time in vm_unmap_ram() with unmapping and freeing vmap_area;
> > > 
> > > 3) areas allocated via vm_map_ram(), then delegate to vb_alloc() when
> > > size <= VMAP_MAX_ALLOC. Its vmap_area is allocated at one time with
> > > VMAP_BLOCK_SIZE big, and split and used later through vb_alloc(), freed
> > > via vb_free(). When the entire area is dirty, it will be unmapped and
> > > freed.
> > > 
> > > Based on above facts, we need add flags to differentiate the normal
> > > vmalloc area from the vm_map_ram area, namely area 1) and 2). And we
> > > also need flags to differentiate the area 2) and 3). Because area 3) are
> > > pieces of a entire vmap_area, vb_free() will unmap the piece of area and
> > > set the part dirty, but the entire vmap_area will kept there. So when we
> > > will read area 3), we need take vb->lock and only read out the still
> > > mapped part, but not dirty or free part of the vmap_area.
> > 
> > I don't think you understand the problem.
> > 
> > Task A:			Task B:		Task C:
> > p = vm_map_ram()
> > 			vread(p);
> > 			... preempted ...
> > vm_unmap_ram(p);
> > 					q = vm_map_ram();
> > 			vread continues
> 
> 
> 
> > 
> > If C has reused the address space allocated by A, task B is now reading
> > the memory mapped by task C instead of task A.  If it hasn't, it's now
> > trying to read from unmapped, and quite possibly freed memory.  Which
> > might have been allocated by task D.
> 
> Hmm, it may not be like that.
> 
> Firstly, I would remind that vread() takes vmap_area_lock during the
> whole reading process. Before this patchset, the vread() and other area
> manipulation will have below status:
> 1) __get_vm_area_node() could only finish insert_vmap_area(), then free
> the vmap_area_lock, then warting;
> 2) __get_vm_area_node() finishs setup_vmalloc_vm()
>   2.1) doing mapping but not finished;
>   2.2) clear_vm_uninitialized_flag() is called after mapping is done;
> 3) remove_vm_area() is called to set -> = NULL, then free vmap_area_lock;
> 
> Task A:			   Task B:		     Task C:
> p = __get_vm_area_node()
> remove_vm_area(p);
> 			   vread(p);
> 
> 			   vread end 
> 					     q = __get_vm_area_node();
> 
> So, as you can see, the checking "if (!va->vm)" in vread() will filter
> out vmap_area:
> a) areas if only insert_vmap_area() is called, but ->vm is still NULL; 
> b) areas if remove_vm_area() is called to clear ->vm to NULL;
> c) areas created through vm_map_ram() since its ->vm is always NULL;
> 
> Means vread() will read out vmap_area:
> d) areas if setup_vmalloc_vm() is called;
>   1) mapping is done on areas, e.g clear_vm_uninitialized_flag() is
>        called;
>   2) mapping is being handled, just after returning from setup_vmalloc_vm();
> 
> 
> ******* after this patchset applied:
> 
> Task A:			Task B:		Task C:
> p = vm_map_ram()
> vm_unmap_ram(p);
> 			vread(p);
>                          vb_vread()
> 			vread end 
> 
> 					q = vm_map_ram();
> 
> With this patchset applied, other than normal areas, for the
> vm_map_ram() areas:
> 1) In vm_map_ram(), set vmap_area->flags = VMAP_RAM when vmap_area_lock
>    is taken; In vm_unmap_ram(), clear it wiht "va->flags &= ~VMAP_RAM"
>    when vmap_area_lock is taken;
> 2) If vmap_block, set va->flags = VMAP_RAM|VMAP_BLOCK; And set
>    vmap_block->used_map to track the used region, filter out the dirty
>    and free region;
> 3) In vb_vread(), we take vb->lock to avoid reading out dirty regions.
> 
> Please help point out what is wrong or I missed.
> 
One thing is we still can read-out un-mapped pages, i.e. a text instead:

<snip>
static void vb_free(unsigned long addr, unsigned long size)
{
	unsigned long offset;
	unsigned int order;
	struct vmap_block *vb;

	BUG_ON(offset_in_page(size));
	BUG_ON(size > PAGE_SIZE*VMAP_MAX_ALLOC);

	flush_cache_vunmap(addr, addr + size);

	order = get_order(size);
	offset = (addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT;
	vb = xa_load(&vmap_blocks, addr_to_vb_idx(addr));

	vunmap_range_noflush(addr, addr + size);

	if (debug_pagealloc_enabled_static())
		flush_tlb_kernel_range(addr, addr + size);

	spin_lock(&vb->lock);
...
<snip>

or am i missing something? Is it a problem? It might be. Another thing
it would be good if you upload a new patchset so it is easier to review
it.

Thanks!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas
  2022-11-30 13:06           ` Uladzislau Rezki
@ 2022-12-01  4:46             ` Baoquan He
  0 siblings, 0 replies; 14+ messages in thread
From: Baoquan He @ 2022-12-01  4:46 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Matthew Wilcox, linux-kernel, linux-mm, akpm, stephen.s.brennan, hch

On 11/30/22 at 02:06pm, Uladzislau Rezki wrote:
> On Thu, Nov 24, 2022 at 05:52:34PM +0800, Baoquan He wrote:
......
> > > I don't think you understand the problem.
> > > 
> > > Task A:			Task B:		Task C:
> > > p = vm_map_ram()
> > > 			vread(p);
> > > 			... preempted ...
> > > vm_unmap_ram(p);
> > > 					q = vm_map_ram();
> > > 			vread continues
> > 
> > 
> > 
> > > 
> > > If C has reused the address space allocated by A, task B is now reading
> > > the memory mapped by task C instead of task A.  If it hasn't, it's now
> > > trying to read from unmapped, and quite possibly freed memory.  Which
> > > might have been allocated by task D.
> > 
> > Hmm, it may not be like that.
> > 
> > Firstly, I would remind that vread() takes vmap_area_lock during the
> > whole reading process. Before this patchset, the vread() and other area
> > manipulation will have below status:
> > 1) __get_vm_area_node() could only finish insert_vmap_area(), then free
> > the vmap_area_lock, then warting;
> > 2) __get_vm_area_node() finishs setup_vmalloc_vm()
> >   2.1) doing mapping but not finished;
> >   2.2) clear_vm_uninitialized_flag() is called after mapping is done;
> > 3) remove_vm_area() is called to set -> = NULL, then free vmap_area_lock;
> > 
> > Task A:			   Task B:		     Task C:
> > p = __get_vm_area_node()
> > remove_vm_area(p);
> > 			   vread(p);
> > 
> > 			   vread end 
> > 					     q = __get_vm_area_node();
> > 
> > So, as you can see, the checking "if (!va->vm)" in vread() will filter
> > out vmap_area:
> > a) areas if only insert_vmap_area() is called, but ->vm is still NULL; 
> > b) areas if remove_vm_area() is called to clear ->vm to NULL;
> > c) areas created through vm_map_ram() since its ->vm is always NULL;
> > 
> > Means vread() will read out vmap_area:
> > d) areas if setup_vmalloc_vm() is called;
> >   1) mapping is done on areas, e.g clear_vm_uninitialized_flag() is
> >        called;
> >   2) mapping is being handled, just after returning from setup_vmalloc_vm();
> > 
> > 
> > ******* after this patchset applied:
> > 
> > Task A:			Task B:		Task C:
> > p = vm_map_ram()
> > vm_unmap_ram(p);
> > 			vread(p);
> >                          vb_vread()
> > 			vread end 
> > 
> > 					q = vm_map_ram();
> > 
> > With this patchset applied, other than normal areas, for the
> > vm_map_ram() areas:
> > 1) In vm_map_ram(), set vmap_area->flags = VMAP_RAM when vmap_area_lock
> >    is taken; In vm_unmap_ram(), clear it wiht "va->flags &= ~VMAP_RAM"
> >    when vmap_area_lock is taken;
> > 2) If vmap_block, set va->flags = VMAP_RAM|VMAP_BLOCK; And set
> >    vmap_block->used_map to track the used region, filter out the dirty
> >    and free region;
> > 3) In vb_vread(), we take vb->lock to avoid reading out dirty regions.
> > 
> > Please help point out what is wrong or I missed.
> > 
> One thing is we still can read-out un-mapped pages, i.e. a text instead:
> 
> <snip>
> static void vb_free(unsigned long addr, unsigned long size)
> {
> 	unsigned long offset;
> 	unsigned int order;
> 	struct vmap_block *vb;
> 
> 	BUG_ON(offset_in_page(size));
> 	BUG_ON(size > PAGE_SIZE*VMAP_MAX_ALLOC);
> 
> 	flush_cache_vunmap(addr, addr + size);
> 
> 	order = get_order(size);
> 	offset = (addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT;
> 	vb = xa_load(&vmap_blocks, addr_to_vb_idx(addr));
> 
> 	vunmap_range_noflush(addr, addr + size);
> 
> 	if (debug_pagealloc_enabled_static())
> 		flush_tlb_kernel_range(addr, addr + size);
> 
> 	spin_lock(&vb->lock);
> ...
> <snip>
> 
> or am i missing something? Is it a problem? It might be. Another thing
> it would be good if you upload a new patchset so it is easier to review
> it.

Thanks for checking.

Please check patch 1, vmap_block->used_map is introduced to track the
vb regions allocation and free via vb_alloc/vb_free(). The vb->used_map
only set for pages being used, the dirty and free regions are all
cleared. In the added vb_vread() of patch 3, vb->lock is required and
taken during the whole vb vmap reading, and only page of regions set in
vb->used_map will be read out.

So if vb_free() is called, and vb->used_map is cleared away, it won't
be read out in vb_vread(). If vb_free() is requiring vb->lock and waiting,
the region hasn't been unmapped and can be read out.

@@ -2114,6 +2118,9 @@ static void vb_free(unsigned long addr, unsigned long size)
        order = get_order(size);
        offset = (addr & (VMAP_BLOCK_SIZE - 1)) >> PAGE_SHIFT;
        vb = xa_load(&vmap_blocks, addr_to_vb_idx(addr));
+       spin_lock(&vb->lock);
+       bitmap_clear(vb->used_map, offset, (1UL << order));
+       spin_unlock(&vb->lock);
 
        vunmap_range_noflush(addr, addr + size);

I will work out a formal patchset for reviewing, will post and CC all
reviewers.

Thanks
Baoquan


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-12-01  4:47 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-09  3:35 [PATCH RFC 0/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas Baoquan He
2022-11-09  3:35 ` [PATCH RFC 1/3] mm/vmalloc.c: add used_map into vmap_block to track space of vmap_block Baoquan He
2022-11-09  3:35 ` [PATCH RFC 2/3] mm/vmalloc.c: add flags to mark vm_map_ram area Baoquan He
2022-11-09  3:35 ` [PATCH RFC 3/3] mm/vmalloc.c: allow vread() to read out vm_map_ram areas Baoquan He
2022-11-10  0:59   ` Stephen Brennan
2022-11-10 10:23     ` Baoquan He
2022-11-10 18:48       ` Stephen Brennan
2022-11-14 10:06         ` Baoquan He
2022-11-18  8:01   ` Matthew Wilcox
2022-11-23  3:38     ` Baoquan He
2022-11-23 13:24       ` Matthew Wilcox
2022-11-24  9:52         ` Baoquan He
2022-11-30 13:06           ` Uladzislau Rezki
2022-12-01  4:46             ` Baoquan He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).