All of lore.kernel.org
 help / color / mirror / Atom feed
* /proc/kcore reads 0's for vmap_block
@ 2022-10-26 23:15 Stephen Brennan
  2022-10-27  8:54 ` Baoquan He
  0 siblings, 1 reply; 5+ messages in thread
From: Stephen Brennan @ 2022-10-26 23:15 UTC (permalink / raw)
  To: Andrew Morton, linux-mm

Hi all,

The /proc/kcore interface uses vread() to read memory addresses which
are in the vmalloc region, and it seems that vread() doesn't support
vmap_areas which are associated with a vmap_block. As far as I can tell,
those have vmap_area.vm == NULL, and get skipped during vread()'s
iteration. So the end result is that the read simply returns 0's.

This impacts live debuggers like gdb and drgn, which is how I stumbled
upon it[1]. It looks like crash avoids the issue by doing a page table
walk and reading the physical address.

I'm wondering if there's any rationale for this omission from vread():
is it a simple oversight, or was it omitted due to the difficulty? Is
it possible for /proc/kcore to simply take the page faults when it reads
unmapped memory and handle them? (I'm sure that's already discussed or
is obviously infeasible for some reason beyond me.)

Ideally, I'm just looking for a way forward that allows the debugger to
*work* as expected, meaning either that /proc/kcore always reads the
correct data, or that the debugger can know ahead of time that it will
need to do some processing (like a page table walk) first. 

Thanks,
Stephen

[1]: https://github.com/osandov/drgn/issues/217


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: /proc/kcore reads 0's for vmap_block
  2022-10-26 23:15 /proc/kcore reads 0's for vmap_block Stephen Brennan
@ 2022-10-27  8:54 ` Baoquan He
  2022-10-27 18:07   ` Stephen Brennan
  2022-10-28  7:52   ` Uladzislau Rezki
  0 siblings, 2 replies; 5+ messages in thread
From: Baoquan He @ 2022-10-27  8:54 UTC (permalink / raw)
  To: Stephen Brennan
  Cc: Andrew Morton, linux-mm, Uladzislau Rezki, Christoph Hellwig,
	Matthew Wilcox

On 10/26/22 at 04:15pm, Stephen Brennan wrote:
> Hi all,
> 
> The /proc/kcore interface uses vread() to read memory addresses which
> are in the vmalloc region, and it seems that vread() doesn't support
> vmap_areas which are associated with a vmap_block. As far as I can tell,
> those have vmap_area.vm == NULL, and get skipped during vread()'s
> iteration. So the end result is that the read simply returns 0's.

Hmm, with my understanding, it should be vm_map_ram() which is called
without a struct vm_struct associated, and it's the only interface do
to so. vmap_block is the optimization way based on percpu to reduce vmap
lock race.

> 
> This impacts live debuggers like gdb and drgn, which is how I stumbled
> upon it[1]. It looks like crash avoids the issue by doing a page table
> walk and reading the physical address.
> 
> I'm wondering if there's any rationale for this omission from vread():
> is it a simple oversight, or was it omitted due to the difficulty? Is
> it possible for /proc/kcore to simply take the page faults when it reads
> unmapped memory and handle them? (I'm sure that's already discussed or
> is obviously infeasible for some reason beyond me.)

From git history, the vmlist iterating was taken in vread() at the
first place. Later, in below commit, when people changed to iterate
vmap_area_list instead, they just inherited the old code logic. I guess
that's why vmap with NULL ->vm is skipped.

commit e81ce85f960c ("mm, vmalloc: iterate vmap_area_list, instead of vmlist in vread/vwrite()")

> 
> Ideally, I'm just looking for a way forward that allows the debugger to
> *work* as expected, meaning either that /proc/kcore always reads the
> correct data, or that the debugger can know ahead of time that it will
> need to do some processing (like a page table walk) first. 

I think we can adjust vread() to allow those vmap_area with NULL ->vm
being read out? Made a draft patch at below, please feel free to have a
test. Not sure if there's any risk.

From 9f1b786730f3ee0a8d5b48a94dbefa674102d7b9 Mon Sep 17 00:00:00 2001
From: Baoquan He <bhe@redhat.com>
Date: Thu, 27 Oct 2022 16:20:26 +0800
Subject: [PATCH] mm/vmalloc.c: allow to read out vm_map_ram() areas in vread()
Content-type: text/plain

Currently, vread can read out vmalloc areas who is associated with
a vm_struct. While this doesn't work for areas created by vm_map_ram()
interface because it doesn't allocate a vm_struct. Then in vread(),
these areas will be skipped.

Pages are passed into vm_map_ram() and mapped onto frea vmap area,
it should be safe to read them out. Change code to allow to read
out these vm_map_ram() areas in vread().

Signed-off-by: Baoquan He <bhe@redhat.com>
---
 mm/vmalloc.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ccaa461998f3..f899ab784671 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3526,7 +3526,7 @@ long vread(char *buf, char *addr, unsigned long count)
 	struct vm_struct *vm;
 	char *vaddr, *buf_start = buf;
 	unsigned long buflen = count;
-	unsigned long n;
+	unsigned long n, size;
 
 	addr = kasan_reset_tag(addr);
 
@@ -3547,12 +3547,11 @@ long vread(char *buf, char *addr, unsigned long count)
 		if (!count)
 			break;
 
-		if (!va->vm)
-			continue;
-
 		vm = va->vm;
-		vaddr = (char *) vm->addr;
-		if (addr >= vaddr + get_vm_area_size(vm))
+		vaddr = (char *) va->va_start;
+		size = vm ? get_vm_area_size(vm) : va_size(va);
+
+		if (addr >= vaddr + size)
 			continue;
 		while (addr < vaddr) {
 			if (count == 0)
@@ -3562,10 +3561,10 @@ long vread(char *buf, char *addr, unsigned long count)
 			addr++;
 			count--;
 		}
-		n = vaddr + get_vm_area_size(vm) - addr;
+		n = vaddr + size - addr;
 		if (n > count)
 			n = count;
-		if (!(vm->flags & VM_IOREMAP))
+		if (!vm || !(vm->flags & VM_IOREMAP))
 			aligned_vread(buf, addr, n);
 		else /* IOREMAP area is treated as memory hole */
 			memset(buf, 0, n);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: /proc/kcore reads 0's for vmap_block
  2022-10-27  8:54 ` Baoquan He
@ 2022-10-27 18:07   ` Stephen Brennan
  2022-10-28  7:52   ` Uladzislau Rezki
  1 sibling, 0 replies; 5+ messages in thread
From: Stephen Brennan @ 2022-10-27 18:07 UTC (permalink / raw)
  To: Baoquan He
  Cc: Andrew Morton, linux-mm, Uladzislau Rezki, Christoph Hellwig,
	Matthew Wilcox

Hi Baoquan,

Thanks for taking a look and coming up with a patch!

(responses inline)

Baoquan He <bhe@redhat.com> writes:
> On 10/26/22 at 04:15pm, Stephen Brennan wrote:
>> Hi all,
>> 
>> The /proc/kcore interface uses vread() to read memory addresses which
>> are in the vmalloc region, and it seems that vread() doesn't support
>> vmap_areas which are associated with a vmap_block. As far as I can tell,
>> those have vmap_area.vm == NULL, and get skipped during vread()'s
>> iteration. So the end result is that the read simply returns 0's.
>
> Hmm, with my understanding, it should be vm_map_ram() which is called
> without a struct vm_struct associated, and it's the only interface do
> to so. vmap_block is the optimization way based on percpu to reduce vmap
> lock race.
>
>> 
>> This impacts live debuggers like gdb and drgn, which is how I stumbled
>> upon it[1]. It looks like crash avoids the issue by doing a page table
>> walk and reading the physical address.
>> 
>> I'm wondering if there's any rationale for this omission from vread():
>> is it a simple oversight, or was it omitted due to the difficulty? Is
>> it possible for /proc/kcore to simply take the page faults when it reads
>> unmapped memory and handle them? (I'm sure that's already discussed or
>> is obviously infeasible for some reason beyond me.)
>
> From git history, the vmlist iterating was taken in vread() at the
> first place. Later, in below commit, when people changed to iterate
> vmap_area_list instead, they just inherited the old code logic. I guess
> that's why vmap with NULL ->vm is skipped.
>
> commit e81ce85f960c ("mm, vmalloc: iterate vmap_area_list, instead of vmlist in vread/vwrite()")
>
>> 
>> Ideally, I'm just looking for a way forward that allows the debugger to
>> *work* as expected, meaning either that /proc/kcore always reads the
>> correct data, or that the debugger can know ahead of time that it will
>> need to do some processing (like a page table walk) first. 
>
> I think we can adjust vread() to allow those vmap_area with NULL ->vm
> being read out? Made a draft patch at below, please feel free to have a
> test. Not sure if there's any risk.

I tested this out and it fully resolved the problem! I didn't encounter
any issues with the debuggers using /proc/kcore. My test kernel module
is here:
https://github.com/brenns10/kernel_stuff/blob/master/drgn_vmalloc_test/drgn_vmalloc_test.c

I was able to read the full contents of the mapped area correctly with
drgn thanks to this patch.

>
> From 9f1b786730f3ee0a8d5b48a94dbefa674102d7b9 Mon Sep 17 00:00:00 2001
> From: Baoquan He <bhe@redhat.com>
> Date: Thu, 27 Oct 2022 16:20:26 +0800
> Subject: [PATCH] mm/vmalloc.c: allow to read out vm_map_ram() areas in vread()
> Content-type: text/plain
>
> Currently, vread can read out vmalloc areas who is associated with
> a vm_struct. While this doesn't work for areas created by vm_map_ram()
> interface because it doesn't allocate a vm_struct. Then in vread(),
> these areas will be skipped.
>
> Pages are passed into vm_map_ram() and mapped onto frea vmap area,
> it should be safe to read them out. Change code to allow to read
> out these vm_map_ram() areas in vread().

I see, we don't even need to worry about page faults because
vm_map_ram() is guaranteed to have mapped them. That does make things
easier :)

>
> Signed-off-by: Baoquan He <bhe@redhat.com>

Reviewed-by: Stephen Brennan <stephen.s.brennan@oracle.com>

(or Tested-by if you prefer)

> ---
>  mm/vmalloc.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ccaa461998f3..f899ab784671 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3526,7 +3526,7 @@ long vread(char *buf, char *addr, unsigned long count)
>  	struct vm_struct *vm;
>  	char *vaddr, *buf_start = buf;
>  	unsigned long buflen = count;
> -	unsigned long n;
> +	unsigned long n, size;
>  
>  	addr = kasan_reset_tag(addr);
>  
> @@ -3547,12 +3547,11 @@ long vread(char *buf, char *addr, unsigned long count)
>  		if (!count)
>  			break;
>  
> -		if (!va->vm)
> -			continue;
> -
>  		vm = va->vm;
> -		vaddr = (char *) vm->addr;
> -		if (addr >= vaddr + get_vm_area_size(vm))
> +		vaddr = (char *) va->va_start;
> +		size = vm ? get_vm_area_size(vm) : va_size(va);
> +
> +		if (addr >= vaddr + size)
>  			continue;
>  		while (addr < vaddr) {
>  			if (count == 0)
> @@ -3562,10 +3561,10 @@ long vread(char *buf, char *addr, unsigned long count)
>  			addr++;
>  			count--;
>  		}
> -		n = vaddr + get_vm_area_size(vm) - addr;
> +		n = vaddr + size - addr;
>  		if (n > count)
>  			n = count;
> -		if (!(vm->flags & VM_IOREMAP))
> +		if (!vm || !(vm->flags & VM_IOREMAP))
>  			aligned_vread(buf, addr, n);
>  		else /* IOREMAP area is treated as memory hole */
>  			memset(buf, 0, n);
> -- 
> 2.34.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: /proc/kcore reads 0's for vmap_block
  2022-10-27  8:54 ` Baoquan He
  2022-10-27 18:07   ` Stephen Brennan
@ 2022-10-28  7:52   ` Uladzislau Rezki
  2022-10-31  7:12     ` Baoquan He
  1 sibling, 1 reply; 5+ messages in thread
From: Uladzislau Rezki @ 2022-10-28  7:52 UTC (permalink / raw)
  To: Baoquan He
  Cc: Stephen Brennan, Andrew Morton, linux-mm, Uladzislau Rezki,
	Christoph Hellwig, Matthew Wilcox

On Thu, Oct 27, 2022 at 04:54:38PM +0800, Baoquan He wrote:
> On 10/26/22 at 04:15pm, Stephen Brennan wrote:
> > Hi all,
> > 
> > The /proc/kcore interface uses vread() to read memory addresses which
> > are in the vmalloc region, and it seems that vread() doesn't support
> > vmap_areas which are associated with a vmap_block. As far as I can tell,
> > those have vmap_area.vm == NULL, and get skipped during vread()'s
> > iteration. So the end result is that the read simply returns 0's.
> 
> Hmm, with my understanding, it should be vm_map_ram() which is called
> without a struct vm_struct associated, and it's the only interface do
> to so. vmap_block is the optimization way based on percpu to reduce vmap
> lock race.
> 
> > 
> > This impacts live debuggers like gdb and drgn, which is how I stumbled
> > upon it[1]. It looks like crash avoids the issue by doing a page table
> > walk and reading the physical address.
> > 
> > I'm wondering if there's any rationale for this omission from vread():
> > is it a simple oversight, or was it omitted due to the difficulty? Is
> > it possible for /proc/kcore to simply take the page faults when it reads
> > unmapped memory and handle them? (I'm sure that's already discussed or
> > is obviously infeasible for some reason beyond me.)
> 
> From git history, the vmlist iterating was taken in vread() at the
> first place. Later, in below commit, when people changed to iterate
> vmap_area_list instead, they just inherited the old code logic. I guess
> that's why vmap with NULL ->vm is skipped.
> 
> commit e81ce85f960c ("mm, vmalloc: iterate vmap_area_list, instead of vmlist in vread/vwrite()")
> 
> > 
> > Ideally, I'm just looking for a way forward that allows the debugger to
> > *work* as expected, meaning either that /proc/kcore always reads the
> > correct data, or that the debugger can know ahead of time that it will
> > need to do some processing (like a page table walk) first. 
> 
> I think we can adjust vread() to allow those vmap_area with NULL ->vm
> being read out? Made a draft patch at below, please feel free to have a
> test. Not sure if there's any risk.
> 
> From 9f1b786730f3ee0a8d5b48a94dbefa674102d7b9 Mon Sep 17 00:00:00 2001
> From: Baoquan He <bhe@redhat.com>
> Date: Thu, 27 Oct 2022 16:20:26 +0800
> Subject: [PATCH] mm/vmalloc.c: allow to read out vm_map_ram() areas in vread()
> Content-type: text/plain
> 
> Currently, vread can read out vmalloc areas who is associated with
> a vm_struct. While this doesn't work for areas created by vm_map_ram()
> interface because it doesn't allocate a vm_struct. Then in vread(),
> these areas will be skipped.
> 
> Pages are passed into vm_map_ram() and mapped onto frea vmap area,
> it should be safe to read them out. Change code to allow to read
> out these vm_map_ram() areas in vread().
> 
> Signed-off-by: Baoquan He <bhe@redhat.com>
> ---
>  mm/vmalloc.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ccaa461998f3..f899ab784671 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3526,7 +3526,7 @@ long vread(char *buf, char *addr, unsigned long count)
>  	struct vm_struct *vm;
>  	char *vaddr, *buf_start = buf;
>  	unsigned long buflen = count;
> -	unsigned long n;
> +	unsigned long n, size;
>  
>  	addr = kasan_reset_tag(addr);
>  
> @@ -3547,12 +3547,11 @@ long vread(char *buf, char *addr, unsigned long count)
>  		if (!count)
>  			break;
>  
> -		if (!va->vm)
> -			continue;
> -
>  		vm = va->vm;
> -		vaddr = (char *) vm->addr;
> -		if (addr >= vaddr + get_vm_area_size(vm))
> +		vaddr = (char *) va->va_start;
> +		size = vm ? get_vm_area_size(vm) : va_size(va);
> +
> +		if (addr >= vaddr + size)
>  			continue;
>  		while (addr < vaddr) {
>  			if (count == 0)
> @@ -3562,10 +3561,10 @@ long vread(char *buf, char *addr, unsigned long count)
>  			addr++;
>  			count--;
>  		}
> -		n = vaddr + get_vm_area_size(vm) - addr;
> +		n = vaddr + size - addr;
>  		if (n > count)
>  			n = count;
> -		if (!(vm->flags & VM_IOREMAP))
> +		if (!vm || !(vm->flags & VM_IOREMAP))
>  			aligned_vread(buf, addr, n);
>
What happens if during the read a page is unmapped by the
vm_unamp_ram()? I see that it concurrently can occur.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: /proc/kcore reads 0's for vmap_block
  2022-10-28  7:52   ` Uladzislau Rezki
@ 2022-10-31  7:12     ` Baoquan He
  0 siblings, 0 replies; 5+ messages in thread
From: Baoquan He @ 2022-10-31  7:12 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Stephen Brennan, Andrew Morton, linux-mm, Christoph Hellwig,
	Matthew Wilcox

On 10/28/22 at 09:52am, Uladzislau Rezki wrote:
> On Thu, Oct 27, 2022 at 04:54:38PM +0800, Baoquan He wrote:
> > On 10/26/22 at 04:15pm, Stephen Brennan wrote:
> > > Hi all,
> > > 
> > > The /proc/kcore interface uses vread() to read memory addresses which
> > > are in the vmalloc region, and it seems that vread() doesn't support
> > > vmap_areas which are associated with a vmap_block. As far as I can tell,
> > > those have vmap_area.vm == NULL, and get skipped during vread()'s
> > > iteration. So the end result is that the read simply returns 0's.
> > 
> > Hmm, with my understanding, it should be vm_map_ram() which is called
> > without a struct vm_struct associated, and it's the only interface do
> > to so. vmap_block is the optimization way based on percpu to reduce vmap
> > lock race.
> > 
> > > 
> > > This impacts live debuggers like gdb and drgn, which is how I stumbled
> > > upon it[1]. It looks like crash avoids the issue by doing a page table
> > > walk and reading the physical address.
> > > 
> > > I'm wondering if there's any rationale for this omission from vread():
> > > is it a simple oversight, or was it omitted due to the difficulty? Is
> > > it possible for /proc/kcore to simply take the page faults when it reads
> > > unmapped memory and handle them? (I'm sure that's already discussed or
> > > is obviously infeasible for some reason beyond me.)
> > 
> > From git history, the vmlist iterating was taken in vread() at the
> > first place. Later, in below commit, when people changed to iterate
> > vmap_area_list instead, they just inherited the old code logic. I guess
> > that's why vmap with NULL ->vm is skipped.
> > 
> > commit e81ce85f960c ("mm, vmalloc: iterate vmap_area_list, instead of vmlist in vread/vwrite()")
> > 
> > > 
> > > Ideally, I'm just looking for a way forward that allows the debugger to
> > > *work* as expected, meaning either that /proc/kcore always reads the
> > > correct data, or that the debugger can know ahead of time that it will
> > > need to do some processing (like a page table walk) first. 
> > 
> > I think we can adjust vread() to allow those vmap_area with NULL ->vm
> > being read out? Made a draft patch at below, please feel free to have a
> > test. Not sure if there's any risk.
> > 
> > From 9f1b786730f3ee0a8d5b48a94dbefa674102d7b9 Mon Sep 17 00:00:00 2001
> > From: Baoquan He <bhe@redhat.com>
> > Date: Thu, 27 Oct 2022 16:20:26 +0800
> > Subject: [PATCH] mm/vmalloc.c: allow to read out vm_map_ram() areas in vread()
> > Content-type: text/plain
> > 
> > Currently, vread can read out vmalloc areas who is associated with
> > a vm_struct. While this doesn't work for areas created by vm_map_ram()
> > interface because it doesn't allocate a vm_struct. Then in vread(),
> > these areas will be skipped.
> > 
> > Pages are passed into vm_map_ram() and mapped onto frea vmap area,
> > it should be safe to read them out. Change code to allow to read
> > out these vm_map_ram() areas in vread().
> > 
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  mm/vmalloc.c | 15 +++++++--------
> >  1 file changed, 7 insertions(+), 8 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index ccaa461998f3..f899ab784671 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3526,7 +3526,7 @@ long vread(char *buf, char *addr, unsigned long count)
> >  	struct vm_struct *vm;
> >  	char *vaddr, *buf_start = buf;
> >  	unsigned long buflen = count;
> > -	unsigned long n;
> > +	unsigned long n, size;
> >  
> >  	addr = kasan_reset_tag(addr);
> >  
> > @@ -3547,12 +3547,11 @@ long vread(char *buf, char *addr, unsigned long count)
> >  		if (!count)
> >  			break;
> >  
> > -		if (!va->vm)
> > -			continue;
> > -
> >  		vm = va->vm;
> > -		vaddr = (char *) vm->addr;
> > -		if (addr >= vaddr + get_vm_area_size(vm))
> > +		vaddr = (char *) va->va_start;
> > +		size = vm ? get_vm_area_size(vm) : va_size(va);
> > +
> > +		if (addr >= vaddr + size)
> >  			continue;
> >  		while (addr < vaddr) {
> >  			if (count == 0)
> > @@ -3562,10 +3561,10 @@ long vread(char *buf, char *addr, unsigned long count)
> >  			addr++;
> >  			count--;
> >  		}
> > -		n = vaddr + get_vm_area_size(vm) - addr;
> > +		n = vaddr + size - addr;
> >  		if (n > count)
> >  			n = count;
> > -		if (!(vm->flags & VM_IOREMAP))
> > +		if (!vm || !(vm->flags & VM_IOREMAP))
> >  			aligned_vread(buf, addr, n);
> >
> What happens if during the read a page is unmapped by the
> vm_unamp_ram()? I see that it concurrently can occur.

You are right, thanks for pointing that out.

Currently, vmap_block doesn't track the used or dirty/free pages in one
whole vmap block. The old alloc_map and dirty_map bits have been
removed. I plan to add used_map to track the pages being used. The
added code looks simple and won't degrade efficiency. Will post for
reviewing once finished.



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-10-31  7:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-26 23:15 /proc/kcore reads 0's for vmap_block Stephen Brennan
2022-10-27  8:54 ` Baoquan He
2022-10-27 18:07   ` Stephen Brennan
2022-10-28  7:52   ` Uladzislau Rezki
2022-10-31  7:12     ` Baoquan He

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.