Re: /proc/kcore reads 0's for vmap_block

From: Baoquan He <bhe@redhat.com>
To: Uladzislau Rezki <urezki@gmail.com>
Cc: Stephen Brennan <stephen.s.brennan@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Christoph Hellwig <hch@infradead.org>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: /proc/kcore reads 0's for vmap_block
Date: Mon, 31 Oct 2022 15:12:40 +0800	[thread overview]
Message-ID: <Y191aGqqaCLMF/P3@MiWiFi-R3L-srv> (raw)
In-Reply-To: <Y1uKSmgURNEa3nQu@pc636>

On 10/28/22 at 09:52am, Uladzislau Rezki wrote:
> On Thu, Oct 27, 2022 at 04:54:38PM +0800, Baoquan He wrote:
> > On 10/26/22 at 04:15pm, Stephen Brennan wrote:
> > > Hi all,
> > > 
> > > The /proc/kcore interface uses vread() to read memory addresses which
> > > are in the vmalloc region, and it seems that vread() doesn't support
> > > vmap_areas which are associated with a vmap_block. As far as I can tell,
> > > those have vmap_area.vm == NULL, and get skipped during vread()'s
> > > iteration. So the end result is that the read simply returns 0's.
> > 
> > Hmm, with my understanding, it should be vm_map_ram() which is called
> > without a struct vm_struct associated, and it's the only interface do
> > to so. vmap_block is the optimization way based on percpu to reduce vmap
> > lock race.
> > 
> > > 
> > > This impacts live debuggers like gdb and drgn, which is how I stumbled
> > > upon it[1]. It looks like crash avoids the issue by doing a page table
> > > walk and reading the physical address.
> > > 
> > > I'm wondering if there's any rationale for this omission from vread():
> > > is it a simple oversight, or was it omitted due to the difficulty? Is
> > > it possible for /proc/kcore to simply take the page faults when it reads
> > > unmapped memory and handle them? (I'm sure that's already discussed or
> > > is obviously infeasible for some reason beyond me.)
> > 
> > From git history, the vmlist iterating was taken in vread() at the
> > first place. Later, in below commit, when people changed to iterate
> > vmap_area_list instead, they just inherited the old code logic. I guess
> > that's why vmap with NULL ->vm is skipped.
> > 
> > commit e81ce85f960c ("mm, vmalloc: iterate vmap_area_list, instead of vmlist in vread/vwrite()")
> > 
> > > 
> > > Ideally, I'm just looking for a way forward that allows the debugger to
> > > *work* as expected, meaning either that /proc/kcore always reads the
> > > correct data, or that the debugger can know ahead of time that it will
> > > need to do some processing (like a page table walk) first. 
> > 
> > I think we can adjust vread() to allow those vmap_area with NULL ->vm
> > being read out? Made a draft patch at below, please feel free to have a
> > test. Not sure if there's any risk.
> > 
> > From 9f1b786730f3ee0a8d5b48a94dbefa674102d7b9 Mon Sep 17 00:00:00 2001
> > From: Baoquan He <bhe@redhat.com>
> > Date: Thu, 27 Oct 2022 16:20:26 +0800
> > Subject: [PATCH] mm/vmalloc.c: allow to read out vm_map_ram() areas in vread()
> > Content-type: text/plain
> > 
> > Currently, vread can read out vmalloc areas who is associated with
> > a vm_struct. While this doesn't work for areas created by vm_map_ram()
> > interface because it doesn't allocate a vm_struct. Then in vread(),
> > these areas will be skipped.
> > 
> > Pages are passed into vm_map_ram() and mapped onto frea vmap area,
> > it should be safe to read them out. Change code to allow to read
> > out these vm_map_ram() areas in vread().
> > 
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  mm/vmalloc.c | 15 +++++++--------
> >  1 file changed, 7 insertions(+), 8 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index ccaa461998f3..f899ab784671 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3526,7 +3526,7 @@ long vread(char *buf, char *addr, unsigned long count)
> >  	struct vm_struct *vm;
> >  	char *vaddr, *buf_start = buf;
> >  	unsigned long buflen = count;
> > -	unsigned long n;
> > +	unsigned long n, size;
> >  
> >  	addr = kasan_reset_tag(addr);
> >  
> > @@ -3547,12 +3547,11 @@ long vread(char *buf, char *addr, unsigned long count)
> >  		if (!count)
> >  			break;
> >  
> > -		if (!va->vm)
> > -			continue;
> > -
> >  		vm = va->vm;
> > -		vaddr = (char *) vm->addr;
> > -		if (addr >= vaddr + get_vm_area_size(vm))
> > +		vaddr = (char *) va->va_start;
> > +		size = vm ? get_vm_area_size(vm) : va_size(va);
> > +
> > +		if (addr >= vaddr + size)
> >  			continue;
> >  		while (addr < vaddr) {
> >  			if (count == 0)
> > @@ -3562,10 +3561,10 @@ long vread(char *buf, char *addr, unsigned long count)
> >  			addr++;
> >  			count--;
> >  		}
> > -		n = vaddr + get_vm_area_size(vm) - addr;
> > +		n = vaddr + size - addr;
> >  		if (n > count)
> >  			n = count;
> > -		if (!(vm->flags & VM_IOREMAP))
> > +		if (!vm || !(vm->flags & VM_IOREMAP))
> >  			aligned_vread(buf, addr, n);
> >
> What happens if during the read a page is unmapped by the
> vm_unamp_ram()? I see that it concurrently can occur.

You are right, thanks for pointing that out.

Currently, vmap_block doesn't track the used or dirty/free pages in one
whole vmap block. The old alloc_map and dirty_map bits have been
removed. I plan to add used_map to track the pages being used. The
added code looks simple and won't degrade efficiency. Will post for
reviewing once finished.