linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
@ 2015-01-05 10:21 Jay Foad
  2015-01-05 21:03 ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Jay Foad @ 2015-01-05 10:21 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, stable, stable-review, akpm, alan, Linus Torvalds

Sorry for replying to this old email...

On Wed, 2010-08-18 at 13:30 -0700, Greg KH wrote:
> 2.6.35-stable review patch. If anyone has any objections, please let us know.
>
> ------------------
>
> From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>
> commit d7824370e26325c881b665350ce64fb0a4fde24a upstream.
>
> This commit makes the stack guard page somewhat less visible to user
> space. It does this by:
>
> - not showing the guard page in /proc/<pid>/maps

> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -210,6 +210,7 @@ static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
>   int flags = vma->vm_flags;
>   unsigned long ino = 0;
>   unsigned long long pgoff = 0;
> + unsigned long start;
>   dev_t dev = 0;
>   int len;
>
> @@ -220,8 +221,13 @@ static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
>   pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
>   }
>
> + /* We don't show the stack guard page in /proc/maps */
> + start = vma->vm_start;
> + if (vma->vm_flags & VM_GROWSDOWN)
> + start += PAGE_SIZE;
> +
>   seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n",
> - vma->vm_start,
> + start,
>   vma->vm_end,
>   flags & VM_READ ? 'r' : '-',
>   flags & VM_WRITE ? 'w' : '-',

This change seems to be causing a problem with an address sanitizer
(https://code.google.com/p/address-sanitizer/) test case. See here for
some more details: http://reviews.llvm.org/D6777

Address sanitizer tries to find the mapping for the current thread's
stack by iterating through the entries in /proc/self/maps looking for
one that contains the address of some random stack variable. This
fails if the stack mapping has already used up all of its RLIMIT_STACK
quota, because in that case check_stack_guard_page() will fail to add
a guard page, but show_map_vma() will still assume that the first page
of the stack *is* a guard page, and won't report it in /proc/maps.

Here's a small program that demonstrates the failure:

$ cat lim.c
#include <stdio.h>
int main() {
  char a[1000];
  FILE *m = fopen("/proc/self/maps", "r");
  while (fgets(a, sizeof a, m)) {
    unsigned long p = (unsigned long)a, start, end;
    if (sscanf(a, "%lx-%lx", &start, &end) == 2
        && start <= p && p < end) {
      printf("stack found at %lx-%lx\n", start, end);
      goto close;
    }
  }
  printf("stack not found\n");
close:
  fclose(m);
  main();
}

On x86-64 with a 3.16 kernel I get:

$ gcc -o lim lim.c && ulimit -Ss 16 && ./lim
stack found at 7fff389c7000-7fff389ca000
stack found at 7fff389c7000-7fff389ca000
stack found at 7fff389c7000-7fff389ca000
stack not found
stack not found
stack not found
Segmentation fault (core dumped)

This seems like a bug in /proc/maps to me, but is there any chance of
fixing it? Or is it too late to change this behaviour?

Thanks,
Jay.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2015-01-05 10:21 [2/3] mm: fix up some user-visible effects of the stack guard page Jay Foad
@ 2015-01-05 21:03 ` Linus Torvalds
  2015-01-05 21:11   ` Linus Torvalds
                     ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Linus Torvalds @ 2015-01-05 21:03 UTC (permalink / raw)
  To: Jay Foad
  Cc: Greg KH, Linux Kernel Mailing List, # .39.x, stable-review,
	Andrew Morton, Alan Cox

[-- Attachment #1: Type: text/plain, Size: 1247 bytes --]

On Mon, Jan 5, 2015 at 2:21 AM, Jay Foad <jay.foad@gmail.com> wrote:
> Sorry for replying to this old email...
>
>> commit d7824370e26325c881b665350ce64fb0a4fde24a upstream

Heh. From august 2010. That's 4+ years ago.. How come it was noticed
only now? You guys are running excessively old kernels, methinks.

> Address sanitizer tries to find the mapping for the current thread's
> stack by iterating through the entries in /proc/self/maps looking for
> one that contains the address of some random stack variable. This
> fails if the stack mapping has already used up all of its RLIMIT_STACK
> quota, because in that case check_stack_guard_page() will fail to add
> a guard page, but show_map_vma() will still assume that the first page
> of the stack *is* a guard page, and won't report it in /proc/maps.
>
> Here's a small program that demonstrates the failure:

Yup, your analysis sounds correct.  My completely untested gut feel is
that the problem is that we don't actually return the error from the
expand_stack() call, so then do_anonymous_page() will allow the extra
guard-page access.

IOW, *maybe* a patch like this. TOTALLY UNTESTED! I may have missed
something, and this may be complete crap.

                                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1490 bytes --]

 include/linux/mm.h | 2 +-
 mm/memory.c        | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f80d0194c9bc..80fc92a49649 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1952,7 +1952,7 @@ extern int expand_downwards(struct vm_area_struct *vma,
 #if VM_GROWSUP
 extern int expand_upwards(struct vm_area_struct *vma, unsigned long address);
 #else
-  #define expand_upwards(vma, address) do { } while (0)
+  #define expand_upwards(vma, address) (0)
 #endif
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
diff --git a/mm/memory.c b/mm/memory.c
index ca920d1fd314..d7e497e98f46 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2593,7 +2593,7 @@ static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned lo
 		if (prev && prev->vm_end == address)
 			return prev->vm_flags & VM_GROWSDOWN ? 0 : -ENOMEM;
 
-		expand_downwards(vma, address - PAGE_SIZE);
+		return expand_downwards(vma, address - PAGE_SIZE);
 	}
 	if ((vma->vm_flags & VM_GROWSUP) && address + PAGE_SIZE == vma->vm_end) {
 		struct vm_area_struct *next = vma->vm_next;
@@ -2602,7 +2602,7 @@ static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned lo
 		if (next && next->vm_start == address + PAGE_SIZE)
 			return next->vm_flags & VM_GROWSUP ? 0 : -ENOMEM;
 
-		expand_upwards(vma, address + PAGE_SIZE);
+		return expand_upwards(vma, address + PAGE_SIZE);
 	}
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2015-01-05 21:03 ` Linus Torvalds
@ 2015-01-05 21:11   ` Linus Torvalds
  2015-01-05 21:19   ` Linus Torvalds
  2015-01-06 13:27   ` Jay Foad
  2 siblings, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2015-01-05 21:11 UTC (permalink / raw)
  To: Jay Foad; +Cc: Linux Kernel Mailing List, Andrew Morton, Alan Cox

On Mon, Jan 5, 2015 at 1:03 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yup, your analysis sounds correct.  My completely untested gut feel is
> that the problem is that we don't actually return the error from the
> expand_stack() call, so then do_anonymous_page() will allow the extra
> guard-page access.
>
> IOW, *maybe* a patch like this. TOTALLY UNTESTED! I may have missed
> something, and this may be complete crap.

.. and ti's still untested, but I'm pretty sure it's right - except we
should probably also allow an extra page for the stack rlimit (so that
the guard page doesn't count towards the limit).

                          Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2015-01-05 21:03 ` Linus Torvalds
  2015-01-05 21:11   ` Linus Torvalds
@ 2015-01-05 21:19   ` Linus Torvalds
  2015-01-06 13:27   ` Jay Foad
  2 siblings, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2015-01-05 21:19 UTC (permalink / raw)
  To: Jay Foad
  Cc: Greg KH, Linux Kernel Mailing List, # .39.x, stable-review,
	Andrew Morton, Alan Cox

On Mon, Jan 5, 2015 at 1:03 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> IOW, *maybe* a patch like this. TOTALLY UNTESTED! I may have missed
> something, and this may be complete crap.

It does indeed seem to work.

But see my other email about this effectively making the stack limit
one page smaller. I'm not entirely sure it's worth fixing, but maybe
in another four years somebody will post a regression email about that
;)

IOW, I'd be inclined to say "stack limit includes guard page", and
just wait and see whether that causes any issues.

But regardless, I'd like you to test the patch in your environment
(with the full real load, not just the demonstration test case), since
you are clearly doing special things. My limited testing was just that
- very limited.

Thanks,

                                Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2015-01-05 21:03 ` Linus Torvalds
  2015-01-05 21:11   ` Linus Torvalds
  2015-01-05 21:19   ` Linus Torvalds
@ 2015-01-06 13:27   ` Jay Foad
  2 siblings, 0 replies; 21+ messages in thread
From: Jay Foad @ 2015-01-06 13:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Greg KH, Linux Kernel Mailing List, # .39.x, stable-review,
	Andrew Morton, Alan Cox

On 5 January 2015 at 21:03, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jan 5, 2015 at 2:21 AM, Jay Foad <jay.foad@gmail.com> wrote:
>> Sorry for replying to this old email...
>>
>>> commit d7824370e26325c881b665350ce64fb0a4fde24a upstream
>
> Heh. From august 2010. That's 4+ years ago.. How come it was noticed
> only now?

*Not* because we've just upgraded from an ancient kernel. Rather
because we've started running the asan tests on powerpc64, which has
64k pages rather than 4k, which just happens to trigger the failure. A
bit of git blame detective work lead me to this ancient commit.

>> Address sanitizer tries to find the mapping for the current thread's
>> stack by iterating through the entries in /proc/self/maps looking for
>> one that contains the address of some random stack variable. This
>> fails if the stack mapping has already used up all of its RLIMIT_STACK
>> quota, because in that case check_stack_guard_page() will fail to add
>> a guard page, but show_map_vma() will still assume that the first page
>> of the stack *is* a guard page, and won't report it in /proc/maps.
>>
>> Here's a small program that demonstrates the failure:
>
> Yup, your analysis sounds correct.  My completely untested gut feel is
> that the problem is that we don't actually return the error from the
> expand_stack() call, so then do_anonymous_page() will allow the extra
> guard-page access.
>
> IOW, *maybe* a patch like this. TOTALLY UNTESTED! I may have missed
> something, and this may be complete crap.

(Later)
> .. and ti's still untested, but I'm pretty sure it's right - except we
> should probably also allow an extra page for the stack rlimit (so that
> the guard page doesn't count towards the limit).

(Later)
> It does indeed seem to work.
>
> But see my other email about this effectively making the stack limit
> one page smaller. I'm not entirely sure it's worth fixing, but maybe
> in another four years somebody will post a regression email about that
> ;)
>
> IOW, I'd be inclined to say "stack limit includes guard page", and
> just wait and see whether that causes any issues.

I don't have a problem with that, but I suppose some users might have
to bump their stack limit up by a page, if they were already sailing
close to the wind.

> But regardless, I'd like you to test the patch in your environment
> (with the full real load, not just the demonstration test case), since
> you are clearly doing special things. My limited testing was just that
> - very limited.

I can't test kernel patches on the powerpc64 box as I don't even have
root access, but I've done the best I can on my own x86_64 machine:
- rebuilt my current 3.16.0 kernel both without and with your patch
- tested both my small repro and the original asan stack-overflow test
with a variety of small stack limits (12, 16, 20 k)
- run the full address sanitizer test suite
All looks good *except* that stack overflow tests are now getting
SIGBUS where they used to get SIGSEGV. Asan was only handling SIGSEGV,
but it seems entirely reasonable for it to handle SIGBUS in the same
way too (indeed it was already doing that on Darwin, just not on
Linux). If I make that change then the whole test suite runs cleanly
with your patch.

Thus:
Tested-by: Jay Foad <jay.foad@gmail.com>

Thanks,
Jay.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-20 21:24                     ` Linus Torvalds
@ 2010-08-23  8:58                       ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2010-08-23  8:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ian Campbell, linux-kernel, stable, stable-review, akpm, alan, Greg KH

On Fri, 2010-08-20 at 14:24 -0700, Linus Torvalds wrote:
> Still. A doubly-linked list is totally trivial, and I just bet I have
> a bug in there somewhere. But looking at your 2007 patch and mine
> side-by-side, I do think mine is still less scary. 

For sure, and if you need it in a hurry your patch is much saner.

So if you feel like its required for .36 go for it, we can look at
cleaning it up for .37 or somesuch.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-20 21:17                   ` Linus Torvalds
@ 2010-08-20 21:24                     ` Linus Torvalds
  2010-08-23  8:58                       ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2010-08-20 21:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ian Campbell, linux-kernel, stable, stable-review, akpm, alan, Greg KH

On Fri, Aug 20, 2010 at 2:17 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Appended is that much smaller patch.

Note that the "real" patch is actually even smaller than implied by
this thing. The diffstat

 include/linux/mm_types.h |    2 +-
 kernel/fork.c            |    7 +++++--
 mm/mmap.c                |   37 +++++++++++++++++++++++++++++++++----
 mm/nommu.c               |    7 +++++--
 4 files changed, 44 insertions(+), 9 deletions(-)

implies that it adds a lot more than it removes, but of the 35 new
lines it adds, about half of it - 16 lines - is just the vma list
verification hack. So it really adds less than 20 lines of code.
Hopefully those 20 lines would then buy themselves back with cleanups.

Still. A doubly-linked list is totally trivial, and I just bet I have
a bug in there somewhere. But looking at your 2007 patch and mine
side-by-side, I do think mine is still less scary.

                     Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-20 20:42                 ` Peter Zijlstra
@ 2010-08-20 21:17                   ` Linus Torvalds
  2010-08-20 21:24                     ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2010-08-20 21:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ian Campbell, linux-kernel, stable, stable-review, akpm, alan, Greg KH

[-- Attachment #1: Type: text/plain, Size: 1934 bytes --]

On Fri, Aug 20, 2010 at 1:42 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> You mean something like the below?

Yeah. I looked at exactly something like that. With the same name for
vma_next() - except I just passed down 'mm' to it too. Your patch
takes it from vma->mm and then you have a few places do the whole
__vma_next() by hand.

> Should I look at refreshing that thing (yes, that's a 2007 patch)?

Interesting. I generated about half the patch, and then decided that
it's not worth it. But the fact that apparently you did the same thing
in 2007 means that maybe there really _is_ a pent-up demand for doing
this dang thing.

I do note that I also did a patch that just added "vm_prev", and it's
a lot smaller. It doesn't allow for the cleanups of using the generic
list iterators, though. And it has the usual problem with the
straightforward doubly-linked lists: you end up with those ugly
conditional assignments like

  if (prev)
     prev->vm_next = vma;
  if (next)
     next->vm_prev = vma;

which the list_head approach avoids.

Appended is that much smaller patch. It has an ugly and partial
"validate_vma()" hack there in find_vma() to catch the worst bugs, but
it's really not tested. I did try booting it, and it's not spewing out
errors, but who the heck knows. It doesn't look too horrid, but...

What do you think?

NOTE! I've done _zero_ cleanups. And one of the thing I noticed is
that "find_vma_prev()" is sometimes used to find a "prev" even when
there is no "vma" (ie growing a grow-up stack at the end of the VM, or
adding a new mapping at the end), so my hope that we could get rid of
it entirely was naïve. But there should be other things we should be
able to simplify when we can get at the prev pointer (all the
mprotect/mlock splitting code should work fine without having that
'prev' thing passed around etc).

                            Linus

[-- Attachment #2: diff.txt --]
[-- Type: text/plain, Size: 5233 bytes --]

 include/linux/mm_types.h |    2 +-
 kernel/fork.c            |    7 +++++--
 mm/mmap.c                |   37 +++++++++++++++++++++++++++++++++----
 mm/nommu.c               |    7 +++++--
 4 files changed, 44 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b8bb9a6..ee7e258 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -134,7 +134,7 @@ struct vm_area_struct {
 					   within vm_mm. */
 
 	/* linked list of VM areas per task, sorted by address */
-	struct vm_area_struct *vm_next;
+	struct vm_area_struct *vm_next, *vm_prev;
 
 	pgprot_t vm_page_prot;		/* Access permissions of this VMA. */
 	unsigned long vm_flags;		/* Flags, see mm.h. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 856eac3..b7e9d60 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -300,7 +300,7 @@ out:
 #ifdef CONFIG_MMU
 static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 {
-	struct vm_area_struct *mpnt, *tmp, **pprev;
+	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
 	unsigned long charge;
@@ -328,6 +328,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 	if (retval)
 		goto out;
 
+	prev = NULL;
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
 		struct file *file;
 
@@ -359,7 +360,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 			goto fail_nomem_anon_vma_fork;
 		tmp->vm_flags &= ~VM_LOCKED;
 		tmp->vm_mm = mm;
-		tmp->vm_next = NULL;
+		tmp->vm_next = tmp->vm_prev = NULL;
 		file = tmp->vm_file;
 		if (file) {
 			struct inode *inode = file->f_path.dentry->d_inode;
@@ -392,6 +393,8 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 		 */
 		*pprev = tmp;
 		pprev = &tmp->vm_next;
+		tmp->vm_prev = prev;
+		prev = tmp;
 
 		__vma_link_rb(mm, tmp, rb_link, rb_parent);
 		rb_link = &tmp->vm_rb.rb_right;
diff --git a/mm/mmap.c b/mm/mmap.c
index 3100333..b6212c2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -388,17 +388,23 @@ static inline void
 __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
 		struct vm_area_struct *prev, struct rb_node *rb_parent)
 {
+	struct vm_area_struct *next;
+
+	vma->vm_prev = prev;
 	if (prev) {
-		vma->vm_next = prev->vm_next;
+		next = prev->vm_next;
 		prev->vm_next = vma;
 	} else {
 		mm->mmap = vma;
 		if (rb_parent)
-			vma->vm_next = rb_entry(rb_parent,
+			next = rb_entry(rb_parent,
 					struct vm_area_struct, vm_rb);
 		else
-			vma->vm_next = NULL;
+			next = NULL;
 	}
+	vma->vm_next = next;
+	if (next)
+		next->vm_prev = vma;
 }
 
 void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -483,7 +489,11 @@ static inline void
 __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
 		struct vm_area_struct *prev)
 {
-	prev->vm_next = vma->vm_next;
+	struct vm_area_struct *next = vma->vm_next;
+
+	prev->vm_next = next;
+	if (next)
+		next->vm_prev = prev;
 	rb_erase(&vma->vm_rb, &mm->mm_rb);
 	if (mm->mmap_cache == vma)
 		mm->mmap_cache = prev;
@@ -1577,6 +1587,21 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 
 EXPORT_SYMBOL(get_unmapped_area);
 
+static void validate_vma_list(struct vm_area_struct *vma)
+{
+	static int count;
+
+	if (((vma->vm_next && vma->vm_next->vm_prev != vma) ||
+	     (vma->vm_prev && vma->vm_prev->vm_next != vma)) &&
+	      count < 100) {
+		printk("Badness %p[%p] -> %p -> [%p]%p\n",
+			vma->vm_prev, vma->vm_prev ? vma->vm_prev->vm_next : NULL,
+			vma,
+			vma->vm_next ? vma->vm_next->vm_prev : NULL, vma->vm_next);
+		count++;
+	}
+}
+
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
 struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 {
@@ -1610,6 +1635,7 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 				mm->mmap_cache = vma;
 		}
 	}
+if (vma) validate_vma_list(vma);
 	return vma;
 }
 
@@ -1915,6 +1941,7 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 	unsigned long addr;
 
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
+	vma->vm_prev = NULL;
 	do {
 		rb_erase(&vma->vm_rb, &mm->mm_rb);
 		mm->map_count--;
@@ -1922,6 +1949,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 		vma = vma->vm_next;
 	} while (vma && vma->vm_start < end);
 	*insertion_point = vma;
+	if (vma)
+		vma->vm_prev = prev;
 	tail_vma->vm_next = NULL;
 	if (mm->unmap_area == arch_unmap_area)
 		addr = prev ? prev->vm_end : mm->mmap_base;
diff --git a/mm/nommu.c b/mm/nommu.c
index efa9a38..88ff091 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -604,7 +604,7 @@ static void protect_vma(struct vm_area_struct *vma, unsigned long flags)
  */
 static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
 {
-	struct vm_area_struct *pvma, **pp;
+	struct vm_area_struct *pvma, **pp, *next;
 	struct address_space *mapping;
 	struct rb_node **p, *parent;
 
@@ -664,8 +664,11 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
 			break;
 	}
 
-	vma->vm_next = *pp;
+	next = *pp;
 	*pp = vma;
+	vma->vm_next = next;
+	if (next)
+		next->vm_prev = vma;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-20 19:43               ` Linus Torvalds
@ 2010-08-20 20:42                 ` Peter Zijlstra
  2010-08-20 21:17                   ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-08-20 20:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ian Campbell, linux-kernel, stable, stable-review, akpm, alan, Greg KH

On Fri, 2010-08-20 at 12:43 -0700, Linus Torvalds wrote:
> Yeah, no. It looks like adding a "vm_prev" and doing a regular doubly
> linked list thing wouldn't be too bad. But switching over to the
> list.h version looks like a nightmare.
> 
> Too bad. 

You mean something like the below?

Should I look at refreshing that thing (yes, that's a 2007 patch)?

---
Subject: mm: replace vm_next with a list_head
Date: 19 Jul 2007

Replace the vma list with a proper list_head.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/alpha/kernel/osf_sys.c         |    2 
 arch/arm/mm/mmap.c                  |    2 
 arch/frv/mm/elf-fdpic.c             |    4 -
 arch/i386/mm/hugetlbpage.c          |    2 
 arch/ia64/kernel/sys_ia64.c         |    2 
 arch/ia64/mm/hugetlbpage.c          |    2 
 arch/mips/kernel/irixelf.c          |   20 +++---
 arch/mips/kernel/syscall.c          |    2 
 arch/parisc/kernel/sys_parisc.c     |    4 -
 arch/powerpc/mm/tlb_32.c            |    2 
 arch/ppc/mm/tlb.c                   |    2 
 arch/sh/kernel/sys_sh.c             |    2 
 arch/sh/mm/cache-sh4.c              |    2 
 arch/sh64/kernel/sys_sh64.c         |    2 
 arch/sparc/kernel/sys_sparc.c       |    2 
 arch/sparc64/kernel/binfmt_aout32.c |    2 
 arch/sparc64/kernel/sys_sparc.c     |    2 
 arch/sparc64/mm/hugetlbpage.c       |    2 
 arch/um/kernel/skas/tlb.c           |    6 -
 arch/x86_64/ia32/ia32_aout.c        |    2 
 arch/x86_64/kernel/sys_x86_64.c     |    2 
 drivers/char/mem.c                  |    2 
 drivers/oprofile/buffer_sync.c      |    4 -
 fs/binfmt_aout.c                    |    2 
 fs/binfmt_elf.c                     |    6 -
 fs/binfmt_elf_fdpic.c               |    4 -
 fs/hugetlbfs/inode.c                |    2 
 fs/proc/task_mmu.c                  |   18 ++---
 include/linux/init_task.h           |    1 
 include/linux/mm.h                  |   44 +++++++++++++-
 include/linux/sched.h               |    2 
 ipc/shm.c                           |    4 -
 kernel/acct.c                       |    5 -
 kernel/auditsc.c                    |    4 -
 kernel/fork.c                       |   12 +--
 mm/madvise.c                        |    2 
 mm/memory.c                         |   20 +++---
 mm/mempolicy.c                      |   10 +--
 mm/migrate.c                        |    2 
 mm/mlock.c                          |    4 -
 mm/mmap.c                           |  112 +++++++++++++++++-------------------
 mm/mprotect.c                       |    2 
 mm/mremap.c                         |    7 +-
 mm/msync.c                          |    2 
 mm/swapfile.c                       |    2 
 45 files changed, 186 insertions(+), 155 deletions(-)

Index: linux-2.6/arch/i386/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/i386/mm/hugetlbpage.c
+++ linux-2.6/arch/i386/mm/hugetlbpage.c
@@ -241,7 +241,7 @@ static unsigned long hugetlb_get_unmappe
 full_search:
 	addr = ALIGN(start_addr, HPAGE_SIZE);
 
-	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
+	for (vma = find_vma(mm, addr); ; vma = vma_next(vma)) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
 		if (TASK_SIZE - len < addr) {
 			/*
Index: linux-2.6/arch/x86_64/kernel/sys_x86_64.c
===================================================================
--- linux-2.6.orig/arch/x86_64/kernel/sys_x86_64.c
+++ linux-2.6/arch/x86_64/kernel/sys_x86_64.c
@@ -118,7 +118,7 @@ arch_get_unmapped_area(struct file *filp
 	start_addr = addr;
 
 full_search:
-	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
+	for (vma = find_vma(mm, addr); ; vma = vma_next(vma)) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
 		if (end - len < addr) {
 			/*
Index: linux-2.6/drivers/char/mem.c
===================================================================
--- linux-2.6.orig/drivers/char/mem.c
+++ linux-2.6/drivers/char/mem.c
@@ -633,7 +633,7 @@ static inline size_t read_zero_pagealign
 	down_read(&mm->mmap_sem);
 
 	/* For private mappings, just map in zero pages. */
-	for (vma = find_vma(mm, addr); vma; vma = vma->vm_next) {
+	for (vma = find_vma(mm, addr); vma; vma = vma_next(vma)) {
 		unsigned long count;
 
 		if (vma->vm_start > addr || (vma->vm_flags & VM_WRITE) == 0)
Index: linux-2.6/drivers/oprofile/buffer_sync.c
===================================================================
--- linux-2.6.orig/drivers/oprofile/buffer_sync.c
+++ linux-2.6/drivers/oprofile/buffer_sync.c
@@ -216,7 +216,7 @@ static unsigned long get_exec_dcookie(st
 	if (!mm)
 		goto out;
  
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	list_for_each_entry(vma, &mm->mm_vmas, vm_list) {
 		if (!vma->vm_file)
 			continue;
 		if (!(vma->vm_flags & VM_EXECUTABLE))
@@ -241,7 +241,7 @@ static unsigned long lookup_dcookie(stru
 	unsigned long cookie = NO_COOKIE;
 	struct vm_area_struct * vma;
 
-	for (vma = find_vma(mm, addr); vma; vma = vma->vm_next) {
+	for (vma = find_vma(mm, addr); vma; vma = vma_next(vma)) {
  
 		if (addr < vma->vm_start || addr >= vma->vm_end)
 			continue;
Index: linux-2.6/fs/binfmt_elf.c
===================================================================
--- linux-2.6.orig/fs/binfmt_elf.c
+++ linux-2.6/fs/binfmt_elf.c
@@ -780,7 +780,7 @@ static int load_elf_binary(struct linux_
 	current->mm->start_data = 0;
 	current->mm->end_data = 0;
 	current->mm->end_code = 0;
-	current->mm->mmap = NULL;
+	INIT_LIST_HEAD(&current->mm->mm_vmas);
 	current->flags &= ~PF_FORKNOEXEC;
 	current->mm->def_flags = def_flags;
 
@@ -1444,7 +1444,7 @@ static int elf_dump_thread_status(long s
 static struct vm_area_struct *first_vma(struct task_struct *tsk,
 					struct vm_area_struct *gate_vma)
 {
-	struct vm_area_struct *ret = tsk->mm->mmap;
+	struct vm_area_struct *ret = __vma_next(&tsk->mm->mm_vmas, NULL);
 
 	if (ret)
 		return ret;
@@ -1459,7 +1459,7 @@ static struct vm_area_struct *next_vma(s
 {
 	struct vm_area_struct *ret;
 
-	ret = this_vma->vm_next;
+	ret = vma_next(this_vma);
 	if (ret)
 		return ret;
 	if (this_vma == gate_vma)
Index: linux-2.6/fs/binfmt_elf_fdpic.c
===================================================================
--- linux-2.6.orig/fs/binfmt_elf_fdpic.c
+++ linux-2.6/fs/binfmt_elf_fdpic.c
@@ -1461,7 +1461,7 @@ static int elf_fdpic_dump_segments(struc
 {
 	struct vm_area_struct *vma;
 
-	for (vma = current->mm->mmap; vma; vma = vma->vm_next) {
+	list_for_each_entry(vma, &current->mm->mm_vmas, vm_list) {
 		unsigned long addr;
 
 		if (!maydump(vma))
@@ -1710,7 +1710,7 @@ static int elf_fdpic_core_dump(long sign
 	/* write program headers for segments dump */
 	for (
 #ifdef CONFIG_MMU
-		vma = current->mm->mmap; vma; vma = vma->vm_next
+		vma = __vma_next(&current->mm->mm_vmas, NULL); vma; vma_next(vma)
 #else
 			vml = current->mm->context.vmlist; vml; vml = vml->next
 #endif
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -135,7 +135,7 @@ hugetlb_get_unmapped_area(struct file *f
 full_search:
 	addr = ALIGN(start_addr, HPAGE_SIZE);
 
-	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
+	for (vma = find_vma(mm, addr); ; vma = vma_next(vma)) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
 		if (TASK_SIZE - len < addr) {
 			/*
Index: linux-2.6/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.orig/fs/proc/task_mmu.c
+++ linux-2.6/fs/proc/task_mmu.c
@@ -87,11 +87,11 @@ int proc_exe_link(struct inode *inode, s
 		goto out;
 	down_read(&mm->mmap_sem);
 
-	vma = mm->mmap;
+	vma = __vma_next(&mm->mm_vmas, NULL);
 	while (vma) {
 		if ((vma->vm_flags & VM_EXECUTABLE) && vma->vm_file)
 			break;
-		vma = vma->vm_next;
+		vma = vma_next(vma);
 	}
 
 	if (vma) {
@@ -364,7 +364,7 @@ void clear_refs_smap(struct mm_struct *m
 	struct vm_area_struct *vma;
 
 	down_read(&mm->mmap_sem);
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
+	list_for_each_entry(vma, &mm->mm_vmas, vm_list)
 		if (vma->vm_mm && !is_vm_hugetlb_page(vma))
 			walk_page_range(vma, clear_refs_pte_range, NULL);
 	flush_tlb_mm(mm);
@@ -406,7 +406,7 @@ static void *m_start(struct seq_file *m,
 
 	/* Start with last addr hint */
 	if (last_addr && (vma = find_vma(mm, last_addr))) {
-		vma = vma->vm_next;
+		vma = vma_next(vma);
 		goto out;
 	}
 
@@ -416,9 +416,9 @@ static void *m_start(struct seq_file *m,
 	 */
 	vma = NULL;
 	if ((unsigned long)l < mm->map_count) {
-		vma = mm->mmap;
+		vma = __vma_next(&mm->mm_vmas, NULL);
 		while (l-- && vma)
-			vma = vma->vm_next;
+			vma = vma_next(vma);
 		goto out;
 	}
 
@@ -448,12 +448,12 @@ static void vma_stop(struct proc_maps_pr
 static void *m_next(struct seq_file *m, void *v, loff_t *pos)
 {
 	struct proc_maps_private *priv = m->private;
-	struct vm_area_struct *vma = v;
+	struct vm_area_struct *vma = v, *next;
 	struct vm_area_struct *tail_vma = priv->tail_vma;
 
 	(*pos)++;
-	if (vma && (vma != tail_vma) && vma->vm_next)
-		return vma->vm_next;
+	if (vma && (vma != tail_vma) && (next = vma_next(vma)))
+		return next;
 	vma_stop(priv, vma);
 	return (vma != tail_vma)? tail_vma: NULL;
 }
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c
+++ linux-2.6/ipc/shm.c
@@ -1034,7 +1034,7 @@ asmlinkage long sys_shmdt(char __user *s
 	vma = find_vma(mm, addr);
 
 	while (vma) {
-		next = vma->vm_next;
+		next = vma_next(vma);
 
 		/*
 		 * Check if the starting address would match, i.e. it's
@@ -1067,7 +1067,7 @@ asmlinkage long sys_shmdt(char __user *s
 	 */
 	size = PAGE_ALIGN(size);
 	while (vma && (loff_t)(vma->vm_end - addr) <= size) {
-		next = vma->vm_next;
+		next = vma_next(vma);
 
 		/* finding a matching vma now does not alter retval */
 		if ((vma->vm_ops == &shm_vm_ops) &&
Index: linux-2.6/kernel/acct.c
===================================================================
--- linux-2.6.orig/kernel/acct.c
+++ linux-2.6/kernel/acct.c
@@ -540,11 +540,8 @@ void acct_collect(long exitcode, int gro
 	if (group_dead && current->mm) {
 		struct vm_area_struct *vma;
 		down_read(&current->mm->mmap_sem);
-		vma = current->mm->mmap;
-		while (vma) {
+		list_for_each_entry(vma, &current->mm->mm_vmas, vm_list)
 			vsize += vma->vm_end - vma->vm_start;
-			vma = vma->vm_next;
-		}
 		up_read(&current->mm->mmap_sem);
 	}
 
Index: linux-2.6/kernel/auditsc.c
===================================================================
--- linux-2.6.orig/kernel/auditsc.c
+++ linux-2.6/kernel/auditsc.c
@@ -795,8 +795,7 @@ static void audit_log_task_info(struct a
 
 	if (mm) {
 		down_read(&mm->mmap_sem);
-		vma = mm->mmap;
-		while (vma) {
+		list_for_each_entry(vma, &mm->mm_vmas, vm_list) {
 			if ((vma->vm_flags & VM_EXECUTABLE) &&
 			    vma->vm_file) {
 				audit_log_d_path(ab, "exe=",
@@ -804,7 +803,6 @@ static void audit_log_task_info(struct a
 						 vma->vm_file->f_path.mnt);
 				break;
 			}
-			vma = vma->vm_next;
 		}
 		up_read(&mm->mmap_sem);
 	}
Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -349,7 +349,7 @@ asmlinkage long sys_madvise(unsigned lon
 		if (start >= end)
 			goto out;
 		if (prev)
-			vma = prev->vm_next;
+			vma = vma_next(prev);
 		else	/* madvise_remove dropped mmap_sem */
 			vma = find_vma(current->mm, start);
 	}
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -266,11 +266,12 @@ void free_pgd_range(struct mmu_gather **
 		flush_tlb_pgtables((*tlb)->mm, start, end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
+void free_pgtables(struct mmu_gather **tlb, struct list_head *vmas,
+		struct vm_area_struct *vma,
 		unsigned long floor, unsigned long ceiling)
 {
 	while (vma) {
-		struct vm_area_struct *next = vma->vm_next;
+		struct vm_area_struct *next = __vma_next(vmas, vma);
 		unsigned long addr = vma->vm_start;
 
 		/*
@@ -289,7 +290,7 @@ void free_pgtables(struct mmu_gather **t
 			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
 			       && !is_vm_hugetlb_page(next)) {
 				vma = next;
-				next = vma->vm_next;
+				next = __vma_next(vmas, vma);
 				anon_vma_unlink(vma);
 				unlink_file_vma(vma);
 			}
@@ -808,7 +809,7 @@ static unsigned long unmap_page_range(st
  * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
  * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
+unsigned long unmap_vmas(struct mmu_gather **tlbp, struct list_head *vmas,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *details)
@@ -820,7 +821,7 @@ unsigned long unmap_vmas(struct mmu_gath
 	spinlock_t *i_mmap_lock = details? details->i_mmap_lock: NULL;
 	int fullmm = (*tlbp)->fullmm;
 
-	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) {
+	for (; vma && vma->vm_start < end_addr; vma = __vma_next(vmas, vma)) {
 		unsigned long end;
 
 		start = max(vma->vm_start, start_addr);
@@ -891,7 +892,8 @@ unsigned long zap_page_range(struct vm_a
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
-	end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
+	end = unmap_vmas(&tlb, &vma->vm_mm->mm_vmas, vma,
+			address, end, &nr_accounted, details);
 	if (tlb)
 		tlb_finish_mmu(tlb, address, end);
 	return end;
@@ -2069,7 +2071,7 @@ int vmtruncate_range(struct inode *inode
 void swapin_readahead(swp_entry_t entry, unsigned long addr,struct vm_area_struct *vma)
 {
 #ifdef CONFIG_NUMA
-	struct vm_area_struct *next_vma = vma ? vma->vm_next : NULL;
+	struct vm_area_struct *next_vma = vma ? vma_next(vma) : NULL;
 #endif
 	int i, num;
 	struct page *new_page;
@@ -2096,14 +2098,14 @@ void swapin_readahead(swp_entry_t entry,
 		if (vma) {
 			if (addr >= vma->vm_end) {
 				vma = next_vma;
-				next_vma = vma ? vma->vm_next : NULL;
+				next_vma = vma ? vma_next(vma) : NULL;
 			}
 			if (vma && addr < vma->vm_start)
 				vma = NULL;
 		} else {
 			if (next_vma && addr >= next_vma->vm_start) {
 				vma = next_vma;
-				next_vma = vma->vm_next;
+				next_vma = vma_next(vma);
 			}
 		}
 #endif
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c
+++ linux-2.6/mm/mempolicy.c
@@ -344,9 +344,9 @@ check_range(struct mm_struct *mm, unsign
 	if (!first)
 		return ERR_PTR(-EFAULT);
 	prev = NULL;
-	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
+	for (vma = first; vma && vma->vm_start < end; vma = vma_next(vma)) {
 		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
-			if (!vma->vm_next && vma->vm_end < end)
+			if (!vma_next(vma) && vma->vm_end < end)
 				return ERR_PTR(-EFAULT);
 			if (prev && prev->vm_end < vma->vm_start)
 				return ERR_PTR(-EFAULT);
@@ -403,7 +403,7 @@ static int mbind_range(struct vm_area_st
 
 	err = 0;
 	for (; vma && vma->vm_start < end; vma = next) {
-		next = vma->vm_next;
+		next = vma_next(vma);
 		if (vma->vm_start < start)
 			err = split_vma(vma->vm_mm, vma, start, 1);
 		if (!err && vma->vm_end > end)
@@ -610,7 +610,7 @@ int migrate_to_node(struct mm_struct *mm
 	nodes_clear(nmask);
 	node_set(source, nmask);
 
-	check_range(mm, mm->mmap->vm_start, TASK_SIZE, &nmask,
+	check_range(mm, __vma_next(&mm->mm_vmas, NULL)->vm_start, TASK_SIZE, &nmask,
 			flags | MPOL_MF_DISCONTIG_OK, &pagelist);
 
 	if (!list_empty(&pagelist))
@@ -1698,7 +1698,7 @@ void mpol_rebind_mm(struct mm_struct *mm
 	struct vm_area_struct *vma;
 
 	down_write(&mm->mmap_sem);
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
+	list_for_each_entry(vma, &mm->mm_vmas, vm_list)
 		mpol_rebind_policy(vma->vm_policy, new);
 	up_write(&mm->mmap_sem);
 }
Index: linux-2.6/mm/mlock.c
===================================================================
--- linux-2.6.orig/mm/mlock.c
+++ linux-2.6/mm/mlock.c
@@ -123,7 +123,7 @@ static int do_mlock(unsigned long start,
 		if (nstart >= end)
 			break;
 
-		vma = prev->vm_next;
+		vma = vma_next(prev);
 		if (!vma || vma->vm_start != nstart) {
 			error = -ENOMEM;
 			break;
@@ -181,7 +181,7 @@ static int do_mlockall(int flags)
 	if (flags == MCL_FUTURE)
 		goto out;
 
-	for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
+	list_for_each_entry(vma, &current->mm->mm_vmas, vm_list) {
 		unsigned int newflags;
 
 		newflags = vma->vm_flags | VM_LOCKED;
Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c
+++ linux-2.6/mm/mprotect.c
@@ -302,7 +302,7 @@ sys_mprotect(unsigned long start, size_t
 		if (nstart >= end)
 			goto out;
 
-		vma = prev->vm_next;
+		vma = vma_next(prev);
 		if (!vma || vma->vm_start != nstart) {
 			error = -ENOMEM;
 			goto out;
Index: linux-2.6/mm/mremap.c
===================================================================
--- linux-2.6.orig/mm/mremap.c
+++ linux-2.6/mm/mremap.c
@@ -226,7 +226,7 @@ static unsigned long move_vma(struct vm_
 	if (excess) {
 		vma->vm_flags |= VM_ACCOUNT;
 		if (split)
-			vma->vm_next->vm_flags |= VM_ACCOUNT;
+			vma_next(vma)->vm_flags |= VM_ACCOUNT;
 	}
 
 	if (vm_flags & VM_LOCKED) {
@@ -356,8 +356,9 @@ unsigned long do_mremap(unsigned long ad
 	    !((flags & MREMAP_FIXED) && (addr != new_addr)) &&
 	    (old_len != new_len || !(flags & MREMAP_MAYMOVE))) {
 		unsigned long max_addr = TASK_SIZE;
-		if (vma->vm_next)
-			max_addr = vma->vm_next->vm_start;
+		struct vm_area_struct *next = vma_next(vma);
+		if (next)
+			max_addr = next->vm_start;
 		/* can we just expand the current mapping? */
 		if (max_addr - addr >= new_len) {
 			int pages = (new_len - old_len) >> PAGE_SHIFT;
Index: linux-2.6/mm/msync.c
===================================================================
--- linux-2.6.orig/mm/msync.c
+++ linux-2.6/mm/msync.c
@@ -93,7 +93,7 @@ asmlinkage long sys_msync(unsigned long 
 				error = 0;
 				goto out_unlock;
 			}
-			vma = vma->vm_next;
+			vma = vma_next(vma);
 		}
 	}
 out_unlock:
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -626,7 +626,7 @@ static int unuse_mm(struct mm_struct *mm
 		down_read(&mm->mmap_sem);
 		lock_page(page);
 	}
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	list_for_each_entry(vma, &mm->mm_vmas, vm_list) {
 		if (vma->anon_vma && unuse_vma(vma, entry, page))
 			break;
 	}
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -1006,7 +1006,7 @@ int migrate_vmas(struct mm_struct *mm, c
  	struct vm_area_struct *vma;
  	int err = 0;
 
- 	for(vma = mm->mmap; vma->vm_next && !err; vma = vma->vm_next) {
+	list_for_each_entry(vma, &mm->mm_vmas, vm_list) {
  		if (vma->vm_ops && vma->vm_ops->migrate) {
  			err = vma->vm_ops->migrate(vma, to, from, flags);
  			if (err)
Index: linux-2.6/arch/alpha/kernel/osf_sys.c
===================================================================
--- linux-2.6.orig/arch/alpha/kernel/osf_sys.c
+++ linux-2.6/arch/alpha/kernel/osf_sys.c
@@ -1255,7 +1255,7 @@ arch_get_unmapped_area_1(unsigned long a
 		if (!vma || addr + len <= vma->vm_start)
 			return addr;
 		addr = vma->vm_end;
-		vma = vma->vm_next;
+		vma = vma_next(vma);
 	}
 }
 
Index: linux-2.6/arch/arm/mm/mmap.c
===================================================================
--- linux-2.6.orig/arch/arm/mm/mmap.c
+++ linux-2.6/arch/arm/mm/mmap.c
@@ -84,7 +84,7 @@ full_search:
 	else
 		addr = PAGE_ALIGN(addr);
 
-	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
+	for (vma = find_vma(mm, addr); ; vma = vma_next(vma)) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
 		if (TASK_SIZE - len < addr) {
 			/*
Index: linux-2.6/arch/frv/mm/elf-fdpic.c
===================================================================
--- linux-2.6.orig/arch/frv/mm/elf-fdpic.c
+++ linux-2.6/arch/frv/mm/elf-fdpic.c
@@ -86,7 +86,7 @@ unsigned long arch_get_unmapped_area(str
 
 		if (addr <= limit) {
 			vma = find_vma(current->mm, PAGE_SIZE);
-			for (; vma; vma = vma->vm_next) {
+			for (; vma; vma = vma_next(vma)) {
 				if (addr > limit)
 					break;
 				if (addr + len <= vma->vm_start)
@@ -101,7 +101,7 @@ unsigned long arch_get_unmapped_area(str
 	limit = TASK_SIZE - len;
 	if (addr <= limit) {
 		vma = find_vma(current->mm, addr);
-		for (; vma; vma = vma->vm_next) {
+		for (; vma; vma = vma_next(vma)) {
 			if (addr > limit)
 				break;
 			if (addr + len <= vma->vm_start)
Index: linux-2.6/arch/ia64/kernel/sys_ia64.c
===================================================================
--- linux-2.6.orig/arch/ia64/kernel/sys_ia64.c
+++ linux-2.6/arch/ia64/kernel/sys_ia64.c
@@ -58,7 +58,7 @@ arch_get_unmapped_area (struct file *fil
   full_search:
 	start_addr = addr = (addr + align_mask) & ~align_mask;
 
-	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
+	for (vma = find_vma(mm, addr); ; vma = vma_next(vma)) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
 		if (TASK_SIZE - len < addr || RGN_MAP_LIMIT - len < REGION_OFFSET(addr)) {
 			if (start_addr != TASK_UNMAPPED_BASE) {
Index: linux-2.6/arch/ia64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/hugetlbpage.c
+++ linux-2.6/arch/ia64/mm/hugetlbpage.c
@@ -161,7 +161,7 @@ unsigned long hugetlb_get_unmapped_area(
 		addr = HPAGE_REGION_BASE;
 	else
 		addr = ALIGN(addr, HPAGE_SIZE);
-	for (vmm = find_vma(current->mm, addr); ; vmm = vmm->vm_next) {
+	for (vmm = find_vma(current->mm, addr); ; vmm = vma_next(vma)) {
 		/* At this point:  (!vmm || addr < vmm->vm_end). */
 		if (REGION_OFFSET(addr) + len > RGN_MAP_LIMIT)
 			return -ENOMEM;
Index: linux-2.6/arch/mips/kernel/irixelf.c
===================================================================
--- linux-2.6.orig/arch/mips/kernel/irixelf.c
+++ linux-2.6/arch/mips/kernel/irixelf.c
@@ -718,7 +718,7 @@ static int load_irix_binary(struct linux
 	/* OK, This is the point of no return */
 	current->mm->end_data = 0;
 	current->mm->end_code = 0;
-	current->mm->mmap = NULL;
+	INIT_LIST_HEAD(&current->mm->mm_vmas);
 	current->flags &= ~PF_FORKNOEXEC;
 	elf_entry = (unsigned int) elf_ex.e_entry;
 
@@ -1108,7 +1108,7 @@ static int irix_core_dump(long signr, st
 	/* Count what's needed to dump, up to the limit of coredump size. */
 	segs = 0;
 	size = 0;
-	for (vma = current->mm->mmap; vma != NULL; vma = vma->vm_next) {
+	list_for_each_entry(vma, &current->mm->mm_vmas, vm_list) {
 		if (maydump(vma))
 		{
 			int sz = vma->vm_end-vma->vm_start;
@@ -1267,12 +1267,13 @@ static int irix_core_dump(long signr, st
 	dataoff = offset = roundup(offset, PAGE_SIZE);
 
 	/* Write program headers for segments dump. */
-	for (vma = current->mm->mmap, i = 0;
-		i < segs && vma != NULL; vma = vma->vm_next) {
+	i = 0
+	list_for_each_entry(vma, &current->mm->mm_vmas, vm_list) {
 		struct elf_phdr phdr;
 		size_t sz;
 
-		i++;
+		if (i++ == segs)
+			break;
 
 		sz = vma->vm_end - vma->vm_start;
 
@@ -1301,15 +1302,16 @@ static int irix_core_dump(long signr, st
 
 	DUMP_SEEK(dataoff);
 
-	for (i = 0, vma = current->mm->mmap;
-	    i < segs && vma != NULL;
-	    vma = vma->vm_next) {
+	i = 0
+	list_for_each_entry(vma, &current->mm->mm_vmas, vm_list) {
 		unsigned long addr = vma->vm_start;
 		unsigned long len = vma->vm_end - vma->vm_start;
 
 		if (!maydump(vma))
 			continue;
-		i++;
+
+		if (i++ == segs)
+			break;
 		pr_debug("elf_core_dump: writing %08lx %lx\n", addr, len);
 		DUMP_WRITE((void __user *)addr, len);
 	}
Index: linux-2.6/arch/mips/kernel/syscall.c
===================================================================
--- linux-2.6.orig/arch/mips/kernel/syscall.c
+++ linux-2.6/arch/mips/kernel/syscall.c
@@ -103,7 +103,7 @@ unsigned long arch_get_unmapped_area(str
 	else
 		addr = PAGE_ALIGN(addr);
 
-	for (vmm = find_vma(current->mm, addr); ; vmm = vmm->vm_next) {
+	for (vmm = find_vma(current->mm, addr); ; vmm = vma_next(vmm)) {
 		/* At this point:  (!vmm || addr < vmm->vm_end). */
 		if (task_size - len < addr)
 			return -ENOMEM;
Index: linux-2.6/arch/parisc/kernel/sys_parisc.c
===================================================================
--- linux-2.6.orig/arch/parisc/kernel/sys_parisc.c
+++ linux-2.6/arch/parisc/kernel/sys_parisc.c
@@ -52,7 +52,7 @@ static unsigned long get_unshared_area(u
 
 	addr = PAGE_ALIGN(addr);
 
-	for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
+	for (vma = find_vma(current->mm, addr); ; vma = vma_next(vma)) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
 		if (TASK_SIZE - len < addr)
 			return -ENOMEM;
@@ -88,7 +88,7 @@ static unsigned long get_shared_area(str
 
 	addr = DCACHE_ALIGN(addr - offset) + offset;
 
-	for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
+	for (vma = find_vma(current->mm, addr); ; vma = vma_next(vma)) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
 		if (TASK_SIZE - len < addr)
 			return -ENOMEM;
Index: linux-2.6/arch/powerpc/mm/tlb_32.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/tlb_32.c
+++ linux-2.6/arch/powerpc/mm/tlb_32.c
@@ -154,7 +154,7 @@ void flush_tlb_mm(struct mm_struct *mm)
 	 * unmap_region or exit_mmap, but not from vmtruncate on SMP -
 	 * but it seems dup_mmap is the only SMP case which gets here.
 	 */
-	for (mp = mm->mmap; mp != NULL; mp = mp->vm_next)
+	list_for_each_entry(mp, &mm->mm_vmas, vm_list)
 		flush_range(mp->vm_mm, mp->vm_start, mp->vm_end);
 	FINISH_FLUSH;
 }
Index: linux-2.6/arch/ppc/mm/tlb.c
===================================================================
--- linux-2.6.orig/arch/ppc/mm/tlb.c
+++ linux-2.6/arch/ppc/mm/tlb.c
@@ -148,7 +148,7 @@ void flush_tlb_mm(struct mm_struct *mm)
 		return;
 	}
 
-	for (mp = mm->mmap; mp != NULL; mp = mp->vm_next)
+	list_for_each_entry(mp, &mm->mm_vmas, vm_list)
 		flush_range(mp->vm_mm, mp->vm_start, mp->vm_end);
 	FINISH_FLUSH;
 }
Index: linux-2.6/arch/sh/kernel/sys_sh.c
===================================================================
--- linux-2.6.orig/arch/sh/kernel/sys_sh.c
+++ linux-2.6/arch/sh/kernel/sys_sh.c
@@ -107,7 +107,7 @@ full_search:
 	else
 		addr = PAGE_ALIGN(mm->free_area_cache);
 
-	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
+	for (vma = find_vma(mm, addr); ; vma = vma_next(vma)) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
 		if (unlikely(TASK_SIZE - len < addr)) {
 			/*
Index: linux-2.6/arch/sh/mm/cache-sh4.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/cache-sh4.c
+++ linux-2.6/arch/sh/mm/cache-sh4.c
@@ -396,7 +396,7 @@ void flush_cache_mm(struct mm_struct *mm
 		 * In this case there are reasonably sized ranges to flush,
 		 * iterate through the VMA list and take care of any aliases.
 		 */
-		for (vma = mm->mmap; vma; vma = vma->vm_next)
+		list_for_each_entry(vma, &mm->mm_vmas, vm_list)
 			__flush_cache_mm(mm, vma->vm_start, vma->vm_end);
 	}
 
Index: linux-2.6/arch/sh64/kernel/sys_sh64.c
===================================================================
--- linux-2.6.orig/arch/sh64/kernel/sys_sh64.c
+++ linux-2.6/arch/sh64/kernel/sys_sh64.c
@@ -119,7 +119,7 @@ unsigned long arch_get_unmapped_area(str
 	else
 		addr = COLOUR_ALIGN(addr);
 
-	for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
+	for (vma = find_vma(current->mm, addr); ; vma = vma_next(vma)) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
 		if (TASK_SIZE - len < addr)
 			return -ENOMEM;
Index: linux-2.6/arch/sparc/kernel/sys_sparc.c
===================================================================
--- linux-2.6.orig/arch/sparc/kernel/sys_sparc.c
+++ linux-2.6/arch/sparc/kernel/sys_sparc.c
@@ -64,7 +64,7 @@ unsigned long arch_get_unmapped_area(str
 	else
 		addr = PAGE_ALIGN(addr);
 
-	for (vmm = find_vma(current->mm, addr); ; vmm = vmm->vm_next) {
+	for (vmm = find_vma(current->mm, addr); ; vmm = vma_next(vmm)) {
 		/* At this point:  (!vmm || addr < vmm->vm_end). */
 		if (ARCH_SUN4C_SUN4 && addr < 0xe0000000 && 0x20000000 - len < addr) {
 			addr = PAGE_OFFSET;
Index: linux-2.6/arch/sparc64/kernel/binfmt_aout32.c
===================================================================
--- linux-2.6.orig/arch/sparc64/kernel/binfmt_aout32.c
+++ linux-2.6/arch/sparc64/kernel/binfmt_aout32.c
@@ -242,7 +242,7 @@ static int load_aout32_binary(struct lin
 	current->mm->free_area_cache = current->mm->mmap_base;
 	current->mm->cached_hole_size = 0;
 
-	current->mm->mmap = NULL;
+	LIST_HEAD_INIT(&current->mm->mm_vmas);
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
 	if (N_MAGIC(ex) == NMAGIC) {
Index: linux-2.6/arch/sparc64/kernel/sys_sparc.c
===================================================================
--- linux-2.6.orig/arch/sparc64/kernel/sys_sparc.c
+++ linux-2.6/arch/sparc64/kernel/sys_sparc.c
@@ -166,7 +166,7 @@ full_search:
 	else
 		addr = PAGE_ALIGN(addr);
 
-	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
+	for (vma = find_vma(mm, addr); ; vma = vma_next(vma)) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
 		if (addr < VA_EXCLUDE_START &&
 		    (addr + len) >= VA_EXCLUDE_START) {
Index: linux-2.6/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- linux-2.6.orig/arch/sparc64/mm/hugetlbpage.c
+++ linux-2.6/arch/sparc64/mm/hugetlbpage.c
@@ -54,7 +54,7 @@ static unsigned long hugetlb_get_unmappe
 full_search:
 	addr = ALIGN(addr, HPAGE_SIZE);
 
-	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
+	for (vma = find_vma(mm, addr); ; vma = vma_next(vma)) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
 		if (addr < VA_EXCLUDE_START &&
 		    (addr + len) >= VA_EXCLUDE_START) {
Index: linux-2.6/arch/x86_64/ia32/ia32_aout.c
===================================================================
--- linux-2.6.orig/arch/x86_64/ia32/ia32_aout.c
+++ linux-2.6/arch/x86_64/ia32/ia32_aout.c
@@ -311,7 +311,7 @@ static int load_aout_binary(struct linux
 	current->mm->free_area_cache = TASK_UNMAPPED_BASE;
 	current->mm->cached_hole_size = 0;
 
-	current->mm->mmap = NULL;
+	INIT_LIST_HEAD(&current->mm->mm_vmas);
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
 
Index: linux-2.6/fs/binfmt_aout.c
===================================================================
--- linux-2.6.orig/fs/binfmt_aout.c
+++ linux-2.6/fs/binfmt_aout.c
@@ -323,7 +323,7 @@ static int load_aout_binary(struct linux
 	current->mm->free_area_cache = current->mm->mmap_base;
 	current->mm->cached_hole_size = 0;
 
-	current->mm->mmap = NULL;
+	INIT_LIST_HEAD(&current->mm->mm_vmas);
 	compute_creds(bprm);
  	current->flags &= ~PF_FORKNOEXEC;
 #ifdef __sparc__
Index: linux-2.6/include/linux/init_task.h
===================================================================
--- linux-2.6.orig/include/linux/init_task.h
+++ linux-2.6/include/linux/init_task.h
@@ -46,6 +46,7 @@
 
 #define INIT_MM(name) \
 {			 					\
+	.mm_vmas	= LIST_HEAD_INIT(name.mm_vmas),		\
 	.mm_rb		= RB_ROOT,				\
 	.pgd		= swapper_pg_dir, 			\
 	.mm_users	= ATOMIC_INIT(2), 			\
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -36,6 +36,7 @@ extern int sysctl_legacy_va_layout;
 #define sysctl_legacy_va_layout 0
 #endif
 
+#include <linux/sched.h>
 #include <asm/page.h>
 #include <asm/pgtable.h>
 #include <asm/processor.h>
@@ -64,7 +65,7 @@ struct vm_area_struct {
 					   within vm_mm. */
 
 	/* linked list of VM areas per task, sorted by address */
-	struct vm_area_struct *vm_next;
+	struct list_head vm_list;
 
 	pgprot_t vm_page_prot;		/* Access permissions of this VMA. */
 	unsigned long vm_flags;		/* Flags, listed below. */
@@ -114,6 +115,42 @@ struct vm_area_struct {
 #endif
 };
 
+static inline struct vm_area_struct *
+__vma_next(struct list_head *head, struct vm_area_struct *vma)
+{
+	if (unlikely(!vma))
+		vma = container_of(head, struct vm_area_struct, vm_list);
+
+	if (vma->vm_list.next == head)
+		return NULL;
+
+	return list_entry(vma->vm_list.next, struct vm_area_struct, vm_list);
+}
+
+static inline struct vm_area_struct *
+vma_next(struct vm_area_struct *vma)
+{
+	return __vma_next(&vma->vm_mm->mm_vmas, vma);
+}
+
+static inline struct vm_area_struct *
+__vma_prev(struct list_head *head, struct vm_area_struct *vma)
+{
+	if (unlikely(!vma))
+		vma = container_of(head, struct vm_area_struct, vm_list);
+
+	if (vma->vm_list.prev == head)
+		return NULL;
+
+	return list_entry(vma->vm_list.prev, struct vm_area_struct, vm_list);
+}
+
+static inline struct vm_area_struct *
+vma_prev(struct vm_area_struct *vma)
+{
+	return __vma_prev(&vma->vm_mm->mm_vmas, vma);
+}
+
 extern struct kmem_cache *vm_area_cachep;
 
 /*
@@ -740,13 +777,14 @@ struct zap_details {
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
+unsigned long unmap_vmas(struct mmu_gather **tlb, struct list_head *vmas,
 		struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
+void free_pgtables(struct mmu_gather **tlb, struct list_head *vmas,
+	       	struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -321,7 +321,7 @@ typedef unsigned long mm_counter_t;
 } while (0)
 
 struct mm_struct {
-	struct vm_area_struct * mmap;		/* list of VMAs */
+	struct list_head mm_vmas;		/* list of VMAs */
 	struct rb_root mm_rb;
 	struct vm_area_struct * mmap_cache;	/* last find_vma result */
 	unsigned long (*get_unmapped_area) (struct file *filp,
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -196,7 +196,7 @@ static struct task_struct *dup_task_stru
 #ifdef CONFIG_MMU
 static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 {
-	struct vm_area_struct *mpnt, *tmp, **pprev;
+	struct vm_area_struct *mpnt, *tmp;
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
 	unsigned long charge;
@@ -210,7 +210,6 @@ static inline int dup_mmap(struct mm_str
 	down_write_nested(&mm->mmap_sem, SINGLE_DEPTH_NESTING);
 
 	mm->locked_vm = 0;
-	mm->mmap = NULL;
 	mm->mmap_cache = NULL;
 	mm->free_area_cache = oldmm->mmap_base;
 	mm->cached_hole_size = ~0UL;
@@ -219,9 +218,8 @@ static inline int dup_mmap(struct mm_str
 	mm->mm_rb = RB_ROOT;
 	rb_link = &mm->mm_rb.rb_node;
 	rb_parent = NULL;
-	pprev = &mm->mmap;
 
-	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
+	list_for_each_entry(mpnt, &oldmm->mm_vmas, vm_list) {
 		struct file *file;
 
 		if (mpnt->vm_flags & VM_DONTCOPY) {
@@ -249,7 +247,6 @@ static inline int dup_mmap(struct mm_str
 		vma_set_policy(tmp, pol);
 		tmp->vm_flags &= ~VM_LOCKED;
 		tmp->vm_mm = mm;
-		tmp->vm_next = NULL;
 		anon_vma_link(tmp);
 		file = tmp->vm_file;
 		if (file) {
@@ -270,9 +267,7 @@ static inline int dup_mmap(struct mm_str
 		/*
 		 * Link in the new vma and copy the page table entries.
 		 */
-		*pprev = tmp;
-		pprev = &tmp->vm_next;
-
+		list_add_tail(&tmp->vm_list, &mm->mm_vmas);
 		__vma_link_rb(mm, tmp, rb_link, rb_parent);
 		rb_link = &tmp->vm_rb.rb_right;
 		rb_parent = &tmp->vm_rb;
@@ -329,6 +324,7 @@ static inline void mm_free_pgd(struct mm
 
 static struct mm_struct * mm_init(struct mm_struct * mm)
 {
+	INIT_LIST_HEAD(&mm->mm_vmas);
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	init_rwsem(&mm->mmap_sem);
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c
+++ linux-2.6/mm/mmap.c
@@ -35,7 +35,7 @@
 #define arch_mmap_check(addr, len, flags)	(0)
 #endif
 
-static void unmap_region(struct mm_struct *mm,
+static void unmap_region(struct mm_struct *mm, struct list_head *vmas,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
 		unsigned long start, unsigned long end);
 
@@ -220,18 +220,16 @@ void unlink_file_vma(struct vm_area_stru
 /*
  * Close a vm structure and free it, returning the next.
  */
-static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
+static void remove_vma(struct vm_area_struct *vma)
 {
-	struct vm_area_struct *next = vma->vm_next;
-
 	might_sleep();
+	list_del(&vma->vm_list);
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_free(vma_policy(vma));
 	kmem_cache_free(vm_area_cachep, vma);
-	return next;
 }
 
 asmlinkage unsigned long sys_brk(unsigned long brk)
@@ -316,11 +314,9 @@ void validate_mm(struct mm_struct *mm)
 {
 	int bug = 0;
 	int i = 0;
-	struct vm_area_struct *tmp = mm->mmap;
-	while (tmp) {
-		tmp = tmp->vm_next;
+	struct vm_area_struct *vma;
+	list_for_each_entry(vma, &mm->mm_vmas, vm_list)
 		i++;
-	}
 	if (i != mm->map_count)
 		printk("map_count %d vm_next %d\n", mm->map_count, i), bug = 1;
 	i = browse_rb(&mm->mm_rb);
@@ -374,15 +370,15 @@ __vma_link_list(struct mm_struct *mm, st
 		struct vm_area_struct *prev, struct rb_node *rb_parent)
 {
 	if (prev) {
-		vma->vm_next = prev->vm_next;
-		prev->vm_next = vma;
+		list_add(&vma->vm_list, &prev->vm_list);
 	} else {
-		mm->mmap = vma;
-		if (rb_parent)
-			vma->vm_next = rb_entry(rb_parent,
+		if (rb_parent) {
+			struct vm_area_struct *next =
+				rb_entry(rb_parent,
 					struct vm_area_struct, vm_rb);
-		else
-			vma->vm_next = NULL;
+			list_add_tail(&vma->vm_list, &next->vm_list);
+		} else
+			list_add(&vma->vm_list, &mm->mm_vmas);
 	}
 }
 
@@ -472,7 +468,7 @@ static inline void
 __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
 		struct vm_area_struct *prev)
 {
-	prev->vm_next = vma->vm_next;
+	list_del(&vma->vm_list);
 	rb_erase(&vma->vm_rb, &mm->mm_rb);
 	if (mm->mmap_cache == vma)
 		mm->mmap_cache = prev;
@@ -489,7 +485,7 @@ void vma_adjust(struct vm_area_struct *v
 	unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct vm_area_struct *next = vma->vm_next;
+	struct vm_area_struct *next = vma_next(vma);
 	struct vm_area_struct *importer = NULL;
 	struct address_space *mapping = NULL;
 	struct prio_tree_root *root = NULL;
@@ -630,7 +626,7 @@ again:			remove_next = 1 + (end > next->
 		 * up the code too much to do both in one go.
 		 */
 		if (remove_next == 2) {
-			next = vma->vm_next;
+			next = vma_next(vma);
 			goto again;
 		}
 	}
@@ -751,13 +747,10 @@ struct vm_area_struct *vma_merge(struct 
 	if (vm_flags & VM_SPECIAL)
 		return NULL;
 
-	if (prev)
-		next = prev->vm_next;
-	else
-		next = mm->mmap;
+	next = __vma_next(&mm->mm_vmas, prev);
 	area = next;
 	if (next && next->vm_end == end)		/* cases 6, 7, 8 */
-		next = next->vm_next;
+		next = vma_next(next);
 
 	/*
 	 * Can it merge with the predecessor?
@@ -816,7 +809,7 @@ struct anon_vma *find_mergeable_anon_vma
 	struct vm_area_struct *near;
 	unsigned long vm_flags;
 
-	near = vma->vm_next;
+	near = vma_next(vma);
 	if (!near)
 		goto try_prev;
 
@@ -902,6 +895,7 @@ unsigned long do_mmap_pgoff(struct file 
 	struct rb_node ** rb_link, * rb_parent;
 	int accountable = 1;
 	unsigned long charged = 0, reqprot = prot;
+	LIST_HEAD(vmas);
 
 	/*
 	 * Does the application expect PROT_READ to imply PROT_EXEC?
@@ -1165,7 +1159,8 @@ unmap_and_free_vma:
 	fput(file);
 
 	/* Undo any partial mapping done by a device driver. */
-	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
+	list_add(&vma->vm_list, &vmas);
+	unmap_region(mm, &vmas, vma, prev, vma->vm_start, vma->vm_end);
 	charged = 0;
 free_vma:
 	kmem_cache_free(vm_area_cachep, vma);
@@ -1218,7 +1213,7 @@ arch_get_unmapped_area(struct file *filp
 	}
 
 full_search:
-	for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
+	for (vma = find_vma(mm, addr); ; vma = vma_next(vma)) {
 		/* At this point:  (!vma || addr < vma->vm_end). */
 		if (TASK_SIZE - len < addr) {
 			/*
@@ -1429,14 +1424,11 @@ struct vm_area_struct *
 find_vma_prev(struct mm_struct *mm, unsigned long addr,
 			struct vm_area_struct **pprev)
 {
-	struct vm_area_struct *vma = NULL, *prev = NULL;
+	struct vm_area_struct *prev = NULL, *next;
 	struct rb_node * rb_node;
 	if (!mm)
 		goto out;
 
-	/* Guard against addr being lower than the first VMA */
-	vma = mm->mmap;
-
 	/* Go through the RB tree quickly. */
 	rb_node = mm->mm_rb.rb_node;
 
@@ -1448,7 +1440,8 @@ find_vma_prev(struct mm_struct *mm, unsi
 			rb_node = rb_node->rb_left;
 		} else {
 			prev = vma_tmp;
-			if (!prev->vm_next || (addr < prev->vm_next->vm_end))
+			next = __vma_next(&mm->mm_vmas, prev);
+			if (!next || (addr < next->vm_end))
 				break;
 			rb_node = rb_node->rb_right;
 		}
@@ -1456,7 +1449,7 @@ find_vma_prev(struct mm_struct *mm, unsi
 
 out:
 	*pprev = prev;
-	return prev ? prev->vm_next : vma;
+	return __vma_next(&mm->mm_vmas, prev);
 }
 
 /*
@@ -1655,18 +1648,21 @@ find_extend_vma(struct mm_struct * mm, u
  *
  * Called with the mm semaphore held.
  */
-static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
+static void remove_vma_list(struct mm_struct *mm, struct list_head *vmas,
+		struct vm_area_struct *vma)
 {
 	/* Update high watermark before we lower total_vm */
 	update_hiwater_vm(mm);
 	do {
+		struct vm_area_struct *next = __vma_next(vmas, vma);
 		long nrpages = vma_pages(vma);
 
 		mm->total_vm -= nrpages;
 		if (vma->vm_flags & VM_LOCKED)
 			mm->locked_vm -= nrpages;
 		vm_stat_account(mm, vma->vm_flags, vma->vm_file, -nrpages);
-		vma = remove_vma(vma);
+		remove_vma(vma);
+		vma = next;
 	} while (vma);
 	validate_mm(mm);
 }
@@ -1676,21 +1672,22 @@ static void remove_vma_list(struct mm_st
  *
  * Called with the mm semaphore held.
  */
-static void unmap_region(struct mm_struct *mm,
+static void unmap_region(struct mm_struct *mm, struct list_head *vmas,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
 		unsigned long start, unsigned long end)
 {
-	struct vm_area_struct *next = prev? prev->vm_next: mm->mmap;
+	struct vm_area_struct *next = __vma_next(&mm->mm_vmas, prev);
 	struct mmu_gather *tlb;
 	unsigned long nr_accounted = 0;
 
 	lru_add_drain();
 	tlb = tlb_gather_mmu(mm, 0);
 	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
+	unmap_vmas(&tlb, vmas, vma, start, end, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(&tlb, vma, prev? prev->vm_end: FIRST_USER_ADDRESS,
-				 next? next->vm_start: 0);
+	free_pgtables(&tlb, vmas, vma,
+			prev ? prev->vm_end : FIRST_USER_ADDRESS,
+			next ? next->vm_start : 0);
 	tlb_finish_mmu(tlb, start, end);
 }
 
@@ -1700,21 +1697,18 @@ static void unmap_region(struct mm_struc
  */
 static void
 detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
-	struct vm_area_struct *prev, unsigned long end)
+	       struct vm_area_struct *prev, unsigned long end,
+	       struct list_head *vmas)
 {
-	struct vm_area_struct **insertion_point;
-	struct vm_area_struct *tail_vma = NULL;
 	unsigned long addr;
 
-	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	do {
+		struct vm_area_struct *next = vma_next(vma);
 		rb_erase(&vma->vm_rb, &mm->mm_rb);
 		mm->map_count--;
-		tail_vma = vma;
-		vma = vma->vm_next;
+		list_move_tail(&vma->vm_list, vmas);
+		vma = next;
 	} while (vma && vma->vm_start < end);
-	*insertion_point = vma;
-	tail_vma->vm_next = NULL;
 	if (mm->unmap_area == arch_unmap_area)
 		addr = prev ? prev->vm_end : mm->mmap_base;
 	else
@@ -1784,6 +1778,7 @@ int do_munmap(struct mm_struct *mm, unsi
 {
 	unsigned long end;
 	struct vm_area_struct *vma, *prev, *last;
+	LIST_HEAD(vmas);
 
 	if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start)
 		return -EINVAL;
@@ -1823,16 +1818,16 @@ int do_munmap(struct mm_struct *mm, unsi
 		if (error)
 			return error;
 	}
-	vma = prev? prev->vm_next: mm->mmap;
+	vma = __vma_next(&mm->mm_vmas, prev);
 
 	/*
 	 * Remove the vma's, and unmap the actual pages
 	 */
-	detach_vmas_to_be_unmapped(mm, vma, prev, end);
-	unmap_region(mm, vma, prev, start, end);
+	detach_vmas_to_be_unmapped(mm, vma, prev, end, &vmas);
+	unmap_region(mm, &vmas, vma, prev, start, end);
 
 	/* Fix up all other VM information */
-	remove_vma_list(mm, vma);
+	remove_vma_list(mm, &vmas, vma);
 
 	return 0;
 }
@@ -1969,7 +1964,9 @@ EXPORT_SYMBOL(do_brk);
 void exit_mmap(struct mm_struct *mm)
 {
 	struct mmu_gather *tlb;
-	struct vm_area_struct *vma = mm->mmap;
+	LIST_HEAD(vmas);
+	struct vm_area_struct *vma = __vma_next(&mm->mm_vmas, NULL);
+	struct vm_area_struct *next;
 	unsigned long nr_accounted = 0;
 	unsigned long end;
 
@@ -1978,20 +1975,21 @@ void exit_mmap(struct mm_struct *mm)
 
 	lru_add_drain();
 	flush_cache_mm(mm);
+	detach_vmas_to_be_unmapped(mm, vma, NULL, -1, &vmas);
 	tlb = tlb_gather_mmu(mm, 1);
 	/* Don't update_hiwater_rss(mm) here, do_exit already did */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+	end = unmap_vmas(&tlb, &vmas, vma, 0, -1, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
-	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
+	free_pgtables(&tlb, &vmas, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
 	 * with preemption enabled, without holding any MM locks.
 	 */
-	while (vma)
-		vma = remove_vma(vma);
+	list_for_each_entry_safe(vma, next, &vmas, vm_list)
+		remove_vma(vma);
 
 	BUG_ON(mm->nr_ptes > (FIRST_USER_ADDRESS+PMD_SIZE-1)>>PMD_SHIFT);
 }
Index: linux-2.6/arch/um/kernel/skas/tlb.c
===================================================================
--- linux-2.6.orig/arch/um/kernel/skas/tlb.c
+++ linux-2.6/arch/um/kernel/skas/tlb.c
@@ -90,12 +90,10 @@ void flush_tlb_mm_skas(struct mm_struct 
 void force_flush_all_skas(void)
 {
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = mm->mmap;
+	struct vm_area_struct *vma;
 
-	while(vma != NULL) {
+	list_for_each_entry(vma, &mm->mm_vmas, vm_list)
 		fix_range(mm, vma->vm_start, vma->vm_end, 1);
-		vma = vma->vm_next;
-	}
 }
 
 void flush_tlb_page_skas(struct vm_area_struct *vma, unsigned long address)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-20 18:59             ` Linus Torvalds
@ 2010-08-20 19:43               ` Linus Torvalds
  2010-08-20 20:42                 ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2010-08-20 19:43 UTC (permalink / raw)
  To: Ian Campbell; +Cc: linux-kernel, stable, stable-review, akpm, alan, Greg KH

On Fri, Aug 20, 2010 at 11:59 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It should be a pretty straightforward search-and-replace. And almost
> all of it would be real cleanups.

I take that back. Cleanups - probably. Simple search-and-replace? No.
That vm_next thing is all over the place, and all of the code knows
that the last vma has a vm_next that is NULL.

So switching it to the "<linux/list.h>" kind of accessors would be a major pain.

There's also lots of really ugly code that is all about the "we can't
easily get to the 'prev' entry in the list". Stuff that would be
cleaned up if we just had a vm_prev, but where the cleanups is just
pretty painful.

> And it would be trivial to change the loops like
>
>    for (vma = mm->mmap; vma; vma = vma->vm_next)
>
> into basically just
>
>   list_for_each_entry(vma, &mm->mmap, vm_list)

Yeah, no. It looks like adding a "vm_prev" and doing a regular doubly
linked list thing wouldn't be too bad. But switching over to the
list.h version looks like a nightmare.

Too bad.

                  Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-20 17:43           ` Ian Campbell
@ 2010-08-20 18:59             ` Linus Torvalds
  2010-08-20 19:43               ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2010-08-20 18:59 UTC (permalink / raw)
  To: Ian Campbell; +Cc: linux-kernel, stable, stable-review, akpm, alan, Greg KH

On Fri, Aug 20, 2010 at 10:43 AM, Ian Campbell <ijc@hellion.org.uk> wrote:
>
> If we could easily get at the previous VMA (instead of just the next
> one) then we could easily check if we were mlocking a VM_GROWSDOWN
> region which had another VM_GROWSDOWN region immediately below it and
> therefore avoid introducing a guard page at the boundary. Doing this
> check is currently too expensive because of the need to use
> find_vma_prev. Is that right?

Exactly.

> It does look like a big task, but if it seems like the only sane option
> I'll take a look at it and see if can be broken down into manageable
> stages.

It should be a pretty straightforward search-and-replace. And almost
all of it would be real cleanups.

And it would be trivial to change the loops like

    for (vma = mm->mmap; vma; vma = vma->vm_next)

into basically just

   list_for_each_entry(vma, &mm->mmap, vm_list)

etc.

> You mentioned making this a tunable in your original commit message,
> that would at least help in the short term so I may look into that too.
> (prctl would be the right interface?)

I'm not convinced a tunable is really the right thing, but in this
case it might be acceptable, since you really are doing something
rather specific and odd. Changing the VM to use doubly-linked lists
would be a lot _cleaner_, though.

> I wonder if there's any way to auto tune, for example automatically
> disabling the guard page for a process which mlocks only part of its
> stack VMA. That would obviously target the specific issue I'm seeing
> pretty directly and would only reopen the hole for applications which
> were already doing odd things (c.f. your earlier comment about the guard
> page not being magic or helping with wilfully crazy userspace).

I'd really hate to try to do something subtle that doesn't have
obvious semantics. Subtle code is buggy code. Maybe not today, but two
years from now..

                     Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-20 16:24         ` Linus Torvalds
@ 2010-08-20 17:43           ` Ian Campbell
  2010-08-20 18:59             ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2010-08-20 17:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, stable, stable-review, akpm, alan, Greg KH

[-- Attachment #1: Type: text/plain, Size: 2195 bytes --]

On Fri, 2010-08-20 at 09:24 -0700, Linus Torvalds wrote:
> On Fri, Aug 20, 2010 at 9:07 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > That said, it does strike me as rather odd to do VM ops on partial
> > stacks. What are you doing, exactly, to hit this?
> 
> The reason I ask is that the _sane_ thing to do - if we really care
> about this - is to change the 'vm_next' singly-linked list into using
> 'list.h'. It would clean up a fair amount of stuff, like removing the
> need for that disgusting 'find_vma_prev()' thing. There are actually
> several users of vma's that want to look at the previous vma, but
> because it's hard to get at, they do something non-intuitive or odd.

I wasn't sure at first what you were getting at here, so let me see if I
figured it out...

If we could easily get at the previous VMA (instead of just the next
one) then we could easily check if we were mlocking a VM_GROWSDOWN
region which had another VM_GROWSDOWN region immediately below it and
therefore avoid introducing a guard page at the boundary. Doing this
check is currently too expensive because of the need to use
find_vma_prev. Is that right?

> At the same time, we've had that vm_next pointer since pretty much day
> one, and I also get a strong feeling that it's not really worth the
> churn.

It does look like a big task, but if it seems like the only sane option
I'll take a look at it and see if can be broken down into manageable
stages.

You mentioned making this a tunable in your original commit message,
that would at least help in the short term so I may look into that too.
(prctl would be the right interface?)

I wonder if there's any way to auto tune, for example automatically
disabling the guard page for a process which mlocks only part of its
stack VMA. That would obviously target the specific issue I'm seeing
pretty directly and would only reopen the hole for applications which
were already doing odd things (c.f. your earlier comment about the guard
page not being magic or helping with wilfully crazy userspace).

Ian.

-- 
Ian Campbell

Let he who takes the plunge remember to return it by Tuesday.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-20 16:35           ` Ian Campbell
@ 2010-08-20 16:49             ` Linus Torvalds
  0 siblings, 0 replies; 21+ messages in thread
From: Linus Torvalds @ 2010-08-20 16:49 UTC (permalink / raw)
  To: Ian Campbell
  Cc: linux-kernel, stable, stable-review, akpm, alan, Greg KH,
	Jeremy Fitzhardinge

On Fri, Aug 20, 2010 at 9:35 AM, Ian Campbell <ijc@hellion.org.uk> wrote:
>
> On the other hand the VMA merging is just an optimisation, isn't it?

Well, yes and no. This would make it have semantic differences, if you
were to unmap the lower part of the stack.

I could imagine some crazy program wanting to basically return the
stack pages to the system after doing heavy recursion. IOW, they could
do

 - use lots of stack because we're recursing 1000 levels deep

 - know that we used lots of stack, so after returning do something like

    stack = &local variable;
    align stack down by two pages
    munmap down from there to give memory back

and now it really would be a semantic change where the VM_GROWSDOWN
bit has literally disappeared.

                                  Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-20 16:32         ` Ian Campbell
@ 2010-08-20 16:35           ` Ian Campbell
  2010-08-20 16:49             ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2010-08-20 16:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, stable, stable-review, akpm, alan, Greg KH,
	Jeremy Fitzhardinge

On Fri, 2010-08-20 at 17:32 +0100, Ian Campbell wrote:

> On Fri, 2010-08-20 at 09:07 -0700, Linus Torvalds wrote:
> 
> > Actually, thinking some more about it, that may not be a good idea.
> > Why? Simply because we may want to merge the vma's back together if
> > you do munlock. And it won't (and mustn't) merge if the vm_flags
> > differ in VM_GROWSDOWN.
> > 
> > So I do think we want to keep VM_GROWSDOWN (and VM_GROWSUP on
> PA-RISC)
> > even across a vma split.
> 
> I naively hacked something together and it did seem to work, but I
> shared your worries about merging.
> 
> > Of course, we could set a flag whether the vma really does have a
> > guard page or not.
> 
> Bits in vma->vm_flags seems to be in rather short supply :-( 

On the other hand the VMA merging is just an optimisation, isn't it?

-- 
Ian Campbell

Many people are desperately looking for some wise advice which will
recommend that they do what they want to do.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-20 16:07       ` Linus Torvalds
  2010-08-20 16:24         ` Linus Torvalds
@ 2010-08-20 16:32         ` Ian Campbell
  2010-08-20 16:35           ` Ian Campbell
  1 sibling, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2010-08-20 16:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, stable, stable-review, akpm, alan, Greg KH,
	Jeremy Fitzhardinge

On Fri, 2010-08-20 at 09:07 -0700, Linus Torvalds wrote:

> Actually, thinking some more about it, that may not be a good idea.
> Why? Simply because we may want to merge the vma's back together if
> you do munlock. And it won't (and mustn't) merge if the vm_flags
> differ in VM_GROWSDOWN.
> 
> So I do think we want to keep VM_GROWSDOWN (and VM_GROWSUP on PA-RISC)
> even across a vma split.

I naively hacked something together and it did seem to work, but I
shared your worries about merging.

> Of course, we could set a flag whether the vma really does have a
> guard page or not.

Bits in vma->vm_flags seems to be in rather short supply :-(

> That said, it does strike me as rather odd to do VM ops on partial
> stacks. What are you doing, exactly, to hit this?

I sent a contrived test program in my other mail.

The actual use is to mlock a buffer on the stack in order to pass it to
a Xen hypercall. The contract with the hypervisor is that the pages
passed to the hypercall must be mapped.

Ian.

-- 
Ian Campbell
Current Noise: Dinosaur Jr. - Green Mind

We can defeat gravity.  The problem is the paperwork involved.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-20 16:07       ` Linus Torvalds
@ 2010-08-20 16:24         ` Linus Torvalds
  2010-08-20 17:43           ` Ian Campbell
  2010-08-20 16:32         ` Ian Campbell
  1 sibling, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2010-08-20 16:24 UTC (permalink / raw)
  To: Ian Campbell; +Cc: linux-kernel, stable, stable-review, akpm, alan, Greg KH

On Fri, Aug 20, 2010 at 9:07 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> That said, it does strike me as rather odd to do VM ops on partial
> stacks. What are you doing, exactly, to hit this?

The reason I ask is that the _sane_ thing to do - if we really care
about this - is to change the 'vm_next' singly-linked list into using
'list.h'. It would clean up a fair amount of stuff, like removing the
need for that disgusting 'find_vma_prev()' thing. There are actually
several users of vma's that want to look at the previous vma, but
because it's hard to get at, they do something non-intuitive or odd.

At the same time, we've had that vm_next pointer since pretty much day
one, and I also get a strong feeling that it's not really worth the
churn.

                         Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-20 15:54     ` Linus Torvalds
  2010-08-20 16:02       ` Ian Campbell
@ 2010-08-20 16:07       ` Linus Torvalds
  2010-08-20 16:24         ` Linus Torvalds
  2010-08-20 16:32         ` Ian Campbell
  1 sibling, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2010-08-20 16:07 UTC (permalink / raw)
  To: Ian Campbell; +Cc: linux-kernel, stable, stable-review, akpm, alan, Greg KH

On Fri, Aug 20, 2010 at 8:54 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Aug 20, 2010 at 5:54 AM, Ian Campbell <ijc@hellion.org.uk> wrote:
>>
>> Since we have split the original VMA into 3, shouldn't only the bottom
>> one still have VM_GROWSDOWN set? (how can the top two grow down with the
>> bottom one in the way?) Certainly it seems wrong to enforce a guard page
>> on anything but the bottom VMA (which is what appears to be happening).
>
> Yes, it does seem like we should teach vma splitting to remove
> VM_GROWSDOWN on all but the lowest mapping.

Actually, thinking some more about it, that may not be a good idea.
Why? Simply because we may want to merge the vma's back together if
you do munlock. And it won't (and mustn't) merge if the vm_flags
differ in VM_GROWSDOWN.

So I do think we want to keep VM_GROWSDOWN (and VM_GROWSUP on PA-RISC)
even across a vma split.

Of course, we could set a flag whether the vma really does have a
guard page or not.

That said, it does strike me as rather odd to do VM ops on partial
stacks. What are you doing, exactly, to hit this?

                         Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-20 15:54     ` Linus Torvalds
@ 2010-08-20 16:02       ` Ian Campbell
  2010-08-20 16:07       ` Linus Torvalds
  1 sibling, 0 replies; 21+ messages in thread
From: Ian Campbell @ 2010-08-20 16:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, stable, stable-review, akpm, alan, Greg KH

[-- Attachment #1: Type: text/plain, Size: 2093 bytes --]

On Fri, 2010-08-20 at 08:54 -0700, Linus Torvalds wrote:
> On Fri, Aug 20, 2010 at 5:54 AM, Ian Campbell <ijc@hellion.org.uk> wrote:
> >
> > Since we have split the original VMA into 3, shouldn't only the bottom
> > one still have VM_GROWSDOWN set? (how can the top two grow down with the
> > bottom one in the way?) Certainly it seems wrong to enforce a guard page
> > on anything but the bottom VMA (which is what appears to be happening).
> 
> Yes, it does seem like we should teach vma splitting to remove
> VM_GROWSDOWN on all but the lowest mapping.

Agreed, I was just coding that up to check.

Is there any VMA coalescing which we need to worry about? We don't
appear to do anything like that on munlock at least but I didn't look
elsewhere.

FWIW the attached mlock.c exhibits the issue for me. Under 2.6.35 it
takes 1 additional minor fault if you do not give the "lock" option and
0 minor faults if you do give "lock".

Under 2.6.35.2 and 3.6.35.3-rc it works the same without "lock" but with
"lock" under 2.6.35.2 the mlock fails and with 2.6.35.3 it thinks it
succeeds but didn't really and then bus errors.

> > Out of interest how does the guard page interact with processes which do
> > alloca(N*PAGE_SIZE)?
> 
> It's a guard page, not magic. Some architecture ABI's specify that if
> you expand the stack by more than a certain number, you need to touch
> a page in between (for example, I think alpha had that rule), because
> they don't grow the stack automatically by an arbitrary amount. But
> x86 has never had that rule, and you can certainly defeat a guard page
> by simply accessing by much more than a page.
> 
> As far as I'm concerned, the guard page thing is not - and shouldn't
> be thought of - a "hard" feature. If it's needed, it's really a bug in
> user space. But given that there are bugs in user space, the guard
> page makes it a bit harder to abuse those bugs. But it's about "a bit
> harder" rather than anything else.
> 
> IOW, it does _not_ make up for user space that willfully does crazy
> things, and never will.

Thanks, was just curious...

Ian.

[-- Attachment #2: mlock.c --]
[-- Type: text/x-csrc, Size: 1778 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <sys/mman.h>

#include <sys/time.h>
#include <sys/resource.h>

#define PAGE_SIZE 4096
#define PAGE_MASK (~(PAGE_SIZE-1))

static void do_lock(void *addr, size_t len)
{
      void *laddr = (void *)((unsigned long)addr & PAGE_MASK);
      size_t llen = (len + ((unsigned long)addr - (unsigned long)laddr) +
                     PAGE_SIZE - 1) & PAGE_MASK;
      int e = mlock(laddr, llen);

      printf("locking %p-%p -> %p-%p -> %d\n",
	     addr, addr+len, laddr, laddr+llen, e);
      if (e < 0)
	      exit(1);
}

static void do_test(int lock_it)  __attribute__((noinline));
static void do_test(int lock_it) {
	struct rusage rbefore, rafter;
	struct {
		char pad1[2*PAGE_SIZE];
		unsigned long lock_me;
		char pad2[2*PAGE_SIZE];
	} s;
	unsigned long esp;

	s.pad1[0] = 1;
	s.pad2[2*PAGE_SIZE-1] = 1;

	//memset(&s.pad1, 0, sizeof(s.pad1));
	//memset(&s.pad2, 0, sizeof(s.pad2));

	printf("pad1 at %p-%p\n", &s.pad1[0], &s.pad1[sizeof(s.pad1)-1]);
	printf("LCK  at %p-%p\n", &s.lock_me, &s.lock_me + 1);
	printf("pad2 at %p-%p\n", &s.pad2[0], &s.pad2[sizeof(s.pad2)-1]);
	asm volatile("mov %%esp, %0\n" : "=r" (esp));
	printf("esp: %#lx\n", esp);

	if (lock_it) {
		do_lock(&s.lock_me, sizeof(s.lock_me));
	}

	getrusage(RUSAGE_SELF, &rbefore);

	s.lock_me = 0xdeadbeef;

	getrusage(RUSAGE_SELF, &rafter);

	printf("minor faults: %ld -> %ld\n", rbefore.ru_minflt, rafter.ru_minflt);
	printf("major faults: %ld -> %ld\n", rbefore.ru_majflt, rafter.ru_majflt);
	if (lock_it && (rbefore.ru_minflt != rafter.ru_minflt || rbefore.ru_majflt != rafter.ru_majflt))
		printf("ERROR -- Should not have faulted\n");
}

int main(int argc, char **argv)
{
	do_test(argc > 1 && strcmp(argv[1], "lock") == 0);
	return 0;
}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-20 12:54   ` Ian Campbell
@ 2010-08-20 15:54     ` Linus Torvalds
  2010-08-20 16:02       ` Ian Campbell
  2010-08-20 16:07       ` Linus Torvalds
  0 siblings, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2010-08-20 15:54 UTC (permalink / raw)
  To: Ian Campbell; +Cc: linux-kernel, stable, stable-review, akpm, alan, Greg KH

On Fri, Aug 20, 2010 at 5:54 AM, Ian Campbell <ijc@hellion.org.uk> wrote:
>
> Since we have split the original VMA into 3, shouldn't only the bottom
> one still have VM_GROWSDOWN set? (how can the top two grow down with the
> bottom one in the way?) Certainly it seems wrong to enforce a guard page
> on anything but the bottom VMA (which is what appears to be happening).

Yes, it does seem like we should teach vma splitting to remove
VM_GROWSDOWN on all but the lowest mapping.

> Out of interest how does the guard page interact with processes which do
> alloca(N*PAGE_SIZE)?

It's a guard page, not magic. Some architecture ABI's specify that if
you expand the stack by more than a certain number, you need to touch
a page in between (for example, I think alpha had that rule), because
they don't grow the stack automatically by an arbitrary amount. But
x86 has never had that rule, and you can certainly defeat a guard page
by simply accessing by much more than a page.

As far as I'm concerned, the guard page thing is not - and shouldn't
be thought of - a "hard" feature. If it's needed, it's really a bug in
user space. But given that there are bugs in user space, the guard
page makes it a bit harder to abuse those bugs. But it's about "a bit
harder" rather than anything else.

IOW, it does _not_ make up for user space that willfully does crazy
things, and never will.

                       Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-18 20:30 ` [2/3] mm: fix up some user-visible effects of the stack guard page Greg KH
@ 2010-08-20 12:54   ` Ian Campbell
  2010-08-20 15:54     ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Ian Campbell @ 2010-08-20 12:54 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, stable, stable-review, akpm, alan, Greg KH

On Wed, 2010-08-18 at 13:30 -0700, Greg KH wrote:
> 2.6.35-stable review patch.  If anyone has any objections, please let us know.
> 

>  - by also teaching the _real_ mlock() functionality not to try to lock
>    the guard page.
> 
>    That would just expand the mapping down to create a new guard page,
>    so there really is no point in trying to lock it in place.

> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -167,6 +167,14 @@ static long __mlock_vma_pages_range(stru
>  	if (vma->vm_flags & VM_WRITE)
>  		gup_flags |= FOLL_WRITE;
>  
> +	/* We don't try to access the guard page of a stack vma */
> +	if (vma->vm_flags & VM_GROWSDOWN) {
> +		if (start == vma->vm_start) {
> +			start += PAGE_SIZE;
> +			nr_pages--;
> +		}
> +	}
> +

Is this really correct?

I have an app which tries to mlock a portion of its stack. With this
patch (and a bunch of debug) in place I get:
        [  170.977782] sys_mlock 0xbfd8b000-0xbfd8c000 4096
        [  170.978200] sys_mlock aligned, range now 0xbfd8b000-0xbfd8c000 4096
        [  170.978209] do_mlock 0xbfd8b000-0xbfd8c000 4096 (locking)
        [  170.978216] do_mlock vma de47d8f0 0xbfd7e000-0xbfd94000
        [  170.978223] mlock_fixup split vma de47d8f0 0xbfd7e000-0xbfd94000 at start 0xbfd8b000
        [  170.978231] mlock_fixup split vma de47d8f0 0xbfd8b000-0xbfd94000 at end 0xbfd8c000
        [  170.978240] __mlock_vma_pages_range locking 0xbfd8b000-0xbfd8c000 (1 pages) in VMA bfd8b000 0xbfd8c000-0x0
        [  170.978248] __mlock_vma_pages_range adjusting start 0xbfd8b000->0xbfd8c000 to avoid guard
        [  170.978256] __mlock_vma_pages_range now locking 0xbfd8c000-0xbfd8c000 (0 pages)
        [  170.978263] do_mlock error = 0

Note how we end up locking 0 pages.

The stack layout is:
         0xbfd94000 stack VMA end / base
        
         0xbfd8c000 mlock requested end
         0xbfd8b000 mlock requested start
        
         0xbfd7f000 stack VMA start / top
        
         0xbfd7e000 guard page

As part of the mlock_fixup the original VMA (0xbfd7e000-0xbfd94000) is
split into 3, 0xbfd7e000-0xbfd8b000 + 0xbfd8b000-0xbfd8c000 +
0xbfd8c000-0xbfd94000 in order to mlock the middle bit.

Since we have split the original VMA into 3, shouldn't only the bottom
one still have VM_GROWSDOWN set? (how can the top two grow down with the
bottom one in the way?) Certainly it seems wrong to enforce a guard page
on anything but the bottom VMA (which is what appears to be happening).

Although perhaps the larger issue is whether or not it is valid to mlock
below the current end of your current stack, I don't see why it wouldn't
be so perhaps the above is just completely bogus? Isn't it possible that
a process may try and mlock something on a stack page which hasn't
previously been touched and therefore isn't currently mapped and which
therefore could contain the guard page?

Out of interest how does the guard page interact with processes which do
alloca(N*PAGE_SIZE)?

Ian.
-- 
Ian Campbell
Current Noise: Opeth - White Cluster

If we do not change our direction we are likely to end up where we are headed.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [2/3] mm: fix up some user-visible effects of the stack guard page
  2010-08-18 20:31 [0/3] 2.6.35.3 -stable review Greg KH
@ 2010-08-18 20:30 ` Greg KH
  2010-08-20 12:54   ` Ian Campbell
  0 siblings, 1 reply; 21+ messages in thread
From: Greg KH @ 2010-08-18 20:30 UTC (permalink / raw)
  To: linux-kernel, stable; +Cc: stable-review, torvalds, akpm, alan

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 2756 bytes --]

2.6.35-stable review patch.  If anyone has any objections, please let us know.

------------------

From: Linus Torvalds <torvalds@linux-foundation.org>

commit d7824370e26325c881b665350ce64fb0a4fde24a upstream.

This commit makes the stack guard page somewhat less visible to user
space. It does this by:

 - not showing the guard page in /proc/<pid>/maps

   It looks like lvm-tools will actually read /proc/self/maps to figure
   out where all its mappings are, and effectively do a specialized
   "mlockall()" in user space.  By not showing the guard page as part of
   the mapping (by just adding PAGE_SIZE to the start for grows-up
   pages), lvm-tools ends up not being aware of it.

 - by also teaching the _real_ mlock() functionality not to try to lock
   the guard page.

   That would just expand the mapping down to create a new guard page,
   so there really is no point in trying to lock it in place.

It would perhaps be nice to show the guard page specially in
/proc/<pid>/maps (or at least mark grow-down segments some way), but
let's not open ourselves up to more breakage by user space from programs
that depends on the exact deails of the 'maps' file.

Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
source code to see what was going on with the whole new warning.

Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be
Reported-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 fs/proc/task_mmu.c |    8 +++++++-
 mm/mlock.c         |    8 ++++++++
 2 files changed, 15 insertions(+), 1 deletion(-)

--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -210,6 +210,7 @@ static void show_map_vma(struct seq_file
 	int flags = vma->vm_flags;
 	unsigned long ino = 0;
 	unsigned long long pgoff = 0;
+	unsigned long start;
 	dev_t dev = 0;
 	int len;
 
@@ -220,8 +221,13 @@ static void show_map_vma(struct seq_file
 		pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
 	}
 
+	/* We don't show the stack guard page in /proc/maps */
+	start = vma->vm_start;
+	if (vma->vm_flags & VM_GROWSDOWN)
+		start += PAGE_SIZE;
+
 	seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n",
-			vma->vm_start,
+			start,
 			vma->vm_end,
 			flags & VM_READ ? 'r' : '-',
 			flags & VM_WRITE ? 'w' : '-',
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -167,6 +167,14 @@ static long __mlock_vma_pages_range(stru
 	if (vma->vm_flags & VM_WRITE)
 		gup_flags |= FOLL_WRITE;
 
+	/* We don't try to access the guard page of a stack vma */
+	if (vma->vm_flags & VM_GROWSDOWN) {
+		if (start == vma->vm_start) {
+			start += PAGE_SIZE;
+			nr_pages--;
+		}
+	}
+
 	while (nr_pages > 0) {
 		int i;
 



^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-01-06 13:27 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-05 10:21 [2/3] mm: fix up some user-visible effects of the stack guard page Jay Foad
2015-01-05 21:03 ` Linus Torvalds
2015-01-05 21:11   ` Linus Torvalds
2015-01-05 21:19   ` Linus Torvalds
2015-01-06 13:27   ` Jay Foad
  -- strict thread matches above, loose matches on Subject: below --
2010-08-18 20:31 [0/3] 2.6.35.3 -stable review Greg KH
2010-08-18 20:30 ` [2/3] mm: fix up some user-visible effects of the stack guard page Greg KH
2010-08-20 12:54   ` Ian Campbell
2010-08-20 15:54     ` Linus Torvalds
2010-08-20 16:02       ` Ian Campbell
2010-08-20 16:07       ` Linus Torvalds
2010-08-20 16:24         ` Linus Torvalds
2010-08-20 17:43           ` Ian Campbell
2010-08-20 18:59             ` Linus Torvalds
2010-08-20 19:43               ` Linus Torvalds
2010-08-20 20:42                 ` Peter Zijlstra
2010-08-20 21:17                   ` Linus Torvalds
2010-08-20 21:24                     ` Linus Torvalds
2010-08-23  8:58                       ` Peter Zijlstra
2010-08-20 16:32         ` Ian Campbell
2010-08-20 16:35           ` Ian Campbell
2010-08-20 16:49             ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).