linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Don't mlock guardpage if the stack is growing up
@ 2011-05-08 18:55 Mikulas Patocka
  2011-05-08 20:12 ` Hugh Dickins
  2011-05-08 21:47 ` Linus Torvalds
  0 siblings, 2 replies; 25+ messages in thread
From: Mikulas Patocka @ 2011-05-08 18:55 UTC (permalink / raw)
  To: linux-kernel, linux-parisc, Linus Torvalds; +Cc: Hugh Dickins, Oleg Nesterov

Don't mlock guardpage if the stack is growing up

Linux kernel excludes guard page when performing mlock on a VMA with 
down-growing stack. However, some architectures have up-growing stack and 
locking the guard page should be excluded in this case too.

This patch fixes lvm2 on PA-RISC (and possibly other architectures with 
up-growing stack). lvm2 calculates the number of used pages when locking 
and when unlocking and reports an internal error if the numbers mismatch. 
On PA-RISC, the kernel would incorrectly attempt to mlock the stack guard 
page, this causes allocation of one more page and internal error in lvm2.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>

---
 include/linux/mm.h |    6 ++++++
 mm/memory.c        |   21 ++++++++++++---------
 2 files changed, 18 insertions(+), 9 deletions(-)

Index: linux-2.6.39-rc6-fast/include/linux/mm.h
===================================================================
--- linux-2.6.39-rc6-fast.orig/include/linux/mm.h	2011-05-07 05:59:51.000000000 +0200
+++ linux-2.6.39-rc6-fast/include/linux/mm.h	2011-05-07 05:59:52.000000000 +0200
@@ -1016,6 +1016,12 @@ static inline int vma_stack_continue(str
 	return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN);
 }
 
+/* Is the vma a continuation of the stack vma below it? */
+static inline int vma_stack_growsup_continue(struct vm_area_struct *vma, unsigned long addr)
+{
+	return vma && (vma->vm_start == addr) && (vma->vm_flags & VM_GROWSUP);
+}
+
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len);
Index: linux-2.6.39-rc6-fast/mm/memory.c
===================================================================
--- linux-2.6.39-rc6-fast.orig/mm/memory.c	2011-05-07 05:59:51.000000000 +0200
+++ linux-2.6.39-rc6-fast/mm/memory.c	2011-05-07 05:59:52.000000000 +0200
@@ -1412,9 +1412,12 @@ no_page_table:
 
 static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
 {
-	return (vma->vm_flags & VM_GROWSDOWN) &&
+	return ((vma->vm_flags & VM_GROWSDOWN) &&
 		(vma->vm_start == addr) &&
-		!vma_stack_continue(vma->vm_prev, addr);
+		!vma_stack_continue(vma->vm_prev, addr)) ||
+	       ((vma->vm_flags & VM_GROWSUP) &&
+		(vma->vm_end == addr + PAGE_SIZE) &&
+		!vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
 }
 
 /**
@@ -1551,18 +1554,18 @@ int __get_user_pages(struct task_struct 
 			continue;
 		}
 
-		/*
-		 * If we don't actually want the page itself,
-		 * and it's the stack guard page, just skip it.
-		 */
-		if (!pages && stack_guard_page(vma, start))
-			goto next_page;
-
 		do {
 			struct page *page;
 			unsigned int foll_flags = gup_flags;
 
 			/*
+			 * If we don't actually want the page itself,
+			 * and it's the stack guard page, just skip it.
+			 */
+			if (!pages && stack_guard_page(vma, start))
+				goto next_page;
+
+			/*
 			 * If we have a pending SIGKILL, don't keep faulting
 			 * pages and potentially allocating memory.
 			 */

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-08 18:55 [PATCH] Don't mlock guardpage if the stack is growing up Mikulas Patocka
@ 2011-05-08 20:12 ` Hugh Dickins
  2011-05-09 11:12   ` Mikulas Patocka
  2011-05-08 21:47 ` Linus Torvalds
  1 sibling, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2011-05-08 20:12 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: linux-kernel, linux-parisc, Linus Torvalds, Michel Lespinasse,
	Oleg Nesterov

On Sun, 8 May 2011, Mikulas Patocka wrote:

> Don't mlock guardpage if the stack is growing up
> 
> Linux kernel excludes guard page when performing mlock on a VMA with 
> down-growing stack. However, some architectures have up-growing stack and 
> locking the guard page should be excluded in this case too.
> 
> This patch fixes lvm2 on PA-RISC (and possibly other architectures with 
> up-growing stack). lvm2 calculates the number of used pages when locking 
> and when unlocking and reports an internal error if the numbers mismatch. 
> On PA-RISC, the kernel would incorrectly attempt to mlock the stack guard 
> page, this causes allocation of one more page and internal error in lvm2.
> 
> Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>

Interesting, I'd convinced myself that the growsup case was safe,
because of how we always approach the vma from its bottom end.

I've added Michel to the Cc, he's the one with the best grasp here.

Could you point us to where lvm2 is making these calculations?
I don't understand quite what it's doing.

Thanks,
Hugh

> 
> ---
>  include/linux/mm.h |    6 ++++++
>  mm/memory.c        |   21 ++++++++++++---------
>  2 files changed, 18 insertions(+), 9 deletions(-)
> 
> Index: linux-2.6.39-rc6-fast/include/linux/mm.h
> ===================================================================
> --- linux-2.6.39-rc6-fast.orig/include/linux/mm.h	2011-05-07 05:59:51.000000000 +0200
> +++ linux-2.6.39-rc6-fast/include/linux/mm.h	2011-05-07 05:59:52.000000000 +0200
> @@ -1016,6 +1016,12 @@ static inline int vma_stack_continue(str
>  	return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN);
>  }
>  
> +/* Is the vma a continuation of the stack vma below it? */
> +static inline int vma_stack_growsup_continue(struct vm_area_struct *vma, unsigned long addr)
> +{
> +	return vma && (vma->vm_start == addr) && (vma->vm_flags & VM_GROWSUP);
> +}
> +
>  extern unsigned long move_page_tables(struct vm_area_struct *vma,
>  		unsigned long old_addr, struct vm_area_struct *new_vma,
>  		unsigned long new_addr, unsigned long len);
> Index: linux-2.6.39-rc6-fast/mm/memory.c
> ===================================================================
> --- linux-2.6.39-rc6-fast.orig/mm/memory.c	2011-05-07 05:59:51.000000000 +0200
> +++ linux-2.6.39-rc6-fast/mm/memory.c	2011-05-07 05:59:52.000000000 +0200
> @@ -1412,9 +1412,12 @@ no_page_table:
>  
>  static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
>  {
> -	return (vma->vm_flags & VM_GROWSDOWN) &&
> +	return ((vma->vm_flags & VM_GROWSDOWN) &&
>  		(vma->vm_start == addr) &&
> -		!vma_stack_continue(vma->vm_prev, addr);
> +		!vma_stack_continue(vma->vm_prev, addr)) ||
> +	       ((vma->vm_flags & VM_GROWSUP) &&
> +		(vma->vm_end == addr + PAGE_SIZE) &&
> +		!vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
>  }
>  
>  /**
> @@ -1551,18 +1554,18 @@ int __get_user_pages(struct task_struct 
>  			continue;
>  		}
>  
> -		/*
> -		 * If we don't actually want the page itself,
> -		 * and it's the stack guard page, just skip it.
> -		 */
> -		if (!pages && stack_guard_page(vma, start))
> -			goto next_page;
> -
>  		do {
>  			struct page *page;
>  			unsigned int foll_flags = gup_flags;
>  
>  			/*
> +			 * If we don't actually want the page itself,
> +			 * and it's the stack guard page, just skip it.
> +			 */
> +			if (!pages && stack_guard_page(vma, start))
> +				goto next_page;
> +
> +			/*
>  			 * If we have a pending SIGKILL, don't keep faulting
>  			 * pages and potentially allocating memory.
>  			 */

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-08 18:55 [PATCH] Don't mlock guardpage if the stack is growing up Mikulas Patocka
  2011-05-08 20:12 ` Hugh Dickins
@ 2011-05-08 21:47 ` Linus Torvalds
  2011-05-09 11:01   ` Mikulas Patocka
  1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2011-05-08 21:47 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-kernel, linux-parisc, Hugh Dickins, Oleg Nesterov

On Sun, May 8, 2011 at 11:55 AM, Mikulas Patocka
<mikulas@artax.karlin.mff.cuni.cz> wrote:
>
> This patch fixes lvm2 on PA-RISC (and possibly other architectures with
> up-growing stack). lvm2 calculates the number of used pages when locking
> and when unlocking and reports an internal error if the numbers mismatch.

This patch won't apply on current kernels (including stable) because
of commit a1fde08c74e9 that changed the test of "pages" to instead
just test "flags & FOLL_MLOCK".

That should be trivial to fix up.

However, I really don't much like this complex test:

>  static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
>  {
> -       return (vma->vm_flags & VM_GROWSDOWN) &&
> +       return ((vma->vm_flags & VM_GROWSDOWN) &&
>                (vma->vm_start == addr) &&
> -               !vma_stack_continue(vma->vm_prev, addr);
> +               !vma_stack_continue(vma->vm_prev, addr)) ||
> +              ((vma->vm_flags & VM_GROWSUP) &&
> +               (vma->vm_end == addr + PAGE_SIZE) &&
> +               !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
>  }

in that format. It gets really hard to read, and I think you'd be
better off writing it as two helper functions (or macros) for the two
cases, and then have

  static inline int stack_guard_page(struct vm_area_struct *vma,
unsigned long addr)
  {
    return stack_guard_page_growsdown(vma, addr) ||
      stack_guard_page_growsup(vma, addr);
  }

I'd also like to verify that it doesn't actually generate any extra
code for the common case (iirc VM_GROWSUP is 0 for the architectures
that don't need it, and so the compiler shouldn't generate any extra
code, but I'd like that mentioned and verified explicitly).

Hmm?

Other than that it looks ok to me.

That said, could we please fix LVM to not do that crazy sh*t in the
first place? The STACK_GROWSUP case is never going to have a lot of
testing, this is just sad.

                   Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-08 21:47 ` Linus Torvalds
@ 2011-05-09 11:01   ` Mikulas Patocka
  2011-05-09 11:43     ` Zdenek Kabelac
  0 siblings, 1 reply; 25+ messages in thread
From: Mikulas Patocka @ 2011-05-09 11:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, linux-parisc, Hugh Dickins, Oleg Nesterov, agk

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6108 bytes --]



On Sun, 8 May 2011, Linus Torvalds wrote:

> On Sun, May 8, 2011 at 11:55 AM, Mikulas Patocka
> <mikulas@artax.karlin.mff.cuni.cz> wrote:
> >
> > This patch fixes lvm2 on PA-RISC (and possibly other architectures with
> > up-growing stack). lvm2 calculates the number of used pages when locking
> > and when unlocking and reports an internal error if the numbers mismatch.
> 
> This patch won't apply on current kernels (including stable) because
> of commit a1fde08c74e9 that changed the test of "pages" to instead
> just test "flags & FOLL_MLOCK".
> 
> That should be trivial to fix up.
> 
> However, I really don't much like this complex test:
> 
> >  static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
> >  {
> > -       return (vma->vm_flags & VM_GROWSDOWN) &&
> > +       return ((vma->vm_flags & VM_GROWSDOWN) &&
> >                (vma->vm_start == addr) &&
> > -               !vma_stack_continue(vma->vm_prev, addr);
> > +               !vma_stack_continue(vma->vm_prev, addr)) ||
> > +              ((vma->vm_flags & VM_GROWSUP) &&
> > +               (vma->vm_end == addr + PAGE_SIZE) &&
> > +               !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
> >  }
> 
> in that format. It gets really hard to read, and I think you'd be
> better off writing it as two helper functions (or macros) for the two
> cases, and then have
> 
>   static inline int stack_guard_page(struct vm_area_struct *vma,
> unsigned long addr)
>   {
>     return stack_guard_page_growsdown(vma, addr) ||
>       stack_guard_page_growsup(vma, addr);
>   }
> 
> I'd also like to verify that it doesn't actually generate any extra
> code for the common case (iirc VM_GROWSUP is 0 for the architectures
> that don't need it, and so the compiler shouldn't generate any extra
> code, but I'd like that mentioned and verified explicitly).
> 
> Hmm?
> 
> Other than that it looks ok to me.
> 
> That said, could we please fix LVM to not do that crazy sh*t in the
> first place? The STACK_GROWSUP case is never going to have a lot of
> testing, this is just sad.

LVM reads process maps from /proc/self/maps and locks them with mlock.

Why it doesn't use mlockall()? Because glibc maps all locales to the 
process. Glibc packs all locales to a 100MB file and maps that file to 
every process. Even if the process uses just one locale, glibc maps all.

So, when LVM used mlockall, it consumed >100MB memory and it caused 
out-of-memory problems in system installers.

So, alternate way of locking was added to LVM --- read all maps and lock 
them, except for the glibc locale file.

The real fix would be to fix glibc not to map 100MB to every process.

>                    Linus
> 

This is updated patch. I tested it on x86-64 and it doesn't change 
generated code.

Mikulas

---

Don't lock guardpage if the stack is growing up

Linux kernel excludes guard page when performing mlock on a VMA with
down-growing stack. However, some architectures have up-growing stack
and locking the guard page should be excluded in this case too.

This patch fixes lvm2 on PA-RISC (and possibly other architectures with
up-growing stack). lvm2 calculates number of used pages when locking and
when unlocking and reports an internal error if the numbers mismatch.

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>

---
 include/linux/mm.h |   10 +++++++++-
 mm/memory.c        |   21 ++++++++++++++++++---
 2 files changed, 27 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2011-05-09 12:33:50.000000000 +0200
+++ linux-2.6/include/linux/mm.h	2011-05-09 12:42:05.000000000 +0200
@@ -1011,11 +1011,19 @@ int set_page_dirty_lock(struct page *pag
 int clear_page_dirty_for_io(struct page *page);
 
 /* Is the vma a continuation of the stack vma above it? */
-static inline int vma_stack_continue(struct vm_area_struct *vma, unsigned long addr)
+static inline int vma_stack_growsdown_continue(struct vm_area_struct *vma,
+					       unsigned long addr)
 {
 	return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN);
 }
 
+/* Is the vma a continuation of the stack vma below it? */
+static inline int vma_stack_growsup_continue(struct vm_area_struct *vma,
+					     unsigned long addr)
+{
+	return vma && (vma->vm_start == addr) && (vma->vm_flags & VM_GROWSUP);
+}
+
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2011-05-09 12:33:51.000000000 +0200
+++ linux-2.6/mm/memory.c	2011-05-09 12:41:38.000000000 +0200
@@ -1410,11 +1410,21 @@ no_page_table:
 	return page;
 }
 
-static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
+static inline int stack_guard_page_growsdown(struct vm_area_struct *vma,
+					     unsigned long addr)
 {
 	return (vma->vm_flags & VM_GROWSDOWN) &&
 		(vma->vm_start == addr) &&
-		!vma_stack_continue(vma->vm_prev, addr);
+		!vma_stack_growsdown_continue(vma->vm_prev, addr);
+}
+
+
+static inline int stack_guard_page_growsup(struct vm_area_struct *vma,
+					   unsigned long addr)
+{
+	return (vma->vm_flags & VM_GROWSUP) &&
+		(vma->vm_end == addr + PAGE_SIZE) &&
+		!vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE);
 }
 
 /**
@@ -1554,13 +1564,18 @@ int __get_user_pages(struct task_struct 
 		/*
 		 * For mlock, just skip the stack guard page.
 		 */
-		if ((gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start))
+		if ((gup_flags & FOLL_MLOCK) &&
+		    stack_guard_page_growsdown(vma, start))
 			goto next_page;
 
 		do {
 			struct page *page;
 			unsigned int foll_flags = gup_flags;
 
+			if ((gup_flags & FOLL_MLOCK) &&
+			    stack_guard_page_growsup(vma, start))
+				goto next_page;
+
 			/*
 			 * If we have a pending SIGKILL, don't keep faulting
 			 * pages and potentially allocating memory.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-08 20:12 ` Hugh Dickins
@ 2011-05-09 11:12   ` Mikulas Patocka
  2011-05-09 15:57     ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Mikulas Patocka @ 2011-05-09 11:12 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-kernel, linux-parisc, Linus Torvalds, Michel Lespinasse,
	Oleg Nesterov

On Sun, 8 May 2011, Hugh Dickins wrote:

> On Sun, 8 May 2011, Mikulas Patocka wrote:
> 
> > Don't mlock guardpage if the stack is growing up
> > 
> > Linux kernel excludes guard page when performing mlock on a VMA with 
> > down-growing stack. However, some architectures have up-growing stack and 
> > locking the guard page should be excluded in this case too.
> > 
> > This patch fixes lvm2 on PA-RISC (and possibly other architectures with 
> > up-growing stack). lvm2 calculates the number of used pages when locking 
> > and when unlocking and reports an internal error if the numbers mismatch. 
> > On PA-RISC, the kernel would incorrectly attempt to mlock the stack guard 
> > page, this causes allocation of one more page and internal error in lvm2.
> > 
> > Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
> 
> Interesting, I'd convinced myself that the growsup case was safe,
> because of how we always approach the vma from its bottom end.
> 
> I've added Michel to the Cc, he's the one with the best grasp here.
> 
> Could you point us to where lvm2 is making these calculations?
> I don't understand quite what it's doing.
> 
> Thanks,
> Hugh

See ./lib/mm/memlock.c in LVM2. It reads /proc/self/maps, parses the file 
and locks each map with mlock, except for glibc locale file.

It calculates how much memory it took when locking and unlocking and 
prints an internal error if the numbers differ. The internal error 
normally means that there was some memory allocated while it was locked 
(that is wrong).

However, on up-growing stack, the internal error is always triggered, 
because mlock() of the stack touches the guard page and allocates one more 
page.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 11:01   ` Mikulas Patocka
@ 2011-05-09 11:43     ` Zdenek Kabelac
  2011-05-09 21:08       ` Alasdair G Kergon
  2011-05-09 22:45       ` Matthew Wilcox
  0 siblings, 2 replies; 25+ messages in thread
From: Zdenek Kabelac @ 2011-05-09 11:43 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Linus Torvalds, linux-kernel, linux-parisc, Hugh Dickins,
	Oleg Nesterov, agk

Dne 9.5.2011 13:01, Mikulas Patocka napsal(a):
> 
> 
> On Sun, 8 May 2011, Linus Torvalds wrote:
> 
>> On Sun, May 8, 2011 at 11:55 AM, Mikulas Patocka
>> <mikulas@artax.karlin.mff.cuni.cz> wrote:
>>>
>>> This patch fixes lvm2 on PA-RISC (and possibly other architectures with
>>> up-growing stack). lvm2 calculates the number of used pages when locking
>>> and when unlocking and reports an internal error if the numbers mismatch.
>>
>> This patch won't apply on current kernels (including stable) because
>> of commit a1fde08c74e9 that changed the test of "pages" to instead
>> just test "flags & FOLL_MLOCK".
>>
>> That should be trivial to fix up.
>>
>> However, I really don't much like this complex test:
>>
>>>  static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
>>>  {
>>> -       return (vma->vm_flags & VM_GROWSDOWN) &&
>>> +       return ((vma->vm_flags & VM_GROWSDOWN) &&
>>>                (vma->vm_start == addr) &&
>>> -               !vma_stack_continue(vma->vm_prev, addr);
>>> +               !vma_stack_continue(vma->vm_prev, addr)) ||
>>> +              ((vma->vm_flags & VM_GROWSUP) &&
>>> +               (vma->vm_end == addr + PAGE_SIZE) &&
>>> +               !vma_stack_growsup_continue(vma->vm_next, addr + PAGE_SIZE));
>>>  }
>>
>> in that format. It gets really hard to read, and I think you'd be
>> better off writing it as two helper functions (or macros) for the two
>> cases, and then have
>>
>>   static inline int stack_guard_page(struct vm_area_struct *vma,
>> unsigned long addr)
>>   {
>>     return stack_guard_page_growsdown(vma, addr) ||
>>       stack_guard_page_growsup(vma, addr);
>>   }
>>
>> I'd also like to verify that it doesn't actually generate any extra
>> code for the common case (iirc VM_GROWSUP is 0 for the architectures
>> that don't need it, and so the compiler shouldn't generate any extra
>> code, but I'd like that mentioned and verified explicitly).
>>
>> Hmm?
>>
>> Other than that it looks ok to me.
>>
>> That said, could we please fix LVM to not do that crazy sh*t in the
>> first place? The STACK_GROWSUP case is never going to have a lot of
>> testing, this is just sad.
> 
> LVM reads process maps from /proc/self/maps and locks them with mlock.
> 
> Why it doesn't use mlockall()? Because glibc maps all locales to the 
> process. Glibc packs all locales to a 100MB file and maps that file to 
> every process. Even if the process uses just one locale, glibc maps all.
> 
> So, when LVM used mlockall, it consumed >100MB memory and it caused 
> out-of-memory problems in system installers.
> 
> So, alternate way of locking was added to LVM --- read all maps and lock 
> them, except for the glibc locale file.
> 
> The real fix would be to fix glibc not to map 100MB to every process.
> 

I should add here probably few words.

Glibc knows few more ways around - so it could work only with one locale file
per language, or even without using mmap and allocating them in memory.
Depends on the distribution usually - Fedora decided to combine all locales
into one huge file (>100MB) - Ubuntu/Debian mmaps each locales individually
(usually ~MB)

LVM support both ways - either user may select in lvm.conf to always use
mlockall, or he may switch to use mlock mapping of individual memory areas
where those memory parts, that cannot be executed during suspend state and
cannot cause memory deadlock, are not locked into memory. As a 'bonus' it's
internally used for tracking algorithmic bugs.

Zdenek

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 11:12   ` Mikulas Patocka
@ 2011-05-09 15:57     ` Linus Torvalds
  2011-05-09 22:07       ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2011-05-09 15:57 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Hugh Dickins, linux-kernel, linux-parisc, Michel Lespinasse,
	Oleg Nesterov

On Mon, May 9, 2011 at 4:12 AM, Mikulas Patocka
<mikulas@artax.karlin.mff.cuni.cz> wrote:
>
> See ./lib/mm/memlock.c in LVM2. It reads /proc/self/maps, parses the file
> and locks each map with mlock, except for glibc locale file.

Hmm. One thing that strikes me is this problem also implies that the
/proc/self/maps file is wrong for the GROWSUP case, isn't it?

So I think we should not just apply your lock fix, but then *also*
apply something like this:

    diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
    index 2e7addfd9803..080980880c7f 100644
    --- a/fs/proc/task_mmu.c
    +++ b/fs/proc/task_mmu.c
    @@ -214,7 +214,7 @@ static void show_map_vma(struct seq_file *m,
struct vm_area_struct *vma)
     	int flags = vma->vm_flags;
     	unsigned long ino = 0;
     	unsigned long long pgoff = 0;
    -	unsigned long start;
    +	unsigned long start, end;
     	dev_t dev = 0;
     	int len;

    @@ -227,13 +227,16 @@ static void show_map_vma(struct seq_file *m,
struct vm_area_struct *vma)

     	/* We don't show the stack guard page in /proc/maps */
     	start = vma->vm_start;
    -	if (vma->vm_flags & VM_GROWSDOWN)
    -		if (!vma_stack_continue(vma->vm_prev, vma->vm_start))
    -			start += PAGE_SIZE;
    +	if (stack_guard_page_growsdown(vma, start))
    +		start += PAGE_SIZE;
    +	end = vma->vm_end;
    +	if (stack_guard_page_growsup(vma, end))
    +		end -= PAGE_SIZE;
    +

     	seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n",
     			start,
    -			vma->vm_end,
    +			end,
     			flags & VM_READ ? 'r' : '-',
     			flags & VM_WRITE ? 'w' : '-',
     			flags & VM_EXEC ? 'x' : '-',

NOTE! The above actually assumes that your
"stack_guard_page_growsup()" has been changed to actually take the
"next page" value, which I think makes more sense (ie it's the "end of
stack", the same way "stack_guard_page_growsdown()" address is).

Hmm? I don't have any growsup machine to test with, can you do that?

                        Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 11:43     ` Zdenek Kabelac
@ 2011-05-09 21:08       ` Alasdair G Kergon
  2011-05-09 22:45       ` Matthew Wilcox
  1 sibling, 0 replies; 25+ messages in thread
From: Alasdair G Kergon @ 2011-05-09 21:08 UTC (permalink / raw)
  To: Zdenek Kabelac, Mikulas Patocka, Linus Torvalds, linux-kernel,
	linux-parisc, Hugh Dickins, Oleg Nesterov, agk

On Mon, May 09, 2011 at 01:43:59PM +0200, Zdenek Kabelac wrote:
> Dne 9.5.2011 13:01, Mikulas Patocka napsal(a):
> > So, when LVM used mlockall, it consumed >100MB memory and it caused 
> > out-of-memory problems in system installers.

The way glibc and some distros have implemented locales makes mlockall()
quite useless for many purposes now in practice IMHO.  We discussed the
options at the time and wrote the extra code to implement the only
solution we were offered.

(1) There was no possibility of some distros not using a single locale
database (reasons: optimisation and packaging convenience);
(2) there was no possibility of glibc providing a mechanism to unmap the
locale database (if you reset your locale to the default glibc will not
unmap the file and indeed doesn't rule out doing something similar with
other files in future - these are glibc internals and it's none of our
business).

Alasdair



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 15:57     ` Linus Torvalds
@ 2011-05-09 22:07       ` Linus Torvalds
  2011-05-09 22:19         ` James Bottomley
  2011-05-09 22:26         ` Mikulas Patocka
  0 siblings, 2 replies; 25+ messages in thread
From: Linus Torvalds @ 2011-05-09 22:07 UTC (permalink / raw)
  To: Mikulas Patocka, Tony Luck, Fenghua Yu
  Cc: Hugh Dickins, linux-kernel, linux-parisc, Michel Lespinasse,
	Oleg Nesterov, linux-ia64

[-- Attachment #1: Type: text/plain, Size: 2521 bytes --]

On Mon, May 9, 2011 at 8:57 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. One thing that strikes me is this problem also implies that the
> /proc/self/maps file is wrong for the GROWSUP case, isn't it?
>
> So I think we should not just apply your lock fix, but then *also*
> apply something like this:

Actually, I think we might be better off with something like this.

It makes a few more changes:

 - move the stack guard page checking in __get_user_pages() into the
rare case (ie we didn't find a page), since that's the only case we
care about (the thing about the guard page is that don't want to call
"handle_mm_fault()"). As a result, it's off any path where we can
possibly care about performance, so we might as well have a nice
helper function for both the grow-up and grow-down cases, instead of
trying to be clever and only look at the grow-down case for the first
page in the vma like you did in your patch.

   End result: simpler, more straightforward code.

 - Move the growsup/down helper functions to <linux/mm.h>, since the
/proc code really wants to use them too. That means that the
"vma_stack_continue()" function (which now got split up into two
cases, for the up/down cases) is now entirely just an internal helper
function - nobody else uses it, and the real interface are the
"stack_guard_page_xyz()"  functions. Renamed to be simpler.

 - changed that naming of those stack_guard_page functions to use
_start and _end instead of growsup/growsdown, since it actually takes
the start or the end of the page as the argument (to match the
semantics of the afore-mentioned helpers)

 - and finally, make /proc/<pid>/maps use these helpers for both the
up/down case, so now /proc/self/maps should work well for the growsup
case too.

Hmm?

The only oddish case is IA64 that actually has a stack that grows
*both* up and down. That means that I could make up a stack mapping
that has a single virtual page in it, that is both the start *and* the
end page. Now /proc/self/maps would actually show such a mapping with
"negative" size. That's interesting.

It would be easy enough to have a "if (end < start) end = start" there
for that case, but maybe it's actually interesting information.

Regardless, I'd like to hear whether this patch really does work on
PA-RISC and especially IA64. I think those are the only cases that
have a GROWSUP stack. And the IA64 case that supports both is the most
interesting, everybody else does just one or the other.

                    Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 3827 bytes --]

 fs/proc/task_mmu.c |   12 +++++++-----
 include/linux/mm.h |   24 +++++++++++++++++++++++-
 mm/memory.c        |   16 +++++++---------
 3 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2e7addfd9803..318d8654989b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -214,7 +214,7 @@ static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
 	int flags = vma->vm_flags;
 	unsigned long ino = 0;
 	unsigned long long pgoff = 0;
-	unsigned long start;
+	unsigned long start, end;
 	dev_t dev = 0;
 	int len;
 
@@ -227,13 +227,15 @@ static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
 
 	/* We don't show the stack guard page in /proc/maps */
 	start = vma->vm_start;
-	if (vma->vm_flags & VM_GROWSDOWN)
-		if (!vma_stack_continue(vma->vm_prev, vma->vm_start))
-			start += PAGE_SIZE;
+	if (stack_guard_page_start(vma, start))
+		start += PAGE_SIZE;
+	end = vma->vm_end;
+	if (stack_guard_page_end(vma, end))
+		end -= PAGE_SIZE;
 
 	seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n",
 			start,
-			vma->vm_end,
+			end,
 			flags & VM_READ ? 'r' : '-',
 			flags & VM_WRITE ? 'w' : '-',
 			flags & VM_EXEC ? 'x' : '-',
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2348db26bc3d..6507dde38b16 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1011,11 +1011,33 @@ int set_page_dirty_lock(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
 
 /* Is the vma a continuation of the stack vma above it? */
-static inline int vma_stack_continue(struct vm_area_struct *vma, unsigned long addr)
+static inline int vma_growsdown(struct vm_area_struct *vma, unsigned long addr)
 {
 	return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN);
 }
 
+static inline int stack_guard_page_start(struct vm_area_struct *vma,
+					     unsigned long addr)
+{
+	return (vma->vm_flags & VM_GROWSDOWN) &&
+		(vma->vm_start == addr) &&
+		!vma_growsdown(vma->vm_prev, addr);
+}
+
+/* Is the vma a continuation of the stack vma below it? */
+static inline int vma_growsup(struct vm_area_struct *vma, unsigned long addr)
+{
+	return vma && (vma->vm_start == addr) && (vma->vm_flags & VM_GROWSUP);
+}
+
+static inline int stack_guard_page_end(struct vm_area_struct *vma,
+					   unsigned long addr)
+{
+	return (vma->vm_flags & VM_GROWSUP) &&
+		(vma->vm_end == addr) &&
+		!vma_growsup(vma->vm_next, addr);
+}
+
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len);
diff --git a/mm/memory.c b/mm/memory.c
index 27f425378112..61e66f026563 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1412,9 +1412,8 @@ no_page_table:
 
 static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
 {
-	return (vma->vm_flags & VM_GROWSDOWN) &&
-		(vma->vm_start == addr) &&
-		!vma_stack_continue(vma->vm_prev, addr);
+	return stack_guard_page_start(vma, addr) ||
+	       stack_guard_page_end(vma, addr+PAGE_SIZE);
 }
 
 /**
@@ -1551,12 +1550,6 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			continue;
 		}
 
-		/*
-		 * For mlock, just skip the stack guard page.
-		 */
-		if ((gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start))
-			goto next_page;
-
 		do {
 			struct page *page;
 			unsigned int foll_flags = gup_flags;
@@ -1573,6 +1566,11 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				int ret;
 				unsigned int fault_flags = 0;
 
+				/* For mlock, just skip the stack guard page. */
+				if (foll_flags & FOLL_MLOCK) {
+					if (stack_guard_page(vma, start))
+						goto next_page;
+				}
 				if (foll_flags & FOLL_WRITE)
 					fault_flags |= FAULT_FLAG_WRITE;
 				if (nonblocking)

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 22:07       ` Linus Torvalds
@ 2011-05-09 22:19         ` James Bottomley
  2011-05-09 22:31           ` Linus Torvalds
  2011-05-09 22:26         ` Mikulas Patocka
  1 sibling, 1 reply; 25+ messages in thread
From: James Bottomley @ 2011-05-09 22:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mikulas Patocka, Tony Luck, Fenghua Yu, Hugh Dickins,
	linux-kernel, linux-parisc, Michel Lespinasse, Oleg Nesterov,
	linux-ia64

On Mon, 2011-05-09 at 15:07 -0700, Linus Torvalds wrote:
> Regardless, I'd like to hear whether this patch really does work on
> PA-RISC and especially IA64. I think those are the only cases that
> have a GROWSUP stack. And the IA64 case that supports both is the most
> interesting, everybody else does just one or the other.

So I can test the patch, if you tell me how.  I don't use lvm2, so it
would have to be a simple test case.

James



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 22:07       ` Linus Torvalds
  2011-05-09 22:19         ` James Bottomley
@ 2011-05-09 22:26         ` Mikulas Patocka
  2011-05-15 22:18           ` Mikulas Patocka
  1 sibling, 1 reply; 25+ messages in thread
From: Mikulas Patocka @ 2011-05-09 22:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-parisc, linux-ia64



On Mon, 9 May 2011, Linus Torvalds wrote:

> On Mon, May 9, 2011 at 8:57 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Hmm. One thing that strikes me is this problem also implies that the
> > /proc/self/maps file is wrong for the GROWSUP case, isn't it?
> >
> > So I think we should not just apply your lock fix, but then *also*
> > apply something like this:
> 
> Actually, I think we might be better off with something like this.
> 
> It makes a few more changes:
> 
>  - move the stack guard page checking in __get_user_pages() into the
> rare case (ie we didn't find a page), since that's the only case we
> care about (the thing about the guard page is that don't want to call
> "handle_mm_fault()"). As a result, it's off any path where we can
> possibly care about performance, so we might as well have a nice
> helper function for both the grow-up and grow-down cases, instead of
> trying to be clever and only look at the grow-down case for the first
> page in the vma like you did in your patch.
> 
>    End result: simpler, more straightforward code.
> 
>  - Move the growsup/down helper functions to <linux/mm.h>, since the
> /proc code really wants to use them too. That means that the
> "vma_stack_continue()" function (which now got split up into two
> cases, for the up/down cases) is now entirely just an internal helper
> function - nobody else uses it, and the real interface are the
> "stack_guard_page_xyz()"  functions. Renamed to be simpler.
> 
>  - changed that naming of those stack_guard_page functions to use
> _start and _end instead of growsup/growsdown, since it actually takes
> the start or the end of the page as the argument (to match the
> semantics of the afore-mentioned helpers)
> 
>  - and finally, make /proc/<pid>/maps use these helpers for both the
> up/down case, so now /proc/self/maps should work well for the growsup
> case too.
> 
> Hmm?
> 
> The only oddish case is IA64 that actually has a stack that grows
> *both* up and down. That means that I could make up a stack mapping
> that has a single virtual page in it, that is both the start *and* the
> end page. Now /proc/self/maps would actually show such a mapping with
> "negative" size. That's interesting.
> 
> It would be easy enough to have a "if (end < start) end = start" there
> for that case, but maybe it's actually interesting information.
> 
> Regardless, I'd like to hear whether this patch really does work on
> PA-RISC and especially IA64. I think those are the only cases that
> have a GROWSUP stack. And the IA64 case that supports both is the most
> interesting, everybody else does just one or the other.
> 
>                     Linus

I will test it after a week, now I'm traveling away.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 22:19         ` James Bottomley
@ 2011-05-09 22:31           ` Linus Torvalds
  2011-05-09 22:53             ` Tony Luck
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2011-05-09 22:31 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mikulas Patocka, Tony Luck, Fenghua Yu, Hugh Dickins,
	linux-kernel, linux-parisc, Michel Lespinasse, Oleg Nesterov,
	linux-ia64

[-- Attachment #1: Type: text/plain, Size: 506 bytes --]

On Mon, May 9, 2011 at 3:19 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> So I can test the patch, if you tell me how.  I don't use lvm2, so it
> would have to be a simple test case.

Something like the attached. Run it as root (it needs root just for
the mlockall), and see whether the stack it shows changes.

With current kernels, I think the stack expands by one page during the
mlockall (for STACK_GROWSUP), with the patch it shouldn't.

                   Linus

[-- Attachment #2: t.c --]
[-- Type: text/x-csrc, Size: 400 bytes --]

#include <stdio.h>
#include <string.h>
#include <sys/mman.h>

static void show_stack(void)
{
	FILE *map;
	char buf[1000];

	map = fopen("/proc/self/maps", "r");
	if (!map)
		return;

	while (fgets(buf, 1000, map)) {
		if (!strstr(buf, "[stack]"))
			continue;
		fputs(buf, stdout);
	}

	fclose(map);
}

int main(void)
{
	show_stack();
	mlockall(MCL_CURRENT | MCL_FUTURE);
	show_stack();
	return 0;
}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 11:43     ` Zdenek Kabelac
  2011-05-09 21:08       ` Alasdair G Kergon
@ 2011-05-09 22:45       ` Matthew Wilcox
  2011-05-09 22:56         ` Linus Torvalds
  1 sibling, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2011-05-09 22:45 UTC (permalink / raw)
  To: Zdenek Kabelac
  Cc: Mikulas Patocka, Linus Torvalds, linux-kernel, linux-parisc,
	Hugh Dickins, Oleg Nesterov, agk

On Mon, May 09, 2011 at 01:43:59PM +0200, Zdenek Kabelac wrote:
> > Why it doesn't use mlockall()? Because glibc maps all locales to the 
> > process. Glibc packs all locales to a 100MB file and maps that file to 
> > every process. Even if the process uses just one locale, glibc maps all.
> > 
> > So, when LVM used mlockall, it consumed >100MB memory and it caused 
> > out-of-memory problems in system installers.
> > 
> > So, alternate way of locking was added to LVM --- read all maps and lock 
> > them, except for the glibc locale file.
> > 
> > The real fix would be to fix glibc not to map 100MB to every process.
> 
> I should add here probably few words.
> 
> Glibc knows few more ways around - so it could work only with one locale file
> per language, or even without using mmap and allocating them in memory.
> Depends on the distribution usually - Fedora decided to combine all locales
> into one huge file (>100MB) - Ubuntu/Debian mmaps each locales individually
> (usually ~MB)

Sounds to me like glibc should introduce an mlockmost() call that does all
the work for you ...

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 22:31           ` Linus Torvalds
@ 2011-05-09 22:53             ` Tony Luck
  2011-05-09 22:58               ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Tony Luck @ 2011-05-09 22:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: James Bottomley, Mikulas Patocka, Fenghua Yu, Hugh Dickins,
	linux-kernel, linux-parisc, Michel Lespinasse, Oleg Nesterov,
	linux-ia64

On Mon, May 9, 2011 at 3:31 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> With current kernels, I think the stack expands by one page during the
> mlockall (for STACK_GROWSUP), with the patch it shouldn't.

Tried this on ia64 (with a mod because the upward growing stack isn't
blessed with
the [stack] annotation, only the downward growing stack gets that honour).

ia64 builds, boots, and processes can still grow stacks (both of them).  The
patched kernel doesn't change the size of the upwardly growing stack across
the mlockall().

-Tony

P.S. while we could start both stacks on the same page and have the grow
away from the start point, ia64 actually starts them out a fair distance apart
and lets them run into each other (if you have enough memory to let them
grow that far, and if ulimit -s doesn't stop them earlier)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 22:45       ` Matthew Wilcox
@ 2011-05-09 22:56         ` Linus Torvalds
  2011-05-10 22:57           ` Alasdair G Kergon
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2011-05-09 22:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zdenek Kabelac, Mikulas Patocka, linux-kernel, linux-parisc,
	Hugh Dickins, Oleg Nesterov, agk

On Mon, May 9, 2011 at 3:45 PM, Matthew Wilcox <matthew@wil.cx> wrote:
>
> Sounds to me like glibc should introduce an mlockmost() call that does all
> the work for you ...

That sounds like the worst of all worlds. Nobody will use it, now
there's two places to break, and how do you describe what you don't
want mlocked?

I dunno. There are so few applications that use mlock at *all* that I
wonder if it's even worth worrying about. And this one case we've
hopefully now fixed anyway.

But if people really want to fix mlockall(), I'd suggest some way of
just marking certain mappings as "sparse mappings", and then we can
just teach mlockall to not lock them. And then glibc could just mark
its locale archive that way - and maybe others would mark other
mappings.

We could still make such a "sparse mapping" act as locked for the
actual pages that are mapped into it - so it would have some kind of
real semantics. You could do a "mlockall(..)" on the process, and then
touch the sparse pages you know you want, and they'd be guaranteed to
be mapped after that.

We might even have a "mlockall(MCL_SPARSE)" flag that does that for
everything - basically guarantee that "whatever is mapped in now will
remain mapped in, but I won't bother paging it in just because you ask
me to do a mlockall".

So there are sane semantics for the concept, and it would be easy to
do in the kernel. Whether it's worth doing, I dunno.

                    Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 22:53             ` Tony Luck
@ 2011-05-09 22:58               ` Linus Torvalds
  2011-05-09 23:08                 ` Tony Luck
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2011-05-09 22:58 UTC (permalink / raw)
  To: Tony Luck
  Cc: James Bottomley, Mikulas Patocka, Fenghua Yu, Hugh Dickins,
	linux-kernel, linux-parisc, Michel Lespinasse, Oleg Nesterov,
	linux-ia64

On Mon, May 9, 2011 at 3:53 PM, Tony Luck <tony.luck@gmail.com> wrote:
>
> P.S. while we could start both stacks on the same page and have the grow
> away from the start point, ia64 actually starts them out a fair distance apart
> and lets them run into each other (if you have enough memory to let them
> grow that far, and if ulimit -s doesn't stop them earlier)

Ahh, so you never actually have one single mapping that has both flags set?

In that case, I won't even worry about it.

One thing I did want to verify: did the mlockall() actually change the
stack size without that patch? Just to double-check that the patch
actually did change semantics visibly.

                 Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 22:58               ` Linus Torvalds
@ 2011-05-09 23:08                 ` Tony Luck
  2011-05-09 23:17                   ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Tony Luck @ 2011-05-09 23:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: James Bottomley, Mikulas Patocka, Fenghua Yu, Hugh Dickins,
	linux-kernel, linux-parisc, Michel Lespinasse, Oleg Nesterov,
	linux-ia64

On Mon, May 9, 2011 at 3:58 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Ahh, so you never actually have one single mapping that has both flags set?
>
> In that case, I won't even worry about it.

Definitely not for normal processes - I'm not sure how both stacks are
set up for threads.

> One thing I did want to verify: did the mlockall() actually change the
> stack size without that patch? Just to double-check that the patch
> actually did change semantics visibly.

On an unpatched system I see this (lots more than one page of growth -
pages are 64K on this config):
6007fffffff50000-6007fffffff70000 rw-p 00000000 00:00 0
6007fffffff50000-6008000000750000 rw-p 00000000 00:00 0

On a patched system I see (this one has 16K pages - no growth)
600007ffff9d0000-600007ffff9d4000 rw-p 00000000 00:00 0
600007ffff9d0000-600007ffff9d4000 rw-p 00000000 00:00 0

-Tony

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 23:08                 ` Tony Luck
@ 2011-05-09 23:17                   ` Linus Torvalds
  2011-05-09 23:25                     ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2011-05-09 23:17 UTC (permalink / raw)
  To: Tony Luck
  Cc: James Bottomley, Mikulas Patocka, Fenghua Yu, Hugh Dickins,
	linux-kernel, linux-parisc, Michel Lespinasse, Oleg Nesterov,
	linux-ia64

On Mon, May 9, 2011 at 4:08 PM, Tony Luck <tony.luck@gmail.com> wrote:
>
> Definitely not for normal processes - I'm not sure how both stacks are
> set up for threads.

We don't actually allow user space to set the growsup/growsdown bits
any more (we have PROT_GROWSUP and PROT_GROWSDOWN, but that is to
allow mprotect to not give an exact range, but say "apply this to the
end of a growsup/growsdown segment").

So the only thing that has those bits are things that the kernel sets
explicitly at exec time. So if ia64 doesn't set it, we're all good.

>> One thing I did want to verify: did the mlockall() actually change the
>> stack size without that patch? Just to double-check that the patch
>> actually did change semantics visibly.
>
> On an unpatched system I see this (lots more than one page of growth -
> pages are 64K on this config):
> 6007fffffff50000-6007fffffff70000 rw-p 00000000 00:00 0
> 6007fffffff50000-6008000000750000 rw-p 00000000 00:00 0
>
> On a patched system I see (this one has 16K pages - no growth)
> 600007ffff9d0000-600007ffff9d4000 rw-p 00000000 00:00 0
> 600007ffff9d0000-600007ffff9d4000 rw-p 00000000 00:00 0

Ok, I'll consider it tested. I'll commit it with Mikulas as author,
but note that I edited it so he won't get the blame if there's some
problem.

                             Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 23:17                   ` Linus Torvalds
@ 2011-05-09 23:25                     ` Linus Torvalds
  2011-05-10  4:12                       ` James Bottomley
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2011-05-09 23:25 UTC (permalink / raw)
  To: Tony Luck
  Cc: James Bottomley, Mikulas Patocka, Fenghua Yu, Hugh Dickins,
	linux-kernel, linux-parisc, Michel Lespinasse, Oleg Nesterov,
	linux-ia64

On Mon, May 9, 2011 at 4:17 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok, I'll consider it tested. I'll commit it with Mikulas as author,
> but note that I edited it so he won't get the blame if there's some
> problem.

Oh, and I marked it for stable too, although I don't know if any
distribution really cares about parisc or ia64. And I'm not sure that
ia64 even saw the lvm2 failure case - I'd have expected to hear about
it if it actually happens there.

                       Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 23:25                     ` Linus Torvalds
@ 2011-05-10  4:12                       ` James Bottomley
  0 siblings, 0 replies; 25+ messages in thread
From: James Bottomley @ 2011-05-10  4:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tony Luck, Mikulas Patocka, Fenghua Yu, Hugh Dickins,
	linux-kernel, linux-parisc, Michel Lespinasse, Oleg Nesterov,
	linux-ia64

On Mon, 2011-05-09 at 16:25 -0700, Linus Torvalds wrote:
> On Mon, May 9, 2011 at 4:17 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Ok, I'll consider it tested. I'll commit it with Mikulas as author,
> > but note that I edited it so he won't get the blame if there's some
> > problem.
> 
> Oh, and I marked it for stable too, although I don't know if any
> distribution really cares about parisc or ia64. And I'm not sure that
> ia64 even saw the lvm2 failure case - I'd have expected to hear about
> it if it actually happens there.

Well, it's a done deal, but here's the proof on parisc too:

Before:

c0266000-c0289000 rwxp 00000000 00:00 0                                  [stack]
c0266000-c0a66000 rwxp 00000000 00:00 0                                  [stack]

After:

bffee000-c0010000 rwxp 00000000 00:00 0                                  [stack]
bffee000-c0010000 rwxp 00000000 00:00 0                                  [stack]

James



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 22:56         ` Linus Torvalds
@ 2011-05-10 22:57           ` Alasdair G Kergon
  2011-05-11  8:42             ` Milan Broz
  0 siblings, 1 reply; 25+ messages in thread
From: Alasdair G Kergon @ 2011-05-10 22:57 UTC (permalink / raw)
  To: Linus Torvalds, Matthew Wilcox, Zdenek Kabelac, Mikulas Patocka,
	linux-kernel, linux-parisc, Hugh Dickins, Oleg Nesterov, agk

On Mon, May 09, 2011 at 03:56:11PM -0700, Linus Torvalds wrote:
> So there are sane semantics for the concept, and it would be easy to
> do in the kernel. Whether it's worth doing, I dunno.
 
At this point we have a workaround that seems to work for us.  I find it
ugly and it has needed some tweaking already but we can cope.

If others find similar problems to ours and start replicating the logic
we're using, then that would be the time to give serious thought to a
clean 'sparse' extension.  (Maybe marking individual mappings as MAP_SPARSE
and adding a MCL_NO_SPARSE option to ignore them sounds most promising to me.)

(What other software packages make use of mlockall() and under what
circumstances?)

Alasdair


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-10 22:57           ` Alasdair G Kergon
@ 2011-05-11  8:42             ` Milan Broz
  2011-05-12  2:12               ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Milan Broz @ 2011-05-11  8:42 UTC (permalink / raw)
  To: Alasdair G Kergon, Linus Torvalds, Matthew Wilcox,
	Zdenek Kabelac, Mikulas Patocka, linux-kernel, linux-parisc,
	Hugh Dickins, Oleg Nesterov

On 05/11/2011 12:57 AM, Alasdair G Kergon wrote:
> (What other software packages make use of mlockall() and under what
> circumstances?)

Another one is cryptsetup for commands which manipulate with keys
in memory.
(Users of libcryptetup library are not forced to lock memory, it is optional
call. But cryptsetup itself as libcryptsetup library user always locks memory.)

And I am not happy with mlockall() as well but the lvm2 workaround
is quite complicated.

Basically it wants to lock memory with explicitly allocated keys
(this can be rewritten to use specific locked page though) but it
also need to lock various libdevmapper buffers when issuing dmcrypt
cfg ioctl (mapping table and messages contains key). So that's why
mlockall(MCL_CURRENT|MCL_FUTURE) was the simplest way (and no problems
reported yet).
(No that it is perfect but better than nothing... Of course
more important is to wipe memory with keys after use.)

Milan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-11  8:42             ` Milan Broz
@ 2011-05-12  2:12               ` Linus Torvalds
  2011-05-12  9:06                 ` Zdenek Kabelac
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2011-05-12  2:12 UTC (permalink / raw)
  To: Milan Broz
  Cc: Alasdair G Kergon, Matthew Wilcox, Zdenek Kabelac,
	Mikulas Patocka, linux-kernel, linux-parisc, Hugh Dickins,
	Oleg Nesterov

On Wed, May 11, 2011 at 1:42 AM, Milan Broz <mbroz@redhat.com> wrote:
>
> Another one is cryptsetup [..]

Quite frankly, all security-related uses should always be happy about
a "MCL_SPARSE" model, since there is no point in ever bringing in
pages that haven't been used. The whole (and only) point of
mlock[all]() for them is the "avoid to push to disk" issue.

I do wonder if we really should ever do the page-in at all. We might
simply be better off always just saying "we'll lock pages you've
touched, that's it".

                       Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-12  2:12               ` Linus Torvalds
@ 2011-05-12  9:06                 ` Zdenek Kabelac
  0 siblings, 0 replies; 25+ messages in thread
From: Zdenek Kabelac @ 2011-05-12  9:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Milan Broz, Alasdair G Kergon, Matthew Wilcox, Mikulas Patocka,
	linux-kernel, linux-parisc, Hugh Dickins, Oleg Nesterov

Dne 12.5.2011 04:12, Linus Torvalds napsal(a):
> On Wed, May 11, 2011 at 1:42 AM, Milan Broz <mbroz@redhat.com> wrote:
>>
>> Another one is cryptsetup [..]
> 
> Quite frankly, all security-related uses should always be happy about
> a "MCL_SPARSE" model, since there is no point in ever bringing in
> pages that haven't been used. The whole (and only) point of
> mlock[all]() for them is the "avoid to push to disk" issue.
> 
> I do wonder if we really should ever do the page-in at all. We might
> simply be better off always just saying "we'll lock pages you've
> touched, that's it".
> 


For LVM we need to ensure the code which might ever be executed during  disk
suspend state must be paged and locked in - thus we would need MCL_SPARSE only
on several selected 'unneeded' libraries - as we are obviously not really able
to select which part of glibc might be needed during all code path (though I
guess we may find some limits).  But if we are sure that some libraries and
locale files will never be used during suspend state - we do not care about
those pages at all.

So it's not like we would always need only MCL_SPARSE all the time - we would
probably need to have some control to switch i.e.  glibc into MCL_ALL.

Zdenek

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] Don't mlock guardpage if the stack is growing up
  2011-05-09 22:26         ` Mikulas Patocka
@ 2011-05-15 22:18           ` Mikulas Patocka
  0 siblings, 0 replies; 25+ messages in thread
From: Mikulas Patocka @ 2011-05-15 22:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-parisc, linux-ia64

On Tue, 10 May 2011, Mikulas Patocka wrote:

> On Mon, 9 May 2011, Linus Torvalds wrote:
> 
> > On Mon, May 9, 2011 at 8:57 AM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > Hmm. One thing that strikes me is this problem also implies that the
> > > /proc/self/maps file is wrong for the GROWSUP case, isn't it?
> > >
> > > So I think we should not just apply your lock fix, but then *also*
> > > apply something like this:
> > 
> > Actually, I think we might be better off with something like this.
> > 
> > It makes a few more changes:
> > 
> >  - move the stack guard page checking in __get_user_pages() into the
> > rare case (ie we didn't find a page), since that's the only case we
> > care about (the thing about the guard page is that don't want to call
> > "handle_mm_fault()"). As a result, it's off any path where we can
> > possibly care about performance, so we might as well have a nice
> > helper function for both the grow-up and grow-down cases, instead of
> > trying to be clever and only look at the grow-down case for the first
> > page in the vma like you did in your patch.
> > 
> >    End result: simpler, more straightforward code.
> > 
> >  - Move the growsup/down helper functions to <linux/mm.h>, since the
> > /proc code really wants to use them too. That means that the
> > "vma_stack_continue()" function (which now got split up into two
> > cases, for the up/down cases) is now entirely just an internal helper
> > function - nobody else uses it, and the real interface are the
> > "stack_guard_page_xyz()"  functions. Renamed to be simpler.
> > 
> >  - changed that naming of those stack_guard_page functions to use
> > _start and _end instead of growsup/growsdown, since it actually takes
> > the start or the end of the page as the argument (to match the
> > semantics of the afore-mentioned helpers)
> > 
> >  - and finally, make /proc/<pid>/maps use these helpers for both the
> > up/down case, so now /proc/self/maps should work well for the growsup
> > case too.
> > 
> > Hmm?
> > 
> > The only oddish case is IA64 that actually has a stack that grows
> > *both* up and down. That means that I could make up a stack mapping
> > that has a single virtual page in it, that is both the start *and* the
> > end page. Now /proc/self/maps would actually show such a mapping with
> > "negative" size. That's interesting.
> > 
> > It would be easy enough to have a "if (end < start) end = start" there
> > for that case, but maybe it's actually interesting information.
> > 
> > Regardless, I'd like to hear whether this patch really does work on
> > PA-RISC and especially IA64. I think those are the only cases that
> > have a GROWSUP stack. And the IA64 case that supports both is the most
> > interesting, everybody else does just one or the other.
> > 
> >                     Linus
> 
> I will test it after a week, now I'm traveling away.
> 
> Mikulas

Hi

I tested 2.6.39-rc7 in on PA-RISC and confirm that it works.

Mikulas

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2011-05-15 22:18 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-08 18:55 [PATCH] Don't mlock guardpage if the stack is growing up Mikulas Patocka
2011-05-08 20:12 ` Hugh Dickins
2011-05-09 11:12   ` Mikulas Patocka
2011-05-09 15:57     ` Linus Torvalds
2011-05-09 22:07       ` Linus Torvalds
2011-05-09 22:19         ` James Bottomley
2011-05-09 22:31           ` Linus Torvalds
2011-05-09 22:53             ` Tony Luck
2011-05-09 22:58               ` Linus Torvalds
2011-05-09 23:08                 ` Tony Luck
2011-05-09 23:17                   ` Linus Torvalds
2011-05-09 23:25                     ` Linus Torvalds
2011-05-10  4:12                       ` James Bottomley
2011-05-09 22:26         ` Mikulas Patocka
2011-05-15 22:18           ` Mikulas Patocka
2011-05-08 21:47 ` Linus Torvalds
2011-05-09 11:01   ` Mikulas Patocka
2011-05-09 11:43     ` Zdenek Kabelac
2011-05-09 21:08       ` Alasdair G Kergon
2011-05-09 22:45       ` Matthew Wilcox
2011-05-09 22:56         ` Linus Torvalds
2011-05-10 22:57           ` Alasdair G Kergon
2011-05-11  8:42             ` Milan Broz
2011-05-12  2:12               ` Linus Torvalds
2011-05-12  9:06                 ` Zdenek Kabelac

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).