linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] break out page allocation warning code
@ 2011-04-08 20:22 Dave Hansen
  2011-04-08 20:22 ` [PATCH 2/2] print vmalloc() state after allocation failures Dave Hansen
                   ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Dave Hansen @ 2011-04-08 20:22 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Johannes Weiner, Dave Hansen


This originally started as a simple patch to give vmalloc()
some more verbose output on failure on top of the plain
page allocator messages.  Johannes suggested that it might
be nicer to lead with the vmalloc() info _before_ the page
allocator messages.

But, I do think there's a lot of value in what
__alloc_pages_slowpath() does with its filtering and so
forth.

This patch creates a new function which other allocators
can call instead of relying on the internal page allocator
warnings.  It also gives this function private rate-limiting
which separates it from other printk_ratelimit() users.

---

 linux-2.6.git-dave/include/linux/mm.h |    2 +
 linux-2.6.git-dave/mm/page_alloc.c    |   65 +++++++++++++++++++++++-----------
 2 files changed, 46 insertions(+), 21 deletions(-)

diff -puN include/linux/mm.h~break-out-alloc-failure-messages include/linux/mm.h
--- linux-2.6.git/include/linux/mm.h~break-out-alloc-failure-messages	2011-04-08 13:07:18.978332687 -0700
+++ linux-2.6.git-dave/include/linux/mm.h	2011-04-08 13:07:18.990332675 -0700
@@ -1365,6 +1365,8 @@ extern void si_meminfo(struct sysinfo * 
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 extern int after_bootmem;
 
+extern void nopage_warning(gfp_t gfp_mask, int order, const char *fmt, ...);
+
 extern void setup_per_cpu_pageset(void);
 
 extern void zone_pcp_update(struct zone *zone);
diff -puN mm/page_alloc.c~break-out-alloc-failure-messages mm/page_alloc.c
--- linux-2.6.git/mm/page_alloc.c~break-out-alloc-failure-messages	2011-04-08 13:07:18.982332683 -0700
+++ linux-2.6.git-dave/mm/page_alloc.c	2011-04-08 13:07:18.990332675 -0700
@@ -54,6 +54,7 @@
 #include <trace/events/kmem.h>
 #include <linux/ftrace_event.h>
 #include <linux/memcontrol.h>
+#include <linux/ratelimit.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -1734,6 +1735,48 @@ static inline bool should_suppress_show_
 	return ret;
 }
 
+static DEFINE_RATELIMIT_STATE(nopage_rs,
+		DEFAULT_RATELIMIT_INTERVAL,
+		DEFAULT_RATELIMIT_BURST);
+
+void nopage_warning(gfp_t gfp_mask, int order, const char *fmt, ...)
+{
+	va_list args;
+	int r;
+	unsigned int filter = SHOW_MEM_FILTER_NODES;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
+		return;
+
+	/*
+	 * This documents exceptions given to allocations in certain
+	 * contexts that are allowed to allocate outside current's set
+	 * of allowed nodes.
+	 */
+	if (!(gfp_mask & __GFP_NOMEMALLOC))
+		if (test_thread_flag(TIF_MEMDIE) ||
+		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
+			filter &= ~SHOW_MEM_FILTER_NODES;
+	if (in_interrupt() || !wait)
+		filter &= ~SHOW_MEM_FILTER_NODES;
+
+	if (fmt) {
+		printk(KERN_WARNING);
+		va_start(args, fmt);
+		r = vprintk(fmt, args);
+		va_end(args);
+	}
+
+	printk(KERN_WARNING);
+	printk("%s: page allocation failure: order:%d, mode:0x%x\n",
+			current->comm, order, gfp_mask);
+
+	dump_stack();
+	if (!should_suppress_show_mem())
+		show_mem(filter);
+}
+
 static inline int
 should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 				unsigned long pages_reclaimed)
@@ -2176,27 +2219,7 @@ rebalance:
 	}
 
 nopage:
-	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
-		unsigned int filter = SHOW_MEM_FILTER_NODES;
-
-		/*
-		 * This documents exceptions given to allocations in certain
-		 * contexts that are allowed to allocate outside current's set
-		 * of allowed nodes.
-		 */
-		if (!(gfp_mask & __GFP_NOMEMALLOC))
-			if (test_thread_flag(TIF_MEMDIE) ||
-			    (current->flags & (PF_MEMALLOC | PF_EXITING)))
-				filter &= ~SHOW_MEM_FILTER_NODES;
-		if (in_interrupt() || !wait)
-			filter &= ~SHOW_MEM_FILTER_NODES;
-
-		pr_warning("%s: page allocation failure. order:%d, mode:0x%x\n",
-			current->comm, order, gfp_mask);
-		dump_stack();
-		if (!should_suppress_show_mem())
-			show_mem(filter);
-	}
+	nopage_warning(gfp_mask, order, NULL);
 	return page;
 got_pg:
 	if (kmemcheck_enabled)
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 2/2] print vmalloc() state after allocation failures
  2011-04-08 20:22 [PATCH 1/2] break out page allocation warning code Dave Hansen
@ 2011-04-08 20:22 ` Dave Hansen
  2011-04-08 20:39   ` David Rientjes
  2011-04-08 20:37 ` [PATCH 1/2] break out page allocation warning code David Rientjes
       [not found] ` <BANLkTi=OnDX53nOZcaaMmqXRBcWicam0xg@mail.gmail.com>
  2 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2011-04-08 20:22 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Johannes Weiner, Dave Hansen


I was tracking down a page allocation failure that ended up in vmalloc().
Since vmalloc() uses 0-order pages, if somebody asks for an insane amount
of memory, we'll still get a warning with "order:0" in it.  That's not
very useful.

During recovery, vmalloc() also nicely frees all of the memory that it
got up to the point of the failure.  That is wonderful, but it also
quickly hides any issues.  We have a much different sitation if vmalloc()
repeatedly fails 10GB in to:

	vmalloc(100 * 1<<30);

versus repeatedly failing 4096 bytes in to a:

	vmalloc(8192);

This patch will print out messages that look like this:

[   30.040774] bash: vmalloc failure allocating after 0 / 73728 bytes

As a side issue, I also noticed that ctl_ioctl() does vmalloc() based
solely on an unverified value passed in from userspace.  Granted, it's
under CAP_SYS_ADMIN, but it still frightens me a bit.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/mm/vmalloc.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff -puN mm/vmalloc.c~vmalloc-warn mm/vmalloc.c
--- linux-2.6.git/mm/vmalloc.c~vmalloc-warn	2011-04-08 09:36:05.877020199 -0700
+++ linux-2.6.git-dave/mm/vmalloc.c	2011-04-08 09:38:00.373093593 -0700
@@ -1534,6 +1534,7 @@ static void *__vmalloc_node(unsigned lon
 static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 				 pgprot_t prot, int node, void *caller)
 {
+	int order = 0;
 	struct page **pages;
 	unsigned int nr_pages, array_size, i;
 	gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
@@ -1560,11 +1561,12 @@ static void *__vmalloc_area_node(struct 
 
 	for (i = 0; i < area->nr_pages; i++) {
 		struct page *page;
+		gfp_t tmp_mask = gfp_mask | __GFP_NOWARN;
 
 		if (node < 0)
-			page = alloc_page(gfp_mask);
+			page = alloc_page(tmp_mask);
 		else
-			page = alloc_pages_node(node, gfp_mask, 0);
+			page = alloc_pages_node(node, tmp_mask, order);
 
 		if (unlikely(!page)) {
 			/* Successfully allocated i pages, free them in __vunmap() */
@@ -1579,6 +1581,9 @@ static void *__vmalloc_area_node(struct 
 	return area->addr;
 
 fail:
+	nopage_warning(gfp_mask, order, "vmalloc: allocation failure, "
+			"allocated %ld of %ld bytes\n",
+			(area->nr_pages*PAGE_SIZE), area->size);
 	vfree(area->addr);
 	return NULL;
 }
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-08 20:22 [PATCH 1/2] break out page allocation warning code Dave Hansen
  2011-04-08 20:22 ` [PATCH 2/2] print vmalloc() state after allocation failures Dave Hansen
@ 2011-04-08 20:37 ` David Rientjes
  2011-04-08 20:43   ` Dave Hansen
       [not found] ` <BANLkTi=OnDX53nOZcaaMmqXRBcWicam0xg@mail.gmail.com>
  2 siblings, 1 reply; 37+ messages in thread
From: David Rientjes @ 2011-04-08 20:37 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, Johannes Weiner

On Fri, 8 Apr 2011, Dave Hansen wrote:

> 
> This originally started as a simple patch to give vmalloc()
> some more verbose output on failure on top of the plain
> page allocator messages.  Johannes suggested that it might
> be nicer to lead with the vmalloc() info _before_ the page
> allocator messages.
> 
> But, I do think there's a lot of value in what
> __alloc_pages_slowpath() does with its filtering and so
> forth.
> 
> This patch creates a new function which other allocators
> can call instead of relying on the internal page allocator
> warnings.  It also gives this function private rate-limiting
> which separates it from other printk_ratelimit() users.
> 
> ---
> 
>  linux-2.6.git-dave/include/linux/mm.h |    2 +
>  linux-2.6.git-dave/mm/page_alloc.c    |   65 +++++++++++++++++++++++-----------
>  2 files changed, 46 insertions(+), 21 deletions(-)
> 
> diff -puN include/linux/mm.h~break-out-alloc-failure-messages include/linux/mm.h
> --- linux-2.6.git/include/linux/mm.h~break-out-alloc-failure-messages	2011-04-08 13:07:18.978332687 -0700
> +++ linux-2.6.git-dave/include/linux/mm.h	2011-04-08 13:07:18.990332675 -0700
> @@ -1365,6 +1365,8 @@ extern void si_meminfo(struct sysinfo * 
>  extern void si_meminfo_node(struct sysinfo *val, int nid);
>  extern int after_bootmem;
>  
> +extern void nopage_warning(gfp_t gfp_mask, int order, const char *fmt, ...);
> +
>  extern void setup_per_cpu_pageset(void);
>  
>  extern void zone_pcp_update(struct zone *zone);
> diff -puN mm/page_alloc.c~break-out-alloc-failure-messages mm/page_alloc.c
> --- linux-2.6.git/mm/page_alloc.c~break-out-alloc-failure-messages	2011-04-08 13:07:18.982332683 -0700
> +++ linux-2.6.git-dave/mm/page_alloc.c	2011-04-08 13:07:18.990332675 -0700
> @@ -54,6 +54,7 @@
>  #include <trace/events/kmem.h>
>  #include <linux/ftrace_event.h>
>  #include <linux/memcontrol.h>
> +#include <linux/ratelimit.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -1734,6 +1735,48 @@ static inline bool should_suppress_show_
>  	return ret;
>  }
>  
> +static DEFINE_RATELIMIT_STATE(nopage_rs,
> +		DEFAULT_RATELIMIT_INTERVAL,
> +		DEFAULT_RATELIMIT_BURST);
> +
> +void nopage_warning(gfp_t gfp_mask, int order, const char *fmt, ...)

I suggest a different name for this, something like warn_alloc_failure() 
or such.

I guess this isn't general enough where it could be used in the oom killer 
as well?

> +{
> +	va_list args;
> +	int r;
> +	unsigned int filter = SHOW_MEM_FILTER_NODES;
> +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +
> +	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
> +		return;
> +
> +	/*
> +	 * This documents exceptions given to allocations in certain
> +	 * contexts that are allowed to allocate outside current's set
> +	 * of allowed nodes.
> +	 */
> +	if (!(gfp_mask & __GFP_NOMEMALLOC))
> +		if (test_thread_flag(TIF_MEMDIE) ||
> +		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
> +			filter &= ~SHOW_MEM_FILTER_NODES;
> +	if (in_interrupt() || !wait)
> +		filter &= ~SHOW_MEM_FILTER_NODES;
> +
> +	if (fmt) {
> +		printk(KERN_WARNING);
> +		va_start(args, fmt);
> +		r = vprintk(fmt, args);
> +		va_end(args);
> +	}
> +
> +	printk(KERN_WARNING);
> +	printk("%s: page allocation failure: order:%d, mode:0x%x\n",
> +			current->comm, order, gfp_mask);

This shouldn't be here, it should have been printed already.

> +
> +	dump_stack();
> +	if (!should_suppress_show_mem())
> +		show_mem(filter);
> +}
> +
>  static inline int
>  should_alloc_retry(gfp_t gfp_mask, unsigned int order,
>  				unsigned long pages_reclaimed)
> @@ -2176,27 +2219,7 @@ rebalance:
>  	}
>  
>  nopage:
> -	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
> -		unsigned int filter = SHOW_MEM_FILTER_NODES;
> -
> -		/*
> -		 * This documents exceptions given to allocations in certain
> -		 * contexts that are allowed to allocate outside current's set
> -		 * of allowed nodes.
> -		 */
> -		if (!(gfp_mask & __GFP_NOMEMALLOC))
> -			if (test_thread_flag(TIF_MEMDIE) ||
> -			    (current->flags & (PF_MEMALLOC | PF_EXITING)))
> -				filter &= ~SHOW_MEM_FILTER_NODES;
> -		if (in_interrupt() || !wait)
> -			filter &= ~SHOW_MEM_FILTER_NODES;
> -
> -		pr_warning("%s: page allocation failure. order:%d, mode:0x%x\n",
> -			current->comm, order, gfp_mask);
> -		dump_stack();
> -		if (!should_suppress_show_mem())
> -			show_mem(filter);
> -	}
> +	nopage_warning(gfp_mask, order, NULL);
>  	return page;
>  got_pg:
>  	if (kmemcheck_enabled)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] print vmalloc() state after allocation failures
  2011-04-08 20:22 ` [PATCH 2/2] print vmalloc() state after allocation failures Dave Hansen
@ 2011-04-08 20:39   ` David Rientjes
  2011-04-08 20:47     ` Dave Hansen
  0 siblings, 1 reply; 37+ messages in thread
From: David Rientjes @ 2011-04-08 20:39 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, linux-kernel, Johannes Weiner

On Fri, 8 Apr 2011, Dave Hansen wrote:

> 
> I was tracking down a page allocation failure that ended up in vmalloc().
> Since vmalloc() uses 0-order pages, if somebody asks for an insane amount
> of memory, we'll still get a warning with "order:0" in it.  That's not
> very useful.
> 
> During recovery, vmalloc() also nicely frees all of the memory that it
> got up to the point of the failure.  That is wonderful, but it also
> quickly hides any issues.  We have a much different sitation if vmalloc()
> repeatedly fails 10GB in to:
> 
> 	vmalloc(100 * 1<<30);
> 
> versus repeatedly failing 4096 bytes in to a:
> 
> 	vmalloc(8192);
> 
> This patch will print out messages that look like this:
> 
> [   30.040774] bash: vmalloc failure allocating after 0 / 73728 bytes
> 

Either the changelog or the patch is still wrong because the format of 
this string is inconsistent.

> As a side issue, I also noticed that ctl_ioctl() does vmalloc() based
> solely on an unverified value passed in from userspace.  Granted, it's
> under CAP_SYS_ADMIN, but it still frightens me a bit.
> 
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> ---
> 
>  linux-2.6.git-dave/mm/vmalloc.c |    9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff -puN mm/vmalloc.c~vmalloc-warn mm/vmalloc.c
> --- linux-2.6.git/mm/vmalloc.c~vmalloc-warn	2011-04-08 09:36:05.877020199 -0700
> +++ linux-2.6.git-dave/mm/vmalloc.c	2011-04-08 09:38:00.373093593 -0700
> @@ -1534,6 +1534,7 @@ static void *__vmalloc_node(unsigned lon
>  static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  				 pgprot_t prot, int node, void *caller)
>  {
> +	int order = 0;

Unnecessary, we can continue to hardcode the 0, vmalloc isn't going to use 
higher order allocs (it's there to avoid such things!).

>  	struct page **pages;
>  	unsigned int nr_pages, array_size, i;
>  	gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
> @@ -1560,11 +1561,12 @@ static void *__vmalloc_area_node(struct 
>  
>  	for (i = 0; i < area->nr_pages; i++) {
>  		struct page *page;
> +		gfp_t tmp_mask = gfp_mask | __GFP_NOWARN;

I think it would be better to just do away with this as well and just 
hardwire the __GFP_NOWARN directly into the two allocation calls.

>  
>  		if (node < 0)
> -			page = alloc_page(gfp_mask);
> +			page = alloc_page(tmp_mask);
>  		else
> -			page = alloc_pages_node(node, gfp_mask, 0);
> +			page = alloc_pages_node(node, tmp_mask, order);
>  
>  		if (unlikely(!page)) {
>  			/* Successfully allocated i pages, free them in __vunmap() */
> @@ -1579,6 +1581,9 @@ static void *__vmalloc_area_node(struct 
>  	return area->addr;
>  
>  fail:
> +	nopage_warning(gfp_mask, order, "vmalloc: allocation failure, "
> +			"allocated %ld of %ld bytes\n",
> +			(area->nr_pages*PAGE_SIZE), area->size);
>  	vfree(area->addr);
>  	return NULL;
>  }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-08 20:37 ` [PATCH 1/2] break out page allocation warning code David Rientjes
@ 2011-04-08 20:43   ` Dave Hansen
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2011-04-08 20:43 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, linux-kernel, Johannes Weiner

On Fri, 2011-04-08 at 13:37 -0700, David Rientjes wrote:
> > +static DEFINE_RATELIMIT_STATE(nopage_rs,
> > +		DEFAULT_RATELIMIT_INTERVAL,
> > +		DEFAULT_RATELIMIT_BURST);
> > +
> > +void nopage_warning(gfp_t gfp_mask, int order, const char *fmt, ...)
> 
> I suggest a different name for this, something like warn_alloc_failure() 
> or such.

That works for me.

> I guess this isn't general enough where it could be used in the oom killer 
> as well?

Nope, don't think so.  I took a look at it, but it isn't horribly close
to this.

> > +{
> > +	va_list args;
> > +	int r;
> > +	unsigned int filter = SHOW_MEM_FILTER_NODES;
> > +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > +
> > +	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
> > +		return;
> > +
> > +	/*
> > +	 * This documents exceptions given to allocations in certain
> > +	 * contexts that are allowed to allocate outside current's set
> > +	 * of allowed nodes.
> > +	 */
> > +	if (!(gfp_mask & __GFP_NOMEMALLOC))
> > +		if (test_thread_flag(TIF_MEMDIE) ||
> > +		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
> > +			filter &= ~SHOW_MEM_FILTER_NODES;
> > +	if (in_interrupt() || !wait)
> > +		filter &= ~SHOW_MEM_FILTER_NODES;
> > +
> > +	if (fmt) {
> > +		printk(KERN_WARNING);
> > +		va_start(args, fmt);
> > +		r = vprintk(fmt, args);
> > +		va_end(args);
> > +	}
> > +
> > +	printk(KERN_WARNING);
> > +	printk("%s: page allocation failure: order:%d, mode:0x%x\n",
> > +			current->comm, order, gfp_mask);
> 
> This shouldn't be here, it should have been printed already.

The "page allocation failure" might have been, if it was specified (it
isn't from the allocator), but order and mode haven't been.  My thought
here is that _all_ allocator failures will want to output mode and gfp,
so it might as well be common code instead of making everybody specify
it.

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] print vmalloc() state after allocation failures
  2011-04-08 20:39   ` David Rientjes
@ 2011-04-08 20:47     ` Dave Hansen
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2011-04-08 20:47 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, linux-kernel, Johannes Weiner

On Fri, 2011-04-08 at 13:39 -0700, David Rientjes wrote:
> On Fri, 8 Apr 2011, Dave Hansen wrote:
> > This patch will print out messages that look like this:
> > 
> > [   30.040774] bash: vmalloc failure allocating after 0 / 73728 bytes
> > 
> 
> Either the changelog or the patch is still wrong because the format of 
> this string is inconsistent.

Yeah, ya caught me. :)
> > diff -puN mm/vmalloc.c~vmalloc-warn mm/vmalloc.c
> > --- linux-2.6.git/mm/vmalloc.c~vmalloc-warn	2011-04-08 09:36:05.877020199 -0700
> > +++ linux-2.6.git-dave/mm/vmalloc.c	2011-04-08 09:38:00.373093593 -0700
> > @@ -1534,6 +1534,7 @@ static void *__vmalloc_node(unsigned lon
> >  static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >  				 pgprot_t prot, int node, void *caller)
> >  {
> > +	int order = 0;
> 
> Unnecessary, we can continue to hardcode the 0, vmalloc isn't going to use 
> higher order allocs (it's there to avoid such things!).

The only reason I did that was to keep the printk from looking like
this:

> > +	nopage_warning(gfp_mask, 0,  "vmalloc: allocation failure, "
> > +			"allocated %ld of %ld bytes\n",
> > +			(area->nr_pages*PAGE_SIZE), area->size);

The order is pretty darn obvious in the direct allocator calls, but I
liked having it named where it wasn't as obvious.

> >  	struct page **pages;
> >  	unsigned int nr_pages, array_size, i;
> >  	gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
> > @@ -1560,11 +1561,12 @@ static void *__vmalloc_area_node(struct 
> >  
> >  	for (i = 0; i < area->nr_pages; i++) {
> >  		struct page *page;
> > +		gfp_t tmp_mask = gfp_mask | __GFP_NOWARN;
> 
> I think it would be better to just do away with this as well and just 
> hardwire the __GFP_NOWARN directly into the two allocation calls.

I did it because hard-wiring it takes the alloc_pages_node() one over 80
columns.  I figured if I was going to add a line, I might as well keep
it pretty.

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
       [not found] ` <BANLkTi=OnDX53nOZcaaMmqXRBcWicam0xg@mail.gmail.com>
@ 2011-04-08 21:02   ` Dave Hansen
  2011-04-11 10:20     ` Michal Nazarewicz
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2011-04-08 21:02 UTC (permalink / raw)
  To: Michał Nazarewicz; +Cc: linux-mm, Johannes Weiner, linux-kernel

On Fri, 2011-04-08 at 22:54 +0200, Michał Nazarewicz wrote:
> On Apr 8, 2011 10:23 PM, "Dave Hansen" <dave@linux.vnet.ibm.com> wrote:
> > +       if (fmt) {
> > +               printk(KERN_WARNING);
> > +               va_start(args, fmt);
> > +               r = vprintk(fmt, args);
> > +               va_end(args);
> > +       }
> 
> Could we make the "printk(KERN_WARNING);" go away and require caller
> to specify level?  

The core problem is this: I want two lines of output: one for the
order/mode gunk, and one for the user-specified message.

If we have the user pass in a string for the printk() level, we're stuck
doing what I have here.  If we have them _prepend_ it to the "fmt"
string, then it's harder to figure out below.  I guess we could fish in
the string for it.

> > +       printk(KERN_WARNING);
> > +       printk("%s: page allocation failure: order:%d, mode:0x%x\n",
> > +                       current->comm, order, gfp_mask);
> 
> Even more so here. Why not pr_warning instead of two non-atomic calls
> to printk?

It's a relic of an hour ago when I tried passing in the printk() level
to the function as a string.  It can go away now. :)

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-08 21:02   ` Dave Hansen
@ 2011-04-11 10:20     ` Michal Nazarewicz
  0 siblings, 0 replies; 37+ messages in thread
From: Michal Nazarewicz @ 2011-04-11 10:20 UTC (permalink / raw)
  To: Michał Nazarewicz, Dave Hansen
  Cc: linux-mm, Johannes Weiner, linux-kernel

>> On Apr 8, 2011 10:23 PM, "Dave Hansen" <dave@linux.vnet.ibm.com> wrote:
>>> +       if (fmt) {
>>> +               printk(KERN_WARNING);
>>> +               va_start(args, fmt);
>>> +               r = vprintk(fmt, args);
>>> +               va_end(args);
>>> +       }

> On Fri, 2011-04-08 at 22:54 +0200, Michał Nazarewicz wrote:
>> Could we make the "printk(KERN_WARNING);" go away and require caller
>> to specify level?

On Fri, 08 Apr 2011 23:02:02 +0200, Dave Hansen wrote:
> The core problem is this: I want two lines of output: one for the
> order/mode gunk, and one for the user-specified message.
>
> If we have the user pass in a string for the printk() level, we're stuck
> doing what I have here.  If we have them _prepend_ it to the "fmt"
> string, then it's harder to figure out below.  I guess we could fish in
> the string for it.

This is a bit unfortunate, but that's what I was worried anyway.  I guess
creating a macro which automatically prepends format  with KERN_WARNING
would solve the issue but that's probably not the most elegant solution.

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michal "mina86" Nazarewicz    (o o)
ooo +-----<email/xmpp: mnazarewicz@google.com>-----ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-28 23:48                               ` john stultz
@ 2011-04-29  0:04                                 ` john stultz
  0 siblings, 0 replies; 37+ messages in thread
From: john stultz @ 2011-04-29  0:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Dave Hansen, linux-mm, linux-kernel,
	Johannes Weiner, Michal Nazarewicz, Andrew Morton

On Thu, 2011-04-28 at 16:48 -0700, john stultz wrote:
> On Thu, 2011-04-28 at 15:48 -0700, David Rientjes wrote:
> > On Wed, 27 Apr 2011, john stultz wrote:
> > 
> > > So thinking further, this can be simplified by adding the seqlock first,
> > > and then retaining the task_locking only in the set_task_comm path until
> > > all comm accessors are converted to using get_task_comm.
> > > 
> > 
> > On second thought, I think it would be better to just retain using a 
> > spinlock but instead of using alloc_lock, introduce a new spinlock to 
> > task_struct for the sole purpose of protecting comm.
> > 
> > And, instead, of using get_task_comm() to write into a preallocated 
> > buffer, I think it would be easier in the vast majority of cases that 
> > you'll need to convert to just provide task_comm_lock(p) and 
> > task_comm_unlock(p) so that p->comm can be dereferenced safely.  

Ok.. trying to find a middle ground here by replying to my own
concerns. :)

> So my concern with this is that it means one more lock that could be
> mis-nested. By keeping the locking isolated to the get/set_task_comm, we
> can be sure that won't happen. 
> 
> Also tracking new current->comm references will be easier if we just
> don't allow new ones. Validating that all the comm references are
> correctly locked becomes more difficult if we need locking at each use
> site.

So maybe we still ban current->comm access and instead have a
lightweight get_comm_locked() accessor or something that. Then we can
add debugging options to validate that the lock is properly held
internally.

> Further, since I'm not convinced that we never reference current->comm
> from irq context, if we go with spinlocks, we're going to have to
> disable irqs in the read path as well. seqlocks were nice for that
> aspect.

rwlocks can resolve this concern.


Any other thoughts?

-john


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-28 22:48                             ` David Rientjes
@ 2011-04-28 23:48                               ` john stultz
  2011-04-29  0:04                                 ` john stultz
  0 siblings, 1 reply; 37+ messages in thread
From: john stultz @ 2011-04-28 23:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Dave Hansen, linux-mm, linux-kernel,
	Johannes Weiner, Michal Nazarewicz, Andrew Morton

On Thu, 2011-04-28 at 15:48 -0700, David Rientjes wrote:
> On Wed, 27 Apr 2011, john stultz wrote:
> 
> > So thinking further, this can be simplified by adding the seqlock first,
> > and then retaining the task_locking only in the set_task_comm path until
> > all comm accessors are converted to using get_task_comm.
> > 
> 
> On second thought, I think it would be better to just retain using a 
> spinlock but instead of using alloc_lock, introduce a new spinlock to 
> task_struct for the sole purpose of protecting comm.
> 
> And, instead, of using get_task_comm() to write into a preallocated 
> buffer, I think it would be easier in the vast majority of cases that 
> you'll need to convert to just provide task_comm_lock(p) and 
> task_comm_unlock(p) so that p->comm can be dereferenced safely.  

So my concern with this is that it means one more lock that could be
mis-nested. By keeping the locking isolated to the get/set_task_comm, we
can be sure that won't happen. 

Also tracking new current->comm references will be easier if we just
don't allow new ones. Validating that all the comm references are
correctly locked becomes more difficult if we need locking at each use
site.

Further, since I'm not convinced that we never reference current->comm
from irq context, if we go with spinlocks, we're going to have to
disable irqs in the read path as well. seqlocks were nice for that
aspect.

> get_task_comm() could use that interface itself and then write into a 
> preallocated buffer.
> 
> The problem with using get_task_comm() everywhere is it requires 16 
> additional bytes to be allocated on the stack in hundreds of locations 
> around the kernel which may or may not be safe.

True. Although is this maybe a bit overzealous?

Maybe I can make sure not to add any mid-layer stack nesting by limiting
the scope of the 16bytes to just around where it is used.  This would
ensure we're only adding 16bytes to any current usage.

Other ideas?

thanks
-john


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-28  1:29                           ` john stultz
@ 2011-04-28 22:48                             ` David Rientjes
  2011-04-28 23:48                               ` john stultz
  0 siblings, 1 reply; 37+ messages in thread
From: David Rientjes @ 2011-04-28 22:48 UTC (permalink / raw)
  To: john stultz
  Cc: KOSAKI Motohiro, Dave Hansen, linux-mm, linux-kernel,
	Johannes Weiner, Michal Nazarewicz, Andrew Morton

On Wed, 27 Apr 2011, john stultz wrote:

> So thinking further, this can be simplified by adding the seqlock first,
> and then retaining the task_locking only in the set_task_comm path until
> all comm accessors are converted to using get_task_comm.
> 

On second thought, I think it would be better to just retain using a 
spinlock but instead of using alloc_lock, introduce a new spinlock to 
task_struct for the sole purpose of protecting comm.

And, instead, of using get_task_comm() to write into a preallocated 
buffer, I think it would be easier in the vast majority of cases that 
you'll need to convert to just provide task_comm_lock(p) and 
task_comm_unlock(p) so that p->comm can be dereferenced safely.  
get_task_comm() could use that interface itself and then write into a 
preallocated buffer.

The problem with using get_task_comm() everywhere is it requires 16 
additional bytes to be allocated on the stack in hundreds of locations 
around the kernel which may or may not be safe.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-26 21:25                     ` john stultz
@ 2011-04-28  3:05                       ` KOSAKI Motohiro
  0 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2011-04-28  3:05 UTC (permalink / raw)
  To: john stultz
  Cc: kosaki.motohiro, David Rientjes, Dave Hansen, linux-mm,
	linux-kernel, Johannes Weiner, Michal Nazarewicz, Andrew Morton

> On Thu, 2011-04-21 at 10:29 +0900, KOSAKI Motohiro wrote:
> > And one correction.
> > ------------------------------------------------------------------
> > static ssize_t comm_write(struct file *file, const char __user *buf,
> >                                 size_t count, loff_t *offset)
> > {
> >         struct inode *inode = file->f_path.dentry->d_inode;
> >         struct task_struct *p;
> >         char buffer[TASK_COMM_LEN];
> > 
> >         memset(buffer, 0, sizeof(buffer));
> >         if (count > sizeof(buffer) - 1)
> >                 count = sizeof(buffer) - 1;
> >         if (copy_from_user(buffer, buf, count))
> >                 return -EFAULT;
> > 
> >         p = get_proc_task(inode);
> >         if (!p)
> >                 return -ESRCH;
> > 
> >         if (same_thread_group(current, p))
> >                 set_task_comm(p, buffer);
> >         else
> >                 count = -EINVAL;
> > ------------------------------------------------------------------
> > 
> > This code doesn't have proper credential check. IOW, you forgot to
> > pthread_setuid_np() case.
> 
> Sorry, could you expand on this a bit? Google isn't coming up with much
> for pthread_setuid_np. Can a thread actually end up with different uid
> then the process it is a member of?

Yes. Linux kernel _always_ only care per-thread uid.
glibc 2.3.3 or earlier, it use kernel syscall straight forward. and then
userland application also don't have a way to change per-process uid.

glbc 2.3.4 or later, glibc implement per-process setuid by using signal
for inter thread communication. (ie, every thread call setuid() syscall
internally). Hm, currently pthread_setuid_np don't have proper exported
header file. so, parpaps, we need to only worry about syscall(NR_uid) and
old libc?

Anyway, If you see task_struct definition, you can easily find it has
cred.

Thanks.

> 
> Or is same_thread_group not really what I think it is? What would be a
> better way to check that the two threads are members of the same
> process?
> 
> thanks
> -john
> 
> 




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-28  0:32                         ` john stultz
@ 2011-04-28  1:29                           ` john stultz
  2011-04-28 22:48                             ` David Rientjes
  0 siblings, 1 reply; 37+ messages in thread
From: john stultz @ 2011-04-28  1:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Dave Hansen, linux-mm, linux-kernel,
	Johannes Weiner, Michal Nazarewicz, Andrew Morton

On Wed, 2011-04-27 at 17:32 -0700, john stultz wrote:
> On Wed, 2011-04-27 at 16:51 -0700, David Rientjes wrote:
> > On Tue, 26 Apr 2011, john stultz wrote:
> > > In the meantime, I'll put some effort into trying to protect unlocked
> > > current->comm acccess using get_task_comm() where possible. Won't happen
> > > in a day, and help would be appreciated. 
> > > 
> > 
> > We need to stop protecting ->comm with ->alloc_lock since it is used for 
> > other members of task_struct that may or may not be held in a function 
> > that wants to read ->comm.  We should probably introduce a seqlock.
> 
> Agreed. My initial approach is to consolidate accesses to use
> get_task_comm(), with special case to skip the locking if tsk==current,
> as well as a lock free __get_task_comm() for cases where its not current
> being accessed and the task locking is already done.
> 
> Once that's all done, the next step is to switch to a seqlock (or
> possibly RCU if Dave is still playing with that idea), internally in the
> get_task_comm implementation and then yank the special __get_task_comm. 

So thinking further, this can be simplified by adding the seqlock first,
and then retaining the task_locking only in the set_task_comm path until
all comm accessors are converted to using get_task_comm.

I'll be sending out some initial patches for review shortly.

thanks
-john



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-27 23:51                       ` David Rientjes
@ 2011-04-28  0:32                         ` john stultz
  2011-04-28  1:29                           ` john stultz
  0 siblings, 1 reply; 37+ messages in thread
From: john stultz @ 2011-04-28  0:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Dave Hansen, linux-mm, linux-kernel,
	Johannes Weiner, Michal Nazarewicz, Andrew Morton

On Wed, 2011-04-27 at 16:51 -0700, David Rientjes wrote:
> On Tue, 26 Apr 2011, john stultz wrote:
> > In the meantime, I'll put some effort into trying to protect unlocked
> > current->comm acccess using get_task_comm() where possible. Won't happen
> > in a day, and help would be appreciated. 
> > 
> 
> We need to stop protecting ->comm with ->alloc_lock since it is used for 
> other members of task_struct that may or may not be held in a function 
> that wants to read ->comm.  We should probably introduce a seqlock.

Agreed. My initial approach is to consolidate accesses to use
get_task_comm(), with special case to skip the locking if tsk==current,
as well as a lock free __get_task_comm() for cases where its not current
being accessed and the task locking is already done.

Once that's all done, the next step is to switch to a seqlock (or
possibly RCU if Dave is still playing with that idea), internally in the
get_task_comm implementation and then yank the special __get_task_comm. 

But other suggestions are welcome.

thanks
-john



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-26 19:27                     ` john stultz
@ 2011-04-27 23:51                       ` David Rientjes
  2011-04-28  0:32                         ` john stultz
  0 siblings, 1 reply; 37+ messages in thread
From: David Rientjes @ 2011-04-27 23:51 UTC (permalink / raw)
  To: john stultz
  Cc: KOSAKI Motohiro, Dave Hansen, linux-mm, linux-kernel,
	Johannes Weiner, Michal Nazarewicz, Andrew Morton

On Tue, 26 Apr 2011, john stultz wrote:

> Sorry if this somehow got off on the wrong foot. Its just surprising to
> see such passion bubble up after almost two years of quiet since the
> proc patch went in.
> 

It hasn't been two years, it hasn't even been 18 months.

	$ git diff 4614a696bd1c.. | grep "^+.*current\->comm" | wc -l
	42

Apparently those dozens of new references directly to current->comm since 
the change also were unaware of the need to use get_task_comm() to avoid a 
racy writer.  I don't think there's any code in the kernel that is ok with 
corrupted task names being printed: those messages are usually important.

> So I'm not proposing comm be totally lock free (Dave Hansen might do
> that for me, we'll see :) but when the original patch was proposed, the
> idea that transient empty or incomplete comms would be possible was
> brought up and didn't seem to be a big enough issue at the time to block
> it from being merged.
> 

I'm not really interested in the discussion that happened at the time, I'm 
concerned about racy readers of any thread's comm that result in corrupted 
strings being printed or used in the kernel.

> Its just having a more specific case where these transient
> null/incomplete comms causes an issue would help prioritize the need for
> correctness.
> 

It doesn't seem like there was any due diligence to ensure other code 
wasn't broken.  When comm could only be changed by prctl(), we needed no 
protection for current->comm and so code naturally will reference it 
directly.  Since that's now changed, no audit was done to ensure the 300+ 
references throughout the tree doesn't require non-racy reads.

> In the meantime, I'll put some effort into trying to protect unlocked
> current->comm acccess using get_task_comm() where possible. Won't happen
> in a day, and help would be appreciated. 
> 

We need to stop protecting ->comm with ->alloc_lock since it is used for 
other members of task_struct that may or may not be held in a function 
that wants to read ->comm.  We should probably introduce a seqlock.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-21  1:29                   ` KOSAKI Motohiro
  2011-04-25  4:21                     ` KOSAKI Motohiro
  2011-04-26 19:27                     ` john stultz
@ 2011-04-26 21:25                     ` john stultz
  2011-04-28  3:05                       ` KOSAKI Motohiro
  2 siblings, 1 reply; 37+ messages in thread
From: john stultz @ 2011-04-26 21:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Dave Hansen, linux-mm, linux-kernel,
	Johannes Weiner, Michal Nazarewicz, Andrew Morton

On Thu, 2011-04-21 at 10:29 +0900, KOSAKI Motohiro wrote:
> And one correction.
> ------------------------------------------------------------------
> static ssize_t comm_write(struct file *file, const char __user *buf,
>                                 size_t count, loff_t *offset)
> {
>         struct inode *inode = file->f_path.dentry->d_inode;
>         struct task_struct *p;
>         char buffer[TASK_COMM_LEN];
> 
>         memset(buffer, 0, sizeof(buffer));
>         if (count > sizeof(buffer) - 1)
>                 count = sizeof(buffer) - 1;
>         if (copy_from_user(buffer, buf, count))
>                 return -EFAULT;
> 
>         p = get_proc_task(inode);
>         if (!p)
>                 return -ESRCH;
> 
>         if (same_thread_group(current, p))
>                 set_task_comm(p, buffer);
>         else
>                 count = -EINVAL;
> ------------------------------------------------------------------
> 
> This code doesn't have proper credential check. IOW, you forgot to
> pthread_setuid_np() case.

Sorry, could you expand on this a bit? Google isn't coming up with much
for pthread_setuid_np. Can a thread actually end up with different uid
then the process it is a member of?

Or is same_thread_group not really what I think it is? What would be a
better way to check that the two threads are members of the same
process?

thanks
-john



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-21  1:29                   ` KOSAKI Motohiro
  2011-04-25  4:21                     ` KOSAKI Motohiro
@ 2011-04-26 19:27                     ` john stultz
  2011-04-27 23:51                       ` David Rientjes
  2011-04-26 21:25                     ` john stultz
  2 siblings, 1 reply; 37+ messages in thread
From: john stultz @ 2011-04-26 19:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Dave Hansen, linux-mm, linux-kernel,
	Johannes Weiner, Michal Nazarewicz, Andrew Morton

On Thu, 2011-04-21 at 10:29 +0900, KOSAKI Motohiro wrote:
> > On Wed, 2011-04-20 at 13:24 -0700, David Rientjes wrote:
> > > On Wed, 20 Apr 2011, KOSAKI Motohiro wrote:
> > > 
> > > > > That was true a while ago, but you now need to protect every thread's 
> > > > > ->comm with get_task_comm() or ensuring task_lock() is held to protect 
> > > > > against /proc/pid/comm which can change other thread's ->comm.  That was 
> > > > > different before when prctl(PR_SET_NAME) would only operate on current, so 
> > > > > no lock was needed when reading current->comm.
> > > > 
> > > > Right. /proc/pid/comm is evil. We have to fix it. otherwise we need change
> > > > all of current->comm user. It's very lots!
> > > > 
> > > 
> > > Fixing it in this case would be removing it and only allowing it for 
> > > current via the usual prctl() :)  The code was introduced in 4614a696bd1c 
> > > (procfs: allow threads to rename siblings via /proc/pid/tasks/tid/comm) in 
> > > December 2009 and seems to originally be meant for debugging.  We simply 
> > > can't continue to let it modify any thread's ->comm unless we change the 
> > > over 300 current->comm deferences in the kernel.
> > > 
> > > I'd prefer that we remove /proc/pid/comm entirely or at least prevent 
> > > writing to it unless CONFIG_EXPERT.
> > 
> > Eeeh. That's probably going to be a tough sell, as I think there is
> > wider interest in what it provides. Its useful for debugging
> > applications not kernels, so I doubt folks will want to rebuild their
> > kernel to try to analyze a java issue.
> > 
> > So I'm well aware that there is the chance that you catch the race and
> > read an incomplete/invalid comm (it was discussed at length when the
> > change went in), but somewhere I've missed how that's causing actual
> > problems. Other then just being "evil" and having the documented race,
> > could you clarify what the issue is that your hitting?
> 
> The problem is, there is no documented as well. Okay, I recognized you
> introduced new locking rule for task->comm. But there is no documented
> it. Thus, We have no way to review current callsites are correct or not.
> Can you please do it? And, I have a question. Do you mean now task->comm
> reader don't need task_lock() even if it is another thread?
> 
> _if_ every task->comm reader have to realize it has a chance to read
> incomplete/invalid comm, task_lock() doesn't makes any help.

Sorry if this somehow got off on the wrong foot. Its just surprising to
see such passion bubble up after almost two years of quiet since the
proc patch went in.

So I'm not proposing comm be totally lock free (Dave Hansen might do
that for me, we'll see :) but when the original patch was proposed, the
idea that transient empty or incomplete comms would be possible was
brought up and didn't seem to be a big enough issue at the time to block
it from being merged.

Its just having a more specific case where these transient
null/incomplete comms causes an issue would help prioritize the need for
correctness.

In the meantime, I'll put some effort into trying to protect unlocked
current->comm acccess using get_task_comm() where possible. Won't happen
in a day, and help would be appreciated. 

When we hit the point where the remaining places are where the task_lock
can't be taken, we can either live with the possible incomplete comm or
add a new lock to protect just the comm.

thanks
-john







^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-21  1:29                   ` KOSAKI Motohiro
@ 2011-04-25  4:21                     ` KOSAKI Motohiro
  2011-04-26 19:27                     ` john stultz
  2011-04-26 21:25                     ` john stultz
  2 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2011-04-25  4:21 UTC (permalink / raw)
  To: john stultz
  Cc: kosaki.motohiro, David Rientjes, Dave Hansen, linux-mm,
	linux-kernel, Johannes Weiner, Michal Nazarewicz, Andrew Morton

> > > I'd prefer that we remove /proc/pid/comm entirely or at least prevent 
> > > writing to it unless CONFIG_EXPERT.
> > 
> > Eeeh. That's probably going to be a tough sell, as I think there is
> > wider interest in what it provides. Its useful for debugging
> > applications not kernels, so I doubt folks will want to rebuild their
> > kernel to try to analyze a java issue.
> > 
> > So I'm well aware that there is the chance that you catch the race and
> > read an incomplete/invalid comm (it was discussed at length when the
> > change went in), but somewhere I've missed how that's causing actual
> > problems. Other then just being "evil" and having the documented race,
> > could you clarify what the issue is that your hitting?
> 
> The problem is, there is no documented as well. Okay, I recognized you
> introduced new locking rule for task->comm. But there is no documented
> it. Thus, We have no way to review current callsites are correct or not.
> Can you please do it? And, I have a question. Do you mean now task->comm
> reader don't need task_lock() even if it is another thread?
> 
> _if_ every task->comm reader have to realize it has a chance to read
> incomplete/invalid comm, task_lock() doesn't makes any help.

ping?




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-20 20:34                 ` john stultz
@ 2011-04-21  1:29                   ` KOSAKI Motohiro
  2011-04-25  4:21                     ` KOSAKI Motohiro
                                       ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2011-04-21  1:29 UTC (permalink / raw)
  To: john stultz
  Cc: kosaki.motohiro, David Rientjes, Dave Hansen, linux-mm,
	linux-kernel, Johannes Weiner, Michal Nazarewicz, Andrew Morton

> On Wed, 2011-04-20 at 13:24 -0700, David Rientjes wrote:
> > On Wed, 20 Apr 2011, KOSAKI Motohiro wrote:
> > 
> > > > That was true a while ago, but you now need to protect every thread's 
> > > > ->comm with get_task_comm() or ensuring task_lock() is held to protect 
> > > > against /proc/pid/comm which can change other thread's ->comm.  That was 
> > > > different before when prctl(PR_SET_NAME) would only operate on current, so 
> > > > no lock was needed when reading current->comm.
> > > 
> > > Right. /proc/pid/comm is evil. We have to fix it. otherwise we need change
> > > all of current->comm user. It's very lots!
> > > 
> > 
> > Fixing it in this case would be removing it and only allowing it for 
> > current via the usual prctl() :)  The code was introduced in 4614a696bd1c 
> > (procfs: allow threads to rename siblings via /proc/pid/tasks/tid/comm) in 
> > December 2009 and seems to originally be meant for debugging.  We simply 
> > can't continue to let it modify any thread's ->comm unless we change the 
> > over 300 current->comm deferences in the kernel.
> > 
> > I'd prefer that we remove /proc/pid/comm entirely or at least prevent 
> > writing to it unless CONFIG_EXPERT.
> 
> Eeeh. That's probably going to be a tough sell, as I think there is
> wider interest in what it provides. Its useful for debugging
> applications not kernels, so I doubt folks will want to rebuild their
> kernel to try to analyze a java issue.
> 
> So I'm well aware that there is the chance that you catch the race and
> read an incomplete/invalid comm (it was discussed at length when the
> change went in), but somewhere I've missed how that's causing actual
> problems. Other then just being "evil" and having the documented race,
> could you clarify what the issue is that your hitting?

The problem is, there is no documented as well. Okay, I recognized you
introduced new locking rule for task->comm. But there is no documented
it. Thus, We have no way to review current callsites are correct or not.
Can you please do it? And, I have a question. Do you mean now task->comm
reader don't need task_lock() even if it is another thread?

_if_ every task->comm reader have to realize it has a chance to read
incomplete/invalid comm, task_lock() doesn't makes any help.



And one correction.
------------------------------------------------------------------
static ssize_t comm_write(struct file *file, const char __user *buf,
                                size_t count, loff_t *offset)
{
        struct inode *inode = file->f_path.dentry->d_inode;
        struct task_struct *p;
        char buffer[TASK_COMM_LEN];

        memset(buffer, 0, sizeof(buffer));
        if (count > sizeof(buffer) - 1)
                count = sizeof(buffer) - 1;
        if (copy_from_user(buffer, buf, count))
                return -EFAULT;

        p = get_proc_task(inode);
        if (!p)
                return -ESRCH;

        if (same_thread_group(current, p))
                set_task_comm(p, buffer);
        else
                count = -EINVAL;
------------------------------------------------------------------

This code doesn't have proper credential check. IOW, you forgot to
pthread_setuid_np() case.





^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-20 20:24               ` David Rientjes
@ 2011-04-20 20:34                 ` john stultz
  2011-04-21  1:29                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 37+ messages in thread
From: john stultz @ 2011-04-20 20:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Dave Hansen, linux-mm, linux-kernel,
	Johannes Weiner, Michal Nazarewicz, Andrew Morton

On Wed, 2011-04-20 at 13:24 -0700, David Rientjes wrote:
> On Wed, 20 Apr 2011, KOSAKI Motohiro wrote:
> 
> > > That was true a while ago, but you now need to protect every thread's 
> > > ->comm with get_task_comm() or ensuring task_lock() is held to protect 
> > > against /proc/pid/comm which can change other thread's ->comm.  That was 
> > > different before when prctl(PR_SET_NAME) would only operate on current, so 
> > > no lock was needed when reading current->comm.
> > 
> > Right. /proc/pid/comm is evil. We have to fix it. otherwise we need change
> > all of current->comm user. It's very lots!
> > 
> 
> Fixing it in this case would be removing it and only allowing it for 
> current via the usual prctl() :)  The code was introduced in 4614a696bd1c 
> (procfs: allow threads to rename siblings via /proc/pid/tasks/tid/comm) in 
> December 2009 and seems to originally be meant for debugging.  We simply 
> can't continue to let it modify any thread's ->comm unless we change the 
> over 300 current->comm deferences in the kernel.
> 
> I'd prefer that we remove /proc/pid/comm entirely or at least prevent 
> writing to it unless CONFIG_EXPERT.

Eeeh. That's probably going to be a tough sell, as I think there is
wider interest in what it provides. Its useful for debugging
applications not kernels, so I doubt folks will want to rebuild their
kernel to try to analyze a java issue.

So I'm well aware that there is the chance that you catch the race and
read an incomplete/invalid comm (it was discussed at length when the
change went in), but somewhere I've missed how that's causing actual
problems. Other then just being "evil" and having the documented race,
could you clarify what the issue is that your hitting?

thanks
-john





^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-20  0:39             ` KOSAKI Motohiro
@ 2011-04-20 20:24               ` David Rientjes
  2011-04-20 20:34                 ` john stultz
  0 siblings, 1 reply; 37+ messages in thread
From: David Rientjes @ 2011-04-20 20:24 UTC (permalink / raw)
  To: KOSAKI Motohiro, John Stultz
  Cc: Dave Hansen, linux-mm, linux-kernel, Johannes Weiner,
	Michal Nazarewicz, Andrew Morton

On Wed, 20 Apr 2011, KOSAKI Motohiro wrote:

> > That was true a while ago, but you now need to protect every thread's 
> > ->comm with get_task_comm() or ensuring task_lock() is held to protect 
> > against /proc/pid/comm which can change other thread's ->comm.  That was 
> > different before when prctl(PR_SET_NAME) would only operate on current, so 
> > no lock was needed when reading current->comm.
> 
> Right. /proc/pid/comm is evil. We have to fix it. otherwise we need change
> all of current->comm user. It's very lots!
> 

Fixing it in this case would be removing it and only allowing it for 
current via the usual prctl() :)  The code was introduced in 4614a696bd1c 
(procfs: allow threads to rename siblings via /proc/pid/tasks/tid/comm) in 
December 2009 and seems to originally be meant for debugging.  We simply 
can't continue to let it modify any thread's ->comm unless we change the 
over 300 current->comm deferences in the kernel.

I'd prefer that we remove /proc/pid/comm entirely or at least prevent 
writing to it unless CONFIG_EXPERT.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-20  2:19                 ` KOSAKI Motohiro
@ 2011-04-20  2:46                   ` Dave Hansen
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2011-04-20  2:46 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, linux-mm, linux-kernel, Johannes Weiner,
	Michal Nazarewicz, Andrew Morton, John Stultz

On Wed, 2011-04-20 at 11:19 +0900, KOSAKI Motohiro wrote:
> > +     memcpy(tmp_comm, tsk->comm_buf, TASK_COMM_LEN);
> > +     tsk->comm = tmp;
> >       /*
> > -      * Threads may access current->comm without holding
> > -      * the task lock, so write the string carefully.
> > -      * Readers without a lock may see incomplete new
> > -      * names but are safe from non-terminating string reads.
> > +      * Make sure no one is still looking at tsk->comm_buf
> >        */
> > -     memset(tsk->comm, 0, TASK_COMM_LEN);
> > -     wmb();
> > -     strlcpy(tsk->comm, buf, sizeof(tsk->comm));
> > +     synchronize_rcu();
> 
> The doc says,
> 
> /**
>  * synchronize_rcu - wait until a grace period has elapsed.

Yeah, yeah... see "completely untested". :)

I'll see if dropping the locks or something else equally hackish can
help.  

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-20  1:50               ` KOSAKI Motohiro
@ 2011-04-20  2:19                 ` KOSAKI Motohiro
  2011-04-20  2:46                   ` Dave Hansen
  0 siblings, 1 reply; 37+ messages in thread
From: KOSAKI Motohiro @ 2011-04-20  2:19 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Dave Hansen, David Rientjes, linux-mm,
	linux-kernel, Johannes Weiner, Michal Nazarewicz, Andrew Morton,
	John Stultz

> The concept is ok to me. but AFAIK some caller are now using ARRAY_SIZE(tsk->comm).
> or sizeof(tsk->comm). Probably callers need to be changed too.

one more correction.

>  void set_task_comm(struct task_struct *tsk, char *buf)
>  {
> +	char tmp_comm[TASK_COMM_LEN];
> +
>  	task_lock(tsk);
>  
> +	memcpy(tmp_comm, tsk->comm_buf, TASK_COMM_LEN);
> +	tsk->comm = tmp;
>  	/*
> -	 * Threads may access current->comm without holding
> -	 * the task lock, so write the string carefully.
> -	 * Readers without a lock may see incomplete new
> -	 * names but are safe from non-terminating string reads.
> +	 * Make sure no one is still looking at tsk->comm_buf
>  	 */
> -	memset(tsk->comm, 0, TASK_COMM_LEN);
> -	wmb();
> -	strlcpy(tsk->comm, buf, sizeof(tsk->comm));
> +	synchronize_rcu();

The doc says,

/**
 * synchronize_rcu - wait until a grace period has elapsed.
 *

And here is under spinlock.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-20  1:41             ` Dave Hansen
@ 2011-04-20  1:50               ` KOSAKI Motohiro
  2011-04-20  2:19                 ` KOSAKI Motohiro
  0 siblings, 1 reply; 37+ messages in thread
From: KOSAKI Motohiro @ 2011-04-20  1:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kosaki.motohiro, David Rientjes, linux-mm, linux-kernel,
	Johannes Weiner, Michal Nazarewicz, Andrew Morton, John Stultz

Hi

(Cc to  John Stultz who/proc/<pid>/comm author. I think we need to hear his opinion)

> On Tue, 2011-04-19 at 14:21 -0700, David Rientjes wrote:
> > On Tue, 19 Apr 2011, KOSAKI Motohiro wrote:
> > > The rule is,
> > > 
> > > 1) writing comm
> > > 	need task_lock
> > > 2) read _another_ thread's comm
> > > 	need task_lock
> > > 3) read own comm
> > > 	no need task_lock
> > 
> > That was true a while ago, but you now need to protect every thread's 
> > ->comm with get_task_comm() or ensuring task_lock() is held to protect 
> > against /proc/pid/comm which can change other thread's ->comm.  That was 
> > different before when prctl(PR_SET_NAME) would only operate on current, so 
> > no lock was needed when reading current->comm.
> 
> Everybody still goes through set_task_comm() to _set_ it, though.  That
> means that the worst case scenario that we get is output truncated
> (possibly to nothing).  We already have at least one existing user in
> mm/ (kmemleak) that thinks this is OK.  I'd tend to err in the direction
> of taking a truncated or empty task name to possibly locking up the
> system.
> 
> There are also plenty of instances of current->comm going in to the
> kernel these days.  I count 18 added since 2.6.37.
> 
> As for a long-term fix, locks probably aren't the answer.  Would
> something like this completely untested patch work?  It would have the
> added bonus that it keeps tsk->comm users working for the moment.  We
> could eventually add an rcu_read_lock()-annotated access function.

The concept is ok to me. but AFAIK some caller are now using ARRAY_SIZE(tsk->comm).
or sizeof(tsk->comm). Probably callers need to be changed too.

Thanks.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-19 21:21           ` David Rientjes
  2011-04-20  0:39             ` KOSAKI Motohiro
@ 2011-04-20  1:41             ` Dave Hansen
  2011-04-20  1:50               ` KOSAKI Motohiro
  1 sibling, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2011-04-20  1:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, linux-mm, linux-kernel, Johannes Weiner,
	Michal Nazarewicz, Andrew Morton

On Tue, 2011-04-19 at 14:21 -0700, David Rientjes wrote:
> On Tue, 19 Apr 2011, KOSAKI Motohiro wrote:
> > The rule is,
> > 
> > 1) writing comm
> > 	need task_lock
> > 2) read _another_ thread's comm
> > 	need task_lock
> > 3) read own comm
> > 	no need task_lock
> 
> That was true a while ago, but you now need to protect every thread's 
> ->comm with get_task_comm() or ensuring task_lock() is held to protect 
> against /proc/pid/comm which can change other thread's ->comm.  That was 
> different before when prctl(PR_SET_NAME) would only operate on current, so 
> no lock was needed when reading current->comm.

Everybody still goes through set_task_comm() to _set_ it, though.  That
means that the worst case scenario that we get is output truncated
(possibly to nothing).  We already have at least one existing user in
mm/ (kmemleak) that thinks this is OK.  I'd tend to err in the direction
of taking a truncated or empty task name to possibly locking up the
system.

There are also plenty of instances of current->comm going in to the
kernel these days.  I count 18 added since 2.6.37.

As for a long-term fix, locks probably aren't the answer.  Would
something like this completely untested patch work?  It would have the
added bonus that it keeps tsk->comm users working for the moment.  We
could eventually add an rcu_read_lock()-annotated access function.

---

 linux-2.6.git-dave/fs/exec.c                 |   22 +++++++++++++++-------
 linux-2.6.git-dave/include/linux/init_task.h |    3 ++-
 linux-2.6.git-dave/include/linux/sched.h     |    3 ++-
 3 files changed, 19 insertions(+), 9 deletions(-)

diff -puN mm/page_alloc.c~tsk_comm mm/page_alloc.c
diff -puN include/linux/sched.h~tsk_comm include/linux/sched.h
--- linux-2.6.git/include/linux/sched.h~tsk_comm	2011-04-19 18:23:58.435013635 -0700
+++ linux-2.6.git-dave/include/linux/sched.h	2011-04-19 18:24:44.651034028 -0700
@@ -1334,10 +1334,11 @@ struct task_struct {
 					 * credentials (COW) */
 	struct cred *replacement_session_keyring; /* for KEYCTL_SESSION_TO_PARENT */
 
-	char comm[TASK_COMM_LEN]; /* executable name excluding path
+	char comm_buf[TASK_COMM_LEN]; /* executable name excluding path
 				     - access with [gs]et_task_comm (which lock
 				       it with task_lock())
 				     - initialized normally by setup_new_exec */
+	char __rcu *comm;
 /* file system info */
 	int link_count, total_link_count;
 #ifdef CONFIG_SYSVIPC
diff -puN include/linux/init_task.h~tsk_comm include/linux/init_task.h
--- linux-2.6.git/include/linux/init_task.h~tsk_comm	2011-04-19 18:24:48.703035798 -0700
+++ linux-2.6.git-dave/include/linux/init_task.h	2011-04-19 18:25:22.147050279 -0700
@@ -161,7 +161,8 @@ extern struct cred init_cred;
 	.group_leader	= &tsk,						\
 	RCU_INIT_POINTER(.real_cred, &init_cred),			\
 	RCU_INIT_POINTER(.cred, &init_cred),				\
-	.comm		= "swapper",					\
+	.comm_buf	= "swapper",					\
+	.comm		= &tsk.comm_buf, 				\
 	.thread		= INIT_THREAD,					\
 	.fs		= &init_fs,					\
 	.files		= &init_files,					\
diff -puN fs/exec.c~tsk_comm fs/exec.c
--- linux-2.6.git/fs/exec.c~tsk_comm	2011-04-19 18:25:32.283054625 -0700
+++ linux-2.6.git-dave/fs/exec.c	2011-04-19 18:37:47.991485880 -0700
@@ -1007,17 +1007,25 @@ char *get_task_comm(char *buf, struct ta
 
 void set_task_comm(struct task_struct *tsk, char *buf)
 {
+	char tmp_comm[TASK_COMM_LEN];
+
 	task_lock(tsk);
 
+	memcpy(tmp_comm, tsk->comm_buf, TASK_COMM_LEN);
+	tsk->comm = tmp;
 	/*
-	 * Threads may access current->comm without holding
-	 * the task lock, so write the string carefully.
-	 * Readers without a lock may see incomplete new
-	 * names but are safe from non-terminating string reads.
+	 * Make sure no one is still looking at tsk->comm_buf
 	 */
-	memset(tsk->comm, 0, TASK_COMM_LEN);
-	wmb();
-	strlcpy(tsk->comm, buf, sizeof(tsk->comm));
+	synchronize_rcu();
+
+	strlcpy(tsk->comm_buf, buf, sizeof(tsk->comm));
+	tsk->comm = tsk->com_buff;
+	/*
+	 * Make sure no one is still looking at the
+	 * stack-allocated buffer
+	 */
+	synchronize_rcu();
+
 	task_unlock(tsk);
 	perf_event_comm(tsk);
 }


-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-19 21:21           ` David Rientjes
@ 2011-04-20  0:39             ` KOSAKI Motohiro
  2011-04-20 20:24               ` David Rientjes
  2011-04-20  1:41             ` Dave Hansen
  1 sibling, 1 reply; 37+ messages in thread
From: KOSAKI Motohiro @ 2011-04-20  0:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Dave Hansen, linux-mm, linux-kernel,
	Johannes Weiner, Michal Nazarewicz, Andrew Morton

> On Tue, 19 Apr 2011, KOSAKI Motohiro wrote:
> 
> > The rule is,
> > 
> > 1) writing comm
> > 	need task_lock
> > 2) read _another_ thread's comm
> > 	need task_lock
> > 3) read own comm
> > 	no need task_lock
> > 
> 
> That was true a while ago, but you now need to protect every thread's 
> ->comm with get_task_comm() or ensuring task_lock() is held to protect 
> against /proc/pid/comm which can change other thread's ->comm.  That was 
> different before when prctl(PR_SET_NAME) would only operate on current, so 
> no lock was needed when reading current->comm.

Right. /proc/pid/comm is evil. We have to fix it. otherwise we need change
all of current->comm user. It's very lots!




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-18 20:57       ` Dave Hansen
@ 2011-04-19 21:23         ` David Rientjes
  0 siblings, 0 replies; 37+ messages in thread
From: David Rientjes @ 2011-04-19 21:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Nazarewicz,
	Andrew Morton

On Mon, 18 Apr 2011, Dave Hansen wrote:

> > It shouldn't be a follow-on patch since you're introducing a new feature 
> > here (vmalloc allocation failure warnings) and what I'm identifying is a 
> > race in the access to current->comm.  A bug fix for a race should always 
> > preceed a feature that touches the same code.
> 
> Dude.  Seriously.  Glass house!  a63d83f4
> 

Not sure what you're implying here.  The commit you've identified is the 
oom killer rewrite and the oom killer is very specific about making sure 
to always hold task_lock() whenever dereferencing ->comm, even for 
current, to guard against /proc/pid/comm or prctl().  The oom killer is 
different from your usecase, however, because we can always take 
task_lock(current) in the oom killer because it's in a blockable context, 
whereas page allocation warnings can occur in a superset.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-19  0:44         ` KOSAKI Motohiro
@ 2011-04-19 21:21           ` David Rientjes
  2011-04-20  0:39             ` KOSAKI Motohiro
  2011-04-20  1:41             ` Dave Hansen
  0 siblings, 2 replies; 37+ messages in thread
From: David Rientjes @ 2011-04-19 21:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Hansen, linux-mm, linux-kernel, Johannes Weiner,
	Michal Nazarewicz, Andrew Morton

On Tue, 19 Apr 2011, KOSAKI Motohiro wrote:

> The rule is,
> 
> 1) writing comm
> 	need task_lock
> 2) read _another_ thread's comm
> 	need task_lock
> 3) read own comm
> 	no need task_lock
> 

That was true a while ago, but you now need to protect every thread's 
->comm with get_task_comm() or ensuring task_lock() is held to protect 
against /proc/pid/comm which can change other thread's ->comm.  That was 
different before when prctl(PR_SET_NAME) would only operate on current, so 
no lock was needed when reading current->comm.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/2] break out page allocation warning code
@ 2011-04-19 16:21 Dave Hansen
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2011-04-19 16:21 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Johannes Weiner, David Rientjes, Michal Nazarewicz,
	akpm, Dave Hansen


Changed since last version:
 - use pr_warning instead of open-coded KERN_WARNING
 - eliminate 'wait' variable since we only use it once

--

This originally started as a simple patch to give vmalloc()
some more verbose output on failure on top of the plain
page allocator messages.  Johannes suggested that it might
be nicer to lead with the vmalloc() info _before_ the page
allocator messages.

But, I do think there's a lot of value in what
__alloc_pages_slowpath() does with its filtering and so
forth.

This patch creates a new function which other allocators
can call instead of relying on the internal page allocator
warnings.  It also gives this function private rate-limiting
which separates it from other printk_ratelimit() users.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/include/linux/mm.h |    2 +
 linux-2.6.git-dave/mm/page_alloc.c    |   62 ++++++++++++++++++++++------------
 2 files changed, 43 insertions(+), 21 deletions(-)

diff -puN include/linux/mm.h~break-out-alloc-failure-messages include/linux/mm.h
--- linux-2.6.git/include/linux/mm.h~break-out-alloc-failure-messages	2011-04-18 14:59:51.278529173 -0700
+++ linux-2.6.git-dave/include/linux/mm.h	2011-04-18 14:59:51.290529171 -0700
@@ -1365,6 +1365,8 @@ extern void si_meminfo(struct sysinfo * 
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 extern int after_bootmem;
 
+extern void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...);
+
 extern void setup_per_cpu_pageset(void);
 
 extern void zone_pcp_update(struct zone *zone);
diff -puN mm/page_alloc.c~break-out-alloc-failure-messages mm/page_alloc.c
--- linux-2.6.git/mm/page_alloc.c~break-out-alloc-failure-messages	2011-04-18 14:59:51.282529173 -0700
+++ linux-2.6.git-dave/mm/page_alloc.c	2011-04-18 14:59:51.294529170 -0700
@@ -54,6 +54,7 @@
 #include <trace/events/kmem.h>
 #include <linux/ftrace_event.h>
 #include <linux/memcontrol.h>
+#include <linux/ratelimit.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -1734,6 +1735,45 @@ static inline bool should_suppress_show_
 	return ret;
 }
 
+static DEFINE_RATELIMIT_STATE(nopage_rs,
+		DEFAULT_RATELIMIT_INTERVAL,
+		DEFAULT_RATELIMIT_BURST);
+
+void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
+{
+	va_list args;
+	unsigned int filter = SHOW_MEM_FILTER_NODES;
+
+	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
+		return;
+
+	/*
+	 * This documents exceptions given to allocations in certain
+	 * contexts that are allowed to allocate outside current's set
+	 * of allowed nodes.
+	 */
+	if (!(gfp_mask & __GFP_NOMEMALLOC))
+		if (test_thread_flag(TIF_MEMDIE) ||
+		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
+			filter &= ~SHOW_MEM_FILTER_NODES;
+	if (in_interrupt() || !(gfp_mask & __GFP_WAIT))
+		filter &= ~SHOW_MEM_FILTER_NODES;
+
+	if (fmt) {
+		printk(KERN_WARNING);
+		va_start(args, fmt);
+		vprintk(fmt, args);
+		va_end(args);
+	}
+
+	pr_warning("%s: page allocation failure: order:%d, mode:0x%x\n",
+		   current->comm, order, gfp_mask);
+
+	dump_stack();
+	if (!should_suppress_show_mem())
+		show_mem(filter);
+}
+
 static inline int
 should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 				unsigned long pages_reclaimed)
@@ -2176,27 +2216,7 @@ rebalance:
 	}
 
 nopage:
-	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
-		unsigned int filter = SHOW_MEM_FILTER_NODES;
-
-		/*
-		 * This documents exceptions given to allocations in certain
-		 * contexts that are allowed to allocate outside current's set
-		 * of allowed nodes.
-		 */
-		if (!(gfp_mask & __GFP_NOMEMALLOC))
-			if (test_thread_flag(TIF_MEMDIE) ||
-			    (current->flags & (PF_MEMALLOC | PF_EXITING)))
-				filter &= ~SHOW_MEM_FILTER_NODES;
-		if (in_interrupt() || !wait)
-			filter &= ~SHOW_MEM_FILTER_NODES;
-
-		pr_warning("%s: page allocation failure. order:%d, mode:0x%x\n",
-			current->comm, order, gfp_mask);
-		dump_stack();
-		if (!should_suppress_show_mem())
-			show_mem(filter);
-	}
+	warn_alloc_failed(gfp_mask, order, NULL);
 	return page;
 got_pg:
 	if (kmemcheck_enabled)
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-18 21:22       ` Dave Hansen
@ 2011-04-19  0:44         ` KOSAKI Motohiro
  2011-04-19 21:21           ` David Rientjes
  0 siblings, 1 reply; 37+ messages in thread
From: KOSAKI Motohiro @ 2011-04-19  0:44 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kosaki.motohiro, David Rientjes, linux-mm, linux-kernel,
	Johannes Weiner, Michal Nazarewicz, Andrew Morton

> On Mon, 2011-04-18 at 13:25 -0700, David Rientjes wrote:
> > It shouldn't be a follow-on patch since you're introducing a new feature 
> > here (vmalloc allocation failure warnings) and what I'm identifying is a 
> > race in the access to current->comm.  A bug fix for a race should always 
> > preceed a feature that touches the same code. 
> 
> So, what's the race here?  kmemleak.c says?
> 
>                 /*
>                  * There is a small chance of a race with set_task_comm(),
>                  * however using get_task_comm() here may cause locking
>                  * dependency issues with current->alloc_lock. In the worst
>                  * case, the command line is not correct.
>                  */
>                 strncpy(object->comm, current->comm, sizeof(object->comm));
> 
> We're trying to make sure we don't print out a partially updated
> tsk->comm?  Or, is there a bigger issue here like potential oopses or
> kernel information leaks.
> 
> 1. We require that no memory allocator ever holds the task lock for the
>    current task, and we audit all the existing GFP_ATOMIC users in the
>    kernel to ensure they're not doing it now.  In the case of a problem,
>    we end up with a hung kernel while trying to get a message out to the
>    console.
> 2. We remove current->comm from the printk(), and deal with the
>    information loss.
> 3. We live with corrupted output, like the other ~400 in-kernel users of
>    ->comm do. (I'm assuming that very few of them hold the task lock). 
>    In the case of a race, we get junk on the console, but an otherwise
>    fine bug report (the way it is now).
> 4. We come up with some way to print out current->comm, without holding
>    any task locks.  We could do this by copying it somewhere safe on
>    each context switch.  Could probably also do it with RCU.
> 
> There's also a very, very odd message in fs/exec.c:
> 
>         /*
>          * Threads may access current->comm without holding
>          * the task lock, so write the string carefully.
>          * Readers without a lock may see incomplete new
>          * names but are safe from non-terminating string reads.
>          */

The rule is,

1) writing comm
	need task_lock
2) read _another_ thread's comm
	need task_lock
3) read own comm
	no need task_lock

That's the reason why oom-kill.c need task_lock and other a lot of place don't need
task_lock. I agree this is very strange. it's only historical reason.

The comment of set_task_comm() explained a race against (3).

Thanks.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-18 20:25     ` David Rientjes
  2011-04-18 20:57       ` Dave Hansen
  2011-04-18 21:03       ` Dave Hansen
@ 2011-04-18 21:22       ` Dave Hansen
  2011-04-19  0:44         ` KOSAKI Motohiro
  2 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2011-04-18 21:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Nazarewicz,
	Andrew Morton

On Mon, 2011-04-18 at 13:25 -0700, David Rientjes wrote:
> It shouldn't be a follow-on patch since you're introducing a new feature 
> here (vmalloc allocation failure warnings) and what I'm identifying is a 
> race in the access to current->comm.  A bug fix for a race should always 
> preceed a feature that touches the same code. 

So, what's the race here?  kmemleak.c says?

                /*
                 * There is a small chance of a race with set_task_comm(),
                 * however using get_task_comm() here may cause locking
                 * dependency issues with current->alloc_lock. In the worst
                 * case, the command line is not correct.
                 */
                strncpy(object->comm, current->comm, sizeof(object->comm));

We're trying to make sure we don't print out a partially updated
tsk->comm?  Or, is there a bigger issue here like potential oopses or
kernel information leaks.

1. We require that no memory allocator ever holds the task lock for the
   current task, and we audit all the existing GFP_ATOMIC users in the
   kernel to ensure they're not doing it now.  In the case of a problem,
   we end up with a hung kernel while trying to get a message out to the
   console.
2. We remove current->comm from the printk(), and deal with the
   information loss.
3. We live with corrupted output, like the other ~400 in-kernel users of
   ->comm do. (I'm assuming that very few of them hold the task lock). 
   In the case of a race, we get junk on the console, but an otherwise
   fine bug report (the way it is now).
4. We come up with some way to print out current->comm, without holding
   any task locks.  We could do this by copying it somewhere safe on
   each context switch.  Could probably also do it with RCU.

There's also a very, very odd message in fs/exec.c:

        /*
         * Threads may access current->comm without holding
         * the task lock, so write the string carefully.
         * Readers without a lock may see incomplete new
         * names but are safe from non-terminating string reads.
         */

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-18 20:25     ` David Rientjes
  2011-04-18 20:57       ` Dave Hansen
@ 2011-04-18 21:03       ` Dave Hansen
  2011-04-18 21:22       ` Dave Hansen
  2 siblings, 0 replies; 37+ messages in thread
From: Dave Hansen @ 2011-04-18 21:03 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Nazarewicz,
	Andrew Morton

On Mon, 2011-04-18 at 13:25 -0700, David Rientjes wrote:
>  - provide a statically-allocated buffer to use for get_task_comm() and 
>    copy current->comm over before printing it, or
> 
>  - take task_lock(current) to protect against /proc/pid/comm.
> 
> The latter probably isn't safe because we could potentially already be 
> holding task_lock(current) during a GFP_ATOMIC page allocation. 

I'm not sure get_task_comm() is suitable, either.  It takes the task
lock:

char *get_task_comm(char *buf, struct task_struct *tsk)
{
        /* buf must be at least sizeof(tsk->comm) in size */
        task_lock(tsk);
        strncpy(buf, tsk->comm, sizeof(tsk->comm));
        task_unlock(tsk);
        return buf;
}

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-18 20:25     ` David Rientjes
@ 2011-04-18 20:57       ` Dave Hansen
  2011-04-19 21:23         ` David Rientjes
  2011-04-18 21:03       ` Dave Hansen
  2011-04-18 21:22       ` Dave Hansen
  2 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2011-04-18 20:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Nazarewicz,
	Andrew Morton

On Mon, 2011-04-18 at 13:25 -0700, David Rientjes wrote:
> It shouldn't be a follow-on patch since you're introducing a new feature 
> here (vmalloc allocation failure warnings) and what I'm identifying is a 
> race in the access to current->comm.  A bug fix for a race should always 
> preceed a feature that touches the same code.

Dude.  Seriously.  Glass house!  a63d83f4

I'll go look in to it, though.

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-18 15:10   ` Dave Hansen
@ 2011-04-18 20:25     ` David Rientjes
  2011-04-18 20:57       ` Dave Hansen
                         ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: David Rientjes @ 2011-04-18 20:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Nazarewicz,
	Andrew Morton

On Mon, 18 Apr 2011, Dave Hansen wrote:

> > > +void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
> > > +{
> > > +	va_list args;
> > > +	unsigned int filter = SHOW_MEM_FILTER_NODES;
> > > +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > > +
> > 
> > "wait" is unnecessary.  You didn't do "const gfp_t nowarn = gfp_mask & 
> > __GFP_NOWARN;" for the same reason.
> 
> This line is just a copy from the __alloc_pages_slowpath() one.  I guess
> we only use it once, so I've got no problem killing it.
> 
> > > +	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
> > > +		return;
> > > +
> > > +	/*
> > > +	 * This documents exceptions given to allocations in certain
> > > +	 * contexts that are allowed to allocate outside current's set
> > > +	 * of allowed nodes.
> > > +	 */
> > > +	if (!(gfp_mask & __GFP_NOMEMALLOC))
> > > +		if (test_thread_flag(TIF_MEMDIE) ||
> > > +		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
> > > +			filter &= ~SHOW_MEM_FILTER_NODES;
> > > +	if (in_interrupt() || !wait)
> > > +		filter &= ~SHOW_MEM_FILTER_NODES;
> > > +
> > > +	if (fmt) {
> > > +		printk(KERN_WARNING);
> > > +		va_start(args, fmt);
> > > +		vprintk(fmt, args);
> > > +		va_end(args);
> > > +	}
> > > +
> > > +	printk(KERN_WARNING "%s: page allocation failure: order:%d, mode:0x%x\n",
> > > +			current->comm, order, gfp_mask);
> > 
> > pr_warning()?
> 
> OK, I'll change it back.
> 
> > current->comm should always be printed with get_task_comm() to avoid 
> > racing with /proc/pid/comm.  Since this function can be called potentially 
> > deep in the stack, you may need to serialize this with a 
> > statically-allocated buffer.
> 
> This code was already in page_alloc.c.  I'm simply breaking it out here
> trying to keep the changes down to what is needed minimally to move the
> code.  Correcting this preexisting problem sounds like a great follow-on
> patch.
> 

It shouldn't be a follow-on patch since you're introducing a new feature 
here (vmalloc allocation failure warnings) and what I'm identifying is a 
race in the access to current->comm.  A bug fix for a race should always 
preceed a feature that touches the same code.

There's two options to fixing the race:

 - provide a statically-allocated buffer to use for get_task_comm() and 
   copy current->comm over before printing it, or

 - take task_lock(current) to protect against /proc/pid/comm.

The latter probably isn't safe because we could potentially already be 
holding task_lock(current) during a GFP_ATOMIC page allocation.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-17  0:02 ` David Rientjes
@ 2011-04-18 15:10   ` Dave Hansen
  2011-04-18 20:25     ` David Rientjes
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2011-04-18 15:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Nazarewicz,
	Andrew Morton

On Sat, 2011-04-16 at 17:02 -0700, David Rientjes wrote:
> > +void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
> > +{
> > +	va_list args;
> > +	unsigned int filter = SHOW_MEM_FILTER_NODES;
> > +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > +
> 
> "wait" is unnecessary.  You didn't do "const gfp_t nowarn = gfp_mask & 
> __GFP_NOWARN;" for the same reason.

This line is just a copy from the __alloc_pages_slowpath() one.  I guess
we only use it once, so I've got no problem killing it.

> > +	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
> > +		return;
> > +
> > +	/*
> > +	 * This documents exceptions given to allocations in certain
> > +	 * contexts that are allowed to allocate outside current's set
> > +	 * of allowed nodes.
> > +	 */
> > +	if (!(gfp_mask & __GFP_NOMEMALLOC))
> > +		if (test_thread_flag(TIF_MEMDIE) ||
> > +		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
> > +			filter &= ~SHOW_MEM_FILTER_NODES;
> > +	if (in_interrupt() || !wait)
> > +		filter &= ~SHOW_MEM_FILTER_NODES;
> > +
> > +	if (fmt) {
> > +		printk(KERN_WARNING);
> > +		va_start(args, fmt);
> > +		vprintk(fmt, args);
> > +		va_end(args);
> > +	}
> > +
> > +	printk(KERN_WARNING "%s: page allocation failure: order:%d, mode:0x%x\n",
> > +			current->comm, order, gfp_mask);
> 
> pr_warning()?

OK, I'll change it back.

> current->comm should always be printed with get_task_comm() to avoid 
> racing with /proc/pid/comm.  Since this function can be called potentially 
> deep in the stack, you may need to serialize this with a 
> statically-allocated buffer.

This code was already in page_alloc.c.  I'm simply breaking it out here
trying to keep the changes down to what is needed minimally to move the
code.  Correcting this preexisting problem sounds like a great follow-on
patch.

-- Dave


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] break out page allocation warning code
  2011-04-15 17:04 Dave Hansen
@ 2011-04-17  0:02 ` David Rientjes
  2011-04-18 15:10   ` Dave Hansen
  0 siblings, 1 reply; 37+ messages in thread
From: David Rientjes @ 2011-04-17  0:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, linux-kernel, Johannes Weiner, Michal Nazarewicz,
	Andrew Morton

On Fri, 15 Apr 2011, Dave Hansen wrote:

> 
> This originally started as a simple patch to give vmalloc()
> some more verbose output on failure on top of the plain
> page allocator messages.  Johannes suggested that it might
> be nicer to lead with the vmalloc() info _before_ the page
> allocator messages.
> 
> But, I do think there's a lot of value in what
> __alloc_pages_slowpath() does with its filtering and so
> forth.
> 
> This patch creates a new function which other allocators
> can call instead of relying on the internal page allocator
> warnings.  It also gives this function private rate-limiting
> which separates it from other printk_ratelimit() users.
> 
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> ---
> 
>  linux-2.6.git-dave/include/linux/mm.h |    2 +
>  linux-2.6.git-dave/mm/page_alloc.c    |   63 ++++++++++++++++++++++------------
>  2 files changed, 44 insertions(+), 21 deletions(-)
> 
> diff -puN include/linux/mm.h~break-out-alloc-failure-messages include/linux/mm.h
> --- linux-2.6.git/include/linux/mm.h~break-out-alloc-failure-messages	2011-04-15 08:44:06.911445625 -0700
> +++ linux-2.6.git-dave/include/linux/mm.h	2011-04-15 08:45:10.087416551 -0700
> @@ -1365,6 +1365,8 @@ extern void si_meminfo(struct sysinfo * 
>  extern void si_meminfo_node(struct sysinfo *val, int nid);
>  extern int after_bootmem;
>  
> +extern void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...);
> +
>  extern void setup_per_cpu_pageset(void);
>  
>  extern void zone_pcp_update(struct zone *zone);
> diff -puN mm/page_alloc.c~break-out-alloc-failure-messages mm/page_alloc.c
> --- linux-2.6.git/mm/page_alloc.c~break-out-alloc-failure-messages	2011-04-15 08:44:06.915445623 -0700
> +++ linux-2.6.git-dave/mm/page_alloc.c	2011-04-15 08:48:34.255321834 -0700
> @@ -54,6 +54,7 @@
>  #include <trace/events/kmem.h>
>  #include <linux/ftrace_event.h>
>  #include <linux/memcontrol.h>
> +#include <linux/ratelimit.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -1734,6 +1735,46 @@ static inline bool should_suppress_show_
>  	return ret;
>  }
>  
> +static DEFINE_RATELIMIT_STATE(nopage_rs,
> +		DEFAULT_RATELIMIT_INTERVAL,
> +		DEFAULT_RATELIMIT_BURST);
> +
> +void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
> +{
> +	va_list args;
> +	unsigned int filter = SHOW_MEM_FILTER_NODES;
> +	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +

"wait" is unnecessary.  You didn't do "const gfp_t nowarn = gfp_mask & 
__GFP_NOWARN;" for the same reason.

> +	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
> +		return;
> +
> +	/*
> +	 * This documents exceptions given to allocations in certain
> +	 * contexts that are allowed to allocate outside current's set
> +	 * of allowed nodes.
> +	 */
> +	if (!(gfp_mask & __GFP_NOMEMALLOC))
> +		if (test_thread_flag(TIF_MEMDIE) ||
> +		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
> +			filter &= ~SHOW_MEM_FILTER_NODES;
> +	if (in_interrupt() || !wait)
> +		filter &= ~SHOW_MEM_FILTER_NODES;
> +
> +	if (fmt) {
> +		printk(KERN_WARNING);
> +		va_start(args, fmt);
> +		vprintk(fmt, args);
> +		va_end(args);
> +	}
> +
> +	printk(KERN_WARNING "%s: page allocation failure: order:%d, mode:0x%x\n",
> +			current->comm, order, gfp_mask);

pr_warning()?

current->comm should always be printed with get_task_comm() to avoid 
racing with /proc/pid/comm.  Since this function can be called potentially 
deep in the stack, you may need to serialize this with a 
statically-allocated buffer.

> +
> +	dump_stack();
> +	if (!should_suppress_show_mem())
> +		show_mem(filter);
> +}
> +
>  static inline int
>  should_alloc_retry(gfp_t gfp_mask, unsigned int order,
>  				unsigned long pages_reclaimed)
> @@ -2176,27 +2217,7 @@ rebalance:
>  	}
>  
>  nopage:
> -	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
> -		unsigned int filter = SHOW_MEM_FILTER_NODES;
> -
> -		/*
> -		 * This documents exceptions given to allocations in certain
> -		 * contexts that are allowed to allocate outside current's set
> -		 * of allowed nodes.
> -		 */
> -		if (!(gfp_mask & __GFP_NOMEMALLOC))
> -			if (test_thread_flag(TIF_MEMDIE) ||
> -			    (current->flags & (PF_MEMALLOC | PF_EXITING)))
> -				filter &= ~SHOW_MEM_FILTER_NODES;
> -		if (in_interrupt() || !wait)
> -			filter &= ~SHOW_MEM_FILTER_NODES;
> -
> -		pr_warning("%s: page allocation failure. order:%d, mode:0x%x\n",
> -			current->comm, order, gfp_mask);
> -		dump_stack();
> -		if (!should_suppress_show_mem())
> -			show_mem(filter);
> -	}
> +	warn_alloc_failed(gfp_mask, order, NULL);
>  	return page;
>  got_pg:
>  	if (kmemcheck_enabled)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/2] break out page allocation warning code
@ 2011-04-15 17:04 Dave Hansen
  2011-04-17  0:02 ` David Rientjes
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Hansen @ 2011-04-15 17:04 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Johannes Weiner, David Rientjes, Michal Nazarewicz,
	akpm, Dave Hansen


This originally started as a simple patch to give vmalloc()
some more verbose output on failure on top of the plain
page allocator messages.  Johannes suggested that it might
be nicer to lead with the vmalloc() info _before_ the page
allocator messages.

But, I do think there's a lot of value in what
__alloc_pages_slowpath() does with its filtering and so
forth.

This patch creates a new function which other allocators
can call instead of relying on the internal page allocator
warnings.  It also gives this function private rate-limiting
which separates it from other printk_ratelimit() users.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/include/linux/mm.h |    2 +
 linux-2.6.git-dave/mm/page_alloc.c    |   63 ++++++++++++++++++++++------------
 2 files changed, 44 insertions(+), 21 deletions(-)

diff -puN include/linux/mm.h~break-out-alloc-failure-messages include/linux/mm.h
--- linux-2.6.git/include/linux/mm.h~break-out-alloc-failure-messages	2011-04-15 08:44:06.911445625 -0700
+++ linux-2.6.git-dave/include/linux/mm.h	2011-04-15 08:45:10.087416551 -0700
@@ -1365,6 +1365,8 @@ extern void si_meminfo(struct sysinfo * 
 extern void si_meminfo_node(struct sysinfo *val, int nid);
 extern int after_bootmem;
 
+extern void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...);
+
 extern void setup_per_cpu_pageset(void);
 
 extern void zone_pcp_update(struct zone *zone);
diff -puN mm/page_alloc.c~break-out-alloc-failure-messages mm/page_alloc.c
--- linux-2.6.git/mm/page_alloc.c~break-out-alloc-failure-messages	2011-04-15 08:44:06.915445623 -0700
+++ linux-2.6.git-dave/mm/page_alloc.c	2011-04-15 08:48:34.255321834 -0700
@@ -54,6 +54,7 @@
 #include <trace/events/kmem.h>
 #include <linux/ftrace_event.h>
 #include <linux/memcontrol.h>
+#include <linux/ratelimit.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -1734,6 +1735,46 @@ static inline bool should_suppress_show_
 	return ret;
 }
 
+static DEFINE_RATELIMIT_STATE(nopage_rs,
+		DEFAULT_RATELIMIT_INTERVAL,
+		DEFAULT_RATELIMIT_BURST);
+
+void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
+{
+	va_list args;
+	unsigned int filter = SHOW_MEM_FILTER_NODES;
+	const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
+		return;
+
+	/*
+	 * This documents exceptions given to allocations in certain
+	 * contexts that are allowed to allocate outside current's set
+	 * of allowed nodes.
+	 */
+	if (!(gfp_mask & __GFP_NOMEMALLOC))
+		if (test_thread_flag(TIF_MEMDIE) ||
+		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
+			filter &= ~SHOW_MEM_FILTER_NODES;
+	if (in_interrupt() || !wait)
+		filter &= ~SHOW_MEM_FILTER_NODES;
+
+	if (fmt) {
+		printk(KERN_WARNING);
+		va_start(args, fmt);
+		vprintk(fmt, args);
+		va_end(args);
+	}
+
+	printk(KERN_WARNING "%s: page allocation failure: order:%d, mode:0x%x\n",
+			current->comm, order, gfp_mask);
+
+	dump_stack();
+	if (!should_suppress_show_mem())
+		show_mem(filter);
+}
+
 static inline int
 should_alloc_retry(gfp_t gfp_mask, unsigned int order,
 				unsigned long pages_reclaimed)
@@ -2176,27 +2217,7 @@ rebalance:
 	}
 
 nopage:
-	if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
-		unsigned int filter = SHOW_MEM_FILTER_NODES;
-
-		/*
-		 * This documents exceptions given to allocations in certain
-		 * contexts that are allowed to allocate outside current's set
-		 * of allowed nodes.
-		 */
-		if (!(gfp_mask & __GFP_NOMEMALLOC))
-			if (test_thread_flag(TIF_MEMDIE) ||
-			    (current->flags & (PF_MEMALLOC | PF_EXITING)))
-				filter &= ~SHOW_MEM_FILTER_NODES;
-		if (in_interrupt() || !wait)
-			filter &= ~SHOW_MEM_FILTER_NODES;
-
-		pr_warning("%s: page allocation failure. order:%d, mode:0x%x\n",
-			current->comm, order, gfp_mask);
-		dump_stack();
-		if (!should_suppress_show_mem())
-			show_mem(filter);
-	}
+	warn_alloc_failed(gfp_mask, order, NULL);
 	return page;
 got_pg:
 	if (kmemcheck_enabled)
_

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2011-04-29  0:05 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-08 20:22 [PATCH 1/2] break out page allocation warning code Dave Hansen
2011-04-08 20:22 ` [PATCH 2/2] print vmalloc() state after allocation failures Dave Hansen
2011-04-08 20:39   ` David Rientjes
2011-04-08 20:47     ` Dave Hansen
2011-04-08 20:37 ` [PATCH 1/2] break out page allocation warning code David Rientjes
2011-04-08 20:43   ` Dave Hansen
     [not found] ` <BANLkTi=OnDX53nOZcaaMmqXRBcWicam0xg@mail.gmail.com>
2011-04-08 21:02   ` Dave Hansen
2011-04-11 10:20     ` Michal Nazarewicz
2011-04-15 17:04 Dave Hansen
2011-04-17  0:02 ` David Rientjes
2011-04-18 15:10   ` Dave Hansen
2011-04-18 20:25     ` David Rientjes
2011-04-18 20:57       ` Dave Hansen
2011-04-19 21:23         ` David Rientjes
2011-04-18 21:03       ` Dave Hansen
2011-04-18 21:22       ` Dave Hansen
2011-04-19  0:44         ` KOSAKI Motohiro
2011-04-19 21:21           ` David Rientjes
2011-04-20  0:39             ` KOSAKI Motohiro
2011-04-20 20:24               ` David Rientjes
2011-04-20 20:34                 ` john stultz
2011-04-21  1:29                   ` KOSAKI Motohiro
2011-04-25  4:21                     ` KOSAKI Motohiro
2011-04-26 19:27                     ` john stultz
2011-04-27 23:51                       ` David Rientjes
2011-04-28  0:32                         ` john stultz
2011-04-28  1:29                           ` john stultz
2011-04-28 22:48                             ` David Rientjes
2011-04-28 23:48                               ` john stultz
2011-04-29  0:04                                 ` john stultz
2011-04-26 21:25                     ` john stultz
2011-04-28  3:05                       ` KOSAKI Motohiro
2011-04-20  1:41             ` Dave Hansen
2011-04-20  1:50               ` KOSAKI Motohiro
2011-04-20  2:19                 ` KOSAKI Motohiro
2011-04-20  2:46                   ` Dave Hansen
2011-04-19 16:21 Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).