linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mpol_to_str revisited.
@ 2012-10-08 15:09 Dave Jones
  2012-10-08 15:15 ` Dave Jones
                   ` (3 more replies)
  0 siblings, 4 replies; 59+ messages in thread
From: Dave Jones @ 2012-10-08 15:09 UTC (permalink / raw)
  To: Linux Kernel; +Cc: bhutchings, linux-mm, Linus Torvalds, Andrew Morton

Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove
a user triggerable BUG in mempolicy.

Ben Hutchings pointed out to me that my change introduced a potential leak
of stack contents to userspace, because none of the callers check the return value.

This patch adds the missing return checking, and also clears the buffer beforehand.

Reported-by: Ben Hutchings <bhutchings@solarflare.com>
Cc: stable@kernel.org
Signed-off-by: Dave Jones <davej@redhat.com>

--- 
unanswered question: why are the buffer sizes here different ? which is correct?


diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c
--- src/git-trees/kernel/linux/fs/proc/task_mmu.c	2012-05-31 22:32:46.778150675 -0400
+++ linux-dj/fs/proc/task_mmu.c	2012-10-04 19:31:41.269988984 -0400
@@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file
 	struct mm_walk walk = {};
 	struct mempolicy *pol;
 	int n;
+	int ret;
 	char buffer[50];
 
 	if (!mm)
@@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file
 	walk.mm = mm;
 
 	pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
-	mpol_to_str(buffer, sizeof(buffer), pol, 0);
+	memset(buffer, 0, sizeof(buffer));
+	ret = mpol_to_str(buffer, sizeof(buffer), pol, 0);
+	if (ret < 0)
+		return 0;
+
 	mpol_cond_put(pol);
 
 	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/mm/shmem.c linux-dj/mm/shmem.c
--- src/git-trees/kernel/linux/mm/shmem.c	2012-10-02 15:49:51.977277944 -0400
+++ linux-dj/mm/shmem.c	2012-10-04 19:32:28.862949907 -0400
@@ -885,13 +885,15 @@ redirty:
 static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
 {
 	char buffer[64];
+	int ret;
 
 	if (!mpol || mpol->mode == MPOL_DEFAULT)
 		return;		/* show nothing */
 
-	mpol_to_str(buffer, sizeof(buffer), mpol, 1);
-
-	seq_printf(seq, ",mpol=%s", buffer);
+	memset(buffer, 0, sizeof(buffer));
+	ret = mpol_to_str(buffer, sizeof(buffer), mpol, 1);
+	if (ret > 0)
+		seq_printf(seq, ",mpol=%s", buffer);
 }
 
 static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: mpol_to_str revisited.
  2012-10-08 15:09 mpol_to_str revisited Dave Jones
@ 2012-10-08 15:15 ` Dave Jones
  2012-10-08 20:46   ` David Rientjes
  2012-10-08 20:35 ` David Rientjes
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 59+ messages in thread
From: Dave Jones @ 2012-10-08 15:15 UTC (permalink / raw)
  To: Linux Kernel, bhutchings, linux-mm, Linus Torvalds, Andrew Morton

On Mon, Oct 08, 2012 at 11:09:49AM -0400, Dave Jones wrote:
 > Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove
 > a user triggerable BUG in mempolicy.
 > 
 > Ben Hutchings pointed out to me that my change introduced a potential leak
 > of stack contents to userspace, because none of the callers check the return value.
 > 
 > This patch adds the missing return checking, and also clears the buffer beforehand.
 > 
 > Reported-by: Ben Hutchings <bhutchings@solarflare.com>
 > Cc: stable@kernel.org
 > Signed-off-by: Dave Jones <davej@redhat.com>
 > 
 > --- 
 > unanswered question: why are the buffer sizes here different ? which is correct?

A further unanswered question is how the state got so screwed up that we hit that
default case at all.  Looking at the original report: https://lkml.org/lkml/2012/9/6/356
What's in RAX looks suspiciously like left-over slab poison.

If pol->mode was poisoned, that smells like we have a race where policy is getting freed
while another process is reading it.

Am I missing something, or is there no locking around that at all ?

	Dave


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: mpol_to_str revisited.
  2012-10-08 15:09 mpol_to_str revisited Dave Jones
  2012-10-08 15:15 ` Dave Jones
@ 2012-10-08 20:35 ` David Rientjes
  2012-10-08 20:52   ` Dave Jones
  2012-10-09  0:33 ` Ben Hutchings
  2012-10-16  2:34 ` KOSAKI Motohiro
  3 siblings, 1 reply; 59+ messages in thread
From: David Rientjes @ 2012-10-08 20:35 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, bhutchings, linux-mm, Linus Torvalds,
	Andrew Morton

On Mon, 8 Oct 2012, Dave Jones wrote:

> unanswered question: why are the buffer sizes here different ? which is correct?
> 

Given the current set of mempolicy modes and flags, it's 34, but this can 
change if new modes or flags are added with longer names.  I see no reason 
why shmem shouldn't round up to the nearest power-of-2 of 64 like it 
already does, but 50 is certainly safe as well in task_mmu.c.

> diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c
> --- src/git-trees/kernel/linux/fs/proc/task_mmu.c	2012-05-31 22:32:46.778150675 -0400
> +++ linux-dj/fs/proc/task_mmu.c	2012-10-04 19:31:41.269988984 -0400
> @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file
>  	struct mm_walk walk = {};
>  	struct mempolicy *pol;
>  	int n;
> +	int ret;
>  	char buffer[50];
>  
>  	if (!mm)
> @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file
>  	walk.mm = mm;
>  
>  	pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
> -	mpol_to_str(buffer, sizeof(buffer), pol, 0);
> +	memset(buffer, 0, sizeof(buffer));
> +	ret = mpol_to_str(buffer, sizeof(buffer), pol, 0);
> +	if (ret < 0)
> +		return 0;

We should need the mpol_cond_put(pol) here before returning.

> +
>  	mpol_cond_put(pol);
>  
>  	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
> diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/mm/shmem.c linux-dj/mm/shmem.c
> --- src/git-trees/kernel/linux/mm/shmem.c	2012-10-02 15:49:51.977277944 -0400
> +++ linux-dj/mm/shmem.c	2012-10-04 19:32:28.862949907 -0400
> @@ -885,13 +885,15 @@ redirty:
>  static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
>  {
>  	char buffer[64];
> +	int ret;
>  
>  	if (!mpol || mpol->mode == MPOL_DEFAULT)
>  		return;		/* show nothing */
>  
> -	mpol_to_str(buffer, sizeof(buffer), mpol, 1);
> -
> -	seq_printf(seq, ",mpol=%s", buffer);
> +	memset(buffer, 0, sizeof(buffer));
> +	ret = mpol_to_str(buffer, sizeof(buffer), mpol, 1);
> +	if (ret > 0)
> +		seq_printf(seq, ",mpol=%s", buffer);
>  }
>  
>  static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: mpol_to_str revisited.
  2012-10-08 15:15 ` Dave Jones
@ 2012-10-08 20:46   ` David Rientjes
  0 siblings, 0 replies; 59+ messages in thread
From: David Rientjes @ 2012-10-08 20:46 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, bhutchings, linux-mm, Linus Torvalds,
	Andrew Morton

On Mon, 8 Oct 2012, Dave Jones wrote:

> If pol->mode was poisoned, that smells like we have a race where policy is getting freed
> while another process is reading it.
> 
> Am I missing something, or is there no locking around that at all ?
> 

The only thing that is held during the read() is a reference to the 
task_struct so it doesn't disappear from under us.  The protection needed 
for a task's mempolicy, however, is task_lock() and that is not held.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: mpol_to_str revisited.
  2012-10-08 20:35 ` David Rientjes
@ 2012-10-08 20:52   ` Dave Jones
  2012-10-16  0:48     ` David Rientjes
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Jones @ 2012-10-08 20:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linux Kernel, bhutchings, linux-mm, Linus Torvalds, Andrew Morton

On Mon, Oct 08, 2012 at 01:35:42PM -0700, David Rientjes wrote:

 > > unanswered question: why are the buffer sizes here different ? which is correct?
 > > 
 > Given the current set of mempolicy modes and flags, it's 34, but this can 
 > change if new modes or flags are added with longer names.  I see no reason 
 > why shmem shouldn't round up to the nearest power-of-2 of 64 like it 
 > already does, but 50 is certainly safe as well in task_mmu.c.

Ok. I'll leave that for now.
 
 > > diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c
 > > --- src/git-trees/kernel/linux/fs/proc/task_mmu.c	2012-05-31 22:32:46.778150675 -0400
 > > +++ linux-dj/fs/proc/task_mmu.c	2012-10-04 19:31:41.269988984 -0400
 > > @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file
 > >  	struct mm_walk walk = {};
 > >  	struct mempolicy *pol;
 > >  	int n;
 > > +	int ret;
 > >  	char buffer[50];
 > >  
 > >  	if (!mm)
 > > @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file
 > >  	walk.mm = mm;
 > >  
 > >  	pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
 > > -	mpol_to_str(buffer, sizeof(buffer), pol, 0);
 > > +	memset(buffer, 0, sizeof(buffer));
 > > +	ret = mpol_to_str(buffer, sizeof(buffer), pol, 0);
 > > +	if (ret < 0)
 > > +		return 0;
 > 
 > We should need the mpol_cond_put(pol) here before returning.

good catch. I'll respin the patch later with this changed.

thanks,
 
	Dave



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: mpol_to_str revisited.
  2012-10-08 15:09 mpol_to_str revisited Dave Jones
  2012-10-08 15:15 ` Dave Jones
  2012-10-08 20:35 ` David Rientjes
@ 2012-10-09  0:33 ` Ben Hutchings
  2012-10-16  2:34 ` KOSAKI Motohiro
  3 siblings, 0 replies; 59+ messages in thread
From: Ben Hutchings @ 2012-10-09  0:33 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linux Kernel, linux-mm, Linus Torvalds, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1018 bytes --]

On Mon, 2012-10-08 at 11:09 -0400, Dave Jones wrote:
> Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove
> a user triggerable BUG in mempolicy.
> 
> Ben Hutchings pointed out to me that my change introduced a potential leak
> of stack contents to userspace, because none of the callers check the return value.
> 
> This patch adds the missing return checking, and also clears the buffer beforehand.
>
> Reported-by: Ben Hutchings <bhutchings@solarflare.com>

I was wearing my other hat at the time (ben@decadent.org.uk).

> Cc: stable@kernel.org
> Signed-off-by: Dave Jones <davej@redhat.com>
> 
> --- 
> unanswered question: why are the buffer sizes here different ? which is correct?
[...]

Further question: why even use an intermediate buffer on the stack?
Both callers want to write the result to a seq_file.  Should mpol_str()
then be replaced with a seq_mpol()?

Ben.

-- 
Ben Hutchings
Who are all these weirdos? - David Bowie, about L-Space IRC channel #afp

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: mpol_to_str revisited.
  2012-10-08 20:52   ` Dave Jones
@ 2012-10-16  0:48     ` David Rientjes
  0 siblings, 0 replies; 59+ messages in thread
From: David Rientjes @ 2012-10-16  0:48 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, bhutchings, linux-mm, Linus Torvalds,
	Andrew Morton

On Mon, 8 Oct 2012, Dave Jones wrote:

>  > > diff -durpN '--exclude-from=/home/davej/.exclude' src/git-trees/kernel/linux/fs/proc/task_mmu.c linux-dj/fs/proc/task_mmu.c
>  > > --- src/git-trees/kernel/linux/fs/proc/task_mmu.c	2012-05-31 22:32:46.778150675 -0400
>  > > +++ linux-dj/fs/proc/task_mmu.c	2012-10-04 19:31:41.269988984 -0400
>  > > @@ -1162,6 +1162,7 @@ static int show_numa_map(struct seq_file
>  > >  	struct mm_walk walk = {};
>  > >  	struct mempolicy *pol;
>  > >  	int n;
>  > > +	int ret;
>  > >  	char buffer[50];
>  > >  
>  > >  	if (!mm)
>  > > @@ -1178,7 +1179,11 @@ static int show_numa_map(struct seq_file
>  > >  	walk.mm = mm;
>  > >  
>  > >  	pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
>  > > -	mpol_to_str(buffer, sizeof(buffer), pol, 0);
>  > > +	memset(buffer, 0, sizeof(buffer));
>  > > +	ret = mpol_to_str(buffer, sizeof(buffer), pol, 0);
>  > > +	if (ret < 0)
>  > > +		return 0;
>  > 
>  > We should need the mpol_cond_put(pol) here before returning.
> 
> good catch. I'll respin the patch later with this changed.
> 

Did you get a chance to fix this issue?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: mpol_to_str revisited.
  2012-10-08 15:09 mpol_to_str revisited Dave Jones
                   ` (2 preceding siblings ...)
  2012-10-09  0:33 ` Ben Hutchings
@ 2012-10-16  2:34 ` KOSAKI Motohiro
  2012-10-16  3:58   ` David Rientjes
  3 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2012-10-16  2:34 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, bhutchings, linux-mm, Linus Torvalds,
	Andrew Morton

On Mon, Oct 8, 2012 at 11:09 AM, Dave Jones <davej@redhat.com> wrote:
> Last month I sent in 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a to remove
> a user triggerable BUG in mempolicy.
>
> Ben Hutchings pointed out to me that my change introduced a potential leak
> of stack contents to userspace, because none of the callers check the return value.
>
> This patch adds the missing return checking, and also clears the buffer beforehand.

I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix. we should
close a race (or kill remain ref count leak) if we still have.
Because of, this patch makes unstable /proc output and might lead to
userland confusing.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: mpol_to_str revisited.
  2012-10-16  2:34 ` KOSAKI Motohiro
@ 2012-10-16  3:58   ` David Rientjes
  2012-10-16  5:10     ` KOSAKI Motohiro
  0 siblings, 1 reply; 59+ messages in thread
From: David Rientjes @ 2012-10-16  3:58 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Jones, Linux Kernel, bhutchings, linux-mm, Linus Torvalds,
	Andrew Morton

On Mon, 15 Oct 2012, KOSAKI Motohiro wrote:

> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix.

It's certainly not a complete fix, but I think it's a much better result 
of the race, i.e. we don't panic anymore, we simply fail the read() 
instead.

> we should
> close a race (or kill remain ref count leak) if we still have.

As I mentioned earlier in the thread, the read() is done here on a task 
while only a reference to the task_struct is taken and we do not hold 
task_lock() which is required for task->mempolicy.  Once that is fixed, 
mpol_to_str() should never be called for !task->mempolicy so it will never 
need to return -EINVAL in such a condition.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: mpol_to_str revisited.
  2012-10-16  3:58   ` David Rientjes
@ 2012-10-16  5:10     ` KOSAKI Motohiro
  2012-10-16  6:10       ` David Rientjes
  0 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2012-10-16  5:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Jones, Linux Kernel, bhutchings, linux-mm, Linus Torvalds,
	Andrew Morton

On Mon, Oct 15, 2012 at 11:58 PM, David Rientjes <rientjes@google.com> wrote:
> On Mon, 15 Oct 2012, KOSAKI Motohiro wrote:
>
>> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix.
>
> It's certainly not a complete fix, but I think it's a much better result
> of the race, i.e. we don't panic anymore, we simply fail the read()
> instead.

Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring
to caller complex. That's not good and have no worth.

>> we should
>> close a race (or kill remain ref count leak) if we still have.
>
> As I mentioned earlier in the thread, the read() is done here on a task
> while only a reference to the task_struct is taken and we do not hold
> task_lock() which is required for task->mempolicy.  Once that is fixed,
> mpol_to_str() should never be called for !task->mempolicy so it will never
> need to return -EINVAL in such a condition.

I agree that's obviously a bug and we should fix it.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: mpol_to_str revisited.
  2012-10-16  5:10     ` KOSAKI Motohiro
@ 2012-10-16  6:10       ` David Rientjes
  2012-10-16 23:39         ` KOSAKI Motohiro
  0 siblings, 1 reply; 59+ messages in thread
From: David Rientjes @ 2012-10-16  6:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Jones, Linux Kernel, bhutchings, linux-mm, Linus Torvalds,
	Andrew Morton

On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:

> >> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix.
> >
> > It's certainly not a complete fix, but I think it's a much better result
> > of the race, i.e. we don't panic anymore, we simply fail the read()
> > instead.
> 
> Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring
> to caller complex. That's not good and have no worth.
> 

Before: the kernel panics, all workloads cease.
After: the file shows garbage, all workloads continue.

This is better, in my opinion, but at best it's only a judgment call and 
has no effect on anything.

I agree it would be better to respect the return value of mpol_to_str() 
since there are other possible error conditions other than a freed 
mempolicy, but let's not consider reverting 80de7c3138.  It is obviously 
not a full solution to the problem, though, and we need to serialize with 
task_lock().

Dave, are you interested in coming up with a patch?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: mpol_to_str revisited.
  2012-10-16  6:10       ` David Rientjes
@ 2012-10-16 23:39         ` KOSAKI Motohiro
  2012-10-17  0:12           ` David Rientjes
  0 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2012-10-16 23:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Jones, Linux Kernel, bhutchings, linux-mm, Linus Torvalds,
	Andrew Morton

On Tue, Oct 16, 2012 at 2:10 AM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:
>
>> >> I don't think 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a is right fix.
>> >
>> > It's certainly not a complete fix, but I think it's a much better result
>> > of the race, i.e. we don't panic anymore, we simply fail the read()
>> > instead.
>>
>> Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring
>> to caller complex. That's not good and have no worth.
>>
>
> Before: the kernel panics, all workloads cease.
> After: the file shows garbage, all workloads continue.
>
> This is better, in my opinion, but at best it's only a judgment call and
> has no effect on anything.

Kernel panics help to find our serious mistake.


> I agree it would be better to respect the return value of mpol_to_str()
> since there are other possible error conditions other than a freed
> mempolicy, but let's not consider reverting 80de7c3138.  It is obviously
> not a full solution to the problem, though, and we need to serialize with
> task_lock().

Sorry no. I will have to revert it. mempolicy have already a lot of
meaningless complex and bring us a lot of problems. I haven't
seen any reason adding more.


> Dave, are you interested in coming up with a patch?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: mpol_to_str revisited.
  2012-10-16 23:39         ` KOSAKI Motohiro
@ 2012-10-17  0:12           ` David Rientjes
  2012-10-17  0:31             ` [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps David Rientjes
  2012-10-17  1:33             ` mpol_to_str revisited KOSAKI Motohiro
  0 siblings, 2 replies; 59+ messages in thread
From: David Rientjes @ 2012-10-17  0:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Jones, Linux Kernel, bhutchings, linux-mm, Linus Torvalds,
	Andrew Morton

On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:

> >> Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring
> >> to caller complex. That's not good and have no worth.
> >>
> >
> > Before: the kernel panics, all workloads cease.
> > After: the file shows garbage, all workloads continue.
> >
> > This is better, in my opinion, but at best it's only a judgment call and
> > has no effect on anything.
> 
> Kernel panics help to find our serious mistake.
> 

Kernel panics are not your little debugging tool to let users suffer 
through for non-fatal issues.

> > I agree it would be better to respect the return value of mpol_to_str()
> > since there are other possible error conditions other than a freed
> > mempolicy, but let's not consider reverting 80de7c3138.  It is obviously
> > not a full solution to the problem, though, and we need to serialize with
> > task_lock().
> 
> Sorry no. I will have to revert it.

Feel free to revert anything you wish in your own tree, I couldn't care 
less.  If you try to propose it upstream, Andrew will surely ask you to 
justify the BUG(), good luck on that.

I'll reply to this message with the fix that I think is best.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17  0:12           ` David Rientjes
@ 2012-10-17  0:31             ` David Rientjes
  2012-10-17  1:38               ` KOSAKI Motohiro
  2012-10-17  4:05               ` Dave Jones
  2012-10-17  1:33             ` mpol_to_str revisited KOSAKI Motohiro
  1 sibling, 2 replies; 59+ messages in thread
From: David Rientjes @ 2012-10-17  0:31 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: Dave Jones, KOSAKI Motohiro, bhutchings, Konstantin Khlebnikov,
	Naoya Horiguchi, Hugh Dickins, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm

When reading /proc/pid/numa_maps, it's possible to return the contents of 
the stack where the mempolicy string should be printed if the policy gets 
freed from beneath us.

This happens because mpol_to_str() may return an error the 
stack-allocated buffer is then printed without ever being stored.

There are two possible error conditions in mpol_to_str():

 - if the buffer allocated is insufficient for the string to be stored, 
   and

 - if the mempolicy has an invalid mode.

The first error condition is not triggered in any of the callers to 
mpol_to_str(): at least 50 bytes is always allocated on the stack and this 
is sufficient for the string to be written.  A future patch should convert 
this into BUILD_BUG_ON() since we know the maximum strlen possible, but 
that's not -rc material.

The second error condition is possible if a race occurs in dropping a 
reference to a task's mempolicy causing it to be freed during the read().  
The slab poison value is then used for the mode and mpol_to_str() returns 
-EINVAL.

This race is only possible because get_vma_policy() believes that 
mm->mmap_sem protects task->mempolicy, which isn't true.  The exit path 
does not hold mm->mmap_sem when dropping the reference or setting 
task->mempolicy to NULL: it uses task_lock(task) instead.

Thus, it's required for the caller of a task mempolicy to hold 
task_lock(task) while grabbing the mempolicy and reading it.  Callers with 
a vma policy store their mempolicy earlier and can simply increment the 
reference count so it's guaranteed not to be freed.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/proc/task_mmu.c |    7 +++++--
 mm/mempolicy.c     |    5 ++---
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1158,6 +1158,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 	struct vm_area_struct *vma = v;
 	struct numa_maps *md = &numa_priv->md;
 	struct file *file = vma->vm_file;
+	struct task_struct *task = proc_priv->task;
 	struct mm_struct *mm = vma->vm_mm;
 	struct mm_walk walk = {};
 	struct mempolicy *pol;
@@ -1177,9 +1178,11 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 	walk.private = md;
 	walk.mm = mm;
 
-	pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
+	task_lock(task);
+	pol = get_vma_policy(task, vma, vma->vm_start);
 	mpol_to_str(buffer, sizeof(buffer), pol, 0);
 	mpol_cond_put(pol);
+	task_unlock(task);
 
 	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
 
@@ -1189,7 +1192,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 	} else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
 		seq_printf(m, " heap");
 	} else {
-		pid_t tid = vm_is_stack(proc_priv->task, vma, is_pid);
+		pid_t tid = vm_is_stack(task, vma, is_pid);
 		if (tid != 0) {
 			/*
 			 * Thread stack in /proc/PID/task/TID/maps or
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0b78fb9..d04a8a5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
  *
  * Returns effective policy for a VMA at specified address.
  * Falls back to @task or system default policy, as necessary.
- * Current or other task's task mempolicy and non-shared vma policies
- * are protected by the task's mmap_sem, which must be held for read by
- * the caller.
+ * Current or other task's task mempolicy and non-shared vma policies must be
+ * protected by task_lock(task) by the caller.
  * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
  * count--added by the get_policy() vm_op, as appropriate--to protect against
  * freeing by another task.  It is the caller's responsibility to free the

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: mpol_to_str revisited.
  2012-10-17  0:12           ` David Rientjes
  2012-10-17  0:31             ` [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps David Rientjes
@ 2012-10-17  1:33             ` KOSAKI Motohiro
  1 sibling, 0 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2012-10-17  1:33 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Jones, Linux Kernel, bhutchings, linux-mm, Linus Torvalds,
	Andrew Morton

On Tue, Oct 16, 2012 at 8:12 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:
>
>> >> Even though 80de7c3138ee9fd86a98696fd2cf7ad89b995d0a itself is simple. It bring
>> >> to caller complex. That's not good and have no worth.
>> >>
>> >
>> > Before: the kernel panics, all workloads cease.
>> > After: the file shows garbage, all workloads continue.
>> >
>> > This is better, in my opinion, but at best it's only a judgment call and
>> > has no effect on anything.
>>
>> Kernel panics help to find our serious mistake.
>
> Kernel panics are not your little debugging tool to let users suffer
> through for non-fatal issues.

use after free is fatal, no doubt.

>
>> > I agree it would be better to respect the return value of mpol_to_str()
>> > since there are other possible error conditions other than a freed
>> > mempolicy, but let's not consider reverting 80de7c3138.  It is obviously
>> > not a full solution to the problem, though, and we need to serialize with
>> > task_lock().
>>
>> Sorry no. I will have to revert it.
>
> Feel free to revert anything you wish in your own tree, I couldn't care
> less.  If you try to propose it upstream, Andrew will surely ask you to
> justify the BUG(), good luck on that.

Yeah.
I'm ok just remove both BUG() and EINVAL, but current situation (i.e. ignoring
EINVAL by caller) is surely bad. So, just revert is best IMHO.


>
> I'll reply to this message with the fix that I think is best.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17  0:31             ` [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps David Rientjes
@ 2012-10-17  1:38               ` KOSAKI Motohiro
  2012-10-17  1:49                 ` David Rientjes
  2012-10-17  4:05               ` Dave Jones
  1 sibling, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2012-10-17  1:38 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, Dave Jones, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Tue, Oct 16, 2012 at 8:31 PM, David Rientjes <rientjes@google.com> wrote:
> When reading /proc/pid/numa_maps, it's possible to return the contents of
> the stack where the mempolicy string should be printed if the policy gets
> freed from beneath us.
>
> This happens because mpol_to_str() may return an error the
> stack-allocated buffer is then printed without ever being stored.
>
> There are two possible error conditions in mpol_to_str():
>
>  - if the buffer allocated is insufficient for the string to be stored,
>    and
>
>  - if the mempolicy has an invalid mode.
>
> The first error condition is not triggered in any of the callers to
> mpol_to_str(): at least 50 bytes is always allocated on the stack and this
> is sufficient for the string to be written.  A future patch should convert
> this into BUILD_BUG_ON() since we know the maximum strlen possible, but
> that's not -rc material.
>
> The second error condition is possible if a race occurs in dropping a
> reference to a task's mempolicy causing it to be freed during the read().
> The slab poison value is then used for the mode and mpol_to_str() returns
> -EINVAL.
>
> This race is only possible because get_vma_policy() believes that
> mm->mmap_sem protects task->mempolicy, which isn't true.  The exit path
> does not hold mm->mmap_sem when dropping the reference or setting
> task->mempolicy to NULL: it uses task_lock(task) instead.
>
> Thus, it's required for the caller of a task mempolicy to hold
> task_lock(task) while grabbing the mempolicy and reading it.  Callers with
> a vma policy store their mempolicy earlier and can simply increment the
> reference count so it's guaranteed not to be freed.
>
> Reported-by: Dave Jones <davej@redhat.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  fs/proc/task_mmu.c |    7 +++++--
>  mm/mempolicy.c     |    5 ++---
>  2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1158,6 +1158,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
>         struct vm_area_struct *vma = v;
>         struct numa_maps *md = &numa_priv->md;
>         struct file *file = vma->vm_file;
> +       struct task_struct *task = proc_priv->task;
>         struct mm_struct *mm = vma->vm_mm;
>         struct mm_walk walk = {};
>         struct mempolicy *pol;
> @@ -1177,9 +1178,11 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
>         walk.private = md;
>         walk.mm = mm;
>
> -       pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
> +       task_lock(task);
> +       pol = get_vma_policy(task, vma, vma->vm_start);
>         mpol_to_str(buffer, sizeof(buffer), pol, 0);
>         mpol_cond_put(pol);
> +       task_unlock(task);
>
>         seq_printf(m, "%08lx %s", vma->vm_start, buffer);
>
> @@ -1189,7 +1192,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
>         } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
>                 seq_printf(m, " heap");
>         } else {
> -               pid_t tid = vm_is_stack(proc_priv->task, vma, is_pid);
> +               pid_t tid = vm_is_stack(task, vma, is_pid);
>                 if (tid != 0) {
>                         /*
>                          * Thread stack in /proc/PID/task/TID/maps or
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 0b78fb9..d04a8a5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
>   *
>   * Returns effective policy for a VMA at specified address.
>   * Falls back to @task or system default policy, as necessary.
> - * Current or other task's task mempolicy and non-shared vma policies
> - * are protected by the task's mmap_sem, which must be held for read by
> - * the caller.
> + * Current or other task's task mempolicy and non-shared vma policies must be
> + * protected by task_lock(task) by the caller.

This is not correct. mmap_sem is needed for protecting vma. task_lock()
is needed to close vs exit race only when task != current. In other word,
caller must held both mmap_sem and task_lock if task != current.




>   * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
>   * count--added by the get_policy() vm_op, as appropriate--to protect against
>   * freeing by another task.  It is the caller's responsibility to free the

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17  1:38               ` KOSAKI Motohiro
@ 2012-10-17  1:49                 ` David Rientjes
  2012-10-17  1:53                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 59+ messages in thread
From: David Rientjes @ 2012-10-17  1:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Linus Torvalds, Dave Jones, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:

> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 0b78fb9..d04a8a5 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
> >   *
> >   * Returns effective policy for a VMA at specified address.
> >   * Falls back to @task or system default policy, as necessary.
> > - * Current or other task's task mempolicy and non-shared vma policies
> > - * are protected by the task's mmap_sem, which must be held for read by
> > - * the caller.
> > + * Current or other task's task mempolicy and non-shared vma policies must be
> > + * protected by task_lock(task) by the caller.
> 
> This is not correct. mmap_sem is needed for protecting vma. task_lock()
> is needed to close vs exit race only when task != current. In other word,
> caller must held both mmap_sem and task_lock if task != current.
> 

The comment is specifically addressing non-shared vma policies, you do not 
need to hold mmap_sem to access another thread's mempolicy.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17  1:49                 ` David Rientjes
@ 2012-10-17  1:53                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2012-10-17  1:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, Dave Jones, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Tue, Oct 16, 2012 at 9:49 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 16 Oct 2012, KOSAKI Motohiro wrote:
>
>> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> > index 0b78fb9..d04a8a5 100644
>> > --- a/mm/mempolicy.c
>> > +++ b/mm/mempolicy.c
>> > @@ -1536,9 +1536,8 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
>> >   *
>> >   * Returns effective policy for a VMA at specified address.
>> >   * Falls back to @task or system default policy, as necessary.
>> > - * Current or other task's task mempolicy and non-shared vma policies
>> > - * are protected by the task's mmap_sem, which must be held for read by
>> > - * the caller.
>> > + * Current or other task's task mempolicy and non-shared vma policies must be
>> > + * protected by task_lock(task) by the caller.
>>
>> This is not correct. mmap_sem is needed for protecting vma. task_lock()
>> is needed to close vs exit race only when task != current. In other word,
>> caller must held both mmap_sem and task_lock if task != current.
>
> The comment is specifically addressing non-shared vma policies, you do not
> need to hold mmap_sem to access another thread's mempolicy.

I didn't say old comment is true. I just only your new comment also false.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17  0:31             ` [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps David Rientjes
  2012-10-17  1:38               ` KOSAKI Motohiro
@ 2012-10-17  4:05               ` Dave Jones
  2012-10-17  5:24                 ` David Rientjes
  1 sibling, 1 reply; 59+ messages in thread
From: Dave Jones @ 2012-10-17  4:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Tue, Oct 16, 2012 at 05:31:23PM -0700, David Rientjes wrote:

 > -	pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
 > +	task_lock(task);
 > +	pol = get_vma_policy(task, vma, vma->vm_start);
 >  	mpol_to_str(buffer, sizeof(buffer), pol, 0);
 >  	mpol_cond_put(pol);
 > +	task_unlock(task);

This seems to cause some fallout for me..

BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
3 locks on stack by trinity-child2/8558:
 #0: held:     (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [<ffffffff8120cd1f>] seq_lseek+0x3f/0x120
 #1: held:     (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [<ffffffff81254437>] m_start+0xa7/0x190
 #2: held:     (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [<ffffffff81254f8f>] show_numa_map+0x14f/0x610
Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
Call Trace:
 [<ffffffff810ae4ec>] __might_sleep+0x14c/0x200
 [<ffffffff816bdf4e>] mutex_lock_nested+0x2e/0x50
 [<ffffffff811c43a3>] mpol_shared_policy_lookup+0x33/0x90
 [<ffffffff8118d5c3>] shmem_get_policy+0x33/0x40
 [<ffffffff811c31fa>] get_vma_policy+0x3a/0x90
 [<ffffffff81254fa3>] show_numa_map+0x163/0x610
 [<ffffffff81255b10>] ? pid_maps_open+0x20/0x20
 [<ffffffff81255980>] ? pagemap_hugetlb_range+0xf0/0xf0
 [<ffffffff81255483>] show_pid_numa_map+0x13/0x20
 [<ffffffff8120c902>] traverse+0xf2/0x230
 [<ffffffff8120cd8b>] seq_lseek+0xab/0x120
 [<ffffffff811e6c0b>] sys_lseek+0x7b/0xb0
 [<ffffffff816ca088>] tracesys+0xe1/0xe6


same problem, different syscall..


BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 21996, name: trinity-child3
3 locks on stack by trinity-child3/21996:
 #0: held:     (&p->lock){+.+.+.}, instance: ffff88008d712c08, at: [<ffffffff8120ce3d>] seq_read+0x3d/0x3e0
 #1: held:     (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [<ffffffff81254437>] m_start+0xa7/0x190
 #2: held:     (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [<ffffffff81254f8f>] show_numa_map+0x14f/0x610
Pid: 21996, comm: trinity-child3 Not tainted 3.7.0-rc1+ #32
Call Trace:
 [<ffffffff810ae4ec>] __might_sleep+0x14c/0x200
 [<ffffffff816bdf4e>] mutex_lock_nested+0x2e/0x50
 [<ffffffff811c43a3>] mpol_shared_policy_lookup+0x33/0x90
 [<ffffffff8118d5c3>] shmem_get_policy+0x33/0x40
 [<ffffffff811c31fa>] get_vma_policy+0x3a/0x90
 [<ffffffff81254fa3>] show_numa_map+0x163/0x610
 [<ffffffff81255b10>] ? pid_maps_open+0x20/0x20
 [<ffffffff81255980>] ? pagemap_hugetlb_range+0xf0/0xf0
 [<ffffffff81255483>] show_pid_numa_map+0x13/0x20
 [<ffffffff8120c902>] traverse+0xf2/0x230
 [<ffffffff8120d14b>] seq_read+0x34b/0x3e0
 [<ffffffff8120ce00>] ? seq_lseek+0x120/0x120
 [<ffffffff811e751a>] do_loop_readv_writev+0x5a/0x90
 [<ffffffff811e7851>] do_readv_writev+0x1c1/0x1e0
 [<ffffffff810b0de1>] ? get_parent_ip+0x11/0x50
 [<ffffffff811e7905>] vfs_readv+0x35/0x60
 [<ffffffff811e7b72>] sys_preadv+0xc2/0xe0
 [<ffffffff816ca088>] tracesys+0xe1/0xe6



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17  4:05               ` Dave Jones
@ 2012-10-17  5:24                 ` David Rientjes
  2012-10-17  5:42                   ` Kamezawa Hiroyuki
                                     ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: David Rientjes @ 2012-10-17  5:24 UTC (permalink / raw)
  To: Dave Jones, Andrew Morton, Linus Torvalds, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, 17 Oct 2012, Dave Jones wrote:

> BUG: sleeping function called from invalid context at kernel/mutex.c:269
> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
> 3 locks on stack by trinity-child2/8558:
>  #0: held:     (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [<ffffffff8120cd1f>] seq_lseek+0x3f/0x120
>  #1: held:     (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [<ffffffff81254437>] m_start+0xa7/0x190
>  #2: held:     (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [<ffffffff81254f8f>] show_numa_map+0x14f/0x610
> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
> Call Trace:
>  [<ffffffff810ae4ec>] __might_sleep+0x14c/0x200
>  [<ffffffff816bdf4e>] mutex_lock_nested+0x2e/0x50
>  [<ffffffff811c43a3>] mpol_shared_policy_lookup+0x33/0x90
>  [<ffffffff8118d5c3>] shmem_get_policy+0x33/0x40
>  [<ffffffff811c31fa>] get_vma_policy+0x3a/0x90
>  [<ffffffff81254fa3>] show_numa_map+0x163/0x610
>  [<ffffffff81255b10>] ? pid_maps_open+0x20/0x20
>  [<ffffffff81255980>] ? pagemap_hugetlb_range+0xf0/0xf0
>  [<ffffffff81255483>] show_pid_numa_map+0x13/0x20
>  [<ffffffff8120c902>] traverse+0xf2/0x230
>  [<ffffffff8120cd8b>] seq_lseek+0xab/0x120
>  [<ffffffff811e6c0b>] sys_lseek+0x7b/0xb0
>  [<ffffffff816ca088>] tracesys+0xe1/0xe6
> 

Hmm, looks like we need to change the refcount semantics entirely.  We'll 
need to make get_vma_policy() always take a reference and then drop it 
accordingly.  This work sif get_vma_policy() can grab a reference while 
holding task_lock() for the task policy fallback case.

Comments on this approach?
---
 fs/proc/task_mmu.c |    4 +---
 include/linux/mm.h |    3 +--
 mm/hugetlb.c       |    4 ++--
 mm/mempolicy.c     |   41 ++++++++++++++++++++++-------------------
 4 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 	walk.private = md;
 	walk.mm = mm;
 
-	task_lock(task);
 	pol = get_vma_policy(task, vma, vma->vm_start);
 	mpol_to_str(buffer, sizeof(buffer), pol, 0);
-	mpol_cond_put(pol);
-	task_unlock(task);
+	__mpol_put(pol);
 
 	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -216,8 +216,7 @@ struct vm_operations_struct {
 	 * get_policy() op must add reference [mpol_get()] to any policy at
 	 * (vma,addr) marked as MPOL_SHARED.  The shared policy infrastructure
 	 * in mm/mempolicy.c will do this automatically.
-	 * get_policy() must NOT add a ref if the policy at (vma,addr) is not
-	 * marked as MPOL_SHARED. vma policies are protected by the mmap_sem.
+	 * vma policies are protected by the mmap_sem.
 	 * If no [shared/vma] mempolicy exists at the addr, get_policy() op
 	 * must return NULL--i.e., do not "fallback" to task or system default
 	 * policy.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -568,13 +568,13 @@ retry_cpuset:
 		}
 	}
 
-	mpol_cond_put(mpol);
+	__mpol_put(mpol);
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;
 	return page;
 
 err:
-	mpol_cond_put(mpol);
+	__mpol_put(mpol);
 	return NULL;
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1536,39 +1536,41 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
  *
  * Returns effective policy for a VMA at specified address.
  * Falls back to @task or system default policy, as necessary.
- * Current or other task's task mempolicy and non-shared vma policies must be
- * protected by task_lock(task) by the caller.
- * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
- * count--added by the get_policy() vm_op, as appropriate--to protect against
- * freeing by another task.  It is the caller's responsibility to free the
- * extra reference for shared policies.
+ * Increments the reference count of the returned mempolicy, it is the caller's
+ * responsibility to decrement with __mpol_put().
+ * Requires vma->vm_mm->mmap_sem to be held for vma policies and takes
+ * task_lock(task) for task policy fallback.
  */
 struct mempolicy *get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
-	struct mempolicy *pol = task->mempolicy;
+	struct mempolicy *pol;
+
+	task_lock(task);
+	pol = task->mempolicy;
+	mpol_get(pol);
+	task_unlock(task);
 
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy) {
 			struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
 									addr);
-			if (vpol)
+			if (vpol) {
+				mpol_put(pol);
 				pol = vpol;
+				if (!mpol_needs_cond_ref(pol))
+					mpol_get(pol);
+			}
 		} else if (vma->vm_policy) {
+			mpol_put(pol);
 			pol = vma->vm_policy;
-
-			/*
-			 * shmem_alloc_page() passes MPOL_F_SHARED policy with
-			 * a pseudo vma whose vma->vm_ops=NULL. Take a reference
-			 * count on these policies which will be dropped by
-			 * mpol_cond_put() later
-			 */
-			if (mpol_needs_cond_ref(pol))
-				mpol_get(pol);
+			mpol_get(pol);
 		}
 	}
-	if (!pol)
+	if (!pol) {
 		pol = &default_policy;
+		mpol_get(pol);
+	}
 	return pol;
 }
 
@@ -1919,7 +1921,7 @@ retry_cpuset:
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
-		mpol_cond_put(pol);
+		__mpol_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
 		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 			goto retry_cpuset;
@@ -1943,6 +1945,7 @@ retry_cpuset:
 	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
+	__mpol_put(pol);
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;
 	return page;

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17  5:24                 ` David Rientjes
@ 2012-10-17  5:42                   ` Kamezawa Hiroyuki
  2012-10-17  8:49                     ` KOSAKI Motohiro
  2012-10-17 18:14                   ` Dave Jones
  2012-10-24 23:30                   ` [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Sasha Levin
  2 siblings, 1 reply; 59+ messages in thread
From: Kamezawa Hiroyuki @ 2012-10-17  5:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Jones, Andrew Morton, Linus Torvalds, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

(2012/10/17 14:24), David Rientjes wrote:
> On Wed, 17 Oct 2012, Dave Jones wrote:
>
>> BUG: sleeping function called from invalid context at kernel/mutex.c:269
>> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
>> 3 locks on stack by trinity-child2/8558:
>>   #0: held:     (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [<ffffffff8120cd1f>] seq_lseek+0x3f/0x120
>>   #1: held:     (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [<ffffffff81254437>] m_start+0xa7/0x190
>>   #2: held:     (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [<ffffffff81254f8f>] show_numa_map+0x14f/0x610
>> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
>> Call Trace:
>>   [<ffffffff810ae4ec>] __might_sleep+0x14c/0x200
>>   [<ffffffff816bdf4e>] mutex_lock_nested+0x2e/0x50
>>   [<ffffffff811c43a3>] mpol_shared_policy_lookup+0x33/0x90
>>   [<ffffffff8118d5c3>] shmem_get_policy+0x33/0x40
>>   [<ffffffff811c31fa>] get_vma_policy+0x3a/0x90
>>   [<ffffffff81254fa3>] show_numa_map+0x163/0x610
>>   [<ffffffff81255b10>] ? pid_maps_open+0x20/0x20
>>   [<ffffffff81255980>] ? pagemap_hugetlb_range+0xf0/0xf0
>>   [<ffffffff81255483>] show_pid_numa_map+0x13/0x20
>>   [<ffffffff8120c902>] traverse+0xf2/0x230
>>   [<ffffffff8120cd8b>] seq_lseek+0xab/0x120
>>   [<ffffffff811e6c0b>] sys_lseek+0x7b/0xb0
>>   [<ffffffff816ca088>] tracesys+0xe1/0xe6
>>
>
> Hmm, looks like we need to change the refcount semantics entirely.  We'll
> need to make get_vma_policy() always take a reference and then drop it
> accordingly.  This work sif get_vma_policy() can grab a reference while
> holding task_lock() for the task policy fallback case.
>
> Comments on this approach?

I think this refcounting is better than using task_lock().

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17  5:42                   ` Kamezawa Hiroyuki
@ 2012-10-17  8:49                     ` KOSAKI Motohiro
  2012-10-17 19:50                       ` David Rientjes
  0 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2012-10-17  8:49 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: David Rientjes, Dave Jones, Andrew Morton, Linus Torvalds,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

On Wed, Oct 17, 2012 at 1:42 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2012/10/17 14:24), David Rientjes wrote:
>>
>> On Wed, 17 Oct 2012, Dave Jones wrote:
>>
>>> BUG: sleeping function called from invalid context at kernel/mutex.c:269
>>> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
>>> 3 locks on stack by trinity-child2/8558:
>>>   #0: held:     (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at:
>>> [<ffffffff8120cd1f>] seq_lseek+0x3f/0x120
>>>   #1: held:     (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at:
>>> [<ffffffff81254437>] m_start+0xa7/0x190
>>>   #2: held:     (&(&p->alloc_lock)->rlock){+.+...}, instance:
>>> ffff88011fc64f30, at: [<ffffffff81254f8f>] show_numa_map+0x14f/0x610
>>> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
>>> Call Trace:
>>>   [<ffffffff810ae4ec>] __might_sleep+0x14c/0x200
>>>   [<ffffffff816bdf4e>] mutex_lock_nested+0x2e/0x50
>>>   [<ffffffff811c43a3>] mpol_shared_policy_lookup+0x33/0x90
>>>   [<ffffffff8118d5c3>] shmem_get_policy+0x33/0x40
>>>   [<ffffffff811c31fa>] get_vma_policy+0x3a/0x90
>>>   [<ffffffff81254fa3>] show_numa_map+0x163/0x610
>>>   [<ffffffff81255b10>] ? pid_maps_open+0x20/0x20
>>>   [<ffffffff81255980>] ? pagemap_hugetlb_range+0xf0/0xf0
>>>   [<ffffffff81255483>] show_pid_numa_map+0x13/0x20
>>>   [<ffffffff8120c902>] traverse+0xf2/0x230
>>>   [<ffffffff8120cd8b>] seq_lseek+0xab/0x120
>>>   [<ffffffff811e6c0b>] sys_lseek+0x7b/0xb0
>>>   [<ffffffff816ca088>] tracesys+0xe1/0xe6
>>>
>>
>> Hmm, looks like we need to change the refcount semantics entirely.  We'll
>> need to make get_vma_policy() always take a reference and then drop it
>> accordingly.  This work sif get_vma_policy() can grab a reference while
>> holding task_lock() for the task policy fallback case.
>>
>> Comments on this approach?
>
>
> I think this refcounting is better than using task_lock().

I don't think so. get_vma_policy() is used from fast path. In other
words, number of
atomic ops is sensible for allocation performance. Instead, I'd like
to use spinlock
for shared mempolicy instead of mutex.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17  5:24                 ` David Rientjes
  2012-10-17  5:42                   ` Kamezawa Hiroyuki
@ 2012-10-17 18:14                   ` Dave Jones
  2012-10-17 19:21                     ` David Rientjes
  2012-10-24 23:30                   ` [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Sasha Levin
  2 siblings, 1 reply; 59+ messages in thread
From: Dave Jones @ 2012-10-17 18:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote:
 > On Wed, 17 Oct 2012, Dave Jones wrote:
 > 
 > > BUG: sleeping function called from invalid context at kernel/mutex.c:269
 > 
 > Hmm, looks like we need to change the refcount semantics entirely.  We'll 
 > need to make get_vma_policy() always take a reference and then drop it 
 > accordingly.  This work sif get_vma_policy() can grab a reference while 
 > holding task_lock() for the task policy fallback case.
 > 
 > Comments on this approach?

Seems to be surviving my testing at least..

	Dave


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17 18:14                   ` Dave Jones
@ 2012-10-17 19:21                     ` David Rientjes
  2012-10-17 19:32                       ` Dave Jones
  0 siblings, 1 reply; 59+ messages in thread
From: David Rientjes @ 2012-10-17 19:21 UTC (permalink / raw)
  To: Dave Jones, Andrew Morton, Linus Torvalds, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, 17 Oct 2012, Dave Jones wrote:

> On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote:
>  > On Wed, 17 Oct 2012, Dave Jones wrote:
>  > 
>  > > BUG: sleeping function called from invalid context at kernel/mutex.c:269
>  > 
>  > Hmm, looks like we need to change the refcount semantics entirely.  We'll 
>  > need to make get_vma_policy() always take a reference and then drop it 
>  > accordingly.  This work sif get_vma_policy() can grab a reference while 
>  > holding task_lock() for the task policy fallback case.
>  > 
>  > Comments on this approach?
> 
> Seems to be surviving my testing at least..
> 

Sounds good.  Is it possible to verify that policy_cache isn't getting 
larger than normal in /proc/slabinfo, i.e. when all processes with a 
task mempolicy or shared vma policy have exited, are there still a 
significant number of active objects?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17 19:21                     ` David Rientjes
@ 2012-10-17 19:32                       ` Dave Jones
  2012-10-17 19:38                         ` David Rientjes
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Jones @ 2012-10-17 19:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, Oct 17, 2012 at 12:21:10PM -0700, David Rientjes wrote:
 > On Wed, 17 Oct 2012, Dave Jones wrote:
 > 
 > > On Tue, Oct 16, 2012 at 10:24:32PM -0700, David Rientjes wrote:
 > >  > On Wed, 17 Oct 2012, Dave Jones wrote:
 > >  > 
 > >  > > BUG: sleeping function called from invalid context at kernel/mutex.c:269
 > >  > 
 > >  > Hmm, looks like we need to change the refcount semantics entirely.  We'll 
 > >  > need to make get_vma_policy() always take a reference and then drop it 
 > >  > accordingly.  This work sif get_vma_policy() can grab a reference while 
 > >  > holding task_lock() for the task policy fallback case.
 > >  > 
 > >  > Comments on this approach?
 > > 
 > > Seems to be surviving my testing at least..
 > > 
 > 
 > Sounds good.  Is it possible to verify that policy_cache isn't getting 
 > larger than normal in /proc/slabinfo, i.e. when all processes with a 
 > task mempolicy or shared vma policy have exited, are there still a 
 > significant number of active objects?

Killing the fuzzer caused it to drop dramatically.

Before:
(15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo  | grep policy
shared_policy_node   2931   2967    376   43    4 : tunables    0    0    0 : slabdata     69     69      0
numa_policy         2971   6545    464   35    4 : tunables    0    0    0 : slabdata    187    187      0

After:
(15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo  | grep policy
shared_policy_node      0    215    376   43    4 : tunables    0    0    0 : slabdata      5      5      0
numa_policy           15    175    464   35    4 : tunables    0    0    0 : slabdata      5      5      0



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17 19:32                       ` Dave Jones
@ 2012-10-17 19:38                         ` David Rientjes
  2012-10-17 19:45                           ` Dave Jones
  0 siblings, 1 reply; 59+ messages in thread
From: David Rientjes @ 2012-10-17 19:38 UTC (permalink / raw)
  To: Dave Jones, Andrew Morton, Linus Torvalds, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, 17 Oct 2012, Dave Jones wrote:

>  > Sounds good.  Is it possible to verify that policy_cache isn't getting 
>  > larger than normal in /proc/slabinfo, i.e. when all processes with a 
>  > task mempolicy or shared vma policy have exited, are there still a 
>  > significant number of active objects?
> 
> Killing the fuzzer caused it to drop dramatically.
> 
> Before:
> (15:29:59:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo  | grep policy
> shared_policy_node   2931   2967    376   43    4 : tunables    0    0    0 : slabdata     69     69      0
> numa_policy         2971   6545    464   35    4 : tunables    0    0    0 : slabdata    187    187      0
> 
> After:
> (15:30:16:davej@bitcrush:trinity[master])$ sudo cat /proc/slabinfo  | grep policy
> shared_policy_node      0    215    376   43    4 : tunables    0    0    0 : slabdata      5      5      0
> numa_policy           15    175    464   35    4 : tunables    0    0    0 : slabdata      5      5      0
> 

Excellent, thanks.  This shows that the refcounting is working properly 
and we're not leaking any references as a result of this change causing 
the mempolicies to never be freed.  ("numa_policy" turns out to be 
policy_cache in the code, so thanks for checking both of them.)

Could I add your tested-by?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17 19:38                         ` David Rientjes
@ 2012-10-17 19:45                           ` Dave Jones
  2012-10-17 20:28                             ` [patch for-3.7] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps David Rientjes
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Jones @ 2012-10-17 19:45 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, KOSAKI Motohiro, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, Oct 17, 2012 at 12:38:55PM -0700, David Rientjes wrote:

 > >  > Sounds good.  Is it possible to verify that policy_cache isn't getting 
 > >  > larger than normal in /proc/slabinfo, i.e. when all processes with a 
 > >  > task mempolicy or shared vma policy have exited, are there still a 
 > >  > significant number of active objects?
 > > 
 > > Killing the fuzzer caused it to drop dramatically.
 > > 
 > Excellent, thanks.  This shows that the refcounting is working properly 
 > and we're not leaking any references as a result of this change causing 
 > the mempolicies to never be freed.  ("numa_policy" turns out to be 
 > policy_cache in the code, so thanks for checking both of them.)
 > 
 > Could I add your tested-by?

Sure. Here's a fresh one I just baked.

Tested-by: Dave Jones <davej@redhat.com>

	Dave

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17  8:49                     ` KOSAKI Motohiro
@ 2012-10-17 19:50                       ` David Rientjes
  2012-10-17 21:05                         ` KOSAKI Motohiro
  0 siblings, 1 reply; 59+ messages in thread
From: David Rientjes @ 2012-10-17 19:50 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Kamezawa Hiroyuki, Dave Jones, Andrew Morton, Linus Torvalds,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

On Wed, 17 Oct 2012, KOSAKI Motohiro wrote:

> > I think this refcounting is better than using task_lock().
> 
> I don't think so. get_vma_policy() is used from fast path. In other
> words, number of
> atomic ops is sensible for allocation performance.

There are enhancements that we can make with refcounting: for instance, we 
may want to avoid doing it in the super-fast path when the policy is 
default_policy and then just do

	if (mpol != &default_policy)
		mpol_put(mpol);

> Instead, I'd like
> to use spinlock
> for shared mempolicy instead of mutex.
> 

Um, this was just changed to a mutex last week in commit b22d127a39dd 
("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc() 
can be done with GFP_KERNEL, so I didn't consider reverting that behavior.  
Are you nacking that patch, which you acked, now?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch for-3.7] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps
  2012-10-17 19:45                           ` Dave Jones
@ 2012-10-17 20:28                             ` David Rientjes
  2012-10-17 21:31                               ` [patch for-3.7 v2] " David Rientjes
  0 siblings, 1 reply; 59+ messages in thread
From: David Rientjes @ 2012-10-17 20:28 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: Dave Jones, KOSAKI Motohiro, bhutchings, Konstantin Khlebnikov,
	Naoya Horiguchi, Hugh Dickins, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm

As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack
contents in numa_maps"), the mutex protecting a shared policy can be
inadvertently taken while holding task_lock(task).

Recently, commit b22d127a39dd ("mempolicy: fix a race in 
shared_policy_replace()") switched the spinlock within a shared policy to 
a mutex so sp_alloc() could block.  Thus, a refcount must be grabbed on 
all mempolicies returned by get_vma_policy() so it isn't freed while being 
passed to mpol_to_str() when reading /proc/pid/numa_maps.

This patch only takes task_lock() while dereferencing task->mempolicy in 
get_vma_policy() to increment its refcount.  This ensures it will remain 
in memory until dropped by __mpol_put() after mpol_to_str() is called.

Refcounts of shared policies are grabbed by the ->get_policy() function of 
the vma, all others will be grabbed directly in get_vma_policy().  Now 
that this is done, all callers now unconditionally drop the refcount.

Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 fs/proc/task_mmu.c        |    4 +--
 include/linux/mempolicy.h |   12 +------
 mm/hugetlb.c              |    4 +--
 mm/mempolicy.c            |   79 +++++++++++++++++++--------------------------
 4 files changed, 38 insertions(+), 61 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 	walk.private = md;
 	walk.mm = mm;
 
-	task_lock(task);
 	pol = get_vma_policy(task, vma, vma->vm_start);
 	mpol_to_str(buffer, sizeof(buffer), pol, 0);
-	mpol_cond_put(pol);
-	task_unlock(task);
+	__mpol_put(pol);
 
 	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
 
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -73,13 +73,7 @@ static inline void mpol_put(struct mempolicy *pol)
  */
 static inline int mpol_needs_cond_ref(struct mempolicy *pol)
 {
-	return (pol && (pol->flags & MPOL_F_SHARED));
-}
-
-static inline void mpol_cond_put(struct mempolicy *pol)
-{
-	if (mpol_needs_cond_ref(pol))
-		__mpol_put(pol);
+	return pol->flags & MPOL_F_SHARED;
 }
 
 extern struct mempolicy *__mpol_cond_copy(struct mempolicy *tompol,
@@ -211,10 +205,6 @@ static inline void mpol_put(struct mempolicy *p)
 {
 }
 
-static inline void mpol_cond_put(struct mempolicy *pol)
-{
-}
-
 static inline struct mempolicy *mpol_cond_copy(struct mempolicy *to,
 						struct mempolicy *from)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -568,13 +568,13 @@ retry_cpuset:
 		}
 	}
 
-	mpol_cond_put(mpol);
+	__mpol_put(mpol);
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;
 	return page;
 
 err:
-	mpol_cond_put(mpol);
+	__mpol_put(mpol);
 	return NULL;
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -906,7 +906,8 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 	}
 
  out:
-	mpol_cond_put(pol);
+	if (mpol_needs_cond_ref(pol))
+		__mpol_put(pol);
 	if (vma)
 		up_read(&current->mm->mmap_sem);
 	return err;
@@ -1527,48 +1528,52 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
 }
 
 #endif
-
-/*
- * get_vma_policy(@task, @vma, @addr)
- * @task - task for fallback if vma policy == default
- * @vma   - virtual memory area whose policy is sought
- * @addr  - address in @vma for shared policy lookup
+/**
+ * get_vma_policy() - return effective policy for a vma at specified address
+ * @task: task for fallback if vma policy == default_policy
+ * @vma: virtual memory area whose policy is sought
+ * @addr: address in @vma for shared policy lookup
  *
- * Returns effective policy for a VMA at specified address.
  * Falls back to @task or system default policy, as necessary.
- * Current or other task's task mempolicy and non-shared vma policies must be
- * protected by task_lock(task) by the caller.
- * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
- * count--added by the get_policy() vm_op, as appropriate--to protect against
- * freeing by another task.  It is the caller's responsibility to free the
- * extra reference for shared policies.
+ * Increments the reference count of the returned mempolicy, it is the caller's
+ * responsibility to decrement with __mpol_put().
+ * Requires vma->vm_mm->mmap_sem to be held for vma policies and takes
+ * task_lock(task) for task policy fallback.
  */
 struct mempolicy *get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
-	struct mempolicy *pol = task->mempolicy;
+	struct mempolicy *pol;
+
+	/*
+	 * Grab a reference before task has the potential to exit and free its
+	 * mempolicy.
+	 */
+	task_lock(task);
+	pol = task->mempolicy;
+	mpol_get(pol);
+	task_unlock(task);
 
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy) {
 			struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
 									addr);
-			if (vpol)
+			if (vpol) {
+				mpol_put(pol);
 				pol = vpol;
+				if (!mpol_needs_cond_ref(pol))
+					mpol_get(pol);
+			}
 		} else if (vma->vm_policy) {
+			mpol_put(pol);
 			pol = vma->vm_policy;
-
-			/*
-			 * shmem_alloc_page() passes MPOL_F_SHARED policy with
-			 * a pseudo vma whose vma->vm_ops=NULL. Take a reference
-			 * count on these policies which will be dropped by
-			 * mpol_cond_put() later
-			 */
-			if (mpol_needs_cond_ref(pol))
-				mpol_get(pol);
+			mpol_get(pol);
 		}
 	}
-	if (!pol)
+	if (!pol) {
 		pol = &default_policy;
+		mpol_get(pol);
+	}
 	return pol;
 }
 
@@ -1919,30 +1924,14 @@ retry_cpuset:
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
-		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
-			goto retry_cpuset;
-
-		return page;
+		goto out;
 	}
 	zl = policy_zonelist(gfp, pol, node);
-	if (unlikely(mpol_needs_cond_ref(pol))) {
-		/*
-		 * slow path: ref counted shared policy
-		 */
-		struct page *page =  __alloc_pages_nodemask(gfp, order,
-						zl, policy_nodemask(gfp, pol));
-		__mpol_put(pol);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
-			goto retry_cpuset;
-		return page;
-	}
-	/*
-	 * fast path:  default or task policy
-	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
+out:
+	__mpol_put(pol);
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;
 	return page;

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17 19:50                       ` David Rientjes
@ 2012-10-17 21:05                         ` KOSAKI Motohiro
  2012-10-17 21:27                           ` David Rientjes
  0 siblings, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2012-10-17 21:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Kamezawa Hiroyuki, Dave Jones, Andrew Morton, Linus Torvalds,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

On Wed, Oct 17, 2012 at 3:50 PM, David Rientjes <rientjes@google.com> wrote:
> On Wed, 17 Oct 2012, KOSAKI Motohiro wrote:
>
>> > I think this refcounting is better than using task_lock().
>>
>> I don't think so. get_vma_policy() is used from fast path. In other
>> words, number of
>> atomic ops is sensible for allocation performance.
>
> There are enhancements that we can make with refcounting: for instance, we
> may want to avoid doing it in the super-fast path when the policy is
> default_policy and then just do
>
>         if (mpol != &default_policy)
>                 mpol_put(mpol);
>
>> Instead, I'd like
>> to use spinlock
>> for shared mempolicy instead of mutex.
>>
>
> Um, this was just changed to a mutex last week in commit b22d127a39dd
> ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc()
> can be done with GFP_KERNEL, so I didn't consider reverting that behavior.
> Are you nacking that patch, which you acked, now?

Yes, sadly. /proc usage is a corner case issue. It's not worth to
strike main path.
see commit 52cd3b0740 and around patches. That explain why we avoided your
approach.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17 21:05                         ` KOSAKI Motohiro
@ 2012-10-17 21:27                           ` David Rientjes
  0 siblings, 0 replies; 59+ messages in thread
From: David Rientjes @ 2012-10-17 21:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Kamezawa Hiroyuki, Dave Jones, Andrew Morton, Linus Torvalds,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

On Wed, 17 Oct 2012, KOSAKI Motohiro wrote:

> > Um, this was just changed to a mutex last week in commit b22d127a39dd
> > ("mempolicy: fix a race in shared_policy_replace()") so that sp_alloc()
> > can be done with GFP_KERNEL, so I didn't consider reverting that behavior.
> > Are you nacking that patch, which you acked, now?
> 
> Yes, sadly. /proc usage is a corner case issue. It's not worth to
> strike main path.

It also simplifies the fastpath since we can now unconditionally drop the 
reference.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps
  2012-10-17 20:28                             ` [patch for-3.7] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps David Rientjes
@ 2012-10-17 21:31                               ` David Rientjes
  2012-10-18  4:06                                 ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 59+ messages in thread
From: David Rientjes @ 2012-10-17 21:31 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: Dave Jones, KOSAKI Motohiro, bhutchings, Konstantin Khlebnikov,
	Naoya Horiguchi, Hugh Dickins, KAMEZAWA Hiroyuki, linux-kernel,
	linux-mm

As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack
contents in numa_maps"), the mutex protecting a shared policy can be
inadvertently taken while holding task_lock(task).

Recently, commit b22d127a39dd ("mempolicy: fix a race in 
shared_policy_replace()") switched the spinlock within a shared policy to 
a mutex so sp_alloc() could block.  Thus, a refcount must be grabbed on 
all mempolicies returned by get_vma_policy() so it isn't freed while being 
passed to mpol_to_str() when reading /proc/pid/numa_maps.

This patch only takes task_lock() while dereferencing task->mempolicy in 
get_vma_policy() if it's non-NULL in the lockess check to increment its 
refcount.  This ensures it will remain in memory until dropped by 
__mpol_put() after mpol_to_str() is called.

Refcounts of shared policies are grabbed by the ->get_policy() function of 
the vma, all others will be grabbed directly in get_vma_policy().  Now 
that this is done, all callers now unconditionally drop the refcount.

Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 v2: optimized task_lock() in get_vma_policy(): test for a non-NULL
     task->mempolicy before taking task_lock() and grabbing the reference
     so we don't take the lock unnecessarily.

 fs/proc/task_mmu.c        |    4 +--
 include/linux/mempolicy.h |   12 +------
 mm/hugetlb.c              |    4 +--
 mm/mempolicy.c            |   79 ++++++++++++++++++++-------------------------
 4 files changed, 39 insertions(+), 60 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 14df880..5709e70 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1178,11 +1178,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
 	walk.private = md;
 	walk.mm = mm;
 
-	task_lock(task);
 	pol = get_vma_policy(task, vma, vma->vm_start);
 	mpol_to_str(buffer, sizeof(buffer), pol, 0);
-	mpol_cond_put(pol);
-	task_unlock(task);
+	__mpol_put(pol);
 
 	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
 
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e5ccb9d..f76f7e0 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -73,13 +73,7 @@ static inline void mpol_put(struct mempolicy *pol)
  */
 static inline int mpol_needs_cond_ref(struct mempolicy *pol)
 {
-	return (pol && (pol->flags & MPOL_F_SHARED));
-}
-
-static inline void mpol_cond_put(struct mempolicy *pol)
-{
-	if (mpol_needs_cond_ref(pol))
-		__mpol_put(pol);
+	return pol->flags & MPOL_F_SHARED;
 }
 
 extern struct mempolicy *__mpol_cond_copy(struct mempolicy *tompol,
@@ -211,10 +205,6 @@ static inline void mpol_put(struct mempolicy *p)
 {
 }
 
-static inline void mpol_cond_put(struct mempolicy *pol)
-{
-}
-
 static inline struct mempolicy *mpol_cond_copy(struct mempolicy *to,
 						struct mempolicy *from)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 59a0059..5080808 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -568,13 +568,13 @@ retry_cpuset:
 		}
 	}
 
-	mpol_cond_put(mpol);
+	__mpol_put(mpol);
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;
 	return page;
 
 err:
-	mpol_cond_put(mpol);
+	__mpol_put(mpol);
 	return NULL;
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..a0bb463 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -906,7 +906,8 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 	}
 
  out:
-	mpol_cond_put(pol);
+	if (mpol_needs_cond_ref(pol))
+		__mpol_put(pol);
 	if (vma)
 		up_read(&current->mm->mmap_sem);
 	return err;
@@ -1527,48 +1528,54 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
 }
 
 #endif
-
-/*
- * get_vma_policy(@task, @vma, @addr)
- * @task - task for fallback if vma policy == default
- * @vma   - virtual memory area whose policy is sought
- * @addr  - address in @vma for shared policy lookup
+/**
+ * get_vma_policy() - return effective policy for a vma at specified address
+ * @task: task for fallback if vma policy == default_policy
+ * @vma: virtual memory area whose policy is sought
+ * @addr: address in @vma for shared policy lookup
  *
- * Returns effective policy for a VMA at specified address.
  * Falls back to @task or system default policy, as necessary.
- * Current or other task's task mempolicy and non-shared vma policies must be
- * protected by task_lock(task) by the caller.
- * Shared policies [those marked as MPOL_F_SHARED] require an extra reference
- * count--added by the get_policy() vm_op, as appropriate--to protect against
- * freeing by another task.  It is the caller's responsibility to free the
- * extra reference for shared policies.
+ * Increments the reference count of the returned mempolicy, it is the caller's
+ * responsibility to decrement with __mpol_put().
+ * Requires vma->vm_mm->mmap_sem to be held for vma policies and takes
+ * task_lock(task) for task policy fallback.
  */
 struct mempolicy *get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol = task->mempolicy;
 
+	/*
+	 * Grab a reference before task has the potential to exit and free its
+	 * mempolicy.
+	 */
+	if (pol) {
+		task_lock(task);
+		pol = task->mempolicy;
+		mpol_get(pol);
+		task_unlock(task);
+	}
+
 	if (vma) {
 		if (vma->vm_ops && vma->vm_ops->get_policy) {
 			struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
 									addr);
-			if (vpol)
+			if (vpol) {
+				mpol_put(pol);
 				pol = vpol;
+				if (!mpol_needs_cond_ref(pol))
+					mpol_get(pol);
+			}
 		} else if (vma->vm_policy) {
+			mpol_put(pol);
 			pol = vma->vm_policy;
-
-			/*
-			 * shmem_alloc_page() passes MPOL_F_SHARED policy with
-			 * a pseudo vma whose vma->vm_ops=NULL. Take a reference
-			 * count on these policies which will be dropped by
-			 * mpol_cond_put() later
-			 */
-			if (mpol_needs_cond_ref(pol))
-				mpol_get(pol);
+			mpol_get(pol);
 		}
 	}
-	if (!pol)
+	if (!pol) {
 		pol = &default_policy;
+		mpol_get(pol);
+	}
 	return pol;
 }
 
@@ -1919,30 +1926,14 @@ retry_cpuset:
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
-		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
-			goto retry_cpuset;
-
-		return page;
+		goto out;
 	}
 	zl = policy_zonelist(gfp, pol, node);
-	if (unlikely(mpol_needs_cond_ref(pol))) {
-		/*
-		 * slow path: ref counted shared policy
-		 */
-		struct page *page =  __alloc_pages_nodemask(gfp, order,
-						zl, policy_nodemask(gfp, pol));
-		__mpol_put(pol);
-		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
-			goto retry_cpuset;
-		return page;
-	}
-	/*
-	 * fast path:  default or task policy
-	 */
 	page = __alloc_pages_nodemask(gfp, order, zl,
 				      policy_nodemask(gfp, pol));
+out:
+	__mpol_put(pol);
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;
 	return page;

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps
  2012-10-17 21:31                               ` [patch for-3.7 v2] " David Rientjes
@ 2012-10-18  4:06                                 ` Kamezawa Hiroyuki
  2012-10-18  4:14                                   ` Linus Torvalds
                                                     ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Kamezawa Hiroyuki @ 2012-10-18  4:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Andrew Morton, Dave Jones, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

(2012/10/18 6:31), David Rientjes wrote:
> As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack
> contents in numa_maps"), the mutex protecting a shared policy can be
> inadvertently taken while holding task_lock(task).
>
> Recently, commit b22d127a39dd ("mempolicy: fix a race in
> shared_policy_replace()") switched the spinlock within a shared policy to
> a mutex so sp_alloc() could block.  Thus, a refcount must be grabbed on
> all mempolicies returned by get_vma_policy() so it isn't freed while being
> passed to mpol_to_str() when reading /proc/pid/numa_maps.
>
> This patch only takes task_lock() while dereferencing task->mempolicy in
> get_vma_policy() if it's non-NULL in the lockess check to increment its
> refcount.  This ensures it will remain in memory until dropped by
> __mpol_put() after mpol_to_str() is called.
>
> Refcounts of shared policies are grabbed by the ->get_policy() function of
> the vma, all others will be grabbed directly in get_vma_policy().  Now
> that this is done, all callers now unconditionally drop the refcount.
>

please add original problem description....

from your 1st patch.
> When reading /proc/pid/numa_maps, it's possible to return the contents of
> the stack where the mempolicy string should be printed if the policy gets
> freed from beneath us.
>
> This happens because mpol_to_str() may return an error the
> stack-allocated buffer is then printed without ever being stored.
.....

Hmm, I've read the whole thread again...and, I'm sorry if I misunderstand something.

I think Kosaki mentioned the commit 52cd3b0740. It avoids refcounting in get_vma_policy()
because it's called every time alloc_pages_vma() is called, at every page fault.
So, it seems he doesn't agree this fix because of performance concern on big NUMA,


Can't we have another way to fix ? like this ? too ugly ?
Again, I'm sorry if I misunderstand the points.

==

 From bfe7e2ab1c1375b134ec12efce6517149318f75d Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Thu, 18 Oct 2012 13:17:25 +0900
Subject: [PATCH] hold task->mempolicy while numa_maps scans.

  /proc/<pid>/numa_maps scans vma and show mempolicy under
  mmap_sem. It sometimes accesses task->mempolicy which can
  be freed without mmap_sem and numa_maps can show some
  garbage while scanning.

This patch tries to take reference count of task->mempolicy at reading
numa_maps before calling get_vma_policy(). By this, task->mempolicy
will not be freed until numa_maps reaches its end.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
  fs/proc/task_mmu.c |   20 ++++++++++++++++++++
  1 file changed, 20 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 14df880..d92e868 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -94,6 +94,11 @@ static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
  {
  	if (vma && vma != priv->tail_vma) {
  		struct mm_struct *mm = vma->vm_mm;
+#ifdef CONFIG_NUMA
+		task_lock(priv->task);
+		__mpol_put(priv->task->mempolicy);
+		task_unlock(priv->task);
+#endif
  		up_read(&mm->mmap_sem);
  		mmput(mm);
  	}
@@ -130,6 +135,16 @@ static void *m_start(struct seq_file *m, loff_t *pos)
  		return mm;
  	down_read(&mm->mmap_sem);
  
+	/*
+	 * task->mempolicy can be freed even if mmap_sem is down (see kernel/exit.c)
+	 * We grab refcount for stable access.
+	 * repleacement of task->mmpolicy is guarded by mmap_sem.
+	 */
+#ifdef CONFIG_NUMA
+	task_lock(priv->task);
+	mpol_get(priv->task->mempolicy);
+	task_unlock(priv->task);
+#endif
  	tail_vma = get_gate_vma(priv->task->mm);
  	priv->tail_vma = tail_vma;
  
@@ -161,6 +176,11 @@ out:
  
  	/* End of vmas has been reached */
  	m->version = (tail_vma != NULL)? 0: -1UL;
+#ifdef CONFIG_NUMA
+	task_lock(priv->task);
+	__mpol_put(priv->task->mempolicy);
+	task_unlock(priv->task);
+#endif
  	up_read(&mm->mmap_sem);
  	mmput(mm);
  	return tail_vma;
-- 
1.7.10.2














^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps
  2012-10-18  4:06                                 ` Kamezawa Hiroyuki
@ 2012-10-18  4:14                                   ` Linus Torvalds
  2012-10-18  4:41                                     ` Kamezawa Hiroyuki
  2012-10-18  4:34                                   ` Kamezawa Hiroyuki
  2012-10-18  4:35                                   ` David Rientjes
  2 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2012-10-18  4:14 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: David Rientjes, Andrew Morton, Dave Jones, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

On Wed, Oct 17, 2012 at 9:06 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>         if (vma && vma != priv->tail_vma) {
>                 struct mm_struct *mm = vma->vm_mm;
> +#ifdef CONFIG_NUMA
> +               task_lock(priv->task);
> +               __mpol_put(priv->task->mempolicy);
> +               task_unlock(priv->task);
> +#endif
>                 up_read(&mm->mmap_sem);
>                 mmput(mm);

Please don't put #ifdef's inside code. It makes things really ugly and
hard to read.

And that is *especially* true in this case, since there's a pattern to
all these things:

> +#ifdef CONFIG_NUMA
> +       task_lock(priv->task);
> +       mpol_get(priv->task->mempolicy);
> +       task_unlock(priv->task);
> +#endif

> +#ifdef CONFIG_NUMA
> +       task_lock(priv->task);
> +       __mpol_put(priv->task->mempolicy);
> +       task_unlock(priv->task);
> +#endif

it really sounds like what you want to do is to just abstract a
"numa_policy_get/put(priv)" operation.

So you could make it be something like

  #ifdef CONFIG_NUMA
  static inline numa_policy_get(struct proc_maps_private *priv)
  {
      task_lock(priv->task);
      mpol_get(priv->task->mempolicy);
      task_unlock(priv->task);
  }
  .. same for the "put" function ..
  #else
    #define numa_policy_get(priv) do { } while (0)
    #define numa_policy_put(priv) do { } while (0)
  #endif

and then you wouldn't have to have the #ifdef's in the middle of code,
and I think it will be more readable in general.

Sure, it is going to be a few more actual lines of patch, but there's
no duplicated code sequence, and the added lines are just the syntax
that makes it look better.

                Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps
  2012-10-18  4:06                                 ` Kamezawa Hiroyuki
  2012-10-18  4:14                                   ` Linus Torvalds
@ 2012-10-18  4:34                                   ` Kamezawa Hiroyuki
  2012-10-18 20:03                                     ` David Rientjes
  2012-10-19  6:51                                     ` [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when " KOSAKI Motohiro
  2012-10-18  4:35                                   ` David Rientjes
  2 siblings, 2 replies; 59+ messages in thread
From: Kamezawa Hiroyuki @ 2012-10-18  4:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Andrew Morton, Dave Jones, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

(2012/10/18 13:06), Kamezawa Hiroyuki wrote:
> (2012/10/18 6:31), David Rientjes wrote:
>> As a result of commit 32f8516a8c73 ("mm, mempolicy: fix printing stack
>> contents in numa_maps"), the mutex protecting a shared policy can be
>> inadvertently taken while holding task_lock(task).
>>
>> Recently, commit b22d127a39dd ("mempolicy: fix a race in
>> shared_policy_replace()") switched the spinlock within a shared policy to
>> a mutex so sp_alloc() could block.  Thus, a refcount must be grabbed on
>> all mempolicies returned by get_vma_policy() so it isn't freed while being
>> passed to mpol_to_str() when reading /proc/pid/numa_maps.
>>
>> This patch only takes task_lock() while dereferencing task->mempolicy in
>> get_vma_policy() if it's non-NULL in the lockess check to increment its
>> refcount.  This ensures it will remain in memory until dropped by
>> __mpol_put() after mpol_to_str() is called.
>>
>> Refcounts of shared policies are grabbed by the ->get_policy() function of
>> the vma, all others will be grabbed directly in get_vma_policy().  Now
>> that this is done, all callers now unconditionally drop the refcount.
>>
>
> please add original problem description....
>
> from your 1st patch.
>> When reading /proc/pid/numa_maps, it's possible to return the contents of
>> the stack where the mempolicy string should be printed if the policy gets
>> freed from beneath us.
>>
>> This happens because mpol_to_str() may return an error the
>> stack-allocated buffer is then printed without ever being stored.
> .....
>
> Hmm, I've read the whole thread again...and, I'm sorry if I misunderstand something.
>
> I think Kosaki mentioned the commit 52cd3b0740. It avoids refcounting in get_vma_policy()
> because it's called every time alloc_pages_vma() is called, at every page fault.
> So, it seems he doesn't agree this fix because of performance concern on big NUMA,
>
>
> Can't we have another way to fix ? like this ? too ugly ?
> Again, I'm sorry if I misunderstand the points.
>
Sorry this patch itself may be buggy. please don't test..
I missed that kernel/exit.c sets task->mempolicy to be NULL.
fixed one here.

--
 From 5581c71e68a7f50e52fd67cca00148911023f9f5 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Thu, 18 Oct 2012 13:50:29 +0900
Subject: [PATCH] hold task->mempolicy while numa_maps scans.

  /proc/<pid>/numa_maps scans vma and show mempolicy under
  mmap_sem. It sometimes accesses task->mempolicy which can
  be freed without mmap_sem and numa_maps can show some
  garbage while scanning.

This patch tries to take reference count of task->mempolicy at reading
numa_maps before calling get_vma_policy(). By this, task->mempolicy
will not be freed until numa_maps reaches its end.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

V1->V2
  -  access task->mempolicy only once and remember it.  Becase kernel/exit.c
     can overwrite it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
  fs/proc/internal.h |    4 ++++
  fs/proc/task_mmu.c |   33 ++++++++++++++++++++++++++++++++-
  2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cceaab0..43973b0 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -12,6 +12,7 @@
  #include <linux/sched.h>
  #include <linux/proc_fs.h>
  struct  ctl_table_header;
+struct  mempolicy;
  
  extern struct proc_dir_entry proc_root;
  #ifdef CONFIG_PROC_SYSCTL
@@ -74,6 +75,9 @@ struct proc_maps_private {
  #ifdef CONFIG_MMU
  	struct vm_area_struct *tail_vma;
  #endif
+#ifdef CONFIG_NUMA
+	struct mempolicy *task_mempolicy;
+#endif
  };
  
  void proc_init_inodecache(void);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 14df880..624927d 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -89,11 +89,41 @@ static void pad_len_spaces(struct seq_file *m, int len)
  		len = 1;
  	seq_printf(m, "%*c", len, ' ');
  }
+#ifdef CONFIG_NUMA
+/*
+ * numa_maps scans all vmas under mmap_sem and checks their mempolicy.
+ * But task->mempolicy is not guarded by mmap_sem, it can be cleared/freed
+ * under task_lock() (see kernel/exit.c) replacement of it is guarded by
+ * mmap_sem. So, take referenceount under task_lock() before we start
+ * scanning and drop it when numa_maps reaches the end.
+ */
+static void hold_task_mempolicy(struct proc_maps_private *priv)
+{
+	struct task_struct *task = priv->task;
+
+	task_lock(task);
+	priv->task_mempolicy = task->mempolicy;
+	mpol_get(priv->task_mempolicy);
+	task_unlock(task);
+}
+static void release_task_mempolicy(struct proc_maps_private *priv)
+{
+	mpol_put(priv->task_mempolicy);
+}
+#else
+static void hold_task_mempolicy(struct proc_maps_private *priv)
+{
+}
+static void release_task_mempolicy(struct proc_maps_private *priv)
+{
+}
+#endif
  
  static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
  {
  	if (vma && vma != priv->tail_vma) {
  		struct mm_struct *mm = vma->vm_mm;
+		release_task_mempolicy(priv);
  		up_read(&mm->mmap_sem);
  		mmput(mm);
  	}
@@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
  
  	tail_vma = get_gate_vma(priv->task->mm);
  	priv->tail_vma = tail_vma;
-
+	hold_task_mempolicy(priv);
  	/* Start with last addr hint */
  	vma = find_vma(mm, last_addr);
  	if (last_addr && vma) {
@@ -159,6 +189,7 @@ out:
  	if (vma)
  		return vma;
  
+	release_task_mempolicy(priv);
  	/* End of vmas has been reached */
  	m->version = (tail_vma != NULL)? 0: -1UL;
  	up_read(&mm->mmap_sem);
-- 
1.7.10.2




^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps
  2012-10-18  4:06                                 ` Kamezawa Hiroyuki
  2012-10-18  4:14                                   ` Linus Torvalds
  2012-10-18  4:34                                   ` Kamezawa Hiroyuki
@ 2012-10-18  4:35                                   ` David Rientjes
  2 siblings, 0 replies; 59+ messages in thread
From: David Rientjes @ 2012-10-18  4:35 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Linus Torvalds, Andrew Morton, Dave Jones, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote:

> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 14df880..d92e868 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -94,6 +94,11 @@ static void vma_stop(struct proc_maps_private *priv, struct
> vm_area_struct *vma)
>  {
>  	if (vma && vma != priv->tail_vma) {
>  		struct mm_struct *mm = vma->vm_mm;
> +#ifdef CONFIG_NUMA
> +		task_lock(priv->task);
> +		__mpol_put(priv->task->mempolicy);
> +		task_unlock(priv->task);
> +#endif
>  		up_read(&mm->mmap_sem);
>  		mmput(mm);
>  	}
> @@ -130,6 +135,16 @@ static void *m_start(struct seq_file *m, loff_t *pos)
>  		return mm;
>  	down_read(&mm->mmap_sem);
>  +	/*
> +	 * task->mempolicy can be freed even if mmap_sem is down (see
> kernel/exit.c)
> +	 * We grab refcount for stable access.
> +	 * repleacement of task->mmpolicy is guarded by mmap_sem.
> +	 */
> +#ifdef CONFIG_NUMA
> +	task_lock(priv->task);
> +	mpol_get(priv->task->mempolicy);
> +	task_unlock(priv->task);
> +#endif
>  	tail_vma = get_gate_vma(priv->task->mm);
>  	priv->tail_vma = tail_vma;
>  @@ -161,6 +176,11 @@ out:
>   	/* End of vmas has been reached */
>  	m->version = (tail_vma != NULL)? 0: -1UL;
> +#ifdef CONFIG_NUMA
> +	task_lock(priv->task);
> +	__mpol_put(priv->task->mempolicy);
> +	task_unlock(priv->task);
> +#endif
>  	up_read(&mm->mmap_sem);
>  	mmput(mm);
>  	return tail_vma;

Yes, I must admit that this is better than my version and it looks like 
all the ->show() functions that use these start, next, stop functions 
don't take task_lock() and this would generally be useful: we already hold 
current->mm->mmap_sem so there is little harm in holding 
task_lock(current) when reading these files as long as we're not touching 
the fastpath.

These routines seem like it would nicely be added to mempolicy.h since we 
depend on CONFIG_NUMA there already.

Please fix up the mess I made in show_numa_map() in 32f8516a8c73 ("mm, 
mempolicy: fix printing stack contents in numa_maps") by simply removing 
the task_lock() and task_unlock() as part of your patch.

Thanks Kame!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps
  2012-10-18  4:14                                   ` Linus Torvalds
@ 2012-10-18  4:41                                     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 59+ messages in thread
From: Kamezawa Hiroyuki @ 2012-10-18  4:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Rientjes, Andrew Morton, Dave Jones, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

(2012/10/18 13:14), Linus Torvalds wrote:
> On Wed, Oct 17, 2012 at 9:06 PM, Kamezawa Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>          if (vma && vma != priv->tail_vma) {
>>                  struct mm_struct *mm = vma->vm_mm;
>> +#ifdef CONFIG_NUMA
>> +               task_lock(priv->task);
>> +               __mpol_put(priv->task->mempolicy);
>> +               task_unlock(priv->task);
>> +#endif
>>                  up_read(&mm->mmap_sem);
>>                  mmput(mm);
>
> Please don't put #ifdef's inside code. It makes things really ugly and
> hard to read.
>
> And that is *especially* true in this case, since there's a pattern to
> all these things:
>
>> +#ifdef CONFIG_NUMA
>> +       task_lock(priv->task);
>> +       mpol_get(priv->task->mempolicy);
>> +       task_unlock(priv->task);
>> +#endif
>
>> +#ifdef CONFIG_NUMA
>> +       task_lock(priv->task);
>> +       __mpol_put(priv->task->mempolicy);
>> +       task_unlock(priv->task);
>> +#endif
>
> it really sounds like what you want to do is to just abstract a
> "numa_policy_get/put(priv)" operation.
>
> So you could make it be something like
>
>    #ifdef CONFIG_NUMA
>    static inline numa_policy_get(struct proc_maps_private *priv)
>    {
>        task_lock(priv->task);
>        mpol_get(priv->task->mempolicy);
>        task_unlock(priv->task);
>    }
>    .. same for the "put" function ..
>    #else
>      #define numa_policy_get(priv) do { } while (0)
>      #define numa_policy_put(priv) do { } while (0)
>    #endif
>
> and then you wouldn't have to have the #ifdef's in the middle of code,
> and I think it will be more readable in general.
>
> Sure, it is going to be a few more actual lines of patch, but there's
> no duplicated code sequence, and the added lines are just the syntax
> that makes it look better.
>

you're right, I shouldn't send an ugly patch. I'm sorry.
V2 uses suggested style, I think.

Regards,
-Kame



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps
  2012-10-18  4:34                                   ` Kamezawa Hiroyuki
@ 2012-10-18 20:03                                     ` David Rientjes
  2012-10-19  8:35                                       ` [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while " Kamezawa Hiroyuki
  2012-10-19  6:51                                     ` [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when " KOSAKI Motohiro
  1 sibling, 1 reply; 59+ messages in thread
From: David Rientjes @ 2012-10-18 20:03 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Linus Torvalds, Andrew Morton, Dave Jones, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote:

> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index cceaab0..43973b0 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -12,6 +12,7 @@
>  #include <linux/sched.h>
>  #include <linux/proc_fs.h>
>  struct  ctl_table_header;
> +struct  mempolicy;
>   extern struct proc_dir_entry proc_root;
>  #ifdef CONFIG_PROC_SYSCTL
> @@ -74,6 +75,9 @@ struct proc_maps_private {
>  #ifdef CONFIG_MMU
>  	struct vm_area_struct *tail_vma;
>  #endif
> +#ifdef CONFIG_NUMA
> +	struct mempolicy *task_mempolicy;
> +#endif
>  };
>   void proc_init_inodecache(void);
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 14df880..624927d 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -89,11 +89,41 @@ static void pad_len_spaces(struct seq_file *m, int len)
>  		len = 1;
>  	seq_printf(m, "%*c", len, ' ');
>  }
> +#ifdef CONFIG_NUMA
> +/*
> + * numa_maps scans all vmas under mmap_sem and checks their mempolicy.

Doesn't only affect numa_maps, it also affects maps and smaps although 
they don't need the refcounts.

> + * But task->mempolicy is not guarded by mmap_sem, it can be cleared/freed
> + * under task_lock() (see kernel/exit.c) replacement of it is guarded by
> + * mmap_sem.

I think this should be a little more verbose making it clear that 
task->mempolicy can be cleared and freed if its refcount drops to 0 and is 
only protected by task_lock() and that we're safe from task->mempolicy 
changing between ->start(), ->next(), and ->stop() because 
task->mm->mmap_sem is held for the duration.

> So, take referenceount under task_lock() before we start
> + * scanning and drop it when numa_maps reaches the end.
> + */
> +static void hold_task_mempolicy(struct proc_maps_private *priv)
> +{
> +	struct task_struct *task = priv->task;
> +
> +	task_lock(task);
> +	priv->task_mempolicy = task->mempolicy;
> +	mpol_get(priv->task_mempolicy);
> +	task_unlock(task);
> +}
> +static void release_task_mempolicy(struct proc_maps_private *priv)
> +{
> +	mpol_put(priv->task_mempolicy);
> +}
> +#else
> +static void hold_task_mempolicy(struct proc_maps_private *priv)
> +{
> +}
> +static void release_task_mempolicy(struct proc_maps_private *priv)
> +{
> +}
> +#endif
>   static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct
> *vma)
>  {
>  	if (vma && vma != priv->tail_vma) {
>  		struct mm_struct *mm = vma->vm_mm;
> +		release_task_mempolicy(priv);
>  		up_read(&mm->mmap_sem);
>  		mmput(mm);
>  	}
> @@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
>   	tail_vma = get_gate_vma(priv->task->mm);
>  	priv->tail_vma = tail_vma;
> -
> +	hold_task_mempolicy(priv);
>  	/* Start with last addr hint */
>  	vma = find_vma(mm, last_addr);
>  	if (last_addr && vma) {
> @@ -159,6 +189,7 @@ out:
>  	if (vma)
>  		return vma;
>  +	release_task_mempolicy(priv);
>  	/* End of vmas has been reached */
>  	m->version = (tail_vma != NULL)? 0: -1UL;
>  	up_read(&mm->mmap_sem);

Otherwise looks good, but please remove the two task_lock()'s in 
show_numa_map() that I added as part of this since you're replacing the 
need for locking.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps
  2012-10-18  4:34                                   ` Kamezawa Hiroyuki
  2012-10-18 20:03                                     ` David Rientjes
@ 2012-10-19  6:51                                     ` KOSAKI Motohiro
  1 sibling, 0 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2012-10-19  6:51 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: David Rientjes, Linus Torvalds, Andrew Morton, Dave Jones,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

>> Can't we have another way to fix ? like this ? too ugly ?
>> Again, I'm sorry if I misunderstand the points.
>>
> Sorry this patch itself may be buggy. please don't test..
> I missed that kernel/exit.c sets task->mempolicy to be NULL.
> fixed one here.
>
> --
> From 5581c71e68a7f50e52fd67cca00148911023f9f5 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 18 Oct 2012 13:50:29 +0900
>
> Subject: [PATCH] hold task->mempolicy while numa_maps scans.
>
>  /proc/<pid>/numa_maps scans vma and show mempolicy under
>  mmap_sem. It sometimes accesses task->mempolicy which can
>  be freed without mmap_sem and numa_maps can show some
>  garbage while scanning.
>
> This patch tries to take reference count of task->mempolicy at reading
> numa_maps before calling get_vma_policy(). By this, task->mempolicy
> will not be freed until numa_maps reaches its end.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> V1->V2
>  -  access task->mempolicy only once and remember it.  Becase kernel/exit.c
>     can overwrite it.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Ok, this is acceptable to me. go ahead.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps.
  2012-10-18 20:03                                     ` David Rientjes
@ 2012-10-19  8:35                                       ` Kamezawa Hiroyuki
  2012-10-19  9:28                                         ` David Rientjes
  2012-10-19 19:15                                         ` KOSAKI Motohiro
  0 siblings, 2 replies; 59+ messages in thread
From: Kamezawa Hiroyuki @ 2012-10-19  8:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Andrew Morton, Dave Jones, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

(2012/10/19 5:03), David Rientjes wrote:
> On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote:
>> @@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
>>    	tail_vma = get_gate_vma(priv->task->mm);
>>   	priv->tail_vma = tail_vma;
>> -
>> +	hold_task_mempolicy(priv);
>>   	/* Start with last addr hint */
>>   	vma = find_vma(mm, last_addr);
>>   	if (last_addr && vma) {
>> @@ -159,6 +189,7 @@ out:
>>   	if (vma)
>>   		return vma;
>>   +	release_task_mempolicy(priv);
>>   	/* End of vmas has been reached */
>>   	m->version = (tail_vma != NULL)? 0: -1UL;
>>   	up_read(&mm->mmap_sem);
>
> Otherwise looks good, but please remove the two task_lock()'s in
> show_numa_map() that I added as part of this since you're replacing the
> need for locking.
>
Thank you for your review.
How about this ?

==
 From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Fri, 19 Oct 2012 17:00:55 +0900
Subject: [PATCH] hold task->mempolicy while numa_maps scans.

  /proc/<pid>/numa_maps scans vma and show mempolicy under
  mmap_sem. It sometimes accesses task->mempolicy which can
  be freed without mmap_sem and numa_maps can show some
  garbage while scanning.

This patch tries to take reference count of task->mempolicy at reading
numa_maps before calling get_vma_policy(). By this, task->mempolicy
will not be freed until numa_maps reaches its end.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

V2->v3
  -  updated comments to be more verbose.
  -  removed task_lock() in numa_maps code.
V1->V2
  -  access task->mempolicy only once and remember it.  Becase kernel/exit.c
     can overwrite it.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
  fs/proc/internal.h |    4 ++++
  fs/proc/task_mmu.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
  2 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cceaab0..43973b0 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -12,6 +12,7 @@
  #include <linux/sched.h>
  #include <linux/proc_fs.h>
  struct  ctl_table_header;
+struct  mempolicy;
  
  extern struct proc_dir_entry proc_root;
  #ifdef CONFIG_PROC_SYSCTL
@@ -74,6 +75,9 @@ struct proc_maps_private {
  #ifdef CONFIG_MMU
  	struct vm_area_struct *tail_vma;
  #endif
+#ifdef CONFIG_NUMA
+	struct mempolicy *task_mempolicy;
+#endif
  };
  
  void proc_init_inodecache(void);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 14df880..2371fea 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -89,11 +89,55 @@ static void pad_len_spaces(struct seq_file *m, int len)
  		len = 1;
  	seq_printf(m, "%*c", len, ' ');
  }
+#ifdef CONFIG_NUMA
+/*
+ * These functions are for numa_maps but called in generic **maps seq_file
+ * ->start(), ->stop() ops.
+ *
+ * numa_maps scans all vmas under mmap_sem and checks their mempolicy.
+ * Each mempolicy object is controlled by reference counting. The problem here
+ * is how to avoid accessing dead mempolicy object.
+ *
+ * Because we're holding mmap_sem while reading seq_file, it's safe to access
+ * each vma's mempolicy, no vma objects will never drop refs to mempolicy.
+ *
+ * A task's mempolicy (task->mempolicy) has different behavior. task->mempolicy
+ * is set and replaced under mmap_sem but unrefed and cleared under task_lock().
+ * So, without task_lock(), we cannot trust get_vma_policy() because we cannot
+ * gurantee the task never exits under us. But taking task_lock() around
+ * get_vma_plicy() causes lock order problem.
+ *
+ * To access task->mempolicy without lock, we hold a reference count of an
+ * object pointed by task->mempolicy and remember it. This will guarantee
+ * that task->mempolicy points to an alive object or NULL in numa_maps accesses.
+ */
+static void hold_task_mempolicy(struct proc_maps_private *priv)
+{
+	struct task_struct *task = priv->task;
+
+	task_lock(task);
+	priv->task_mempolicy = task->mempolicy;
+	mpol_get(priv->task_mempolicy);
+	task_unlock(task);
+}
+static void release_task_mempolicy(struct proc_maps_private *priv)
+{
+	mpol_put(priv->task_mempolicy);
+}
+#else
+static void hold_task_mempolicy(struct proc_maps_private *priv)
+{
+}
+static void release_task_mempolicy(struct proc_maps_private *priv)
+{
+}
+#endif
  
  static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
  {
  	if (vma && vma != priv->tail_vma) {
  		struct mm_struct *mm = vma->vm_mm;
+		release_task_mempolicy(priv);
  		up_read(&mm->mmap_sem);
  		mmput(mm);
  	}
@@ -132,7 +176,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
  
  	tail_vma = get_gate_vma(priv->task->mm);
  	priv->tail_vma = tail_vma;
-
+	hold_task_mempolicy(priv);
  	/* Start with last addr hint */
  	vma = find_vma(mm, last_addr);
  	if (last_addr && vma) {
@@ -159,6 +203,7 @@ out:
  	if (vma)
  		return vma;
  
+	release_task_mempolicy(priv);
  	/* End of vmas has been reached */
  	m->version = (tail_vma != NULL)? 0: -1UL;
  	up_read(&mm->mmap_sem);
@@ -1178,11 +1223,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
  	walk.private = md;
  	walk.mm = mm;
  
-	task_lock(task);
  	pol = get_vma_policy(task, vma, vma->vm_start);
  	mpol_to_str(buffer, sizeof(buffer), pol, 0);
  	mpol_cond_put(pol);
-	task_unlock(task);
  
  	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
  
-- 
1.7.10.2




^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps.
  2012-10-19  8:35                                       ` [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while " Kamezawa Hiroyuki
@ 2012-10-19  9:28                                         ` David Rientjes
  2012-10-22  2:47                                           ` Kamezawa Hiroyuki
  2012-10-19 19:15                                         ` KOSAKI Motohiro
  1 sibling, 1 reply; 59+ messages in thread
From: David Rientjes @ 2012-10-19  9:28 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Linus Torvalds, Andrew Morton, Dave Jones, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

On Fri, 19 Oct 2012, Kamezawa Hiroyuki wrote:

> From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Fri, 19 Oct 2012 17:00:55 +0900
> Subject: [PATCH] hold task->mempolicy while numa_maps scans.
> 
>  /proc/<pid>/numa_maps scans vma and show mempolicy under
>  mmap_sem. It sometimes accesses task->mempolicy which can
>  be freed without mmap_sem and numa_maps can show some
>  garbage while scanning.
> 
> This patch tries to take reference count of task->mempolicy at reading
> numa_maps before calling get_vma_policy(). By this, task->mempolicy
> will not be freed until numa_maps reaches its end.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Looks good, but the patch is whitespace damaged so it doesn't apply.  When 
that's fixed:

Acked-by: David Rientjes <rientjes@google.com>

Thanks for following through on this!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps.
  2012-10-19  8:35                                       ` [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while " Kamezawa Hiroyuki
  2012-10-19  9:28                                         ` David Rientjes
@ 2012-10-19 19:15                                         ` KOSAKI Motohiro
  1 sibling, 0 replies; 59+ messages in thread
From: KOSAKI Motohiro @ 2012-10-19 19:15 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: David Rientjes, Linus Torvalds, Andrew Morton, Dave Jones,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

On Fri, Oct 19, 2012 at 4:35 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2012/10/19 5:03), David Rientjes wrote:
>>
>> On Thu, 18 Oct 2012, Kamezawa Hiroyuki wrote:
>>>
>>> @@ -132,7 +162,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
>>>         tail_vma = get_gate_vma(priv->task->mm);
>>>         priv->tail_vma = tail_vma;
>>> -
>>> +       hold_task_mempolicy(priv);
>>>         /* Start with last addr hint */
>>>         vma = find_vma(mm, last_addr);
>>>         if (last_addr && vma) {
>>> @@ -159,6 +189,7 @@ out:
>>>         if (vma)
>>>                 return vma;
>>>   +     release_task_mempolicy(priv);
>>>         /* End of vmas has been reached */
>>>         m->version = (tail_vma != NULL)? 0: -1UL;
>>>         up_read(&mm->mmap_sem);
>>
>>
>> Otherwise looks good, but please remove the two task_lock()'s in
>> show_numa_map() that I added as part of this since you're replacing the
>> need for locking.
>>
> Thank you for your review.
> How about this ?
>
> ==
> From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Fri, 19 Oct 2012 17:00:55 +0900
> Subject: [PATCH] hold task->mempolicy while numa_maps scans.
>
>  /proc/<pid>/numa_maps scans vma and show mempolicy under
>  mmap_sem. It sometimes accesses task->mempolicy which can
>  be freed without mmap_sem and numa_maps can show some
>  garbage while scanning.
>
> This patch tries to take reference count of task->mempolicy at reading
> numa_maps before calling get_vma_policy(). By this, task->mempolicy
> will not be freed until numa_maps reaches its end.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> V2->v3
>  -  updated comments to be more verbose.
>  -  removed task_lock() in numa_maps code.
> V1->V2
>  -  access task->mempolicy only once and remember it.  Becase kernel/exit.c
>     can overwrite it.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps.
  2012-10-19  9:28                                         ` David Rientjes
@ 2012-10-22  2:47                                           ` Kamezawa Hiroyuki
  2012-10-22 20:55                                             ` Andrew Morton
  2012-10-22 20:56                                             ` David Rientjes
  0 siblings, 2 replies; 59+ messages in thread
From: Kamezawa Hiroyuki @ 2012-10-22  2:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Linus Torvalds, Andrew Morton, Dave Jones, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

(2012/10/19 18:28), David Rientjes wrote:

> Looks good, but the patch is whitespace damaged so it doesn't apply.  When
> that's fixed:
>
> Acked-by: David Rientjes <rientjes@google.com>

Sorry, I hope this one is not broken...
==
 From c5849c9034abeec3f26bf30dadccd393b0c5c25e Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Fri, 19 Oct 2012 17:00:55 +0900
Subject: [PATCH] hold task->mempolicy while numa_maps scans.

  /proc/<pid>/numa_maps scans vma and show mempolicy under
  mmap_sem. It sometimes accesses task->mempolicy which can
  be freed without mmap_sem and numa_maps can show some
  garbage while scanning.

This patch tries to take reference count of task->mempolicy at reading
numa_maps before calling get_vma_policy(). By this, task->mempolicy
will not be freed until numa_maps reaches its end.

Acked-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

V2->v3
  -  updated comments to be more verbose.
  -  removed task_lock() in numa_maps code.
V1->V2
  -  access task->mempolicy only once and remember it.  Becase kernel/exit.c
     can overwrite it.

---
  fs/proc/internal.h |    4 ++++
  fs/proc/task_mmu.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
  2 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index cceaab0..43973b0 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -12,6 +12,7 @@
  #include <linux/sched.h>
  #include <linux/proc_fs.h>
  struct  ctl_table_header;
+struct  mempolicy;
  
  extern struct proc_dir_entry proc_root;
  #ifdef CONFIG_PROC_SYSCTL
@@ -74,6 +75,9 @@ struct proc_maps_private {
  #ifdef CONFIG_MMU
  	struct vm_area_struct *tail_vma;
  #endif
+#ifdef CONFIG_NUMA
+	struct mempolicy *task_mempolicy;
+#endif
  };
  
  void proc_init_inodecache(void);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 14df880..2371fea 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -89,11 +89,55 @@ static void pad_len_spaces(struct seq_file *m, int len)
  		len = 1;
  	seq_printf(m, "%*c", len, ' ');
  }
+#ifdef CONFIG_NUMA
+/*
+ * These functions are for numa_maps but called in generic **maps seq_file
+ * ->start(), ->stop() ops.
+ *
+ * numa_maps scans all vmas under mmap_sem and checks their mempolicy.
+ * Each mempolicy object is controlled by reference counting. The problem here
+ * is how to avoid accessing dead mempolicy object.
+ *
+ * Because we're holding mmap_sem while reading seq_file, it's safe to access
+ * each vma's mempolicy, no vma objects will never drop refs to mempolicy.
+ *
+ * A task's mempolicy (task->mempolicy) has different behavior. task->mempolicy
+ * is set and replaced under mmap_sem but unrefed and cleared under task_lock().
+ * So, without task_lock(), we cannot trust get_vma_policy() because we cannot
+ * gurantee the task never exits under us. But taking task_lock() around
+ * get_vma_plicy() causes lock order problem.
+ *
+ * To access task->mempolicy without lock, we hold a reference count of an
+ * object pointed by task->mempolicy and remember it. This will guarantee
+ * that task->mempolicy points to an alive object or NULL in numa_maps accesses.
+ */
+static void hold_task_mempolicy(struct proc_maps_private *priv)
+{
+	struct task_struct *task = priv->task;
+
+	task_lock(task);
+	priv->task_mempolicy = task->mempolicy;
+	mpol_get(priv->task_mempolicy);
+	task_unlock(task);
+}
+static void release_task_mempolicy(struct proc_maps_private *priv)
+{
+	mpol_put(priv->task_mempolicy);
+}
+#else
+static void hold_task_mempolicy(struct proc_maps_private *priv)
+{
+}
+static void release_task_mempolicy(struct proc_maps_private *priv)
+{
+}
+#endif
  
  static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
  {
  	if (vma && vma != priv->tail_vma) {
  		struct mm_struct *mm = vma->vm_mm;
+		release_task_mempolicy(priv);
  		up_read(&mm->mmap_sem);
  		mmput(mm);
  	}
@@ -132,7 +176,7 @@ static void *m_start(struct seq_file *m, loff_t *pos)
  
  	tail_vma = get_gate_vma(priv->task->mm);
  	priv->tail_vma = tail_vma;
-
+	hold_task_mempolicy(priv);
  	/* Start with last addr hint */
  	vma = find_vma(mm, last_addr);
  	if (last_addr && vma) {
@@ -159,6 +203,7 @@ out:
  	if (vma)
  		return vma;
  
+	release_task_mempolicy(priv);
  	/* End of vmas has been reached */
  	m->version = (tail_vma != NULL)? 0: -1UL;
  	up_read(&mm->mmap_sem);
@@ -1178,11 +1223,9 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
  	walk.private = md;
  	walk.mm = mm;
  
-	task_lock(task);
  	pol = get_vma_policy(task, vma, vma->vm_start);
  	mpol_to_str(buffer, sizeof(buffer), pol, 0);
  	mpol_cond_put(pol);
-	task_unlock(task);
  
  	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
  
-- 
1.7.10.2





^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps.
  2012-10-22  2:47                                           ` Kamezawa Hiroyuki
@ 2012-10-22 20:55                                             ` Andrew Morton
  2012-10-22 20:56                                             ` David Rientjes
  1 sibling, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2012-10-22 20:55 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: David Rientjes, Linus Torvalds, Dave Jones, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

On Mon, 22 Oct 2012 11:47:31 +0900
Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> (2012/10/19 18:28), David Rientjes wrote:
> 
> > Looks good, but the patch is whitespace damaged so it doesn't apply.  When
> > that's fixed:
> >
> > Acked-by: David Rientjes <rientjes@google.com>
> 
> Sorry, I hope this one is not broken...
>
> ...
>
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -12,6 +12,7 @@
>   #include <linux/sched.h>
>   #include <linux/proc_fs.h>
>   struct  ctl_table_header;
> +struct  mempolicy;
>   
>   extern struct proc_dir_entry proc_root;
>   #ifdef CONFIG_PROC_SYSCTL
> @@ -74,6 +75,9 @@ struct proc_maps_private {
>   #ifdef CONFIG_MMU
>   	struct vm_area_struct *tail_vma;
>   #endif
> +#ifdef CONFIG_NUMA
> +	struct mempolicy *task_mempolicy;
> +#endif
>   };

The mail client space-stuffed it.

We merged this three days ago, in 9e7814404b77c3e8920b.  Please check
that it landed OK - there's a newline fixup in there but it looks good
to me.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while reading numa_maps.
  2012-10-22  2:47                                           ` Kamezawa Hiroyuki
  2012-10-22 20:55                                             ` Andrew Morton
@ 2012-10-22 20:56                                             ` David Rientjes
  1 sibling, 0 replies; 59+ messages in thread
From: David Rientjes @ 2012-10-22 20:56 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Linus Torvalds, Andrew Morton, Dave Jones, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	linux-kernel, linux-mm

On Mon, 22 Oct 2012, Kamezawa Hiroyuki wrote:

> > Looks good, but the patch is whitespace damaged so it doesn't apply.  When
> > that's fixed:
> > 
> > Acked-by: David Rientjes <rientjes@google.com>
> 
> Sorry, I hope this one is not broken...

Looks like Linus picked this up directly, thanks Kame!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-17  5:24                 ` David Rientjes
  2012-10-17  5:42                   ` Kamezawa Hiroyuki
  2012-10-17 18:14                   ` Dave Jones
@ 2012-10-24 23:30                   ` Sasha Levin
  2012-10-24 23:34                     ` David Rientjes
  2 siblings, 1 reply; 59+ messages in thread
From: Sasha Levin @ 2012-10-24 23:30 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Jones, Andrew Morton, Linus Torvalds, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, Oct 17, 2012 at 1:24 AM, David Rientjes <rientjes@google.com> wrote:
> On Wed, 17 Oct 2012, Dave Jones wrote:
>
>> BUG: sleeping function called from invalid context at kernel/mutex.c:269
>> in_atomic(): 1, irqs_disabled(): 0, pid: 8558, name: trinity-child2
>> 3 locks on stack by trinity-child2/8558:
>>  #0: held:     (&p->lock){+.+.+.}, instance: ffff88010c9a00b0, at: [<ffffffff8120cd1f>] seq_lseek+0x3f/0x120
>>  #1: held:     (&mm->mmap_sem){++++++}, instance: ffff88013956f7c8, at: [<ffffffff81254437>] m_start+0xa7/0x190
>>  #2: held:     (&(&p->alloc_lock)->rlock){+.+...}, instance: ffff88011fc64f30, at: [<ffffffff81254f8f>] show_numa_map+0x14f/0x610
>> Pid: 8558, comm: trinity-child2 Not tainted 3.7.0-rc1+ #32
>> Call Trace:
>>  [<ffffffff810ae4ec>] __might_sleep+0x14c/0x200
>>  [<ffffffff816bdf4e>] mutex_lock_nested+0x2e/0x50
>>  [<ffffffff811c43a3>] mpol_shared_policy_lookup+0x33/0x90
>>  [<ffffffff8118d5c3>] shmem_get_policy+0x33/0x40
>>  [<ffffffff811c31fa>] get_vma_policy+0x3a/0x90
>>  [<ffffffff81254fa3>] show_numa_map+0x163/0x610
>>  [<ffffffff81255b10>] ? pid_maps_open+0x20/0x20
>>  [<ffffffff81255980>] ? pagemap_hugetlb_range+0xf0/0xf0
>>  [<ffffffff81255483>] show_pid_numa_map+0x13/0x20
>>  [<ffffffff8120c902>] traverse+0xf2/0x230
>>  [<ffffffff8120cd8b>] seq_lseek+0xab/0x120
>>  [<ffffffff811e6c0b>] sys_lseek+0x7b/0xb0
>>  [<ffffffff816ca088>] tracesys+0xe1/0xe6
>>
>
> Hmm, looks like we need to change the refcount semantics entirely.  We'll
> need to make get_vma_policy() always take a reference and then drop it
> accordingly.  This work sif get_vma_policy() can grab a reference while
> holding task_lock() for the task policy fallback case.
>
> Comments on this approach?
> ---
[snip]

I'm not sure about the status of the patch, but it doesn't apply on
top of -next, and I still
see the warnings when fuzzing on -next.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-24 23:30                   ` [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Sasha Levin
@ 2012-10-24 23:34                     ` David Rientjes
  2012-10-24 23:37                       ` Sasha Levin
  0 siblings, 1 reply; 59+ messages in thread
From: David Rientjes @ 2012-10-24 23:34 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Dave Jones, Andrew Morton, Linus Torvalds, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, 24 Oct 2012, Sasha Levin wrote:

> I'm not sure about the status of the patch, but it doesn't apply on
> top of -next, and I still
> see the warnings when fuzzing on -next.
> 

This should be fixed by 9e7814404b77 ("hold task->mempolicy while 
numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading 
/proc/pid/numa_maps on that kernel?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-24 23:34                     ` David Rientjes
@ 2012-10-24 23:37                       ` Sasha Levin
  2012-10-25  0:08                         ` David Rientjes
  0 siblings, 1 reply; 59+ messages in thread
From: Sasha Levin @ 2012-10-24 23:37 UTC (permalink / raw)
  To: David Rientjes
  Cc: Dave Jones, Andrew Morton, Linus Torvalds, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, Oct 24, 2012 at 7:34 PM, David Rientjes <rientjes@google.com> wrote:
> On Wed, 24 Oct 2012, Sasha Levin wrote:
>
>> I'm not sure about the status of the patch, but it doesn't apply on
>> top of -next, and I still
>> see the warnings when fuzzing on -next.
>>
>
> This should be fixed by 9e7814404b77 ("hold task->mempolicy while
> numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading
> /proc/pid/numa_maps on that kernel?

I was actually referring to the warnings Dave Jones saw when fuzzing
with trinity after the
original patch was applied.

I still see the following when fuzzing:

[  338.467156] BUG: sleeping function called from invalid context at
kernel/mutex.c:269
[  338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
[  338.481199] 2 locks held by trinity-main/6361:
[  338.486629]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>]
__do_page_fault+0x1e4/0x4f0
[  338.498783]  #1:  (&(&mm->page_table_lock)->rlock){+.+...}, at:
[<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
[  338.511409] Pid: 6361, comm: trinity-main Tainted: G        W
3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
[  338.530318] Call Trace:
[  338.534088]  [<ffffffff8114e393>] __might_sleep+0x1c3/0x1e0
[  338.539358]  [<ffffffff83ae5209>] mutex_lock_nested+0x29/0x50
[  338.545253]  [<ffffffff8124fc3e>] mpol_shared_policy_lookup+0x2e/0x90
[  338.545258]  [<ffffffff81219ebe>] shmem_get_policy+0x2e/0x30
[  338.545264]  [<ffffffff8124e99a>] get_vma_policy+0x5a/0xa0
[  338.545267]  [<ffffffff8124fce1>] mpol_misplaced+0x41/0x1d0
[  338.545272]  [<ffffffff8122f085>] handle_pte_fault+0x465/0x6a0
[  338.545278]  [<ffffffff81131e04>] ? __rcu_read_unlock+0x44/0xb0
[  338.545282]  [<ffffffff81230baa>] handle_mm_fault+0x32a/0x360
[  338.545286]  [<ffffffff810aa5b0>] __do_page_fault+0x480/0x4f0
[  338.545293]  [<ffffffff8111a706>] ? del_timer+0x26/0x80
[  338.545298]  [<ffffffff811c7313>] ? rcu_cleanup_after_idle+0x23/0x170
[  338.545302]  [<ffffffff811ca9a4>] ? rcu_eqs_exit_common+0x64/0x3a0
[  338.545305]  [<ffffffff811c8c66>] ? rcu_eqs_enter_common+0x7c6/0x970
[  338.545309]  [<ffffffff811cafdc>] ? rcu_eqs_exit+0x9c/0xb0
[  338.545312]  [<ffffffff810aa666>] do_page_fault+0x26/0x40
[  338.545317]  [<ffffffff810a3a40>] do_async_page_fault+0x30/0xa0
[  338.545321]  [<ffffffff83ae9268>] async_page_fault+0x28/0x30

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-24 23:37                       ` Sasha Levin
@ 2012-10-25  0:08                         ` David Rientjes
  2012-10-25  0:54                           ` KOSAKI Motohiro
  2012-10-25 12:19                           ` Peter Zijlstra
  0 siblings, 2 replies; 59+ messages in thread
From: David Rientjes @ 2012-10-25  0:08 UTC (permalink / raw)
  To: Sasha Levin, Mel Gorman, Peter Zijlstra, Rik van Riel
  Cc: Dave Jones, Andrew Morton, Linus Torvalds, KOSAKI Motohiro,
	bhutchings, Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, 24 Oct 2012, Sasha Levin wrote:

> > This should be fixed by 9e7814404b77 ("hold task->mempolicy while
> > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading
> > /proc/pid/numa_maps on that kernel?
> 
> I was actually referring to the warnings Dave Jones saw when fuzzing
> with trinity after the
> original patch was applied.
> 
> I still see the following when fuzzing:
> 
> [  338.467156] BUG: sleeping function called from invalid context at
> kernel/mutex.c:269
> [  338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
> [  338.481199] 2 locks held by trinity-main/6361:
> [  338.486629]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>]
> __do_page_fault+0x1e4/0x4f0
> [  338.498783]  #1:  (&(&mm->page_table_lock)->rlock){+.+...}, at:
> [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
> [  338.511409] Pid: 6361, comm: trinity-main Tainted: G        W
> 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
> [  338.530318] Call Trace:
> [  338.534088]  [<ffffffff8114e393>] __might_sleep+0x1c3/0x1e0
> [  338.539358]  [<ffffffff83ae5209>] mutex_lock_nested+0x29/0x50
> [  338.545253]  [<ffffffff8124fc3e>] mpol_shared_policy_lookup+0x2e/0x90
> [  338.545258]  [<ffffffff81219ebe>] shmem_get_policy+0x2e/0x30
> [  338.545264]  [<ffffffff8124e99a>] get_vma_policy+0x5a/0xa0
> [  338.545267]  [<ffffffff8124fce1>] mpol_misplaced+0x41/0x1d0
> [  338.545272]  [<ffffffff8122f085>] handle_pte_fault+0x465/0x6a0
> [  338.545278]  [<ffffffff81131e04>] ? __rcu_read_unlock+0x44/0xb0
> [  338.545282]  [<ffffffff81230baa>] handle_mm_fault+0x32a/0x360
> [  338.545286]  [<ffffffff810aa5b0>] __do_page_fault+0x480/0x4f0
> [  338.545293]  [<ffffffff8111a706>] ? del_timer+0x26/0x80
> [  338.545298]  [<ffffffff811c7313>] ? rcu_cleanup_after_idle+0x23/0x170
> [  338.545302]  [<ffffffff811ca9a4>] ? rcu_eqs_exit_common+0x64/0x3a0
> [  338.545305]  [<ffffffff811c8c66>] ? rcu_eqs_enter_common+0x7c6/0x970
> [  338.545309]  [<ffffffff811cafdc>] ? rcu_eqs_exit+0x9c/0xb0
> [  338.545312]  [<ffffffff810aa666>] do_page_fault+0x26/0x40
> [  338.545317]  [<ffffffff810a3a40>] do_async_page_fault+0x30/0xa0
> [  338.545321]  [<ffffffff83ae9268>] async_page_fault+0x28/0x30
> 

Ok, this looks the same but it's actually a different issue: 
mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, 
calls get_vma_policy() which may take the shared policy mutex.  This 
happens while holding page_table_lock from do_huge_pmd_numa_page() but 
also from do_numa_page() while holding a spinlock on the ptl, which is 
coming from the sched/numa branch.

Is there anyway that we can avoid changing the shared policy mutex back 
into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race 
in shared_policy_replace()"])?

Adding Peter, Rik, and Mel to the cc.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-25  0:08                         ` David Rientjes
@ 2012-10-25  0:54                           ` KOSAKI Motohiro
  2012-10-25  1:15                             ` David Rientjes
  2012-10-25 12:19                           ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: KOSAKI Motohiro @ 2012-10-25  0:54 UTC (permalink / raw)
  To: David Rientjes
  Cc: Sasha Levin, Mel Gorman, Peter Zijlstra, Rik van Riel,
	Dave Jones, Andrew Morton, Linus Torvalds, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, Oct 24, 2012 at 8:08 PM, David Rientjes <rientjes@google.com> wrote:
> On Wed, 24 Oct 2012, Sasha Levin wrote:
>
>> > This should be fixed by 9e7814404b77 ("hold task->mempolicy while
>> > numa_maps scans.") in 3.7-rc2, can you reproduce any issues reading
>> > /proc/pid/numa_maps on that kernel?
>>
>> I was actually referring to the warnings Dave Jones saw when fuzzing
>> with trinity after the
>> original patch was applied.
>>
>> I still see the following when fuzzing:
>>
>> [  338.467156] BUG: sleeping function called from invalid context at
>> kernel/mutex.c:269
>> [  338.473719] in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
>> [  338.481199] 2 locks held by trinity-main/6361:
>> [  338.486629]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>]
>> __do_page_fault+0x1e4/0x4f0
>> [  338.498783]  #1:  (&(&mm->page_table_lock)->rlock){+.+...}, at:
>> [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0
>> [  338.511409] Pid: 6361, comm: trinity-main Tainted: G        W
>> 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
>> [  338.530318] Call Trace:
>> [  338.534088]  [<ffffffff8114e393>] __might_sleep+0x1c3/0x1e0
>> [  338.539358]  [<ffffffff83ae5209>] mutex_lock_nested+0x29/0x50
>> [  338.545253]  [<ffffffff8124fc3e>] mpol_shared_policy_lookup+0x2e/0x90
>> [  338.545258]  [<ffffffff81219ebe>] shmem_get_policy+0x2e/0x30
>> [  338.545264]  [<ffffffff8124e99a>] get_vma_policy+0x5a/0xa0
>> [  338.545267]  [<ffffffff8124fce1>] mpol_misplaced+0x41/0x1d0
>> [  338.545272]  [<ffffffff8122f085>] handle_pte_fault+0x465/0x6a0
>> [  338.545278]  [<ffffffff81131e04>] ? __rcu_read_unlock+0x44/0xb0
>> [  338.545282]  [<ffffffff81230baa>] handle_mm_fault+0x32a/0x360
>> [  338.545286]  [<ffffffff810aa5b0>] __do_page_fault+0x480/0x4f0
>> [  338.545293]  [<ffffffff8111a706>] ? del_timer+0x26/0x80
>> [  338.545298]  [<ffffffff811c7313>] ? rcu_cleanup_after_idle+0x23/0x170
>> [  338.545302]  [<ffffffff811ca9a4>] ? rcu_eqs_exit_common+0x64/0x3a0
>> [  338.545305]  [<ffffffff811c8c66>] ? rcu_eqs_enter_common+0x7c6/0x970
>> [  338.545309]  [<ffffffff811cafdc>] ? rcu_eqs_exit+0x9c/0xb0
>> [  338.545312]  [<ffffffff810aa666>] do_page_fault+0x26/0x40
>> [  338.545317]  [<ffffffff810a3a40>] do_async_page_fault+0x30/0xa0
>> [  338.545321]  [<ffffffff83ae9268>] async_page_fault+0x28/0x30
>>
>
> Ok, this looks the same but it's actually a different issue:
> mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2,
> calls get_vma_policy() which may take the shared policy mutex.  This
> happens while holding page_table_lock from do_huge_pmd_numa_page() but
> also from do_numa_page() while holding a spinlock on the ptl, which is
> coming from the sched/numa branch.
>
> Is there anyway that we can avoid changing the shared policy mutex back
> into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race
> in shared_policy_replace()"])?
>
> Adding Peter, Rik, and Mel to the cc.

Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter,
I guess you commited it, right? If so, may I review your mempolicy
changes? Now mempolicy has a lot of horrible buggy code and I hope to
maintain carefully. Which tree should i see?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-25  0:54                           ` KOSAKI Motohiro
@ 2012-10-25  1:15                             ` David Rientjes
  0 siblings, 0 replies; 59+ messages in thread
From: David Rientjes @ 2012-10-25  1:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Sasha Levin, Mel Gorman, Peter Zijlstra, Rik van Riel,
	Dave Jones, Andrew Morton, Linus Torvalds, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, 24 Oct 2012, KOSAKI Motohiro wrote:

> Hrm. I haven't noticed there is mpol_misplaced() in linux-next. Peter,
> I guess you commited it, right? If so, may I review your mempolicy
> changes? Now mempolicy has a lot of horrible buggy code and I hope to
> maintain carefully. Which tree should i see?
> 

Check out sched/numa from 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git

$ git diff v3.7-rc2.. mm/mempolicy.c | diffstat
 mempolicy.c |  444 +++++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 277 insertions(+), 167 deletions(-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-25  0:08                         ` David Rientjes
  2012-10-25  0:54                           ` KOSAKI Motohiro
@ 2012-10-25 12:19                           ` Peter Zijlstra
  2012-10-25 14:39                             ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2012-10-25 12:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: Sasha Levin, Mel Gorman, Rik van Riel, Dave Jones, Andrew Morton,
	Linus Torvalds, KOSAKI Motohiro, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote:
> Ok, this looks the same but it's actually a different issue: 
> mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, 
> calls get_vma_policy() which may take the shared policy mutex.  This 
> happens while holding page_table_lock from do_huge_pmd_numa_page() but 
> also from do_numa_page() while holding a spinlock on the ptl, which is 
> coming from the sched/numa branch.
> 
> Is there anyway that we can avoid changing the shared policy mutex back 
> into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race 
> in shared_policy_replace()"])?
> 
> Adding Peter, Rik, and Mel to the cc. 

Urgh, crud I totally missed that.

So the problem is that we need to compute if the current page is placed
'right' while holding pte_lock in order to avoid multiple pte_lock
acquisitions on the 'fast' path.

I'll look into this in a bit, but one thing that comes to mind is having
both a spnilock and a mutex and require holding both for modification
while either one is sufficient for read.

That would allow sp_lookup() to use the spinlock, while insert and
replace can hold both.

Not sure it will work for this, need to stare at this code a little
more.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-25 12:19                           ` Peter Zijlstra
@ 2012-10-25 14:39                             ` Peter Zijlstra
  2012-10-25 17:23                               ` Sasha Levin
                                                 ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Peter Zijlstra @ 2012-10-25 14:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: Sasha Levin, Mel Gorman, Rik van Riel, Dave Jones, Andrew Morton,
	Linus Torvalds, KOSAKI Motohiro, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote:
> On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote:
> > Ok, this looks the same but it's actually a different issue: 
> > mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, 
> > calls get_vma_policy() which may take the shared policy mutex.  This 
> > happens while holding page_table_lock from do_huge_pmd_numa_page() but 
> > also from do_numa_page() while holding a spinlock on the ptl, which is 
> > coming from the sched/numa branch.
> > 
> > Is there anyway that we can avoid changing the shared policy mutex back 
> > into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race 
> > in shared_policy_replace()"])?
> > 
> > Adding Peter, Rik, and Mel to the cc. 
> 
> Urgh, crud I totally missed that.
> 
> So the problem is that we need to compute if the current page is placed
> 'right' while holding pte_lock in order to avoid multiple pte_lock
> acquisitions on the 'fast' path.
> 
> I'll look into this in a bit, but one thing that comes to mind is having
> both a spnilock and a mutex and require holding both for modification
> while either one is sufficient for read.
> 
> That would allow sp_lookup() to use the spinlock, while insert and
> replace can hold both.
> 
> Not sure it will work for this, need to stare at this code a little
> more.

So I think the below should work, we hold the spinlock over both rb-tree
modification as sp free, this makes mpol_shared_policy_lookup() which
returns the policy with an incremented refcount work with just the
spinlock.

Comments?

---
 include/linux/mempolicy.h |    1 +
 mm/mempolicy.c            |   23 ++++++++++++++++++-----
 2 files changed, 19 insertions(+), 5 deletions(-)

--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -133,6 +133,7 @@ struct sp_node {
 
 struct shared_policy {
 	struct rb_root root;
+	spinlock_t lock;
 	struct mutex mutex;
 };
 
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2099,12 +2099,20 @@ bool __mpol_equal(struct mempolicy *a, s
  *
  * Remember policies even when nobody has shared memory mapped.
  * The policies are kept in Red-Black tree linked from the inode.
- * They are protected by the sp->lock spinlock, which should be held
- * for any accesses to the tree.
+ *
+ * The rb-tree is locked using both a mutex and a spinlock. Every modification
+ * to the tree must hold both the mutex and the spinlock, lookups can hold
+ * either to observe a stable tree.
+ *
+ * In particular, sp_insert() and sp_delete() take the spinlock, whereas
+ * sp_lookup() doesn't, this so users have choice.
+ *
+ * shared_policy_replace() and mpol_free_shared_policy() take the mutex
+ * and call sp_insert(), sp_delete().
  */
 
 /* lookup first element intersecting start-end */
-/* Caller holds sp->mutex */
+/* Caller holds either sp->lock and/or sp->mutex */
 static struct sp_node *
 sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end)
 {
@@ -2143,6 +2151,7 @@ static void sp_insert(struct shared_poli
 	struct rb_node *parent = NULL;
 	struct sp_node *nd;
 
+	spin_lock(&sp->lock);
 	while (*p) {
 		parent = *p;
 		nd = rb_entry(parent, struct sp_node, nd);
@@ -2155,6 +2164,7 @@ static void sp_insert(struct shared_poli
 	}
 	rb_link_node(&new->nd, parent, p);
 	rb_insert_color(&new->nd, &sp->root);
+	spin_unlock(&sp->lock);
 	pr_debug("inserting %lx-%lx: %d\n", new->start, new->end,
 		 new->policy ? new->policy->mode : 0);
 }
@@ -2168,13 +2178,13 @@ mpol_shared_policy_lookup(struct shared_
 
 	if (!sp->root.rb_node)
 		return NULL;
-	mutex_lock(&sp->mutex);
+	spin_lock(&sp->lock);
 	sn = sp_lookup(sp, idx, idx+1);
 	if (sn) {
 		mpol_get(sn->policy);
 		pol = sn->policy;
 	}
-	mutex_unlock(&sp->mutex);
+	spin_unlock(&sp->lock);
 	return pol;
 }
 
@@ -2295,8 +2305,10 @@ int mpol_misplaced(struct page *page, st
 static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 {
 	pr_debug("deleting %lx-l%lx\n", n->start, n->end);
+	spin_lock(&sp->lock);
 	rb_erase(&n->nd, &sp->root);
 	sp_free(n);
+	spin_unlock(&sp->lock);
 }
 
 static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
@@ -2381,6 +2393,7 @@ void mpol_shared_policy_init(struct shar
 	int ret;
 
 	sp->root = RB_ROOT;		/* empty tree == default mempolicy */
+	spin_lock_init(&sp->lock);
 	mutex_init(&sp->mutex);
 
 	if (mpol) {


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-25 14:39                             ` Peter Zijlstra
@ 2012-10-25 17:23                               ` Sasha Levin
  2012-10-25 20:22                               ` David Rientjes
  2012-10-25 23:09                               ` Linus Torvalds
  2 siblings, 0 replies; 59+ messages in thread
From: Sasha Levin @ 2012-10-25 17:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Rientjes, Mel Gorman, Rik van Riel, Dave Jones,
	Andrew Morton, Linus Torvalds, KOSAKI Motohiro, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On 10/25/2012 10:39 AM, Peter Zijlstra wrote:
> On Thu, 2012-10-25 at 14:19 +0200, Peter Zijlstra wrote:
>> On Wed, 2012-10-24 at 17:08 -0700, David Rientjes wrote:
>>> Ok, this looks the same but it's actually a different issue: 
>>> mpol_misplaced(), which now only exists in linux-next and not in 3.7-rc2, 
>>> calls get_vma_policy() which may take the shared policy mutex.  This 
>>> happens while holding page_table_lock from do_huge_pmd_numa_page() but 
>>> also from do_numa_page() while holding a spinlock on the ptl, which is 
>>> coming from the sched/numa branch.
>>>
>>> Is there anyway that we can avoid changing the shared policy mutex back 
>>> into a spinlock (it was converted in b22d127a39dd ["mempolicy: fix a race 
>>> in shared_policy_replace()"])?
>>>
>>> Adding Peter, Rik, and Mel to the cc. 
>>
>> Urgh, crud I totally missed that.
>>
>> So the problem is that we need to compute if the current page is placed
>> 'right' while holding pte_lock in order to avoid multiple pte_lock
>> acquisitions on the 'fast' path.
>>
>> I'll look into this in a bit, but one thing that comes to mind is having
>> both a spnilock and a mutex and require holding both for modification
>> while either one is sufficient for read.
>>
>> That would allow sp_lookup() to use the spinlock, while insert and
>> replace can hold both.
>>
>> Not sure it will work for this, need to stare at this code a little
>> more.
> 
> So I think the below should work, we hold the spinlock over both rb-tree
> modification as sp free, this makes mpol_shared_policy_lookup() which
> returns the policy with an incremented refcount work with just the
> spinlock.
> 
> Comments?
> 
> ---

It made the warnings I've reported go away.


Thanks,
Sasha


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-25 14:39                             ` Peter Zijlstra
  2012-10-25 17:23                               ` Sasha Levin
@ 2012-10-25 20:22                               ` David Rientjes
  2012-10-25 23:09                               ` Linus Torvalds
  2 siblings, 0 replies; 59+ messages in thread
From: David Rientjes @ 2012-10-25 20:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Mel Gorman, Rik van Riel, Dave Jones, Andrew Morton,
	Linus Torvalds, KOSAKI Motohiro, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Thu, 25 Oct 2012, Peter Zijlstra wrote:

> So I think the below should work, we hold the spinlock over both rb-tree
> modification as sp free, this makes mpol_shared_policy_lookup() which
> returns the policy with an incremented refcount work with just the
> spinlock.
> 
> Comments?
> 

It's rather unfortunate that we need to protect modification with a 
spinlock and a mutex but since sharing was removed in commit 869833f2c5c6 
("mempolicy: remove mempolicy sharing") it requires that sp_alloc() is 
blockable to do the whole mpol_new() and rebind if necessary, which could 
require mm->mmap_sem; it's not as simple as just converting all the 
allocations to GFP_ATOMIC.

It looks as though there is no other alternative other than protecting 
modification with both the spinlock and mutex, which is a clever 
solution, so it looks good to me, thanks!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-25 14:39                             ` Peter Zijlstra
  2012-10-25 17:23                               ` Sasha Levin
  2012-10-25 20:22                               ` David Rientjes
@ 2012-10-25 23:09                               ` Linus Torvalds
  2012-10-26  8:48                                 ` Peter Zijlstra
  2 siblings, 1 reply; 59+ messages in thread
From: Linus Torvalds @ 2012-10-25 23:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Rientjes, Sasha Levin, Mel Gorman, Rik van Riel,
	Dave Jones, Andrew Morton, KOSAKI Motohiro, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So I think the below should work, we hold the spinlock over both rb-tree
> modification as sp free, this makes mpol_shared_policy_lookup() which
> returns the policy with an incremented refcount work with just the
> spinlock.
>
> Comments?

Looks reasonable, if annoyingly complex for something that shouldn't
be important enough for this. Oh well.

However, please check me on this: the need for this is only for
linux-next right now, correct? All the current users in my tree are ok
with just the mutex, no?

            Linus

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-25 23:09                               ` Linus Torvalds
@ 2012-10-26  8:48                                 ` Peter Zijlstra
  2012-10-31 18:29                                   ` Sasha Levin
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2012-10-26  8:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Rientjes, Sasha Levin, Mel Gorman, Rik van Riel,
	Dave Jones, Andrew Morton, KOSAKI Motohiro, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote:
> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So I think the below should work, we hold the spinlock over both rb-tree
> > modification as sp free, this makes mpol_shared_policy_lookup() which
> > returns the policy with an incremented refcount work with just the
> > spinlock.
> >
> > Comments?
> 
> Looks reasonable, if annoyingly complex for something that shouldn't
> be important enough for this. Oh well.

I agree with that.. Its just that when doing numa placement one needs to
respect the pre-existing placement constraints. I've not seen a way
around this.

> However, please check me on this: the need for this is only for
> linux-next right now, correct? All the current users in my tree are ok
> with just the mutex, no?

Yes, the need comes from the numa stuff and I'll stick this patch in
there.

I completely missed Mel's patch turning it into a mutex, but I guess
that's what -next is for :-).

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-26  8:48                                 ` Peter Zijlstra
@ 2012-10-31 18:29                                   ` Sasha Levin
  2012-11-21  0:59                                     ` Sasha Levin
  0 siblings, 1 reply; 59+ messages in thread
From: Sasha Levin @ 2012-10-31 18:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, David Rientjes, Mel Gorman, Rik van Riel,
	Dave Jones, Andrew Morton, KOSAKI Motohiro, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote:
>> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> >
>> > So I think the below should work, we hold the spinlock over both rb-tree
>> > modification as sp free, this makes mpol_shared_policy_lookup() which
>> > returns the policy with an incremented refcount work with just the
>> > spinlock.
>> >
>> > Comments?
>>
>> Looks reasonable, if annoyingly complex for something that shouldn't
>> be important enough for this. Oh well.
>
> I agree with that.. Its just that when doing numa placement one needs to
> respect the pre-existing placement constraints. I've not seen a way
> around this.
>
>> However, please check me on this: the need for this is only for
>> linux-next right now, correct? All the current users in my tree are ok
>> with just the mutex, no?
>
> Yes, the need comes from the numa stuff and I'll stick this patch in
> there.
>
> I completely missed Mel's patch turning it into a mutex, but I guess
> that's what -next is for :-).

So I've been fuzzing with it for the past couple of days and it's been
looking fine with it. Can someone grab it into his tree please?


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps
  2012-10-31 18:29                                   ` Sasha Levin
@ 2012-11-21  0:59                                     ` Sasha Levin
  0 siblings, 0 replies; 59+ messages in thread
From: Sasha Levin @ 2012-11-21  0:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, David Rientjes, Mel Gorman, Rik van Riel,
	Dave Jones, Andrew Morton, KOSAKI Motohiro, bhutchings,
	Konstantin Khlebnikov, Naoya Horiguchi, Hugh Dickins,
	KAMEZAWA Hiroyuki, linux-kernel, linux-mm

Ping? Can someone take it before it's lost?

On Wed, Oct 31, 2012 at 2:29 PM, Sasha Levin <levinsasha928@gmail.com> wrote:
> On Fri, Oct 26, 2012 at 4:48 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Thu, 2012-10-25 at 16:09 -0700, Linus Torvalds wrote:
>>> On Thu, Oct 25, 2012 at 7:39 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> >
>>> > So I think the below should work, we hold the spinlock over both rb-tree
>>> > modification as sp free, this makes mpol_shared_policy_lookup() which
>>> > returns the policy with an incremented refcount work with just the
>>> > spinlock.
>>> >
>>> > Comments?
>>>
>>> Looks reasonable, if annoyingly complex for something that shouldn't
>>> be important enough for this. Oh well.
>>
>> I agree with that.. Its just that when doing numa placement one needs to
>> respect the pre-existing placement constraints. I've not seen a way
>> around this.
>>
>>> However, please check me on this: the need for this is only for
>>> linux-next right now, correct? All the current users in my tree are ok
>>> with just the mutex, no?
>>
>> Yes, the need comes from the numa stuff and I'll stick this patch in
>> there.
>>
>> I completely missed Mel's patch turning it into a mutex, but I guess
>> that's what -next is for :-).
>
> So I've been fuzzing with it for the past couple of days and it's been
> looking fine with it. Can someone grab it into his tree please?
>
>
> Thanks,
> Sasha

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2012-11-21  1:00 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-08 15:09 mpol_to_str revisited Dave Jones
2012-10-08 15:15 ` Dave Jones
2012-10-08 20:46   ` David Rientjes
2012-10-08 20:35 ` David Rientjes
2012-10-08 20:52   ` Dave Jones
2012-10-16  0:48     ` David Rientjes
2012-10-09  0:33 ` Ben Hutchings
2012-10-16  2:34 ` KOSAKI Motohiro
2012-10-16  3:58   ` David Rientjes
2012-10-16  5:10     ` KOSAKI Motohiro
2012-10-16  6:10       ` David Rientjes
2012-10-16 23:39         ` KOSAKI Motohiro
2012-10-17  0:12           ` David Rientjes
2012-10-17  0:31             ` [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps David Rientjes
2012-10-17  1:38               ` KOSAKI Motohiro
2012-10-17  1:49                 ` David Rientjes
2012-10-17  1:53                   ` KOSAKI Motohiro
2012-10-17  4:05               ` Dave Jones
2012-10-17  5:24                 ` David Rientjes
2012-10-17  5:42                   ` Kamezawa Hiroyuki
2012-10-17  8:49                     ` KOSAKI Motohiro
2012-10-17 19:50                       ` David Rientjes
2012-10-17 21:05                         ` KOSAKI Motohiro
2012-10-17 21:27                           ` David Rientjes
2012-10-17 18:14                   ` Dave Jones
2012-10-17 19:21                     ` David Rientjes
2012-10-17 19:32                       ` Dave Jones
2012-10-17 19:38                         ` David Rientjes
2012-10-17 19:45                           ` Dave Jones
2012-10-17 20:28                             ` [patch for-3.7] mm, mempolicy: avoid taking mutex inside spinlock when reading numa_maps David Rientjes
2012-10-17 21:31                               ` [patch for-3.7 v2] " David Rientjes
2012-10-18  4:06                                 ` Kamezawa Hiroyuki
2012-10-18  4:14                                   ` Linus Torvalds
2012-10-18  4:41                                     ` Kamezawa Hiroyuki
2012-10-18  4:34                                   ` Kamezawa Hiroyuki
2012-10-18 20:03                                     ` David Rientjes
2012-10-19  8:35                                       ` [patch for-3.7 v3] mm, mempolicy: hold task->mempolicy refcount while " Kamezawa Hiroyuki
2012-10-19  9:28                                         ` David Rientjes
2012-10-22  2:47                                           ` Kamezawa Hiroyuki
2012-10-22 20:55                                             ` Andrew Morton
2012-10-22 20:56                                             ` David Rientjes
2012-10-19 19:15                                         ` KOSAKI Motohiro
2012-10-19  6:51                                     ` [patch for-3.7 v2] mm, mempolicy: avoid taking mutex inside spinlock when " KOSAKI Motohiro
2012-10-18  4:35                                   ` David Rientjes
2012-10-24 23:30                   ` [patch for-3.7] mm, mempolicy: fix printing stack contents in numa_maps Sasha Levin
2012-10-24 23:34                     ` David Rientjes
2012-10-24 23:37                       ` Sasha Levin
2012-10-25  0:08                         ` David Rientjes
2012-10-25  0:54                           ` KOSAKI Motohiro
2012-10-25  1:15                             ` David Rientjes
2012-10-25 12:19                           ` Peter Zijlstra
2012-10-25 14:39                             ` Peter Zijlstra
2012-10-25 17:23                               ` Sasha Levin
2012-10-25 20:22                               ` David Rientjes
2012-10-25 23:09                               ` Linus Torvalds
2012-10-26  8:48                                 ` Peter Zijlstra
2012-10-31 18:29                                   ` Sasha Levin
2012-11-21  0:59                                     ` Sasha Levin
2012-10-17  1:33             ` mpol_to_str revisited KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).