All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
@ 2011-03-14  8:08 ` Hugh Dickins
  0 siblings, 0 replies; 28+ messages in thread
From: Hugh Dickins @ 2011-03-14  8:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Andrew Morton, David Rientjes, linux-kernel, linux-mm

THP's collapse_huge_page() has an understandable but ugly difference
in when its huge page is allocated: inside if NUMA but outside if not.
It's hardly surprising that the memcg failure path forgot that, freeing
the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
(or even worse, using the freed page).

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/huge_memory.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

--- 2.6.38-rc8/mm/huge_memory.c	2011-03-08 09:27:16.000000000 -0800
+++ linux/mm/huge_memory.c	2011-03-13 18:26:21.000000000 -0700
@@ -1762,6 +1762,10 @@ static void collapse_huge_page(struct mm
 #ifndef CONFIG_NUMA
 	VM_BUG_ON(!*hpage);
 	new_page = *hpage;
+	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
+		up_read(&mm->mmap_sem);
+		return;
+	}
 #else
 	VM_BUG_ON(*hpage);
 	/*
@@ -1781,12 +1785,12 @@ static void collapse_huge_page(struct mm
 		*hpage = ERR_PTR(-ENOMEM);
 		return;
 	}
-#endif
 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
 		up_read(&mm->mmap_sem);
 		put_page(new_page);
 		return;
 	}
+#endif
 
 	/* after allocating the hugepage upgrade to mmap_sem write mode */
 	up_read(&mm->mmap_sem);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
@ 2011-03-14  8:08 ` Hugh Dickins
  0 siblings, 0 replies; 28+ messages in thread
From: Hugh Dickins @ 2011-03-14  8:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Andrew Morton, David Rientjes, linux-kernel, linux-mm

THP's collapse_huge_page() has an understandable but ugly difference
in when its huge page is allocated: inside if NUMA but outside if not.
It's hardly surprising that the memcg failure path forgot that, freeing
the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
(or even worse, using the freed page).

Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/huge_memory.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

--- 2.6.38-rc8/mm/huge_memory.c	2011-03-08 09:27:16.000000000 -0800
+++ linux/mm/huge_memory.c	2011-03-13 18:26:21.000000000 -0700
@@ -1762,6 +1762,10 @@ static void collapse_huge_page(struct mm
 #ifndef CONFIG_NUMA
 	VM_BUG_ON(!*hpage);
 	new_page = *hpage;
+	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
+		up_read(&mm->mmap_sem);
+		return;
+	}
 #else
 	VM_BUG_ON(*hpage);
 	/*
@@ -1781,12 +1785,12 @@ static void collapse_huge_page(struct mm
 		*hpage = ERR_PTR(-ENOMEM);
 		return;
 	}
-#endif
 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
 		up_read(&mm->mmap_sem);
 		put_page(new_page);
 		return;
 	}
+#endif
 
 	/* after allocating the hugepage upgrade to mmap_sem write mode */
 	up_read(&mm->mmap_sem);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
  2011-03-14  8:08 ` Hugh Dickins
@ 2011-03-14 15:25   ` Minchan Kim
  -1 siblings, 0 replies; 28+ messages in thread
From: Minchan Kim @ 2011-03-14 15:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrea Arcangeli, Andrew Morton, David Rientjes,
	linux-kernel, linux-mm

On Mon, Mar 14, 2011 at 5:08 PM, Hugh Dickins <hughd@google.com> wrote:
> THP's collapse_huge_page() has an understandable but ugly difference
> in when its huge page is allocated: inside if NUMA but outside if not.
> It's hardly surprising that the memcg failure path forgot that, freeing
> the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
> (or even worse, using the freed page).
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Thanks, Hugh.
-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
@ 2011-03-14 15:25   ` Minchan Kim
  0 siblings, 0 replies; 28+ messages in thread
From: Minchan Kim @ 2011-03-14 15:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrea Arcangeli, Andrew Morton, David Rientjes,
	linux-kernel, linux-mm

On Mon, Mar 14, 2011 at 5:08 PM, Hugh Dickins <hughd@google.com> wrote:
> THP's collapse_huge_page() has an understandable but ugly difference
> in when its huge page is allocated: inside if NUMA but outside if not.
> It's hardly surprising that the memcg failure path forgot that, freeing
> the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
> (or even worse, using the freed page).
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Thanks, Hugh.
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
  2011-03-14  8:08 ` Hugh Dickins
@ 2011-03-14 15:52   ` Andrea Arcangeli
  -1 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2011-03-14 15:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrew Morton, David Rientjes, linux-kernel, linux-mm

On Mon, Mar 14, 2011 at 01:08:47AM -0700, Hugh Dickins wrote:
>  mm/huge_memory.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> --- 2.6.38-rc8/mm/huge_memory.c	2011-03-08 09:27:16.000000000 -0800
> +++ linux/mm/huge_memory.c	2011-03-13 18:26:21.000000000 -0700
> @@ -1762,6 +1762,10 @@ static void collapse_huge_page(struct mm
>  #ifndef CONFIG_NUMA
>  	VM_BUG_ON(!*hpage);
>  	new_page = *hpage;
> +	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
> +		up_read(&mm->mmap_sem);
> +		return;
> +	}
>  #else
>  	VM_BUG_ON(*hpage);
>  	/*
> @@ -1781,12 +1785,12 @@ static void collapse_huge_page(struct mm
>  		*hpage = ERR_PTR(-ENOMEM);
>  		return;
>  	}
> -#endif
>  	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
>  		up_read(&mm->mmap_sem);
>  		put_page(new_page);
>  		return;
>  	}
> +#endif
>  
>  	/* after allocating the hugepage upgrade to mmap_sem write mode */
>  	up_read(&mm->mmap_sem);

Correct! I'd suggest to fix it without duplicating the
mem_cgroup_newpage_charge. It's just one more put_page than needed
with memcg enabled and NUMA disabled (I guess most memcg testing
happened with NUMA enabled). The larger diff likely rejects with -mm
NUMA code... I'll try to diff it with a smaller -U2 (this code has
little change to be misplaced) that may allow it to apply clean
regardless of the merging order, so it may make life easier.

It may have been overkill to keep the NUMA case separated in order to
avoid spurious allocations for the not-NUMA case, code become more
complex and I'm not sure if it's really worthwhile. The optimization
makes sense but it's minor and it created more complexity than
strictly needed. For now we can't change it in the short term as it
has been tested this way, but if people dislike the additional
complexity that this optimization created, I'm not against dropping it
in the future. Your comment was positive about the optimization (you
said understandable) so I wanted to share my current thinking on these
ifdefs...

Thanks,
Andrea

===
Subject: thp+memcg-numa: fix BUG at include/linux/mm.h:370!

From: Hugh Dickins <hughd@google.com>

THP's collapse_huge_page() has an understandable but ugly difference
in when its huge page is allocated: inside if NUMA but outside if not.
It's hardly surprising that the memcg failure path forgot that, freeing
the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
(or even worse, using the freed page).

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dbe99a5..bf41114 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1785,5 +1785,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
 		up_read(&mm->mmap_sem);
+#ifdef CONFIG_NUMA
 		put_page(new_page);
+#endif
 		return;
 	}


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
@ 2011-03-14 15:52   ` Andrea Arcangeli
  0 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2011-03-14 15:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrew Morton, David Rientjes, linux-kernel, linux-mm

On Mon, Mar 14, 2011 at 01:08:47AM -0700, Hugh Dickins wrote:
>  mm/huge_memory.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> --- 2.6.38-rc8/mm/huge_memory.c	2011-03-08 09:27:16.000000000 -0800
> +++ linux/mm/huge_memory.c	2011-03-13 18:26:21.000000000 -0700
> @@ -1762,6 +1762,10 @@ static void collapse_huge_page(struct mm
>  #ifndef CONFIG_NUMA
>  	VM_BUG_ON(!*hpage);
>  	new_page = *hpage;
> +	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
> +		up_read(&mm->mmap_sem);
> +		return;
> +	}
>  #else
>  	VM_BUG_ON(*hpage);
>  	/*
> @@ -1781,12 +1785,12 @@ static void collapse_huge_page(struct mm
>  		*hpage = ERR_PTR(-ENOMEM);
>  		return;
>  	}
> -#endif
>  	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
>  		up_read(&mm->mmap_sem);
>  		put_page(new_page);
>  		return;
>  	}
> +#endif
>  
>  	/* after allocating the hugepage upgrade to mmap_sem write mode */
>  	up_read(&mm->mmap_sem);

Correct! I'd suggest to fix it without duplicating the
mem_cgroup_newpage_charge. It's just one more put_page than needed
with memcg enabled and NUMA disabled (I guess most memcg testing
happened with NUMA enabled). The larger diff likely rejects with -mm
NUMA code... I'll try to diff it with a smaller -U2 (this code has
little change to be misplaced) that may allow it to apply clean
regardless of the merging order, so it may make life easier.

It may have been overkill to keep the NUMA case separated in order to
avoid spurious allocations for the not-NUMA case, code become more
complex and I'm not sure if it's really worthwhile. The optimization
makes sense but it's minor and it created more complexity than
strictly needed. For now we can't change it in the short term as it
has been tested this way, but if people dislike the additional
complexity that this optimization created, I'm not against dropping it
in the future. Your comment was positive about the optimization (you
said understandable) so I wanted to share my current thinking on these
ifdefs...

Thanks,
Andrea

===
Subject: thp+memcg-numa: fix BUG at include/linux/mm.h:370!

From: Hugh Dickins <hughd@google.com>

THP's collapse_huge_page() has an understandable but ugly difference
in when its huge page is allocated: inside if NUMA but outside if not.
It's hardly surprising that the memcg failure path forgot that, freeing
the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
(or even worse, using the freed page).

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index dbe99a5..bf41114 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1785,5 +1785,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
 		up_read(&mm->mmap_sem);
+#ifdef CONFIG_NUMA
 		put_page(new_page);
+#endif
 		return;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
  2011-03-14 15:52   ` Andrea Arcangeli
@ 2011-03-14 16:37     ` Hugh Dickins
  -1 siblings, 0 replies; 28+ messages in thread
From: Hugh Dickins @ 2011-03-14 16:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Andrew Morton, David Rientjes, linux-kernel, linux-mm

On Mon, 14 Mar 2011, Andrea Arcangeli wrote:
> 
> Correct! I'd suggest to fix it without duplicating the
> mem_cgroup_newpage_charge. It's just one more put_page than needed
> with memcg enabled and NUMA disabled (I guess most memcg testing
> happened with NUMA enabled). The larger diff likely rejects with -mm
> NUMA code... I'll try to diff it with a smaller -U2 (this code has
> little change to be misplaced) that may allow it to apply clean
> regardless of the merging order, so it may make life easier.

I did try it that way at first (didn't help when I mistakenly put
#ifndef instead of #ifdef around the put_page!), but was repulsed
by seeing yet another #ifdef CONFIG_NUMA, so went with the duplicating
version - which Linus has now taken.

> 
> It may have been overkill to keep the NUMA case separated in order to
> avoid spurious allocations for the not-NUMA case, code become more
> complex and I'm not sure if it's really worthwhile. The optimization
> makes sense but it's minor and it created more complexity than
> strictly needed. For now we can't change it in the short term as it
> has been tested this way, but if people dislike the additional
> complexity that this optimization created, I'm not against dropping it
> in the future. Your comment was positive about the optimization (you
> said understandable) so I wanted to share my current thinking on these
> ifdefs...

I was certainly tempted to remove all the non-NUMA cases, but as you
say, now is not the right time for that since we needed a quick bugfix.

I do appreciate why you did it that way, it is nicer to be allocating
on the outside, though unsuitable in the NUMA case.  But given how this
bug has passed unnoticed for so long, it seems like nobody has been
testing non-NUMA, so yes, best to simplify and make non-NUMA do the
same as NUMA in 2.6.39.

Since Linus has taken my version that you didn't like, perhaps you can
get even by sending him your "mm: PageBuddy cleanups" patch, the version
I didn't like (for its silly branches) so was reluctant to push myself.

I'd really like to see that fix in, since it's a little hard to argue
for in -stable, being all about a system which is already unstable.
But I think it needs a stronger title than "PageBuddy cleanups" -
"fix BUG in bad_page()"?

Hugh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
@ 2011-03-14 16:37     ` Hugh Dickins
  0 siblings, 0 replies; 28+ messages in thread
From: Hugh Dickins @ 2011-03-14 16:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Andrew Morton, David Rientjes, linux-kernel, linux-mm

On Mon, 14 Mar 2011, Andrea Arcangeli wrote:
> 
> Correct! I'd suggest to fix it without duplicating the
> mem_cgroup_newpage_charge. It's just one more put_page than needed
> with memcg enabled and NUMA disabled (I guess most memcg testing
> happened with NUMA enabled). The larger diff likely rejects with -mm
> NUMA code... I'll try to diff it with a smaller -U2 (this code has
> little change to be misplaced) that may allow it to apply clean
> regardless of the merging order, so it may make life easier.

I did try it that way at first (didn't help when I mistakenly put
#ifndef instead of #ifdef around the put_page!), but was repulsed
by seeing yet another #ifdef CONFIG_NUMA, so went with the duplicating
version - which Linus has now taken.

> 
> It may have been overkill to keep the NUMA case separated in order to
> avoid spurious allocations for the not-NUMA case, code become more
> complex and I'm not sure if it's really worthwhile. The optimization
> makes sense but it's minor and it created more complexity than
> strictly needed. For now we can't change it in the short term as it
> has been tested this way, but if people dislike the additional
> complexity that this optimization created, I'm not against dropping it
> in the future. Your comment was positive about the optimization (you
> said understandable) so I wanted to share my current thinking on these
> ifdefs...

I was certainly tempted to remove all the non-NUMA cases, but as you
say, now is not the right time for that since we needed a quick bugfix.

I do appreciate why you did it that way, it is nicer to be allocating
on the outside, though unsuitable in the NUMA case.  But given how this
bug has passed unnoticed for so long, it seems like nobody has been
testing non-NUMA, so yes, best to simplify and make non-NUMA do the
same as NUMA in 2.6.39.

Since Linus has taken my version that you didn't like, perhaps you can
get even by sending him your "mm: PageBuddy cleanups" patch, the version
I didn't like (for its silly branches) so was reluctant to push myself.

I'd really like to see that fix in, since it's a little hard to argue
for in -stable, being all about a system which is already unstable.
But I think it needs a stronger title than "PageBuddy cleanups" -
"fix BUG in bad_page()"?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
  2011-03-14 16:37     ` Hugh Dickins
@ 2011-03-14 16:56       ` Linus Torvalds
  -1 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2011-03-14 16:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Andrew Morton, David Rientjes, linux-kernel, linux-mm

On Mon, Mar 14, 2011 at 9:37 AM, Hugh Dickins <hughd@google.com> wrote:
>
> I did try it that way at first (didn't help when I mistakenly put
> #ifndef instead of #ifdef around the put_page!), but was repulsed
> by seeing yet another #ifdef CONFIG_NUMA, so went with the duplicating
> version - which Linus has now taken.

I have to admit to being repulsed by the whole patch, but my main
source of "that's effin ugly" was from the crazy lock handling.

Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And
if not, why not release the read-lock early? And even if it _does_
need it, why not do

    ret = mem_cgroup_newpage_charge();
    up_read(&mm->mmap_sem);
    if (ret) {
        ...

finally, the #ifdef CONFIG_NUMA is ugly, but it's ugly in the return
path of the function too, and the nicer way would probably be to have
it in one place and do something like

    /*
     * The allocation rules are different for the NUMA/non-NUMA cases
     * For the NUMA case, we allocate here, for the non-numa case we
     * use the allocation in *hpage
     */
    static inline struct page *collapse_alloc_hugepage(struct page **hpage)
    {
    #ifdef CONFIG_NUMA
        VM_BUG_ON(*hpage);
        return alloc_hugepage_vma(khugepaged_defrag(), vma, address, node);
    #else
        VM_BUG_ON(!*hpage);
        return *hpage;
    #endif
    }

    static inline void collapse_free_hugepage(struct page *page)
    {
    #ifdef CONFIG_NUMA
        put_page(new_page);
    #else
        /* Nothing to do */
    #endif
    }

and use that instead. The point being that the #ifdef'fery now ends up
being in a much more targeted area and much better abstracted, rather
than in the middle of code, and ugly as sin.

But as mentioned, the lock handling is disgusting. Why is it even safe
to drop and re-take the lock at all?

                                   Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
@ 2011-03-14 16:56       ` Linus Torvalds
  0 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2011-03-14 16:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Andrew Morton, David Rientjes, linux-kernel, linux-mm

On Mon, Mar 14, 2011 at 9:37 AM, Hugh Dickins <hughd@google.com> wrote:
>
> I did try it that way at first (didn't help when I mistakenly put
> #ifndef instead of #ifdef around the put_page!), but was repulsed
> by seeing yet another #ifdef CONFIG_NUMA, so went with the duplicating
> version - which Linus has now taken.

I have to admit to being repulsed by the whole patch, but my main
source of "that's effin ugly" was from the crazy lock handling.

Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And
if not, why not release the read-lock early? And even if it _does_
need it, why not do

    ret = mem_cgroup_newpage_charge();
    up_read(&mm->mmap_sem);
    if (ret) {
        ...

finally, the #ifdef CONFIG_NUMA is ugly, but it's ugly in the return
path of the function too, and the nicer way would probably be to have
it in one place and do something like

    /*
     * The allocation rules are different for the NUMA/non-NUMA cases
     * For the NUMA case, we allocate here, for the non-numa case we
     * use the allocation in *hpage
     */
    static inline struct page *collapse_alloc_hugepage(struct page **hpage)
    {
    #ifdef CONFIG_NUMA
        VM_BUG_ON(*hpage);
        return alloc_hugepage_vma(khugepaged_defrag(), vma, address, node);
    #else
        VM_BUG_ON(!*hpage);
        return *hpage;
    #endif
    }

    static inline void collapse_free_hugepage(struct page *page)
    {
    #ifdef CONFIG_NUMA
        put_page(new_page);
    #else
        /* Nothing to do */
    #endif
    }

and use that instead. The point being that the #ifdef'fery now ends up
being in a much more targeted area and much better abstracted, rather
than in the middle of code, and ugly as sin.

But as mentioned, the lock handling is disgusting. Why is it even safe
to drop and re-take the lock at all?

                                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] mm: PageBuddy and mapcount underflows robustness
  2011-03-14 16:37     ` Hugh Dickins
@ 2011-03-14 16:59       ` Andrea Arcangeli
  -1 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2011-03-14 16:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrew Morton, David Rientjes, linux-kernel, linux-mm

Hi Hugh,

On Mon, Mar 14, 2011 at 09:37:43AM -0700, Hugh Dickins wrote:
> version - which Linus has now taken.

That was fast, so that solves all merging order rejects in the first
place as we'll all rebase on upstraem.

> I was certainly tempted to remove all the non-NUMA cases, but as you
> say, now is not the right time for that since we needed a quick bugfix.

Correct. Too much risk in making the not-NUMA case as the NUMA case.

> I do appreciate why you did it that way, it is nicer to be allocating

BTW, one reason it was done this way is also because the not-NUMA case
was the original code, and when I added the NUMA awareness to
khugepaged I didn't want to risk regressions to the existing case that
I knew worked fine.

> on the outside, though unsuitable in the NUMA case.  But given how this
> bug has passed unnoticed for so long, it seems like nobody has been
> testing non-NUMA, so yes, best to simplify and make non-NUMA do the
> same as NUMA in 2.6.39.

These days clearly the NUMA case gets more testing than the not-NUMA
case. Especially for features like memcg that are mostly needed on
NUMA systems and not so much on small laptops or something not
NUMA.

I'm unsure if it's so urgent to remove it, maybe a little benchmarking
with khugepaged at 100% load may be worth trying first, but if there's
no real change in the frequency increase of khugepaged/pages_collapsed
counter, I'm surely ok if it gets removed later.

> Since Linus has taken my version that you didn't like, perhaps you can

I'm ok with your version... no problem.

> get even by sending him your "mm: PageBuddy cleanups" patch, the version
> I didn't like (for its silly branches) so was reluctant to push myself.

Ok that slipped even my own aa.git tree as it was more a RFC when I
posted it and it didn't fix a real bug but just made the code more
robust at runtime in case of hardware memory corruption or software
bugs.

> I'd really like to see that fix in, since it's a little hard to argue
> for in -stable, being all about a system which is already unstable.
>
> But I think it needs a stronger title than "PageBuddy cleanups" -
> "fix BUG in bad_page()"?

I think this can comfortably wait 2.6.39-rc. I think it's best if
Andrew merges it so it gets digested through -mm for a while. But let
me know if you prefer something else. So here it is with slightly
updated header.

===
Subject: mm: PageBuddy and mapcount underflows robustness

From: Andrea Arcangeli <aarcange@redhat.com>

bad_page could VM_BUG_ON(!PageBuddy(page)) inside __ClearPageBuddy(). I prefer
to keep the VM_BUG_ON for safety and to add a if to solve it.

Change the _mapcount value indicating PageBuddy from -2 to -1024*1024 for more
robusteness against page_mapcount() underflows.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Hugh Dickins <hughd@google.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f6385fc..fa16ba0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -402,16 +402,22 @@ static inline void init_page_count(struct page *page)
 /*
  * PageBuddy() indicate that the page is free and in the buddy system
  * (see mm/page_alloc.c).
+ *
+ * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to
+ * -2 so that an underflow of the page_mapcount() won't be mistaken
+ * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE.
  */
+#define PAGE_BUDDY_MAPCOUNT_VALUE (-1024*1024)
+
 static inline int PageBuddy(struct page *page)
 {
-	return atomic_read(&page->_mapcount) == -2;
+	return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE;
 }
 
 static inline void __SetPageBuddy(struct page *page)
 {
 	VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
-	atomic_set(&page->_mapcount, -2);
+	atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE);
 }
 
 static inline void __ClearPageBuddy(struct page *page)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a873e61..8aac134 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -286,7 +286,9 @@ static void bad_page(struct page *page)
 
 	/* Don't complain about poisoned pages */
 	if (PageHWPoison(page)) {
-		__ClearPageBuddy(page);
+		/* __ClearPageBuddy VM_BUG_ON(!PageBuddy(page)) */
+		if (PageBuddy(page))
+			__ClearPageBuddy(page);
 		return;
 	}
 
@@ -317,7 +319,8 @@ static void bad_page(struct page *page)
 	dump_stack();
 out:
 	/* Leave bad fields for debug, except PageBuddy could make trouble */
-	__ClearPageBuddy(page);
+	if (PageBuddy(page)) /* __ClearPageBuddy VM_BUG_ON(!PageBuddy(page)) */
+		__ClearPageBuddy(page);
 	add_taint(TAINT_BAD_PAGE);
 }
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH] mm: PageBuddy and mapcount underflows robustness
@ 2011-03-14 16:59       ` Andrea Arcangeli
  0 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2011-03-14 16:59 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Linus Torvalds, Andrew Morton, David Rientjes, linux-kernel, linux-mm

Hi Hugh,

On Mon, Mar 14, 2011 at 09:37:43AM -0700, Hugh Dickins wrote:
> version - which Linus has now taken.

That was fast, so that solves all merging order rejects in the first
place as we'll all rebase on upstraem.

> I was certainly tempted to remove all the non-NUMA cases, but as you
> say, now is not the right time for that since we needed a quick bugfix.

Correct. Too much risk in making the not-NUMA case as the NUMA case.

> I do appreciate why you did it that way, it is nicer to be allocating

BTW, one reason it was done this way is also because the not-NUMA case
was the original code, and when I added the NUMA awareness to
khugepaged I didn't want to risk regressions to the existing case that
I knew worked fine.

> on the outside, though unsuitable in the NUMA case.  But given how this
> bug has passed unnoticed for so long, it seems like nobody has been
> testing non-NUMA, so yes, best to simplify and make non-NUMA do the
> same as NUMA in 2.6.39.

These days clearly the NUMA case gets more testing than the not-NUMA
case. Especially for features like memcg that are mostly needed on
NUMA systems and not so much on small laptops or something not
NUMA.

I'm unsure if it's so urgent to remove it, maybe a little benchmarking
with khugepaged at 100% load may be worth trying first, but if there's
no real change in the frequency increase of khugepaged/pages_collapsed
counter, I'm surely ok if it gets removed later.

> Since Linus has taken my version that you didn't like, perhaps you can

I'm ok with your version... no problem.

> get even by sending him your "mm: PageBuddy cleanups" patch, the version
> I didn't like (for its silly branches) so was reluctant to push myself.

Ok that slipped even my own aa.git tree as it was more a RFC when I
posted it and it didn't fix a real bug but just made the code more
robust at runtime in case of hardware memory corruption or software
bugs.

> I'd really like to see that fix in, since it's a little hard to argue
> for in -stable, being all about a system which is already unstable.
>
> But I think it needs a stronger title than "PageBuddy cleanups" -
> "fix BUG in bad_page()"?

I think this can comfortably wait 2.6.39-rc. I think it's best if
Andrew merges it so it gets digested through -mm for a while. But let
me know if you prefer something else. So here it is with slightly
updated header.

===
Subject: mm: PageBuddy and mapcount underflows robustness

From: Andrea Arcangeli <aarcange@redhat.com>

bad_page could VM_BUG_ON(!PageBuddy(page)) inside __ClearPageBuddy(). I prefer
to keep the VM_BUG_ON for safety and to add a if to solve it.

Change the _mapcount value indicating PageBuddy from -2 to -1024*1024 for more
robusteness against page_mapcount() underflows.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Hugh Dickins <hughd@google.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f6385fc..fa16ba0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -402,16 +402,22 @@ static inline void init_page_count(struct page *page)
 /*
  * PageBuddy() indicate that the page is free and in the buddy system
  * (see mm/page_alloc.c).
+ *
+ * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to
+ * -2 so that an underflow of the page_mapcount() won't be mistaken
+ * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE.
  */
+#define PAGE_BUDDY_MAPCOUNT_VALUE (-1024*1024)
+
 static inline int PageBuddy(struct page *page)
 {
-	return atomic_read(&page->_mapcount) == -2;
+	return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE;
 }
 
 static inline void __SetPageBuddy(struct page *page)
 {
 	VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
-	atomic_set(&page->_mapcount, -2);
+	atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE);
 }
 
 static inline void __ClearPageBuddy(struct page *page)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a873e61..8aac134 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -286,7 +286,9 @@ static void bad_page(struct page *page)
 
 	/* Don't complain about poisoned pages */
 	if (PageHWPoison(page)) {
-		__ClearPageBuddy(page);
+		/* __ClearPageBuddy VM_BUG_ON(!PageBuddy(page)) */
+		if (PageBuddy(page))
+			__ClearPageBuddy(page);
 		return;
 	}
 
@@ -317,7 +319,8 @@ static void bad_page(struct page *page)
 	dump_stack();
 out:
 	/* Leave bad fields for debug, except PageBuddy could make trouble */
-	__ClearPageBuddy(page);
+	if (PageBuddy(page)) /* __ClearPageBuddy VM_BUG_ON(!PageBuddy(page)) */
+		__ClearPageBuddy(page);
 	add_taint(TAINT_BAD_PAGE);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
  2011-03-14 16:56       ` Linus Torvalds
@ 2011-03-14 17:17         ` Andrea Arcangeli
  -1 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2011-03-14 17:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David Rientjes, linux-kernel,
	linux-mm, Johannes Weiner, Minchan Kim

Hello,

On Mon, Mar 14, 2011 at 09:56:10AM -0700, Linus Torvalds wrote:
> On Mon, Mar 14, 2011 at 9:37 AM, Hugh Dickins <hughd@google.com> wrote:
> >
> > I did try it that way at first (didn't help when I mistakenly put
> > #ifndef instead of #ifdef around the put_page!), but was repulsed
> > by seeing yet another #ifdef CONFIG_NUMA, so went with the duplicating
> > version - which Linus has now taken.
> 
> I have to admit to being repulsed by the whole patch, but my main
> source of "that's effin ugly" was from the crazy lock handling.
> 
> Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And
> if not, why not release the read-lock early? And even if it _does_
> need it, why not do
> 

The mmap_sem is needed for the page allocation, or the "vma" can
become a dangling pointer (the vma is passed to alloc_hugepage_vma).

>     ret = mem_cgroup_newpage_charge();
>     up_read(&mm->mmap_sem);
>     if (ret) {
>         ...
> 
> finally, the #ifdef CONFIG_NUMA is ugly, but it's ugly in the return
> path of the function too, and the nicer way would probably be to have
> it in one place and do something like
> 
>     /*
>      * The allocation rules are different for the NUMA/non-NUMA cases
>      * For the NUMA case, we allocate here, for the non-numa case we
>      * use the allocation in *hpage
>      */
>     static inline struct page *collapse_alloc_hugepage(struct page **hpage)
>     {
>     #ifdef CONFIG_NUMA
>         VM_BUG_ON(*hpage);
>         return alloc_hugepage_vma(khugepaged_defrag(), vma, address, node);
>     #else
>         VM_BUG_ON(!*hpage);
>         return *hpage;
>     #endif
>     }
> 
>     static inline void collapse_free_hugepage(struct page *page)
>     {
>     #ifdef CONFIG_NUMA
>         put_page(new_page);
>     #else
>         /* Nothing to do */
>     #endif
>     }
> 
> and use that instead. The point being that the #ifdef'fery now ends up
> being in a much more targeted area and much better abstracted, rather
> than in the middle of code, and ugly as sin.

Agreed about the cleanup. I didn't add a new function for it like
probably I should have to make it less ugly...

About mem_cgroup_newpage_charge I think you're right it won't need the
mmap_sem. Running it under it is sure safe. But if it's not needed we
can move the up_read before the mem_cgroup_newpage_charge like you
suggested. Johannes/Minchan could you confirm the mmap_sem isn't
needed around mem_cgroup_newpage_charge? The mm and new_page are
stable without the mmap_sem, only the vma goes away but the memcg
shouldn't care.

> But as mentioned, the lock handling is disgusting. Why is it even safe
> to drop and re-take the lock at all?

We have to upgrade the rwsem from read to write (hugepmd isn't allowed
to materialize under code that runs with the mmap_sem read mode,
that's one of the invariants to be safe when split_huge_page_pmd does
nothing and let the regular pte walk go ahead). It is safe because we
revalidate the vma after dropping the read lock and taking the write
lock. It's generally impossible to upgrade the lock without first
dropping it if more than one thread does that in parallel on the same
mm (they all hold the read lock so somebody has to drop the read lock
and revalidate before anyone has a chance to take the write lock). Now
interestingly I notice that in this very case khugepaged is single
threaded and no other places would call upgrade_read() on the mmap_sem
anywhere, so it probably would be safe, but there's no method for that
(because it'd need to be called at most by one thread at once on the
same mm and that's probably not considered an useful case, even if it
probably would be in collapse_huge_page). If we ever thread khugepaged
to run on more than one core then we'd be forced to drop the lock too
(unless we make the scan on the same mm mutually exclusive which isn't
the best for large mm anyway). But I exclude we ever need to thread
khugepaged though, it's blazing fast (unlike the ksmd scan). So if we
implment an upgrade_read() we might be able to remove a find_vma in
collapse_huge_page. It's not very important to optimize though as the
memory copy should be the biggest cost anyway.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
@ 2011-03-14 17:17         ` Andrea Arcangeli
  0 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2011-03-14 17:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David Rientjes, linux-kernel,
	linux-mm, Johannes Weiner, Minchan Kim

Hello,

On Mon, Mar 14, 2011 at 09:56:10AM -0700, Linus Torvalds wrote:
> On Mon, Mar 14, 2011 at 9:37 AM, Hugh Dickins <hughd@google.com> wrote:
> >
> > I did try it that way at first (didn't help when I mistakenly put
> > #ifndef instead of #ifdef around the put_page!), but was repulsed
> > by seeing yet another #ifdef CONFIG_NUMA, so went with the duplicating
> > version - which Linus has now taken.
> 
> I have to admit to being repulsed by the whole patch, but my main
> source of "that's effin ugly" was from the crazy lock handling.
> 
> Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And
> if not, why not release the read-lock early? And even if it _does_
> need it, why not do
> 

The mmap_sem is needed for the page allocation, or the "vma" can
become a dangling pointer (the vma is passed to alloc_hugepage_vma).

>     ret = mem_cgroup_newpage_charge();
>     up_read(&mm->mmap_sem);
>     if (ret) {
>         ...
> 
> finally, the #ifdef CONFIG_NUMA is ugly, but it's ugly in the return
> path of the function too, and the nicer way would probably be to have
> it in one place and do something like
> 
>     /*
>      * The allocation rules are different for the NUMA/non-NUMA cases
>      * For the NUMA case, we allocate here, for the non-numa case we
>      * use the allocation in *hpage
>      */
>     static inline struct page *collapse_alloc_hugepage(struct page **hpage)
>     {
>     #ifdef CONFIG_NUMA
>         VM_BUG_ON(*hpage);
>         return alloc_hugepage_vma(khugepaged_defrag(), vma, address, node);
>     #else
>         VM_BUG_ON(!*hpage);
>         return *hpage;
>     #endif
>     }
> 
>     static inline void collapse_free_hugepage(struct page *page)
>     {
>     #ifdef CONFIG_NUMA
>         put_page(new_page);
>     #else
>         /* Nothing to do */
>     #endif
>     }
> 
> and use that instead. The point being that the #ifdef'fery now ends up
> being in a much more targeted area and much better abstracted, rather
> than in the middle of code, and ugly as sin.

Agreed about the cleanup. I didn't add a new function for it like
probably I should have to make it less ugly...

About mem_cgroup_newpage_charge I think you're right it won't need the
mmap_sem. Running it under it is sure safe. But if it's not needed we
can move the up_read before the mem_cgroup_newpage_charge like you
suggested. Johannes/Minchan could you confirm the mmap_sem isn't
needed around mem_cgroup_newpage_charge? The mm and new_page are
stable without the mmap_sem, only the vma goes away but the memcg
shouldn't care.

> But as mentioned, the lock handling is disgusting. Why is it even safe
> to drop and re-take the lock at all?

We have to upgrade the rwsem from read to write (hugepmd isn't allowed
to materialize under code that runs with the mmap_sem read mode,
that's one of the invariants to be safe when split_huge_page_pmd does
nothing and let the regular pte walk go ahead). It is safe because we
revalidate the vma after dropping the read lock and taking the write
lock. It's generally impossible to upgrade the lock without first
dropping it if more than one thread does that in parallel on the same
mm (they all hold the read lock so somebody has to drop the read lock
and revalidate before anyone has a chance to take the write lock). Now
interestingly I notice that in this very case khugepaged is single
threaded and no other places would call upgrade_read() on the mmap_sem
anywhere, so it probably would be safe, but there's no method for that
(because it'd need to be called at most by one thread at once on the
same mm and that's probably not considered an useful case, even if it
probably would be in collapse_huge_page). If we ever thread khugepaged
to run on more than one core then we'd be forced to drop the lock too
(unless we make the scan on the same mm mutually exclusive which isn't
the best for large mm anyway). But I exclude we ever need to thread
khugepaged though, it's blazing fast (unlike the ksmd scan). So if we
implment an upgrade_read() we might be able to remove a find_vma in
collapse_huge_page. It's not very important to optimize though as the
memory copy should be the biggest cost anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: PageBuddy and mapcount underflows robustness
  2011-03-14 16:59       ` Andrea Arcangeli
@ 2011-03-14 17:30         ` Linus Torvalds
  -1 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2011-03-14 17:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Andrew Morton, David Rientjes, linux-kernel, linux-mm

On Mon, Mar 14, 2011 at 9:59 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> +#define PAGE_BUDDY_MAPCOUNT_VALUE (-1024*1024)

I realize that this is a nitpick, but from a code generation
standpoint, large random constants like these are just nasty.

I would suggest aiming for constants that are easy to generate and/or
fit better in the code stream. In many encoding schemes (eg x86), -128
is much easier to generate, since it fits in a signed byte and allows
small instructions etc. And in most RISC encodings, 8- or 16-bit
constants can be encoded much more easily than something like your
current one, and bigger ones often end up resulting in a load from
memory or at least several immediate-building instructions.

> -       __ClearPageBuddy(page);
> +       if (PageBuddy(page)) /* __ClearPageBuddy VM_BUG_ON(!PageBuddy(page)) */
> +               __ClearPageBuddy(page);

Also, this is just disgusting. It adds no safety here to have that
VM_BUG_ON(), so it's just unnecessary code generation to do this.
Also, we don't even WANT to do that stupid "__ClearPageBuddy()" in the
first place! What those two code-sites actually want are just a simple

    reset_page_mapcount(page);

which does the right thing in _general_, and not just for the buddy
case - we want to reset the mapcount for other reasons than just
pagebuddy (ie the underflow/overflow case).

And it avoids the VM_BUG_ON() too, making the crazy conditionals be not needed.

No?

                        Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: PageBuddy and mapcount underflows robustness
@ 2011-03-14 17:30         ` Linus Torvalds
  0 siblings, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2011-03-14 17:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Andrew Morton, David Rientjes, linux-kernel, linux-mm

On Mon, Mar 14, 2011 at 9:59 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> +#define PAGE_BUDDY_MAPCOUNT_VALUE (-1024*1024)

I realize that this is a nitpick, but from a code generation
standpoint, large random constants like these are just nasty.

I would suggest aiming for constants that are easy to generate and/or
fit better in the code stream. In many encoding schemes (eg x86), -128
is much easier to generate, since it fits in a signed byte and allows
small instructions etc. And in most RISC encodings, 8- or 16-bit
constants can be encoded much more easily than something like your
current one, and bigger ones often end up resulting in a load from
memory or at least several immediate-building instructions.

> -       __ClearPageBuddy(page);
> +       if (PageBuddy(page)) /* __ClearPageBuddy VM_BUG_ON(!PageBuddy(page)) */
> +               __ClearPageBuddy(page);

Also, this is just disgusting. It adds no safety here to have that
VM_BUG_ON(), so it's just unnecessary code generation to do this.
Also, we don't even WANT to do that stupid "__ClearPageBuddy()" in the
first place! What those two code-sites actually want are just a simple

    reset_page_mapcount(page);

which does the right thing in _general_, and not just for the buddy
case - we want to reset the mapcount for other reasons than just
pagebuddy (ie the underflow/overflow case).

And it avoids the VM_BUG_ON() too, making the crazy conditionals be not needed.

No?

                        Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
  2011-03-14 17:17         ` Andrea Arcangeli
@ 2011-03-14 19:58           ` Johannes Weiner
  -1 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-03-14 19:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, David Rientjes,
	linux-kernel, linux-mm, Minchan Kim

On Mon, Mar 14, 2011 at 06:17:31PM +0100, Andrea Arcangeli wrote:
> On Mon, Mar 14, 2011 at 09:56:10AM -0700, Linus Torvalds wrote:
> > Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And
> > if not, why not release the read-lock early? And even if it _does_
> > need it, why not do

[...]

> About mem_cgroup_newpage_charge I think you're right it won't need the
> mmap_sem. Running it under it is sure safe. But if it's not needed we
> can move the up_read before the mem_cgroup_newpage_charge like you
> suggested. Johannes/Minchan could you confirm the mmap_sem isn't
> needed around mem_cgroup_newpage_charge? The mm and new_page are
> stable without the mmap_sem, only the vma goes away but the memcg
> shouldn't care.

We don't care about the vma.  It's all about assigning the physical
page to the memcg that mm->owner belongs to.

It would be the first callsite not holding the mmap_sem, but that is
only because all existing sites are fault handlers that don't drop the
lock for other reasons.

I am not aware of anything that would rely on the lock in there, or
would not deserve to break if it did.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
@ 2011-03-14 19:58           ` Johannes Weiner
  0 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2011-03-14 19:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, David Rientjes,
	linux-kernel, linux-mm, Minchan Kim

On Mon, Mar 14, 2011 at 06:17:31PM +0100, Andrea Arcangeli wrote:
> On Mon, Mar 14, 2011 at 09:56:10AM -0700, Linus Torvalds wrote:
> > Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And
> > if not, why not release the read-lock early? And even if it _does_
> > need it, why not do

[...]

> About mem_cgroup_newpage_charge I think you're right it won't need the
> mmap_sem. Running it under it is sure safe. But if it's not needed we
> can move the up_read before the mem_cgroup_newpage_charge like you
> suggested. Johannes/Minchan could you confirm the mmap_sem isn't
> needed around mem_cgroup_newpage_charge? The mm and new_page are
> stable without the mmap_sem, only the vma goes away but the memcg
> shouldn't care.

We don't care about the vma.  It's all about assigning the physical
page to the memcg that mm->owner belongs to.

It would be the first callsite not holding the mmap_sem, but that is
only because all existing sites are fault handlers that don't drop the
lock for other reasons.

I am not aware of anything that would rely on the lock in there, or
would not deserve to break if it did.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
  2011-03-14 19:58           ` Johannes Weiner
@ 2011-03-14 23:42             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-14 23:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrea Arcangeli, Linus Torvalds, Hugh Dickins, Andrew Morton,
	David Rientjes, linux-kernel, linux-mm, Minchan Kim

On Mon, 14 Mar 2011 20:58:23 +0100
Johannes Weiner <jweiner@redhat.com> wrote:

> On Mon, Mar 14, 2011 at 06:17:31PM +0100, Andrea Arcangeli wrote:
> > On Mon, Mar 14, 2011 at 09:56:10AM -0700, Linus Torvalds wrote:
> > > Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And
> > > if not, why not release the read-lock early? And even if it _does_
> > > need it, why not do
> 
> [...]
> 
> > About mem_cgroup_newpage_charge I think you're right it won't need the
> > mmap_sem. Running it under it is sure safe. But if it's not needed we
> > can move the up_read before the mem_cgroup_newpage_charge like you
> > suggested. Johannes/Minchan could you confirm the mmap_sem isn't
> > needed around mem_cgroup_newpage_charge? The mm and new_page are
> > stable without the mmap_sem, only the vma goes away but the memcg
> > shouldn't care.
> 
> We don't care about the vma.  It's all about assigning the physical
> page to the memcg that mm->owner belongs to.
> 
> It would be the first callsite not holding the mmap_sem, but that is
> only because all existing sites are fault handlers that don't drop the
> lock for other reasons.
> 
> I am not aware of anything that would rely on the lock in there, or
> would not deserve to break if it did.
> 

mmap_sem is not required to held if uncharge() operation is done
if vma turns out to be a stale pointer.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
@ 2011-03-14 23:42             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 28+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-03-14 23:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrea Arcangeli, Linus Torvalds, Hugh Dickins, Andrew Morton,
	David Rientjes, linux-kernel, linux-mm, Minchan Kim

On Mon, 14 Mar 2011 20:58:23 +0100
Johannes Weiner <jweiner@redhat.com> wrote:

> On Mon, Mar 14, 2011 at 06:17:31PM +0100, Andrea Arcangeli wrote:
> > On Mon, Mar 14, 2011 at 09:56:10AM -0700, Linus Torvalds wrote:
> > > Does mem_cgroup_newpage_charge() even _need_ the mmap_sem at all? And
> > > if not, why not release the read-lock early? And even if it _does_
> > > need it, why not do
> 
> [...]
> 
> > About mem_cgroup_newpage_charge I think you're right it won't need the
> > mmap_sem. Running it under it is sure safe. But if it's not needed we
> > can move the up_read before the mem_cgroup_newpage_charge like you
> > suggested. Johannes/Minchan could you confirm the mmap_sem isn't
> > needed around mem_cgroup_newpage_charge? The mm and new_page are
> > stable without the mmap_sem, only the vma goes away but the memcg
> > shouldn't care.
> 
> We don't care about the vma.  It's all about assigning the physical
> page to the memcg that mm->owner belongs to.
> 
> It would be the first callsite not holding the mmap_sem, but that is
> only because all existing sites are fault handlers that don't drop the
> lock for other reasons.
> 
> I am not aware of anything that would rely on the lock in there, or
> would not deserve to break if it did.
> 

mmap_sem is not required to held if uncharge() operation is done
if vma turns out to be a stale pointer.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: PageBuddy and mapcount underflows robustness
  2011-03-14 17:30         ` Linus Torvalds
@ 2011-03-17 23:16           ` Andrea Arcangeli
  -1 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2011-03-17 23:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David Rientjes, linux-kernel, linux-mm

On Mon, Mar 14, 2011 at 10:30:11AM -0700, Linus Torvalds wrote:
> On Mon, Mar 14, 2011 at 9:59 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > +#define PAGE_BUDDY_MAPCOUNT_VALUE (-1024*1024)
> 
> I realize that this is a nitpick, but from a code generation
> standpoint, large random constants like these are just nasty.
> 
> I would suggest aiming for constants that are easy to generate and/or
> fit better in the code stream. In many encoding schemes (eg x86), -128
> is much easier to generate, since it fits in a signed byte and allows
> small instructions etc. And in most RISC encodings, 8- or 16-bit
> constants can be encoded much more easily than something like your
> current one, and bigger ones often end up resulting in a load from
> memory or at least several immediate-building instructions.

-128 sure is ok with me. It's extremely unlikely to be off by 127
times, if we're off by more than 127 times we're still going to detect
the error.

> Also, this is just disgusting. It adds no safety here to have that
> VM_BUG_ON(), so it's just unnecessary code generation to do this.
> Also, we don't even WANT to do that stupid "__ClearPageBuddy()" in the
> first place! What those two code-sites actually want are just a simple
> 
>     reset_page_mapcount(page);
> 
> which does the right thing in _general_, and not just for the buddy
> case - we want to reset the mapcount for other reasons than just
> pagebuddy (ie the underflow/overflow case).

Agreed.

> And it avoids the VM_BUG_ON() too, making the crazy conditionals be not needed.

Well using reset_page_mapcount in the two error sites, didn't require
me to remove the VM_BUG_ON from __ClearPageBuddy so I left it
there...

===
Subject: mm: PageBuddy and mapcount robustness

From: Andrea Arcangeli <aarcange@redhat.com>

Change the _mapcount value indicating PageBuddy from -2 to -128 for
more robusteness against page_mapcount() undeflows.

Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
ignore the previous retval of PageBuddy().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm.h |   11 +++++++++--
 mm/page_alloc.c    |    4 ++--
 2 files changed, 11 insertions(+), 4 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -402,16 +402,23 @@ static inline void init_page_count(struc
 /*
  * PageBuddy() indicate that the page is free and in the buddy system
  * (see mm/page_alloc.c).
+ *
+ * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to
+ * -2 so that an underflow of the page_mapcount() won't be mistaken
+ * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. -128 can be created very
+ * efficiently by most CPU architectures.
  */
+#define PAGE_BUDDY_MAPCOUNT_VALUE (-128)
+
 static inline int PageBuddy(struct page *page)
 {
-	return atomic_read(&page->_mapcount) == -2;
+	return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE;
 }
 
 static inline void __SetPageBuddy(struct page *page)
 {
 	VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
-	atomic_set(&page->_mapcount, -2);
+	atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE);
 }
 
 static inline void __ClearPageBuddy(struct page *page)
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -286,7 +286,7 @@ static void bad_page(struct page *page)
 
 	/* Don't complain about poisoned pages */
 	if (PageHWPoison(page)) {
-		__ClearPageBuddy(page);
+		reset_page_mapcount(page); /* remove PageBuddy */
 		return;
 	}
 
@@ -317,7 +317,7 @@ static void bad_page(struct page *page)
 	dump_stack();
 out:
 	/* Leave bad fields for debug, except PageBuddy could make trouble */
-	__ClearPageBuddy(page);
+	reset_page_mapcount(page); /* remove PageBuddy */
 	add_taint(TAINT_BAD_PAGE);
 }
 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: PageBuddy and mapcount underflows robustness
@ 2011-03-17 23:16           ` Andrea Arcangeli
  0 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2011-03-17 23:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Hugh Dickins, Andrew Morton, David Rientjes, linux-kernel, linux-mm

On Mon, Mar 14, 2011 at 10:30:11AM -0700, Linus Torvalds wrote:
> On Mon, Mar 14, 2011 at 9:59 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > +#define PAGE_BUDDY_MAPCOUNT_VALUE (-1024*1024)
> 
> I realize that this is a nitpick, but from a code generation
> standpoint, large random constants like these are just nasty.
> 
> I would suggest aiming for constants that are easy to generate and/or
> fit better in the code stream. In many encoding schemes (eg x86), -128
> is much easier to generate, since it fits in a signed byte and allows
> small instructions etc. And in most RISC encodings, 8- or 16-bit
> constants can be encoded much more easily than something like your
> current one, and bigger ones often end up resulting in a load from
> memory or at least several immediate-building instructions.

-128 sure is ok with me. It's extremely unlikely to be off by 127
times, if we're off by more than 127 times we're still going to detect
the error.

> Also, this is just disgusting. It adds no safety here to have that
> VM_BUG_ON(), so it's just unnecessary code generation to do this.
> Also, we don't even WANT to do that stupid "__ClearPageBuddy()" in the
> first place! What those two code-sites actually want are just a simple
> 
>     reset_page_mapcount(page);
> 
> which does the right thing in _general_, and not just for the buddy
> case - we want to reset the mapcount for other reasons than just
> pagebuddy (ie the underflow/overflow case).

Agreed.

> And it avoids the VM_BUG_ON() too, making the crazy conditionals be not needed.

Well using reset_page_mapcount in the two error sites, didn't require
me to remove the VM_BUG_ON from __ClearPageBuddy so I left it
there...

===
Subject: mm: PageBuddy and mapcount robustness

From: Andrea Arcangeli <aarcange@redhat.com>

Change the _mapcount value indicating PageBuddy from -2 to -128 for
more robusteness against page_mapcount() undeflows.

Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
ignore the previous retval of PageBuddy().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm.h |   11 +++++++++--
 mm/page_alloc.c    |    4 ++--
 2 files changed, 11 insertions(+), 4 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -402,16 +402,23 @@ static inline void init_page_count(struc
 /*
  * PageBuddy() indicate that the page is free and in the buddy system
  * (see mm/page_alloc.c).
+ *
+ * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to
+ * -2 so that an underflow of the page_mapcount() won't be mistaken
+ * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. -128 can be created very
+ * efficiently by most CPU architectures.
  */
+#define PAGE_BUDDY_MAPCOUNT_VALUE (-128)
+
 static inline int PageBuddy(struct page *page)
 {
-	return atomic_read(&page->_mapcount) == -2;
+	return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE;
 }
 
 static inline void __SetPageBuddy(struct page *page)
 {
 	VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
-	atomic_set(&page->_mapcount, -2);
+	atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE);
 }
 
 static inline void __ClearPageBuddy(struct page *page)
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -286,7 +286,7 @@ static void bad_page(struct page *page)
 
 	/* Don't complain about poisoned pages */
 	if (PageHWPoison(page)) {
-		__ClearPageBuddy(page);
+		reset_page_mapcount(page); /* remove PageBuddy */
 		return;
 	}
 
@@ -317,7 +317,7 @@ static void bad_page(struct page *page)
 	dump_stack();
 out:
 	/* Leave bad fields for debug, except PageBuddy could make trouble */
-	__ClearPageBuddy(page);
+	reset_page_mapcount(page); /* remove PageBuddy */
 	add_taint(TAINT_BAD_PAGE);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: PageBuddy and mapcount underflows robustness
  2011-03-17 23:16           ` Andrea Arcangeli
@ 2011-03-18 21:34             ` Hugh Dickins
  -1 siblings, 0 replies; 28+ messages in thread
From: Hugh Dickins @ 2011-03-18 21:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Andrew Morton, David Rientjes, linux-kernel,
	linux-mm, stable

On Fri, 18 Mar 2011, Andrea Arcangeli wrote:
> On Mon, Mar 14, 2011 at 10:30:11AM -0700, Linus Torvalds wrote:
> Subject: mm: PageBuddy and mapcount robustness
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Change the _mapcount value indicating PageBuddy from -2 to -128 for
> more robusteness against page_mapcount() undeflows.
> 
> Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
> ignore the previous retval of PageBuddy().
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Hugh Dickins <hughd@google.com>

Yes, this version satisfies my objections too.
I'd say Acked-by but I see that it's already in, great.

I've Cc'ed stable@kernel.org: please can we have this in 2.6.38.1,
since 2.6.38 regressed the recovery from bad page states,
inadvertently changing them to a fatal error when CONFIG_DEBUG_VM.

Thanks,
Hugh

commit ef2b4b95a63a1d23958dcb99eb2c6898eddc87d0
Author: Andrea Arcangeli <aarcange@redhat.com>
Date:   Fri Mar 18 00:16:35 2011 +0100

    mm: PageBuddy and mapcount robustness
    
    Change the _mapcount value indicating PageBuddy from -2 to -128 for
    more robusteness against page_mapcount() undeflows.
    
    Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
    ignore the previous retval of PageBuddy().
    
    Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
    Reported-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 679300c..ff83798 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -402,16 +402,23 @@ static inline void init_page_count(struct page *page)
 /*
  * PageBuddy() indicate that the page is free and in the buddy system
  * (see mm/page_alloc.c).
+ *
+ * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to
+ * -2 so that an underflow of the page_mapcount() won't be mistaken
+ * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. -128 can be created very
+ * efficiently by most CPU architectures.
  */
+#define PAGE_BUDDY_MAPCOUNT_VALUE (-128)
+
 static inline int PageBuddy(struct page *page)
 {
-	return atomic_read(&page->_mapcount) == -2;
+	return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE;
 }
 
 static inline void __SetPageBuddy(struct page *page)
 {
 	VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
-	atomic_set(&page->_mapcount, -2);
+	atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE);
 }
 
 static inline void __ClearPageBuddy(struct page *page)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bd76256..7945247 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -286,7 +286,7 @@ static void bad_page(struct page *page)
 
 	/* Don't complain about poisoned pages */
 	if (PageHWPoison(page)) {
-		__ClearPageBuddy(page);
+		reset_page_mapcount(page); /* remove PageBuddy */
 		return;
 	}
 
@@ -317,7 +317,7 @@ static void bad_page(struct page *page)
 	dump_stack();
 out:
 	/* Leave bad fields for debug, except PageBuddy could make trouble */
-	__ClearPageBuddy(page);
+	reset_page_mapcount(page); /* remove PageBuddy */
 	add_taint(TAINT_BAD_PAGE);
 }
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] mm: PageBuddy and mapcount underflows robustness
@ 2011-03-18 21:34             ` Hugh Dickins
  0 siblings, 0 replies; 28+ messages in thread
From: Hugh Dickins @ 2011-03-18 21:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Andrew Morton, David Rientjes, linux-kernel,
	linux-mm, stable

On Fri, 18 Mar 2011, Andrea Arcangeli wrote:
> On Mon, Mar 14, 2011 at 10:30:11AM -0700, Linus Torvalds wrote:
> Subject: mm: PageBuddy and mapcount robustness
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Change the _mapcount value indicating PageBuddy from -2 to -128 for
> more robusteness against page_mapcount() undeflows.
> 
> Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
> ignore the previous retval of PageBuddy().
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Hugh Dickins <hughd@google.com>

Yes, this version satisfies my objections too.
I'd say Acked-by but I see that it's already in, great.

I've Cc'ed stable@kernel.org: please can we have this in 2.6.38.1,
since 2.6.38 regressed the recovery from bad page states,
inadvertently changing them to a fatal error when CONFIG_DEBUG_VM.

Thanks,
Hugh

commit ef2b4b95a63a1d23958dcb99eb2c6898eddc87d0
Author: Andrea Arcangeli <aarcange@redhat.com>
Date:   Fri Mar 18 00:16:35 2011 +0100

    mm: PageBuddy and mapcount robustness
    
    Change the _mapcount value indicating PageBuddy from -2 to -128 for
    more robusteness against page_mapcount() undeflows.
    
    Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
    ignore the previous retval of PageBuddy().
    
    Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
    Reported-by: Hugh Dickins <hughd@google.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 679300c..ff83798 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -402,16 +402,23 @@ static inline void init_page_count(struct page *page)
 /*
  * PageBuddy() indicate that the page is free and in the buddy system
  * (see mm/page_alloc.c).
+ *
+ * PAGE_BUDDY_MAPCOUNT_VALUE must be <= -2 but better not too close to
+ * -2 so that an underflow of the page_mapcount() won't be mistaken
+ * for a genuine PAGE_BUDDY_MAPCOUNT_VALUE. -128 can be created very
+ * efficiently by most CPU architectures.
  */
+#define PAGE_BUDDY_MAPCOUNT_VALUE (-128)
+
 static inline int PageBuddy(struct page *page)
 {
-	return atomic_read(&page->_mapcount) == -2;
+	return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE;
 }
 
 static inline void __SetPageBuddy(struct page *page)
 {
 	VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
-	atomic_set(&page->_mapcount, -2);
+	atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE);
 }
 
 static inline void __ClearPageBuddy(struct page *page)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bd76256..7945247 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -286,7 +286,7 @@ static void bad_page(struct page *page)
 
 	/* Don't complain about poisoned pages */
 	if (PageHWPoison(page)) {
-		__ClearPageBuddy(page);
+		reset_page_mapcount(page); /* remove PageBuddy */
 		return;
 	}
 
@@ -317,7 +317,7 @@ static void bad_page(struct page *page)
 	dump_stack();
 out:
 	/* Leave bad fields for debug, except PageBuddy could make trouble */
-	__ClearPageBuddy(page);
+	reset_page_mapcount(page); /* remove PageBuddy */
 	add_taint(TAINT_BAD_PAGE);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [stable] [PATCH] mm: PageBuddy and mapcount underflows robustness
  2011-03-18 21:34             ` Hugh Dickins
@ 2011-03-23 22:58               ` Greg KH
  -1 siblings, 0 replies; 28+ messages in thread
From: Greg KH @ 2011-03-23 22:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, David Rientjes,
	Andrew Morton, Linus Torvalds, stable

On Fri, Mar 18, 2011 at 02:34:03PM -0700, Hugh Dickins wrote:
> On Fri, 18 Mar 2011, Andrea Arcangeli wrote:
> > On Mon, Mar 14, 2011 at 10:30:11AM -0700, Linus Torvalds wrote:
> > Subject: mm: PageBuddy and mapcount robustness
> > 
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Change the _mapcount value indicating PageBuddy from -2 to -128 for
> > more robusteness against page_mapcount() undeflows.
> > 
> > Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
> > ignore the previous retval of PageBuddy().
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Reported-by: Hugh Dickins <hughd@google.com>
> 
> Yes, this version satisfies my objections too.
> I'd say Acked-by but I see that it's already in, great.
> 
> I've Cc'ed stable@kernel.org: please can we have this in 2.6.38.1,
> since 2.6.38 regressed the recovery from bad page states,
> inadvertently changing them to a fatal error when CONFIG_DEBUG_VM.

Now queued up for 2.6.38.2, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [stable] [PATCH] mm: PageBuddy and mapcount underflows robustness
@ 2011-03-23 22:58               ` Greg KH
  0 siblings, 0 replies; 28+ messages in thread
From: Greg KH @ 2011-03-23 22:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, linux-kernel, linux-mm, David Rientjes,
	Andrew Morton, Linus Torvalds, stable

On Fri, Mar 18, 2011 at 02:34:03PM -0700, Hugh Dickins wrote:
> On Fri, 18 Mar 2011, Andrea Arcangeli wrote:
> > On Mon, Mar 14, 2011 at 10:30:11AM -0700, Linus Torvalds wrote:
> > Subject: mm: PageBuddy and mapcount robustness
> > 
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Change the _mapcount value indicating PageBuddy from -2 to -128 for
> > more robusteness against page_mapcount() undeflows.
> > 
> > Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to
> > ignore the previous retval of PageBuddy().
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Reported-by: Hugh Dickins <hughd@google.com>
> 
> Yes, this version satisfies my objections too.
> I'd say Acked-by but I see that it's already in, great.
> 
> I've Cc'ed stable@kernel.org: please can we have this in 2.6.38.1,
> since 2.6.38 regressed the recovery from bad page states,
> inadvertently changing them to a fatal error when CONFIG_DEBUG_VM.

Now queued up for 2.6.38.2, thanks.

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
  2011-03-14 19:58           ` Johannes Weiner
@ 2011-04-01 21:21             ` Andrea Arcangeli
  -1 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2011-04-01 21:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, David Rientjes,
	linux-kernel, linux-mm, Minchan Kim

On Mon, Mar 14, 2011 at 08:58:23PM +0100, Johannes Weiner wrote:
> We don't care about the vma.  It's all about assigning the physical
> page to the memcg that mm->owner belongs to.
> 
> It would be the first callsite not holding the mmap_sem, but that is
> only because all existing sites are fault handlers that don't drop the
> lock for other reasons.

I was afraid it'd be the first callsite this is why I wasn't excited
to push it in 2.6.38, but Linus's right and we should micro-optimize
it for 2.6.39.

> I am not aware of anything that would rely on the lock in there, or
> would not deserve to break if it did.

Thanks for double checking.

What about this? (only problem is the thp-vmstat patch in -mm then
reject, maybe I should rediff it against -mm instead, as you wish)

===
Subject: thp: optimize memcg charge in khugepaged

From: Andrea Arcangeli <aarcange@redhat.com>

We don't need to hold the mmmap_sem through mem_cgroup_newpage_charge(), the
mmap_sem is only hold for keeping the vma stable and we don't need the vma
stable anymore after we return from alloc_hugepage_vma().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0a619e0..c61d9ad 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1762,12 +1762,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 #ifndef CONFIG_NUMA
+	up_read(&mm->mmap_sem);
 	VM_BUG_ON(!*hpage);
 	new_page = *hpage;
-	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
-		up_read(&mm->mmap_sem);
-		return;
-	}
 #else
 	VM_BUG_ON(*hpage);
 	/*
@@ -1782,20 +1779,20 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
 				      node, __GFP_OTHER_NODE);
+	/* after allocating the hugepage upgrade to mmap_sem write mode */
+	up_read(&mm->mmap_sem);
 	if (unlikely(!new_page)) {
-		up_read(&mm->mmap_sem);
 		*hpage = ERR_PTR(-ENOMEM);
 		return;
 	}
+#endif
+
 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
-		up_read(&mm->mmap_sem);
+#ifdef CONFIG_NUMA
 		put_page(new_page);
+#endif
 		return;
 	}
-#endif
-
-	/* after allocating the hugepage upgrade to mmap_sem write mode */
-	up_read(&mm->mmap_sem);
 
 	/*
 	 * Prevent all access to pagetables with the exception of

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370!
@ 2011-04-01 21:21             ` Andrea Arcangeli
  0 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2011-04-01 21:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Hugh Dickins, Andrew Morton, David Rientjes,
	linux-kernel, linux-mm, Minchan Kim

On Mon, Mar 14, 2011 at 08:58:23PM +0100, Johannes Weiner wrote:
> We don't care about the vma.  It's all about assigning the physical
> page to the memcg that mm->owner belongs to.
> 
> It would be the first callsite not holding the mmap_sem, but that is
> only because all existing sites are fault handlers that don't drop the
> lock for other reasons.

I was afraid it'd be the first callsite this is why I wasn't excited
to push it in 2.6.38, but Linus's right and we should micro-optimize
it for 2.6.39.

> I am not aware of anything that would rely on the lock in there, or
> would not deserve to break if it did.

Thanks for double checking.

What about this? (only problem is the thp-vmstat patch in -mm then
reject, maybe I should rediff it against -mm instead, as you wish)

===
Subject: thp: optimize memcg charge in khugepaged

From: Andrea Arcangeli <aarcange@redhat.com>

We don't need to hold the mmmap_sem through mem_cgroup_newpage_charge(), the
mmap_sem is only hold for keeping the vma stable and we don't need the vma
stable anymore after we return from alloc_hugepage_vma().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0a619e0..c61d9ad 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1762,12 +1762,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 #ifndef CONFIG_NUMA
+	up_read(&mm->mmap_sem);
 	VM_BUG_ON(!*hpage);
 	new_page = *hpage;
-	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
-		up_read(&mm->mmap_sem);
-		return;
-	}
 #else
 	VM_BUG_ON(*hpage);
 	/*
@@ -1782,20 +1779,20 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 */
 	new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
 				      node, __GFP_OTHER_NODE);
+	/* after allocating the hugepage upgrade to mmap_sem write mode */
+	up_read(&mm->mmap_sem);
 	if (unlikely(!new_page)) {
-		up_read(&mm->mmap_sem);
 		*hpage = ERR_PTR(-ENOMEM);
 		return;
 	}
+#endif
+
 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
-		up_read(&mm->mmap_sem);
+#ifdef CONFIG_NUMA
 		put_page(new_page);
+#endif
 		return;
 	}
-#endif
-
-	/* after allocating the hugepage upgrade to mmap_sem write mode */
-	up_read(&mm->mmap_sem);
 
 	/*
 	 * Prevent all access to pagetables with the exception of

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2011-04-01 21:22 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-14  8:08 [PATCH] thp+memcg-numa: fix BUG at include/linux/mm.h:370! Hugh Dickins
2011-03-14  8:08 ` Hugh Dickins
2011-03-14 15:25 ` Minchan Kim
2011-03-14 15:25   ` Minchan Kim
2011-03-14 15:52 ` Andrea Arcangeli
2011-03-14 15:52   ` Andrea Arcangeli
2011-03-14 16:37   ` Hugh Dickins
2011-03-14 16:37     ` Hugh Dickins
2011-03-14 16:56     ` Linus Torvalds
2011-03-14 16:56       ` Linus Torvalds
2011-03-14 17:17       ` Andrea Arcangeli
2011-03-14 17:17         ` Andrea Arcangeli
2011-03-14 19:58         ` Johannes Weiner
2011-03-14 19:58           ` Johannes Weiner
2011-03-14 23:42           ` KAMEZAWA Hiroyuki
2011-03-14 23:42             ` KAMEZAWA Hiroyuki
2011-04-01 21:21           ` Andrea Arcangeli
2011-04-01 21:21             ` Andrea Arcangeli
2011-03-14 16:59     ` [PATCH] mm: PageBuddy and mapcount underflows robustness Andrea Arcangeli
2011-03-14 16:59       ` Andrea Arcangeli
2011-03-14 17:30       ` Linus Torvalds
2011-03-14 17:30         ` Linus Torvalds
2011-03-17 23:16         ` Andrea Arcangeli
2011-03-17 23:16           ` Andrea Arcangeli
2011-03-18 21:34           ` Hugh Dickins
2011-03-18 21:34             ` Hugh Dickins
2011-03-23 22:58             ` [stable] " Greg KH
2011-03-23 22:58               ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.