linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] mm: prevent endless growth of anon_vma hierarchy
@ 2014-11-26 18:11 Konstantin Khlebnikov
  2014-11-26 19:30 ` Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-26 18:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Linux Kernel Mailing List
  Cc: Andrea Arcangeli, Rik van Riel, Tim Hartrick, Daniel Forrest,
	Hugh Dickins, Michel Lespinasse, Vlastimil Babka

Constantly forking task causes unlimited grow of anon_vma chain.
Each next child allocate new level of anon_vmas and links vmas to all
previous levels because it inherits pages from them. None of anon_vmas
cannot be freed because there might be pages which points to them.

This patch adds heuristic which decides to reuse existing anon_vma instead
of forking new one. It counts vmas and direct descendants for each anon_vma.
Anon_vma with degree lower than two will be reused at next fork.

As a result each anon_vma has either alive vma or at least two descendants,
endless chains are no longer possible and count of anon_vmas is no more than
two times more than count of vmas.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>
Tested-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu
Fixes: 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue")
Cc: Stable <stable@vger.kernel.org> (2.6.34+)

---

v2: update degree in anon_vma_prepare for merged anon_vma
v3: update comment and tags
---
 include/linux/rmap.h |   16 ++++++++++++++++
 mm/rmap.c            |   30 +++++++++++++++++++++++++++++-
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c0c2bce..b1d140c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -45,6 +45,22 @@ struct anon_vma {
 	 * mm_take_all_locks() (mm_all_locks_mutex).
 	 */
 	struct rb_root rb_root;	/* Interval tree of private "related" vmas */
+
+	/*
+	 * Count of child anon_vmas and VMAs which points to this anon_vma.
+	 *
+	 * This counter is used for making decision about reusing old anon_vma
+	 * instead of forking new one. It allows to detect anon_vmas which have
+	 * just one direct descendant and no vmas. Reusing such anon_vma not
+	 * leads to significant preformance regression but prevents degradation
+	 * of anon_vma hierarchy to endless linear chain.
+	 *
+	 * Root anon_vma is never reused because it is its own parent and it has
+	 * at leat one vma or child, thus at fork it's degree is at least 2.
+	 */
+	unsigned degree;
+
+	struct anon_vma *parent;	/* Parent of this anon_vma */
 };
 
 /*
diff --git a/mm/rmap.c b/mm/rmap.c
index 19886fb..df5c44e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -72,6 +72,8 @@ static inline struct anon_vma *anon_vma_alloc(void)
 	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
 	if (anon_vma) {
 		atomic_set(&anon_vma->refcount, 1);
+		anon_vma->degree = 1;	/* Reference for first vma */
+		anon_vma->parent = anon_vma;
 		/*
 		 * Initialise the anon_vma root to point to itself. If called
 		 * from fork, the root will be reset to the parents anon_vma.
@@ -188,6 +190,8 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			anon_vma_chain_link(vma, avc, anon_vma);
+			/* vma link if merged or child link for new root */
+			anon_vma->degree++;
 			allocated = NULL;
 			avc = NULL;
 		}
@@ -256,7 +260,17 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 		anon_vma = pavc->anon_vma;
 		root = lock_anon_vma_root(root, anon_vma);
 		anon_vma_chain_link(dst, avc, anon_vma);
+
+		/*
+		 * Reuse existing anon_vma if its degree lower than two,
+		 * that means it has no vma and just one anon_vma child.
+		 */
+		if (!dst->anon_vma && anon_vma != src->anon_vma &&
+				anon_vma->degree < 2)
+			dst->anon_vma = anon_vma;
 	}
+	if (dst->anon_vma)
+		dst->anon_vma->degree++;
 	unlock_anon_vma_root(root);
 	return 0;
 
@@ -279,6 +293,9 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	if (!pvma->anon_vma)
 		return 0;
 
+	/* Drop inherited anon_vma, we'll reuse old one or allocate new. */
+	vma->anon_vma = NULL;
+
 	/*
 	 * First, attach the new VMA to the parent VMA's anon_vmas,
 	 * so rmap can find non-COWed pages in child processes.
@@ -286,6 +303,10 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	if (anon_vma_clone(vma, pvma))
 		return -ENOMEM;
 
+	/* An old anon_vma has been reused. */
+	if (vma->anon_vma)
+		return 0;
+
 	/* Then add our own anon_vma. */
 	anon_vma = anon_vma_alloc();
 	if (!anon_vma)
@@ -299,6 +320,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	 * lock any of the anon_vmas in this anon_vma tree.
 	 */
 	anon_vma->root = pvma->anon_vma->root;
+	anon_vma->parent = pvma->anon_vma;
 	/*
 	 * With refcounts, an anon_vma can stay around longer than the
 	 * process it belongs to. The root anon_vma needs to be pinned until
@@ -309,6 +331,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	vma->anon_vma = anon_vma;
 	anon_vma_lock_write(anon_vma);
 	anon_vma_chain_link(vma, avc, anon_vma);
+	anon_vma->parent->degree++;
 	anon_vma_unlock_write(anon_vma);
 
 	return 0;
@@ -339,12 +362,16 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 		 * Leave empty anon_vmas on the list - we'll need
 		 * to free them outside the lock.
 		 */
-		if (RB_EMPTY_ROOT(&anon_vma->rb_root))
+		if (RB_EMPTY_ROOT(&anon_vma->rb_root)) {
+			anon_vma->parent->degree--;
 			continue;
+		}
 
 		list_del(&avc->same_vma);
 		anon_vma_chain_free(avc);
 	}
+	if (vma->anon_vma)
+		vma->anon_vma->degree--;
 	unlock_anon_vma_root(root);
 
 	/*
@@ -355,6 +382,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma = avc->anon_vma;
 
+		BUG_ON(anon_vma->degree);
 		put_anon_vma(anon_vma);
 
 		list_del(&avc->same_vma);


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: prevent endless growth of anon_vma hierarchy
  2014-11-26 18:11 [PATCH v3] mm: prevent endless growth of anon_vma hierarchy Konstantin Khlebnikov
@ 2014-11-26 19:30 ` Rik van Riel
  2014-11-26 20:20   ` Konstantin Khlebnikov
  2014-11-26 21:05 ` Daniel Forrest
  2014-12-16 10:42 ` Michal Hocko
  2 siblings, 1 reply; 9+ messages in thread
From: Rik van Riel @ 2014-11-26 19:30 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-mm, Andrew Morton,
	Linux Kernel Mailing List
  Cc: Andrea Arcangeli, Tim Hartrick, Daniel Forrest, Hugh Dickins,
	Michel Lespinasse, Vlastimil Babka

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/26/2014 01:11 PM, Konstantin Khlebnikov wrote:

> diff --git a/include/linux/rmap.h b/include/linux/rmap.h index
> c0c2bce..b1d140c 100644 --- a/include/linux/rmap.h +++
> b/include/linux/rmap.h @@ -45,6 +45,22 @@ struct anon_vma { *
> mm_take_all_locks() (mm_all_locks_mutex). */ struct rb_root
> rb_root;	/* Interval tree of private "related" vmas */ + +	/* +	 *
> Count of child anon_vmas and VMAs which points to this anon_vma. +
> * +	 * This counter is used for making decision about reusing old
> anon_vma +	 * instead of forking new one. It allows to detect
> anon_vmas which have +	 * just one direct descendant and no vmas.
> Reusing such anon_vma not +	 * leads to significant preformance
> regression but prevents degradation +	 * of anon_vma hierarchy to
> endless linear chain. +	 * +	 * Root anon_vma is never reused
> because it is its own parent and it has +	 * at leat one vma or
> child, thus at fork it's degree is at least 2. +	 */ +	unsigned
> degree; + +	struct anon_vma *parent;	/* Parent of this anon_vma */ 
> };

Could this be put earlier in the struct, so the "unsigned degree" can be
packed into the same long word as the spinlock, on 64 bit systems?

Otherwise there are two 4-byte entities in the struct, each of which get
padded out to 8 bytes.


> diff --git a/mm/rmap.c b/mm/rmap.c index 19886fb..df5c44e 100644 
> --- a/mm/rmap.c +++ b/mm/rmap.c @@ -72,6 +72,8 @@ static inline
> struct anon_vma *anon_vma_alloc(void) anon_vma =
> kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL); if (anon_vma) { 
> atomic_set(&anon_vma->refcount, 1); +		anon_vma->degree = 1;	/*
> Reference for first vma */ +		anon_vma->parent = anon_vma; /* *
> Initialise the anon_vma root to point to itself. If called * from
> fork, the root will be reset to the parents anon_vma. @@ -188,6
> +190,8 @@ int anon_vma_prepare(struct vm_area_struct *vma) if
> (likely(!vma->anon_vma)) { vma->anon_vma = anon_vma; 
> anon_vma_chain_link(vma, avc, anon_vma); +			/* vma link if merged
> or child link for new root */ +			anon_vma->degree++; allocated =
> NULL; avc = NULL; } @@ -256,7 +260,17 @@ int anon_vma_clone(struct
> vm_area_struct *dst, struct vm_area_struct *src) anon_vma =
> pavc->anon_vma; root = lock_anon_vma_root(root, anon_vma); 
> anon_vma_chain_link(dst, avc, anon_vma); + +		/* +		 * Reuse
> existing anon_vma if its degree lower than two, +		 * that means it
> has no vma and just one anon_vma child. +		 */ +		if
> (!dst->anon_vma && anon_vma != src->anon_vma && +
> anon_vma->degree < 2) +			dst->anon_vma = anon_vma; }

Why can src->anon_vma not be reused if it still not shared with
any other task?

Would it be more readable to use a "reuse_anon_vma" pointer here,
and assign dst->anon_vma the value of reuse_anon_vma if we choose
to reuse one?

Assigning different things to dst->anon_vma looks a little confusing.

Would it make sense to rename anon_vma->degree to anon_vma->sharing
or anon_vma->shared, or even anon_vma->users, to indicate that it is
a counter of how many VMAs are sharing this anon_vma?

> +	if (dst->anon_vma) +		dst->anon_vma->degree++; 
> unlock_anon_vma_root(root); return 0;
> 
> @@ -279,6 +293,9 @@ int anon_vma_fork(struct vm_area_struct *vma,
> struct vm_area_struct *pvma) if (!pvma->anon_vma) return 0;
> 
> +	/* Drop inherited anon_vma, we'll reuse old one or allocate new.
> */ +	vma->anon_vma = NULL;

Use of a temporary variable in anon_vma_clone() would avoid this.

> @@ -286,6 +303,10 @@ int anon_vma_fork(struct vm_area_struct *vma,
> struct vm_area_struct *pvma) if (anon_vma_clone(vma, pvma)) return
> -ENOMEM;
> 
> +	/* An old anon_vma has been reused. */

s/old/existing/  ?

> +	if (vma->anon_vma) +		return 0; + /* Then add our own anon_vma.
> */ anon_vma = anon_vma_alloc(); if (!anon_vma)

Overall the patch looks good.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUdipzAAoJEM553pKExN6Db2kH/20CfKy2ntayKb03tqYnlohu
OUxtCwqiow8XsfYc2cEBrCznCNPD5B0sdDdEgRWybBnRCikHNQS4vUBhNl/F13gS
Hu8LM+RElhZwr69cCshUXefIx5xMKimUeAsHutpvy09onZy0DvYutdR958/Nhca/
1OjXqtE+LbPd0aG87OQQlagk4DQls0uA2l609qBRKsfMgm2444MRPAN0RGQwlYIv
SENvzVFN4ZvIdzsU8IoSw2EkhBankYDKbbTxAy+sHCHaxKzq0eKn+JgRaoZLjxU9
+43snI/fkWNN+S5KLgshUKIVO84kAmRAIfdKUjt/DYYkOj6YPp48aJnOKVFwjIY=
=Bhp4
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: prevent endless growth of anon_vma hierarchy
  2014-11-26 19:30 ` Rik van Riel
@ 2014-11-26 20:20   ` Konstantin Khlebnikov
  0 siblings, 0 replies; 9+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-26 20:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-mm, Andrew Morton, Linux Kernel Mailing List,
	Andrea Arcangeli, Tim Hartrick, Daniel Forrest, Hugh Dickins,
	Michel Lespinasse, Vlastimil Babka

On Wed, Nov 26, 2014 at 10:30 PM, Rik van Riel <riel@redhat.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 11/26/2014 01:11 PM, Konstantin Khlebnikov wrote:
>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h index
>> c0c2bce..b1d140c 100644 --- a/include/linux/rmap.h +++
>> b/include/linux/rmap.h @@ -45,6 +45,22 @@ struct anon_vma { *
>> mm_take_all_locks() (mm_all_locks_mutex). */ struct rb_root
>> rb_root;      /* Interval tree of private "related" vmas */ + +       /* +     *
>> Count of child anon_vmas and VMAs which points to this anon_vma. +
>> * +    * This counter is used for making decision about reusing old
>> anon_vma +     * instead of forking new one. It allows to detect
>> anon_vmas which have +         * just one direct descendant and no vmas.
>> Reusing such anon_vma not +    * leads to significant preformance
>> regression but prevents degradation +  * of anon_vma hierarchy to
>> endless linear chain. +        * +     * Root anon_vma is never reused
>> because it is its own parent and it has +      * at leat one vma or
>> child, thus at fork it's degree is at least 2. +       */ +   unsigned
>> degree; + +   struct anon_vma *parent;        /* Parent of this anon_vma */
>> };
>
> Could this be put earlier in the struct, so the "unsigned degree" can be
> packed into the same long word as the spinlock, on 64 bit systems?
>
> Otherwise there are two 4-byte entities in the struct, each of which get
> padded out to 8 bytes.

As I see, rw-semaphore is used here now.
But here is 4-byte padding after atomic counter.
So we could shrink structure from 8 to 7 64-bit words.

>
>
>> diff --git a/mm/rmap.c b/mm/rmap.c index 19886fb..df5c44e 100644
>> --- a/mm/rmap.c +++ b/mm/rmap.c @@ -72,6 +72,8 @@ static inline
>> struct anon_vma *anon_vma_alloc(void) anon_vma =
>> kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL); if (anon_vma) {
>> atomic_set(&anon_vma->refcount, 1); +         anon_vma->degree = 1;   /*
>> Reference for first vma */ +          anon_vma->parent = anon_vma; /* *
>> Initialise the anon_vma root to point to itself. If called * from
>> fork, the root will be reset to the parents anon_vma. @@ -188,6
>> +190,8 @@ int anon_vma_prepare(struct vm_area_struct *vma) if
>> (likely(!vma->anon_vma)) { vma->anon_vma = anon_vma;
>> anon_vma_chain_link(vma, avc, anon_vma); +                    /* vma link if merged
>> or child link for new root */ +                       anon_vma->degree++; allocated =
>> NULL; avc = NULL; } @@ -256,7 +260,17 @@ int anon_vma_clone(struct
>> vm_area_struct *dst, struct vm_area_struct *src) anon_vma =
>> pavc->anon_vma; root = lock_anon_vma_root(root, anon_vma);
>> anon_vma_chain_link(dst, avc, anon_vma); + +          /* +             * Reuse
>> existing anon_vma if its degree lower than two, +              * that means it
>> has no vma and just one anon_vma child. +              */ +           if
>> (!dst->anon_vma && anon_vma != src->anon_vma && +
>> anon_vma->degree < 2) +                       dst->anon_vma = anon_vma; }
>
> Why can src->anon_vma not be reused if it still not shared with
> any other task?

First child will see parent anon-vma with degree counter == 1.
Child haven't incremented this counter yet.
It could be shared but this isn't necessary for fixing original problem.
So I've decided to minimized influence.

>
> Would it be more readable to use a "reuse_anon_vma" pointer here,
> and assign dst->anon_vma the value of reuse_anon_vma if we choose
> to reuse one?
>
> Assigning different things to dst->anon_vma looks a little confusing.

What different things? It always points to anon_vma, but sometimes it's NULL.

>
> Would it make sense to rename anon_vma->degree to anon_vma->sharing
> or anon_vma->shared, or even anon_vma->users, to indicate that it is
> a counter of how many VMAs are sharing this anon_vma?

It counts two different things: vmas and directly descended anon_vmas.
So I give it abstract name to prevent misunderstanding.
For example, if it's 1 that doesn't means that here is just one user:
that might be one child anon_vma and a lot of grandchilds.

>
>> +     if (dst->anon_vma) +            dst->anon_vma->degree++;
>> unlock_anon_vma_root(root); return 0;
>>
>> @@ -279,6 +293,9 @@ int anon_vma_fork(struct vm_area_struct *vma,
>> struct vm_area_struct *pvma) if (!pvma->anon_vma) return 0;
>>
>> +     /* Drop inherited anon_vma, we'll reuse old one or allocate new.
>> */ +  vma->anon_vma = NULL;
>
> Use of a temporary variable in anon_vma_clone() would avoid this.

anon_vma_clone() is used in some other places where we clone vma
without creating new level: for example in vma_split. This trick allows
to keep interface untouched.

>
>> @@ -286,6 +303,10 @@ int anon_vma_fork(struct vm_area_struct *vma,
>> struct vm_area_struct *pvma) if (anon_vma_clone(vma, pvma)) return
>> -ENOMEM;
>>
>> +     /* An old anon_vma has been reused. */
>
> s/old/existing/  ?

or maybe antecedent

>
>> +     if (vma->anon_vma) +            return 0; + /* Then add our own anon_vma.
>> */ anon_vma = anon_vma_alloc(); if (!anon_vma)
>
> Overall the patch looks good.
>
> - --
> All rights reversed
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQEcBAEBAgAGBQJUdipzAAoJEM553pKExN6Db2kH/20CfKy2ntayKb03tqYnlohu
> OUxtCwqiow8XsfYc2cEBrCznCNPD5B0sdDdEgRWybBnRCikHNQS4vUBhNl/F13gS
> Hu8LM+RElhZwr69cCshUXefIx5xMKimUeAsHutpvy09onZy0DvYutdR958/Nhca/
> 1OjXqtE+LbPd0aG87OQQlagk4DQls0uA2l609qBRKsfMgm2444MRPAN0RGQwlYIv
> SENvzVFN4ZvIdzsU8IoSw2EkhBankYDKbbTxAy+sHCHaxKzq0eKn+JgRaoZLjxU9
> +43snI/fkWNN+S5KLgshUKIVO84kAmRAIfdKUjt/DYYkOj6YPp48aJnOKVFwjIY=
> =Bhp4
> -----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: prevent endless growth of anon_vma hierarchy
  2014-11-26 18:11 [PATCH v3] mm: prevent endless growth of anon_vma hierarchy Konstantin Khlebnikov
  2014-11-26 19:30 ` Rik van Riel
@ 2014-11-26 21:05 ` Daniel Forrest
  2014-11-26 22:35   ` Andrew Morton
  2014-11-27  9:13   ` Michal Hocko
  2014-12-16 10:42 ` Michal Hocko
  2 siblings, 2 replies; 9+ messages in thread
From: Daniel Forrest @ 2014-11-26 21:05 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-mm, Andrew Morton, Linux Kernel Mailing List,
	Andrea Arcangeli, Rik van Riel, Tim Hartrick, Hugh Dickins,
	Michel Lespinasse, Vlastimil Babka

On Wed, Nov 26, 2014 at 10:11:45PM +0400, Konstantin Khlebnikov wrote:

> Constantly forking task causes unlimited grow of anon_vma chain.
> Each next child allocate new level of anon_vmas and links vmas to all
> previous levels because it inherits pages from them. None of anon_vmas
> cannot be freed because there might be pages which points to them.
> 
> This patch adds heuristic which decides to reuse existing anon_vma instead
> of forking new one. It counts vmas and direct descendants for each anon_vma.
> Anon_vma with degree lower than two will be reused at next fork.
> 
> As a result each anon_vma has either alive vma or at least two descendants,
> endless chains are no longer possible and count of anon_vmas is no more than
> two times more than count of vmas.

While I was working on the previous fix for this bug, Andrew Morton
noticed that the error return from anon_vma_clone() was being dropped
and replaced with -ENOMEM (which is not itself a bug because the only
error return value from anon_vma_clone() is -ENOMEM).

I did an audit of callers of anon_vma_clone() and discovered an actual
bug where the error return was being lost.  In __split_vma(), between
Linux 3.11 and 3.12 the code was changed so the err variable is used
before the call to anon_vma_clone() and the default initial value of
-ENOMEM is overwritten.  So a failure of anon_vma_clone() will return
success since err at this point is now zero.

Below is a patch which fixes this bug and also propagates the error
return value from anon_vma_clone() in all cases.

I can send this as a separate patch, but maybe it would be easier if
you were to incorporate it into yours?

Signed-off-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>

---
 mmap.c |   10 +++++++---
 rmap.c |    6 ++++--
 2 files changed, 11 insertions(+), 5 deletions(-)

diff -rup a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -776,8 +776,11 @@ again:			remove_next = 1 + (end > next->
 		 * shrinking vma had, to cover any anon pages imported.
 		 */
 		if (exporter && exporter->anon_vma && !importer->anon_vma) {
-			if (anon_vma_clone(importer, exporter))
-				return -ENOMEM;
+			int error;
+
+			error = anon_vma_clone(importer, exporter);
+			if (error)
+				return error;
 			importer->anon_vma = exporter->anon_vma;
 		}
 	}
@@ -2469,7 +2472,8 @@ static int __split_vma(struct mm_struct 
 	if (err)
 		goto out_free_vma;
 
-	if (anon_vma_clone(new, vma))
+	err = anon_vma_clone(new, vma);
+	if (err)
 		goto out_free_mpol;
 
 	if (new->vm_file)
diff -rup a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -274,6 +274,7 @@ int anon_vma_fork(struct vm_area_struct 
 {
 	struct anon_vma_chain *avc;
 	struct anon_vma *anon_vma;
+	int error;
 
 	/* Don't bother if the parent process has no anon_vma here. */
 	if (!pvma->anon_vma)
@@ -283,8 +284,9 @@ int anon_vma_fork(struct vm_area_struct 
 	 * First, attach the new VMA to the parent VMA's anon_vmas,
 	 * so rmap can find non-COWed pages in child processes.
 	 */
-	if (anon_vma_clone(vma, pvma))
-		return -ENOMEM;
+	error = anon_vma_clone(vma, pvma);
+	if (error)
+		return error;
 
 	/* Then add our own anon_vma. */
 	anon_vma = anon_vma_alloc();

-- 
Daniel K. Forrest		Space Science and
dan.forrest@ssec.wisc.edu	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: prevent endless growth of anon_vma hierarchy
  2014-11-26 21:05 ` Daniel Forrest
@ 2014-11-26 22:35   ` Andrew Morton
  2014-11-27  9:13   ` Michal Hocko
  1 sibling, 0 replies; 9+ messages in thread
From: Andrew Morton @ 2014-11-26 22:35 UTC (permalink / raw)
  To: Daniel Forrest
  Cc: Konstantin Khlebnikov, linux-mm, Linux Kernel Mailing List,
	Andrea Arcangeli, Rik van Riel, Tim Hartrick, Hugh Dickins,
	Michel Lespinasse, Vlastimil Babka

On Wed, 26 Nov 2014 15:05:59 -0600 Daniel Forrest <dan.forrest@ssec.wisc.edu> wrote:

> On Wed, Nov 26, 2014 at 10:11:45PM +0400, Konstantin Khlebnikov wrote:
> 
> > Constantly forking task causes unlimited grow of anon_vma chain.
> > Each next child allocate new level of anon_vmas and links vmas to all
> > previous levels because it inherits pages from them. None of anon_vmas
> > cannot be freed because there might be pages which points to them.
> > 
> > This patch adds heuristic which decides to reuse existing anon_vma instead
> > of forking new one. It counts vmas and direct descendants for each anon_vma.
> > Anon_vma with degree lower than two will be reused at next fork.
> > 
> > As a result each anon_vma has either alive vma or at least two descendants,
> > endless chains are no longer possible and count of anon_vmas is no more than
> > two times more than count of vmas.
> 
> While I was working on the previous fix for this bug, Andrew Morton
> noticed that the error return from anon_vma_clone() was being dropped
> and replaced with -ENOMEM (which is not itself a bug because the only
> error return value from anon_vma_clone() is -ENOMEM).
> 
> I did an audit of callers of anon_vma_clone() and discovered an actual
> bug where the error return was being lost.  In __split_vma(), between
> Linux 3.11 and 3.12 the code was changed so the err variable is used
> before the call to anon_vma_clone() and the default initial value of
> -ENOMEM is overwritten.  So a failure of anon_vma_clone() will return
> success since err at this point is now zero.
> 
> Below is a patch which fixes this bug and also propagates the error
> return value from anon_vma_clone() in all cases.
> 
> I can send this as a separate patch, but maybe it would be easier if
> you were to incorporate it into yours?
> 

I grabbed it.  A bugfix is a bugfix.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: prevent endless growth of anon_vma hierarchy
  2014-11-26 21:05 ` Daniel Forrest
  2014-11-26 22:35   ` Andrew Morton
@ 2014-11-27  9:13   ` Michal Hocko
  1 sibling, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2014-11-27  9:13 UTC (permalink / raw)
  To: Konstantin Khlebnikov, linux-mm, Andrew Morton,
	Linux Kernel Mailing List, Andrea Arcangeli, Rik van Riel,
	Tim Hartrick, Hugh Dickins, Michel Lespinasse, Vlastimil Babka

On Wed 26-11-14 15:05:59, Daniel Forrest wrote:
> On Wed, Nov 26, 2014 at 10:11:45PM +0400, Konstantin Khlebnikov wrote:
> 
> > Constantly forking task causes unlimited grow of anon_vma chain.
> > Each next child allocate new level of anon_vmas and links vmas to all
> > previous levels because it inherits pages from them. None of anon_vmas
> > cannot be freed because there might be pages which points to them.
> > 
> > This patch adds heuristic which decides to reuse existing anon_vma instead
> > of forking new one. It counts vmas and direct descendants for each anon_vma.
> > Anon_vma with degree lower than two will be reused at next fork.
> > 
> > As a result each anon_vma has either alive vma or at least two descendants,
> > endless chains are no longer possible and count of anon_vmas is no more than
> > two times more than count of vmas.
> 
> While I was working on the previous fix for this bug, Andrew Morton
> noticed that the error return from anon_vma_clone() was being dropped
> and replaced with -ENOMEM (which is not itself a bug because the only
> error return value from anon_vma_clone() is -ENOMEM).
> 
> I did an audit of callers of anon_vma_clone() and discovered an actual
> bug where the error return was being lost.  In __split_vma(), between
> Linux 3.11 and 3.12 the code was changed so the err variable is used
> before the call to anon_vma_clone() and the default initial value of
> -ENOMEM is overwritten.  So a failure of anon_vma_clone() will return
> success since err at this point is now zero.
> 
> Below is a patch which fixes this bug and also propagates the error
> return value from anon_vma_clone() in all cases.
> 
> I can send this as a separate patch, but maybe it would be easier if
> you were to incorporate it into yours?

I would prefer two patches as they address two different things and also
target different set of stable trees.

> Signed-off-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>

Fixes: ef0855d334e1 (mm: mempolicy: turn vma_set_policy() into vma_dup_policy())

and mark for stable (3.12+) please.

Feel free to add
Reviewed-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> 
> ---
>  mmap.c |   10 +++++++---
>  rmap.c |    6 ++++--
>  2 files changed, 11 insertions(+), 5 deletions(-)
> 
> diff -rup a/mm/mmap.c b/mm/mmap.c
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -776,8 +776,11 @@ again:			remove_next = 1 + (end > next->
>  		 * shrinking vma had, to cover any anon pages imported.
>  		 */
>  		if (exporter && exporter->anon_vma && !importer->anon_vma) {
> -			if (anon_vma_clone(importer, exporter))
> -				return -ENOMEM;
> +			int error;
> +
> +			error = anon_vma_clone(importer, exporter);
> +			if (error)
> +				return error;
>  			importer->anon_vma = exporter->anon_vma;
>  		}
>  	}
> @@ -2469,7 +2472,8 @@ static int __split_vma(struct mm_struct 
>  	if (err)
>  		goto out_free_vma;
>  
> -	if (anon_vma_clone(new, vma))
> +	err = anon_vma_clone(new, vma);
> +	if (err)
>  		goto out_free_mpol;
>  
>  	if (new->vm_file)
> diff -rup a/mm/rmap.c b/mm/rmap.c
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -274,6 +274,7 @@ int anon_vma_fork(struct vm_area_struct 
>  {
>  	struct anon_vma_chain *avc;
>  	struct anon_vma *anon_vma;
> +	int error;
>  
>  	/* Don't bother if the parent process has no anon_vma here. */
>  	if (!pvma->anon_vma)
> @@ -283,8 +284,9 @@ int anon_vma_fork(struct vm_area_struct 
>  	 * First, attach the new VMA to the parent VMA's anon_vmas,
>  	 * so rmap can find non-COWed pages in child processes.
>  	 */
> -	if (anon_vma_clone(vma, pvma))
> -		return -ENOMEM;
> +	error = anon_vma_clone(vma, pvma);
> +	if (error)
> +		return error;
>  
>  	/* Then add our own anon_vma. */
>  	anon_vma = anon_vma_alloc();
> 
> -- 
> Daniel K. Forrest		Space Science and
> dan.forrest@ssec.wisc.edu	Engineering Center
> (608) 890 - 0558		University of Wisconsin, Madison
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: prevent endless growth of anon_vma hierarchy
  2014-11-26 18:11 [PATCH v3] mm: prevent endless growth of anon_vma hierarchy Konstantin Khlebnikov
  2014-11-26 19:30 ` Rik van Riel
  2014-11-26 21:05 ` Daniel Forrest
@ 2014-12-16 10:42 ` Michal Hocko
  2014-12-16 23:50   ` Andrew Morton
  2 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2014-12-16 10:42 UTC (permalink / raw)
  To: Andrew Morton, Konstantin Khlebnikov
  Cc: linux-mm, Linux Kernel Mailing List, Andrea Arcangeli,
	Rik van Riel, Tim Hartrick, Daniel Forrest, Hugh Dickins,
	Michel Lespinasse, Vlastimil Babka

What happened to this patch? I do not see it merged for 3.19 and
nor in the current mmotm tree (2014-12-15-17-05)

On Wed 26-11-14 22:11:45, Konstantin Khlebnikov wrote:
> Constantly forking task causes unlimited grow of anon_vma chain.
> Each next child allocate new level of anon_vmas and links vmas to all
> previous levels because it inherits pages from them. None of anon_vmas
> cannot be freed because there might be pages which points to them.
> 
> This patch adds heuristic which decides to reuse existing anon_vma instead
> of forking new one. It counts vmas and direct descendants for each anon_vma.
> Anon_vma with degree lower than two will be reused at next fork.
> 
> As a result each anon_vma has either alive vma or at least two descendants,
> endless chains are no longer possible and count of anon_vmas is no more than
> two times more than count of vmas.
> 
> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
> Reported-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>
> Tested-by: Michal Hocko <mhocko@suse.cz>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu
> Fixes: 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue")
> Cc: Stable <stable@vger.kernel.org> (2.6.34+)
> 
> ---
> 
> v2: update degree in anon_vma_prepare for merged anon_vma
> v3: update comment and tags
> ---
>  include/linux/rmap.h |   16 ++++++++++++++++
>  mm/rmap.c            |   30 +++++++++++++++++++++++++++++-
>  2 files changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index c0c2bce..b1d140c 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -45,6 +45,22 @@ struct anon_vma {
>  	 * mm_take_all_locks() (mm_all_locks_mutex).
>  	 */
>  	struct rb_root rb_root;	/* Interval tree of private "related" vmas */
> +
> +	/*
> +	 * Count of child anon_vmas and VMAs which points to this anon_vma.
> +	 *
> +	 * This counter is used for making decision about reusing old anon_vma
> +	 * instead of forking new one. It allows to detect anon_vmas which have
> +	 * just one direct descendant and no vmas. Reusing such anon_vma not
> +	 * leads to significant preformance regression but prevents degradation
> +	 * of anon_vma hierarchy to endless linear chain.
> +	 *
> +	 * Root anon_vma is never reused because it is its own parent and it has
> +	 * at leat one vma or child, thus at fork it's degree is at least 2.
> +	 */
> +	unsigned degree;
> +
> +	struct anon_vma *parent;	/* Parent of this anon_vma */
>  };
>  
>  /*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 19886fb..df5c44e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -72,6 +72,8 @@ static inline struct anon_vma *anon_vma_alloc(void)
>  	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
>  	if (anon_vma) {
>  		atomic_set(&anon_vma->refcount, 1);
> +		anon_vma->degree = 1;	/* Reference for first vma */
> +		anon_vma->parent = anon_vma;
>  		/*
>  		 * Initialise the anon_vma root to point to itself. If called
>  		 * from fork, the root will be reset to the parents anon_vma.
> @@ -188,6 +190,8 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  		if (likely(!vma->anon_vma)) {
>  			vma->anon_vma = anon_vma;
>  			anon_vma_chain_link(vma, avc, anon_vma);
> +			/* vma link if merged or child link for new root */
> +			anon_vma->degree++;
>  			allocated = NULL;
>  			avc = NULL;
>  		}
> @@ -256,7 +260,17 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
>  		anon_vma = pavc->anon_vma;
>  		root = lock_anon_vma_root(root, anon_vma);
>  		anon_vma_chain_link(dst, avc, anon_vma);
> +
> +		/*
> +		 * Reuse existing anon_vma if its degree lower than two,
> +		 * that means it has no vma and just one anon_vma child.
> +		 */
> +		if (!dst->anon_vma && anon_vma != src->anon_vma &&
> +				anon_vma->degree < 2)
> +			dst->anon_vma = anon_vma;
>  	}
> +	if (dst->anon_vma)
> +		dst->anon_vma->degree++;
>  	unlock_anon_vma_root(root);
>  	return 0;
>  
> @@ -279,6 +293,9 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	if (!pvma->anon_vma)
>  		return 0;
>  
> +	/* Drop inherited anon_vma, we'll reuse old one or allocate new. */
> +	vma->anon_vma = NULL;
> +
>  	/*
>  	 * First, attach the new VMA to the parent VMA's anon_vmas,
>  	 * so rmap can find non-COWed pages in child processes.
> @@ -286,6 +303,10 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	if (anon_vma_clone(vma, pvma))
>  		return -ENOMEM;
>  
> +	/* An old anon_vma has been reused. */
> +	if (vma->anon_vma)
> +		return 0;
> +
>  	/* Then add our own anon_vma. */
>  	anon_vma = anon_vma_alloc();
>  	if (!anon_vma)
> @@ -299,6 +320,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	 * lock any of the anon_vmas in this anon_vma tree.
>  	 */
>  	anon_vma->root = pvma->anon_vma->root;
> +	anon_vma->parent = pvma->anon_vma;
>  	/*
>  	 * With refcounts, an anon_vma can stay around longer than the
>  	 * process it belongs to. The root anon_vma needs to be pinned until
> @@ -309,6 +331,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	vma->anon_vma = anon_vma;
>  	anon_vma_lock_write(anon_vma);
>  	anon_vma_chain_link(vma, avc, anon_vma);
> +	anon_vma->parent->degree++;
>  	anon_vma_unlock_write(anon_vma);
>  
>  	return 0;
> @@ -339,12 +362,16 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
>  		 * Leave empty anon_vmas on the list - we'll need
>  		 * to free them outside the lock.
>  		 */
> -		if (RB_EMPTY_ROOT(&anon_vma->rb_root))
> +		if (RB_EMPTY_ROOT(&anon_vma->rb_root)) {
> +			anon_vma->parent->degree--;
>  			continue;
> +		}
>  
>  		list_del(&avc->same_vma);
>  		anon_vma_chain_free(avc);
>  	}
> +	if (vma->anon_vma)
> +		vma->anon_vma->degree--;
>  	unlock_anon_vma_root(root);
>  
>  	/*
> @@ -355,6 +382,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
>  	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
>  		struct anon_vma *anon_vma = avc->anon_vma;
>  
> +		BUG_ON(anon_vma->degree);
>  		put_anon_vma(anon_vma);
>  
>  		list_del(&avc->same_vma);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: prevent endless growth of anon_vma hierarchy
  2014-12-16 10:42 ` Michal Hocko
@ 2014-12-16 23:50   ` Andrew Morton
  2014-12-17  9:04     ` Konstantin Khlebnikov
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2014-12-16 23:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Konstantin Khlebnikov, linux-mm, Linux Kernel Mailing List,
	Andrea Arcangeli, Rik van Riel, Tim Hartrick, Daniel Forrest,
	Hugh Dickins, Michel Lespinasse, Vlastimil Babka

On Tue, 16 Dec 2014 11:42:18 +0100 Michal Hocko <mhocko@suse.cz> wrote:

> What happened to this patch? I do not see it merged for 3.19 and
> nor in the current mmotm tree (2014-12-15-17-05)
> 

Awaiting v4.  See Konstantin's reply to Rik's review comments.

Konstantin, please ensure that the questions which Rik asked are
answered (via code comments and changelogging) in the next version?
Because other future readers will have the same questions.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] mm: prevent endless growth of anon_vma hierarchy
  2014-12-16 23:50   ` Andrew Morton
@ 2014-12-17  9:04     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 9+ messages in thread
From: Konstantin Khlebnikov @ 2014-12-17  9:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, linux-mm, Linux Kernel Mailing List,
	Andrea Arcangeli, Rik van Riel, Tim Hartrick, Daniel Forrest,
	Hugh Dickins, Michel Lespinasse, Vlastimil Babka

On Wed, Dec 17, 2014 at 2:50 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Tue, 16 Dec 2014 11:42:18 +0100 Michal Hocko <mhocko@suse.cz> wrote:
>
>> What happened to this patch? I do not see it merged for 3.19 and
>> nor in the current mmotm tree (2014-12-15-17-05)
>>
>
> Awaiting v4.  See Konstantin's reply to Rik's review comments.
>
> Konstantin, please ensure that the questions which Rik asked are
> answered (via code comments and changelogging) in the next version?
> Because other future readers will have the same questions.
>

Done.

Thanks for reminding.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-12-17  9:04 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-26 18:11 [PATCH v3] mm: prevent endless growth of anon_vma hierarchy Konstantin Khlebnikov
2014-11-26 19:30 ` Rik van Riel
2014-11-26 20:20   ` Konstantin Khlebnikov
2014-11-26 21:05 ` Daniel Forrest
2014-11-26 22:35   ` Andrew Morton
2014-11-27  9:13   ` Michal Hocko
2014-12-16 10:42 ` Michal Hocko
2014-12-16 23:50   ` Andrew Morton
2014-12-17  9:04     ` Konstantin Khlebnikov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).