All of lore.kernel.org
 help / color / mirror / Atom feed
* Repeated fork() causes SLAB to grow without bound
@ 2012-08-16  2:46 Daniel Forrest
  2012-08-16 18:58   ` Rik van Riel
  0 siblings, 1 reply; 75+ messages in thread
From: Daniel Forrest @ 2012-08-16  2:46 UTC (permalink / raw)
  To: linux-kernel

I'm hoping someone has seen this before...

I've been trying to track down a performance problem with Linux 3.0.4.
The symptom is system-mode load increasing over time while user-mode
load remains constant while running a data ingest/processing program.

Looking at /proc/meminfo I noticed SUnreclaim increasing steadily.

Looking at /proc/slabinfo I noticed anon_vma and anon_vma_chain also
increasing steadily.

I was able to generate a simple test program that will cause this:

---

#include <unistd.h>

int main(int argc, char *argv[])
{
   pid_t pid;

   while (1) {
      pid = fork();
      if (pid == -1) {
	 /* error */
	 return 1;
      }
      if (pid) {
	 /* parent */
	 sleep(2);
	 break;
      }
      else {
	 /* child */
	 sleep(1);
      }
   }
   return 0;
}

---

In the actual program (running as a daemon), a child is reading data
while its parent is processing the previously read data.  At any time
there are only a few processes in existence, with older processes
exiting and new processes being fork()ed.  Killing the program frees
the slab usage.

I patched the kernel to 3.0.40, but the problem remains.  I also
compiled with slab debugging and can see that the growth of anon_vma
and anon_vma_chain is due to anon_vma_clone/anon_vma_fork.

Is this a known issue?  Is it fixed in a later release?

Thanks,

-- 
Daniel K. Forrest		Space Science and
dan.forrest@ssec.wisc.edu	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
  2012-08-16  2:46 Repeated fork() causes SLAB to grow without bound Daniel Forrest
@ 2012-08-16 18:58   ` Rik van Riel
  0 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-08-16 18:58 UTC (permalink / raw)
  To: linux-kernel, Hugh Dickins; +Cc: linux-mm

On 08/15/2012 10:46 PM, Daniel Forrest wrote:
> I'm hoping someone has seen this before...
>
> I've been trying to track down a performance problem with Linux 3.0.4.
> The symptom is system-mode load increasing over time while user-mode
> load remains constant while running a data ingest/processing program.
>
> Looking at /proc/meminfo I noticed SUnreclaim increasing steadily.
>
> Looking at /proc/slabinfo I noticed anon_vma and anon_vma_chain also
> increasing steadily.

Oh dear.

Basically, what happens is that at fork time, a new
"level" is created for the anon_vma hierarchy. This
works great for normal forking daemons, since the
parent process just keeps running, and forking off
children.

Look at anon_vma_fork() in mm/rmap.c for the details.

Having each child become the new parent, and the
previous parent exit, can result in an "infinite"
stack of anon_vmas.

Now, the parent anon_vma we cannot get rid of,
because that is where the anon_vma lock lives.

However, in your case you have many more anon_vma
levels than you have processes!

I wonder if it may be possible to fix your bug
by adding a refcount to the struct anon_vma,
one count for each VMA that is directly attached
to the anon_vma (ie. vma->anon_vma == anon_vma),
and one for each page that points to the anon_vma.

If the reference count on an anon_vma reaches 0,
we can skip that anon_vma in anon_vma_clone, and
the child process should not get that anon_vma.

A scheme like that may be enough to avoid the trouble
you are running into.

Does this sound realistic?

> I was able to generate a simple test program that will cause this:
>
> ---
>
> #include <unistd.h>
>
> int main(int argc, char *argv[])
> {
>     pid_t pid;
>
>     while (1) {
>        pid = fork();
>        if (pid == -1) {
> 	 /* error */
> 	 return 1;
>        }
>        if (pid) {
> 	 /* parent */
> 	 sleep(2);
> 	 break;
>        }
>        else {
> 	 /* child */
> 	 sleep(1);
>        }
>     }
>     return 0;
> }
>
> ---
>
> In the actual program (running as a daemon), a child is reading data
> while its parent is processing the previously read data.  At any time
> there are only a few processes in existence, with older processes
> exiting and new processes being fork()ed.  Killing the program frees
> the slab usage.
>
> I patched the kernel to 3.0.40, but the problem remains.  I also
> compiled with slab debugging and can see that the growth of anon_vma
> and anon_vma_chain is due to anon_vma_clone/anon_vma_fork.
>
> Is this a known issue?  Is it fixed in a later release?
>
> Thanks,
>



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
@ 2012-08-16 18:58   ` Rik van Riel
  0 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-08-16 18:58 UTC (permalink / raw)
  To: linux-kernel, Hugh Dickins; +Cc: linux-mm

On 08/15/2012 10:46 PM, Daniel Forrest wrote:
> I'm hoping someone has seen this before...
>
> I've been trying to track down a performance problem with Linux 3.0.4.
> The symptom is system-mode load increasing over time while user-mode
> load remains constant while running a data ingest/processing program.
>
> Looking at /proc/meminfo I noticed SUnreclaim increasing steadily.
>
> Looking at /proc/slabinfo I noticed anon_vma and anon_vma_chain also
> increasing steadily.

Oh dear.

Basically, what happens is that at fork time, a new
"level" is created for the anon_vma hierarchy. This
works great for normal forking daemons, since the
parent process just keeps running, and forking off
children.

Look at anon_vma_fork() in mm/rmap.c for the details.

Having each child become the new parent, and the
previous parent exit, can result in an "infinite"
stack of anon_vmas.

Now, the parent anon_vma we cannot get rid of,
because that is where the anon_vma lock lives.

However, in your case you have many more anon_vma
levels than you have processes!

I wonder if it may be possible to fix your bug
by adding a refcount to the struct anon_vma,
one count for each VMA that is directly attached
to the anon_vma (ie. vma->anon_vma == anon_vma),
and one for each page that points to the anon_vma.

If the reference count on an anon_vma reaches 0,
we can skip that anon_vma in anon_vma_clone, and
the child process should not get that anon_vma.

A scheme like that may be enough to avoid the trouble
you are running into.

Does this sound realistic?

> I was able to generate a simple test program that will cause this:
>
> ---
>
> #include <unistd.h>
>
> int main(int argc, char *argv[])
> {
>     pid_t pid;
>
>     while (1) {
>        pid = fork();
>        if (pid == -1) {
> 	 /* error */
> 	 return 1;
>        }
>        if (pid) {
> 	 /* parent */
> 	 sleep(2);
> 	 break;
>        }
>        else {
> 	 /* child */
> 	 sleep(1);
>        }
>     }
>     return 0;
> }
>
> ---
>
> In the actual program (running as a daemon), a child is reading data
> while its parent is processing the previously read data.  At any time
> there are only a few processes in existence, with older processes
> exiting and new processes being fork()ed.  Killing the program frees
> the slab usage.
>
> I patched the kernel to 3.0.40, but the problem remains.  I also
> compiled with slab debugging and can see that the growth of anon_vma
> and anon_vma_chain is due to anon_vma_clone/anon_vma_fork.
>
> Is this a known issue?  Is it fixed in a later release?
>
> Thanks,
>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
  2012-08-16 18:58   ` Rik van Riel
@ 2012-08-18  0:03     ` Daniel Forrest
  -1 siblings, 0 replies; 75+ messages in thread
From: Daniel Forrest @ 2012-08-18  0:03 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, Hugh Dickins, linux-mm

On Thu, Aug 16, 2012 at 02:58:45PM -0400, Rik van Riel wrote:

> Oh dear.
> 
> Basically, what happens is that at fork time, a new
> "level" is created for the anon_vma hierarchy. This
> works great for normal forking daemons, since the
> parent process just keeps running, and forking off
> children.
> 
> Look at anon_vma_fork() in mm/rmap.c for the details.
> 
> Having each child become the new parent, and the
> previous parent exit, can result in an "infinite"
> stack of anon_vmas.
> 
> Now, the parent anon_vma we cannot get rid of,
> because that is where the anon_vma lock lives.
> 
> However, in your case you have many more anon_vma
> levels than you have processes!
> 
> I wonder if it may be possible to fix your bug
> by adding a refcount to the struct anon_vma,
> one count for each VMA that is directly attached
> to the anon_vma (ie. vma->anon_vma == anon_vma),
> and one for each page that points to the anon_vma.
> 
> If the reference count on an anon_vma reaches 0,
> we can skip that anon_vma in anon_vma_clone, and
> the child process should not get that anon_vma.
> 
> A scheme like that may be enough to avoid the trouble
> you are running into.
> 
> Does this sound realistic?

Based on your comments, I came up with the following patch.  It boots
and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know
if I've overlooked something.  I'm not a kernel hacker.


--- include/linux/rmap.h.ORIG	2011-08-05 04:59:21.000000000 +0000
+++ include/linux/rmap.h	2012-08-16 22:52:25.000000000 +0000
@@ -35,6 +35,7 @@ struct anon_vma {
 	 * anon_vma if they are the last user on release
 	 */
 	atomic_t refcount;
+	atomic_t pagecount;
 
 	/*
 	 * NOTE: the LSB of the head.next is set by
--- mm/rmap.c.ORIG	2011-08-05 04:59:21.000000000 +0000
+++ mm/rmap.c	2012-08-17 23:55:13.000000000 +0000
@@ -85,6 +85,7 @@ static inline struct anon_vma *anon_vma_
 static inline void anon_vma_free(struct anon_vma *anon_vma)
 {
 	VM_BUG_ON(atomic_read(&anon_vma->refcount));
+	VM_BUG_ON(atomic_read(&anon_vma->pagecount));
 
 	/*
 	 * Synchronize against page_lock_anon_vma() such that
@@ -176,6 +177,7 @@ int anon_vma_prepare(struct vm_area_stru
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
+			atomic_inc(&anon_vma->pagecount);
 			avc->anon_vma = anon_vma;
 			avc->vma = vma;
 			list_add(&avc->same_vma, &vma->anon_vma_chain);
@@ -262,7 +264,10 @@ int anon_vma_clone(struct vm_area_struct
 		}
 		anon_vma = pavc->anon_vma;
 		root = lock_anon_vma_root(root, anon_vma);
-		anon_vma_chain_link(dst, avc, anon_vma);
+		if (!atomic_read(&anon_vma->pagecount))
+			anon_vma_chain_free(avc);
+		else
+			anon_vma_chain_link(dst, avc, anon_vma);
 	}
 	unlock_anon_vma_root(root);
 	return 0;
@@ -314,6 +319,7 @@ int anon_vma_fork(struct vm_area_struct
 	get_anon_vma(anon_vma->root);
 	/* Mark this anon_vma as the one where our new (COWed) pages go. */
 	vma->anon_vma = anon_vma;
+	atomic_set(&anon_vma->pagecount, 1);
 	anon_vma_lock(anon_vma);
 	anon_vma_chain_link(vma, avc, anon_vma);
 	anon_vma_unlock(anon_vma);
@@ -341,6 +347,8 @@ void unlink_anon_vmas(struct vm_area_str
 
 		root = lock_anon_vma_root(root, anon_vma);
 		list_del(&avc->same_anon_vma);
+		if (vma->anon_vma == anon_vma)
+			atomic_dec(&anon_vma->pagecount);
 
 		/*
 		 * Leave empty anon_vmas on the list - we'll need
@@ -375,6 +383,7 @@ static void anon_vma_ctor(void *data)
 
 	mutex_init(&anon_vma->mutex);
 	atomic_set(&anon_vma->refcount, 0);
+	atomic_set(&anon_vma->pagecount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -996,6 +1005,7 @@ static void __page_set_anon_rmap(struct
 	if (!exclusive)
 		anon_vma = anon_vma->root;
 
+	atomic_inc(&anon_vma->pagecount);
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 	page->mapping = (struct address_space *) anon_vma;
 	page->index = linear_page_index(vma, address);
@@ -1142,6 +1152,11 @@ void page_remove_rmap(struct page *page)
 	if (unlikely(PageHuge(page)))
 		return;
 	if (PageAnon(page)) {
+		struct anon_vma *anon_vma;
+
+		anon_vma = page_anon_vma(page);
+		if (anon_vma)
+			atomic_dec(&anon_vma->pagecount);
 		mem_cgroup_uncharge_page(page);
 		if (!PageTransHuge(page))
 			__dec_zone_page_state(page, NR_ANON_PAGES);
@@ -1747,6 +1762,7 @@ static void __hugepage_set_anon_rmap(str
 	if (!exclusive)
 		anon_vma = anon_vma->root;
 
+	atomic_inc(&anon_vma->pagecount);
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 	page->mapping = (struct address_space *) anon_vma;
 	page->index = linear_page_index(vma, address);

-- 
Daniel K. Forrest		Space Science and
dan.forrest@ssec.wisc.edu	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
@ 2012-08-18  0:03     ` Daniel Forrest
  0 siblings, 0 replies; 75+ messages in thread
From: Daniel Forrest @ 2012-08-18  0:03 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, Hugh Dickins, linux-mm

On Thu, Aug 16, 2012 at 02:58:45PM -0400, Rik van Riel wrote:

> Oh dear.
> 
> Basically, what happens is that at fork time, a new
> "level" is created for the anon_vma hierarchy. This
> works great for normal forking daemons, since the
> parent process just keeps running, and forking off
> children.
> 
> Look at anon_vma_fork() in mm/rmap.c for the details.
> 
> Having each child become the new parent, and the
> previous parent exit, can result in an "infinite"
> stack of anon_vmas.
> 
> Now, the parent anon_vma we cannot get rid of,
> because that is where the anon_vma lock lives.
> 
> However, in your case you have many more anon_vma
> levels than you have processes!
> 
> I wonder if it may be possible to fix your bug
> by adding a refcount to the struct anon_vma,
> one count for each VMA that is directly attached
> to the anon_vma (ie. vma->anon_vma == anon_vma),
> and one for each page that points to the anon_vma.
> 
> If the reference count on an anon_vma reaches 0,
> we can skip that anon_vma in anon_vma_clone, and
> the child process should not get that anon_vma.
> 
> A scheme like that may be enough to avoid the trouble
> you are running into.
> 
> Does this sound realistic?

Based on your comments, I came up with the following patch.  It boots
and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know
if I've overlooked something.  I'm not a kernel hacker.


--- include/linux/rmap.h.ORIG	2011-08-05 04:59:21.000000000 +0000
+++ include/linux/rmap.h	2012-08-16 22:52:25.000000000 +0000
@@ -35,6 +35,7 @@ struct anon_vma {
 	 * anon_vma if they are the last user on release
 	 */
 	atomic_t refcount;
+	atomic_t pagecount;
 
 	/*
 	 * NOTE: the LSB of the head.next is set by
--- mm/rmap.c.ORIG	2011-08-05 04:59:21.000000000 +0000
+++ mm/rmap.c	2012-08-17 23:55:13.000000000 +0000
@@ -85,6 +85,7 @@ static inline struct anon_vma *anon_vma_
 static inline void anon_vma_free(struct anon_vma *anon_vma)
 {
 	VM_BUG_ON(atomic_read(&anon_vma->refcount));
+	VM_BUG_ON(atomic_read(&anon_vma->pagecount));
 
 	/*
 	 * Synchronize against page_lock_anon_vma() such that
@@ -176,6 +177,7 @@ int anon_vma_prepare(struct vm_area_stru
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
+			atomic_inc(&anon_vma->pagecount);
 			avc->anon_vma = anon_vma;
 			avc->vma = vma;
 			list_add(&avc->same_vma, &vma->anon_vma_chain);
@@ -262,7 +264,10 @@ int anon_vma_clone(struct vm_area_struct
 		}
 		anon_vma = pavc->anon_vma;
 		root = lock_anon_vma_root(root, anon_vma);
-		anon_vma_chain_link(dst, avc, anon_vma);
+		if (!atomic_read(&anon_vma->pagecount))
+			anon_vma_chain_free(avc);
+		else
+			anon_vma_chain_link(dst, avc, anon_vma);
 	}
 	unlock_anon_vma_root(root);
 	return 0;
@@ -314,6 +319,7 @@ int anon_vma_fork(struct vm_area_struct
 	get_anon_vma(anon_vma->root);
 	/* Mark this anon_vma as the one where our new (COWed) pages go. */
 	vma->anon_vma = anon_vma;
+	atomic_set(&anon_vma->pagecount, 1);
 	anon_vma_lock(anon_vma);
 	anon_vma_chain_link(vma, avc, anon_vma);
 	anon_vma_unlock(anon_vma);
@@ -341,6 +347,8 @@ void unlink_anon_vmas(struct vm_area_str
 
 		root = lock_anon_vma_root(root, anon_vma);
 		list_del(&avc->same_anon_vma);
+		if (vma->anon_vma == anon_vma)
+			atomic_dec(&anon_vma->pagecount);
 
 		/*
 		 * Leave empty anon_vmas on the list - we'll need
@@ -375,6 +383,7 @@ static void anon_vma_ctor(void *data)
 
 	mutex_init(&anon_vma->mutex);
 	atomic_set(&anon_vma->refcount, 0);
+	atomic_set(&anon_vma->pagecount, 0);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -996,6 +1005,7 @@ static void __page_set_anon_rmap(struct
 	if (!exclusive)
 		anon_vma = anon_vma->root;
 
+	atomic_inc(&anon_vma->pagecount);
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 	page->mapping = (struct address_space *) anon_vma;
 	page->index = linear_page_index(vma, address);
@@ -1142,6 +1152,11 @@ void page_remove_rmap(struct page *page)
 	if (unlikely(PageHuge(page)))
 		return;
 	if (PageAnon(page)) {
+		struct anon_vma *anon_vma;
+
+		anon_vma = page_anon_vma(page);
+		if (anon_vma)
+			atomic_dec(&anon_vma->pagecount);
 		mem_cgroup_uncharge_page(page);
 		if (!PageTransHuge(page))
 			__dec_zone_page_state(page, NR_ANON_PAGES);
@@ -1747,6 +1762,7 @@ static void __hugepage_set_anon_rmap(str
 	if (!exclusive)
 		anon_vma = anon_vma->root;
 
+	atomic_inc(&anon_vma->pagecount);
 	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
 	page->mapping = (struct address_space *) anon_vma;
 	page->index = linear_page_index(vma, address);

-- 
Daniel K. Forrest		Space Science and
dan.forrest@ssec.wisc.edu	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
  2012-08-18  0:03     ` Daniel Forrest
@ 2012-08-18  3:46       ` Rik van Riel
  -1 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-08-18  3:46 UTC (permalink / raw)
  To: linux-kernel, Hugh Dickins, linux-mm

On 08/17/2012 08:03 PM, Daniel Forrest wrote:

> Based on your comments, I came up with the following patch.  It boots
> and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know
> if I've overlooked something.  I'm not a kernel hacker.

The patch looks reasonable to me.  There is one spot left
for optimization, which I have pointed out below.

Of course, that leaves the big question: do we want the
overhead of having the atomic addition and decrement for
every anonymous memory page, or is it easier to fix this
issue in userspace?

Given that malicious userspace could potentially run the
system out of memory, without needing special privileges,
and the OOM killer may not be able to reclaim it due to
internal slab fragmentation, I guess this issue could be
classified as a low impact denial of service vulnerability.

Furthermore, there is already a fair amount of bookkeeping
being done in the rmap code, so this patch is not likely
to add a whole lot - some testing might be useful, though.

> @@ -262,7 +264,10 @@ int anon_vma_clone(struct vm_area_struct
>   		}
>   		anon_vma = pavc->anon_vma;
>   		root = lock_anon_vma_root(root, anon_vma);
> -		anon_vma_chain_link(dst, avc, anon_vma);
> +		if (!atomic_read(&anon_vma->pagecount))
> +			anon_vma_chain_free(avc);
> +		else
> +			anon_vma_chain_link(dst, avc, anon_vma);
>   	}
>   	unlock_anon_vma_root(root);
>   	return 0;

In this function, you can do the test before the code block
where we try to allocate an anon_vma chain.

In other words:

	list_for_each_entry_reverse(.....
	struct anon_vma *anon_vma;

+	if (!atomic_read(&anon_vma->pagecount))
+		continue;
+
	avc = anon_vma_chain_alloc(...
	if (unlikely(!avc)) {

The rest looks good.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
@ 2012-08-18  3:46       ` Rik van Riel
  0 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-08-18  3:46 UTC (permalink / raw)
  To: linux-kernel, Hugh Dickins, linux-mm

On 08/17/2012 08:03 PM, Daniel Forrest wrote:

> Based on your comments, I came up with the following patch.  It boots
> and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know
> if I've overlooked something.  I'm not a kernel hacker.

The patch looks reasonable to me.  There is one spot left
for optimization, which I have pointed out below.

Of course, that leaves the big question: do we want the
overhead of having the atomic addition and decrement for
every anonymous memory page, or is it easier to fix this
issue in userspace?

Given that malicious userspace could potentially run the
system out of memory, without needing special privileges,
and the OOM killer may not be able to reclaim it due to
internal slab fragmentation, I guess this issue could be
classified as a low impact denial of service vulnerability.

Furthermore, there is already a fair amount of bookkeeping
being done in the rmap code, so this patch is not likely
to add a whole lot - some testing might be useful, though.

> @@ -262,7 +264,10 @@ int anon_vma_clone(struct vm_area_struct
>   		}
>   		anon_vma = pavc->anon_vma;
>   		root = lock_anon_vma_root(root, anon_vma);
> -		anon_vma_chain_link(dst, avc, anon_vma);
> +		if (!atomic_read(&anon_vma->pagecount))
> +			anon_vma_chain_free(avc);
> +		else
> +			anon_vma_chain_link(dst, avc, anon_vma);
>   	}
>   	unlock_anon_vma_root(root);
>   	return 0;

In this function, you can do the test before the code block
where we try to allocate an anon_vma chain.

In other words:

	list_for_each_entry_reverse(.....
	struct anon_vma *anon_vma;

+	if (!atomic_read(&anon_vma->pagecount))
+		continue;
+
	avc = anon_vma_chain_alloc(...
	if (unlikely(!avc)) {

The rest looks good.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
  2012-08-18  3:46       ` Rik van Riel
@ 2012-08-18  4:07         ` Daniel Forrest
  -1 siblings, 0 replies; 75+ messages in thread
From: Daniel Forrest @ 2012-08-18  4:07 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, Hugh Dickins, linux-mm

On Fri, Aug 17, 2012 at 11:46:18PM -0400, Rik van Riel wrote:

> On 08/17/2012 08:03 PM, Daniel Forrest wrote:
> 
> >Based on your comments, I came up with the following patch.  It boots
> >and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know
> >if I've overlooked something.  I'm not a kernel hacker.
> 
> The patch looks reasonable to me.  There is one spot left
> for optimization, which I have pointed out below.
> 
> Of course, that leaves the big question: do we want the
> overhead of having the atomic addition and decrement for
> every anonymous memory page, or is it easier to fix this
> issue in userspace?
> 
> Given that malicious userspace could potentially run the
> system out of memory, without needing special privileges,
> and the OOM killer may not be able to reclaim it due to
> internal slab fragmentation, I guess this issue could be
> classified as a low impact denial of service vulnerability.
> 
> Furthermore, there is already a fair amount of bookkeeping
> being done in the rmap code, so this patch is not likely
> to add a whole lot - some testing might be useful, though.
> 
> >@@ -262,7 +264,10 @@ int anon_vma_clone(struct vm_area_struct
> >  		}
> >  		anon_vma = pavc->anon_vma;
> >  		root = lock_anon_vma_root(root, anon_vma);
> >-		anon_vma_chain_link(dst, avc, anon_vma);
> >+		if (!atomic_read(&anon_vma->pagecount))
> >+			anon_vma_chain_free(avc);
> >+		else
> >+			anon_vma_chain_link(dst, avc, anon_vma);
> >  	}
> >  	unlock_anon_vma_root(root);
> >  	return 0;
> 
> In this function, you can do the test before the code block
> where we try to allocate an anon_vma chain.
> 
> In other words:
> 
> 	list_for_each_entry_reverse(.....
> 	struct anon_vma *anon_vma;
> 
> +	if (!atomic_read(&anon_vma->pagecount))
> +		continue;
> +
> 	avc = anon_vma_chain_alloc(...
> 	if (unlikely(!avc)) {
> 
> The rest looks good.

I was being careful since I wasn't certain about the locking.  Does
the test need to be protected by "lock_anon_vma_root"?  That's why I
chose the overhead of the possible wasted "anon_vma_chain_alloc".

-- 
Daniel K. Forrest		Space Science and
dan.forrest@ssec.wisc.edu	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
@ 2012-08-18  4:07         ` Daniel Forrest
  0 siblings, 0 replies; 75+ messages in thread
From: Daniel Forrest @ 2012-08-18  4:07 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, Hugh Dickins, linux-mm

On Fri, Aug 17, 2012 at 11:46:18PM -0400, Rik van Riel wrote:

> On 08/17/2012 08:03 PM, Daniel Forrest wrote:
> 
> >Based on your comments, I came up with the following patch.  It boots
> >and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know
> >if I've overlooked something.  I'm not a kernel hacker.
> 
> The patch looks reasonable to me.  There is one spot left
> for optimization, which I have pointed out below.
> 
> Of course, that leaves the big question: do we want the
> overhead of having the atomic addition and decrement for
> every anonymous memory page, or is it easier to fix this
> issue in userspace?
> 
> Given that malicious userspace could potentially run the
> system out of memory, without needing special privileges,
> and the OOM killer may not be able to reclaim it due to
> internal slab fragmentation, I guess this issue could be
> classified as a low impact denial of service vulnerability.
> 
> Furthermore, there is already a fair amount of bookkeeping
> being done in the rmap code, so this patch is not likely
> to add a whole lot - some testing might be useful, though.
> 
> >@@ -262,7 +264,10 @@ int anon_vma_clone(struct vm_area_struct
> >  		}
> >  		anon_vma = pavc->anon_vma;
> >  		root = lock_anon_vma_root(root, anon_vma);
> >-		anon_vma_chain_link(dst, avc, anon_vma);
> >+		if (!atomic_read(&anon_vma->pagecount))
> >+			anon_vma_chain_free(avc);
> >+		else
> >+			anon_vma_chain_link(dst, avc, anon_vma);
> >  	}
> >  	unlock_anon_vma_root(root);
> >  	return 0;
> 
> In this function, you can do the test before the code block
> where we try to allocate an anon_vma chain.
> 
> In other words:
> 
> 	list_for_each_entry_reverse(.....
> 	struct anon_vma *anon_vma;
> 
> +	if (!atomic_read(&anon_vma->pagecount))
> +		continue;
> +
> 	avc = anon_vma_chain_alloc(...
> 	if (unlikely(!avc)) {
> 
> The rest looks good.

I was being careful since I wasn't certain about the locking.  Does
the test need to be protected by "lock_anon_vma_root"?  That's why I
chose the overhead of the possible wasted "anon_vma_chain_alloc".

-- 
Daniel K. Forrest		Space Science and
dan.forrest@ssec.wisc.edu	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
  2012-08-18  4:07         ` Daniel Forrest
@ 2012-08-18  4:10           ` Rik van Riel
  -1 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-08-18  4:10 UTC (permalink / raw)
  To: linux-kernel, Hugh Dickins, linux-mm

On 08/18/2012 12:07 AM, Daniel Forrest wrote:

> I was being careful since I wasn't certain about the locking.  Does
> the test need to be protected by "lock_anon_vma_root"?  That's why I
> chose the overhead of the possible wasted "anon_vma_chain_alloc".

The function anon_vma_clone is being called from fork().

When running fork(), the kernel holds the mm->mmap_sem for
write, which prevents page faults by the parent process.
This means if the anon_vma in question belongs to the parent
process, no new pages will be added to it in this time.

Likewise, if the anon_vma belonged to a grandparent process,
any new pages instantiated in it will not be visible to the
parent process, or to the newly created process. This means
it is safe to skip the anon_vma.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
@ 2012-08-18  4:10           ` Rik van Riel
  0 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-08-18  4:10 UTC (permalink / raw)
  To: linux-kernel, Hugh Dickins, linux-mm

On 08/18/2012 12:07 AM, Daniel Forrest wrote:

> I was being careful since I wasn't certain about the locking.  Does
> the test need to be protected by "lock_anon_vma_root"?  That's why I
> chose the overhead of the possible wasted "anon_vma_chain_alloc".

The function anon_vma_clone is being called from fork().

When running fork(), the kernel holds the mm->mmap_sem for
write, which prevents page faults by the parent process.
This means if the anon_vma in question belongs to the parent
process, no new pages will be added to it in this time.

Likewise, if the anon_vma belonged to a grandparent process,
any new pages instantiated in it will not be visible to the
parent process, or to the newly created process. This means
it is safe to skip the anon_vma.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
  2012-08-18  3:46       ` Rik van Riel
@ 2012-08-20  8:00         ` Hugh Dickins
  -1 siblings, 0 replies; 75+ messages in thread
From: Hugh Dickins @ 2012-08-20  8:00 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Daniel Forrest, Andrea Arcangeli, Michel Lespinasse,
	linux-kernel, linux-mm

On Fri, 17 Aug 2012, Rik van Riel wrote:
> On 08/17/2012 08:03 PM, Daniel Forrest wrote:
> 
> > Based on your comments, I came up with the following patch.  It boots
> > and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know
> > if I've overlooked something.  I'm not a kernel hacker.
> 
> The patch looks reasonable to me.  There is one spot left
> for optimization, which I have pointed out below.
> 
> Of course, that leaves the big question: do we want the
> overhead of having the atomic addition and decrement for
> every anonymous memory page, or is it easier to fix this
> issue in userspace?

I've not given any thought to alternatives, and I've not done any
performance analysis; but my instinct says that we really do not
want another atomic increment and decrement (and another cache
line redirtied) for every single page mapped.

One of the things I've often admired about Andrea's anon_vma design
was the way it did not need a refcount; and although we later added
one for KSM and migration, that scarcely mattered, because it was
for exceptional circumstances, and not per page.

May I dare to think: what if we just backed out all the anon_vma_chain
complexity, and returned to the simple anon_vma list we had in 2.6.33?

Just how realistic was the workload which led you to anon_vma_chains?
And isn't it correct to say that the performance evaluation was made
while believing that each anon_vma->lock was useful, before the sad
realization that anon_vma->root->lock (or ->mutex) had to be used?

I've Cc'ed Michel, because I think he has plans (or at least hopes) for
the anon_vmas, in his relentless pursuit of world domination by rbtree.

Hugh

> 
> Given that malicious userspace could potentially run the
> system out of memory, without needing special privileges,
> and the OOM killer may not be able to reclaim it due to
> internal slab fragmentation, I guess this issue could be
> classified as a low impact denial of service vulnerability.
> 
> Furthermore, there is already a fair amount of bookkeeping
> being done in the rmap code, so this patch is not likely
> to add a whole lot - some testing might be useful, though.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
@ 2012-08-20  8:00         ` Hugh Dickins
  0 siblings, 0 replies; 75+ messages in thread
From: Hugh Dickins @ 2012-08-20  8:00 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Daniel Forrest, Andrea Arcangeli, Michel Lespinasse,
	linux-kernel, linux-mm

On Fri, 17 Aug 2012, Rik van Riel wrote:
> On 08/17/2012 08:03 PM, Daniel Forrest wrote:
> 
> > Based on your comments, I came up with the following patch.  It boots
> > and the anon_vma/anon_vma_chain SLAB usage is stable, but I don't know
> > if I've overlooked something.  I'm not a kernel hacker.
> 
> The patch looks reasonable to me.  There is one spot left
> for optimization, which I have pointed out below.
> 
> Of course, that leaves the big question: do we want the
> overhead of having the atomic addition and decrement for
> every anonymous memory page, or is it easier to fix this
> issue in userspace?

I've not given any thought to alternatives, and I've not done any
performance analysis; but my instinct says that we really do not
want another atomic increment and decrement (and another cache
line redirtied) for every single page mapped.

One of the things I've often admired about Andrea's anon_vma design
was the way it did not need a refcount; and although we later added
one for KSM and migration, that scarcely mattered, because it was
for exceptional circumstances, and not per page.

May I dare to think: what if we just backed out all the anon_vma_chain
complexity, and returned to the simple anon_vma list we had in 2.6.33?

Just how realistic was the workload which led you to anon_vma_chains?
And isn't it correct to say that the performance evaluation was made
while believing that each anon_vma->lock was useful, before the sad
realization that anon_vma->root->lock (or ->mutex) had to be used?

I've Cc'ed Michel, because I think he has plans (or at least hopes) for
the anon_vmas, in his relentless pursuit of world domination by rbtree.

Hugh

> 
> Given that malicious userspace could potentially run the
> system out of memory, without needing special privileges,
> and the OOM killer may not be able to reclaim it due to
> internal slab fragmentation, I guess this issue could be
> classified as a low impact denial of service vulnerability.
> 
> Furthermore, there is already a fair amount of bookkeeping
> being done in the rmap code, so this patch is not likely
> to add a whole lot - some testing might be useful, though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
  2012-08-20  8:00         ` Hugh Dickins
@ 2012-08-20  9:39           ` Michel Lespinasse
  -1 siblings, 0 replies; 75+ messages in thread
From: Michel Lespinasse @ 2012-08-20  9:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Rik van Riel, Daniel Forrest, Andrea Arcangeli, linux-kernel, linux-mm

On Mon, Aug 20, 2012 at 1:00 AM, Hugh Dickins <hughd@google.com> wrote:
> On Fri, 17 Aug 2012, Rik van Riel wrote:
>> Of course, that leaves the big question: do we want the
>> overhead of having the atomic addition and decrement for
>> every anonymous memory page, or is it easier to fix this
>> issue in userspace?
>
> I've not given any thought to alternatives, and I've not done any
> performance analysis; but my instinct says that we really do not
> want another atomic increment and decrement (and another cache
> line redirtied) for every single page mapped.

I am concerned about this as well.

> May I dare to think: what if we just backed out all the anon_vma_chain
> complexity, and returned to the simple anon_vma list we had in 2.6.33?
>
> Just how realistic was the workload which led you to anon_vma_chains?
> And isn't it correct to say that the performance evaluation was made
> while believing that each anon_vma->lock was useful, before the sad
> realization that anon_vma->root->lock (or ->mutex) had to be used?

Thanks for suggesting this - I certainly wish we could go that way. I
suspect there will be a strong case against this, but I'd certainly
like to hear it (and see if it can be addressed another way).

Here we just don't have processes that fork a lot of children that
don't immediately exec, so anon_vmas don't bring any value for us.

> I've Cc'ed Michel, because I think he has plans (or at least hopes) for
> the anon_vmas, in his relentless pursuit of world domination by rbtree.

Unfortunately I don't have great ideas there.

It would be easy to add a flag to track if an anon_vma has ever been
referenced by a struct page, and not clone the anon_vma if the flag
isn't set. But, this wouldn't help at all with the DOS potential here.

If there are pages referencing the anon_vma, we could reassign these
to the parent anon_vma, but finding all such pages would be expensive
too.

Instead of adding an atomic count for page references, we could limit
the anon_vma stacking depth. In fork, we would only clone anon_vmas
that have a low enough generation count. I think that's not great
(adds a special case for the deep-fork-without-exec behavior), but
still better than the atomic page reference counter.

I would still prefer if we could just remove the anon_vma_chain stuff, though.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
@ 2012-08-20  9:39           ` Michel Lespinasse
  0 siblings, 0 replies; 75+ messages in thread
From: Michel Lespinasse @ 2012-08-20  9:39 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Rik van Riel, Daniel Forrest, Andrea Arcangeli, linux-kernel, linux-mm

On Mon, Aug 20, 2012 at 1:00 AM, Hugh Dickins <hughd@google.com> wrote:
> On Fri, 17 Aug 2012, Rik van Riel wrote:
>> Of course, that leaves the big question: do we want the
>> overhead of having the atomic addition and decrement for
>> every anonymous memory page, or is it easier to fix this
>> issue in userspace?
>
> I've not given any thought to alternatives, and I've not done any
> performance analysis; but my instinct says that we really do not
> want another atomic increment and decrement (and another cache
> line redirtied) for every single page mapped.

I am concerned about this as well.

> May I dare to think: what if we just backed out all the anon_vma_chain
> complexity, and returned to the simple anon_vma list we had in 2.6.33?
>
> Just how realistic was the workload which led you to anon_vma_chains?
> And isn't it correct to say that the performance evaluation was made
> while believing that each anon_vma->lock was useful, before the sad
> realization that anon_vma->root->lock (or ->mutex) had to be used?

Thanks for suggesting this - I certainly wish we could go that way. I
suspect there will be a strong case against this, but I'd certainly
like to hear it (and see if it can be addressed another way).

Here we just don't have processes that fork a lot of children that
don't immediately exec, so anon_vmas don't bring any value for us.

> I've Cc'ed Michel, because I think he has plans (or at least hopes) for
> the anon_vmas, in his relentless pursuit of world domination by rbtree.

Unfortunately I don't have great ideas there.

It would be easy to add a flag to track if an anon_vma has ever been
referenced by a struct page, and not clone the anon_vma if the flag
isn't set. But, this wouldn't help at all with the DOS potential here.

If there are pages referencing the anon_vma, we could reassign these
to the parent anon_vma, but finding all such pages would be expensive
too.

Instead of adding an atomic count for page references, we could limit
the anon_vma stacking depth. In fork, we would only clone anon_vmas
that have a low enough generation count. I think that's not great
(adds a special case for the deep-fork-without-exec behavior), but
still better than the atomic page reference counter.

I would still prefer if we could just remove the anon_vma_chain stuff, though.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
  2012-08-20  9:39           ` Michel Lespinasse
@ 2012-08-20 11:11             ` Andi Kleen
  -1 siblings, 0 replies; 75+ messages in thread
From: Andi Kleen @ 2012-08-20 11:11 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Hugh Dickins, Rik van Riel, Daniel Forrest, Andrea Arcangeli,
	linux-kernel, linux-mm

Michel Lespinasse <walken@google.com> writes:
>
> I would still prefer if we could just remove the anon_vma_chain stuff, though.

Would probably help with the fork locking problems too. 
We never really recovered from that regression.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
@ 2012-08-20 11:11             ` Andi Kleen
  0 siblings, 0 replies; 75+ messages in thread
From: Andi Kleen @ 2012-08-20 11:11 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Hugh Dickins, Rik van Riel, Daniel Forrest, Andrea Arcangeli,
	linux-kernel, linux-mm

Michel Lespinasse <walken@google.com> writes:
>
> I would still prefer if we could just remove the anon_vma_chain stuff, though.

Would probably help with the fork locking problems too. 
We never really recovered from that regression.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
  2012-08-20  9:39           ` Michel Lespinasse
@ 2012-08-20 11:17             ` Rik van Riel
  -1 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-08-20 11:17 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Hugh Dickins, Daniel Forrest, Andrea Arcangeli, linux-kernel, linux-mm

On 08/20/2012 05:39 AM, Michel Lespinasse wrote:

> I would still prefer if we could just remove the anon_vma_chain stuff, though.

If only we could.

That simply replaces a medium issue at fork time, with the
potential for a catastrophic issue at page reclaim time,
in any workload with heavily forking server software.

Without the anon_vma_chains, we end up scanning every single
one of the child processes (and the parent) for every COWed
page, which can be a real issue when the VM runs into 1000
such pages, for 1000 child processes.

Unfortunately, we have seen this happen...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
@ 2012-08-20 11:17             ` Rik van Riel
  0 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-08-20 11:17 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Hugh Dickins, Daniel Forrest, Andrea Arcangeli, linux-kernel, linux-mm

On 08/20/2012 05:39 AM, Michel Lespinasse wrote:

> I would still prefer if we could just remove the anon_vma_chain stuff, though.

If only we could.

That simply replaces a medium issue at fork time, with the
potential for a catastrophic issue at page reclaim time,
in any workload with heavily forking server software.

Without the anon_vma_chains, we end up scanning every single
one of the child processes (and the parent) for every COWed
page, which can be a real issue when the VM runs into 1000
such pages, for 1000 child processes.

Unfortunately, we have seen this happen...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
  2012-08-20 11:17             ` Rik van Riel
@ 2012-08-20 11:53               ` Michel Lespinasse
  -1 siblings, 0 replies; 75+ messages in thread
From: Michel Lespinasse @ 2012-08-20 11:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Daniel Forrest, Andrea Arcangeli, linux-kernel, linux-mm

On Mon, Aug 20, 2012 at 4:17 AM, Rik van Riel <riel@redhat.com> wrote:
> Without the anon_vma_chains, we end up scanning every single
> one of the child processes (and the parent) for every COWed
> page, which can be a real issue when the VM runs into 1000
> such pages, for 1000 child processes.
>
> Unfortunately, we have seen this happen...

Well, it only happens if the vma is created in the parent, and the
first anon write also happens in the parent. I suppose that's a
legitimate thing to do in a forking server though - say, for an
expensive initialization stage, or precomputing some table, or
whatever.

When fork happens after the first anon page has been created, the
child VMA currently ends up being added to the parent's anon_vma -
even if the child might never create new anon pages into that VMA.

I wonder if it might help to add the child VMA onto the parent's
anon_vma only at the first child COW event. That way it would at least
be possible (with userspace changes) for any forking servers to
separate the areas they want to write into from the parent (such as
things that need expensive initialization), from the ones that they
want to write into from the child, and have none of the anon_vma lists
grow too large.

This might still be impractical if one has too many such workloads to
care about. I'm just not sure how prevalent the problem workloads are.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
@ 2012-08-20 11:53               ` Michel Lespinasse
  0 siblings, 0 replies; 75+ messages in thread
From: Michel Lespinasse @ 2012-08-20 11:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Daniel Forrest, Andrea Arcangeli, linux-kernel, linux-mm

On Mon, Aug 20, 2012 at 4:17 AM, Rik van Riel <riel@redhat.com> wrote:
> Without the anon_vma_chains, we end up scanning every single
> one of the child processes (and the parent) for every COWed
> page, which can be a real issue when the VM runs into 1000
> such pages, for 1000 child processes.
>
> Unfortunately, we have seen this happen...

Well, it only happens if the vma is created in the parent, and the
first anon write also happens in the parent. I suppose that's a
legitimate thing to do in a forking server though - say, for an
expensive initialization stage, or precomputing some table, or
whatever.

When fork happens after the first anon page has been created, the
child VMA currently ends up being added to the parent's anon_vma -
even if the child might never create new anon pages into that VMA.

I wonder if it might help to add the child VMA onto the parent's
anon_vma only at the first child COW event. That way it would at least
be possible (with userspace changes) for any forking servers to
separate the areas they want to write into from the parent (such as
things that need expensive initialization), from the ones that they
want to write into from the child, and have none of the anon_vma lists
grow too large.

This might still be impractical if one has too many such workloads to
care about. I'm just not sure how prevalent the problem workloads are.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
  2012-08-20 11:53               ` Michel Lespinasse
@ 2012-08-20 19:11                 ` Michel Lespinasse
  -1 siblings, 0 replies; 75+ messages in thread
From: Michel Lespinasse @ 2012-08-20 19:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Daniel Forrest, Andrea Arcangeli, linux-kernel, linux-mm

On Mon, Aug 20, 2012 at 4:53 AM, Michel Lespinasse <walken@google.com> wrote:
> I wonder if it might help to add the child VMA onto the parent's
> anon_vma only at the first child COW event. That way it would at least
> be possible (with userspace changes) for any forking servers to
> separate the areas they want to write into from the parent (such as
> things that need expensive initialization), from the ones that they
> want to write into from the child, and have none of the anon_vma lists
> grow too large.

Actually that wouldn't work. The parent's anon pages are visible from
the child, so the child vma needs to be on the parent anon_vma list.
Sorry for the noise :/

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Repeated fork() causes SLAB to grow without bound
@ 2012-08-20 19:11                 ` Michel Lespinasse
  0 siblings, 0 replies; 75+ messages in thread
From: Michel Lespinasse @ 2012-08-20 19:11 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Daniel Forrest, Andrea Arcangeli, linux-kernel, linux-mm

On Mon, Aug 20, 2012 at 4:53 AM, Michel Lespinasse <walken@google.com> wrote:
> I wonder if it might help to add the child VMA onto the parent's
> anon_vma only at the first child COW event. That way it would at least
> be possible (with userspace changes) for any forking servers to
> separate the areas they want to write into from the parent (such as
> things that need expensive initialization), from the ones that they
> want to write into from the child, and have none of the anon_vma lists
> grow too large.

Actually that wouldn't work. The parent's anon pages are visible from
the child, so the child vma needs to be on the parent anon_vma list.
Sorry for the noise :/

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound
  2012-08-20  9:39           ` Michel Lespinasse
@ 2012-08-22  3:20             ` Michel Lespinasse
  -1 siblings, 0 replies; 75+ messages in thread
From: Michel Lespinasse @ 2012-08-22  3:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Rik van Riel, Daniel Forrest, Andrea Arcangeli, Andrew Morton,
	linux-kernel, linux-mm

On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote:
> Instead of adding an atomic count for page references, we could limit
> the anon_vma stacking depth. In fork, we would only clone anon_vmas
> that have a low enough generation count. I think that's not great
> (adds a special case for the deep-fork-without-exec behavior), but
> still better than the atomic page reference counter.

Here is an attached patch to demonstrate the idea.

anon_vma_clone() is modified to return the length of the existing same_vma
anon vma chain, and we create a new anon_vma in the child only on the first
fork (this could be tweaked to allow up to a set number of forks, but
I think the first fork would cover all the common forking server cases).

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/mmap.c |    6 +++---
 mm/rmap.c |   18 ++++++++++++++----
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 3edfcdfa42d9..e14b19a838cb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -539,7 +539,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		 * shrinking vma had, to cover any anon pages imported.
 		 */
 		if (exporter && exporter->anon_vma && !importer->anon_vma) {
-			if (anon_vma_clone(importer, exporter))
+			if (anon_vma_clone(importer, exporter) < 0)
 				return -ENOMEM;
 			importer->anon_vma = exporter->anon_vma;
 		}
@@ -1988,7 +1988,7 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	}
 	vma_set_policy(new, pol);
 
-	if (anon_vma_clone(new, vma))
+	if (anon_vma_clone(new, vma) < 0)
 		goto out_free_mpol;
 
 	if (new->vm_file) {
@@ -2409,7 +2409,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			if (IS_ERR(pol))
 				goto out_free_vma;
 			INIT_LIST_HEAD(&new_vma->anon_vma_chain);
-			if (anon_vma_clone(new_vma, vma))
+			if (anon_vma_clone(new_vma, vma) < 0)
 				goto out_free_mempol;
 			vma_set_policy(new_vma, pol);
 			new_vma->vm_start = addr;
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cda2a24..ba8a726aaee6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -238,12 +238,13 @@ static inline void unlock_anon_vma_root(struct anon_vma *root)
 
 /*
  * Attach the anon_vmas from src to dst.
- * Returns 0 on success, -ENOMEM on failure.
+ * Returns length of the anon_vma chain on success, -ENOMEM on failure.
  */
 int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 {
 	struct anon_vma_chain *avc, *pavc;
 	struct anon_vma *root = NULL;
+	int length = 0;
 
 	list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma;
@@ -259,9 +260,10 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 		anon_vma = pavc->anon_vma;
 		root = lock_anon_vma_root(root, anon_vma);
 		anon_vma_chain_link(dst, avc, anon_vma);
+		length++;
 	}
 	unlock_anon_vma_root(root);
-	return 0;
+	return length;
 
  enomem_failure:
 	unlink_anon_vmas(dst);
@@ -322,6 +324,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 {
 	struct anon_vma_chain *avc;
 	struct anon_vma *anon_vma;
+	int length;
 
 	/* Don't bother if the parent process has no anon_vma here. */
 	if (!pvma->anon_vma)
@@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	 * First, attach the new VMA to the parent VMA's anon_vmas,
 	 * so rmap can find non-COWed pages in child processes.
 	 */
-	if (anon_vma_clone(vma, pvma))
+	length = anon_vma_clone(vma, pvma);
+	if (length < 0)
 		return -ENOMEM;
+	else if (length > 1)
+		return 0;
 
-	/* Then add our own anon_vma. */
+	/*
+	 * Then add our own anon_vma. We do this only on the first fork after
+	 * the anon_vma is created, as we don't want the same_vma chain to
+	 * grow arbitrarily large.
+	 */
 	anon_vma = anon_vma_alloc();
 	if (!anon_vma)
 		goto out_error;

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound
@ 2012-08-22  3:20             ` Michel Lespinasse
  0 siblings, 0 replies; 75+ messages in thread
From: Michel Lespinasse @ 2012-08-22  3:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Rik van Riel, Daniel Forrest, Andrea Arcangeli, Andrew Morton,
	linux-kernel, linux-mm

On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote:
> Instead of adding an atomic count for page references, we could limit
> the anon_vma stacking depth. In fork, we would only clone anon_vmas
> that have a low enough generation count. I think that's not great
> (adds a special case for the deep-fork-without-exec behavior), but
> still better than the atomic page reference counter.

Here is an attached patch to demonstrate the idea.

anon_vma_clone() is modified to return the length of the existing same_vma
anon vma chain, and we create a new anon_vma in the child only on the first
fork (this could be tweaked to allow up to a set number of forks, but
I think the first fork would cover all the common forking server cases).

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/mmap.c |    6 +++---
 mm/rmap.c |   18 ++++++++++++++----
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 3edfcdfa42d9..e14b19a838cb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -539,7 +539,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		 * shrinking vma had, to cover any anon pages imported.
 		 */
 		if (exporter && exporter->anon_vma && !importer->anon_vma) {
-			if (anon_vma_clone(importer, exporter))
+			if (anon_vma_clone(importer, exporter) < 0)
 				return -ENOMEM;
 			importer->anon_vma = exporter->anon_vma;
 		}
@@ -1988,7 +1988,7 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	}
 	vma_set_policy(new, pol);
 
-	if (anon_vma_clone(new, vma))
+	if (anon_vma_clone(new, vma) < 0)
 		goto out_free_mpol;
 
 	if (new->vm_file) {
@@ -2409,7 +2409,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			if (IS_ERR(pol))
 				goto out_free_vma;
 			INIT_LIST_HEAD(&new_vma->anon_vma_chain);
-			if (anon_vma_clone(new_vma, vma))
+			if (anon_vma_clone(new_vma, vma) < 0)
 				goto out_free_mempol;
 			vma_set_policy(new_vma, pol);
 			new_vma->vm_start = addr;
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cda2a24..ba8a726aaee6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -238,12 +238,13 @@ static inline void unlock_anon_vma_root(struct anon_vma *root)
 
 /*
  * Attach the anon_vmas from src to dst.
- * Returns 0 on success, -ENOMEM on failure.
+ * Returns length of the anon_vma chain on success, -ENOMEM on failure.
  */
 int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 {
 	struct anon_vma_chain *avc, *pavc;
 	struct anon_vma *root = NULL;
+	int length = 0;
 
 	list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma;
@@ -259,9 +260,10 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 		anon_vma = pavc->anon_vma;
 		root = lock_anon_vma_root(root, anon_vma);
 		anon_vma_chain_link(dst, avc, anon_vma);
+		length++;
 	}
 	unlock_anon_vma_root(root);
-	return 0;
+	return length;
 
  enomem_failure:
 	unlink_anon_vmas(dst);
@@ -322,6 +324,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 {
 	struct anon_vma_chain *avc;
 	struct anon_vma *anon_vma;
+	int length;
 
 	/* Don't bother if the parent process has no anon_vma here. */
 	if (!pvma->anon_vma)
@@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	 * First, attach the new VMA to the parent VMA's anon_vmas,
 	 * so rmap can find non-COWed pages in child processes.
 	 */
-	if (anon_vma_clone(vma, pvma))
+	length = anon_vma_clone(vma, pvma);
+	if (length < 0)
 		return -ENOMEM;
+	else if (length > 1)
+		return 0;
 
-	/* Then add our own anon_vma. */
+	/*
+	 * Then add our own anon_vma. We do this only on the first fork after
+	 * the anon_vma is created, as we don't want the same_vma chain to
+	 * grow arbitrarily large.
+	 */
 	anon_vma = anon_vma_alloc();
 	if (!anon_vma)
 		goto out_error;

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound
  2012-08-22  3:20             ` Michel Lespinasse
@ 2012-08-22  3:29               ` Rik van Riel
  -1 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-08-22  3:29 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Hugh Dickins, Daniel Forrest, Andrea Arcangeli, Andrew Morton,
	linux-kernel, linux-mm

On 08/21/2012 11:20 PM, Michel Lespinasse wrote:
> On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote:
>> Instead of adding an atomic count for page references, we could limit
>> the anon_vma stacking depth. In fork, we would only clone anon_vmas
>> that have a low enough generation count. I think that's not great
>> (adds a special case for the deep-fork-without-exec behavior), but
>> still better than the atomic page reference counter.
>
> Here is an attached patch to demonstrate the idea.
>
> anon_vma_clone() is modified to return the length of the existing same_vma
> anon vma chain, and we create a new anon_vma in the child only on the first
> fork (this could be tweaked to allow up to a set number of forks, but
> I think the first fork would cover all the common forking server cases).

I suspect we need 2 or 3.

Some forking servers first fork off one child, and have
the original parent exit, in order to "background the server".
That first child then becomes the parent to the real child
processes that do the work.

It is conceivable that we might need an extra level for
processes that do something special with privilege dropping,
namespace changing, etc...

Even setting the threshold to 5 should be totally harmless,
since the problem does not kick in until we have really
long chains, like in Dan's bug report.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound
@ 2012-08-22  3:29               ` Rik van Riel
  0 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2012-08-22  3:29 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Hugh Dickins, Daniel Forrest, Andrea Arcangeli, Andrew Morton,
	linux-kernel, linux-mm

On 08/21/2012 11:20 PM, Michel Lespinasse wrote:
> On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote:
>> Instead of adding an atomic count for page references, we could limit
>> the anon_vma stacking depth. In fork, we would only clone anon_vmas
>> that have a low enough generation count. I think that's not great
>> (adds a special case for the deep-fork-without-exec behavior), but
>> still better than the atomic page reference counter.
>
> Here is an attached patch to demonstrate the idea.
>
> anon_vma_clone() is modified to return the length of the existing same_vma
> anon vma chain, and we create a new anon_vma in the child only on the first
> fork (this could be tweaked to allow up to a set number of forks, but
> I think the first fork would cover all the common forking server cases).

I suspect we need 2 or 3.

Some forking servers first fork off one child, and have
the original parent exit, in order to "background the server".
That first child then becomes the parent to the real child
processes that do the work.

It is conceivable that we might need an extra level for
processes that do something special with privilege dropping,
namespace changing, etc...

Even setting the threshold to 5 should be totally harmless,
since the problem does not kick in until we have really
long chains, like in Dan's bug report.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound
  2012-08-22  3:29               ` Rik van Riel
@ 2013-06-03 19:50                 ` Daniel Forrest
  -1 siblings, 0 replies; 75+ messages in thread
From: Daniel Forrest @ 2013-06-03 19:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Hugh Dickins, Andrea Arcangeli, Andrew Morton,
	linux-kernel, linux-mm

On Tue, Aug 21, 2012 at 11:29:54PM -0400, Rik van Riel wrote:
> On 08/21/2012 11:20 PM, Michel Lespinasse wrote:
> >On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote:
> >>Instead of adding an atomic count for page references, we could limit
> >>the anon_vma stacking depth. In fork, we would only clone anon_vmas
> >>that have a low enough generation count. I think that's not great
> >>(adds a special case for the deep-fork-without-exec behavior), but
> >>still better than the atomic page reference counter.
> >
> >Here is an attached patch to demonstrate the idea.
> >
> >anon_vma_clone() is modified to return the length of the existing same_vma
> >anon vma chain, and we create a new anon_vma in the child only on the first
> >fork (this could be tweaked to allow up to a set number of forks, but
> >I think the first fork would cover all the common forking server cases).
> 
> I suspect we need 2 or 3.
> 
> Some forking servers first fork off one child, and have
> the original parent exit, in order to "background the server".
> That first child then becomes the parent to the real child
> processes that do the work.
> 
> It is conceivable that we might need an extra level for
> processes that do something special with privilege dropping,
> namespace changing, etc...
> 
> Even setting the threshold to 5 should be totally harmless,
> since the problem does not kick in until we have really
> long chains, like in Dan's bug report.

I have been running with Michel's patch (with the threshold set to 5)
for quite a few months now and can confirm that it does indeed solve
my problem.  I am not a kernel developer, so I would appreciate if one
of you could push this into the kernel tree.

NOTE: I have attached Michel's patch with "(length > 1)" modified to
"(length > 5)" and added a "Tested-by:".

---

On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote:
> Instead of adding an atomic count for page references, we could limit
> the anon_vma stacking depth. In fork, we would only clone anon_vmas
> that have a low enough generation count. I think that's not great
> (adds a special case for the deep-fork-without-exec behavior), but
> still better than the atomic page reference counter.

Here is an attached patch to demonstrate the idea.

anon_vma_clone() is modified to return the length of the existing same_vma
anon vma chain, and we create a new anon_vma in the child only on the first
fork (this could be tweaked to allow up to a set number of forks, but
I think the first fork would cover all the common forking server cases).

Signed-off-by: Michel Lespinasse <walken@google.com>
Tested-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>
---
 mm/mmap.c |    6 +++---
 mm/rmap.c |   18 ++++++++++++++----
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 3edfcdfa42d9..e14b19a838cb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -539,7 +539,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		 * shrinking vma had, to cover any anon pages imported.
 		 */
 		if (exporter && exporter->anon_vma && !importer->anon_vma) {
-			if (anon_vma_clone(importer, exporter))
+			if (anon_vma_clone(importer, exporter) < 0)
 				return -ENOMEM;
 			importer->anon_vma = exporter->anon_vma;
 		}
@@ -1988,7 +1988,7 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	}
 	vma_set_policy(new, pol);
 
-	if (anon_vma_clone(new, vma))
+	if (anon_vma_clone(new, vma) < 0)
 		goto out_free_mpol;
 
 	if (new->vm_file) {
@@ -2409,7 +2409,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			if (IS_ERR(pol))
 				goto out_free_vma;
 			INIT_LIST_HEAD(&new_vma->anon_vma_chain);
-			if (anon_vma_clone(new_vma, vma))
+			if (anon_vma_clone(new_vma, vma) < 0)
 				goto out_free_mempol;
 			vma_set_policy(new_vma, pol);
 			new_vma->vm_start = addr;
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cda2a24..ba8a726aaee6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -238,12 +238,13 @@ static inline void unlock_anon_vma_root(struct anon_vma *root)
 
 /*
  * Attach the anon_vmas from src to dst.
- * Returns 0 on success, -ENOMEM on failure.
+ * Returns length of the anon_vma chain on success, -ENOMEM on failure.
  */
 int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 {
 	struct anon_vma_chain *avc, *pavc;
 	struct anon_vma *root = NULL;
+	int length = 0;
 
 	list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma;
@@ -259,9 +260,10 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 		anon_vma = pavc->anon_vma;
 		root = lock_anon_vma_root(root, anon_vma);
 		anon_vma_chain_link(dst, avc, anon_vma);
+		length++;
 	}
 	unlock_anon_vma_root(root);
-	return 0;
+	return length;
 
  enomem_failure:
 	unlink_anon_vmas(dst);
@@ -322,6 +324,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 {
 	struct anon_vma_chain *avc;
 	struct anon_vma *anon_vma;
+	int length;
 
 	/* Don't bother if the parent process has no anon_vma here. */
 	if (!pvma->anon_vma)
@@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	 * First, attach the new VMA to the parent VMA's anon_vmas,
 	 * so rmap can find non-COWed pages in child processes.
 	 */
-	if (anon_vma_clone(vma, pvma))
+	length = anon_vma_clone(vma, pvma);
+	if (length < 0)
 		return -ENOMEM;
+	else if (length > 5)
+		return 0;
 
-	/* Then add our own anon_vma. */
+	/*
+	 * Then add our own anon_vma. We do this only on the first fork after
+	 * the anon_vma is created, as we don't want the same_vma chain to
+	 * grow arbitrarily large.
+	 */
 	anon_vma = anon_vma_alloc();
 	if (!anon_vma)
 		goto out_error;

-- 
Daniel K. Forrest		Space Science and
dan.forrest@ssec.wisc.edu	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound
@ 2013-06-03 19:50                 ` Daniel Forrest
  0 siblings, 0 replies; 75+ messages in thread
From: Daniel Forrest @ 2013-06-03 19:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Hugh Dickins, Andrea Arcangeli, Andrew Morton,
	linux-kernel, linux-mm

On Tue, Aug 21, 2012 at 11:29:54PM -0400, Rik van Riel wrote:
> On 08/21/2012 11:20 PM, Michel Lespinasse wrote:
> >On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote:
> >>Instead of adding an atomic count for page references, we could limit
> >>the anon_vma stacking depth. In fork, we would only clone anon_vmas
> >>that have a low enough generation count. I think that's not great
> >>(adds a special case for the deep-fork-without-exec behavior), but
> >>still better than the atomic page reference counter.
> >
> >Here is an attached patch to demonstrate the idea.
> >
> >anon_vma_clone() is modified to return the length of the existing same_vma
> >anon vma chain, and we create a new anon_vma in the child only on the first
> >fork (this could be tweaked to allow up to a set number of forks, but
> >I think the first fork would cover all the common forking server cases).
> 
> I suspect we need 2 or 3.
> 
> Some forking servers first fork off one child, and have
> the original parent exit, in order to "background the server".
> That first child then becomes the parent to the real child
> processes that do the work.
> 
> It is conceivable that we might need an extra level for
> processes that do something special with privilege dropping,
> namespace changing, etc...
> 
> Even setting the threshold to 5 should be totally harmless,
> since the problem does not kick in until we have really
> long chains, like in Dan's bug report.

I have been running with Michel's patch (with the threshold set to 5)
for quite a few months now and can confirm that it does indeed solve
my problem.  I am not a kernel developer, so I would appreciate if one
of you could push this into the kernel tree.

NOTE: I have attached Michel's patch with "(length > 1)" modified to
"(length > 5)" and added a "Tested-by:".

---

On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote:
> Instead of adding an atomic count for page references, we could limit
> the anon_vma stacking depth. In fork, we would only clone anon_vmas
> that have a low enough generation count. I think that's not great
> (adds a special case for the deep-fork-without-exec behavior), but
> still better than the atomic page reference counter.

Here is an attached patch to demonstrate the idea.

anon_vma_clone() is modified to return the length of the existing same_vma
anon vma chain, and we create a new anon_vma in the child only on the first
fork (this could be tweaked to allow up to a set number of forks, but
I think the first fork would cover all the common forking server cases).

Signed-off-by: Michel Lespinasse <walken@google.com>
Tested-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>
---
 mm/mmap.c |    6 +++---
 mm/rmap.c |   18 ++++++++++++++----
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 3edfcdfa42d9..e14b19a838cb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -539,7 +539,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		 * shrinking vma had, to cover any anon pages imported.
 		 */
 		if (exporter && exporter->anon_vma && !importer->anon_vma) {
-			if (anon_vma_clone(importer, exporter))
+			if (anon_vma_clone(importer, exporter) < 0)
 				return -ENOMEM;
 			importer->anon_vma = exporter->anon_vma;
 		}
@@ -1988,7 +1988,7 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	}
 	vma_set_policy(new, pol);
 
-	if (anon_vma_clone(new, vma))
+	if (anon_vma_clone(new, vma) < 0)
 		goto out_free_mpol;
 
 	if (new->vm_file) {
@@ -2409,7 +2409,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			if (IS_ERR(pol))
 				goto out_free_vma;
 			INIT_LIST_HEAD(&new_vma->anon_vma_chain);
-			if (anon_vma_clone(new_vma, vma))
+			if (anon_vma_clone(new_vma, vma) < 0)
 				goto out_free_mempol;
 			vma_set_policy(new_vma, pol);
 			new_vma->vm_start = addr;
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cda2a24..ba8a726aaee6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -238,12 +238,13 @@ static inline void unlock_anon_vma_root(struct anon_vma *root)
 
 /*
  * Attach the anon_vmas from src to dst.
- * Returns 0 on success, -ENOMEM on failure.
+ * Returns length of the anon_vma chain on success, -ENOMEM on failure.
  */
 int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 {
 	struct anon_vma_chain *avc, *pavc;
 	struct anon_vma *root = NULL;
+	int length = 0;
 
 	list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma;
@@ -259,9 +260,10 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 		anon_vma = pavc->anon_vma;
 		root = lock_anon_vma_root(root, anon_vma);
 		anon_vma_chain_link(dst, avc, anon_vma);
+		length++;
 	}
 	unlock_anon_vma_root(root);
-	return 0;
+	return length;
 
  enomem_failure:
 	unlink_anon_vmas(dst);
@@ -322,6 +324,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 {
 	struct anon_vma_chain *avc;
 	struct anon_vma *anon_vma;
+	int length;
 
 	/* Don't bother if the parent process has no anon_vma here. */
 	if (!pvma->anon_vma)
@@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	 * First, attach the new VMA to the parent VMA's anon_vmas,
 	 * so rmap can find non-COWed pages in child processes.
 	 */
-	if (anon_vma_clone(vma, pvma))
+	length = anon_vma_clone(vma, pvma);
+	if (length < 0)
 		return -ENOMEM;
+	else if (length > 5)
+		return 0;
 
-	/* Then add our own anon_vma. */
+	/*
+	 * Then add our own anon_vma. We do this only on the first fork after
+	 * the anon_vma is created, as we don't want the same_vma chain to
+	 * grow arbitrarily large.
+	 */
 	anon_vma = anon_vma_alloc();
 	if (!anon_vma)
 		goto out_error;

-- 
Daniel K. Forrest		Space Science and
dan.forrest@ssec.wisc.edu	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound
  2013-06-03 19:50                 ` Daniel Forrest
@ 2013-06-04 10:37                   ` Rik van Riel
  -1 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2013-06-04 10:37 UTC (permalink / raw)
  To: Michel Lespinasse, Hugh Dickins, Andrea Arcangeli, Andrew Morton,
	linux-kernel, linux-mm

On 06/03/2013 03:50 PM, Daniel Forrest wrote:
> On Tue, Aug 21, 2012 at 11:29:54PM -0400, Rik van Riel wrote:
>> On 08/21/2012 11:20 PM, Michel Lespinasse wrote:
>>> On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote:
>>>> Instead of adding an atomic count for page references, we could limit
>>>> the anon_vma stacking depth. In fork, we would only clone anon_vmas
>>>> that have a low enough generation count. I think that's not great
>>>> (adds a special case for the deep-fork-without-exec behavior), but
>>>> still better than the atomic page reference counter.
>>>
>>> Here is an attached patch to demonstrate the idea.
>>>
>>> anon_vma_clone() is modified to return the length of the existing same_vma
>>> anon vma chain, and we create a new anon_vma in the child only on the first
>>> fork (this could be tweaked to allow up to a set number of forks, but
>>> I think the first fork would cover all the common forking server cases).
>>
>> I suspect we need 2 or 3.
>>
>> Some forking servers first fork off one child, and have
>> the original parent exit, in order to "background the server".
>> That first child then becomes the parent to the real child
>> processes that do the work.
>>
>> It is conceivable that we might need an extra level for
>> processes that do something special with privilege dropping,
>> namespace changing, etc...
>>
>> Even setting the threshold to 5 should be totally harmless,
>> since the problem does not kick in until we have really
>> long chains, like in Dan's bug report.
>
> I have been running with Michel's patch (with the threshold set to 5)
> for quite a few months now and can confirm that it does indeed solve
> my problem.  I am not a kernel developer, so I would appreciate if one
> of you could push this into the kernel tree.
>
> NOTE: I have attached Michel's patch with "(length > 1)" modified to
> "(length > 5)" and added a "Tested-by:".

Thank you for testing this.

I believe this code should go into the Linux kernel,
since it closes up what could be a denial of service
attack (albeit a local one) with the anonvma code.

> On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote:
>> Instead of adding an atomic count for page references, we could limit
>> the anon_vma stacking depth. In fork, we would only clone anon_vmas
>> that have a low enough generation count. I think that's not great
>> (adds a special case for the deep-fork-without-exec behavior), but
>> still better than the atomic page reference counter.
>
> Here is an attached patch to demonstrate the idea.
>
> anon_vma_clone() is modified to return the length of the existing same_vma
> anon vma chain, and we create a new anon_vma in the child only on the first
> fork (this could be tweaked to allow up to a set number of forks, but
> I think the first fork would cover all the common forking server cases).
>
> Signed-off-by: Michel Lespinasse <walken@google.com>
> Tested-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound
@ 2013-06-04 10:37                   ` Rik van Riel
  0 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2013-06-04 10:37 UTC (permalink / raw)
  To: Michel Lespinasse, Hugh Dickins, Andrea Arcangeli, Andrew Morton,
	linux-kernel, linux-mm

On 06/03/2013 03:50 PM, Daniel Forrest wrote:
> On Tue, Aug 21, 2012 at 11:29:54PM -0400, Rik van Riel wrote:
>> On 08/21/2012 11:20 PM, Michel Lespinasse wrote:
>>> On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote:
>>>> Instead of adding an atomic count for page references, we could limit
>>>> the anon_vma stacking depth. In fork, we would only clone anon_vmas
>>>> that have a low enough generation count. I think that's not great
>>>> (adds a special case for the deep-fork-without-exec behavior), but
>>>> still better than the atomic page reference counter.
>>>
>>> Here is an attached patch to demonstrate the idea.
>>>
>>> anon_vma_clone() is modified to return the length of the existing same_vma
>>> anon vma chain, and we create a new anon_vma in the child only on the first
>>> fork (this could be tweaked to allow up to a set number of forks, but
>>> I think the first fork would cover all the common forking server cases).
>>
>> I suspect we need 2 or 3.
>>
>> Some forking servers first fork off one child, and have
>> the original parent exit, in order to "background the server".
>> That first child then becomes the parent to the real child
>> processes that do the work.
>>
>> It is conceivable that we might need an extra level for
>> processes that do something special with privilege dropping,
>> namespace changing, etc...
>>
>> Even setting the threshold to 5 should be totally harmless,
>> since the problem does not kick in until we have really
>> long chains, like in Dan's bug report.
>
> I have been running with Michel's patch (with the threshold set to 5)
> for quite a few months now and can confirm that it does indeed solve
> my problem.  I am not a kernel developer, so I would appreciate if one
> of you could push this into the kernel tree.
>
> NOTE: I have attached Michel's patch with "(length > 1)" modified to
> "(length > 5)" and added a "Tested-by:".

Thank you for testing this.

I believe this code should go into the Linux kernel,
since it closes up what could be a denial of service
attack (albeit a local one) with the anonvma code.

> On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote:
>> Instead of adding an atomic count for page references, we could limit
>> the anon_vma stacking depth. In fork, we would only clone anon_vmas
>> that have a low enough generation count. I think that's not great
>> (adds a special case for the deep-fork-without-exec behavior), but
>> still better than the atomic page reference counter.
>
> Here is an attached patch to demonstrate the idea.
>
> anon_vma_clone() is modified to return the length of the existing same_vma
> anon vma chain, and we create a new anon_vma in the child only on the first
> fork (this could be tweaked to allow up to a set number of forks, but
> I think the first fork would cover all the common forking server cases).
>
> Signed-off-by: Michel Lespinasse <walken@google.com>
> Tested-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound
  2013-06-04 10:37                   ` Rik van Riel
@ 2013-06-05 14:02                     ` Andrea Arcangeli
  -1 siblings, 0 replies; 75+ messages in thread
From: Andrea Arcangeli @ 2013-06-05 14:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Hugh Dickins, Andrew Morton, linux-kernel, linux-mm

On Tue, Jun 04, 2013 at 06:37:25AM -0400, Rik van Riel wrote:
> On 06/03/2013 03:50 PM, Daniel Forrest wrote:
> > On Tue, Aug 21, 2012 at 11:29:54PM -0400, Rik van Riel wrote:
> >> On 08/21/2012 11:20 PM, Michel Lespinasse wrote:
> >>> On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote:
> >>>> Instead of adding an atomic count for page references, we could limit
> >>>> the anon_vma stacking depth. In fork, we would only clone anon_vmas
> >>>> that have a low enough generation count. I think that's not great
> >>>> (adds a special case for the deep-fork-without-exec behavior), but
> >>>> still better than the atomic page reference counter.
> >>>
> >>> Here is an attached patch to demonstrate the idea.
> >>>
> >>> anon_vma_clone() is modified to return the length of the existing same_vma
> >>> anon vma chain, and we create a new anon_vma in the child only on the first
> >>> fork (this could be tweaked to allow up to a set number of forks, but
> >>> I think the first fork would cover all the common forking server cases).
> >>
> >> I suspect we need 2 or 3.
> >>
> >> Some forking servers first fork off one child, and have
> >> the original parent exit, in order to "background the server".
> >> That first child then becomes the parent to the real child
> >> processes that do the work.
> >>
> >> It is conceivable that we might need an extra level for
> >> processes that do something special with privilege dropping,
> >> namespace changing, etc...
> >>
> >> Even setting the threshold to 5 should be totally harmless,
> >> since the problem does not kick in until we have really
> >> long chains, like in Dan's bug report.
> >
> > I have been running with Michel's patch (with the threshold set to 5)
> > for quite a few months now and can confirm that it does indeed solve
> > my problem.  I am not a kernel developer, so I would appreciate if one
> > of you could push this into the kernel tree.
> >
> > NOTE: I have attached Michel's patch with "(length > 1)" modified to
> > "(length > 5)" and added a "Tested-by:".
> 
> Thank you for testing this.
> 
> I believe this code should go into the Linux kernel,
> since it closes up what could be a denial of service
> attack (albeit a local one) with the anonvma code.

Agreed. The only thing I don't like about this patch is the hardcoding
of number 5: could we make it a variable to tweak with sysfs/sysctl so
if some weird workload arises we have a tuning tweak? It'd cost one
cacheline during fork, so it doesn't look excessive overhead.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC PATCH] Re: Repeated fork() causes SLAB to grow without bound
@ 2013-06-05 14:02                     ` Andrea Arcangeli
  0 siblings, 0 replies; 75+ messages in thread
From: Andrea Arcangeli @ 2013-06-05 14:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Hugh Dickins, Andrew Morton, linux-kernel, linux-mm

On Tue, Jun 04, 2013 at 06:37:25AM -0400, Rik van Riel wrote:
> On 06/03/2013 03:50 PM, Daniel Forrest wrote:
> > On Tue, Aug 21, 2012 at 11:29:54PM -0400, Rik van Riel wrote:
> >> On 08/21/2012 11:20 PM, Michel Lespinasse wrote:
> >>> On Mon, Aug 20, 2012 at 02:39:26AM -0700, Michel Lespinasse wrote:
> >>>> Instead of adding an atomic count for page references, we could limit
> >>>> the anon_vma stacking depth. In fork, we would only clone anon_vmas
> >>>> that have a low enough generation count. I think that's not great
> >>>> (adds a special case for the deep-fork-without-exec behavior), but
> >>>> still better than the atomic page reference counter.
> >>>
> >>> Here is an attached patch to demonstrate the idea.
> >>>
> >>> anon_vma_clone() is modified to return the length of the existing same_vma
> >>> anon vma chain, and we create a new anon_vma in the child only on the first
> >>> fork (this could be tweaked to allow up to a set number of forks, but
> >>> I think the first fork would cover all the common forking server cases).
> >>
> >> I suspect we need 2 or 3.
> >>
> >> Some forking servers first fork off one child, and have
> >> the original parent exit, in order to "background the server".
> >> That first child then becomes the parent to the real child
> >> processes that do the work.
> >>
> >> It is conceivable that we might need an extra level for
> >> processes that do something special with privilege dropping,
> >> namespace changing, etc...
> >>
> >> Even setting the threshold to 5 should be totally harmless,
> >> since the problem does not kick in until we have really
> >> long chains, like in Dan's bug report.
> >
> > I have been running with Michel's patch (with the threshold set to 5)
> > for quite a few months now and can confirm that it does indeed solve
> > my problem.  I am not a kernel developer, so I would appreciate if one
> > of you could push this into the kernel tree.
> >
> > NOTE: I have attached Michel's patch with "(length > 1)" modified to
> > "(length > 5)" and added a "Tested-by:".
> 
> Thank you for testing this.
> 
> I believe this code should go into the Linux kernel,
> since it closes up what could be a denial of service
> attack (albeit a local one) with the anonvma code.

Agreed. The only thing I don't like about this patch is the hardcoding
of number 5: could we make it a variable to tweak with sysfs/sysctl so
if some weird workload arises we have a tuning tweak? It'd cost one
cacheline during fork, so it doesn't look excessive overhead.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH] Repeated fork() causes SLAB to grow without bound
  2013-06-03 19:50                 ` Daniel Forrest
@ 2014-11-14 16:30                   ` Daniel Forrest
  -1 siblings, 0 replies; 75+ messages in thread
From: Daniel Forrest @ 2014-11-14 16:30 UTC (permalink / raw)
  To: Rik van Riel, Michel Lespinasse, Hugh Dickins, Andrea Arcangeli,
	Andrew Morton, linux-kernel, linux-mm
  Cc: Tim Hartrick, Michal Hocko

There have been a couple of inquiries about the status of this patch
over the last few months, so I am going to try pushing it out.

Andrea Arcangeli has commented:

> Agreed. The only thing I don't like about this patch is the hardcoding
> of number 5: could we make it a variable to tweak with sysfs/sysctl so
> if some weird workload arises we have a tuning tweak? It'd cost one
> cacheline during fork, so it doesn't look excessive overhead.

Adding this is beyond my experience level, so if it is required then
someone else will have to make it so.

Rik van Riel has commented:

> I believe we should just merge that patch.
> 
> I have not seen any better ideas come by.
> 
> The comment should probably be fixed to reflect the
> chain length of 5 though :)

So here is Michel's patch again with "(length > 1)" modified to
"(length > 5)" and fixed comments.

I have been running with this patch (with the threshold set to 5) for
over two years now and it does indeed solve the problem.

---

anon_vma_clone() is modified to return the length of the existing
same_vma anon vma chain, and we create a new anon_vma in the child
if it is more than five forks after the anon_vma was created, as we
don't want the same_vma chain to grow arbitrarily large.

Signed-off-by: Michel Lespinasse <walken@google.com>
Tested-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>
---
 mm/mmap.c |    6 +++---
 mm/rmap.c |   18 ++++++++++++++----
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 3edfcdfa42d9..e14b19a838cb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -539,7 +539,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		 * shrinking vma had, to cover any anon pages imported.
 		 */
 		if (exporter && exporter->anon_vma && !importer->anon_vma) {
-			if (anon_vma_clone(importer, exporter))
+			if (anon_vma_clone(importer, exporter) < 0)
 				return -ENOMEM;
 			importer->anon_vma = exporter->anon_vma;
 		}
@@ -1988,7 +1988,7 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	}
 	vma_set_policy(new, pol);
 
-	if (anon_vma_clone(new, vma))
+	if (anon_vma_clone(new, vma) < 0)
 		goto out_free_mpol;
 
 	if (new->vm_file) {
@@ -2409,7 +2409,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			if (IS_ERR(pol))
 				goto out_free_vma;
 			INIT_LIST_HEAD(&new_vma->anon_vma_chain);
-			if (anon_vma_clone(new_vma, vma))
+			if (anon_vma_clone(new_vma, vma) < 0)
 				goto out_free_mempol;
 			vma_set_policy(new_vma, pol);
 			new_vma->vm_start = addr;
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cda2a24..ba8a726aaee6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -238,12 +238,13 @@ static inline void unlock_anon_vma_root(struct anon_vma *root)
 
 /*
  * Attach the anon_vmas from src to dst.
- * Returns 0 on success, -ENOMEM on failure.
+ * Returns length of the anon_vma chain on success, -ENOMEM on failure.
  */
 int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 {
 	struct anon_vma_chain *avc, *pavc;
 	struct anon_vma *root = NULL;
+	int length = 0;
 
 	list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma;
@@ -259,9 +260,10 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 		anon_vma = pavc->anon_vma;
 		root = lock_anon_vma_root(root, anon_vma);
 		anon_vma_chain_link(dst, avc, anon_vma);
+		length++;
 	}
 	unlock_anon_vma_root(root);
-	return 0;
+	return length;
 
  enomem_failure:
 	unlink_anon_vmas(dst);
@@ -322,6 +324,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 {
 	struct anon_vma_chain *avc;
 	struct anon_vma *anon_vma;
+	int length;
 
 	/* Don't bother if the parent process has no anon_vma here. */
 	if (!pvma->anon_vma)
@@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	 * First, attach the new VMA to the parent VMA's anon_vmas,
 	 * so rmap can find non-COWed pages in child processes.
 	 */
-	if (anon_vma_clone(vma, pvma))
+	length = anon_vma_clone(vma, pvma);
+	if (length < 0)
 		return -ENOMEM;
+	else if (length > 5)
+		return 0;
 
-	/* Then add our own anon_vma. */
+	/*
+	 * Then add our own anon_vma. We do this only for five forks after
+	 * the anon_vma was created, as we don't want the same_vma chain to
+	 * grow arbitrarily large.
+	 */
 	anon_vma = anon_vma_alloc();
 	if (!anon_vma)
 		goto out_error;

-- 
Daniel K. Forrest		Space Science and
dan.forrest@ssec.wisc.edu	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-14 16:30                   ` Daniel Forrest
  0 siblings, 0 replies; 75+ messages in thread
From: Daniel Forrest @ 2014-11-14 16:30 UTC (permalink / raw)
  To: Rik van Riel, Michel Lespinasse, Hugh Dickins, Andrea Arcangeli,
	Andrew Morton, linux-kernel, linux-mm
  Cc: Tim Hartrick, Michal Hocko

There have been a couple of inquiries about the status of this patch
over the last few months, so I am going to try pushing it out.

Andrea Arcangeli has commented:

> Agreed. The only thing I don't like about this patch is the hardcoding
> of number 5: could we make it a variable to tweak with sysfs/sysctl so
> if some weird workload arises we have a tuning tweak? It'd cost one
> cacheline during fork, so it doesn't look excessive overhead.

Adding this is beyond my experience level, so if it is required then
someone else will have to make it so.

Rik van Riel has commented:

> I believe we should just merge that patch.
> 
> I have not seen any better ideas come by.
> 
> The comment should probably be fixed to reflect the
> chain length of 5 though :)

So here is Michel's patch again with "(length > 1)" modified to
"(length > 5)" and fixed comments.

I have been running with this patch (with the threshold set to 5) for
over two years now and it does indeed solve the problem.

---

anon_vma_clone() is modified to return the length of the existing
same_vma anon vma chain, and we create a new anon_vma in the child
if it is more than five forks after the anon_vma was created, as we
don't want the same_vma chain to grow arbitrarily large.

Signed-off-by: Michel Lespinasse <walken@google.com>
Tested-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>
---
 mm/mmap.c |    6 +++---
 mm/rmap.c |   18 ++++++++++++++----
 2 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 3edfcdfa42d9..e14b19a838cb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -539,7 +539,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 		 * shrinking vma had, to cover any anon pages imported.
 		 */
 		if (exporter && exporter->anon_vma && !importer->anon_vma) {
-			if (anon_vma_clone(importer, exporter))
+			if (anon_vma_clone(importer, exporter) < 0)
 				return -ENOMEM;
 			importer->anon_vma = exporter->anon_vma;
 		}
@@ -1988,7 +1988,7 @@ static int __split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	}
 	vma_set_policy(new, pol);
 
-	if (anon_vma_clone(new, vma))
+	if (anon_vma_clone(new, vma) < 0)
 		goto out_free_mpol;
 
 	if (new->vm_file) {
@@ -2409,7 +2409,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			if (IS_ERR(pol))
 				goto out_free_vma;
 			INIT_LIST_HEAD(&new_vma->anon_vma_chain);
-			if (anon_vma_clone(new_vma, vma))
+			if (anon_vma_clone(new_vma, vma) < 0)
 				goto out_free_mempol;
 			vma_set_policy(new_vma, pol);
 			new_vma->vm_start = addr;
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cda2a24..ba8a726aaee6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -238,12 +238,13 @@ static inline void unlock_anon_vma_root(struct anon_vma *root)
 
 /*
  * Attach the anon_vmas from src to dst.
- * Returns 0 on success, -ENOMEM on failure.
+ * Returns length of the anon_vma chain on success, -ENOMEM on failure.
  */
 int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 {
 	struct anon_vma_chain *avc, *pavc;
 	struct anon_vma *root = NULL;
+	int length = 0;
 
 	list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma;
@@ -259,9 +260,10 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 		anon_vma = pavc->anon_vma;
 		root = lock_anon_vma_root(root, anon_vma);
 		anon_vma_chain_link(dst, avc, anon_vma);
+		length++;
 	}
 	unlock_anon_vma_root(root);
-	return 0;
+	return length;
 
  enomem_failure:
 	unlink_anon_vmas(dst);
@@ -322,6 +324,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 {
 	struct anon_vma_chain *avc;
 	struct anon_vma *anon_vma;
+	int length;
 
 	/* Don't bother if the parent process has no anon_vma here. */
 	if (!pvma->anon_vma)
@@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	 * First, attach the new VMA to the parent VMA's anon_vmas,
 	 * so rmap can find non-COWed pages in child processes.
 	 */
-	if (anon_vma_clone(vma, pvma))
+	length = anon_vma_clone(vma, pvma);
+	if (length < 0)
 		return -ENOMEM;
+	else if (length > 5)
+		return 0;
 
-	/* Then add our own anon_vma. */
+	/*
+	 * Then add our own anon_vma. We do this only for five forks after
+	 * the anon_vma was created, as we don't want the same_vma chain to
+	 * grow arbitrarily large.
+	 */
 	anon_vma = anon_vma_alloc();
 	if (!anon_vma)
 		goto out_error;

-- 
Daniel K. Forrest		Space Science and
dan.forrest@ssec.wisc.edu	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-14 16:30                   ` Daniel Forrest
@ 2014-11-18  0:02                     ` Andrew Morton
  -1 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2014-11-18  0:02 UTC (permalink / raw)
  To: Daniel Forrest
  Cc: Rik van Riel, Michel Lespinasse, Hugh Dickins, Andrea Arcangeli,
	linux-kernel, linux-mm, Tim Hartrick, Michal Hocko

On Fri, 14 Nov 2014 10:30:53 -0600 Daniel Forrest <dan.forrest@ssec.wisc.edu> wrote:

> There have been a couple of inquiries about the status of this patch
> over the last few months, so I am going to try pushing it out.
> 
> Andrea Arcangeli has commented:
> 
> > Agreed. The only thing I don't like about this patch is the hardcoding
> > of number 5: could we make it a variable to tweak with sysfs/sysctl so
> > if some weird workload arises we have a tuning tweak? It'd cost one
> > cacheline during fork, so it doesn't look excessive overhead.
> 
> Adding this is beyond my experience level, so if it is required then
> someone else will have to make it so.
> 
> Rik van Riel has commented:
> 
> > I believe we should just merge that patch.
> > 
> > I have not seen any better ideas come by.
> > 
> > The comment should probably be fixed to reflect the
> > chain length of 5 though :)
> 
> So here is Michel's patch again with "(length > 1)" modified to
> "(length > 5)" and fixed comments.
> 
> I have been running with this patch (with the threshold set to 5) for
> over two years now and it does indeed solve the problem.
> 
> ---
> 
> anon_vma_clone() is modified to return the length of the existing
> same_vma anon vma chain, and we create a new anon_vma in the child
> if it is more than five forks after the anon_vma was created, as we
> don't want the same_vma chain to grow arbitrarily large.

hoo boy, what's going on here.

- Under what circumstances are we seeing this slab windup?

- What are the consequences?  Can it OOM the machine?

- Why is this occurring?  There aren't an infinite number of vmas, so
  there shouldn't be an infinite number of anon_vmas or
  anon_vma_chains.

- IOW, what has to be done to fix this properly?

- What are the runtime consequences of limiting the length of the chain?

> ...
>
> @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	 * First, attach the new VMA to the parent VMA's anon_vmas,
>  	 * so rmap can find non-COWed pages in child processes.
>  	 */
> -	if (anon_vma_clone(vma, pvma))
> +	length = anon_vma_clone(vma, pvma);
> +	if (length < 0)
>  		return -ENOMEM;

This should propagate the anon_vma_clone() return val instead of
assuming ENOMEM.  But that won't fix anything...

> +	else if (length > 5)
> +		return 0;
>  
> -	/* Then add our own anon_vma. */
> +	/*
> +	 * Then add our own anon_vma. We do this only for five forks after
> +	 * the anon_vma was created, as we don't want the same_vma chain to
> +	 * grow arbitrarily large.
> +	 */
>  	anon_vma = anon_vma_alloc();


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-18  0:02                     ` Andrew Morton
  0 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2014-11-18  0:02 UTC (permalink / raw)
  To: Daniel Forrest
  Cc: Rik van Riel, Michel Lespinasse, Hugh Dickins, Andrea Arcangeli,
	linux-kernel, linux-mm, Tim Hartrick, Michal Hocko

On Fri, 14 Nov 2014 10:30:53 -0600 Daniel Forrest <dan.forrest@ssec.wisc.edu> wrote:

> There have been a couple of inquiries about the status of this patch
> over the last few months, so I am going to try pushing it out.
> 
> Andrea Arcangeli has commented:
> 
> > Agreed. The only thing I don't like about this patch is the hardcoding
> > of number 5: could we make it a variable to tweak with sysfs/sysctl so
> > if some weird workload arises we have a tuning tweak? It'd cost one
> > cacheline during fork, so it doesn't look excessive overhead.
> 
> Adding this is beyond my experience level, so if it is required then
> someone else will have to make it so.
> 
> Rik van Riel has commented:
> 
> > I believe we should just merge that patch.
> > 
> > I have not seen any better ideas come by.
> > 
> > The comment should probably be fixed to reflect the
> > chain length of 5 though :)
> 
> So here is Michel's patch again with "(length > 1)" modified to
> "(length > 5)" and fixed comments.
> 
> I have been running with this patch (with the threshold set to 5) for
> over two years now and it does indeed solve the problem.
> 
> ---
> 
> anon_vma_clone() is modified to return the length of the existing
> same_vma anon vma chain, and we create a new anon_vma in the child
> if it is more than five forks after the anon_vma was created, as we
> don't want the same_vma chain to grow arbitrarily large.

hoo boy, what's going on here.

- Under what circumstances are we seeing this slab windup?

- What are the consequences?  Can it OOM the machine?

- Why is this occurring?  There aren't an infinite number of vmas, so
  there shouldn't be an infinite number of anon_vmas or
  anon_vma_chains.

- IOW, what has to be done to fix this properly?

- What are the runtime consequences of limiting the length of the chain?

> ...
>
> @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	 * First, attach the new VMA to the parent VMA's anon_vmas,
>  	 * so rmap can find non-COWed pages in child processes.
>  	 */
> -	if (anon_vma_clone(vma, pvma))
> +	length = anon_vma_clone(vma, pvma);
> +	if (length < 0)
>  		return -ENOMEM;

This should propagate the anon_vma_clone() return val instead of
assuming ENOMEM.  But that won't fix anything...

> +	else if (length > 5)
> +		return 0;
>  
> -	/* Then add our own anon_vma. */
> +	/*
> +	 * Then add our own anon_vma. We do this only for five forks after
> +	 * the anon_vma was created, as we don't want the same_vma chain to
> +	 * grow arbitrarily large.
> +	 */
>  	anon_vma = anon_vma_alloc();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-18  0:02                     ` Andrew Morton
@ 2014-11-18  1:41                       ` Daniel Forrest
  -1 siblings, 0 replies; 75+ messages in thread
From: Daniel Forrest @ 2014-11-18  1:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Hugh Dickins, Andrea Arcangeli,
	linux-kernel, linux-mm, Tim Hartrick, Michal Hocko

On Mon, Nov 17, 2014 at 04:02:12PM -0800, Andrew Morton wrote:
> On Fri, 14 Nov 2014 10:30:53 -0600 Daniel Forrest <dan.forrest@ssec.wisc.edu> wrote:
> 
> > There have been a couple of inquiries about the status of this patch
> > over the last few months, so I am going to try pushing it out.
> > 
> > Andrea Arcangeli has commented:
> > 
> > > Agreed. The only thing I don't like about this patch is the hardcoding
> > > of number 5: could we make it a variable to tweak with sysfs/sysctl so
> > > if some weird workload arises we have a tuning tweak? It'd cost one
> > > cacheline during fork, so it doesn't look excessive overhead.
> > 
> > Adding this is beyond my experience level, so if it is required then
> > someone else will have to make it so.
> > 
> > Rik van Riel has commented:
> > 
> > > I believe we should just merge that patch.
> > > 
> > > I have not seen any better ideas come by.
> > > 
> > > The comment should probably be fixed to reflect the
> > > chain length of 5 though :)
> > 
> > So here is Michel's patch again with "(length > 1)" modified to
> > "(length > 5)" and fixed comments.
> > 
> > I have been running with this patch (with the threshold set to 5) for
> > over two years now and it does indeed solve the problem.
> > 
> > ---
> > 
> > anon_vma_clone() is modified to return the length of the existing
> > same_vma anon vma chain, and we create a new anon_vma in the child
> > if it is more than five forks after the anon_vma was created, as we
> > don't want the same_vma chain to grow arbitrarily large.
> 
> hoo boy, what's going on here.
> 
> - Under what circumstances are we seeing this slab windup?

The original bug report is here:

https://lkml.org/lkml/2012/8/15/765

> - What are the consequences?  Can it OOM the machine?

Yes, eventually you run out of SLAB space.

> - Why is this occurring?  There aren't an infinite number of vmas, so
>   there shouldn't be an infinite number of anon_vmas or
>   anon_vma_chains.

Because of the serial forking there does indeed end up being an infinite
number of vmas.  The initial vma can never be deleted (even though the
initial parent process has long since terminated) because the initial
vma is referenced by the children.

> - IOW, what has to be done to fix this properly?

As far as I know, this is the best solution.  I tried a refcounting
solution based on comments by Rik van Riel:

https://lkml.org/lkml/2012/8/17/536

But it didn't fully work, probably because I didn't quite get the
locking done properly.  In any case, at this point questions came up
about the overhead of the page refcounting and Michel Lespinasse
suggested the initial version of this patch:

https://lkml.org/lkml/2012/8/21/730

> - What are the runtime consequences of limiting the length of the chain?

I can't say, but it only affects users who fork more than five levels
deep without doing an exec.  On the other hand, there are at least three
users (Tim Hartrick, Michal Hocko, and myself) who have real world
applications where the consequence of no patch is a crashed system.

I would suggest reading the thread starting with my initial bug report
for what others have had to say about this.

> > ...
> >
> > @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
> >  	 * First, attach the new VMA to the parent VMA's anon_vmas,
> >  	 * so rmap can find non-COWed pages in child processes.
> >  	 */
> > -	if (anon_vma_clone(vma, pvma))
> > +	length = anon_vma_clone(vma, pvma);
> > +	if (length < 0)
> >  		return -ENOMEM;
> 
> This should propagate the anon_vma_clone() return val instead of
> assuming ENOMEM.  But that won't fix anything...

Agreed, but the only failure return value of anon_vma_clone is -ENOMEM.

Scanning the code in __split_vma (mm/mmap.c) it looks like the error
return is lost (between Linux 3.11 and 3.12 the err variable is now
used before the call to anon_vma_clone and the default initial value of
-ENOMEM is overwritten).  This is an actual bug in the current code.

I can update the patch to fix these issues.

> > +	else if (length > 5)
> > +		return 0;
> >  
> > -	/* Then add our own anon_vma. */
> > +	/*
> > +	 * Then add our own anon_vma. We do this only for five forks after
> > +	 * the anon_vma was created, as we don't want the same_vma chain to
> > +	 * grow arbitrarily large.
> > +	 */
> >  	anon_vma = anon_vma_alloc();

-- 
Daniel K. Forrest		Space Science and
dan.forrest@ssec.wisc.edu	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-18  1:41                       ` Daniel Forrest
  0 siblings, 0 replies; 75+ messages in thread
From: Daniel Forrest @ 2014-11-18  1:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Hugh Dickins, Andrea Arcangeli,
	linux-kernel, linux-mm, Tim Hartrick, Michal Hocko

On Mon, Nov 17, 2014 at 04:02:12PM -0800, Andrew Morton wrote:
> On Fri, 14 Nov 2014 10:30:53 -0600 Daniel Forrest <dan.forrest@ssec.wisc.edu> wrote:
> 
> > There have been a couple of inquiries about the status of this patch
> > over the last few months, so I am going to try pushing it out.
> > 
> > Andrea Arcangeli has commented:
> > 
> > > Agreed. The only thing I don't like about this patch is the hardcoding
> > > of number 5: could we make it a variable to tweak with sysfs/sysctl so
> > > if some weird workload arises we have a tuning tweak? It'd cost one
> > > cacheline during fork, so it doesn't look excessive overhead.
> > 
> > Adding this is beyond my experience level, so if it is required then
> > someone else will have to make it so.
> > 
> > Rik van Riel has commented:
> > 
> > > I believe we should just merge that patch.
> > > 
> > > I have not seen any better ideas come by.
> > > 
> > > The comment should probably be fixed to reflect the
> > > chain length of 5 though :)
> > 
> > So here is Michel's patch again with "(length > 1)" modified to
> > "(length > 5)" and fixed comments.
> > 
> > I have been running with this patch (with the threshold set to 5) for
> > over two years now and it does indeed solve the problem.
> > 
> > ---
> > 
> > anon_vma_clone() is modified to return the length of the existing
> > same_vma anon vma chain, and we create a new anon_vma in the child
> > if it is more than five forks after the anon_vma was created, as we
> > don't want the same_vma chain to grow arbitrarily large.
> 
> hoo boy, what's going on here.
> 
> - Under what circumstances are we seeing this slab windup?

The original bug report is here:

https://lkml.org/lkml/2012/8/15/765

> - What are the consequences?  Can it OOM the machine?

Yes, eventually you run out of SLAB space.

> - Why is this occurring?  There aren't an infinite number of vmas, so
>   there shouldn't be an infinite number of anon_vmas or
>   anon_vma_chains.

Because of the serial forking there does indeed end up being an infinite
number of vmas.  The initial vma can never be deleted (even though the
initial parent process has long since terminated) because the initial
vma is referenced by the children.

> - IOW, what has to be done to fix this properly?

As far as I know, this is the best solution.  I tried a refcounting
solution based on comments by Rik van Riel:

https://lkml.org/lkml/2012/8/17/536

But it didn't fully work, probably because I didn't quite get the
locking done properly.  In any case, at this point questions came up
about the overhead of the page refcounting and Michel Lespinasse
suggested the initial version of this patch:

https://lkml.org/lkml/2012/8/21/730

> - What are the runtime consequences of limiting the length of the chain?

I can't say, but it only affects users who fork more than five levels
deep without doing an exec.  On the other hand, there are at least three
users (Tim Hartrick, Michal Hocko, and myself) who have real world
applications where the consequence of no patch is a crashed system.

I would suggest reading the thread starting with my initial bug report
for what others have had to say about this.

> > ...
> >
> > @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
> >  	 * First, attach the new VMA to the parent VMA's anon_vmas,
> >  	 * so rmap can find non-COWed pages in child processes.
> >  	 */
> > -	if (anon_vma_clone(vma, pvma))
> > +	length = anon_vma_clone(vma, pvma);
> > +	if (length < 0)
> >  		return -ENOMEM;
> 
> This should propagate the anon_vma_clone() return val instead of
> assuming ENOMEM.  But that won't fix anything...

Agreed, but the only failure return value of anon_vma_clone is -ENOMEM.

Scanning the code in __split_vma (mm/mmap.c) it looks like the error
return is lost (between Linux 3.11 and 3.12 the err variable is now
used before the call to anon_vma_clone and the default initial value of
-ENOMEM is overwritten).  This is an actual bug in the current code.

I can update the patch to fix these issues.

> > +	else if (length > 5)
> > +		return 0;
> >  
> > -	/* Then add our own anon_vma. */
> > +	/*
> > +	 * Then add our own anon_vma. We do this only for five forks after
> > +	 * the anon_vma was created, as we don't want the same_vma chain to
> > +	 * grow arbitrarily large.
> > +	 */
> >  	anon_vma = anon_vma_alloc();

-- 
Daniel K. Forrest		Space Science and
dan.forrest@ssec.wisc.edu	Engineering Center
(608) 890 - 0558		University of Wisconsin, Madison

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-18  1:41                       ` Daniel Forrest
@ 2014-11-18  2:41                         ` Rik van Riel
  -1 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2014-11-18  2:41 UTC (permalink / raw)
  To: Andrew Morton, Michel Lespinasse, Hugh Dickins, Andrea Arcangeli,
	linux-kernel, linux-mm, Tim Hartrick, Michal Hocko

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/17/2014 08:41 PM, Daniel Forrest wrote:
> On Mon, Nov 17, 2014 at 04:02:12PM -0800, Andrew Morton wrote:
>> On Fri, 14 Nov 2014 10:30:53 -0600 Daniel Forrest
>> <dan.forrest@ssec.wisc.edu> wrote:
>> 
>>> There have been a couple of inquiries about the status of this
>>> patch over the last few months, so I am going to try pushing it
>>> out.
>>> 
>>> Andrea Arcangeli has commented:
>>> 
>>>> Agreed. The only thing I don't like about this patch is the
>>>> hardcoding of number 5: could we make it a variable to tweak
>>>> with sysfs/sysctl so if some weird workload arises we have a
>>>> tuning tweak? It'd cost one cacheline during fork, so it
>>>> doesn't look excessive overhead.
>>> 
>>> Adding this is beyond my experience level, so if it is required
>>> then someone else will have to make it so.
>>> 
>>> Rik van Riel has commented:
>>> 
>>>> I believe we should just merge that patch.
>>>> 
>>>> I have not seen any better ideas come by.
>>>> 
>>>> The comment should probably be fixed to reflect the chain
>>>> length of 5 though :)
>>> 
>>> So here is Michel's patch again with "(length > 1)" modified
>>> to "(length > 5)" and fixed comments.
>>> 
>>> I have been running with this patch (with the threshold set to
>>> 5) for over two years now and it does indeed solve the
>>> problem.
>>> 
>>> ---
>>> 
>>> anon_vma_clone() is modified to return the length of the
>>> existing same_vma anon vma chain, and we create a new anon_vma
>>> in the child if it is more than five forks after the anon_vma
>>> was created, as we don't want the same_vma chain to grow
>>> arbitrarily large.
>> 
>> hoo boy, what's going on here.
>> 
>> - Under what circumstances are we seeing this slab windup?
> 
> The original bug report is here:
> 
> https://lkml.org/lkml/2012/8/15/765
> 
>> - What are the consequences?  Can it OOM the machine?
> 
> Yes, eventually you run out of SLAB space.
> 
>> - Why is this occurring?  There aren't an infinite number of
>> vmas, so there shouldn't be an infinite number of anon_vmas or 
>> anon_vma_chains.
> 
> Because of the serial forking there does indeed end up being an
> infinite number of vmas.  The initial vma can never be deleted
> (even though the initial parent process has long since terminated)
> because the initial vma is referenced by the children.

There is a finite number of VMAs, but an infite number of
anon_vmas.

Subtle, yet deadly...

>> - IOW, what has to be done to fix this properly?
> 
> As far as I know, this is the best solution.  I tried a
> refcounting solution based on comments by Rik van Riel:
> 
> https://lkml.org/lkml/2012/8/17/536
> 
> But it didn't fully work, probably because I didn't quite get the 
> locking done properly.  In any case, at this point questions came
> up about the overhead of the page refcounting and Michel
> Lespinasse suggested the initial version of this patch:
> 
> https://lkml.org/lkml/2012/8/21/730
> 
>> - What are the runtime consequences of limiting the length of the
>> chain?
> 
> I can't say, but it only affects users who fork more than five
> levels deep without doing an exec.  On the other hand, there are at
> least three users (Tim Hartrick, Michal Hocko, and myself) who have
> real world applications where the consequence of no patch is a
> crashed system.
> 
> I would suggest reading the thread starting with my initial bug
> report for what others have had to say about this.

I suspect what Andrew is hinting at is that the
changelog for the patch should contain a detailed
description of exactly what the bug is, how it is
triggered, what the symptoms are, and how the
patch avoids it.

That way people can understand what the code does
simply by looking at the changelog - no need to go
find old linux-kernel mailing list threads.

>>> ...
>>> 
>>> @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct
>>> *vma, struct vm_area_struct *pvma) * First, attach the new VMA
>>> to the parent VMA's anon_vmas, * so rmap can find non-COWed
>>> pages in child processes. */ -	if (anon_vma_clone(vma, pvma)) +
>>> length = anon_vma_clone(vma, pvma); +	if (length < 0) return
>>> -ENOMEM;
>> 
>> This should propagate the anon_vma_clone() return val instead of 
>> assuming ENOMEM.  But that won't fix anything...
> 
> Agreed, but the only failure return value of anon_vma_clone is
> -ENOMEM.
> 
> Scanning the code in __split_vma (mm/mmap.c) it looks like the
> error return is lost (between Linux 3.11 and 3.12 the err variable
> is now used before the call to anon_vma_clone and the default
> initial value of -ENOMEM is overwritten).  This is an actual bug in
> the current code.
> 
> I can update the patch to fix these issues.
> 
>>> +	else if (length > 5) +		return 0;
>>> 
>>> -	/* Then add our own anon_vma. */ +	/* +	 * Then add our own
>>> anon_vma. We do this only for five forks after +	 * the
>>> anon_vma was created, as we don't want the same_vma chain to +
>>> * grow arbitrarily large. +	 */ anon_vma = anon_vma_alloc();
> 


- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUarH1AAoJEM553pKExN6DXwUH/RHNwGTYhzzwIQtbtMqnHYjE
YWriqPLIOW8yWh85hkrmTsjWIegbDnEsbgNRX0Y8ANrKgx+vWRRW/eJ/s+Z+m7UY
lD1DKO3vIfUSQvL4QHnViTEgEHfdychnhe0SE/kMeQbnLpUw8ywviJxX0UibeLdK
L/F8xMzpUj/PBkNTtPxQRevWwUEMMMY6RS8RjHNBADe9ym/Fjd0dzAkoPCYCUapT
barWfI9RMC3gYfyObFNBNYyaYyyK1FlAyBq52d/W8xCBW/5EIhEtFBGben/lAuEP
alJt+jnFq4B1tXQtJIu1YBhY4OhuqWQy5lbz7NFPxg8+cECVPd3Vq6O2Bxilz9U=
=GLaM
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-18  2:41                         ` Rik van Riel
  0 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2014-11-18  2:41 UTC (permalink / raw)
  To: Andrew Morton, Michel Lespinasse, Hugh Dickins, Andrea Arcangeli,
	linux-kernel, linux-mm, Tim Hartrick, Michal Hocko

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/17/2014 08:41 PM, Daniel Forrest wrote:
> On Mon, Nov 17, 2014 at 04:02:12PM -0800, Andrew Morton wrote:
>> On Fri, 14 Nov 2014 10:30:53 -0600 Daniel Forrest
>> <dan.forrest@ssec.wisc.edu> wrote:
>> 
>>> There have been a couple of inquiries about the status of this
>>> patch over the last few months, so I am going to try pushing it
>>> out.
>>> 
>>> Andrea Arcangeli has commented:
>>> 
>>>> Agreed. The only thing I don't like about this patch is the
>>>> hardcoding of number 5: could we make it a variable to tweak
>>>> with sysfs/sysctl so if some weird workload arises we have a
>>>> tuning tweak? It'd cost one cacheline during fork, so it
>>>> doesn't look excessive overhead.
>>> 
>>> Adding this is beyond my experience level, so if it is required
>>> then someone else will have to make it so.
>>> 
>>> Rik van Riel has commented:
>>> 
>>>> I believe we should just merge that patch.
>>>> 
>>>> I have not seen any better ideas come by.
>>>> 
>>>> The comment should probably be fixed to reflect the chain
>>>> length of 5 though :)
>>> 
>>> So here is Michel's patch again with "(length > 1)" modified
>>> to "(length > 5)" and fixed comments.
>>> 
>>> I have been running with this patch (with the threshold set to
>>> 5) for over two years now and it does indeed solve the
>>> problem.
>>> 
>>> ---
>>> 
>>> anon_vma_clone() is modified to return the length of the
>>> existing same_vma anon vma chain, and we create a new anon_vma
>>> in the child if it is more than five forks after the anon_vma
>>> was created, as we don't want the same_vma chain to grow
>>> arbitrarily large.
>> 
>> hoo boy, what's going on here.
>> 
>> - Under what circumstances are we seeing this slab windup?
> 
> The original bug report is here:
> 
> https://lkml.org/lkml/2012/8/15/765
> 
>> - What are the consequences?  Can it OOM the machine?
> 
> Yes, eventually you run out of SLAB space.
> 
>> - Why is this occurring?  There aren't an infinite number of
>> vmas, so there shouldn't be an infinite number of anon_vmas or 
>> anon_vma_chains.
> 
> Because of the serial forking there does indeed end up being an
> infinite number of vmas.  The initial vma can never be deleted
> (even though the initial parent process has long since terminated)
> because the initial vma is referenced by the children.

There is a finite number of VMAs, but an infite number of
anon_vmas.

Subtle, yet deadly...

>> - IOW, what has to be done to fix this properly?
> 
> As far as I know, this is the best solution.  I tried a
> refcounting solution based on comments by Rik van Riel:
> 
> https://lkml.org/lkml/2012/8/17/536
> 
> But it didn't fully work, probably because I didn't quite get the 
> locking done properly.  In any case, at this point questions came
> up about the overhead of the page refcounting and Michel
> Lespinasse suggested the initial version of this patch:
> 
> https://lkml.org/lkml/2012/8/21/730
> 
>> - What are the runtime consequences of limiting the length of the
>> chain?
> 
> I can't say, but it only affects users who fork more than five
> levels deep without doing an exec.  On the other hand, there are at
> least three users (Tim Hartrick, Michal Hocko, and myself) who have
> real world applications where the consequence of no patch is a
> crashed system.
> 
> I would suggest reading the thread starting with my initial bug
> report for what others have had to say about this.

I suspect what Andrew is hinting at is that the
changelog for the patch should contain a detailed
description of exactly what the bug is, how it is
triggered, what the symptoms are, and how the
patch avoids it.

That way people can understand what the code does
simply by looking at the changelog - no need to go
find old linux-kernel mailing list threads.

>>> ...
>>> 
>>> @@ -331,10 +334,17 @@ int anon_vma_fork(struct vm_area_struct
>>> *vma, struct vm_area_struct *pvma) * First, attach the new VMA
>>> to the parent VMA's anon_vmas, * so rmap can find non-COWed
>>> pages in child processes. */ -	if (anon_vma_clone(vma, pvma)) +
>>> length = anon_vma_clone(vma, pvma); +	if (length < 0) return
>>> -ENOMEM;
>> 
>> This should propagate the anon_vma_clone() return val instead of 
>> assuming ENOMEM.  But that won't fix anything...
> 
> Agreed, but the only failure return value of anon_vma_clone is
> -ENOMEM.
> 
> Scanning the code in __split_vma (mm/mmap.c) it looks like the
> error return is lost (between Linux 3.11 and 3.12 the err variable
> is now used before the call to anon_vma_clone and the default
> initial value of -ENOMEM is overwritten).  This is an actual bug in
> the current code.
> 
> I can update the patch to fix these issues.
> 
>>> +	else if (length > 5) +		return 0;
>>> 
>>> -	/* Then add our own anon_vma. */ +	/* +	 * Then add our own
>>> anon_vma. We do this only for five forks after +	 * the
>>> anon_vma was created, as we don't want the same_vma chain to +
>>> * grow arbitrarily large. +	 */ anon_vma = anon_vma_alloc();
> 


- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUarH1AAoJEM553pKExN6DXwUH/RHNwGTYhzzwIQtbtMqnHYjE
YWriqPLIOW8yWh85hkrmTsjWIegbDnEsbgNRX0Y8ANrKgx+vWRRW/eJ/s+Z+m7UY
lD1DKO3vIfUSQvL4QHnViTEgEHfdychnhe0SE/kMeQbnLpUw8ywviJxX0UibeLdK
L/F8xMzpUj/PBkNTtPxQRevWwUEMMMY6RS8RjHNBADe9ym/Fjd0dzAkoPCYCUapT
barWfI9RMC3gYfyObFNBNYyaYyyK1FlAyBq52d/W8xCBW/5EIhEtFBGben/lAuEP
alJt+jnFq4B1tXQtJIu1YBhY4OhuqWQy5lbz7NFPxg8+cECVPd3Vq6O2Bxilz9U=
=GLaM
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-18  2:41                         ` Rik van Riel
@ 2014-11-18 20:19                           ` Andrew Morton
  -1 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2014-11-18 20:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Hugh Dickins, Andrea Arcangeli, linux-kernel,
	linux-mm, Tim Hartrick, Michal Hocko

On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com> wrote:

> > Because of the serial forking there does indeed end up being an
> > infinite number of vmas.  The initial vma can never be deleted
> > (even though the initial parent process has long since terminated)
> > because the initial vma is referenced by the children.
> 
> There is a finite number of VMAs, but an infite number of
> anon_vmas.
> 
> Subtle, yet deadly...

Well, we clearly have the data structures screwed up.  I've forgotten
enough about this code for me to be unable to work out what the fixed
up data structures would look like :( But surely there is some proper
solution here.  Help?

> > I can't say, but it only affects users who fork more than five
> > levels deep without doing an exec.  On the other hand, there are at
> > least three users (Tim Hartrick, Michal Hocko, and myself) who have
> > real world applications where the consequence of no patch is a
> > crashed system.
> > 
> > I would suggest reading the thread starting with my initial bug
> > report for what others have had to say about this.
> 
> I suspect what Andrew is hinting at is that the
> changelog for the patch should contain a detailed
> description of exactly what the bug is, how it is
> triggered, what the symptoms are, and how the
> patch avoids it.
>
> That way people can understand what the code does
> simply by looking at the changelog - no need to go
> find old linux-kernel mailing list threads.

Yes please, there's a ton of stuff here which we should attempt to
capture.

https://lkml.org/lkml/2012/8/15/765 is useful.

I'm assuming that with the "foo < 5" hack, an application which forked
5 times then did a lot of work would still trigger the "catastrophic
issue at page reclaim time" issue which Rik identified at
https://lkml.org/lkml/2012/8/20/265?

There are real-world workloads which are triggering this slab growth
problem, yes?  (Detail them in the changelog, please).

This bug snuck under my radar last time - we're permitting unprivileged
userspace to exhaust memory and that's bad.  I'm OK with the foo<5
thing for -stable kernels, as it is simple.  But I'm reluctant to merge
(or at least to retain) it in mainline because then everyone will run
away and think about other stuff and this bug will never get fixed
properly.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-18 20:19                           ` Andrew Morton
  0 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2014-11-18 20:19 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Hugh Dickins, Andrea Arcangeli, linux-kernel,
	linux-mm, Tim Hartrick, Michal Hocko

On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com> wrote:

> > Because of the serial forking there does indeed end up being an
> > infinite number of vmas.  The initial vma can never be deleted
> > (even though the initial parent process has long since terminated)
> > because the initial vma is referenced by the children.
> 
> There is a finite number of VMAs, but an infite number of
> anon_vmas.
> 
> Subtle, yet deadly...

Well, we clearly have the data structures screwed up.  I've forgotten
enough about this code for me to be unable to work out what the fixed
up data structures would look like :( But surely there is some proper
solution here.  Help?

> > I can't say, but it only affects users who fork more than five
> > levels deep without doing an exec.  On the other hand, there are at
> > least three users (Tim Hartrick, Michal Hocko, and myself) who have
> > real world applications where the consequence of no patch is a
> > crashed system.
> > 
> > I would suggest reading the thread starting with my initial bug
> > report for what others have had to say about this.
> 
> I suspect what Andrew is hinting at is that the
> changelog for the patch should contain a detailed
> description of exactly what the bug is, how it is
> triggered, what the symptoms are, and how the
> patch avoids it.
>
> That way people can understand what the code does
> simply by looking at the changelog - no need to go
> find old linux-kernel mailing list threads.

Yes please, there's a ton of stuff here which we should attempt to
capture.

https://lkml.org/lkml/2012/8/15/765 is useful.

I'm assuming that with the "foo < 5" hack, an application which forked
5 times then did a lot of work would still trigger the "catastrophic
issue at page reclaim time" issue which Rik identified at
https://lkml.org/lkml/2012/8/20/265?

There are real-world workloads which are triggering this slab growth
problem, yes?  (Detail them in the changelog, please).

This bug snuck under my radar last time - we're permitting unprivileged
userspace to exhaust memory and that's bad.  I'm OK with the foo<5
thing for -stable kernels, as it is simple.  But I'm reluctant to merge
(or at least to retain) it in mainline because then everyone will run
away and think about other stuff and this bug will never get fixed
properly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-18 20:19                           ` Andrew Morton
@ 2014-11-18 22:15                             ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 75+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-18 22:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Hugh Dickins, Andrea Arcangeli,
	Linux Kernel Mailing List, linux-mm, Tim Hartrick, Michal Hocko

On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com> wrote:
>
>> > Because of the serial forking there does indeed end up being an
>> > infinite number of vmas.  The initial vma can never be deleted
>> > (even though the initial parent process has long since terminated)
>> > because the initial vma is referenced by the children.
>>
>> There is a finite number of VMAs, but an infite number of
>> anon_vmas.
>>
>> Subtle, yet deadly...
>
> Well, we clearly have the data structures screwed up.  I've forgotten
> enough about this code for me to be unable to work out what the fixed
> up data structures would look like :( But surely there is some proper
> solution here.  Help?

Not sure if it's right but probably we could reuse on fork an old anon_vma
from the chain if it's already lost all vmas which points to it.
For endlessly forking exploit this should work mostly like proposed patch
which stops branching after some depth but without magic constant.

>
>> > I can't say, but it only affects users who fork more than five
>> > levels deep without doing an exec.  On the other hand, there are at
>> > least three users (Tim Hartrick, Michal Hocko, and myself) who have
>> > real world applications where the consequence of no patch is a
>> > crashed system.
>> >
>> > I would suggest reading the thread starting with my initial bug
>> > report for what others have had to say about this.
>>
>> I suspect what Andrew is hinting at is that the
>> changelog for the patch should contain a detailed
>> description of exactly what the bug is, how it is
>> triggered, what the symptoms are, and how the
>> patch avoids it.
>>
>> That way people can understand what the code does
>> simply by looking at the changelog - no need to go
>> find old linux-kernel mailing list threads.
>
> Yes please, there's a ton of stuff here which we should attempt to
> capture.
>
> https://lkml.org/lkml/2012/8/15/765 is useful.
>
> I'm assuming that with the "foo < 5" hack, an application which forked
> 5 times then did a lot of work would still trigger the "catastrophic
> issue at page reclaim time" issue which Rik identified at
> https://lkml.org/lkml/2012/8/20/265?
>
> There are real-world workloads which are triggering this slab growth
> problem, yes?  (Detail them in the changelog, please).
>
> This bug snuck under my radar last time - we're permitting unprivileged
> userspace to exhaust memory and that's bad.  I'm OK with the foo<5
> thing for -stable kernels, as it is simple.  But I'm reluctant to merge
> (or at least to retain) it in mainline because then everyone will run
> away and think about other stuff and this bug will never get fixed
> properly.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-18 22:15                             ` Konstantin Khlebnikov
  0 siblings, 0 replies; 75+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-18 22:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Hugh Dickins, Andrea Arcangeli,
	Linux Kernel Mailing List, linux-mm, Tim Hartrick, Michal Hocko

On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com> wrote:
>
>> > Because of the serial forking there does indeed end up being an
>> > infinite number of vmas.  The initial vma can never be deleted
>> > (even though the initial parent process has long since terminated)
>> > because the initial vma is referenced by the children.
>>
>> There is a finite number of VMAs, but an infite number of
>> anon_vmas.
>>
>> Subtle, yet deadly...
>
> Well, we clearly have the data structures screwed up.  I've forgotten
> enough about this code for me to be unable to work out what the fixed
> up data structures would look like :( But surely there is some proper
> solution here.  Help?

Not sure if it's right but probably we could reuse on fork an old anon_vma
from the chain if it's already lost all vmas which points to it.
For endlessly forking exploit this should work mostly like proposed patch
which stops branching after some depth but without magic constant.

>
>> > I can't say, but it only affects users who fork more than five
>> > levels deep without doing an exec.  On the other hand, there are at
>> > least three users (Tim Hartrick, Michal Hocko, and myself) who have
>> > real world applications where the consequence of no patch is a
>> > crashed system.
>> >
>> > I would suggest reading the thread starting with my initial bug
>> > report for what others have had to say about this.
>>
>> I suspect what Andrew is hinting at is that the
>> changelog for the patch should contain a detailed
>> description of exactly what the bug is, how it is
>> triggered, what the symptoms are, and how the
>> patch avoids it.
>>
>> That way people can understand what the code does
>> simply by looking at the changelog - no need to go
>> find old linux-kernel mailing list threads.
>
> Yes please, there's a ton of stuff here which we should attempt to
> capture.
>
> https://lkml.org/lkml/2012/8/15/765 is useful.
>
> I'm assuming that with the "foo < 5" hack, an application which forked
> 5 times then did a lot of work would still trigger the "catastrophic
> issue at page reclaim time" issue which Rik identified at
> https://lkml.org/lkml/2012/8/20/265?
>
> There are real-world workloads which are triggering this slab growth
> problem, yes?  (Detail them in the changelog, please).
>
> This bug snuck under my radar last time - we're permitting unprivileged
> userspace to exhaust memory and that's bad.  I'm OK with the foo<5
> thing for -stable kernels, as it is simple.  But I'm reluctant to merge
> (or at least to retain) it in mainline because then everyone will run
> away and think about other stuff and this bug will never get fixed
> properly.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-18 22:15                             ` Konstantin Khlebnikov
  (?)
@ 2014-11-18 23:02                             ` Konstantin Khlebnikov
  2014-11-18 23:50                                 ` Vlastimil Babka
  -1 siblings, 1 reply; 75+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-18 23:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Hugh Dickins, Andrea Arcangeli,
	Linux Kernel Mailing List, linux-mm, Tim Hartrick, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 3214 bytes --]

On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com> wrote:
>>
>>> > Because of the serial forking there does indeed end up being an
>>> > infinite number of vmas.  The initial vma can never be deleted
>>> > (even though the initial parent process has long since terminated)
>>> > because the initial vma is referenced by the children.
>>>
>>> There is a finite number of VMAs, but an infite number of
>>> anon_vmas.
>>>
>>> Subtle, yet deadly...
>>
>> Well, we clearly have the data structures screwed up.  I've forgotten
>> enough about this code for me to be unable to work out what the fixed
>> up data structures would look like :( But surely there is some proper
>> solution here.  Help?
>
> Not sure if it's right but probably we could reuse on fork an old anon_vma
> from the chain if it's already lost all vmas which points to it.
> For endlessly forking exploit this should work mostly like proposed patch
> which stops branching after some depth but without magic constant.

Something like this. I leave proper comment for tomorrow.

>
>>
>>> > I can't say, but it only affects users who fork more than five
>>> > levels deep without doing an exec.  On the other hand, there are at
>>> > least three users (Tim Hartrick, Michal Hocko, and myself) who have
>>> > real world applications where the consequence of no patch is a
>>> > crashed system.
>>> >
>>> > I would suggest reading the thread starting with my initial bug
>>> > report for what others have had to say about this.
>>>
>>> I suspect what Andrew is hinting at is that the
>>> changelog for the patch should contain a detailed
>>> description of exactly what the bug is, how it is
>>> triggered, what the symptoms are, and how the
>>> patch avoids it.
>>>
>>> That way people can understand what the code does
>>> simply by looking at the changelog - no need to go
>>> find old linux-kernel mailing list threads.
>>
>> Yes please, there's a ton of stuff here which we should attempt to
>> capture.
>>
>> https://lkml.org/lkml/2012/8/15/765 is useful.
>>
>> I'm assuming that with the "foo < 5" hack, an application which forked
>> 5 times then did a lot of work would still trigger the "catastrophic
>> issue at page reclaim time" issue which Rik identified at
>> https://lkml.org/lkml/2012/8/20/265?
>>
>> There are real-world workloads which are triggering this slab growth
>> problem, yes?  (Detail them in the changelog, please).
>>
>> This bug snuck under my radar last time - we're permitting unprivileged
>> userspace to exhaust memory and that's bad.  I'm OK with the foo<5
>> thing for -stable kernels, as it is simple.  But I'm reluctant to merge
>> (or at least to retain) it in mainline because then everyone will run
>> away and think about other stuff and this bug will never get fixed
>> properly.
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

[-- Attachment #2: mm-reuse-old-anon_vma-if-it-s-lost-all-vmas --]
[-- Type: application/octet-stream, Size: 2362 bytes --]

mm: reuse old anon_vma if it's lost all vmas

From: Konstantin Khlebnikov <koct9i@gmail.com>

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
---
 include/linux/rmap.h |    2 ++
 mm/rmap.c            |   14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c0c2bce..d40ca08 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -36,6 +36,8 @@ struct anon_vma {
 	 */
 	atomic_t refcount;
 
+	int nr_vmas;	/* Number of direct references from vmas */
+
 	/*
 	 * NOTE: the LSB of the rb_root.rb_node is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
diff --git a/mm/rmap.c b/mm/rmap.c
index 19886fb..ced4754 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -72,6 +72,7 @@ static inline struct anon_vma *anon_vma_alloc(void)
 	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
 	if (anon_vma) {
 		atomic_set(&anon_vma->refcount, 1);
+		anon_vma->nr_vmas = 1;
 		/*
 		 * Initialise the anon_vma root to point to itself. If called
 		 * from fork, the root will be reset to the parents anon_vma.
@@ -256,7 +257,11 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 		anon_vma = pavc->anon_vma;
 		root = lock_anon_vma_root(root, anon_vma);
 		anon_vma_chain_link(dst, avc, anon_vma);
+		if (!dst->anon_vma && !anon_vma->nr_vmas)
+			dst->anon_vma = anon_vma;
 	}
+	if (dst->anon_vma)
+		dst->anon_vma->nr_vmas++;
 	unlock_anon_vma_root(root);
 	return 0;
 
@@ -279,6 +284,9 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	if (!pvma->anon_vma)
 		return 0;
 
+	/* Drop parent anon_vma, we want find or allocate our own. */
+	vma->anon_vma = NULL;
+
 	/*
 	 * First, attach the new VMA to the parent VMA's anon_vmas,
 	 * so rmap can find non-COWed pages in child processes.
@@ -286,6 +294,10 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	if (anon_vma_clone(vma, pvma))
 		return -ENOMEM;
 
+	/* Old anon_vma has been reused. */
+	if (vma->anon_vma)
+		return 0;
+
 	/* Then add our own anon_vma. */
 	anon_vma = anon_vma_alloc();
 	if (!anon_vma)
@@ -345,6 +357,8 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 		list_del(&avc->same_vma);
 		anon_vma_chain_free(avc);
 	}
+	if (vma->anon_vma)
+		vma->anon_vma->nr_vmas--;
 	unlock_anon_vma_root(root);
 
 	/*

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-18 23:02                             ` Konstantin Khlebnikov
@ 2014-11-18 23:50                                 ` Vlastimil Babka
  0 siblings, 0 replies; 75+ messages in thread
From: Vlastimil Babka @ 2014-11-18 23:50 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Hugh Dickins, Andrea Arcangeli,
	Linux Kernel Mailing List, linux-mm, Tim Hartrick, Michal Hocko

On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote:
> On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton
>> <akpm@linux-foundation.org> wrote:
>>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com> wrote:
>>>
>>>> > Because of the serial forking there does indeed end up being an
>>>> > infinite number of vmas.  The initial vma can never be deleted
>>>> > (even though the initial parent process has long since terminated)
>>>> > because the initial vma is referenced by the children.
>>>>
>>>> There is a finite number of VMAs, but an infite number of
>>>> anon_vmas.
>>>>
>>>> Subtle, yet deadly...
>>>
>>> Well, we clearly have the data structures screwed up.  I've forgotten
>>> enough about this code for me to be unable to work out what the fixed
>>> up data structures would look like :( But surely there is some proper
>>> solution here.  Help?
>>
>> Not sure if it's right but probably we could reuse on fork an old anon_vma
>> from the chain if it's already lost all vmas which points to it.
>> For endlessly forking exploit this should work mostly like proposed patch
>> which stops branching after some depth but without magic constant.
> 
> Something like this. I leave proper comment for tomorrow.

Hmm I'm not sure that will work as it is. If I understand it correctly, your
patch can detect if the parent's anon_vma has no own references at the fork()
time. But at the fork time, the parent is still alive, it only exits after the
fork, right? So I guess it still has own references and the child will still
allocate its new anon_vma, and the problem is not solved.

So maybe we could detect that the own references dropped to zero when the parent
does exit, and then change mapping of all relevant pages to the root anon_vma,
destroy avc's of children and the anon_vma itself. But that sounds quite
heavyweight :/

Vlastimil

>>
>>>


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-18 23:50                                 ` Vlastimil Babka
  0 siblings, 0 replies; 75+ messages in thread
From: Vlastimil Babka @ 2014-11-18 23:50 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Andrew Morton
  Cc: Rik van Riel, Michel Lespinasse, Hugh Dickins, Andrea Arcangeli,
	Linux Kernel Mailing List, linux-mm, Tim Hartrick, Michal Hocko

On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote:
> On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton
>> <akpm@linux-foundation.org> wrote:
>>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com> wrote:
>>>
>>>> > Because of the serial forking there does indeed end up being an
>>>> > infinite number of vmas.  The initial vma can never be deleted
>>>> > (even though the initial parent process has long since terminated)
>>>> > because the initial vma is referenced by the children.
>>>>
>>>> There is a finite number of VMAs, but an infite number of
>>>> anon_vmas.
>>>>
>>>> Subtle, yet deadly...
>>>
>>> Well, we clearly have the data structures screwed up.  I've forgotten
>>> enough about this code for me to be unable to work out what the fixed
>>> up data structures would look like :( But surely there is some proper
>>> solution here.  Help?
>>
>> Not sure if it's right but probably we could reuse on fork an old anon_vma
>> from the chain if it's already lost all vmas which points to it.
>> For endlessly forking exploit this should work mostly like proposed patch
>> which stops branching after some depth but without magic constant.
> 
> Something like this. I leave proper comment for tomorrow.

Hmm I'm not sure that will work as it is. If I understand it correctly, your
patch can detect if the parent's anon_vma has no own references at the fork()
time. But at the fork time, the parent is still alive, it only exits after the
fork, right? So I guess it still has own references and the child will still
allocate its new anon_vma, and the problem is not solved.

So maybe we could detect that the own references dropped to zero when the parent
does exit, and then change mapping of all relevant pages to the root anon_vma,
destroy avc's of children and the anon_vma itself. But that sounds quite
heavyweight :/

Vlastimil

>>
>>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-18 20:19                           ` Andrew Morton
@ 2014-11-19  2:48                             ` Rik van Riel
  -1 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2014-11-19  2:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michel Lespinasse, Hugh Dickins, Andrea Arcangeli, linux-kernel,
	linux-mm, Tim Hartrick, Michal Hocko

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/18/2014 03:19 PM, Andrew Morton wrote:
> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com>
> wrote:
> 

>> That way people can understand what the code does simply by
>> looking at the changelog - no need to go find old linux-kernel
>> mailing list threads.
> 
> Yes please, there's a ton of stuff here which we should attempt to 
> capture.
> 
> https://lkml.org/lkml/2012/8/15/765 is useful.
> 
> I'm assuming that with the "foo < 5" hack, an application which
> forked 5 times then did a lot of work would still trigger the
> "catastrophic issue at page reclaim time" issue which Rik
> identified at https://lkml.org/lkml/2012/8/20/265?

It's not "forking 5 times", it is "forking >>5 generations deep".

There are a few programs that do that, but it does not appear
that they are forking servers like apache or sendmail (which
fork from the 2nd generation, and then sometimes again to exec
a helper from the 4th generation).

> There are real-world workloads which are triggering this slab
> growth problem, yes?  (Detail them in the changelog, please).

There are, but the overlap between "forks >>5 generations deep"
and "forks a bajillion child processes" appears to be zero.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUbATgAAoJEM553pKExN6Ds84H/ixCr4Q5C09sDISuw9y/PsVI
moXPbqgefpzbS316MgD1AMl7rj2OWAMiQcRGQ6yMelXOyuB89XTiBi19t5UxaSUn
tuFnxeknoIL0155yTfszETRGjN9mUKoyk9HAhND1T+x2VFLwaQYyk7CdZC/h7IQ7
m1jfwlR30r0Ie6x5lkN1XaculdWdXjr7wTwUWeOVsc6lWv3kR3dC52LKsB4fv340
gBeL5sTDNNp6r5Gfr5QL7fQR0eLVvhStSmsm4GbggpVSBSCpZ++h8eTjdtHxuJO3
jtgEGAvhnLDSqRi6NG6dKoxtXW8++hnFIKBw1Ec36NTuTkbKiHo9EQujINtXWro=
=/EU5
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-19  2:48                             ` Rik van Riel
  0 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2014-11-19  2:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michel Lespinasse, Hugh Dickins, Andrea Arcangeli, linux-kernel,
	linux-mm, Tim Hartrick, Michal Hocko

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/18/2014 03:19 PM, Andrew Morton wrote:
> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com>
> wrote:
> 

>> That way people can understand what the code does simply by
>> looking at the changelog - no need to go find old linux-kernel
>> mailing list threads.
> 
> Yes please, there's a ton of stuff here which we should attempt to 
> capture.
> 
> https://lkml.org/lkml/2012/8/15/765 is useful.
> 
> I'm assuming that with the "foo < 5" hack, an application which
> forked 5 times then did a lot of work would still trigger the
> "catastrophic issue at page reclaim time" issue which Rik
> identified at https://lkml.org/lkml/2012/8/20/265?

It's not "forking 5 times", it is "forking >>5 generations deep".

There are a few programs that do that, but it does not appear
that they are forking servers like apache or sendmail (which
fork from the 2nd generation, and then sometimes again to exec
a helper from the 4th generation).

> There are real-world workloads which are triggering this slab
> growth problem, yes?  (Detail them in the changelog, please).

There are, but the overlap between "forks >>5 generations deep"
and "forks a bajillion child processes" appears to be zero.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUbATgAAoJEM553pKExN6Ds84H/ixCr4Q5C09sDISuw9y/PsVI
moXPbqgefpzbS316MgD1AMl7rj2OWAMiQcRGQ6yMelXOyuB89XTiBi19t5UxaSUn
tuFnxeknoIL0155yTfszETRGjN9mUKoyk9HAhND1T+x2VFLwaQYyk7CdZC/h7IQ7
m1jfwlR30r0Ie6x5lkN1XaculdWdXjr7wTwUWeOVsc6lWv3kR3dC52LKsB4fv340
gBeL5sTDNNp6r5Gfr5QL7fQR0eLVvhStSmsm4GbggpVSBSCpZ++h8eTjdtHxuJO3
jtgEGAvhnLDSqRi6NG6dKoxtXW8++hnFIKBw1Ec36NTuTkbKiHo9EQujINtXWro=
=/EU5
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-18 23:50                                 ` Vlastimil Babka
@ 2014-11-19 14:36                                   ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 75+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-19 14:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Rik van Riel, Michel Lespinasse, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

On Wed, Nov 19, 2014 at 2:50 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote:
>> On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton
>>> <akpm@linux-foundation.org> wrote:
>>>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com> wrote:
>>>>
>>>>> > Because of the serial forking there does indeed end up being an
>>>>> > infinite number of vmas.  The initial vma can never be deleted
>>>>> > (even though the initial parent process has long since terminated)
>>>>> > because the initial vma is referenced by the children.
>>>>>
>>>>> There is a finite number of VMAs, but an infite number of
>>>>> anon_vmas.
>>>>>
>>>>> Subtle, yet deadly...
>>>>
>>>> Well, we clearly have the data structures screwed up.  I've forgotten
>>>> enough about this code for me to be unable to work out what the fixed
>>>> up data structures would look like :( But surely there is some proper
>>>> solution here.  Help?
>>>
>>> Not sure if it's right but probably we could reuse on fork an old anon_vma
>>> from the chain if it's already lost all vmas which points to it.
>>> For endlessly forking exploit this should work mostly like proposed patch
>>> which stops branching after some depth but without magic constant.
>>
>> Something like this. I leave proper comment for tomorrow.
>
> Hmm I'm not sure that will work as it is. If I understand it correctly, your
> patch can detect if the parent's anon_vma has no own references at the fork()
> time. But at the fork time, the parent is still alive, it only exits after the
> fork, right? So I guess it still has own references and the child will still
> allocate its new anon_vma, and the problem is not solved.

But it could reuse anon_vma from grandparent or older.
Count of anon_vmas in chain will be limited with count of alive processes.

I think it's better to describe this in terms of sets of anon_vma
instead hierarchy:
at clone vma inherits pages from parent together with set of anon_vma
which they belong.
For new pages it might allocate new anon_vma or reuse existing. After
my patch vma
will try to reuse anon_vma from that set which has no vmas which points to it.
As a result there will be no parent-child relation between anon_vma and
multiple pages might have equal (anon_vma, index) pair but I see no
problems here.

>
> So maybe we could detect that the own references dropped to zero when the parent
> does exit, and then change mapping of all relevant pages to the root anon_vma,
> destroy avc's of children and the anon_vma itself. But that sounds quite
> heavyweight :/
>
> Vlastimil
>
>>>
>>>>
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-19 14:36                                   ` Konstantin Khlebnikov
  0 siblings, 0 replies; 75+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-19 14:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Rik van Riel, Michel Lespinasse, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

On Wed, Nov 19, 2014 at 2:50 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote:
>> On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton
>>> <akpm@linux-foundation.org> wrote:
>>>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com> wrote:
>>>>
>>>>> > Because of the serial forking there does indeed end up being an
>>>>> > infinite number of vmas.  The initial vma can never be deleted
>>>>> > (even though the initial parent process has long since terminated)
>>>>> > because the initial vma is referenced by the children.
>>>>>
>>>>> There is a finite number of VMAs, but an infite number of
>>>>> anon_vmas.
>>>>>
>>>>> Subtle, yet deadly...
>>>>
>>>> Well, we clearly have the data structures screwed up.  I've forgotten
>>>> enough about this code for me to be unable to work out what the fixed
>>>> up data structures would look like :( But surely there is some proper
>>>> solution here.  Help?
>>>
>>> Not sure if it's right but probably we could reuse on fork an old anon_vma
>>> from the chain if it's already lost all vmas which points to it.
>>> For endlessly forking exploit this should work mostly like proposed patch
>>> which stops branching after some depth but without magic constant.
>>
>> Something like this. I leave proper comment for tomorrow.
>
> Hmm I'm not sure that will work as it is. If I understand it correctly, your
> patch can detect if the parent's anon_vma has no own references at the fork()
> time. But at the fork time, the parent is still alive, it only exits after the
> fork, right? So I guess it still has own references and the child will still
> allocate its new anon_vma, and the problem is not solved.

But it could reuse anon_vma from grandparent or older.
Count of anon_vmas in chain will be limited with count of alive processes.

I think it's better to describe this in terms of sets of anon_vma
instead hierarchy:
at clone vma inherits pages from parent together with set of anon_vma
which they belong.
For new pages it might allocate new anon_vma or reuse existing. After
my patch vma
will try to reuse anon_vma from that set which has no vmas which points to it.
As a result there will be no parent-child relation between anon_vma and
multiple pages might have equal (anon_vma, index) pair but I see no
problems here.

>
> So maybe we could detect that the own references dropped to zero when the parent
> does exit, and then change mapping of all relevant pages to the root anon_vma,
> destroy avc's of children and the anon_vma itself. But that sounds quite
> heavyweight :/
>
> Vlastimil
>
>>>
>>>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-19 14:36                                   ` Konstantin Khlebnikov
@ 2014-11-19 16:09                                     ` Vlastimil Babka
  -1 siblings, 0 replies; 75+ messages in thread
From: Vlastimil Babka @ 2014-11-19 16:09 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, Rik van Riel, Michel Lespinasse, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

On 11/19/2014 03:36 PM, Konstantin Khlebnikov wrote:
> On Wed, Nov 19, 2014 at 2:50 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote:
>>> On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton
>>>> <akpm@linux-foundation.org> wrote:
>>>>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com> wrote:
>>>>>
>>>>>> > Because of the serial forking there does indeed end up being an
>>>>>> > infinite number of vmas.  The initial vma can never be deleted
>>>>>> > (even though the initial parent process has long since terminated)
>>>>>> > because the initial vma is referenced by the children.
>>>>>>
>>>>>> There is a finite number of VMAs, but an infite number of
>>>>>> anon_vmas.
>>>>>>
>>>>>> Subtle, yet deadly...
>>>>>
>>>>> Well, we clearly have the data structures screwed up.  I've forgotten
>>>>> enough about this code for me to be unable to work out what the fixed
>>>>> up data structures would look like :( But surely there is some proper
>>>>> solution here.  Help?
>>>>
>>>> Not sure if it's right but probably we could reuse on fork an old anon_vma
>>>> from the chain if it's already lost all vmas which points to it.
>>>> For endlessly forking exploit this should work mostly like proposed patch
>>>> which stops branching after some depth but without magic constant.
>>>
>>> Something like this. I leave proper comment for tomorrow.
>>
>> Hmm I'm not sure that will work as it is. If I understand it correctly, your
>> patch can detect if the parent's anon_vma has no own references at the fork()
>> time. But at the fork time, the parent is still alive, it only exits after the
>> fork, right? So I guess it still has own references and the child will still
>> allocate its new anon_vma, and the problem is not solved.
> 
> But it could reuse anon_vma from grandparent or older.
> Count of anon_vmas in chain will be limited with count of alive processes.

Ah I missed that it can reuse older anon_vma, sorry.

> I think it's better to describe this in terms of sets of anon_vma
> instead hierarchy:
> at clone vma inherits pages from parent together with set of anon_vma
> which they belong.
> For new pages it might allocate new anon_vma or reuse existing. After
> my patch vma
> will try to reuse anon_vma from that set which has no vmas which points to it.
> As a result there will be no parent-child relation between anon_vma and
> multiple pages might have equal (anon_vma, index) pair but I see no
> problems here.

Hmm I wonder if root anon_vma should be excluded from this reusal. For
performance reasons, exclusive pages go to non-root anon_vma (see
__page_set_anon_rmap()) and reusing root anon_vma would change this.
Also from reading http://lwn.net/Articles/383162/ I understand that correctness
also depends on the hierarchy and I wonder if there's a danger of reintroducing
a bug like the one described there.

Vlastimil

>>
>> So maybe we could detect that the own references dropped to zero when the parent
>> does exit, and then change mapping of all relevant pages to the root anon_vma,
>> destroy avc's of children and the anon_vma itself. But that sounds quite
>> heavyweight :/
>>
>> Vlastimil
>>
>>>>
>>>>>
>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-19 16:09                                     ` Vlastimil Babka
  0 siblings, 0 replies; 75+ messages in thread
From: Vlastimil Babka @ 2014-11-19 16:09 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, Rik van Riel, Michel Lespinasse, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

On 11/19/2014 03:36 PM, Konstantin Khlebnikov wrote:
> On Wed, Nov 19, 2014 at 2:50 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote:
>>> On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton
>>>> <akpm@linux-foundation.org> wrote:
>>>>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com> wrote:
>>>>>
>>>>>> > Because of the serial forking there does indeed end up being an
>>>>>> > infinite number of vmas.  The initial vma can never be deleted
>>>>>> > (even though the initial parent process has long since terminated)
>>>>>> > because the initial vma is referenced by the children.
>>>>>>
>>>>>> There is a finite number of VMAs, but an infite number of
>>>>>> anon_vmas.
>>>>>>
>>>>>> Subtle, yet deadly...
>>>>>
>>>>> Well, we clearly have the data structures screwed up.  I've forgotten
>>>>> enough about this code for me to be unable to work out what the fixed
>>>>> up data structures would look like :( But surely there is some proper
>>>>> solution here.  Help?
>>>>
>>>> Not sure if it's right but probably we could reuse on fork an old anon_vma
>>>> from the chain if it's already lost all vmas which points to it.
>>>> For endlessly forking exploit this should work mostly like proposed patch
>>>> which stops branching after some depth but without magic constant.
>>>
>>> Something like this. I leave proper comment for tomorrow.
>>
>> Hmm I'm not sure that will work as it is. If I understand it correctly, your
>> patch can detect if the parent's anon_vma has no own references at the fork()
>> time. But at the fork time, the parent is still alive, it only exits after the
>> fork, right? So I guess it still has own references and the child will still
>> allocate its new anon_vma, and the problem is not solved.
> 
> But it could reuse anon_vma from grandparent or older.
> Count of anon_vmas in chain will be limited with count of alive processes.

Ah I missed that it can reuse older anon_vma, sorry.

> I think it's better to describe this in terms of sets of anon_vma
> instead hierarchy:
> at clone vma inherits pages from parent together with set of anon_vma
> which they belong.
> For new pages it might allocate new anon_vma or reuse existing. After
> my patch vma
> will try to reuse anon_vma from that set which has no vmas which points to it.
> As a result there will be no parent-child relation between anon_vma and
> multiple pages might have equal (anon_vma, index) pair but I see no
> problems here.

Hmm I wonder if root anon_vma should be excluded from this reusal. For
performance reasons, exclusive pages go to non-root anon_vma (see
__page_set_anon_rmap()) and reusing root anon_vma would change this.
Also from reading http://lwn.net/Articles/383162/ I understand that correctness
also depends on the hierarchy and I wonder if there's a danger of reintroducing
a bug like the one described there.

Vlastimil

>>
>> So maybe we could detect that the own references dropped to zero when the parent
>> does exit, and then change mapping of all relevant pages to the root anon_vma,
>> destroy avc's of children and the anon_vma itself. But that sounds quite
>> heavyweight :/
>>
>> Vlastimil
>>
>>>>
>>>>>
>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-19 16:09                                     ` Vlastimil Babka
@ 2014-11-19 16:58                                       ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 75+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-19 16:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Rik van Riel, Michel Lespinasse, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 11/19/2014 03:36 PM, Konstantin Khlebnikov wrote:
>> On Wed, Nov 19, 2014 at 2:50 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>> On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote:
>>>> On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>>> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton
>>>>> <akpm@linux-foundation.org> wrote:
>>>>>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com> wrote:
>>>>>>
>>>>>>> > Because of the serial forking there does indeed end up being an
>>>>>>> > infinite number of vmas.  The initial vma can never be deleted
>>>>>>> > (even though the initial parent process has long since terminated)
>>>>>>> > because the initial vma is referenced by the children.
>>>>>>>
>>>>>>> There is a finite number of VMAs, but an infite number of
>>>>>>> anon_vmas.
>>>>>>>
>>>>>>> Subtle, yet deadly...
>>>>>>
>>>>>> Well, we clearly have the data structures screwed up.  I've forgotten
>>>>>> enough about this code for me to be unable to work out what the fixed
>>>>>> up data structures would look like :( But surely there is some proper
>>>>>> solution here.  Help?
>>>>>
>>>>> Not sure if it's right but probably we could reuse on fork an old anon_vma
>>>>> from the chain if it's already lost all vmas which points to it.
>>>>> For endlessly forking exploit this should work mostly like proposed patch
>>>>> which stops branching after some depth but without magic constant.
>>>>
>>>> Something like this. I leave proper comment for tomorrow.
>>>
>>> Hmm I'm not sure that will work as it is. If I understand it correctly, your
>>> patch can detect if the parent's anon_vma has no own references at the fork()
>>> time. But at the fork time, the parent is still alive, it only exits after the
>>> fork, right? So I guess it still has own references and the child will still
>>> allocate its new anon_vma, and the problem is not solved.
>>
>> But it could reuse anon_vma from grandparent or older.
>> Count of anon_vmas in chain will be limited with count of alive processes.
>
> Ah I missed that it can reuse older anon_vma, sorry.
>
>> I think it's better to describe this in terms of sets of anon_vma
>> instead hierarchy:
>> at clone vma inherits pages from parent together with set of anon_vma
>> which they belong.
>> For new pages it might allocate new anon_vma or reuse existing. After
>> my patch vma
>> will try to reuse anon_vma from that set which has no vmas which points to it.
>> As a result there will be no parent-child relation between anon_vma and
>> multiple pages might have equal (anon_vma, index) pair but I see no
>> problems here.
>
> Hmm I wonder if root anon_vma should be excluded from this reusal. For
> performance reasons, exclusive pages go to non-root anon_vma (see
> __page_set_anon_rmap()) and reusing root anon_vma would change this.

This is simple, in my patch this can be reached by bumping its nr_vmas
by one and it'll never be reused.

> Also from reading http://lwn.net/Articles/383162/ I understand that correctness
> also depends on the hierarchy and I wonder if there's a danger of reintroducing
> a bug like the one described there.

If I remember right that was fixed by linking non-exclusively mapped pages to
root anon_vma instead of anon_vma from vma where fault has happened.
After my patch this still works. Topology hierarchy actually isn't used.
Here just one selected "root' anon_vma which dies last. That's all.

>
> Vlastimil
>
>>>
>>> So maybe we could detect that the own references dropped to zero when the parent
>>> does exit, and then change mapping of all relevant pages to the root anon_vma,
>>> destroy avc's of children and the anon_vma itself. But that sounds quite
>>> heavyweight :/
>>>
>>> Vlastimil
>>>
>>>>>
>>>>>>
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-19 16:58                                       ` Konstantin Khlebnikov
  0 siblings, 0 replies; 75+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-19 16:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Rik van Riel, Michel Lespinasse, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 11/19/2014 03:36 PM, Konstantin Khlebnikov wrote:
>> On Wed, Nov 19, 2014 at 2:50 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>> On 11/19/2014 12:02 AM, Konstantin Khlebnikov wrote:
>>>> On Wed, Nov 19, 2014 at 1:15 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>>> On Tue, Nov 18, 2014 at 11:19 PM, Andrew Morton
>>>>> <akpm@linux-foundation.org> wrote:
>>>>>> On Mon, 17 Nov 2014 21:41:57 -0500 Rik van Riel <riel@redhat.com> wrote:
>>>>>>
>>>>>>> > Because of the serial forking there does indeed end up being an
>>>>>>> > infinite number of vmas.  The initial vma can never be deleted
>>>>>>> > (even though the initial parent process has long since terminated)
>>>>>>> > because the initial vma is referenced by the children.
>>>>>>>
>>>>>>> There is a finite number of VMAs, but an infite number of
>>>>>>> anon_vmas.
>>>>>>>
>>>>>>> Subtle, yet deadly...
>>>>>>
>>>>>> Well, we clearly have the data structures screwed up.  I've forgotten
>>>>>> enough about this code for me to be unable to work out what the fixed
>>>>>> up data structures would look like :( But surely there is some proper
>>>>>> solution here.  Help?
>>>>>
>>>>> Not sure if it's right but probably we could reuse on fork an old anon_vma
>>>>> from the chain if it's already lost all vmas which points to it.
>>>>> For endlessly forking exploit this should work mostly like proposed patch
>>>>> which stops branching after some depth but without magic constant.
>>>>
>>>> Something like this. I leave proper comment for tomorrow.
>>>
>>> Hmm I'm not sure that will work as it is. If I understand it correctly, your
>>> patch can detect if the parent's anon_vma has no own references at the fork()
>>> time. But at the fork time, the parent is still alive, it only exits after the
>>> fork, right? So I guess it still has own references and the child will still
>>> allocate its new anon_vma, and the problem is not solved.
>>
>> But it could reuse anon_vma from grandparent or older.
>> Count of anon_vmas in chain will be limited with count of alive processes.
>
> Ah I missed that it can reuse older anon_vma, sorry.
>
>> I think it's better to describe this in terms of sets of anon_vma
>> instead hierarchy:
>> at clone vma inherits pages from parent together with set of anon_vma
>> which they belong.
>> For new pages it might allocate new anon_vma or reuse existing. After
>> my patch vma
>> will try to reuse anon_vma from that set which has no vmas which points to it.
>> As a result there will be no parent-child relation between anon_vma and
>> multiple pages might have equal (anon_vma, index) pair but I see no
>> problems here.
>
> Hmm I wonder if root anon_vma should be excluded from this reusal. For
> performance reasons, exclusive pages go to non-root anon_vma (see
> __page_set_anon_rmap()) and reusing root anon_vma would change this.

This is simple, in my patch this can be reached by bumping its nr_vmas
by one and it'll never be reused.

> Also from reading http://lwn.net/Articles/383162/ I understand that correctness
> also depends on the hierarchy and I wonder if there's a danger of reintroducing
> a bug like the one described there.

If I remember right that was fixed by linking non-exclusively mapped pages to
root anon_vma instead of anon_vma from vma where fault has happened.
After my patch this still works. Topology hierarchy actually isn't used.
Here just one selected "root' anon_vma which dies last. That's all.

>
> Vlastimil
>
>>>
>>> So maybe we could detect that the own references dropped to zero when the parent
>>> does exit, and then change mapping of all relevant pages to the root anon_vma,
>>> destroy avc's of children and the anon_vma itself. But that sounds quite
>>> heavyweight :/
>>>
>>> Vlastimil
>>>
>>>>>
>>>>>>
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-19 16:58                                       ` Konstantin Khlebnikov
@ 2014-11-19 23:14                                         ` Michel Lespinasse
  -1 siblings, 0 replies; 75+ messages in thread
From: Michel Lespinasse @ 2014-11-19 23:14 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Vlastimil Babka, Andrew Morton, Rik van Riel, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

On Wed, Nov 19, 2014 at 8:58 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> Also from reading http://lwn.net/Articles/383162/ I understand that correctness
>> also depends on the hierarchy and I wonder if there's a danger of reintroducing
>> a bug like the one described there.
>
> If I remember right that was fixed by linking non-exclusively mapped pages to
> root anon_vma instead of anon_vma from vma where fault has happened.
> After my patch this still works. Topology hierarchy actually isn't used.
> Here just one selected "root' anon_vma which dies last. That's all.

That's not how I remember it.

An anon_vma corresponds to a given vma V, and is used to track all
vmas (V and descendant vmas) that may include a page that was
originally mapped in V.

Each anon page has a link to the anon_vma corresponding to the vma
they were originally faulted in, and an offset indicating where the
page was located relative to that original VMA.

The anon_vma has an interval tree of struct anon_vma_chain, and each
struct anon_vma_chain includes a link to a descendent-of-V vma. This
allows rmap to quickly find all the vmas that may map a given page
(based on the page's anon_vma and offset).

When forking or splitting vmas, the new vma is a descendent of the
same vmas as the old one so it must be added to all the anon_vma
interval trees that were referencing the old one (that is, ancestors
of the new vma). To that end, all the struct anon_vma_chain pointing
to a given vma are kept on a linked list, and struct anon_vma_chain
includes a link to the anon_vma holding the interval tree.

Locking the entire structure is done with a single lock hosted in the
root anon_vma (that is, a vma that was created by mmap() and not by
cloning or forking existing vmas).

Limit the length of the ancestors linked list is correct, though it
has performance implications. In the extreme case, forcing all vmas to
be added on the root vma's interval tree would be correct, though it
may re-introduce the performance problems that lead to the
introduction of anon_vma.

The good thing about Konstantin's proposal is that it does not have
any magic constant like mine did. However, I think he is mistaken in
saying that hierarchy isn't used - an ancestor vma will always have
more descendents than its children, and the reason for the hierarchy
is to limit the number of vmas that rmap must explore.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-19 23:14                                         ` Michel Lespinasse
  0 siblings, 0 replies; 75+ messages in thread
From: Michel Lespinasse @ 2014-11-19 23:14 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Vlastimil Babka, Andrew Morton, Rik van Riel, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

On Wed, Nov 19, 2014 at 8:58 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> Also from reading http://lwn.net/Articles/383162/ I understand that correctness
>> also depends on the hierarchy and I wonder if there's a danger of reintroducing
>> a bug like the one described there.
>
> If I remember right that was fixed by linking non-exclusively mapped pages to
> root anon_vma instead of anon_vma from vma where fault has happened.
> After my patch this still works. Topology hierarchy actually isn't used.
> Here just one selected "root' anon_vma which dies last. That's all.

That's not how I remember it.

An anon_vma corresponds to a given vma V, and is used to track all
vmas (V and descendant vmas) that may include a page that was
originally mapped in V.

Each anon page has a link to the anon_vma corresponding to the vma
they were originally faulted in, and an offset indicating where the
page was located relative to that original VMA.

The anon_vma has an interval tree of struct anon_vma_chain, and each
struct anon_vma_chain includes a link to a descendent-of-V vma. This
allows rmap to quickly find all the vmas that may map a given page
(based on the page's anon_vma and offset).

When forking or splitting vmas, the new vma is a descendent of the
same vmas as the old one so it must be added to all the anon_vma
interval trees that were referencing the old one (that is, ancestors
of the new vma). To that end, all the struct anon_vma_chain pointing
to a given vma are kept on a linked list, and struct anon_vma_chain
includes a link to the anon_vma holding the interval tree.

Locking the entire structure is done with a single lock hosted in the
root anon_vma (that is, a vma that was created by mmap() and not by
cloning or forking existing vmas).

Limit the length of the ancestors linked list is correct, though it
has performance implications. In the extreme case, forcing all vmas to
be added on the root vma's interval tree would be correct, though it
may re-introduce the performance problems that lead to the
introduction of anon_vma.

The good thing about Konstantin's proposal is that it does not have
any magic constant like mine did. However, I think he is mistaken in
saying that hierarchy isn't used - an ancestor vma will always have
more descendents than its children, and the reason for the hierarchy
is to limit the number of vmas that rmap must explore.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-19 23:14                                         ` Michel Lespinasse
@ 2014-11-20 14:42                                           ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 75+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-20 14:42 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Vlastimil Babka, Andrew Morton, Rik van Riel, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

On Thu, Nov 20, 2014 at 2:14 AM, Michel Lespinasse <walken@google.com> wrote:
> On Wed, Nov 19, 2014 at 8:58 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>> Also from reading http://lwn.net/Articles/383162/ I understand that correctness
>>> also depends on the hierarchy and I wonder if there's a danger of reintroducing
>>> a bug like the one described there.
>>
>> If I remember right that was fixed by linking non-exclusively mapped pages to
>> root anon_vma instead of anon_vma from vma where fault has happened.
>> After my patch this still works. Topology hierarchy actually isn't used.
>> Here just one selected "root' anon_vma which dies last. That's all.
>
> That's not how I remember it.

??? That at the end of lwn article:

[quote]
The fix is straightforward; when linking an existing page to an
anon_vma structure,
the kernel needs to pick the one which is highest in the process hierarchy;
that guarantees that the anon_vma will not go away prematurely.
[/quote]

nowdays this happens in __page_set_anon_rmap():

/*
* If the page isn't exclusively mapped into this vma,
* we must use the _oldest_ possible anon_vma for the
* page mapping!
*/
if (!exclusive)
    anon_vma = anon_vma->root;

The rest treeish of topology affects only performance.

>
> An anon_vma corresponds to a given vma V, and is used to track all
> vmas (V and descendant vmas) that may include a page that was
> originally mapped in V.
>
> Each anon page has a link to the anon_vma corresponding to the vma
> they were originally faulted in, and an offset indicating where the
> page was located relative to that original VMA.
>
> The anon_vma has an interval tree of struct anon_vma_chain, and each
> struct anon_vma_chain includes a link to a descendent-of-V vma. This
> allows rmap to quickly find all the vmas that may map a given page
> (based on the page's anon_vma and offset).
>
> When forking or splitting vmas, the new vma is a descendent of the
> same vmas as the old one so it must be added to all the anon_vma
> interval trees that were referencing the old one (that is, ancestors
> of the new vma). To that end, all the struct anon_vma_chain pointing
> to a given vma are kept on a linked list, and struct anon_vma_chain
> includes a link to the anon_vma holding the interval tree.
>
> Locking the entire structure is done with a single lock hosted in the
> root anon_vma (that is, a vma that was created by mmap() and not by
> cloning or forking existing vmas).
>
> Limit the length of the ancestors linked list is correct, though it
> has performance implications. In the extreme case, forcing all vmas to
> be added on the root vma's interval tree would be correct, though it
> may re-introduce the performance problems that lead to the
> introduction of anon_vma.
>
> The good thing about Konstantin's proposal is that it does not have
> any magic constant like mine did. However, I think he is mistaken in
> saying that hierarchy isn't used - an ancestor vma will always have
> more descendents than its children, and the reason for the hierarchy
> is to limit the number of vmas that rmap must explore.

I mean after breaking hierarchy whole structure stays correct and kernel
wouldn't explode, of course reusing anon_vma from ancestor makes
rmap walk less effective because newly allocated pages will get false
aliased vmas where they will never be mapped.


I'm thinking about limitation for reusing anon_vmas which might increase
performance without breaking asymptotic estimation of count anon_vma in
the worst case. For example this heuristic: allow to reuse only anon_vma
with single direct descendant. It seems there will be arount up to two times
more anon_vmas but false-aliasing must be much lower.



>
> --
> Michel "Walken" Lespinasse
> A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-20 14:42                                           ` Konstantin Khlebnikov
  0 siblings, 0 replies; 75+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-20 14:42 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Vlastimil Babka, Andrew Morton, Rik van Riel, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

On Thu, Nov 20, 2014 at 2:14 AM, Michel Lespinasse <walken@google.com> wrote:
> On Wed, Nov 19, 2014 at 8:58 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>> Also from reading http://lwn.net/Articles/383162/ I understand that correctness
>>> also depends on the hierarchy and I wonder if there's a danger of reintroducing
>>> a bug like the one described there.
>>
>> If I remember right that was fixed by linking non-exclusively mapped pages to
>> root anon_vma instead of anon_vma from vma where fault has happened.
>> After my patch this still works. Topology hierarchy actually isn't used.
>> Here just one selected "root' anon_vma which dies last. That's all.
>
> That's not how I remember it.

??? That at the end of lwn article:

[quote]
The fix is straightforward; when linking an existing page to an
anon_vma structure,
the kernel needs to pick the one which is highest in the process hierarchy;
that guarantees that the anon_vma will not go away prematurely.
[/quote]

nowdays this happens in __page_set_anon_rmap():

/*
* If the page isn't exclusively mapped into this vma,
* we must use the _oldest_ possible anon_vma for the
* page mapping!
*/
if (!exclusive)
    anon_vma = anon_vma->root;

The rest treeish of topology affects only performance.

>
> An anon_vma corresponds to a given vma V, and is used to track all
> vmas (V and descendant vmas) that may include a page that was
> originally mapped in V.
>
> Each anon page has a link to the anon_vma corresponding to the vma
> they were originally faulted in, and an offset indicating where the
> page was located relative to that original VMA.
>
> The anon_vma has an interval tree of struct anon_vma_chain, and each
> struct anon_vma_chain includes a link to a descendent-of-V vma. This
> allows rmap to quickly find all the vmas that may map a given page
> (based on the page's anon_vma and offset).
>
> When forking or splitting vmas, the new vma is a descendent of the
> same vmas as the old one so it must be added to all the anon_vma
> interval trees that were referencing the old one (that is, ancestors
> of the new vma). To that end, all the struct anon_vma_chain pointing
> to a given vma are kept on a linked list, and struct anon_vma_chain
> includes a link to the anon_vma holding the interval tree.
>
> Locking the entire structure is done with a single lock hosted in the
> root anon_vma (that is, a vma that was created by mmap() and not by
> cloning or forking existing vmas).
>
> Limit the length of the ancestors linked list is correct, though it
> has performance implications. In the extreme case, forcing all vmas to
> be added on the root vma's interval tree would be correct, though it
> may re-introduce the performance problems that lead to the
> introduction of anon_vma.
>
> The good thing about Konstantin's proposal is that it does not have
> any magic constant like mine did. However, I think he is mistaken in
> saying that hierarchy isn't used - an ancestor vma will always have
> more descendents than its children, and the reason for the hierarchy
> is to limit the number of vmas that rmap must explore.

I mean after breaking hierarchy whole structure stays correct and kernel
wouldn't explode, of course reusing anon_vma from ancestor makes
rmap walk less effective because newly allocated pages will get false
aliased vmas where they will never be mapped.


I'm thinking about limitation for reusing anon_vmas which might increase
performance without breaking asymptotic estimation of count anon_vma in
the worst case. For example this heuristic: allow to reuse only anon_vma
with single direct descendant. It seems there will be arount up to two times
more anon_vmas but false-aliasing must be much lower.



>
> --
> Michel "Walken" Lespinasse
> A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-20 14:42                                           ` Konstantin Khlebnikov
@ 2014-11-20 14:50                                             ` Rik van Riel
  -1 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2014-11-20 14:50 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Michel Lespinasse
  Cc: Vlastimil Babka, Andrew Morton, Hugh Dickins, Andrea Arcangeli,
	Linux Kernel Mailing List, linux-mm, Tim Hartrick, Michal Hocko

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote:

> I'm thinking about limitation for reusing anon_vmas which might
> increase performance without breaking asymptotic estimation of
> count anon_vma in the worst case. For example this heuristic: allow
> to reuse only anon_vma with single direct descendant. It seems
> there will be arount up to two times more anon_vmas but
> false-aliasing must be much lower.

It may even be possible to not create a child anon_vma for the
first child a parent forks, but only create a new anon_vma once
the parent clones a second child (alive at the same time as the
first child).

That still takes care of things like apache or sendmail, but
would not create infinite anon_vmas for a task that keeps forking
itself to infinite depth without calling exec...

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUbf+hAAoJEM553pKExN6DxhQH/1QL+9GdhaSx7EQnRcbDRcHi
GuEfMU0g9Kv4ad+oPSQnH/L7vJMJAYeh5ZJGH+rOykWHp3sGReqDZOnzpXRAe11z
1cSC1BJsndzrv9wX8niFpuKpYbF0IP+ckv3qaEzWtm5yCRyhHVZfr6b794Y4K9jF
z2EPPu1vAAldbkx1VlYTwofBA5lESL5UmrFvH4ouI7BeWYSEe6BgVCbvK+K5fANT
ketdA5R08xyUAcXDa+28qpBYkdWnxNhwqseDoXCW8SOFNwWbLDI6GRfrsCNku13i
Gi41h3uEuIAGDf+AU/GMjiymgwutCOGq+cfZlszELaRvHmDpNGYdPv1llghNg7Q=
=Vk+H
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-20 14:50                                             ` Rik van Riel
  0 siblings, 0 replies; 75+ messages in thread
From: Rik van Riel @ 2014-11-20 14:50 UTC (permalink / raw)
  To: Konstantin Khlebnikov, Michel Lespinasse
  Cc: Vlastimil Babka, Andrew Morton, Hugh Dickins, Andrea Arcangeli,
	Linux Kernel Mailing List, linux-mm, Tim Hartrick, Michal Hocko

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote:

> I'm thinking about limitation for reusing anon_vmas which might
> increase performance without breaking asymptotic estimation of
> count anon_vma in the worst case. For example this heuristic: allow
> to reuse only anon_vma with single direct descendant. It seems
> there will be arount up to two times more anon_vmas but
> false-aliasing must be much lower.

It may even be possible to not create a child anon_vma for the
first child a parent forks, but only create a new anon_vma once
the parent clones a second child (alive at the same time as the
first child).

That still takes care of things like apache or sendmail, but
would not create infinite anon_vmas for a task that keeps forking
itself to infinite depth without calling exec...

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUbf+hAAoJEM553pKExN6DxhQH/1QL+9GdhaSx7EQnRcbDRcHi
GuEfMU0g9Kv4ad+oPSQnH/L7vJMJAYeh5ZJGH+rOykWHp3sGReqDZOnzpXRAe11z
1cSC1BJsndzrv9wX8niFpuKpYbF0IP+ckv3qaEzWtm5yCRyhHVZfr6b794Y4K9jF
z2EPPu1vAAldbkx1VlYTwofBA5lESL5UmrFvH4ouI7BeWYSEe6BgVCbvK+K5fANT
ketdA5R08xyUAcXDa+28qpBYkdWnxNhwqseDoXCW8SOFNwWbLDI6GRfrsCNku13i
Gi41h3uEuIAGDf+AU/GMjiymgwutCOGq+cfZlszELaRvHmDpNGYdPv1llghNg7Q=
=Vk+H
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-20 14:50                                             ` Rik van Riel
@ 2014-11-20 15:03                                               ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 75+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-20 15:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Vlastimil Babka, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel <riel@redhat.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote:
>
>> I'm thinking about limitation for reusing anon_vmas which might
>> increase performance without breaking asymptotic estimation of
>> count anon_vma in the worst case. For example this heuristic: allow
>> to reuse only anon_vma with single direct descendant. It seems
>> there will be arount up to two times more anon_vmas but
>> false-aliasing must be much lower.
>
> It may even be possible to not create a child anon_vma for the
> first child a parent forks, but only create a new anon_vma once
> the parent clones a second child (alive at the same time as the
> first child).
>
> That still takes care of things like apache or sendmail, but
> would not create infinite anon_vmas for a task that keeps forking
> itself to infinite depth without calling exec...

But this scheme is still exploitable. Malicious software easily could create
sequence of forks and exits which leads to infinite chain of anon_vmas.

>
> - --
> All rights reversed
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQEcBAEBAgAGBQJUbf+hAAoJEM553pKExN6DxhQH/1QL+9GdhaSx7EQnRcbDRcHi
> GuEfMU0g9Kv4ad+oPSQnH/L7vJMJAYeh5ZJGH+rOykWHp3sGReqDZOnzpXRAe11z
> 1cSC1BJsndzrv9wX8niFpuKpYbF0IP+ckv3qaEzWtm5yCRyhHVZfr6b794Y4K9jF
> z2EPPu1vAAldbkx1VlYTwofBA5lESL5UmrFvH4ouI7BeWYSEe6BgVCbvK+K5fANT
> ketdA5R08xyUAcXDa+28qpBYkdWnxNhwqseDoXCW8SOFNwWbLDI6GRfrsCNku13i
> Gi41h3uEuIAGDf+AU/GMjiymgwutCOGq+cfZlszELaRvHmDpNGYdPv1llghNg7Q=
> =Vk+H
> -----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-20 15:03                                               ` Konstantin Khlebnikov
  0 siblings, 0 replies; 75+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-20 15:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Vlastimil Babka, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel <riel@redhat.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote:
>
>> I'm thinking about limitation for reusing anon_vmas which might
>> increase performance without breaking asymptotic estimation of
>> count anon_vma in the worst case. For example this heuristic: allow
>> to reuse only anon_vma with single direct descendant. It seems
>> there will be arount up to two times more anon_vmas but
>> false-aliasing must be much lower.
>
> It may even be possible to not create a child anon_vma for the
> first child a parent forks, but only create a new anon_vma once
> the parent clones a second child (alive at the same time as the
> first child).
>
> That still takes care of things like apache or sendmail, but
> would not create infinite anon_vmas for a task that keeps forking
> itself to infinite depth without calling exec...

But this scheme is still exploitable. Malicious software easily could create
sequence of forks and exits which leads to infinite chain of anon_vmas.

>
> - --
> All rights reversed
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQEcBAEBAgAGBQJUbf+hAAoJEM553pKExN6DxhQH/1QL+9GdhaSx7EQnRcbDRcHi
> GuEfMU0g9Kv4ad+oPSQnH/L7vJMJAYeh5ZJGH+rOykWHp3sGReqDZOnzpXRAe11z
> 1cSC1BJsndzrv9wX8niFpuKpYbF0IP+ckv3qaEzWtm5yCRyhHVZfr6b794Y4K9jF
> z2EPPu1vAAldbkx1VlYTwofBA5lESL5UmrFvH4ouI7BeWYSEe6BgVCbvK+K5fANT
> ketdA5R08xyUAcXDa+28qpBYkdWnxNhwqseDoXCW8SOFNwWbLDI6GRfrsCNku13i
> Gi41h3uEuIAGDf+AU/GMjiymgwutCOGq+cfZlszELaRvHmDpNGYdPv1llghNg7Q=
> =Vk+H
> -----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-20 14:42                                           ` Konstantin Khlebnikov
@ 2014-11-20 15:27                                             ` Michel Lespinasse
  -1 siblings, 0 replies; 75+ messages in thread
From: Michel Lespinasse @ 2014-11-20 15:27 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Vlastimil Babka, Andrew Morton, Rik van Riel, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

On Thu, Nov 20, 2014 at 3:42 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Thu, Nov 20, 2014 at 2:14 AM, Michel Lespinasse <walken@google.com> wrote:
>> On Wed, Nov 19, 2014 at 8:58 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>> On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>> Also from reading http://lwn.net/Articles/383162/ I understand that correctness
>>>> also depends on the hierarchy and I wonder if there's a danger of reintroducing
>>>> a bug like the one described there.
>>>
>>> If I remember right that was fixed by linking non-exclusively mapped pages to
>>> root anon_vma instead of anon_vma from vma where fault has happened.
>>> After my patch this still works. Topology hierarchy actually isn't used.
>>> Here just one selected "root' anon_vma which dies last. That's all.
>>
>> That's not how I remember it.
>
> ??? That at the end of lwn article:
>
> [quote]
> The fix is straightforward; when linking an existing page to an
> anon_vma structure,
> the kernel needs to pick the one which is highest in the process hierarchy;
> that guarantees that the anon_vma will not go away prematurely.
> [/quote]
>
> nowdays this happens in __page_set_anon_rmap():
>
> /*
> * If the page isn't exclusively mapped into this vma,
> * we must use the _oldest_ possible anon_vma for the
> * page mapping!
> */
> if (!exclusive)
>     anon_vma = anon_vma->root;
>
> The rest treeish of topology affects only performance.

Ah, I see what you mean.

IIRC the !exclusive bit is for pages coming back from swap, where we
don't have enough tracking info to remember where the page was first
created so we have to assume the worst case (i.e. that it was created
in the root anon_vma). My understanding was that we don't exercise
this in the non-swap case. Looking back into it, it seems that we are
now doing this with ksm and migrate as well, though.

The point remains though that moving pages higher than necessary in
the anon_vma hierarchy is OK from a correctness perspective but could
have bad implications from a performance perspective.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-20 15:27                                             ` Michel Lespinasse
  0 siblings, 0 replies; 75+ messages in thread
From: Michel Lespinasse @ 2014-11-20 15:27 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Vlastimil Babka, Andrew Morton, Rik van Riel, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

On Thu, Nov 20, 2014 at 3:42 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Thu, Nov 20, 2014 at 2:14 AM, Michel Lespinasse <walken@google.com> wrote:
>> On Wed, Nov 19, 2014 at 8:58 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>> On Wed, Nov 19, 2014 at 7:09 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>> Also from reading http://lwn.net/Articles/383162/ I understand that correctness
>>>> also depends on the hierarchy and I wonder if there's a danger of reintroducing
>>>> a bug like the one described there.
>>>
>>> If I remember right that was fixed by linking non-exclusively mapped pages to
>>> root anon_vma instead of anon_vma from vma where fault has happened.
>>> After my patch this still works. Topology hierarchy actually isn't used.
>>> Here just one selected "root' anon_vma which dies last. That's all.
>>
>> That's not how I remember it.
>
> ??? That at the end of lwn article:
>
> [quote]
> The fix is straightforward; when linking an existing page to an
> anon_vma structure,
> the kernel needs to pick the one which is highest in the process hierarchy;
> that guarantees that the anon_vma will not go away prematurely.
> [/quote]
>
> nowdays this happens in __page_set_anon_rmap():
>
> /*
> * If the page isn't exclusively mapped into this vma,
> * we must use the _oldest_ possible anon_vma for the
> * page mapping!
> */
> if (!exclusive)
>     anon_vma = anon_vma->root;
>
> The rest treeish of topology affects only performance.

Ah, I see what you mean.

IIRC the !exclusive bit is for pages coming back from swap, where we
don't have enough tracking info to remember where the page was first
created so we have to assume the worst case (i.e. that it was created
in the root anon_vma). My understanding was that we don't exercise
this in the non-swap case. Looking back into it, it seems that we are
now doing this with ksm and migrate as well, though.

The point remains though that moving pages higher than necessary in
the anon_vma hierarchy is OK from a correctness perspective but could
have bad implications from a performance perspective.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-20 15:03                                               ` Konstantin Khlebnikov
  (?)
@ 2014-11-24  7:09                                               ` Konstantin Khlebnikov
  2014-11-25 10:59                                                   ` Michal Hocko
  -1 siblings, 1 reply; 75+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-24  7:09 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Michel Lespinasse, Vlastimil Babka, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Tim Hartrick, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 2185 bytes --]

On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel <riel@redhat.com> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote:
>>
>>> I'm thinking about limitation for reusing anon_vmas which might
>>> increase performance without breaking asymptotic estimation of
>>> count anon_vma in the worst case. For example this heuristic: allow
>>> to reuse only anon_vma with single direct descendant. It seems
>>> there will be arount up to two times more anon_vmas but
>>> false-aliasing must be much lower.

Done. RFC patch in attachment.

This patch adds heuristic which decides to reuse existing anon_vma instead
of forking new one. It counts vmas and direct descendants for each anon_vma.
Anon_vma with degree lower than two will be reused at next fork.
As a result each anon_vma has either alive vma or at least two descendants,
endless chains are no longer possible and count of anon_vmas is no more than
two times more than count of vmas.


>>
>> It may even be possible to not create a child anon_vma for the
>> first child a parent forks, but only create a new anon_vma once
>> the parent clones a second child (alive at the same time as the
>> first child).
>>
>> That still takes care of things like apache or sendmail, but
>> would not create infinite anon_vmas for a task that keeps forking
>> itself to infinite depth without calling exec...
>
> But this scheme is still exploitable. Malicious software easily could create
> sequence of forks and exits which leads to infinite chain of anon_vmas.
>
>>
>> - --
>> All rights reversed
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1
>>
>> iQEcBAEBAgAGBQJUbf+hAAoJEM553pKExN6DxhQH/1QL+9GdhaSx7EQnRcbDRcHi
>> GuEfMU0g9Kv4ad+oPSQnH/L7vJMJAYeh5ZJGH+rOykWHp3sGReqDZOnzpXRAe11z
>> 1cSC1BJsndzrv9wX8niFpuKpYbF0IP+ckv3qaEzWtm5yCRyhHVZfr6b794Y4K9jF
>> z2EPPu1vAAldbkx1VlYTwofBA5lESL5UmrFvH4ouI7BeWYSEe6BgVCbvK+K5fANT
>> ketdA5R08xyUAcXDa+28qpBYkdWnxNhwqseDoXCW8SOFNwWbLDI6GRfrsCNku13i
>> Gi41h3uEuIAGDf+AU/GMjiymgwutCOGq+cfZlszELaRvHmDpNGYdPv1llghNg7Q=
>> =Vk+H
>> -----END PGP SIGNATURE-----

[-- Attachment #2: mm-prevent-endless-growth-of-anon_vma-hierarchy --]
[-- Type: application/octet-stream, Size: 5437 bytes --]

mm: prevent endless growth of anon_vma hierarchy

From: Konstantin Khlebnikov <koct9i@gmail.com>

Constantly forking task causes unlimited grow of anon_vma chain.
Each next child allocate new level of anon_vmas and links vmas to all
previous levels because it inherits pages from them. None of anon_vmas
cannot be freed because there might be pages which points to them.

This patch adds heuristic which decides to reuse existing anon_vma instead
of forking new one. It counts vmas and direct descendants for each anon_vma.
Anon_vma with degree lower than two will be reused at next fork.
As a result each anon_vma has either alive vma or at least two descendants,
endless chains are no longer possible and count of anon_vmas is no more than
two times more than count of vmas.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu
---
 include/linux/rmap.h |   16 ++++++++++++++++
 mm/rmap.c            |   30 +++++++++++++++++++++++++++++-
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c0c2bce..b1d140c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -45,6 +45,22 @@ struct anon_vma {
 	 * mm_take_all_locks() (mm_all_locks_mutex).
 	 */
 	struct rb_root rb_root;	/* Interval tree of private "related" vmas */
+
+	/*
+	 * Count of child anon_vmas and VMAs which points to this anon_vma.
+	 *
+	 * This counter is used for making decision about reusing old anon_vma
+	 * instead of forking new one. It allows to detect anon_vmas which have
+	 * just one direct descendant and no vmas. Reusing such anon_vma not
+	 * leads to significant preformance regression but prevents degradation
+	 * of anon_vma hierarchy to endless linear chain.
+	 *
+	 * Root anon_vma is never reused because it is its own parent and it has
+	 * at leat one vma or child, thus at fork it's degree is at least 2.
+	 */
+	unsigned degree;
+
+	struct anon_vma *parent;	/* Parent of this anon_vma */
 };
 
 /*
diff --git a/mm/rmap.c b/mm/rmap.c
index 19886fb..ba29e1c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -72,6 +72,8 @@ static inline struct anon_vma *anon_vma_alloc(void)
 	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
 	if (anon_vma) {
 		atomic_set(&anon_vma->refcount, 1);
+		anon_vma->degree = 1;	/* Reference for first vma */
+		anon_vma->parent = anon_vma;
 		/*
 		 * Initialise the anon_vma root to point to itself. If called
 		 * from fork, the root will be reset to the parents anon_vma.
@@ -180,6 +182,8 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 			if (unlikely(!anon_vma))
 				goto out_enomem_free_avc;
 			allocated = anon_vma;
+			/* Bump degree, root anon_vma is its own parent. */
+			anon_vma->degree++;
 		}
 
 		anon_vma_lock_write(anon_vma);
@@ -256,7 +260,17 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 		anon_vma = pavc->anon_vma;
 		root = lock_anon_vma_root(root, anon_vma);
 		anon_vma_chain_link(dst, avc, anon_vma);
+
+		/*
+		 * Reuse existing anon_vma if its degree lower than two,
+		 * that means it has no vma and just one anon_vma child.
+		 */
+		if (!dst->anon_vma && anon_vma != src->anon_vma &&
+				anon_vma->degree < 2)
+			dst->anon_vma = anon_vma;
 	}
+	if (dst->anon_vma)
+		dst->anon_vma->degree++;
 	unlock_anon_vma_root(root);
 	return 0;
 
@@ -279,6 +293,9 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	if (!pvma->anon_vma)
 		return 0;
 
+	/* Drop inherited anon_vma, we'll reuse old one or allocate new. */
+	vma->anon_vma = NULL;
+
 	/*
 	 * First, attach the new VMA to the parent VMA's anon_vmas,
 	 * so rmap can find non-COWed pages in child processes.
@@ -286,6 +303,10 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	if (anon_vma_clone(vma, pvma))
 		return -ENOMEM;
 
+	/* An old anon_vma has been reused. */
+	if (vma->anon_vma)
+		return 0;
+
 	/* Then add our own anon_vma. */
 	anon_vma = anon_vma_alloc();
 	if (!anon_vma)
@@ -299,6 +320,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	 * lock any of the anon_vmas in this anon_vma tree.
 	 */
 	anon_vma->root = pvma->anon_vma->root;
+	anon_vma->parent = pvma->anon_vma;
 	/*
 	 * With refcounts, an anon_vma can stay around longer than the
 	 * process it belongs to. The root anon_vma needs to be pinned until
@@ -309,6 +331,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	vma->anon_vma = anon_vma;
 	anon_vma_lock_write(anon_vma);
 	anon_vma_chain_link(vma, avc, anon_vma);
+	anon_vma->parent->degree++;
 	anon_vma_unlock_write(anon_vma);
 
 	return 0;
@@ -339,12 +362,16 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 		 * Leave empty anon_vmas on the list - we'll need
 		 * to free them outside the lock.
 		 */
-		if (RB_EMPTY_ROOT(&anon_vma->rb_root))
+		if (RB_EMPTY_ROOT(&anon_vma->rb_root)) {
+			anon_vma->parent->degree--;
 			continue;
+		}
 
 		list_del(&avc->same_vma);
 		anon_vma_chain_free(avc);
 	}
+	if (vma->anon_vma)
+		vma->anon_vma->degree--;
 	unlock_anon_vma_root(root);
 
 	/*
@@ -355,6 +382,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma = avc->anon_vma;
 
+		BUG_ON(anon_vma->degree);
 		put_anon_vma(anon_vma);
 
 		list_del(&avc->same_vma);

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-24  7:09                                               ` Konstantin Khlebnikov
@ 2014-11-25 10:59                                                   ` Michal Hocko
  0 siblings, 0 replies; 75+ messages in thread
From: Michal Hocko @ 2014-11-25 10:59 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Rik van Riel, Michel Lespinasse, Vlastimil Babka, Andrew Morton,
	Hugh Dickins, Andrea Arcangeli, Linux Kernel Mailing List,
	linux-mm, Tim Hartrick

On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote:
> On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel <riel@redhat.com> wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote:
> >>
> >>> I'm thinking about limitation for reusing anon_vmas which might
> >>> increase performance without breaking asymptotic estimation of
> >>> count anon_vma in the worst case. For example this heuristic: allow
> >>> to reuse only anon_vma with single direct descendant. It seems
> >>> there will be arount up to two times more anon_vmas but
> >>> false-aliasing must be much lower.
> 
> Done. RFC patch in attachment.

This is triggering BUG_ON(anon_vma->degree); in unlink_anon_vmas. I have
applied the patch on top of 3.18.0-rc6. 

[   12.380189] ------------[ cut here ]------------
[   12.380221] kernel BUG at mm/rmap.c:385!
[   12.380239] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   12.380272] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd
[   12.380518] CPU: 1 PID: 3704 Comm: kdm_greet Not tainted 3.18.0-rc6-test-00001-gf5bc00c103ff #409
[   12.380554] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011
[   12.380584] task: ffff8801272bc2c0 ti: ffff8800bcaf0000 task.ti: ffff8800bcaf0000
[   12.380614] RIP: 0010:[<ffffffff81125f09>]  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[   12.380653] RSP: 0018:ffff8800bcaf3d28  EFLAGS: 00010286
[   12.380676] RAX: ffff8800bcb3e690 RBX: ffff8800bcb35e28 RCX: ffff8801272bcb60
[   12.380706] RDX: ffff8800bcb38e70 RSI: 0000000000000001 RDI: ffff8800bcb38e70
[   12.380734] RBP: ffff8800bcaf3d78 R08: 0000000000000000 R09: 0000000000000000
[   12.380764] R10: 0000000000000000 R11: ffff8800bcb3e6a0 R12: ffff8800bcb3e680
[   12.380793] R13: ffff8800bcb3e690 R14: ffff8800bcb38e70 R15: ffff8800bcb38e70
[   12.380822] FS:  0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000
[   12.380855] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   12.380880] CR2: 00007fcd2603b0e8 CR3: 0000000001a11000 CR4: 00000000000407e0
[   12.380908] Stack:
[   12.380918]  ffff8801272e9dc0 ffff8800bcb35e38 ffff8800bcb35e38 ffff8800bcb3e680
[   12.380953]  ffff8800bcaf3d78 ffff8800bcb35dc0 ffff8800bcaf3dd8 0000000000000000
[   12.380989]  0000000000000000 ffff8800bcb35dc0 ffff8800bcaf3dc8 ffffffff81119e26
[   12.381024] Call Trace:
[   12.381038]  [<ffffffff81119e26>] free_pgtables+0x8e/0xcc
[   12.381062]  [<ffffffff81121ac1>] exit_mmap+0x84/0x123
[   12.381086]  [<ffffffff8103ff09>] mmput+0x5e/0xbb
[   12.381107]  [<ffffffff81044d8c>] do_exit+0x39c/0x97e
[   12.381131]  [<ffffffff810f49b4>] ? context_tracking_user_exit+0x79/0x116
[   12.381160]  [<ffffffff8127f43a>] ? __this_cpu_preempt_check+0x13/0x15
[   12.381188]  [<ffffffff810453f1>] do_group_exit+0x4c/0xc9
[   12.381212]  [<ffffffff81045482>] SyS_exit_group+0x14/0x14
[   12.381238]  [<ffffffff81524f52>] system_call_fastpath+0x12/0x17
[   12.381262] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 
[   12.381445] RIP  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[   12.381473]  RSP <ffff8800bcaf3d28>
[   12.386659] ---[ end trace 5761ee18fca12427 ]---
[   12.386662] Fixing recursive fault but reboot is needed!
[   13.158240] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
[   13.259294] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
[   13.259468] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready
[   16.790917] e1000e: lan0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[   16.790957] IPv6: ADDRCONF(NETDEV_CHANGE): lan0: link becomes ready
[   18.846524] iwlwifi 0000:02:00.0: L1 Enabled - LTR Disabled
[   18.846742] iwlwifi 0000:02:00.0: Radio type=0x0-0x3-0x1
[   18.941594] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   19.145595] e1000e: lan0 NIC Link is Down
[   19.287399] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
[   19.391325] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
[   19.391475] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready
[   19.573640] e1000e: lan0 NIC Link is Down
[   19.717813] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
[   19.819729] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
[   19.819883] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready
[   22.938849] e1000e: lan0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[   22.938889] IPv6: ADDRCONF(NETDEV_CHANGE): lan0: link becomes ready
[   23.404027] ------------[ cut here ]------------
[   23.404056] kernel BUG at mm/rmap.c:385!
[   23.404074] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC
[   23.404107] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd
[   23.404353] CPU: 1 PID: 4506 Comm: synaptikscfg Tainted: G      D        3.18.0-rc6-test-00001-gf5bc00c103ff #409
[   23.404395] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011
[   23.404425] task: ffff8800a337c2c0 ti: ffff88009f4ec000 task.ti: ffff88009f4ec000
[   23.404455] RIP: 0010:[<ffffffff81125f09>]  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[   23.404494] RSP: 0018:ffff88009f4efd28  EFLAGS: 00010282
[   23.405766] RAX: ffff88009f54d010 RBX: ffff88009f54c488 RCX: 0000000000000000
[   23.407062] RDX: ffff88009f5a3a50 RSI: 0000000000000001 RDI: ffff88009f5a3a50
[   23.408352] RBP: ffff88009f4efd78 R08: 0000000000000000 R09: 0000000000000000
[   23.409597] R10: 0000000000000000 R11: ffff88009f54d020 R12: ffff88009f54d000
[   23.410816] R13: ffff88009f54d010 R14: ffff88009f5a3a50 R15: ffff88009f5a3a50
[   23.411998] FS:  0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000
[   23.413167] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   23.414320] CR2: 00007f7a855608f0 CR3: 00000000a328c000 CR4: 00000000000407e0
[   23.415471] Stack:
[   23.416603]  ffff8800a3390e00 ffff88009f54c498 ffff88009f54c498 ffff88009f54d000
[   23.417747]  ffff88009f4efd78 ffff88009f54c420 ffff88009f4efdd8 0000000000000000
[   23.418892]  0000000000000000 ffff88009f54c420 ffff88009f4efdc8 ffffffff81119e26
[   23.420027] Call Trace:
[   23.421153]  [<ffffffff81119e26>] free_pgtables+0x8e/0xcc
[   23.422273]  [<ffffffff81121ac1>] exit_mmap+0x84/0x123
[   23.423411]  [<ffffffff81044d48>] ? do_exit+0x358/0x97e
[   23.424537]  [<ffffffff8103ff09>] mmput+0x5e/0xbb
[   23.425665]  [<ffffffff81044d8c>] do_exit+0x39c/0x97e
[   23.426766]  [<ffffffff810f49b4>] ? context_tracking_user_exit+0x79/0x116
[   23.427866]  [<ffffffff8127f43a>] ? __this_cpu_preempt_check+0x13/0x15
[   23.428962]  [<ffffffff810453f1>] do_group_exit+0x4c/0xc9
[   23.430064]  [<ffffffff81045482>] SyS_exit_group+0x14/0x14
[   23.431162]  [<ffffffff81524f52>] system_call_fastpath+0x12/0x17
[   23.432262] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 
[   23.434722] RIP  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[   23.435924]  RSP <ffff88009f4efd28>
[   23.441996] ---[ end trace 5761ee18fca12428 ]---
[   23.442001] Fixing recursive fault but reboot is needed!
[  838.179454] ------------[ cut here ]------------
[  838.180658] kernel BUG at mm/rmap.c:385!
[  838.181843] invalid opcode: 0000 [#3] PREEMPT SMP DEBUG_PAGEALLOC
[  838.183046] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd
[  838.186983] CPU: 1 PID: 6643 Comm: colord-sane Tainted: G      D        3.18.0-rc6-test-00001-gf5bc00c103ff #409
[  838.188240] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011
[  838.189503] task: ffff8800c4fd8000 ti: ffff880079c6c000 task.ti: ffff880079c6c000
[  838.190765] RIP: 0010:[<ffffffff81125f09>]  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[  838.192045] RSP: 0018:ffff880079c6fb68  EFLAGS: 00010286
[  838.193324] RAX: ffff8800c5a70150 RBX: ffff8800a6fd5748 RCX: 0000000000000000
[  838.194616] RDX: ffff8800a5379840 RSI: 0000000000000001 RDI: ffff8800a5379840
[  838.195879] RBP: ffff880079c6fbb8 R08: 0000000000000000 R09: 0000000000000000
[  838.197100] R10: 0000000000000000 R11: ffff8800c5a70160 R12: ffff8800c5a70140
[  838.198289] R13: ffff8800c5a70150 R14: ffff8800a5379840 R15: ffff8800a5379840
[  838.199448] FS:  0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000
[  838.200604] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  838.201753] CR2: 00007fdfd692cde8 CR3: 0000000079d0d000 CR4: 00000000000407e0
[  838.202902] Stack:
[  838.204029]  ffff88011e6fc540 ffff8800a6fd5758 ffff8800a6fd5758 ffff8800c5a70140
[  838.205180]  ffff880079c6fbb8 ffff8800a6fd56e0 ffff880079c6fc18 0000000000000000
[  838.206328]  0000000000000000 ffff8800a6fd56e0 ffff880079c6fc08 ffffffff81119e26
[  838.207477] Call Trace:
[  838.208614]  [<ffffffff81119e26>] free_pgtables+0x8e/0xcc
[  838.209762]  [<ffffffff81121ac1>] exit_mmap+0x84/0x123
[  838.210897]  [<ffffffff81044d48>] ? do_exit+0x358/0x97e
[  838.212020]  [<ffffffff8103ff09>] mmput+0x5e/0xbb
[  838.213132]  [<ffffffff81044d8c>] do_exit+0x39c/0x97e
[  838.214232]  [<ffffffff8104ea16>] ? get_signal+0xdb/0x68a
[  838.215324]  [<ffffffff8115de6d>] ? poll_select_copy_remaining+0xfe/0xfe
[  838.216420]  [<ffffffff810453f1>] do_group_exit+0x4c/0xc9
[  838.217521]  [<ffffffff8104ef82>] get_signal+0x647/0x68a
[  838.218612]  [<ffffffff810f48bd>] ? context_tracking_user_enter+0xdb/0x159
[  838.219705]  [<ffffffff8100228f>] do_signal+0x28/0x657
[  838.220796]  [<ffffffff810c1e10>] ? __acct_update_integrals+0xbf/0xd4
[  838.221894]  [<ffffffff81063e43>] ? preempt_count_sub+0xcd/0xdb
[  838.222998]  [<ffffffff8106972e>] ? vtime_account_user+0x88/0x95
[  838.224105]  [<ffffffff815243a3>] ? _raw_spin_unlock+0x32/0x47
[  838.225205]  [<ffffffff810f49b4>] ? context_tracking_user_exit+0x79/0x116
[  838.226308]  [<ffffffff810f49b4>] ? context_tracking_user_exit+0x79/0x116
[  838.227401]  [<ffffffff810028fd>] do_notify_resume+0x3f/0x94
[  838.228495]  [<ffffffff81525218>] int_signal+0x12/0x17
[  838.229581] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 
[  838.231909] RIP  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[  838.233003]  RSP <ffff880079c6fb68>
[  838.234248] ---[ end trace 5761ee18fca12429 ]---
[  838.234251] Fixing recursive fault but reboot is needed!
[ 1806.784267] ------------[ cut here ]------------
[ 1806.785322] kernel BUG at mm/rmap.c:385!
[ 1806.786361] invalid opcode: 0000 [#4] PREEMPT SMP DEBUG_PAGEALLOC
[ 1806.787397] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd
[ 1806.790682] CPU: 1 PID: 8135 Comm: DNS Resolver #7 Tainted: G      D        3.18.0-rc6-test-00001-gf5bc00c103ff #409
[ 1806.791728] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011
[ 1806.792779] task: ffff8800b3d40000 ti: ffff880079e34000 task.ti: ffff880079e34000
[ 1806.793816] RIP: 0010:[<ffffffff81125f09>]  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[ 1806.794863] RSP: 0018:ffff880079e37d38  EFLAGS: 00010282
[ 1806.795894] RAX: ffff8800b508d790 RBX: ffff8800bcaa4e28 RCX: 0000000000000000
[ 1806.796948] RDX: ffff880124ce0f20 RSI: 0000000000000001 RDI: ffff880124ce0f20
[ 1806.798011] RBP: ffff880079e37d88 R08: 0000000000000000 R09: 0000000000000000
[ 1806.799048] R10: 00007fc2827f9db0 R11: ffff8800b508d7a0 R12: ffff8800b508d780
[ 1806.800105] R13: ffff8800b508d790 R14: ffff880124ce0f20 R15: ffff880124ce0f20
[ 1806.801143] FS:  00007fc2827fa700(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000
[ 1806.802206] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1806.803244] CR2: 00007fc2c6b87000 CR3: 00000000a3063000 CR4: 00000000000407e0
[ 1806.804305] Stack:
[ 1806.805329]  00007fc280754000 ffff8800bcaa4e38 ffff8800bcaa4e38 ffff8800b508d780
[ 1806.806382]  0000000081098bfb ffff8800bcaa4dc0 ffff880079e37df8 00007fc27ff00000
[ 1806.807467]  00007fc280a00000 ffff8800bcaa4dc0 ffff880079e37dd8 ffffffff81119e26
[ 1806.808536] Call Trace:
[ 1806.809570]  [<ffffffff81119e26>] free_pgtables+0x8e/0xcc
[ 1806.810617]  [<ffffffff8111fe4c>] unmap_region+0xc8/0xec
[ 1806.811658]  [<ffffffff81270329>] ? __rb_erase_color+0x122/0x1f9
[ 1806.812724]  [<ffffffff8112192b>] do_munmap+0x275/0x2f7
[ 1806.813792]  [<ffffffff811219f5>] vm_munmap+0x48/0x61
[ 1806.814841]  [<ffffffff81121a34>] SyS_munmap+0x26/0x2f
[ 1806.815884]  [<ffffffff81524f52>] system_call_fastpath+0x12/0x17
[ 1806.816951] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 
[ 1806.819300] RIP  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[ 1806.820457]  RSP <ffff880079e37d38>
[ 1806.822068] ---[ end trace 5761ee18fca1242a ]---
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-25 10:59                                                   ` Michal Hocko
  0 siblings, 0 replies; 75+ messages in thread
From: Michal Hocko @ 2014-11-25 10:59 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Rik van Riel, Michel Lespinasse, Vlastimil Babka, Andrew Morton,
	Hugh Dickins, Andrea Arcangeli, Linux Kernel Mailing List,
	linux-mm, Tim Hartrick

On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote:
> On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel <riel@redhat.com> wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote:
> >>
> >>> I'm thinking about limitation for reusing anon_vmas which might
> >>> increase performance without breaking asymptotic estimation of
> >>> count anon_vma in the worst case. For example this heuristic: allow
> >>> to reuse only anon_vma with single direct descendant. It seems
> >>> there will be arount up to two times more anon_vmas but
> >>> false-aliasing must be much lower.
> 
> Done. RFC patch in attachment.

This is triggering BUG_ON(anon_vma->degree); in unlink_anon_vmas. I have
applied the patch on top of 3.18.0-rc6. 

[   12.380189] ------------[ cut here ]------------
[   12.380221] kernel BUG at mm/rmap.c:385!
[   12.380239] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[   12.380272] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd
[   12.380518] CPU: 1 PID: 3704 Comm: kdm_greet Not tainted 3.18.0-rc6-test-00001-gf5bc00c103ff #409
[   12.380554] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011
[   12.380584] task: ffff8801272bc2c0 ti: ffff8800bcaf0000 task.ti: ffff8800bcaf0000
[   12.380614] RIP: 0010:[<ffffffff81125f09>]  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[   12.380653] RSP: 0018:ffff8800bcaf3d28  EFLAGS: 00010286
[   12.380676] RAX: ffff8800bcb3e690 RBX: ffff8800bcb35e28 RCX: ffff8801272bcb60
[   12.380706] RDX: ffff8800bcb38e70 RSI: 0000000000000001 RDI: ffff8800bcb38e70
[   12.380734] RBP: ffff8800bcaf3d78 R08: 0000000000000000 R09: 0000000000000000
[   12.380764] R10: 0000000000000000 R11: ffff8800bcb3e6a0 R12: ffff8800bcb3e680
[   12.380793] R13: ffff8800bcb3e690 R14: ffff8800bcb38e70 R15: ffff8800bcb38e70
[   12.380822] FS:  0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000
[   12.380855] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   12.380880] CR2: 00007fcd2603b0e8 CR3: 0000000001a11000 CR4: 00000000000407e0
[   12.380908] Stack:
[   12.380918]  ffff8801272e9dc0 ffff8800bcb35e38 ffff8800bcb35e38 ffff8800bcb3e680
[   12.380953]  ffff8800bcaf3d78 ffff8800bcb35dc0 ffff8800bcaf3dd8 0000000000000000
[   12.380989]  0000000000000000 ffff8800bcb35dc0 ffff8800bcaf3dc8 ffffffff81119e26
[   12.381024] Call Trace:
[   12.381038]  [<ffffffff81119e26>] free_pgtables+0x8e/0xcc
[   12.381062]  [<ffffffff81121ac1>] exit_mmap+0x84/0x123
[   12.381086]  [<ffffffff8103ff09>] mmput+0x5e/0xbb
[   12.381107]  [<ffffffff81044d8c>] do_exit+0x39c/0x97e
[   12.381131]  [<ffffffff810f49b4>] ? context_tracking_user_exit+0x79/0x116
[   12.381160]  [<ffffffff8127f43a>] ? __this_cpu_preempt_check+0x13/0x15
[   12.381188]  [<ffffffff810453f1>] do_group_exit+0x4c/0xc9
[   12.381212]  [<ffffffff81045482>] SyS_exit_group+0x14/0x14
[   12.381238]  [<ffffffff81524f52>] system_call_fastpath+0x12/0x17
[   12.381262] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 
[   12.381445] RIP  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[   12.381473]  RSP <ffff8800bcaf3d28>
[   12.386659] ---[ end trace 5761ee18fca12427 ]---
[   12.386662] Fixing recursive fault but reboot is needed!
[   13.158240] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
[   13.259294] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
[   13.259468] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready
[   16.790917] e1000e: lan0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[   16.790957] IPv6: ADDRCONF(NETDEV_CHANGE): lan0: link becomes ready
[   18.846524] iwlwifi 0000:02:00.0: L1 Enabled - LTR Disabled
[   18.846742] iwlwifi 0000:02:00.0: Radio type=0x0-0x3-0x1
[   18.941594] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   19.145595] e1000e: lan0 NIC Link is Down
[   19.287399] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
[   19.391325] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
[   19.391475] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready
[   19.573640] e1000e: lan0 NIC Link is Down
[   19.717813] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
[   19.819729] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
[   19.819883] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready
[   22.938849] e1000e: lan0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[   22.938889] IPv6: ADDRCONF(NETDEV_CHANGE): lan0: link becomes ready
[   23.404027] ------------[ cut here ]------------
[   23.404056] kernel BUG at mm/rmap.c:385!
[   23.404074] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC
[   23.404107] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd
[   23.404353] CPU: 1 PID: 4506 Comm: synaptikscfg Tainted: G      D        3.18.0-rc6-test-00001-gf5bc00c103ff #409
[   23.404395] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011
[   23.404425] task: ffff8800a337c2c0 ti: ffff88009f4ec000 task.ti: ffff88009f4ec000
[   23.404455] RIP: 0010:[<ffffffff81125f09>]  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[   23.404494] RSP: 0018:ffff88009f4efd28  EFLAGS: 00010282
[   23.405766] RAX: ffff88009f54d010 RBX: ffff88009f54c488 RCX: 0000000000000000
[   23.407062] RDX: ffff88009f5a3a50 RSI: 0000000000000001 RDI: ffff88009f5a3a50
[   23.408352] RBP: ffff88009f4efd78 R08: 0000000000000000 R09: 0000000000000000
[   23.409597] R10: 0000000000000000 R11: ffff88009f54d020 R12: ffff88009f54d000
[   23.410816] R13: ffff88009f54d010 R14: ffff88009f5a3a50 R15: ffff88009f5a3a50
[   23.411998] FS:  0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000
[   23.413167] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   23.414320] CR2: 00007f7a855608f0 CR3: 00000000a328c000 CR4: 00000000000407e0
[   23.415471] Stack:
[   23.416603]  ffff8800a3390e00 ffff88009f54c498 ffff88009f54c498 ffff88009f54d000
[   23.417747]  ffff88009f4efd78 ffff88009f54c420 ffff88009f4efdd8 0000000000000000
[   23.418892]  0000000000000000 ffff88009f54c420 ffff88009f4efdc8 ffffffff81119e26
[   23.420027] Call Trace:
[   23.421153]  [<ffffffff81119e26>] free_pgtables+0x8e/0xcc
[   23.422273]  [<ffffffff81121ac1>] exit_mmap+0x84/0x123
[   23.423411]  [<ffffffff81044d48>] ? do_exit+0x358/0x97e
[   23.424537]  [<ffffffff8103ff09>] mmput+0x5e/0xbb
[   23.425665]  [<ffffffff81044d8c>] do_exit+0x39c/0x97e
[   23.426766]  [<ffffffff810f49b4>] ? context_tracking_user_exit+0x79/0x116
[   23.427866]  [<ffffffff8127f43a>] ? __this_cpu_preempt_check+0x13/0x15
[   23.428962]  [<ffffffff810453f1>] do_group_exit+0x4c/0xc9
[   23.430064]  [<ffffffff81045482>] SyS_exit_group+0x14/0x14
[   23.431162]  [<ffffffff81524f52>] system_call_fastpath+0x12/0x17
[   23.432262] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 
[   23.434722] RIP  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[   23.435924]  RSP <ffff88009f4efd28>
[   23.441996] ---[ end trace 5761ee18fca12428 ]---
[   23.442001] Fixing recursive fault but reboot is needed!
[  838.179454] ------------[ cut here ]------------
[  838.180658] kernel BUG at mm/rmap.c:385!
[  838.181843] invalid opcode: 0000 [#3] PREEMPT SMP DEBUG_PAGEALLOC
[  838.183046] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd
[  838.186983] CPU: 1 PID: 6643 Comm: colord-sane Tainted: G      D        3.18.0-rc6-test-00001-gf5bc00c103ff #409
[  838.188240] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011
[  838.189503] task: ffff8800c4fd8000 ti: ffff880079c6c000 task.ti: ffff880079c6c000
[  838.190765] RIP: 0010:[<ffffffff81125f09>]  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[  838.192045] RSP: 0018:ffff880079c6fb68  EFLAGS: 00010286
[  838.193324] RAX: ffff8800c5a70150 RBX: ffff8800a6fd5748 RCX: 0000000000000000
[  838.194616] RDX: ffff8800a5379840 RSI: 0000000000000001 RDI: ffff8800a5379840
[  838.195879] RBP: ffff880079c6fbb8 R08: 0000000000000000 R09: 0000000000000000
[  838.197100] R10: 0000000000000000 R11: ffff8800c5a70160 R12: ffff8800c5a70140
[  838.198289] R13: ffff8800c5a70150 R14: ffff8800a5379840 R15: ffff8800a5379840
[  838.199448] FS:  0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000
[  838.200604] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  838.201753] CR2: 00007fdfd692cde8 CR3: 0000000079d0d000 CR4: 00000000000407e0
[  838.202902] Stack:
[  838.204029]  ffff88011e6fc540 ffff8800a6fd5758 ffff8800a6fd5758 ffff8800c5a70140
[  838.205180]  ffff880079c6fbb8 ffff8800a6fd56e0 ffff880079c6fc18 0000000000000000
[  838.206328]  0000000000000000 ffff8800a6fd56e0 ffff880079c6fc08 ffffffff81119e26
[  838.207477] Call Trace:
[  838.208614]  [<ffffffff81119e26>] free_pgtables+0x8e/0xcc
[  838.209762]  [<ffffffff81121ac1>] exit_mmap+0x84/0x123
[  838.210897]  [<ffffffff81044d48>] ? do_exit+0x358/0x97e
[  838.212020]  [<ffffffff8103ff09>] mmput+0x5e/0xbb
[  838.213132]  [<ffffffff81044d8c>] do_exit+0x39c/0x97e
[  838.214232]  [<ffffffff8104ea16>] ? get_signal+0xdb/0x68a
[  838.215324]  [<ffffffff8115de6d>] ? poll_select_copy_remaining+0xfe/0xfe
[  838.216420]  [<ffffffff810453f1>] do_group_exit+0x4c/0xc9
[  838.217521]  [<ffffffff8104ef82>] get_signal+0x647/0x68a
[  838.218612]  [<ffffffff810f48bd>] ? context_tracking_user_enter+0xdb/0x159
[  838.219705]  [<ffffffff8100228f>] do_signal+0x28/0x657
[  838.220796]  [<ffffffff810c1e10>] ? __acct_update_integrals+0xbf/0xd4
[  838.221894]  [<ffffffff81063e43>] ? preempt_count_sub+0xcd/0xdb
[  838.222998]  [<ffffffff8106972e>] ? vtime_account_user+0x88/0x95
[  838.224105]  [<ffffffff815243a3>] ? _raw_spin_unlock+0x32/0x47
[  838.225205]  [<ffffffff810f49b4>] ? context_tracking_user_exit+0x79/0x116
[  838.226308]  [<ffffffff810f49b4>] ? context_tracking_user_exit+0x79/0x116
[  838.227401]  [<ffffffff810028fd>] do_notify_resume+0x3f/0x94
[  838.228495]  [<ffffffff81525218>] int_signal+0x12/0x17
[  838.229581] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 
[  838.231909] RIP  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[  838.233003]  RSP <ffff880079c6fb68>
[  838.234248] ---[ end trace 5761ee18fca12429 ]---
[  838.234251] Fixing recursive fault but reboot is needed!
[ 1806.784267] ------------[ cut here ]------------
[ 1806.785322] kernel BUG at mm/rmap.c:385!
[ 1806.786361] invalid opcode: 0000 [#4] PREEMPT SMP DEBUG_PAGEALLOC
[ 1806.787397] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd
[ 1806.790682] CPU: 1 PID: 8135 Comm: DNS Resolver #7 Tainted: G      D        3.18.0-rc6-test-00001-gf5bc00c103ff #409
[ 1806.791728] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011
[ 1806.792779] task: ffff8800b3d40000 ti: ffff880079e34000 task.ti: ffff880079e34000
[ 1806.793816] RIP: 0010:[<ffffffff81125f09>]  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[ 1806.794863] RSP: 0018:ffff880079e37d38  EFLAGS: 00010282
[ 1806.795894] RAX: ffff8800b508d790 RBX: ffff8800bcaa4e28 RCX: 0000000000000000
[ 1806.796948] RDX: ffff880124ce0f20 RSI: 0000000000000001 RDI: ffff880124ce0f20
[ 1806.798011] RBP: ffff880079e37d88 R08: 0000000000000000 R09: 0000000000000000
[ 1806.799048] R10: 00007fc2827f9db0 R11: ffff8800b508d7a0 R12: ffff8800b508d780
[ 1806.800105] R13: ffff8800b508d790 R14: ffff880124ce0f20 R15: ffff880124ce0f20
[ 1806.801143] FS:  00007fc2827fa700(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000
[ 1806.802206] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1806.803244] CR2: 00007fc2c6b87000 CR3: 00000000a3063000 CR4: 00000000000407e0
[ 1806.804305] Stack:
[ 1806.805329]  00007fc280754000 ffff8800bcaa4e38 ffff8800bcaa4e38 ffff8800b508d780
[ 1806.806382]  0000000081098bfb ffff8800bcaa4dc0 ffff880079e37df8 00007fc27ff00000
[ 1806.807467]  00007fc280a00000 ffff8800bcaa4dc0 ffff880079e37dd8 ffffffff81119e26
[ 1806.808536] Call Trace:
[ 1806.809570]  [<ffffffff81119e26>] free_pgtables+0x8e/0xcc
[ 1806.810617]  [<ffffffff8111fe4c>] unmap_region+0xc8/0xec
[ 1806.811658]  [<ffffffff81270329>] ? __rb_erase_color+0x122/0x1f9
[ 1806.812724]  [<ffffffff8112192b>] do_munmap+0x275/0x2f7
[ 1806.813792]  [<ffffffff811219f5>] vm_munmap+0x48/0x61
[ 1806.814841]  [<ffffffff81121a34>] SyS_munmap+0x26/0x2f
[ 1806.815884]  [<ffffffff81524f52>] system_call_fastpath+0x12/0x17
[ 1806.816951] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89 
[ 1806.819300] RIP  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
[ 1806.820457]  RSP <ffff880079e37d38>
[ 1806.822068] ---[ end trace 5761ee18fca1242a ]---
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-25 10:59                                                   ` Michal Hocko
  (?)
@ 2014-11-25 12:13                                                   ` Konstantin Khlebnikov
  2014-11-25 15:00                                                       ` Michal Hocko
  -1 siblings, 1 reply; 75+ messages in thread
From: Konstantin Khlebnikov @ 2014-11-25 12:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rik van Riel, Michel Lespinasse, Vlastimil Babka, Andrew Morton,
	Hugh Dickins, Andrea Arcangeli, Linux Kernel Mailing List,
	linux-mm, Tim Hartrick

[-- Attachment #1: Type: text/plain, Size: 16416 bytes --]

On Tue, Nov 25, 2014 at 1:59 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote:
>> On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel <riel@redhat.com> wrote:
>> >> -----BEGIN PGP SIGNED MESSAGE-----
>> >> Hash: SHA1
>> >>
>> >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote:
>> >>
>> >>> I'm thinking about limitation for reusing anon_vmas which might
>> >>> increase performance without breaking asymptotic estimation of
>> >>> count anon_vma in the worst case. For example this heuristic: allow
>> >>> to reuse only anon_vma with single direct descendant. It seems
>> >>> there will be arount up to two times more anon_vmas but
>> >>> false-aliasing must be much lower.
>>
>> Done. RFC patch in attachment.
>
> This is triggering BUG_ON(anon_vma->degree); in unlink_anon_vmas. I have
> applied the patch on top of 3.18.0-rc6.

It seems I've screwed up with counter if anon_vma is merged in anon_vma_prepare.
Increment must be in the next if block:

--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -182,8 +182,6 @@ int anon_vma_prepare(struct vm_area_struct *vma)
                        if (unlikely(!anon_vma))
                                goto out_enomem_free_avc;
                        allocated = anon_vma;
-                       /* Bump degree, root anon_vma is its own parent. */
-                       anon_vma->degree++;
                }

                anon_vma_lock_write(anon_vma);
@@ -192,6 +190,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
                if (likely(!vma->anon_vma)) {
                        vma->anon_vma = anon_vma;
                        anon_vma_chain_link(vma, avc, anon_vma);
+                       anon_vma->degree++;
                        allocated = NULL;
                        avc = NULL;
                }

I've tested it with trinity but probably isn't long enough.

>
> [   12.380189] ------------[ cut here ]------------
> [   12.380221] kernel BUG at mm/rmap.c:385!
> [   12.380239] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [   12.380272] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd
> [   12.380518] CPU: 1 PID: 3704 Comm: kdm_greet Not tainted 3.18.0-rc6-test-00001-gf5bc00c103ff #409
> [   12.380554] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011
> [   12.380584] task: ffff8801272bc2c0 ti: ffff8800bcaf0000 task.ti: ffff8800bcaf0000
> [   12.380614] RIP: 0010:[<ffffffff81125f09>]  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
> [   12.380653] RSP: 0018:ffff8800bcaf3d28  EFLAGS: 00010286
> [   12.380676] RAX: ffff8800bcb3e690 RBX: ffff8800bcb35e28 RCX: ffff8801272bcb60
> [   12.380706] RDX: ffff8800bcb38e70 RSI: 0000000000000001 RDI: ffff8800bcb38e70
> [   12.380734] RBP: ffff8800bcaf3d78 R08: 0000000000000000 R09: 0000000000000000
> [   12.380764] R10: 0000000000000000 R11: ffff8800bcb3e6a0 R12: ffff8800bcb3e680
> [   12.380793] R13: ffff8800bcb3e690 R14: ffff8800bcb38e70 R15: ffff8800bcb38e70
> [   12.380822] FS:  0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000
> [   12.380855] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   12.380880] CR2: 00007fcd2603b0e8 CR3: 0000000001a11000 CR4: 00000000000407e0
> [   12.380908] Stack:
> [   12.380918]  ffff8801272e9dc0 ffff8800bcb35e38 ffff8800bcb35e38 ffff8800bcb3e680
> [   12.380953]  ffff8800bcaf3d78 ffff8800bcb35dc0 ffff8800bcaf3dd8 0000000000000000
> [   12.380989]  0000000000000000 ffff8800bcb35dc0 ffff8800bcaf3dc8 ffffffff81119e26
> [   12.381024] Call Trace:
> [   12.381038]  [<ffffffff81119e26>] free_pgtables+0x8e/0xcc
> [   12.381062]  [<ffffffff81121ac1>] exit_mmap+0x84/0x123
> [   12.381086]  [<ffffffff8103ff09>] mmput+0x5e/0xbb
> [   12.381107]  [<ffffffff81044d8c>] do_exit+0x39c/0x97e
> [   12.381131]  [<ffffffff810f49b4>] ? context_tracking_user_exit+0x79/0x116
> [   12.381160]  [<ffffffff8127f43a>] ? __this_cpu_preempt_check+0x13/0x15
> [   12.381188]  [<ffffffff810453f1>] do_group_exit+0x4c/0xc9
> [   12.381212]  [<ffffffff81045482>] SyS_exit_group+0x14/0x14
> [   12.381238]  [<ffffffff81524f52>] system_call_fastpath+0x12/0x17
> [   12.381262] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89
> [   12.381445] RIP  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
> [   12.381473]  RSP <ffff8800bcaf3d28>
> [   12.386659] ---[ end trace 5761ee18fca12427 ]---
> [   12.386662] Fixing recursive fault but reboot is needed!
> [   13.158240] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
> [   13.259294] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
> [   13.259468] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready
> [   16.790917] e1000e: lan0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
> [   16.790957] IPv6: ADDRCONF(NETDEV_CHANGE): lan0: link becomes ready
> [   18.846524] iwlwifi 0000:02:00.0: L1 Enabled - LTR Disabled
> [   18.846742] iwlwifi 0000:02:00.0: Radio type=0x0-0x3-0x1
> [   18.941594] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
> [   19.145595] e1000e: lan0 NIC Link is Down
> [   19.287399] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
> [   19.391325] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
> [   19.391475] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready
> [   19.573640] e1000e: lan0 NIC Link is Down
> [   19.717813] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
> [   19.819729] e1000e 0000:00:19.0: irq 25 for MSI/MSI-X
> [   19.819883] IPv6: ADDRCONF(NETDEV_UP): lan0: link is not ready
> [   22.938849] e1000e: lan0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
> [   22.938889] IPv6: ADDRCONF(NETDEV_CHANGE): lan0: link becomes ready
> [   23.404027] ------------[ cut here ]------------
> [   23.404056] kernel BUG at mm/rmap.c:385!
> [   23.404074] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC
> [   23.404107] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd
> [   23.404353] CPU: 1 PID: 4506 Comm: synaptikscfg Tainted: G      D        3.18.0-rc6-test-00001-gf5bc00c103ff #409
> [   23.404395] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011
> [   23.404425] task: ffff8800a337c2c0 ti: ffff88009f4ec000 task.ti: ffff88009f4ec000
> [   23.404455] RIP: 0010:[<ffffffff81125f09>]  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
> [   23.404494] RSP: 0018:ffff88009f4efd28  EFLAGS: 00010282
> [   23.405766] RAX: ffff88009f54d010 RBX: ffff88009f54c488 RCX: 0000000000000000
> [   23.407062] RDX: ffff88009f5a3a50 RSI: 0000000000000001 RDI: ffff88009f5a3a50
> [   23.408352] RBP: ffff88009f4efd78 R08: 0000000000000000 R09: 0000000000000000
> [   23.409597] R10: 0000000000000000 R11: ffff88009f54d020 R12: ffff88009f54d000
> [   23.410816] R13: ffff88009f54d010 R14: ffff88009f5a3a50 R15: ffff88009f5a3a50
> [   23.411998] FS:  0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000
> [   23.413167] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   23.414320] CR2: 00007f7a855608f0 CR3: 00000000a328c000 CR4: 00000000000407e0
> [   23.415471] Stack:
> [   23.416603]  ffff8800a3390e00 ffff88009f54c498 ffff88009f54c498 ffff88009f54d000
> [   23.417747]  ffff88009f4efd78 ffff88009f54c420 ffff88009f4efdd8 0000000000000000
> [   23.418892]  0000000000000000 ffff88009f54c420 ffff88009f4efdc8 ffffffff81119e26
> [   23.420027] Call Trace:
> [   23.421153]  [<ffffffff81119e26>] free_pgtables+0x8e/0xcc
> [   23.422273]  [<ffffffff81121ac1>] exit_mmap+0x84/0x123
> [   23.423411]  [<ffffffff81044d48>] ? do_exit+0x358/0x97e
> [   23.424537]  [<ffffffff8103ff09>] mmput+0x5e/0xbb
> [   23.425665]  [<ffffffff81044d8c>] do_exit+0x39c/0x97e
> [   23.426766]  [<ffffffff810f49b4>] ? context_tracking_user_exit+0x79/0x116
> [   23.427866]  [<ffffffff8127f43a>] ? __this_cpu_preempt_check+0x13/0x15
> [   23.428962]  [<ffffffff810453f1>] do_group_exit+0x4c/0xc9
> [   23.430064]  [<ffffffff81045482>] SyS_exit_group+0x14/0x14
> [   23.431162]  [<ffffffff81524f52>] system_call_fastpath+0x12/0x17
> [   23.432262] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89
> [   23.434722] RIP  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
> [   23.435924]  RSP <ffff88009f4efd28>
> [   23.441996] ---[ end trace 5761ee18fca12428 ]---
> [   23.442001] Fixing recursive fault but reboot is needed!
> [  838.179454] ------------[ cut here ]------------
> [  838.180658] kernel BUG at mm/rmap.c:385!
> [  838.181843] invalid opcode: 0000 [#3] PREEMPT SMP DEBUG_PAGEALLOC
> [  838.183046] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd
> [  838.186983] CPU: 1 PID: 6643 Comm: colord-sane Tainted: G      D        3.18.0-rc6-test-00001-gf5bc00c103ff #409
> [  838.188240] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011
> [  838.189503] task: ffff8800c4fd8000 ti: ffff880079c6c000 task.ti: ffff880079c6c000
> [  838.190765] RIP: 0010:[<ffffffff81125f09>]  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
> [  838.192045] RSP: 0018:ffff880079c6fb68  EFLAGS: 00010286
> [  838.193324] RAX: ffff8800c5a70150 RBX: ffff8800a6fd5748 RCX: 0000000000000000
> [  838.194616] RDX: ffff8800a5379840 RSI: 0000000000000001 RDI: ffff8800a5379840
> [  838.195879] RBP: ffff880079c6fbb8 R08: 0000000000000000 R09: 0000000000000000
> [  838.197100] R10: 0000000000000000 R11: ffff8800c5a70160 R12: ffff8800c5a70140
> [  838.198289] R13: ffff8800c5a70150 R14: ffff8800a5379840 R15: ffff8800a5379840
> [  838.199448] FS:  0000000000000000(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000
> [  838.200604] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  838.201753] CR2: 00007fdfd692cde8 CR3: 0000000079d0d000 CR4: 00000000000407e0
> [  838.202902] Stack:
> [  838.204029]  ffff88011e6fc540 ffff8800a6fd5758 ffff8800a6fd5758 ffff8800c5a70140
> [  838.205180]  ffff880079c6fbb8 ffff8800a6fd56e0 ffff880079c6fc18 0000000000000000
> [  838.206328]  0000000000000000 ffff8800a6fd56e0 ffff880079c6fc08 ffffffff81119e26
> [  838.207477] Call Trace:
> [  838.208614]  [<ffffffff81119e26>] free_pgtables+0x8e/0xcc
> [  838.209762]  [<ffffffff81121ac1>] exit_mmap+0x84/0x123
> [  838.210897]  [<ffffffff81044d48>] ? do_exit+0x358/0x97e
> [  838.212020]  [<ffffffff8103ff09>] mmput+0x5e/0xbb
> [  838.213132]  [<ffffffff81044d8c>] do_exit+0x39c/0x97e
> [  838.214232]  [<ffffffff8104ea16>] ? get_signal+0xdb/0x68a
> [  838.215324]  [<ffffffff8115de6d>] ? poll_select_copy_remaining+0xfe/0xfe
> [  838.216420]  [<ffffffff810453f1>] do_group_exit+0x4c/0xc9
> [  838.217521]  [<ffffffff8104ef82>] get_signal+0x647/0x68a
> [  838.218612]  [<ffffffff810f48bd>] ? context_tracking_user_enter+0xdb/0x159
> [  838.219705]  [<ffffffff8100228f>] do_signal+0x28/0x657
> [  838.220796]  [<ffffffff810c1e10>] ? __acct_update_integrals+0xbf/0xd4
> [  838.221894]  [<ffffffff81063e43>] ? preempt_count_sub+0xcd/0xdb
> [  838.222998]  [<ffffffff8106972e>] ? vtime_account_user+0x88/0x95
> [  838.224105]  [<ffffffff815243a3>] ? _raw_spin_unlock+0x32/0x47
> [  838.225205]  [<ffffffff810f49b4>] ? context_tracking_user_exit+0x79/0x116
> [  838.226308]  [<ffffffff810f49b4>] ? context_tracking_user_exit+0x79/0x116
> [  838.227401]  [<ffffffff810028fd>] do_notify_resume+0x3f/0x94
> [  838.228495]  [<ffffffff81525218>] int_signal+0x12/0x17
> [  838.229581] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89
> [  838.231909] RIP  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
> [  838.233003]  RSP <ffff880079c6fb68>
> [  838.234248] ---[ end trace 5761ee18fca12429 ]---
> [  838.234251] Fixing recursive fault but reboot is needed!
> [ 1806.784267] ------------[ cut here ]------------
> [ 1806.785322] kernel BUG at mm/rmap.c:385!
> [ 1806.786361] invalid opcode: 0000 [#4] PREEMPT SMP DEBUG_PAGEALLOC
> [ 1806.787397] Modules linked in: i915 cfbfillrect cfbimgblt i2c_algo_bit fbcon bitblit softcursor cfbcopyarea font drm_kms_helper drm fb fbdev binfmt_misc fuse uvcvideo videobuf2_vmalloc videobuf2_memops arc4 videobuf2_core v4l2_common sdhci_pci iwldvm videodev media mac80211 i2c_i801 i2c_core sdhci mmc_core iwlwifi cfg80211 snd_hda_codec_hdmi snd_hda_codec_idt snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_pcm_oss snd_mixer_oss snd_pcm video backlight snd_timer snd
> [ 1806.790682] CPU: 1 PID: 8135 Comm: DNS Resolver #7 Tainted: G      D        3.18.0-rc6-test-00001-gf5bc00c103ff #409
> [ 1806.791728] Hardware name: Dell Inc. Latitude E6320/09PHH9, BIOS A08 10/18/2011
> [ 1806.792779] task: ffff8800b3d40000 ti: ffff880079e34000 task.ti: ffff880079e34000
> [ 1806.793816] RIP: 0010:[<ffffffff81125f09>]  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
> [ 1806.794863] RSP: 0018:ffff880079e37d38  EFLAGS: 00010282
> [ 1806.795894] RAX: ffff8800b508d790 RBX: ffff8800bcaa4e28 RCX: 0000000000000000
> [ 1806.796948] RDX: ffff880124ce0f20 RSI: 0000000000000001 RDI: ffff880124ce0f20
> [ 1806.798011] RBP: ffff880079e37d88 R08: 0000000000000000 R09: 0000000000000000
> [ 1806.799048] R10: 00007fc2827f9db0 R11: ffff8800b508d7a0 R12: ffff8800b508d780
> [ 1806.800105] R13: ffff8800b508d790 R14: ffff880124ce0f20 R15: ffff880124ce0f20
> [ 1806.801143] FS:  00007fc2827fa700(0000) GS:ffff88012d440000(0000) knlGS:0000000000000000
> [ 1806.802206] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1806.803244] CR2: 00007fc2c6b87000 CR3: 00000000a3063000 CR4: 00000000000407e0
> [ 1806.804305] Stack:
> [ 1806.805329]  00007fc280754000 ffff8800bcaa4e38 ffff8800bcaa4e38 ffff8800b508d780
> [ 1806.806382]  0000000081098bfb ffff8800bcaa4dc0 ffff880079e37df8 00007fc27ff00000
> [ 1806.807467]  00007fc280a00000 ffff8800bcaa4dc0 ffff880079e37dd8 ffffffff81119e26
> [ 1806.808536] Call Trace:
> [ 1806.809570]  [<ffffffff81119e26>] free_pgtables+0x8e/0xcc
> [ 1806.810617]  [<ffffffff8111fe4c>] unmap_region+0xc8/0xec
> [ 1806.811658]  [<ffffffff81270329>] ? __rb_erase_color+0x122/0x1f9
> [ 1806.812724]  [<ffffffff8112192b>] do_munmap+0x275/0x2f7
> [ 1806.813792]  [<ffffffff811219f5>] vm_munmap+0x48/0x61
> [ 1806.814841]  [<ffffffff81121a34>] SyS_munmap+0x26/0x2f
> [ 1806.815884]  [<ffffffff81524f52>] system_call_fastpath+0x12/0x17
> [ 1806.816951] Code: 32 f5 ff 49 8b 45 78 48 8b 18 4c 8d 60 f0 48 83 eb 10 4d 8d 6c 24 10 4c 3b 6d b8 74 3d 49 8b 7c 24 08 83 bf 98 00 00 00 00 74 02 <0f> 0b f0 ff 8f 88 00 00 00 74 1d 4c 89 ef e8 61 96 15 00 4c 89
> [ 1806.819300] RIP  [<ffffffff81125f09>] unlink_anon_vmas+0x12b/0x169
> [ 1806.820457]  RSP <ffff880079e37d38>
> [ 1806.822068] ---[ end trace 5761ee18fca1242a ]---
> --
> Michal Hocko
> SUSE Labs

[-- Attachment #2: mm-prevent-endless-growth-of-anon_vma-hierarchy-v2 --]
[-- Type: application/octet-stream, Size: 5520 bytes --]

mm: prevent endless growth of anon_vma hierarchy

From: Konstantin Khlebnikov <koct9i@gmail.com>

Constantly forking task causes unlimited grow of anon_vma chain.
Each next child allocate new level of anon_vmas and links vmas to all
previous levels because it inherits pages from them. None of anon_vmas
cannot be freed because there might be pages which points to them.

This patch adds heuristic which decides to reuse existing anon_vma instead
of forking new one. It counts vmas and direct descendants for each anon_vma.
Anon_vma with degree lower than two will be reused at next fork.
As a result each anon_vma has either alive vma or at least two descendants,
endless chains are no longer possible and count of anon_vmas is no more than
two times more than count of vmas.

v2: update degree in anon_vma_prepare for merged anon_vma

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu
---
 include/linux/rmap.h |   16 ++++++++++++++++
 mm/rmap.c            |   30 +++++++++++++++++++++++++++++-
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c0c2bce..b1d140c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -45,6 +45,22 @@ struct anon_vma {
 	 * mm_take_all_locks() (mm_all_locks_mutex).
 	 */
 	struct rb_root rb_root;	/* Interval tree of private "related" vmas */
+
+	/*
+	 * Count of child anon_vmas and VMAs which points to this anon_vma.
+	 *
+	 * This counter is used for making decision about reusing old anon_vma
+	 * instead of forking new one. It allows to detect anon_vmas which have
+	 * just one direct descendant and no vmas. Reusing such anon_vma not
+	 * leads to significant preformance regression but prevents degradation
+	 * of anon_vma hierarchy to endless linear chain.
+	 *
+	 * Root anon_vma is never reused because it is its own parent and it has
+	 * at leat one vma or child, thus at fork it's degree is at least 2.
+	 */
+	unsigned degree;
+
+	struct anon_vma *parent;	/* Parent of this anon_vma */
 };
 
 /*
diff --git a/mm/rmap.c b/mm/rmap.c
index 19886fb..df5c44e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -72,6 +72,8 @@ static inline struct anon_vma *anon_vma_alloc(void)
 	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
 	if (anon_vma) {
 		atomic_set(&anon_vma->refcount, 1);
+		anon_vma->degree = 1;	/* Reference for first vma */
+		anon_vma->parent = anon_vma;
 		/*
 		 * Initialise the anon_vma root to point to itself. If called
 		 * from fork, the root will be reset to the parents anon_vma.
@@ -188,6 +190,8 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 		if (likely(!vma->anon_vma)) {
 			vma->anon_vma = anon_vma;
 			anon_vma_chain_link(vma, avc, anon_vma);
+			/* vma link if merged or child link for new root */
+			anon_vma->degree++;
 			allocated = NULL;
 			avc = NULL;
 		}
@@ -256,7 +260,17 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 		anon_vma = pavc->anon_vma;
 		root = lock_anon_vma_root(root, anon_vma);
 		anon_vma_chain_link(dst, avc, anon_vma);
+
+		/*
+		 * Reuse existing anon_vma if its degree lower than two,
+		 * that means it has no vma and just one anon_vma child.
+		 */
+		if (!dst->anon_vma && anon_vma != src->anon_vma &&
+				anon_vma->degree < 2)
+			dst->anon_vma = anon_vma;
 	}
+	if (dst->anon_vma)
+		dst->anon_vma->degree++;
 	unlock_anon_vma_root(root);
 	return 0;
 
@@ -279,6 +293,9 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	if (!pvma->anon_vma)
 		return 0;
 
+	/* Drop inherited anon_vma, we'll reuse old one or allocate new. */
+	vma->anon_vma = NULL;
+
 	/*
 	 * First, attach the new VMA to the parent VMA's anon_vmas,
 	 * so rmap can find non-COWed pages in child processes.
@@ -286,6 +303,10 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	if (anon_vma_clone(vma, pvma))
 		return -ENOMEM;
 
+	/* An old anon_vma has been reused. */
+	if (vma->anon_vma)
+		return 0;
+
 	/* Then add our own anon_vma. */
 	anon_vma = anon_vma_alloc();
 	if (!anon_vma)
@@ -299,6 +320,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	 * lock any of the anon_vmas in this anon_vma tree.
 	 */
 	anon_vma->root = pvma->anon_vma->root;
+	anon_vma->parent = pvma->anon_vma;
 	/*
 	 * With refcounts, an anon_vma can stay around longer than the
 	 * process it belongs to. The root anon_vma needs to be pinned until
@@ -309,6 +331,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	vma->anon_vma = anon_vma;
 	anon_vma_lock_write(anon_vma);
 	anon_vma_chain_link(vma, avc, anon_vma);
+	anon_vma->parent->degree++;
 	anon_vma_unlock_write(anon_vma);
 
 	return 0;
@@ -339,12 +362,16 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 		 * Leave empty anon_vmas on the list - we'll need
 		 * to free them outside the lock.
 		 */
-		if (RB_EMPTY_ROOT(&anon_vma->rb_root))
+		if (RB_EMPTY_ROOT(&anon_vma->rb_root)) {
+			anon_vma->parent->degree--;
 			continue;
+		}
 
 		list_del(&avc->same_vma);
 		anon_vma_chain_free(avc);
 	}
+	if (vma->anon_vma)
+		vma->anon_vma->degree--;
 	unlock_anon_vma_root(root);
 
 	/*
@@ -355,6 +382,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
 	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma = avc->anon_vma;
 
+		BUG_ON(anon_vma->degree);
 		put_anon_vma(anon_vma);
 
 		list_del(&avc->same_vma);

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-25 12:13                                                   ` Konstantin Khlebnikov
@ 2014-11-25 15:00                                                       ` Michal Hocko
  0 siblings, 0 replies; 75+ messages in thread
From: Michal Hocko @ 2014-11-25 15:00 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Rik van Riel, Michel Lespinasse, Vlastimil Babka, Andrew Morton,
	Hugh Dickins, Andrea Arcangeli, Linux Kernel Mailing List,
	linux-mm, Tim Hartrick

On Tue 25-11-14 16:13:16, Konstantin Khlebnikov wrote:
> On Tue, Nov 25, 2014 at 1:59 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote:
> >> On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> >> > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel <riel@redhat.com> wrote:
> >> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> >> Hash: SHA1
> >> >>
> >> >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote:
> >> >>
> >> >>> I'm thinking about limitation for reusing anon_vmas which might
> >> >>> increase performance without breaking asymptotic estimation of
> >> >>> count anon_vma in the worst case. For example this heuristic: allow
> >> >>> to reuse only anon_vma with single direct descendant. It seems
> >> >>> there will be arount up to two times more anon_vmas but
> >> >>> false-aliasing must be much lower.
> >>
> >> Done. RFC patch in attachment.
> >
> > This is triggering BUG_ON(anon_vma->degree); in unlink_anon_vmas. I have
> > applied the patch on top of 3.18.0-rc6.
> 
> It seems I've screwed up with counter if anon_vma is merged in anon_vma_prepare.
> Increment must be in the next if block:
> 
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -182,8 +182,6 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>                         if (unlikely(!anon_vma))
>                                 goto out_enomem_free_avc;
>                         allocated = anon_vma;
> -                       /* Bump degree, root anon_vma is its own parent. */
> -                       anon_vma->degree++;
>                 }
> 
>                 anon_vma_lock_write(anon_vma);
> @@ -192,6 +190,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>                 if (likely(!vma->anon_vma)) {
>                         vma->anon_vma = anon_vma;
>                         anon_vma_chain_link(vma, avc, anon_vma);
> +                       anon_vma->degree++;
>                         allocated = NULL;
>                         avc = NULL;
>                 }
> 
> I've tested it with trinity but probably isn't long enough.

OK, this has passed few runs with the original reproducer:
$ date +%s; grep anon_vma /proc/slabinfo;
$ ./vma_chain_repro
$ sleep 1h
$ date +%s; grep anon_vma /proc/slabinfo
$ killall vma_chain_repro
$ date +%s; grep anon_vma /proc/slabinfo
1416923468
anon_vma           11523  11523    176   23    1 : tunables    0    0    0 : slabdata    501    501      0
1416927070
anon_vma           11477  11477    176   23    1 : tunables    0    0    0 : slabdata    499    499      0
1416927070
anon_vma           11127  11431    176   23    1 : tunables    0    0    0 : slabdata    497    497      0

anon_vmas do not seem to leak anymore. I have forwarded the patch to the
customer who was complaining about NSD but I guess it will take some
time to get the confirmation.

Anyway thanks a lot for your help and feel free to add
Tested-by: Michal Hocko <mhocko@suse.cz>

I have yet to look deeper into the code to give you my Reviewed-by.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-25 15:00                                                       ` Michal Hocko
  0 siblings, 0 replies; 75+ messages in thread
From: Michal Hocko @ 2014-11-25 15:00 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Rik van Riel, Michel Lespinasse, Vlastimil Babka, Andrew Morton,
	Hugh Dickins, Andrea Arcangeli, Linux Kernel Mailing List,
	linux-mm, Tim Hartrick

On Tue 25-11-14 16:13:16, Konstantin Khlebnikov wrote:
> On Tue, Nov 25, 2014 at 1:59 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote:
> >> On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> >> > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel <riel@redhat.com> wrote:
> >> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> >> Hash: SHA1
> >> >>
> >> >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote:
> >> >>
> >> >>> I'm thinking about limitation for reusing anon_vmas which might
> >> >>> increase performance without breaking asymptotic estimation of
> >> >>> count anon_vma in the worst case. For example this heuristic: allow
> >> >>> to reuse only anon_vma with single direct descendant. It seems
> >> >>> there will be arount up to two times more anon_vmas but
> >> >>> false-aliasing must be much lower.
> >>
> >> Done. RFC patch in attachment.
> >
> > This is triggering BUG_ON(anon_vma->degree); in unlink_anon_vmas. I have
> > applied the patch on top of 3.18.0-rc6.
> 
> It seems I've screwed up with counter if anon_vma is merged in anon_vma_prepare.
> Increment must be in the next if block:
> 
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -182,8 +182,6 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>                         if (unlikely(!anon_vma))
>                                 goto out_enomem_free_avc;
>                         allocated = anon_vma;
> -                       /* Bump degree, root anon_vma is its own parent. */
> -                       anon_vma->degree++;
>                 }
> 
>                 anon_vma_lock_write(anon_vma);
> @@ -192,6 +190,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>                 if (likely(!vma->anon_vma)) {
>                         vma->anon_vma = anon_vma;
>                         anon_vma_chain_link(vma, avc, anon_vma);
> +                       anon_vma->degree++;
>                         allocated = NULL;
>                         avc = NULL;
>                 }
> 
> I've tested it with trinity but probably isn't long enough.

OK, this has passed few runs with the original reproducer:
$ date +%s; grep anon_vma /proc/slabinfo;
$ ./vma_chain_repro
$ sleep 1h
$ date +%s; grep anon_vma /proc/slabinfo
$ killall vma_chain_repro
$ date +%s; grep anon_vma /proc/slabinfo
1416923468
anon_vma           11523  11523    176   23    1 : tunables    0    0    0 : slabdata    501    501      0
1416927070
anon_vma           11477  11477    176   23    1 : tunables    0    0    0 : slabdata    499    499      0
1416927070
anon_vma           11127  11431    176   23    1 : tunables    0    0    0 : slabdata    497    497      0

anon_vmas do not seem to leak anymore. I have forwarded the patch to the
customer who was complaining about NSD but I guess it will take some
time to get the confirmation.

Anyway thanks a lot for your help and feel free to add
Tested-by: Michal Hocko <mhocko@suse.cz>

I have yet to look deeper into the code to give you my Reviewed-by.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-25 15:00                                                       ` Michal Hocko
@ 2014-11-26 17:35                                                         ` Michal Hocko
  -1 siblings, 0 replies; 75+ messages in thread
From: Michal Hocko @ 2014-11-26 17:35 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Rik van Riel, Michel Lespinasse, Vlastimil Babka, Andrew Morton,
	Hugh Dickins, Andrea Arcangeli, Linux Kernel Mailing List,
	linux-mm, Tim Hartrick, Daniel Forrest

On Tue 25-11-14 16:00:06, Michal Hocko wrote:
> On Tue 25-11-14 16:13:16, Konstantin Khlebnikov wrote:
> > On Tue, Nov 25, 2014 at 1:59 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > > On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote:
> > >> On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> > >> > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel <riel@redhat.com> wrote:
> > >> >> -----BEGIN PGP SIGNED MESSAGE-----
> > >> >> Hash: SHA1
> > >> >>
> > >> >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote:
> > >> >>
> > >> >>> I'm thinking about limitation for reusing anon_vmas which might
> > >> >>> increase performance without breaking asymptotic estimation of
> > >> >>> count anon_vma in the worst case. For example this heuristic: allow
> > >> >>> to reuse only anon_vma with single direct descendant. It seems
> > >> >>> there will be arount up to two times more anon_vmas but
> > >> >>> false-aliasing must be much lower.
> > >>
> > >> Done. RFC patch in attachment.

Ok, finally managed to untagnle myself from vma chains and your patch
makes sense to me, it is quite clever actually. Here is it including the
fixup.
---
> From 1d4b0b38198c69ecfeb37670cb1dda767a802c9a Mon Sep 17 00:00:00 2001
> From: Konstantin Khlebnikov <koct9i@gmail.com>
> Date: Tue, 25 Nov 2014 10:54:44 +0100
> Subject: [PATCH] mm: prevent endless growth of anon_vma hierarchy
> 
> Constantly forking task causes unlimited grow of anon_vma chain.
> Each next child allocate new level of anon_vmas and links vmas to all
> previous levels because it inherits pages from them. None of anon_vmas
> cannot be freed because there might be pages which points to them.
> 
> This patch adds heuristic which decides to reuse existing anon_vma instead
> of forking new one. It counts vmas and direct descendants for each anon_vma.
> Anon_vma with degree lower than two will be reused at next fork.
> As a result each anon_vma has either alive vma or at least two descendants,
> endless chains are no longer possible and count of anon_vmas is no more than
> two times more than count of vmas.
> 
> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
> Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu

Tested-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Michal Hocko <mhocko@suse.cz>

and I guess
Reported-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>

who somehow vanished from CC list (added back) would be appropriate as
well.

plus

Fixes: 5beb49305251 (mm: change anon_vma linking to fix multi-process server scalability issue)
and mark it for stable

Thanks!

> ---
>  include/linux/rmap.h | 16 ++++++++++++++++
>  mm/rmap.c            | 29 ++++++++++++++++++++++++++++-
>  2 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index c0c2bce6b0b7..b1d140c20b37 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -45,6 +45,22 @@ struct anon_vma {
>  	 * mm_take_all_locks() (mm_all_locks_mutex).
>  	 */
>  	struct rb_root rb_root;	/* Interval tree of private "related" vmas */
> +
> +	/*
> +	 * Count of child anon_vmas and VMAs which points to this anon_vma.
> +	 *
> +	 * This counter is used for making decision about reusing old anon_vma
> +	 * instead of forking new one. It allows to detect anon_vmas which have
> +	 * just one direct descendant and no vmas. Reusing such anon_vma not
> +	 * leads to significant preformance regression but prevents degradation
> +	 * of anon_vma hierarchy to endless linear chain.
> +	 *
> +	 * Root anon_vma is never reused because it is its own parent and it has
> +	 * at leat one vma or child, thus at fork it's degree is at least 2.
> +	 */
> +	unsigned degree;
> +
> +	struct anon_vma *parent;	/* Parent of this anon_vma */
>  };
>  
>  /*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 19886fb2f13a..40ae8184a1e1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -72,6 +72,8 @@ static inline struct anon_vma *anon_vma_alloc(void)
>  	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
>  	if (anon_vma) {
>  		atomic_set(&anon_vma->refcount, 1);
> +		anon_vma->degree = 1;	/* Reference for first vma */
> +		anon_vma->parent = anon_vma;
>  		/*
>  		 * Initialise the anon_vma root to point to itself. If called
>  		 * from fork, the root will be reset to the parents anon_vma.
> @@ -188,6 +190,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  		if (likely(!vma->anon_vma)) {
>  			vma->anon_vma = anon_vma;
>  			anon_vma_chain_link(vma, avc, anon_vma);
> +			anon_vma->degree++;
>  			allocated = NULL;
>  			avc = NULL;
>  		}
> @@ -256,7 +259,17 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
>  		anon_vma = pavc->anon_vma;
>  		root = lock_anon_vma_root(root, anon_vma);
>  		anon_vma_chain_link(dst, avc, anon_vma);
> +
> +		/*
> +		 * Reuse existing anon_vma if its degree lower than two,
> +		 * that means it has no vma and just one anon_vma child.
> +		 */
> +		if (!dst->anon_vma && anon_vma != src->anon_vma &&
> +				anon_vma->degree < 2)
> +			dst->anon_vma = anon_vma;
>  	}
> +	if (dst->anon_vma)
> +		dst->anon_vma->degree++;
>  	unlock_anon_vma_root(root);
>  	return 0;
>  
> @@ -279,6 +292,9 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	if (!pvma->anon_vma)
>  		return 0;
>  
> +	/* Drop inherited anon_vma, we'll reuse old one or allocate new. */
> +	vma->anon_vma = NULL;
> +
>  	/*
>  	 * First, attach the new VMA to the parent VMA's anon_vmas,
>  	 * so rmap can find non-COWed pages in child processes.
> @@ -286,6 +302,10 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	if (anon_vma_clone(vma, pvma))
>  		return -ENOMEM;
>  
> +	/* An old anon_vma has been reused. */
> +	if (vma->anon_vma)
> +		return 0;
> +
>  	/* Then add our own anon_vma. */
>  	anon_vma = anon_vma_alloc();
>  	if (!anon_vma)
> @@ -299,6 +319,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	 * lock any of the anon_vmas in this anon_vma tree.
>  	 */
>  	anon_vma->root = pvma->anon_vma->root;
> +	anon_vma->parent = pvma->anon_vma;
>  	/*
>  	 * With refcounts, an anon_vma can stay around longer than the
>  	 * process it belongs to. The root anon_vma needs to be pinned until
> @@ -309,6 +330,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	vma->anon_vma = anon_vma;
>  	anon_vma_lock_write(anon_vma);
>  	anon_vma_chain_link(vma, avc, anon_vma);
> +	anon_vma->parent->degree++;
>  	anon_vma_unlock_write(anon_vma);
>  
>  	return 0;
> @@ -339,12 +361,16 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
>  		 * Leave empty anon_vmas on the list - we'll need
>  		 * to free them outside the lock.
>  		 */
> -		if (RB_EMPTY_ROOT(&anon_vma->rb_root))
> +		if (RB_EMPTY_ROOT(&anon_vma->rb_root)) {
> +			anon_vma->parent->degree--;
>  			continue;
> +		}
>  
>  		list_del(&avc->same_vma);
>  		anon_vma_chain_free(avc);
>  	}
> +	if (vma->anon_vma)
> +		vma->anon_vma->degree--;
>  	unlock_anon_vma_root(root);
>  
>  	/*
> @@ -355,6 +381,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
>  	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
>  		struct anon_vma *anon_vma = avc->anon_vma;
>  
> +		BUG_ON(anon_vma->degree);
>  		put_anon_vma(anon_vma);
>  
>  		list_del(&avc->same_vma);
> -- 
> 2.1.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
@ 2014-11-26 17:35                                                         ` Michal Hocko
  0 siblings, 0 replies; 75+ messages in thread
From: Michal Hocko @ 2014-11-26 17:35 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Rik van Riel, Michel Lespinasse, Vlastimil Babka, Andrew Morton,
	Hugh Dickins, Andrea Arcangeli, Linux Kernel Mailing List,
	linux-mm, Tim Hartrick, Daniel Forrest

On Tue 25-11-14 16:00:06, Michal Hocko wrote:
> On Tue 25-11-14 16:13:16, Konstantin Khlebnikov wrote:
> > On Tue, Nov 25, 2014 at 1:59 PM, Michal Hocko <mhocko@suse.cz> wrote:
> > > On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote:
> > >> On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> > >> > On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel <riel@redhat.com> wrote:
> > >> >> -----BEGIN PGP SIGNED MESSAGE-----
> > >> >> Hash: SHA1
> > >> >>
> > >> >> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote:
> > >> >>
> > >> >>> I'm thinking about limitation for reusing anon_vmas which might
> > >> >>> increase performance without breaking asymptotic estimation of
> > >> >>> count anon_vma in the worst case. For example this heuristic: allow
> > >> >>> to reuse only anon_vma with single direct descendant. It seems
> > >> >>> there will be arount up to two times more anon_vmas but
> > >> >>> false-aliasing must be much lower.
> > >>
> > >> Done. RFC patch in attachment.

Ok, finally managed to untagnle myself from vma chains and your patch
makes sense to me, it is quite clever actually. Here is it including the
fixup.
---
> From 1d4b0b38198c69ecfeb37670cb1dda767a802c9a Mon Sep 17 00:00:00 2001
> From: Konstantin Khlebnikov <koct9i@gmail.com>
> Date: Tue, 25 Nov 2014 10:54:44 +0100
> Subject: [PATCH] mm: prevent endless growth of anon_vma hierarchy
> 
> Constantly forking task causes unlimited grow of anon_vma chain.
> Each next child allocate new level of anon_vmas and links vmas to all
> previous levels because it inherits pages from them. None of anon_vmas
> cannot be freed because there might be pages which points to them.
> 
> This patch adds heuristic which decides to reuse existing anon_vma instead
> of forking new one. It counts vmas and direct descendants for each anon_vma.
> Anon_vma with degree lower than two will be reused at next fork.
> As a result each anon_vma has either alive vma or at least two descendants,
> endless chains are no longer possible and count of anon_vmas is no more than
> two times more than count of vmas.
> 
> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
> Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu

Tested-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Michal Hocko <mhocko@suse.cz>

and I guess
Reported-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>

who somehow vanished from CC list (added back) would be appropriate as
well.

plus

Fixes: 5beb49305251 (mm: change anon_vma linking to fix multi-process server scalability issue)
and mark it for stable

Thanks!

> ---
>  include/linux/rmap.h | 16 ++++++++++++++++
>  mm/rmap.c            | 29 ++++++++++++++++++++++++++++-
>  2 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index c0c2bce6b0b7..b1d140c20b37 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -45,6 +45,22 @@ struct anon_vma {
>  	 * mm_take_all_locks() (mm_all_locks_mutex).
>  	 */
>  	struct rb_root rb_root;	/* Interval tree of private "related" vmas */
> +
> +	/*
> +	 * Count of child anon_vmas and VMAs which points to this anon_vma.
> +	 *
> +	 * This counter is used for making decision about reusing old anon_vma
> +	 * instead of forking new one. It allows to detect anon_vmas which have
> +	 * just one direct descendant and no vmas. Reusing such anon_vma not
> +	 * leads to significant preformance regression but prevents degradation
> +	 * of anon_vma hierarchy to endless linear chain.
> +	 *
> +	 * Root anon_vma is never reused because it is its own parent and it has
> +	 * at leat one vma or child, thus at fork it's degree is at least 2.
> +	 */
> +	unsigned degree;
> +
> +	struct anon_vma *parent;	/* Parent of this anon_vma */
>  };
>  
>  /*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 19886fb2f13a..40ae8184a1e1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -72,6 +72,8 @@ static inline struct anon_vma *anon_vma_alloc(void)
>  	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
>  	if (anon_vma) {
>  		atomic_set(&anon_vma->refcount, 1);
> +		anon_vma->degree = 1;	/* Reference for first vma */
> +		anon_vma->parent = anon_vma;
>  		/*
>  		 * Initialise the anon_vma root to point to itself. If called
>  		 * from fork, the root will be reset to the parents anon_vma.
> @@ -188,6 +190,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>  		if (likely(!vma->anon_vma)) {
>  			vma->anon_vma = anon_vma;
>  			anon_vma_chain_link(vma, avc, anon_vma);
> +			anon_vma->degree++;
>  			allocated = NULL;
>  			avc = NULL;
>  		}
> @@ -256,7 +259,17 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
>  		anon_vma = pavc->anon_vma;
>  		root = lock_anon_vma_root(root, anon_vma);
>  		anon_vma_chain_link(dst, avc, anon_vma);
> +
> +		/*
> +		 * Reuse existing anon_vma if its degree lower than two,
> +		 * that means it has no vma and just one anon_vma child.
> +		 */
> +		if (!dst->anon_vma && anon_vma != src->anon_vma &&
> +				anon_vma->degree < 2)
> +			dst->anon_vma = anon_vma;
>  	}
> +	if (dst->anon_vma)
> +		dst->anon_vma->degree++;
>  	unlock_anon_vma_root(root);
>  	return 0;
>  
> @@ -279,6 +292,9 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	if (!pvma->anon_vma)
>  		return 0;
>  
> +	/* Drop inherited anon_vma, we'll reuse old one or allocate new. */
> +	vma->anon_vma = NULL;
> +
>  	/*
>  	 * First, attach the new VMA to the parent VMA's anon_vmas,
>  	 * so rmap can find non-COWed pages in child processes.
> @@ -286,6 +302,10 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	if (anon_vma_clone(vma, pvma))
>  		return -ENOMEM;
>  
> +	/* An old anon_vma has been reused. */
> +	if (vma->anon_vma)
> +		return 0;
> +
>  	/* Then add our own anon_vma. */
>  	anon_vma = anon_vma_alloc();
>  	if (!anon_vma)
> @@ -299,6 +319,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	 * lock any of the anon_vmas in this anon_vma tree.
>  	 */
>  	anon_vma->root = pvma->anon_vma->root;
> +	anon_vma->parent = pvma->anon_vma;
>  	/*
>  	 * With refcounts, an anon_vma can stay around longer than the
>  	 * process it belongs to. The root anon_vma needs to be pinned until
> @@ -309,6 +330,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>  	vma->anon_vma = anon_vma;
>  	anon_vma_lock_write(anon_vma);
>  	anon_vma_chain_link(vma, avc, anon_vma);
> +	anon_vma->parent->degree++;
>  	anon_vma_unlock_write(anon_vma);
>  
>  	return 0;
> @@ -339,12 +361,16 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
>  		 * Leave empty anon_vmas on the list - we'll need
>  		 * to free them outside the lock.
>  		 */
> -		if (RB_EMPTY_ROOT(&anon_vma->rb_root))
> +		if (RB_EMPTY_ROOT(&anon_vma->rb_root)) {
> +			anon_vma->parent->degree--;
>  			continue;
> +		}
>  
>  		list_del(&avc->same_vma);
>  		anon_vma_chain_free(avc);
>  	}
> +	if (vma->anon_vma)
> +		vma->anon_vma->degree--;
>  	unlock_anon_vma_root(root);
>  
>  	/*
> @@ -355,6 +381,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
>  	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
>  		struct anon_vma *anon_vma = avc->anon_vma;
>  
> +		BUG_ON(anon_vma->degree);
>  		put_anon_vma(anon_vma);
>  
>  		list_del(&avc->same_vma);
> -- 
> 2.1.3

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH] Repeated fork() causes SLAB to grow without bound
  2014-11-26 17:35                                                         ` Michal Hocko
  (?)
@ 2014-12-05 15:44                                                         ` Jerome Marchand
  -1 siblings, 0 replies; 75+ messages in thread
From: Jerome Marchand @ 2014-12-05 15:44 UTC (permalink / raw)
  To: Michal Hocko, Konstantin Khlebnikov
  Cc: Rik van Riel, Michel Lespinasse, Vlastimil Babka, Andrew Morton,
	Hugh Dickins, Andrea Arcangeli, Linux Kernel Mailing List,
	linux-mm, Tim Hartrick, Daniel Forrest

[-- Attachment #1: Type: text/plain, Size: 8020 bytes --]

On 11/26/2014 06:35 PM, Michal Hocko wrote:
> On Tue 25-11-14 16:00:06, Michal Hocko wrote:
>> On Tue 25-11-14 16:13:16, Konstantin Khlebnikov wrote:
>>> On Tue, Nov 25, 2014 at 1:59 PM, Michal Hocko <mhocko@suse.cz> wrote:
>>>> On Mon 24-11-14 11:09:40, Konstantin Khlebnikov wrote:
>>>>> On Thu, Nov 20, 2014 at 6:03 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>>>> On Thu, Nov 20, 2014 at 5:50 PM, Rik van Riel <riel@redhat.com> wrote:
>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>>> Hash: SHA1
>>>>>>>
>>>>>>> On 11/20/2014 09:42 AM, Konstantin Khlebnikov wrote:
>>>>>>>
>>>>>>>> I'm thinking about limitation for reusing anon_vmas which might
>>>>>>>> increase performance without breaking asymptotic estimation of
>>>>>>>> count anon_vma in the worst case. For example this heuristic: allow
>>>>>>>> to reuse only anon_vma with single direct descendant. It seems
>>>>>>>> there will be arount up to two times more anon_vmas but
>>>>>>>> false-aliasing must be much lower.
>>>>>
>>>>> Done. RFC patch in attachment.
> 
> Ok, finally managed to untagnle myself from vma chains and your patch
> makes sense to me, it is quite clever actually. Here is it including the
> fixup.
> ---
>> From 1d4b0b38198c69ecfeb37670cb1dda767a802c9a Mon Sep 17 00:00:00 2001
>> From: Konstantin Khlebnikov <koct9i@gmail.com>
>> Date: Tue, 25 Nov 2014 10:54:44 +0100
>> Subject: [PATCH] mm: prevent endless growth of anon_vma hierarchy
>>
>> Constantly forking task causes unlimited grow of anon_vma chain.
>> Each next child allocate new level of anon_vmas and links vmas to all
>> previous levels because it inherits pages from them. None of anon_vmas
>> cannot be freed because there might be pages which points to them.
>>
>> This patch adds heuristic which decides to reuse existing anon_vma instead
>> of forking new one. It counts vmas and direct descendants for each anon_vma.
>> Anon_vma with degree lower than two will be reused at next fork.
>> As a result each anon_vma has either alive vma or at least two descendants,
>> endless chains are no longer possible and count of anon_vmas is no more than
>> two times more than count of vmas.
>>
>> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
>> Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu
> 
> Tested-by: Michal Hocko <mhocko@suse.cz>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> 
> and I guess
> Reported-by: Daniel Forrest <dan.forrest@ssec.wisc.edu>

Tested-by: Jerome Marchand <jmarchan@redhat.com>

Minor nitpicks below.

> 
> who somehow vanished from CC list (added back) would be appropriate as
> well.
> 
> plus
> 
> Fixes: 5beb49305251 (mm: change anon_vma linking to fix multi-process server scalability issue)
> and mark it for stable
> 
> Thanks!
> 
>> ---
>>  include/linux/rmap.h | 16 ++++++++++++++++
>>  mm/rmap.c            | 29 ++++++++++++++++++++++++++++-
>>  2 files changed, 44 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index c0c2bce6b0b7..b1d140c20b37 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -45,6 +45,22 @@ struct anon_vma {
>>  	 * mm_take_all_locks() (mm_all_locks_mutex).
>>  	 */
>>  	struct rb_root rb_root;	/* Interval tree of private "related" vmas */
>> +
>> +	/*
>> +	 * Count of child anon_vmas and VMAs which points to this anon_vma.
>> +	 *
>> +	 * This counter is used for making decision about reusing old anon_vma
>> +	 * instead of forking new one. It allows to detect anon_vmas which have
>> +	 * just one direct descendant and no vmas. Reusing such anon_vma not
>> +	 * leads to significant preformance regression but prevents degradation

Does it or does it not lead to significant performance issue? I can't tell.

>> +	 * of anon_vma hierarchy to endless linear chain.
>> +	 *
>> +	 * Root anon_vma is never reused because it is its own parent and it has
>> +	 * at leat one vma or child, thus at fork it's degree is at least 2.

s/leat/least/

Thanks,
Jerome

>> +	 */
>> +	unsigned degree;
>> +
>> +	struct anon_vma *parent;	/* Parent of this anon_vma */
>>  };
>>  
>>  /*
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 19886fb2f13a..40ae8184a1e1 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -72,6 +72,8 @@ static inline struct anon_vma *anon_vma_alloc(void)
>>  	anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
>>  	if (anon_vma) {
>>  		atomic_set(&anon_vma->refcount, 1);
>> +		anon_vma->degree = 1;	/* Reference for first vma */
>> +		anon_vma->parent = anon_vma;
>>  		/*
>>  		 * Initialise the anon_vma root to point to itself. If called
>>  		 * from fork, the root will be reset to the parents anon_vma.
>> @@ -188,6 +190,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
>>  		if (likely(!vma->anon_vma)) {
>>  			vma->anon_vma = anon_vma;
>>  			anon_vma_chain_link(vma, avc, anon_vma);
>> +			anon_vma->degree++;
>>  			allocated = NULL;
>>  			avc = NULL;
>>  		}
>> @@ -256,7 +259,17 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
>>  		anon_vma = pavc->anon_vma;
>>  		root = lock_anon_vma_root(root, anon_vma);
>>  		anon_vma_chain_link(dst, avc, anon_vma);
>> +
>> +		/*
>> +		 * Reuse existing anon_vma if its degree lower than two,
>> +		 * that means it has no vma and just one anon_vma child.
>> +		 */
>> +		if (!dst->anon_vma && anon_vma != src->anon_vma &&
>> +				anon_vma->degree < 2)
>> +			dst->anon_vma = anon_vma;
>>  	}
>> +	if (dst->anon_vma)
>> +		dst->anon_vma->degree++;
>>  	unlock_anon_vma_root(root);
>>  	return 0;
>>  
>> @@ -279,6 +292,9 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>>  	if (!pvma->anon_vma)
>>  		return 0;
>>  
>> +	/* Drop inherited anon_vma, we'll reuse old one or allocate new. */
>> +	vma->anon_vma = NULL;
>> +
>>  	/*
>>  	 * First, attach the new VMA to the parent VMA's anon_vmas,
>>  	 * so rmap can find non-COWed pages in child processes.
>> @@ -286,6 +302,10 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>>  	if (anon_vma_clone(vma, pvma))
>>  		return -ENOMEM;
>>  
>> +	/* An old anon_vma has been reused. */
>> +	if (vma->anon_vma)
>> +		return 0;
>> +
>>  	/* Then add our own anon_vma. */
>>  	anon_vma = anon_vma_alloc();
>>  	if (!anon_vma)
>> @@ -299,6 +319,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>>  	 * lock any of the anon_vmas in this anon_vma tree.
>>  	 */
>>  	anon_vma->root = pvma->anon_vma->root;
>> +	anon_vma->parent = pvma->anon_vma;
>>  	/*
>>  	 * With refcounts, an anon_vma can stay around longer than the
>>  	 * process it belongs to. The root anon_vma needs to be pinned until
>> @@ -309,6 +330,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
>>  	vma->anon_vma = anon_vma;
>>  	anon_vma_lock_write(anon_vma);
>>  	anon_vma_chain_link(vma, avc, anon_vma);
>> +	anon_vma->parent->degree++;
>>  	anon_vma_unlock_write(anon_vma);
>>  
>>  	return 0;
>> @@ -339,12 +361,16 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
>>  		 * Leave empty anon_vmas on the list - we'll need
>>  		 * to free them outside the lock.
>>  		 */
>> -		if (RB_EMPTY_ROOT(&anon_vma->rb_root))
>> +		if (RB_EMPTY_ROOT(&anon_vma->rb_root)) {
>> +			anon_vma->parent->degree--;
>>  			continue;
>> +		}
>>  
>>  		list_del(&avc->same_vma);
>>  		anon_vma_chain_free(avc);
>>  	}
>> +	if (vma->anon_vma)
>> +		vma->anon_vma->degree--;
>>  	unlock_anon_vma_root(root);
>>  
>>  	/*
>> @@ -355,6 +381,7 @@ void unlink_anon_vmas(struct vm_area_struct *vma)
>>  	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
>>  		struct anon_vma *anon_vma = avc->anon_vma;
>>  
>> +		BUG_ON(anon_vma->degree);
>>  		put_anon_vma(anon_vma);
>>  
>>  		list_del(&avc->same_vma);
>> -- 
>> 2.1.3
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2014-12-05 15:45 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-16  2:46 Repeated fork() causes SLAB to grow without bound Daniel Forrest
2012-08-16 18:58 ` Rik van Riel
2012-08-16 18:58   ` Rik van Riel
2012-08-18  0:03   ` Daniel Forrest
2012-08-18  0:03     ` Daniel Forrest
2012-08-18  3:46     ` Rik van Riel
2012-08-18  3:46       ` Rik van Riel
2012-08-18  4:07       ` Daniel Forrest
2012-08-18  4:07         ` Daniel Forrest
2012-08-18  4:10         ` Rik van Riel
2012-08-18  4:10           ` Rik van Riel
2012-08-20  8:00       ` Hugh Dickins
2012-08-20  8:00         ` Hugh Dickins
2012-08-20  9:39         ` Michel Lespinasse
2012-08-20  9:39           ` Michel Lespinasse
2012-08-20 11:11           ` Andi Kleen
2012-08-20 11:11             ` Andi Kleen
2012-08-20 11:17           ` Rik van Riel
2012-08-20 11:17             ` Rik van Riel
2012-08-20 11:53             ` Michel Lespinasse
2012-08-20 11:53               ` Michel Lespinasse
2012-08-20 19:11               ` Michel Lespinasse
2012-08-20 19:11                 ` Michel Lespinasse
2012-08-22  3:20           ` [RFC PATCH] " Michel Lespinasse
2012-08-22  3:20             ` Michel Lespinasse
2012-08-22  3:29             ` Rik van Riel
2012-08-22  3:29               ` Rik van Riel
2013-06-03 19:50               ` Daniel Forrest
2013-06-03 19:50                 ` Daniel Forrest
2013-06-04 10:37                 ` Rik van Riel
2013-06-04 10:37                   ` Rik van Riel
2013-06-05 14:02                   ` Andrea Arcangeli
2013-06-05 14:02                     ` Andrea Arcangeli
2014-11-14 16:30                 ` [PATCH] " Daniel Forrest
2014-11-14 16:30                   ` Daniel Forrest
2014-11-18  0:02                   ` Andrew Morton
2014-11-18  0:02                     ` Andrew Morton
2014-11-18  1:41                     ` Daniel Forrest
2014-11-18  1:41                       ` Daniel Forrest
2014-11-18  2:41                       ` Rik van Riel
2014-11-18  2:41                         ` Rik van Riel
2014-11-18 20:19                         ` Andrew Morton
2014-11-18 20:19                           ` Andrew Morton
2014-11-18 22:15                           ` Konstantin Khlebnikov
2014-11-18 22:15                             ` Konstantin Khlebnikov
2014-11-18 23:02                             ` Konstantin Khlebnikov
2014-11-18 23:50                               ` Vlastimil Babka
2014-11-18 23:50                                 ` Vlastimil Babka
2014-11-19 14:36                                 ` Konstantin Khlebnikov
2014-11-19 14:36                                   ` Konstantin Khlebnikov
2014-11-19 16:09                                   ` Vlastimil Babka
2014-11-19 16:09                                     ` Vlastimil Babka
2014-11-19 16:58                                     ` Konstantin Khlebnikov
2014-11-19 16:58                                       ` Konstantin Khlebnikov
2014-11-19 23:14                                       ` Michel Lespinasse
2014-11-19 23:14                                         ` Michel Lespinasse
2014-11-20 14:42                                         ` Konstantin Khlebnikov
2014-11-20 14:42                                           ` Konstantin Khlebnikov
2014-11-20 14:50                                           ` Rik van Riel
2014-11-20 14:50                                             ` Rik van Riel
2014-11-20 15:03                                             ` Konstantin Khlebnikov
2014-11-20 15:03                                               ` Konstantin Khlebnikov
2014-11-24  7:09                                               ` Konstantin Khlebnikov
2014-11-25 10:59                                                 ` Michal Hocko
2014-11-25 10:59                                                   ` Michal Hocko
2014-11-25 12:13                                                   ` Konstantin Khlebnikov
2014-11-25 15:00                                                     ` Michal Hocko
2014-11-25 15:00                                                       ` Michal Hocko
2014-11-26 17:35                                                       ` Michal Hocko
2014-11-26 17:35                                                         ` Michal Hocko
2014-12-05 15:44                                                         ` Jerome Marchand
2014-11-20 15:27                                           ` Michel Lespinasse
2014-11-20 15:27                                             ` Michel Lespinasse
2014-11-19  2:48                           ` Rik van Riel
2014-11-19  2:48                             ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.