All of lore.kernel.org
 help / color / mirror / Atom feed
* [resend][PATCH v2] mlock() doesn't wait to finish lru_add_drain_all()
@ 2009-10-09  2:21 ` KOSAKI Motohiro
  0 siblings, 0 replies; 13+ messages in thread
From: KOSAKI Motohiro @ 2009-10-09  2:21 UTC (permalink / raw)
  To: Peter Zijlstra, Mike Galbraith, Oleg Nesterov, LKML, linux-mm,
	Andrew Morton
  Cc: kosaki.motohiro


Changelog
  since v1
    - rewrote description
    - fold patch 1/2 and 2/2

==========================================================
Recently, Mike Galbraith reported mlock() makes hang-up very long time in
his system. Peter Zijlstra explainted the reason.

  Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
  cpu0 does mlock()->lru_add_drain_all(), which does
  schedule_on_each_cpu(), which then waits for all cpus to complete the
  work. Except that cpu1, which is busy with the RT task, will never run
  keventd until the RT load goes away.

  This is not so much an actual deadlock as a serious starvation case.

His system has two partions using cpusets and RT-task partion cpu doesn't
have any PCP cache. thus, this result was pretty unexpected.

The fact is, mlock() doesn't need to wait to finish lru_add_drain_all().
if mlock() can't turn on PG_mlock, vmscan turn it on later.

Thus, this patch replace it with lru_add_drain_all_async().


Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Reported-by: Mike Galbraith <efault@gmx.de> 
Cc: Oleg Nesterov <onestero@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/swap.h |    1 +
 mm/mlock.c           |    4 ++--
 mm/swap.c            |   24 ++++++++++++++++++++++++
 3 files changed, 27 insertions(+), 2 deletions(-)

Index: b/include/linux/swap.h
===================================================================
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -204,6 +204,7 @@ extern void activate_page(struct page *)
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern int lru_add_drain_all(void);
+extern int lru_add_drain_all_async(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void swap_setup(void);
 
Index: b/mm/swap.c
===================================================================
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -38,6 +38,7 @@ int page_cluster;
 
 static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct work_struct, lru_drain_work);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -312,6 +313,24 @@ int lru_add_drain_all(void)
 }
 
 /*
+ * Returns 0 for success
+ */
+int lru_add_drain_all_async(void)
+{
+	int cpu;
+
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = &per_cpu(lru_drain_work, cpu);
+		schedule_work_on(cpu, work);
+	}
+	put_online_cpus();
+
+	return 0;
+}
+
+
+/*
  * Batched page_cache_release().  Decrement the reference count on all the
  * passed pages.  If it fell to zero then remove the page from the LRU and
  * free it.
@@ -497,6 +516,7 @@ EXPORT_SYMBOL(pagevec_lookup_tag);
 void __init swap_setup(void)
 {
 	unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT);
+	int cpu;
 
 #ifdef CONFIG_SWAP
 	bdi_init(swapper_space.backing_dev_info);
@@ -511,4 +531,8 @@ void __init swap_setup(void)
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
 	 */
+
+	for_each_possible_cpu(cpu) {
+		INIT_WORK(&per_cpu(lru_drain_work, cpu), lru_add_drain_per_cpu);
+	}
 }
Index: b/mm/mlock.c
===================================================================
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -481,7 +481,7 @@ SYSCALL_DEFINE2(mlock, unsigned long, st
 	if (!can_do_mlock())
 		return -EPERM;
 
-	lru_add_drain_all();	/* flush pagevec */
+	lru_add_drain_all_async();	/* flush pagevec */
 
 	down_write(&current->mm->mmap_sem);
 	len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
@@ -549,7 +549,7 @@ SYSCALL_DEFINE1(mlockall, int, flags)
 	if (!can_do_mlock())
 		goto out;
 
-	lru_add_drain_all();	/* flush pagevec */
+	lru_add_drain_all_async();	/* flush pagevec */
 
 	down_write(&current->mm->mmap_sem);
 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [resend][PATCH v2] mlock() doesn't wait to finish lru_add_drain_all()
@ 2009-10-09  2:21 ` KOSAKI Motohiro
  0 siblings, 0 replies; 13+ messages in thread
From: KOSAKI Motohiro @ 2009-10-09  2:21 UTC (permalink / raw)
  To: Peter Zijlstra, Mike Galbraith, Oleg Nesterov, LKML, linux-mm,
	Andrew Morton
  Cc: kosaki.motohiro


Changelog
  since v1
    - rewrote description
    - fold patch 1/2 and 2/2

==========================================================
Recently, Mike Galbraith reported mlock() makes hang-up very long time in
his system. Peter Zijlstra explainted the reason.

  Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
  cpu0 does mlock()->lru_add_drain_all(), which does
  schedule_on_each_cpu(), which then waits for all cpus to complete the
  work. Except that cpu1, which is busy with the RT task, will never run
  keventd until the RT load goes away.

  This is not so much an actual deadlock as a serious starvation case.

His system has two partions using cpusets and RT-task partion cpu doesn't
have any PCP cache. thus, this result was pretty unexpected.

The fact is, mlock() doesn't need to wait to finish lru_add_drain_all().
if mlock() can't turn on PG_mlock, vmscan turn it on later.

Thus, this patch replace it with lru_add_drain_all_async().


Reported-by: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Reported-by: Mike Galbraith <efault@gmx.de> 
Cc: Oleg Nesterov <onestero@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/swap.h |    1 +
 mm/mlock.c           |    4 ++--
 mm/swap.c            |   24 ++++++++++++++++++++++++
 3 files changed, 27 insertions(+), 2 deletions(-)

Index: b/include/linux/swap.h
===================================================================
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -204,6 +204,7 @@ extern void activate_page(struct page *)
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern int lru_add_drain_all(void);
+extern int lru_add_drain_all_async(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void swap_setup(void);
 
Index: b/mm/swap.c
===================================================================
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -38,6 +38,7 @@ int page_cluster;
 
 static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct work_struct, lru_drain_work);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -312,6 +313,24 @@ int lru_add_drain_all(void)
 }
 
 /*
+ * Returns 0 for success
+ */
+int lru_add_drain_all_async(void)
+{
+	int cpu;
+
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = &per_cpu(lru_drain_work, cpu);
+		schedule_work_on(cpu, work);
+	}
+	put_online_cpus();
+
+	return 0;
+}
+
+
+/*
  * Batched page_cache_release().  Decrement the reference count on all the
  * passed pages.  If it fell to zero then remove the page from the LRU and
  * free it.
@@ -497,6 +516,7 @@ EXPORT_SYMBOL(pagevec_lookup_tag);
 void __init swap_setup(void)
 {
 	unsigned long megs = totalram_pages >> (20 - PAGE_SHIFT);
+	int cpu;
 
 #ifdef CONFIG_SWAP
 	bdi_init(swapper_space.backing_dev_info);
@@ -511,4 +531,8 @@ void __init swap_setup(void)
 	 * Right now other parts of the system means that we
 	 * _really_ don't want to cluster much more
 	 */
+
+	for_each_possible_cpu(cpu) {
+		INIT_WORK(&per_cpu(lru_drain_work, cpu), lru_add_drain_per_cpu);
+	}
 }
Index: b/mm/mlock.c
===================================================================
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -481,7 +481,7 @@ SYSCALL_DEFINE2(mlock, unsigned long, st
 	if (!can_do_mlock())
 		return -EPERM;
 
-	lru_add_drain_all();	/* flush pagevec */
+	lru_add_drain_all_async();	/* flush pagevec */
 
 	down_write(&current->mm->mmap_sem);
 	len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
@@ -549,7 +549,7 @@ SYSCALL_DEFINE1(mlockall, int, flags)
 	if (!can_do_mlock())
 		goto out;
 
-	lru_add_drain_all();	/* flush pagevec */
+	lru_add_drain_all_async();	/* flush pagevec */
 
 	down_write(&current->mm->mmap_sem);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [resend][PATCH v2] mlock() doesn't wait to finish lru_add_drain_all()
  2009-10-09  2:21 ` KOSAKI Motohiro
@ 2009-10-12 23:57   ` Andrew Morton
  -1 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2009-10-12 23:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Peter Zijlstra, Mike Galbraith, Oleg Nesterov, LKML, linux-mm

On Fri,  9 Oct 2009 11:21:55 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Recently, Mike Galbraith reported mlock() makes hang-up very long time in
> his system. Peter Zijlstra explainted the reason.
> 
>   Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
>   cpu0 does mlock()->lru_add_drain_all(), which does
>   schedule_on_each_cpu(), which then waits for all cpus to complete the
>   work. Except that cpu1, which is busy with the RT task, will never run
>   keventd until the RT load goes away.
> 
>   This is not so much an actual deadlock as a serious starvation case.
> 
> His system has two partions using cpusets and RT-task partion cpu doesn't
> have any PCP cache. thus, this result was pretty unexpected.
> 
> The fact is, mlock() doesn't need to wait to finish lru_add_drain_all().
> if mlock() can't turn on PG_mlock, vmscan turn it on later.
> 
> Thus, this patch replace it with lru_add_drain_all_async().

So why don't we just remove the lru_add_drain_all() call from sys_mlock()?

How did you work out why the lru_add_drain_all() is present in
sys_mlock() anyway?  Neither the code nor the original changelog tell
us.  Who do I thwap for that?  Nick and his reviewers.  Sigh.

There are many callers of lru_add_drain_all() all over the place.  Each
of those is vulnerable to the same starvation issue, is it not?

If so, it would be better to just fix up lru_add_drain_all().  Afaict
all of its functions can be performed in hard IRQ context, so we can
use smp_call_function()?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [resend][PATCH v2] mlock() doesn't wait to finish lru_add_drain_all()
@ 2009-10-12 23:57   ` Andrew Morton
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2009-10-12 23:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Peter Zijlstra, Mike Galbraith, Oleg Nesterov, LKML, linux-mm

On Fri,  9 Oct 2009 11:21:55 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Recently, Mike Galbraith reported mlock() makes hang-up very long time in
> his system. Peter Zijlstra explainted the reason.
> 
>   Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
>   cpu0 does mlock()->lru_add_drain_all(), which does
>   schedule_on_each_cpu(), which then waits for all cpus to complete the
>   work. Except that cpu1, which is busy with the RT task, will never run
>   keventd until the RT load goes away.
> 
>   This is not so much an actual deadlock as a serious starvation case.
> 
> His system has two partions using cpusets and RT-task partion cpu doesn't
> have any PCP cache. thus, this result was pretty unexpected.
> 
> The fact is, mlock() doesn't need to wait to finish lru_add_drain_all().
> if mlock() can't turn on PG_mlock, vmscan turn it on later.
> 
> Thus, this patch replace it with lru_add_drain_all_async().

So why don't we just remove the lru_add_drain_all() call from sys_mlock()?

How did you work out why the lru_add_drain_all() is present in
sys_mlock() anyway?  Neither the code nor the original changelog tell
us.  Who do I thwap for that?  Nick and his reviewers.  Sigh.

There are many callers of lru_add_drain_all() all over the place.  Each
of those is vulnerable to the same starvation issue, is it not?

If so, it would be better to just fix up lru_add_drain_all().  Afaict
all of its functions can be performed in hard IRQ context, so we can
use smp_call_function()?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [resend][PATCH v2] mlock() doesn't wait to finish lru_add_drain_all()
  2009-10-12 23:57   ` Andrew Morton
@ 2009-10-13  1:17     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 13+ messages in thread
From: KOSAKI Motohiro @ 2009-10-13  1:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Peter Zijlstra, Mike Galbraith, Oleg Nesterov,
	LKML, linux-mm

Hi

> On Fri,  9 Oct 2009 11:21:55 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > Recently, Mike Galbraith reported mlock() makes hang-up very long time in
> > his system. Peter Zijlstra explainted the reason.
> > 
> >   Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> >   cpu0 does mlock()->lru_add_drain_all(), which does
> >   schedule_on_each_cpu(), which then waits for all cpus to complete the
> >   work. Except that cpu1, which is busy with the RT task, will never run
> >   keventd until the RT load goes away.
> > 
> >   This is not so much an actual deadlock as a serious starvation case.
> > 
> > His system has two partions using cpusets and RT-task partion cpu doesn't
> > have any PCP cache. thus, this result was pretty unexpected.
> > 
> > The fact is, mlock() doesn't need to wait to finish lru_add_drain_all().
> > if mlock() can't turn on PG_mlock, vmscan turn it on later.
> > 
> > Thus, this patch replace it with lru_add_drain_all_async().
> 
> So why don't we just remove the lru_add_drain_all() call from sys_mlock()?

There are small reason. the administrators and the testers (include me)
look at Mlock field in /proc/meminfo.
They natually expect Mlock field match with actual number of mlocked pages
if the system don't have any stress. Otherwise, we can't make mlock test case ;)


> How did you work out why the lru_add_drain_all() is present in
> sys_mlock() anyway?  Neither the code nor the original changelog tell
> us.  Who do I thwap for that?  Nick and his reviewers.  Sigh.

[Umm, My dictionaly don't tell me the meaning of "thwap".  An meaning of
an imitative word strongly depend on culture. Thus, I probably
misunderstand this paragraph.]

I've understand the existing reason by looooooong time review.


> There are many callers of lru_add_drain_all() all over the place.  Each
> of those is vulnerable to the same starvation issue, is it not?

There are.

> If so, it would be better to just fix up lru_add_drain_all().  Afaict
> all of its functions can be performed in hard IRQ context, so we can
> use smp_call_function()?

There is a option. but it have one downside, it require lru_add_pvecs
related function call irq_disable().

__lru_cache_add() is often called from page fault path. then we need
performance mesurement.





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [resend][PATCH v2] mlock() doesn't wait to finish lru_add_drain_all()
@ 2009-10-13  1:17     ` KOSAKI Motohiro
  0 siblings, 0 replies; 13+ messages in thread
From: KOSAKI Motohiro @ 2009-10-13  1:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Peter Zijlstra, Mike Galbraith, Oleg Nesterov,
	LKML, linux-mm

Hi

> On Fri,  9 Oct 2009 11:21:55 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > Recently, Mike Galbraith reported mlock() makes hang-up very long time in
> > his system. Peter Zijlstra explainted the reason.
> > 
> >   Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> >   cpu0 does mlock()->lru_add_drain_all(), which does
> >   schedule_on_each_cpu(), which then waits for all cpus to complete the
> >   work. Except that cpu1, which is busy with the RT task, will never run
> >   keventd until the RT load goes away.
> > 
> >   This is not so much an actual deadlock as a serious starvation case.
> > 
> > His system has two partions using cpusets and RT-task partion cpu doesn't
> > have any PCP cache. thus, this result was pretty unexpected.
> > 
> > The fact is, mlock() doesn't need to wait to finish lru_add_drain_all().
> > if mlock() can't turn on PG_mlock, vmscan turn it on later.
> > 
> > Thus, this patch replace it with lru_add_drain_all_async().
> 
> So why don't we just remove the lru_add_drain_all() call from sys_mlock()?

There are small reason. the administrators and the testers (include me)
look at Mlock field in /proc/meminfo.
They natually expect Mlock field match with actual number of mlocked pages
if the system don't have any stress. Otherwise, we can't make mlock test case ;)


> How did you work out why the lru_add_drain_all() is present in
> sys_mlock() anyway?  Neither the code nor the original changelog tell
> us.  Who do I thwap for that?  Nick and his reviewers.  Sigh.

[Umm, My dictionaly don't tell me the meaning of "thwap".  An meaning of
an imitative word strongly depend on culture. Thus, I probably
misunderstand this paragraph.]

I've understand the existing reason by looooooong time review.


> There are many callers of lru_add_drain_all() all over the place.  Each
> of those is vulnerable to the same starvation issue, is it not?

There are.

> If so, it would be better to just fix up lru_add_drain_all().  Afaict
> all of its functions can be performed in hard IRQ context, so we can
> use smp_call_function()?

There is a option. but it have one downside, it require lru_add_pvecs
related function call irq_disable().

__lru_cache_add() is often called from page fault path. then we need
performance mesurement.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [resend][PATCH v2] mlock() doesn't wait to finish lru_add_drain_all()
  2009-10-13  1:17     ` KOSAKI Motohiro
@ 2009-10-13  1:51       ` Andrew Morton
  -1 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2009-10-13  1:51 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Peter Zijlstra, Mike Galbraith, Oleg Nesterov, LKML, linux-mm

On Tue, 13 Oct 2009 10:17:48 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Hi
> 
> > On Fri,  9 Oct 2009 11:21:55 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > Recently, Mike Galbraith reported mlock() makes hang-up very long time in
> > > his system. Peter Zijlstra explainted the reason.
> > > 
> > >   Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> > >   cpu0 does mlock()->lru_add_drain_all(), which does
> > >   schedule_on_each_cpu(), which then waits for all cpus to complete the
> > >   work. Except that cpu1, which is busy with the RT task, will never run
> > >   keventd until the RT load goes away.
> > > 
> > >   This is not so much an actual deadlock as a serious starvation case.
> > > 
> > > His system has two partions using cpusets and RT-task partion cpu doesn't
> > > have any PCP cache. thus, this result was pretty unexpected.
> > > 
> > > The fact is, mlock() doesn't need to wait to finish lru_add_drain_all().
> > > if mlock() can't turn on PG_mlock, vmscan turn it on later.
> > > 
> > > Thus, this patch replace it with lru_add_drain_all_async().
> > 
> > So why don't we just remove the lru_add_drain_all() call from sys_mlock()?
> 
> There are small reason. the administrators and the testers (include me)
> look at Mlock field in /proc/meminfo.
> They natually expect Mlock field match with actual number of mlocked pages
> if the system don't have any stress. Otherwise, we can't make mlock test case ;)
> 
> 
> > How did you work out why the lru_add_drain_all() is present in
> > sys_mlock() anyway?  Neither the code nor the original changelog tell
> > us.  Who do I thwap for that?  Nick and his reviewers.  Sigh.
> 
> [Umm, My dictionaly don't tell me the meaning of "thwap".  An meaning of
> an imitative word strongly depend on culture. Thus, I probably
> misunderstand this paragraph.]

"slap"?

> I've understand the existing reason by looooooong time review.
> 
> 
> > There are many callers of lru_add_drain_all() all over the place.  Each
> > of those is vulnerable to the same starvation issue, is it not?
> 
> There are.
> 
> > If so, it would be better to just fix up lru_add_drain_all().  Afaict
> > all of its functions can be performed in hard IRQ context, so we can
> > use smp_call_function()?
> 
> There is a option. but it have one downside, it require lru_add_pvecs
> related function call irq_disable().

I don't know what this means.  ____pagevec_lru_add() (for example) can
be trivially changed from spin_lock_irq() to spin_lock_irqsave().

In other cases we can perhaps split an existing

foo()
{
	spin_lock_irq(zone->lock);
}

into

__foo()
{
	spin_lock(zone->lock);
}

foo()
{
	local_irq_disable()
	__foo();
}

then call the new __foo().
	


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [resend][PATCH v2] mlock() doesn't wait to finish lru_add_drain_all()
@ 2009-10-13  1:51       ` Andrew Morton
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2009-10-13  1:51 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Peter Zijlstra, Mike Galbraith, Oleg Nesterov, LKML, linux-mm

On Tue, 13 Oct 2009 10:17:48 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Hi
> 
> > On Fri,  9 Oct 2009 11:21:55 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > Recently, Mike Galbraith reported mlock() makes hang-up very long time in
> > > his system. Peter Zijlstra explainted the reason.
> > > 
> > >   Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> > >   cpu0 does mlock()->lru_add_drain_all(), which does
> > >   schedule_on_each_cpu(), which then waits for all cpus to complete the
> > >   work. Except that cpu1, which is busy with the RT task, will never run
> > >   keventd until the RT load goes away.
> > > 
> > >   This is not so much an actual deadlock as a serious starvation case.
> > > 
> > > His system has two partions using cpusets and RT-task partion cpu doesn't
> > > have any PCP cache. thus, this result was pretty unexpected.
> > > 
> > > The fact is, mlock() doesn't need to wait to finish lru_add_drain_all().
> > > if mlock() can't turn on PG_mlock, vmscan turn it on later.
> > > 
> > > Thus, this patch replace it with lru_add_drain_all_async().
> > 
> > So why don't we just remove the lru_add_drain_all() call from sys_mlock()?
> 
> There are small reason. the administrators and the testers (include me)
> look at Mlock field in /proc/meminfo.
> They natually expect Mlock field match with actual number of mlocked pages
> if the system don't have any stress. Otherwise, we can't make mlock test case ;)
> 
> 
> > How did you work out why the lru_add_drain_all() is present in
> > sys_mlock() anyway?  Neither the code nor the original changelog tell
> > us.  Who do I thwap for that?  Nick and his reviewers.  Sigh.
> 
> [Umm, My dictionaly don't tell me the meaning of "thwap".  An meaning of
> an imitative word strongly depend on culture. Thus, I probably
> misunderstand this paragraph.]

"slap"?

> I've understand the existing reason by looooooong time review.
> 
> 
> > There are many callers of lru_add_drain_all() all over the place.  Each
> > of those is vulnerable to the same starvation issue, is it not?
> 
> There are.
> 
> > If so, it would be better to just fix up lru_add_drain_all().  Afaict
> > all of its functions can be performed in hard IRQ context, so we can
> > use smp_call_function()?
> 
> There is a option. but it have one downside, it require lru_add_pvecs
> related function call irq_disable().

I don't know what this means.  ____pagevec_lru_add() (for example) can
be trivially changed from spin_lock_irq() to spin_lock_irqsave().

In other cases we can perhaps split an existing

foo()
{
	spin_lock_irq(zone->lock);
}

into

__foo()
{
	spin_lock(zone->lock);
}

foo()
{
	local_irq_disable()
	__foo();
}

then call the new __foo().
	

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [resend][PATCH v2] mlock() doesn't wait to finish lru_add_drain_all()
  2009-10-13  1:51       ` Andrew Morton
@ 2009-10-13  3:18         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 13+ messages in thread
From: KOSAKI Motohiro @ 2009-10-13  3:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Peter Zijlstra, Mike Galbraith, Oleg Nesterov,
	LKML, linux-mm

> On Tue, 13 Oct 2009 10:17:48 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > Hi
> > 
> > > On Fri,  9 Oct 2009 11:21:55 +0900 (JST)
> > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > 
> > > > Recently, Mike Galbraith reported mlock() makes hang-up very long time in
> > > > his system. Peter Zijlstra explainted the reason.
> > > > 
> > > >   Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> > > >   cpu0 does mlock()->lru_add_drain_all(), which does
> > > >   schedule_on_each_cpu(), which then waits for all cpus to complete the
> > > >   work. Except that cpu1, which is busy with the RT task, will never run
> > > >   keventd until the RT load goes away.
> > > > 
> > > >   This is not so much an actual deadlock as a serious starvation case.
> > > > 
> > > > His system has two partions using cpusets and RT-task partion cpu doesn't
> > > > have any PCP cache. thus, this result was pretty unexpected.
> > > > 
> > > > The fact is, mlock() doesn't need to wait to finish lru_add_drain_all().
> > > > if mlock() can't turn on PG_mlock, vmscan turn it on later.
> > > > 
> > > > Thus, this patch replace it with lru_add_drain_all_async().
> > > 
> > > So why don't we just remove the lru_add_drain_all() call from sys_mlock()?
> > 
> > There are small reason. the administrators and the testers (include me)
> > look at Mlock field in /proc/meminfo.
> > They natually expect Mlock field match with actual number of mlocked pages
> > if the system don't have any stress. Otherwise, we can't make mlock test case ;)
> > 
> > 
> > > How did you work out why the lru_add_drain_all() is present in
> > > sys_mlock() anyway?  Neither the code nor the original changelog tell
> > > us.  Who do I thwap for that?  Nick and his reviewers.  Sigh.
> > 
> > [Umm, My dictionaly don't tell me the meaning of "thwap".  An meaning of
> > an imitative word strongly depend on culture. Thus, I probably
> > misunderstand this paragraph.]
> 
> "slap"?
> 
> > I've understand the existing reason by looooooong time review.
> > 
> > 
> > > There are many callers of lru_add_drain_all() all over the place.  Each
> > > of those is vulnerable to the same starvation issue, is it not?
> > 
> > There are.
> > 
> > > If so, it would be better to just fix up lru_add_drain_all().  Afaict
> > > all of its functions can be performed in hard IRQ context, so we can
> > > use smp_call_function()?
> > 
> > There is a option. but it have one downside, it require lru_add_pvecs
> > related function call irq_disable().
> 
> I don't know what this means.  ____pagevec_lru_add() (for example) can
> be trivially changed from spin_lock_irq() to spin_lock_irqsave().
> 
> In other cases we can perhaps split an existing
> 
> foo()
> {
> 	spin_lock_irq(zone->lock);
> }
> 
> into
> 
> __foo()
> {
> 	spin_lock(zone->lock);
> }
> 
> foo()
> {
> 	local_irq_disable()
> 	__foo();
> }
> 
> then call the new __foo().

The problem is in __lru_cache_add().

============================================================
void __lru_cache_add(struct page *page, enum lru_list lru)
{
        struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru];

        page_cache_get(page);
        if (!pagevec_add(pvec, page))
                ____pagevec_lru_add(pvec, lru);
        put_cpu_var(lru_add_pvecs);
}
============================================================

current typical scenario is
1. preempt disable
2. assign lru_add_pvec
3. page_cache_get()
4. pvec->pages[pvec->nr++] = page;
5. preempt enable

but the preempt disabling assume drain_cpu_pagevecs() run on process context.
we need to convert it with irq_disabling.

I don't know this how much serious. I haven't mesure it. Yes, I can.
I will report this result as another mail.

Thanks.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [resend][PATCH v2] mlock() doesn't wait to finish lru_add_drain_all()
@ 2009-10-13  3:18         ` KOSAKI Motohiro
  0 siblings, 0 replies; 13+ messages in thread
From: KOSAKI Motohiro @ 2009-10-13  3:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Peter Zijlstra, Mike Galbraith, Oleg Nesterov,
	LKML, linux-mm

> On Tue, 13 Oct 2009 10:17:48 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > Hi
> > 
> > > On Fri,  9 Oct 2009 11:21:55 +0900 (JST)
> > > KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > > 
> > > > Recently, Mike Galbraith reported mlock() makes hang-up very long time in
> > > > his system. Peter Zijlstra explainted the reason.
> > > > 
> > > >   Suppose you have 2 cpus, cpu1 is busy doing a SCHED_FIFO-99 while(1),
> > > >   cpu0 does mlock()->lru_add_drain_all(), which does
> > > >   schedule_on_each_cpu(), which then waits for all cpus to complete the
> > > >   work. Except that cpu1, which is busy with the RT task, will never run
> > > >   keventd until the RT load goes away.
> > > > 
> > > >   This is not so much an actual deadlock as a serious starvation case.
> > > > 
> > > > His system has two partions using cpusets and RT-task partion cpu doesn't
> > > > have any PCP cache. thus, this result was pretty unexpected.
> > > > 
> > > > The fact is, mlock() doesn't need to wait to finish lru_add_drain_all().
> > > > if mlock() can't turn on PG_mlock, vmscan turn it on later.
> > > > 
> > > > Thus, this patch replace it with lru_add_drain_all_async().
> > > 
> > > So why don't we just remove the lru_add_drain_all() call from sys_mlock()?
> > 
> > There are small reason. the administrators and the testers (include me)
> > look at Mlock field in /proc/meminfo.
> > They natually expect Mlock field match with actual number of mlocked pages
> > if the system don't have any stress. Otherwise, we can't make mlock test case ;)
> > 
> > 
> > > How did you work out why the lru_add_drain_all() is present in
> > > sys_mlock() anyway?  Neither the code nor the original changelog tell
> > > us.  Who do I thwap for that?  Nick and his reviewers.  Sigh.
> > 
> > [Umm, My dictionaly don't tell me the meaning of "thwap".  An meaning of
> > an imitative word strongly depend on culture. Thus, I probably
> > misunderstand this paragraph.]
> 
> "slap"?
> 
> > I've understand the existing reason by looooooong time review.
> > 
> > 
> > > There are many callers of lru_add_drain_all() all over the place.  Each
> > > of those is vulnerable to the same starvation issue, is it not?
> > 
> > There are.
> > 
> > > If so, it would be better to just fix up lru_add_drain_all().  Afaict
> > > all of its functions can be performed in hard IRQ context, so we can
> > > use smp_call_function()?
> > 
> > There is a option. but it have one downside, it require lru_add_pvecs
> > related function call irq_disable().
> 
> I don't know what this means.  ____pagevec_lru_add() (for example) can
> be trivially changed from spin_lock_irq() to spin_lock_irqsave().
> 
> In other cases we can perhaps split an existing
> 
> foo()
> {
> 	spin_lock_irq(zone->lock);
> }
> 
> into
> 
> __foo()
> {
> 	spin_lock(zone->lock);
> }
> 
> foo()
> {
> 	local_irq_disable()
> 	__foo();
> }
> 
> then call the new __foo().

The problem is in __lru_cache_add().

============================================================
void __lru_cache_add(struct page *page, enum lru_list lru)
{
        struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru];

        page_cache_get(page);
        if (!pagevec_add(pvec, page))
                ____pagevec_lru_add(pvec, lru);
        put_cpu_var(lru_add_pvecs);
}
============================================================

current typical scenario is
1. preempt disable
2. assign lru_add_pvec
3. page_cache_get()
4. pvec->pages[pvec->nr++] = page;
5. preempt enable

but the preempt disabling assume drain_cpu_pagevecs() run on process context.
we need to convert it with irq_disabling.

I don't know this how much serious. I haven't mesure it. Yes, I can.
I will report this result as another mail.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [resend][PATCH v2] mlock() doesn't wait to finish lru_add_drain_all()
  2009-10-13  3:18         ` KOSAKI Motohiro
@ 2009-10-13  3:35           ` Andrew Morton
  -1 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2009-10-13  3:35 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Peter Zijlstra, Mike Galbraith, Oleg Nesterov, LKML, linux-mm

On Tue, 13 Oct 2009 12:18:17 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> The problem is in __lru_cache_add().
> 
> ============================================================
> void __lru_cache_add(struct page *page, enum lru_list lru)
> {
>         struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru];
> 
>         page_cache_get(page);
>         if (!pagevec_add(pvec, page))
>                 ____pagevec_lru_add(pvec, lru);
>         put_cpu_var(lru_add_pvecs);
> }
> ============================================================
> 
> current typical scenario is
> 1. preempt disable
> 2. assign lru_add_pvec
> 3. page_cache_get()
> 4. pvec->pages[pvec->nr++] = page;
> 5. preempt enable
> 
> but the preempt disabling assume drain_cpu_pagevecs() run on process context.
> we need to convert it with irq_disabling.

Nope, preempt_disable()/enable() can be performed in hard IRQ context. 
I see nothing in __lru_cache_add() which would cause problems when run
from hard IRQ.

Apart from latency, of course.  Doing a full smp_call_function() in
lru_add_drain_all() might get expensive if it's ever called with any
great frequency.

A smart implementation might take a peek at other cpu's queues and omit
the cross-CPU call if the queue is empty, for example..


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [resend][PATCH v2] mlock() doesn't wait to finish lru_add_drain_all()
@ 2009-10-13  3:35           ` Andrew Morton
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2009-10-13  3:35 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Peter Zijlstra, Mike Galbraith, Oleg Nesterov, LKML, linux-mm

On Tue, 13 Oct 2009 12:18:17 +0900 (JST) KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> The problem is in __lru_cache_add().
> 
> ============================================================
> void __lru_cache_add(struct page *page, enum lru_list lru)
> {
>         struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru];
> 
>         page_cache_get(page);
>         if (!pagevec_add(pvec, page))
>                 ____pagevec_lru_add(pvec, lru);
>         put_cpu_var(lru_add_pvecs);
> }
> ============================================================
> 
> current typical scenario is
> 1. preempt disable
> 2. assign lru_add_pvec
> 3. page_cache_get()
> 4. pvec->pages[pvec->nr++] = page;
> 5. preempt enable
> 
> but the preempt disabling assume drain_cpu_pagevecs() run on process context.
> we need to convert it with irq_disabling.

Nope, preempt_disable()/enable() can be performed in hard IRQ context. 
I see nothing in __lru_cache_add() which would cause problems when run
from hard IRQ.

Apart from latency, of course.  Doing a full smp_call_function() in
lru_add_drain_all() might get expensive if it's ever called with any
great frequency.

A smart implementation might take a peek at other cpu's queues and omit
the cross-CPU call if the queue is empty, for example..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [resend][PATCH v2] mlock() doesn't wait to finish lru_add_drain_all()
  2009-10-13  1:17     ` KOSAKI Motohiro
  (?)
  (?)
@ 2009-10-13  4:25     ` Valdis.Kletnieks
  -1 siblings, 0 replies; 13+ messages in thread
From: Valdis.Kletnieks @ 2009-10-13  4:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Peter Zijlstra, Mike Galbraith, Oleg Nesterov,
	LKML, linux-mm

[-- Attachment #1: Type: text/plain, Size: 586 bytes --]

On Tue, 13 Oct 2009 10:17:48 +0900, KOSAKI Motohiro said:

> > How did you work out why the lru_add_drain_all() is present in
> > sys_mlock() anyway?  Neither the code nor the original changelog tell
> > us.  Who do I thwap for that?  Nick and his reviewers.  Sigh.
> 
> [Umm, My dictionaly don't tell me the meaning of "thwap".  An meaning of
> an imitative word strongly depend on culture. Thus, I probably
> misunderstand this paragraph.]

http://ars.userfriendly.org/cartoons/?id=20030210&mode=classic

(biff, thwap, it's all the same - the sound of a cluebat impacting somebody ;)

[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-10-13  4:27 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-09  2:21 [resend][PATCH v2] mlock() doesn't wait to finish lru_add_drain_all() KOSAKI Motohiro
2009-10-09  2:21 ` KOSAKI Motohiro
2009-10-12 23:57 ` Andrew Morton
2009-10-12 23:57   ` Andrew Morton
2009-10-13  1:17   ` KOSAKI Motohiro
2009-10-13  1:17     ` KOSAKI Motohiro
2009-10-13  1:51     ` Andrew Morton
2009-10-13  1:51       ` Andrew Morton
2009-10-13  3:18       ` KOSAKI Motohiro
2009-10-13  3:18         ` KOSAKI Motohiro
2009-10-13  3:35         ` Andrew Morton
2009-10-13  3:35           ` Andrew Morton
2009-10-13  4:25     ` Valdis.Kletnieks

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.