linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: mlock: mlocked pages are unevictable
       [not found] <200810201659.m9KGxtFC016280@hera.kernel.org>
@ 2008-10-21 15:13 ` Heiko Carstens
  2008-10-21 15:51   ` KOSAKI Motohiro
  2008-10-22 15:28   ` mlock: mlocked pages are unevictable Lee Schermerhorn
  0 siblings, 2 replies; 35+ messages in thread
From: Heiko Carstens @ 2008-10-21 15:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-kernel, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Rik van Riel, Lee Schermerhorn, KOSAKI Motohiro, linux-mm

Hi Nick,

On Mon, Oct 20, 2008 at 04:59:55PM +0000, Linux Kernel Mailing List wrote:
> Gitweb:     http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b291f000393f5a0b679012b39d79fbc85c018233
> Commit:     b291f000393f5a0b679012b39d79fbc85c018233
> Author:     Nick Piggin <npiggin@suse.de>
> AuthorDate: Sat Oct 18 20:26:44 2008 -0700
> Committer:  Linus Torvalds <torvalds@linux-foundation.org>
> CommitDate: Mon Oct 20 08:52:30 2008 -0700
> 
>     mlock: mlocked pages are unevictable

[...]

I think the following part of your patch:

> diff --git a/mm/swap.c b/mm/swap.c
> index fee6b97..bc58c13 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -278,7 +278,7 @@ void lru_add_drain(void)
>  	put_cpu();
>  }
> 
> -#ifdef CONFIG_NUMA
> +#if defined(CONFIG_NUMA) || defined(CONFIG_UNEVICTABLE_LRU)
>  static void lru_add_drain_per_cpu(struct work_struct *dummy)
>  {
>  	lru_add_drain();

causes this (allyesconfig on s390):

[17179587.988810] =======================================================
[17179587.988819] [ INFO: possible circular locking dependency detected ]
[17179587.988824] 2.6.27-06509-g2515ddc-dirty #190
[17179587.988827] -------------------------------------------------------
[17179587.988831] multipathd/3868 is trying to acquire lock:
[17179587.988834]  (events){--..}, at: [<0000000000157f82>] flush_work+0x42/0x124
[17179587.988850] 
[17179587.988851] but task is already holding lock:
[17179587.988854]  (&mm->mmap_sem){----}, at: [<00000000001c0be4>] sys_mlockall+0x5c/0xe0
[17179587.988865] 
[17179587.988866] which lock already depends on the new lock.
[17179587.988867] 
[17179587.988871] 
[17179587.988871] the existing dependency chain (in reverse order) is:
[17179587.988875] 
[17179587.988876] -> #3 (&mm->mmap_sem){----}:
[17179587.988883]        [<0000000000171a42>] __lock_acquire+0x143e/0x17c4
[17179587.988891]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
[17179587.988896]        [<0000000000b2a532>] down_read+0x62/0xd8
[17179587.988905]        [<0000000000b2cc40>] do_dat_exception+0x14c/0x390
[17179587.988910]        [<0000000000114d36>] sysc_return+0x0/0x8
[17179587.988917]        [<00000000006c694a>] copy_from_user_mvcos+0x12/0x84
[17179587.988926]        [<00000000007335f0>] eql_ioctl+0x3e8/0x590
[17179587.988935]        [<00000000008b6230>] dev_ifsioc+0x29c/0x2c8
[17179587.988942]        [<00000000008b6874>] dev_ioctl+0x618/0x680
[17179587.988946]        [<00000000008a1a8c>] sock_ioctl+0x2b4/0x2c8
[17179587.988953]        [<00000000001f99a8>] vfs_ioctl+0x50/0xbc
[17179587.988960]        [<00000000001f9ee2>] do_vfs_ioctl+0x4ce/0x510
[17179587.988965]        [<00000000001f9f94>] sys_ioctl+0x70/0x98
[17179587.988970]        [<0000000000114d30>] sysc_noemu+0x10/0x16
[17179587.988975]        [<0000020000131286>] 0x20000131286
[17179587.988980] 
[17179587.988981] -> #2 (rtnl_mutex){--..}:
[17179587.988987]        [<0000000000171a42>] __lock_acquire+0x143e/0x17c4
[17179587.988993]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
[17179587.988998]        [<0000000000b29ae8>] mutex_lock_nested+0x11c/0x31c
[17179587.989003]        [<00000000008bff1c>] rtnl_lock+0x30/0x40
[17179587.989009]        [<00000000008c144e>] linkwatch_event+0x26/0x6c
[17179587.989015]        [<0000000000157356>] run_workqueue+0x146/0x240
[17179587.989020]        [<000000000015756e>] worker_thread+0x11e/0x134
[17179587.989025]        [<000000000015cd8e>] kthread+0x6e/0xa4
[17179587.989030]        [<000000000010ad9a>] kernel_thread_starter+0x6/0xc
[17179587.989036]        [<000000000010ad94>] kernel_thread_starter+0x0/0xc
[17179587.989042] 
[17179587.989042] -> #1 ((linkwatch_work).work){--..}:
[17179587.989049]        [<0000000000171a42>] __lock_acquire+0x143e/0x17c4
[17179587.989054]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
[17179587.989059]        [<0000000000157350>] run_workqueue+0x140/0x240
[17179587.989064]        [<000000000015756e>] worker_thread+0x11e/0x134
[17179587.989069]        [<000000000015cd8e>] kthread+0x6e/0xa4
[17179587.989074]        [<000000000010ad9a>] kernel_thread_starter+0x6/0xc
[17179587.989079]        [<000000000010ad94>] kernel_thread_starter+0x0/0xc
[17179587.989084] 
[17179587.989085] -> #0 (events){--..}:
[17179587.989091]        [<00000000001716ca>] __lock_acquire+0x10c6/0x17c4
[17179587.989096]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
[17179587.989101]        [<0000000000157fb4>] flush_work+0x74/0x124
[17179587.989107]        [<0000000000158620>] schedule_on_each_cpu+0xec/0x138
[17179587.989112]        [<00000000001b0ab4>] lru_add_drain_all+0x2c/0x40
[17179587.989117]        [<00000000001c05ac>] __mlock_vma_pages_range+0xcc/0x2e8
[17179587.989123]        [<00000000001c0970>] mlock_fixup+0x1a8/0x280
[17179587.989128]        [<00000000001c0aec>] do_mlockall+0xa4/0xd4
[17179587.989133]        [<00000000001c0c36>] sys_mlockall+0xae/0xe0
[17179587.989138]        [<0000000000114d30>] sysc_noemu+0x10/0x16
[17179587.989142]        [<000002000025a466>] 0x2000025a466
[17179587.989147] 
[17179587.989148] other info that might help us debug this:
[17179587.989149] 
[17179587.989154] 1 lock held by multipathd/3868:
[17179587.989156]  #0:  (&mm->mmap_sem){----}, at: [<00000000001c0be4>] sys_mlockall+0x5c/0xe0
[17179587.989165] 
[17179587.989166] stack backtrace:
[17179587.989170] CPU: 0 Not tainted 2.6.27-06509-g2515ddc-dirty #190
[17179587.989174] Process multipathd (pid: 3868, task: 000000003978a298, ksp: 0000000039b23eb8)
[17179587.989178] 000000003978aa00 0000000039b238b8 0000000000000002 0000000000000000 
[17179587.989184]        0000000039b23958 0000000039b238d0 0000000039b238d0 00000000001060ee 
[17179587.989192]        0000000000000003 0000000000000000 0000000000000000 000000000000000b 
[17179587.989199]        0000000000000060 0000000000000008 0000000039b238b8 0000000039b23928 
[17179587.989207]        0000000000b30b50 00000000001060ee 0000000039b238b8 0000000039b23910 
[17179587.989216] Call Trace:
[17179587.989219] ([<0000000000106036>] show_trace+0xb2/0xd0)
[17179587.989225]  [<000000000010610c>] show_stack+0xb8/0xc8
[17179587.989230]  [<0000000000b27a96>] dump_stack+0xae/0xbc
[17179587.989234]  [<000000000017019e>] print_circular_bug_tail+0xee/0x100
[17179587.989240]  [<00000000001716ca>] __lock_acquire+0x10c6/0x17c4
[17179587.989245]  [<0000000000171e5c>] lock_acquire+0x94/0xbc
[17179587.989250]  [<0000000000157fb4>] flush_work+0x74/0x124
[17179587.989256]  [<0000000000158620>] schedule_on_each_cpu+0xec/0x138
[17179587.989261]  [<00000000001b0ab4>] lru_add_drain_all+0x2c/0x40
[17179587.989266]  [<00000000001c05ac>] __mlock_vma_pages_range+0xcc/0x2e8
[17179587.989271]  [<00000000001c0970>] mlock_fixup+0x1a8/0x280
[17179587.989276]  [<00000000001c0aec>] do_mlockall+0xa4/0xd4
[17179587.989281]  [<00000000001c0c36>] sys_mlockall+0xae/0xe0
[17179587.989286]  [<0000000000114d30>] sysc_noemu+0x10/0x16
[17179587.989290]  [<000002000025a466>] 0x2000025a466
[17179587.989294] INFO: lockdep is turned off.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mlock: mlocked pages are unevictable
  2008-10-21 15:13 ` mlock: mlocked pages are unevictable Heiko Carstens
@ 2008-10-21 15:51   ` KOSAKI Motohiro
  2008-10-21 17:18     ` KOSAKI Motohiro
  2008-10-22 15:28   ` mlock: mlocked pages are unevictable Lee Schermerhorn
  1 sibling, 1 reply; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-10-21 15:51 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Nick Piggin, linux-kernel, Hugh Dickins, Andrew Morton,
	Linus Torvalds, Rik van Riel, Lee Schermerhorn, linux-mm

Hi

> I think the following part of your patch:
>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index fee6b97..bc58c13 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -278,7 +278,7 @@ void lru_add_drain(void)
>>       put_cpu();
>>  }
>>
>> -#ifdef CONFIG_NUMA
>> +#if defined(CONFIG_NUMA) || defined(CONFIG_UNEVICTABLE_LRU)
>>  static void lru_add_drain_per_cpu(struct work_struct *dummy)
>>  {
>>       lru_add_drain();
>
> causes this (allyesconfig on s390):

hm,

I don't think so.

Actually, this patch has
   mmap_sem -> lru_add_drain_all() dependency.

but its dependency already exist in another place.
example,

  sys_move_pages()
      do_move_pages()  <- down_read(mmap_sem)
          migrate_prep()
               lru_add_drain_all()

Thought?

> [17179587.988810] =======================================================
> [17179587.988819] [ INFO: possible circular locking dependency detected ]
> [17179587.988824] 2.6.27-06509-g2515ddc-dirty #190
> [17179587.988827] -------------------------------------------------------
> [17179587.988831] multipathd/3868 is trying to acquire lock:
> [17179587.988834]  (events){--..}, at: [<0000000000157f82>] flush_work+0x42/0x124
> [17179587.988850]
> [17179587.988851] but task is already holding lock:
> [17179587.988854]  (&mm->mmap_sem){----}, at: [<00000000001c0be4>] sys_mlockall+0x5c/0xe0
> [17179587.988865]
> [17179587.988866] which lock already depends on the new lock.
> [17179587.988867]
> [17179587.988871]
> [17179587.988871] the existing dependency chain (in reverse order) is:
> [17179587.988875]
> [17179587.988876] -> #3 (&mm->mmap_sem){----}:
> [17179587.988883]        [<0000000000171a42>] __lock_acquire+0x143e/0x17c4
> [17179587.988891]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
> [17179587.988896]        [<0000000000b2a532>] down_read+0x62/0xd8
> [17179587.988905]        [<0000000000b2cc40>] do_dat_exception+0x14c/0x390
> [17179587.988910]        [<0000000000114d36>] sysc_return+0x0/0x8
> [17179587.988917]        [<00000000006c694a>] copy_from_user_mvcos+0x12/0x84
> [17179587.988926]        [<00000000007335f0>] eql_ioctl+0x3e8/0x590
> [17179587.988935]        [<00000000008b6230>] dev_ifsioc+0x29c/0x2c8
> [17179587.988942]        [<00000000008b6874>] dev_ioctl+0x618/0x680
> [17179587.988946]        [<00000000008a1a8c>] sock_ioctl+0x2b4/0x2c8
> [17179587.988953]        [<00000000001f99a8>] vfs_ioctl+0x50/0xbc
> [17179587.988960]        [<00000000001f9ee2>] do_vfs_ioctl+0x4ce/0x510
> [17179587.988965]        [<00000000001f9f94>] sys_ioctl+0x70/0x98
> [17179587.988970]        [<0000000000114d30>] sysc_noemu+0x10/0x16
> [17179587.988975]        [<0000020000131286>] 0x20000131286
> [17179587.988980]
> [17179587.988981] -> #2 (rtnl_mutex){--..}:
> [17179587.988987]        [<0000000000171a42>] __lock_acquire+0x143e/0x17c4
> [17179587.988993]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
> [17179587.988998]        [<0000000000b29ae8>] mutex_lock_nested+0x11c/0x31c
> [17179587.989003]        [<00000000008bff1c>] rtnl_lock+0x30/0x40
> [17179587.989009]        [<00000000008c144e>] linkwatch_event+0x26/0x6c
> [17179587.989015]        [<0000000000157356>] run_workqueue+0x146/0x240
> [17179587.989020]        [<000000000015756e>] worker_thread+0x11e/0x134
> [17179587.989025]        [<000000000015cd8e>] kthread+0x6e/0xa4
> [17179587.989030]        [<000000000010ad9a>] kernel_thread_starter+0x6/0xc
> [17179587.989036]        [<000000000010ad94>] kernel_thread_starter+0x0/0xc
> [17179587.989042]
> [17179587.989042] -> #1 ((linkwatch_work).work){--..}:
> [17179587.989049]        [<0000000000171a42>] __lock_acquire+0x143e/0x17c4
> [17179587.989054]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
> [17179587.989059]        [<0000000000157350>] run_workqueue+0x140/0x240
> [17179587.989064]        [<000000000015756e>] worker_thread+0x11e/0x134
> [17179587.989069]        [<000000000015cd8e>] kthread+0x6e/0xa4
> [17179587.989074]        [<000000000010ad9a>] kernel_thread_starter+0x6/0xc
> [17179587.989079]        [<000000000010ad94>] kernel_thread_starter+0x0/0xc
> [17179587.989084]
> [17179587.989085] -> #0 (events){--..}:
> [17179587.989091]        [<00000000001716ca>] __lock_acquire+0x10c6/0x17c4
> [17179587.989096]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
> [17179587.989101]        [<0000000000157fb4>] flush_work+0x74/0x124
> [17179587.989107]        [<0000000000158620>] schedule_on_each_cpu+0xec/0x138
> [17179587.989112]        [<00000000001b0ab4>] lru_add_drain_all+0x2c/0x40
> [17179587.989117]        [<00000000001c05ac>] __mlock_vma_pages_range+0xcc/0x2e8
> [17179587.989123]        [<00000000001c0970>] mlock_fixup+0x1a8/0x280
> [17179587.989128]        [<00000000001c0aec>] do_mlockall+0xa4/0xd4
> [17179587.989133]        [<00000000001c0c36>] sys_mlockall+0xae/0xe0
> [17179587.989138]        [<0000000000114d30>] sysc_noemu+0x10/0x16
> [17179587.989142]        [<000002000025a466>] 0x2000025a466
> [17179587.989147]
> [17179587.989148] other info that might help us debug this:
> [17179587.989149]
> [17179587.989154] 1 lock held by multipathd/3868:
> [17179587.989156]  #0:  (&mm->mmap_sem){----}, at: [<00000000001c0be4>] sys_mlockall+0x5c/0xe0
> [17179587.989165]
> [17179587.989166] stack backtrace:
> [17179587.989170] CPU: 0 Not tainted 2.6.27-06509-g2515ddc-dirty #190
> [17179587.989174] Process multipathd (pid: 3868, task: 000000003978a298, ksp: 0000000039b23eb8)
> [17179587.989178] 000000003978aa00 0000000039b238b8 0000000000000002 0000000000000000
> [17179587.989184]        0000000039b23958 0000000039b238d0 0000000039b238d0 00000000001060ee
> [17179587.989192]        0000000000000003 0000000000000000 0000000000000000 000000000000000b
> [17179587.989199]        0000000000000060 0000000000000008 0000000039b238b8 0000000039b23928
> [17179587.989207]        0000000000b30b50 00000000001060ee 0000000039b238b8 0000000039b23910
> [17179587.989216] Call Trace:
> [17179587.989219] ([<0000000000106036>] show_trace+0xb2/0xd0)
> [17179587.989225]  [<000000000010610c>] show_stack+0xb8/0xc8
> [17179587.989230]  [<0000000000b27a96>] dump_stack+0xae/0xbc
> [17179587.989234]  [<000000000017019e>] print_circular_bug_tail+0xee/0x100
> [17179587.989240]  [<00000000001716ca>] __lock_acquire+0x10c6/0x17c4
> [17179587.989245]  [<0000000000171e5c>] lock_acquire+0x94/0xbc
> [17179587.989250]  [<0000000000157fb4>] flush_work+0x74/0x124
> [17179587.989256]  [<0000000000158620>] schedule_on_each_cpu+0xec/0x138
> [17179587.989261]  [<00000000001b0ab4>] lru_add_drain_all+0x2c/0x40
> [17179587.989266]  [<00000000001c05ac>] __mlock_vma_pages_range+0xcc/0x2e8
> [17179587.989271]  [<00000000001c0970>] mlock_fixup+0x1a8/0x280
> [17179587.989276]  [<00000000001c0aec>] do_mlockall+0xa4/0xd4
> [17179587.989281]  [<00000000001c0c36>] sys_mlockall+0xae/0xe0
> [17179587.989286]  [<0000000000114d30>] sysc_noemu+0x10/0x16
> [17179587.989290]  [<000002000025a466>] 0x2000025a466
> [17179587.989294] INFO: lockdep is turned off.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mlock: mlocked pages are unevictable
  2008-10-21 15:51   ` KOSAKI Motohiro
@ 2008-10-21 17:18     ` KOSAKI Motohiro
  2008-10-21 20:30       ` Peter Zijlstra
  2008-10-23 15:00       ` [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu() KOSAKI Motohiro
  0 siblings, 2 replies; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-10-21 17:18 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Nick Piggin, linux-kernel, Hugh Dickins, Andrew Morton,
	Linus Torvalds, Rik van Riel, Lee Schermerhorn, linux-mm

2008/10/22 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> Hi
>
>> I think the following part of your patch:
>>
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index fee6b97..bc58c13 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -278,7 +278,7 @@ void lru_add_drain(void)
>>>       put_cpu();
>>>  }
>>>
>>> -#ifdef CONFIG_NUMA
>>> +#if defined(CONFIG_NUMA) || defined(CONFIG_UNEVICTABLE_LRU)
>>>  static void lru_add_drain_per_cpu(struct work_struct *dummy)
>>>  {
>>>       lru_add_drain();
>>
>> causes this (allyesconfig on s390):
>
> hm,
>
> I don't think so.
>
> Actually, this patch has
>   mmap_sem -> lru_add_drain_all() dependency.
>
> but its dependency already exist in another place.
> example,
>
>  sys_move_pages()
>      do_move_pages()  <- down_read(mmap_sem)
>          migrate_prep()
>               lru_add_drain_all()
>
> Thought?

ok. maybe I understand this issue.

This bug is caused by folloing dependencys.

some VM place has
      mmap_sem -> kevent_wq

net/core/dev.c::dev_ioctl()  has
     rtnl_lock  ->  mmap_sem        (*) almost ioctl has
copy_from_user() and it cause page fault.

linkwatch_event has
    kevent_wq -> rtnl_lock


So, I think VM subsystem shouldn't use kevent_wq because many driver
use ioctl and work queue combination.
then drivers fixing isn't easy.

I'll make the patch soon.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mlock: mlocked pages are unevictable
  2008-10-21 17:18     ` KOSAKI Motohiro
@ 2008-10-21 20:30       ` Peter Zijlstra
  2008-10-21 20:48         ` Peter Zijlstra
  2008-10-23 15:00       ` [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu() KOSAKI Motohiro
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2008-10-21 20:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Heiko Carstens, Nick Piggin, linux-kernel, Hugh Dickins,
	Andrew Morton, Linus Torvalds, Rik van Riel, Lee Schermerhorn,
	linux-mm, Oleg Nesterov

On Wed, 2008-10-22 at 02:18 +0900, KOSAKI Motohiro wrote:
> 2008/10/22 KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>:
> > Hi
> >
> >> I think the following part of your patch:
> >>
> >>> diff --git a/mm/swap.c b/mm/swap.c
> >>> index fee6b97..bc58c13 100644
> >>> --- a/mm/swap.c
> >>> +++ b/mm/swap.c
> >>> @@ -278,7 +278,7 @@ void lru_add_drain(void)
> >>>       put_cpu();
> >>>  }
> >>>
> >>> -#ifdef CONFIG_NUMA
> >>> +#if defined(CONFIG_NUMA) || defined(CONFIG_UNEVICTABLE_LRU)
> >>>  static void lru_add_drain_per_cpu(struct work_struct *dummy)
> >>>  {
> >>>       lru_add_drain();
> >>
> >> causes this (allyesconfig on s390):

I would have suspected the new might_fault() annotation, although I
haven't checked if that made it to Linus yet.

> > hm,
> >
> > I don't think so.
> >
> > Actually, this patch has
> >   mmap_sem -> lru_add_drain_all() dependency.
> >
> > but its dependency already exist in another place.
> > example,
> >
> >  sys_move_pages()
> >      do_move_pages()  <- down_read(mmap_sem)
> >          migrate_prep()
> >               lru_add_drain_all()
> >
> > Thought?
> 
> ok. maybe I understand this issue.
> 
> This bug is caused by folloing dependencys.
> 
> some VM place has
>       mmap_sem -> kevent_wq
> 
> net/core/dev.c::dev_ioctl()  has
>      rtnl_lock  ->  mmap_sem        (*) almost ioctl has
> copy_from_user() and it cause page fault.
> 
> linkwatch_event has
>     kevent_wq -> rtnl_lock
> 
> 
> So, I think VM subsystem shouldn't use kevent_wq because many driver
> use ioctl and work queue combination.
> then drivers fixing isn't easy.
> 
> I'll make the patch soon.

Doing what exactly?

The problem appears to be calling flush_work(), which is rather heavy
handed. We could do schedule_on_each_cpu() using a completion.

Which I think is a nicer solution (if sufficient of course).

lockdep splat attached for Oleg's convenience.

> [17179587.988810] =======================================================
> [17179587.988819] [ INFO: possible circular locking dependency detected ]
> [17179587.988824] 2.6.27-06509-g2515ddc-dirty #190
> [17179587.988827] -------------------------------------------------------
> [17179587.988831] multipathd/3868 is trying to acquire lock:
> [17179587.988834]  (events){--..}, at: [<0000000000157f82>] flush_work+0x42/0x124
> [17179587.988850] 
> [17179587.988851] but task is already holding lock:
> [17179587.988854]  (&mm->mmap_sem){----}, at: [<00000000001c0be4>] sys_mlockall+0x5c/0xe0
> [17179587.988865] 
> [17179587.988866] which lock already depends on the new lock.
> [17179587.988867] 
> [17179587.988871] 
> [17179587.988871] the existing dependency chain (in reverse order) is:
> [17179587.988875] 
> [17179587.988876] -> #3 (&mm->mmap_sem){----}:
> [17179587.988883]        [<0000000000171a42>] __lock_acquire+0x143e/0x17c4
> [17179587.988891]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
> [17179587.988896]        [<0000000000b2a532>] down_read+0x62/0xd8
> [17179587.988905]        [<0000000000b2cc40>] do_dat_exception+0x14c/0x390
> [17179587.988910]        [<0000000000114d36>] sysc_return+0x0/0x8
> [17179587.988917]        [<00000000006c694a>] copy_from_user_mvcos+0x12/0x84
> [17179587.988926]        [<00000000007335f0>] eql_ioctl+0x3e8/0x590
> [17179587.988935]        [<00000000008b6230>] dev_ifsioc+0x29c/0x2c8
> [17179587.988942]        [<00000000008b6874>] dev_ioctl+0x618/0x680
> [17179587.988946]        [<00000000008a1a8c>] sock_ioctl+0x2b4/0x2c8
> [17179587.988953]        [<00000000001f99a8>] vfs_ioctl+0x50/0xbc
> [17179587.988960]        [<00000000001f9ee2>] do_vfs_ioctl+0x4ce/0x510
> [17179587.988965]        [<00000000001f9f94>] sys_ioctl+0x70/0x98
> [17179587.988970]        [<0000000000114d30>] sysc_noemu+0x10/0x16
> [17179587.988975]        [<0000020000131286>] 0x20000131286
> [17179587.988980] 
> [17179587.988981] -> #2 (rtnl_mutex){--..}:
> [17179587.988987]        [<0000000000171a42>] __lock_acquire+0x143e/0x17c4
> [17179587.988993]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
> [17179587.988998]        [<0000000000b29ae8>] mutex_lock_nested+0x11c/0x31c
> [17179587.989003]        [<00000000008bff1c>] rtnl_lock+0x30/0x40
> [17179587.989009]        [<00000000008c144e>] linkwatch_event+0x26/0x6c
> [17179587.989015]        [<0000000000157356>] run_workqueue+0x146/0x240
> [17179587.989020]        [<000000000015756e>] worker_thread+0x11e/0x134
> [17179587.989025]        [<000000000015cd8e>] kthread+0x6e/0xa4
> [17179587.989030]        [<000000000010ad9a>] kernel_thread_starter+0x6/0xc
> [17179587.989036]        [<000000000010ad94>] kernel_thread_starter+0x0/0xc
> [17179587.989042] 
> [17179587.989042] -> #1 ((linkwatch_work).work){--..}:
> [17179587.989049]        [<0000000000171a42>] __lock_acquire+0x143e/0x17c4
> [17179587.989054]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
> [17179587.989059]        [<0000000000157350>] run_workqueue+0x140/0x240
> [17179587.989064]        [<000000000015756e>] worker_thread+0x11e/0x134
> [17179587.989069]        [<000000000015cd8e>] kthread+0x6e/0xa4
> [17179587.989074]        [<000000000010ad9a>] kernel_thread_starter+0x6/0xc
> [17179587.989079]        [<000000000010ad94>] kernel_thread_starter+0x0/0xc
> [17179587.989084] 
> [17179587.989085] -> #0 (events){--..}:
> [17179587.989091]        [<00000000001716ca>] __lock_acquire+0x10c6/0x17c4
> [17179587.989096]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
> [17179587.989101]        [<0000000000157fb4>] flush_work+0x74/0x124
> [17179587.989107]        [<0000000000158620>] schedule_on_each_cpu+0xec/0x138
> [17179587.989112]        [<00000000001b0ab4>] lru_add_drain_all+0x2c/0x40
> [17179587.989117]        [<00000000001c05ac>] __mlock_vma_pages_range+0xcc/0x2e8
> [17179587.989123]        [<00000000001c0970>] mlock_fixup+0x1a8/0x280
> [17179587.989128]        [<00000000001c0aec>] do_mlockall+0xa4/0xd4
> [17179587.989133]        [<00000000001c0c36>] sys_mlockall+0xae/0xe0
> [17179587.989138]        [<0000000000114d30>] sysc_noemu+0x10/0x16
> [17179587.989142]        [<000002000025a466>] 0x2000025a466
> [17179587.989147] 
> [17179587.989148] other info that might help us debug this:
> [17179587.989149] 
> [17179587.989154] 1 lock held by multipathd/3868:
> [17179587.989156]  #0:  (&mm->mmap_sem){----}, at: [<00000000001c0be4>] sys_mlockall+0x5c/0xe0
> [17179587.989165] 
> [17179587.989166] stack backtrace:
> [17179587.989170] CPU: 0 Not tainted 2.6.27-06509-g2515ddc-dirty #190
> [17179587.989174] Process multipathd (pid: 3868, task: 000000003978a298, ksp: 0000000039b23eb8)
> [17179587.989178] 000000003978aa00 0000000039b238b8 0000000000000002 0000000000000000 
> [17179587.989184]        0000000039b23958 0000000039b238d0 0000000039b238d0 00000000001060ee 
> [17179587.989192]        0000000000000003 0000000000000000 0000000000000000 000000000000000b 
> [17179587.989199]        0000000000000060 0000000000000008 0000000039b238b8 0000000039b23928 
> [17179587.989207]        0000000000b30b50 00000000001060ee 0000000039b238b8 0000000039b23910 
> [17179587.989216] Call Trace:
> [17179587.989219] ([<0000000000106036>] show_trace+0xb2/0xd0)
> [17179587.989225]  [<000000000010610c>] show_stack+0xb8/0xc8
> [17179587.989230]  [<0000000000b27a96>] dump_stack+0xae/0xbc
> [17179587.989234]  [<000000000017019e>] print_circular_bug_tail+0xee/0x100
> [17179587.989240]  [<00000000001716ca>] __lock_acquire+0x10c6/0x17c4
> [17179587.989245]  [<0000000000171e5c>] lock_acquire+0x94/0xbc
> [17179587.989250]  [<0000000000157fb4>] flush_work+0x74/0x124
> [17179587.989256]  [<0000000000158620>] schedule_on_each_cpu+0xec/0x138
> [17179587.989261]  [<00000000001b0ab4>] lru_add_drain_all+0x2c/0x40
> [17179587.989266]  [<00000000001c05ac>] __mlock_vma_pages_range+0xcc/0x2e8
> [17179587.989271]  [<00000000001c0970>] mlock_fixup+0x1a8/0x280
> [17179587.989276]  [<00000000001c0aec>] do_mlockall+0xa4/0xd4
> [17179587.989281]  [<00000000001c0c36>] sys_mlockall+0xae/0xe0
> [17179587.989286]  [<0000000000114d30>] sysc_noemu+0x10/0x16
> [17179587.989290]  [<000002000025a466>] 0x2000025a466
> [17179587.989294] INFO: lockdep is turned off.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mlock: mlocked pages are unevictable
  2008-10-21 20:30       ` Peter Zijlstra
@ 2008-10-21 20:48         ` Peter Zijlstra
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2008-10-21 20:48 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Heiko Carstens, Nick Piggin, linux-kernel, Hugh Dickins,
	Andrew Morton, Linus Torvalds, Rik van Riel, Lee Schermerhorn,
	linux-mm, Oleg Nesterov

On Tue, 2008-10-21 at 22:30 +0200, Peter Zijlstra wrote:

> The problem appears to be calling flush_work(), which is rather heavy
> handed. We could do schedule_on_each_cpu() using a completion.
> 
> Which I think is a nicer solution (if sufficient of course).

Ah, never mind, the flush_work() is already doing the right thing using
barriers and completions.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: mlock: mlocked pages are unevictable
  2008-10-21 15:13 ` mlock: mlocked pages are unevictable Heiko Carstens
  2008-10-21 15:51   ` KOSAKI Motohiro
@ 2008-10-22 15:28   ` Lee Schermerhorn
  1 sibling, 0 replies; 35+ messages in thread
From: Lee Schermerhorn @ 2008-10-22 15:28 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Nick Piggin, linux-kernel, Hugh Dickins, Andrew Morton,
	Linus Torvalds, Rik van Riel, KOSAKI Motohiro, linux-mm

On Tue, 2008-10-21 at 17:13 +0200, Heiko Carstens wrote:
> Hi Nick,
> 
> On Mon, Oct 20, 2008 at 04:59:55PM +0000, Linux Kernel Mailing List wrote:
> > Gitweb:     http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b291f000393f5a0b679012b39d79fbc85c018233
> > Commit:     b291f000393f5a0b679012b39d79fbc85c018233
> > Author:     Nick Piggin <npiggin@suse.de>
> > AuthorDate: Sat Oct 18 20:26:44 2008 -0700
> > Committer:  Linus Torvalds <torvalds@linux-foundation.org>
> > CommitDate: Mon Oct 20 08:52:30 2008 -0700
> > 
> >     mlock: mlocked pages are unevictable
> 
> [...]
> 
> I think the following part of your patch:
> 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index fee6b97..bc58c13 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -278,7 +278,7 @@ void lru_add_drain(void)
> >  	put_cpu();
> >  }
> > 
> > -#ifdef CONFIG_NUMA
> > +#if defined(CONFIG_NUMA) || defined(CONFIG_UNEVICTABLE_LRU)
> >  static void lru_add_drain_per_cpu(struct work_struct *dummy)
> >  {
> >  	lru_add_drain();
> 
> causes this (allyesconfig on s390):
> 
> [17179587.988810] =======================================================
> [17179587.988819] [ INFO: possible circular locking dependency detected ]
> [17179587.988824] 2.6.27-06509-g2515ddc-dirty #190
> [17179587.988827] -------------------------------------------------------
> [17179587.988831] multipathd/3868 is trying to acquire lock:
> [17179587.988834]  (events){--..}, at: [<0000000000157f82>] flush_work+0x42/0x124
> [17179587.988850] 
> [17179587.988851] but task is already holding lock:
> [17179587.988854]  (&mm->mmap_sem){----}, at: [<00000000001c0be4>] sys_mlockall+0x5c/0xe0
> [17179587.988865] 
> [17179587.988866] which lock already depends on the new lock.
> [17179587.988867] 
<snip>
> [17179587.989148] other info that might help us debug this:
> [17179587.989149] 
> [17179587.989154] 1 lock held by multipathd/3868:
> [17179587.989156]  #0:  (&mm->mmap_sem){----}, at: [<00000000001c0be4>] sys_mlockall+0x5c/0xe0
> [17179587.989165] 
> [17179587.989166] stack backtrace:
> [17179587.989170] CPU: 0 Not tainted 2.6.27-06509-g2515ddc-dirty #190
> [17179587.989174] Process multipathd (pid: 3868, task: 000000003978a298, ksp: 0000000039b23eb8)
> [17179587.989178] 000000003978aa00 0000000039b238b8 0000000000000002 0000000000000000 
> [17179587.989184]        0000000039b23958 0000000039b238d0 0000000039b238d0 00000000001060ee 
> [17179587.989192]        0000000000000003 0000000000000000 0000000000000000 000000000000000b 
> [17179587.989199]        0000000000000060 0000000000000008 0000000039b238b8 0000000039b23928 
> [17179587.989207]        0000000000b30b50 00000000001060ee 0000000039b238b8 0000000039b23910 
> [17179587.989216] Call Trace:
> [17179587.989219] ([<0000000000106036>] show_trace+0xb2/0xd0)
> [17179587.989225]  [<000000000010610c>] show_stack+0xb8/0xc8
> [17179587.989230]  [<0000000000b27a96>] dump_stack+0xae/0xbc
> [17179587.989234]  [<000000000017019e>] print_circular_bug_tail+0xee/0x100
> [17179587.989240]  [<00000000001716ca>] __lock_acquire+0x10c6/0x17c4
> [17179587.989245]  [<0000000000171e5c>] lock_acquire+0x94/0xbc
> [17179587.989250]  [<0000000000157fb4>] flush_work+0x74/0x124
> [17179587.989256]  [<0000000000158620>] schedule_on_each_cpu+0xec/0x138
> [17179587.989261]  [<00000000001b0ab4>] lru_add_drain_all+0x2c/0x40
> [17179587.989266]  [<00000000001c05ac>] __mlock_vma_pages_range+0xcc/0x2e8
> [17179587.989271]  [<00000000001c0970>] mlock_fixup+0x1a8/0x280
> [17179587.989276]  [<00000000001c0aec>] do_mlockall+0xa4/0xd4
> [17179587.989281]  [<00000000001c0c36>] sys_mlockall+0xae/0xe0
> [17179587.989286]  [<0000000000114d30>] sysc_noemu+0x10/0x16
> [17179587.989290]  [<000002000025a466>] 0x2000025a466
> [17179587.989294] INFO: lockdep is turned off.


We could probably remove the lru_add_drain_all() called from
__mlock_vma_pages_range(), or replace it with a local lru_add_drain().
It's only there to push pages that might still be in the lru pagevecs
out to the lru lists so that we can isolate them and move them to the
the unevictable list.  The local lru_drain() should push any pages
faulted in by the immediately prior call to get_user_pages().  The only
pages we'd miss would be pages [recently?] faulted on another processor
and still in that pagevec.  So, we'll have a page marked as mlocked on a
normal lru list.  If/when vmscan sees it, it will immediately move it to
the unevictable lru list.

The call to lru_add_drain_all() from __clear_page_mlock() may be more
difficult.  Rik added that during testing because we found race
conditions--during COW in the fault path, IIRC--where we would strand an
mlocked page on the unevictable list.  It's an unlikely situation, I
think.  We were beating on COWing of mlocked pages--mlockall(); fork();
child attempts write to shared anon page, mlocked by parent;
munlockall()/exit() from parent--pretty heavily at the time.

Lee



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-21 17:18     ` KOSAKI Motohiro
  2008-10-21 20:30       ` Peter Zijlstra
@ 2008-10-23 15:00       ` KOSAKI Motohiro
  2008-10-24  1:28         ` Nick Piggin
                           ` (3 more replies)
  1 sibling, 4 replies; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-10-23 15:00 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: kosaki.motohiro, Nick Piggin, linux-kernel, Hugh Dickins,
	Andrew Morton, Linus Torvalds, Rik van Riel, Lee Schermerhorn,
	linux-mm, Christoph Lameter

Hi Heiko,

> >> I think the following part of your patch:
> >>
> >>> diff --git a/mm/swap.c b/mm/swap.c
> >>> index fee6b97..bc58c13 100644
> >>> --- a/mm/swap.c
> >>> +++ b/mm/swap.c
> >>> @@ -278,7 +278,7 @@ void lru_add_drain(void)
> >>>       put_cpu();
> >>>  }
> >>>
> >>> -#ifdef CONFIG_NUMA
> >>> +#if defined(CONFIG_NUMA) || defined(CONFIG_UNEVICTABLE_LRU)
> >>>  static void lru_add_drain_per_cpu(struct work_struct *dummy)
> >>>  {
> >>>       lru_add_drain();
> >>
> >> causes this (allyesconfig on s390):
> >
> > hm,
> >
> > I don't think so.
> >
> > Actually, this patch has
> >   mmap_sem -> lru_add_drain_all() dependency.
> >
> > but its dependency already exist in another place.
> > example,
> >
> >  sys_move_pages()
> >      do_move_pages()  <- down_read(mmap_sem)
> >          migrate_prep()
> >               lru_add_drain_all()
> >
> > Thought?
> 
> ok. maybe I understand this issue.
> 
> This bug is caused by folloing dependencys.
> 
> some VM place has
>       mmap_sem -> kevent_wq
> 
> net/core/dev.c::dev_ioctl()  has
>      rtnl_lock  ->  mmap_sem        (*) almost ioctl has
> copy_from_user() and it cause page fault.
> 
> linkwatch_event has
>     kevent_wq -> rtnl_lock
> 
> 
> So, I think VM subsystem shouldn't use kevent_wq because many driver
> use ioctl and work queue combination.
> then drivers fixing isn't easy.
> 
> I'll make the patch soon.

My box can't reproduce this issue.
Could you please test on following patch?



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Heiko reported following lockdep warnings.


=================================================================================
causes this (allyesconfig on s390):

[17179587.988810] =======================================================
[17179587.988819] [ INFO: possible circular locking dependency detected ]
[17179587.988824] 2.6.27-06509-g2515ddc-dirty #190
[17179587.988827] -------------------------------------------------------
[17179587.988831] multipathd/3868 is trying to acquire lock:
[17179587.988834]  (events){--..}, at: [<0000000000157f82>] flush_work+0x42/0x124
[17179587.988850] 
[17179587.988851] but task is already holding lock:
[17179587.988854]  (&mm->mmap_sem){----}, at: [<00000000001c0be4>] sys_mlockall+0x5c/0xe0
[17179587.988865] 
[17179587.988866] which lock already depends on the new lock.
[17179587.988867] 
[17179587.988871] 
[17179587.988871] the existing dependency chain (in reverse order) is:
[17179587.988875] 
[17179587.988876] -> #3 (&mm->mmap_sem){----}:
[17179587.988883]        [<0000000000171a42>] __lock_acquire+0x143e/0x17c4
[17179587.988891]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
[17179587.988896]        [<0000000000b2a532>] down_read+0x62/0xd8
[17179587.988905]        [<0000000000b2cc40>] do_dat_exception+0x14c/0x390
[17179587.988910]        [<0000000000114d36>] sysc_return+0x0/0x8
[17179587.988917]        [<00000000006c694a>] copy_from_user_mvcos+0x12/0x84
[17179587.988926]        [<00000000007335f0>] eql_ioctl+0x3e8/0x590
[17179587.988935]        [<00000000008b6230>] dev_ifsioc+0x29c/0x2c8
[17179587.988942]        [<00000000008b6874>] dev_ioctl+0x618/0x680
[17179587.988946]        [<00000000008a1a8c>] sock_ioctl+0x2b4/0x2c8
[17179587.988953]        [<00000000001f99a8>] vfs_ioctl+0x50/0xbc
[17179587.988960]        [<00000000001f9ee2>] do_vfs_ioctl+0x4ce/0x510
[17179587.988965]        [<00000000001f9f94>] sys_ioctl+0x70/0x98
[17179587.988970]        [<0000000000114d30>] sysc_noemu+0x10/0x16
[17179587.988975]        [<0000020000131286>] 0x20000131286
[17179587.988980] 
[17179587.988981] -> #2 (rtnl_mutex){--..}:
[17179587.988987]        [<0000000000171a42>] __lock_acquire+0x143e/0x17c4
[17179587.988993]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
[17179587.988998]        [<0000000000b29ae8>] mutex_lock_nested+0x11c/0x31c
[17179587.989003]        [<00000000008bff1c>] rtnl_lock+0x30/0x40
[17179587.989009]        [<00000000008c144e>] linkwatch_event+0x26/0x6c
[17179587.989015]        [<0000000000157356>] run_workqueue+0x146/0x240
[17179587.989020]        [<000000000015756e>] worker_thread+0x11e/0x134
[17179587.989025]        [<000000000015cd8e>] kthread+0x6e/0xa4
[17179587.989030]        [<000000000010ad9a>] kernel_thread_starter+0x6/0xc
[17179587.989036]        [<000000000010ad94>] kernel_thread_starter+0x0/0xc
[17179587.989042] 
[17179587.989042] -> #1 ((linkwatch_work).work){--..}:
[17179587.989049]        [<0000000000171a42>] __lock_acquire+0x143e/0x17c4
[17179587.989054]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
[17179587.989059]        [<0000000000157350>] run_workqueue+0x140/0x240
[17179587.989064]        [<000000000015756e>] worker_thread+0x11e/0x134
[17179587.989069]        [<000000000015cd8e>] kthread+0x6e/0xa4
[17179587.989074]        [<000000000010ad9a>] kernel_thread_starter+0x6/0xc
[17179587.989079]        [<000000000010ad94>] kernel_thread_starter+0x0/0xc
[17179587.989084] 
[17179587.989085] -> #0 (events){--..}:
[17179587.989091]        [<00000000001716ca>] __lock_acquire+0x10c6/0x17c4
[17179587.989096]        [<0000000000171e5c>] lock_acquire+0x94/0xbc
[17179587.989101]        [<0000000000157fb4>] flush_work+0x74/0x124
[17179587.989107]        [<0000000000158620>] schedule_on_each_cpu+0xec/0x138
[17179587.989112]        [<00000000001b0ab4>] lru_add_drain_all+0x2c/0x40
[17179587.989117]        [<00000000001c05ac>] __mlock_vma_pages_range+0xcc/0x2e8
[17179587.989123]        [<00000000001c0970>] mlock_fixup+0x1a8/0x280
[17179587.989128]        [<00000000001c0aec>] do_mlockall+0xa4/0xd4
[17179587.989133]        [<00000000001c0c36>] sys_mlockall+0xae/0xe0
[17179587.989138]        [<0000000000114d30>] sysc_noemu+0x10/0x16
[17179587.989142]        [<000002000025a466>] 0x2000025a466
[17179587.989147] 
[17179587.989148] other info that might help us debug this:
[17179587.989149] 
[17179587.989154] 1 lock held by multipathd/3868:
[17179587.989156]  #0:  (&mm->mmap_sem){----}, at: [<00000000001c0be4>] sys_mlockall+0x5c/0xe0
[17179587.989165] 
[17179587.989166] stack backtrace:
[17179587.989170] CPU: 0 Not tainted 2.6.27-06509-g2515ddc-dirty #190
[17179587.989174] Process multipathd (pid: 3868, task: 000000003978a298, ksp: 0000000039b23eb8)
[17179587.989178] 000000003978aa00 0000000039b238b8 0000000000000002 0000000000000000 
[17179587.989184]        0000000039b23958 0000000039b238d0 0000000039b238d0 00000000001060ee 
[17179587.989192]        0000000000000003 0000000000000000 0000000000000000 000000000000000b 
[17179587.989199]        0000000000000060 0000000000000008 0000000039b238b8 0000000039b23928 
[17179587.989207]        0000000000b30b50 00000000001060ee 0000000039b238b8 0000000039b23910 
[17179587.989216] Call Trace:
[17179587.989219] ([<0000000000106036>] show_trace+0xb2/0xd0)
[17179587.989225]  [<000000000010610c>] show_stack+0xb8/0xc8
[17179587.989230]  [<0000000000b27a96>] dump_stack+0xae/0xbc
[17179587.989234]  [<000000000017019e>] print_circular_bug_tail+0xee/0x100
[17179587.989240]  [<00000000001716ca>] __lock_acquire+0x10c6/0x17c4
[17179587.989245]  [<0000000000171e5c>] lock_acquire+0x94/0xbc
[17179587.989250]  [<0000000000157fb4>] flush_work+0x74/0x124
[17179587.989256]  [<0000000000158620>] schedule_on_each_cpu+0xec/0x138
[17179587.989261]  [<00000000001b0ab4>] lru_add_drain_all+0x2c/0x40
[17179587.989266]  [<00000000001c05ac>] __mlock_vma_pages_range+0xcc/0x2e8
[17179587.989271]  [<00000000001c0970>] mlock_fixup+0x1a8/0x280
[17179587.989276]  [<00000000001c0aec>] do_mlockall+0xa4/0xd4
[17179587.989281]  [<00000000001c0c36>] sys_mlockall+0xae/0xe0
[17179587.989286]  [<0000000000114d30>] sysc_noemu+0x10/0x16
[17179587.989290]  [<000002000025a466>] 0x2000025a466
[17179587.989294] INFO: lockdep is turned off.
=======================================================================================

It because following three circular locking dependency.

Some VM place has
      mmap_sem -> kevent_wq via lru_add_drain_all()

net/core/dev.c::dev_ioctl()  has
     rtnl_lock  ->  mmap_sem        (*) the ioctl has copy_from_user() and it can do page fault.

linkwatch_event has
     kevent_wq -> rtnl_lock


Actually, schedule_on_each_cpu() is very problematic function.
it introduce the dependency of all worker on keventd_wq, 
but we can't know what lock held by worker in kevend_wq because
keventd_wq is widely used out of kernel drivers too.

So, the task of any lock held shouldn't wait on keventd_wq.
Its task should use own special purpose work queue.



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: Christoph Lameter <cl@linux-foundation.org>
CC: Nick Piggin <npiggin@suse.de>
CC: Hugh Dickins <hugh@veritas.com>,
CC: Andrew Morton <akpm@linux-foundation.org>,
CC: Linus Torvalds <torvalds@linux-foundation.org>,
CC: Rik van Riel <riel@redhat.com>,
CC: Lee Schermerhorn <lee.schermerhorn@hp.com>,

 linux-2.6.27-git10-vm_wq/include/linux/workqueue.h |    1 
 linux-2.6.27-git10-vm_wq/kernel/workqueue.c        |   37 +++++++++++++++++++++
 linux-2.6.27-git10-vm_wq/mm/swap.c                 |    8 +++-
 3 files changed, 45 insertions(+), 1 deletion(-)

Index: linux-2.6.27-git10-vm_wq/include/linux/workqueue.h
===================================================================
--- linux-2.6.27-git10-vm_wq.orig/include/linux/workqueue.h	2008-10-23 21:01:38.000000000 +0900
+++ linux-2.6.27-git10-vm_wq/include/linux/workqueue.h	2008-10-23 22:34:20.000000000 +0900
@@ -195,6 +195,7 @@ extern int schedule_delayed_work(struct 
 extern int schedule_delayed_work_on(int cpu, struct delayed_work *work,
 					unsigned long delay);
 extern int schedule_on_each_cpu(work_func_t func);
+int queue_work_on_each_cpu(struct workqueue_struct *wq, work_func_t func);
 extern int current_is_keventd(void);
 extern int keventd_up(void);
 
Index: linux-2.6.27-git10-vm_wq/kernel/workqueue.c
===================================================================
--- linux-2.6.27-git10-vm_wq.orig/kernel/workqueue.c	2008-10-23 21:01:38.000000000 +0900
+++ linux-2.6.27-git10-vm_wq/kernel/workqueue.c	2008-10-23 22:34:20.000000000 +0900
@@ -674,6 +674,8 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
  * Returns -ve errno on failure.
  *
  * schedule_on_each_cpu() is very slow.
+ * caller should NOT held any lock, otherwise flush_work(keventd_wq) can
+ * cause dead-lock.
  */
 int schedule_on_each_cpu(work_func_t func)
 {
@@ -698,6 +700,41 @@ int schedule_on_each_cpu(work_func_t fun
 	return 0;
 }
 
+/**
+ * queue_work_on_each_cpu - call a function on each online CPU
+ *
+ * @wq:   the workqueue
+ * @func: the function to call
+ *
+ * Returns zero on success.
+ * Returns -ve errno on failure.
+ *
+ * similar to schedule_on_each_cpu(), but wq argument is there.
+ * queue_work_on_each_cpu() is very slow.
+ */
+int queue_work_on_each_cpu(struct workqueue_struct *wq, work_func_t func)
+{
+	int cpu;
+	struct work_struct *works;
+
+	works = alloc_percpu(struct work_struct);
+	if (!works)
+		return -ENOMEM;
+
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		struct work_struct *work = per_cpu_ptr(works, cpu);
+
+		INIT_WORK(work, func);
+		queue_work_on(cpu, wq, work);
+	}
+	for_each_online_cpu(cpu)
+		flush_work(per_cpu_ptr(works, cpu));
+	put_online_cpus();
+	free_percpu(works);
+	return 0;
+}
+
 void flush_scheduled_work(void)
 {
 	flush_workqueue(keventd_wq);
Index: linux-2.6.27-git10-vm_wq/mm/swap.c
===================================================================
--- linux-2.6.27-git10-vm_wq.orig/mm/swap.c	2008-10-23 21:01:38.000000000 +0900
+++ linux-2.6.27-git10-vm_wq/mm/swap.c	2008-10-23 22:53:27.000000000 +0900
@@ -39,6 +39,8 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 
+static struct workqueue_struct *vm_wq __read_mostly;
+
 /*
  * This path almost never happens for VM activity - pages are normally
  * freed via pagevecs.  But it gets used by networking.
@@ -310,7 +312,7 @@ static void lru_add_drain_per_cpu(struct
  */
 int lru_add_drain_all(void)
 {
-	return schedule_on_each_cpu(lru_add_drain_per_cpu);
+	return queue_work_on_each_cpu(vm_wq, lru_add_drain_per_cpu);
 }
 
 #else
@@ -611,4 +613,8 @@ void __init swap_setup(void)
 #ifdef CONFIG_HOTPLUG_CPU
 	hotcpu_notifier(cpu_swap_callback, 0);
 #endif
+
+	vm_wq = create_workqueue("vm_work");
+	BUG_ON(!vm_wq);
+
 }



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-23 15:00       ` [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu() KOSAKI Motohiro
@ 2008-10-24  1:28         ` Nick Piggin
  2008-10-24  4:54           ` KOSAKI Motohiro
  2008-10-24 19:20         ` Heiko Carstens
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2008-10-24  1:28 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Heiko Carstens, linux-kernel, Hugh Dickins, Andrew Morton,
	Linus Torvalds, Rik van Riel, Lee Schermerhorn, linux-mm,
	Christoph Lameter

On Fri, Oct 24, 2008 at 12:00:17AM +0900, KOSAKI Motohiro wrote:
> It because following three circular locking dependency.
> 
> Some VM place has
>       mmap_sem -> kevent_wq via lru_add_drain_all()
> 
> net/core/dev.c::dev_ioctl()  has
>      rtnl_lock  ->  mmap_sem        (*) the ioctl has copy_from_user() and it can do page fault.
> 
> linkwatch_event has
>      kevent_wq -> rtnl_lock
> 
> 
> Actually, schedule_on_each_cpu() is very problematic function.
> it introduce the dependency of all worker on keventd_wq, 
> but we can't know what lock held by worker in kevend_wq because
> keventd_wq is widely used out of kernel drivers too.
> 
> So, the task of any lock held shouldn't wait on keventd_wq.
> Its task should use own special purpose work queue.

I don't see a better way to solve it, other than avoiding lru_add_drain_all


> 
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
> CC: Christoph Lameter <cl@linux-foundation.org>
> CC: Nick Piggin <npiggin@suse.de>
> CC: Hugh Dickins <hugh@veritas.com>,
> CC: Andrew Morton <akpm@linux-foundation.org>,
> CC: Linus Torvalds <torvalds@linux-foundation.org>,
> CC: Rik van Riel <riel@redhat.com>,
> CC: Lee Schermerhorn <lee.schermerhorn@hp.com>,
> 
>  linux-2.6.27-git10-vm_wq/include/linux/workqueue.h |    1 
>  linux-2.6.27-git10-vm_wq/kernel/workqueue.c        |   37 +++++++++++++++++++++
>  linux-2.6.27-git10-vm_wq/mm/swap.c                 |    8 +++-
>  3 files changed, 45 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6.27-git10-vm_wq/include/linux/workqueue.h
> ===================================================================
> --- linux-2.6.27-git10-vm_wq.orig/include/linux/workqueue.h	2008-10-23 21:01:38.000000000 +0900
> +++ linux-2.6.27-git10-vm_wq/include/linux/workqueue.h	2008-10-23 22:34:20.000000000 +0900
> @@ -195,6 +195,7 @@ extern int schedule_delayed_work(struct 
>  extern int schedule_delayed_work_on(int cpu, struct delayed_work *work,
>  					unsigned long delay);
>  extern int schedule_on_each_cpu(work_func_t func);
> +int queue_work_on_each_cpu(struct workqueue_struct *wq, work_func_t func);
>  extern int current_is_keventd(void);
>  extern int keventd_up(void);
>  
> Index: linux-2.6.27-git10-vm_wq/kernel/workqueue.c
> ===================================================================
> --- linux-2.6.27-git10-vm_wq.orig/kernel/workqueue.c	2008-10-23 21:01:38.000000000 +0900
> +++ linux-2.6.27-git10-vm_wq/kernel/workqueue.c	2008-10-23 22:34:20.000000000 +0900
> @@ -674,6 +674,8 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
>   * Returns -ve errno on failure.
>   *
>   * schedule_on_each_cpu() is very slow.
> + * caller should NOT held any lock, otherwise flush_work(keventd_wq) can
> + * cause dead-lock.
>   */
>  int schedule_on_each_cpu(work_func_t func)
>  {
> @@ -698,6 +700,41 @@ int schedule_on_each_cpu(work_func_t fun
>  	return 0;
>  }
>  
> +/**
> + * queue_work_on_each_cpu - call a function on each online CPU
> + *
> + * @wq:   the workqueue
> + * @func: the function to call
> + *
> + * Returns zero on success.
> + * Returns -ve errno on failure.
> + *
> + * similar to schedule_on_each_cpu(), but wq argument is there.
> + * queue_work_on_each_cpu() is very slow.
> + */
> +int queue_work_on_each_cpu(struct workqueue_struct *wq, work_func_t func)
> +{
> +	int cpu;
> +	struct work_struct *works;
> +
> +	works = alloc_percpu(struct work_struct);
> +	if (!works)
> +		return -ENOMEM;
> +
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		struct work_struct *work = per_cpu_ptr(works, cpu);
> +
> +		INIT_WORK(work, func);
> +		queue_work_on(cpu, wq, work);
> +	}
> +	for_each_online_cpu(cpu)
> +		flush_work(per_cpu_ptr(works, cpu));
> +	put_online_cpus();
> +	free_percpu(works);
> +	return 0;
> +}
> +
>  void flush_scheduled_work(void)
>  {
>  	flush_workqueue(keventd_wq);
> Index: linux-2.6.27-git10-vm_wq/mm/swap.c
> ===================================================================
> --- linux-2.6.27-git10-vm_wq.orig/mm/swap.c	2008-10-23 21:01:38.000000000 +0900
> +++ linux-2.6.27-git10-vm_wq/mm/swap.c	2008-10-23 22:53:27.000000000 +0900
> @@ -39,6 +39,8 @@ int page_cluster;
>  static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>  
> +static struct workqueue_struct *vm_wq __read_mostly;
> +
>  /*
>   * This path almost never happens for VM activity - pages are normally
>   * freed via pagevecs.  But it gets used by networking.
> @@ -310,7 +312,7 @@ static void lru_add_drain_per_cpu(struct
>   */
>  int lru_add_drain_all(void)
>  {
> -	return schedule_on_each_cpu(lru_add_drain_per_cpu);
> +	return queue_work_on_each_cpu(vm_wq, lru_add_drain_per_cpu);
>  }
>  
>  #else
> @@ -611,4 +613,8 @@ void __init swap_setup(void)
>  #ifdef CONFIG_HOTPLUG_CPU
>  	hotcpu_notifier(cpu_swap_callback, 0);
>  #endif
> +
> +	vm_wq = create_workqueue("vm_work");
> +	BUG_ON(!vm_wq);
> +
>  }
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-24  1:28         ` Nick Piggin
@ 2008-10-24  4:54           ` KOSAKI Motohiro
  2008-10-24  4:55             ` Nick Piggin
  0 siblings, 1 reply; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-10-24  4:54 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, Heiko Carstens, linux-kernel, Hugh Dickins,
	Andrew Morton, Linus Torvalds, Rik van Riel, Lee Schermerhorn,
	linux-mm, Christoph Lameter

> > 
> > Actually, schedule_on_each_cpu() is very problematic function.
> > it introduce the dependency of all worker on keventd_wq, 
> > but we can't know what lock held by worker in kevend_wq because
> > keventd_wq is widely used out of kernel drivers too.
> > 
> > So, the task of any lock held shouldn't wait on keventd_wq.
> > Its task should use own special purpose work queue.
> 
> I don't see a better way to solve it, other than avoiding lru_add_drain_all

Well,

Unfortunately, lru_add_drain_all is also used some other VM place
(page migration and memory hotplug).
and page migration's usage is the same of this mlock usage.
(1. grab mmap_sem  2.  call lru_add_drain_all)

Then, change mlock usage isn't solution ;-)




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-24  4:54           ` KOSAKI Motohiro
@ 2008-10-24  4:55             ` Nick Piggin
  2008-10-24  5:29               ` KOSAKI Motohiro
  0 siblings, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2008-10-24  4:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Heiko Carstens, linux-kernel, Hugh Dickins, Andrew Morton,
	Linus Torvalds, Rik van Riel, Lee Schermerhorn, linux-mm,
	Christoph Lameter

On Fri, Oct 24, 2008 at 01:54:46PM +0900, KOSAKI Motohiro wrote:
> > > 
> > > Actually, schedule_on_each_cpu() is very problematic function.
> > > it introduce the dependency of all worker on keventd_wq, 
> > > but we can't know what lock held by worker in kevend_wq because
> > > keventd_wq is widely used out of kernel drivers too.
> > > 
> > > So, the task of any lock held shouldn't wait on keventd_wq.
> > > Its task should use own special purpose work queue.
> > 
> > I don't see a better way to solve it, other than avoiding lru_add_drain_all
> 
> Well,
> 
> Unfortunately, lru_add_drain_all is also used some other VM place
> (page migration and memory hotplug).
> and page migration's usage is the same of this mlock usage.
> (1. grab mmap_sem  2.  call lru_add_drain_all)
> 
> Then, change mlock usage isn't solution ;-)

No, not mlock alone.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-24  4:55             ` Nick Piggin
@ 2008-10-24  5:29               ` KOSAKI Motohiro
  2008-10-24  5:34                 ` Nick Piggin
  0 siblings, 1 reply; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-10-24  5:29 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, Heiko Carstens, linux-kernel, Hugh Dickins,
	Andrew Morton, Linus Torvalds, Rik van Riel, Lee Schermerhorn,
	linux-mm, Christoph Lameter

> > > I don't see a better way to solve it, other than avoiding lru_add_drain_all
> > 
> > Well,
> > 
> > Unfortunately, lru_add_drain_all is also used some other VM place
> > (page migration and memory hotplug).
> > and page migration's usage is the same of this mlock usage.
> > (1. grab mmap_sem  2.  call lru_add_drain_all)
> > 
> > Then, change mlock usage isn't solution ;-)
> 
> No, not mlock alone.

Ah, I see.
It seems difficult but valuable. I'll think this way for a while.


Thanks.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-24  5:29               ` KOSAKI Motohiro
@ 2008-10-24  5:34                 ` Nick Piggin
  2008-10-24  5:51                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 35+ messages in thread
From: Nick Piggin @ 2008-10-24  5:34 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Heiko Carstens, linux-kernel, Hugh Dickins, Andrew Morton,
	Linus Torvalds, Rik van Riel, Lee Schermerhorn, linux-mm,
	Christoph Lameter

On Fri, Oct 24, 2008 at 02:29:18PM +0900, KOSAKI Motohiro wrote:
> > > > I don't see a better way to solve it, other than avoiding lru_add_drain_all
> > > 
> > > Well,
> > > 
> > > Unfortunately, lru_add_drain_all is also used some other VM place
> > > (page migration and memory hotplug).
> > > and page migration's usage is the same of this mlock usage.
> > > (1. grab mmap_sem  2.  call lru_add_drain_all)
> > > 
> > > Then, change mlock usage isn't solution ;-)
> > 
> > No, not mlock alone.
> 
> Ah, I see.
> It seems difficult but valuable. I'll think this way for a while.

Well, I think it would be nice if we can reduce lru_add_drain_all,
however your patch might be the least intrusive and best short term
solution.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-24  5:34                 ` Nick Piggin
@ 2008-10-24  5:51                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-10-24  5:51 UTC (permalink / raw)
  To: Nick Piggin
  Cc: kosaki.motohiro, Heiko Carstens, linux-kernel, Hugh Dickins,
	Andrew Morton, Linus Torvalds, Rik van Riel, Lee Schermerhorn,
	linux-mm, Christoph Lameter

> On Fri, Oct 24, 2008 at 02:29:18PM +0900, KOSAKI Motohiro wrote:
> > > > > I don't see a better way to solve it, other than avoiding lru_add_drain_all
> > > > 
> > > > Well,
> > > > 
> > > > Unfortunately, lru_add_drain_all is also used some other VM place
> > > > (page migration and memory hotplug).
> > > > and page migration's usage is the same of this mlock usage.
> > > > (1. grab mmap_sem  2.  call lru_add_drain_all)
> > > > 
> > > > Then, change mlock usage isn't solution ;-)
> > > 
> > > No, not mlock alone.
> > 
> > Ah, I see.
> > It seems difficult but valuable. I'll think this way for a while.
> 
> Well, I think it would be nice if we can reduce lru_add_drain_all,
> however your patch might be the least intrusive and best short term
> solution.

Yup, thanks.

I also think my way is the best solustion of 2.6.28 age.
and I should work on your better solution for long term.





^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-23 15:00       ` [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu() KOSAKI Motohiro
  2008-10-24  1:28         ` Nick Piggin
@ 2008-10-24 19:20         ` Heiko Carstens
  2008-10-26 11:06         ` Peter Zijlstra
  2008-10-27 21:55         ` Andrew Morton
  3 siblings, 0 replies; 35+ messages in thread
From: Heiko Carstens @ 2008-10-24 19:20 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Nick Piggin, linux-kernel, Hugh Dickins, Andrew Morton,
	Linus Torvalds, Rik van Riel, Lee Schermerhorn, linux-mm,
	Christoph Lameter

On Fri, Oct 24, 2008 at 12:00:17AM +0900, KOSAKI Motohiro wrote:
> Hi Heiko,
> > This bug is caused by folloing dependencys.
> > 
> > some VM place has
> >       mmap_sem -> kevent_wq
> > 
> > net/core/dev.c::dev_ioctl()  has
> >      rtnl_lock  ->  mmap_sem        (*) almost ioctl has
> > copy_from_user() and it cause page fault.
> > 
> > linkwatch_event has
> >     kevent_wq -> rtnl_lock
> > 
> > 
> > So, I think VM subsystem shouldn't use kevent_wq because many driver
> > use ioctl and work queue combination.
> > then drivers fixing isn't easy.
> > 
> > I'll make the patch soon.
> 
> My box can't reproduce this issue.
> Could you please test on following patch?

Your patch seems to fix the issue. At least I don't see the warning anymore ;)

Thanks,
Heiko

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-23 15:00       ` [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu() KOSAKI Motohiro
  2008-10-24  1:28         ` Nick Piggin
  2008-10-24 19:20         ` Heiko Carstens
@ 2008-10-26 11:06         ` Peter Zijlstra
  2008-10-26 13:37           ` KOSAKI Motohiro
  2008-10-27 21:55         ` Andrew Morton
  3 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2008-10-26 11:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Heiko Carstens, Nick Piggin, linux-kernel, Hugh Dickins,
	Andrew Morton, Linus Torvalds, Rik van Riel, Lee Schermerhorn,
	linux-mm, Christoph Lameter, Gautham Shenoy, Oleg Nesterov,
	Rusty Russell

On Fri, 2008-10-24 at 00:00 +0900, KOSAKI Motohiro wrote:

> It because following three circular locking dependency.
> 
> Some VM place has
>       mmap_sem -> kevent_wq via lru_add_drain_all()
> 
> net/core/dev.c::dev_ioctl()  has
>      rtnl_lock  ->  mmap_sem        (*) the ioctl has copy_from_user() and it can do page fault.
> 
> linkwatch_event has
>      kevent_wq -> rtnl_lock
> 
> 
> Actually, schedule_on_each_cpu() is very problematic function.
> it introduce the dependency of all worker on keventd_wq, 
> but we can't know what lock held by worker in kevend_wq because
> keventd_wq is widely used out of kernel drivers too.
> 
> So, the task of any lock held shouldn't wait on keventd_wq.
> Its task should use own special purpose work queue.
> 
> 
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
> CC: Christoph Lameter <cl@linux-foundation.org>
> CC: Nick Piggin <npiggin@suse.de>
> CC: Hugh Dickins <hugh@veritas.com>,
> CC: Andrew Morton <akpm@linux-foundation.org>,
> CC: Linus Torvalds <torvalds@linux-foundation.org>,
> CC: Rik van Riel <riel@redhat.com>,
> CC: Lee Schermerhorn <lee.schermerhorn@hp.com>,
> 
>  linux-2.6.27-git10-vm_wq/include/linux/workqueue.h |    1 
>  linux-2.6.27-git10-vm_wq/kernel/workqueue.c        |   37 +++++++++++++++++++++
>  linux-2.6.27-git10-vm_wq/mm/swap.c                 |    8 +++-
>  3 files changed, 45 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6.27-git10-vm_wq/include/linux/workqueue.h
> ===================================================================
> --- linux-2.6.27-git10-vm_wq.orig/include/linux/workqueue.h	2008-10-23 21:01:38.000000000 +0900
> +++ linux-2.6.27-git10-vm_wq/include/linux/workqueue.h	2008-10-23 22:34:20.000000000 +0900
> @@ -195,6 +195,7 @@ extern int schedule_delayed_work(struct 
>  extern int schedule_delayed_work_on(int cpu, struct delayed_work *work,
>  					unsigned long delay);
>  extern int schedule_on_each_cpu(work_func_t func);
> +int queue_work_on_each_cpu(struct workqueue_struct *wq, work_func_t func);
>  extern int current_is_keventd(void);
>  extern int keventd_up(void);
>  
> Index: linux-2.6.27-git10-vm_wq/kernel/workqueue.c
> ===================================================================
> --- linux-2.6.27-git10-vm_wq.orig/kernel/workqueue.c	2008-10-23 21:01:38.000000000 +0900
> +++ linux-2.6.27-git10-vm_wq/kernel/workqueue.c	2008-10-23 22:34:20.000000000 +0900
> @@ -674,6 +674,8 @@ EXPORT_SYMBOL(schedule_delayed_work_on);
>   * Returns -ve errno on failure.
>   *
>   * schedule_on_each_cpu() is very slow.
> + * caller should NOT held any lock, otherwise flush_work(keventd_wq) can
> + * cause dead-lock.

I think this is too strong.

> */
>  int schedule_on_each_cpu(work_func_t func)
>  {
> @@ -698,6 +700,41 @@ int schedule_on_each_cpu(work_func_t fun
>  	return 0;
>  }
>  
> +/**
> + * queue_work_on_each_cpu - call a function on each online CPU
> + *
> + * @wq:   the workqueue
> + * @func: the function to call
> + *
> + * Returns zero on success.
> + * Returns -ve errno on failure.
> + *
> + * similar to schedule_on_each_cpu(), but wq argument is there.
> + * queue_work_on_each_cpu() is very slow.
> + */
> +int queue_work_on_each_cpu(struct workqueue_struct *wq, work_func_t func)
> +{
> +	int cpu;
> +	struct work_struct *works;
> +
> +	works = alloc_percpu(struct work_struct);
> +	if (!works)
> +		return -ENOMEM;
> +
> +	get_online_cpus();
> +	for_each_online_cpu(cpu) {
> +		struct work_struct *work = per_cpu_ptr(works, cpu);
> +
> +		INIT_WORK(work, func);
> +		queue_work_on(cpu, wq, work);
> +	}
> +	for_each_online_cpu(cpu)
> +		flush_work(per_cpu_ptr(works, cpu));
> +	put_online_cpus();
> +	free_percpu(works);
> +	return 0;
> +}
> +

Which gives the opportunity to implement schedule_on_each_cpu() with
this.

> void flush_scheduled_work(void)
>  {
>  	flush_workqueue(keventd_wq);
> Index: linux-2.6.27-git10-vm_wq/mm/swap.c
> ===================================================================
> --- linux-2.6.27-git10-vm_wq.orig/mm/swap.c	2008-10-23 21:01:38.000000000 +0900
> +++ linux-2.6.27-git10-vm_wq/mm/swap.c	2008-10-23 22:53:27.000000000 +0900
> @@ -39,6 +39,8 @@ int page_cluster;
>  static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>  
> +static struct workqueue_struct *vm_wq __read_mostly;
> +
>  /*
>   * This path almost never happens for VM activity - pages are normally
>   * freed via pagevecs.  But it gets used by networking.
> @@ -310,7 +312,7 @@ static void lru_add_drain_per_cpu(struct
>   */
>  int lru_add_drain_all(void)
>  {
> -	return schedule_on_each_cpu(lru_add_drain_per_cpu);
> +	return queue_work_on_each_cpu(vm_wq, lru_add_drain_per_cpu);
>  }
>  
>  #else
> @@ -611,4 +613,8 @@ void __init swap_setup(void)
>  #ifdef CONFIG_HOTPLUG_CPU
>  	hotcpu_notifier(cpu_swap_callback, 0);
>  #endif
> +
> +	vm_wq = create_workqueue("vm_work");
> +	BUG_ON(!vm_wq);
> +
>  }

While I really hate adding yet another per-cpu thread for this, I don't
see another way out atm.

Oleg, Rusty, ego, you lot were discussing a similar extra per-cpu
workqueue, can we merge these two?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-26 11:06         ` Peter Zijlstra
@ 2008-10-26 13:37           ` KOSAKI Motohiro
  2008-10-26 13:49             ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-10-26 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Heiko Carstens, Nick Piggin, linux-kernel, Hugh Dickins,
	Andrew Morton, Linus Torvalds, Rik van Riel, Lee Schermerhorn,
	linux-mm, Christoph Lameter, Gautham Shenoy, Oleg Nesterov,
	Rusty Russell

Hi Peter,

>> @@ -611,4 +613,8 @@ void __init swap_setup(void)
>>  #ifdef CONFIG_HOTPLUG_CPU
>>       hotcpu_notifier(cpu_swap_callback, 0);
>>  #endif
>> +
>> +     vm_wq = create_workqueue("vm_work");
>> +     BUG_ON(!vm_wq);
>> +
>>  }
>
> While I really hate adding yet another per-cpu thread for this, I don't
> see another way out atm.

Can I ask the reason of your hate?
if I don't know it, making improvement patch is very difficult to me.


> Oleg, Rusty, ego, you lot were discussing a similar extra per-cpu
> workqueue, can we merge these two?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-26 13:37           ` KOSAKI Motohiro
@ 2008-10-26 13:49             ` Peter Zijlstra
  2008-10-26 15:51               ` KOSAKI Motohiro
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2008-10-26 13:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Heiko Carstens, Nick Piggin, linux-kernel, Hugh Dickins,
	Andrew Morton, Linus Torvalds, Rik van Riel, Lee Schermerhorn,
	linux-mm, Christoph Lameter, Gautham Shenoy, Oleg Nesterov,
	Rusty Russell, mpm

On Sun, 2008-10-26 at 22:37 +0900, KOSAKI Motohiro wrote:
> Hi Peter,
> 
> >> @@ -611,4 +613,8 @@ void __init swap_setup(void)
> >>  #ifdef CONFIG_HOTPLUG_CPU
> >>       hotcpu_notifier(cpu_swap_callback, 0);
> >>  #endif
> >> +
> >> +     vm_wq = create_workqueue("vm_work");
> >> +     BUG_ON(!vm_wq);
> >> +
> >>  }
> >
> > While I really hate adding yet another per-cpu thread for this, I don't
> > see another way out atm.
> 
> Can I ask the reason of your hate?
> if I don't know it, making improvement patch is very difficult to me.

There seems to be no drive to keep them down, ps -def output it utterly
dominated by kernel threads on a freshly booted machine with many cpus.

And while they are not _that_ expensive to have around, they are not
free either, I imagine the tiny-linux folks having an interest in
keeping these down too.

> > Oleg, Rusty, ego, you lot were discussing a similar extra per-cpu
> > workqueue, can we merge these two?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-26 13:49             ` Peter Zijlstra
@ 2008-10-26 15:51               ` KOSAKI Motohiro
  2008-10-26 16:17                 ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-10-26 15:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Heiko Carstens, Nick Piggin, linux-kernel, Hugh Dickins,
	Andrew Morton, Linus Torvalds, Rik van Riel, Lee Schermerhorn,
	linux-mm, Christoph Lameter, Gautham Shenoy, Oleg Nesterov,
	Rusty Russell, mpm

>> >> @@ -611,4 +613,8 @@ void __init swap_setup(void)
>> >>  #ifdef CONFIG_HOTPLUG_CPU
>> >>       hotcpu_notifier(cpu_swap_callback, 0);
>> >>  #endif
>> >> +
>> >> +     vm_wq = create_workqueue("vm_work");
>> >> +     BUG_ON(!vm_wq);
>> >> +
>> >>  }
>> >
>> > While I really hate adding yet another per-cpu thread for this, I don't
>> > see another way out atm.
>>
>> Can I ask the reason of your hate?
>> if I don't know it, making improvement patch is very difficult to me.
>
> There seems to be no drive to keep them down, ps -def output it utterly
> dominated by kernel threads on a freshly booted machine with many cpus.

True. but I don't think it is big problem. because

1. people can use grep filter easily.
2. that is just "sense of beauty" issue. not real pain.
3. current ps output is already utterly filled by kernel thread on
large server :)
    the patch doesn't introduce new problem.

> And while they are not _that_ expensive to have around, they are not
> free either, I imagine the tiny-linux folks having an interest in
> keeping these down too.

In my embedded job experience, I don't hear that.
Their folks strongly interest to memory size and cpu usage, but don't
interest # of thread so much.

Yes, too many thread spent many memory by stack. but the patch
introduce only one thread on embedded device.


Perhaps, I misunderstand your intension. so can you point your
previous discussion url?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-26 15:51               ` KOSAKI Motohiro
@ 2008-10-26 16:17                 ` Peter Zijlstra
  2008-10-27  3:14                   ` KOSAKI Motohiro
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2008-10-26 16:17 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Heiko Carstens, Nick Piggin, linux-kernel, Hugh Dickins,
	Andrew Morton, Linus Torvalds, Rik van Riel, Lee Schermerhorn,
	linux-mm, Christoph Lameter, Gautham Shenoy, Oleg Nesterov,
	Rusty Russell, mpm

On Mon, 2008-10-27 at 00:51 +0900, KOSAKI Motohiro wrote:
> >> >> @@ -611,4 +613,8 @@ void __init swap_setup(void)
> >> >>  #ifdef CONFIG_HOTPLUG_CPU
> >> >>       hotcpu_notifier(cpu_swap_callback, 0);
> >> >>  #endif
> >> >> +
> >> >> +     vm_wq = create_workqueue("vm_work");
> >> >> +     BUG_ON(!vm_wq);
> >> >> +
> >> >>  }
> >> >
> >> > While I really hate adding yet another per-cpu thread for this, I don't
> >> > see another way out atm.
> >>
> >> Can I ask the reason of your hate?
> >> if I don't know it, making improvement patch is very difficult to me.
> >
> > There seems to be no drive to keep them down, ps -def output it utterly
> > dominated by kernel threads on a freshly booted machine with many cpus.
> 
> True. but I don't think it is big problem. because
> 
> 1. people can use grep filter easily.
> 2. that is just "sense of beauty" issue. not real pain.
> 3. current ps output is already utterly filled by kernel thread on
> large server :)
>     the patch doesn't introduce new problem.

Sure, its already bad, which is why I think we should see to it it
doesn't get worse - also we could make kthreads use CLONE_PID in which
case they'd all get collapsed, but that would be a use-visible change
which might up-set folks even more.

> > And while they are not _that_ expensive to have around, they are not
> > free either, I imagine the tiny-linux folks having an interest in
> > keeping these down too.
> 
> In my embedded job experience, I don't hear that.
> Their folks strongly interest to memory size and cpu usage, but don't
> interest # of thread so much.
> 
> Yes, too many thread spent many memory by stack. but the patch
> introduce only one thread on embedded device.

Right, and would be about 4k+sizeof(task_struct), some people might be
bothered, but most won't care.

> Perhaps, I misunderstand your intension. so can you point your
> previous discussion url?

my google skillz fail me, but once in a while people complain that we
have too many kernel threads.

Anyway, if we can re-use this per-cpu workqueue for more goals, I guess
there is even less of an objection.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-26 16:17                 ` Peter Zijlstra
@ 2008-10-27  3:14                   ` KOSAKI Motohiro
  2008-10-27  7:56                     ` Peter Zijlstra
  0 siblings, 1 reply; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-10-27  3:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Heiko Carstens, Nick Piggin, linux-kernel,
	Hugh Dickins, Andrew Morton, Linus Torvalds, Rik van Riel,
	Lee Schermerhorn, linux-mm, Christoph Lameter, Gautham Shenoy,
	Oleg Nesterov, Rusty Russell, mpm

> Right, and would be about 4k+sizeof(task_struct), some people might be
> bothered, but most won't care.
> 
> > Perhaps, I misunderstand your intension. so can you point your
> > previous discussion url?
> 
> my google skillz fail me, but once in a while people complain that we
> have too many kernel threads.
> 
> Anyway, if we can re-use this per-cpu workqueue for more goals, I guess
> there is even less of an objection.

In general, you are right.
but this is special case. mmap_sem is really widely used various subsystem and drivers.
(because page fault via copy_user introduce to depend on mmap_sem)

Then, any work-queue reu-sing can cause similar dead-lock easily.


So I think we have two choices (nick explained it at this thread).

(1) own workqueue (the patch)
(2) avoid lru_add_drain_all completely


if you really strongly hate (1), we should target to (2) IMO.

Thought?




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-27  3:14                   ` KOSAKI Motohiro
@ 2008-10-27  7:56                     ` Peter Zijlstra
  2008-10-27  8:03                       ` KOSAKI Motohiro
  0 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2008-10-27  7:56 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Heiko Carstens, Nick Piggin, linux-kernel, Hugh Dickins,
	Andrew Morton, Linus Torvalds, Rik van Riel, Lee Schermerhorn,
	linux-mm, Christoph Lameter, Gautham Shenoy, Oleg Nesterov,
	Rusty Russell, mpm

On Mon, 2008-10-27 at 12:14 +0900, KOSAKI Motohiro wrote:
> > Right, and would be about 4k+sizeof(task_struct), some people might be
> > bothered, but most won't care.
> > 
> > > Perhaps, I misunderstand your intension. so can you point your
> > > previous discussion url?
> > 
> > my google skillz fail me, but once in a while people complain that we
> > have too many kernel threads.
> > 
> > Anyway, if we can re-use this per-cpu workqueue for more goals, I guess
> > there is even less of an objection.
> 
> In general, you are right.
> but this is special case. mmap_sem is really widely used various subsystem and drivers.
> (because page fault via copy_user introduce to depend on mmap_sem)
> 
> Then, any work-queue reu-sing can cause similar dead-lock easily.

Yeah, I know, and the cpu-hotplug discussion needed another thread due
to yet another locking incident. I was hoping these two could go
together.

Neither are general-purpose workqueues, both need to stay away from the
normal eventd due to deadlocks.

ego, does you extra thread ever use mmap_sem?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-27  7:56                     ` Peter Zijlstra
@ 2008-10-27  8:03                       ` KOSAKI Motohiro
  2008-10-27 10:42                         ` KOSAKI Motohiro
  0 siblings, 1 reply; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-10-27  8:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Heiko Carstens, Nick Piggin, linux-kernel,
	Hugh Dickins, Andrew Morton, Linus Torvalds, Rik van Riel,
	Lee Schermerhorn, linux-mm, Christoph Lameter, Gautham Shenoy,
	Oleg Nesterov, Rusty Russell, mpm

> On Mon, 2008-10-27 at 12:14 +0900, KOSAKI Motohiro wrote:
> > > Right, and would be about 4k+sizeof(task_struct), some people might be
> > > bothered, but most won't care.
> > > 
> > > > Perhaps, I misunderstand your intension. so can you point your
> > > > previous discussion url?
> > > 
> > > my google skillz fail me, but once in a while people complain that we
> > > have too many kernel threads.
> > > 
> > > Anyway, if we can re-use this per-cpu workqueue for more goals, I guess
> > > there is even less of an objection.
> > 
> > In general, you are right.
> > but this is special case. mmap_sem is really widely used various subsystem and drivers.
> > (because page fault via copy_user introduce to depend on mmap_sem)
> > 
> > Then, any work-queue reu-sing can cause similar dead-lock easily.
> 
> Yeah, I know, and the cpu-hotplug discussion needed another thread due
> to yet another locking incident. I was hoping these two could go
> together.

Yeah, I found its thread. (I think it is "work_on_cpu: helper for doing task on a CPU.", right?)
So I'll read it soon.

Please wait a bit.


> 
> Neither are general-purpose workqueues, both need to stay away from the
> normal eventd due to deadlocks.
> 
> ego, does you extra thread ever use mmap_sem?






^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-27  8:03                       ` KOSAKI Motohiro
@ 2008-10-27 10:42                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-10-27 10:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Heiko Carstens, Nick Piggin, linux-kernel,
	Hugh Dickins, Andrew Morton, Linus Torvalds, Rik van Riel,
	Lee Schermerhorn, linux-mm, Christoph Lameter, Gautham Shenoy,
	Oleg Nesterov, Rusty Russell, mpm

> > On Mon, 2008-10-27 at 12:14 +0900, KOSAKI Motohiro wrote:
> > > > Right, and would be about 4k+sizeof(task_struct), some people might be
> > > > bothered, but most won't care.
> > > > 
> > > > > Perhaps, I misunderstand your intension. so can you point your
> > > > > previous discussion url?
> > > > 
> > > > my google skillz fail me, but once in a while people complain that we
> > > > have too many kernel threads.
> > > > 
> > > > Anyway, if we can re-use this per-cpu workqueue for more goals, I guess
> > > > there is even less of an objection.
> > > 
> > > In general, you are right.
> > > but this is special case. mmap_sem is really widely used various subsystem and drivers.
> > > (because page fault via copy_user introduce to depend on mmap_sem)
> > > 
> > > Then, any work-queue reu-sing can cause similar dead-lock easily.
> > 
> > Yeah, I know, and the cpu-hotplug discussion needed another thread due
> > to yet another locking incident. I was hoping these two could go
> > together.
> 
> Yeah, I found its thread. (I think it is "work_on_cpu: helper for doing task on a CPU.", right?)
> So I'll read it soon.
> 
> Please wait a bit.

Done.

Now, I think smp_call_function() is better for this issue.
I'll try it.

Thanks a lot.

> > Neither are general-purpose workqueues, both need to stay away from the
> > normal eventd due to deadlocks.
> > 
> > ego, does you extra thread ever use mmap_sem?
> 
> 
> 




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-23 15:00       ` [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu() KOSAKI Motohiro
                           ` (2 preceding siblings ...)
  2008-10-26 11:06         ` Peter Zijlstra
@ 2008-10-27 21:55         ` Andrew Morton
  2008-10-28 14:25           ` Christoph Lameter
  3 siblings, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2008-10-27 21:55 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: heiko.carstens, kosaki.motohiro, npiggin, linux-kernel, hugh,
	torvalds, riel, lee.schermerhorn, linux-mm, cl

On Fri, 24 Oct 2008 00:00:17 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Hi Heiko,
> 
> > >> I think the following part of your patch:
> > >>
> > >>> diff --git a/mm/swap.c b/mm/swap.c
> > >>> index fee6b97..bc58c13 100644
> > >>> --- a/mm/swap.c
> > >>> +++ b/mm/swap.c
> > >>> @@ -278,7 +278,7 @@ void lru_add_drain(void)
> > >>>       put_cpu();
> > >>>  }
> > >>>
> > >>> -#ifdef CONFIG_NUMA
> > >>> +#if defined(CONFIG_NUMA) || defined(CONFIG_UNEVICTABLE_LRU)
> > >>>  static void lru_add_drain_per_cpu(struct work_struct *dummy)
> > >>>  {
> > >>>       lru_add_drain();
> > >>
> > >> causes this (allyesconfig on s390):
> > >
> > > hm,
> > >
> > > I don't think so.
> > >
> > > Actually, this patch has
> > >   mmap_sem -> lru_add_drain_all() dependency.
> > >
> > > but its dependency already exist in another place.
> > > example,
> > >
> > >  sys_move_pages()
> > >      do_move_pages()  <- down_read(mmap_sem)
> > >          migrate_prep()
> > >               lru_add_drain_all()

Can we fix that instead?

> ...
>
> It because following three circular locking dependency.
> 
> Some VM place has
>       mmap_sem -> kevent_wq via lru_add_drain_all()
> 
> net/core/dev.c::dev_ioctl()  has
>      rtnl_lock  ->  mmap_sem        (*) the ioctl has copy_from_user() and it can do page fault.
> 
> linkwatch_event has
>      kevent_wq -> rtnl_lock
> 
> 
> Actually, schedule_on_each_cpu() is very problematic function.
> it introduce the dependency of all worker on keventd_wq, 
> but we can't know what lock held by worker in kevend_wq because
> keventd_wq is widely used out of kernel drivers too.
> 
> So, the task of any lock held shouldn't wait on keventd_wq.
> Its task should use own special purpose work queue.
> 

Or we change the callers of lru_add_drain_all() to call it without
holding any locks.  I mean, what's the *point* in calling it with
mmap_sem held?  That won't stop threads from adding new pages into the
pagevecs.


>  #endif
> +
> +	vm_wq = create_workqueue("vm_work");
> +	BUG_ON(!vm_wq);
> +
>  }

Because it's pretty sad to add yet another kernel thread on each CPU
(thousands!) just because of some obscure theoretical deadlock in
page-migration and memory-hotplug.  Most people don't even use those.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-27 21:55         ` Andrew Morton
@ 2008-10-28 14:25           ` Christoph Lameter
  2008-10-28 20:45             ` Andrew Morton
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2008-10-28 14:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, heiko.carstens, npiggin, linux-kernel, hugh,
	torvalds, riel, lee.schermerhorn, linux-mm

On Mon, 27 Oct 2008, Andrew Morton wrote:

> Can we fix that instead?

How about this fix?



Subject: Move migrate_prep out from under mmap_sem

Move the migrate_prep outside the mmap_sem for the following system calls

1. sys_move_pages
2. sys_migrate_pages
3. sys_mbind()

It really does not matter when we flush the lru. The system is free to add
pages onto the lru even during migration which will make the page 
migration either skip the page (mbind, migrate_pages) or return a busy 
state (move_pages).

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c	2008-10-28 09:16:18.475514878 -0500
+++ linux-2.6/mm/mempolicy.c	2008-10-28 09:22:46.486773874 -0500
@@ -489,12 +489,6 @@
  	int err;
  	struct vm_area_struct *first, *vma, *prev;

-	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
-
-		err = migrate_prep();
-		if (err)
-			return ERR_PTR(err);
-	}

  	first = find_vma(mm, start);
  	if (!first)
@@ -809,9 +803,13 @@
  	const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags)
  {
  	int busy = 0;
-	int err = 0;
+	int err;
  	nodemask_t tmp;

+	err = migrate_prep();
+	if (err)
+		return err;
+
  	down_read(&mm->mmap_sem);

  	err = migrate_vmas(mm, from_nodes, to_nodes, flags);
@@ -974,6 +972,12 @@
  		 start, start + len, mode, mode_flags,
  		 nmask ? nodes_addr(*nmask)[0] : -1);

+	if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
+
+		err = migrate_prep();
+		if (err)
+			return err;
+	}
  	down_write(&mm->mmap_sem);
  	vma = check_range(mm, start, end, nmask,
  			  flags | MPOL_MF_INVERT, &pagelist);
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c	2008-10-28 09:15:46.578013464 -0500
+++ linux-2.6/mm/migrate.c	2008-10-28 09:16:14.038014180 -0500
@@ -841,12 +841,12 @@
  	struct page_to_node *pp;
  	LIST_HEAD(pagelist);

+	migrate_prep();
  	down_read(&mm->mmap_sem);

  	/*
  	 * Build a list of pages to migrate
  	 */
-	migrate_prep();
  	for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
  		struct vm_area_struct *vma;
  		struct page *page;

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-28 14:25           ` Christoph Lameter
@ 2008-10-28 20:45             ` Andrew Morton
  2008-10-28 21:29               ` Lee Schermerhorn
  2008-10-29  7:20               ` [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu() KOSAKI Motohiro
  0 siblings, 2 replies; 35+ messages in thread
From: Andrew Morton @ 2008-10-28 20:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: kosaki.motohiro, heiko.carstens, npiggin, linux-kernel, hugh,
	torvalds, riel, lee.schermerhorn, linux-mm

On Tue, 28 Oct 2008 09:25:31 -0500 (CDT)
Christoph Lameter <cl@linux-foundation.org> wrote:

> On Mon, 27 Oct 2008, Andrew Morton wrote:
> 
> > Can we fix that instead?
> 
> How about this fix?
> 
> 
> 
> Subject: Move migrate_prep out from under mmap_sem
> 
> Move the migrate_prep outside the mmap_sem for the following system calls
> 
> 1. sys_move_pages
> 2. sys_migrate_pages
> 3. sys_mbind()
> 
> It really does not matter when we flush the lru. The system is free to add
> pages onto the lru even during migration which will make the page 
> migration either skip the page (mbind, migrate_pages) or return a busy 
> state (move_pages).
> 

That looks nicer, thanks.  Hopefully it fixes the
lockdep-warning/deadlock...

I guess we should document our newly discovered schedule_on_each_cpu()
problems before we forget about it and later rediscover it.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-28 20:45             ` Andrew Morton
@ 2008-10-28 21:29               ` Lee Schermerhorn
  2008-10-29  7:17                 ` KOSAKI Motohiro
  2008-10-29  7:20               ` [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu() KOSAKI Motohiro
  1 sibling, 1 reply; 35+ messages in thread
From: Lee Schermerhorn @ 2008-10-28 21:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, kosaki.motohiro, heiko.carstens, npiggin,
	linux-kernel, hugh, torvalds, riel, linux-mm

On Tue, 2008-10-28 at 13:45 -0700, Andrew Morton wrote:
> On Tue, 28 Oct 2008 09:25:31 -0500 (CDT)
> Christoph Lameter <cl@linux-foundation.org> wrote:
> 
> > On Mon, 27 Oct 2008, Andrew Morton wrote:
> > 
> > > Can we fix that instead?
> > 
> > How about this fix?
> > 
> > 
> > 
> > Subject: Move migrate_prep out from under mmap_sem
> > 
> > Move the migrate_prep outside the mmap_sem for the following system calls
> > 
> > 1. sys_move_pages
> > 2. sys_migrate_pages
> > 3. sys_mbind()
> > 
> > It really does not matter when we flush the lru. The system is free to add
> > pages onto the lru even during migration which will make the page 
> > migration either skip the page (mbind, migrate_pages) or return a busy 
> > state (move_pages).
> > 
> 
> That looks nicer, thanks.  Hopefully it fixes the
> lockdep-warning/deadlock...

I believe that we still  have the lru_drain_all() called from the fault
path [with mmap_sem held] in clear_page_mlock().  We call
clear_page_mlock() on COW of an mlocked page in a VM_LOCKED vma to
ensure that we don't end up with an mlocked page in some other task's
non-VM_LOCKED vma where we'd then fail to munlock it later.  During
development testing, Rik encountered scenarios where a page would
encounter a COW fault while it was still making its way to the LRU via
the pagevecs.  So, he added the 'drain_all() and that seemed to avoid
this scenario.

Now, in the current upstream version of the unevictable mlocked pages
patches, we just count any mlocked pages [vmstat] that make their way to
free*page() instead of BUGging out, as we were doing earlier during
development.  So, maybe we can drop the lru_drain_add()s in the
unevictable mlocked pages work and live with the occasional freed
mlocked page, or mlocked page on the active/inactive lists to be dealt
with by vmscan.

Comments?

Lee

> 
> I guess we should document our newly discovered schedule_on_each_cpu()
> problems before we forget about it and later rediscover it.




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-28 21:29               ` Lee Schermerhorn
@ 2008-10-29  7:17                 ` KOSAKI Motohiro
  2008-10-29 12:40                   ` Lee Schermerhorn
  0 siblings, 1 reply; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-10-29  7:17 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Andrew Morton, Christoph Lameter, heiko.carstens, npiggin,
	linux-kernel, hugh, torvalds, riel, linux-mm

> I believe that we still  have the lru_drain_all() called from the fault
> path [with mmap_sem held] in clear_page_mlock().  We call
> clear_page_mlock() on COW of an mlocked page in a VM_LOCKED vma to
> ensure that we don't end up with an mlocked page in some other task's
> non-VM_LOCKED vma where we'd then fail to munlock it later.  During
> development testing, Rik encountered scenarios where a page would
> encounter a COW fault while it was still making its way to the LRU via
> the pagevecs.  So, he added the 'drain_all() and that seemed to avoid
> this scenario.

Agreed.


> Now, in the current upstream version of the unevictable mlocked pages
> patches, we just count any mlocked pages [vmstat] that make their way to
> free*page() instead of BUGging out, as we were doing earlier during
> development.  So, maybe we can drop the lru_drain_add()s in the
> unevictable mlocked pages work and live with the occasional freed
> mlocked page, or mlocked page on the active/inactive lists to be dealt
> with by vmscan.

hm, okey.
maybe, I was wrong.

I'll make "dropping lru_add_drain_all()" patch soon.
I expect I need few days.
  make the patch:                  1 day
  confirm by stress workload:  2-3 days

because rik's original problem only happend on heavy wokload, I think.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-28 20:45             ` Andrew Morton
  2008-10-28 21:29               ` Lee Schermerhorn
@ 2008-10-29  7:20               ` KOSAKI Motohiro
  2008-10-29  8:21                 ` KAMEZAWA Hiroyuki
  2008-11-05  9:51                 ` Peter Zijlstra
  1 sibling, 2 replies; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-10-29  7:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, heiko.carstens, npiggin, linux-kernel, hugh,
	torvalds, riel, lee.schermerhorn, linux-mm

> I guess we should document our newly discovered schedule_on_each_cpu()
> problems before we forget about it and later rediscover it.

Now, schedule_on_each_cpu() is only used by lru_add_drain_all().
and smp_call_function() is better way for cross call.

So I propose
   1. lru_add_drain_all() use smp_call_function()
   2. remove schedule_on_each_cpu()


Thought?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-29  7:20               ` [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu() KOSAKI Motohiro
@ 2008-10-29  8:21                 ` KAMEZAWA Hiroyuki
  2008-11-05  9:51                 ` Peter Zijlstra
  1 sibling, 0 replies; 35+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-29  8:21 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Christoph Lameter, heiko.carstens, npiggin,
	linux-kernel, hugh, torvalds, riel, lee.schermerhorn, linux-mm

On Wed, 29 Oct 2008 16:20:24 +0900
"KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> wrote:

> > I guess we should document our newly discovered schedule_on_each_cpu()
> > problems before we forget about it and later rediscover it.
> 
> Now, schedule_on_each_cpu() is only used by lru_add_drain_all().
> and smp_call_function() is better way for cross call.
> 
> So I propose
>    1. lru_add_drain_all() use smp_call_function()
IMHO, smp_call_function() is not good, either.

The real problem in this lru_add_drain_all() around mlock() is handling of
pagevec. How about attached one ?(not tested at all..just an idea.)

>    2. remove schedule_on_each_cpu()
> 
I'm using schedule_on_each_cpu() from not dangerous context (in new memcg patch..)

Thanks,
-Kame

==
pagevec is used for avoidning lru_lock contention for add/remove pages to/from
LRU. But under split-lru/unevictable lru world, this delay in pagevec can
cause unexpected behavior.
  * A page scheduled to add to Unevictable lru is unlocked
    while it's in pagevec.
Because a page wrongly linked to Unevictable lru cannot come back to usual
lru, this is a problem. To avoid this kind of situation, lru_add_drain_all()
is called from mlock() path.


This patch remove "delay" of pagevec for Unevictable pages and remove
lru_add_drain_all(), which is a burtal function should not be called from
deep under the kernel.




---
 mm/mlock.c |   13 ++-----------
 mm/swap.c  |   17 +++++++++++++----
 2 files changed, 15 insertions(+), 15 deletions(-)

Index: mmotm-2.6.27+/mm/mlock.c
===================================================================
--- mmotm-2.6.27+.orig/mm/mlock.c
+++ mmotm-2.6.27+/mm/mlock.c
@@ -66,14 +66,9 @@ void __clear_page_mlock(struct page *pag
 		putback_lru_page(page);
 	} else {
 		/*
-		 * Page not on the LRU yet.  Flush all pagevecs and retry.
+		 * Page not on the LRU yet.
+		 * pagevec will handle this in proper way.
 		 */
-		lru_add_drain_all();
-		if (!isolate_lru_page(page))
-			putback_lru_page(page);
-		else if (PageUnevictable(page))
-			count_vm_event(UNEVICTABLE_PGSTRANDED);
-
 	}
 }
 
@@ -187,8 +182,6 @@ static long __mlock_vma_pages_range(stru
 	if (vma->vm_flags & VM_WRITE)
 		gup_flags |= GUP_FLAGS_WRITE;
 
-	lru_add_drain_all();	/* push cached pages to LRU */
-
 	while (nr_pages > 0) {
 		int i;
 
@@ -251,8 +244,6 @@ static long __mlock_vma_pages_range(stru
 		ret = 0;
 	}
 
-	lru_add_drain_all();	/* to update stats */
-
 	return ret;	/* count entire vma as locked_vm */
 }
 
Index: mmotm-2.6.27+/mm/swap.c
===================================================================
--- mmotm-2.6.27+.orig/mm/swap.c
+++ mmotm-2.6.27+/mm/swap.c
@@ -200,10 +200,19 @@ void __lru_cache_add(struct page *page, 
 {
 	struct pagevec *pvec = &get_cpu_var(lru_add_pvecs)[lru];
 
-	page_cache_get(page);
-	if (!pagevec_add(pvec, page))
-		____pagevec_lru_add(pvec, lru);
-	put_cpu_var(lru_add_pvecs);
+	if (likely(lru != LRU_UNEVICTABLE)) {
+		page_cache_get(page);
+		if (!pagevec_add(pvec, page))
+			____pagevec_lru_add(pvec, lru);
+		put_cpu_var(lru_add_pvecs);
+	} else {
+		/*
+		 * A page put into Unevictable List has no chance to come back
+  		 * to other LRU.(it can be unlocked while in pagevec.)
+  		 * We do add_to_lru in synchrous way.
+  		 */
+		add_page_to_unevictable_list(page);
+	}
 }
 
 /**


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-29  7:17                 ` KOSAKI Motohiro
@ 2008-10-29 12:40                   ` Lee Schermerhorn
  2008-11-06  0:14                     ` [PATCH] get rid of lru_add_drain_all() in munlock path KOSAKI Motohiro
  0 siblings, 1 reply; 35+ messages in thread
From: Lee Schermerhorn @ 2008-10-29 12:40 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Christoph Lameter, heiko.carstens, npiggin,
	linux-kernel, hugh, torvalds, riel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2747 bytes --]

On Wed, 2008-10-29 at 16:17 +0900, KOSAKI Motohiro wrote:
> > I believe that we still  have the lru_drain_all() called from the fault
> > path [with mmap_sem held] in clear_page_mlock().  We call
> > clear_page_mlock() on COW of an mlocked page in a VM_LOCKED vma to
> > ensure that we don't end up with an mlocked page in some other task's
> > non-VM_LOCKED vma where we'd then fail to munlock it later.  During
> > development testing, Rik encountered scenarios where a page would
> > encounter a COW fault while it was still making its way to the LRU via
> > the pagevecs.  So, he added the 'drain_all() and that seemed to avoid
> > this scenario.
> 
> Agreed.
> 
> 
> > Now, in the current upstream version of the unevictable mlocked pages
> > patches, we just count any mlocked pages [vmstat] that make their way to
> > free*page() instead of BUGging out, as we were doing earlier during
> > development.  So, maybe we can drop the lru_drain_add()s in the
> > unevictable mlocked pages work and live with the occasional freed
> > mlocked page, or mlocked page on the active/inactive lists to be dealt
> > with by vmscan.
> 
> hm, okey.
> maybe, I was wrong.
> 
> I'll make "dropping lru_add_drain_all()" patch soon.
> I expect I need few days.
>   make the patch:                  1 day
>   confirm by stress workload:  2-3 days
> 
> because rik's original problem only happend on heavy wokload, I think.

Indeed.  It was an ad hoc test program [2 versions attached] written
specifically to beat on COW of shared pages mlocked by parent then COWed
by parent or child and unmapped explicitly or via exit.  We were trying
to find all the ways the we could end up freeing mlocked pages--and
there were several.  Most of these turned out to be genuine
coding/design defects [as difficult as that may be to believe :-)], so
tracking them down was worthwhile.  And, I think that, in general,
clearing a page's mlocked state and rescuing from the unevictable lru
list on COW--to prevent the mlocked page from ending up mapped into some
task's non-VM_LOCKED vma--is a good thing to strive for.  

Now, looking at the current code [28-rc1] in [__]clear_page_mlock():
We've already cleared the PG_mlocked flag, we've decremented the mlocked
pages stats, and we're just trying to rescue the page from the
unevictable list to the in/active list.  If we fail to isolate the page,
then either some other task has it isolated and will return it to an
appropriate lru or it resides in a pagevec heading for an in/active lru
list.  We don't use pagevec for unevictable list.  Any other cases?  If
not, then we can probably dispense with the "try harder" logic--the
lru_add_drain()--in __clear_page_mlock().

Do you agree?  Or have I missed something?

Lee   

[-- Attachment #2: rvr-mlock-oops.c --]
[-- Type: text/x-csrc, Size: 1171 bytes --]

/*
 * In the split VM code in 2.6.25-rc3-mm1 and later, we see PG_mlock
 * pages freed from the exit/exit_mmap path.  This test case creates
 * a process, forks it, mlocks, touches some memory and exits, to
 * try and trigger the bug - Rik van Riel, Mar 2008
 */
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

#define NUMFORKS 1000
#define MEMSIZE 1024*1024

void child(void)
{
	char * mem;
	int err;
	int i;

	err = mlockall(MCL_CURRENT|MCL_FUTURE);
	if (err < 0) {
		printf("child mlock failed\n");
		exit(1);
	}

	mem = malloc(MEMSIZE);
	if (!mem) {
		printf("child could not allocate memory\n");
		exit(2);
	}

	/* Touch the memory so the kernel allocates actual pages. */
	for (i = 0; i < MEMSIZE; i++)
		mem[i] = i;

	/* Avoids the oops?  Nope ... :( */
	munlockall();

	/* This is where we can trigger the oops. */
	exit(0);
}

int main(int argc, char *argv)
{
	int i;
	int status;

	for (i = 0; i < NUMFORKS ; i++) {
		pid_t pid = fork();

		if (!pid)	
			child(); /* does not return */
		else if (pid > 0)
			wait(&status);
		else {
			printf("fork failed\n");
			exit(1);
		}
	}
}

[-- Attachment #3: rvr-mlock-oops2.c --]
[-- Type: text/x-csrc, Size: 1176 bytes --]

/*
 * In the split VM code in 2.6.25-rc3-mm1 and later, we see PG_mlock
 * pages freed from the exit/exit_mmap path.  This test case creates
 * a process, forks it, mlocks, touches some memory and exits, to
 * try and trigger the bug - Rik van Riel, Mar 2008
 */
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

#define NUMFORKS 1
#define MEMSIZE 1024*1024

void child(void)
{
	char * mem;
	int i;

	mem = malloc(MEMSIZE);
	if (!mem) {
		printf("child could not allocate memory\n");
		exit(2);
	}

	/* Touch the memory so the kernel allocates actual pages. */
	for (i = 0; i < MEMSIZE; i++)
		mem[i] = i;

	/* This is where we can trigger the oops. */
	exit(0);
}

int main(int argc, char *argv)
{
	int i;
	int status;
	pid_t pid;
	int err;

	err = mlockall(MCL_CURRENT|MCL_FUTURE);
	if (err < 0) {
		printf("parent mlock failed\n");
		exit(1);
	}

	pid = getpid();

	printf("parent pid = %d\n", pid);

	for (i = 0; i < NUMFORKS ; i++) {
		pid = fork();

		if (!pid)	
			child(); /* does not return */
		else if (pid > 0)
			wait(&status);
		else {
			printf("fork failed\n");
			exit(1);
		}
	}
}

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-10-29  7:20               ` [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu() KOSAKI Motohiro
  2008-10-29  8:21                 ` KAMEZAWA Hiroyuki
@ 2008-11-05  9:51                 ` Peter Zijlstra
  2008-11-05  9:55                   ` KOSAKI Motohiro
  1 sibling, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2008-11-05  9:51 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Christoph Lameter, heiko.carstens, npiggin,
	linux-kernel, hugh, torvalds, riel, lee.schermerhorn, linux-mm

On Wed, 2008-10-29 at 16:20 +0900, KOSAKI Motohiro wrote:
> > I guess we should document our newly discovered schedule_on_each_cpu()
> > problems before we forget about it and later rediscover it.
> 
> Now, schedule_on_each_cpu() is only used by lru_add_drain_all().
> and smp_call_function() is better way for cross call.
> 
> So I propose
>    1. lru_add_drain_all() use smp_call_function()
>    2. remove schedule_on_each_cpu()
> 
> 
> Thought?

At the very least that will not solve the problem on -rt where a lot of
the smp_call_function() users are converted to schedule_on_each_cpu().



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu()
  2008-11-05  9:51                 ` Peter Zijlstra
@ 2008-11-05  9:55                   ` KOSAKI Motohiro
  0 siblings, 0 replies; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-11-05  9:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Andrew Morton, Christoph Lameter,
	heiko.carstens, npiggin, linux-kernel, hugh, torvalds, riel,
	lee.schermerhorn, linux-mm

> On Wed, 2008-10-29 at 16:20 +0900, KOSAKI Motohiro wrote:
> > > I guess we should document our newly discovered schedule_on_each_cpu()
> > > problems before we forget about it and later rediscover it.
> > 
> > Now, schedule_on_each_cpu() is only used by lru_add_drain_all().
> > and smp_call_function() is better way for cross call.
> > 
> > So I propose
> >    1. lru_add_drain_all() use smp_call_function()
> >    2. remove schedule_on_each_cpu()
> > 
> > 
> > Thought?
> 
> At the very least that will not solve the problem on -rt where a lot of
> the smp_call_function() users are converted to schedule_on_each_cpu().

yup.
Now, I testing "simple dropping lru_add_drain_all() in mlock path" patch.

Thanks.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] get rid of lru_add_drain_all() in munlock path
  2008-10-29 12:40                   ` Lee Schermerhorn
@ 2008-11-06  0:14                     ` KOSAKI Motohiro
  2008-11-06 16:33                       ` Kamalesh Babulal
  0 siblings, 1 reply; 35+ messages in thread
From: KOSAKI Motohiro @ 2008-11-06  0:14 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: kosaki.motohiro, Andrew Morton, Christoph Lameter,
	heiko.carstens, npiggin, linux-kernel, hugh, torvalds, riel,
	linux-mm, Peter Zijlstra, Kamalesh Babulal

> > > Now, in the current upstream version of the unevictable mlocked pages
> > > patches, we just count any mlocked pages [vmstat] that make their way to
> > > free*page() instead of BUGging out, as we were doing earlier during
> > > development.  So, maybe we can drop the lru_drain_add()s in the
> > > unevictable mlocked pages work and live with the occasional freed
> > > mlocked page, or mlocked page on the active/inactive lists to be dealt
> > > with by vmscan.
> > 
> > hm, okey.
> > maybe, I was wrong.
> > 
> > I'll make "dropping lru_add_drain_all()" patch soon.
> > I expect I need few days.
> >   make the patch:                  1 day
> >   confirm by stress workload:  2-3 days
> > 
> > because rik's original problem only happend on heavy wokload, I think.
> 
> Indeed.  It was an ad hoc test program [2 versions attached] written
> specifically to beat on COW of shared pages mlocked by parent then COWed
> by parent or child and unmapped explicitly or via exit.  We were trying
> to find all the ways the we could end up freeing mlocked pages--and
> there were several.  Most of these turned out to be genuine
> coding/design defects [as difficult as that may be to believe :-)], so
> tracking them down was worthwhile.  And, I think that, in general,
> clearing a page's mlocked state and rescuing from the unevictable lru
> list on COW--to prevent the mlocked page from ending up mapped into some
> task's non-VM_LOCKED vma--is a good thing to strive for.  



> Now, looking at the current code [28-rc1] in [__]clear_page_mlock():
> We've already cleared the PG_mlocked flag, we've decremented the mlocked
> pages stats, and we're just trying to rescue the page from the
> unevictable list to the in/active list.  If we fail to isolate the page,
> then either some other task has it isolated and will return it to an
> appropriate lru or it resides in a pagevec heading for an in/active lru
> list.  We don't use pagevec for unevictable list.  Any other cases?  If
> not, then we can probably dispense with the "try harder" logic--the
> lru_add_drain()--in __clear_page_mlock().
> 
> Do you agree?  Or have I missed something?

Yup.
you are perfectly right.

Honestly, I thought lazy rescue isn't so good because it cause statics difference of
# of mlocked pages and # of unevictalble pages in past time.
and, I tought i can avoid it.

but it is wrong.

I made its patch actually, but it introduce many and unnecessary messyness.
So, I believe simple lru_add_drain_all() dropping patch is better.

Again, you are right.


In these days, I've run stress workload and I confirm my patch doesn't
cause mlocked page leak.

this patch also solve Heiko and Kamalesh rtnl 
circular dependency problem (I think).
http://marc.info/?l=linux-kernel&m=122460208308785&w=2
http://marc.info/?l=linux-netdev&m=122586921407698&w=2


-------------------------------------------------------------------------
lockdep warns about following message at boot time on one of my test machine.
Then, schedule_on_each_cpu() sholdn't be called when the task have mmap_sem.

Actually, lru_add_drain_all() exist to prevent the unevictalble pages stay on reclaimable lru list.
but currenct unevictable code can rescue unevictable pages although it stay on reclaimable list.

So removing is better.

In addition, this patch add lru_add_drain_all() to sys_mlock() and sys_mlockall().
it isn't must.
but it reduce the failure of moving to unevictable list.
its failure can rescue in vmscan later. but reducing is better.


Note, if above rescuing happend, the Mlocked and the Unevictable field mismatching happend in /proc/meminfo.
but it doesn't cause any real trouble.



~~~~~~~~~~~~~~~~~~~~~~~~~ start here ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.28-rc2-mm1 #2
-------------------------------------------------------
lvm/1103 is trying to acquire lock:
 (&cpu_hotplug.lock){--..}, at: [<c0130789>] get_online_cpus+0x29/0x50

but task is already holding lock:
 (&mm->mmap_sem){----}, at: [<c01878ae>] sys_mlockall+0x4e/0xb0

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (&mm->mmap_sem){----}:
       [<c0153da2>] check_noncircular+0x82/0x110
       [<c0185e6a>] might_fault+0x4a/0xa0
       [<c0156161>] validate_chain+0xb11/0x1070
       [<c0185e6a>] might_fault+0x4a/0xa0
       [<c0156923>] __lock_acquire+0x263/0xa10
       [<c015714c>] lock_acquire+0x7c/0xb0			(*) grab mmap_sem
       [<c0185e6a>] might_fault+0x4a/0xa0
       [<c0185e9b>] might_fault+0x7b/0xa0
       [<c0185e6a>] might_fault+0x4a/0xa0
       [<c0294dd0>] copy_to_user+0x30/0x60
       [<c01ae3ec>] filldir+0x7c/0xd0
       [<c01e3a6a>] sysfs_readdir+0x11a/0x1f0			(*) grab sysfs_mutex
       [<c01ae370>] filldir+0x0/0xd0
       [<c01ae370>] filldir+0x0/0xd0
       [<c01ae4c6>] vfs_readdir+0x86/0xa0			(*) grab i_mutex
       [<c01ae75b>] sys_getdents+0x6b/0xc0
       [<c010355a>] syscall_call+0x7/0xb
       [<ffffffff>] 0xffffffff

-> #2 (sysfs_mutex){--..}:
       [<c0153da2>] check_noncircular+0x82/0x110
       [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0
       [<c0156161>] validate_chain+0xb11/0x1070
       [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0
       [<c0156923>] __lock_acquire+0x263/0xa10
       [<c015714c>] lock_acquire+0x7c/0xb0			(*) grab sysfs_mutex
       [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0
       [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0
       [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0
       [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0
       [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0
       [<c01e422f>] create_dir+0x3f/0x90
       [<c01e42a9>] sysfs_create_dir+0x29/0x50
       [<c04faaf5>] _spin_unlock+0x25/0x40
       [<c028f21d>] kobject_add_internal+0xcd/0x1a0
       [<c028f37a>] kobject_set_name_vargs+0x3a/0x50
       [<c028f41d>] kobject_init_and_add+0x2d/0x40
       [<c019d4d2>] sysfs_slab_add+0xd2/0x180
       [<c019d580>] sysfs_add_func+0x0/0x70
       [<c019d5dc>] sysfs_add_func+0x5c/0x70			(*) grab slub_lock
       [<c01400f2>] run_workqueue+0x172/0x200
       [<c014008f>] run_workqueue+0x10f/0x200
       [<c0140bd0>] worker_thread+0x0/0xf0		
       [<c0140c6c>] worker_thread+0x9c/0xf0
       [<c0143c80>] autoremove_wake_function+0x0/0x50
       [<c0140bd0>] worker_thread+0x0/0xf0
       [<c0143972>] kthread+0x42/0x70
       [<c0143930>] kthread+0x0/0x70
       [<c01042db>] kernel_thread_helper+0x7/0x1c
       [<ffffffff>] 0xffffffff

-> #1 (slub_lock){----}:
       [<c0153d2d>] check_noncircular+0xd/0x110
       [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0
       [<c0156161>] validate_chain+0xb11/0x1070
       [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0
       [<c015433d>] mark_lock+0x35d/0xd00
       [<c0156923>] __lock_acquire+0x263/0xa10
       [<c015714c>] lock_acquire+0x7c/0xb0
       [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0
       [<c04f93a3>] down_read+0x43/0x80
       [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0		(*) grab slub_lock
       [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0
       [<c04fd9ac>] notifier_call_chain+0x3c/0x70
       [<c04f5454>] _cpu_up+0x84/0x110
       [<c04f552b>] cpu_up+0x4b/0x70				(*) grab cpu_hotplug.lock
       [<c06d1530>] kernel_init+0x0/0x170
       [<c06d15e5>] kernel_init+0xb5/0x170
       [<c06d1530>] kernel_init+0x0/0x170
       [<c01042db>] kernel_thread_helper+0x7/0x1c
       [<ffffffff>] 0xffffffff

-> #0 (&cpu_hotplug.lock){--..}:
       [<c0155bff>] validate_chain+0x5af/0x1070
       [<c040f7e0>] dev_status+0x0/0x50
       [<c0156923>] __lock_acquire+0x263/0xa10
       [<c015714c>] lock_acquire+0x7c/0xb0
       [<c0130789>] get_online_cpus+0x29/0x50
       [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0
       [<c0130789>] get_online_cpus+0x29/0x50
       [<c0130789>] get_online_cpus+0x29/0x50
       [<c017bc30>] lru_add_drain_per_cpu+0x0/0x10
       [<c0130789>] get_online_cpus+0x29/0x50			(*) grab cpu_hotplug.lock
       [<c0140cf2>] schedule_on_each_cpu+0x32/0xe0
       [<c0187095>] __mlock_vma_pages_range+0x85/0x2c0
       [<c0156945>] __lock_acquire+0x285/0xa10
       [<c0188f09>] vma_merge+0xa9/0x1d0
       [<c0187450>] mlock_fixup+0x180/0x200
       [<c0187548>] do_mlockall+0x78/0x90			(*) grab mmap_sem
       [<c01878e1>] sys_mlockall+0x81/0xb0
       [<c010355a>] syscall_call+0x7/0xb
       [<ffffffff>] 0xffffffff

other info that might help us debug this:

1 lock held by lvm/1103:
 #0:  (&mm->mmap_sem){----}, at: [<c01878ae>] sys_mlockall+0x4e/0xb0

stack backtrace:
Pid: 1103, comm: lvm Not tainted 2.6.28-rc2-mm1 #2
Call Trace:
 [<c01555fc>] print_circular_bug_tail+0x7c/0xd0
 [<c0155bff>] validate_chain+0x5af/0x1070
 [<c040f7e0>] dev_status+0x0/0x50
 [<c0156923>] __lock_acquire+0x263/0xa10
 [<c015714c>] lock_acquire+0x7c/0xb0
 [<c0130789>] get_online_cpus+0x29/0x50
 [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0
 [<c0130789>] get_online_cpus+0x29/0x50
 [<c0130789>] get_online_cpus+0x29/0x50
 [<c017bc30>] lru_add_drain_per_cpu+0x0/0x10
 [<c0130789>] get_online_cpus+0x29/0x50
 [<c0140cf2>] schedule_on_each_cpu+0x32/0xe0
 [<c0187095>] __mlock_vma_pages_range+0x85/0x2c0
 [<c0156945>] __lock_acquire+0x285/0xa10
 [<c0188f09>] vma_merge+0xa9/0x1d0
 [<c0187450>] mlock_fixup+0x180/0x200
 [<c0187548>] do_mlockall+0x78/0x90
 [<c01878e1>] sys_mlockall+0x81/0xb0
 [<c010355a>] syscall_call+0x7/0xb

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ end here ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/mlock.c |   16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

Index: b/mm/mlock.c
===================================================================
--- a/mm/mlock.c	2008-11-02 20:23:38.000000000 +0900
+++ b/mm/mlock.c	2008-11-02 21:00:21.000000000 +0900
@@ -66,14 +66,10 @@ void __clear_page_mlock(struct page *pag
 		putback_lru_page(page);
 	} else {
 		/*
-		 * Page not on the LRU yet.  Flush all pagevecs and retry.
+		 * We lost the race. the page already moved to evictable list.
 		 */
-		lru_add_drain_all();
-		if (!isolate_lru_page(page))
-			putback_lru_page(page);
-		else if (PageUnevictable(page))
+		if (PageUnevictable(page))
 			count_vm_event(UNEVICTABLE_PGSTRANDED);
-
 	}
 }
 
@@ -187,8 +183,6 @@ static long __mlock_vma_pages_range(stru
 	if (vma->vm_flags & VM_WRITE)
 		gup_flags |= GUP_FLAGS_WRITE;
 
-	lru_add_drain_all();	/* push cached pages to LRU */
-
 	while (nr_pages > 0) {
 		int i;
 
@@ -251,8 +245,6 @@ static long __mlock_vma_pages_range(stru
 		ret = 0;
 	}
 
-	lru_add_drain_all();	/* to update stats */
-
 	return ret;	/* count entire vma as locked_vm */
 }
 
@@ -546,6 +538,8 @@ asmlinkage long sys_mlock(unsigned long 
 	if (!can_do_mlock())
 		return -EPERM;
 
+	lru_add_drain_all();	/* flush pagevec */
+
 	down_write(&current->mm->mmap_sem);
 	len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
 	start &= PAGE_MASK;
@@ -612,6 +606,8 @@ asmlinkage long sys_mlockall(int flags)
 	if (!can_do_mlock())
 		goto out;
 
+	lru_add_drain_all();	/* flush pagevec */
+
 	down_write(&current->mm->mmap_sem);
 
 	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] get rid of lru_add_drain_all() in munlock path
  2008-11-06  0:14                     ` [PATCH] get rid of lru_add_drain_all() in munlock path KOSAKI Motohiro
@ 2008-11-06 16:33                       ` Kamalesh Babulal
  0 siblings, 0 replies; 35+ messages in thread
From: Kamalesh Babulal @ 2008-11-06 16:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Lee Schermerhorn, Andrew Morton, Christoph Lameter,
	heiko.carstens, npiggin, linux-kernel, hugh, torvalds, riel,
	linux-mm, Peter Zijlstra

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2008-11-06 09:14:07]:

> > > > Now, in the current upstream version of the unevictable mlocked pages
> > > > patches, we just count any mlocked pages [vmstat] that make their way to
> > > > free*page() instead of BUGging out, as we were doing earlier during
> > > > development.  So, maybe we can drop the lru_drain_add()s in the
> > > > unevictable mlocked pages work and live with the occasional freed
> > > > mlocked page, or mlocked page on the active/inactive lists to be dealt
> > > > with by vmscan.
> > > 
> > > hm, okey.
> > > maybe, I was wrong.
> > > 
> > > I'll make "dropping lru_add_drain_all()" patch soon.
> > > I expect I need few days.
> > >   make the patch:                  1 day
> > >   confirm by stress workload:  2-3 days
> > > 
> > > because rik's original problem only happend on heavy wokload, I think.
> > 
> > Indeed.  It was an ad hoc test program [2 versions attached] written
> > specifically to beat on COW of shared pages mlocked by parent then COWed
> > by parent or child and unmapped explicitly or via exit.  We were trying
> > to find all the ways the we could end up freeing mlocked pages--and
> > there were several.  Most of these turned out to be genuine
> > coding/design defects [as difficult as that may be to believe :-)], so
> > tracking them down was worthwhile.  And, I think that, in general,
> > clearing a page's mlocked state and rescuing from the unevictable lru
> > list on COW--to prevent the mlocked page from ending up mapped into some
> > task's non-VM_LOCKED vma--is a good thing to strive for.  
> 
> 
> 
> > Now, looking at the current code [28-rc1] in [__]clear_page_mlock():
> > We've already cleared the PG_mlocked flag, we've decremented the mlocked
> > pages stats, and we're just trying to rescue the page from the
> > unevictable list to the in/active list.  If we fail to isolate the page,
> > then either some other task has it isolated and will return it to an
> > appropriate lru or it resides in a pagevec heading for an in/active lru
> > list.  We don't use pagevec for unevictable list.  Any other cases?  If
> > not, then we can probably dispense with the "try harder" logic--the
> > lru_add_drain()--in __clear_page_mlock().
> > 
> > Do you agree?  Or have I missed something?
> 
> Yup.
> you are perfectly right.
> 
> Honestly, I thought lazy rescue isn't so good because it cause statics difference of
> # of mlocked pages and # of unevictalble pages in past time.
> and, I tought i can avoid it.
> 
> but it is wrong.
> 
> I made its patch actually, but it introduce many and unnecessary messyness.
> So, I believe simple lru_add_drain_all() dropping patch is better.
> 
> Again, you are right.
> 
> 
> In these days, I've run stress workload and I confirm my patch doesn't
> cause mlocked page leak.
> 
> this patch also solve Heiko and Kamalesh rtnl 
> circular dependency problem (I think).
> http://marc.info/?l=linux-kernel&m=122460208308785&w=2
> http://marc.info/?l=linux-netdev&m=122586921407698&w=2
> 
> 
> -------------------------------------------------------------------------
> lockdep warns about following message at boot time on one of my test machine.
> Then, schedule_on_each_cpu() sholdn't be called when the task have mmap_sem.
> 
> Actually, lru_add_drain_all() exist to prevent the unevictalble pages stay on reclaimable lru list.
> but currenct unevictable code can rescue unevictable pages although it stay on reclaimable list.
> 
> So removing is better.
> 
> In addition, this patch add lru_add_drain_all() to sys_mlock() and sys_mlockall().
> it isn't must.
> but it reduce the failure of moving to unevictable list.
> its failure can rescue in vmscan later. but reducing is better.
> 
> 
> Note, if above rescuing happend, the Mlocked and the Unevictable field mismatching happend in /proc/meminfo.
> but it doesn't cause any real trouble.
> 
> 
<snip warning>

Hi Kosaki-san,

 Thanks, the patch fixes the circular locking dependency warning, while
booting up.

  Tested-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  mm/mlock.c |   16 ++++++----------
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> Index: b/mm/mlock.c
> ===================================================================
> --- a/mm/mlock.c	2008-11-02 20:23:38.000000000 +0900
> +++ b/mm/mlock.c	2008-11-02 21:00:21.000000000 +0900
> @@ -66,14 +66,10 @@ void __clear_page_mlock(struct page *pag
>  		putback_lru_page(page);
>  	} else {
>  		/*
> -		 * Page not on the LRU yet.  Flush all pagevecs and retry.
> +		 * We lost the race. the page already moved to evictable list.
>  		 */
> -		lru_add_drain_all();
> -		if (!isolate_lru_page(page))
> -			putback_lru_page(page);
> -		else if (PageUnevictable(page))
> +		if (PageUnevictable(page))
>  			count_vm_event(UNEVICTABLE_PGSTRANDED);
> -
>  	}
>  }
> 
> @@ -187,8 +183,6 @@ static long __mlock_vma_pages_range(stru
>  	if (vma->vm_flags & VM_WRITE)
>  		gup_flags |= GUP_FLAGS_WRITE;
> 
> -	lru_add_drain_all();	/* push cached pages to LRU */
> -
>  	while (nr_pages > 0) {
>  		int i;
> 
> @@ -251,8 +245,6 @@ static long __mlock_vma_pages_range(stru
>  		ret = 0;
>  	}
> 
> -	lru_add_drain_all();	/* to update stats */
> -
>  	return ret;	/* count entire vma as locked_vm */
>  }
> 
> @@ -546,6 +538,8 @@ asmlinkage long sys_mlock(unsigned long 
>  	if (!can_do_mlock())
>  		return -EPERM;
> 
> +	lru_add_drain_all();	/* flush pagevec */
> +
>  	down_write(&current->mm->mmap_sem);
>  	len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
>  	start &= PAGE_MASK;
> @@ -612,6 +606,8 @@ asmlinkage long sys_mlockall(int flags)
>  	if (!can_do_mlock())
>  		goto out;
> 
> +	lru_add_drain_all();	/* flush pagevec */
> +
>  	down_write(&current->mm->mmap_sem);
> 
>  	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
> 
> 
> 

-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2008-11-06 16:41 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <200810201659.m9KGxtFC016280@hera.kernel.org>
2008-10-21 15:13 ` mlock: mlocked pages are unevictable Heiko Carstens
2008-10-21 15:51   ` KOSAKI Motohiro
2008-10-21 17:18     ` KOSAKI Motohiro
2008-10-21 20:30       ` Peter Zijlstra
2008-10-21 20:48         ` Peter Zijlstra
2008-10-23 15:00       ` [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu() KOSAKI Motohiro
2008-10-24  1:28         ` Nick Piggin
2008-10-24  4:54           ` KOSAKI Motohiro
2008-10-24  4:55             ` Nick Piggin
2008-10-24  5:29               ` KOSAKI Motohiro
2008-10-24  5:34                 ` Nick Piggin
2008-10-24  5:51                   ` KOSAKI Motohiro
2008-10-24 19:20         ` Heiko Carstens
2008-10-26 11:06         ` Peter Zijlstra
2008-10-26 13:37           ` KOSAKI Motohiro
2008-10-26 13:49             ` Peter Zijlstra
2008-10-26 15:51               ` KOSAKI Motohiro
2008-10-26 16:17                 ` Peter Zijlstra
2008-10-27  3:14                   ` KOSAKI Motohiro
2008-10-27  7:56                     ` Peter Zijlstra
2008-10-27  8:03                       ` KOSAKI Motohiro
2008-10-27 10:42                         ` KOSAKI Motohiro
2008-10-27 21:55         ` Andrew Morton
2008-10-28 14:25           ` Christoph Lameter
2008-10-28 20:45             ` Andrew Morton
2008-10-28 21:29               ` Lee Schermerhorn
2008-10-29  7:17                 ` KOSAKI Motohiro
2008-10-29 12:40                   ` Lee Schermerhorn
2008-11-06  0:14                     ` [PATCH] get rid of lru_add_drain_all() in munlock path KOSAKI Motohiro
2008-11-06 16:33                       ` Kamalesh Babulal
2008-10-29  7:20               ` [RFC][PATCH] lru_add_drain_all() don't use schedule_on_each_cpu() KOSAKI Motohiro
2008-10-29  8:21                 ` KAMEZAWA Hiroyuki
2008-11-05  9:51                 ` Peter Zijlstra
2008-11-05  9:55                   ` KOSAKI Motohiro
2008-10-22 15:28   ` mlock: mlocked pages are unevictable Lee Schermerhorn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).