All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] fs: proc: use down_read_killable in proc_pid_cmdline_read()
@ 2018-02-20 19:49 Yang Shi
  2018-02-20 22:38 ` Alexey Dobriyan
  2018-02-20 23:57 ` Yang Shi
  0 siblings, 2 replies; 11+ messages in thread
From: Yang Shi @ 2018-02-20 19:49 UTC (permalink / raw)
  To: akpm, mingo, adobriyan; +Cc: yang.shi, linux-kernel

When running vm-scalability with large memory (> 300GB), the below hung
task issue happens occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
       Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 ps              D    0 14018      1 0x00000004
  ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
  ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
  00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
 Call Trace:
  [<ffffffff817154d0>] ? __schedule+0x250/0x730
  [<ffffffff817159e6>] schedule+0x36/0x80
  [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
  [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
  [<ffffffff81717db0>] down_read+0x20/0x40
  [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
  [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
  [<ffffffff81241d87>] __vfs_read+0x37/0x150
  [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
  [<ffffffff81242266>] vfs_read+0x96/0x130
  [<ffffffff812437b5>] SyS_read+0x55/0xc0
  [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5

When manipulating a large mapping, the process may hold the mmap_sem for
long time, so reading /proc/<pid>/cmdline may be blocked in
uninterruptible state for long time.

down_read_trylock() sounds too aggressive, and we already have killable
version APIs for semaphore, here use down_read_killable() to improve the
responsiveness.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
---
 fs/proc/base.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9298324..44b6f8f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -242,7 +242,9 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
 		goto out_mmput;
 	}
 
-	down_read(&mm->mmap_sem);
+	rv = down_read_killable(&mm->mmap_sem);
+	if (rv)
+		goto out_mmput;
 	arg_start = mm->arg_start;
 	arg_end = mm->arg_end;
 	env_start = mm->env_start;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: proc: use down_read_killable in proc_pid_cmdline_read()
  2018-02-20 19:49 [PATCH] fs: proc: use down_read_killable in proc_pid_cmdline_read() Yang Shi
@ 2018-02-20 22:38 ` Alexey Dobriyan
  2018-02-20 23:38   ` Yang Shi
  2018-02-20 23:57 ` Yang Shi
  1 sibling, 1 reply; 11+ messages in thread
From: Alexey Dobriyan @ 2018-02-20 22:38 UTC (permalink / raw)
  To: Yang Shi; +Cc: akpm, mingo, linux-kernel

On Wed, Feb 21, 2018 at 03:49:29AM +0800, Yang Shi wrote:
> When running vm-scalability with large memory (> 300GB), the below hung
> task issue happens occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>        Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  ps              D    0 14018      1 0x00000004
>   ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>   ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>   00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>  Call Trace:
>   [<ffffffff817154d0>] ? __schedule+0x250/0x730
>   [<ffffffff817159e6>] schedule+0x36/0x80
>   [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>   [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>   [<ffffffff81717db0>] down_read+0x20/0x40
>   [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>   [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>   [<ffffffff81241d87>] __vfs_read+0x37/0x150
>   [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>   [<ffffffff81242266>] vfs_read+0x96/0x130
>   [<ffffffff812437b5>] SyS_read+0x55/0xc0
>   [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> When manipulating a large mapping, the process may hold the mmap_sem for
> long time, so reading /proc/<pid>/cmdline may be blocked in
> uninterruptible state for long time.
> 
> down_read_trylock() sounds too aggressive, and we already have killable
> version APIs for semaphore, here use down_read_killable() to improve the
> responsiveness.

> -	down_read(&mm->mmap_sem);
> +	rv = down_read_killable(&mm->mmap_sem);
> +	if (rv)
> +		goto out_mmput;
>  	arg_start = mm->arg_start;
>  	arg_end = mm->arg_end;
>  	env_start = mm->env_start;

Fix is incomplete
1) /proc/*/cmdline only wants to read 4 values atomically,
   those 4 values are basically random values and aren't
   related to VM part at all (after C/R went in, they are
   settable to arbitrary values)
2) access_remote_vm() et al will do the same ->mmap_sem, and
3) /proc/*/environ and get_cmdline() do the same.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: proc: use down_read_killable in proc_pid_cmdline_read()
  2018-02-20 22:38 ` Alexey Dobriyan
@ 2018-02-20 23:38   ` Yang Shi
  2018-02-21 19:57     ` Alexey Dobriyan
  0 siblings, 1 reply; 11+ messages in thread
From: Yang Shi @ 2018-02-20 23:38 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: akpm, mingo, linux-kernel



On 2/20/18 2:38 PM, Alexey Dobriyan wrote:
> On Wed, Feb 21, 2018 at 03:49:29AM +0800, Yang Shi wrote:
>> When running vm-scalability with large memory (> 300GB), the below hung
>> task issue happens occasionally.
>>
>> INFO: task ps:14018 blocked for more than 120 seconds.
>>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>   ps              D    0 14018      1 0x00000004
>>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>   Call Trace:
>>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>    [<ffffffff817159e6>] schedule+0x36/0x80
>>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>    [<ffffffff81717db0>] down_read+0x20/0x40
>>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>>    [<ffffffff81241d87>] __vfs_read+0x37/0x150
>>    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>>    [<ffffffff81242266>] vfs_read+0x96/0x130
>>    [<ffffffff812437b5>] SyS_read+0x55/0xc0
>>    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>
>> When manipulating a large mapping, the process may hold the mmap_sem for
>> long time, so reading /proc/<pid>/cmdline may be blocked in
>> uninterruptible state for long time.
>>
>> down_read_trylock() sounds too aggressive, and we already have killable
>> version APIs for semaphore, here use down_read_killable() to improve the
>> responsiveness.
>> -	down_read(&mm->mmap_sem);
>> +	rv = down_read_killable(&mm->mmap_sem);
>> +	if (rv)
>> +		goto out_mmput;
>>   	arg_start = mm->arg_start;
>>   	arg_end = mm->arg_end;
>>   	env_start = mm->env_start;
> Fix is incomplete

Yes, it is. Since I just ran into the above splat, so I just did the 
minimum change.

> 1) /proc/*/cmdline only wants to read 4 values atomically,
>     those 4 values are basically random values and aren't
>     related to VM part at all (after C/R went in, they are
>     settable to arbitrary values)

Sorry, I don't get your point here. Could you please elaborate?

> 2) access_remote_vm() et al will do the same ->mmap_sem, and

Yes, it does. But, __access_remote_vm() is called by access_process_vm() 
too, which is used by much more places, i.e. ptrace, so I was not sure 
if it is preferred to convert to killable version. So, I leave it untouched.

> 3) /proc/*/environ and get_cmdline() do the same.

They look suitable to use killable version.

BTW, I just realized the code should go to out_free_page instead of 
out_mmput. Will rectify in newer version once we decide the extra places 
to use killable version.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: proc: use down_read_killable in proc_pid_cmdline_read()
  2018-02-20 19:49 [PATCH] fs: proc: use down_read_killable in proc_pid_cmdline_read() Yang Shi
  2018-02-20 22:38 ` Alexey Dobriyan
@ 2018-02-20 23:57 ` Yang Shi
  1 sibling, 0 replies; 11+ messages in thread
From: Yang Shi @ 2018-02-20 23:57 UTC (permalink / raw)
  To: akpm, mingo, adobriyan; +Cc: linux-kernel



On 2/20/18 11:49 AM, Yang Shi wrote:
> When running vm-scalability with large memory (> 300GB), the below hung
> task issue happens occasionally.
>
> INFO: task ps:14018 blocked for more than 120 seconds.
>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   ps              D    0 14018      1 0x00000004
>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>   Call Trace:
>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
>    [<ffffffff817159e6>] schedule+0x36/0x80
>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>    [<ffffffff81717db0>] down_read+0x20/0x40
>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>    [<ffffffff81241d87>] __vfs_read+0x37/0x150
>    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>    [<ffffffff81242266>] vfs_read+0x96/0x130
>    [<ffffffff812437b5>] SyS_read+0x55/0xc0
>    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>
> When manipulating a large mapping, the process may hold the mmap_sem for
> long time, so reading /proc/<pid>/cmdline may be blocked in
> uninterruptible state for long time.
>
> down_read_trylock() sounds too aggressive, and we already have killable
> version APIs for semaphore, here use down_read_killable() to improve the
> responsiveness.
>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> ---
>   fs/proc/base.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 9298324..44b6f8f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -242,7 +242,9 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
>   		goto out_mmput;
>   	}
>   
> -	down_read(&mm->mmap_sem);
> +	rv = down_read_killable(&mm->mmap_sem);
> +	if (rv)
> +		goto out_mmput;
>   	arg_start = mm->arg_start;
>   	arg_end = mm->arg_end;
>   	env_start = mm->env_start;

Should goto out_free_page, amended patch as below.

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9298324..057e6bd 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -242,7 +242,9 @@ static ssize_t proc_pid_cmdline_read(struct file 
*file, char __user *buf,
                 goto out_mmput;
         }

-       down_read(&mm->mmap_sem);
+       rv = down_read_killable(&mm->mmap_sem);
+       if (rv)
+               goto out_free_page;
         arg_start = mm->arg_start;
         arg_end = mm->arg_end;
         env_start = mm->env_start;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: proc: use down_read_killable in proc_pid_cmdline_read()
  2018-02-20 23:38   ` Yang Shi
@ 2018-02-21 19:57     ` Alexey Dobriyan
  2018-02-21 23:13       ` Yang Shi
  2018-02-23 18:28       ` Yang Shi
  0 siblings, 2 replies; 11+ messages in thread
From: Alexey Dobriyan @ 2018-02-21 19:57 UTC (permalink / raw)
  To: Yang Shi; +Cc: akpm, mingo, linux-kernel

On Tue, Feb 20, 2018 at 03:38:24PM -0800, Yang Shi wrote:
> 
> 
> On 2/20/18 2:38 PM, Alexey Dobriyan wrote:
> > On Wed, Feb 21, 2018 at 03:49:29AM +0800, Yang Shi wrote:
> >> When running vm-scalability with large memory (> 300GB), the below hung
> >> task issue happens occasionally.
> >>
> >> INFO: task ps:14018 blocked for more than 120 seconds.
> >>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
> >>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>   ps              D    0 14018      1 0x00000004
> >>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
> >>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
> >>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
> >>   Call Trace:
> >>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
> >>    [<ffffffff817159e6>] schedule+0x36/0x80
> >>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
> >>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
> >>    [<ffffffff81717db0>] down_read+0x20/0x40
> >>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
> >>    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
> >>    [<ffffffff81241d87>] __vfs_read+0x37/0x150
> >>    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
> >>    [<ffffffff81242266>] vfs_read+0x96/0x130
> >>    [<ffffffff812437b5>] SyS_read+0x55/0xc0
> >>    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
> >>
> >> When manipulating a large mapping, the process may hold the mmap_sem for
> >> long time, so reading /proc/<pid>/cmdline may be blocked in
> >> uninterruptible state for long time.
> >>
> >> down_read_trylock() sounds too aggressive, and we already have killable
> >> version APIs for semaphore, here use down_read_killable() to improve the
> >> responsiveness.
> >> -	down_read(&mm->mmap_sem);
> >> +	rv = down_read_killable(&mm->mmap_sem);
> >> +	if (rv)
> >> +		goto out_mmput;
> >>   	arg_start = mm->arg_start;
> >>   	arg_end = mm->arg_end;
> >>   	env_start = mm->env_start;
> > Fix is incomplete
> 
> Yes, it is. Since I just ran into the above splat, so I just did the 
> minimum change.
> 
> > 1) /proc/*/cmdline only wants to read 4 values atomically,
> >     those 4 values are basically random values and aren't
> >     related to VM part at all (after C/R went in, they are
> >     settable to arbitrary values)
> 
> Sorry, I don't get your point here. Could you please elaborate?

I hoped there is some random spinlock those 4 values could be moved to
but no.

> > 2) access_remote_vm() et al will do the same ->mmap_sem, and
> 
> Yes, it does. But, __access_remote_vm() is called by access_process_vm() 
> too, which is used by much more places, i.e. ptrace, so I was not sure 
> if it is preferred to convert to killable version. So, I leave it untouched.

Yeah, but ->mmap_sem is taken 3 times per /proc/*/cmdline read
and your scalability tests should trigger next backtrace right away.

> > 3) /proc/*/environ and get_cmdline() do the same.
> 
> They look suitable to use killable version.
> 
> BTW, I just realized the code should go to out_free_page instead of 
> out_mmput. Will rectify in newer version once we decide the extra places 
> to use killable version.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: proc: use down_read_killable in proc_pid_cmdline_read()
  2018-02-21 19:57     ` Alexey Dobriyan
@ 2018-02-21 23:13       ` Yang Shi
  2018-02-23 19:33         ` Alexey Dobriyan
  2018-02-23 18:28       ` Yang Shi
  1 sibling, 1 reply; 11+ messages in thread
From: Yang Shi @ 2018-02-21 23:13 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: akpm, mingo, linux-kernel



On 2/21/18 11:57 AM, Alexey Dobriyan wrote:
> On Tue, Feb 20, 2018 at 03:38:24PM -0800, Yang Shi wrote:
>>
>> On 2/20/18 2:38 PM, Alexey Dobriyan wrote:
>>> On Wed, Feb 21, 2018 at 03:49:29AM +0800, Yang Shi wrote:
>>>> When running vm-scalability with large memory (> 300GB), the below hung
>>>> task issue happens occasionally.
>>>>
>>>> INFO: task ps:14018 blocked for more than 120 seconds.
>>>>          Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>>>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>    ps              D    0 14018      1 0x00000004
>>>>     ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>>>     ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>>>     00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>>>    Call Trace:
>>>>     [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>>>     [<ffffffff817159e6>] schedule+0x36/0x80
>>>>     [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>>>     [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>>>     [<ffffffff81717db0>] down_read+0x20/0x40
>>>>     [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>>>     [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>>>>     [<ffffffff81241d87>] __vfs_read+0x37/0x150
>>>>     [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>>>>     [<ffffffff81242266>] vfs_read+0x96/0x130
>>>>     [<ffffffff812437b5>] SyS_read+0x55/0xc0
>>>>     [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>>>
>>>> When manipulating a large mapping, the process may hold the mmap_sem for
>>>> long time, so reading /proc/<pid>/cmdline may be blocked in
>>>> uninterruptible state for long time.
>>>>
>>>> down_read_trylock() sounds too aggressive, and we already have killable
>>>> version APIs for semaphore, here use down_read_killable() to improve the
>>>> responsiveness.
>>>> -	down_read(&mm->mmap_sem);
>>>> +	rv = down_read_killable(&mm->mmap_sem);
>>>> +	if (rv)
>>>> +		goto out_mmput;
>>>>    	arg_start = mm->arg_start;
>>>>    	arg_end = mm->arg_end;
>>>>    	env_start = mm->env_start;
>>> Fix is incomplete
>> Yes, it is. Since I just ran into the above splat, so I just did the
>> minimum change.
>>
>>> 1) /proc/*/cmdline only wants to read 4 values atomically,
>>>      those 4 values are basically random values and aren't
>>>      related to VM part at all (after C/R went in, they are
>>>      settable to arbitrary values)
>> Sorry, I don't get your point here. Could you please elaborate?
> I hoped there is some random spinlock those 4 values could be moved to
> but no.

Those 4 values might be changed by prctl_set_mm concurrently if 
mm->mmap_sem is not acquired.

>
>>> 2) access_remote_vm() et al will do the same ->mmap_sem, and
>> Yes, it does. But, __access_remote_vm() is called by access_process_vm()
>> too, which is used by much more places, i.e. ptrace, so I was not sure
>> if it is preferred to convert to killable version. So, I leave it untouched.
> Yeah, but ->mmap_sem is taken 3 times per /proc/*/cmdline read
> and your scalability tests should trigger next backtrace right away.

Yes, however, I didn't run into it if mmap_sem is acquired earlier.

How about defining a killable version, like 
__access_remote_vm_killable() which use down_read_killable(), then the 
killable version can be used by proc/*/cmdline? There might be other 
users in the future.

Thanks,
Yang

>
>>> 3) /proc/*/environ and get_cmdline() do the same.
>> They look suitable to use killable version.
>>
>> BTW, I just realized the code should go to out_free_page instead of
>> out_mmput. Will rectify in newer version once we decide the extra places
>> to use killable version.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: proc: use down_read_killable in proc_pid_cmdline_read()
  2018-02-21 19:57     ` Alexey Dobriyan
  2018-02-21 23:13       ` Yang Shi
@ 2018-02-23 18:28       ` Yang Shi
  1 sibling, 0 replies; 11+ messages in thread
From: Yang Shi @ 2018-02-23 18:28 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: akpm, mingo, linux-kernel



On 2/21/18 11:57 AM, Alexey Dobriyan wrote:
> On Tue, Feb 20, 2018 at 03:38:24PM -0800, Yang Shi wrote:
>>
>> On 2/20/18 2:38 PM, Alexey Dobriyan wrote:
>>> On Wed, Feb 21, 2018 at 03:49:29AM +0800, Yang Shi wrote:
>>>> When running vm-scalability with large memory (> 300GB), the below hung
>>>> task issue happens occasionally.
>>>>
>>>> INFO: task ps:14018 blocked for more than 120 seconds.
>>>>          Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>>>    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>    ps              D    0 14018      1 0x00000004
>>>>     ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>>>     ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>>>     00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>>>    Call Trace:
>>>>     [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>>>     [<ffffffff817159e6>] schedule+0x36/0x80
>>>>     [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>>>     [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>>>     [<ffffffff81717db0>] down_read+0x20/0x40
>>>>     [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>>>     [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>>>>     [<ffffffff81241d87>] __vfs_read+0x37/0x150
>>>>     [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>>>>     [<ffffffff81242266>] vfs_read+0x96/0x130
>>>>     [<ffffffff812437b5>] SyS_read+0x55/0xc0
>>>>     [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>>>
>>>> When manipulating a large mapping, the process may hold the mmap_sem for
>>>> long time, so reading /proc/<pid>/cmdline may be blocked in
>>>> uninterruptible state for long time.
>>>>
>>>> down_read_trylock() sounds too aggressive, and we already have killable
>>>> version APIs for semaphore, here use down_read_killable() to improve the
>>>> responsiveness.
>>>> -	down_read(&mm->mmap_sem);
>>>> +	rv = down_read_killable(&mm->mmap_sem);
>>>> +	if (rv)
>>>> +		goto out_mmput;
>>>>    	arg_start = mm->arg_start;
>>>>    	arg_end = mm->arg_end;
>>>>    	env_start = mm->env_start;
>>> Fix is incomplete
>> Yes, it is. Since I just ran into the above splat, so I just did the
>> minimum change.
>>
>>> 1) /proc/*/cmdline only wants to read 4 values atomically,
>>>      those 4 values are basically random values and aren't
>>>      related to VM part at all (after C/R went in, they are
>>>      settable to arbitrary values)
>> Sorry, I don't get your point here. Could you please elaborate?
> I hoped there is some random spinlock those 4 values could be moved to
> but no.
>
>>> 2) access_remote_vm() et al will do the same ->mmap_sem, and
>> Yes, it does. But, __access_remote_vm() is called by access_process_vm()
>> too, which is used by much more places, i.e. ptrace, so I was not sure
>> if it is preferred to convert to killable version. So, I leave it untouched.
> Yeah, but ->mmap_sem is taken 3 times per /proc/*/cmdline read
> and your scalability tests should trigger next backtrace right away.
>
>>> 3) /proc/*/environ and get_cmdline() do the same.
>> They look suitable to use killable version.

Had a look at get_cmdline() further. It looks it is not necessary to 
convert to killable since it is called mainly by audit_exit(), which is 
called when process or syscall exit, so it might be preferred to wait 
until the mmap_sem is released. So, it sounds pointless to use 
down_read_killable().

Thanks,
Yang

>>
>> BTW, I just realized the code should go to out_free_page instead of
>> out_mmput. Will rectify in newer version once we decide the extra places
>> to use killable version.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: proc: use down_read_killable in proc_pid_cmdline_read()
  2018-02-21 23:13       ` Yang Shi
@ 2018-02-23 19:33         ` Alexey Dobriyan
  2018-02-23 19:42           ` Yang Shi
  0 siblings, 1 reply; 11+ messages in thread
From: Alexey Dobriyan @ 2018-02-23 19:33 UTC (permalink / raw)
  To: Yang Shi; +Cc: akpm, mingo, linux-kernel

On Wed, Feb 21, 2018 at 03:13:10PM -0800, Yang Shi wrote:

> >>> 2) access_remote_vm() et al will do the same ->mmap_sem, and
> >> Yes, it does. But, __access_remote_vm() is called by access_process_vm()
> >> too, which is used by much more places, i.e. ptrace, so I was not sure
> >> if it is preferred to convert to killable version. So, I leave it untouched.
> > Yeah, but ->mmap_sem is taken 3 times per /proc/*/cmdline read
> > and your scalability tests should trigger next backtrace right away.
> 
> Yes, however, I didn't run into it if mmap_sem is acquired earlier.
> 
> How about defining a killable version, like 
> __access_remote_vm_killable() which use down_read_killable(), then the 
> killable version can be used by proc/*/cmdline? There might be other 
> users in the future.

It would be a disaster as interfaces multiply.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: proc: use down_read_killable in proc_pid_cmdline_read()
  2018-02-23 19:33         ` Alexey Dobriyan
@ 2018-02-23 19:42           ` Yang Shi
  2018-02-23 19:45             ` Alexey Dobriyan
  0 siblings, 1 reply; 11+ messages in thread
From: Yang Shi @ 2018-02-23 19:42 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: akpm, mingo, linux-kernel



On 2/23/18 11:33 AM, Alexey Dobriyan wrote:
> On Wed, Feb 21, 2018 at 03:13:10PM -0800, Yang Shi wrote:
>
>>>>> 2) access_remote_vm() et al will do the same ->mmap_sem, and
>>>> Yes, it does. But, __access_remote_vm() is called by access_process_vm()
>>>> too, which is used by much more places, i.e. ptrace, so I was not sure
>>>> if it is preferred to convert to killable version. So, I leave it untouched.
>>> Yeah, but ->mmap_sem is taken 3 times per /proc/*/cmdline read
>>> and your scalability tests should trigger next backtrace right away.
>> Yes, however, I didn't run into it if mmap_sem is acquired earlier.
>>
>> How about defining a killable version, like
>> __access_remote_vm_killable() which use down_read_killable(), then the
>> killable version can be used by proc/*/cmdline? There might be other
>> users in the future.
> It would be a disaster as interfaces multiply.

Might be not that bad. Just need add:

__access_remote_vm_killable()
access_remote_vm_killable()

Then access_process_vm() keep using __access_remote_vm(). And, use 
access_remote_vm_killable() in fs/proc/base. And, it looks the 
access_process_vm() calls in get_cmdline() can be changed to 
access_remote_vm() (non killable version).

Yang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: proc: use down_read_killable in proc_pid_cmdline_read()
  2018-02-23 19:42           ` Yang Shi
@ 2018-02-23 19:45             ` Alexey Dobriyan
  2018-02-23 20:08               ` Yang Shi
  0 siblings, 1 reply; 11+ messages in thread
From: Alexey Dobriyan @ 2018-02-23 19:45 UTC (permalink / raw)
  To: Yang Shi; +Cc: akpm, mingo, linux-kernel

On Fri, Feb 23, 2018 at 11:42:34AM -0800, Yang Shi wrote:
> 
> 
> On 2/23/18 11:33 AM, Alexey Dobriyan wrote:
> > On Wed, Feb 21, 2018 at 03:13:10PM -0800, Yang Shi wrote:
> >
> >>>>> 2) access_remote_vm() et al will do the same ->mmap_sem, and
> >>>> Yes, it does. But, __access_remote_vm() is called by access_process_vm()
> >>>> too, which is used by much more places, i.e. ptrace, so I was not sure
> >>>> if it is preferred to convert to killable version. So, I leave it untouched.
> >>> Yeah, but ->mmap_sem is taken 3 times per /proc/*/cmdline read
> >>> and your scalability tests should trigger next backtrace right away.
> >> Yes, however, I didn't run into it if mmap_sem is acquired earlier.
> >>
> >> How about defining a killable version, like
> >> __access_remote_vm_killable() which use down_read_killable(), then the
> >> killable version can be used by proc/*/cmdline? There might be other
> >> users in the future.
> > It would be a disaster as interfaces multiply.
> 
> Might be not that bad.

Maybe.

But you need to explain why there is no backtrace several lines later:

	access_remote_vm
	__access_remote_vm
	down_read(&mm->mmap_sem)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: proc: use down_read_killable in proc_pid_cmdline_read()
  2018-02-23 19:45             ` Alexey Dobriyan
@ 2018-02-23 20:08               ` Yang Shi
  0 siblings, 0 replies; 11+ messages in thread
From: Yang Shi @ 2018-02-23 20:08 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: akpm, mingo, linux-kernel



On 2/23/18 11:45 AM, Alexey Dobriyan wrote:
> On Fri, Feb 23, 2018 at 11:42:34AM -0800, Yang Shi wrote:
>>
>> On 2/23/18 11:33 AM, Alexey Dobriyan wrote:
>>> On Wed, Feb 21, 2018 at 03:13:10PM -0800, Yang Shi wrote:
>>>
>>>>>>> 2) access_remote_vm() et al will do the same ->mmap_sem, and
>>>>>> Yes, it does. But, __access_remote_vm() is called by access_process_vm()
>>>>>> too, which is used by much more places, i.e. ptrace, so I was not sure
>>>>>> if it is preferred to convert to killable version. So, I leave it untouched.
>>>>> Yeah, but ->mmap_sem is taken 3 times per /proc/*/cmdline read
>>>>> and your scalability tests should trigger next backtrace right away.
>>>> Yes, however, I didn't run into it if mmap_sem is acquired earlier.
>>>>
>>>> How about defining a killable version, like
>>>> __access_remote_vm_killable() which use down_read_killable(), then the
>>>> killable version can be used by proc/*/cmdline? There might be other
>>>> users in the future.
>>> It would be a disaster as interfaces multiply.
>> Might be not that bad.
> Maybe.
>
> But you need to explain why there is no backtrace several lines later:
>
> 	access_remote_vm
> 	__access_remote_vm
> 	down_read(&mm->mmap_sem)

I think it might be because:

         CPU A                                  CPU B
                                          read /proc/*/cmdline
                                          get_mm
acquire mmap_sem
munmap(300G)                 try to acquire mmap_sem --> go to sleep
release mmap_sem
                                           got mmap_sem
                                           release mmap_sem

                                           access_remote_vm
                                           put_mm


The munmap might happen right before access_remote_vm(), but I just 
didn't run into it for the time being. It may be hit on another machine 
or with some changes to the test cases.

BTW, even the hung I met happened occassionally, not very often. So, the 
access_remote_vm() hit sounds less often. But, I agree it is still 
possible in theory.

Regards,
Yang

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-02-23 20:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-20 19:49 [PATCH] fs: proc: use down_read_killable in proc_pid_cmdline_read() Yang Shi
2018-02-20 22:38 ` Alexey Dobriyan
2018-02-20 23:38   ` Yang Shi
2018-02-21 19:57     ` Alexey Dobriyan
2018-02-21 23:13       ` Yang Shi
2018-02-23 19:33         ` Alexey Dobriyan
2018-02-23 19:42           ` Yang Shi
2018-02-23 19:45             ` Alexey Dobriyan
2018-02-23 20:08               ` Yang Shi
2018-02-23 18:28       ` Yang Shi
2018-02-20 23:57 ` Yang Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.