* [PATCH 1/4 v2] mm: add access_remote_vm_killable APIs
2018-02-27 0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
@ 2018-02-27 0:25 ` Yang Shi
2018-02-27 0:25 ` [PATCH 2/4 v2] fs: proc: use down_read_killable in proc_pid_cmdline_read() Yang Shi
` (5 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-02-27 0:25 UTC (permalink / raw)
To: akpm, mingo, adobriyan; +Cc: yang.shi, linux-mm, linux-kernel
Extracted common part (without acquiring mmap_sem) of
__access_remote_vm() into raw_access_remote_vm() then create
__access_remote_vm_killable() and access_remote_vm_killable() with
acquiring mmap_sem by down_read_killable().
Keep non-killable versions using down_read().
The killable version will be used by reading /proc/*/cmdline and
/proc/*/environ for the time being.
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
---
include/linux/mm.h | 5 +++++
mm/memory.c | 44 +++++++++++++++++++++++++++++++++++++-------
mm/nommu.c | 36 ++++++++++++++++++++++++++++++++----
3 files changed, 74 insertions(+), 11 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42..4574b19 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1372,8 +1372,13 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
void *buf, int len, unsigned int gup_flags);
extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
void *buf, int len, unsigned int gup_flags);
+extern int access_remote_vm_killable(struct mm_struct *mm, unsigned long addr,
+ void *buf, int len, unsigned int gup_flags);
extern int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
unsigned long addr, void *buf, int len, unsigned int gup_flags);
+extern int __access_remote_vm_killable(struct task_struct *tsk,
+ struct mm_struct *mm, unsigned long addr, void *buf, int len,
+ unsigned int gup_flags);
long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
diff --git a/mm/memory.c b/mm/memory.c
index dd8de96..8d7e223 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4415,18 +4415,13 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
EXPORT_SYMBOL_GPL(generic_access_phys);
#endif
-/*
- * Access another process' address space as given in mm. If non-NULL, use the
- * given task for page fault accounting.
- */
-int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
+static int raw_access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
unsigned long addr, void *buf, int len, unsigned int gup_flags)
{
struct vm_area_struct *vma;
void *old_buf = buf;
int write = gup_flags & FOLL_WRITE;
- down_read(&mm->mmap_sem);
/* ignore errors, just check how much was successfully transferred */
while (len) {
int bytes, ret, offset;
@@ -4475,11 +4470,40 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
buf += bytes;
addr += bytes;
}
- up_read(&mm->mmap_sem);
return buf - old_buf;
}
+/*
+ * Access another process' address space as given in mm. If non-NULL, use the
+ * given task for page fault accounting.
+ */
+int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long addr, void *buf, int len, unsigned int gup_flags)
+{
+ int ret;
+
+ down_read(&mm->mmap_sem);
+ ret = raw_access_remote_vm(tsk, mm, addr, buf, len, gup_flags);
+ up_read(&mm->mmap_sem);
+ return ret;
+}
+
+int __access_remote_vm_killable(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long addr, void *buf, int len, unsigned int gup_flags)
+{
+ int ret;
+
+ ret = down_read_killable(&mm->mmap_sem);
+ if (ret)
+ goto out;
+
+ ret = raw_access_remote_vm(tsk, mm, addr, buf, len, gup_flags);
+ up_read(&mm->mmap_sem);
+out:
+ return ret;
+}
+
/**
* access_remote_vm - access another process' address space
* @mm: the mm_struct of the target address space
@@ -4490,6 +4514,12 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
*
* The caller must hold a reference on @mm.
*/
+int access_remote_vm_killable(struct mm_struct *mm, unsigned long addr,
+ void *buf, int len, unsigned int gup_flags)
+{
+ return __access_remote_vm_killable(NULL, mm, addr, buf, len, gup_flags);
+}
+
int access_remote_vm(struct mm_struct *mm, unsigned long addr,
void *buf, int len, unsigned int gup_flags)
{
diff --git a/mm/nommu.c b/mm/nommu.c
index ebb6e61..ea043b3 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1802,14 +1802,12 @@ void filemap_map_pages(struct vm_fault *vmf,
}
EXPORT_SYMBOL(filemap_map_pages);
-int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
+static int raw_access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
unsigned long addr, void *buf, int len, unsigned int gup_flags)
{
struct vm_area_struct *vma;
int write = gup_flags & FOLL_WRITE;
- down_read(&mm->mmap_sem);
-
/* the access must start within one of the target process's mappings */
vma = find_vma(mm, addr);
if (vma) {
@@ -1830,9 +1828,33 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
len = 0;
}
+ return len;
+}
+
+int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long addr, void *buf, int len, unsigned int gup_flags)
+{
+ int ret;
+
+ down_read(&mm->mmap_sem);
+ ret = raw_access_remote_vm(tsk, mm, addr, buf, len, gup_flags);
up_read(&mm->mmap_sem);
+ return ret;
+}
- return len;
+int __access_remote_vm_killable(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long addr, void *buf, int len, unsigned int gup_flags)
+{
+ int ret;
+
+ ret = down_read_killable(&mm->mmap_sem);
+ if (ret)
+ goto out;
+
+ ret = raw_access_remote_vm(tsk, mm, addr, buf, len, gup_flags);
+ up_read(&mm->mmap_sem);
+out:
+ return ret;
}
/**
@@ -1845,6 +1867,12 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
*
* The caller must hold a reference on @mm.
*/
+int access_remote_vm_killable(struct mm_struct *mm, unsigned long addr,
+ void *buf, int len, unsigned int gup_flags)
+{
+ return __access_remote_vm_killable(NULL, mm, addr, buf, len, gup_flags);
+}
+
int access_remote_vm(struct mm_struct *mm, unsigned long addr,
void *buf, int len, unsigned int gup_flags)
{
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/4 v2] fs: proc: use down_read_killable in proc_pid_cmdline_read()
2018-02-27 0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
2018-02-27 0:25 ` [PATCH 1/4 v2] mm: add access_remote_vm_killable APIs Yang Shi
@ 2018-02-27 0:25 ` Yang Shi
2018-02-27 0:25 ` [PATCH 3/4 v2] fs: proc: use down_read_killable() in environ_read() Yang Shi
` (4 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-02-27 0:25 UTC (permalink / raw)
To: akpm, mingo, adobriyan; +Cc: yang.shi, linux-mm, linux-kernel
When running vm-scalability with large memory (> 300GB), the below hung
task issue happens occasionally.
INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ps D 0 14018 1 0x00000004
ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
Call Trace:
[<ffffffff817154d0>] ? __schedule+0x250/0x730
[<ffffffff817159e6>] schedule+0x36/0x80
[<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
[<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
[<ffffffff81717db0>] down_read+0x20/0x40
[<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
[<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
[<ffffffff81241d87>] __vfs_read+0x37/0x150
[<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
[<ffffffff81242266>] vfs_read+0x96/0x130
[<ffffffff812437b5>] SyS_read+0x55/0xc0
[<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
When manipulating a large mapping, the process may hold the mmap_sem for
long time, so reading /proc/<pid>/cmdline may be blocked in
uninterruptible state for long time.
down_read_trylock() sounds too aggressive, and we already have killable
version APIs for semaphore, here use down_read_killable() to improve the
responsiveness.
And, convert access_remote_vm() to killable version.
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
---
fs/proc/base.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9298324..9bdb84b 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -242,7 +242,9 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
goto out_mmput;
}
- down_read(&mm->mmap_sem);
+ rv = down_read_killable(&mm->mmap_sem);
+ if (rv)
+ goto out_free_page;
arg_start = mm->arg_start;
arg_end = mm->arg_end;
env_start = mm->env_start;
@@ -264,7 +266,7 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
* Inherently racy -- command line shares address space
* with code and data.
*/
- rv = access_remote_vm(mm, arg_end - 1, &c, 1, 0);
+ rv = access_remote_vm_killable(mm, arg_end - 1, &c, 1, 0);
if (rv <= 0)
goto out_free_page;
@@ -282,7 +284,8 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
int nr_read;
_count = min3(count, len, PAGE_SIZE);
- nr_read = access_remote_vm(mm, p, page, _count, 0);
+ nr_read = access_remote_vm_killable(mm, p, page,
+ _count, 0);
if (nr_read < 0)
rv = nr_read;
if (nr_read <= 0)
@@ -328,7 +331,8 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
bool final;
_count = min3(count, len, PAGE_SIZE);
- nr_read = access_remote_vm(mm, p, page, _count, 0);
+ nr_read = access_remote_vm_killable(mm, p,
+ page, _count, 0);
if (nr_read < 0)
rv = nr_read;
if (nr_read <= 0)
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/4 v2] fs: proc: use down_read_killable() in environ_read()
2018-02-27 0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
2018-02-27 0:25 ` [PATCH 1/4 v2] mm: add access_remote_vm_killable APIs Yang Shi
2018-02-27 0:25 ` [PATCH 2/4 v2] fs: proc: use down_read_killable in proc_pid_cmdline_read() Yang Shi
@ 2018-02-27 0:25 ` Yang Shi
2018-02-27 7:15 ` Alexey Dobriyan
2018-02-27 0:25 ` [PATCH 4/4 v2] mm: use access_remote_vm() in get_cmdline() Yang Shi
` (3 subsequent siblings)
6 siblings, 1 reply; 16+ messages in thread
From: Yang Shi @ 2018-02-27 0:25 UTC (permalink / raw)
To: akpm, mingo, adobriyan; +Cc: yang.shi, linux-mm, linux-kernel
Like reading /proc/*/cmdline, it is possible to be blocked for long time
when reading /proc/*/environ when manipulating large mapping at the mean
time. The environ reading process will be waiting for mmap_sem become
available for a long time then it may cause the reading task hung.
Convert down_read() and access_remote_vm() to killable version.
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
---
fs/proc/base.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9bdb84b..d87d9ab 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -933,7 +933,9 @@ static ssize_t environ_read(struct file *file, char __user *buf,
if (!mmget_not_zero(mm))
goto free;
- down_read(&mm->mmap_sem);
+ ret = down_read_killable(&mm->mmap_sem);
+ if (ret)
+ goto out_mmput;
env_start = mm->env_start;
env_end = mm->env_end;
up_read(&mm->mmap_sem);
@@ -950,7 +952,8 @@ static ssize_t environ_read(struct file *file, char __user *buf,
max_len = min_t(size_t, PAGE_SIZE, count);
this_len = min(max_len, this_len);
- retval = access_remote_vm(mm, (env_start + src), page, this_len, 0);
+ retval = access_remote_vm_killable(mm, (env_start + src),
+ page, this_len, 0);
if (retval <= 0) {
ret = retval;
@@ -968,6 +971,8 @@ static ssize_t environ_read(struct file *file, char __user *buf,
count -= retval;
}
*ppos = src;
+
+out_mmput:
mmput(mm);
free:
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4 v2] fs: proc: use down_read_killable() in environ_read()
2018-02-27 0:25 ` [PATCH 3/4 v2] fs: proc: use down_read_killable() in environ_read() Yang Shi
@ 2018-02-27 7:15 ` Alexey Dobriyan
2018-02-27 16:59 ` Yang Shi
0 siblings, 1 reply; 16+ messages in thread
From: Alexey Dobriyan @ 2018-02-27 7:15 UTC (permalink / raw)
To: Yang Shi; +Cc: akpm, mingo, linux-mm, linux-kernel
On Tue, Feb 27, 2018 at 08:25:50AM +0800, Yang Shi wrote:
> Like reading /proc/*/cmdline, it is possible to be blocked for long time
> when reading /proc/*/environ when manipulating large mapping at the mean
> time. The environ reading process will be waiting for mmap_sem become
> available for a long time then it may cause the reading task hung.
>
> Convert down_read() and access_remote_vm() to killable version.
>
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
Ehh, bloody tags.
I didn't suggest _killable() variants, they're quite ugly because API
multiplies. access_remote_vm() could be converted to down_read_killable().
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -933,7 +933,9 @@ static ssize_t environ_read(struct file *file, char __user *buf,
> if (!mmget_not_zero(mm))
> goto free;
>
> - down_read(&mm->mmap_sem);
> + ret = down_read_killable(&mm->mmap_sem);
> + if (ret)
> + goto out_mmput;
> env_start = mm->env_start;
> env_end = mm->env_end;
> up_read(&mm->mmap_sem);
> @@ -950,7 +952,8 @@ static ssize_t environ_read(struct file *file, char __user *buf,
> max_len = min_t(size_t, PAGE_SIZE, count);
> this_len = min(max_len, this_len);
>
> - retval = access_remote_vm(mm, (env_start + src), page, this_len, 0);
> + retval = access_remote_vm_killable(mm, (env_start + src),
> + page, this_len, 0);
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4 v2] fs: proc: use down_read_killable() in environ_read()
2018-02-27 7:15 ` Alexey Dobriyan
@ 2018-02-27 16:59 ` Yang Shi
0 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-02-27 16:59 UTC (permalink / raw)
To: Alexey Dobriyan; +Cc: akpm, mingo, linux-mm, linux-kernel
On 2/26/18 11:15 PM, Alexey Dobriyan wrote:
> On Tue, Feb 27, 2018 at 08:25:50AM +0800, Yang Shi wrote:
>> Like reading /proc/*/cmdline, it is possible to be blocked for long time
>> when reading /proc/*/environ when manipulating large mapping at the mean
>> time. The environ reading process will be waiting for mmap_sem become
>> available for a long time then it may cause the reading task hung.
>>
>> Convert down_read() and access_remote_vm() to killable version.
>>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
> Ehh, bloody tags.
I mean fix reading /proc/*/environ part.
> I didn't suggest _killable() variants, they're quite ugly because API
> multiplies. access_remote_vm() could be converted to down_read_killable().
There might be other places that need non-killable access_remote_vm(),
like patch 4/4.
And, it sounds keeping access_remote_vm() semantic intact may prevent
from confusion. The most people may get used to assume
access_remote_vm() behaves like access_process_vm() except not inc mm
reference count.
Thanks,
Yang
>
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -933,7 +933,9 @@ static ssize_t environ_read(struct file *file, char __user *buf,
>> if (!mmget_not_zero(mm))
>> goto free;
>>
>> - down_read(&mm->mmap_sem);
>> + ret = down_read_killable(&mm->mmap_sem);
>> + if (ret)
>> + goto out_mmput;
>> env_start = mm->env_start;
>> env_end = mm->env_end;
>> up_read(&mm->mmap_sem);
>> @@ -950,7 +952,8 @@ static ssize_t environ_read(struct file *file, char __user *buf,
>> max_len = min_t(size_t, PAGE_SIZE, count);
>> this_len = min(max_len, this_len);
>>
>> - retval = access_remote_vm(mm, (env_start + src), page, this_len, 0);
>> + retval = access_remote_vm_killable(mm, (env_start + src),
>> + page, this_len, 0);
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 4/4 v2] mm: use access_remote_vm() in get_cmdline()
2018-02-27 0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
` (2 preceding siblings ...)
2018-02-27 0:25 ` [PATCH 3/4 v2] fs: proc: use down_read_killable() in environ_read() Yang Shi
@ 2018-02-27 0:25 ` Yang Shi
2018-02-27 1:02 ` [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc David Rientjes
` (2 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-02-27 0:25 UTC (permalink / raw)
To: akpm, mingo, adobriyan; +Cc: yang.shi, linux-mm, linux-kernel
get_cmdline() is using access_process_vm() which increases mm reference
count, but the mm reference count has been increased before calling
access_process_vm() and it is kept across get_cmdline(). It sounds
unnecessary to get mm reference count increased twice, so replace
access_process_vm() to access_remote_vm() which requires caller increase
mm reference count.
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
mm/util.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/util.c b/mm/util.c
index c125050..9b40637 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -732,7 +732,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen)
if (len > buflen)
len = buflen;
- res = access_process_vm(task, arg_start, buffer, len, FOLL_FORCE);
+ res = access_remote_vm(mm, arg_start, buffer, len, FOLL_FORCE);
/*
* If the nul at the end of args has been overwritten, then
@@ -746,7 +746,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen)
len = env_end - env_start;
if (len > buflen - res)
len = buflen - res;
- res += access_process_vm(task, env_start,
+ res += access_remote_vm(mm, env_start,
buffer+res, len,
FOLL_FORCE);
res = strnlen(buffer, res);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
2018-02-27 0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
` (3 preceding siblings ...)
2018-02-27 0:25 ` [PATCH 4/4 v2] mm: use access_remote_vm() in get_cmdline() Yang Shi
@ 2018-02-27 1:02 ` David Rientjes
2018-02-27 1:25 ` Yang Shi
2018-03-06 18:45 ` Yang Shi
2018-03-06 20:45 ` Andrew Morton
6 siblings, 1 reply; 16+ messages in thread
From: David Rientjes @ 2018-02-27 1:02 UTC (permalink / raw)
To: Yang Shi; +Cc: akpm, mingo, adobriyan, linux-mm, linux-kernel
On Tue, 27 Feb 2018, Yang Shi wrote:
>
> Background:
> When running vm-scalability with large memory (> 300GB), the below hung
> task issue happens occasionally.
>
> INFO: task ps:14018 blocked for more than 120 seconds.
> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> ps D 0 14018 1 0x00000004
> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
> Call Trace:
> [<ffffffff817154d0>] ? __schedule+0x250/0x730
> [<ffffffff817159e6>] schedule+0x36/0x80
> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
> [<ffffffff81717db0>] down_read+0x20/0x40
> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
> [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
> [<ffffffff81241d87>] __vfs_read+0x37/0x150
> [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
> [<ffffffff81242266>] vfs_read+0x96/0x130
> [<ffffffff812437b5>] SyS_read+0x55/0xc0
> [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>
> When manipulating a large mapping, the process may hold the mmap_sem for
> long time, so reading /proc/<pid>/cmdline may be blocked in
> uninterruptible state for long time.
> We already have killable version APIs for semaphore, here use down_read_killable()
> to improve the responsiveness.
>
Rather than killable, we have patches that introduce down_read_unfair()
variants for the files you've modified (cmdline and environ) as well as
others (maps, numa_maps, smaps).
When another thread is holding down_read() and there are queued
down_write()'s, down_read_unfair() allows for grabbing the rwsem without
queueing for it. Additionally, when another thread is holding
down_write(), down_read_unfair() allows for queueing in front of other
threads trying to grab it for write as well.
Ingo would know more about whether a variant like that in upstream Linux
would be acceptable.
Would you be interested in unfair variants instead of only addressing
killable?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
2018-02-27 1:02 ` [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc David Rientjes
@ 2018-02-27 1:25 ` Yang Shi
2018-02-27 1:47 ` David Rientjes
0 siblings, 1 reply; 16+ messages in thread
From: Yang Shi @ 2018-02-27 1:25 UTC (permalink / raw)
To: David Rientjes; +Cc: akpm, mingo, adobriyan, linux-mm, linux-kernel
On 2/26/18 5:02 PM, David Rientjes wrote:
> On Tue, 27 Feb 2018, Yang Shi wrote:
>
>> Background:
>> When running vm-scalability with large memory (> 300GB), the below hung
>> task issue happens occasionally.
>>
>> INFO: task ps:14018 blocked for more than 120 seconds.
>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> ps D 0 14018 1 0x00000004
>> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>> Call Trace:
>> [<ffffffff817154d0>] ? __schedule+0x250/0x730
>> [<ffffffff817159e6>] schedule+0x36/0x80
>> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>> [<ffffffff81717db0>] down_read+0x20/0x40
>> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>> [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>> [<ffffffff81241d87>] __vfs_read+0x37/0x150
>> [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>> [<ffffffff81242266>] vfs_read+0x96/0x130
>> [<ffffffff812437b5>] SyS_read+0x55/0xc0
>> [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>
>> When manipulating a large mapping, the process may hold the mmap_sem for
>> long time, so reading /proc/<pid>/cmdline may be blocked in
>> uninterruptible state for long time.
>> We already have killable version APIs for semaphore, here use down_read_killable()
>> to improve the responsiveness.
>>
> Rather than killable, we have patches that introduce down_read_unfair()
> variants for the files you've modified (cmdline and environ) as well as
> others (maps, numa_maps, smaps).
You mean you have such functionality used by google internally?
>
> When another thread is holding down_read() and there are queued
> down_write()'s, down_read_unfair() allows for grabbing the rwsem without
> queueing for it. Additionally, when another thread is holding
> down_write(), down_read_unfair() allows for queueing in front of other
> threads trying to grab it for write as well.
It sounds the __unfair variant make the caller have chance to jump the
gun to grab the semaphore before other waiters, right? But when a
process holds the semaphore, i.e. mmap_sem, for a long time, it still
has to sleep in uninterruptible state, right?
But, it seems __unfair variant may not be very helpful in this usecase.
Reading /proc might be not that important to require any special care to
grab the semaphore before other waiters. I just hope it doesn't sleep in
uninterruptible state for a long time. If the user is not patient enough
due to some reason, they can have a chance to abort.
>
> Ingo would know more about whether a variant like that in upstream Linux
> would be acceptable.
>
> Would you be interested in unfair variants instead of only addressing
> killable?
Yes, I'm although it still looks overkilling to me for reading /proc.
Thanks,
Yang
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
2018-02-27 1:25 ` Yang Shi
@ 2018-02-27 1:47 ` David Rientjes
2018-03-01 0:17 ` Yang Shi
0 siblings, 1 reply; 16+ messages in thread
From: David Rientjes @ 2018-02-27 1:47 UTC (permalink / raw)
To: Yang Shi; +Cc: akpm, mingo, adobriyan, linux-mm, linux-kernel
On Mon, 26 Feb 2018, Yang Shi wrote:
> > Rather than killable, we have patches that introduce down_read_unfair()
> > variants for the files you've modified (cmdline and environ) as well as
> > others (maps, numa_maps, smaps).
>
> You mean you have such functionality used by google internally?
>
Yup, see https://lwn.net/Articles/387720.
> > When another thread is holding down_read() and there are queued
> > down_write()'s, down_read_unfair() allows for grabbing the rwsem without
> > queueing for it. Additionally, when another thread is holding
> > down_write(), down_read_unfair() allows for queueing in front of other
> > threads trying to grab it for write as well.
>
> It sounds the __unfair variant make the caller have chance to jump the gun to
> grab the semaphore before other waiters, right? But when a process holds the
> semaphore, i.e. mmap_sem, for a long time, it still has to sleep in
> uninterruptible state, right?
>
Right, it's solving two separate things which I think may be able to be
merged together. Killable is solving an issue where the rwsem is blocking
for a long period of time in uninterruptible sleep, and unfair is solving
an issue where reading the procfs files gets stalled for a long period of
time. We haven't run into an issue (yet) where killable would have solved
it; we just have the unfair variants to grab the rwsem asap and then, if
killable, gracefully return.
> > Ingo would know more about whether a variant like that in upstream Linux
> > would be acceptable.
> >
> > Would you be interested in unfair variants instead of only addressing
> > killable?
>
> Yes, I'm although it still looks overkilling to me for reading /proc.
>
We make certain inferences on the readablility of procfs files for other
threads to determine how much its mm's mmap_sem is contended.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
2018-02-27 1:47 ` David Rientjes
@ 2018-03-01 0:17 ` Yang Shi
0 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-03-01 0:17 UTC (permalink / raw)
To: David Rientjes; +Cc: akpm, mingo, adobriyan, linux-mm, linux-kernel
On 2/26/18 5:47 PM, David Rientjes wrote:
> On Mon, 26 Feb 2018, Yang Shi wrote:
>
>>> Rather than killable, we have patches that introduce down_read_unfair()
>>> variants for the files you've modified (cmdline and environ) as well as
>>> others (maps, numa_maps, smaps).
>> You mean you have such functionality used by google internally?
>>
> Yup, see https://lwn.net/Articles/387720.
>
>>> When another thread is holding down_read() and there are queued
>>> down_write()'s, down_read_unfair() allows for grabbing the rwsem without
>>> queueing for it. Additionally, when another thread is holding
>>> down_write(), down_read_unfair() allows for queueing in front of other
>>> threads trying to grab it for write as well.
>> It sounds the __unfair variant make the caller have chance to jump the gun to
>> grab the semaphore before other waiters, right? But when a process holds the
>> semaphore, i.e. mmap_sem, for a long time, it still has to sleep in
>> uninterruptible state, right?
>>
> Right, it's solving two separate things which I think may be able to be
> merged together. Killable is solving an issue where the rwsem is blocking
> for a long period of time in uninterruptible sleep, and unfair is solving
> an issue where reading the procfs files gets stalled for a long period of
> time. We haven't run into an issue (yet) where killable would have solved
> it; we just have the unfair variants to grab the rwsem asap and then, if
> killable, gracefully return.
>
>>> Ingo would know more about whether a variant like that in upstream Linux
>>> would be acceptable.
>>>
>>> Would you be interested in unfair variants instead of only addressing
>>> killable?
>> Yes, I'm although it still looks overkilling to me for reading /proc.
>>
> We make certain inferences on the readablility of procfs files for other
> threads to determine how much its mm's mmap_sem is contended.
I see your points here for reading /proc for system monitor. However,
I'm concerned that the _unfair APIs get the processes which read /proc
priority elevation (not real priority change, just look like). It might
be abused by some applications, for example:
A high priority process and a low priority process are waiting for the
same rwsem, if the low priority process is trying to read /proc
maliciously on purpose, it can get elevated to grap the rwsem before any
other processes which are waiting for the same rwsem.
Yang
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
2018-02-27 0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
` (4 preceding siblings ...)
2018-02-27 1:02 ` [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc David Rientjes
@ 2018-03-06 18:45 ` Yang Shi
2018-03-06 20:45 ` Andrew Morton
6 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-03-06 18:45 UTC (permalink / raw)
To: akpm, mingo, adobriyan; +Cc: linux-mm, linux-kernel
Hi Andrew,
Any comment on this series is appreciated.
I'm supposed everyone agrees to improve the readability of /proc, right?
The argument is whether access_remote_vm_killable() is preferred or not.
I though _killable() version may prevent from confusion since
access_remote_vm() is always non-killable. And, patch 4/4 uses the
non-killable version.
Thanks,
Yang
On 2/26/18 4:25 PM, Yang Shi wrote:
> Background:
> When running vm-scalability with large memory (> 300GB), the below hung
> task issue happens occasionally.
>
> INFO: task ps:14018 blocked for more than 120 seconds.
> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> ps D 0 14018 1 0x00000004
> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
> Call Trace:
> [<ffffffff817154d0>] ? __schedule+0x250/0x730
> [<ffffffff817159e6>] schedule+0x36/0x80
> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
> [<ffffffff81717db0>] down_read+0x20/0x40
> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
> [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
> [<ffffffff81241d87>] __vfs_read+0x37/0x150
> [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
> [<ffffffff81242266>] vfs_read+0x96/0x130
> [<ffffffff812437b5>] SyS_read+0x55/0xc0
> [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>
> When manipulating a large mapping, the process may hold the mmap_sem for
> long time, so reading /proc/<pid>/cmdline may be blocked in
> uninterruptible state for long time.
> We already have killable version APIs for semaphore, here use down_read_killable()
> to improve the responsiveness.
>
>
> When reviewing the v1 patch (https://patchwork.kernel.org/patch/10230809/),
> Alexey pointed out access_remote_vm() need to be killable too. And, /proc/*/environ
> reading may suffer from the same issue, so it should be converted to killable
> version for both down_read and access_remote_vm too.
>
> With reading the code, both access_remote_vm() and access_process_vm() calls
> __access_remote_vm() which acquires mmap_sem by down_read(). access_remote_vm()
> is only used by fs/proc/base.c, but access_process_vm() is used by other
> subsystems too, i.e. ptrace, audit, etc. So, it sounds not that safe to convert
> both access_remote_vm() and access_process_vm() to killable.
> Instead of doing so, extract command part of __access_remote_vm() (gup part) to
> a new static function, called raw_access_remote_vm(), then define
> __access_remote_vm() and __access_remote_vm_killable(), which acquire mmap_sem
> by down_read() and _killable() respectively.
>
> Then define access_remote_vm() and access_remote_vm_killable() to call them
> respectively. Keep access_process_vm() calls __access_remote_vm().
>
> So far fs/proc/base.c is the only user of access_remote_vm_killable(), but
> there might be other users in the future.
>
> There are 4 patches in this revision:
> #1 define access_remote_vm_killable() APIs
> #2 convert /proc/*/cmdline reading to down_read_killable() and access_remote_vm_killable()
> #3 convert /proc/*/environ reading to down_read_killable() and access_remote_vm_killable()
> #4 replace access_process_vm() to access_remote_vm() in get_cmdline to save one
> mm reference count inc (please see the commit log for the details). This
> change makes get_cmdline() is the only user of access_remote_vm()
>
>
> Yang Shi (4):
> mm: add access_remote_vm_killable APIs
> fs: proc: use down_read_killable in proc_pid_cmdline_read()
> fs: proc: use down_read_killable() in environ_read()
> mm: use access_remote_vm() in get_cmdline()
>
> fs/proc/base.c | 21 +++++++++++++++------
> include/linux/mm.h | 5 +++++
> mm/memory.c | 44 +++++++++++++++++++++++++++++++++++++-------
> mm/nommu.c | 36 ++++++++++++++++++++++++++++++++----
> mm/util.c | 4 ++--
> 5 files changed, 91 insertions(+), 19 deletions(-)
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
2018-02-27 0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
` (5 preceding siblings ...)
2018-03-06 18:45 ` Yang Shi
@ 2018-03-06 20:45 ` Andrew Morton
2018-03-06 21:17 ` Yang Shi
6 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2018-03-06 20:45 UTC (permalink / raw)
To: Yang Shi; +Cc: mingo, adobriyan, linux-mm, linux-kernel, David Rientjes
On Tue, 27 Feb 2018 08:25:47 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
> Background:
> When running vm-scalability with large memory (> 300GB), the below hung
> task issue happens occasionally.
>
> INFO: task ps:14018 blocked for more than 120 seconds.
> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> ps D 0 14018 1 0x00000004
> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
> Call Trace:
> [<ffffffff817154d0>] ? __schedule+0x250/0x730
> [<ffffffff817159e6>] schedule+0x36/0x80
> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
> [<ffffffff81717db0>] down_read+0x20/0x40
> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
> [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
> [<ffffffff81241d87>] __vfs_read+0x37/0x150
> [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
> [<ffffffff81242266>] vfs_read+0x96/0x130
> [<ffffffff812437b5>] SyS_read+0x55/0xc0
> [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>
> When manipulating a large mapping, the process may hold the mmap_sem for
> long time, so reading /proc/<pid>/cmdline may be blocked in
> uninterruptible state for long time.
> We already have killable version APIs for semaphore, here use down_read_killable()
> to improve the responsiveness.
>
Maybe I'm missing something, but I don't see how this solves the
problem. Yes, the read of /proc/pid/cmdline will be abandoned if
someone interrupts that process. But if nobody does that, the read
will still just sit there for 2 minutes and the watchdog warning will
still come out?
Where the heck are we holding mmap_sem for so long? Can that be fixed?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
2018-03-06 20:45 ` Andrew Morton
@ 2018-03-06 21:17 ` Yang Shi
2018-03-06 21:41 ` Andrew Morton
0 siblings, 1 reply; 16+ messages in thread
From: Yang Shi @ 2018-03-06 21:17 UTC (permalink / raw)
To: Andrew Morton; +Cc: mingo, adobriyan, linux-mm, linux-kernel, David Rientjes
On 3/6/18 12:45 PM, Andrew Morton wrote:
> On Tue, 27 Feb 2018 08:25:47 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>> Background:
>> When running vm-scalability with large memory (> 300GB), the below hung
>> task issue happens occasionally.
>>
>> INFO: task ps:14018 blocked for more than 120 seconds.
>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> ps D 0 14018 1 0x00000004
>> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>> Call Trace:
>> [<ffffffff817154d0>] ? __schedule+0x250/0x730
>> [<ffffffff817159e6>] schedule+0x36/0x80
>> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>> [<ffffffff81717db0>] down_read+0x20/0x40
>> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>> [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>> [<ffffffff81241d87>] __vfs_read+0x37/0x150
>> [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>> [<ffffffff81242266>] vfs_read+0x96/0x130
>> [<ffffffff812437b5>] SyS_read+0x55/0xc0
>> [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>
>> When manipulating a large mapping, the process may hold the mmap_sem for
>> long time, so reading /proc/<pid>/cmdline may be blocked in
>> uninterruptible state for long time.
>> We already have killable version APIs for semaphore, here use down_read_killable()
>> to improve the responsiveness.
>>
> Maybe I'm missing something, but I don't see how this solves the
> problem. Yes, the read of /proc/pid/cmdline will be abandoned if
> someone interrupts that process. But if nobody does that, the read
> will still just sit there for 2 minutes and the watchdog warning will
> still come out?
No, the warning will not come out since down_read_killable() puts the
task into TASK_KILLABLE state instead of TASK_UNINTERRUPTIBLE state. The
hung task check will skip TASK_KILLABLE tasks, please see the below code
in (kernel/hung_task.c):
/* use "==" to skip the TASK_KILLABLE tasks waiting on
NFS */
if (t->state == TASK_UNINTERRUPTIBLE)
check_hung_task(t, timeout);
It just mitigates the hung task warning, can't resolve the mmap_sem
scalability issue. Furthermore, waiting on pure uninterruptible state
for reading /proc sounds unnecessary. It doesn't wait for I/O completion.
>
> Where the heck are we holding mmap_sem for so long? Can that be fixed?
The mmap_sem is held for unmapping a large map which has every single
page mapped. This is not a issue in real production code. Just found it
by running vm-scalability on a machine with ~600GB memory.
AFAIK, I don't see any easy fix for the mmap_sem scalability issue. I
saw range locking patches (https://lwn.net/Articles/723648/) were
floating around. But, it may not help too much on the case that a large
map with every single page mapped.
Thanks,
Yang
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
2018-03-06 21:17 ` Yang Shi
@ 2018-03-06 21:41 ` Andrew Morton
2018-03-07 0:47 ` Yang Shi
0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2018-03-06 21:41 UTC (permalink / raw)
To: Yang Shi; +Cc: mingo, adobriyan, linux-mm, linux-kernel, David Rientjes
On Tue, 6 Mar 2018 13:17:37 -0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>
> It just mitigates the hung task warning, can't resolve the mmap_sem
> scalability issue. Furthermore, waiting on pure uninterruptible state
> for reading /proc sounds unnecessary. It doesn't wait for I/O completion.
OK.
> >
> > Where the heck are we holding mmap_sem for so long? Can that be fixed?
>
> The mmap_sem is held for unmapping a large map which has every single
> page mapped. This is not a issue in real production code. Just found it
> by running vm-scalability on a machine with ~600GB memory.
>
> AFAIK, I don't see any easy fix for the mmap_sem scalability issue. I
> saw range locking patches (https://lwn.net/Articles/723648/) were
> floating around. But, it may not help too much on the case that a large
> map with every single page mapped.
Well it sounds fairly simple to mitigate? Simplistically: don't unmap
600G in a single hit; do it 1G at a time, dropping mmap_sem each time.
A smarter version might only come up for air if there are mmap_sem
waiters and if it has already done some work. I don't think we have
any particular atomicity requirements when unmapping?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
2018-03-06 21:41 ` Andrew Morton
@ 2018-03-07 0:47 ` Yang Shi
0 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-03-07 0:47 UTC (permalink / raw)
To: Andrew Morton; +Cc: mingo, adobriyan, linux-mm, linux-kernel, David Rientjes
On 3/6/18 1:41 PM, Andrew Morton wrote:
> On Tue, 6 Mar 2018 13:17:37 -0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>>
>> It just mitigates the hung task warning, can't resolve the mmap_sem
>> scalability issue. Furthermore, waiting on pure uninterruptible state
>> for reading /proc sounds unnecessary. It doesn't wait for I/O completion.
> OK.
Since we already had down_read_killable() APIs available, IMHO, giving
application a chance to abort at some circumstances sounds not bad.
>
>>> Where the heck are we holding mmap_sem for so long? Can that be fixed?
>> The mmap_sem is held for unmapping a large map which has every single
>> page mapped. This is not a issue in real production code. Just found it
>> by running vm-scalability on a machine with ~600GB memory.
>>
>> AFAIK, I don't see any easy fix for the mmap_sem scalability issue. I
>> saw range locking patches (https://lwn.net/Articles/723648/) were
>> floating around. But, it may not help too much on the case that a large
>> map with every single page mapped.
> Well it sounds fairly simple to mitigate? Simplistically: don't unmap
> 600G in a single hit; do it 1G at a time, dropping mmap_sem each time.
> A smarter version might only come up for air if there are mmap_sem
> waiters and if it has already done some work. I don't think we have
> any particular atomicity requirements when unmapping?
I'm not quite sure. But, the existing applications may assume munmap is
atomic?
Thanks,
Yang
>
^ permalink raw reply [flat|nested] 16+ messages in thread