linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
@ 2018-02-27  0:25 Yang Shi
  2018-02-27  0:25 ` [PATCH 1/4 v2] mm: add access_remote_vm_killable APIs Yang Shi
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Yang Shi @ 2018-02-27  0:25 UTC (permalink / raw)
  To: akpm, mingo, adobriyan; +Cc: yang.shi, linux-mm, linux-kernel


Background:
When running vm-scalability with large memory (> 300GB), the below hung
task issue happens occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
       Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 ps              D    0 14018      1 0x00000004
  ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
  ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
  00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
 Call Trace:
  [<ffffffff817154d0>] ? __schedule+0x250/0x730
  [<ffffffff817159e6>] schedule+0x36/0x80
  [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
  [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
  [<ffffffff81717db0>] down_read+0x20/0x40
  [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
  [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
  [<ffffffff81241d87>] __vfs_read+0x37/0x150
  [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
  [<ffffffff81242266>] vfs_read+0x96/0x130
  [<ffffffff812437b5>] SyS_read+0x55/0xc0
  [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5

When manipulating a large mapping, the process may hold the mmap_sem for
long time, so reading /proc/<pid>/cmdline may be blocked in
uninterruptible state for long time.
We already have killable version APIs for semaphore, here use down_read_killable()
to improve the responsiveness.


When reviewing the v1 patch (https://patchwork.kernel.org/patch/10230809/),
Alexey pointed out access_remote_vm() need to be killable too. And, /proc/*/environ
reading may suffer from the same issue, so it should be converted to killable
version for both down_read and access_remote_vm too.

With reading the code, both access_remote_vm() and access_process_vm() calls
__access_remote_vm() which acquires mmap_sem by down_read(). access_remote_vm()
is only used by fs/proc/base.c, but access_process_vm() is used by other
subsystems too, i.e. ptrace, audit, etc. So, it sounds not that safe to convert
both access_remote_vm() and access_process_vm() to killable.
Instead of doing so, extract command part of __access_remote_vm() (gup part) to
a new static function, called raw_access_remote_vm(), then define
__access_remote_vm() and __access_remote_vm_killable(), which acquire mmap_sem
by down_read() and _killable() respectively.

Then define access_remote_vm() and access_remote_vm_killable() to call them
respectively. Keep access_process_vm() calls __access_remote_vm().

So far fs/proc/base.c is the only user of access_remote_vm_killable(), but
there might be other users in the future.

There are 4 patches in this revision:
#1 define access_remote_vm_killable() APIs
#2 convert /proc/*/cmdline reading to down_read_killable() and access_remote_vm_killable()
#3 convert /proc/*/environ reading to down_read_killable() and access_remote_vm_killable()
#4 replace access_process_vm() to access_remote_vm() in get_cmdline to save one
   mm reference count inc (please see the commit log for the details). This
   change makes get_cmdline() is the only user of access_remote_vm()


Yang Shi (4):
      mm: add access_remote_vm_killable APIs
      fs: proc: use down_read_killable in proc_pid_cmdline_read()
      fs: proc: use down_read_killable() in environ_read()
      mm: use access_remote_vm() in get_cmdline()

 fs/proc/base.c     | 21 +++++++++++++++------
 include/linux/mm.h |  5 +++++
 mm/memory.c        | 44 +++++++++++++++++++++++++++++++++++++-------
 mm/nommu.c         | 36 ++++++++++++++++++++++++++++++++----
 mm/util.c          |  4 ++--
 5 files changed, 91 insertions(+), 19 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/4 v2] mm: add access_remote_vm_killable APIs
  2018-02-27  0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
@ 2018-02-27  0:25 ` Yang Shi
  2018-02-27  0:25 ` [PATCH 2/4 v2] fs: proc: use down_read_killable in proc_pid_cmdline_read() Yang Shi
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-02-27  0:25 UTC (permalink / raw)
  To: akpm, mingo, adobriyan; +Cc: yang.shi, linux-mm, linux-kernel

Extracted common part (without acquiring mmap_sem) of
__access_remote_vm() into raw_access_remote_vm() then create
__access_remote_vm_killable() and access_remote_vm_killable() with
acquiring mmap_sem by down_read_killable().
Keep non-killable versions using down_read().

The killable version will be used by reading /proc/*/cmdline and
/proc/*/environ for the time being.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
---
 include/linux/mm.h |  5 +++++
 mm/memory.c        | 44 +++++++++++++++++++++++++++++++++++++-------
 mm/nommu.c         | 36 ++++++++++++++++++++++++++++++++----
 3 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42..4574b19 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1372,8 +1372,13 @@ extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
 		void *buf, int len, unsigned int gup_flags);
 extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
 		void *buf, int len, unsigned int gup_flags);
+extern int access_remote_vm_killable(struct mm_struct *mm, unsigned long addr,
+		void *buf, int len, unsigned int gup_flags);
 extern int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long addr, void *buf, int len, unsigned int gup_flags);
+extern int __access_remote_vm_killable(struct task_struct *tsk,
+		struct mm_struct *mm, unsigned long addr, void *buf, int len,
+		unsigned int gup_flags);
 
 long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long start, unsigned long nr_pages,
diff --git a/mm/memory.c b/mm/memory.c
index dd8de96..8d7e223 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4415,18 +4415,13 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
 EXPORT_SYMBOL_GPL(generic_access_phys);
 #endif
 
-/*
- * Access another process' address space as given in mm.  If non-NULL, use the
- * given task for page fault accounting.
- */
-int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
+static int raw_access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long addr, void *buf, int len, unsigned int gup_flags)
 {
 	struct vm_area_struct *vma;
 	void *old_buf = buf;
 	int write = gup_flags & FOLL_WRITE;
 
-	down_read(&mm->mmap_sem);
 	/* ignore errors, just check how much was successfully transferred */
 	while (len) {
 		int bytes, ret, offset;
@@ -4475,11 +4470,40 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
 		buf += bytes;
 		addr += bytes;
 	}
-	up_read(&mm->mmap_sem);
 
 	return buf - old_buf;
 }
 
+/*
+ * Access another process' address space as given in mm.  If non-NULL, use the
+ * given task for page fault accounting.
+ */
+int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long addr, void *buf, int len, unsigned int gup_flags)
+{
+	int ret;
+
+	down_read(&mm->mmap_sem);
+	ret = raw_access_remote_vm(tsk, mm, addr, buf, len, gup_flags);
+	up_read(&mm->mmap_sem);
+	return ret;
+}
+
+int __access_remote_vm_killable(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long addr, void *buf, int len, unsigned int gup_flags)
+{
+	int ret;
+
+	ret = down_read_killable(&mm->mmap_sem);
+	if (ret)
+		goto out;
+
+	ret = raw_access_remote_vm(tsk, mm, addr, buf, len, gup_flags);
+	up_read(&mm->mmap_sem);
+out:
+	return ret;
+}
+
 /**
  * access_remote_vm - access another process' address space
  * @mm:		the mm_struct of the target address space
@@ -4490,6 +4514,12 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
  *
  * The caller must hold a reference on @mm.
  */
+int access_remote_vm_killable(struct mm_struct *mm, unsigned long addr,
+		void *buf, int len, unsigned int gup_flags)
+{
+	return __access_remote_vm_killable(NULL, mm, addr, buf, len, gup_flags);
+}
+
 int access_remote_vm(struct mm_struct *mm, unsigned long addr,
 		void *buf, int len, unsigned int gup_flags)
 {
diff --git a/mm/nommu.c b/mm/nommu.c
index ebb6e61..ea043b3 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1802,14 +1802,12 @@ void filemap_map_pages(struct vm_fault *vmf,
 }
 EXPORT_SYMBOL(filemap_map_pages);
 
-int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
+static int raw_access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long addr, void *buf, int len, unsigned int gup_flags)
 {
 	struct vm_area_struct *vma;
 	int write = gup_flags & FOLL_WRITE;
 
-	down_read(&mm->mmap_sem);
-
 	/* the access must start within one of the target process's mappings */
 	vma = find_vma(mm, addr);
 	if (vma) {
@@ -1830,9 +1828,33 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
 		len = 0;
 	}
 
+	return len;
+}
+
+int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long addr, void *buf, int len, unsigned int gup_flags)
+{
+	int ret;
+
+	down_read(&mm->mmap_sem);
+	ret = raw_access_remote_vm(tsk, mm, addr, buf, len, gup_flags);
 	up_read(&mm->mmap_sem);
+	return ret;
+}
 
-	return len;
+int __access_remote_vm_killable(struct task_struct *tsk, struct mm_struct *mm,
+		unsigned long addr, void *buf, int len, unsigned int gup_flags)
+{
+	int ret;
+
+	ret = down_read_killable(&mm->mmap_sem);
+	if (ret)
+		goto out;
+
+	ret = raw_access_remote_vm(tsk, mm, addr, buf, len, gup_flags);
+	up_read(&mm->mmap_sem);
+out:
+	return ret;
 }
 
 /**
@@ -1845,6 +1867,12 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
  *
  * The caller must hold a reference on @mm.
  */
+int access_remote_vm_killable(struct mm_struct *mm, unsigned long addr,
+		void *buf, int len, unsigned int gup_flags)
+{
+	return __access_remote_vm_killable(NULL, mm, addr, buf, len, gup_flags);
+}
+
 int access_remote_vm(struct mm_struct *mm, unsigned long addr,
 		void *buf, int len, unsigned int gup_flags)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/4 v2] fs: proc: use down_read_killable in proc_pid_cmdline_read()
  2018-02-27  0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
  2018-02-27  0:25 ` [PATCH 1/4 v2] mm: add access_remote_vm_killable APIs Yang Shi
@ 2018-02-27  0:25 ` Yang Shi
  2018-02-27  0:25 ` [PATCH 3/4 v2] fs: proc: use down_read_killable() in environ_read() Yang Shi
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-02-27  0:25 UTC (permalink / raw)
  To: akpm, mingo, adobriyan; +Cc: yang.shi, linux-mm, linux-kernel

When running vm-scalability with large memory (> 300GB), the below hung
task issue happens occasionally.

INFO: task ps:14018 blocked for more than 120 seconds.
       Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 ps              D    0 14018      1 0x00000004
  ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
  ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
  00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
 Call Trace:
  [<ffffffff817154d0>] ? __schedule+0x250/0x730
  [<ffffffff817159e6>] schedule+0x36/0x80
  [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
  [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
  [<ffffffff81717db0>] down_read+0x20/0x40
  [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
  [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
  [<ffffffff81241d87>] __vfs_read+0x37/0x150
  [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
  [<ffffffff81242266>] vfs_read+0x96/0x130
  [<ffffffff812437b5>] SyS_read+0x55/0xc0
  [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5

When manipulating a large mapping, the process may hold the mmap_sem for
long time, so reading /proc/<pid>/cmdline may be blocked in
uninterruptible state for long time.

down_read_trylock() sounds too aggressive, and we already have killable
version APIs for semaphore, here use down_read_killable() to improve the
responsiveness.

And, convert access_remote_vm() to killable version.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 fs/proc/base.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9298324..9bdb84b 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -242,7 +242,9 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
 		goto out_mmput;
 	}
 
-	down_read(&mm->mmap_sem);
+	rv = down_read_killable(&mm->mmap_sem);
+	if (rv)
+		goto out_free_page;
 	arg_start = mm->arg_start;
 	arg_end = mm->arg_end;
 	env_start = mm->env_start;
@@ -264,7 +266,7 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
 	 * Inherently racy -- command line shares address space
 	 * with code and data.
 	 */
-	rv = access_remote_vm(mm, arg_end - 1, &c, 1, 0);
+	rv = access_remote_vm_killable(mm, arg_end - 1, &c, 1, 0);
 	if (rv <= 0)
 		goto out_free_page;
 
@@ -282,7 +284,8 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
 			int nr_read;
 
 			_count = min3(count, len, PAGE_SIZE);
-			nr_read = access_remote_vm(mm, p, page, _count, 0);
+			nr_read = access_remote_vm_killable(mm, p, page,
+							_count, 0);
 			if (nr_read < 0)
 				rv = nr_read;
 			if (nr_read <= 0)
@@ -328,7 +331,8 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
 				bool final;
 
 				_count = min3(count, len, PAGE_SIZE);
-				nr_read = access_remote_vm(mm, p, page, _count, 0);
+				nr_read = access_remote_vm_killable(mm, p,
+							page, _count, 0);
 				if (nr_read < 0)
 					rv = nr_read;
 				if (nr_read <= 0)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/4 v2] fs: proc: use down_read_killable() in environ_read()
  2018-02-27  0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
  2018-02-27  0:25 ` [PATCH 1/4 v2] mm: add access_remote_vm_killable APIs Yang Shi
  2018-02-27  0:25 ` [PATCH 2/4 v2] fs: proc: use down_read_killable in proc_pid_cmdline_read() Yang Shi
@ 2018-02-27  0:25 ` Yang Shi
  2018-02-27  7:15   ` Alexey Dobriyan
  2018-02-27  0:25 ` [PATCH 4/4 v2] mm: use access_remote_vm() in get_cmdline() Yang Shi
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Yang Shi @ 2018-02-27  0:25 UTC (permalink / raw)
  To: akpm, mingo, adobriyan; +Cc: yang.shi, linux-mm, linux-kernel

Like reading /proc/*/cmdline, it is possible to be blocked for long time
when reading /proc/*/environ when manipulating large mapping at the mean
time. The environ reading process will be waiting for mmap_sem become
available for a long time then it may cause the reading task hung.

Convert down_read() and access_remote_vm() to killable version.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
---
 fs/proc/base.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9bdb84b..d87d9ab 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -933,7 +933,9 @@ static ssize_t environ_read(struct file *file, char __user *buf,
 	if (!mmget_not_zero(mm))
 		goto free;
 
-	down_read(&mm->mmap_sem);
+	ret = down_read_killable(&mm->mmap_sem);
+	if (ret)
+		goto out_mmput;
 	env_start = mm->env_start;
 	env_end = mm->env_end;
 	up_read(&mm->mmap_sem);
@@ -950,7 +952,8 @@ static ssize_t environ_read(struct file *file, char __user *buf,
 		max_len = min_t(size_t, PAGE_SIZE, count);
 		this_len = min(max_len, this_len);
 
-		retval = access_remote_vm(mm, (env_start + src), page, this_len, 0);
+		retval = access_remote_vm_killable(mm, (env_start + src),
+						page, this_len, 0);
 
 		if (retval <= 0) {
 			ret = retval;
@@ -968,6 +971,8 @@ static ssize_t environ_read(struct file *file, char __user *buf,
 		count -= retval;
 	}
 	*ppos = src;
+
+out_mmput:
 	mmput(mm);
 
 free:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/4 v2] mm: use access_remote_vm() in get_cmdline()
  2018-02-27  0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
                   ` (2 preceding siblings ...)
  2018-02-27  0:25 ` [PATCH 3/4 v2] fs: proc: use down_read_killable() in environ_read() Yang Shi
@ 2018-02-27  0:25 ` Yang Shi
  2018-02-27  1:02 ` [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc David Rientjes
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-02-27  0:25 UTC (permalink / raw)
  To: akpm, mingo, adobriyan; +Cc: yang.shi, linux-mm, linux-kernel

get_cmdline() is using access_process_vm() which increases mm reference
count, but the mm reference count has been increased before calling
access_process_vm() and it is kept across get_cmdline(). It sounds
unnecessary to get mm reference count increased twice, so replace
access_process_vm() to access_remote_vm() which requires caller increase
mm reference count.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
---
 mm/util.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index c125050..9b40637 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -732,7 +732,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen)
 	if (len > buflen)
 		len = buflen;
 
-	res = access_process_vm(task, arg_start, buffer, len, FOLL_FORCE);
+	res = access_remote_vm(mm, arg_start, buffer, len, FOLL_FORCE);
 
 	/*
 	 * If the nul at the end of args has been overwritten, then
@@ -746,7 +746,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen)
 			len = env_end - env_start;
 			if (len > buflen - res)
 				len = buflen - res;
-			res += access_process_vm(task, env_start,
+			res += access_remote_vm(mm, env_start,
 						 buffer+res, len,
 						 FOLL_FORCE);
 			res = strnlen(buffer, res);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
  2018-02-27  0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
                   ` (3 preceding siblings ...)
  2018-02-27  0:25 ` [PATCH 4/4 v2] mm: use access_remote_vm() in get_cmdline() Yang Shi
@ 2018-02-27  1:02 ` David Rientjes
  2018-02-27  1:25   ` Yang Shi
  2018-03-06 18:45 ` Yang Shi
  2018-03-06 20:45 ` Andrew Morton
  6 siblings, 1 reply; 16+ messages in thread
From: David Rientjes @ 2018-02-27  1:02 UTC (permalink / raw)
  To: Yang Shi; +Cc: akpm, mingo, adobriyan, linux-mm, linux-kernel

On Tue, 27 Feb 2018, Yang Shi wrote:

> 
> Background:
> When running vm-scalability with large memory (> 300GB), the below hung
> task issue happens occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>        Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  ps              D    0 14018      1 0x00000004
>   ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>   ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>   00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>  Call Trace:
>   [<ffffffff817154d0>] ? __schedule+0x250/0x730
>   [<ffffffff817159e6>] schedule+0x36/0x80
>   [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>   [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>   [<ffffffff81717db0>] down_read+0x20/0x40
>   [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>   [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>   [<ffffffff81241d87>] __vfs_read+0x37/0x150
>   [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>   [<ffffffff81242266>] vfs_read+0x96/0x130
>   [<ffffffff812437b5>] SyS_read+0x55/0xc0
>   [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> When manipulating a large mapping, the process may hold the mmap_sem for
> long time, so reading /proc/<pid>/cmdline may be blocked in
> uninterruptible state for long time.
> We already have killable version APIs for semaphore, here use down_read_killable()
> to improve the responsiveness.
> 

Rather than killable, we have patches that introduce down_read_unfair() 
variants for the files you've modified (cmdline and environ) as well as 
others (maps, numa_maps, smaps).

When another thread is holding down_read() and there are queued 
down_write()'s, down_read_unfair() allows for grabbing the rwsem without 
queueing for it.  Additionally, when another thread is holding 
down_write(), down_read_unfair() allows for queueing in front of other 
threads trying to grab it for write as well.

Ingo would know more about whether a variant like that in upstream Linux 
would be acceptable.

Would you be interested in unfair variants instead of only addressing 
killable?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
  2018-02-27  1:02 ` [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc David Rientjes
@ 2018-02-27  1:25   ` Yang Shi
  2018-02-27  1:47     ` David Rientjes
  0 siblings, 1 reply; 16+ messages in thread
From: Yang Shi @ 2018-02-27  1:25 UTC (permalink / raw)
  To: David Rientjes; +Cc: akpm, mingo, adobriyan, linux-mm, linux-kernel



On 2/26/18 5:02 PM, David Rientjes wrote:
> On Tue, 27 Feb 2018, Yang Shi wrote:
>
>> Background:
>> When running vm-scalability with large memory (> 300GB), the below hung
>> task issue happens occasionally.
>>
>> INFO: task ps:14018 blocked for more than 120 seconds.
>>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>   ps              D    0 14018      1 0x00000004
>>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>   Call Trace:
>>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>    [<ffffffff817159e6>] schedule+0x36/0x80
>>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>    [<ffffffff81717db0>] down_read+0x20/0x40
>>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>>    [<ffffffff81241d87>] __vfs_read+0x37/0x150
>>    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>>    [<ffffffff81242266>] vfs_read+0x96/0x130
>>    [<ffffffff812437b5>] SyS_read+0x55/0xc0
>>    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>
>> When manipulating a large mapping, the process may hold the mmap_sem for
>> long time, so reading /proc/<pid>/cmdline may be blocked in
>> uninterruptible state for long time.
>> We already have killable version APIs for semaphore, here use down_read_killable()
>> to improve the responsiveness.
>>
> Rather than killable, we have patches that introduce down_read_unfair()
> variants for the files you've modified (cmdline and environ) as well as
> others (maps, numa_maps, smaps).

You mean you have such functionality used by google internally?

>
> When another thread is holding down_read() and there are queued
> down_write()'s, down_read_unfair() allows for grabbing the rwsem without
> queueing for it.  Additionally, when another thread is holding
> down_write(), down_read_unfair() allows for queueing in front of other
> threads trying to grab it for write as well.

It sounds the __unfair variant make the caller have chance to jump the 
gun to grab the semaphore before other waiters, right? But when a 
process holds the semaphore, i.e. mmap_sem, for a long time, it still 
has to sleep in uninterruptible state, right?

But, it seems __unfair variant may not be very helpful in this usecase. 
Reading /proc might be not that important to require any special care to 
grab the semaphore before other waiters. I just hope it doesn't sleep in 
uninterruptible state for a long time. If the user is not patient enough 
due to some reason, they can have a chance to abort.

>
> Ingo would know more about whether a variant like that in upstream Linux
> would be acceptable.
>
> Would you be interested in unfair variants instead of only addressing
> killable?

Yes, I'm although it still looks overkilling to me for reading /proc.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
  2018-02-27  1:25   ` Yang Shi
@ 2018-02-27  1:47     ` David Rientjes
  2018-03-01  0:17       ` Yang Shi
  0 siblings, 1 reply; 16+ messages in thread
From: David Rientjes @ 2018-02-27  1:47 UTC (permalink / raw)
  To: Yang Shi; +Cc: akpm, mingo, adobriyan, linux-mm, linux-kernel

On Mon, 26 Feb 2018, Yang Shi wrote:

> > Rather than killable, we have patches that introduce down_read_unfair()
> > variants for the files you've modified (cmdline and environ) as well as
> > others (maps, numa_maps, smaps).
> 
> You mean you have such functionality used by google internally?
> 

Yup, see https://lwn.net/Articles/387720.

> > When another thread is holding down_read() and there are queued
> > down_write()'s, down_read_unfair() allows for grabbing the rwsem without
> > queueing for it.  Additionally, when another thread is holding
> > down_write(), down_read_unfair() allows for queueing in front of other
> > threads trying to grab it for write as well.
> 
> It sounds the __unfair variant make the caller have chance to jump the gun to
> grab the semaphore before other waiters, right? But when a process holds the
> semaphore, i.e. mmap_sem, for a long time, it still has to sleep in
> uninterruptible state, right?
> 

Right, it's solving two separate things which I think may be able to be 
merged together.  Killable is solving an issue where the rwsem is blocking 
for a long period of time in uninterruptible sleep, and unfair is solving 
an issue where reading the procfs files gets stalled for a long period of 
time.  We haven't run into an issue (yet) where killable would have solved 
it; we just have the unfair variants to grab the rwsem asap and then, if 
killable, gracefully return.

> > Ingo would know more about whether a variant like that in upstream Linux
> > would be acceptable.
> > 
> > Would you be interested in unfair variants instead of only addressing
> > killable?
> 
> Yes, I'm although it still looks overkilling to me for reading /proc.
> 

We make certain inferences on the readablility of procfs files for other 
threads to determine how much its mm's mmap_sem is contended.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4 v2] fs: proc: use down_read_killable() in environ_read()
  2018-02-27  0:25 ` [PATCH 3/4 v2] fs: proc: use down_read_killable() in environ_read() Yang Shi
@ 2018-02-27  7:15   ` Alexey Dobriyan
  2018-02-27 16:59     ` Yang Shi
  0 siblings, 1 reply; 16+ messages in thread
From: Alexey Dobriyan @ 2018-02-27  7:15 UTC (permalink / raw)
  To: Yang Shi; +Cc: akpm, mingo, linux-mm, linux-kernel

On Tue, Feb 27, 2018 at 08:25:50AM +0800, Yang Shi wrote:
> Like reading /proc/*/cmdline, it is possible to be blocked for long time
> when reading /proc/*/environ when manipulating large mapping at the mean
> time. The environ reading process will be waiting for mmap_sem become
> available for a long time then it may cause the reading task hung.
> 
> Convert down_read() and access_remote_vm() to killable version.
> 
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>

Ehh, bloody tags.
I didn't suggest _killable() variants, they're quite ugly because API
multiplies. access_remote_vm() could be converted to down_read_killable().

> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -933,7 +933,9 @@ static ssize_t environ_read(struct file *file, char __user *buf,
>  	if (!mmget_not_zero(mm))
>  		goto free;
>  
> -	down_read(&mm->mmap_sem);
> +	ret = down_read_killable(&mm->mmap_sem);
> +	if (ret)
> +		goto out_mmput;
>  	env_start = mm->env_start;
>  	env_end = mm->env_end;
>  	up_read(&mm->mmap_sem);
> @@ -950,7 +952,8 @@ static ssize_t environ_read(struct file *file, char __user *buf,
>  		max_len = min_t(size_t, PAGE_SIZE, count);
>  		this_len = min(max_len, this_len);
>  
> -		retval = access_remote_vm(mm, (env_start + src), page, this_len, 0);
> +		retval = access_remote_vm_killable(mm, (env_start + src),
> +						page, this_len, 0);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4 v2] fs: proc: use down_read_killable() in environ_read()
  2018-02-27  7:15   ` Alexey Dobriyan
@ 2018-02-27 16:59     ` Yang Shi
  0 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-02-27 16:59 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: akpm, mingo, linux-mm, linux-kernel



On 2/26/18 11:15 PM, Alexey Dobriyan wrote:
> On Tue, Feb 27, 2018 at 08:25:50AM +0800, Yang Shi wrote:
>> Like reading /proc/*/cmdline, it is possible to be blocked for long time
>> when reading /proc/*/environ when manipulating large mapping at the mean
>> time. The environ reading process will be waiting for mmap_sem become
>> available for a long time then it may cause the reading task hung.
>>
>> Convert down_read() and access_remote_vm() to killable version.
>>
>> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
>> Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
> Ehh, bloody tags.

I mean fix reading /proc/*/environ part.

> I didn't suggest _killable() variants, they're quite ugly because API
> multiplies. access_remote_vm() could be converted to down_read_killable().

There might be other places that need non-killable access_remote_vm(), 
like patch 4/4.

And, it sounds keeping access_remote_vm() semantic intact may prevent 
from confusion. The most people may get used to assume 
access_remote_vm() behaves like access_process_vm() except not inc mm 
reference count.

Thanks,
Yang

>
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -933,7 +933,9 @@ static ssize_t environ_read(struct file *file, char __user *buf,
>>   	if (!mmget_not_zero(mm))
>>   		goto free;
>>   
>> -	down_read(&mm->mmap_sem);
>> +	ret = down_read_killable(&mm->mmap_sem);
>> +	if (ret)
>> +		goto out_mmput;
>>   	env_start = mm->env_start;
>>   	env_end = mm->env_end;
>>   	up_read(&mm->mmap_sem);
>> @@ -950,7 +952,8 @@ static ssize_t environ_read(struct file *file, char __user *buf,
>>   		max_len = min_t(size_t, PAGE_SIZE, count);
>>   		this_len = min(max_len, this_len);
>>   
>> -		retval = access_remote_vm(mm, (env_start + src), page, this_len, 0);
>> +		retval = access_remote_vm_killable(mm, (env_start + src),
>> +						page, this_len, 0);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
  2018-02-27  1:47     ` David Rientjes
@ 2018-03-01  0:17       ` Yang Shi
  0 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-03-01  0:17 UTC (permalink / raw)
  To: David Rientjes; +Cc: akpm, mingo, adobriyan, linux-mm, linux-kernel



On 2/26/18 5:47 PM, David Rientjes wrote:
> On Mon, 26 Feb 2018, Yang Shi wrote:
>
>>> Rather than killable, we have patches that introduce down_read_unfair()
>>> variants for the files you've modified (cmdline and environ) as well as
>>> others (maps, numa_maps, smaps).
>> You mean you have such functionality used by google internally?
>>
> Yup, see https://lwn.net/Articles/387720.
>
>>> When another thread is holding down_read() and there are queued
>>> down_write()'s, down_read_unfair() allows for grabbing the rwsem without
>>> queueing for it.  Additionally, when another thread is holding
>>> down_write(), down_read_unfair() allows for queueing in front of other
>>> threads trying to grab it for write as well.
>> It sounds the __unfair variant make the caller have chance to jump the gun to
>> grab the semaphore before other waiters, right? But when a process holds the
>> semaphore, i.e. mmap_sem, for a long time, it still has to sleep in
>> uninterruptible state, right?
>>
> Right, it's solving two separate things which I think may be able to be
> merged together.  Killable is solving an issue where the rwsem is blocking
> for a long period of time in uninterruptible sleep, and unfair is solving
> an issue where reading the procfs files gets stalled for a long period of
> time.  We haven't run into an issue (yet) where killable would have solved
> it; we just have the unfair variants to grab the rwsem asap and then, if
> killable, gracefully return.
>
>>> Ingo would know more about whether a variant like that in upstream Linux
>>> would be acceptable.
>>>
>>> Would you be interested in unfair variants instead of only addressing
>>> killable?
>> Yes, I'm although it still looks overkilling to me for reading /proc.
>>
> We make certain inferences on the readablility of procfs files for other
> threads to determine how much its mm's mmap_sem is contended.

I see your points here for reading /proc for system monitor. However, 
I'm concerned that the _unfair APIs get the processes which read /proc 
priority elevation (not real priority change, just look like). It might 
be abused by some applications, for example:

A high priority process and a low priority process are waiting for the 
same rwsem, if the low priority process is trying to read /proc 
maliciously on purpose, it can get elevated to grap the rwsem before any 
other processes which are waiting for the same rwsem.

Yang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
  2018-02-27  0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
                   ` (4 preceding siblings ...)
  2018-02-27  1:02 ` [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc David Rientjes
@ 2018-03-06 18:45 ` Yang Shi
  2018-03-06 20:45 ` Andrew Morton
  6 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-03-06 18:45 UTC (permalink / raw)
  To: akpm, mingo, adobriyan; +Cc: linux-mm, linux-kernel

Hi Andrew,


Any comment on this series is appreciated.


I'm supposed everyone agrees to improve the readability of /proc, right? 
The argument is whether access_remote_vm_killable() is preferred or not.


I though _killable() version may prevent from confusion since 
access_remote_vm() is always non-killable. And, patch 4/4 uses the 
non-killable version.


Thanks,

Yang


On 2/26/18 4:25 PM, Yang Shi wrote:
> Background:
> When running vm-scalability with large memory (> 300GB), the below hung
> task issue happens occasionally.
>
> INFO: task ps:14018 blocked for more than 120 seconds.
>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   ps              D    0 14018      1 0x00000004
>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>   Call Trace:
>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
>    [<ffffffff817159e6>] schedule+0x36/0x80
>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>    [<ffffffff81717db0>] down_read+0x20/0x40
>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>    [<ffffffff81241d87>] __vfs_read+0x37/0x150
>    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>    [<ffffffff81242266>] vfs_read+0x96/0x130
>    [<ffffffff812437b5>] SyS_read+0x55/0xc0
>    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>
> When manipulating a large mapping, the process may hold the mmap_sem for
> long time, so reading /proc/<pid>/cmdline may be blocked in
> uninterruptible state for long time.
> We already have killable version APIs for semaphore, here use down_read_killable()
> to improve the responsiveness.
>
>
> When reviewing the v1 patch (https://patchwork.kernel.org/patch/10230809/),
> Alexey pointed out access_remote_vm() need to be killable too. And, /proc/*/environ
> reading may suffer from the same issue, so it should be converted to killable
> version for both down_read and access_remote_vm too.
>
> With reading the code, both access_remote_vm() and access_process_vm() calls
> __access_remote_vm() which acquires mmap_sem by down_read(). access_remote_vm()
> is only used by fs/proc/base.c, but access_process_vm() is used by other
> subsystems too, i.e. ptrace, audit, etc. So, it sounds not that safe to convert
> both access_remote_vm() and access_process_vm() to killable.
> Instead of doing so, extract command part of __access_remote_vm() (gup part) to
> a new static function, called raw_access_remote_vm(), then define
> __access_remote_vm() and __access_remote_vm_killable(), which acquire mmap_sem
> by down_read() and _killable() respectively.
>
> Then define access_remote_vm() and access_remote_vm_killable() to call them
> respectively. Keep access_process_vm() calls __access_remote_vm().
>
> So far fs/proc/base.c is the only user of access_remote_vm_killable(), but
> there might be other users in the future.
>
> There are 4 patches in this revision:
> #1 define access_remote_vm_killable() APIs
> #2 convert /proc/*/cmdline reading to down_read_killable() and access_remote_vm_killable()
> #3 convert /proc/*/environ reading to down_read_killable() and access_remote_vm_killable()
> #4 replace access_process_vm() to access_remote_vm() in get_cmdline to save one
>     mm reference count inc (please see the commit log for the details). This
>     change makes get_cmdline() is the only user of access_remote_vm()
>
>
> Yang Shi (4):
>        mm: add access_remote_vm_killable APIs
>        fs: proc: use down_read_killable in proc_pid_cmdline_read()
>        fs: proc: use down_read_killable() in environ_read()
>        mm: use access_remote_vm() in get_cmdline()
>
>   fs/proc/base.c     | 21 +++++++++++++++------
>   include/linux/mm.h |  5 +++++
>   mm/memory.c        | 44 +++++++++++++++++++++++++++++++++++++-------
>   mm/nommu.c         | 36 ++++++++++++++++++++++++++++++++----
>   mm/util.c          |  4 ++--
>   5 files changed, 91 insertions(+), 19 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
  2018-02-27  0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
                   ` (5 preceding siblings ...)
  2018-03-06 18:45 ` Yang Shi
@ 2018-03-06 20:45 ` Andrew Morton
  2018-03-06 21:17   ` Yang Shi
  6 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2018-03-06 20:45 UTC (permalink / raw)
  To: Yang Shi; +Cc: mingo, adobriyan, linux-mm, linux-kernel, David Rientjes

On Tue, 27 Feb 2018 08:25:47 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:

> 
> Background:
> When running vm-scalability with large memory (> 300GB), the below hung
> task issue happens occasionally.
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>        Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  ps              D    0 14018      1 0x00000004
>   ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>   ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>   00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>  Call Trace:
>   [<ffffffff817154d0>] ? __schedule+0x250/0x730
>   [<ffffffff817159e6>] schedule+0x36/0x80
>   [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>   [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>   [<ffffffff81717db0>] down_read+0x20/0x40
>   [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>   [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>   [<ffffffff81241d87>] __vfs_read+0x37/0x150
>   [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>   [<ffffffff81242266>] vfs_read+0x96/0x130
>   [<ffffffff812437b5>] SyS_read+0x55/0xc0
>   [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> When manipulating a large mapping, the process may hold the mmap_sem for
> long time, so reading /proc/<pid>/cmdline may be blocked in
> uninterruptible state for long time.
> We already have killable version APIs for semaphore, here use down_read_killable()
> to improve the responsiveness.
> 

Maybe I'm missing something, but I don't see how this solves the
problem.  Yes, the read of /proc/pid/cmdline will be abandoned if
someone interrupts that process.  But if nobody does that, the read
will still just sit there for 2 minutes and the watchdog warning will
still come out?

Where the heck are we holding mmap_sem for so long?  Can that be fixed?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
  2018-03-06 20:45 ` Andrew Morton
@ 2018-03-06 21:17   ` Yang Shi
  2018-03-06 21:41     ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Yang Shi @ 2018-03-06 21:17 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mingo, adobriyan, linux-mm, linux-kernel, David Rientjes



On 3/6/18 12:45 PM, Andrew Morton wrote:
> On Tue, 27 Feb 2018 08:25:47 +0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>> Background:
>> When running vm-scalability with large memory (> 300GB), the below hung
>> task issue happens occasionally.
>>
>> INFO: task ps:14018 blocked for more than 120 seconds.
>>         Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>   ps              D    0 14018      1 0x00000004
>>    ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>    ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>    00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>   Call Trace:
>>    [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>    [<ffffffff817159e6>] schedule+0x36/0x80
>>    [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>    [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>    [<ffffffff81717db0>] down_read+0x20/0x40
>>    [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>    [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>>    [<ffffffff81241d87>] __vfs_read+0x37/0x150
>>    [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>>    [<ffffffff81242266>] vfs_read+0x96/0x130
>>    [<ffffffff812437b5>] SyS_read+0x55/0xc0
>>    [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>
>> When manipulating a large mapping, the process may hold the mmap_sem for
>> long time, so reading /proc/<pid>/cmdline may be blocked in
>> uninterruptible state for long time.
>> We already have killable version APIs for semaphore, here use down_read_killable()
>> to improve the responsiveness.
>>
> Maybe I'm missing something, but I don't see how this solves the
> problem.  Yes, the read of /proc/pid/cmdline will be abandoned if
> someone interrupts that process.  But if nobody does that, the read
> will still just sit there for 2 minutes and the watchdog warning will
> still come out?

No, the warning will not come out since down_read_killable() puts the 
task into TASK_KILLABLE state instead of TASK_UNINTERRUPTIBLE state. The 
hung task check will skip TASK_KILLABLE tasks, please see the below code 
in (kernel/hung_task.c):

                 /* use "==" to skip the TASK_KILLABLE tasks waiting on 
NFS */
                 if (t->state == TASK_UNINTERRUPTIBLE)
                         check_hung_task(t, timeout);

It just mitigates the hung task warning, can't resolve the mmap_sem 
scalability issue. Furthermore, waiting on pure uninterruptible state 
for reading /proc sounds unnecessary. It doesn't wait for I/O completion.

>
> Where the heck are we holding mmap_sem for so long?  Can that be fixed?

The mmap_sem is held for unmapping a large map which has every single 
page mapped. This is not a issue in real production code. Just found it 
by running vm-scalability on a machine with ~600GB memory.

AFAIK, I don't see any easy fix for the mmap_sem scalability issue. I 
saw range locking patches (https://lwn.net/Articles/723648/) were 
floating around. But, it may not help too much on the case that a large 
map with every single page mapped.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
  2018-03-06 21:17   ` Yang Shi
@ 2018-03-06 21:41     ` Andrew Morton
  2018-03-07  0:47       ` Yang Shi
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2018-03-06 21:41 UTC (permalink / raw)
  To: Yang Shi; +Cc: mingo, adobriyan, linux-mm, linux-kernel, David Rientjes

On Tue, 6 Mar 2018 13:17:37 -0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:

> 
> 
> It just mitigates the hung task warning, can't resolve the mmap_sem 
> scalability issue. Furthermore, waiting on pure uninterruptible state 
> for reading /proc sounds unnecessary. It doesn't wait for I/O completion.

OK.

> >
> > Where the heck are we holding mmap_sem for so long?  Can that be fixed?
> 
> The mmap_sem is held for unmapping a large map which has every single 
> page mapped. This is not a issue in real production code. Just found it 
> by running vm-scalability on a machine with ~600GB memory.
> 
> AFAIK, I don't see any easy fix for the mmap_sem scalability issue. I 
> saw range locking patches (https://lwn.net/Articles/723648/) were 
> floating around. But, it may not help too much on the case that a large 
> map with every single page mapped.

Well it sounds fairly simple to mitigate?  Simplistically: don't unmap
600G in a single hit; do it 1G at a time, dropping mmap_sem each time. 
A smarter version might only come up for air if there are mmap_sem
waiters and if it has already done some work.  I don't think we have
any particular atomicity requirements when unmapping?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc
  2018-03-06 21:41     ` Andrew Morton
@ 2018-03-07  0:47       ` Yang Shi
  0 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2018-03-07  0:47 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mingo, adobriyan, linux-mm, linux-kernel, David Rientjes



On 3/6/18 1:41 PM, Andrew Morton wrote:
> On Tue, 6 Mar 2018 13:17:37 -0800 Yang Shi <yang.shi@linux.alibaba.com> wrote:
>
>>
>> It just mitigates the hung task warning, can't resolve the mmap_sem
>> scalability issue. Furthermore, waiting on pure uninterruptible state
>> for reading /proc sounds unnecessary. It doesn't wait for I/O completion.
> OK.

Since we already had down_read_killable() APIs available, IMHO, giving 
application a chance to abort at some circumstances sounds not bad.

>
>>> Where the heck are we holding mmap_sem for so long?  Can that be fixed?
>> The mmap_sem is held for unmapping a large map which has every single
>> page mapped. This is not a issue in real production code. Just found it
>> by running vm-scalability on a machine with ~600GB memory.
>>
>> AFAIK, I don't see any easy fix for the mmap_sem scalability issue. I
>> saw range locking patches (https://lwn.net/Articles/723648/) were
>> floating around. But, it may not help too much on the case that a large
>> map with every single page mapped.
> Well it sounds fairly simple to mitigate?  Simplistically: don't unmap
> 600G in a single hit; do it 1G at a time, dropping mmap_sem each time.
> A smarter version might only come up for air if there are mmap_sem
> waiters and if it has already done some work.  I don't think we have
> any particular atomicity requirements when unmapping?

I'm not quite sure. But, the existing applications may assume munmap is 
atomic?

Thanks,
Yang

>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-03-07  0:47 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-27  0:25 [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc Yang Shi
2018-02-27  0:25 ` [PATCH 1/4 v2] mm: add access_remote_vm_killable APIs Yang Shi
2018-02-27  0:25 ` [PATCH 2/4 v2] fs: proc: use down_read_killable in proc_pid_cmdline_read() Yang Shi
2018-02-27  0:25 ` [PATCH 3/4 v2] fs: proc: use down_read_killable() in environ_read() Yang Shi
2018-02-27  7:15   ` Alexey Dobriyan
2018-02-27 16:59     ` Yang Shi
2018-02-27  0:25 ` [PATCH 4/4 v2] mm: use access_remote_vm() in get_cmdline() Yang Shi
2018-02-27  1:02 ` [RFC PATCH 0/4 v2] Define killable version for access_remote_vm() and use it in fs/proc David Rientjes
2018-02-27  1:25   ` Yang Shi
2018-02-27  1:47     ` David Rientjes
2018-03-01  0:17       ` Yang Shi
2018-03-06 18:45 ` Yang Shi
2018-03-06 20:45 ` Andrew Morton
2018-03-06 21:17   ` Yang Shi
2018-03-06 21:41     ` Andrew Morton
2018-03-07  0:47       ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).