All of lore.kernel.org
 help / color / mirror / Atom feed
* [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
@ 2018-04-09 21:52 Yang Shi
  2018-04-10  8:48 ` Cyrill Gorcunov
  2018-04-10  9:09 ` Michal Hocko
  0 siblings, 2 replies; 22+ messages in thread
From: Yang Shi @ 2018-04-09 21:52 UTC (permalink / raw)
  To: adobriyan, mhocko, willy, mguzik, gorcunov, akpm
  Cc: yang.shi, linux-mm, linux-kernel

mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.

And, the mmap_sem contention may cause unexpected issue like below:

INFO: task ps:14018 blocked for more than 120 seconds.
       Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 ps              D    0 14018      1 0x00000004
  ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
  ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
  00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
 Call Trace:
  [<ffffffff817154d0>] ? __schedule+0x250/0x730
  [<ffffffff817159e6>] schedule+0x36/0x80
  [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
  [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
  [<ffffffff81717db0>] down_read+0x20/0x40
  [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
  [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
  [<ffffffff81241d87>] __vfs_read+0x37/0x150
  [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
  [<ffffffff81242266>] vfs_read+0x96/0x130
  [<ffffffff812437b5>] SyS_read+0x55/0xc0
  [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5

Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.

So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others except start_brk and
brk, which are still protected by mmap_sem to avoid concurrent access
from do_brk().

This patch just eliminates the abuse of mmap_sem, but it can't resolve the
above hung task warning completely since the later access_remote_vm() call
needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the
future.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
---
v2 --> v3:
* Restored down_write in prctl syscall
* Elaborate the limitation of this patch suggested by Michal
* Protect those fields by the new lock except brk and start_brk per Michal's
  suggestion
* Based off Cyrill's non PR_SET_MM_MAP oprations deprecation patch
  (https://lkml.org/lkml/2018/4/5/541)

v1 --> v2:
* Use spinlock instead of rwlock per Mattew's suggestion
* Replace down_write to down_read in prctl_set_mm (see commit log for details)

 fs/proc/base.c           |  8 ++++----
 include/linux/mm_types.h |  2 ++
 kernel/fork.c            |  1 +
 kernel/sys.c             | 13 ++++++++-----
 mm/init-mm.c             |  1 +
 5 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d532468..aa7ce07 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -239,12 +239,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
 		goto out_mmput;
 	}
 
-	down_read(&mm->mmap_sem);
+	spin_lock(&mm->arg_lock);
 	arg_start = mm->arg_start;
 	arg_end = mm->arg_end;
 	env_start = mm->env_start;
 	env_end = mm->env_end;
-	up_read(&mm->mmap_sem);
+	spin_unlock(&mm->arg_lock);
 
 	BUG_ON(arg_start > arg_end);
 	BUG_ON(env_start > env_end);
@@ -926,10 +926,10 @@ static ssize_t environ_read(struct file *file, char __user *buf,
 	if (!mmget_not_zero(mm))
 		goto free;
 
-	down_read(&mm->mmap_sem);
+	spin_lock(&mm->arg_lock);
 	env_start = mm->env_start;
 	env_end = mm->env_end;
-	up_read(&mm->mmap_sem);
+	spin_unlock(&mm->arg_lock);
 
 	while (count > 0) {
 		size_t this_len, max_len;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2161234..26ff7bf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -413,6 +413,8 @@ struct mm_struct {
 	unsigned long exec_vm;		/* VM_EXEC & ~VM_WRITE & ~VM_STACK */
 	unsigned long stack_vm;		/* VM_STACK */
 	unsigned long def_flags;
+
+	spinlock_t arg_lock; /* protect the below fields, except brk */
 	unsigned long start_code, end_code, start_data, end_data;
 	unsigned long start_brk, brk, start_stack;
 	unsigned long arg_start, arg_end, env_start, env_end;
diff --git a/kernel/fork.c b/kernel/fork.c
index 242c8c9..295f903 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,6 +900,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->pinned_vm = 0;
 	memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
 	spin_lock_init(&mm->page_table_lock);
+	spin_lock_init(&mm->arg_lock);
 	mm_init_cpumask(mm);
 	mm_init_aio(mm);
 	mm_init_owner(mm, p);
diff --git a/kernel/sys.c b/kernel/sys.c
index f16725e..cbac235 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2011,8 +2011,6 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
 			return error;
 	}
 
-	down_write(&mm->mmap_sem);
-
 	/*
 	 * We don't validate if these members are pointing to
 	 * real present VMAs because application may have correspond
@@ -2025,17 +2023,23 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
 	 *    to any problem in kernel itself
 	 */
 
+	/* Hold mmap_sem to avoid concurrent access by do_brk */
+	down_write(&mm->mmap_sem);
+	mm->start_brk	= prctl_map.start_brk;
+	mm->brk		= prctl_map.brk;
+	up_write(&mm->mmap_sem);
+
+	spin_lock(&mm->arg_lock);
 	mm->start_code	= prctl_map.start_code;
 	mm->end_code	= prctl_map.end_code;
 	mm->start_data	= prctl_map.start_data;
 	mm->end_data	= prctl_map.end_data;
-	mm->start_brk	= prctl_map.start_brk;
-	mm->brk		= prctl_map.brk;
 	mm->start_stack	= prctl_map.start_stack;
 	mm->arg_start	= prctl_map.arg_start;
 	mm->arg_end	= prctl_map.arg_end;
 	mm->env_start	= prctl_map.env_start;
 	mm->env_end	= prctl_map.env_end;
+	spin_unlock(&mm->arg_lock);
 
 	/*
 	 * Note this update of @saved_auxv is lockless thus
@@ -2048,7 +2052,6 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
 	if (prctl_map.auxv_size)
 		memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv));
 
-	up_write(&mm->mmap_sem);
 	return 0;
 }
 #endif /* CONFIG_CHECKPOINT_RESTORE */
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f94d5d1..f0179c9 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -22,6 +22,7 @@ struct mm_struct init_mm = {
 	.mm_count	= ATOMIC_INIT(1),
 	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
+	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.user_ns	= &init_user_ns,
 	INIT_MM_CONTEXT(init_mm)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-09 21:52 [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct Yang Shi
@ 2018-04-10  8:48 ` Cyrill Gorcunov
  2018-04-10  9:09 ` Michal Hocko
  1 sibling, 0 replies; 22+ messages in thread
From: Cyrill Gorcunov @ 2018-04-10  8:48 UTC (permalink / raw)
  To: Yang Shi; +Cc: adobriyan, mhocko, willy, mguzik, akpm, linux-mm, linux-kernel

On Tue, Apr 10, 2018 at 05:52:54AM +0800, Yang Shi wrote:
> mmap_sem is on the hot path of kernel, and it very contended, but it is
> abused too. It is used to protect arg_start|end and evn_start|end when
> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> sense since those proc files just expect to read 4 values atomically and
> not related to VM, they could be set to arbitrary values by C/R.
> 
> And, the mmap_sem contention may cause unexpected issue like below:
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>        Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps              D    0 14018      1 0x00000004
>   ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>   ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>   00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>  Call Trace:
>   [<ffffffff817154d0>] ? __schedule+0x250/0x730
>   [<ffffffff817159e6>] schedule+0x36/0x80
>   [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>   [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>   [<ffffffff81717db0>] down_read+0x20/0x40
>   [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>   [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>   [<ffffffff81241d87>] __vfs_read+0x37/0x150
>   [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>   [<ffffffff81242266>] vfs_read+0x96/0x130
>   [<ffffffff812437b5>] SyS_read+0x55/0xc0
>   [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> for them to mitigate the abuse of mmap_sem.
> 
> So, introduce a new spinlock in mm_struct to protect the concurrent
> access to arg_start|end, env_start|end and others except start_brk and
> brk, which are still protected by mmap_sem to avoid concurrent access
> from do_brk().
> 
> This patch just eliminates the abuse of mmap_sem, but it can't resolve the
> above hung task warning completely since the later access_remote_vm() call
> needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the
> future.
> 
> Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Mateusz Guzik <mguzik@redhat.com>
> Cc: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
> v2 --> v3:
> * Restored down_write in prctl syscall
> * Elaborate the limitation of this patch suggested by Michal
> * Protect those fields by the new lock except brk and start_brk per Michal's
>   suggestion
> * Based off Cyrill's non PR_SET_MM_MAP oprations deprecation patch
>   (https://lkml.org/lkml/2018/4/5/541)

For prctl part

Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>

thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-09 21:52 [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct Yang Shi
  2018-04-10  8:48 ` Cyrill Gorcunov
@ 2018-04-10  9:09 ` Michal Hocko
  2018-04-10  9:40   ` Cyrill Gorcunov
  1 sibling, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2018-04-10  9:09 UTC (permalink / raw)
  To: Yang Shi; +Cc: adobriyan, willy, mguzik, gorcunov, akpm, linux-mm, linux-kernel

On Tue 10-04-18 05:52:54, Yang Shi wrote:
[...]
> So, introduce a new spinlock in mm_struct to protect the concurrent
> access to arg_start|end, env_start|end and others except start_brk and
> brk, which are still protected by mmap_sem to avoid concurrent access
> from do_brk().

Is there any fundamental problem with brk using the same lock?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-10  9:09 ` Michal Hocko
@ 2018-04-10  9:40   ` Cyrill Gorcunov
  2018-04-10 10:42     ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Cyrill Gorcunov @ 2018-04-10  9:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yang Shi, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel

On Tue, Apr 10, 2018 at 11:09:17AM +0200, Michal Hocko wrote:
> On Tue 10-04-18 05:52:54, Yang Shi wrote:
> [...]
> > So, introduce a new spinlock in mm_struct to protect the concurrent
> > access to arg_start|end, env_start|end and others except start_brk and
> > brk, which are still protected by mmap_sem to avoid concurrent access
> > from do_brk().
> 
> Is there any fundamental problem with brk using the same lock?

Seems so. Look into mm/mmap.c:brk syscall which reads and writes
brk value under mmap_sem ('cause of do_brk called inside).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-10  9:40   ` Cyrill Gorcunov
@ 2018-04-10 10:42     ` Michal Hocko
  2018-04-10 11:02       ` Cyrill Gorcunov
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2018-04-10 10:42 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Yang Shi, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel

On Tue 10-04-18 12:40:47, Cyrill Gorcunov wrote:
> On Tue, Apr 10, 2018 at 11:09:17AM +0200, Michal Hocko wrote:
> > On Tue 10-04-18 05:52:54, Yang Shi wrote:
> > [...]
> > > So, introduce a new spinlock in mm_struct to protect the concurrent
> > > access to arg_start|end, env_start|end and others except start_brk and
> > > brk, which are still protected by mmap_sem to avoid concurrent access
> > > from do_brk().
> > 
> > Is there any fundamental problem with brk using the same lock?
> 
> Seems so. Look into mm/mmap.c:brk syscall which reads and writes
> brk value under mmap_sem ('cause of do_brk called inside).

Why cannot we simply use the lock when the value is updated?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-10 10:42     ` Michal Hocko
@ 2018-04-10 11:02       ` Cyrill Gorcunov
  2018-04-10 11:10         ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Cyrill Gorcunov @ 2018-04-10 11:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yang Shi, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel

On Tue, Apr 10, 2018 at 12:42:15PM +0200, Michal Hocko wrote:
> On Tue 10-04-18 12:40:47, Cyrill Gorcunov wrote:
> > On Tue, Apr 10, 2018 at 11:09:17AM +0200, Michal Hocko wrote:
> > > On Tue 10-04-18 05:52:54, Yang Shi wrote:
> > > [...]
> > > > So, introduce a new spinlock in mm_struct to protect the concurrent
> > > > access to arg_start|end, env_start|end and others except start_brk and
> > > > brk, which are still protected by mmap_sem to avoid concurrent access
> > > > from do_brk().
> > > 
> > > Is there any fundamental problem with brk using the same lock?
> > 
> > Seems so. Look into mm/mmap.c:brk syscall which reads and writes
> > brk value under mmap_sem ('cause of do_brk called inside).
> 
> Why cannot we simply use the lock when the value is updated?

Because do_brk does vma manipulations, for this reason it's
running under down_write_killable(&mm->mmap_sem). Or you
mean something else?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-10 11:02       ` Cyrill Gorcunov
@ 2018-04-10 11:10         ` Michal Hocko
  2018-04-10 12:28           ` Cyrill Gorcunov
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2018-04-10 11:10 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Yang Shi, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel

On Tue 10-04-18 14:02:42, Cyrill Gorcunov wrote:
> On Tue, Apr 10, 2018 at 12:42:15PM +0200, Michal Hocko wrote:
> > On Tue 10-04-18 12:40:47, Cyrill Gorcunov wrote:
> > > On Tue, Apr 10, 2018 at 11:09:17AM +0200, Michal Hocko wrote:
> > > > On Tue 10-04-18 05:52:54, Yang Shi wrote:
> > > > [...]
> > > > > So, introduce a new spinlock in mm_struct to protect the concurrent
> > > > > access to arg_start|end, env_start|end and others except start_brk and
> > > > > brk, which are still protected by mmap_sem to avoid concurrent access
> > > > > from do_brk().
> > > > 
> > > > Is there any fundamental problem with brk using the same lock?
> > > 
> > > Seems so. Look into mm/mmap.c:brk syscall which reads and writes
> > > brk value under mmap_sem ('cause of do_brk called inside).
> > 
> > Why cannot we simply use the lock when the value is updated?
> 
> Because do_brk does vma manipulations, for this reason it's
> running under down_write_killable(&mm->mmap_sem). Or you
> mean something else?

Yes, all we need the new lock for is to get a consistent view on brk
values. I am simply asking whether there is something fundamentally
wrong by doing the update inside the new lock while keeping the original
mmap_sem locking in the brk path. That would allow us to drop the
mmap_sem lock in the proc path when looking at brk values.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-10 11:10         ` Michal Hocko
@ 2018-04-10 12:28           ` Cyrill Gorcunov
  2018-04-10 16:21             ` Yang Shi
  0 siblings, 1 reply; 22+ messages in thread
From: Cyrill Gorcunov @ 2018-04-10 12:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yang Shi, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel

On Tue, Apr 10, 2018 at 01:10:01PM +0200, Michal Hocko wrote:
> > 
> > Because do_brk does vma manipulations, for this reason it's
> > running under down_write_killable(&mm->mmap_sem). Or you
> > mean something else?
> 
> Yes, all we need the new lock for is to get a consistent view on brk
> values. I am simply asking whether there is something fundamentally
> wrong by doing the update inside the new lock while keeping the original
> mmap_sem locking in the brk path. That would allow us to drop the
> mmap_sem lock in the proc path when looking at brk values.

Michal gimme some time. I guess  we might do so, but I need some
spare time to take more precise look into the code, hopefully today
evening. Also I've a suspicion that we've wracked check_data_rlimit
with this new lock in prctl. Need to verify it again.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-10 12:28           ` Cyrill Gorcunov
@ 2018-04-10 16:21             ` Yang Shi
  2018-04-10 18:28                 ` Yang Shi
  0 siblings, 1 reply; 22+ messages in thread
From: Yang Shi @ 2018-04-10 16:21 UTC (permalink / raw)
  To: Cyrill Gorcunov, Michal Hocko
  Cc: adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel



On 4/10/18 5:28 AM, Cyrill Gorcunov wrote:
> On Tue, Apr 10, 2018 at 01:10:01PM +0200, Michal Hocko wrote:
>>> Because do_brk does vma manipulations, for this reason it's
>>> running under down_write_killable(&mm->mmap_sem). Or you
>>> mean something else?
>> Yes, all we need the new lock for is to get a consistent view on brk
>> values. I am simply asking whether there is something fundamentally
>> wrong by doing the update inside the new lock while keeping the original
>> mmap_sem locking in the brk path. That would allow us to drop the
>> mmap_sem lock in the proc path when looking at brk values.
> Michal gimme some time. I guess  we might do so, but I need some
> spare time to take more precise look into the code, hopefully today
> evening. Also I've a suspicion that we've wracked check_data_rlimit
> with this new lock in prctl. Need to verify it again.

I see you guys points. We might be able to move the drop of mmap_sem 
before setting mm->brk in sys_brk since mmap_sem should be used to 
protect vma manipulation only, then protect the value modify with the 
new arg_lock. Then we can eliminate mmap_sem stuff in prctl path, and it 
also prevents from wrecking check_data_rlimit.

At the first glance, it looks feasible to me. Will look into deeper later.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-10 16:21             ` Yang Shi
@ 2018-04-10 18:28                 ` Yang Shi
  0 siblings, 0 replies; 22+ messages in thread
From: Yang Shi @ 2018-04-10 18:28 UTC (permalink / raw)
  To: Cyrill Gorcunov, Michal Hocko
  Cc: adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel



On 4/10/18 9:21 AM, Yang Shi wrote:
>
>
> On 4/10/18 5:28 AM, Cyrill Gorcunov wrote:
>> On Tue, Apr 10, 2018 at 01:10:01PM +0200, Michal Hocko wrote:
>>>> Because do_brk does vma manipulations, for this reason it's
>>>> running under down_write_killable(&mm->mmap_sem). Or you
>>>> mean something else?
>>> Yes, all we need the new lock for is to get a consistent view on brk
>>> values. I am simply asking whether there is something fundamentally
>>> wrong by doing the update inside the new lock while keeping the 
>>> original
>>> mmap_sem locking in the brk path. That would allow us to drop the
>>> mmap_sem lock in the proc path when looking at brk values.
>> Michal gimme some time. I guess  we might do so, but I need some
>> spare time to take more precise look into the code, hopefully today
>> evening. Also I've a suspicion that we've wracked check_data_rlimit
>> with this new lock in prctl. Need to verify it again.
>
> I see you guys points. We might be able to move the drop of mmap_sem 
> before setting mm->brk in sys_brk since mmap_sem should be used to 
> protect vma manipulation only, then protect the value modify with the 
> new arg_lock. Then we can eliminate mmap_sem stuff in prctl path, and 
> it also prevents from wrecking check_data_rlimit.
>
> At the first glance, it looks feasible to me. Will look into deeper 
> later.

A further look told me this might be *not* feasible.

It looks the new lock will not break check_data_rlimit since in my patch 
both start_brk and brk is protected by mmap_sem. The code flow might 
look like below:

CPU A                             CPU B
--------                       --------
prctl                               sys_brk
                                       down_write
check_data_rlimit           check_data_rlimit (need mm->start_brk)
                                       set brk
down_write                    up_write
set start_brk
set brk
up_write


If CPU A gets the mmap_sem first, it will set start_brk and brk, then 
CPU B will check with the new start_brk. And, prctl doesn't care if 
sys_brk is run before it since it gets the new start_brk and brk from 
parameter.

If we protect start_brk and brk with the new lock, sys_brk might get old 
start_brk, then sys_brk might break rlimit check silently, is that right?

So, it looks using new lock in prctl and keeping mmap_sem in brk path 
has race condition.

Thanks,
Yang

>
> Thanks,
> Yang
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
@ 2018-04-10 18:28                 ` Yang Shi
  0 siblings, 0 replies; 22+ messages in thread
From: Yang Shi @ 2018-04-10 18:28 UTC (permalink / raw)
  To: Cyrill Gorcunov, Michal Hocko
  Cc: adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel



On 4/10/18 9:21 AM, Yang Shi wrote:
>
>
> On 4/10/18 5:28 AM, Cyrill Gorcunov wrote:
>> On Tue, Apr 10, 2018 at 01:10:01PM +0200, Michal Hocko wrote:
>>>> Because do_brk does vma manipulations, for this reason it's
>>>> running under down_write_killable(&mm->mmap_sem). Or you
>>>> mean something else?
>>> Yes, all we need the new lock for is to get a consistent view on brk
>>> values. I am simply asking whether there is something fundamentally
>>> wrong by doing the update inside the new lock while keeping the 
>>> original
>>> mmap_sem locking in the brk path. That would allow us to drop the
>>> mmap_sem lock in the proc path when looking at brk values.
>> Michal gimme some time. I guessA  we might do so, but I need some
>> spare time to take more precise look into the code, hopefully today
>> evening. Also I've a suspicion that we've wracked check_data_rlimit
>> with this new lock in prctl. Need to verify it again.
>
> I see you guys points. We might be able to move the drop of mmap_sem 
> before setting mm->brk in sys_brk since mmap_sem should be used to 
> protect vma manipulation only, then protect the value modify with the 
> new arg_lock. Then we can eliminate mmap_sem stuff in prctl path, and 
> it also prevents from wrecking check_data_rlimit.
>
> At the first glance, it looks feasible to me. Will look into deeper 
> later.

A further look told me this might be *not* feasible.

It looks the new lock will not break check_data_rlimit since in my patch 
both start_brk and brk is protected by mmap_sem. The code flow might 
look like below:

CPU AA A A A A A A A A A A A A A A A A A A A A A A A A A A A  CPU B
--------A A A A A A A A A A A A A A A A A A A A A A  --------
prctlA A A A A A A A A A A A A A A A A A A A A A A A A A A A A A  sys_brk
 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A  down_write
check_data_rlimitA A A A A A A A A A  check_data_rlimit (need mm->start_brk)
 A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A  set brk
down_writeA A A A A A A A A A A A A A A A A A A  up_write
set start_brk
set brk
up_write


If CPU A gets the mmap_sem first, it will set start_brk and brk, then 
CPU B will check with the new start_brk. And, prctl doesn't care if 
sys_brk is run before it since it gets the new start_brk and brk from 
parameter.

If we protect start_brk and brk with the new lock, sys_brk might get old 
start_brk, then sys_brk might break rlimit check silently, is that right?

So, it looks using new lock in prctl and keeping mmap_sem in brk path 
has race condition.

Thanks,
Yang

>
> Thanks,
> Yang
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-10 18:28                 ` Yang Shi
@ 2018-04-10 19:17                   ` Cyrill Gorcunov
  -1 siblings, 0 replies; 22+ messages in thread
From: Cyrill Gorcunov @ 2018-04-10 19:17 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel

On Tue, Apr 10, 2018 at 11:28:13AM -0700, Yang Shi wrote:
> > 
> > At the first glance, it looks feasible to me. Will look into deeper
> > later.
> 
> A further look told me this might be *not* feasible.
> 
> It looks the new lock will not break check_data_rlimit since in my patch
> both start_brk and brk is protected by mmap_sem. The code flow might look
> like below:
> 
> CPU A                             CPU B
> --------                       --------
> prctl                               sys_brk
>                                       down_write
> check_data_rlimit           check_data_rlimit (need mm->start_brk)
>                                       set brk
> down_write                    up_write
> set start_brk
> set brk
> up_write
> 
> If CPU A gets the mmap_sem first, it will set start_brk and brk, then CPU B
> will check with the new start_brk. And, prctl doesn't care if sys_brk is run
> before it since it gets the new start_brk and brk from parameter.
> 
> If we protect start_brk and brk with the new lock, sys_brk might get old
> start_brk, then sys_brk might break rlimit check silently, is that right?
> 
> So, it looks using new lock in prctl and keeping mmap_sem in brk path has
> race condition.

I fear so. The check_data_rlimit implies that all elements involved into
validation (brk, start_brk, start_data, end_data) are not changed unpredicably
until written back into mm. In turn if we guard start_brk,brk only (as
it is done in the patch) the check_data_rlimit may pass on wrong data
I think. And as you mentioned the race above exact the example of such
situation. I think for prctl case we can simply left use of mmap_sem
as it were before the patch, after all this syscall is really in cold
path all the time.

	Cyrill

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
@ 2018-04-10 19:17                   ` Cyrill Gorcunov
  0 siblings, 0 replies; 22+ messages in thread
From: Cyrill Gorcunov @ 2018-04-10 19:17 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel

On Tue, Apr 10, 2018 at 11:28:13AM -0700, Yang Shi wrote:
> > 
> > At the first glance, it looks feasible to me. Will look into deeper
> > later.
> 
> A further look told me this might be *not* feasible.
> 
> It looks the new lock will not break check_data_rlimit since in my patch
> both start_brk and brk is protected by mmap_sem. The code flow might look
> like below:
> 
> CPU A                             CPU B
> --------                       --------
> prctl                               sys_brk
>                                       down_write
> check_data_rlimit           check_data_rlimit (need mm->start_brk)
>                                       set brk
> down_write                    up_write
> set start_brk
> set brk
> up_write
> 
> If CPU A gets the mmap_sem first, it will set start_brk and brk, then CPU B
> will check with the new start_brk. And, prctl doesn't care if sys_brk is run
> before it since it gets the new start_brk and brk from parameter.
> 
> If we protect start_brk and brk with the new lock, sys_brk might get old
> start_brk, then sys_brk might break rlimit check silently, is that right?
> 
> So, it looks using new lock in prctl and keeping mmap_sem in brk path has
> race condition.

I fear so. The check_data_rlimit implies that all elements involved into
validation (brk, start_brk, start_data, end_data) are not changed unpredicably
until written back into mm. In turn if we guard start_brk,brk only (as
it is done in the patch) the check_data_rlimit may pass on wrong data
I think. And as you mentioned the race above exact the example of such
situation. I think for prctl case we can simply left use of mmap_sem
as it were before the patch, after all this syscall is really in cold
path all the time.

	Cyrill

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-10 19:17                   ` Cyrill Gorcunov
@ 2018-04-10 19:33                     ` Yang Shi
  -1 siblings, 0 replies; 22+ messages in thread
From: Yang Shi @ 2018-04-10 19:33 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Michal Hocko, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel



On 4/10/18 12:17 PM, Cyrill Gorcunov wrote:
> On Tue, Apr 10, 2018 at 11:28:13AM -0700, Yang Shi wrote:
>>> At the first glance, it looks feasible to me. Will look into deeper
>>> later.
>> A further look told me this might be *not* feasible.
>>
>> It looks the new lock will not break check_data_rlimit since in my patch
>> both start_brk and brk is protected by mmap_sem. The code flow might look
>> like below:
>>
>> CPU A                             CPU B
>> --------                       --------
>> prctl                               sys_brk
>>                                        down_write
>> check_data_rlimit           check_data_rlimit (need mm->start_brk)
>>                                        set brk
>> down_write                    up_write
>> set start_brk
>> set brk
>> up_write
>>
>> If CPU A gets the mmap_sem first, it will set start_brk and brk, then CPU B
>> will check with the new start_brk. And, prctl doesn't care if sys_brk is run
>> before it since it gets the new start_brk and brk from parameter.
>>
>> If we protect start_brk and brk with the new lock, sys_brk might get old
>> start_brk, then sys_brk might break rlimit check silently, is that right?
>>
>> So, it looks using new lock in prctl and keeping mmap_sem in brk path has
>> race condition.
> I fear so. The check_data_rlimit implies that all elements involved into
> validation (brk, start_brk, start_data, end_data) are not changed unpredicably
> until written back into mm. In turn if we guard start_brk,brk only (as
> it is done in the patch) the check_data_rlimit may pass on wrong data
> I think. And as you mentioned the race above exact the example of such
> situation. I think for prctl case we can simply left use of mmap_sem
> as it were before the patch, after all this syscall is really in cold
> path all the time.

The race condition is just valid when protecting start_brk, brk, 
start_data and end_data with the new lock, but keep using mmap_sem in 
brk path.

So, we should just need make a little tweak to have mmap_sem protect 
start_brk, brk, start_data and end_data, then have the new lock protect 
others so that we still can remove mmap_sem in proc as the patch is 
aimed to do.

Yang

>
> 	Cyrill

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
@ 2018-04-10 19:33                     ` Yang Shi
  0 siblings, 0 replies; 22+ messages in thread
From: Yang Shi @ 2018-04-10 19:33 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Michal Hocko, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel



On 4/10/18 12:17 PM, Cyrill Gorcunov wrote:
> On Tue, Apr 10, 2018 at 11:28:13AM -0700, Yang Shi wrote:
>>> At the first glance, it looks feasible to me. Will look into deeper
>>> later.
>> A further look told me this might be *not* feasible.
>>
>> It looks the new lock will not break check_data_rlimit since in my patch
>> both start_brk and brk is protected by mmap_sem. The code flow might look
>> like below:
>>
>> CPU AA A A A A A A A A A A A A A A A A A A A A A A A A A A A  CPU B
>> --------A A A A A A A A A A A A A A A A A A A A A A  --------
>> prctlA A A A A A A A A A A A A A A A A A A A A A A A A A A A A A  sys_brk
>>  A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A  down_write
>> check_data_rlimitA A A A A A A A A A  check_data_rlimit (need mm->start_brk)
>>  A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A  set brk
>> down_writeA A A A A A A A A A A A A A A A A A A  up_write
>> set start_brk
>> set brk
>> up_write
>>
>> If CPU A gets the mmap_sem first, it will set start_brk and brk, then CPU B
>> will check with the new start_brk. And, prctl doesn't care if sys_brk is run
>> before it since it gets the new start_brk and brk from parameter.
>>
>> If we protect start_brk and brk with the new lock, sys_brk might get old
>> start_brk, then sys_brk might break rlimit check silently, is that right?
>>
>> So, it looks using new lock in prctl and keeping mmap_sem in brk path has
>> race condition.
> I fear so. The check_data_rlimit implies that all elements involved into
> validation (brk, start_brk, start_data, end_data) are not changed unpredicably
> until written back into mm. In turn if we guard start_brk,brk only (as
> it is done in the patch) the check_data_rlimit may pass on wrong data
> I think. And as you mentioned the race above exact the example of such
> situation. I think for prctl case we can simply left use of mmap_sem
> as it were before the patch, after all this syscall is really in cold
> path all the time.

The race condition is just valid when protecting start_brk, brk, 
start_data and end_data with the new lock, but keep using mmap_sem in 
brk path.

So, we should just need make a little tweak to have mmap_sem protect 
start_brk, brk, start_data and end_data, then have the new lock protect 
others so that we still can remove mmap_sem in proc as the patch is 
aimed to do.

Yang

>
> 	Cyrill

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-10 19:33                     ` Yang Shi
  (?)
@ 2018-04-10 20:06                     ` Cyrill Gorcunov
  -1 siblings, 0 replies; 22+ messages in thread
From: Cyrill Gorcunov @ 2018-04-10 20:06 UTC (permalink / raw)
  To: Yang Shi
  Cc: Michal Hocko, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel

On Tue, Apr 10, 2018 at 12:33:35PM -0700, Yang Shi wrote:
...
> 
> The race condition is just valid when protecting start_brk, brk, start_data
> and end_data with the new lock, but keep using mmap_sem in brk path.
> 
> So, we should just need make a little tweak to have mmap_sem protect
> start_brk, brk, start_data and end_data, then have the new lock protect
> others so that we still can remove mmap_sem in proc as the patch is aimed to
> do.

+1. Sounds like a plan.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-10 18:28                 ` Yang Shi
@ 2018-04-12 12:18                   ` Michal Hocko
  -1 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2018-04-12 12:18 UTC (permalink / raw)
  To: Yang Shi
  Cc: Cyrill Gorcunov, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel

On Tue 10-04-18 11:28:13, Yang Shi wrote:
> 
> 
> On 4/10/18 9:21 AM, Yang Shi wrote:
> > 
> > 
> > On 4/10/18 5:28 AM, Cyrill Gorcunov wrote:
> > > On Tue, Apr 10, 2018 at 01:10:01PM +0200, Michal Hocko wrote:
> > > > > Because do_brk does vma manipulations, for this reason it's
> > > > > running under down_write_killable(&mm->mmap_sem). Or you
> > > > > mean something else?
> > > > Yes, all we need the new lock for is to get a consistent view on brk
> > > > values. I am simply asking whether there is something fundamentally
> > > > wrong by doing the update inside the new lock while keeping the
> > > > original
> > > > mmap_sem locking in the brk path. That would allow us to drop the
> > > > mmap_sem lock in the proc path when looking at brk values.
> > > Michal gimme some time. I guess  we might do so, but I need some
> > > spare time to take more precise look into the code, hopefully today
> > > evening. Also I've a suspicion that we've wracked check_data_rlimit
> > > with this new lock in prctl. Need to verify it again.
> > 
> > I see you guys points. We might be able to move the drop of mmap_sem
> > before setting mm->brk in sys_brk since mmap_sem should be used to
> > protect vma manipulation only, then protect the value modify with the
> > new arg_lock. Then we can eliminate mmap_sem stuff in prctl path, and it
> > also prevents from wrecking check_data_rlimit.
> > 
> > At the first glance, it looks feasible to me. Will look into deeper
> > later.
> 
> A further look told me this might be *not* feasible.
> 
> It looks the new lock will not break check_data_rlimit since in my patch
> both start_brk and brk is protected by mmap_sem. The code flow might look
> like below:
> 
> CPU A                             CPU B
> --------                       --------
> prctl                               sys_brk
>                                       down_write
> check_data_rlimit           check_data_rlimit (need mm->start_brk)
>                                       set brk
> down_write                    up_write
> set start_brk
> set brk
> up_write
> 
> 
> If CPU A gets the mmap_sem first, it will set start_brk and brk, then CPU B
> will check with the new start_brk. And, prctl doesn't care if sys_brk is run
> before it since it gets the new start_brk and brk from parameter.
> 
> If we protect start_brk and brk with the new lock, sys_brk might get old
> start_brk, then sys_brk might break rlimit check silently, is that right?
> 
> So, it looks using new lock in prctl and keeping mmap_sem in brk path has
> race condition.

OK, I've admittedly didn't give it too much time to think about. Maybe
we do something clever to remove the race but can we start at least by
reducing the write lock to read on prctl side and use the dedicated
spinlock for updating values? That should close the above race AFAICS
and the read lock would be much more friendly to other VM operations.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
@ 2018-04-12 12:18                   ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2018-04-12 12:18 UTC (permalink / raw)
  To: Yang Shi
  Cc: Cyrill Gorcunov, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel

On Tue 10-04-18 11:28:13, Yang Shi wrote:
> 
> 
> On 4/10/18 9:21 AM, Yang Shi wrote:
> > 
> > 
> > On 4/10/18 5:28 AM, Cyrill Gorcunov wrote:
> > > On Tue, Apr 10, 2018 at 01:10:01PM +0200, Michal Hocko wrote:
> > > > > Because do_brk does vma manipulations, for this reason it's
> > > > > running under down_write_killable(&mm->mmap_sem). Or you
> > > > > mean something else?
> > > > Yes, all we need the new lock for is to get a consistent view on brk
> > > > values. I am simply asking whether there is something fundamentally
> > > > wrong by doing the update inside the new lock while keeping the
> > > > original
> > > > mmap_sem locking in the brk path. That would allow us to drop the
> > > > mmap_sem lock in the proc path when looking at brk values.
> > > Michal gimme some time. I guess  we might do so, but I need some
> > > spare time to take more precise look into the code, hopefully today
> > > evening. Also I've a suspicion that we've wracked check_data_rlimit
> > > with this new lock in prctl. Need to verify it again.
> > 
> > I see you guys points. We might be able to move the drop of mmap_sem
> > before setting mm->brk in sys_brk since mmap_sem should be used to
> > protect vma manipulation only, then protect the value modify with the
> > new arg_lock. Then we can eliminate mmap_sem stuff in prctl path, and it
> > also prevents from wrecking check_data_rlimit.
> > 
> > At the first glance, it looks feasible to me. Will look into deeper
> > later.
> 
> A further look told me this might be *not* feasible.
> 
> It looks the new lock will not break check_data_rlimit since in my patch
> both start_brk and brk is protected by mmap_sem. The code flow might look
> like below:
> 
> CPU A                             CPU B
> --------                       --------
> prctl                               sys_brk
>                                       down_write
> check_data_rlimit           check_data_rlimit (need mm->start_brk)
>                                       set brk
> down_write                    up_write
> set start_brk
> set brk
> up_write
> 
> 
> If CPU A gets the mmap_sem first, it will set start_brk and brk, then CPU B
> will check with the new start_brk. And, prctl doesn't care if sys_brk is run
> before it since it gets the new start_brk and brk from parameter.
> 
> If we protect start_brk and brk with the new lock, sys_brk might get old
> start_brk, then sys_brk might break rlimit check silently, is that right?
> 
> So, it looks using new lock in prctl and keeping mmap_sem in brk path has
> race condition.

OK, I've admittedly didn't give it too much time to think about. Maybe
we do something clever to remove the race but can we start at least by
reducing the write lock to read on prctl side and use the dedicated
spinlock for updating values? That should close the above race AFAICS
and the read lock would be much more friendly to other VM operations.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-12 12:18                   ` Michal Hocko
@ 2018-04-12 16:20                     ` Yang Shi
  -1 siblings, 0 replies; 22+ messages in thread
From: Yang Shi @ 2018-04-12 16:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Cyrill Gorcunov, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel



On 4/12/18 5:18 AM, Michal Hocko wrote:
> On Tue 10-04-18 11:28:13, Yang Shi wrote:
>>
>> On 4/10/18 9:21 AM, Yang Shi wrote:
>>>
>>> On 4/10/18 5:28 AM, Cyrill Gorcunov wrote:
>>>> On Tue, Apr 10, 2018 at 01:10:01PM +0200, Michal Hocko wrote:
>>>>>> Because do_brk does vma manipulations, for this reason it's
>>>>>> running under down_write_killable(&mm->mmap_sem). Or you
>>>>>> mean something else?
>>>>> Yes, all we need the new lock for is to get a consistent view on brk
>>>>> values. I am simply asking whether there is something fundamentally
>>>>> wrong by doing the update inside the new lock while keeping the
>>>>> original
>>>>> mmap_sem locking in the brk path. That would allow us to drop the
>>>>> mmap_sem lock in the proc path when looking at brk values.
>>>> Michal gimme some time. I guess  we might do so, but I need some
>>>> spare time to take more precise look into the code, hopefully today
>>>> evening. Also I've a suspicion that we've wracked check_data_rlimit
>>>> with this new lock in prctl. Need to verify it again.
>>> I see you guys points. We might be able to move the drop of mmap_sem
>>> before setting mm->brk in sys_brk since mmap_sem should be used to
>>> protect vma manipulation only, then protect the value modify with the
>>> new arg_lock. Then we can eliminate mmap_sem stuff in prctl path, and it
>>> also prevents from wrecking check_data_rlimit.
>>>
>>> At the first glance, it looks feasible to me. Will look into deeper
>>> later.
>> A further look told me this might be *not* feasible.
>>
>> It looks the new lock will not break check_data_rlimit since in my patch
>> both start_brk and brk is protected by mmap_sem. The code flow might look
>> like below:
>>
>> CPU A                             CPU B
>> --------                       --------
>> prctl                               sys_brk
>>                                        down_write
>> check_data_rlimit           check_data_rlimit (need mm->start_brk)
>>                                        set brk
>> down_write                    up_write
>> set start_brk
>> set brk
>> up_write
>>
>>
>> If CPU A gets the mmap_sem first, it will set start_brk and brk, then CPU B
>> will check with the new start_brk. And, prctl doesn't care if sys_brk is run
>> before it since it gets the new start_brk and brk from parameter.
>>
>> If we protect start_brk and brk with the new lock, sys_brk might get old
>> start_brk, then sys_brk might break rlimit check silently, is that right?
>>
>> So, it looks using new lock in prctl and keeping mmap_sem in brk path has
>> race condition.
> OK, I've admittedly didn't give it too much time to think about. Maybe
> we do something clever to remove the race but can we start at least by
> reducing the write lock to read on prctl side and use the dedicated
> spinlock for updating values? That should close the above race AFAICS
> and the read lock would be much more friendly to other VM operations.

Yes, is sounds feasible. We just need care about prctl is run before 
sys_brk. So, you mean:

down_read
spin_lock
update all the values
spin_unlock
up_read


>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
@ 2018-04-12 16:20                     ` Yang Shi
  0 siblings, 0 replies; 22+ messages in thread
From: Yang Shi @ 2018-04-12 16:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Cyrill Gorcunov, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel



On 4/12/18 5:18 AM, Michal Hocko wrote:
> On Tue 10-04-18 11:28:13, Yang Shi wrote:
>>
>> On 4/10/18 9:21 AM, Yang Shi wrote:
>>>
>>> On 4/10/18 5:28 AM, Cyrill Gorcunov wrote:
>>>> On Tue, Apr 10, 2018 at 01:10:01PM +0200, Michal Hocko wrote:
>>>>>> Because do_brk does vma manipulations, for this reason it's
>>>>>> running under down_write_killable(&mm->mmap_sem). Or you
>>>>>> mean something else?
>>>>> Yes, all we need the new lock for is to get a consistent view on brk
>>>>> values. I am simply asking whether there is something fundamentally
>>>>> wrong by doing the update inside the new lock while keeping the
>>>>> original
>>>>> mmap_sem locking in the brk path. That would allow us to drop the
>>>>> mmap_sem lock in the proc path when looking at brk values.
>>>> Michal gimme some time. I guessA  we might do so, but I need some
>>>> spare time to take more precise look into the code, hopefully today
>>>> evening. Also I've a suspicion that we've wracked check_data_rlimit
>>>> with this new lock in prctl. Need to verify it again.
>>> I see you guys points. We might be able to move the drop of mmap_sem
>>> before setting mm->brk in sys_brk since mmap_sem should be used to
>>> protect vma manipulation only, then protect the value modify with the
>>> new arg_lock. Then we can eliminate mmap_sem stuff in prctl path, and it
>>> also prevents from wrecking check_data_rlimit.
>>>
>>> At the first glance, it looks feasible to me. Will look into deeper
>>> later.
>> A further look told me this might be *not* feasible.
>>
>> It looks the new lock will not break check_data_rlimit since in my patch
>> both start_brk and brk is protected by mmap_sem. The code flow might look
>> like below:
>>
>> CPU AA A A A A A A A A A A A A A A A A A A A A A A A A A A A  CPU B
>> --------A A A A A A A A A A A A A A A A A A A A A A  --------
>> prctlA A A A A A A A A A A A A A A A A A A A A A A A A A A A A A  sys_brk
>>  A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A  down_write
>> check_data_rlimitA A A A A A A A A A  check_data_rlimit (need mm->start_brk)
>>  A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A  set brk
>> down_writeA A A A A A A A A A A A A A A A A A A  up_write
>> set start_brk
>> set brk
>> up_write
>>
>>
>> If CPU A gets the mmap_sem first, it will set start_brk and brk, then CPU B
>> will check with the new start_brk. And, prctl doesn't care if sys_brk is run
>> before it since it gets the new start_brk and brk from parameter.
>>
>> If we protect start_brk and brk with the new lock, sys_brk might get old
>> start_brk, then sys_brk might break rlimit check silently, is that right?
>>
>> So, it looks using new lock in prctl and keeping mmap_sem in brk path has
>> race condition.
> OK, I've admittedly didn't give it too much time to think about. Maybe
> we do something clever to remove the race but can we start at least by
> reducing the write lock to read on prctl side and use the dedicated
> spinlock for updating values? That should close the above race AFAICS
> and the read lock would be much more friendly to other VM operations.

Yes, is sounds feasible. We just need care about prctl is run before 
sys_brk. So, you mean:

down_read
spin_lock
update all the values
spin_unlock
up_read


>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
  2018-04-12 16:20                     ` Yang Shi
@ 2018-04-13  6:56                       ` Michal Hocko
  -1 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2018-04-13  6:56 UTC (permalink / raw)
  To: Yang Shi
  Cc: Cyrill Gorcunov, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel

On Thu 12-04-18 09:20:24, Yang Shi wrote:
> 
> 
> On 4/12/18 5:18 AM, Michal Hocko wrote:
> > On Tue 10-04-18 11:28:13, Yang Shi wrote:
> > > 
> > > On 4/10/18 9:21 AM, Yang Shi wrote:
> > > > 
> > > > On 4/10/18 5:28 AM, Cyrill Gorcunov wrote:
> > > > > On Tue, Apr 10, 2018 at 01:10:01PM +0200, Michal Hocko wrote:
> > > > > > > Because do_brk does vma manipulations, for this reason it's
> > > > > > > running under down_write_killable(&mm->mmap_sem). Or you
> > > > > > > mean something else?
> > > > > > Yes, all we need the new lock for is to get a consistent view on brk
> > > > > > values. I am simply asking whether there is something fundamentally
> > > > > > wrong by doing the update inside the new lock while keeping the
> > > > > > original
> > > > > > mmap_sem locking in the brk path. That would allow us to drop the
> > > > > > mmap_sem lock in the proc path when looking at brk values.
> > > > > Michal gimme some time. I guess  we might do so, but I need some
> > > > > spare time to take more precise look into the code, hopefully today
> > > > > evening. Also I've a suspicion that we've wracked check_data_rlimit
> > > > > with this new lock in prctl. Need to verify it again.
> > > > I see you guys points. We might be able to move the drop of mmap_sem
> > > > before setting mm->brk in sys_brk since mmap_sem should be used to
> > > > protect vma manipulation only, then protect the value modify with the
> > > > new arg_lock. Then we can eliminate mmap_sem stuff in prctl path, and it
> > > > also prevents from wrecking check_data_rlimit.
> > > > 
> > > > At the first glance, it looks feasible to me. Will look into deeper
> > > > later.
> > > A further look told me this might be *not* feasible.
> > > 
> > > It looks the new lock will not break check_data_rlimit since in my patch
> > > both start_brk and brk is protected by mmap_sem. The code flow might look
> > > like below:
> > > 
> > > CPU A                             CPU B
> > > --------                       --------
> > > prctl                               sys_brk
> > >                                        down_write
> > > check_data_rlimit           check_data_rlimit (need mm->start_brk)
> > >                                        set brk
> > > down_write                    up_write
> > > set start_brk
> > > set brk
> > > up_write
> > > 
> > > 
> > > If CPU A gets the mmap_sem first, it will set start_brk and brk, then CPU B
> > > will check with the new start_brk. And, prctl doesn't care if sys_brk is run
> > > before it since it gets the new start_brk and brk from parameter.
> > > 
> > > If we protect start_brk and brk with the new lock, sys_brk might get old
> > > start_brk, then sys_brk might break rlimit check silently, is that right?
> > > 
> > > So, it looks using new lock in prctl and keeping mmap_sem in brk path has
> > > race condition.
> > OK, I've admittedly didn't give it too much time to think about. Maybe
> > we do something clever to remove the race but can we start at least by
> > reducing the write lock to read on prctl side and use the dedicated
> > spinlock for updating values? That should close the above race AFAICS
> > and the read lock would be much more friendly to other VM operations.
> 
> Yes, is sounds feasible. We just need care about prctl is run before
> sys_brk. 

There will never be any before/after ordering here. It has never been.
We just need the two to be mutually exlusive. We do not really need that
for races with the page fault because the prctl doesn't modify the
layout AFAIU.

> So, you mean:
> 
> down_read
> spin_lock
> update all the values
> spin_unlock
> up_read

Yes.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct
@ 2018-04-13  6:56                       ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2018-04-13  6:56 UTC (permalink / raw)
  To: Yang Shi
  Cc: Cyrill Gorcunov, adobriyan, willy, mguzik, akpm, linux-mm, linux-kernel

On Thu 12-04-18 09:20:24, Yang Shi wrote:
> 
> 
> On 4/12/18 5:18 AM, Michal Hocko wrote:
> > On Tue 10-04-18 11:28:13, Yang Shi wrote:
> > > 
> > > On 4/10/18 9:21 AM, Yang Shi wrote:
> > > > 
> > > > On 4/10/18 5:28 AM, Cyrill Gorcunov wrote:
> > > > > On Tue, Apr 10, 2018 at 01:10:01PM +0200, Michal Hocko wrote:
> > > > > > > Because do_brk does vma manipulations, for this reason it's
> > > > > > > running under down_write_killable(&mm->mmap_sem). Or you
> > > > > > > mean something else?
> > > > > > Yes, all we need the new lock for is to get a consistent view on brk
> > > > > > values. I am simply asking whether there is something fundamentally
> > > > > > wrong by doing the update inside the new lock while keeping the
> > > > > > original
> > > > > > mmap_sem locking in the brk path. That would allow us to drop the
> > > > > > mmap_sem lock in the proc path when looking at brk values.
> > > > > Michal gimme some time. I guess  we might do so, but I need some
> > > > > spare time to take more precise look into the code, hopefully today
> > > > > evening. Also I've a suspicion that we've wracked check_data_rlimit
> > > > > with this new lock in prctl. Need to verify it again.
> > > > I see you guys points. We might be able to move the drop of mmap_sem
> > > > before setting mm->brk in sys_brk since mmap_sem should be used to
> > > > protect vma manipulation only, then protect the value modify with the
> > > > new arg_lock. Then we can eliminate mmap_sem stuff in prctl path, and it
> > > > also prevents from wrecking check_data_rlimit.
> > > > 
> > > > At the first glance, it looks feasible to me. Will look into deeper
> > > > later.
> > > A further look told me this might be *not* feasible.
> > > 
> > > It looks the new lock will not break check_data_rlimit since in my patch
> > > both start_brk and brk is protected by mmap_sem. The code flow might look
> > > like below:
> > > 
> > > CPU A                             CPU B
> > > --------                       --------
> > > prctl                               sys_brk
> > >                                        down_write
> > > check_data_rlimit           check_data_rlimit (need mm->start_brk)
> > >                                        set brk
> > > down_write                    up_write
> > > set start_brk
> > > set brk
> > > up_write
> > > 
> > > 
> > > If CPU A gets the mmap_sem first, it will set start_brk and brk, then CPU B
> > > will check with the new start_brk. And, prctl doesn't care if sys_brk is run
> > > before it since it gets the new start_brk and brk from parameter.
> > > 
> > > If we protect start_brk and brk with the new lock, sys_brk might get old
> > > start_brk, then sys_brk might break rlimit check silently, is that right?
> > > 
> > > So, it looks using new lock in prctl and keeping mmap_sem in brk path has
> > > race condition.
> > OK, I've admittedly didn't give it too much time to think about. Maybe
> > we do something clever to remove the race but can we start at least by
> > reducing the write lock to read on prctl side and use the dedicated
> > spinlock for updating values? That should close the above race AFAICS
> > and the read lock would be much more friendly to other VM operations.
> 
> Yes, is sounds feasible. We just need care about prctl is run before
> sys_brk. 

There will never be any before/after ordering here. It has never been.
We just need the two to be mutually exlusive. We do not really need that
for races with the page fault because the prctl doesn't modify the
layout AFAIU.

> So, you mean:
> 
> down_read
> spin_lock
> update all the values
> spin_unlock
> up_read

Yes.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-04-13  6:56 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-09 21:52 [v3 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct Yang Shi
2018-04-10  8:48 ` Cyrill Gorcunov
2018-04-10  9:09 ` Michal Hocko
2018-04-10  9:40   ` Cyrill Gorcunov
2018-04-10 10:42     ` Michal Hocko
2018-04-10 11:02       ` Cyrill Gorcunov
2018-04-10 11:10         ` Michal Hocko
2018-04-10 12:28           ` Cyrill Gorcunov
2018-04-10 16:21             ` Yang Shi
2018-04-10 18:28               ` Yang Shi
2018-04-10 18:28                 ` Yang Shi
2018-04-10 19:17                 ` Cyrill Gorcunov
2018-04-10 19:17                   ` Cyrill Gorcunov
2018-04-10 19:33                   ` Yang Shi
2018-04-10 19:33                     ` Yang Shi
2018-04-10 20:06                     ` Cyrill Gorcunov
2018-04-12 12:18                 ` Michal Hocko
2018-04-12 12:18                   ` Michal Hocko
2018-04-12 16:20                   ` Yang Shi
2018-04-12 16:20                     ` Yang Shi
2018-04-13  6:56                     ` Michal Hocko
2018-04-13  6:56                       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.