All of lore.kernel.org
 help / color / mirror / Atom feed
* + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree
@ 2014-08-21 22:51 akpm
  0 siblings, 0 replies; 12+ messages in thread
From: akpm @ 2014-08-21 22:51 UTC (permalink / raw)
  To: gorcunov, avagin, ebiederm, hpa, jln, kamezawa.hiroyu, keescook,
	mtk.manpages, segoon, serge.hallyn, tj, xemul, mm-commits


The patch titled
     Subject: prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation
has been added to the -mm tree.  Its filename is
     prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Cyrill Gorcunov <gorcunov@openvz.org>
Subject: prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation

During development of c/r we've noticed that in case if we need to support
user namespaces we face a problem with capabilities in prctl(PR_SET_MM,
...) call, in particular once new user namespace is created
capable(CAP_SYS_RESOURCE) no longer passes.

A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values
in one bundle, which would allow the kernel to make more intensive test
for sanity of values and same time allow us to support checkpoint/restore
of user namespaces.

Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
prctl_mm_map structure which carries all the members to be updated.

	prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)

	struct prctl_mm_map {
		__u64	start_code;
		__u64	end_code;
		__u64	start_data;
		__u64	end_data;
		__u64	start_brk;
		__u64	brk;
		__u64	start_stack;
		__u64	arg_start;
		__u64	arg_end;
		__u64	env_start;
		__u64	env_end;
		__u64	*auxv;
		__u32	auxv_size;
		__u32	exe_fd;
	};

All members except @exe_fd correspond ones of struct mm_struct.  To figure
out which available values these members may take here are meanings of the
members.

 - start_code, end_code: represent bounds of executable code area
 - start_data, end_data: represent bounds of data area
 - start_brk, brk: used to calculate bounds for brk() syscall
 - start_stack: used when accounting space needed for command
   line arguments, environment and shmat() syscall
 - arg_start, arg_end, env_start, env_end: represent memory area
   supplied for command line arguments and environment variables
 - auxv, auxv_size: carries auxiliary vector, Elf format specifics
 - exe_fd: file descriptor number for executable link (/proc/self/exe)

Thus we apply the following requirements to the values

1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
   in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
   interval.

2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
   VMAs (say a program maps own new .text and .data segments during execution)
   the rest of members should belong to VMA which must exist.

3) Addresses must be ordered, ie @start_ member must not be greater or
   equal to appropriate @end_ member.

4) As in regular Elf loading procedure we require that @start_brk and
   @brk be greater than @end_data.

5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
   exceed existing limit. Same applies to RLIMIT_STACK.

6) Auxiliary vector size must not exceed existing one (which is
   predefined as AT_VECTOR_SIZE and depends on architecture).

7) File descriptor passed in @exe_file should be pointing
   to executable file (because we use existing prctl_set_mm_exe_file_locked
   helper it ensures that the file we are going to use as exe link has all
   required permission granted).

Now about where these members are involved inside kernel code:

 - @start_code and @end_code are used in /proc/$pid/[stat|statm] output;

 - @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
   also they are considered if there enough space for brk() syscall
   result if RLIMIT_DATA is set;

 - @start_brk shown in /proc/$pid/stat output and accounted in brk()
   syscall if RLIMIT_DATA is set; also this member is tested to
   find a symbolic name of mmap event for perf system (we choose
   if event is generated for "heap" area); one more aplication is
   selinux -- we test if a process has PROCESS__EXECHEAP permission
   if trying to make heap area being executable with mprotect() syscall;

 - @brk is a current value for brk() syscall which lays inside heap
   area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
   provides new memory area to a user space upon brk() completion the
   mm::brk is updated to carry new value;

   Both @start_brk and @brk are actively used in /proc/$pid/maps
   and /proc/$pid/smaps output to find a symbolic name "heap" for
   VMA being scanned;

 - @start_stack is printed out in /proc/$pid/stat and used to
   find a symbolic name "stack" for task and threads in
   /proc/$pid/maps and /proc/$pid/smaps output, and as the same
   as with @start_brk -- perf system uses it for event naming.
   Also kernel treat this member as a start address of where
   to map vDSO pages and to check if there is enough space
   for shmat() syscall;

 - @arg_start, @arg_end, @env_start and @env_end are printed out
   in /proc/$pid/stat. Another access to the data these members
   represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
   Any attempt to read these areas kernel tests with access_process_vm
   helper so a user must have enough rights for this action;

 - @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
   speaking kernel doesn't care much about which exactly data is
   sitting there because it is solely for userspace;

 - @exe_fd is referred from /proc/$pid/exe and when generating
   coredump. We uses prctl_set_mm_exe_file_locked helper to update
   this member, so exe-file link modification remains one-shot
   action.

Still note that updating exe-file link now doesn't require sys-resource
capability anymore, after all there is no much profit in preventing setup
own file link (there are a number of ways to execute own code -- ptrace,
ld-preload, so that the only reliable way to find which exactly code is
executed is to inspect running program memory).  Still we require the
caller to be at least user-namespace root user.

I believe the old interface should be deprecated and ripped off in a
couple of kernel releases if no one against.

To test if new interface is implemented in the kernel one can pass
PR_SET_MM_MAP_SIZE opcode and the kernel returns the size of currently
supported struct prctl_mm_map.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Andrew Vagin <avagin@openvz.org>
Tested-by: Andrew Vagin <avagin@openvz.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Julien Tinnes <jln@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/uapi/linux/prctl.h |   25 ++++
 kernel/sys.c               |  211 ++++++++++++++++++++++++++++++++++-
 2 files changed, 235 insertions(+), 1 deletion(-)

diff -puN include/uapi/linux/prctl.h~prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3 include/uapi/linux/prctl.h
--- a/include/uapi/linux/prctl.h~prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3
+++ a/include/uapi/linux/prctl.h
@@ -119,6 +119,31 @@
 # define PR_SET_MM_ENV_END		11
 # define PR_SET_MM_AUXV			12
 # define PR_SET_MM_EXE_FILE		13
+# define PR_SET_MM_MAP			14
+# define PR_SET_MM_MAP_SIZE		15
+
+/*
+ * This structure provides new memory descriptor
+ * map which mostly modifies /proc/pid/stat[m]
+ * output for a task. This mostly done in a
+ * sake of checkpoint/restore functionality.
+ */
+struct prctl_mm_map {
+	__u64	start_code;		/* code section bounds */
+	__u64	end_code;
+	__u64	start_data;		/* data section bounds */
+	__u64	end_data;
+	__u64	start_brk;		/* heap for brk() syscall */
+	__u64	brk;
+	__u64	start_stack;		/* stack starts at */
+	__u64	arg_start;		/* command line arguments bounds */
+	__u64	arg_end;
+	__u64	env_start;		/* environment variables bounds */
+	__u64	env_end;
+	__u64	*auxv;			/* auxiliary vector */
+	__u32	auxv_size;		/* vector size */
+	__u32	exe_fd;			/* /proc/$pid/exe link file */
+};
 
 /*
  * Set specific pid that is allowed to ptrace the current task.
diff -puN kernel/sys.c~prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3 kernel/sys.c
--- a/kernel/sys.c~prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3
+++ a/kernel/sys.c
@@ -1687,6 +1687,208 @@ exit:
 	return err;
 }
 
+#ifdef CONFIG_CHECKPOINT_RESTORE
+/*
+ * WARNING: we don't require any capability here so be very careful
+ * in what is allowed for modification from userspace.
+ */
+static int validate_prctl_map_locked(struct prctl_mm_map *prctl_map)
+{
+	unsigned long mmap_max_addr = TASK_SIZE;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *stack_vma;
+	int error = 0;
+
+	/*
+	 * Make sure the members are not somewhere outside
+	 * of allowed address space.
+	 */
+#define __prctl_check_addr_space(__member)					\
+	({									\
+		int __rc;							\
+		if ((unsigned long)prctl_map->__member < mmap_max_addr &&	\
+		    (unsigned long)prctl_map->__member >= mmap_min_addr)	\
+			__rc = 0;						\
+		else								\
+			__rc = -EINVAL;						\
+		__rc;								\
+	})
+	error |= __prctl_check_addr_space(start_code);
+	error |= __prctl_check_addr_space(end_code);
+	error |= __prctl_check_addr_space(start_data);
+	error |= __prctl_check_addr_space(end_data);
+	error |= __prctl_check_addr_space(start_stack);
+	error |= __prctl_check_addr_space(start_brk);
+	error |= __prctl_check_addr_space(brk);
+	error |= __prctl_check_addr_space(arg_start);
+	error |= __prctl_check_addr_space(arg_end);
+	error |= __prctl_check_addr_space(env_start);
+	error |= __prctl_check_addr_space(env_end);
+	if (error)
+		goto out;
+#undef __prctl_check_addr_space
+
+	/*
+	 * Stack, brk, command line arguments and environment must exist.
+	 */
+	stack_vma = find_vma(mm, (unsigned long)prctl_map->start_stack);
+	if (!stack_vma) {
+		error = -EINVAL;
+		goto out;
+	}
+#define __prctl_check_vma(__member)						\
+	find_vma(mm, (unsigned long)prctl_map->__member) ? 0 : -EINVAL
+	error |= __prctl_check_vma(start_brk);
+	error |= __prctl_check_vma(brk);
+	error |= __prctl_check_vma(arg_start);
+	error |= __prctl_check_vma(arg_end);
+	error |= __prctl_check_vma(env_start);
+	error |= __prctl_check_vma(env_end);
+	if (error)
+		goto out;
+#undef __prctl_check_vma
+
+	/*
+	 * Make sure the pairs are ordered.
+	 */
+#define __prctl_check_order(__m1, __op, __m2)					\
+	((unsigned long)prctl_map->__m1 __op					\
+	 (unsigned long)prctl_map->__m2) ? 0 : -EINVAL
+	error |= __prctl_check_order(start_code, <, end_code);
+	error |= __prctl_check_order(start_data, <, end_data);
+	error |= __prctl_check_order(start_brk, <=, brk);
+	error |= __prctl_check_order(arg_start, <=, arg_end);
+	error |= __prctl_check_order(env_start, <=, env_end);
+	if (error)
+		goto out;
+#undef __prctl_check_order
+
+	error = -EINVAL;
+
+	/*
+	 * @brk should be after @end_data in traditional maps.
+	 */
+	if (prctl_map->start_brk <= prctl_map->end_data ||
+	    prctl_map->brk <= prctl_map->end_data)
+		goto out;
+
+	/*
+	 * Neither we should allow to override limits if they set.
+	 */
+	if (check_data_rlimit(rlimit(RLIMIT_DATA), prctl_map->brk,
+			      prctl_map->start_brk, prctl_map->end_data,
+			      prctl_map->start_data))
+			goto out;
+
+#ifdef CONFIG_STACK_GROWSUP
+	if (check_data_rlimit(rlimit(RLIMIT_STACK),
+			      stack_vma->vm_end,
+			      prctl_map->start_stack, 0, 0))
+#else
+	if (check_data_rlimit(rlimit(RLIMIT_STACK),
+			      prctl_map->start_stack,
+			      stack_vma->vm_start, 0, 0))
+#endif
+		goto out;
+
+	/*
+	 * Someone is trying to cheat the auxv vector.
+	 */
+	if (prctl_map->auxv_size) {
+		if (!prctl_map->auxv ||
+		    prctl_map->auxv_size > sizeof(mm->saved_auxv))
+			goto out;
+	}
+
+	/*
+	 * Finally, make sure the caller has the rights to
+	 * change /proc/pid/exe link: only local root should
+	 * be allowed to.
+	 */
+	if (prctl_map->exe_fd != (u32)-1) {
+		struct user_namespace *ns = current_user_ns();
+		const struct cred *cred = current_cred();
+
+		if (!uid_eq(cred->uid, make_kuid(ns, 0)) ||
+		    !gid_eq(cred->gid, make_kgid(ns, 0)))
+			goto out;
+	}
+
+	error = 0;
+out:
+	return error;
+}
+
+static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data_size)
+{
+	struct prctl_mm_map prctl_map = { .exe_fd = (u32)-1, };
+	unsigned long user_auxv[AT_VECTOR_SIZE];
+	struct mm_struct *mm = current->mm;
+	int error = -EINVAL;
+
+	BUILD_BUG_ON(sizeof(user_auxv) != sizeof(mm->saved_auxv));
+
+	if (opt == PR_SET_MM_MAP_SIZE)
+		return put_user((unsigned int)sizeof(prctl_map),
+				(unsigned int __user *)addr);
+
+	if (data_size != sizeof(prctl_map))
+		return -EINVAL;
+
+	if (copy_from_user(&prctl_map, addr, sizeof(prctl_map)))
+		return -EFAULT;
+
+	down_read(&mm->mmap_sem);
+
+	if (validate_prctl_map_locked(&prctl_map))
+		goto out;
+
+	if (prctl_map.auxv_size) {
+		up_read(&mm->mmap_sem);
+		memset(user_auxv, 0, sizeof(user_auxv));
+		error = copy_from_user(user_auxv,
+				       (const void __user *)prctl_map.auxv,
+				       prctl_map.auxv_size);
+		down_read(&mm->mmap_sem);
+		if (error)
+			goto out;
+	}
+
+	if (prctl_map.exe_fd != (u32)-1) {
+		error = prctl_set_mm_exe_file_locked(mm, prctl_map.exe_fd);
+		if (error)
+			goto out;
+	}
+
+	if (prctl_map.auxv_size) {
+		/* Last entry must be AT_NULL as specification requires */
+		user_auxv[AT_VECTOR_SIZE - 2] = AT_NULL;
+		user_auxv[AT_VECTOR_SIZE - 1] = AT_NULL;
+
+		task_lock(current);
+		memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv));
+		task_unlock(current);
+	}
+
+	mm->start_code	= prctl_map.start_code;
+	mm->end_code	= prctl_map.end_code;
+	mm->start_data	= prctl_map.start_data;
+	mm->end_data	= prctl_map.end_data;
+	mm->start_brk	= prctl_map.start_brk;
+	mm->brk		= prctl_map.brk;
+	mm->start_stack	= prctl_map.start_stack;
+	mm->arg_start	= prctl_map.arg_start;
+	mm->arg_end	= prctl_map.arg_end;
+	mm->env_start	= prctl_map.env_start;
+	mm->env_end	= prctl_map.env_end;
+
+	error = 0;
+out:
+	up_read(&mm->mmap_sem);
+	return error;
+}
+#endif /* CONFIG_CHECKPOINT_RESTORE */
+
 static int prctl_set_mm(int opt, unsigned long addr,
 			unsigned long arg4, unsigned long arg5)
 {
@@ -1694,9 +1896,16 @@ static int prctl_set_mm(int opt, unsigne
 	struct vm_area_struct *vma;
 	int error;
 
-	if (arg5 || (arg4 && opt != PR_SET_MM_AUXV))
+	if (arg5 || (arg4 && (opt != PR_SET_MM_AUXV &&
+			      opt != PR_SET_MM_MAP &&
+			      opt != PR_SET_MM_MAP_SIZE)))
 		return -EINVAL;
 
+#ifdef CONFIG_CHECKPOINT_RESTORE
+	if (opt == PR_SET_MM_MAP || opt == PR_SET_MM_MAP_SIZE)
+		return prctl_set_mm_map(opt, (const void __user *)addr, arg4);
+#endif
+
 	if (!capable(CAP_SYS_RESOURCE))
 		return -EPERM;
 
_

Patches currently in -mm which might be from gorcunov@openvz.org are

mm-introduce-check_data_rlimit-helper-v2.patch
mm-use-may_adjust_brk-helper.patch
prctl-pr_set_mm-factor-out-mmap_sem-when-update-mm-exe_file.patch
prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch
prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3-fix.patch
fs-proc-task_mmuc-dont-use-task-mm-in-m_start-and-show_map.patch
fs-proc-task_mmuc-unify-simplify-do_maps_open-and-numa_maps_open.patch
proc-introduce-proc_mem_open.patch
fs-proc-task_mmuc-shift-mm_access-from-m_start-to-proc_maps_open.patch
fs-proc-task_mmuc-simplify-the-vma_stop-logic.patch
fs-proc-task_mmuc-cleanup-the-tail_vma-horror-in-m_next.patch
fs-proc-task_mmuc-shift-priv-task-=-null-from-m_start-to-m_stop.patch
fs-proc-task_mmuc-kill-the-suboptimal-and-confusing-m-version-logic.patch
fs-proc-task_mmuc-simplify-m_start-to-make-it-readable.patch
fs-proc-task_mmuc-introduce-m_next_vma-helper.patch
fs-proc-task_mmuc-reintroduce-m-version-logic.patch
fs-proc-task_mmuc-update-m-version-in-the-main-loop-in-m_start.patch
fs-proc-task_nommuc-change-maps_open-to-use-__seq_open_private.patch
fs-proc-task_nommuc-shift-mm_access-from-m_start-to-proc_maps_open.patch
fs-proc-task_nommuc-dont-use-priv-task-mm.patch


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree
  2014-08-23 19:29                 ` Oleg Nesterov
@ 2014-08-23 20:11                   ` Cyrill Gorcunov
  0 siblings, 0 replies; 12+ messages in thread
From: Cyrill Gorcunov @ 2014-08-23 20:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov,
	KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton,
	linux-kernel

On Sat, Aug 23, 2014 at 09:29:15PM +0200, Oleg Nesterov wrote:
> On 08/23, Cyrill Gorcunov wrote:
> >
> > On Sat, Aug 23, 2014 at 05:32:22PM +0200, Oleg Nesterov wrote:
> > >
> > > And btw, where do you see RLIMIT_STACK in do_shmat() ?
> >
> > Indirectly, though start_stack pointer. We assign it in setup_arg_pages taking into
> > account RLIMIT_STACK value, then do_shmat operates with start_stack pointer,
> > thus when we setup some new value into start_stack we at least should
> > not exceed old RLIMIT_STACK?
> 
> Still can't understand how do_shmat() connects to RLIMIT_STACK, even
> indirectly. setup_arg_pages() obviously uses RLIMIT_STACK, but with or
> without the patch I sent do_shmat() behaviour does not depend on
> RLIMIT_STACK at all.

setup_arg_pages sets up start_stack member and obeys RLIMIT_STACK limit
so if there were some rlimit set the start_stack will point to space
allocated with known size, then do_shmat takes start_stack and checks
if there space left. So initial value of start_stack is limited to
rlimit setting. When setup_arg_pages is called the start_stack
has rlimit constrain, so it looks to me pretty natural to do the
same in prctl, no?

> > > > arg_start, arg_end, env_start, env_end
> > > >        really point to existing VMAs, strictly speaking the probgram can unmap
> > > >        all own VMAs except executable one and continue running without problem
> > > >        but this is not that practical I think and at first iteration I prefer
> > > >        more severe tests here on VMAs
> > >
> > > But, again, for what? There are only used to report this info via /proc/.
> >
> > Because it's close to how loading elf proceeds. This prctl is like emulating
> > micro-elf loader ;) Indeed there is no problem for kernel if all these members
> > are pointing to some already umapped VMAs,
> 
> Yes. And whatever checks you add into prctl(), an application can simply
> do mmap() before prctl() just to fool it, and unmap (say) arg_start right
> after that.

yes it can, but prctl on itsown actually not much usefull until you need
to mimic "another program" procfs output like we do in criu :)

> > but earlier we required cap-sysadmin
> > to be grated so that if someone calls for this prctl he must be knowing what
> > he is doing, and if some other unpredicted change in kernel code cause this
> > prctl to panic the kernel such action could be initiated by sysadmin only,
> > not regular user. Now I've released the security requirement so I thought
> > let be more paranoid and require all VMAs to present, all rlimits to be
> > in good shape and such even if the members we're modifying are used for
> > procfs statistics output mostly, we always have a way to relax this tests
> > but not the reverse. That was the main idea ;)
> 
> Following this logic you should also verify that KSTK_ESP(group_leader)
> falls into ->start_stack area or return an error otherwise.

And probably I should, I'll re-check.

> 
> As for elf loader, of course it sets ->start_stack/etc correctly and they
> can't point to nowehere, but this is only because it actually creates these
> segments. And I guess the c/r tools should do this "correctly" too, but
> prctl() itself can never verify this.
> 
> OK, lets look at this from another angle. Suppose that an application switches
> to another stack and unmaps ->start_stack area. Now, how are you going to
> restore it correctly if ->start_stack points to nowhere? Are you going
> to "fix" ->start_stack? I do not think this would be correct. Or, are you
> going to create the additional vma just to make prctl() happy?

Actually all the programs we were c/r'ing are "normal" ones who doesn't
mangle own contents but I fear yes, with current approach if some vma's
are missing prctl refuse to proceed. Oleg, I'm cooking the patch which
addresses all your comments and relaxes the restrictions I simply need
once again grep all members to be sure it's safe. Hopefully I'll finish
tomorrow and show the patch, still if there some other concerns please
notify me I'm happy to address any of them!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree
  2014-08-23 16:33               ` Cyrill Gorcunov
@ 2014-08-23 19:29                 ` Oleg Nesterov
  2014-08-23 20:11                   ` Cyrill Gorcunov
  0 siblings, 1 reply; 12+ messages in thread
From: Oleg Nesterov @ 2014-08-23 19:29 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov,
	KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton,
	linux-kernel

On 08/23, Cyrill Gorcunov wrote:
>
> On Sat, Aug 23, 2014 at 05:32:22PM +0200, Oleg Nesterov wrote:
> >
> > And btw, where do you see RLIMIT_STACK in do_shmat() ?
>
> Indirectly, though start_stack pointer. We assign it in setup_arg_pages taking into
> account RLIMIT_STACK value, then do_shmat operates with start_stack pointer,
> thus when we setup some new value into start_stack we at least should
> not exceed old RLIMIT_STACK?

Still can't understand how do_shmat() connects to RLIMIT_STACK, even
indirectly. setup_arg_pages() obviously uses RLIMIT_STACK, but with or
without the patch I sent do_shmat() behaviour does not depend on
RLIMIT_STACK at all.

> > > arg_start, arg_end, env_start, env_end
> > >        really point to existing VMAs, strictly speaking the probgram can unmap
> > >        all own VMAs except executable one and continue running without problem
> > >        but this is not that practical I think and at first iteration I prefer
> > >        more severe tests here on VMAs
> >
> > But, again, for what? There are only used to report this info via /proc/.
>
> Because it's close to how loading elf proceeds. This prctl is like emulating
> micro-elf loader ;) Indeed there is no problem for kernel if all these members
> are pointing to some already umapped VMAs,

Yes. And whatever checks you add into prctl(), an application can simply
do mmap() before prctl() just to fool it, and unmap (say) arg_start right
after that.

> but earlier we required cap-sysadmin
> to be grated so that if someone calls for this prctl he must be knowing what
> he is doing, and if some other unpredicted change in kernel code cause this
> prctl to panic the kernel such action could be initiated by sysadmin only,
> not regular user. Now I've released the security requirement so I thought
> let be more paranoid and require all VMAs to present, all rlimits to be
> in good shape and such even if the members we're modifying are used for
> procfs statistics output mostly, we always have a way to relax this tests
> but not the reverse. That was the main idea ;)

Following this logic you should also verify that KSTK_ESP(group_leader)
falls into ->start_stack area or return an error otherwise.

As for elf loader, of course it sets ->start_stack/etc correctly and they
can't point to nowehere, but this is only because it actually creates these
segments. And I guess the c/r tools should do this "correctly" too, but
prctl() itself can never verify this.

OK, lets look at this from another angle. Suppose that an application switches
to another stack and unmaps ->start_stack area. Now, how are you going to
restore it correctly if ->start_stack points to nowhere? Are you going
to "fix" ->start_stack? I do not think this would be correct. Or, are you
going to create the additional vma just to make prctl() happy?

Oleg.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree
  2014-08-23 15:32             ` Oleg Nesterov
@ 2014-08-23 16:33               ` Cyrill Gorcunov
  2014-08-23 19:29                 ` Oleg Nesterov
  0 siblings, 1 reply; 12+ messages in thread
From: Cyrill Gorcunov @ 2014-08-23 16:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov,
	KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton,
	linux-kernel

On Sat, Aug 23, 2014 at 05:32:22PM +0200, Oleg Nesterov wrote:
> > >
> > > Besides, it can't help anyway. cred_guard_mutex is per-process (not per-thread),
> > > suppose that a vfork()'ed child does prctl() while another thread reads the
> > > parent's /proc/pid/auxv.
> >
> > Then either I need to use some other lock (not sure which one) either leave it
> > completely unlocked mentionin in the man page such lockless behaviour. Thoughts?
> 
> Personally I think "lockless" is the best choice (not sure man page should
> know about this detail). I mean, I think that we do not care if proc_pid_auxv()
> prints garbage if it races with ptctl().
> 
> Otherwise we have to use mmap_sem in proc_pid_auxv(), doesn't look nice.

I see, thanks a lot!

> 
> >  | > Stricktly speaking yes, but don't forget we might need to update
> >  | > exe::file as well which requires lock to be taken.
> >  |
> >  | For reading? I see prctl_set_mm_exe_file_locked() in this patch, probably
> >  | this function was added by another patch. But, if this function calls
> >  | set_mm_exe_file() (I guess it does?) then down_read() is not enough?
> >  | set_mm_exe_file() can race with itself.
> >
> > yes, for reading, look in set_mm_exe_file we lookup for vma which should
> > be not present when we change the link, and yes, because of read-only lock
> > this call can race but only one caller success there because we allow
> > to change exe link only once.
> 
> Ah, I forgot about MMF_EXE_FILE_CHANGED, thanks for correcting me.
> 
> (btw I think this check must die too, but this is off-topic and I was
>  wrong anyway).
> 
> OK, but I still think down_read(mmap_sem) is not enough. get_mm_exe_file()
> can do get_file() after prctl() paths do the final fput().
> 
> Or please look at tomoyo_get_exe(). Another thread can play with mm->exe_file
> fput().
> 
> Plus I am a bit worrried about inode_permission() under mmap_sem... but
> this is probably fine. Although you can never know which locks a creative
> filesystem/security module can take ;) But probably this is fine.

Thanks, I'll re-check.

> >  | But for what? Ignoring the (I think buggy) check in do_shmat() ->start_stack
> >  | is simply unused, we only report it via /proc/. The same for, say, mm->start_code.
> >
> > that't the good question if this check in do_shmat is buggy or not, why do
> > you think it's a bug there?
> 
> Please see the patch I sent.

Already ;)

> > Oleg, letme summarize all the concerns maybe there would be a way to handle
> > them gracefully
> >
> > 1) How code flows for now (with all fixes on top of current Andrew's queued patches)
> >
> >    - obtain struct prctl_mm_map from userspace
> >    - copy saved_auxv from userspace
> >    - down-read mmap_sem
> >    - validate all the data passed from userspace
> 
> I won't argue, but at least mmap_min/max_addr do not need mmap_sem.
> 
> >      - we need a reference to stack-vma for RLIMIT_STACK check (this is doable,
> >        as you said, but until we drop the RLIMIT_STACK from do_shmat I would
> >        prefer it to be here)
> 
> OK, I won't argue, but I think this is pointless and misleading.
> 
> And btw, where do you see RLIMIT_STACK in do_shmat() ?

Indirectly, though start_stack pointer. We assign it in setup_arg_pages taking into
account RLIMIT_STACK value, then do_shmat operates with start_stack pointer,
thus when we setup some new value into start_stack we at least should
not exceed old RLIMIT_STACK? This is not anyhow critial I think but I wanted
to do as maximal tests as possible.

> 
> >      - we need to be sure that start_brk, brk,
> 
> probably yes, simply because the kernel actually uses this members.
> 
> > arg_start, arg_end, env_start, env_end
> >        really point to existing VMAs, strictly speaking the probgram can unmap
> >        all own VMAs except executable one and continue running without problem
> >        but this is not that practical I think and at first iteration I prefer
> >        more severe tests here on VMAs
> 
> But, again, for what? There are only used to report this info via /proc/.

Because it's close to how loading elf proceeds. This prctl is like emulating
micro-elf loader ;) Indeed there is no problem for kernel if all these members
are pointing to some already umapped VMAs, but earlier we required cap-sysadmin
to be grated so that if someone calls for this prctl he must be knowing what
he is doing, and if some other unpredicted change in kernel code cause this
prctl to panic the kernel such action could be initiated by sysadmin only,
not regular user. Now I've released the security requirement so I thought
let be more paranoid and require all VMAs to present, all rlimits to be
in good shape and such even if the members we're modifying are used for
procfs statistics output mostly, we always have a way to relax this tests
but not the reverse. That was the main idea ;)

> >    - setup new mm::exe_file (we need to be sure the old exe_file is unmapped
> >      so mmap_sem read-lock is needed)
> 
> See above.
> 
> > Oleg, check please if I undersnad you correctly, you propose
> >
> >  - drop off mmap_sem completely
> 
> No, no, I didn't, we obviously can't do this.
> 
> >  - don't verify for RLIMIT_STACK
> 
> Yes, and more "don't verify". But again, I won't really argue. Just in my
> opinion almost all these checks looks misleading, confusing, and unnecessary.
> 
> Please think about those who will try to understand this code. A little
> comment like "this is not needed but we all are paranoid in openvz" could
> make it a bit more understandable ;)

Yeah, I think I should add this comment into the code.

> 
> >  - drop off task_lock when updating mm::saved_auxv but still invent
> >    how to prevent update/read race
> 
> Personally I think we can simply ignore this race.
> 
> But let me repeat, I won't argue with any approach as long as I think
> it is fine correctness wise.

I see, thanks a lot for comements Oleg!!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree
  2014-08-23 14:18           ` Cyrill Gorcunov
@ 2014-08-23 15:32             ` Oleg Nesterov
  2014-08-23 16:33               ` Cyrill Gorcunov
  0 siblings, 1 reply; 12+ messages in thread
From: Oleg Nesterov @ 2014-08-23 15:32 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov,
	KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton,
	linux-kernel

On 08/23, Cyrill Gorcunov wrote:
>
> On Sat, Aug 23, 2014 at 03:30:01PM +0200, Oleg Nesterov wrote:
> >
> > On 08/23, Oleg Nesterov wrote:
> > >
> > > On 08/23, Cyrill Gorcunov wrote:
> > >
> > > > Looks like I need
> > > > to use cred_guard_mutex instead of task_lock here, no?
> > >
> > > Please don't. First of all, it can't help because proc_pid_auxv() doesn't hold
> > > this lock. It does mm_access() which drops this lock after return. And to remind,
> > > we are going to remove mm_access/lock_trace from sys_read() paths in proc.
> >
> > Besides, it can't help anyway. cred_guard_mutex is per-process (not per-thread),
> > suppose that a vfork()'ed child does prctl() while another thread reads the
> > parent's /proc/pid/auxv.
>
> Then either I need to use some other lock (not sure which one) either leave it
> completely unlocked mentionin in the man page such lockless behaviour. Thoughts?

Personally I think "lockless" is the best choice (not sure man page should
know about this detail). I mean, I think that we do not care if proc_pid_auxv()
prints garbage if it races with ptctl().

Otherwise we have to use mmap_sem in proc_pid_auxv(), doesn't look nice.

>  | > Stricktly speaking yes, but don't forget we might need to update
>  | > exe::file as well which requires lock to be taken.
>  |
>  | For reading? I see prctl_set_mm_exe_file_locked() in this patch, probably
>  | this function was added by another patch. But, if this function calls
>  | set_mm_exe_file() (I guess it does?) then down_read() is not enough?
>  | set_mm_exe_file() can race with itself.
>
> yes, for reading, look in set_mm_exe_file we lookup for vma which should
> be not present when we change the link, and yes, because of read-only lock
> this call can race but only one caller success there because we allow
> to change exe link only once.

Ah, I forgot about MMF_EXE_FILE_CHANGED, thanks for correcting me.

(btw I think this check must die too, but this is off-topic and I was
 wrong anyway).

OK, but I still think down_read(mmap_sem) is not enough. get_mm_exe_file()
can do get_file() after prctl() paths do the final fput().

Or please look at tomoyo_get_exe(). Another thread can play with mm->exe_file
fput().

Plus I am a bit worrried about inode_permission() under mmap_sem... but
this is probably fine. Although you can never know which locks a creative
filesystem/security module can take ;) But probably this is fine.

>  | But for what? Ignoring the (I think buggy) check in do_shmat() ->start_stack
>  | is simply unused, we only report it via /proc/. The same for, say, mm->start_code.
>
> that't the good question if this check in do_shmat is buggy or not, why do
> you think it's a bug there?

Please see the patch I sent.

> Oleg, letme summarize all the concerns maybe there would be a way to handle
> them gracefully
>
> 1) How code flows for now (with all fixes on top of current Andrew's queued patches)
>
>    - obtain struct prctl_mm_map from userspace
>    - copy saved_auxv from userspace
>    - down-read mmap_sem
>    - validate all the data passed from userspace

I won't argue, but at least mmap_min/max_addr do not need mmap_sem.

>      - we need a reference to stack-vma for RLIMIT_STACK check (this is doable,
>        as you said, but until we drop the RLIMIT_STACK from do_shmat I would
>        prefer it to be here)

OK, I won't argue, but I think this is pointless and misleading.

And btw, where do you see RLIMIT_STACK in do_shmat() ?

>      - we need to be sure that start_brk, brk,

probably yes, simply because the kernel actually uses this members.

> arg_start, arg_end, env_start, env_end
>        really point to existing VMAs, strictly speaking the probgram can unmap
>        all own VMAs except executable one and continue running without problem
>        but this is not that practical I think and at first iteration I prefer
>        more severe tests here on VMAs

But, again, for what? There are only used to report this info via /proc/.

>    - setup new mm::exe_file (we need to be sure the old exe_file is unmapped
>      so mmap_sem read-lock is needed)

See above.

> Oleg, check please if I undersnad you correctly, you propose
>
>  - drop off mmap_sem completely

No, no, I didn't, we obviously can't do this.

>  - don't verify for RLIMIT_STACK

Yes, and more "don't verify". But again, I won't really argue. Just in my
opinion almost all these checks looks misleading, confusing, and unnecessary.

Please think about those who will try to understand this code. A little
comment like "this is not needed but we all are paranoid in openvz" could
make it a bit more understandable ;)

>  - drop off task_lock when updating mm::saved_auxv but still invent
>    how to prevent update/read race

Personally I think we can simply ignore this race.

But let me repeat, I won't argue with any approach as long as I think
it is fine correctness wise.

Oleg.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree
  2014-08-23 13:30         ` Oleg Nesterov
@ 2014-08-23 14:18           ` Cyrill Gorcunov
  2014-08-23 15:32             ` Oleg Nesterov
  0 siblings, 1 reply; 12+ messages in thread
From: Cyrill Gorcunov @ 2014-08-23 14:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov,
	KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton,
	linux-kernel

On Sat, Aug 23, 2014 at 03:30:01PM +0200, Oleg Nesterov wrote:
> forgot to mention,
> 
> On 08/23, Oleg Nesterov wrote:
> >
> > On 08/23, Cyrill Gorcunov wrote:
> >
> > > Looks like I need
> > > to use cred_guard_mutex instead of task_lock here, no?
> >
> > Please don't. First of all, it can't help because proc_pid_auxv() doesn't hold
> > this lock. It does mm_access() which drops this lock after return. And to remind,
> > we are going to remove mm_access/lock_trace from sys_read() paths in proc.
> 
> Besides, it can't help anyway. cred_guard_mutex is per-process (not per-thread),
> suppose that a vfork()'ed child does prctl() while another thread reads the
> parent's /proc/pid/auxv.

Then either I need to use some other lock (not sure which one) either leave it
completely unlocked mentionin in the man page such lockless behaviour. Thoughts?

> Cyrill, I am sorry, but I am starting to think that this patch should be
> dropped and replaced by another version. Or do you think it would be better
> to send the fixes on top?

It's not a problem to drop this particular patch (together with all fixes on
top) one and replace it with new version (this looks like a better idea than
drowning lkml with series of small patches). I rather need to understand what
exactly should be done in new version. So from your previous email

 | > Stricktly speaking yes, but don't forget we might need to update
 | > exe::file as well which requires lock to be taken.
 |
 | For reading? I see prctl_set_mm_exe_file_locked() in this patch, probably
 | this function was added by another patch. But, if this function calls
 | set_mm_exe_file() (I guess it does?) then down_read() is not enough?
 | set_mm_exe_file() can race with itself.

yes, for reading, look in set_mm_exe_file we lookup for vma which should
be not present when we change the link, and yes, because of read-only lock
this call can race but only one caller success there because we allow
to change exe link only once.

 | But for what? Ignoring the (I think buggy) check in do_shmat() ->start_stack
 | is simply unused, we only report it via /proc/. The same for, say, mm->start_code.

that't the good question if this check in do_shmat is buggy or not, why do
you think it's a bug there?

Oleg, letme summarize all the concerns maybe there would be a way to handle
them gracefully

1) How code flows for now (with all fixes on top of current Andrew's queued patches)

   - obtain struct prctl_mm_map from userspace
   - copy saved_auxv from userspace
   - down-read mmap_sem
   - validate all the data passed from userspace
     - we need a reference to stack-vma for RLIMIT_STACK check (this is doable,
       as you said, but until we drop the RLIMIT_STACK from do_shmat I would
       prefer it to be here)
     - we need to be sure that start_brk, brk, arg_start, arg_end, env_start, env_end
       really point to existing VMAs, strictly speaking the probgram can unmap
       all own VMAs except executable one and continue running without problem
       but this is not that practical I think and at first iteration I prefer
       more severe tests here on VMAs
   - setup new mm::exe_file (we need to be sure the old exe_file is unmapped
     so mmap_sem read-lock is needed)
   - update mm::saved_auxv with new values
   - finally setup new members to struct mm_struct
   - up-read mmap_sem

2) The qustion is do we really need that read-lock would be taken for all this
   time? And my answer is yes because of how I implement the checks for start_brk
   and etc.

Oleg, check please if I undersnad you correctly, you propose

 - drop off mmap_sem completely
 - don't verify for RLIMIT_STACK
 - drop off task_lock when updating mm::saved_auxv but still invent
   how to prevent update/read race

right?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree
  2014-08-23 13:14       ` Oleg Nesterov
@ 2014-08-23 13:30         ` Oleg Nesterov
  2014-08-23 14:18           ` Cyrill Gorcunov
  0 siblings, 1 reply; 12+ messages in thread
From: Oleg Nesterov @ 2014-08-23 13:30 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov,
	KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton,
	linux-kernel

forgot to mention,

On 08/23, Oleg Nesterov wrote:
>
> On 08/23, Cyrill Gorcunov wrote:
>
> > Looks like I need
> > to use cred_guard_mutex instead of task_lock here, no?
>
> Please don't. First of all, it can't help because proc_pid_auxv() doesn't hold
> this lock. It does mm_access() which drops this lock after return. And to remind,
> we are going to remove mm_access/lock_trace from sys_read() paths in proc.

Besides, it can't help anyway. cred_guard_mutex is per-process (not per-thread),
suppose that a vfork()'ed child does prctl() while another thread reads the
parent's /proc/pid/auxv.

Cyrill, I am sorry, but I am starting to think that this patch should be
dropped and replaced by another version. Or do you think it would be better
to send the fixes on top?

Oleg.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree
  2014-08-23 12:22     ` Cyrill Gorcunov
@ 2014-08-23 13:14       ` Oleg Nesterov
  2014-08-23 13:30         ` Oleg Nesterov
  0 siblings, 1 reply; 12+ messages in thread
From: Oleg Nesterov @ 2014-08-23 13:14 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov,
	KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton,
	linux-kernel

On 08/23, Cyrill Gorcunov wrote:
>
> On Sat, Aug 23, 2014 at 01:53:02PM +0200, Oleg Nesterov wrote:
> > >
> > > It should protect from allocation/devetion/mergin of another vma. IOW when
> > > I lookup for vma I need to be sure it exist and won't disappear at least
> > > while I validate it.
> >
> > plus you need mmap_sem (at least for reading) when you update mm_struct,
> > this is clear.
> >
> > My question was why the whole function should be called under mmap_sem?
> > It could take it only around find_vma() + check(RLIMIT_STACK) ?
>
> Stricktly speaking yes, but don't forget we might need to update
> exe::file as well which requires lock to be taken.

For reading? I see prctl_set_mm_exe_file_locked() in this patch, probably
this function was added by another patch. But, if this function calls
set_mm_exe_file() (I guess it does?) then down_read() is not enough?
set_mm_exe_file() can race with itself.

And this still doesn't answer my question. As I said, I understand that
we need mmap_sem to update mm_struct, and this is what prctl_set_mm_map()
does at the end. And it also calls prctl_set_mm_exe_file_locked(),
validate_prctl_map_locked() doesn't do this.

> So it is simplier
> to take the read-lock for the whole function.

Still can't understand why validate_prctl_map_locked() should be called
under this lock. OK, I won't insist.

> > In fact I do not think we need this vma_stack/RLIMIT_STACK check at all.
> > It buys nithing and looks strange. RLIMIT_STACK is mostly for self-debugging,
> > to catch the, say, unlimited recursion. An application can trivially
> > create a stack region of arbitrary size. I'd seriously suggest to remove it.
>
> Look, allocate stack for self is not a problem (we do this for our parasite
> code which executes inside dumpee address space) but RLIMIT_STACK check is
> present in ipc shmem so I think we still need this check in a sake of
> consistency.

But for what? Ignoring the (I think buggy) check in do_shmat() ->start_stack
is simply unused, we only report it via /proc/. The same for, say, mm->start_code.

It seems that only start_brk/end_data/brk need some validation. Perhaps something
else, I didn't try to verify. So why do we need these confusing checks?

> > > > > +	if (prctl_map.auxv_size) {
> > > > > +		/* Last entry must be AT_NULL as specification requires */
> > > > > +		user_auxv[AT_VECTOR_SIZE - 2] = AT_NULL;
> > > > > +		user_auxv[AT_VECTOR_SIZE - 1] = AT_NULL;
> > > > > +
> > > > > +		task_lock(current);
> > > > > +		memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv));
> > > > > +		task_unlock(current);
> > > >
> > > > Again, could you explain this task_lock() ?
> > >
> > > It is used for serialization access to saved_auxv, ie when we fill it
> > > with new data the other reader (via procfs interface) should wait until
> > > we finish.
> >
> > But proc_pid_auxv() doesn't take this lock? And even if it did, this lock
> > can't help. task_lock() is per-thread, and multiple threads (including
> > CLONE_VM tasks, vfork() for example) can share the same ->mm.
> >
> > This certainly doesn't look right.
>
> It takes this lock

Where? Another patch I missed ? ;)

> but indeed this won't help much.

Yes, it can't help at all.

> Looks like I need
> to use cred_guard_mutex instead of task_lock here, no?

Please don't. First of all, it can't help because proc_pid_auxv() doesn't hold
this lock. It does mm_access() which drops this lock after return. And to remind,
we are going to remove mm_access/lock_trace from sys_read() paths in proc.

Oleg.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree
  2014-08-23 11:53   ` Oleg Nesterov
@ 2014-08-23 12:22     ` Cyrill Gorcunov
  2014-08-23 13:14       ` Oleg Nesterov
  0 siblings, 1 reply; 12+ messages in thread
From: Cyrill Gorcunov @ 2014-08-23 12:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov,
	KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton,
	linux-kernel

On Sat, Aug 23, 2014 at 01:53:02PM +0200, Oleg Nesterov wrote:
> >
> > It should protect from allocation/devetion/mergin of another vma. IOW when
> > I lookup for vma I need to be sure it exist and won't disappear at least
> > while I validate it.
> 
> plus you need mmap_sem (at least for reading) when you update mm_struct,
> this is clear.
> 
> My question was why the whole function should be called under mmap_sem?
> It could take it only around find_vma() + check(RLIMIT_STACK) ?

Stricktly speaking yes, but don't forget we might need to update
exe::file as well which requires lock to be taken. So it is simplier
to take the read-lock for the whole function.

> In fact I do not think we need this vma_stack/RLIMIT_STACK check at all.
> It buys nithing and looks strange. RLIMIT_STACK is mostly for self-debugging,
> to catch the, say, unlimited recursion. An application can trivially
> create a stack region of arbitrary size. I'd seriously suggest to remove it.

Look, allocate stack for self is not a problem (we do this for our parasite
code which executes inside dumpee address space) but RLIMIT_STACK check is
present in ipc shmem so I think we still need this check in a sake of
consistency. (note this code doesn't require any special caps so I need
to use as much checks/tests as possible).

> 
> > > > +	if (prctl_map.auxv_size) {
> > > > +		/* Last entry must be AT_NULL as specification requires */
> > > > +		user_auxv[AT_VECTOR_SIZE - 2] = AT_NULL;
> > > > +		user_auxv[AT_VECTOR_SIZE - 1] = AT_NULL;
> > > > +
> > > > +		task_lock(current);
> > > > +		memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv));
> > > > +		task_unlock(current);
> > >
> > > Again, could you explain this task_lock() ?
> >
> > It is used for serialization access to saved_auxv, ie when we fill it
> > with new data the other reader (via procfs interface) should wait until
> > we finish.
> 
> But proc_pid_auxv() doesn't take this lock? And even if it did, this lock
> can't help. task_lock() is per-thread, and multiple threads (including
> CLONE_VM tasks, vfork() for example) can share the same ->mm.
> 
> This certainly doesn't look right.

It takes this lock but indeed this won't help much. Looks like I need
to use cred_guard_mutex instead of task_lock here, no?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree
  2014-08-22 20:15 ` Cyrill Gorcunov
@ 2014-08-23 11:53   ` Oleg Nesterov
  2014-08-23 12:22     ` Cyrill Gorcunov
  0 siblings, 1 reply; 12+ messages in thread
From: Oleg Nesterov @ 2014-08-23 11:53 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov,
	KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton,
	linux-kernel

On 08/23, Cyrill Gorcunov wrote:
>
> On Fri, Aug 22, 2014 at 09:22:41PM +0200, Oleg Nesterov wrote:
> > Hi Cyrill,
> >
> > I think the patch is fine but I can't understand the usage of mmap_sem
> > and alloc_lock,
> >
> > > +	stack_vma = find_vma(mm, (unsigned long)prctl_map->start_stack);
> >
> > OK, find_vma() needs mmap_sem. But otherwise, why this should be called
> > under down_read(&mm->mmap_sem) ? What this lock tries to protect?
>
> It should protect from allocation/devetion/mergin of another vma. IOW when
> I lookup for vma I need to be sure it exist and won't disappear at least
> while I validate it.

plus you need mmap_sem (at least for reading) when you update mm_struct,
this is clear.

My question was why the whole function should be called under mmap_sem?
It could take it only around find_vma() + check(RLIMIT_STACK) ?

In fact I do not think we need this vma_stack/RLIMIT_STACK check at all.
It buys nithing and looks strange. RLIMIT_STACK is mostly for self-debugging,
to catch the, say, unlimited recursion. An application can trivially
create a stack region of arbitrary size. I'd seriously suggest to remove it.

> > > +	if (prctl_map.auxv_size) {
> > > +		/* Last entry must be AT_NULL as specification requires */
> > > +		user_auxv[AT_VECTOR_SIZE - 2] = AT_NULL;
> > > +		user_auxv[AT_VECTOR_SIZE - 1] = AT_NULL;
> > > +
> > > +		task_lock(current);
> > > +		memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv));
> > > +		task_unlock(current);
> >
> > Again, could you explain this task_lock() ?
>
> It is used for serialization access to saved_auxv, ie when we fill it
> with new data the other reader (via procfs interface) should wait until
> we finish.

But proc_pid_auxv() doesn't take this lock? And even if it did, this lock
can't help. task_lock() is per-thread, and multiple threads (including
CLONE_VM tasks, vfork() for example) can share the same ->mm.

This certainly doesn't look right.

Oleg.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree
  2014-08-22 19:22 Oleg Nesterov
@ 2014-08-22 20:15 ` Cyrill Gorcunov
  2014-08-23 11:53   ` Oleg Nesterov
  0 siblings, 1 reply; 12+ messages in thread
From: Cyrill Gorcunov @ 2014-08-22 20:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov,
	KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton,
	linux-kernel

On Fri, Aug 22, 2014 at 09:22:41PM +0200, Oleg Nesterov wrote:
> Hi Cyrill,
> 
> I think the patch is fine but I can't understand the usage of mmap_sem
> and alloc_lock,
> 
> > +	stack_vma = find_vma(mm, (unsigned long)prctl_map->start_stack);
> 
> OK, find_vma() needs mmap_sem. But otherwise, why this should be called
> under down_read(&mm->mmap_sem) ? What this lock tries to protect?

It should protect from allocation/devetion/mergin of another vma. IOW when
I lookup for vma I need to be sure it exist and won't disappear at least
while I validate it.

> > +	if (prctl_map.auxv_size) {
> > +		up_read(&mm->mmap_sem);
> > +		memset(user_auxv, 0, sizeof(user_auxv));
> > +		error = copy_from_user(user_auxv,
> > +				       (const void __user *)prctl_map.auxv,
> > +				       prctl_map.auxv_size);
> > +		down_read(&mm->mmap_sem);
> 
> And if we actually need this lock, why it is safe to drop it temporary?
> And why we can't move this copy_from_user() up before down_read) in any
> case?

Good point, thanks! I agreed that better to move this copying of user_auxv
before taking read-lock (once Andrew answer me my question about copying of
offsets in prctl_mm_map structure, I'll update on top).

> > +	if (prctl_map.auxv_size) {
> > +		/* Last entry must be AT_NULL as specification requires */
> > +		user_auxv[AT_VECTOR_SIZE - 2] = AT_NULL;
> > +		user_auxv[AT_VECTOR_SIZE - 1] = AT_NULL;
> > +
> > +		task_lock(current);
> > +		memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv));
> > +		task_unlock(current);
> 
> Again, could you explain this task_lock() ?

It is used for serialization access to saved_auxv, ie when we fill it
with new data the other reader (via procfs interface) should wait until
we finish.

Thanks for comments, Oleg!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree
@ 2014-08-22 19:22 Oleg Nesterov
  2014-08-22 20:15 ` Cyrill Gorcunov
  0 siblings, 1 reply; 12+ messages in thread
From: Oleg Nesterov @ 2014-08-22 19:22 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Kees Cook, Tejun Heo, Andrew Vagin, Eric W. Biederman,
	H. Peter Anvin, Serge Hallyn, Pavel Emelyanov, Vasiliy Kulikov,
	KAMEZAWA Hiroyuki, Michael Kerrisk, Julien Tinnes, Andrew Morton,
	linux-kernel

Hi Cyrill,

I think the patch is fine but I can't understand the usage of mmap_sem
and alloc_lock,

> +	stack_vma = find_vma(mm, (unsigned long)prctl_map->start_stack);

OK, find_vma() needs mmap_sem. But otherwise, why this should be called
under down_read(&mm->mmap_sem) ? What this lock tries to protect?

> +	if (prctl_map.auxv_size) {
> +		up_read(&mm->mmap_sem);
> +		memset(user_auxv, 0, sizeof(user_auxv));
> +		error = copy_from_user(user_auxv,
> +				       (const void __user *)prctl_map.auxv,
> +				       prctl_map.auxv_size);
> +		down_read(&mm->mmap_sem);

And if we actually need this lock, why it is safe to drop it temporary?
And why we can't move this copy_from_user() up before down_read) in any
case?

> +	if (prctl_map.auxv_size) {
> +		/* Last entry must be AT_NULL as specification requires */
> +		user_auxv[AT_VECTOR_SIZE - 2] = AT_NULL;
> +		user_auxv[AT_VECTOR_SIZE - 1] = AT_NULL;
> +
> +		task_lock(current);
> +		memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv));
> +		task_unlock(current);

Again, could you explain this task_lock() ?

Oleg.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-08-23 20:11 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-21 22:51 + prctl-pr_set_mm-introduce-pr_set_mm_map-operation-v3.patch added to -mm tree akpm
2014-08-22 19:22 Oleg Nesterov
2014-08-22 20:15 ` Cyrill Gorcunov
2014-08-23 11:53   ` Oleg Nesterov
2014-08-23 12:22     ` Cyrill Gorcunov
2014-08-23 13:14       ` Oleg Nesterov
2014-08-23 13:30         ` Oleg Nesterov
2014-08-23 14:18           ` Cyrill Gorcunov
2014-08-23 15:32             ` Oleg Nesterov
2014-08-23 16:33               ` Cyrill Gorcunov
2014-08-23 19:29                 ` Oleg Nesterov
2014-08-23 20:11                   ` Cyrill Gorcunov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.