linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Broad write-locking of nascent mm in execve
@ 2020-10-02  1:23 Jann Horn
  2020-10-02  1:24 ` [PATCH 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm Jann Horn
  2020-10-02  1:25 ` [PATCH 2/2] exec: Broadly lock nascent mm until setup_arg_pages() Jann Horn
  0 siblings, 2 replies; 9+ messages in thread
From: Jann Horn @ 2020-10-02  1:23 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Eric W . Biederman, Michel Lespinasse,
	Mauro Carvalho Chehab, Sakari Ailus, Jeff Dike,
	Richard Weinberger, Anton Ivanov, linux-um, Jason Gunthorpe,
	John Hubbard

These two patches replace "mmap locking API: don't check locking
if the mm isn't live yet"[1], which is currently in the mmotm tree,
and should be placed in the same spot where the old patch was.

While I originally said that this would be an alternative
patch (meaning that the existing patch would have worked just
as well), the new patches actually address an additional issue
that the old patch missed (bprm->vma is used after the switch
to the new mm).

I have boot-tested these patches on x64-64 (with lockdep) and
!MMU arm (the latter with both FLAT and ELF).

[1] https://lkml.kernel.org/r/CAG48ez03YJG9JU_6tGiMcaVjuTyRE_o4LEQ7901b5ZoCnNAjcg@mail.gmail.com

Jann Horn (2):
  mmap locking API: Order lock of nascent mm outside lock of live mm
  exec: Broadly lock nascent mm until setup_arg_pages()

 arch/um/include/asm/mmu_context.h |  3 +-
 fs/exec.c                         | 64 ++++++++++++++++---------------
 include/linux/binfmts.h           |  2 +-
 include/linux/mmap_lock.h         | 23 ++++++++++-
 kernel/fork.c                     |  7 +---
 5 files changed, 59 insertions(+), 40 deletions(-)


base-commit: fb0155a09b0224a7147cb07a4ce6034c8d29667f
prerequisite-patch-id: 08f97130a51898a5f6efddeeb5b42638577398c7
prerequisite-patch-id: 577664d761cd23fe9031ffdb1d3c9ac313572c67
prerequisite-patch-id: dc29a39716aa8689f80ba2767803d9df3709beaa
prerequisite-patch-id: 42b1b546d33391ead2753621f541bcc408af1769
prerequisite-patch-id: 2cbb839f57006f32e21f4229e099ae1bd782be24
prerequisite-patch-id: 1b4daf01cf61654a5ec54b5c3f7c7508be7244ee
prerequisite-patch-id: f46cc8c99f1909fe2a65fbc3cf1f6bc57489a086
prerequisite-patch-id: 2b0caed97223241d5008898dde995d02fda544e4
prerequisite-patch-id: 6b7adcb54989e1ec3370f256ff2c35d19cf785aa
-- 
2.28.0.806.g8561365e88-goog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm
  2020-10-02  1:23 [PATCH 0/2] Broad write-locking of nascent mm in execve Jann Horn
@ 2020-10-02  1:24 ` Jann Horn
  2020-10-02  9:17   ` Michel Lespinasse
  2020-10-02  1:25 ` [PATCH 2/2] exec: Broadly lock nascent mm until setup_arg_pages() Jann Horn
  1 sibling, 1 reply; 9+ messages in thread
From: Jann Horn @ 2020-10-02  1:24 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Eric W . Biederman, Michel Lespinasse,
	Mauro Carvalho Chehab, Sakari Ailus, Jeff Dike,
	Richard Weinberger, Anton Ivanov, linux-um, Jason Gunthorpe,
	John Hubbard

Until now, the mmap lock of the nascent mm was ordered inside the mmap lock
of the old mm (in dup_mmap() and in UML's activate_mm()).
A following patch will change the exec path to very broadly lock the
nascent mm, but fine-grained locking should still work at the same time for
the new mm.
To do this in a way that lockdep is happy about, let's turn around the lock
ordering in both places that currently nest the locks.
Since SINGLE_DEPTH_NESTING is normally used for the inner nesting layer,
make up our own lock subclass MMAP_LOCK_SUBCLASS_NASCENT and use that
instead.

The added locking calls in exec_mmap() are temporary; the following patch
will move the locking out of exec_mmap().

Signed-off-by: Jann Horn <jannh@google.com>
---
 arch/um/include/asm/mmu_context.h |  3 +--
 fs/exec.c                         |  4 ++++
 include/linux/mmap_lock.h         | 23 +++++++++++++++++++++--
 kernel/fork.c                     |  7 ++-----
 4 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/arch/um/include/asm/mmu_context.h
b/arch/um/include/asm/mmu_context.h
index 17ddd4edf875..c13bc5150607 100644
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -48,9 +48,8 @@ static inline void activate_mm(struct mm_struct
*old, struct mm_struct *new)
 	 * when the new ->mm is used for the first time.
 	 */
 	__switch_mm(&new->context.id);
-	mmap_write_lock_nested(new, SINGLE_DEPTH_NESTING);
+	mmap_assert_write_locked(new);
 	uml_setup_stubs(new);
-	mmap_write_unlock(new);
 }

 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
diff --git a/fs/exec.c b/fs/exec.c
index a91003e28eaa..229dbc7aa61a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1114,6 +1114,8 @@ static int exec_mmap(struct mm_struct *mm)
 	if (ret)
 		return ret;

+	mmap_write_lock_nascent(mm);
+
 	if (old_mm) {
 		/*
 		 * Make sure that if there is a core dump in progress
@@ -1125,6 +1127,7 @@ static int exec_mmap(struct mm_struct *mm)
 		if (unlikely(old_mm->core_state)) {
 			mmap_read_unlock(old_mm);
 			mutex_unlock(&tsk->signal->exec_update_mutex);
+			mmap_write_unlock(mm);
 			return -EINTR;
 		}
 	}
@@ -1138,6 +1141,7 @@ static int exec_mmap(struct mm_struct *mm)
 	tsk->mm->vmacache_seqnum = 0;
 	vmacache_flush(tsk);
 	task_unlock(tsk);
+	mmap_write_unlock(mm);
 	if (old_mm) {
 		mmap_read_unlock(old_mm);
 		BUG_ON(active_mm != old_mm);
diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 0707671851a8..24de1fe99ee4 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -3,6 +3,18 @@

 #include <linux/mmdebug.h>

+/*
+ * Lock subclasses for the mmap_lock.
+ *
+ * MMAP_LOCK_SUBCLASS_NASCENT is for core kernel code that wants to lock an mm
+ * that is still being constructed and wants to be able to access the active mm
+ * normally at the same time. It nests outside MMAP_LOCK_SUBCLASS_NORMAL.
+ */
+enum {
+	MMAP_LOCK_SUBCLASS_NORMAL = 0,
+	MMAP_LOCK_SUBCLASS_NASCENT
+};
+
 #define MMAP_LOCK_INITIALIZER(name) \
 	.mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock),

@@ -16,9 +28,16 @@ static inline void mmap_write_lock(struct mm_struct *mm)
 	down_write(&mm->mmap_lock);
 }

-static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
+/*
+ * Lock an mm_struct that is still being set up (during fork or exec).
+ * This nests outside the mmap locks of live mm_struct instances.
+ * No interruptible/killable versions exist because at the points where you're
+ * supposed to use this helper, the mm isn't visible to anything else, so we
+ * expect the mmap_lock to be uncontended.
+ */
+static inline void mmap_write_lock_nascent(struct mm_struct *mm)
 {
-	down_write_nested(&mm->mmap_lock, subclass);
+	down_write_nested(&mm->mmap_lock, MMAP_LOCK_SUBCLASS_NASCENT);
 }

 static inline int mmap_write_lock_killable(struct mm_struct *mm)
diff --git a/kernel/fork.c b/kernel/fork.c
index da8d360fb032..db67eb4ac7bd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -474,6 +474,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	unsigned long charge;
 	LIST_HEAD(uf);

+	mmap_write_lock_nascent(mm);
 	uprobe_start_dup_mmap();
 	if (mmap_write_lock_killable(oldmm)) {
 		retval = -EINTR;
@@ -481,10 +482,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	}
 	flush_cache_dup_mm(oldmm);
 	uprobe_dup_mmap(oldmm, mm);
-	/*
-	 * Not linked in yet - no deadlock potential:
-	 */
-	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);

 	/* No ordering required: file already has been exposed. */
 	RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
@@ -600,12 +597,12 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	/* a new mm has just been created */
 	retval = arch_dup_mmap(oldmm, mm);
 out:
-	mmap_write_unlock(mm);
 	flush_tlb_mm(oldmm);
 	mmap_write_unlock(oldmm);
 	dup_userfaultfd_complete(&uf);
 fail_uprobe_end:
 	uprobe_end_dup_mmap();
+	mmap_write_unlock(mm);
 	return retval;
 fail_nomem_anon_vma_fork:
 	mpol_put(vma_policy(tmp));
-- 
2.28.0.806.g8561365e88-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] exec: Broadly lock nascent mm until setup_arg_pages()
  2020-10-02  1:23 [PATCH 0/2] Broad write-locking of nascent mm in execve Jann Horn
  2020-10-02  1:24 ` [PATCH 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm Jann Horn
@ 2020-10-02  1:25 ` Jann Horn
  1 sibling, 0 replies; 9+ messages in thread
From: Jann Horn @ 2020-10-02  1:25 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Eric W . Biederman, Michel Lespinasse,
	Mauro Carvalho Chehab, Sakari Ailus, Jeff Dike,
	Richard Weinberger, Anton Ivanov, linux-um, Jason Gunthorpe,
	John Hubbard

While AFAIK there currently is nothing that can modify the VMA tree of a
new mm until userspace has started running under the mm, we should properly
lock the mm here anyway, both to keep lockdep happy when adding locking
assertions and to be safe in the future in case someone e.g. decides to
permit VMA-tree-mutating operations in process_madvise_behavior_valid().

The goal of this patch is to broadly lock the nascent mm in the exec path,
from around the time it is created all the way to the end of
setup_arg_pages() (because setup_arg_pages() accesses bprm->vma).
As long as the mm is write-locked, keep it around in bprm->mm, even after
it has been installed on the task (with an extra reference on the mm, to
reduce complexity in free_bprm()).
After setup_arg_pages(), we have to unlock the mm so that APIs such as
copy_to_user() will work in the following binfmt-specific setup code.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
---
 fs/exec.c               | 68 ++++++++++++++++++++---------------------
 include/linux/binfmts.h |  2 +-
 2 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 229dbc7aa61a..fe11d77e397a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -254,11 +254,6 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
 		return -ENOMEM;
 	vma_set_anonymous(vma);

-	if (mmap_write_lock_killable(mm)) {
-		err = -EINTR;
-		goto err_free;
-	}
-
 	/*
 	 * Place the stack at the largest stack address the architecture
 	 * supports. Later, we'll move this to an appropriate place. We don't
@@ -276,12 +271,9 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
 		goto err;

 	mm->stack_vm = mm->total_vm = 1;
-	mmap_write_unlock(mm);
 	bprm->p = vma->vm_end - sizeof(void *);
 	return 0;
 err:
-	mmap_write_unlock(mm);
-err_free:
 	bprm->vma = NULL;
 	vm_area_free(vma);
 	return err;
@@ -364,9 +356,9 @@ static int bprm_mm_init(struct linux_binprm *bprm)
 	struct mm_struct *mm = NULL;

 	bprm->mm = mm = mm_alloc();
-	err = -ENOMEM;
 	if (!mm)
-		goto err;
+		return -ENOMEM;
+	mmap_write_lock_nascent(mm);

 	/* Save current stack limit for all calculations made during exec. */
 	task_lock(current->group_leader);
@@ -374,17 +366,12 @@ static int bprm_mm_init(struct linux_binprm *bprm)
 	task_unlock(current->group_leader);

 	err = __bprm_mm_init(bprm);
-	if (err)
-		goto err;
-
-	return 0;
-
-err:
-	if (mm) {
-		bprm->mm = NULL;
-		mmdrop(mm);
-	}
+	if (!err)
+		return 0;

+	bprm->mm = NULL;
+	mmap_write_unlock(mm);
+	mmdrop(mm);
 	return err;
 }

@@ -735,6 +722,7 @@ static int shift_arg_pages(struct vm_area_struct
*vma, unsigned long shift)
 /*
  * Finalizes the stack vm_area_struct. The flags and permissions are updated,
  * the stack is optionally relocated, and some extra space is added.
+ * At the end of this, the mm_struct will be unlocked on success.
  */
 int setup_arg_pages(struct linux_binprm *bprm,
 		    unsigned long stack_top,
@@ -787,9 +775,6 @@ int setup_arg_pages(struct linux_binprm *bprm,
 		bprm->loader -= stack_shift;
 	bprm->exec -= stack_shift;

-	if (mmap_write_lock_killable(mm))
-		return -EINTR;
-
 	vm_flags = VM_STACK_FLAGS;

 	/*
@@ -807,7 +792,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end,
 			vm_flags);
 	if (ret)
-		goto out_unlock;
+		return ret;
 	BUG_ON(prev != vma);

 	if (unlikely(vm_flags & VM_EXEC)) {
@@ -819,7 +804,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	if (stack_shift) {
 		ret = shift_arg_pages(vma, stack_shift);
 		if (ret)
-			goto out_unlock;
+			return ret;
 	}

 	/* mprotect_fixup is overkill to remove the temporary stack flags */
@@ -846,11 +831,17 @@ int setup_arg_pages(struct linux_binprm *bprm,
 	current->mm->start_stack = bprm->p;
 	ret = expand_stack(vma, stack_base);
 	if (ret)
-		ret = -EFAULT;
+		return -EFAULT;

-out_unlock:
+	/*
+	 * From this point on, anything that wants to poke around in the
+	 * mm_struct must lock it by itself.
+	 */
+	bprm->vma = NULL;
 	mmap_write_unlock(mm);
-	return ret;
+	mmput(mm);
+	bprm->mm = NULL;
+	return 0;
 }
 EXPORT_SYMBOL(setup_arg_pages);

@@ -1114,8 +1105,6 @@ static int exec_mmap(struct mm_struct *mm)
 	if (ret)
 		return ret;

-	mmap_write_lock_nascent(mm);
-
 	if (old_mm) {
 		/*
 		 * Make sure that if there is a core dump in progress
@@ -1127,11 +1116,12 @@ static int exec_mmap(struct mm_struct *mm)
 		if (unlikely(old_mm->core_state)) {
 			mmap_read_unlock(old_mm);
 			mutex_unlock(&tsk->signal->exec_update_mutex);
-			mmap_write_unlock(mm);
 			return -EINTR;
 		}
 	}

+	/* bprm->mm stays refcounted, current->mm takes an extra reference */
+	mmget(mm);
 	task_lock(tsk);
 	active_mm = tsk->active_mm;
 	membarrier_exec_mmap(mm);
@@ -1141,7 +1131,6 @@ static int exec_mmap(struct mm_struct *mm)
 	tsk->mm->vmacache_seqnum = 0;
 	vmacache_flush(tsk);
 	task_unlock(tsk);
-	mmap_write_unlock(mm);
 	if (old_mm) {
 		mmap_read_unlock(old_mm);
 		BUG_ON(active_mm != old_mm);
@@ -1397,8 +1386,6 @@ int begin_new_exec(struct linux_binprm * bprm)
 	if (retval)
 		goto out;

-	bprm->mm = NULL;
-
 #ifdef CONFIG_POSIX_TIMERS
 	exit_itimers(me->signal);
 	flush_itimer_signals();
@@ -1545,6 +1532,18 @@ void setup_new_exec(struct linux_binprm * bprm)
 	me->mm->task_size = TASK_SIZE;
 	mutex_unlock(&me->signal->exec_update_mutex);
 	mutex_unlock(&me->signal->cred_guard_mutex);
+
+#ifndef CONFIG_MMU
+	/*
+	 * On MMU, setup_arg_pages() wants to access bprm->vma after this point,
+	 * so we can't drop the mmap lock yet.
+	 * On !MMU, we have neither setup_arg_pages() nor bprm->vma, so we
+	 * should drop the lock here.
+	 */
+	mmap_write_unlock(bprm->mm);
+	mmput(bprm->mm);
+	bprm->mm = NULL;
+#endif
 }
 EXPORT_SYMBOL(setup_new_exec);

@@ -1581,6 +1580,7 @@ static void free_bprm(struct linux_binprm *bprm)
 {
 	if (bprm->mm) {
 		acct_arg_size(bprm, 0);
+		mmap_write_unlock(bprm->mm);
 		mmput(bprm->mm);
 	}
 	free_arg_pages(bprm);
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 0571701ab1c5..3bf06212fbae 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -22,7 +22,7 @@ struct linux_binprm {
 # define MAX_ARG_PAGES	32
 	struct page *page[MAX_ARG_PAGES];
 #endif
-	struct mm_struct *mm;
+	struct mm_struct *mm; /* nascent mm, write-locked */
 	unsigned long p; /* current top of mem */
 	unsigned long argmin; /* rlimit marker for copy_strings() */
 	unsigned int
-- 
2.28.0.806.g8561365e88-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm
  2020-10-02  1:24 ` [PATCH 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm Jann Horn
@ 2020-10-02  9:17   ` Michel Lespinasse
  2020-10-02 11:39     ` Jason Gunthorpe
  2020-10-02 16:33     ` Jann Horn
  0 siblings, 2 replies; 9+ messages in thread
From: Michel Lespinasse @ 2020-10-02  9:17 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, linux-mm, LKML, Eric W . Biederman,
	Mauro Carvalho Chehab, Sakari Ailus, Jeff Dike,
	Richard Weinberger, Anton Ivanov, linux-um, Jason Gunthorpe,
	John Hubbard

On Thu, Oct 1, 2020 at 6:25 PM Jann Horn <jannh@google.com> wrote:
> Until now, the mmap lock of the nascent mm was ordered inside the mmap lock
> of the old mm (in dup_mmap() and in UML's activate_mm()).
> A following patch will change the exec path to very broadly lock the
> nascent mm, but fine-grained locking should still work at the same time for
> the new mm.
> To do this in a way that lockdep is happy about, let's turn around the lock
> ordering in both places that currently nest the locks.
> Since SINGLE_DEPTH_NESTING is normally used for the inner nesting layer,
> make up our own lock subclass MMAP_LOCK_SUBCLASS_NASCENT and use that
> instead.
>
> The added locking calls in exec_mmap() are temporary; the following patch
> will move the locking out of exec_mmap().

Thanks for doing this.

This is probably a silly question, but I am not sure exactly where we
lock the old MM while bprm is creating the new MM ? I am guessing this
would be only in setup_arg_pages(), copying the args and environment
from the old the the new MM ? If that is correct, then wouldn't it be
sufficient to use mmap_write_lock_nested in setup_arg_pages() ? Or, is
the issue that we'd prefer to have a killable version of it there ?

Also FYI I was going to play with these patches a bit to help answer
these questions on my own, but wasn't able to easily apply them as
they came lightly mangled (whitespace issues) when I saved them.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm
  2020-10-02  9:17   ` Michel Lespinasse
@ 2020-10-02 11:39     ` Jason Gunthorpe
  2020-10-02 16:33     ` Jann Horn
  1 sibling, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2020-10-02 11:39 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Jann Horn, Andrew Morton, linux-mm, LKML, Eric W . Biederman,
	Mauro Carvalho Chehab, Sakari Ailus, Jeff Dike,
	Richard Weinberger, Anton Ivanov, linux-um, John Hubbard

On Fri, Oct 02, 2020 at 02:17:49AM -0700, Michel Lespinasse wrote:
> Also FYI I was going to play with these patches a bit to help answer
> these questions on my own, but wasn't able to easily apply them as
> they came lightly mangled (whitespace issues) when I saved them.

Me too

It seems OK, you've created sort of a SINGLE_DEPTH_NESTING but in
reverse - instead of marking the child of the nest it marks the
parent.

It would be nice to add a note in the commit message where the nesting
happens on this path.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm
  2020-10-02  9:17   ` Michel Lespinasse
  2020-10-02 11:39     ` Jason Gunthorpe
@ 2020-10-02 16:33     ` Jann Horn
  2020-10-03 21:30       ` Michel Lespinasse
  1 sibling, 1 reply; 9+ messages in thread
From: Jann Horn @ 2020-10-02 16:33 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, LKML, Eric W . Biederman,
	Mauro Carvalho Chehab, Sakari Ailus, Jeff Dike,
	Richard Weinberger, Anton Ivanov, linux-um, Jason Gunthorpe,
	John Hubbard

On Fri, Oct 2, 2020 at 11:18 AM Michel Lespinasse <walken@google.com> wrote:
> On Thu, Oct 1, 2020 at 6:25 PM Jann Horn <jannh@google.com> wrote:
> > Until now, the mmap lock of the nascent mm was ordered inside the mmap lock
> > of the old mm (in dup_mmap() and in UML's activate_mm()).
> > A following patch will change the exec path to very broadly lock the
> > nascent mm, but fine-grained locking should still work at the same time for
> > the new mm.
> > To do this in a way that lockdep is happy about, let's turn around the lock
> > ordering in both places that currently nest the locks.
> > Since SINGLE_DEPTH_NESTING is normally used for the inner nesting layer,
> > make up our own lock subclass MMAP_LOCK_SUBCLASS_NASCENT and use that
> > instead.
> >
> > The added locking calls in exec_mmap() are temporary; the following patch
> > will move the locking out of exec_mmap().
>
> Thanks for doing this.
>
> This is probably a silly question, but I am not sure exactly where we
> lock the old MM while bprm is creating the new MM ? I am guessing this
> would be only in setup_arg_pages(), copying the args and environment
> from the old the the new MM ? If that is correct, then wouldn't it be
> sufficient to use mmap_write_lock_nested in setup_arg_pages() ? Or, is
> the issue that we'd prefer to have a killable version of it there ?

We're also implicitly locking the old MM anytime we take page faults
before exec_mmap(), which basically means the various userspace memory
accesses in do_execveat_common(). This happens after bprm_mm_init(),
so we've already set bprm->vma at that point.

> Also FYI I was going to play with these patches a bit to help answer
> these questions on my own, but wasn't able to easily apply them as
> they came lightly mangled (whitespace issues) when I saved them.

Uuugh, dammit, I see what happened. Sorry about the trouble. Thanks
for telling me, guess I'll go back to sending patches the way I did it
before. :/

I guess I'll go make a v2 of this with some extra comment about where
the old MM is accessed, as Jason suggested, and without the whitespace
issues?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm
  2020-10-02 16:33     ` Jann Horn
@ 2020-10-03 21:30       ` Michel Lespinasse
  2020-10-05  1:30         ` Jann Horn
  0 siblings, 1 reply; 9+ messages in thread
From: Michel Lespinasse @ 2020-10-03 21:30 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, linux-mm, LKML, Eric W . Biederman,
	Mauro Carvalho Chehab, Sakari Ailus, Jeff Dike,
	Richard Weinberger, Anton Ivanov, linux-um, Jason Gunthorpe,
	John Hubbard

On Fri, Oct 2, 2020 at 9:33 AM Jann Horn <jannh@google.com> wrote:
> On Fri, Oct 2, 2020 at 11:18 AM Michel Lespinasse <walken@google.com> wrote:
> > On Thu, Oct 1, 2020 at 6:25 PM Jann Horn <jannh@google.com> wrote:
> > > Until now, the mmap lock of the nascent mm was ordered inside the mmap lock
> > > of the old mm (in dup_mmap() and in UML's activate_mm()).
> > > A following patch will change the exec path to very broadly lock the
> > > nascent mm, but fine-grained locking should still work at the same time for
> > > the new mm.
> > > To do this in a way that lockdep is happy about, let's turn around the lock
> > > ordering in both places that currently nest the locks.
> > > Since SINGLE_DEPTH_NESTING is normally used for the inner nesting layer,
> > > make up our own lock subclass MMAP_LOCK_SUBCLASS_NASCENT and use that
> > > instead.
> > >
> > > The added locking calls in exec_mmap() are temporary; the following patch
> > > will move the locking out of exec_mmap().
> >
> > Thanks for doing this.
> >
> > This is probably a silly question, but I am not sure exactly where we
> > lock the old MM while bprm is creating the new MM ? I am guessing this
> > would be only in setup_arg_pages(), copying the args and environment
> > from the old the the new MM ? If that is correct, then wouldn't it be
> > sufficient to use mmap_write_lock_nested in setup_arg_pages() ? Or, is
> > the issue that we'd prefer to have a killable version of it there ?
>
> We're also implicitly locking the old MM anytime we take page faults
> before exec_mmap(), which basically means the various userspace memory
> accesses in do_execveat_common(). This happens after bprm_mm_init(),
> so we've already set bprm->vma at that point.

Ah yes, I see the issue now. It would be much nicer if copy_strings
could coax copy_from_user into taking a nested lock, but of course
there is no way to do that.

I'm not sure if it'd be reasonable to kmap the source pages like we do
for the destination pages ?

Adding a nascent lock instead of a nested lock, as you propose, seems
to work, but it also looks quite unusual. Not that I have anything
better to propose at this point though...


Unrelated to the above: copy_from_user and copy_to_user should not be
called with mmap_lock held; it may be worth adding these assertions
too (probably in separate patches) ?


> Uuugh, dammit, I see what happened. Sorry about the trouble. Thanks
> for telling me, guess I'll go back to sending patches the way I did it
> before. :/

Yeah, I've hit such issues with gmail before too :/

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm
  2020-10-03 21:30       ` Michel Lespinasse
@ 2020-10-05  1:30         ` Jann Horn
  2020-10-05 12:52           ` Jason Gunthorpe
  0 siblings, 1 reply; 9+ messages in thread
From: Jann Horn @ 2020-10-05  1:30 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Andrew Morton, linux-mm, LKML, Eric W . Biederman,
	Mauro Carvalho Chehab, Sakari Ailus, Jeff Dike,
	Richard Weinberger, Anton Ivanov, linux-um, Jason Gunthorpe,
	John Hubbard

On Sat, Oct 3, 2020 at 11:30 PM Michel Lespinasse <walken@google.com> wrote:
> Unrelated to the above: copy_from_user and copy_to_user should not be
> called with mmap_lock held; it may be worth adding these assertions
> too (probably in separate patches) ?

We already have that: All (hopefully?) the userspace accessors call
might_fault(), and that does might_lock_read(&current->mm->mmap_lock)
(if we're not running in a lazytlb kernel thread or KERNEL_DS is on or
we're in IRQ context or page faults have explicitly been disabled).


But another place where lockdep asserts should be added is find_vma();
there are currently several architectures that sometimes improperly
call that with no lock held:

SPARC's arch_validate_prot():
https://lore.kernel.org/linux-mm/CAG48ez3YsfTfOFKa-Po58e4PNp7FK54MFbkK3aUPSRt3LWtxQA@mail.gmail.com/

nios2 sys_cacheflush():
https://lore.kernel.org/linux-mm/CAG48ez3hxeXU29UGWRH-gRXX2jb5Lc==npbXFt8UDrWO4eHZdQ@mail.gmail.com/

nds32 sys_cacheflush():
https://lore.kernel.org/linux-mm/CAG48ez1UnQEMok9rqFQC4XHBaMmBe=eaedu8Z_RXdjFHTna_LA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm
  2020-10-05  1:30         ` Jann Horn
@ 2020-10-05 12:52           ` Jason Gunthorpe
  0 siblings, 0 replies; 9+ messages in thread
From: Jason Gunthorpe @ 2020-10-05 12:52 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michel Lespinasse, Andrew Morton, linux-mm, LKML,
	Eric W . Biederman, Mauro Carvalho Chehab, Sakari Ailus,
	Jeff Dike, Richard Weinberger, Anton Ivanov, linux-um,
	John Hubbard

On Mon, Oct 05, 2020 at 03:30:43AM +0200, Jann Horn wrote:
> But another place where lockdep asserts should be added is find_vma();
> there are currently several architectures that sometimes improperly
> call that with no lock held:

Yes, I've seen several cases of this mis-use in drivers too

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-10-05 13:14 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-02  1:23 [PATCH 0/2] Broad write-locking of nascent mm in execve Jann Horn
2020-10-02  1:24 ` [PATCH 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm Jann Horn
2020-10-02  9:17   ` Michel Lespinasse
2020-10-02 11:39     ` Jason Gunthorpe
2020-10-02 16:33     ` Jann Horn
2020-10-03 21:30       ` Michel Lespinasse
2020-10-05  1:30         ` Jann Horn
2020-10-05 12:52           ` Jason Gunthorpe
2020-10-02  1:25 ` [PATCH 2/2] exec: Broadly lock nascent mm until setup_arg_pages() Jann Horn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).