* [PATCH v2 0/2] Broad write-locking of nascent mm in execve @ 2020-10-06 22:54 Jann Horn 2020-10-06 22:54 ` [PATCH v2 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm Jann Horn 2020-10-06 22:54 ` [PATCH v2 2/2] exec: Broadly lock nascent mm until setup_arg_pages() Jann Horn 0 siblings, 2 replies; 7+ messages in thread From: Jann Horn @ 2020-10-06 22:54 UTC (permalink / raw) To: Andrew Morton, linux-mm Cc: linux-kernel, Eric W . Biederman, Michel Lespinasse, Mauro Carvalho Chehab, Sakari Ailus, Jeff Dike, Richard Weinberger, Anton Ivanov, linux-um, Jason Gunthorpe, John Hubbard v2: - fix commit message of patch 1/2 and be more verbose about where the old mmap lock is taken (Michel, Jason) - resending without mangling the diffs :/ (Michel, Jason) These two patches replace "mmap locking API: don't check locking if the mm isn't live yet"[1], which is currently in the mmotm tree, and should be placed in the same spot where the old patch was. While I originally said that this would be an alternative patch (meaning that the existing patch would have worked just as well), the new patches actually address an additional issue that the old patch missed (bprm->vma is used after the switch to the new mm). I have boot-tested these patches on x64-64 (with lockdep) and !MMU arm (the latter with both FLAT and ELF). [1] https://lkml.kernel.org/r/CAG48ez03YJG9JU_6tGiMcaVjuTyRE_o4LEQ7901b5ZoCnNAjcg@mail.gmail.com Jann Horn (2): mmap locking API: Order lock of nascent mm outside lock of live mm exec: Broadly lock nascent mm until setup_arg_pages() arch/um/include/asm/mmu_context.h | 3 +- fs/exec.c | 64 ++++++++++++++++--------------- include/linux/binfmts.h | 2 +- include/linux/mmap_lock.h | 23 ++++++++++- kernel/fork.c | 7 +--- 5 files changed, 59 insertions(+), 40 deletions(-) base-commit: fb0155a09b0224a7147cb07a4ce6034c8d29667f prerequisite-patch-id: 08f97130a51898a5f6efddeeb5b42638577398c7 prerequisite-patch-id: 577664d761cd23fe9031ffdb1d3c9ac313572c67 prerequisite-patch-id: dc29a39716aa8689f80ba2767803d9df3709beaa prerequisite-patch-id: 42b1b546d33391ead2753621f541bcc408af1769 prerequisite-patch-id: 2cbb839f57006f32e21f4229e099ae1bd782be24 prerequisite-patch-id: 1b4daf01cf61654a5ec54b5c3f7c7508be7244ee prerequisite-patch-id: f46cc8c99f1909fe2a65fbc3cf1f6bc57489a086 prerequisite-patch-id: 2b0caed97223241d5008898dde995d02fda544e4 prerequisite-patch-id: 6b7adcb54989e1ec3370f256ff2c35d19cf785aa -- 2.28.0.806.g8561365e88-goog ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm 2020-10-06 22:54 [PATCH v2 0/2] Broad write-locking of nascent mm in execve Jann Horn @ 2020-10-06 22:54 ` Jann Horn 2020-10-07 7:42 ` Johannes Berg 2020-10-06 22:54 ` [PATCH v2 2/2] exec: Broadly lock nascent mm until setup_arg_pages() Jann Horn 1 sibling, 1 reply; 7+ messages in thread From: Jann Horn @ 2020-10-06 22:54 UTC (permalink / raw) To: Andrew Morton, linux-mm Cc: linux-kernel, Eric W . Biederman, Michel Lespinasse, Mauro Carvalho Chehab, Sakari Ailus, Jeff Dike, Richard Weinberger, Anton Ivanov, linux-um, Jason Gunthorpe, John Hubbard Until now, the mmap lock of the nascent mm was ordered inside the mmap lock of the old mm (in dup_mmap() and in UML's activate_mm()). A following patch will change the exec path to very broadly lock the nascent mm, but fine-grained locking should still work at the same time for the old mm. In particular, mmap locking calls are hidden behind the copy_from_user() calls and such that are reached through functions like copy_strings() - when a page fault occurs on a userspace memory access, the mmap lock will be taken. To do this in a way that lockdep is happy about, let's turn around the lock ordering in both places that currently nest the locks. Since SINGLE_DEPTH_NESTING is normally used for the inner nesting layer, make up our own lock subclass MMAP_LOCK_SUBCLASS_NASCENT and use that instead. The added locking calls in exec_mmap() are temporary; the following patch will move the locking out of exec_mmap(). Signed-off-by: Jann Horn <jannh@google.com> --- arch/um/include/asm/mmu_context.h | 3 +-- fs/exec.c | 4 ++++ include/linux/mmap_lock.h | 23 +++++++++++++++++++++-- kernel/fork.c | 7 ++----- 4 files changed, 28 insertions(+), 9 deletions(-) diff --git a/arch/um/include/asm/mmu_context.h b/arch/um/include/asm/mmu_context.h index 17ddd4edf875..c13bc5150607 100644 --- a/arch/um/include/asm/mmu_context.h +++ b/arch/um/include/asm/mmu_context.h @@ -48,9 +48,8 @@ static inline void activate_mm(struct mm_struct *old, struct mm_struct *new) * when the new ->mm is used for the first time. */ __switch_mm(&new->context.id); - mmap_write_lock_nested(new, SINGLE_DEPTH_NESTING); + mmap_assert_write_locked(new); uml_setup_stubs(new); - mmap_write_unlock(new); } static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, diff --git a/fs/exec.c b/fs/exec.c index a91003e28eaa..229dbc7aa61a 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1114,6 +1114,8 @@ static int exec_mmap(struct mm_struct *mm) if (ret) return ret; + mmap_write_lock_nascent(mm); + if (old_mm) { /* * Make sure that if there is a core dump in progress @@ -1125,6 +1127,7 @@ static int exec_mmap(struct mm_struct *mm) if (unlikely(old_mm->core_state)) { mmap_read_unlock(old_mm); mutex_unlock(&tsk->signal->exec_update_mutex); + mmap_write_unlock(mm); return -EINTR; } } @@ -1138,6 +1141,7 @@ static int exec_mmap(struct mm_struct *mm) tsk->mm->vmacache_seqnum = 0; vmacache_flush(tsk); task_unlock(tsk); + mmap_write_unlock(mm); if (old_mm) { mmap_read_unlock(old_mm); BUG_ON(active_mm != old_mm); diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 0707671851a8..24de1fe99ee4 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -3,6 +3,18 @@ #include <linux/mmdebug.h> +/* + * Lock subclasses for the mmap_lock. + * + * MMAP_LOCK_SUBCLASS_NASCENT is for core kernel code that wants to lock an mm + * that is still being constructed and wants to be able to access the active mm + * normally at the same time. It nests outside MMAP_LOCK_SUBCLASS_NORMAL. + */ +enum { + MMAP_LOCK_SUBCLASS_NORMAL = 0, + MMAP_LOCK_SUBCLASS_NASCENT +}; + #define MMAP_LOCK_INITIALIZER(name) \ .mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock), @@ -16,9 +28,16 @@ static inline void mmap_write_lock(struct mm_struct *mm) down_write(&mm->mmap_lock); } -static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass) +/* + * Lock an mm_struct that is still being set up (during fork or exec). + * This nests outside the mmap locks of live mm_struct instances. + * No interruptible/killable versions exist because at the points where you're + * supposed to use this helper, the mm isn't visible to anything else, so we + * expect the mmap_lock to be uncontended. + */ +static inline void mmap_write_lock_nascent(struct mm_struct *mm) { - down_write_nested(&mm->mmap_lock, subclass); + down_write_nested(&mm->mmap_lock, MMAP_LOCK_SUBCLASS_NASCENT); } static inline int mmap_write_lock_killable(struct mm_struct *mm) diff --git a/kernel/fork.c b/kernel/fork.c index da8d360fb032..db67eb4ac7bd 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -474,6 +474,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, unsigned long charge; LIST_HEAD(uf); + mmap_write_lock_nascent(mm); uprobe_start_dup_mmap(); if (mmap_write_lock_killable(oldmm)) { retval = -EINTR; @@ -481,10 +482,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, } flush_cache_dup_mm(oldmm); uprobe_dup_mmap(oldmm, mm); - /* - * Not linked in yet - no deadlock potential: - */ - mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING); /* No ordering required: file already has been exposed. */ RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm)); @@ -600,12 +597,12 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, /* a new mm has just been created */ retval = arch_dup_mmap(oldmm, mm); out: - mmap_write_unlock(mm); flush_tlb_mm(oldmm); mmap_write_unlock(oldmm); dup_userfaultfd_complete(&uf); fail_uprobe_end: uprobe_end_dup_mmap(); + mmap_write_unlock(mm); return retval; fail_nomem_anon_vma_fork: mpol_put(vma_policy(tmp)); -- 2.28.0.806.g8561365e88-goog ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v2 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm 2020-10-06 22:54 ` [PATCH v2 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm Jann Horn @ 2020-10-07 7:42 ` Johannes Berg 2020-10-07 8:28 ` Jann Horn 0 siblings, 1 reply; 7+ messages in thread From: Johannes Berg @ 2020-10-07 7:42 UTC (permalink / raw) To: Jann Horn, Andrew Morton, linux-mm Cc: Michel Lespinasse, Jason Gunthorpe, Richard Weinberger, Jeff Dike, linux-um, linux-kernel, Eric W . Biederman, Sakari Ailus, John Hubbard, Mauro Carvalho Chehab, Anton Ivanov On Wed, 2020-10-07 at 00:54 +0200, Jann Horn wrote: > Until now, the mmap lock of the nascent mm was ordered inside the mmap lock > of the old mm (in dup_mmap() and in UML's activate_mm()). > A following patch will change the exec path to very broadly lock the > nascent mm, but fine-grained locking should still work at the same time for > the old mm. > > In particular, mmap locking calls are hidden behind the copy_from_user() > calls and such that are reached through functions like copy_strings() - > when a page fault occurs on a userspace memory access, the mmap lock > will be taken. > > To do this in a way that lockdep is happy about, let's turn around the lock > ordering in both places that currently nest the locks. > Since SINGLE_DEPTH_NESTING is normally used for the inner nesting layer, > make up our own lock subclass MMAP_LOCK_SUBCLASS_NASCENT and use that > instead. > > The added locking calls in exec_mmap() are temporary; the following patch > will move the locking out of exec_mmap(). > > Signed-off-by: Jann Horn <jannh@google.com> > --- > arch/um/include/asm/mmu_context.h | 3 +-- > fs/exec.c | 4 ++++ > include/linux/mmap_lock.h | 23 +++++++++++++++++++++-- > kernel/fork.c | 7 ++----- > 4 files changed, 28 insertions(+), 9 deletions(-) > > diff --git a/arch/um/include/asm/mmu_context.h b/arch/um/include/asm/mmu_context.h > index 17ddd4edf875..c13bc5150607 100644 > --- a/arch/um/include/asm/mmu_context.h > +++ b/arch/um/include/asm/mmu_context.h > @@ -48,9 +48,8 @@ static inline void activate_mm(struct mm_struct *old, struct mm_struct *new) > * when the new ->mm is used for the first time. > */ > __switch_mm(&new->context.id); > - mmap_write_lock_nested(new, SINGLE_DEPTH_NESTING); > + mmap_assert_write_locked(new); > uml_setup_stubs(new); > - mmap_write_unlock(new); > } FWIW, this was I believe causing lockdep issues. I think I had previously determined that this was pointless, since it's still nascent and cannot be used yet? But I didn't really know for sure, and this patch was never applied: https://patchwork.ozlabs.org/project/linux-um/patch/20200604133752.397dedea0758.I7a24aaa26794eb3fa432003c1bf55cbb816489e2@changeid/ I guess your patches will also fix the lockdep complaints in UML in this area, I hope I'll be able to test it soon. johannes ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm 2020-10-07 7:42 ` Johannes Berg @ 2020-10-07 8:28 ` Jann Horn 2020-10-07 11:35 ` Johannes Berg 0 siblings, 1 reply; 7+ messages in thread From: Jann Horn @ 2020-10-07 8:28 UTC (permalink / raw) To: Johannes Berg Cc: Andrew Morton, Linux-MM, Michel Lespinasse, Jason Gunthorpe, Richard Weinberger, Jeff Dike, linux-um, kernel list, Eric W . Biederman, Sakari Ailus, John Hubbard, Mauro Carvalho Chehab, Anton Ivanov On Wed, Oct 7, 2020 at 9:42 AM Johannes Berg <johannes@sipsolutions.net> wrote: > On Wed, 2020-10-07 at 00:54 +0200, Jann Horn wrote: > > Until now, the mmap lock of the nascent mm was ordered inside the mmap lock > > of the old mm (in dup_mmap() and in UML's activate_mm()). > > A following patch will change the exec path to very broadly lock the > > nascent mm, but fine-grained locking should still work at the same time for > > the old mm. > > > > In particular, mmap locking calls are hidden behind the copy_from_user() > > calls and such that are reached through functions like copy_strings() - > > when a page fault occurs on a userspace memory access, the mmap lock > > will be taken. > > > > To do this in a way that lockdep is happy about, let's turn around the lock > > ordering in both places that currently nest the locks. > > Since SINGLE_DEPTH_NESTING is normally used for the inner nesting layer, > > make up our own lock subclass MMAP_LOCK_SUBCLASS_NASCENT and use that > > instead. > > > > The added locking calls in exec_mmap() are temporary; the following patch > > will move the locking out of exec_mmap(). > > > > Signed-off-by: Jann Horn <jannh@google.com> > > --- > > arch/um/include/asm/mmu_context.h | 3 +-- > > fs/exec.c | 4 ++++ > > include/linux/mmap_lock.h | 23 +++++++++++++++++++++-- > > kernel/fork.c | 7 ++----- > > 4 files changed, 28 insertions(+), 9 deletions(-) > > > > diff --git a/arch/um/include/asm/mmu_context.h b/arch/um/include/asm/mmu_context.h > > index 17ddd4edf875..c13bc5150607 100644 > > --- a/arch/um/include/asm/mmu_context.h > > +++ b/arch/um/include/asm/mmu_context.h > > @@ -48,9 +48,8 @@ static inline void activate_mm(struct mm_struct *old, struct mm_struct *new) > > * when the new ->mm is used for the first time. > > */ > > __switch_mm(&new->context.id); > > - mmap_write_lock_nested(new, SINGLE_DEPTH_NESTING); > > + mmap_assert_write_locked(new); > > uml_setup_stubs(new); > > - mmap_write_unlock(new); > > } > > FWIW, this was I believe causing lockdep issues. > > I think I had previously determined that this was pointless, since it's > still nascent and cannot be used yet? Well.. the thing is that with patch 2/2, I'm not just protecting the mm while it hasn't been installed yet, but also after it's been installed, until setup_arg_pages() is done (which still uses a VMA pointer that we obtained really early in the nascent phase). With the recent rework Eric Biederman has done to clean up the locking around execve, operations like process_vm_writev() and (currently only in the MM tree, not mainline yet) process_madvise() can remotely occur on our new mm after setup_new_exec(), before we've reached setup_arg_pages(). While AFAIK all those operations *currently* only read the VMA tree, that would change as soon as someone e.g. changes the list of allowed operations for process_madvise() to include something like MADV_MERGEABLE. In that case, we'd get a UAF if the madvise code merges away our VMA while we still hold and use a dangling pointer to it. So in summary, I think the code currently is not (visibly) buggy in the sense that you can make it do something bad, but it's extremely fragile and probably only safe by chance. This patchset is partly my attempt to make this a bit more future-proof before someone comes along and turns it into an actual memory corruption bug with some innocuous little change. (Because I've had a situation before where I thought "oh, this looks really fragile and only works by chance, but eh, it's not really worth changing that code" and then the next time I looked, it had turned into a security bug that had already made its way into kernel releases people were using.) > But I didn't really know for sure, > and this patch was never applied: > > https://patchwork.ozlabs.org/project/linux-um/patch/20200604133752.397dedea0758.I7a24aaa26794eb3fa432003c1bf55cbb816489e2@changeid/ Eeeh... with all the kernel debugging infrastructure *disabled*, down_write_nested() is defined as: # define down_write_nested(sem, subclass) down_write(sem) and then down_write() is: void __sched down_write(struct rw_semaphore *sem) { might_sleep(); rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_); LOCK_CONTENDED(sem, __down_write_trylock, __down_write); } and that might_sleep() there is not just used for atomic sleep debugging, but actually also creates an explicit preemption point (independent of CONFIG_DEBUG_ATOMIC_SLEEP; here's the version with atomic sleep debugging *disabled*): # define might_sleep() do { might_resched(); } while (0) where might_resched() is: #ifdef CONFIG_PREEMPT_VOLUNTARY extern int _cond_resched(void); # define might_resched() _cond_resched() #else # define might_resched() do { } while (0) #endif _cond_resched() has a check for preempt_count before triggering the scheduler, but on PREEMPT_VOLUNTARY without debugging, taking a spinlock currently won't increment that, I think. And even if preempt_count was active for PREEMPT_VOLUNTARY (which I think the x86 folks were discussing?), you'd still hit a call into the RCU core, which probably shouldn't be happening under a spinlock either. Now, arch/um/ sets ARCH_NO_PREEMPT, so we can't actually be configured with PREEMPT_VOLUNTARY, so this can't actually happen. But it feels like we're on pretty thin ice here. > I guess your patches will also fix the lockdep complaints in UML in this > area, I hope I'll be able to test it soon. That would be a nice side effect. :) ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm 2020-10-07 8:28 ` Jann Horn @ 2020-10-07 11:35 ` Johannes Berg 0 siblings, 0 replies; 7+ messages in thread From: Johannes Berg @ 2020-10-07 11:35 UTC (permalink / raw) To: Jann Horn Cc: Andrew Morton, Linux-MM, Michel Lespinasse, Jason Gunthorpe, Richard Weinberger, Jeff Dike, linux-um, kernel list, Eric W . Biederman, Sakari Ailus, John Hubbard, Mauro Carvalho Chehab, Anton Ivanov Hi Jann, > > > +++ b/arch/um/include/asm/mmu_context.h > > > @@ -48,9 +48,8 @@ static inline void activate_mm(struct mm_struct *old, struct mm_struct *new) > > > * when the new ->mm is used for the first time. > > > */ > > > __switch_mm(&new->context.id); > > > - mmap_write_lock_nested(new, SINGLE_DEPTH_NESTING); > > > + mmap_assert_write_locked(new); > > > uml_setup_stubs(new); > > > - mmap_write_unlock(new); > > > } > > > > FWIW, this was I believe causing lockdep issues. > > > > I think I had previously determined that this was pointless, since it's > > still nascent and cannot be used yet? > > Well.. the thing is that with patch 2/2, I'm not just protecting the > mm while it hasn't been installed yet, but also after it's been > installed, until setup_arg_pages() is done (which still uses a VMA > pointer that we obtained really early in the nascent phase). Oh, sure. I was referring only to the locking in UML's activate_mm(), quoted above. Sorry for not making that clear. > So in summary, I think the code currently is not (visibly) buggy in > the sense that you can make it do something bad, but it's extremely > fragile and probably only safe by chance. This patchset is partly my > attempt to make this a bit more future-proof before someone comes > along and turns it into an actual memory corruption bug with some > innocuous little change. (Because I've had a situation before where I > thought "oh, this looks really fragile and only works by chance, but > eh, it's not really worth changing that code" and then the next time I > looked, it had turned into a security bug that had already made its > way into kernel releases people were using.) Right. > > But I didn't really know for sure, > > and this patch was never applied: > > > > https://patchwork.ozlabs.org/project/linux-um/patch/20200604133752.397dedea0758.I7a24aaa26794eb3fa432003c1bf55cbb816489e2@changeid/ > > Eeeh... with all the kernel debugging infrastructure *disabled*, but I didn't have it disabled, I had lockdep enabled, and lockdep was complaining (now granted, I was still on 5.8 for that patch): ============================= [ BUG: Invalid wait context ] 5.8.0-00006-gef4b340c886a #23 Not tainted ----------------------------- swapper/1 is trying to lock: 000000006e54c160 (&mm->mmap_lock/1){....}-{3:3}, at: begin_new_exec+0x6c5/0xb26 other info that might help us debug this: context-{4:4} 3 locks held by swapper/1: #0: 00000000705f4548 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: __do_execve_file+0x12c/0x7ea #1: 00000000705f45e0 (&sig->exec_update_mutex){+.+.}-{3:3}, at: begin_new_exec+0x5db/0xb26 #2: 00000000705e05a8 (&p->alloc_lock){+.+.}-{2:2}, at: begin_new_exec+0x66b/0xb26 stack backtrace: CPU: 0 PID: 1 Comm: swapper Not tainted 5.8.0-00006-gef4b340c886a #23 Stack: 6057fa2d 705e0760 705ebbb0 00000133 6008d289 705e0760 705e0040 00000003 705ebbc0 6028e02f 705ebc50 60080b29 Call Trace: [<6008d289>] ? printk+0x0/0x94 [<60024a1a>] show_stack+0x153/0x174 [<6008d289>] ? printk+0x0/0x94 [<6028e02f>] dump_stack+0x34/0x36 [<60080b29>] __lock_acquire+0x515/0x15f5 [<6007c593>] ? hlock_class+0x0/0xa1 [<6007fd90>] lock_acquire+0x347/0x42d [<6013def5>] ? begin_new_exec+0x6c5/0xb26 [<60039b51>] ? set_signals+0x29/0x3f [<600835c1>] ? lock_acquired+0x310/0x320 [<6013b5ce>] ? would_dump+0x0/0x8a [<600798fd>] down_write_nested+0x2f/0x83 [<6013def5>] ? begin_new_exec+0x6c5/0xb26 [<600798ce>] ? down_write_nested+0x0/0x83 [<6013def5>] begin_new_exec+0x6c5/0xb26 [<6019593b>] ? load_elf_phdrs+0x6f/0x9d [<60298d55>] ? memcmp+0x0/0x20 [<60196612>] load_elf_binary+0x2cb/0xc49 [...] but it really looks just about the same on v5.9-rc8. > > I guess your patches will also fix the lockdep complaints in UML in this > > area, I hope I'll be able to test it soon. > > That would be a nice side effect. :) It does indeed fix it :) johannes ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2 2/2] exec: Broadly lock nascent mm until setup_arg_pages() 2020-10-06 22:54 [PATCH v2 0/2] Broad write-locking of nascent mm in execve Jann Horn 2020-10-06 22:54 ` [PATCH v2 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm Jann Horn @ 2020-10-06 22:54 ` Jann Horn 2020-10-07 12:12 ` Jason Gunthorpe 1 sibling, 1 reply; 7+ messages in thread From: Jann Horn @ 2020-10-06 22:54 UTC (permalink / raw) To: Andrew Morton, linux-mm Cc: linux-kernel, Eric W . Biederman, Michel Lespinasse, Mauro Carvalho Chehab, Sakari Ailus, Jeff Dike, Richard Weinberger, Anton Ivanov, linux-um, Jason Gunthorpe, John Hubbard While AFAIK there currently is nothing that can modify the VMA tree of a new mm until userspace has started running under the mm, we should properly lock the mm here anyway, both to keep lockdep happy when adding locking assertions and to be safe in the future in case someone e.g. decides to permit VMA-tree-mutating operations in process_madvise_behavior_valid(). The goal of this patch is to broadly lock the nascent mm in the exec path, from around the time it is created all the way to the end of setup_arg_pages() (because setup_arg_pages() accesses bprm->vma). As long as the mm is write-locked, keep it around in bprm->mm, even after it has been installed on the task (with an extra reference on the mm, to reduce complexity in free_bprm()). After setup_arg_pages(), we have to unlock the mm so that APIs such as copy_to_user() will work in the following binfmt-specific setup code. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: Michel Lespinasse <walken@google.com> Signed-off-by: Jann Horn <jannh@google.com> --- fs/exec.c | 68 ++++++++++++++++++++--------------------- include/linux/binfmts.h | 2 +- 2 files changed, 35 insertions(+), 35 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 229dbc7aa61a..fe11d77e397a 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -254,11 +254,6 @@ static int __bprm_mm_init(struct linux_binprm *bprm) return -ENOMEM; vma_set_anonymous(vma); - if (mmap_write_lock_killable(mm)) { - err = -EINTR; - goto err_free; - } - /* * Place the stack at the largest stack address the architecture * supports. Later, we'll move this to an appropriate place. We don't @@ -276,12 +271,9 @@ static int __bprm_mm_init(struct linux_binprm *bprm) goto err; mm->stack_vm = mm->total_vm = 1; - mmap_write_unlock(mm); bprm->p = vma->vm_end - sizeof(void *); return 0; err: - mmap_write_unlock(mm); -err_free: bprm->vma = NULL; vm_area_free(vma); return err; @@ -364,9 +356,9 @@ static int bprm_mm_init(struct linux_binprm *bprm) struct mm_struct *mm = NULL; bprm->mm = mm = mm_alloc(); - err = -ENOMEM; if (!mm) - goto err; + return -ENOMEM; + mmap_write_lock_nascent(mm); /* Save current stack limit for all calculations made during exec. */ task_lock(current->group_leader); @@ -374,17 +366,12 @@ static int bprm_mm_init(struct linux_binprm *bprm) task_unlock(current->group_leader); err = __bprm_mm_init(bprm); - if (err) - goto err; - - return 0; - -err: - if (mm) { - bprm->mm = NULL; - mmdrop(mm); - } + if (!err) + return 0; + bprm->mm = NULL; + mmap_write_unlock(mm); + mmdrop(mm); return err; } @@ -735,6 +722,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift) /* * Finalizes the stack vm_area_struct. The flags and permissions are updated, * the stack is optionally relocated, and some extra space is added. + * At the end of this, the mm_struct will be unlocked on success. */ int setup_arg_pages(struct linux_binprm *bprm, unsigned long stack_top, @@ -787,9 +775,6 @@ int setup_arg_pages(struct linux_binprm *bprm, bprm->loader -= stack_shift; bprm->exec -= stack_shift; - if (mmap_write_lock_killable(mm)) - return -EINTR; - vm_flags = VM_STACK_FLAGS; /* @@ -807,7 +792,7 @@ int setup_arg_pages(struct linux_binprm *bprm, ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end, vm_flags); if (ret) - goto out_unlock; + return ret; BUG_ON(prev != vma); if (unlikely(vm_flags & VM_EXEC)) { @@ -819,7 +804,7 @@ int setup_arg_pages(struct linux_binprm *bprm, if (stack_shift) { ret = shift_arg_pages(vma, stack_shift); if (ret) - goto out_unlock; + return ret; } /* mprotect_fixup is overkill to remove the temporary stack flags */ @@ -846,11 +831,17 @@ int setup_arg_pages(struct linux_binprm *bprm, current->mm->start_stack = bprm->p; ret = expand_stack(vma, stack_base); if (ret) - ret = -EFAULT; + return -EFAULT; -out_unlock: + /* + * From this point on, anything that wants to poke around in the + * mm_struct must lock it by itself. + */ + bprm->vma = NULL; mmap_write_unlock(mm); - return ret; + mmput(mm); + bprm->mm = NULL; + return 0; } EXPORT_SYMBOL(setup_arg_pages); @@ -1114,8 +1105,6 @@ static int exec_mmap(struct mm_struct *mm) if (ret) return ret; - mmap_write_lock_nascent(mm); - if (old_mm) { /* * Make sure that if there is a core dump in progress @@ -1127,11 +1116,12 @@ static int exec_mmap(struct mm_struct *mm) if (unlikely(old_mm->core_state)) { mmap_read_unlock(old_mm); mutex_unlock(&tsk->signal->exec_update_mutex); - mmap_write_unlock(mm); return -EINTR; } } + /* bprm->mm stays refcounted, current->mm takes an extra reference */ + mmget(mm); task_lock(tsk); active_mm = tsk->active_mm; membarrier_exec_mmap(mm); @@ -1141,7 +1131,6 @@ static int exec_mmap(struct mm_struct *mm) tsk->mm->vmacache_seqnum = 0; vmacache_flush(tsk); task_unlock(tsk); - mmap_write_unlock(mm); if (old_mm) { mmap_read_unlock(old_mm); BUG_ON(active_mm != old_mm); @@ -1397,8 +1386,6 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) goto out; - bprm->mm = NULL; - #ifdef CONFIG_POSIX_TIMERS exit_itimers(me->signal); flush_itimer_signals(); @@ -1545,6 +1532,18 @@ void setup_new_exec(struct linux_binprm * bprm) me->mm->task_size = TASK_SIZE; mutex_unlock(&me->signal->exec_update_mutex); mutex_unlock(&me->signal->cred_guard_mutex); + +#ifndef CONFIG_MMU + /* + * On MMU, setup_arg_pages() wants to access bprm->vma after this point, + * so we can't drop the mmap lock yet. + * On !MMU, we have neither setup_arg_pages() nor bprm->vma, so we + * should drop the lock here. + */ + mmap_write_unlock(bprm->mm); + mmput(bprm->mm); + bprm->mm = NULL; +#endif } EXPORT_SYMBOL(setup_new_exec); @@ -1581,6 +1580,7 @@ static void free_bprm(struct linux_binprm *bprm) { if (bprm->mm) { acct_arg_size(bprm, 0); + mmap_write_unlock(bprm->mm); mmput(bprm->mm); } free_arg_pages(bprm); diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h index 0571701ab1c5..3bf06212fbae 100644 --- a/include/linux/binfmts.h +++ b/include/linux/binfmts.h @@ -22,7 +22,7 @@ struct linux_binprm { # define MAX_ARG_PAGES 32 struct page *page[MAX_ARG_PAGES]; #endif - struct mm_struct *mm; + struct mm_struct *mm; /* nascent mm, write-locked */ unsigned long p; /* current top of mem */ unsigned long argmin; /* rlimit marker for copy_strings() */ unsigned int -- 2.28.0.806.g8561365e88-goog ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v2 2/2] exec: Broadly lock nascent mm until setup_arg_pages() 2020-10-06 22:54 ` [PATCH v2 2/2] exec: Broadly lock nascent mm until setup_arg_pages() Jann Horn @ 2020-10-07 12:12 ` Jason Gunthorpe 0 siblings, 0 replies; 7+ messages in thread From: Jason Gunthorpe @ 2020-10-07 12:12 UTC (permalink / raw) To: Jann Horn Cc: Andrew Morton, linux-mm, linux-kernel, Eric W . Biederman, Michel Lespinasse, Mauro Carvalho Chehab, Sakari Ailus, Jeff Dike, Richard Weinberger, Anton Ivanov, linux-um, John Hubbard On Wed, Oct 07, 2020 at 12:54:50AM +0200, Jann Horn wrote: > @@ -1545,6 +1532,18 @@ void setup_new_exec(struct linux_binprm * bprm) > me->mm->task_size = TASK_SIZE; > mutex_unlock(&me->signal->exec_update_mutex); > mutex_unlock(&me->signal->cred_guard_mutex); > + > +#ifndef CONFIG_MMU > + /* > + * On MMU, setup_arg_pages() wants to access bprm->vma after this point, > + * so we can't drop the mmap lock yet. > + * On !MMU, we have neither setup_arg_pages() nor bprm->vma, so we > + * should drop the lock here. > + */ > + mmap_write_unlock(bprm->mm); > + mmput(bprm->mm); > + bprm->mm = NULL; > +#endif > } It looks like this could this be a if (!IS_ENABLED(CONFIG_MMU)) This all seems nice, more locking points were removed than added at least Jason ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-10-07 12:12 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-10-06 22:54 [PATCH v2 0/2] Broad write-locking of nascent mm in execve Jann Horn 2020-10-06 22:54 ` [PATCH v2 1/2] mmap locking API: Order lock of nascent mm outside lock of live mm Jann Horn 2020-10-07 7:42 ` Johannes Berg 2020-10-07 8:28 ` Jann Horn 2020-10-07 11:35 ` Johannes Berg 2020-10-06 22:54 ` [PATCH v2 2/2] exec: Broadly lock nascent mm until setup_arg_pages() Jann Horn 2020-10-07 12:12 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).