linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] mm and ptrace: Track dumpability until task is freed
@ 2020-10-16  2:40 Jann Horn
  2020-10-16  2:40 ` [RFC PATCH 1/6] ptrace: Keep mm around after exit_mm() for __ptrace_may_access() Jann Horn
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Jann Horn @ 2020-10-16  2:40 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Eric Biederman, Oleg Nesterov
  Cc: linux-kernel, Will Deacon, Kees Cook, Ingo Molnar

At the moment, there is a lifetime issue (no, not the UAF kind) around
__ptrace_may_access():

__ptrace_may_access() wants to check mm->flags and mm->user_ns to figure
out whether the caller should be allowed to access some target task.
__ptrace_may_access() can be called as long as __put_task_struct()
hasn't happened yet; but __put_task_struct() happens when the task is
about to be freed, which is much later than exit_mm() (which happens
pretty early during task exit).
So we can have a situation where we need to consult the mm for a
security check, but we don't have an mm anymore.

At the moment, this is solved by failing open: If the mm is gone, we
pretend that it was dumpable. That's dubious from a security
perspective - as one example, we drop the mm_struct before the file
descriptor table, so someone might be able to steal file descriptors
from an exiting tasks when dumpability was supposed to prevent that.

The easy fix would be to let __ptrace_may_access() instead always refuse
access to tasks that have lost their mm; but then that would e.g. mean
that the ability to inspect dead tasks in procfs would be restricted.
So while that might work in practice, it'd be a bit ugly, too.

Another option would be to move the dumpability information elsewhere -
but that would have to be the task_struct (the signal_struct can be
shared with dead pre-execve threads, so we can't use it here). So we'd
have to keep dumpability information in sync across threads - that'd
probably be pretty ugly.


So I think the proper fix is to let the task_struct hold a reference on
the mm_struct until the task goes away completely. This is implemented
in patch 1/6, which is also the only patch in this series that I
actually care about (and the only one with a stable backport marking);
the rest of the series are some tweaks in case people dislike the idea
of constantly freeing mm_structs from workqueue context.
Those tweaks should also reduce the memory usage of dead tasks, by
ensuring that they don't keep their PGDs alive.


Patch 1/6 is not particularly pretty, but I can't think of any better
way to do it.

So: Does this series (and in particular patch 1/6) look vaguely sane?
And if not, does anyone have a better approach?


Jann Horn (6):
  ptrace: Keep mm around after exit_mm() for __ptrace_may_access()
  refcount: Move refcount_t definition into linux/types.h
  mm: Add refcount for preserving mm_struct without pgd
  mm, oom: Use mm_ref()/mm_unref() and avoid mmdrop_async()
  ptrace: Use mm_ref() for ->exit_mm
  mm: remove now-unused mmdrop_async()

 arch/x86/kernel/tboot.c    |  2 +
 drivers/firmware/efi/efi.c |  2 +
 include/linux/mm_types.h   | 15 ++++++-
 include/linux/refcount.h   | 13 +-----
 include/linux/sched.h      |  8 ++++
 include/linux/sched/mm.h   | 13 ++++++
 include/linux/types.h      | 12 +++++
 kernel/exit.c              |  2 +
 kernel/fork.c              | 90 +++++++++++++++++---------------------
 kernel/ptrace.c            | 10 +++++
 mm/init-mm.c               |  2 +
 mm/oom_kill.c              |  2 +-
 12 files changed, 105 insertions(+), 66 deletions(-)


base-commit: bbf5c979011a099af5dc76498918ed7df445635b
-- 
2.29.0.rc1.297.gfa9743e501-goog



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/6] ptrace: Keep mm around after exit_mm() for __ptrace_may_access()
  2020-10-16  2:40 [RFC PATCH 0/6] mm and ptrace: Track dumpability until task is freed Jann Horn
@ 2020-10-16  2:40 ` Jann Horn
  2020-10-16  2:40 ` [RFC PATCH 2/6] refcount: Move refcount_t definition into linux/types.h Jann Horn
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Jann Horn @ 2020-10-16  2:40 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Eric Biederman, Oleg Nesterov
  Cc: linux-kernel, Will Deacon, Kees Cook, Ingo Molnar

__ptrace_may_access() checks can happen on target tasks that are in the
middle of do_exit(), past exit_mm(). At that point, the ->mm pointer has
been NULLed out, and the mm_struct has been mmput().

Unfortunately, the mm_struct contains the dumpability and the user_ns in
which the task last went through execve(), and we need those for
__ptrace_may_access(). Currently, that problem is handled by failing open:
If the ->mm is gone, we assume that the task was dumpable. In some edge
cases, this could potentially expose access to things like
/proc/$pid/fd/$fd of originally non-dumpable processes.
(exit_files() comes after exit_mm(), so the file descriptor table is still
there when we've gone through exit_mm().)

One way to fix this would be to move mm->user_ns and the dumpability state
over into the task_struct. However, that gets quite ugly if we want to
preserve existing semantics because e.g. PR_SET_DUMPABLE and commit_creds()
would then have to scan through all tasks sharing the mm_struct and keep
them in sync manually - that'd be a bit error-prone and overcomplicated.

(Moving these things into the signal_struct is not an option because that
is kept across executions, and pre-execve co-threads will share the
signal_struct that is also used by the task that has gone through
execve().)

I believe that this patch may be the least bad option to fix this - keep
the mm_struct (but not process memory) around with an mmgrab() reference
from exit_mm() until the task goes away completely.

Note that this moves free_task() down in order to make mmdrop_async()
available without a forward declaration.

Cc: stable@vger.kernel.org
Fixes: bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks")
Signed-off-by: Jann Horn <jannh@google.com>
---
 include/linux/sched.h |  8 +++++++
 kernel/exit.c         |  2 ++
 kernel/fork.c         | 54 ++++++++++++++++++++++---------------------
 kernel/ptrace.c       | 10 ++++++++
 4 files changed, 48 insertions(+), 26 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index afe01e232935..55bec6ff5626 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -747,6 +747,14 @@ struct task_struct {
 
 	struct mm_struct		*mm;
 	struct mm_struct		*active_mm;
+	/*
+	 * When we exit and ->mm (the reference pinning ->mm's address space)
+	 * goes away, we stash a reference to the mm_struct itself (counted via
+	 * exit_mm->mm_count) in this member.
+	 * This allows us to continue using the mm_struct for security checks
+	 * and such even after the task has started exiting.
+	 */
+	struct mm_struct		*exit_mm;
 
 	/* Per-thread vma caching: */
 	struct vmacache			vmacache;
diff --git a/kernel/exit.c b/kernel/exit.c
index 733e80f334e7..97253ef33486 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -476,6 +476,8 @@ static void exit_mm(void)
 	/* more a memory barrier than a real lock */
 	task_lock(current);
 	current->mm = NULL;
+	mmgrab(mm); /* for current->exit_mm */
+	current->exit_mm = mm;
 	mmap_read_unlock(mm);
 	enter_lazy_tlb(mm, current);
 	task_unlock(current);
diff --git a/kernel/fork.c b/kernel/fork.c
index da8d360fb032..4942428a217c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -438,32 +438,6 @@ void put_task_stack(struct task_struct *tsk)
 }
 #endif
 
-void free_task(struct task_struct *tsk)
-{
-	scs_release(tsk);
-
-#ifndef CONFIG_THREAD_INFO_IN_TASK
-	/*
-	 * The task is finally done with both the stack and thread_info,
-	 * so free both.
-	 */
-	release_task_stack(tsk);
-#else
-	/*
-	 * If the task had a separate stack allocation, it should be gone
-	 * by now.
-	 */
-	WARN_ON_ONCE(refcount_read(&tsk->stack_refcount) != 0);
-#endif
-	rt_mutex_debug_task_free(tsk);
-	ftrace_graph_exit_task(tsk);
-	arch_release_task_struct(tsk);
-	if (tsk->flags & PF_KTHREAD)
-		free_kthread_struct(tsk);
-	free_task_struct(tsk);
-}
-EXPORT_SYMBOL(free_task);
-
 #ifdef CONFIG_MMU
 static __latent_entropy int dup_mmap(struct mm_struct *mm,
 					struct mm_struct *oldmm)
@@ -722,6 +696,34 @@ static inline void put_signal_struct(struct signal_struct *sig)
 		free_signal_struct(sig);
 }
 
+void free_task(struct task_struct *tsk)
+{
+	scs_release(tsk);
+
+#ifndef CONFIG_THREAD_INFO_IN_TASK
+	/*
+	 * The task is finally done with both the stack and thread_info,
+	 * so free both.
+	 */
+	release_task_stack(tsk);
+#else
+	/*
+	 * If the task had a separate stack allocation, it should be gone
+	 * by now.
+	 */
+	WARN_ON_ONCE(refcount_read(&tsk->stack_refcount) != 0);
+#endif
+	rt_mutex_debug_task_free(tsk);
+	ftrace_graph_exit_task(tsk);
+	arch_release_task_struct(tsk);
+	if (tsk->flags & PF_KTHREAD)
+		free_kthread_struct(tsk);
+	if (tsk->exit_mm)
+		mmdrop_async(tsk->exit_mm);
+	free_task_struct(tsk);
+}
+EXPORT_SYMBOL(free_task);
+
 void __put_task_struct(struct task_struct *tsk)
 {
 	WARN_ON(!tsk->exit_state);
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 43d6179508d6..0aedc6cf5bdc 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -342,7 +342,17 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode)
 	 * Pairs with a write barrier in commit_creds().
 	 */
 	smp_rmb();
+	/*
+	 * Look up the target task's mm_struct. If it fails because the task is
+	 * exiting and has gone through exit_mm(), we can instead use ->exit_mm
+	 * as long as we only use members that are preserved by an mmgrab()
+	 * reference.
+	 * The only case in which both ->mm and ->exit_mm can be NULL should be
+	 * kernel threads.
+	 */
 	mm = task->mm;
+	if (!mm)
+		mm = task->exit_mm;
 	if (mm &&
 	    ((get_dumpable(mm) != SUID_DUMP_USER) &&
 	     !ptrace_has_cap(cred, mm->user_ns, mode)))
-- 
2.29.0.rc1.297.gfa9743e501-goog



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 2/6] refcount: Move refcount_t definition into linux/types.h
  2020-10-16  2:40 [RFC PATCH 0/6] mm and ptrace: Track dumpability until task is freed Jann Horn
  2020-10-16  2:40 ` [RFC PATCH 1/6] ptrace: Keep mm around after exit_mm() for __ptrace_may_access() Jann Horn
@ 2020-10-16  2:40 ` Jann Horn
  2020-10-16  2:40 ` [RFC PATCH 3/6] mm: Add refcount for preserving mm_struct without pgd Jann Horn
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Jann Horn @ 2020-10-16  2:40 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Eric Biederman, Oleg Nesterov
  Cc: linux-kernel, Will Deacon, Kees Cook, Ingo Molnar

I want to use refcount_t in mm_struct, but if refcount_t is defined in
linux/refcount.h, that header would have to be included in
linux/mm_types.h; that would be wasteful.

Let's move refcount_t over into linux/types.h so that includes can be
written less wastefully.

Signed-off-by: Jann Horn <jannh@google.com>
---
 include/linux/refcount.h | 13 +------------
 include/linux/types.h    | 12 ++++++++++++
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/include/linux/refcount.h b/include/linux/refcount.h
index 0e3ee25eb156..fd8cf65e4e2f 100644
--- a/include/linux/refcount.h
+++ b/include/linux/refcount.h
@@ -96,22 +96,11 @@
 #include <linux/bug.h>
 #include <linux/compiler.h>
 #include <linux/limits.h>
+#include <linux/types.h> /* refcount_t is defined here */
 #include <linux/spinlock_types.h>
 
 struct mutex;
 
-/**
- * struct refcount_t - variant of atomic_t specialized for reference counts
- * @refs: atomic_t counter field
- *
- * The counter saturates at REFCOUNT_SATURATED and will not move once
- * there. This avoids wrapping the counter and causing 'spurious'
- * use-after-free bugs.
- */
-typedef struct refcount_struct {
-	atomic_t refs;
-} refcount_t;
-
 #define REFCOUNT_INIT(n)	{ .refs = ATOMIC_INIT(n), }
 #define REFCOUNT_MAX		INT_MAX
 #define REFCOUNT_SATURATED	(INT_MIN / 2)
diff --git a/include/linux/types.h b/include/linux/types.h
index a147977602b5..34e4e779e767 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -175,6 +175,18 @@ typedef struct {
 } atomic64_t;
 #endif
 
+/**
+ * struct refcount_t - variant of atomic_t specialized for reference counts
+ * @refs: atomic_t counter field
+ *
+ * The counter saturates at REFCOUNT_SATURATED and will not move once
+ * there. This avoids wrapping the counter and causing 'spurious'
+ * use-after-free bugs.
+ */
+typedef struct refcount_struct {
+	atomic_t refs;
+} refcount_t;
+
 struct list_head {
 	struct list_head *next, *prev;
 };
-- 
2.29.0.rc1.297.gfa9743e501-goog



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 3/6] mm: Add refcount for preserving mm_struct without pgd
  2020-10-16  2:40 [RFC PATCH 0/6] mm and ptrace: Track dumpability until task is freed Jann Horn
  2020-10-16  2:40 ` [RFC PATCH 1/6] ptrace: Keep mm around after exit_mm() for __ptrace_may_access() Jann Horn
  2020-10-16  2:40 ` [RFC PATCH 2/6] refcount: Move refcount_t definition into linux/types.h Jann Horn
@ 2020-10-16  2:40 ` Jann Horn
  2020-10-16  2:40 ` [RFC PATCH 4/6] mm, oom: Use mm_ref()/mm_unref() and avoid mmdrop_async() Jann Horn
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Jann Horn @ 2020-10-16  2:40 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Eric Biederman, Oleg Nesterov
  Cc: linux-kernel, Will Deacon, Kees Cook, Ingo Molnar

Currently, mm_struct has two refcounts:

 - mm_users: preserves everything - the mm_struct, the page tables, the
   memory mappings, and so on
 - mm_count: preserves the mm_struct and pgd

However, there are three types of users of mm_struct:

1. users that want page tables, memory mappings and so on
2. users that want to preserve the pgd (for lazy TLB)
3. users that just want to keep the mm_struct itself around (e.g. for
   mmget_not_zero() or __ptrace_may_access())

Dropping mm_count references can be messy because dropping mm_count to
zero deletes the pgd, which takes the pgd_lock on x86, meaning it doesn't
work from RCU callbacks (which run in IRQ context). In those cases,
mmdrop_async() must be used to punt the invocation of __mmdrop() to
workqueue context.

That's fine when mmdrop_async() is a rare case, but the preceding patch
"ptrace: Keep mm around after exit_mm() for __ptrace_may_access()" makes it
the common case; we should probably avoid punting freeing to workqueue
context all the time if we can avoid it?

To resolve this, add a third refcount that just protects the mm_struct and
the user_ns it points to, and which can be dropped with synchronous freeing
from (almost) any context.

Signed-off-by: Jann Horn <jannh@google.com>
---
 arch/x86/kernel/tboot.c    |  2 ++
 drivers/firmware/efi/efi.c |  2 ++
 include/linux/mm_types.h   | 13 +++++++++++--
 include/linux/sched/mm.h   | 13 +++++++++++++
 kernel/fork.c              | 14 ++++++++++----
 mm/init-mm.c               |  2 ++
 6 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index 992fb1415c0f..b92ea1bb3bb9 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -19,6 +19,7 @@
 #include <linux/mm.h>
 #include <linux/tboot.h>
 #include <linux/debugfs.h>
+#include <linux/refcount.h>
 
 #include <asm/realmode.h>
 #include <asm/processor.h>
@@ -93,6 +94,7 @@ static struct mm_struct tboot_mm = {
 	.pgd            = swapper_pg_dir,
 	.mm_users       = ATOMIC_INIT(2),
 	.mm_count       = ATOMIC_INIT(1),
+	.mm_bare_refs   = REFCOUNT_INIT(1),
 	MMAP_LOCK_INITIALIZER(init_mm)
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
 	.mmlist         = LIST_HEAD_INIT(init_mm.mmlist),
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 3aa07c3b5136..3b73a0717c6e 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -26,6 +26,7 @@
 #include <linux/platform_device.h>
 #include <linux/random.h>
 #include <linux/reboot.h>
+#include <linux/refcount.h>
 #include <linux/slab.h>
 #include <linux/acpi.h>
 #include <linux/ucs2_string.h>
@@ -54,6 +55,7 @@ struct mm_struct efi_mm = {
 	.mm_rb			= RB_ROOT,
 	.mm_users		= ATOMIC_INIT(2),
 	.mm_count		= ATOMIC_INIT(1),
+	.mm_bare_refs		= REFCOUNT_INIT(1),
 	MMAP_LOCK_INITIALIZER(efi_mm)
 	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
 	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ed028af3cb19..764d251966c7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -429,13 +429,22 @@ struct mm_struct {
 
 		/**
 		 * @mm_count: The number of references to &struct mm_struct
-		 * (@mm_users count as 1).
+		 * including its pgd (@mm_users count as 1).
 		 *
 		 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
-		 * &struct mm_struct is freed.
+		 * pgd is freed.
 		 */
 		atomic_t mm_count;
 
+		/**
+		 * @mm_bare_refs: The number of references to &struct mm_struct
+		 * that preserve no page table state whatsoever (@mm_count
+		 * counts as 1).
+		 *
+		 * When this drops to 0, the &struct mm_struct is freed.
+		 */
+		refcount_t mm_bare_refs;
+
 		/**
 		 * @has_pinned: Whether this mm has pinned any pages.  This can
 		 * be either replaced in the future by @pinned_vm when it
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index f889e332912f..e5b236e15757 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -109,6 +109,19 @@ extern void mmput(struct mm_struct *);
 void mmput_async(struct mm_struct *);
 #endif
 
+static inline void mm_ref(struct mm_struct *mm)
+{
+	refcount_inc(&mm->mm_bare_refs);
+}
+
+void __mm_unref(struct mm_struct *mm);
+
+static inline void mm_unref(struct mm_struct *mm)
+{
+	if (refcount_dec_and_test(&mm->mm_bare_refs))
+		__mm_unref(mm);
+}
+
 /* Grab a reference to a task's mm, if it is not already going away */
 extern struct mm_struct *get_task_mm(struct task_struct *task);
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 4942428a217c..fcdd1ace79e4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -642,10 +642,16 @@ static void check_mm(struct mm_struct *mm)
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
 #define free_mm(mm)	(kmem_cache_free(mm_cachep, (mm)))
 
+void __mm_unref(struct mm_struct *mm)
+{
+	put_user_ns(mm->user_ns);
+	free_mm(mm);
+}
+
 /*
- * Called when the last reference to the mm
+ * Called when the last PGD-preserving reference to the mm
  * is dropped: either by a lazy thread or by
- * mmput. Free the page directory and the mm.
+ * mmput. Free the page directory.
  */
 void __mmdrop(struct mm_struct *mm)
 {
@@ -656,8 +662,7 @@ void __mmdrop(struct mm_struct *mm)
 	destroy_context(mm);
 	mmu_notifier_subscriptions_destroy(mm);
 	check_mm(mm);
-	put_user_ns(mm->user_ns);
-	free_mm(mm);
+	mm_unref(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
 
@@ -1007,6 +1012,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->vmacache_seqnum = 0;
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
+	refcount_set(&mm->mm_bare_refs, 1);
 	mmap_init_lock(mm);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm->core_state = NULL;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 3a613c85f9ed..3c3cd35236fd 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -7,6 +7,7 @@
 #include <linux/cpumask.h>
 #include <linux/mman.h>
 #include <linux/pgtable.h>
+#include <linux/refcount.h>
 
 #include <linux/atomic.h>
 #include <linux/user_namespace.h>
@@ -31,6 +32,7 @@ struct mm_struct init_mm = {
 	.pgd		= swapper_pg_dir,
 	.mm_users	= ATOMIC_INIT(2),
 	.mm_count	= ATOMIC_INIT(1),
+	.mm_bare_refs	= REFCOUNT_INIT(1),
 	MMAP_LOCK_INITIALIZER(init_mm)
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
-- 
2.29.0.rc1.297.gfa9743e501-goog



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 4/6] mm, oom: Use mm_ref()/mm_unref() and avoid mmdrop_async()
  2020-10-16  2:40 [RFC PATCH 0/6] mm and ptrace: Track dumpability until task is freed Jann Horn
                   ` (2 preceding siblings ...)
  2020-10-16  2:40 ` [RFC PATCH 3/6] mm: Add refcount for preserving mm_struct without pgd Jann Horn
@ 2020-10-16  2:40 ` Jann Horn
  2020-10-16  2:40 ` [RFC PATCH 5/6] ptrace: Use mm_ref() for ->exit_mm Jann Horn
  2020-10-16  2:40 ` [RFC PATCH 6/6] mm: remove now-unused mmdrop_async() Jann Horn
  5 siblings, 0 replies; 7+ messages in thread
From: Jann Horn @ 2020-10-16  2:40 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Eric Biederman, Oleg Nesterov
  Cc: linux-kernel, Will Deacon, Kees Cook, Ingo Molnar

The OOM killer uses MMF_OOM_SKIP to avoid running on an mm that has started
__mmput(); it only uses the mmgrab() reference to ensure that the mm_struct
itself stays alive.

This means that we don't need a full mmgrab() reference, which will keep
the pgd (and potentially also some pmd pages) alive and can't be cleaned up
from RCU callback context; we can use an mm_ref() reference instead.

Signed-off-by: Jann Horn <jannh@google.com>
---
 kernel/fork.c | 6 +-----
 mm/oom_kill.c | 2 +-
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index fcdd1ace79e4..59c119b03351 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -686,12 +686,8 @@ static inline void free_signal_struct(struct signal_struct *sig)
 {
 	taskstats_tgid_free(sig);
 	sched_autogroup_exit(sig);
-	/*
-	 * __mmdrop is not safe to call from softirq context on x86 due to
-	 * pgd_dtor so postpone it to the async context
-	 */
 	if (sig->oom_mm)
-		mmdrop_async(sig->oom_mm);
+		mm_unref(sig->oom_mm);
 	kmem_cache_free(signal_cachep, sig);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e90f25d6385d..12967f54fbcf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -704,7 +704,7 @@ static void mark_oom_victim(struct task_struct *tsk)
 
 	/* oom_mm is bound to the signal struct life time. */
 	if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm)) {
-		mmgrab(tsk->signal->oom_mm);
+		mm_ref(tsk->signal->oom_mm);
 		set_bit(MMF_OOM_VICTIM, &mm->flags);
 	}
 
-- 
2.29.0.rc1.297.gfa9743e501-goog



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 5/6] ptrace: Use mm_ref() for ->exit_mm
  2020-10-16  2:40 [RFC PATCH 0/6] mm and ptrace: Track dumpability until task is freed Jann Horn
                   ` (3 preceding siblings ...)
  2020-10-16  2:40 ` [RFC PATCH 4/6] mm, oom: Use mm_ref()/mm_unref() and avoid mmdrop_async() Jann Horn
@ 2020-10-16  2:40 ` Jann Horn
  2020-10-16  2:40 ` [RFC PATCH 6/6] mm: remove now-unused mmdrop_async() Jann Horn
  5 siblings, 0 replies; 7+ messages in thread
From: Jann Horn @ 2020-10-16  2:40 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Eric Biederman, Oleg Nesterov
  Cc: linux-kernel, Will Deacon, Kees Cook, Ingo Molnar

We only use ->exit_mm to look up dumpability and the ->user_mm; we don't
need to keep the PGD alive for this.
mmgrab() is also inconvenient here, because it means that we need to use
mmdrop_async() when dropping the reference to the mm from an RCU callback.
Use mm_ref() instead of mmgrab() to make things neater.

Signed-off-by: Jann Horn <jannh@google.com>
---
 kernel/exit.c | 2 +-
 kernel/fork.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 97253ef33486..03ba6d13ef1e 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -476,7 +476,7 @@ static void exit_mm(void)
 	/* more a memory barrier than a real lock */
 	task_lock(current);
 	current->mm = NULL;
-	mmgrab(mm); /* for current->exit_mm */
+	mm_ref(mm); /* for current->exit_mm */
 	current->exit_mm = mm;
 	mmap_read_unlock(mm);
 	enter_lazy_tlb(mm, current);
diff --git a/kernel/fork.c b/kernel/fork.c
index 59c119b03351..4383bf055b40 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -720,7 +720,7 @@ void free_task(struct task_struct *tsk)
 	if (tsk->flags & PF_KTHREAD)
 		free_kthread_struct(tsk);
 	if (tsk->exit_mm)
-		mmdrop_async(tsk->exit_mm);
+		mm_unref(tsk->exit_mm);
 	free_task_struct(tsk);
 }
 EXPORT_SYMBOL(free_task);
-- 
2.29.0.rc1.297.gfa9743e501-goog



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 6/6] mm: remove now-unused mmdrop_async()
  2020-10-16  2:40 [RFC PATCH 0/6] mm and ptrace: Track dumpability until task is freed Jann Horn
                   ` (4 preceding siblings ...)
  2020-10-16  2:40 ` [RFC PATCH 5/6] ptrace: Use mm_ref() for ->exit_mm Jann Horn
@ 2020-10-16  2:40 ` Jann Horn
  5 siblings, 0 replies; 7+ messages in thread
From: Jann Horn @ 2020-10-16  2:40 UTC (permalink / raw)
  To: Andrew Morton, linux-mm, Eric Biederman, Oleg Nesterov
  Cc: linux-kernel, Will Deacon, Kees Cook, Ingo Molnar

The preceding patches have removed all users of mmdrop_async(); get rid of
it.

Note that on MMU, we still need async_put_work because mmput_async() uses
it, which in turn is used by binder's shrinker callback. We could claw back
those 4 words per mm if we made mmput_async() depend on
CONFIG_ANDROID_BINDER_IPC.

Signed-off-by: Jann Horn <jannh@google.com>
---
 include/linux/mm_types.h |  2 ++
 kernel/fork.c            | 16 ----------------
 2 files changed, 2 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 764d251966c7..8fde2068bde1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -560,7 +560,9 @@ struct mm_struct {
 #ifdef CONFIG_HUGETLB_PAGE
 		atomic_long_t hugetlb_usage;
 #endif
+#ifdef CONFIG_MMU
 		struct work_struct async_put_work;
+#endif
 	} __randomize_layout;
 
 	/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 4383bf055b40..c5f2ec544933 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -666,22 +666,6 @@ void __mmdrop(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
 
-static void mmdrop_async_fn(struct work_struct *work)
-{
-	struct mm_struct *mm;
-
-	mm = container_of(work, struct mm_struct, async_put_work);
-	__mmdrop(mm);
-}
-
-static void mmdrop_async(struct mm_struct *mm)
-{
-	if (unlikely(atomic_dec_and_test(&mm->mm_count))) {
-		INIT_WORK(&mm->async_put_work, mmdrop_async_fn);
-		schedule_work(&mm->async_put_work);
-	}
-}
-
 static inline void free_signal_struct(struct signal_struct *sig)
 {
 	taskstats_tgid_free(sig);
-- 
2.29.0.rc1.297.gfa9743e501-goog



^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-10-16  2:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-16  2:40 [RFC PATCH 0/6] mm and ptrace: Track dumpability until task is freed Jann Horn
2020-10-16  2:40 ` [RFC PATCH 1/6] ptrace: Keep mm around after exit_mm() for __ptrace_may_access() Jann Horn
2020-10-16  2:40 ` [RFC PATCH 2/6] refcount: Move refcount_t definition into linux/types.h Jann Horn
2020-10-16  2:40 ` [RFC PATCH 3/6] mm: Add refcount for preserving mm_struct without pgd Jann Horn
2020-10-16  2:40 ` [RFC PATCH 4/6] mm, oom: Use mm_ref()/mm_unref() and avoid mmdrop_async() Jann Horn
2020-10-16  2:40 ` [RFC PATCH 5/6] ptrace: Use mm_ref() for ->exit_mm Jann Horn
2020-10-16  2:40 ` [RFC PATCH 6/6] mm: remove now-unused mmdrop_async() Jann Horn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).