linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there
@ 2020-08-18  6:12 Jann Horn
  2020-08-18  6:12 ` [PATCH v3 1/5] binfmt_elf_fdpic: Stop using dump_emit() on user pointers on !MMU Jann Horn
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Jann Horn @ 2020-08-18  6:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Christoph Hellwig, linux-kernel, linux-mm,
	linux-fsdevel, Alexander Viro, Eric W . Biederman, Oleg Nesterov

At the moment, we have that rather ugly mmget_still_valid() helper to
work around <https://crbug.com/project-zero/1790>: ELF core dumping
doesn't take the mmap_sem while traversing the task's VMAs, and if
anything (like userfaultfd) then remotely messes with the VMA tree,
fireworks ensue. So at the moment we use mmget_still_valid() to bail
out in any writers that might be operating on a remote mm's VMAs.

With this series, I'm trying to get rid of the need for that as
cleanly as possible. ("cleanly" meaning "avoid holding the mmap_lock
across unbounded sleeps".)


Patches 1, 2 and 3 are relatively unrelated cleanups in the core
dumping code.

Patches 4 and 5 implement the main change: Instead of repeatedly
accessing the VMA list with sleeps in between, we snapshot it at the
start with proper locking, and then later we just use our copy of
the VMA list. This ensures that the kernel won't crash, that VMA
metadata in the coredump is consistent even in the presence of
concurrent modifications, and that any virtual addresses that aren't
being concurrently modified have their contents show up in the core
dump properly.

The disadvantage of this approach is that we need a bit more memory
during core dumping for storing metadata about all VMAs.

After this series has landed, we should be able to rip out
mmget_still_valid().


I have tested:

 - Creating a simple core dump on X86-64 still works.
 - The created coredump on X86-64 opens in GDB and looks plausible.
 - NOMMU 32-bit ARM can still generate plausible-looking core dumps
   through the FDPIC implementation. (I can't test this with GDB because
   GDB is missing some structure definition for nommu ARM, but I've
   poked around in the hexdump and it looked decent.)

Jann Horn (5):
  binfmt_elf_fdpic: Stop using dump_emit() on user pointers on !MMU
  coredump: Let dump_emit() bail out on short writes
  coredump: Refactor page range dumping into common helper
  binfmt_elf, binfmt_elf_fdpic: Use a VMA list snapshot
  mm/gup: Take mmap_lock in get_dump_page()

 fs/binfmt_elf.c          | 184 ++++++++++++++-------------------------
 fs/binfmt_elf_fdpic.c    | 106 +++++++++-------------
 fs/coredump.c            | 125 +++++++++++++++++++++++---
 include/linux/coredump.h |  11 +++
 mm/gup.c                 |  61 +++++++------
 5 files changed, 265 insertions(+), 222 deletions(-)


base-commit: 06a4ec1d9dc652e17ee3ac2ceb6c7cf6c2b75cdd
-- 
2.28.0.220.ged08abb693-goog


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/5] binfmt_elf_fdpic: Stop using dump_emit() on user pointers on !MMU
  2020-08-18  6:12 [PATCH v3 0/5] Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there Jann Horn
@ 2020-08-18  6:12 ` Jann Horn
  2020-08-18  6:12 ` [PATCH v3 2/5] coredump: Let dump_emit() bail out on short writes Jann Horn
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Jann Horn @ 2020-08-18  6:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Christoph Hellwig, linux-kernel, linux-mm,
	linux-fsdevel, Alexander Viro, Eric W . Biederman, Oleg Nesterov

dump_emit() is for kernel pointers, and VMAs describe userspace memory.
Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
even if it probably doesn't matter much on !MMU systems - especially given
that it looks like we can just use the same get_dump_page() as on MMU if
we move it out of the CONFIG_MMU block.

One small change we have to make in get_dump_page() is to use
__get_user_pages_locked() instead of __get_user_pages(), since the
latter doesn't exist on nommu. On mmu builds, __get_user_pages_locked()
will just call __get_user_pages() for us.

Signed-off-by: Jann Horn <jannh@google.com>
---
 fs/binfmt_elf_fdpic.c |  8 ------
 mm/gup.c              | 57 +++++++++++++++++++++----------------------
 2 files changed, 28 insertions(+), 37 deletions(-)

diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 50f845702b92..a53f83830986 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1529,14 +1529,11 @@ static bool elf_fdpic_dump_segments(struct coredump_params *cprm)
 	struct vm_area_struct *vma;
 
 	for (vma = current->mm->mmap; vma; vma = vma->vm_next) {
-#ifdef CONFIG_MMU
 		unsigned long addr;
-#endif
 
 		if (!maydump(vma, cprm->mm_flags))
 			continue;
 
-#ifdef CONFIG_MMU
 		for (addr = vma->vm_start; addr < vma->vm_end;
 							addr += PAGE_SIZE) {
 			bool res;
@@ -1552,11 +1549,6 @@ static bool elf_fdpic_dump_segments(struct coredump_params *cprm)
 			if (!res)
 				return false;
 		}
-#else
-		if (!dump_emit(cprm, (void *) vma->vm_start,
-				vma->vm_end - vma->vm_start))
-			return false;
-#endif
 	}
 	return true;
 }
diff --git a/mm/gup.c b/mm/gup.c
index ae096ea7583f..92519e5a44b3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1495,35 +1495,6 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
 		mmap_read_unlock(mm);
 	return ret;	/* 0 or negative error code */
 }
-
-/**
- * get_dump_page() - pin user page in memory while writing it to core dump
- * @addr: user address
- *
- * Returns struct page pointer of user page pinned for dump,
- * to be freed afterwards by put_page().
- *
- * Returns NULL on any kind of failure - a hole must then be inserted into
- * the corefile, to preserve alignment with its headers; and also returns
- * NULL wherever the ZERO_PAGE, or an anonymous pte_none, has been found -
- * allowing a hole to be left in the corefile to save diskspace.
- *
- * Called without mmap_lock, but after all other threads have been killed.
- */
-#ifdef CONFIG_ELF_CORE
-struct page *get_dump_page(unsigned long addr)
-{
-	struct vm_area_struct *vma;
-	struct page *page;
-
-	if (__get_user_pages(current->mm, addr, 1,
-			     FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma,
-			     NULL) < 1)
-		return NULL;
-	flush_cache_page(vma, addr, page_to_pfn(page));
-	return page;
-}
-#endif /* CONFIG_ELF_CORE */
 #else /* CONFIG_MMU */
 static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
 		unsigned long nr_pages, struct page **pages,
@@ -1569,6 +1540,34 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
 }
 #endif /* !CONFIG_MMU */
 
+/**
+ * get_dump_page() - pin user page in memory while writing it to core dump
+ * @addr: user address
+ *
+ * Returns struct page pointer of user page pinned for dump,
+ * to be freed afterwards by put_page().
+ *
+ * Returns NULL on any kind of failure - a hole must then be inserted into
+ * the corefile, to preserve alignment with its headers; and also returns
+ * NULL wherever the ZERO_PAGE, or an anonymous pte_none, has been found -
+ * allowing a hole to be left in the corefile to save diskspace.
+ *
+ * Called without mmap_lock, but after all other threads have been killed.
+ */
+#ifdef CONFIG_ELF_CORE
+struct page *get_dump_page(unsigned long addr)
+{
+	struct vm_area_struct *vma;
+	struct page *page;
+
+	if (__get_user_pages_locked(current->mm, addr, 1, &page, &vma, NULL,
+				    FOLL_FORCE | FOLL_DUMP | FOLL_GET) < 1)
+		return NULL;
+	flush_cache_page(vma, addr, page_to_pfn(page));
+	return page;
+}
+#endif /* CONFIG_ELF_CORE */
+
 #if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
 static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
 {
-- 
2.28.0.220.ged08abb693-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 2/5] coredump: Let dump_emit() bail out on short writes
  2020-08-18  6:12 [PATCH v3 0/5] Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there Jann Horn
  2020-08-18  6:12 ` [PATCH v3 1/5] binfmt_elf_fdpic: Stop using dump_emit() on user pointers on !MMU Jann Horn
@ 2020-08-18  6:12 ` Jann Horn
  2020-08-18 13:40   ` Oleg Nesterov
  2020-08-18  6:12 ` [PATCH v3 3/5] coredump: Refactor page range dumping into common helper Jann Horn
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2020-08-18  6:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Christoph Hellwig, linux-kernel, linux-mm,
	linux-fsdevel, Alexander Viro, Eric W . Biederman, Oleg Nesterov

dump_emit() has a retry loop, but there seems to be no way for that retry
logic to actually be used; and it was also buggy, writing the same data
repeatedly after a short write.

Let's just bail out on a short write.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jann Horn <jannh@google.com>
---
 fs/coredump.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 76e7c10edfc0..5e24c06092c9 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -840,17 +840,17 @@ int dump_emit(struct coredump_params *cprm, const void *addr, int nr)
 	ssize_t n;
 	if (cprm->written + nr > cprm->limit)
 		return 0;
-	while (nr) {
-		if (dump_interrupted())
-			return 0;
-		n = __kernel_write(file, addr, nr, &pos);
-		if (n <= 0)
-			return 0;
-		file->f_pos = pos;
-		cprm->written += n;
-		cprm->pos += n;
-		nr -= n;
-	}
+
+
+	if (dump_interrupted())
+		return 0;
+	n = __kernel_write(file, addr, nr, &pos);
+	if (n != nr)
+		return 0;
+	file->f_pos = pos;
+	cprm->written += n;
+	cprm->pos += n;
+
 	return 1;
 }
 EXPORT_SYMBOL(dump_emit);
-- 
2.28.0.220.ged08abb693-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 3/5] coredump: Refactor page range dumping into common helper
  2020-08-18  6:12 [PATCH v3 0/5] Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there Jann Horn
  2020-08-18  6:12 ` [PATCH v3 1/5] binfmt_elf_fdpic: Stop using dump_emit() on user pointers on !MMU Jann Horn
  2020-08-18  6:12 ` [PATCH v3 2/5] coredump: Let dump_emit() bail out on short writes Jann Horn
@ 2020-08-18  6:12 ` Jann Horn
  2020-08-18  6:12 ` [PATCH v3 4/5] binfmt_elf, binfmt_elf_fdpic: Use a VMA list snapshot Jann Horn
  2020-08-18  6:12 ` [PATCH v3 5/5] mm/gup: Take mmap_lock in get_dump_page() Jann Horn
  4 siblings, 0 replies; 11+ messages in thread
From: Jann Horn @ 2020-08-18  6:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Christoph Hellwig, linux-kernel, linux-mm,
	linux-fsdevel, Alexander Viro, Eric W . Biederman, Oleg Nesterov

Both fs/binfmt_elf.c and fs/binfmt_elf_fdpic.c need to dump ranges of pages
into the coredump file. Extract that logic into a common helper.

Signed-off-by: Jann Horn <jannh@google.com>
---
 fs/binfmt_elf.c          | 22 ++--------------------
 fs/binfmt_elf_fdpic.c    | 18 +++---------------
 fs/coredump.c            | 34 ++++++++++++++++++++++++++++++++++
 include/linux/coredump.h |  2 ++
 4 files changed, 41 insertions(+), 35 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 13d053982dd7..5fd11a25d320 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2419,26 +2419,8 @@ static int elf_core_dump(struct coredump_params *cprm)
 
 	for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
 			vma = next_vma(vma, gate_vma)) {
-		unsigned long addr;
-		unsigned long end;
-
-		end = vma->vm_start + vma_filesz[i++];
-
-		for (addr = vma->vm_start; addr < end; addr += PAGE_SIZE) {
-			struct page *page;
-			int stop;
-
-			page = get_dump_page(addr);
-			if (page) {
-				void *kaddr = kmap(page);
-				stop = !dump_emit(cprm, kaddr, PAGE_SIZE);
-				kunmap(page);
-				put_page(page);
-			} else
-				stop = !dump_skip(cprm, PAGE_SIZE);
-			if (stop)
-				goto end_coredump;
-		}
+		if (!dump_user_range(cprm, vma->vm_start, vma_filesz[i++]))
+			goto end_coredump;
 	}
 	dump_truncate(cprm);
 
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index a53f83830986..76e8c0defdc8 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1534,21 +1534,9 @@ static bool elf_fdpic_dump_segments(struct coredump_params *cprm)
 		if (!maydump(vma, cprm->mm_flags))
 			continue;
 
-		for (addr = vma->vm_start; addr < vma->vm_end;
-							addr += PAGE_SIZE) {
-			bool res;
-			struct page *page = get_dump_page(addr);
-			if (page) {
-				void *kaddr = kmap(page);
-				res = dump_emit(cprm, kaddr, PAGE_SIZE);
-				kunmap(page);
-				put_page(page);
-			} else {
-				res = dump_skip(cprm, PAGE_SIZE);
-			}
-			if (!res)
-				return false;
-		}
+		if (!dump_user_range(cprm, vma->vm_start,
+				     vma->vma_end - vma->vm_start))
+			return false;
 	}
 	return true;
 }
diff --git a/fs/coredump.c b/fs/coredump.c
index 5e24c06092c9..6042d15acd51 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -876,6 +876,40 @@ int dump_skip(struct coredump_params *cprm, size_t nr)
 }
 EXPORT_SYMBOL(dump_skip);
 
+#ifdef CONFIG_ELF_CORE
+int dump_user_range(struct coredump_params *cprm, unsigned long start,
+		    unsigned long len)
+{
+	unsigned long addr;
+
+	for (addr = start; addr < start + len; addr += PAGE_SIZE) {
+		struct page *page;
+		int stop;
+
+		/*
+		 * To avoid having to allocate page tables for virtual address
+		 * ranges that have never been used yet, and also to make it
+		 * easy to generate sparse core files, use a helper that returns
+		 * NULL when encountering an empty page table entry that would
+		 * otherwise have been filled with the zero page.
+		 */
+		page = get_dump_page(addr);
+		if (page) {
+			void *kaddr = kmap(page);
+
+			stop = !dump_emit(cprm, kaddr, PAGE_SIZE);
+			kunmap(page);
+			put_page(page);
+		} else {
+			stop = !dump_skip(cprm, PAGE_SIZE);
+		}
+		if (stop)
+			return 0;
+	}
+	return 1;
+}
+#endif
+
 int dump_align(struct coredump_params *cprm, int align)
 {
 	unsigned mod = cprm->pos & (align - 1);
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index 7a899e83835d..f0b71a74d0bc 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -16,6 +16,8 @@ extern int dump_skip(struct coredump_params *cprm, size_t nr);
 extern int dump_emit(struct coredump_params *cprm, const void *addr, int nr);
 extern int dump_align(struct coredump_params *cprm, int align);
 extern void dump_truncate(struct coredump_params *cprm);
+int dump_user_range(struct coredump_params *cprm, unsigned long start,
+		    unsigned long len);
 #ifdef CONFIG_COREDUMP
 extern void do_coredump(const kernel_siginfo_t *siginfo);
 #else
-- 
2.28.0.220.ged08abb693-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 4/5] binfmt_elf, binfmt_elf_fdpic: Use a VMA list snapshot
  2020-08-18  6:12 [PATCH v3 0/5] Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there Jann Horn
                   ` (2 preceding siblings ...)
  2020-08-18  6:12 ` [PATCH v3 3/5] coredump: Refactor page range dumping into common helper Jann Horn
@ 2020-08-18  6:12 ` Jann Horn
  2020-08-18  8:18   ` Linus Torvalds
  2020-08-18  6:12 ` [PATCH v3 5/5] mm/gup: Take mmap_lock in get_dump_page() Jann Horn
  4 siblings, 1 reply; 11+ messages in thread
From: Jann Horn @ 2020-08-18  6:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Christoph Hellwig, linux-kernel, linux-mm,
	linux-fsdevel, Alexander Viro, Eric W . Biederman, Oleg Nesterov

In both binfmt_elf and binfmt_elf_fdpic, use a new helper
dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
VMA, if we have one) while protected by the mmap_lock, and then use that
snapshot instead of walking the VMA list without locking.

An alternative approach would be to keep the mmap_lock held across the
entire core dumping operation; however, keeping the mmap_lock locked while
we may be blocked for an unbounded amount of time (e.g. because we're
dumping to a FUSE filesystem or so) isn't really optimal; the mmap_lock
blocks things like the ->release handler of userfaultfd, and we don't
really want critical system daemons to grind to a halt just because someone
"gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or something
like that.

Since both the normal ELF code and the FDPIC ELF code need this
functionality (and if any other binfmt wants to add coredump support in the
future, they'd probably need it, too), implement this with a common helper
in fs/coredump.c.

A downside of this approach is that we now need a bigger amount of kernel
memory per userspace VMA in the normal ELF case, and that we need O(n)
kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't
be terribly bad.

Signed-off-by: Jann Horn <jannh@google.com>
---
 fs/binfmt_elf.c          | 166 ++++++++++++++++-----------------------
 fs/binfmt_elf_fdpic.c    |  86 ++++++++++----------
 fs/coredump.c            |  69 ++++++++++++++++
 include/linux/coredump.h |   9 +++
 4 files changed, 186 insertions(+), 144 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 5fd11a25d320..519c38dc9097 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1420,8 +1420,12 @@ static bool always_dump_vma(struct vm_area_struct *vma)
 	return false;
 }
 
+#define DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER 1
+
 /*
  * Decide what to dump of a segment, part, all or none.
+ * The result must be fixed up via vma_dump_size_fixup() once we're in a context
+ * that doesn't hold mmap_lock.
  */
 static unsigned long vma_dump_size(struct vm_area_struct *vma,
 				   unsigned long mm_flags)
@@ -1476,30 +1480,15 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma,
 
 	/*
 	 * If this looks like the beginning of a DSO or executable mapping,
-	 * check for an ELF header.  If we find one, dump the first page to
-	 * aid in determining what was mapped here.
+	 * we'll check for an ELF header. If we find one, we'll dump the first
+	 * page to aid in determining what was mapped here.
+	 * However, we shouldn't sleep on userspace reads while holding the
+	 * mmap_lock, so we just return a placeholder for now that will be fixed
+	 * up later in vma_dump_size_fixup().
 	 */
 	if (FILTER(ELF_HEADERS) &&
-	    vma->vm_pgoff == 0 && (vma->vm_flags & VM_READ)) {
-		u32 __user *header = (u32 __user *) vma->vm_start;
-		u32 word;
-		/*
-		 * Doing it this way gets the constant folded by GCC.
-		 */
-		union {
-			u32 cmp;
-			char elfmag[SELFMAG];
-		} magic;
-		BUILD_BUG_ON(SELFMAG != sizeof word);
-		magic.elfmag[EI_MAG0] = ELFMAG0;
-		magic.elfmag[EI_MAG1] = ELFMAG1;
-		magic.elfmag[EI_MAG2] = ELFMAG2;
-		magic.elfmag[EI_MAG3] = ELFMAG3;
-		if (unlikely(get_user(word, header)))
-			word = 0;
-		if (word == magic.cmp)
-			return PAGE_SIZE;
-	}
+	    vma->vm_pgoff == 0 && (vma->vm_flags & VM_READ))
+		return DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER;
 
 #undef	FILTER
 
@@ -1509,6 +1498,22 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma,
 	return vma->vm_end - vma->vm_start;
 }
 
+/*
+ * Fix up vma_dump_size()'s result based on whether this VMA has an ELF header.
+ * The caller must *not* hold mmap_lock.
+ * See vma_dump_size() for context.
+ */
+static void vma_dump_size_fixup(struct core_vma_metadata *meta)
+{
+	char elfmag[SELFMAG];
+
+	if (copy_from_user(elfmag, (void __user *)meta->start, SELFMAG) ||
+	    memcmp(elfmag, ELFMAG, SELFMAG) != 0)
+		meta->dump_size = 0;
+	else
+		meta->dump_size = PAGE_SIZE;
+}
+
 /* An ELF note in memory */
 struct memelfnote
 {
@@ -2220,32 +2225,6 @@ static void free_note_info(struct elf_note_info *info)
 
 #endif
 
-static struct vm_area_struct *first_vma(struct task_struct *tsk,
-					struct vm_area_struct *gate_vma)
-{
-	struct vm_area_struct *ret = tsk->mm->mmap;
-
-	if (ret)
-		return ret;
-	return gate_vma;
-}
-/*
- * Helper function for iterating across a vma list.  It ensures that the caller
- * will visit `gate_vma' prior to terminating the search.
- */
-static struct vm_area_struct *next_vma(struct vm_area_struct *this_vma,
-					struct vm_area_struct *gate_vma)
-{
-	struct vm_area_struct *ret;
-
-	ret = this_vma->vm_next;
-	if (ret)
-		return ret;
-	if (this_vma == gate_vma)
-		return NULL;
-	return gate_vma;
-}
-
 static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
 			     elf_addr_t e_shoff, int segs)
 {
@@ -2272,9 +2251,8 @@ static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
 static int elf_core_dump(struct coredump_params *cprm)
 {
 	int has_dumped = 0;
-	int segs, i;
+	int vma_count, segs, i;
 	size_t vma_data_size = 0;
-	struct vm_area_struct *vma, *gate_vma;
 	struct elfhdr elf;
 	loff_t offset = 0, dataoff;
 	struct elf_note_info info = { };
@@ -2282,30 +2260,35 @@ static int elf_core_dump(struct coredump_params *cprm)
 	struct elf_shdr *shdr4extnum = NULL;
 	Elf_Half e_phnum;
 	elf_addr_t e_shoff;
-	elf_addr_t *vma_filesz = NULL;
+	struct core_vma_metadata *vma_meta;
+
+	if (dump_vma_snapshot(cprm, &vma_count, &vma_meta, vma_dump_size))
+		return 0;
+
+	for (i = 0; i < vma_count; i++) {
+		struct core_vma_metadata *meta = vma_meta + i;
+
+		if (meta->dump_size == DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER) {
+			/*
+			 * We want to select ->dump_size based on whether the
+			 * VMA starts with an ELF header, but that uses
+			 * copy_from_user(), which doesn't work from the
+			 * dump_vma_snapshot() callback because the mmap_lock
+			 * is held there.
+			 * So instead, fix it up here, where we're not holding
+			 * any locks.
+			 */
+			vma_dump_size_fixup(meta);
+		}
+
+		vma_data_size += meta->dump_size;
+	}
 
-	/*
-	 * We no longer stop all VM operations.
-	 * 
-	 * This is because those proceses that could possibly change map_count
-	 * or the mmap / vma pages are now blocked in do_exit on current
-	 * finishing this core dump.
-	 *
-	 * Only ptrace can touch these memory addresses, but it doesn't change
-	 * the map_count or the pages allocated. So no possibility of crashing
-	 * exists while dumping the mm->vm_next areas to the core file.
-	 */
-  
 	/*
 	 * The number of segs are recored into ELF header as 16bit value.
 	 * Please check DEFAULT_MAX_MAP_COUNT definition when you modify here.
 	 */
-	segs = current->mm->map_count;
-	segs += elf_core_extra_phdrs();
-
-	gate_vma = get_gate_vma(current->mm);
-	if (gate_vma != NULL)
-		segs++;
+	segs = vma_count + elf_core_extra_phdrs();
 
 	/* for notes section */
 	segs++;
@@ -2343,24 +2326,6 @@ static int elf_core_dump(struct coredump_params *cprm)
 
 	dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE);
 
-	/*
-	 * Zero vma process will get ZERO_SIZE_PTR here.
-	 * Let coredump continue for register state at least.
-	 */
-	vma_filesz = kvmalloc(array_size(sizeof(*vma_filesz), (segs - 1)),
-			      GFP_KERNEL);
-	if (!vma_filesz)
-		goto end_coredump;
-
-	for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
-			vma = next_vma(vma, gate_vma)) {
-		unsigned long dump_size;
-
-		dump_size = vma_dump_size(vma, cprm->mm_flags);
-		vma_filesz[i++] = dump_size;
-		vma_data_size += dump_size;
-	}
-
 	offset += vma_data_size;
 	offset += elf_core_extra_data_size();
 	e_shoff = offset;
@@ -2381,21 +2346,23 @@ static int elf_core_dump(struct coredump_params *cprm)
 		goto end_coredump;
 
 	/* Write program headers for segments dump */
-	for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
-			vma = next_vma(vma, gate_vma)) {
+	for (i = 0; i < vma_count; i++) {
+		struct core_vma_metadata *meta = vma_meta + i;
 		struct elf_phdr phdr;
 
 		phdr.p_type = PT_LOAD;
 		phdr.p_offset = offset;
-		phdr.p_vaddr = vma->vm_start;
+		phdr.p_vaddr = meta->start;
 		phdr.p_paddr = 0;
-		phdr.p_filesz = vma_filesz[i++];
-		phdr.p_memsz = vma->vm_end - vma->vm_start;
+		phdr.p_filesz = meta->dump_size;
+		phdr.p_memsz = meta->end - meta->start;
 		offset += phdr.p_filesz;
-		phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0;
-		if (vma->vm_flags & VM_WRITE)
+		phdr.p_flags = 0;
+		if (meta->flags & VM_READ)
+			phdr.p_flags |= PF_R;
+		if (meta->flags & VM_WRITE)
 			phdr.p_flags |= PF_W;
-		if (vma->vm_flags & VM_EXEC)
+		if (meta->flags & VM_EXEC)
 			phdr.p_flags |= PF_X;
 		phdr.p_align = ELF_EXEC_PAGESIZE;
 
@@ -2417,9 +2384,10 @@ static int elf_core_dump(struct coredump_params *cprm)
 	if (!dump_skip(cprm, dataoff - cprm->pos))
 		goto end_coredump;
 
-	for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
-			vma = next_vma(vma, gate_vma)) {
-		if (!dump_user_range(cprm, vma->vm_start, vma_filesz[i++]))
+	for (i = 0; i < vma_count; i++) {
+		struct core_vma_metadata *meta = vma_meta + i;
+
+		if (!dump_user_range(cprm, meta->start, meta->dump_size))
 			goto end_coredump;
 	}
 	dump_truncate(cprm);
@@ -2435,7 +2403,7 @@ static int elf_core_dump(struct coredump_params *cprm)
 end_coredump:
 	free_note_info(&info);
 	kfree(shdr4extnum);
-	kvfree(vma_filesz);
+	kvfree(vma_meta);
 	kfree(phdr4note);
 	return has_dumped;
 }
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 76e8c0defdc8..f9d35bc9f0af 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1222,7 +1222,8 @@ struct elf_prstatus_fdpic
  *
  * I think we should skip something. But I am not sure how. H.J.
  */
-static int maydump(struct vm_area_struct *vma, unsigned long mm_flags)
+static unsigned long vma_dump_size(struct vm_area_struct *vma,
+				   unsigned long mm_flags)
 {
 	int dump_ok;
 
@@ -1251,7 +1252,7 @@ static int maydump(struct vm_area_struct *vma, unsigned long mm_flags)
 			kdcore("%08lx: %08lx: %s (DAX private)", vma->vm_start,
 			       vma->vm_flags, dump_ok ? "yes" : "no");
 		}
-		return dump_ok;
+		goto out;
 	}
 
 	/* By default, dump shared memory if mapped from an anonymous file. */
@@ -1260,13 +1261,13 @@ static int maydump(struct vm_area_struct *vma, unsigned long mm_flags)
 			dump_ok = test_bit(MMF_DUMP_ANON_SHARED, &mm_flags);
 			kdcore("%08lx: %08lx: %s (share)", vma->vm_start,
 			       vma->vm_flags, dump_ok ? "yes" : "no");
-			return dump_ok;
+			goto out;
 		}
 
 		dump_ok = test_bit(MMF_DUMP_MAPPED_SHARED, &mm_flags);
 		kdcore("%08lx: %08lx: %s (share)", vma->vm_start,
 		       vma->vm_flags, dump_ok ? "yes" : "no");
-		return dump_ok;
+		goto out;
 	}
 
 #ifdef CONFIG_MMU
@@ -1275,14 +1276,16 @@ static int maydump(struct vm_area_struct *vma, unsigned long mm_flags)
 		dump_ok = test_bit(MMF_DUMP_MAPPED_PRIVATE, &mm_flags);
 		kdcore("%08lx: %08lx: %s (!anon)", vma->vm_start,
 		       vma->vm_flags, dump_ok ? "yes" : "no");
-		return dump_ok;
+		goto out;
 	}
 #endif
 
 	dump_ok = test_bit(MMF_DUMP_ANON_PRIVATE, &mm_flags);
 	kdcore("%08lx: %08lx: %s", vma->vm_start, vma->vm_flags,
 	       dump_ok ? "yes" : "no");
-	return dump_ok;
+
+out:
+	return dump_ok ? vma->vm_end - vma->vm_start : 0;
 }
 
 /* An ELF note in memory */
@@ -1524,31 +1527,30 @@ static void fill_extnum_info(struct elfhdr *elf, struct elf_shdr *shdr4extnum,
 /*
  * dump the segments for an MMU process
  */
-static bool elf_fdpic_dump_segments(struct coredump_params *cprm)
+static bool elf_fdpic_dump_segments(struct coredump_params *cprm,
+				    struct core_vma_metadata *vma_meta,
+				    int vma_count)
 {
-	struct vm_area_struct *vma;
+	int i;
 
-	for (vma = current->mm->mmap; vma; vma = vma->vm_next) {
-		unsigned long addr;
+	for (i = 0; i < vma_count; i++) {
+		struct core_vma_metadata *meta = vma_meta + i;
 
-		if (!maydump(vma, cprm->mm_flags))
-			continue;
-
-		if (!dump_user_range(cprm, vma->vm_start,
-				     vma->vma_end - vma->vm_start))
+		if (!dump_user_range(cprm, meta->start, meta->dump_size))
 			return false;
 	}
 	return true;
 }
 
-static size_t elf_core_vma_data_size(unsigned long mm_flags)
+static size_t elf_core_vma_data_size(unsigned long mm_flags,
+				     struct core_vma_metadata *vma_meta,
+				     int vma_count)
 {
-	struct vm_area_struct *vma;
 	size_t size = 0;
+	int i;
 
-	for (vma = current->mm->mmap; vma; vma = vma->vm_next)
-		if (maydump(vma, mm_flags))
-			size += vma->vm_end - vma->vm_start;
+	for (i = 0; i < vma_count; i++)
+		size += vma_meta[i].dump_size;
 	return size;
 }
 
@@ -1562,9 +1564,8 @@ static size_t elf_core_vma_data_size(unsigned long mm_flags)
 static int elf_fdpic_core_dump(struct coredump_params *cprm)
 {
 	int has_dumped = 0;
-	int segs;
+	int vma_count, segs;
 	int i;
-	struct vm_area_struct *vma;
 	struct elfhdr *elf = NULL;
 	loff_t offset = 0, dataoff;
 	struct memelfnote psinfo_note, auxv_note;
@@ -1578,18 +1579,7 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
 	elf_addr_t e_shoff;
 	struct core_thread *ct;
 	struct elf_thread_status *tmp;
-
-	/*
-	 * We no longer stop all VM operations.
-	 *
-	 * This is because those proceses that could possibly change map_count
-	 * or the mmap / vma pages are now blocked in do_exit on current
-	 * finishing this core dump.
-	 *
-	 * Only ptrace can touch these memory addresses, but it doesn't change
-	 * the map_count or the pages allocated. So no possibility of crashing
-	 * exists while dumping the mm->vm_next areas to the core file.
-	 */
+	struct core_vma_metadata *vma_meta = NULL;
 
 	/* alloc memory for large data structures: too large to be on stack */
 	elf = kmalloc(sizeof(*elf), GFP_KERNEL);
@@ -1599,6 +1589,9 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
 	if (!psinfo)
 		goto end_coredump;
 
+	if (dump_vma_snapshot(cprm, &vma_count, &vma_meta, vma_dump_size))
+		goto end_coredump;
+
 	for (ct = current->mm->core_state->dumper.next;
 					ct; ct = ct->next) {
 		tmp = elf_dump_thread_status(cprm->siginfo->si_signo,
@@ -1618,8 +1611,7 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
 	tmp->next = thread_list;
 	thread_list = tmp;
 
-	segs = current->mm->map_count;
-	segs += elf_core_extra_phdrs();
+	segs = vma_count + elf_core_extra_phdrs();
 
 	/* for notes section */
 	segs++;
@@ -1664,7 +1656,7 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
 	/* Page-align dumped data */
 	dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE);
 
-	offset += elf_core_vma_data_size(cprm->mm_flags);
+	offset += elf_core_vma_data_size(cprm->mm_flags, vma_meta, vma_count);
 	offset += elf_core_extra_data_size();
 	e_shoff = offset;
 
@@ -1684,23 +1676,26 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
 		goto end_coredump;
 
 	/* write program headers for segments dump */
-	for (vma = current->mm->mmap; vma; vma = vma->vm_next) {
+	for (i = 0; i < vma_count; i++) {
+		struct core_vma_metadata *meta = vma_meta + i;
 		struct elf_phdr phdr;
 		size_t sz;
 
-		sz = vma->vm_end - vma->vm_start;
+		sz = meta->end - meta->start;
 
 		phdr.p_type = PT_LOAD;
 		phdr.p_offset = offset;
-		phdr.p_vaddr = vma->vm_start;
+		phdr.p_vaddr = meta->start;
 		phdr.p_paddr = 0;
-		phdr.p_filesz = maydump(vma, cprm->mm_flags) ? sz : 0;
+		phdr.p_filesz = meta->dump_size;
 		phdr.p_memsz = sz;
 		offset += phdr.p_filesz;
-		phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0;
-		if (vma->vm_flags & VM_WRITE)
+		phdr.p_flags = 0;
+		if (meta->flags & VM_READ)
+			phdr.p_flags |= PF_R;
+		if (meta->flags & VM_WRITE)
 			phdr.p_flags |= PF_W;
-		if (vma->vm_flags & VM_EXEC)
+		if (meta->flags & VM_EXEC)
 			phdr.p_flags |= PF_X;
 		phdr.p_align = ELF_EXEC_PAGESIZE;
 
@@ -1732,7 +1727,7 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
 	if (!dump_skip(cprm, dataoff - cprm->pos))
 		goto end_coredump;
 
-	if (!elf_fdpic_dump_segments(cprm))
+	if (!elf_fdpic_dump_segments(cprm, vma_meta, vma_count))
 		goto end_coredump;
 
 	if (!elf_core_write_extra_data(cprm))
@@ -1756,6 +1751,7 @@ static int elf_fdpic_core_dump(struct coredump_params *cprm)
 		thread_list = thread_list->next;
 		kfree(tmp);
 	}
+	kvfree(vma_meta);
 	kfree(phdr4note);
 	kfree(elf);
 	kfree(psinfo);
diff --git a/fs/coredump.c b/fs/coredump.c
index 6042d15acd51..7b4486b701bb 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -936,3 +936,72 @@ void dump_truncate(struct coredump_params *cprm)
 	}
 }
 EXPORT_SYMBOL(dump_truncate);
+
+static struct vm_area_struct *first_vma(struct task_struct *tsk,
+					struct vm_area_struct *gate_vma)
+{
+	struct vm_area_struct *ret = tsk->mm->mmap;
+
+	if (ret)
+		return ret;
+	return gate_vma;
+}
+
+/*
+ * Helper function for iterating across a vma list.  It ensures that the caller
+ * will visit `gate_vma' prior to terminating the search.
+ */
+static struct vm_area_struct *next_vma(struct vm_area_struct *this_vma,
+				       struct vm_area_struct *gate_vma)
+{
+	struct vm_area_struct *ret;
+
+	ret = this_vma->vm_next;
+	if (ret)
+		return ret;
+	if (this_vma == gate_vma)
+		return NULL;
+	return gate_vma;
+}
+
+/*
+ * Under the mmap_lock, take a snapshot of relevant information about the task's
+ * VMAs.
+ */
+int dump_vma_snapshot(struct coredump_params *cprm, int *vma_count,
+		      struct core_vma_metadata **vma_meta,
+		      unsigned long (*dump_size_cb)(struct vm_area_struct *, unsigned long))
+{
+	struct vm_area_struct *vma, *gate_vma;
+	struct mm_struct *mm = current->mm;
+	int i;
+
+	if (mmap_read_lock_killable(mm))
+		return -EINTR;
+
+	gate_vma = get_gate_vma(mm);
+	*vma_count = mm->map_count + (gate_vma ? 1 : 0);
+
+	*vma_meta = kvmalloc_array(*vma_count, sizeof(**vma_meta), GFP_KERNEL);
+	if (!*vma_meta) {
+		mmap_read_unlock(mm);
+		return -ENOMEM;
+	}
+
+	for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
+			vma = next_vma(vma, gate_vma), i++) {
+		struct core_vma_metadata *m = (*vma_meta) + i;
+
+		m->start = vma->vm_start;
+		m->end = vma->vm_end;
+		m->flags = vma->vm_flags;
+		m->dump_size = dump_size_cb(vma, cprm->mm_flags);
+	}
+
+	mmap_read_unlock(mm);
+
+	if (WARN_ON(i != *vma_count))
+		return -EFAULT;
+
+	return 0;
+}
diff --git a/include/linux/coredump.h b/include/linux/coredump.h
index f0b71a74d0bc..21bbef9bd8d6 100644
--- a/include/linux/coredump.h
+++ b/include/linux/coredump.h
@@ -7,6 +7,12 @@
 #include <linux/fs.h>
 #include <asm/siginfo.h>
 
+struct core_vma_metadata {
+	unsigned long start, end;
+	unsigned long flags;
+	unsigned long dump_size;
+};
+
 /*
  * These are the only things you should do on a core-file: use only these
  * functions to write out all the necessary info.
@@ -18,6 +24,9 @@ extern int dump_align(struct coredump_params *cprm, int align);
 extern void dump_truncate(struct coredump_params *cprm);
 int dump_user_range(struct coredump_params *cprm, unsigned long start,
 		    unsigned long len);
+int dump_vma_snapshot(struct coredump_params *cprm, int *vma_count,
+		      struct core_vma_metadata **vma_meta,
+		      unsigned long (*dump_size_cb)(struct vm_area_struct *, unsigned long));
 #ifdef CONFIG_COREDUMP
 extern void do_coredump(const kernel_siginfo_t *siginfo);
 #else
-- 
2.28.0.220.ged08abb693-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 5/5] mm/gup: Take mmap_lock in get_dump_page()
  2020-08-18  6:12 [PATCH v3 0/5] Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there Jann Horn
                   ` (3 preceding siblings ...)
  2020-08-18  6:12 ` [PATCH v3 4/5] binfmt_elf, binfmt_elf_fdpic: Use a VMA list snapshot Jann Horn
@ 2020-08-18  6:12 ` Jann Horn
  4 siblings, 0 replies; 11+ messages in thread
From: Jann Horn @ 2020-08-18  6:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Christoph Hellwig, linux-kernel, linux-mm,
	linux-fsdevel, Alexander Viro, Eric W . Biederman, Oleg Nesterov

Properly take the mmap_lock before calling into the GUP code from
get_dump_page(); and play nice, allowing the GUP code to drop the mmap_lock
if it has to sleep.

As Linus pointed out, we don't actually need the VMA because
__get_user_pages() will flush the dcache for us if necessary.

Signed-off-by: Jann Horn <jannh@google.com>
---
 mm/gup.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 92519e5a44b3..bd0f7311c5c6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1552,19 +1552,23 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
  * NULL wherever the ZERO_PAGE, or an anonymous pte_none, has been found -
  * allowing a hole to be left in the corefile to save diskspace.
  *
- * Called without mmap_lock, but after all other threads have been killed.
+ * Called without mmap_lock (takes and releases the mmap_lock by itself).
  */
 #ifdef CONFIG_ELF_CORE
 struct page *get_dump_page(unsigned long addr)
 {
-	struct vm_area_struct *vma;
+	struct mm_struct *mm = current->mm;
 	struct page *page;
+	int locked = 1;
+	int ret;
 
-	if (__get_user_pages_locked(current->mm, addr, 1, &page, &vma, NULL,
-				    FOLL_FORCE | FOLL_DUMP | FOLL_GET) < 1)
+	if (mmap_read_lock_killable(mm))
 		return NULL;
-	flush_cache_page(vma, addr, page_to_pfn(page));
-	return page;
+	ret = __get_user_pages_locked(mm, addr, 1, &page, NULL, &locked,
+				      FOLL_FORCE | FOLL_DUMP | FOLL_GET);
+	if (locked)
+		mmap_read_unlock(mm);
+	return (ret == 1) ? page : NULL;
 }
 #endif /* CONFIG_ELF_CORE */
 
-- 
2.28.0.220.ged08abb693-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 4/5] binfmt_elf, binfmt_elf_fdpic: Use a VMA list snapshot
  2020-08-18  6:12 ` [PATCH v3 4/5] binfmt_elf, binfmt_elf_fdpic: Use a VMA list snapshot Jann Horn
@ 2020-08-18  8:18   ` Linus Torvalds
  2020-08-18  8:31     ` Jann Horn
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2020-08-18  8:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Christoph Hellwig, Linux Kernel Mailing List,
	Linux-MM, linux-fsdevel, Alexander Viro, Eric W . Biederman,
	Oleg Nesterov

On Mon, Aug 17, 2020 at 11:13 PM Jann Horn <jannh@google.com> wrote:
>
>         /*
>          * If this looks like the beginning of a DSO or executable mapping,
> +        * we'll check for an ELF header. If we find one, we'll dump the first
> +        * page to aid in determining what was mapped here.
> +        * However, we shouldn't sleep on userspace reads while holding the
> +        * mmap_lock, so we just return a placeholder for now that will be fixed
> +        * up later in vma_dump_size_fixup().

I still don't like this.

And I still don't think it's necessary.

The whole - and only - point of "check if it's an ELF header" is that
we don't want to dump data that could just be found by looking at the
original binary.

But by the time we get to this point, we already know that

 (a) it's a private mapping with file backing, and it's the first page
of the file

 (b) it has never been written to and it's mapped for reading

and the choice at this point is "don't dump at all", or "dump just the
first page".

And honestly, that whole "check if it has the ELF header" signature
was always just a heuristic. Nothing should depend on it anyway.

We already skip dumping file data under a lot of other circumstances
(and perhaps equally importantly, we already decided to dump it all
under other circumstances).

I think this DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER hackery is worse than
just changing the heuristic.

So instead, just say "ok, if the file was executable, let's dump the
first page".

The test might be as simple as jjust checking

       if (file_inode(vma->vm_file)->i_mode & 0111)

and you'd be done. That's likely a _better_ heuristic than the "let's
go look at the random first word in memory".

Your patches look otherwise fine, but I really really despise that
DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER, and I don't think it's even
necessary.

             Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 4/5] binfmt_elf, binfmt_elf_fdpic: Use a VMA list snapshot
  2020-08-18  8:18   ` Linus Torvalds
@ 2020-08-18  8:31     ` Jann Horn
  0 siblings, 0 replies; 11+ messages in thread
From: Jann Horn @ 2020-08-18  8:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Christoph Hellwig, Linux Kernel Mailing List,
	Linux-MM, linux-fsdevel, Alexander Viro, Eric W . Biederman,
	Oleg Nesterov

On Tue, Aug 18, 2020 at 10:18 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Aug 17, 2020 at 11:13 PM Jann Horn <jannh@google.com> wrote:
> >
> >         /*
> >          * If this looks like the beginning of a DSO or executable mapping,
> > +        * we'll check for an ELF header. If we find one, we'll dump the first
> > +        * page to aid in determining what was mapped here.
> > +        * However, we shouldn't sleep on userspace reads while holding the
> > +        * mmap_lock, so we just return a placeholder for now that will be fixed
> > +        * up later in vma_dump_size_fixup().
>
> I still don't like this.
>
> And I still don't think it's necessary.
>
> The whole - and only - point of "check if it's an ELF header" is that
> we don't want to dump data that could just be found by looking at the
> original binary.
>
> But by the time we get to this point, we already know that
>
>  (a) it's a private mapping with file backing, and it's the first page
> of the file
>
>  (b) it has never been written to and it's mapped for reading
>
> and the choice at this point is "don't dump at all", or "dump just the
> first page".
>
> And honestly, that whole "check if it has the ELF header" signature
> was always just a heuristic. Nothing should depend on it anyway.
>
> We already skip dumping file data under a lot of other circumstances
> (and perhaps equally importantly, we already decided to dump it all
> under other circumstances).
>
> I think this DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER hackery is worse than
> just changing the heuristic.
>
> So instead, just say "ok, if the file was executable, let's dump the
> first page".
>
> The test might be as simple as jjust checking
>
>        if (file_inode(vma->vm_file)->i_mode & 0111)
>
> and you'd be done. That's likely a _better_ heuristic than the "let's
> go look at the random first word in memory".
>
> Your patches look otherwise fine, but I really really despise that
> DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER, and I don't think it's even
> necessary.

Yeah, good point, it's a pretty ugly hack. I'll make a new version
along the lines of what you suggested.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/5] coredump: Let dump_emit() bail out on short writes
  2020-08-18  6:12 ` [PATCH v3 2/5] coredump: Let dump_emit() bail out on short writes Jann Horn
@ 2020-08-18 13:40   ` Oleg Nesterov
  2020-08-18 15:08     ` Jann Horn
  2020-08-18 15:17     ` Al Viro
  0 siblings, 2 replies; 11+ messages in thread
From: Oleg Nesterov @ 2020-08-18 13:40 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Linus Torvalds, Christoph Hellwig, linux-kernel,
	linux-mm, linux-fsdevel, Alexander Viro, Eric W . Biederman

On 08/18, Jann Horn wrote:
>
> +	if (dump_interrupted())
> +		return 0;
> +	n = __kernel_write(file, addr, nr, &pos);
> +	if (n != nr)
> +		return 0;
> +	file->f_pos = pos;

Just curious, can't we simply do

	__kernel_write(file, addr, nr, &file->f_pos);

and avoid "loff_t pos" ?

Oleg.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/5] coredump: Let dump_emit() bail out on short writes
  2020-08-18 13:40   ` Oleg Nesterov
@ 2020-08-18 15:08     ` Jann Horn
  2020-08-18 15:17     ` Al Viro
  1 sibling, 0 replies; 11+ messages in thread
From: Jann Horn @ 2020-08-18 15:08 UTC (permalink / raw)
  To: Oleg Nesterov, Alexander Viro
  Cc: Andrew Morton, Linus Torvalds, Christoph Hellwig, kernel list,
	Linux-MM, linux-fsdevel, Eric W . Biederman

On Tue, Aug 18, 2020 at 3:40 PM Oleg Nesterov <oleg@redhat.com> wrote:
> On 08/18, Jann Horn wrote:
> >
> > +     if (dump_interrupted())
> > +             return 0;
> > +     n = __kernel_write(file, addr, nr, &pos);
> > +     if (n != nr)
> > +             return 0;
> > +     file->f_pos = pos;
>
> Just curious, can't we simply do
>
>         __kernel_write(file, addr, nr, &file->f_pos);
>
> and avoid "loff_t pos" ?

Hm... e.g. ksys_write() has the same pattern of copying the value into
a local variable and back, but I guess maybe there it's done so that
->f_pos can't change when vfs_write() returns a negative value, or
something like that? Or maybe to make the update look more atomic?
None of that is a concern for the core-dumping code, so I guess we
could change it... but then again, maybe we shouldn't diverge from how
it's done in fs/read_write.c (e.g. in ksys_write()) too much.
Coredumping is already a bit too special, no need to make it worse...

It looks like Al Viro introduced this as part of commit 2507a4fbd48a
("make dump_emit() use vfs_write() instead of banging at ->f_op->write
directly"). Before that commit, &file->f_pos was actually passed as a
parameter, just like you're proposing. I don't really want to try
reverting parts of Al's commits without understanding what exactly is
going on...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/5] coredump: Let dump_emit() bail out on short writes
  2020-08-18 13:40   ` Oleg Nesterov
  2020-08-18 15:08     ` Jann Horn
@ 2020-08-18 15:17     ` Al Viro
  1 sibling, 0 replies; 11+ messages in thread
From: Al Viro @ 2020-08-18 15:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jann Horn, Andrew Morton, Linus Torvalds, Christoph Hellwig,
	linux-kernel, linux-mm, linux-fsdevel, Eric W . Biederman

On Tue, Aug 18, 2020 at 03:40:28PM +0200, Oleg Nesterov wrote:
> On 08/18, Jann Horn wrote:
> >
> > +	if (dump_interrupted())
> > +		return 0;
> > +	n = __kernel_write(file, addr, nr, &pos);
> > +	if (n != nr)
> > +		return 0;
> > +	file->f_pos = pos;
> 
> Just curious, can't we simply do
> 
> 	__kernel_write(file, addr, nr, &file->f_pos);
> 
> and avoid "loff_t pos" ?

	Bloody bad pattern; it would be (probably) safe in this case,
but in general ->f_pos is shared data.  Exposing it to fuckloads of
->write() instances is a bad idea - we had bugs like that.

	General rule: never pass an address of ->f_pos to anything,
and limit access to it as much as possible.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-08-18 15:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-18  6:12 [PATCH v3 0/5] Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there Jann Horn
2020-08-18  6:12 ` [PATCH v3 1/5] binfmt_elf_fdpic: Stop using dump_emit() on user pointers on !MMU Jann Horn
2020-08-18  6:12 ` [PATCH v3 2/5] coredump: Let dump_emit() bail out on short writes Jann Horn
2020-08-18 13:40   ` Oleg Nesterov
2020-08-18 15:08     ` Jann Horn
2020-08-18 15:17     ` Al Viro
2020-08-18  6:12 ` [PATCH v3 3/5] coredump: Refactor page range dumping into common helper Jann Horn
2020-08-18  6:12 ` [PATCH v3 4/5] binfmt_elf, binfmt_elf_fdpic: Use a VMA list snapshot Jann Horn
2020-08-18  8:18   ` Linus Torvalds
2020-08-18  8:31     ` Jann Horn
2020-08-18  6:12 ` [PATCH v3 5/5] mm/gup: Take mmap_lock in get_dump_page() Jann Horn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).