linux-unionfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
@ 2021-08-12  8:43 David Hildenbrand
  2021-08-12  8:43 ` [PATCH v1 1/7] binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib() David Hildenbrand
                   ` (8 more replies)
  0 siblings, 9 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-08-12  8:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Alexander Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm

This series is based on v5.14-rc5 and corresponds code-wise to the
previously sent RFC [1] (the RFC still applied cleanly).

This series removes all in-tree usage of MAP_DENYWRITE from the kernel
and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for
user space applications a while ago because of the chance for DoS.
The last renaming user is binfmt binary loading during exec and
legacy library loading via uselib().

With this change, MAP_DENYWRITE is effectively ignored throughout the
kernel. Although the net change is small, I think the cleanup in mmap()
is quite nice.

There are some (minor) user-visible changes with this series:
1. We no longer deny write access to shared libaries loaded via legacy
   uselib(); this behavior matches modern user space e.g., via dlopen().
2. We no longer deny write access to the elf interpreter after exec
   completed, treating it just like shared libraries (which it often is).
3. We always deny write access to the file linked via /proc/pid/exe:
   sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file
   cannot be denied, and write access to the file will remain denied
   until the link is effectivel gone (exec, termination,
   PR_SET_MM_EXE_FILE) -- just as if exec'ing the file.

I was wondering if we really care about permanently disabling write access
to the executable, or if it would be good enough to just disable write
access while loading the new executable during exec; but I don't know
the history of that -- and it somewhat makes sense to deny write access
at least to the main executable. With modern user space -- dlopen() -- we
can effectively modify the content of shared libraries while being used.

There is a related problem [2] with overlayfs, that should at least partly
be tackled by this series. I don't quite understand the interaction of
overlayfs and deny_write_access()/allow_write_access() at exec time:

If we end up denying write access to the wrong file and not to the
realfile, that would be fundamentally broken. We would have to reroute
our deny_write_access()/ allow_write_access() calls for the exec file to
the realfile -- but I leave figuring out the details to overlayfs guys, as
that would be a related but different issue.

RFC -> v1:
- "binfmt: remove in-tree usage of MAP_DENYWRITE"
-- Add a note that this should fix part of a problem with overlayfs

[1] https://lore.kernel.org/r/20210423131640.20080-1-david@redhat.com/
[2] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jann Horn <jannh@google.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shawn Anastasio <shawn@anastas.io>
Cc: Steven Price <steven.price@arm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Cc: Thomas Cedeno <thomascedeno@google.com>
Cc: Collin Fijalkovich <cfijalkovich@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chengguang Xu <cgxu519@mykernel.net>
Cc: "Christian König" <ckoenig.leichtzumerken@gmail.com>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org

David Hildenbrand (7):
  binfmt: don't use MAP_DENYWRITE when loading shared libraries via
    uselib()
  kernel/fork: factor out atomcially replacing the current MM exe_file
  kernel/fork: always deny write access to current MM exe_file
  binfmt: remove in-tree usage of MAP_DENYWRITE
  mm: remove VM_DENYWRITE
  mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
  fs: update documentation of get_write_access() and friends

 arch/x86/ia32/ia32_aout.c      |  8 ++--
 fs/binfmt_aout.c               |  7 ++--
 fs/binfmt_elf.c                |  6 +--
 fs/binfmt_elf_fdpic.c          |  2 +-
 fs/proc/task_mmu.c             |  1 -
 include/linux/fs.h             | 19 +++++----
 include/linux/mm.h             |  3 +-
 include/linux/mman.h           |  4 +-
 include/trace/events/mmflags.h |  1 -
 kernel/events/core.c           |  2 -
 kernel/fork.c                  | 75 ++++++++++++++++++++++++++++++----
 kernel/sys.c                   | 33 +--------------
 lib/test_printf.c              |  5 +--
 mm/mmap.c                      | 29 ++-----------
 mm/nommu.c                     |  2 -
 15 files changed, 98 insertions(+), 99 deletions(-)


base-commit: 36a21d51725af2ce0700c6ebcb6b9594aac658a6
-- 
2.31.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v1 1/7] binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib()
  2021-08-12  8:43 [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE David Hildenbrand
@ 2021-08-12  8:43 ` David Hildenbrand
  2021-08-12  8:43 ` [PATCH v1 2/7] kernel/fork: factor out atomcially replacing the current MM exe_file David Hildenbrand
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-08-12  8:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Alexander Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm

uselib() is the legacy systemcall for loading shared libraries.
Nowadays, applications use dlopen() to load shared libraries, completely
implemented in user space via mmap().

For example, glibc uses MAP_COPY to mmap shared libraries. While this
maps to MAP_PRIVATE | MAP_DENYWRITE on Linux, Linux ignores any
MAP_DENYWRITE specification from user space in mmap.

With this change, all remaining in-tree users of MAP_DENYWRITE use it
to map an executable. We will be able to open shared libraries loaded
via uselib() writable, just as we already can via dlopen() from user
space.

This is one step into the direction of removing MAP_DENYWRITE from the
kernel. This can be considered a minor user space visible change.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/ia32/ia32_aout.c | 2 +-
 fs/binfmt_aout.c          | 2 +-
 fs/binfmt_elf.c           | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
index 5e5b9fc2747f..321d7b22ad2d 100644
--- a/arch/x86/ia32/ia32_aout.c
+++ b/arch/x86/ia32/ia32_aout.c
@@ -293,7 +293,7 @@ static int load_aout_library(struct file *file)
 	/* Now use mmap to map the library into memory. */
 	error = vm_mmap(file, start_addr, ex.a_text + ex.a_data,
 			PROT_READ | PROT_WRITE | PROT_EXEC,
-			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | MAP_32BIT,
+			MAP_FIXED | MAP_PRIVATE | MAP_32BIT,
 			N_TXTOFF(ex));
 	retval = error;
 	if (error != start_addr)
diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
index 145917f734fe..d29de971d3f3 100644
--- a/fs/binfmt_aout.c
+++ b/fs/binfmt_aout.c
@@ -309,7 +309,7 @@ static int load_aout_library(struct file *file)
 	/* Now use mmap to map the library into memory. */
 	error = vm_mmap(file, start_addr, ex.a_text + ex.a_data,
 			PROT_READ | PROT_WRITE | PROT_EXEC,
-			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
+			MAP_FIXED | MAP_PRIVATE;
 			N_TXTOFF(ex));
 	retval = error;
 	if (error != start_addr)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 439ed81e755a..6d2c79533631 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1384,7 +1384,7 @@ static int load_elf_library(struct file *file)
 			(eppnt->p_filesz +
 			 ELF_PAGEOFFSET(eppnt->p_vaddr)),
 			PROT_READ | PROT_WRITE | PROT_EXEC,
-			MAP_FIXED_NOREPLACE | MAP_PRIVATE | MAP_DENYWRITE,
+			MAP_FIXED_NOREPLACE | MAP_PRIVATE,
 			(eppnt->p_offset -
 			 ELF_PAGEOFFSET(eppnt->p_vaddr)));
 	if (error != ELF_PAGESTART(eppnt->p_vaddr))
-- 
2.31.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v1 2/7] kernel/fork: factor out atomcially replacing the current MM exe_file
  2021-08-12  8:43 [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE David Hildenbrand
  2021-08-12  8:43 ` [PATCH v1 1/7] binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib() David Hildenbrand
@ 2021-08-12  8:43 ` David Hildenbrand
  2021-08-12  9:17   ` Christian Brauner
  2021-08-12  8:43 ` [PATCH v1 3/7] kernel/fork: always deny write access to " David Hildenbrand
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 82+ messages in thread
From: David Hildenbrand @ 2021-08-12  8:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Alexander Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm

Let's factor the main logic out into atomic_set_mm_exe_file(), such that
all mm->exe_file logic is contained in kernel/fork.c.

While at it, perform some simple cleanups that are possible now that
we're simplifying the individual functions.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h |  2 ++
 kernel/fork.c      | 35 +++++++++++++++++++++++++++++++++--
 kernel/sys.c       | 33 +--------------------------------
 3 files changed, 36 insertions(+), 34 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..197505324b74 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2581,6 +2581,8 @@ extern int mm_take_all_locks(struct mm_struct *mm);
 extern void mm_drop_all_locks(struct mm_struct *mm);
 
 extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
+extern int atomic_set_mm_exe_file(struct mm_struct *mm,
+				  struct file *new_exe_file);
 extern struct file *get_mm_exe_file(struct mm_struct *mm);
 extern struct file *get_task_exe_file(struct task_struct *task);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index bc94b2cc5995..6bd2e52bcdfb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1149,8 +1149,8 @@ void mmput_async(struct mm_struct *mm)
  * Main users are mmput() and sys_execve(). Callers prevent concurrent
  * invocations: in mmput() nobody alive left, in execve task is single
  * threaded. sys_prctl(PR_SET_MM_MAP/EXE_FILE) also needs to set the
- * mm->exe_file, but does so without using set_mm_exe_file() in order
- * to avoid the need for any locks.
+ * mm->exe_file, but uses atomic_set_mm_exe_file(), avoiding the need
+ * for any locks.
  */
 void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
 {
@@ -1170,6 +1170,37 @@ void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
 		fput(old_exe_file);
 }
 
+int atomic_set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
+{
+	struct vm_area_struct *vma;
+	struct file *old_exe_file;
+	int ret = 0;
+
+	/* Forbid mm->exe_file change if old file still mapped. */
+	old_exe_file = get_mm_exe_file(mm);
+	if (old_exe_file) {
+		mmap_read_lock(mm);
+		for (vma = mm->mmap; vma && !ret; vma = vma->vm_next) {
+			if (!vma->vm_file)
+				continue;
+			if (path_equal(&vma->vm_file->f_path,
+				       &old_exe_file->f_path))
+				ret = -EBUSY;
+		}
+		mmap_read_unlock(mm);
+		fput(old_exe_file);
+		if (ret)
+			return ret;
+	}
+
+	/* set the new file, lockless */
+	get_file(new_exe_file);
+	old_exe_file = xchg(&mm->exe_file, new_exe_file);
+	if (old_exe_file)
+		fput(old_exe_file);
+	return 0;
+}
+
 /**
  * get_mm_exe_file - acquire a reference to the mm's executable file
  *
diff --git a/kernel/sys.c b/kernel/sys.c
index ef1a78f5d71c..40551b411fda 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1846,7 +1846,6 @@ SYSCALL_DEFINE1(umask, int, mask)
 static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 {
 	struct fd exe;
-	struct file *old_exe, *exe_file;
 	struct inode *inode;
 	int err;
 
@@ -1869,40 +1868,10 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 	if (err)
 		goto exit;
 
-	/*
-	 * Forbid mm->exe_file change if old file still mapped.
-	 */
-	exe_file = get_mm_exe_file(mm);
-	err = -EBUSY;
-	if (exe_file) {
-		struct vm_area_struct *vma;
-
-		mmap_read_lock(mm);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
-			if (!vma->vm_file)
-				continue;
-			if (path_equal(&vma->vm_file->f_path,
-				       &exe_file->f_path))
-				goto exit_err;
-		}
-
-		mmap_read_unlock(mm);
-		fput(exe_file);
-	}
-
-	err = 0;
-	/* set the new file, lockless */
-	get_file(exe.file);
-	old_exe = xchg(&mm->exe_file, exe.file);
-	if (old_exe)
-		fput(old_exe);
+	err = atomic_set_mm_exe_file(mm, exe.file);
 exit:
 	fdput(exe);
 	return err;
-exit_err:
-	mmap_read_unlock(mm);
-	fput(exe_file);
-	goto exit;
 }
 
 /*
-- 
2.31.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v1 3/7] kernel/fork: always deny write access to current MM exe_file
  2021-08-12  8:43 [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE David Hildenbrand
  2021-08-12  8:43 ` [PATCH v1 1/7] binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib() David Hildenbrand
  2021-08-12  8:43 ` [PATCH v1 2/7] kernel/fork: factor out atomcially replacing the current MM exe_file David Hildenbrand
@ 2021-08-12  8:43 ` David Hildenbrand
  2021-08-12 10:05   ` Christian Brauner
  2021-08-12 16:51   ` Linus Torvalds
  2021-08-12  8:43 ` [PATCH v1 4/7] binfmt: remove in-tree usage of MAP_DENYWRITE David Hildenbrand
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-08-12  8:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Alexander Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm

We want to remove VM_DENYWRITE only currently only used when mapping the
executable during exec. During exec, we already deny_write_access() the
executable, however, after exec completes the VMAs mapped
with VM_DENYWRITE effectively keeps write access denied via
deny_write_access().

Let's deny write access when setting the MM exe_file. With this change, we
can remove VM_DENYWRITE for mapping executables.

This represents a minor user space visible change:
sys_prctl(PR_SET_MM_EXE_FILE) can now fail if the file is already
opened writable. Also, after sys_prctl(PR_SET_MM_EXE_FILE), the file
cannot be opened writable. Note that we can already fail with -EACCES if
the file doesn't have execute permissions.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 kernel/fork.c | 39 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 6bd2e52bcdfb..5d904878f19b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -476,6 +476,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 {
 	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
 	struct rb_node **rb_link, *rb_parent;
+	struct file *exe_file;
 	int retval;
 	unsigned long charge;
 	LIST_HEAD(uf);
@@ -493,7 +494,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
 
 	/* No ordering required: file already has been exposed. */
-	RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
+	exe_file = get_mm_exe_file(oldmm);
+	RCU_INIT_POINTER(mm->exe_file, exe_file);
+	if (exe_file)
+		deny_write_access(exe_file);
 
 	mm->total_vm = oldmm->total_vm;
 	mm->data_vm = oldmm->data_vm;
@@ -638,8 +642,13 @@ static inline void mm_free_pgd(struct mm_struct *mm)
 #else
 static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 {
+	struct file *exe_file;
+
 	mmap_write_lock(oldmm);
-	RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
+	exe_file = get_mm_exe_file(oldmm);
+	RCU_INIT_POINTER(mm->exe_file, exe_file);
+	if (exe_file)
+		deny_write_access(exe_file);
 	mmap_write_unlock(oldmm);
 	return 0;
 }
@@ -1163,11 +1172,19 @@ void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
 	 */
 	old_exe_file = rcu_dereference_raw(mm->exe_file);
 
-	if (new_exe_file)
+	if (new_exe_file) {
 		get_file(new_exe_file);
+		/*
+		 * exec code is required to deny_write_access() successfully,
+		 * so this cannot fail
+		 */
+		deny_write_access(new_exe_file);
+	}
 	rcu_assign_pointer(mm->exe_file, new_exe_file);
-	if (old_exe_file)
+	if (old_exe_file) {
+		allow_write_access(old_exe_file);
 		fput(old_exe_file);
+	}
 }
 
 int atomic_set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
@@ -1194,10 +1211,22 @@ int atomic_set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
 	}
 
 	/* set the new file, lockless */
+	ret = deny_write_access(new_exe_file);
+	if (ret)
+		return -EACCES;
 	get_file(new_exe_file);
+
 	old_exe_file = xchg(&mm->exe_file, new_exe_file);
-	if (old_exe_file)
+	if (old_exe_file) {
+		/*
+		 * Don't race with dup_mmap() getting the file and disallowing
+		 * write access while someone might open the file writable.
+		 */
+		mmap_read_lock(mm);
+		allow_write_access(old_exe_file);
 		fput(old_exe_file);
+		mmap_read_unlock(mm);
+	}
 	return 0;
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v1 4/7] binfmt: remove in-tree usage of MAP_DENYWRITE
  2021-08-12  8:43 [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE David Hildenbrand
                   ` (2 preceding siblings ...)
  2021-08-12  8:43 ` [PATCH v1 3/7] kernel/fork: always deny write access to " David Hildenbrand
@ 2021-08-12  8:43 ` David Hildenbrand
  2021-08-12  8:43 ` [PATCH v1 5/7] mm: remove VM_DENYWRITE David Hildenbrand
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-08-12  8:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Alexander Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm

At exec time when we mmap the new executable via MAP_DENYWRITE we have it
opened via do_open_execat() and already deny_write_access()'ed the file
successfully. Once exec completes, we allow_write_acces(); however,
we set mm->exe_file in begin_new_exec() via set_mm_exe_file() and
also deny_write_access() as long as mm->exe_file remains set. We'll
effectively deny write access to our executable via mm->exe_file
until mm->exe_file is changed -- when the process is removed, on new
exec, or via sys_prctl(PR_SET_MM_EXE_FILE).

Let's remove all usage of MAP_DENYWRITE, it's no longer necessary for
mm->exe_file.

In case of an elf interpreter, we'll now only deny write access to the file
during exec. This is somewhat okay, because the interpreter behaves
(and sometime is) a shared library; all shared libraries, especially the
ones loaded directly in user space like via dlopen() won't ever be mapped
via MAP_DENYWRITE, because we ignore that from user space completely;
these shared libraries can always be modified while mapped and executed.
Let's only special-case the main executable, denying write access while
being executed by a process. This can be considered a minor user space
visible change.

While this is a cleanup, it also fixes part of a problem reported with
VM_DENYWRITE on overlayfs, as VM_DENYWRITE is effectively unused with
this patch and will be removed next:
  "Overlayfs did not honor positive i_writecount on realfile for
   VM_DENYWRITE mappings." [1]

[1] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/

Reported-by: Chengguang Xu <cgxu519@mykernel.net>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/ia32/ia32_aout.c | 6 ++----
 fs/binfmt_aout.c          | 5 ++---
 fs/binfmt_elf.c           | 4 ++--
 fs/binfmt_elf_fdpic.c     | 2 +-
 4 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
index 321d7b22ad2d..9bd15241fadb 100644
--- a/arch/x86/ia32/ia32_aout.c
+++ b/arch/x86/ia32/ia32_aout.c
@@ -202,8 +202,7 @@ static int load_aout_binary(struct linux_binprm *bprm)
 
 		error = vm_mmap(bprm->file, N_TXTADDR(ex), ex.a_text,
 				PROT_READ | PROT_EXEC,
-				MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE |
-				MAP_32BIT,
+				MAP_FIXED | MAP_PRIVATE | MAP_32BIT,
 				fd_offset);
 
 		if (error != N_TXTADDR(ex))
@@ -211,8 +210,7 @@ static int load_aout_binary(struct linux_binprm *bprm)
 
 		error = vm_mmap(bprm->file, N_DATADDR(ex), ex.a_data,
 				PROT_READ | PROT_WRITE | PROT_EXEC,
-				MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE |
-				MAP_32BIT,
+				MAP_FIXED | MAP_PRIVATE | MAP_32BIT,
 				fd_offset + ex.a_text);
 		if (error != N_DATADDR(ex))
 			return error;
diff --git a/fs/binfmt_aout.c b/fs/binfmt_aout.c
index d29de971d3f3..a47496d0f123 100644
--- a/fs/binfmt_aout.c
+++ b/fs/binfmt_aout.c
@@ -221,8 +221,7 @@ static int load_aout_binary(struct linux_binprm * bprm)
 		}
 
 		error = vm_mmap(bprm->file, N_TXTADDR(ex), ex.a_text,
-			PROT_READ | PROT_EXEC,
-			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
+			PROT_READ | PROT_EXEC, MAP_FIXED | MAP_PRIVATE,
 			fd_offset);
 
 		if (error != N_TXTADDR(ex))
@@ -230,7 +229,7 @@ static int load_aout_binary(struct linux_binprm * bprm)
 
 		error = vm_mmap(bprm->file, N_DATADDR(ex), ex.a_data,
 				PROT_READ | PROT_WRITE | PROT_EXEC,
-				MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
+				MAP_FIXED | MAP_PRIVATE,
 				fd_offset + ex.a_text);
 		if (error != N_DATADDR(ex))
 			return error;
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 6d2c79533631..69d900a8473d 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -622,7 +622,7 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 	eppnt = interp_elf_phdata;
 	for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
 		if (eppnt->p_type == PT_LOAD) {
-			int elf_type = MAP_PRIVATE | MAP_DENYWRITE;
+			int elf_type = MAP_PRIVATE;
 			int elf_prot = make_prot(eppnt->p_flags, arch_state,
 						 true, true);
 			unsigned long vaddr = 0;
@@ -1070,7 +1070,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 		elf_prot = make_prot(elf_ppnt->p_flags, &arch_state,
 				     !!interpreter, false);
 
-		elf_flags = MAP_PRIVATE | MAP_DENYWRITE;
+		elf_flags = MAP_PRIVATE;
 
 		vaddr = elf_ppnt->p_vaddr;
 		/*
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index cf4028487dcc..6d8fd6030cbb 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1041,7 +1041,7 @@ static int elf_fdpic_map_file_by_direct_mmap(struct elf_fdpic_params *params,
 		if (phdr->p_flags & PF_W) prot |= PROT_WRITE;
 		if (phdr->p_flags & PF_X) prot |= PROT_EXEC;
 
-		flags = MAP_PRIVATE | MAP_DENYWRITE;
+		flags = MAP_PRIVATE;
 		maddr = 0;
 
 		switch (params->flags & ELF_FDPIC_FLAG_ARRANGEMENT) {
-- 
2.31.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v1 5/7] mm: remove VM_DENYWRITE
  2021-08-12  8:43 [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE David Hildenbrand
                   ` (3 preceding siblings ...)
  2021-08-12  8:43 ` [PATCH v1 4/7] binfmt: remove in-tree usage of MAP_DENYWRITE David Hildenbrand
@ 2021-08-12  8:43 ` David Hildenbrand
  2021-08-12  8:43 ` [PATCH v1 6/7] mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff() David Hildenbrand
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-08-12  8:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Alexander Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm

All in-tree users of MAP_DENYWRITE are gone. MAP_DENYWRITE cannot be
set from user space, so all users are gone; let's remove it.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/task_mmu.c             |  1 -
 include/linux/mm.h             |  1 -
 include/linux/mman.h           |  1 -
 include/trace/events/mmflags.h |  1 -
 kernel/events/core.c           |  2 --
 kernel/fork.c                  |  3 ---
 lib/test_printf.c              |  5 ++---
 mm/mmap.c                      | 27 +++------------------------
 8 files changed, 5 insertions(+), 36 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index eb97468dfe4c..cf25be3e0321 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -619,7 +619,6 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 		[ilog2(VM_MAYSHARE)]	= "ms",
 		[ilog2(VM_GROWSDOWN)]	= "gd",
 		[ilog2(VM_PFNMAP)]	= "pf",
-		[ilog2(VM_DENYWRITE)]	= "dw",
 		[ilog2(VM_LOCKED)]	= "lo",
 		[ilog2(VM_IO)]		= "io",
 		[ilog2(VM_SEQ_READ)]	= "sr",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 197505324b74..434cc97ddcf8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -281,7 +281,6 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_GROWSDOWN	0x00000100	/* general info on the segment */
 #define VM_UFFD_MISSING	0x00000200	/* missing pages tracking */
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
-#define VM_DENYWRITE	0x00000800	/* ETXTBSY on write attempts.. */
 #define VM_UFFD_WP	0x00001000	/* wrprotect pages tracking */
 
 #define VM_LOCKED	0x00002000
diff --git a/include/linux/mman.h b/include/linux/mman.h
index ebb09a964272..bd9aadda047b 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -153,7 +153,6 @@ static inline unsigned long
 calc_vm_flag_bits(unsigned long flags)
 {
 	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
-	       _calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
 	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
 	       _calc_vm_trans(flags, MAP_SYNC,	     VM_SYNC      ) |
 	       arch_calc_vm_flag_bits(flags);
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 390270e00a1d..f44c3fb8da1a 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -163,7 +163,6 @@ IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
 	{VM_UFFD_MISSING,		"uffd_missing"	},		\
 IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR,	"uffd_minor"	)		\
 	{VM_PFNMAP,			"pfnmap"	},		\
-	{VM_DENYWRITE,			"denywrite"	},		\
 	{VM_UFFD_WP,			"uffd_wp"	},		\
 	{VM_LOCKED,			"locked"	},		\
 	{VM_IO,				"io"		},		\
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1cb1f9b8392e..19767bb9933c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8307,8 +8307,6 @@ static void perf_event_mmap_event(struct perf_mmap_event *mmap_event)
 	else
 		flags = MAP_PRIVATE;
 
-	if (vma->vm_flags & VM_DENYWRITE)
-		flags |= MAP_DENYWRITE;
 	if (vma->vm_flags & VM_LOCKED)
 		flags |= MAP_LOCKED;
 	if (is_vm_hugetlb_page(vma))
diff --git a/kernel/fork.c b/kernel/fork.c
index 5d904878f19b..31df30d9f1a9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -560,12 +560,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
 		file = tmp->vm_file;
 		if (file) {
-			struct inode *inode = file_inode(file);
 			struct address_space *mapping = file->f_mapping;
 
 			get_file(file);
-			if (tmp->vm_flags & VM_DENYWRITE)
-				put_write_access(inode);
 			i_mmap_lock_write(mapping);
 			if (tmp->vm_flags & VM_SHARED)
 				mapping_allow_writable(mapping);
diff --git a/lib/test_printf.c b/lib/test_printf.c
index 8ac71aee46af..8a48b61c3763 100644
--- a/lib/test_printf.c
+++ b/lib/test_printf.c
@@ -675,9 +675,8 @@ flags(void)
 			"uptodate|dirty|lru|active|swapbacked",
 			cmp_buffer);
 
-	flags = VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC
-			| VM_DENYWRITE;
-	test("read|exec|mayread|maywrite|mayexec|denywrite", "%pGv", &flags);
+	flags = VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
+	test("read|exec|mayread|maywrite|mayexec", "%pGv", &flags);
 
 	gfp = GFP_TRANSHUGE;
 	test("GFP_TRANSHUGE", "%pGg", &gfp);
diff --git a/mm/mmap.c b/mm/mmap.c
index ca54d36d203a..589dc1dc13db 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -148,8 +148,6 @@ void vma_set_page_prot(struct vm_area_struct *vma)
 static void __remove_shared_vm_struct(struct vm_area_struct *vma,
 		struct file *file, struct address_space *mapping)
 {
-	if (vma->vm_flags & VM_DENYWRITE)
-		allow_write_access(file);
 	if (vma->vm_flags & VM_SHARED)
 		mapping_unmap_writable(mapping);
 
@@ -666,8 +664,6 @@ static void __vma_link_file(struct vm_area_struct *vma)
 	if (file) {
 		struct address_space *mapping = file->f_mapping;
 
-		if (vma->vm_flags & VM_DENYWRITE)
-			put_write_access(file_inode(file));
 		if (vma->vm_flags & VM_SHARED)
 			mapping_allow_writable(mapping);
 
@@ -1788,22 +1784,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma->vm_pgoff = pgoff;
 
 	if (file) {
-		if (vm_flags & VM_DENYWRITE) {
-			error = deny_write_access(file);
-			if (error)
-				goto free_vma;
-		}
 		if (vm_flags & VM_SHARED) {
 			error = mapping_map_writable(file->f_mapping);
 			if (error)
-				goto allow_write_and_free_vma;
+				goto free_vma;
 		}
 
-		/* ->mmap() can change vma->vm_file, but must guarantee that
-		 * vma_link() below can deny write-access if VM_DENYWRITE is set
-		 * and map writably if VM_SHARED is set. This usually means the
-		 * new file must not have been exposed to user-space, yet.
-		 */
 		vma->vm_file = get_file(file);
 		error = call_mmap(file, vma);
 		if (error)
@@ -1860,13 +1846,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	/* Once vma denies write, undo our temporary denial count */
-	if (file) {
 unmap_writable:
-		if (vm_flags & VM_SHARED)
-			mapping_unmap_writable(file->f_mapping);
-		if (vm_flags & VM_DENYWRITE)
-			allow_write_access(file);
-	}
+	if (file && vm_flags & VM_SHARED)
+		mapping_unmap_writable(file->f_mapping);
 	file = vma->vm_file;
 out:
 	perf_event_mmap(vma);
@@ -1906,9 +1888,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	charged = 0;
 	if (vm_flags & VM_SHARED)
 		mapping_unmap_writable(file->f_mapping);
-allow_write_and_free_vma:
-	if (vm_flags & VM_DENYWRITE)
-		allow_write_access(file);
 free_vma:
 	vm_area_free(vma);
 unacct_error:
-- 
2.31.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v1 6/7] mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
  2021-08-12  8:43 [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE David Hildenbrand
                   ` (4 preceding siblings ...)
  2021-08-12  8:43 ` [PATCH v1 5/7] mm: remove VM_DENYWRITE David Hildenbrand
@ 2021-08-12  8:43 ` David Hildenbrand
  2021-08-12  8:43 ` [PATCH v1 7/7] fs: update documentation of get_write_access() and friends David Hildenbrand
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-08-12  8:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Alexander Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm

Let's also remove masking off MAP_DENYWROTE from ksys_mmap_pgoff():
the last in-tree occurrence of MAP_DENYWRITE is now in LEGACY_MAP_MASK,
which accepts the flag e.g., for MAP_SHARED_VALIDATE; however, the flag
is ignored throughout the kernel now.

Add a comment to LEGACY_MAP_MASK stating that MAP_DENYWRITE is ignored.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mman.h | 3 ++-
 mm/mmap.c            | 2 --
 mm/nommu.c           | 2 --
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/linux/mman.h b/include/linux/mman.h
index bd9aadda047b..b66e91b8176c 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -32,7 +32,8 @@
  * The historical set of flags that all mmap implementations implicitly
  * support when a ->mmap_validate() op is not provided in file_operations.
  *
- * MAP_EXECUTABLE is completely ignored throughout the kernel.
+ * MAP_EXECUTABLE and MAP_DENYWRITE are completely ignored throughout the
+ * kernel.
  */
 #define LEGACY_MAP_MASK (MAP_SHARED \
 		| MAP_PRIVATE \
diff --git a/mm/mmap.c b/mm/mmap.c
index 589dc1dc13db..bf11fc6e8311 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1626,8 +1626,6 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 			return PTR_ERR(file);
 	}
 
-	flags &= ~MAP_DENYWRITE;
-
 	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
 out_fput:
 	if (file)
diff --git a/mm/nommu.c b/mm/nommu.c
index 3a93d4054810..0987d131bdfc 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1296,8 +1296,6 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 			goto out;
 	}
 
-	flags &= ~MAP_DENYWRITE;
-
 	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
 
 	if (file)
-- 
2.31.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v1 7/7] fs: update documentation of get_write_access() and friends
  2021-08-12  8:43 [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE David Hildenbrand
                   ` (5 preceding siblings ...)
  2021-08-12  8:43 ` [PATCH v1 6/7] mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff() David Hildenbrand
@ 2021-08-12  8:43 ` David Hildenbrand
  2021-08-12 12:20 ` [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE Florian Weimer
  2021-08-12 17:32 ` Eric W. Biederman
  8 siblings, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-08-12  8:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Alexander Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm

As VM_DENYWRITE does no longer exists, let's spring-clean the
documentation of get_write_access() and friends.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/fs.h | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 640574294216..e0dc3e96ed72 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3055,15 +3055,20 @@ static inline void file_end_write(struct file *file)
 }
 
 /*
+ * This is used for regular files where some users -- especially the
+ * currently executed binary in a process, previously handled via
+ * VM_DENYWRITE -- cannot handle concurrent write (and maybe mmap
+ * read-write shared) accesses.
+ *
  * get_write_access() gets write permission for a file.
  * put_write_access() releases this write permission.
- * This is used for regular files.
- * We cannot support write (and maybe mmap read-write shared) accesses and
- * MAP_DENYWRITE mmappings simultaneously. The i_writecount field of an inode
- * can have the following values:
- * 0: no writers, no VM_DENYWRITE mappings
- * < 0: (-i_writecount) vm_area_structs with VM_DENYWRITE set exist
- * > 0: (i_writecount) users are writing to the file.
+ * deny_write_access() denies write access to a file.
+ * allow_write_access() re-enables write access to a file.
+ *
+ * The i_writecount field of an inode can have the following values:
+ * 0: no write access, no denied write access
+ * < 0: (-i_writecount) users that denied write access to the file.
+ * > 0: (i_writecount) users that have write access to the file.
  *
  * Normally we operate on that counter with atomic_{inc,dec} and it's safe
  * except for the cases where we don't hold i_writecount yet. Then we need to
-- 
2.31.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 2/7] kernel/fork: factor out atomcially replacing the current MM exe_file
  2021-08-12  8:43 ` [PATCH v1 2/7] kernel/fork: factor out atomcially replacing the current MM exe_file David Hildenbrand
@ 2021-08-12  9:17   ` Christian Brauner
  0 siblings, 0 replies; 82+ messages in thread
From: Christian Brauner @ 2021-08-12  9:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Alexander Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm

On Thu, Aug 12, 2021 at 10:43:43AM +0200, David Hildenbrand wrote:
> Let's factor the main logic out into atomic_set_mm_exe_file(), such that
> all mm->exe_file logic is contained in kernel/fork.c.
> 
> While at it, perform some simple cleanups that are possible now that
> we're simplifying the individual functions.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---

Looks good.
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

>  include/linux/mm.h |  2 ++
>  kernel/fork.c      | 35 +++++++++++++++++++++++++++++++++--
>  kernel/sys.c       | 33 +--------------------------------
>  3 files changed, 36 insertions(+), 34 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 7ca22e6e694a..197505324b74 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2581,6 +2581,8 @@ extern int mm_take_all_locks(struct mm_struct *mm);
>  extern void mm_drop_all_locks(struct mm_struct *mm);
>  
>  extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
> +extern int atomic_set_mm_exe_file(struct mm_struct *mm,
> +				  struct file *new_exe_file);
>  extern struct file *get_mm_exe_file(struct mm_struct *mm);
>  extern struct file *get_task_exe_file(struct task_struct *task);
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index bc94b2cc5995..6bd2e52bcdfb 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1149,8 +1149,8 @@ void mmput_async(struct mm_struct *mm)
>   * Main users are mmput() and sys_execve(). Callers prevent concurrent
>   * invocations: in mmput() nobody alive left, in execve task is single
>   * threaded. sys_prctl(PR_SET_MM_MAP/EXE_FILE) also needs to set the
> - * mm->exe_file, but does so without using set_mm_exe_file() in order
> - * to avoid the need for any locks.
> + * mm->exe_file, but uses atomic_set_mm_exe_file(), avoiding the need
> + * for any locks.
>   */
>  void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
>  {
> @@ -1170,6 +1170,37 @@ void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
>  		fput(old_exe_file);
>  }
>  
> +int atomic_set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
> +{
> +	struct vm_area_struct *vma;
> +	struct file *old_exe_file;
> +	int ret = 0;
> +
> +	/* Forbid mm->exe_file change if old file still mapped. */
> +	old_exe_file = get_mm_exe_file(mm);
> +	if (old_exe_file) {
> +		mmap_read_lock(mm);
> +		for (vma = mm->mmap; vma && !ret; vma = vma->vm_next) {
> +			if (!vma->vm_file)
> +				continue;
> +			if (path_equal(&vma->vm_file->f_path,
> +				       &old_exe_file->f_path))
> +				ret = -EBUSY;
> +		}
> +		mmap_read_unlock(mm);
> +		fput(old_exe_file);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	/* set the new file, lockless */
> +	get_file(new_exe_file);
> +	old_exe_file = xchg(&mm->exe_file, new_exe_file);
> +	if (old_exe_file)
> +		fput(old_exe_file);
> +	return 0;
> +}
> +
>  /**
>   * get_mm_exe_file - acquire a reference to the mm's executable file
>   *
> diff --git a/kernel/sys.c b/kernel/sys.c
> index ef1a78f5d71c..40551b411fda 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1846,7 +1846,6 @@ SYSCALL_DEFINE1(umask, int, mask)
>  static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
>  {
>  	struct fd exe;
> -	struct file *old_exe, *exe_file;
>  	struct inode *inode;
>  	int err;
>  
> @@ -1869,40 +1868,10 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
>  	if (err)
>  		goto exit;
>  
> -	/*
> -	 * Forbid mm->exe_file change if old file still mapped.
> -	 */
> -	exe_file = get_mm_exe_file(mm);
> -	err = -EBUSY;
> -	if (exe_file) {
> -		struct vm_area_struct *vma;
> -
> -		mmap_read_lock(mm);
> -		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> -			if (!vma->vm_file)
> -				continue;
> -			if (path_equal(&vma->vm_file->f_path,
> -				       &exe_file->f_path))
> -				goto exit_err;
> -		}
> -
> -		mmap_read_unlock(mm);
> -		fput(exe_file);
> -	}
> -
> -	err = 0;
> -	/* set the new file, lockless */
> -	get_file(exe.file);
> -	old_exe = xchg(&mm->exe_file, exe.file);
> -	if (old_exe)
> -		fput(old_exe);
> +	err = atomic_set_mm_exe_file(mm, exe.file);
>  exit:
>  	fdput(exe);
>  	return err;
> -exit_err:
> -	mmap_read_unlock(mm);
> -	fput(exe_file);
> -	goto exit;
>  }
>  
>  /*
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 3/7] kernel/fork: always deny write access to current MM exe_file
  2021-08-12  8:43 ` [PATCH v1 3/7] kernel/fork: always deny write access to " David Hildenbrand
@ 2021-08-12 10:05   ` Christian Brauner
  2021-08-12 10:13     ` David Hildenbrand
  2021-08-12 16:51   ` Linus Torvalds
  1 sibling, 1 reply; 82+ messages in thread
From: Christian Brauner @ 2021-08-12 10:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Alexander Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm,
	Andrei Vagin

[+Cc Andrei]

On Thu, Aug 12, 2021 at 10:43:44AM +0200, David Hildenbrand wrote:
> We want to remove VM_DENYWRITE only currently only used when mapping the
> executable during exec. During exec, we already deny_write_access() the
> executable, however, after exec completes the VMAs mapped
> with VM_DENYWRITE effectively keeps write access denied via
> deny_write_access().
> 
> Let's deny write access when setting the MM exe_file. With this change, we
> can remove VM_DENYWRITE for mapping executables.
> 
> This represents a minor user space visible change:
> sys_prctl(PR_SET_MM_EXE_FILE) can now fail if the file is already
> opened writable. Also, after sys_prctl(PR_SET_MM_EXE_FILE), the file

Just for completeness, this also affects PR_SET_MM_MAP when exe_fd is
set.

> cannot be opened writable. Note that we can already fail with -EACCES if
> the file doesn't have execute permissions.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---

The biggest user I know and that I'm involved in is CRIU which heavily
uses PR_SET_MM_MAP (with a fallback to PR_SET_MM_EXE_FILE on older
kernels) during restore. Afair, criu opens the exe fd as an O_PATH
during dump and thus will use the same flag during restore when
opening it. So that should be fine.

However, if I understand the consequences of this change correctly, a
problem could be restoring workloads that hold a writable fd open to
their exe file at dump time which would mean that during restore that fd
would be reopened writable causing CRIU to fail when setting the exe
file for the task to be restored.

Which honestly, no idea how many such workloads exist. (I know at least
of runC and LXC need to sometimes reopen to rexec themselves (weird bug
to protect against attacking the exe file) and thus re-open
/proc/self/exe but read-only.)

>  kernel/fork.c | 39 ++++++++++++++++++++++++++++++++++-----
>  1 file changed, 34 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 6bd2e52bcdfb..5d904878f19b 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -476,6 +476,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  {
>  	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
>  	struct rb_node **rb_link, *rb_parent;
> +	struct file *exe_file;
>  	int retval;
>  	unsigned long charge;
>  	LIST_HEAD(uf);
> @@ -493,7 +494,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
>  
>  	/* No ordering required: file already has been exposed. */
> -	RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
> +	exe_file = get_mm_exe_file(oldmm);
> +	RCU_INIT_POINTER(mm->exe_file, exe_file);
> +	if (exe_file)
> +		deny_write_access(exe_file);
>  
>  	mm->total_vm = oldmm->total_vm;
>  	mm->data_vm = oldmm->data_vm;
> @@ -638,8 +642,13 @@ static inline void mm_free_pgd(struct mm_struct *mm)
>  #else
>  static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
> +	struct file *exe_file;
> +
>  	mmap_write_lock(oldmm);
> -	RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
> +	exe_file = get_mm_exe_file(oldmm);
> +	RCU_INIT_POINTER(mm->exe_file, exe_file);
> +	if (exe_file)
> +		deny_write_access(exe_file);
>  	mmap_write_unlock(oldmm);
>  	return 0;
>  }
> @@ -1163,11 +1172,19 @@ void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
>  	 */
>  	old_exe_file = rcu_dereference_raw(mm->exe_file);
>  
> -	if (new_exe_file)
> +	if (new_exe_file) {
>  		get_file(new_exe_file);
> +		/*
> +		 * exec code is required to deny_write_access() successfully,
> +		 * so this cannot fail
> +		 */
> +		deny_write_access(new_exe_file);
> +	}
>  	rcu_assign_pointer(mm->exe_file, new_exe_file);
> -	if (old_exe_file)
> +	if (old_exe_file) {
> +		allow_write_access(old_exe_file);
>  		fput(old_exe_file);
> +	}
>  }
>  
>  int atomic_set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
> @@ -1194,10 +1211,22 @@ int atomic_set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
>  	}
>  
>  	/* set the new file, lockless */
> +	ret = deny_write_access(new_exe_file);
> +	if (ret)
> +		return -EACCES;
>  	get_file(new_exe_file);
> +
>  	old_exe_file = xchg(&mm->exe_file, new_exe_file);
> -	if (old_exe_file)
> +	if (old_exe_file) {
> +		/*
> +		 * Don't race with dup_mmap() getting the file and disallowing
> +		 * write access while someone might open the file writable.
> +		 */
> +		mmap_read_lock(mm);
> +		allow_write_access(old_exe_file);
>  		fput(old_exe_file);
> +		mmap_read_unlock(mm);
> +	}
>  	return 0;
>  }
>  
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 3/7] kernel/fork: always deny write access to current MM exe_file
  2021-08-12 10:05   ` Christian Brauner
@ 2021-08-12 10:13     ` David Hildenbrand
  2021-08-12 12:32       ` Christian Brauner
  0 siblings, 1 reply; 82+ messages in thread
From: David Hildenbrand @ 2021-08-12 10:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Alexander Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm,
	Andrei Vagin

On 12.08.21 12:05, Christian Brauner wrote:
> [+Cc Andrei]
> 
> On Thu, Aug 12, 2021 at 10:43:44AM +0200, David Hildenbrand wrote:
>> We want to remove VM_DENYWRITE only currently only used when mapping the
>> executable during exec. During exec, we already deny_write_access() the
>> executable, however, after exec completes the VMAs mapped
>> with VM_DENYWRITE effectively keeps write access denied via
>> deny_write_access().
>>
>> Let's deny write access when setting the MM exe_file. With this change, we
>> can remove VM_DENYWRITE for mapping executables.
>>
>> This represents a minor user space visible change:
>> sys_prctl(PR_SET_MM_EXE_FILE) can now fail if the file is already
>> opened writable. Also, after sys_prctl(PR_SET_MM_EXE_FILE), the file
> 
> Just for completeness, this also affects PR_SET_MM_MAP when exe_fd is
> set.

Correct.

> 
>> cannot be opened writable. Note that we can already fail with -EACCES if
>> the file doesn't have execute permissions.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
> 
> The biggest user I know and that I'm involved in is CRIU which heavily
> uses PR_SET_MM_MAP (with a fallback to PR_SET_MM_EXE_FILE on older
> kernels) during restore. Afair, criu opens the exe fd as an O_PATH
> during dump and thus will use the same flag during restore when
> opening it. So that should be fine.

Yes.

> 
> However, if I understand the consequences of this change correctly, a
> problem could be restoring workloads that hold a writable fd open to
> their exe file at dump time which would mean that during restore that fd
> would be reopened writable causing CRIU to fail when setting the exe
> file for the task to be restored.

If it's their exe file, then the existing VM_DENYWRITE handling would 
have forbidden these workloads to open the fd of their exe file 
writable, right? At least before doing any 
PR_SET_MM_MAP/PR_SET_MM_EXE_FILE. But that should rule out quite a lot 
of cases we might be worried about, right?

> 
> Which honestly, no idea how many such workloads exist. (I know at least
> of runC and LXC need to sometimes reopen to rexec themselves (weird bug
> to protect against attacking the exe file) and thus re-open
> /proc/self/exe but read-only.)
> 
>>   kernel/fork.c | 39 ++++++++++++++++++++++++++++++++++-----
>>   1 file changed, 34 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 6bd2e52bcdfb..5d904878f19b 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -476,6 +476,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   {
>>   	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
>>   	struct rb_node **rb_link, *rb_parent;
>> +	struct file *exe_file;
>>   	int retval;
>>   	unsigned long charge;
>>   	LIST_HEAD(uf);
>> @@ -493,7 +494,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
>>   
>>   	/* No ordering required: file already has been exposed. */
>> -	RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
>> +	exe_file = get_mm_exe_file(oldmm);
>> +	RCU_INIT_POINTER(mm->exe_file, exe_file);
>> +	if (exe_file)
>> +		deny_write_access(exe_file);
>>   
>>   	mm->total_vm = oldmm->total_vm;
>>   	mm->data_vm = oldmm->data_vm;
>> @@ -638,8 +642,13 @@ static inline void mm_free_pgd(struct mm_struct *mm)
>>   #else
>>   static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>>   {
>> +	struct file *exe_file;
>> +
>>   	mmap_write_lock(oldmm);
>> -	RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
>> +	exe_file = get_mm_exe_file(oldmm);
>> +	RCU_INIT_POINTER(mm->exe_file, exe_file);
>> +	if (exe_file)
>> +		deny_write_access(exe_file);
>>   	mmap_write_unlock(oldmm);
>>   	return 0;
>>   }
>> @@ -1163,11 +1172,19 @@ void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
>>   	 */
>>   	old_exe_file = rcu_dereference_raw(mm->exe_file);
>>   
>> -	if (new_exe_file)
>> +	if (new_exe_file) {
>>   		get_file(new_exe_file);
>> +		/*
>> +		 * exec code is required to deny_write_access() successfully,
>> +		 * so this cannot fail
>> +		 */
>> +		deny_write_access(new_exe_file);
>> +	}
>>   	rcu_assign_pointer(mm->exe_file, new_exe_file);
>> -	if (old_exe_file)
>> +	if (old_exe_file) {
>> +		allow_write_access(old_exe_file);
>>   		fput(old_exe_file);
>> +	}
>>   }
>>   
>>   int atomic_set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
>> @@ -1194,10 +1211,22 @@ int atomic_set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
>>   	}
>>   
>>   	/* set the new file, lockless */
>> +	ret = deny_write_access(new_exe_file);
>> +	if (ret)
>> +		return -EACCES;
>>   	get_file(new_exe_file);
>> +
>>   	old_exe_file = xchg(&mm->exe_file, new_exe_file);
>> -	if (old_exe_file)
>> +	if (old_exe_file) {
>> +		/*
>> +		 * Don't race with dup_mmap() getting the file and disallowing
>> +		 * write access while someone might open the file writable.
>> +		 */
>> +		mmap_read_lock(mm);
>> +		allow_write_access(old_exe_file);
>>   		fput(old_exe_file);
>> +		mmap_read_unlock(mm);
>> +	}
>>   	return 0;
>>   }
>>   
>> -- 
>> 2.31.1
>>
> 


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-12  8:43 [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE David Hildenbrand
                   ` (6 preceding siblings ...)
  2021-08-12  8:43 ` [PATCH v1 7/7] fs: update documentation of get_write_access() and friends David Hildenbrand
@ 2021-08-12 12:20 ` Florian Weimer
  2021-08-12 12:47   ` David Hildenbrand
  2021-08-12 16:17   ` Eric W. Biederman
  2021-08-12 17:32 ` Eric W. Biederman
  8 siblings, 2 replies; 82+ messages in thread
From: Florian Weimer @ 2021-08-12 12:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Alexander Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm

* David Hildenbrand:

> There are some (minor) user-visible changes with this series:
> 1. We no longer deny write access to shared libaries loaded via legacy
>    uselib(); this behavior matches modern user space e.g., via dlopen().
> 2. We no longer deny write access to the elf interpreter after exec
>    completed, treating it just like shared libraries (which it often is).

We have a persistent issue with people using cp (or similar tools) to
replace system libraries.  Since the file is truncated first, all
relocations and global data are replaced by file contents, result in
difficult-to-diagnose crashes.  It would be nice if we had a way to
prevent this mistake.  It doesn't have to be MAP_DENYWRITE or MAP_COPY.
It could be something completely new, like an option that turns every
future access beyond the truncation point into a signal (rather than
getting bad data or bad code and crashing much later).

I don't know how many invalid copy operations are currently thwarted by
the current program interpreter restriction.  I doubt that lifting the
restriction matters.

> 3. We always deny write access to the file linked via /proc/pid/exe:
>    sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file
>    cannot be denied, and write access to the file will remain denied
>    until the link is effectivel gone (exec, termination,
>    PR_SET_MM_EXE_FILE) -- just as if exec'ing the file.
>
> I was wondering if we really care about permanently disabling write access
> to the executable, or if it would be good enough to just disable write
> access while loading the new executable during exec; but I don't know
> the history of that -- and it somewhat makes sense to deny write access
> at least to the main executable. With modern user space -- dlopen() -- we
> can effectively modify the content of shared libraries while being used.

Is there a difference between ET_DYN and ET_EXEC executables?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 3/7] kernel/fork: always deny write access to current MM exe_file
  2021-08-12 10:13     ` David Hildenbrand
@ 2021-08-12 12:32       ` Christian Brauner
  2021-08-12 12:38         ` David Hildenbrand
  0 siblings, 1 reply; 82+ messages in thread
From: Christian Brauner @ 2021-08-12 12:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Alexander Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm,
	Andrei Vagin

On Thu, Aug 12, 2021 at 12:13:44PM +0200, David Hildenbrand wrote:
> On 12.08.21 12:05, Christian Brauner wrote:
> > [+Cc Andrei]
> > 
> > On Thu, Aug 12, 2021 at 10:43:44AM +0200, David Hildenbrand wrote:
> > > We want to remove VM_DENYWRITE only currently only used when mapping the
> > > executable during exec. During exec, we already deny_write_access() the
> > > executable, however, after exec completes the VMAs mapped
> > > with VM_DENYWRITE effectively keeps write access denied via
> > > deny_write_access().
> > > 
> > > Let's deny write access when setting the MM exe_file. With this change, we
> > > can remove VM_DENYWRITE for mapping executables.
> > > 
> > > This represents a minor user space visible change:
> > > sys_prctl(PR_SET_MM_EXE_FILE) can now fail if the file is already
> > > opened writable. Also, after sys_prctl(PR_SET_MM_EXE_FILE), the file
> > 
> > Just for completeness, this also affects PR_SET_MM_MAP when exe_fd is
> > set.
> 
> Correct.
> 
> > 
> > > cannot be opened writable. Note that we can already fail with -EACCES if
> > > the file doesn't have execute permissions.
> > > 
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > 
> > The biggest user I know and that I'm involved in is CRIU which heavily
> > uses PR_SET_MM_MAP (with a fallback to PR_SET_MM_EXE_FILE on older
> > kernels) during restore. Afair, criu opens the exe fd as an O_PATH
> > during dump and thus will use the same flag during restore when
> > opening it. So that should be fine.
> 
> Yes.
> 
> > 
> > However, if I understand the consequences of this change correctly, a
> > problem could be restoring workloads that hold a writable fd open to
> > their exe file at dump time which would mean that during restore that fd
> > would be reopened writable causing CRIU to fail when setting the exe
> > file for the task to be restored.
> 
> If it's their exe file, then the existing VM_DENYWRITE handling would have
> forbidden these workloads to open the fd of their exe file writable, right?

Yes.

> At least before doing any PR_SET_MM_MAP/PR_SET_MM_EXE_FILE. But that should
> rule out quite a lot of cases we might be worried about, right?

Yes, it rules out the most obvious cases. The problem is really just
that we don't know how common weirder cases are. But that doesn't mean
we shouldn't try and risk it. This is a nice cleanup and playing
/proc/self/exe games isn't super common.

> 
> > 
> > Which honestly, no idea how many such workloads exist. (I know at least
> > of runC and LXC need to sometimes reopen to rexec themselves (weird bug
> > to protect against attacking the exe file) and thus re-open
> > /proc/self/exe but read-only.)
> > 
> > >   kernel/fork.c | 39 ++++++++++++++++++++++++++++++++++-----
> > >   1 file changed, 34 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 6bd2e52bcdfb..5d904878f19b 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -476,6 +476,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >   {
> > >   	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
> > >   	struct rb_node **rb_link, *rb_parent;
> > > +	struct file *exe_file;
> > >   	int retval;
> > >   	unsigned long charge;
> > >   	LIST_HEAD(uf);
> > > @@ -493,7 +494,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >   	mmap_write_lock_nested(mm, SINGLE_DEPTH_NESTING);
> > >   	/* No ordering required: file already has been exposed. */
> > > -	RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
> > > +	exe_file = get_mm_exe_file(oldmm);
> > > +	RCU_INIT_POINTER(mm->exe_file, exe_file);
> > > +	if (exe_file)
> > > +		deny_write_access(exe_file);
> > >   	mm->total_vm = oldmm->total_vm;
> > >   	mm->data_vm = oldmm->data_vm;
> > > @@ -638,8 +642,13 @@ static inline void mm_free_pgd(struct mm_struct *mm)
> > >   #else
> > >   static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> > >   {
> > > +	struct file *exe_file;
> > > +
> > >   	mmap_write_lock(oldmm);
> > > -	RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
> > > +	exe_file = get_mm_exe_file(oldmm);
> > > +	RCU_INIT_POINTER(mm->exe_file, exe_file);
> > > +	if (exe_file)
> > > +		deny_write_access(exe_file);
> > >   	mmap_write_unlock(oldmm);
> > >   	return 0;
> > >   }
> > > @@ -1163,11 +1172,19 @@ void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
> > >   	 */
> > >   	old_exe_file = rcu_dereference_raw(mm->exe_file);
> > > -	if (new_exe_file)
> > > +	if (new_exe_file) {
> > >   		get_file(new_exe_file);
> > > +		/*
> > > +		 * exec code is required to deny_write_access() successfully,
> > > +		 * so this cannot fail
> > > +		 */
> > > +		deny_write_access(new_exe_file);
> > > +	}
> > >   	rcu_assign_pointer(mm->exe_file, new_exe_file);
> > > -	if (old_exe_file)
> > > +	if (old_exe_file) {
> > > +		allow_write_access(old_exe_file);
> > >   		fput(old_exe_file);
> > > +	}
> > >   }
> > >   int atomic_set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
> > > @@ -1194,10 +1211,22 @@ int atomic_set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
> > >   	}
> > >   	/* set the new file, lockless */
> > > +	ret = deny_write_access(new_exe_file);
> > > +	if (ret)
> > > +		return -EACCES;
> > >   	get_file(new_exe_file);
> > > +
> > >   	old_exe_file = xchg(&mm->exe_file, new_exe_file);
> > > -	if (old_exe_file)
> > > +	if (old_exe_file) {
> > > +		/*
> > > +		 * Don't race with dup_mmap() getting the file and disallowing
> > > +		 * write access while someone might open the file writable.
> > > +		 */
> > > +		mmap_read_lock(mm);
> > > +		allow_write_access(old_exe_file);
> > >   		fput(old_exe_file);
> > > +		mmap_read_unlock(mm);
> > > +	}
> > >   	return 0;
> > >   }
> > > -- 
> > > 2.31.1
> > > 
> > 
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 3/7] kernel/fork: always deny write access to current MM exe_file
  2021-08-12 12:32       ` Christian Brauner
@ 2021-08-12 12:38         ` David Hildenbrand
  0 siblings, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-08-12 12:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Alexander Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm,
	Andrei Vagin

On 12.08.21 14:32, Christian Brauner wrote:
> On Thu, Aug 12, 2021 at 12:13:44PM +0200, David Hildenbrand wrote:
>> On 12.08.21 12:05, Christian Brauner wrote:
>>> [+Cc Andrei]
>>>
>>> On Thu, Aug 12, 2021 at 10:43:44AM +0200, David Hildenbrand wrote:
>>>> We want to remove VM_DENYWRITE only currently only used when mapping the
>>>> executable during exec. During exec, we already deny_write_access() the
>>>> executable, however, after exec completes the VMAs mapped
>>>> with VM_DENYWRITE effectively keeps write access denied via
>>>> deny_write_access().
>>>>
>>>> Let's deny write access when setting the MM exe_file. With this change, we
>>>> can remove VM_DENYWRITE for mapping executables.
>>>>
>>>> This represents a minor user space visible change:
>>>> sys_prctl(PR_SET_MM_EXE_FILE) can now fail if the file is already
>>>> opened writable. Also, after sys_prctl(PR_SET_MM_EXE_FILE), the file
>>>
>>> Just for completeness, this also affects PR_SET_MM_MAP when exe_fd is
>>> set.
>>
>> Correct.
>>
>>>
>>>> cannot be opened writable. Note that we can already fail with -EACCES if
>>>> the file doesn't have execute permissions.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>
>>> The biggest user I know and that I'm involved in is CRIU which heavily
>>> uses PR_SET_MM_MAP (with a fallback to PR_SET_MM_EXE_FILE on older
>>> kernels) during restore. Afair, criu opens the exe fd as an O_PATH
>>> during dump and thus will use the same flag during restore when
>>> opening it. So that should be fine.
>>
>> Yes.
>>
>>>
>>> However, if I understand the consequences of this change correctly, a
>>> problem could be restoring workloads that hold a writable fd open to
>>> their exe file at dump time which would mean that during restore that fd
>>> would be reopened writable causing CRIU to fail when setting the exe
>>> file for the task to be restored.
>>
>> If it's their exe file, then the existing VM_DENYWRITE handling would have
>> forbidden these workloads to open the fd of their exe file writable, right?
> 
> Yes.
> 
>> At least before doing any PR_SET_MM_MAP/PR_SET_MM_EXE_FILE. But that should
>> rule out quite a lot of cases we might be worried about, right?
> 
> Yes, it rules out the most obvious cases. The problem is really just
> that we don't know how common weirder cases are. But that doesn't mean
> we shouldn't try and risk it. This is a nice cleanup and playing
> /proc/self/exe games isn't super common.
> 

Right, and having the file your executing opened writable isn't 
something very common as well.

If we really run into problems, we could not protect the new file when 
issuing PR_SET_MM_MAP/PR_SET_MM_EXE_FILE. But I'd like to avoid that, if 
possible, because it feels like working around something that never 
should have worked that way and is quite inconsistent.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-12 12:20 ` [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE Florian Weimer
@ 2021-08-12 12:47   ` David Hildenbrand
  2021-08-12 16:17   ` Eric W. Biederman
  1 sibling, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-08-12 12:47 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Alexander Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm

On 12.08.21 14:20, Florian Weimer wrote:
> * David Hildenbrand:
> 
>> There are some (minor) user-visible changes with this series:
>> 1. We no longer deny write access to shared libaries loaded via legacy
>>     uselib(); this behavior matches modern user space e.g., via dlopen().
>> 2. We no longer deny write access to the elf interpreter after exec
>>     completed, treating it just like shared libraries (which it often is).
> 
> We have a persistent issue with people using cp (or similar tools) to
> replace system libraries.  Since the file is truncated first, all
> relocations and global data are replaced by file contents, result in
> difficult-to-diagnose crashes.  It would be nice if we had a way to
> prevent this mistake.  It doesn't have to be MAP_DENYWRITE or MAP_COPY.
> It could be something completely new, like an option that turns every
> future access beyond the truncation point into a signal (rather than
> getting bad data or bad code and crashing much later).
> 
> I don't know how many invalid copy operations are currently thwarted by
> the current program interpreter restriction.  I doubt that lifting the
> restriction matters.
> 
>> 3. We always deny write access to the file linked via /proc/pid/exe:
>>     sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file
>>     cannot be denied, and write access to the file will remain denied
>>     until the link is effectivel gone (exec, termination,
>>     PR_SET_MM_EXE_FILE) -- just as if exec'ing the file.
>>
>> I was wondering if we really care about permanently disabling write access
>> to the executable, or if it would be good enough to just disable write
>> access while loading the new executable during exec; but I don't know
>> the history of that -- and it somewhat makes sense to deny write access
>> at least to the main executable. With modern user space -- dlopen() -- we
>> can effectively modify the content of shared libraries while being used.
> 
> Is there a difference between ET_DYN and ET_EXEC executables?

No, I don't think so. When exec'ing, the main executable will see a 
deny_write_access(file); AFAIKT, that can either be ET_DYN or ET_EXEC.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-12 12:20 ` [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE Florian Weimer
  2021-08-12 12:47   ` David Hildenbrand
@ 2021-08-12 16:17   ` Eric W. Biederman
  1 sibling, 0 replies; 82+ messages in thread
From: Eric W. Biederman @ 2021-08-12 16:17 UTC (permalink / raw)
  To: Florian Weimer
  Cc: David Hildenbrand, linux-kernel, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Alexander Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm

Florian Weimer <fweimer@redhat.com> writes:

> * David Hildenbrand:
>
>> There are some (minor) user-visible changes with this series:
>> 1. We no longer deny write access to shared libaries loaded via legacy
>>    uselib(); this behavior matches modern user space e.g., via dlopen().
>> 2. We no longer deny write access to the elf interpreter after exec
>>    completed, treating it just like shared libraries (which it often is).
>
> We have a persistent issue with people using cp (or similar tools) to
> replace system libraries.  Since the file is truncated first, all
> relocations and global data are replaced by file contents, result in
> difficult-to-diagnose crashes.  It would be nice if we had a way to
> prevent this mistake.  It doesn't have to be MAP_DENYWRITE or MAP_COPY.
> It could be something completely new, like an option that turns every
> future access beyond the truncation point into a signal (rather than
> getting bad data or bad code and crashing much later).
>
> I don't know how many invalid copy operations are currently thwarted by
> the current program interpreter restriction.  I doubt that lifting the
> restriction matters.

I suspect that what should happen is that we should make shared
libraries and executables read-only on disk.

We could potentially take this a step farther and introduce a new sysctl
that causes "mmap(adr, len, PROT_EXEC, MAP_SHARED, fd, off)" but not
PROT_WRITE to fail if the file can be written by anyone.  That sysctl
could even deny chown adding write access to the file if there are
mappings open.

Given that there hasn't been enough pain for people to install shared
libraries read-only yet I suspect just installing executables and shared
libraries without write-permissions on disk is enough to prevent the
hard to track down bugs you have been talking about.

>> 3. We always deny write access to the file linked via /proc/pid/exe:
>>    sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file
>>    cannot be denied, and write access to the file will remain denied
>>    until the link is effectivel gone (exec, termination,
>>    PR_SET_MM_EXE_FILE) -- just as if exec'ing the file.
>>
>> I was wondering if we really care about permanently disabling write access
>> to the executable, or if it would be good enough to just disable write
>> access while loading the new executable during exec; but I don't know
>> the history of that -- and it somewhat makes sense to deny write access
>> at least to the main executable. With modern user space -- dlopen() -- we
>> can effectively modify the content of shared libraries while being used.
>
> Is there a difference between ET_DYN and ET_EXEC executables?

What is being changed is how we track which files to denying write
access on.  Instead of denying write-access based on a per mapping (aka
mmap) basis, the new code is only denying access to /proc/self/exe.

Because the method of tracking is much coarser is why the interper stops
being protected.  The code doesn't care how the mappings happen, only
if the file is /proc/self/exe or not.

Eric


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 3/7] kernel/fork: always deny write access to current MM exe_file
  2021-08-12  8:43 ` [PATCH v1 3/7] kernel/fork: always deny write access to " David Hildenbrand
  2021-08-12 10:05   ` Christian Brauner
@ 2021-08-12 16:51   ` Linus Torvalds
  2021-08-12 19:38     ` David Hildenbrand
  1 sibling, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2021-08-12 16:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Alexander Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM

On Wed, Aug 11, 2021 at 10:45 PM David Hildenbrand <david@redhat.com> wrote:
>
>         /* No ordering required: file already has been exposed. */
> -       RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
> +       exe_file = get_mm_exe_file(oldmm);
> +       RCU_INIT_POINTER(mm->exe_file, exe_file);
> +       if (exe_file)
> +               deny_write_access(exe_file);

Can we make a helper function for this, since it's done in two different places?

> -       if (new_exe_file)
> +       if (new_exe_file) {
>                 get_file(new_exe_file);
> +               /*
> +                * exec code is required to deny_write_access() successfully,
> +                * so this cannot fail
> +                */
> +               deny_write_access(new_exe_file);
> +       }
>         rcu_assign_pointer(mm->exe_file, new_exe_file);

And the above looks positively wrong. The comment is also nonsensical,
in that it basically says "we thought this cannot fail, so we'll just
rely on it".

If it truly cannot fail, then the comment should give the reason, not
the "we depend on this not failing".

And honestly, I don't see why it couldn't fail. And if it *does* fail,
we cannot then RCU-assign the exe_file pointer with this, because
you'll get a counter imbalance when you do the allow_write_access()
later.

Anyway, do_open_execat() does do deny_write_access() with proper error
checking. I think that is the existing reference that you depend on -
so that it doesn't fail. So the comment could possibly say that the
only caller has done this, but can we not just use the reference
deny_write_access() directly, and not do a new one here?

IOW, maybe there's an extraneous 'allow_write_access()' somewhere that
should be dropped when we do the whole binprm dance in execve()?

             Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-12  8:43 [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE David Hildenbrand
                   ` (7 preceding siblings ...)
  2021-08-12 12:20 ` [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE Florian Weimer
@ 2021-08-12 17:32 ` Eric W. Biederman
  2021-08-12 17:35   ` Andy Lutomirski
  8 siblings, 1 reply; 82+ messages in thread
From: Eric W. Biederman @ 2021-08-12 17:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Alexander Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, linux-api, x86, linux-fsdevel, linux-mm

David Hildenbrand <david@redhat.com> writes:

> This series is based on v5.14-rc5 and corresponds code-wise to the
> previously sent RFC [1] (the RFC still applied cleanly).
>
> This series removes all in-tree usage of MAP_DENYWRITE from the kernel
> and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for
> user space applications a while ago because of the chance for DoS.
> The last renaming user is binfmt binary loading during exec and
> legacy library loading via uselib().
>
> With this change, MAP_DENYWRITE is effectively ignored throughout the
> kernel. Although the net change is small, I think the cleanup in mmap()
> is quite nice.
>
> There are some (minor) user-visible changes with this series:
> 1. We no longer deny write access to shared libaries loaded via legacy
>    uselib(); this behavior matches modern user space e.g., via dlopen().
> 2. We no longer deny write access to the elf interpreter after exec
>    completed, treating it just like shared libraries (which it often is).
> 3. We always deny write access to the file linked via /proc/pid/exe:
>    sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file
>    cannot be denied, and write access to the file will remain denied
>    until the link is effectivel gone (exec, termination,
>    PR_SET_MM_EXE_FILE) -- just as if exec'ing the file.
>
> I was wondering if we really care about permanently disabling write access
> to the executable, or if it would be good enough to just disable write
> access while loading the new executable during exec; but I don't know
> the history of that -- and it somewhat makes sense to deny write access
> at least to the main executable. With modern user space -- dlopen() -- we
> can effectively modify the content of shared libraries while being
> used.

So I think what we really want to do is to install executables with
and shared libraries without write permissions and immutable.  So that
upgrades/replacements of the libraries and executables are forced to
rename or unlink them.  We need the immutable bit as CAP_DAC_OVERRIDE
aka being root ignores the writable bits when a file is opened for
write.  However CAP_DAC_OVERRIDE does not override the immutable state
of a file.

I believe that denying write access at exec mmap time is actually much
to late in the process and making the denial of writing much larger in
scope is fundamentally what we want to do.  Changing how we install the
files, avoids the denial of service problems that MAP_DENYWRITE had.
Making the denial always happen ensures that installation programs are
never fooled into thinking a non-atomic update of an executable or
shared library is ok.

Still that is non-kernel work so I don't know who would make that
change.

As this fundamentally simplifies and a design mistake with very little
functional change.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

For the entire series.


> There is a related problem [2] with overlayfs, that should at least partly
> be tackled by this series. I don't quite understand the interaction of
> overlayfs and deny_write_access()/allow_write_access() at exec time:
>
> If we end up denying write access to the wrong file and not to the
> realfile, that would be fundamentally broken. We would have to reroute
> our deny_write_access()/ allow_write_access() calls for the exec file to
> the realfile -- but I leave figuring out the details to overlayfs guys, as
> that would be a related but different issue.
>
> RFC -> v1:
> - "binfmt: remove in-tree usage of MAP_DENYWRITE"
> -- Add a note that this should fix part of a problem with overlayfs
>
> [1] https://lore.kernel.org/r/20210423131640.20080-1-david@redhat.com/
> [2] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/
>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Petr Mladek <pmladek@suse.com>
> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Greg Ungerer <gerg@linux-m68k.org>
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
> Cc: Chinwen Chang <chinwen.chang@mediatek.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Feng Tang <feng.tang@intel.com>
> Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Shawn Anastasio <shawn@anastas.io>
> Cc: Steven Price <steven.price@arm.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Marco Elver <elver@google.com>
> Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
> Cc: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> Cc: Thomas Cedeno <thomascedeno@google.com>
> Cc: Collin Fijalkovich <cfijalkovich@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Chengguang Xu <cgxu519@mykernel.net>
> Cc: "Christian König" <ckoenig.leichtzumerken@gmail.com>
> Cc: linux-unionfs@vger.kernel.org
> Cc: linux-api@vger.kernel.org
> Cc: x86@kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-mm@kvack.org
>
> David Hildenbrand (7):
>   binfmt: don't use MAP_DENYWRITE when loading shared libraries via
>     uselib()
>   kernel/fork: factor out atomcially replacing the current MM exe_file
>   kernel/fork: always deny write access to current MM exe_file
>   binfmt: remove in-tree usage of MAP_DENYWRITE
>   mm: remove VM_DENYWRITE
>   mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff()
>   fs: update documentation of get_write_access() and friends
>
>  arch/x86/ia32/ia32_aout.c      |  8 ++--
>  fs/binfmt_aout.c               |  7 ++--
>  fs/binfmt_elf.c                |  6 +--
>  fs/binfmt_elf_fdpic.c          |  2 +-
>  fs/proc/task_mmu.c             |  1 -
>  include/linux/fs.h             | 19 +++++----
>  include/linux/mm.h             |  3 +-
>  include/linux/mman.h           |  4 +-
>  include/trace/events/mmflags.h |  1 -
>  kernel/events/core.c           |  2 -
>  kernel/fork.c                  | 75 ++++++++++++++++++++++++++++++----
>  kernel/sys.c                   | 33 +--------------
>  lib/test_printf.c              |  5 +--
>  mm/mmap.c                      | 29 ++-----------
>  mm/nommu.c                     |  2 -
>  15 files changed, 98 insertions(+), 99 deletions(-)
>
>
> base-commit: 36a21d51725af2ce0700c6ebcb6b9594aac658a6

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-12 17:32 ` Eric W. Biederman
@ 2021-08-12 17:35   ` Andy Lutomirski
  2021-08-12 17:48     ` Eric W. Biederman
  0 siblings, 1 reply; 82+ messages in thread
From: Andy Lutomirski @ 2021-08-12 17:35 UTC (permalink / raw)
  To: Eric W. Biederman, David Hildenbrand
  Cc: Linux Kernel Mailing List, Linus Torvalds, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, linux-mm



On Thu, Aug 12, 2021, at 10:32 AM, Eric W. Biederman wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
> > This series is based on v5.14-rc5 and corresponds code-wise to the
> > previously sent RFC [1] (the RFC still applied cleanly).
> >
> > This series removes all in-tree usage of MAP_DENYWRITE from the kernel
> > and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for
> > user space applications a while ago because of the chance for DoS.
> > The last renaming user is binfmt binary loading during exec and
> > legacy library loading via uselib().
> >
> > With this change, MAP_DENYWRITE is effectively ignored throughout the
> > kernel. Although the net change is small, I think the cleanup in mmap()
> > is quite nice.
> >
> > There are some (minor) user-visible changes with this series:
> > 1. We no longer deny write access to shared libaries loaded via legacy
> >    uselib(); this behavior matches modern user space e.g., via dlopen().
> > 2. We no longer deny write access to the elf interpreter after exec
> >    completed, treating it just like shared libraries (which it often is).
> > 3. We always deny write access to the file linked via /proc/pid/exe:
> >    sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file
> >    cannot be denied, and write access to the file will remain denied
> >    until the link is effectivel gone (exec, termination,
> >    PR_SET_MM_EXE_FILE) -- just as if exec'ing the file.
> >
> > I was wondering if we really care about permanently disabling write access
> > to the executable, or if it would be good enough to just disable write
> > access while loading the new executable during exec; but I don't know
> > the history of that -- and it somewhat makes sense to deny write access
> > at least to the main executable. With modern user space -- dlopen() -- we
> > can effectively modify the content of shared libraries while being
> > used.
> 
> So I think what we really want to do is to install executables with
> and shared libraries without write permissions and immutable.  So that
> upgrades/replacements of the libraries and executables are forced to
> rename or unlink them.  We need the immutable bit as CAP_DAC_OVERRIDE
> aka being root ignores the writable bits when a file is opened for
> write.  However CAP_DAC_OVERRIDE does not override the immutable state
> of a file.

If we really want to do this, I think we'd want a different flag that's more like sealed.  Non-root users should be able to do this, too.

Or we could just more gracefully handle users that overwrite running programs.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-12 17:35   ` Andy Lutomirski
@ 2021-08-12 17:48     ` Eric W. Biederman
  2021-08-12 18:01       ` Andy Lutomirski
                         ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Eric W. Biederman @ 2021-08-12 17:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Hildenbrand, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, linux-mm

"Andy Lutomirski" <luto@kernel.org> writes:

> On Thu, Aug 12, 2021, at 10:32 AM, Eric W. Biederman wrote:
>> David Hildenbrand <david@redhat.com> writes:
>> 
>> > This series is based on v5.14-rc5 and corresponds code-wise to the
>> > previously sent RFC [1] (the RFC still applied cleanly).
>> >
>> > This series removes all in-tree usage of MAP_DENYWRITE from the kernel
>> > and removes VM_DENYWRITE. We stopped supporting MAP_DENYWRITE for
>> > user space applications a while ago because of the chance for DoS.
>> > The last renaming user is binfmt binary loading during exec and
>> > legacy library loading via uselib().
>> >
>> > With this change, MAP_DENYWRITE is effectively ignored throughout the
>> > kernel. Although the net change is small, I think the cleanup in mmap()
>> > is quite nice.
>> >
>> > There are some (minor) user-visible changes with this series:
>> > 1. We no longer deny write access to shared libaries loaded via legacy
>> >    uselib(); this behavior matches modern user space e.g., via dlopen().
>> > 2. We no longer deny write access to the elf interpreter after exec
>> >    completed, treating it just like shared libraries (which it often is).
>> > 3. We always deny write access to the file linked via /proc/pid/exe:
>> >    sys_prctl(PR_SET_MM_EXE_FILE) will fail if write access to the file
>> >    cannot be denied, and write access to the file will remain denied
>> >    until the link is effectivel gone (exec, termination,
>> >    PR_SET_MM_EXE_FILE) -- just as if exec'ing the file.
>> >
>> > I was wondering if we really care about permanently disabling write access
>> > to the executable, or if it would be good enough to just disable write
>> > access while loading the new executable during exec; but I don't know
>> > the history of that -- and it somewhat makes sense to deny write access
>> > at least to the main executable. With modern user space -- dlopen() -- we
>> > can effectively modify the content of shared libraries while being
>> > used.
>> 
>> So I think what we really want to do is to install executables with
>> and shared libraries without write permissions and immutable.  So that
>> upgrades/replacements of the libraries and executables are forced to
>> rename or unlink them.  We need the immutable bit as CAP_DAC_OVERRIDE
>> aka being root ignores the writable bits when a file is opened for
>> write.  However CAP_DAC_OVERRIDE does not override the immutable state
>> of a file.
>
> If we really want to do this, I think we'd want a different flag
> that's more like sealed.  Non-root users should be able to do this,
> too.
>
> Or we could just more gracefully handle users that overwrite running
> programs.

I had a blind spot, and Florian Weimer made a very reasonable request.
Apparently userspace for shared libraires uses MAP_PRIVATE.

So we almost don't care if the library is overwritten.  We loose some
efficiency and apparently there are some corner cases like the library
being extended past the end of the exiting file that are problematic.

Given that MAP_PRIVATE for shared libraries is our strategy for handling
writes to shared libraries perhaps we just need to use MAP_POPULATE or a
new related flag (perhaps MAP_PRIVATE_NOW) that just makes certain that
everything mapped from the executable is guaranteed to be visible from
the time of the mmap, and any changes from the filesystem side after
that are guaranteed to cause a copy on write.

Once we get that figured out we could consider getting rid of deny-write
entirely.

Eric

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-12 17:48     ` Eric W. Biederman
@ 2021-08-12 18:01       ` Andy Lutomirski
  2021-08-12 18:10       ` Linus Torvalds
  2021-08-12 18:15       ` Florian Weimer
  2 siblings, 0 replies; 82+ messages in thread
From: Andy Lutomirski @ 2021-08-12 18:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Hildenbrand, Linux Kernel Mailing List, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, linux-mm



On Thu, Aug 12, 2021, at 10:48 AM, Eric W. Biederman wrote:
> "Andy Lutomirski" <luto@kernel.org> writes:

> I had a blind spot, and Florian Weimer made a very reasonable request.
> Apparently userspace for shared libraires uses MAP_PRIVATE.
> 
> So we almost don't care if the library is overwritten.  We loose some
> efficiency and apparently there are some corner cases like the library
> being extended past the end of the exiting file that are problematic.
> 
> Given that MAP_PRIVATE for shared libraries is our strategy for handling
> writes to shared libraries perhaps we just need to use MAP_POPULATE or a
> new related flag (perhaps MAP_PRIVATE_NOW) that just makes certain that
> everything mapped from the executable is guaranteed to be visible from
> the time of the mmap, and any changes from the filesystem side after
> that are guaranteed to cause a copy on write.
> 
> Once we get that figured out we could consider getting rid of deny-write
> entirely.

Are all of the CoW bits in good enough shape for this to work without just immediately CoWing the whole file?  In principle, write(2) to a file should be able to notice that it needs to CoW some pages, but I doubt that this actually works.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-12 17:48     ` Eric W. Biederman
  2021-08-12 18:01       ` Andy Lutomirski
@ 2021-08-12 18:10       ` Linus Torvalds
  2021-08-12 18:47         ` Eric W. Biederman
  2021-08-12 19:24         ` David Hildenbrand
  2021-08-12 18:15       ` Florian Weimer
  2 siblings, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2021-08-12 18:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, David Hildenbrand, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM

On Thu, Aug 12, 2021 at 7:48 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Given that MAP_PRIVATE for shared libraries is our strategy for handling
> writes to shared libraries perhaps we just need to use MAP_POPULATE or a
> new related flag (perhaps MAP_PRIVATE_NOW)

No. That would be horrible for the usual bloated GUI libraries. It
might help some (dynamic page faults are not cheap either), but it
would hurt a lot.

This is definitely a "if you overwrite a system library while it's
being used, you get to keep both pieces" situation.

The kernel ETXTBUSY thing is purely a courtesy feature, and as people
have noticed it only really works for the main executable because of
various reasons. It's not something user space should even rely on,
it's more of a "ok, you're doing something incredibly stupid, and
we'll help you avoid shooting yourself in the foot when we notice".

Any distro should make sure their upgrade tools don't just
truncate/write to random libraries executables.

And if they do, it's really not a kernel issue.

This patch series basically takes this very historical error return,
and simplifies and clarifies the implementation, and in the process
might change some very subtle corner case (unmapping the original
executable entirely?). I hope (and think) it wouldn't matter exactly
because this is a "courtesy error" rather than anything that a sane
setup would _depend_ on, but hey, insane setups clearly exist.

               Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-12 17:48     ` Eric W. Biederman
  2021-08-12 18:01       ` Andy Lutomirski
  2021-08-12 18:10       ` Linus Torvalds
@ 2021-08-12 18:15       ` Florian Weimer
  2021-08-12 18:21         ` Linus Torvalds
  2 siblings, 1 reply; 82+ messages in thread
From: Florian Weimer @ 2021-08-12 18:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, David Hildenbrand, Linux Kernel Mailing List,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Al Viro, Alexey Dobriyan,
	Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, linux-mm

* Eric W. Biederman:

> Given that MAP_PRIVATE for shared libraries is our strategy for handling
> writes to shared libraries perhaps we just need to use MAP_POPULATE or a
> new related flag (perhaps MAP_PRIVATE_NOW) that just makes certain that
> everything mapped from the executable is guaranteed to be visible from
> the time of the mmap, and any changes from the filesystem side after
> that are guaranteed to cause a copy on write.

I think this is called MAP_COPY:

  <https://www.gnu.org/software/hurd/glibc/mmap.html>

If we could get that functionality, we would certainly use it in the
glibc dynamic loader.  And it's not just dynamic loaders that would
benefit.

But I assume there are some rather thorny issues around semantics.  If
the changed areas of the file are in the page cache already, everything
is okay.  But if parts of the file are changed or discarded that are
not, they would have to be read in first, which is rather awkward.

That's why I suggested the signal for all future accesses.  It seems
more tractable and addresses the largest issue (the difficulty of
figuring out why some processes crash occasionally).

Thanks,
Florian


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-12 18:15       ` Florian Weimer
@ 2021-08-12 18:21         ` Linus Torvalds
  0 siblings, 0 replies; 82+ messages in thread
From: Linus Torvalds @ 2021-08-12 18:21 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Eric W. Biederman, Andy Lutomirski, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM

On Thu, Aug 12, 2021 at 8:16 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> I think this is called MAP_COPY:
>
>   <https://www.gnu.org/software/hurd/glibc/mmap.html>

Please don't even consider the crazy notions that GNU Hurd did.

It's a fundamental design mistake. The Hurd VM was horrendous, and
MAP_COPY was a prime example of the kinds of horrors it had.

I'm not sure how much of the mis-designs were due to Hurd, and how
much of it due to Mach 3. But please don't point to Hurd VM
documentation except possibly to warn people. We want people to
_forget_ those mistakes, not repeat them.

          Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-12 18:10       ` Linus Torvalds
@ 2021-08-12 18:47         ` Eric W. Biederman
  2021-08-13  9:05           ` David Laight
  2021-08-12 19:24         ` David Hildenbrand
  1 sibling, 1 reply; 82+ messages in thread
From: Eric W. Biederman @ 2021-08-12 18:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, David Hildenbrand, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Aug 12, 2021 at 7:48 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Given that MAP_PRIVATE for shared libraries is our strategy for handling
>> writes to shared libraries perhaps we just need to use MAP_POPULATE or a
>> new related flag (perhaps MAP_PRIVATE_NOW)
>
> No. That would be horrible for the usual bloated GUI libraries. It
> might help some (dynamic page faults are not cheap either), but it
> would hurt a lot.

I wasn't aiming so much at the MAP_POPULATE part but something that
would trigger cow from writes to the file.  I see code that is close but
I don't see any code in the kernel that would implement that currently.

Upon reflection I think it will always be difficult to trigger cow from
the file write side of the kernel as code that would cow the page in
the page cache would cause problems with writable mmaps.

> This is definitely a "if you overwrite a system library while it's
> being used, you get to keep both pieces" situation.
>
> The kernel ETXTBUSY thing is purely a courtesy feature, and as people
> have noticed it only really works for the main executable because of
> various reasons. It's not something user space should even rely on,
> it's more of a "ok, you're doing something incredibly stupid, and
> we'll help you avoid shooting yourself in the foot when we notice".
>
> Any distro should make sure their upgrade tools don't just
> truncate/write to random libraries executables.

Yes.  I am trying to come up with advice on how userspace
implementations can implement their tools to use other mechanisms that
solve the overwriting shared libaries and executables problem that
are not broken by design.

For a little bit the way Florian Weirmer was talking and the fact that
uselib uses MAP_PRIVATE had me thinking that somehow MAP_PRIVATE could
be part of the solution.  I have now looked into the implementation of
MAP_PRIVATE and I since we don't perform the cow on filesystem writes
MAP_PRIVATE absolutely can not be part of the solution we recommend to
userspace.

So today the best advice I can give to userspace is to mark their
executables and shared libraries as read-only and immutable.  Otherwise
a change to the executable file can change what is mapped into memory.
MAP_PRIVATE does not help.

> And if they do, it's really not a kernel issue.

What is a kernel issue is giving people good advice on how to use kernel
features to solve real world problems.  I have seen the write to a
mapped exectuable/shared lib problem, and Florian has seen it.  So while
rare the problem is real and a pain to debug.

> This patch series basically takes this very historical error return,
> and simplifies and clarifies the implementation, and in the process
> might change some very subtle corner case (unmapping the original
> executable entirely?). I hope (and think) it wouldn't matter exactly
> because this is a "courtesy error" rather than anything that a sane
> setup would _depend_ on, but hey, insane setups clearly exist.

Oh yes.

I very much agree that the design of this patchset is perfectly fine.

I also see that MAP_DENYWRITE is unfortunately broken by design.  I
vaguely remember the discussion when MAP_DENYWRITE was made a noop
because of the denial-of-service aspect of MAP_DENYWRITE.

I very much agree that we should strongly encourage userspace not
to write to mmaped files.

As I am learning with my two year old, it helps to give a constructive
suggestion of alternative behavior instead of just saying no.
Florian reported that there remains a problem in userspace. So I am
coming up with a constructive suggestion.  My apologies for going off
into the weeds for a moment.

Eric


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-12 18:10       ` Linus Torvalds
  2021-08-12 18:47         ` Eric W. Biederman
@ 2021-08-12 19:24         ` David Hildenbrand
  1 sibling, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-08-12 19:24 UTC (permalink / raw)
  To: Linus Torvalds, Eric W. Biederman
  Cc: Andy Lutomirski, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM

On 12.08.21 20:10, Linus Torvalds wrote:
> On Thu, Aug 12, 2021 at 7:48 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Given that MAP_PRIVATE for shared libraries is our strategy for handling
>> writes to shared libraries perhaps we just need to use MAP_POPULATE or a
>> new related flag (perhaps MAP_PRIVATE_NOW)
> 
> No. That would be horrible for the usual bloated GUI libraries. It
> might help some (dynamic page faults are not cheap either), but it
> would hurt a lot.

Right, we most certainly don't want to waste system ram / swap space, 
memory for page tables, and degrade performance just because some 
corner-case nasty user space could harm itself.

> 
> This is definitely a "if you overwrite a system library while it's
> being used, you get to keep both pieces" situation.

Right, play stupid games, win stupid prices. I agree that if there would 
be an efficient way to detect+handle such overwrites gracefully, it 
would be great to have the kernel support that. ETXTBUSY as implemented 
with this series (but also before this series) is really only a 
minimalistic approach to help detect some issues regarding the main 
executable.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 3/7] kernel/fork: always deny write access to current MM exe_file
  2021-08-12 16:51   ` Linus Torvalds
@ 2021-08-12 19:38     ` David Hildenbrand
  0 siblings, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-08-12 19:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Alexander Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Eric W. Biederman,
	Greg Ungerer, Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM

On 12.08.21 18:51, Linus Torvalds wrote:
> On Wed, Aug 11, 2021 at 10:45 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>          /* No ordering required: file already has been exposed. */
>> -       RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
>> +       exe_file = get_mm_exe_file(oldmm);
>> +       RCU_INIT_POINTER(mm->exe_file, exe_file);
>> +       if (exe_file)
>> +               deny_write_access(exe_file);
> 
> Can we make a helper function for this, since it's done in two different places?

Sure, no compelling reason not to (except finding a suitable name, but 
I'll think about that tomorrow).

> 
>> -       if (new_exe_file)
>> +       if (new_exe_file) {
>>                  get_file(new_exe_file);
>> +               /*
>> +                * exec code is required to deny_write_access() successfully,
>> +                * so this cannot fail
>> +                */
>> +               deny_write_access(new_exe_file);
>> +       }
>>          rcu_assign_pointer(mm->exe_file, new_exe_file);
> 
> And the above looks positively wrong. The comment is also nonsensical,
> in that it basically says "we thought this cannot fail, so we'll just
> rely on it".

Well, it documents the expectation towards the caller, but in a 
suboptimal way, I agree.

> 
> If it truly cannot fail, then the comment should give the reason, not
> the "we depend on this not failing".

Right, "We depend on the caller already have done a deny_write_access() 
successfully first such that this call cannot fail." combined with

if (deny_write_access(new_exe_file))
	pr_warn("Unexpected failure of deny_write_access() in %s",
                  __func__);

suggestions welcome.

> 
> And honestly, I don't see why it couldn't fail. And if it *does* fail,
> we cannot then RCU-assign the exe_file pointer with this, because
> you'll get a counter imbalance when you do the allow_write_access()
> later.

Anyone calling set_mm_exe_file() (-> begin_new_exec()) is expected to 
successfully triggered a deny_write_access() upfront such that we won't 
fail at that point.

Further, on the dup_mmap() path we are sure the previous oldmm exe_file 
properly saw a successful deny_write_access() already, because that's 
now guaranteed for any exe_file.

> 
> Anyway, do_open_execat() does do deny_write_access() with proper error
> checking. I think that is the existing reference that you depend on -
> so that it doesn't fail. So the comment could possibly say that the
> only caller has done this, but can we not just use the reference
> deny_write_access() directly, and not do a new one here?

I think that might over-complicate the exec code where we would see a 
allow_write_access() on error paths, but not on success paths. This here 
looks cleaner to me, agreeing that the comment and the error check has 
to be improved.

We handle all allow_write_access()/deny_write_access() regarding 
exe_file completely in kernel/fork.c, which is IMHO quite nice.

> 
> IOW, maybe there's an extraneous 'allow_write_access()' somewhere that
> should be dropped when we do the whole binprm dance in execve()?

fs/exec.c: free_bprm() and exec_binprm() to be precise.

Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-12 18:47         ` Eric W. Biederman
@ 2021-08-13  9:05           ` David Laight
       [not found]             ` <87h7ft2j68.fsf@disp2133>
  0 siblings, 1 reply; 82+ messages in thread
From: David Laight @ 2021-08-13  9:05 UTC (permalink / raw)
  To: 'Eric W. Biederman', Linus Torvalds
  Cc: Andy Lutomirski, David Hildenbrand, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM

From: Eric W. Biederman
> Sent: 12 August 2021 19:47
...
> So today the best advice I can give to userspace is to mark their
> executables and shared libraries as read-only and immutable.  Otherwise
> a change to the executable file can change what is mapped into memory.
> MAP_PRIVATE does not help.

While 'immutable' might be ok for files installed by distributions
it would be a PITA in development.

ETXTBUSY is a useful reminder that the file you are copying from
machine A to machine B (etc) is still running and probably ought
to be killed/stopped before you get confused.

I've never really understood why it doesn't stop shared libraries
being overwritten - but they do tend to be updated less often.

Overwriting an in-use shared library could be really confusing.
It is likely that all the code is actually in memory.
So everything carries on running as normal.
Until the kernel gets under memory pressure and discards a page.
Then a page from the new version is faulted in and random
programs start getting SEGVs.
This could be days after the borked update.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
       [not found]             ` <87h7ft2j68.fsf@disp2133>
@ 2021-08-13 20:51               ` Florian Weimer
  2021-08-14  0:31               ` Linus Torvalds
  1 sibling, 0 replies; 82+ messages in thread
From: Florian Weimer @ 2021-08-13 20:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Laight, Linus Torvalds, Andy Lutomirski, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM, Michael Kerrisk (man-pages)

* Eric W. Biederman:

> Florian Weimer, would it be possible to get glibc's ld.so implementation to use
> MAP_SHARED?  Just so people reading the code know what to expect of the
> kernel?  As far as I can tell there is not a practical difference
> between a read-only MAP_PRIVATE and a read-only MAP_SHARED.

Some applications use mprotect to change page protections behind glibc's
back.  Using MAP_SHARED would break fork pretty badly.

Most of the hard-to-diagnose crashes seem to come from global data or
relocations because they are wiped by truncation.  And we certainly
can't use MAP_SHARED for those.  Code often seems to come back unchanged
after the truncation because the overwritten file hasn't actually
changed.  File attributes don't help because the copying is an
adminstrative action in the context of the application (maybe the result
of some automation).

I think avoiding the crashes isn't the right approach.  What I'd like to
see is better diagnostics.  Writing mtime and ctime to the core file
might help.  Or adding a flag to the core file and /proc/PID/smaps that
indicates if the file has been truncated across the mapping since the
mapping was created.

A bit less conservative and even more obvious to diagnose would be a new
flag for the mapping (perhaps set via madvise) that causes any future
access to the mapping to fault with SIGBUS and a special si_code value
after the file has been truncated across the mapping.  I think we would
set that in the glibc dynamic loader.  It would make the crashes much
less weird.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
       [not found]             ` <87h7ft2j68.fsf@disp2133>
  2021-08-13 20:51               ` Florian Weimer
@ 2021-08-14  0:31               ` Linus Torvalds
  2021-08-14  0:49                 ` Andy Lutomirski
  1 sibling, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2021-08-14  0:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Laight, Andy Lutomirski, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk (man-pages)

On Fri, Aug 13, 2021 at 10:18 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Florian Weimer, would it be possible to get glibc's ld.so implementation to use
> MAP_SHARED?  Just so people reading the code know what to expect of the
> kernel?  As far as I can tell there is not a practical difference
> between a read-only MAP_PRIVATE and a read-only MAP_SHARED.

There's a huge difference.

For one, you actually don't necessarily want read-only. Doing COW on
library images is quite common for things like relocation etc (you'd
_hope_ everything is PC-relative, but no)

So no. Never EVER use MAP_SHARED unless you literally expect to have
two different mappings that need to be kept in sync and one writes the
other.

I'll just repeat: stop arguing about this case. If somebody writes to
a busy library, THAT IS A FUNDAMENTAL BUG, and nobody sane should care
at all about it apart from the "you get what you deserve".

What's next? Do you think glibc should also map every byte in the user
address space so that user programs don't get SIGSEGV when they have
wild pointers?

Again - that's a user BUG and trying to "work around" a wild pointer
is a worse fix than the problem it tries to fix.

The exact same thing is true for shared library (or executable)
mappings. Trying to work around people writing to them is *worse* than
the bug of doing so.

Stop this completely inane discussion already.

                  Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-14  0:31               ` Linus Torvalds
@ 2021-08-14  0:49                 ` Andy Lutomirski
  2021-08-14  0:54                   ` Linus Torvalds
                                     ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Andy Lutomirski @ 2021-08-14  0:49 UTC (permalink / raw)
  To: Linus Torvalds, Eric W. Biederman
  Cc: David Laight, David Hildenbrand, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk



On Fri, Aug 13, 2021, at 5:31 PM, Linus Torvalds wrote:
> On Fri, Aug 13, 2021 at 10:18 AM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> >
> > Florian Weimer, would it be possible to get glibc's ld.so implementation to use
> > MAP_SHARED?  Just so people reading the code know what to expect of the
> > kernel?  As far as I can tell there is not a practical difference
> > between a read-only MAP_PRIVATE and a read-only MAP_SHARED.
> 
> There's a huge difference.
> 
> For one, you actually don't necessarily want read-only. Doing COW on
> library images is quite common for things like relocation etc (you'd
> _hope_ everything is PC-relative, but no)
> 
> So no. Never EVER use MAP_SHARED unless you literally expect to have
> two different mappings that need to be kept in sync and one writes the
> other.
> 
> I'll just repeat: stop arguing about this case. If somebody writes to
> a busy library, THAT IS A FUNDAMENTAL BUG, and nobody sane should care
> at all about it apart from the "you get what you deserve".
> 
> What's next? Do you think glibc should also map every byte in the user
> address space so that user programs don't get SIGSEGV when they have
> wild pointers?
> 
> Again - that's a user BUG and trying to "work around" a wild pointer
> is a worse fix than the problem it tries to fix.
> 
> The exact same thing is true for shared library (or executable)
> mappings. Trying to work around people writing to them is *worse* than
> the bug of doing so.
> 
> Stop this completely inane discussion already.
> 

I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism entirely.

In my life, I’ve encountered -ETXTBUSY intermittently, and it invariably means that I somehow failed to finish killing a program fast enough for whatever random rebuild I’m doing to succeed. It’s at best erratic — it only applies for static binaries, and it has never once saved me from a problem I care about. If the program I’m recompiling crashes, I don’t care — it’s probably already part way through dying from an unrelated fatal signal.  What actually happens is that I see -ETXTBUSY, think “wait, this isn’t Windows, why are there file sharing rules,” then think “wait, Linux has *one* half baked file sharing rule,” and go on with my life. [0]

Seriously, can we deprecate and remove the whole thing?

[0] we have mandatory locks, too. Sigh.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-14  0:49                 ` Andy Lutomirski
@ 2021-08-14  0:54                   ` Linus Torvalds
  2021-08-14  0:58                     ` Linus Torvalds
                                       ` (2 more replies)
  2021-08-14  3:04                   ` Matthew Wilcox
  2021-08-18 15:42                   ` J. Bruce Fields
  2 siblings, 3 replies; 82+ messages in thread
From: Linus Torvalds @ 2021-08-14  0:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism entirely.

I think that would be ok, except I can see somebody relying on it.

It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.

But you are right that we have removed parts of it over time (no more
MAP_DENYWRITE, no more uselib()) so that what we have today is a
fairly weak form of what we used to do.

And nobody really complained when we weakened it, so maybe removing it
entirely might be acceptable.

              Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-14  0:54                   ` Linus Torvalds
@ 2021-08-14  0:58                     ` Linus Torvalds
  2021-08-14  1:57                       ` Al Viro
  2021-08-14 19:52                     ` David Laight
  2021-08-26 17:48                     ` Andy Lutomirski
  2 siblings, 1 reply; 82+ messages in thread
From: Linus Torvalds @ 2021-08-14  0:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Eric W. Biederman, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And nobody really complained when we weakened it, so maybe removing it
> entirely might be acceptable.

I guess we could just try it and see... Worst comes to worst, we'll
have to put it back, but at least we'd know what crazy thing still
wants it..

              Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-14  0:58                     ` Linus Torvalds
@ 2021-08-14  1:57                       ` Al Viro
  2021-08-14  2:02                         ` Al Viro
  2021-08-14  7:53                         ` Christian Brauner
  0 siblings, 2 replies; 82+ messages in thread
From: Al Viro @ 2021-08-14  1:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Eric W. Biederman, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Fri, Aug 13, 2021 at 02:58:57PM -1000, Linus Torvalds wrote:
> On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > And nobody really complained when we weakened it, so maybe removing it
> > entirely might be acceptable.
> 
> I guess we could just try it and see... Worst comes to worst, we'll
> have to put it back, but at least we'd know what crazy thing still
> wants it..

Umm...  I'll need to go back and look through the thread, but I'm
fairly sure that there used to be suckers that did replacement of
binary that way (try to write, count on exclusion with execve while
it's being written to) instead of using rename.  Install scripts
of weird crap and stuff like that...

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-14  1:57                       ` Al Viro
@ 2021-08-14  2:02                         ` Al Viro
  2021-08-14  9:06                           ` David Hildenbrand
  2021-08-14  7:53                         ` Christian Brauner
  1 sibling, 1 reply; 82+ messages in thread
From: Al Viro @ 2021-08-14  2:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Eric W. Biederman, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Sat, Aug 14, 2021 at 01:57:31AM +0000, Al Viro wrote:
> On Fri, Aug 13, 2021 at 02:58:57PM -1000, Linus Torvalds wrote:
> > On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > And nobody really complained when we weakened it, so maybe removing it
> > > entirely might be acceptable.
> > 
> > I guess we could just try it and see... Worst comes to worst, we'll
> > have to put it back, but at least we'd know what crazy thing still
> > wants it..
> 
> Umm...  I'll need to go back and look through the thread, but I'm
> fairly sure that there used to be suckers that did replacement of
> binary that way (try to write, count on exclusion with execve while
> it's being written to) instead of using rename.  Install scripts
> of weird crap and stuff like that...

... and before anyone goes off - I certainly agree that using that
behaviour is not a good idea and had never been one.  All I'm saying
is that there at least used to be very random (and rarely exercised)
bits of userland relying upon that behaviour.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-14  0:49                 ` Andy Lutomirski
  2021-08-14  0:54                   ` Linus Torvalds
@ 2021-08-14  3:04                   ` Matthew Wilcox
  2021-08-17 16:48                     ` Removing Mandatory Locks Eric W. Biederman
  2021-08-18  7:51                     ` [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE Christian Brauner
  2021-08-18 15:42                   ` J. Bruce Fields
  2 siblings, 2 replies; 82+ messages in thread
From: Matthew Wilcox @ 2021-08-14  3:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Eric W. Biederman, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> [0] we have mandatory locks, too. Sigh.

I'd love to remove that.  Perhaps we could try persuading more of the
distros to disable the CONFIG option first.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-14  1:57                       ` Al Viro
  2021-08-14  2:02                         ` Al Viro
@ 2021-08-14  7:53                         ` Christian Brauner
  1 sibling, 0 replies; 82+ messages in thread
From: Christian Brauner @ 2021-08-14  7:53 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds, Andy Lutomirski
  Cc: Eric W. Biederman, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Alexey Dobriyan,
	Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk, Christoph Hellwig

On Sat, Aug 14, 2021 at 01:57:31AM +0000, Al Viro wrote:
> On Fri, Aug 13, 2021 at 02:58:57PM -1000, Linus Torvalds wrote:
> > On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > And nobody really complained when we weakened it, so maybe removing it
> > > entirely might be acceptable.
> > 
> > I guess we could just try it and see... Worst comes to worst, we'll
> > have to put it back, but at least we'd know what crazy thing still
> > wants it..
> 
> Umm...  I'll need to go back and look through the thread, but I'm
> fairly sure that there used to be suckers that did replacement of
> binary that way (try to write, count on exclusion with execve while
> it's being written to) instead of using rename.  Install scripts
> of weird crap and stuff like that...

I'm not agains trying to remove it, but I think Al has a point.

Removing the write protection will also most certainly make certain
classes of attacks _easier_. For example, the runC container breakout
from last year using privileged containers issued CVE-2019-5736 would be
easier. I'm quoting from the commit I fixed this with:

    The attack can be made when attaching to a running container or when starting a
    container running a specially crafted image.  For example, when runC attaches
    to a container the attacker can trick it into executing itself. This could be
    done by replacing the target binary inside the container with a custom binary
    pointing back at the runC binary itself. As an example, if the target binary
    was /bin/bash, this could be replaced with an executable script specifying the
    interpreter path #!/proc/self/exe (/proc/self/exec is a symbolic link created
    by the kernel for every process which points to the binary that was executed
    for that process). As such when /bin/bash is executed inside the container,
    instead the target of /proc/self/exe will be executed - which will point to the
    runc binary on the host. The attacker can then proceed to write to the target
    of /proc/self/exe to try and overwrite the runC binary on the host.

and then the write protection kicks in of course:

    However in general, this will not succeed as the kernel will not
    permit it to be overwritten whilst runC is executing.

which the attack can of course already overcome nowadays with minimal
smarts:

    To overcome this, the attacker can instead open a file descriptor to
    /proc/self/exe using the O_PATH flag and then proceed to reopen the
    binary as O_WRONLY through /proc/self/fd/<nr> and try to write to it
    in a busy loop from a separate process. Ultimately it will succeed
    when the runC binary exits. After this the runC binary is
    compromised and can be used to attack other containers or the host
    itself.

But with write protection removed you'd allow such attacks to succeed
right away. It's not a huge deal to remove it since we need to have
other protection mechanisms in place already:

    To prevent this attack, LXC has been patched to create a temporary copy of the
    calling binary itself when it starts or attaches to containers. To do this LXC
    creates an anonymous, in-memory file using the memfd_create() system call and
    copies itself into the temporary in-memory file, which is then sealed to
    prevent further modifications. LXC then executes this sealed, in-memory file
    instead of the original on-disk binary. Any compromising write operations from
    a privileged container to the host LXC binary will then write to the temporary
    in-memory binary and not to the host binary on-disk, preserving the integrity
    of the host LXC binary. Also as the temporary, in-memory LXC binary is sealed,
    writes to this will also fail.

    Note: memfd_create() was added to the Linux kernel in the 3.17 release.

However, I still like to pich the upgrade mask idea Aleksa and we tried
to implement when we did openat2(). If we leave write-protection in
preventing /proc/self/exe from being written to:

we can take some time and upstream the upgrade mask patchset which was
part of the initial openat2() patchset but was dropped back then (and I
had Linus remove the last remants of the idea in [1]).

The idea was to add a new field to struct open_how "upgrade_mask" that
would allow a caller to specify with what permissions an fd could be
reopened with. I still like this idea a great deal and it would be a
very welcome addition to system management programs. The upgrade mask is
of course optional, i.e. the caller would have to specify the upgrade
mask at open time to restrict reopening (lest we regress the whole
world).

But, we could make it so that an O_PATH fd gotten from opening
/proc/<pid>/exe always gets a restricted upgrade mask set and so it
can't be upgraded to a O_WRONLY fd afterwards. For this to be
meaningful, write protection for /proc/self/exe would need to be kept.

[1]: commit 5c350aa11b441b32baf3bfe4018168cb8d10cef7
     Author: Christian Brauner <christian.brauner@ubuntu.com>
     Date:   Fri May 28 11:24:15 2021 +0200
     
         fcntl: remove unused VALID_UPGRADE_FLAGS
     
         We currently do not maky use of this feature and should we implement
         something like this in the future it's trivial to add it back.
     
         Link: https://lore.kernel.org/r/20210528092417.3942079-2-brauner@kernel.org
         Cc: Christoph Hellwig <hch@lst.de>
         Cc: Aleksa Sarai <cyphar@cyphar.com>
         Cc: Al Viro <viro@zeniv.linux.org.uk>
         Cc: linux-fsdevel@vger.kernel.org
         Suggested-by: Richard Guy Briggs <rgb@redhat.com>
         Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
         Reviewed-by: Christoph Hellwig <hch@lst.de>
         Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-14  2:02                         ` Al Viro
@ 2021-08-14  9:06                           ` David Hildenbrand
  0 siblings, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-08-14  9:06 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: Andy Lutomirski, Eric W. Biederman, David Laight,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Alexey Dobriyan,
	Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM, Florian Weimer, Michael Kerrisk

On 14.08.21 04:02, Al Viro wrote:
> On Sat, Aug 14, 2021 at 01:57:31AM +0000, Al Viro wrote:
>> On Fri, Aug 13, 2021 at 02:58:57PM -1000, Linus Torvalds wrote:
>>> On Fri, Aug 13, 2021 at 2:54 PM Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>>
>>>> And nobody really complained when we weakened it, so maybe removing it
>>>> entirely might be acceptable.
>>>
>>> I guess we could just try it and see... Worst comes to worst, we'll
>>> have to put it back, but at least we'd know what crazy thing still
>>> wants it..
>>
>> Umm...  I'll need to go back and look through the thread, but I'm
>> fairly sure that there used to be suckers that did replacement of
>> binary that way (try to write, count on exclusion with execve while
>> it's being written to) instead of using rename.  Install scripts
>> of weird crap and stuff like that...
> 
> ... and before anyone goes off - I certainly agree that using that
> behaviour is not a good idea and had never been one.  All I'm saying
> is that there at least used to be very random (and rarely exercised)
> bits of userland relying upon that behaviour.
> 

Removing it completely is certainly more controversial than limiting it 
to the main executable. I'm mostly happy as long as we get rid of that 
nasty per-VMA handling, because that adds real complexity at places that 
are complicated enough.

Having the remaining deny_write_access()/allow_write_access() at sane 
places now (loading a new binary, exchanging exe_file) looks certainly 
much cleaner and I still consider it a valuable, simple sanity feature 
to have around. I don't think there is any sane use case for modifying 
the main executable, and it seems to be very easy to catch.

For example, besides users that rely on this behavior, in my thinking 
(see the cover letter), especially having a binary not getting changed 
while we're loading it sounds like a very good idea (not saying we would 
expose a way to exploit the kernel if we would allow for modifications 
while in the elf parser, but also not saying we wouldn't because I 
didn't check if there would be a way; at least we already allow it in 
the legacy library loader before mapping the segments with 
MAP_DENYWRITE). And if we decide to keep the behavior while loading the 
executable, keeping it while exe_file is set isn't much added 
code/complexity IMHO.

Long story short, I'd vote for keeping it in, and if we decide to rip it 
out completely, do it a a separate, more careful step.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-14  0:54                   ` Linus Torvalds
  2021-08-14  0:58                     ` Linus Torvalds
@ 2021-08-14 19:52                     ` David Laight
  2021-08-26 17:48                     ` Andy Lutomirski
  2 siblings, 0 replies; 82+ messages in thread
From: David Laight @ 2021-08-14 19:52 UTC (permalink / raw)
  To: 'Linus Torvalds', Andy Lutomirski
  Cc: Eric W. Biederman, David Hildenbrand, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

From: Linus Torvalds
> Sent: 14 August 2021 01:55
> 
> On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism
> entirely.
> 
> I think that would be ok, except I can see somebody relying on it.
> 
> It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.

I think ETXTBUSY predates Linux itself.
But I can't remember whether the elf versions of sunos or svr4
implemented it for shared libraries.
I don't remember hitting it, so they may not have.

I'm actually surprised it ia an mmap() flag rather than an open() one.
Being able to open a file and guarantee it can't be changed seems a sane idea.
And not just for programs/libraries.

By the sound of it 'immutable' is no use.
You need to be able to unlink the file - otherwise you get into the
window's fiasco of not being able to update without 17 reboots.

FWIW MAP_COPY would only need to take one copy of the page - all the
users could share the same page (backed by a single page of swap).
Not that I'm suggesting it is a good idea at all.

I do wonder about /proc/self/exe though.
It gave the NetBSD Linux emulation a terrible problem.
Being able to open the inode of the program is fine.
The problem is the what readlink() returns - it is basically stale.
If a program open the link contents it could get anything at all.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Removing Mandatory Locks
  2021-08-14  3:04                   ` Matthew Wilcox
@ 2021-08-17 16:48                     ` Eric W. Biederman
  2021-08-17 16:50                       ` David Hildenbrand
                                         ` (2 more replies)
  2021-08-18  7:51                     ` [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE Christian Brauner
  1 sibling, 3 replies; 82+ messages in thread
From: Eric W. Biederman @ 2021-08-17 16:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andy Lutomirski, Linus Torvalds, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
>> [0] we have mandatory locks, too. Sigh.
>
> I'd love to remove that.  Perhaps we could try persuading more of the
> distros to disable the CONFIG option first.

Yes.  The support is disabled in RHEL8.

Does anyone know the appropriate people to talk to encourage other
distro's to encourage them to disable the CONFIG_MANDATORY_FILE_LOCKING?

Either that or we can wait until the code bit-rots, but distro's
disabling and removing a feature on their own is the more responsible
path.

Given how many hoops need to be jumped through to use mandatory file
locking once it is enabled, and the fact it has never worked in
containers makes me suspect there are no more users.

Eric



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-17 16:48                     ` Removing Mandatory Locks Eric W. Biederman
@ 2021-08-17 16:50                       ` David Hildenbrand
  2021-08-18  9:34                       ` Rodrigo Campos
  2021-08-19 18:39                       ` Jeff Layton
  2 siblings, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-08-17 16:50 UTC (permalink / raw)
  To: Eric W. Biederman, Matthew Wilcox
  Cc: Andy Lutomirski, Linus Torvalds, David Laight,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM, Florian Weimer, Michael Kerrisk

On 17.08.21 18:48, Eric W. Biederman wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
>> On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
>>> [0] we have mandatory locks, too. Sigh.
>>
>> I'd love to remove that.  Perhaps we could try persuading more of the
>> distros to disable the CONFIG option first.
> 
> Yes.  The support is disabled in RHEL8.

kernel-ark also seems to not set it for Fedora and ARK

redhat/configs/common/generic/CONFIG_MANDATORY_FILE_LOCKING:# 
CONFIG_MANDATORY_FILE_LOCKING is not set


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-14  3:04                   ` Matthew Wilcox
  2021-08-17 16:48                     ` Removing Mandatory Locks Eric W. Biederman
@ 2021-08-18  7:51                     ` Christian Brauner
  1 sibling, 0 replies; 82+ messages in thread
From: Christian Brauner @ 2021-08-18  7:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andy Lutomirski, Linus Torvalds, Eric W. Biederman, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Sat, Aug 14, 2021 at 04:04:09AM +0100, Matthew Wilcox wrote:
> On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> > [0] we have mandatory locks, too. Sigh.
> 
> I'd love to remove that.  Perhaps we could try persuading more of the
> distros to disable the CONFIG option first.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1940392

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-17 16:48                     ` Removing Mandatory Locks Eric W. Biederman
  2021-08-17 16:50                       ` David Hildenbrand
@ 2021-08-18  9:34                       ` Rodrigo Campos
  2021-08-19 19:18                         ` Jeff Layton
  2021-08-19 18:39                       ` Jeff Layton
  2 siblings, 1 reply; 82+ messages in thread
From: Rodrigo Campos @ 2021-08-18  9:34 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matthew Wilcox, Andy Lutomirski, Linus Torvalds, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Tue, Aug 17, 2021 at 6:49 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Matthew Wilcox <willy@infradead.org> writes:
>
> > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> >> [0] we have mandatory locks, too. Sigh.
> >
> > I'd love to remove that.  Perhaps we could try persuading more of the
> > distros to disable the CONFIG option first.
>
> Yes.  The support is disabled in RHEL8.

If it helps, it seems to be enabled on the just released debian stable:
    $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.10.0-8-amd64
    CONFIG_MANDATORY_FILE_LOCKING=y

Also the new 5.13 kernel in experimental has it too:
    $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.13.0-trunk-amd64
    CONFIG_MANDATORY_FILE_LOCKING=y

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-14  0:49                 ` Andy Lutomirski
  2021-08-14  0:54                   ` Linus Torvalds
  2021-08-14  3:04                   ` Matthew Wilcox
@ 2021-08-18 15:42                   ` J. Bruce Fields
  2021-08-19 13:56                     ` Eric W. Biederman
       [not found]                     ` <162943109106.9892.7426782042253067338@noble.neil.brown.name>
  2 siblings, 2 replies; 82+ messages in thread
From: J. Bruce Fields @ 2021-08-18 15:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Eric W. Biederman, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> I’ll bite.  How about we attack this in the opposite direction: remove
> the deny write mechanism entirely.

For what it's worth, Windows has open flags that allow denying read or
write opens.  They also made their way into the NFSv4 protocol, but
knfsd enforces them only against other NFSv4 clients.  Last I checked,
Samba attempted to emulate them using flock (and there's a comment to
that effect on the flock syscall in fs/locks.c).  I don't know what Wine
does.

Pavel Shilovsky posted flags adding O_DENY* flags years ago:

	https://lwn.net/Articles/581005/

I keep thinking I should look back at those some day but will probably
never get to it.

I've no idea how Windows applications use them, though I'm told it's
common.

--b.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-18 15:42                   ` J. Bruce Fields
@ 2021-08-19 13:56                     ` Eric W. Biederman
  2021-08-19 14:33                       ` J. Bruce Fields
       [not found]                     ` <162943109106.9892.7426782042253067338@noble.neil.brown.name>
  1 sibling, 1 reply; 82+ messages in thread
From: Eric W. Biederman @ 2021-08-19 13:56 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Andy Lutomirski, Linus Torvalds, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

bfields@fieldses.org (J. Bruce Fields) writes:

> On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
>> I’ll bite.  How about we attack this in the opposite direction: remove
>> the deny write mechanism entirely.
>
> For what it's worth, Windows has open flags that allow denying read or
> write opens.  They also made their way into the NFSv4 protocol, but
> knfsd enforces them only against other NFSv4 clients.  Last I checked,
> Samba attempted to emulate them using flock (and there's a comment to
> that effect on the flock syscall in fs/locks.c).  I don't know what Wine
> does.
>
> Pavel Shilovsky posted flags adding O_DENY* flags years ago:
>
> 	https://lwn.net/Articles/581005/
>
> I keep thinking I should look back at those some day but will probably
> never get to it.
>
> I've no idea how Windows applications use them, though I'm told it's
> common.

I don't know in any detail.  I just have this memory of not being able
to open or do anything with a file on windows while any application has
it open.

We limit mandatory locks to filesystems that have the proper mount flag
and files that are sgid but are not executable.  Reusing that limit we
could probably allow such a behavior in Linux without causing chaos.

Without being very strict about which files can participate I can just
imagine someone hiding their presence by not allowing other applications
the ability to write to utmp or a log file.

In the windows world where everything evolved with those kinds of
restrictions it is probably fine (although super annoying).

Eric

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-19 13:56                     ` Eric W. Biederman
@ 2021-08-19 14:33                       ` J. Bruce Fields
  2021-08-20 12:54                         ` Jeff Layton
  0 siblings, 1 reply; 82+ messages in thread
From: J. Bruce Fields @ 2021-08-19 14:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Linus Torvalds, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Thu, Aug 19, 2021 at 08:56:52AM -0500, Eric W. Biederman wrote:
> bfields@fieldses.org (J. Bruce Fields) writes:
> 
> > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> >> I’ll bite.  How about we attack this in the opposite direction: remove
> >> the deny write mechanism entirely.
> >
> > For what it's worth, Windows has open flags that allow denying read or
> > write opens.  They also made their way into the NFSv4 protocol, but
> > knfsd enforces them only against other NFSv4 clients.  Last I checked,
> > Samba attempted to emulate them using flock (and there's a comment to
> > that effect on the flock syscall in fs/locks.c).  I don't know what Wine
> > does.
> >
> > Pavel Shilovsky posted flags adding O_DENY* flags years ago:
> >
> > 	https://lwn.net/Articles/581005/
> >
> > I keep thinking I should look back at those some day but will probably
> > never get to it.
> >
> > I've no idea how Windows applications use them, though I'm told it's
> > common.
> 
> I don't know in any detail.  I just have this memory of not being able
> to open or do anything with a file on windows while any application has
> it open.
> 
> We limit mandatory locks to filesystems that have the proper mount flag
> and files that are sgid but are not executable.  Reusing that limit we
> could probably allow such a behavior in Linux without causing chaos.

I'm pretty confused about how we're using the term "mandatory locking".

The locks you're thinking of are basically ordinary posix byte-range
locks which we attempt to enforce as mandatory under certain conditions
(e.g. in fs/read_write.c:rw_verify_area).  That means we have to check
them on ordinary reads and writes, which is a pain in the butt.  (And we
don't manage to do it correctly--the code just checks for the existence
of a conflicting lock before performing IO, ignoring the obvious
time-of-check/time-of-use race.)

This has nothing to do with Windows share locks which from what I
understand are whole-file locks that are only enforced against opens.

--b.

> Without being very strict about which files can participate I can just
> imagine someone hiding their presence by not allowing other applications
> the ability to write to utmp or a log file.
> 
> In the windows world where everything evolved with those kinds of
> restrictions it is probably fine (although super annoying).
> 
> Eric

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-17 16:48                     ` Removing Mandatory Locks Eric W. Biederman
  2021-08-17 16:50                       ` David Hildenbrand
  2021-08-18  9:34                       ` Rodrigo Campos
@ 2021-08-19 18:39                       ` Jeff Layton
  2021-08-19 19:15                         ` Linus Torvalds
  2 siblings, 1 reply; 82+ messages in thread
From: Jeff Layton @ 2021-08-19 18:39 UTC (permalink / raw)
  To: Eric W. Biederman, Matthew Wilcox
  Cc: Andy Lutomirski, Linus Torvalds, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Tue, 2021-08-17 at 11:48 -0500, Eric W. Biederman wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> > > [0] we have mandatory locks, too. Sigh.
> > 
> > I'd love to remove that.  Perhaps we could try persuading more of the
> > distros to disable the CONFIG option first.
> 
> Yes.  The support is disabled in RHEL8.
> 
> Does anyone know the appropriate people to talk to encourage other
> distro's to encourage them to disable the CONFIG_MANDATORY_FILE_LOCKING?
> 
> Either that or we can wait until the code bit-rots, but distro's
> disabling and removing a feature on their own is the more responsible
> path.
> 
> Given how many hoops need to be jumped through to use mandatory file
> locking once it is enabled, and the fact it has never worked in
> containers makes me suspect there are no more users.
> 

I'm all for ripping it out too. It's an insane interface anyway.

I've not heard a single complaint about this being turned off in
fedora/rhel or any other distro that has this disabled.

Cheers,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-19 18:39                       ` Jeff Layton
@ 2021-08-19 19:15                         ` Linus Torvalds
  2021-08-19 19:55                           ` Eric Biggers
                                             ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Linus Torvalds @ 2021-08-19 19:15 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Eric W. Biederman, Matthew Wilcox, Andy Lutomirski, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> I'm all for ripping it out too. It's an insane interface anyway.
>
> I've not heard a single complaint about this being turned off in
> fedora/rhel or any other distro that has this disabled.

I'd love to remove it, we could absolutely test it. The fact that
several major distros have it disabled makes me think it's fine.

But as always, it would be good to check Android.

The desktop distros tend to have the same tools and programs, so if
Fedora and RHEL haven't needed it for years, then it's likely stale in
Debian too (despite being enabled).

But Android tends to be very different. Does anybody know?

            Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-18  9:34                       ` Rodrigo Campos
@ 2021-08-19 19:18                         ` Jeff Layton
  2021-08-19 20:03                           ` Willy Tarreau
  0 siblings, 1 reply; 82+ messages in thread
From: Jeff Layton @ 2021-08-19 19:18 UTC (permalink / raw)
  To: Rodrigo Campos, Eric W. Biederman
  Cc: Matthew Wilcox, Andy Lutomirski, Linus Torvalds, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Wed, 2021-08-18 at 11:34 +0200, Rodrigo Campos wrote:
> On Tue, Aug 17, 2021 at 6:49 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > 
> > Matthew Wilcox <willy@infradead.org> writes:
> > 
> > > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> > > > [0] we have mandatory locks, too. Sigh.
> > > 
> > > I'd love to remove that.  Perhaps we could try persuading more of the
> > > distros to disable the CONFIG option first.
> > 
> > Yes.  The support is disabled in RHEL8.
> 
> If it helps, it seems to be enabled on the just released debian stable:
>     $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.10.0-8-amd64
>     CONFIG_MANDATORY_FILE_LOCKING=y
> 
> Also the new 5.13 kernel in experimental has it too:
>     $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.13.0-trunk-amd64
>     CONFIG_MANDATORY_FILE_LOCKING=y

A pity. It would have been nice if they had turned it off a while ago. I
guess I should have done more outreach at the time. Sigh...

In any case, I'm still inclined toward just ripping it out at this
point. It's hard to believe that anyone really uses it.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-19 19:15                         ` Linus Torvalds
@ 2021-08-19 19:55                           ` Eric Biggers
  2021-08-19 20:18                           ` Jeff Layton
  2021-08-20 16:30                           ` Kees Cook
  2 siblings, 0 replies; 82+ messages in thread
From: Eric Biggers @ 2021-08-19 19:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Layton, Eric W. Biederman, Matthew Wilcox, Andy Lutomirski,
	David Laight, David Hildenbrand, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Thu, Aug 19, 2021 at 12:15:08PM -0700, Linus Torvalds wrote:
> On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > I'm all for ripping it out too. It's an insane interface anyway.
> >
> > I've not heard a single complaint about this being turned off in
> > fedora/rhel or any other distro that has this disabled.
> 
> I'd love to remove it, we could absolutely test it. The fact that
> several major distros have it disabled makes me think it's fine.
> 
> But as always, it would be good to check Android.
> 
> The desktop distros tend to have the same tools and programs, so if
> Fedora and RHEL haven't needed it for years, then it's likely stale in
> Debian too (despite being enabled).
> 
> But Android tends to be very different. Does anybody know?
> 

As far as I know, Android never uses mandatory file locking.  While
CONFIG_MANDATORY_LOCKING=y is typically set (as it's "default y" upstream),
I can't find anywhere in the Android source tree that uses the "mand" mount
option, let alone anything actually using mandatory locks.  (I'm assuming that
Documentation/filesystems/mandatory-locking.rst is up-to-date regarding what
userspace actually has to do to use it.)

- Eric

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-19 19:18                         ` Jeff Layton
@ 2021-08-19 20:03                           ` Willy Tarreau
  0 siblings, 0 replies; 82+ messages in thread
From: Willy Tarreau @ 2021-08-19 20:03 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Rodrigo Campos, Eric W. Biederman, Matthew Wilcox,
	Andy Lutomirski, Linus Torvalds, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Thu, Aug 19, 2021 at 03:18:15PM -0400, Jeff Layton wrote:
> On Wed, 2021-08-18 at 11:34 +0200, Rodrigo Campos wrote:
> > On Tue, Aug 17, 2021 at 6:49 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> > > 
> > > Matthew Wilcox <willy@infradead.org> writes:
> > > 
> > > > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> > > > > [0] we have mandatory locks, too. Sigh.
> > > > 
> > > > I'd love to remove that.  Perhaps we could try persuading more of the
> > > > distros to disable the CONFIG option first.
> > > 
> > > Yes.  The support is disabled in RHEL8.
> > 
> > If it helps, it seems to be enabled on the just released debian stable:
> >     $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.10.0-8-amd64
> >     CONFIG_MANDATORY_FILE_LOCKING=y
> > 
> > Also the new 5.13 kernel in experimental has it too:
> >     $ grep CONFIG_MANDATORY_FILE_LOCKING /boot/config-5.13.0-trunk-amd64
> >     CONFIG_MANDATORY_FILE_LOCKING=y
> 
> A pity. It would have been nice if they had turned it off a while ago. I
> guess I should have done more outreach at the time. Sigh...

Would it be acceptable to add a warning when MS_MANDLOCK is passed to
mount() and backport this to stable kernels in order to get reports
of any such use in a reasonably short time ? Anyway it sounds
important to at least warn about deprecation.

Willy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-19 19:15                         ` Linus Torvalds
  2021-08-19 19:55                           ` Eric Biggers
@ 2021-08-19 20:18                           ` Jeff Layton
  2021-08-19 20:31                             ` Linus Torvalds
  2021-08-20 16:30                           ` Kees Cook
  2 siblings, 1 reply; 82+ messages in thread
From: Jeff Layton @ 2021-08-19 20:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Matthew Wilcox, Andy Lutomirski, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Thu, 2021-08-19 at 12:15 -0700, Linus Torvalds wrote:
> On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > I'm all for ripping it out too. It's an insane interface anyway.
> > 
> > I've not heard a single complaint about this being turned off in
> > fedora/rhel or any other distro that has this disabled.
> 
> I'd love to remove it, we could absolutely test it. The fact that
> several major distros have it disabled makes me think it's fine.
> 
> But as always, it would be good to check Android.
> 
> The desktop distros tend to have the same tools and programs, so if
> Fedora and RHEL haven't needed it for years, then it's likely stale in
> Debian too (despite being enabled).
> 
> But Android tends to be very different. Does anybody know?
> 

Now that I think about it a little more, I actually did get one
complaint a few years ago:

Someone had upgraded from an earlier distro that supported the -o mand
mount option to a later one that had disabled it, and they had an (old)
fstab entry that specified it. They didn't actually use mandatory
locking and weren't sure why the option was set, so they removed it and
moved on.

I would feel a lot better about it if we had gotten Debian to turn it
off several years ago too, but I agree it's unlikely anyone uses this
and the risk of removing it is low.

I've spun up a patch to just rip it out. I'll do a bit of testing with
it tomorrow and then send it out.

Cheers,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-19 20:18                           ` Jeff Layton
@ 2021-08-19 20:31                             ` Linus Torvalds
  2021-08-19 21:43                               ` Jeff Layton
                                                 ` (2 more replies)
  0 siblings, 3 replies; 82+ messages in thread
From: Linus Torvalds @ 2021-08-19 20:31 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Eric W. Biederman, Matthew Wilcox, Andy Lutomirski, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> Now that I think about it a little more, I actually did get one
> complaint a few years ago:
>
> Someone had upgraded from an earlier distro that supported the -o mand
> mount option to a later one that had disabled it, and they had an (old)
> fstab entry that specified it.

Hmm. We might be able to turn the "return -EINVAL" into just a warning.

Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we
already do that

        VFS: "mand" mount option not supported

warning print, but then we fail the mount.

If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make
sense to turn that warning into something bigger, but then let the
mount continue - since now that "mand" flag would be purely a legacy
thing.

And yes, if we do that, we'd want the warning to be a big ugly thing,
just to make people very aware of it happening. Right now it's a
one-liner that is easy to miss, and the "oh, the mount failed" is the
thing that hopefully informs people about the fact that they need to
enable CONFIG_MANDATORY_FILE_LOCKING.

The logic being that if you can no longer enable mandatory locking in
the kernel, the current hard failure seems overly aggressive (and
might cause boot failures and inability to fix/report things when it
possibly keeps you from using the system at all).

              Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-19 20:31                             ` Linus Torvalds
@ 2021-08-19 21:43                               ` Jeff Layton
  2021-08-19 22:32                                 ` Linus Torvalds
  2021-08-20  2:10                               ` Matthew Wilcox
  2021-08-20  6:36                               ` Amir Goldstein
  2 siblings, 1 reply; 82+ messages in thread
From: Jeff Layton @ 2021-08-19 21:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, Matthew Wilcox, Andy Lutomirski, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Thu, 2021-08-19 at 13:31 -0700, Linus Torvalds wrote:
> On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > Now that I think about it a little more, I actually did get one
> > complaint a few years ago:
> > 
> > Someone had upgraded from an earlier distro that supported the -o mand
> > mount option to a later one that had disabled it, and they had an (old)
> > fstab entry that specified it.
> 
> Hmm. We might be able to turn the "return -EINVAL" into just a warning.
> 
> Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we
> already do that
> 
>         VFS: "mand" mount option not supported
> 
> warning print, but then we fail the mount.
> 
> If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make
> sense to turn that warning into something bigger, but then let the
> mount continue - since now that "mand" flag would be purely a legacy
> thing.
> 
> And yes, if we do that, we'd want the warning to be a big ugly thing,
> just to make people very aware of it happening. Right now it's a
> one-liner that is easy to miss, and the "oh, the mount failed" is the
> thing that hopefully informs people about the fact that they need to
> enable CONFIG_MANDATORY_FILE_LOCKING.
> 
> The logic being that if you can no longer enable mandatory locking in
> the kernel, the current hard failure seems overly aggressive (and
> might cause boot failures and inability to fix/report things when it
> possibly keeps you from using the system at all).
> 

What sort of big, ugly warning did you have in mind?

I'm fine with that general approach though and will plan to roll that
change into the patch I'm testing.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-19 21:43                               ` Jeff Layton
@ 2021-08-19 22:32                                 ` Linus Torvalds
  2021-08-20  8:30                                   ` David Laight
  2021-08-20 13:43                                   ` Steven Rostedt
  0 siblings, 2 replies; 82+ messages in thread
From: Linus Torvalds @ 2021-08-19 22:32 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Eric W. Biederman, Matthew Wilcox, Andy Lutomirski, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Thu, Aug 19, 2021 at 2:43 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> What sort of big, ugly warning did you have in mind?

I originally thought WARN_ON_ONCE() just to get the distro automatic
error handling involved, but it would probably be a big problem for
the people who end up having panic-on-warn or something.

So probably just a "make it a big box" thing that stands out, kind of
what lockdep etc does with

        pr_warn("======...====\n");

around the messages..

I don't know if distros have some pattern we could use that would end
up being something that gets reported to the user?

              Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-19 20:31                             ` Linus Torvalds
  2021-08-19 21:43                               ` Jeff Layton
@ 2021-08-20  2:10                               ` Matthew Wilcox
  2021-08-20  6:36                               ` Amir Goldstein
  2 siblings, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2021-08-20  2:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Layton, Eric W. Biederman, Andy Lutomirski, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Thu, Aug 19, 2021 at 01:31:35PM -0700, Linus Torvalds wrote:
> Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we
> already do that
> 
>         VFS: "mand" mount option not supported
> 
> warning print, but then we fail the mount.
> 
> If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make
> sense to turn that warning into something bigger, but then let the
> mount continue - since now that "mand" flag would be purely a legacy
> thing.
> 
> And yes, if we do that, we'd want the warning to be a big ugly thing,
> just to make people very aware of it happening. Right now it's a
> one-liner that is easy to miss, and the "oh, the mount failed" is the
> thing that hopefully informs people about the fact that they need to
> enable CONFIG_MANDATORY_FILE_LOCKING.

When I ripped out the NFS "intr" mount option fourteen years ago,
I just turned it into a noop (commit 150030b78a45).  It has greatly
amused me every article I've read that's been written since then
that recommends using it.  Just shows how much tribal knowledge we
have.

I think this is a little different, though; I was essetially making the
*wanted* behaviour of 'intr' the default (and disabling the unwanted
behaviour).  With 'mand', we're losing the behaviour entirely, and it's
plausible that someone might care.  Maybe something more like the old
sys_bdflush implementation?

        if (msg_count < 5) {
                msg_count++;
                printk(KERN_INFO
                        "warning: process `%s' used the obsolete bdflush"
                        " system call\n", current->comm);
                printk(KERN_INFO "Fix your initscripts?\n");
        }


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-19 20:31                             ` Linus Torvalds
  2021-08-19 21:43                               ` Jeff Layton
  2021-08-20  2:10                               ` Matthew Wilcox
@ 2021-08-20  6:36                               ` Amir Goldstein
  2021-08-20  7:14                                 ` Amir Goldstein
  2 siblings, 1 reply; 82+ messages in thread
From: Amir Goldstein @ 2021-08-20  6:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Layton, Eric W. Biederman, Matthew Wilcox, Andy Lutomirski,
	David Laight, David Hildenbrand, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Thu, Aug 19, 2021 at 11:32 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > Now that I think about it a little more, I actually did get one
> > complaint a few years ago:
> >
> > Someone had upgraded from an earlier distro that supported the -o mand
> > mount option to a later one that had disabled it, and they had an (old)
> > fstab entry that specified it.
>
> Hmm. We might be able to turn the "return -EINVAL" into just a warning.
>
> Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we
> already do that
>
>         VFS: "mand" mount option not supported
>
> warning print, but then we fail the mount.
>
> If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make
> sense to turn that warning into something bigger, but then let the
> mount continue - since now that "mand" flag would be purely a legacy
> thing.
>
> And yes, if we do that, we'd want the warning to be a big ugly thing,
> just to make people very aware of it happening. Right now it's a
> one-liner that is easy to miss, and the "oh, the mount failed" is the
> thing that hopefully informs people about the fact that they need to
> enable CONFIG_MANDATORY_FILE_LOCKING.
>
> The logic being that if you can no longer enable mandatory locking in
> the kernel, the current hard failure seems overly aggressive (and
> might cause boot failures and inability to fix/report things when it
> possibly keeps you from using the system at all).
>

Allow me to play the devil's advocate here - if fstab has '-o mand' we have
no way of knowing if any application is relying on '-o mand' and adding
more !!!!! to the warning is mostly good for clearing our conscious ;-)

Not saying we cannot resort to that and not saying there is an easy
solution, but there is one more solution to consider - force rdonly mount.
Yes, it could break some systems and possibly fail boot, but then again
an ext4 fs can already become rdonly due to errors, so it wouldn't
be the first time that sysadmins/users run into this behavior.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-20  6:36                               ` Amir Goldstein
@ 2021-08-20  7:14                                 ` Amir Goldstein
  2021-08-20 12:27                                   ` Jeff Layton
  0 siblings, 1 reply; 82+ messages in thread
From: Amir Goldstein @ 2021-08-20  7:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Layton, Eric W. Biederman, Matthew Wilcox, Andy Lutomirski,
	David Laight, David Hildenbrand, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Fri, Aug 20, 2021 at 9:36 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Aug 19, 2021 at 11:32 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote:
> > >
> > > Now that I think about it a little more, I actually did get one
> > > complaint a few years ago:
> > >
> > > Someone had upgraded from an earlier distro that supported the -o mand
> > > mount option to a later one that had disabled it, and they had an (old)
> > > fstab entry that specified it.
> >
> > Hmm. We might be able to turn the "return -EINVAL" into just a warning.
> >
> > Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we
> > already do that
> >
> >         VFS: "mand" mount option not supported
> >
> > warning print, but then we fail the mount.
> >
> > If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make
> > sense to turn that warning into something bigger, but then let the
> > mount continue - since now that "mand" flag would be purely a legacy
> > thing.
> >
> > And yes, if we do that, we'd want the warning to be a big ugly thing,
> > just to make people very aware of it happening. Right now it's a
> > one-liner that is easy to miss, and the "oh, the mount failed" is the
> > thing that hopefully informs people about the fact that they need to
> > enable CONFIG_MANDATORY_FILE_LOCKING.
> >
> > The logic being that if you can no longer enable mandatory locking in
> > the kernel, the current hard failure seems overly aggressive (and
> > might cause boot failures and inability to fix/report things when it
> > possibly keeps you from using the system at all).
> >
>
> Allow me to play the devil's advocate here - if fstab has '-o mand' we have
> no way of knowing if any application is relying on '-o mand' and adding
> more !!!!! to the warning is mostly good for clearing our conscious ;-)
>
> Not saying we cannot resort to that and not saying there is an easy
> solution, but there is one more solution to consider - force rdonly mount.
> Yes, it could break some systems and possibly fail boot, but then again
> an ext4 fs can already become rdonly due to errors, so it wouldn't
> be the first time that sysadmins/users run into this behavior.
>

Adding an anecdote - this week I got a report from field support
engineers about failure to assemble a RAID0 array, which led to this
warning that *requires* user intervention, in the worse case for boot
device it requires changing kernel boot params:

md/raid0:%s: cannot assemble multi-zone RAID0 with default_layout setting
md/raid0: please set raid.default_layout to 1 or 2

c84a1372df92 md/raid0: avoid RAID0 data corruption due to layout confusion.

There is no way I would have gotten this report from the field if a failure
was not involved...

The rdonly mount is only needed to get the attention of support people
to look the the kernel logs and find the warning - at this point, not too
many !!!!! are needed ;-)

So we could make 'mand' an alias to 'ro' and print a warning that says:
"'mand' mount option is deprecated, please fix your init scripts.
For caution, your filesystem was mounted rdonly, feel free to remount
rw and move on..."

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
       [not found]                     ` <162943109106.9892.7426782042253067338@noble.neil.brown.name>
@ 2021-08-20  8:25                       ` David Laight
  0 siblings, 0 replies; 82+ messages in thread
From: David Laight @ 2021-08-20  8:25 UTC (permalink / raw)
  To: 'NeilBrown', J. Bruce Fields
  Cc: Andy Lutomirski, Linus Torvalds, Eric W. Biederman,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

From: NeilBrown
> Sent: 20 August 2021 04:45
...
> O_DENYREAD is an insane flag.  If a process reads a file that some other
> process is working on, then the only which could be hurt is the reader.
> So allowing a process to ask for the open to fail if someone is writing
> might make sense.  Insisting that all opens fail does not.
> Any code wanting O_DENYREAD *should* use advisory locking, and any code
> wanting to know about read denial should too.

It might make sense if O_DENYREAD | O_DENYWRITE | O_RDWR are all set.
That would be what O_EXCL ought to mean for a normal file.
So would be useful for a program that wants to update a config file.

...
> It would be nice to be able to combine O_DENYWRITE with O_RDWR.  This
> combination is exactly what the kernel *should* do for swap files.

I suspect that is a common usage - eg for updating a file that contains
a log file sequence number.

...
> I'm not sure about O_DENYDELETE.  It is a lock on the name.  Unix has
> traditionally used lock-files to lock a name.  The functionality makes
> sense for processes with write-access to the directory...

I'm not sure it makes any sense on filesystems that use inode numbers.
Which name would you protect, and how would you manage to do the test.
On windows O_DENYDELETE is pretty much the default.
Which is why software updates are such a PITA.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Removing Mandatory Locks
  2021-08-19 22:32                                 ` Linus Torvalds
@ 2021-08-20  8:30                                   ` David Laight
  2021-08-23  7:55                                     ` Geert Uytterhoeven
  2021-08-20 13:43                                   ` Steven Rostedt
  1 sibling, 1 reply; 82+ messages in thread
From: David Laight @ 2021-08-20  8:30 UTC (permalink / raw)
  To: 'Linus Torvalds', Jeff Layton
  Cc: Eric W. Biederman, Matthew Wilcox, Andy Lutomirski,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

From: Linus Torvalds
> Sent: 19 August 2021 23:33
> 
> On Thu, Aug 19, 2021 at 2:43 PM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > What sort of big, ugly warning did you have in mind?
> 
> I originally thought WARN_ON_ONCE() just to get the distro automatic
> error handling involved, but it would probably be a big problem for
> the people who end up having panic-on-warn or something.

Even panic-on-oops is a PITA.
Took us weeks to realise that a customer system that was randomly
rebooting was 'just' having a boring NULL pointer access.
 
> So probably just a "make it a big box" thing that stands out, kind of
> what lockdep etc does with
> 
>         pr_warn("======...====\n");
> 
> around the messages..
> 
> I don't know if distros have some pattern we could use that would end
> up being something that gets reported to the user?

Will users even see it?
A lot of recent distro installs try very hard to hide all the kernel
messages.
OTOH I guess '-o mand' is unlikely to be set on any of those systems.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-20  7:14                                 ` Amir Goldstein
@ 2021-08-20 12:27                                   ` Jeff Layton
  2021-08-20 12:38                                     ` Willy Tarreau
  0 siblings, 1 reply; 82+ messages in thread
From: Jeff Layton @ 2021-08-20 12:27 UTC (permalink / raw)
  To: Amir Goldstein, Linus Torvalds
  Cc: Eric W. Biederman, Matthew Wilcox, Andy Lutomirski, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk, Willy Tarreau

On Fri, 2021-08-20 at 10:14 +0300, Amir Goldstein wrote:
> On Fri, Aug 20, 2021 at 9:36 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > 
> > On Thu, Aug 19, 2021 at 11:32 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > > 
> > > On Thu, Aug 19, 2021 at 1:18 PM Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > Now that I think about it a little more, I actually did get one
> > > > complaint a few years ago:
> > > > 
> > > > Someone had upgraded from an earlier distro that supported the -o mand
> > > > mount option to a later one that had disabled it, and they had an (old)
> > > > fstab entry that specified it.
> > > 
> > > Hmm. We might be able to turn the "return -EINVAL" into just a warning.
> > > 
> > > Yes, yes, currently if you turn off CONFIG_MANDATORY_FILE_LOCKING, we
> > > already do that
> > > 
> > >         VFS: "mand" mount option not supported
> > > 
> > > warning print, but then we fail the mount.
> > > 
> > > If CONFIG_MANDATORY_FILE_LOCKING goes away entirely, it might make
> > > sense to turn that warning into something bigger, but then let the
> > > mount continue - since now that "mand" flag would be purely a legacy
> > > thing.
> > > 
> > > And yes, if we do that, we'd want the warning to be a big ugly thing,
> > > just to make people very aware of it happening. Right now it's a
> > > one-liner that is easy to miss, and the "oh, the mount failed" is the
> > > thing that hopefully informs people about the fact that they need to
> > > enable CONFIG_MANDATORY_FILE_LOCKING.
> > > 
> > > The logic being that if you can no longer enable mandatory locking in
> > > the kernel, the current hard failure seems overly aggressive (and
> > > might cause boot failures and inability to fix/report things when it
> > > possibly keeps you from using the system at all).
> > > 
> > 
> > Allow me to play the devil's advocate here - if fstab has '-o mand' we have
> > no way of knowing if any application is relying on '-o mand' and adding
> > more !!!!! to the warning is mostly good for clearing our conscious ;-)
> > 
> > Not saying we cannot resort to that and not saying there is an easy
> > solution, but there is one more solution to consider - force rdonly mount.
> > Yes, it could break some systems and possibly fail boot, but then again
> > an ext4 fs can already become rdonly due to errors, so it wouldn't
> > be the first time that sysadmins/users run into this behavior.
> > 
> 
> Adding an anecdote - this week I got a report from field support
> engineers about failure to assemble a RAID0 array, which led to this
> warning that *requires* user intervention, in the worse case for boot
> device it requires changing kernel boot params:
> 
> md/raid0:%s: cannot assemble multi-zone RAID0 with default_layout setting
> md/raid0: please set raid.default_layout to 1 or 2
> 
> c84a1372df92 md/raid0: avoid RAID0 data corruption due to layout confusion.
> 
> There is no way I would have gotten this report from the field if a failure
> was not involved...
> 
> The rdonly mount is only needed to get the attention of support people
> to look the the kernel logs and find the warning - at this point, not too
> many !!!!! are needed ;-)
> 
> So we could make 'mand' an alias to 'ro' and print a warning that says:
> "'mand' mount option is deprecated, please fix your init scripts.
> For caution, your filesystem was mounted rdonly, feel free to remount
> rw and move on..."

That is a possibility, but I'm not sure it's any better than just
failing the mount. We could also just keep the code around and throw a
big, scary warning about its impending removal for a few releases before
ripping it out completely (like Willy T. was suggesting).

I'm fine with any of these approaches if the consensus is that it's too
risky to just remove it. OTOH, I've yet to ever hear of any application
that uses this feature, even in a historical sense. You have to jump
through so many hoops that nothing can rely on having it available.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-20 12:27                                   ` Jeff Layton
@ 2021-08-20 12:38                                     ` Willy Tarreau
  2021-08-20 13:03                                       ` Jeff Layton
  0 siblings, 1 reply; 82+ messages in thread
From: Willy Tarreau @ 2021-08-20 12:38 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, Linus Torvalds, Eric W. Biederman,
	Matthew Wilcox, Andy Lutomirski, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Fri, Aug 20, 2021 at 08:27:12AM -0400, Jeff Layton wrote:
> I'm fine with any of these approaches if the consensus is that it's too
> risky to just remove it. OTOH, I've yet to ever hear of any application
> that uses this feature, even in a historical sense.

Honestly, I agree. Some have fun of me because I'm often using old
stuff, but I don't even remember having used an application that
made use of mandatory locking. I remember having enabled it myself in
my kernels long ago after discovering its existence in the man pages,
just to test it. It doesn't rule out the possibility that it exists
somewhere though, but I think that the immediate removal combined
with the big fat warning in previous branches should be largely
enough to avoid the last minute surprise.

Willy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-19 14:33                       ` J. Bruce Fields
@ 2021-08-20 12:54                         ` Jeff Layton
  0 siblings, 0 replies; 82+ messages in thread
From: Jeff Layton @ 2021-08-20 12:54 UTC (permalink / raw)
  To: J. Bruce Fields, Eric W. Biederman
  Cc: Andy Lutomirski, Linus Torvalds, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Thu, 2021-08-19 at 10:33 -0400, J. Bruce Fields wrote:
> On Thu, Aug 19, 2021 at 08:56:52AM -0500, Eric W. Biederman wrote:
> > bfields@fieldses.org (J. Bruce Fields) writes:
> > 
> > > On Fri, Aug 13, 2021 at 05:49:19PM -0700, Andy Lutomirski wrote:
> > > > I’ll bite.  How about we attack this in the opposite direction: remove
> > > > the deny write mechanism entirely.
> > > 
> > > For what it's worth, Windows has open flags that allow denying read or
> > > write opens.  They also made their way into the NFSv4 protocol, but
> > > knfsd enforces them only against other NFSv4 clients.  Last I checked,
> > > Samba attempted to emulate them using flock (and there's a comment to
> > > that effect on the flock syscall in fs/locks.c).  I don't know what Wine
> > > does.
> > > 
> > > Pavel Shilovsky posted flags adding O_DENY* flags years ago:
> > > 
> > > 	https://lwn.net/Articles/581005/
> > > 
> > > I keep thinking I should look back at those some day but will probably
> > > never get to it.
> > > 
> > > I've no idea how Windows applications use them, though I'm told it's
> > > common.
> > 
> > I don't know in any detail.  I just have this memory of not being able
> > to open or do anything with a file on windows while any application has
> > it open.
> > 
> > We limit mandatory locks to filesystems that have the proper mount flag
> > and files that are sgid but are not executable.  Reusing that limit we
> > could probably allow such a behavior in Linux without causing chaos.
> 
> I'm pretty confused about how we're using the term "mandatory locking".
> 
> The locks you're thinking of are basically ordinary posix byte-range
> locks which we attempt to enforce as mandatory under certain conditions
> (e.g. in fs/read_write.c:rw_verify_area).  That means we have to check
> them on ordinary reads and writes, which is a pain in the butt.  (And we
> don't manage to do it correctly--the code just checks for the existence
> of a conflicting lock before performing IO, ignoring the obvious
> time-of-check/time-of-use race.)
> 

Yeah, the locks we're talking about are the locks described in:

    Documentation/filesystems/mandatory-locking.rst

They've always been racy. You have to mount the fs with '-o mand' and
set a special mode on the file (setgid bit set, with group execute bit
cleared). It's a crazypants interface.

> This has nothing to do with Windows share locks which from what I
> understand are whole-file locks that are only enforced against opens.
> 

Yep. Those are different.

Confusingly, there is also LOCK_MAND|LOCK_READ|LOCK_WRITE for flock(),
which are purported to be for emulating Windows share modes. They aren't
really mandatory though.

> --b.
> 
> > Without being very strict about which files can participate I can just
> > imagine someone hiding their presence by not allowing other applications
> > the ability to write to utmp or a log file.
> > 
> > In the windows world where everything evolved with those kinds of
> > restrictions it is probably fine (although super annoying).
> > 
> > Eric

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-20 12:38                                     ` Willy Tarreau
@ 2021-08-20 13:03                                       ` Jeff Layton
  2021-08-20 13:11                                         ` Willy Tarreau
  0 siblings, 1 reply; 82+ messages in thread
From: Jeff Layton @ 2021-08-20 13:03 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Amir Goldstein, Linus Torvalds, Eric W. Biederman,
	Matthew Wilcox, Andy Lutomirski, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Fri, 2021-08-20 at 14:38 +0200, Willy Tarreau wrote:
> On Fri, Aug 20, 2021 at 08:27:12AM -0400, Jeff Layton wrote:
> > I'm fine with any of these approaches if the consensus is that it's too
> > risky to just remove it. OTOH, I've yet to ever hear of any application
> > that uses this feature, even in a historical sense.
> 
> Honestly, I agree. Some have fun of me because I'm often using old
> stuff, but I don't even remember having used an application that
> made use of mandatory locking. I remember having enabled it myself in
> my kernels long ago after discovering its existence in the man pages,
> just to test it. It doesn't rule out the possibility that it exists
> somewhere though, but I think that the immediate removal combined
> with the big fat warning in previous branches should be largely
> enough to avoid the last minute surprise.
> 

Good point. It wouldn't hurt to push such a warning into stable kernels
at the same time. There always is a lag when we do something like this
before some downstream user notices.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-20 13:03                                       ` Jeff Layton
@ 2021-08-20 13:11                                         ` Willy Tarreau
  0 siblings, 0 replies; 82+ messages in thread
From: Willy Tarreau @ 2021-08-20 13:11 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, Linus Torvalds, Eric W. Biederman,
	Matthew Wilcox, Andy Lutomirski, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Fri, Aug 20, 2021 at 09:03:16AM -0400, Jeff Layton wrote:
> Good point. It wouldn't hurt to push such a warning into stable kernels
> at the same time. There always is a lag when we do something like this
> before some downstream user notices.

Yes, that's why I proposed this. A warning can be backported into
stable without big consequences except warning future victims...

Willy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-19 22:32                                 ` Linus Torvalds
  2021-08-20  8:30                                   ` David Laight
@ 2021-08-20 13:43                                   ` Steven Rostedt
  2021-08-20 16:06                                     ` Linus Torvalds
  1 sibling, 1 reply; 82+ messages in thread
From: Steven Rostedt @ 2021-08-20 13:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Layton, Eric W. Biederman, Matthew Wilcox, Andy Lutomirski,
	David Laight, David Hildenbrand, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Thu, 19 Aug 2021 15:32:31 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I originally thought WARN_ON_ONCE() just to get the distro automatic
> error handling involved, but it would probably be a big problem for
> the people who end up having panic-on-warn or something.
> 
> So probably just a "make it a big box" thing that stands out, kind of
> what lockdep etc does with
> 
>         pr_warn("======...====\n");
> 
> around the messages..
> 
> I don't know if distros have some pattern we could use that would end
> up being something that gets reported to the user?

People have started using my trace-printk notice message, that seems to
be big enough to get noticed.


 **********************************************************
 **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
 **                                                      **
 ** trace_printk() being used. Allocating extra memory.  **
 **                                                      **
 ** This means that this is a DEBUG kernel and it is     **
 ** unsafe for production use.                           **
 **                                                      **
 ** If you see this message and you are not debugging    **
 ** the kernel, report this immediately to your vendor!  **
 **                                                      **
 **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
 **********************************************************

There's been some talk about making that a more "generic" warning
message too.

-- Steve

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-20 13:43                                   ` Steven Rostedt
@ 2021-08-20 16:06                                     ` Linus Torvalds
  0 siblings, 0 replies; 82+ messages in thread
From: Linus Torvalds @ 2021-08-20 16:06 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jeff Layton, Eric W. Biederman, Matthew Wilcox, Andy Lutomirski,
	David Laight, David Hildenbrand, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Fri, Aug 20, 2021 at 6:43 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu, 19 Aug 2021 15:32:31 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > I don't know if distros have some pattern we could use that would end
> > up being something that gets reported to the user?
>
> People have started using my trace-printk notice message, that seems to
> be big enough to get noticed.

Well, I think people who use ftrace are m,ore likely to look at kernel
messages than most...

So what would be more interesting is if there's some distro support
for showing kernel notifications..

I see new notifications for calendar events, for devices that got
mounted, for a lot of things - so I'm really wondering if somebody
already perhaps had something for specially formatted kernel
messages..

               Linus

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-19 19:15                         ` Linus Torvalds
  2021-08-19 19:55                           ` Eric Biggers
  2021-08-19 20:18                           ` Jeff Layton
@ 2021-08-20 16:30                           ` Kees Cook
  2021-08-20 19:17                             ` H. Peter Anvin
  2 siblings, 1 reply; 82+ messages in thread
From: Kees Cook @ 2021-08-20 16:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jeff Layton, Eric W. Biederman, Matthew Wilcox, Andy Lutomirski,
	David Laight, David Hildenbrand, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Thu, Aug 19, 2021 at 12:15:08PM -0700, Linus Torvalds wrote:
> On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > I'm all for ripping it out too. It's an insane interface anyway.
> >
> > I've not heard a single complaint about this being turned off in
> > fedora/rhel or any other distro that has this disabled.
> 
> I'd love to remove it, we could absolutely test it. The fact that
> several major distros have it disabled makes me think it's fine.

FWIW, it is now disabled in Ubuntu too:

https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/impish/commit/?h=master-next&id=f3aac5e47789cbeb3177a14d3d2a06575249e14b

> But as always, it would be good to check Android.

It looks like it's enabled (checking the Pixel 4 kernel image), but it's
not specifically mentioned in any of the build configs that are used to
construct the image, so I think this is just catching the "default y". I
expect it'd be fine to turn this off.

I will ask around to see if it's actually used.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-20 16:30                           ` Kees Cook
@ 2021-08-20 19:17                             ` H. Peter Anvin
  2021-08-20 21:29                               ` Jeff Layton
  2021-08-20 22:31                               ` Matthew Wilcox
  0 siblings, 2 replies; 82+ messages in thread
From: H. Peter Anvin @ 2021-08-20 19:17 UTC (permalink / raw)
  To: Kees Cook, Linus Torvalds
  Cc: Jeff Layton, Eric W. Biederman, Matthew Wilcox, Andy Lutomirski,
	David Laight, David Hildenbrand, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Al Viro, Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

I thought the main user was Samba and/or otherwise providing file service for M$ systems?

On August 20, 2021 9:30:31 AM PDT, Kees Cook <keescook@chromium.org> wrote:
>On Thu, Aug 19, 2021 at 12:15:08PM -0700, Linus Torvalds wrote:
>> On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote:
>> >
>> > I'm all for ripping it out too. It's an insane interface anyway.
>> >
>> > I've not heard a single complaint about this being turned off in
>> > fedora/rhel or any other distro that has this disabled.
>> 
>> I'd love to remove it, we could absolutely test it. The fact that
>> several major distros have it disabled makes me think it's fine.
>
>FWIW, it is now disabled in Ubuntu too:
>
>https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/impish/commit/?h=master-next&id=f3aac5e47789cbeb3177a14d3d2a06575249e14b
>
>> But as always, it would be good to check Android.
>
>It looks like it's enabled (checking the Pixel 4 kernel image), but it's
>not specifically mentioned in any of the build configs that are used to
>construct the image, so I think this is just catching the "default y". I
>expect it'd be fine to turn this off.
>
>I will ask around to see if it's actually used.
>

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-20 19:17                             ` H. Peter Anvin
@ 2021-08-20 21:29                               ` Jeff Layton
  2021-08-21 12:45                                 ` Jeff Layton
  2021-08-20 22:31                               ` Matthew Wilcox
  1 sibling, 1 reply; 82+ messages in thread
From: Jeff Layton @ 2021-08-20 21:29 UTC (permalink / raw)
  To: H. Peter Anvin, Kees Cook, Linus Torvalds
  Cc: Eric W. Biederman, Matthew Wilcox, Andy Lutomirski, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

No, Windows has deny-mode locking at open time, but the kernel's
mandatory locks are enforced during read/write (which is why they are
such a pain). Samba will not miss these at all.

If we want something to provide windows-like semantics, we'd probably
want to start with something like Pavel Shilovsky's O_DENY_* patches.

-- Jeff

On Fri, 2021-08-20 at 12:17 -0700, H. Peter Anvin wrote:
> I thought the main user was Samba and/or otherwise providing file service for M$ systems?
> 
> On August 20, 2021 9:30:31 AM PDT, Kees Cook <keescook@chromium.org> wrote:
> > On Thu, Aug 19, 2021 at 12:15:08PM -0700, Linus Torvalds wrote:
> > > On Thu, Aug 19, 2021 at 11:39 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > > 
> > > > I'm all for ripping it out too. It's an insane interface anyway.
> > > > 
> > > > I've not heard a single complaint about this being turned off in
> > > > fedora/rhel or any other distro that has this disabled.
> > > 
> > > I'd love to remove it, we could absolutely test it. The fact that
> > > several major distros have it disabled makes me think it's fine.
> > 
> > FWIW, it is now disabled in Ubuntu too:
> > 
> > https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/impish/commit/?h=master-next&id=f3aac5e47789cbeb3177a14d3d2a06575249e14b
> > 
> > > But as always, it would be good to check Android.
> > 
> > It looks like it's enabled (checking the Pixel 4 kernel image), but it's
> > not specifically mentioned in any of the build configs that are used to
> > construct the image, so I think this is just catching the "default y". I
> > expect it'd be fine to turn this off.
> > 
> > I will ask around to see if it's actually used.
> > 
> 

-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-20 19:17                             ` H. Peter Anvin
  2021-08-20 21:29                               ` Jeff Layton
@ 2021-08-20 22:31                               ` Matthew Wilcox
  1 sibling, 0 replies; 82+ messages in thread
From: Matthew Wilcox @ 2021-08-20 22:31 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Kees Cook, Linus Torvalds, Jeff Layton, Eric W. Biederman,
	Andy Lutomirski, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Al Viro, Alexey Dobriyan,
	Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Fri, Aug 20, 2021 at 12:17:49PM -0700, H. Peter Anvin wrote:
> I thought the main user was Samba and/or otherwise providing file service for M$ systems?

When I asked around about this in ~2001, the only example anyoe was able
to come up with was some database that I no longer remember the name of.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-20 21:29                               ` Jeff Layton
@ 2021-08-21 12:45                                 ` Jeff Layton
  2021-08-23 22:15                                   ` J. Bruce Fields
  0 siblings, 1 reply; 82+ messages in thread
From: Jeff Layton @ 2021-08-21 12:45 UTC (permalink / raw)
  To: H. Peter Anvin, Kees Cook, Linus Torvalds
  Cc: Eric W. Biederman, Matthew Wilcox, Andy Lutomirski, David Laight,
	David Hildenbrand, Linux Kernel Mailing List, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Fri, 2021-08-20 at 17:29 -0400, Jeff Layton wrote:
> No, Windows has deny-mode locking at open time, but the kernel's
> mandatory locks are enforced during read/write (which is why they are
> such a pain). Samba will not miss these at all.
> 
> If we want something to provide windows-like semantics, we'd probably
> want to start with something like Pavel Shilovsky's O_DENY_* patches.
> 
> -- Jeff
> 

Doh! It completely slipped my mind about byte-range locks on windows...

Those are mandatory and they do block read and write activity to the
ranges locked. They have weird semantics vs. POSIX locks (they stack
instead of splitting/merging, etc.).

Samba emulates these with (advisory) POSIX locks in most cases. Using
mandatory locks is probably possible, but I think it would add more
potential for deadlock and security issues.
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-20  8:30                                   ` David Laight
@ 2021-08-23  7:55                                     ` Geert Uytterhoeven
  2021-08-23  8:14                                       ` David Laight
  0 siblings, 1 reply; 82+ messages in thread
From: Geert Uytterhoeven @ 2021-08-23  7:55 UTC (permalink / raw)
  To: David Laight
  Cc: Linus Torvalds, Jeff Layton, Eric W. Biederman, Matthew Wilcox,
	Andy Lutomirski, David Hildenbrand, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Mike Rapoport, Vlastimil Babka, Vincenzo Frascino, Chinwen Chang,
	Michel Lespinasse, Catalin Marinas, Huang Ying, Jann Horn,
	Feng Tang, Kevin Brodsky, Michael Ellerman, Shawn Anastasio,
	Steven Price, Nicholas Piggin, Christian Brauner, Jens Axboe,
	Gabriel Krisman Bertazi, Peter Xu, Suren Baghdasaryan,
	Shakeel Butt, Marco Elver, Daniel Jordan, Nicolas Viennot,
	Thomas Cedeno, Collin Fijalkovich, Michal Hocko, Miklos Szeredi,
	Chengguang Xu, Christian König, linux-unionfs, Linux API,
	the arch/x86 maintainers, <linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Fri, Aug 20, 2021 at 10:30 AM David Laight <David.Laight@aculab.com> wrote:
> From: Linus Torvalds
> > Sent: 19 August 2021 23:33
> >
> > On Thu, Aug 19, 2021 at 2:43 PM Jeff Layton <jlayton@kernel.org> wrote:
> > >
> > > What sort of big, ugly warning did you have in mind?
> >
> > I originally thought WARN_ON_ONCE() just to get the distro automatic
> > error handling involved, but it would probably be a big problem for
> > the people who end up having panic-on-warn or something.
>
> Even panic-on-oops is a PITA.
> Took us weeks to realise that a customer system that was randomly
> rebooting was 'just' having a boring NULL pointer access.
>
> > So probably just a "make it a big box" thing that stands out, kind of
> > what lockdep etc does with
> >
> >         pr_warn("======...====\n");
> >
> > around the messages..

Do we really need more of these?
They take time to print (especially on serial
consoles) and increase kernel size.

What's wrong with using an appropriate KERN_*, and letting userspace
make sure the admin/user will see the message (see below)?

> >
> > I don't know if distros have some pattern we could use that would end
> > up being something that gets reported to the user?
>
> Will users even see it?
> A lot of recent distro installs try very hard to hide all the kernel
> messages.

Exactly.  E.g. Ubuntu doesn't show any kernel output during normal
operation.

On Fri, Aug 20, 2021 at 6:12 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Aug 20, 2021 at 6:43 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Thu, 19 Aug 2021 15:32:31 -0700
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > >
> > > I don't know if distros have some pattern we could use that would end
> > > up being something that gets reported to the user?

> So what would be more interesting is if there's some distro support
> for showing kernel notifications..
>
> I see new notifications for calendar events, for devices that got
> mounted, for a lot of things - so I'm really wondering if somebody
> already perhaps had something for specially formatted kernel
> messages..

Isn't that what the old syslog and the new systemd are supposed to
handle in userspace?

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: Removing Mandatory Locks
  2021-08-23  7:55                                     ` Geert Uytterhoeven
@ 2021-08-23  8:14                                       ` David Laight
  0 siblings, 0 replies; 82+ messages in thread
From: David Laight @ 2021-08-23  8:14 UTC (permalink / raw)
  To: 'Geert Uytterhoeven'
  Cc: Linus Torvalds, Jeff Layton, Eric W. Biederman, Matthew Wilcox,
	Andy Lutomirski, David Hildenbrand, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Mike Rapoport, Vlastimil Babka, Vincenzo Frascino, Chinwen Chang,
	Michel Lespinasse, Catalin Marinas, Huang Ying, Jann Horn,
	Feng Tang, Kevin Brodsky, Michael Ellerman, Shawn Anastasio,
	Steven Price, Nicholas Piggin, Christian Brauner, Jens Axboe,
	Gabriel Krisman Bertazi, Peter Xu, Suren Baghdasaryan,
	Shakeel Butt, Marco Elver, Daniel Jordan, Nicolas Viennot,
	Thomas Cedeno, Collin Fijalkovich, Michal Hocko, Miklos Szeredi,
	Chengguang Xu, Christian König, linux-unionfs, Linux API,
	the arch/x86 maintainers, <linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

From: Geert Uytterhoeven
> Sent: 23 August 2021 08:56
...
> Exactly.  E.g. Ubuntu doesn't show any kernel output during normal
> operation.

Current ubuntu (x86) is getting to be a PITA.
It even runs the graphical login on tty0 - which is where
the kernel messages would end up.
It is almost impossible to get an actual VGA console.

I need to find a different distro that has less bloat in it.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: Removing Mandatory Locks
  2021-08-21 12:45                                 ` Jeff Layton
@ 2021-08-23 22:15                                   ` J. Bruce Fields
  0 siblings, 0 replies; 82+ messages in thread
From: J. Bruce Fields @ 2021-08-23 22:15 UTC (permalink / raw)
  To: Jeff Layton
  Cc: H. Peter Anvin, Kees Cook, Linus Torvalds, Eric W. Biederman,
	Matthew Wilcox, Andy Lutomirski, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Al Viro, Alexey Dobriyan,
	Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Sat, Aug 21, 2021 at 08:45:54AM -0400, Jeff Layton wrote:
> On Fri, 2021-08-20 at 17:29 -0400, Jeff Layton wrote:
> > No, Windows has deny-mode locking at open time, but the kernel's
> > mandatory locks are enforced during read/write (which is why they are
> > such a pain). Samba will not miss these at all.
> > 
> > If we want something to provide windows-like semantics, we'd probably
> > want to start with something like Pavel Shilovsky's O_DENY_* patches.
> > 
> > -- Jeff
> > 
> 
> Doh! It completely slipped my mind about byte-range locks on windows...
> 
> Those are mandatory and they do block read and write activity to the
> ranges locked. They have weird semantics vs. POSIX locks (they stack
> instead of splitting/merging, etc.).
> 
> Samba emulates these with (advisory) POSIX locks in most cases. Using
> mandatory locks is probably possible, but I think it would add more
> potential for deadlock and security issues.

Right, so Windows byte-range locks are different from Windows open deny
modes.

But even if somebody wanted to implement them, I doubt they'd start with
the mandatory locking code you're removing here, so I think they're
irrelevant to this discussion.

--b.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-14  0:54                   ` Linus Torvalds
  2021-08-14  0:58                     ` Linus Torvalds
  2021-08-14 19:52                     ` David Laight
@ 2021-08-26 17:48                     ` Andy Lutomirski
  2021-08-26 21:47                       ` David Hildenbrand
  2 siblings, 1 reply; 82+ messages in thread
From: Andy Lutomirski @ 2021-08-26 17:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric W. Biederman, David Laight, David Hildenbrand,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	<linux-fsdevel@vger.kernel.org>,
	Linux-MM, Florian Weimer, Michael Kerrisk

On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote:
> On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism entirely.
> 
> I think that would be ok, except I can see somebody relying on it.
> 
> It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.

Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY.  Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it.

Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write.  So, in a multithreaded program, one thread does:

fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC);
write(fd, some stuff);

<--- problem is here

close(fd);
execve("some exefile");

Another thread does:

fork();
execve("something else");

In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails.  Whoops.  See, for example:

https://github.com/golang/go/issues/22315

I propose we get rid of deny_write_access() completely to solve this.

Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons.

(OFD locks seem like they might have the same problem.  Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?)

> 
> But you are right that we have removed parts of it over time (no more
> MAP_DENYWRITE, no more uselib()) so that what we have today is a
> fairly weak form of what we used to do.
> 
> And nobody really complained when we weakened it, so maybe removing it
> entirely might be acceptable.
> 
>               Linus
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-26 17:48                     ` Andy Lutomirski
@ 2021-08-26 21:47                       ` David Hildenbrand
  2021-08-26 22:13                         ` Eric W. Biederman
  2021-08-27 10:18                         ` Christian Brauner
  0 siblings, 2 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-08-26 21:47 UTC (permalink / raw)
  To: Andy Lutomirski, Linus Torvalds
  Cc: Eric W. Biederman, David Laight, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM, Florian Weimer, Michael Kerrisk

On 26.08.21 19:48, Andy Lutomirski wrote:
> On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote:
>> On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism entirely.
>>
>> I think that would be ok, except I can see somebody relying on it.
>>
>> It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.
> 
> Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY.  Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it.
> 
> Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write.  So, in a multithreaded program, one thread does:
> 
> fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC);
> write(fd, some stuff);
> 
> <--- problem is here
> 
> close(fd);
> execve("some exefile");
> 
> Another thread does:
> 
> fork();
> execve("something else");
> 
> In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails.  Whoops.  See, for example:
> 
> https://github.com/golang/go/issues/22315
> 
> I propose we get rid of deny_write_access() completely to solve this.
> 
> Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons.
> 
> (OFD locks seem like they might have the same problem.  Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?)
> 

It's not like this issue is new (^2017) or relevant in practice. So no 
need to hurry IMHO. One step at a time: it might make perfect sense to 
remove ETXTBSY, but we have to be careful to not break other user space 
that actually cares about the current behavior in practice.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-26 21:47                       ` David Hildenbrand
@ 2021-08-26 22:13                         ` Eric W. Biederman
  2021-08-27  8:22                           ` David Laight
  2021-09-01  8:28                           ` David Hildenbrand
  2021-08-27 10:18                         ` Christian Brauner
  1 sibling, 2 replies; 82+ messages in thread
From: Eric W. Biederman @ 2021-08-26 22:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andy Lutomirski, Linus Torvalds, David Laight,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM, Florian Weimer, Michael Kerrisk

David Hildenbrand <david@redhat.com> writes:

> On 26.08.21 19:48, Andy Lutomirski wrote:
>> On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote:
>>> On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>>
>>>> I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism entirely.
>>>
>>> I think that would be ok, except I can see somebody relying on it.
>>>
>>> It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.
>>
>> Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY.  Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it.
>>
>> Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write.  So, in a multithreaded program, one thread does:
>>
>> fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC);
>> write(fd, some stuff);
>>
>> <--- problem is here
>>
>> close(fd);
>> execve("some exefile");
>>
>> Another thread does:
>>
>> fork();
>> execve("something else");
>>
>> In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails.  Whoops.  See, for example:
>>
>> https://github.com/golang/go/issues/22315
>>
>> I propose we get rid of deny_write_access() completely to solve this.
>>
>> Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons.
>>
>> (OFD locks seem like they might have the same problem.  Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?)
>>
>
> It's not like this issue is new (^2017) or relevant in practice. So no
> need to hurry IMHO. One step at a time: it might make perfect sense to
> remove ETXTBSY, but we have to be careful to not break other user
> space that actually cares about the current behavior in practice.

It is an old enough issue that I agree there is no need to hurry.

I also ran into this issue not too long ago when I refactored the
usermode_driver code.  My challenge was not being in userspace
the delayed fput was not happening in my kernel thread.  Which meant
that writing the file, then closing the file, then execing the file
consistently reported -ETXTBSY.

The kernel code wound up doing:
	/* Flush delayed fput so exec can open the file read-only */
	flush_delayed_fput();
	task_work_run();

As I read the code the delay for userspace file descriptors is
always done with task_work_add, so userspace should not hit
that kind of silliness, and should be able to actually close
the file descriptor before the exec.


On the flip side, I don't know how anything can depend upon getting an
-ETXTBSY.  So I don't think there is any real risk of breaking userspace
if we remove it.

Eric


^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-26 22:13                         ` Eric W. Biederman
@ 2021-08-27  8:22                           ` David Laight
  2021-08-27 15:58                             ` Eric W. Biederman
  2021-09-01  8:28                           ` David Hildenbrand
  1 sibling, 1 reply; 82+ messages in thread
From: David Laight @ 2021-08-27  8:22 UTC (permalink / raw)
  To: 'Eric W. Biederman', David Hildenbrand
  Cc: Andy Lutomirski, Linus Torvalds, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Al Viro, Alexey Dobriyan, Steven Rostedt,
	Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM, Florian Weimer, Michael Kerrisk

From: Eric W. Biederman
> Sent: 26 August 2021 23:14
...
> I also ran into this issue not too long ago when I refactored the
> usermode_driver code.  My challenge was not being in userspace
> the delayed fput was not happening in my kernel thread.  Which meant
> that writing the file, then closing the file, then execing the file
> consistently reported -ETXTBSY.
> 
> The kernel code wound up doing:
> 	/* Flush delayed fput so exec can open the file read-only */
> 	flush_delayed_fput();
> 	task_work_run();
> 
> As I read the code the delay for userspace file descriptors is
> always done with task_work_add, so userspace should not hit
> that kind of silliness, and should be able to actually close
> the file descriptor before the exec.

If task_work_add ends up adding it to a task that is already
running on a different cpu, and that cpu takes a hardware
interrupt that takes some time and/or schedules the softint
code to run immediately the hardware interrupt completes
then it may well be possible for userspace to have 'issues'.

Any flags associated with O_DENY_WRITE would need to be cleared
synchronously in the close() rather then in any delayed fput().

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-26 21:47                       ` David Hildenbrand
  2021-08-26 22:13                         ` Eric W. Biederman
@ 2021-08-27 10:18                         ` Christian Brauner
  1 sibling, 0 replies; 82+ messages in thread
From: Christian Brauner @ 2021-08-27 10:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andy Lutomirski, Linus Torvalds, Eric W. Biederman, David Laight,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM, Florian Weimer, Michael Kerrisk

On Thu, Aug 26, 2021 at 11:47:07PM +0200, David Hildenbrand wrote:
> On 26.08.21 19:48, Andy Lutomirski wrote:
> > On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote:
> > > On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > > 
> > > > I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism entirely.
> > > 
> > > I think that would be ok, except I can see somebody relying on it.
> > > 
> > > It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.
> > 
> > Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY.  Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it.
> > 
> > Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write.  So, in a multithreaded program, one thread does:
> > 
> > fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC);
> > write(fd, some stuff);
> > 
> > <--- problem is here
> > 
> > close(fd);
> > execve("some exefile");
> > 
> > Another thread does:
> > 
> > fork();
> > execve("something else");
> > 
> > In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails.  Whoops.  See, for example:
> > 
> > https://github.com/golang/go/issues/22315
> > 
> > I propose we get rid of deny_write_access() completely to solve this.
> > 
> > Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons.
> > 
> > (OFD locks seem like they might have the same problem.  Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?)
> > 
> 
> It's not like this issue is new (^2017) or relevant in practice. So no need
> to hurry IMHO. One step at a time: it might make perfect sense to remove
> ETXTBSY, but we have to be careful to not break other user space that
> actually cares about the current behavior in practice.

I agree. As I at least tried to show, removing write-protection can make
some exploits easier. I'm all for trying to remove this if it simplifies
things but for sure this shouldn't be part of this patchset and we
should be careful about it.

The removal of a (misguided or only partially functioning) protection
mechanism doesn't introduce but removes a failure point.
And I don't think removal and addition of a failure point usually have
the same consequences. Introducing a new failure point will often mean
userspace quickly detects regressions. Such regressions are pretty
common due to security fixes we introduce. Recent examples include [1].
Right after this was merged the regression was reported.

But when allowing behavior that used to fail like ETXTBSY it can be
difficult for userspace to detect such regressions. The reason for that
is quite often that userspace applications don't tend to do something
that they know upfront will fail. Attackers however might.

[1]: bfb819ea20ce ("proc: Check /proc/$pid/attr/ writes against file opener")

Christian

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-27  8:22                           ` David Laight
@ 2021-08-27 15:58                             ` Eric W. Biederman
  0 siblings, 0 replies; 82+ messages in thread
From: Eric W. Biederman @ 2021-08-27 15:58 UTC (permalink / raw)
  To: David Laight
  Cc: David Hildenbrand, Andy Lutomirski, Linus Torvalds,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM, Florian Weimer, Michael Kerrisk

David Laight <David.Laight@ACULAB.COM> writes:

> From: Eric W. Biederman
>> Sent: 26 August 2021 23:14
> ...
>> I also ran into this issue not too long ago when I refactored the
>> usermode_driver code.  My challenge was not being in userspace
>> the delayed fput was not happening in my kernel thread.  Which meant
>> that writing the file, then closing the file, then execing the file
>> consistently reported -ETXTBSY.
>> 
>> The kernel code wound up doing:
>> 	/* Flush delayed fput so exec can open the file read-only */
>> 	flush_delayed_fput();
>> 	task_work_run();
>> 
>> As I read the code the delay for userspace file descriptors is
>> always done with task_work_add, so userspace should not hit
>> that kind of silliness, and should be able to actually close
>> the file descriptor before the exec.
>
> If task_work_add ends up adding it to a task that is already
> running on a different cpu, and that cpu takes a hardware
> interrupt that takes some time and/or schedules the softint
> code to run immediately the hardware interrupt completes
> then it may well be possible for userspace to have 'issues'.

It it task_work_add(current).  Which punts the work to the return to
userspace.

> Any flags associated with O_DENY_WRITE would need to be cleared
> synchronously in the close() rather then in any delayed fput().

Eric

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE
  2021-08-26 22:13                         ` Eric W. Biederman
  2021-08-27  8:22                           ` David Laight
@ 2021-09-01  8:28                           ` David Hildenbrand
  1 sibling, 0 replies; 82+ messages in thread
From: David Hildenbrand @ 2021-09-01  8:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andy Lutomirski, Linus Torvalds, David Laight,
	Linux Kernel Mailing List, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Al Viro,
	Alexey Dobriyan, Steven Rostedt, Peter Zijlstra (Intel),
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Petr Mladek, Sergey Senozhatsky,
	Andy Shevchenko, Rasmus Villemoes, Kees Cook, Greg Ungerer,
	Geert Uytterhoeven, Mike Rapoport, Vlastimil Babka,
	Vincenzo Frascino, Chinwen Chang, Michel Lespinasse,
	Catalin Marinas, Matthew Wilcox (Oracle),
	Huang Ying, Jann Horn, Feng Tang, Kevin Brodsky,
	Michael Ellerman, Shawn Anastasio, Steven Price, Nicholas Piggin,
	Christian Brauner, Jens Axboe, Gabriel Krisman Bertazi, Peter Xu,
	Suren Baghdasaryan, Shakeel Butt, Marco Elver, Daniel Jordan,
	Nicolas Viennot, Thomas Cedeno, Collin Fijalkovich, Michal Hocko,
	Miklos Szeredi, Chengguang Xu, Christian König,
	linux-unionfs, Linux API, the arch/x86 maintainers,
	linux-fsdevel, Linux-MM, Florian Weimer, Michael Kerrisk

On 27.08.21 00:13, Eric W. Biederman wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 26.08.21 19:48, Andy Lutomirski wrote:
>>> On Fri, Aug 13, 2021, at 5:54 PM, Linus Torvalds wrote:
>>>> On Fri, Aug 13, 2021 at 2:49 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>>>
>>>>> I’ll bite.  How about we attack this in the opposite direction: remove the deny write mechanism entirely.
>>>>
>>>> I think that would be ok, except I can see somebody relying on it.
>>>>
>>>> It's broken, it's stupid, but we've done that ETXTBUSY for a _loong_ time.
>>>
>>> Someone off-list just pointed something out to me, and I think we should push harder to remove ETXTBSY.  Specifically, we've all been focused on open() failing with ETXTBSY, and it's easy to make fun of anyone opening a running program for write when they should be unlinking and replacing it.
>>>
>>> Alas, Linux's implementation of deny_write_access() is correct^Wabsurd, and deny_write_access() *also* returns ETXTBSY if the file is open for write.  So, in a multithreaded program, one thread does:
>>>
>>> fd = open("some exefile", O_RDWR | O_CREAT | O_CLOEXEC);
>>> write(fd, some stuff);
>>>
>>> <--- problem is here
>>>
>>> close(fd);
>>> execve("some exefile");
>>>
>>> Another thread does:
>>>
>>> fork();
>>> execve("something else");
>>>
>>> In between fork and execve, there's another copy of the open file description, and i_writecount is held, and the execve() fails.  Whoops.  See, for example:
>>>
>>> https://github.com/golang/go/issues/22315
>>>
>>> I propose we get rid of deny_write_access() completely to solve this.
>>>
>>> Getting rid of i_writecount itself seems a bit harder, since a handful of filesystems use it for clever reasons.
>>>
>>> (OFD locks seem like they might have the same problem.  Maybe we should have a clone() flag to unshare the file table and close close-on-exec things?)
>>>
>>
>> It's not like this issue is new (^2017) or relevant in practice. So no
>> need to hurry IMHO. One step at a time: it might make perfect sense to
>> remove ETXTBSY, but we have to be careful to not break other user
>> space that actually cares about the current behavior in practice.
> 
> It is an old enough issue that I agree there is no need to hurry.
> 
> I also ran into this issue not too long ago when I refactored the
> usermode_driver code.  My challenge was not being in userspace
> the delayed fput was not happening in my kernel thread.  Which meant
> that writing the file, then closing the file, then execing the file
> consistently reported -ETXTBSY.
> 
> The kernel code wound up doing:
> 	/* Flush delayed fput so exec can open the file read-only */
> 	flush_delayed_fput();
> 	task_work_run();
> 
> As I read the code the delay for userspace file descriptors is
> always done with task_work_add, so userspace should not hit
> that kind of silliness, and should be able to actually close
> the file descriptor before the exec.
> 
> 
> On the flip side, I don't know how anything can depend upon getting an
> -ETXTBSY.  So I don't think there is any real risk of breaking userspace
> if we remove it.

At least in LTP, we have two test cases testing exactly that behavior:

testcases/kernel/syscalls/creat/creat07.c
testcases/kernel/syscalls/execve/execve04.c


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2021-09-01  8:28 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-12  8:43 [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE David Hildenbrand
2021-08-12  8:43 ` [PATCH v1 1/7] binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib() David Hildenbrand
2021-08-12  8:43 ` [PATCH v1 2/7] kernel/fork: factor out atomcially replacing the current MM exe_file David Hildenbrand
2021-08-12  9:17   ` Christian Brauner
2021-08-12  8:43 ` [PATCH v1 3/7] kernel/fork: always deny write access to " David Hildenbrand
2021-08-12 10:05   ` Christian Brauner
2021-08-12 10:13     ` David Hildenbrand
2021-08-12 12:32       ` Christian Brauner
2021-08-12 12:38         ` David Hildenbrand
2021-08-12 16:51   ` Linus Torvalds
2021-08-12 19:38     ` David Hildenbrand
2021-08-12  8:43 ` [PATCH v1 4/7] binfmt: remove in-tree usage of MAP_DENYWRITE David Hildenbrand
2021-08-12  8:43 ` [PATCH v1 5/7] mm: remove VM_DENYWRITE David Hildenbrand
2021-08-12  8:43 ` [PATCH v1 6/7] mm: ignore MAP_DENYWRITE in ksys_mmap_pgoff() David Hildenbrand
2021-08-12  8:43 ` [PATCH v1 7/7] fs: update documentation of get_write_access() and friends David Hildenbrand
2021-08-12 12:20 ` [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE Florian Weimer
2021-08-12 12:47   ` David Hildenbrand
2021-08-12 16:17   ` Eric W. Biederman
2021-08-12 17:32 ` Eric W. Biederman
2021-08-12 17:35   ` Andy Lutomirski
2021-08-12 17:48     ` Eric W. Biederman
2021-08-12 18:01       ` Andy Lutomirski
2021-08-12 18:10       ` Linus Torvalds
2021-08-12 18:47         ` Eric W. Biederman
2021-08-13  9:05           ` David Laight
     [not found]             ` <87h7ft2j68.fsf@disp2133>
2021-08-13 20:51               ` Florian Weimer
2021-08-14  0:31               ` Linus Torvalds
2021-08-14  0:49                 ` Andy Lutomirski
2021-08-14  0:54                   ` Linus Torvalds
2021-08-14  0:58                     ` Linus Torvalds
2021-08-14  1:57                       ` Al Viro
2021-08-14  2:02                         ` Al Viro
2021-08-14  9:06                           ` David Hildenbrand
2021-08-14  7:53                         ` Christian Brauner
2021-08-14 19:52                     ` David Laight
2021-08-26 17:48                     ` Andy Lutomirski
2021-08-26 21:47                       ` David Hildenbrand
2021-08-26 22:13                         ` Eric W. Biederman
2021-08-27  8:22                           ` David Laight
2021-08-27 15:58                             ` Eric W. Biederman
2021-09-01  8:28                           ` David Hildenbrand
2021-08-27 10:18                         ` Christian Brauner
2021-08-14  3:04                   ` Matthew Wilcox
2021-08-17 16:48                     ` Removing Mandatory Locks Eric W. Biederman
2021-08-17 16:50                       ` David Hildenbrand
2021-08-18  9:34                       ` Rodrigo Campos
2021-08-19 19:18                         ` Jeff Layton
2021-08-19 20:03                           ` Willy Tarreau
2021-08-19 18:39                       ` Jeff Layton
2021-08-19 19:15                         ` Linus Torvalds
2021-08-19 19:55                           ` Eric Biggers
2021-08-19 20:18                           ` Jeff Layton
2021-08-19 20:31                             ` Linus Torvalds
2021-08-19 21:43                               ` Jeff Layton
2021-08-19 22:32                                 ` Linus Torvalds
2021-08-20  8:30                                   ` David Laight
2021-08-23  7:55                                     ` Geert Uytterhoeven
2021-08-23  8:14                                       ` David Laight
2021-08-20 13:43                                   ` Steven Rostedt
2021-08-20 16:06                                     ` Linus Torvalds
2021-08-20  2:10                               ` Matthew Wilcox
2021-08-20  6:36                               ` Amir Goldstein
2021-08-20  7:14                                 ` Amir Goldstein
2021-08-20 12:27                                   ` Jeff Layton
2021-08-20 12:38                                     ` Willy Tarreau
2021-08-20 13:03                                       ` Jeff Layton
2021-08-20 13:11                                         ` Willy Tarreau
2021-08-20 16:30                           ` Kees Cook
2021-08-20 19:17                             ` H. Peter Anvin
2021-08-20 21:29                               ` Jeff Layton
2021-08-21 12:45                                 ` Jeff Layton
2021-08-23 22:15                                   ` J. Bruce Fields
2021-08-20 22:31                               ` Matthew Wilcox
2021-08-18  7:51                     ` [PATCH v1 0/7] Remove in-tree usage of MAP_DENYWRITE Christian Brauner
2021-08-18 15:42                   ` J. Bruce Fields
2021-08-19 13:56                     ` Eric W. Biederman
2021-08-19 14:33                       ` J. Bruce Fields
2021-08-20 12:54                         ` Jeff Layton
     [not found]                     ` <162943109106.9892.7426782042253067338@noble.neil.brown.name>
2021-08-20  8:25                       ` David Laight
2021-08-12 19:24         ` David Hildenbrand
2021-08-12 18:15       ` Florian Weimer
2021-08-12 18:21         ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).