linux-hardening.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
       [not found] <20220413134946.2732468-1-catalin.marinas@arm.com>
@ 2022-04-14 18:52 ` Kees Cook
  2022-04-15 20:01   ` Topi Miettinen
  2022-04-20 13:01   ` Catalin Marinas
  0 siblings, 2 replies; 13+ messages in thread
From: Kees Cook @ 2022-04-14 18:52 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	Topi Miettinen, linux-mm, linux-arm-kernel, linux-kernel,
	linux-abi-devel, linux-hardening, Jann Horn, Salvatore Mesoraca,
	Igor Zhbanov

On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
> The background to this is that systemd has a configuration option called
> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> is to prevent a user task from inadvertently creating an executable
> mapping that is (or was) writeable. Since such BPF filter is stateless,
> it cannot detect mappings that were previously writeable but
> subsequently changed to read-only. Therefore the filter simply rejects
> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> support (Branch Target Identification), the dynamic loader cannot change
> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> For libraries, it can resort to unmapping and re-mapping but for the
> main executable it does not have a file descriptor. The original bug
> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> for libraries - [3].

Right, so, the systemd filter is a big hammer solution for the kernel
not having a very easy way to provide W^X mapping protections to
userspace. There's stuff in SELinux, and there have been several
attempts[1] at other LSMs to do it too, but nothing stuck.

Given the filter, and the implementation of how to enable BTI, I see two
solutions:

- provide a way to do W^X so systemd can implement the feature differently
- provide a way to turn on BTI separate from mprotect to bypass the filter

I would agree, the latter seems like the greater hack, so I welcome
this RFC, though I think it might need to explore a bit of the feature
space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
it risks being too narrowly implemented. For example, playing well with
JITs should be part of the design, and will likely need some kind of
ELF flags and/or "sealing" mode, and to handle the vma alias case as
Jann Horn pointed out[2].

> Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
> flag, inherited on fork() and execve(). The kernel tracks a previously
> writeable mapping via a new VM_WAS_WRITE flag (64-bit only
> architectures). I went for a personality flag by analogy with the
> READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
> we don't want more personality flags. A minor downside with the
> personality flag is that there is no way for the user to query which
> flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
> this.

My instinct here is to use a prctl(), which maps to other kinds of modern
inherited state (like no_new_privs).

> Posting this as an RFC to start a discussion and cc'ing some of the
> systemd guys and those involved in the earlier thread around the glibc
> workaround for dynamic libraries [4]. Before thinking of upstreaming
> this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
> BPF filter with the in-kernel one.
> 
> Thanks,
> 
> Catalin
> 
> [1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute=
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1888842
> [3] https://sourceware.org/bugzilla/show_bug.cgi?id=26831
> [3] https://lore.kernel.org/r/cover.1604393169.git.szabolcs.nagy@arm.com

So, yes, let's do it. It's long long overdue in the kernel. :)

-Kees

[1] https://github.com/KSPP/linux/issues/32
[2] https://github.com/KSPP/linux/issues/32#issuecomment-1084859611

> 
> Catalin Marinas (4):
>   mm: Track previously writeable vma permission
>   mm, personality: Implement memory-deny-write-execute as a personality
>     flag
>   fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality
>     flag
>   arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC
> 
>  arch/arm64/Kconfig               |  1 +
>  fs/binfmt_elf.c                  |  2 ++
>  include/linux/mm.h               |  6 ++++++
>  include/linux/mman.h             | 18 +++++++++++++++++-
>  include/uapi/linux/binfmts.h     |  4 ++++
>  include/uapi/linux/personality.h |  1 +
>  mm/Kconfig                       |  4 ++++
>  mm/mmap.c                        |  3 +++
>  mm/mprotect.c                    |  5 +++++
>  9 files changed, 43 insertions(+), 1 deletion(-)
> 

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-14 18:52 ` [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE) Kees Cook
@ 2022-04-15 20:01   ` Topi Miettinen
  2022-04-20 13:01   ` Catalin Marinas
  1 sibling, 0 replies; 13+ messages in thread
From: Topi Miettinen @ 2022-04-15 20:01 UTC (permalink / raw)
  To: Kees Cook, Catalin Marinas
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel,
	linux-hardening, Jann Horn, Salvatore Mesoraca, Igor Zhbanov

On 14.4.2022 21.52, Kees Cook wrote:
> On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
>> The background to this is that systemd has a configuration option called
>> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
>> is to prevent a user task from inadvertently creating an executable
>> mapping that is (or was) writeable. Since such BPF filter is stateless,
>> it cannot detect mappings that were previously writeable but
>> subsequently changed to read-only. Therefore the filter simply rejects
>> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
>> support (Branch Target Identification), the dynamic loader cannot change
>> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
>> For libraries, it can resort to unmapping and re-mapping but for the
>> main executable it does not have a file descriptor. The original bug
>> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
>> for libraries - [3].
> 
> Right, so, the systemd filter is a big hammer solution for the kernel
> not having a very easy way to provide W^X mapping protections to
> userspace. There's stuff in SELinux, and there have been several
> attempts[1] at other LSMs to do it too, but nothing stuck.
> 
> Given the filter, and the implementation of how to enable BTI, I see two
> solutions:
> 
> - provide a way to do W^X so systemd can implement the feature differently
> - provide a way to turn on BTI separate from mprotect to bypass the filter
> 
> I would agree, the latter seems like the greater hack, so I welcome
> this RFC, though I think it might need to explore a bit of the feature
> space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
> it risks being too narrowly implemented. For example, playing well with
> JITs should be part of the design, and will likely need some kind of
> ELF flags and/or "sealing" mode, and to handle the vma alias case as
> Jann Horn pointed out[2].

Another interesting case from 2006 by Ulrich Drepper is to use a 
temporary file and map it twice, once with PROT_WRITE and once with 
PROT_EXEC [1]. This isn't possible if the mount flags of the file 
systems are also in line with W^X principle. System services (unlike 
user apps) typically don't use /tmp nor /dev/shm (mounted with 
"rw,exec"). With systemd a simple file system W^X policy can be 
implemented for a service for example with NoExecPaths=/ ExecPaths=/usr 
ReadOnlyPaths=/usr. In-kernel MDWE probably could look beyond file 
descriptors and check if the mount flags of the file system containing 
the file being mmap()ed agree with W^X. The use cases for system 
services and user apps may be different: system services are often 
compatible with maximum hardening, while user apps may need various 
compatibility solutions if they use JIT, trampolines or FFI and access 
to W+X file systems may be also needed.

-Topi

[1] https://akkadia.org/drepper/selinux-mem.html

>> Add in-kernel support for such feature as a DENY_WRITE_EXEC personality
>> flag, inherited on fork() and execve(). The kernel tracks a previously
>> writeable mapping via a new VM_WAS_WRITE flag (64-bit only
>> architectures). I went for a personality flag by analogy with the
>> READ_IMPLIES_EXEC one. However, I'm happy to change it to a prctl() if
>> we don't want more personality flags. A minor downside with the
>> personality flag is that there is no way for the user to query which
>> flags are supported, so in patch 3 I added an AT_FLAGS bit to advertise
>> this.
> 
> My instinct here is to use a prctl(), which maps to other kinds of modern
> inherited state (like no_new_privs).
> 
>> Posting this as an RFC to start a discussion and cc'ing some of the
>> systemd guys and those involved in the earlier thread around the glibc
>> workaround for dynamic libraries [4]. Before thinking of upstreaming
>> this we'd need the systemd folk to buy into replacing the MDWE SECCOMP
>> BPF filter with the in-kernel one.
>>
>> Thanks,
>>
>> Catalin
>>
>> [1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#MemoryDenyWriteExecute=
>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1888842
>> [3] https://sourceware.org/bugzilla/show_bug.cgi?id=26831
>> [3] https://lore.kernel.org/r/cover.1604393169.git.szabolcs.nagy@arm.com
> 
> So, yes, let's do it. It's long long overdue in the kernel. :)
> 
> -Kees
> 
> [1] https://github.com/KSPP/linux/issues/32
> [2] https://github.com/KSPP/linux/issues/32#issuecomment-1084859611
> 
>>
>> Catalin Marinas (4):
>>    mm: Track previously writeable vma permission
>>    mm, personality: Implement memory-deny-write-execute as a personality
>>      flag
>>    fs/binfmt_elf: Tell user-space about the DENY_WRITE_EXEC personality
>>      flag
>>    arm64: Select ARCH_ENABLE_DENY_WRITE_EXEC
>>
>>   arch/arm64/Kconfig               |  1 +
>>   fs/binfmt_elf.c                  |  2 ++
>>   include/linux/mm.h               |  6 ++++++
>>   include/linux/mman.h             | 18 +++++++++++++++++-
>>   include/uapi/linux/binfmts.h     |  4 ++++
>>   include/uapi/linux/personality.h |  1 +
>>   mm/Kconfig                       |  4 ++++
>>   mm/mmap.c                        |  3 +++
>>   mm/mprotect.c                    |  5 +++++
>>   9 files changed, 43 insertions(+), 1 deletion(-)
>>
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-14 18:52 ` [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE) Kees Cook
  2022-04-15 20:01   ` Topi Miettinen
@ 2022-04-20 13:01   ` Catalin Marinas
  2022-04-20 17:44     ` Kees Cook
  2022-04-20 19:34     ` Topi Miettinen
  1 sibling, 2 replies; 13+ messages in thread
From: Catalin Marinas @ 2022-04-20 13:01 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	Topi Miettinen, linux-mm, linux-arm-kernel, linux-kernel,
	linux-abi-devel, linux-hardening, Jann Horn, Salvatore Mesoraca,
	Igor Zhbanov

On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote:
> On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
> > The background to this is that systemd has a configuration option called
> > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> > is to prevent a user task from inadvertently creating an executable
> > mapping that is (or was) writeable. Since such BPF filter is stateless,
> > it cannot detect mappings that were previously writeable but
> > subsequently changed to read-only. Therefore the filter simply rejects
> > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> > support (Branch Target Identification), the dynamic loader cannot change
> > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> > For libraries, it can resort to unmapping and re-mapping but for the
> > main executable it does not have a file descriptor. The original bug
> > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> > for libraries - [3].
> 
> Right, so, the systemd filter is a big hammer solution for the kernel
> not having a very easy way to provide W^X mapping protections to
> userspace. There's stuff in SELinux, and there have been several
> attempts[1] at other LSMs to do it too, but nothing stuck.
> 
> Given the filter, and the implementation of how to enable BTI, I see two
> solutions:
> 
> - provide a way to do W^X so systemd can implement the feature differently
> - provide a way to turn on BTI separate from mprotect to bypass the filter
> 
> I would agree, the latter seems like the greater hack,

We discussed such hacks in the past but they are just working around the
fundamental issue - systemd wants W^X but with BPF it can only achieve
it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping
was already executable. If we find a better solution for W^X, we
wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI).

> so I welcome
> this RFC, though I think it might need to explore a bit of the feature
> space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
> it risks being too narrowly implemented. For example, playing well with
> JITs should be part of the design, and will likely need some kind of
> ELF flags and/or "sealing" mode, and to handle the vma alias case as
> Jann Horn pointed out[2].

I agree we should look at what we want to cover, though trying to avoid
re-inventing SELinux. With this patchset I went for the minimum that
systemd MDWE does with BPF.

I think JITs get around it using something like memfd with two separate
mappings to the same page. We could try to prevent such aliases but
allow it if an ELF note is detected (or get the JIT to issue a prctl()).

Anyway, with a prctl() we can allow finer-grained control starting with
anonymous and file mappings and later extending to vma aliases,
writeable files etc. On top we can add a seal mask so that a process
cannot disable a control was set. Something like (I'm not good at
names):

	prctl(PR_MDWX_SET, flags, seal_mask);
	prctl(PR_MDWX_GET);

with flags like:

	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
	PR_MDWX_WRITEABLE_FILE

(needs some more thinking)

-- 
Catalin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-20 13:01   ` Catalin Marinas
@ 2022-04-20 17:44     ` Kees Cook
  2022-04-20 19:34     ` Topi Miettinen
  1 sibling, 0 replies; 13+ messages in thread
From: Kees Cook @ 2022-04-20 17:44 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	Topi Miettinen, linux-mm, linux-arm-kernel, linux-kernel,
	linux-abi-devel, linux-hardening, Jann Horn, Salvatore Mesoraca,
	Igor Zhbanov

On Wed, Apr 20, 2022 at 02:01:02PM +0100, Catalin Marinas wrote:
> I agree we should look at what we want to cover, though trying to avoid
> re-inventing SELinux. With this patchset I went for the minimum that
> systemd MDWE does with BPF.

Right -- and I don't think we're at any risk of slippery-sloping into a
full MAC system. :)

I'm fine with doing the implementation in stages, if we've attempted to
design the steps (which I think you've got a good start on below).

> I think JITs get around it using something like memfd with two separate
> mappings to the same page. We could try to prevent such aliases but
> allow it if an ELF note is detected (or get the JIT to issue a prctl()).

Right -- I'd rather JITs carry some hard-coded property (i.e. ELF note)
to indicate the fact that they're expecting to do these kinds of things
rather than leaving it open for all processes.

> Anyway, with a prctl() we can allow finer-grained control starting with
> anonymous and file mappings and later extending to vma aliases,
> writeable files etc. On top we can add a seal mask so that a process
> cannot disable a control was set. Something like (I'm not good at
> names):
> 
> 	prctl(PR_MDWX_SET, flags, seal_mask);
> 	prctl(PR_MDWX_GET);
> 
> with flags like:
> 
> 	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
> 	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
> 	PR_MDWX_WRITEABLE_FILE

The SARA proposal lists a lot of behavioral details to consider.
Quoting it[1] here:
>> - W^X enforcement will cause problems to any programs that needs
>>   memory pages mapped both as writable and executable at the same time e.g.
>>   programs with executable stack markings in the PT_GNU_STACK segment.

IMO, executable stack markings should be considered completely
deprecated. In fact, we've been warning about it since 2020:
47a2ebb7f505 ("execve: warn if process starts with executable stack")

So with execstack, under W^X, I think we should either:
- refuse to exec the process (default)
- disable W^X for the process (but not its children)

>> - W!->X restriction will cause problems to any program that
>>   needs to generate executable code at run time or to modify executable
>>   pages e.g. programs with a JIT compiler built-in or linked against a
>>   non-PIC library.

This seems solvable with an ELF flag.

>> - Executable MMAP prevention can work only with programs that have at least
>>   partial RELRO support. It's disabled automatically for programs that
>>   lack this feature. It will cause problems to any program that uses dlopen
>>   or tries to do an executable mmap. Unfortunately this feature is the one
>>   that could create most problems and should be enabled only after careful
>>   evaluation.

This seems like a variation on the execstack case, and we should be
able to detect the state and choose a behavior based on system settings,
and a smarter version (as SARA has) would track RELRO pages waiting for
the loader to make them read-only.

SARA was proposed with a set of feature flags[2]; quoting here:
>> | W^X                          |  0x0008  |

This is the basic property, refusing PROT_WRITE | PROT_EXEC. I note that
SARA also rejects opening /proc/$pid/mem with FMODE_WRITE when this is
enabled for a process. (It likely should extend to process_vm_write()
too.)

>> | W!->X Heap                   |  0x0001  |
>> | W!->X Stack                  |  0x0002  |
>> | W!->X Other memory           |  0x0004  |

This is for the vma history tracking, and I don't think we need to
separate this by memory type? It's nice to have the granularity, but for
a first-pass it seems like overkill? Maybe I'm missing some detail.

>> | Don't enforce, just complain |  0x0010  |
>> | Be Verbose                   |  0x0020  |

Unclear if these would work well with a non-LSM approach.

>> | Executable MMAP prevention   |  0x0040  |

This is the relro detection piece.

>> | Trampoline emulation         |  0x0100  |

This is a more advanced case for emulating execstack, but if we can just
ignore execstack entirely, this can go away?

>> | Children will inherit flags  |  0x0200  |

Should a process have that control?

>> | Force W^X on setprocattr     |  0x0080  |

This is a "seal" trigger, which could be done through prctl().


It looks like a bunch of the features are designed around having as much
as possible enabled at exec time, and then tightening it further as
various things are finished (e.g. execstack, relro, sealing, etc), which
is, I think, what would still be needed for a process launcher to be
able to enable this kind of protection. (i.e. hoping the process calls a
prctl() to enable the protection isn't going to work well with systemd.)

So, I *think* we could have a minimal form with these considerations:
 - execstack: declare it distinctly incompatible.
 - relro: I think this is solved with BIND_NOW. It's been a while since
   I looked deeply at this, but I think under BIND_NOW, the (executable)
   PLT doesn't ever need to be writable (since it points into the GOT),
   and the (initially writable) GOT is already never executable. This
   needs to be verified...
 - JITs can be allowed with a ELF flag and can choose to opt-in with
   a prctl().

-Kees

[1] https://lore.kernel.org/lkml/1562410493-8661-1-git-send-email-s.mesoraca16@gmail.com/
[2] https://lore.kernel.org/lkml/1562410493-8661-2-git-send-email-s.mesoraca16@gmail.com/

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-20 13:01   ` Catalin Marinas
  2022-04-20 17:44     ` Kees Cook
@ 2022-04-20 19:34     ` Topi Miettinen
  2022-04-20 23:21       ` Kees Cook
  1 sibling, 1 reply; 13+ messages in thread
From: Topi Miettinen @ 2022-04-20 19:34 UTC (permalink / raw)
  To: Catalin Marinas, Kees Cook
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel,
	linux-hardening, Jann Horn, Salvatore Mesoraca, Igor Zhbanov

On 20.4.2022 16.01, Catalin Marinas wrote:
> On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote:
>> On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
>>> The background to this is that systemd has a configuration option called
>>> MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
>>> is to prevent a user task from inadvertently creating an executable
>>> mapping that is (or was) writeable. Since such BPF filter is stateless,
>>> it cannot detect mappings that were previously writeable but
>>> subsequently changed to read-only. Therefore the filter simply rejects
>>> any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
>>> support (Branch Target Identification), the dynamic loader cannot change
>>> an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
>>> For libraries, it can resort to unmapping and re-mapping but for the
>>> main executable it does not have a file descriptor. The original bug
>>> report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
>>> for libraries - [3].
>>
>> Right, so, the systemd filter is a big hammer solution for the kernel
>> not having a very easy way to provide W^X mapping protections to
>> userspace. There's stuff in SELinux, and there have been several
>> attempts[1] at other LSMs to do it too, but nothing stuck.
>>
>> Given the filter, and the implementation of how to enable BTI, I see two
>> solutions:
>>
>> - provide a way to do W^X so systemd can implement the feature differently
>> - provide a way to turn on BTI separate from mprotect to bypass the filter
>>
>> I would agree, the latter seems like the greater hack,
> 
> We discussed such hacks in the past but they are just working around the
> fundamental issue - systemd wants W^X but with BPF it can only achieve
> it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping
> was already executable. If we find a better solution for W^X, we
> wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI).
> 
>> so I welcome
>> this RFC, though I think it might need to explore a bit of the feature
>> space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
>> it risks being too narrowly implemented. For example, playing well with
>> JITs should be part of the design, and will likely need some kind of
>> ELF flags and/or "sealing" mode, and to handle the vma alias case as
>> Jann Horn pointed out[2].
> 
> I agree we should look at what we want to cover, though trying to avoid
> re-inventing SELinux. With this patchset I went for the minimum that
> systemd MDWE does with BPF.
> 
> I think JITs get around it using something like memfd with two separate
> mappings to the same page. We could try to prevent such aliases but
> allow it if an ELF note is detected (or get the JIT to issue a prctl()).
> 
> Anyway, with a prctl() we can allow finer-grained control starting with
> anonymous and file mappings and later extending to vma aliases,
> writeable files etc. On top we can add a seal mask so that a process
> cannot disable a control was set. Something like (I'm not good at
> names):
> 
> 	prctl(PR_MDWX_SET, flags, seal_mask);
> 	prctl(PR_MDWX_GET);
> 
> with flags like:
> 
> 	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
> 	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
> 	PR_MDWX_WRITEABLE_FILE
> 
> (needs some more thinking)
> 

For systemd, feature compatibility with the BPF version is important so 
that we could automatically switch to the kernel version once available 
without regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) 
should match exactly what MemoryDenyWriteExecute=yes as implemented with 
BPF has: only forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). 
Like BPF, once installed there should be no way to escape and ELF flags 
should be also ignored. ARM BTI should be allowed though (allow 
PROT_EXEC|PROT_BTI if the old flags had PROT_EXEC).

Then we could have improved versions (other PR_MDWX_ prctls) with lots 
more checks. This could be enabled with MemoryDenyWriteExecute=strict or so.

Perhaps also more relaxed versions (like SARA) could be interesting 
(system service running Python with FFI, or perhaps JVM etc), enabled 
with for example MemoryDenyWriteExecute=trampolines. That way even those 
programs would get some protection (though there would be a gap in the 
defences).

-Topi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-20 19:34     ` Topi Miettinen
@ 2022-04-20 23:21       ` Kees Cook
  2022-04-21 15:35         ` Catalin Marinas
  0 siblings, 1 reply; 13+ messages in thread
From: Kees Cook @ 2022-04-20 23:21 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Catalin Marinas, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
> On 20.4.2022 16.01, Catalin Marinas wrote:
> > On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote:
> > > On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
> > > > The background to this is that systemd has a configuration option called
> > > > MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
> > > > is to prevent a user task from inadvertently creating an executable
> > > > mapping that is (or was) writeable. Since such BPF filter is stateless,
> > > > it cannot detect mappings that were previously writeable but
> > > > subsequently changed to read-only. Therefore the filter simply rejects
> > > > any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
> > > > support (Branch Target Identification), the dynamic loader cannot change
> > > > an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
> > > > For libraries, it can resort to unmapping and re-mapping but for the
> > > > main executable it does not have a file descriptor. The original bug
> > > > report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
> > > > for libraries - [3].
> > > 
> > > Right, so, the systemd filter is a big hammer solution for the kernel
> > > not having a very easy way to provide W^X mapping protections to
> > > userspace. There's stuff in SELinux, and there have been several
> > > attempts[1] at other LSMs to do it too, but nothing stuck.
> > > 
> > > Given the filter, and the implementation of how to enable BTI, I see two
> > > solutions:
> > > 
> > > - provide a way to do W^X so systemd can implement the feature differently
> > > - provide a way to turn on BTI separate from mprotect to bypass the filter
> > > 
> > > I would agree, the latter seems like the greater hack,
> > 
> > We discussed such hacks in the past but they are just working around the
> > fundamental issue - systemd wants W^X but with BPF it can only achieve
> > it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping
> > was already executable. If we find a better solution for W^X, we
> > wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI).
> > 
> > > so I welcome
> > > this RFC, though I think it might need to explore a bit of the feature
> > > space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
> > > it risks being too narrowly implemented. For example, playing well with
> > > JITs should be part of the design, and will likely need some kind of
> > > ELF flags and/or "sealing" mode, and to handle the vma alias case as
> > > Jann Horn pointed out[2].
> > 
> > I agree we should look at what we want to cover, though trying to avoid
> > re-inventing SELinux. With this patchset I went for the minimum that
> > systemd MDWE does with BPF.
> > 
> > I think JITs get around it using something like memfd with two separate
> > mappings to the same page. We could try to prevent such aliases but
> > allow it if an ELF note is detected (or get the JIT to issue a prctl()).
> > 
> > Anyway, with a prctl() we can allow finer-grained control starting with
> > anonymous and file mappings and later extending to vma aliases,
> > writeable files etc. On top we can add a seal mask so that a process
> > cannot disable a control was set. Something like (I'm not good at
> > names):
> > 
> > 	prctl(PR_MDWX_SET, flags, seal_mask);
> > 	prctl(PR_MDWX_GET);
> > 
> > with flags like:
> > 
> > 	PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
> > 	PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
> > 	PR_MDWX_WRITEABLE_FILE
> > 
> > (needs some more thinking)
> > 
> 
> For systemd, feature compatibility with the BPF version is important so that
> we could automatically switch to the kernel version once available without
> regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
> exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
> forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
> installed there should be no way to escape and ELF flags should be also
> ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
> old flags had PROT_EXEC).
> 
> Then we could have improved versions (other PR_MDWX_ prctls) with lots more
> checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
> 
> Perhaps also more relaxed versions (like SARA) could be interesting (system
> service running Python with FFI, or perhaps JVM etc), enabled with for
> example MemoryDenyWriteExecute=trampolines. That way even those programs
> would get some protection (though there would be a gap in the defences).

Yup, I think we're all on the same page. Catalin, can you respin with a
prctl for enabling MDWE? I propose just:

	prctl(PR_MDWX_SET, flags);
	prctl(PR_MDWX_GET);

	PR_MDWX_FLAG_MMAP
		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
		covering at least: mmap, mprotect, pkey_mprotect, and shmat.

I don't think anything should be allowed to be disabled once set.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-20 23:21       ` Kees Cook
@ 2022-04-21 15:35         ` Catalin Marinas
  2022-04-21 16:42           ` Kees Cook
  2022-04-21 16:48           ` Topi Miettinen
  0 siblings, 2 replies; 13+ messages in thread
From: Catalin Marinas @ 2022-04-21 15:35 UTC (permalink / raw)
  To: Kees Cook
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote:
> On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
> > For systemd, feature compatibility with the BPF version is important so that
> > we could automatically switch to the kernel version once available without
> > regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
> > exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
> > forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
> > installed there should be no way to escape and ELF flags should be also
> > ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
> > old flags had PROT_EXEC).

I agree.

> > Then we could have improved versions (other PR_MDWX_ prctls) with lots more
> > checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
> > 
> > Perhaps also more relaxed versions (like SARA) could be interesting (system
> > service running Python with FFI, or perhaps JVM etc), enabled with for
> > example MemoryDenyWriteExecute=trampolines. That way even those programs
> > would get some protection (though there would be a gap in the defences).
> 
> Yup, I think we're all on the same page. Catalin, can you respin with a
> prctl for enabling MDWE? I propose just:
> 
> 	prctl(PR_MDWX_SET, flags);
> 	prctl(PR_MDWX_GET);
> 
> 	PR_MDWX_FLAG_MMAP
> 		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
> 		covering at least: mmap, mprotect, pkey_mprotect, and shmat.

Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
the vma is not already PROT_EXEC? The latter is closer to the current
systemd approach. The former allows an mprotect(PROT_EXEC) if the
mapping was PROT_READ only for example.

I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
replacement for BPF MDWE.

-- 
Catalin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-21 15:35         ` Catalin Marinas
@ 2022-04-21 16:42           ` Kees Cook
  2022-04-21 17:24             ` Catalin Marinas
  2022-04-21 16:48           ` Topi Miettinen
  1 sibling, 1 reply; 13+ messages in thread
From: Kees Cook @ 2022-04-21 16:42 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote:
> > On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
> > > For systemd, feature compatibility with the BPF version is important so that
> > > we could automatically switch to the kernel version once available without
> > > regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
> > > exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
> > > forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
> > > installed there should be no way to escape and ELF flags should be also
> > > ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
> > > old flags had PROT_EXEC).
> 
> I agree.
> 
> > > Then we could have improved versions (other PR_MDWX_ prctls) with lots more
> > > checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
> > > 
> > > Perhaps also more relaxed versions (like SARA) could be interesting (system
> > > service running Python with FFI, or perhaps JVM etc), enabled with for
> > > example MemoryDenyWriteExecute=trampolines. That way even those programs
> > > would get some protection (though there would be a gap in the defences).
> > 
> > Yup, I think we're all on the same page. Catalin, can you respin with a
> > prctl for enabling MDWE? I propose just:
> > 
> > 	prctl(PR_MDWX_SET, flags);
> > 	prctl(PR_MDWX_GET);
> > 
> > 	PR_MDWX_FLAG_MMAP
> > 		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
> > 		covering at least: mmap, mprotect, pkey_mprotect, and shmat.
> 
> Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> the vma is not already PROT_EXEC? The latter is closer to the current
> systemd approach. The former allows an mprotect(PROT_EXEC) if the
> mapping was PROT_READ only for example.
> 
> I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> replacement for BPF MDWE.

I think "was PROT_WRITE" is an important part of the defense that
couldn't be done with a simple seccomp filter (which is why the filter
ended up being a problem in the first place).

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-21 15:35         ` Catalin Marinas
  2022-04-21 16:42           ` Kees Cook
@ 2022-04-21 16:48           ` Topi Miettinen
  2022-04-21 17:28             ` Catalin Marinas
  1 sibling, 1 reply; 13+ messages in thread
From: Topi Miettinen @ 2022-04-21 16:48 UTC (permalink / raw)
  To: Catalin Marinas, Kees Cook
  Cc: Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel,
	linux-hardening, Jann Horn, Salvatore Mesoraca, Igor Zhbanov

On 21.4.2022 18.35, Catalin Marinas wrote:
> On Wed, Apr 20, 2022 at 04:21:45PM -0700, Kees Cook wrote:
>> On Wed, Apr 20, 2022 at 10:34:33PM +0300, Topi Miettinen wrote:
>>> For systemd, feature compatibility with the BPF version is important so that
>>> we could automatically switch to the kernel version once available without
>>> regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT) should match
>>> exactly what MemoryDenyWriteExecute=yes as implemented with BPF has: only
>>> forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC). Like BPF, once
>>> installed there should be no way to escape and ELF flags should be also
>>> ignored. ARM BTI should be allowed though (allow PROT_EXEC|PROT_BTI if the
>>> old flags had PROT_EXEC).
> 
> I agree.
> 
>>> Then we could have improved versions (other PR_MDWX_ prctls) with lots more
>>> checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
>>>
>>> Perhaps also more relaxed versions (like SARA) could be interesting (system
>>> service running Python with FFI, or perhaps JVM etc), enabled with for
>>> example MemoryDenyWriteExecute=trampolines. That way even those programs
>>> would get some protection (though there would be a gap in the defences).
>>
>> Yup, I think we're all on the same page. Catalin, can you respin with a
>> prctl for enabling MDWE? I propose just:
>>
>> 	prctl(PR_MDWX_SET, flags);
>> 	prctl(PR_MDWX_GET);
>>
>> 	PR_MDWX_FLAG_MMAP
>> 		disallows PROT_EXEC on any VMA that is or was PROT_WRITE,
>> 		covering at least: mmap, mprotect, pkey_mprotect, and shmat.
> 
> Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> the vma is not already PROT_EXEC? The latter is closer to the current
> systemd approach. The former allows an mprotect(PROT_EXEC) if the
> mapping was PROT_READ only for example.
> 
> I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> replacement for BPF MDWE.
> 

I think we'd want existing installations with MemoryDenyWriteExecute=yes 
not start failing when the implementation is changed to in-kernel 
version. The implementation could be very simple and not even check 
existing PROT_ flags (except for BTI case) to be maximally compatible to 
BPF version. So I'd leave "was PROT_WRITE" and other checks to more 
advanced versions, enabled with a different PR_MDWX_FLAG_.

-Topi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-21 16:42           ` Kees Cook
@ 2022-04-21 17:24             ` Catalin Marinas
  2022-04-21 17:41               ` Kees Cook
  0 siblings, 1 reply; 13+ messages in thread
From: Catalin Marinas @ 2022-04-21 17:24 UTC (permalink / raw)
  To: Kees Cook
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote:
> On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > the vma is not already PROT_EXEC? The latter is closer to the current
> > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > mapping was PROT_READ only for example.
> > 
> > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > replacement for BPF MDWE.
> 
> I think "was PROT_WRITE" is an important part of the defense that
> couldn't be done with a simple seccomp filter (which is why the filter
> ended up being a problem in the first place).

I would say "was PROT_WRITE" is slightly more relaxed than "is not
already PROT_EXEC". The seccomp filter can't do "is not already
PROT_EXEC" either since it only checks the mprotect() arguments, not the
current vma flags.

So we have (with sub-cases):

1. Current BPF filter:

   a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails

   b)	mmap(PROT_READ|PROT_EXEC);
	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// fails

2. "is not already PROT_EXEC":

   a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails

   b)	mmap(PROT_READ|PROT_EXEC);
	mmap(PROT_READ|PROT_EXEC|PROT_BTI);	// passes

   c)	mmap(PROT_READ);
	mprotect(PROT_READ|PROT_EXEC);		// fails

3. "is or was not PROT_WRITE":

   a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails

   b)	mmap(PROT_READ|PROT_EXEC);
	mmap(PROT_READ|PROT_EXEC|PROT_BTI);	// passes

   c)	mmap(PROT_READ);
	mprotect(PROT_READ|PROT_EXEC);		// passes

   d)	mmap(PROT_READ|PROT_WRITE);
	mprotect(PROT_READ);
	mprotect(PROT_READ|PROT_EXEC);		// fails (was write)

The current seccomp filter is the strictest. If we go for (2), it allows
PROT_BTI as in 2.b but prevents 2.c (as would the current seccomp
filter). (3) relaxes 2.c as in 3.c while still preventing 3.d. Both (1)
and (2) prevent 3.d by simply rejecting mprotect(PROT_EXEC).

If we don't care about 3.c, we might as well go for (2). I don't mind,
already went for (3) in this series. I think either of them would not be
a regression on MDWE, unless there is some test that attempts 3.c and
expects it to fail.

-- 
Catalin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-21 16:48           ` Topi Miettinen
@ 2022-04-21 17:28             ` Catalin Marinas
  0 siblings, 0 replies; 13+ messages in thread
From: Catalin Marinas @ 2022-04-21 17:28 UTC (permalink / raw)
  To: Topi Miettinen
  Cc: Kees Cook, Andrew Morton, Christoph Hellwig, Lennart Poettering,
	Zbigniew Jędrzejewski-Szmek, Will Deacon, Alexander Viro,
	Eric Biederman, Szabolcs Nagy, Mark Brown, Jeremy Linton,
	linux-mm, linux-arm-kernel, linux-kernel, linux-abi-devel,
	linux-hardening, Jann Horn, Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 07:48:27PM +0300, Topi Miettinen wrote:
> On 21.4.2022 18.35, Catalin Marinas wrote:
> > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > the vma is not already PROT_EXEC? The latter is closer to the current
> > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > mapping was PROT_READ only for example.
> > 
> > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > replacement for BPF MDWE.
> 
> I think we'd want existing installations with MemoryDenyWriteExecute=yes not
> start failing when the implementation is changed to in-kernel version. The
> implementation could be very simple and not even check existing PROT_ flags
> (except for BTI case) to be maximally compatible to BPF version.

It would have to check the existing flags otherwise this could have been
implemented in the BPF filter. The dynamic loader (or kernel loader)
first mmap(PROT_READ|PROT_EXEC) and, if the BTI note is found, it
switches it to mprotect(PROT_READ|PROT_EXEC|PROT_BTI). If we allowed
this to pass simply because of PROT_BTI, one could create such
executable mapping even if it is (or was) writeable.

So we can either allow mprotect(PROT_EXEC) if the mapping was never
writeable or we allow mprotect(PROT_EXEC) if the mapping is already
PROT_EXEC (and the check for W^X was previously done by mmap()).

> So I'd leave "was PROT_WRITE" and other checks to more advanced
> versions, enabled with a different PR_MDWX_FLAG_.

This works for me as well. See my reply to Kees for the use-cases.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-21 17:24             ` Catalin Marinas
@ 2022-04-21 17:41               ` Kees Cook
  2022-04-21 18:33                 ` Catalin Marinas
  0 siblings, 1 reply; 13+ messages in thread
From: Kees Cook @ 2022-04-21 17:41 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 06:24:21PM +0100, Catalin Marinas wrote:
> On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote:
> > On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> > > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > > the vma is not already PROT_EXEC? The latter is closer to the current
> > > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > > mapping was PROT_READ only for example.
> > > 
> > > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > > replacement for BPF MDWE.
> > 
> > I think "was PROT_WRITE" is an important part of the defense that
> > couldn't be done with a simple seccomp filter (which is why the filter
> > ended up being a problem in the first place).
> 
> I would say "was PROT_WRITE" is slightly more relaxed than "is not
> already PROT_EXEC". The seccomp filter can't do "is not already
> PROT_EXEC" either since it only checks the mprotect() arguments, not the
> current vma flags.
> 
> So we have (with sub-cases):
> 
> 1. Current BPF filter:
> 
>    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> 
>    b)	mmap(PROT_READ|PROT_EXEC);
> 	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// fails
>
>    c)	mmap(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
>
>    d)	mmap(PROT_READ|PROT_WRITE);
> 	mprotect(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
> 
> 2. "is not already PROT_EXEC":
> 
>    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> 
>    b)	mmap(PROT_READ|PROT_EXEC);
> 	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> 
>    c)	mmap(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
>
>    d)	mmap(PROT_READ|PROT_WRITE);
> 	mprotect(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails
> 
> 3. "is or was not PROT_WRITE":
> 
>    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> 
>    b)	mmap(PROT_READ|PROT_EXEC);
> 	mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> 
>    c)	mmap(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// passes
> 
>    d)	mmap(PROT_READ|PROT_WRITE);
> 	mprotect(PROT_READ);
> 	mprotect(PROT_READ|PROT_EXEC);		// fails

[edited above to show each case]

restated what was already summarized:
Problem is 1.b. 2 and 3 solve it. 3 is more relaxed (c passes).

> If we don't care about 3.c, we might as well go for (2). I don't mind,
> already went for (3) in this series. I think either of them would not be
> a regression on MDWE, unless there is some test that attempts 3.c and
> expects it to fail.

I should stop arguing for a less restrictive mode. ;) It just feels weird
that the combinations are API-mediated, rather than logically defined:
I can do PROT_READ|PROT_EXEC with mmap but not mprotect under 2. As
opposed to saying "the vma cannot be executable if it is or ever was
writable". I find the latter much easier to reason about as far as the
expectations of system state.

So, I'd still prefer 3, as that was the _goal_ of the systemd MDWE
seccomp filter, but yes, 2 does provide the same protection while
allowing BTI.

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE)
  2022-04-21 17:41               ` Kees Cook
@ 2022-04-21 18:33                 ` Catalin Marinas
  0 siblings, 0 replies; 13+ messages in thread
From: Catalin Marinas @ 2022-04-21 18:33 UTC (permalink / raw)
  To: Kees Cook
  Cc: Topi Miettinen, Andrew Morton, Christoph Hellwig,
	Lennart Poettering, Zbigniew Jędrzejewski-Szmek,
	Will Deacon, Alexander Viro, Eric Biederman, Szabolcs Nagy,
	Mark Brown, Jeremy Linton, linux-mm, linux-arm-kernel,
	linux-kernel, linux-abi-devel, linux-hardening, Jann Horn,
	Salvatore Mesoraca, Igor Zhbanov

On Thu, Apr 21, 2022 at 10:41:43AM -0700, Kees Cook wrote:
> On Thu, Apr 21, 2022 at 06:24:21PM +0100, Catalin Marinas wrote:
> > On Thu, Apr 21, 2022 at 09:42:23AM -0700, Kees Cook wrote:
> > > On Thu, Apr 21, 2022 at 04:35:15PM +0100, Catalin Marinas wrote:
> > > > Do we want the "was PROT_WRITE" or we just reject mprotect(PROT_EXEC) if
> > > > the vma is not already PROT_EXEC? The latter is closer to the current
> > > > systemd approach. The former allows an mprotect(PROT_EXEC) if the
> > > > mapping was PROT_READ only for example.
> > > > 
> > > > I'd drop the "was PROT_WRITE" for now if the aim is a drop-in
> > > > replacement for BPF MDWE.
> > > 
> > > I think "was PROT_WRITE" is an important part of the defense that
> > > couldn't be done with a simple seccomp filter (which is why the filter
> > > ended up being a problem in the first place).
> > 
> > I would say "was PROT_WRITE" is slightly more relaxed than "is not
> > already PROT_EXEC". The seccomp filter can't do "is not already
> > PROT_EXEC" either since it only checks the mprotect() arguments, not the
> > current vma flags.
> > 
> > So we have (with sub-cases):
> > 
> > 1. Current BPF filter:
> > 
> >    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> > 
> >    b)	mmap(PROT_READ|PROT_EXEC);
> >		mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// fails
> >
> >    c)	mmap(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> >
> >    d)	mmap(PROT_READ|PROT_WRITE);
> >		mprotect(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> > 
> > 2. "is not already PROT_EXEC":
> > 
> >    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> > 
> >    b)	mmap(PROT_READ|PROT_EXEC);
> >		mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> > 
> >    c)	mmap(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> >
> >    d)	mmap(PROT_READ|PROT_WRITE);
> >		mprotect(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// fails
> > 
> > 3. "is or was not PROT_WRITE":
> > 
> >    a)	mmap(PROT_READ|PROT_WRITE|PROT_EXEC);	// fails
> > 
> >    b)	mmap(PROT_READ|PROT_EXEC);
> >		mprotect(PROT_READ|PROT_EXEC|PROT_BTI);	// passes
> > 
> >    c)	mmap(PROT_READ);
> >		mprotect(PROT_READ|PROT_EXEC);		// passes
> > 
> >    d)	mmap(PROT_READ|PROT_WRITE);
> >		mprotect(PROT_READ);
> >	 	mprotect(PROT_READ|PROT_EXEC);		// fails
> 
> [edited above to show each case]

Thanks, I was in a rush to get home ;).

> restated what was already summarized:
> Problem is 1.b. 2 and 3 solve it. 3 is more relaxed (c passes).
> 
> > If we don't care about 3.c, we might as well go for (2). I don't mind,
> > already went for (3) in this series. I think either of them would not be
> > a regression on MDWE, unless there is some test that attempts 3.c and
> > expects it to fail.
> 
> I should stop arguing for a less restrictive mode. ;) It just feels weird
> that the combinations are API-mediated, rather than logically defined:
> I can do PROT_READ|PROT_EXEC with mmap but not mprotect under 2. As
> opposed to saying "the vma cannot be executable if it is or ever was
> writable". I find the latter much easier to reason about as far as the
> expectations of system state.

I had the same reasoning, hence option 3 in this series. I prefer to
treat mmap(PROT_READ|PROT_EXEC) and mprotect(PROT_READ|PROT_EXEC) in a
similar way.

-- 
Catalin

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-04-21 18:33 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220413134946.2732468-1-catalin.marinas@arm.com>
2022-04-14 18:52 ` [PATCH RFC 0/4] mm, arm64: In-kernel support for memory-deny-write-execute (MDWE) Kees Cook
2022-04-15 20:01   ` Topi Miettinen
2022-04-20 13:01   ` Catalin Marinas
2022-04-20 17:44     ` Kees Cook
2022-04-20 19:34     ` Topi Miettinen
2022-04-20 23:21       ` Kees Cook
2022-04-21 15:35         ` Catalin Marinas
2022-04-21 16:42           ` Kees Cook
2022-04-21 17:24             ` Catalin Marinas
2022-04-21 17:41               ` Kees Cook
2022-04-21 18:33                 ` Catalin Marinas
2022-04-21 16:48           ` Topi Miettinen
2022-04-21 17:28             ` Catalin Marinas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).