* Re: [PATCH] seccomp.2: Explain arch checking, value (non-)truncation, expand example
2015-03-16 23:34 ` Jann Horn
@ 2015-03-17 17:23 ` Kees Cook
2015-03-22 15:58 ` Michael Kerrisk (man-pages)
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: Kees Cook @ 2015-03-17 17:23 UTC (permalink / raw)
To: Jann Horn; +Cc: Michael Kerrisk, linux-man, LKML, Andy Lutomirski, Will Drewry
On Mon, Mar 16, 2015 at 4:34 PM, Jann Horn <jann@thejh.net> wrote:
> On Mon, Mar 16, 2015 at 03:25:56PM -0700, Kees Cook wrote:
>> On Mon, Mar 16, 2015 at 11:01 AM, Jann Horn <jann@thejh.net> wrote:
>> > Document some more-or-less surprising things about seccomp.
>> > I'm not sure whether changing the example code like that is a
>> > good idea - maybe that part of the patch should be left out?
>> >
>> > Demo code for the X32 issue:
>> > https://gist.github.com/thejh/c5b670a816bbb9791a6d
>> >
>> > Demo code for full 64bit registers being visible in seccomp
>> > if the i386 ABI is used on a 64bit system:
>> > https://gist.github.com/thejh/c37b27aefc44ab775db5
>>
>> So, it is probably worth noting the x32 ABI somewhere, and seccomp.2
>> is probably reasonable, though maybe it should be explicitly detailed
>> in syscall.2?
>
> I guess that would be sensible. However, I still think that the seccomp
> manpage should mention it, too, or advise the reader to also carefully
> read the syscall.2 manpage.
Yeah, I think both should include it.
>
>
>> In the seccomp.2 manpage, though, I think we should discourage
>> blacklisting, since whitelisting is a much more effective way to do
>> attack surface reduction on syscalls. (And, as such, x32 would be
>> already eliminated from x86-64 filters.)
>
> I agree, whitelisting should be encouraged. However, as far as I can
> tell, people use seccomp not only for proper, strict sandboxing, but
> also to e.g. fix small security problems in containers or to provide
> additional precautions for them. In that case, I think that the use
> of a blacklist is more understandable, and various project use
> seccomp that way or at least support the use of blacklists: The
> default policy of LXC is a blacklist, sandstorm.io uses a seccomp
> blacklist and blacklists specific ptrace calls, systemd-nspawn uses a
> blacklist (although the manpage says that that's meant as a precaution
> against accidental damage, not as a security measure), systemd
> supports both whitelists and blacklists in the SystemCallFilter
> directive.
>
>
>> It is, however, reasonable to update the example just so it can be
>> super-explicit, though I'd change the comments to say something more
>> direct about the whitelisting vs blacklisting, like "While this
>> example uses whitelisting,
>
> You mean you would want to change the example to use whitelisting?
> That sounds like a good idea.
Oh man, I clearly didn't have enough coffee. Yeah, the example is
blacklisting, isn't it? Let's leave that alone and just add the mask
check, as you recommended. I still think it should avoid the AND so we
don't have to reload the syscall nr, though.
Thanks!
-Kees
>
>
>> this is how an overlapping syscall ABI
>> could be tested." or something. Additionally, I think it would be
>> better to test for >= instead of & to avoid having to reload the
>> syscall nr.
>
> Yes, sounds good.
--
Kees Cook
Chrome OS Security
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] seccomp.2: Explain arch checking, value (non-)truncation, expand example
2015-03-16 23:34 ` Jann Horn
2015-03-17 17:23 ` Kees Cook
@ 2015-03-22 15:58 ` Michael Kerrisk (man-pages)
2015-03-24 18:38 ` [PATCH v2 1/2] seccomp.2: Explain blacklisting problems, " Jann Horn
2015-03-24 18:40 ` [PATCH v2 2/2] syscall.2: add x32 ABI Jann Horn
3 siblings, 0 replies; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-03-22 15:58 UTC (permalink / raw)
To: Jann Horn; +Cc: Kees Cook, linux-man, LKML, Andy Lutomirski, Will Drewry
Hi Jann,
Thanks for working on this page. I'm a little unclear at this point.
Am I correct to assume you are going to revise this patch in the light
of Kees's comments and send a new patch?
Thanks,
Michael
On 17 March 2015 at 00:34, Jann Horn <jann@thejh.net> wrote:
> On Mon, Mar 16, 2015 at 03:25:56PM -0700, Kees Cook wrote:
>> On Mon, Mar 16, 2015 at 11:01 AM, Jann Horn <jann@thejh.net> wrote:
>> > Document some more-or-less surprising things about seccomp.
>> > I'm not sure whether changing the example code like that is a
>> > good idea - maybe that part of the patch should be left out?
>> >
>> > Demo code for the X32 issue:
>> > https://gist.github.com/thejh/c5b670a816bbb9791a6d
>> >
>> > Demo code for full 64bit registers being visible in seccomp
>> > if the i386 ABI is used on a 64bit system:
>> > https://gist.github.com/thejh/c37b27aefc44ab775db5
>>
>> So, it is probably worth noting the x32 ABI somewhere, and seccomp.2
>> is probably reasonable, though maybe it should be explicitly detailed
>> in syscall.2?
>
> I guess that would be sensible. However, I still think that the seccomp
> manpage should mention it, too, or advise the reader to also carefully
> read the syscall.2 manpage.
>
>
>> In the seccomp.2 manpage, though, I think we should discourage
>> blacklisting, since whitelisting is a much more effective way to do
>> attack surface reduction on syscalls. (And, as such, x32 would be
>> already eliminated from x86-64 filters.)
>
> I agree, whitelisting should be encouraged. However, as far as I can
> tell, people use seccomp not only for proper, strict sandboxing, but
> also to e.g. fix small security problems in containers or to provide
> additional precautions for them. In that case, I think that the use
> of a blacklist is more understandable, and various project use
> seccomp that way or at least support the use of blacklists: The
> default policy of LXC is a blacklist, sandstorm.io uses a seccomp
> blacklist and blacklists specific ptrace calls, systemd-nspawn uses a
> blacklist (although the manpage says that that's meant as a precaution
> against accidental damage, not as a security measure), systemd
> supports both whitelists and blacklists in the SystemCallFilter
> directive.
>
>
>> It is, however, reasonable to update the example just so it can be
>> super-explicit, though I'd change the comments to say something more
>> direct about the whitelisting vs blacklisting, like "While this
>> example uses whitelisting,
>
> You mean you would want to change the example to use whitelisting?
> That sounds like a good idea.
>
>
>> this is how an overlapping syscall ABI
>> could be tested." or something. Additionally, I think it would be
>> better to test for >= instead of & to avoid having to reload the
>> syscall nr.
>
> Yes, sounds good.
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 1/2] seccomp.2: Explain blacklisting problems, expand example
2015-03-16 23:34 ` Jann Horn
2015-03-17 17:23 ` Kees Cook
2015-03-22 15:58 ` Michael Kerrisk (man-pages)
@ 2015-03-24 18:38 ` Jann Horn
2015-03-29 16:01 ` Michael Kerrisk (man-pages)
2015-03-24 18:40 ` [PATCH v2 2/2] syscall.2: add x32 ABI Jann Horn
3 siblings, 1 reply; 9+ messages in thread
From: Jann Horn @ 2015-03-24 18:38 UTC (permalink / raw)
To: Kees Cook; +Cc: Michael Kerrisk, linux-man, LKML, Andy Lutomirski, Will Drewry
---
man2/seccomp.2 | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 67 insertions(+), 6 deletions(-)
diff --git a/man2/seccomp.2 b/man2/seccomp.2
index e2a5060..b596fb8 100644
--- a/man2/seccomp.2
+++ b/man2/seccomp.2
@@ -250,6 +250,55 @@ struct seccomp_data {
.fi
.in
+Because the numbers of system calls vary between architectures and
+some architectures (e.g. X86-64) allow user-space code to use
+the calling conventions of multiple architectures, it is usually
+necessary to verify the value of the
+.IR arch
+field.
+
+It is strongly recommended to use a whitelisting approach whenever
+possible because such an approach is more robust and simple.
+A blacklist will have to be updated whenever a potentially
+dangerous syscall is added (or a dangerous flag or option if those
+are blacklisted), and it is often possible to alter the
+representation of a value without altering its meaning, leading to
+a blacklist bypass.
+
+The
+.IR arch
+field is not unique for all calling conventions. The X86-64 ABI and
+the X32 ABI both use
+.BR AUDIT_ARCH_X86_64
+as
+.IR arch ,
+and they run on the same processors. Instead, the mask
+.BR __X32_SYSCALL_BIT
+is used on the system call number to tell the two ABIs apart.
+This means that in order to create a seccomp-based
+blacklist for system calls performed through the X86-64 ABI,
+it is necessary to not only check that
+.IR arch
+equals
+.BR AUDIT_ARCH_X86_64 ,
+but also to explicitly reject all syscalls that contain
+.BR __X32_SYSCALL_BIT
+in
+.IR nr .
+
+When checking values from
+.IR args
+against a blacklist, keep in mind that arguments are often
+silently truncated before being processed, but after the seccomp
+check. For example, this happens if the i386 ABI is used on an
+X86-64 kernel: Although the kernel will normally not look beyond
+the 32 lowest bits of the arguments, the values of the full
+64-bit registers will be present in the seccomp data. A less
+surprising example is that if the X86-64 ABI is used to perform
+a syscall that takes an argument of type int, the
+more-significant half of the argument register is ignored by
+the syscall, but visible in the seccomp data.
+
A seccomp filter returns a 32-bit value consisting of two parts:
the most significant 16 bits
(corresponding to the mask defined by the constant
@@ -616,38 +665,50 @@ cecilia
#include <linux/seccomp.h>
#include <sys/prctl.h>
+#define X32_SYSCALL_BIT 0x40000000
+
static int
install_filter(int syscall_nr, int t_arch, int f_errno)
{
+ unsigned int upper_nr_limit = 0xffffffff;
+ /* assume that AUDIT_ARCH_X86_64 means the normal X86-64 ABI */
+ if (t_arch == AUDIT_ARCH_X86_64)
+ upper_nr_limit = X32_SYSCALL_BIT - 1;
+
struct sock_filter filter[] = {
/* [0] Load architecture from 'seccomp_data' buffer into
accumulator */
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
(offsetof(struct seccomp_data, arch))),
- /* [1] Jump forward 4 instructions if architecture does not
+ /* [1] Jump forward 5 instructions if architecture does not
match 't_arch' */
- BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 4),
+ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5),
/* [2] Load system call number from 'seccomp_data' buffer into
accumulator */
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
(offsetof(struct seccomp_data, nr))),
- /* [3] Jump forward 1 instruction if system call number
+ /* [3] Check ABI - only needed for X86-64 in blacklist usecases.
+ Use JGT instead of checking against the bitmask to avoid
+ having to reload the syscall number. */
+ BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0),
+
+ /* [4] Jump forward 1 instruction if system call number
does not match 'syscall_nr' */
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),
- /* [4] Matching architecture and system call: don't execute
+ /* [5] Matching architecture and system call: don't execute
the system call, and return 'f_errno' in 'errno' */
BPF_STMT(BPF_RET | BPF_K,
SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
- /* [5] Destination of system call number mismatch: allow other
+ /* [6] Destination of system call number mismatch: allow other
system calls */
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
- /* [6] Destination of architecture mismatch: kill process */
+ /* [7] Destination of architecture mismatch: kill process */
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
};
--
2.1.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2 1/2] seccomp.2: Explain blacklisting problems, expand example
2015-03-24 18:38 ` [PATCH v2 1/2] seccomp.2: Explain blacklisting problems, " Jann Horn
@ 2015-03-29 16:01 ` Michael Kerrisk (man-pages)
0 siblings, 0 replies; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-03-29 16:01 UTC (permalink / raw)
To: Jann Horn, Kees Cook
Cc: mtk.manpages, linux-man, LKML, Andy Lutomirski, Will Drewry
Hi Jann,
Thanks for the patch. I've applied it.
Cheers,
Michael
On 03/24/2015 07:38 PM, Jann Horn wrote:
> ---
> man2/seccomp.2 | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 67 insertions(+), 6 deletions(-)
>
> diff --git a/man2/seccomp.2 b/man2/seccomp.2
> index e2a5060..b596fb8 100644
> --- a/man2/seccomp.2
> +++ b/man2/seccomp.2
> @@ -250,6 +250,55 @@ struct seccomp_data {
> .fi
> .in
>
> +Because the numbers of system calls vary between architectures and
> +some architectures (e.g. X86-64) allow user-space code to use
> +the calling conventions of multiple architectures, it is usually
> +necessary to verify the value of the
> +.IR arch
> +field.
> +
> +It is strongly recommended to use a whitelisting approach whenever
> +possible because such an approach is more robust and simple.
> +A blacklist will have to be updated whenever a potentially
> +dangerous syscall is added (or a dangerous flag or option if those
> +are blacklisted), and it is often possible to alter the
> +representation of a value without altering its meaning, leading to
> +a blacklist bypass.
> +
> +The
> +.IR arch
> +field is not unique for all calling conventions. The X86-64 ABI and
> +the X32 ABI both use
> +.BR AUDIT_ARCH_X86_64
> +as
> +.IR arch ,
> +and they run on the same processors. Instead, the mask
> +.BR __X32_SYSCALL_BIT
> +is used on the system call number to tell the two ABIs apart.
> +This means that in order to create a seccomp-based
> +blacklist for system calls performed through the X86-64 ABI,
> +it is necessary to not only check that
> +.IR arch
> +equals
> +.BR AUDIT_ARCH_X86_64 ,
> +but also to explicitly reject all syscalls that contain
> +.BR __X32_SYSCALL_BIT
> +in
> +.IR nr .
> +
> +When checking values from
> +.IR args
> +against a blacklist, keep in mind that arguments are often
> +silently truncated before being processed, but after the seccomp
> +check. For example, this happens if the i386 ABI is used on an
> +X86-64 kernel: Although the kernel will normally not look beyond
> +the 32 lowest bits of the arguments, the values of the full
> +64-bit registers will be present in the seccomp data. A less
> +surprising example is that if the X86-64 ABI is used to perform
> +a syscall that takes an argument of type int, the
> +more-significant half of the argument register is ignored by
> +the syscall, but visible in the seccomp data.
> +
> A seccomp filter returns a 32-bit value consisting of two parts:
> the most significant 16 bits
> (corresponding to the mask defined by the constant
> @@ -616,38 +665,50 @@ cecilia
> #include <linux/seccomp.h>
> #include <sys/prctl.h>
>
> +#define X32_SYSCALL_BIT 0x40000000
> +
> static int
> install_filter(int syscall_nr, int t_arch, int f_errno)
> {
> + unsigned int upper_nr_limit = 0xffffffff;
> + /* assume that AUDIT_ARCH_X86_64 means the normal X86-64 ABI */
> + if (t_arch == AUDIT_ARCH_X86_64)
> + upper_nr_limit = X32_SYSCALL_BIT - 1;
> +
> struct sock_filter filter[] = {
> /* [0] Load architecture from 'seccomp_data' buffer into
> accumulator */
> BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
> (offsetof(struct seccomp_data, arch))),
>
> - /* [1] Jump forward 4 instructions if architecture does not
> + /* [1] Jump forward 5 instructions if architecture does not
> match 't_arch' */
> - BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 4),
> + BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5),
>
> /* [2] Load system call number from 'seccomp_data' buffer into
> accumulator */
> BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
> (offsetof(struct seccomp_data, nr))),
>
> - /* [3] Jump forward 1 instruction if system call number
> + /* [3] Check ABI - only needed for X86-64 in blacklist usecases.
> + Use JGT instead of checking against the bitmask to avoid
> + having to reload the syscall number. */
> + BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0),
> +
> + /* [4] Jump forward 1 instruction if system call number
> does not match 'syscall_nr' */
> BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),
>
> - /* [4] Matching architecture and system call: don't execute
> + /* [5] Matching architecture and system call: don't execute
> the system call, and return 'f_errno' in 'errno' */
> BPF_STMT(BPF_RET | BPF_K,
> SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
>
> - /* [5] Destination of system call number mismatch: allow other
> + /* [6] Destination of system call number mismatch: allow other
> system calls */
> BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
>
> - /* [6] Destination of architecture mismatch: kill process */
> + /* [7] Destination of architecture mismatch: kill process */
> BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
> };
>
>
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 2/2] syscall.2: add x32 ABI
2015-03-16 23:34 ` Jann Horn
` (2 preceding siblings ...)
2015-03-24 18:38 ` [PATCH v2 1/2] seccomp.2: Explain blacklisting problems, " Jann Horn
@ 2015-03-24 18:40 ` Jann Horn
2015-04-21 14:01 ` Michael Kerrisk (man-pages)
3 siblings, 1 reply; 9+ messages in thread
From: Jann Horn @ 2015-03-24 18:40 UTC (permalink / raw)
To: Kees Cook
Cc: Michael Kerrisk, linux-man, LKML, Andy Lutomirski, Will Drewry,
H. Peter Anvin
[added H. Peter Anvin to CC for this patch because he
seems to have contributed large parts of the X32 code]
---
man2/syscall.2 | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/man2/syscall.2 b/man2/syscall.2
index ef8f3cf..1f25255 100644
--- a/man2/syscall.2
+++ b/man2/syscall.2
@@ -170,12 +170,18 @@ s390 svc 0 r1 r2 See below
s390x svc 0 r1 r2 See below
sparc/32 t 0x10 g1 o0
sparc/64 t 0x6d g1 o0
-x86_64 syscall rax rax
+x86_64 syscall rax rax See below
+x32 syscall rax rax See below
.TE
.PP
For s390 and s390x, NR (the system call number)
may be passed directly with "svc NR" if it is less than 256.
+The x32 ABI uses the same instruction as the x86_64 ABI and is used on
+the same processors. To differentiate between them, the bitmask
+.I __X32_SYSCALL_BIT
+is bitwise-ORed into the syscall number for syscalls under the x32 ABI.
+
On a few architectures,
a register is used to indicate simple boolean failure of the system call:
ia64 uses
@@ -210,6 +216,7 @@ s390x r2 r3 r4 r5 r6 r7 -
sparc/32 o0 o1 o2 o3 o4 o5 -
sparc/64 o0 o1 o2 o3 o4 o5 -
x86_64 rdi rsi rdx r10 r8 r9 -
+x32 rdi rsi rdx r10 r8 r9 -
.TE
.PP
The mips/o32 system call convention passes
--
2.1.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2 2/2] syscall.2: add x32 ABI
2015-03-24 18:40 ` [PATCH v2 2/2] syscall.2: add x32 ABI Jann Horn
@ 2015-04-21 14:01 ` Michael Kerrisk (man-pages)
0 siblings, 0 replies; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-04-21 14:01 UTC (permalink / raw)
To: Jann Horn
Cc: Kees Cook, linux-man, LKML, Andy Lutomirski, Will Drewry, H. Peter Anvin
Hi Jann,
On 24 March 2015 at 19:40, Jann Horn <jann@thejh.net> wrote:
> [added H. Peter Anvin to CC for this patch because he
> seems to have contributed large parts of the X32 code]
It looks like I did not reply to this mail. But as you'll probably have seen
from the man-pages-3.83 release announcement, I did apply this patch.
Thanks!
Cheers,
Michael
>
> ---
> man2/syscall.2 | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/man2/syscall.2 b/man2/syscall.2
> index ef8f3cf..1f25255 100644
> --- a/man2/syscall.2
> +++ b/man2/syscall.2
> @@ -170,12 +170,18 @@ s390 svc 0 r1 r2 See below
> s390x svc 0 r1 r2 See below
> sparc/32 t 0x10 g1 o0
> sparc/64 t 0x6d g1 o0
> -x86_64 syscall rax rax
> +x86_64 syscall rax rax See below
> +x32 syscall rax rax See below
> .TE
> .PP
> For s390 and s390x, NR (the system call number)
> may be passed directly with "svc NR" if it is less than 256.
>
> +The x32 ABI uses the same instruction as the x86_64 ABI and is used on
> +the same processors. To differentiate between them, the bitmask
> +.I __X32_SYSCALL_BIT
> +is bitwise-ORed into the syscall number for syscalls under the x32 ABI.
> +
> On a few architectures,
> a register is used to indicate simple boolean failure of the system call:
> ia64 uses
> @@ -210,6 +216,7 @@ s390x r2 r3 r4 r5 r6 r7 -
> sparc/32 o0 o1 o2 o3 o4 o5 -
> sparc/64 o0 o1 o2 o3 o4 o5 -
> x86_64 rdi rsi rdx r10 r8 r9 -
> +x32 rdi rsi rdx r10 r8 r9 -
> .TE
> .PP
> The mips/o32 system call convention passes
> --
> 2.1.4
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
^ permalink raw reply [flat|nested] 9+ messages in thread