linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] seccomp.2: Explain arch checking, value (non-)truncation, expand example
@ 2015-03-16 18:01 Jann Horn
  2015-03-16 22:25 ` Kees Cook
  0 siblings, 1 reply; 9+ messages in thread
From: Jann Horn @ 2015-03-16 18:01 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: linux-man, linux-kernel, Kees Cook, Andy Lutomirski, Will Drewry

Document some more-or-less surprising things about seccomp.
I'm not sure whether changing the example code like that is a
good idea - maybe that part of the patch should be left out?

Demo code for the X32 issue:
https://gist.github.com/thejh/c5b670a816bbb9791a6d

Demo code for full 64bit registers being visible in seccomp
if the i386 ABI is used on a 64bit system:
https://gist.github.com/thejh/c37b27aefc44ab775db5

---
 man2/seccomp.2 | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 66 insertions(+), 6 deletions(-)

diff --git a/man2/seccomp.2 b/man2/seccomp.2
index 702ceb8..307a408 100644
--- a/man2/seccomp.2
+++ b/man2/seccomp.2
@@ -223,6 +223,47 @@ struct seccomp_data {
 .fi
 .in
 
+Because the numbers of system calls vary between architectures and
+some architectures (e.g. X86-64) allow user-space code to use
+the calling conventions of multiple architectures, it is usually
+necessary to verify the value of the
+.IR arch
+field.
+
+The
+.IR arch
+field is not unique for all calling conventions. The X86-64 ABI and
+the X32 ABI both use
+.BR AUDIT_ARCH_X86_64
+as
+.IR arch ,
+and they run on the same processors. Instead, the mask
+.BR __X32_SYSCALL_BIT
+is used on the system call number to tell the two ABIs apart.
+This means that in order to create a seccomp-based
+blacklist for system calls performed through the X86-64 ABI,
+it is necessary to not only check that
+.IR arch
+equals
+.BR AUDIT_ARCH_X86_64 ,
+but also to explicitly reject all syscalls that contain
+.BR __X32_SYSCALL_BIT
+in
+.IR nr .
+
+When checking values from
+.IR args
+against a blacklist, keep in mind that arguments are often
+silently truncated before being processed, but after the seccomp
+check. For example, this happens if the i386 ABI is used on an
+X86-64 kernel: Although the kernel will normally not look beyond
+the 32 lowest bits of the arguments, the values of the full
+64-bit registers will be present in the seccomp data. A less
+surprising example is that if any 64-bit ABI is used to perform
+a syscall that takes an argument of type int, the
+more-significant half of the argument register is ignored by
+the syscall, but visible in the seccomp data.
+
 A seccomp filter returns a 32-bit value consisting of two parts:
 the most significant 16 bits
 (corresponding to the mask defined by the constant
@@ -584,38 +625,57 @@ cecilia
 #include <linux/seccomp.h>
 #include <sys/prctl.h>
 
+#define X32_SYSCALL_BIT 0x40000000
+
 static int
 install_filter(int syscall_nr, int t_arch, int f_errno)
 {
+    int forbidden_bitmask = 0;
+    /* assume that AUDIT_ARCH_X86_64 means the normal X86-64 ABI */
+    if (t_arch == AUDIT_ARCH_X86_64)
+        forbidden_bitmask = X32_SYSCALL_BIT;
+
     struct sock_filter filter[] = {
         /* [0] Load architecture from 'seccomp_data' buffer into
                accumulator */
         BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                  (offsetof(struct seccomp_data, arch))),
 
-        /* [1] Jump forward 4 instructions if architecture does not
+        /* [1] Jump forward 7 instructions if architecture does not
                match 't_arch' */
-        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 4),
+        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 7),
 
         /* [2] Load system call number from 'seccomp_data' buffer into
                accumulator */
         BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                  (offsetof(struct seccomp_data, nr))),
 
-        /* [3] Jump forward 1 instruction if system call number
+        /* [3] Determine ABI from system call number - only needed for X86-64
+               in blacklist usecases */
+        BPF_STMT(BPF_ALU | BPF_AND | BPF_K, forbidden_bitmask),
+
+        /* [4] Check ABI - only needed for X86-64 in blacklist usecases */
+        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 0, 0, 4),
+
+        /* [5] Load system call number from 'seccomp_data' buffer into
+               accumulator */
+        BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
+                 (offsetof(struct seccomp_data, nr))),
+
+        /* [6] Jump forward 1 instruction if system call number
                does not match 'syscall_nr' */
         BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),
 
-        /* [4] Matching architecture and system call: don't execute
+        /* [7] Matching architecture and system call: don't execute
 	       the system call, and return 'f_errno' in 'errno' */
         BPF_STMT(BPF_RET | BPF_K,
                  SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
 
-        /* [5] Destination of system call number mismatch: allow other
+        /* [8] Destination of system call number mismatch: allow other
                system calls */
         BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
 
-        /* [6] Destination of architecture mismatch: kill process */
+        /* [9] Destination of architecture mismatch: kill process */
         BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
     };
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] seccomp.2: Explain arch checking, value (non-)truncation, expand example
  2015-03-16 18:01 [PATCH] seccomp.2: Explain arch checking, value (non-)truncation, expand example Jann Horn
@ 2015-03-16 22:25 ` Kees Cook
  2015-03-16 23:34   ` Jann Horn
  0 siblings, 1 reply; 9+ messages in thread
From: Kees Cook @ 2015-03-16 22:25 UTC (permalink / raw)
  To: Jann Horn; +Cc: Michael Kerrisk, linux-man, LKML, Andy Lutomirski, Will Drewry

On Mon, Mar 16, 2015 at 11:01 AM, Jann Horn <jann@thejh.net> wrote:
> Document some more-or-less surprising things about seccomp.
> I'm not sure whether changing the example code like that is a
> good idea - maybe that part of the patch should be left out?
>
> Demo code for the X32 issue:
> https://gist.github.com/thejh/c5b670a816bbb9791a6d
>
> Demo code for full 64bit registers being visible in seccomp
> if the i386 ABI is used on a 64bit system:
> https://gist.github.com/thejh/c37b27aefc44ab775db5

So, it is probably worth noting the x32 ABI somewhere, and seccomp.2
is probably reasonable, though maybe it should be explicitly detailed
in syscall.2?

In the seccomp.2 manpage, though, I think we should discourage
blacklisting, since whitelisting is a much more effective way to do
attack surface reduction on syscalls. (And, as such, x32 would be
already eliminated from x86-64 filters.)

It is, however, reasonable to update the example just so it can be
super-explicit, though I'd change the comments to say something more
direct about the whitelisting vs blacklisting, like "While this
example uses whitelisting, this is how an overlapping syscall ABI
could be tested." or something. Additionally, I think it would be
better to test for >= instead of & to avoid having to reload the
syscall nr.

Thanks for this clarification!

-Kees

>
> ---
>  man2/seccomp.2 | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 66 insertions(+), 6 deletions(-)
>
> diff --git a/man2/seccomp.2 b/man2/seccomp.2
> index 702ceb8..307a408 100644
> --- a/man2/seccomp.2
> +++ b/man2/seccomp.2
> @@ -223,6 +223,47 @@ struct seccomp_data {
>  .fi
>  .in
>
> +Because the numbers of system calls vary between architectures and
> +some architectures (e.g. X86-64) allow user-space code to use
> +the calling conventions of multiple architectures, it is usually
> +necessary to verify the value of the
> +.IR arch
> +field.
> +
> +The
> +.IR arch
> +field is not unique for all calling conventions. The X86-64 ABI and
> +the X32 ABI both use
> +.BR AUDIT_ARCH_X86_64
> +as
> +.IR arch ,
> +and they run on the same processors. Instead, the mask
> +.BR __X32_SYSCALL_BIT
> +is used on the system call number to tell the two ABIs apart.
> +This means that in order to create a seccomp-based
> +blacklist for system calls performed through the X86-64 ABI,
> +it is necessary to not only check that
> +.IR arch
> +equals
> +.BR AUDIT_ARCH_X86_64 ,
> +but also to explicitly reject all syscalls that contain
> +.BR __X32_SYSCALL_BIT
> +in
> +.IR nr .
> +
> +When checking values from
> +.IR args
> +against a blacklist, keep in mind that arguments are often
> +silently truncated before being processed, but after the seccomp
> +check. For example, this happens if the i386 ABI is used on an
> +X86-64 kernel: Although the kernel will normally not look beyond
> +the 32 lowest bits of the arguments, the values of the full
> +64-bit registers will be present in the seccomp data. A less
> +surprising example is that if any 64-bit ABI is used to perform
> +a syscall that takes an argument of type int, the
> +more-significant half of the argument register is ignored by
> +the syscall, but visible in the seccomp data.
> +
>  A seccomp filter returns a 32-bit value consisting of two parts:
>  the most significant 16 bits
>  (corresponding to the mask defined by the constant
> @@ -584,38 +625,57 @@ cecilia
>  #include <linux/seccomp.h>
>  #include <sys/prctl.h>
>
> +#define X32_SYSCALL_BIT 0x40000000
> +
>  static int
>  install_filter(int syscall_nr, int t_arch, int f_errno)
>  {
> +    int forbidden_bitmask = 0;
> +    /* assume that AUDIT_ARCH_X86_64 means the normal X86-64 ABI */
> +    if (t_arch == AUDIT_ARCH_X86_64)
> +        forbidden_bitmask = X32_SYSCALL_BIT;
> +
>      struct sock_filter filter[] = {
>          /* [0] Load architecture from 'seccomp_data' buffer into
>                 accumulator */
>          BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
>                   (offsetof(struct seccomp_data, arch))),
>
> -        /* [1] Jump forward 4 instructions if architecture does not
> +        /* [1] Jump forward 7 instructions if architecture does not
>                 match 't_arch' */
> -        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 4),
> +        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 7),
>
>          /* [2] Load system call number from 'seccomp_data' buffer into
>                 accumulator */
>          BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
>                   (offsetof(struct seccomp_data, nr))),
>
> -        /* [3] Jump forward 1 instruction if system call number
> +        /* [3] Determine ABI from system call number - only needed for X86-64
> +               in blacklist usecases */
> +        BPF_STMT(BPF_ALU | BPF_AND | BPF_K, forbidden_bitmask),
> +
> +        /* [4] Check ABI - only needed for X86-64 in blacklist usecases */
> +        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 0, 0, 4),
> +
> +        /* [5] Load system call number from 'seccomp_data' buffer into
> +               accumulator */
> +        BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
> +                 (offsetof(struct seccomp_data, nr))),
> +
> +        /* [6] Jump forward 1 instruction if system call number
>                 does not match 'syscall_nr' */
>          BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),
>
> -        /* [4] Matching architecture and system call: don't execute
> +        /* [7] Matching architecture and system call: don't execute
>                the system call, and return 'f_errno' in 'errno' */
>          BPF_STMT(BPF_RET | BPF_K,
>                   SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
>
> -        /* [5] Destination of system call number mismatch: allow other
> +        /* [8] Destination of system call number mismatch: allow other
>                 system calls */
>          BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
>
> -        /* [6] Destination of architecture mismatch: kill process */
> +        /* [9] Destination of architecture mismatch: kill process */
>          BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
>      };
>
> --
> 2.1.4



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] seccomp.2: Explain arch checking, value (non-)truncation, expand example
  2015-03-16 22:25 ` Kees Cook
@ 2015-03-16 23:34   ` Jann Horn
  2015-03-17 17:23     ` Kees Cook
                       ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Jann Horn @ 2015-03-16 23:34 UTC (permalink / raw)
  To: Kees Cook; +Cc: Michael Kerrisk, linux-man, LKML, Andy Lutomirski, Will Drewry

On Mon, Mar 16, 2015 at 03:25:56PM -0700, Kees Cook wrote:
> On Mon, Mar 16, 2015 at 11:01 AM, Jann Horn <jann@thejh.net> wrote:
> > Document some more-or-less surprising things about seccomp.
> > I'm not sure whether changing the example code like that is a
> > good idea - maybe that part of the patch should be left out?
> >
> > Demo code for the X32 issue:
> > https://gist.github.com/thejh/c5b670a816bbb9791a6d
> >
> > Demo code for full 64bit registers being visible in seccomp
> > if the i386 ABI is used on a 64bit system:
> > https://gist.github.com/thejh/c37b27aefc44ab775db5
> 
> So, it is probably worth noting the x32 ABI somewhere, and seccomp.2
> is probably reasonable, though maybe it should be explicitly detailed
> in syscall.2?

I guess that would be sensible. However, I still think that the seccomp
manpage should mention it, too, or advise the reader to also carefully
read the syscall.2 manpage.


> In the seccomp.2 manpage, though, I think we should discourage
> blacklisting, since whitelisting is a much more effective way to do
> attack surface reduction on syscalls. (And, as such, x32 would be
> already eliminated from x86-64 filters.)

I agree, whitelisting should be encouraged. However, as far as I can
tell, people use seccomp not only for proper, strict sandboxing, but
also to e.g. fix small security problems in containers or to provide
additional precautions for them. In that case, I think that the use
of a blacklist is more understandable, and various project use
seccomp that way or at least support the use of blacklists: The
default policy of LXC is a blacklist, sandstorm.io uses a seccomp
blacklist and blacklists specific ptrace calls, systemd-nspawn uses a
blacklist (although the manpage says that that's meant as a precaution
against accidental damage, not as a security measure), systemd
supports both whitelists and blacklists in the SystemCallFilter
directive.


> It is, however, reasonable to update the example just so it can be
> super-explicit, though I'd change the comments to say something more
> direct about the whitelisting vs blacklisting, like "While this
> example uses whitelisting,

You mean you would want to change the example to use whitelisting?
That sounds like a good idea.


> this is how an overlapping syscall ABI
> could be tested." or something. Additionally, I think it would be
> better to test for >= instead of & to avoid having to reload the
> syscall nr.

Yes, sounds good.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] seccomp.2: Explain arch checking, value (non-)truncation, expand example
  2015-03-16 23:34   ` Jann Horn
@ 2015-03-17 17:23     ` Kees Cook
  2015-03-22 15:58     ` Michael Kerrisk (man-pages)
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Kees Cook @ 2015-03-17 17:23 UTC (permalink / raw)
  To: Jann Horn; +Cc: Michael Kerrisk, linux-man, LKML, Andy Lutomirski, Will Drewry

On Mon, Mar 16, 2015 at 4:34 PM, Jann Horn <jann@thejh.net> wrote:
> On Mon, Mar 16, 2015 at 03:25:56PM -0700, Kees Cook wrote:
>> On Mon, Mar 16, 2015 at 11:01 AM, Jann Horn <jann@thejh.net> wrote:
>> > Document some more-or-less surprising things about seccomp.
>> > I'm not sure whether changing the example code like that is a
>> > good idea - maybe that part of the patch should be left out?
>> >
>> > Demo code for the X32 issue:
>> > https://gist.github.com/thejh/c5b670a816bbb9791a6d
>> >
>> > Demo code for full 64bit registers being visible in seccomp
>> > if the i386 ABI is used on a 64bit system:
>> > https://gist.github.com/thejh/c37b27aefc44ab775db5
>>
>> So, it is probably worth noting the x32 ABI somewhere, and seccomp.2
>> is probably reasonable, though maybe it should be explicitly detailed
>> in syscall.2?
>
> I guess that would be sensible. However, I still think that the seccomp
> manpage should mention it, too, or advise the reader to also carefully
> read the syscall.2 manpage.

Yeah, I think both should include it.

>
>
>> In the seccomp.2 manpage, though, I think we should discourage
>> blacklisting, since whitelisting is a much more effective way to do
>> attack surface reduction on syscalls. (And, as such, x32 would be
>> already eliminated from x86-64 filters.)
>
> I agree, whitelisting should be encouraged. However, as far as I can
> tell, people use seccomp not only for proper, strict sandboxing, but
> also to e.g. fix small security problems in containers or to provide
> additional precautions for them. In that case, I think that the use
> of a blacklist is more understandable, and various project use
> seccomp that way or at least support the use of blacklists: The
> default policy of LXC is a blacklist, sandstorm.io uses a seccomp
> blacklist and blacklists specific ptrace calls, systemd-nspawn uses a
> blacklist (although the manpage says that that's meant as a precaution
> against accidental damage, not as a security measure), systemd
> supports both whitelists and blacklists in the SystemCallFilter
> directive.
>
>
>> It is, however, reasonable to update the example just so it can be
>> super-explicit, though I'd change the comments to say something more
>> direct about the whitelisting vs blacklisting, like "While this
>> example uses whitelisting,
>
> You mean you would want to change the example to use whitelisting?
> That sounds like a good idea.

Oh man, I clearly didn't have enough coffee. Yeah, the example is
blacklisting, isn't it? Let's leave that alone and just add the mask
check, as you recommended. I still think it should avoid the AND so we
don't have to reload the syscall nr, though.

Thanks!

-Kees

>
>
>> this is how an overlapping syscall ABI
>> could be tested." or something. Additionally, I think it would be
>> better to test for >= instead of & to avoid having to reload the
>> syscall nr.
>
> Yes, sounds good.



-- 
Kees Cook
Chrome OS Security

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] seccomp.2: Explain arch checking, value (non-)truncation, expand example
  2015-03-16 23:34   ` Jann Horn
  2015-03-17 17:23     ` Kees Cook
@ 2015-03-22 15:58     ` Michael Kerrisk (man-pages)
  2015-03-24 18:38     ` [PATCH v2 1/2] seccomp.2: Explain blacklisting problems, " Jann Horn
  2015-03-24 18:40     ` [PATCH v2 2/2] syscall.2: add x32 ABI Jann Horn
  3 siblings, 0 replies; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-03-22 15:58 UTC (permalink / raw)
  To: Jann Horn; +Cc: Kees Cook, linux-man, LKML, Andy Lutomirski, Will Drewry

Hi Jann,

Thanks for working on this page. I'm a little unclear at this point.
Am I correct to assume you are going to revise this patch in the light
of Kees's comments and send a new patch?

Thanks,

Michael


On 17 March 2015 at 00:34, Jann Horn <jann@thejh.net> wrote:
> On Mon, Mar 16, 2015 at 03:25:56PM -0700, Kees Cook wrote:
>> On Mon, Mar 16, 2015 at 11:01 AM, Jann Horn <jann@thejh.net> wrote:
>> > Document some more-or-less surprising things about seccomp.
>> > I'm not sure whether changing the example code like that is a
>> > good idea - maybe that part of the patch should be left out?
>> >
>> > Demo code for the X32 issue:
>> > https://gist.github.com/thejh/c5b670a816bbb9791a6d
>> >
>> > Demo code for full 64bit registers being visible in seccomp
>> > if the i386 ABI is used on a 64bit system:
>> > https://gist.github.com/thejh/c37b27aefc44ab775db5
>>
>> So, it is probably worth noting the x32 ABI somewhere, and seccomp.2
>> is probably reasonable, though maybe it should be explicitly detailed
>> in syscall.2?
>
> I guess that would be sensible. However, I still think that the seccomp
> manpage should mention it, too, or advise the reader to also carefully
> read the syscall.2 manpage.
>
>
>> In the seccomp.2 manpage, though, I think we should discourage
>> blacklisting, since whitelisting is a much more effective way to do
>> attack surface reduction on syscalls. (And, as such, x32 would be
>> already eliminated from x86-64 filters.)
>
> I agree, whitelisting should be encouraged. However, as far as I can
> tell, people use seccomp not only for proper, strict sandboxing, but
> also to e.g. fix small security problems in containers or to provide
> additional precautions for them. In that case, I think that the use
> of a blacklist is more understandable, and various project use
> seccomp that way or at least support the use of blacklists: The
> default policy of LXC is a blacklist, sandstorm.io uses a seccomp
> blacklist and blacklists specific ptrace calls, systemd-nspawn uses a
> blacklist (although the manpage says that that's meant as a precaution
> against accidental damage, not as a security measure), systemd
> supports both whitelists and blacklists in the SystemCallFilter
> directive.
>
>
>> It is, however, reasonable to update the example just so it can be
>> super-explicit, though I'd change the comments to say something more
>> direct about the whitelisting vs blacklisting, like "While this
>> example uses whitelisting,
>
> You mean you would want to change the example to use whitelisting?
> That sounds like a good idea.
>
>
>> this is how an overlapping syscall ABI
>> could be tested." or something. Additionally, I think it would be
>> better to test for >= instead of & to avoid having to reload the
>> syscall nr.
>
> Yes, sounds good.



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/2] seccomp.2: Explain blacklisting problems, expand example
  2015-03-16 23:34   ` Jann Horn
  2015-03-17 17:23     ` Kees Cook
  2015-03-22 15:58     ` Michael Kerrisk (man-pages)
@ 2015-03-24 18:38     ` Jann Horn
  2015-03-29 16:01       ` Michael Kerrisk (man-pages)
  2015-03-24 18:40     ` [PATCH v2 2/2] syscall.2: add x32 ABI Jann Horn
  3 siblings, 1 reply; 9+ messages in thread
From: Jann Horn @ 2015-03-24 18:38 UTC (permalink / raw)
  To: Kees Cook; +Cc: Michael Kerrisk, linux-man, LKML, Andy Lutomirski, Will Drewry

---
 man2/seccomp.2 | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 67 insertions(+), 6 deletions(-)

diff --git a/man2/seccomp.2 b/man2/seccomp.2
index e2a5060..b596fb8 100644
--- a/man2/seccomp.2
+++ b/man2/seccomp.2
@@ -250,6 +250,55 @@ struct seccomp_data {
 .fi
 .in
 
+Because the numbers of system calls vary between architectures and
+some architectures (e.g. X86-64) allow user-space code to use
+the calling conventions of multiple architectures, it is usually
+necessary to verify the value of the
+.IR arch
+field.
+
+It is strongly recommended to use a whitelisting approach whenever
+possible because such an approach is more robust and simple.
+A blacklist will have to be updated whenever a potentially
+dangerous syscall is added (or a dangerous flag or option if those
+are blacklisted), and it is often possible to alter the
+representation of a value without altering its meaning, leading to
+a blacklist bypass.
+
+The
+.IR arch
+field is not unique for all calling conventions. The X86-64 ABI and
+the X32 ABI both use
+.BR AUDIT_ARCH_X86_64
+as
+.IR arch ,
+and they run on the same processors. Instead, the mask
+.BR __X32_SYSCALL_BIT
+is used on the system call number to tell the two ABIs apart.
+This means that in order to create a seccomp-based
+blacklist for system calls performed through the X86-64 ABI,
+it is necessary to not only check that
+.IR arch
+equals
+.BR AUDIT_ARCH_X86_64 ,
+but also to explicitly reject all syscalls that contain
+.BR __X32_SYSCALL_BIT
+in
+.IR nr .
+
+When checking values from
+.IR args
+against a blacklist, keep in mind that arguments are often
+silently truncated before being processed, but after the seccomp
+check. For example, this happens if the i386 ABI is used on an
+X86-64 kernel: Although the kernel will normally not look beyond
+the 32 lowest bits of the arguments, the values of the full
+64-bit registers will be present in the seccomp data. A less
+surprising example is that if the X86-64 ABI is used to perform
+a syscall that takes an argument of type int, the
+more-significant half of the argument register is ignored by
+the syscall, but visible in the seccomp data.
+
 A seccomp filter returns a 32-bit value consisting of two parts:
 the most significant 16 bits
 (corresponding to the mask defined by the constant
@@ -616,38 +665,50 @@ cecilia
 #include <linux/seccomp.h>
 #include <sys/prctl.h>
 
+#define X32_SYSCALL_BIT 0x40000000
+
 static int
 install_filter(int syscall_nr, int t_arch, int f_errno)
 {
+    unsigned int upper_nr_limit = 0xffffffff;
+    /* assume that AUDIT_ARCH_X86_64 means the normal X86-64 ABI */
+    if (t_arch == AUDIT_ARCH_X86_64)
+        upper_nr_limit = X32_SYSCALL_BIT - 1;
+
     struct sock_filter filter[] = {
         /* [0] Load architecture from 'seccomp_data' buffer into
                accumulator */
         BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                  (offsetof(struct seccomp_data, arch))),
 
-        /* [1] Jump forward 4 instructions if architecture does not
+        /* [1] Jump forward 5 instructions if architecture does not
                match 't_arch' */
-        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 4),
+        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5),
 
         /* [2] Load system call number from 'seccomp_data' buffer into
                accumulator */
         BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                  (offsetof(struct seccomp_data, nr))),
 
-        /* [3] Jump forward 1 instruction if system call number
+        /* [3] Check ABI - only needed for X86-64 in blacklist usecases.
+               Use JGT instead of checking against the bitmask to avoid
+               having to reload the syscall number. */
+        BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0),
+
+        /* [4] Jump forward 1 instruction if system call number
                does not match 'syscall_nr' */
         BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),
 
-        /* [4] Matching architecture and system call: don't execute
+        /* [5] Matching architecture and system call: don't execute
 	       the system call, and return 'f_errno' in 'errno' */
         BPF_STMT(BPF_RET | BPF_K,
                  SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
 
-        /* [5] Destination of system call number mismatch: allow other
+        /* [6] Destination of system call number mismatch: allow other
                system calls */
         BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
 
-        /* [6] Destination of architecture mismatch: kill process */
+        /* [7] Destination of architecture mismatch: kill process */
         BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
     };
 
-- 
2.1.4



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/2] syscall.2: add x32 ABI
  2015-03-16 23:34   ` Jann Horn
                       ` (2 preceding siblings ...)
  2015-03-24 18:38     ` [PATCH v2 1/2] seccomp.2: Explain blacklisting problems, " Jann Horn
@ 2015-03-24 18:40     ` Jann Horn
  2015-04-21 14:01       ` Michael Kerrisk (man-pages)
  3 siblings, 1 reply; 9+ messages in thread
From: Jann Horn @ 2015-03-24 18:40 UTC (permalink / raw)
  To: Kees Cook
  Cc: Michael Kerrisk, linux-man, LKML, Andy Lutomirski, Will Drewry,
	H. Peter Anvin

[added H. Peter Anvin to CC for this patch because he
seems to have contributed large parts of the X32 code]

---
 man2/syscall.2 | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/man2/syscall.2 b/man2/syscall.2
index ef8f3cf..1f25255 100644
--- a/man2/syscall.2
+++ b/man2/syscall.2
@@ -170,12 +170,18 @@ s390	svc 0	r1	r2	See below
 s390x	svc 0	r1	r2	See below
 sparc/32	t 0x10	g1	o0
 sparc/64	t 0x6d	g1	o0
-x86_64	syscall	rax	rax
+x86_64	syscall	rax	rax	See below
+x32	syscall	rax	rax	See below
 .TE
 .PP
 For s390 and s390x, NR (the system call number)
 may be passed directly with "svc NR" if it is less than 256.
 
+The x32 ABI uses the same instruction as the x86_64 ABI and is used on
+the same processors. To differentiate between them, the bitmask
+.I __X32_SYSCALL_BIT
+is bitwise-ORed into the syscall number for syscalls under the x32 ABI.
+
 On a few architectures,
 a register is used to indicate simple boolean failure of the system call:
 ia64 uses
@@ -210,6 +216,7 @@ s390x       r2    r3    r4    r5    r6    r7    -
 sparc/32    o0    o1    o2    o3    o4    o5    -
 sparc/64    o0    o1    o2    o3    o4    o5    -
 x86_64      rdi   rsi   rdx   r10   r8    r9    -
+x32         rdi   rsi   rdx   r10   r8    r9    -
 .TE
 .PP
 The mips/o32 system call convention passes
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/2] seccomp.2: Explain blacklisting problems, expand example
  2015-03-24 18:38     ` [PATCH v2 1/2] seccomp.2: Explain blacklisting problems, " Jann Horn
@ 2015-03-29 16:01       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-03-29 16:01 UTC (permalink / raw)
  To: Jann Horn, Kees Cook
  Cc: mtk.manpages, linux-man, LKML, Andy Lutomirski, Will Drewry

Hi Jann,

Thanks for the patch. I've applied it.

Cheers,

Michael


On 03/24/2015 07:38 PM, Jann Horn wrote:
> ---
>  man2/seccomp.2 | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 67 insertions(+), 6 deletions(-)
> 
> diff --git a/man2/seccomp.2 b/man2/seccomp.2
> index e2a5060..b596fb8 100644
> --- a/man2/seccomp.2
> +++ b/man2/seccomp.2
> @@ -250,6 +250,55 @@ struct seccomp_data {
>  .fi
>  .in
>  
> +Because the numbers of system calls vary between architectures and
> +some architectures (e.g. X86-64) allow user-space code to use
> +the calling conventions of multiple architectures, it is usually
> +necessary to verify the value of the
> +.IR arch
> +field.
> +
> +It is strongly recommended to use a whitelisting approach whenever
> +possible because such an approach is more robust and simple.
> +A blacklist will have to be updated whenever a potentially
> +dangerous syscall is added (or a dangerous flag or option if those
> +are blacklisted), and it is often possible to alter the
> +representation of a value without altering its meaning, leading to
> +a blacklist bypass.
> +
> +The
> +.IR arch
> +field is not unique for all calling conventions. The X86-64 ABI and
> +the X32 ABI both use
> +.BR AUDIT_ARCH_X86_64
> +as
> +.IR arch ,
> +and they run on the same processors. Instead, the mask
> +.BR __X32_SYSCALL_BIT
> +is used on the system call number to tell the two ABIs apart.
> +This means that in order to create a seccomp-based
> +blacklist for system calls performed through the X86-64 ABI,
> +it is necessary to not only check that
> +.IR arch
> +equals
> +.BR AUDIT_ARCH_X86_64 ,
> +but also to explicitly reject all syscalls that contain
> +.BR __X32_SYSCALL_BIT
> +in
> +.IR nr .
> +
> +When checking values from
> +.IR args
> +against a blacklist, keep in mind that arguments are often
> +silently truncated before being processed, but after the seccomp
> +check. For example, this happens if the i386 ABI is used on an
> +X86-64 kernel: Although the kernel will normally not look beyond
> +the 32 lowest bits of the arguments, the values of the full
> +64-bit registers will be present in the seccomp data. A less
> +surprising example is that if the X86-64 ABI is used to perform
> +a syscall that takes an argument of type int, the
> +more-significant half of the argument register is ignored by
> +the syscall, but visible in the seccomp data.
> +
>  A seccomp filter returns a 32-bit value consisting of two parts:
>  the most significant 16 bits
>  (corresponding to the mask defined by the constant
> @@ -616,38 +665,50 @@ cecilia
>  #include <linux/seccomp.h>
>  #include <sys/prctl.h>
>  
> +#define X32_SYSCALL_BIT 0x40000000
> +
>  static int
>  install_filter(int syscall_nr, int t_arch, int f_errno)
>  {
> +    unsigned int upper_nr_limit = 0xffffffff;
> +    /* assume that AUDIT_ARCH_X86_64 means the normal X86-64 ABI */
> +    if (t_arch == AUDIT_ARCH_X86_64)
> +        upper_nr_limit = X32_SYSCALL_BIT - 1;
> +
>      struct sock_filter filter[] = {
>          /* [0] Load architecture from 'seccomp_data' buffer into
>                 accumulator */
>          BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
>                   (offsetof(struct seccomp_data, arch))),
>  
> -        /* [1] Jump forward 4 instructions if architecture does not
> +        /* [1] Jump forward 5 instructions if architecture does not
>                 match 't_arch' */
> -        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 4),
> +        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, t_arch, 0, 5),
>  
>          /* [2] Load system call number from 'seccomp_data' buffer into
>                 accumulator */
>          BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
>                   (offsetof(struct seccomp_data, nr))),
>  
> -        /* [3] Jump forward 1 instruction if system call number
> +        /* [3] Check ABI - only needed for X86-64 in blacklist usecases.
> +               Use JGT instead of checking against the bitmask to avoid
> +               having to reload the syscall number. */
> +        BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, upper_nr_limit, 3, 0),
> +
> +        /* [4] Jump forward 1 instruction if system call number
>                 does not match 'syscall_nr' */
>          BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, syscall_nr, 0, 1),
>  
> -        /* [4] Matching architecture and system call: don't execute
> +        /* [5] Matching architecture and system call: don't execute
>  	       the system call, and return 'f_errno' in 'errno' */
>          BPF_STMT(BPF_RET | BPF_K,
>                   SECCOMP_RET_ERRNO | (f_errno & SECCOMP_RET_DATA)),
>  
> -        /* [5] Destination of system call number mismatch: allow other
> +        /* [6] Destination of system call number mismatch: allow other
>                 system calls */
>          BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
>  
> -        /* [6] Destination of architecture mismatch: kill process */
> +        /* [7] Destination of architecture mismatch: kill process */
>          BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
>      };
>  
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/2] syscall.2: add x32 ABI
  2015-03-24 18:40     ` [PATCH v2 2/2] syscall.2: add x32 ABI Jann Horn
@ 2015-04-21 14:01       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 9+ messages in thread
From: Michael Kerrisk (man-pages) @ 2015-04-21 14:01 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, linux-man, LKML, Andy Lutomirski, Will Drewry, H. Peter Anvin

Hi Jann,

On 24 March 2015 at 19:40, Jann Horn <jann@thejh.net> wrote:
> [added H. Peter Anvin to CC for this patch because he
> seems to have contributed large parts of the X32 code]

It looks like I did not reply to this mail. But as you'll probably have seen
from the man-pages-3.83 release announcement, I did apply this patch.
Thanks!

Cheers,

Michael


>
> ---
>  man2/syscall.2 | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/man2/syscall.2 b/man2/syscall.2
> index ef8f3cf..1f25255 100644
> --- a/man2/syscall.2
> +++ b/man2/syscall.2
> @@ -170,12 +170,18 @@ s390      svc 0   r1      r2      See below
>  s390x  svc 0   r1      r2      See below
>  sparc/32       t 0x10  g1      o0
>  sparc/64       t 0x6d  g1      o0
> -x86_64 syscall rax     rax
> +x86_64 syscall rax     rax     See below
> +x32    syscall rax     rax     See below
>  .TE
>  .PP
>  For s390 and s390x, NR (the system call number)
>  may be passed directly with "svc NR" if it is less than 256.
>
> +The x32 ABI uses the same instruction as the x86_64 ABI and is used on
> +the same processors. To differentiate between them, the bitmask
> +.I __X32_SYSCALL_BIT
> +is bitwise-ORed into the syscall number for syscalls under the x32 ABI.
> +
>  On a few architectures,
>  a register is used to indicate simple boolean failure of the system call:
>  ia64 uses
> @@ -210,6 +216,7 @@ s390x       r2    r3    r4    r5    r6    r7    -
>  sparc/32    o0    o1    o2    o3    o4    o5    -
>  sparc/64    o0    o1    o2    o3    o4    o5    -
>  x86_64      rdi   rsi   rdx   r10   r8    r9    -
> +x32         rdi   rsi   rdx   r10   r8    r9    -
>  .TE
>  .PP
>  The mips/o32 system call convention passes
> --
> 2.1.4



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-04-21 14:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-16 18:01 [PATCH] seccomp.2: Explain arch checking, value (non-)truncation, expand example Jann Horn
2015-03-16 22:25 ` Kees Cook
2015-03-16 23:34   ` Jann Horn
2015-03-17 17:23     ` Kees Cook
2015-03-22 15:58     ` Michael Kerrisk (man-pages)
2015-03-24 18:38     ` [PATCH v2 1/2] seccomp.2: Explain blacklisting problems, " Jann Horn
2015-03-29 16:01       ` Michael Kerrisk (man-pages)
2015-03-24 18:40     ` [PATCH v2 2/2] syscall.2: add x32 ABI Jann Horn
2015-04-21 14:01       ` Michael Kerrisk (man-pages)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).