All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] RFC: How to make seccomp reliable and useful ?
@ 2017-02-15 18:27 Daniel P. Berrange
  2017-02-15 23:36 ` Eduardo Otubo
  2017-02-16  8:38 ` Thomas Huth
  0 siblings, 2 replies; 9+ messages in thread
From: Daniel P. Berrange @ 2017-02-15 18:27 UTC (permalink / raw)
  To: qemu-devel; +Cc: Eduardo Otubo

The current impl of seccomp in QEMU is intentionally allowing a huge range
of system calls to be executed. The goal was that running '-sandbox on'
should never break any feature of QEMU, so naturally any syscall that can
executed on any codepath QEMU takes must be allowed.

This is good for usability because users don't need to understand the technical
details of the sandbox technology, they merely say "on" and it "just works".
Conversely though, this is bad for security because QEMU has to allow a huge
range of system calls to be used due to its broad functionality.

During initial discussions for seccomp back in 2012 it was suggested, there
might be alternate policies developed for QEMU which deny some features, but
improve security overall. To best of my knowledge, this has never been discussed
again since then.


In addition, since initially merging, there has been a steady stream of patches
to whitelist further syscalls that were missing. Some of these were missing due
to newly added functionality in QEMU since the original seccomp impl, while
others have been missing since day 1. It is reasonable to expect that there are
still many syscalls missing in the whitelist. In just a couple of minutes of
comparing the whitelist vs global syscall list it was possible to identify two
further missing syscalls. The '-netdev bridge,br=virbr0' network backend fails
because setuid is blocked, preventing execution of the qemu-bridge-helper
program. If built against glibc < 2.9, or running on kernel < 2.6.27 it will
fail to call eventfd() because we only permit eventfd2() syscall, not the
older eventfd() syscall used on older Linux. Some ifup scripts used with the
-netdev arg may also break due to lack of chmod, flock, getxattr permissions.
This risk of missing syscalls is why -sandbox defaults to off, and we've never
considered defaulting it to on.


The fundamental problem is that building a whitelist of syscalls used by QEMU
emulators is an intractable problem. QEMU on my system links to 183 different
shared libraries and there is no way in the world that anyone can figure out
which code paths QEMU triggers in these libraries and thus identify which
syscalls will be genuinely needed.

Thus a whitelist based approach for QEMU is doomed to always be missing some
syscalls, resulting in uneccessary abrts of QEMU when it tickles some edge
case. If you are lucky the abort() happens at startup so you see it quickly
and can address it. If you are unlucky the abort() happens after your VM has
been running for days/week/months and you loose data.

IOW, seccomp integration as it currently exists today in QEMU offers minimal
security benefits, while at the same time causing spurious crashes which may
cause user data loss from aborting a running VM, discouraging users from using
even the minimal protection it offers.

I think we need to rework our seccomp support so that we can have a high enough
level of confidence in it, that it could be enabled by default. At the same time
we need to make it do something more tangibly useful from a security POV.


First we need to admit that whitelisting is a failed approach, and switch to
using blacklisting. Unless we do this, we'll never have high enough confidence
to enable it by default - something that's never turned on might as well not
exist at all.


There is a reasonable easily identifiable set of syscalls that QEMU should
never be permitted to use, no matter what configuration it is in, what helpers
it spawns, or what libraries it links to. eg reboot, swapon, swapoff,  syslog,
mount, unmount, kexec_*, etc - any syscall that affects global system state,
rather than process local state should be forbidden.

There are some syscalls that are simply hardcoded to return ENOSYS which can
be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see the
man page 'unimplemented(2)').

There are some syscalls which are considered obsolete - they were previously
useful, but no modern code would call them, as they have been superceeded.
For example, readdir replaced by getdents. We could blacklist these by default
but provide a way to allow use of obsolete syscalls if running on older systems.
e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that we decide
to just block them permanently with no opt in - would need to analyse when
their replacements appeared in widespread use.

There might be a few more syscalls which we can determine are never valid to
use in QEMU or any library or helper program it might run. I expect this list
to be very small though, given the impossibility of auditing code paths through
millions of lines of code QEMU links to.

Everything else should be allowed.

At this point we have a highly reliable "-sandbox on" which we're not having
to constantly patch.


>From here we need a way to allow a user to opt-in to more restrictive policies,
accepting that it will block certain features. For example, there should be a
a way to disable any means to elevate privileges from QEMU or things it spawns.
e.g. '-sandbox on,elevateprivileges=deny'.

This would not only block the variuous set*uid|gid functions via seccomp, but
should also prctl(PR_SET_NO_NEW_PRIVS). This would allows the user to optin to
a restrictive world if they know they'll not require things like the setuid
bridge helper.

Similarly there should be an '-sandbox on,spawn=deny' which prevents the ability
to fork/exec processes at all, whether privileged or not. This would block
features like the qemu bridge helper, SMB server, ifup/down scripts, migration
exec: protocol. These are all rarely used features though, so an opt-in to block
their use is reasonable & desirable.

A -sandbox on,resourcecontrol=deny, which prevents QEMU from setting stuff like
process affinity, schedular priority, etc. Some uses of QEMU might need them,
but normally such controls are left to the mgmt app above QEMU to set prior to
the exec() of QEMU.



The key is that these are *not* low level knobs controlling system calls, but
moderately high level knobs controlling general concepts. This is a high enough
level of abstraction to enable libvirt to automatically turn them on/off based
on guest config, without libvirt having to know anything detailed about QEMU
code impl for the features.


Finally, for avoidance of doubt, I'm *not* actually proposing to implement this
myself any time in the forseeable future. This mail came about from the fact
that many people have questioned whether current seccomp code is anything other
than "security theatre". I tend to agree with such an assessment myself, and was
initially intending to just send a patch to remove seccomp, to stimulate some
discussion. Instead, however, I decided to write this mail to see if we can
identify a way forward to make seccomp both reliable and useful. If QEMU had the
kind of approach outlined above, with a default blacklist instead of whitelist,
and some opt-ins for stricter lists, it is something I think libvirt would be
reasonably happy to enable out of the box. That would be a step forward from
today where libvirt would never consider turning seccomp on by default.

Perhaps this re-working could be a GSoC idea for some interested student...

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?
  2017-02-15 18:27 [Qemu-devel] RFC: How to make seccomp reliable and useful ? Daniel P. Berrange
@ 2017-02-15 23:36 ` Eduardo Otubo
  2017-02-16  9:33   ` Daniel P. Berrange
  2017-02-16  8:38 ` Thomas Huth
  1 sibling, 1 reply; 9+ messages in thread
From: Eduardo Otubo @ 2017-02-15 23:36 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: qemu-devel, pmoore

On Wed, Feb 15, 2017 at 06=27=32PM +0000, Daniel P. Berrange wrote:
> The current impl of seccomp in QEMU is intentionally allowing a huge range
> of system calls to be executed. The goal was that running '-sandbox on'
> should never break any feature of QEMU, so naturally any syscall that can
> executed on any codepath QEMU takes must be allowed.
> 
> This is good for usability because users don't need to understand the technical
> details of the sandbox technology, they merely say "on" and it "just works".
> Conversely though, this is bad for security because QEMU has to allow a huge
> range of system calls to be used due to its broad functionality.
> 
> During initial discussions for seccomp back in 2012 it was suggested, there
> might be alternate policies developed for QEMU which deny some features, but
> improve security overall. To best of my knowledge, this has never been discussed
> again since then.
> 
> 
> In addition, since initially merging, there has been a steady stream of patches
> to whitelist further syscalls that were missing. Some of these were missing due
> to newly added functionality in QEMU since the original seccomp impl, while
> others have been missing since day 1. It is reasonable to expect that there are
> still many syscalls missing in the whitelist. In just a couple of minutes of
> comparing the whitelist vs global syscall list it was possible to identify two
> further missing syscalls. The '-netdev bridge,br=virbr0' network backend fails
> because setuid is blocked, preventing execution of the qemu-bridge-helper
> program. If built against glibc < 2.9, or running on kernel < 2.6.27 it will
> fail to call eventfd() because we only permit eventfd2() syscall, not the
> older eventfd() syscall used on older Linux. Some ifup scripts used with the
> -netdev arg may also break due to lack of chmod, flock, getxattr permissions.
> This risk of missing syscalls is why -sandbox defaults to off, and we've never
> considered defaulting it to on.
> 
> 
> The fundamental problem is that building a whitelist of syscalls used by QEMU
> emulators is an intractable problem. QEMU on my system links to 183 different
> shared libraries and there is no way in the world that anyone can figure out
> which code paths QEMU triggers in these libraries and thus identify which
> syscalls will be genuinely needed.
> 
> Thus a whitelist based approach for QEMU is doomed to always be missing some
> syscalls, resulting in uneccessary abrts of QEMU when it tickles some edge
> case. If you are lucky the abort() happens at startup so you see it quickly
> and can address it. If you are unlucky the abort() happens after your VM has
> been running for days/week/months and you loose data.
> 
> IOW, seccomp integration as it currently exists today in QEMU offers minimal
> security benefits, while at the same time causing spurious crashes which may
> cause user data loss from aborting a running VM, discouraging users from using
> even the minimal protection it offers.
> 
> I think we need to rework our seccomp support so that we can have a high enough
> level of confidence in it, that it could be enabled by default. At the same time
> we need to make it do something more tangibly useful from a security POV.
> 
> 
> First we need to admit that whitelisting is a failed approach, and switch to
> using blacklisting. Unless we do this, we'll never have high enough confidence
> to enable it by default - something that's never turned on might as well not
> exist at all.
> 
> 
> There is a reasonable easily identifiable set of syscalls that QEMU should
> never be permitted to use, no matter what configuration it is in, what helpers
> it spawns, or what libraries it links to. eg reboot, swapon, swapoff,  syslog,
> mount, unmount, kexec_*, etc - any syscall that affects global system state,
> rather than process local state should be forbidden.
> 
> There are some syscalls that are simply hardcoded to return ENOSYS which can
> be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see the
> man page 'unimplemented(2)').
> 
> There are some syscalls which are considered obsolete - they were previously
> useful, but no modern code would call them, as they have been superceeded.
> For example, readdir replaced by getdents. We could blacklist these by default
> but provide a way to allow use of obsolete syscalls if running on older systems.
> e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that we decide
> to just block them permanently with no opt in - would need to analyse when
> their replacements appeared in widespread use.
> 
> There might be a few more syscalls which we can determine are never valid to
> use in QEMU or any library or helper program it might run. I expect this list
> to be very small though, given the impossibility of auditing code paths through
> millions of lines of code QEMU links to.
> 
> Everything else should be allowed.
> 
> At this point we have a highly reliable "-sandbox on" which we're not having
> to constantly patch.
> 
> 
> From here we need a way to allow a user to opt-in to more restrictive policies,
> accepting that it will block certain features. For example, there should be a
> a way to disable any means to elevate privileges from QEMU or things it spawns.
> e.g. '-sandbox on,elevateprivileges=deny'.
> 
> This would not only block the variuous set*uid|gid functions via seccomp, but
> should also prctl(PR_SET_NO_NEW_PRIVS). This would allows the user to optin to
> a restrictive world if they know they'll not require things like the setuid
> bridge helper.
> 
> Similarly there should be an '-sandbox on,spawn=deny' which prevents the ability
> to fork/exec processes at all, whether privileged or not. This would block
> features like the qemu bridge helper, SMB server, ifup/down scripts, migration
> exec: protocol. These are all rarely used features though, so an opt-in to block
> their use is reasonable & desirable.
> 
> A -sandbox on,resourcecontrol=deny, which prevents QEMU from setting stuff like
> process affinity, schedular priority, etc. Some uses of QEMU might need them,
> but normally such controls are left to the mgmt app above QEMU to set prior to
> the exec() of QEMU.
> 
> 
> 
> The key is that these are *not* low level knobs controlling system calls, but
> moderately high level knobs controlling general concepts. This is a high enough
> level of abstraction to enable libvirt to automatically turn them on/off based
> on guest config, without libvirt having to know anything detailed about QEMU
> code impl for the features.
> 
> 
> Finally, for avoidance of doubt, I'm *not* actually proposing to implement this
> myself any time in the forseeable future. This mail came about from the fact
> that many people have questioned whether current seccomp code is anything other
> than "security theatre". I tend to agree with such an assessment myself, and was
> initially intending to just send a patch to remove seccomp, to stimulate some
> discussion. Instead, however, I decided to write this mail to see if we can
> identify a way forward to make seccomp both reliable and useful. If QEMU had the
> kind of approach outlined above, with a default blacklist instead of whitelist,
> and some opt-ins for stricter lists, it is something I think libvirt would be
> reasonably happy to enable out of the box. That would be a step forward from
> today where libvirt would never consider turning seccomp on by default.
> 
> Perhaps this re-working could be a GSoC idea for some interested student...
> 

I'm not a student, thus not eligible GSoC person but I would be more
than grateful to take this initiative of yours and transform into some
patches so we can make this feature something really useful and
reliable.

Perhaps now is not the right time to terse comments on every idea you
gave, I agree with most of them. I wrote the whole implementation of
this feature but actually became the maintainer because people approving
sycalls and sending pull-requests were too busy, and I could do it. But
to be completely honest I had few poor ideas on how to improve it and
almost no time to actually do it in the past. Time passed by and all I
did was approve new syscalls and turn them into pull-requests.

Let's spin up these ideas and hopefully incorporate into Qemu. Next step
I'm gonna dig into every topic and draft a little more. I guess we can
keep on this thread, or perhaps in separate ones. From there I can start
to write some code.

Best regards,

-- 
Eduardo Otubo
ProfitBricks GmbH

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?
  2017-02-15 18:27 [Qemu-devel] RFC: How to make seccomp reliable and useful ? Daniel P. Berrange
  2017-02-15 23:36 ` Eduardo Otubo
@ 2017-02-16  8:38 ` Thomas Huth
  2017-02-16  9:32   ` Daniel P. Berrange
  1 sibling, 1 reply; 9+ messages in thread
From: Thomas Huth @ 2017-02-16  8:38 UTC (permalink / raw)
  To: Daniel P. Berrange, qemu-devel; +Cc: Eduardo Otubo

On 15.02.2017 19:27, Daniel P. Berrange wrote:
> The current impl of seccomp in QEMU is intentionally allowing a huge range
> of system calls to be executed. The goal was that running '-sandbox on'
> should never break any feature of QEMU, so naturally any syscall that can
> executed on any codepath QEMU takes must be allowed.
> 
> This is good for usability because users don't need to understand the technical
> details of the sandbox technology, they merely say "on" and it "just works".
> Conversely though, this is bad for security because QEMU has to allow a huge
> range of system calls to be used due to its broad functionality.
> 
> During initial discussions for seccomp back in 2012 it was suggested, there
> might be alternate policies developed for QEMU which deny some features, but
> improve security overall. To best of my knowledge, this has never been discussed
> again since then.
> 
> 
> In addition, since initially merging, there has been a steady stream of patches
> to whitelist further syscalls that were missing. Some of these were missing due
> to newly added functionality in QEMU since the original seccomp impl, while
> others have been missing since day 1. It is reasonable to expect that there are
> still many syscalls missing in the whitelist. In just a couple of minutes of
> comparing the whitelist vs global syscall list it was possible to identify two
> further missing syscalls. The '-netdev bridge,br=virbr0' network backend fails
> because setuid is blocked, preventing execution of the qemu-bridge-helper
> program. If built against glibc < 2.9, or running on kernel < 2.6.27 it will
> fail to call eventfd() because we only permit eventfd2() syscall, not the
> older eventfd() syscall used on older Linux. Some ifup scripts used with the
> -netdev arg may also break due to lack of chmod, flock, getxattr permissions.
> This risk of missing syscalls is why -sandbox defaults to off, and we've never
> considered defaulting it to on.
> 
> 
> The fundamental problem is that building a whitelist of syscalls used by QEMU
> emulators is an intractable problem. QEMU on my system links to 183 different
> shared libraries and there is no way in the world that anyone can figure out
> which code paths QEMU triggers in these libraries and thus identify which
> syscalls will be genuinely needed.
> 
> Thus a whitelist based approach for QEMU is doomed to always be missing some
> syscalls, resulting in uneccessary abrts of QEMU when it tickles some edge
> case. If you are lucky the abort() happens at startup so you see it quickly
> and can address it. If you are unlucky the abort() happens after your VM has
> been running for days/week/months and you loose data.
> 
> IOW, seccomp integration as it currently exists today in QEMU offers minimal
> security benefits, while at the same time causing spurious crashes which may
> cause user data loss from aborting a running VM, discouraging users from using
> even the minimal protection it offers.
> 
> I think we need to rework our seccomp support so that we can have a high enough
> level of confidence in it, that it could be enabled by default. At the same time
> we need to make it do something more tangibly useful from a security POV.
> 
> 
> First we need to admit that whitelisting is a failed approach, and switch to
> using blacklisting. Unless we do this, we'll never have high enough confidence
> to enable it by default - something that's never turned on might as well not
> exist at all.
> 
> 
> There is a reasonable easily identifiable set of syscalls that QEMU should
> never be permitted to use, no matter what configuration it is in, what helpers
> it spawns, or what libraries it links to. eg reboot, swapon, swapoff,  syslog,
> mount, unmount, kexec_*, etc - any syscall that affects global system state,
> rather than process local state should be forbidden.
> 
> There are some syscalls that are simply hardcoded to return ENOSYS which can
> be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see the
> man page 'unimplemented(2)').
> 
> There are some syscalls which are considered obsolete - they were previously
> useful, but no modern code would call them, as they have been superceeded.
> For example, readdir replaced by getdents. We could blacklist these by default
> but provide a way to allow use of obsolete syscalls if running on older systems.
> e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that we decide
> to just block them permanently with no opt in - would need to analyse when
> their replacements appeared in widespread use.
> 
> There might be a few more syscalls which we can determine are never valid to
> use in QEMU or any library or helper program it might run. I expect this list
> to be very small though, given the impossibility of auditing code paths through
> millions of lines of code QEMU links to.
> 
> Everything else should be allowed.
> 
> At this point we have a highly reliable "-sandbox on" which we're not having
> to constantly patch.
> 
> From here we need a way to allow a user to opt-in to more restrictive policies,
> accepting that it will block certain features. For example, there should be a
> a way to disable any means to elevate privileges from QEMU or things it spawns.
> e.g. '-sandbox on,elevateprivileges=deny'.
> 
> This would not only block the variuous set*uid|gid functions via seccomp, but
> should also prctl(PR_SET_NO_NEW_PRIVS). This would allows the user to optin to
> a restrictive world if they know they'll not require things like the setuid
> bridge helper.
> 
> Similarly there should be an '-sandbox on,spawn=deny' which prevents the ability
> to fork/exec processes at all, whether privileged or not. This would block
> features like the qemu bridge helper, SMB server, ifup/down scripts, migration
> exec: protocol. These are all rarely used features though, so an opt-in to block
> their use is reasonable & desirable.
> 
> A -sandbox on,resourcecontrol=deny, which prevents QEMU from setting stuff like
> process affinity, schedular priority, etc. Some uses of QEMU might need them,
> but normally such controls are left to the mgmt app above QEMU to set prior to
> the exec() of QEMU.

I like your proposal! I just wanted to add an idea for an additional
parameter (not sure whether it is feasible, though): Something like
"-sandbox on,network=off" ... i.e. forbid all system calls that are used
for networking. Rationale: Sometimes your VM does not need any
networking, and you want to make sure that a malicious guest can also
not reach your local network in that case.

 Thomas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?
  2017-02-16  8:38 ` Thomas Huth
@ 2017-02-16  9:32   ` Daniel P. Berrange
  2017-02-16  9:37     ` Thomas Huth
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel P. Berrange @ 2017-02-16  9:32 UTC (permalink / raw)
  To: Thomas Huth; +Cc: qemu-devel, Eduardo Otubo

On Thu, Feb 16, 2017 at 09:38:59AM +0100, Thomas Huth wrote:
> On 15.02.2017 19:27, Daniel P. Berrange wrote:
> > The current impl of seccomp in QEMU is intentionally allowing a huge range
> > of system calls to be executed. The goal was that running '-sandbox on'
> > should never break any feature of QEMU, so naturally any syscall that can
> > executed on any codepath QEMU takes must be allowed.
> > 
> > This is good for usability because users don't need to understand the technical
> > details of the sandbox technology, they merely say "on" and it "just works".
> > Conversely though, this is bad for security because QEMU has to allow a huge
> > range of system calls to be used due to its broad functionality.
> > 
> > During initial discussions for seccomp back in 2012 it was suggested, there
> > might be alternate policies developed for QEMU which deny some features, but
> > improve security overall. To best of my knowledge, this has never been discussed
> > again since then.
> > 
> > 
> > In addition, since initially merging, there has been a steady stream of patches
> > to whitelist further syscalls that were missing. Some of these were missing due
> > to newly added functionality in QEMU since the original seccomp impl, while
> > others have been missing since day 1. It is reasonable to expect that there are
> > still many syscalls missing in the whitelist. In just a couple of minutes of
> > comparing the whitelist vs global syscall list it was possible to identify two
> > further missing syscalls. The '-netdev bridge,br=virbr0' network backend fails
> > because setuid is blocked, preventing execution of the qemu-bridge-helper
> > program. If built against glibc < 2.9, or running on kernel < 2.6.27 it will
> > fail to call eventfd() because we only permit eventfd2() syscall, not the
> > older eventfd() syscall used on older Linux. Some ifup scripts used with the
> > -netdev arg may also break due to lack of chmod, flock, getxattr permissions.
> > This risk of missing syscalls is why -sandbox defaults to off, and we've never
> > considered defaulting it to on.
> > 
> > 
> > The fundamental problem is that building a whitelist of syscalls used by QEMU
> > emulators is an intractable problem. QEMU on my system links to 183 different
> > shared libraries and there is no way in the world that anyone can figure out
> > which code paths QEMU triggers in these libraries and thus identify which
> > syscalls will be genuinely needed.
> > 
> > Thus a whitelist based approach for QEMU is doomed to always be missing some
> > syscalls, resulting in uneccessary abrts of QEMU when it tickles some edge
> > case. If you are lucky the abort() happens at startup so you see it quickly
> > and can address it. If you are unlucky the abort() happens after your VM has
> > been running for days/week/months and you loose data.
> > 
> > IOW, seccomp integration as it currently exists today in QEMU offers minimal
> > security benefits, while at the same time causing spurious crashes which may
> > cause user data loss from aborting a running VM, discouraging users from using
> > even the minimal protection it offers.
> > 
> > I think we need to rework our seccomp support so that we can have a high enough
> > level of confidence in it, that it could be enabled by default. At the same time
> > we need to make it do something more tangibly useful from a security POV.
> > 
> > 
> > First we need to admit that whitelisting is a failed approach, and switch to
> > using blacklisting. Unless we do this, we'll never have high enough confidence
> > to enable it by default - something that's never turned on might as well not
> > exist at all.
> > 
> > 
> > There is a reasonable easily identifiable set of syscalls that QEMU should
> > never be permitted to use, no matter what configuration it is in, what helpers
> > it spawns, or what libraries it links to. eg reboot, swapon, swapoff,  syslog,
> > mount, unmount, kexec_*, etc - any syscall that affects global system state,
> > rather than process local state should be forbidden.
> > 
> > There are some syscalls that are simply hardcoded to return ENOSYS which can
> > be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see the
> > man page 'unimplemented(2)').
> > 
> > There are some syscalls which are considered obsolete - they were previously
> > useful, but no modern code would call them, as they have been superceeded.
> > For example, readdir replaced by getdents. We could blacklist these by default
> > but provide a way to allow use of obsolete syscalls if running on older systems.
> > e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that we decide
> > to just block them permanently with no opt in - would need to analyse when
> > their replacements appeared in widespread use.
> > 
> > There might be a few more syscalls which we can determine are never valid to
> > use in QEMU or any library or helper program it might run. I expect this list
> > to be very small though, given the impossibility of auditing code paths through
> > millions of lines of code QEMU links to.
> > 
> > Everything else should be allowed.
> > 
> > At this point we have a highly reliable "-sandbox on" which we're not having
> > to constantly patch.
> > 
> > From here we need a way to allow a user to opt-in to more restrictive policies,
> > accepting that it will block certain features. For example, there should be a
> > a way to disable any means to elevate privileges from QEMU or things it spawns.
> > e.g. '-sandbox on,elevateprivileges=deny'.
> > 
> > This would not only block the variuous set*uid|gid functions via seccomp, but
> > should also prctl(PR_SET_NO_NEW_PRIVS). This would allows the user to optin to
> > a restrictive world if they know they'll not require things like the setuid
> > bridge helper.
> > 
> > Similarly there should be an '-sandbox on,spawn=deny' which prevents the ability
> > to fork/exec processes at all, whether privileged or not. This would block
> > features like the qemu bridge helper, SMB server, ifup/down scripts, migration
> > exec: protocol. These are all rarely used features though, so an opt-in to block
> > their use is reasonable & desirable.
> > 
> > A -sandbox on,resourcecontrol=deny, which prevents QEMU from setting stuff like
> > process affinity, schedular priority, etc. Some uses of QEMU might need them,
> > but normally such controls are left to the mgmt app above QEMU to set prior to
> > the exec() of QEMU.
> 
> I like your proposal! I just wanted to add an idea for an additional
> parameter (not sure whether it is feasible, though): Something like
> "-sandbox on,network=off" ... i.e. forbid all system calls that are used
> for networking. Rationale: Sometimes your VM does not need any
> networking, and you want to make sure that a malicious guest can also
> not reach your local network in that case.

This is pretty tricky. Even if there is not obviously configured network
backend in QEMU, there's plenty of scope for things in libraries to
be using networking. Something want a fully qualified hostname ? That'll
trigger UDP / TCP connections to a DNS resolver. Running with the SDL
or GTK display frontends - those use networking over UNIX sockets to
talk to a display server. Linked to glib2 ? That'll connect to DConf
over DBus UNIX socket in the background. etc

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?
  2017-02-15 23:36 ` Eduardo Otubo
@ 2017-02-16  9:33   ` Daniel P. Berrange
  2017-03-01 22:38     ` Eduardo Otubo
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel P. Berrange @ 2017-02-16  9:33 UTC (permalink / raw)
  To: qemu-devel, pmoore

On Thu, Feb 16, 2017 at 12:36:51AM +0100, Eduardo Otubo wrote:
> On Wed, Feb 15, 2017 at 06=27=32PM +0000, Daniel P. Berrange wrote:
> > The current impl of seccomp in QEMU is intentionally allowing a huge range
> > of system calls to be executed. The goal was that running '-sandbox on'
> > should never break any feature of QEMU, so naturally any syscall that can
> > executed on any codepath QEMU takes must be allowed.
> > 
> > This is good for usability because users don't need to understand the technical
> > details of the sandbox technology, they merely say "on" and it "just works".
> > Conversely though, this is bad for security because QEMU has to allow a huge
> > range of system calls to be used due to its broad functionality.
> > 
> > During initial discussions for seccomp back in 2012 it was suggested, there
> > might be alternate policies developed for QEMU which deny some features, but
> > improve security overall. To best of my knowledge, this has never been discussed
> > again since then.
> > 
> > 
> > In addition, since initially merging, there has been a steady stream of patches
> > to whitelist further syscalls that were missing. Some of these were missing due
> > to newly added functionality in QEMU since the original seccomp impl, while
> > others have been missing since day 1. It is reasonable to expect that there are
> > still many syscalls missing in the whitelist. In just a couple of minutes of
> > comparing the whitelist vs global syscall list it was possible to identify two
> > further missing syscalls. The '-netdev bridge,br=virbr0' network backend fails
> > because setuid is blocked, preventing execution of the qemu-bridge-helper
> > program. If built against glibc < 2.9, or running on kernel < 2.6.27 it will
> > fail to call eventfd() because we only permit eventfd2() syscall, not the
> > older eventfd() syscall used on older Linux. Some ifup scripts used with the
> > -netdev arg may also break due to lack of chmod, flock, getxattr permissions.
> > This risk of missing syscalls is why -sandbox defaults to off, and we've never
> > considered defaulting it to on.
> > 
> > 
> > The fundamental problem is that building a whitelist of syscalls used by QEMU
> > emulators is an intractable problem. QEMU on my system links to 183 different
> > shared libraries and there is no way in the world that anyone can figure out
> > which code paths QEMU triggers in these libraries and thus identify which
> > syscalls will be genuinely needed.
> > 
> > Thus a whitelist based approach for QEMU is doomed to always be missing some
> > syscalls, resulting in uneccessary abrts of QEMU when it tickles some edge
> > case. If you are lucky the abort() happens at startup so you see it quickly
> > and can address it. If you are unlucky the abort() happens after your VM has
> > been running for days/week/months and you loose data.
> > 
> > IOW, seccomp integration as it currently exists today in QEMU offers minimal
> > security benefits, while at the same time causing spurious crashes which may
> > cause user data loss from aborting a running VM, discouraging users from using
> > even the minimal protection it offers.
> > 
> > I think we need to rework our seccomp support so that we can have a high enough
> > level of confidence in it, that it could be enabled by default. At the same time
> > we need to make it do something more tangibly useful from a security POV.
> > 
> > 
> > First we need to admit that whitelisting is a failed approach, and switch to
> > using blacklisting. Unless we do this, we'll never have high enough confidence
> > to enable it by default - something that's never turned on might as well not
> > exist at all.
> > 
> > 
> > There is a reasonable easily identifiable set of syscalls that QEMU should
> > never be permitted to use, no matter what configuration it is in, what helpers
> > it spawns, or what libraries it links to. eg reboot, swapon, swapoff,  syslog,
> > mount, unmount, kexec_*, etc - any syscall that affects global system state,
> > rather than process local state should be forbidden.
> > 
> > There are some syscalls that are simply hardcoded to return ENOSYS which can
> > be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see the
> > man page 'unimplemented(2)').
> > 
> > There are some syscalls which are considered obsolete - they were previously
> > useful, but no modern code would call them, as they have been superceeded.
> > For example, readdir replaced by getdents. We could blacklist these by default
> > but provide a way to allow use of obsolete syscalls if running on older systems.
> > e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that we decide
> > to just block them permanently with no opt in - would need to analyse when
> > their replacements appeared in widespread use.
> > 
> > There might be a few more syscalls which we can determine are never valid to
> > use in QEMU or any library or helper program it might run. I expect this list
> > to be very small though, given the impossibility of auditing code paths through
> > millions of lines of code QEMU links to.
> > 
> > Everything else should be allowed.
> > 
> > At this point we have a highly reliable "-sandbox on" which we're not having
> > to constantly patch.
> > 
> > 
> > From here we need a way to allow a user to opt-in to more restrictive policies,
> > accepting that it will block certain features. For example, there should be a
> > a way to disable any means to elevate privileges from QEMU or things it spawns.
> > e.g. '-sandbox on,elevateprivileges=deny'.
> > 
> > This would not only block the variuous set*uid|gid functions via seccomp, but
> > should also prctl(PR_SET_NO_NEW_PRIVS). This would allows the user to optin to
> > a restrictive world if they know they'll not require things like the setuid
> > bridge helper.
> > 
> > Similarly there should be an '-sandbox on,spawn=deny' which prevents the ability
> > to fork/exec processes at all, whether privileged or not. This would block
> > features like the qemu bridge helper, SMB server, ifup/down scripts, migration
> > exec: protocol. These are all rarely used features though, so an opt-in to block
> > their use is reasonable & desirable.
> > 
> > A -sandbox on,resourcecontrol=deny, which prevents QEMU from setting stuff like
> > process affinity, schedular priority, etc. Some uses of QEMU might need them,
> > but normally such controls are left to the mgmt app above QEMU to set prior to
> > the exec() of QEMU.
> > 
> > 
> > 
> > The key is that these are *not* low level knobs controlling system calls, but
> > moderately high level knobs controlling general concepts. This is a high enough
> > level of abstraction to enable libvirt to automatically turn them on/off based
> > on guest config, without libvirt having to know anything detailed about QEMU
> > code impl for the features.
> > 
> > 
> > Finally, for avoidance of doubt, I'm *not* actually proposing to implement this
> > myself any time in the forseeable future. This mail came about from the fact
> > that many people have questioned whether current seccomp code is anything other
> > than "security theatre". I tend to agree with such an assessment myself, and was
> > initially intending to just send a patch to remove seccomp, to stimulate some
> > discussion. Instead, however, I decided to write this mail to see if we can
> > identify a way forward to make seccomp both reliable and useful. If QEMU had the
> > kind of approach outlined above, with a default blacklist instead of whitelist,
> > and some opt-ins for stricter lists, it is something I think libvirt would be
> > reasonably happy to enable out of the box. That would be a step forward from
> > today where libvirt would never consider turning seccomp on by default.
> > 
> > Perhaps this re-working could be a GSoC idea for some interested student...
> > 
> 
> I'm not a student, thus not eligible GSoC person but I would be more
> than grateful to take this initiative of yours and transform into some
> patches so we can make this feature something really useful and
> reliable.

Sure, I just threw GSoC out there as one possible idea. If you or anyone
else has time to work on it, that's great too.

> Perhaps now is not the right time to terse comments on every idea you
> gave, I agree with most of them. I wrote the whole implementation of
> this feature but actually became the maintainer because people approving
> sycalls and sending pull-requests were too busy, and I could do it. But
> to be completely honest I had few poor ideas on how to improve it and
> almost no time to actually do it in the past. Time passed by and all I
> did was approve new syscalls and turn them into pull-requests.
> 
> Let's spin up these ideas and hopefully incorporate into Qemu. Next step
> I'm gonna dig into every topic and draft a little more. I guess we can
> keep on this thread, or perhaps in separate ones. From there I can start
> to write some code.

ok

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?
  2017-02-16  9:32   ` Daniel P. Berrange
@ 2017-02-16  9:37     ` Thomas Huth
  2017-02-16  9:41       ` Daniel P. Berrange
  0 siblings, 1 reply; 9+ messages in thread
From: Thomas Huth @ 2017-02-16  9:37 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: qemu-devel, Eduardo Otubo

On 16.02.2017 10:32, Daniel P. Berrange wrote:
> On Thu, Feb 16, 2017 at 09:38:59AM +0100, Thomas Huth wrote:
>> On 15.02.2017 19:27, Daniel P. Berrange wrote:
>>> The current impl of seccomp in QEMU is intentionally allowing a huge range
>>> of system calls to be executed. The goal was that running '-sandbox on'
>>> should never break any feature of QEMU, so naturally any syscall that can
>>> executed on any codepath QEMU takes must be allowed.
>>>
>>> This is good for usability because users don't need to understand the technical
>>> details of the sandbox technology, they merely say "on" and it "just works".
>>> Conversely though, this is bad for security because QEMU has to allow a huge
>>> range of system calls to be used due to its broad functionality.
>>>
>>> During initial discussions for seccomp back in 2012 it was suggested, there
>>> might be alternate policies developed for QEMU which deny some features, but
>>> improve security overall. To best of my knowledge, this has never been discussed
>>> again since then.
>>>
>>>
>>> In addition, since initially merging, there has been a steady stream of patches
>>> to whitelist further syscalls that were missing. Some of these were missing due
>>> to newly added functionality in QEMU since the original seccomp impl, while
>>> others have been missing since day 1. It is reasonable to expect that there are
>>> still many syscalls missing in the whitelist. In just a couple of minutes of
>>> comparing the whitelist vs global syscall list it was possible to identify two
>>> further missing syscalls. The '-netdev bridge,br=virbr0' network backend fails
>>> because setuid is blocked, preventing execution of the qemu-bridge-helper
>>> program. If built against glibc < 2.9, or running on kernel < 2.6.27 it will
>>> fail to call eventfd() because we only permit eventfd2() syscall, not the
>>> older eventfd() syscall used on older Linux. Some ifup scripts used with the
>>> -netdev arg may also break due to lack of chmod, flock, getxattr permissions.
>>> This risk of missing syscalls is why -sandbox defaults to off, and we've never
>>> considered defaulting it to on.
>>>
>>>
>>> The fundamental problem is that building a whitelist of syscalls used by QEMU
>>> emulators is an intractable problem. QEMU on my system links to 183 different
>>> shared libraries and there is no way in the world that anyone can figure out
>>> which code paths QEMU triggers in these libraries and thus identify which
>>> syscalls will be genuinely needed.
>>>
>>> Thus a whitelist based approach for QEMU is doomed to always be missing some
>>> syscalls, resulting in uneccessary abrts of QEMU when it tickles some edge
>>> case. If you are lucky the abort() happens at startup so you see it quickly
>>> and can address it. If you are unlucky the abort() happens after your VM has
>>> been running for days/week/months and you loose data.
>>>
>>> IOW, seccomp integration as it currently exists today in QEMU offers minimal
>>> security benefits, while at the same time causing spurious crashes which may
>>> cause user data loss from aborting a running VM, discouraging users from using
>>> even the minimal protection it offers.
>>>
>>> I think we need to rework our seccomp support so that we can have a high enough
>>> level of confidence in it, that it could be enabled by default. At the same time
>>> we need to make it do something more tangibly useful from a security POV.
>>>
>>>
>>> First we need to admit that whitelisting is a failed approach, and switch to
>>> using blacklisting. Unless we do this, we'll never have high enough confidence
>>> to enable it by default - something that's never turned on might as well not
>>> exist at all.
>>>
>>>
>>> There is a reasonable easily identifiable set of syscalls that QEMU should
>>> never be permitted to use, no matter what configuration it is in, what helpers
>>> it spawns, or what libraries it links to. eg reboot, swapon, swapoff,  syslog,
>>> mount, unmount, kexec_*, etc - any syscall that affects global system state,
>>> rather than process local state should be forbidden.
>>>
>>> There are some syscalls that are simply hardcoded to return ENOSYS which can
>>> be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see the
>>> man page 'unimplemented(2)').
>>>
>>> There are some syscalls which are considered obsolete - they were previously
>>> useful, but no modern code would call them, as they have been superceeded.
>>> For example, readdir replaced by getdents. We could blacklist these by default
>>> but provide a way to allow use of obsolete syscalls if running on older systems.
>>> e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that we decide
>>> to just block them permanently with no opt in - would need to analyse when
>>> their replacements appeared in widespread use.
>>>
>>> There might be a few more syscalls which we can determine are never valid to
>>> use in QEMU or any library or helper program it might run. I expect this list
>>> to be very small though, given the impossibility of auditing code paths through
>>> millions of lines of code QEMU links to.
>>>
>>> Everything else should be allowed.
>>>
>>> At this point we have a highly reliable "-sandbox on" which we're not having
>>> to constantly patch.
>>>
>>> From here we need a way to allow a user to opt-in to more restrictive policies,
>>> accepting that it will block certain features. For example, there should be a
>>> a way to disable any means to elevate privileges from QEMU or things it spawns.
>>> e.g. '-sandbox on,elevateprivileges=deny'.
>>>
>>> This would not only block the variuous set*uid|gid functions via seccomp, but
>>> should also prctl(PR_SET_NO_NEW_PRIVS). This would allows the user to optin to
>>> a restrictive world if they know they'll not require things like the setuid
>>> bridge helper.
>>>
>>> Similarly there should be an '-sandbox on,spawn=deny' which prevents the ability
>>> to fork/exec processes at all, whether privileged or not. This would block
>>> features like the qemu bridge helper, SMB server, ifup/down scripts, migration
>>> exec: protocol. These are all rarely used features though, so an opt-in to block
>>> their use is reasonable & desirable.
>>>
>>> A -sandbox on,resourcecontrol=deny, which prevents QEMU from setting stuff like
>>> process affinity, schedular priority, etc. Some uses of QEMU might need them,
>>> but normally such controls are left to the mgmt app above QEMU to set prior to
>>> the exec() of QEMU.
>>
>> I like your proposal! I just wanted to add an idea for an additional
>> parameter (not sure whether it is feasible, though): Something like
>> "-sandbox on,network=off" ... i.e. forbid all system calls that are used
>> for networking. Rationale: Sometimes your VM does not need any
>> networking, and you want to make sure that a malicious guest can also
>> not reach your local network in that case.
> 
> This is pretty tricky. Even if there is not obviously configured network
> backend in QEMU, there's plenty of scope for things in libraries to
> be using networking. Something want a fully qualified hostname ? That'll
> trigger UDP / TCP connections to a DNS resolver. Running with the SDL
> or GTK display frontends - those use networking over UNIX sockets to
> talk to a display server. Linked to glib2 ? That'll connect to DConf
> over DBus UNIX socket in the background. etc

Oh, too bad. Aren't there at least some system calls which could be used
to block TCP/IP connections, while we still allow local UNIX sockets?
... hmm, maybe that's rather something to solve at the SELinux level
instead...

 Thomas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?
  2017-02-16  9:37     ` Thomas Huth
@ 2017-02-16  9:41       ` Daniel P. Berrange
  0 siblings, 0 replies; 9+ messages in thread
From: Daniel P. Berrange @ 2017-02-16  9:41 UTC (permalink / raw)
  To: Thomas Huth; +Cc: qemu-devel, Eduardo Otubo

On Thu, Feb 16, 2017 at 10:37:31AM +0100, Thomas Huth wrote:
> On 16.02.2017 10:32, Daniel P. Berrange wrote:
> > On Thu, Feb 16, 2017 at 09:38:59AM +0100, Thomas Huth wrote:
> >> I like your proposal! I just wanted to add an idea for an additional
> >> parameter (not sure whether it is feasible, though): Something like
> >> "-sandbox on,network=off" ... i.e. forbid all system calls that are used
> >> for networking. Rationale: Sometimes your VM does not need any
> >> networking, and you want to make sure that a malicious guest can also
> >> not reach your local network in that case.
> > 
> > This is pretty tricky. Even if there is not obviously configured network
> > backend in QEMU, there's plenty of scope for things in libraries to
> > be using networking. Something want a fully qualified hostname ? That'll
> > trigger UDP / TCP connections to a DNS resolver. Running with the SDL
> > or GTK display frontends - those use networking over UNIX sockets to
> > talk to a display server. Linked to glib2 ? That'll connect to DConf
> > over DBus UNIX socket in the background. etc
> 
> Oh, too bad. Aren't there at least some system calls which could be used
> to block TCP/IP connections, while we still allow local UNIX sockets?
> ... hmm, maybe that's rather something to solve at the SELinux level
> instead...

seccomp lets you filter based on value of syscall arguments. So you could
filter out socket() calls with family != AF_UNIX.  This still leaves
potential trouble with DNS resolvers though, which can be valid to use
even if not wanting to make network connections. Annoyingly even if one
ran a localhost DNS resolver, there's no facility in /etc/resolv.conf
to specify a UNIX socket for talking to it - it'd have to use TCP over
localhost :-(

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?
  2017-02-16  9:33   ` Daniel P. Berrange
@ 2017-03-01 22:38     ` Eduardo Otubo
  2017-03-02  9:35       ` Daniel P. Berrange
  0 siblings, 1 reply; 9+ messages in thread
From: Eduardo Otubo @ 2017-03-01 22:38 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: qemu-devel, pmoore

On Thu, Feb 16, 2017 at 09=33=16AM +0000, Daniel P. Berrange wrote:
> On Thu, Feb 16, 2017 at 12:36:51AM +0100, Eduardo Otubo wrote:
> > On Wed, Feb 15, 2017 at 06=27=32PM +0000, Daniel P. Berrange wrote:

[...]

> > > 
> > > There is a reasonable easily identifiable set of syscalls that QEMU should
> > > never be permitted to use, no matter what configuration it is in, what helpers
> > > it spawns, or what libraries it links to. eg reboot, swapon, swapoff,  syslog,
> > > mount, unmount, kexec_*, etc - any syscall that affects global system state,
> > > rather than process local state should be forbidden.
> > > 
> > > There are some syscalls that are simply hardcoded to return ENOSYS which can
> > > be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see the
> > > man page 'unimplemented(2)').

I've been working on the blacklist, you can see here:
https://github.com/otubo/qemu/commit/31e603180081474ff35c5897813cb635f8e9a786

I didn't send as an RFC to the list because it's still an on going work,
but if you have any comments, please feel free.

> > > 
> > > There are some syscalls which are considered obsolete - they were previously
> > > useful, but no modern code would call them, as they have been superceeded.
> > > For example, readdir replaced by getdents. We could blacklist these by default
> > > but provide a way to allow use of obsolete syscalls if running on older systems.
> > > e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that we decide
> > > to just block them permanently with no opt in - would need to analyse when
> > > their replacements appeared in widespread use.

The obsolete part is also on my github (didn't send for the same
reason):
https://github.com/otubo/qemu/commit/54a57eb150ca3e5b67e9a81394c6cfa4ac82a6ff

Also, can't find anywhere a solid list of obsolete system calls, can you
elaborate a little more on how to determine this list?

> > > 
> > > There might be a few more syscalls which we can determine are never valid to
> > > use in QEMU or any library or helper program it might run. I expect this list
> > > to be very small though, given the impossibility of auditing code paths through
> > > millions of lines of code QEMU links to.
> > > 
> > > Everything else should be allowed.
> > > 
> > > At this point we have a highly reliable "-sandbox on" which we're not having
> > > to constantly patch.
> > > 
> > > 
> > > From here we need a way to allow a user to opt-in to more restrictive policies,
> > > accepting that it will block certain features. For example, there should be a
> > > a way to disable any means to elevate privileges from QEMU or things it spawns.
> > > e.g. '-sandbox on,elevateprivileges=deny'.
> > > 
> > > This would not only block the variuous set*uid|gid functions via seccomp, but
> > > should also prctl(PR_SET_NO_NEW_PRIVS). This would allows the user to optin to
> > > a restrictive world if they know they'll not require things like the setuid
> > > bridge helper.

Also, I was re-reading all documentation again, prctl(PR_SET_NO_NEW_PRIVS) is enabled
by default when using seccomp.

> > > 
> > > Similarly there should be an '-sandbox on,spawn=deny' which prevents the ability
> > > to fork/exec processes at all, whether privileged or not. This would block
> > > features like the qemu bridge helper, SMB server, ifup/down scripts, migration
> > > exec: protocol. These are all rarely used features though, so an opt-in to block
> > > their use is reasonable & desirable.
> > > 
> > > A -sandbox on,resourcecontrol=deny, which prevents QEMU from setting stuff like
> > > process affinity, schedular priority, etc. Some uses of QEMU might need them,
> > > but normally such controls are left to the mgmt app above QEMU to set prior to
> > > the exec() of QEMU.
> > > 
> > > 
> > > 
> > > The key is that these are *not* low level knobs controlling system calls, but
> > > moderately high level knobs controlling general concepts. This is a high enough
> > > level of abstraction to enable libvirt to automatically turn them on/off based
> > > on guest config, without libvirt having to know anything detailed about QEMU
> > > code impl for the features.
> > > 
> > > 
> > > Finally, for avoidance of doubt, I'm *not* actually proposing to implement this
> > > myself any time in the forseeable future. This mail came about from the fact
> > > that many people have questioned whether current seccomp code is anything other
> > > than "security theatre". I tend to agree with such an assessment myself, and was
> > > initially intending to just send a patch to remove seccomp, to stimulate some
> > > discussion. Instead, however, I decided to write this mail to see if we can
> > > identify a way forward to make seccomp both reliable and useful. If QEMU had the
> > > kind of approach outlined above, with a default blacklist instead of whitelist,
> > > and some opt-ins for stricter lists, it is something I think libvirt would be
> > > reasonably happy to enable out of the box. That would be a step forward from
> > > today where libvirt would never consider turning seccomp on by default.
> > > 
> > > Perhaps this re-working could be a GSoC idea for some interested student...
> > > 
> > 
> > I'm not a student, thus not eligible GSoC person but I would be more
> > than grateful to take this initiative of yours and transform into some
> > patches so we can make this feature something really useful and
> > reliable.
> 
> Sure, I just threw GSoC out there as one possible idea. If you or anyone
> else has time to work on it, that's great too.
> 
> > Perhaps now is not the right time to terse comments on every idea you
> > gave, I agree with most of them. I wrote the whole implementation of
> > this feature but actually became the maintainer because people approving
> > sycalls and sending pull-requests were too busy, and I could do it. But
> > to be completely honest I had few poor ideas on how to improve it and
> > almost no time to actually do it in the past. Time passed by and all I
> > did was approve new syscalls and turn them into pull-requests.
> > 
> > Let's spin up these ideas and hopefully incorporate into Qemu. Next step
> > I'm gonna dig into every topic and draft a little more. I guess we can
> > keep on this thread, or perhaps in separate ones. From there I can start
> > to write some code.
> 
> ok
> 
> Regards,
> Daniel
> -- 
> |: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org              -o-             http://virt-manager.org :|
> |: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|
> 

-- 
Eduardo Otubo
ProfitBricks GmbH

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] RFC: How to make seccomp reliable and useful ?
  2017-03-01 22:38     ` Eduardo Otubo
@ 2017-03-02  9:35       ` Daniel P. Berrange
  0 siblings, 0 replies; 9+ messages in thread
From: Daniel P. Berrange @ 2017-03-02  9:35 UTC (permalink / raw)
  To: qemu-devel, pmoore

On Wed, Mar 01, 2017 at 11:38:56PM +0100, Eduardo Otubo wrote:
> On Thu, Feb 16, 2017 at 09=33=16AM +0000, Daniel P. Berrange wrote:
> > On Thu, Feb 16, 2017 at 12:36:51AM +0100, Eduardo Otubo wrote:
> > > On Wed, Feb 15, 2017 at 06=27=32PM +0000, Daniel P. Berrange wrote:
> 
> [...]
> 
> > > > 
> > > > There is a reasonable easily identifiable set of syscalls that QEMU should
> > > > never be permitted to use, no matter what configuration it is in, what helpers
> > > > it spawns, or what libraries it links to. eg reboot, swapon, swapoff,  syslog,
> > > > mount, unmount, kexec_*, etc - any syscall that affects global system state,
> > > > rather than process local state should be forbidden.
> > > > 
> > > > There are some syscalls that are simply hardcoded to return ENOSYS which can
> > > > be trivially blacklisted. afs_syscall, break, fattach, ftime, etc (see the
> > > > man page 'unimplemented(2)').
> 
> I've been working on the blacklist, you can see here:
> https://github.com/otubo/qemu/commit/31e603180081474ff35c5897813cb635f8e9a786
> 
> I didn't send as an RFC to the list because it's still an on going work,
> but if you have any comments, please feel free.
> 
> > > > 
> > > > There are some syscalls which are considered obsolete - they were previously
> > > > useful, but no modern code would call them, as they have been superceeded.
> > > > For example, readdir replaced by getdents. We could blacklist these by default
> > > > but provide a way to allow use of obsolete syscalls if running on older systems.
> > > > e.g. '-sandbox on,obsolete=allow'. They might be obsolete enough that we decide
> > > > to just block them permanently with no opt in - would need to analyse when
> > > > their replacements appeared in widespread use.
> 
> The obsolete part is also on my github (didn't send for the same
> reason):
> https://github.com/otubo/qemu/commit/54a57eb150ca3e5b67e9a81394c6cfa4ac82a6ff
> 
> Also, can't find anywhere a solid list of obsolete system calls, can you
> elaborate a little more on how to determine this list?

Systemd has such a list in ./src/shared/seccomp-util.c
Look for the array containing SYSCALL_FILTER_SET_OBSOLETE


Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-03-02  9:35 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-15 18:27 [Qemu-devel] RFC: How to make seccomp reliable and useful ? Daniel P. Berrange
2017-02-15 23:36 ` Eduardo Otubo
2017-02-16  9:33   ` Daniel P. Berrange
2017-03-01 22:38     ` Eduardo Otubo
2017-03-02  9:35       ` Daniel P. Berrange
2017-02-16  8:38 ` Thomas Huth
2017-02-16  9:32   ` Daniel P. Berrange
2017-02-16  9:37     ` Thomas Huth
2017-02-16  9:41       ` Daniel P. Berrange

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.