All of lore.kernel.org
 help / color / mirror / Atom feed
* Secure KVM
@ 2011-11-06 20:40 Sasha Levin
  2011-11-07  0:07 ` Rusty Russell
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Sasha Levin @ 2011-11-06 20:40 UTC (permalink / raw)
  To: Andrea Arcangeli, Avi Kivity, Marcelo Tosatti, Ingo Molnar,
	Pekka Enberg, Cyrill
  Cc: kvm

Hi all,

I'm planning on doing a small fork of the KVM tool to turn it into a
'Secure KVM' enabled hypervisor. Now you probably ask yourself, Huh?

The idea was discussed briefly couple of months ago, but never got off
the ground - which is a shame IMO.

It's easy to explain the problem: If an attacker finds a security hole
in any of the devices which are exposed to the guest, the attacker would
be able to either crash the guest, or possibly run code on the host
itself.

The solution is also simple to explain: Split the devices into different
processes and use seccomp to sandbox each device into the exact set of
resources it needs to operate, nothing more and nothing less.

Since I'll be basing it on the KVM tool, which doesn't really emulate
that many legacy devices, I'll focus first on the virtio family for the
sake of simplicity (and covering 90% of the options).

This is my basic overview of how I'm planning on implementing the
initial POC:

1. First I'll focus on the simple virtio-rng device, it's simple enough
to allow us to focus on the aspects which are important for the POC
while still covering most bases (i.e. sandbox to single file
- /dev/urandom and such).

2. Do it on a one process per device concept, where for each device
(notice - not device *type*) requested, a new process which handles it
will be spawned.

3. That process will be limited exactly to the resources it needs to
operate, for example - if we run a virtio-blk device, it would be able
to access only the image file which it should be using.

4. Connection between hypervisor and devices will be based on unix
sockets, this should allow for better separation compared to other
approaches such as shared memory.

5. While performance is an aspect, complete isolation is more important.
Security is primary, performance is secondary.

6. Share as much code as possible with current implementation of virtio
devices, make it possible to run virtio devices either like it's being
done now, or by spawning them as separate processes - the amount of
specific code for the separate process case should be minimal.


Thats all I have for now, comments are *very* welcome.

-- 

Sasha.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-06 20:40 Secure KVM Sasha Levin
@ 2011-11-07  0:07 ` Rusty Russell
  2011-11-07  6:29   ` Sasha Levin
  2011-11-07  9:26 ` Avi Kivity
  2011-11-07 17:37   ` [Qemu-devel] " Anthony Liguori
  2 siblings, 1 reply; 31+ messages in thread
From: Rusty Russell @ 2011-11-07  0:07 UTC (permalink / raw)
  To: Sasha Levin, Andrea Arcangeli, Avi Kivity, Marcelo Tosatti,
	Ingo Molnar, Pekka
  Cc: kvm

On Sun, 06 Nov 2011 22:40:20 +0200, Sasha Levin <levinsasha928@gmail.com> wrote:
> The solution is also simple to explain: Split the devices into different
> processes and use seccomp to sandbox each device into the exact set of
> resources it needs to operate, nothing more and nothing less.

lguest does a process per device.  Actually, it uses clone for legacy
reasons, but I have a patch which changes it to processes.

It works well, and it's *simple*.  I suggest looking at
Documentation/virtual/lguest/lguest.c.

Good luck!
Rusty.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07  0:07 ` Rusty Russell
@ 2011-11-07  6:29   ` Sasha Levin
  2011-11-07  6:37     ` Pekka Enberg
  2011-11-07 22:49     ` Rusty Russell
  0 siblings, 2 replies; 31+ messages in thread
From: Sasha Levin @ 2011-11-07  6:29 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andrea Arcangeli, Avi Kivity, Marcelo Tosatti, Ingo Molnar,
	Pekka Enberg, Cyrill Gorcunov, Asias He, Anthony Liguori,
	Michael S. Tsirkin, kvm

On Mon, 2011-11-07 at 10:37 +1030, Rusty Russell wrote:
> On Sun, 06 Nov 2011 22:40:20 +0200, Sasha Levin <levinsasha928@gmail.com> wrote:
> > The solution is also simple to explain: Split the devices into different
> > processes and use seccomp to sandbox each device into the exact set of
> > resources it needs to operate, nothing more and nothing less.
> 
> lguest does a process per device.  Actually, it uses clone for legacy
> reasons, but I have a patch which changes it to processes.
> 
> It works well, and it's *simple*.  I suggest looking at
> Documentation/virtual/lguest/lguest.c.
> 
> Good luck!
> Rusty.

Yup, thats pretty much what I want to have.

As you said, clone() isn't really an option - sharing things like the VM
and handles is something which I want to avoid. How does your patch
handle IPC?

-- 

Sasha.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07  6:29   ` Sasha Levin
@ 2011-11-07  6:37     ` Pekka Enberg
  2011-11-07  6:46       ` Sasha Levin
  2011-11-07 22:49     ` Rusty Russell
  1 sibling, 1 reply; 31+ messages in thread
From: Pekka Enberg @ 2011-11-07  6:37 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Rusty Russell, Andrea Arcangeli, Avi Kivity, Marcelo Tosatti,
	Ingo Molnar, Cyrill Gorcunov, Asias He, Anthony Liguori,
	Michael S. Tsirkin, kvm

On Mon, Nov 7, 2011 at 8:29 AM, Sasha Levin <levinsasha928@gmail.com> wrote:
> As you said, clone() isn't really an option - sharing things like the VM
> and handles is something which I want to avoid. How does your patch
> handle IPC?

Use the unshare() system call?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07  6:37     ` Pekka Enberg
@ 2011-11-07  6:46       ` Sasha Levin
  2011-11-07  7:03         ` Pekka Enberg
  0 siblings, 1 reply; 31+ messages in thread
From: Sasha Levin @ 2011-11-07  6:46 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Rusty Russell, Andrea Arcangeli, Avi Kivity, Marcelo Tosatti,
	Ingo Molnar, Cyrill Gorcunov, Asias He, Anthony Liguori,
	Michael S. Tsirkin, kvm

On Mon, 2011-11-07 at 08:37 +0200, Pekka Enberg wrote:
> On Mon, Nov 7, 2011 at 8:29 AM, Sasha Levin <levinsasha928@gmail.com> wrote:
> > As you said, clone() isn't really an option - sharing things like the VM
> > and handles is something which I want to avoid. How does your patch
> > handle IPC?
> 
> Use the unshare() system call?

Yup, but you must somehow communicate with the master process, and this
is currently missing from the lguest implementation since everything is
shared (vm + fds).

If you simply unshare it, you must have a different method of talking
with the master process. I suggested doing it using unix sockets, and am
wondering how Rusty did it in his patch.

-- 

Sasha.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07  6:46       ` Sasha Levin
@ 2011-11-07  7:03         ` Pekka Enberg
  0 siblings, 0 replies; 31+ messages in thread
From: Pekka Enberg @ 2011-11-07  7:03 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Pekka Enberg, Rusty Russell, Andrea Arcangeli, Avi Kivity,
	Marcelo Tosatti, Ingo Molnar, Cyrill Gorcunov, Asias He,
	Anthony Liguori, Michael S. Tsirkin, kvm

On Mon, 7 Nov 2011, Sasha Levin wrote:
> Yup, but you must somehow communicate with the master process, and this
> is currently missing from the lguest implementation since everything is
> shared (vm + fds).
>
> If you simply unshare it, you must have a different method of talking
> with the master process. I suggested doing it using unix sockets, and am
> wondering how Rusty did it in his patch.

The model I've heard people talk about is using seccomp which can be used 
for any IPC that works with file descriptors.

 			Pekka

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-06 20:40 Secure KVM Sasha Levin
  2011-11-07  0:07 ` Rusty Russell
@ 2011-11-07  9:26 ` Avi Kivity
  2011-11-07 10:17   ` Sasha Levin
                     ` (2 more replies)
  2011-11-07 17:37   ` [Qemu-devel] " Anthony Liguori
  2 siblings, 3 replies; 31+ messages in thread
From: Avi Kivity @ 2011-11-07  9:26 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrea Arcangeli, Marcelo Tosatti, Ingo Molnar, Pekka Enberg,
	Cyrill Gorcunov, Asias He, Anthony Liguori, Rusty Russell,
	Michael S. Tsirkin, kvm

On 11/06/2011 10:40 PM, Sasha Levin wrote:
> Hi all,
>
> I'm planning on doing a small fork of the KVM tool to turn it into a
> 'Secure KVM' enabled hypervisor. Now you probably ask yourself, Huh?

Actually, no.

> The idea was discussed briefly couple of months ago, but never got off
> the ground - which is a shame IMO.
>
> It's easy to explain the problem: If an attacker finds a security hole
> in any of the devices which are exposed to the guest, the attacker would
> be able to either crash the guest, or possibly run code on the host
> itself.

Crashing the guest is fine (not 100% - you can have unprivileged code
managing a device, in which case we allow unprivileged code to crash the
entire guest - but that's rare).  Running code on the host is also fine;
we have a permissions system in place to prevent damage; see libvirt's
sVirt code, which uses selinux to disallow an exploited guest from
touching other guests or host data.  It should be able to protect
host-only networks as well (not sure if it does that).

The real risk is that the exploited hypervisor turns around and exploits
yet another hole in the system, like a privileged daemon that the
hypervisor is allowed to be in contact with, or the kernel itself, via a
vulnerability in the kernel interfaces.

> The solution is also simple to explain: Split the devices into different
> processes and use seccomp to sandbox each device into the exact set of
> resources it needs to operate, nothing more and nothing less.

One thing to beware of is memory hotplug.  If the memory map is static,
then a fork() once everything is set up (with MAP_SHARED) alllows all
processes to access guest memory.  However, if memory hotplug is
supported (or planned to be supported), then you can't do that, as
seccomp doesn't allow you to run mmap() in confined processes.

This means they have to use RPC to the main process in order to access
memory, which is going to slow them down significantly.

> Since I'll be basing it on the KVM tool, which doesn't really emulate
> that many legacy devices, I'll focus first on the virtio family for the
> sake of simplicity (and covering 90% of the options).

Since virtio is so performance sensitive, my feeling is that it is
better to audit it, and rely on sandboxing for the non performance
sensitive parts of the device model.  Of course for a POC it's fine to
start with it.

> This is my basic overview of how I'm planning on implementing the
> initial POC:

<snip plan>

> Thats all I have for now, comments are *very* welcome.

This plan is quite similar to the equivalent plans for qemu.  However,
as kvm-tool is much smaller than qemu, you're likely to have much easier
time and make much faster progress.  This is really a great use of
kvm-tool, to explore new ideas rather than catching up; and I'm sure
your experience will prove useful for qemu as well.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07  9:26 ` Avi Kivity
@ 2011-11-07 10:17   ` Sasha Levin
  2011-11-07 10:27     ` Avi Kivity
  2011-11-07 11:27     ` Stefan Hajnoczi
  2011-11-07 17:39   ` Anthony Liguori
  2011-11-07 22:56   ` Rusty Russell
  2 siblings, 2 replies; 31+ messages in thread
From: Sasha Levin @ 2011-11-07 10:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrea Arcangeli, Marcelo Tosatti, Ingo Molnar, Pekka Enberg,
	Cyrill Gorcunov, Asias He, Anthony Liguori, Rusty Russell,
	Michael S. Tsirkin, kvm

Hi Avi,

Thank you for your comments!

Just one question below:

On Mon, Nov 7, 2011 at 11:26 AM, Avi Kivity <avi@redhat.com> wrote:
> Crashing the guest is fine (not 100% - you can have unprivileged code
> managing a device, in which case we allow unprivileged code to crash the
> entire guest - but that's rare).  Running code on the host is also fine;

On Mon, Nov 7, 2011 at 11:26 AM, Avi Kivity <avi@redhat.com> wrote:
> One thing to beware of is memory hotplug.  If the memory map is static,
> then a fork() once everything is set up (with MAP_SHARED) alllows all
> processes to access guest memory.  However, if memory hotplug is
> supported (or planned to be supported), then you can't do that, as
> seccomp doesn't allow you to run mmap() in confined processes.
>
> This means they have to use RPC to the main process in order to access
> memory, which is going to slow them down significantly.

Is the risk of a non-privileged guest code being able to exploit
hypervisor to access guest memory which it's not allowed to access is
really that small? I actually thought it would be one of the main
concerns we'd need to handle, but from what I understand from you it's
an irrelevant scenario.

If it's really the case, then mapping guest memory is preferable.
While mmap() is an issue, I think it's a great example of why seccomp
filters are needed in the kernel, and might be a good chance to push
that feature forward. In that sense, 'Secure KVM' could be used as a
guinea pig both for seccomp filters and future QEMU work.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07 10:17   ` Sasha Levin
@ 2011-11-07 10:27     ` Avi Kivity
  2011-11-07 11:27     ` Stefan Hajnoczi
  1 sibling, 0 replies; 31+ messages in thread
From: Avi Kivity @ 2011-11-07 10:27 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrea Arcangeli, Marcelo Tosatti, Ingo Molnar, Pekka Enberg,
	Cyrill Gorcunov, Asias He, Anthony Liguori, Rusty Russell,
	Michael S. Tsirkin, kvm

On 11/07/2011 12:17 PM, Sasha Levin wrote:
> Hi Avi,
>
> Thank you for your comments!
>
> Just one question below:
>
> On Mon, Nov 7, 2011 at 11:26 AM, Avi Kivity <avi@redhat.com> wrote:
> > Crashing the guest is fine (not 100% - you can have unprivileged code
> > managing a device, in which case we allow unprivileged code to crash the
> > entire guest - but that's rare).  Running code on the host is also fine;
>
> On Mon, Nov 7, 2011 at 11:26 AM, Avi Kivity <avi@redhat.com> wrote:
> > One thing to beware of is memory hotplug.  If the memory map is static,
> > then a fork() once everything is set up (with MAP_SHARED) alllows all
> > processes to access guest memory.  However, if memory hotplug is
> > supported (or planned to be supported), then you can't do that, as
> > seccomp doesn't allow you to run mmap() in confined processes.
> >
> > This means they have to use RPC to the main process in order to access
> > memory, which is going to slow them down significantly.
>
> Is the risk of a non-privileged guest code being able to exploit
> hypervisor to access guest memory which it's not allowed to access is
> really that small? I actually thought it would be one of the main
> concerns we'd need to handle, but from what I understand from you it's
> an irrelevant scenario.

I wouldn't say it's completely irrelevant.  But mainstream deployments
(Linux and Windows) don't really suffer from it, since all device
drivers are privilged (an exception may be graphics drivers on newer
Windows).  Scenarios which may be vulnerable are nested virtualization
with the guest using device assignment.

> If it's really the case, then mapping guest memory is preferable.
> While mmap() is an issue, I think it's a great example of why seccomp
> filters are needed in the kernel, and might be a good chance to push
> that feature forward. In that sense, 'Secure KVM' could be used as a
> guinea pig both for seccomp filters and future QEMU work.

Sure.

Another direction we're looking up is making it harder to exploit a
vulnerability.  PIC/PIE (position independent code/executable) make it
harder to exploit a bug; and selinux controls on exec(), mprotect(), and
mmap(PROT_EXEC) make it impossible to inject code (you can still use
code in the hypervisor or its libraries).  So we still have
vulnerabilities, but they're all denial of service rather than privilege
escalation.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07 10:17   ` Sasha Levin
  2011-11-07 10:27     ` Avi Kivity
@ 2011-11-07 11:27     ` Stefan Hajnoczi
  2011-11-07 12:40       ` Sasha Levin
  2011-11-07 17:43       ` Anthony Liguori
  1 sibling, 2 replies; 31+ messages in thread
From: Stefan Hajnoczi @ 2011-11-07 11:27 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Avi Kivity, Andrea Arcangeli, Marcelo Tosatti, Ingo Molnar,
	Pekka Enberg, Cyrill Gorcunov, Asias He, Anthony Liguori,
	Rusty Russell, Michael S. Tsirkin, kvm

On Mon, Nov 7, 2011 at 10:17 AM, Sasha Levin <levinsasha928@gmail.com> wrote:
> Hi Avi,
>
> Thank you for your comments!
>
> Just one question below:
>
> On Mon, Nov 7, 2011 at 11:26 AM, Avi Kivity <avi@redhat.com> wrote:
>> Crashing the guest is fine (not 100% - you can have unprivileged code
>> managing a device, in which case we allow unprivileged code to crash the
>> entire guest - but that's rare).  Running code on the host is also fine;
>
> On Mon, Nov 7, 2011 at 11:26 AM, Avi Kivity <avi@redhat.com> wrote:
>> One thing to beware of is memory hotplug.  If the memory map is static,
>> then a fork() once everything is set up (with MAP_SHARED) alllows all
>> processes to access guest memory.  However, if memory hotplug is
>> supported (or planned to be supported), then you can't do that, as
>> seccomp doesn't allow you to run mmap() in confined processes.
>>
>> This means they have to use RPC to the main process in order to access
>> memory, which is going to slow them down significantly.
>
> Is the risk of a non-privileged guest code being able to exploit
> hypervisor to access guest memory which it's not allowed to access is
> really that small? I actually thought it would be one of the main
> concerns we'd need to handle, but from what I understand from you it's
> an irrelevant scenario.
>
> If it's really the case, then mapping guest memory is preferable.
> While mmap() is an issue, I think it's a great example of why seccomp
> filters are needed in the kernel, and might be a good chance to push
> that feature forward. In that sense, 'Secure KVM' could be used as a
> guinea pig both for seccomp filters and future QEMU work.

This is a really interesting topic - something that we've discussed in
QEMU as well.

Doing it with seccomp is really hard since that only allows read(2),
write(2), exit(2), and sigreturn(2).  I think using seccomp means that
host devices (e.g. actual network and block device I/O) are
implemented outside the seccomp because it requires other syscalls.
Then the seccomp process would simply do hardware emulation with IPCs
for all actual I/O.

Where does the VNC server, the image formats, etc go?  It would be
nice to confine them too.

In that respect I think Avi's ideas about using safe programming
languages (even if just a NaCl toolchain) are nice because they are
more general and apply to all of the codebase.

Stefan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07 11:27     ` Stefan Hajnoczi
@ 2011-11-07 12:40       ` Sasha Levin
  2011-11-07 12:51         ` Avi Kivity
  2011-11-07 17:43       ` Anthony Liguori
  1 sibling, 1 reply; 31+ messages in thread
From: Sasha Levin @ 2011-11-07 12:40 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Avi Kivity, Andrea Arcangeli, Marcelo Tosatti, Ingo Molnar,
	Pekka Enberg, Cyrill Gorcunov, Asias He, Anthony Liguori,
	Rusty Russell, Michael S. Tsirkin, kvm

On Mon, Nov 7, 2011 at 1:27 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Mon, Nov 7, 2011 at 10:17 AM, Sasha Levin <levinsasha928@gmail.com> wrote:
>> Hi Avi,
>>
>> Thank you for your comments!
>>
>> Just one question below:
>>
>> On Mon, Nov 7, 2011 at 11:26 AM, Avi Kivity <avi@redhat.com> wrote:
>>> Crashing the guest is fine (not 100% - you can have unprivileged code
>>> managing a device, in which case we allow unprivileged code to crash the
>>> entire guest - but that's rare).  Running code on the host is also fine;
>>
>> On Mon, Nov 7, 2011 at 11:26 AM, Avi Kivity <avi@redhat.com> wrote:
>>> One thing to beware of is memory hotplug.  If the memory map is static,
>>> then a fork() once everything is set up (with MAP_SHARED) alllows all
>>> processes to access guest memory.  However, if memory hotplug is
>>> supported (or planned to be supported), then you can't do that, as
>>> seccomp doesn't allow you to run mmap() in confined processes.
>>>
>>> This means they have to use RPC to the main process in order to access
>>> memory, which is going to slow them down significantly.
>>
>> Is the risk of a non-privileged guest code being able to exploit
>> hypervisor to access guest memory which it's not allowed to access is
>> really that small? I actually thought it would be one of the main
>> concerns we'd need to handle, but from what I understand from you it's
>> an irrelevant scenario.
>>
>> If it's really the case, then mapping guest memory is preferable.
>> While mmap() is an issue, I think it's a great example of why seccomp
>> filters are needed in the kernel, and might be a good chance to push
>> that feature forward. In that sense, 'Secure KVM' could be used as a
>> guinea pig both for seccomp filters and future QEMU work.
>
> This is a really interesting topic - something that we've discussed in
> QEMU as well.
>
> Doing it with seccomp is really hard since that only allows read(2),
> write(2), exit(2), and sigreturn(2).  I think using seccomp means that
> host devices (e.g. actual network and block device I/O) are
> implemented outside the seccomp because it requires other syscalls.
> Then the seccomp process would simply do hardware emulation with IPCs
> for all actual I/O.

Yup, thats why it might be a good chance to explore into seccomp filters.

Being able to filter not just calls, but also some parameters of the
calls will allow us to tailor a pretty well defined wrapper for each
and every device.

>
> Where does the VNC server, the image formats, etc go?  It would be
> nice to confine them too.

Regarding image formats, just wondering - was there ever any plan to
merge (at least some of them) into the kernel?

> In that respect I think Avi's ideas about using safe programming
> languages (even if just a NaCl toolchain) are nice because they are
> more general and apply to all of the codebase.
>
> Stefan
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07 12:40       ` Sasha Levin
@ 2011-11-07 12:51         ` Avi Kivity
  2011-11-07 14:56           ` Stefan Hajnoczi
  0 siblings, 1 reply; 31+ messages in thread
From: Avi Kivity @ 2011-11-07 12:51 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Stefan Hajnoczi, Andrea Arcangeli, Marcelo Tosatti, Ingo Molnar,
	Pekka Enberg, Cyrill Gorcunov, Asias He, Anthony Liguori,
	Rusty Russell, Michael S. Tsirkin, kvm

On 11/07/2011 02:40 PM, Sasha Levin wrote:
> >
> > Where does the VNC server, the image formats, etc go?  It would be
> > nice to confine them too.
>
> Regarding image formats, just wondering - was there ever any plan to
> merge (at least some of them) into the kernel?

Xen has/had something where (IIUC) the kernel would call out on an
unmapped cluster, let userspace figure out the mapping, then service
requests to that cluster completely in the kernel.  I'm not convinced
it's worthwhile.

btw, the kernel already has support for a flexible copy-on-write format
- btrfs raw files.  It makes sense to increase the integration there. 
You can keep image files as ordinary files, use COW for snapshots, and
implement exporting to qcow via SEEK_HOLE.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07 12:51         ` Avi Kivity
@ 2011-11-07 14:56           ` Stefan Hajnoczi
  0 siblings, 0 replies; 31+ messages in thread
From: Stefan Hajnoczi @ 2011-11-07 14:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Sasha Levin, Andrea Arcangeli, Marcelo Tosatti, Ingo Molnar,
	Pekka Enberg, Cyrill Gorcunov, Asias He, Anthony Liguori,
	Rusty Russell, Michael S. Tsirkin, kvm

On Mon, Nov 7, 2011 at 12:51 PM, Avi Kivity <avi@redhat.com> wrote:
> On 11/07/2011 02:40 PM, Sasha Levin wrote:
>> >
>> > Where does the VNC server, the image formats, etc go?  It would be
>> > nice to confine them too.
>>
>> Regarding image formats, just wondering - was there ever any plan to
>> merge (at least some of them) into the kernel?
>
> Xen has/had something where (IIUC) the kernel would call out on an
> unmapped cluster, let userspace figure out the mapping, then service
> requests to that cluster completely in the kernel.  I'm not convinced
> it's worthwhile.

http://wiki.xensource.com/xenwiki/DmUserspace

I like the design - it's essentially a software MMU for block devices.
 Userspace gets to service faults and can therefore look up metadata
in the image format or even allocate new space.

Getting all the qemu-img supported drivers into the kernel isn't
worthwhile or a good idea IMO.  If we got just qcow2 into the kernel
we'd basically have another mechanism to do stuff similar to what LVM
and btrfs can do.  I'm interested in using existing kernel
functionality more than adding a qcow2 driver.

Stefan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-06 20:40 Secure KVM Sasha Levin
@ 2011-11-07 17:37   ` Anthony Liguori
  2011-11-07  9:26 ` Avi Kivity
  2011-11-07 17:37   ` [Qemu-devel] " Anthony Liguori
  2 siblings, 0 replies; 31+ messages in thread
From: Anthony Liguori @ 2011-11-07 17:37 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrea Arcangeli, Avi Kivity, Marcelo Tosatti, Ingo Molnar,
	Pekka Enberg, Cyrill Gorcunov, Asias He, Rusty Russell,
	Michael S. Tsirkin, kvm, Corentin Chary, qemu-devel

On 11/06/2011 02:40 PM, Sasha Levin wrote:
> Hi all,
>
> I'm planning on doing a small fork of the KVM tool to turn it into a
> 'Secure KVM' enabled hypervisor. Now you probably ask yourself, Huh?
>
> The idea was discussed briefly couple of months ago, but never got off
> the ground - which is a shame IMO.
>
> It's easy to explain the problem: If an attacker finds a security hole
> in any of the devices which are exposed to the guest, the attacker would
> be able to either crash the guest, or possibly run code on the host
> itself.
>
> The solution is also simple to explain: Split the devices into different
> processes and use seccomp to sandbox each device into the exact set of
> resources it needs to operate, nothing more and nothing less.
>
> Since I'll be basing it on the KVM tool, which doesn't really emulate
> that many legacy devices, I'll focus first on the virtio family for the
> sake of simplicity (and covering 90% of the options).
>
> This is my basic overview of how I'm planning on implementing the
> initial POC:
>
> 1. First I'll focus on the simple virtio-rng device, it's simple enough
> to allow us to focus on the aspects which are important for the POC
> while still covering most bases (i.e. sandbox to single file
> - /dev/urandom and such).
>
> 2. Do it on a one process per device concept, where for each device
> (notice - not device *type*) requested, a new process which handles it
> will be spawned.
>
> 3. That process will be limited exactly to the resources it needs to
> operate, for example - if we run a virtio-blk device, it would be able
> to access only the image file which it should be using.
>
> 4. Connection between hypervisor and devices will be based on unix
> sockets, this should allow for better separation compared to other
> approaches such as shared memory.
>
> 5. While performance is an aspect, complete isolation is more important.
> Security is primary, performance is secondary.
>
> 6. Share as much code as possible with current implementation of virtio
> devices, make it possible to run virtio devices either like it's being
> done now, or by spawning them as separate processes - the amount of
> specific code for the separate process case should be minimal.
>
>
> Thats all I have for now, comments are *very* welcome.

I thought about this a bit and have some ideas that may or may not help.

1) If you add device save/load support, then it's something you can potentially 
use to give yourself quite a bit of flexibility in changing the sandbox.  At any 
point in run time, you can save the device model's state in the sandbox, destroy 
the sandbox, and then build a new sandbox and restore the device to its former 
state.

This might turn out to be very useful in supporting things like device hotplug 
and/or memory hot plug.

2) I think it's largely possible to implement all device emulation without doing 
any dynamic memory allocation.  Since memory allocation DoS is something you 
have to deal with anyway, I suspect most device emulation already uses a fixed 
amount of memory per device.   This can potentially dramatically simplify things.

3) I think virtio can/should be used as a generic "backend to frontend" 
transport between the device model and the tool.

4) Lack of select() is really challenging.  I understand why it's not there 
since it can technically be emulated but it seems like a no-risk syscall to 
whitelist and it would make programming in a sandbox so much easier.  Maybe 
Andrea has some comments here?  I might be missing something here.

Regards,

Anthony Liguori

>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] Secure KVM
@ 2011-11-07 17:37   ` Anthony Liguori
  0 siblings, 0 replies; 31+ messages in thread
From: Anthony Liguori @ 2011-11-07 17:37 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrea Arcangeli, Cyrill Gorcunov, Rusty Russell, kvm,
	Michael S. Tsirkin, Corentin Chary, Asias He, Marcelo Tosatti,
	qemu-devel, Pekka Enberg, Avi Kivity, Ingo Molnar

On 11/06/2011 02:40 PM, Sasha Levin wrote:
> Hi all,
>
> I'm planning on doing a small fork of the KVM tool to turn it into a
> 'Secure KVM' enabled hypervisor. Now you probably ask yourself, Huh?
>
> The idea was discussed briefly couple of months ago, but never got off
> the ground - which is a shame IMO.
>
> It's easy to explain the problem: If an attacker finds a security hole
> in any of the devices which are exposed to the guest, the attacker would
> be able to either crash the guest, or possibly run code on the host
> itself.
>
> The solution is also simple to explain: Split the devices into different
> processes and use seccomp to sandbox each device into the exact set of
> resources it needs to operate, nothing more and nothing less.
>
> Since I'll be basing it on the KVM tool, which doesn't really emulate
> that many legacy devices, I'll focus first on the virtio family for the
> sake of simplicity (and covering 90% of the options).
>
> This is my basic overview of how I'm planning on implementing the
> initial POC:
>
> 1. First I'll focus on the simple virtio-rng device, it's simple enough
> to allow us to focus on the aspects which are important for the POC
> while still covering most bases (i.e. sandbox to single file
> - /dev/urandom and such).
>
> 2. Do it on a one process per device concept, where for each device
> (notice - not device *type*) requested, a new process which handles it
> will be spawned.
>
> 3. That process will be limited exactly to the resources it needs to
> operate, for example - if we run a virtio-blk device, it would be able
> to access only the image file which it should be using.
>
> 4. Connection between hypervisor and devices will be based on unix
> sockets, this should allow for better separation compared to other
> approaches such as shared memory.
>
> 5. While performance is an aspect, complete isolation is more important.
> Security is primary, performance is secondary.
>
> 6. Share as much code as possible with current implementation of virtio
> devices, make it possible to run virtio devices either like it's being
> done now, or by spawning them as separate processes - the amount of
> specific code for the separate process case should be minimal.
>
>
> Thats all I have for now, comments are *very* welcome.

I thought about this a bit and have some ideas that may or may not help.

1) If you add device save/load support, then it's something you can potentially 
use to give yourself quite a bit of flexibility in changing the sandbox.  At any 
point in run time, you can save the device model's state in the sandbox, destroy 
the sandbox, and then build a new sandbox and restore the device to its former 
state.

This might turn out to be very useful in supporting things like device hotplug 
and/or memory hot plug.

2) I think it's largely possible to implement all device emulation without doing 
any dynamic memory allocation.  Since memory allocation DoS is something you 
have to deal with anyway, I suspect most device emulation already uses a fixed 
amount of memory per device.   This can potentially dramatically simplify things.

3) I think virtio can/should be used as a generic "backend to frontend" 
transport between the device model and the tool.

4) Lack of select() is really challenging.  I understand why it's not there 
since it can technically be emulated but it seems like a no-risk syscall to 
whitelist and it would make programming in a sandbox so much easier.  Maybe 
Andrea has some comments here?  I might be missing something here.

Regards,

Anthony Liguori

>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07  9:26 ` Avi Kivity
  2011-11-07 10:17   ` Sasha Levin
@ 2011-11-07 17:39   ` Anthony Liguori
  2011-11-07 18:43     ` Avi Kivity
  2011-11-07 22:56   ` Rusty Russell
  2 siblings, 1 reply; 31+ messages in thread
From: Anthony Liguori @ 2011-11-07 17:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Sasha Levin, Andrea Arcangeli, Marcelo Tosatti, Ingo Molnar,
	Pekka Enberg, Cyrill Gorcunov, Asias He, Rusty Russell,
	Michael S. Tsirkin, kvm

On 11/07/2011 03:26 AM, Avi Kivity wrote:
> On 11/06/2011 10:40 PM, Sasha Levin wrote:
>> Hi all,
>>
>> I'm planning on doing a small fork of the KVM tool to turn it into a
>> 'Secure KVM' enabled hypervisor. Now you probably ask yourself, Huh?
>
> Actually, no.
>
>> The idea was discussed briefly couple of months ago, but never got off
>> the ground - which is a shame IMO.
>>
>> It's easy to explain the problem: If an attacker finds a security hole
>> in any of the devices which are exposed to the guest, the attacker would
>> be able to either crash the guest, or possibly run code on the host
>> itself.
>
> Crashing the guest is fine (not 100% - you can have unprivileged code
> managing a device, in which case we allow unprivileged code to crash the
> entire guest - but that's rare).  Running code on the host is also fine;
> we have a permissions system in place to prevent damage; see libvirt's
> sVirt code, which uses selinux to disallow an exploited guest from
> touching other guests or host data.  It should be able to protect
> host-only networks as well (not sure if it does that).
>
> The real risk is that the exploited hypervisor turns around and exploits
> yet another hole in the system, like a privileged daemon that the
> hypervisor is allowed to be in contact with, or the kernel itself, via a
> vulnerability in the kernel interfaces.
>
>> The solution is also simple to explain: Split the devices into different
>> processes and use seccomp to sandbox each device into the exact set of
>> resources it needs to operate, nothing more and nothing less.
>
> One thing to beware of is memory hotplug.  If the memory map is static,
> then a fork() once everything is set up (with MAP_SHARED) alllows all
> processes to access guest memory.  However, if memory hotplug is
> supported (or planned to be supported), then you can't do that, as
> seccomp doesn't allow you to run mmap() in confined processes.
>
> This means they have to use RPC to the main process in order to access
> memory, which is going to slow them down significantly.

If you treat the sandbox as ephemeral by leveraging save/restore, you can throw 
away and rebuild the device model on every memory change.  While not a super 
cheap operation, it's at least amortized over time.

Regards,

Anthony Liguori


>> Since I'll be basing it on the KVM tool, which doesn't really emulate
>> that many legacy devices, I'll focus first on the virtio family for the
>> sake of simplicity (and covering 90% of the options).
>
> Since virtio is so performance sensitive, my feeling is that it is
> better to audit it, and rely on sandboxing for the non performance
> sensitive parts of the device model.  Of course for a POC it's fine to
> start with it.
>
>> This is my basic overview of how I'm planning on implementing the
>> initial POC:
>
> <snip plan>
>
>> Thats all I have for now, comments are *very* welcome.
>
> This plan is quite similar to the equivalent plans for qemu.  However,
> as kvm-tool is much smaller than qemu, you're likely to have much easier
> time and make much faster progress.  This is really a great use of
> kvm-tool, to explore new ideas rather than catching up; and I'm sure
> your experience will prove useful for qemu as well.
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07 11:27     ` Stefan Hajnoczi
  2011-11-07 12:40       ` Sasha Levin
@ 2011-11-07 17:43       ` Anthony Liguori
  2011-11-07 18:41         ` Avi Kivity
  1 sibling, 1 reply; 31+ messages in thread
From: Anthony Liguori @ 2011-11-07 17:43 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Sasha Levin, Avi Kivity, Andrea Arcangeli, Marcelo Tosatti,
	Ingo Molnar, Pekka Enberg, Cyrill Gorcunov, Asias He,
	Rusty Russell, Michael S. Tsirkin, kvm

On 11/07/2011 05:27 AM, Stefan Hajnoczi wrote:
> On Mon, Nov 7, 2011 at 10:17 AM, Sasha Levin<levinsasha928@gmail.com>  wrote:
>
> This is a really interesting topic - something that we've discussed in
> QEMU as well.
>
> Doing it with seccomp is really hard since that only allows read(2),
> write(2), exit(2), and sigreturn(2).  I think using seccomp means that
> host devices (e.g. actual network and block device I/O) are
> implemented outside the seccomp because it requires other syscalls.
> Then the seccomp process would simply do hardware emulation with IPCs
> for all actual I/O.
>
> Where does the VNC server, the image formats, etc go?  It would be
> nice to confine them too.
>
> In that respect I think Avi's ideas about using safe programming
> languages (even if just a NaCl toolchain) are nice because they are
> more general and apply to all of the codebase.

It's a nice idea but the NaCL toolchain doesn't have a nice upstream story right 
now.

I think seccomp() mode 1 isn't so bad.  It's difficult to boot strap, but once 
you have a reasonable set of RPCs, it shouldn't be all that bad of an 
environment to program in.

One way to think of a seccomp() sandbox is that it emulates the legacy device 
model and translates everything into an ultra-modern, no backwards compat, 
pure-virtio device model.  From a QEMU perspective, it would treat the sandbox 
as part of the guest, and then implement a bare bones machine that only exposed 
the couple of virtio interfaces to the sandbox.  QEMU would then bridge this to 
the various types of backends.

Regards,

Anthony Liguori

>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07 17:37   ` [Qemu-devel] " Anthony Liguori
@ 2011-11-07 17:52     ` Sasha Levin
  -1 siblings, 0 replies; 31+ messages in thread
From: Sasha Levin @ 2011-11-07 17:52 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Andrea Arcangeli, Avi Kivity, Marcelo Tosatti, Ingo Molnar,
	Pekka Enberg, Cyrill Gorcunov, Asias He, Rusty Russell,
	Michael S. Tsirkin, kvm, Corentin Chary, qemu-devel

Hi Anthony,

Thank you for your comments!

On Mon, 2011-11-07 at 11:37 -0600, Anthony Liguori wrote:
> On 11/06/2011 02:40 PM, Sasha Levin wrote:
> > Hi all,
> >
> > I'm planning on doing a small fork of the KVM tool to turn it into a
> > 'Secure KVM' enabled hypervisor. Now you probably ask yourself, Huh?
> >
> > The idea was discussed briefly couple of months ago, but never got off
> > the ground - which is a shame IMO.
> >
> > It's easy to explain the problem: If an attacker finds a security hole
> > in any of the devices which are exposed to the guest, the attacker would
> > be able to either crash the guest, or possibly run code on the host
> > itself.
> >
> > The solution is also simple to explain: Split the devices into different
> > processes and use seccomp to sandbox each device into the exact set of
> > resources it needs to operate, nothing more and nothing less.
> >
> > Since I'll be basing it on the KVM tool, which doesn't really emulate
> > that many legacy devices, I'll focus first on the virtio family for the
> > sake of simplicity (and covering 90% of the options).
> >
> > This is my basic overview of how I'm planning on implementing the
> > initial POC:
> >
> > 1. First I'll focus on the simple virtio-rng device, it's simple enough
> > to allow us to focus on the aspects which are important for the POC
> > while still covering most bases (i.e. sandbox to single file
> > - /dev/urandom and such).
> >
> > 2. Do it on a one process per device concept, where for each device
> > (notice - not device *type*) requested, a new process which handles it
> > will be spawned.
> >
> > 3. That process will be limited exactly to the resources it needs to
> > operate, for example - if we run a virtio-blk device, it would be able
> > to access only the image file which it should be using.
> >
> > 4. Connection between hypervisor and devices will be based on unix
> > sockets, this should allow for better separation compared to other
> > approaches such as shared memory.
> >
> > 5. While performance is an aspect, complete isolation is more important.
> > Security is primary, performance is secondary.
> >
> > 6. Share as much code as possible with current implementation of virtio
> > devices, make it possible to run virtio devices either like it's being
> > done now, or by spawning them as separate processes - the amount of
> > specific code for the separate process case should be minimal.
> >
> >
> > Thats all I have for now, comments are *very* welcome.
> 
> I thought about this a bit and have some ideas that may or may not help.
> 
> 1) If you add device save/load support, then it's something you can potentially 
> use to give yourself quite a bit of flexibility in changing the sandbox.  At any 
> point in run time, you can save the device model's state in the sandbox, destroy 
> the sandbox, and then build a new sandbox and restore the device to its former 
> state.
> 
> This might turn out to be very useful in supporting things like device hotplug 
> and/or memory hot plug.
> 
> 2) I think it's largely possible to implement all device emulation without doing 
> any dynamic memory allocation.  Since memory allocation DoS is something you 
> have to deal with anyway, I suspect most device emulation already uses a fixed 
> amount of memory per device.   This can potentially dramatically simplify things.
> 
> 3) I think virtio can/should be used as a generic "backend to frontend" 
> transport between the device model and the tool.

virtio requires server and client to have shared memory, so if we
already go with shared memory we can just let the device manage the
actual virtio driver directly, no?

Also, things like interrupts would also require some sort of a different
IPC, which would complicate things a bit.


> 4) Lack of select() is really challenging.  I understand why it's not there 
> since it can technically be emulated but it seems like a no-risk syscall to 
> whitelist and it would make programming in a sandbox so much easier.  Maybe 
> Andrea has some comments here?  I might be missing something here.

There are several of these which would be nice to have, and if we can
get seccomp filters we have good flexibility with which APIs we allow
for each device.

> Regards,
> 
> Anthony Liguori
> 
> >
> 

-- 

Sasha.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] Secure KVM
@ 2011-11-07 17:52     ` Sasha Levin
  0 siblings, 0 replies; 31+ messages in thread
From: Sasha Levin @ 2011-11-07 17:52 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Andrea Arcangeli, Cyrill Gorcunov, Rusty Russell, kvm,
	Michael S. Tsirkin, Corentin Chary, Asias He, Marcelo Tosatti,
	qemu-devel, Pekka Enberg, Avi Kivity, Ingo Molnar

Hi Anthony,

Thank you for your comments!

On Mon, 2011-11-07 at 11:37 -0600, Anthony Liguori wrote:
> On 11/06/2011 02:40 PM, Sasha Levin wrote:
> > Hi all,
> >
> > I'm planning on doing a small fork of the KVM tool to turn it into a
> > 'Secure KVM' enabled hypervisor. Now you probably ask yourself, Huh?
> >
> > The idea was discussed briefly couple of months ago, but never got off
> > the ground - which is a shame IMO.
> >
> > It's easy to explain the problem: If an attacker finds a security hole
> > in any of the devices which are exposed to the guest, the attacker would
> > be able to either crash the guest, or possibly run code on the host
> > itself.
> >
> > The solution is also simple to explain: Split the devices into different
> > processes and use seccomp to sandbox each device into the exact set of
> > resources it needs to operate, nothing more and nothing less.
> >
> > Since I'll be basing it on the KVM tool, which doesn't really emulate
> > that many legacy devices, I'll focus first on the virtio family for the
> > sake of simplicity (and covering 90% of the options).
> >
> > This is my basic overview of how I'm planning on implementing the
> > initial POC:
> >
> > 1. First I'll focus on the simple virtio-rng device, it's simple enough
> > to allow us to focus on the aspects which are important for the POC
> > while still covering most bases (i.e. sandbox to single file
> > - /dev/urandom and such).
> >
> > 2. Do it on a one process per device concept, where for each device
> > (notice - not device *type*) requested, a new process which handles it
> > will be spawned.
> >
> > 3. That process will be limited exactly to the resources it needs to
> > operate, for example - if we run a virtio-blk device, it would be able
> > to access only the image file which it should be using.
> >
> > 4. Connection between hypervisor and devices will be based on unix
> > sockets, this should allow for better separation compared to other
> > approaches such as shared memory.
> >
> > 5. While performance is an aspect, complete isolation is more important.
> > Security is primary, performance is secondary.
> >
> > 6. Share as much code as possible with current implementation of virtio
> > devices, make it possible to run virtio devices either like it's being
> > done now, or by spawning them as separate processes - the amount of
> > specific code for the separate process case should be minimal.
> >
> >
> > Thats all I have for now, comments are *very* welcome.
> 
> I thought about this a bit and have some ideas that may or may not help.
> 
> 1) If you add device save/load support, then it's something you can potentially 
> use to give yourself quite a bit of flexibility in changing the sandbox.  At any 
> point in run time, you can save the device model's state in the sandbox, destroy 
> the sandbox, and then build a new sandbox and restore the device to its former 
> state.
> 
> This might turn out to be very useful in supporting things like device hotplug 
> and/or memory hot plug.
> 
> 2) I think it's largely possible to implement all device emulation without doing 
> any dynamic memory allocation.  Since memory allocation DoS is something you 
> have to deal with anyway, I suspect most device emulation already uses a fixed 
> amount of memory per device.   This can potentially dramatically simplify things.
> 
> 3) I think virtio can/should be used as a generic "backend to frontend" 
> transport between the device model and the tool.

virtio requires server and client to have shared memory, so if we
already go with shared memory we can just let the device manage the
actual virtio driver directly, no?

Also, things like interrupts would also require some sort of a different
IPC, which would complicate things a bit.


> 4) Lack of select() is really challenging.  I understand why it's not there 
> since it can technically be emulated but it seems like a no-risk syscall to 
> whitelist and it would make programming in a sandbox so much easier.  Maybe 
> Andrea has some comments here?  I might be missing something here.

There are several of these which would be nice to have, and if we can
get seccomp filters we have good flexibility with which APIs we allow
for each device.

> Regards,
> 
> Anthony Liguori
> 
> >
> 

-- 

Sasha.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] Secure KVM
  2011-11-07 17:52     ` [Qemu-devel] " Sasha Levin
@ 2011-11-07 18:03       ` Anthony Liguori
  -1 siblings, 0 replies; 31+ messages in thread
From: Anthony Liguori @ 2011-11-07 18:03 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrea Arcangeli, Cyrill Gorcunov, Rusty Russell, kvm,
	Michael S. Tsirkin, Corentin Chary, Asias He, Marcelo Tosatti,
	qemu-devel, Pekka Enberg, Avi Kivity, Ingo Molnar

On 11/07/2011 11:52 AM, Sasha Levin wrote:
> Hi Anthony,
>
> Thank you for your comments!
>
> On Mon, 2011-11-07 at 11:37 -0600, Anthony Liguori wrote:
>> On 11/06/2011 02:40 PM, Sasha Levin wrote:
>>> Hi all,
>>>
>>> I'm planning on doing a small fork of the KVM tool to turn it into a
>>> 'Secure KVM' enabled hypervisor. Now you probably ask yourself, Huh?
>>>
>>> The idea was discussed briefly couple of months ago, but never got off
>>> the ground - which is a shame IMO.
>>>
>>> It's easy to explain the problem: If an attacker finds a security hole
>>> in any of the devices which are exposed to the guest, the attacker would
>>> be able to either crash the guest, or possibly run code on the host
>>> itself.
>>>
>>> The solution is also simple to explain: Split the devices into different
>>> processes and use seccomp to sandbox each device into the exact set of
>>> resources it needs to operate, nothing more and nothing less.
>>>
>>> Since I'll be basing it on the KVM tool, which doesn't really emulate
>>> that many legacy devices, I'll focus first on the virtio family for the
>>> sake of simplicity (and covering 90% of the options).
>>>
>>> This is my basic overview of how I'm planning on implementing the
>>> initial POC:
>>>
>>> 1. First I'll focus on the simple virtio-rng device, it's simple enough
>>> to allow us to focus on the aspects which are important for the POC
>>> while still covering most bases (i.e. sandbox to single file
>>> - /dev/urandom and such).
>>>
>>> 2. Do it on a one process per device concept, where for each device
>>> (notice - not device *type*) requested, a new process which handles it
>>> will be spawned.
>>>
>>> 3. That process will be limited exactly to the resources it needs to
>>> operate, for example - if we run a virtio-blk device, it would be able
>>> to access only the image file which it should be using.
>>>
>>> 4. Connection between hypervisor and devices will be based on unix
>>> sockets, this should allow for better separation compared to other
>>> approaches such as shared memory.
>>>
>>> 5. While performance is an aspect, complete isolation is more important.
>>> Security is primary, performance is secondary.
>>>
>>> 6. Share as much code as possible with current implementation of virtio
>>> devices, make it possible to run virtio devices either like it's being
>>> done now, or by spawning them as separate processes - the amount of
>>> specific code for the separate process case should be minimal.
>>>
>>>
>>> Thats all I have for now, comments are *very* welcome.
>>
>> I thought about this a bit and have some ideas that may or may not help.
>>
>> 1) If you add device save/load support, then it's something you can potentially
>> use to give yourself quite a bit of flexibility in changing the sandbox.  At any
>> point in run time, you can save the device model's state in the sandbox, destroy
>> the sandbox, and then build a new sandbox and restore the device to its former
>> state.
>>
>> This might turn out to be very useful in supporting things like device hotplug
>> and/or memory hot plug.
>>
>> 2) I think it's largely possible to implement all device emulation without doing
>> any dynamic memory allocation.  Since memory allocation DoS is something you
>> have to deal with anyway, I suspect most device emulation already uses a fixed
>> amount of memory per device.   This can potentially dramatically simplify things.
>>
>> 3) I think virtio can/should be used as a generic "backend to frontend"
>> transport between the device model and the tool.
>
> virtio requires server and client to have shared memory, so if we
> already go with shared memory we can just let the device manage the
> actual virtio driver directly, no?

Let's say you're implementing an IDE device model in the sandbox.  You can try 
to implement the block layer in the sandbox but I think that quickly will become 
too difficult.

You can do as Avi suggested and do all DMA accesses from the IDE device model as 
RPCs, or you can map guest memory as shared memory and utilize (1) in order to 
change that mapping as you need to.

At some point, you end up with a struct iovec and an offset that you want to 
read/write to the virtual disk.  You need a way to send that to the "frontend" 
that will then handle that as a raw/qcow2 request.

Well, virtio is great at doing exactly that :-)   So if you increase your shared 
memory to have a little bit extra to stick another vring, you can use that for 
device model -> front end communication without paying an extra memcpy.

For notifications, the easiest thing to do is setup an "event channel" bitmap 
and use a single eventfd to multiplex that event channel bitmap.  This is pretty 
much how Xen works btw.  A single interrupt is reserved and a bitmap is used to 
dispatch the actual events.

So the sandbox loop would look like:

void main() {
   setup_devices();

   read_from_event_channel(main_channel);
   for i in vrings:
      check_vring_notification(i);
}

Once vring would be used for dispatching PIO/MMIO.  The remaining vrings could 
be used for anything really.

Like I mentioned elsewhere, just think of the sandbox as just an extension of 
the guests firmware.  The purpose of the sandbox is to reduce a very 
complicated, legacy device model, into a very simple and easy to audit, purely 
virtio based model.

>
> Also, things like interrupts would also require some sort of a different
> IPC, which would complicate things a bit.
>
>
>> 4) Lack of select() is really challenging.  I understand why it's not there
>> since it can technically be emulated but it seems like a no-risk syscall to
>> whitelist and it would make programming in a sandbox so much easier.  Maybe
>> Andrea has some comments here?  I might be missing something here.
>
> There are several of these which would be nice to have, and if we can
> get seccomp filters we have good flexibility with which APIs we allow
> for each device.

Yeah, filters are nice but I fear that you lose some of the PR benefits of 
sandboxing.  Once the first application claims to use sandboxing, whitelists a 
syscall it shouldn't, you'll start getting slashdot articles about "Linux 
sandbox broken, Linux security hopeless broken".  Then what's the point of all 
of this?

Regards,

Anthony Liguori

>> Regards,
>>
>> Anthony Liguori
>>
>>>
>>
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] Secure KVM
@ 2011-11-07 18:03       ` Anthony Liguori
  0 siblings, 0 replies; 31+ messages in thread
From: Anthony Liguori @ 2011-11-07 18:03 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrea Arcangeli, Pekka Enberg, Marcelo Tosatti, kvm,
	Michael S. Tsirkin, Corentin Chary, Asias He, Rusty Russell,
	qemu-devel, Cyrill Gorcunov, Avi Kivity, Ingo Molnar

On 11/07/2011 11:52 AM, Sasha Levin wrote:
> Hi Anthony,
>
> Thank you for your comments!
>
> On Mon, 2011-11-07 at 11:37 -0600, Anthony Liguori wrote:
>> On 11/06/2011 02:40 PM, Sasha Levin wrote:
>>> Hi all,
>>>
>>> I'm planning on doing a small fork of the KVM tool to turn it into a
>>> 'Secure KVM' enabled hypervisor. Now you probably ask yourself, Huh?
>>>
>>> The idea was discussed briefly couple of months ago, but never got off
>>> the ground - which is a shame IMO.
>>>
>>> It's easy to explain the problem: If an attacker finds a security hole
>>> in any of the devices which are exposed to the guest, the attacker would
>>> be able to either crash the guest, or possibly run code on the host
>>> itself.
>>>
>>> The solution is also simple to explain: Split the devices into different
>>> processes and use seccomp to sandbox each device into the exact set of
>>> resources it needs to operate, nothing more and nothing less.
>>>
>>> Since I'll be basing it on the KVM tool, which doesn't really emulate
>>> that many legacy devices, I'll focus first on the virtio family for the
>>> sake of simplicity (and covering 90% of the options).
>>>
>>> This is my basic overview of how I'm planning on implementing the
>>> initial POC:
>>>
>>> 1. First I'll focus on the simple virtio-rng device, it's simple enough
>>> to allow us to focus on the aspects which are important for the POC
>>> while still covering most bases (i.e. sandbox to single file
>>> - /dev/urandom and such).
>>>
>>> 2. Do it on a one process per device concept, where for each device
>>> (notice - not device *type*) requested, a new process which handles it
>>> will be spawned.
>>>
>>> 3. That process will be limited exactly to the resources it needs to
>>> operate, for example - if we run a virtio-blk device, it would be able
>>> to access only the image file which it should be using.
>>>
>>> 4. Connection between hypervisor and devices will be based on unix
>>> sockets, this should allow for better separation compared to other
>>> approaches such as shared memory.
>>>
>>> 5. While performance is an aspect, complete isolation is more important.
>>> Security is primary, performance is secondary.
>>>
>>> 6. Share as much code as possible with current implementation of virtio
>>> devices, make it possible to run virtio devices either like it's being
>>> done now, or by spawning them as separate processes - the amount of
>>> specific code for the separate process case should be minimal.
>>>
>>>
>>> Thats all I have for now, comments are *very* welcome.
>>
>> I thought about this a bit and have some ideas that may or may not help.
>>
>> 1) If you add device save/load support, then it's something you can potentially
>> use to give yourself quite a bit of flexibility in changing the sandbox.  At any
>> point in run time, you can save the device model's state in the sandbox, destroy
>> the sandbox, and then build a new sandbox and restore the device to its former
>> state.
>>
>> This might turn out to be very useful in supporting things like device hotplug
>> and/or memory hot plug.
>>
>> 2) I think it's largely possible to implement all device emulation without doing
>> any dynamic memory allocation.  Since memory allocation DoS is something you
>> have to deal with anyway, I suspect most device emulation already uses a fixed
>> amount of memory per device.   This can potentially dramatically simplify things.
>>
>> 3) I think virtio can/should be used as a generic "backend to frontend"
>> transport between the device model and the tool.
>
> virtio requires server and client to have shared memory, so if we
> already go with shared memory we can just let the device manage the
> actual virtio driver directly, no?

Let's say you're implementing an IDE device model in the sandbox.  You can try 
to implement the block layer in the sandbox but I think that quickly will become 
too difficult.

You can do as Avi suggested and do all DMA accesses from the IDE device model as 
RPCs, or you can map guest memory as shared memory and utilize (1) in order to 
change that mapping as you need to.

At some point, you end up with a struct iovec and an offset that you want to 
read/write to the virtual disk.  You need a way to send that to the "frontend" 
that will then handle that as a raw/qcow2 request.

Well, virtio is great at doing exactly that :-)   So if you increase your shared 
memory to have a little bit extra to stick another vring, you can use that for 
device model -> front end communication without paying an extra memcpy.

For notifications, the easiest thing to do is setup an "event channel" bitmap 
and use a single eventfd to multiplex that event channel bitmap.  This is pretty 
much how Xen works btw.  A single interrupt is reserved and a bitmap is used to 
dispatch the actual events.

So the sandbox loop would look like:

void main() {
   setup_devices();

   read_from_event_channel(main_channel);
   for i in vrings:
      check_vring_notification(i);
}

Once vring would be used for dispatching PIO/MMIO.  The remaining vrings could 
be used for anything really.

Like I mentioned elsewhere, just think of the sandbox as just an extension of 
the guests firmware.  The purpose of the sandbox is to reduce a very 
complicated, legacy device model, into a very simple and easy to audit, purely 
virtio based model.

>
> Also, things like interrupts would also require some sort of a different
> IPC, which would complicate things a bit.
>
>
>> 4) Lack of select() is really challenging.  I understand why it's not there
>> since it can technically be emulated but it seems like a no-risk syscall to
>> whitelist and it would make programming in a sandbox so much easier.  Maybe
>> Andrea has some comments here?  I might be missing something here.
>
> There are several of these which would be nice to have, and if we can
> get seccomp filters we have good flexibility with which APIs we allow
> for each device.

Yeah, filters are nice but I fear that you lose some of the PR benefits of 
sandboxing.  Once the first application claims to use sandboxing, whitelists a 
syscall it shouldn't, you'll start getting slashdot articles about "Linux 
sandbox broken, Linux security hopeless broken".  Then what's the point of all 
of this?

Regards,

Anthony Liguori

>> Regards,
>>
>> Anthony Liguori
>>
>>>
>>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07 17:43       ` Anthony Liguori
@ 2011-11-07 18:41         ` Avi Kivity
  0 siblings, 0 replies; 31+ messages in thread
From: Avi Kivity @ 2011-11-07 18:41 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Stefan Hajnoczi, Sasha Levin, Andrea Arcangeli, Marcelo Tosatti,
	Ingo Molnar, Pekka Enberg, Cyrill Gorcunov, Asias He,
	Rusty Russell, Michael S. Tsirkin, kvm

On 11/07/2011 07:43 PM, Anthony Liguori wrote:
>> In that respect I think Avi's ideas about using safe programming
>> languages (even if just a NaCl toolchain) are nice because they are
>> more general and apply to all of the codebase.
>
>
> It's a nice idea but the NaCL toolchain doesn't have a nice upstream
> story right now.

True.  It doesn't have to be NaCl though, although that's my favorite. 
The biggest advantage is near native speed with no context switches for
RPC.  This allows virtio and hpet to be sandboxed too.

>
> I think seccomp() mode 1 isn't so bad.  It's difficult to boot strap,
> but once you have a reasonable set of RPCs, it shouldn't be all that
> bad of an environment to program in.

It should be exactly the same as today's qemu.  We port the qemu_* APIs
to our rpc, and everything should just work (but no direct memory access
any more - everything goes through the APIs).

>
> One way to think of a seccomp() sandbox is that it emulates the legacy
> device model and translates everything into an ultra-modern, no
> backwards compat, pure-virtio device model.  From a QEMU perspective,
> it would treat the sandbox as part of the guest, and then implement a
> bare bones machine that only exposed the couple of virtio interfaces
> to the sandbox.  QEMU would then bridge this to the various types of
> backends.

I don't see how it works - some devices reference the guest address
space, which we can't touch.

How would you bridge IDE to virtio?  Create a third address space for
the internal virtio device?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07 17:39   ` Anthony Liguori
@ 2011-11-07 18:43     ` Avi Kivity
  2011-11-07 19:07       ` Anthony Liguori
  0 siblings, 1 reply; 31+ messages in thread
From: Avi Kivity @ 2011-11-07 18:43 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Sasha Levin, Andrea Arcangeli, Marcelo Tosatti, Ingo Molnar,
	Pekka Enberg, Cyrill Gorcunov, Asias He, Rusty Russell,
	Michael S. Tsirkin, kvm

On 11/07/2011 07:39 PM, Anthony Liguori wrote:
>> One thing to beware of is memory hotplug.  If the memory map is static,
>> then a fork() once everything is set up (with MAP_SHARED) alllows all
>> processes to access guest memory.  However, if memory hotplug is
>> supported (or planned to be supported), then you can't do that, as
>> seccomp doesn't allow you to run mmap() in confined processes.
>>
>> This means they have to use RPC to the main process in order to access
>> memory, which is going to slow them down significantly.
>
>
> If you treat the sandbox as ephemeral by leveraging save/restore, you
> can throw away and rebuild the device model on every memory change. 
> While not a super cheap operation, it's at least amortized over time.

Good idea!

We lost the context of all threads, but that also happens on live
migration.  I'm sure this is workable.

Plus we get save/restore testing for free.  Did someone say win/win?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07 18:43     ` Avi Kivity
@ 2011-11-07 19:07       ` Anthony Liguori
  2011-11-07 19:54         ` Avi Kivity
  0 siblings, 1 reply; 31+ messages in thread
From: Anthony Liguori @ 2011-11-07 19:07 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Sasha Levin, Andrea Arcangeli, Marcelo Tosatti, Ingo Molnar,
	Pekka Enberg, Cyrill Gorcunov, Asias He, Rusty Russell,
	Michael S. Tsirkin, kvm

On 11/07/2011 12:43 PM, Avi Kivity wrote:
> On 11/07/2011 07:39 PM, Anthony Liguori wrote:
>>> One thing to beware of is memory hotplug.  If the memory map is static,
>>> then a fork() once everything is set up (with MAP_SHARED) alllows all
>>> processes to access guest memory.  However, if memory hotplug is
>>> supported (or planned to be supported), then you can't do that, as
>>> seccomp doesn't allow you to run mmap() in confined processes.
>>>
>>> This means they have to use RPC to the main process in order to access
>>> memory, which is going to slow them down significantly.
>>
>>
>> If you treat the sandbox as ephemeral by leveraging save/restore, you
>> can throw away and rebuild the device model on every memory change.
>> While not a super cheap operation, it's at least amortized over time.
>
> Good idea!
>
> We lost the context of all threads, but that also happens on live
> migration.  I'm sure this is workable.
>
> Plus we get save/restore testing for free.  Did someone say win/win?

Indeed.

But it mandates that everything in the sandbox be serializable so given the 
current state of things, it would mean you couldn't put qcow2 in the sandbox, 
for instance.

Regards,

Anthony Liguori






^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07 19:07       ` Anthony Liguori
@ 2011-11-07 19:54         ` Avi Kivity
  0 siblings, 0 replies; 31+ messages in thread
From: Avi Kivity @ 2011-11-07 19:54 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Sasha Levin, Andrea Arcangeli, Marcelo Tosatti, Ingo Molnar,
	Pekka Enberg, Cyrill Gorcunov, Asias He, Rusty Russell,
	Michael S. Tsirkin, kvm

On 11/07/2011 09:07 PM, Anthony Liguori wrote:
>> We lost the context of all threads, but that also happens on live
>> migration.  I'm sure this is workable.
>>
>> Plus we get save/restore testing for free.  Did someone say win/win?
>
>
> Indeed.
>
> But it mandates that everything in the sandbox be serializable so
> given the current state of things, it would mean you couldn't put
> qcow2 in the sandbox, for instance.

Quiesce all requests, reopen the blockdev.

A bit heavyweight, but that's life.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07  6:29   ` Sasha Levin
  2011-11-07  6:37     ` Pekka Enberg
@ 2011-11-07 22:49     ` Rusty Russell
  1 sibling, 0 replies; 31+ messages in thread
From: Rusty Russell @ 2011-11-07 22:49 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrea Arcangeli, Avi Kivity, Marcelo Tosatti, Ingo Molnar,
	Pekka Enberg, Cyrill Gorcunov, Asias He, Anthony Liguori,
	Michael S. Tsirkin, kvm

On Mon, 07 Nov 2011 08:29:03 +0200, Sasha Levin <levinsasha928@gmail.com> wrote:
> On Mon, 2011-11-07 at 10:37 +1030, Rusty Russell wrote:
> > On Sun, 06 Nov 2011 22:40:20 +0200, Sasha Levin <levinsasha928@gmail.com> wrote:
> > > The solution is also simple to explain: Split the devices into different
> > > processes and use seccomp to sandbox each device into the exact set of
> > > resources it needs to operate, nothing more and nothing less.
> > 
> > lguest does a process per device.  Actually, it uses clone for legacy
> > reasons, but I have a patch which changes it to processes.
> > 
> > It works well, and it's *simple*.  I suggest looking at
> > Documentation/virtual/lguest/lguest.c.
> > 
> > Good luck!
> > Rusty.
> 
> Yup, thats pretty much what I want to have.
> 
> As you said, clone() isn't really an option - sharing things like the VM
> and handles is something which I want to avoid. How does your patch
> handle IPC?

Yeah, the patch to change it to processes just changes the mmap (of
/dev/zero) which forms guest memory from MAP_PRIVATE to MAP_SHARED.

There's no IPC, because I have no device hotplug :)  On exit we kill the
entire process group, so it kills the device processes too.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07  9:26 ` Avi Kivity
  2011-11-07 10:17   ` Sasha Levin
  2011-11-07 17:39   ` Anthony Liguori
@ 2011-11-07 22:56   ` Rusty Russell
  2 siblings, 0 replies; 31+ messages in thread
From: Rusty Russell @ 2011-11-07 22:56 UTC (permalink / raw)
  To: Avi Kivity, Sasha Levin
  Cc: Andrea Arcangeli, Marcelo Tosatti, Ingo Molnar, Pekka Enberg,
	Cyrill Gorcunov, Asias He, Anthony Liguori, Michael S. Tsirkin,
	kvm

On Mon, 07 Nov 2011 11:26:53 +0200, Avi Kivity <avi@redhat.com> wrote:
> One thing to beware of is memory hotplug.  If the memory map is static,
> then a fork() once everything is set up (with MAP_SHARED) alllows all
> processes to access guest memory.  However, if memory hotplug is
> supported (or planned to be supported), then you can't do that, as
> seccomp doesn't allow you to run mmap() in confined processes.
> 
> This means they have to use RPC to the main process in order to access
> memory, which is going to slow them down significantly.

That would be very silly.  As virtio devices are simple, you just ask
the device process to save its state, then you kill it and start a new
one.  For initial implementation, you service each request in a loop so
there's no state at all.

A pipe is all you need.

> > Since I'll be basing it on the KVM tool, which doesn't really emulate
> > that many legacy devices, I'll focus first on the virtio family for the
> > sake of simplicity (and covering 90% of the options).
> 
> Since virtio is so performance sensitive, my feeling is that it is
> better to audit it, and rely on sandboxing for the non performance
> sensitive parts of the device model.  Of course for a POC it's fine to
> start with it.

A separate thread per device (or even per virtqueue, as lguest does)
will help parallelism.  My very brief experiments with lguest showed
that it made some things better, some things worse...

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] Secure KVM
  2011-11-07 18:03       ` Anthony Liguori
@ 2011-11-07 23:06         ` Rusty Russell
  -1 siblings, 0 replies; 31+ messages in thread
From: Rusty Russell @ 2011-11-07 23:06 UTC (permalink / raw)
  To: Anthony Liguori, Sasha Levin
  Cc: Andrea Arcangeli, Cyrill Gorcunov, kvm, Michael S. Tsirkin,
	Corentin Chary, Asias He, Marcelo Tosatti, qemu-devel,
	Pekka Enberg, Avi Kivity, Ingo Molnar

On Mon, 07 Nov 2011 12:03:38 -0600, Anthony Liguori <anthony@codemonkey.ws> wrote:
> So the sandbox loop would look like:
> 
> void main() {
>    setup_devices();
> 
>    read_from_event_channel(main_channel);
>    for i in vrings:
>       check_vring_notification(i);
> }

lguest uses a model where you attach an eventfd to a given virtqueue.
(If you don't have an eventfd registered for a vq, the main process
 returns from the read() of /dev/lguest with the info).

At the moment we use a process per virtqueue, but you could attach the
same eventfd to multiple vqs.

Since you can't select() inside seccomp, the main process could write to
the eventfd to wake up the thread to respond to IPC.

Here's the net output code:

/*
 * The Network
 *
 * Handling output for network is also simple: we get all the output buffers
 * and write them to /dev/net/tun.
 */
struct net_info {
	int tunfd;
};

static void net_output(struct virtqueue *vq)
{
	struct net_info *net_info = vq->dev->priv;
	unsigned int head, out, in;
	struct iovec iov[vq->vring.num];

	/* We usually wait in here for the Guest to give us a packet. */
	head = wait_for_vq_desc(vq, iov, &out, &in);
	if (in)
		errx(1, "Input buffers in net output queue?");
	/*
	 * Send the whole thing through to /dev/net/tun.  It expects the exact
	 * same format: what a coincidence!
	 */
	if (writev(net_info->tunfd, iov, out) < 0)
		warnx("Write to tun failed (%d)?", errno);

	/*
	 * Done with that one; wait_for_vq_desc() will send the interrupt if
	 * all packets are processed.
	 */
	add_used(vq, head, 0);
}

Here's the input thread:

/*
 * Handling network input is a bit trickier, because I've tried to optimize it.
 *
 * First we have a helper routine which tells is if from this file descriptor
 * (ie. the /dev/net/tun device) will block:
 */
static bool will_block(int fd)
{
	fd_set fdset;
	struct timeval zero = { 0, 0 };
	FD_ZERO(&fdset);
	FD_SET(fd, &fdset);
	return select(fd+1, &fdset, NULL, NULL, &zero) != 1;
}

/*
 * This handles packets coming in from the tun device to our Guest.  Like all
 * service routines, it gets called again as soon as it returns, so you don't
 * see a while(1) loop here.
 */
static void net_input(struct virtqueue *vq)
{
	int len;
	unsigned int head, out, in;
	struct iovec iov[vq->vring.num];
	struct net_info *net_info = vq->dev->priv;

	/*
	 * Get a descriptor to write an incoming packet into.  This will also
	 * send an interrupt if they're out of descriptors.
	 */
	head = wait_for_vq_desc(vq, iov, &out, &in);
	if (out)
		errx(1, "Output buffers in net input queue?");

	/*
	 * If it looks like we'll block reading from the tun device, send them
	 * an interrupt.
	 */
	if (vq->pending_used && will_block(net_info->tunfd))
		trigger_irq(vq);

	/*
	 * Read in the packet.  This is where we normally wait (when there's no
	 * incoming network traffic).
	 */
	len = readv(net_info->tunfd, iov, in);
	if (len <= 0)
		warn("Failed to read from tun (%d).", errno);

	/*
	 * Mark that packet buffer as used, but don't interrupt here.  We want
	 * to wait until we've done as much work as we can.
	 */
	add_used(vq, head, len);
}


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] Secure KVM
@ 2011-11-07 23:06         ` Rusty Russell
  0 siblings, 0 replies; 31+ messages in thread
From: Rusty Russell @ 2011-11-07 23:06 UTC (permalink / raw)
  To: Anthony Liguori, Sasha Levin
  Cc: Andrea Arcangeli, Pekka Enberg, kvm, Michael S. Tsirkin,
	Corentin Chary, Asias He, Marcelo Tosatti, qemu-devel,
	Cyrill Gorcunov, Avi Kivity, Ingo Molnar

On Mon, 07 Nov 2011 12:03:38 -0600, Anthony Liguori <anthony@codemonkey.ws> wrote:
> So the sandbox loop would look like:
> 
> void main() {
>    setup_devices();
> 
>    read_from_event_channel(main_channel);
>    for i in vrings:
>       check_vring_notification(i);
> }

lguest uses a model where you attach an eventfd to a given virtqueue.
(If you don't have an eventfd registered for a vq, the main process
 returns from the read() of /dev/lguest with the info).

At the moment we use a process per virtqueue, but you could attach the
same eventfd to multiple vqs.

Since you can't select() inside seccomp, the main process could write to
the eventfd to wake up the thread to respond to IPC.

Here's the net output code:

/*
 * The Network
 *
 * Handling output for network is also simple: we get all the output buffers
 * and write them to /dev/net/tun.
 */
struct net_info {
	int tunfd;
};

static void net_output(struct virtqueue *vq)
{
	struct net_info *net_info = vq->dev->priv;
	unsigned int head, out, in;
	struct iovec iov[vq->vring.num];

	/* We usually wait in here for the Guest to give us a packet. */
	head = wait_for_vq_desc(vq, iov, &out, &in);
	if (in)
		errx(1, "Input buffers in net output queue?");
	/*
	 * Send the whole thing through to /dev/net/tun.  It expects the exact
	 * same format: what a coincidence!
	 */
	if (writev(net_info->tunfd, iov, out) < 0)
		warnx("Write to tun failed (%d)?", errno);

	/*
	 * Done with that one; wait_for_vq_desc() will send the interrupt if
	 * all packets are processed.
	 */
	add_used(vq, head, 0);
}

Here's the input thread:

/*
 * Handling network input is a bit trickier, because I've tried to optimize it.
 *
 * First we have a helper routine which tells is if from this file descriptor
 * (ie. the /dev/net/tun device) will block:
 */
static bool will_block(int fd)
{
	fd_set fdset;
	struct timeval zero = { 0, 0 };
	FD_ZERO(&fdset);
	FD_SET(fd, &fdset);
	return select(fd+1, &fdset, NULL, NULL, &zero) != 1;
}

/*
 * This handles packets coming in from the tun device to our Guest.  Like all
 * service routines, it gets called again as soon as it returns, so you don't
 * see a while(1) loop here.
 */
static void net_input(struct virtqueue *vq)
{
	int len;
	unsigned int head, out, in;
	struct iovec iov[vq->vring.num];
	struct net_info *net_info = vq->dev->priv;

	/*
	 * Get a descriptor to write an incoming packet into.  This will also
	 * send an interrupt if they're out of descriptors.
	 */
	head = wait_for_vq_desc(vq, iov, &out, &in);
	if (out)
		errx(1, "Output buffers in net input queue?");

	/*
	 * If it looks like we'll block reading from the tun device, send them
	 * an interrupt.
	 */
	if (vq->pending_used && will_block(net_info->tunfd))
		trigger_irq(vq);

	/*
	 * Read in the packet.  This is where we normally wait (when there's no
	 * incoming network traffic).
	 */
	len = readv(net_info->tunfd, iov, in);
	if (len <= 0)
		warn("Failed to read from tun (%d).", errno);

	/*
	 * Mark that packet buffer as used, but don't interrupt here.  We want
	 * to wait until we've done as much work as we can.
	 */
	add_used(vq, head, len);
}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Secure KVM
  2011-11-07 18:03       ` Anthony Liguori
@ 2011-11-08 19:51         ` Will Drewry
  -1 siblings, 0 replies; 31+ messages in thread
From: Will Drewry @ 2011-11-08 19:51 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Andrea Arcangeli, Cyrill Gorcunov, Rusty Russell, kvm,
	Michael S. Tsirkin, Corentin Chary, Asias He, Marcelo Tosatti,
	qemu-devel, Pekka Enberg, Sasha Levin, Ingo Molnar, Avi Kivity

On Mon, Nov 7, 2011 at 12:03 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/07/2011 11:52 AM, Sasha Levin wrote:
>>
>> Hi Anthony,
>>
>> Thank you for your comments!
>>
>> On Mon, 2011-11-07 at 11:37 -0600, Anthony Liguori wrote:
>>>
>>> On 11/06/2011 02:40 PM, Sasha Levin wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I'm planning on doing a small fork of the KVM tool to turn it into a
>>>> 'Secure KVM' enabled hypervisor. Now you probably ask yourself, Huh?
>>>>
>>>> The idea was discussed briefly couple of months ago, but never got off
>>>> the ground - which is a shame IMO.
>>>>
>>>> It's easy to explain the problem: If an attacker finds a security hole
>>>> in any of the devices which are exposed to the guest, the attacker would
>>>> be able to either crash the guest, or possibly run code on the host
>>>> itself.
>>>>
>>>> The solution is also simple to explain: Split the devices into different
>>>> processes and use seccomp to sandbox each device into the exact set of
>>>> resources it needs to operate, nothing more and nothing less.
>>>>
>>>> Since I'll be basing it on the KVM tool, which doesn't really emulate
>>>> that many legacy devices, I'll focus first on the virtio family for the
>>>> sake of simplicity (and covering 90% of the options).
>>>>
>>>> This is my basic overview of how I'm planning on implementing the
>>>> initial POC:
>>>>
>>>> 1. First I'll focus on the simple virtio-rng device, it's simple enough
>>>> to allow us to focus on the aspects which are important for the POC
>>>> while still covering most bases (i.e. sandbox to single file
>>>> - /dev/urandom and such).
>>>>
>>>> 2. Do it on a one process per device concept, where for each device
>>>> (notice - not device *type*) requested, a new process which handles it
>>>> will be spawned.
>>>>
>>>> 3. That process will be limited exactly to the resources it needs to
>>>> operate, for example - if we run a virtio-blk device, it would be able
>>>> to access only the image file which it should be using.
>>>>
>>>> 4. Connection between hypervisor and devices will be based on unix
>>>> sockets, this should allow for better separation compared to other
>>>> approaches such as shared memory.
>>>>
>>>> 5. While performance is an aspect, complete isolation is more important.
>>>> Security is primary, performance is secondary.
>>>>
>>>> 6. Share as much code as possible with current implementation of virtio
>>>> devices, make it possible to run virtio devices either like it's being
>>>> done now, or by spawning them as separate processes - the amount of
>>>> specific code for the separate process case should be minimal.
>>>>
>>>>
>>>> Thats all I have for now, comments are *very* welcome.
>>>
>>> I thought about this a bit and have some ideas that may or may not help.
>>>
>>> 1) If you add device save/load support, then it's something you can
>>> potentially
>>> use to give yourself quite a bit of flexibility in changing the sandbox.
>>>  At any
>>> point in run time, you can save the device model's state in the sandbox,
>>> destroy
>>> the sandbox, and then build a new sandbox and restore the device to its
>>> former
>>> state.
>>>
>>> This might turn out to be very useful in supporting things like device
>>> hotplug
>>> and/or memory hot plug.
>>>
>>> 2) I think it's largely possible to implement all device emulation
>>> without doing
>>> any dynamic memory allocation.  Since memory allocation DoS is something
>>> you
>>> have to deal with anyway, I suspect most device emulation already uses a
>>> fixed
>>> amount of memory per device.   This can potentially dramatically simplify
>>> things.
>>>
>>> 3) I think virtio can/should be used as a generic "backend to frontend"
>>> transport between the device model and the tool.
>>
>> virtio requires server and client to have shared memory, so if we
>> already go with shared memory we can just let the device manage the
>> actual virtio driver directly, no?
>
> Let's say you're implementing an IDE device model in the sandbox.  You can
> try to implement the block layer in the sandbox but I think that quickly
> will become too difficult.
>
> You can do as Avi suggested and do all DMA accesses from the IDE device
> model as RPCs, or you can map guest memory as shared memory and utilize (1)
> in order to change that mapping as you need to.
>
> At some point, you end up with a struct iovec and an offset that you want to
> read/write to the virtual disk.  You need a way to send that to the
> "frontend" that will then handle that as a raw/qcow2 request.
>
> Well, virtio is great at doing exactly that :-)   So if you increase your
> shared memory to have a little bit extra to stick another vring, you can use
> that for device model -> front end communication without paying an extra
> memcpy.
>
> For notifications, the easiest thing to do is setup an "event channel"
> bitmap and use a single eventfd to multiplex that event channel bitmap.
>  This is pretty much how Xen works btw.  A single interrupt is reserved and
> a bitmap is used to dispatch the actual events.
>
> So the sandbox loop would look like:
>
> void main() {
>  setup_devices();
>
>  read_from_event_channel(main_channel);
>  for i in vrings:
>     check_vring_notification(i);
> }
>
> Once vring would be used for dispatching PIO/MMIO.  The remaining vrings
> could be used for anything really.
>
> Like I mentioned elsewhere, just think of the sandbox as just an extension
> of the guests firmware.  The purpose of the sandbox is to reduce a very
> complicated, legacy device model, into a very simple and easy to audit,
> purely virtio based model.
>
>>
>> Also, things like interrupts would also require some sort of a different
>> IPC, which would complicate things a bit.
>>
>>
>>> 4) Lack of select() is really challenging.  I understand why it's not
>>> there
>>> since it can technically be emulated but it seems like a no-risk syscall
>>> to
>>> whitelist and it would make programming in a sandbox so much easier.
>>>  Maybe
>>> Andrea has some comments here?  I might be missing something here.
>>
>> There are several of these which would be nice to have, and if we can
>> get seccomp filters we have good flexibility with which APIs we allow
>> for each device.
>
> Yeah, filters are nice but I fear that you lose some of the PR benefits of
> sandboxing.  Once the first application claims to use sandboxing, whitelists
> a syscall it shouldn't, you'll start getting slashdot articles about "Linux
> sandbox broken, Linux security hopeless broken".  Then what's the point of
> all of this?

Approaching the limit: since no security code/infrastructure is
perfect, then what's the point of all of this? :)

When I've spoken about seccomp_filter, I've tried to avoid the word
'sandbox' as that comes with more baggage than just creating a means
of reducing the kernel's attack surface.  Ideally, seccomp_filter just
fills the void between read/write/sigreturn/exit and
all-the-system-calls: Don't want select? ok. Want epoll? ok. . . It
does mean that developers will have to determine the tradeoffs
themselves (or with some general guidance).  But, I expect there'd be
quite a few more consumers of seccomp if it was possible to not need
to emulate select() behavior or if, for example, brk() was allowed.

cheers!
will

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] Secure KVM
@ 2011-11-08 19:51         ` Will Drewry
  0 siblings, 0 replies; 31+ messages in thread
From: Will Drewry @ 2011-11-08 19:51 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Andrea Arcangeli, Cyrill Gorcunov, Rusty Russell, kvm,
	Michael S. Tsirkin, Corentin Chary, Asias He, Marcelo Tosatti,
	qemu-devel, Pekka Enberg, Sasha Levin, Ingo Molnar, Avi Kivity

On Mon, Nov 7, 2011 at 12:03 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/07/2011 11:52 AM, Sasha Levin wrote:
>>
>> Hi Anthony,
>>
>> Thank you for your comments!
>>
>> On Mon, 2011-11-07 at 11:37 -0600, Anthony Liguori wrote:
>>>
>>> On 11/06/2011 02:40 PM, Sasha Levin wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I'm planning on doing a small fork of the KVM tool to turn it into a
>>>> 'Secure KVM' enabled hypervisor. Now you probably ask yourself, Huh?
>>>>
>>>> The idea was discussed briefly couple of months ago, but never got off
>>>> the ground - which is a shame IMO.
>>>>
>>>> It's easy to explain the problem: If an attacker finds a security hole
>>>> in any of the devices which are exposed to the guest, the attacker would
>>>> be able to either crash the guest, or possibly run code on the host
>>>> itself.
>>>>
>>>> The solution is also simple to explain: Split the devices into different
>>>> processes and use seccomp to sandbox each device into the exact set of
>>>> resources it needs to operate, nothing more and nothing less.
>>>>
>>>> Since I'll be basing it on the KVM tool, which doesn't really emulate
>>>> that many legacy devices, I'll focus first on the virtio family for the
>>>> sake of simplicity (and covering 90% of the options).
>>>>
>>>> This is my basic overview of how I'm planning on implementing the
>>>> initial POC:
>>>>
>>>> 1. First I'll focus on the simple virtio-rng device, it's simple enough
>>>> to allow us to focus on the aspects which are important for the POC
>>>> while still covering most bases (i.e. sandbox to single file
>>>> - /dev/urandom and such).
>>>>
>>>> 2. Do it on a one process per device concept, where for each device
>>>> (notice - not device *type*) requested, a new process which handles it
>>>> will be spawned.
>>>>
>>>> 3. That process will be limited exactly to the resources it needs to
>>>> operate, for example - if we run a virtio-blk device, it would be able
>>>> to access only the image file which it should be using.
>>>>
>>>> 4. Connection between hypervisor and devices will be based on unix
>>>> sockets, this should allow for better separation compared to other
>>>> approaches such as shared memory.
>>>>
>>>> 5. While performance is an aspect, complete isolation is more important.
>>>> Security is primary, performance is secondary.
>>>>
>>>> 6. Share as much code as possible with current implementation of virtio
>>>> devices, make it possible to run virtio devices either like it's being
>>>> done now, or by spawning them as separate processes - the amount of
>>>> specific code for the separate process case should be minimal.
>>>>
>>>>
>>>> Thats all I have for now, comments are *very* welcome.
>>>
>>> I thought about this a bit and have some ideas that may or may not help.
>>>
>>> 1) If you add device save/load support, then it's something you can
>>> potentially
>>> use to give yourself quite a bit of flexibility in changing the sandbox.
>>>  At any
>>> point in run time, you can save the device model's state in the sandbox,
>>> destroy
>>> the sandbox, and then build a new sandbox and restore the device to its
>>> former
>>> state.
>>>
>>> This might turn out to be very useful in supporting things like device
>>> hotplug
>>> and/or memory hot plug.
>>>
>>> 2) I think it's largely possible to implement all device emulation
>>> without doing
>>> any dynamic memory allocation.  Since memory allocation DoS is something
>>> you
>>> have to deal with anyway, I suspect most device emulation already uses a
>>> fixed
>>> amount of memory per device.   This can potentially dramatically simplify
>>> things.
>>>
>>> 3) I think virtio can/should be used as a generic "backend to frontend"
>>> transport between the device model and the tool.
>>
>> virtio requires server and client to have shared memory, so if we
>> already go with shared memory we can just let the device manage the
>> actual virtio driver directly, no?
>
> Let's say you're implementing an IDE device model in the sandbox.  You can
> try to implement the block layer in the sandbox but I think that quickly
> will become too difficult.
>
> You can do as Avi suggested and do all DMA accesses from the IDE device
> model as RPCs, or you can map guest memory as shared memory and utilize (1)
> in order to change that mapping as you need to.
>
> At some point, you end up with a struct iovec and an offset that you want to
> read/write to the virtual disk.  You need a way to send that to the
> "frontend" that will then handle that as a raw/qcow2 request.
>
> Well, virtio is great at doing exactly that :-)   So if you increase your
> shared memory to have a little bit extra to stick another vring, you can use
> that for device model -> front end communication without paying an extra
> memcpy.
>
> For notifications, the easiest thing to do is setup an "event channel"
> bitmap and use a single eventfd to multiplex that event channel bitmap.
>  This is pretty much how Xen works btw.  A single interrupt is reserved and
> a bitmap is used to dispatch the actual events.
>
> So the sandbox loop would look like:
>
> void main() {
>  setup_devices();
>
>  read_from_event_channel(main_channel);
>  for i in vrings:
>     check_vring_notification(i);
> }
>
> Once vring would be used for dispatching PIO/MMIO.  The remaining vrings
> could be used for anything really.
>
> Like I mentioned elsewhere, just think of the sandbox as just an extension
> of the guests firmware.  The purpose of the sandbox is to reduce a very
> complicated, legacy device model, into a very simple and easy to audit,
> purely virtio based model.
>
>>
>> Also, things like interrupts would also require some sort of a different
>> IPC, which would complicate things a bit.
>>
>>
>>> 4) Lack of select() is really challenging.  I understand why it's not
>>> there
>>> since it can technically be emulated but it seems like a no-risk syscall
>>> to
>>> whitelist and it would make programming in a sandbox so much easier.
>>>  Maybe
>>> Andrea has some comments here?  I might be missing something here.
>>
>> There are several of these which would be nice to have, and if we can
>> get seccomp filters we have good flexibility with which APIs we allow
>> for each device.
>
> Yeah, filters are nice but I fear that you lose some of the PR benefits of
> sandboxing.  Once the first application claims to use sandboxing, whitelists
> a syscall it shouldn't, you'll start getting slashdot articles about "Linux
> sandbox broken, Linux security hopeless broken".  Then what's the point of
> all of this?

Approaching the limit: since no security code/infrastructure is
perfect, then what's the point of all of this? :)

When I've spoken about seccomp_filter, I've tried to avoid the word
'sandbox' as that comes with more baggage than just creating a means
of reducing the kernel's attack surface.  Ideally, seccomp_filter just
fills the void between read/write/sigreturn/exit and
all-the-system-calls: Don't want select? ok. Want epoll? ok. . . It
does mean that developers will have to determine the tradeoffs
themselves (or with some general guidance).  But, I expect there'd be
quite a few more consumers of seccomp if it was possible to not need
to emulate select() behavior or if, for example, brk() was allowed.

cheers!
will

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2011-11-08 19:51 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-06 20:40 Secure KVM Sasha Levin
2011-11-07  0:07 ` Rusty Russell
2011-11-07  6:29   ` Sasha Levin
2011-11-07  6:37     ` Pekka Enberg
2011-11-07  6:46       ` Sasha Levin
2011-11-07  7:03         ` Pekka Enberg
2011-11-07 22:49     ` Rusty Russell
2011-11-07  9:26 ` Avi Kivity
2011-11-07 10:17   ` Sasha Levin
2011-11-07 10:27     ` Avi Kivity
2011-11-07 11:27     ` Stefan Hajnoczi
2011-11-07 12:40       ` Sasha Levin
2011-11-07 12:51         ` Avi Kivity
2011-11-07 14:56           ` Stefan Hajnoczi
2011-11-07 17:43       ` Anthony Liguori
2011-11-07 18:41         ` Avi Kivity
2011-11-07 17:39   ` Anthony Liguori
2011-11-07 18:43     ` Avi Kivity
2011-11-07 19:07       ` Anthony Liguori
2011-11-07 19:54         ` Avi Kivity
2011-11-07 22:56   ` Rusty Russell
2011-11-07 17:37 ` Anthony Liguori
2011-11-07 17:37   ` [Qemu-devel] " Anthony Liguori
2011-11-07 17:52   ` Sasha Levin
2011-11-07 17:52     ` [Qemu-devel] " Sasha Levin
2011-11-07 18:03     ` Anthony Liguori
2011-11-07 18:03       ` Anthony Liguori
2011-11-07 23:06       ` Rusty Russell
2011-11-07 23:06         ` Rusty Russell
2011-11-08 19:51       ` Will Drewry
2011-11-08 19:51         ` [Qemu-devel] " Will Drewry

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.