linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Next gen kvm api
@ 2012-02-02 16:09 Avi Kivity
       [not found] ` <CAB9FdM9M2DWXBxxyG-ez_5igT61x5b7ptw+fKfgaqMBU_JS5aA@mail.gmail.com>
                   ` (3 more replies)
  0 siblings, 4 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-02 16:09 UTC (permalink / raw)
  To: KVM list; +Cc: linux-kernel, qemu-devel

The kvm api has been accumulating cruft for several years now.  This is
due to feature creep, fixing mistakes, experience gained by the
maintainers and developers on how to do things, ports to new
architectures, and simply as a side effect of a code base that is
developed slowly and incrementally.

While I don't think we can justify a complete revamp of the API now, I'm
writing this as a thought experiment to see where a from-scratch API can
take us.  Of course, if we do implement this, the new and old APIs will
have to be supported side by side for several years.

Syscalls
--------
kvm currently uses the much-loved ioctl() system call as its entry
point.  While this made it easy to add kvm to the kernel unintrusively,
it does have downsides:

- overhead in the entry path, for the ioctl dispatch path and vcpu mutex
(low but measurable)
- semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
a vm to be tied to an mm_struct, but the current API ties them to file
descriptors, which can move between threads and processes.  We check
that they don't, but we don't want to.

Moving to syscalls avoids these problems, but introduces new ones:

- adding new syscalls is generally frowned upon, and kvm will need several
- syscalls into modules are harder and rarer than into core kernel code
- will need to add a vcpu pointer to task_struct, and a kvm pointer to
mm_struct

Syscalls that operate on the entire guest will pick it up implicitly
from the mm_struct, and syscalls that operate on a vcpu will pick it up
from current.

State accessors
---------------
Currently vcpu state is read and written by a bunch of ioctls that
access register sets that were added (or discovered) along the years. 
Some state is stored in the vcpu mmap area.  These will be replaced by a
pair of syscalls that read or write the entire state, or a subset of the
state, in a tag/value format.  A register will be described by a tuple:

  set: the register set to which it belongs; either a real set (GPR,
x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
eflags/rip/IDT/interrupt shadow/pending exception/etc.)
  number: register number within a set
  size: for self-description, and to allow expanding registers like
SSE->AVX or eax->rax
  attributes: read-write, read-only, read-only for guest but read-write
for host
  value

Device model
------------
Currently kvm virtualizes or emulates a set of x86 cores, with or
without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
PCI devices assigned from the host.  The API allows emulating the local
APICs in userspace.

The new API will do away with the IOAPIC/PIC/PIT emulation and defer
them to userspace.  Note: this may cause a regression for older guests
that don't support MSI or kvmclock.  Device assignment will be done
using VFIO, that is, without direct kvm involvement.

Local APICs will be mandatory, but it will be possible to hide them from
the guest.  This means that it will no longer be possible to emulate an
APIC in userspace, but it will be possible to virtualize an APIC-less
core - userspace will play with the LINT0/LINT1 inputs (configured as
EXITINT and NMI) to queue interrupts and NMIs.

The communications between the local APIC and the IOAPIC/PIC will be
done over a socketpair, emulating the APIC bus protocol.

Ioeventfd/irqfd
---------------
As the ioeventfd/irqfd mechanism has been quite successful, it will be
retained, and perhaps supplemented with a way to assign an mmio region
to a socketpair carrying transactions.  This allows a device model to be
implemented out-of-process.  The socketpair can also be used to
implement a replacement for coalesced mmio, by not waiting for responses
on write transactions when enabled.  Synchronization of coalesced mmio
will be implemented in the kernel, not userspace as now: when a
non-coalesced mmio is needed, the kernel will first flush the coalesced
mmio queue(s).

Guest memory management
-----------------------
Instead of managing each memory slot individually, a single API will be
provided that replaces the entire guest physical memory map atomically. 
This matches the implementation (using RCU) and plugs holes in the
current API, where you lose the dirty log in the window between the last
call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
that removes the slot.

Slot-based dirty logging will be replaced by range-based and work-based
dirty logging; that is "what pages are dirty in this range, which may be
smaller than a slot" and "don't return more than N pages".

We may want to place the log in user memory instead of kernel memory, to
reduce pinned memory and increase flexibility.

vcpu fd mmap area
-----------------
Currently we mmap() a few pages of the vcpu fd for fast user/kernel
communications.  This will be replaced by a more orthodox pointer
parameter to sys_kvm_enter_guest(), that will be accessed using
get_user() and put_user().  This is slower than the current situation,
but better for things like strace.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
       [not found] ` <CAB9FdM9M2DWXBxxyG-ez_5igT61x5b7ptw+fKfgaqMBU_JS5aA@mail.gmail.com>
@ 2012-02-02 22:16   ` Rob Earhart
  2012-02-05 13:14   ` Avi Kivity
  1 sibling, 0 replies; 89+ messages in thread
From: Rob Earhart @ 2012-02-02 22:16 UTC (permalink / raw)
  To: Avi Kivity; +Cc: KVM list, linux-kernel, qemu-devel

(Resending as plain text to appease vger.kernel.org :-)

On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity <avi@redhat.com> wrote:
>
> The kvm api has been accumulating cruft for several years now.  This is
> due to feature creep, fixing mistakes, experience gained by the
> maintainers and developers on how to do things, ports to new
> architectures, and simply as a side effect of a code base that is
> developed slowly and incrementally.
>
> While I don't think we can justify a complete revamp of the API now, I'm
> writing this as a thought experiment to see where a from-scratch API can
> take us.  Of course, if we do implement this, the new and old APIs will
> have to be supported side by side for several years.
>
> Syscalls
> --------
> kvm currently uses the much-loved ioctl() system call as its entry
> point.  While this made it easy to add kvm to the kernel unintrusively,
> it does have downsides:
>
> - overhead in the entry path, for the ioctl dispatch path and vcpu mutex
> (low but measurable)
> - semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
> a vm to be tied to an mm_struct, but the current API ties them to file
> descriptors, which can move between threads and processes.  We check
> that they don't, but we don't want to.
>
> Moving to syscalls avoids these problems, but introduces new ones:
>
> - adding new syscalls is generally frowned upon, and kvm will need several
> - syscalls into modules are harder and rarer than into core kernel code
> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> mm_struct
>
> Syscalls that operate on the entire guest will pick it up implicitly
> from the mm_struct, and syscalls that operate on a vcpu will pick it up
> from current.
>

<snipped>

I like the ioctl() interface.  If the overhead matters in your hot
path, I suspect you're doing it wrong; use irq fds & ioevent fds.  You
might fix the semantic mismatch by having a notion of a "current
process's VM" and "current thread's VCPU", and just use the one
/dev/kvm filedescriptor.

Or you could go the other way, and break the connection between VMs
and processes / VCPUs and threads: I don't know how easy it is to do
it in Linux, but a VCPU might be backed by a kernel thread, operated
on via ioctl()s, indicating that they've exited the guest by having
their descriptors become readable (and either use read() or mmap() to
pull off the reason why the VCPU exited).  This would allow for a
variety of different programming styles for the VMM--I'm a fan of CSP
model myself, but that's hard to do with the current API.

It'd be nice to be able to kick a VCPU out of the guest without
messing around with signals.  One possibility would be to tie it to an
eventfd; another might be to add a pseudo-register to indicate whether
the VCPU is explicitly suspended.  (Combined with the decoupling idea,
you'd want another pseudo-register to indicate whether the VMM is
implicitly suspended due to an intercept; a single "runnable" bit is
racy if both the VMM and VCPU are setting it.)

ioevent fds are definitely useful.  It might be cute if they could
synchronously set the VIRTIO_USED_F_NOTIFY bit - the guest could do
this itself, but that'd require giving the guest write access to the
used side of the virtio queue, and I kind of like the idea that it
doesn't need write access there.  Then again, I don't have any perf
data to back up the need for this.

The rest of it sounds great.

)Rob

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-02 16:09 [RFC] Next gen kvm api Avi Kivity
       [not found] ` <CAB9FdM9M2DWXBxxyG-ez_5igT61x5b7ptw+fKfgaqMBU_JS5aA@mail.gmail.com>
@ 2012-02-03  2:09 ` Anthony Liguori
  2012-02-04  2:08   ` Takuya Yoshikawa
                     ` (3 more replies)
  2012-02-03 18:07 ` Eric Northup
  2012-02-05  9:37 ` Gleb Natapov
  3 siblings, 4 replies; 89+ messages in thread
From: Anthony Liguori @ 2012-02-03  2:09 UTC (permalink / raw)
  To: Avi Kivity; +Cc: KVM list, linux-kernel, qemu-devel

On 02/02/2012 10:09 AM, Avi Kivity wrote:
> The kvm api has been accumulating cruft for several years now.  This is
> due to feature creep, fixing mistakes, experience gained by the
> maintainers and developers on how to do things, ports to new
> architectures, and simply as a side effect of a code base that is
> developed slowly and incrementally.
>
> While I don't think we can justify a complete revamp of the API now, I'm
> writing this as a thought experiment to see where a from-scratch API can
> take us.  Of course, if we do implement this, the new and old APIs will
> have to be supported side by side for several years.
>
> Syscalls
> --------
> kvm currently uses the much-loved ioctl() system call as its entry
> point.  While this made it easy to add kvm to the kernel unintrusively,
> it does have downsides:
>
> - overhead in the entry path, for the ioctl dispatch path and vcpu mutex
> (low but measurable)
> - semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
> a vm to be tied to an mm_struct, but the current API ties them to file
> descriptors, which can move between threads and processes.  We check
> that they don't, but we don't want to.
>
> Moving to syscalls avoids these problems, but introduces new ones:
>
> - adding new syscalls is generally frowned upon, and kvm will need several
> - syscalls into modules are harder and rarer than into core kernel code
> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> mm_struct
>
> Syscalls that operate on the entire guest will pick it up implicitly
> from the mm_struct, and syscalls that operate on a vcpu will pick it up
> from current.

This seems like the natural progression.

> State accessors
> ---------------
> Currently vcpu state is read and written by a bunch of ioctls that
> access register sets that were added (or discovered) along the years.
> Some state is stored in the vcpu mmap area.  These will be replaced by a
> pair of syscalls that read or write the entire state, or a subset of the
> state, in a tag/value format.  A register will be described by a tuple:
>
>    set: the register set to which it belongs; either a real set (GPR,
> x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
> eflags/rip/IDT/interrupt shadow/pending exception/etc.)
>    number: register number within a set
>    size: for self-description, and to allow expanding registers like
> SSE->AVX or eax->rax
>    attributes: read-write, read-only, read-only for guest but read-write
> for host
>    value

I do like the idea a lot of being able to read one register at a time as often 
times that's all you need.

>
> Device model
> ------------
> Currently kvm virtualizes or emulates a set of x86 cores, with or
> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> PCI devices assigned from the host.  The API allows emulating the local
> APICs in userspace.
>
> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> them to userspace.

I'm a big fan of this.

> Note: this may cause a regression for older guests
> that don't support MSI or kvmclock.  Device assignment will be done
> using VFIO, that is, without direct kvm involvement.
>
> Local APICs will be mandatory, but it will be possible to hide them from
> the guest.  This means that it will no longer be possible to emulate an
> APIC in userspace, but it will be possible to virtualize an APIC-less
> core - userspace will play with the LINT0/LINT1 inputs (configured as
> EXITINT and NMI) to queue interrupts and NMIs.

I think this makes sense.  An interesting consequence of this is that it's no 
longer necessary to associate the VCPU context with an MMIO/PIO operation.  I'm 
not sure if there's an obvious benefit to that but it's interesting nonetheless.

> The communications between the local APIC and the IOAPIC/PIC will be
> done over a socketpair, emulating the APIC bus protocol.
>
> Ioeventfd/irqfd
> ---------------
> As the ioeventfd/irqfd mechanism has been quite successful, it will be
> retained, and perhaps supplemented with a way to assign an mmio region
> to a socketpair carrying transactions.  This allows a device model to be
> implemented out-of-process.  The socketpair can also be used to
> implement a replacement for coalesced mmio, by not waiting for responses
> on write transactions when enabled.  Synchronization of coalesced mmio
> will be implemented in the kernel, not userspace as now: when a
> non-coalesced mmio is needed, the kernel will first flush the coalesced
> mmio queue(s).
>
> Guest memory management
> -----------------------
> Instead of managing each memory slot individually, a single API will be
> provided that replaces the entire guest physical memory map atomically.
> This matches the implementation (using RCU) and plugs holes in the
> current API, where you lose the dirty log in the window between the last
> call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
> that removes the slot.
>
> Slot-based dirty logging will be replaced by range-based and work-based
> dirty logging; that is "what pages are dirty in this range, which may be
> smaller than a slot" and "don't return more than N pages".
>
> We may want to place the log in user memory instead of kernel memory, to
> reduce pinned memory and increase flexibility.

Since we really only support 64-bit hosts, what about just pointing the kernel 
at a address/size pair and rely on userspace to mmap() the range appropriately?

> vcpu fd mmap area
> -----------------
> Currently we mmap() a few pages of the vcpu fd for fast user/kernel
> communications.  This will be replaced by a more orthodox pointer
> parameter to sys_kvm_enter_guest(), that will be accessed using
> get_user() and put_user().  This is slower than the current situation,
> but better for things like strace.

Look pretty interesting overall.

Regards,

Anthony Liguori

>


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC] Next gen kvm api
  2012-02-02 16:09 [RFC] Next gen kvm api Avi Kivity
       [not found] ` <CAB9FdM9M2DWXBxxyG-ez_5igT61x5b7ptw+fKfgaqMBU_JS5aA@mail.gmail.com>
  2012-02-03  2:09 ` Anthony Liguori
@ 2012-02-03 18:07 ` Eric Northup
  2012-02-03 22:52   ` [Qemu-devel] " Anthony Liguori
  2012-02-05  9:37 ` Gleb Natapov
  3 siblings, 1 reply; 89+ messages in thread
From: Eric Northup @ 2012-02-03 18:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: KVM list, linux-kernel, qemu-devel

On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity <avi@redhat.com> wrote:
[...]
>
> Moving to syscalls avoids these problems, but introduces new ones:
>
> - adding new syscalls is generally frowned upon, and kvm will need several
> - syscalls into modules are harder and rarer than into core kernel code
> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> mm_struct
- Lost a good place to put access control (permissions on /dev/kvm)
for which user-mode processes can use KVM.

How would the ability to use sys_kvm_* be regulated?

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-03 18:07 ` Eric Northup
@ 2012-02-03 22:52   ` Anthony Liguori
  2012-02-06 19:46     ` Scott Wood
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-02-03 22:52 UTC (permalink / raw)
  To: Eric Northup; +Cc: Avi Kivity, linux-kernel, KVM list, qemu-devel

On 02/03/2012 12:07 PM, Eric Northup wrote:
> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity<avi@redhat.com>  wrote:
> [...]
>>
>> Moving to syscalls avoids these problems, but introduces new ones:
>>
>> - adding new syscalls is generally frowned upon, and kvm will need several
>> - syscalls into modules are harder and rarer than into core kernel code
>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>> mm_struct
> - Lost a good place to put access control (permissions on /dev/kvm)
> for which user-mode processes can use KVM.
>
> How would the ability to use sys_kvm_* be regulated?

Why should it be regulated?

It's not a finite or privileged resource.

Regards,

Anthony Liguori

>


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-03  2:09 ` Anthony Liguori
@ 2012-02-04  2:08   ` Takuya Yoshikawa
  2012-02-22 13:06     ` Peter Zijlstra
  2012-02-05  9:24   ` Avi Kivity
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 89+ messages in thread
From: Takuya Yoshikawa @ 2012-02-04  2:08 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, KVM list, linux-kernel, qemu-devel

Hope to get comments from live migration developers,

Anthony Liguori <anthony@codemonkey.ws> wrote:

> > Guest memory management
> > -----------------------
> > Instead of managing each memory slot individually, a single API will be
> > provided that replaces the entire guest physical memory map atomically.
> > This matches the implementation (using RCU) and plugs holes in the
> > current API, where you lose the dirty log in the window between the last
> > call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
> > that removes the slot.
> >
> > Slot-based dirty logging will be replaced by range-based and work-based
> > dirty logging; that is "what pages are dirty in this range, which may be
> > smaller than a slot" and "don't return more than N pages".
> >
> > We may want to place the log in user memory instead of kernel memory, to
> > reduce pinned memory and increase flexibility.
> 
> Since we really only support 64-bit hosts, what about just pointing the kernel 
> at a address/size pair and rely on userspace to mmap() the range appropriately?
> 

Seems reasonable but the real problem is not how to set up the memory:
the problem is how to set a bit in user-space.

We need two things:
	- introducing set_bit_user()
	- changing mmu_lock from spin_lock to mutex_lock
	  (mark_page_dirty() can be called with mmu_lock held)

The former is straightforward and I sent a patch last year.
The latter needs a fundamental change:  I heard (from Avi) that we can
change mmu_lock to mutex_lock if mmu_notifier becomes preemptible.

So I was planning to restart this work when Peter's
	"mm: Preemptibility"
	http://lkml.org/lkml/2011/4/1/141
gets finished.

But even if we cannot achieve "without pinned memory" we may also want
to make the user-space know how many pages are getting dirty.

For example think about the last step of live migration.  We stop the
guest and send the remaining pages.  For this we do not need to write
protect them any more, just want to know which ones are dirty.

If user-space can read the bitmap, it does not need to do GET_DIRTY_LOG
because the guest is already stopped, so we can reduce the downtime.

Is this correct?


So I think we can do this in two steps:
	1. just move the bitmap to user-space and (pin it)
	2. un-pin it when the time comes

I can start 1 after "srcu-less dirty logging" gets finished.


	Takuya

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-03  2:09 ` Anthony Liguori
  2012-02-04  2:08   ` Takuya Yoshikawa
@ 2012-02-05  9:24   ` Avi Kivity
  2012-02-07  1:08   ` Alexander Graf
  2012-02-10  3:07   ` Jamie Lokier
  3 siblings, 0 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-05  9:24 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: KVM list, linux-kernel, qemu-devel

On 02/03/2012 04:09 AM, Anthony Liguori wrote:
>
>> Note: this may cause a regression for older guests
>> that don't support MSI or kvmclock.  Device assignment will be done
>> using VFIO, that is, without direct kvm involvement.
>>
>> Local APICs will be mandatory, but it will be possible to hide them from
>> the guest.  This means that it will no longer be possible to emulate an
>> APIC in userspace, but it will be possible to virtualize an APIC-less
>> core - userspace will play with the LINT0/LINT1 inputs (configured as
>> EXITINT and NMI) to queue interrupts and NMIs.
>
> I think this makes sense.  An interesting consequence of this is that
> it's no longer necessary to associate the VCPU context with an
> MMIO/PIO operation.  I'm not sure if there's an obvious benefit to
> that but it's interesting nonetheless.

It doesn't follow (at least from the above), and it isn't allowed in
some situations (like PIO invoking synchronous SMI).  So we'll have to
retain synchronous PIO/MMIO (but we can allow to relax this for
socketpair mmio).

>
>> The communications between the local APIC and the IOAPIC/PIC will be
>> done over a socketpair, emulating the APIC bus protocol.
>>
>> Ioeventfd/irqfd
>> ---------------
>> As the ioeventfd/irqfd mechanism has been quite successful, it will be
>> retained, and perhaps supplemented with a way to assign an mmio region
>> to a socketpair carrying transactions.  This allows a device model to be
>> implemented out-of-process.  The socketpair can also be used to
>> implement a replacement for coalesced mmio, by not waiting for responses
>> on write transactions when enabled.  Synchronization of coalesced mmio
>> will be implemented in the kernel, not userspace as now: when a
>> non-coalesced mmio is needed, the kernel will first flush the coalesced
>> mmio queue(s).
>>
>> Guest memory management
>> -----------------------
>> Instead of managing each memory slot individually, a single API will be
>> provided that replaces the entire guest physical memory map atomically.
>> This matches the implementation (using RCU) and plugs holes in the
>> current API, where you lose the dirty log in the window between the last
>> call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
>> that removes the slot.
>>
>> Slot-based dirty logging will be replaced by range-based and work-based
>> dirty logging; that is "what pages are dirty in this range, which may be
>> smaller than a slot" and "don't return more than N pages".
>>
>> We may want to place the log in user memory instead of kernel memory, to
>> reduce pinned memory and increase flexibility.
>
> Since we really only support 64-bit hosts, 

We don't (Red Hat does, but that's a distro choice).  Non-x86 also needs
32-bit.

> what about just pointing the kernel at a address/size pair and rely on
> userspace to mmap() the range appropriately?

The "one large slot" approach.  Even if we ignore the 32-bit issue, we
still need some per-slot information, like per-slot dirty logging.  It's
also hard to create aliases this way (BIOS at 0xe0000 and 0xfffe0000) or
to move memory around (framebuffer BAR).

>
>> vcpu fd mmap area
>> -----------------
>> Currently we mmap() a few pages of the vcpu fd for fast user/kernel
>> communications.  This will be replaced by a more orthodox pointer
>> parameter to sys_kvm_enter_guest(), that will be accessed using
>> get_user() and put_user().  This is slower than the current situation,
>> but better for things like strace.
>
> Look pretty interesting overall.

I'll get an actual API description for the next round.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC] Next gen kvm api
  2012-02-02 16:09 [RFC] Next gen kvm api Avi Kivity
                   ` (2 preceding siblings ...)
  2012-02-03 18:07 ` Eric Northup
@ 2012-02-05  9:37 ` Gleb Natapov
  2012-02-05  9:44   ` Avi Kivity
  3 siblings, 1 reply; 89+ messages in thread
From: Gleb Natapov @ 2012-02-05  9:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: KVM list, linux-kernel, qemu-devel

On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> Device model
> ------------
> Currently kvm virtualizes or emulates a set of x86 cores, with or
> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> PCI devices assigned from the host.  The API allows emulating the local
> APICs in userspace.
> 
> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> them to userspace.  Note: this may cause a regression for older guests
> that don't support MSI or kvmclock.  Device assignment will be done
> using VFIO, that is, without direct kvm involvement.
> 
So are we officially saying that KVM is only for modern guest
virtualization? Also my not so old host kernel uses MSI only for NIC.
SATA and USB are using IOAPIC (though this is probably more HW related
than kernel version related).

--
			Gleb.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC] Next gen kvm api
  2012-02-05  9:37 ` Gleb Natapov
@ 2012-02-05  9:44   ` Avi Kivity
  2012-02-05  9:51     ` Gleb Natapov
  0 siblings, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-05  9:44 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: KVM list, linux-kernel, qemu-devel

On 02/05/2012 11:37 AM, Gleb Natapov wrote:
> On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> > Device model
> > ------------
> > Currently kvm virtualizes or emulates a set of x86 cores, with or
> > without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> > PCI devices assigned from the host.  The API allows emulating the local
> > APICs in userspace.
> > 
> > The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> > them to userspace.  Note: this may cause a regression for older guests
> > that don't support MSI or kvmclock.  Device assignment will be done
> > using VFIO, that is, without direct kvm involvement.
> > 
> So are we officially saying that KVM is only for modern guest
> virtualization? 

No, but older guests may have reduced performance in some workloads
(e.g. RHEL4 gettimeofday() intensive workloads).

> Also my not so old host kernel uses MSI only for NIC.
> SATA and USB are using IOAPIC (though this is probably more HW related
> than kernel version related).

For devices emulated in userspace, it doesn't matter where the IOAPIC
is.  It only matters for kernel provided devices (PIT, assigned devices,
vhost-net).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC] Next gen kvm api
  2012-02-05  9:44   ` Avi Kivity
@ 2012-02-05  9:51     ` Gleb Natapov
  2012-02-05  9:56       ` Avi Kivity
  2012-02-05 16:36       ` [Qemu-devel] " Anthony Liguori
  0 siblings, 2 replies; 89+ messages in thread
From: Gleb Natapov @ 2012-02-05  9:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: KVM list, linux-kernel, qemu-devel

On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
> On 02/05/2012 11:37 AM, Gleb Natapov wrote:
> > On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> > > Device model
> > > ------------
> > > Currently kvm virtualizes or emulates a set of x86 cores, with or
> > > without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> > > PCI devices assigned from the host.  The API allows emulating the local
> > > APICs in userspace.
> > > 
> > > The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> > > them to userspace.  Note: this may cause a regression for older guests
> > > that don't support MSI or kvmclock.  Device assignment will be done
> > > using VFIO, that is, without direct kvm involvement.
> > > 
> > So are we officially saying that KVM is only for modern guest
> > virtualization? 
> 
> No, but older guests may have reduced performance in some workloads
> (e.g. RHEL4 gettimeofday() intensive workloads).
> 
Reduced performance is what I mean. Obviously old guests will continue working.

> > Also my not so old host kernel uses MSI only for NIC.
> > SATA and USB are using IOAPIC (though this is probably more HW related
> > than kernel version related).
> 
> For devices emulated in userspace, it doesn't matter where the IOAPIC
> is.  It only matters for kernel provided devices (PIT, assigned devices,
> vhost-net).
> 
What about EOI that will have to do additional exit to userspace for each
interrupt delivered?

--
			Gleb.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC] Next gen kvm api
  2012-02-05  9:51     ` Gleb Natapov
@ 2012-02-05  9:56       ` Avi Kivity
  2012-02-05 10:58         ` Gleb Natapov
  2012-02-05 16:36       ` [Qemu-devel] " Anthony Liguori
  1 sibling, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-05  9:56 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: KVM list, linux-kernel, qemu-devel

On 02/05/2012 11:51 AM, Gleb Natapov wrote:
> On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
> > On 02/05/2012 11:37 AM, Gleb Natapov wrote:
> > > On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> > > > Device model
> > > > ------------
> > > > Currently kvm virtualizes or emulates a set of x86 cores, with or
> > > > without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> > > > PCI devices assigned from the host.  The API allows emulating the local
> > > > APICs in userspace.
> > > > 
> > > > The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> > > > them to userspace.  Note: this may cause a regression for older guests
> > > > that don't support MSI or kvmclock.  Device assignment will be done
> > > > using VFIO, that is, without direct kvm involvement.
> > > > 
> > > So are we officially saying that KVM is only for modern guest
> > > virtualization? 
> > 
> > No, but older guests may have reduced performance in some workloads
> > (e.g. RHEL4 gettimeofday() intensive workloads).
> > 
> Reduced performance is what I mean. Obviously old guests will continue working.

I'm not happy about it either.

> > > Also my not so old host kernel uses MSI only for NIC.
> > > SATA and USB are using IOAPIC (though this is probably more HW related
> > > than kernel version related).
> > 
> > For devices emulated in userspace, it doesn't matter where the IOAPIC
> > is.  It only matters for kernel provided devices (PIT, assigned devices,
> > vhost-net).
> > 
> What about EOI that will have to do additional exit to userspace for each
> interrupt delivered?

I think the ioapic EOI is asynchronous wrt the core, yes?  So the vcpu
can just post the EOI broadcast on the apic-bus socketpair, waking up
the thread handling the ioapic, and continue running.  This trades off
vcpu latency for using more host resources.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC] Next gen kvm api
  2012-02-05  9:56       ` Avi Kivity
@ 2012-02-05 10:58         ` Gleb Natapov
  2012-02-05 13:16           ` Avi Kivity
  0 siblings, 1 reply; 89+ messages in thread
From: Gleb Natapov @ 2012-02-05 10:58 UTC (permalink / raw)
  To: Avi Kivity; +Cc: KVM list, linux-kernel, qemu-devel

On Sun, Feb 05, 2012 at 11:56:21AM +0200, Avi Kivity wrote:
> On 02/05/2012 11:51 AM, Gleb Natapov wrote:
> > On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
> > > On 02/05/2012 11:37 AM, Gleb Natapov wrote:
> > > > On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
> > > > > Device model
> > > > > ------------
> > > > > Currently kvm virtualizes or emulates a set of x86 cores, with or
> > > > > without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> > > > > PCI devices assigned from the host.  The API allows emulating the local
> > > > > APICs in userspace.
> > > > > 
> > > > > The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> > > > > them to userspace.  Note: this may cause a regression for older guests
> > > > > that don't support MSI or kvmclock.  Device assignment will be done
> > > > > using VFIO, that is, without direct kvm involvement.
> > > > > 
> > > > So are we officially saying that KVM is only for modern guest
> > > > virtualization? 
> > > 
> > > No, but older guests may have reduced performance in some workloads
> > > (e.g. RHEL4 gettimeofday() intensive workloads).
> > > 
> > Reduced performance is what I mean. Obviously old guests will continue working.
> 
> I'm not happy about it either.
> 
It is not only about old guests either. In RHEL we pretend to not
support HPET because when some guests detect it they are accessing
its mmio frequently for certain workloads. For Linux guests we can
avoid that by using kvmclock. For Windows guests I hope we will have
enlightenment timers  + RTC, but what about other guests? *BSD? How often
they access HPET when it is available? We will probably have to move
HPET into the kernel if we want to make it usable.

So what is the criteria for device to be emulated in userspace vs kernelspace
in new API? Never? What about vhost-net then? Only if a device works in MSI
mode? This may work for HPET case, but looks like artificial limitation
since the problem with HPET is not interrupt latency, but mmio space
access. 

And BTW, what about enlightenment timers for Windows? Are we going to
implement them in userspace or kernel?
 
> > > > Also my not so old host kernel uses MSI only for NIC.
> > > > SATA and USB are using IOAPIC (though this is probably more HW related
> > > > than kernel version related).
> > > 
> > > For devices emulated in userspace, it doesn't matter where the IOAPIC
> > > is.  It only matters for kernel provided devices (PIT, assigned devices,
> > > vhost-net).
> > > 
> > What about EOI that will have to do additional exit to userspace for each
> > interrupt delivered?
> 
> I think the ioapic EOI is asynchronous wrt the core, yes?  So the vcpu
Probably, do not see what problem can async EOI may cause.

> can just post the EOI broadcast on the apic-bus socketpair, waking up
> the thread handling the ioapic, and continue running.  This trades off
> vcpu latency for using more host resources.
> 
Sounds good. This will increase IOAPIC interrupt latency though since next
interrupt (same GSI) can't be delivered until EOI is processed.

--
			Gleb.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
       [not found] ` <CAB9FdM9M2DWXBxxyG-ez_5igT61x5b7ptw+fKfgaqMBU_JS5aA@mail.gmail.com>
  2012-02-02 22:16   ` [Qemu-devel] " Rob Earhart
@ 2012-02-05 13:14   ` Avi Kivity
  2012-02-06 17:41     ` Rob Earhart
  1 sibling, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-05 13:14 UTC (permalink / raw)
  To: Rob Earhart; +Cc: linux-kernel, KVM list, qemu-devel

On 02/03/2012 12:13 AM, Rob Earhart wrote:
> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity <avi@redhat.com
> <mailto:avi@redhat.com>> wrote:
>
>     The kvm api has been accumulating cruft for several years now.
>      This is
>     due to feature creep, fixing mistakes, experience gained by the
>     maintainers and developers on how to do things, ports to new
>     architectures, and simply as a side effect of a code base that is
>     developed slowly and incrementally.
>
>     While I don't think we can justify a complete revamp of the API
>     now, I'm
>     writing this as a thought experiment to see where a from-scratch
>     API can
>     take us.  Of course, if we do implement this, the new and old APIs
>     will
>     have to be supported side by side for several years.
>
>     Syscalls
>     --------
>     kvm currently uses the much-loved ioctl() system call as its entry
>     point.  While this made it easy to add kvm to the kernel
>     unintrusively,
>     it does have downsides:
>
>     - overhead in the entry path, for the ioctl dispatch path and vcpu
>     mutex
>     (low but measurable)
>     - semantic mismatch: kvm really wants a vcpu to be tied to a
>     thread, and
>     a vm to be tied to an mm_struct, but the current API ties them to file
>     descriptors, which can move between threads and processes.  We check
>     that they don't, but we don't want to.
>
>     Moving to syscalls avoids these problems, but introduces new ones:
>
>     - adding new syscalls is generally frowned upon, and kvm will need
>     several
>     - syscalls into modules are harder and rarer than into core kernel
>     code
>     - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>     mm_struct
>
>     Syscalls that operate on the entire guest will pick it up implicitly
>     from the mm_struct, and syscalls that operate on a vcpu will pick
>     it up
>     from current.
>
>
> <snipped>
>
> I like the ioctl() interface.  If the overhead matters in your hot path,

I can't say that it's a pressing problem, but it's not negligible.

> I suspect you're doing it wrong;

What am I doing wrong?

> use irq fds & ioevent fds.  You might fix the semantic mismatch by
> having a notion of a "current process's VM" and "current thread's
> VCPU", and just use the one /dev/kvm filedescriptor.
>
> Or you could go the other way, and break the connection between VMs
> and processes / VCPUs and threads: I don't know how easy it is to do
> it in Linux, but a VCPU might be backed by a kernel thread, operated
> on via ioctl()s, indicating that they've exited the guest by having
> their descriptors become readable (and either use read() or mmap() to
> pull off the reason why the VCPU exited). 

That breaks the ability to renice vcpu threads (unless you want the user
renice kernel threads).

> This would allow for a variety of different programming styles for the
> VMM--I'm a fan of CSP model myself, but that's hard to do with the
> current API.

Just convert the synchronous API to an RPC over a pipe, in the vcpu
thread, and you have the asynchronous model you asked for.

>
> It'd be nice to be able to kick a VCPU out of the guest without
> messing around with signals.  One possibility would be to tie it to an
> eventfd;

We have to support signals in any case, supporting more mechanisms just
increases complexity.

> another might be to add a pseudo-register to indicate whether the VCPU
> is explicitly suspended.  (Combined with the decoupling idea, you'd
> want another pseudo-register to indicate whether the VMM is implicitly
> suspended due to an intercept; a single "runnable" bit is racy if both
> the VMM and VCPU are setting it.)
>
> ioevent fds are definitely useful.  It might be cute if they could
> synchronously set the VIRTIO_USED_F_NOTIFY bit - the guest could do
> this itself, but that'd require giving the guest write access to the
> used side of the virtio queue, and I kind of like the idea that it
> doesn't need write access there.  Then again, I don't have any perf
> data to back up the need for this.
>

I'd hate to tie ioeventfds into virtio specifics, they're a general
mechanism.  Especially if the guest can do it itself.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [RFC] Next gen kvm api
  2012-02-05 10:58         ` Gleb Natapov
@ 2012-02-05 13:16           ` Avi Kivity
  0 siblings, 0 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-05 13:16 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: KVM list, linux-kernel, qemu-devel

On 02/05/2012 12:58 PM, Gleb Natapov wrote:
> > > > 
> > > Reduced performance is what I mean. Obviously old guests will continue working.
> > 
> > I'm not happy about it either.
> > 
> It is not only about old guests either. In RHEL we pretend to not
> support HPET because when some guests detect it they are accessing
> its mmio frequently for certain workloads. For Linux guests we can
> avoid that by using kvmclock. For Windows guests I hope we will have
> enlightenment timers  + RTC, but what about other guests? *BSD? How often
> they access HPET when it is available? We will probably have to move
> HPET into the kernel if we want to make it usable.

If we have to, we'll do it.

> So what is the criteria for device to be emulated in userspace vs kernelspace
> in new API? Never? What about vhost-net then? Only if a device works in MSI
> mode? This may work for HPET case, but looks like artificial limitation
> since the problem with HPET is not interrupt latency, but mmio space
> access. 

The criteria is, if it's absolutely necessary.

> And BTW, what about enlightenment timers for Windows? Are we going to
> implement them in userspace or kernel?

The kernel.
-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-05  9:51     ` Gleb Natapov
  2012-02-05  9:56       ` Avi Kivity
@ 2012-02-05 16:36       ` Anthony Liguori
  2012-02-06  9:34         ` Avi Kivity
  1 sibling, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-02-05 16:36 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Avi Kivity, linux-kernel, KVM list, qemu-devel

On 02/05/2012 03:51 AM, Gleb Natapov wrote:
> On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
>> On 02/05/2012 11:37 AM, Gleb Natapov wrote:
>>> On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
>>>> Device model
>>>> ------------
>>>> Currently kvm virtualizes or emulates a set of x86 cores, with or
>>>> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
>>>> PCI devices assigned from the host.  The API allows emulating the local
>>>> APICs in userspace.
>>>>
>>>> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
>>>> them to userspace.  Note: this may cause a regression for older guests
>>>> that don't support MSI or kvmclock.  Device assignment will be done
>>>> using VFIO, that is, without direct kvm involvement.
>>>>
>>> So are we officially saying that KVM is only for modern guest
>>> virtualization?
>>
>> No, but older guests may have reduced performance in some workloads
>> (e.g. RHEL4 gettimeofday() intensive workloads).
>>
> Reduced performance is what I mean. Obviously old guests will continue working.

An interesting solution to this problem would be an in-kernel device VM.

Most of the time, the hot register is just one register within a more complex 
device.  The reads are often side-effect free and trivially computed from some 
device state + host time.

If userspace had a way to upload bytecode to the kernel that was executed for a 
PIO operation, it could either pass the operation to userspace or handle it 
within the kernel when possible without taking a heavy weight exit.

If the bytecode can access variables in a shared memory area, it could be pretty 
efficient to work with.

This means that the kernel never has to deal with specific in-kernel devices but 
that userspace can accelerator as many of its devices as it sees fit.

This could replace ioeventfd as a mechanism (which would allow clearing the 
notify flag before writing to an eventfd).

We could potentially just use BPF for this.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-05 16:36       ` [Qemu-devel] " Anthony Liguori
@ 2012-02-06  9:34         ` Avi Kivity
  2012-02-06 13:33           ` Anthony Liguori
  2012-02-07 18:12           ` Rusty Russell
  0 siblings, 2 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-06  9:34 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Gleb Natapov, linux-kernel, KVM list, qemu-devel

On 02/05/2012 06:36 PM, Anthony Liguori wrote:
> On 02/05/2012 03:51 AM, Gleb Natapov wrote:
>> On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
>>> On 02/05/2012 11:37 AM, Gleb Natapov wrote:
>>>> On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
>>>>> Device model
>>>>> ------------
>>>>> Currently kvm virtualizes or emulates a set of x86 cores, with or
>>>>> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
>>>>> PCI devices assigned from the host.  The API allows emulating the
>>>>> local
>>>>> APICs in userspace.
>>>>>
>>>>> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
>>>>> them to userspace.  Note: this may cause a regression for older
>>>>> guests
>>>>> that don't support MSI or kvmclock.  Device assignment will be done
>>>>> using VFIO, that is, without direct kvm involvement.
>>>>>
>>>> So are we officially saying that KVM is only for modern guest
>>>> virtualization?
>>>
>>> No, but older guests may have reduced performance in some workloads
>>> (e.g. RHEL4 gettimeofday() intensive workloads).
>>>
>> Reduced performance is what I mean. Obviously old guests will
>> continue working.
>
> An interesting solution to this problem would be an in-kernel device VM.

It's interesting, yes, but has a very high barrier to implementation.

>
> Most of the time, the hot register is just one register within a more
> complex device.  The reads are often side-effect free and trivially
> computed from some device state + host time.

Look at arch/x86/kvm/i8254.c:pit_ioport_read() for a counterexample. 
There are also interactions with other devices (for example the
apic/ioapic interaction via the apic bus).

>
> If userspace had a way to upload bytecode to the kernel that was
> executed for a PIO operation, it could either pass the operation to
> userspace or handle it within the kernel when possible without taking
> a heavy weight exit.
>
> If the bytecode can access variables in a shared memory area, it could
> be pretty efficient to work with.
>
> This means that the kernel never has to deal with specific in-kernel
> devices but that userspace can accelerator as many of its devices as
> it sees fit.

I would really love to have this, but the problem is that we'd need a
general purpose bytecode VM with binding to some kernel APIs.  The
bytecode VM, if made general enough to host more complicated devices,
would likely be much larger than the actual code we have in the kernel now.

>
> This could replace ioeventfd as a mechanism (which would allow
> clearing the notify flag before writing to an eventfd).
>
> We could potentially just use BPF for this.

BPF generally just computes a predicate.  We could overload the scratch
area for storing internal state and for read results, though (and have
an "mmio scratch register" for reading the time).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-06  9:34         ` Avi Kivity
@ 2012-02-06 13:33           ` Anthony Liguori
  2012-02-06 13:54             ` Avi Kivity
  2012-02-07 18:12           ` Rusty Russell
  1 sibling, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-02-06 13:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, linux-kernel, Gleb Natapov, KVM list

On 02/06/2012 03:34 AM, Avi Kivity wrote:
> On 02/05/2012 06:36 PM, Anthony Liguori wrote:
>> On 02/05/2012 03:51 AM, Gleb Natapov wrote:
>>> On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote:
>>>> On 02/05/2012 11:37 AM, Gleb Natapov wrote:
>>>>> On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote:
>>>>>> Device model
>>>>>> ------------
>>>>>> Currently kvm virtualizes or emulates a set of x86 cores, with or
>>>>>> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
>>>>>> PCI devices assigned from the host.  The API allows emulating the
>>>>>> local
>>>>>> APICs in userspace.
>>>>>>
>>>>>> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
>>>>>> them to userspace.  Note: this may cause a regression for older
>>>>>> guests
>>>>>> that don't support MSI or kvmclock.  Device assignment will be done
>>>>>> using VFIO, that is, without direct kvm involvement.
>>>>>>
>>>>> So are we officially saying that KVM is only for modern guest
>>>>> virtualization?
>>>>
>>>> No, but older guests may have reduced performance in some workloads
>>>> (e.g. RHEL4 gettimeofday() intensive workloads).
>>>>
>>> Reduced performance is what I mean. Obviously old guests will
>>> continue working.
>>
>> An interesting solution to this problem would be an in-kernel device VM.
>
> It's interesting, yes, but has a very high barrier to implementation.
>
>>
>> Most of the time, the hot register is just one register within a more
>> complex device.  The reads are often side-effect free and trivially
>> computed from some device state + host time.
>
> Look at arch/x86/kvm/i8254.c:pit_ioport_read() for a counterexample.
> There are also interactions with other devices (for example the
> apic/ioapic interaction via the apic bus).

Hrm, maybe I'm missing it, but the path that would be hot is:

if (!status_latched && !count_latched) {
    value = kpit_elapsed()
    // manipulate count based on mode
    // mask value depending on read_state
}

This path is side-effect free, and applies relatively simple math to a time counter.

The idea would be to allow the filter to not handle an I/O request depending on 
existing state.  Anything that's modifies state (like reading the latch counter) 
would drop to userspace.

>
>>
>> If userspace had a way to upload bytecode to the kernel that was
>> executed for a PIO operation, it could either pass the operation to
>> userspace or handle it within the kernel when possible without taking
>> a heavy weight exit.
>>
>> If the bytecode can access variables in a shared memory area, it could
>> be pretty efficient to work with.
>>
>> This means that the kernel never has to deal with specific in-kernel
>> devices but that userspace can accelerator as many of its devices as
>> it sees fit.
>
> I would really love to have this, but the problem is that we'd need a
> general purpose bytecode VM with binding to some kernel APIs.  The
> bytecode VM, if made general enough to host more complicated devices,
> would likely be much larger than the actual code we have in the kernel now.

I think the question is whether BPF is good enough as it stands.  I'm not really 
sure.  I agree that inventing a new bytecode VM is probably not worth it.

>>
>> This could replace ioeventfd as a mechanism (which would allow
>> clearing the notify flag before writing to an eventfd).
>>
>> We could potentially just use BPF for this.
>
> BPF generally just computes a predicate.

Can it modify a packet in place?  I think a predicate is about right (can this 
io operation be handled in the kernel or not) but the question is whether 
there's a way produce an output as a side effect.

> We could overload the scratch
> area for storing internal state and for read results, though (and have
> an "mmio scratch register" for reading the time).

Right.

Regards,

Anthony Liguori



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-06 13:33           ` Anthony Liguori
@ 2012-02-06 13:54             ` Avi Kivity
  2012-02-06 14:00               ` Anthony Liguori
  0 siblings, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-06 13:54 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, linux-kernel, Gleb Natapov, KVM list

On 02/06/2012 03:33 PM, Anthony Liguori wrote:
>> Look at arch/x86/kvm/i8254.c:pit_ioport_read() for a counterexample.
>> There are also interactions with other devices (for example the
>> apic/ioapic interaction via the apic bus).
>
>
> Hrm, maybe I'm missing it, but the path that would be hot is:
>
> if (!status_latched && !count_latched) {
>    value = kpit_elapsed()
>    // manipulate count based on mode
>    // mask value depending on read_state
> }
>
> This path is side-effect free, and applies relatively simple math to a
> time counter.

Do guests always read an unlatched counter?  Doesn't seem reasonable
since they can't get a stable count this way.

>
> The idea would be to allow the filter to not handle an I/O request
> depending on existing state.  Anything that's modifies state (like
> reading the latch counter) would drop to userspace.

This restricts us to a subset of the device which is at the mercy of the
guest.

>
>>
>>>
>>> If userspace had a way to upload bytecode to the kernel that was
>>> executed for a PIO operation, it could either pass the operation to
>>> userspace or handle it within the kernel when possible without taking
>>> a heavy weight exit.
>>>
>>> If the bytecode can access variables in a shared memory area, it could
>>> be pretty efficient to work with.
>>>
>>> This means that the kernel never has to deal with specific in-kernel
>>> devices but that userspace can accelerator as many of its devices as
>>> it sees fit.
>>
>> I would really love to have this, but the problem is that we'd need a
>> general purpose bytecode VM with binding to some kernel APIs.  The
>> bytecode VM, if made general enough to host more complicated devices,
>> would likely be much larger than the actual code we have in the
>> kernel now.
>
> I think the question is whether BPF is good enough as it stands.  I'm
> not really sure.

I think not.  It doesn't have 64-bit muldiv, required for hpet, for example.

>   I agree that inventing a new bytecode VM is probably not worth it.
>
>>>
>>> This could replace ioeventfd as a mechanism (which would allow
>>> clearing the notify flag before writing to an eventfd).
>>>
>>> We could potentially just use BPF for this.
>>
>> BPF generally just computes a predicate.
>
> Can it modify a packet in place?  I think a predicate is about right
> (can this io operation be handled in the kernel or not) but the
> question is whether there's a way produce an output as a side effect.

You can use the scratch area, and say that it's persistent.  But the VM
itself isn't rich enough.

>
>> We could overload the scratch
>> area for storing internal state and for read results, though (and have
>> an "mmio scratch register" for reading the time).
>
> Right.
>

We could define mmio registers for muldiv64, and for communicating over
the APIC bus.  But then the device model for BPF ends up more
complicated than the kernel devices we have put together.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-06 13:54             ` Avi Kivity
@ 2012-02-06 14:00               ` Anthony Liguori
  2012-02-06 14:08                 ` Avi Kivity
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-02-06 14:00 UTC (permalink / raw)
  To: Avi Kivity; +Cc: KVM list, qemu-devel, Gleb Natapov, linux-kernel

On 02/06/2012 07:54 AM, Avi Kivity wrote:
> On 02/06/2012 03:33 PM, Anthony Liguori wrote:
>>> Look at arch/x86/kvm/i8254.c:pit_ioport_read() for a counterexample.
>>> There are also interactions with other devices (for example the
>>> apic/ioapic interaction via the apic bus).
>>
>>
>> Hrm, maybe I'm missing it, but the path that would be hot is:
>>
>> if (!status_latched&&  !count_latched) {
>>     value = kpit_elapsed()
>>     // manipulate count based on mode
>>     // mask value depending on read_state
>> }
>>
>> This path is side-effect free, and applies relatively simple math to a
>> time counter.
>
> Do guests always read an unlatched counter?  Doesn't seem reasonable
> since they can't get a stable count this way.

Perhaps.  You could have the latching done by writing to persisted scratch 
memory but then locking becomes an issue.

>> The idea would be to allow the filter to not handle an I/O request
>> depending on existing state.  Anything that's modifies state (like
>> reading the latch counter) would drop to userspace.
>
> This restricts us to a subset of the device which is at the mercy of the
> guest.

Yes, but it provides an elegant solution to having a flexible way to do things 
in the fast path in a generic way without presenting additional security concerns.

A similar, albeit more complex and less elegant, approach would be to make use 
of something like the vtpm optimization to reflect certain exits back into 
injected code into the guest.  But this has the disadvantage of being very 
x86-centric and it's not clear if you can avoid double exits which would hurt 
the slow paths.

> We could define mmio registers for muldiv64, and for communicating over
> the APIC bus.  But then the device model for BPF ends up more
> complicated than the kernel devices we have put together.

Maybe what we really need is NaCL for kernel space :-D

Regards,

Anthony Liguori



^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-06 14:00               ` Anthony Liguori
@ 2012-02-06 14:08                 ` Avi Kivity
  0 siblings, 0 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-06 14:08 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: KVM list, qemu-devel, Gleb Natapov, linux-kernel

On 02/06/2012 04:00 PM, Anthony Liguori wrote:
>> Do guests always read an unlatched counter?  Doesn't seem reasonable
>> since they can't get a stable count this way.
>
>
> Perhaps.  You could have the latching done by writing to persisted
> scratch memory but then locking becomes an issue.

Oh, you'd certainly serialize the entire device.

>
>>> The idea would be to allow the filter to not handle an I/O request
>>> depending on existing state.  Anything that's modifies state (like
>>> reading the latch counter) would drop to userspace.
>>
>> This restricts us to a subset of the device which is at the mercy of the
>> guest.
>
> Yes, but it provides an elegant solution to having a flexible way to
> do things in the fast path in a generic way without presenting
> additional security concerns.
>
> A similar, albeit more complex and less elegant, approach would be to
> make use of something like the vtpm optimization to reflect certain
> exits back into injected code into the guest.  But this has the
> disadvantage of being very x86-centric and it's not clear if you can
> avoid double exits which would hurt the slow paths.

It's also hard to communicate with the rest of the host kernel (say for
timers).  You can't ensure that any piece of memory will be virtually
mapped, and with the correct permissions too.

>
>> We could define mmio registers for muldiv64, and for communicating over
>> the APIC bus.  But then the device model for BPF ends up more
>> complicated than the kernel devices we have put together.
>
> Maybe what we really need is NaCL for kernel space :-D

NaCl or bytecode, doesn't matter.  But we do need bindings to other
kernel and kvm services.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-05 13:14   ` Avi Kivity
@ 2012-02-06 17:41     ` Rob Earhart
  2012-02-06 19:11       ` Anthony Liguori
  2012-02-07 12:01       ` Avi Kivity
  0 siblings, 2 replies; 89+ messages in thread
From: Rob Earhart @ 2012-02-06 17:41 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-kernel, KVM list, qemu-devel

On Sun, Feb 5, 2012 at 5:14 AM, Avi Kivity <avi@redhat.com> wrote:
> On 02/03/2012 12:13 AM, Rob Earhart wrote:
>> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity <avi@redhat.com
>> <mailto:avi@redhat.com>> wrote:
>>
>>     The kvm api has been accumulating cruft for several years now.
>>      This is
>>     due to feature creep, fixing mistakes, experience gained by the
>>     maintainers and developers on how to do things, ports to new
>>     architectures, and simply as a side effect of a code base that is
>>     developed slowly and incrementally.
>>
>>     While I don't think we can justify a complete revamp of the API
>>     now, I'm
>>     writing this as a thought experiment to see where a from-scratch
>>     API can
>>     take us.  Of course, if we do implement this, the new and old APIs
>>     will
>>     have to be supported side by side for several years.
>>
>>     Syscalls
>>     --------
>>     kvm currently uses the much-loved ioctl() system call as its entry
>>     point.  While this made it easy to add kvm to the kernel
>>     unintrusively,
>>     it does have downsides:
>>
>>     - overhead in the entry path, for the ioctl dispatch path and vcpu
>>     mutex
>>     (low but measurable)
>>     - semantic mismatch: kvm really wants a vcpu to be tied to a
>>     thread, and
>>     a vm to be tied to an mm_struct, but the current API ties them to file
>>     descriptors, which can move between threads and processes.  We check
>>     that they don't, but we don't want to.
>>
>>     Moving to syscalls avoids these problems, but introduces new ones:
>>
>>     - adding new syscalls is generally frowned upon, and kvm will need
>>     several
>>     - syscalls into modules are harder and rarer than into core kernel
>>     code
>>     - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>>     mm_struct
>>
>>     Syscalls that operate on the entire guest will pick it up implicitly
>>     from the mm_struct, and syscalls that operate on a vcpu will pick
>>     it up
>>     from current.
>>
>>
>> <snipped>
>>
>> I like the ioctl() interface.  If the overhead matters in your hot path,
>
> I can't say that it's a pressing problem, but it's not negligible.
>
>> I suspect you're doing it wrong;
>
> What am I doing wrong?

"You the vmm" not "you the KVM maintainer" :-)

To be a little more precise: If a VCPU thread is going all the way out
to host usermode in its hot path, that's probably a performance
problem regardless of how fast you make the transitions between host
user and host kernel.

That's why ioctl() doesn't bother me.  I think it'd be more useful to
focus on mechanisms which don't require the VCPU thread to exit at all
in its hot paths, so the overhead of the ioctl() really becomes lost
in the noise.  irq fds and ioevent fds are great for that, and I
really like your MMIO-over-socketpair idea.

>> use irq fds & ioevent fds.  You might fix the semantic mismatch by
>> having a notion of a "current process's VM" and "current thread's
>> VCPU", and just use the one /dev/kvm filedescriptor.
>>
>> Or you could go the other way, and break the connection between VMs
>> and processes / VCPUs and threads: I don't know how easy it is to do
>> it in Linux, but a VCPU might be backed by a kernel thread, operated
>> on via ioctl()s, indicating that they've exited the guest by having
>> their descriptors become readable (and either use read() or mmap() to
>> pull off the reason why the VCPU exited).
>
> That breaks the ability to renice vcpu threads (unless you want the user
> renice kernel threads).

I think it'd be fine to have an ioctl()/syscall() to do it.  But I
don't know how well that'd compose with other tools people might use
for managing priorities.

>> This would allow for a variety of different programming styles for the
>> VMM--I'm a fan of CSP model myself, but that's hard to do with the
>> current API.
>
> Just convert the synchronous API to an RPC over a pipe, in the vcpu
> thread, and you have the asynchronous model you asked for.

Yup.  But you still get multiple threads in your process.  It's not a
disaster, though.

)Rob

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-06 17:41     ` Rob Earhart
@ 2012-02-06 19:11       ` Anthony Liguori
  2012-02-07 12:03         ` Avi Kivity
  2012-02-07 12:01       ` Avi Kivity
  1 sibling, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-02-06 19:11 UTC (permalink / raw)
  To: Rob Earhart; +Cc: Avi Kivity, linux-kernel, KVM list, qemu-devel

On 02/06/2012 11:41 AM, Rob Earhart wrote:
> On Sun, Feb 5, 2012 at 5:14 AM, Avi Kivity<avi@redhat.com>  wrote:
>> On 02/03/2012 12:13 AM, Rob Earhart wrote:
>>> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity<avi@redhat.com
>>> <mailto:avi@redhat.com>>  wrote:
>>>
>>>      The kvm api has been accumulating cruft for several years now.
>>>       This is
>>>      due to feature creep, fixing mistakes, experience gained by the
>>>      maintainers and developers on how to do things, ports to new
>>>      architectures, and simply as a side effect of a code base that is
>>>      developed slowly and incrementally.
>>>
>>>      While I don't think we can justify a complete revamp of the API
>>>      now, I'm
>>>      writing this as a thought experiment to see where a from-scratch
>>>      API can
>>>      take us.  Of course, if we do implement this, the new and old APIs
>>>      will
>>>      have to be supported side by side for several years.
>>>
>>>      Syscalls
>>>      --------
>>>      kvm currently uses the much-loved ioctl() system call as its entry
>>>      point.  While this made it easy to add kvm to the kernel
>>>      unintrusively,
>>>      it does have downsides:
>>>
>>>      - overhead in the entry path, for the ioctl dispatch path and vcpu
>>>      mutex
>>>      (low but measurable)
>>>      - semantic mismatch: kvm really wants a vcpu to be tied to a
>>>      thread, and
>>>      a vm to be tied to an mm_struct, but the current API ties them to file
>>>      descriptors, which can move between threads and processes.  We check
>>>      that they don't, but we don't want to.
>>>
>>>      Moving to syscalls avoids these problems, but introduces new ones:
>>>
>>>      - adding new syscalls is generally frowned upon, and kvm will need
>>>      several
>>>      - syscalls into modules are harder and rarer than into core kernel
>>>      code
>>>      - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>>>      mm_struct
>>>
>>>      Syscalls that operate on the entire guest will pick it up implicitly
>>>      from the mm_struct, and syscalls that operate on a vcpu will pick
>>>      it up
>>>      from current.
>>>
>>>
>>> <snipped>
>>>
>>> I like the ioctl() interface.  If the overhead matters in your hot path,
>>
>> I can't say that it's a pressing problem, but it's not negligible.
>>
>>> I suspect you're doing it wrong;
>>
>> What am I doing wrong?
>
> "You the vmm" not "you the KVM maintainer" :-)
>
> To be a little more precise: If a VCPU thread is going all the way out
> to host usermode in its hot path, that's probably a performance
> problem regardless of how fast you make the transitions between host
> user and host kernel.
>
> That's why ioctl() doesn't bother me.  I think it'd be more useful to
> focus on mechanisms which don't require the VCPU thread to exit at all
> in its hot paths, so the overhead of the ioctl() really becomes lost
> in the noise.  irq fds and ioevent fds are great for that, and I
> really like your MMIO-over-socketpair idea.

I'm not so sure.  ioeventfds and a future mmio-over-socketpair have to put the 
kthread to sleep while it waits for the other end to process it.  This is 
effectively equivalent to a heavy weight exit.  The difference in cost is 
dropping to userspace which is really neglible these days (< 100 cycles).

There is some fast-path trickery to avoid heavy weight exits but this presents 
the same basic problem of having to put all the device model stuff in the kernel.

ioeventfd to userspace is almost certainly worse for performance.  And Avi 
mentioned, you can emulate this behavior yourself in userspace if so inclined.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-03 22:52   ` [Qemu-devel] " Anthony Liguori
@ 2012-02-06 19:46     ` Scott Wood
  2012-02-07  6:58       ` Michael Ellerman
  2012-02-07 12:28       ` Anthony Liguori
  0 siblings, 2 replies; 89+ messages in thread
From: Scott Wood @ 2012-02-06 19:46 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Eric Northup, Avi Kivity, linux-kernel, KVM list, qemu-devel

On 02/03/2012 04:52 PM, Anthony Liguori wrote:
> On 02/03/2012 12:07 PM, Eric Northup wrote:
>> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity<avi@redhat.com>  wrote:
>> [...]
>>>
>>> Moving to syscalls avoids these problems, but introduces new ones:
>>>
>>> - adding new syscalls is generally frowned upon, and kvm will need
>>> several
>>> - syscalls into modules are harder and rarer than into core kernel code
>>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>>> mm_struct
>> - Lost a good place to put access control (permissions on /dev/kvm)
>> for which user-mode processes can use KVM.
>>
>> How would the ability to use sys_kvm_* be regulated?
> 
> Why should it be regulated?
> 
> It's not a finite or privileged resource.

You're exposing a large, complex kernel subsystem that does very
low-level things with the hardware.  It's a potential source of exploits
(from bugs in KVM or in hardware).  I can see people wanting to be
selective with access because of that.

And sometimes it is a finite resource.  I don't know how x86 does it,
but on at least some powerpc hardware we have a finite, relatively small
number of hardware partition IDs.

-Scott


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-03  2:09 ` Anthony Liguori
  2012-02-04  2:08   ` Takuya Yoshikawa
  2012-02-05  9:24   ` Avi Kivity
@ 2012-02-07  1:08   ` Alexander Graf
  2012-02-07 12:24     ` Avi Kivity
  2012-02-10  3:07   ` Jamie Lokier
  3 siblings, 1 reply; 89+ messages in thread
From: Alexander Graf @ 2012-02-07  1:08 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, KVM list, linux-kernel, qemu-devel, kvm-ppc


On 03.02.2012, at 03:09, Anthony Liguori wrote:

> On 02/02/2012 10:09 AM, Avi Kivity wrote:
>> The kvm api has been accumulating cruft for several years now.  This is
>> due to feature creep, fixing mistakes, experience gained by the
>> maintainers and developers on how to do things, ports to new
>> architectures, and simply as a side effect of a code base that is
>> developed slowly and incrementally.
>> 
>> While I don't think we can justify a complete revamp of the API now, I'm
>> writing this as a thought experiment to see where a from-scratch API can
>> take us.  Of course, if we do implement this, the new and old APIs will
>> have to be supported side by side for several years.
>> 
>> Syscalls
>> --------
>> kvm currently uses the much-loved ioctl() system call as its entry
>> point.  While this made it easy to add kvm to the kernel unintrusively,
>> it does have downsides:
>> 
>> - overhead in the entry path, for the ioctl dispatch path and vcpu mutex
>> (low but measurable)
>> - semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
>> a vm to be tied to an mm_struct, but the current API ties them to file
>> descriptors, which can move between threads and processes.  We check
>> that they don't, but we don't want to.
>> 
>> Moving to syscalls avoids these problems, but introduces new ones:
>> 
>> - adding new syscalls is generally frowned upon, and kvm will need several
>> - syscalls into modules are harder and rarer than into core kernel code
>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>> mm_struct
>> 
>> Syscalls that operate on the entire guest will pick it up implicitly
>> from the mm_struct, and syscalls that operate on a vcpu will pick it up
>> from current.
> 
> This seems like the natural progression.

I don't like the idea too much. On s390 and ppc we can set other vcpu's interrupt status. How would that work in this model?

I really do like the ioctl model btw. It's easily extensible and easy to understand.

I can also promise you that I have no idea what other extensions we will need in the next few years. The non-x86 targets are just really very moving. So having an interface that allows for easy extension is a must-have.

> 
>> State accessors
>> ---------------
>> Currently vcpu state is read and written by a bunch of ioctls that
>> access register sets that were added (or discovered) along the years.
>> Some state is stored in the vcpu mmap area.  These will be replaced by a
>> pair of syscalls that read or write the entire state, or a subset of the
>> state, in a tag/value format.  A register will be described by a tuple:
>> 
>>   set: the register set to which it belongs; either a real set (GPR,
>> x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
>> eflags/rip/IDT/interrupt shadow/pending exception/etc.)
>>   number: register number within a set
>>   size: for self-description, and to allow expanding registers like
>> SSE->AVX or eax->rax
>>   attributes: read-write, read-only, read-only for guest but read-write
>> for host
>>   value
> 
> I do like the idea a lot of being able to read one register at a time as often times that's all you need.

The framework is in KVM today. It's called ONE_REG. So far only PPC implements a few registers. If you like it, just throw all the x86 ones in there and you have everything you need.

> 
>> 
>> Device model
>> ------------
>> Currently kvm virtualizes or emulates a set of x86 cores, with or
>> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
>> PCI devices assigned from the host.  The API allows emulating the local
>> APICs in userspace.
>> 
>> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
>> them to userspace.
> 
> I'm a big fan of this.
> 
>> Note: this may cause a regression for older guests
>> that don't support MSI or kvmclock.  Device assignment will be done
>> using VFIO, that is, without direct kvm involvement.
>> 
>> Local APICs will be mandatory, but it will be possible to hide them from
>> the guest.  This means that it will no longer be possible to emulate an
>> APIC in userspace, but it will be possible to virtualize an APIC-less
>> core - userspace will play with the LINT0/LINT1 inputs (configured as
>> EXITINT and NMI) to queue interrupts and NMIs.
> 
> I think this makes sense.  An interesting consequence of this is that it's no longer necessary to associate the VCPU context with an MMIO/PIO operation.  I'm not sure if there's an obvious benefit to that but it's interesting nonetheless.
> 
>> The communications between the local APIC and the IOAPIC/PIC will be
>> done over a socketpair, emulating the APIC bus protocol.

What is keeping us from moving there today?

>> 
>> Ioeventfd/irqfd
>> ---------------
>> As the ioeventfd/irqfd mechanism has been quite successful, it will be
>> retained, and perhaps supplemented with a way to assign an mmio region
>> to a socketpair carrying transactions.  This allows a device model to be
>> implemented out-of-process.  The socketpair can also be used to
>> implement a replacement for coalesced mmio, by not waiting for responses
>> on write transactions when enabled.  Synchronization of coalesced mmio
>> will be implemented in the kernel, not userspace as now: when a
>> non-coalesced mmio is needed, the kernel will first flush the coalesced
>> mmio queue(s).

I would vote for completely deprecating coalesced MMIO. It is a generic framework that nobody except for VGA really needs. Better make something that accelerates read and write paths thanks to more specific knowledge of the interface.

One thing I'm thinking of here is IDE. There's no need to PIO callback into user space for all the status ports. We only really care about a callback on write to 7 (cmd). All the others are basically registers that the kernel could just read and write from shared memory.

I'm sure the VGA text stuff could use similar acceleration with well-known interfaces.

To me, coalesced mmio has proven that's it's generalization where it doesn't belong.

>> 
>> Guest memory management
>> -----------------------
>> Instead of managing each memory slot individually, a single API will be
>> provided that replaces the entire guest physical memory map atomically.
>> This matches the implementation (using RCU) and plugs holes in the
>> current API, where you lose the dirty log in the window between the last
>> call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
>> that removes the slot.

So we render the actual slot logic invisible? That's a very good idea.

>> 
>> Slot-based dirty logging will be replaced by range-based and work-based
>> dirty logging; that is "what pages are dirty in this range, which may be
>> smaller than a slot" and "don't return more than N pages".
>> 
>> We may want to place the log in user memory instead of kernel memory, to
>> reduce pinned memory and increase flexibility.
> 
> Since we really only support 64-bit hosts, what about just pointing the kernel at a address/size pair and rely on userspace to mmap() the range appropriately?

That's basically what he suggested, no?

> 
>> vcpu fd mmap area
>> -----------------
>> Currently we mmap() a few pages of the vcpu fd for fast user/kernel
>> communications.  This will be replaced by a more orthodox pointer
>> parameter to sys_kvm_enter_guest(), that will be accessed using
>> get_user() and put_user().  This is slower than the current situation,
>> but better for things like strace.

I would actually rather like to see the amount of page sharing between kernel and user space increased, no decreased. I don't care if I can throw strace on KVM. I want speed.

> Look pretty interesting overall.

Yeah, I agree with most ideas, except for the syscall one. Everything else can easily be implemented on top of the current model.


Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-06 19:46     ` Scott Wood
@ 2012-02-07  6:58       ` Michael Ellerman
  2012-02-07 10:04         ` Alexander Graf
  2012-02-07 12:28       ` Anthony Liguori
  1 sibling, 1 reply; 89+ messages in thread
From: Michael Ellerman @ 2012-02-07  6:58 UTC (permalink / raw)
  To: Scott Wood
  Cc: Anthony Liguori, Eric Northup, Avi Kivity, linux-kernel,
	KVM list, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1415 bytes --]

On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
> On 02/03/2012 04:52 PM, Anthony Liguori wrote:
> > On 02/03/2012 12:07 PM, Eric Northup wrote:
> >> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity<avi@redhat.com>  wrote:
> >> [...]
> >>>
> >>> Moving to syscalls avoids these problems, but introduces new ones:
> >>>
> >>> - adding new syscalls is generally frowned upon, and kvm will need
> >>> several
> >>> - syscalls into modules are harder and rarer than into core kernel code
> >>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> >>> mm_struct
> >> - Lost a good place to put access control (permissions on /dev/kvm)
> >> for which user-mode processes can use KVM.
> >>
> >> How would the ability to use sys_kvm_* be regulated?
> > 
> > Why should it be regulated?
> > 
> > It's not a finite or privileged resource.
> 
> You're exposing a large, complex kernel subsystem that does very
> low-level things with the hardware.  It's a potential source of exploits
> (from bugs in KVM or in hardware).  I can see people wanting to be
> selective with access because of that.

Exactly.

In a perfect world I'd agree with Anthony, but in reality I think
sysadmins are quite happy that they can prevent some users from using
KVM.

You could presumably achieve something similar with capabilities or
whatever, but a node in /dev is much simpler.

cheers

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07  6:58       ` Michael Ellerman
@ 2012-02-07 10:04         ` Alexander Graf
  2012-02-15 22:21           ` Arnd Bergmann
  0 siblings, 1 reply; 89+ messages in thread
From: Alexander Graf @ 2012-02-07 10:04 UTC (permalink / raw)
  To: michael
  Cc: Scott Wood, Anthony Liguori, Eric Northup, Avi Kivity,
	linux-kernel, KVM list, qemu-devel


On 07.02.2012, at 07:58, Michael Ellerman wrote:

> On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
>> On 02/03/2012 04:52 PM, Anthony Liguori wrote:
>>> On 02/03/2012 12:07 PM, Eric Northup wrote:
>>>> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity<avi@redhat.com>  wrote:
>>>> [...]
>>>>> 
>>>>> Moving to syscalls avoids these problems, but introduces new ones:
>>>>> 
>>>>> - adding new syscalls is generally frowned upon, and kvm will need
>>>>> several
>>>>> - syscalls into modules are harder and rarer than into core kernel code
>>>>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>>>>> mm_struct
>>>> - Lost a good place to put access control (permissions on /dev/kvm)
>>>> for which user-mode processes can use KVM.
>>>> 
>>>> How would the ability to use sys_kvm_* be regulated?
>>> 
>>> Why should it be regulated?
>>> 
>>> It's not a finite or privileged resource.
>> 
>> You're exposing a large, complex kernel subsystem that does very
>> low-level things with the hardware.  It's a potential source of exploits
>> (from bugs in KVM or in hardware).  I can see people wanting to be
>> selective with access because of that.
> 
> Exactly.
> 
> In a perfect world I'd agree with Anthony, but in reality I think
> sysadmins are quite happy that they can prevent some users from using
> KVM.
> 
> You could presumably achieve something similar with capabilities or
> whatever, but a node in /dev is much simpler.

Well, you could still keep the /dev/kvm node and then have syscalls operate on the fd.

But again, I don't see the problem with the ioctl interface. It's nice, extensible and works great for us.


Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-06 17:41     ` Rob Earhart
  2012-02-06 19:11       ` Anthony Liguori
@ 2012-02-07 12:01       ` Avi Kivity
  1 sibling, 0 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-07 12:01 UTC (permalink / raw)
  To: Rob Earhart; +Cc: linux-kernel, KVM list, qemu-devel

On 02/06/2012 07:41 PM, Rob Earhart wrote:
> >>
> >>  I like the ioctl() interface.  If the overhead matters in your hot path,
> >
> >  I can't say that it's a pressing problem, but it's not negligible.
> >
> >>  I suspect you're doing it wrong;
> >
> >  What am I doing wrong?
>
> "You the vmm" not "you the KVM maintainer" :-)
>
> To be a little more precise: If a VCPU thread is going all the way out
> to host usermode in its hot path, that's probably a performance
> problem regardless of how fast you make the transitions between host
> user and host kernel.

Why?

> That's why ioctl() doesn't bother me.  I think it'd be more useful to
> focus on mechanisms which don't require the VCPU thread to exit at all
> in its hot paths, so the overhead of the ioctl() really becomes lost
> in the noise.  irq fds and ioevent fds are great for that, and I
> really like your MMIO-over-socketpair idea.

I like them too, but they're not suitable for all cases.

An ioeventfd, or unordered write-over-mmio-socketpair can take one of 
two paths:

  - waking up an idle mmio service thread on a different core, involving 
a double context switch on that remote core
  - scheduling the idle mmio service thread on the current core, 
involving both a double context switch and a heavyweight exit

An ordered write-over-mmio-socketpair, or a read-over-mmio-socketpair 
can also take one of two paths
  - waking up an idle mmio service thread on a different core, involving 
a double context switch on that remote core, and also  invoking two 
context switches on the current core (while we wait for a reply); if the 
current core schedules a user task we might also have a heavyweight exit
  - scheduling the idle mmio service thread on the current core, 
involving both a double context switch and a heavyweight exit

As you can see the actual work is greater for threaded io handlers than 
the synchronous ones.  The real advantage is that you can perform more 
work in parallel if you have the spare cores (not a given in 
consolidation environments) and if you actually have a lot of work to do 
(like virtio-net in a throughput load).  It doesn't quite fit a "read 
hpet register" load.



>
> >>  This would allow for a variety of different programming styles for the
> >>  VMM--I'm a fan of CSP model myself, but that's hard to do with the
> >>  current API.
> >
> >  Just convert the synchronous API to an RPC over a pipe, in the vcpu
> >  thread, and you have the asynchronous model you asked for.
>
> Yup.  But you still get multiple threads in your process.  It's not a
> disaster, though.
>

You have multiple threads anyway, even if it's the kernel that creates them.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-06 19:11       ` Anthony Liguori
@ 2012-02-07 12:03         ` Avi Kivity
  2012-02-07 15:17           ` Anthony Liguori
  0 siblings, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-07 12:03 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Rob Earhart, linux-kernel, KVM list, qemu-devel

On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>
> I'm not so sure.  ioeventfds and a future mmio-over-socketpair have to 
> put the kthread to sleep while it waits for the other end to process 
> it.  This is effectively equivalent to a heavy weight exit.  The 
> difference in cost is dropping to userspace which is really neglible 
> these days (< 100 cycles).

On what machine did you measure these wonderful numbers?

But I agree a heavyweight exit is probably faster than a double context 
switch on a remote core.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07  1:08   ` Alexander Graf
@ 2012-02-07 12:24     ` Avi Kivity
  2012-02-07 12:51       ` Alexander Graf
  0 siblings, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-07 12:24 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc

On 02/07/2012 03:08 AM, Alexander Graf wrote:
> I don't like the idea too much. On s390 and ppc we can set other vcpu's interrupt status. How would that work in this model?

It would be a "vm-wide syscall".  You can also do that on x86 (through 
KVM_IRQ_LINE).

>
> I really do like the ioctl model btw. It's easily extensible and easy to understand.
>
> I can also promise you that I have no idea what other extensions we will need in the next few years. The non-x86 targets are just really very moving. So having an interface that allows for easy extension is a must-have.

Good point.  If we ever go through with it, it will only be after we see 
the interface has stabilized.

>
> >
> >>  State accessors
> >>  ---------------
> >>  Currently vcpu state is read and written by a bunch of ioctls that
> >>  access register sets that were added (or discovered) along the years.
> >>  Some state is stored in the vcpu mmap area.  These will be replaced by a
> >>  pair of syscalls that read or write the entire state, or a subset of the
> >>  state, in a tag/value format.  A register will be described by a tuple:
> >>
> >>    set: the register set to which it belongs; either a real set (GPR,
> >>  x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
> >>  eflags/rip/IDT/interrupt shadow/pending exception/etc.)
> >>    number: register number within a set
> >>    size: for self-description, and to allow expanding registers like
> >>  SSE->AVX or eax->rax
> >>    attributes: read-write, read-only, read-only for guest but read-write
> >>  for host
> >>    value
> >
> >  I do like the idea a lot of being able to read one register at a time as often times that's all you need.
>
> The framework is in KVM today. It's called ONE_REG. So far only PPC implements a few registers. If you like it, just throw all the x86 ones in there and you have everything you need.

This is more like MANY_REG, where you scatter/gather a list of registers 
in userspace to the kernel or vice versa.

>
> >>  The communications between the local APIC and the IOAPIC/PIC will be
> >>  done over a socketpair, emulating the APIC bus protocol.
>
> What is keeping us from moving there today?

The biggest problem with this proposal is that what we have today works 
reasonably well.  Nothing is keeping us from moving there, except the 
fear of performance regressions and lack of strong motivation.

>
> >>
> >>  Ioeventfd/irqfd
> >>  ---------------
> >>  As the ioeventfd/irqfd mechanism has been quite successful, it will be
> >>  retained, and perhaps supplemented with a way to assign an mmio region
> >>  to a socketpair carrying transactions.  This allows a device model to be
> >>  implemented out-of-process.  The socketpair can also be used to
> >>  implement a replacement for coalesced mmio, by not waiting for responses
> >>  on write transactions when enabled.  Synchronization of coalesced mmio
> >>  will be implemented in the kernel, not userspace as now: when a
> >>  non-coalesced mmio is needed, the kernel will first flush the coalesced
> >>  mmio queue(s).
>
> I would vote for completely deprecating coalesced MMIO. It is a generic framework that nobody except for VGA really needs.

It's actually used by e1000 too, don't remember what the performance 
benefits are.  Of course, few people use e1000.

> Better make something that accelerates read and write paths thanks to more specific knowledge of the interface.
>
> One thing I'm thinking of here is IDE. There's no need to PIO callback into user space for all the status ports. We only really care about a callback on write to 7 (cmd). All the others are basically registers that the kernel could just read and write from shared memory.
>
> I'm sure the VGA text stuff could use similar acceleration with well-known interfaces.

This goes back to the discussion about a kernel bytecode vm for 
accelerating mmio.  The problem is that we need something really general.

> To me, coalesced mmio has proven that's it's generalization where it doesn't belong.

But you want to generalize it even more?

There's no way a patch with 'VGA' in it would be accepted.

>
> >>
> >>  Guest memory management
> >>  -----------------------
> >>  Instead of managing each memory slot individually, a single API will be
> >>  provided that replaces the entire guest physical memory map atomically.
> >>  This matches the implementation (using RCU) and plugs holes in the
> >>  current API, where you lose the dirty log in the window between the last
> >>  call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
> >>  that removes the slot.
>
> So we render the actual slot logic invisible? That's a very good idea.

No, slots still exist.  Only the API is "replace slot list" instead of 
"add slot" and "remove slot".

>
> >>
> >>  Slot-based dirty logging will be replaced by range-based and work-based
> >>  dirty logging; that is "what pages are dirty in this range, which may be
> >>  smaller than a slot" and "don't return more than N pages".
> >>
> >>  We may want to place the log in user memory instead of kernel memory, to
> >>  reduce pinned memory and increase flexibility.
> >
> >  Since we really only support 64-bit hosts, what about just pointing the kernel at a address/size pair and rely on userspace to mmap() the range appropriately?
>
> That's basically what he suggested, no?


No.

> >
> >>  vcpu fd mmap area
> >>  -----------------
> >>  Currently we mmap() a few pages of the vcpu fd for fast user/kernel
> >>  communications.  This will be replaced by a more orthodox pointer
> >>  parameter to sys_kvm_enter_guest(), that will be accessed using
> >>  get_user() and put_user().  This is slower than the current situation,
> >>  but better for things like strace.
>
> I would actually rather like to see the amount of page sharing between kernel and user space increased, no decreased. I don't care if I can throw strace on KVM. I want speed.

Something really critical should be handled in the kernel.  Care to 
provide examples?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-06 19:46     ` Scott Wood
  2012-02-07  6:58       ` Michael Ellerman
@ 2012-02-07 12:28       ` Anthony Liguori
  2012-02-07 12:40         ` Avi Kivity
  2012-02-08 17:02         ` Scott Wood
  1 sibling, 2 replies; 89+ messages in thread
From: Anthony Liguori @ 2012-02-07 12:28 UTC (permalink / raw)
  To: Scott Wood; +Cc: qemu-devel, linux-kernel, Eric Northup, KVM list, Avi Kivity

On 02/06/2012 01:46 PM, Scott Wood wrote:
> On 02/03/2012 04:52 PM, Anthony Liguori wrote:
>> On 02/03/2012 12:07 PM, Eric Northup wrote:
>>> On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity<avi@redhat.com>   wrote:
>>> [...]
>>>>
>>>> Moving to syscalls avoids these problems, but introduces new ones:
>>>>
>>>> - adding new syscalls is generally frowned upon, and kvm will need
>>>> several
>>>> - syscalls into modules are harder and rarer than into core kernel code
>>>> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
>>>> mm_struct
>>> - Lost a good place to put access control (permissions on /dev/kvm)
>>> for which user-mode processes can use KVM.
>>>
>>> How would the ability to use sys_kvm_* be regulated?
>>
>> Why should it be regulated?
>>
>> It's not a finite or privileged resource.
>
> You're exposing a large, complex kernel subsystem that does very
> low-level things with the hardware.

As does the rest of the kernel.

>  It's a potential source of exploits
> (from bugs in KVM or in hardware).  I can see people wanting to be
> selective with access because of that.

As is true of the rest of the kernel.

If you want finer grain access control, that's exactly why we have things like 
LSM and SELinux.  You can add the appropriate LSM hooks into the KVM 
infrastructure and setup default SELinux policies appropriately.

> And sometimes it is a finite resource.  I don't know how x86 does it,
> but on at least some powerpc hardware we have a finite, relatively small
> number of hardware partition IDs.

But presumably this is per-core, right?  And they're recycled, right?  IOW, 
there isn't a limit of number of guests <= number of hardware partitions IDs. 
It just impacts performance.

Regards,

Anthony Liguori

>
> -Scott
>
>


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 12:28       ` Anthony Liguori
@ 2012-02-07 12:40         ` Avi Kivity
  2012-02-07 12:51           ` Anthony Liguori
  2012-02-08 17:02         ` Scott Wood
  1 sibling, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-07 12:40 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Scott Wood, qemu-devel, linux-kernel, Eric Northup, KVM list

On 02/07/2012 02:28 PM, Anthony Liguori wrote:
>
>>  It's a potential source of exploits
>> (from bugs in KVM or in hardware).  I can see people wanting to be
>> selective with access because of that.
>
> As is true of the rest of the kernel.
>
> If you want finer grain access control, that's exactly why we have 
> things like LSM and SELinux.  You can add the appropriate LSM hooks 
> into the KVM infrastructure and setup default SELinux policies 
> appropriately.

LSMs protect objects, not syscalls.  There isn't an object to protect 
here (except the fake /dev/kvm object).

In theory, kvm is exactly the same as other syscalls, but in practice, 
it is used by only very few user programs, so there may be many 
unexercised paths.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 12:40         ` Avi Kivity
@ 2012-02-07 12:51           ` Anthony Liguori
  2012-02-07 13:18             ` Avi Kivity
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-02-07 12:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Scott Wood, Eric Northup, qemu-devel, KVM list, linux-kernel

On 02/07/2012 06:40 AM, Avi Kivity wrote:
> On 02/07/2012 02:28 PM, Anthony Liguori wrote:
>>
>>> It's a potential source of exploits
>>> (from bugs in KVM or in hardware). I can see people wanting to be
>>> selective with access because of that.
>>
>> As is true of the rest of the kernel.
>>
>> If you want finer grain access control, that's exactly why we have things like
>> LSM and SELinux. You can add the appropriate LSM hooks into the KVM
>> infrastructure and setup default SELinux policies appropriately.
>
> LSMs protect objects, not syscalls. There isn't an object to protect here
> (except the fake /dev/kvm object).

A VM can be an object.

Regards,

Anthony Liguori

> In theory, kvm is exactly the same as other syscalls, but in practice, it is
> used by only very few user programs, so there may be many unexercised paths.
>


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 12:24     ` Avi Kivity
@ 2012-02-07 12:51       ` Alexander Graf
  2012-02-07 13:16         ` Avi Kivity
  0 siblings, 1 reply; 89+ messages in thread
From: Alexander Graf @ 2012-02-07 12:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc


On 07.02.2012, at 13:24, Avi Kivity wrote:

> On 02/07/2012 03:08 AM, Alexander Graf wrote:
>> I don't like the idea too much. On s390 and ppc we can set other vcpu's interrupt status. How would that work in this model?
> 
> It would be a "vm-wide syscall".  You can also do that on x86 (through KVM_IRQ_LINE).
> 
>> 
>> I really do like the ioctl model btw. It's easily extensible and easy to understand.
>> 
>> I can also promise you that I have no idea what other extensions we will need in the next few years. The non-x86 targets are just really very moving. So having an interface that allows for easy extension is a must-have.
> 
> Good point.  If we ever go through with it, it will only be after we see the interface has stabilized.

Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints.

The same goes for ARM, where we will get v7 support for now, but very soon we will also want to get v8. Stabilizing a target so far takes ~1-2 years from what I've seen. And that stabilizing to a point where we don't find major ABI issues anymore.

> 
>> 
>> >
>> >>  State accessors
>> >>  ---------------
>> >>  Currently vcpu state is read and written by a bunch of ioctls that
>> >>  access register sets that were added (or discovered) along the years.
>> >>  Some state is stored in the vcpu mmap area.  These will be replaced by a
>> >>  pair of syscalls that read or write the entire state, or a subset of the
>> >>  state, in a tag/value format.  A register will be described by a tuple:
>> >>
>> >>    set: the register set to which it belongs; either a real set (GPR,
>> >>  x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
>> >>  eflags/rip/IDT/interrupt shadow/pending exception/etc.)
>> >>    number: register number within a set
>> >>    size: for self-description, and to allow expanding registers like
>> >>  SSE->AVX or eax->rax
>> >>    attributes: read-write, read-only, read-only for guest but read-write
>> >>  for host
>> >>    value
>> >
>> >  I do like the idea a lot of being able to read one register at a time as often times that's all you need.
>> 
>> The framework is in KVM today. It's called ONE_REG. So far only PPC implements a few registers. If you like it, just throw all the x86 ones in there and you have everything you need.
> 
> This is more like MANY_REG, where you scatter/gather a list of registers in userspace to the kernel or vice versa.

Well, yeah, to me MANY_REG is part of ONE_REG. The idea behind ONE_REG was to give every register a unique identifier that can be used to access it. Taking that logic to an array is trivial.

> 
>> 
>> >>  The communications between the local APIC and the IOAPIC/PIC will be
>> >>  done over a socketpair, emulating the APIC bus protocol.
>> 
>> What is keeping us from moving there today?
> 
> The biggest problem with this proposal is that what we have today works reasonably well.  Nothing is keeping us from moving there, except the fear of performance regressions and lack of strong motivation.

So why bring it up in the "next-gen" api discussion?

> 
>> 
>> >>
>> >>  Ioeventfd/irqfd
>> >>  ---------------
>> >>  As the ioeventfd/irqfd mechanism has been quite successful, it will be
>> >>  retained, and perhaps supplemented with a way to assign an mmio region
>> >>  to a socketpair carrying transactions.  This allows a device model to be
>> >>  implemented out-of-process.  The socketpair can also be used to
>> >>  implement a replacement for coalesced mmio, by not waiting for responses
>> >>  on write transactions when enabled.  Synchronization of coalesced mmio
>> >>  will be implemented in the kernel, not userspace as now: when a
>> >>  non-coalesced mmio is needed, the kernel will first flush the coalesced
>> >>  mmio queue(s).
>> 
>> I would vote for completely deprecating coalesced MMIO. It is a generic framework that nobody except for VGA really needs.
> 
> It's actually used by e1000 too, don't remember what the performance benefits are.  Of course, few people use e1000.

And for e1000 it's only used for nvram which actually could benefit from a more clever "this is backed by ram" logic. Coalesced mmio is not a great fit here.

> 
>> Better make something that accelerates read and write paths thanks to more specific knowledge of the interface.
>> 
>> One thing I'm thinking of here is IDE. There's no need to PIO callback into user space for all the status ports. We only really care about a callback on write to 7 (cmd). All the others are basically registers that the kernel could just read and write from shared memory.
>> 
>> I'm sure the VGA text stuff could use similar acceleration with well-known interfaces.
> 
> This goes back to the discussion about a kernel bytecode vm for accelerating mmio.  The problem is that we need something really general.
> 
>> To me, coalesced mmio has proven that's it's generalization where it doesn't belong.
> 
> But you want to generalize it even more?
> 
> There's no way a patch with 'VGA' in it would be accepted.

Why not? I think the natural step forward is hybrid acceleration. Take a minimal subset of device emulation into kernel land, keep the rest in user space. Similar to how vhost works, where we keep device enumeration and configuration in user space, but ring processing in kernel space.

Good candidates for in-kernel acceleration are:

  - HPET
  - VGA
  - IDE

I'm not sure how easy it would be to only partially accelerate the hot paths of the IO-APIC. I'm not too familiar with its details.

We will run into the same thing with the MPIC though. On e500v2, IPIs are done through the MPIC. So if we want any SMP performance on those, we need to shove that part into the kernel. I don't really want to have all of the MPIC code in there however. So a hybrid approach sounds like a great fit.

The problem with in-kernel device emulation the way we have it today is that it's an all-or-nothing choice. Either we push the device into kernel space or we keep it in user space. That adds a lot of code in kernel land where it doesn't belong.

> 
>> 
>> >>
>> >>  Guest memory management
>> >>  -----------------------
>> >>  Instead of managing each memory slot individually, a single API will be
>> >>  provided that replaces the entire guest physical memory map atomically.
>> >>  This matches the implementation (using RCU) and plugs holes in the
>> >>  current API, where you lose the dirty log in the window between the last
>> >>  call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
>> >>  that removes the slot.
>> 
>> So we render the actual slot logic invisible? That's a very good idea.
> 
> No, slots still exist.  Only the API is "replace slot list" instead of "add slot" and "remove slot".

Why? On PPC we walk the slots on every fault (incl. mmio), so fast lookup times there would be great. I was thinking of something page table like here. That only works when then internal slot structure is hidden from user space though.

> 
>> 
>> >>
>> >>  Slot-based dirty logging will be replaced by range-based and work-based
>> >>  dirty logging; that is "what pages are dirty in this range, which may be
>> >>  smaller than a slot" and "don't return more than N pages".
>> >>
>> >>  We may want to place the log in user memory instead of kernel memory, to
>> >>  reduce pinned memory and increase flexibility.
>> >
>> >  Since we really only support 64-bit hosts, what about just pointing the kernel at a address/size pair and rely on userspace to mmap() the range appropriately?
>> 
>> That's basically what he suggested, no?
> 
> 
> No.
> 
>> >
>> >>  vcpu fd mmap area
>> >>  -----------------
>> >>  Currently we mmap() a few pages of the vcpu fd for fast user/kernel
>> >>  communications.  This will be replaced by a more orthodox pointer
>> >>  parameter to sys_kvm_enter_guest(), that will be accessed using
>> >>  get_user() and put_user().  This is slower than the current situation,
>> >>  but better for things like strace.
>> 
>> I would actually rather like to see the amount of page sharing between kernel and user space increased, no decreased. I don't care if I can throw strace on KVM. I want speed.
> 
> Something really critical should be handled in the kernel.  Care to provide examples?

Just look at the s390 patches Christian posted recently. I think that's a very nice direction to walk towards.
For permanently mapped space, the hybrid stuff above could fall into that category. We could however to it through copy_from/to_user with a user space pointer.

So maybe you're right - the mmap'ed space isn't all that important. Having kernel space write into user space memory is however.


Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 12:51       ` Alexander Graf
@ 2012-02-07 13:16         ` Avi Kivity
  2012-02-07 13:40           ` Alexander Graf
  0 siblings, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-07 13:16 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc

On 02/07/2012 02:51 PM, Alexander Graf wrote:
> On 07.02.2012, at 13:24, Avi Kivity wrote:
>
> >  On 02/07/2012 03:08 AM, Alexander Graf wrote:
> >>  I don't like the idea too much. On s390 and ppc we can set other vcpu's interrupt status. How would that work in this model?
> >
> >  It would be a "vm-wide syscall".  You can also do that on x86 (through KVM_IRQ_LINE).
> >
> >>
> >>  I really do like the ioctl model btw. It's easily extensible and easy to understand.
> >>
> >>  I can also promise you that I have no idea what other extensions we will need in the next few years. The non-x86 targets are just really very moving. So having an interface that allows for easy extension is a must-have.
> >
> >  Good point.  If we ever go through with it, it will only be after we see the interface has stabilized.
>
> Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints.

I would expect that newer archs have less constraints, not more.

> The same goes for ARM, where we will get v7 support for now, but very soon we will also want to get v8. Stabilizing a target so far takes ~1-2 years from what I've seen. And that stabilizing to a point where we don't find major ABI issues anymore.

The trick is to get the ABI to be flexible, like a generalized ABI for 
state.  But it's true that it's really hard to nail it down.


> >>
> >>  The framework is in KVM today. It's called ONE_REG. So far only PPC implements a few registers. If you like it, just throw all the x86 ones in there and you have everything you need.
> >
> >  This is more like MANY_REG, where you scatter/gather a list of registers in userspace to the kernel or vice versa.
>
> Well, yeah, to me MANY_REG is part of ONE_REG. The idea behind ONE_REG was to give every register a unique identifier that can be used to access it. Taking that logic to an array is trivial.

Definitely easy to extend.


> >
> >>
> >>  >>   The communications between the local APIC and the IOAPIC/PIC will be
> >>  >>   done over a socketpair, emulating the APIC bus protocol.
> >>
> >>  What is keeping us from moving there today?
> >
> >  The biggest problem with this proposal is that what we have today works reasonably well.  Nothing is keeping us from moving there, except the fear of performance regressions and lack of strong motivation.
>
> So why bring it up in the "next-gen" api discussion?

One reason is to try to shape future changes to the current ABI in the 
same direction.  Another is that maybe someone will convince us that it 
is needed.

> >
> >  There's no way a patch with 'VGA' in it would be accepted.
>
> Why not? I think the natural step forward is hybrid acceleration. Take a minimal subset of device emulation into kernel land, keep the rest in user space.


When a device is fully in the kernel, we have a good specification of 
the ABI: it just implements the spec, and the ABI provides the interface 
from the device to the rest of the world.  Partially accelerated devices 
means a much greater effort in specifying exactly what it does.  It's 
also vulnerable to changes in how the guest uses the device.

> Similar to how vhost works, where we keep device enumeration and configuration in user space, but ring processing in kernel space.

vhost-net was a massive effort, I hope we don't have to replicate it.

>
> Good candidates for in-kernel acceleration are:
>
>    - HPET

Yes

>    - VGA
>    - IDE

Why?  There are perfectly good replacements for these (qxl, virtio-blk, 
virtio-scsi).

> I'm not sure how easy it would be to only partially accelerate the hot paths of the IO-APIC. I'm not too familiar with its details.

Pretty hard.

>
> We will run into the same thing with the MPIC though. On e500v2, IPIs are done through the MPIC. So if we want any SMP performance on those, we need to shove that part into the kernel. I don't really want to have all of the MPIC code in there however. So a hybrid approach sounds like a great fit.

Pointer to the qemu code?

> The problem with in-kernel device emulation the way we have it today is that it's an all-or-nothing choice. Either we push the device into kernel space or we keep it in user space. That adds a lot of code in kernel land where it doesn't belong.

Like I mentioned, I see that as a good thing.

> >
> >  No, slots still exist.  Only the API is "replace slot list" instead of "add slot" and "remove slot".
>
> Why?

Physical memory is discontiguous, and includes aliases (two gpas 
referencing the same backing page).  How else would you describe it.

> On PPC we walk the slots on every fault (incl. mmio), so fast lookup times there would be great. I was thinking of something page table like here.

We can certainly convert the slots to a tree internally.  I'm doing the 
same thing for qemu now, maybe we can do it for kvm too.  No need to 
involve the ABI at all.

Slot searching is quite fast since there's a small number of slots, and 
we sort the larger ones to be in the front, so positive lookups are 
fast.  We cache negative lookups in the shadow page tables (an spte can 
be either "not mapped", "mapped to RAM", or "not mapped and known to be 
mmio") so we rarely need to walk the entire list.

> That only works when then internal slot structure is hidden from user space though.

Why?

>
> >>  I would actually rather like to see the amount of page sharing between kernel and user space increased, no decreased. I don't care if I can throw strace on KVM. I want speed.
> >
> >  Something really critical should be handled in the kernel.  Care to provide examples?
>
> Just look at the s390 patches Christian posted recently.

Which ones?

> I think that's a very nice direction to walk towards.
> For permanently mapped space, the hybrid stuff above could fall into that category. We could however to it through copy_from/to_user with a user space pointer.
>
> So maybe you're right - the mmap'ed space isn't all that important. Having kernel space write into user space memory is however.
>
>

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 12:51           ` Anthony Liguori
@ 2012-02-07 13:18             ` Avi Kivity
  2012-02-07 15:15               ` Anthony Liguori
  0 siblings, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-07 13:18 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Scott Wood, Eric Northup, qemu-devel, KVM list, linux-kernel

On 02/07/2012 02:51 PM, Anthony Liguori wrote:
> On 02/07/2012 06:40 AM, Avi Kivity wrote:
>> On 02/07/2012 02:28 PM, Anthony Liguori wrote:
>>>
>>>> It's a potential source of exploits
>>>> (from bugs in KVM or in hardware). I can see people wanting to be
>>>> selective with access because of that.
>>>
>>> As is true of the rest of the kernel.
>>>
>>> If you want finer grain access control, that's exactly why we have 
>>> things like
>>> LSM and SELinux. You can add the appropriate LSM hooks into the KVM
>>> infrastructure and setup default SELinux policies appropriately.
>>
>> LSMs protect objects, not syscalls. There isn't an object to protect 
>> here
>> (except the fake /dev/kvm object).
>
> A VM can be an object.
>

Not really, it's not accessible in a namespace.  How would you label it?

Maybe we can reuse the process label/context (not sure what the right 
term is for a process).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 13:16         ` Avi Kivity
@ 2012-02-07 13:40           ` Alexander Graf
  2012-02-07 14:21             ` Avi Kivity
                               ` (2 more replies)
  0 siblings, 3 replies; 89+ messages in thread
From: Alexander Graf @ 2012-02-07 13:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc


On 07.02.2012, at 14:16, Avi Kivity wrote:

> On 02/07/2012 02:51 PM, Alexander Graf wrote:
>> On 07.02.2012, at 13:24, Avi Kivity wrote:
>> 
>> >  On 02/07/2012 03:08 AM, Alexander Graf wrote:
>> >>  I don't like the idea too much. On s390 and ppc we can set other vcpu's interrupt status. How would that work in this model?
>> >
>> >  It would be a "vm-wide syscall".  You can also do that on x86 (through KVM_IRQ_LINE).
>> >
>> >>
>> >>  I really do like the ioctl model btw. It's easily extensible and easy to understand.
>> >>
>> >>  I can also promise you that I have no idea what other extensions we will need in the next few years. The non-x86 targets are just really very moving. So having an interface that allows for easy extension is a must-have.
>> >
>> >  Good point.  If we ever go through with it, it will only be after we see the interface has stabilized.
>> 
>> Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints.
> 
> I would expect that newer archs have less constraints, not more.

Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have today on 32-bit, but extends a bunch of registers to 64-bit. So what if we laid out stuff wrong before?

I don't even want to imagine what v7 arm vs v8 arm looks like. It's a completely new architecture.

And what if MIPS comes along? I hear they also work on hw accelerated virtualization.

> 
>> The same goes for ARM, where we will get v7 support for now, but very soon we will also want to get v8. Stabilizing a target so far takes ~1-2 years from what I've seen. And that stabilizing to a point where we don't find major ABI issues anymore.
> 
> The trick is to get the ABI to be flexible, like a generalized ABI for state.  But it's true that it's really hard to nail it down.

Yup, and I think what we have today is a pretty good approach to this. I'm trying to mostly add "generalized" ioctls whenever I see that something can be handled generically, like ONE_REG or ENABLE_CAP. If we keep moving that direction, we are extensible with a reasonably stable ABI. Even without syscalls.

> 
> 
>> >>
>> >>  The framework is in KVM today. It's called ONE_REG. So far only PPC implements a few registers. If you like it, just throw all the x86 ones in there and you have everything you need.
>> >
>> >  This is more like MANY_REG, where you scatter/gather a list of registers in userspace to the kernel or vice versa.
>> 
>> Well, yeah, to me MANY_REG is part of ONE_REG. The idea behind ONE_REG was to give every register a unique identifier that can be used to access it. Taking that logic to an array is trivial.
> 
> Definitely easy to extend.
> 
> 
>> >
>> >>
>> >>  >>   The communications between the local APIC and the IOAPIC/PIC will be
>> >>  >>   done over a socketpair, emulating the APIC bus protocol.
>> >>
>> >>  What is keeping us from moving there today?
>> >
>> >  The biggest problem with this proposal is that what we have today works reasonably well.  Nothing is keeping us from moving there, except the fear of performance regressions and lack of strong motivation.
>> 
>> So why bring it up in the "next-gen" api discussion?
> 
> One reason is to try to shape future changes to the current ABI in the same direction.  Another is that maybe someone will convince us that it is needed.
> 
>> >
>> >  There's no way a patch with 'VGA' in it would be accepted.
>> 
>> Why not? I think the natural step forward is hybrid acceleration. Take a minimal subset of device emulation into kernel land, keep the rest in user space.
> 
> 
> When a device is fully in the kernel, we have a good specification of the ABI: it just implements the spec, and the ABI provides the interface from the device to the rest of the world.  Partially accelerated devices means a much greater effort in specifying exactly what it does.  It's also vulnerable to changes in how the guest uses the device.

Why? For the HPET timer register for example, we could have a simple MMIO hook that says

  on_read:
    return read_current_time() - shared_page.offset;
  on_write:
    handle_in_user_space();

For IDE, it would be as simple as

  register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE, &s->cmd[0]);
  for (i = 1; i < 7; i++) {
    register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE, &s->cmd[i]);
    register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE, &s->cmd[i]);
  }

and we should have reduced overhead of IDE by quite a bit already. All the other 2k LOC in hw/ide/core.c don't matter for us really.

> 
>> Similar to how vhost works, where we keep device enumeration and configuration in user space, but ring processing in kernel space.
> 
> vhost-net was a massive effort, I hope we don't have to replicate it.

Was it harder than the in-kernel io-apic?

> 
>> 
>> Good candidates for in-kernel acceleration are:
>> 
>>   - HPET
> 
> Yes
> 
>>   - VGA
>>   - IDE
> 
> Why?  There are perfectly good replacements for these (qxl, virtio-blk, virtio-scsi).

Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. Same for virtio.

Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest. KVM's strength has always been its close resemblance to hardware.

> 
>> I'm not sure how easy it would be to only partially accelerate the hot paths of the IO-APIC. I'm not too familiar with its details.
> 
> Pretty hard.
> 
>> 
>> We will run into the same thing with the MPIC though. On e500v2, IPIs are done through the MPIC. So if we want any SMP performance on those, we need to shove that part into the kernel. I don't really want to have all of the MPIC code in there however. So a hybrid approach sounds like a great fit.
> 
> Pointer to the qemu code?

hw/openpic.c

> 
>> The problem with in-kernel device emulation the way we have it today is that it's an all-or-nothing choice. Either we push the device into kernel space or we keep it in user space. That adds a lot of code in kernel land where it doesn't belong.
> 
> Like I mentioned, I see that as a good thing.

I don't. And we don't do it for hypercall handling on book3s hv either for example. There we have a 3 level handling system. Very hot path hypercalls get handled in real mode. Reasonably hot path hypercalls get handled in kernel space. Everything else goes to user land.

> 
>> >
>> >  No, slots still exist.  Only the API is "replace slot list" instead of "add slot" and "remove slot".
>> 
>> Why?
> 
> Physical memory is discontiguous, and includes aliases (two gpas referencing the same backing page).  How else would you describe it.
> 
>> On PPC we walk the slots on every fault (incl. mmio), so fast lookup times there would be great. I was thinking of something page table like here.
> 
> We can certainly convert the slots to a tree internally.  I'm doing the same thing for qemu now, maybe we can do it for kvm too.  No need to involve the ABI at all.

Hrm, true.

> Slot searching is quite fast since there's a small number of slots, and we sort the larger ones to be in the front, so positive lookups are fast.  We cache negative lookups in the shadow page tables (an spte can be either "not mapped", "mapped to RAM", or "not mapped and known to be mmio") so we rarely need to walk the entire list.

Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky.
We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too.

> 
>> That only works when then internal slot structure is hidden from user space though.
> 
> Why?

Because if user space thinks it's slots and in reality it's a tree that doesn't match. If you decouple the external view from the internal view, it works again.

> 
>> 
>> >>  I would actually rather like to see the amount of page sharing between kernel and user space increased, no decreased. I don't care if I can throw strace on KVM. I want speed.
>> >
>> >  Something really critical should be handled in the kernel.  Care to provide examples?
>> 
>> Just look at the s390 patches Christian posted recently.
> 
> Which ones?

  http://www.mail-archive.com/kvm@vger.kernel.org/msg66155.html


Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 13:40           ` Alexander Graf
@ 2012-02-07 14:21             ` Avi Kivity
  2012-02-07 14:39               ` Alexander Graf
  2012-02-12  7:10               ` Takuya Yoshikawa
  2012-02-07 15:23             ` Anthony Liguori
  2012-02-15 22:14             ` Arnd Bergmann
  2 siblings, 2 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-07 14:21 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc

On 02/07/2012 03:40 PM, Alexander Graf wrote:
> >>
> >>  Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints.
> >
> >  I would expect that newer archs have less constraints, not more.
>
> Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have today on 32-bit, but extends a bunch of registers to 64-bit. So what if we laid out stuff wrong before?

That's not what I mean by constraints.  It's easy to accommodate 
different register layouts.  Constraints (for me) are like requiring 
gang scheduling.  But you introduced the subject - what did you mean?

Let's take for example the software-controlled TLB on some ppc.  It's 
tempting to call them all "registers" and use the register interface to 
access them.  Is it workable?

Or let's look at SMM on x86.  To implement it memory slots need an 
additional attribute "SMM/non-SMM/either".  These sort of things, if you 
don't think of them beforehand, break your interface.

>
> I don't even want to imagine what v7 arm vs v8 arm looks like. It's a completely new architecture.
>
> And what if MIPS comes along? I hear they also work on hw accelerated virtualization.

If it's just a matter of different register names and sizes, no 
problem.  From what I've seen of v8, it doesn't introduce new wierdnesses.

>
> >
> >>  The same goes for ARM, where we will get v7 support for now, but very soon we will also want to get v8. Stabilizing a target so far takes ~1-2 years from what I've seen. And that stabilizing to a point where we don't find major ABI issues anymore.
> >
> >  The trick is to get the ABI to be flexible, like a generalized ABI for state.  But it's true that it's really hard to nail it down.
>
> Yup, and I think what we have today is a pretty good approach to this. I'm trying to mostly add "generalized" ioctls whenever I see that something can be handled generically, like ONE_REG or ENABLE_CAP. If we keep moving that direction, we are extensible with a reasonably stable ABI. Even without syscalls.

Syscalls are orthogonal to that - they're to avoid the fget_light() and 
to tighten the vcpu/thread and vm/process relationship.

> , keep the rest in user space.
> >
> >
> >  When a device is fully in the kernel, we have a good specification of the ABI: it just implements the spec, and the ABI provides the interface from the device to the rest of the world.  Partially accelerated devices means a much greater effort in specifying exactly what it does.  It's also vulnerable to changes in how the guest uses the device.
>
> Why? For the HPET timer register for example, we could have a simple MMIO hook that says
>
>    on_read:
>      return read_current_time() - shared_page.offset;
>    on_write:
>      handle_in_user_space();

It works for the really simple cases, yes, but if the guest wants to set 
up one-shot timers, it fails.  Also look at the PIT which latches on read.

>
> For IDE, it would be as simple as
>
>    register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>    for (i = 1; i<  7; i++) {
>      register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>      register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>    }
>
> and we should have reduced overhead of IDE by quite a bit already. All the other 2k LOC in hw/ide/core.c don't matter for us really.


Just use virtio.

>
> >
> >>  Similar to how vhost works, where we keep device enumeration and configuration in user space, but ring processing in kernel space.
> >
> >  vhost-net was a massive effort, I hope we don't have to replicate it.
>
> Was it harder than the in-kernel io-apic?

Much, much harder.

>
> >
> >>
> >>  Good candidates for in-kernel acceleration are:
> >>
> >>    - HPET
> >
> >  Yes
> >
> >>    - VGA
> >>    - IDE
> >
> >  Why?  There are perfectly good replacements for these (qxl, virtio-blk, virtio-scsi).
>
> Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. Same for virtio.
>
> Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest.

Rest easy, there's no chance of that.  But if a guest is important 
enough, virtio drivers will get written.  IDE has no chance in hell of 
approaching virtio-blk performance, no matter how much effort we put 
into it.

> KVM's strength has always been its close resemblance to hardware.

This will remain.  But we can't optimize everything.

> >
> >>
> >>  We will run into the same thing with the MPIC though. On e500v2, IPIs are done through the MPIC. So if we want any SMP performance on those, we need to shove that part into the kernel. I don't really want to have all of the MPIC code in there however. So a hybrid approach sounds like a great fit.
> >
> >  Pointer to the qemu code?
>
> hw/openpic.c

I see what you mean.

>
> >
> >>  The problem with in-kernel device emulation the way we have it today is that it's an all-or-nothing choice. Either we push the device into kernel space or we keep it in user space. That adds a lot of code in kernel land where it doesn't belong.
> >
> >  Like I mentioned, I see that as a good thing.
>
> I don't. And we don't do it for hypercall handling on book3s hv either for example. There we have a 3 level handling system. Very hot path hypercalls get handled in real mode. Reasonably hot path hypercalls get handled in kernel space. Everything else goes to user land.

Well, the MPIC thing really supports your point.

> >
> >>  >
> >>  >   No, slots still exist.  Only the API is "replace slot list" instead of "add slot" and "remove slot".
> >>
> >>  Why?
> >
> >  Physical memory is discontiguous, and includes aliases (two gpas referencing the same backing page).  How else would you describe it.
> >
> >>  On PPC we walk the slots on every fault (incl. mmio), so fast lookup times there would be great. I was thinking of something page table like here.
> >
> >  We can certainly convert the slots to a tree internally.  I'm doing the same thing for qemu now, maybe we can do it for kvm too.  No need to involve the ABI at all.
>
> Hrm, true.
>
> >  Slot searching is quite fast since there's a small number of slots, and we sort the larger ones to be in the front, so positive lookups are fast.  We cache negative lookups in the shadow page tables (an spte can be either "not mapped", "mapped to RAM", or "not mapped and known to be mmio") so we rarely need to walk the entire list.
>
> Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky.
> We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too.

For x86 that's not a problem, since once you map a page, it stays mapped 
(on modern hardware).

>
> >
> >>  That only works when then internal slot structure is hidden from user space though.
> >
> >  Why?
>
> Because if user space thinks it's slots and in reality it's a tree that doesn't match. If you decouple the external view from the internal view, it works again.

Userspace needs to provide a function hva = f(gpa).  Why does it matter 
how the function is spelled out?  Slots happen to be a concise 
representation.  Transform the function all you like in the kernel, as 
long as you preserve all the mappings.

>
> >
> >>
> >>  >>   I would actually rather like to see the amount of page sharing between kernel and user space increased, no decreased. I don't care if I can throw strace on KVM. I want speed.
> >>  >
> >>  >   Something really critical should be handled in the kernel.  Care to provide examples?
> >>
> >>  Just look at the s390 patches Christian posted recently.
> >
> >  Which ones?
>
>    http://www.mail-archive.com/kvm@vger.kernel.org/msg66155.html
>

Yeah - s390 is always different.  On the current interface synchronous 
registers are easy, so why not.  But I wonder if it's really critical.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 14:21             ` Avi Kivity
@ 2012-02-07 14:39               ` Alexander Graf
  2012-02-15 11:18                 ` Avi Kivity
  2012-02-12  7:10               ` Takuya Yoshikawa
  1 sibling, 1 reply; 89+ messages in thread
From: Alexander Graf @ 2012-02-07 14:39 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc


On 07.02.2012, at 15:21, Avi Kivity wrote:

> On 02/07/2012 03:40 PM, Alexander Graf wrote:
>> >>
>> >>  Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints.
>> >
>> >  I would expect that newer archs have less constraints, not more.
>> 
>> Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have today on 32-bit, but extends a bunch of registers to 64-bit. So what if we laid out stuff wrong before?
> 
> That's not what I mean by constraints.  It's easy to accommodate different register layouts.  Constraints (for me) are like requiring gang scheduling.  But you introduced the subject - what did you mean?

New extensions to architectures give us new challenges. Newer booke for example implements page tables in parallel to soft TLBs. We need to model that. My point was more that I can't predict the future :).

> Let's take for example the software-controlled TLB on some ppc.  It's tempting to call them all "registers" and use the register interface to access them.  Is it workable?

Workable, yes. Fast? No. Right now we share them between kernel and user space to have very fast access to them. That way we don't have to sync anything at all.

> Or let's look at SMM on x86.  To implement it memory slots need an additional attribute "SMM/non-SMM/either".  These sort of things, if you don't think of them beforehand, break your interface.

Yup. And we will never think of all the cases.

> 
>> 
>> I don't even want to imagine what v7 arm vs v8 arm looks like. It's a completely new architecture.
>> 
>> And what if MIPS comes along? I hear they also work on hw accelerated virtualization.
> 
> If it's just a matter of different register names and sizes, no problem.  From what I've seen of v8, it doesn't introduce new wierdnesses.

I haven't seen anything real yet, since the spec isn't out. So far only generic architecture documentation is available.

> 
>> 
>> >
>> >>  The same goes for ARM, where we will get v7 support for now, but very soon we will also want to get v8. Stabilizing a target so far takes ~1-2 years from what I've seen. And that stabilizing to a point where we don't find major ABI issues anymore.
>> >
>> >  The trick is to get the ABI to be flexible, like a generalized ABI for state.  But it's true that it's really hard to nail it down.
>> 
>> Yup, and I think what we have today is a pretty good approach to this. I'm trying to mostly add "generalized" ioctls whenever I see that something can be handled generically, like ONE_REG or ENABLE_CAP. If we keep moving that direction, we are extensible with a reasonably stable ABI. Even without syscalls.
> 
> Syscalls are orthogonal to that - they're to avoid the fget_light() and to tighten the vcpu/thread and vm/process relationship.

How about keeping the ioctl interface but moving vcpu_run to a syscall then? That should really be the only thing that belongs into the fast path, right? Every time we do a register sync in user space, we do something wrong. Instead, user space should either

  a) have wrappers around register accesses, so it can directly ask for specific registers that it needs
or
  b) keep everything that would be requested by the register synchronization in shared memory

> 
>> , keep the rest in user space.
>> >
>> >
>> >  When a device is fully in the kernel, we have a good specification of the ABI: it just implements the spec, and the ABI provides the interface from the device to the rest of the world.  Partially accelerated devices means a much greater effort in specifying exactly what it does.  It's also vulnerable to changes in how the guest uses the device.
>> 
>> Why? For the HPET timer register for example, we could have a simple MMIO hook that says
>> 
>>   on_read:
>>     return read_current_time() - shared_page.offset;
>>   on_write:
>>     handle_in_user_space();
> 
> It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails.  

I don't understand. Why would anything fail here? Once the logic that's implemented by the kernel accelerator doesn't fit anymore, unregister it.

> Also look at the PIT which latches on read.
> 
>> 
>> For IDE, it would be as simple as
>> 
>>   register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>>   for (i = 1; i<  7; i++) {
>>     register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>     register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>   }
>> 
>> and we should have reduced overhead of IDE by quite a bit already. All the other 2k LOC in hw/ide/core.c don't matter for us really.
> 
> 
> Just use virtio.

Just use xenbus. Seriously, this is not an answer.

> 
>> 
>> >
>> >>  Similar to how vhost works, where we keep device enumeration and configuration in user space, but ring processing in kernel space.
>> >
>> >  vhost-net was a massive effort, I hope we don't have to replicate it.
>> 
>> Was it harder than the in-kernel io-apic?
> 
> Much, much harder.
> 
>> 
>> >
>> >>
>> >>  Good candidates for in-kernel acceleration are:
>> >>
>> >>    - HPET
>> >
>> >  Yes
>> >
>> >>    - VGA
>> >>    - IDE
>> >
>> >  Why?  There are perfectly good replacements for these (qxl, virtio-blk, virtio-scsi).
>> 
>> Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. Same for virtio.
>> 
>> Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest.
> 
> Rest easy, there's no chance of that.  But if a guest is important enough, virtio drivers will get written.  IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it.

Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads.

> 
>> KVM's strength has always been its close resemblance to hardware.
> 
> This will remain.  But we can't optimize everything.

That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no?

> 
>> >
>> >>
>> >>  We will run into the same thing with the MPIC though. On e500v2, IPIs are done through the MPIC. So if we want any SMP performance on those, we need to shove that part into the kernel. I don't really want to have all of the MPIC code in there however. So a hybrid approach sounds like a great fit.
>> >
>> >  Pointer to the qemu code?
>> 
>> hw/openpic.c
> 
> I see what you mean.
> 
>> 
>> >
>> >>  The problem with in-kernel device emulation the way we have it today is that it's an all-or-nothing choice. Either we push the device into kernel space or we keep it in user space. That adds a lot of code in kernel land where it doesn't belong.
>> >
>> >  Like I mentioned, I see that as a good thing.
>> 
>> I don't. And we don't do it for hypercall handling on book3s hv either for example. There we have a 3 level handling system. Very hot path hypercalls get handled in real mode. Reasonably hot path hypercalls get handled in kernel space. Everything else goes to user land.
> 
> Well, the MPIC thing really supports your point.

I'm sure we'll find more examples :)

> 
>> >
>> >>  >
>> >>  >   No, slots still exist.  Only the API is "replace slot list" instead of "add slot" and "remove slot".
>> >>
>> >>  Why?
>> >
>> >  Physical memory is discontiguous, and includes aliases (two gpas referencing the same backing page).  How else would you describe it.
>> >
>> >>  On PPC we walk the slots on every fault (incl. mmio), so fast lookup times there would be great. I was thinking of something page table like here.
>> >
>> >  We can certainly convert the slots to a tree internally.  I'm doing the same thing for qemu now, maybe we can do it for kvm too.  No need to involve the ABI at all.
>> 
>> Hrm, true.
>> 
>> >  Slot searching is quite fast since there's a small number of slots, and we sort the larger ones to be in the front, so positive lookups are fast.  We cache negative lookups in the shadow page tables (an spte can be either "not mapped", "mapped to RAM", or "not mapped and known to be mmio") so we rarely need to walk the entire list.
>> 
>> Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky.
>> We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too.
> 
> For x86 that's not a problem, since once you map a page, it stays mapped (on modern hardware).

Ah, because you're on NPT and you can have MMIO hints in the nested page table. Nifty. Yeah, we don't have that luxury :).

> 
>> 
>> >
>> >>  That only works when then internal slot structure is hidden from user space though.
>> >
>> >  Why?
>> 
>> Because if user space thinks it's slots and in reality it's a tree that doesn't match. If you decouple the external view from the internal view, it works again.
> 
> Userspace needs to provide a function hva = f(gpa).  Why does it matter how the function is spelled out?  Slots happen to be a concise representation.  Transform the function all you like in the kernel, as long as you preserve all the mappings.

I think we're talking about the same thing really.

> 
>> 
>> >
>> >>
>> >>  >>   I would actually rather like to see the amount of page sharing between kernel and user space increased, no decreased. I don't care if I can throw strace on KVM. I want speed.
>> >>  >
>> >>  >   Something really critical should be handled in the kernel.  Care to provide examples?
>> >>
>> >>  Just look at the s390 patches Christian posted recently.
>> >
>> >  Which ones?
>> 
>>   http://www.mail-archive.com/kvm@vger.kernel.org/msg66155.html
>> 
> 
> Yeah - s390 is always different.  On the current interface synchronous registers are easy, so why not.  But I wonder if it's really critical.

It's certainly slick :). We do the same for the TLB on e500, just with a separate ioctl to set the sharing up.


Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 13:18             ` Avi Kivity
@ 2012-02-07 15:15               ` Anthony Liguori
  2012-02-07 18:28                 ` Chris Wright
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-02-07 15:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Scott Wood, linux-kernel, Eric Northup, KVM list, qemu-devel,
	Chris Wright

On 02/07/2012 07:18 AM, Avi Kivity wrote:
> On 02/07/2012 02:51 PM, Anthony Liguori wrote:
>> On 02/07/2012 06:40 AM, Avi Kivity wrote:
>>> On 02/07/2012 02:28 PM, Anthony Liguori wrote:
>>>>
>>>>> It's a potential source of exploits
>>>>> (from bugs in KVM or in hardware). I can see people wanting to be
>>>>> selective with access because of that.
>>>>
>>>> As is true of the rest of the kernel.
>>>>
>>>> If you want finer grain access control, that's exactly why we have things like
>>>> LSM and SELinux. You can add the appropriate LSM hooks into the KVM
>>>> infrastructure and setup default SELinux policies appropriately.
>>>
>>> LSMs protect objects, not syscalls. There isn't an object to protect here
>>> (except the fake /dev/kvm object).
>>
>> A VM can be an object.
>>
>
> Not really, it's not accessible in a namespace. How would you label it?

Labels can originate from userspace, IIUC, so I think it's possible for QEMU (or 
whatever the userspace is) to set the label for the VM while it's creating it. 
I think this is how most of the labeling for X and things of that nature works.

Maybe Chris can set me straight.

> Maybe we can reuse the process label/context (not sure what the right term is
> for a process).

Regards,

Anthony Liguori

>


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 12:03         ` Avi Kivity
@ 2012-02-07 15:17           ` Anthony Liguori
  2012-02-07 16:02             ` Avi Kivity
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-02-07 15:17 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Rob Earhart, linux-kernel, KVM list, qemu-devel

On 02/07/2012 06:03 AM, Avi Kivity wrote:
> On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>>
>> I'm not so sure. ioeventfds and a future mmio-over-socketpair have to put the
>> kthread to sleep while it waits for the other end to process it. This is
>> effectively equivalent to a heavy weight exit. The difference in cost is
>> dropping to userspace which is really neglible these days (< 100 cycles).
>
> On what machine did you measure these wonderful numbers?

A syscall is what I mean by "dropping to userspace", not the cost of a heavy 
weight exit.  I think a heavy weight exit is still around a few thousand cycles.

Any nehalem class or better processor should have a syscall cost of around that 
unless I'm wildly mistaken.

>
> But I agree a heavyweight exit is probably faster than a double context switch
> on a remote core.

I meant, if you already need to take a heavyweight exit (and you do to schedule 
something else on the core), than the only additional cost is taking a syscall 
return to userspace *first* before scheduling another process.  That overhead is 
pretty low.

Regards,

Anthony Liguori

>
>


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 13:40           ` Alexander Graf
  2012-02-07 14:21             ` Avi Kivity
@ 2012-02-07 15:23             ` Anthony Liguori
  2012-02-07 15:28               ` Alexander Graf
                                 ` (2 more replies)
  2012-02-15 22:14             ` Arnd Bergmann
  2 siblings, 3 replies; 89+ messages in thread
From: Anthony Liguori @ 2012-02-07 15:23 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Avi Kivity, qemu-devel, kvm-ppc, KVM list, linux-kernel

On 02/07/2012 07:40 AM, Alexander Graf wrote:
>
> Why? For the HPET timer register for example, we could have a simple MMIO hook that says
>
>    on_read:
>      return read_current_time() - shared_page.offset;
>    on_write:
>      handle_in_user_space();
>
> For IDE, it would be as simple as
>
>    register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>    for (i = 1; i<  7; i++) {
>      register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>      register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>    }

You can't easily serialize updates to that address with the kernel since two 
threads are likely going to be accessing it at the same time.  That either means 
an expensive sync operation or a reliance on atomic instructions.

But not all architectures offer non-word sized atomic instructions so it gets 
fairly nasty in practice.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 15:23             ` Anthony Liguori
@ 2012-02-07 15:28               ` Alexander Graf
  2012-02-08 17:20               ` Alan Cox
  2012-02-15 13:33               ` Avi Kivity
  2 siblings, 0 replies; 89+ messages in thread
From: Alexander Graf @ 2012-02-07 15:28 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, qemu-devel, kvm-ppc, KVM list, linux-kernel


On 07.02.2012, at 16:23, Anthony Liguori wrote:

> On 02/07/2012 07:40 AM, Alexander Graf wrote:
>> 
>> Why? For the HPET timer register for example, we could have a simple MMIO hook that says
>> 
>>   on_read:
>>     return read_current_time() - shared_page.offset;
>>   on_write:
>>     handle_in_user_space();
>> 
>> For IDE, it would be as simple as
>> 
>>   register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>>   for (i = 1; i<  7; i++) {
>>     register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>     register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>   }
> 
> You can't easily serialize updates to that address with the kernel since two threads are likely going to be accessing it at the same time.  That either means an expensive sync operation or a reliance on atomic instructions.

Yes. Essentially we want a mutex for them.

> But not all architectures offer non-word sized atomic instructions so it gets fairly nasty in practice.

Well, we can always require fields to be word sized.


Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 15:17           ` Anthony Liguori
@ 2012-02-07 16:02             ` Avi Kivity
  2012-02-07 16:18               ` Jan Kiszka
  2012-02-07 16:19               ` Anthony Liguori
  0 siblings, 2 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-07 16:02 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Rob Earhart, linux-kernel, KVM list, qemu-devel

On 02/07/2012 05:17 PM, Anthony Liguori wrote:
> On 02/07/2012 06:03 AM, Avi Kivity wrote:
>> On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>>>
>>> I'm not so sure. ioeventfds and a future mmio-over-socketpair have 
>>> to put the
>>> kthread to sleep while it waits for the other end to process it. 
>>> This is
>>> effectively equivalent to a heavy weight exit. The difference in 
>>> cost is
>>> dropping to userspace which is really neglible these days (< 100 
>>> cycles).
>>
>> On what machine did you measure these wonderful numbers?
>
> A syscall is what I mean by "dropping to userspace", not the cost of a 
> heavy weight exit. 

Ah.  But then ioeventfd has that as well, unless the other end is in the 
kernel too.

> I think a heavy weight exit is still around a few thousand cycles.
>
> Any nehalem class or better processor should have a syscall cost of 
> around that unless I'm wildly mistaken.
>

That's what I remember too.

>>
>> But I agree a heavyweight exit is probably faster than a double 
>> context switch
>> on a remote core.
>
> I meant, if you already need to take a heavyweight exit (and you do to 
> schedule something else on the core), than the only additional cost is 
> taking a syscall return to userspace *first* before scheduling another 
> process.  That overhead is pretty low.

Yeah.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 16:02             ` Avi Kivity
@ 2012-02-07 16:18               ` Jan Kiszka
  2012-02-07 16:21                 ` Anthony Liguori
  2012-02-07 16:19               ` Anthony Liguori
  1 sibling, 1 reply; 89+ messages in thread
From: Jan Kiszka @ 2012-02-07 16:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Rob Earhart, linux-kernel, KVM list, qemu-devel

On 2012-02-07 17:02, Avi Kivity wrote:
> On 02/07/2012 05:17 PM, Anthony Liguori wrote:
>> On 02/07/2012 06:03 AM, Avi Kivity wrote:
>>> On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>>>>
>>>> I'm not so sure. ioeventfds and a future mmio-over-socketpair have
>>>> to put the
>>>> kthread to sleep while it waits for the other end to process it.
>>>> This is
>>>> effectively equivalent to a heavy weight exit. The difference in
>>>> cost is
>>>> dropping to userspace which is really neglible these days (< 100
>>>> cycles).
>>>
>>> On what machine did you measure these wonderful numbers?
>>
>> A syscall is what I mean by "dropping to userspace", not the cost of a
>> heavy weight exit. 
> 
> Ah.  But then ioeventfd has that as well, unless the other end is in the
> kernel too.
> 
>> I think a heavy weight exit is still around a few thousand cycles.
>>
>> Any nehalem class or better processor should have a syscall cost of
>> around that unless I'm wildly mistaken.
>>
> 
> That's what I remember too.
> 
>>>
>>> But I agree a heavyweight exit is probably faster than a double
>>> context switch
>>> on a remote core.
>>
>> I meant, if you already need to take a heavyweight exit (and you do to
>> schedule something else on the core), than the only additional cost is
>> taking a syscall return to userspace *first* before scheduling another
>> process.  That overhead is pretty low.
> 
> Yeah.
> 

Isn't there another level in between just scheduling and full syscall
return if the user return notifier has some real work to do?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 16:02             ` Avi Kivity
  2012-02-07 16:18               ` Jan Kiszka
@ 2012-02-07 16:19               ` Anthony Liguori
  2012-02-15 13:47                 ` Avi Kivity
  1 sibling, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-02-07 16:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Rob Earhart, linux-kernel, KVM list, qemu-devel

On 02/07/2012 10:02 AM, Avi Kivity wrote:
> On 02/07/2012 05:17 PM, Anthony Liguori wrote:
>> On 02/07/2012 06:03 AM, Avi Kivity wrote:
>>> On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>>>>
>>>> I'm not so sure. ioeventfds and a future mmio-over-socketpair have to put the
>>>> kthread to sleep while it waits for the other end to process it. This is
>>>> effectively equivalent to a heavy weight exit. The difference in cost is
>>>> dropping to userspace which is really neglible these days (< 100 cycles).
>>>
>>> On what machine did you measure these wonderful numbers?
>>
>> A syscall is what I mean by "dropping to userspace", not the cost of a heavy
>> weight exit.
>
> Ah. But then ioeventfd has that as well, unless the other end is in the kernel too.

Yes, that was my point exactly :-)

ioeventfd/mmio-over-socketpair to adifferent thread is not faster than a 
synchronous KVM_RUN + writing to an eventfd in userspace modulo a couple of 
cheap syscalls.

The exception is when the other end is in the kernel and there is magic 
optimizations (like there is today with ioeventfd).

Regards,

Anthony Liguori

>
>> I think a heavy weight exit is still around a few thousand cycles.
>>
>> Any nehalem class or better processor should have a syscall cost of around
>> that unless I'm wildly mistaken.
>>
>
> That's what I remember too.
>
>>>
>>> But I agree a heavyweight exit is probably faster than a double context switch
>>> on a remote core.
>>
>> I meant, if you already need to take a heavyweight exit (and you do to
>> schedule something else on the core), than the only additional cost is taking
>> a syscall return to userspace *first* before scheduling another process. That
>> overhead is pretty low.
>
> Yeah.
>


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 16:18               ` Jan Kiszka
@ 2012-02-07 16:21                 ` Anthony Liguori
  2012-02-07 16:29                   ` Jan Kiszka
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-02-07 16:21 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Avi Kivity, qemu-devel, Rob Earhart, linux-kernel, KVM list

On 02/07/2012 10:18 AM, Jan Kiszka wrote:
> On 2012-02-07 17:02, Avi Kivity wrote:
>> On 02/07/2012 05:17 PM, Anthony Liguori wrote:
>>> On 02/07/2012 06:03 AM, Avi Kivity wrote:
>>>> On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>>>>>
>>>>> I'm not so sure. ioeventfds and a future mmio-over-socketpair have
>>>>> to put the
>>>>> kthread to sleep while it waits for the other end to process it.
>>>>> This is
>>>>> effectively equivalent to a heavy weight exit. The difference in
>>>>> cost is
>>>>> dropping to userspace which is really neglible these days (<  100
>>>>> cycles).
>>>>
>>>> On what machine did you measure these wonderful numbers?
>>>
>>> A syscall is what I mean by "dropping to userspace", not the cost of a
>>> heavy weight exit.
>>
>> Ah.  But then ioeventfd has that as well, unless the other end is in the
>> kernel too.
>>
>>> I think a heavy weight exit is still around a few thousand cycles.
>>>
>>> Any nehalem class or better processor should have a syscall cost of
>>> around that unless I'm wildly mistaken.
>>>
>>
>> That's what I remember too.
>>
>>>>
>>>> But I agree a heavyweight exit is probably faster than a double
>>>> context switch
>>>> on a remote core.
>>>
>>> I meant, if you already need to take a heavyweight exit (and you do to
>>> schedule something else on the core), than the only additional cost is
>>> taking a syscall return to userspace *first* before scheduling another
>>> process.  That overhead is pretty low.
>>
>> Yeah.
>>
>
> Isn't there another level in between just scheduling and full syscall
> return if the user return notifier has some real work to do?

Depends on whether you're scheduling a kthread or a userspace process, no?  If 
you're eventually going to end up in userspace, you have to do the full heavy 
weight exit.

If you're scheduling to a kthread, it's better to do the type of trickery that 
ioeventfd does and just turn it into a function call.

Regards,

Anthony Liguori

>
> Jan
>


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 16:21                 ` Anthony Liguori
@ 2012-02-07 16:29                   ` Jan Kiszka
  2012-02-15 13:41                     ` Avi Kivity
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Kiszka @ 2012-02-07 16:29 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, qemu-devel, Rob Earhart, linux-kernel, KVM list

On 2012-02-07 17:21, Anthony Liguori wrote:
> On 02/07/2012 10:18 AM, Jan Kiszka wrote:
>> On 2012-02-07 17:02, Avi Kivity wrote:
>>> On 02/07/2012 05:17 PM, Anthony Liguori wrote:
>>>> On 02/07/2012 06:03 AM, Avi Kivity wrote:
>>>>> On 02/06/2012 09:11 PM, Anthony Liguori wrote:
>>>>>>
>>>>>> I'm not so sure. ioeventfds and a future mmio-over-socketpair have
>>>>>> to put the
>>>>>> kthread to sleep while it waits for the other end to process it.
>>>>>> This is
>>>>>> effectively equivalent to a heavy weight exit. The difference in
>>>>>> cost is
>>>>>> dropping to userspace which is really neglible these days (<  100
>>>>>> cycles).
>>>>>
>>>>> On what machine did you measure these wonderful numbers?
>>>>
>>>> A syscall is what I mean by "dropping to userspace", not the cost of a
>>>> heavy weight exit.
>>>
>>> Ah.  But then ioeventfd has that as well, unless the other end is in the
>>> kernel too.
>>>
>>>> I think a heavy weight exit is still around a few thousand cycles.
>>>>
>>>> Any nehalem class or better processor should have a syscall cost of
>>>> around that unless I'm wildly mistaken.
>>>>
>>>
>>> That's what I remember too.
>>>
>>>>>
>>>>> But I agree a heavyweight exit is probably faster than a double
>>>>> context switch
>>>>> on a remote core.
>>>>
>>>> I meant, if you already need to take a heavyweight exit (and you do to
>>>> schedule something else on the core), than the only additional cost is
>>>> taking a syscall return to userspace *first* before scheduling another
>>>> process.  That overhead is pretty low.
>>>
>>> Yeah.
>>>
>>
>> Isn't there another level in between just scheduling and full syscall
>> return if the user return notifier has some real work to do?
> 
> Depends on whether you're scheduling a kthread or a userspace process, no?  If 

Kthreads can't return, of course. User space threads /may/ do so. And
then there needs to be a differences between host and guest in the
tracked MSRs. I think to recall it's a question of another few hundred
cycles.

Jan

> you're eventually going to end up in userspace, you have to do the full heavy 
> weight exit.
> 
> If you're scheduling to a kthread, it's better to do the type of trickery that 
> ioeventfd does and just turn it into a function call.
> 
> Regards,
> 
> Anthony Liguori
> 
>>
>> Jan
>>
> 

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-06  9:34         ` Avi Kivity
  2012-02-06 13:33           ` Anthony Liguori
@ 2012-02-07 18:12           ` Rusty Russell
  2012-02-15 13:39             ` Avi Kivity
  1 sibling, 1 reply; 89+ messages in thread
From: Rusty Russell @ 2012-02-07 18:12 UTC (permalink / raw)
  To: Avi Kivity, Anthony Liguori
  Cc: Gleb Natapov, linux-kernel, KVM list, qemu-devel

On Mon, 06 Feb 2012 11:34:01 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 02/05/2012 06:36 PM, Anthony Liguori wrote:
> > If userspace had a way to upload bytecode to the kernel that was
> > executed for a PIO operation, it could either pass the operation to
> > userspace or handle it within the kernel when possible without taking
> > a heavy weight exit.
> >
> > If the bytecode can access variables in a shared memory area, it could
> > be pretty efficient to work with.
> >
> > This means that the kernel never has to deal with specific in-kernel
> > devices but that userspace can accelerator as many of its devices as
> > it sees fit.
> 
> I would really love to have this, but the problem is that we'd need a
> general purpose bytecode VM with binding to some kernel APIs.  The
> bytecode VM, if made general enough to host more complicated devices,
> would likely be much larger than the actual code we have in the kernel now.

We have the ability to upload bytecode into the kernel already.  It's in
a great bytecode interpreted by the CPU itself.

If every user were emulating different machines, LPF this would make
sense.  Are they?  Or should we write those helpers once, in C, and
provide that for them.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 15:15               ` Anthony Liguori
@ 2012-02-07 18:28                 ` Chris Wright
  0 siblings, 0 replies; 89+ messages in thread
From: Chris Wright @ 2012-02-07 18:28 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, Scott Wood, linux-kernel, Eric Northup, KVM list,
	qemu-devel, Chris Wright

* Anthony Liguori (anthony@codemonkey.ws) wrote:
> On 02/07/2012 07:18 AM, Avi Kivity wrote:
> >On 02/07/2012 02:51 PM, Anthony Liguori wrote:
> >>On 02/07/2012 06:40 AM, Avi Kivity wrote:
> >>>On 02/07/2012 02:28 PM, Anthony Liguori wrote:
> >>>>
> >>>>>It's a potential source of exploits
> >>>>>(from bugs in KVM or in hardware). I can see people wanting to be
> >>>>>selective with access because of that.
> >>>>
> >>>>As is true of the rest of the kernel.
> >>>>
> >>>>If you want finer grain access control, that's exactly why we have things like
> >>>>LSM and SELinux. You can add the appropriate LSM hooks into the KVM
> >>>>infrastructure and setup default SELinux policies appropriately.
> >>>
> >>>LSMs protect objects, not syscalls. There isn't an object to protect here
> >>>(except the fake /dev/kvm object).
> >>
> >>A VM can be an object.
> >
> >Not really, it's not accessible in a namespace. How would you label it?

A VM, vcpu, etc are all objects.  The labelling can be implicit based on
the security context of the process creating the object.  You could create
simplistic rules such as a process may have the ability KVM__VM_CREATE
(this is roughly analogous to the PROC__EXECMEM policy control that
allows some processes to create executable writable memory mappings, or
SHM__CREATE for a process that can create a shared memory segment).
Adding some label mgmt to the object (add ->security and some callbacks to
do ->alloc/init/free), and then checks on the object itself would allow
for finer grained protection.  If there was any VM lookup (although the
original example explicitly ties a process to a vm and a thread to a
vcpu) the finer grained check would certainly be useful to verify that
the process can access the VM.

> Labels can originate from userspace, IIUC, so I think it's possible for QEMU
> (or whatever the userspace is) to set the label for the VM while it's
> creating it. I think this is how most of the labeling for X and things of
> that nature works.

For X, the policy enforcement is done in the X server.  There is
assistance from the kernel for doing policy server queries (can foo do
bar?), but it's up to the X server to actually care enough to ask and
then fail a request that doesn't comply.  I'm not sure that's the model
here.

thanks,
-chris

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 12:28       ` Anthony Liguori
  2012-02-07 12:40         ` Avi Kivity
@ 2012-02-08 17:02         ` Scott Wood
  2012-02-08 17:12           ` Alan Cox
  1 sibling, 1 reply; 89+ messages in thread
From: Scott Wood @ 2012-02-08 17:02 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: qemu-devel, linux-kernel, Eric Northup, KVM list, Avi Kivity

On 02/07/2012 06:28 AM, Anthony Liguori wrote:
> On 02/06/2012 01:46 PM, Scott Wood wrote:
>> On 02/03/2012 04:52 PM, Anthony Liguori wrote:
>>> On 02/03/2012 12:07 PM, Eric Northup wrote:
>>>> How would the ability to use sys_kvm_* be regulated?
>>>
>>> Why should it be regulated?
>>>
>>> It's not a finite or privileged resource.
>>
>> You're exposing a large, complex kernel subsystem that does very
>> low-level things with the hardware.
> 
> As does the rest of the kernel.

Just because other parts of the kernel made this mistake (e.g.
networking) doesn't mean that KVM should as well.

> If you want finer grain access control, that's exactly why we have
> things like LSM and SELinux.  You can add the appropriate LSM hooks into
> the KVM infrastructure and setup default SELinux policies appropriately.

Needing to use such bandaids is more complicated (or at least less
familiar to many) than setting permissions on a filesystem object.

>> And sometimes it is a finite resource.  I don't know how x86 does it,
>> but on at least some powerpc hardware we have a finite, relatively small
>> number of hardware partition IDs.
> 
> But presumably this is per-core, right?

Not currently.

I can't speak for the IBM stuff, but our hardware is desgined with the
idea that a partition has a permanent system-wide LPID (partition ID).
We *may* be able to do dynamic LPID on e500mc, but it is likely to be a
problem in the future with things like LPID-based direct-to-guest
interrupt delivery.  There's also a question of prioritizing effort --
there's enough other stuff that needs work first.

> And they're recycled, right? 

Not currently (other than when a guest is destroyed, of course).

What are the advantages of getting rid of the file descriptor that
warrant this?  What is performance sensitive enough than an fd lookup is
unacceptable but the other overhead of going out to qemu is fine?

Is that fd lookup any heavier than "appropriate LSM hooks"?

If the fd overhead really is a problem, perhaps the fd could be retained
for setup operations, and omitted only on calls that require a vcpu to
have been already set up on the current thread?

-Scott


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-08 17:02         ` Scott Wood
@ 2012-02-08 17:12           ` Alan Cox
  0 siblings, 0 replies; 89+ messages in thread
From: Alan Cox @ 2012-02-08 17:12 UTC (permalink / raw)
  To: Scott Wood
  Cc: Anthony Liguori, qemu-devel, linux-kernel, Eric Northup,
	KVM list, Avi Kivity

> If the fd overhead really is a problem, perhaps the fd could be retained
> for setup operations, and omitted only on calls that require a vcpu to
> have been already set up on the current thread?

Quite frankly I'd like to have an fd because it means you've got a
meaningful way of ensuring that id reuse problems go away. You open a
given id and keep a handle to it, if the id gets reused then your handle
will be tied to the old one so you can fail the requests.

Without an fd it's near impossible to get this right. The Unix/Linux
model is open an object, use it, close it. I see no reason not to do that.

Also the LSM hooks apply to file objects mostly, so its a natural fit on
top *IF* you choose to use them.

Finally you can pass file handles around between processes - do that any
other way 8)

Alan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 15:23             ` Anthony Liguori
  2012-02-07 15:28               ` Alexander Graf
@ 2012-02-08 17:20               ` Alan Cox
  2012-02-15 13:33               ` Avi Kivity
  2 siblings, 0 replies; 89+ messages in thread
From: Alan Cox @ 2012-02-08 17:20 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Alexander Graf, Avi Kivity, qemu-devel, kvm-ppc, KVM list, linux-kernel

> >    register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
> >    for (i = 1; i<  7; i++) {
> >      register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> >      register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> >    }
> 
> You can't easily serialize updates to that address with the kernel since two 
> threads are likely going to be accessing it at the same time.  That either means 
> an expensive sync operation or a reliance on atomic instructions.

Who cares

If your API is right this isn't a problem (and for IDE the guess that it
won't happen you will win 99.999% of the time).

In fact IDE you can do even better in many cases because you'll get a
single rep outsw you can trap and shortcut.

> But not all architectures offer non-word sized atomic instructions so it gets 
> fairly nasty in practice.

Thats their problem. We don't screwup the fast paths because some
hardware vendor screwed up that bit of their implementation. That's
*their* problem not everyone elses.

So on x86 IDE should be about 10 outb traps that can be predicted, a rep
outsw which can be shortcut and a completion set of inb/inw ops that can
be predicted.

You should hit userspace about once per IDE operation. Fix the hot paths
with good design and the noise doesn't matter.

Alan

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-03  2:09 ` Anthony Liguori
                     ` (2 preceding siblings ...)
  2012-02-07  1:08   ` Alexander Graf
@ 2012-02-10  3:07   ` Jamie Lokier
  3 siblings, 0 replies; 89+ messages in thread
From: Jamie Lokier @ 2012-02-10  3:07 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Avi Kivity, KVM list, linux-kernel, qemu-devel

Anthony Liguori wrote:
> >The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> >them to userspace.
> 
> I'm a big fan of this.

I agree with getting rid of unnecessary emulations.
(Why were those things emulated in the first place?)

But it would be good to retain some way to "plugin" device emulations
in the kernel, separate from KVM core with a well-defined API boundary.

Then it wouldn't matter to the KVM core whether there's PIT emulation
or whatever; that would just be a separate module.  Perhaps even with
its own /dev device and maybe not tightly bound to KVM,

> >Note: this may cause a regression for older guests that don't
> >support MSI or kvmclock.  Device assignment will be done using
> >VFIO, that is, without direct kvm involvement.

I don't like the sound of regressions.

I tend to think of a VM as something that needs to have consistent
behaviour over a long time, for keeping working systems running for
years despite changing hardware, or reviving old systems to test
software and make patches for things in long-term maintenance etc.

But I haven't noticed problems from upgrading kernelspace-KVM yet,
only upgrading the userspace parts.  If a kernel upgrade is risky,
that makes upgrading host kernels difficult and "all or nothing" for
all the guests within.

However it looks like you mean only the performance characteristics
will change because of moving things back to userspace?

> >Local APICs will be mandatory, but it will be possible to hide them from
> >the guest.  This means that it will no longer be possible to emulate an
> >APIC in userspace, but it will be possible to virtualize an APIC-less
> >core - userspace will play with the LINT0/LINT1 inputs (configured as
> >EXITINT and NMI) to queue interrupts and NMIs.
> 
> I think this makes sense.  An interesting consequence of this is
> that it's no longer necessary to associate the VCPU context with an
> MMIO/PIO operation.  I'm not sure if there's an obvious benefit to
> that but it's interesting nonetheless.

Would that be useful for using VCPUs to run sandboxed userspace code
with ability to trap and control the whole environment (as opposed to
guest OSes, or ptrace which is rather incomplete and unsuitable for
sandboxing code meant for other OSes)?

Thanks,
-- Jamie

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 14:21             ` Avi Kivity
  2012-02-07 14:39               ` Alexander Graf
@ 2012-02-12  7:10               ` Takuya Yoshikawa
  2012-02-15 13:32                 ` Avi Kivity
  1 sibling, 1 reply; 89+ messages in thread
From: Takuya Yoshikawa @ 2012-02-12  7:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexander Graf, Anthony Liguori, KVM list, linux-kernel,
	qemu-devel, kvm-ppc

Avi Kivity <avi@redhat.com> wrote:

> > >  Slot searching is quite fast since there's a small number of slots, and we sort the larger ones to be in the front, so positive lookups are fast.  We cache negative lookups in the shadow page tables (an spte can be either "not mapped", "mapped to RAM", or "not mapped and known to be mmio") so we rarely need to walk the entire list.
> >
> > Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky.
> > We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too.
> 
> For x86 that's not a problem, since once you map a page, it stays mapped 
> (on modern hardware).
> 

I was once thinking about how to search a slot reasonably fast for every case,
even when we do not have mmio-spte cache.

One possible way I thought up was to sort slots according to their base_gfn.
Then the problem would become:  "find the first slot whose base_gfn + npages
is greater than this gfn."

Since we can do binary search, the search cost is O(log(# of slots)).

But I guess that most of the time was wasted on reading many memslots just to
know their base_gfn and npages.

So the most practically effective thing is to make a separate array which holds
just their base_gfn.  This will make the task a simple, and cache friendly,
search on an integer array:  probably faster than using *-tree data structure.

If needed, we should make cmp_memslot() architecture specific in the end?

	Takuya

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 14:39               ` Alexander Graf
@ 2012-02-15 11:18                 ` Avi Kivity
  2012-02-15 11:57                   ` Alexander Graf
  0 siblings, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-15 11:18 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc

On 02/07/2012 04:39 PM, Alexander Graf wrote:
> > 
> > Syscalls are orthogonal to that - they're to avoid the fget_light() and to tighten the vcpu/thread and vm/process relationship.
>
> How about keeping the ioctl interface but moving vcpu_run to a syscall then?

I dislike half-and-half interfaces even more.  And it's not like the
fget_light() is really painful - it's just that I see it occasionally in
perf top so it annoys me.

>  That should really be the only thing that belongs into the fast path, right? Every time we do a register sync in user space, we do something wrong. Instead, user space should either
>
>   a) have wrappers around register accesses, so it can directly ask for specific registers that it needs
> or
>   b) keep everything that would be requested by the register synchronization in shared memory

Always-synced shared memory is a liability, since newer hardware might
introduce on-chip caches for that state, making synchronization
expensive.  Or we may choose to keep some of the registers loaded, if we
have a way to trap on their use from userspace - for example we can
return to userspace with the guest fpu loaded, and trap if userspace
tries to use it.

Is an extra syscall for copying TLB entries to user space prohibitively
expensive?

> > 
> >> , keep the rest in user space.
> >> >
> >> >
> >> >  When a device is fully in the kernel, we have a good specification of the ABI: it just implements the spec, and the ABI provides the interface from the device to the rest of the world.  Partially accelerated devices means a much greater effort in specifying exactly what it does.  It's also vulnerable to changes in how the guest uses the device.
> >> 
> >> Why? For the HPET timer register for example, we could have a simple MMIO hook that says
> >> 
> >>   on_read:
> >>     return read_current_time() - shared_page.offset;
> >>   on_write:
> >>     handle_in_user_space();
> > 
> > It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails.  
>
> I don't understand. Why would anything fail here? 

It fails to provide a benefit, I didn't mean it causes guest failures.

You also have to make sure the kernel part and the user part use exactly
the same time bases.

> Once the logic that's implemented by the kernel accelerator doesn't fit anymore, unregister it.

Yeah.

>
> > Also look at the PIT which latches on read.
> > 
> >> 
> >> For IDE, it would be as simple as
> >> 
> >>   register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
> >>   for (i = 1; i<  7; i++) {
> >>     register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> >>     register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
> >>   }
> >> 
> >> and we should have reduced overhead of IDE by quite a bit already. All the other 2k LOC in hw/ide/core.c don't matter for us really.
> > 
> > 
> > Just use virtio.
>
> Just use xenbus. Seriously, this is not an answer.

Why not?  We invested effort in making it as fast as possible, and in
writing the drivers.  IDE will never, ever, get anything close to virtio
performance, even if we put all of it in the kernel.

However, after these examples, I'm more open to partial acceleration
now.  I won't ever like it though.

> >> >
> >> >>    - VGA
> >> >>    - IDE
> >> >
> >> >  Why?  There are perfectly good replacements for these (qxl, virtio-blk, virtio-scsi).
> >> 
> >> Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 

3rd party drivers are a way of life for Windows users; and the
incremental benefits of IDE acceleration are still far behind virtio.

> I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 

Cirrus or vesa should be okay for them, I don't see what we could do for
them in the kernel, or why.

> Same for virtio.
> >> 
> >> Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest.
> > 
> > Rest easy, there's no chance of that.  But if a guest is important enough, virtio drivers will get written.  IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it.
>
> Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads.

For linear loads, so should we, perhaps with greater cpu utliization.

If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
shouldn't matter.

> > 
> >> KVM's strength has always been its close resemblance to hardware.
> > 
> > This will remain.  But we can't optimize everything.
>
> That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no?

We should make sure that we don't default to IDE.  Qemu has no knowledge
of the guest, so it can't default to virtio, but higher level tools can
and should.

> >> 
> >> Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky.
> >> We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too.
> > 
> > For x86 that's not a problem, since once you map a page, it stays mapped (on modern hardware).
>
> Ah, because you're on NPT and you can have MMIO hints in the nested page table. Nifty. Yeah, we don't have that luxury :).

Well the real reason is we have an extra bit reported by page faults
that we can control.  Can't you set up a hashed pte that is configured
in a way that it will fault, no matter what type of access the guest
does, and see it in your page fault handler?

I'm guessing guest kernel ptes don't get evicted often.

> > 
> >> 
> >> >
> >> >>  That only works when then internal slot structure is hidden from user space though.
> >> >
> >> >  Why?
> >> 
> >> Because if user space thinks it's slots and in reality it's a tree that doesn't match. If you decouple the external view from the internal view, it works again.
> > 
> > Userspace needs to provide a function hva = f(gpa).  Why does it matter how the function is spelled out?  Slots happen to be a concise representation.  Transform the function all you like in the kernel, as long as you preserve all the mappings.
>
> I think we're talking about the same thing really.

So what's your objection to slots?

> >>   http://www.mail-archive.com/kvm@vger.kernel.org/msg66155.html
> >> 
> > 
> > Yeah - s390 is always different.  On the current interface synchronous registers are easy, so why not.  But I wonder if it's really critical.
>
> It's certainly slick :). We do the same for the TLB on e500, just with a separate ioctl to set the sharing up.

It's also dangerous wrt future hardware, as noted above.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-15 11:18                 ` Avi Kivity
@ 2012-02-15 11:57                   ` Alexander Graf
  2012-02-15 13:29                     ` Avi Kivity
  2012-02-15 19:17                     ` Scott Wood
  0 siblings, 2 replies; 89+ messages in thread
From: Alexander Graf @ 2012-02-15 11:57 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc


On 15.02.2012, at 12:18, Avi Kivity wrote:

> On 02/07/2012 04:39 PM, Alexander Graf wrote:
>>> 
>>> Syscalls are orthogonal to that - they're to avoid the fget_light() and to tighten the vcpu/thread and vm/process relationship.
>> 
>> How about keeping the ioctl interface but moving vcpu_run to a syscall then?
> 
> I dislike half-and-half interfaces even more.  And it's not like the
> fget_light() is really painful - it's just that I see it occasionally in
> perf top so it annoys me.
> 
>> That should really be the only thing that belongs into the fast path, right? Every time we do a register sync in user space, we do something wrong. Instead, user space should either
>> 
>>  a) have wrappers around register accesses, so it can directly ask for specific registers that it needs
>> or
>>  b) keep everything that would be requested by the register synchronization in shared memory
> 
> Always-synced shared memory is a liability, since newer hardware might
> introduce on-chip caches for that state, making synchronization
> expensive.  Or we may choose to keep some of the registers loaded, if we
> have a way to trap on their use from userspace - for example we can
> return to userspace with the guest fpu loaded, and trap if userspace
> tries to use it.
> 
> Is an extra syscall for copying TLB entries to user space prohibitively
> expensive?

The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes.

> 
>>> 
>>>> , keep the rest in user space.
>>>>> 
>>>>> 
>>>>> When a device is fully in the kernel, we have a good specification of the ABI: it just implements the spec, and the ABI provides the interface from the device to the rest of the world.  Partially accelerated devices means a much greater effort in specifying exactly what it does.  It's also vulnerable to changes in how the guest uses the device.
>>>> 
>>>> Why? For the HPET timer register for example, we could have a simple MMIO hook that says
>>>> 
>>>>  on_read:
>>>>    return read_current_time() - shared_page.offset;
>>>>  on_write:
>>>>    handle_in_user_space();
>>> 
>>> It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails.  
>> 
>> I don't understand. Why would anything fail here? 
> 
> It fails to provide a benefit, I didn't mean it causes guest failures.
> 
> You also have to make sure the kernel part and the user part use exactly
> the same time bases.

Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;).

> 
>> Once the logic that's implemented by the kernel accelerator doesn't fit anymore, unregister it.
> 
> Yeah.
> 
>> 
>>> Also look at the PIT which latches on read.
>>> 
>>>> 
>>>> For IDE, it would be as simple as
>>>> 
>>>>  register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>>>>  for (i = 1; i<  7; i++) {
>>>>    register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>>>    register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>>>  }
>>>> 
>>>> and we should have reduced overhead of IDE by quite a bit already. All the other 2k LOC in hw/ide/core.c don't matter for us really.
>>> 
>>> 
>>> Just use virtio.
>> 
>> Just use xenbus. Seriously, this is not an answer.
> 
> Why not?  We invested effort in making it as fast as possible, and in
> writing the drivers.  IDE will never, ever, get anything close to virtio
> performance, even if we put all of it in the kernel.
> 
> However, after these examples, I'm more open to partial acceleration
> now.  I won't ever like it though.
> 
>>>>> 
>>>>>>   - VGA
>>>>>>   - IDE
>>>>> 
>>>>> Why?  There are perfectly good replacements for these (qxl, virtio-blk, virtio-scsi).
>>>> 
>>>> Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 
> 
> 3rd party drivers are a way of life for Windows users; and the
> incremental benefits of IDE acceleration are still far behind virtio.

The typical way of life for Windows users are all-included drivers. Which is the case for AHCI, where we're getting awesome performance for Vista and above guests. The iDE thing was just an idea for legacy ones.

It'd be great to simply try and see how fast we could get by handling a few special registers in kernel space vs heavyweight exiting to QEMU. If it's only 10%, I wouldn't even bother with creating an interface for it. I'd bet the benefits are a lot bigger though.

And the main point was that specific partial device emulation buys us more than pseudo-generic accelerators like coalesced mmio, which are also only used by 1 or 2 devices.

> 
>> I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
> 
> Cirrus or vesa should be okay for them, I don't see what we could do for
> them in the kernel, or why.

That's my point. You need fast emulation of standard devices to get a good baseline. Do PV on top, but keep the baseline as fast as is reasonable.

> 
>> Same for virtio.
>>>> 
>>>> Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest.
>>> 
>>> Rest easy, there's no chance of that.  But if a guest is important enough, virtio drivers will get written.  IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it.
>> 
>> Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads.
> 
> For linear loads, so should we, perhaps with greater cpu utliization.
> 
> If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
> means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
> shouldn't matter.

*shrug* last time I checked we were a lot slower. But maybe there's more stuff making things slow than the exit path ;).

> 
>>> 
>>>> KVM's strength has always been its close resemblance to hardware.
>>> 
>>> This will remain.  But we can't optimize everything.
>> 
>> That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no?
> 
> We should make sure that we don't default to IDE.  Qemu has no knowledge
> of the guest, so it can't default to virtio, but higher level tools can
> and should.

You can only default to virtio on recent Linux. Windows, BSD, etc don't include drivers, so you can't assume it working. You can default to AHCI for basically any recent guest, but that still won't work for XP and the likes :(.

> 
>>>> 
>>>> Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky.
>>>> We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too.
>>> 
>>> For x86 that's not a problem, since once you map a page, it stays mapped (on modern hardware).
>> 
>> Ah, because you're on NPT and you can have MMIO hints in the nested page table. Nifty. Yeah, we don't have that luxury :).
> 
> Well the real reason is we have an extra bit reported by page faults
> that we can control.  Can't you set up a hashed pte that is configured
> in a way that it will fault, no matter what type of access the guest
> does, and see it in your page fault handler?

I might be able to synthesize a PTE that is !readable and might throw a permission exception instead of a miss exception. I might be able to synthesize something similar for booke. I don't however get any indication on why things failed.

So for MMIO reads, I can assume that this is an MMIO because I would never write a non-readable entry. For writes, I'm overloading the bit that also means "guest entry is not readable" so there I'd have to walk the guest PTEs/TLBs and check if I find a read-only entry. Right now I can just forward write faults to the guest. Since COW is probably a hotter path for the guest than MMIO, this might end up being ineffective.

But it's certainly an interesting idea.

> I'm guessing guest kernel ptes don't get evicted often.

Yeah, depends on the model you're running on ;). It's not the most common thing though, I agree.

> 
>>> 
>>>> 
>>>>> 
>>>>>> That only works when then internal slot structure is hidden from user space though.
>>>>> 
>>>>> Why?
>>>> 
>>>> Because if user space thinks it's slots and in reality it's a tree that doesn't match. If you decouple the external view from the internal view, it works again.
>>> 
>>> Userspace needs to provide a function hva = f(gpa).  Why does it matter how the function is spelled out?  Slots happen to be a concise representation.  Transform the function all you like in the kernel, as long as you preserve all the mappings.
>> 
>> I think we're talking about the same thing really.
> 
> So what's your objection to slots?

I was merely saying that having slots internally keeps us from speeding things up. I don't mind the external interface though.

> 
>>>>  http://www.mail-archive.com/kvm@vger.kernel.org/msg66155.html
>>>> 
>>> 
>>> Yeah - s390 is always different.  On the current interface synchronous registers are easy, so why not.  But I wonder if it's really critical.
>> 
>> It's certainly slick :). We do the same for the TLB on e500, just with a separate ioctl to set the sharing up.
> 
> It's also dangerous wrt future hardware, as noted above.

Yes and no. I see the capability system as two things in one:

  1) indicate features we learn later
  2) indicate missing features in our current model

So if a new model comes out that can't do something, just scratch off the CAP and be good ;). If somehow you ended up with multiple bits in a single CAP, remove the CAP, create a new one with the subset, set that for the new hardware.

We will have the same situation when we get nested TLBs for booke. We just unlearn a CAP then. User space needs to cope with its unavailability anyways.


Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-15 11:57                   ` Alexander Graf
@ 2012-02-15 13:29                     ` Avi Kivity
  2012-02-15 13:37                       ` Alexander Graf
  2012-02-15 19:17                     ` Scott Wood
  1 sibling, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-15 13:29 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc

On 02/15/2012 01:57 PM, Alexander Graf wrote:
> > 
> > Is an extra syscall for copying TLB entries to user space prohibitively
> > expensive?
>
> The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes.

You don't need to copy the entire TLB, just the way that maps the
address you're interested in.

btw, why are you interested in virtual addresses in userspace at all?

> >>> 
> >>> It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails.  
> >> 
> >> I don't understand. Why would anything fail here? 
> > 
> > It fails to provide a benefit, I didn't mean it causes guest failures.
> > 
> > You also have to make sure the kernel part and the user part use exactly
> > the same time bases.
>
> Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;).

Depends on how much the alignment relies on guest knowledge.  I guess
with a simple device like HPET, it's simple, but with a complex device,
different guests (or different versions of the same guest) could drive
it very differently.

> >>>> 
> >>>> Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 
> > 
> > 3rd party drivers are a way of life for Windows users; and the
> > incremental benefits of IDE acceleration are still far behind virtio.
>
> The typical way of life for Windows users are all-included drivers. Which is the case for AHCI, where we're getting awesome performance for Vista and above guests. The iDE thing was just an idea for legacy ones.
>
> It'd be great to simply try and see how fast we could get by handling a few special registers in kernel space vs heavyweight exiting to QEMU. If it's only 10%, I wouldn't even bother with creating an interface for it. I'd bet the benefits are a lot bigger though.
>
> And the main point was that specific partial device emulation buys us more than pseudo-generic accelerators like coalesced mmio, which are also only used by 1 or 2 devices.

Ok.

> > 
> >> I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
> > 
> > Cirrus or vesa should be okay for them, I don't see what we could do for
> > them in the kernel, or why.
>
> That's my point. You need fast emulation of standard devices to get a good baseline. Do PV on top, but keep the baseline as fast as is reasonable.
>
> > 
> >> Same for virtio.
> >>>> 
> >>>> Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest.
> >>> 
> >>> Rest easy, there's no chance of that.  But if a guest is important enough, virtio drivers will get written.  IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it.
> >> 
> >> Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads.
> > 
> > For linear loads, so should we, perhaps with greater cpu utliization.
> > 
> > If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
> > means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
> > shouldn't matter.
>
> *shrug* last time I checked we were a lot slower. But maybe there's more stuff making things slow than the exit path ;).

One thing that's different is that virtio offloads itself to a thread
very quickly, while IDE does a lot of work in vcpu thread context.

> > 
> >>> 
> >>>> KVM's strength has always been its close resemblance to hardware.
> >>> 
> >>> This will remain.  But we can't optimize everything.
> >> 
> >> That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no?
> > 
> > We should make sure that we don't default to IDE.  Qemu has no knowledge
> > of the guest, so it can't default to virtio, but higher level tools can
> > and should.
>
> You can only default to virtio on recent Linux. Windows, BSD, etc don't include drivers, so you can't assume it working. You can default to AHCI for basically any recent guest, but that still won't work for XP and the likes :(.

The all-knowing management tool can provide a virtio driver disk, or
even slip-stream the driver into the installation CD.


>  
> >> Ah, because you're on NPT and you can have MMIO hints in the nested page table. Nifty. Yeah, we don't have that luxury :).
> > 
> > Well the real reason is we have an extra bit reported by page faults
> > that we can control.  Can't you set up a hashed pte that is configured
> > in a way that it will fault, no matter what type of access the guest
> > does, and see it in your page fault handler?
>
> I might be able to synthesize a PTE that is !readable and might throw a permission exception instead of a miss exception. I might be able to synthesize something similar for booke. I don't however get any indication on why things failed.
>
> So for MMIO reads, I can assume that this is an MMIO because I would never write a non-readable entry. For writes, I'm overloading the bit that also means "guest entry is not readable" so there I'd have to walk the guest PTEs/TLBs and check if I find a read-only entry. Right now I can just forward write faults to the guest. Since COW is probably a hotter path for the guest than MMIO, this might end up being ineffective.

COWs usually happen from guest userspace, while mmio is usually from the
guest kernel, so you can switch on that, maybe.

> >> 
> >> I think we're talking about the same thing really.
> > 
> > So what's your objection to slots?
>
> I was merely saying that having slots internally keeps us from speeding things up. I don't mind the external interface though.

Ah, but it doesn't.  We can sort them, convert them to a radix tree,
basically do anything with them.

>
> > 
> >>>>  http://www.mail-archive.com/kvm@vger.kernel.org/msg66155.html
> >>>> 
> >>> 
> >>> Yeah - s390 is always different.  On the current interface synchronous registers are easy, so why not.  But I wonder if it's really critical.
> >> 
> >> It's certainly slick :). We do the same for the TLB on e500, just with a separate ioctl to set the sharing up.
> > 
> > It's also dangerous wrt future hardware, as noted above.
>
> Yes and no. I see the capability system as two things in one:
>
>   1) indicate features we learn later
>   2) indicate missing features in our current model
>
> So if a new model comes out that can't do something, just scratch off the CAP and be good ;). If somehow you ended up with multiple bits in a single CAP, remove the CAP, create a new one with the subset, set that for the new hardware.
>
> We will have the same situation when we get nested TLBs for booke. We just unlearn a CAP then. User space needs to cope with its unavailability anyways.
>

At least qemu tends to assume a certain baseline and won't run without
it.  We also need to make sure that the feature is available in some
other way (non-shared memory), which means duplication to begin with.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-12  7:10               ` Takuya Yoshikawa
@ 2012-02-15 13:32                 ` Avi Kivity
  0 siblings, 0 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-15 13:32 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Alexander Graf, Anthony Liguori, KVM list, linux-kernel,
	qemu-devel, kvm-ppc

On 02/12/2012 09:10 AM, Takuya Yoshikawa wrote:
> Avi Kivity <avi@redhat.com> wrote:
>
> > > >  Slot searching is quite fast since there's a small number of slots, and we sort the larger ones to be in the front, so positive lookups are fast.  We cache negative lookups in the shadow page tables (an spte can be either "not mapped", "mapped to RAM", or "not mapped and known to be mmio") so we rarely need to walk the entire list.
> > >
> > > Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky.
> > > We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too.
> > 
> > For x86 that's not a problem, since once you map a page, it stays mapped 
> > (on modern hardware).
> > 
>
> I was once thinking about how to search a slot reasonably fast for every case,
> even when we do not have mmio-spte cache.
>
> One possible way I thought up was to sort slots according to their base_gfn.
> Then the problem would become:  "find the first slot whose base_gfn + npages
> is greater than this gfn."
>
> Since we can do binary search, the search cost is O(log(# of slots)).
>
> But I guess that most of the time was wasted on reading many memslots just to
> know their base_gfn and npages.
>
> So the most practically effective thing is to make a separate array which holds
> just their base_gfn.  This will make the task a simple, and cache friendly,
> search on an integer array:  probably faster than using *-tree data structure.

This assumes that there is equal probability for matching any slot.  But
that's not true, even if you have hundreds of slots, the probability is
much greater for the two main memory slots, or if you're playing with
the framebuffer, the framebuffer slot.  Everything else is loaded
quickly into shadow and forgotten.

> If needed, we should make cmp_memslot() architecture specific in the end?

We could, but why is it needed?  This logic holds for all architectures.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 15:23             ` Anthony Liguori
  2012-02-07 15:28               ` Alexander Graf
  2012-02-08 17:20               ` Alan Cox
@ 2012-02-15 13:33               ` Avi Kivity
  2 siblings, 0 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-15 13:33 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Alexander Graf, qemu-devel, kvm-ppc, KVM list, linux-kernel

On 02/07/2012 05:23 PM, Anthony Liguori wrote:
> On 02/07/2012 07:40 AM, Alexander Graf wrote:
>>
>> Why? For the HPET timer register for example, we could have a simple
>> MMIO hook that says
>>
>>    on_read:
>>      return read_current_time() - shared_page.offset;
>>    on_write:
>>      handle_in_user_space();
>>
>> For IDE, it would be as simple as
>>
>>    register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,&s->cmd[0]);
>>    for (i = 1; i<  7; i++) {
>>      register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>      register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,&s->cmd[i]);
>>    }
>
> You can't easily serialize updates to that address with the kernel
> since two threads are likely going to be accessing it at the same
> time.  That either means an expensive sync operation or a reliance on
> atomic instructions.
>
> But not all architectures offer non-word sized atomic instructions so
> it gets fairly nasty in practice.
>

I doubt that any guest accesses IDE registers from two threads in
parallel.  The guest will have some lock, so we could have a lock as
well and be assured that there will never be contention.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-15 13:29                     ` Avi Kivity
@ 2012-02-15 13:37                       ` Alexander Graf
  2012-02-15 13:57                         ` Avi Kivity
  0 siblings, 1 reply; 89+ messages in thread
From: Alexander Graf @ 2012-02-15 13:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc


On 15.02.2012, at 14:29, Avi Kivity wrote:

> On 02/15/2012 01:57 PM, Alexander Graf wrote:
>>> 
>>> Is an extra syscall for copying TLB entries to user space prohibitively
>>> expensive?
>> 
>> The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes.
> 
> You don't need to copy the entire TLB, just the way that maps the
> address you're interested in.

Yeah, unless we do migration in which case we need to introduce another special case to fetch the whole thing :(.

> btw, why are you interested in virtual addresses in userspace at all?

We need them for gdb and monitor introspection.

> 
>>>>> 
>>>>> It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails.  
>>>> 
>>>> I don't understand. Why would anything fail here? 
>>> 
>>> It fails to provide a benefit, I didn't mean it causes guest failures.
>>> 
>>> You also have to make sure the kernel part and the user part use exactly
>>> the same time bases.
>> 
>> Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;).
> 
> Depends on how much the alignment relies on guest knowledge.  I guess
> with a simple device like HPET, it's simple, but with a complex device,
> different guests (or different versions of the same guest) could drive
> it very differently.

Right. But accelerating simple devices > not accelerating any devices. No? :)

> 
>>>>>> 
>>>>>> Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 
>>> 
>>> 3rd party drivers are a way of life for Windows users; and the
>>> incremental benefits of IDE acceleration are still far behind virtio.
>> 
>> The typical way of life for Windows users are all-included drivers. Which is the case for AHCI, where we're getting awesome performance for Vista and above guests. The iDE thing was just an idea for legacy ones.
>> 
>> It'd be great to simply try and see how fast we could get by handling a few special registers in kernel space vs heavyweight exiting to QEMU. If it's only 10%, I wouldn't even bother with creating an interface for it. I'd bet the benefits are a lot bigger though.
>> 
>> And the main point was that specific partial device emulation buys us more than pseudo-generic accelerators like coalesced mmio, which are also only used by 1 or 2 devices.
> 
> Ok.
> 
>>> 
>>>> I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
>>> 
>>> Cirrus or vesa should be okay for them, I don't see what we could do for
>>> them in the kernel, or why.
>> 
>> That's my point. You need fast emulation of standard devices to get a good baseline. Do PV on top, but keep the baseline as fast as is reasonable.
>> 
>>> 
>>>> Same for virtio.
>>>>>> 
>>>>>> Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest.
>>>>> 
>>>>> Rest easy, there's no chance of that.  But if a guest is important enough, virtio drivers will get written.  IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it.
>>>> 
>>>> Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads.
>>> 
>>> For linear loads, so should we, perhaps with greater cpu utliization.
>>> 
>>> If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
>>> means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
>>> shouldn't matter.
>> 
>> *shrug* last time I checked we were a lot slower. But maybe there's more stuff making things slow than the exit path ;).
> 
> One thing that's different is that virtio offloads itself to a thread
> very quickly, while IDE does a lot of work in vcpu thread context.

So it's all about latencies again, which could be reduced at least a fair bit with the scheme I described above. But really, this needs to be prototyped and benchmarked to actually give us data on how fast it would get us.

> 
>>> 
>>>>> 
>>>>>> KVM's strength has always been its close resemblance to hardware.
>>>>> 
>>>>> This will remain.  But we can't optimize everything.
>>>> 
>>>> That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no?
>>> 
>>> We should make sure that we don't default to IDE.  Qemu has no knowledge
>>> of the guest, so it can't default to virtio, but higher level tools can
>>> and should.
>> 
>> You can only default to virtio on recent Linux. Windows, BSD, etc don't include drivers, so you can't assume it working. You can default to AHCI for basically any recent guest, but that still won't work for XP and the likes :(.
> 
> The all-knowing management tool can provide a virtio driver disk, or
> even slip-stream the driver into the installation CD.

One management tool might do that, another one might now. We can't assume that all management tools are all-knowing. Some times you also want to run guest OSs that the management tool doesn't know (yet).

> 
> 
>> 
>>>> Ah, because you're on NPT and you can have MMIO hints in the nested page table. Nifty. Yeah, we don't have that luxury :).
>>> 
>>> Well the real reason is we have an extra bit reported by page faults
>>> that we can control.  Can't you set up a hashed pte that is configured
>>> in a way that it will fault, no matter what type of access the guest
>>> does, and see it in your page fault handler?
>> 
>> I might be able to synthesize a PTE that is !readable and might throw a permission exception instead of a miss exception. I might be able to synthesize something similar for booke. I don't however get any indication on why things failed.
>> 
>> So for MMIO reads, I can assume that this is an MMIO because I would never write a non-readable entry. For writes, I'm overloading the bit that also means "guest entry is not readable" so there I'd have to walk the guest PTEs/TLBs and check if I find a read-only entry. Right now I can just forward write faults to the guest. Since COW is probably a hotter path for the guest than MMIO, this might end up being ineffective.
> 
> COWs usually happen from guest userspace, while mmio is usually from the
> guest kernel, so you can switch on that, maybe.

Hrm, nice idea. That might fall apart with user space drivers that we might eventually have once vfio turns out to work well, but for the time being it's a nice hack :).

> 
>>>> 
>>>> I think we're talking about the same thing really.
>>> 
>>> So what's your objection to slots?
>> 
>> I was merely saying that having slots internally keeps us from speeding things up. I don't mind the external interface though.
> 
> Ah, but it doesn't.  We can sort them, convert them to a radix tree,
> basically do anything with them.

That's perfectly fine then :).

> 
>> 
>>> 
>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg66155.html
>>>>>> 
>>>>> 
>>>>> Yeah - s390 is always different.  On the current interface synchronous registers are easy, so why not.  But I wonder if it's really critical.
>>>> 
>>>> It's certainly slick :). We do the same for the TLB on e500, just with a separate ioctl to set the sharing up.
>>> 
>>> It's also dangerous wrt future hardware, as noted above.
>> 
>> Yes and no. I see the capability system as two things in one:
>> 
>>  1) indicate features we learn later
>>  2) indicate missing features in our current model
>> 
>> So if a new model comes out that can't do something, just scratch off the CAP and be good ;). If somehow you ended up with multiple bits in a single CAP, remove the CAP, create a new one with the subset, set that for the new hardware.
>> 
>> We will have the same situation when we get nested TLBs for booke. We just unlearn a CAP then. User space needs to cope with its unavailability anyways.
>> 
> 
> At least qemu tends to assume a certain baseline and won't run without
> it.  We also need to make sure that the feature is available in some
> other way (non-shared memory), which means duplication to begin with.

Yes, but that's the nature of accelerating things in other layers. If we move registers from ioctl get/set to shared pages, we need to keep the ioctls around. We also need to keep the ioctl access functions in qemu around. Unless we move up the baseline, but then we'd kill our backwards compatibility, which isn't all that great of an idea.

So yes, that's exactly what happens. And it's good that it does :). Gives us the chance to roll back when we realized we did something stupid.


Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 18:12           ` Rusty Russell
@ 2012-02-15 13:39             ` Avi Kivity
  2012-02-15 21:59               ` Anthony Liguori
  2012-02-15 23:08               ` Rusty Russell
  0 siblings, 2 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-15 13:39 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Anthony Liguori, Gleb Natapov, linux-kernel, KVM list, qemu-devel

On 02/07/2012 08:12 PM, Rusty Russell wrote:
> > I would really love to have this, but the problem is that we'd need a
> > general purpose bytecode VM with binding to some kernel APIs.  The
> > bytecode VM, if made general enough to host more complicated devices,
> > would likely be much larger than the actual code we have in the kernel now.
>
> We have the ability to upload bytecode into the kernel already.  It's in
> a great bytecode interpreted by the CPU itself.

Unfortunately it's inflexible (has to come with the kernel) and open to
security vulnerabilities.

> If every user were emulating different machines, LPF this would make
> sense.  Are they?  

They aren't.

> Or should we write those helpers once, in C, and
> provide that for them.

There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
them are quite complicated.  However implementing them in bytecode
amounts to exposing a stable kernel ABI, since they use such a vast
range of kernel services.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 16:29                   ` Jan Kiszka
@ 2012-02-15 13:41                     ` Avi Kivity
  0 siblings, 0 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-15 13:41 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Anthony Liguori, qemu-devel, Rob Earhart, linux-kernel, KVM list

On 02/07/2012 06:29 PM, Jan Kiszka wrote:
> >>>
> >>
> >> Isn't there another level in between just scheduling and full syscall
> >> return if the user return notifier has some real work to do?
> > 
> > Depends on whether you're scheduling a kthread or a userspace process, no?  If 
>
> Kthreads can't return, of course. User space threads /may/ do so. And
> then there needs to be a differences between host and guest in the
> tracked MSRs. 

Right.  Until we randomize kernel virtual addresses (what happened to
that?) and then there will always be a difference, even if you run the
same kernel in the host and guest.

> I think to recall it's a question of another few hundred
> cycles.

Right.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 16:19               ` Anthony Liguori
@ 2012-02-15 13:47                 ` Avi Kivity
  0 siblings, 0 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-15 13:47 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Rob Earhart, linux-kernel, KVM list, qemu-devel

On 02/07/2012 06:19 PM, Anthony Liguori wrote:
>> Ah. But then ioeventfd has that as well, unless the other end is in
>> the kernel too.
>
>
> Yes, that was my point exactly :-)
>
> ioeventfd/mmio-over-socketpair to adifferent thread is not faster than
> a synchronous KVM_RUN + writing to an eventfd in userspace modulo a
> couple of cheap syscalls.
>
> The exception is when the other end is in the kernel and there is
> magic optimizations (like there is today with ioeventfd).

vhost seems to schedule a workqueue item unconditionally.

irqfd does have magic optimizations to avoid an extra schedule.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-15 13:37                       ` Alexander Graf
@ 2012-02-15 13:57                         ` Avi Kivity
  2012-02-15 14:08                           ` Alexander Graf
  0 siblings, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-15 13:57 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc

On 02/15/2012 03:37 PM, Alexander Graf wrote:
> On 15.02.2012, at 14:29, Avi Kivity wrote:
>
> > On 02/15/2012 01:57 PM, Alexander Graf wrote:
> >>> 
> >>> Is an extra syscall for copying TLB entries to user space prohibitively
> >>> expensive?
> >> 
> >> The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes.
> > 
> > You don't need to copy the entire TLB, just the way that maps the
> > address you're interested in.
>
> Yeah, unless we do migration in which case we need to introduce another special case to fetch the whole thing :(.

Well, the scatter/gather registers I proposed will give you just one
register or all of them.

> > btw, why are you interested in virtual addresses in userspace at all?
>
> We need them for gdb and monitor introspection.

Hardly fast paths that justify shared memory.  I should be much harder
on you.

> >> 
> >> Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;).
> > 
> > Depends on how much the alignment relies on guest knowledge.  I guess
> > with a simple device like HPET, it's simple, but with a complex device,
> > different guests (or different versions of the same guest) could drive
> > it very differently.
>
> Right. But accelerating simple devices > not accelerating any devices. No? :)

Yes.  But introducing bugs and vulns < not introducing them.  It's a
tradeoff.  Even an unexploited vulnerability can be a lot more pain,
just because you need to update your entire cluster, than a simple
device that is accelerated for a guest which has maybe 3% utilization. 
Performance is just one parameter we optimize for.  It's easy to overdo
it because it's an easily measurable and sexy parameter, but it's a mistake.

> > 
> > One thing that's different is that virtio offloads itself to a thread
> > very quickly, while IDE does a lot of work in vcpu thread context.
>
> So it's all about latencies again, which could be reduced at least a fair bit with the scheme I described above. But really, this needs to be prototyped and benchmarked to actually give us data on how fast it would get us.

Simply making qemu issue the request from a thread would be way better. 
Something like socketpair mmio, configured for not waiting for the
writes to be seen (posted writes) will also help by buffering writes in
the socket buffer.

> > 
> > The all-knowing management tool can provide a virtio driver disk, or
> > even slip-stream the driver into the installation CD.
>
> One management tool might do that, another one might now. We can't assume that all management tools are all-knowing. Some times you also want to run guest OSs that the management tool doesn't know (yet).

That is true, but we have to leave some work for the management guys.

>  
> >> So for MMIO reads, I can assume that this is an MMIO because I would never write a non-readable entry. For writes, I'm overloading the bit that also means "guest entry is not readable" so there I'd have to walk the guest PTEs/TLBs and check if I find a read-only entry. Right now I can just forward write faults to the guest. Since COW is probably a hotter path for the guest than MMIO, this might end up being ineffective.
> > 
> > COWs usually happen from guest userspace, while mmio is usually from the
> > guest kernel, so you can switch on that, maybe.
>
> Hrm, nice idea. That might fall apart with user space drivers that we might eventually have once vfio turns out to work well, but for the time being it's a nice hack :).

Or nested virt...



-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-15 13:57                         ` Avi Kivity
@ 2012-02-15 14:08                           ` Alexander Graf
  2012-02-16 19:24                             ` Avi Kivity
  0 siblings, 1 reply; 89+ messages in thread
From: Alexander Graf @ 2012-02-15 14:08 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc


On 15.02.2012, at 14:57, Avi Kivity wrote:

> On 02/15/2012 03:37 PM, Alexander Graf wrote:
>> On 15.02.2012, at 14:29, Avi Kivity wrote:
>> 
>>> On 02/15/2012 01:57 PM, Alexander Graf wrote:
>>>>> 
>>>>> Is an extra syscall for copying TLB entries to user space prohibitively
>>>>> expensive?
>>>> 
>>>> The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes.
>>> 
>>> You don't need to copy the entire TLB, just the way that maps the
>>> address you're interested in.
>> 
>> Yeah, unless we do migration in which case we need to introduce another special case to fetch the whole thing :(.
> 
> Well, the scatter/gather registers I proposed will give you just one
> register or all of them.

One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them. By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86. On x86 you also have shared memory for page tables, it's just guest visible, hence in guest memory. The concept is the same.

> 
>>> btw, why are you interested in virtual addresses in userspace at all?
>> 
>> We need them for gdb and monitor introspection.
> 
> Hardly fast paths that justify shared memory.  I should be much harder
> on you.

It was a tradeoff on speed and complexity. This way we have the least amount of complexity IMHO. All KVM code paths just magically fit in with the TCG code. There are essentially no if(kvm_enabled)'s in our MMU walking code, because the tables are just there. Makes everything a lot easier (without dragging down performance).

> 
>>>> 
>>>> Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;).
>>> 
>>> Depends on how much the alignment relies on guest knowledge.  I guess
>>> with a simple device like HPET, it's simple, but with a complex device,
>>> different guests (or different versions of the same guest) could drive
>>> it very differently.
>> 
>> Right. But accelerating simple devices > not accelerating any devices. No? :)
> 
> Yes.  But introducing bugs and vulns < not introducing them.  It's a
> tradeoff.  Even an unexploited vulnerability can be a lot more pain,
> just because you need to update your entire cluster, than a simple
> device that is accelerated for a guest which has maybe 3% utilization. 
> Performance is just one parameter we optimize for.  It's easy to overdo
> it because it's an easily measurable and sexy parameter, but it's a mistake.

Yeah, I agree. That's why I was trying to get AHCI to the default storage adapter for a while, because I think the same. However, Anthony believes that XP/w2k3 is still a major chunk of the guests running on QEMU, so we can't do that :(.

I'm mostly trying to think of ways to accelerate the obvious low hanging fruits, without overengineering any interfaces.

> 
>>> 
>>> One thing that's different is that virtio offloads itself to a thread
>>> very quickly, while IDE does a lot of work in vcpu thread context.
>> 
>> So it's all about latencies again, which could be reduced at least a fair bit with the scheme I described above. But really, this needs to be prototyped and benchmarked to actually give us data on how fast it would get us.
> 
> Simply making qemu issue the request from a thread would be way better. 
> Something like socketpair mmio, configured for not waiting for the
> writes to be seen (posted writes) will also help by buffering writes in
> the socket buffer.

Yup, nice idea. That only works when all parts of a device are actually implemented through the same socket though. Otherwise you could run out of order. So if you have a PCI device with a PIO and an MMIO BAR region, they would both have to be handled through the same socket.

> 
>>> 
>>> The all-knowing management tool can provide a virtio driver disk, or
>>> even slip-stream the driver into the installation CD.
>> 
>> One management tool might do that, another one might now. We can't assume that all management tools are all-knowing. Some times you also want to run guest OSs that the management tool doesn't know (yet).
> 
> That is true, but we have to leave some work for the management guys.

The easier the management stack is, the happier I am ;).

> 
>> 
>>>> So for MMIO reads, I can assume that this is an MMIO because I would never write a non-readable entry. For writes, I'm overloading the bit that also means "guest entry is not readable" so there I'd have to walk the guest PTEs/TLBs and check if I find a read-only entry. Right now I can just forward write faults to the guest. Since COW is probably a hotter path for the guest than MMIO, this might end up being ineffective.
>>> 
>>> COWs usually happen from guest userspace, while mmio is usually from the
>>> guest kernel, so you can switch on that, maybe.
>> 
>> Hrm, nice idea. That might fall apart with user space drivers that we might eventually have once vfio turns out to work well, but for the time being it's a nice hack :).
> 
> Or nested virt...

Nested virt on ppc with device assignment? And here I thought I was the crazy one of the two of us :)


Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-15 11:57                   ` Alexander Graf
  2012-02-15 13:29                     ` Avi Kivity
@ 2012-02-15 19:17                     ` Scott Wood
  1 sibling, 0 replies; 89+ messages in thread
From: Scott Wood @ 2012-02-15 19:17 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Avi Kivity, Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc

On 02/15/2012 05:57 AM, Alexander Graf wrote:
> 
> On 15.02.2012, at 12:18, Avi Kivity wrote:
> 
>> Well the real reason is we have an extra bit reported by page faults
>> that we can control.  Can't you set up a hashed pte that is configured
>> in a way that it will fault, no matter what type of access the guest
>> does, and see it in your page fault handler?
> 
> I might be able to synthesize a PTE that is !readable and might throw
> a permission exception instead of a miss exception. I might be able
> to synthesize something similar for booke. I don't however get any
> indication on why things failed.

On booke with ISA 2.06 hypervisor extensions, there's MAS8[VF] that will
trigger a DSI that gets sent to the hypervisor even if normal DSIs go
directly to the guest.  You'll still need to zero out the execute
permission bits.

For other booke, you could use one of the user bits in MAS3 (along with
zeroing out all the permission bits), which you could get to by doing a
tlbsx.

-Scott


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-15 13:39             ` Avi Kivity
@ 2012-02-15 21:59               ` Anthony Liguori
  2012-02-16  8:57                 ` Gleb Natapov
  2012-02-15 23:08               ` Rusty Russell
  1 sibling, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-02-15 21:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Rusty Russell, qemu-devel, KVM list, Gleb Natapov, linux-kernel

On 02/15/2012 07:39 AM, Avi Kivity wrote:
> On 02/07/2012 08:12 PM, Rusty Russell wrote:
>>> I would really love to have this, but the problem is that we'd need a
>>> general purpose bytecode VM with binding to some kernel APIs.  The
>>> bytecode VM, if made general enough to host more complicated devices,
>>> would likely be much larger than the actual code we have in the kernel now.
>>
>> We have the ability to upload bytecode into the kernel already.  It's in
>> a great bytecode interpreted by the CPU itself.
>
> Unfortunately it's inflexible (has to come with the kernel) and open to
> security vulnerabilities.

I wonder if there's any reasonable way to run device emulation within the 
context of the guest.  Could we effectively do something like SMM?

For a given set of traps, reflect back into the guest quickly changing the 
visibility of the VGA region. It may require installing a new CR3 but maybe that 
wouldn't be so bad with VPIDs.

Then you could implement the PIT as guest firmware using kvmclock as the time base.

Once you're back in the guest, you could install the old CR3.  Perhaps just hide 
a portion of the physical address space with the e820.

Regards,

Anthony Liguori

>> If every user were emulating different machines, LPF this would make
>> sense.  Are they?
>
> They aren't.
>
>> Or should we write those helpers once, in C, and
>> provide that for them.
>
> There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
> stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
> them are quite complicated.  However implementing them in bytecode
> amounts to exposing a stable kernel ABI, since they use such a vast
> range of kernel services.
>


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 13:40           ` Alexander Graf
  2012-02-07 14:21             ` Avi Kivity
  2012-02-07 15:23             ` Anthony Liguori
@ 2012-02-15 22:14             ` Arnd Bergmann
  2 siblings, 0 replies; 89+ messages in thread
From: Arnd Bergmann @ 2012-02-15 22:14 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alexander Graf, Avi Kivity, kvm-ppc, KVM list, linux-kernel

On Tuesday 07 February 2012, Alexander Graf wrote:
> >> 
> >> Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints.
> > 
> > I would expect that newer archs have less constraints, not more.
> 
> Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have today on 32-bit, but extends a
> bunch of registers to 64-bit. So what if we laid out stuff wrong before?
> 
> I don't even want to imagine what v7 arm vs v8 arm looks like. It's a completely new architecture.
> 

I have not seen the source but I'm pretty sure that v7 and v8 they look very
similar regarding virtualization support because they were designed together,
including the concept that on v8 you can run either a v7 compatible 32 bit
hypervisor with 32 bit guests or a 64 bit hypervisor with a combination of
32 and 64 bit guests. Also, the page table layout in v7-LPAE is identical
to the v8 one. The main difference is the instruction set, but then ARMv7
already has four of these (ARM, Thumb, Thumb2, ThumbEE).

	Arnd


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-07 10:04         ` Alexander Graf
@ 2012-02-15 22:21           ` Arnd Bergmann
  2012-02-16  1:04             ` Michael Ellerman
  2012-02-16 10:26             ` Avi Kivity
  0 siblings, 2 replies; 89+ messages in thread
From: Arnd Bergmann @ 2012-02-15 22:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexander Graf, michael, KVM list, linux-kernel, Eric Northup,
	Scott Wood, Avi Kivity

On Tuesday 07 February 2012, Alexander Graf wrote:
> On 07.02.2012, at 07:58, Michael Ellerman wrote:
> 
> > On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
> >> You're exposing a large, complex kernel subsystem that does very
> >> low-level things with the hardware.  It's a potential source of exploits
> >> (from bugs in KVM or in hardware).  I can see people wanting to be
> >> selective with access because of that.
> > 
> > Exactly.
> > 
> > In a perfect world I'd agree with Anthony, but in reality I think
> > sysadmins are quite happy that they can prevent some users from using
> > KVM.
> > 
> > You could presumably achieve something similar with capabilities or
> > whatever, but a node in /dev is much simpler.
> 
> Well, you could still keep the /dev/kvm node and then have syscalls operate on the fd.
> 
> But again, I don't see the problem with the ioctl interface. It's nice, extensible and works great for us.
> 

ioctl is good for hardware devices and stuff that you want to enumerate
and/or control permissions on. For something like KVM that is really a
core kernel service, a syscall makes much more sense.

I would certainly never mix the two concepts: If you use a chardev to get
a file descriptor, use ioctl to do operations on it, and if you use a 
syscall to get the file descriptor then use other syscalls to do operations
on it.

I don't really have a good recommendation whether or not to change from an
ioctl based interface to syscall for KVM now. On the one hand I believe it
would be significantly cleaner, on the other hand we cannot remove the
chardev interface any more since there are many existing users.

	Arnd

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-15 13:39             ` Avi Kivity
  2012-02-15 21:59               ` Anthony Liguori
@ 2012-02-15 23:08               ` Rusty Russell
  1 sibling, 0 replies; 89+ messages in thread
From: Rusty Russell @ 2012-02-15 23:08 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Anthony Liguori, Gleb Natapov, linux-kernel, KVM list, qemu-devel

On Wed, 15 Feb 2012 15:39:41 +0200, Avi Kivity <avi@redhat.com> wrote:
> On 02/07/2012 08:12 PM, Rusty Russell wrote:
> > > I would really love to have this, but the problem is that we'd need a
> > > general purpose bytecode VM with binding to some kernel APIs.  The
> > > bytecode VM, if made general enough to host more complicated devices,
> > > would likely be much larger than the actual code we have in the kernel now.
> >
> > We have the ability to upload bytecode into the kernel already.  It's in
> > a great bytecode interpreted by the CPU itself.
> 
> Unfortunately it's inflexible (has to come with the kernel) and open to
> security vulnerabilities.

It doesn't have to come with the kernel, but it does require privs.  And
the bytecode itself might be invulnerable, the services it will call
will be, so it's not clear it'll be a win, given the reduced
auditability.

The grass is not really greener, and getting there involves many fences.

> > If every user were emulating different machines, LPF this would make
> > sense.  Are they?  
> 
> They aren't.
> 
> > Or should we write those helpers once, in C, and
> > provide that for them.
> 
> There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
> stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
> them are quite complicated.  However implementing them in bytecode
> amounts to exposing a stable kernel ABI, since they use such a vast
> range of kernel services.

We could think about regularizing and enumerating the various in-kernel
helpers, and give userspace a generic mechanism for wiring them up.
That would surely be the first step towards bytecode anyway.

But the current device assignment ioctls make me think that this
wouldn't be simple or neat.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-15 22:21           ` Arnd Bergmann
@ 2012-02-16  1:04             ` Michael Ellerman
  2012-02-16 19:28               ` Avi Kivity
  2012-02-16 10:26             ` Avi Kivity
  1 sibling, 1 reply; 89+ messages in thread
From: Michael Ellerman @ 2012-02-16  1:04 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: qemu-devel, Alexander Graf, KVM list, linux-kernel, Eric Northup,
	Scott Wood, Avi Kivity

[-- Attachment #1: Type: text/plain, Size: 2504 bytes --]

On Wed, 2012-02-15 at 22:21 +0000, Arnd Bergmann wrote:
> On Tuesday 07 February 2012, Alexander Graf wrote:
> > On 07.02.2012, at 07:58, Michael Ellerman wrote:
> > 
> > > On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
> > >> You're exposing a large, complex kernel subsystem that does very
> > >> low-level things with the hardware.  It's a potential source of exploits
> > >> (from bugs in KVM or in hardware).  I can see people wanting to be
> > >> selective with access because of that.
> > > 
> > > Exactly.
> > > 
> > > In a perfect world I'd agree with Anthony, but in reality I think
> > > sysadmins are quite happy that they can prevent some users from using
> > > KVM.
> > > 
> > > You could presumably achieve something similar with capabilities or
> > > whatever, but a node in /dev is much simpler.
> > 
> > Well, you could still keep the /dev/kvm node and then have syscalls operate on the fd.
> > 
> > But again, I don't see the problem with the ioctl interface. It's nice, extensible and works great for us.
> > 
> 
> ioctl is good for hardware devices and stuff that you want to enumerate
> and/or control permissions on. For something like KVM that is really a
> core kernel service, a syscall makes much more sense.

Yeah maybe. That distinction is at least in part just historical.

The first problem I see with using a syscall is that you don't need one
syscall for KVM, you need ~90. OK so you wouldn't do that, you'd use a
multiplexed syscall like epoll_ctl() - or probably several
(vm/vcpu/etc).

Secondly you still need a handle/context for those syscalls, and I think
the most sane thing to use for that is an fd.

At that point you've basically reinvented ioctl :)

I also think it is an advantage that you have a node in /dev for
permissions. I know other "core kernel" interfaces don't use a /dev
node, but arguably that is their loss.

> I would certainly never mix the two concepts: If you use a chardev to get
> a file descriptor, use ioctl to do operations on it, and if you use a 
> syscall to get the file descriptor then use other syscalls to do operations
> on it.

Sure, we use a syscall to get the fd (open) and then other syscalls to
do operations on it, ioctl and kvm_vcpu_run. ;)

But seriously, I guess that makes sense. Though it's a bit of a pity
because if you want a syscall for any of it, eg. vcpu_run(), then you
have to basically reinvent ioctl for all the other little operations.

cheers

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-15 21:59               ` Anthony Liguori
@ 2012-02-16  8:57                 ` Gleb Natapov
  2012-02-16 14:46                   ` Anthony Liguori
  0 siblings, 1 reply; 89+ messages in thread
From: Gleb Natapov @ 2012-02-16  8:57 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, Rusty Russell, qemu-devel, KVM list, linux-kernel

On Wed, Feb 15, 2012 at 03:59:33PM -0600, Anthony Liguori wrote:
> On 02/15/2012 07:39 AM, Avi Kivity wrote:
> >On 02/07/2012 08:12 PM, Rusty Russell wrote:
> >>>I would really love to have this, but the problem is that we'd need a
> >>>general purpose bytecode VM with binding to some kernel APIs.  The
> >>>bytecode VM, if made general enough to host more complicated devices,
> >>>would likely be much larger than the actual code we have in the kernel now.
> >>
> >>We have the ability to upload bytecode into the kernel already.  It's in
> >>a great bytecode interpreted by the CPU itself.
> >
> >Unfortunately it's inflexible (has to come with the kernel) and open to
> >security vulnerabilities.
> 
> I wonder if there's any reasonable way to run device emulation
> within the context of the guest.  Could we effectively do something
> like SMM?
> 
> For a given set of traps, reflect back into the guest quickly
> changing the visibility of the VGA region. It may require installing
> a new CR3 but maybe that wouldn't be so bad with VPIDs.
> 
What will it buy us? Surely not speed. Entering a guest is not much
(if at all) faster than exiting to userspace and any non trivial
operation will require exit to userspace anyway, so we just added one
more guest entry/exit operation on the way to userspace.

> Then you could implement the PIT as guest firmware using kvmclock as the time base.
> 
> Once you're back in the guest, you could install the old CR3.
> Perhaps just hide a portion of the physical address space with the
> e820.
> 
> Regards,
> 
> Anthony Liguori
> 
> >>If every user were emulating different machines, LPF this would make
> >>sense.  Are they?
> >
> >They aren't.
> >
> >>Or should we write those helpers once, in C, and
> >>provide that for them.
> >
> >There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
> >stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
> >them are quite complicated.  However implementing them in bytecode
> >amounts to exposing a stable kernel ABI, since they use such a vast
> >range of kernel services.
> >

--
			Gleb.

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-15 22:21           ` Arnd Bergmann
  2012-02-16  1:04             ` Michael Ellerman
@ 2012-02-16 10:26             ` Avi Kivity
  1 sibling, 0 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-16 10:26 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: qemu-devel, Alexander Graf, michael, KVM list, linux-kernel,
	Eric Northup, Scott Wood

On 02/16/2012 12:21 AM, Arnd Bergmann wrote:
> ioctl is good for hardware devices and stuff that you want to enumerate
> and/or control permissions on. For something like KVM that is really a
> core kernel service, a syscall makes much more sense.
>
> I would certainly never mix the two concepts: If you use a chardev to get
> a file descriptor, use ioctl to do operations on it, and if you use a 
> syscall to get the file descriptor then use other syscalls to do operations
> on it.
>
> I don't really have a good recommendation whether or not to change from an
> ioctl based interface to syscall for KVM now. On the one hand I believe it
> would be significantly cleaner, on the other hand we cannot remove the
> chardev interface any more since there are many existing users.
>

This sums up my feelings exactly.  Moving to syscalls would be an
improvement, but not so much an improvement as to warrant the thrashing
and the pain from having to maintain the old interface for a long while.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-16  8:57                 ` Gleb Natapov
@ 2012-02-16 14:46                   ` Anthony Liguori
  2012-02-16 19:34                     ` Avi Kivity
  0 siblings, 1 reply; 89+ messages in thread
From: Anthony Liguori @ 2012-02-16 14:46 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: KVM list, linux-kernel, Avi Kivity, Rusty Russell, qemu-devel

On 02/16/2012 02:57 AM, Gleb Natapov wrote:
> On Wed, Feb 15, 2012 at 03:59:33PM -0600, Anthony Liguori wrote:
>> On 02/15/2012 07:39 AM, Avi Kivity wrote:
>>> On 02/07/2012 08:12 PM, Rusty Russell wrote:
>>>>> I would really love to have this, but the problem is that we'd need a
>>>>> general purpose bytecode VM with binding to some kernel APIs.  The
>>>>> bytecode VM, if made general enough to host more complicated devices,
>>>>> would likely be much larger than the actual code we have in the kernel now.
>>>>
>>>> We have the ability to upload bytecode into the kernel already.  It's in
>>>> a great bytecode interpreted by the CPU itself.
>>>
>>> Unfortunately it's inflexible (has to come with the kernel) and open to
>>> security vulnerabilities.
>>
>> I wonder if there's any reasonable way to run device emulation
>> within the context of the guest.  Could we effectively do something
>> like SMM?
>>
>> For a given set of traps, reflect back into the guest quickly
>> changing the visibility of the VGA region. It may require installing
>> a new CR3 but maybe that wouldn't be so bad with VPIDs.
>>
> What will it buy us? Surely not speed. Entering a guest is not much
> (if at all) faster than exiting to userspace and any non trivial
> operation will require exit to userspace anyway,

You can emulate the PIT/RTC entirely within the guest using kvmclock which 
doesn't require an additional exit to get the current time base.

So instead of:

1) guest -> host kernel
2) host kernel -> userspace
3) implement logic using rdtscp via VDSO
4) userspace -> host kernel
5) host kernel -> guest

You go:

1) guest -> host kernel
2) host kernel -> guest (with special CR3)
3) implement logic using rdtscp + kvmclock page
4) change CR3 within guest and RETI to VMEXIT source RIP

Same basic concept as PS/2 emulation with SMM.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-15 14:08                           ` Alexander Graf
@ 2012-02-16 19:24                             ` Avi Kivity
  2012-02-16 19:34                               ` Alexander Graf
  0 siblings, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-16 19:24 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc

On 02/15/2012 04:08 PM, Alexander Graf wrote:
> > 
> > Well, the scatter/gather registers I proposed will give you just one
> > register or all of them.
>
> One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them. 

I should have said, just one register, or all of them, or anything in
between.

> By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86.

Sharing the data structures is not need.  Simply synchronize them before
lookup, like we do for ordinary registers.

>  On x86 you also have shared memory for page tables, it's just guest visible, hence in guest memory. The concept is the same.

But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
on every exit.  And you're risking the same thing if your hardware gets
cleverer.

> > 
> >>> btw, why are you interested in virtual addresses in userspace at all?
> >> 
> >> We need them for gdb and monitor introspection.
> > 
> > Hardly fast paths that justify shared memory.  I should be much harder
> > on you.
>
> It was a tradeoff on speed and complexity. This way we have the least amount of complexity IMHO. All KVM code paths just magically fit in with the TCG code. 

It's too magical, fitting a random version of a random userspace
component.  Now you can't change this tcg code (and still keep the magic).

Some complexity is part of keeping software as separate components.

> There are essentially no if(kvm_enabled)'s in our MMU walking code, because the tables are just there. Makes everything a lot easier (without dragging down performance).

We have the same issue with registers.  There we call
cpu_synchronize_state() before every access.  No magic, but we get to
reuse the code just the same.

> > 
> >>> 
> >>> One thing that's different is that virtio offloads itself to a thread
> >>> very quickly, while IDE does a lot of work in vcpu thread context.
> >> 
> >> So it's all about latencies again, which could be reduced at least a fair bit with the scheme I described above. But really, this needs to be prototyped and benchmarked to actually give us data on how fast it would get us.
> > 
> > Simply making qemu issue the request from a thread would be way better. 
> > Something like socketpair mmio, configured for not waiting for the
> > writes to be seen (posted writes) will also help by buffering writes in
> > the socket buffer.
>
> Yup, nice idea. That only works when all parts of a device are actually implemented through the same socket though. 

Right, but that's not an issue.

> Otherwise you could run out of order. So if you have a PCI device with a PIO and an MMIO BAR region, they would both have to be handled through the same socket.

I'm more worried about interactions between hotplug and a device, and
between people issuing unrelated PCI reads to flush writes (not sure
what the hardware semantics are there).  It's easy to get this wrong.

> >>> 
> >>> COWs usually happen from guest userspace, while mmio is usually from the
> >>> guest kernel, so you can switch on that, maybe.
> >> 
> >> Hrm, nice idea. That might fall apart with user space drivers that we might eventually have once vfio turns out to work well, but for the time being it's a nice hack :).
> > 
> > Or nested virt...
>
> Nested virt on ppc with device assignment? And here I thought I was the crazy one of the two of us :)

I don't mind being crazy on somebody else's arch.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-16  1:04             ` Michael Ellerman
@ 2012-02-16 19:28               ` Avi Kivity
  2012-02-17  0:09                 ` Michael Ellerman
  0 siblings, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-16 19:28 UTC (permalink / raw)
  To: michael
  Cc: Arnd Bergmann, qemu-devel, Alexander Graf, KVM list,
	linux-kernel, Eric Northup, Scott Wood

On 02/16/2012 03:04 AM, Michael Ellerman wrote:
> > 
> > ioctl is good for hardware devices and stuff that you want to enumerate
> > and/or control permissions on. For something like KVM that is really a
> > core kernel service, a syscall makes much more sense.
>
> Yeah maybe. That distinction is at least in part just historical.
>
> The first problem I see with using a syscall is that you don't need one
> syscall for KVM, you need ~90. OK so you wouldn't do that, you'd use a
> multiplexed syscall like epoll_ctl() - or probably several
> (vm/vcpu/etc).

No.  Many of our ioctls are for state save/restore - we reduce that to
two.  Many others are due to the with/without irqchip support - we slash
that as well.  The device assignment stuff is relegated to vfio.

I still have to draw up a concrete proposal, but I think we'll end up
with 10-15.

>
> Secondly you still need a handle/context for those syscalls, and I think
> the most sane thing to use for that is an fd.

The context is the process (for vm-wide calls) and thread (for vcpu
local calls).

>
> At that point you've basically reinvented ioctl :)
>
> I also think it is an advantage that you have a node in /dev for
> permissions. I know other "core kernel" interfaces don't use a /dev
> node, but arguably that is their loss.

Have to agree with that.  Theoretically we don't need permissions for
/dev/kvm, but in practice we do.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-16 19:24                             ` Avi Kivity
@ 2012-02-16 19:34                               ` Alexander Graf
  2012-02-16 19:38                                 ` Avi Kivity
  0 siblings, 1 reply; 89+ messages in thread
From: Alexander Graf @ 2012-02-16 19:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc


On 16.02.2012, at 20:24, Avi Kivity wrote:

> On 02/15/2012 04:08 PM, Alexander Graf wrote:
>>> 
>>> Well, the scatter/gather registers I proposed will give you just one
>>> register or all of them.
>> 
>> One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them. 
> 
> I should have said, just one register, or all of them, or anything in
> between.
> 
>> By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86.
> 
> Sharing the data structures is not need.  Simply synchronize them before
> lookup, like we do for ordinary registers.

Ordinary registers are a few bytes. We're talking of dozens of kbytes here.

> 
>> On x86 you also have shared memory for page tables, it's just guest visible, hence in guest memory. The concept is the same.
> 
> But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
> on every exit.  And you're risking the same thing if your hardware gets
> cleverer.

Yes, we do. When that day comes, we forget the CAP and do it another way. Which way we will find out by the time that day of more clever hardware comes :).

> 
>>> 
>>>>> btw, why are you interested in virtual addresses in userspace at all?
>>>> 
>>>> We need them for gdb and monitor introspection.
>>> 
>>> Hardly fast paths that justify shared memory.  I should be much harder
>>> on you.
>> 
>> It was a tradeoff on speed and complexity. This way we have the least amount of complexity IMHO. All KVM code paths just magically fit in with the TCG code. 
> 
> It's too magical, fitting a random version of a random userspace
> component.  Now you can't change this tcg code (and still keep the magic).
> 
> Some complexity is part of keeping software as separate components.

Why? If another user space wants to use this, they can

a) do the slow copy path
or
b) simply use our struct definitions

The whole copy thing really only makes sense when you have existing code in user space that you don't want to touch, but easily add on KVM to it. If KVM is part of your whole design, then integrating things makes a lot more sense.

> 
>> There are essentially no if(kvm_enabled)'s in our MMU walking code, because the tables are just there. Makes everything a lot easier (without dragging down performance).
> 
> We have the same issue with registers.  There we call
> cpu_synchronize_state() before every access.  No magic, but we get to
> reuse the code just the same.

Yes, and for those few bytes it's ok to do so - most of the time. On s390, even those get shared by now. And it makes sense to do so - if we synchronize it every time anyways, why not do so implicitly?


Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-16 14:46                   ` Anthony Liguori
@ 2012-02-16 19:34                     ` Avi Kivity
  0 siblings, 0 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-16 19:34 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Gleb Natapov, KVM list, linux-kernel, Rusty Russell, qemu-devel

On 02/16/2012 04:46 PM, Anthony Liguori wrote:
>> What will it buy us? Surely not speed. Entering a guest is not much
>> (if at all) faster than exiting to userspace and any non trivial
>> operation will require exit to userspace anyway,
>
>
> You can emulate the PIT/RTC entirely within the guest using kvmclock
> which doesn't require an additional exit to get the current time base.
>
> So instead of:
>
> 1) guest -> host kernel
> 2) host kernel -> userspace
> 3) implement logic using rdtscp via VDSO
> 4) userspace -> host kernel
> 5) host kernel -> guest
>
> You go:
>
> 1) guest -> host kernel
> 2) host kernel -> guest (with special CR3)
> 3) implement logic using rdtscp + kvmclock page
> 4) change CR3 within guest and RETI to VMEXIT source RIP
>
> Same basic concept as PS/2 emulation with SMM.

Interesting, but unimplementable in practice.  SMM requires a VMEXIT for
RSM, and anything non-SMM wants a virtual address mapping (and some RAM)
which you can't get without guest cooperation.  There are other
complications like an NMI interrupting hypervisor-provided code and
finding unexpected addresses on its stack (SMM at least blocks NMIs).

Tangentially related, Intel introduced a VMFUNC that allows you to
change the guest's physical memory map to a pre-set alternative provided
by the host, without a VMEXIT.  Seems similar to SMM but requires guest
cooperation.  I guess it's for unintrusive virus scanners and the like.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-16 19:34                               ` Alexander Graf
@ 2012-02-16 19:38                                 ` Avi Kivity
  2012-02-16 20:41                                   ` Scott Wood
  2012-02-17  0:19                                   ` Alexander Graf
  0 siblings, 2 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-16 19:38 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc

On 02/16/2012 09:34 PM, Alexander Graf wrote:
> On 16.02.2012, at 20:24, Avi Kivity wrote:
>
> > On 02/15/2012 04:08 PM, Alexander Graf wrote:
> >>> 
> >>> Well, the scatter/gather registers I proposed will give you just one
> >>> register or all of them.
> >> 
> >> One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them. 
> > 
> > I should have said, just one register, or all of them, or anything in
> > between.
> > 
> >> By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86.
> > 
> > Sharing the data structures is not need.  Simply synchronize them before
> > lookup, like we do for ordinary registers.
>
> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.

A TLB way is a few dozen bytes, no?

> > 
> >> On x86 you also have shared memory for page tables, it's just guest visible, hence in guest memory. The concept is the same.
> > 
> > But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
> > on every exit.  And you're risking the same thing if your hardware gets
> > cleverer.
>
> Yes, we do. When that day comes, we forget the CAP and do it another way. Which way we will find out by the time that day of more clever hardware comes :).

Or we try to be less clever unless we have a really compelling reason. 
qemu monitor and gdb support aren't compelling reasons to optimize.

> > 
> > It's too magical, fitting a random version of a random userspace
> > component.  Now you can't change this tcg code (and still keep the magic).
> > 
> > Some complexity is part of keeping software as separate components.
>
> Why? If another user space wants to use this, they can
>
> a) do the slow copy path
> or
> b) simply use our struct definitions
>
> The whole copy thing really only makes sense when you have existing code in user space that you don't want to touch, but easily add on KVM to it. If KVM is part of your whole design, then integrating things makes a lot more sense.

Yeah, I guess.

>
> > 
> >> There are essentially no if(kvm_enabled)'s in our MMU walking code, because the tables are just there. Makes everything a lot easier (without dragging down performance).
> > 
> > We have the same issue with registers.  There we call
> > cpu_synchronize_state() before every access.  No magic, but we get to
> > reuse the code just the same.
>
> Yes, and for those few bytes it's ok to do so - most of the time. On s390, even those get shared by now. And it makes sense to do so - if we synchronize it every time anyways, why not do so implicitly?
>

At least on x86, we synchronize only rarely.



-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-16 19:38                                 ` Avi Kivity
@ 2012-02-16 20:41                                   ` Scott Wood
  2012-02-17  0:23                                     ` Alexander Graf
  2012-02-18  9:49                                     ` Avi Kivity
  2012-02-17  0:19                                   ` Alexander Graf
  1 sibling, 2 replies; 89+ messages in thread
From: Scott Wood @ 2012-02-16 20:41 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alexander Graf, Anthony Liguori, KVM list, linux-kernel,
	qemu-devel, kvm-ppc

On 02/16/2012 01:38 PM, Avi Kivity wrote:
> On 02/16/2012 09:34 PM, Alexander Graf wrote:
>> On 16.02.2012, at 20:24, Avi Kivity wrote:
>>
>>> On 02/15/2012 04:08 PM, Alexander Graf wrote:
>>>>>
>>>>> Well, the scatter/gather registers I proposed will give you just one
>>>>> register or all of them.
>>>>
>>>> One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them. 
>>>
>>> I should have said, just one register, or all of them, or anything in
>>> between.
>>>
>>>> By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86.
>>>
>>> Sharing the data structures is not need.  Simply synchronize them before
>>> lookup, like we do for ordinary registers.
>>
>> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
> 
> A TLB way is a few dozen bytes, no?

I think you mean a TLB set... but the TLB (or part of it) may be fully
associative.

On e500mc, it's 24 bytes for one TLB entry, and you'd need 4 entries for
a set of TLB0, and all 64 entries in TLB1.  So 1632 bytes total.

Then we'd need to deal with tracking whether we synchronized one or more
specific sets, or everything (for migration or debug TLB dump).  The
request to synchronize would have to come from within the QEMU MMU code,
since that's the point where we know what to ask for (unless we
duplicate the logic elsewhere).  I'm not sure that reusing the standard
QEMU MMU code for individual debug address translation is really
simplifying things...

And yes, we do have fancier hardware coming fairly soon for which this
breaks (TLB0 entries can be loaded without host involvement, as long as
there's a translation from guest physical to physical in a separate
hardware table).  It'd be reasonable to ignore TLB0 for migration (treat
it as invalidated), but not for debug since that may be where the
translation we're interested in resides.

-Scott


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-16 19:28               ` Avi Kivity
@ 2012-02-17  0:09                 ` Michael Ellerman
  2012-02-18 10:03                   ` Avi Kivity
  0 siblings, 1 reply; 89+ messages in thread
From: Michael Ellerman @ 2012-02-17  0:09 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Arnd Bergmann, qemu-devel, Alexander Graf, KVM list,
	linux-kernel, Eric Northup, Scott Wood

[-- Attachment #1: Type: text/plain, Size: 1871 bytes --]

On Thu, 2012-02-16 at 21:28 +0200, Avi Kivity wrote:
> On 02/16/2012 03:04 AM, Michael Ellerman wrote:
> > > 
> > > ioctl is good for hardware devices and stuff that you want to enumerate
> > > and/or control permissions on. For something like KVM that is really a
> > > core kernel service, a syscall makes much more sense.
> >
> > Yeah maybe. That distinction is at least in part just historical.
> >
> > The first problem I see with using a syscall is that you don't need one
> > syscall for KVM, you need ~90. OK so you wouldn't do that, you'd use a
> > multiplexed syscall like epoll_ctl() - or probably several
> > (vm/vcpu/etc).
> 
> No.  Many of our ioctls are for state save/restore - we reduce that to
> two.  Many others are due to the with/without irqchip support - we slash
> that as well.  The device assignment stuff is relegated to vfio.
> 
> I still have to draw up a concrete proposal, but I think we'll end up
> with 10-15.

That's true, you certainly could reduce it, though by how much I'm not
sure. On powerpc I'm working on moving the irq controller emulation into
the kernel, and some associated firmware emulation, so that's at least
one new ioctl. And there will always be more, whatever scheme you have
must be easily extensible - ie. not requiring new syscalls for each new
weird platform.

> > Secondly you still need a handle/context for those syscalls, and I think
> > the most sane thing to use for that is an fd.
> 
> The context is the process (for vm-wide calls) and thread (for vcpu
> local calls).

Yeah OK I forgot you'd mentioned that. But isn't that change basically
orthogonal to how you get into the kernel? ie. we could have the
kvm/vcpu pointers in mm_struct/task_struct today?

I guess it wouldn't win you much though because you still have the fd
and ioctl overhead as well.

cheers

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-16 19:38                                 ` Avi Kivity
  2012-02-16 20:41                                   ` Scott Wood
@ 2012-02-17  0:19                                   ` Alexander Graf
  2012-02-18 10:00                                     ` Avi Kivity
  1 sibling, 1 reply; 89+ messages in thread
From: Alexander Graf @ 2012-02-17  0:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc


On 16.02.2012, at 20:38, Avi Kivity wrote:

> On 02/16/2012 09:34 PM, Alexander Graf wrote:
>> On 16.02.2012, at 20:24, Avi Kivity wrote:
>> 
>>> On 02/15/2012 04:08 PM, Alexander Graf wrote:
>>>>> 
>>>>> Well, the scatter/gather registers I proposed will give you just one
>>>>> register or all of them.
>>>> 
>>>> One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them. 
>>> 
>>> I should have said, just one register, or all of them, or anything in
>>> between.
>>> 
>>>> By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86.
>>> 
>>> Sharing the data structures is not need.  Simply synchronize them before
>>> lookup, like we do for ordinary registers.
>> 
>> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
> 
> A TLB way is a few dozen bytes, no?
> 
>>> 
>>>> On x86 you also have shared memory for page tables, it's just guest visible, hence in guest memory. The concept is the same.
>>> 
>>> But cr3 isn't, and if we put it in shared memory, we'd have to VMREAD it
>>> on every exit.  And you're risking the same thing if your hardware gets
>>> cleverer.
>> 
>> Yes, we do. When that day comes, we forget the CAP and do it another way. Which way we will find out by the time that day of more clever hardware comes :).
> 
> Or we try to be less clever unless we have a really compelling reason. 
> qemu monitor and gdb support aren't compelling reasons to optimize.

The goal here was simplicity with a grain of performance concerns.

So what would you be envisioning? Should we make all of the MMU walker code in target-ppc KVM aware so it fetches that single way it actually cares about on demand from the kernel? That is pretty intrusive and goes against the general nicely fitting in principle of how KVM integrates today.

Also, we need to store the guest TLB somewhere. With this model, we can just store it in user space memory, so we keep only a single copy around, reducing memory footprint. If we had to copy it, we would need more than a single copy.

> 
>>> 
>>> It's too magical, fitting a random version of a random userspace
>>> component.  Now you can't change this tcg code (and still keep the magic).
>>> 
>>> Some complexity is part of keeping software as separate components.
>> 
>> Why? If another user space wants to use this, they can
>> 
>> a) do the slow copy path
>> or
>> b) simply use our struct definitions
>> 
>> The whole copy thing really only makes sense when you have existing code in user space that you don't want to touch, but easily add on KVM to it. If KVM is part of your whole design, then integrating things makes a lot more sense.
> 
> Yeah, I guess.
> 
>> 
>>> 
>>>> There are essentially no if(kvm_enabled)'s in our MMU walking code, because the tables are just there. Makes everything a lot easier (without dragging down performance).
>>> 
>>> We have the same issue with registers.  There we call
>>> cpu_synchronize_state() before every access.  No magic, but we get to
>>> reuse the code just the same.
>> 
>> Yes, and for those few bytes it's ok to do so - most of the time. On s390, even those get shared by now. And it makes sense to do so - if we synchronize it every time anyways, why not do so implicitly?
>> 
> 
> At least on x86, we synchronize only rarely.

Yeah, on s390 we only know which registers actually contain the information we need for traps / hypercalls when in user space, since that's where the decoding happens. So we better have all GPRs available to read from and write to.


Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-16 20:41                                   ` Scott Wood
@ 2012-02-17  0:23                                     ` Alexander Graf
  2012-02-17 18:27                                       ` Scott Wood
  2012-02-18  9:49                                     ` Avi Kivity
  1 sibling, 1 reply; 89+ messages in thread
From: Alexander Graf @ 2012-02-17  0:23 UTC (permalink / raw)
  To: Scott Wood
  Cc: Avi Kivity, Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc


On 16.02.2012, at 21:41, Scott Wood wrote:

> On 02/16/2012 01:38 PM, Avi Kivity wrote:
>> On 02/16/2012 09:34 PM, Alexander Graf wrote:
>>> On 16.02.2012, at 20:24, Avi Kivity wrote:
>>> 
>>>> On 02/15/2012 04:08 PM, Alexander Graf wrote:
>>>>>> 
>>>>>> Well, the scatter/gather registers I proposed will give you just one
>>>>>> register or all of them.
>>>>> 
>>>>> One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them. 
>>>> 
>>>> I should have said, just one register, or all of them, or anything in
>>>> between.
>>>> 
>>>>> By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86.
>>>> 
>>>> Sharing the data structures is not need.  Simply synchronize them before
>>>> lookup, like we do for ordinary registers.
>>> 
>>> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
>> 
>> A TLB way is a few dozen bytes, no?
> 
> I think you mean a TLB set... but the TLB (or part of it) may be fully
> associative.
> 
> On e500mc, it's 24 bytes for one TLB entry, and you'd need 4 entries for
> a set of TLB0, and all 64 entries in TLB1.  So 1632 bytes total.
> 
> Then we'd need to deal with tracking whether we synchronized one or more
> specific sets, or everything (for migration or debug TLB dump).  The
> request to synchronize would have to come from within the QEMU MMU code,
> since that's the point where we know what to ask for (unless we
> duplicate the logic elsewhere).  I'm not sure that reusing the standard
> QEMU MMU code for individual debug address translation is really
> simplifying things...
> 
> And yes, we do have fancier hardware coming fairly soon for which this
> breaks (TLB0 entries can be loaded without host involvement, as long as
> there's a translation from guest physical to physical in a separate
> hardware table).  It'd be reasonable to ignore TLB0 for migration (treat
> it as invalidated), but not for debug since that may be where the
> translation we're interested in resides.

Could we maybe add an ioctl that forces kvm to read out the current tlb0 contents and push them to memory? How slow would that be?


Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-17  0:23                                     ` Alexander Graf
@ 2012-02-17 18:27                                       ` Scott Wood
  0 siblings, 0 replies; 89+ messages in thread
From: Scott Wood @ 2012-02-17 18:27 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Avi Kivity, Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc

On 02/16/2012 06:23 PM, Alexander Graf wrote:
> On 16.02.2012, at 21:41, Scott Wood wrote:
>> And yes, we do have fancier hardware coming fairly soon for which this
>> breaks (TLB0 entries can be loaded without host involvement, as long as
>> there's a translation from guest physical to physical in a separate
>> hardware table).  It'd be reasonable to ignore TLB0 for migration (treat
>> it as invalidated), but not for debug since that may be where the
>> translation we're interested in resides.
> 
> Could we maybe add an ioctl that forces kvm to read out the current tlb0 contents and push them to memory? How slow would that be?

Yes, I was thinking something like that.  We'd just have to remove (make
conditional on MMU type) the statement that this is synchronized
implicitly on return from vcpu_run.

Performance shouldn't be a problem -- we'd only need to sync once and
then can do all the repeated debug accesses we want.  So should be no
need to mess around with partial sync.

-Scott


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-16 20:41                                   ` Scott Wood
  2012-02-17  0:23                                     ` Alexander Graf
@ 2012-02-18  9:49                                     ` Avi Kivity
  1 sibling, 0 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-18  9:49 UTC (permalink / raw)
  To: Scott Wood
  Cc: Alexander Graf, Anthony Liguori, KVM list, linux-kernel,
	qemu-devel, kvm-ppc

On 02/16/2012 10:41 PM, Scott Wood wrote:
> >>> Sharing the data structures is not need.  Simply synchronize them before
> >>> lookup, like we do for ordinary registers.
> >>
> >> Ordinary registers are a few bytes. We're talking of dozens of kbytes here.
> > 
> > A TLB way is a few dozen bytes, no?
>
> I think you mean a TLB set... 

Yes, thanks.

> but the TLB (or part of it) may be fully
> associative.

A fully associative TLB has to be very small.

> On e500mc, it's 24 bytes for one TLB entry, and you'd need 4 entries for
> a set of TLB0, and all 64 entries in TLB1.  So 1632 bytes total.

Syncing this every time you need a translation (for gdb or the monitor)
is trivial in terms of performance.

> Then we'd need to deal with tracking whether we synchronized one or more
> specific sets, or everything (for migration or debug TLB dump).  The
> request to synchronize would have to come from within the QEMU MMU code,
> since that's the point where we know what to ask for (unless we
> duplicate the logic elsewhere).  I'm not sure that reusing the standard
> QEMU MMU code for individual debug address translation is really
> simplifying things...
>
> And yes, we do have fancier hardware coming fairly soon for which this
> breaks (TLB0 entries can be loaded without host involvement, as long as
> there's a translation from guest physical to physical in a separate
> hardware table).  It'd be reasonable to ignore TLB0 for migration (treat
> it as invalidated), but not for debug since that may be where the
> translation we're interested in resides.
>

So with this new hardware, the always-sync API breaks.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-17  0:19                                   ` Alexander Graf
@ 2012-02-18 10:00                                     ` Avi Kivity
  2012-02-18 10:43                                       ` Alexander Graf
  0 siblings, 1 reply; 89+ messages in thread
From: Avi Kivity @ 2012-02-18 10:00 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc

On 02/17/2012 02:19 AM, Alexander Graf wrote:
> > 
> > Or we try to be less clever unless we have a really compelling reason. 
> > qemu monitor and gdb support aren't compelling reasons to optimize.
>
> The goal here was simplicity with a grain of performance concerns.
>

Shared memory is simple in one way, but in other ways it is more
complicated since it takes away the kernel's freedom in how it manages
the data, how it's laid out, and whether it can lazify things or not.

> So what would you be envisioning? Should we make all of the MMU walker code in target-ppc KVM aware so it fetches that single way it actually cares about on demand from the kernel? That is pretty intrusive and goes against the general nicely fitting in principle of how KVM integrates today.

First, it's trivial, when you access a set you call
cpu_synchronize_tlb(set), just like how you access the registers when
you want them.

Second, and more important, how a random version of qemu works is
totally immaterial to the kvm userspace interface.  qemu could change in
15 different ways and so could the kernel, and other users exist. 
Fitting into qemu's current model is not a goal (if qemu happens to have
a good model, use it by all means; and clashing with qemu is likely an
indication the something is wrong -- but the two projects need to be
decoupled).

> Also, we need to store the guest TLB somewhere. With this model, we can just store it in user space memory, so we keep only a single copy around, reducing memory footprint. If we had to copy it, we would need more than a single copy.

That's the whole point.  You could store it on the cpu hardware, if the
cpu allows it.  Forcing it into always-synchronized shared memory takes
that ability away from you.

>  
> > 
> > At least on x86, we synchronize only rarely.
>
> Yeah, on s390 we only know which registers actually contain the information we need for traps / hypercalls when in user space, since that's where the decoding happens. So we better have all GPRs available to read from and write to.
>

Ok.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-17  0:09                 ` Michael Ellerman
@ 2012-02-18 10:03                   ` Avi Kivity
  0 siblings, 0 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-18 10:03 UTC (permalink / raw)
  To: michael
  Cc: Arnd Bergmann, qemu-devel, Alexander Graf, KVM list,
	linux-kernel, Eric Northup, Scott Wood

On 02/17/2012 02:09 AM, Michael Ellerman wrote:
> On Thu, 2012-02-16 at 21:28 +0200, Avi Kivity wrote:
> > On 02/16/2012 03:04 AM, Michael Ellerman wrote:
> > > > 
> > > > ioctl is good for hardware devices and stuff that you want to enumerate
> > > > and/or control permissions on. For something like KVM that is really a
> > > > core kernel service, a syscall makes much more sense.
> > >
> > > Yeah maybe. That distinction is at least in part just historical.
> > >
> > > The first problem I see with using a syscall is that you don't need one
> > > syscall for KVM, you need ~90. OK so you wouldn't do that, you'd use a
> > > multiplexed syscall like epoll_ctl() - or probably several
> > > (vm/vcpu/etc).
> > 
> > No.  Many of our ioctls are for state save/restore - we reduce that to
> > two.  Many others are due to the with/without irqchip support - we slash
> > that as well.  The device assignment stuff is relegated to vfio.
> > 
> > I still have to draw up a concrete proposal, but I think we'll end up
> > with 10-15.
>
> That's true, you certainly could reduce it, though by how much I'm not
> sure. On powerpc I'm working on moving the irq controller emulation into
> the kernel, and some associated firmware emulation, so that's at least
> one new ioctl. And there will always be more, whatever scheme you have
> must be easily extensible - ie. not requiring new syscalls for each new
> weird platform.

Most of it falls into read/write state, which is covered by two
syscalls.  There's probably need for configuration (wiring etc.); we
could call that pseudo-state with fake registers but I don't like that
very much.


> > > Secondly you still need a handle/context for those syscalls, and I think
> > > the most sane thing to use for that is an fd.
> > 
> > The context is the process (for vm-wide calls) and thread (for vcpu
> > local calls).
>
> Yeah OK I forgot you'd mentioned that. But isn't that change basically
> orthogonal to how you get into the kernel? ie. we could have the
> kvm/vcpu pointers in mm_struct/task_struct today?
>
> I guess it wouldn't win you much though because you still have the fd
> and ioctl overhead as well.
>

Yes.  I also dislike bypassing ioctl semantics (though we already do
that by requiring vcpus to stay on the same thread and vms on the same
process).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-18 10:00                                     ` Avi Kivity
@ 2012-02-18 10:43                                       ` Alexander Graf
  0 siblings, 0 replies; 89+ messages in thread
From: Alexander Graf @ 2012-02-18 10:43 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Anthony Liguori, KVM list, linux-kernel, qemu-devel, kvm-ppc



On 18.02.2012, at 11:00, Avi Kivity <avi@redhat.com> wrote:

> On 02/17/2012 02:19 AM, Alexander Graf wrote:
>>> 
>>> Or we try to be less clever unless we have a really compelling reason. 
>>> qemu monitor and gdb support aren't compelling reasons to optimize.
>> 
>> The goal here was simplicity with a grain of performance concerns.
>> 
> 
> Shared memory is simple in one way, but in other ways it is more
> complicated since it takes away the kernel's freedom in how it manages
> the data, how it's laid out, and whether it can lazify things or not.

Yes and no. Shared memory is a means of transferring data. If it's implemented by copying internally or by implicit sychronization is orthogonal to that.

With the interface as is, we can now on newer CPUs (which need changes to user space to work anyways) take the current interface and add a new CAP + ioctl that allows us to force flush the TLYb into the shared buffer. That way we maintain backwards compatibility, memory savings, no in kernel vmalloc cluttering etc. on all CPUs, but get the checkpoint to actually have useful contents for new CPUs.

I don't see the problem really. The data is the architected layout of the TLB. It contains all the data that can possibly make up a TLB entry according to the booke spec. If we wanted to copy different data, we'd need a different ioctl too.

> 
>> So what would you be envisioning? Should we make all of the MMU walker code in target-ppc KVM aware so it fetches that single way it actually cares about on demand from the kernel? That is pretty intrusive and goes against the general nicely fitting in principle of how KVM integrates today.
> 
> First, it's trivial, when you access a set you call
> cpu_synchronize_tlb(set), just like how you access the registers when
> you want them.

Yes, which is reasonably intrusive and going to be necessary with LRAT.

> 
> Second, and more important, how a random version of qemu works is
> totally immaterial to the kvm userspace interface.  qemu could change in
> 15 different ways and so could the kernel, and other users exist. 
> Fitting into qemu's current model is not a goal (if qemu happens to have
> a good model, use it by all means; and clashing with qemu is likely an
> indication the something is wrong -- but the two projects need to be
> decoupled).

Sure. In fact, in this case, the two were developed together. QEMU didn't have support for this specific TLB type, so we combined the development efforts. This way any new user space has a very easy time to implement it too, because we didn't model the KVM parts after QEMU, but the QEMU parts after KVM.

I still think it holds true that the KVM interface is very easy to plug in to any random emulation project. And to achieve that, the interface should be as little intrusive as possible wrt its requirements. The one we have seemed to fit that pretty well. Sure, we need a special flush command for newer CPUs, but at least we don't have to always copy. We only copy when we need to.

> 
>> Also, we need to store the guest TLB somewhere. With this model, we can just store it in user space memory, so we keep only a single copy around, reducing memory footprint. If we had to copy it, we would need more than a single copy.
> 
> That's the whole point.  You could store it on the cpu hardware, if the
> cpu allows it.  Forcing it into always-synchronized shared memory takes
> that ability away from you.

Yup. So the correct comment to make would be "don't make the shared TLB always synchronized", which I agree with today. I still think that the whole idea of passing kvm user space memory to work on is great. It reduces vmalloc footprint, it reduces copying, and it keeps data at one place, reducing chances to mess up.

Having it defined to always be in sync was a mistake, but one we can easily fix. That's why the CAP and ioctl interfaces are so awesome ;). I strongly believe that I can't predict the future. So designing an interface that holds stable for the next 10 years is close to imposdible. with an easily extensible interface however, it becomes almost trivial tk fix earlier messups ;).


Alex


^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [Qemu-devel] [RFC] Next gen kvm api
  2012-02-04  2:08   ` Takuya Yoshikawa
@ 2012-02-22 13:06     ` Peter Zijlstra
  0 siblings, 0 replies; 89+ messages in thread
From: Peter Zijlstra @ 2012-02-22 13:06 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: Anthony Liguori, Avi Kivity, KVM list, linux-kernel, qemu-devel

On Sat, 2012-02-04 at 11:08 +0900, Takuya Yoshikawa wrote:
> The latter needs a fundamental change:  I heard (from Avi) that we can
> change mmu_lock to mutex_lock if mmu_notifier becomes preemptible.
> 
> So I was planning to restart this work when Peter's
>         "mm: Preemptibility"
>         http://lkml.org/lkml/2011/4/1/141
> gets finished. 


That got merged a while ago:

# git describe --contains d16dfc550f5326a4000f3322582a7c05dec91d7a --match "v*"
v3.0-rc1~275

While I still need to get back to unifying mmu_gather across
architectures the whole thing is currently preemptible.

^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2012-02-22 13:06 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-02 16:09 [RFC] Next gen kvm api Avi Kivity
     [not found] ` <CAB9FdM9M2DWXBxxyG-ez_5igT61x5b7ptw+fKfgaqMBU_JS5aA@mail.gmail.com>
2012-02-02 22:16   ` [Qemu-devel] " Rob Earhart
2012-02-05 13:14   ` Avi Kivity
2012-02-06 17:41     ` Rob Earhart
2012-02-06 19:11       ` Anthony Liguori
2012-02-07 12:03         ` Avi Kivity
2012-02-07 15:17           ` Anthony Liguori
2012-02-07 16:02             ` Avi Kivity
2012-02-07 16:18               ` Jan Kiszka
2012-02-07 16:21                 ` Anthony Liguori
2012-02-07 16:29                   ` Jan Kiszka
2012-02-15 13:41                     ` Avi Kivity
2012-02-07 16:19               ` Anthony Liguori
2012-02-15 13:47                 ` Avi Kivity
2012-02-07 12:01       ` Avi Kivity
2012-02-03  2:09 ` Anthony Liguori
2012-02-04  2:08   ` Takuya Yoshikawa
2012-02-22 13:06     ` Peter Zijlstra
2012-02-05  9:24   ` Avi Kivity
2012-02-07  1:08   ` Alexander Graf
2012-02-07 12:24     ` Avi Kivity
2012-02-07 12:51       ` Alexander Graf
2012-02-07 13:16         ` Avi Kivity
2012-02-07 13:40           ` Alexander Graf
2012-02-07 14:21             ` Avi Kivity
2012-02-07 14:39               ` Alexander Graf
2012-02-15 11:18                 ` Avi Kivity
2012-02-15 11:57                   ` Alexander Graf
2012-02-15 13:29                     ` Avi Kivity
2012-02-15 13:37                       ` Alexander Graf
2012-02-15 13:57                         ` Avi Kivity
2012-02-15 14:08                           ` Alexander Graf
2012-02-16 19:24                             ` Avi Kivity
2012-02-16 19:34                               ` Alexander Graf
2012-02-16 19:38                                 ` Avi Kivity
2012-02-16 20:41                                   ` Scott Wood
2012-02-17  0:23                                     ` Alexander Graf
2012-02-17 18:27                                       ` Scott Wood
2012-02-18  9:49                                     ` Avi Kivity
2012-02-17  0:19                                   ` Alexander Graf
2012-02-18 10:00                                     ` Avi Kivity
2012-02-18 10:43                                       ` Alexander Graf
2012-02-15 19:17                     ` Scott Wood
2012-02-12  7:10               ` Takuya Yoshikawa
2012-02-15 13:32                 ` Avi Kivity
2012-02-07 15:23             ` Anthony Liguori
2012-02-07 15:28               ` Alexander Graf
2012-02-08 17:20               ` Alan Cox
2012-02-15 13:33               ` Avi Kivity
2012-02-15 22:14             ` Arnd Bergmann
2012-02-10  3:07   ` Jamie Lokier
2012-02-03 18:07 ` Eric Northup
2012-02-03 22:52   ` [Qemu-devel] " Anthony Liguori
2012-02-06 19:46     ` Scott Wood
2012-02-07  6:58       ` Michael Ellerman
2012-02-07 10:04         ` Alexander Graf
2012-02-15 22:21           ` Arnd Bergmann
2012-02-16  1:04             ` Michael Ellerman
2012-02-16 19:28               ` Avi Kivity
2012-02-17  0:09                 ` Michael Ellerman
2012-02-18 10:03                   ` Avi Kivity
2012-02-16 10:26             ` Avi Kivity
2012-02-07 12:28       ` Anthony Liguori
2012-02-07 12:40         ` Avi Kivity
2012-02-07 12:51           ` Anthony Liguori
2012-02-07 13:18             ` Avi Kivity
2012-02-07 15:15               ` Anthony Liguori
2012-02-07 18:28                 ` Chris Wright
2012-02-08 17:02         ` Scott Wood
2012-02-08 17:12           ` Alan Cox
2012-02-05  9:37 ` Gleb Natapov
2012-02-05  9:44   ` Avi Kivity
2012-02-05  9:51     ` Gleb Natapov
2012-02-05  9:56       ` Avi Kivity
2012-02-05 10:58         ` Gleb Natapov
2012-02-05 13:16           ` Avi Kivity
2012-02-05 16:36       ` [Qemu-devel] " Anthony Liguori
2012-02-06  9:34         ` Avi Kivity
2012-02-06 13:33           ` Anthony Liguori
2012-02-06 13:54             ` Avi Kivity
2012-02-06 14:00               ` Anthony Liguori
2012-02-06 14:08                 ` Avi Kivity
2012-02-07 18:12           ` Rusty Russell
2012-02-15 13:39             ` Avi Kivity
2012-02-15 21:59               ` Anthony Liguori
2012-02-16  8:57                 ` Gleb Natapov
2012-02-16 14:46                   ` Anthony Liguori
2012-02-16 19:34                     ` Avi Kivity
2012-02-15 23:08               ` Rusty Russell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).