linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Next gen kvm api
@ 2012-02-02 16:09 Avi Kivity
       [not found] ` <CAB9FdM9M2DWXBxxyG-ez_5igT61x5b7ptw+fKfgaqMBU_JS5aA@mail.gmail.com>
                   ` (3 more replies)
  0 siblings, 4 replies; 89+ messages in thread
From: Avi Kivity @ 2012-02-02 16:09 UTC (permalink / raw)
  To: KVM list; +Cc: linux-kernel, qemu-devel

The kvm api has been accumulating cruft for several years now.  This is
due to feature creep, fixing mistakes, experience gained by the
maintainers and developers on how to do things, ports to new
architectures, and simply as a side effect of a code base that is
developed slowly and incrementally.

While I don't think we can justify a complete revamp of the API now, I'm
writing this as a thought experiment to see where a from-scratch API can
take us.  Of course, if we do implement this, the new and old APIs will
have to be supported side by side for several years.

Syscalls
--------
kvm currently uses the much-loved ioctl() system call as its entry
point.  While this made it easy to add kvm to the kernel unintrusively,
it does have downsides:

- overhead in the entry path, for the ioctl dispatch path and vcpu mutex
(low but measurable)
- semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
a vm to be tied to an mm_struct, but the current API ties them to file
descriptors, which can move between threads and processes.  We check
that they don't, but we don't want to.

Moving to syscalls avoids these problems, but introduces new ones:

- adding new syscalls is generally frowned upon, and kvm will need several
- syscalls into modules are harder and rarer than into core kernel code
- will need to add a vcpu pointer to task_struct, and a kvm pointer to
mm_struct

Syscalls that operate on the entire guest will pick it up implicitly
from the mm_struct, and syscalls that operate on a vcpu will pick it up
from current.

State accessors
---------------
Currently vcpu state is read and written by a bunch of ioctls that
access register sets that were added (or discovered) along the years. 
Some state is stored in the vcpu mmap area.  These will be replaced by a
pair of syscalls that read or write the entire state, or a subset of the
state, in a tag/value format.  A register will be described by a tuple:

  set: the register set to which it belongs; either a real set (GPR,
x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
eflags/rip/IDT/interrupt shadow/pending exception/etc.)
  number: register number within a set
  size: for self-description, and to allow expanding registers like
SSE->AVX or eax->rax
  attributes: read-write, read-only, read-only for guest but read-write
for host
  value

Device model
------------
Currently kvm virtualizes or emulates a set of x86 cores, with or
without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
PCI devices assigned from the host.  The API allows emulating the local
APICs in userspace.

The new API will do away with the IOAPIC/PIC/PIT emulation and defer
them to userspace.  Note: this may cause a regression for older guests
that don't support MSI or kvmclock.  Device assignment will be done
using VFIO, that is, without direct kvm involvement.

Local APICs will be mandatory, but it will be possible to hide them from
the guest.  This means that it will no longer be possible to emulate an
APIC in userspace, but it will be possible to virtualize an APIC-less
core - userspace will play with the LINT0/LINT1 inputs (configured as
EXITINT and NMI) to queue interrupts and NMIs.

The communications between the local APIC and the IOAPIC/PIC will be
done over a socketpair, emulating the APIC bus protocol.

Ioeventfd/irqfd
---------------
As the ioeventfd/irqfd mechanism has been quite successful, it will be
retained, and perhaps supplemented with a way to assign an mmio region
to a socketpair carrying transactions.  This allows a device model to be
implemented out-of-process.  The socketpair can also be used to
implement a replacement for coalesced mmio, by not waiting for responses
on write transactions when enabled.  Synchronization of coalesced mmio
will be implemented in the kernel, not userspace as now: when a
non-coalesced mmio is needed, the kernel will first flush the coalesced
mmio queue(s).

Guest memory management
-----------------------
Instead of managing each memory slot individually, a single API will be
provided that replaces the entire guest physical memory map atomically. 
This matches the implementation (using RCU) and plugs holes in the
current API, where you lose the dirty log in the window between the last
call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
that removes the slot.

Slot-based dirty logging will be replaced by range-based and work-based
dirty logging; that is "what pages are dirty in this range, which may be
smaller than a slot" and "don't return more than N pages".

We may want to place the log in user memory instead of kernel memory, to
reduce pinned memory and increase flexibility.

vcpu fd mmap area
-----------------
Currently we mmap() a few pages of the vcpu fd for fast user/kernel
communications.  This will be replaced by a more orthodox pointer
parameter to sys_kvm_enter_guest(), that will be accessed using
get_user() and put_user().  This is slower than the current situation,
but better for things like strace.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2012-02-22 13:06 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-02 16:09 [RFC] Next gen kvm api Avi Kivity
     [not found] ` <CAB9FdM9M2DWXBxxyG-ez_5igT61x5b7ptw+fKfgaqMBU_JS5aA@mail.gmail.com>
2012-02-02 22:16   ` [Qemu-devel] " Rob Earhart
2012-02-05 13:14   ` Avi Kivity
2012-02-06 17:41     ` Rob Earhart
2012-02-06 19:11       ` Anthony Liguori
2012-02-07 12:03         ` Avi Kivity
2012-02-07 15:17           ` Anthony Liguori
2012-02-07 16:02             ` Avi Kivity
2012-02-07 16:18               ` Jan Kiszka
2012-02-07 16:21                 ` Anthony Liguori
2012-02-07 16:29                   ` Jan Kiszka
2012-02-15 13:41                     ` Avi Kivity
2012-02-07 16:19               ` Anthony Liguori
2012-02-15 13:47                 ` Avi Kivity
2012-02-07 12:01       ` Avi Kivity
2012-02-03  2:09 ` Anthony Liguori
2012-02-04  2:08   ` Takuya Yoshikawa
2012-02-22 13:06     ` Peter Zijlstra
2012-02-05  9:24   ` Avi Kivity
2012-02-07  1:08   ` Alexander Graf
2012-02-07 12:24     ` Avi Kivity
2012-02-07 12:51       ` Alexander Graf
2012-02-07 13:16         ` Avi Kivity
2012-02-07 13:40           ` Alexander Graf
2012-02-07 14:21             ` Avi Kivity
2012-02-07 14:39               ` Alexander Graf
2012-02-15 11:18                 ` Avi Kivity
2012-02-15 11:57                   ` Alexander Graf
2012-02-15 13:29                     ` Avi Kivity
2012-02-15 13:37                       ` Alexander Graf
2012-02-15 13:57                         ` Avi Kivity
2012-02-15 14:08                           ` Alexander Graf
2012-02-16 19:24                             ` Avi Kivity
2012-02-16 19:34                               ` Alexander Graf
2012-02-16 19:38                                 ` Avi Kivity
2012-02-16 20:41                                   ` Scott Wood
2012-02-17  0:23                                     ` Alexander Graf
2012-02-17 18:27                                       ` Scott Wood
2012-02-18  9:49                                     ` Avi Kivity
2012-02-17  0:19                                   ` Alexander Graf
2012-02-18 10:00                                     ` Avi Kivity
2012-02-18 10:43                                       ` Alexander Graf
2012-02-15 19:17                     ` Scott Wood
2012-02-12  7:10               ` Takuya Yoshikawa
2012-02-15 13:32                 ` Avi Kivity
2012-02-07 15:23             ` Anthony Liguori
2012-02-07 15:28               ` Alexander Graf
2012-02-08 17:20               ` Alan Cox
2012-02-15 13:33               ` Avi Kivity
2012-02-15 22:14             ` Arnd Bergmann
2012-02-10  3:07   ` Jamie Lokier
2012-02-03 18:07 ` Eric Northup
2012-02-03 22:52   ` [Qemu-devel] " Anthony Liguori
2012-02-06 19:46     ` Scott Wood
2012-02-07  6:58       ` Michael Ellerman
2012-02-07 10:04         ` Alexander Graf
2012-02-15 22:21           ` Arnd Bergmann
2012-02-16  1:04             ` Michael Ellerman
2012-02-16 19:28               ` Avi Kivity
2012-02-17  0:09                 ` Michael Ellerman
2012-02-18 10:03                   ` Avi Kivity
2012-02-16 10:26             ` Avi Kivity
2012-02-07 12:28       ` Anthony Liguori
2012-02-07 12:40         ` Avi Kivity
2012-02-07 12:51           ` Anthony Liguori
2012-02-07 13:18             ` Avi Kivity
2012-02-07 15:15               ` Anthony Liguori
2012-02-07 18:28                 ` Chris Wright
2012-02-08 17:02         ` Scott Wood
2012-02-08 17:12           ` Alan Cox
2012-02-05  9:37 ` Gleb Natapov
2012-02-05  9:44   ` Avi Kivity
2012-02-05  9:51     ` Gleb Natapov
2012-02-05  9:56       ` Avi Kivity
2012-02-05 10:58         ` Gleb Natapov
2012-02-05 13:16           ` Avi Kivity
2012-02-05 16:36       ` [Qemu-devel] " Anthony Liguori
2012-02-06  9:34         ` Avi Kivity
2012-02-06 13:33           ` Anthony Liguori
2012-02-06 13:54             ` Avi Kivity
2012-02-06 14:00               ` Anthony Liguori
2012-02-06 14:08                 ` Avi Kivity
2012-02-07 18:12           ` Rusty Russell
2012-02-15 13:39             ` Avi Kivity
2012-02-15 21:59               ` Anthony Liguori
2012-02-16  8:57                 ` Gleb Natapov
2012-02-16 14:46                   ` Anthony Liguori
2012-02-16 19:34                     ` Avi Kivity
2012-02-15 23:08               ` Rusty Russell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).