linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Anthony Liguori <anthony@codemonkey.ws>
To: Avi Kivity <avi@redhat.com>
Cc: KVM list <kvm@vger.kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	qemu-devel <qemu-devel@nongnu.org>
Subject: Re: [Qemu-devel] [RFC] Next gen kvm api
Date: Thu, 02 Feb 2012 20:09:26 -0600	[thread overview]
Message-ID: <4F2B41D6.8020603@codemonkey.ws> (raw)
In-Reply-To: <4F2AB552.2070909@redhat.com>

On 02/02/2012 10:09 AM, Avi Kivity wrote:
> The kvm api has been accumulating cruft for several years now.  This is
> due to feature creep, fixing mistakes, experience gained by the
> maintainers and developers on how to do things, ports to new
> architectures, and simply as a side effect of a code base that is
> developed slowly and incrementally.
>
> While I don't think we can justify a complete revamp of the API now, I'm
> writing this as a thought experiment to see where a from-scratch API can
> take us.  Of course, if we do implement this, the new and old APIs will
> have to be supported side by side for several years.
>
> Syscalls
> --------
> kvm currently uses the much-loved ioctl() system call as its entry
> point.  While this made it easy to add kvm to the kernel unintrusively,
> it does have downsides:
>
> - overhead in the entry path, for the ioctl dispatch path and vcpu mutex
> (low but measurable)
> - semantic mismatch: kvm really wants a vcpu to be tied to a thread, and
> a vm to be tied to an mm_struct, but the current API ties them to file
> descriptors, which can move between threads and processes.  We check
> that they don't, but we don't want to.
>
> Moving to syscalls avoids these problems, but introduces new ones:
>
> - adding new syscalls is generally frowned upon, and kvm will need several
> - syscalls into modules are harder and rarer than into core kernel code
> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> mm_struct
>
> Syscalls that operate on the entire guest will pick it up implicitly
> from the mm_struct, and syscalls that operate on a vcpu will pick it up
> from current.

This seems like the natural progression.

> State accessors
> ---------------
> Currently vcpu state is read and written by a bunch of ioctls that
> access register sets that were added (or discovered) along the years.
> Some state is stored in the vcpu mmap area.  These will be replaced by a
> pair of syscalls that read or write the entire state, or a subset of the
> state, in a tag/value format.  A register will be described by a tuple:
>
>    set: the register set to which it belongs; either a real set (GPR,
> x87, SSE/AVX, segment, cpuid, MSRs) or a fake set (for
> eflags/rip/IDT/interrupt shadow/pending exception/etc.)
>    number: register number within a set
>    size: for self-description, and to allow expanding registers like
> SSE->AVX or eax->rax
>    attributes: read-write, read-only, read-only for guest but read-write
> for host
>    value

I do like the idea a lot of being able to read one register at a time as often 
times that's all you need.

>
> Device model
> ------------
> Currently kvm virtualizes or emulates a set of x86 cores, with or
> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of
> PCI devices assigned from the host.  The API allows emulating the local
> APICs in userspace.
>
> The new API will do away with the IOAPIC/PIC/PIT emulation and defer
> them to userspace.

I'm a big fan of this.

> Note: this may cause a regression for older guests
> that don't support MSI or kvmclock.  Device assignment will be done
> using VFIO, that is, without direct kvm involvement.
>
> Local APICs will be mandatory, but it will be possible to hide them from
> the guest.  This means that it will no longer be possible to emulate an
> APIC in userspace, but it will be possible to virtualize an APIC-less
> core - userspace will play with the LINT0/LINT1 inputs (configured as
> EXITINT and NMI) to queue interrupts and NMIs.

I think this makes sense.  An interesting consequence of this is that it's no 
longer necessary to associate the VCPU context with an MMIO/PIO operation.  I'm 
not sure if there's an obvious benefit to that but it's interesting nonetheless.

> The communications between the local APIC and the IOAPIC/PIC will be
> done over a socketpair, emulating the APIC bus protocol.
>
> Ioeventfd/irqfd
> ---------------
> As the ioeventfd/irqfd mechanism has been quite successful, it will be
> retained, and perhaps supplemented with a way to assign an mmio region
> to a socketpair carrying transactions.  This allows a device model to be
> implemented out-of-process.  The socketpair can also be used to
> implement a replacement for coalesced mmio, by not waiting for responses
> on write transactions when enabled.  Synchronization of coalesced mmio
> will be implemented in the kernel, not userspace as now: when a
> non-coalesced mmio is needed, the kernel will first flush the coalesced
> mmio queue(s).
>
> Guest memory management
> -----------------------
> Instead of managing each memory slot individually, a single API will be
> provided that replaces the entire guest physical memory map atomically.
> This matches the implementation (using RCU) and plugs holes in the
> current API, where you lose the dirty log in the window between the last
> call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
> that removes the slot.
>
> Slot-based dirty logging will be replaced by range-based and work-based
> dirty logging; that is "what pages are dirty in this range, which may be
> smaller than a slot" and "don't return more than N pages".
>
> We may want to place the log in user memory instead of kernel memory, to
> reduce pinned memory and increase flexibility.

Since we really only support 64-bit hosts, what about just pointing the kernel 
at a address/size pair and rely on userspace to mmap() the range appropriately?

> vcpu fd mmap area
> -----------------
> Currently we mmap() a few pages of the vcpu fd for fast user/kernel
> communications.  This will be replaced by a more orthodox pointer
> parameter to sys_kvm_enter_guest(), that will be accessed using
> get_user() and put_user().  This is slower than the current situation,
> but better for things like strace.

Look pretty interesting overall.

Regards,

Anthony Liguori

>


  parent reply	other threads:[~2012-02-03  2:09 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-02 16:09 [RFC] Next gen kvm api Avi Kivity
     [not found] ` <CAB9FdM9M2DWXBxxyG-ez_5igT61x5b7ptw+fKfgaqMBU_JS5aA@mail.gmail.com>
2012-02-02 22:16   ` [Qemu-devel] " Rob Earhart
2012-02-05 13:14   ` Avi Kivity
2012-02-06 17:41     ` Rob Earhart
2012-02-06 19:11       ` Anthony Liguori
2012-02-07 12:03         ` Avi Kivity
2012-02-07 15:17           ` Anthony Liguori
2012-02-07 16:02             ` Avi Kivity
2012-02-07 16:18               ` Jan Kiszka
2012-02-07 16:21                 ` Anthony Liguori
2012-02-07 16:29                   ` Jan Kiszka
2012-02-15 13:41                     ` Avi Kivity
2012-02-07 16:19               ` Anthony Liguori
2012-02-15 13:47                 ` Avi Kivity
2012-02-07 12:01       ` Avi Kivity
2012-02-03  2:09 ` Anthony Liguori [this message]
2012-02-04  2:08   ` Takuya Yoshikawa
2012-02-22 13:06     ` Peter Zijlstra
2012-02-05  9:24   ` Avi Kivity
2012-02-07  1:08   ` Alexander Graf
2012-02-07 12:24     ` Avi Kivity
2012-02-07 12:51       ` Alexander Graf
2012-02-07 13:16         ` Avi Kivity
2012-02-07 13:40           ` Alexander Graf
2012-02-07 14:21             ` Avi Kivity
2012-02-07 14:39               ` Alexander Graf
2012-02-15 11:18                 ` Avi Kivity
2012-02-15 11:57                   ` Alexander Graf
2012-02-15 13:29                     ` Avi Kivity
2012-02-15 13:37                       ` Alexander Graf
2012-02-15 13:57                         ` Avi Kivity
2012-02-15 14:08                           ` Alexander Graf
2012-02-16 19:24                             ` Avi Kivity
2012-02-16 19:34                               ` Alexander Graf
2012-02-16 19:38                                 ` Avi Kivity
2012-02-16 20:41                                   ` Scott Wood
2012-02-17  0:23                                     ` Alexander Graf
2012-02-17 18:27                                       ` Scott Wood
2012-02-18  9:49                                     ` Avi Kivity
2012-02-17  0:19                                   ` Alexander Graf
2012-02-18 10:00                                     ` Avi Kivity
2012-02-18 10:43                                       ` Alexander Graf
2012-02-15 19:17                     ` Scott Wood
2012-02-12  7:10               ` Takuya Yoshikawa
2012-02-15 13:32                 ` Avi Kivity
2012-02-07 15:23             ` Anthony Liguori
2012-02-07 15:28               ` Alexander Graf
2012-02-08 17:20               ` Alan Cox
2012-02-15 13:33               ` Avi Kivity
2012-02-15 22:14             ` Arnd Bergmann
2012-02-10  3:07   ` Jamie Lokier
2012-02-03 18:07 ` Eric Northup
2012-02-03 22:52   ` [Qemu-devel] " Anthony Liguori
2012-02-06 19:46     ` Scott Wood
2012-02-07  6:58       ` Michael Ellerman
2012-02-07 10:04         ` Alexander Graf
2012-02-15 22:21           ` Arnd Bergmann
2012-02-16  1:04             ` Michael Ellerman
2012-02-16 19:28               ` Avi Kivity
2012-02-17  0:09                 ` Michael Ellerman
2012-02-18 10:03                   ` Avi Kivity
2012-02-16 10:26             ` Avi Kivity
2012-02-07 12:28       ` Anthony Liguori
2012-02-07 12:40         ` Avi Kivity
2012-02-07 12:51           ` Anthony Liguori
2012-02-07 13:18             ` Avi Kivity
2012-02-07 15:15               ` Anthony Liguori
2012-02-07 18:28                 ` Chris Wright
2012-02-08 17:02         ` Scott Wood
2012-02-08 17:12           ` Alan Cox
2012-02-05  9:37 ` Gleb Natapov
2012-02-05  9:44   ` Avi Kivity
2012-02-05  9:51     ` Gleb Natapov
2012-02-05  9:56       ` Avi Kivity
2012-02-05 10:58         ` Gleb Natapov
2012-02-05 13:16           ` Avi Kivity
2012-02-05 16:36       ` [Qemu-devel] " Anthony Liguori
2012-02-06  9:34         ` Avi Kivity
2012-02-06 13:33           ` Anthony Liguori
2012-02-06 13:54             ` Avi Kivity
2012-02-06 14:00               ` Anthony Liguori
2012-02-06 14:08                 ` Avi Kivity
2012-02-07 18:12           ` Rusty Russell
2012-02-15 13:39             ` Avi Kivity
2012-02-15 21:59               ` Anthony Liguori
2012-02-16  8:57                 ` Gleb Natapov
2012-02-16 14:46                   ` Anthony Liguori
2012-02-16 19:34                     ` Avi Kivity
2012-02-15 23:08               ` Rusty Russell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F2B41D6.8020603@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=avi@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).