From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755271Ab2BFNdN (ORCPT ); Mon, 6 Feb 2012 08:33:13 -0500 Received: from mail-pz0-f46.google.com ([209.85.210.46]:45209 "EHLO mail-pz0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755016Ab2BFNdM (ORCPT ); Mon, 6 Feb 2012 08:33:12 -0500 Message-ID: <4F2FD692.5060708@codemonkey.ws> Date: Mon, 06 Feb 2012 07:33:06 -0600 From: Anthony Liguori User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110922 Lightning/1.0b2 Thunderbird/3.1.15 MIME-Version: 1.0 To: Avi Kivity CC: qemu-devel , linux-kernel , Gleb Natapov , KVM list Subject: Re: [Qemu-devel] [RFC] Next gen kvm api References: <4F2AB552.2070909@redhat.com> <20120205093723.GQ23536@redhat.com> <4F2E4F8B.8090504@redhat.com> <20120205095153.GA29265@redhat.com> <4F2EAFF6.7030006@codemonkey.ws> <4F2F9E89.7090607@redhat.com> In-Reply-To: <4F2F9E89.7090607@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/06/2012 03:34 AM, Avi Kivity wrote: > On 02/05/2012 06:36 PM, Anthony Liguori wrote: >> On 02/05/2012 03:51 AM, Gleb Natapov wrote: >>> On Sun, Feb 05, 2012 at 11:44:43AM +0200, Avi Kivity wrote: >>>> On 02/05/2012 11:37 AM, Gleb Natapov wrote: >>>>> On Thu, Feb 02, 2012 at 06:09:54PM +0200, Avi Kivity wrote: >>>>>> Device model >>>>>> ------------ >>>>>> Currently kvm virtualizes or emulates a set of x86 cores, with or >>>>>> without local APICs, a 24-input IOAPIC, a PIC, a PIT, and a number of >>>>>> PCI devices assigned from the host. The API allows emulating the >>>>>> local >>>>>> APICs in userspace. >>>>>> >>>>>> The new API will do away with the IOAPIC/PIC/PIT emulation and defer >>>>>> them to userspace. Note: this may cause a regression for older >>>>>> guests >>>>>> that don't support MSI or kvmclock. Device assignment will be done >>>>>> using VFIO, that is, without direct kvm involvement. >>>>>> >>>>> So are we officially saying that KVM is only for modern guest >>>>> virtualization? >>>> >>>> No, but older guests may have reduced performance in some workloads >>>> (e.g. RHEL4 gettimeofday() intensive workloads). >>>> >>> Reduced performance is what I mean. Obviously old guests will >>> continue working. >> >> An interesting solution to this problem would be an in-kernel device VM. > > It's interesting, yes, but has a very high barrier to implementation. > >> >> Most of the time, the hot register is just one register within a more >> complex device. The reads are often side-effect free and trivially >> computed from some device state + host time. > > Look at arch/x86/kvm/i8254.c:pit_ioport_read() for a counterexample. > There are also interactions with other devices (for example the > apic/ioapic interaction via the apic bus). Hrm, maybe I'm missing it, but the path that would be hot is: if (!status_latched && !count_latched) { value = kpit_elapsed() // manipulate count based on mode // mask value depending on read_state } This path is side-effect free, and applies relatively simple math to a time counter. The idea would be to allow the filter to not handle an I/O request depending on existing state. Anything that's modifies state (like reading the latch counter) would drop to userspace. > >> >> If userspace had a way to upload bytecode to the kernel that was >> executed for a PIO operation, it could either pass the operation to >> userspace or handle it within the kernel when possible without taking >> a heavy weight exit. >> >> If the bytecode can access variables in a shared memory area, it could >> be pretty efficient to work with. >> >> This means that the kernel never has to deal with specific in-kernel >> devices but that userspace can accelerator as many of its devices as >> it sees fit. > > I would really love to have this, but the problem is that we'd need a > general purpose bytecode VM with binding to some kernel APIs. The > bytecode VM, if made general enough to host more complicated devices, > would likely be much larger than the actual code we have in the kernel now. I think the question is whether BPF is good enough as it stands. I'm not really sure. I agree that inventing a new bytecode VM is probably not worth it. >> >> This could replace ioeventfd as a mechanism (which would allow >> clearing the notify flag before writing to an eventfd). >> >> We could potentially just use BPF for this. > > BPF generally just computes a predicate. Can it modify a packet in place? I think a predicate is about right (can this io operation be handled in the kernel or not) but the question is whether there's a way produce an output as a side effect. > We could overload the scratch > area for storing internal state and for read results, though (and have > an "mmio scratch register" for reading the time). Right. Regards, Anthony Liguori