From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alex Williamson <alex.williamson@redhat.com>
Subject: Re: VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT
 - a Mediated ...)
Date: Tue, 26 Jan 2016 15:56:15 -0700
Message-ID: <1453848975.18049.7.camel@redhat.com>
References: <569C5071.6080004@intel.com>
	 <1453092476.32741.67.camel@redhat.com> <569CA8AD.6070200@intel.com>
	 <1453143919.32741.169.camel@redhat.com> <569F4C86.2070501@intel.com>
	 <AADFC41AFE54684AB9EE6CBC0274A5D15F786B4B@SHSMSX101.ccr.corp.intel.com>
	 <56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com>
	 <56A72313.9030009@intel.com> <56A77D2D.40109@gmail.com>
	 <1453826249.26652.54.camel@redhat.com>
	 <AADFC41AFE54684AB9EE6CBC0274A5D15F78EAEA@SHSMSX101.ccr.corp.intel.com>
	 <1453844613.18049.1.camel@redhat.com>
	 <AADFC41AFE54684AB9EE6CBC0274A5D15F78EB95@SHSMSX101.ccr.corp.intel.com>
	 <1453846073.18049.3.camel@redhat.com>
	 <AADFC41AFE54684AB9EE6CBC0274A5D15F78ECBB@SHSMSX101.ccr.corp.intel.com>
	 <1453847250.18049.5.camel@redhat.com>
	 <AADFC41AFE54684AB9EE6CBC0274A5D15F78ED63@SHSMSX101.ccr.corp.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Gerd Hoffmann <kraxel@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Lv, Zhiyuan" <zhiyuan.lv@intel.com>,
	"Ruan, Shuai" <shuai.ruan@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	qemu-devel <qemu-devel@nongnu.org>,
	"igvt-g@lists.01.org" <igvt-g@ml01.01.org>,
	Neo Jia <cjia@nvidia.com>
To: "Tian, Kevin" <kevin.tian@intel.com>,
	Yang Zhang <yang.zhang.wz@gmail.com>,
	"Song, Jike" <jike.song@intel.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:51483 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753119AbcAZW4R (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 26 Jan 2016 17:56:17 -0500
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D15F78ED63@SHSMSX101.ccr.corp.intel.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, January 27, 2016 6:27 AM
> >=C2=A0
> > On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, January 27, 2016 6:08 AM
> > > >=C2=A0
> > > > > > > >=C2=A0
> > > > > > >=C2=A0
> > > > > > > Today KVMGT (not using VFIO yet) registers I/O emulation =
callbacks to
> > > > > > > KVM, so VM MMIO access will be forwarded to KVMGT directl=
y for
> > > > > > > emulation in kernel. If we reuse above R/W flags, the who=
le emulation
> > > > > > > path would be unnecessarily long with obvious performance=
 impact. We
> > > > > > > either need a new flag here to indicate in-kernel emulati=
on (bias from
> > > > > > > passthrough support), or just hide the region alternative=
ly (let KVMGT
> > > > > > > to handle I/O emulation itself like today).
> > > > > >=C2=A0
> > > > > > That sounds like a future optimization TBH.=C2=A0=C2=A0Ther=
e's very strict
> > > > > > layering between vfio and kvm.=C2=A0=C2=A0Physical device a=
ssignment could make
> > > > > > use of it as well, avoiding a round trip through userspace =
when an
> > > > > > ioread/write would do.=C2=A0=C2=A0Userspace also needs to o=
rchestrate those kinds
> > > > > > of accelerators, there might be cases where userspace wants=
 to see those
> > > > > > transactions for debugging or manipulating the device.=C2=A0=
=C2=A0We can't simply
> > > > > > take shortcuts to provide such direct access.=C2=A0=C2=A0Th=
anks,
> > > > > >=C2=A0
> > > > >=C2=A0
> > > > > But we have to balance such debugging flexibility and accepta=
ble performance.
> > > > > To me the latter one is more important otherwise there'd be n=
o real usage
> > > > > around this technique, while for debugging there are other al=
ternative (e.g.
> > > > > ftrace) Consider some extreme case with 100k traps/second and=
 then see
> > > > > how much impact a 2-3x longer emulation path can bring...
> > > >=C2=A0
> > > > Are you jumping to the conclusion that it cannot be done with p=
roper
> > > > layering in place?=C2=A0=C2=A0Performance is important, but it'=
s not an excuse to
> > > > abandon designing interfaces between independent components.=C2=
=A0=C2=A0Thanks,
> > > >=C2=A0
> > >=C2=A0
> > > Two are not controversial. My point is to remove unnecessary long=
 trip
> > > as possible. After another thought, yes we can reuse existing rea=
d/write
> > > flags:
> > > =C2=A0	- KVMGT will expose a private control variable whether in-=
kernel
> > > delivery is required;
> >=C2=A0
> > But in-kernel delivery is never *required*.=C2=A0=C2=A0Wouldn't use=
rspace want to
> > deliver in-kernel any time it possibly could?
> >=C2=A0
> > > =C2=A0	- when the variable is true, KVMGT will register in-kernel=
 MMIO
> > > emulation callbacks then VM MMIO request will be delivered to KVM=
GT
> > > directly;
> > > =C2=A0	- when the variable is false, KVMGT will not register anyt=
hing.
> > > VM MMIO request will then be delivered to Qemu and then ioread/wr=
ite
> > > will be used to finally reach KVMGT emulation logic;
> >=C2=A0
> > No, that means the interface is entirely dependent on a backdoor th=
rough
> > KVM.=C2=A0=C2=A0Why can't userspace (QEMU) do something like regist=
er an MMIO
> > region with KVM handled via a provided file descriptor and offset,
> > couldn't KVM then call the file ops without a kernel exit?=C2=A0=C2=
=A0Thanks,
> >=C2=A0
>=C2=A0
> Could you elaborate this thought? If it can achieve the purpose w/o
> a kernel exit definitely we can adapt to it. :-)

I only thought of it when replying to the last email and have been doin=
g
some research, but we already do quite a bit of synchronization through
file descriptors.=C2=A0=C2=A0The kvm-vfio pseudo device uses a group fi=
le
descriptor to ensure a user has access to a group, allowing some degree
of interaction between modules.=C2=A0=C2=A0Eventfds and irqfds already =
make use of
f_ops on file descriptors to poke data.=C2=A0=C2=A0So, if KVM had infor=
mation that
an MMIO region was backed by a file descriptor for which it already has
a reference via fdget() (and verified access rights and whatnot), then
it ought to be a simple matter to get to f_ops->read/write knowing the
base offset of that MMIO region.=C2=A0=C2=A0Perhaps it could even simpl=
y use
__vfs_read/write().=C2=A0=C2=A0Then we've got a proper reference to the=
 file
descriptor for ownership purposes and we've transparently jumped across
modules without any implicit knowledge of the other end.=C2=A0=C2=A0Cou=
ld it work?
Thanks,

Alex


From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:57598)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1aOCX8-0001Ef-NX
	for qemu-devel@nongnu.org; Tue, 26 Jan 2016 17:56:23 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1aOCX3-0002FK-KF
	for qemu-devel@nongnu.org; Tue, 26 Jan 2016 17:56:22 -0500
Received: from mx1.redhat.com ([209.132.183.28]:48840)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1aOCX3-0002F5-Cl
	for qemu-devel@nongnu.org; Tue, 26 Jan 2016 17:56:17 -0500
Message-ID: <1453848975.18049.7.camel@redhat.com>
From: Alex Williamson <alex.williamson@redhat.com>
Date: Tue, 26 Jan 2016 15:56:15 -0700
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D15F78ED63@SHSMSX101.ccr.corp.intel.com>
References: <569C5071.6080004@intel.com>
	<1453092476.32741.67.camel@redhat.com> <569CA8AD.6070200@intel.com>
	<1453143919.32741.169.camel@redhat.com> <569F4C86.2070501@intel.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F786B4B@SHSMSX101.ccr.corp.intel.com>
	<56A6083E.10703@intel.com> <1453757426.32741.614.camel@redhat.com>
	<56A72313.9030009@intel.com> <56A77D2D.40109@gmail.com>
	<1453826249.26652.54.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78EAEA@SHSMSX101.ccr.corp.intel.com>
	<1453844613.18049.1.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78EB95@SHSMSX101.ccr.corp.intel.com>
	<1453846073.18049.3.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78ECBB@SHSMSX101.ccr.corp.intel.com>
	<1453847250.18049.5.camel@redhat.com>
	<AADFC41AFE54684AB9EE6CBC0274A5D15F78ED63@SHSMSX101.ccr.corp.intel.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3
 release of XenGT - a Mediated ...)
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Tian, Kevin" <kevin.tian@intel.com>, Yang Zhang <yang.zhang.wz@gmail.com>, "Song, Jike" <jike.song@intel.com>
Cc: "Ruan, Shuai" <shuai.ruan@intel.com>, Neo Jia <cjia@nvidia.com>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "igvt-g@lists.01.org" <igvt-g@ml01.01.org>, qemu-devel <qemu-devel@nongnu.org>, Gerd Hoffmann <kraxel@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, "Lv, Zhiyuan" <zhiyuan.lv@intel.com>

On Tue, 2016-01-26 at 22:39 +0000, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, January 27, 2016 6:27 AM
> >=C2=A0
> > On Tue, 2016-01-26 at 22:15 +0000, Tian, Kevin wrote:
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, January 27, 2016 6:08 AM
> > > >=C2=A0
> > > > > > > >=C2=A0
> > > > > > >=C2=A0
> > > > > > > Today KVMGT (not using VFIO yet) registers I/O emulation ca=
llbacks to
> > > > > > > KVM, so VM MMIO access will be forwarded to KVMGT directly =
for
> > > > > > > emulation in kernel. If we reuse above R/W flags, the whole=
 emulation
> > > > > > > path would be unnecessarily long with obvious performance i=
mpact. We
> > > > > > > either need a new flag here to indicate in-kernel emulation=
 (bias from
> > > > > > > passthrough support), or just hide the region alternatively=
 (let KVMGT
> > > > > > > to handle I/O emulation itself like today).
> > > > > >=C2=A0
> > > > > > That sounds like a future optimization TBH.=C2=A0=C2=A0There'=
s very strict
> > > > > > layering between vfio and kvm.=C2=A0=C2=A0Physical device ass=
ignment could make
> > > > > > use of it as well, avoiding a round trip through userspace wh=
en an
> > > > > > ioread/write would do.=C2=A0=C2=A0Userspace also needs to orc=
hestrate those kinds
> > > > > > of accelerators, there might be cases where userspace wants t=
o see those
> > > > > > transactions for debugging or manipulating the device.=C2=A0=C2=
=A0We can't simply
> > > > > > take shortcuts to provide such direct access.=C2=A0=C2=A0Than=
ks,
> > > > > >=C2=A0
> > > > >=C2=A0
> > > > > But we have to balance such debugging flexibility and acceptabl=
e performance.
> > > > > To me the latter one is more important otherwise there'd be no =
real usage
> > > > > around this technique, while for debugging there are other alte=
rnative (e.g.
> > > > > ftrace) Consider some extreme case with 100k traps/second and t=
hen see
> > > > > how much impact a 2-3x longer emulation path can bring...
> > > >=C2=A0
> > > > Are you jumping to the conclusion that it cannot be done with pro=
per
> > > > layering in place?=C2=A0=C2=A0Performance is important, but it's =
not an excuse to
> > > > abandon designing interfaces between independent components.=C2=A0=
=C2=A0Thanks,
> > > >=C2=A0
> > >=C2=A0
> > > Two are not controversial. My point is to remove unnecessary long t=
rip
> > > as possible. After another thought, yes we can reuse existing read/=
write
> > > flags:
> > > =C2=A0	- KVMGT will expose a private control variable whether in-ke=
rnel
> > > delivery is required;
> >=C2=A0
> > But in-kernel delivery is never *required*.=C2=A0=C2=A0Wouldn't users=
pace want to
> > deliver in-kernel any time it possibly could?
> >=C2=A0
> > > =C2=A0	- when the variable is true, KVMGT will register in-kernel M=
MIO
> > > emulation callbacks then VM MMIO request will be delivered to KVMGT
> > > directly;
> > > =C2=A0	- when the variable is false, KVMGT will not register anythi=
ng.
> > > VM MMIO request will then be delivered to Qemu and then ioread/writ=
e
> > > will be used to finally reach KVMGT emulation logic;
> >=C2=A0
> > No, that means the interface is entirely dependent on a backdoor thro=
ugh
> > KVM.=C2=A0=C2=A0Why can't userspace (QEMU) do something like register=
 an MMIO
> > region with KVM handled via a provided file descriptor and offset,
> > couldn't KVM then call the file ops without a kernel exit?=C2=A0=C2=A0=
Thanks,
> >=C2=A0
>=C2=A0
> Could you elaborate this thought? If it can achieve the purpose w/o
> a kernel exit definitely we can adapt to it. :-)

I only thought of it when replying to the last email and have been doing
some research, but we already do quite a bit of synchronization through
file descriptors.=C2=A0=C2=A0The kvm-vfio pseudo device uses a group file
descriptor to ensure a user has access to a group, allowing some degree
of interaction between modules.=C2=A0=C2=A0Eventfds and irqfds already ma=
ke use of
f_ops on file descriptors to poke data.=C2=A0=C2=A0So, if KVM had informa=
tion that
an MMIO region was backed by a file descriptor for which it already has
a reference via fdget() (and verified access rights and whatnot), then
it ought to be a simple matter to get to f_ops->read/write knowing the
base offset of that MMIO region.=C2=A0=C2=A0Perhaps it could even simply =
use
__vfs_read/write().=C2=A0=C2=A0Then we've got a proper reference to the f=
ile
descriptor for ownership purposes and we've transparently jumped across
modules without any implicit knowledge of the other end.=C2=A0=C2=A0Could=
 it work?
Thanks,

Alex