From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [PATCH 2/7] pci: memory access API and IOMMU support
Date: Sun, 5 Sep 2010 10:10:31 +0300
Message-ID: <20100905071029.GA16014@redhat.com>
References: <1283007298-10942-1-git-send-email-eduard.munteanu@linux360.ro>
 <1283007298-10942-3-git-send-email-eduard.munteanu@linux360.ro>
 <20100902052826.GB4273@redhat.com>
 <20100902084058.GA7211@localhost>
 <20100902094947.GA9085@redhat.com>
 <AANLkTik+NJB7dC3GMcsrX98EM_VfMYb=Txj3sTaGE33T@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>,
	joro@8bytes.org, paul@codesourcery.com, avi@redhat.com,
	anthony@codemonkey.ws, av1474@comtv.ru, yamahata@valinux.co.jp,
	kvm@vger.kernel.org, qemu-devel@nongnu.org
To: Blue Swirl <blauwirbel@gmail.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:59199 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752669Ab0IEHRE (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sun, 5 Sep 2010 03:17:04 -0400
Content-Disposition: inline
In-Reply-To: <AANLkTik+NJB7dC3GMcsrX98EM_VfMYb=Txj3sTaGE33T@mail.gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Sat, Sep 04, 2010 at 09:01:06AM +0000, Blue Swirl wrote:
> On Thu, Sep 2, 2010 at 9:49 AM, Michael S. Tsirkin <mst@redhat.com> w=
rote:
> > On Thu, Sep 02, 2010 at 11:40:58AM +0300, Eduard - Gabriel Munteanu=
 wrote:
> >> On Thu, Sep 02, 2010 at 08:28:26AM +0300, Michael S. Tsirkin wrote=
:
> >> > On Sat, Aug 28, 2010 at 05:54:53PM +0300, Eduard - Gabriel Munte=
anu wrote:
> >> > > PCI devices should access memory through pci_memory_*() instea=
d of
> >> > > cpu_physical_memory_*(). This also provides support for transl=
ation and
> >> > > access checking in case an IOMMU is emulated.
> >> > >
> >> > > Memory maps are treated as remote IOTLBs (that is, translation=
 caches
> >> > > belonging to the IOMMU-aware device itself). Clients (devices)=
 must
> >> > > provide callbacks for map invalidation in case these maps are
> >> > > persistent beyond the current I/O context, e.g. AIO DMA transf=
ers.
> >> > >
> >> > > Signed-off-by: Eduard - Gabriel Munteanu <eduard.munteanu@linu=
x360.ro>
> >> >
> >> >
> >> > I am concerned about adding more pointer chaising on data path.
> >> > Could we have
> >> > 1. an iommu pointer in a device, inherited by secondary buses
> >> > =A0 =A0when they are created and by devices from buses when they=
 are attached.
> >> > 2. translation pointer in the iommu instead of the bus
> >>
> >> The first solution I proposed was based on qdev, that is, each
> >> DeviceState had an 'iommu' field. Translation would be done by
> >> recursively looking in the parent bus/devs for an IOMMU.
> >>
> >> But Anthony said we're better off with bus-specific APIs, mostly b=
ecause
> >> (IIRC) there may be different types of addresses and it might be
> >> difficult to abstract those properly.
> >
> > Well we ended up with casting
> > away types to make pci callbacks fit in ide structure,
> > and silently assuming that all addresses are in fact 64 bit.
> > So maybe it's hard to abstract addresses properly, but
> > it appears we'll have to, to avoid even worse problems.
> >
> >> I suppose I could revisit the idea by integrating the IOMMU in a
> >> PCIDevice as opposed to a DeviceState.
> >>
> >> Anthony, Paul, any thoughts on this?
> >
> > Just to clarify: this is an optimization idea:
> > instead of a bus walk on each access, do the walk
> > when device is attached to the bus, and copy the iommu
> > from the root to the device itself.
> >
> > This will also make it possible to create
> > DMADeviceState structure which would have this iommu field,
> > and we'd use this structure instead of the void pointers all over.
> >
> >
> >> > 3. pci_memory_XX functions inline, doing fast path for non-iommu=
 case:
> >> >
> >> > =A0 =A0 if (__builtin_expect(!dev->iommu, 1)
> >> > =A0 =A0 =A0 =A0 =A0 =A0 return cpu_memory_rw
> >>
> >> But isn't this some sort of 'if (likely(!dev->iommu))' from the Li=
nux
> >> kernel? If so, it puts the IOMMU-enabled case at disadvantage.
> >
> > IOMMU has a ton of indirections anyway.
> >
> >> I suppose most emulated systems would have at least some theoretic=
al
> >> reasons to enable the IOMMU, e.g. as a GART replacement (say for 3=
2-bit
> >> PCI devices) or for userspace drivers.
> >> So there are reasons to enable
> >> the IOMMU even when you don't have a real host IOMMU and you're no=
t
> >> using nested guests.
> >
> > The time most people enable iommu for all devices in both real and =
virtualized
> > systems appears distant, one of the reasons is because it has a lot=
 of overhead.
> > Let's start with not adding overhead for existing users, makes sens=
e?
>=20
> I think the goal architecture (not for IOMMU, but in general) is one
> with zero copy DMA. This means we have stage one where the addresses
> are translated to host pointers and stage two where the read/write
> operation happens. The devices need to be converted.
>=20
> Now, let's consider the IOMMU in this zero copy architecture. It's on=
e
> stage of address translation, for the access operation it will not
> matter. We can add translation caching at device level (or even at an=
y
> intermediate levels), but that needs a cache invalidation callback
> system as discussed earlier. This can be implemented later, we need
> the zero copy stuff first.
>=20
> Given this overall picture, I think eliminating some pointer
> dereference overheads in non-zero copy architecture is a very
> premature optimization and it may even direct the architecture to
> wrong direction.
>=20
> If the performance degradation at this point is not acceptable, we
> could also postpone merging IOMMU until zero copy conversion has
> happened, or make IOMMU a compile time option. But it would be nice t=
o
> back the decision by performance figures.

I agree, a minimal benchmark showing no performance impact
when disabled would put these concerns to rest.

--=20
MST