linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* IOMMUs was Re: Intel vs AMD x86-64
       [not found]     ` <Pine.LNX.4.58.0402231359280.3005@ppc970.osdl.org.suse.lists.linux.kernel>
@ 2004-02-24 14:06       ` Andi Kleen
  2004-02-24 18:13         ` David S. Miller
  0 siblings, 1 reply; 7+ messages in thread
From: Andi Kleen @ 2004-02-24 14:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, davem

Linus Torvalds <torvalds@osdl.org> writes:

> In fact, I _think_ you could actually use the AGP bridge as a strange
> IOMMU. Of course, right now their AGP bridges are all 32-bit limited
> anyway, but the point being that they at least in theory would seem to
> have the capability to do this.

Actually AGPv3 is 40 bits capable (using a strange encoding, but it works).

On Opteron the IOMMU code (ab)uses the built in AGPv3 GART in the CPU, which 
was originally intended for AGP. AMD converted it to be able to remap
PCI especially for Linux, which I think deserves applause.

It works surprisingly well even though it was not designed as a real
IOMMU. Of course one of the main advantages of a real IOMMU -
preventing arbitary memory corruption from broken devices - is lost
because the remapping table is just a hole in the memory. I'm 
secretly hoping that when there is more support for Linux at 
chipset vendors they will someday add a bit to isolate all traffic
that doesn't go through the GART from the main memory. This way
you could get a much more reliable system that can tolerate broken
PCI devices at a moderate performance penalty.

One side effect of this is that the IOMMU TLB flush strategy is a bit
dumb, because it has to do config space accesses for it. This is
understandable because AGP rarely sets up new mappings. This is a bit
of a problem because the @#$@$-X server does direct PCI accesses on
its own and can race with an IOMMU TLB flush. But I hope this can get
fixed eventually e.g. with the new freedesktop.org X server. When
we get PCI Express memory mapped config space support this problem
will hopefully go away.

The bad message is that PCI Express will do away with GARTs, so 
they may not be there anymore in future chipsets. But I hope they
will at least keep it in the Opteron on CPU bridge.
 
> > Really, not having an IOMMU on a 64-bit platform these days is basically like
> > pulling out one's toenails with an ice pick.
> 
> Well, as long as they had that "64-bit is server" mentality, they can 
> honestly say that you just have to use 64-bit-capable PCI cards.
> 
> Now, the "server only" mentality is obviously crap, but since we haven't
> even seen the chipsets designed for the 64-bit chips, we shouldn't
> complain. At least yet.

What I find especially ironic is that exactly the same chipset
people who use these crap arguments put 32bit only USB and IDE 
devices into the same chips. USB and IDE are the major users
of the IOMMU. And yes, they're 32bit only even in the "highend"
Intel server chipsets. And they already have a mostly working IOMMU in the 
chipset for the GART, they just refuse to use it for PCI too.

> Now, I'm not above complaining about Intel (in fact, the Intel people seem
> to often think I hate them because I'm apparently the only person who gets
> quoted who complains about bad decisions publicly), but at least I try to
> avoid complaining before-the-fact ;)

Can you please complain a bit more about the chipset people and
get quoted so that Intel management hears you ? ;-)

-Andi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IOMMUs was Re: Intel vs AMD x86-64
  2004-02-24 14:06       ` IOMMUs was Re: Intel vs AMD x86-64 Andi Kleen
@ 2004-02-24 18:13         ` David S. Miller
  2004-02-27  1:28           ` Andi Kleen
  0 siblings, 1 reply; 7+ messages in thread
From: David S. Miller @ 2004-02-24 18:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: torvalds, linux-kernel

On 24 Feb 2004 15:06:47 +0100
Andi Kleen <ak@suse.de> wrote:

> One side effect of this is that the IOMMU TLB flush strategy is a bit
> dumb, because it has to do config space accesses for it.

This can be costly, but if you flush the IOMMU like sparc64 does (basically
it's similar to how KMAPs are flushed on x86), the cost gets real low because
then you only flush the whole iommu once every time you walk the whole mapping
table of the iommu.

I'm sure you've probably thought of this already, just mentioning it in case
you haven't.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IOMMUs was Re: Intel vs AMD x86-64
  2004-02-27  1:28           ` Andi Kleen
@ 2004-02-24 18:41             ` David S. Miller
  2004-02-25  0:36             ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 7+ messages in thread
From: David S. Miller @ 2004-02-24 18:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: torvalds, linux-kernel, richard.brunner

On Fri, 27 Feb 2004 02:28:49 +0100
Andi Kleen <ak@suse.de> wrote:

> Also the other part of the dumbness is that the flush is global, not per mapping. I guess
> you don't have that problem on Sparc64.

Yes, we can per-page flush, but I don't use that feature at all since I do the "flush all when
wrap around IOMMU pte table" thing we're discussing.  In fact there is no "global flush" so
what I have to do is use diagnostic accesses to the IOMMU TLB to kick out the entries one
by one.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IOMMUs was Re: Intel vs AMD x86-64
  2004-02-27  1:28           ` Andi Kleen
  2004-02-24 18:41             ` David S. Miller
@ 2004-02-25  0:36             ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 7+ messages in thread
From: Benjamin Herrenschmidt @ 2004-02-25  0:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David S. Miller, Linus Torvalds, Linux Kernel list, richard.brunner


> Arjan suggested it some time ago already. In fact I implemented it, it's in the current code.
> But it caused data corruption with a few devices, in particular 3ware, so I had 
> to disable it again. I didn't find a bug in the code. It worked fine with others. My theory 
> was that it triggered some hardware bug that was normally masked by the frequent flushes, but 
> I wasn't able to track it down without heavy equipment.

Interesting. I'm having a data corruption issue with the G5 iommu that
I can fix by always mapping everything. That is non-mapped virtual
IO pages are actually mapped to a dummy RAM page. It seems there is a
problem with the PCI<->HT bridge doing prefetches beyond iommu mapped
pages, thus triggering an iommu error, which in turns probably triggers
some other chipset bug ending up in data corruption. Having everything
mapped (allowing prefetch to complete even while prefetched data is
actually useless) fixes the problem and we don't see any corruption.

Of course, that means we can not longer use the mecanism we first
implemented where we would only flush the iommu TLB once after runnning
out of virtual pages to allocate. We have to flush on every insertion
and removal now :( On the other hand, we can probably do per-tag TLB
flushes instead of flushing the whole TLB once we properly figure out
how to access the tag registers on the chipset and their format (the
darwin source code seem to imply that is doable, but doesn't actually
use that, but in this regard, apple's implementation is impressively
sub-optimal).

Ben.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IOMMUs was Re: Intel vs AMD x86-64
  2004-02-24 18:13         ` David S. Miller
@ 2004-02-27  1:28           ` Andi Kleen
  2004-02-24 18:41             ` David S. Miller
  2004-02-25  0:36             ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 7+ messages in thread
From: Andi Kleen @ 2004-02-27  1:28 UTC (permalink / raw)
  To: David S. Miller; +Cc: torvalds, linux-kernel, richard.brunner

On Tue, 24 Feb 2004 10:13:40 -0800
"David S. Miller" <davem@redhat.com> wrote:

> On 24 Feb 2004 15:06:47 +0100
> Andi Kleen <ak@suse.de> wrote:
> 
> > One side effect of this is that the IOMMU TLB flush strategy is a bit
> > dumb, because it has to do config space accesses for it.
> 
> This can be costly, but if you flush the IOMMU like sparc64 does (basically
> it's similar to how KMAPs are flushed on x86), the cost gets real low because
> then you only flush the whole iommu once every time you walk the whole mapping
> table of the iommu.
> 
> I'm sure you've probably thought of this already, just mentioning it in case
> you haven't.

Arjan suggested it some time ago already. In fact I implemented it, it's in the current code.
But it caused data corruption with a few devices, in particular 3ware, so I had 
to disable it again. I didn't find a bug in the code. It worked fine with others. My theory 
was that it triggered some hardware bug that was normally masked by the frequent flushes, but 
I wasn't able to track it down without heavy equipment.

Currently it is in there, but disabled by default. Can be enabled with iommu=nofullflush.

Also the other part of the dumbness is that the flush is global, not per mapping. I guess
you don't have that problem on Sparc64.

Anyways, even with these restrictions having the GART as IOMMU is much better than
doing software bouncing.

-Andi

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: IOMMUs was Re: Intel vs AMD x86-64
  2004-02-24 15:50 richard.brunner
@ 2004-02-24 16:27 ` Mike Fedyk
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Fedyk @ 2004-02-24 16:27 UTC (permalink / raw)
  To: richard.brunner; +Cc: linux-kernel

On Tue, Feb 24, 2004 at 09:50:02AM -0600, richard.brunner@amd.com wrote:
> 
> > -----Original Message-----
> > From: Andi Kleen [mailto:ak@suse.de] 
> 
>  
> > On Opteron the IOMMU code (ab)uses the built in AGPv3 GART in 
> > the CPU, which 
> > was originally intended for AGP. AMD converted it to be able 
> > to remap PCI especially for Linux, which I think deserves applause.
> > 
> > It works surprisingly well even though it was not designed as 
> > a real IOMMU. Of course one of the main advantages of a real 
> > IOMMU - preventing arbitary memory corruption from broken 
> > devices - is lost because the remapping table is just a hole 
> > in the memory. I'm 
> > secretly hoping that when there is more support for Linux at 
> > chipset vendors they will someday add a bit to isolate all 
> > traffic that doesn't go through the GART from the main 
> > memory. This way you could get a much more reliable system 
> > that can tolerate broken PCI devices at a moderate 
> > performance penalty.
> 
> Andi is being modest. It was he and Andrea Arcangeli who convinced 
> me we had a problem. We found a way to trick the AGP
> GART hardware into helping, and then they turned it into a 
> "real" solution and helped us work the warts out of the BIOS 
> to enable it.

Yowza!  Open source helping to make better processors.  :-D

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: IOMMUs was Re: Intel vs AMD x86-64
@ 2004-02-24 15:50 richard.brunner
  2004-02-24 16:27 ` Mike Fedyk
  0 siblings, 1 reply; 7+ messages in thread
From: richard.brunner @ 2004-02-24 15:50 UTC (permalink / raw)
  To: linux-kernel


> -----Original Message-----
> From: Andi Kleen [mailto:ak@suse.de] 

 
> On Opteron the IOMMU code (ab)uses the built in AGPv3 GART in 
> the CPU, which 
> was originally intended for AGP. AMD converted it to be able 
> to remap PCI especially for Linux, which I think deserves applause.
> 
> It works surprisingly well even though it was not designed as 
> a real IOMMU. Of course one of the main advantages of a real 
> IOMMU - preventing arbitary memory corruption from broken 
> devices - is lost because the remapping table is just a hole 
> in the memory. I'm 
> secretly hoping that when there is more support for Linux at 
> chipset vendors they will someday add a bit to isolate all 
> traffic that doesn't go through the GART from the main 
> memory. This way you could get a much more reliable system 
> that can tolerate broken PCI devices at a moderate 
> performance penalty.

Andi is being modest. It was he and Andrea Arcangeli who convinced 
me we had a problem. We found a way to trick the AGP
GART hardware into helping, and then they turned it into a 
"real" solution and helped us work the warts out of the BIOS 
to enable it.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-02-25  0:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.44.0402231625220.9708-100000@chimarrao.boston.redhat.com.suse.lists.linux.kernel>
     [not found] ` <Pine.LNX.4.58.0402231335430.3005@ppc970.osdl.org.suse.lists.linux.kernel>
     [not found]   ` <20040223134853.5947a414.davem@redhat.com.suse.lists.linux.kernel>
     [not found]     ` <Pine.LNX.4.58.0402231359280.3005@ppc970.osdl.org.suse.lists.linux.kernel>
2004-02-24 14:06       ` IOMMUs was Re: Intel vs AMD x86-64 Andi Kleen
2004-02-24 18:13         ` David S. Miller
2004-02-27  1:28           ` Andi Kleen
2004-02-24 18:41             ` David S. Miller
2004-02-25  0:36             ` Benjamin Herrenschmidt
2004-02-24 15:50 richard.brunner
2004-02-24 16:27 ` Mike Fedyk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).