linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* problem with blk_queue_bounce_limit()
@ 2003-06-05  6:42 David Mosberger
  2003-06-05  7:20 ` David S. Miller
  0 siblings, 1 reply; 29+ messages in thread
From: David Mosberger @ 2003-06-05  6:42 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel

On platforms with I/O MMU hardware, memory above 4GB, and IDE hard disks,
this check:

		BUG_ON(dma_addr < BLK_BOUNCE_ISA);

causes an instant panic.  The reason is quite obvious: since there is
an I/O MMU, BLK_BOUNCE_ISA is effectively unlimited, and most IDE
controllers can of course DMA only to <4GB.

So, the check is wrong.  I think the proper way to fix this is to pass
a "struct dev" into the routine and then to use dma_supported() to
check whether bounce buffers will be needed.  Do you agree?  If so,
can you fix it?

Thanks,

	--david

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-05  6:42 problem with blk_queue_bounce_limit() David Mosberger
@ 2003-06-05  7:20 ` David S. Miller
  2003-06-06  6:42   ` David Mosberger
  0 siblings, 1 reply; 29+ messages in thread
From: David S. Miller @ 2003-06-05  7:20 UTC (permalink / raw)
  To: davidm; +Cc: axboe, linux-kernel

On Wed, 2003-06-04 at 23:42, David Mosberger wrote:
> On platforms with I/O MMU hardware, memory above 4GB, and IDE hard disks,
> this check:
> 
> 		BUG_ON(dma_addr < BLK_BOUNCE_ISA);

Doesn't panic on sparc64, let this be your guiding light :-)

-- 
David S. Miller <davem@redhat.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-05  7:20 ` David S. Miller
@ 2003-06-06  6:42   ` David Mosberger
  2003-06-06  6:45     ` David S. Miller
  2003-06-06  6:52     ` David S. Miller
  0 siblings, 2 replies; 29+ messages in thread
From: David Mosberger @ 2003-06-06  6:42 UTC (permalink / raw)
  To: manfred; +Cc: axboe, davidm, David S. Miller, linux-kernel

>>>>> On 05 Jun 2003 00:20:53 -0700, "David S. Miller" <davem@redhat.com> said:

  DaveM> On Wed, 2003-06-04 at 23:42, David Mosberger wrote:
  >> On platforms with I/O MMU hardware, memory above 4GB, and IDE hard disks,
  >> this check:

  >> BUG_ON(dma_addr < BLK_BOUNCE_ISA);

  DaveM> Doesn't panic on sparc64, let this be your guiding light :-)

I checked with Dave in private email and, like I suspected, this is a
(potential) problem on sparc64 as well.  Let me be try to be even more
clear: on many 64 bit platforms, BLK_BOUNCE_ISA will be bigger than
4GB, so the BUG_ON() will trigger for IDE controllers that can DMA
only to, say, 4GB (and it's probably not even an IDE-only problem, I
think we don't see it on SCSI because our machines have SCSI
controllers that can DMA >4GB).

Manfred, I'm readdressed this mail to you because according to google,
you're the original author of the patch
(http://www.cs.helsinki.fi/linux/linux-kernel/2002-02/0032.html).

Like I stated earlier, I think this code simply makes no sense on a
platform with I/O MMU.  Hence my suggestion to deal with this via
dma_supported().

	--david

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-06  6:42   ` David Mosberger
@ 2003-06-06  6:45     ` David S. Miller
  2003-06-06  6:54       ` David Mosberger
  2003-06-06  6:52     ` David S. Miller
  1 sibling, 1 reply; 29+ messages in thread
From: David S. Miller @ 2003-06-06  6:45 UTC (permalink / raw)
  To: davidm, davidm; +Cc: manfred, axboe, linux-kernel

   From: David Mosberger <davidm@napali.hpl.hp.com>
   Date: Thu, 5 Jun 2003 23:42:17 -0700
   
   Manfred, I'm readdressed this mail to you because according to google,
   you're the original author of the patch
   (http://www.cs.helsinki.fi/linux/linux-kernel/2002-02/0032.html).
   
   Like I stated earlier, I think this code simply makes no sense on a
   platform with I/O MMU.  Hence my suggestion to deal with this via
   dma_supported().
   
David, what is blk_max_low_pfn set to on your ia64 systems?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-06  6:42   ` David Mosberger
  2003-06-06  6:45     ` David S. Miller
@ 2003-06-06  6:52     ` David S. Miller
  2003-06-06  7:19       ` David Mosberger
  1 sibling, 1 reply; 29+ messages in thread
From: David S. Miller @ 2003-06-06  6:52 UTC (permalink / raw)
  To: davidm, davidm; +Cc: manfred, axboe, linux-kernel


Ok, ide-lib.c can call blk_queue_bounce_limit with only
3 different args:

1) BLK_BOUNCE_HIGH (--> blk_max_pfn)
2) BLK_BOUNCE_ANY (--> blk_max_lo_pfn)
3) pci_dev->dma_mask

David, you're not setting PCI_DMA_BUS_IS_PHYS properly.
For IOMMU it should be set to zero.  This tells the block
layer if you're IOMMU or not.

I knew it would be a bug like this.

Anyways, the problem case is #3 above and that will never
happen if you start setting PCI_DMA_BUS_IS_PHYS to zero
as appropriate.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-06  6:45     ` David S. Miller
@ 2003-06-06  6:54       ` David Mosberger
  2003-06-06  7:08         ` David S. Miller
  0 siblings, 1 reply; 29+ messages in thread
From: David Mosberger @ 2003-06-06  6:54 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, manfred, axboe, linux-kernel

>>>>> On Thu, 05 Jun 2003 23:45:26 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:

  >> From: David Mosberger <davidm@napali.hpl.hp.com>
  >> Date: Thu, 5 Jun 2003 23:42:17 -0700

  >> Manfred, I'm readdressed this mail to you because according to google,
  >> you're the original author of the patch
  >> (http://www.cs.helsinki.fi/linux/linux-kernel/2002-02/0032.html).

  >> Like I stated earlier, I think this code simply makes no sense on a
  >> platform with I/O MMU.  Hence my suggestion to deal with this via
  >> dma_supported().

  > David, what is blk_max_low_pfn set to on your ia64 systems?

max_low_pfn, which is going to be the pfn of some page > 4GB (for
machines with a sufficient amount of memory).

	--david

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-06  6:54       ` David Mosberger
@ 2003-06-06  7:08         ` David S. Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David S. Miller @ 2003-06-06  7:08 UTC (permalink / raw)
  To: davidm, davidm; +Cc: manfred, axboe, linux-kernel

   From: David Mosberger <davidm@napali.hpl.hp.com>
   Date: Thu, 5 Jun 2003 23:54:50 -0700

   >>>>> On Thu, 05 Jun 2003 23:45:26 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:

     > David, what is blk_max_low_pfn set to on your ia64 systems?
   
   max_low_pfn, which is going to be the pfn of some page > 4GB (for
   machines with a sufficient amount of memory).

Right, but see my other email, you need to see PCI_DMA_BUS_IS_PHYS
properly.  This tells the block layer if you're IOMMU or not.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-06  6:52     ` David S. Miller
@ 2003-06-06  7:19       ` David Mosberger
  2003-06-06  7:32         ` David S. Miller
  0 siblings, 1 reply; 29+ messages in thread
From: David Mosberger @ 2003-06-06  7:19 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, manfred, axboe, linux-kernel

>>>>> On Thu, 05 Jun 2003 23:52:49 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:

  David> you're not setting PCI_DMA_BUS_IS_PHYS properly.  For IOMMU
  David> it should be set to zero.  This tells the block layer if
  David> you're IOMMU or not.

Ah, yes, thanks for pointing this out.

PCI_DMA_BUS_IS_PHYS (and it's description) is quite misleading: it
claims that it has something to do with there being an equivalence
between PCI bus and physical addresses.  That's actually the case for
(small) ia64 platforms so that's why we ended up setting it to 1.

	--david

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-06  7:19       ` David Mosberger
@ 2003-06-06  7:32         ` David S. Miller
  2003-06-06  7:44           ` Christoph Hellwig
  2003-06-06 20:13           ` David Mosberger
  0 siblings, 2 replies; 29+ messages in thread
From: David S. Miller @ 2003-06-06  7:32 UTC (permalink / raw)
  To: davidm, davidm; +Cc: manfred, axboe, linux-kernel

   From: David Mosberger <davidm@napali.hpl.hp.com>
   Date: Fri, 6 Jun 2003 00:19:08 -0700

   PCI_DMA_BUS_IS_PHYS (and it's description) is quite misleading: it
   claims that it has something to do with there being an equivalence
   between PCI bus and physical addresses.  That's actually the case for
   (small) ia64 platforms so that's why we ended up setting it to 1.
   
It does have to do with such an equivalence.  If your port couldn't
work if drivers use the deprecated virt_to_bus/bus_to_virt, you must
set PCI_DMA_BUS_IS_PHYS to zero.

The whole block layer makes all kinds of assumptions about what
physically contiguous addresses mean about how they'll be contiguous
in the bus addresses the device will actually use to perform the
DMA transfer.

Likewise, when PCI_DMA_BUS_IS_PHYS is zero, it knows that many
IOMMU'ish things can occur such as taking non-physically-contiguous
pages and mapping them using the IOMMU to create bus-contiguous
mappings.

We could convert the few compile time checks of PCI_DMA_BUS_IS_PHYS
so that you can set this based upon the configuration of the machine
if for some configurations it is true.  drivers/net/tg3.c is the
only offender, my bad :-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-06  7:44           ` Christoph Hellwig
@ 2003-06-06  7:43             ` David S. Miller
  2003-06-06  7:51               ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: David S. Miller @ 2003-06-06  7:43 UTC (permalink / raw)
  To: hch; +Cc: davidm, davidm, manfred, axboe, linux-kernel

   From: Christoph Hellwig <hch@infradead.org>
   Date: Fri, 6 Jun 2003 08:44:10 +0100

   PCI_DMA_BUS_IS_PHYS should be a propert of each struct device because
   a machine might have a iommu for one bus type but not another,

We know of no such systems.  Even in mixed-bus environments such as
sparc64 SBUS+PCI systems.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-06  7:32         ` David S. Miller
@ 2003-06-06  7:44           ` Christoph Hellwig
  2003-06-06  7:43             ` David S. Miller
  2003-06-06 20:13           ` David Mosberger
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2003-06-06  7:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, davidm, manfred, axboe, linux-kernel

On Fri, Jun 06, 2003 at 12:32:30AM -0700, David S. Miller wrote:
> We could convert the few compile time checks of PCI_DMA_BUS_IS_PHYS
> so that you can set this based upon the configuration of the machine
> if for some configurations it is true.  drivers/net/tg3.c is the
> only offender, my bad :-)

That reminds of of another thing that came up when looking over
the scsi bounce limit setting code.  All this bounce code ist based
purely on a scsi host and if existant it's struct device - imo
PCI_DMA_BUS_IS_PHYS should be a propert of each struct device because
a machine might have a iommu for one bus type but not another, e.g.

	dma_is_phys(dev);

For those arches that need it this could be expanded to per-bus
code, for all those where everything is either phys or not it
would be a simple define that ignores the argument and still produces
"perfect" code.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-06  7:43             ` David S. Miller
@ 2003-06-06  7:51               ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2003-06-06  7:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, davidm, manfred, axboe, linux-kernel

On Fri, Jun 06, 2003 at 12:43:05AM -0700, David S. Miller wrote:
>    From: Christoph Hellwig <hch@infradead.org>
>    Date: Fri, 6 Jun 2003 08:44:10 +0100
> 
>    PCI_DMA_BUS_IS_PHYS should be a propert of each struct device because
>    a machine might have a iommu for one bus type but not another,
> 
> We know of no such systems.  Even in mixed-bus environments such as
> sparc64 SBUS+PCI systems.

What about alpha systems with EISA and PCI slots?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-06  7:32         ` David S. Miller
  2003-06-06  7:44           ` Christoph Hellwig
@ 2003-06-06 20:13           ` David Mosberger
  2003-06-07  6:44             ` David S. Miller
  1 sibling, 1 reply; 29+ messages in thread
From: David Mosberger @ 2003-06-06 20:13 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, davidm, manfred, axboe, linux-kernel

>>>>> On Fri, 06 Jun 2003 00:32:30 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:

  David>    From: David Mosberger <davidm@napali.hpl.hp.com> Date:
  David> Fri, 6 Jun 2003 00:19:08 -0700

  David>    PCI_DMA_BUS_IS_PHYS (and it's description) is quite
  David> misleading: it claims that it has something to do with there
  David> being an equivalence between PCI bus and physical addresses.
  David> That's actually the case for (small) ia64 platforms so that's
  David> why we ended up setting it to 1.

  David> It does have to do with such an equivalence.  If your port
  David> couldn't work if drivers use the deprecated
  David> virt_to_bus/bus_to_virt, you must set PCI_DMA_BUS_IS_PHYS to
  David> zero.

Yes, but the comment certainly is confusing.  How about something like
this:

/*
 * PCI_DMA_BUS_IS_PHYS should be set to 1 if there is necessarily a
 * direct corespondence between device bus addresses and CPU physical
 * addresses.  Platforms with a hardware I/O MMU _must_ turn this off
 * to suppress the bounce buffer handling code in the block and
 * network device layers.  Platforms with separate bus address spaces
 * _must_ turn this off and provide a device DMA mapping
 * implementation that takes care of the necessary address
 * translation.
 */
#define PCI_DMA_BUS_IS_PHYS	pci_dma_bus_is_phys

  David> The whole block layer makes all kinds of assumptions about
  David> what physically contiguous addresses mean about how they'll
  David> be contiguous in the bus addresses the device will actually
  David> use to perform the DMA transfer.

This sounds all very dramatic, but try as I might, all I find is three
places where PCI_DMA_BUS_IS_PHYS is used:

	- ide-lib.c: used to disable bounce buffering
	- scsi_lib.c: used to disable bounce buffering
	- tg3.c: what the heck??

The tg3 code looks ugly in the extreme (sorry).  If I understand it
right, it's trying to work around a bug which shows up when a packet
covers a certain address?  With an I/O MMU, you then remap the
offending buffer again before freeing the old mapping which will
ensure a different address of the packet, whereas in the non-I/O MMU
case, you copy the entire socket buffer (since just remapping it won't
change the address and since there is no interface to copy just a
portion of a socket buffer) and then do the remapping.

Did I get this right (or at least close enough)?

It seems really bad to me to rely on implementation-specifics of the
DMA API.  I suspect the code would break on a platform which has
separate bus address spaces but no (hardware) I/O TLB?  (Yeah,
probably not a supported scenarious, but with a proper fix, this
wouldn't be a problem.)

Does this bug happen often enough that it's performance critical?
Otherwise, you could just always use the copy-the-entire-buffer
workaround.  If its performance-critical, would it make sense
to extend the socket buffer API to allow copying a portion of
a buffer?

I really dislike PCI_DMA_BUS_IS_PHYS, because it introduces a
discontinuity.  I don't think it should be necessary.

If it wasn't for tg3.c, couldn't PCI_DMA_BUS_IS_PHYS be gotten rid of
much more cleanly with a dma_max_phys_addr(dev) function, which would
return the maximum physical address that device DEV can address
(either directly, or via an I/O TLB)?

  David> We could convert the few compile time checks of
  David> PCI_DMA_BUS_IS_PHYS so that you can set this based upon the
  David> configuration of the machine if for some configurations it is
  David> true.  drivers/net/tg3.c is the only offender, my bad :-)

Yes.  Would you mind fixing that?

	--david

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-06 20:13           ` David Mosberger
@ 2003-06-07  6:44             ` David S. Miller
  2003-06-07  7:05               ` David Mosberger
  2003-06-07  7:20               ` David Mosberger
  0 siblings, 2 replies; 29+ messages in thread
From: David S. Miller @ 2003-06-07  6:44 UTC (permalink / raw)
  To: davidm, davidm; +Cc: manfred, axboe, linux-kernel

   From: David Mosberger <davidm@napali.hpl.hp.com>
   Date: Fri, 6 Jun 2003 13:13:38 -0700

   Yes, but the comment certainly is confusing.  How about something like
   this:
   
No arguments.

     David> The whole block layer makes all kinds of assumptions about
     David> what physically contiguous addresses mean about how they'll
     David> be contiguous in the bus addresses the device will actually
     David> use to perform the DMA transfer.
   
   This sounds all very dramatic, but try as I might, all I find is three
   places where PCI_DMA_BUS_IS_PHYS is used:
   
   	- ide-lib.c: used to disable bounce buffering
   	- scsi_lib.c: used to disable bounce buffering

Fix your grep, 

   	- tg3.c: what the heck??
   
In order to workaround a "just below 4GB dma address" bug in the
chip, we have to have a reliable way to "remap" the given networking
buffer to some other DMA address that does not meet the hw bug case.

If we don't have an IOMMU, we have to change the buffer itself and
we accomplish this with SKB copy.

But on an IOMMU system, we could end up mapping to the same bogus
DMA address.  So we have to solve this problem by keeping the
existng bad mapping, doing a new DMA mapping, then trowing away
the old one.
   
   Did I get this right (or at least close enough)?
   
Precisely.

   Otherwise, you could just always use the copy-the-entire-buffer
   workaround.

The new PCI dma mapping I make could map to the SAME bad DMA
address, that's the problem.  I could loop forever making new
DMA mappings on an IOMMU system, each and every one falls into
the hw bug case.

   I really dislike PCI_DMA_BUS_IS_PHYS, because it introduces a
   discontinuity.  I don't think it should be necessary.
   
I totally disagree.

     David> We could convert the few compile time checks of
     David> PCI_DMA_BUS_IS_PHYS so that you can set this based upon the
     David> configuration of the machine if for some configurations it is
     David> true.  drivers/net/tg3.c is the only offender, my bad :-)
   
   Yes.  Would you mind fixing that?
   
Sure, no problem.

   

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-07  6:44             ` David S. Miller
@ 2003-06-07  7:05               ` David Mosberger
  2003-06-07  7:11                 ` David S. Miller
  2003-06-07  7:20               ` David Mosberger
  1 sibling, 1 reply; 29+ messages in thread
From: David Mosberger @ 2003-06-07  7:05 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, manfred, axboe, linux-kernel

>>>>> On Fri, 06 Jun 2003 23:44:01 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:

  DaveM> But on an IOMMU system, we could end up mapping to the same
  DaveM> bogus DMA address.

Ah, yes, just changing the buffer address doesn't guarantee a
different bus address.  I missed that.

  DaveM> So we have to solve this problem by keeping the existng bad
  DaveM> mapping, doing a new DMA mapping, then trowing away the old
  DaveM> one.

But you're creating a new mapping for the old buffer.  What if you had
a DMA API implementation which consolidates multiple mapping attempts
of the same buffer into a single mapping entry (along with a reference
count)?  That would break the workaround.

Isn't the proper fix to (a) get a new buffer, (b) create a mapping for
the new buffer, (c) destroy the mapping for the old buffer.  That
should guarantee a different bus address, no matter what the
DMA-mapping implementation.

Plus then you don't have to rely on PCI_DMA_BUS_IS_PHYS.

	--david

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-07  7:05               ` David Mosberger
@ 2003-06-07  7:11                 ` David S. Miller
  2003-06-10 20:01                   ` David Mosberger
  2003-06-15  7:06                   ` Anton Blanchard
  0 siblings, 2 replies; 29+ messages in thread
From: David S. Miller @ 2003-06-07  7:11 UTC (permalink / raw)
  To: davidm, davidm; +Cc: manfred, axboe, linux-kernel

   From: David Mosberger <davidm@napali.hpl.hp.com>
   Date: Sat, 7 Jun 2003 00:05:06 -0700

   But you're creating a new mapping for the old buffer.  What if you had
   a DMA API implementation which consolidates multiple mapping attempts
   of the same buffer into a single mapping entry (along with a reference
   count)?  That would break the workaround.
   
I hope nobody is doing this, it would probably break other things
we haven't considered yet.

You can't support all the BIO_MERGE_BOUNDARY stuff properly in
such a scheme. And you _WANT_ to support that when you have an
IOMMU, it shrinks the DMA descriptor addr/len entries a chip
has to DMA for each block I/O considerably.

   Isn't the proper fix to (a) get a new buffer, (b) create a mapping for
   the new buffer, (c) destroy the mapping for the old buffer.  That
   should guarantee a different bus address, no matter what the
   DMA-mapping implementation.
   
I suppose this would work, fell free to code this up for the
tg3 driver for me because I certainly lack the time to do this.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-07  7:20               ` David Mosberger
@ 2003-06-07  7:19                 ` David S. Miller
  2003-06-07  9:44                 ` Russell King
  2003-06-10 20:24                 ` David Mosberger
  2 siblings, 0 replies; 29+ messages in thread
From: David S. Miller @ 2003-06-07  7:19 UTC (permalink / raw)
  To: davidm, davidm; +Cc: manfred, axboe, linux-kernel

   From: David Mosberger <davidm@napali.hpl.hp.com>
   Date: Sat, 7 Jun 2003 00:20:46 -0700

   >>>>> On Fri, 06 Jun 2003 23:44:01 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:
   
     DaveM>    This sounds all very dramatic, but try as I might, all I
     DaveM> find is three places where PCI_DMA_BUS_IS_PHYS is used:
   
     DaveM> Fix your grep,
   
   What am I missing?
   
Sorry, I was meaning to mention BIO_VMERGE_BOUNDARY which works in
concert with PCI_DMA_BUS_IS_PHYS.  You really have to set it
accurately to get the most out of the block layer's ability to
take advantage of IOMMUs.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-07  6:44             ` David S. Miller
  2003-06-07  7:05               ` David Mosberger
@ 2003-06-07  7:20               ` David Mosberger
  2003-06-07  7:19                 ` David S. Miller
                                   ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: David Mosberger @ 2003-06-07  7:20 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, manfred, axboe, linux-kernel

>>>>> On Fri, 06 Jun 2003 23:44:01 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:

  DaveM>    This sounds all very dramatic, but try as I might, all I
  DaveM> find is three places where PCI_DMA_BUS_IS_PHYS is used:

  DaveM> Fix your grep,

What am I missing?

	--david

$ find . -type f | xargs fgrep PCI_DMA_BUS_IS_PHYS | fgrep -v /SCCS/
./drivers/ide/ide-lib.c:                if (!PCI_DMA_BUS_IS_PHYS)
./drivers/net/tg3.c:#ifndef PCI_DMA_BUS_IS_PHYS
./drivers/net/tg3.c:#define PCI_DMA_BUS_IS_PHYS 1
./drivers/net/tg3.c:#if !PCI_DMA_BUS_IS_PHYS
./drivers/net/tg3.c:#if !PCI_DMA_BUS_IS_PHYS
./drivers/scsi/scsi_lib.c:              if (PCI_DMA_BUS_IS_PHYS && host_dev && host_dev->dma_mask)
./include/asm-alpha/pci.h:#define PCI_DMA_BUS_IS_PHYS  0
./include/asm-arm/pci.h:#define PCI_DMA_BUS_IS_PHYS     (0)
./include/asm-i386/pci.h:#define PCI_DMA_BUS_IS_PHYS    (1)
./include/asm-ia64/pci.h: * PCI_DMA_BUS_IS_PHYS should be set to 1 if there is necessarily a direct corespondence
./include/asm-ia64/pci.h:#define PCI_DMA_BUS_IS_PHYS    pci_dma_bus_is_phys
./include/asm-m68k/pci.h:#define PCI_DMA_BUS_IS_PHYS    (1)
./include/asm-parisc/pci.h:#define PCI_DMA_BUS_IS_PHYS     (1)
./include/asm-ppc/pci.h:#define PCI_DMA_BUS_IS_PHYS     (1)
./include/asm-ppc64/pci.h:#define PCI_DMA_BUS_IS_PHYS   (0)
./include/asm-s390/pci.h:#define PCI_DMA_BUS_IS_PHYS (1)
./include/asm-sparc/pci.h:#define PCI_DMA_BUS_IS_PHYS   (0)
./include/asm-sparc64/pci.h:#define PCI_DMA_BUS_IS_PHYS (0)
./include/asm-um/pci.h:#define PCI_DMA_BUS_IS_PHYS     (1)
./include/asm-x86_64/pci.h:#define PCI_DMA_BUS_IS_PHYS  (0)
./include/asm-x86_64/pci.h:#define PCI_DMA_BUS_IS_PHYS  1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-07  7:20               ` David Mosberger
  2003-06-07  7:19                 ` David S. Miller
@ 2003-06-07  9:44                 ` Russell King
  2003-06-07  9:47                   ` David S. Miller
  2003-06-10 20:24                 ` David Mosberger
  2 siblings, 1 reply; 29+ messages in thread
From: Russell King @ 2003-06-07  9:44 UTC (permalink / raw)
  To: davidm; +Cc: David S. Miller, manfred, axboe, linux-kernel

On Sat, Jun 07, 2003 at 12:20:46AM -0700, David Mosberger wrote:
> ./include/asm-arm/pci.h:#define PCI_DMA_BUS_IS_PHYS     (0)

I suspect we probably set this incorrectly; we have some platforms where
there is merely an offset between the phys address and the bus address.
For these, I think we want to set this to 1.

Other platforms require the dma functions to allocate a new buffer
and copy the data to work around buggy "wont fix" errata (eg, new buffer
below 1MB) and for these I think we want to leave this at 0.

It is rather unfortunate that this got called "PCI_xxx" since it has
been used in a non pci-bus manner in (eg) the scsi layer.

Also note that I have platforms where the dma_mask is a real mask not
"a set of zeros followed by a set of ones from MSB to LSB."  I can
see this breaking the block layer if PCI_DMA_BUS_IS_PHYS is defined
to one. 8/

-- 
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-07  9:44                 ` Russell King
@ 2003-06-07  9:47                   ` David S. Miller
  2003-06-07 13:23                     ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: David S. Miller @ 2003-06-07  9:47 UTC (permalink / raw)
  To: rmk; +Cc: davidm, manfred, axboe, linux-kernel

   From: Russell King <rmk@arm.linux.org.uk>
   Date: Sat, 7 Jun 2003 10:44:34 +0100
   
   It is rather unfortunate that this got called "PCI_xxx" since it has
   been used in a non pci-bus manner in (eg) the scsi layer.

I agree, the idea at the time that Jens and myself did this
work was that the generic device layer would provide interfaces
by which we could test this given a struct device.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-07  9:47                   ` David S. Miller
@ 2003-06-07 13:23                     ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2003-06-07 13:23 UTC (permalink / raw)
  To: David S. Miller; +Cc: rmk, davidm, manfred, axboe, linux-kernel

On Sat, Jun 07, 2003 at 02:47:04AM -0700, David S. Miller wrote:
> I agree, the idea at the time that Jens and myself did this
> work was that the generic device layer would provide interfaces
> by which we could test this given a struct device.

Wouldn't that be the dma_is_phys(dev) call I mentioned earlier in
this thread that you didn't like? :)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-07  7:11                 ` David S. Miller
@ 2003-06-10 20:01                   ` David Mosberger
  2003-06-12  6:47                     ` David S. Miller
  2003-06-15  7:06                   ` Anton Blanchard
  1 sibling, 1 reply; 29+ messages in thread
From: David Mosberger @ 2003-06-10 20:01 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, davidm

>>>>> On Sat, 07 Jun 2003 00:11:40 -0700 (PDT), "David S. Miller" <davem@redhat.com> said:

  >>    From: David Mosberger <davidm@napali.hpl.hp.com> Date: Sat, 7
  >> Jun 2003 00:05:06 -0700

  >>    Isn't the proper fix to (a) get a new buffer, (b) create a
  >> mapping for the new buffer, (c) destroy the mapping for the old
  >> buffer.  That should guarantee a different bus address, no matter
  >> what the DMA-mapping implementation.

  > I suppose this would work, fell free to code this up for the tg3
  > driver for me because I certainly lack the time to do this.

How about the attached patch?

The non-PCI_DMA_BUS_IS_PHYS code should work already, because it
creates the new mapping before destroying the old one(s).

The patch has been compiled-tested but not runtime-tested: I can't
trigger the bug on ia64, because there, the end of the 4GB space is
reserved for firmware purposes, so we'll never have available memory
near address 0xffffdcc0.

The performance impact should be negligible, as the probability of
hitting the bug case should be vanishingly small.

	--david

===== drivers/net/tg3.c 1.68 vs edited =====
--- 1.68/drivers/net/tg3.c	Wed Apr 23 20:02:11 2003
+++ edited/drivers/net/tg3.c	Tue Jun 10 12:53:15 2003
@@ -2234,73 +2234,17 @@
 	schedule_work(&tp->reset_task);
 }
 
-#if !PCI_DMA_BUS_IS_PHYS
-static void tg3_set_txd_addr(struct tg3 *tp, int entry, dma_addr_t mapping)
-{
-	if (tp->tg3_flags & TG3_FLAG_HOST_TXDS) {
-		struct tg3_tx_buffer_desc *txd = &tp->tx_ring[entry];
-
-		txd->addr_hi = ((u64) mapping >> 32);
-		txd->addr_lo = ((u64) mapping & 0xffffffff);
-	} else {
-		unsigned long txd;
-
-		txd = (tp->regs +
-		       NIC_SRAM_WIN_BASE +
-		       NIC_SRAM_TX_BUFFER_DESC);
-		txd += (entry * TXD_SIZE);
-
-		if (sizeof(dma_addr_t) != sizeof(u32))
-			writel(((u64) mapping >> 32),
-			       txd + TXD_ADDR + TG3_64BIT_REG_HIGH);
-
-		writel(((u64) mapping & 0xffffffff),
-		       txd + TXD_ADDR + TG3_64BIT_REG_LOW);
-	}
-}
-#endif
-
 static void tg3_set_txd(struct tg3 *, int, dma_addr_t, int, u32, u32);
 
 static int tigon3_4gb_hwbug_workaround(struct tg3 *tp, struct sk_buff *skb,
 				       u32 guilty_entry, int guilty_len,
 				       u32 last_plus_one, u32 *start, u32 mss)
 {
+	struct sk_buff *new_skb = skb_copy(skb, GFP_ATOMIC);
 	dma_addr_t new_addr;
 	u32 entry = *start;
 	int i;
 
-#if !PCI_DMA_BUS_IS_PHYS
-	/* IOMMU, just map the guilty area again which is guaranteed to
-	 * use different addresses.
-	 */
-
-	i = 0;
-	while (entry != guilty_entry) {
-		entry = NEXT_TX(entry);
-		i++;
-	}
-	if (i == 0) {
-		new_addr = pci_map_single(tp->pdev, skb->data, guilty_len,
-					  PCI_DMA_TODEVICE);
-	} else {
-		skb_frag_t *frag = &skb_shinfo(skb)->frags[i - 1];
-
-		new_addr = pci_map_page(tp->pdev,
-					frag->page, frag->page_offset,
-					guilty_len, PCI_DMA_TODEVICE);
-	}
-	pci_unmap_single(tp->pdev, pci_unmap_addr(&tp->tx_buffers[guilty_entry],
-						  mapping),
-			 guilty_len, PCI_DMA_TODEVICE);
-	tg3_set_txd_addr(tp, guilty_entry, new_addr);
-	pci_unmap_addr_set(&tp->tx_buffers[guilty_entry], mapping,
-			   new_addr);
-	*start = last_plus_one;
-#else
-	/* Oh well, no IOMMU, have to allocate a whole new SKB. */
-	struct sk_buff *new_skb = skb_copy(skb, GFP_ATOMIC);
-
 	if (!new_skb) {
 		dev_kfree_skb(skb);
 		return -1;
@@ -2337,7 +2281,6 @@
 	}
 
 	dev_kfree_skb(skb);
-#endif
 
 	return 0;
 }

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-07  7:20               ` David Mosberger
  2003-06-07  7:19                 ` David S. Miller
  2003-06-07  9:44                 ` Russell King
@ 2003-06-10 20:24                 ` David Mosberger
  2 siblings, 0 replies; 29+ messages in thread
From: David Mosberger @ 2003-06-10 20:24 UTC (permalink / raw)
  To: hch, James.Bottomley; +Cc: davem, axboe, linux-kernel

Christoph,

You wrote:

 > imo PCI_DMA_BUS_IS_PHYS should be a propert of each struct device
 > because a machine might have a iommu for one bus type but not
 > another, e.g.

 >	dma_is_phys(dev);

As pointed out by DaveM, this isn't sufficient for the block layer,
which needs to know the page size of the I/O MMU so it can make
merging decisions about physically discontiguous buffers.

I think we also need:

	/*
	 * Returns a mask of bits which need to be 0 in order for
	 * the DMA-mapping interface to be able to remap a buffer.
	 * DMA-mapping implementations for real (hardware) I/O MMUs
	 * will want to return (iommu_page_size - 1) here, if they
	 * support such remapping.  DMA-mapping implementations which
	 * do not support remapping must return a mask of all 1s.
	 */
	unsigned long dma_merge_mask(dev)

Then you can replace:

	BIO_VMERGE_BOUNDARY => (dma_merge_mask(dev) + 1)

Of course, this doesn't work literally, because we don't have a device
pointer handy in the bio code.  Instead, it would probably make the
most sense to add a "iommu_merge_mask" member to "struct
request_queue" and then do something along the lines of:

 #define BIO_VMERGE_BOUNDARY(q)	((q)->iommu_merge_mask + 1)

Note 1: the "+ 1" will get optimized away because the only way
BIO_VMERGE_BOUNDARY() is used is in BIOVEC_VIRT_MERGEABLE, which
really needs a mask anyhow; this could be cleaned up of course, but
that's a separate issue.

Note 2: dma_merge_mask() cannot be used to replace dma_is_phys() (as
much as I'd like that), because they issue of (virtual) remapping is
really quite distinct from whether a (hardware) I/O MMU is present
(not to mention the _other_ reason that a bus may not be "physical").

Note 3: I'm not comfortable hacking the bio code, so if someone would
like to prototype this, by all means go ahead... ;-)

Thanks,

	--david

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-10 20:01                   ` David Mosberger
@ 2003-06-12  6:47                     ` David S. Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David S. Miller @ 2003-06-12  6:47 UTC (permalink / raw)
  To: davidm, davidm; +Cc: linux-kernel

   From: David Mosberger <davidm@napali.hpl.hp.com>
   Date: Tue, 10 Jun 2003 13:01:29 -0700

   How about the attached patch?
   
Looks "Obviously correct" ;-)  I'll apply this, thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-07  7:11                 ` David S. Miller
  2003-06-10 20:01                   ` David Mosberger
@ 2003-06-15  7:06                   ` Anton Blanchard
  2003-06-15  7:11                     ` David S. Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Anton Blanchard @ 2003-06-15  7:06 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, davidm, manfred, axboe, linux-kernel

 
>    But you're creating a new mapping for the old buffer.  What if you had
>    a DMA API implementation which consolidates multiple mapping attempts
>    of the same buffer into a single mapping entry (along with a reference
>    count)?  That would break the workaround.
>    
> I hope nobody is doing this, it would probably break other things
> we haven't considered yet.

Dave, we talked about this ages ago as a possible alternative to skb
recycling and persistent IOMMU mappings for those skbs.

Unfortunately you need a hash to map from all of memory to a pci bus
address for each host bridge (well IOMMU), and so far I cant think of
anything that wouldnt chew gobs of RAM.

Anton

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-15  7:06                   ` Anton Blanchard
@ 2003-06-15  7:11                     ` David S. Miller
  2003-06-15  8:04                       ` Anton Blanchard
  0 siblings, 1 reply; 29+ messages in thread
From: David S. Miller @ 2003-06-15  7:11 UTC (permalink / raw)
  To: anton; +Cc: davidm, davidm, manfred, axboe, linux-kernel

   From: Anton Blanchard <anton@samba.org>
   Date: Sun, 15 Jun 2003 17:06:11 +1000

   Dave, we talked about this ages ago as a possible alternative to skb
   recycling and persistent IOMMU mappings for those skbs.
   
   Unfortunately you need a hash to map from all of memory to a pci bus
   address for each host bridge (well IOMMU), and so far I cant think of
   anything that wouldnt chew gobs of RAM.
   
The hash table need not be sized wrt. ram, but rather to the expected
amount of "DMA in flight" you can expect for the system.

You have to make these tables per-IOMMU anyways, which breaks down the
problem much further.

ROFL, whose workstation name is krispykreme? :-)
I just noticed that.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-15  7:11                     ` David S. Miller
@ 2003-06-15  8:04                       ` Anton Blanchard
  2003-06-15  8:18                         ` David S. Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Anton Blanchard @ 2003-06-15  8:04 UTC (permalink / raw)
  To: David S. Miller; +Cc: davidm, davidm, manfred, axboe, linux-kernel


> The hash table need not be sized wrt. ram, but rather to the expected
> amount of "DMA in flight" you can expect for the system.
>
> You have to make these tables per-IOMMU anyways, which breaks down the
> problem much further.

Thats turns out to be a problem because we can have a bunch of IOMMUs
on ppc64 - we might have 4 slots per IOMMU. Also, on a partitioned machine,
each slot gets an exclusive range of bus addresses to provide isolation
between adapters.

At the extreme end we might have 512GB of RAM and 48 IOMMUs. A hash
table per IOMMU could end up using a lot of RAM.

To make the best use of the IOMMU we would want to have it filled with
addresses (to have the best chance that we will find an existing
mapping). And for zero copy stuff it could be all over RAM. 
In effect we are looking at a mapping from 512GB -> 3GB. (assuming the
top 1GB of PCI addresses are for PCI IO and memory)

We also need a quick way to find old entries in order to purge them.

> ROFL, whose workstation name is krispykreme? :-)
> I just noticed that.

Hey thats my laptop :) One is opening up in Penrith Australia next
week, I may never need to visit the US again :)

Anton

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
  2003-06-15  8:04                       ` Anton Blanchard
@ 2003-06-15  8:18                         ` David S. Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David S. Miller @ 2003-06-15  8:18 UTC (permalink / raw)
  To: anton; +Cc: davidm, davidm, manfred, axboe, linux-kernel

   From: Anton Blanchard <anton@samba.org>
   Date: Sun, 15 Jun 2003 18:04:42 +1000
   
   To make the best use of the IOMMU we would want to have it filled with
   addresses (to have the best chance that we will find an existing
   mapping). And for zero copy stuff it could be all over RAM. 
   In effect we are looking at a mapping from 512GB -> 3GB. (assuming the
   top 1GB of PCI addresses are for PCI IO and memory)

We have this same problem with huge files in the page cache.
Maybe you should use a trie.

Because that data structure will expand wrt. actual use.

I bet the management (especially the shared memory refcount bump)
will outweight the costs saved unless your hardware is really stupid.
But judging by the e1000/ppc64 issue, your hardware is stupid and thus
maybe it's worthwhile for you to continue to pursue this idea :)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: problem with blk_queue_bounce_limit()
@ 2003-06-06 18:06 James Bottomley
  0 siblings, 0 replies; 29+ messages in thread
From: James Bottomley @ 2003-06-06 18:06 UTC (permalink / raw)
  To: David S. Miller; +Cc: Christoph Hellwig, Linux Kernel



       From: Christoph Hellwig <hch@infradead.org>
       Date: Fri, 6 Jun 2003 08:44:10 +0100
    
       PCI_DMA_BUS_IS_PHYS should be a propert of each struct device because
       a machine might have a iommu for one bus type but not another,
    
    We know of no such systems.  Even in mixed-bus environments such as
    sparc64 SBUS+PCI systems.
    
Actually, I know of such a system, although, I will admit they're
obscure.  Certain PA-RISC machines with EISA buses have an IOMMU on the
EISA bus alone (not on the GSC bus).  It's job was really to cope with
the ISA DMA issue nicely, but it qualifies.

It was for machines like this that I've already been thinking about
introducing a dma_is_phys(dev) macro, which I originally left out of the
generic device DMA API.

James



^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2003-06-15  8:09 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-06-05  6:42 problem with blk_queue_bounce_limit() David Mosberger
2003-06-05  7:20 ` David S. Miller
2003-06-06  6:42   ` David Mosberger
2003-06-06  6:45     ` David S. Miller
2003-06-06  6:54       ` David Mosberger
2003-06-06  7:08         ` David S. Miller
2003-06-06  6:52     ` David S. Miller
2003-06-06  7:19       ` David Mosberger
2003-06-06  7:32         ` David S. Miller
2003-06-06  7:44           ` Christoph Hellwig
2003-06-06  7:43             ` David S. Miller
2003-06-06  7:51               ` Christoph Hellwig
2003-06-06 20:13           ` David Mosberger
2003-06-07  6:44             ` David S. Miller
2003-06-07  7:05               ` David Mosberger
2003-06-07  7:11                 ` David S. Miller
2003-06-10 20:01                   ` David Mosberger
2003-06-12  6:47                     ` David S. Miller
2003-06-15  7:06                   ` Anton Blanchard
2003-06-15  7:11                     ` David S. Miller
2003-06-15  8:04                       ` Anton Blanchard
2003-06-15  8:18                         ` David S. Miller
2003-06-07  7:20               ` David Mosberger
2003-06-07  7:19                 ` David S. Miller
2003-06-07  9:44                 ` Russell King
2003-06-07  9:47                   ` David S. Miller
2003-06-07 13:23                     ` Christoph Hellwig
2003-06-10 20:24                 ` David Mosberger
2003-06-06 18:06 James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).