All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] pci: Speed up processing of IOMMU related functions
@ 2011-03-29 23:36 Mike Travis
  2011-03-29 23:36 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: Mike Travis @ 2011-03-29 23:36 UTC (permalink / raw)
  To: David Woodhouse, Jesse Barnes; +Cc: Mike Habeck, iommu, linux-pci, linux-kernel


    During testing of a large customer UV configuration with a lot of
    I/O devices, certain IOMMU functions were either consuming too much
    time or not functioning correctly.

    This patchset addresses those problems.

Signed-off-by: Mike Travis <travis@sgi.com>

-- 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping
  2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
@ 2011-03-29 23:36 ` Mike Travis
  2011-03-30 17:51   ` Chris Wright
  2011-03-29 23:36 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function Mike Travis
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-29 23:36 UTC (permalink / raw)
  To: David Woodhouse, Jesse Barnes; +Cc: Mike Habeck, iommu, linux-pci, linux-kernel

[-- Attachment #1: intel-iommu-pass-thru-fix --]
[-- Type: text/plain, Size: 2582 bytes --]

    When the IOMMU is being used, each request for a DMA mapping requires
    the intel_iommu code to look for some space in the DMA mapping table.
    For most drivers this occurs for each transfer.

    When there are many outstanding DMA mappings [as seems to be the case
    with the 10GigE driver], the table grows large and the search for
    space becomes increasingly time consuming.  Performance for the
    10GigE driver drops to about 10% of it's capacity on a UV system
    when the CPU count is large.

    The workaround is to specify the iommu=pt option which sets up a 1:1
    identity map for those devices that support enough DMA address bits to
    cover the physical system memory.  This is the "pass through" option.

    But this can only be accomplished by those devices that pass their
    DMA data through the IOMMU (VTd).  But Host Bridge Devices connected
    to System Sockets do not pass their data through the VTd, thus the
    following error occurs:

    IOMMU: hardware identity mapping for device 1000:3e:00.0
    Failed to setup IOMMU pass-through
    BUG: unable to handle kernel NULL pointer dereference at 000000000000001c

    This patch fixes that problem but removing Host Bridge devices from
    being identity mapped, given that they do not generate DMA ops anyways.

Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Mike Habeck <habeck@sgi.com>
---
 drivers/pci/intel-iommu.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

--- linux.orig/drivers/pci/intel-iommu.c
+++ linux/drivers/pci/intel-iommu.c
@@ -46,6 +46,7 @@
 #define ROOT_SIZE		VTD_PAGE_SIZE
 #define CONTEXT_SIZE		VTD_PAGE_SIZE
 
+#define IS_HOSTBRIDGE_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_HOST)
 #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
 #define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
 #define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
@@ -2183,7 +2184,7 @@ static int iommu_should_identity_map(str
 	 * take them out of the 1:1 domain later.
 	 */
 	if (!startup)
-		return pdev->dma_mask > DMA_BIT_MASK(32);
+		return pdev->dma_mask == DMA_BIT_MASK(64);
 
 	return 1;
 }
@@ -2198,6 +2199,9 @@ static int __init iommu_prepare_static_i
 		return -EFAULT;
 
 	for_each_pci_dev(pdev) {
+		/* Skip PCI Host Bridge devices */
+		if (IS_HOSTBRIDGE_DEVICE(pdev))
+			continue;
 		if (iommu_should_identity_map(pdev, 1)) {
 			printk(KERN_INFO "IOMMU: %s identity mapping for device %s\n",
 			       hw ? "hardware" : "software", pci_name(pdev));

-- 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function
  2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
  2011-03-29 23:36 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
@ 2011-03-29 23:36 ` Mike Travis
  2011-03-30 19:19   ` Chris Wright
  2011-03-29 23:36 ` [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges Mike Travis
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-29 23:36 UTC (permalink / raw)
  To: David Woodhouse, Jesse Barnes; +Cc: Mike Habeck, iommu, linux-pci, linux-kernel

[-- Attachment #1: speed-up-iommu_no_mapping --]
[-- Type: text/plain, Size: 1989 bytes --]

    On Nahelem systems, there are 43 Host Bridges exposed to the kernel
    for each System Socket.  With a large socket count, and the pass
    through option for iommu is set, too much time is spent in the
    identity_mapping function, hunting though the iommu domains to
    check if a specific device is "identity mapped".

    Speed up the function by setting an "is_identity_mapped" flag in
    the pci_dev struct, if this device is mapped to the static identity
    domain.

Signed-off-by: Mike Travis <travis@sgi.com>
---
 drivers/pci/intel-iommu.c |   11 +++++------
 include/linux/pci.h       |    1 +
 2 files changed, 6 insertions(+), 6 deletions(-)

--- linux.orig/drivers/pci/intel-iommu.c
+++ linux/drivers/pci/intel-iommu.c
@@ -2102,11 +2102,7 @@ static int identity_mapping(struct pci_d
 	if (likely(!iommu_identity_mapping))
 		return 0;
 
-
-	list_for_each_entry(info, &si_domain->devices, link)
-		if (info->dev == pdev)
-			return 1;
-	return 0;
+	return pdev->is_identity_mapped;
 }
 
 static int domain_add_dev_info(struct dmar_domain *domain,
@@ -2138,6 +2134,7 @@ static int domain_add_dev_info(struct dm
 	list_add(&info->global, &device_domain_list);
 	pdev->dev.archdata.iommu = info;
 	spin_unlock_irqrestore(&device_domain_lock, flags);
+	pdev->is_identity_mapped = (domain == si_domain);
 
 	return 0;
 }
@@ -3404,8 +3401,10 @@ static void domain_remove_one_dev_info(s
 		 * update iommu count and coherency
 		 */
 		if (iommu == device_to_iommu(info->segment, info->bus,
-					    info->devfn))
+					    info->devfn)) {
 			found = 1;
+			pdev->is_identity_mapped = 0;
+		}
 	}
 
 	if (found == 0) {
--- linux.orig/include/linux/pci.h
+++ linux/include/linux/pci.h
@@ -314,6 +314,7 @@ struct pci_dev {
 	unsigned int	is_virtfn:1;
 	unsigned int	reset_fn:1;
 	unsigned int    is_hotplug_bridge:1;
+	unsigned int	is_identity_mapped:1;
 	unsigned int    __aer_firmware_first_valid:1;
 	unsigned int	__aer_firmware_first:1;
 	pci_dev_flags_t dev_flags;

-- 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
  2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
  2011-03-29 23:36 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
  2011-03-29 23:36 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function Mike Travis
@ 2011-03-29 23:36 ` Mike Travis
  2011-03-31 22:11   ` Mike Travis
  2011-03-29 23:36 ` [PATCH 4/4] Intel pci: Use coherent DMA mask when requested Mike Travis
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-29 23:36 UTC (permalink / raw)
  To: David Woodhouse, Jesse Barnes; +Cc: Mike Habeck, iommu, linux-pci, linux-kernel

[-- Attachment #1: limit-dmar_init_reserved_ranges --]
[-- Type: text/plain, Size: 1775 bytes --]

    dmar_init_reserved_ranges() reserves the card's MMIO ranges to
    prevent handing out a DMA map that would overlap with the MMIO range.
    The problem while the Nvidia GPU has 64bit BARs, it's capable of
    receiving > 40bit PIOs, but can't generate > 40bit DMAs.

    So when the iommu code reserves these MMIO ranges a > 40bit
    entry ends up getting in the rbtree.  On a UV test system with
    the Nvidia cards, the BARs are:

      0001:36:00.0 VGA compatible controller: nVidia Corporation GT200GL 
	  Region 0: Memory at 92000000 (32-bit, non-prefetchable) [size=16M]
	  Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
	  Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]

    So this 44bit MMIO address 0xf8200000000 ends up in the rbtree.  As DMA
    maps get added and deleted from the rbtree we can end up getting a cached
    entry to this 0xf8200000000 entry... this is what results in the code
    handing out the invalid DMA map of 0xf81fffff000:

	    [ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]

    The IOVA code needs to better honor the "limit_pfn" when allocating
    these maps.

Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Mike Habeck <habeck@sgi.com>
---
 drivers/pci/intel-iommu.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- linux.orig/drivers/pci/intel-iommu.c
+++ linux/drivers/pci/intel-iommu.c
@@ -1323,7 +1323,8 @@ static void dmar_init_reserved_ranges(vo
 
 		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
 			r = &pdev->resource[i];
-			if (!r->flags || !(r->flags & IORESOURCE_MEM))
+			if (!r->flags || !(r->flags & IORESOURCE_MEM) ||
+			    r->start > pdev->dma_mask)
 				continue;
 			iova = reserve_iova(&reserved_iova_list,
 					    IOVA_PFN(r->start),

-- 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 4/4] Intel pci: Use coherent DMA mask when requested
  2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
                   ` (2 preceding siblings ...)
  2011-03-29 23:36 ` [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges Mike Travis
@ 2011-03-29 23:36 ` Mike Travis
  2011-03-30 18:02   ` Chris Wright
  2011-04-07 19:47 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-29 23:36 UTC (permalink / raw)
  To: David Woodhouse, Jesse Barnes; +Cc: Mike Habeck, iommu, linux-pci, linux-kernel

[-- Attachment #1: use-coherent-dma-mask --]
[-- Type: text/plain, Size: 761 bytes --]

    The __intel_map_single function is not honoring the passed in
    DMA mask.  This results in not using the coherent DMA mask when
    called from intel_alloc_coherent().

Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Mike Habeck <habeck@sgi.com>
---
 drivers/pci/intel-iommu.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- linux.orig/drivers/pci/intel-iommu.c
+++ linux/drivers/pci/intel-iommu.c
@@ -2582,8 +2582,7 @@ static dma_addr_t __intel_map_single(str
 	iommu = domain_get_iommu(domain);
 	size = aligned_nrpages(paddr, size);
 
-	iova = intel_alloc_iova(hwdev, domain, dma_to_mm_pfn(size),
-				pdev->dma_mask);
+	iova = intel_alloc_iova(hwdev, domain, dma_to_mm_pfn(size), dma_mask);
 	if (!iova)
 		goto error;
 

-- 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping
  2011-03-29 23:36 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
@ 2011-03-30 17:51   ` Chris Wright
  2011-03-30 18:30     ` Mike Travis
  0 siblings, 1 reply; 27+ messages in thread
From: Chris Wright @ 2011-03-30 17:51 UTC (permalink / raw)
  To: Mike Travis
  Cc: David Woodhouse, Jesse Barnes, linux-pci, iommu, Mike Habeck,
	linux-kernel

* Mike Travis (travis@sgi.com) wrote:
>     When the IOMMU is being used, each request for a DMA mapping requires
>     the intel_iommu code to look for some space in the DMA mapping table.
>     For most drivers this occurs for each transfer.
> 
>     When there are many outstanding DMA mappings [as seems to be the case
>     with the 10GigE driver], the table grows large and the search for
>     space becomes increasingly time consuming.  Performance for the
>     10GigE driver drops to about 10% of it's capacity on a UV system
>     when the CPU count is large.

That's pretty poor.  I've seen large overheads, but when that big it was
also related to issues in the 10G driver.  Do you have profile data
showing this as the hotspot?

>     The workaround is to specify the iommu=pt option which sets up a 1:1
>     identity map for those devices that support enough DMA address bits to
>     cover the physical system memory.  This is the "pass through" option.
> 
>     But this can only be accomplished by those devices that pass their
>     DMA data through the IOMMU (VTd).  But Host Bridge Devices connected
>     to System Sockets do not pass their data through the VTd, thus the
>     following error occurs:
> 
>     IOMMU: hardware identity mapping for device 1000:3e:00.0
>     Failed to setup IOMMU pass-through
>     BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
> 
>     This patch fixes that problem but removing Host Bridge devices from
>     being identity mapped, given that they do not generate DMA ops anyways.
> 
> Signed-off-by: Mike Travis <travis@sgi.com>
> Reviewed-by: Mike Habeck <habeck@sgi.com>
> ---
>  drivers/pci/intel-iommu.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> --- linux.orig/drivers/pci/intel-iommu.c
> +++ linux/drivers/pci/intel-iommu.c
> @@ -46,6 +46,7 @@
>  #define ROOT_SIZE		VTD_PAGE_SIZE
>  #define CONTEXT_SIZE		VTD_PAGE_SIZE
>  
> +#define IS_HOSTBRIDGE_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_HOST)
>  #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
>  #define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
>  #define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
> @@ -2183,7 +2184,7 @@ static int iommu_should_identity_map(str
>  	 * take them out of the 1:1 domain later.
>  	 */
>  	if (!startup)
> -		return pdev->dma_mask > DMA_BIT_MASK(32);
> +		return pdev->dma_mask == DMA_BIT_MASK(64);

This looks unrelated, why the change?

>  	return 1;
>  }
> @@ -2198,6 +2199,9 @@ static int __init iommu_prepare_static_i
>  		return -EFAULT;
>  
>  	for_each_pci_dev(pdev) {
> +		/* Skip PCI Host Bridge devices */
> +		if (IS_HOSTBRIDGE_DEVICE(pdev))
> +			continue;
>  		if (iommu_should_identity_map(pdev, 1)) {

Should this host bridge check go into iommu_should_identity_map?

I understand skipping the extra host bridges, but what is the NULL ptr deref
coming from?  Just to be sure this isn't a bandaid.

thanks,
-chris


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] Intel pci: Use coherent DMA mask when requested
  2011-03-29 23:36 ` [PATCH 4/4] Intel pci: Use coherent DMA mask when requested Mike Travis
@ 2011-03-30 18:02   ` Chris Wright
  2011-04-01  2:57     ` FUJITA Tomonori
  0 siblings, 1 reply; 27+ messages in thread
From: Chris Wright @ 2011-03-30 18:02 UTC (permalink / raw)
  To: Mike Travis
  Cc: David Woodhouse, Jesse Barnes, linux-pci, iommu, Mike Habeck,
	linux-kernel, fujita.tomonori

* Mike Travis (travis@sgi.com) wrote:
>     The __intel_map_single function is not honoring the passed in
>     DMA mask.  This results in not using the coherent DMA mask when
>     called from intel_alloc_coherent().
> 
> Signed-off-by: Mike Travis <travis@sgi.com>
> Reviewed-by: Mike Habeck <habeck@sgi.com>

Hmm, looks like it has actually been broken since it was introduced
in bb9e6d65.

Acked-by: Chris Wright <chrisw@sous-sol.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping
  2011-03-30 17:51   ` Chris Wright
@ 2011-03-30 18:30     ` Mike Travis
  2011-03-30 19:15       ` Chris Wright
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-30 18:30 UTC (permalink / raw)
  To: Chris Wright
  Cc: David Woodhouse, Jesse Barnes, linux-pci, iommu, Mike Habeck,
	linux-kernel



Chris Wright wrote:
> * Mike Travis (travis@sgi.com) wrote:
>>     When the IOMMU is being used, each request for a DMA mapping requires
>>     the intel_iommu code to look for some space in the DMA mapping table.
>>     For most drivers this occurs for each transfer.
>>
>>     When there are many outstanding DMA mappings [as seems to be the case
>>     with the 10GigE driver], the table grows large and the search for
>>     space becomes increasingly time consuming.  Performance for the
>>     10GigE driver drops to about 10% of it's capacity on a UV system
>>     when the CPU count is large.
> 
> That's pretty poor.  I've seen large overheads, but when that big it was
> also related to issues in the 10G driver.  Do you have profile data
> showing this as the hotspot?

Here's one from our internal bug report:

Here is a profile from a run with iommu=on  iommu=pt  (no forcedac)

uv48-sys was receiving and uv-debug sending.
ksoftirqd/640 was running at approx. 100% cpu utilization.
I had pinned the nttcp process on uv48-sys to cpu 64.

# Samples: 1255641
#
# Overhead        Command  Shared Object  Symbol
# ........  .............  .............  ......
#
    50.27%ESC[m  ksoftirqd/640  [kernel]       [k] _spin_lock
    27.43%ESC[m  ksoftirqd/640  [kernel]       [k] iommu_no_mapping
...
      0.48%  ksoftirqd/640  [kernel]       [k] iommu_should_identity_map
      0.45%  ksoftirqd/640  [kernel]       [k] ixgbe_alloc_rx_buffers    [
ixgbe]
      0.42%  ksoftirqd/640  [kernel]       [k] ioat2_tx_submit_unlock    [
ioatdma]
      0.29%  ksoftirqd/640  [kernel]       [k] uv_read_rtc
      0.25%  ksoftirqd/640  [kernel]       [k] __alloc_skb
      0.20%  ksoftirqd/640  [kernel]       [k] try_to_wake_up
      0.19%  ksoftirqd/640  [kernel]       [k] ____cache_alloc_node
      0.19%  ksoftirqd/640  [kernel]       [k] kmem_cache_free
      0.19%  ksoftirqd/640  [kernel]       [k] __netdev_alloc_skb
      0.18%  ksoftirqd/640  [kernel]       [k] tcp_v4_rcv
      0.15%  ksoftirqd/640  [kernel]       [k] resched_task
      0.15%  ksoftirqd/640  [kernel]       [k] tcp_data_queue
      0.13%  ksoftirqd/640  [kernel]       [k] xfrm4_policy_check
      0.11%  ksoftirqd/640  [kernel]       [k] get_page_from_freelist
      0.10%  ksoftirqd/640  [kernel]       [k] sched_clock_cpu
      0.10%  ksoftirqd/640  [kernel]       [k] sock_def_readable
...

I tracked this time down to identity_mapping() in this loop:

       list_for_each_entry(info, &si_domain->devices, link)
               if (info->dev == pdev)
                       return 1;

I didn't get the exact count, but there was approx 11,000 PCI devices
on this system.  And this function was called for every page request
in each DMA request.

Here's an excerpt from our internal bug report:

I also looked at the cpu utilization uv. Its at 22% for the nttcp process
and ksoftirqd is not at the top so I think this means the fix is working.

Another run
uv-debug:~/eddiem/nttcp-1.52 # ./nttcp -T -l 1048576 -P 60 192.168.1.2
Running for 60 seconds...
     Bytes  Real s   CPU s Real-MBit/s  CPU-MBit/s   Calls  Real-C/s   CPU-C/s
l51671728128   60.00   13.52   6889.4548  30582.1259   49278    821.29    3645.7
151671728128   60.00   12.53   6889.4660  32983.4024  123666   2061.07    9867.4

Trying it from the other side shows nttcp on uv at 44% cpu.

uv41-sys:~/eddiem/nttcp-1.52 # ./nttcp -T -l 1048576 -P 60 192.168.1.1
Running for 60 seconds...
     Bytes  Real s   CPU s Real-MBit/s  CPU-MBit/s   Calls  Real-C/s   CPU-C/s
l51292143616   60.00   26.40   6838.9326  15544.4581   48917    815.28    1853.1
151292456796   60.00    7.35   6839.0407  55809.8528   93530   1558.84   12720.9


Note that our networking experts also tuned the 10GigE parameters which
helped bring the speed back up to almost line speed.  (The 10GigE was
by far the most affected driver, but even the 1GigE driver lost performance.)

There was also changes for the irq_rebalancer and disabling sched domains
2 and 3 (which was being hit by idle_rebalancer).  I remember sched domain 3
had all 4096 cpus but I forgot what sd 2 had.)

Also, running the network test on the same node as where the cards were
helped as well.

If you really need them, I can sign up for some system time and get better
before/after profile data specifically for these IOMMU changes?

Thanks,
Mike

> 
>>     The workaround is to specify the iommu=pt option which sets up a 1:1
>>     identity map for those devices that support enough DMA address bits to
>>     cover the physical system memory.  This is the "pass through" option.
>>
>>     But this can only be accomplished by those devices that pass their
>>     DMA data through the IOMMU (VTd).  But Host Bridge Devices connected
>>     to System Sockets do not pass their data through the VTd, thus the
>>     following error occurs:
>>
>>     IOMMU: hardware identity mapping for device 1000:3e:00.0
>>     Failed to setup IOMMU pass-through
>>     BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
>>
>>     This patch fixes that problem but removing Host Bridge devices from
>>     being identity mapped, given that they do not generate DMA ops anyways.
>>
>> Signed-off-by: Mike Travis <travis@sgi.com>
>> Reviewed-by: Mike Habeck <habeck@sgi.com>
>> ---
>>  drivers/pci/intel-iommu.c |    6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> --- linux.orig/drivers/pci/intel-iommu.c
>> +++ linux/drivers/pci/intel-iommu.c
>> @@ -46,6 +46,7 @@
>>  #define ROOT_SIZE		VTD_PAGE_SIZE
>>  #define CONTEXT_SIZE		VTD_PAGE_SIZE
>>  
>> +#define IS_HOSTBRIDGE_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_HOST)
>>  #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
>>  #define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
>>  #define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
>> @@ -2183,7 +2184,7 @@ static int iommu_should_identity_map(str
>>  	 * take them out of the 1:1 domain later.
>>  	 */
>>  	if (!startup)
>> -		return pdev->dma_mask > DMA_BIT_MASK(32);
>> +		return pdev->dma_mask == DMA_BIT_MASK(64);
> 
> This looks unrelated, why the change?
> 
>>  	return 1;
>>  }
>> @@ -2198,6 +2199,9 @@ static int __init iommu_prepare_static_i
>>  		return -EFAULT;
>>  
>>  	for_each_pci_dev(pdev) {
>> +		/* Skip PCI Host Bridge devices */
>> +		if (IS_HOSTBRIDGE_DEVICE(pdev))
>> +			continue;
>>  		if (iommu_should_identity_map(pdev, 1)) {
> 
> Should this host bridge check go into iommu_should_identity_map?
> 
> I understand skipping the extra host bridges, but what is the NULL ptr deref
> coming from?  Just to be sure this isn't a bandaid.
> 
> thanks,
> -chris
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping
  2011-03-30 18:30     ` Mike Travis
@ 2011-03-30 19:15       ` Chris Wright
  2011-03-30 19:25         ` Mike Travis
  0 siblings, 1 reply; 27+ messages in thread
From: Chris Wright @ 2011-03-30 19:15 UTC (permalink / raw)
  To: Mike Travis
  Cc: Chris Wright, David Woodhouse, Jesse Barnes, linux-pci, iommu,
	Mike Habeck, linux-kernel

* Mike Travis (travis@sgi.com) wrote:
> Chris Wright wrote:
> >* Mike Travis (travis@sgi.com) wrote:
> >>    When the IOMMU is being used, each request for a DMA mapping requires
> >>    the intel_iommu code to look for some space in the DMA mapping table.
> >>    For most drivers this occurs for each transfer.
> >>
> >>    When there are many outstanding DMA mappings [as seems to be the case
> >>    with the 10GigE driver], the table grows large and the search for
> >>    space becomes increasingly time consuming.  Performance for the
> >>    10GigE driver drops to about 10% of it's capacity on a UV system
> >>    when the CPU count is large.
> >
> >That's pretty poor.  I've seen large overheads, but when that big it was
> >also related to issues in the 10G driver.  Do you have profile data
> >showing this as the hotspot?
> 
> Here's one from our internal bug report:
> 
> Here is a profile from a run with iommu=on  iommu=pt  (no forcedac)

OK, I was actually interested in the !pt case.  But this is useful
still.  The iova lookup being distinct from the identity_mapping() case.

> uv48-sys was receiving and uv-debug sending.
> ksoftirqd/640 was running at approx. 100% cpu utilization.
> I had pinned the nttcp process on uv48-sys to cpu 64.
> 
> # Samples: 1255641
> #
> # Overhead        Command  Shared Object  Symbol
> # ........  .............  .............  ......
> #
>    50.27%ESC[m  ksoftirqd/640  [kernel]       [k] _spin_lock
>    27.43%ESC[m  ksoftirqd/640  [kernel]       [k] iommu_no_mapping

> ...
>      0.48%  ksoftirqd/640  [kernel]       [k] iommu_should_identity_map
>      0.45%  ksoftirqd/640  [kernel]       [k] ixgbe_alloc_rx_buffers    [
> ixgbe]

Note, ixgbe has had rx dma mapping issues (that's why I wondered what
was causing the massive slowdown under !pt mode).

<snip>
> I tracked this time down to identity_mapping() in this loop:
> 
>       list_for_each_entry(info, &si_domain->devices, link)
>               if (info->dev == pdev)
>                       return 1;
> 
> I didn't get the exact count, but there was approx 11,000 PCI devices
> on this system.  And this function was called for every page request
> in each DMA request.

Right, so this is the list traversal (and wow, a lot of PCI devices).
Did you try a smarter data structure? (While there's room for another
bit in pci_dev, the bit is more about iommu implementation details than
anything at the pci level).

Or the domain_dev_info is cached in the archdata of device struct.
You should be able to just reference that directly.

Didn't think it through completely, but perhaps something as simple as:

	return pdev->dev.archdata.iommu == si_domain;

thanks,
-chris

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function
  2011-03-29 23:36 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function Mike Travis
@ 2011-03-30 19:19   ` Chris Wright
  2011-03-30 19:29     ` Mike Travis
  2011-03-31  0:33     ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function v2 Mike Travis
  0 siblings, 2 replies; 27+ messages in thread
From: Chris Wright @ 2011-03-30 19:19 UTC (permalink / raw)
  To: Mike Travis
  Cc: David Woodhouse, Jesse Barnes, linux-pci, iommu, Mike Habeck,
	linux-kernel

* Mike Travis (travis@sgi.com) wrote:
> @@ -2138,6 +2134,7 @@ static int domain_add_dev_info(struct dm
>  	list_add(&info->global, &device_domain_list);
>  	pdev->dev.archdata.iommu = info;

The si_domain is cached right here...can you not use this instead of
adding an iommu impl. specific bit to pci_dev?

>  	spin_unlock_irqrestore(&device_domain_lock, flags);
> +	pdev->is_identity_mapped = (domain == si_domain);

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping
  2011-03-30 19:15       ` Chris Wright
@ 2011-03-30 19:25         ` Mike Travis
  2011-03-30 19:57           ` Chris Wright
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-30 19:25 UTC (permalink / raw)
  To: Chris Wright
  Cc: David Woodhouse, Jesse Barnes, linux-pci, iommu, Mike Habeck,
	linux-kernel



Chris Wright wrote:
> * Mike Travis (travis@sgi.com) wrote:
>> Chris Wright wrote:
>>> * Mike Travis (travis@sgi.com) wrote:
>>>>    When the IOMMU is being used, each request for a DMA mapping requires
>>>>    the intel_iommu code to look for some space in the DMA mapping table.
>>>>    For most drivers this occurs for each transfer.
>>>>
>>>>    When there are many outstanding DMA mappings [as seems to be the case
>>>>    with the 10GigE driver], the table grows large and the search for
>>>>    space becomes increasingly time consuming.  Performance for the
>>>>    10GigE driver drops to about 10% of it's capacity on a UV system
>>>>    when the CPU count is large.
>>> That's pretty poor.  I've seen large overheads, but when that big it was
>>> also related to issues in the 10G driver.  Do you have profile data
>>> showing this as the hotspot?
>> Here's one from our internal bug report:
>>
>> Here is a profile from a run with iommu=on  iommu=pt  (no forcedac)
> 
> OK, I was actually interested in the !pt case.  But this is useful
> still.  The iova lookup being distinct from the identity_mapping() case.

I can get that as well, but having every device using maps caused it's
own set of problems (hundreds of dma maps).  Here's a list of devices
on the system under test.  You can see that even 'minor' glitches can
get magnified when there are so many...

Blade Location    NASID  PCI Address X Display   Device
----------------------------------------------------------------------
    0 r001i01b00      0  0000:01:00.0      -   Intel 82576 Gigabit Network Connection
    .          .      .  0000:01:00.1      -   Intel 82576 Gigabit Network Connection
    .          .      .  0000:04:00.0      -   LSI SAS1064ET Fusion-MPT SAS
    .          .      .  0000:05:00.0      -   Matrox MGA G200e
    2 r001i01b02      4  0001:02:00.0      -   Mellanox MT26428 InfiniBand
    3 r001i01b03      6  0002:02:00.0      -   Mellanox MT26428 InfiniBand
    4 r001i01b04      8  0003:02:00.0      -   Mellanox MT26428 InfiniBand
   11 r001i01b11     22  0007:02:00.0      -   Mellanox MT26428 InfiniBand
   13 r001i01b13     26  0008:02:00.0      -   Mellanox MT26428 InfiniBand
   15 r001i01b15     30  0009:07:00.0   :0.0   nVidia GF100 [Tesla S2050]
    .          .      .  0009:08:00.0   :1.1   nVidia GF100 [Tesla S2050]
   18 r001i23b02     36  000b:02:00.0      -   Mellanox MT26428 InfiniBand
   20 r001i23b04     40  000c:01:00.0      -   Intel 82599EB 10-Gigabit Network Connection
    .          .      .  000c:01:00.1      -   Intel 82599EB 10-Gigabit Network Connection
    .          .      .  000c:04:00.0      -   Mellanox MT26428 InfiniBand
   23 r001i23b07     46  000d:07:00.0      -   nVidia GF100 [Tesla S2050]
    .          .      .  000d:08:00.0      -   nVidia GF100 [Tesla S2050]
   25 r001i23b09     50  000e:01:00.0      -   Intel 82599EB 10-Gigabit Network Connection
    .          .      .  000e:01:00.1      -   Intel 82599EB 10-Gigabit Network Connection
    .          .      .  000e:04:00.0      -   Mellanox MT26428 InfiniBand
   26 r001i23b10     52  000f:02:00.0      -   Mellanox MT26428 InfiniBand
   27 r001i23b11     54  0010:02:00.0      -   Mellanox MT26428 InfiniBand
   29 r001i23b13     58  0011:02:00.0      -   Mellanox MT26428 InfiniBand
   31 r001i23b15     62  0012:02:00.0      -   Mellanox MT26428 InfiniBand
   34 r002i01b02     68  0013:01:00.0      -   Mellanox MT26428 InfiniBand
   35 r002i01b03     70  0014:02:00.0      -   Mellanox MT26428 InfiniBand
   36 r002i01b04     72  0015:01:00.0      -   Mellanox MT26428 InfiniBand
   41 r002i01b09     82  0018:07:00.0      -   nVidia GF100 [Tesla S2050]
    .          .      .  0018:08:00.0      -   nVidia GF100 [Tesla S2050]
   43 r002i01b11     86  0019:01:00.0      -   Mellanox MT26428 InfiniBand
   45 r002i01b13     90  001a:01:00.0      -   Mellanox MT26428 InfiniBand
   48 r002i23b00     96  001c:07:00.0      -   nVidia GF100 [Tesla S2050]
    .          .      .  001c:08:00.0      -   nVidia GF100 [Tesla S2050]
   50 r002i23b02    100  001d:02:00.0      -   Mellanox MT26428 InfiniBand
   52 r002i23b04    104  001e:01:00.0      -   Intel 82599EB 10-Gigabit Network Connection
    .          .      .  001e:01:00.1      -   Intel 82599EB 10-Gigabit Network Connection
    .          .      .  001e:04:00.0      -   Mellanox MT26428 InfiniBand
   57 r002i23b09    114  0020:01:00.0      -   Intel 82599EB 10-Gigabit Network Connection
    .          .      .  0020:01:00.1      -   Intel 82599EB 10-Gigabit Network Connection
    .          .      .  0020:04:00.0      -   Mellanox MT26428 InfiniBand
   58 r002i23b10    116  0021:02:00.0      -   Mellanox MT26428 InfiniBand
   59 r002i23b11    118  0022:02:00.0      -   Mellanox MT26428 InfiniBand
   61 r002i23b13    122  0023:02:00.0      -   Mellanox MT26428 InfiniBand
   63 r002i23b15    126  0024:02:00.0      -   Mellanox MT26428 InfiniBand

> 
>> uv48-sys was receiving and uv-debug sending.
>> ksoftirqd/640 was running at approx. 100% cpu utilization.
>> I had pinned the nttcp process on uv48-sys to cpu 64.
>>
>> # Samples: 1255641
>> #
>> # Overhead        Command  Shared Object  Symbol
>> # ........  .............  .............  ......
>> #
>>    50.27%ESC[m  ksoftirqd/640  [kernel]       [k] _spin_lock
>>    27.43%ESC[m  ksoftirqd/640  [kernel]       [k] iommu_no_mapping
> 
>> ...
>>      0.48%  ksoftirqd/640  [kernel]       [k] iommu_should_identity_map
>>      0.45%  ksoftirqd/640  [kernel]       [k] ixgbe_alloc_rx_buffers    [
>> ixgbe]
> 
> Note, ixgbe has had rx dma mapping issues (that's why I wondered what
> was causing the massive slowdown under !pt mode).

I think since this profile run, the network guys updated the ixgbe
driver with a later version.  (I don't know the outcome of that test.)

> 
> <snip>
>> I tracked this time down to identity_mapping() in this loop:
>>
>>       list_for_each_entry(info, &si_domain->devices, link)
>>               if (info->dev == pdev)
>>                       return 1;
>>
>> I didn't get the exact count, but there was approx 11,000 PCI devices
>> on this system.  And this function was called for every page request
>> in each DMA request.
> 
> Right, so this is the list traversal (and wow, a lot of PCI devices).

Most of the PCI devices were the 45 on each of 256 Nahalem sockets.
Also, there's a ton of bridges as well.

> Did you try a smarter data structure? (While there's room for another
> bit in pci_dev, the bit is more about iommu implementation details than
> anything at the pci level).
> 
> Or the domain_dev_info is cached in the archdata of device struct.
> You should be able to just reference that directly.
> 
> Didn't think it through completely, but perhaps something as simple as:
> 
> 	return pdev->dev.archdata.iommu == si_domain;

I can try this, thanks!

> 
> thanks,
> -chris

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function
  2011-03-30 19:19   ` Chris Wright
@ 2011-03-30 19:29     ` Mike Travis
  2011-03-31  0:33     ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function v2 Mike Travis
  1 sibling, 0 replies; 27+ messages in thread
From: Mike Travis @ 2011-03-30 19:29 UTC (permalink / raw)
  To: Chris Wright
  Cc: David Woodhouse, Jesse Barnes, linux-pci, iommu, Mike Habeck,
	linux-kernel



Chris Wright wrote:
> * Mike Travis (travis@sgi.com) wrote:
>> @@ -2138,6 +2134,7 @@ static int domain_add_dev_info(struct dm
>>  	list_add(&info->global, &device_domain_list);
>>  	pdev->dev.archdata.iommu = info;
> 
> The si_domain is cached right here...can you not use this instead of
> adding an iommu impl. specific bit to pci_dev?
> 
>>  	spin_unlock_irqrestore(&device_domain_lock, flags);
>> +	pdev->is_identity_mapped = (domain == si_domain);

Yes, it looks like I came at the solution backwards... I added
the bit then figured out how to set it.  I'll update that patch.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping
  2011-03-30 19:25         ` Mike Travis
@ 2011-03-30 19:57           ` Chris Wright
  0 siblings, 0 replies; 27+ messages in thread
From: Chris Wright @ 2011-03-30 19:57 UTC (permalink / raw)
  To: Mike Travis
  Cc: Chris Wright, David Woodhouse, Jesse Barnes, linux-pci, iommu,
	Mike Habeck, linux-kernel

* Mike Travis (travis@sgi.com) wrote:
> Chris Wright wrote:
> >OK, I was actually interested in the !pt case.  But this is useful
> >still.  The iova lookup being distinct from the identity_mapping() case.
> 
> I can get that as well, but having every device using maps caused it's
> own set of problems (hundreds of dma maps).  Here's a list of devices
> on the system under test.  You can see that even 'minor' glitches can
> get magnified when there are so many...

Yeah, I was focused on the overhead of actually mapping/unmapping an
address in the non-pt case.

> Blade Location    NASID  PCI Address X Display   Device
> ----------------------------------------------------------------------
>    0 r001i01b00      0  0000:01:00.0      -   Intel 82576 Gigabit Network Connection
>    .          .      .  0000:01:00.1      -   Intel 82576 Gigabit Network Connection
>    .          .      .  0000:04:00.0      -   LSI SAS1064ET Fusion-MPT SAS
>    .          .      .  0000:05:00.0      -   Matrox MGA G200e
>    2 r001i01b02      4  0001:02:00.0      -   Mellanox MT26428 InfiniBand
>    3 r001i01b03      6  0002:02:00.0      -   Mellanox MT26428 InfiniBand
>    4 r001i01b04      8  0003:02:00.0      -   Mellanox MT26428 InfiniBand
>   11 r001i01b11     22  0007:02:00.0      -   Mellanox MT26428 InfiniBand
>   13 r001i01b13     26  0008:02:00.0      -   Mellanox MT26428 InfiniBand
>   15 r001i01b15     30  0009:07:00.0   :0.0   nVidia GF100 [Tesla S2050]
>    .          .      .  0009:08:00.0   :1.1   nVidia GF100 [Tesla S2050]
>   18 r001i23b02     36  000b:02:00.0      -   Mellanox MT26428 InfiniBand
>   20 r001i23b04     40  000c:01:00.0      -   Intel 82599EB 10-Gigabit Network Connection
>    .          .      .  000c:01:00.1      -   Intel 82599EB 10-Gigabit Network Connection
>    .          .      .  000c:04:00.0      -   Mellanox MT26428 InfiniBand
>   23 r001i23b07     46  000d:07:00.0      -   nVidia GF100 [Tesla S2050]
>    .          .      .  000d:08:00.0      -   nVidia GF100 [Tesla S2050]
>   25 r001i23b09     50  000e:01:00.0      -   Intel 82599EB 10-Gigabit Network Connection
>    .          .      .  000e:01:00.1      -   Intel 82599EB 10-Gigabit Network Connection
>    .          .      .  000e:04:00.0      -   Mellanox MT26428 InfiniBand
>   26 r001i23b10     52  000f:02:00.0      -   Mellanox MT26428 InfiniBand
>   27 r001i23b11     54  0010:02:00.0      -   Mellanox MT26428 InfiniBand
>   29 r001i23b13     58  0011:02:00.0      -   Mellanox MT26428 InfiniBand
>   31 r001i23b15     62  0012:02:00.0      -   Mellanox MT26428 InfiniBand
>   34 r002i01b02     68  0013:01:00.0      -   Mellanox MT26428 InfiniBand
>   35 r002i01b03     70  0014:02:00.0      -   Mellanox MT26428 InfiniBand
>   36 r002i01b04     72  0015:01:00.0      -   Mellanox MT26428 InfiniBand
>   41 r002i01b09     82  0018:07:00.0      -   nVidia GF100 [Tesla S2050]
>    .          .      .  0018:08:00.0      -   nVidia GF100 [Tesla S2050]
>   43 r002i01b11     86  0019:01:00.0      -   Mellanox MT26428 InfiniBand
>   45 r002i01b13     90  001a:01:00.0      -   Mellanox MT26428 InfiniBand
>   48 r002i23b00     96  001c:07:00.0      -   nVidia GF100 [Tesla S2050]
>    .          .      .  001c:08:00.0      -   nVidia GF100 [Tesla S2050]
>   50 r002i23b02    100  001d:02:00.0      -   Mellanox MT26428 InfiniBand
>   52 r002i23b04    104  001e:01:00.0      -   Intel 82599EB 10-Gigabit Network Connection
>    .          .      .  001e:01:00.1      -   Intel 82599EB 10-Gigabit Network Connection
>    .          .      .  001e:04:00.0      -   Mellanox MT26428 InfiniBand
>   57 r002i23b09    114  0020:01:00.0      -   Intel 82599EB 10-Gigabit Network Connection
>    .          .      .  0020:01:00.1      -   Intel 82599EB 10-Gigabit Network Connection
>    .          .      .  0020:04:00.0      -   Mellanox MT26428 InfiniBand
>   58 r002i23b10    116  0021:02:00.0      -   Mellanox MT26428 InfiniBand
>   59 r002i23b11    118  0022:02:00.0      -   Mellanox MT26428 InfiniBand
>   61 r002i23b13    122  0023:02:00.0      -   Mellanox MT26428 InfiniBand
>   63 r002i23b15    126  0024:02:00.0      -   Mellanox MT26428 InfiniBand
> 
> >
> >>uv48-sys was receiving and uv-debug sending.
> >>ksoftirqd/640 was running at approx. 100% cpu utilization.
> >>I had pinned the nttcp process on uv48-sys to cpu 64.
> >>
> >># Samples: 1255641
> >>#
> >># Overhead        Command  Shared Object  Symbol
> >># ........  .............  .............  ......
> >>#
> >>   50.27%ESC[m  ksoftirqd/640  [kernel]       [k] _spin_lock
> >>   27.43%ESC[m  ksoftirqd/640  [kernel]       [k] iommu_no_mapping
> >
> >>...
> >>     0.48%  ksoftirqd/640  [kernel]       [k] iommu_should_identity_map
> >>     0.45%  ksoftirqd/640  [kernel]       [k] ixgbe_alloc_rx_buffers    [
> >>ixgbe]
> >
> >Note, ixgbe has had rx dma mapping issues (that's why I wondered what
> >was causing the massive slowdown under !pt mode).
> 
> I think since this profile run, the network guys updated the ixgbe
> driver with a later version.  (I don't know the outcome of that test.)

OK.  The ixgbe fix I was thinking of is in since 2.6.34:  43634e82 (ixgbe:
Fix DMA mapping/unmapping issues when HWRSC is enabled on IOMMU enabled
on IOMMU enabled kernels).

> ><snip>
> >>I tracked this time down to identity_mapping() in this loop:
> >>
> >>      list_for_each_entry(info, &si_domain->devices, link)
> >>              if (info->dev == pdev)
> >>                      return 1;
> >>
> >>I didn't get the exact count, but there was approx 11,000 PCI devices
> >>on this system.  And this function was called for every page request
> >>in each DMA request.
> >
> >Right, so this is the list traversal (and wow, a lot of PCI devices).
> 
> Most of the PCI devices were the 45 on each of 256 Nahalem sockets.
> Also, there's a ton of bridges as well.
> 
> >Did you try a smarter data structure? (While there's room for another
> >bit in pci_dev, the bit is more about iommu implementation details than
> >anything at the pci level).
> >
> >Or the domain_dev_info is cached in the archdata of device struct.
> >You should be able to just reference that directly.
> >
> >Didn't think it through completely, but perhaps something as simple as:
> >
> >	return pdev->dev.archdata.iommu == si_domain;
> 
> I can try this, thanks!

Err, I guess that'd be info = archdata.iommu; info->domain == si_domain
(and probably need some sanity checking against things like
DUMMY_DEVICE_DOMAIN_INFO).  But you get the idea.

thanks,
-chris

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function v2
  2011-03-30 19:19   ` Chris Wright
  2011-03-30 19:29     ` Mike Travis
@ 2011-03-31  0:33     ` Mike Travis
  1 sibling, 0 replies; 27+ messages in thread
From: Mike Travis @ 2011-03-31  0:33 UTC (permalink / raw)
  To: Chris Wright
  Cc: David Woodhouse, Jesse Barnes, linux-pci, iommu, Mike Habeck,
	linux-kernel

Subject: Intel iommu: Speed up processing of the identity_mapping function

    When there are a large count of PCI devices, and the pass
    through option for iommu is set, much time is spent in the
    identity_mapping function hunting though the iommu domains to
    check if a specific device is "identity mapped".

    Speed up the function by checking the cached info to see if
    it's mapped to the static identity domain.

Signed-off-by: Mike Travis <travis@sgi.com>
---
 drivers/pci/intel-iommu.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- linux-2.6.32.orig/drivers/pci/intel-iommu.c
+++ linux-2.6.32/drivers/pci/intel-iommu.c
@@ -2124,10 +2124,10 @@ static int identity_mapping(struct pci_d
 	if (likely(!iommu_identity_mapping))
 		return 0;
 
+	info = pdev->dev.archdata.iommu;
+	if (info && info != DUMMY_DEVICE_DOMAIN_INFO)
+		return (info->domain == si_domain);
 
-	list_for_each_entry(info, &si_domain->devices, link)
-		if (info->dev == pdev)
-			return 1;
 	return 0;
 }
 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
  2011-03-29 23:36 ` [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges Mike Travis
@ 2011-03-31 22:11   ` Mike Travis
  2011-03-31 22:53     ` Chris Wright
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-31 22:11 UTC (permalink / raw)
  To: David Woodhouse, Jesse Barnes, Chris Wright
  Cc: Mike Habeck, iommu, linux-pci, linux-kernel

Chris - did you have any comment on this patch?

David/Jesse - should I resubmit the entire series for consideration?  We
have a number of customers that have encountered this problem.

Thanks,
Mike

Mike Travis wrote:
>     dmar_init_reserved_ranges() reserves the card's MMIO ranges to
>     prevent handing out a DMA map that would overlap with the MMIO range.
>     The problem while the Nvidia GPU has 64bit BARs, it's capable of
>     receiving > 40bit PIOs, but can't generate > 40bit DMAs.
> 
>     So when the iommu code reserves these MMIO ranges a > 40bit
>     entry ends up getting in the rbtree.  On a UV test system with
>     the Nvidia cards, the BARs are:
> 
>       0001:36:00.0 VGA compatible controller: nVidia Corporation GT200GL 
> 	  Region 0: Memory at 92000000 (32-bit, non-prefetchable) [size=16M]
> 	  Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
> 	  Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
> 
>     So this 44bit MMIO address 0xf8200000000 ends up in the rbtree.  As DMA
>     maps get added and deleted from the rbtree we can end up getting a cached
>     entry to this 0xf8200000000 entry... this is what results in the code
>     handing out the invalid DMA map of 0xf81fffff000:
> 
> 	    [ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
> 
>     The IOVA code needs to better honor the "limit_pfn" when allocating
>     these maps.
> 
> Signed-off-by: Mike Travis <travis@sgi.com>
> Reviewed-by: Mike Habeck <habeck@sgi.com>
> ---
>  drivers/pci/intel-iommu.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- linux.orig/drivers/pci/intel-iommu.c
> +++ linux/drivers/pci/intel-iommu.c
> @@ -1323,7 +1323,8 @@ static void dmar_init_reserved_ranges(vo
>  
>  		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
>  			r = &pdev->resource[i];
> -			if (!r->flags || !(r->flags & IORESOURCE_MEM))
> +			if (!r->flags || !(r->flags & IORESOURCE_MEM) ||
> +			    r->start > pdev->dma_mask)
>  				continue;
>  			iova = reserve_iova(&reserved_iova_list,
>  					    IOVA_PFN(r->start),
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
  2011-03-31 22:11   ` Mike Travis
@ 2011-03-31 22:53     ` Chris Wright
  2011-03-31 23:25       ` Mike Travis
  2011-03-31 23:39       ` [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges Chris Wright
  0 siblings, 2 replies; 27+ messages in thread
From: Chris Wright @ 2011-03-31 22:53 UTC (permalink / raw)
  To: Mike Travis
  Cc: David Woodhouse, Jesse Barnes, Chris Wright, Mike Habeck, iommu,
	linux-pci, linux-kernel

* Mike Travis (travis@sgi.com) wrote:
> Chris - did you have any comment on this patch?

It doesn't actually look right to me.  It means that particular range
is no longer reserved.  But perhaps I've misunderstood something.

> Mike Travis wrote:
> >    dmar_init_reserved_ranges() reserves the card's MMIO ranges to
> >    prevent handing out a DMA map that would overlap with the MMIO range.
> >    The problem while the Nvidia GPU has 64bit BARs, it's capable of
> >    receiving > 40bit PIOs, but can't generate > 40bit DMAs.

I don't undertand what you mean here.

> >    So when the iommu code reserves these MMIO ranges a > 40bit
> >    entry ends up getting in the rbtree.  On a UV test system with
> >    the Nvidia cards, the BARs are:
> >
> >      0001:36:00.0 VGA compatible controller: nVidia Corporation
> >GT200GL 	  Region 0: Memory at 92000000 (32-bit, non-prefetchable)
> >[size=16M]
> >	  Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
> >	  Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
> >
> >    So this 44bit MMIO address 0xf8200000000 ends up in the rbtree.  As DMA
> >    maps get added and deleted from the rbtree we can end up getting a cached
> >    entry to this 0xf8200000000 entry... this is what results in the code
> >    handing out the invalid DMA map of 0xf81fffff000:
> >
> >	    [ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
> >
> >    The IOVA code needs to better honor the "limit_pfn" when allocating
> >    these maps.

This means we could get the MMIO address range (it's no longer reserved).
It seems to me the DMA transaction would then become a peer to peer
transaction if ACS is not enabled, which could show up as random register
write in that GPUs 256M BAR (i.e. broken).

The iova allocation should not hand out an address bigger than the
dma_mask.  What is the device's dma_mask?

thanks,
-chris

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
  2011-03-31 22:53     ` Chris Wright
@ 2011-03-31 23:25       ` Mike Travis
  2011-03-31 23:40         ` Mike Habeck
  2011-03-31 23:39       ` [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges Chris Wright
  1 sibling, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-31 23:25 UTC (permalink / raw)
  To: Chris Wright
  Cc: David Woodhouse, Jesse Barnes, Mike Habeck, iommu, linux-pci,
	linux-kernel

I'll probably need help from our Hardware PCI Engineer to help explain
this further, though here's a pointer to an earlier email thread:

http://marc.info/?l=linux-kernel&m=129259816925973&w=2

I'll also dig out the specs you're asking for.

Thanks,
Mike

Chris Wright wrote:
> * Mike Travis (travis@sgi.com) wrote:
>> Chris - did you have any comment on this patch?
> 
> It doesn't actually look right to me.  It means that particular range
> is no longer reserved.  But perhaps I've misunderstood something.
> 
>> Mike Travis wrote:
>>>    dmar_init_reserved_ranges() reserves the card's MMIO ranges to
>>>    prevent handing out a DMA map that would overlap with the MMIO range.
>>>    The problem while the Nvidia GPU has 64bit BARs, it's capable of
>>>    receiving > 40bit PIOs, but can't generate > 40bit DMAs.
> 
> I don't undertand what you mean here.
> 
>>>    So when the iommu code reserves these MMIO ranges a > 40bit
>>>    entry ends up getting in the rbtree.  On a UV test system with
>>>    the Nvidia cards, the BARs are:
>>>
>>>      0001:36:00.0 VGA compatible controller: nVidia Corporation
>>> GT200GL 	  Region 0: Memory at 92000000 (32-bit, non-prefetchable)
>>> [size=16M]
>>> 	  Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
>>> 	  Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
>>>
>>>    So this 44bit MMIO address 0xf8200000000 ends up in the rbtree.  As DMA
>>>    maps get added and deleted from the rbtree we can end up getting a cached
>>>    entry to this 0xf8200000000 entry... this is what results in the code
>>>    handing out the invalid DMA map of 0xf81fffff000:
>>>
>>> 	    [ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
>>>
>>>    The IOVA code needs to better honor the "limit_pfn" when allocating
>>>    these maps.
> 
> This means we could get the MMIO address range (it's no longer reserved).
> It seems to me the DMA transaction would then become a peer to peer
> transaction if ACS is not enabled, which could show up as random register
> write in that GPUs 256M BAR (i.e. broken).
> 
> The iova allocation should not hand out an address bigger than the
> dma_mask.  What is the device's dma_mask?
> 
> thanks,
> -chris

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
  2011-03-31 22:53     ` Chris Wright
  2011-03-31 23:25       ` Mike Travis
@ 2011-03-31 23:39       ` Chris Wright
  1 sibling, 0 replies; 27+ messages in thread
From: Chris Wright @ 2011-03-31 23:39 UTC (permalink / raw)
  To: Chris Wright
  Cc: Mike Travis, linux-pci, linux-kernel, Jesse Barnes, iommu,
	Mike Habeck, David Woodhouse

* Chris Wright (chrisw@sous-sol.org) wrote:
> > Mike Travis wrote:
> > >	  Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
> > >	  Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
> > >
> > >    So this 44bit MMIO address 0xf8200000000 ends up in the rbtree.  As DMA
> > >    maps get added and deleted from the rbtree we can end up getting a cached
> > >    entry to this 0xf8200000000 entry... this is what results in the code
> > >    handing out the invalid DMA map of 0xf81fffff000:
> > >
> > >	    [ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
> > >
> > >    The IOVA code needs to better honor the "limit_pfn" when allocating
> > >    these maps.
> 
> This means we could get the MMIO address range (it's no longer reserved).
> It seems to me the DMA transaction would then become a peer to peer
> transaction if ACS is not enabled, which could show up as random register
> write in that GPUs 256M BAR (i.e. broken).
> 
> The iova allocation should not hand out an address bigger than the
> dma_mask.  What is the device's dma_mask?

Ah, looks like this is a bad interaction with the way the cached entry
is handled.  I think the iova lookup should skip down the the limit_pfn
rather than assume that rb_last's pfn_lo/hi is ok just because it's in
the tree.  Because you'll never hit the limit_pfn == 32bit_pfn case, it
just goes straight to rb_last in __get_cached_rbnode.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
  2011-03-31 23:25       ` Mike Travis
@ 2011-03-31 23:40         ` Mike Habeck
  2011-03-31 23:56           ` Chris Wright
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Habeck @ 2011-03-31 23:40 UTC (permalink / raw)
  To: Mike Travis
  Cc: Chris Wright, David Woodhouse, Jesse Barnes, iommu, linux-pci,
	linux-kernel

On 03/31/2011 06:25 PM, Mike Travis wrote:
> I'll probably need help from our Hardware PCI Engineer to help explain
> this further, though here's a pointer to an earlier email thread:
>
> http://marc.info/?l=linux-kernel&m=129259816925973&w=2
>
> I'll also dig out the specs you're asking for.
>
> Thanks,
> Mike
>
> Chris Wright wrote:
>> * Mike Travis (travis@sgi.com) wrote:
>>> Chris - did you have any comment on this patch?
>>
>> It doesn't actually look right to me. It means that particular range
>> is no longer reserved. But perhaps I've misunderstood something.
>>
>>> Mike Travis wrote:
>>>> dmar_init_reserved_ranges() reserves the card's MMIO ranges to
>>>> prevent handing out a DMA map that would overlap with the MMIO range.
>>>> The problem while the Nvidia GPU has 64bit BARs, it's capable of
>>>> receiving > 40bit PIOs, but can't generate > 40bit DMAs.
>>
>> I don't undertand what you mean here.

What Mike is getting at is there is no reason to reserve the MMIO
range if it's greater than the dma_mask, given the MMIO range is
outside of what the IOVA code will ever hand back to the IOMMU
code.  In this case the nVidia card has a 64bit BAR and is assigned
the MMIO range [0xf8200000000 - 0xf820fffffff].  But the Nvidia
card can only generate a 40bit DMA (thus has a 40bit dma_mask). If
the IOVA code honors the limit_pfn (i.e., dma_mask) passed in it
will never hand back a >40bit address back to the IOMMU code. Thus
there is no reason to reserve the cards MMIO range if it is greater
than the dma_mask. (And that is what the patch is doing).

More below,,,

>>
>>>> So when the iommu code reserves these MMIO ranges a > 40bit
>>>> entry ends up getting in the rbtree. On a UV test system with
>>>> the Nvidia cards, the BARs are:
>>>>
>>>> 0001:36:00.0 VGA compatible controller: nVidia Corporation
>>>> GT200GL Region 0: Memory at 92000000 (32-bit, non-prefetchable)
>>>> [size=16M]
>>>> Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
>>>> Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
>>>>
>>>> So this 44bit MMIO address 0xf8200000000 ends up in the rbtree. As DMA
>>>> maps get added and deleted from the rbtree we can end up getting a cached
>>>> entry to this 0xf8200000000 entry... this is what results in the code
>>>> handing out the invalid DMA map of 0xf81fffff000:
>>>>
>>>> [ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
>>>>
>>>> The IOVA code needs to better honor the "limit_pfn" when allocating
>>>> these maps.
>>
>> This means we could get the MMIO address range (it's no longer reserved).

Not true, the MMIO address is greater than the dma_mask (i.e., the
limit_pfn passed into alloc_iova()) thus the IOVA code will never
hand back that address range given it's greater than the dma_mask).

>> It seems to me the DMA transaction would then become a peer to peer
>> transaction if ACS is not enabled, which could show up as random register
>> write in that GPUs 256M BAR (i.e. broken).
>>
>> The iova allocation should not hand out an address bigger than the
>> dma_mask. What is the device's dma_mask?

Agree.  But there is a bug.  The IOVA doesn't validate the limit_pfn
if it uses the cached entry.  One could argue that it should validate
the limit_pfn, but then again a entry outside the limit_pfn should
have never got into the rbtree...  (it got in due to the IOMMU's
dmar_init_reserved_ranges() adding it).

-mike

>>
>> thanks,
>> -chris


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
  2011-03-31 23:40         ` Mike Habeck
@ 2011-03-31 23:56           ` Chris Wright
  2011-04-01  1:05             ` Mike Habeck
  0 siblings, 1 reply; 27+ messages in thread
From: Chris Wright @ 2011-03-31 23:56 UTC (permalink / raw)
  To: Mike Habeck
  Cc: Mike Travis, Chris Wright, David Woodhouse, Jesse Barnes, iommu,
	linux-pci, linux-kernel

* Mike Habeck (habeck@sgi.com) wrote:
> On 03/31/2011 06:25 PM, Mike Travis wrote:
> >I'll probably need help from our Hardware PCI Engineer to help explain
> >this further, though here's a pointer to an earlier email thread:
> >
> >http://marc.info/?l=linux-kernel&m=129259816925973&w=2
> >
> >I'll also dig out the specs you're asking for.
> >
> >Thanks,
> >Mike
> >
> >Chris Wright wrote:
> >>* Mike Travis (travis@sgi.com) wrote:
> >>>Chris - did you have any comment on this patch?
> >>
> >>It doesn't actually look right to me. It means that particular range
> >>is no longer reserved. But perhaps I've misunderstood something.
> >>
> >>>Mike Travis wrote:
> >>>>dmar_init_reserved_ranges() reserves the card's MMIO ranges to
> >>>>prevent handing out a DMA map that would overlap with the MMIO range.
> >>>>The problem while the Nvidia GPU has 64bit BARs, it's capable of
> >>>>receiving > 40bit PIOs, but can't generate > 40bit DMAs.
> >>
> >>I don't undertand what you mean here.
> 
> What Mike is getting at is there is no reason to reserve the MMIO
> range if it's greater than the dma_mask, given the MMIO range is
> outside of what the IOVA code will ever hand back to the IOMMU
> code.  In this case the nVidia card has a 64bit BAR and is assigned
> the MMIO range [0xf8200000000 - 0xf820fffffff].  But the Nvidia
> card can only generate a 40bit DMA (thus has a 40bit dma_mask). If
> the IOVA code honors the limit_pfn (i.e., dma_mask) passed in it
> will never hand back a >40bit address back to the IOMMU code. Thus
> there is no reason to reserve the cards MMIO range if it is greater
> than the dma_mask. (And that is what the patch is doing).

The reserved ranges are for all devices.  Another device with a 64bit
dma_mask could get that region if it's not properly reserved.  The
driver would then program that device to dma to an address to is an
alias to a MMIO region.  The memory transaction travels up towards
root...and sees the MMIO range in some bridge and would go straight down
to the GPU.

> More below,,,
> 
> >>
> >>>>So when the iommu code reserves these MMIO ranges a > 40bit
> >>>>entry ends up getting in the rbtree. On a UV test system with
> >>>>the Nvidia cards, the BARs are:
> >>>>
> >>>>0001:36:00.0 VGA compatible controller: nVidia Corporation
> >>>>GT200GL Region 0: Memory at 92000000 (32-bit, non-prefetchable)
> >>>>[size=16M]
> >>>>Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
> >>>>Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
> >>>>
> >>>>So this 44bit MMIO address 0xf8200000000 ends up in the rbtree. As DMA
> >>>>maps get added and deleted from the rbtree we can end up getting a cached
> >>>>entry to this 0xf8200000000 entry... this is what results in the code
> >>>>handing out the invalid DMA map of 0xf81fffff000:
> >>>>
> >>>>[ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
> >>>>
> >>>>The IOVA code needs to better honor the "limit_pfn" when allocating
> >>>>these maps.
> >>
> >>This means we could get the MMIO address range (it's no longer reserved).
> 
> Not true, the MMIO address is greater than the dma_mask (i.e., the
> limit_pfn passed into alloc_iova()) thus the IOVA code will never
> hand back that address range given it's greater than the dma_mask).

Well, as you guys are seeing, the iova allocation code is making the
assumption that if the range is in the tree, it's valid.  And it is
handing out an address that's too large.

> >>It seems to me the DMA transaction would then become a peer to peer
> >>transaction if ACS is not enabled, which could show up as random register
> >>write in that GPUs 256M BAR (i.e. broken).
> >>
> >>The iova allocation should not hand out an address bigger than the
> >>dma_mask. What is the device's dma_mask?
> 
> Agree.  But there is a bug.  The IOVA doesn't validate the limit_pfn
> if it uses the cached entry.  One could argue that it should validate
> the limit_pfn, but then again a entry outside the limit_pfn should
> have never got into the rbtree...  (it got in due to the IOMMU's
> dmar_init_reserved_ranges() adding it).

Yeah, I think it needs to be in the global reserved list.  But perhaps
not copied into the domain specific iova.  Or simply skipped on iova
allocation (don't just assume rb_last is <= dma_mask).

thanks,
-chris

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
  2011-03-31 23:56           ` Chris Wright
@ 2011-04-01  1:05             ` Mike Habeck
  2011-04-02  0:32               ` [PATCH 3/4 v2] intel-iommu: don't cache iova above 32bit caching boundary Chris Wright
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Habeck @ 2011-04-01  1:05 UTC (permalink / raw)
  To: Chris Wright
  Cc: Mike Travis, David Woodhouse, Jesse Barnes, iommu, linux-pci,
	linux-kernel



Chris Wright wrote:
> * Mike Habeck (habeck@sgi.com) wrote:
>> On 03/31/2011 06:25 PM, Mike Travis wrote:
>>> I'll probably need help from our Hardware PCI Engineer to help explain
>>> this further, though here's a pointer to an earlier email thread:
>>>
>>> http://marc.info/?l=linux-kernel&m=129259816925973&w=2
>>>
>>> I'll also dig out the specs you're asking for.
>>>
>>> Thanks,
>>> Mike
>>>
>>> Chris Wright wrote:
>>>> * Mike Travis (travis@sgi.com) wrote:
>>>>> Chris - did you have any comment on this patch?
>>>> It doesn't actually look right to me. It means that particular range
>>>> is no longer reserved. But perhaps I've misunderstood something.
>>>>
>>>>> Mike Travis wrote:
>>>>>> dmar_init_reserved_ranges() reserves the card's MMIO ranges to
>>>>>> prevent handing out a DMA map that would overlap with the MMIO range.
>>>>>> The problem while the Nvidia GPU has 64bit BARs, it's capable of
>>>>>> receiving > 40bit PIOs, but can't generate > 40bit DMAs.
>>>> I don't undertand what you mean here.
>> What Mike is getting at is there is no reason to reserve the MMIO
>> range if it's greater than the dma_mask, given the MMIO range is
>> outside of what the IOVA code will ever hand back to the IOMMU
>> code.  In this case the nVidia card has a 64bit BAR and is assigned
>> the MMIO range [0xf8200000000 - 0xf820fffffff].  But the Nvidia
>> card can only generate a 40bit DMA (thus has a 40bit dma_mask). If
>> the IOVA code honors the limit_pfn (i.e., dma_mask) passed in it
>> will never hand back a >40bit address back to the IOMMU code. Thus
>> there is no reason to reserve the cards MMIO range if it is greater
>> than the dma_mask. (And that is what the patch is doing).
> 
> The reserved ranges are for all devices.  Another device with a 64bit
> dma_mask could get that region if it's not properly reserved.  The
> driver would then program that device to dma to an address to is an
> alias to a MMIO region.  The memory transaction travels up towards
> root...and sees the MMIO range in some bridge and would go straight down
> to the GPU.

Chris,

OK, I understand now what you meant by the patch possibly causing
the DMA transaction to become a peer to peer transaction.  Mike and
I will have to rethink this one.  Thanks for your input.

-mike


> 
>> More below,,,
>>
>>>>>> So when the iommu code reserves these MMIO ranges a > 40bit
>>>>>> entry ends up getting in the rbtree. On a UV test system with
>>>>>> the Nvidia cards, the BARs are:
>>>>>>
>>>>>> 0001:36:00.0 VGA compatible controller: nVidia Corporation
>>>>>> GT200GL Region 0: Memory at 92000000 (32-bit, non-prefetchable)
>>>>>> [size=16M]
>>>>>> Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
>>>>>> Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
>>>>>>
>>>>>> So this 44bit MMIO address 0xf8200000000 ends up in the rbtree. As DMA
>>>>>> maps get added and deleted from the rbtree we can end up getting a cached
>>>>>> entry to this 0xf8200000000 entry... this is what results in the code
>>>>>> handing out the invalid DMA map of 0xf81fffff000:
>>>>>>
>>>>>> [ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
>>>>>>
>>>>>> The IOVA code needs to better honor the "limit_pfn" when allocating
>>>>>> these maps.
>>>> This means we could get the MMIO address range (it's no longer reserved).
>> Not true, the MMIO address is greater than the dma_mask (i.e., the
>> limit_pfn passed into alloc_iova()) thus the IOVA code will never
>> hand back that address range given it's greater than the dma_mask).
> 
> Well, as you guys are seeing, the iova allocation code is making the
> assumption that if the range is in the tree, it's valid.  And it is
> handing out an address that's too large.
> 
>>>> It seems to me the DMA transaction would then become a peer to peer
>>>> transaction if ACS is not enabled, which could show up as random register
>>>> write in that GPUs 256M BAR (i.e. broken).
>>>>
>>>> The iova allocation should not hand out an address bigger than the
>>>> dma_mask. What is the device's dma_mask?
>> Agree.  But there is a bug.  The IOVA doesn't validate the limit_pfn
>> if it uses the cached entry.  One could argue that it should validate
>> the limit_pfn, but then again a entry outside the limit_pfn should
>> have never got into the rbtree...  (it got in due to the IOMMU's
>> dmar_init_reserved_ranges() adding it).
> 
> Yeah, I think it needs to be in the global reserved list.  But perhaps
> not copied into the domain specific iova.  Or simply skipped on iova
> allocation (don't just assume rb_last is <= dma_mask).
> 
> thanks,
> -chris

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] Intel pci: Use coherent DMA mask when requested
  2011-03-30 18:02   ` Chris Wright
@ 2011-04-01  2:57     ` FUJITA Tomonori
  0 siblings, 0 replies; 27+ messages in thread
From: FUJITA Tomonori @ 2011-04-01  2:57 UTC (permalink / raw)
  To: chrisw
  Cc: travis, dwmw2, jbarnes, linux-pci, iommu, habeck, linux-kernel,
	fujita.tomonori

On Wed, 30 Mar 2011 11:02:45 -0700
Chris Wright <chrisw@sous-sol.org> wrote:

> * Mike Travis (travis@sgi.com) wrote:
> >     The __intel_map_single function is not honoring the passed in
> >     DMA mask.  This results in not using the coherent DMA mask when
> >     called from intel_alloc_coherent().
> > 
> > Signed-off-by: Mike Travis <travis@sgi.com>
> > Reviewed-by: Mike Habeck <habeck@sgi.com>
> 
> Hmm, looks like it has actually been broken since it was introduced
> in bb9e6d65.

Oops, you are right.

The patch looks good to me.

Thanks,

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 3/4 v2] intel-iommu: don't cache iova above 32bit caching boundary
  2011-04-01  1:05             ` Mike Habeck
@ 2011-04-02  0:32               ` Chris Wright
  2011-04-06  0:39                 ` [PATCH 3/4 v3] " Chris Wright
  0 siblings, 1 reply; 27+ messages in thread
From: Chris Wright @ 2011-04-02  0:32 UTC (permalink / raw)
  To: Mike Travis, Mike Habeck
  Cc: David Woodhouse, Chris Wright, Jesse Barnes, iommu, linux-pci,
	linux-kernel

Mike Travis and Mike Habeck reported an issue where iova allocation
would return a range that was larger than a device's dma mask.

https://lkml.org/lkml/2011/3/29/423

The dmar initialization code will reserve all PCI MMIO regions and copy
those reservations into a domain specific iova tree.  It is possible for
one of those regions to be above the dma mask of a device.  It is typical
to allocate iovas with a 32bit mask (despite device's dma mask possibly
being larger) and cache the result until it exhausts the lower 32bit
address space.  Freeing the iova range that is >= the last iova in the
lower 32bit range when there is still an iova above the 32bit range will
corrupt the cached iova by pointing it to a region that is above 32bit.
If that region is also larger than the device's dma mask, a subsequent
allocation will return an unusable iova and cause dma failure.

Simply don't cache an iova that is above the 32bit caching boundary.

Reported-by: Mike Travis <travis@sgi.com>
Reported-by: Mike Habeck <habeck@sgi.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: stable@kernel.org 
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
Mike or Mike, can you try this?  I was able to reproduce the failure
in a few different ways and successfully test this patch against those
failures, but w/out real hw.

 drivers/pci/iova.c |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/iova.c b/drivers/pci/iova.c
index 7914951..1690ca4 100644
--- a/drivers/pci/iova.c
+++ b/drivers/pci/iova.c
@@ -63,8 +63,16 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
 	curr = iovad->cached32_node;
 	cached_iova = container_of(curr, struct iova, node);
 
-	if (free->pfn_lo >= cached_iova->pfn_lo)
-		iovad->cached32_node = rb_next(&free->node);
+	if (free->pfn_lo >= cached_iova->pfn_lo) {
+		struct rb_node *node = rb_next(&free->node);
+		struct iova *iova = container_of(node, struct iova, node);
+
+		/* only cache if it's below 32bit pfn */
+		if (iova->pfn_lo < iovad->dma_32bit_pfn)
+			iovad->cached32_node = node;
+		else
+			iovad->cached32_node = NULL;
+	}
 }
 
 /* Computes the padding size required, to make the
-- 
1.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/4 v3] intel-iommu: don't cache iova above 32bit caching boundary
  2011-04-02  0:32               ` [PATCH 3/4 v2] intel-iommu: don't cache iova above 32bit caching boundary Chris Wright
@ 2011-04-06  0:39                 ` Chris Wright
  0 siblings, 0 replies; 27+ messages in thread
From: Chris Wright @ 2011-04-06  0:39 UTC (permalink / raw)
  To: Mike Travis, Mike Habeck
  Cc: David Woodhouse, Chris Wright, Jesse Barnes, iommu, linux-pci,
	linux-kernel

Mike Travis and Mike Habeck reported an issue where iova allocation
would return a range that was larger than a device's dma mask.

https://lkml.org/lkml/2011/3/29/423

The dmar initialization code will reserve all PCI MMIO regions and copy
those reservations into a domain specific iova tree.  It is possible for
one of those regions to be above the dma mask of a device.  It is typical
to allocate iovas with a 32bit mask (despite device's dma mask possibly
being larger) and cache the result until it exhausts the lower 32bit
address space.  Freeing the iova range that is >= the last iova in the
lower 32bit range when there is still an iova above the 32bit range will
corrupt the cached iova by pointing it to a region that is above 32bit.
If that region is also larger than the device's dma mask, a subsequent
allocation will return an unusable iova and cause dma failure.

Simply don't cache an iova that is above the 32bit caching boundary.

Reported-by: Mike Travis <travis@sgi.com>
Reported-by: Mike Habeck <habeck@sgi.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: stable@kernel.org
Acked-by: Mike Travis <travis@sgi.com>
Tested-by: Mike Habeck <habeck@sgi.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---

v3: rb_next() can return NULL, found when testing on my hw

David, Mike Travis will collect and resumbit full series when he's back.

 drivers/pci/iova.c |   12 ++++++++++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/iova.c b/drivers/pci/iova.c
index 7914951..10f995a 100644
--- a/drivers/pci/iova.c
+++ b/drivers/pci/iova.c
@@ -63,8 +63,16 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
 	curr = iovad->cached32_node;
 	cached_iova = container_of(curr, struct iova, node);
 
-	if (free->pfn_lo >= cached_iova->pfn_lo)
-		iovad->cached32_node = rb_next(&free->node);
+	if (free->pfn_lo >= cached_iova->pfn_lo) {
+		struct rb_node *node = rb_next(&free->node);
+		struct iova *iova = container_of(node, struct iova, node);
+
+		/* only cache if it's below 32bit pfn */
+		if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
+			iovad->cached32_node = node;
+		else
+			iovad->cached32_node = NULL;
+	}
 }
 
 /* Computes the padding size required, to make the
-- 
1.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping
  2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
                   ` (3 preceding siblings ...)
  2011-03-29 23:36 ` [PATCH 4/4] Intel pci: Use coherent DMA mask when requested Mike Travis
@ 2011-04-07 19:47 ` Mike Travis
  2011-04-07 19:51 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function Mike Travis
  2011-04-07 19:52 ` [PATCH 4/4] Intel pci: Use coherent DMA mask when requested Mike Travis
  6 siblings, 0 replies; 27+ messages in thread
From: Mike Travis @ 2011-04-07 19:47 UTC (permalink / raw)
  To: David Woodhouse, Jesse Barnes, Chris Wright
  Cc: Mike Habeck, iommu, linux-pci, linux-kernel

    When the IOMMU is being used, each request for a DMA mapping requires
    the intel_iommu code to look for some space in the DMA mapping table.
    For most drivers this occurs for each transfer.

    When there are many outstanding DMA mappings [as seems to be the case
    with the 10GigE driver], the table grows large and the search for
    space becomes increasingly time consuming.  Performance for the
    10GigE driver drops to about 10% of it's capacity on a UV system
    when the CPU count is large.

    The workaround is to specify the iommu=pt option which sets up a 1:1
    identity map for those devices that support enough DMA address bits to
    cover the physical system memory.  This is the "pass through" option.

    But this can only be accomplished by those devices that pass their
    DMA data through the IOMMU (VTd).  But Host Bridge Devices connected
    to System Sockets do not pass their data through the VTd, thus the
    following error occurs:

    IOMMU: hardware identity mapping for device 1000:3e:00.0
    Failed to setup IOMMU pass-through
    BUG: unable to handle kernel NULL pointer dereference at 000000000000001c

    This patch fixes that problem but removing Host Bridge devices from
    being identity mapped, given that they do not generate DMA ops anyways.

Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Mike Habeck <habeck@sgi.com>
---
 drivers/pci/intel-iommu.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

--- linux.orig/drivers/pci/intel-iommu.c
+++ linux/drivers/pci/intel-iommu.c
@@ -46,6 +46,7 @@
 #define ROOT_SIZE		VTD_PAGE_SIZE
 #define CONTEXT_SIZE		VTD_PAGE_SIZE
 
+#define IS_HOSTBRIDGE_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_HOST)
 #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
 #define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
 #define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
@@ -2183,7 +2184,7 @@ static int iommu_should_identity_map(str
 	 * take them out of the 1:1 domain later.
 	 */
 	if (!startup)
-		return pdev->dma_mask > DMA_BIT_MASK(32);
+		return pdev->dma_mask == DMA_BIT_MASK(64);
 
 	return 1;
 }
@@ -2198,6 +2199,9 @@ static int __init iommu_prepare_static_i
 		return -EFAULT;
 
 	for_each_pci_dev(pdev) {
+		/* Skip PCI Host Bridge devices */
+		if (IS_HOSTBRIDGE_DEVICE(pdev))
+			continue;
 		if (iommu_should_identity_map(pdev, 1)) {
 			printk(KERN_INFO "IOMMU: %s identity mapping for device %s\n",
 			       hw ? "hardware" : "software", pci_name(pdev));

-- 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function
  2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
                   ` (4 preceding siblings ...)
  2011-04-07 19:47 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
@ 2011-04-07 19:51 ` Mike Travis
  2011-04-07 19:52 ` [PATCH 4/4] Intel pci: Use coherent DMA mask when requested Mike Travis
  6 siblings, 0 replies; 27+ messages in thread
From: Mike Travis @ 2011-04-07 19:51 UTC (permalink / raw)
  To: David Woodhouse, Jesse Barnes, Chris Wright
  Cc: Mike Habeck, iommu, linux-pci, linux-kernel

Subject: Intel iommu: Speed up processing of the identity_mapping function

    When there are a large count of PCI devices, and the pass
    through option for iommu is set, much time is spent in the
    identity_mapping function hunting though the iommu domains to
    check if a specific device is "identity mapped".

    Speed up the function by checking the cached info to see if
    it's mapped to the static identity domain.

Signed-off-by: Mike Travis <travis@sgi.com>
---
 drivers/pci/intel-iommu.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- linux-2.6.32.orig/drivers/pci/intel-iommu.c
+++ linux-2.6.32/drivers/pci/intel-iommu.c
@@ -2124,10 +2124,10 @@ static int identity_mapping(struct pci_d
 	if (likely(!iommu_identity_mapping))
 		return 0;
 
+	info = pdev->dev.archdata.iommu;
+	if (info && info != DUMMY_DEVICE_DOMAIN_INFO)
+		return (info->domain == si_domain);
 
-	list_for_each_entry(info, &si_domain->devices, link)
-		if (info->dev == pdev)
-			return 1;
 	return 0;
 }
 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 4/4] Intel pci: Use coherent DMA mask when requested
  2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
                   ` (5 preceding siblings ...)
  2011-04-07 19:51 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function Mike Travis
@ 2011-04-07 19:52 ` Mike Travis
  6 siblings, 0 replies; 27+ messages in thread
From: Mike Travis @ 2011-04-07 19:52 UTC (permalink / raw)
  To: David Woodhouse, Jesse Barnes
  Cc: Mike Habeck, iommu, linux-pci, linux-kernel, Chris Wright

    The __intel_map_single function is not honoring the passed in
    DMA mask.  This results in not using the coherent DMA mask when
    called from intel_alloc_coherent().

Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Mike Habeck <habeck@sgi.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
---
 drivers/pci/intel-iommu.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- linux.orig/drivers/pci/intel-iommu.c
+++ linux/drivers/pci/intel-iommu.c
@@ -2582,8 +2582,7 @@ static dma_addr_t __intel_map_single(str
 	iommu = domain_get_iommu(domain);
 	size = aligned_nrpages(paddr, size);
 
-	iova = intel_alloc_iova(hwdev, domain, dma_to_mm_pfn(size),
-				pdev->dma_mask);
+	iova = intel_alloc_iova(hwdev, domain, dma_to_mm_pfn(size), dma_mask);
 	if (!iova)
 		goto error;
 

-- 

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2011-04-07 19:52 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
2011-03-29 23:36 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
2011-03-30 17:51   ` Chris Wright
2011-03-30 18:30     ` Mike Travis
2011-03-30 19:15       ` Chris Wright
2011-03-30 19:25         ` Mike Travis
2011-03-30 19:57           ` Chris Wright
2011-03-29 23:36 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function Mike Travis
2011-03-30 19:19   ` Chris Wright
2011-03-30 19:29     ` Mike Travis
2011-03-31  0:33     ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function v2 Mike Travis
2011-03-29 23:36 ` [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges Mike Travis
2011-03-31 22:11   ` Mike Travis
2011-03-31 22:53     ` Chris Wright
2011-03-31 23:25       ` Mike Travis
2011-03-31 23:40         ` Mike Habeck
2011-03-31 23:56           ` Chris Wright
2011-04-01  1:05             ` Mike Habeck
2011-04-02  0:32               ` [PATCH 3/4 v2] intel-iommu: don't cache iova above 32bit caching boundary Chris Wright
2011-04-06  0:39                 ` [PATCH 3/4 v3] " Chris Wright
2011-03-31 23:39       ` [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges Chris Wright
2011-03-29 23:36 ` [PATCH 4/4] Intel pci: Use coherent DMA mask when requested Mike Travis
2011-03-30 18:02   ` Chris Wright
2011-04-01  2:57     ` FUJITA Tomonori
2011-04-07 19:47 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
2011-04-07 19:51 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function Mike Travis
2011-04-07 19:52 ` [PATCH 4/4] Intel pci: Use coherent DMA mask when requested Mike Travis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.