* [PATCH 0/4] pci: Speed up processing of IOMMU related functions
@ 2011-03-29 23:36 Mike Travis
2011-03-29 23:36 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
` (6 more replies)
0 siblings, 7 replies; 27+ messages in thread
From: Mike Travis @ 2011-03-29 23:36 UTC (permalink / raw)
To: David Woodhouse, Jesse Barnes; +Cc: Mike Habeck, iommu, linux-pci, linux-kernel
During testing of a large customer UV configuration with a lot of
I/O devices, certain IOMMU functions were either consuming too much
time or not functioning correctly.
This patchset addresses those problems.
Signed-off-by: Mike Travis <travis@sgi.com>
--
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping
2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
@ 2011-03-29 23:36 ` Mike Travis
2011-03-30 17:51 ` Chris Wright
2011-03-29 23:36 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function Mike Travis
` (5 subsequent siblings)
6 siblings, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-29 23:36 UTC (permalink / raw)
To: David Woodhouse, Jesse Barnes; +Cc: Mike Habeck, iommu, linux-pci, linux-kernel
[-- Attachment #1: intel-iommu-pass-thru-fix --]
[-- Type: text/plain, Size: 2582 bytes --]
When the IOMMU is being used, each request for a DMA mapping requires
the intel_iommu code to look for some space in the DMA mapping table.
For most drivers this occurs for each transfer.
When there are many outstanding DMA mappings [as seems to be the case
with the 10GigE driver], the table grows large and the search for
space becomes increasingly time consuming. Performance for the
10GigE driver drops to about 10% of it's capacity on a UV system
when the CPU count is large.
The workaround is to specify the iommu=pt option which sets up a 1:1
identity map for those devices that support enough DMA address bits to
cover the physical system memory. This is the "pass through" option.
But this can only be accomplished by those devices that pass their
DMA data through the IOMMU (VTd). But Host Bridge Devices connected
to System Sockets do not pass their data through the VTd, thus the
following error occurs:
IOMMU: hardware identity mapping for device 1000:3e:00.0
Failed to setup IOMMU pass-through
BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
This patch fixes that problem but removing Host Bridge devices from
being identity mapped, given that they do not generate DMA ops anyways.
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Mike Habeck <habeck@sgi.com>
---
drivers/pci/intel-iommu.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- linux.orig/drivers/pci/intel-iommu.c
+++ linux/drivers/pci/intel-iommu.c
@@ -46,6 +46,7 @@
#define ROOT_SIZE VTD_PAGE_SIZE
#define CONTEXT_SIZE VTD_PAGE_SIZE
+#define IS_HOSTBRIDGE_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_HOST)
#define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
#define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
#define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
@@ -2183,7 +2184,7 @@ static int iommu_should_identity_map(str
* take them out of the 1:1 domain later.
*/
if (!startup)
- return pdev->dma_mask > DMA_BIT_MASK(32);
+ return pdev->dma_mask == DMA_BIT_MASK(64);
return 1;
}
@@ -2198,6 +2199,9 @@ static int __init iommu_prepare_static_i
return -EFAULT;
for_each_pci_dev(pdev) {
+ /* Skip PCI Host Bridge devices */
+ if (IS_HOSTBRIDGE_DEVICE(pdev))
+ continue;
if (iommu_should_identity_map(pdev, 1)) {
printk(KERN_INFO "IOMMU: %s identity mapping for device %s\n",
hw ? "hardware" : "software", pci_name(pdev));
--
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function
2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
2011-03-29 23:36 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
@ 2011-03-29 23:36 ` Mike Travis
2011-03-30 19:19 ` Chris Wright
2011-03-29 23:36 ` [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges Mike Travis
` (4 subsequent siblings)
6 siblings, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-29 23:36 UTC (permalink / raw)
To: David Woodhouse, Jesse Barnes; +Cc: Mike Habeck, iommu, linux-pci, linux-kernel
[-- Attachment #1: speed-up-iommu_no_mapping --]
[-- Type: text/plain, Size: 1989 bytes --]
On Nahelem systems, there are 43 Host Bridges exposed to the kernel
for each System Socket. With a large socket count, and the pass
through option for iommu is set, too much time is spent in the
identity_mapping function, hunting though the iommu domains to
check if a specific device is "identity mapped".
Speed up the function by setting an "is_identity_mapped" flag in
the pci_dev struct, if this device is mapped to the static identity
domain.
Signed-off-by: Mike Travis <travis@sgi.com>
---
drivers/pci/intel-iommu.c | 11 +++++------
include/linux/pci.h | 1 +
2 files changed, 6 insertions(+), 6 deletions(-)
--- linux.orig/drivers/pci/intel-iommu.c
+++ linux/drivers/pci/intel-iommu.c
@@ -2102,11 +2102,7 @@ static int identity_mapping(struct pci_d
if (likely(!iommu_identity_mapping))
return 0;
-
- list_for_each_entry(info, &si_domain->devices, link)
- if (info->dev == pdev)
- return 1;
- return 0;
+ return pdev->is_identity_mapped;
}
static int domain_add_dev_info(struct dmar_domain *domain,
@@ -2138,6 +2134,7 @@ static int domain_add_dev_info(struct dm
list_add(&info->global, &device_domain_list);
pdev->dev.archdata.iommu = info;
spin_unlock_irqrestore(&device_domain_lock, flags);
+ pdev->is_identity_mapped = (domain == si_domain);
return 0;
}
@@ -3404,8 +3401,10 @@ static void domain_remove_one_dev_info(s
* update iommu count and coherency
*/
if (iommu == device_to_iommu(info->segment, info->bus,
- info->devfn))
+ info->devfn)) {
found = 1;
+ pdev->is_identity_mapped = 0;
+ }
}
if (found == 0) {
--- linux.orig/include/linux/pci.h
+++ linux/include/linux/pci.h
@@ -314,6 +314,7 @@ struct pci_dev {
unsigned int is_virtfn:1;
unsigned int reset_fn:1;
unsigned int is_hotplug_bridge:1;
+ unsigned int is_identity_mapped:1;
unsigned int __aer_firmware_first_valid:1;
unsigned int __aer_firmware_first:1;
pci_dev_flags_t dev_flags;
--
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
2011-03-29 23:36 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
2011-03-29 23:36 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function Mike Travis
@ 2011-03-29 23:36 ` Mike Travis
2011-03-31 22:11 ` Mike Travis
2011-03-29 23:36 ` [PATCH 4/4] Intel pci: Use coherent DMA mask when requested Mike Travis
` (3 subsequent siblings)
6 siblings, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-29 23:36 UTC (permalink / raw)
To: David Woodhouse, Jesse Barnes; +Cc: Mike Habeck, iommu, linux-pci, linux-kernel
[-- Attachment #1: limit-dmar_init_reserved_ranges --]
[-- Type: text/plain, Size: 1775 bytes --]
dmar_init_reserved_ranges() reserves the card's MMIO ranges to
prevent handing out a DMA map that would overlap with the MMIO range.
The problem while the Nvidia GPU has 64bit BARs, it's capable of
receiving > 40bit PIOs, but can't generate > 40bit DMAs.
So when the iommu code reserves these MMIO ranges a > 40bit
entry ends up getting in the rbtree. On a UV test system with
the Nvidia cards, the BARs are:
0001:36:00.0 VGA compatible controller: nVidia Corporation GT200GL
Region 0: Memory at 92000000 (32-bit, non-prefetchable) [size=16M]
Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
So this 44bit MMIO address 0xf8200000000 ends up in the rbtree. As DMA
maps get added and deleted from the rbtree we can end up getting a cached
entry to this 0xf8200000000 entry... this is what results in the code
handing out the invalid DMA map of 0xf81fffff000:
[ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
The IOVA code needs to better honor the "limit_pfn" when allocating
these maps.
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Mike Habeck <habeck@sgi.com>
---
drivers/pci/intel-iommu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- linux.orig/drivers/pci/intel-iommu.c
+++ linux/drivers/pci/intel-iommu.c
@@ -1323,7 +1323,8 @@ static void dmar_init_reserved_ranges(vo
for (i = 0; i < PCI_NUM_RESOURCES; i++) {
r = &pdev->resource[i];
- if (!r->flags || !(r->flags & IORESOURCE_MEM))
+ if (!r->flags || !(r->flags & IORESOURCE_MEM) ||
+ r->start > pdev->dma_mask)
continue;
iova = reserve_iova(&reserved_iova_list,
IOVA_PFN(r->start),
--
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 4/4] Intel pci: Use coherent DMA mask when requested
2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
` (2 preceding siblings ...)
2011-03-29 23:36 ` [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges Mike Travis
@ 2011-03-29 23:36 ` Mike Travis
2011-03-30 18:02 ` Chris Wright
2011-04-07 19:47 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
` (2 subsequent siblings)
6 siblings, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-29 23:36 UTC (permalink / raw)
To: David Woodhouse, Jesse Barnes; +Cc: Mike Habeck, iommu, linux-pci, linux-kernel
[-- Attachment #1: use-coherent-dma-mask --]
[-- Type: text/plain, Size: 761 bytes --]
The __intel_map_single function is not honoring the passed in
DMA mask. This results in not using the coherent DMA mask when
called from intel_alloc_coherent().
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Mike Habeck <habeck@sgi.com>
---
drivers/pci/intel-iommu.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- linux.orig/drivers/pci/intel-iommu.c
+++ linux/drivers/pci/intel-iommu.c
@@ -2582,8 +2582,7 @@ static dma_addr_t __intel_map_single(str
iommu = domain_get_iommu(domain);
size = aligned_nrpages(paddr, size);
- iova = intel_alloc_iova(hwdev, domain, dma_to_mm_pfn(size),
- pdev->dma_mask);
+ iova = intel_alloc_iova(hwdev, domain, dma_to_mm_pfn(size), dma_mask);
if (!iova)
goto error;
--
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping
2011-03-29 23:36 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
@ 2011-03-30 17:51 ` Chris Wright
2011-03-30 18:30 ` Mike Travis
0 siblings, 1 reply; 27+ messages in thread
From: Chris Wright @ 2011-03-30 17:51 UTC (permalink / raw)
To: Mike Travis
Cc: David Woodhouse, Jesse Barnes, linux-pci, iommu, Mike Habeck,
linux-kernel
* Mike Travis (travis@sgi.com) wrote:
> When the IOMMU is being used, each request for a DMA mapping requires
> the intel_iommu code to look for some space in the DMA mapping table.
> For most drivers this occurs for each transfer.
>
> When there are many outstanding DMA mappings [as seems to be the case
> with the 10GigE driver], the table grows large and the search for
> space becomes increasingly time consuming. Performance for the
> 10GigE driver drops to about 10% of it's capacity on a UV system
> when the CPU count is large.
That's pretty poor. I've seen large overheads, but when that big it was
also related to issues in the 10G driver. Do you have profile data
showing this as the hotspot?
> The workaround is to specify the iommu=pt option which sets up a 1:1
> identity map for those devices that support enough DMA address bits to
> cover the physical system memory. This is the "pass through" option.
>
> But this can only be accomplished by those devices that pass their
> DMA data through the IOMMU (VTd). But Host Bridge Devices connected
> to System Sockets do not pass their data through the VTd, thus the
> following error occurs:
>
> IOMMU: hardware identity mapping for device 1000:3e:00.0
> Failed to setup IOMMU pass-through
> BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
>
> This patch fixes that problem but removing Host Bridge devices from
> being identity mapped, given that they do not generate DMA ops anyways.
>
> Signed-off-by: Mike Travis <travis@sgi.com>
> Reviewed-by: Mike Habeck <habeck@sgi.com>
> ---
> drivers/pci/intel-iommu.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> --- linux.orig/drivers/pci/intel-iommu.c
> +++ linux/drivers/pci/intel-iommu.c
> @@ -46,6 +46,7 @@
> #define ROOT_SIZE VTD_PAGE_SIZE
> #define CONTEXT_SIZE VTD_PAGE_SIZE
>
> +#define IS_HOSTBRIDGE_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_HOST)
> #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
> #define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
> #define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
> @@ -2183,7 +2184,7 @@ static int iommu_should_identity_map(str
> * take them out of the 1:1 domain later.
> */
> if (!startup)
> - return pdev->dma_mask > DMA_BIT_MASK(32);
> + return pdev->dma_mask == DMA_BIT_MASK(64);
This looks unrelated, why the change?
> return 1;
> }
> @@ -2198,6 +2199,9 @@ static int __init iommu_prepare_static_i
> return -EFAULT;
>
> for_each_pci_dev(pdev) {
> + /* Skip PCI Host Bridge devices */
> + if (IS_HOSTBRIDGE_DEVICE(pdev))
> + continue;
> if (iommu_should_identity_map(pdev, 1)) {
Should this host bridge check go into iommu_should_identity_map?
I understand skipping the extra host bridges, but what is the NULL ptr deref
coming from? Just to be sure this isn't a bandaid.
thanks,
-chris
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 4/4] Intel pci: Use coherent DMA mask when requested
2011-03-29 23:36 ` [PATCH 4/4] Intel pci: Use coherent DMA mask when requested Mike Travis
@ 2011-03-30 18:02 ` Chris Wright
2011-04-01 2:57 ` FUJITA Tomonori
0 siblings, 1 reply; 27+ messages in thread
From: Chris Wright @ 2011-03-30 18:02 UTC (permalink / raw)
To: Mike Travis
Cc: David Woodhouse, Jesse Barnes, linux-pci, iommu, Mike Habeck,
linux-kernel, fujita.tomonori
* Mike Travis (travis@sgi.com) wrote:
> The __intel_map_single function is not honoring the passed in
> DMA mask. This results in not using the coherent DMA mask when
> called from intel_alloc_coherent().
>
> Signed-off-by: Mike Travis <travis@sgi.com>
> Reviewed-by: Mike Habeck <habeck@sgi.com>
Hmm, looks like it has actually been broken since it was introduced
in bb9e6d65.
Acked-by: Chris Wright <chrisw@sous-sol.org>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping
2011-03-30 17:51 ` Chris Wright
@ 2011-03-30 18:30 ` Mike Travis
2011-03-30 19:15 ` Chris Wright
0 siblings, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-30 18:30 UTC (permalink / raw)
To: Chris Wright
Cc: David Woodhouse, Jesse Barnes, linux-pci, iommu, Mike Habeck,
linux-kernel
Chris Wright wrote:
> * Mike Travis (travis@sgi.com) wrote:
>> When the IOMMU is being used, each request for a DMA mapping requires
>> the intel_iommu code to look for some space in the DMA mapping table.
>> For most drivers this occurs for each transfer.
>>
>> When there are many outstanding DMA mappings [as seems to be the case
>> with the 10GigE driver], the table grows large and the search for
>> space becomes increasingly time consuming. Performance for the
>> 10GigE driver drops to about 10% of it's capacity on a UV system
>> when the CPU count is large.
>
> That's pretty poor. I've seen large overheads, but when that big it was
> also related to issues in the 10G driver. Do you have profile data
> showing this as the hotspot?
Here's one from our internal bug report:
Here is a profile from a run with iommu=on iommu=pt (no forcedac)
uv48-sys was receiving and uv-debug sending.
ksoftirqd/640 was running at approx. 100% cpu utilization.
I had pinned the nttcp process on uv48-sys to cpu 64.
# Samples: 1255641
#
# Overhead Command Shared Object Symbol
# ........ ............. ............. ......
#
50.27%ESC[m ksoftirqd/640 [kernel] [k] _spin_lock
27.43%ESC[m ksoftirqd/640 [kernel] [k] iommu_no_mapping
...
0.48% ksoftirqd/640 [kernel] [k] iommu_should_identity_map
0.45% ksoftirqd/640 [kernel] [k] ixgbe_alloc_rx_buffers [
ixgbe]
0.42% ksoftirqd/640 [kernel] [k] ioat2_tx_submit_unlock [
ioatdma]
0.29% ksoftirqd/640 [kernel] [k] uv_read_rtc
0.25% ksoftirqd/640 [kernel] [k] __alloc_skb
0.20% ksoftirqd/640 [kernel] [k] try_to_wake_up
0.19% ksoftirqd/640 [kernel] [k] ____cache_alloc_node
0.19% ksoftirqd/640 [kernel] [k] kmem_cache_free
0.19% ksoftirqd/640 [kernel] [k] __netdev_alloc_skb
0.18% ksoftirqd/640 [kernel] [k] tcp_v4_rcv
0.15% ksoftirqd/640 [kernel] [k] resched_task
0.15% ksoftirqd/640 [kernel] [k] tcp_data_queue
0.13% ksoftirqd/640 [kernel] [k] xfrm4_policy_check
0.11% ksoftirqd/640 [kernel] [k] get_page_from_freelist
0.10% ksoftirqd/640 [kernel] [k] sched_clock_cpu
0.10% ksoftirqd/640 [kernel] [k] sock_def_readable
...
I tracked this time down to identity_mapping() in this loop:
list_for_each_entry(info, &si_domain->devices, link)
if (info->dev == pdev)
return 1;
I didn't get the exact count, but there was approx 11,000 PCI devices
on this system. And this function was called for every page request
in each DMA request.
Here's an excerpt from our internal bug report:
I also looked at the cpu utilization uv. Its at 22% for the nttcp process
and ksoftirqd is not at the top so I think this means the fix is working.
Another run
uv-debug:~/eddiem/nttcp-1.52 # ./nttcp -T -l 1048576 -P 60 192.168.1.2
Running for 60 seconds...
Bytes Real s CPU s Real-MBit/s CPU-MBit/s Calls Real-C/s CPU-C/s
l51671728128 60.00 13.52 6889.4548 30582.1259 49278 821.29 3645.7
151671728128 60.00 12.53 6889.4660 32983.4024 123666 2061.07 9867.4
Trying it from the other side shows nttcp on uv at 44% cpu.
uv41-sys:~/eddiem/nttcp-1.52 # ./nttcp -T -l 1048576 -P 60 192.168.1.1
Running for 60 seconds...
Bytes Real s CPU s Real-MBit/s CPU-MBit/s Calls Real-C/s CPU-C/s
l51292143616 60.00 26.40 6838.9326 15544.4581 48917 815.28 1853.1
151292456796 60.00 7.35 6839.0407 55809.8528 93530 1558.84 12720.9
Note that our networking experts also tuned the 10GigE parameters which
helped bring the speed back up to almost line speed. (The 10GigE was
by far the most affected driver, but even the 1GigE driver lost performance.)
There was also changes for the irq_rebalancer and disabling sched domains
2 and 3 (which was being hit by idle_rebalancer). I remember sched domain 3
had all 4096 cpus but I forgot what sd 2 had.)
Also, running the network test on the same node as where the cards were
helped as well.
If you really need them, I can sign up for some system time and get better
before/after profile data specifically for these IOMMU changes?
Thanks,
Mike
>
>> The workaround is to specify the iommu=pt option which sets up a 1:1
>> identity map for those devices that support enough DMA address bits to
>> cover the physical system memory. This is the "pass through" option.
>>
>> But this can only be accomplished by those devices that pass their
>> DMA data through the IOMMU (VTd). But Host Bridge Devices connected
>> to System Sockets do not pass their data through the VTd, thus the
>> following error occurs:
>>
>> IOMMU: hardware identity mapping for device 1000:3e:00.0
>> Failed to setup IOMMU pass-through
>> BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
>>
>> This patch fixes that problem but removing Host Bridge devices from
>> being identity mapped, given that they do not generate DMA ops anyways.
>>
>> Signed-off-by: Mike Travis <travis@sgi.com>
>> Reviewed-by: Mike Habeck <habeck@sgi.com>
>> ---
>> drivers/pci/intel-iommu.c | 6 +++++-
>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> --- linux.orig/drivers/pci/intel-iommu.c
>> +++ linux/drivers/pci/intel-iommu.c
>> @@ -46,6 +46,7 @@
>> #define ROOT_SIZE VTD_PAGE_SIZE
>> #define CONTEXT_SIZE VTD_PAGE_SIZE
>>
>> +#define IS_HOSTBRIDGE_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_HOST)
>> #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
>> #define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
>> #define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
>> @@ -2183,7 +2184,7 @@ static int iommu_should_identity_map(str
>> * take them out of the 1:1 domain later.
>> */
>> if (!startup)
>> - return pdev->dma_mask > DMA_BIT_MASK(32);
>> + return pdev->dma_mask == DMA_BIT_MASK(64);
>
> This looks unrelated, why the change?
>
>> return 1;
>> }
>> @@ -2198,6 +2199,9 @@ static int __init iommu_prepare_static_i
>> return -EFAULT;
>>
>> for_each_pci_dev(pdev) {
>> + /* Skip PCI Host Bridge devices */
>> + if (IS_HOSTBRIDGE_DEVICE(pdev))
>> + continue;
>> if (iommu_should_identity_map(pdev, 1)) {
>
> Should this host bridge check go into iommu_should_identity_map?
>
> I understand skipping the extra host bridges, but what is the NULL ptr deref
> coming from? Just to be sure this isn't a bandaid.
>
> thanks,
> -chris
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping
2011-03-30 18:30 ` Mike Travis
@ 2011-03-30 19:15 ` Chris Wright
2011-03-30 19:25 ` Mike Travis
0 siblings, 1 reply; 27+ messages in thread
From: Chris Wright @ 2011-03-30 19:15 UTC (permalink / raw)
To: Mike Travis
Cc: Chris Wright, David Woodhouse, Jesse Barnes, linux-pci, iommu,
Mike Habeck, linux-kernel
* Mike Travis (travis@sgi.com) wrote:
> Chris Wright wrote:
> >* Mike Travis (travis@sgi.com) wrote:
> >> When the IOMMU is being used, each request for a DMA mapping requires
> >> the intel_iommu code to look for some space in the DMA mapping table.
> >> For most drivers this occurs for each transfer.
> >>
> >> When there are many outstanding DMA mappings [as seems to be the case
> >> with the 10GigE driver], the table grows large and the search for
> >> space becomes increasingly time consuming. Performance for the
> >> 10GigE driver drops to about 10% of it's capacity on a UV system
> >> when the CPU count is large.
> >
> >That's pretty poor. I've seen large overheads, but when that big it was
> >also related to issues in the 10G driver. Do you have profile data
> >showing this as the hotspot?
>
> Here's one from our internal bug report:
>
> Here is a profile from a run with iommu=on iommu=pt (no forcedac)
OK, I was actually interested in the !pt case. But this is useful
still. The iova lookup being distinct from the identity_mapping() case.
> uv48-sys was receiving and uv-debug sending.
> ksoftirqd/640 was running at approx. 100% cpu utilization.
> I had pinned the nttcp process on uv48-sys to cpu 64.
>
> # Samples: 1255641
> #
> # Overhead Command Shared Object Symbol
> # ........ ............. ............. ......
> #
> 50.27%ESC[m ksoftirqd/640 [kernel] [k] _spin_lock
> 27.43%ESC[m ksoftirqd/640 [kernel] [k] iommu_no_mapping
> ...
> 0.48% ksoftirqd/640 [kernel] [k] iommu_should_identity_map
> 0.45% ksoftirqd/640 [kernel] [k] ixgbe_alloc_rx_buffers [
> ixgbe]
Note, ixgbe has had rx dma mapping issues (that's why I wondered what
was causing the massive slowdown under !pt mode).
<snip>
> I tracked this time down to identity_mapping() in this loop:
>
> list_for_each_entry(info, &si_domain->devices, link)
> if (info->dev == pdev)
> return 1;
>
> I didn't get the exact count, but there was approx 11,000 PCI devices
> on this system. And this function was called for every page request
> in each DMA request.
Right, so this is the list traversal (and wow, a lot of PCI devices).
Did you try a smarter data structure? (While there's room for another
bit in pci_dev, the bit is more about iommu implementation details than
anything at the pci level).
Or the domain_dev_info is cached in the archdata of device struct.
You should be able to just reference that directly.
Didn't think it through completely, but perhaps something as simple as:
return pdev->dev.archdata.iommu == si_domain;
thanks,
-chris
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function
2011-03-29 23:36 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function Mike Travis
@ 2011-03-30 19:19 ` Chris Wright
2011-03-30 19:29 ` Mike Travis
2011-03-31 0:33 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function v2 Mike Travis
0 siblings, 2 replies; 27+ messages in thread
From: Chris Wright @ 2011-03-30 19:19 UTC (permalink / raw)
To: Mike Travis
Cc: David Woodhouse, Jesse Barnes, linux-pci, iommu, Mike Habeck,
linux-kernel
* Mike Travis (travis@sgi.com) wrote:
> @@ -2138,6 +2134,7 @@ static int domain_add_dev_info(struct dm
> list_add(&info->global, &device_domain_list);
> pdev->dev.archdata.iommu = info;
The si_domain is cached right here...can you not use this instead of
adding an iommu impl. specific bit to pci_dev?
> spin_unlock_irqrestore(&device_domain_lock, flags);
> + pdev->is_identity_mapped = (domain == si_domain);
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping
2011-03-30 19:15 ` Chris Wright
@ 2011-03-30 19:25 ` Mike Travis
2011-03-30 19:57 ` Chris Wright
0 siblings, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-30 19:25 UTC (permalink / raw)
To: Chris Wright
Cc: David Woodhouse, Jesse Barnes, linux-pci, iommu, Mike Habeck,
linux-kernel
Chris Wright wrote:
> * Mike Travis (travis@sgi.com) wrote:
>> Chris Wright wrote:
>>> * Mike Travis (travis@sgi.com) wrote:
>>>> When the IOMMU is being used, each request for a DMA mapping requires
>>>> the intel_iommu code to look for some space in the DMA mapping table.
>>>> For most drivers this occurs for each transfer.
>>>>
>>>> When there are many outstanding DMA mappings [as seems to be the case
>>>> with the 10GigE driver], the table grows large and the search for
>>>> space becomes increasingly time consuming. Performance for the
>>>> 10GigE driver drops to about 10% of it's capacity on a UV system
>>>> when the CPU count is large.
>>> That's pretty poor. I've seen large overheads, but when that big it was
>>> also related to issues in the 10G driver. Do you have profile data
>>> showing this as the hotspot?
>> Here's one from our internal bug report:
>>
>> Here is a profile from a run with iommu=on iommu=pt (no forcedac)
>
> OK, I was actually interested in the !pt case. But this is useful
> still. The iova lookup being distinct from the identity_mapping() case.
I can get that as well, but having every device using maps caused it's
own set of problems (hundreds of dma maps). Here's a list of devices
on the system under test. You can see that even 'minor' glitches can
get magnified when there are so many...
Blade Location NASID PCI Address X Display Device
----------------------------------------------------------------------
0 r001i01b00 0 0000:01:00.0 - Intel 82576 Gigabit Network Connection
. . . 0000:01:00.1 - Intel 82576 Gigabit Network Connection
. . . 0000:04:00.0 - LSI SAS1064ET Fusion-MPT SAS
. . . 0000:05:00.0 - Matrox MGA G200e
2 r001i01b02 4 0001:02:00.0 - Mellanox MT26428 InfiniBand
3 r001i01b03 6 0002:02:00.0 - Mellanox MT26428 InfiniBand
4 r001i01b04 8 0003:02:00.0 - Mellanox MT26428 InfiniBand
11 r001i01b11 22 0007:02:00.0 - Mellanox MT26428 InfiniBand
13 r001i01b13 26 0008:02:00.0 - Mellanox MT26428 InfiniBand
15 r001i01b15 30 0009:07:00.0 :0.0 nVidia GF100 [Tesla S2050]
. . . 0009:08:00.0 :1.1 nVidia GF100 [Tesla S2050]
18 r001i23b02 36 000b:02:00.0 - Mellanox MT26428 InfiniBand
20 r001i23b04 40 000c:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
. . . 000c:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
. . . 000c:04:00.0 - Mellanox MT26428 InfiniBand
23 r001i23b07 46 000d:07:00.0 - nVidia GF100 [Tesla S2050]
. . . 000d:08:00.0 - nVidia GF100 [Tesla S2050]
25 r001i23b09 50 000e:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
. . . 000e:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
. . . 000e:04:00.0 - Mellanox MT26428 InfiniBand
26 r001i23b10 52 000f:02:00.0 - Mellanox MT26428 InfiniBand
27 r001i23b11 54 0010:02:00.0 - Mellanox MT26428 InfiniBand
29 r001i23b13 58 0011:02:00.0 - Mellanox MT26428 InfiniBand
31 r001i23b15 62 0012:02:00.0 - Mellanox MT26428 InfiniBand
34 r002i01b02 68 0013:01:00.0 - Mellanox MT26428 InfiniBand
35 r002i01b03 70 0014:02:00.0 - Mellanox MT26428 InfiniBand
36 r002i01b04 72 0015:01:00.0 - Mellanox MT26428 InfiniBand
41 r002i01b09 82 0018:07:00.0 - nVidia GF100 [Tesla S2050]
. . . 0018:08:00.0 - nVidia GF100 [Tesla S2050]
43 r002i01b11 86 0019:01:00.0 - Mellanox MT26428 InfiniBand
45 r002i01b13 90 001a:01:00.0 - Mellanox MT26428 InfiniBand
48 r002i23b00 96 001c:07:00.0 - nVidia GF100 [Tesla S2050]
. . . 001c:08:00.0 - nVidia GF100 [Tesla S2050]
50 r002i23b02 100 001d:02:00.0 - Mellanox MT26428 InfiniBand
52 r002i23b04 104 001e:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
. . . 001e:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
. . . 001e:04:00.0 - Mellanox MT26428 InfiniBand
57 r002i23b09 114 0020:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
. . . 0020:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
. . . 0020:04:00.0 - Mellanox MT26428 InfiniBand
58 r002i23b10 116 0021:02:00.0 - Mellanox MT26428 InfiniBand
59 r002i23b11 118 0022:02:00.0 - Mellanox MT26428 InfiniBand
61 r002i23b13 122 0023:02:00.0 - Mellanox MT26428 InfiniBand
63 r002i23b15 126 0024:02:00.0 - Mellanox MT26428 InfiniBand
>
>> uv48-sys was receiving and uv-debug sending.
>> ksoftirqd/640 was running at approx. 100% cpu utilization.
>> I had pinned the nttcp process on uv48-sys to cpu 64.
>>
>> # Samples: 1255641
>> #
>> # Overhead Command Shared Object Symbol
>> # ........ ............. ............. ......
>> #
>> 50.27%ESC[m ksoftirqd/640 [kernel] [k] _spin_lock
>> 27.43%ESC[m ksoftirqd/640 [kernel] [k] iommu_no_mapping
>
>> ...
>> 0.48% ksoftirqd/640 [kernel] [k] iommu_should_identity_map
>> 0.45% ksoftirqd/640 [kernel] [k] ixgbe_alloc_rx_buffers [
>> ixgbe]
>
> Note, ixgbe has had rx dma mapping issues (that's why I wondered what
> was causing the massive slowdown under !pt mode).
I think since this profile run, the network guys updated the ixgbe
driver with a later version. (I don't know the outcome of that test.)
>
> <snip>
>> I tracked this time down to identity_mapping() in this loop:
>>
>> list_for_each_entry(info, &si_domain->devices, link)
>> if (info->dev == pdev)
>> return 1;
>>
>> I didn't get the exact count, but there was approx 11,000 PCI devices
>> on this system. And this function was called for every page request
>> in each DMA request.
>
> Right, so this is the list traversal (and wow, a lot of PCI devices).
Most of the PCI devices were the 45 on each of 256 Nahalem sockets.
Also, there's a ton of bridges as well.
> Did you try a smarter data structure? (While there's room for another
> bit in pci_dev, the bit is more about iommu implementation details than
> anything at the pci level).
>
> Or the domain_dev_info is cached in the archdata of device struct.
> You should be able to just reference that directly.
>
> Didn't think it through completely, but perhaps something as simple as:
>
> return pdev->dev.archdata.iommu == si_domain;
I can try this, thanks!
>
> thanks,
> -chris
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function
2011-03-30 19:19 ` Chris Wright
@ 2011-03-30 19:29 ` Mike Travis
2011-03-31 0:33 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function v2 Mike Travis
1 sibling, 0 replies; 27+ messages in thread
From: Mike Travis @ 2011-03-30 19:29 UTC (permalink / raw)
To: Chris Wright
Cc: David Woodhouse, Jesse Barnes, linux-pci, iommu, Mike Habeck,
linux-kernel
Chris Wright wrote:
> * Mike Travis (travis@sgi.com) wrote:
>> @@ -2138,6 +2134,7 @@ static int domain_add_dev_info(struct dm
>> list_add(&info->global, &device_domain_list);
>> pdev->dev.archdata.iommu = info;
>
> The si_domain is cached right here...can you not use this instead of
> adding an iommu impl. specific bit to pci_dev?
>
>> spin_unlock_irqrestore(&device_domain_lock, flags);
>> + pdev->is_identity_mapped = (domain == si_domain);
Yes, it looks like I came at the solution backwards... I added
the bit then figured out how to set it. I'll update that patch.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping
2011-03-30 19:25 ` Mike Travis
@ 2011-03-30 19:57 ` Chris Wright
0 siblings, 0 replies; 27+ messages in thread
From: Chris Wright @ 2011-03-30 19:57 UTC (permalink / raw)
To: Mike Travis
Cc: Chris Wright, David Woodhouse, Jesse Barnes, linux-pci, iommu,
Mike Habeck, linux-kernel
* Mike Travis (travis@sgi.com) wrote:
> Chris Wright wrote:
> >OK, I was actually interested in the !pt case. But this is useful
> >still. The iova lookup being distinct from the identity_mapping() case.
>
> I can get that as well, but having every device using maps caused it's
> own set of problems (hundreds of dma maps). Here's a list of devices
> on the system under test. You can see that even 'minor' glitches can
> get magnified when there are so many...
Yeah, I was focused on the overhead of actually mapping/unmapping an
address in the non-pt case.
> Blade Location NASID PCI Address X Display Device
> ----------------------------------------------------------------------
> 0 r001i01b00 0 0000:01:00.0 - Intel 82576 Gigabit Network Connection
> . . . 0000:01:00.1 - Intel 82576 Gigabit Network Connection
> . . . 0000:04:00.0 - LSI SAS1064ET Fusion-MPT SAS
> . . . 0000:05:00.0 - Matrox MGA G200e
> 2 r001i01b02 4 0001:02:00.0 - Mellanox MT26428 InfiniBand
> 3 r001i01b03 6 0002:02:00.0 - Mellanox MT26428 InfiniBand
> 4 r001i01b04 8 0003:02:00.0 - Mellanox MT26428 InfiniBand
> 11 r001i01b11 22 0007:02:00.0 - Mellanox MT26428 InfiniBand
> 13 r001i01b13 26 0008:02:00.0 - Mellanox MT26428 InfiniBand
> 15 r001i01b15 30 0009:07:00.0 :0.0 nVidia GF100 [Tesla S2050]
> . . . 0009:08:00.0 :1.1 nVidia GF100 [Tesla S2050]
> 18 r001i23b02 36 000b:02:00.0 - Mellanox MT26428 InfiniBand
> 20 r001i23b04 40 000c:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
> . . . 000c:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
> . . . 000c:04:00.0 - Mellanox MT26428 InfiniBand
> 23 r001i23b07 46 000d:07:00.0 - nVidia GF100 [Tesla S2050]
> . . . 000d:08:00.0 - nVidia GF100 [Tesla S2050]
> 25 r001i23b09 50 000e:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
> . . . 000e:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
> . . . 000e:04:00.0 - Mellanox MT26428 InfiniBand
> 26 r001i23b10 52 000f:02:00.0 - Mellanox MT26428 InfiniBand
> 27 r001i23b11 54 0010:02:00.0 - Mellanox MT26428 InfiniBand
> 29 r001i23b13 58 0011:02:00.0 - Mellanox MT26428 InfiniBand
> 31 r001i23b15 62 0012:02:00.0 - Mellanox MT26428 InfiniBand
> 34 r002i01b02 68 0013:01:00.0 - Mellanox MT26428 InfiniBand
> 35 r002i01b03 70 0014:02:00.0 - Mellanox MT26428 InfiniBand
> 36 r002i01b04 72 0015:01:00.0 - Mellanox MT26428 InfiniBand
> 41 r002i01b09 82 0018:07:00.0 - nVidia GF100 [Tesla S2050]
> . . . 0018:08:00.0 - nVidia GF100 [Tesla S2050]
> 43 r002i01b11 86 0019:01:00.0 - Mellanox MT26428 InfiniBand
> 45 r002i01b13 90 001a:01:00.0 - Mellanox MT26428 InfiniBand
> 48 r002i23b00 96 001c:07:00.0 - nVidia GF100 [Tesla S2050]
> . . . 001c:08:00.0 - nVidia GF100 [Tesla S2050]
> 50 r002i23b02 100 001d:02:00.0 - Mellanox MT26428 InfiniBand
> 52 r002i23b04 104 001e:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
> . . . 001e:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
> . . . 001e:04:00.0 - Mellanox MT26428 InfiniBand
> 57 r002i23b09 114 0020:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
> . . . 0020:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
> . . . 0020:04:00.0 - Mellanox MT26428 InfiniBand
> 58 r002i23b10 116 0021:02:00.0 - Mellanox MT26428 InfiniBand
> 59 r002i23b11 118 0022:02:00.0 - Mellanox MT26428 InfiniBand
> 61 r002i23b13 122 0023:02:00.0 - Mellanox MT26428 InfiniBand
> 63 r002i23b15 126 0024:02:00.0 - Mellanox MT26428 InfiniBand
>
> >
> >>uv48-sys was receiving and uv-debug sending.
> >>ksoftirqd/640 was running at approx. 100% cpu utilization.
> >>I had pinned the nttcp process on uv48-sys to cpu 64.
> >>
> >># Samples: 1255641
> >>#
> >># Overhead Command Shared Object Symbol
> >># ........ ............. ............. ......
> >>#
> >> 50.27%ESC[m ksoftirqd/640 [kernel] [k] _spin_lock
> >> 27.43%ESC[m ksoftirqd/640 [kernel] [k] iommu_no_mapping
> >
> >>...
> >> 0.48% ksoftirqd/640 [kernel] [k] iommu_should_identity_map
> >> 0.45% ksoftirqd/640 [kernel] [k] ixgbe_alloc_rx_buffers [
> >>ixgbe]
> >
> >Note, ixgbe has had rx dma mapping issues (that's why I wondered what
> >was causing the massive slowdown under !pt mode).
>
> I think since this profile run, the network guys updated the ixgbe
> driver with a later version. (I don't know the outcome of that test.)
OK. The ixgbe fix I was thinking of is in since 2.6.34: 43634e82 (ixgbe:
Fix DMA mapping/unmapping issues when HWRSC is enabled on IOMMU enabled
on IOMMU enabled kernels).
> ><snip>
> >>I tracked this time down to identity_mapping() in this loop:
> >>
> >> list_for_each_entry(info, &si_domain->devices, link)
> >> if (info->dev == pdev)
> >> return 1;
> >>
> >>I didn't get the exact count, but there was approx 11,000 PCI devices
> >>on this system. And this function was called for every page request
> >>in each DMA request.
> >
> >Right, so this is the list traversal (and wow, a lot of PCI devices).
>
> Most of the PCI devices were the 45 on each of 256 Nahalem sockets.
> Also, there's a ton of bridges as well.
>
> >Did you try a smarter data structure? (While there's room for another
> >bit in pci_dev, the bit is more about iommu implementation details than
> >anything at the pci level).
> >
> >Or the domain_dev_info is cached in the archdata of device struct.
> >You should be able to just reference that directly.
> >
> >Didn't think it through completely, but perhaps something as simple as:
> >
> > return pdev->dev.archdata.iommu == si_domain;
>
> I can try this, thanks!
Err, I guess that'd be info = archdata.iommu; info->domain == si_domain
(and probably need some sanity checking against things like
DUMMY_DEVICE_DOMAIN_INFO). But you get the idea.
thanks,
-chris
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function v2
2011-03-30 19:19 ` Chris Wright
2011-03-30 19:29 ` Mike Travis
@ 2011-03-31 0:33 ` Mike Travis
1 sibling, 0 replies; 27+ messages in thread
From: Mike Travis @ 2011-03-31 0:33 UTC (permalink / raw)
To: Chris Wright
Cc: David Woodhouse, Jesse Barnes, linux-pci, iommu, Mike Habeck,
linux-kernel
Subject: Intel iommu: Speed up processing of the identity_mapping function
When there are a large count of PCI devices, and the pass
through option for iommu is set, much time is spent in the
identity_mapping function hunting though the iommu domains to
check if a specific device is "identity mapped".
Speed up the function by checking the cached info to see if
it's mapped to the static identity domain.
Signed-off-by: Mike Travis <travis@sgi.com>
---
drivers/pci/intel-iommu.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- linux-2.6.32.orig/drivers/pci/intel-iommu.c
+++ linux-2.6.32/drivers/pci/intel-iommu.c
@@ -2124,10 +2124,10 @@ static int identity_mapping(struct pci_d
if (likely(!iommu_identity_mapping))
return 0;
+ info = pdev->dev.archdata.iommu;
+ if (info && info != DUMMY_DEVICE_DOMAIN_INFO)
+ return (info->domain == si_domain);
- list_for_each_entry(info, &si_domain->devices, link)
- if (info->dev == pdev)
- return 1;
return 0;
}
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
2011-03-29 23:36 ` [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges Mike Travis
@ 2011-03-31 22:11 ` Mike Travis
2011-03-31 22:53 ` Chris Wright
0 siblings, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-31 22:11 UTC (permalink / raw)
To: David Woodhouse, Jesse Barnes, Chris Wright
Cc: Mike Habeck, iommu, linux-pci, linux-kernel
Chris - did you have any comment on this patch?
David/Jesse - should I resubmit the entire series for consideration? We
have a number of customers that have encountered this problem.
Thanks,
Mike
Mike Travis wrote:
> dmar_init_reserved_ranges() reserves the card's MMIO ranges to
> prevent handing out a DMA map that would overlap with the MMIO range.
> The problem while the Nvidia GPU has 64bit BARs, it's capable of
> receiving > 40bit PIOs, but can't generate > 40bit DMAs.
>
> So when the iommu code reserves these MMIO ranges a > 40bit
> entry ends up getting in the rbtree. On a UV test system with
> the Nvidia cards, the BARs are:
>
> 0001:36:00.0 VGA compatible controller: nVidia Corporation GT200GL
> Region 0: Memory at 92000000 (32-bit, non-prefetchable) [size=16M]
> Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
> Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
>
> So this 44bit MMIO address 0xf8200000000 ends up in the rbtree. As DMA
> maps get added and deleted from the rbtree we can end up getting a cached
> entry to this 0xf8200000000 entry... this is what results in the code
> handing out the invalid DMA map of 0xf81fffff000:
>
> [ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
>
> The IOVA code needs to better honor the "limit_pfn" when allocating
> these maps.
>
> Signed-off-by: Mike Travis <travis@sgi.com>
> Reviewed-by: Mike Habeck <habeck@sgi.com>
> ---
> drivers/pci/intel-iommu.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> --- linux.orig/drivers/pci/intel-iommu.c
> +++ linux/drivers/pci/intel-iommu.c
> @@ -1323,7 +1323,8 @@ static void dmar_init_reserved_ranges(vo
>
> for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> r = &pdev->resource[i];
> - if (!r->flags || !(r->flags & IORESOURCE_MEM))
> + if (!r->flags || !(r->flags & IORESOURCE_MEM) ||
> + r->start > pdev->dma_mask)
> continue;
> iova = reserve_iova(&reserved_iova_list,
> IOVA_PFN(r->start),
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
2011-03-31 22:11 ` Mike Travis
@ 2011-03-31 22:53 ` Chris Wright
2011-03-31 23:25 ` Mike Travis
2011-03-31 23:39 ` [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges Chris Wright
0 siblings, 2 replies; 27+ messages in thread
From: Chris Wright @ 2011-03-31 22:53 UTC (permalink / raw)
To: Mike Travis
Cc: David Woodhouse, Jesse Barnes, Chris Wright, Mike Habeck, iommu,
linux-pci, linux-kernel
* Mike Travis (travis@sgi.com) wrote:
> Chris - did you have any comment on this patch?
It doesn't actually look right to me. It means that particular range
is no longer reserved. But perhaps I've misunderstood something.
> Mike Travis wrote:
> > dmar_init_reserved_ranges() reserves the card's MMIO ranges to
> > prevent handing out a DMA map that would overlap with the MMIO range.
> > The problem while the Nvidia GPU has 64bit BARs, it's capable of
> > receiving > 40bit PIOs, but can't generate > 40bit DMAs.
I don't undertand what you mean here.
> > So when the iommu code reserves these MMIO ranges a > 40bit
> > entry ends up getting in the rbtree. On a UV test system with
> > the Nvidia cards, the BARs are:
> >
> > 0001:36:00.0 VGA compatible controller: nVidia Corporation
> >GT200GL Region 0: Memory at 92000000 (32-bit, non-prefetchable)
> >[size=16M]
> > Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
> > Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
> >
> > So this 44bit MMIO address 0xf8200000000 ends up in the rbtree. As DMA
> > maps get added and deleted from the rbtree we can end up getting a cached
> > entry to this 0xf8200000000 entry... this is what results in the code
> > handing out the invalid DMA map of 0xf81fffff000:
> >
> > [ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
> >
> > The IOVA code needs to better honor the "limit_pfn" when allocating
> > these maps.
This means we could get the MMIO address range (it's no longer reserved).
It seems to me the DMA transaction would then become a peer to peer
transaction if ACS is not enabled, which could show up as random register
write in that GPUs 256M BAR (i.e. broken).
The iova allocation should not hand out an address bigger than the
dma_mask. What is the device's dma_mask?
thanks,
-chris
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
2011-03-31 22:53 ` Chris Wright
@ 2011-03-31 23:25 ` Mike Travis
2011-03-31 23:40 ` Mike Habeck
2011-03-31 23:39 ` [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges Chris Wright
1 sibling, 1 reply; 27+ messages in thread
From: Mike Travis @ 2011-03-31 23:25 UTC (permalink / raw)
To: Chris Wright
Cc: David Woodhouse, Jesse Barnes, Mike Habeck, iommu, linux-pci,
linux-kernel
I'll probably need help from our Hardware PCI Engineer to help explain
this further, though here's a pointer to an earlier email thread:
http://marc.info/?l=linux-kernel&m=129259816925973&w=2
I'll also dig out the specs you're asking for.
Thanks,
Mike
Chris Wright wrote:
> * Mike Travis (travis@sgi.com) wrote:
>> Chris - did you have any comment on this patch?
>
> It doesn't actually look right to me. It means that particular range
> is no longer reserved. But perhaps I've misunderstood something.
>
>> Mike Travis wrote:
>>> dmar_init_reserved_ranges() reserves the card's MMIO ranges to
>>> prevent handing out a DMA map that would overlap with the MMIO range.
>>> The problem while the Nvidia GPU has 64bit BARs, it's capable of
>>> receiving > 40bit PIOs, but can't generate > 40bit DMAs.
>
> I don't undertand what you mean here.
>
>>> So when the iommu code reserves these MMIO ranges a > 40bit
>>> entry ends up getting in the rbtree. On a UV test system with
>>> the Nvidia cards, the BARs are:
>>>
>>> 0001:36:00.0 VGA compatible controller: nVidia Corporation
>>> GT200GL Region 0: Memory at 92000000 (32-bit, non-prefetchable)
>>> [size=16M]
>>> Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
>>> Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
>>>
>>> So this 44bit MMIO address 0xf8200000000 ends up in the rbtree. As DMA
>>> maps get added and deleted from the rbtree we can end up getting a cached
>>> entry to this 0xf8200000000 entry... this is what results in the code
>>> handing out the invalid DMA map of 0xf81fffff000:
>>>
>>> [ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
>>>
>>> The IOVA code needs to better honor the "limit_pfn" when allocating
>>> these maps.
>
> This means we could get the MMIO address range (it's no longer reserved).
> It seems to me the DMA transaction would then become a peer to peer
> transaction if ACS is not enabled, which could show up as random register
> write in that GPUs 256M BAR (i.e. broken).
>
> The iova allocation should not hand out an address bigger than the
> dma_mask. What is the device's dma_mask?
>
> thanks,
> -chris
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
2011-03-31 22:53 ` Chris Wright
2011-03-31 23:25 ` Mike Travis
@ 2011-03-31 23:39 ` Chris Wright
1 sibling, 0 replies; 27+ messages in thread
From: Chris Wright @ 2011-03-31 23:39 UTC (permalink / raw)
To: Chris Wright
Cc: Mike Travis, linux-pci, linux-kernel, Jesse Barnes, iommu,
Mike Habeck, David Woodhouse
* Chris Wright (chrisw@sous-sol.org) wrote:
> > Mike Travis wrote:
> > > Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
> > > Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
> > >
> > > So this 44bit MMIO address 0xf8200000000 ends up in the rbtree. As DMA
> > > maps get added and deleted from the rbtree we can end up getting a cached
> > > entry to this 0xf8200000000 entry... this is what results in the code
> > > handing out the invalid DMA map of 0xf81fffff000:
> > >
> > > [ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
> > >
> > > The IOVA code needs to better honor the "limit_pfn" when allocating
> > > these maps.
>
> This means we could get the MMIO address range (it's no longer reserved).
> It seems to me the DMA transaction would then become a peer to peer
> transaction if ACS is not enabled, which could show up as random register
> write in that GPUs 256M BAR (i.e. broken).
>
> The iova allocation should not hand out an address bigger than the
> dma_mask. What is the device's dma_mask?
Ah, looks like this is a bad interaction with the way the cached entry
is handled. I think the iova lookup should skip down the the limit_pfn
rather than assume that rb_last's pfn_lo/hi is ok just because it's in
the tree. Because you'll never hit the limit_pfn == 32bit_pfn case, it
just goes straight to rb_last in __get_cached_rbnode.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
2011-03-31 23:25 ` Mike Travis
@ 2011-03-31 23:40 ` Mike Habeck
2011-03-31 23:56 ` Chris Wright
0 siblings, 1 reply; 27+ messages in thread
From: Mike Habeck @ 2011-03-31 23:40 UTC (permalink / raw)
To: Mike Travis
Cc: Chris Wright, David Woodhouse, Jesse Barnes, iommu, linux-pci,
linux-kernel
On 03/31/2011 06:25 PM, Mike Travis wrote:
> I'll probably need help from our Hardware PCI Engineer to help explain
> this further, though here's a pointer to an earlier email thread:
>
> http://marc.info/?l=linux-kernel&m=129259816925973&w=2
>
> I'll also dig out the specs you're asking for.
>
> Thanks,
> Mike
>
> Chris Wright wrote:
>> * Mike Travis (travis@sgi.com) wrote:
>>> Chris - did you have any comment on this patch?
>>
>> It doesn't actually look right to me. It means that particular range
>> is no longer reserved. But perhaps I've misunderstood something.
>>
>>> Mike Travis wrote:
>>>> dmar_init_reserved_ranges() reserves the card's MMIO ranges to
>>>> prevent handing out a DMA map that would overlap with the MMIO range.
>>>> The problem while the Nvidia GPU has 64bit BARs, it's capable of
>>>> receiving > 40bit PIOs, but can't generate > 40bit DMAs.
>>
>> I don't undertand what you mean here.
What Mike is getting at is there is no reason to reserve the MMIO
range if it's greater than the dma_mask, given the MMIO range is
outside of what the IOVA code will ever hand back to the IOMMU
code. In this case the nVidia card has a 64bit BAR and is assigned
the MMIO range [0xf8200000000 - 0xf820fffffff]. But the Nvidia
card can only generate a 40bit DMA (thus has a 40bit dma_mask). If
the IOVA code honors the limit_pfn (i.e., dma_mask) passed in it
will never hand back a >40bit address back to the IOMMU code. Thus
there is no reason to reserve the cards MMIO range if it is greater
than the dma_mask. (And that is what the patch is doing).
More below,,,
>>
>>>> So when the iommu code reserves these MMIO ranges a > 40bit
>>>> entry ends up getting in the rbtree. On a UV test system with
>>>> the Nvidia cards, the BARs are:
>>>>
>>>> 0001:36:00.0 VGA compatible controller: nVidia Corporation
>>>> GT200GL Region 0: Memory at 92000000 (32-bit, non-prefetchable)
>>>> [size=16M]
>>>> Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
>>>> Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
>>>>
>>>> So this 44bit MMIO address 0xf8200000000 ends up in the rbtree. As DMA
>>>> maps get added and deleted from the rbtree we can end up getting a cached
>>>> entry to this 0xf8200000000 entry... this is what results in the code
>>>> handing out the invalid DMA map of 0xf81fffff000:
>>>>
>>>> [ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
>>>>
>>>> The IOVA code needs to better honor the "limit_pfn" when allocating
>>>> these maps.
>>
>> This means we could get the MMIO address range (it's no longer reserved).
Not true, the MMIO address is greater than the dma_mask (i.e., the
limit_pfn passed into alloc_iova()) thus the IOVA code will never
hand back that address range given it's greater than the dma_mask).
>> It seems to me the DMA transaction would then become a peer to peer
>> transaction if ACS is not enabled, which could show up as random register
>> write in that GPUs 256M BAR (i.e. broken).
>>
>> The iova allocation should not hand out an address bigger than the
>> dma_mask. What is the device's dma_mask?
Agree. But there is a bug. The IOVA doesn't validate the limit_pfn
if it uses the cached entry. One could argue that it should validate
the limit_pfn, but then again a entry outside the limit_pfn should
have never got into the rbtree... (it got in due to the IOMMU's
dmar_init_reserved_ranges() adding it).
-mike
>>
>> thanks,
>> -chris
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
2011-03-31 23:40 ` Mike Habeck
@ 2011-03-31 23:56 ` Chris Wright
2011-04-01 1:05 ` Mike Habeck
0 siblings, 1 reply; 27+ messages in thread
From: Chris Wright @ 2011-03-31 23:56 UTC (permalink / raw)
To: Mike Habeck
Cc: Mike Travis, Chris Wright, David Woodhouse, Jesse Barnes, iommu,
linux-pci, linux-kernel
* Mike Habeck (habeck@sgi.com) wrote:
> On 03/31/2011 06:25 PM, Mike Travis wrote:
> >I'll probably need help from our Hardware PCI Engineer to help explain
> >this further, though here's a pointer to an earlier email thread:
> >
> >http://marc.info/?l=linux-kernel&m=129259816925973&w=2
> >
> >I'll also dig out the specs you're asking for.
> >
> >Thanks,
> >Mike
> >
> >Chris Wright wrote:
> >>* Mike Travis (travis@sgi.com) wrote:
> >>>Chris - did you have any comment on this patch?
> >>
> >>It doesn't actually look right to me. It means that particular range
> >>is no longer reserved. But perhaps I've misunderstood something.
> >>
> >>>Mike Travis wrote:
> >>>>dmar_init_reserved_ranges() reserves the card's MMIO ranges to
> >>>>prevent handing out a DMA map that would overlap with the MMIO range.
> >>>>The problem while the Nvidia GPU has 64bit BARs, it's capable of
> >>>>receiving > 40bit PIOs, but can't generate > 40bit DMAs.
> >>
> >>I don't undertand what you mean here.
>
> What Mike is getting at is there is no reason to reserve the MMIO
> range if it's greater than the dma_mask, given the MMIO range is
> outside of what the IOVA code will ever hand back to the IOMMU
> code. In this case the nVidia card has a 64bit BAR and is assigned
> the MMIO range [0xf8200000000 - 0xf820fffffff]. But the Nvidia
> card can only generate a 40bit DMA (thus has a 40bit dma_mask). If
> the IOVA code honors the limit_pfn (i.e., dma_mask) passed in it
> will never hand back a >40bit address back to the IOMMU code. Thus
> there is no reason to reserve the cards MMIO range if it is greater
> than the dma_mask. (And that is what the patch is doing).
The reserved ranges are for all devices. Another device with a 64bit
dma_mask could get that region if it's not properly reserved. The
driver would then program that device to dma to an address to is an
alias to a MMIO region. The memory transaction travels up towards
root...and sees the MMIO range in some bridge and would go straight down
to the GPU.
> More below,,,
>
> >>
> >>>>So when the iommu code reserves these MMIO ranges a > 40bit
> >>>>entry ends up getting in the rbtree. On a UV test system with
> >>>>the Nvidia cards, the BARs are:
> >>>>
> >>>>0001:36:00.0 VGA compatible controller: nVidia Corporation
> >>>>GT200GL Region 0: Memory at 92000000 (32-bit, non-prefetchable)
> >>>>[size=16M]
> >>>>Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
> >>>>Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
> >>>>
> >>>>So this 44bit MMIO address 0xf8200000000 ends up in the rbtree. As DMA
> >>>>maps get added and deleted from the rbtree we can end up getting a cached
> >>>>entry to this 0xf8200000000 entry... this is what results in the code
> >>>>handing out the invalid DMA map of 0xf81fffff000:
> >>>>
> >>>>[ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
> >>>>
> >>>>The IOVA code needs to better honor the "limit_pfn" when allocating
> >>>>these maps.
> >>
> >>This means we could get the MMIO address range (it's no longer reserved).
>
> Not true, the MMIO address is greater than the dma_mask (i.e., the
> limit_pfn passed into alloc_iova()) thus the IOVA code will never
> hand back that address range given it's greater than the dma_mask).
Well, as you guys are seeing, the iova allocation code is making the
assumption that if the range is in the tree, it's valid. And it is
handing out an address that's too large.
> >>It seems to me the DMA transaction would then become a peer to peer
> >>transaction if ACS is not enabled, which could show up as random register
> >>write in that GPUs 256M BAR (i.e. broken).
> >>
> >>The iova allocation should not hand out an address bigger than the
> >>dma_mask. What is the device's dma_mask?
>
> Agree. But there is a bug. The IOVA doesn't validate the limit_pfn
> if it uses the cached entry. One could argue that it should validate
> the limit_pfn, but then again a entry outside the limit_pfn should
> have never got into the rbtree... (it got in due to the IOMMU's
> dmar_init_reserved_ranges() adding it).
Yeah, I think it needs to be in the global reserved list. But perhaps
not copied into the domain specific iova. Or simply skipped on iova
allocation (don't just assume rb_last is <= dma_mask).
thanks,
-chris
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges
2011-03-31 23:56 ` Chris Wright
@ 2011-04-01 1:05 ` Mike Habeck
2011-04-02 0:32 ` [PATCH 3/4 v2] intel-iommu: don't cache iova above 32bit caching boundary Chris Wright
0 siblings, 1 reply; 27+ messages in thread
From: Mike Habeck @ 2011-04-01 1:05 UTC (permalink / raw)
To: Chris Wright
Cc: Mike Travis, David Woodhouse, Jesse Barnes, iommu, linux-pci,
linux-kernel
Chris Wright wrote:
> * Mike Habeck (habeck@sgi.com) wrote:
>> On 03/31/2011 06:25 PM, Mike Travis wrote:
>>> I'll probably need help from our Hardware PCI Engineer to help explain
>>> this further, though here's a pointer to an earlier email thread:
>>>
>>> http://marc.info/?l=linux-kernel&m=129259816925973&w=2
>>>
>>> I'll also dig out the specs you're asking for.
>>>
>>> Thanks,
>>> Mike
>>>
>>> Chris Wright wrote:
>>>> * Mike Travis (travis@sgi.com) wrote:
>>>>> Chris - did you have any comment on this patch?
>>>> It doesn't actually look right to me. It means that particular range
>>>> is no longer reserved. But perhaps I've misunderstood something.
>>>>
>>>>> Mike Travis wrote:
>>>>>> dmar_init_reserved_ranges() reserves the card's MMIO ranges to
>>>>>> prevent handing out a DMA map that would overlap with the MMIO range.
>>>>>> The problem while the Nvidia GPU has 64bit BARs, it's capable of
>>>>>> receiving > 40bit PIOs, but can't generate > 40bit DMAs.
>>>> I don't undertand what you mean here.
>> What Mike is getting at is there is no reason to reserve the MMIO
>> range if it's greater than the dma_mask, given the MMIO range is
>> outside of what the IOVA code will ever hand back to the IOMMU
>> code. In this case the nVidia card has a 64bit BAR and is assigned
>> the MMIO range [0xf8200000000 - 0xf820fffffff]. But the Nvidia
>> card can only generate a 40bit DMA (thus has a 40bit dma_mask). If
>> the IOVA code honors the limit_pfn (i.e., dma_mask) passed in it
>> will never hand back a >40bit address back to the IOMMU code. Thus
>> there is no reason to reserve the cards MMIO range if it is greater
>> than the dma_mask. (And that is what the patch is doing).
>
> The reserved ranges are for all devices. Another device with a 64bit
> dma_mask could get that region if it's not properly reserved. The
> driver would then program that device to dma to an address to is an
> alias to a MMIO region. The memory transaction travels up towards
> root...and sees the MMIO range in some bridge and would go straight down
> to the GPU.
Chris,
OK, I understand now what you meant by the patch possibly causing
the DMA transaction to become a peer to peer transaction. Mike and
I will have to rethink this one. Thanks for your input.
-mike
>
>> More below,,,
>>
>>>>>> So when the iommu code reserves these MMIO ranges a > 40bit
>>>>>> entry ends up getting in the rbtree. On a UV test system with
>>>>>> the Nvidia cards, the BARs are:
>>>>>>
>>>>>> 0001:36:00.0 VGA compatible controller: nVidia Corporation
>>>>>> GT200GL Region 0: Memory at 92000000 (32-bit, non-prefetchable)
>>>>>> [size=16M]
>>>>>> Region 1: Memory at f8200000000 (64-bit, prefetchable) [size=256M]
>>>>>> Region 3: Memory at 90000000 (64-bit, non-prefetchable) [size=32M]
>>>>>>
>>>>>> So this 44bit MMIO address 0xf8200000000 ends up in the rbtree. As DMA
>>>>>> maps get added and deleted from the rbtree we can end up getting a cached
>>>>>> entry to this 0xf8200000000 entry... this is what results in the code
>>>>>> handing out the invalid DMA map of 0xf81fffff000:
>>>>>>
>>>>>> [ 0xf8200000000-1 >> PAGE_SIZE << PAGE_SIZE ]
>>>>>>
>>>>>> The IOVA code needs to better honor the "limit_pfn" when allocating
>>>>>> these maps.
>>>> This means we could get the MMIO address range (it's no longer reserved).
>> Not true, the MMIO address is greater than the dma_mask (i.e., the
>> limit_pfn passed into alloc_iova()) thus the IOVA code will never
>> hand back that address range given it's greater than the dma_mask).
>
> Well, as you guys are seeing, the iova allocation code is making the
> assumption that if the range is in the tree, it's valid. And it is
> handing out an address that's too large.
>
>>>> It seems to me the DMA transaction would then become a peer to peer
>>>> transaction if ACS is not enabled, which could show up as random register
>>>> write in that GPUs 256M BAR (i.e. broken).
>>>>
>>>> The iova allocation should not hand out an address bigger than the
>>>> dma_mask. What is the device's dma_mask?
>> Agree. But there is a bug. The IOVA doesn't validate the limit_pfn
>> if it uses the cached entry. One could argue that it should validate
>> the limit_pfn, but then again a entry outside the limit_pfn should
>> have never got into the rbtree... (it got in due to the IOMMU's
>> dmar_init_reserved_ranges() adding it).
>
> Yeah, I think it needs to be in the global reserved list. But perhaps
> not copied into the domain specific iova. Or simply skipped on iova
> allocation (don't just assume rb_last is <= dma_mask).
>
> thanks,
> -chris
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 4/4] Intel pci: Use coherent DMA mask when requested
2011-03-30 18:02 ` Chris Wright
@ 2011-04-01 2:57 ` FUJITA Tomonori
0 siblings, 0 replies; 27+ messages in thread
From: FUJITA Tomonori @ 2011-04-01 2:57 UTC (permalink / raw)
To: chrisw
Cc: travis, dwmw2, jbarnes, linux-pci, iommu, habeck, linux-kernel,
fujita.tomonori
On Wed, 30 Mar 2011 11:02:45 -0700
Chris Wright <chrisw@sous-sol.org> wrote:
> * Mike Travis (travis@sgi.com) wrote:
> > The __intel_map_single function is not honoring the passed in
> > DMA mask. This results in not using the coherent DMA mask when
> > called from intel_alloc_coherent().
> >
> > Signed-off-by: Mike Travis <travis@sgi.com>
> > Reviewed-by: Mike Habeck <habeck@sgi.com>
>
> Hmm, looks like it has actually been broken since it was introduced
> in bb9e6d65.
Oops, you are right.
The patch looks good to me.
Thanks,
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 3/4 v2] intel-iommu: don't cache iova above 32bit caching boundary
2011-04-01 1:05 ` Mike Habeck
@ 2011-04-02 0:32 ` Chris Wright
2011-04-06 0:39 ` [PATCH 3/4 v3] " Chris Wright
0 siblings, 1 reply; 27+ messages in thread
From: Chris Wright @ 2011-04-02 0:32 UTC (permalink / raw)
To: Mike Travis, Mike Habeck
Cc: David Woodhouse, Chris Wright, Jesse Barnes, iommu, linux-pci,
linux-kernel
Mike Travis and Mike Habeck reported an issue where iova allocation
would return a range that was larger than a device's dma mask.
https://lkml.org/lkml/2011/3/29/423
The dmar initialization code will reserve all PCI MMIO regions and copy
those reservations into a domain specific iova tree. It is possible for
one of those regions to be above the dma mask of a device. It is typical
to allocate iovas with a 32bit mask (despite device's dma mask possibly
being larger) and cache the result until it exhausts the lower 32bit
address space. Freeing the iova range that is >= the last iova in the
lower 32bit range when there is still an iova above the 32bit range will
corrupt the cached iova by pointing it to a region that is above 32bit.
If that region is also larger than the device's dma mask, a subsequent
allocation will return an unusable iova and cause dma failure.
Simply don't cache an iova that is above the 32bit caching boundary.
Reported-by: Mike Travis <travis@sgi.com>
Reported-by: Mike Habeck <habeck@sgi.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: stable@kernel.org
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
Mike or Mike, can you try this? I was able to reproduce the failure
in a few different ways and successfully test this patch against those
failures, but w/out real hw.
drivers/pci/iova.c | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/iova.c b/drivers/pci/iova.c
index 7914951..1690ca4 100644
--- a/drivers/pci/iova.c
+++ b/drivers/pci/iova.c
@@ -63,8 +63,16 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
curr = iovad->cached32_node;
cached_iova = container_of(curr, struct iova, node);
- if (free->pfn_lo >= cached_iova->pfn_lo)
- iovad->cached32_node = rb_next(&free->node);
+ if (free->pfn_lo >= cached_iova->pfn_lo) {
+ struct rb_node *node = rb_next(&free->node);
+ struct iova *iova = container_of(node, struct iova, node);
+
+ /* only cache if it's below 32bit pfn */
+ if (iova->pfn_lo < iovad->dma_32bit_pfn)
+ iovad->cached32_node = node;
+ else
+ iovad->cached32_node = NULL;
+ }
}
/* Computes the padding size required, to make the
--
1.7.4
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 3/4 v3] intel-iommu: don't cache iova above 32bit caching boundary
2011-04-02 0:32 ` [PATCH 3/4 v2] intel-iommu: don't cache iova above 32bit caching boundary Chris Wright
@ 2011-04-06 0:39 ` Chris Wright
0 siblings, 0 replies; 27+ messages in thread
From: Chris Wright @ 2011-04-06 0:39 UTC (permalink / raw)
To: Mike Travis, Mike Habeck
Cc: David Woodhouse, Chris Wright, Jesse Barnes, iommu, linux-pci,
linux-kernel
Mike Travis and Mike Habeck reported an issue where iova allocation
would return a range that was larger than a device's dma mask.
https://lkml.org/lkml/2011/3/29/423
The dmar initialization code will reserve all PCI MMIO regions and copy
those reservations into a domain specific iova tree. It is possible for
one of those regions to be above the dma mask of a device. It is typical
to allocate iovas with a 32bit mask (despite device's dma mask possibly
being larger) and cache the result until it exhausts the lower 32bit
address space. Freeing the iova range that is >= the last iova in the
lower 32bit range when there is still an iova above the 32bit range will
corrupt the cached iova by pointing it to a region that is above 32bit.
If that region is also larger than the device's dma mask, a subsequent
allocation will return an unusable iova and cause dma failure.
Simply don't cache an iova that is above the 32bit caching boundary.
Reported-by: Mike Travis <travis@sgi.com>
Reported-by: Mike Habeck <habeck@sgi.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: stable@kernel.org
Acked-by: Mike Travis <travis@sgi.com>
Tested-by: Mike Habeck <habeck@sgi.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
v3: rb_next() can return NULL, found when testing on my hw
David, Mike Travis will collect and resumbit full series when he's back.
drivers/pci/iova.c | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/iova.c b/drivers/pci/iova.c
index 7914951..10f995a 100644
--- a/drivers/pci/iova.c
+++ b/drivers/pci/iova.c
@@ -63,8 +63,16 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
curr = iovad->cached32_node;
cached_iova = container_of(curr, struct iova, node);
- if (free->pfn_lo >= cached_iova->pfn_lo)
- iovad->cached32_node = rb_next(&free->node);
+ if (free->pfn_lo >= cached_iova->pfn_lo) {
+ struct rb_node *node = rb_next(&free->node);
+ struct iova *iova = container_of(node, struct iova, node);
+
+ /* only cache if it's below 32bit pfn */
+ if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
+ iovad->cached32_node = node;
+ else
+ iovad->cached32_node = NULL;
+ }
}
/* Computes the padding size required, to make the
--
1.7.4
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping
2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
` (3 preceding siblings ...)
2011-03-29 23:36 ` [PATCH 4/4] Intel pci: Use coherent DMA mask when requested Mike Travis
@ 2011-04-07 19:47 ` Mike Travis
2011-04-07 19:51 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function Mike Travis
2011-04-07 19:52 ` [PATCH 4/4] Intel pci: Use coherent DMA mask when requested Mike Travis
6 siblings, 0 replies; 27+ messages in thread
From: Mike Travis @ 2011-04-07 19:47 UTC (permalink / raw)
To: David Woodhouse, Jesse Barnes, Chris Wright
Cc: Mike Habeck, iommu, linux-pci, linux-kernel
When the IOMMU is being used, each request for a DMA mapping requires
the intel_iommu code to look for some space in the DMA mapping table.
For most drivers this occurs for each transfer.
When there are many outstanding DMA mappings [as seems to be the case
with the 10GigE driver], the table grows large and the search for
space becomes increasingly time consuming. Performance for the
10GigE driver drops to about 10% of it's capacity on a UV system
when the CPU count is large.
The workaround is to specify the iommu=pt option which sets up a 1:1
identity map for those devices that support enough DMA address bits to
cover the physical system memory. This is the "pass through" option.
But this can only be accomplished by those devices that pass their
DMA data through the IOMMU (VTd). But Host Bridge Devices connected
to System Sockets do not pass their data through the VTd, thus the
following error occurs:
IOMMU: hardware identity mapping for device 1000:3e:00.0
Failed to setup IOMMU pass-through
BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
This patch fixes that problem but removing Host Bridge devices from
being identity mapped, given that they do not generate DMA ops anyways.
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Mike Habeck <habeck@sgi.com>
---
drivers/pci/intel-iommu.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
--- linux.orig/drivers/pci/intel-iommu.c
+++ linux/drivers/pci/intel-iommu.c
@@ -46,6 +46,7 @@
#define ROOT_SIZE VTD_PAGE_SIZE
#define CONTEXT_SIZE VTD_PAGE_SIZE
+#define IS_HOSTBRIDGE_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_HOST)
#define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
#define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
#define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
@@ -2183,7 +2184,7 @@ static int iommu_should_identity_map(str
* take them out of the 1:1 domain later.
*/
if (!startup)
- return pdev->dma_mask > DMA_BIT_MASK(32);
+ return pdev->dma_mask == DMA_BIT_MASK(64);
return 1;
}
@@ -2198,6 +2199,9 @@ static int __init iommu_prepare_static_i
return -EFAULT;
for_each_pci_dev(pdev) {
+ /* Skip PCI Host Bridge devices */
+ if (IS_HOSTBRIDGE_DEVICE(pdev))
+ continue;
if (iommu_should_identity_map(pdev, 1)) {
printk(KERN_INFO "IOMMU: %s identity mapping for device %s\n",
hw ? "hardware" : "software", pci_name(pdev));
--
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function
2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
` (4 preceding siblings ...)
2011-04-07 19:47 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
@ 2011-04-07 19:51 ` Mike Travis
2011-04-07 19:52 ` [PATCH 4/4] Intel pci: Use coherent DMA mask when requested Mike Travis
6 siblings, 0 replies; 27+ messages in thread
From: Mike Travis @ 2011-04-07 19:51 UTC (permalink / raw)
To: David Woodhouse, Jesse Barnes, Chris Wright
Cc: Mike Habeck, iommu, linux-pci, linux-kernel
Subject: Intel iommu: Speed up processing of the identity_mapping function
When there are a large count of PCI devices, and the pass
through option for iommu is set, much time is spent in the
identity_mapping function hunting though the iommu domains to
check if a specific device is "identity mapped".
Speed up the function by checking the cached info to see if
it's mapped to the static identity domain.
Signed-off-by: Mike Travis <travis@sgi.com>
---
drivers/pci/intel-iommu.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
--- linux-2.6.32.orig/drivers/pci/intel-iommu.c
+++ linux-2.6.32/drivers/pci/intel-iommu.c
@@ -2124,10 +2124,10 @@ static int identity_mapping(struct pci_d
if (likely(!iommu_identity_mapping))
return 0;
+ info = pdev->dev.archdata.iommu;
+ if (info && info != DUMMY_DEVICE_DOMAIN_INFO)
+ return (info->domain == si_domain);
- list_for_each_entry(info, &si_domain->devices, link)
- if (info->dev == pdev)
- return 1;
return 0;
}
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 4/4] Intel pci: Use coherent DMA mask when requested
2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
` (5 preceding siblings ...)
2011-04-07 19:51 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function Mike Travis
@ 2011-04-07 19:52 ` Mike Travis
6 siblings, 0 replies; 27+ messages in thread
From: Mike Travis @ 2011-04-07 19:52 UTC (permalink / raw)
To: David Woodhouse, Jesse Barnes
Cc: Mike Habeck, iommu, linux-pci, linux-kernel, Chris Wright
The __intel_map_single function is not honoring the passed in
DMA mask. This results in not using the coherent DMA mask when
called from intel_alloc_coherent().
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Mike Habeck <habeck@sgi.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
---
drivers/pci/intel-iommu.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
--- linux.orig/drivers/pci/intel-iommu.c
+++ linux/drivers/pci/intel-iommu.c
@@ -2582,8 +2582,7 @@ static dma_addr_t __intel_map_single(str
iommu = domain_get_iommu(domain);
size = aligned_nrpages(paddr, size);
- iova = intel_alloc_iova(hwdev, domain, dma_to_mm_pfn(size),
- pdev->dma_mask);
+ iova = intel_alloc_iova(hwdev, domain, dma_to_mm_pfn(size), dma_mask);
if (!iova)
goto error;
--
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2011-04-07 19:52 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-03-29 23:36 [PATCH 0/4] pci: Speed up processing of IOMMU related functions Mike Travis
2011-03-29 23:36 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
2011-03-30 17:51 ` Chris Wright
2011-03-30 18:30 ` Mike Travis
2011-03-30 19:15 ` Chris Wright
2011-03-30 19:25 ` Mike Travis
2011-03-30 19:57 ` Chris Wright
2011-03-29 23:36 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function Mike Travis
2011-03-30 19:19 ` Chris Wright
2011-03-30 19:29 ` Mike Travis
2011-03-31 0:33 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function v2 Mike Travis
2011-03-29 23:36 ` [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges Mike Travis
2011-03-31 22:11 ` Mike Travis
2011-03-31 22:53 ` Chris Wright
2011-03-31 23:25 ` Mike Travis
2011-03-31 23:40 ` Mike Habeck
2011-03-31 23:56 ` Chris Wright
2011-04-01 1:05 ` Mike Habeck
2011-04-02 0:32 ` [PATCH 3/4 v2] intel-iommu: don't cache iova above 32bit caching boundary Chris Wright
2011-04-06 0:39 ` [PATCH 3/4 v3] " Chris Wright
2011-03-31 23:39 ` [PATCH 3/4] Intel pci: Limit dmar_init_reserved_ranges Chris Wright
2011-03-29 23:36 ` [PATCH 4/4] Intel pci: Use coherent DMA mask when requested Mike Travis
2011-03-30 18:02 ` Chris Wright
2011-04-01 2:57 ` FUJITA Tomonori
2011-04-07 19:47 ` [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping Mike Travis
2011-04-07 19:51 ` [PATCH 2/4] Intel iommu: Speed up processing of the identity_mapping function Mike Travis
2011-04-07 19:52 ` [PATCH 4/4] Intel pci: Use coherent DMA mask when requested Mike Travis
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.