linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Intel IOMMU 00/10] Intel IOMMU support, take #2
@ 2007-06-19 21:37 Keshavamurthy, Anil S
  2007-06-19 21:37 ` [Intel IOMMU 01/10] DMAR detection and parsing logic Keshavamurthy, Anil S
                   ` (10 more replies)
  0 siblings, 11 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-19 21:37 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: ak, gregkh, muli, suresh.b.siddha, arjan, ashok.raj, davem, clameter

Hi All,
	This patch supports the upcomming Intel IOMMU hardware
a.k.a. Intel(R) Virtualization Technology for Directed I/O 
Architecture and the hardware spec for the same can be found here
http://www.intel.com/technology/virtualization/index.htm

	This version of the patches incorporates several 
feedback obtained from previous postings.

Some of the major changes are
1) Removed resource pool (a.k.a. pre-allocate pool) patch
2) For memory allocation in the DMA map api calls we
   now use kmem_cache_alloc() and get_zeroedpage() functions
   to allocate memory for internal data structures and for 
   page table setup memory.
3) The allocation of memory in the DMA map api calls is 
   very critical and to avoid failures during memory allocation
   in the DMA map api calls we evaluated several technique
   a) mempool - We found that mempool is pretty much useless
      if we try to allocate memory with GFP_ATOMIC which is
     our case. Also we found that it is difficult to judge
     how much to reserver during the creation of mempool.
   b) PF_MEMALLOC - When a task flags (current->flags) are
     set with PF_MEMALLOC then watermark checks are avoided
     during the memory allocation.
  We choose to use the latter (option b) and make this as
  a separate patch which can be debated further. Please
  see patch 6/10.

Other minor changes are mostly coding style fixes and 
making sure that checkpatch.pl passes the patches.

Please include this set of patches for next MM release.

Thanks and regards,
-Anil S Keshavamurthy
E-mail: anil.s.keshavamurthy@intel.com

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [Intel IOMMU 01/10] DMAR detection and parsing logic
  2007-06-19 21:37 [Intel IOMMU 00/10] Intel IOMMU support, take #2 Keshavamurthy, Anil S
@ 2007-06-19 21:37 ` Keshavamurthy, Anil S
  2007-07-04  9:18   ` Peter Zijlstra
  2007-06-19 21:37 ` [Intel IOMMU 02/10] PCI generic helper function Keshavamurthy, Anil S
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-19 21:37 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: ak, gregkh, muli, suresh.b.siddha, arjan, ashok.raj, davem,
	clameter, Anil S Keshavamurthy

[-- Attachment #1: dmar_detection.patch --]
[-- Type: text/plain, Size: 15106 bytes --]

This patch adds support for early detection and parsing of DMAR's
(DMA Remapping ) reported to OS via ACPI tables.

DMA remapping(DMAR) devices support enables independent address
translations for Direct Memory Access(DMA) from Devices.
These DMA remapping devices are reported via ACPI tables
and includes pci device scope covered by these DMA
remapping device.

For detailed info on the specification of  "Intel(R) Virtualization 
Technology for Directed I/O Architecture" please see
http://www.intel.com/technology/virtualization/index.htm

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>

---
 arch/x86_64/Kconfig   |   11 +
 drivers/pci/Makefile  |    3 
 drivers/pci/dmar.c    |  327 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/acpi/actbl1.h |   27 +++-
 include/linux/dmar.h  |   52 +++++++
 5 files changed, 413 insertions(+), 7 deletions(-)

Index: linux-2.6.22-rc4-mm2/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.22-rc4-mm2.orig/arch/x86_64/Kconfig	2007-06-18 15:45:39.000000000 -0700
+++ linux-2.6.22-rc4-mm2/arch/x86_64/Kconfig	2007-06-19 13:05:03.000000000 -0700
@@ -730,6 +730,17 @@
 	bool "Support mmconfig PCI config space access"
 	depends on PCI && ACPI
 
+config DMAR
+	bool "Support for DMA Remapping Devices (EXPERIMENTAL)"
+	depends on PCI_MSI && ACPI && EXPERIMENTAL
+	default y
+	help
+	  DMA remapping(DMAR) devices support enables independent address
+	  translations for Direct Memory Access(DMA) from Devices.
+	  These DMA remapping devices are reported via ACPI tables
+	  and includes pci device scope covered by these DMA
+	  remapping device.
+
 source "drivers/pci/pcie/Kconfig"
 
 source "drivers/pci/Kconfig"
Index: linux-2.6.22-rc4-mm2/drivers/pci/Makefile
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/Makefile	2007-06-18 15:45:39.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/Makefile	2007-06-19 13:43:14.000000000 -0700
@@ -20,6 +20,9 @@
 # Build the Hypertransport interrupt support
 obj-$(CONFIG_HT_IRQ) += htirq.o
 
+# Build Intel IOMMU support
+obj-$(CONFIG_DMAR) += dmar.o
+
 #
 # Some architectures use the generic PCI setup functions
 #
Index: linux-2.6.22-rc4-mm2/drivers/pci/dmar.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.22-rc4-mm2/drivers/pci/dmar.c	2007-06-18 15:45:46.000000000 -0700
@@ -0,0 +1,327 @@
+/*
+ * Copyright (c) 2006, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * 	Copyright (C) Ashok Raj <ashok.raj@intel.com>
+ *	Copyright (C) Shaohua Li <shaohua.li@intel.com>
+ *	Copyright (C) Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
+ *
+ * 	This file implements early detection/parsing of DMA Remapping Devices
+ * reported to OS through BIOS via DMA remapping reporting (DMAR) ACPI
+ * tables.
+ */
+
+#include <linux/pci.h>
+#include <linux/dmar.h>
+
+#undef PREFIX
+#define PREFIX "DMAR:"
+
+/* No locks are needed as DMA remapping hardware unit
+ * list is constructed at boot time and hotplug of
+ * these units are not supported by the architecture.
+ */
+LIST_HEAD(dmar_drhd_units);
+LIST_HEAD(dmar_rmrr_units);
+
+static struct acpi_table_header * __initdata dmar_tbl;
+
+static void __init dmar_register_drhd_unit(struct dmar_drhd_unit *drhd)
+{
+	/*
+	 * add INCLUDE_ALL at the tail, so scan the list will find it at
+	 * the very end.
+	 */
+	if (drhd->include_all)
+		list_add_tail(&drhd->list, &dmar_drhd_units);
+	else
+		list_add(&drhd->list, &dmar_drhd_units);
+}
+
+static void __init dmar_register_rmrr_unit(struct dmar_rmrr_unit *rmrr)
+{
+	list_add(&rmrr->list, &dmar_rmrr_units);
+}
+
+static int __init dmar_parse_one_dev_scope(struct acpi_dmar_device_scope *scope,
+					   struct pci_dev **dev, u16 segment)
+{
+	struct pci_bus *bus;
+	struct pci_dev *pdev = NULL;
+	struct acpi_dmar_pci_path *path;
+	int count;
+
+	bus = pci_find_bus(segment, scope->bus);
+	path = (struct acpi_dmar_pci_path *)(scope + 1);
+	count = (scope->length - sizeof(struct acpi_dmar_device_scope))
+		/ sizeof(struct acpi_dmar_pci_path);
+
+	while (count) {
+		if (pdev)
+			pci_dev_put(pdev);
+		/*
+		 * Some BIOSes list non-exist devices in DMAR table, just
+		 * ignore it
+		 */
+		if (!bus) {
+			printk(KERN_WARNING
+			PREFIX "Device scope bus [%d] not found\n",
+			scope->bus);
+			break;
+		}
+		pdev = pci_get_slot(bus, PCI_DEVFN(path->dev, path->fn));
+		if (!pdev) {
+			printk(KERN_WARNING PREFIX
+			"Device scope device [%04x:%02x:%02x.%02x] not found\n",
+				segment, bus->number, path->dev, path->fn);
+			break;
+		}
+		path ++;
+		count --;
+		bus = pdev->subordinate;
+	}
+	if (!pdev) {
+		printk(KERN_WARNING PREFIX
+		"Device scope device [%04x:%02x:%02x.%02x] not found\n",
+		segment, scope->bus, path->dev, path->fn);
+		*dev = NULL;
+		return 0;
+	}
+	if ((scope->entry_type == ACPI_DMAR_SCOPE_TYPE_ENDPOINT && \
+			pdev->subordinate) || (scope->entry_type == \
+			ACPI_DMAR_SCOPE_TYPE_BRIDGE && !pdev->subordinate)) {
+		pci_dev_put(pdev);
+		printk(KERN_WARNING PREFIX
+			"Device scope type does not match for %s\n",
+			 pci_name(pdev));
+		return -EINVAL;
+	}
+	*dev = pdev;
+	return 0;
+}
+
+static int __init dmar_parse_dev_scope(void *start, void *end, int *cnt,
+				       struct pci_dev ***devices, u16 segment)
+{
+	struct acpi_dmar_device_scope *scope;
+	void * tmp = start;
+	int index;
+	int ret;
+
+	*cnt = 0;
+	while (start < end) {
+		scope = start;
+		if (scope->entry_type == ACPI_DMAR_SCOPE_TYPE_ENDPOINT ||
+		    scope->entry_type == ACPI_DMAR_SCOPE_TYPE_BRIDGE)
+			(*cnt)++;
+		else
+			printk(KERN_WARNING PREFIX
+				"Unsupported device scope\n");
+		start += scope->length;
+	}
+	if (*cnt == 0)
+		return 0;
+
+	*devices = kcalloc(*cnt, sizeof(struct pci_dev *), GFP_KERNEL);
+	if (!*devices)
+		return -ENOMEM;
+
+	start = tmp;
+	index = 0;
+	while (start < end) {
+		scope = start;
+		if (scope->entry_type == ACPI_DMAR_SCOPE_TYPE_ENDPOINT ||
+		    scope->entry_type == ACPI_DMAR_SCOPE_TYPE_BRIDGE) {
+			ret = dmar_parse_one_dev_scope(scope,
+				&(*devices)[index], segment);
+			if (ret) {
+				kfree(*devices);
+				return ret;
+			}
+			index ++;
+		}
+		start += scope->length;
+	}
+
+	return 0;
+}
+
+/**
+ * dmar_parse_one_drhd - parses exactly one DMA remapping hardware definition
+ * structure which uniquely represent one DMA remapping hardware unit
+ * present in the platform
+ */
+static int __init
+dmar_parse_one_drhd(struct acpi_dmar_header *header)
+{
+	struct acpi_dmar_hardware_unit *drhd;
+	struct dmar_drhd_unit *dmaru;
+	int ret = 0;
+	static int include_all;
+
+	dmaru = kzalloc(sizeof(*dmaru), GFP_KERNEL);
+	if (!dmaru)
+		return -ENOMEM;
+
+	drhd = (struct acpi_dmar_hardware_unit *)header;
+	dmaru->reg_base_addr = drhd->address;
+	dmaru->include_all = drhd->flags & 0x1; /* BIT0: INCLUDE_ALL */
+
+	if (!dmaru->include_all)
+		ret = dmar_parse_dev_scope((void *)(drhd + 1),
+				((void *)drhd) + header->length,
+				&dmaru->devices_cnt, &dmaru->devices,
+				drhd->segment);
+	else {
+		/* Only allow one INCLUDE_ALL */
+		if (include_all) {
+			printk(KERN_WARNING PREFIX "Only one INCLUDE_ALL "
+				"device scope is allowed\n");
+			ret = -EINVAL;
+		}
+		include_all = 1;
+	}
+
+	if (ret || (dmaru->devices_cnt == 0 && !dmaru->include_all))
+		kfree(dmaru);
+	else
+		dmar_register_drhd_unit(dmaru);
+	return ret;
+}
+
+static int __init
+dmar_parse_one_rmrr(struct acpi_dmar_header *header)
+{
+	struct acpi_dmar_reserved_memory *rmrr;
+	struct dmar_rmrr_unit *rmrru;
+	int ret = 0;
+
+	rmrru = kzalloc(sizeof(*rmrru), GFP_KERNEL);
+	if (!rmrru)
+		return -ENOMEM;
+
+	rmrr = (struct acpi_dmar_reserved_memory *)header;
+	rmrru->base_address = rmrr->base_address;
+	rmrru->end_address = rmrr->end_address;
+	ret = dmar_parse_dev_scope((void *)(rmrr + 1),
+		((void *)rmrr) + header->length,
+		&rmrru->devices_cnt, &rmrru->devices, rmrr->segment);
+
+	if (ret || (rmrru->devices_cnt == 0))
+		kfree(rmrru);
+	else
+		dmar_register_rmrr_unit(rmrru);
+	return ret;
+}
+
+static void __init
+dmar_table_print_dmar_entry(struct acpi_dmar_header *header)
+{
+	struct acpi_dmar_hardware_unit *drhd;
+	struct acpi_dmar_reserved_memory *rmrr;
+
+	switch (header->type) {
+	case ACPI_DMAR_TYPE_HARDWARE_UNIT:
+		drhd = (struct acpi_dmar_hardware_unit *)header;
+		printk (KERN_INFO PREFIX
+			"DRHD (flags: 0x%08x)base: 0x%016Lx\n",
+			drhd->flags, drhd->address);
+		break;
+	case ACPI_DMAR_TYPE_RESERVED_MEMORY:
+		rmrr = (struct acpi_dmar_reserved_memory *)header;
+
+		printk (KERN_INFO PREFIX
+			"RMRR base: 0x%016Lx end: 0x%016Lx\n",
+			rmrr->base_address, rmrr->end_address);
+		break;
+	}
+}
+
+/**
+ * parse_dmar_table - parses the DMA reporting table
+ */
+static int __init
+parse_dmar_table(void)
+{
+	struct acpi_table_dmar *dmar;
+	struct acpi_dmar_header *entry_header;
+	int ret = 0;
+
+	dmar = (struct acpi_table_dmar *)dmar_tbl;
+
+	if (!dmar->width) {
+		printk (KERN_WARNING PREFIX "Zero: Invalid DMAR haw\n");
+		return -EINVAL;
+	}
+
+	printk (KERN_INFO PREFIX "Host address width %d\n",
+		dmar->width + 1);
+
+	entry_header = (struct acpi_dmar_header *)(dmar + 1);
+	while (((unsigned long)entry_header) <
+			(((unsigned long)dmar) + dmar_tbl->length)) {
+		dmar_table_print_dmar_entry(entry_header);
+
+		switch (entry_header->type) {
+		case ACPI_DMAR_TYPE_HARDWARE_UNIT:
+			ret = dmar_parse_one_drhd(entry_header);
+			break;
+		case ACPI_DMAR_TYPE_RESERVED_MEMORY:
+			ret = dmar_parse_one_rmrr(entry_header);
+			break;
+		default:
+			printk(KERN_WARNING PREFIX
+				"Unknown DMAR structure type\n");
+			ret = 0; /* for forward compatibility */
+			break;
+		}
+		if (ret)
+			break;
+
+		entry_header = ((void *)entry_header + entry_header->length);
+	}
+	return ret;
+}
+
+
+int __init dmar_table_init(void)
+{
+
+	parse_dmar_table();
+	if (list_empty(&dmar_drhd_units)) {
+		printk(KERN_ERR PREFIX "No DMAR devices found\n");
+		return -ENODEV;
+	}
+	return 0;
+}
+
+/**
+ * early_dmar_detect - checks to see if the platform supports DMAR devices
+ */
+int __init early_dmar_detect(void)
+{
+	acpi_status status = AE_OK;
+
+	/* if we could find DMAR table, then there are DMAR devices */
+	status = acpi_get_table(ACPI_SIG_DMAR, 0,
+				(struct acpi_table_header **)&dmar_tbl);
+
+	if (ACPI_SUCCESS(status) && !dmar_tbl) {
+		printk (KERN_WARNING PREFIX "Unable to map DMAR\n");
+		status = AE_NOT_FOUND;
+	}
+
+	return (ACPI_SUCCESS(status) ? 1 : 0);
+}
Index: linux-2.6.22-rc4-mm2/include/acpi/actbl1.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/acpi/actbl1.h	2007-06-18 15:45:39.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/acpi/actbl1.h	2007-06-18 15:45:46.000000000 -0700
@@ -257,7 +257,8 @@
 struct acpi_table_dmar {
 	struct acpi_table_header header;	/* Common ACPI table header */
 	u8 width;		/* Host Address Width */
-	u8 reserved[11];
+	u8 flags;
+	u8 reserved[10];
 };
 
 /* DMAR subtable header */
@@ -265,8 +266,6 @@
 struct acpi_dmar_header {
 	u16 type;
 	u16 length;
-	u8 flags;
-	u8 reserved[3];
 };
 
 /* Values for subtable type in struct acpi_dmar_header */
@@ -274,13 +273,15 @@
 enum acpi_dmar_type {
 	ACPI_DMAR_TYPE_HARDWARE_UNIT = 0,
 	ACPI_DMAR_TYPE_RESERVED_MEMORY = 1,
-	ACPI_DMAR_TYPE_RESERVED = 2	/* 2 and greater are reserved */
+	ACPI_DMAR_TYPE_ATSR = 2,
+	ACPI_DMAR_TYPE_RESERVED = 3	/* 3 and greater are reserved */
 };
 
 struct acpi_dmar_device_scope {
 	u8 entry_type;
 	u8 length;
-	u8 segment;
+	u16 reserved;
+	u8 enumeration_id;
 	u8 bus;
 };
 
@@ -290,7 +291,14 @@
 	ACPI_DMAR_SCOPE_TYPE_NOT_USED = 0,
 	ACPI_DMAR_SCOPE_TYPE_ENDPOINT = 1,
 	ACPI_DMAR_SCOPE_TYPE_BRIDGE = 2,
-	ACPI_DMAR_SCOPE_TYPE_RESERVED = 3	/* 3 and greater are reserved */
+	ACPI_DMAR_SCOPE_TYPE_IOAPIC = 3,
+	ACPI_DMAR_SCOPE_TYPE_HPET = 4,
+	ACPI_DMAR_SCOPE_TYPE_RESERVED = 5	/* 5 and greater are reserved */
+};
+
+struct acpi_dmar_pci_path {
+	u8 dev;
+	u8 fn;
 };
 
 /*
@@ -301,6 +309,9 @@
 
 struct acpi_dmar_hardware_unit {
 	struct acpi_dmar_header header;
+	u8 flags;
+	u8 reserved;
+	u16 segment;
 	u64 address;		/* Register Base Address */
 };
 
@@ -312,7 +323,9 @@
 
 struct acpi_dmar_reserved_memory {
 	struct acpi_dmar_header header;
-	u64 address;		/* 4_k aligned base address */
+	u16 reserved;
+	u16 segment;
+	u64 base_address;		/* 4_k aligned base address */
 	u64 end_address;	/* 4_k aligned limit address */
 };
 
Index: linux-2.6.22-rc4-mm2/include/linux/dmar.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.22-rc4-mm2/include/linux/dmar.h	2007-06-19 13:43:14.000000000 -0700
@@ -0,0 +1,52 @@
+/*
+ * Copyright (c) 2006, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Copyright (C) Ashok Raj <ashok.raj@intel.com>
+ * Copyright (C) Shaohua Li <shaohua.li@intel.com>
+ */
+
+#ifndef __DMAR_H__
+#define __DMAR_H__
+
+#include <linux/acpi.h>
+#include <linux/types.h>
+
+
+extern int dmar_table_init(void);
+extern int early_dmar_detect(void);
+
+extern struct list_head dmar_drhd_units;
+extern struct list_head dmar_rmrr_units;
+
+struct dmar_drhd_unit {
+	struct list_head list;		/* list of drhd units	*/
+	u64	reg_base_addr;		/* register base address*/
+	struct	pci_dev **devices; 	/* target device array	*/
+	int	devices_cnt;		/* target device count	*/
+	u8	ignored:1; 		/* ignore drhd		*/
+	u8	include_all:1;
+	struct intel_iommu *iommu;
+};
+
+struct dmar_rmrr_unit {
+	struct list_head list;		/* list of rmrr units	*/
+	u64	base_address;		/* reserved base address*/
+	u64	end_address;		/* reserved end address */
+	struct pci_dev **devices;	/* target devices */
+	int	devices_cnt;		/* target device count */
+};
+
+#endif /* __DMAR_H__ */

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [Intel IOMMU 02/10] PCI generic helper function
  2007-06-19 21:37 [Intel IOMMU 00/10] Intel IOMMU support, take #2 Keshavamurthy, Anil S
  2007-06-19 21:37 ` [Intel IOMMU 01/10] DMAR detection and parsing logic Keshavamurthy, Anil S
@ 2007-06-19 21:37 ` Keshavamurthy, Anil S
  2007-06-26  5:49   ` Andrew Morton
  2007-06-19 21:37 ` [Intel IOMMU 03/10] clflush_cache_range now takes size param Keshavamurthy, Anil S
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-19 21:37 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: ak, gregkh, muli, suresh.b.siddha, arjan, ashok.raj, davem,
	clameter, Anil S Keshavamurthy

[-- Attachment #1: pcie_port_type.patch --]
[-- Type: text/plain, Size: 4219 bytes --]

When devices are under a p2p bridge, upstream
transactions get replaced by the device id of the bridge as it owns the
PCIE transaction. Hence its necessary to setup translations on behalf of the
bridge as well. Due to this limitation all devices under a p2p share the same
domain in a DMAR.

We just cache the type of device, if its a native PCIe device
or not for later use.

changes from previous posting
1) fixed mostly coding style issues nothing major

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
---
 drivers/pci/pci.h    |    1 +
 drivers/pci/probe.c  |   14 ++++++++++++++
 drivers/pci/search.c |   30 ++++++++++++++++++++++++++++++
 include/linux/pci.h  |    2 ++
 4 files changed, 47 insertions(+)

Index: linux-2.6.22-rc4-mm2/drivers/pci/pci.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/pci.h	2007-06-18 15:44:45.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/pci.h	2007-06-18 15:45:07.000000000 -0700
@@ -92,3 +92,4 @@
 	return NULL;
 }
 
+struct pci_dev *pci_find_upstream_pcie_bridge(struct pci_dev *pdev);
Index: linux-2.6.22-rc4-mm2/drivers/pci/probe.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/probe.c	2007-06-18 15:44:45.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/probe.c	2007-06-18 15:45:07.000000000 -0700
@@ -851,6 +851,19 @@
 	kfree(pci_dev);
 }
 
+static void set_pcie_port_type(struct pci_dev *pdev)
+{
+	int pos;
+	u16 reg16;
+
+	pos = pci_find_capability(pdev, PCI_CAP_ID_EXP);
+	if (!pos)
+		return;
+	pdev->is_pcie = 1;
+	pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, &reg16);
+	pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4;
+}
+
 /**
  * pci_cfg_space_size - get the configuration space size of the PCI device.
  * @dev: PCI device
@@ -965,6 +978,7 @@
 	dev->device = (l >> 16) & 0xffff;
 	dev->cfg_size = pci_cfg_space_size(dev);
 	dev->error_state = pci_channel_io_normal;
+	set_pcie_port_type(dev);
 
 	/* Assume 32-bit PCI; let 64-bit PCI cards (which are far rarer)
 	   set this higher, assuming the system even supports it.  */
Index: linux-2.6.22-rc4-mm2/drivers/pci/search.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/search.c	2007-06-18 15:44:45.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/search.c	2007-06-18 15:45:07.000000000 -0700
@@ -14,6 +14,36 @@
 #include "pci.h"
 
 DECLARE_RWSEM(pci_bus_sem);
+/*
+ * find the upstream PCIE-to-PCI bridge of a PCI device
+ * if the device is PCIE, return NULL
+ * if the device isn't connected to a PCIE bridge (that is its parent is a
+ * legacy PCI bridge and the bridge is directly connected to bus 0), return its
+ * parent
+ */
+struct pci_dev *
+pci_find_upstream_pcie_bridge(struct pci_dev *pdev)
+{
+	struct pci_dev *tmp = NULL;
+
+	if (pdev->is_pcie)
+		return NULL;
+	while (1) {
+		if (!pdev->bus->self)
+			break;
+		pdev = pdev->bus->self;
+		/* a p2p bridge */
+		if (!pdev->is_pcie) {
+			tmp = pdev;
+			continue;
+		}
+		/* PCI device should connect to a PCIE bridge */
+		BUG_ON(pdev->pcie_type != PCI_EXP_TYPE_PCI_BRIDGE);
+		return pdev;
+	}
+
+	return tmp;
+}
 
 static struct pci_bus *pci_do_find_bus(struct pci_bus *bus, unsigned char busnr)
 {
Index: linux-2.6.22-rc4-mm2/include/linux/pci.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/pci.h	2007-06-18 15:44:45.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/pci.h	2007-06-18 15:45:07.000000000 -0700
@@ -140,6 +140,7 @@
 	unsigned short	subsystem_device;
 	unsigned int	class;		/* 3 bytes: (base,sub,prog-if) */
 	u8		hdr_type;	/* PCI header type (`multi' flag masked out) */
+	u8		pcie_type;	/* PCI-E device/port type */
 	u8		rom_base_reg;	/* which config register controls the ROM */
 	u8		pin;  		/* which interrupt pin this device uses */
 
@@ -182,6 +183,7 @@
 	unsigned int 	msi_enabled:1;
 	unsigned int	msix_enabled:1;
 	unsigned int	is_managed:1;
+	unsigned int	is_pcie:1;
 	atomic_t	enable_cnt;	/* pci_enable_device has been called */
 
 	u32		saved_config_space[16]; /* config space saved at suspend time */

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [Intel IOMMU 03/10] clflush_cache_range now takes size param
  2007-06-19 21:37 [Intel IOMMU 00/10] Intel IOMMU support, take #2 Keshavamurthy, Anil S
  2007-06-19 21:37 ` [Intel IOMMU 01/10] DMAR detection and parsing logic Keshavamurthy, Anil S
  2007-06-19 21:37 ` [Intel IOMMU 02/10] PCI generic helper function Keshavamurthy, Anil S
@ 2007-06-19 21:37 ` Keshavamurthy, Anil S
  2007-06-19 21:37 ` [Intel IOMMU 04/10] IOVA allocation and management routines Keshavamurthy, Anil S
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-19 21:37 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: ak, gregkh, muli, suresh.b.siddha, arjan, ashok.raj, davem,
	clameter, Anil S Keshavamurthy

[-- Attachment #1: clflush_cache_range.patch --]
[-- Type: text/plain, Size: 1727 bytes --]

	Introduce the size param for clflush_cache_range().

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>

---
 arch/x86_64/mm/pageattr.c       |    6 +++---
 include/asm-x86_64/cacheflush.h |    1 +
 2 files changed, 4 insertions(+), 3 deletions(-)

Index: linux-2.6.22-rc4-mm2/arch/x86_64/mm/pageattr.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/arch/x86_64/mm/pageattr.c	2007-06-18 15:45:39.000000000 -0700
+++ linux-2.6.22-rc4-mm2/arch/x86_64/mm/pageattr.c	2007-06-18 15:45:46.000000000 -0700
@@ -61,10 +61,10 @@
 	return base;
 } 
 
-static void cache_flush_page(void *adr)
+void clflush_cache_range(void *adr, int size)
 {
 	int i;
-	for (i = 0; i < PAGE_SIZE; i += boot_cpu_data.x86_clflush_size)
+	for (i = 0; i < size; i += boot_cpu_data.x86_clflush_size)
 		asm volatile("clflush (%0)" :: "r" (adr + i));
 }
 
@@ -80,7 +80,7 @@
 	list_for_each_entry(pg, l, lru) {
 		void *adr = page_address(pg);
 		if (cpu_has_clflush)
-			cache_flush_page(adr);
+			clflush_cache_range(adr, PAGE_SIZE);
 	}
 	__flush_tlb_all();
 }
Index: linux-2.6.22-rc4-mm2/include/asm-x86_64/cacheflush.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/asm-x86_64/cacheflush.h	2007-06-18 15:45:39.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/asm-x86_64/cacheflush.h	2007-06-18 15:45:46.000000000 -0700
@@ -27,6 +27,7 @@
 void global_flush_tlb(void); 
 int change_page_attr(struct page *page, int numpages, pgprot_t prot);
 int change_page_attr_addr(unsigned long addr, int numpages, pgprot_t prot);
+void clflush_cache_range(void *addr, int size);
 
 #ifdef CONFIG_DEBUG_RODATA
 void mark_rodata_ro(void);

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [Intel IOMMU 04/10] IOVA allocation and management routines
  2007-06-19 21:37 [Intel IOMMU 00/10] Intel IOMMU support, take #2 Keshavamurthy, Anil S
                   ` (2 preceding siblings ...)
  2007-06-19 21:37 ` [Intel IOMMU 03/10] clflush_cache_range now takes size param Keshavamurthy, Anil S
@ 2007-06-19 21:37 ` Keshavamurthy, Anil S
  2007-06-26  6:07   ` Andrew Morton
  2007-06-19 21:37 ` [Intel IOMMU 05/10] Intel IOMMU driver Keshavamurthy, Anil S
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-19 21:37 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: ak, gregkh, muli, suresh.b.siddha, arjan, ashok.raj, davem,
	clameter, Anil S Keshavamurthy

[-- Attachment #1: generic_iova.patch --]
[-- Type: text/plain, Size: 12945 bytes --]

	This code implements a generic IOVA allocation and 
management. As per Dave's suggestion we are now allocating
IO virtual address from Higher DMA limit address rather
than lower end address and this eliminated the need to preserve
the IO virtual address for multiple devices sharing the same
domain virtual address.

Also this code uses red black trees to store the allocated and
reserved iova nodes. This showed a good performance improvements
over previous linear linked list.

Changes from previous posting:
1) Fixed mostly coding style issues

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
---
 drivers/pci/iova.c |  351 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/pci/iova.h |   62 +++++++++
 2 files changed, 413 insertions(+)

Index: linux-2.6.22-rc4-mm2/drivers/pci/iova.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.22-rc4-mm2/drivers/pci/iova.c	2007-06-18 15:45:46.000000000 -0700
@@ -0,0 +1,351 @@
+/*
+ * Copyright (c) 2006, Intel Corporation.
+ *
+ * This file is released under the GPLv2.
+ *
+ * Copyright (C) 2006 Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
+ */
+
+#include "iova.h"
+
+void
+init_iova_domain(struct iova_domain *iovad)
+{
+	spin_lock_init(&iovad->iova_alloc_lock);
+	spin_lock_init(&iovad->iova_rbtree_lock);
+	iovad->rbroot = RB_ROOT;
+	iovad->cached32_node = NULL;
+
+}
+
+static struct rb_node *
+__get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
+{
+	if ((*limit_pfn != DMA_32BIT_PFN) ||
+		(iovad->cached32_node == NULL))
+		return rb_last(&iovad->rbroot);
+	else {
+		struct rb_node *prev_node = rb_prev(iovad->cached32_node);
+		struct iova *curr_iova =
+			container_of(iovad->cached32_node, struct iova, node);
+		*limit_pfn = curr_iova->pfn_lo - 1;
+		return prev_node;
+	}
+}
+
+static inline void
+__cached_rbnode_insert_update(struct iova_domain *iovad,
+	unsigned long limit_pfn, struct iova *new)
+{
+	if (limit_pfn != DMA_32BIT_PFN)
+		return;
+	iovad->cached32_node = &new->node;
+}
+
+static inline void
+__cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
+{
+	struct iova *cached_iova;
+	struct rb_node *curr;
+
+	if (!iovad->cached32_node)
+		return;
+	curr = iovad->cached32_node;
+	cached_iova = container_of(curr, struct iova, node);
+
+	if (free->pfn_lo >= cached_iova->pfn_lo)
+		iovad->cached32_node = rb_next(&free->node);
+}
+
+static inline int __alloc_iova_range(struct iova_domain *iovad,
+	unsigned long size, unsigned long limit_pfn, struct iova *new)
+{
+	struct rb_node *curr = NULL;
+	unsigned long flags;
+	unsigned long saved_pfn;
+
+	/* Walk the tree backwards */
+	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	saved_pfn = limit_pfn;
+	curr = __get_cached_rbnode(iovad, &limit_pfn);
+	while (curr) {
+		struct iova *curr_iova = container_of(curr, struct iova, node);
+		if (limit_pfn < curr_iova->pfn_lo)
+			goto move_left;
+		if (limit_pfn < curr_iova->pfn_hi)
+			goto adjust_limit_pfn;
+		if ((curr_iova->pfn_hi + size) <= limit_pfn)
+			break;	/* found a free slot */
+adjust_limit_pfn:
+		limit_pfn = curr_iova->pfn_lo - 1;
+move_left:
+		curr = rb_prev(curr);
+	}
+
+	if ((!curr) && !(IOVA_START_PFN + size <= limit_pfn)) {
+		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+		return -ENOMEM;
+	}
+	new->pfn_hi = limit_pfn;
+	new->pfn_lo = limit_pfn - size + 1;
+
+	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	return 0;
+}
+
+static void
+iova_insert_rbtree(struct rb_root *root, struct iova *iova)
+{
+	struct rb_node **new = &(root->rb_node), *parent = NULL;
+	/* Figure out where to put new node */
+	while (*new) {
+		struct iova *this = container_of(*new, struct iova, node);
+		parent = *new;
+
+		if (iova->pfn_lo < this->pfn_lo)
+			new = &((*new)->rb_left);
+		else if (iova->pfn_lo > this->pfn_lo)
+			new = &((*new)->rb_right);
+		else
+			BUG(); /* this should not happen */
+	}
+	/* Add new node and rebalance tree. */
+	rb_link_node(&iova->node, parent, new);
+	rb_insert_color(&iova->node, root);
+}
+
+/**
+ * alloc_iova - allocates an iova
+ * @iovad - iova domain in question
+ * @size - size of page frames to allocate
+ * @limit_pfn - max limit address
+ * This function allocates an iova in the range limit_pfn to IOVA_START_PFN
+ * looking from limit_pfn instead from IOVA_START_PFN.
+ */
+struct iova *
+alloc_iova(struct iova_domain *iovad, unsigned long size,
+	unsigned long limit_pfn)
+{
+	unsigned long flags;
+	struct iova *new_iova;
+	int ret;
+
+	new_iova = alloc_iova_mem();
+	if (!new_iova)
+		return NULL;
+
+	spin_lock_irqsave(&iovad->iova_alloc_lock, flags);
+	ret = __alloc_iova_range(iovad, size, limit_pfn, new_iova);
+
+	if (ret) {
+		spin_unlock_irqrestore(&iovad->iova_alloc_lock, flags);
+		free_iova_mem(new_iova);
+		return NULL;
+	}
+
+	/* Insert the new_iova into domain rbtree by holding writer lock */
+	spin_lock(&iovad->iova_rbtree_lock);
+	iova_insert_rbtree(&iovad->rbroot, new_iova);
+	__cached_rbnode_insert_update(iovad, limit_pfn, new_iova);
+	spin_unlock(&iovad->iova_rbtree_lock);
+
+	spin_unlock_irqrestore(&iovad->iova_alloc_lock, flags);
+
+	return new_iova;
+}
+
+/**
+ * find_iova - find's an iova for a given pfn
+ * @iovad - iova domain in question.
+ * pfn - page frame number
+ * This function finds and returns an iova belonging to the
+ * given doamin which matches the given pfn.
+ */
+struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn)
+{
+	unsigned long flags;
+	struct rb_node *node;
+
+	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	node = iovad->rbroot.rb_node;
+	while (node) {
+		struct iova *iova = container_of(node, struct iova, node);
+
+		/* If pfn falls within iova's range, return iova */
+		if ((pfn >= iova->pfn_lo) && (pfn <= iova->pfn_hi)) {
+			spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+			return iova;
+		}
+
+		if (pfn < iova->pfn_lo)
+			node = node->rb_left;
+		else if (pfn > iova->pfn_lo)
+			node = node->rb_right;
+	}
+
+	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	return NULL;
+}
+
+/**
+ * __free_iova - frees the given iova
+ * @iovad: iova domain in question.
+ * @iova: iova in question.
+ * Frees the given iova belonging to the giving domain
+ */
+void
+__free_iova(struct iova_domain *iovad, struct iova *iova)
+{
+	unsigned long flags;
+
+	if (iova) {
+		spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+		__cached_rbnode_delete_update(iovad, iova);
+		rb_erase(&iova->node, &iovad->rbroot);
+		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+		free_iova_mem(iova);
+	}
+}
+
+/**
+ * free_iova - finds and frees the iova for a given pfn
+ * @iovad: - iova domain in question.
+ * @pfn: - pfn that is allocated previously
+ * This functions finds an iova for a given pfn and then
+ * frees the iova from that domain.
+ */
+void
+free_iova(struct iova_domain *iovad, unsigned long pfn)
+{
+	struct iova *iova = find_iova(iovad, pfn);
+	__free_iova(iovad, iova);
+
+}
+
+/**
+ * put_iova_domain - destroys the iova doamin
+ * @iovad: - iova domain in question.
+ * All the iova's in that domain are destroyed.
+ */
+void put_iova_domain(struct iova_domain *iovad)
+{
+	struct rb_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	node = rb_first(&iovad->rbroot);
+	while (node) {
+		struct iova *iova = container_of(node, struct iova, node);
+		rb_erase(node, &iovad->rbroot);
+		free_iova_mem(iova);
+		node = rb_first(&iovad->rbroot);
+	}
+	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+}
+
+static inline int
+__is_range_overlap(struct rb_node *node,
+	unsigned long pfn_lo, unsigned long pfn_hi)
+{
+	struct iova * iova = container_of(node, struct iova, node);
+
+	if ((pfn_lo <= iova->pfn_hi) && (pfn_hi >= iova->pfn_lo))
+		return 1;
+	return 0;
+}
+
+static inline struct iova *
+__insert_new_range(struct iova_domain *iovad,
+	unsigned long pfn_lo, unsigned long pfn_hi)
+{
+	struct iova *iova;
+
+	iova = alloc_iova_mem();
+	if (!iova)
+		return iova;
+
+	iova->pfn_hi = pfn_hi;
+	iova->pfn_lo = pfn_lo;
+	iova_insert_rbtree(&iovad->rbroot, iova);
+	return iova;
+}
+
+static inline void
+__adjust_overlap_range(struct iova *iova,
+	unsigned long *pfn_lo, unsigned long *pfn_hi)
+{
+	if (*pfn_lo < iova->pfn_lo)
+		iova->pfn_lo = *pfn_lo;
+	if (*pfn_hi > iova->pfn_hi)
+		*pfn_lo = iova->pfn_hi + 1;
+}
+
+/**
+ * reserve_iova - reserves an iova in the given range
+ * @iovad: - iova domain pointer
+ * @pfn_lo: - lower page frame address
+ * @pfn_hi:- higher pfn adderss
+ * This function allocates reserves the address range from pfn_lo to pfn_hi so
+ * that this address is not dished out as part of alloc_iova.
+ */
+struct iova *
+reserve_iova(struct iova_domain *iovad,
+	unsigned long pfn_lo, unsigned long pfn_hi)
+{
+	struct rb_node *node;
+	unsigned long flags;
+	struct iova *iova;
+	unsigned int overlap = 0;
+
+	spin_lock_irqsave(&iovad->iova_alloc_lock, flags);
+	spin_lock(&iovad->iova_rbtree_lock);
+	for (node = rb_first(&iovad->rbroot); node; node = rb_next(node)) {
+		if (__is_range_overlap(node, pfn_lo, pfn_hi)) {
+			iova = container_of(node, struct iova, node);
+			__adjust_overlap_range(iova, &pfn_lo, &pfn_hi);
+			if ((pfn_lo >= iova->pfn_lo) &&
+				(pfn_hi <= iova->pfn_hi))
+				goto finish;
+			overlap = 1;
+
+		} else if (overlap)
+				break;
+	}
+
+	/* We are here either becasue this is the first reserver node
+	 * or need to insert remaining non overlap addr range
+	 */
+	iova = __insert_new_range(iovad, pfn_lo, pfn_hi);
+finish:
+
+	spin_unlock(&iovad->iova_rbtree_lock);
+	spin_unlock_irqrestore(&iovad->iova_alloc_lock, flags);
+	return iova;
+}
+
+/**
+ * copy_reserved_iova - copies the reserved between domains
+ * @from: - source doamin from where to copy
+ * @to: - destination domin where to copy
+ * This function copies reserved iova's from one doamin to
+ * other.
+ */
+void
+copy_reserved_iova(struct iova_domain *from, struct iova_domain *to)
+{
+	unsigned long flags;
+	struct rb_node *node;
+
+	spin_lock_irqsave(&from->iova_alloc_lock, flags);
+	spin_lock(&from->iova_rbtree_lock);
+	for (node = rb_first(&from->rbroot); node; node = rb_next(node)) {
+		struct iova *iova = container_of(node, struct iova, node);
+		struct iova *new_iova;
+		new_iova = reserve_iova(to, iova->pfn_lo, iova->pfn_hi);
+		if (!new_iova)
+			printk(KERN_ERR "Reserve iova range %lx@%lx failed\n",
+				iova->pfn_lo, iova->pfn_lo);
+	}
+	spin_unlock(&from->iova_rbtree_lock);
+	spin_unlock_irqrestore(&from->iova_alloc_lock, flags);
+}
Index: linux-2.6.22-rc4-mm2/drivers/pci/iova.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.22-rc4-mm2/drivers/pci/iova.h	2007-06-18 15:45:46.000000000 -0700
@@ -0,0 +1,62 @@
+/*
+ * Copyright (c) 2006, Intel Corporation.
+ *
+ * This file is released under the GPLv2.
+ *
+ * Copyright (C) 2006 Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
+ *
+ */
+
+#ifndef _IOVA_H_
+#define _IOVA_H_
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/rbtree.h>
+#include <linux/dma-mapping.h>
+
+/*
+ * We need a fixed PAGE_SIZE of 4K irrespective of
+ * arch PAGE_SIZE for IOMMU page tables.
+ */
+#define PAGE_SHIFT_4K		(12)
+#define PAGE_SIZE_4K		(1UL << PAGE_SHIFT_4K)
+#define PAGE_MASK_4K		(((u64)-1) << PAGE_SHIFT_4K)
+#define PAGE_ALIGN_4K(addr)	(((addr) + PAGE_SIZE_4K - 1) & PAGE_MASK_4K)
+
+#define IOVA_START_ADDR		(0x1000)
+#define IOVA_START_PFN		(IOVA_START_ADDR >> PAGE_SHIFT_4K)
+
+#define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT_4K)
+#define DMA_32BIT_PFN	IOVA_PFN(DMA_32BIT_MASK)
+#define DMA_64BIT_PFN	IOVA_PFN(DMA_64BIT_MASK)
+
+/* iova structure */
+struct iova {
+	struct rb_node	node;
+	unsigned long	pfn_hi; /* IOMMU dish out addr hi */
+	unsigned long	pfn_lo; /* IOMMU dish out addr lo */
+};
+
+/* holds all the iova translations for a domain */
+struct iova_domain {
+	spinlock_t	iova_alloc_lock;/* Lock to protect iova  allocation */
+	spinlock_t	iova_rbtree_lock; /* Lock to protect update of rbtree */
+	struct rb_root	rbroot;		/* iova domain rbtree root */
+	struct rb_node	*cached32_node; /* Save last alloced node */
+};
+
+struct iova *alloc_iova_mem(void);
+void free_iova_mem(struct iova *iova);
+void free_iova(struct iova_domain *iovad, unsigned long pfn);
+void __free_iova(struct iova_domain *iovad, struct iova *iova);
+struct iova * alloc_iova(struct iova_domain *iovad, unsigned long size,
+	unsigned long limit_pfn);
+struct iova * reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo,
+	unsigned long pfn_hi);
+void copy_reserved_iova(struct iova_domain *from, struct iova_domain *to);
+void init_iova_domain(struct iova_domain *iovad);
+struct iova * find_iova(struct iova_domain *iovad, unsigned long pfn);
+void put_iova_domain(struct iova_domain *iovad);
+
+#endif

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [Intel IOMMU 05/10] Intel IOMMU driver
  2007-06-19 21:37 [Intel IOMMU 00/10] Intel IOMMU support, take #2 Keshavamurthy, Anil S
                   ` (3 preceding siblings ...)
  2007-06-19 21:37 ` [Intel IOMMU 04/10] IOVA allocation and management routines Keshavamurthy, Anil S
@ 2007-06-19 21:37 ` Keshavamurthy, Anil S
  2007-06-19 23:32   ` Christoph Lameter
                     ` (2 more replies)
  2007-06-19 21:37 ` [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls Keshavamurthy, Anil S
                   ` (5 subsequent siblings)
  10 siblings, 3 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-19 21:37 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: ak, gregkh, muli, suresh.b.siddha, arjan, ashok.raj, davem,
	clameter, Anil S Keshavamurthy

[-- Attachment #1: intel_iommu-3.patch --]
[-- Type: text/plain, Size: 68765 bytes --]

	Actual intel IOMMU driver. Hardware spec can be found at:
http://www.intel.com/technology/virtualization

This driver sets X86_64 'dma_ops', so hook into standard DMA APIs. In this way,
PCI driver will get virtual DMA address. This change is transparent to PCI
drivers.

Changes from previous postings:
1) Fixed all the coding style errors - checkpatches.pl passes this patch
2) Addressed all Andrew's comments
3) Removed resource pool ( a.k.a pre-allocate pool)
4) Now uses the standard kmem_cache_alloc functions to allocate memory
   during dma map api calls.


Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
---
 Documentation/Intel-IOMMU.txt       |   93 +
 Documentation/kernel-parameters.txt |   10 
 arch/x86_64/kernel/pci-dma.c        |    5 
 drivers/pci/Makefile                |    5 
 drivers/pci/intel-iommu.c           | 1956 ++++++++++++++++++++++++++++++++++++
 drivers/pci/intel-iommu.h           |  318 +++++
 include/linux/dmar.h                |   22 
 7 files changed, 2408 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc4-mm2/Documentation/Intel-IOMMU.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.22-rc4-mm2/Documentation/Intel-IOMMU.txt	2007-06-19 14:11:43.000000000 -0700
@@ -0,0 +1,93 @@
+Linux IOMMU Support
+===================
+
+The architecture spec can be obtained from the below location.
+
+http://www.intel.com/technology/virtualization/
+
+This guide gives a quick cheat sheet for some basic understanding.
+
+Some Keywords
+
+DMAR - DMA remapping
+DRHD - DMA Engine Reporting Structure
+RMRR - Reserved memory Region Reporting Structure
+ZLR  - Zero length reads from PCI devices
+IOVA - IO Virtual address.
+
+Basic stuff
+-----------
+
+ACPI enumerates and lists the different DMA engines in the platform, and
+device scope relationships between PCI devices and which DMA engine  controls
+them.
+
+What is RMRR?
+-------------
+
+There are some devices the BIOS controls, for e.g USB devices to perform
+PS2 emulation. The regions of memory used for these devices are marked
+reserved in the e820 map. When we turn on DMA translation, DMA to those
+regions will fail. Hence BIOS uses RMRR to specify these regions along with
+devices that need to access these regions. OS is expected to setup
+unity mappings for these regions for these devices to access these regions.
+
+How is IOVA generated?
+---------------------
+
+Well behaved drivers call pci_map_*() calls before sending command to device
+that needs to perform DMA. Once DMA is completed and mapping is no longer
+required, device performs a pci_unmap_*() calls to unmap the region.
+
+The Intel IOMMU driver allocates a virtual address per domain. Each PCIE
+device has its own domain (hence protection). Devices under p2p bridges
+share the virtual address with all devices under the p2p bridge due to
+transaction id aliasing for p2p bridges.
+
+IOVA generation is pretty generic. We used the same technique as vmalloc()
+but these are not global address spaces, but separate for each domain.
+Different DMA engines may support different number of domains.
+
+We also allocate gaurd pages with each mapping, so we can attempt to catch
+any overflow that might happen.
+
+
+Graphics Problems?
+------------------
+If you encounter issues with graphics devices, you can try adding
+option intel_iommu=igfx_off to turn off the integrated graphics engine.
+
+Some exceptions to IOVA
+-----------------------
+Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
+The same is true for peer to peer transactions. Hence we reserve the
+address from PCI MMIO ranges so they are not allocated for IOVA addresses.
+
+Boot Message Sample
+-------------------
+
+Something like this gets printed indicating presence of DMAR tables
+in ACPI.
+
+ACPI: DMAR (v001 A M I  OEMDMAR  0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0
+
+When DMAR is being processed and initialized by ACPI, prints DMAR locations
+and any RMRR's processed.
+
+ACPI DMAR:Host address width 36
+ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000
+ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000
+ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000
+ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff
+ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff
+
+When DMAR is enabled for use, you will notice..
+
+PCI-DMA: Using DMAR IOMMU
+
+TBD
+----
+
+- For compatibility testing, could use unity map domain for all devices, just
+  provide a 1-1 for all useful memory under a single domain for all devices.
+- API for paravirt ops for abstracting functionlity for VMM folks.
Index: linux-2.6.22-rc4-mm2/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.22-rc4-mm2.orig/Documentation/kernel-parameters.txt	2007-06-19 13:43:14.000000000 -0700
+++ linux-2.6.22-rc4-mm2/Documentation/kernel-parameters.txt	2007-06-19 14:11:45.000000000 -0700
@@ -779,6 +779,16 @@
 
 	inttest=	[IA64]
 
+	intel_iommu=	[DMAR] Intel IOMMU driver (DMAR) option
+		off
+			Disable intel iommu driver.
+		igfx_off [Default Off]
+			By default, gfx is mapped as normal device. If a gfx
+			device has a dedicated DMAR unit, the DMAR unit is
+			bypassed by not enabling DMAR with this option. In
+			this case, gfx device will use physical address for
+			DMA.
+
 	io7=		[HW] IO7 for Marvel based alpha systems
 			See comment before marvel_specify_io7 in
 			arch/alpha/kernel/core_marvel.c.
Index: linux-2.6.22-rc4-mm2/arch/x86_64/kernel/pci-dma.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/arch/x86_64/kernel/pci-dma.c	2007-06-19 13:43:14.000000000 -0700
+++ linux-2.6.22-rc4-mm2/arch/x86_64/kernel/pci-dma.c	2007-06-19 14:06:20.000000000 -0700
@@ -7,6 +7,7 @@
 #include <linux/string.h>
 #include <linux/pci.h>
 #include <linux/module.h>
+#include <linux/dmar.h>
 #include <asm/io.h>
 #include <asm/proto.h>
 #include <asm/calgary.h>
@@ -302,6 +303,8 @@
 	detect_calgary();
 #endif
 
+	detect_intel_iommu();
+
 #ifdef CONFIG_SWIOTLB
 	pci_swiotlb_init();
 #endif
@@ -313,6 +316,8 @@
 	calgary_iommu_init();
 #endif
 
+	intel_iommu_init();
+
 #ifdef CONFIG_IOMMU
 	gart_iommu_init();
 #endif
Index: linux-2.6.22-rc4-mm2/drivers/pci/Makefile
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/Makefile	2007-06-19 14:06:20.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/Makefile	2007-06-19 14:06:20.000000000 -0700
@@ -21,7 +21,10 @@
 obj-$(CONFIG_HT_IRQ) += htirq.o
 
 # Build Intel IOMMU support
-obj-$(CONFIG_DMAR) += dmar.o
+obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o
+
+#Build Intel-IOMMU support
+obj-$(CONFIG_DMAR) += iova.o dmar.o intel-iommu.o
 
 #
 # Some architectures use the generic PCI setup functions
Index: linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.c	2007-06-19 14:13:07.000000000 -0700
@@ -0,0 +1,1956 @@
+/*
+ * Copyright (c) 2006, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Copyright (C) Ashok Raj <ashok.raj@intel.com>
+ * Copyright (C) Shaohua Li <shaohua.li@intel.com>
+ * Copyright (C) Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
+ */
+
+#include <linux/init.h>
+#include <linux/bitmap.h>
+#include <linux/slab.h>
+#include <linux/irq.h>
+#include <linux/interrupt.h>
+#include <linux/sysdev.h>
+#include <linux/spinlock.h>
+#include <linux/pci.h>
+#include <linux/dmar.h>
+#include <linux/dma-mapping.h>
+#include <linux/mempool.h>
+#include "iova.h"
+#include "intel-iommu.h"
+#include <asm/proto.h> /* force_iommu in this header in x86-64*/
+#include <asm/cacheflush.h>
+#include "pci.h"
+
+#define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
+#define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
+
+#define IOAPIC_RANGE_START	(0xfee00000)
+#define IOAPIC_RANGE_END	(0xfeefffff)
+#define IOVA_START_ADDR		(0x1000)
+
+#define DEFAULT_DOMAIN_ADDRESS_WIDTH 48
+
+#define DMAR_OPERATION_TIMEOUT (HZ*60) /* 1m */
+
+#define DOMAIN_MAX_ADDR(gaw) ((((u64)1) << gaw) - 1)
+
+static void domain_remove_dev_info(struct dmar_domain *domain);
+
+static int dmar_disabled;
+static int __initdata dmar_map_gfx = 1;
+
+#define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
+static DEFINE_SPINLOCK(device_domain_lock);
+static LIST_HEAD(device_domain_list);
+
+static int __init intel_iommu_setup(char *str)
+{
+	if (!str)
+		return -EINVAL;
+	while (*str) {
+		if (!strncmp(str, "off", 3)) {
+			dmar_disabled = 1;
+			printk(KERN_INFO"Intel-IOMMU: disabled\n");
+		} else if (!strncmp(str, "igfx_off", 8)) {
+			dmar_map_gfx = 0;
+			printk(KERN_INFO
+				"Intel-IOMMU: disable GFX device mapping\n");
+		}
+
+		str += strcspn(str, ",");
+		while (*str == ',')
+			str++;
+	}
+	return 0;
+}
+__setup("intel_iommu=", intel_iommu_setup);
+
+static struct kmem_cache *iommu_domain_cache;
+static struct kmem_cache *iommu_devinfo_cache;
+static struct kmem_cache *iommu_iova_cache;
+
+static inline void *alloc_pgtable_page(void)
+{
+	return (void *)get_zeroed_page(GFP_ATOMIC);
+}
+
+static inline void free_pgtable_page(void *vaddr)
+{
+	free_page((unsigned long)vaddr);
+}
+
+static inline void *alloc_domain_mem(void)
+{
+	return kmem_cache_alloc(iommu_domain_cache, GFP_ATOMIC);
+}
+
+static inline void free_domain_mem(void *vaddr)
+{
+	kmem_cache_free(iommu_domain_cache, vaddr);
+}
+
+static inline void * alloc_devinfo_mem(void)
+{
+	return kmem_cache_alloc(iommu_devinfo_cache, GFP_ATOMIC);
+}
+
+static inline void free_devinfo_mem(void *vaddr)
+{
+	kmem_cache_free(iommu_devinfo_cache, vaddr);
+}
+
+struct iova *alloc_iova_mem(void)
+{
+	return kmem_cache_alloc(iommu_iova_cache, GFP_ATOMIC);
+}
+
+void free_iova_mem(struct iova *iova)
+{
+	kmem_cache_free(iommu_iova_cache, iova);
+}
+
+static inline void __iommu_flush_cache(
+	struct intel_iommu *iommu, void *addr, int size)
+{
+	if (!ecap_coherent(iommu->ecap))
+		clflush_cache_range(addr, size);
+}
+
+/* Gets context entry for a given bus and devfn */
+static struct context_entry * device_to_context_entry(struct intel_iommu *iommu,
+		u8 bus, u8 devfn)
+{
+	struct root_entry *root;
+	struct context_entry *context;
+	unsigned long phy_addr;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iommu->lock, flags);
+	root = &iommu->root_entry[bus];
+	context = get_context_addr_from_root(root);
+	if (!context) {
+		context = (struct context_entry *)alloc_pgtable_page();
+		if (!context) {
+			spin_unlock_irqrestore(&iommu->lock, flags);
+			return NULL;
+		}
+		__iommu_flush_cache(iommu, (void *)context, PAGE_SIZE_4K);
+		phy_addr = virt_to_phys((void *)context);
+		set_root_value(root, phy_addr);
+		set_root_present(root);
+		__iommu_flush_cache(iommu, root, sizeof(*root));
+	}
+	spin_unlock_irqrestore(&iommu->lock, flags);
+	return &context[devfn];
+}
+
+static int device_context_mapped(struct intel_iommu *iommu, u8 bus, u8 devfn)
+{
+	struct root_entry *root;
+	struct context_entry *context;
+	int ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iommu->lock, flags);
+	root = &iommu->root_entry[bus];
+	context = get_context_addr_from_root(root);
+	if (!context) {
+		ret = 0;
+		goto out;
+	}
+	ret = context_present(context[devfn]);
+out:
+	spin_unlock_irqrestore(&iommu->lock, flags);
+	return ret;
+}
+
+static void clear_context_table(struct intel_iommu *iommu, u8 bus, u8 devfn)
+{
+	struct root_entry *root;
+	struct context_entry *context;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iommu->lock, flags);
+	root = &iommu->root_entry[bus];
+	context = get_context_addr_from_root(root);
+	if (context) {
+		context_clear_entry(context[devfn]);
+		__iommu_flush_cache(iommu, &context[devfn], \
+			sizeof(*context));
+	}
+	spin_unlock_irqrestore(&iommu->lock, flags);
+}
+
+static void free_context_table(struct intel_iommu *iommu)
+{
+	struct root_entry *root;
+	int i;
+	unsigned long flags;
+	struct context_entry *context;
+
+	spin_lock_irqsave(&iommu->lock, flags);
+	if (!iommu->root_entry) {
+		goto out;
+	}
+	for (i = 0; i < ROOT_ENTRY_NR; i++) {
+		root = &iommu->root_entry[i];
+		context = get_context_addr_from_root(root);
+		if (context)
+			free_pgtable_page(context);
+	}
+	free_pgtable_page(iommu->root_entry);
+	iommu->root_entry = NULL;
+out:
+	spin_unlock_irqrestore(&iommu->lock, flags);
+}
+
+/* page table handling */
+#define LEVEL_STRIDE		(9)
+#define LEVEL_MASK		(((u64)1 << LEVEL_STRIDE) - 1)
+
+static inline int agaw_to_level(int agaw)
+{
+	return agaw + 2;
+}
+
+static inline int agaw_to_width(int agaw)
+{
+	return 30 + agaw * LEVEL_STRIDE;
+
+}
+
+static inline int width_to_agaw(int width)
+{
+	return (width - 30) / LEVEL_STRIDE;
+}
+
+static inline unsigned int level_to_offset_bits(int level)
+{
+	return (12 + (level - 1) * LEVEL_STRIDE);
+}
+
+static inline int address_level_offset(u64 addr, int level)
+{
+	return ((addr >> level_to_offset_bits(level)) & LEVEL_MASK);
+}
+
+static inline u64 level_mask(int level)
+{
+	return ((u64)-1 << level_to_offset_bits(level));
+}
+
+static inline u64 level_size(int level)
+{
+	return ((u64)1 << level_to_offset_bits(level));
+}
+
+static inline u64 align_to_level(u64 addr, int level)
+{
+	return ((addr + level_size(level) - 1) & level_mask(level));
+}
+
+static struct dma_pte * addr_to_dma_pte(struct dmar_domain *domain, u64 addr)
+{
+	int addr_width = agaw_to_width(domain->agaw);
+	struct dma_pte *parent, *pte = NULL;
+	int level = agaw_to_level(domain->agaw);
+	int offset;
+	unsigned long flags;
+
+	BUG_ON(!domain->pgd);
+
+	addr &= (((u64)1) << addr_width) - 1;
+	parent = domain->pgd;
+
+	spin_lock_irqsave(&domain->mapping_lock, flags);
+	while (level > 0) {
+		void *tmp_page;
+
+		offset = address_level_offset(addr, level);
+		pte = &parent[offset];
+		if (level == 1)
+			break;
+
+		if (!dma_pte_present(*pte)) {
+			tmp_page = alloc_pgtable_page();
+
+			if (!tmp_page) {
+				spin_unlock_irqrestore(&domain->mapping_lock,
+					flags);
+				return NULL;
+			}
+			__iommu_flush_cache(domain->iommu, tmp_page,
+					PAGE_SIZE_4K);
+			dma_set_pte_addr(*pte, virt_to_phys(tmp_page));
+			/*
+			 * high level table always sets r/w, last level page
+			 * table control read/write
+			 */
+			dma_set_pte_readable(*pte);
+			dma_set_pte_writable(*pte);
+			__iommu_flush_cache(domain->iommu, pte, sizeof(*pte));
+		}
+		parent = phys_to_virt(dma_pte_addr(*pte));
+		level--;
+	}
+
+	spin_unlock_irqrestore(&domain->mapping_lock, flags);
+	return pte;
+}
+
+/* return address's pte at specific level */
+static struct dma_pte *dma_addr_level_pte(struct dmar_domain *domain, u64 addr,
+		int level)
+{
+	struct dma_pte *parent, *pte = NULL;
+	int total = agaw_to_level(domain->agaw);
+	int offset;
+
+	parent = domain->pgd;
+	while (level <= total) {
+		offset = address_level_offset(addr, total);
+		pte = &parent[offset];
+		if (level == total)
+			return pte;
+
+		if (!dma_pte_present(*pte))
+			break;
+		parent = phys_to_virt(dma_pte_addr(*pte));
+		total--;
+	}
+	return NULL;
+}
+
+/* clear one page's page table */
+static void dma_pte_clear_one(struct dmar_domain *domain, u64 addr)
+{
+	struct dma_pte *pte = NULL;
+
+	/* get last level pte */
+	pte = dma_addr_level_pte(domain, addr, 1);
+
+	if (pte) {
+		dma_clear_pte(*pte);
+		__iommu_flush_cache(domain->iommu, pte, sizeof(*pte));
+	}
+}
+
+/* clear last level pte, a tlb flush should be followed */
+static void dma_pte_clear_range(struct dmar_domain *domain, u64 start, u64 end)
+{
+	int addr_width = agaw_to_width(domain->agaw);
+
+	start &= (((u64)1) << addr_width) - 1;
+	end &= (((u64)1) << addr_width) - 1;
+	/* in case it's partial page */
+	start = PAGE_ALIGN_4K(start);
+	end &= PAGE_MASK_4K;
+
+	/* we don't need lock here, nobody else touches the iova range */
+	while (start < end) {
+		dma_pte_clear_one(domain, start);
+		start += PAGE_SIZE_4K;
+	}
+}
+
+/* free page table pages. last level pte should already be cleared */
+static void dma_pte_free_pagetable(struct dmar_domain *domain,
+	u64 start, u64 end)
+{
+	int addr_width = agaw_to_width(domain->agaw);
+	struct dma_pte *pte;
+	int total = agaw_to_level(domain->agaw);
+	int level;
+	u64 tmp;
+
+	start &= (((u64)1) << addr_width) - 1;
+	end &= (((u64)1) << addr_width) - 1;
+
+	/* we don't need lock here, nobody else touches the iova range */
+	level = 2;
+	while (level <= total) {
+		tmp = align_to_level(start, level);
+		if (tmp >= end || (tmp + level_size(level) > end))
+			return;
+
+		while (tmp < end) {
+			pte = dma_addr_level_pte(domain, tmp, level);
+			if (pte) {
+				free_pgtable_page(
+					phys_to_virt(dma_pte_addr(*pte)));
+				dma_clear_pte(*pte);
+				__iommu_flush_cache(domain->iommu,
+						pte, sizeof(*pte));
+			}
+			tmp += level_size(level);
+		}
+		level++;
+	}
+	/* free pgd */
+	if (start == 0 && end >= ((((u64)1) << addr_width) - 1)) {
+		free_pgtable_page(domain->pgd);
+		domain->pgd = NULL;
+	}
+}
+
+/* iommu handling */
+static int iommu_alloc_root_entry(struct intel_iommu *iommu)
+{
+	struct root_entry *root;
+	unsigned long flags;
+
+	root = (struct root_entry *)alloc_pgtable_page();
+	if (!root)
+		return -ENOMEM;
+
+	__iommu_flush_cache(iommu, root, PAGE_SIZE_4K);
+
+	spin_lock_irqsave(&iommu->lock, flags);
+	iommu->root_entry = root;
+	spin_unlock_irqrestore(&iommu->lock, flags);
+
+	return 0;
+}
+
+#define IOMMU_WAIT_OP(iommu, offset, op, cond, sts) \
+{\
+	unsigned long start_time = jiffies;\
+	while (1) {\
+		sts = op (iommu->reg + offset);\
+		if (cond)\
+			break;\
+		if (time_after(jiffies, start_time + DMAR_OPERATION_TIMEOUT))\
+			panic("DMAR hardware is malfunctioning\n");\
+		cpu_relax();\
+	}\
+}
+
+static void iommu_set_root_entry(struct intel_iommu *iommu)
+{
+	void *addr;
+	u32 cmd, sts;
+	unsigned long flag;
+
+	addr = iommu->root_entry;
+
+	spin_lock_irqsave(&iommu->register_lock, flag);
+	dmar_writeq(iommu->reg + DMAR_RTADDR_REG, virt_to_phys(addr));
+
+	cmd = iommu->gcmd | DMA_GCMD_SRTP;
+	writel(cmd, iommu->reg + DMAR_GCMD_REG);
+
+	/* Make sure hardware complete it */
+	IOMMU_WAIT_OP(iommu, DMAR_GSTS_REG,
+		readl, (sts & DMA_GSTS_RTPS), sts);
+
+	spin_unlock_irqrestore(&iommu->register_lock, flag);
+}
+
+static void iommu_flush_write_buffer(struct intel_iommu *iommu)
+{
+	u32 val;
+	unsigned long flag;
+
+	if (!cap_rwbf(iommu->cap))
+		return;
+	val = iommu->gcmd | DMA_GCMD_WBF;
+
+	spin_lock_irqsave(&iommu->register_lock, flag);
+	writel(val, iommu->reg + DMAR_GCMD_REG);
+
+	/* Make sure hardware complete it */
+	IOMMU_WAIT_OP(iommu, DMAR_GSTS_REG,
+			readl, (!(val & DMA_GSTS_WBFS)), val);
+
+	spin_unlock_irqrestore(&iommu->register_lock, flag);
+}
+
+/* return value determine if we need a write buffer flush */
+static int __iommu_flush_context(struct intel_iommu *iommu,
+	u16 did, u16 source_id, u8 function_mask, u64 type,
+	int non_present_entry_flush)
+{
+	u64 val = 0;
+	unsigned long flag;
+
+	/*
+	 * In the non-present entry flush case, if hardware doesn't cache
+	 * non-present entry we do nothing and if hardware cache non-present
+	 * entry, we flush entries of domain 0 (the domain id is used to cache
+	 * any non-present entries)
+	 */
+	if (non_present_entry_flush) {
+		if (!cap_caching_mode(iommu->cap))
+			return 1;
+		else
+			did = 0;
+	}
+
+	switch (type) {
+	case DMA_CCMD_GLOBAL_INVL:
+		val = DMA_CCMD_GLOBAL_INVL;
+		break;
+	case DMA_CCMD_DOMAIN_INVL:
+		val = DMA_CCMD_DOMAIN_INVL|DMA_CCMD_DID(did);
+		break;
+	case DMA_CCMD_DEVICE_INVL:
+		val = DMA_CCMD_DEVICE_INVL|DMA_CCMD_DID(did)
+			| DMA_CCMD_SID(source_id) | DMA_CCMD_FM(function_mask);
+		break;
+	default:
+		BUG();
+	}
+	val |= DMA_CCMD_ICC;
+
+	spin_lock_irqsave(&iommu->register_lock, flag);
+	dmar_writeq(iommu->reg + DMAR_CCMD_REG, val);
+
+	/* Make sure hardware complete it */
+	IOMMU_WAIT_OP(iommu, DMAR_CCMD_REG,
+		dmar_readq, (!(val & DMA_CCMD_ICC)), val);
+
+	spin_unlock_irqrestore(&iommu->register_lock, flag);
+
+	/* flush context entry will implictly flush write buffer */
+	return 0;
+}
+
+static int inline iommu_flush_context_global(struct intel_iommu *iommu,
+	int non_present_entry_flush)
+{
+	return __iommu_flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL,
+		non_present_entry_flush);
+}
+
+static int inline iommu_flush_context_domain(struct intel_iommu *iommu, u16 did,
+	int non_present_entry_flush)
+{
+	return __iommu_flush_context(iommu, did, 0, 0, DMA_CCMD_DOMAIN_INVL,
+		non_present_entry_flush);
+}
+
+static int inline iommu_flush_context_device(struct intel_iommu *iommu,
+	u16 did, u16 source_id, u8 function_mask, int non_present_entry_flush)
+{
+	return __iommu_flush_context(iommu, did, source_id, function_mask,
+		DMA_CCMD_DEVICE_INVL, non_present_entry_flush);
+}
+
+/* return value determine if we need a write buffer flush */
+static int __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did,
+	u64 addr, unsigned int size_order, u64 type,
+	int non_present_entry_flush)
+{
+	int tlb_offset = ecap_iotlb_offset(iommu->ecap);
+	u64 val = 0, val_iva = 0;
+	unsigned long flag;
+
+	/*
+	 * In the non-present entry flush case, if hardware doesn't cache
+	 * non-present entry we do nothing and if hardware cache non-present
+	 * entry, we flush entries of domain 0 (the domain id is used to cache
+	 * any non-present entries)
+	 */
+	if (non_present_entry_flush) {
+		if (!cap_caching_mode(iommu->cap))
+			return 1;
+		else
+			did = 0;
+	}
+
+	switch (type) {
+	case DMA_TLB_GLOBAL_FLUSH:
+		/* global flush doesn't need set IVA_REG */
+		val = DMA_TLB_GLOBAL_FLUSH|DMA_TLB_IVT;
+		break;
+	case DMA_TLB_DSI_FLUSH:
+		val = DMA_TLB_DSI_FLUSH|DMA_TLB_IVT|DMA_TLB_DID(did);
+		break;
+	case DMA_TLB_PSI_FLUSH:
+		val = DMA_TLB_PSI_FLUSH|DMA_TLB_IVT|DMA_TLB_DID(did);
+		/* Note: always flush non-leaf currently */
+		val_iva = size_order | addr;
+		break;
+	default:
+		BUG();
+	}
+	/* Note: set drain read/write */
+#if 0
+	/*
+	 * This is probably to be super secure.. Looks like we can
+	 * ignore it without any impact.
+	 */
+	if (cap_read_drain(iommu->cap))
+		val |= DMA_TLB_READ_DRAIN;
+#endif
+	if (cap_write_drain(iommu->cap))
+		val |= DMA_TLB_WRITE_DRAIN;
+
+	spin_lock_irqsave(&iommu->register_lock, flag);
+	/* Note: Only uses first TLB reg currently */
+	if (val_iva)
+		dmar_writeq(iommu->reg + tlb_offset, val_iva);
+	dmar_writeq(iommu->reg + tlb_offset + 8, val);
+
+	/* Make sure hardware complete it */
+	IOMMU_WAIT_OP(iommu, tlb_offset + 8,
+		dmar_readq, (!(val & DMA_TLB_IVT)), val);
+
+	spin_unlock_irqrestore(&iommu->register_lock, flag);
+
+	/* check IOTLB invalidation granularity */
+	if (DMA_TLB_IAIG(val) == 0)
+		printk(KERN_ERR"IOMMU: flush IOTLB failed\n");
+	if (DMA_TLB_IAIG(val) != DMA_TLB_IIRG(type))
+		pr_debug("IOMMU: tlb flush request %Lx, actual %Lx\n",
+			DMA_TLB_IIRG(type), DMA_TLB_IAIG(val));
+	/* flush context entry will implictly flush write buffer */
+	return 0;
+}
+
+static int inline iommu_flush_iotlb_global(struct intel_iommu *iommu,
+	int non_present_entry_flush)
+{
+	return __iommu_flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH,
+		non_present_entry_flush);
+}
+
+static int inline iommu_flush_iotlb_dsi(struct intel_iommu *iommu, u16 did,
+	int non_present_entry_flush)
+{
+	return __iommu_flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH,
+		non_present_entry_flush);
+}
+
+static int iommu_get_alignment(u64 base, unsigned int size)
+{
+	int t = 0;
+	u64 end;
+
+	end = base + size - 1;
+	while (base != end) {
+		t++;
+		base >>= 1;
+		end >>= 1;
+	}
+	return t;
+}
+
+static int iommu_flush_iotlb_psi(struct intel_iommu *iommu, u16 did,
+	u64 addr, unsigned int pages, int non_present_entry_flush)
+{
+	unsigned int align;
+
+	BUG_ON(addr & (~PAGE_MASK_4K));
+	BUG_ON(pages == 0);
+
+	/* Fallback to domain selective flush if no PSI support */
+	if (!cap_pgsel_inv(iommu->cap))
+		return iommu_flush_iotlb_dsi(iommu, did,
+			non_present_entry_flush);
+
+	/*
+	 * PSI requires page size to be 2 ^ x, and the base address is naturally
+	 * aligned to the size
+	 */
+	align = iommu_get_alignment(addr >> PAGE_SHIFT_4K, pages);
+	/* Fallback to domain selective flush if size is too big */
+	if (align > cap_max_amask_val(iommu->cap))
+		return iommu_flush_iotlb_dsi(iommu, did,
+			non_present_entry_flush);
+
+	addr >>= PAGE_SHIFT_4K + align;
+	addr <<= PAGE_SHIFT_4K + align;
+
+	return __iommu_flush_iotlb(iommu, did, addr, align,
+		DMA_TLB_PSI_FLUSH, non_present_entry_flush);
+}
+
+static int iommu_enable_translation(struct intel_iommu *iommu)
+{
+	u32 sts;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iommu->register_lock, flags);
+	writel(iommu->gcmd|DMA_GCMD_TE, iommu->reg + DMAR_GCMD_REG);
+
+	/* Make sure hardware complete it */
+	IOMMU_WAIT_OP(iommu, DMAR_GSTS_REG,
+		readl, (sts & DMA_GSTS_TES), sts);
+
+	iommu->gcmd |= DMA_GCMD_TE;
+	spin_unlock_irqrestore(&iommu->register_lock, flags);
+	return 0;
+}
+
+static int iommu_disable_translation(struct intel_iommu *iommu)
+{
+	u32 sts;
+	unsigned long flag;
+
+	spin_lock_irqsave(&iommu->register_lock, flag);
+	iommu->gcmd &= ~DMA_GCMD_TE;
+	writel(iommu->gcmd, iommu->reg + DMAR_GCMD_REG);
+
+	/* Make sure hardware complete it */
+	IOMMU_WAIT_OP(iommu, DMAR_GSTS_REG,
+		readl, (!(sts & DMA_GSTS_TES)), sts);
+
+	spin_unlock_irqrestore(&iommu->register_lock, flag);
+	return 0;
+}
+
+static int iommu_init_domains(struct intel_iommu *iommu)
+{
+	unsigned long ndomains;
+	unsigned long nlongs;
+
+	ndomains = cap_ndoms(iommu->cap);
+	pr_debug("Number of Domains supportd <%ld>\n", ndomains);
+	nlongs = BITS_TO_LONGS(ndomains);
+
+	/* TBD: there might be 64K domains,
+	 * consider other allocation for future chip
+	 */
+	iommu->domain_ids = kcalloc(nlongs, sizeof(unsigned long), GFP_KERNEL);
+	if (!iommu->domain_ids) {
+		printk(KERN_ERR "Allocating domain id array failed\n");
+		return -ENOMEM;
+	}
+	iommu->domains = kcalloc(ndomains, sizeof(struct dmar_domain *),
+			GFP_KERNEL);
+	if (!iommu->domains) {
+		printk(KERN_ERR "Allocating domain array failed\n");
+		kfree(iommu->domain_ids);
+		return -ENOMEM;
+	}
+
+	/*
+	 * if Caching mode is set, then invalid translations are tagged
+	 * with domainid 0. Hence we need to pre-allocate it.
+	 */
+	if (cap_caching_mode(iommu->cap))
+		set_bit(0, iommu->domain_ids);
+	return 0;
+}
+
+static struct intel_iommu *alloc_iommu(struct dmar_drhd_unit *drhd)
+{
+	struct intel_iommu *iommu;
+	int ret;
+	int map_size;
+	u32 ver;
+
+	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+	if (!iommu)
+		return NULL;
+	iommu->reg = ioremap(drhd->reg_base_addr, PAGE_SIZE_4K);
+	if (!iommu->reg) {
+		printk(KERN_ERR "IOMMU: can't map the region\n");
+		goto error;
+	}
+	iommu->cap = dmar_readq(iommu->reg + DMAR_CAP_REG);
+	iommu->ecap = dmar_readq(iommu->reg + DMAR_ECAP_REG);
+
+	/* the registers might be more than one page */
+	map_size = max_t(int, ecap_max_iotlb_offset(iommu->ecap),
+		cap_max_fault_reg_offset(iommu->cap));
+	map_size = PAGE_ALIGN_4K(map_size);
+	if (map_size > PAGE_SIZE_4K) {
+		iounmap(iommu->reg);
+		iommu->reg = ioremap(drhd->reg_base_addr, map_size);
+		if (!iommu->reg) {
+			printk(KERN_ERR "IOMMU: can't map the region\n");
+			goto error;
+		}
+	}
+
+	ver = readl(iommu->reg + DMAR_VER_REG);
+	pr_debug("IOMMU %llx: ver %d:%d cap %llx ecap %llx\n",
+		drhd->reg_base_addr, DMAR_VER_MAJOR(ver), DMAR_VER_MINOR(ver),
+		iommu->cap, iommu->ecap);
+	ret = iommu_init_domains(iommu);
+	if (ret)
+		goto error_unmap;
+	spin_lock_init(&iommu->lock);
+	spin_lock_init(&iommu->register_lock);
+
+	drhd->iommu = iommu;
+	return iommu;
+error_unmap:
+	iounmap(iommu->reg);
+	iommu->reg = 0;
+error:
+	kfree(iommu);
+	return NULL;
+}
+
+static void domain_exit(struct dmar_domain *domain);
+static void free_iommu(struct intel_iommu *iommu)
+{
+	struct dmar_domain *domain;
+	int i;
+
+	if (!iommu)
+		return;
+
+	i = find_first_bit(iommu->domain_ids, cap_ndoms(iommu->cap));
+	for (; i < cap_ndoms(iommu->cap); ) {
+		domain = iommu->domains[i];
+		clear_bit(i, iommu->domain_ids);
+		domain_exit(domain);
+		i = find_next_bit(iommu->domain_ids,
+			cap_ndoms(iommu->cap), i+1);
+	}
+
+	if (iommu->gcmd & DMA_GCMD_TE)
+		iommu_disable_translation(iommu);
+
+	if (iommu->irq) {
+		set_irq_data(iommu->irq, NULL);
+		/* This will mask the irq */
+		free_irq(iommu->irq, iommu);
+		destroy_irq(iommu->irq);
+	}
+
+	kfree(iommu->domains);
+	kfree(iommu->domain_ids);
+
+	/* free context mapping */
+	free_context_table(iommu);
+
+	if (iommu->reg)
+		iounmap(iommu->reg);
+	kfree(iommu);
+}
+
+static struct dmar_domain * iommu_alloc_domain(struct intel_iommu *iommu)
+{
+	unsigned long num;
+	unsigned long ndomains;
+	struct dmar_domain *domain;
+	unsigned long flags;
+
+	domain = alloc_domain_mem();
+	if (!domain)
+		return NULL;
+
+	ndomains = cap_ndoms(iommu->cap);
+
+	spin_lock_irqsave(&iommu->lock, flags);
+	num = find_first_zero_bit(iommu->domain_ids, ndomains);
+	if (num >= ndomains) {
+		spin_unlock_irqrestore(&iommu->lock, flags);
+		free_domain_mem(domain);
+		printk(KERN_ERR "IOMMU: no free domain ids\n");
+		return NULL;
+	}
+
+	set_bit(num, iommu->domain_ids);
+	domain->id = num;
+	domain->iommu = iommu;
+	iommu->domains[num] = domain;
+	spin_unlock_irqrestore(&iommu->lock, flags);
+
+	return domain;
+}
+
+static void iommu_free_domain(struct dmar_domain *domain)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&domain->iommu->lock, flags);
+	clear_bit(domain->id, domain->iommu->domain_ids);
+	spin_unlock_irqrestore(&domain->iommu->lock, flags);
+}
+
+static struct iova_domain reserved_iova_list;
+
+static void dmar_init_reserved_ranges(void)
+{
+	struct pci_dev *pdev = NULL;
+	struct iova *iova;
+	int i;
+	u64 addr, size;
+
+	init_iova_domain(&reserved_iova_list);
+
+	/* IOAPIC ranges shouldn't be accessed by DMA */
+	iova = reserve_iova(&reserved_iova_list, IOVA_PFN(IOAPIC_RANGE_START),
+		IOVA_PFN(IOAPIC_RANGE_END));
+	if (!iova)
+		printk(KERN_ERR "Reserve IOAPIC range failed\n");
+
+	/* Reserve all PCI MMIO to avoid peer-to-peer access */
+	for_each_pci_dev(pdev) {
+		struct resource *r;
+
+		for (i = 0; i < PCI_NUM_RESOURCES; i++) {
+			r = &pdev->resource[i];
+			if (!r->flags || !(r->flags & IORESOURCE_MEM))
+				continue;
+			addr = r->start;
+			addr &= PAGE_MASK_4K;
+			size = r->end - addr;
+			size = PAGE_ALIGN_4K(size);
+			iova = reserve_iova(&reserved_iova_list, IOVA_PFN(addr),
+				IOVA_PFN(size + addr) - 1);
+			if (!iova)
+				printk(KERN_ERR "Reserve iova failed\n");
+		}
+	}
+
+}
+
+static void domain_reserve_special_ranges(struct dmar_domain *domain)
+{
+	copy_reserved_iova(&reserved_iova_list, &domain->iovad);
+}
+
+static inline int guestwidth_to_adjustwidth(int gaw)
+{
+	int agaw;
+	int r = (gaw - 12) % 9;
+
+	if (r == 0)
+		agaw = gaw;
+	else
+		agaw = gaw + 9 - r;
+	if (agaw > 64)
+		agaw = 64;
+	return agaw;
+}
+
+static int domain_init(struct dmar_domain *domain, int guest_width)
+{
+	struct intel_iommu *iommu;
+	int adjust_width, agaw;
+	unsigned long sagaw;
+
+	init_iova_domain(&domain->iovad);
+	spin_lock_init(&domain->mapping_lock);
+
+	domain_reserve_special_ranges(domain);
+
+	/* calculate AGAW */
+	iommu = domain->iommu;
+	if (guest_width > cap_mgaw(iommu->cap))
+		guest_width = cap_mgaw(iommu->cap);
+	domain->gaw = guest_width;
+	adjust_width = guestwidth_to_adjustwidth(guest_width);
+	agaw = width_to_agaw(adjust_width);
+	sagaw = cap_sagaw(iommu->cap);
+	if (!test_bit(agaw, &sagaw)) {
+		/* hardware doesn't support it, choose a bigger one */
+		pr_debug("IOMMU: hardware doesn't support agaw %d\n", agaw);
+		agaw = find_next_bit(&sagaw, 5, agaw);
+		if (agaw >= 5)
+			return -ENODEV;
+	}
+	domain->agaw = agaw;
+	INIT_LIST_HEAD(&domain->devices);
+
+	/* always allocate the top pgd */
+	domain->pgd = (struct dma_pte *)alloc_pgtable_page();
+	if (!domain->pgd)
+		return -ENOMEM;
+	__iommu_flush_cache(iommu, domain->pgd, PAGE_SIZE_4K);
+	return 0;
+}
+
+static void domain_exit(struct dmar_domain *domain)
+{
+	u64 end;
+
+	/* Domain 0 is reserved, so dont process it */
+	if (!domain)
+		return;
+
+	domain_remove_dev_info(domain);
+	/* destroy iovas */
+	put_iova_domain(&domain->iovad);
+	end = DOMAIN_MAX_ADDR(domain->gaw);
+	end = end & (~PAGE_MASK_4K);
+
+	/* clear ptes */
+	dma_pte_clear_range(domain, 0, end);
+
+	/* free page tables */
+	dma_pte_free_pagetable(domain, 0, end);
+
+	iommu_free_domain(domain);
+	free_domain_mem(domain);
+}
+
+static int domain_context_mapping_one(struct dmar_domain *domain,
+		u8 bus, u8 devfn)
+{
+	struct context_entry *context;
+	struct intel_iommu *iommu = domain->iommu;
+	unsigned long flags;
+
+	pr_debug("Set context mapping for %02x:%02x.%d\n",
+		bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
+	BUG_ON(!domain->pgd);
+	context = device_to_context_entry(iommu, bus, devfn);
+	if (!context)
+		return -ENOMEM;
+	spin_lock_irqsave(&iommu->lock, flags);
+	if (context_present(*context)) {
+		spin_unlock_irqrestore(&iommu->lock, flags);
+		return 0;
+	}
+
+	context_set_domain_id(*context, domain->id);
+	context_set_address_width(*context, domain->agaw);
+	context_set_address_root(*context, virt_to_phys(domain->pgd));
+	context_set_translation_type(*context, CONTEXT_TT_MULTI_LEVEL);
+	context_set_fault_enable(*context);
+	context_set_present(*context);
+	__iommu_flush_cache(iommu, context, sizeof(*context));
+
+	/* it's a non-present to present mapping */
+	if (iommu_flush_context_device(iommu, domain->id,
+			(((u16)bus) << 8) | devfn, DMA_CCMD_MASK_NOBIT, 1))
+		iommu_flush_write_buffer(iommu);
+	else
+		iommu_flush_iotlb_dsi(iommu, 0, 0);
+	spin_unlock_irqrestore(&iommu->lock, flags);
+	return 0;
+}
+
+static int
+domain_context_mapping(struct dmar_domain *domain, struct pci_dev *pdev)
+{
+	int ret;
+	struct pci_dev *tmp, *parent;
+
+	ret = domain_context_mapping_one(domain, pdev->bus->number,
+		pdev->devfn);
+	if (ret)
+		return ret;
+
+	/* dependent device mapping */
+	tmp = pci_find_upstream_pcie_bridge(pdev);
+	if (!tmp)
+		return 0;
+	/* Secondary interface's bus number and devfn 0 */
+	parent = pdev->bus->self;
+	while (parent != tmp) {
+		ret = domain_context_mapping_one(domain, parent->bus->number,
+			parent->devfn);
+		if (ret)
+			return ret;
+		parent = parent->bus->self;
+	}
+	if (tmp->is_pcie) /* this is a PCIE-to-PCI bridge */
+		return domain_context_mapping_one(domain,
+			tmp->subordinate->number, 0);
+	else /* this is a legacy PCI bridge */
+		return domain_context_mapping_one(domain,
+			tmp->bus->number, tmp->devfn);
+}
+
+static int domain_context_mapped(struct dmar_domain *domain,
+	struct pci_dev *pdev)
+{
+	int ret;
+	struct pci_dev *tmp, *parent;
+
+	ret = device_context_mapped(domain->iommu,
+		pdev->bus->number, pdev->devfn);
+	if (!ret)
+		return ret;
+	/* dependent device mapping */
+	tmp = pci_find_upstream_pcie_bridge(pdev);
+	if (!tmp)
+		return ret;
+	/* Secondary interface's bus number and devfn 0 */
+	parent = pdev->bus->self;
+	while (parent != tmp) {
+		ret = device_context_mapped(domain->iommu, parent->bus->number,
+			parent->devfn);
+		if (!ret)
+			return ret;
+		parent = parent->bus->self;
+	}
+	if (tmp->is_pcie)
+		return device_context_mapped(domain->iommu,
+			tmp->subordinate->number, 0);
+	else
+		return device_context_mapped(domain->iommu,
+			tmp->bus->number, tmp->devfn);
+}
+
+static int
+domain_page_mapping(struct dmar_domain *domain, dma_addr_t iova,
+			u64 hpa, size_t size, int prot)
+{
+	u64 start_pfn, end_pfn;
+	struct dma_pte *pte;
+	int index;
+
+	if ((prot & (DMA_PTE_READ|DMA_PTE_WRITE)) == 0)
+		return -EINVAL;
+	iova &= PAGE_MASK_4K;
+	start_pfn = ((u64)hpa) >> PAGE_SHIFT_4K;
+	end_pfn = (PAGE_ALIGN_4K(((u64)hpa) + size)) >> PAGE_SHIFT_4K;
+	index = 0;
+	while (start_pfn < end_pfn) {
+		pte = addr_to_dma_pte(domain, iova + PAGE_SIZE_4K * index);
+		if (!pte)
+			return -ENOMEM;
+		/* We don't need lock here, nobody else
+		 * touches the iova range
+		 */
+		BUG_ON(dma_pte_addr(*pte));
+		dma_set_pte_addr(*pte, start_pfn << PAGE_SHIFT_4K);
+		dma_set_pte_prot(*pte, prot);
+		__iommu_flush_cache(domain->iommu, pte, sizeof(*pte));
+		start_pfn++;
+		index++;
+	}
+	return 0;
+}
+
+static void detach_domain_for_dev(struct dmar_domain *domain, u8 bus, u8 devfn)
+{
+	clear_context_table(domain->iommu, bus, devfn);
+	iommu_flush_context_global(domain->iommu, 0);
+	iommu_flush_iotlb_global(domain->iommu, 0);
+}
+
+static void domain_remove_dev_info(struct dmar_domain *domain)
+{
+	struct device_domain_info *info;
+	unsigned long flags;
+
+	spin_lock_irqsave(&device_domain_lock, flags);
+	while (!list_empty(&domain->devices)) {
+		info = list_entry(domain->devices.next,
+			struct device_domain_info, link);
+		list_del(&info->link);
+		list_del(&info->global);
+		if (info->dev)
+			info->dev->sysdata = NULL;
+		spin_unlock_irqrestore(&device_domain_lock, flags);
+
+		detach_domain_for_dev(info->domain, info->bus, info->devfn);
+		free_devinfo_mem(info);
+
+		spin_lock_irqsave(&device_domain_lock, flags);
+	}
+	spin_unlock_irqrestore(&device_domain_lock, flags);
+}
+
+/*
+ * find_domain
+ * Note: we use struct pci_dev->sysdata stores the info
+ */
+struct dmar_domain *
+find_domain(struct pci_dev *pdev)
+{
+	struct device_domain_info *info;
+
+	/* No lock here, assumes no domain exit in normal case */
+	info = (struct device_domain_info *)pdev->sysdata;
+	if (info)
+		return info->domain;
+	return NULL;
+}
+
+static int dmar_pci_device_match(struct pci_dev *devices[], int cnt,
+     struct pci_dev *dev)
+{
+	int index;
+
+	while (dev) {
+		for (index = 0; index < cnt; index ++)
+			if (dev == devices[index])
+				return 1;
+
+		/* Check our parent */
+		dev = dev->bus->self;
+	}
+
+	return 0;
+}
+
+static struct dmar_drhd_unit *
+dmar_find_matched_drhd_unit(struct pci_dev *dev)
+{
+	struct dmar_drhd_unit *drhd = NULL;
+
+	list_for_each_entry(drhd, &dmar_drhd_units, list) {
+		if (drhd->include_all || dmar_pci_device_match(drhd->devices,
+						drhd->devices_cnt, dev))
+			return drhd;
+	}
+
+	return NULL;
+}
+
+/* domain is initialized */
+static struct dmar_domain *get_domain_for_dev(struct pci_dev *pdev, int gaw)
+{
+	struct dmar_domain *domain, *found = NULL;
+	struct intel_iommu *iommu;
+	struct dmar_drhd_unit *drhd;
+	struct device_domain_info *info, *tmp;
+	struct pci_dev *dev_tmp;
+	unsigned long flags;
+	int bus = 0, devfn = 0;
+
+	domain = find_domain(pdev);
+	if (domain)
+		return domain;
+
+	dev_tmp = pci_find_upstream_pcie_bridge(pdev);
+	if (dev_tmp) {
+		if (dev_tmp->is_pcie) {
+			bus = dev_tmp->subordinate->number;
+			devfn = 0;
+		} else {
+			bus = dev_tmp->bus->number;
+			devfn = dev_tmp->devfn;
+		}
+		spin_lock_irqsave(&device_domain_lock, flags);
+		list_for_each_entry(info, &device_domain_list, global) {
+			if (info->bus == bus && info->devfn == devfn) {
+				found = info->domain;
+				break;
+			}
+		}
+		spin_unlock_irqrestore(&device_domain_lock, flags);
+		/* pcie-pci bridge already has a domain, uses it */
+		if (found) {
+			domain = found;
+			goto found_domain;
+		}
+	}
+
+	/* Allocate new domain for the device */
+	drhd = dmar_find_matched_drhd_unit(pdev);
+	if (!drhd) {
+		printk(KERN_ERR "IOMMU: can't find DMAR for device %s\n",
+			pci_name(pdev));
+		return NULL;
+	}
+	iommu = drhd->iommu;
+
+	domain = iommu_alloc_domain(iommu);
+	if (!domain)
+		goto error;
+
+	if (domain_init(domain, gaw)) {
+		domain_exit(domain);
+		goto error;
+	}
+
+	/* register pcie-to-pci device */
+	if (dev_tmp) {
+		info = alloc_devinfo_mem();
+		if (!info) {
+			domain_exit(domain);
+			goto error;
+		}
+		info->bus = bus;
+		info->devfn = devfn;
+		info->dev = NULL;
+		info->domain = domain;
+		/* This domain is shared by devices under p2p bridge */
+		domain->flags |= DOMAIN_FLAG_MULTIPLE_DEVICES;
+
+		/* pcie-to-pci bridge already has a domain, uses it */
+		found = NULL;
+		spin_lock_irqsave(&device_domain_lock, flags);
+		list_for_each_entry(tmp, &device_domain_list, global) {
+			if (tmp->bus == bus && tmp->devfn == devfn) {
+				found = tmp->domain;
+				break;
+			}
+		}
+		if (found) {
+			free_devinfo_mem(info);
+			domain_exit(domain);
+			domain = found;
+		} else {
+			list_add(&info->link, &domain->devices);
+			list_add(&info->global, &device_domain_list);
+		}
+		spin_unlock_irqrestore(&device_domain_lock, flags);
+	}
+
+found_domain:
+	info = alloc_devinfo_mem();
+	if (!info)
+		goto error;
+	info->bus = pdev->bus->number;
+	info->devfn = pdev->devfn;
+	info->dev = pdev;
+	info->domain = domain;
+	spin_lock_irqsave(&device_domain_lock, flags);
+	/* somebody is fast */
+	found = find_domain(pdev);
+	if (found != NULL) {
+		spin_unlock_irqrestore(&device_domain_lock, flags);
+		if (found != domain) {
+			domain_exit(domain);
+			domain = found;
+		}
+		free_devinfo_mem(info);
+		return domain;
+	}
+	list_add(&info->link, &domain->devices);
+	list_add(&info->global, &device_domain_list);
+	pdev->sysdata = info;
+	spin_unlock_irqrestore(&device_domain_lock, flags);
+	return domain;
+error:
+	/* recheck it here, maybe others set it */
+	return find_domain(pdev);
+}
+
+static int iommu_prepare_identity_map(struct pci_dev *pdev, u64 start, u64 end)
+{
+	struct dmar_domain *domain;
+	unsigned long size;
+	u64 base;
+	int ret;
+
+	printk(KERN_INFO
+		"IOMMU: Setting identity map for device %s [0x%Lx - 0x%Lx]\n",
+		pci_name(pdev), start, end);
+	/* page table init */
+	domain = get_domain_for_dev(pdev, DEFAULT_DOMAIN_ADDRESS_WIDTH);
+	if (!domain)
+		return -ENOMEM;
+
+	/* The address might not be aligned */
+	base = start & PAGE_MASK_4K;
+	size = end - base;
+	size = PAGE_ALIGN_4K(size);
+	if (!reserve_iova(&domain->iovad, IOVA_PFN(base),
+			IOVA_PFN(base + size) - 1)) {
+		printk(KERN_ERR "IOMMU: reserve iova failed\n");
+		ret = -ENOMEM;
+		goto error;
+	}
+
+	pr_debug("Mapping reserved region %lx@%llx for %s\n",
+		size, base, pci_name(pdev));
+	/*
+	 * RMRR range might have overlap with physical memory range,
+	 * clear it first
+	 */
+	dma_pte_clear_range(domain, base, base + size);
+
+	ret = domain_page_mapping(domain, base, base, size,
+		DMA_PTE_READ|DMA_PTE_WRITE);
+	if (ret)
+		goto error;
+
+	/* context entry init */
+	ret = domain_context_mapping(domain, pdev);
+	if (!ret)
+		return 0;
+error:
+	domain_exit(domain);
+	return ret;
+
+}
+
+static inline int iommu_prepare_rmrr_dev(struct dmar_rmrr_unit *rmrr,
+	struct pci_dev *pdev)
+{
+	if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
+		return 0;
+	return iommu_prepare_identity_map(pdev, rmrr->base_address,
+		rmrr->end_address + 1);
+}
+
+int __init init_dmars(void)
+{
+	struct dmar_drhd_unit *drhd;
+	struct dmar_rmrr_unit *rmrr;
+	struct pci_dev *pdev;
+	struct intel_iommu *iommu;
+	int ret, unit = 0;
+
+	/*
+	 * for each drhd
+	 *    allocate root
+	 *    initialize and program root entry to not present
+	 * endfor
+	 */
+	for_each_drhd_unit(drhd) {
+		if (drhd->ignored)
+			continue;
+		iommu = alloc_iommu(drhd);
+		if (!iommu) {
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		/*
+		 * TBD:
+		 * we could share the same root & context tables
+		 * amoung all IOMMU's. Need to Split it later.
+		 */
+		ret = iommu_alloc_root_entry(iommu);
+		if (ret) {
+			printk(KERN_ERR "IOMMU: allocate root entry failed\n");
+			goto error;
+		}
+	}
+
+	/*
+	 * For each rmrr
+	 *   for each dev attached to rmrr
+	 *   do
+	 *     locate drhd for dev, alloc domain for dev
+	 *     allocate free domain
+	 *     allocate page table entries for rmrr
+	 *     if context not allocated for bus
+	 *           allocate and init context
+	 *           set present in root table for this bus
+	 *     init context with domain, translation etc
+	 *    endfor
+	 * endfor
+	 */
+	for_each_rmrr_units(rmrr) {
+		int i;
+		for (i = 0; i < rmrr->devices_cnt; i++) {
+			pdev = rmrr->devices[i];
+			/* some BIOS lists non-exist devices in DMAR table */
+			if (!pdev)
+				continue;
+			ret = iommu_prepare_rmrr_dev(rmrr, pdev);
+			if (ret)
+				printk(KERN_ERR
+				 "IOMMU: mapping reserved region failed\n");
+		}
+	}
+
+	/*
+	 * for each drhd
+	 *   enable fault log
+	 *   global invalidate context cache
+	 *   global invalidate iotlb
+	 *   enable translation
+	 */
+	for_each_drhd_unit(drhd) {
+		if (drhd->ignored)
+			continue;
+		iommu = drhd->iommu;
+		sprintf (iommu->name, "dmar%d", unit++);
+
+		iommu_flush_write_buffer(iommu);
+
+		iommu_set_root_entry(iommu);
+
+		iommu_flush_context_global(iommu, 0);
+		iommu_flush_iotlb_global(iommu, 0);
+
+		ret = iommu_enable_translation(iommu);
+		if (ret)
+			goto error;
+	}
+
+	return 0;
+error:
+	for_each_drhd_unit(drhd) {
+		if (drhd->ignored)
+			continue;
+		iommu = drhd->iommu;
+		free_iommu(iommu);
+	}
+	return ret;
+}
+
+static inline u64 aligned_size(u64 host_addr, size_t size)
+{
+	u64 addr;
+	addr = (host_addr & (~PAGE_MASK_4K)) + size;
+	return PAGE_ALIGN_4K(addr);
+}
+
+struct iova *
+iommu_alloc_iova(struct dmar_domain *domain, void *host_addr, size_t size,
+		u64 start, u64 end)
+{
+	u64 start_addr;
+	struct iova *piova;
+
+	/* Make sure it's in range */
+	if ((start > DOMAIN_MAX_ADDR(domain->gaw)) || end < start)
+		return NULL;
+
+	end = min_t(u64, DOMAIN_MAX_ADDR(domain->gaw), end);
+	start_addr = PAGE_ALIGN_4K(start);
+	size = aligned_size((u64)host_addr, size);
+	if (!size || (start_addr + size > end))
+		return NULL;
+
+	piova = alloc_iova(&domain->iovad,
+			size >> PAGE_SHIFT_4K, IOVA_PFN(end));
+
+	return piova;
+}
+
+static dma_addr_t __intel_map_single(struct device *dev, void *addr,
+	size_t size, int dir, u64 *flush_addr, unsigned int *flush_size)
+{
+	struct dmar_domain *domain;
+	struct pci_dev *pdev = to_pci_dev(dev);
+	int ret;
+	int prot = 0;
+	struct iova *iova = NULL;
+	u64 start_addr;
+
+	addr = (void *)virt_to_phys(addr);
+
+	domain = get_domain_for_dev(pdev,
+			DEFAULT_DOMAIN_ADDRESS_WIDTH);
+	if (!domain) {
+		printk(KERN_ERR
+			"Allocating domain for %s failed", pci_name(pdev));
+		return 0;
+	}
+
+	start_addr = IOVA_START_ADDR;
+
+	if (pdev->dma_mask <= DMA_32BIT_MASK) {
+		iova = iommu_alloc_iova(domain, addr, size, start_addr,
+			pdev->dma_mask);
+	} else  {
+		/*
+		 * First try to allocate an io virtual address in
+		 * DMA_32BIT_MASK and if that fails then try allocating
+		 * from higer range
+		 */
+		iova = iommu_alloc_iova(domain, addr, size, start_addr,
+			DMA_32BIT_MASK);
+		if (!iova)
+			iova = iommu_alloc_iova(domain, addr, size, start_addr,
+			pdev->dma_mask);
+	}
+
+	if (!iova) {
+		printk(KERN_ERR"Allocating iova for %s failed", pci_name(pdev));
+		return 0;
+	}
+
+	/* make sure context mapping is ok */
+	if (unlikely(!domain_context_mapped(domain, pdev))) {
+		ret = domain_context_mapping(domain, pdev);
+		if (ret)
+			goto error;
+	}
+
+	/*
+	 * Check if DMAR supports zero-length reads on write only
+	 * mappings..
+	 */
+	if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL || \
+			!cap_zlr(domain->iommu->cap))
+		prot |= DMA_PTE_READ;
+	if (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)
+		prot |= DMA_PTE_WRITE;
+	/*
+	 * addr - (addr + size) might be partial page, we should map the whole
+	 * page.  Note: if two part of one page are separately mapped, we
+	 * might have two guest_addr mapping to the same host addr, but this
+	 * is not a big problem
+	 */
+	ret = domain_page_mapping(domain, iova->pfn_lo << PAGE_SHIFT_4K,
+		((u64)addr) & PAGE_MASK_4K,
+		(iova->pfn_hi - iova->pfn_lo + 1) << PAGE_SHIFT_4K, prot);
+	if (ret)
+		goto error;
+
+	pr_debug("Device %s request: %lx@%llx mapping: %lx@%llx, dir %d\n",
+		pci_name(pdev), size, (u64)addr,
+		(iova->pfn_hi - iova->pfn_lo + 1) << PAGE_SHIFT_4K,
+		(u64)(iova->pfn_lo << PAGE_SHIFT_4K), dir);
+
+	*flush_addr = iova->pfn_lo << PAGE_SHIFT_4K;
+	*flush_size = (iova->pfn_hi - iova->pfn_lo + 1) << PAGE_SHIFT_4K;
+	return (iova->pfn_lo << PAGE_SHIFT_4K) + ((u64)addr & (~PAGE_MASK_4K));
+error:
+	__free_iova(&domain->iovad, iova);
+	printk(KERN_ERR"Device %s request: %lx@%llx dir %d --- failed\n",
+		pci_name(pdev), size, (u64)addr, dir);
+	return 0;
+}
+
+static dma_addr_t intel_map_single(struct device *hwdev, void *addr,
+	size_t size, int dir)
+{
+	struct pci_dev *pdev = to_pci_dev(hwdev);
+	dma_addr_t ret;
+	struct dmar_domain *domain;
+	u64 flush_addr;
+	unsigned int flush_size;
+
+	BUG_ON(dir == DMA_NONE);
+	if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
+		return virt_to_bus(addr);
+
+	ret = __intel_map_single(hwdev, addr, size,
+			dir, &flush_addr, &flush_size);
+	if (ret) {
+		domain = find_domain(pdev);
+		/* it's a non-present to present mapping */
+		if (iommu_flush_iotlb_psi(domain->iommu, domain->id,
+				flush_addr, flush_size >> PAGE_SHIFT_4K, 1))
+			iommu_flush_write_buffer(domain->iommu);
+	}
+	return ret;
+}
+
+static void __intel_unmap_single(struct device *dev, dma_addr_t dev_addr,
+	size_t size, int dir, u64 *flush_addr, unsigned int *flush_size)
+{
+	struct dmar_domain *domain;
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct iova *iova;
+
+	domain = find_domain(pdev);
+	BUG_ON(!domain);
+
+	iova = find_iova(&domain->iovad, IOVA_PFN(dev_addr));
+	if (!iova) {
+		*flush_size = 0;
+		return;
+	}
+	pr_debug("Device %s unmapping: %lx@%llx\n",
+		pci_name(pdev),
+		(iova->pfn_hi - iova->pfn_lo + 1) << PAGE_SHIFT_4K,
+		(u64)(iova->pfn_lo << PAGE_SHIFT_4K));
+
+	*flush_addr = iova->pfn_lo << PAGE_SHIFT_4K;
+	*flush_size = (iova->pfn_hi - iova->pfn_lo + 1) << PAGE_SHIFT_4K;
+	/*  clear the whole page, not just dev_addr - (dev_addr + size) */
+	dma_pte_clear_range(domain, *flush_addr, *flush_addr + *flush_size);
+	/* free page tables */
+	dma_pte_free_pagetable(domain, *flush_addr, *flush_addr + *flush_size);
+	/* free iova */
+	__free_iova(&domain->iovad, iova);
+}
+
+static void intel_unmap_single(struct device *dev, dma_addr_t dev_addr,
+	size_t size, int dir)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct dmar_domain *domain;
+	u64 flush_addr;
+	unsigned int flush_size;
+
+	if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
+		return;
+
+	domain = find_domain(pdev);
+	__intel_unmap_single(dev, dev_addr, size,
+		dir, &flush_addr, &flush_size);
+	if (flush_size == 0)
+		return;
+	if (iommu_flush_iotlb_psi(domain->iommu, domain->id, flush_addr,
+			flush_size >> PAGE_SHIFT_4K, 0))
+		iommu_flush_write_buffer(domain->iommu);
+}
+
+static void * intel_alloc_coherent(struct device *hwdev, size_t size,
+		       dma_addr_t *dma_handle, gfp_t flags)
+{
+	void *vaddr;
+	int order;
+
+	size = PAGE_ALIGN_4K(size);
+	order = get_order(size);
+	flags &= ~(GFP_DMA | GFP_DMA32);
+
+	vaddr = (void *)__get_free_pages(flags, order);
+	if (!vaddr)
+		return NULL;
+	memset(vaddr, 0, size);
+
+	*dma_handle = intel_map_single(hwdev, vaddr, size, DMA_BIDIRECTIONAL);
+	if (*dma_handle)
+		return vaddr;
+	free_pages((unsigned long)vaddr, order);
+	return NULL;
+}
+
+static void intel_free_coherent(struct device *hwdev, size_t size,
+	void *vaddr, dma_addr_t dma_handle)
+{
+	int order;
+
+	size = PAGE_ALIGN_4K(size);
+	order = get_order(size);
+
+	intel_unmap_single(hwdev, dma_handle, size, DMA_BIDIRECTIONAL);
+	free_pages((unsigned long)vaddr, order);
+}
+
+static void intel_unmap_sg(struct device *hwdev, struct scatterlist *sg,
+	int nelems, int dir)
+{
+	int i;
+	struct pci_dev *pdev = to_pci_dev(hwdev);
+	struct dmar_domain *domain;
+	u64 flush_addr;
+	unsigned int flush_size;
+
+	if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
+		return;
+
+	domain = find_domain(pdev);
+	for (i = 0; i < nelems; i++, sg++)
+		__intel_unmap_single(hwdev, sg->dma_address,
+			sg->dma_length, dir, &flush_addr, &flush_size);
+
+	if (iommu_flush_iotlb_dsi(domain->iommu, domain->id, 0))
+		iommu_flush_write_buffer(domain->iommu);
+}
+
+#define SG_ENT_VIRT_ADDRESS(sg)	(page_address((sg)->page) + (sg)->offset)
+static int intel_nontranslate_map_sg(struct device *hddev,
+	struct scatterlist *sg, int nelems, int dir)
+{
+	int i;
+
+	for (i = 0; i < nelems; i++) {
+		struct scatterlist *s = &sg[i];
+		BUG_ON(!s->page);
+		s->dma_address = virt_to_bus(SG_ENT_VIRT_ADDRESS(s));
+		s->dma_length = s->length;
+	}
+	return nelems;
+}
+
+static int intel_map_sg(struct device *hwdev, struct scatterlist *sg,
+	int nelems, int dir)
+{
+	void *addr;
+	int i;
+	dma_addr_t dma_handle;
+	struct pci_dev *pdev = to_pci_dev(hwdev);
+	struct dmar_domain *domain;
+	u64 flush_addr;
+	unsigned int flush_size;
+
+	BUG_ON(dir == DMA_NONE);
+	if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO)
+		return intel_nontranslate_map_sg(hwdev, sg, nelems, dir);
+
+	for (i = 0; i < nelems; i++, sg++) {
+		addr = SG_ENT_VIRT_ADDRESS(sg);
+		dma_handle = __intel_map_single(hwdev, addr,
+				sg->length, dir, &flush_addr, &flush_size);
+		if (!dma_handle) {
+			intel_unmap_sg(hwdev, sg - i, i, dir);
+			sg[0].dma_length = 0;
+			return 0;
+		}
+		sg->dma_address = dma_handle;
+		sg->dma_length = sg->length;
+	}
+
+	domain = find_domain(pdev);
+
+	/* it's a non-present to present mapping */
+	if (iommu_flush_iotlb_dsi(domain->iommu, domain->id, 1))
+		iommu_flush_write_buffer(domain->iommu);
+	return nelems;
+}
+
+static struct dma_mapping_ops intel_dma_ops = {
+	.alloc_coherent = intel_alloc_coherent,
+	.free_coherent = intel_free_coherent,
+	.map_single = intel_map_single,
+	.unmap_single = intel_unmap_single,
+	.map_sg = intel_map_sg,
+	.unmap_sg = intel_unmap_sg,
+};
+
+static inline int iommu_domain_cache_init(void)
+{
+	int ret = 0;
+
+	iommu_domain_cache = kmem_cache_create("iommu_domain",
+					 sizeof(struct dmar_domain),
+					 0,
+					 SLAB_HWCACHE_ALIGN,
+					 NULL,
+					 NULL);
+	if (!iommu_domain_cache) {
+		printk(KERN_ERR "Couldn't create iommu_domain cache\n");
+		ret = -ENOMEM;
+	}
+
+	return ret;
+}
+
+static inline int iommu_devinfo_cache_init(void)
+{
+	int ret = 0;
+
+	iommu_devinfo_cache = kmem_cache_create("iommu_devinfo",
+					 sizeof(struct device_domain_info),
+					 0,
+					 SLAB_HWCACHE_ALIGN,
+					 NULL,
+					 NULL);
+	if (!iommu_devinfo_cache) {
+		printk(KERN_ERR "Couldn't create devinfo cache\n");
+		ret = -ENOMEM;
+	}
+
+	return ret;
+}
+
+static inline int iommu_iova_cache_init(void)
+{
+	int ret = 0;
+
+	iommu_iova_cache = kmem_cache_create("iommu_iova",
+					 sizeof(struct iova),
+					 0,
+					 SLAB_HWCACHE_ALIGN,
+					 NULL,
+					 NULL);
+	if (!iommu_iova_cache) {
+		printk(KERN_ERR "Couldn't create iova cache\n");
+		ret = -ENOMEM;
+	}
+
+	return ret;
+}
+
+static int __init iommu_init_mempool(void)
+{
+	int ret;
+	ret = iommu_iova_cache_init();
+	if (ret)
+		return ret;
+
+	ret = iommu_domain_cache_init();
+	if (ret)
+		goto domain_error;
+
+	ret = iommu_devinfo_cache_init();
+	if (!ret)
+		return ret;
+
+	kmem_cache_destroy(iommu_domain_cache);
+domain_error:
+	kmem_cache_destroy(iommu_iova_cache);
+
+	return -ENOMEM;
+}
+
+static void __init iommu_exit_mempool(void)
+{
+	kmem_cache_destroy(iommu_devinfo_cache);
+	kmem_cache_destroy(iommu_domain_cache);
+	kmem_cache_destroy(iommu_iova_cache);
+
+}
+
+void __init detect_intel_iommu(void)
+{
+	if (swiotlb || no_iommu || iommu_detected || dmar_disabled)
+		return;
+	if (early_dmar_detect()) {
+		iommu_detected = 1;
+	}
+}
+
+static void __init init_no_remapping_devices(void)
+{
+	struct dmar_drhd_unit *drhd;
+
+	for_each_drhd_unit(drhd) {
+		if (!drhd->include_all) {
+			int i;
+			for (i = 0; i < drhd->devices_cnt; i++)
+				if (drhd->devices[i] != NULL)
+					break;
+			/* ignore DMAR unit if no pci devices exist */
+			if (i == drhd->devices_cnt)
+				drhd->ignored = 1;
+		}
+	}
+
+	if (dmar_map_gfx)
+		return;
+
+	for_each_drhd_unit(drhd) {
+		int i;
+		if (drhd->ignored || drhd->include_all)
+			continue;
+
+		for (i = 0; i < drhd->devices_cnt; i++)
+			if (drhd->devices[i] &&
+				!IS_GFX_DEVICE(drhd->devices[i]))
+				break;
+
+		if (i < drhd->devices_cnt)
+			continue;
+
+		/* bypass IOMMU if it is just for gfx devices */
+		drhd->ignored = 1;
+		for (i = 0; i < drhd->devices_cnt; i++) {
+			if (!drhd->devices[i])
+				continue;
+			drhd->devices[i]->sysdata = DUMMY_DEVICE_DOMAIN_INFO;
+		}
+	}
+}
+
+int __init intel_iommu_init(void)
+{
+	int ret = 0;
+
+	if (no_iommu || swiotlb || dmar_disabled)
+		return -ENODEV;
+
+	if (dmar_table_init())
+		return 	-ENODEV;
+
+	iommu_init_mempool();
+	dmar_init_reserved_ranges();
+
+	init_no_remapping_devices();
+
+	ret = init_dmars();
+	if (ret) {
+		printk(KERN_ERR "IOMMU: dmar init failed\n");
+		put_iova_domain(&reserved_iova_list);
+		iommu_exit_mempool();
+		return ret;
+	}
+	printk(KERN_INFO
+	"PCI-DMA: Intel(R) Virtualization Technology for Directed I/O\n");
+
+	force_iommu = 1;
+	dma_ops = &intel_dma_ops;
+	return 0;
+}
Index: linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.h	2007-06-19 14:11:41.000000000 -0700
@@ -0,0 +1,318 @@
+/*
+ * Copyright (c) 2006, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Copyright (C) Ashok Raj <ashok.raj@intel.com>
+ * Copyright (C) Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
+ */
+
+#ifndef _INTEL_IOMMU_H_
+#define _INTEL_IOMMU_H_
+
+#include <linux/types.h>
+#include <linux/msi.h>
+#include "iova.h"
+#include <linux/io.h>
+
+/*
+ * Intel IOMMU register specification per version 1.0 public spec.
+ */
+
+#define	DMAR_VER_REG	0x0	/* Arch version supported by this IOMMU */
+#define	DMAR_CAP_REG	0x8	/* Hardware supported capabilities */
+#define	DMAR_ECAP_REG	0x10	/* Extended capabilities supported */
+#define	DMAR_GCMD_REG	0x18	/* Global command register */
+#define	DMAR_GSTS_REG	0x1c	/* Global status register */
+#define	DMAR_RTADDR_REG	0x20	/* Root entry table */
+#define	DMAR_CCMD_REG	0x28	/* Context command reg */
+#define	DMAR_FSTS_REG	0x34	/* Fault Status register */
+#define	DMAR_FECTL_REG	0x38	/* Fault control register */
+#define	DMAR_FEDATA_REG	0x3c	/* Fault event interrupt data register */
+#define	DMAR_FEADDR_REG	0x40	/* Fault event interrupt addr register */
+#define	DMAR_FEUADDR_REG 0x44	/* Upper address register */
+#define	DMAR_AFLOG_REG	0x58	/* Advanced Fault control */
+#define	DMAR_PMEN_REG	0x64	/* Enable Protected Memory Region */
+#define	DMAR_PLMBASE_REG 0x68	/* PMRR Low addr */
+#define	DMAR_PLMLIMIT_REG 0x6c	/* PMRR low limit */
+#define	DMAR_PHMBASE_REG 0x70	/* pmrr high base addr */
+#define	DMAR_PHMLIMIT_REG 0x78	/* pmrr high limit */
+
+#define OFFSET_STRIDE		(9)
+/*
+#define dmar_readl(dmar, reg) readl(dmar + reg)
+#define dmar_readq(dmar, reg) ({ \
+		u32 lo, hi; \
+		lo = readl(dmar + reg); \
+		hi = readl(dmar + reg + 4); \
+		(((u64) hi) << 32) + lo; })
+*/
+static inline u64 dmar_readq(void *addr)
+{
+	u32 lo, hi;
+	lo = readl(addr);
+	hi = readl(addr + 4);
+	return (((u64) hi) << 32) + lo;
+}
+
+static inline void dmar_writeq(void __iomem *addr, u64 val)
+{
+	writel((u32)val, addr);
+	writel((u32)(val >> 32), addr + 4);
+}
+
+#define DMAR_VER_MAJOR(v)		(((v) & 0xf0) >> 4)
+#define DMAR_VER_MINOR(v)		((v) & 0x0f)
+
+/*
+ * Decoding Capability Register
+ */
+#define cap_read_drain(c)	(((c) >> 55) & 1)
+#define cap_write_drain(c)	(((c) >> 54) & 1)
+#define cap_max_amask_val(c)	(((c) >> 48) & 0x3f)
+#define cap_num_fault_regs(c)	((((c) >> 40) & 0xff) + 1)
+#define cap_pgsel_inv(c)	(((c) >> 39) & 1)
+
+#define cap_super_page_val(c)	(((c) >> 34) & 0xf)
+#define cap_super_offset(c)	(((find_first_bit(&cap_super_page_val(c), 4)) \
+					* OFFSET_STRIDE) + 21)
+
+#define cap_fault_reg_offset(c)	((((c) >> 24) & 0x3ff) * 16)
+#define cap_max_fault_reg_offset(c) \
+	(cap_fault_reg_offset(c) + cap_num_fault_regs(c) * 16)
+
+#define cap_zlr(c)		(((c) >> 22) & 1)
+#define cap_isoch(c)		(((c) >> 23) & 1)
+#define cap_mgaw(c)		((((c) >> 16) & 0x3f) + 1)
+#define cap_sagaw(c)		(((c) >> 8) & 0x1f)
+#define cap_caching_mode(c)	(((c) >> 7) & 1)
+#define cap_phmr(c)		(((c) >> 6) & 1)
+#define cap_plmr(c)		(((c) >> 5) & 1)
+#define cap_rwbf(c)		(((c) >> 4) & 1)
+#define cap_afl(c)		(((c) >> 3) & 1)
+#define cap_ndoms(c)		(((unsigned long)1) << (4 + 2 * ((c) & 0x7)))
+/*
+ * Extended Capability Register
+ */
+
+#define ecap_niotlb_iunits(e)	((((e) >> 24) & 0xff) + 1)
+#define ecap_iotlb_offset(e) 	((((e) >> 8) & 0x3ff) * 16)
+#define ecap_max_iotlb_offset(e) \
+	(ecap_iotlb_offset(e) + ecap_niotlb_iunits(e) * 16)
+#define ecap_coherent(e)	((e) & 0x1)
+
+
+/* IOTLB_REG */
+#define DMA_TLB_GLOBAL_FLUSH (((u64)1) << 60)
+#define DMA_TLB_DSI_FLUSH (((u64)2) << 60)
+#define DMA_TLB_PSI_FLUSH (((u64)3) << 60)
+#define DMA_TLB_IIRG(type) ((type >> 60) & 7)
+#define DMA_TLB_IAIG(val) (((val) >> 57) & 7)
+#define DMA_TLB_READ_DRAIN (((u64)1) << 49)
+#define DMA_TLB_WRITE_DRAIN (((u64)1) << 48)
+#define DMA_TLB_DID(id)	(((u64)((id) & 0xffff)) << 32)
+#define DMA_TLB_IVT (((u64)1) << 63)
+#define DMA_TLB_IH_NONLEAF (((u64)1) << 6)
+#define DMA_TLB_MAX_SIZE (0x3f)
+
+/* GCMD_REG */
+#define DMA_GCMD_TE (((u32)1) << 31)
+#define DMA_GCMD_SRTP (((u32)1) << 30)
+#define DMA_GCMD_SFL (((u32)1) << 29)
+#define DMA_GCMD_EAFL (((u32)1) << 28)
+#define DMA_GCMD_WBF (((u32)1) << 27)
+
+/* GSTS_REG */
+#define DMA_GSTS_TES (((u32)1) << 31)
+#define DMA_GSTS_RTPS (((u32)1) << 30)
+#define DMA_GSTS_FLS (((u32)1) << 29)
+#define DMA_GSTS_AFLS (((u32)1) << 28)
+#define DMA_GSTS_WBFS (((u32)1) << 27)
+
+/* CCMD_REG */
+#define DMA_CCMD_ICC (((u64)1) << 63)
+#define DMA_CCMD_GLOBAL_INVL (((u64)1) << 61)
+#define DMA_CCMD_DOMAIN_INVL (((u64)2) << 61)
+#define DMA_CCMD_DEVICE_INVL (((u64)3) << 61)
+#define DMA_CCMD_FM(m) (((u64)((m) & 0x3)) << 32)
+#define DMA_CCMD_MASK_NOBIT 0
+#define DMA_CCMD_MASK_1BIT 1
+#define DMA_CCMD_MASK_2BIT 2
+#define DMA_CCMD_MASK_3BIT 3
+#define DMA_CCMD_SID(s) (((u64)((s) & 0xffff)) << 16)
+#define DMA_CCMD_DID(d) ((u64)((d) & 0xffff))
+
+/* FECTL_REG */
+#define DMA_FECTL_IM (((u32)1) << 31)
+
+/* FSTS_REG */
+#define DMA_FSTS_PPF ((u32)2)
+#define DMA_FSTS_PFO ((u32)1)
+#define dma_fsts_fault_record_index(s) (((s) >> 8) & 0xff)
+
+/* FRCD_REG, 32 bits access */
+#define DMA_FRCD_F (((u32)1) << 31)
+#define dma_frcd_type(d) ((d >> 30) & 1)
+#define dma_frcd_fault_reason(c) (c & 0xff)
+#define dma_frcd_source_id(c) (c & 0xffff)
+#define dma_frcd_page_addr(d) (d & (((u64)-1) << 12)) /* low 64 bit */
+
+/*
+ * 0: Present
+ * 1-11: Reserved
+ * 12-63: Context Ptr (12 - (haw-1))
+ * 64-127: Reserved
+ */
+struct root_entry {
+	u64	val;
+	u64	rsvd1;
+};
+#define ROOT_ENTRY_NR (PAGE_SIZE_4K/sizeof(struct root_entry))
+static inline bool root_present(struct root_entry *root)
+{
+	return (root->val & 1);
+}
+static inline void set_root_present(struct root_entry *root)
+{
+	root->val |= 1;
+}
+static inline void set_root_value(struct root_entry *root, unsigned long value)
+{
+	root->val |= value & PAGE_MASK_4K;
+}
+
+struct context_entry;
+static inline struct context_entry *
+get_context_addr_from_root(struct root_entry *root)
+{
+	return (struct context_entry *)
+		(root_present(root)?phys_to_virt(
+		root->val & PAGE_MASK_4K):
+		NULL);
+}
+
+/*
+ * low 64 bits:
+ * 0: present
+ * 1: fault processing disable
+ * 2-3: translation type
+ * 12-63: address space root
+ * high 64 bits:
+ * 0-2: address width
+ * 3-6: aval
+ * 8-23: domain id
+ */
+struct context_entry {
+	u64 lo;
+	u64 hi;
+};
+#define context_present(c) ((c).lo & 1)
+#define context_fault_disable(c) (((c).lo >> 1) & 1)
+#define context_translation_type(c) (((c).lo >> 2) & 3)
+#define context_address_root(c) ((c).lo & PAGE_MASK_4K)
+#define context_address_width(c) ((c).hi &  7)
+#define context_domain_id(c) (((c).hi >> 8) & ((1 << 16) - 1))
+
+#define context_set_present(c) do {(c).lo |= 1;} while (0)
+#define context_set_fault_enable(c) \
+	do {(c).lo &= (((u64)-1) << 2) | 1;} while (0)
+#define context_set_translation_type(c, val) \
+	do { \
+		(c).lo &= (((u64)-1) << 4) | 3; \
+		(c).lo |= ((val) & 3) << 2; \
+	} while (0)
+#define CONTEXT_TT_MULTI_LEVEL 0
+#define context_set_address_root(c, val) \
+	do {(c).lo |= (val) & PAGE_MASK_4K;} while (0)
+#define context_set_address_width(c, val) do {(c).hi |= (val) & 7;} while (0)
+#define context_set_domain_id(c, val) \
+	do {(c).hi |= ((val) & ((1 << 16) - 1)) << 8;} while (0)
+#define context_clear_entry(c) do {(c).lo = 0; (c).hi = 0;} while (0)
+
+/*
+ * 0: readable
+ * 1: writable
+ * 2-6: reserved
+ * 7: super page
+ * 8-11: available
+ * 12-63: Host physcial address
+ */
+struct dma_pte {
+	u64 val;
+};
+#define dma_clear_pte(p)	do {(p).val = 0;} while (0)
+
+#define DMA_PTE_READ (1)
+#define DMA_PTE_WRITE (2)
+
+#define dma_set_pte_readable(p) do {(p).val |= DMA_PTE_READ;} while (0)
+#define dma_set_pte_writable(p) do {(p).val |= DMA_PTE_WRITE;} while (0)
+#define dma_set_pte_prot(p, prot) \
+		do {(p).val = ((p).val & ~3) | ((prot) & 3); } while (0)
+#define dma_pte_addr(p) ((p).val & PAGE_MASK_4K)
+#define dma_set_pte_addr(p, addr) do {\
+		(p).val |= ((addr) & PAGE_MASK_4K); } while (0)
+#define dma_pte_present(p) (((p).val & 3) != 0)
+
+struct intel_iommu;
+
+struct dmar_domain {
+	int	id;			/* domain id */
+	struct intel_iommu *iommu;	/* back pointer to owning iommu */
+
+	struct list_head devices; 	/* all devices' list */
+	struct iova_domain iovad;	/* iova's that belong to this domain */
+
+	struct dma_pte	*pgd;		/* virtual address */
+	spinlock_t	mapping_lock;	/* page table lock */
+	int		gaw;		/* max guest address width */
+
+	/* adjusted guest address width, 0 is level 2 30-bit */
+	int		agaw;
+
+#define DOMAIN_FLAG_MULTIPLE_DEVICES 1
+	int		flags;
+};
+
+/* PCI domain-device relationship */
+struct device_domain_info {
+	struct list_head link;	/* link to domain siblings */
+	struct list_head global; /* link to global list */
+	u8 bus;			/* PCI bus numer */
+	u8 devfn;		/* PCI devfn number */
+	struct pci_dev *dev; /* it's NULL for PCIE-to-PCI bridge */
+	struct dmar_domain *domain; /* pointer to domain */
+};
+
+extern int init_dmars(void);
+
+struct intel_iommu {
+	void __iomem	*reg; /* Pointer to hardware regs, virtual addr */
+	u64		cap;
+	u64		ecap;
+	unsigned long 	*domain_ids; /* bitmap of domains */
+	struct dmar_domain **domains; /* ptr to domains */
+	int		seg;
+	u32		gcmd; /* Holds TE, EAFL. Don't need SRTP, SFL, WBF */
+	spinlock_t	lock; /* protect context, domain ids */
+	spinlock_t	register_lock; /* protect register handling */
+	struct root_entry *root_entry; /* virtual address */
+
+	unsigned int irq;
+	unsigned char name[7];    /* Device Name */
+	struct msi_msg saved_msg;
+	struct sys_device sysdev;
+};
+
+#endif
Index: linux-2.6.22-rc4-mm2/include/linux/dmar.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/dmar.h	2007-06-19 14:06:20.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/dmar.h	2007-06-19 14:11:43.000000000 -0700
@@ -23,7 +23,14 @@
 
 #include <linux/acpi.h>
 #include <linux/types.h>
+#include <linux/msi.h>
 
+#ifdef CONFIG_DMAR
+struct intel_iommu;
+
+/* Intel IOMMU detection and initialization functions */
+extern void detect_intel_iommu(void);
+extern int intel_iommu_init(void);
 
 extern int dmar_table_init(void);
 extern int early_dmar_detect(void);
@@ -49,4 +56,19 @@
 	int	devices_cnt;		/* target device count */
 };
 
+#define for_each_drhd_unit(drhd) \
+	list_for_each_entry(drhd, &dmar_drhd_units, list)
+#define for_each_rmrr_units(rmrr) \
+	list_for_each_entry(rmrr, &dmar_rmrr_units, list)
+#else
+static inline void detect_intel_iommu(void)
+{
+	return;
+}
+static inline int intel_iommu_init(void)
+{
+	return -ENODEV;
+}
+
+#endif /* !CONFIG_DMAR */
 #endif /* __DMAR_H__ */

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-19 21:37 [Intel IOMMU 00/10] Intel IOMMU support, take #2 Keshavamurthy, Anil S
                   ` (4 preceding siblings ...)
  2007-06-19 21:37 ` [Intel IOMMU 05/10] Intel IOMMU driver Keshavamurthy, Anil S
@ 2007-06-19 21:37 ` Keshavamurthy, Anil S
  2007-06-19 23:25   ` Christoph Lameter
  2007-06-20  8:06   ` Peter Zijlstra
  2007-06-19 21:37 ` [Intel IOMMU 07/10] Intel iommu cmdline option - forcedac Keshavamurthy, Anil S
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-19 21:37 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: ak, gregkh, muli, suresh.b.siddha, arjan, ashok.raj, davem,
	clameter, Anil S Keshavamurthy

[-- Attachment #1: intel_iommu_pf_memalloc.patch --]
[-- Type: text/plain, Size: 3662 bytes --]

Intel IOMMU driver needs memory during DMA map calls to setup its internal
page tables and for other data structures. As we all know that these DMA 
map calls are mostly called in the interrupt context or with the spinlock 
held by the upper level drivers(network/storage drivers), so in order to 
avoid any memory allocation failure due to low memory issues,
this patch makes memory allocation by temporarily setting PF_MEMALLOC
flags for the current task before making memory allocation calls.

We evaluated mempools as a backup when kmem_cache_alloc() fails
and found that mempools are really not useful here because
 1) We don;t know for sure how much to reserve in advance
 2) And mempools are not useful for GFP_ATOMIC case (as we call 
    memory alloc functions with GFP_ATOMIC)


With PF_MEMALLOC flag set in the current->flags, the VM subsystem avoids
any watermark checks before allocating memory thus guarantee'ing the 
memory till the last free page. Further, looking at the code in 
mm/page_alloc.c in __alloc_pages() function, looks like this
flag is useful only in the non-interrupt context.

If we are in the interrupt context and memory allocation in IOMMU
driver fails for some reason, then the DMA map api's will return 
failure and it is up to the higher level drivers to retry. Suppose,
if upper level driver programs the controller with the buggy 
DMA virtual address, the IOMMU will block that DMA transaction
when that happens thus preventing any corruption to main memory.

So far in our test scenario, we were unable to create
any memory allocation failure inside dma map api calls.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>

---
 drivers/pci/intel-iommu.c |   30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

Index: linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/intel-iommu.c	2007-06-18 15:45:46.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.c	2007-06-19 13:10:29.000000000 -0700
@@ -84,9 +84,31 @@
 static struct kmem_cache *iommu_devinfo_cache;
 static struct kmem_cache *iommu_iova_cache;
 
+static inline void *iommu_kmem_cache_alloc(struct kmem_cache *cachep)
+{
+	unsigned int flags;
+	void *vaddr;
+
+	/* trying to avoid low memory issues */
+	flags = current->flags & PF_MEMALLOC;
+	current->flags |= PF_MEMALLOC;
+	vaddr = kmem_cache_alloc(cachep, GFP_ATOMIC);
+	current->flags &= (~PF_MEMALLOC | flags);
+	return vaddr;
+}
+
+
 static inline void *alloc_pgtable_page(void)
 {
-	return (void *)get_zeroed_page(GFP_ATOMIC);
+	unsigned int flags;
+	void *vaddr;
+
+	/* trying to avoid low memory issues */
+	flags = current->flags & PF_MEMALLOC;
+	current->flags |= PF_MEMALLOC;
+	vaddr = (void *)get_zeroed_page(GFP_ATOMIC);
+	current->flags &= (~PF_MEMALLOC | flags);
+	return vaddr;
 }
 
 static inline void free_pgtable_page(void *vaddr)
@@ -96,7 +118,7 @@
 
 static inline void *alloc_domain_mem(void)
 {
-	return kmem_cache_alloc(iommu_domain_cache, GFP_ATOMIC);
+	return iommu_kmem_cache_alloc(iommu_domain_cache);
 }
 
 static inline void free_domain_mem(void *vaddr)
@@ -106,7 +128,7 @@
 
 static inline void * alloc_devinfo_mem(void)
 {
-	return kmem_cache_alloc(iommu_devinfo_cache, GFP_ATOMIC);
+	return iommu_kmem_cache_alloc(iommu_devinfo_cache);
 }
 
 static inline void free_devinfo_mem(void *vaddr)
@@ -116,7 +138,7 @@
 
 struct iova *alloc_iova_mem(void)
 {
-	return kmem_cache_alloc(iommu_iova_cache, GFP_ATOMIC);
+	return iommu_kmem_cache_alloc(iommu_iova_cache);
 }
 
 void free_iova_mem(struct iova *iova)

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [Intel IOMMU 07/10] Intel iommu cmdline option - forcedac
  2007-06-19 21:37 [Intel IOMMU 00/10] Intel IOMMU support, take #2 Keshavamurthy, Anil S
                   ` (5 preceding siblings ...)
  2007-06-19 21:37 ` [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls Keshavamurthy, Anil S
@ 2007-06-19 21:37 ` Keshavamurthy, Anil S
  2007-06-19 21:37 ` [Intel IOMMU 08/10] DMAR fault handling support Keshavamurthy, Anil S
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-19 21:37 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: ak, gregkh, muli, suresh.b.siddha, arjan, ashok.raj, davem,
	clameter, Anil S Keshavamurthy

[-- Attachment #1: dmar_forcedac.patch --]
[-- Type: text/plain, Size: 2372 bytes --]

	Introduce intel_iommu=forcedac commandline option.
This option is helpful to verify the pci device capability
of handling physical dma'able address greater than 4G.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
---
 Documentation/kernel-parameters.txt |    7 +++++++
 drivers/pci/intel-iommu.c           |    7 ++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc4-mm2/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.22-rc4-mm2.orig/Documentation/kernel-parameters.txt	2007-06-18 15:45:46.000000000 -0700
+++ linux-2.6.22-rc4-mm2/Documentation/kernel-parameters.txt	2007-06-18 15:45:46.000000000 -0700
@@ -788,6 +788,13 @@
 			bypassed by not enabling DMAR with this option. In
 			this case, gfx device will use physical address for
 			DMA.
+		forcedac [x86_64]
+			With this option iommu will not optimize to look
+			for io virtual address below 32 bit forcing dual
+			address cycle on pci bus for cards supporting greater
+			than 32 bit addressing. The default is to look
+			for translation below 32 bit and if not available
+			then look in the higher range.
 
 	io7=		[HW] IO7 for Marvel based alpha systems
 			See comment before marvel_specify_io7 in
Index: linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/intel-iommu.c	2007-06-18 15:45:46.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.c	2007-06-19 13:09:06.000000000 -0700
@@ -53,6 +53,7 @@
 
 static int dmar_disabled;
 static int __initdata dmar_map_gfx = 1;
+static int dmar_forcedac;
 
 #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
 static DEFINE_SPINLOCK(device_domain_lock);
@@ -70,6 +71,10 @@
 			dmar_map_gfx = 0;
 			printk(KERN_INFO
 				"Intel-IOMMU: disable GFX device mapping\n");
+		} else if (!strncmp(str, "forcedac", 8)) {
+			printk (KERN_INFO
+				"Intel-IOMMU: Forcing DAC for PCI devices\n");
+			dmar_forcedac = 1;
 		}
 
 		str += strcspn(str, ",");
@@ -1557,7 +1562,7 @@
 
 	start_addr = IOVA_START_ADDR;
 
-	if (pdev->dma_mask <= DMA_32BIT_MASK) {
+	if ((pdev->dma_mask <= DMA_32BIT_MASK) || (dmar_forcedac)) {
 		iova = iommu_alloc_iova(domain, addr, size, start_addr,
 			pdev->dma_mask);
 	} else  {

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [Intel IOMMU 08/10] DMAR fault handling support
  2007-06-19 21:37 [Intel IOMMU 00/10] Intel IOMMU support, take #2 Keshavamurthy, Anil S
                   ` (6 preceding siblings ...)
  2007-06-19 21:37 ` [Intel IOMMU 07/10] Intel iommu cmdline option - forcedac Keshavamurthy, Anil S
@ 2007-06-19 21:37 ` Keshavamurthy, Anil S
  2007-06-19 21:37 ` [Intel IOMMU 09/10] Iommu Gfx workaround Keshavamurthy, Anil S
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-19 21:37 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: ak, gregkh, muli, suresh.b.siddha, arjan, ashok.raj, davem,
	clameter, Anil S Keshavamurthy

[-- Attachment #1: dmar_fault_handling_support.patch --]
[-- Type: text/plain, Size: 10354 bytes --]

	MSI interrupt handler registrations and fault handling support
for Intel-IOMMU hadrware.

This patch enables the MSI interrupts for the DMA remapping units
and in the interrupt handler read the fault cause and outputs the
same on to the console.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
---
 Documentation/Intel-IOMMU.txt |   17 +++
 arch/x86_64/kernel/io_apic.c  |   59 ++++++++++++
 drivers/pci/intel-iommu.c     |  194 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/dmar.h          |   12 ++
 4 files changed, 281 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc4-mm2/Documentation/Intel-IOMMU.txt
===================================================================
--- linux-2.6.22-rc4-mm2.orig/Documentation/Intel-IOMMU.txt	2007-06-18 15:45:46.000000000 -0700
+++ linux-2.6.22-rc4-mm2/Documentation/Intel-IOMMU.txt	2007-06-19 13:05:03.000000000 -0700
@@ -63,6 +63,15 @@
 The same is true for peer to peer transactions. Hence we reserve the
 address from PCI MMIO ranges so they are not allocated for IOVA addresses.
 
+
+Fault reporting
+---------------
+When errors are reported, the DMA engine signals via an interrupt. The fault
+reason and device that caused it with fault reason is printed on console.
+
+See below for sample.
+
+
 Boot Message Sample
 -------------------
 
@@ -85,6 +94,14 @@
 
 PCI-DMA: Using DMAR IOMMU
 
+Fault reporting
+---------------
+
+DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
+DMAR:[fault reason 05] PTE Write access is not set
+DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000
+DMAR:[fault reason 05] PTE Write access is not set
+
 TBD
 ----
 
Index: linux-2.6.22-rc4-mm2/arch/x86_64/kernel/io_apic.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/arch/x86_64/kernel/io_apic.c	2007-06-18 15:45:38.000000000 -0700
+++ linux-2.6.22-rc4-mm2/arch/x86_64/kernel/io_apic.c	2007-06-18 15:45:46.000000000 -0700
@@ -31,6 +31,7 @@
 #include <linux/sysdev.h>
 #include <linux/msi.h>
 #include <linux/htirq.h>
+#include <linux/dmar.h>
 #ifdef CONFIG_ACPI
 #include <acpi/acpi_bus.h>
 #endif
@@ -2026,8 +2027,64 @@
 	destroy_irq(irq);
 }
 
-#endif /* CONFIG_PCI_MSI */
+#ifdef CONFIG_DMAR
+#ifdef CONFIG_SMP
+static void dmar_msi_set_affinity(unsigned int irq, cpumask_t mask)
+{
+	struct irq_cfg *cfg = irq_cfg + irq;
+	struct msi_msg msg;
+	unsigned int dest;
+	cpumask_t tmp;
+
+	cpus_and(tmp, mask, cpu_online_map);
+	if (cpus_empty(tmp))
+		return;
+
+	if (assign_irq_vector(irq, mask))
+		return;
+
+	cpus_and(tmp, cfg->domain, mask);
+	dest = cpu_mask_to_apicid(tmp);
+
+	dmar_msi_read(irq, &msg);
+
+	msg.data &= ~MSI_DATA_VECTOR_MASK;
+	msg.data |= MSI_DATA_VECTOR(cfg->vector);
+	msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
+	msg.address_lo |= MSI_ADDR_DEST_ID(dest);
+
+	dmar_msi_write(irq, &msg);
+	irq_desc[irq].affinity = mask;
+}
+#endif /* CONFIG_SMP */
+
+struct irq_chip dmar_msi_type = {
+	.name = "DMAR_MSI",
+	.unmask = dmar_msi_unmask,
+	.mask = dmar_msi_mask,
+	.ack = ack_apic_edge,
+#ifdef CONFIG_SMP
+	.set_affinity = dmar_msi_set_affinity,
+#endif
+	.retrigger = ioapic_retrigger_irq,
+};
+
+int arch_setup_dmar_msi(unsigned int irq)
+{
+	int ret;
+	struct msi_msg msg;
+
+	ret = msi_compose_msg(NULL, irq, &msg);
+	if (ret < 0)
+		return ret;
+	dmar_msi_write(irq, &msg);
+	set_irq_chip_and_handler_name(irq, &dmar_msi_type, handle_edge_irq,
+		"edge");
+	return 0;
+}
+#endif
 
+#endif /* CONFIG_PCI_MSI */
 /*
  * Hypertransport interrupt support
  */
Index: linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/intel-iommu.c	2007-06-18 15:45:46.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.c	2007-06-19 13:05:03.000000000 -0700
@@ -742,6 +742,196 @@
 	return 0;
 }
 
+/* iommu interrupt handling. Most stuff are MSI-like. */
+
+static char *fault_reason_strings[] =
+{
+	"Software",
+	"Present bit in root entry is clear",
+	"Present bit in context entry is clear",
+	"Invalid context entry",
+	"Access beyond MGAW",
+	"PTE Write access is not set",
+	"PTE Read access is not set",
+	"Next page table ptr is invalid",
+	"Root table address invalid",
+	"Context table ptr is invalid",
+	"non-zero reserved fields in RTP",
+	"non-zero reserved fields in CTP",
+	"non-zero reserved fields in PTE",
+	"Unknown"
+};
+#define MAX_FAULT_REASON_IDX 	ARRAY_SIZE(fault_reason_strings)
+
+char *dmar_get_fault_reason(u8 fault_reason)
+{
+	if (fault_reason > MAX_FAULT_REASON_IDX)
+		return fault_reason_strings[MAX_FAULT_REASON_IDX];
+	else
+		return fault_reason_strings[fault_reason];
+}
+
+void dmar_msi_unmask(unsigned int irq)
+{
+	struct intel_iommu *iommu = get_irq_data(irq);
+	unsigned long flag;
+
+	/* unmask it */
+	spin_lock_irqsave(&iommu->register_lock, flag);
+	writel(0, iommu->reg + DMAR_FECTL_REG);
+	/* Read a reg to force flush the post write */
+	readl(iommu->reg + DMAR_FECTL_REG);
+	spin_unlock_irqrestore(&iommu->register_lock, flag);
+}
+
+void dmar_msi_mask(unsigned int irq)
+{
+	unsigned long flag;
+	struct intel_iommu *iommu = get_irq_data(irq);
+
+	/* mask it */
+	spin_lock_irqsave(&iommu->register_lock, flag);
+	writel(DMA_FECTL_IM, iommu->reg + DMAR_FECTL_REG);
+	/* Read a reg to force flush the post write */
+	readl(iommu->reg + DMAR_FECTL_REG);
+	spin_unlock_irqrestore(&iommu->register_lock, flag);
+}
+
+void dmar_msi_write(int irq, struct msi_msg *msg)
+{
+	struct intel_iommu *iommu = get_irq_data(irq);
+	unsigned long flag;
+
+	spin_lock_irqsave(&iommu->register_lock, flag);
+	writel(msg->data, iommu->reg + DMAR_FEDATA_REG);
+	writel(msg->address_lo, iommu->reg + DMAR_FEADDR_REG);
+	writel(msg->address_hi, iommu->reg + DMAR_FEUADDR_REG);
+	spin_unlock_irqrestore(&iommu->register_lock, flag);
+}
+
+void dmar_msi_read(int irq, struct msi_msg *msg)
+{
+	struct intel_iommu *iommu = get_irq_data(irq);
+	unsigned long flag;
+
+	spin_lock_irqsave(&iommu->register_lock, flag);
+	msg->data = readl(iommu->reg + DMAR_FEDATA_REG);
+	msg->address_lo = readl(iommu->reg + DMAR_FEADDR_REG);
+	msg->address_hi = readl(iommu->reg + DMAR_FEUADDR_REG);
+	spin_unlock_irqrestore(&iommu->register_lock, flag);
+}
+
+static int iommu_page_fault_do_one(struct intel_iommu *iommu, int type,
+		u8 fault_reason, u16 source_id, u64 addr)
+{
+	char *reason;
+
+	reason = dmar_get_fault_reason(fault_reason);
+
+	printk(KERN_ERR
+		"DMAR:[%s] Request device [%02x:%02x.%d] "
+		"fault addr %llx \n"
+		"DMAR:[fault reason %02d] %s\n",
+		(type ? "DMA Read" : "DMA Write"),
+		(source_id >> 8), PCI_SLOT(source_id & 0xFF),
+		PCI_FUNC(source_id & 0xFF), addr, fault_reason, reason);
+	return 0;
+}
+
+#define PRIMARY_FAULT_REG_LEN (16)
+static irqreturn_t iommu_page_fault(int irq, void *dev_id)
+{
+	struct intel_iommu *iommu = dev_id;
+	int reg, fault_index;
+	u32 fault_status;
+	unsigned long flag;
+
+	spin_lock_irqsave(&iommu->register_lock, flag);
+	fault_status = readl(iommu->reg + DMAR_FSTS_REG);
+
+	/* TBD: ignore advanced fault log currently */
+	if (!(fault_status & DMA_FSTS_PPF))
+		goto clear_overflow;
+
+	fault_index = dma_fsts_fault_record_index(fault_status);
+	reg = cap_fault_reg_offset(iommu->cap);
+	while (1) {
+		u8 fault_reason;
+		u16 source_id;
+		u64 guest_addr;
+		int type;
+		u32 data;
+
+		/* highest 32 bits */
+		data = readl(iommu->reg + reg +
+				fault_index * PRIMARY_FAULT_REG_LEN + 12);
+		if (!(data & DMA_FRCD_F))
+			break;
+
+		fault_reason = dma_frcd_fault_reason(data);
+		type = dma_frcd_type(data);
+
+		data = readl(iommu->reg + reg +
+				fault_index * PRIMARY_FAULT_REG_LEN + 8);
+		source_id = dma_frcd_source_id(data);
+
+		guest_addr = dmar_readq(iommu->reg + reg +
+				fault_index * PRIMARY_FAULT_REG_LEN);
+		guest_addr = dma_frcd_page_addr(guest_addr);
+		/* clear the fault */
+		writel(DMA_FRCD_F, iommu->reg + reg +
+			fault_index * PRIMARY_FAULT_REG_LEN + 12);
+
+		spin_unlock_irqrestore(&iommu->register_lock, flag);
+
+		iommu_page_fault_do_one(iommu, type, fault_reason,
+				source_id, guest_addr);
+
+		fault_index++;
+		if (fault_index > cap_num_fault_regs(iommu->cap))
+			fault_index = 0;
+		spin_lock_irqsave(&iommu->register_lock, flag);
+	}
+clear_overflow:
+	/* clear primary fault overflow */
+	fault_status = readl(iommu->reg + DMAR_FSTS_REG);
+	if (fault_status & DMA_FSTS_PFO)
+		writel(DMA_FSTS_PFO, iommu->reg + DMAR_FSTS_REG);
+
+	spin_unlock_irqrestore(&iommu->register_lock, flag);
+	return IRQ_HANDLED;
+}
+
+int dmar_set_interrupt(struct intel_iommu *iommu)
+{
+	int irq, ret;
+
+	irq = create_irq();
+	if (!irq) {
+		printk(KERN_ERR "IOMMU: no free vectors\n");
+		return -EINVAL;
+	}
+
+	set_irq_data(irq, iommu);
+	iommu->irq = irq;
+
+	ret = arch_setup_dmar_msi(irq);
+	if (ret) {
+		set_irq_data(irq, NULL);
+		iommu->irq = 0;
+		destroy_irq(irq);
+		return 0;
+	}
+
+	/* Force fault register is cleared */
+	iommu_page_fault(irq, iommu);
+
+	ret = request_irq(irq, iommu_page_fault, 0, iommu->name, iommu);
+	if (ret)
+		printk(KERN_ERR "IOMMU: can't request irq\n");
+	return ret;
+}
+
 static int iommu_init_domains(struct intel_iommu *iommu)
 {
 	unsigned long ndomains;
@@ -1489,6 +1679,10 @@
 
 		iommu_flush_write_buffer(iommu);
 
+		ret = dmar_set_interrupt(iommu);
+		if (ret)
+			goto error;
+
 		iommu_set_root_entry(iommu);
 
 		iommu_flush_context_global(iommu, 0);
Index: linux-2.6.22-rc4-mm2/include/linux/dmar.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/dmar.h	2007-06-18 15:45:46.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/dmar.h	2007-06-18 15:45:46.000000000 -0700
@@ -28,6 +28,18 @@
 #ifdef CONFIG_DMAR
 struct intel_iommu;
 
+extern char *dmar_get_fault_reason(u8 fault_reason);
+
+/* Can't use the common MSI interrupt functions
+ * since DMAR is not a pci device
+ */
+extern void dmar_msi_unmask(unsigned int irq);
+extern void dmar_msi_mask(unsigned int irq);
+extern void dmar_msi_read(int irq, struct msi_msg *msg);
+extern void dmar_msi_write(int irq, struct msi_msg *msg);
+extern int dmar_set_interrupt(struct intel_iommu *iommu);
+extern int arch_setup_dmar_msi(unsigned int irq);
+
 /* Intel IOMMU detection and initialization functions */
 extern void detect_intel_iommu(void);
 extern int intel_iommu_init(void);

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [Intel IOMMU 09/10] Iommu Gfx workaround
  2007-06-19 21:37 [Intel IOMMU 00/10] Intel IOMMU support, take #2 Keshavamurthy, Anil S
                   ` (7 preceding siblings ...)
  2007-06-19 21:37 ` [Intel IOMMU 08/10] DMAR fault handling support Keshavamurthy, Anil S
@ 2007-06-19 21:37 ` Keshavamurthy, Anil S
  2007-06-19 21:37 ` [Intel IOMMU 10/10] Iommu floppy workaround Keshavamurthy, Anil S
  2007-06-26  6:45 ` [Intel IOMMU 00/10] Intel IOMMU support, take #2 Andrew Morton
  10 siblings, 0 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-19 21:37 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: ak, gregkh, muli, suresh.b.siddha, arjan, ashok.raj, davem,
	clameter, Anil S Keshavamurthy

[-- Attachment #1: gfx_wrkaround.patch --]
[-- Type: text/plain, Size: 4925 bytes --]

When we fix all the opensource gfx drivers to use the DMA api's,
at that time we can yank this config options out.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
---
 Documentation/Intel-IOMMU.txt |    5 +++++
 arch/x86_64/Kconfig           |   11 +++++++++++
 arch/x86_64/kernel/e820.c     |   19 +++++++++++++++++++
 drivers/pci/intel-iommu.c     |   33 +++++++++++++++++++++++++++++++++
 drivers/pci/intel-iommu.h     |    7 +++++++
 5 files changed, 75 insertions(+)

Index: linux-2.6.22-rc4-mm2/Documentation/Intel-IOMMU.txt
===================================================================
--- linux-2.6.22-rc4-mm2.orig/Documentation/Intel-IOMMU.txt	2007-06-18 15:45:08.000000000 -0700
+++ linux-2.6.22-rc4-mm2/Documentation/Intel-IOMMU.txt	2007-06-18 15:45:08.000000000 -0700
@@ -57,6 +57,11 @@
 If you encounter issues with graphics devices, you can try adding
 option intel_iommu=igfx_off to turn off the integrated graphics engine.
 
+If it happens to be a PCI device included in the INCLUDE_ALL Engine,
+then try enabling CONFIG_DMAR_GFX_WA to setup a 1-1 map. We hear
+graphics drivers may be in process of using DMA api's in the near
+future and at that time this option can be yanked out.
+
 Some exceptions to IOVA
 -----------------------
 Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff).
Index: linux-2.6.22-rc4-mm2/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.22-rc4-mm2.orig/arch/x86_64/Kconfig	2007-06-18 15:45:07.000000000 -0700
+++ linux-2.6.22-rc4-mm2/arch/x86_64/Kconfig	2007-06-18 15:45:08.000000000 -0700
@@ -741,6 +741,17 @@
 	  and includes pci device scope covered by these DMA
 	  remapping device.
 
+config DMAR_GFX_WA
+	bool "Support for Graphics workaround"
+	depends on DMAR
+	default y
+	help
+	 Current Graphics drivers tend to use physical address
+	 for DMA and avoid using DMA api's. Setting this config
+	 option permits the IOMMU driver to set a unity map for
+	 all the OS visible memory. Hence the driver can continue
+	 to use physical addresses for DMA.
+
 source "drivers/pci/pcie/Kconfig"
 
 source "drivers/pci/Kconfig"
Index: linux-2.6.22-rc4-mm2/arch/x86_64/kernel/e820.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/arch/x86_64/kernel/e820.c	2007-06-18 15:44:44.000000000 -0700
+++ linux-2.6.22-rc4-mm2/arch/x86_64/kernel/e820.c	2007-06-18 15:45:08.000000000 -0700
@@ -723,3 +723,22 @@
 	printk(KERN_INFO "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
 		pci_mem_start, gapstart, gapsize);
 }
+
+int __init arch_get_ram_range(int slot, u64 *addr, u64 *size)
+{
+	int i;
+
+	if (slot < 0 || slot >= e820.nr_map)
+		return -1;
+	for (i = slot; i < e820.nr_map; i++) {
+		if (e820.map[i].type != E820_RAM)
+			continue;
+		break;
+	}
+	if (i == e820.nr_map || e820.map[i].addr > (max_pfn << PAGE_SHIFT))
+		return -1;
+	*addr = e820.map[i].addr;
+	*size = min_t(u64, e820.map[i].size + e820.map[i].addr,
+		max_pfn << PAGE_SHIFT) - *addr;
+	return i + 1;
+}
Index: linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/intel-iommu.c	2007-06-18 15:45:08.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.c	2007-06-18 15:45:08.000000000 -0700
@@ -1601,6 +1601,36 @@
 		rmrr->end_address + 1);
 }
 
+#ifdef CONFIG_DMAR_GFX_WA
+extern int arch_get_ram_range(int slot, u64 *addr, u64 *size);
+static void __init iommu_prepare_gfx_mapping(void)
+{
+	struct pci_dev *pdev = NULL;
+	u64 base, size;
+	int slot;
+	int ret;
+
+	for_each_pci_dev(pdev) {
+		if (pdev->sysdata == DUMMY_DEVICE_DOMAIN_INFO ||
+				!IS_GFX_DEVICE(pdev))
+			continue;
+		printk(KERN_INFO "IOMMU: gfx device %s 1-1 mapping\n",
+			pci_name(pdev));
+		slot = arch_get_ram_range(0, &base, &size);
+		while (slot >= 0) {
+			ret = iommu_prepare_identity_map(pdev,
+					base, base + size);
+			if (ret)
+				goto error;
+			slot = arch_get_ram_range(slot, &base, &size);
+		}
+		continue;
+error:
+		printk(KERN_ERR "IOMMU: mapping reserved region failed\n");
+	}
+}
+#endif
+
 int __init init_dmars(void)
 {
 	struct dmar_drhd_unit *drhd;
@@ -1664,6 +1694,8 @@
 		}
 	}
 
+	iommu_prepare_gfx_mapping();
+
 	/*
 	 * for each drhd
 	 *   enable fault log
@@ -2175,3 +2207,4 @@
 	dma_ops = &intel_dma_ops;
 	return 0;
 }
+
Index: linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/intel-iommu.h	2007-06-18 15:45:08.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.h	2007-06-18 15:45:08.000000000 -0700
@@ -315,4 +315,11 @@
 	struct sys_device sysdev;
 };
 
+#ifndef CONFIG_DMAR_GFX_WA
+static inline void iommu_prepare_gfx_mapping(void)
+{
+	return;
+}
+#endif /* !CONFIG_DMAR_GFX_WA */
+
 #endif

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [Intel IOMMU 10/10] Iommu floppy workaround
  2007-06-19 21:37 [Intel IOMMU 00/10] Intel IOMMU support, take #2 Keshavamurthy, Anil S
                   ` (8 preceding siblings ...)
  2007-06-19 21:37 ` [Intel IOMMU 09/10] Iommu Gfx workaround Keshavamurthy, Anil S
@ 2007-06-19 21:37 ` Keshavamurthy, Anil S
  2007-06-26  6:42   ` Andrew Morton
  2007-06-26  6:45 ` [Intel IOMMU 00/10] Intel IOMMU support, take #2 Andrew Morton
  10 siblings, 1 reply; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-19 21:37 UTC (permalink / raw)
  To: akpm, linux-kernel
  Cc: ak, gregkh, muli, suresh.b.siddha, arjan, ashok.raj, davem,
	clameter, Anil S Keshavamurthy

[-- Attachment #1: floppy_disk_wrkaround.patch --]
[-- Type: text/plain, Size: 2813 bytes --]

	This config option (DMAR_FLPY_WA) sets up 1:1 mapping for the
floppy device so that the floppy device which does not use
DMA api's will continue to work. 

Once the floppy driver starts using DMA api's this config option
can be turn off or this patch can be yanked out of kernel at that
time.


Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
---
 arch/x86_64/Kconfig       |   10 ++++++++++
 drivers/pci/intel-iommu.c |   22 ++++++++++++++++++++++
 drivers/pci/intel-iommu.h |    7 +++++++
 3 files changed, 39 insertions(+)

Index: linux-2.6.22-rc4-mm2/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.22-rc4-mm2.orig/arch/x86_64/Kconfig	2007-06-18 15:45:08.000000000 -0700
+++ linux-2.6.22-rc4-mm2/arch/x86_64/Kconfig	2007-06-18 15:45:09.000000000 -0700
@@ -752,6 +752,16 @@
 	 all the OS visible memory. Hence the driver can continue
 	 to use physical addresses for DMA.
 
+config DMAR_FLPY_WA
+	bool "Support for Floppy disk workaround"
+	depends on DMAR
+	default y
+	help
+	 Floppy disk drivers are know to by pass dma api calls
+	 their by failing to work when IOMMU is enabled. This
+	 work around will setup a 1 to 1 mappings for the first
+	 16M to make floppy(isa device) work.
+
 source "drivers/pci/pcie/Kconfig"
 
 source "drivers/pci/Kconfig"
Index: linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/intel-iommu.c	2007-06-18 15:45:08.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.c	2007-06-18 15:45:09.000000000 -0700
@@ -1631,6 +1631,26 @@
 }
 #endif
 
+#ifdef CONFIG_DMAR_FLPY_WA
+static inline void iommu_prepare_isa(void)
+{
+	struct pci_dev *pdev = NULL;
+	int ret;
+
+	pdev = pci_get_class (PCI_CLASS_BRIDGE_ISA << 8, NULL);
+	if (!pdev)
+		return;
+
+	printk (KERN_INFO "IOMMU: Prepare 0-16M unity mapping for LPC\n");
+	ret = iommu_prepare_identity_map(pdev, 0, 16*1024*1024);
+
+	if (ret)
+		printk ("IOMMU: Failed to create 0-64M identity map, \
+			Floppy might not work\n");
+
+}
+#endif
+
 int __init init_dmars(void)
 {
 	struct dmar_drhd_unit *drhd;
@@ -1696,6 +1716,8 @@
 
 	iommu_prepare_gfx_mapping();
 
+	iommu_prepare_isa();
+
 	/*
 	 * for each drhd
 	 *   enable fault log
Index: linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/intel-iommu.h	2007-06-18 15:45:08.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.h	2007-06-18 15:45:09.000000000 -0700
@@ -322,4 +322,11 @@
 }
 #endif /* !CONFIG_DMAR_GFX_WA */
 
+#ifndef CONFIG_DMAR_FLPY_WA
+static inline void iommu_prepare_isa(void)
+{
+	return;
+}
+#endif /* !CONFIG_DMAR_FLPY_WA */
+
 #endif

-- 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-19 21:37 ` [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls Keshavamurthy, Anil S
@ 2007-06-19 23:25   ` Christoph Lameter
  2007-06-19 23:27     ` Arjan van de Ven
  2007-06-20  8:06   ` Peter Zijlstra
  1 sibling, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2007-06-19 23:25 UTC (permalink / raw)
  To: Keshavamurthy, Anil S
  Cc: akpm, linux-kernel, ak, gregkh, muli, suresh.b.siddha, arjan,
	ashok.raj, davem

On Tue, 19 Jun 2007, Keshavamurthy, Anil S wrote:


> So far in our test scenario, we were unable to create
> any memory allocation failure inside dma map api calls.

All these functions should have gfp_t flags passed to them.
Otherwise you are locked into the use of GFP_ATOMIC. If this is a 
parameter then you may be able to use GFP_KERNEL in various places and you 
may develop the code to have less GFP_ATOMIC allocs in the future.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-19 23:25   ` Christoph Lameter
@ 2007-06-19 23:27     ` Arjan van de Ven
  2007-06-19 23:34       ` Christoph Lameter
  0 siblings, 1 reply; 65+ messages in thread
From: Arjan van de Ven @ 2007-06-19 23:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Keshavamurthy, Anil S, akpm, linux-kernel, ak, gregkh, muli,
	suresh.b.siddha, ashok.raj, davem

Christoph Lameter wrote:
> On Tue, 19 Jun 2007, Keshavamurthy, Anil S wrote:
> 
> 
>> So far in our test scenario, we were unable to create
>> any memory allocation failure inside dma map api calls.
> 
> All these functions should have gfp_t flags passed to them.

why?

> Otherwise you are locked into the use of GFP_ATOMIC.

all callers pretty much are either in irq context or with spinlocks 
held. Good luck..... it's also called primarily from the PCI DMA API 
which doesn't take a gfp_t argument in the first place...

so I'm not seeing the point.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 05/10] Intel IOMMU driver
  2007-06-19 21:37 ` [Intel IOMMU 05/10] Intel IOMMU driver Keshavamurthy, Anil S
@ 2007-06-19 23:32   ` Christoph Lameter
  2007-06-19 23:50     ` Keshavamurthy, Anil S
  2007-06-26  6:32     ` Andrew Morton
  2007-06-26  6:25   ` Andrew Morton
  2007-06-26  6:30   ` Andrew Morton
  2 siblings, 2 replies; 65+ messages in thread
From: Christoph Lameter @ 2007-06-19 23:32 UTC (permalink / raw)
  To: Keshavamurthy, Anil S
  Cc: akpm, linux-kernel, ak, gregkh, muli, suresh.b.siddha, arjan,
	ashok.raj, davem

On Tue, 19 Jun 2007, Keshavamurthy, Anil S wrote:

> +static inline void *alloc_pgtable_page(void)
> +{
> +	return (void *)get_zeroed_page(GFP_ATOMIC);
> +}

Need to pass gfp_t parameter. Repeates a couple of times.

> +	addr &= (((u64)1) << addr_width) - 1;
> +	parent = domain->pgd;
> +
> +	spin_lock_irqsave(&domain->mapping_lock, flags);
> +	while (level > 0) {
> +		void *tmp_page;
> +
> +		offset = address_level_offset(addr, level);
> +		pte = &parent[offset];
> +		if (level == 1)
> +			break;
> +
> +		if (!dma_pte_present(*pte)) {
> +			tmp_page = alloc_pgtable_page();

Is it not possible here to drop the lock and do the alloc with GFP_KERNEL 
and deal with the resulting race? That is done in other parts of the 
kernel.

> +/* iommu handling */
> +static int iommu_alloc_root_entry(struct intel_iommu *iommu)
> +{
> +	struct root_entry *root;
> +	unsigned long flags;
> +
> +	root = (struct root_entry *)alloc_pgtable_page();

This may be able to become a GFP_KERNEL alloc since interrupts are enabled 
at this point?

> +int __init init_dmars(void)
> +{
> +	struct dmar_drhd_unit *drhd;
> +	struct dmar_rmrr_unit *rmrr;
> +	struct pci_dev *pdev;
> +	struct intel_iommu *iommu;
> +	int ret, unit = 0;
> +
> +	/*
> +	 * for each drhd
> +	 *    allocate root
> +	 *    initialize and program root entry to not present
> +	 * endfor
> +	 */
> +	for_each_drhd_unit(drhd) {
> +		if (drhd->ignored)
> +			continue;
> +		iommu = alloc_iommu(drhd);

GFP_KERNEL alloc possible?

> +{
> +	int ret = 0;
> +
> +	iommu_devinfo_cache = kmem_cache_create("iommu_devinfo",
> +					 sizeof(struct device_domain_info),
> +					 0,
> +					 SLAB_HWCACHE_ALIGN,
> +					 NULL,

Replace by

iommu_devinfo_cache = KMEM_CACHE(iommu_devinfo, SLAB_HWCACHE_ALIGN)

> +static inline int iommu_iova_cache_init(void)
> +{
> +	int ret = 0;
> +
> +	iommu_iova_cache = kmem_cache_create("iommu_iova",
> +					 sizeof(struct iova),
> +					 0,
> +					 SLAB_HWCACHE_ALIGN,
> +					 NULL,
> +					 NULL);

Ditto


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-19 23:27     ` Arjan van de Ven
@ 2007-06-19 23:34       ` Christoph Lameter
  2007-06-20  0:02         ` Arjan van de Ven
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Lameter @ 2007-06-19 23:34 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Keshavamurthy, Anil S, akpm, linux-kernel, ak, gregkh, muli,
	suresh.b.siddha, ashok.raj, davem

On Tue, 19 Jun 2007, Arjan van de Ven wrote:

> > Otherwise you are locked into the use of GFP_ATOMIC.
> 
> all callers pretty much are either in irq context or with spinlocks held. Good
> luck..... it's also called primarily from the PCI DMA API which doesn't take a
> gfp_t argument in the first place...
> 
> so I'm not seeing the point.

Hmmm... From my superficial look at things it seems that one could avoid 
GFP_ATOMIC at times. I do not know too much about the driver though but it 
seems a bit restrictive to always do GFP_ATOMIC allocs.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 05/10] Intel IOMMU driver
  2007-06-19 23:32   ` Christoph Lameter
@ 2007-06-19 23:50     ` Keshavamurthy, Anil S
  2007-06-19 23:56       ` Christoph Lameter
  2007-06-26  6:32     ` Andrew Morton
  1 sibling, 1 reply; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-19 23:50 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Keshavamurthy, Anil S, akpm, linux-kernel, ak, gregkh, muli,
	suresh.b.siddha, arjan, ashok.raj, davem

On Tue, Jun 19, 2007 at 04:32:23PM -0700, Christoph Lameter wrote:
> On Tue, 19 Jun 2007, Keshavamurthy, Anil S wrote:
> 
> > +static inline void *alloc_pgtable_page(void)
> > +{
> > +	return (void *)get_zeroed_page(GFP_ATOMIC);
> > +}
> 
> Need to pass gfp_t parameter. Repeates a couple of times.
> 
> > +	addr &= (((u64)1) << addr_width) - 1;
> > +	parent = domain->pgd;
> > +
> > +	spin_lock_irqsave(&domain->mapping_lock, flags);
> > +	while (level > 0) {
> > +		void *tmp_page;
> > +
> > +		offset = address_level_offset(addr, level);
> > +		pte = &parent[offset];
> > +		if (level == 1)
> > +			break;
> > +
> > +		if (!dma_pte_present(*pte)) {
> > +			tmp_page = alloc_pgtable_page();
> 
> Is it not possible here to drop the lock and do the alloc with GFP_KERNEL 
> and deal with the resulting race? That is done in other parts of the 
> kernel.
> 
> > +/* iommu handling */
> > +static int iommu_alloc_root_entry(struct intel_iommu *iommu)
> > +{
> > +	struct root_entry *root;
> > +	unsigned long flags;
> > +
> > +	root = (struct root_entry *)alloc_pgtable_page();
> 
> This may be able to become a GFP_KERNEL alloc since interrupts are enabled 
> at this point?

Memory allocated during driver init is very less and not much benefit
with the suggested changes I think. Please correct me If I am wrong.

The biggest benifit will be when we can figure out GPF_XXXX flags
during runtime when DMA map api's are called. 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 05/10] Intel IOMMU driver
  2007-06-19 23:50     ` Keshavamurthy, Anil S
@ 2007-06-19 23:56       ` Christoph Lameter
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Lameter @ 2007-06-19 23:56 UTC (permalink / raw)
  To: Keshavamurthy, Anil S
  Cc: akpm, linux-kernel, ak, gregkh, muli, suresh.b.siddha, arjan,
	ashok.raj, davem

On Tue, 19 Jun 2007, Keshavamurthy, Anil S wrote:

> Memory allocated during driver init is very less and not much benefit
> with the suggested changes I think. Please correct me If I am wrong.

If its just a small amount of memory then the benefit will not be large. 
You are likely right.

> The biggest benifit will be when we can figure out GPF_XXXX flags
> during runtime when DMA map api's are called. 

Right.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-19 23:34       ` Christoph Lameter
@ 2007-06-20  0:02         ` Arjan van de Ven
  0 siblings, 0 replies; 65+ messages in thread
From: Arjan van de Ven @ 2007-06-20  0:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Keshavamurthy, Anil S, akpm, linux-kernel, ak, gregkh, muli,
	suresh.b.siddha, ashok.raj, davem

On Tue, 2007-06-19 at 16:34 -0700, Christoph Lameter wrote:
> On Tue, 19 Jun 2007, Arjan van de Ven wrote:
> 
> > > Otherwise you are locked into the use of GFP_ATOMIC.
> > 
> > all callers pretty much are either in irq context or with spinlocks held. Good
> > luck..... it's also called primarily from the PCI DMA API which doesn't take a
> > gfp_t argument in the first place...
> > 
> > so I'm not seeing the point.
> 
> Hmmm... From my superficial look at things it seems that one could avoid 
> GFP_ATOMIC at times. 

by changing ALL drivers. And then you realize the common scenario is
that it's taken in irq context ;)

> I do not know too much about the driver though but it 
> seems a bit restrictive to always do GFP_ATOMIC allocs.

the only alternative to GFP_ATOMIC is GFP_NOIO which is... barely
better. And then add that it can be used only sporadic...
feel free to first change all the drivers.. once the callers of these
functions have a gfp_t then I'm sure Anil will be happy to take one of
those as well ;-)


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-19 21:37 ` [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls Keshavamurthy, Anil S
  2007-06-19 23:25   ` Christoph Lameter
@ 2007-06-20  8:06   ` Peter Zijlstra
  2007-06-20 13:03     ` Arjan van de Ven
  2007-06-26  5:34     ` Andrew Morton
  1 sibling, 2 replies; 65+ messages in thread
From: Peter Zijlstra @ 2007-06-20  8:06 UTC (permalink / raw)
  To: Keshavamurthy, Anil S
  Cc: akpm, linux-kernel, ak, gregkh, muli, suresh.b.siddha, arjan,
	ashok.raj, davem, clameter

On Tue, 2007-06-19 at 14:37 -0700, Keshavamurthy, Anil S wrote:
> plain text document attachment (intel_iommu_pf_memalloc.patch)
> Intel IOMMU driver needs memory during DMA map calls to setup its internal
> page tables and for other data structures. As we all know that these DMA 
> map calls are mostly called in the interrupt context or with the spinlock 
> held by the upper level drivers(network/storage drivers), so in order to 
> avoid any memory allocation failure due to low memory issues,
> this patch makes memory allocation by temporarily setting PF_MEMALLOC
> flags for the current task before making memory allocation calls.
> 
> We evaluated mempools as a backup when kmem_cache_alloc() fails
> and found that mempools are really not useful here because
>  1) We don;t know for sure how much to reserve in advance

So you just unleashed an unbounded allocation context on PF_MEMALLOC?
seems like a really really bad idea.

>  2) And mempools are not useful for GFP_ATOMIC case (as we call 
>     memory alloc functions with GFP_ATOMIC)

Mempools work as intended with GFP_ATOMIC, it gets filled up to the
specified number of elements using GFP_KERNEL (at creation time). This
gives GFP_ATOMIC allocations nr_elements extra items once it would start
failing.

> With PF_MEMALLOC flag set in the current->flags, the VM subsystem avoids
> any watermark checks before allocating memory thus guarantee'ing the 
> memory till the last free page.

PF_MEMALLOC as is, is meant to salvage the VM from the typical VM
deadlock. Using it as you do now is not something a driver should ever
do, and I'm afraid I will have to strongly oppose this patch.

You really really want to calculate an upper bound on your memory
consumption and reserve this.

So, I'm afraid I'll have to..

NACK!


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-20  8:06   ` Peter Zijlstra
@ 2007-06-20 13:03     ` Arjan van de Ven
  2007-06-20 17:30       ` Siddha, Suresh B
  2007-06-26  5:34     ` Andrew Morton
  1 sibling, 1 reply; 65+ messages in thread
From: Arjan van de Ven @ 2007-06-20 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Keshavamurthy, Anil S, akpm, linux-kernel, ak, gregkh, muli,
	suresh.b.siddha, ashok.raj, davem, clameter

Peter Zijlstra wrote:
> 
> 
> PF_MEMALLOC as is, is meant to salvage the VM from the typical VM
> deadlock. 

.. and this IS the typical VM deadlock.. it is your storage driver 
trying to write out a piece of memory on behalf of the VM, and calls 
the iommu to map it, which then needs a bit of memory....

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-20 13:03     ` Arjan van de Ven
@ 2007-06-20 17:30       ` Siddha, Suresh B
  2007-06-20 18:05         ` Peter Zijlstra
  0 siblings, 1 reply; 65+ messages in thread
From: Siddha, Suresh B @ 2007-06-20 17:30 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Peter Zijlstra, Keshavamurthy, Anil S, akpm, linux-kernel, ak,
	gregkh, muli, suresh.b.siddha, ashok.raj, davem, clameter

On Wed, Jun 20, 2007 at 06:03:02AM -0700, Arjan van de Ven wrote:
> Peter Zijlstra wrote:
> >
> >
> >PF_MEMALLOC as is, is meant to salvage the VM from the typical VM
> >deadlock. 
> 
> .. and this IS the typical VM deadlock.. it is your storage driver 
> trying to write out a piece of memory on behalf of the VM, and calls 
> the iommu to map it, which then needs a bit of memory....

Today PF_MEMALLOC doesn't do much in interrupt context. If PF_MEMALLOC
is the right usage model for this, then we need to fix the behavior of
PF_MEMALLOC in the interrupt context(for our usage model, we do most
of the allocations in interrupt context).

I am not very familiar with PF_MEMALLOC. So experts please comment.

thanks,
suresh

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-20 17:30       ` Siddha, Suresh B
@ 2007-06-20 18:05         ` Peter Zijlstra
  2007-06-20 19:14           ` Arjan van de Ven
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2007-06-20 18:05 UTC (permalink / raw)
  To: Siddha, Suresh B
  Cc: Arjan van de Ven, Keshavamurthy, Anil S, akpm, linux-kernel, ak,
	gregkh, muli, ashok.raj, davem, clameter

On Wed, 2007-06-20 at 10:30 -0700, Siddha, Suresh B wrote:
> On Wed, Jun 20, 2007 at 06:03:02AM -0700, Arjan van de Ven wrote:
> > Peter Zijlstra wrote:
> > >
> > >
> > >PF_MEMALLOC as is, is meant to salvage the VM from the typical VM
> > >deadlock. 
> > 
> > .. and this IS the typical VM deadlock.. it is your storage driver 
> > trying to write out a piece of memory on behalf of the VM, and calls 
> > the iommu to map it, which then needs a bit of memory....
> 
> Today PF_MEMALLOC doesn't do much in interrupt context. If PF_MEMALLOC
> is the right usage model for this, then we need to fix the behavior of
> PF_MEMALLOC in the interrupt context(for our usage model, we do most
> of the allocations in interrupt context).

Right, I have patches that add GFP_EMERGENCY to do basically that.

> I am not very familiar with PF_MEMALLOC. So experts please comment.

PF_MEMALLOC is meant to avoid the VM deadlock - that is we need memory
to free memory. The one constraint is that its use be bounded. (which is
currently violated in that there is no bound on the number of direct
reclaim contexts - which is on my to-fix list)

So a reclaim context (kswapd and direct reclaim) set PF_MEMALLOC to
ensure they themselves will not block on a memory allocation. And it is
understood that these code paths have a bounded memory footprint.

Now, this code seems to be running from interrupt context, which makes
it impossible to tell if the work is being done on behalf of a reclaim
task.  Is it possible to setup the needed data for the IRQ handler from
process context?

Blindly adding GFP_EMERGENCY to do this, has the distinct disadvantage
that there is no inherent bound on the amount of memory consumed. In my
patch set I add an emergency reserve (below the current watermarks,
because ALLOC_HIGH and ALLOC_HARDER modify the threshold in a relative
way, and thus cannot provide a guaranteed limit). I then accurately
account all allocations made from this reserve to ensure I never cross
the set limit.

Like has been said before, if possible move to blocking allocs
(GFP_NOIO), if that is not possible use mempools (for kmem_cache, or
page alloc), if that is not possible use ALLOC_NO_WATERMARKS
(PF_MEMALLOC, GFP_EMERGENCY) but put in a reserve and account its usage.

The last option basically boils down to reserved based allocation,
something which I hope to introduce some-day...

That is, failure is a OK, unless you're from a reclaim context, those
should make progress.


One thing I'm confused about, in earlier discussions it was said that
mempools are not sufficient because they deplete the GFP_ATOMIC reserve
and only then use the mempool. This would not work because some
downstream allocation would then go splat --- using
PF_MEMALLOC/GFP_EMERGENCY has exactly the same problem!




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-20 18:05         ` Peter Zijlstra
@ 2007-06-20 19:14           ` Arjan van de Ven
  2007-06-20 20:08             ` Peter Zijlstra
  0 siblings, 1 reply; 65+ messages in thread
From: Arjan van de Ven @ 2007-06-20 19:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Siddha, Suresh B, Keshavamurthy, Anil S, akpm, linux-kernel, ak,
	gregkh, muli, ashok.raj, davem, clameter

Peter Zijlstra wrote:
> So a reclaim context (kswapd and direct reclaim) set PF_MEMALLOC to
> ensure they themselves will not block on a memory allocation. And it is
> understood that these code paths have a bounded memory footprint.


that's a too simplistic view though; what happens is that kswapd will 
queue the IO, but the irq context will then take the IO from the queue 
and do the DMA mapping... which needs the memory.....

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-20 19:14           ` Arjan van de Ven
@ 2007-06-20 20:08             ` Peter Zijlstra
  2007-06-20 23:03               ` Keshavamurthy, Anil S
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2007-06-20 20:08 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Siddha, Suresh B, Keshavamurthy, Anil S, akpm, linux-kernel, ak,
	gregkh, muli, ashok.raj, davem, clameter

On Wed, 2007-06-20 at 12:14 -0700, Arjan van de Ven wrote:
> Peter Zijlstra wrote:
> > So a reclaim context (kswapd and direct reclaim) set PF_MEMALLOC to
> > ensure they themselves will not block on a memory allocation. And it is
> > understood that these code paths have a bounded memory footprint.
> 
> 
> that's a too simplistic view though; what happens is that kswapd will 
> queue the IO, but the irq context will then take the IO from the queue 
> and do the DMA mapping... which needs the memory.....

Right, but who stops some unrelated interrupt handler from completely
depleting memory?

What I'm saying is that there should be some coupling between the
reclaim context and the irq context doing work on its behalf.

For instance, you know how many pages are in the queue, and which queue.
So you could preallocate enough memory to handle that many pages from
irq context and couple that reserve to the queue object. Then irq
context can use that memory to do the work.




^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-20 20:08             ` Peter Zijlstra
@ 2007-06-20 23:03               ` Keshavamurthy, Anil S
  2007-06-21  6:10                 ` Peter Zijlstra
  0 siblings, 1 reply; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-20 23:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Siddha, Suresh B, Keshavamurthy, Anil S, akpm,
	linux-kernel, ak, gregkh, muli, ashok.raj, davem, clameter

On Wed, Jun 20, 2007 at 10:08:51PM +0200, Peter Zijlstra wrote:
> On Wed, 2007-06-20 at 12:14 -0700, Arjan van de Ven wrote:
> > Peter Zijlstra wrote:
> > > So a reclaim context (kswapd and direct reclaim) set PF_MEMALLOC to
> > > ensure they themselves will not block on a memory allocation. And it is
> > > understood that these code paths have a bounded memory footprint.
> > 
> > 
> > that's a too simplistic view though; what happens is that kswapd will 
> > queue the IO, but the irq context will then take the IO from the queue 
> > and do the DMA mapping... which needs the memory.....

As Arjan is saying, that a reclaim context sets PF_MEMALLOC flag
and submits the IO, but the controller driver decides to queue the IO,
and later in the interrupt context it de-queues and calls the
IOMMU driver for mapping the DMA physical address and in this DMA 
map api call we may need the memory to satisfy the DMA map api call.
Hence PF_MEMALLOC set by the reclaim context should work from 
interrupt context too, if it is not then that needs to be fixed.

> 
> Right, but who stops some unrelated interrupt handler from completely
> depleting memory?
The DMA map API's exposed by IOMMU are called by the 
storage/network controller driver and it is in this IOMMU
driver we are talking about allocating memory. So we have no 
clue how much I/O the controller above us is capable of submitting. 
And all we do is the mapping of DMA phyical address and 
provide the caller with the virtual DMA address and it
is in this process we may need some memory.

> 
> What I'm saying is that there should be some coupling between the
> reclaim context and the irq context doing work on its behalf.
Nope this info is not available when upper level drivers
calls the standard DMA map api's.

> 
> For instance, you know how many pages are in the queue, and which queue.
> So you could preallocate enough memory to handle that many pages from
> irq context and couple that reserve to the queue object. Then irq
> context can use that memory to do the work.

As I have said earlier, we are not the storage or network controller
driver and we have no idea how much the IO the controller is capable of.


Again this patch is the best effort for not to fail the DMA map api,
if it fails due to lack of memory, we have no choice but to return
failure to upper level driver. But today, upper level drivers are
not capable of handling failures from DMA map api's hence this best
effort for not to fail the DMA map api call.

thanks,
-Anil



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-20 23:03               ` Keshavamurthy, Anil S
@ 2007-06-21  6:10                 ` Peter Zijlstra
  2007-06-21  6:11                   ` Arjan van de Ven
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2007-06-21  6:10 UTC (permalink / raw)
  To: Keshavamurthy, Anil S
  Cc: Arjan van de Ven, Siddha, Suresh B, akpm, linux-kernel, ak,
	gregkh, muli, ashok.raj, davem, clameter

On Wed, 2007-06-20 at 16:03 -0700, Keshavamurthy, Anil S wrote:
> On Wed, Jun 20, 2007 at 10:08:51PM +0200, Peter Zijlstra wrote:
> > On Wed, 2007-06-20 at 12:14 -0700, Arjan van de Ven wrote:
> > > Peter Zijlstra wrote:
> > > > So a reclaim context (kswapd and direct reclaim) set PF_MEMALLOC to
> > > > ensure they themselves will not block on a memory allocation. And it is
> > > > understood that these code paths have a bounded memory footprint.
> > > 
> > > 
> > > that's a too simplistic view though; what happens is that kswapd will 
> > > queue the IO, but the irq context will then take the IO from the queue 
> > > and do the DMA mapping... which needs the memory.....
> 
> As Arjan is saying, that a reclaim context sets PF_MEMALLOC flag
> and submits the IO, but the controller driver decides to queue the IO,
> and later in the interrupt context it de-queues and calls the
> IOMMU driver for mapping the DMA physical address and in this DMA 
> map api call we may need the memory to satisfy the DMA map api call.
> Hence PF_MEMALLOC set by the reclaim context should work from 
> interrupt context too, if it is not then that needs to be fixed.

PF_MEMALLOC cannot work from interrupt context, nor should it. But there
are other options.

What I'm saying is that if you do use the reserves, you should ensure
the use is bounded. I'm not seeing anything like that.

This is a generic API, who is to ensure some other non-swap device will
not deplete memory and deadlock the reclaim process?


Also, please do explain why mempools are not usable? those will have
exactly the same semantics as you are now getting (if PF_MEMALLOC would
work from interrupt context).


Also, explain to me how an IOMMU is different from bounce buffers? They
both do the same thing, no? They both need memory in order to complete
DMA.

Is it just a broken API you're working against? If so, isn't the Linux
way to fix these things, that is why we have the source code after all.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-21  6:10                 ` Peter Zijlstra
@ 2007-06-21  6:11                   ` Arjan van de Ven
  2007-06-21  6:29                     ` Peter Zijlstra
  2007-06-21  6:30                     ` Keshavamurthy, Anil S
  0 siblings, 2 replies; 65+ messages in thread
From: Arjan van de Ven @ 2007-06-21  6:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Keshavamurthy, Anil S, Siddha, Suresh B, akpm, linux-kernel, ak,
	gregkh, muli, ashok.raj, davem, clameter

Peter Zijlstra wrote:
> What I'm saying is that if you do use the reserves, you should ensure
> the use is bounded. I'm not seeing anything like that.

each mapping takes at most 3 pages
> 
> This is a generic API, who is to ensure some other non-swap device will
> not deplete memory and deadlock the reclaim process?
>

that information is not available at this level ;(

> 
> 
> Also, explain to me how an IOMMU is different from bounce buffers? They
> both do the same thing, no? They both need memory in order to complete
> DMA.

bounce buffers happen in a place where you can sleep.... that makes a 
lot of difference.

> 
> Is it just a broken API you're working against? If so, isn't the Linux
> way to fix these things, that is why we have the source code after all.

well yes and no... the other iommu's snuck in as well... it's not 
entirely fair to hold this one back until a 2 year, 1400 driver 
project is completed ;(

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-21  6:11                   ` Arjan van de Ven
@ 2007-06-21  6:29                     ` Peter Zijlstra
  2007-06-21  6:37                       ` Keshavamurthy, Anil S
  2007-06-21  6:30                     ` Keshavamurthy, Anil S
  1 sibling, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2007-06-21  6:29 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Keshavamurthy, Anil S, Siddha, Suresh B, akpm, linux-kernel, ak,
	gregkh, muli, ashok.raj, davem, clameter

On Wed, 2007-06-20 at 23:11 -0700, Arjan van de Ven wrote:
> Peter Zijlstra wrote:
> > What I'm saying is that if you do use the reserves, you should ensure
> > the use is bounded. I'm not seeing anything like that.
> 
> each mapping takes at most 3 pages

That is a start, but the thing I'm worried most about is non-reclaim
related devices using the thing when in dire straights.

> > This is a generic API, who is to ensure some other non-swap device will
> > not deplete memory and deadlock the reclaim process?
> 
> that information is not available at this level ;(

Can we bring it there?

> > Also, explain to me how an IOMMU is different from bounce buffers? They
> > both do the same thing, no? They both need memory in order to complete
> > DMA.
> 
> bounce buffers happen in a place where you can sleep.... that makes a 
> lot of difference.

Right, can't you stick part of this work there?

> > 
> > Is it just a broken API you're working against? If so, isn't the Linux
> > way to fix these things, that is why we have the source code after all.
> 
> well yes and no... the other iommu's snuck in as well... it's not 
> entirely fair to hold this one back until a 2 year, 1400 driver 
> project is completed ;(

I understand, but at some point we should stop; we cannot keep taking
crap in deference of such things.

Also, the other iommu code you pointed me to, was happy to fail, it did
not attempt to use the emergency reserves.


But you left out the mempools question again. I have read the earlier
threads, and it was said mempools are no good because they first deplete
GFP_ATOMIC reserves and then down-stream allocs could go splat.
PF_MEMALLOC/GFP_EMERGENCY has exactly the same problem...

So why no mempools?


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-21  6:11                   ` Arjan van de Ven
  2007-06-21  6:29                     ` Peter Zijlstra
@ 2007-06-21  6:30                     ` Keshavamurthy, Anil S
  1 sibling, 0 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-21  6:30 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Peter Zijlstra, Keshavamurthy, Anil S, Siddha, Suresh B, akpm,
	linux-kernel, ak, gregkh, muli, ashok.raj, davem, clameter

On Wed, Jun 20, 2007 at 11:11:05PM -0700, Arjan van de Ven wrote:
> Peter Zijlstra wrote:
> >What I'm saying is that if you do use the reserves, you should ensure
> >the use is bounded. I'm not seeing anything like that.
> 
> each mapping takes at most 3 pages
With 3 pages(3 level page table), IOMMU can map at 
most 2MB and each additional last level page helps
map another 2MB. Again, the IOMMU driver re-uses the
virtual address instead of giving contiguous virtual address
which helps to keep the page table growth and helps
reuse the same page table entries, in that sense we are
bounded, again we are not sure how much IO will be in flight
on a perticular system so as to reserve that many pages for
IOMMU page table setup.

-Anil

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-21  6:29                     ` Peter Zijlstra
@ 2007-06-21  6:37                       ` Keshavamurthy, Anil S
  2007-06-21  7:13                         ` Peter Zijlstra
  0 siblings, 1 reply; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-21  6:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Arjan van de Ven, Keshavamurthy, Anil S, Siddha, Suresh B, akpm,
	linux-kernel, ak, gregkh, muli, ashok.raj, davem, clameter

On Thu, Jun 21, 2007 at 08:29:34AM +0200, Peter Zijlstra wrote:
> On Wed, 2007-06-20 at 23:11 -0700, Arjan van de Ven wrote:
> > Peter Zijlstra wrote:
> Also, the other iommu code you pointed me to, was happy to fail, it did
> not attempt to use the emergency reserves.

Is the same behavior acceptable here?

-Anil

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-21  6:37                       ` Keshavamurthy, Anil S
@ 2007-06-21  7:13                         ` Peter Zijlstra
  2007-06-21 19:51                           ` Keshavamurthy, Anil S
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2007-06-21  7:13 UTC (permalink / raw)
  To: Keshavamurthy, Anil S
  Cc: Arjan van de Ven, Siddha, Suresh B, akpm, linux-kernel, ak,
	gregkh, muli, ashok.raj, davem, clameter

On Wed, 2007-06-20 at 23:37 -0700, Keshavamurthy, Anil S wrote:
> On Thu, Jun 21, 2007 at 08:29:34AM +0200, Peter Zijlstra wrote:
> > On Wed, 2007-06-20 at 23:11 -0700, Arjan van de Ven wrote:
> > > Peter Zijlstra wrote:
> > Also, the other iommu code you pointed me to, was happy to fail, it did
> > not attempt to use the emergency reserves.
> 
> Is the same behavior acceptable here?

I would say it is. Failure is a part of life.

If you have a (small) mempool with 16 pages or so, that should give you
plenty megabytes of io-space to get out of a tight spot. That is, you
can queue many pages with that. If it is depleted you know you have at
least that many pages outstanding. So failing will just delay the next
pages.

Throughput is not a key issue when that low on memory, a guarantee of
progress is.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-21  7:13                         ` Peter Zijlstra
@ 2007-06-21 19:51                           ` Keshavamurthy, Anil S
  0 siblings, 0 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-21 19:51 UTC (permalink / raw)
  To: Peter Zijlstra, akpm
  Cc: Keshavamurthy, Anil S, Arjan van de Ven, Siddha, Suresh B, akpm,
	linux-kernel, ak, gregkh, muli, ashok.raj, davem, clameter

On Thu, Jun 21, 2007 at 09:13:11AM +0200, Peter Zijlstra wrote:
> On Wed, 2007-06-20 at 23:37 -0700, Keshavamurthy, Anil S wrote:
> > On Thu, Jun 21, 2007 at 08:29:34AM +0200, Peter Zijlstra wrote:
> > > On Wed, 2007-06-20 at 23:11 -0700, Arjan van de Ven wrote:
> > > > Peter Zijlstra wrote:
> > > Also, the other iommu code you pointed me to, was happy to fail, it did
> > > not attempt to use the emergency reserves.
> > 
> > Is the same behavior acceptable here?
> 
> I would say it is. Failure is a part of life.
> 
> If you have a (small) mempool with 16 pages or so, that should give you
> plenty megabytes of io-space to get out of a tight spot. That is, you
> can queue many pages with that. If it is depleted you know you have at
> least that many pages outstanding. So failing will just delay the next
> pages.
> 
> Throughput is not a key issue when that low on memory, a guarantee of
> progress is.

Andrew,
	Can you please queue all the other patches except this one for your
next MM release? (Yes, We can safely drop this patch without any issues in applying
rest of the patches).


-thanks,
Anil Keshavamurthy

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls
  2007-06-20  8:06   ` Peter Zijlstra
  2007-06-20 13:03     ` Arjan van de Ven
@ 2007-06-26  5:34     ` Andrew Morton
  1 sibling, 0 replies; 65+ messages in thread
From: Andrew Morton @ 2007-06-26  5:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Keshavamurthy, Anil S, linux-kernel, ak, gregkh, muli,
	suresh.b.siddha, arjan, ashok.raj, davem, clameter

On Wed, 20 Jun 2007 10:06:39 +0200 Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, 2007-06-19 at 14:37 -0700, Keshavamurthy, Anil S wrote:
> > plain text document attachment (intel_iommu_pf_memalloc.patch)
> > Intel IOMMU driver needs memory during DMA map calls to setup its internal
> > page tables and for other data structures. As we all know that these DMA 
> > map calls are mostly called in the interrupt context or with the spinlock 
> > held by the upper level drivers(network/storage drivers), so in order to 
> > avoid any memory allocation failure due to low memory issues,
> > this patch makes memory allocation by temporarily setting PF_MEMALLOC
> > flags for the current task before making memory allocation calls.
> > 
> > We evaluated mempools as a backup when kmem_cache_alloc() fails
> > and found that mempools are really not useful here because
> >  1) We don;t know for sure how much to reserve in advance
> 
> So you just unleashed an unbounded allocation context on PF_MEMALLOC?
> seems like a really really bad idea.
> 
> >  2) And mempools are not useful for GFP_ATOMIC case (as we call 
> >     memory alloc functions with GFP_ATOMIC)
> 
> Mempools work as intended with GFP_ATOMIC, it gets filled up to the
> specified number of elements using GFP_KERNEL (at creation time). This
> gives GFP_ATOMIC allocations nr_elements extra items once it would start
> failing.

Yup.  Changelog is buggy.

> > With PF_MEMALLOC flag set in the current->flags, the VM subsystem avoids
> > any watermark checks before allocating memory thus guarantee'ing the 
> > memory till the last free page.
> 
> PF_MEMALLOC as is, is meant to salvage the VM from the typical VM
> deadlock. Using it as you do now is not something a driver should ever
> do, and I'm afraid I will have to strongly oppose this patch.
> 
> You really really want to calculate an upper bound on your memory
> consumption and reserve this.
> 
> So, I'm afraid I'll have to..
> 
> NACK!

err, PF_MEMALLOC doesn't actually do anything if in_interrupt(), so your
reason-for-nacking isn't legitimate.  And neither is Anil's patch ;)

So I'm thinking that if this patch passed all his testing, a patch which
didn't play these PF_MEMALLOC games would pass the same tests.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 02/10] PCI generic helper function
  2007-06-19 21:37 ` [Intel IOMMU 02/10] PCI generic helper function Keshavamurthy, Anil S
@ 2007-06-26  5:49   ` Andrew Morton
  2007-06-26 14:44     ` Keshavamurthy, Anil S
  0 siblings, 1 reply; 65+ messages in thread
From: Andrew Morton @ 2007-06-26  5:49 UTC (permalink / raw)
  To: Keshavamurthy, Anil S
  Cc: linux-kernel, ak, gregkh, muli, suresh.b.siddha, arjan,
	ashok.raj, davem, clameter

On Tue, 19 Jun 2007 14:37:03 -0700 "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> wrote:

> +struct pci_dev *
> +pci_find_upstream_pcie_bridge(struct pci_dev *pdev)

You didn't need a newline there, but that's what the rest of that file
does.  Hu hum.

> +{
> +	struct pci_dev *tmp = NULL;
> +
> +	if (pdev->is_pcie)
> +		return NULL;
> +	while (1) {
> +		if (!pdev->bus->self)
> +			break;
> +		pdev = pdev->bus->self;
> +		/* a p2p bridge */
> +		if (!pdev->is_pcie) {
> +			tmp = pdev;
> +			continue;
> +		}
> +		/* PCI device should connect to a PCIE bridge */
> +		BUG_ON(pdev->pcie_type != PCI_EXP_TYPE_PCI_BRIDGE);

I assume that if this bug triggers, we've found some broken hardware?

Going BUG seems like a pretty rude reaction to this, especially when it
would be so easy to drop a warning and then recover.


How's about this?

--- a/drivers/pci/search.c~intel-iommu-pci-generic-helper-function-fix
+++ a/drivers/pci/search.c
@@ -38,7 +38,11 @@ pci_find_upstream_pcie_bridge(struct pci
 			continue;
 		}
 		/* PCI device should connect to a PCIE bridge */
-		BUG_ON(pdev->pcie_type != PCI_EXP_TYPE_PCI_BRIDGE);
+		if (pdev->pcie_type != PCI_EXP_TYPE_PCI_BRIDGE) {
+			/* Busted hardware? */
+			WARN_ON_ONCE(1);
+			return NULL;
+		}
 		return pdev;
 	}
 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 04/10] IOVA allocation and management routines
  2007-06-19 21:37 ` [Intel IOMMU 04/10] IOVA allocation and management routines Keshavamurthy, Anil S
@ 2007-06-26  6:07   ` Andrew Morton
  2007-06-26 16:16     ` Keshavamurthy, Anil S
  0 siblings, 1 reply; 65+ messages in thread
From: Andrew Morton @ 2007-06-26  6:07 UTC (permalink / raw)
  To: Keshavamurthy, Anil S
  Cc: linux-kernel, ak, gregkh, muli, suresh.b.siddha, arjan,
	ashok.raj, davem, clameter

On Tue, 19 Jun 2007 14:37:05 -0700 "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> wrote:

> 	This code implements a generic IOVA allocation and 
> management. As per Dave's suggestion we are now allocating
> IO virtual address from Higher DMA limit address rather
> than lower end address and this eliminated the need to preserve
> the IO virtual address for multiple devices sharing the same
> domain virtual address.
> 
> Also this code uses red black trees to store the allocated and
> reserved iova nodes. This showed a good performance improvements
> over previous linear linked list.
> 
> Changes from previous posting:
> 1) Fixed mostly coding style issues
> 

All the inlines in this code are pretty pointless: all those functions have
a single callsite so the compiler inlines them anyway.  If we later add
more callsites for these functions, they're too big to be inlined.

inline is usually wrong: don't do it!

> +
> +/**
> + * find_iova - find's an iova for a given pfn
> + * @iovad - iova domain in question.
> + * pfn - page frame number
> + * This function finds and returns an iova belonging to the
> + * given doamin which matches the given pfn.
> + */
> +struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn)
> +{
> +	unsigned long flags;
> +	struct rb_node *node;
> +
> +	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> +	node = iovad->rbroot.rb_node;
> +	while (node) {
> +		struct iova *iova = container_of(node, struct iova, node);
> +
> +		/* If pfn falls within iova's range, return iova */
> +		if ((pfn >= iova->pfn_lo) && (pfn <= iova->pfn_hi)) {
> +			spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> +			return iova;
> +		}
> +
> +		if (pfn < iova->pfn_lo)
> +			node = node->rb_left;
> +		else if (pfn > iova->pfn_lo)
> +			node = node->rb_right;
> +	}
> +
> +	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> +	return NULL;
> +}

So we take the lock, look up an item, then drop the lock then return the
item we just found.  We took no refcount on it and we didn't do anything to
keep this object alive.

Is that a bug, or does the (afacit undocumented) lifecycle management of
these things take care of it in some manner?  If yes, please reply via an
add-a-comment patch.


> +/**
> + * __free_iova - frees the given iova
> + * @iovad: iova domain in question.
> + * @iova: iova in question.
> + * Frees the given iova belonging to the giving domain
> + */
> +void
> +__free_iova(struct iova_domain *iovad, struct iova *iova)
> +{
> +	unsigned long flags;
> +
> +	if (iova) {
> +		spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> +		__cached_rbnode_delete_update(iovad, iova);
> +		rb_erase(&iova->node, &iovad->rbroot);
> +		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> +		free_iova_mem(iova);
> +	}
> +}

Can this really be called with NULL?  If so, under what circumstances? 
(This reader couldn't work it out from a brief look at the code, so perhaps
others will not be able to either.  Perhaps a comment is needed)

> +/**
> + * free_iova - finds and frees the iova for a given pfn
> + * @iovad: - iova domain in question.
> + * @pfn: - pfn that is allocated previously
> + * This functions finds an iova for a given pfn and then
> + * frees the iova from that domain.
> + */
> +void
> +free_iova(struct iova_domain *iovad, unsigned long pfn)
> +{
> +	struct iova *iova = find_iova(iovad, pfn);
> +	__free_iova(iovad, iova);
> +
> +}
> +
> +/**
> + * put_iova_domain - destroys the iova doamin
> + * @iovad: - iova domain in question.
> + * All the iova's in that domain are destroyed.
> + */
> +void put_iova_domain(struct iova_domain *iovad)
> +{
> +	struct rb_node *node;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> +	node = rb_first(&iovad->rbroot);
> +	while (node) {
> +		struct iova *iova = container_of(node, struct iova, node);
> +		rb_erase(node, &iovad->rbroot);
> +		free_iova_mem(iova);
> +		node = rb_first(&iovad->rbroot);
> +	}
> +	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> +}

Right, so I suspect what's happening here is that all iova's remain valid
until their entire domain is destroyed, yes?

What is the upper bound to the memory consumpotion here, and what provides
it?

Again, some code comments about these design issues are appropriate.

> +/*
> + * We need a fixed PAGE_SIZE of 4K irrespective of
> + * arch PAGE_SIZE for IOMMU page tables.
> + */
> +#define PAGE_SHIFT_4K		(12)
> +#define PAGE_SIZE_4K		(1UL << PAGE_SHIFT_4K)
> +#define PAGE_MASK_4K		(((u64)-1) << PAGE_SHIFT_4K)
> +#define PAGE_ALIGN_4K(addr)	(((addr) + PAGE_SIZE_4K - 1) & PAGE_MASK_4K)

Am still wondering why we cannot use PAGE_SIZE, PAGE_SHIFT, etc here.

> +#define IOVA_START_ADDR		(0x1000)

What determined that address?  (Needs comment)

> +#define IOVA_START_PFN		(IOVA_START_ADDR >> PAGE_SHIFT_4K)
> +
> +#define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT_4K)

So I'm looking at this and wondering "what type does addr have"?

If it's unsigned long then perhaps we have a problem on x86_32 PAE.  Maybe
we don't support x86_32 PAE, but still, I'd have thought that the
appropriate type here is dma_addr_t.

But alas, it was needlessly implemented as a macro, so the reader cannot
tell.

> +#define DMA_32BIT_PFN	IOVA_PFN(DMA_32BIT_MASK)
> +#define DMA_64BIT_PFN	IOVA_PFN(DMA_64BIT_MASK)
> +
> +/* iova structure */
> +struct iova {
> +	struct rb_node	node;
> +	unsigned long	pfn_hi; /* IOMMU dish out addr hi */
> +	unsigned long	pfn_lo; /* IOMMU dish out addr lo */
> +};
> +
> +/* holds all the iova translations for a domain */
> +struct iova_domain {
> +	spinlock_t	iova_alloc_lock;/* Lock to protect iova  allocation */
> +	spinlock_t	iova_rbtree_lock; /* Lock to protect update of rbtree */
> +	struct rb_root	rbroot;		/* iova domain rbtree root */
> +	struct rb_node	*cached32_node; /* Save last alloced node */
> +};
> +
> +struct iova *alloc_iova_mem(void);
> +void free_iova_mem(struct iova *iova);
> +void free_iova(struct iova_domain *iovad, unsigned long pfn);
> +void __free_iova(struct iova_domain *iovad, struct iova *iova);
> +struct iova * alloc_iova(struct iova_domain *iovad, unsigned long size,
> +	unsigned long limit_pfn);
> +struct iova * reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo,
> +	unsigned long pfn_hi);
> +void copy_reserved_iova(struct iova_domain *from, struct iova_domain *to);
> +void init_iova_domain(struct iova_domain *iovad);
> +struct iova * find_iova(struct iova_domain *iovad, unsigned long pfn);
> +void put_iova_domain(struct iova_domain *iovad);
> +
> +#endif
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 05/10] Intel IOMMU driver
  2007-06-19 21:37 ` [Intel IOMMU 05/10] Intel IOMMU driver Keshavamurthy, Anil S
  2007-06-19 23:32   ` Christoph Lameter
@ 2007-06-26  6:25   ` Andrew Morton
  2007-06-26 16:33     ` Keshavamurthy, Anil S
  2007-06-26  6:30   ` Andrew Morton
  2 siblings, 1 reply; 65+ messages in thread
From: Andrew Morton @ 2007-06-26  6:25 UTC (permalink / raw)
  To: Keshavamurthy, Anil S
  Cc: linux-kernel, ak, gregkh, muli, suresh.b.siddha, arjan,
	ashok.raj, davem, clameter

On Tue, 19 Jun 2007 14:37:06 -0700 "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> wrote:

> +/*
> + * Decoding Capability Register
> + */
> +#define cap_read_drain(c)	(((c) >> 55) & 1)
> +#define cap_write_drain(c)	(((c) >> 54) & 1)
> +#define cap_max_amask_val(c)	(((c) >> 48) & 0x3f)
> +#define cap_num_fault_regs(c)	((((c) >> 40) & 0xff) + 1)
> +#define cap_pgsel_inv(c)	(((c) >> 39) & 1)
> +
> +#define cap_super_page_val(c)	(((c) >> 34) & 0xf)
> +#define cap_super_offset(c)	(((find_first_bit(&cap_super_page_val(c), 4)) \
> +					* OFFSET_STRIDE) + 21)
> +
> +#define cap_fault_reg_offset(c)	((((c) >> 24) & 0x3ff) * 16)
> +#define cap_max_fault_reg_offset(c) \
> +	(cap_fault_reg_offset(c) + cap_num_fault_regs(c) * 16)
> +
> +#define cap_zlr(c)		(((c) >> 22) & 1)
> +#define cap_isoch(c)		(((c) >> 23) & 1)
> +#define cap_mgaw(c)		((((c) >> 16) & 0x3f) + 1)
> +#define cap_sagaw(c)		(((c) >> 8) & 0x1f)
> +#define cap_caching_mode(c)	(((c) >> 7) & 1)
> +#define cap_phmr(c)		(((c) >> 6) & 1)
> +#define cap_plmr(c)		(((c) >> 5) & 1)
> +#define cap_rwbf(c)		(((c) >> 4) & 1)
> +#define cap_afl(c)		(((c) >> 3) & 1)
> +#define cap_ndoms(c)		(((unsigned long)1) << (4 + 2 * ((c) & 0x7)))
> +/*
> + * Extended Capability Register
> + */
> +
> +#define ecap_niotlb_iunits(e)	((((e) >> 24) & 0xff) + 1)
> +#define ecap_iotlb_offset(e) 	((((e) >> 8) & 0x3ff) * 16)
> +#define ecap_max_iotlb_offset(e) \
> +	(ecap_iotlb_offset(e) + ecap_niotlb_iunits(e) * 16)
> +#define ecap_coherent(e)	((e) & 0x1)

None of these actually _need_ to be macros and it would be better to 
implement them in C.  That way things are more self-documenting, more
pleasant to read, more likely to get commented and you'll fix the
two bugs wherein the argument to a macro is evaluated more than once.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 05/10] Intel IOMMU driver
  2007-06-19 21:37 ` [Intel IOMMU 05/10] Intel IOMMU driver Keshavamurthy, Anil S
  2007-06-19 23:32   ` Christoph Lameter
  2007-06-26  6:25   ` Andrew Morton
@ 2007-06-26  6:30   ` Andrew Morton
  2 siblings, 0 replies; 65+ messages in thread
From: Andrew Morton @ 2007-06-26  6:30 UTC (permalink / raw)
  To: Keshavamurthy, Anil S
  Cc: linux-kernel, ak, gregkh, muli, suresh.b.siddha, arjan,
	ashok.raj, davem, clameter

On Tue, 19 Jun 2007 14:37:06 -0700 "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> wrote:

> 	Actual intel IOMMU driver. Hardware spec can be found at:
> http://www.intel.com/technology/virtualization
> 
> This driver sets X86_64 'dma_ops', so hook into standard DMA APIs. In this way,
> PCI driver will get virtual DMA address. This change is transparent to PCI
> drivers.
> 
> Changes from previous postings:
> 1) Fixed all the coding style errors - checkpatches.pl passes this patch
> 2) Addressed all Andrew's comments
> 3) Removed resource pool ( a.k.a pre-allocate pool)
> 4) Now uses the standard kmem_cache_alloc functions to allocate memory
>    during dma map api calls.
> 
> 
> ...
> +#define context_set_translation_type(c, val) \
> +	do { \
> +		(c).lo &= (((u64)-1) << 4) | 3; \
> +		(c).lo |= ((val) & 3) << 2; \
> +	} while (0)

That evaluates `c' twice.  It's a little handgrenade waiting to go off.

> +#define context_clear_entry(c) do {(c).lo = 0; (c).hi = 0;} while (0)

Ditto



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 05/10] Intel IOMMU driver
  2007-06-19 23:32   ` Christoph Lameter
  2007-06-19 23:50     ` Keshavamurthy, Anil S
@ 2007-06-26  6:32     ` Andrew Morton
  2007-06-26 16:29       ` Keshavamurthy, Anil S
  1 sibling, 1 reply; 65+ messages in thread
From: Andrew Morton @ 2007-06-26  6:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Keshavamurthy, Anil S, linux-kernel, ak, gregkh, muli,
	suresh.b.siddha, arjan, ashok.raj, davem

On Tue, 19 Jun 2007 16:32:23 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> On Tue, 19 Jun 2007, Keshavamurthy, Anil S wrote:
> 
> > +static inline void *alloc_pgtable_page(void)
> > +{
> > +	return (void *)get_zeroed_page(GFP_ATOMIC);
> > +}
> 
> Need to pass gfp_t parameter. Repeates a couple of times.
> ...
> Is it not possible here to drop the lock and do the alloc with GFP_KERNEL 
> and deal with the resulting race? That is done in other parts of the 
> kernel.
> ...
> This may be able to become a GFP_KERNEL alloc since interrupts are enabled 
> at this point?
> ...
> GFP_KERNEL alloc possible?
> 

Yeah, if there are any callsites at all at which we know that we can
perform a sleeping allocation, Christoph's suggestions should be adopted. 
Because even a bare GFP_NOIO is heaps more robust than GFP_ATOMIC, and it
will also reload the free-pages reserves, making subsequent GFP_ATOMIC
allocations more likely to succeed.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 10/10] Iommu floppy workaround
  2007-06-19 21:37 ` [Intel IOMMU 10/10] Iommu floppy workaround Keshavamurthy, Anil S
@ 2007-06-26  6:42   ` Andrew Morton
  2007-06-26 10:37     ` Andi Kleen
  2007-06-26 16:26     ` Keshavamurthy, Anil S
  0 siblings, 2 replies; 65+ messages in thread
From: Andrew Morton @ 2007-06-26  6:42 UTC (permalink / raw)
  To: Keshavamurthy, Anil S
  Cc: linux-kernel, ak, gregkh, muli, suresh.b.siddha, arjan,
	ashok.raj, davem, clameter

On Tue, 19 Jun 2007 14:37:11 -0700 "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> wrote:

> 	This config option (DMAR_FLPY_WA) sets up 1:1 mapping for the
> floppy device so that the floppy device which does not use
> DMA api's will continue to work. 
> 
> Once the floppy driver starts using DMA api's this config option
> can be turn off or this patch can be yanked out of kernel at that
> time.
> 
> 
> Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
> ---
>  arch/x86_64/Kconfig       |   10 ++++++++++
>  drivers/pci/intel-iommu.c |   22 ++++++++++++++++++++++
>  drivers/pci/intel-iommu.h |    7 +++++++
>  3 files changed, 39 insertions(+)
> 
> Index: linux-2.6.22-rc4-mm2/arch/x86_64/Kconfig
> ===================================================================
> --- linux-2.6.22-rc4-mm2.orig/arch/x86_64/Kconfig	2007-06-18 15:45:08.000000000 -0700
> +++ linux-2.6.22-rc4-mm2/arch/x86_64/Kconfig	2007-06-18 15:45:09.000000000 -0700
> @@ -752,6 +752,16 @@
>  	 all the OS visible memory. Hence the driver can continue
>  	 to use physical addresses for DMA.
>  
> +config DMAR_FLPY_WA

FLOPPY is spelled "FLOPPY"!

> ===================================================================
> --- linux-2.6.22-rc4-mm2.orig/drivers/pci/intel-iommu.c	2007-06-18 15:45:08.000000000 -0700
> +++ linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.c	2007-06-18 15:45:09.000000000 -0700
> @@ -1631,6 +1631,26 @@
>  }
>  #endif
>  
> +#ifdef CONFIG_DMAR_FLPY_WA
> +static inline void iommu_prepare_isa(void)
> +{
> +	struct pci_dev *pdev = NULL;
> +	int ret;
> +
> +	pdev = pci_get_class (PCI_CLASS_BRIDGE_ISA << 8, NULL);
> +	if (!pdev)
> +		return;
> +
> +	printk (KERN_INFO "IOMMU: Prepare 0-16M unity mapping for LPC\n");
> +	ret = iommu_prepare_identity_map(pdev, 0, 16*1024*1024);
> +
> +	if (ret)
> +		printk ("IOMMU: Failed to create 0-64M identity map, \
> +			Floppy might not work\n");
> +
> +}
> +#endif
> +
>  int __init init_dmars(void)
>  {
>  	struct dmar_drhd_unit *drhd;
> @@ -1696,6 +1716,8 @@
>  
>  	iommu_prepare_gfx_mapping();
>  
> +	iommu_prepare_isa();
> +
>  	/*
>  	 * for each drhd
>  	 *   enable fault log
> Index: linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.h
> ===================================================================
> --- linux-2.6.22-rc4-mm2.orig/drivers/pci/intel-iommu.h	2007-06-18 15:45:08.000000000 -0700
> +++ linux-2.6.22-rc4-mm2/drivers/pci/intel-iommu.h	2007-06-18 15:45:09.000000000 -0700
> @@ -322,4 +322,11 @@
>  }
>  #endif /* !CONFIG_DMAR_GFX_WA */
>  
> +#ifndef CONFIG_DMAR_FLPY_WA
> +static inline void iommu_prepare_isa(void)
> +{
> +	return;
> +}
> +#endif /* !CONFIG_DMAR_FLPY_WA */

Bit weird that this was implemented in the header like that.

How about this?  (Also contains rather a lot of obvious style fixes)


 arch/x86_64/Kconfig       |    2 +-
 drivers/pci/intel-iommu.c |   19 ++++++++++++-------
 drivers/pci/intel-iommu.h |    7 -------
 3 files changed, 13 insertions(+), 15 deletions(-)

diff -puN arch/x86_64/Kconfig~intel-iommu-iommu-floppy-workaround-fix arch/x86_64/Kconfig
--- a/arch/x86_64/Kconfig~intel-iommu-iommu-floppy-workaround-fix
+++ a/arch/x86_64/Kconfig
@@ -770,7 +770,7 @@ config DMAR_GFX_WA
 	 all the OS visible memory. Hence the driver can continue
 	 to use physical addresses for DMA.
 
-config DMAR_FLPY_WA
+config DMAR_FLOPPY_WA
 	bool "Support for Floppy disk workaround"
 	depends on DMAR
 	default y
diff -puN drivers/pci/intel-iommu.c~intel-iommu-iommu-floppy-workaround-fix drivers/pci/intel-iommu.c
--- a/drivers/pci/intel-iommu.c~intel-iommu-iommu-floppy-workaround-fix
+++ a/drivers/pci/intel-iommu.c
@@ -1631,25 +1631,30 @@ error:
 }
 #endif
 
-#ifdef CONFIG_DMAR_FLPY_WA
+#ifdef CONFIG_DMAR_FLOPPY_WA
 static inline void iommu_prepare_isa(void)
 {
-	struct pci_dev *pdev = NULL;
+	struct pci_dev *pdev;
 	int ret;
 
-	pdev = pci_get_class (PCI_CLASS_BRIDGE_ISA << 8, NULL);
+	pdev = pci_get_class(PCI_CLASS_BRIDGE_ISA << 8, NULL);
 	if (!pdev)
 		return;
 
-	printk (KERN_INFO "IOMMU: Prepare 0-16M unity mapping for LPC\n");
+	printk(KERN_INFO "IOMMU: Prepare 0-16M unity mapping for LPC\n");
 	ret = iommu_prepare_identity_map(pdev, 0, 16*1024*1024);
 
 	if (ret)
-		printk ("IOMMU: Failed to create 0-64M identity map, \
-			Floppy might not work\n");
+		printk("IOMMU: Failed to create 0-64M identity map, "
+			"floppy might not work\n");
 
 }
-#endif
+#else
+static inline void iommu_prepare_isa(void)
+{
+	return;
+}
+#endif /* !CONFIG_DMAR_FLPY_WA */
 
 int __init init_dmars(void)
 {
diff -puN drivers/pci/intel-iommu.h~intel-iommu-iommu-floppy-workaround-fix drivers/pci/intel-iommu.h
--- a/drivers/pci/intel-iommu.h~intel-iommu-iommu-floppy-workaround-fix
+++ a/drivers/pci/intel-iommu.h
@@ -322,11 +322,4 @@ static inline void iommu_prepare_gfx_map
 }
 #endif /* !CONFIG_DMAR_GFX_WA */
 
-#ifndef CONFIG_DMAR_FLPY_WA
-static inline void iommu_prepare_isa(void)
-{
-	return;
-}
-#endif /* !CONFIG_DMAR_FLPY_WA */
-
 #endif
_


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-19 21:37 [Intel IOMMU 00/10] Intel IOMMU support, take #2 Keshavamurthy, Anil S
                   ` (9 preceding siblings ...)
  2007-06-19 21:37 ` [Intel IOMMU 10/10] Iommu floppy workaround Keshavamurthy, Anil S
@ 2007-06-26  6:45 ` Andrew Morton
  2007-06-26  7:12   ` Andi Kleen
  10 siblings, 1 reply; 65+ messages in thread
From: Andrew Morton @ 2007-06-26  6:45 UTC (permalink / raw)
  To: Keshavamurthy, Anil S
  Cc: linux-kernel, ak, gregkh, muli, suresh.b.siddha, arjan,
	ashok.raj, davem, clameter

On Tue, 19 Jun 2007 14:37:01 -0700 "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> wrote:

> 	This patch supports the upcomming Intel IOMMU hardware
> a.k.a. Intel(R) Virtualization Technology for Directed I/O 
> Architecture

So...  what's all this code for?

I assume that the intent here is to speed things up under Xen, etc?  Do we
have any benchmark results to help us to decide whether a merge would be
justified?

Does it slow anything down?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26  6:45 ` [Intel IOMMU 00/10] Intel IOMMU support, take #2 Andrew Morton
@ 2007-06-26  7:12   ` Andi Kleen
  2007-06-26 11:13     ` Muli Ben-Yehuda
  0 siblings, 1 reply; 65+ messages in thread
From: Andi Kleen @ 2007-06-26  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Keshavamurthy, Anil S, linux-kernel, gregkh, muli,
	suresh.b.siddha, arjan, ashok.raj, davem, clameter

On Tuesday 26 June 2007 08:45:50 Andrew Morton wrote:
> On Tue, 19 Jun 2007 14:37:01 -0700 "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> wrote:
> 
> > 	This patch supports the upcomming Intel IOMMU hardware
> > a.k.a. Intel(R) Virtualization Technology for Directed I/O 
> > Architecture
> 
> So...  what's all this code for?
> 
> I assume that the intent here is to speed things up under Xen, etc? 

Yes in some cases, but not this code. That would be the Xen version
of this code that could potentially assign whole devices to guests.
I expect this to be only useful in some special cases though because
most hardware is not virtualizable and you typically want an own
instance for each guest.

Ok at some point KVM might implement this too; i likely would
use this code for this.

> Do we 
> have any benchmark results to help us to decide whether a merge would be
> justified?

The main advantage for doing it in the normal kernel is not performance, but 
more safety. Broken devices won't be able to corrupt memory by doing
random DMA.

Unfortunately that doesn't work for graphics yet, for that need
user space interfaces for the X server are needed.

There are some potential performance benefits too:
- When you have a device that cannot address the complete address range
an IOMMU can remap its memory instead of bounce buffering. Remapping
is likely cheaper than copying. 
- The IOMMU can merge sg lists into a single virtual block. This could
potentially speed up SG IO when the device is slow walking SG lists.
[I long ago benchmarked 5% on some block benchmark with an old
MPT Fusion; but it probably depends a lot on the HBA]

And you get better driver debugging because unexpected memory accesses
from the devices will cause an trapable event.

> 
> Does it slow anything down?

It adds more overhead to each IO so yes.

-Andi


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 10/10] Iommu floppy workaround
  2007-06-26  6:42   ` Andrew Morton
@ 2007-06-26 10:37     ` Andi Kleen
  2007-06-26 19:25       ` Keshavamurthy, Anil S
  2007-06-26 16:26     ` Keshavamurthy, Anil S
  1 sibling, 1 reply; 65+ messages in thread
From: Andi Kleen @ 2007-06-26 10:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Keshavamurthy, Anil S, linux-kernel, gregkh, muli,
	suresh.b.siddha, arjan, ashok.raj, davem, clameter


> > Index: linux-2.6.22-rc4-mm2/arch/x86_64/Kconfig
> > ===================================================================
> > --- linux-2.6.22-rc4-mm2.orig/arch/x86_64/Kconfig	2007-06-18 15:45:08.000000000 -0700
> > +++ linux-2.6.22-rc4-mm2/arch/x86_64/Kconfig	2007-06-18 15:45:09.000000000 -0700
> > @@ -752,6 +752,16 @@
> >  	 all the OS visible memory. Hence the driver can continue
> >  	 to use physical addresses for DMA.
> >  
> > +config DMAR_FLPY_WA
> 
> FLOPPY is spelled "FLOPPY"!

Also this shouldn't be a user visible config.  The floppy driver should just
do this transparently when loaded and undo when unloaded.

Otherwise it would be a CONFIG_MAKE_MY_FLOPPY_WORK which is just not very nice.

-Andi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26  7:12   ` Andi Kleen
@ 2007-06-26 11:13     ` Muli Ben-Yehuda
  2007-06-26 15:03       ` Arjan van de Ven
  2007-06-26 15:56       ` Andi Kleen
  0 siblings, 2 replies; 65+ messages in thread
From: Muli Ben-Yehuda @ 2007-06-26 11:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Keshavamurthy, Anil S, linux-kernel, gregkh,
	suresh.b.siddha, arjan, ashok.raj, davem, clameter

On Tue, Jun 26, 2007 at 09:12:45AM +0200, Andi Kleen wrote:

> There are some potential performance benefits too:
> - When you have a device that cannot address the complete address range
> an IOMMU can remap its memory instead of bounce buffering. Remapping
> is likely cheaper than copying. 

But those devices aren't likely to be found on modern systems.

> - The IOMMU can merge sg lists into a single virtual block. This could
> potentially speed up SG IO when the device is slow walking SG lists.
> [I long ago benchmarked 5% on some block benchmark with an old
> MPT Fusion; but it probably depends a lot on the HBA]

But most devices are SG-capable.

> And you get better driver debugging because unexpected memory
> accesses from the devices will cause an trapable event.

That and direct-access for KVM the big ones, IMHO, and definitely
justify merging.

> > Does it slow anything down?
> 
> It adds more overhead to each IO so yes.

How much? we have numbers (to be presented at OLS later this week)
that show that on bare-metal an IOMMU can cost as much as 15%-30% more
CPU utilization for an IO intensive workload (netperf). It will be
interesting to see comparable numbers for VT-d.

Cheers,
Muli

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 02/10] PCI generic helper function
  2007-06-26  5:49   ` Andrew Morton
@ 2007-06-26 14:44     ` Keshavamurthy, Anil S
  0 siblings, 0 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-26 14:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Keshavamurthy, Anil S, linux-kernel, ak, gregkh, muli,
	suresh.b.siddha, arjan, ashok.raj, davem, clameter

On Mon, Jun 25, 2007 at 10:49:37PM -0700, Andrew Morton wrote:
> On Tue, 19 Jun 2007 14:37:03 -0700 "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> wrote:
> 
> > +struct pci_dev *
> > +pci_find_upstream_pcie_bridge(struct pci_dev *pdev)
> 
> You didn't need a newline there, but that's what the rest of that file
> does.  Hu hum.
> 
> > +{
> > +	struct pci_dev *tmp = NULL;
> > +
> > +	if (pdev->is_pcie)
> > +		return NULL;
> > +	while (1) {
> > +		if (!pdev->bus->self)
> > +			break;
> > +		pdev = pdev->bus->self;
> > +		/* a p2p bridge */
> > +		if (!pdev->is_pcie) {
> > +			tmp = pdev;
> > +			continue;
> > +		}
> > +		/* PCI device should connect to a PCIE bridge */
> > +		BUG_ON(pdev->pcie_type != PCI_EXP_TYPE_PCI_BRIDGE);
> 
> I assume that if this bug triggers, we've found some broken hardware?
> 
> Going BUG seems like a pretty rude reaction to this, especially when it
> would be so easy to drop a warning and then recover.
> 
> 
> How's about this?
Looks good, thanks.

> 
> --- a/drivers/pci/search.c~intel-iommu-pci-generic-helper-function-fix
> +++ a/drivers/pci/search.c
> @@ -38,7 +38,11 @@ pci_find_upstream_pcie_bridge(struct pci
>  			continue;
>  		}
>  		/* PCI device should connect to a PCIE bridge */
> -		BUG_ON(pdev->pcie_type != PCI_EXP_TYPE_PCI_BRIDGE);
> +		if (pdev->pcie_type != PCI_EXP_TYPE_PCI_BRIDGE) {
> +			/* Busted hardware? */
> +			WARN_ON_ONCE(1);
> +			return NULL;
> +		}
>  		return pdev;
>  	}
>  

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26 11:13     ` Muli Ben-Yehuda
@ 2007-06-26 15:03       ` Arjan van de Ven
  2007-06-26 15:11         ` Muli Ben-Yehuda
  2007-06-26 15:56       ` Andi Kleen
  1 sibling, 1 reply; 65+ messages in thread
From: Arjan van de Ven @ 2007-06-26 15:03 UTC (permalink / raw)
  To: Muli Ben-Yehuda
  Cc: Andi Kleen, Andrew Morton, Keshavamurthy, Anil S, linux-kernel,
	gregkh, suresh.b.siddha, ashok.raj, davem, clameter

Muli Ben-Yehuda wrote:
> How much? we have numbers (to be presented at OLS later this week)
> that show that on bare-metal an IOMMU can cost as much as 15%-30% more
> CPU utilization for an IO intensive workload (netperf). It will be
> interesting to see comparable numbers for VT-d.

for VT-d it is a LOT less. I'll let anil give you his data :)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26 15:56       ` Andi Kleen
@ 2007-06-26 15:09         ` Muli Ben-Yehuda
  2007-06-26 15:36           ` Andi Kleen
  2007-06-26 15:15         ` Arjan van de Ven
  1 sibling, 1 reply; 65+ messages in thread
From: Muli Ben-Yehuda @ 2007-06-26 15:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Keshavamurthy, Anil S, linux-kernel, gregkh,
	suresh.b.siddha, arjan, ashok.raj, davem, clameter

On Tue, Jun 26, 2007 at 05:56:49PM +0200, Andi Kleen wrote:

> > > - The IOMMU can merge sg lists into a single virtual block. This could
> > > potentially speed up SG IO when the device is slow walking SG
> > > lists.  [I long ago benchmarked 5% on some block benchmark with
> > > an old MPT Fusion; but it probably depends a lot on the HBA]
> > 
> > But most devices are SG-capable.
> 
> Your point being?

That the fact that an IOMMU can do SG for non-SG-capble cards is not
interesting from a "reason for inclusion" POV.

> > How much? we have numbers (to be presented at OLS later this week)
> > that show that on bare-metal an IOMMU can cost as much as 15%-30%
> > more CPU utilization for an IO intensive workload (netperf). It
> > will be interesting to see comparable numbers for VT-d.
> 
> That is something that needs more work.

Yup. I'm working on it (mostly in the context of Calgary) but also
looking at improvements to the DMA-API interface and usage.

> We should probably have a switch to use the IOMMU only for specific
> devices (e.g. for the KVM case) r only when remapping is
> needed.

Calgary already does this internally (via calgary=disable=<BUSNUM>)
but that's pretty ugly. It would be better to do it in a generic
fashion when deciding which dma_ops to call (i.e., a dma_ops per bus
or device).

> Also the user interface for X server case needs more work.

Is anyone working on it?

Cheers,
Muli

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26 15:03       ` Arjan van de Ven
@ 2007-06-26 15:11         ` Muli Ben-Yehuda
  2007-06-26 15:48           ` Keshavamurthy, Anil S
  0 siblings, 1 reply; 65+ messages in thread
From: Muli Ben-Yehuda @ 2007-06-26 15:11 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Andrew Morton, Keshavamurthy, Anil S, linux-kernel,
	gregkh, suresh.b.siddha, ashok.raj, davem, clameter

On Tue, Jun 26, 2007 at 08:03:59AM -0700, Arjan van de Ven wrote:
> Muli Ben-Yehuda wrote:
> >How much? we have numbers (to be presented at OLS later this week)
> >that show that on bare-metal an IOMMU can cost as much as 15%-30% more
> >CPU utilization for an IO intensive workload (netperf). It will be
> >interesting to see comparable numbers for VT-d.
> 
> for VT-d it is a LOT less. I'll let anil give you his data :)

Looking forward to it. Note that this is on a large SMP machine with
Gigabit ethernet, with netperf TCP stream. Comparing numbers for other
benchmarks on other machines is ... less than useful, but the numbers
themeselves are interesting.

Cheers,
Muli


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26 15:56       ` Andi Kleen
  2007-06-26 15:09         ` Muli Ben-Yehuda
@ 2007-06-26 15:15         ` Arjan van de Ven
  2007-06-26 15:33           ` Andi Kleen
  1 sibling, 1 reply; 65+ messages in thread
From: Arjan van de Ven @ 2007-06-26 15:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Muli Ben-Yehuda, Andrew Morton, Keshavamurthy, Anil S,
	linux-kernel, gregkh, suresh.b.siddha, ashok.raj, davem,
	clameter

> 
> Also the user interface for X server case needs more work.
> 

actually with the mode setting of X moving into the kernel... X won't 
use /dev/mem anymore at all
(and I think it mostly already doesn't even without that)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26 15:15         ` Arjan van de Ven
@ 2007-06-26 15:33           ` Andi Kleen
  2007-06-26 16:25             ` Arjan van de Ven
  0 siblings, 1 reply; 65+ messages in thread
From: Andi Kleen @ 2007-06-26 15:33 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Muli Ben-Yehuda, Andrew Morton, Keshavamurthy,
	Anil S, linux-kernel, gregkh, suresh.b.siddha, ashok.raj, davem,
	clameter

On Tue, Jun 26, 2007 at 08:15:05AM -0700, Arjan van de Ven wrote:
> >
> >Also the user interface for X server case needs more work.
> >
> 
> actually with the mode setting of X moving into the kernel... X won't 
> use /dev/mem anymore at all

We'll see if that happens. It has been talked about forever,
but results are sparse. 

> (and I think it mostly already doesn't even without that)

It uses /sys/bus/pci/* which is not any better as seen from the IOMMU.

Any interface will need to be explicit because user space needs to know which
DMA addresses to put into the hardware. It's not enough to just transparently
translate the mappings.

-Andi


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26 15:09         ` Muli Ben-Yehuda
@ 2007-06-26 15:36           ` Andi Kleen
  0 siblings, 0 replies; 65+ messages in thread
From: Andi Kleen @ 2007-06-26 15:36 UTC (permalink / raw)
  To: Muli Ben-Yehuda
  Cc: Andi Kleen, Andrew Morton, Keshavamurthy, Anil S, linux-kernel,
	gregkh, suresh.b.siddha, arjan, ashok.raj, davem, clameter

On Tue, Jun 26, 2007 at 11:09:40AM -0400, Muli Ben-Yehuda wrote:
> On Tue, Jun 26, 2007 at 05:56:49PM +0200, Andi Kleen wrote:
> 
> > > > - The IOMMU can merge sg lists into a single virtual block. This could
> > > > potentially speed up SG IO when the device is slow walking SG
> > > > lists.  [I long ago benchmarked 5% on some block benchmark with
> > > > an old MPT Fusion; but it probably depends a lot on the HBA]
> > > 
> > > But most devices are SG-capable.
> > 
> > Your point being?
> 
> That the fact that an IOMMU can do SG for non-SG-capble cards is not
> interesting from a "reason for inclusion" POV.

You misunderstood me; my point was that some SG capable devices
can go faster if they get shorter SG lists.

But yes for non SG capable devices it is also interesting. I expect
it will obsolete most users of that ugly external patch to allocate large
memory areas for IOs. That's a point I didn't mention earlier.

> > Also the user interface for X server case needs more work.
> 
> Is anyone working on it?

It's somewhere on the todo list.


-Andi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26 15:11         ` Muli Ben-Yehuda
@ 2007-06-26 15:48           ` Keshavamurthy, Anil S
  2007-06-26 16:00             ` Muli Ben-Yehuda
  0 siblings, 1 reply; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-26 15:48 UTC (permalink / raw)
  To: Muli Ben-Yehuda
  Cc: Arjan van de Ven, Andi Kleen, Andrew Morton, Keshavamurthy,
	Anil S, linux-kernel, gregkh, suresh.b.siddha, ashok.raj, davem,
	clameter

On Tue, Jun 26, 2007 at 11:11:25AM -0400, Muli Ben-Yehuda wrote:
> On Tue, Jun 26, 2007 at 08:03:59AM -0700, Arjan van de Ven wrote:
> > Muli Ben-Yehuda wrote:
> > >How much? we have numbers (to be presented at OLS later this week)
> > >that show that on bare-metal an IOMMU can cost as much as 15%-30% more
> > >CPU utilization for an IO intensive workload (netperf). It will be
> > >interesting to see comparable numbers for VT-d.
> > 
> > for VT-d it is a LOT less. I'll let anil give you his data :)
> 
> Looking forward to it. Note that this is on a large SMP machine with
> Gigabit ethernet, with netperf TCP stream. Comparing numbers for other
> benchmarks on other machines is ... less than useful, but the numbers
> themeselves are interesting.
Our initial benchmark results showed we had around 3% extra CPU 
utilization overhead when compared to native(i.e without IOMMU).
Again, our benchmark was on small SMP machine and we used
iperf and a 1G ethernet cards.

Going forward we will do more benchmark tests and will share the
results.

-Anil

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26 11:13     ` Muli Ben-Yehuda
  2007-06-26 15:03       ` Arjan van de Ven
@ 2007-06-26 15:56       ` Andi Kleen
  2007-06-26 15:09         ` Muli Ben-Yehuda
  2007-06-26 15:15         ` Arjan van de Ven
  1 sibling, 2 replies; 65+ messages in thread
From: Andi Kleen @ 2007-06-26 15:56 UTC (permalink / raw)
  To: Muli Ben-Yehuda
  Cc: Andrew Morton, Keshavamurthy, Anil S, linux-kernel, gregkh,
	suresh.b.siddha, arjan, ashok.raj, davem, clameter

Muli Ben-Yehuda <muli@il.ibm.com> writes:

> On Tue, Jun 26, 2007 at 09:12:45AM +0200, Andi Kleen wrote:
> 
> > There are some potential performance benefits too:
> > - When you have a device that cannot address the complete address range
> > an IOMMU can remap its memory instead of bounce buffering. Remapping
> > is likely cheaper than copying. 
> 
> But those devices aren't likely to be found on modern systems.

Not true. I don't see anybody designing DAC capable USB or firewire
or sound or TV cards. And there are plenty of non AHCI SATA interfaces too
(often the BIOS defaults are this way because XP doesn't deal
well with AHCI). And video cards generally don't support it 
(although they don't like IOMMUs either). Just these devices all might 
not be performance relevant (except for the video cards) 

> > - The IOMMU can merge sg lists into a single virtual block. This could
> > potentially speed up SG IO when the device is slow walking SG lists.
> > [I long ago benchmarked 5% on some block benchmark with an old
> > MPT Fusion; but it probably depends a lot on the HBA]
> 
> But most devices are SG-capable.

Your point being? It depends on if the SG hardware is slow
enough that it makes a difference. I found one case where that
was true, but it's unknown how common that is.

Only benchmarks can tell.

Also my results were on a pretty slow IOMMU implementation
so with a fast one it might be different too.

> How much? we have numbers (to be presented at OLS later this week)
> that show that on bare-metal an IOMMU can cost as much as 15%-30% more
> CPU utilization for an IO intensive workload (netperf). It will be
> interesting to see comparable numbers for VT-d.

That is something that needs more work.

We should probably have a switch to use the IOMMU only for specific
devices (e.g. for the KVM case) r only when remapping is needed. Only
boot options for this is probably not good enough. But that is something
that can be worked on once everything is in tree.

Also the user interface for X server case needs more work.

-Andi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26 15:48           ` Keshavamurthy, Anil S
@ 2007-06-26 16:00             ` Muli Ben-Yehuda
  0 siblings, 0 replies; 65+ messages in thread
From: Muli Ben-Yehuda @ 2007-06-26 16:00 UTC (permalink / raw)
  To: Keshavamurthy, Anil S
  Cc: Arjan van de Ven, Andi Kleen, Andrew Morton, linux-kernel,
	gregkh, suresh.b.siddha, ashok.raj, davem, clameter

On Tue, Jun 26, 2007 at 08:48:04AM -0700, Keshavamurthy, Anil S wrote:

> Our initial benchmark results showed we had around 3% extra CPU
> utilization overhead when compared to native(i.e without IOMMU).
> Again, our benchmark was on small SMP machine and we used iperf and
> a 1G ethernet cards.

Please try netperf and a bigger machine for a meaningful comparison :-)
I assume this is with e1000?

> Going forward we will do more benchmark tests and will share the
> results.

Looking forward to it.

Cheers,
Muli

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 04/10] IOVA allocation and management routines
  2007-06-26  6:07   ` Andrew Morton
@ 2007-06-26 16:16     ` Keshavamurthy, Anil S
  0 siblings, 0 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-26 16:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Keshavamurthy, Anil S, linux-kernel, ak, gregkh, muli,
	suresh.b.siddha, arjan, ashok.raj, davem, clameter

On Mon, Jun 25, 2007 at 11:07:47PM -0700, Andrew Morton wrote:
> On Tue, 19 Jun 2007 14:37:05 -0700 "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> wrote:
> 
> All the inlines in this code are pretty pointless: all those functions have
> a single callsite so the compiler inlines them anyway.  If we later add
> more callsites for these functions, they're too big to be inlined.
> 
> inline is usually wrong: don't do it!
Yup, I agree and will follow in future.

> > +
> > +/**
> > + * find_iova - find's an iova for a given pfn
> > + * @iovad - iova domain in question.
> > + * pfn - page frame number
> > + * This function finds and returns an iova belonging to the
> > + * given doamin which matches the given pfn.
> > + */
> > +struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn)
> > +{
> > +	unsigned long flags;
> > +	struct rb_node *node;
> > +
> > +	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> > +	node = iovad->rbroot.rb_node;
> > +	while (node) {
> > +		struct iova *iova = container_of(node, struct iova, node);
> > +
> > +		/* If pfn falls within iova's range, return iova */
> > +		if ((pfn >= iova->pfn_lo) && (pfn <= iova->pfn_hi)) {
> > +			spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> > +			return iova;
> > +		}
> > +
> > +		if (pfn < iova->pfn_lo)
> > +			node = node->rb_left;
> > +		else if (pfn > iova->pfn_lo)
> > +			node = node->rb_right;
> > +	}
> > +
> > +	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> > +	return NULL;
> > +}
> 
> So we take the lock, look up an item, then drop the lock then return the
> item we just found.  We took no refcount on it and we didn't do anything to
> keep this object alive.
> 
> Is that a bug, or does the (afacit undocumented) lifecycle management of
> these things take care of it in some manner?  If yes, please reply via an
> add-a-comment patch.

Nope, this is not a bug. Adding a comment patch which explains the same.

> 
> 
> > +/**
> > + * __free_iova - frees the given iova
> > + * @iovad: iova domain in question.
> > + * @iova: iova in question.
> > + * Frees the given iova belonging to the giving domain
> > + */
> > +void
> > +__free_iova(struct iova_domain *iovad, struct iova *iova)
> > +{
> > +	unsigned long flags;
> > +
> > +	if (iova) {
> > +		spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> > +		__cached_rbnode_delete_update(iovad, iova);
> > +		rb_erase(&iova->node, &iovad->rbroot);
> > +		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> > +		free_iova_mem(iova);
> > +	}
> > +}
> 

> Can this really be called with NULL?  If so, under what circumstances? 
> (This reader couldn't work it out from a brief look at the code, so perhaps
> others will not be able to either.  Perhaps a comment is needed)

It was getting called from only one place free_iova().
Below patch address your concern.

> 
> > +/**
> > + * free_iova - finds and frees the iova for a given pfn
> > + * @iovad: - iova domain in question.
> > + * @pfn: - pfn that is allocated previously
> > + * This functions finds an iova for a given pfn and then
> > + * frees the iova from that domain.
> > + */
> > +void
> > +free_iova(struct iova_domain *iovad, unsigned long pfn)
> > +{
> > +	struct iova *iova = find_iova(iovad, pfn);
> > +	__free_iova(iovad, iova);
> > +
> > +}
> > +
> > +/**
> > + * put_iova_domain - destroys the iova doamin
> > + * @iovad: - iova domain in question.
> > + * All the iova's in that domain are destroyed.
> > + */
> > +void put_iova_domain(struct iova_domain *iovad)
> > +{
> > +	struct rb_node *node;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
> > +	node = rb_first(&iovad->rbroot);
> > +	while (node) {
> > +		struct iova *iova = container_of(node, struct iova, node);
> > +		rb_erase(node, &iovad->rbroot);
> > +		free_iova_mem(iova);
> > +		node = rb_first(&iovad->rbroot);
> > +	}
> > +	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
> > +}
> 
> Right, so I suspect what's happening here is that all iova's remain valid
> until their entire domain is destroyed, yes?

Nope. IOVA are valid only for the duration of DMA MAP and DMA UNMAP calls.
In case of Intel-iommu driver, the iova's are valid only for the duration
of __intel_map_singl() and __intel_unmap_single() calls.

> 
> What is the upper bound to the memory consumpotion here, and what provides
> it?
As explained above, iova are freed and reused again during the DMA map calls.

> 
> Again, some code comments about these design issues are appropriate.
> 
> > +/*
> > + * We need a fixed PAGE_SIZE of 4K irrespective of
> > + * arch PAGE_SIZE for IOMMU page tables.
> > + */
> > +#define PAGE_SHIFT_4K		(12)
> > +#define PAGE_SIZE_4K		(1UL << PAGE_SHIFT_4K)
> > +#define PAGE_MASK_4K		(((u64)-1) << PAGE_SHIFT_4K)
> > +#define PAGE_ALIGN_4K(addr)	(((addr) + PAGE_SIZE_4K - 1) & PAGE_MASK_4K)
> 
> Am still wondering why we cannot use PAGE_SIZE, PAGE_SHIFT, etc here.
VT-d hardware(a.k.a Intel IOMMU hardware) page table size is always
4K irrespective of the OS PAGE_SIZE. We want to use the same code for 
IA64 too where the OS PAGE_SIZE may not be 4K size and hence
had to define here for IOMMU.

> 
> > +#define IOVA_START_ADDR		(0x1000)
> 
> What determined that address?  (Needs comment)
Fixed. Please see bleow patch.

> 
> > +#define IOVA_START_PFN		(IOVA_START_ADDR >> PAGE_SHIFT_4K)
> > +
> > +#define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT_4K)
> 
> So I'm looking at this and wondering "what type does addr have"?
> 
> If it's unsigned long then perhaps we have a problem on x86_32 PAE.  Maybe
> we don't support x86_32 PAE, but still, I'd have thought that the
> appropriate type here is dma_addr_t.
Yup that is correct. We don;t support x86_32.
> 
> But alas, it was needlessly implemented as a macro, so the reader cannot
> tell.
I am in the process of making the same code base to work for
IA64 architecure too and in this process I will do more cleanup.

Thanks.

Please apply the below patch as a fix the existing patch.


signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>

---
 drivers/pci/iova.c |   22 ++++++++++++++--------
 drivers/pci/iova.h |    4 ++--
 2 files changed, 16 insertions(+), 10 deletions(-)

Index: linux-2.6.22-rc4-mm2/drivers/pci/iova.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/iova.c	2007-06-26 07:50:34.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/iova.c	2007-06-26 08:25:23.000000000 -0700
@@ -166,6 +166,7 @@
 	unsigned long flags;
 	struct rb_node *node;
 
+	/* Take the lock so that no other thread is manipulating the rbtree */
 	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
 	node = iovad->rbroot.rb_node;
 	while (node) {
@@ -174,6 +175,12 @@
 		/* If pfn falls within iova's range, return iova */
 		if ((pfn >= iova->pfn_lo) && (pfn <= iova->pfn_hi)) {
 			spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+			/* We are not holding the lock while this iova
+			 * is referenced by the caller as the same thread
+			 * which called this function also calls __free_iova()
+			 * and it is by desing that only one thread can possibly
+			 * reference a particular iova and hence no conflict.
+			 */
 			return iova;
 		}
 
@@ -198,13 +205,11 @@
 {
 	unsigned long flags;
 
-	if (iova) {
-		spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
-		__cached_rbnode_delete_update(iovad, iova);
-		rb_erase(&iova->node, &iovad->rbroot);
-		spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
-		free_iova_mem(iova);
-	}
+	spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
+	__cached_rbnode_delete_update(iovad, iova);
+	rb_erase(&iova->node, &iovad->rbroot);
+	spin_unlock_irqrestore(&iovad->iova_rbtree_lock, flags);
+	free_iova_mem(iova);
 }
 
 /**
@@ -218,7 +223,8 @@
 free_iova(struct iova_domain *iovad, unsigned long pfn)
 {
 	struct iova *iova = find_iova(iovad, pfn);
-	__free_iova(iovad, iova);
+	if (iova)
+		__free_iova(iovad, iova);
 
 }
 
Index: linux-2.6.22-rc4-mm2/drivers/pci/iova.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/iova.h	2007-06-26 07:50:34.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/iova.h	2007-06-26 08:28:05.000000000 -0700
@@ -24,8 +24,8 @@
 #define PAGE_MASK_4K		(((u64)-1) << PAGE_SHIFT_4K)
 #define PAGE_ALIGN_4K(addr)	(((addr) + PAGE_SIZE_4K - 1) & PAGE_MASK_4K)
 
-#define IOVA_START_ADDR		(0x1000)
-#define IOVA_START_PFN		(IOVA_START_ADDR >> PAGE_SHIFT_4K)
+/* IO virtual address start page frame number */
+#define IOVA_START_PFN		(1)
 
 #define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT_4K)
 #define DMA_32BIT_PFN	IOVA_PFN(DMA_32BIT_MASK)
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26 15:33           ` Andi Kleen
@ 2007-06-26 16:25             ` Arjan van de Ven
  2007-06-26 17:31               ` Andi Kleen
  0 siblings, 1 reply; 65+ messages in thread
From: Arjan van de Ven @ 2007-06-26 16:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Muli Ben-Yehuda, Andrew Morton, Keshavamurthy, Anil S,
	linux-kernel, gregkh, suresh.b.siddha, ashok.raj, davem,
	clameter

Andi Kleen wrote:
> On Tue, Jun 26, 2007 at 08:15:05AM -0700, Arjan van de Ven wrote:
>>> Also the user interface for X server case needs more work.
>>>
>> actually with the mode setting of X moving into the kernel... X won't 
>> use /dev/mem anymore at all
> 
> We'll see if that happens. It has been talked about forever,
> but results are sparse. 

jbarnes posted the code a few weeks ago.

> 
>> (and I think it mostly already doesn't even without that)
> 
> It uses /sys/bus/pci/* which is not any better as seen from the IOMMU.
> 
> Any interface will need to be explicit because user space needs to know which
> DMA addresses to put into the hardware. It's not enough to just transparently
> translate the mappings.

that's what DRM is used for nowadays...

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 10/10] Iommu floppy workaround
  2007-06-26  6:42   ` Andrew Morton
  2007-06-26 10:37     ` Andi Kleen
@ 2007-06-26 16:26     ` Keshavamurthy, Anil S
  1 sibling, 0 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-26 16:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Keshavamurthy, Anil S, linux-kernel, ak, gregkh, muli,
	suresh.b.siddha, arjan, ashok.raj, davem, clameter

On Mon, Jun 25, 2007 at 11:42:22PM -0700, Andrew Morton wrote:
> On Tue, 19 Jun 2007 14:37:11 -0700 "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> wrote:
> 
> Bit weird that this was implemented in the header like that.
Sorry, it is my mistake as I understood thus from your previous
code review comment.
> 
> How about this?  (Also contains rather a lot of obvious style fixes)
Yup, looks good.

-Anil

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 05/10] Intel IOMMU driver
  2007-06-26  6:32     ` Andrew Morton
@ 2007-06-26 16:29       ` Keshavamurthy, Anil S
  0 siblings, 0 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-26 16:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Keshavamurthy, Anil S, linux-kernel, ak,
	gregkh, muli, suresh.b.siddha, arjan, ashok.raj, davem

On Mon, Jun 25, 2007 at 11:32:49PM -0700, Andrew Morton wrote:
> On Tue, 19 Jun 2007 16:32:23 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Tue, 19 Jun 2007, Keshavamurthy, Anil S wrote:
> > 
> > > +static inline void *alloc_pgtable_page(void)
> > > +{
> > > +	return (void *)get_zeroed_page(GFP_ATOMIC);
> > > +}
> > 
> > Need to pass gfp_t parameter. Repeates a couple of times.
> > ...
> > Is it not possible here to drop the lock and do the alloc with GFP_KERNEL 
> > and deal with the resulting race? That is done in other parts of the 
> > kernel.
> > ...
> > This may be able to become a GFP_KERNEL alloc since interrupts are enabled 
> > at this point?
> > ...
> > GFP_KERNEL alloc possible?
> > 
> 
> Yeah, if there are any callsites at all at which we know that we can
> perform a sleeping allocation, Christoph's suggestions should be adopted. 
> Because even a bare GFP_NOIO is heaps more robust than GFP_ATOMIC, and it
> will also reload the free-pages reserves, making subsequent GFP_ATOMIC
> allocations more likely to succeed.
Yup, will do as part of making this code work for IA64, which is my next
item in my todo list.

-Anil

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 05/10] Intel IOMMU driver
  2007-06-26  6:25   ` Andrew Morton
@ 2007-06-26 16:33     ` Keshavamurthy, Anil S
  0 siblings, 0 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-26 16:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Keshavamurthy, Anil S, linux-kernel, ak, gregkh, muli,
	suresh.b.siddha, arjan, ashok.raj, davem, clameter

On Mon, Jun 25, 2007 at 11:25:11PM -0700, Andrew Morton wrote:
> On Tue, 19 Jun 2007 14:37:06 -0700 "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> wrote:
> 
> 
> None of these actually _need_ to be macros and it would be better to 
> implement them in C.  That way things are more self-documenting, more
> pleasant to read, more likely to get commented and you'll fix the
> two bugs wherein the argument to a macro is evaluated more than once.
Agree. Will send a patch when I get back from OLS as any changes to this
needs thorugh testing.

-Anil

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26 16:25             ` Arjan van de Ven
@ 2007-06-26 17:31               ` Andi Kleen
  2007-06-26 20:10                 ` Jesse Barnes
  0 siblings, 1 reply; 65+ messages in thread
From: Andi Kleen @ 2007-06-26 17:31 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andi Kleen, Muli Ben-Yehuda, Andrew Morton, Keshavamurthy,
	Anil S, linux-kernel, gregkh, suresh.b.siddha, ashok.raj, davem,
	clameter

> >>(and I think it mostly already doesn't even without that)
> >
> >It uses /sys/bus/pci/* which is not any better as seen from the IOMMU.
> >
> >Any interface will need to be explicit because user space needs to know 
> >which
> >DMA addresses to put into the hardware. It's not enough to just 
> >transparently
> >translate the mappings.
> 
> that's what DRM is used for nowadays...

But DRM does support much less hardware than the X server?

Perhaps we just need an ioctl where an X server can switch this.

-Andi


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 10/10] Iommu floppy workaround
  2007-06-26 10:37     ` Andi Kleen
@ 2007-06-26 19:25       ` Keshavamurthy, Anil S
  0 siblings, 0 replies; 65+ messages in thread
From: Keshavamurthy, Anil S @ 2007-06-26 19:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Keshavamurthy, Anil S, linux-kernel, gregkh, muli,
	suresh.b.siddha, arjan, ashok.raj, davem, clameter

On Tue, Jun 26, 2007 at 12:37:55PM +0200, Andi Kleen wrote:
> 
> > > Index: linux-2.6.22-rc4-mm2/arch/x86_64/Kconfig
> > > ===================================================================
> > > --- linux-2.6.22-rc4-mm2.orig/arch/x86_64/Kconfig	2007-06-18 15:45:08.000000000 -0700
> > > +++ linux-2.6.22-rc4-mm2/arch/x86_64/Kconfig	2007-06-18 15:45:09.000000000 -0700
> > > @@ -752,6 +752,16 @@
> > >  	 all the OS visible memory. Hence the driver can continue
> > >  	 to use physical addresses for DMA.
> > >  
> > > +config DMAR_FLPY_WA
> > 
> > FLOPPY is spelled "FLOPPY"!
> 
> Also this shouldn't be a user visible config.  The floppy driver should just
> do this transparently when loaded and undo when unloaded.

Yup, I agree. 

Here goes the patch to make it user invisible config option.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
---
 arch/x86_64/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.22-rc4-mm2/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.22-rc4-mm2.orig/arch/x86_64/Kconfig	2007-06-26 12:04:42.000000000 -0700
+++ linux-2.6.22-rc4-mm2/arch/x86_64/Kconfig	2007-06-26 12:06:01.000000000 -0700
@@ -753,7 +753,7 @@
 	 to use physical addresses for DMA.
 
 config DMAR_FLOPPY_WA
-	bool "Support for Floppy disk workaround"
+	bool
 	depends on DMAR
 	default y
 	help


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26 17:31               ` Andi Kleen
@ 2007-06-26 20:10                 ` Jesse Barnes
  2007-06-26 22:35                   ` Andi Kleen
  0 siblings, 1 reply; 65+ messages in thread
From: Jesse Barnes @ 2007-06-26 20:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Arjan van de Ven, Muli Ben-Yehuda, Andrew Morton, Keshavamurthy,
	Anil S, linux-kernel, gregkh, suresh.b.siddha, ashok.raj, davem,
	clameter

On Tuesday, June 26, 2007 10:31:57 Andi Kleen wrote:
> > >>(and I think it mostly already doesn't even without that)
> > >
> > >It uses /sys/bus/pci/* which is not any better as seen from the
> > > IOMMU.
> > >
> > >Any interface will need to be explicit because user space needs to
> > > know which
> > >DMA addresses to put into the hardware. It's not enough to just
> > >transparently
> > >translate the mappings.
> >
> > that's what DRM is used for nowadays...
>
> But DRM does support much less hardware than the X server?

Yeah, the number of DRM drivers is relatively small compared to X or 
fbdev, but for simple DMA they're fairly easy to write.

> Perhaps we just need an ioctl where an X server can switch this.

Switch what?  Turn on or off transparent translation?

Jesse

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 00/10] Intel IOMMU support, take #2
  2007-06-26 20:10                 ` Jesse Barnes
@ 2007-06-26 22:35                   ` Andi Kleen
  0 siblings, 0 replies; 65+ messages in thread
From: Andi Kleen @ 2007-06-26 22:35 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Andi Kleen, Arjan van de Ven, Muli Ben-Yehuda, Andrew Morton,
	Keshavamurthy, Anil S, linux-kernel, gregkh, suresh.b.siddha,
	ashok.raj, davem, clameter

> 
> > Perhaps we just need an ioctl where an X server can switch this.
> 
> Switch what?  Turn on or off transparent translation?

Turn on/off bypass for its device.

-Andi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 01/10] DMAR detection and parsing logic
  2007-06-19 21:37 ` [Intel IOMMU 01/10] DMAR detection and parsing logic Keshavamurthy, Anil S
@ 2007-07-04  9:18   ` Peter Zijlstra
  2007-07-04 10:04     ` Andrew Morton
  0 siblings, 1 reply; 65+ messages in thread
From: Peter Zijlstra @ 2007-07-04  9:18 UTC (permalink / raw)
  To: Keshavamurthy, Anil S
  Cc: akpm, linux-kernel, ak, gregkh, muli, suresh.b.siddha, arjan,
	ashok.raj, davem, clameter

On Tue, 2007-06-19 at 14:37 -0700, Keshavamurthy, Anil S wrote:
> plain text document attachment (dmar_detection.patch)

> +/**
> + * parse_dmar_table - parses the DMA reporting table
> + */
> +static int __init
> +parse_dmar_table(void)
> +{
> +	struct acpi_table_dmar *dmar;
> +	struct acpi_dmar_header *entry_header;
> +	int ret = 0;
> +
> +	dmar = (struct acpi_table_dmar *)dmar_tbl;
> +
> +	if (!dmar->width) {
          ^^^^^^^^^^^^^^^^^^^

That goes *splat* on my opteron box.

> +		printk (KERN_WARNING PREFIX "Zero: Invalid DMAR haw\n");
> +		return -EINVAL;
> +	}
> +
> +	printk (KERN_INFO PREFIX "Host address width %d\n",
> +		dmar->width + 1);
> +
> +	entry_header = (struct acpi_dmar_header *)(dmar + 1);
> +	while (((unsigned long)entry_header) <
> +			(((unsigned long)dmar) + dmar_tbl->length)) {
> +		dmar_table_print_dmar_entry(entry_header);
> +
> +		switch (entry_header->type) {
> +		case ACPI_DMAR_TYPE_HARDWARE_UNIT:
> +			ret = dmar_parse_one_drhd(entry_header);
> +			break;
> +		case ACPI_DMAR_TYPE_RESERVED_MEMORY:
> +			ret = dmar_parse_one_rmrr(entry_header);
> +			break;
> +		default:
> +			printk(KERN_WARNING PREFIX
> +				"Unknown DMAR structure type\n");
> +			ret = 0; /* for forward compatibility */
> +			break;
> +		}
> +		if (ret)
> +			break;
> +
> +		entry_header = ((void *)entry_header + entry_header->length);
> +	}
> +	return ret;
> +}
> +


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 01/10] DMAR detection and parsing logic
  2007-07-04  9:18   ` Peter Zijlstra
@ 2007-07-04 10:04     ` Andrew Morton
  2007-07-04 10:14       ` Peter Zijlstra
  0 siblings, 1 reply; 65+ messages in thread
From: Andrew Morton @ 2007-07-04 10:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Keshavamurthy, Anil S, linux-kernel, ak, gregkh, muli,
	suresh.b.siddha, arjan, ashok.raj, davem, clameter

On Wed, 04 Jul 2007 11:18:56 +0200 Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, 2007-06-19 at 14:37 -0700, Keshavamurthy, Anil S wrote:
> > plain text document attachment (dmar_detection.patch)
> 
> > +/**
> > + * parse_dmar_table - parses the DMA reporting table
> > + */
> > +static int __init
> > +parse_dmar_table(void)
> > +{
> > +	struct acpi_table_dmar *dmar;
> > +	struct acpi_dmar_header *entry_header;
> > +	int ret = 0;
> > +
> > +	dmar = (struct acpi_table_dmar *)dmar_tbl;
> > +
> > +	if (!dmar->width) {
>           ^^^^^^^^^^^^^^^^^^^
> 
> That goes *splat* on my opteron box.

This?

From: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>

Check for dmar_tbl pointer as this can be NULL on systems with no Intel
VT-d support.

Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/pci/dmar.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff -puN drivers/pci/dmar.c~intel-iommu-dmar-detection-and-parsing-logic-fix-intel-dmar-crash-on-amd-x86_64 drivers/pci/dmar.c
--- a/drivers/pci/dmar.c~intel-iommu-dmar-detection-and-parsing-logic-fix-intel-dmar-crash-on-amd-x86_64
+++ a/drivers/pci/dmar.c
@@ -260,6 +260,8 @@ parse_dmar_table(void)
 	int ret = 0;
 
 	dmar = (struct acpi_table_dmar *)dmar_tbl;
+	if (!dmar)
+		return -ENODEV;
 
 	if (!dmar->width) {
 		printk (KERN_WARNING PREFIX "Zero: Invalid DMAR haw\n");
@@ -301,7 +303,7 @@ int __init dmar_table_init(void)
 
 	parse_dmar_table();
 	if (list_empty(&dmar_drhd_units)) {
-		printk(KERN_ERR PREFIX "No DMAR devices found\n");
+		printk(KERN_INFO PREFIX "No DMAR devices found\n");
 		return -ENODEV;
 	}
 	return 0;
_


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [Intel IOMMU 01/10] DMAR detection and parsing logic
  2007-07-04 10:04     ` Andrew Morton
@ 2007-07-04 10:14       ` Peter Zijlstra
  0 siblings, 0 replies; 65+ messages in thread
From: Peter Zijlstra @ 2007-07-04 10:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Keshavamurthy, Anil S, linux-kernel, ak, gregkh, muli,
	suresh.b.siddha, arjan, ashok.raj, davem, clameter

On Wed, 2007-07-04 at 03:04 -0700, Andrew Morton wrote:
> On Wed, 04 Jul 2007 11:18:56 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Tue, 2007-06-19 at 14:37 -0700, Keshavamurthy, Anil S wrote:
> > > plain text document attachment (dmar_detection.patch)
> > 
> > > +/**
> > > + * parse_dmar_table - parses the DMA reporting table
> > > + */
> > > +static int __init
> > > +parse_dmar_table(void)
> > > +{
> > > +	struct acpi_table_dmar *dmar;
> > > +	struct acpi_dmar_header *entry_header;
> > > +	int ret = 0;
> > > +
> > > +	dmar = (struct acpi_table_dmar *)dmar_tbl;
> > > +
> > > +	if (!dmar->width) {
> >           ^^^^^^^^^^^^^^^^^^^
> > 
> > That goes *splat* on my opteron box.
> 
> This?
> 
> From: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>
> 
> Check for dmar_tbl pointer as this can be NULL on systems with no Intel
> VT-d support.
> 
> Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Ah, that does look sane, I'll test it whenever the next -mm comes
around.


^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2007-07-04 10:15 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-06-19 21:37 [Intel IOMMU 00/10] Intel IOMMU support, take #2 Keshavamurthy, Anil S
2007-06-19 21:37 ` [Intel IOMMU 01/10] DMAR detection and parsing logic Keshavamurthy, Anil S
2007-07-04  9:18   ` Peter Zijlstra
2007-07-04 10:04     ` Andrew Morton
2007-07-04 10:14       ` Peter Zijlstra
2007-06-19 21:37 ` [Intel IOMMU 02/10] PCI generic helper function Keshavamurthy, Anil S
2007-06-26  5:49   ` Andrew Morton
2007-06-26 14:44     ` Keshavamurthy, Anil S
2007-06-19 21:37 ` [Intel IOMMU 03/10] clflush_cache_range now takes size param Keshavamurthy, Anil S
2007-06-19 21:37 ` [Intel IOMMU 04/10] IOVA allocation and management routines Keshavamurthy, Anil S
2007-06-26  6:07   ` Andrew Morton
2007-06-26 16:16     ` Keshavamurthy, Anil S
2007-06-19 21:37 ` [Intel IOMMU 05/10] Intel IOMMU driver Keshavamurthy, Anil S
2007-06-19 23:32   ` Christoph Lameter
2007-06-19 23:50     ` Keshavamurthy, Anil S
2007-06-19 23:56       ` Christoph Lameter
2007-06-26  6:32     ` Andrew Morton
2007-06-26 16:29       ` Keshavamurthy, Anil S
2007-06-26  6:25   ` Andrew Morton
2007-06-26 16:33     ` Keshavamurthy, Anil S
2007-06-26  6:30   ` Andrew Morton
2007-06-19 21:37 ` [Intel IOMMU 06/10] Avoid memory allocation failures in dma map api calls Keshavamurthy, Anil S
2007-06-19 23:25   ` Christoph Lameter
2007-06-19 23:27     ` Arjan van de Ven
2007-06-19 23:34       ` Christoph Lameter
2007-06-20  0:02         ` Arjan van de Ven
2007-06-20  8:06   ` Peter Zijlstra
2007-06-20 13:03     ` Arjan van de Ven
2007-06-20 17:30       ` Siddha, Suresh B
2007-06-20 18:05         ` Peter Zijlstra
2007-06-20 19:14           ` Arjan van de Ven
2007-06-20 20:08             ` Peter Zijlstra
2007-06-20 23:03               ` Keshavamurthy, Anil S
2007-06-21  6:10                 ` Peter Zijlstra
2007-06-21  6:11                   ` Arjan van de Ven
2007-06-21  6:29                     ` Peter Zijlstra
2007-06-21  6:37                       ` Keshavamurthy, Anil S
2007-06-21  7:13                         ` Peter Zijlstra
2007-06-21 19:51                           ` Keshavamurthy, Anil S
2007-06-21  6:30                     ` Keshavamurthy, Anil S
2007-06-26  5:34     ` Andrew Morton
2007-06-19 21:37 ` [Intel IOMMU 07/10] Intel iommu cmdline option - forcedac Keshavamurthy, Anil S
2007-06-19 21:37 ` [Intel IOMMU 08/10] DMAR fault handling support Keshavamurthy, Anil S
2007-06-19 21:37 ` [Intel IOMMU 09/10] Iommu Gfx workaround Keshavamurthy, Anil S
2007-06-19 21:37 ` [Intel IOMMU 10/10] Iommu floppy workaround Keshavamurthy, Anil S
2007-06-26  6:42   ` Andrew Morton
2007-06-26 10:37     ` Andi Kleen
2007-06-26 19:25       ` Keshavamurthy, Anil S
2007-06-26 16:26     ` Keshavamurthy, Anil S
2007-06-26  6:45 ` [Intel IOMMU 00/10] Intel IOMMU support, take #2 Andrew Morton
2007-06-26  7:12   ` Andi Kleen
2007-06-26 11:13     ` Muli Ben-Yehuda
2007-06-26 15:03       ` Arjan van de Ven
2007-06-26 15:11         ` Muli Ben-Yehuda
2007-06-26 15:48           ` Keshavamurthy, Anil S
2007-06-26 16:00             ` Muli Ben-Yehuda
2007-06-26 15:56       ` Andi Kleen
2007-06-26 15:09         ` Muli Ben-Yehuda
2007-06-26 15:36           ` Andi Kleen
2007-06-26 15:15         ` Arjan van de Ven
2007-06-26 15:33           ` Andi Kleen
2007-06-26 16:25             ` Arjan van de Ven
2007-06-26 17:31               ` Andi Kleen
2007-06-26 20:10                 ` Jesse Barnes
2007-06-26 22:35                   ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).