linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
@ 2014-04-25  0:36 Bill Sumner
  2014-04-25  0:36 ` [PATCH 1/8] iommu/vt-d: Fix a few existing lines for checkpatch.pl Bill Sumner
                   ` (8 more replies)
  0 siblings, 9 replies; 21+ messages in thread
From: Bill Sumner @ 2014-04-25  0:36 UTC (permalink / raw)
  To: dwmw2, indou.takao, bhe, joro
  Cc: iommu, kexec, alex.williamson, linux-pci, linux-kernel, ddutile,
	ishii.hironobu, bhelgaas, bill.sumner, doug.hatch, zhenhua,
	jerry.hoemann

The following series implements a fix for:
A kdump problem about DMA that has been discussed for a long time.
That is, when a kernel panics and boots into the kdump kernel, DMA that was
started by the panicked kernel is not stopped before the kdump kernel is booted;
and the kdump kernel disables the IOMMU while this DMA continues.
This causes the IOMMU to stop translating the DMA addresses as IOVAs and
begin to treat them as physical memory addresses -- which causes the DMA to either:
1. generate DMAR errors or
2. generate PCI SERR errors or
3. transfer data to or from incorrect areas of memory.
Often this causes the dump to fail.

This patch set modifies the behavior of the Intel iommu in the crashdump kernel: 
1. to accept the iommu hardware in an active state,
2. to leave the current translations in-place so that legacy DMA will continue
   using its current buffers until the device drivers in the crashdump kernel
   initialize and initialize their devices,
3. to use different portions of the iova address ranges for the device drivers
   in the crashdump kernel than the iova ranges that were in-use at the time
   of the panic.  

Advantages of this approach:
1. All manipulation of the IO-device is done by the Linux device-driver
   for that device.
2. This approach behaves in a manner very similar to operation without an
   active iommu.
3. Any activity between the IO-device and its RMRR areas is handled by the
   device-driver in the same manner as during a non-kdump boot.
4. If an IO-device has no driver in the kdump kernel, it is simply left alone.
   This supports the practice of creating a special kdump kernel without
   drivers for any devices that are not required for taking a crashdump. 
5. Minimal code-changes among the existing mainline intel-iommu code.

Summary of changes in this patch set:
1. Updated to a more current top-of-tree and merged the code with the
   large number of changes that were recently taken-in to intel-iommu.c
2. Returned to the structure of a patch-set
3. Enabled the intel-iommu driver to consist of multiple *.c files
   by moving many of the #defines, prototypes, and inline functions
   into a new file: intel-iommu-private.h (First three patches implement
   only this enhancement -- could be applied independent of the last 5)
4. Moved the new "crashdump fix" code into a new file: intel-iommu-kdump.c 
5. Removed the pr_debug constructs from the new code that implements the
   "crashdump fix" -- making the code much cleaner and easier to read.
6. Miscellaneous cleanups such as enum-values for return-codes.
7. Simplified the code that retrieves the values needed to initialize a new
   domain by using the linked-list of previously-collected values
   instead of stepping back into the tree of translation tables.

Bill Sumner (8):
  Fix a few existing lines for checkpatch.pl
  Consolidate lines for a new private header
  Create intel-iommu-private.h
  Update iommu_attach_domain() and its callers
  Add Items needed for fixing crashdump to private header
  Create intel-iommu-kdump.c
  Add domain-id functions to intel-iommu-kdump.c
  Add remaining changes to intel-iommu.c to fix crashdump

 drivers/iommu/Makefile              |   2 +-
 drivers/iommu/intel-iommu-kdump.c   | 629 ++++++++++++++++++++++++++++++++++++
 drivers/iommu/intel-iommu-private.h | 457 ++++++++++++++++++++++++++
 drivers/iommu/intel-iommu.c         | 512 ++++++++---------------------
 4 files changed, 1229 insertions(+), 371 deletions(-)
 create mode 100644 drivers/iommu/intel-iommu-kdump.c
 create mode 100644 drivers/iommu/intel-iommu-private.h

-- 
Bill Sumner <bill.sumner@hp.com>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/8] iommu/vt-d: Fix a few existing lines for checkpatch.pl
  2014-04-25  0:36 [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO Bill Sumner
@ 2014-04-25  0:36 ` Bill Sumner
  2014-04-25  0:36 ` [PATCH 2/8] iommu/vt-d: Consolidate lines for a new private header Bill Sumner
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Bill Sumner @ 2014-04-25  0:36 UTC (permalink / raw)
  To: dwmw2, indou.takao, bhe, joro
  Cc: iommu, kexec, alex.williamson, linux-pci, linux-kernel, ddutile,
	ishii.hironobu, bhelgaas, bill.sumner, doug.hatch, zhenhua,
	jerry.hoemann

Things like:
1. no space before tabs
2. no initialization of static variables to 0 or NULL
3. no extra parentheses around the value on the 'return' statement
4. no line over 80-characters

The first three patches in this set enable the intel-iommu driver to consist
of multiple *.c source files by moving many of the existing definitions,
prototypes, and structure definitions from the front of intel-iommu.c into
a new intel-iommu-private.h file -- and replacing them with a #include
of that file.  


The last five patches in this set use the above enablement to implement,
within the new source file intel-iommu-kdump.c, a fix for:

A kdump problem about DMA that has been discussed for a long time.
That is, when a kernel panics and boots into the kdump kernel, DMA that was
started by the panicked kernel is not stopped before the kdump kernel is booted;
and the kdump kernel disables the IOMMU while this DMA continues.
This causes the IOMMU to stop translating the DMA addresses as IOVAs and
begin to treat them as physical memory addresses -- which causes the DMA to either:
1. generate DMAR errors or
2. generate PCI SERR errors or
3. transfer data to or from incorrect areas of memory.
Often this causes the dump to fail.

This patch set modifies the behavior of the Intel iommu in the crashdump kernel: 
1. to accept the iommu hardware in an active state,
2. to leave the current translations in-place so that legacy DMA will continue
   using its current buffers until the device drivers in the crashdump kernel
   initialize and initialize their devices,
3. to use different portions of the iova address ranges for the device drivers
   in the crashdump kernel than the iova ranges that were in-use at the time
   of the panic.  

Advantages of this approach:
1. All manipulation of the IO-device is done by the Linux device-driver
   for that device.
2. This approach behaves in a manner very similar to operation without an
   active iommu.
3. Any activity between the IO-device and its RMRR areas is handled by the
   device-driver in the same manner as during a non-kdump boot.
4. If an IO-device has no driver in the kdump kernel, it is simply left alone.
   This supports the practice of creating a special kdump kernel without
   drivers for any devices that are not required for taking a crashdump. 
5. Minimal code-changes among the existing mainline intel-iommu code.


Signed-off-by: Bill Sumner <bill.sumner@hp.com>
---
 drivers/iommu/intel-iommu.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 69fa7da..f80b4bb 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -69,7 +69,8 @@
    to match. That way, we can use 'unsigned long' for PFNs with impunity. */
 #define DOMAIN_MAX_PFN(gaw)	((unsigned long) min_t(uint64_t, \
 				__DOMAIN_MAX_PFN(gaw), (unsigned long)-1))
-#define DOMAIN_MAX_ADDR(gaw)	(((uint64_t)__DOMAIN_MAX_PFN(gaw)) << VTD_PAGE_SHIFT)
+#define DOMAIN_MAX_ADDR(gaw) (((uint64_t)__DOMAIN_MAX_PFN(gaw)) << \
+					VTD_PAGE_SHIFT)
 
 #define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT)
 #define DMA_32BIT_PFN		IOVA_PFN(DMA_BIT_MASK(32))
@@ -172,7 +173,7 @@ static int rwbf_quirk;
  * set to 1 to panic kernel if can't successfully enable VT-d
  * (used when kernel is launched w/ TXT)
  */
-static int force_on = 0;
+static int force_on;
 
 /*
  * 0: Present
@@ -187,7 +188,7 @@ struct root_entry {
 #define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))
 static inline bool root_present(struct root_entry *root)
 {
-	return (root->val & 1);
+	return root->val & 1;
 }
 static inline void set_root_present(struct root_entry *root)
 {
@@ -225,7 +226,7 @@ struct context_entry {
 
 static inline bool context_present(struct context_entry *context)
 {
-	return (context->lo & 1);
+	return context->lo & 1;
 }
 static inline void context_set_present(struct context_entry *context)
 {
@@ -303,7 +304,7 @@ static inline bool dma_pte_present(struct dma_pte *pte)
 
 static inline bool dma_pte_superpage(struct dma_pte *pte)
 {
-	return (pte->val & (1 << 7));
+	return pte->val & (1 << 7);
 }
 
 static inline int first_pte_in_page(struct dma_pte *pte)
@@ -314,7 +315,7 @@ static inline int first_pte_in_page(struct dma_pte *pte)
 /*
  * This domain is a statically identity mapping domain.
  *	1. This domain creats a static 1:1 mapping to all usable memory.
- * 	2. It maps to each iommu if successful.
+ *	2. It maps to each iommu if successful.
  *	3. Each iommu mapps to this domain if successful.
  */
 static struct dmar_domain *si_domain;
@@ -344,7 +345,7 @@ struct dmar_domain {
 	DECLARE_BITMAP(iommu_bmp, IOMMU_UNITS_SUPPORTED);
 					/* bitmap of iommus this domain uses*/
 
-	struct list_head devices; 	/* all devices' list */
+	struct list_head devices;	/* all devices' list */
 	struct iova_domain iovad;	/* iova's that belong to this domain */
 
 	struct dma_pte	*pgd;		/* virtual address */
-- 
Bill Sumner <bill.sumner@hp.com>


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/8] iommu/vt-d: Consolidate lines for a new private header
  2014-04-25  0:36 [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO Bill Sumner
  2014-04-25  0:36 ` [PATCH 1/8] iommu/vt-d: Fix a few existing lines for checkpatch.pl Bill Sumner
@ 2014-04-25  0:36 ` Bill Sumner
  2014-04-25  0:36 ` [PATCH 3/8] iommu/vt-d: Create intel-iommu-private.h Bill Sumner
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Bill Sumner @ 2014-04-25  0:36 UTC (permalink / raw)
  To: dwmw2, indou.takao, bhe, joro
  Cc: iommu, kexec, alex.williamson, linux-pci, linux-kernel, ddutile,
	ishii.hironobu, bhelgaas, bill.sumner, doug.hatch, zhenhua,
	jerry.hoemann

 In intel-iommu.c, move downward the few lines near the
 front that should not move to an intel-iommu-private.h
 file (mostly data-item definitions.) Also move upward
 a couple of inline functions that allocate and free pages.
 This leaves at the front of the file a consolidated block
 of the lines that would move to an intel-iommu-private.h file.

Signed-off-by: Bill Sumner <bill.sumner@hp.com>
---
 drivers/iommu/intel-iommu.c | 74 +++++++++++++++++++++++----------------------
 1 file changed, 38 insertions(+), 36 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index f80b4bb..49fdac5 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -163,17 +163,6 @@ static inline unsigned long virt_to_dma_pfn(void *p)
 	return page_to_dma_pfn(virt_to_page(p));
 }
 
-/* global iommu list, set NULL for ignored DMAR units */
-static struct intel_iommu **g_iommus;
-
-static void __init check_tylersburg_isoch(void);
-static int rwbf_quirk;
-
-/*
- * set to 1 to panic kernel if can't successfully enable VT-d
- * (used when kernel is launched w/ TXT)
- */
-static int force_on;
 
 /*
  * 0: Present
@@ -312,15 +301,6 @@ static inline int first_pte_in_page(struct dma_pte *pte)
 	return !((unsigned long)pte & ~VTD_PAGE_MASK);
 }
 
-/*
- * This domain is a statically identity mapping domain.
- *	1. This domain creats a static 1:1 mapping to all usable memory.
- *	2. It maps to each iommu if successful.
- *	3. Each iommu mapps to this domain if successful.
- */
-static struct dmar_domain *si_domain;
-static int hw_pass_through = 1;
-
 /* devices under the same p2p bridge are owned in one domain */
 #define DOMAIN_FLAG_P2P_MULTIPLE_DEVICES (1 << 0)
 
@@ -394,12 +374,50 @@ struct dmar_atsr_unit {
 	u8 include_all:1;		/* include all ports */
 };
 
+static inline void *alloc_pgtable_page(int node)
+{
+	struct page *page;
+	void *vaddr = NULL;
+
+	page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0);
+	if (page)
+		vaddr = page_address(page);
+	return vaddr;
+}
+
+static inline void free_pgtable_page(void *vaddr)
+{
+	free_page((unsigned long)vaddr);
+}
+
 static LIST_HEAD(dmar_atsr_units);
 static LIST_HEAD(dmar_rmrr_units);
 
 #define for_each_rmrr_units(rmrr) \
 	list_for_each_entry(rmrr, &dmar_rmrr_units, list)
 
+
+static void __init check_tylersburg_isoch(void);
+
+/* global iommu list, set NULL for ignored DMAR units */
+static struct intel_iommu **g_iommus;
+static int rwbf_quirk;
+
+/*
+ * set to 1 to panic kernel if can't successfully enable VT-d
+ * (used when kernel is launched w/ TXT)
+ */
+static int force_on;
+
+/*
+ * This domain is a statically identity mapping domain.
+ *	1. This domain creats a static 1:1 mapping to all usable memory.
+ *	2. It maps to each iommu if successful.
+ *	3. Each iommu mapps to this domain if successful.
+ */
+static struct dmar_domain *si_domain;
+static int hw_pass_through = 1;
+
 static void flush_unmaps_timeout(unsigned long data);
 
 static DEFINE_TIMER(unmap_timer,  flush_unmaps_timeout, 0, 0);
@@ -494,22 +512,6 @@ static struct kmem_cache *iommu_domain_cache;
 static struct kmem_cache *iommu_devinfo_cache;
 static struct kmem_cache *iommu_iova_cache;
 
-static inline void *alloc_pgtable_page(int node)
-{
-	struct page *page;
-	void *vaddr = NULL;
-
-	page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0);
-	if (page)
-		vaddr = page_address(page);
-	return vaddr;
-}
-
-static inline void free_pgtable_page(void *vaddr)
-{
-	free_page((unsigned long)vaddr);
-}
-
 static inline void *alloc_domain_mem(void)
 {
 	return kmem_cache_alloc(iommu_domain_cache, GFP_ATOMIC);
-- 
Bill Sumner <bill.sumner@hp.com>


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/8] iommu/vt-d: Create intel-iommu-private.h
  2014-04-25  0:36 [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO Bill Sumner
  2014-04-25  0:36 ` [PATCH 1/8] iommu/vt-d: Fix a few existing lines for checkpatch.pl Bill Sumner
  2014-04-25  0:36 ` [PATCH 2/8] iommu/vt-d: Consolidate lines for a new private header Bill Sumner
@ 2014-04-25  0:36 ` Bill Sumner
  2014-04-25  0:36 ` [PATCH 4/8] iommu/vt-d: Update iommu_attach_domain() and its callers Bill Sumner
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Bill Sumner @ 2014-04-25  0:36 UTC (permalink / raw)
  To: dwmw2, indou.takao, bhe, joro
  Cc: iommu, kexec, alex.williamson, linux-pci, linux-kernel, ddutile,
	ishii.hironobu, bhelgaas, bill.sumner, doug.hatch, zhenhua,
	jerry.hoemann

 Move the single block of #define, static inline ... ; struct definitions
 to intel-iommu-private.h from intel-iommu.c
 Replace them with #include "intel-iommu-private.h"

 This introduces no functional change from current behaviour.

Signed-off-by: Bill Sumner <bill.sumner@hp.com>
---
 drivers/iommu/intel-iommu-private.h | 363 ++++++++++++++++++++++++++++++++++++
 drivers/iommu/intel-iommu.c         | 345 +---------------------------------
 2 files changed, 364 insertions(+), 344 deletions(-)
 create mode 100644 drivers/iommu/intel-iommu-private.h

diff --git a/drivers/iommu/intel-iommu-private.h b/drivers/iommu/intel-iommu-private.h
new file mode 100644
index 0000000..480399c
--- /dev/null
+++ b/drivers/iommu/intel-iommu-private.h
@@ -0,0 +1,363 @@
+/*
+ * Copyright (c) 2006, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Copyright (C) 2006-2008 Intel Corporation
+ * Author: Ashok Raj <ashok.raj@intel.com>
+ * Author: Shaohua Li <shaohua.li@intel.com>
+ * Author: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
+ * Author: Fenghua Yu <fenghua.yu@intel.com>
+ */
+
+
+#define ROOT_SIZE		VTD_PAGE_SIZE
+#define CONTEXT_SIZE		VTD_PAGE_SIZE
+
+#define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
+#define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
+#define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
+
+#define IOAPIC_RANGE_START	(0xfee00000)
+#define IOAPIC_RANGE_END	(0xfeefffff)
+#define IOVA_START_ADDR		(0x1000)
+
+#define DEFAULT_DOMAIN_ADDRESS_WIDTH 48
+
+#define MAX_AGAW_WIDTH 64
+#define MAX_AGAW_PFN_WIDTH	(MAX_AGAW_WIDTH - VTD_PAGE_SHIFT)
+
+#define __DOMAIN_MAX_PFN(gaw)  ((((uint64_t)1) << (gaw-VTD_PAGE_SHIFT)) - 1)
+#define __DOMAIN_MAX_ADDR(gaw) ((((uint64_t)1) << gaw) - 1)
+
+/* We limit DOMAIN_MAX_PFN to fit in an unsigned long, and DOMAIN_MAX_ADDR
+   to match. That way, we can use 'unsigned long' for PFNs with impunity. */
+#define DOMAIN_MAX_PFN(gaw)	((unsigned long) min_t(uint64_t, \
+				__DOMAIN_MAX_PFN(gaw), (unsigned long)-1))
+#define DOMAIN_MAX_ADDR(gaw) (((uint64_t)__DOMAIN_MAX_PFN(gaw)) << \
+					VTD_PAGE_SHIFT)
+
+#define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT)
+#define DMA_32BIT_PFN		IOVA_PFN(DMA_BIT_MASK(32))
+#define DMA_64BIT_PFN		IOVA_PFN(DMA_BIT_MASK(64))
+
+/* page table handling */
+#define LEVEL_STRIDE		(9)
+#define LEVEL_MASK		(((u64)1 << LEVEL_STRIDE) - 1)
+
+/*
+ * This bitmap is used to advertise the page sizes our hardware support
+ * to the IOMMU core, which will then use this information to split
+ * physically contiguous memory regions it is mapping into page sizes
+ * that we support.
+ *
+ * Traditionally the IOMMU core just handed us the mappings directly,
+ * after making sure the size is an order of a 4KiB page and that the
+ * mapping has natural alignment.
+ *
+ * To retain this behavior, we currently advertise that we support
+ * all page sizes that are an order of 4KiB.
+ *
+ * If at some point we'd like to utilize the IOMMU core's new behavior,
+ * we could change this to advertise the real page sizes we support.
+ */
+#define INTEL_IOMMU_PGSIZES	(~0xFFFUL)
+
+static inline int agaw_to_level(int agaw)
+{
+	return agaw + 2;
+}
+
+static inline int agaw_to_width(int agaw)
+{
+	return min_t(int, 30 + agaw * LEVEL_STRIDE, MAX_AGAW_WIDTH);
+}
+
+static inline int width_to_agaw(int width)
+{
+	return DIV_ROUND_UP(width - 30, LEVEL_STRIDE);
+}
+
+static inline unsigned int level_to_offset_bits(int level)
+{
+	return (level - 1) * LEVEL_STRIDE;
+}
+
+static inline int pfn_level_offset(unsigned long pfn, int level)
+{
+	return (pfn >> level_to_offset_bits(level)) & LEVEL_MASK;
+}
+
+static inline unsigned long level_mask(int level)
+{
+	return -1UL << level_to_offset_bits(level);
+}
+
+static inline unsigned long level_size(int level)
+{
+	return 1UL << level_to_offset_bits(level);
+}
+
+static inline unsigned long align_to_level(unsigned long pfn, int level)
+{
+	return (pfn + level_size(level) - 1) & level_mask(level);
+}
+
+static inline unsigned long lvl_to_nr_pages(unsigned int lvl)
+{
+	return  1 << min_t(int, (lvl - 1) * LEVEL_STRIDE, MAX_AGAW_PFN_WIDTH);
+}
+
+/* VT-d pages must always be _smaller_ than MM pages. Otherwise things
+   are never going to work. */
+static inline unsigned long dma_to_mm_pfn(unsigned long dma_pfn)
+{
+	return dma_pfn >> (PAGE_SHIFT - VTD_PAGE_SHIFT);
+}
+
+static inline unsigned long mm_to_dma_pfn(unsigned long mm_pfn)
+{
+	return mm_pfn << (PAGE_SHIFT - VTD_PAGE_SHIFT);
+}
+static inline unsigned long page_to_dma_pfn(struct page *pg)
+{
+	return mm_to_dma_pfn(page_to_pfn(pg));
+}
+static inline unsigned long virt_to_dma_pfn(void *p)
+{
+	return page_to_dma_pfn(virt_to_page(p));
+}
+
+
+/*
+ * 0: Present
+ * 1-11: Reserved
+ * 12-63: Context Ptr (12 - (haw-1))
+ * 64-127: Reserved
+ */
+struct root_entry {
+	u64	val;
+	u64	rsvd1;
+};
+#define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))
+static inline bool root_present(struct root_entry *root)
+{
+	return root->val & 1;
+}
+static inline void set_root_present(struct root_entry *root)
+{
+	root->val |= 1;
+}
+static inline void set_root_value(struct root_entry *root, unsigned long value)
+{
+	root->val |= value & VTD_PAGE_MASK;
+}
+
+static inline struct context_entry *
+get_context_addr_from_root(struct root_entry *root)
+{
+	return (struct context_entry *)
+		(root_present(root)?phys_to_virt(
+		root->val & VTD_PAGE_MASK) :
+		NULL);
+}
+
+/*
+ * low 64 bits:
+ * 0: present
+ * 1: fault processing disable
+ * 2-3: translation type
+ * 12-63: address space root
+ * high 64 bits:
+ * 0-2: address width
+ * 3-6: aval
+ * 8-23: domain id
+ */
+struct context_entry {
+	u64 lo;
+	u64 hi;
+};
+
+static inline bool context_present(struct context_entry *context)
+{
+	return context->lo & 1;
+}
+static inline void context_set_present(struct context_entry *context)
+{
+	context->lo |= 1;
+}
+
+static inline void context_set_fault_enable(struct context_entry *context)
+{
+	context->lo &= (((u64)-1) << 2) | 1;
+}
+
+static inline void context_set_translation_type(struct context_entry *context,
+						unsigned long value)
+{
+	context->lo &= (((u64)-1) << 4) | 3;
+	context->lo |= (value & 3) << 2;
+}
+
+static inline void context_set_address_root(struct context_entry *context,
+					    unsigned long value)
+{
+	context->lo |= value & VTD_PAGE_MASK;
+}
+
+static inline void context_set_address_width(struct context_entry *context,
+					     unsigned long value)
+{
+	context->hi |= value & 7;
+}
+
+static inline void context_set_domain_id(struct context_entry *context,
+					 unsigned long value)
+{
+	context->hi |= (value & ((1 << 16) - 1)) << 8;
+}
+
+static inline void context_clear_entry(struct context_entry *context)
+{
+	context->lo = 0;
+	context->hi = 0;
+}
+
+/*
+ * 0: readable
+ * 1: writable
+ * 2-6: reserved
+ * 7: super page
+ * 8-10: available
+ * 11: snoop behavior
+ * 12-63: Host physcial address
+ */
+struct dma_pte {
+	u64 val;
+};
+
+static inline void dma_clear_pte(struct dma_pte *pte)
+{
+	pte->val = 0;
+}
+
+static inline u64 dma_pte_addr(struct dma_pte *pte)
+{
+#ifdef CONFIG_64BIT
+	return pte->val & VTD_PAGE_MASK;
+#else
+	/* Must have a full atomic 64-bit read */
+	return  __cmpxchg64(&pte->val, 0ULL, 0ULL) & VTD_PAGE_MASK;
+#endif
+}
+
+static inline bool dma_pte_present(struct dma_pte *pte)
+{
+	return (pte->val & 3) != 0;
+}
+
+static inline bool dma_pte_superpage(struct dma_pte *pte)
+{
+	return pte->val & (1 << 7);
+}
+
+static inline int first_pte_in_page(struct dma_pte *pte)
+{
+	return !((unsigned long)pte & ~VTD_PAGE_MASK);
+}
+
+/* devices under the same p2p bridge are owned in one domain */
+#define DOMAIN_FLAG_P2P_MULTIPLE_DEVICES (1 << 0)
+
+/* domain represents a virtual machine, more than one devices
+ * across iommus may be owned in one domain, e.g. kvm guest.
+ */
+#define DOMAIN_FLAG_VIRTUAL_MACHINE	(1 << 1)
+
+/* si_domain contains mulitple devices */
+#define DOMAIN_FLAG_STATIC_IDENTITY	(1 << 2)
+
+/* define the limit of IOMMUs supported in each domain */
+#ifdef	CONFIG_X86
+# define	IOMMU_UNITS_SUPPORTED	MAX_IO_APICS
+#else
+# define	IOMMU_UNITS_SUPPORTED	64
+#endif
+
+struct dmar_domain {
+	int	id;			/* domain id */
+	int	nid;			/* node id */
+	DECLARE_BITMAP(iommu_bmp, IOMMU_UNITS_SUPPORTED);
+					/* bitmap of iommus this domain uses*/
+
+	struct list_head devices;	/* all devices' list */
+	struct iova_domain iovad;	/* iova's that belong to this domain */
+
+	struct dma_pte	*pgd;		/* virtual address */
+	int		gaw;		/* max guest address width */
+
+	/* adjusted guest address width, 0 is level 2 30-bit */
+	int		agaw;
+
+	int		flags;		/* flags to find out type of domain */
+
+	int		iommu_coherency;/* indicate coherency of iommu access */
+	int		iommu_snooping; /* indicate snooping control feature*/
+	int		iommu_count;	/* reference count of iommu */
+	int		iommu_superpage;/* Level of superpages supported:
+					   0 == 4KiB (no superpages), 1 == 2MiB,
+					   2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
+	spinlock_t	iommu_lock;	/* protect iommu set in domain */
+	u64		max_addr;	/* maximum mapped address */
+};
+
+/* PCI domain-device relationship */
+struct device_domain_info {
+	struct list_head link;	/* link to domain siblings */
+	struct list_head global; /* link to global list */
+	u8 bus;			/* PCI bus number */
+	u8 devfn;		/* PCI devfn number */
+	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
+	struct intel_iommu *iommu; /* IOMMU used by this device */
+	struct dmar_domain *domain; /* pointer to domain */
+};
+
+struct dmar_rmrr_unit {
+	struct list_head list;		/* list of rmrr units	*/
+	struct acpi_dmar_header *hdr;	/* ACPI header		*/
+	u64	base_address;		/* reserved base address*/
+	u64	end_address;		/* reserved end address */
+	struct dmar_dev_scope *devices;	/* target devices */
+	int	devices_cnt;		/* target device count */
+};
+
+struct dmar_atsr_unit {
+	struct list_head list;		/* list of ATSR units */
+	struct acpi_dmar_header *hdr;	/* ACPI header */
+	struct dmar_dev_scope *devices;	/* target devices */
+	int devices_cnt;		/* target device count */
+	u8 include_all:1;		/* include all ports */
+};
+
+static inline void *alloc_pgtable_page(int node)
+{
+	struct page *page;
+	void *vaddr = NULL;
+
+	page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0);
+	if (page)
+		vaddr = page_address(page);
+	return vaddr;
+}
+
+static inline void free_pgtable_page(void *vaddr)
+{
+	free_page((unsigned long)vaddr);
+}
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 49fdac5..4116377 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -45,350 +45,7 @@
 
 #include "irq_remapping.h"
 #include "pci.h"
-
-#define ROOT_SIZE		VTD_PAGE_SIZE
-#define CONTEXT_SIZE		VTD_PAGE_SIZE
-
-#define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
-#define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
-#define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
-
-#define IOAPIC_RANGE_START	(0xfee00000)
-#define IOAPIC_RANGE_END	(0xfeefffff)
-#define IOVA_START_ADDR		(0x1000)
-
-#define DEFAULT_DOMAIN_ADDRESS_WIDTH 48
-
-#define MAX_AGAW_WIDTH 64
-#define MAX_AGAW_PFN_WIDTH	(MAX_AGAW_WIDTH - VTD_PAGE_SHIFT)
-
-#define __DOMAIN_MAX_PFN(gaw)  ((((uint64_t)1) << (gaw-VTD_PAGE_SHIFT)) - 1)
-#define __DOMAIN_MAX_ADDR(gaw) ((((uint64_t)1) << gaw) - 1)
-
-/* We limit DOMAIN_MAX_PFN to fit in an unsigned long, and DOMAIN_MAX_ADDR
-   to match. That way, we can use 'unsigned long' for PFNs with impunity. */
-#define DOMAIN_MAX_PFN(gaw)	((unsigned long) min_t(uint64_t, \
-				__DOMAIN_MAX_PFN(gaw), (unsigned long)-1))
-#define DOMAIN_MAX_ADDR(gaw) (((uint64_t)__DOMAIN_MAX_PFN(gaw)) << \
-					VTD_PAGE_SHIFT)
-
-#define IOVA_PFN(addr)		((addr) >> PAGE_SHIFT)
-#define DMA_32BIT_PFN		IOVA_PFN(DMA_BIT_MASK(32))
-#define DMA_64BIT_PFN		IOVA_PFN(DMA_BIT_MASK(64))
-
-/* page table handling */
-#define LEVEL_STRIDE		(9)
-#define LEVEL_MASK		(((u64)1 << LEVEL_STRIDE) - 1)
-
-/*
- * This bitmap is used to advertise the page sizes our hardware support
- * to the IOMMU core, which will then use this information to split
- * physically contiguous memory regions it is mapping into page sizes
- * that we support.
- *
- * Traditionally the IOMMU core just handed us the mappings directly,
- * after making sure the size is an order of a 4KiB page and that the
- * mapping has natural alignment.
- *
- * To retain this behavior, we currently advertise that we support
- * all page sizes that are an order of 4KiB.
- *
- * If at some point we'd like to utilize the IOMMU core's new behavior,
- * we could change this to advertise the real page sizes we support.
- */
-#define INTEL_IOMMU_PGSIZES	(~0xFFFUL)
-
-static inline int agaw_to_level(int agaw)
-{
-	return agaw + 2;
-}
-
-static inline int agaw_to_width(int agaw)
-{
-	return min_t(int, 30 + agaw * LEVEL_STRIDE, MAX_AGAW_WIDTH);
-}
-
-static inline int width_to_agaw(int width)
-{
-	return DIV_ROUND_UP(width - 30, LEVEL_STRIDE);
-}
-
-static inline unsigned int level_to_offset_bits(int level)
-{
-	return (level - 1) * LEVEL_STRIDE;
-}
-
-static inline int pfn_level_offset(unsigned long pfn, int level)
-{
-	return (pfn >> level_to_offset_bits(level)) & LEVEL_MASK;
-}
-
-static inline unsigned long level_mask(int level)
-{
-	return -1UL << level_to_offset_bits(level);
-}
-
-static inline unsigned long level_size(int level)
-{
-	return 1UL << level_to_offset_bits(level);
-}
-
-static inline unsigned long align_to_level(unsigned long pfn, int level)
-{
-	return (pfn + level_size(level) - 1) & level_mask(level);
-}
-
-static inline unsigned long lvl_to_nr_pages(unsigned int lvl)
-{
-	return  1 << min_t(int, (lvl - 1) * LEVEL_STRIDE, MAX_AGAW_PFN_WIDTH);
-}
-
-/* VT-d pages must always be _smaller_ than MM pages. Otherwise things
-   are never going to work. */
-static inline unsigned long dma_to_mm_pfn(unsigned long dma_pfn)
-{
-	return dma_pfn >> (PAGE_SHIFT - VTD_PAGE_SHIFT);
-}
-
-static inline unsigned long mm_to_dma_pfn(unsigned long mm_pfn)
-{
-	return mm_pfn << (PAGE_SHIFT - VTD_PAGE_SHIFT);
-}
-static inline unsigned long page_to_dma_pfn(struct page *pg)
-{
-	return mm_to_dma_pfn(page_to_pfn(pg));
-}
-static inline unsigned long virt_to_dma_pfn(void *p)
-{
-	return page_to_dma_pfn(virt_to_page(p));
-}
-
-
-/*
- * 0: Present
- * 1-11: Reserved
- * 12-63: Context Ptr (12 - (haw-1))
- * 64-127: Reserved
- */
-struct root_entry {
-	u64	val;
-	u64	rsvd1;
-};
-#define ROOT_ENTRY_NR (VTD_PAGE_SIZE/sizeof(struct root_entry))
-static inline bool root_present(struct root_entry *root)
-{
-	return root->val & 1;
-}
-static inline void set_root_present(struct root_entry *root)
-{
-	root->val |= 1;
-}
-static inline void set_root_value(struct root_entry *root, unsigned long value)
-{
-	root->val |= value & VTD_PAGE_MASK;
-}
-
-static inline struct context_entry *
-get_context_addr_from_root(struct root_entry *root)
-{
-	return (struct context_entry *)
-		(root_present(root)?phys_to_virt(
-		root->val & VTD_PAGE_MASK) :
-		NULL);
-}
-
-/*
- * low 64 bits:
- * 0: present
- * 1: fault processing disable
- * 2-3: translation type
- * 12-63: address space root
- * high 64 bits:
- * 0-2: address width
- * 3-6: aval
- * 8-23: domain id
- */
-struct context_entry {
-	u64 lo;
-	u64 hi;
-};
-
-static inline bool context_present(struct context_entry *context)
-{
-	return context->lo & 1;
-}
-static inline void context_set_present(struct context_entry *context)
-{
-	context->lo |= 1;
-}
-
-static inline void context_set_fault_enable(struct context_entry *context)
-{
-	context->lo &= (((u64)-1) << 2) | 1;
-}
-
-static inline void context_set_translation_type(struct context_entry *context,
-						unsigned long value)
-{
-	context->lo &= (((u64)-1) << 4) | 3;
-	context->lo |= (value & 3) << 2;
-}
-
-static inline void context_set_address_root(struct context_entry *context,
-					    unsigned long value)
-{
-	context->lo |= value & VTD_PAGE_MASK;
-}
-
-static inline void context_set_address_width(struct context_entry *context,
-					     unsigned long value)
-{
-	context->hi |= value & 7;
-}
-
-static inline void context_set_domain_id(struct context_entry *context,
-					 unsigned long value)
-{
-	context->hi |= (value & ((1 << 16) - 1)) << 8;
-}
-
-static inline void context_clear_entry(struct context_entry *context)
-{
-	context->lo = 0;
-	context->hi = 0;
-}
-
-/*
- * 0: readable
- * 1: writable
- * 2-6: reserved
- * 7: super page
- * 8-10: available
- * 11: snoop behavior
- * 12-63: Host physcial address
- */
-struct dma_pte {
-	u64 val;
-};
-
-static inline void dma_clear_pte(struct dma_pte *pte)
-{
-	pte->val = 0;
-}
-
-static inline u64 dma_pte_addr(struct dma_pte *pte)
-{
-#ifdef CONFIG_64BIT
-	return pte->val & VTD_PAGE_MASK;
-#else
-	/* Must have a full atomic 64-bit read */
-	return  __cmpxchg64(&pte->val, 0ULL, 0ULL) & VTD_PAGE_MASK;
-#endif
-}
-
-static inline bool dma_pte_present(struct dma_pte *pte)
-{
-	return (pte->val & 3) != 0;
-}
-
-static inline bool dma_pte_superpage(struct dma_pte *pte)
-{
-	return pte->val & (1 << 7);
-}
-
-static inline int first_pte_in_page(struct dma_pte *pte)
-{
-	return !((unsigned long)pte & ~VTD_PAGE_MASK);
-}
-
-/* devices under the same p2p bridge are owned in one domain */
-#define DOMAIN_FLAG_P2P_MULTIPLE_DEVICES (1 << 0)
-
-/* domain represents a virtual machine, more than one devices
- * across iommus may be owned in one domain, e.g. kvm guest.
- */
-#define DOMAIN_FLAG_VIRTUAL_MACHINE	(1 << 1)
-
-/* si_domain contains mulitple devices */
-#define DOMAIN_FLAG_STATIC_IDENTITY	(1 << 2)
-
-/* define the limit of IOMMUs supported in each domain */
-#ifdef	CONFIG_X86
-# define	IOMMU_UNITS_SUPPORTED	MAX_IO_APICS
-#else
-# define	IOMMU_UNITS_SUPPORTED	64
-#endif
-
-struct dmar_domain {
-	int	id;			/* domain id */
-	int	nid;			/* node id */
-	DECLARE_BITMAP(iommu_bmp, IOMMU_UNITS_SUPPORTED);
-					/* bitmap of iommus this domain uses*/
-
-	struct list_head devices;	/* all devices' list */
-	struct iova_domain iovad;	/* iova's that belong to this domain */
-
-	struct dma_pte	*pgd;		/* virtual address */
-	int		gaw;		/* max guest address width */
-
-	/* adjusted guest address width, 0 is level 2 30-bit */
-	int		agaw;
-
-	int		flags;		/* flags to find out type of domain */
-
-	int		iommu_coherency;/* indicate coherency of iommu access */
-	int		iommu_snooping; /* indicate snooping control feature*/
-	int		iommu_count;	/* reference count of iommu */
-	int		iommu_superpage;/* Level of superpages supported:
-					   0 == 4KiB (no superpages), 1 == 2MiB,
-					   2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
-	spinlock_t	iommu_lock;	/* protect iommu set in domain */
-	u64		max_addr;	/* maximum mapped address */
-};
-
-/* PCI domain-device relationship */
-struct device_domain_info {
-	struct list_head link;	/* link to domain siblings */
-	struct list_head global; /* link to global list */
-	u8 bus;			/* PCI bus number */
-	u8 devfn;		/* PCI devfn number */
-	struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
-	struct intel_iommu *iommu; /* IOMMU used by this device */
-	struct dmar_domain *domain; /* pointer to domain */
-};
-
-struct dmar_rmrr_unit {
-	struct list_head list;		/* list of rmrr units	*/
-	struct acpi_dmar_header *hdr;	/* ACPI header		*/
-	u64	base_address;		/* reserved base address*/
-	u64	end_address;		/* reserved end address */
-	struct dmar_dev_scope *devices;	/* target devices */
-	int	devices_cnt;		/* target device count */
-};
-
-struct dmar_atsr_unit {
-	struct list_head list;		/* list of ATSR units */
-	struct acpi_dmar_header *hdr;	/* ACPI header */
-	struct dmar_dev_scope *devices;	/* target devices */
-	int devices_cnt;		/* target device count */
-	u8 include_all:1;		/* include all ports */
-};
-
-static inline void *alloc_pgtable_page(int node)
-{
-	struct page *page;
-	void *vaddr = NULL;
-
-	page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0);
-	if (page)
-		vaddr = page_address(page);
-	return vaddr;
-}
-
-static inline void free_pgtable_page(void *vaddr)
-{
-	free_page((unsigned long)vaddr);
-}
+#include "intel-iommu-private.h"
 
 static LIST_HEAD(dmar_atsr_units);
 static LIST_HEAD(dmar_rmrr_units);
-- 
Bill Sumner <bill.sumner@hp.com>


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/8] iommu/vt-d: Update iommu_attach_domain() and its callers
  2014-04-25  0:36 [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO Bill Sumner
                   ` (2 preceding siblings ...)
  2014-04-25  0:36 ` [PATCH 3/8] iommu/vt-d: Create intel-iommu-private.h Bill Sumner
@ 2014-04-25  0:36 ` Bill Sumner
  2014-04-25  0:36 ` [PATCH 5/8] iommu/vt-d: Items required for kdump Bill Sumner
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Bill Sumner @ 2014-04-25  0:36 UTC (permalink / raw)
  To: dwmw2, indou.takao, bhe, joro
  Cc: iommu, kexec, alex.williamson, linux-pci, linux-kernel, ddutile,
	ishii.hironobu, bhelgaas, bill.sumner, doug.hatch, zhenhua,
	jerry.hoemann

Allow specification of the domain-id for the new domain.
This patch only adds the 'did' parameter to iommu_attach_domain()
and modifies all of its callers to specify the default value of -1
which says "no did specified, allocate a new one".

This is no functional change from current behaviour -- just enables
a functional change to be made in a later patch.

Signed-off-by: Bill Sumner <bill.sumner@hp.com>
---
 drivers/iommu/intel-iommu.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 4116377..226c1d0 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1181,7 +1181,8 @@ static struct dmar_domain *alloc_domain(bool vm)
 }
 
 static int iommu_attach_domain(struct dmar_domain *domain,
-			       struct intel_iommu *iommu)
+			       struct intel_iommu *iommu,
+			       int domain_number)
 {
 	int num;
 	unsigned long ndomains;
@@ -1191,12 +1192,15 @@ static int iommu_attach_domain(struct dmar_domain *domain,
 
 	spin_lock_irqsave(&iommu->lock, flags);
 
-	num = find_first_zero_bit(iommu->domain_ids, ndomains);
-	if (num >= ndomains) {
-		spin_unlock_irqrestore(&iommu->lock, flags);
-		printk(KERN_ERR "IOMMU: no free domain ids\n");
-		return -ENOMEM;
-	}
+	if (domain_number < 0) {
+		num = find_first_zero_bit(iommu->domain_ids, ndomains);
+		if (num >= ndomains) {
+			spin_unlock_irqrestore(&iommu->lock, flags);
+			pr_err("IOMMU: no free domain ids\n");
+			return -ENOMEM;
+		}
+	} else
+		num = domain_number;
 
 	domain->id = num;
 	domain->iommu_count++;
@@ -1875,6 +1879,7 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
 	struct pci_dev *dev_tmp = NULL;
 	unsigned long flags;
 	u8 bus, devfn, bridge_bus, bridge_devfn;
+	int did = -1;   /* Default to "no domain_id supplied" */
 
 	domain = find_domain(dev);
 	if (domain)
@@ -1915,7 +1920,7 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
 	domain = alloc_domain(false);
 	if (!domain)
 		goto error;
-	if (iommu_attach_domain(domain, iommu)) {
+	if (iommu_attach_domain(domain, iommu, did)) {
 		free_domain_mem(domain);
 		domain = NULL;
 		goto error;
@@ -2083,7 +2088,7 @@ static int __init si_domain_init(int hw)
 	si_domain->flags = DOMAIN_FLAG_STATIC_IDENTITY;
 
 	for_each_active_iommu(iommu, drhd) {
-		ret = iommu_attach_domain(si_domain, iommu);
+		ret = iommu_attach_domain(si_domain, iommu, (int) -1);
 		if (ret) {
 			domain_exit(si_domain);
 			return -EFAULT;
-- 
Bill Sumner <bill.sumner@hp.com>


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/8] iommu/vt-d: Items required for kdump
  2014-04-25  0:36 [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO Bill Sumner
                   ` (3 preceding siblings ...)
  2014-04-25  0:36 ` [PATCH 4/8] iommu/vt-d: Update iommu_attach_domain() and its callers Bill Sumner
@ 2014-04-25  0:36 ` Bill Sumner
  2014-04-25  0:36 ` [PATCH 6/8] iommu/vt-d: Create intel-iommu-kdump.c Bill Sumner
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Bill Sumner @ 2014-04-25  0:36 UTC (permalink / raw)
  To: dwmw2, indou.takao, bhe, joro
  Cc: iommu, kexec, alex.williamson, linux-pci, linux-kernel, ddutile,
	ishii.hironobu, bhelgaas, bill.sumner, doug.hatch, zhenhua,
	jerry.hoemann

Add data declarations and prototypes needed for kdump.

Signed-off-by: Bill Sumner <bill.sumner@hp.com>
---
 drivers/iommu/intel-iommu-private.h | 94 +++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)

diff --git a/drivers/iommu/intel-iommu-private.h b/drivers/iommu/intel-iommu-private.h
index 480399c..3f20234 100644
--- a/drivers/iommu/intel-iommu-private.h
+++ b/drivers/iommu/intel-iommu-private.h
@@ -361,3 +361,97 @@ static inline void free_pgtable_page(void *vaddr)
 {
 	free_page((unsigned long)vaddr);
 }
+
+#ifdef CONFIG_CRASH_DUMP
+/*
+ * Fix Crashdump failure caused by leftover DMA through a hardware IOMMU
+ *
+ * Fixes the crashdump kernel to deal with an active iommu and legacy
+ * DMA from the (old) panicked kernel in a manner similar to how legacy
+ * DMA is handled when no hardware iommu was in use by the old kernel --
+ * allow the legacy DMA to continue into its current buffers.
+ *
+ * In the crashdump kernel, this code:
+ * 1. skips disabling the IOMMU's translating of IO Virtual Addresses (IOVA)
+ * 2. leaves the current translations in-place so that legacy DMA will
+ *    continue to use its current buffers,
+ * 3. allocates to the device drivers in the crashdump kernel
+ *    portions of the iova address ranges that are different
+ *    from the iova address ranges that were being used by the old kernel
+ *    at the time of the panic.
+ *
+ */
+
+struct domain_values_entry {
+	struct list_head link;		/* link entries into a list */
+	struct iova_domain iovad;	/* iova's that belong to this domain */
+	struct dma_pte	*pgd;		/* virtual address */
+	int    did;			/* domain id */
+	int    gaw;			/* max guest address width */
+	int    iommu_superpage;		/* Level of superpages supported:
+					   0 == 4KiB (no superpages), 1 == 2MiB,
+					   2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
+};
+
+
+
+/* ------------------------------------------------------------------------
+ * Prototypes of interface functions
+ * ------------------------------------------------------------------------
+ */
+
+extern int
+intel_iommu_copy_translation_tables(struct dmar_drhd_unit *drhd,
+	struct root_entry **root_old_p, struct root_entry **root_new_p,
+	int g_num_of_iommus);
+
+extern int
+intel_iommu_get_dids_from_old_kernel(struct intel_iommu *iommu);
+
+extern struct domain_values_entry *
+intel_iommu_did_to_domain_values_entry(int did, struct intel_iommu *iommu);
+
+
+/* ------------------------------------------------------------------------
+ * Utility functions for accessing the iommu Translation Tables
+ * ------------------------------------------------------------------------
+ */
+
+static inline struct context_entry *
+get_context_phys_from_root(struct root_entry *root)
+{
+	return (struct context_entry *)
+		(root_present(root) ? (void *) (root->val & VTD_PAGE_MASK)
+				    : NULL);
+}
+
+static inline int
+context_get_p(struct context_entry *c)    {return((c->lo >> 0) & 0x1); }
+
+static inline int
+context_get_fpdi(struct context_entry *c) {return((c->lo >> 1) & 0x1); }
+
+static inline int
+context_get_t(struct context_entry *c)    {return((c->lo >> 2) & 0x3); }
+
+static inline u64
+context_get_asr(struct context_entry *c)  {return((c->lo >> 12));      }
+
+static inline int
+context_get_aw(struct context_entry *c)   {return((c->hi >> 0) & 0x7); }
+
+static inline int
+context_get_aval(struct context_entry *c) {return((c->hi >> 3) & 0xf); }
+
+static inline int
+context_get_did(struct context_entry *c)  {return((c->hi >> 8) & 0xffff); }
+
+static inline void
+context_put_asr(struct context_entry *c, unsigned long asr)
+{
+	c->lo &= (~VTD_PAGE_MASK);
+	c->lo |= (asr << VTD_PAGE_SHIFT);
+}
+
+#endif	/* CONFIG_CRASH_DUMP */
+
-- 
Bill Sumner <bill.sumner@hp.com>


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/8] iommu/vt-d: Create intel-iommu-kdump.c
  2014-04-25  0:36 [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO Bill Sumner
                   ` (4 preceding siblings ...)
  2014-04-25  0:36 ` [PATCH 5/8] iommu/vt-d: Items required for kdump Bill Sumner
@ 2014-04-25  0:36 ` Bill Sumner
  2014-04-25  0:36 ` [PATCH 7/8] iommu/vt-d: Add domain-id functions to intel-iommu-kdump.c Bill Sumner
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Bill Sumner @ 2014-04-25  0:36 UTC (permalink / raw)
  To: dwmw2, indou.takao, bhe, joro
  Cc: iommu, kexec, alex.williamson, linux-pci, linux-kernel, ddutile,
	ishii.hironobu, bhelgaas, bill.sumner, doug.hatch, zhenhua,
	jerry.hoemann

Create intel-iommu-kdump.c
Populate it with support functions to copy iommu translation tables from
from the panicked kernel into the kdump kernel in the event of a crash.

Update Makefile to build intel-iommu-kdump.o

Signed-off-by: Bill Sumner <bill.sumner@hp.com>
---
 drivers/iommu/Makefile            |   2 +-
 drivers/iommu/intel-iommu-kdump.c | 590 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 591 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/intel-iommu-kdump.c

diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 5d58bf1..bd61452 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_AMD_IOMMU) += amd_iommu.o amd_iommu_init.o
 obj-$(CONFIG_AMD_IOMMU_V2) += amd_iommu_v2.o
 obj-$(CONFIG_ARM_SMMU) += arm-smmu.o
 obj-$(CONFIG_DMAR_TABLE) += dmar.o
-obj-$(CONFIG_INTEL_IOMMU) += iova.o intel-iommu.o
+obj-$(CONFIG_INTEL_IOMMU) += iova.o intel-iommu.o intel-iommu-kdump.o
 obj-$(CONFIG_IRQ_REMAP) += intel_irq_remapping.o irq_remapping.o
 obj-$(CONFIG_OMAP_IOMMU) += omap-iommu.o
 obj-$(CONFIG_OMAP_IOMMU) += omap-iommu2.o
diff --git a/drivers/iommu/intel-iommu-kdump.c b/drivers/iommu/intel-iommu-kdump.c
new file mode 100644
index 0000000..4e653e048
--- /dev/null
+++ b/drivers/iommu/intel-iommu-kdump.c
@@ -0,0 +1,590 @@
+/*
+ * Copyright (C) 2014 Hewlett-Packard Development Company, L.P.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Copyright (C) 2014 Hewlett-Packard Development Company, L.P.
+ * Author: Bill Sumner <bill.sumner@hp.com>
+ */
+#include <linux/init.h>
+#include <linux/bitmap.h>
+#include <linux/debugfs.h>
+#include <linux/export.h>
+#include <linux/slab.h>
+#include <linux/irq.h>
+#include <linux/interrupt.h>
+#include <linux/spinlock.h>
+#include <linux/pci.h>
+#include <linux/dmar.h>
+#include <linux/dma-mapping.h>
+#include <linux/mempool.h>
+#include <linux/timer.h>
+#include <linux/iova.h>
+#include <linux/iommu.h>
+#include <linux/intel-iommu.h>
+#include <linux/syscore_ops.h>
+#include <linux/tboot.h>
+#include <linux/dmi.h>
+#include <linux/pci-ats.h>
+#include <linux/memblock.h>
+#include <asm/irq_remapping.h>
+#include <asm/cacheflush.h>
+#include <linux/iommu.h>
+
+#include "irq_remapping.h"
+#include "pci.h"
+#include "intel-iommu-private.h"
+#include <linux/crash_dump.h>
+
+#ifdef CONFIG_CRASH_DUMP
+
+
+/* Lists of domain_values_entry to hold domain values found during the copy.
+ * One list for each iommu in g_number_of_iommus.
+ */
+static struct list_head *domain_values_list;
+
+
+/* ========================================================================
+ * Copy iommu translation tables from old kernel into new  kernel.
+ * Entry to this set of functions is: intel_iommu_copy_translation_tables()
+ * ------------------------------------------------------------------------
+ */
+#define RET_BADCOPY -1	/* Return-code: Cannot copy translate tables */
+
+/*
+ * Copy memory from a physically-addressed area into a virtually-addressed area
+ */
+static int oldcopy(void *to, void *from, int size)
+{
+	size_t ret = 0;			/* Length copied */
+	unsigned long pfn;		/* Page Frame Number */
+	char *buf = to;			/* Adr(Output buffer) */
+	size_t csize = (size_t)size;	/* Num(bytes to copy) */
+	unsigned long offset;		/* Lower 12 bits of from */
+	int userbuf = 0;		/* to is in kernel space */
+
+
+	pfn = ((unsigned long) from) >> VTD_PAGE_SHIFT;
+	offset = ((unsigned long) from) & (~VTD_PAGE_MASK);
+	ret = copy_oldmem_page(pfn, buf, csize, offset, userbuf);
+
+	return (int) ret;
+}
+
+
+
+/*
+ * Struct copy_page_addr_parms is used to allow copy_page_addr()
+ * to accumulate values across multiple calls and returns.
+ */
+struct copy_page_addr_parms {
+	u32 first;	/* flag: first-time  */
+	u32 last;	/* flag: last-time */
+	u32 bus;	/* last bus number we saw */
+	u32 devfn;	/* last devfn we saw */
+	u32 shift;	/* last shift we saw */
+	u64 pte;	/* Page Table Entry */
+	u64 next_addr;	/* next-expected page_addr */
+
+	u64 page_addr;	/* page_addr accumulating size */
+	u64 page_size;	/* page_size accumulated */
+
+	struct domain_values_entry *dve;	/* to accumulate iova ranges */
+};
+
+/*
+ * constant for initializing instances of copy_page_addr_parms properly.
+ */
+static struct copy_page_addr_parms copy_page_addr_parms_init = {1, 0};
+
+
+
+/*
+ * Lowest-level function in the 'Copy Page Tables' set
+ * Called once for each page_addr present in an iommu page-address table.
+ *
+ * Because of the depth-first traversal of the page-tables by the
+ * higher-level functions that call 'copy_page_addr', all pages
+ * of a domain will be presented in ascending order of IO Virtual Address.
+ *
+ * This function accumulates each contiguous range of these IOVAs and
+ * reserves it within the proper domain in the crashdump kernel when a
+ * non-contiguous range is detected, as determined by any of the following:
+ * 1. a change in the bus or device owning the presented page
+ * 2. a change in the page-size of the presented page (parameter shift)
+ * 3. a change in the page-table entry of the presented page
+ * 4. a presented IOVA that does not match the expected next-page address
+ * 5. the 'last' flag is set, indicating that all IOVAs have been seen.
+ */
+static int copy_page_addr(u64 page_addr, u32 shift, u32 bus, u32 devfn,
+				u64 pte, struct domain_values_entry *dve,
+				void *parms)
+{
+	struct copy_page_addr_parms *ppap = parms;
+
+	u64 page_size = ((u64)1 << shift);	/* page_size */
+	u64 pfn_lo;				/* For reserving IOVA range */
+	u64 pfn_hi;				/* For reserving IOVA range */
+	struct iova *iova_p;			/* For reserving IOVA range */
+
+	if (!ppap) {
+		pr_err("ERROR: ppap is NULL: 0x%3.3x(%3.3d) DevFn: 0x%3.3x(%3.3d) Page: 0x%16.16llx Size: 0x%16.16llx(%lld)\n",
+			bus, bus, devfn, devfn,  page_addr,
+			page_size, page_size);
+		return 0;
+	}
+
+	/* If (only extending current addr range) */
+	if (ppap->first     == 0      &&
+	    ppap->last      == 0      &&
+	    ppap->bus       == bus    &&
+	    ppap->devfn     == devfn  &&
+	    ppap->shift     == shift  &&
+	    (ppap->pte & ~VTD_PAGE_MASK) == (pte & ~VTD_PAGE_MASK) &&
+	    ppap->next_addr == page_addr) {
+
+		/* Update page size and next-expected address */
+		ppap->next_addr += page_size;
+		ppap->page_size += page_size;
+		return 0;
+	}
+
+	if (!ppap->first) {
+		/* Close-out the accumulated IOVA address range */
+
+		if (!ppap->dve) {
+			pr_err("%s ERROR: ppap->dve is NULL -- needed to reserve range for B:D:F=%2.2x:%2.2x:%1.1x\n",
+				__func__,
+				ppap->bus, ppap->devfn >> 3, ppap->devfn & 0x7);
+			return RET_BADCOPY;
+		}
+		pfn_lo = IOVA_PFN(ppap->page_addr);
+		pfn_hi = IOVA_PFN(ppap->page_addr + ppap->page_size);
+		iova_p = reserve_iova(&ppap->dve->iovad, pfn_lo, pfn_hi);
+	}
+
+	/* Prepare for a new IOVA address range */
+	ppap->first     = 0;		/* Not first-time anymore */
+	ppap->bus       = bus;
+	ppap->devfn     = devfn;
+	ppap->shift     = shift;
+	ppap->pte       = pte;
+	ppap->next_addr = page_addr + page_size; /* Next-expected page_addr */
+
+	ppap->page_addr = page_addr;	/* Addr(new page) */
+	ppap->page_size = page_size;	/* Size(new page) */
+
+	ppap->dve	= dve;	/* adr(device_values_entry for new range) */
+
+	return 0;
+}
+
+/*
+ * Recursive function to copy the tree of page tables (max 6 recursions)
+ * Parameter 'shift' controls the recursion
+ */
+static int copy_page_table(struct dma_pte **dma_pte_new_p,
+			   struct dma_pte *dma_pte_phys,
+			   u32 shift, u64 page_addr,
+			   struct intel_iommu *iommu,
+			   u32 bus, u32 devfn,
+			   struct domain_values_entry *dve, void *ppap)
+{
+	int ret;			/* Integer return code */
+	struct dma_pte *p;		/* Physical adr(each entry) iterator */
+	struct dma_pte *pgt_new_virt;	/* Adr(dma_pte in new kernel) */
+	struct dma_pte *dma_pte_next;	/* Adr(next table down)  */
+	u64 u;				/* index(each entry in page_table) */
+
+
+	/* If (already done all levels -- problem) */
+	if (shift < 12) {
+		pr_err("ERROR %s shift < 12 %p\n", __func__, dma_pte_phys);
+		pr_err("shift %d, page_addr %16.16llu bus %3.3u devfn %3.3u\n",
+			shift, page_addr, bus, devfn);
+		return RET_BADCOPY;
+	}
+
+	/* allocate a page table in the new kernel
+	 * copy contents from old kernel
+	 * then update each entry in the table in the new kernel
+	 */
+
+	pgt_new_virt = (struct dma_pte *)alloc_pgtable_page(iommu->node);
+	if (!pgt_new_virt)
+		return -ENOMEM;
+
+	ret = oldcopy(pgt_new_virt, dma_pte_phys, VTD_PAGE_SIZE);
+	if (ret <= 0)
+		return ret;
+
+	for (u = 0, p = pgt_new_virt; u < 512; u++, p++) {
+
+		if (((p->val & DMA_PTE_READ) == 0) &&
+		    ((p->val & DMA_PTE_WRITE) == 0))
+			continue;
+
+		if (dma_pte_superpage(p) || (shift == 12)) {
+
+			ret = copy_page_addr(page_addr | (u << shift),
+				shift, bus, devfn, p->val, dve, ppap);
+			if (ret)
+				return ret;
+			continue;
+		}
+
+		ret = copy_page_table(&dma_pte_next,
+				(struct dma_pte *)(p->val & VTD_PAGE_MASK),
+				shift-9, page_addr | (u << shift),
+				iommu, bus, devfn, dve, ppap);
+		if (ret)
+			return ret;
+
+		p->val &= ~VTD_PAGE_MASK;	/* Clear old and set new pgd */
+		p->val |= ((u64)dma_pte_next & VTD_PAGE_MASK);
+	}
+
+	*dma_pte_new_p = (struct dma_pte *)virt_to_phys(pgt_new_virt);
+	__iommu_flush_cache(iommu, pgt_new_virt, VTD_PAGE_SIZE);
+
+	return 0;
+}
+
+
+/*
+ * Called once for each context_entry found in a copied context_entry_table
+ * Each context_entry represents one PCIe device handled by the IOMMU.
+ *
+ * The 'domain_values_list' contains one 'domain_values_entry' for each
+ * unique domain-id found while copying the context entries for each iommu.
+ *
+ * The Intel-iommu spec. requires that every context_entry that contains
+ * the same domain-id point to the same set of page translation tables.
+ * The hardware uses this to improve the use of its translation cache.
+ * In order to insure that the copied translate tables abide by this
+ * requirement, this function keeps a list of domain-ids (dids) that
+ * have already been seen for this iommu. This function checks each entry
+ * already on the list for a domain-id that matches the domain-id in this
+ * context_entry.  If found, this function places the address of the previous
+ * context's tree of page translation tables into this context_entry.
+ * If a matching previous entry is not found, a new 'domain_values_entry'
+ * structure is created for the domain-id in this context_entry and
+ * copy_page_table is called to duplicate its tree of page tables.
+ */
+
+enum returns_from_copy_context_entry {
+RET_CCE_NOT_PRESENT = 1,
+RET_CCE_NEW_PAGE_TABLES,
+RET_CCE_PASS_THROUGH_1,
+RET_CCE_PASS_THROUGH_2,
+RET_CCE_RESERVED_VALUE,
+RET_CCE_PREVIOUS_DID
+};
+static int copy_context_entry(struct intel_iommu *iommu, u32 bus, u32 devfn,
+			      void *ppap, struct context_entry *ce)
+{
+	int ret = 0;			/* Integer Return Code */
+	u32 shift = 0;			/* bits to shift page_addr  */
+	u64 page_addr = 0;		/* Address of translated page */
+	struct dma_pte *pgt_old_phys;	/* Adr(page_table in the old kernel) */
+	struct dma_pte *pgt_new_phys;	/* Adr(page_table in the new kernel) */
+	unsigned long asr;		/* New asr value for new context */
+	u8  t;				/* Translation-type from context */
+	u8  aw;				/* Address-width from context */
+	u32 aw_shift[8] = {
+		12+9+9,		/* [000b] 30-bit AGAW (2-level page table) */
+		12+9+9+9,	/* [001b] 39-bit AGAW (3-level page table) */
+		12+9+9+9+9,	/* [010b] 48-bit AGAW (4-level page table) */
+		12+9+9+9+9+9,	/* [011b] 57-bit AGAW (5-level page table) */
+		12+9+9+9+9+9+9,	/* [100b] 64-bit AGAW (6-level page table) */
+		0,		/* [111b] Reserved */
+		0,		/* [110b] Reserved */
+		0,		/* [111b] Reserved */
+	};
+
+	struct domain_values_entry *dve = NULL;
+
+
+	if (!context_get_p(ce)) {	/* If (context not present) */
+		ret = RET_CCE_NOT_PRESENT;		/* Skip it */
+		goto exit;
+	}
+
+	t = context_get_t(ce);
+
+	/* If we have seen this domain-id before on this iommu,
+	 * give this context the same page-tables and we are done.
+	 */
+	list_for_each_entry(dve, &domain_values_list[iommu->seq_id], link) {
+		if (dve->did == (int) context_get_did(ce)) {
+			switch (t) {
+			case 0:	/* page tables */
+			case 1:	/* page tables */
+				asr = virt_to_phys(dve->pgd) >> VTD_PAGE_SHIFT;
+				context_put_asr(ce, asr);
+				ret = RET_CCE_PREVIOUS_DID;
+				break;
+
+			case 2:	/* Pass through */
+				if (dve->pgd == NULL)
+					ret =  RET_CCE_PASS_THROUGH_2;
+				else
+					ret = RET_BADCOPY;
+				break;
+
+			default: /* Bad value of 't'*/
+				ret = RET_BADCOPY;
+				break;
+			}
+			goto exit;
+		}
+	}
+
+	/* Since we now know that this is a new domain-id for this iommu,
+	 * create a new entry, add it to the list, and handle its
+	 * page tables.
+	 */
+
+	dve = kcalloc(1, sizeof(struct domain_values_entry), GFP_KERNEL);
+	if (!dve) {
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	dve->did = (int) context_get_did(ce);
+	dve->gaw = (int) agaw_to_width(context_get_aw(ce));
+	dve->pgd = NULL;
+	init_iova_domain(&dve->iovad, DMA_32BIT_PFN);
+
+	list_add(&dve->link, &domain_values_list[iommu->seq_id]);
+
+
+	if (t == 0 || t == 1) {		/* If (context has page tables) */
+		aw = context_get_aw(ce);
+		shift = aw_shift[aw];
+
+		pgt_old_phys = (struct dma_pte *)(context_get_asr(ce) << 12);
+
+		ret = copy_page_table(&pgt_new_phys, pgt_old_phys,
+			shift-9, page_addr, iommu, bus, devfn, dve, ppap);
+
+		if (ret)		/* if (problem) bail out */
+			goto exit;
+
+		asr = ((unsigned long)(pgt_new_phys)) >> VTD_PAGE_SHIFT;
+		context_put_asr(ce, asr);
+		dve->pgd = phys_to_virt((unsigned long)pgt_new_phys);
+		ret = RET_CCE_NEW_PAGE_TABLES;
+		goto exit;
+	}
+
+	if (t == 2) {		/* If (Identity mapped pass-through) */
+		ret = RET_CCE_PASS_THROUGH_1;	/* REVISIT: Skip for now */
+		goto exit;
+	}
+
+	ret = RET_CCE_RESERVED_VALUE;	/* Else ce->t is a Reserved value */
+	/* Note fall-through */
+
+exit:	/* all returns come through here to insure good clean-up */
+	return ret;
+}
+
+
+/*
+ * Called once for each context_entry_table found in the root_entry_table
+ */
+static int copy_context_entry_table(struct intel_iommu *iommu,
+				    u32 bus, void *ppap,
+				    struct context_entry **context_new_p,
+				    struct context_entry *context_old_phys)
+{
+	int ret = 0;				/* Integer return code */
+	struct context_entry *ce;		/* Iterator */
+	struct context_entry *context_new_phys;	/* adr(table in new kernel) */
+	struct context_entry *context_new_virt;	/* adr(table in new kernel) */
+	u32 devfn = 0;				/* PCI Device & function */
+
+	/* allocate a context-entry table in the new kernel
+	 * copy contents from old kernel
+	 * then update each entry in the table in the new kernel
+	 */
+	context_new_virt =
+		(struct context_entry *)alloc_pgtable_page(iommu->node);
+	if (!context_new_virt)
+		return -ENOMEM;
+
+	context_new_phys =
+		(struct context_entry *)virt_to_phys(context_new_virt);
+
+	oldcopy(context_new_virt, context_old_phys, VTD_PAGE_SIZE);
+
+	for (devfn = 0, ce = context_new_virt; devfn < 256; devfn++, ce++) {
+
+		if (!context_get_p(ce))		/* If (context not present) */
+			continue;		/* Skip it */
+
+		ret = copy_context_entry(iommu, bus, devfn, ppap, ce);
+		if (ret < 0)		/* if (problem) */
+			return RET_BADCOPY;
+
+		switch (ret) {
+		case RET_CCE_NOT_PRESENT:
+			continue;
+		case RET_CCE_NEW_PAGE_TABLES:
+			continue;
+		case RET_CCE_PASS_THROUGH_1:
+			continue;
+		case RET_CCE_PASS_THROUGH_2:
+			continue;
+		case RET_CCE_RESERVED_VALUE:
+			return RET_BADCOPY;
+		case RET_CCE_PREVIOUS_DID:
+			continue;
+		default:
+			return RET_BADCOPY;
+		};
+	}
+
+	*context_new_p = context_new_phys;
+	__iommu_flush_cache(iommu, context_new_virt, VTD_PAGE_SIZE);
+	return 0;
+}
+
+
+/*
+ * Highest-level function in the 'copy translation tables' set of functions
+ */
+static int copy_root_entry_table(struct intel_iommu *iommu, void *ppap,
+				 struct root_entry  **root_new_virt_p,
+				 struct root_entry  *root_old_phys)
+{
+	int ret = 0;				/* Integer return code */
+	u32 bus;				/* Index: root-entry-table */
+	struct root_entry  *re;			/* Virt(iterator: new table) */
+	struct root_entry  *root_new_virt;	/* Virt(table in new kernel) */
+	struct context_entry *context_old_phys;	/* Phys(context table entry) */
+	struct context_entry *context_new_phys;	/* Phys(new context_entry) */
+
+	/*
+	 * allocate a root-entry table in the new kernel
+	 * copy contents from old kernel
+	 * then update each entry in the table in the new kernel
+	 */
+
+	root_new_virt = (struct root_entry *)alloc_pgtable_page(iommu->node);
+	if (!root_new_virt)
+		return -ENOMEM;
+
+	oldcopy(root_new_virt, root_old_phys, VTD_PAGE_SIZE);
+
+	for (bus = 0, re = root_new_virt; bus < 256; bus += 1, re += 1) {
+
+		if (!root_present(re))
+			continue;
+
+		context_old_phys = get_context_phys_from_root(re);
+
+		if (!context_old_phys)
+			continue;
+
+		ret = copy_context_entry_table(iommu, bus, ppap,
+						&context_new_phys,
+						context_old_phys);
+		if (ret)
+			return ret;
+
+		re->val &= ~VTD_PAGE_MASK;
+		set_root_value(re, (unsigned long)context_new_phys);
+	}
+
+	*root_new_virt_p = root_new_virt;
+	__iommu_flush_cache(iommu, root_new_virt, VTD_PAGE_SIZE);
+	return 0;
+}
+
+/*
+ * Interface to the "copy translation tables" set of functions
+ * from mainline code.
+ */
+int intel_iommu_copy_translation_tables(struct dmar_drhd_unit *drhd,
+		struct root_entry **root_old_phys_p,
+		struct root_entry **root_new_virt_p,
+		int g_num_of_iommus)
+{
+	struct intel_iommu *iommu;	/* Virt(iommu hardware registers) */
+	unsigned long long q;		/* quadword scratch */
+	struct root_entry *root_phys;	/* Phys(table in old kernel) */
+	struct root_entry *root_new;	/* Virt(table in new kernel) */
+	int ret = 0;			/* Integer return code */
+	int i = 0;			/* Loop index */
+
+	/* Structure so copy_page_addr() can accumulate things
+	 * over multiple calls and returns
+	 */
+	struct copy_page_addr_parms ppa_parms = copy_page_addr_parms_init;
+	struct copy_page_addr_parms *ppap = &ppa_parms;
+
+
+	iommu = drhd->iommu;
+	q = readq(iommu->reg + DMAR_RTADDR_REG);
+
+	if (!q)
+		return -1;
+
+	*root_old_phys_p = (struct root_entry *)q;	/* Returned to caller */
+
+	/* If (list needs initializing) do it here */
+	if (!domain_values_list) {
+		domain_values_list =
+			 kcalloc(g_num_of_iommus, sizeof(struct list_head),
+					GFP_KERNEL);
+
+		if (!domain_values_list) {
+			pr_err("Allocation failed for domain_values_list array\n");
+			return -ENOMEM;
+		}
+		for (i = 0; i < g_num_of_iommus; i++)
+			INIT_LIST_HEAD(&domain_values_list[i]);
+	}
+
+	/* Copy the root-entry table from the old kernel
+	 * foreach context_entry_table in root_entry
+	 *    foreach context_entry in context_entry_table
+	 *       foreach level-1 page_table_entry in context_entry
+	 *          foreach level-2 page_table_entry in level 1 page_table_entry
+	 *             Above pattern continues up to 6 levels of page tables
+	 *                Sanity-check the entry
+	 *                Process the bus, devfn, page_address, page_size
+	 */
+
+	root_phys = (struct root_entry *)q;
+	ret = copy_root_entry_table(iommu, ppap, &root_new, root_phys);
+	if (ret)
+		return ret;
+
+
+	ppa_parms.last = 1;
+	copy_page_addr(0, 0, 0, 0, 0, NULL, ppap);
+	*root_new_virt_p = root_new;			/* Returned to caller */
+
+	/* The translation tables in the new kernel should now contain
+	 * the same translations as the tables in the old kernel.
+	 * This will allow us to update the iommu hdw to use the new tables.
+	 *
+	 * NOTE: Neither the iommu hardware nor the iommu->root_entry
+	 *       struct-value is updated herein.
+	 *       These are left for the caller to do.
+	 */
+
+	return 0;
+}
+#endif /* CONFIG_CRASH_DUMP */
-- 
Bill Sumner <bill.sumner@hp.com>


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/8] iommu/vt-d: Add domain-id functions to intel-iommu-kdump.c
  2014-04-25  0:36 [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO Bill Sumner
                   ` (5 preceding siblings ...)
  2014-04-25  0:36 ` [PATCH 6/8] iommu/vt-d: Create intel-iommu-kdump.c Bill Sumner
@ 2014-04-25  0:36 ` Bill Sumner
  2014-04-25  0:36 ` [PATCH 8/8] iommu/vt-d: Changes to support kdump Bill Sumner
  2014-04-30 10:49 ` [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO David Woodhouse
  8 siblings, 0 replies; 21+ messages in thread
From: Bill Sumner @ 2014-04-25  0:36 UTC (permalink / raw)
  To: dwmw2, indou.takao, bhe, joro
  Cc: iommu, kexec, alex.williamson, linux-pci, linux-kernel, ddutile,
	ishii.hironobu, bhelgaas, bill.sumner, doug.hatch, zhenhua,
	jerry.hoemann

Interfaces for when a new domain in the crashdump kernel needs some
values from the panicked kernel's context entries.

Signed-off-by: Bill Sumner <bill.sumner@hp.com>
---
 drivers/iommu/intel-iommu-kdump.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/drivers/iommu/intel-iommu-kdump.c b/drivers/iommu/intel-iommu-kdump.c
index 4e653e048..f3c777e 100644
--- a/drivers/iommu/intel-iommu-kdump.c
+++ b/drivers/iommu/intel-iommu-kdump.c
@@ -53,6 +53,45 @@ static struct list_head *domain_values_list;
 
 
 /* ========================================================================
+ * Interfaces for when a new domain in the crashdump kernel needs some
+ * values from the panicked kernel's context entries
+ * ------------------------------------------------------------------------
+ */
+
+
+struct domain_values_entry *intel_iommu_did_to_domain_values_entry(int did,
+		struct intel_iommu *iommu)
+{
+	struct domain_values_entry *dve;	/* iterator */
+
+	list_for_each_entry(dve, &domain_values_list[iommu->seq_id], link)
+		if (dve->did == did)
+			return dve;
+	return NULL;
+}
+
+
+/* Mark domain-id's from old kernel as in-use on this iommu so that a new
+ * domain-id is allocated in the case where there is a device in the new kernel
+ * that was not in the old kernel -- and therefore a new domain-id is needed.
+ */
+int intel_iommu_get_dids_from_old_kernel(struct intel_iommu *iommu)
+{
+	struct domain_values_entry *dve;	/* iterator */
+
+	pr_info("IOMMU:%d Domain ids from panicked kernel:\n", iommu->seq_id);
+
+	list_for_each_entry(dve, &domain_values_list[iommu->seq_id], link) {
+		set_bit(dve->did, iommu->domain_ids);
+		pr_info("DID did:%d(0x%4.4x)\n", dve->did, dve->did);
+	}
+
+	pr_info("----------------------------------------\n");
+	return 0;
+}
+
+
+/* ========================================================================
  * Copy iommu translation tables from old kernel into new  kernel.
  * Entry to this set of functions is: intel_iommu_copy_translation_tables()
  * ------------------------------------------------------------------------
-- 
Bill Sumner <bill.sumner@hp.com>


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 8/8] iommu/vt-d: Changes to support kdump
  2014-04-25  0:36 [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO Bill Sumner
                   ` (6 preceding siblings ...)
  2014-04-25  0:36 ` [PATCH 7/8] iommu/vt-d: Add domain-id functions to intel-iommu-kdump.c Bill Sumner
@ 2014-04-25  0:36 ` Bill Sumner
  2014-04-30 10:49 ` [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO David Woodhouse
  8 siblings, 0 replies; 21+ messages in thread
From: Bill Sumner @ 2014-04-25  0:36 UTC (permalink / raw)
  To: dwmw2, indou.takao, bhe, joro
  Cc: iommu, kexec, alex.williamson, linux-pci, linux-kernel, ddutile,
	ishii.hironobu, bhelgaas, bill.sumner, doug.hatch, zhenhua,
	jerry.hoemann

 Add "#include <linux/crash_dump.h>" to gain access to is_kdump_kernel()

 Modify the operation of the following functions when called during crash dump:
 device_to_domain_id
 get_domain_for_dev
 init_dmars
 intel_iommu_init

Signed-off-by: Bill Sumner <bill.sumner@hp.com>
---
 drivers/iommu/intel-iommu.c | 133 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 120 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 226c1d0..8b3e509 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -46,6 +46,7 @@
 #include "irq_remapping.h"
 #include "pci.h"
 #include "intel-iommu-private.h"
+#include <linux/crash_dump.h>
 
 static LIST_HEAD(dmar_atsr_units);
 static LIST_HEAD(dmar_rmrr_units);
@@ -53,6 +54,7 @@ static LIST_HEAD(dmar_rmrr_units);
 #define for_each_rmrr_units(rmrr) \
 	list_for_each_entry(rmrr, &dmar_rmrr_units, list)
 
+static int intel_iommu_in_crashdump;
 
 static void __init check_tylersburg_isoch(void);
 
@@ -1796,6 +1798,24 @@ static void domain_remove_dev_info(struct dmar_domain *domain)
 	spin_unlock_irqrestore(&device_domain_lock, flags);
 }
 
+#ifdef CONFIG_CRASH_DUMP
+static int device_to_domain_id(struct intel_iommu *iommu, u8 bus, u8 devfn)
+{
+	int did = -1;			/* domain-id returned */
+	struct root_entry *root;
+	struct context_entry *context;
+	unsigned long flags;
+
+	spin_lock_irqsave(&iommu->lock, flags);
+	root = &iommu->root_entry[bus];
+	context = get_context_addr_from_root(root);
+	if (context && context_present(context+devfn))
+		did = context_get_did(context+devfn);
+	spin_unlock_irqrestore(&iommu->lock, flags);
+	return did;
+}
+#endif /* CONFIG_CRASH_DUMP */
+
 /*
  * find_domain
  * Note: we use struct device->archdata.iommu stores the info
@@ -1880,6 +1900,9 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
 	unsigned long flags;
 	u8 bus, devfn, bridge_bus, bridge_devfn;
 	int did = -1;   /* Default to "no domain_id supplied" */
+#ifdef CONFIG_CRASH_DUMP
+	struct domain_values_entry *dve = NULL;
+#endif /* CONFIG_CRASH_DUMP */
 
 	domain = find_domain(dev);
 	if (domain)
@@ -1920,6 +1943,24 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
 	domain = alloc_domain(false);
 	if (!domain)
 		goto error;
+
+#ifdef CONFIG_CRASH_DUMP
+	if (intel_iommu_in_crashdump) {
+		/*
+		 * if this device had a did in the old kernel
+		 * use its values instead of generating new ones
+		 */
+		did = device_to_domain_id(iommu, bus, devfn);
+		if (did > 0 || (did == 0 && !cap_caching_mode(iommu->cap)))
+			dve = intel_iommu_did_to_domain_values_entry(did,
+								iommu);
+		if (dve)
+			gaw = dve->gaw;
+		else
+			did = -1;
+	}
+#endif /* CONFIG_CRASH_DUMP */
+
 	if (iommu_attach_domain(domain, iommu, did)) {
 		free_domain_mem(domain);
 		domain = NULL;
@@ -1929,6 +1970,18 @@ static struct dmar_domain *get_domain_for_dev(struct device *dev, int gaw)
 	if (domain_init(domain, gaw))
 		goto error;
 
+#ifdef CONFIG_CRASH_DUMP
+	if (intel_iommu_in_crashdump && dve) {
+
+		if (domain->pgd)
+			free_pgtable_page(domain->pgd);
+
+		domain->pgd = dve->pgd;
+
+		copy_reserved_iova(&dve->iovad, &domain->iovad);
+	}
+#endif /* CONFIG_CRASH_DUMP */
+
 	/* register pcie-to-pci device */
 	if (dev_tmp) {
 		domain = dmar_insert_dev_info(iommu, bridge_bus, bridge_devfn,
@@ -2331,6 +2384,10 @@ static int __init init_dmars(void)
 	struct device *dev;
 	struct intel_iommu *iommu;
 	int i, ret;
+#ifdef CONFIG_CRASH_DUMP
+	struct root_entry *root_old_phys;
+	struct root_entry *root_new_virt;
+#endif /* CONFIG_CRASH_DUMP */
 
 	/*
 	 * for each drhd
@@ -2374,16 +2431,40 @@ static int __init init_dmars(void)
 		if (ret)
 			goto free_iommu;
 
-		/*
-		 * TBD:
-		 * we could share the same root & context tables
-		 * among all IOMMU's. Need to Split it later.
-		 */
-		ret = iommu_alloc_root_entry(iommu);
-		if (ret) {
-			printk(KERN_ERR "IOMMU: allocate root entry failed\n");
-			goto free_iommu;
+#ifdef CONFIG_CRASH_DUMP
+		if (intel_iommu_in_crashdump) {
+			pr_info("IOMMU Copying translate tables from panicked kernel\n");
+			ret = intel_iommu_copy_translation_tables(drhd,
+					&root_old_phys, &root_new_virt,
+					g_num_of_iommus);
+			if (ret) {
+				pr_err("IOMMU: Copy translate tables failed\n");
+
+				/* Best to stop trying */
+				intel_iommu_in_crashdump = false;
+				goto error;
+			}
+			iommu->root_entry = root_new_virt;
+			pr_info("IOMMU: root_new_virt:0x%12.12llx phys:0x%12.12llx\n",
+				(u64)root_new_virt,
+				virt_to_phys(root_new_virt));
+			intel_iommu_get_dids_from_old_kernel(iommu);
+		} else {
+#endif /* CONFIG_CRASH_DUMP */
+			/*
+			 * TBD:
+			 * we could share the same root & context tables
+			 * among all IOMMU's. Need to Split it later.
+			 */
+			ret = iommu_alloc_root_entry(iommu);
+			if (ret) {
+				printk(KERN_ERR "IOMMU: allocate root entry failed\n");
+				goto free_iommu;
+			}
+#ifdef CONFIG_CRASH_DUMP
 		}
+#endif /* CONFIG_CRASH_DUMP */
+
 		if (!ecap_pass_through(iommu->ecap))
 			hw_pass_through = 0;
 	}
@@ -2442,6 +2523,16 @@ static int __init init_dmars(void)
 
 	check_tylersburg_isoch();
 
+#ifdef CONFIG_CRASH_DUMP
+	/*
+	 * In the crashdump kernel: Skip setting-up new domains for
+	 * si, rmrr, and the isa bus on the expectation that these
+	 * translations were copied from the old kernel.
+	 */
+	if (intel_iommu_in_crashdump)
+		goto skip_new_domains_for_si_rmrr_isa;
+#endif /* CONFIG_CRASH_DUMP */
+
 	/*
 	 * If pass through is not set or not enabled, setup context entries for
 	 * identity mappings for rmrr, gfx, and isa and may fall back to static
@@ -2482,6 +2573,10 @@ static int __init init_dmars(void)
 
 	iommu_prepare_isa();
 
+#ifdef CONFIG_CRASH_DUMP
+skip_new_domains_for_si_rmrr_isa:;
+#endif /* CONFIG_CRASH_DUMP */
+
 	/*
 	 * for each drhd
 	 *   enable fault log
@@ -3624,12 +3719,24 @@ int __init intel_iommu_init(void)
 		goto out_free_dmar;
 	}
 
+#ifdef CONFIG_CRASH_DUMP
 	/*
-	 * Disable translation if already enabled prior to OS handover.
+	 * If (This is the crash kernel)
+	 *    Set: copy iommu translate tables from old kernel
+	 *    Skip disabling the iommu hardware translations
 	 */
-	for_each_active_iommu(iommu, drhd)
-		if (iommu->gcmd & DMA_GCMD_TE)
-			iommu_disable_translation(iommu);
+	if (is_kdump_kernel()) {
+		intel_iommu_in_crashdump = true;
+		pr_info("IOMMU intel_iommu_in_crashdump = true\n");
+		pr_info("IOMMU Skip disabling iommu hardware translations\n");
+	} else
+#endif /* CONFIG_CRASH_DUMP */
+		/*
+		 * Disable translation if already enabled prior to OS handover.
+		 */
+		for_each_active_iommu(iommu, drhd)
+			if (iommu->gcmd & DMA_GCMD_TE)
+				iommu_disable_translation(iommu);
 
 	if (dmar_dev_scope_init() < 0) {
 		if (force_on)
-- 
Bill Sumner <bill.sumner@hp.com>


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
  2014-04-25  0:36 [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO Bill Sumner
                   ` (7 preceding siblings ...)
  2014-04-25  0:36 ` [PATCH 8/8] iommu/vt-d: Changes to support kdump Bill Sumner
@ 2014-04-30 10:49 ` David Woodhouse
  2014-05-02 20:13   ` Jerry Hoemann
                     ` (2 more replies)
  8 siblings, 3 replies; 21+ messages in thread
From: David Woodhouse @ 2014-04-30 10:49 UTC (permalink / raw)
  To: Bill Sumner, Hoemann, Jerry
  Cc: indou.takao, bhe, joro, linux-pci, kexec, linux-kernel, iommu,
	doug.hatch, ishii.hironobu, bhelgaas, zhenhua

[-- Attachment #1: Type: text/plain, Size: 1640 bytes --]

On Thu, 2014-04-24 at 18:36 -0600, Bill Sumner wrote:
> 
> This patch set modifies the behavior of the Intel iommu in the crashdump kernel: 
> 1. to accept the iommu hardware in an active state,
> 2. to leave the current translations in-place so that legacy DMA will continue
>    using its current buffers until the device drivers in the crashdump kernel
>    initialize and initialize their devices,
> 3. to use different portions of the iova address ranges for the device drivers
>    in the crashdump kernel than the iova ranges that were in-use at the time
>    of the panic.  

There could be all kinds of existing mappings in the DMA page tables,
and I'm not sure it's safe to preserve them. What prevents the crashdump
kernel from trying to use any of the physical pages which are
accessible, and which could thus be corrupted by stray DMA?

In fact, the old kernel could even have set up 1:1 passthrough mappings
for some devices, which would then be able to DMA *anywhere*. Surely we
need to prevent that?

After the last round of this patchset, we discussed a potential
improvement where you point every virtual bus address at the *same*
physical scratch page.

That way, we allow the "rogue" DMA to continue to the same virtual bus
addresses, but it can only ever affect one piece of physical memory and
can't have detrimental effects elsewhere.

Was that option considered and discounted for some reason? It seems like
it would make sense...

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5745 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
  2014-04-30 10:49 ` [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO David Woodhouse
@ 2014-05-02 20:13   ` Jerry Hoemann
  2014-05-07 18:25   ` Jerry Hoemann
  2014-07-02 13:32   ` Joerg Roedel
  2 siblings, 0 replies; 21+ messages in thread
From: Jerry Hoemann @ 2014-05-02 20:13 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Bill Sumner, indou.takao, bhe, joro, linux-pci, kexec,
	linux-kernel, iommu, doug.hatch, ishii.hironobu, bhelgaas,
	zhenhua

On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:

Hi David,

As you may know,  Bill has retired and I am picking up this work.
I am still coming up to speed in this area so my goal is to understand
your concerns and research them as I dig through code and specs.

My apologizes for the delay in replying and for our missing your
earlier questions.


> On Thu, 2014-04-24 at 18:36 -0600, Bill Sumner wrote:
> > 
> > This patch set modifies the behavior of the Intel iommu in the crashdump kernel: 
> > 1. to accept the iommu hardware in an active state,
> > 2. to leave the current translations in-place so that legacy DMA will continue
> >    using its current buffers until the device drivers in the crashdump kernel
> >    initialize and initialize their devices,
> > 3. to use different portions of the iova address ranges for the device drivers
> >    in the crashdump kernel than the iova ranges that were in-use at the time
> >    of the panic.  
> 
> There could be all kinds of existing mappings in the DMA page tables,
> and I'm not sure it's safe to preserve them. What prevents the crashdump
> kernel from trying to use any of the physical pages which are
> accessible, and which could thus be corrupted by stray DMA?

In kdump, we switch to, and execute from the capture kernel.  (AKA 2nd kernel,
crash kernel.)  This is a separate distinct instance of linux.  One of
the intents of this switch is to (kdump.txt):

    "This ensures that ongoing Direct Memory Access
(DMA) from the system kernel does not corrupt the dump-capture kernel.
The kexec -p command loads the dump-capture kernel into this reserved
memory."

As capture kernel is allocated early in boot, we shouldn't have DMA
targeted to it once the capture kernel is loaded.

Now,  the capture kernel will try to access 1st kernel memory via /proc/vmcore
after it boots and runs makedumpfile.   Is it this access that you
are concerned with?



> 
> In fact, the old kernel could even have set up 1:1 passthrough mappings
> for some devices, which would then be able to DMA *anywhere*. Surely we
> need to prevent that?

>From prior patch version comments,  I know Bill was aware of the
issue of pass-through, but don't know to what extent he tested with
the feature enabled.  E.g. in Jan and prior versions he stated he
had not tested w/ pass through.  He subsequently dropped this statement.

The approach of the patch is to just allow the outstanding DMA
to complete.  Assuming the targeted address of the pass through
was sane,  does this differ greatly from the non pass through case?
Now, if the DMA was truly going to random places (like the capture
kernel itself)  I'm not sure what we would do.  Suggestions?


> 
> After the last round of this patchset, we discussed a potential
> improvement where you point every virtual bus address at the *same*
> physical scratch page.
> 
> That way, we allow the "rogue" DMA to continue to the same virtual bus
> addresses, but it can only ever affect one piece of physical memory and
> can't have detrimental effects elsewhere.
> 
> Was that option considered and discounted for some reason? It seems like
> it would make sense...

I don't know if this was considered.

I will need time to go through code and the spec to understand
implications better.

Thanks

Jerry


-- 

----------------------------------------------------------------------------
Jerry Hoemann            Software Engineer              Hewlett-Packard

3404 E Harmony Rd. MS 57                        phone:  (970) 898-1022
Ft. Collins, CO 80528                           FAX:    (970) 898-XXXX
                                                email:  jerry.hoemann@hp.com
----------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
  2014-04-30 10:49 ` [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO David Woodhouse
  2014-05-02 20:13   ` Jerry Hoemann
@ 2014-05-07 18:25   ` Jerry Hoemann
  2014-07-02 13:32   ` Joerg Roedel
  2 siblings, 0 replies; 21+ messages in thread
From: Jerry Hoemann @ 2014-05-07 18:25 UTC (permalink / raw)
  To: David Woodhouse
  Cc: indou.takao, bhe, joro, linux-pci, kexec, linux-kernel, iommu,
	ishii.hironobu, bhelgaas, zhenhua


David,

I received the following email from Bill Sumner addressing your
earlier email.

Jerry

On Wed, 2014-04-30, David Woodhouse wrote:
Addressing a portion of the last question first:
>Was that option considered and discounted for some reason? It seems like
>it would make sense.
Considered ?
   Yes. It is an interesting idea. See technical discussion below.

Discounted for some reason ?
   Not really. Only for lack of time.  With only a limited amount of time,
   I focused on providing a clean set of patches on a recent baseline.  

>On Thu, 2014-04-24 at 18:36 -0600, Bill Sumner wrote:
>>
>> This patch set modifies the behavior of the Intel iommu in the crashdump kernel:
>> 1. to accept the iommu hardware in an active state,
>> 2. to leave the current translations in-place so that legacy DMA will continue
>> using its current buffers until the device drivers in the crashdump kernel
>> initialize and initialize their devices,
>> 3. to use different portions of the iova address ranges for the device drivers
>> in the crashdump kernel than the iova ranges that were in-use at the time
>> of the panic.
>
>There could be all kinds of existing mappings in the DMA page tables,
>and I'm not sure it's safe to preserve them.

Actually I think it is safe to preserve them in a great number of cases,
and there is some percentage of cases where something else will work better. 

Fortunately, history shows that the panicked kernel generates bad
translations rarely enough that many crashdumps on systems
without iommu hardware still succeed by simply allowing the DMA/IO
to continue into the existing buffers. The patch set uses the same
technique when the iommu is active. So I think the odds start
out in our favor -- enough in our favor that we should implement
the current patch set into Linux and then begin work to improve it. 

Since the patch set is currently available, and its technique has already been
somewhat tested by three companies, the remaining path for including it
into Linux is short.  This would significantly improve the crashdump success
rate when the Intel iommu is active.  It would also provide a foundation for
investigations into more advanced techniques that would further increase
the success rate.

Bad translations do still happen -- bad programming, tables getting hosed,
etc.  We would like to find a way to get a good dump in this extra percentage
of cases. It sounds like you have some good ideas in this area.  

The iommu hardware guarantees that DMA/IO can only access the memory 
areas described in the DMA page tables.  We can be quite sure that the only
physical memory areas that any DMA/IO can access are ones that are described
by the contents of the translation tables. This comes with a few caveats:
1. hw-passthru and the 'si' domain -- see discussion under a later question. 
   The problem with these is that they allow DMA/IO access to any of 
   physical memory.  
2. The iommu hardware will use valid translation entries to the "wrong" place
   just as readily as it will use ones to the "right" place.  Having the
   kdump kernel check-out the memory areas described by the tables before
   using them seems like a good idea.  For instance: any DMA buffers from the
   panicked kernel that point into the kdump kernel's area would be highly
   suspect.
3. Random writes into the translate tables may well write known-bad values
   into fields or reserved areas -- values which will cause the iommu
   hardware to reject that entry.  Unfortunately we cannot count on this
   happening, but it felt like a bright-spot worth mentioning.  

>What prevents the crashdump
>kernel from trying to use any of the physical pages which are
>accessible, and which could thus be corrupted by stray DMA?

DMA into the kdump area will corrupt the kdump and cause loss of the dump.  
Note that this was the original problem with disabling the iommu at the
beginning of the kdump kernel which forced DMA that was going to 
its original (good) buffers to begin going into essentially random
places -- almost all of them "wrong".

However, I believe that the kdump kernel itself will not be the problem
for the following reasons:

As I understand the kdump architecture, the kdump kernel is restricted
to the physical memory area that was reserved for it by the platform
kernel during its initialization (when the platform kernel presumably was
still healthy.) The kdump kernel is assumed to be clean and healthy,
so it will not be attempting to use any memory outside of what it is
assigned -- except for reading pages of the panicked kernel in order to
write them to the dump file.

Assuming that the DMA page tables were checked to insure that no DMA page
table points into the kdump kernel's reserved area, no stray DMA/IO will
affect the kdump kernel.

>
>In fact, the old kernel could even have set up 1:1 passthrough mappings
>for some devices, which would then be able to DMA *anywhere*. Surely we
>need to prevent that?
Yes, I agree. 
The 1:1 passthrough mappings seem to be problematic -- both the
use of hw-passthrough by the iommu and the 'si' domain set up in the DMA
page tables.  These mappings completely bypass one of the basic reasons
for using the iommu hardware -- to restrict DMA access to known-safe
areas of memory.

I would prefer that Linux not use either of these mechanisms unless
it is absolutely necessary -- in which case it could be explicitly
enabled.  After all, there are probably still some (hopefully few) devices
that absolutely require it. Also, there may be circumstances where a
performance gain outweighs the additional risk to the crashdump.

If the kdump kernel finds a 1:1 passthrough domain among the DMA page tables,
the real issue comes if we also need that device for taking the crashdump.
If we do not need it, then pointing all of that device's IOVAs at a safe
buffer -- as you recommend -- looks like a good solution.

If kdump does need it, I can think of two ways to handle things:
1. Just leave it. This is what happens when there is no hardware iommu
   active, and this has worked OK there for a long time.  This option
   clearly depends upon the 1:1 passthrough device not being the problem.
   This is also what my patches do, since they are modeled on handling
   the DMA buffers in the same manner as when there is no iommu active.

2. As you suggest, create a safe buffer and force all of this device's 
   IOVAs into it.  Then begin mapping real buffers when the kdump kernel
   begins using the device.

>
>After the last round of this patchset, we discussed a potential
>improvement where you point every virtual bus address at the *same*
>physical scratch page.
>
>That way, we allow the "rogue" DMA to continue to the same virtual bus
>addresses, but it can only ever affect one piece of physical memory and
>can't have detrimental effects elsewhere.

Just a few technical observations and questions that hopefully will help
implement this enhancement:

Since each device may eventually be used by the kdump kernel, then each
device will need its own domain-id and its own set of DMA page tables 
so that the IOVAs requested by the kdump kernel can map that device's
IOVAs to that device's buffers.

As IO devices have grown smarter, many of them, particularly NICs and
storage interfaces, use DMA for work queues and status-reporting
vectors in addition to buffers of data to be transferred.  
Some experimenting and testing may be necessary to determine how
these devices behave when the translation for the work queue is
switched to a safe-buffer which does not contain valid entries for
that device.  

Questions that came to mind as I thought about this proposal:
1. Does the iommu need to know when the device driver has reset the device
   and that it is safe to add translations to the DMA page tables?

2. If it needs to know, how does it know, since the device driver asking
   for an IOVA via the DMA subsystem is usually the first indication to
   the iommu driver about the device and this may not guarantee that the
   device driver has already reset the device at that point?

3. For any given device, which IOVAs will be mapped to the safe buffer ?
   a. Only the IOVAs active at the time of the panic, which would require
      scanning the existing DMA page tables to find them.
   b. All possible IOVAs ? This would seem to be a very large number of
      pages for the page tables -- especially since each device may need
      its own set of DMA page tables.  There could still be only one
      "safe data buffer" with a lot of page tables pointing to it.
   c. Determine these "on the fly" by capturing DMAR faults or some similar
      mechanism ?
   d. Other possibilities ?


>
>Was that option considered and discounted for some reason? It seems like
>it would make sense.



--
Bill Sumner


Forwarded by Jerry Hoemann

-- 

----------------------------------------------------------------------------
Jerry Hoemann            Software Engineer              Hewlett-Packard

3404 E Harmony Rd. MS 57                        phone:  (970) 898-1022
Ft. Collins, CO 80528                           FAX:    (970) 898-XXXX
                                                email:  jerry.hoemann@hp.com
----------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
  2014-04-30 10:49 ` [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO David Woodhouse
  2014-05-02 20:13   ` Jerry Hoemann
  2014-05-07 18:25   ` Jerry Hoemann
@ 2014-07-02 13:32   ` Joerg Roedel
  2014-07-11 16:27     ` Jerry Hoemann
  2014-10-22  2:16     ` Bjorn Helgaas
  2 siblings, 2 replies; 21+ messages in thread
From: Joerg Roedel @ 2014-07-02 13:32 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Bill Sumner, Hoemann, Jerry, indou.takao, bhe, linux-pci, kexec,
	linux-kernel, iommu, doug.hatch, ishii.hironobu, bhelgaas,
	zhenhua

Hi David,

On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:
> There could be all kinds of existing mappings in the DMA page tables,
> and I'm not sure it's safe to preserve them. What prevents the crashdump
> kernel from trying to use any of the physical pages which are
> accessible, and which could thus be corrupted by stray DMA?
> 
> In fact, the old kernel could even have set up 1:1 passthrough mappings
> for some devices, which would then be able to DMA *anywhere*. Surely we
> need to prevent that?

Ideally we would prevent that, yes. But the problem is that a failed DMA
transaction might put the device into an unrecoverable state. Usually
any in-flight DMA transactions should only target buffers set up by the
previous kernel and not corrupt any data.

> After the last round of this patchset, we discussed a potential
> improvement where you point every virtual bus address at the *same*
> physical scratch page.

That is a solution to prevent the in-flight DMA failures. But what
happens when there is some in-flight DMA to a disk to write some inodes
or a new superblock. Then this scratch address-space may cause
filesystem corruption at worst.

So with this in mind I would prefer initially taking over the
page-tables from the old kernel before the device drivers re-initialize
the devices.


	Joerg



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
  2014-07-02 13:32   ` Joerg Roedel
@ 2014-07-11 16:27     ` Jerry Hoemann
  2014-10-15  8:10       ` Li, ZhenHua
  2014-10-22  2:16     ` Bjorn Helgaas
  1 sibling, 1 reply; 21+ messages in thread
From: Jerry Hoemann @ 2014-07-11 16:27 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: David Woodhouse, indou.takao, bhe, linux-pci, kexec,
	linux-kernel, iommu, ishii.hironobu, bhelgaas, zhenhua

On Wed, Jul 02, 2014 at 03:32:59PM +0200, Joerg Roedel wrote:
> Hi David,
> 
> On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:
> > There could be all kinds of existing mappings in the DMA page tables,
> > and I'm not sure it's safe to preserve them. What prevents the crashdump
> > kernel from trying to use any of the physical pages which are
> > accessible, and which could thus be corrupted by stray DMA?
> > 
> > In fact, the old kernel could even have set up 1:1 passthrough mappings
> > for some devices, which would then be able to DMA *anywhere*. Surely we
> > need to prevent that?
> 
> Ideally we would prevent that, yes. But the problem is that a failed DMA
> transaction might put the device into an unrecoverable state. Usually
> any in-flight DMA transactions should only target buffers set up by the
> previous kernel and not corrupt any data.
> 
> > After the last round of this patchset, we discussed a potential
> > improvement where you point every virtual bus address at the *same*
> > physical scratch page.
> 
> That is a solution to prevent the in-flight DMA failures. But what
> happens when there is some in-flight DMA to a disk to write some inodes
> or a new superblock. Then this scratch address-space may cause
> filesystem corruption at worst.
> 
> So with this in mind I would prefer initially taking over the
> page-tables from the old kernel before the device drivers re-initialize
> the devices.
> 
> 
> 	Joerg

David, Joerg,

What do you think here?  Do you want me to update the patch set for 3.17?

Jerry

-- 

----------------------------------------------------------------------------
Jerry Hoemann            Software Engineer              Hewlett-Packard

3404 E Harmony Rd. MS 57                        phone:  (970) 898-1022
Ft. Collins, CO 80528                           FAX:    (970) 898-XXXX
                                                email:  jerry.hoemann@hp.com
----------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
  2014-07-11 16:27     ` Jerry Hoemann
@ 2014-10-15  8:10       ` Li, ZhenHua
  2014-10-15  8:45         ` Li, ZhenHua
  0 siblings, 1 reply; 21+ messages in thread
From: Li, ZhenHua @ 2014-10-15  8:10 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse
  Cc: Jerry.Hoemann, indou.takao, bhe, linux-pci, kexec, linux-kernel,
	iommu, ishii.hironobu, bhelgaas, zhenhua

David, Joerg,
I plan to merge this patch set with 3.17 stable kernel, and split this 
patch set into two :
1. The core part, including the changed functions, like [Patch 4/8], 
[Patch 8/8].
2. For the formatting issues, like [Patch 1/8],[Patch 3/8],  including 
the changes for code formations, creation of new files 
intel-iommu-kdump.c, intel-iommu-private.h.

I believe this will make the patch set more clear to read and understand.

What are your suggestions?

Thanks
Zhenhua


On 07/12/2014 12:27 AM, Jerry Hoemann wrote:
> On Wed, Jul 02, 2014 at 03:32:59PM +0200, Joerg Roedel wrote:
>> Hi David,
>>
>> On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:
>>> There could be all kinds of existing mappings in the DMA page tables,
>>> and I'm not sure it's safe to preserve them. What prevents the crashdump
>>> kernel from trying to use any of the physical pages which are
>>> accessible, and which could thus be corrupted by stray DMA?
>>>
>>> In fact, the old kernel could even have set up 1:1 passthrough mappings
>>> for some devices, which would then be able to DMA *anywhere*. Surely we
>>> need to prevent that?
>>
>> Ideally we would prevent that, yes. But the problem is that a failed DMA
>> transaction might put the device into an unrecoverable state. Usually
>> any in-flight DMA transactions should only target buffers set up by the
>> previous kernel and not corrupt any data.
>>
>>> After the last round of this patchset, we discussed a potential
>>> improvement where you point every virtual bus address at the *same*
>>> physical scratch page.
>>
>> That is a solution to prevent the in-flight DMA failures. But what
>> happens when there is some in-flight DMA to a disk to write some inodes
>> or a new superblock. Then this scratch address-space may cause
>> filesystem corruption at worst.
>>
>> So with this in mind I would prefer initially taking over the
>> page-tables from the old kernel before the device drivers re-initialize
>> the devices.
>>
>>
>> 	Joerg
>
> David, Joerg,
>
> What do you think here?  Do you want me to update the patch set for 3.17?
>
> Jerry
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
  2014-10-15  8:10       ` Li, ZhenHua
@ 2014-10-15  8:45         ` Li, ZhenHua
  0 siblings, 0 replies; 21+ messages in thread
From: Li, ZhenHua @ 2014-10-15  8:45 UTC (permalink / raw)
  To: Joerg Roedel, David Woodhouse
  Cc: Jerry.Hoemann, indou.takao, bhe, linux-pci, kexec, linux-kernel,
	iommu, ishii.hironobu, bhelgaas, zhenhua, Vaden,
	Tom L (HP Server OS Architecture)

Add Tom to CC list.
On 10/15/2014 04:10 PM, Li, ZhenHua wrote:
> David, Joerg,
> I plan to merge this patch set with 3.17 stable kernel, and split this
> patch set into two :
> 1. The core part, including the changed functions, like [Patch 4/8],
> [Patch 8/8].
> 2. For the formatting issues, like [Patch 1/8],[Patch 3/8],  including
> the changes for code formations, creation of new files
> intel-iommu-kdump.c, intel-iommu-private.h.
>
> I believe this will make the patch set more clear to read and understand.
>
> What are your suggestions?
>
> Thanks
> Zhenhua
>
>
> On 07/12/2014 12:27 AM, Jerry Hoemann wrote:
>> On Wed, Jul 02, 2014 at 03:32:59PM +0200, Joerg Roedel wrote:
>>> Hi David,
>>>
>>> On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:
>>>> There could be all kinds of existing mappings in the DMA page tables,
>>>> and I'm not sure it's safe to preserve them. What prevents the
>>>> crashdump
>>>> kernel from trying to use any of the physical pages which are
>>>> accessible, and which could thus be corrupted by stray DMA?
>>>>
>>>> In fact, the old kernel could even have set up 1:1 passthrough mappings
>>>> for some devices, which would then be able to DMA *anywhere*. Surely we
>>>> need to prevent that?
>>>
>>> Ideally we would prevent that, yes. But the problem is that a failed DMA
>>> transaction might put the device into an unrecoverable state. Usually
>>> any in-flight DMA transactions should only target buffers set up by the
>>> previous kernel and not corrupt any data.
>>>
>>>> After the last round of this patchset, we discussed a potential
>>>> improvement where you point every virtual bus address at the *same*
>>>> physical scratch page.
>>>
>>> That is a solution to prevent the in-flight DMA failures. But what
>>> happens when there is some in-flight DMA to a disk to write some inodes
>>> or a new superblock. Then this scratch address-space may cause
>>> filesystem corruption at worst.
>>>
>>> So with this in mind I would prefer initially taking over the
>>> page-tables from the old kernel before the device drivers re-initialize
>>> the devices.
>>>
>>>
>>>     Joerg
>>
>> David, Joerg,
>>
>> What do you think here?  Do you want me to update the patch set for 3.17?
>>
>> Jerry
>>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
  2014-07-02 13:32   ` Joerg Roedel
  2014-07-11 16:27     ` Jerry Hoemann
@ 2014-10-22  2:16     ` Bjorn Helgaas
  2014-10-22  2:47       ` Eric W. Biederman
  2014-10-22 13:21       ` Joerg Roedel
  1 sibling, 2 replies; 21+ messages in thread
From: Bjorn Helgaas @ 2014-10-22  2:16 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: David Woodhouse, Hoemann, Jerry, Takao Indoh, Baoquan He,
	linux-pci, kexec, linux-kernel, open list:INTEL IOMMU (VT-d),
	doug.hatch, ishii.hironobu, zhenhua, Zhen-Hua, Eric W. Biederman,
	Vaden, Tom L (HP Server OS Architecture)

[-cc Bill, +cc Zhen-Hua, Eric, Tom, Jerry]

Hi Joerg,

I was looking at Zhen-Hua's recent patches, trying to figure out if I
need to do anything with them.  Resetting devices in the old kernel
seems like a non-starter.  Resetting devices in the new kernel, ...,
well, maybe.  It seems ugly, and it seems like the sort of problem
that IOMMUs are designed to solve.  Anyway, I found this old
discussion that I didn't quite understand:

On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:

>> After the last round of this patchset, we discussed a potential
>> improvement where you point every virtual bus address at the *same*
>> physical scratch page.
>
> That is a solution to prevent the in-flight DMA failures. But what
> happens when there is some in-flight DMA to a disk to write some inodes
> or a new superblock. Then this scratch address-space may cause
> filesystem corruption at worst.

This in-flight DMA is from a device programmed by the old kernel, and
it would be reading data from the old kernel's buffers.  I think
you're suggesting that we might want that DMA read to complete so the
device can update filesystem metadata?

I don't really understand that argument.  Don't we usually want to
stop any data from escaping the machine after a crash, on the theory
that the old kernel is crashing because something is catastrophically
wrong and we may have already corrupted things in memory?  If so,
allowing this old DMA to complete is just as likely to make things
worse as to make them better.

Without kdump, we likely would reboot through the BIOS and the device
would get reset and the DMA would never happen at all.  So if we made
the dump kernel program the IOMMU to prevent the DMA, that seems like
a similar situation.

> So with this in mind I would prefer initially taking over the
> page-tables from the old kernel before the device drivers re-initialize
> the devices.

This makes the dump kernel more dependent on data from the old kernel,
which we obviously want to avoid when possible.

I didn't find the previous discussion where pointing every virtual bus
address at the same physical scratch page was proposed.  Why was that
better than programming the IOMMU to reject every DMA?

Bjorn

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
  2014-10-22  2:16     ` Bjorn Helgaas
@ 2014-10-22  2:47       ` Eric W. Biederman
  2014-10-22  3:08         ` Li, ZhenHua
  2014-10-22 13:21       ` Joerg Roedel
  1 sibling, 1 reply; 21+ messages in thread
From: Eric W. Biederman @ 2014-10-22  2:47 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Joerg Roedel, David Woodhouse, Hoemann, Jerry, Takao Indoh,
	Baoquan He, linux-pci, kexec, linux-kernel,
	open list:INTEL IOMMU (VT-d),
	doug.hatch, ishii.hironobu, zhenhua, Zhen-Hua, Vaden,
	Tom L (HP Server OS Architecture)

Bjorn Helgaas <bhelgaas@google.com> writes:

> [-cc Bill, +cc Zhen-Hua, Eric, Tom, Jerry]
>
> Hi Joerg,
>
> I was looking at Zhen-Hua's recent patches, trying to figure out if I
> need to do anything with them.  Resetting devices in the old kernel
> seems like a non-starter.  Resetting devices in the new kernel, ...,
> well, maybe.  It seems ugly, and it seems like the sort of problem
> that IOMMUs are designed to solve.  Anyway, I found this old
> discussion that I didn't quite understand:

For context here is the kexec on panic design, and what I know from
previous rounds of similar conversations.

The way kexec on panic aka kdump is designed to work is that the
recovery kernel lives in a piece of memory reserved at boot time and
known not to be in use by any driver (because we never ever use it for
DMA).  If DMA's continue from any source the old kernel may be a little
more corrupted but our currently running kernel should not.

Device drivers that we use in the recovery kernel are required to be
able to initialize their devices from an arbitrary state or fail to
initialize their devices.

We have discussed things on various occassions but IOMMUs all have their
own individual idiosynchrousies and came late to the party so that it
is hard to generalize.

The reserved region is generally low enough in memory that simply
not using IOMMUs works.

The major challenge with initializing an IOMMU would be that there are
potentially devices whose driver is not loaded in the recover kernel
with on-going DMA sessions (perhaps a NIC in response to network
packet).

Which essentially means that if you are going to use an IOMMU slot in a
recovery kernel you have to either know that IOMMU slot was reserved for
the recovery kernel (what has always felt like the easiest way to me).
Or you have to know everything that could target that IOMMU slot has
been reset or has it's driver loaded.

I have always thought the simplist and easiest solution would be to
reserve a few IOMMU slots for the kexec on panic kernel.  But if folks
can find other ways to guarantee that an on-going DMA isn't targeting
an IOMMU slot (such as resetting everything downstream from that
IOMMU slot) more power to you.

> On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:
>
>>> After the last round of this patchset, we discussed a potential
>>> improvement where you point every virtual bus address at the *same*
>>> physical scratch page.
>>
>> That is a solution to prevent the in-flight DMA failures. But what
>> happens when there is some in-flight DMA to a disk to write some inodes
>> or a new superblock. Then this scratch address-space may cause
>> filesystem corruption at worst.
>
> This in-flight DMA is from a device programmed by the old kernel, and
> it would be reading data from the old kernel's buffers.  I think
> you're suggesting that we might want that DMA read to complete so the
> device can update filesystem metadata?
>
> I don't really understand that argument.  Don't we usually want to
> stop any data from escaping the machine after a crash, on the theory
> that the old kernel is crashing because something is catastrophically
> wrong and we may have already corrupted things in memory?  If so,
> allowing this old DMA to complete is just as likely to make things
> worse as to make them better.
>
> Without kdump, we likely would reboot through the BIOS and the device
> would get reset and the DMA would never happen at all.  So if we made
> the dump kernel program the IOMMU to prevent the DMA, that seems like
> a similar situation.
>
>> So with this in mind I would prefer initially taking over the
>> page-tables from the old kernel before the device drivers re-initialize
>> the devices.
>
> This makes the dump kernel more dependent on data from the old kernel,
> which we obviously want to avoid when possible.
>
> I didn't find the previous discussion where pointing every virtual bus
> address at the same physical scratch page was proposed.  Why was that
> better than programming the IOMMU to reject every DMA?
>
> Bjorn

Eric


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
  2014-10-22  2:47       ` Eric W. Biederman
@ 2014-10-22  3:08         ` Li, ZhenHua
  0 siblings, 0 replies; 21+ messages in thread
From: Li, ZhenHua @ 2014-10-22  3:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Bjorn Helgaas, Joerg Roedel, David Woodhouse, Hoemann, Jerry,
	Takao Indoh, Baoquan He, linux-pci, kexec, linux-kernel,
	open list:INTEL IOMMU (VT-d),
	doug.hatch, ishii.hironobu, zhenhua, Vaden,
	Tom L (HP Server OS Architecture)

Need more time to read and think about these mails. I just want to 
clarify one thing: Bill has left HP, and now I inherited his works.
That's why I sent an update of his patch
	https://lkml.org/lkml/2014/10/21/134

On 10/22/2014 10:47 AM, Eric W. Biederman wrote:
> Bjorn Helgaas <bhelgaas@google.com> writes:
>
>> [-cc Bill, +cc Zhen-Hua, Eric, Tom, Jerry]
>>
>> Hi Joerg,
>>
>> I was looking at Zhen-Hua's recent patches, trying to figure out if I
>> need to do anything with them.  Resetting devices in the old kernel
>> seems like a non-starter.  Resetting devices in the new kernel, ...,
>> well, maybe.  It seems ugly, and it seems like the sort of problem
>> that IOMMUs are designed to solve.  Anyway, I found this old
>> discussion that I didn't quite understand:
>
> For context here is the kexec on panic design, and what I know from
> previous rounds of similar conversations.
>
> The way kexec on panic aka kdump is designed to work is that the
> recovery kernel lives in a piece of memory reserved at boot time and
> known not to be in use by any driver (because we never ever use it for
> DMA).  If DMA's continue from any source the old kernel may be a little
> more corrupted but our currently running kernel should not.
>
> Device drivers that we use in the recovery kernel are required to be
> able to initialize their devices from an arbitrary state or fail to
> initialize their devices.
>
> We have discussed things on various occassions but IOMMUs all have their
> own individual idiosynchrousies and came late to the party so that it
> is hard to generalize.
>
> The reserved region is generally low enough in memory that simply
> not using IOMMUs works.
>
> The major challenge with initializing an IOMMU would be that there are
> potentially devices whose driver is not loaded in the recover kernel
> with on-going DMA sessions (perhaps a NIC in response to network
> packet).
>
> Which essentially means that if you are going to use an IOMMU slot in a
> recovery kernel you have to either know that IOMMU slot was reserved for
> the recovery kernel (what has always felt like the easiest way to me).
> Or you have to know everything that could target that IOMMU slot has
> been reset or has it's driver loaded.
>
> I have always thought the simplist and easiest solution would be to
> reserve a few IOMMU slots for the kexec on panic kernel.  But if folks
> can find other ways to guarantee that an on-going DMA isn't targeting
> an IOMMU slot (such as resetting everything downstream from that
> IOMMU slot) more power to you.
>
>> On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel <joro@8bytes.org> wrote:
>>> On Wed, Apr 30, 2014 at 11:49:33AM +0100, David Woodhouse wrote:
>>
>>>> After the last round of this patchset, we discussed a potential
>>>> improvement where you point every virtual bus address at the *same*
>>>> physical scratch page.
>>>
>>> That is a solution to prevent the in-flight DMA failures. But what
>>> happens when there is some in-flight DMA to a disk to write some inodes
>>> or a new superblock. Then this scratch address-space may cause
>>> filesystem corruption at worst.
>>
>> This in-flight DMA is from a device programmed by the old kernel, and
>> it would be reading data from the old kernel's buffers.  I think
>> you're suggesting that we might want that DMA read to complete so the
>> device can update filesystem metadata?
>>
>> I don't really understand that argument.  Don't we usually want to
>> stop any data from escaping the machine after a crash, on the theory
>> that the old kernel is crashing because something is catastrophically
>> wrong and we may have already corrupted things in memory?  If so,
>> allowing this old DMA to complete is just as likely to make things
>> worse as to make them better.
>>
>> Without kdump, we likely would reboot through the BIOS and the device
>> would get reset and the DMA would never happen at all.  So if we made
>> the dump kernel program the IOMMU to prevent the DMA, that seems like
>> a similar situation.
>>
>>> So with this in mind I would prefer initially taking over the
>>> page-tables from the old kernel before the device drivers re-initialize
>>> the devices.
>>
>> This makes the dump kernel more dependent on data from the old kernel,
>> which we obviously want to avoid when possible.
>>
>> I didn't find the previous discussion where pointing every virtual bus
>> address at the same physical scratch page was proposed.  Why was that
>> better than programming the IOMMU to reject every DMA?
>>
>> Bjorn
>
> Eric
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
  2014-10-22  2:16     ` Bjorn Helgaas
  2014-10-22  2:47       ` Eric W. Biederman
@ 2014-10-22 13:21       ` Joerg Roedel
  2014-10-22 18:26         ` Bjorn Helgaas
  1 sibling, 1 reply; 21+ messages in thread
From: Joerg Roedel @ 2014-10-22 13:21 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: David Woodhouse, Hoemann, Jerry, Takao Indoh, Baoquan He,
	linux-pci, kexec, linux-kernel, open list:INTEL IOMMU (VT-d),
	doug.hatch, ishii.hironobu, zhenhua, Zhen-Hua, Eric W. Biederman,
	Vaden, Tom L (HP Server OS Architecture)

Hi Bjorn,

On Tue, Oct 21, 2014 at 08:16:46PM -0600, Bjorn Helgaas wrote:
> I was looking at Zhen-Hua's recent patches, trying to figure out if I
> need to do anything with them.  Resetting devices in the old kernel
> seems like a non-starter.  Resetting devices in the new kernel, ...,
> well, maybe.  It seems ugly, and it seems like the sort of problem
> that IOMMUs are designed to solve.

Actually resetting the devices in the kdump kernel would be one of the
better solutions for this problem. When we have a generic way to stop
all in-flight DMA from the PCI endpoints we could safely disable and
then re-enable the IOMMU.

> On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > That is a solution to prevent the in-flight DMA failures. But what
> > happens when there is some in-flight DMA to a disk to write some inodes
> > or a new superblock. Then this scratch address-space may cause
> > filesystem corruption at worst.
> 
> This in-flight DMA is from a device programmed by the old kernel, and
> it would be reading data from the old kernel's buffers.  I think
> you're suggesting that we might want that DMA read to complete so the
> device can update filesystem metadata?

Well, it is not about updating filesystem metadata. In the kdump kernel
we have these options:

	1) Disable the IOMMU. Problem here is, that DMA is now
	   untranslated, so that any in-flight DMA might read or write
	   from a random location in memory, corrupting the kdump or
	   even the new kexec kernel memory. So this is a non-starter.

	2) Re-program the IOMMU to block all DMA. This is safer as it
	   does not corrupt any data in memory. But some devices react
	   very poorly on a master abort from the IOMMU, so bad that the
	   driver in the kdump kernel fails to initialize that device.
	   In this case taking dump itself might fail (and often does,
	   according to reports)

	3) To prevent master aborts like in option (2), David suggested
	   to map the whole DMA address space to a scratch page. This
	   also prevents system memory corruption and the master aborts.
	   But the problem is, that in-flight DMA will now read all
	   zeros. This can corrupt disk and network data, at worst it
	   nulls out the superblocks of your filesystem and you lose all
	   data. So this is not an option too.

What we currently do in the VT-d driver is a mixture of (1) and (2). The
VT-d driver disables the IOMMU hardware (opening a race window for
memory data corruption), re-initializes it to reject any ongoing DMA
(which causes master aborts for in-flight DMA) and re-enables the IOMMU
hardware.

But this strategy fails in heavy IO environments quite often and we look
into ways to solve the problem, or at least improve the current
situation as good as we can.

I talked to David about this at LPC and we came up with basically this
procedure:

	1. If the VT-d driver finds the IOMMU enabled, it reuses its
	   root-context table. This way the IOMMU must not be disabled
	   and re-enabled, eliminating the race we have now.
	   Other data structures like the context-entries are copied
	   over from the old kernel.  We basically keep all mappings
	   from the old kernel, allowing any in-flight DMA to succeed.
	   No memory or disk data corruption.
	   The issue here is, that we modify data from the old kernel
	   which is about to be dumped. But there are ways to work
	   around that.

	2. When a device driver issues the first dma_map command for a
	   device, we assign a new and empty page-table, thus removing
	   all mappings from the old kernel for the device.
	   Rationale is, that at this point the device driver should
	   have reset the device to a point where all in-flight DMA is
	   canceled.

This approach goes into the same direction as Bill Sumners patch-set,
which Li took over. But it goes not as far as keeping the old mappings
while the kdump kernel is still working with the devices (which might
introduce new issues and corner cases).

> > So with this in mind I would prefer initially taking over the
> > page-tables from the old kernel before the device drivers re-initialize
> > the devices.
> 
> This makes the dump kernel more dependent on data from the old kernel,
> which we obviously want to avoid when possible.

Sure, but this is not really possible here (unless we have a generic and
reliable way to reset all PCI endpoint devices and cancel all in-flight
DMA before we disable the IOMMU in the kdump kernel).
Otherwise we always risk data corruption somewhere, in system memory or
on disk.

> I didn't find the previous discussion where pointing every virtual bus
> address at the same physical scratch page was proposed.  Why was that
> better than programming the IOMMU to reject every DMA?

As I said, the problem is that this causes master aborts which some
devices can't recover from, so that the device driver in the kdump
kernel fails to initialize the device.


	Joerg


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO
  2014-10-22 13:21       ` Joerg Roedel
@ 2014-10-22 18:26         ` Bjorn Helgaas
  0 siblings, 0 replies; 21+ messages in thread
From: Bjorn Helgaas @ 2014-10-22 18:26 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: David Woodhouse, Hoemann, Jerry, Takao Indoh, Baoquan He,
	linux-pci, kexec, linux-kernel, open list:INTEL IOMMU (VT-d),
	doug.hatch, ishii.hironobu, zhenhua, Zhen-Hua, Eric W. Biederman,
	Vaden, Tom L (HP Server OS Architecture)

On Wed, Oct 22, 2014 at 7:21 AM, Joerg Roedel <joro@8bytes.org> wrote:
> Hi Bjorn,
>
> On Tue, Oct 21, 2014 at 08:16:46PM -0600, Bjorn Helgaas wrote:
>> I was looking at Zhen-Hua's recent patches, trying to figure out if I
>> need to do anything with them.  Resetting devices in the old kernel
>> seems like a non-starter.  Resetting devices in the new kernel, ...,
>> well, maybe.  It seems ugly, and it seems like the sort of problem
>> that IOMMUs are designed to solve.
>
> Actually resetting the devices in the kdump kernel would be one of the
> better solutions for this problem. When we have a generic way to stop
> all in-flight DMA from the PCI endpoints we could safely disable and
> then re-enable the IOMMU.
>
>> On Wed, Jul 2, 2014 at 7:32 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> > That is a solution to prevent the in-flight DMA failures. But what
>> > happens when there is some in-flight DMA to a disk to write some inodes
>> > or a new superblock. Then this scratch address-space may cause
>> > filesystem corruption at worst.
>>
>> This in-flight DMA is from a device programmed by the old kernel, and
>> it would be reading data from the old kernel's buffers.  I think
>> you're suggesting that we might want that DMA read to complete so the
>> device can update filesystem metadata?
>
> Well, it is not about updating filesystem metadata. In the kdump kernel
> we have these options:
>
>         1) Disable the IOMMU. Problem here is, that DMA is now
>            untranslated, so that any in-flight DMA might read or write
>            from a random location in memory, corrupting the kdump or
>            even the new kexec kernel memory. So this is a non-starter.

Agreed (at least if the IOMMU was enabled in the crashed kernel).

>         2) Re-program the IOMMU to block all DMA. This is safer as it
>            does not corrupt any data in memory. But some devices react
>            very poorly on a master abort from the IOMMU, so bad that the
>            driver in the kdump kernel fails to initialize that device.
>            In this case taking dump itself might fail (and often does,
>            according to reports)

Sounds like an option, even though broken devices work poorly.

>         3) To prevent master aborts like in option (2), David suggested
>            to map the whole DMA address space to a scratch page. This
>            also prevents system memory corruption and the master aborts.
>            But the problem is, that in-flight DMA will now read all
>            zeros. This can corrupt disk and network data, at worst it
>            nulls out the superblocks of your filesystem and you lose all
>            data. So this is not an option too.

Ah, yes, I see your point now.  This allows corrupted data, e.g., all
zeroes, to be written to disk or network after the kernel crash.  I
agree; this doesn't sound like a good option.

And the proposal below is a 4th option (leave IOMMU enabled, reusing
crashed kernel's mappings until drivers make new mappings).

> What we currently do in the VT-d driver is a mixture of (1) and (2). The
> VT-d driver disables the IOMMU hardware (opening a race window for
> memory data corruption), re-initializes it to reject any ongoing DMA
> (which causes master aborts for in-flight DMA) and re-enables the IOMMU
> hardware.
>
> But this strategy fails in heavy IO environments quite often and we look
> into ways to solve the problem, or at least improve the current
> situation as good as we can.
>
> I talked to David about this at LPC and we came up with basically this
> procedure:
>
>         1. If the VT-d driver finds the IOMMU enabled, it reuses its
>            root-context table. This way the IOMMU must not be disabled
>            and re-enabled, eliminating the race we have now.
>            Other data structures like the context-entries are copied
>            over from the old kernel.  We basically keep all mappings
>            from the old kernel, allowing any in-flight DMA to succeed.
>            No memory or disk data corruption.

If the crashed kernel had corrupted memory, couldn't an in-flight DMA
read that corrupted data from memory and write it to disk?

I guess you could argue that this is merely a race, and the in-flight
DMA could just as easily have happened before the kernel crash, so
there's always a window and the only question is whether it closes
when the IOMMU driver starts up or when the device driver starts up.

>            The issue here is, that we modify data from the old kernel
>            which is about to be dumped. But there are ways to work
>            around that.
>
>         2. When a device driver issues the first dma_map command for a
>            device, we assign a new and empty page-table, thus removing
>            all mappings from the old kernel for the device.
>            Rationale is, that at this point the device driver should
>            have reset the device to a point where all in-flight DMA is
>            canceled.
>
> This approach goes into the same direction as Bill Sumners patch-set,
> which Li took over. But it goes not as far as keeping the old mappings
> while the kdump kernel is still working with the devices (which might
> introduce new issues and corner cases).
>
>> > So with this in mind I would prefer initially taking over the
>> > page-tables from the old kernel before the device drivers re-initialize
>> > the devices.
>>
>> This makes the dump kernel more dependent on data from the old kernel,
>> which we obviously want to avoid when possible.
>
> Sure, but this is not really possible here (unless we have a generic and
> reliable way to reset all PCI endpoint devices and cancel all in-flight
> DMA before we disable the IOMMU in the kdump kernel).
> Otherwise we always risk data corruption somewhere, in system memory or
> on disk.
>
>> I didn't find the previous discussion where pointing every virtual bus
>> address at the same physical scratch page was proposed.  Why was that
>> better than programming the IOMMU to reject every DMA?
>
> As I said, the problem is that this causes master aborts which some
> devices can't recover from, so that the device driver in the kdump
> kernel fails to initialize the device.

Yes, thanks for making that explicit again.

Bjorn

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-10-22 18:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-25  0:36 [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO Bill Sumner
2014-04-25  0:36 ` [PATCH 1/8] iommu/vt-d: Fix a few existing lines for checkpatch.pl Bill Sumner
2014-04-25  0:36 ` [PATCH 2/8] iommu/vt-d: Consolidate lines for a new private header Bill Sumner
2014-04-25  0:36 ` [PATCH 3/8] iommu/vt-d: Create intel-iommu-private.h Bill Sumner
2014-04-25  0:36 ` [PATCH 4/8] iommu/vt-d: Update iommu_attach_domain() and its callers Bill Sumner
2014-04-25  0:36 ` [PATCH 5/8] iommu/vt-d: Items required for kdump Bill Sumner
2014-04-25  0:36 ` [PATCH 6/8] iommu/vt-d: Create intel-iommu-kdump.c Bill Sumner
2014-04-25  0:36 ` [PATCH 7/8] iommu/vt-d: Add domain-id functions to intel-iommu-kdump.c Bill Sumner
2014-04-25  0:36 ` [PATCH 8/8] iommu/vt-d: Changes to support kdump Bill Sumner
2014-04-30 10:49 ` [PATCH 0/8] iommu/vt-d: Fix crash dump failure caused by legacy DMA/IO David Woodhouse
2014-05-02 20:13   ` Jerry Hoemann
2014-05-07 18:25   ` Jerry Hoemann
2014-07-02 13:32   ` Joerg Roedel
2014-07-11 16:27     ` Jerry Hoemann
2014-10-15  8:10       ` Li, ZhenHua
2014-10-15  8:45         ` Li, ZhenHua
2014-10-22  2:16     ` Bjorn Helgaas
2014-10-22  2:47       ` Eric W. Biederman
2014-10-22  3:08         ` Li, ZhenHua
2014-10-22 13:21       ` Joerg Roedel
2014-10-22 18:26         ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).