All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/7] x86 PAT: track pfnmap mappings with remap_pfn_range vm_insert_pfn - v3
@ 2008-12-18 19:41 venkatesh.pallipadi
  2008-12-18 19:41 ` [patch 1/7] x86 PAT: store vm_pgoff for all linear_over_vma_region mappings " venkatesh.pallipadi
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: venkatesh.pallipadi @ 2008-12-18 19:41 UTC (permalink / raw)
  To: mingo, tglx, hpa, akpm, npiggin, hugh
  Cc: arjan, jbarnes, rdreier, jeremy, linux-kernel,
	Venkatesh Pallipadi, Suresh Siddha

v3: Patches updated based on Andrew's comments on the earlier version.

Drivers use mmap followed by pgprot_* and remap_pfn_range or vm_insert_pfn,
in order to export reserved memory to userspace. Currently, such mappings are
not tracked and hence not kept consistent with other mappings (/dev/mem,
pci resource, ioremap) for the sme memory, that may exist in the system.

The following patchset adds x86 PAT attribute tracking and untracking for
pfnmap related APIs.

First three patches in the patchset are changing the generic mm code to fit
in this tracking. Last four patches are x86 specific to make things work
with x86 PAT code. The patchset aso introduces pgprot_writecombine interface,
which gives writecombine mapping when enabled, falling back to
pgprot_noncached otherwise.

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch 1/7] x86 PAT: store vm_pgoff for all linear_over_vma_region mappings - v3
  2008-12-18 19:41 [patch 0/7] x86 PAT: track pfnmap mappings with remap_pfn_range vm_insert_pfn - v3 venkatesh.pallipadi
@ 2008-12-18 19:41 ` venkatesh.pallipadi
  2008-12-18 21:27   ` Nick Piggin
  2008-12-18 19:41 ` [patch 2/7] x86 PAT: Add follow_pfnmp_pte routine to help tracking pfnmap pages " venkatesh.pallipadi
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: venkatesh.pallipadi @ 2008-12-18 19:41 UTC (permalink / raw)
  To: mingo, tglx, hpa, akpm, npiggin, hugh
  Cc: arjan, jbarnes, rdreier, jeremy, linux-kernel,
	Venkatesh Pallipadi, Suresh Siddha

[-- Attachment #1: mm_rfc.patch --]
[-- Type: text/plain, Size: 3203 bytes --]

Drivers use mmap followed by pgprot_* and remap_pfn_range or vm_insert_pfn,
in order to export reserved memory to userspace. Currently, such mappings are
not tracked and hence not kept consistent with other mappings (/dev/mem,
pci resource, ioremap) for the sme memory, that may exist in the system.

The following patchset adds x86 PAT attribute tracking and untracking for
pfnmap related APIs.

First three patches in the patchset are changing the generic mm code to fit
in this tracking. Last four patches are x86 specific to make things work
with x86 PAT code. The patchset aso introduces pgprot_writecombine interface,
which gives writecombine mapping when enabled, falling back to
pgprot_noncached otherwise.

This patch:

While working on x86 PAT, we faced some hurdles with trackking
remap_pfn_range() regions, as we do not have any information to say
whether that PFNMAP mapping is linear for the entire vma range or
it is smaller granularity regions within the vma.

A simple solution to this is to use vm_pgoff as an indicator for
linear mapping over the vma region. Currently, remap_pfn_range
only sets vm_pgoff for COW mappings. Below patch changes the
logic and sets the vm_pgoff irrespective of COW. This will still not
be enough for the case where pfn is zero (vma region mapped to
physical address zero). But, for all the other cases, we can look at
pfnmap VMAs and say whether the mappng is for the entire vma region
or not.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

---
 include/linux/mm.h |    9 +++++++++
 mm/memory.c        |    7 +++----
 2 files changed, 12 insertions(+), 4 deletions(-)

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-12-17 17:24:31.000000000 -0800
+++ linux-2.6/mm/memory.c	2008-12-18 10:10:46.000000000 -0800
@@ -1575,11 +1575,10 @@ int remap_pfn_range(struct vm_area_struc
 	 * behaviour that some programs depend on. We mark the "original"
 	 * un-COW'ed pages by matching them up with "vma->vm_pgoff".
 	 */
-	if (is_cow_mapping(vma->vm_flags)) {
-		if (addr != vma->vm_start || end != vma->vm_end)
-			return -EINVAL;
+	if (addr == vma->vm_start && end == vma->vm_end)
 		vma->vm_pgoff = pfn;
-	}
+	else if (is_cow_mapping(vma->vm_flags))
+		return -EINVAL;
 
 	vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP;
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-12-17 17:24:31.000000000 -0800
+++ linux-2.6/include/linux/mm.h	2008-12-18 10:10:46.000000000 -0800
@@ -145,6 +145,15 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 
+static inline int is_linear_pfn_mapping(struct vm_area_struct *vma)
+{
+	return ((vma->vm_flags & VM_PFNMAP) && vma->vm_pgoff);
+}
+
+static inline int is_pfn_mapping(struct vm_area_struct *vma)
+{
+	return (vma->vm_flags & VM_PFNMAP);
+}
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch 2/7] x86 PAT: Add follow_pfnmp_pte routine to help tracking pfnmap pages - v3
  2008-12-18 19:41 [patch 0/7] x86 PAT: track pfnmap mappings with remap_pfn_range vm_insert_pfn - v3 venkatesh.pallipadi
  2008-12-18 19:41 ` [patch 1/7] x86 PAT: store vm_pgoff for all linear_over_vma_region mappings " venkatesh.pallipadi
@ 2008-12-18 19:41 ` venkatesh.pallipadi
  2008-12-18 21:31   ` Nick Piggin
  2008-12-18 19:41 ` [patch 3/7] x86 PAT: hooks in generic vm code to help archs to track pfnmap regions " venkatesh.pallipadi
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: venkatesh.pallipadi @ 2008-12-18 19:41 UTC (permalink / raw)
  To: mingo, tglx, hpa, akpm, npiggin, hugh
  Cc: arjan, jbarnes, rdreier, jeremy, linux-kernel,
	Venkatesh Pallipadi, Suresh Siddha

[-- Attachment #1: follow_pfn.patch --]
[-- Type: text/plain, Size: 2392 bytes --]

Add a generic interface to follow pfn in a pfnmap vma range. This is used by
one of the subsequent x86 PAT related patch to keep track of memory types
for vma regions across vma copy and free.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

---
 include/linux/mm.h |    3 +++
 mm/memory.c        |   43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-11-25 13:56:46.000000000 -0800
+++ linux-2.6/include/linux/mm.h	2008-11-25 14:20:23.000000000 -0800
@@ -1223,6 +1223,9 @@ struct page *follow_page(struct vm_area_
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_ANON	0x08	/* give ZERO_PAGE if no pgtable */
 
+int follow_pfnmap_pte(struct vm_area_struct *vma,
+				unsigned long address, pte_t *ret_ptep);
+
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
 extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-11-25 14:07:42.000000000 -0800
+++ linux-2.6/mm/memory.c	2008-11-25 14:18:35.000000000 -0800
@@ -1111,6 +1111,49 @@ no_page_table:
 	return page;
 }
 
+int follow_pfnmap_pte(struct vm_area_struct *vma, unsigned long address,
+			pte_t *ret_ptep)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *ptep, pte;
+	spinlock_t *ptl;
+	struct page *page;
+	struct mm_struct *mm = vma->vm_mm;
+
+	if (!is_pfn_mapping(vma))
+		goto err;
+
+	page = NULL;
+	pgd = pgd_offset(mm, address);
+	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		goto err;
+
+	pud = pud_offset(pgd, address);
+	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+		goto err;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		goto err;
+
+	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+
+	pte = *ptep;
+	if (!pte_present(pte))
+		goto err_unlock;
+
+	*ret_ptep = pte;
+	pte_unmap_unlock(ptep, ptl);
+	return 0;
+
+err_unlock:
+	pte_unmap_unlock(ptep, ptl);
+err:
+	return -EINVAL;
+}
+
 /* Can we do the FOLL_ANON optimization? */
 static inline int use_zero_page(struct vm_area_struct *vma)
 {

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch 3/7] x86 PAT: hooks in generic vm code to help archs to track pfnmap regions - v3
  2008-12-18 19:41 [patch 0/7] x86 PAT: track pfnmap mappings with remap_pfn_range vm_insert_pfn - v3 venkatesh.pallipadi
  2008-12-18 19:41 ` [patch 1/7] x86 PAT: store vm_pgoff for all linear_over_vma_region mappings " venkatesh.pallipadi
  2008-12-18 19:41 ` [patch 2/7] x86 PAT: Add follow_pfnmp_pte routine to help tracking pfnmap pages " venkatesh.pallipadi
@ 2008-12-18 19:41 ` venkatesh.pallipadi
  2008-12-18 21:35   ` Nick Piggin
  2008-12-18 19:41 ` [patch 4/7] x86 PAT: Implement track/untrack of pfnmap regions for x86 " venkatesh.pallipadi
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: venkatesh.pallipadi @ 2008-12-18 19:41 UTC (permalink / raw)
  To: mingo, tglx, hpa, akpm, npiggin, hugh
  Cc: arjan, jbarnes, rdreier, jeremy, linux-kernel,
	Venkatesh Pallipadi, Suresh Siddha

[-- Attachment #1: generic_pfn_range_tracking.patch --]
[-- Type: text/plain, Size: 4946 bytes --]

Introduce generic hooks in remap_pfn_range and vm_insert_pfn and
corresponding copy and free routines with reserve and free tracking.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

---
 include/linux/mm.h |    6 ++++
 mm/memory.c        |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 81 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2008-12-18 10:10:50.000000000 -0800
+++ linux-2.6/mm/memory.c	2008-12-18 10:11:23.000000000 -0800
@@ -99,6 +99,50 @@ int randomize_va_space __read_mostly =
 					2;
 #endif
 
+#ifndef track_pfn_vma_new
+/*
+ * Interface that can be used by architecture code to keep track of
+ * memory type of pfn mappings (remap_pfn_range, vm_insert_pfn)
+ *
+ * track_pfn_vma_new is called when a _new_ pfn mapping is being established
+ * for physical range indicated by pfn and size.
+ */
+int track_pfn_vma_new(struct vm_area_struct *vma, pgprot_t prot,
+			unsigned long pfn, unsigned long size)
+{
+	return 0;
+}
+#endif
+
+#ifndef track_pfn_vma_copy
+/*
+ * Interface that can be used by architecture code to keep track of
+ * memory type of pfn mappings (remap_pfn_range, vm_insert_pfn)
+ *
+ * track_pfn_vma_copy is called when vma that is covering the pfnmap gets
+ * copied through copy_page_range().
+ */
+int track_pfn_vma_copy(struct vm_area_struct *vma)
+{
+	return 0;
+}
+#endif
+
+#ifndef untrack_pfn_vma
+/*
+ * Interface that can be used by architecture code to keep track of
+ * memory type of pfn mappings (remap_pfn_range, vm_insert_pfn)
+ *
+ * untrack_pfn_vma is called while unmapping a pfnmap for a region.
+ * untrack can be called for a specific region indicated by pfn and size or
+ * can be for the entire vma (in which case size can be zero).
+ */
+void untrack_pfn_vma(struct vm_area_struct *vma, unsigned long pfn,
+			unsigned long size)
+{
+}
+#endif
+
 static int __init disable_randmaps(char *s)
 {
 	randomize_va_space = 0;
@@ -669,6 +713,16 @@ int copy_page_range(struct mm_struct *ds
 	if (is_vm_hugetlb_page(vma))
 		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
 
+	if (is_pfn_mapping(vma)) {
+		/*
+		 * We do not free on error cases below as remove_vma
+		 * gets called on error from higher level routine
+		 */
+		ret = track_pfn_vma_copy(vma);
+		if (ret)
+			return ret;
+	}
+
 	/*
 	 * We need to invalidate the secondary MMU mappings only when
 	 * there could be a permission downgrade on the ptes of the
@@ -915,6 +969,9 @@ unsigned long unmap_vmas(struct mmu_gath
 		if (vma->vm_flags & VM_ACCOUNT)
 			*nr_accounted += (end - start) >> PAGE_SHIFT;
 
+		if (is_pfn_mapping(vma))
+			untrack_pfn_vma(vma, 0, 0);
+
 		while (start != end) {
 			if (!tlb_start_valid) {
 				tlb_start = start;
@@ -1473,6 +1530,7 @@ out:
 int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn)
 {
+	int ret;
 	/*
 	 * Technically, architectures with pte_special can avoid all these
 	 * restrictions (same for remap_pfn_range).  However we would like
@@ -1487,7 +1545,15 @@ int vm_insert_pfn(struct vm_area_struct 
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;
-	return insert_pfn(vma, addr, pfn, vma->vm_page_prot);
+	if (track_pfn_vma_new(vma, vma->vm_page_prot, pfn, PAGE_SIZE))
+		return -EINVAL;
+
+  	ret = insert_pfn(vma, addr, pfn, vma->vm_page_prot);
+
+	if (ret)
+		untrack_pfn_vma(vma, pfn, PAGE_SIZE);
+
+	return ret;
 }
 EXPORT_SYMBOL(vm_insert_pfn);
 
@@ -1625,6 +1691,10 @@ int remap_pfn_range(struct vm_area_struc
 
 	vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP;
 
+	err = track_pfn_vma_new(vma, prot, pfn, PAGE_ALIGN(size));
+	if (err)
+		return -EINVAL;
+
 	BUG_ON(addr >= end);
 	pfn -= addr >> PAGE_SHIFT;
 	pgd = pgd_offset(mm, addr);
@@ -1636,6 +1706,10 @@ int remap_pfn_range(struct vm_area_struc
 		if (err)
 			break;
 	} while (pgd++, addr = next, addr != end);
+
+	if (err)
+		untrack_pfn_vma(vma, pfn, PAGE_ALIGN(size));
+
 	return err;
 }
 EXPORT_SYMBOL(remap_pfn_range);
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2008-12-18 10:10:50.000000000 -0800
+++ linux-2.6/include/linux/mm.h	2008-12-18 10:11:23.000000000 -0800
@@ -155,6 +155,12 @@ static inline int is_pfn_mapping(struct 
 	return (vma->vm_flags & VM_PFNMAP);
 }
 
+extern int track_pfn_vma_new(struct vm_area_struct *vma, pgprot_t prot,
+				unsigned long pfn, unsigned long size);
+extern int track_pfn_vma_copy(struct vm_area_struct *vma);
+extern void untrack_pfn_vma(struct vm_area_struct *vma, unsigned long pfn,
+				unsigned long size);
+
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
  * ->fault function. The vma's ->fault is responsible for returning a bitmask

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch 4/7] x86 PAT: Implement track/untrack of pfnmap regions for x86 - v3
  2008-12-18 19:41 [patch 0/7] x86 PAT: track pfnmap mappings with remap_pfn_range vm_insert_pfn - v3 venkatesh.pallipadi
                   ` (2 preceding siblings ...)
  2008-12-18 19:41 ` [patch 3/7] x86 PAT: hooks in generic vm code to help archs to track pfnmap regions " venkatesh.pallipadi
@ 2008-12-18 19:41 ` venkatesh.pallipadi
  2008-12-18 21:38   ` Nick Piggin
  2008-12-18 19:41 ` [patch 5/7] x86 PAT: change pgprot_noncached to uc_minus instead of strong uc " venkatesh.pallipadi
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: venkatesh.pallipadi @ 2008-12-18 19:41 UTC (permalink / raw)
  To: mingo, tglx, hpa, akpm, npiggin, hugh
  Cc: arjan, jbarnes, rdreier, jeremy, linux-kernel,
	Venkatesh Pallipadi, Suresh Siddha

[-- Attachment #1: x86_pfn_reservefree.patch --]
[-- Type: text/plain, Size: 8585 bytes --]

Hookup remap_pfn_range and vm_insert_pfn and corresponding copy and free
routines with reserve and free tracking.

reserve and free here only takes care of non RAM region mapping. For RAM
region, driver should use set_memory_[uc|wc|wb] to set the cache type and
then setup the mapping for user pte. We can bypass below
reserve/free in that case.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

---
 arch/x86/include/asm/pgtable.h |   10 +
 arch/x86/mm/pat.c              |  236 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 246 insertions(+)

Index: linux-2.6/arch/x86/mm/pat.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/pat.c	2008-12-17 17:03:25.000000000 -0800
+++ linux-2.6/arch/x86/mm/pat.c	2008-12-17 17:22:59.000000000 -0800
@@ -596,6 +596,242 @@ void unmap_devmem(unsigned long pfn, uns
 	free_memtype(addr, addr + size);
 }
 
+/*
+ * Internal interface to reserve a range of physical memory with prot.
+ * Reserved non RAM regions only and after successful reserve_memtype,
+ * this func also keeps identity mapping (if any) in sync with this new prot.
+ */
+static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t vma_prot)
+{
+	int is_ram = 0;
+	int id_sz, ret;
+	unsigned long flags;
+	unsigned long want_flags = (pgprot_val(vma_prot) & _PAGE_CACHE_MASK);
+
+	is_ram = pagerange_is_ram(paddr, paddr + size);
+
+	if (is_ram != 0) {
+		/*
+		 * For mapping RAM pages, drivers need to call
+		 * set_memory_[uc|wc|wb] directly, for reserve and free, before
+		 * setting up the PTE.
+		 */
+		WARN_ON_ONCE(1);
+		return 0;
+	}
+
+	ret = reserve_memtype(paddr, paddr + size, want_flags, &flags);
+	if (ret)
+		return ret;
+
+	if (flags != want_flags) {
+		free_memtype(paddr, paddr + size);
+		printk(KERN_ERR
+		"%s:%d map pfn expected mapping type %s for %Lx-%Lx, got %s\n",
+			current->comm, current->pid,
+			cattr_name(want_flags),
+			(unsigned long long)paddr,
+			(unsigned long long)(paddr + size),
+			cattr_name(flags));
+		return -EINVAL;
+	}
+
+	/* Need to keep identity mapping in sync */
+	if (paddr >= __pa(high_memory))
+		return 0;
+
+	id_sz = (__pa(high_memory) < paddr + size) ?
+				__pa(high_memory) - paddr :
+				size;
+
+	if (ioremap_change_attr((unsigned long)__va(paddr), id_sz, flags) < 0) {
+		free_memtype(paddr, paddr + size);
+		printk(KERN_ERR
+			"%s:%d reserve_pfn_range ioremap_change_attr failed %s "
+			"for %Lx-%Lx\n",
+			current->comm, current->pid,
+			cattr_name(flags),
+			(unsigned long long)paddr,
+			(unsigned long long)(paddr + size));
+		return -EINVAL;
+	}
+	return 0;
+}
+
+/*
+ * Internal interface to free a range of physical memory.
+ * Frees non RAM regions only.
+ */
+static void free_pfn_range(u64 paddr, unsigned long size)
+{
+	int is_ram;
+
+	is_ram = pagerange_is_ram(paddr, paddr + size);
+	if (is_ram == 0)
+		free_memtype(paddr, paddr + size);
+}
+
+/*
+ * track_pfn_vma_copy is called when vma that is covering the pfnmap gets
+ * copied through copy_page_range().
+ *
+ * If the vma has a linear pfn mapping for the entire range, we get the prot
+ * from pte and reserve the entire vma range with single reserve_pfn_range call.
+ * Otherwise, we reserve the entire vma range, my ging through the PTEs page
+ * by page to get physical address and protection.
+ */
+int track_pfn_vma_copy(struct vm_area_struct *vma)
+{
+	int retval = 0;
+	unsigned long i, j;
+	u64 paddr;
+	pgprot_t prot;
+	pte_t pte;
+	unsigned long vma_start = vma->vm_start;
+	unsigned long vma_end = vma->vm_end;
+	unsigned long vma_size = vma_end - vma_start;
+
+	if (!pat_enabled)
+		return 0;
+
+	if (is_linear_pfn_mapping(vma)) {
+		/*
+		 * reserve the whole chunk starting from vm_pgoff,
+		 * But, we have to get the protection from pte.
+		 */
+		if (follow_pfnmap_pte(vma, vma_start, &pte)) {
+			WARN_ON_ONCE(1);
+			return -1;
+		}
+		prot = pte_pgprot(pte);
+		paddr = (u64)vma->vm_pgoff << PAGE_SHIFT;
+		return reserve_pfn_range(paddr, vma_size, prot);
+	}
+
+	/* reserve entire vma page by page, using pfn and prot from pte */
+	for (i = 0; i < vma_size; i += PAGE_SIZE) {
+		if (follow_pfnmap_pte(vma, vma_start + i, &pte))
+			continue;
+
+		paddr = pte_pa(pte);
+		prot = pte_pgprot(pte);
+		retval = reserve_pfn_range(paddr, PAGE_SIZE, prot);
+		if (retval)
+			goto cleanup_ret;
+	}
+	return 0;
+
+cleanup_ret:
+	/* Reserve error: Cleanup partial reservation and return error */
+	for (j = 0; j < i; j += PAGE_SIZE) {
+		if (follow_pfnmap_pte(vma, vma_start + j, &pte))
+			continue;
+
+		paddr = pte_pa(pte);
+		free_pfn_range(paddr, PAGE_SIZE);
+	}
+
+	return retval;
+}
+
+/*
+ * track_pfn_vma_new is called when a _new_ pfn mapping is being established
+ * for physical range indicated by pfn and size.
+ *
+ * prot is passed in as a parameter for the new mapping. If the vma has a
+ * linear pfn mapping for the entire range reserve the entire vma range with
+ * single reserve_pfn_range call.
+ * Otherwise, we look t the pfn and size and reserve only the specified range
+ * page by page.
+ *
+ * Note that this function can be called with caller trying to map only a
+ * subrange/page inside the vma.
+ */
+int track_pfn_vma_new(struct vm_area_struct *vma, pgprot_t prot,
+			unsigned long pfn, unsigned long size)
+{
+	int retval = 0;
+	unsigned long i, j;
+	u64 base_paddr;
+	u64 paddr;
+	unsigned long vma_start = vma->vm_start;
+	unsigned long vma_end = vma->vm_end;
+	unsigned long vma_size = vma_end - vma_start;
+
+	if (!pat_enabled)
+		return 0;
+
+	if (is_linear_pfn_mapping(vma)) {
+		/* reserve the whole chunk starting from vm_pgoff */
+		paddr = (u64)vma->vm_pgoff << PAGE_SHIFT;
+		return reserve_pfn_range(paddr, vma_size, prot);
+	}
+
+	/* reserve page by page using pfn and size */
+	base_paddr = (u64)pfn << PAGE_SHIFT;
+	for (i = 0; i < size; i += PAGE_SIZE) {
+		paddr = base_paddr + i;
+		retval = reserve_pfn_range(paddr, PAGE_SIZE, prot);
+		if (retval)
+			goto cleanup_ret;
+	}
+	return 0;
+
+cleanup_ret:
+	/* Reserve error: Cleanup partial reservation and return error */
+	for (j = 0; j < i; j += PAGE_SIZE) {
+		paddr = base_paddr + j;
+		free_pfn_range(paddr, PAGE_SIZE);
+	}
+
+	return retval;
+}
+
+/*
+ * untrack_pfn_vma is called while unmapping a pfnmap for a region.
+ * untrack can be called for a specific region indicated by pfn and size or
+ * can be for the entire vma (in which case size can be zero).
+ */
+void untrack_pfn_vma(struct vm_area_struct *vma, unsigned long pfn,
+			unsigned long size)
+{
+	unsigned long i;
+	u64 paddr;
+	unsigned long vma_start = vma->vm_start;
+	unsigned long vma_end = vma->vm_end;
+	unsigned long vma_size = vma_end - vma_start;
+
+	if (!pat_enabled)
+		return;
+
+	if (is_linear_pfn_mapping(vma)) {
+		/* free the whole chunk starting from vm_pgoff */
+		paddr = (u64)vma->vm_pgoff << PAGE_SHIFT;
+		free_pfn_range(paddr, vma_size);
+		return;
+	}
+
+	if (size != 0 && size != vma_size) {
+		/* free page by page, using pfn and size */
+		paddr = (u64)pfn << PAGE_SHIFT;
+		for (i = 0; i < size; i += PAGE_SIZE) {
+			paddr = paddr + i;
+			free_pfn_range(paddr, PAGE_SIZE);
+		}
+	} else {
+		/* free entire vma, page by page, using the pfn from pte */
+		for (i = 0; i < vma_size; i += PAGE_SIZE) {
+			pte_t pte;
+
+			if (follow_pfnmap_pte(vma, vma_start + i, &pte))
+				continue;
+
+			paddr = pte_pa(pte);
+			free_pfn_range(paddr, PAGE_SIZE);
+		}
+	}
+}
+
 #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_X86_PAT)
 
 /* get Nth element of the linked list */
Index: linux-2.6/arch/x86/include/asm/pgtable.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/pgtable.h	2008-12-17 17:03:25.000000000 -0800
+++ linux-2.6/arch/x86/include/asm/pgtable.h	2008-12-17 17:09:16.000000000 -0800
@@ -219,6 +219,11 @@ static inline unsigned long pte_pfn(pte_
 	return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT;
 }
 
+static inline u64 pte_pa(pte_t pte)
+{
+	return pte_val(pte) & PTE_PFN_MASK;
+}
+
 #define pte_page(pte)	pfn_to_page(pte_pfn(pte))
 
 static inline int pmd_large(pmd_t pte)
@@ -328,6 +333,11 @@ static inline pgprot_t pgprot_modify(pgp
 
 #define canon_pgprot(p) __pgprot(pgprot_val(p) & __supported_pte_mask)
 
+/* Indicate that x86 has its own track and untrack pfn vma functions */
+#define track_pfn_vma_new track_pfn_vma_new
+#define track_pfn_vma_copy track_pfn_vma_copy
+#define untrack_pfn_vma untrack_pfn_vma
+
 #ifndef __ASSEMBLY__
 #define __HAVE_PHYS_MEM_ACCESS_PROT
 struct file;

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch 5/7] x86 PAT: change pgprot_noncached to uc_minus instead of strong uc - v3
  2008-12-18 19:41 [patch 0/7] x86 PAT: track pfnmap mappings with remap_pfn_range vm_insert_pfn - v3 venkatesh.pallipadi
                   ` (3 preceding siblings ...)
  2008-12-18 19:41 ` [patch 4/7] x86 PAT: Implement track/untrack of pfnmap regions for x86 " venkatesh.pallipadi
@ 2008-12-18 19:41 ` venkatesh.pallipadi
  2008-12-18 19:41 ` [patch 6/7] x86 PAT: add pgprot_writecombine() interface for drivers " venkatesh.pallipadi
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: venkatesh.pallipadi @ 2008-12-18 19:41 UTC (permalink / raw)
  To: mingo, tglx, hpa, akpm, npiggin, hugh
  Cc: arjan, jbarnes, rdreier, jeremy, linux-kernel,
	Venkatesh Pallipadi, Suresh Siddha

[-- Attachment #1: make_pgprot_noncached_ucminus.patch --]
[-- Type: text/plain, Size: 2657 bytes --]

Make pgprot_noncached uc_minus instead of strong UC. This will make
pgprot_noncached to be in line with ioremap_nocache() and all the other
APIs that map page uc_minus on uc request.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

---
 arch/x86/include/asm/pgtable.h    |    8 ++++++++
 arch/x86/include/asm/pgtable_32.h |    9 ---------
 arch/x86/include/asm/pgtable_64.h |    6 ------
 3 files changed, 8 insertions(+), 15 deletions(-)

Index: linux-2.6/arch/x86/include/asm/pgtable.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/pgtable.h	2008-12-17 17:09:16.000000000 -0800
+++ linux-2.6/arch/x86/include/asm/pgtable.h	2008-12-17 17:23:07.000000000 -0800
@@ -158,6 +158,14 @@
 #define PGD_IDENT_ATTR	 0x001		/* PRESENT (no other attributes) */
 #endif
 
+/*
+ * Macro to mark a page protection value as UC-
+ */
+#define pgprot_noncached(prot)					\
+	((boot_cpu_data.x86 > 3)				\
+	 ? (__pgprot(pgprot_val(prot) | _PAGE_CACHE_UC_MINUS))	\
+	 : (prot))
+
 #ifndef __ASSEMBLY__
 
 /*
Index: linux-2.6/arch/x86/include/asm/pgtable_32.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/pgtable_32.h	2008-12-17 15:01:51.000000000 -0800
+++ linux-2.6/arch/x86/include/asm/pgtable_32.h	2008-12-17 17:23:07.000000000 -0800
@@ -101,15 +101,6 @@ extern unsigned long pg0[];
 #endif
 
 /*
- * Macro to mark a page protection value as "uncacheable".
- * On processors which do not support it, this is a no-op.
- */
-#define pgprot_noncached(prot)					\
-	((boot_cpu_data.x86 > 3)				\
-	 ? (__pgprot(pgprot_val(prot) | _PAGE_PCD | _PAGE_PWT))	\
-	 : (prot))
-
-/*
  * Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
  */
Index: linux-2.6/arch/x86/include/asm/pgtable_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/pgtable_64.h	2008-12-17 15:01:51.000000000 -0800
+++ linux-2.6/arch/x86/include/asm/pgtable_64.h	2008-12-17 17:23:07.000000000 -0800
@@ -177,12 +177,6 @@ static inline int pmd_bad(pmd_t pmd)
 #define pages_to_mb(x)	((x) >> (20 - PAGE_SHIFT))   /* FIXME: is this right? */
 
 /*
- * Macro to mark a page protection value as "uncacheable".
- */
-#define pgprot_noncached(prot)					\
-	(__pgprot(pgprot_val((prot)) | _PAGE_PCD | _PAGE_PWT))
-
-/*
  * Conversion functions: convert a page and protection to a page entry,
  * and a page entry and page directory to the page they refer to.
  */

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch 6/7] x86 PAT: add pgprot_writecombine() interface for drivers - v3
  2008-12-18 19:41 [patch 0/7] x86 PAT: track pfnmap mappings with remap_pfn_range vm_insert_pfn - v3 venkatesh.pallipadi
                   ` (4 preceding siblings ...)
  2008-12-18 19:41 ` [patch 5/7] x86 PAT: change pgprot_noncached to uc_minus instead of strong uc " venkatesh.pallipadi
@ 2008-12-18 19:41 ` venkatesh.pallipadi
  2008-12-18 19:41 ` [patch 7/7] x86 PAT: update documentation to cover pgprot and remap_pfn related changes " venkatesh.pallipadi
  2008-12-18 21:17 ` [patch 0/7] x86 PAT: track pfnmap mappings with remap_pfn_range vm_insert_pfn " H. Peter Anvin
  7 siblings, 0 replies; 24+ messages in thread
From: venkatesh.pallipadi @ 2008-12-18 19:41 UTC (permalink / raw)
  To: mingo, tglx, hpa, akpm, npiggin, hugh
  Cc: arjan, jbarnes, rdreier, jeremy, linux-kernel,
	Venkatesh Pallipadi, Suresh Siddha

[-- Attachment #1: add_pgprot_writecombine.patch --]
[-- Type: text/plain, Size: 2188 bytes --]

Add pgprot_writecombine. pgprot_writecombine will be aliased to
pgprot_noncached when not supported by the architecture.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

---
 arch/x86/include/asm/pgtable.h |    3 +++
 arch/x86/mm/pat.c              |    8 ++++++++
 include/asm-generic/pgtable.h  |    4 ++++
 3 files changed, 15 insertions(+)

Index: linux-2.6/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-generic/pgtable.h	2008-12-17 15:01:51.000000000 -0800
+++ linux-2.6/include/asm-generic/pgtable.h	2008-12-17 17:23:11.000000000 -0800
@@ -129,6 +129,10 @@ static inline void ptep_set_wrprotect(st
 #define move_pte(pte, prot, old_addr, new_addr)	(pte)
 #endif
 
+#ifndef pgprot_writecombine
+#define pgprot_writecombine pgprot_noncached
+#endif
+
 /*
  * When walking page tables, get the address of the next boundary,
  * or the end address of the range if that comes earlier.  Although no
Index: linux-2.6/arch/x86/include/asm/pgtable.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/pgtable.h	2008-12-17 17:23:07.000000000 -0800
+++ linux-2.6/arch/x86/include/asm/pgtable.h	2008-12-17 17:23:11.000000000 -0800
@@ -168,6 +168,9 @@
 
 #ifndef __ASSEMBLY__
 
+#define pgprot_writecombine	pgprot_writecombine
+extern pgprot_t pgprot_writecombine(pgprot_t prot);
+
 /*
  * ZERO_PAGE is a global shared page that is always zero: used
  * for zero-mapped memory areas etc..
Index: linux-2.6/arch/x86/mm/pat.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/pat.c	2008-12-17 17:22:59.000000000 -0800
+++ linux-2.6/arch/x86/mm/pat.c	2008-12-17 17:23:11.000000000 -0800
@@ -832,6 +832,14 @@ void untrack_pfn_vma(struct vm_area_stru
 	}
 }
 
+pgprot_t pgprot_writecombine(pgprot_t prot)
+{
+	if (pat_enabled)
+		return __pgprot(pgprot_val(prot) | _PAGE_CACHE_WC);
+	else
+		return pgprot_noncached(prot);
+}
+
 #if defined(CONFIG_DEBUG_FS) && defined(CONFIG_X86_PAT)
 
 /* get Nth element of the linked list */

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [patch 7/7] x86 PAT: update documentation to cover pgprot and remap_pfn related changes - v3
  2008-12-18 19:41 [patch 0/7] x86 PAT: track pfnmap mappings with remap_pfn_range vm_insert_pfn - v3 venkatesh.pallipadi
                   ` (5 preceding siblings ...)
  2008-12-18 19:41 ` [patch 6/7] x86 PAT: add pgprot_writecombine() interface for drivers " venkatesh.pallipadi
@ 2008-12-18 19:41 ` venkatesh.pallipadi
  2008-12-18 21:13   ` Randy Dunlap
  2008-12-18 21:17 ` [patch 0/7] x86 PAT: track pfnmap mappings with remap_pfn_range vm_insert_pfn " H. Peter Anvin
  7 siblings, 1 reply; 24+ messages in thread
From: venkatesh.pallipadi @ 2008-12-18 19:41 UTC (permalink / raw)
  To: mingo, tglx, hpa, akpm, npiggin, hugh
  Cc: arjan, jbarnes, rdreier, jeremy, linux-kernel,
	Venkatesh Pallipadi, Suresh Siddha

[-- Attachment #1: documentation_update.patch --]
[-- Type: text/plain, Size: 1778 bytes --]

Add documentation related to pgprot_* change.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 Documentation/x86/pat.txt |   24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

Index: linux-2.6/Documentation/x86/pat.txt
===================================================================
--- linux-2.6.orig/Documentation/x86/pat.txt	2008-12-17 15:01:50.000000000 -0800
+++ linux-2.6/Documentation/x86/pat.txt	2008-12-17 17:23:16.000000000 -0800
@@ -80,6 +80,30 @@ pci proc               |    --    |    -
                        |          |            |                  |
 -------------------------------------------------------------------
 
+Advanced APIs for drivers
+-------------------------
+A. Exporting pages to user with remap_pfn_range, io_remap_pfn_range,
+vm_insert_pfn
+
+Drivers wanting to export some pages to userspace, do it by using mmap
+interface and a combination of
+1) pgprot_noncached()
+2) io_remap_pfn_range() or remap_pfn_range() or vm_insert_pfn()
+
+With pat support, a new API pgprot_writecombine is being added. So, driver can
+continue to use the above sequence, with either pgprot_noncached() or
+pgprot_writecombine() in step 1, followed by step 2.
+
+In addition, step 2 internally tracks the region as UC or WC in memtype
+list in order to ensure no conflicting mapping.
+
+Note that this set of APIs only work with IO (non RAM) regions. If driver
+wants to export RAM region, it has to do set_memory_uc() or set_memory_wc()
+as step 0 above and also track the usage of those pages and use set_memory_wb()
+before the page is freed to free pool.
+
+
+
 Notes:
 
 -- in the above table mean "Not suggested usage for the API". Some of the --'s

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 7/7] x86 PAT: update documentation to cover pgprot and remap_pfn related changes - v3
  2008-12-18 19:41 ` [patch 7/7] x86 PAT: update documentation to cover pgprot and remap_pfn related changes " venkatesh.pallipadi
@ 2008-12-18 21:13   ` Randy Dunlap
  2008-12-18 21:49     ` Pallipadi, Venkatesh
  0 siblings, 1 reply; 24+ messages in thread
From: Randy Dunlap @ 2008-12-18 21:13 UTC (permalink / raw)
  To: venkatesh.pallipadi
  Cc: mingo, tglx, hpa, akpm, npiggin, hugh, arjan, jbarnes, rdreier,
	jeremy, linux-kernel, Suresh Siddha

venkatesh.pallipadi@intel.com wrote:

+Advanced APIs for drivers
+-------------------------
+A. Exporting pages to user with remap_pfn_range, io_remap_pfn_range,

                    to users
or                  to userspace

+vm_insert_pfn
+
+Drivers wanting to export some pages to userspace, do it by using mmap

Drop comma.

+interface and a combination of
+1) pgprot_noncached()
+2) io_remap_pfn_range() or remap_pfn_range() or vm_insert_pfn()
+
+With pat support, a new API pgprot_writecombine is being added. So, driver can

s/pat/PAT/
s/driver/drivers/ or s/driver/a driver/

+continue to use the above sequence, with either pgprot_noncached() or
+pgprot_writecombine() in step 1, followed by step 2.
+
+In addition, step 2 internally tracks the region as UC or WC in memtype
+list in order to ensure no conflicting mapping.
+
+Note that this set of APIs only work with IO (non RAM) regions. If driver

                                 works with IO (non-RAM) regions. If a driver

+wants to export RAM region, it has to do set_memory_uc() or set_memory_wc()

          export a RAM region,

+as step 0 above and also track the usage of those pages and use set_memory_wb()
+before the page is freed to free pool.


-- 
~Randy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 0/7] x86 PAT: track pfnmap mappings with remap_pfn_range vm_insert_pfn - v3
  2008-12-18 19:41 [patch 0/7] x86 PAT: track pfnmap mappings with remap_pfn_range vm_insert_pfn - v3 venkatesh.pallipadi
                   ` (6 preceding siblings ...)
  2008-12-18 19:41 ` [patch 7/7] x86 PAT: update documentation to cover pgprot and remap_pfn related changes " venkatesh.pallipadi
@ 2008-12-18 21:17 ` H. Peter Anvin
  7 siblings, 0 replies; 24+ messages in thread
From: H. Peter Anvin @ 2008-12-18 21:17 UTC (permalink / raw)
  To: venkatesh.pallipadi
  Cc: mingo, tglx, akpm, npiggin, hugh, arjan, jbarnes, rdreier,
	jeremy, linux-kernel, Suresh Siddha

venkatesh.pallipadi@intel.com wrote:
> v3: Patches updated based on Andrew's comments on the earlier version.
> 
> Drivers use mmap followed by pgprot_* and remap_pfn_range or vm_insert_pfn,
> in order to export reserved memory to userspace. Currently, such mappings are
> not tracked and hence not kept consistent with other mappings (/dev/mem,
> pci resource, ioremap) for the sme memory, that may exist in the system.
> 
> The following patchset adds x86 PAT attribute tracking and untracking for
> pfnmap related APIs.
> 
> First three patches in the patchset are changing the generic mm code to fit
> in this tracking. Last four patches are x86 specific to make things work
> with x86 PAT code. The patchset aso introduces pgprot_writecombine interface,
> which gives writecombine mapping when enabled, falling back to
> pgprot_noncached otherwise.

Series applied to tip:x86/pat2, thanks!

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 1/7] x86 PAT: store vm_pgoff for all linear_over_vma_region mappings - v3
  2008-12-18 19:41 ` [patch 1/7] x86 PAT: store vm_pgoff for all linear_over_vma_region mappings " venkatesh.pallipadi
@ 2008-12-18 21:27   ` Nick Piggin
  2008-12-18 22:10     ` Pallipadi, Venkatesh
  0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2008-12-18 21:27 UTC (permalink / raw)
  To: venkatesh.pallipadi
  Cc: mingo, tglx, hpa, akpm, hugh, arjan, jbarnes, rdreier, jeremy,
	linux-kernel, Suresh Siddha

On Thu, Dec 18, 2008 at 11:41:27AM -0800, venkatesh.pallipadi@intel.com wrote:
> Drivers use mmap followed by pgprot_* and remap_pfn_range or vm_insert_pfn,
> in order to export reserved memory to userspace. Currently, such mappings are
> not tracked and hence not kept consistent with other mappings (/dev/mem,
> pci resource, ioremap) for the sme memory, that may exist in the system.
> 
> The following patchset adds x86 PAT attribute tracking and untracking for
> pfnmap related APIs.
> 
> First three patches in the patchset are changing the generic mm code to fit
> in this tracking. Last four patches are x86 specific to make things work
> with x86 PAT code. The patchset aso introduces pgprot_writecombine interface,
> which gives writecombine mapping when enabled, falling back to
> pgprot_noncached otherwise.
> 
> This patch:
> 
> While working on x86 PAT, we faced some hurdles with trackking
> remap_pfn_range() regions, as we do not have any information to say
> whether that PFNMAP mapping is linear for the entire vma range or
> it is smaller granularity regions within the vma.
> 
> A simple solution to this is to use vm_pgoff as an indicator for
> linear mapping over the vma region. Currently, remap_pfn_range
> only sets vm_pgoff for COW mappings. Below patch changes the
> logic and sets the vm_pgoff irrespective of COW. This will still not
> be enough for the case where pfn is zero (vma region mapped to
> physical address zero). But, for all the other cases, we can look at
> pfnmap VMAs and say whether the mappng is for the entire vma region
> or not.
> 
> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> 
> ---
>  include/linux/mm.h |    9 +++++++++
>  mm/memory.c        |    7 +++----
>  2 files changed, 12 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c	2008-12-17 17:24:31.000000000 -0800
> +++ linux-2.6/mm/memory.c	2008-12-18 10:10:46.000000000 -0800
> @@ -1575,11 +1575,10 @@ int remap_pfn_range(struct vm_area_struc
>  	 * behaviour that some programs depend on. We mark the "original"
>  	 * un-COW'ed pages by matching them up with "vma->vm_pgoff".
>  	 */
> -	if (is_cow_mapping(vma->vm_flags)) {
> -		if (addr != vma->vm_start || end != vma->vm_end)
> -			return -EINVAL;
> +	if (addr == vma->vm_start && end == vma->vm_end)
>  		vma->vm_pgoff = pfn;
> -	}
> +	else if (is_cow_mapping(vma->vm_flags))
> +		return -EINVAL;
>  
>  	vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP;
>  
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h	2008-12-17 17:24:31.000000000 -0800
> +++ linux-2.6/include/linux/mm.h	2008-12-18 10:10:46.000000000 -0800
> @@ -145,6 +145,15 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
>  #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
>  
> +static inline int is_linear_pfn_mapping(struct vm_area_struct *vma)
> +{
> +	return ((vma->vm_flags & VM_PFNMAP) && vma->vm_pgoff);
> +}
> +
> +static inline int is_pfn_mapping(struct vm_area_struct *vma)
> +{
> +	return (vma->vm_flags & VM_PFNMAP);
> +}
>  
>  /*
>   * vm_fault is filled by the the pagefault handler and passed to the vma's

This is fine by me, however:
1. Can you add some comments to say "this is not for core vm but for pat,
   oh and a pgoff of zero is not going to work".
2. Can you please justify to me (or the changelog) roughly why PAT wants
   to know if the mapping is linear or not? Presumably it has to handle
   both types? If performance wasn't an issue, then you could manually scan
   the ptes to verify (which would solve your zero-offset bug). etc.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 2/7] x86 PAT: Add follow_pfnmp_pte routine to help tracking pfnmap pages - v3
  2008-12-18 19:41 ` [patch 2/7] x86 PAT: Add follow_pfnmp_pte routine to help tracking pfnmap pages " venkatesh.pallipadi
@ 2008-12-18 21:31   ` Nick Piggin
  2008-12-18 22:15     ` Pallipadi, Venkatesh
  0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2008-12-18 21:31 UTC (permalink / raw)
  To: venkatesh.pallipadi
  Cc: mingo, tglx, hpa, akpm, hugh, arjan, jbarnes, rdreier, jeremy,
	linux-kernel, Suresh Siddha

On Thu, Dec 18, 2008 at 11:41:28AM -0800, venkatesh.pallipadi@intel.com wrote:
> Add a generic interface to follow pfn in a pfnmap vma range. This is used by
> one of the subsequent x86 PAT related patch to keep track of memory types
> for vma regions across vma copy and free.
> 
> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>

Can you please reuse follow_phys for this? (preferably use the same API
even if it requires some modification, otherwise if not possible, then
at least can you implement a common core for both APIs).


> 
> ---
>  include/linux/mm.h |    3 +++
>  mm/memory.c        |   43 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 46 insertions(+)
> 
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h	2008-11-25 13:56:46.000000000 -0800
> +++ linux-2.6/include/linux/mm.h	2008-11-25 14:20:23.000000000 -0800
> @@ -1223,6 +1223,9 @@ struct page *follow_page(struct vm_area_
>  #define FOLL_GET	0x04	/* do get_page on page */
>  #define FOLL_ANON	0x08	/* give ZERO_PAGE if no pgtable */
>  
> +int follow_pfnmap_pte(struct vm_area_struct *vma,
> +				unsigned long address, pte_t *ret_ptep);
> +
>  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
>  			void *data);
>  extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c	2008-11-25 14:07:42.000000000 -0800
> +++ linux-2.6/mm/memory.c	2008-11-25 14:18:35.000000000 -0800
> @@ -1111,6 +1111,49 @@ no_page_table:
>  	return page;
>  }
>  
> +int follow_pfnmap_pte(struct vm_area_struct *vma, unsigned long address,
> +			pte_t *ret_ptep)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *ptep, pte;
> +	spinlock_t *ptl;
> +	struct page *page;
> +	struct mm_struct *mm = vma->vm_mm;
> +
> +	if (!is_pfn_mapping(vma))
> +		goto err;
> +
> +	page = NULL;
> +	pgd = pgd_offset(mm, address);
> +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> +		goto err;
> +
> +	pud = pud_offset(pgd, address);
> +	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> +		goto err;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> +		goto err;
> +
> +	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> +
> +	pte = *ptep;
> +	if (!pte_present(pte))
> +		goto err_unlock;
> +
> +	*ret_ptep = pte;
> +	pte_unmap_unlock(ptep, ptl);
> +	return 0;
> +
> +err_unlock:
> +	pte_unmap_unlock(ptep, ptl);
> +err:
> +	return -EINVAL;
> +}
> +
>  /* Can we do the FOLL_ANON optimization? */
>  static inline int use_zero_page(struct vm_area_struct *vma)
>  {
> 
> -- 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 3/7] x86 PAT: hooks in generic vm code to help archs to track pfnmap regions - v3
  2008-12-18 19:41 ` [patch 3/7] x86 PAT: hooks in generic vm code to help archs to track pfnmap regions " venkatesh.pallipadi
@ 2008-12-18 21:35   ` Nick Piggin
  2008-12-18 22:23     ` Pallipadi, Venkatesh
  0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2008-12-18 21:35 UTC (permalink / raw)
  To: venkatesh.pallipadi
  Cc: mingo, tglx, hpa, akpm, hugh, arjan, jbarnes, rdreier, jeremy,
	linux-kernel, Suresh Siddha

On Thu, Dec 18, 2008 at 11:41:29AM -0800, venkatesh.pallipadi@intel.com wrote:
> Introduce generic hooks in remap_pfn_range and vm_insert_pfn and
> corresponding copy and free routines with reserve and free tracking.

These should be inline so that they can be folded out (I'm sure gcc
with -Os and "optimize" inlining will do something stupid here).
Also, the normal way to add such arch hooks is to put the default
into asm-generic and have other archs include it... that would be
nicer than sticking it into mm/memory.c wouldn't it?

Sigh, fork/exit paths slow down yet again. But oh well. Maybe can
you add some branch hints?

 
> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> 
> ---
>  include/linux/mm.h |    6 ++++
>  mm/memory.c        |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 81 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c	2008-12-18 10:10:50.000000000 -0800
> +++ linux-2.6/mm/memory.c	2008-12-18 10:11:23.000000000 -0800
> @@ -99,6 +99,50 @@ int randomize_va_space __read_mostly =
>  					2;
>  #endif
>  
> +#ifndef track_pfn_vma_new
> +/*
> + * Interface that can be used by architecture code to keep track of
> + * memory type of pfn mappings (remap_pfn_range, vm_insert_pfn)
> + *
> + * track_pfn_vma_new is called when a _new_ pfn mapping is being established
> + * for physical range indicated by pfn and size.
> + */
> +int track_pfn_vma_new(struct vm_area_struct *vma, pgprot_t prot,
> +			unsigned long pfn, unsigned long size)
> +{
> +	return 0;
> +}
> +#endif
> +
> +#ifndef track_pfn_vma_copy
> +/*
> + * Interface that can be used by architecture code to keep track of
> + * memory type of pfn mappings (remap_pfn_range, vm_insert_pfn)
> + *
> + * track_pfn_vma_copy is called when vma that is covering the pfnmap gets
> + * copied through copy_page_range().
> + */
> +int track_pfn_vma_copy(struct vm_area_struct *vma)
> +{
> +	return 0;
> +}
> +#endif
> +
> +#ifndef untrack_pfn_vma
> +/*
> + * Interface that can be used by architecture code to keep track of
> + * memory type of pfn mappings (remap_pfn_range, vm_insert_pfn)
> + *
> + * untrack_pfn_vma is called while unmapping a pfnmap for a region.
> + * untrack can be called for a specific region indicated by pfn and size or
> + * can be for the entire vma (in which case size can be zero).
> + */
> +void untrack_pfn_vma(struct vm_area_struct *vma, unsigned long pfn,
> +			unsigned long size)
> +{
> +}
> +#endif
> +
>  static int __init disable_randmaps(char *s)
>  {
>  	randomize_va_space = 0;
> @@ -669,6 +713,16 @@ int copy_page_range(struct mm_struct *ds
>  	if (is_vm_hugetlb_page(vma))
>  		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
>  
> +	if (is_pfn_mapping(vma)) {
> +		/*
> +		 * We do not free on error cases below as remove_vma
> +		 * gets called on error from higher level routine
> +		 */
> +		ret = track_pfn_vma_copy(vma);
> +		if (ret)
> +			return ret;
> +	}
> +
>  	/*
>  	 * We need to invalidate the secondary MMU mappings only when
>  	 * there could be a permission downgrade on the ptes of the
> @@ -915,6 +969,9 @@ unsigned long unmap_vmas(struct mmu_gath
>  		if (vma->vm_flags & VM_ACCOUNT)
>  			*nr_accounted += (end - start) >> PAGE_SHIFT;
>  
> +		if (is_pfn_mapping(vma))
> +			untrack_pfn_vma(vma, 0, 0);
> +
>  		while (start != end) {
>  			if (!tlb_start_valid) {
>  				tlb_start = start;
> @@ -1473,6 +1530,7 @@ out:
>  int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>  			unsigned long pfn)
>  {
> +	int ret;
>  	/*
>  	 * Technically, architectures with pte_special can avoid all these
>  	 * restrictions (same for remap_pfn_range).  However we would like
> @@ -1487,7 +1545,15 @@ int vm_insert_pfn(struct vm_area_struct 
>  
>  	if (addr < vma->vm_start || addr >= vma->vm_end)
>  		return -EFAULT;
> -	return insert_pfn(vma, addr, pfn, vma->vm_page_prot);
> +	if (track_pfn_vma_new(vma, vma->vm_page_prot, pfn, PAGE_SIZE))
> +		return -EINVAL;
> +
> +  	ret = insert_pfn(vma, addr, pfn, vma->vm_page_prot);
> +
> +	if (ret)
> +		untrack_pfn_vma(vma, pfn, PAGE_SIZE);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL(vm_insert_pfn);
>  
> @@ -1625,6 +1691,10 @@ int remap_pfn_range(struct vm_area_struc
>  
>  	vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP;
>  
> +	err = track_pfn_vma_new(vma, prot, pfn, PAGE_ALIGN(size));
> +	if (err)
> +		return -EINVAL;
> +
>  	BUG_ON(addr >= end);
>  	pfn -= addr >> PAGE_SHIFT;
>  	pgd = pgd_offset(mm, addr);
> @@ -1636,6 +1706,10 @@ int remap_pfn_range(struct vm_area_struc
>  		if (err)
>  			break;
>  	} while (pgd++, addr = next, addr != end);
> +
> +	if (err)
> +		untrack_pfn_vma(vma, pfn, PAGE_ALIGN(size));
> +
>  	return err;
>  }
>  EXPORT_SYMBOL(remap_pfn_range);
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h	2008-12-18 10:10:50.000000000 -0800
> +++ linux-2.6/include/linux/mm.h	2008-12-18 10:11:23.000000000 -0800
> @@ -155,6 +155,12 @@ static inline int is_pfn_mapping(struct 
>  	return (vma->vm_flags & VM_PFNMAP);
>  }
>  
> +extern int track_pfn_vma_new(struct vm_area_struct *vma, pgprot_t prot,
> +				unsigned long pfn, unsigned long size);
> +extern int track_pfn_vma_copy(struct vm_area_struct *vma);
> +extern void untrack_pfn_vma(struct vm_area_struct *vma, unsigned long pfn,
> +				unsigned long size);
> +
>  /*
>   * vm_fault is filled by the the pagefault handler and passed to the vma's
>   * ->fault function. The vma's ->fault is responsible for returning a bitmask
> 
> -- 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 4/7] x86 PAT: Implement track/untrack of pfnmap regions for x86 - v3
  2008-12-18 19:41 ` [patch 4/7] x86 PAT: Implement track/untrack of pfnmap regions for x86 " venkatesh.pallipadi
@ 2008-12-18 21:38   ` Nick Piggin
  2008-12-18 21:40     ` H. Peter Anvin
  0 siblings, 1 reply; 24+ messages in thread
From: Nick Piggin @ 2008-12-18 21:38 UTC (permalink / raw)
  To: venkatesh.pallipadi
  Cc: mingo, tglx, hpa, akpm, hugh, arjan, jbarnes, rdreier, jeremy,
	linux-kernel, Suresh Siddha

And that's the end of the mm/ code... yeah the mm/ parts seem OK to me,
with my minor comments addressed I would ack them all.

On Thu, Dec 18, 2008 at 11:41:30AM -0800, venkatesh.pallipadi@intel.com wrote:
> Hookup remap_pfn_range and vm_insert_pfn and corresponding copy and free
> routines with reserve and free tracking.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 4/7] x86 PAT: Implement track/untrack of pfnmap regions for x86 - v3
  2008-12-18 21:38   ` Nick Piggin
@ 2008-12-18 21:40     ` H. Peter Anvin
  2008-12-18 21:46       ` Ingo Molnar
  0 siblings, 1 reply; 24+ messages in thread
From: H. Peter Anvin @ 2008-12-18 21:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: venkatesh.pallipadi, mingo, tglx, akpm, hugh, arjan, jbarnes,
	rdreier, jeremy, linux-kernel, Suresh Siddha

Nick Piggin wrote:
> And that's the end of the mm/ code... yeah the mm/ parts seem OK to me,
> with my minor comments addressed I would ack them all.

Great!

Venki, can this be done with incremental patches, or would that be too
messy?

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 4/7] x86 PAT: Implement track/untrack of pfnmap regions for x86 - v3
  2008-12-18 21:40     ` H. Peter Anvin
@ 2008-12-18 21:46       ` Ingo Molnar
  2008-12-18 21:53         ` Pallipadi, Venkatesh
  0 siblings, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2008-12-18 21:46 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Nick Piggin, venkatesh.pallipadi, tglx, akpm, hugh, arjan,
	jbarnes, rdreier, jeremy, linux-kernel, Suresh Siddha


* H. Peter Anvin <hpa@zytor.com> wrote:

> Nick Piggin wrote:
> > And that's the end of the mm/ code... yeah the mm/ parts seem OK to me,
> > with my minor comments addressed I would ack them all.
> 
> Great!
> 
> Venki, can this be done with incremental patches, or would that be too 
> messy?

would be nice to have them as deltas. The patches are pushed out into 
tip/master already - and we'll push it towards linux-next once Nick's 
concerns have been addressed.

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 7/7] x86 PAT: update documentation to cover pgprot and remap_pfn related changes - v3
  2008-12-18 21:13   ` Randy Dunlap
@ 2008-12-18 21:49     ` Pallipadi, Venkatesh
  2008-12-18 21:53       ` Randy Dunlap
  0 siblings, 1 reply; 24+ messages in thread
From: Pallipadi, Venkatesh @ 2008-12-18 21:49 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Pallipadi, Venkatesh, mingo, tglx, hpa, akpm, npiggin, hugh,
	arjan, jbarnes, rdreier, jeremy, linux-kernel, Siddha, Suresh B

On Thu, Dec 18, 2008 at 01:13:28PM -0800, Randy Dunlap wrote:
> venkatesh.pallipadi@intel.com wrote:
> 
> +Advanced APIs for drivers
> +-------------------------
> +A. Exporting pages to user with remap_pfn_range, io_remap_pfn_range,
> 
>                     to users
> or                  to userspace
> 
> +vm_insert_pfn
> +
> +Drivers wanting to export some pages to userspace, do it by using mmap
> 
> Drop comma.
> 
> +interface and a combination of
> +1) pgprot_noncached()
> +2) io_remap_pfn_range() or remap_pfn_range() or vm_insert_pfn()
> +
> +With pat support, a new API pgprot_writecombine is being added. So, driver can
> 
> s/pat/PAT/
> s/driver/drivers/ or s/driver/a driver/
> 
> +continue to use the above sequence, with either pgprot_noncached() or
> +pgprot_writecombine() in step 1, followed by step 2.
> +
> +In addition, step 2 internally tracks the region as UC or WC in memtype
> +list in order to ensure no conflicting mapping.
> +
> +Note that this set of APIs only work with IO (non RAM) regions. If driver
> 
>                                  works with IO (non-RAM) regions. If a driver
> 
> +wants to export RAM region, it has to do set_memory_uc() or set_memory_wc()
> 
>           export a RAM region,
> 
> +as step 0 above and also track the usage of those pages and use set_memory_wb()
> +before the page is freed to free pool.
> 
> 

refreshed patch below

Thanks,
Venki


Add documentation related to pgprot_* change.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---
 Documentation/x86/pat.txt |   24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

Index: linux-2.6/Documentation/x86/pat.txt
===================================================================
--- linux-2.6.orig/Documentation/x86/pat.txt	2008-12-17 15:01:50.000000000 -0800
+++ linux-2.6/Documentation/x86/pat.txt	2008-12-17 17:23:16.000000000 -0800
@@ -80,6 +80,30 @@ pci proc               |    --    |    -
                        |          |            |                  |
 -------------------------------------------------------------------
 
+Advanced APIs for drivers
+-------------------------
+A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range,
+vm_insert_pfn
+
+Drivers wanting to export some pages to userspace do it by using mmap
+interface and a combination of
+1) pgprot_noncached()
+2) io_remap_pfn_range() or remap_pfn_range() or vm_insert_pfn()
+
+With pat support, a new API pgprot_writecombine is being added. So, drivers can
+continue to use the above sequence, with either pgprot_noncached() or
+pgprot_writecombine() in step 1, followed by step 2.
+
+In addition, step 2 internally tracks the region as UC or WC in memtype
+list in order to ensure no conflicting mapping.
+
+Note that this set of APIs only work with IO (non RAM) regions. If a driver
+wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc()
+as step 0 above and also track the usage of those pages and use set_memory_wb()
+before the page is freed to free pool.
+
+
+
 Notes:
 
 -- in the above table mean "Not suggested usage for the API". Some of the --'s

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 4/7] x86 PAT: Implement track/untrack of pfnmap regions for x86 - v3
  2008-12-18 21:46       ` Ingo Molnar
@ 2008-12-18 21:53         ` Pallipadi, Venkatesh
  0 siblings, 0 replies; 24+ messages in thread
From: Pallipadi, Venkatesh @ 2008-12-18 21:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Nick Piggin, Pallipadi, Venkatesh, tglx, akpm,
	hugh, arjan, jbarnes, rdreier, jeremy, linux-kernel, Siddha,
	Suresh B

On Thu, Dec 18, 2008 at 01:46:22PM -0800, Ingo Molnar wrote:
> 
> * H. Peter Anvin <hpa@zytor.com> wrote:
> 
> > Nick Piggin wrote:
> > > And that's the end of the mm/ code... yeah the mm/ parts seem OK to me,
> > > with my minor comments addressed I would ack them all.
> >
> > Great!
> >
> > Venki, can this be done with incremental patches, or would that be too
> > messy?
> 
> would be nice to have them as deltas. The patches are pushed out into
> tip/master already - and we'll push it towards linux-next once Nick's
> concerns have been addressed.
> 

Yes. I will send incremental patches over v3 addressing Nick's comments.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 7/7] x86 PAT: update documentation to cover pgprot and remap_pfn related changes - v3
  2008-12-18 21:49     ` Pallipadi, Venkatesh
@ 2008-12-18 21:53       ` Randy Dunlap
  2008-12-18 22:03         ` Pallipadi, Venkatesh
  0 siblings, 1 reply; 24+ messages in thread
From: Randy Dunlap @ 2008-12-18 21:53 UTC (permalink / raw)
  To: Pallipadi, Venkatesh
  Cc: mingo, tglx, hpa, akpm, npiggin, hugh, arjan, jbarnes, rdreier,
	jeremy, linux-kernel, Siddha, Suresh B

Pallipadi, Venkatesh wrote:
> On Thu, Dec 18, 2008 at 01:13:28PM -0800, Randy Dunlap wrote:
>> venkatesh.pallipadi@intel.com wrote:
>>
>> +Advanced APIs for drivers
>> +-------------------------
>> +A. Exporting pages to user with remap_pfn_range, io_remap_pfn_range,
>>
>>                     to users
>> or                  to userspace
>>
>> +vm_insert_pfn
>> +
>> +Drivers wanting to export some pages to userspace, do it by using mmap
>>
>> Drop comma.
>>
>> +interface and a combination of
>> +1) pgprot_noncached()
>> +2) io_remap_pfn_range() or remap_pfn_range() or vm_insert_pfn()
>> +
>> +With pat support, a new API pgprot_writecombine is being added. So, driver can
>>
>> s/pat/PAT/
>> s/driver/drivers/ or s/driver/a driver/
>>
>> +continue to use the above sequence, with either pgprot_noncached() or
>> +pgprot_writecombine() in step 1, followed by step 2.
>> +
>> +In addition, step 2 internally tracks the region as UC or WC in memtype
>> +list in order to ensure no conflicting mapping.
>> +
>> +Note that this set of APIs only work with IO (non RAM) regions. If driver
>>
>>                                  works with IO (non-RAM) regions. If a driver
>>
>> +wants to export RAM region, it has to do set_memory_uc() or set_memory_wc()
>>
>>           export a RAM region,
>>
>> +as step 0 above and also track the usage of those pages and use set_memory_wb()
>> +before the page is freed to free pool.
>>
>>
> 
> refreshed patch below

Some of them were changed, others not changed, with no explanation...
More changes below:


> Thanks,
> Venki
> 
> 
> Add documentation related to pgprot_* change.
> 
> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> ---
>  Documentation/x86/pat.txt |   24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> Index: linux-2.6/Documentation/x86/pat.txt
> ===================================================================
> --- linux-2.6.orig/Documentation/x86/pat.txt	2008-12-17 15:01:50.000000000 -0800
> +++ linux-2.6/Documentation/x86/pat.txt	2008-12-17 17:23:16.000000000 -0800
> @@ -80,6 +80,30 @@ pci proc               |    --    |    -
>                         |          |            |                  |
>  -------------------------------------------------------------------
>  
> +Advanced APIs for drivers
> +-------------------------
> +A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range,
> +vm_insert_pfn
> +
> +Drivers wanting to export some pages to userspace do it by using mmap
> +interface and a combination of
> +1) pgprot_noncached()
> +2) io_remap_pfn_range() or remap_pfn_range() or vm_insert_pfn()
> +
> +With pat support, a new API pgprot_writecombine is being added. So, drivers can

        PAT (please)

> +continue to use the above sequence, with either pgprot_noncached() or
> +pgprot_writecombine() in step 1, followed by step 2.
> +
> +In addition, step 2 internally tracks the region as UC or WC in memtype
> +list in order to ensure no conflicting mapping.
> +
> +Note that this set of APIs only work with IO (non RAM) regions. If a driver

                              only works with IO (non-RAM)

> +wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc()
> +as step 0 above and also track the usage of those pages and use set_memory_wb()
> +before the page is freed to free pool.


-- 
~Randy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [patch 7/7] x86 PAT: update documentation to cover pgprot and remap_pfn related changes - v3
  2008-12-18 21:53       ` Randy Dunlap
@ 2008-12-18 22:03         ` Pallipadi, Venkatesh
  0 siblings, 0 replies; 24+ messages in thread
From: Pallipadi, Venkatesh @ 2008-12-18 22:03 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: mingo, tglx, hpa, akpm, npiggin, hugh, arjan, jbarnes, rdreier,
	jeremy, linux-kernel, Siddha, Suresh B

 

>-----Original Message-----
>From: Randy Dunlap [mailto:randy.dunlap@oracle.com] 
>Sent: Thursday, December 18, 2008 1:53 PM
>To: Pallipadi, Venkatesh
>Cc: mingo@elte.hu; tglx@linutronix.de; hpa@zytor.com; 
>akpm@linux-foundation.org; npiggin@suse.de; hugh@veritas.com; 
>arjan@infradead.org; jbarnes@virtuousgeek.org; 
>rdreier@cisco.com; jeremy@goop.org; 
>linux-kernel@vger.kernel.org; Siddha, Suresh B
>Subject: Re: [patch 7/7] x86 PAT: update documentation to 
>cover pgprot and remap_pfn related changes - v3
>
>Pallipadi, Venkatesh wrote:
>> On Thu, Dec 18, 2008 at 01:13:28PM -0800, Randy Dunlap wrote:
>>> venkatesh.pallipadi@intel.com wrote:
>>>
>>> +Advanced APIs for drivers
>>> +-------------------------
>>> +A. Exporting pages to user with remap_pfn_range, 
>io_remap_pfn_range,
>>>
>>>                     to users
>>> or                  to userspace
>>>
>>> +vm_insert_pfn
>>> +
>>> +Drivers wanting to export some pages to userspace, do it 
>by using mmap
>>>
>>> Drop comma.
>>>
>>> +interface and a combination of
>>> +1) pgprot_noncached()
>>> +2) io_remap_pfn_range() or remap_pfn_range() or vm_insert_pfn()
>>> +
>>> +With pat support, a new API pgprot_writecombine is being 
>added. So, driver can
>>>
>>> s/pat/PAT/
>>> s/driver/drivers/ or s/driver/a driver/
>>>
>>> +continue to use the above sequence, with either 
>pgprot_noncached() or
>>> +pgprot_writecombine() in step 1, followed by step 2.
>>> +
>>> +In addition, step 2 internally tracks the region as UC or 
>WC in memtype
>>> +list in order to ensure no conflicting mapping.
>>> +
>>> +Note that this set of APIs only work with IO (non RAM) 
>regions. If driver
>>>
>>>                                  works with IO (non-RAM) 
>regions. If a driver
>>>
>>> +wants to export RAM region, it has to do set_memory_uc() 
>or set_memory_wc()
>>>
>>>           export a RAM region,
>>>
>>> +as step 0 above and also track the usage of those pages 
>and use set_memory_wb()
>>> +before the page is freed to free pool.
>>>
>>>
>>
>> refreshed patch below
>
>Some of them were changed, others not changed, with no explanation...
>More changes below:

Sorry. Not sure how I missed it :( (note to self - do not send
patches soon after lunch). Will promise to do all the changes
as an incremental patch along with other comments.

Thanks,
Venki


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 1/7] x86 PAT: store vm_pgoff for all linear_over_vma_region mappings - v3
  2008-12-18 21:27   ` Nick Piggin
@ 2008-12-18 22:10     ` Pallipadi, Venkatesh
  2008-12-18 22:33       ` Nick Piggin
  0 siblings, 1 reply; 24+ messages in thread
From: Pallipadi, Venkatesh @ 2008-12-18 22:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pallipadi, Venkatesh, mingo, tglx, hpa, akpm, hugh, arjan,
	jbarnes, rdreier, jeremy, linux-kernel, Siddha, Suresh B

On Thu, Dec 18, 2008 at 01:27:28PM -0800, Nick Piggin wrote:
> On Thu, Dec 18, 2008 at 11:41:27AM -0800, venkatesh.pallipadi@intel.com wrote:
> > Drivers use mmap followed by pgprot_* and remap_pfn_range or vm_insert_pfn,
> > in order to export reserved memory to userspace. Currently, such mappings are
> > not tracked and hence not kept consistent with other mappings (/dev/mem,
> > pci resource, ioremap) for the sme memory, that may exist in the system.
> > 
> > The following patchset adds x86 PAT attribute tracking and untracking for
> > pfnmap related APIs.
> > 
> > First three patches in the patchset are changing the generic mm code to fit
> > in this tracking. Last four patches are x86 specific to make things work
> > with x86 PAT code. The patchset aso introduces pgprot_writecombine interface,
> > which gives writecombine mapping when enabled, falling back to
> > pgprot_noncached otherwise.
> > 
> > This patch:
> > 
> > While working on x86 PAT, we faced some hurdles with trackking
> > remap_pfn_range() regions, as we do not have any information to say
> > whether that PFNMAP mapping is linear for the entire vma range or
> > it is smaller granularity regions within the vma.
> > 
> > A simple solution to this is to use vm_pgoff as an indicator for
> > linear mapping over the vma region. Currently, remap_pfn_range
> > only sets vm_pgoff for COW mappings. Below patch changes the
> > logic and sets the vm_pgoff irrespective of COW. This will still not
> > be enough for the case where pfn is zero (vma region mapped to
> > physical address zero). But, for all the other cases, we can look at
> > pfnmap VMAs and say whether the mappng is for the entire vma region
> > or not.
> > 
> > Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
> > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> > 
> > ---
> >  include/linux/mm.h |    9 +++++++++
> >  mm/memory.c        |    7 +++----
> >  2 files changed, 12 insertions(+), 4 deletions(-)
> > 
> > Index: linux-2.6/mm/memory.c
> > ===================================================================
> > --- linux-2.6.orig/mm/memory.c	2008-12-17 17:24:31.000000000 -0800
> > +++ linux-2.6/mm/memory.c	2008-12-18 10:10:46.000000000 -0800
> > @@ -1575,11 +1575,10 @@ int remap_pfn_range(struct vm_area_struc
> >  	 * behaviour that some programs depend on. We mark the "original"
> >  	 * un-COW'ed pages by matching them up with "vma->vm_pgoff".
> >  	 */
> > -	if (is_cow_mapping(vma->vm_flags)) {
> > -		if (addr != vma->vm_start || end != vma->vm_end)
> > -			return -EINVAL;
> > +	if (addr == vma->vm_start && end == vma->vm_end)
> >  		vma->vm_pgoff = pfn;
> > -	}
> > +	else if (is_cow_mapping(vma->vm_flags))
> > +		return -EINVAL;
> >  
> >  	vma->vm_flags |= VM_IO | VM_RESERVED | VM_PFNMAP;
> >  
> > Index: linux-2.6/include/linux/mm.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/mm.h	2008-12-17 17:24:31.000000000 -0800
> > +++ linux-2.6/include/linux/mm.h	2008-12-18 10:10:46.000000000 -0800
> > @@ -145,6 +145,15 @@ extern pgprot_t protection_map[16];
> >  #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
> >  #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
> >  
> > +static inline int is_linear_pfn_mapping(struct vm_area_struct *vma)
> > +{
> > +	return ((vma->vm_flags & VM_PFNMAP) && vma->vm_pgoff);
> > +}
> > +
> > +static inline int is_pfn_mapping(struct vm_area_struct *vma)
> > +{
> > +	return (vma->vm_flags & VM_PFNMAP);
> > +}
> >  
> >  /*
> >   * vm_fault is filled by the the pagefault handler and passed to the vma's
> 
> This is fine by me, however:
> 1. Can you add some comments to say "this is not for core vm but for pat,
>    oh and a pgoff of zero is not going to work".

OK. Will add comments about both the points.

> 2. Can you please justify to me (or the changelog) roughly why PAT wants
>    to know if the mapping is linear or not? Presumably it has to handle
>    both types? If performance wasn't an issue, then you could manually scan
>    the ptes to verify (which would solve your zero-offset bug). etc.

The main reason is performance. If we know it is linear, we can track the entire
region as one block and do the reserve free for entire region. But, if it is
not linear, then we have to reserve memtype of physical addresses page by page.
This will not be optimal as it will result in reserve and free becoming
slower. Almost all users that we find in kernel today (atleast in x86)  are
all linear.

Thanks,
Venki


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 2/7] x86 PAT: Add follow_pfnmp_pte routine to help tracking pfnmap pages - v3
  2008-12-18 21:31   ` Nick Piggin
@ 2008-12-18 22:15     ` Pallipadi, Venkatesh
  0 siblings, 0 replies; 24+ messages in thread
From: Pallipadi, Venkatesh @ 2008-12-18 22:15 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pallipadi, Venkatesh, mingo, tglx, hpa, akpm, hugh, arjan,
	jbarnes, rdreier, jeremy, linux-kernel, Siddha, Suresh B

On Thu, Dec 18, 2008 at 01:31:12PM -0800, Nick Piggin wrote:
> On Thu, Dec 18, 2008 at 11:41:28AM -0800, venkatesh.pallipadi@intel.com wrote:
> > Add a generic interface to follow pfn in a pfnmap vma range. This is used by
> > one of the subsequent x86 PAT related patch to keep track of memory types
> > for vma regions across vma copy and free.
> > 
> > Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
> > Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
> 
> Can you please reuse follow_phys for this? (preferably use the same API
> even if it requires some modification, otherwise if not possible, then
> at least can you implement a common core for both APIs).
> 

Yes. Hadn't noticed the presence of follow_phys before. Will send a patch to
handle this.

Thanks,
Venki

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 3/7] x86 PAT: hooks in generic vm code to help archs to track pfnmap regions - v3
  2008-12-18 21:35   ` Nick Piggin
@ 2008-12-18 22:23     ` Pallipadi, Venkatesh
  0 siblings, 0 replies; 24+ messages in thread
From: Pallipadi, Venkatesh @ 2008-12-18 22:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pallipadi, Venkatesh, mingo, tglx, hpa, akpm, hugh, arjan,
	jbarnes, rdreier, jeremy, linux-kernel, Siddha, Suresh B

On Thu, Dec 18, 2008 at 01:35:57PM -0800, Nick Piggin wrote:
> On Thu, Dec 18, 2008 at 11:41:29AM -0800, venkatesh.pallipadi@intel.com wrote:
> > Introduce generic hooks in remap_pfn_range and vm_insert_pfn and
> > corresponding copy and free routines with reserve and free tracking.
> 
> These should be inline so that they can be folded out (I'm sure gcc
> with -Os and "optimize" inlining will do something stupid here).
> Also, the normal way to add such arch hooks is to put the default
> into asm-generic and have other archs include it... that would be
> nicer than sticking it into mm/memory.c wouldn't it?

I did check that these calls were optimized by gcc when there is no arch
specific definitions. But, as you pointed out, it should be cleaner to put this
in asm-generic, though I may have to touch more files.

> Sigh, fork/exit paths slow down yet again. But oh well. Maybe can
> you add some branch hints?

OK. Will add branch hints for these..

Thanks,
Venki


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [patch 1/7] x86 PAT: store vm_pgoff for all linear_over_vma_region mappings - v3
  2008-12-18 22:10     ` Pallipadi, Venkatesh
@ 2008-12-18 22:33       ` Nick Piggin
  0 siblings, 0 replies; 24+ messages in thread
From: Nick Piggin @ 2008-12-18 22:33 UTC (permalink / raw)
  To: Pallipadi, Venkatesh
  Cc: mingo, tglx, hpa, akpm, hugh, arjan, jbarnes, rdreier, jeremy,
	linux-kernel, Siddha, Suresh B

On Thu, Dec 18, 2008 at 02:10:57PM -0800, Pallipadi, Venkatesh wrote:
> On Thu, Dec 18, 2008 at 01:27:28PM -0800, Nick Piggin wrote:
> > 
> > This is fine by me, however:
> > 1. Can you add some comments to say "this is not for core vm but for pat,
> >    oh and a pgoff of zero is not going to work".
> 
> OK. Will add comments about both the points.
> 
> > 2. Can you please justify to me (or the changelog) roughly why PAT wants
> >    to know if the mapping is linear or not? Presumably it has to handle
> >    both types? If performance wasn't an issue, then you could manually scan
> >    the ptes to verify (which would solve your zero-offset bug). etc.
> 
> The main reason is performance. If we know it is linear, we can track the entire
> region as one block and do the reserve free for entire region. But, if it is
> not linear, then we have to reserve memtype of physical addresses page by page.
> This will not be optimal as it will result in reserve and free becoming
> slower. Almost all users that we find in kernel today (atleast in x86)  are
> all linear.

OK, so it is not a bug to miss the zero pgoff case then. That's good
to know and should be added to comments.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-12-18 22:33 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-18 19:41 [patch 0/7] x86 PAT: track pfnmap mappings with remap_pfn_range vm_insert_pfn - v3 venkatesh.pallipadi
2008-12-18 19:41 ` [patch 1/7] x86 PAT: store vm_pgoff for all linear_over_vma_region mappings " venkatesh.pallipadi
2008-12-18 21:27   ` Nick Piggin
2008-12-18 22:10     ` Pallipadi, Venkatesh
2008-12-18 22:33       ` Nick Piggin
2008-12-18 19:41 ` [patch 2/7] x86 PAT: Add follow_pfnmp_pte routine to help tracking pfnmap pages " venkatesh.pallipadi
2008-12-18 21:31   ` Nick Piggin
2008-12-18 22:15     ` Pallipadi, Venkatesh
2008-12-18 19:41 ` [patch 3/7] x86 PAT: hooks in generic vm code to help archs to track pfnmap regions " venkatesh.pallipadi
2008-12-18 21:35   ` Nick Piggin
2008-12-18 22:23     ` Pallipadi, Venkatesh
2008-12-18 19:41 ` [patch 4/7] x86 PAT: Implement track/untrack of pfnmap regions for x86 " venkatesh.pallipadi
2008-12-18 21:38   ` Nick Piggin
2008-12-18 21:40     ` H. Peter Anvin
2008-12-18 21:46       ` Ingo Molnar
2008-12-18 21:53         ` Pallipadi, Venkatesh
2008-12-18 19:41 ` [patch 5/7] x86 PAT: change pgprot_noncached to uc_minus instead of strong uc " venkatesh.pallipadi
2008-12-18 19:41 ` [patch 6/7] x86 PAT: add pgprot_writecombine() interface for drivers " venkatesh.pallipadi
2008-12-18 19:41 ` [patch 7/7] x86 PAT: update documentation to cover pgprot and remap_pfn related changes " venkatesh.pallipadi
2008-12-18 21:13   ` Randy Dunlap
2008-12-18 21:49     ` Pallipadi, Venkatesh
2008-12-18 21:53       ` Randy Dunlap
2008-12-18 22:03         ` Pallipadi, Venkatesh
2008-12-18 21:17 ` [patch 0/7] x86 PAT: track pfnmap mappings with remap_pfn_range vm_insert_pfn " H. Peter Anvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.