linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -V7 00/10] THP support for PPC64 (Patchset 2)
@ 2013-04-28 19:51 Aneesh Kumar K.V
  2013-04-28 19:51 ` [PATCH -V7 01/10] powerpc/THP: Double the PMD table size for THP Aneesh Kumar K.V
                   ` (9 more replies)
  0 siblings, 10 replies; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-04-28 19:51 UTC (permalink / raw)
  To: benh, paulus, dwg, linux-mm; +Cc: linuxppc-dev

Hi,

This is the second patchset needed to support THP on ppc64. Some of the changes
included in this series are tricky in that it changes the powerpc linux page table
walk subtly. We also overload few of the pte flags for ptes at PMD leve (huge
page PTEs). This patchset require closer review before merging upstream.

I have split the patch series into two patchset, so that we can look at getting
prerequisite patches upstream in 3.10.

Some numbers:

The latency measurements code from Anton  found at
http://ozlabs.org/~anton/junkcode/latency2001.c

64K page size (With THP support)
--------------------------
[root@llmp24l02 test]# ./latency2001 8G
 8589934592    428.49 cycles    120.50 ns
[root@llmp24l02 test]# ./latency2001 -l 8G
 8589934592    471.16 cycles    132.50 ns
[root@llmp24l02 test]# echo never > /sys/kernel/mm/transparent_hugepage/enabled 
[root@llmp24l02 test]# ./latency2001 8G
 8589934592    766.52 cycles    215.56 ns
[root@llmp24l02 test]# 

4K page size (No THP support for 4K)
----------------------------
[root@llmp24l02 test]# ./latency2001 8G
 8589934592    814.88 cycles    229.16 ns
[root@llmp24l02 test]# ./latency2001 -l 8G
 8589934592    463.69 cycles    130.40 ns
[root@llmp24l02 test]# 

We are close to hugetlbfs in latency and we can achieve this with zero
config/page reservation. Most of the allocations above are fault allocated.

Another test that does 50000000 random access over 1GB area goes from
2.65 seconds to 1.07 seconds with this patchset.

split_huge_page impact:
---------------------
To look at the performance impact of large page invalidate, I tried the below
experiment. The test involved, accessing a large contiguous region of memory
location as below

    for (i = 0; i < size; i += PAGE_SIZE)
	data[i] = i;

We wanted to access the data in sequential order so that we look at the
worst case THP performance. Accesing the data in sequential order implies
we have the Page table cached and overhead of TLB miss is as minimal as
possible. We also don't touch the entire page, because that can result in
cache evict.

After we touched the full range as above, we now call mprotect on each
of that page. A mprotect will result in a hugepage split. This should
allow us to measure the impact of hugepage split.

    for (i = 0; i < size; i += PAGE_SIZE)
	 mprotect(&data[i], PAGE_SIZE, PROT_READ);

Split hugepage impact: 
---------------------
THP enabled: 2.851561705 seconds for test completion
THP disable: 3.599146098 seconds for test completion

We are 20.7% better than non THP case even when we have all the large pages split.

Detailed output:

THP enabled:
---------------------------------------
[root@llmp24l02 ~]# cat /proc/vmstat  | grep thp
thp_fault_alloc 0
thp_fault_fallback 0
thp_collapse_alloc 0
thp_collapse_alloc_failed 0
thp_split 0
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0
[root@llmp24l02 ~]# /root/thp/tools/perf/perf stat -e page-faults,dTLB-load-misses ./split-huge-page-mpro 20G                                                                      
time taken to touch all the data in ns: 2763096913 

 Performance counter stats for './split-huge-page-mpro 20G':

             1,581 page-faults                                                 
             3,159 dTLB-load-misses                                            

       2.851561705 seconds time elapsed

[root@llmp24l02 ~]# 
[root@llmp24l02 ~]# cat /proc/vmstat  | grep thp
thp_fault_alloc 1279
thp_fault_fallback 0
thp_collapse_alloc 0
thp_collapse_alloc_failed 0
thp_split 1279
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0
[root@llmp24l02 ~]# 

    77.05%  split-huge-page  [kernel.kallsyms]     [k] .clear_user_page                        
     7.10%  split-huge-page  [kernel.kallsyms]     [k] .perf_event_mmap_ctx                    
     1.51%  split-huge-page  split-huge-page-mpro  [.] 0x0000000000000a70                      
     0.96%  split-huge-page  [unknown]             [H] 0x000000000157e3bc                      
     0.81%  split-huge-page  [kernel.kallsyms]     [k] .up_write                               
     0.76%  split-huge-page  [kernel.kallsyms]     [k] .perf_event_mmap                        
     0.76%  split-huge-page  [kernel.kallsyms]     [k] .down_write                             
     0.74%  split-huge-page  [kernel.kallsyms]     [k] .lru_add_page_tail                      
     0.61%  split-huge-page  [kernel.kallsyms]     [k] .split_huge_page                        
     0.59%  split-huge-page  [kernel.kallsyms]     [k] .change_protection                      
     0.51%  split-huge-page  [kernel.kallsyms]     [k] .release_pages                          


     0.96%  split-huge-page  [unknown]             [H] 0x000000000157e3bc                      
            |          
            |--79.44%-- reloc_start
            |          |          
            |          |--86.54%-- .__pSeries_lpar_hugepage_invalidate
            |          |          .pSeries_lpar_hugepage_invalidate
            |          |          .hpte_need_hugepage_flush
            |          |          .split_huge_page
            |          |          .__split_huge_page_pmd
            |          |          .vma_adjust
            |          |          .vma_merge
            |          |          .mprotect_fixup
            |          |          .SyS_mprotect


THP disabled:
---------------
[root@llmp24l02 ~]# echo never > /sys/kernel/mm/transparent_hugepage/enabled
[root@llmp24l02 ~]# /root/thp/tools/perf/perf stat -e page-faults,dTLB-load-misses ./split-huge-page-mpro 20G
time taken to touch all the data in ns: 3513767220 

 Performance counter stats for './split-huge-page-mpro 20G':

          3,27,726 page-faults                                                 
          3,29,654 dTLB-load-misses                                            

       3.599146098 seconds time elapsed

[root@llmp24l02 ~]#

Changes from V6:
* split the patch series into two patchset.
* Address review feedback.

Changes from V5:
* Address review comments
* Added new patch to not use hugepd for explcit hugepages. Explicit hugepaes
  now use PTE format similar to transparent hugepages.
* We don't use page->_mapcount for tracking free PTE frags in a PTE page.
* rebased to a86d52667d8eda5de39393ce737794403bdce1eb
* Tested with libhugetlbfs test suite

Changes from V4:
* Fix bad page error in page_table_alloc
  BUG: Bad page state in process stream  pfn:f1a59
  page:f0000000034dc378 count:1 mapcount:0 mapping:          (null) index:0x0
  [c000000f322c77d0] [c00000000015e198] .bad_page+0xe8/0x140
  [c000000f322c7860] [c00000000015e3c4] .free_pages_prepare+0x1d4/0x1e0
  [c000000f322c7910] [c000000000160450] .free_hot_cold_page+0x50/0x230
  [c000000f322c79c0] [c00000000003ad18] .page_table_alloc+0x168/0x1c0

Changes from V3:
* PowerNV boot fixes

Change from V2:
* Change patch "powerpc: Reduce PTE table memory wastage" to use much simpler approach
  for PTE page sharing.
* Changes to handle huge pages in KVM code.
* Address other review comments

Changes from V1
* Address review comments
* More patch split
* Add batch hpte invalidate for hugepages.

Changes from RFC V2:
* Address review comments
* More code cleanup and patch split

Changes from RFC V1:
* HugeTLB fs now works
* Compile issues fixed
* rebased to v3.8
* Patch series reorded so that ppc64 cleanups and MM THP changes are moved
  early in the series. This should help in picking those patches early.

Thanks,
-aneesh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH -V7 01/10] powerpc/THP: Double the PMD table size for THP
  2013-04-28 19:51 [PATCH -V7 00/10] THP support for PPC64 (Patchset 2) Aneesh Kumar K.V
@ 2013-04-28 19:51 ` Aneesh Kumar K.V
  2013-05-03  3:21   ` David Gibson
  2013-04-28 19:51 ` [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64 Aneesh Kumar K.V
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-04-28 19:51 UTC (permalink / raw)
  To: benh, paulus, dwg, linux-mm; +Cc: linuxppc-dev, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

THP code does PTE page allocation along with large page request and deposit them
for later use. This is to ensure that we won't have any failures when we split
hugepages to regular pages.

On powerpc we want to use the deposited PTE page for storing hash pte slot and
secondary bit information for the HPTEs. We use the second half
of the pmd table to save the deposted PTE page.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgalloc-64.h    | 6 +++---
 arch/powerpc/include/asm/pgtable-ppc64.h | 6 +++++-
 arch/powerpc/mm/init_64.c                | 9 ++++++---
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc-64.h b/arch/powerpc/include/asm/pgalloc-64.h
index 91acb12..c756463 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -221,17 +221,17 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(PGT_CACHE(PMD_INDEX_SIZE),
+	return kmem_cache_alloc(PGT_CACHE(PMD_CACHE_INDEX),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
-	kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
+	kmem_cache_free(PGT_CACHE(PMD_CACHE_INDEX), pmd);
 }
 
 #define __pmd_free_tlb(tlb, pmd, addr)		      \
-	pgtable_free_tlb(tlb, pmd, PMD_INDEX_SIZE)
+	pgtable_free_tlb(tlb, pmd, PMD_CACHE_INDEX)
 #ifndef CONFIG_PPC_64K_PAGES
 #define __pud_free_tlb(tlb, pud, addr)		      \
 	pgtable_free_tlb(tlb, pud, PUD_INDEX_SIZE)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index e3d55f6f..ab84332 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -20,7 +20,11 @@
                 	    PUD_INDEX_SIZE + PGD_INDEX_SIZE + PAGE_SHIFT)
 #define PGTABLE_RANGE (ASM_CONST(1) << PGTABLE_EADDR_SIZE)
 
-
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define PMD_CACHE_INDEX	(PMD_INDEX_SIZE + 1)
+#else
+#define PMD_CACHE_INDEX	PMD_INDEX_SIZE
+#endif
 /*
  * Define the address range of the kernel non-linear virtual area
  */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a56de85..97f741d 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -88,7 +88,11 @@ static void pgd_ctor(void *addr)
 
 static void pmd_ctor(void *addr)
 {
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	memset(addr, 0, PMD_TABLE_SIZE * 2);
+#else
 	memset(addr, 0, PMD_TABLE_SIZE);
+#endif
 }
 
 struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
@@ -137,10 +141,9 @@ void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
 void pgtable_cache_init(void)
 {
 	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
-	pgtable_cache_add(PMD_INDEX_SIZE, pmd_ctor);
-	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_INDEX_SIZE))
+	pgtable_cache_add(PMD_CACHE_INDEX, pmd_ctor);
+	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_CACHE_INDEX))
 		panic("Couldn't allocate pgtable caches");
-
 	/* In all current configs, when the PUD index exists it's the
 	 * same size as either the pgd or pmd index.  Verify that the
 	 * initialization above has also created a PUD cache.  This
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64
  2013-04-28 19:51 [PATCH -V7 00/10] THP support for PPC64 (Patchset 2) Aneesh Kumar K.V
  2013-04-28 19:51 ` [PATCH -V7 01/10] powerpc/THP: Double the PMD table size for THP Aneesh Kumar K.V
@ 2013-04-28 19:51 ` Aneesh Kumar K.V
  2013-05-03  4:52   ` David Gibson
  2013-04-28 19:51 ` [PATCH -V7 03/10] powerpc: move find_linux_pte_or_hugepte and gup_hugepte to common code Aneesh Kumar K.V
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-04-28 19:51 UTC (permalink / raw)
  To: benh, paulus, dwg, linux-mm; +Cc: linuxppc-dev, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

We now have pmd entries covering 16MB range and the PMD table double its original size.
We use the second half of the PMD table to deposit the pgtable (PTE page).
The depoisted PTE page is further used to track the HPTE information. The information
include [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
4096 entries. Both will fit in a 4K PTE page. On hugepage invalidate we need to walk
the PTE page and invalidate all valid HPTEs.

This patch implements necessary arch specific functions for THP support and also
hugepage invalidate logic. These PMD related functions are intentionally kept
similar to their PTE counter-part.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/page.h              |  11 +-
 arch/powerpc/include/asm/pgtable-ppc64-64k.h |   3 +-
 arch/powerpc/include/asm/pgtable-ppc64.h     | 259 +++++++++++++++++++++-
 arch/powerpc/include/asm/pgtable.h           |   5 +
 arch/powerpc/include/asm/pte-hash64-64k.h    |  17 ++
 arch/powerpc/mm/pgtable_64.c                 | 318 +++++++++++++++++++++++++++
 arch/powerpc/platforms/Kconfig.cputype       |   1 +
 7 files changed, 611 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 988c812..cbf4be7 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -37,8 +37,17 @@
 #define PAGE_SIZE		(ASM_CONST(1) << PAGE_SHIFT)
 
 #ifndef __ASSEMBLY__
-#ifdef CONFIG_HUGETLB_PAGE
+/*
+ * With hugetlbfs enabled we allow the HPAGE_SHIFT to run time
+ * configurable. But we enable THP only with 16MB hugepage.
+ * With only THP configured, we force hugepage size to 16MB.
+ * This should ensure that all subarchs that doesn't support
+ * THP continue to work fine with HPAGE_SHIFT usage.
+ */
+#if defined(CONFIG_HUGETLB_PAGE)
 extern unsigned int HPAGE_SHIFT;
+#elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#define HPAGE_SHIFT PMD_SHIFT
 #else
 #define HPAGE_SHIFT PAGE_SHIFT
 #endif
diff --git a/arch/powerpc/include/asm/pgtable-ppc64-64k.h b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
index 45142d6..a56b82f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64-64k.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
@@ -33,7 +33,8 @@
 #define PGDIR_MASK	(~(PGDIR_SIZE-1))
 
 /* Bits to mask out from a PMD to get to the PTE page */
-#define PMD_MASKED_BITS		0x1ff
+/* PMDs point to PTE table fragments which are 4K aligned.  */
+#define PMD_MASKED_BITS		0xfff
 /* Bits to mask out from a PGD/PUD to get to the PMD page */
 #define PUD_MASKED_BITS		0x1ff
 
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index ab84332..20133c1 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -154,7 +154,7 @@
 #define	pmd_present(pmd)	(pmd_val(pmd) != 0)
 #define	pmd_clear(pmdp)		(pmd_val(*(pmdp)) = 0)
 #define pmd_page_vaddr(pmd)	(pmd_val(pmd) & ~PMD_MASKED_BITS)
-#define pmd_page(pmd)		virt_to_page(pmd_page_vaddr(pmd))
+extern struct page *pmd_page(pmd_t pmd);
 
 #define pud_set(pudp, pudval)	(pud_val(*(pudp)) = (pudval))
 #define pud_none(pud)		(!pud_val(pud))
@@ -382,4 +382,261 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 
 #endif /* __ASSEMBLY__ */
 
+#ifndef _PAGE_SPLITTING
+/*
+ * THP pages can't be special. So use the _PAGE_SPECIAL
+ */
+#define _PAGE_SPLITTING _PAGE_SPECIAL
+#endif
+
+#ifndef _PAGE_THP_HUGE
+/*
+ * We need to differentiate between explicit huge page and THP huge
+ * page, since THP huge page also need to track real subpage details
+ * We use the _PAGE_COMBO bits here as dummy for platform that doesn't
+ * support THP.
+ */
+#define _PAGE_THP_HUGE  0x10000000
+#endif
+
+/*
+ * PTE flags to conserve for HPTE identification for THP page.
+ */
+#ifndef _PAGE_THP_HPTEFLAGS
+#define _PAGE_THP_HPTEFLAGS	(_PAGE_BUSY | _PAGE_HASHPTE)
+#endif
+
+#define HUGE_PAGE_SIZE		(ASM_CONST(1) << 24)
+#define HUGE_PAGE_MASK		(~(HUGE_PAGE_SIZE - 1))
+
+/*
+ * set of bits not changed in pmd_modify.
+ */
+#define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_THP_HPTEFLAGS | \
+			 _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_THP_HUGE)
+
+#ifndef __ASSEMBLY__
+extern void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
+				     pmd_t *pmdp);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
+extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
+extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
+extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+		       pmd_t *pmdp, pmd_t pmd);
+extern void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
+				 pmd_t *pmd);
+
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+	/*
+	 * leaf pte for huge page, bottom two bits != 00
+	 */
+	return (pmd_val(pmd) & 0x3) && (pmd_val(pmd) & _PAGE_THP_HUGE);
+}
+
+static inline int pmd_large(pmd_t pmd)
+{
+	/*
+	 * leaf pte for huge page, bottom two bits != 00
+	 */
+	if (pmd_trans_huge(pmd))
+		return pmd_val(pmd) & _PAGE_PRESENT;
+	return 0;
+}
+
+static inline int pmd_trans_splitting(pmd_t pmd)
+{
+	if (pmd_trans_huge(pmd))
+		return pmd_val(pmd) & _PAGE_SPLITTING;
+	return 0;
+}
+
+
+static inline unsigned long pmd_pfn(pmd_t pmd)
+{
+	/*
+	 * Only called for hugepage pmd
+	 */
+	return pmd_val(pmd) >> PTE_RPN_SHIFT;
+}
+
+/* We will enable it in the last patch */
+#define has_transparent_hugepage() 0
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+static inline int pmd_young(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_ACCESSED;
+}
+
+static inline pmd_t pmd_mkhuge(pmd_t pmd)
+{
+	/* Do nothing, mk_pmd() does this part.  */
+	return pmd;
+}
+
+#define __HAVE_ARCH_PMD_WRITE
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_RW;
+}
+
+static inline pmd_t pmd_mkold(pmd_t pmd)
+{
+	pmd_val(pmd) &= ~_PAGE_ACCESSED;
+	return pmd;
+}
+
+static inline pmd_t pmd_wrprotect(pmd_t pmd)
+{
+	pmd_val(pmd) &= ~_PAGE_RW;
+	return pmd;
+}
+
+static inline pmd_t pmd_mkdirty(pmd_t pmd)
+{
+	pmd_val(pmd) |= _PAGE_DIRTY;
+	return pmd;
+}
+
+static inline pmd_t pmd_mkyoung(pmd_t pmd)
+{
+	pmd_val(pmd) |= _PAGE_ACCESSED;
+	return pmd;
+}
+
+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+	pmd_val(pmd) |= _PAGE_RW;
+	return pmd;
+}
+
+static inline pmd_t pmd_mknotpresent(pmd_t pmd)
+{
+	pmd_val(pmd) &= ~_PAGE_PRESENT;
+	return pmd;
+}
+
+static inline pmd_t pmd_mksplitting(pmd_t pmd)
+{
+	pmd_val(pmd) |= _PAGE_SPLITTING;
+	return pmd;
+}
+
+/*
+ * Set the dirty and/or accessed bits atomically in a linux hugepage PMD, this
+ * function doesn't need to flush the hash entry
+ */
+static inline void __pmdp_set_access_flags(pmd_t *pmdp, pmd_t entry)
+{
+	unsigned long bits = pmd_val(entry) & (_PAGE_DIRTY |
+					       _PAGE_ACCESSED |
+					       _PAGE_RW | _PAGE_EXEC);
+#ifdef PTE_ATOMIC_UPDATES
+	unsigned long old, tmp;
+
+	__asm__ __volatile__(
+	"1:	ldarx	%0,0,%4\n\
+		andi.	%1,%0,%6\n\
+		bne-	1b \n\
+		or	%0,%3,%0\n\
+		stdcx.	%0,0,%4\n\
+		bne-	1b"
+	:"=&r" (old), "=&r" (tmp), "=m" (*pmdp)
+	:"r" (bits), "r" (pmdp), "m" (*pmdp), "i" (_PAGE_BUSY)
+	:"cc");
+#else
+	unsigned long old = pmd_val(*pmdp);
+	*pmdp = __pmd(old | bits);
+#endif
+}
+
+#define __HAVE_ARCH_PMD_SAME
+static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
+{
+	return (((pmd_val(pmd_a) ^ pmd_val(pmd_b)) & ~_PAGE_THP_HPTEFLAGS) == 0);
+}
+
+#define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+extern int pmdp_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp,
+				 pmd_t entry, int dirty);
+
+static inline unsigned long pmd_hugepage_update(struct mm_struct *mm,
+						unsigned long addr,
+						pmd_t *pmdp, unsigned long clr)
+{
+#ifdef PTE_ATOMIC_UPDATES
+	unsigned long old, tmp;
+
+	__asm__ __volatile__(
+	"1:	ldarx	%0,0,%3\n\
+		andi.	%1,%0,%6\n\
+		bne-	1b \n\
+		andc	%1,%0,%4 \n\
+		stdcx.	%1,0,%3 \n\
+		bne-	1b"
+	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
+	: "r" (pmdp), "r" (clr), "m" (*pmdp), "i" (_PAGE_BUSY)
+	: "cc" );
+#else
+	unsigned long old = pmd_val(*pmdp);
+	*pmdp = __pmd(old & ~clr);
+#endif
+
+#ifdef CONFIG_PPC_STD_MMU_64
+	if (old & _PAGE_HASHPTE)
+		hpte_need_hugepage_flush(mm, addr, pmdp);
+#endif
+	return old;
+}
+
+static inline int __pmdp_test_and_clear_young(struct mm_struct *mm,
+					      unsigned long addr, pmd_t *pmdp)
+{
+	unsigned long old;
+
+	if ((pmd_val(*pmdp) & (_PAGE_ACCESSED | _PAGE_HASHPTE)) == 0)
+		return 0;
+	old = pmd_hugepage_update(mm, addr, pmdp, _PAGE_ACCESSED);
+	return ((old & _PAGE_ACCESSED) != 0);
+}
+
+#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+				     unsigned long address, pmd_t *pmdp);
+#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
+extern pmd_t pmdp_get_and_clear(struct mm_struct *mm,
+				unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMDP_SET_WRPROTECT
+static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
+				      pmd_t *pmdp)
+{
+
+	if ((pmd_val(*pmdp) & _PAGE_RW) == 0)
+		return;
+
+	pmd_hugepage_update(mm, addr, pmdp, _PAGE_RW);
+}
+
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+extern void pmdp_splitting_flush(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PGTABLE_DEPOSIT
+extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+				       pgtable_t pgtable);
+#define __HAVE_ARCH_PGTABLE_WITHDRAW
+extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMDP_INVALIDATE
+extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+			    pmd_t *pmdp);
+#endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 7aeb955..283198e 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -222,5 +222,10 @@ extern int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		       unsigned long end, int write, struct page **pages, int *nr);
 #endif /* __ASSEMBLY__ */
 
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmd_large(pmd)		0
+#define has_transparent_hugepage() 0
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_PGTABLE_H */
diff --git a/arch/powerpc/include/asm/pte-hash64-64k.h b/arch/powerpc/include/asm/pte-hash64-64k.h
index 3e13e23..6be70be 100644
--- a/arch/powerpc/include/asm/pte-hash64-64k.h
+++ b/arch/powerpc/include/asm/pte-hash64-64k.h
@@ -38,6 +38,23 @@
  */
 #define PTE_RPN_SHIFT	(30)
 
+/*
+ * THP pages can't be special. So use the _PAGE_SPECIAL
+ */
+#define _PAGE_SPLITTING _PAGE_SPECIAL
+
+/*
+ * PTE flags to conserve for HPTE identification for THP page.
+ * We drop _PAGE_COMBO here, because we overload that with _PAGE_TH_HUGE.
+ */
+#define _PAGE_THP_HPTEFLAGS	(_PAGE_BUSY | _PAGE_HASHPTE)
+
+/*
+ * We need to differentiate between explicit huge page and THP huge
+ * page, since THP huge page also need to track real subpage details
+ */
+#define _PAGE_THP_HUGE  _PAGE_COMBO
+
 #ifndef __ASSEMBLY__
 
 /*
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index a854096..54216c1 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -338,6 +338,19 @@ EXPORT_SYMBOL(iounmap);
 EXPORT_SYMBOL(__iounmap);
 EXPORT_SYMBOL(__iounmap_at);
 
+/*
+ * For hugepage we have pfn in the pmd, we use PTE_RPN_SHIFT bits for flags
+ * For PTE page, we have a PTE_FRAG_SIZE (4K) aligned virtual address.
+ */
+struct page *pmd_page(pmd_t pmd)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (pmd_trans_huge(pmd))
+		return pfn_to_page(pmd_pfn(pmd));
+#endif
+	return virt_to_page(pmd_page_vaddr(pmd));
+}
+
 #ifdef CONFIG_PPC_64K_PAGES
 static pte_t *get_from_cache(struct mm_struct *mm)
 {
@@ -455,3 +468,308 @@ void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
 }
 #endif
 #endif /* CONFIG_PPC_64K_PAGES */
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static pmd_t set_hugepage_access_flags_filter(pmd_t pmd,
+					      struct vm_area_struct *vma,
+					      int dirty)
+{
+	return pmd;
+}
+
+/*
+ * This is called when relaxing access to a hugepage. It's also called in the page
+ * fault path when we don't hit any of the major fault cases, ie, a minor
+ * update of _PAGE_ACCESSED, _PAGE_DIRTY, etc... The generic code will have
+ * handled those two for us, we additionally deal with missing execute
+ * permission here on some processors
+ */
+int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp, pmd_t entry, int dirty)
+{
+	int changed;
+	entry = set_hugepage_access_flags_filter(entry, vma, dirty);
+	changed = !pmd_same(*(pmdp), entry);
+	if (changed) {
+		__pmdp_set_access_flags(pmdp, entry);
+		/*
+		 * Since we are not supporting SW TLB systems, we don't
+		 * have any thing similar to flush_tlb_page_nohash()
+		 */
+	}
+	return changed;
+}
+
+int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+			      unsigned long address, pmd_t *pmdp)
+{
+	return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp);
+}
+
+/*
+ * We currently remove entries from the hashtable regardless of whether
+ * the entry was young or dirty. The generic routines only flush if the
+ * entry was young or dirty which is not good enough.
+ *
+ * We should be more intelligent about this but for the moment we override
+ * these functions and force a tlb flush unconditionally
+ */
+int pmdp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmdp)
+{
+	return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp);
+}
+
+/*
+ * We mark the pmd splitting and invalidate all the hpte
+ * entries for this hugepage.
+ */
+void pmdp_splitting_flush(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp)
+{
+	unsigned long old, tmp;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+#ifdef PTE_ATOMIC_UPDATES
+
+	__asm__ __volatile__(
+	"1:	ldarx	%0,0,%3\n\
+		andi.	%1,%0,%6\n\
+		bne-	1b \n\
+		ori	%1,%0,%4 \n\
+		stdcx.	%1,0,%3 \n\
+		bne-	1b"
+	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
+	: "r" (pmdp), "i" (_PAGE_SPLITTING), "m" (*pmdp), "i" (_PAGE_BUSY)
+	: "cc" );
+#else
+	old = pmd_val(*pmdp);
+	*pmdp = __pmd(old | _PAGE_SPLITTING);
+#endif
+	/*
+	 * If we didn't had the splitting flag set, go and flush the
+	 * HPTE entries and serialize against gup fast.
+	 */
+	if (!(old & _PAGE_SPLITTING)) {
+#ifdef CONFIG_PPC_STD_MMU_64
+		/* We need to flush the hpte */
+		if (old & _PAGE_HASHPTE)
+			hpte_need_hugepage_flush(vma->vm_mm, address, pmdp);
+#endif
+		/* need tlb flush only to serialize against gup-fast */
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	}
+}
+
+/*
+ * We want to put the pgtable in pmd and use pgtable for tracking
+ * the base page size hptes
+ */
+void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
+				pgtable_t pgtable)
+{
+	unsigned long *pgtable_slot;
+	assert_spin_locked(&mm->page_table_lock);
+	/*
+	 * we store the pgtable in the second half of PMD
+	 */
+	pgtable_slot = pmdp + PTRS_PER_PMD;
+	*pgtable_slot = (unsigned long)pgtable;
+}
+
+pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
+{
+	pgtable_t pgtable;
+	unsigned long *pgtable_slot;
+
+	assert_spin_locked(&mm->page_table_lock);
+	pgtable_slot = pmdp + PTRS_PER_PMD;
+	pgtable = (pgtable_t) *pgtable_slot;
+	/*
+	 * We store HPTE information in the deposited PTE fragment.
+	 * zero out the content on withdraw.
+	 */
+	memset(pgtable, 0, PTE_FRAG_SIZE);
+	return pgtable;
+}
+
+/*
+ * Since we are looking at latest ppc64, we don't need to worry about
+ * i/d cache coherency on exec fault
+ */
+static pmd_t set_pmd_filter(pmd_t pmd, unsigned long addr)
+{
+	pmd = __pmd(pmd_val(pmd) & ~_PAGE_THP_HPTEFLAGS);
+	return pmd;
+}
+
+/*
+ * We can make it less convoluted than __set_pte_at, because
+ * we can ignore lot of hardware here, because this is only for
+ * MPSS
+ */
+static inline void __set_pmd_at(struct mm_struct *mm, unsigned long addr,
+				pmd_t *pmdp, pmd_t pmd, int percpu)
+{
+	/*
+	 * There is nothing in hash page table now, so nothing to
+	 * invalidate, set_pte_at is used for adding new entry.
+	 * For updating we should use update_hugepage_pmd()
+	 */
+	*pmdp = pmd;
+}
+
+/*
+ * set a new huge pmd. We should not be called for updating
+ * an existing pmd entry. That should go via pmd_hugepage_update.
+ */
+void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+		pmd_t *pmdp, pmd_t pmd)
+{
+	/*
+	 * Note: mm->context.id might not yet have been assigned as
+	 * this context might not have been activated yet when this
+	 * is called.
+	 */
+	pmd = set_pmd_filter(pmd, addr);
+
+	__set_pmd_at(mm, addr, pmdp, pmd, 0);
+
+}
+
+void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+		     pmd_t *pmdp)
+{
+	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT);
+	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+}
+
+/*
+ * A linux hugepage PMD was changed and the corresponding hash table entries
+ * neesd to be flushed.
+ *
+ * The linux hugepage PMD now include the pmd entries followed by the address
+ * to the stashed pgtable_t. The stashed pgtable_t contains the hpte bits.
+ * [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
+ * With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
+ * 4096 entries. Both will fit in a 4K pgtable_t.
+ */
+void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp)
+{
+	int ssize, i;
+	unsigned long s_addr;
+	unsigned int psize, valid;
+	unsigned char *hpte_slot_array;
+	unsigned long hidx, vpn, vsid, hash, shift, slot;
+
+	/*
+	 * Flush all the hptes mapping this hugepage
+	 */
+	s_addr = addr & HUGE_PAGE_MASK;
+	/*
+	 * The hpte hindex are stored in the pgtable whose address is in the
+	 * second half of the PMD
+	 */
+	hpte_slot_array = *(char **)(pmdp + PTRS_PER_PMD);
+
+	/* get the base page size */
+	psize = get_slice_psize(mm, s_addr);
+	shift = mmu_psize_defs[psize].shift;
+
+	for (i = 0; i < (HUGE_PAGE_SIZE >> shift); i++) {
+		/*
+		 * 8 bits per each hpte entries
+		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
+		 */
+		valid = hpte_slot_array[i] & 0x1;
+		if (!valid)
+			continue;
+		hidx =  hpte_slot_array[i]  >> 1;
+
+		/* get the vpn */
+		addr = s_addr + (i * (1ul << shift));
+		if (!is_kernel_addr(addr)) {
+			ssize = user_segment_size(addr);
+			vsid = get_vsid(mm->context.id, addr, ssize);
+			WARN_ON(vsid == 0);
+		} else {
+			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
+			ssize = mmu_kernel_ssize;
+		}
+
+		vpn = hpt_vpn(addr, vsid, ssize);
+		hash = hpt_hash(vpn, shift, ssize);
+		if (hidx & _PTEIDX_SECONDARY)
+			hash = ~hash;
+
+		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+		slot += hidx & _PTEIDX_GROUP_IX;
+		ppc_md.hpte_invalidate(slot, vpn, psize, ssize, 0);
+	}
+}
+
+static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot)
+{
+	pmd_val(pmd) |= pgprot_val(pgprot);
+	return pmd;
+}
+
+pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot)
+{
+	pmd_t pmd;
+	/*
+	 * For a valid pte, we would have _PAGE_PRESENT or _PAGE_FILE always
+	 * set. We use this to check THP page at pmd level.
+	 * leaf pte for huge page, bottom two bits != 00
+	 */
+	pmd_val(pmd) = pfn << PTE_RPN_SHIFT;
+	pmd_val(pmd) |= _PAGE_THP_HUGE;
+	pmd = pmd_set_protbits(pmd, pgprot);
+	return pmd;
+}
+
+pmd_t mk_pmd(struct page *page, pgprot_t pgprot)
+{
+	return pfn_pmd(page_to_pfn(page), pgprot);
+}
+
+pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
+{
+
+	pmd_val(pmd) &= _HPAGE_CHG_MASK;
+	pmd = pmd_set_protbits(pmd, newprot);
+	return pmd;
+}
+
+/*
+ * This is called at the end of handling a user page fault, when the
+ * fault has been handled by updating a HUGE PMD entry in the linux page tables.
+ * We use it to preload an HPTE into the hash table corresponding to
+ * the updated linux HUGE PMD entry.
+ */
+void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
+			  pmd_t *pmd)
+{
+	return;
+}
+
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+pmd_t pmdp_get_and_clear(struct mm_struct *mm,
+			 unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t old_pmd;
+	unsigned long old;
+	/*
+	 * khugepaged calls this for normal pmd also
+	 */
+	if (pmd_trans_huge(*pmdp)) {
+		old = pmd_hugepage_update(mm, addr, pmdp, ~0UL);
+		old_pmd = __pmd(old);
+	} else {
+		old_pmd = *pmdp;
+		pmd_clear(pmdp);
+	}
+	return old_pmd;
+}
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index 18e3b76..a526144 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -71,6 +71,7 @@ config PPC_BOOK3S_64
 	select PPC_FPU
 	select PPC_HAVE_PMU_SUPPORT
 	select SYS_SUPPORTS_HUGETLBFS
+	select HAVE_ARCH_TRANSPARENT_HUGEPAGE if PPC_64K_PAGES
 
 config PPC_BOOK3E_64
 	bool "Embedded processors"
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH -V7 03/10] powerpc: move find_linux_pte_or_hugepte and gup_hugepte to common code
  2013-04-28 19:51 [PATCH -V7 00/10] THP support for PPC64 (Patchset 2) Aneesh Kumar K.V
  2013-04-28 19:51 ` [PATCH -V7 01/10] powerpc/THP: Double the PMD table size for THP Aneesh Kumar K.V
  2013-04-28 19:51 ` [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64 Aneesh Kumar K.V
@ 2013-04-28 19:51 ` Aneesh Kumar K.V
  2013-04-28 19:51 ` [PATCH -V7 04/10] powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepages Aneesh Kumar K.V
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-04-28 19:51 UTC (permalink / raw)
  To: benh, paulus, dwg, linux-mm; +Cc: linuxppc-dev, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

We will use this in the later patch for handling THP pages

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/hugetlb.h       |   8 +-
 arch/powerpc/include/asm/pgtable-ppc64.h |  11 --
 arch/powerpc/mm/Makefile                 |   2 +-
 arch/powerpc/mm/hugetlbpage.c            | 251 ++++++++++++++++---------------
 4 files changed, 136 insertions(+), 136 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h
index 4daf7e6..91aba46 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -190,8 +190,14 @@ static inline void flush_hugetlb_page(struct vm_area_struct *vma,
 				      unsigned long vmaddr)
 {
 }
-#endif /* CONFIG_HUGETLB_PAGE */
 
+#define hugepd_shift(x) 0
+static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr,
+				    unsigned pdshift)
+{
+	return 0;
+}
+#endif /* CONFIG_HUGETLB_PAGE */
 
 /*
  * FSL Book3E platforms require special gpage handling - the gpages
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 20133c1..f0effab 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -367,19 +367,8 @@ static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea)
 	return pt;
 }
 
-#ifdef CONFIG_HUGETLB_PAGE
 pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 				 unsigned *shift);
-#else
-static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
-					       unsigned *shift)
-{
-	if (shift)
-		*shift = 0;
-	return find_linux_pte(pgdir, ea);
-}
-#endif /* !CONFIG_HUGETLB_PAGE */
-
 #endif /* __ASSEMBLY__ */
 
 #ifndef _PAGE_SPLITTING
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index cf16b57..fde36e6 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -28,8 +28,8 @@ obj-$(CONFIG_44x)		+= 44x_mmu.o
 obj-$(CONFIG_PPC_FSL_BOOK3E)	+= fsl_booke_mmu.o
 obj-$(CONFIG_NEED_MULTIPLE_NODES) += numa.o
 obj-$(CONFIG_PPC_MM_SLICES)	+= slice.o
-ifeq ($(CONFIG_HUGETLB_PAGE),y)
 obj-y				+= hugetlbpage.o
+ifeq ($(CONFIG_HUGETLB_PAGE),y)
 obj-$(CONFIG_PPC_STD_MMU_64)	+= hugetlbpage-hash64.o
 obj-$(CONFIG_PPC_BOOK3E_MMU)	+= hugetlbpage-book3e.o
 endif
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index fbe6be7..8601f2d 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -21,6 +21,9 @@
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 #include <asm/setup.h>
+#include <asm/hugetlb.h>
+
+#ifdef CONFIG_HUGETLB_PAGE
 
 #define PAGE_SHIFT_64K	16
 #define PAGE_SHIFT_16M	24
@@ -100,66 +103,6 @@ int pgd_huge(pgd_t pgd)
 }
 #endif
 
-/*
- * We have 4 cases for pgds and pmds:
- * (1) invalid (all zeroes)
- * (2) pointer to next table, as normal; bottom 6 bits == 0
- * (3) leaf pte for huge page, bottom two bits != 00
- * (4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table
- */
-pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift)
-{
-	pgd_t *pg;
-	pud_t *pu;
-	pmd_t *pm;
-	pte_t *ret_pte;
-	hugepd_t *hpdp = NULL;
-	unsigned pdshift = PGDIR_SHIFT;
-
-	if (shift)
-		*shift = 0;
-
-	pg = pgdir + pgd_index(ea);
-
-	if (pgd_huge(*pg)) {
-		ret_pte = (pte_t *) pg;
-		goto out;
-	} else if (is_hugepd(pg))
-		hpdp = (hugepd_t *)pg;
-	else if (!pgd_none(*pg)) {
-		pdshift = PUD_SHIFT;
-		pu = pud_offset(pg, ea);
-
-		if (pud_huge(*pu)) {
-			ret_pte = (pte_t *) pu;
-			goto out;
-		} else if (is_hugepd(pu))
-			hpdp = (hugepd_t *)pu;
-		else if (!pud_none(*pu)) {
-			pdshift = PMD_SHIFT;
-			pm = pmd_offset(pu, ea);
-
-			if (pmd_huge(*pm)) {
-				ret_pte = (pte_t *) pm;
-				goto out;
-			} else if (is_hugepd(pm))
-				hpdp = (hugepd_t *)pm;
-			else if (!pmd_none(*pm))
-				return pte_offset_kernel(pm, ea);
-		}
-	}
-	if (!hpdp)
-		return NULL;
-
-	ret_pte = hugepte_offset(hpdp, ea, pdshift);
-	pdshift = hugepd_shift(*hpdp);
-out:
-	if (shift)
-		*shift = pdshift;
-	return ret_pte;
-}
-EXPORT_SYMBOL_GPL(find_linux_pte_or_hugepte);
-
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {
 	return find_linux_pte_or_hugepte(mm->pgd, addr, NULL);
@@ -753,69 +696,6 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	return NULL;
 }
 
-int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
-		unsigned long end, int write, struct page **pages, int *nr)
-{
-	unsigned long mask;
-	unsigned long pte_end;
-	struct page *head, *page, *tail;
-	pte_t pte;
-	int refs;
-
-	pte_end = (addr + sz) & ~(sz-1);
-	if (pte_end < end)
-		end = pte_end;
-
-	pte = *ptep;
-	mask = _PAGE_PRESENT | _PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-
-	if ((pte_val(pte) & mask) != mask)
-		return 0;
-
-	/* hugepages are never "special" */
-	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
-
-	refs = 0;
-	head = pte_page(pte);
-
-	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-	tail = page;
-	do {
-		VM_BUG_ON(compound_head(page) != head);
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (addr += PAGE_SIZE, addr != end);
-
-	if (!page_cache_add_speculative(head, refs)) {
-		*nr -= refs;
-		return 0;
-	}
-
-	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-		/* Could be optimized better */
-		*nr -= refs;
-		while (refs--)
-			put_page(head);
-		return 0;
-	}
-
-	/*
-	 * Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
-	return 1;
-}
-
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
 {
@@ -1032,3 +912,128 @@ void flush_dcache_icache_hugepage(struct page *page)
 		}
 	}
 }
+
+#endif /* CONFIG_HUGETLB_PAGE */
+
+/*
+ * We have 4 cases for pgds and pmds:
+ * (1) invalid (all zeroes)
+ * (2) pointer to next table, as normal; bottom 6 bits == 0
+ * (3) leaf pte for huge page, bottom two bits != 00
+ * (4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table
+ */
+pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift)
+{
+	pgd_t *pg;
+	pud_t *pu;
+	pmd_t *pm;
+	pte_t *ret_pte;
+	hugepd_t *hpdp = NULL;
+	unsigned pdshift = PGDIR_SHIFT;
+
+	if (shift)
+		*shift = 0;
+
+	pg = pgdir + pgd_index(ea);
+
+	if (pgd_huge(*pg)) {
+		ret_pte = (pte_t *) pg;
+		goto out;
+	} else if (is_hugepd(pg))
+		hpdp = (hugepd_t *)pg;
+	else if (!pgd_none(*pg)) {
+		pdshift = PUD_SHIFT;
+		pu = pud_offset(pg, ea);
+
+		if (pud_huge(*pu)) {
+			ret_pte = (pte_t *) pu;
+			goto out;
+		} else if (is_hugepd(pu))
+			hpdp = (hugepd_t *)pu;
+		else if (!pud_none(*pu)) {
+			pdshift = PMD_SHIFT;
+			pm = pmd_offset(pu, ea);
+
+			if (pmd_huge(*pm)) {
+				ret_pte = (pte_t *) pm;
+				goto out;
+			} else if (is_hugepd(pm))
+				hpdp = (hugepd_t *)pm;
+			else if (!pmd_none(*pm))
+				return pte_offset_kernel(pm, ea);
+		}
+	}
+	if (!hpdp)
+		return NULL;
+
+	ret_pte = hugepte_offset(hpdp, ea, pdshift);
+	pdshift = hugepd_shift(*hpdp);
+out:
+	if (shift)
+		*shift = pdshift;
+	return ret_pte;
+}
+EXPORT_SYMBOL_GPL(find_linux_pte_or_hugepte);
+
+int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
+		unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask;
+	unsigned long pte_end;
+	struct page *head, *page, *tail;
+	pte_t pte;
+	int refs;
+
+	pte_end = (addr + sz) & ~(sz-1);
+	if (pte_end < end)
+		end = pte_end;
+
+	pte = *ptep;
+	mask = _PAGE_PRESENT | _PAGE_USER;
+	if (write)
+		mask |= _PAGE_RW;
+
+	if ((pte_val(pte) & mask) != mask)
+		return 0;
+
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	refs = 0;
+	head = pte_page(pte);
+
+	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
+	tail = page;
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+		/* Could be optimized better */
+		*nr -= refs;
+		while (refs--)
+			put_page(head);
+		return 0;
+	}
+
+	/*
+	 * Any tail page need their mapcount reference taken before we
+	 * return.
+	 */
+	while (refs--) {
+		if (PageTail(tail))
+			get_huge_page_tail(tail);
+		tail++;
+	}
+
+	return 1;
+}
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH -V7 04/10] powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepages
  2013-04-28 19:51 [PATCH -V7 00/10] THP support for PPC64 (Patchset 2) Aneesh Kumar K.V
                   ` (2 preceding siblings ...)
  2013-04-28 19:51 ` [PATCH -V7 03/10] powerpc: move find_linux_pte_or_hugepte and gup_hugepte to common code Aneesh Kumar K.V
@ 2013-04-28 19:51 ` Aneesh Kumar K.V
  2013-05-03  4:53   ` David Gibson
  2013-04-28 19:51 ` [PATCH -V7 05/10] powerpc: Replace find_linux_pte with find_linux_pte_or_hugepte Aneesh Kumar K.V
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-04-28 19:51 UTC (permalink / raw)
  To: benh, paulus, dwg, linux-mm; +Cc: linuxppc-dev, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/hugetlbpage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 8601f2d..081c001 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -954,7 +954,7 @@ pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift
 			pdshift = PMD_SHIFT;
 			pm = pmd_offset(pu, ea);
 
-			if (pmd_huge(*pm)) {
+			if (pmd_huge(*pm) || pmd_large(*pm)) {
 				ret_pte = (pte_t *) pm;
 				goto out;
 			} else if (is_hugepd(pm))
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH -V7 05/10] powerpc: Replace find_linux_pte with find_linux_pte_or_hugepte
  2013-04-28 19:51 [PATCH -V7 00/10] THP support for PPC64 (Patchset 2) Aneesh Kumar K.V
                   ` (3 preceding siblings ...)
  2013-04-28 19:51 ` [PATCH -V7 04/10] powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepages Aneesh Kumar K.V
@ 2013-04-28 19:51 ` Aneesh Kumar K.V
  2013-05-03  4:56   ` David Gibson
  2013-04-28 19:51 ` [PATCH -V7 06/10] powerpc: Update gup_pmd_range to handle transparent hugepages Aneesh Kumar K.V
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-04-28 19:51 UTC (permalink / raw)
  To: benh, paulus, dwg, linux-mm; +Cc: linuxppc-dev, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly
document why we don't need to handle transparent hugepages at callsites.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 24 ------------------------
 arch/powerpc/kernel/io-workarounds.c     | 10 ++++++++--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c      |  2 +-
 arch/powerpc/mm/hash_utils_64.c          |  8 +++++++-
 arch/powerpc/mm/hugetlbpage.c            |  8 ++++++--
 arch/powerpc/mm/tlb_hash64.c             |  7 ++++++-
 arch/powerpc/platforms/pseries/eeh.c     |  7 ++++++-
 7 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index f0effab..97fc839 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -343,30 +343,6 @@ static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry)
 
 void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
 void pgtable_cache_init(void);
-
-/*
- * find_linux_pte returns the address of a linux pte for a given
- * effective address and directory.  If not found, it returns zero.
- */
-static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea)
-{
-	pgd_t *pg;
-	pud_t *pu;
-	pmd_t *pm;
-	pte_t *pt = NULL;
-
-	pg = pgdir + pgd_index(ea);
-	if (!pgd_none(*pg)) {
-		pu = pud_offset(pg, ea);
-		if (!pud_none(*pu)) {
-			pm = pmd_offset(pu, ea);
-			if (pmd_present(*pm))
-				pt = pte_offset_kernel(pm, ea);
-		}
-	}
-	return pt;
-}
-
 pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
 				 unsigned *shift);
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/kernel/io-workarounds.c b/arch/powerpc/kernel/io-workarounds.c
index 50e90b7..e5263ab 100644
--- a/arch/powerpc/kernel/io-workarounds.c
+++ b/arch/powerpc/kernel/io-workarounds.c
@@ -55,6 +55,7 @@ static struct iowa_bus *iowa_pci_find(unsigned long vaddr, unsigned long paddr)
 
 struct iowa_bus *iowa_mem_find_bus(const PCI_IO_ADDR addr)
 {
+	unsigned shift;
 	struct iowa_bus *bus;
 	int token;
 
@@ -70,11 +71,16 @@ struct iowa_bus *iowa_mem_find_bus(const PCI_IO_ADDR addr)
 		if (vaddr < PHB_IO_BASE || vaddr >= PHB_IO_END)
 			return NULL;
 
-		ptep = find_linux_pte(init_mm.pgd, vaddr);
+		ptep = find_linux_pte_or_hugepte(init_mm.pgd, vaddr, &shift);
 		if (ptep == NULL)
 			paddr = 0;
-		else
+		else {
+			/*
+			 * we don't have hugepages backing iomem
+			 */
+			BUG_ON(shift);
 			paddr = pte_pfn(*ptep) << PAGE_SHIFT;
+		}
 		bus = iowa_pci_find(vaddr, paddr);
 
 		if (bus == NULL)
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 19c93ba..8c345df 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -27,7 +27,7 @@ static void *real_vmalloc_addr(void *x)
 	unsigned long addr = (unsigned long) x;
 	pte_t *p;
 
-	p = find_linux_pte(swapper_pg_dir, addr);
+	p = find_linux_pte_or_hugepte(swapper_pg_dir, addr, NULL);
 	if (!p || !pte_present(*p))
 		return NULL;
 	/* assume we don't have huge pages in vmalloc space... */
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index d0eb6d4..e942ae9 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1131,6 +1131,7 @@ EXPORT_SYMBOL_GPL(hash_page);
 void hash_preload(struct mm_struct *mm, unsigned long ea,
 		  unsigned long access, unsigned long trap)
 {
+	int shift;
 	unsigned long vsid;
 	pgd_t *pgdir;
 	pte_t *ptep;
@@ -1152,10 +1153,15 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
 	pgdir = mm->pgd;
 	if (pgdir == NULL)
 		return;
-	ptep = find_linux_pte(pgdir, ea);
+	/*
+	 * THP pages use update_mmu_cache_pmd. We don't do
+	 * hash preload there. Hence can ignore THP here
+	 */
+	ptep = find_linux_pte_or_hugepte(pgdir, ea, &shift);
 	if (!ptep)
 		return;
 
+	BUG_ON(shift);
 #ifdef CONFIG_PPC_64K_PAGES
 	/* If either _PAGE_4K_PFN or _PAGE_NO_CACHE is set (and we are on
 	 * a 64K kernel), then we don't preload, hash_page() will take
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 081c001..1154714 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -105,6 +105,7 @@ int pgd_huge(pgd_t pgd)
 
 pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {
+	/* Only called for HugeTLB pages, hence can ignore THP */
 	return find_linux_pte_or_hugepte(mm->pgd, addr, NULL);
 }
 
@@ -673,11 +674,14 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
 	struct page *page;
 	unsigned shift;
 	unsigned long mask;
-
+	/*
+	 * Transparent hugepages are handled by generic code. We can skip them
+	 * here.
+	 */
 	ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
 
 	/* Verify it is a huge page else bail. */
-	if (!ptep || !shift)
+	if (!ptep || !shift || pmd_trans_huge((pmd_t)*ptep))
 		return ERR_PTR(-EINVAL);
 
 	mask = (1UL << shift) - 1;
diff --git a/arch/powerpc/mm/tlb_hash64.c b/arch/powerpc/mm/tlb_hash64.c
index 023ec8a..56d9b85 100644
--- a/arch/powerpc/mm/tlb_hash64.c
+++ b/arch/powerpc/mm/tlb_hash64.c
@@ -189,6 +189,7 @@ void tlb_flush(struct mmu_gather *tlb)
 void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
 			      unsigned long end)
 {
+	int shift;
 	unsigned long flags;
 
 	start = _ALIGN_DOWN(start, PAGE_SIZE);
@@ -206,11 +207,15 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
 	local_irq_save(flags);
 	arch_enter_lazy_mmu_mode();
 	for (; start < end; start += PAGE_SIZE) {
-		pte_t *ptep = find_linux_pte(mm->pgd, start);
+		pte_t *ptep = find_linux_pte_or_hugepte(mm->pgd, start, &shift);
 		unsigned long pte;
 
 		if (ptep == NULL)
 			continue;
+		/*
+		 * We won't find hugepages here, this is iomem.
+		 */
+		BUG_ON(shift);
 		pte = pte_val(*ptep);
 		if (!(pte & _PAGE_HASHPTE))
 			continue;
diff --git a/arch/powerpc/platforms/pseries/eeh.c b/arch/powerpc/platforms/pseries/eeh.c
index 6b73d6c..d2e76d2 100644
--- a/arch/powerpc/platforms/pseries/eeh.c
+++ b/arch/powerpc/platforms/pseries/eeh.c
@@ -258,12 +258,17 @@ void eeh_slot_error_detail(struct eeh_pe *pe, int severity)
  */
 static inline unsigned long eeh_token_to_phys(unsigned long token)
 {
+	int shift;
 	pte_t *ptep;
 	unsigned long pa;
 
-	ptep = find_linux_pte(init_mm.pgd, token);
+	/*
+	 * We won't find hugepages here, iomem
+	 */
+	ptep = find_linux_pte_or_hugepte(init_mm.pgd, token, &shift);
 	if (!ptep)
 		return token;
+	BUG_ON(shift);
 	pa = pte_pfn(*ptep) << PAGE_SHIFT;
 
 	return pa | (token & (PAGE_SIZE-1));
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH -V7 06/10] powerpc: Update gup_pmd_range to handle transparent hugepages
  2013-04-28 19:51 [PATCH -V7 00/10] THP support for PPC64 (Patchset 2) Aneesh Kumar K.V
                   ` (4 preceding siblings ...)
  2013-04-28 19:51 ` [PATCH -V7 05/10] powerpc: Replace find_linux_pte with find_linux_pte_or_hugepte Aneesh Kumar K.V
@ 2013-04-28 19:51 ` Aneesh Kumar K.V
  2013-05-03  4:57   ` David Gibson
  2013-04-28 19:51 ` [PATCH -V7 07/10] powerpc/THP: Add code to handle HPTE faults for large pages Aneesh Kumar K.V
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-04-28 19:51 UTC (permalink / raw)
  To: benh, paulus, dwg, linux-mm; +Cc: linuxppc-dev, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/gup.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index 4b921af..3d36fd7 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -66,9 +66,20 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		/*
+		 * The pmd_trans_splitting() check below explains why
+		 * pmdp_splitting_flush has to flush the tlb, to stop
+		 * this gup-fast code from running while we set the
+		 * splitting bit in the pmd. Returning zero will take
+		 * the slow path that will call wait_split_huge_page()
+		 * if the pmd is still in splitting state. gup-fast
+		 * can't because it has irq disabled and
+		 * wait_split_huge_page() would never return as the
+		 * tlb flush IPI wouldn't run.
+		 */
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
 			return 0;
-		if (pmd_huge(pmd)) {
+		if (pmd_huge(pmd) || pmd_large(pmd)) {
 			if (!gup_hugepte((pte_t *)pmdp, PMD_SIZE, addr, next,
 					 write, pages, nr))
 				return 0;
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH -V7 07/10] powerpc/THP: Add code to handle HPTE faults for large pages
  2013-04-28 19:51 [PATCH -V7 00/10] THP support for PPC64 (Patchset 2) Aneesh Kumar K.V
                   ` (5 preceding siblings ...)
  2013-04-28 19:51 ` [PATCH -V7 06/10] powerpc: Update gup_pmd_range to handle transparent hugepages Aneesh Kumar K.V
@ 2013-04-28 19:51 ` Aneesh Kumar K.V
  2013-05-03  5:13   ` David Gibson
  2013-04-28 19:51 ` [PATCH -V7 08/10] powerpc/THP: Enable THP on PPC64 Aneesh Kumar K.V
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-04-28 19:51 UTC (permalink / raw)
  To: benh, paulus, dwg, linux-mm; +Cc: linuxppc-dev, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

The deposted PTE page in the second half of the PMD table is used to
track the state on hash PTEs. After updating the HPTE, we mark the
coresponding slot in the deposted PTE page valid.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/mmu-hash64.h |  13 +++
 arch/powerpc/mm/Makefile              |   1 +
 arch/powerpc/mm/hash_utils_64.c       |  13 ++-
 arch/powerpc/mm/hugepage-hash64.c     | 180 ++++++++++++++++++++++++++++++++++
 4 files changed, 203 insertions(+), 4 deletions(-)
 create mode 100644 arch/powerpc/mm/hugepage-hash64.c

diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
index 2accc96..3d6fbb0 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -340,6 +340,19 @@ extern int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
 int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
 		     pte_t *ptep, unsigned long trap, int local, int ssize,
 		     unsigned int shift, unsigned int mmu_psize);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern int __hash_page_thp(unsigned long ea, unsigned long access,
+			   unsigned long vsid, pmd_t *pmdp, unsigned long trap,
+			   int local, int ssize, unsigned int psize);
+#else
+static inline int __hash_page_thp(unsigned long ea, unsigned long access,
+				  unsigned long vsid, pmd_t *pmdp,
+				  unsigned long trap, int local,
+				  int ssize, unsigned int psize)
+{
+	BUG();
+}
+#endif
 extern void hash_failure_debug(unsigned long ea, unsigned long access,
 			       unsigned long vsid, unsigned long trap,
 			       int ssize, int psize, int lpsize,
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index fde36e6..87671eb 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -33,6 +33,7 @@ ifeq ($(CONFIG_HUGETLB_PAGE),y)
 obj-$(CONFIG_PPC_STD_MMU_64)	+= hugetlbpage-hash64.o
 obj-$(CONFIG_PPC_BOOK3E_MMU)	+= hugetlbpage-book3e.o
 endif
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += hugepage-hash64.o
 obj-$(CONFIG_PPC_SUBPAGE_PROT)	+= subpage-prot.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
 obj-$(CONFIG_HIGHMEM)		+= highmem.o
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index e942ae9..cea7267 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1041,11 +1041,16 @@ int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
 		return 1;
 	}
 
+	if (hugeshift) {
+		if (pmd_trans_huge((pmd_t) *ptep))
+			return __hash_page_thp(ea, access, vsid, (pmd_t *)ptep,
+					       trap, local, ssize, psize);
 #ifdef CONFIG_HUGETLB_PAGE
-	if (hugeshift)
-		return __hash_page_huge(ea, access, vsid, ptep, trap, local,
-					ssize, hugeshift, psize);
-#endif /* CONFIG_HUGETLB_PAGE */
+		else
+			return __hash_page_huge(ea, access, vsid, ptep, trap,
+						local, ssize, hugeshift, psize);
+#endif
+	}
 
 #ifndef CONFIG_PPC_64K_PAGES
 	DBG_LOW(" i-pte: %016lx\n", pte_val(*ptep));
diff --git a/arch/powerpc/mm/hugepage-hash64.c b/arch/powerpc/mm/hugepage-hash64.c
new file mode 100644
index 0000000..340962a
--- /dev/null
+++ b/arch/powerpc/mm/hugepage-hash64.c
@@ -0,0 +1,180 @@
+/*
+ * Copyright IBM Corporation, 2013
+ * Author Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ */
+
+/*
+ * PPC64 THP Support for hash based MMUs
+ */
+#include <linux/mm.h>
+#include <asm/machdep.h>
+
+/*
+ * The linux hugepage PMD now include the pmd entries followed by the address
+ * to the stashed pgtable_t. The stashed pgtable_t contains the hpte bits.
+ * [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
+ * With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
+ * 4096 entries. Both will fit in a 4K pgtable_t.
+ */
+int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
+		    pmd_t *pmdp, unsigned long trap, int local, int ssize,
+		    unsigned int psize)
+{
+	unsigned int index, valid;
+	unsigned char *hpte_slot_array;
+	unsigned long rflags, pa, hidx;
+	unsigned long old_pmd, new_pmd;
+	int ret, lpsize = MMU_PAGE_16M;
+	unsigned long vpn, hash, shift, slot;
+
+	/*
+	 * atomically mark the linux large page PMD busy and dirty
+	 */
+	do {
+		old_pmd = pmd_val(*pmdp);
+		/* If PMD busy, retry the access */
+		if (unlikely(old_pmd & _PAGE_BUSY))
+			return 0;
+		/* If PMD permissions don't match, take page fault */
+		if (unlikely(access & ~old_pmd))
+			return 1;
+		/*
+		 * Try to lock the PTE, add ACCESSED and DIRTY if it was
+		 * a write access
+		 */
+		new_pmd = old_pmd | _PAGE_BUSY | _PAGE_ACCESSED;
+		if (access & _PAGE_RW)
+			new_pmd |= _PAGE_DIRTY;
+	} while (old_pmd != __cmpxchg_u64((unsigned long *)pmdp,
+					  old_pmd, new_pmd));
+	/*
+	 * PP bits. _PAGE_USER is already PP bit 0x2, so we only
+	 * need to add in 0x1 if it's a read-only user page
+	 */
+	rflags = new_pmd & _PAGE_USER;
+	if ((new_pmd & _PAGE_USER) && !((new_pmd & _PAGE_RW) &&
+					   (new_pmd & _PAGE_DIRTY)))
+		rflags |= 0x1;
+	/*
+	 * _PAGE_EXEC -> HW_NO_EXEC since it's inverted
+	 */
+	rflags |= ((new_pmd & _PAGE_EXEC) ? 0 : HPTE_R_N);
+
+#if 0
+	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE)) {
+
+		/*
+		 * No CPU has hugepages but lacks no execute, so we
+		 * don't need to worry about that case
+		 */
+		rflags = hash_page_do_lazy_icache(rflags, __pte(old_pte), trap);
+	}
+#endif
+	/*
+	 * Find the slot index details for this ea, using base page size.
+	 */
+	shift = mmu_psize_defs[psize].shift;
+	index = (ea & (HUGE_PAGE_SIZE - 1)) >> shift;
+	BUG_ON(index >= 4096);
+
+	vpn = hpt_vpn(ea, vsid, ssize);
+	hash = hpt_hash(vpn, shift, ssize);
+	/*
+	 * The hpte hindex are stored in the pgtable whose address is in the
+	 * second half of the PMD
+	 */
+	hpte_slot_array = *(char **)(pmdp + PTRS_PER_PMD);
+
+	valid = hpte_slot_array[index]  & 0x1;
+	if (valid) {
+		/* update the hpte bits */
+		hidx =  hpte_slot_array[index]  >> 1;
+		if (hidx & _PTEIDX_SECONDARY)
+			hash = ~hash;
+		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+		slot += hidx & _PTEIDX_GROUP_IX;
+
+		ret = ppc_md.hpte_updatepp(slot, rflags, vpn,
+					   psize, ssize, local);
+		/*
+		 * We failed to update, try to insert a new entry.
+		 */
+		if (ret == -1) {
+			/*
+			 * large pte is marked busy, so we can be sure
+			 * nobody is looking at hpte_slot_array. hence we can
+			 * safely update this here.
+			 */
+			hpte_slot_array[index] = 0;
+			valid = 0;
+		}
+	}
+
+	if (!valid) {
+		unsigned long hpte_group;
+
+		/* insert new entry */
+		pa = pmd_pfn(__pmd(old_pmd)) << PAGE_SHIFT;
+repeat:
+		hpte_group = ((hash & htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL;
+
+		/* clear the busy bits and set the hash pte bits */
+		new_pmd = (new_pmd & ~_PAGE_THP_HPTEFLAGS) | _PAGE_HASHPTE;
+
+		/* Add in WIMG bits */
+		rflags |= (new_pmd & (_PAGE_WRITETHRU | _PAGE_NO_CACHE |
+				      _PAGE_COHERENT | _PAGE_GUARDED));
+
+		/* Insert into the hash table, primary slot */
+		slot = ppc_md.hpte_insert(hpte_group, vpn, pa, rflags, 0,
+					  psize, lpsize, ssize);
+		/*
+		 * Primary is full, try the secondary
+		 */
+		if (unlikely(slot == -1)) {
+			hpte_group = ((~hash & htab_hash_mask) *
+				      HPTES_PER_GROUP) & ~0x7UL;
+			slot = ppc_md.hpte_insert(hpte_group, vpn, pa,
+						  rflags, HPTE_V_SECONDARY,
+						  psize, lpsize, ssize);
+			if (slot == -1) {
+				if (mftb() & 0x1)
+					hpte_group = ((hash & htab_hash_mask) *
+						      HPTES_PER_GROUP) & ~0x7UL;
+
+				ppc_md.hpte_remove(hpte_group);
+				goto repeat;
+			}
+		}
+		/*
+		 * Hypervisor failure. Restore old pmd and return -1
+		 * similar to __hash_page_*
+		 */
+		if (unlikely(slot == -2)) {
+			*pmdp = __pmd(old_pmd);
+			hash_failure_debug(ea, access, vsid, trap, ssize,
+					   psize, lpsize, old_pmd);
+			return -1;
+		}
+		/*
+		 * large pte is marked busy, so we can be sure
+		 * nobody is looking at hpte_slot_array. hence we can
+		 * safely update this here.
+		 */
+		hpte_slot_array[index] = slot << 1 | 0x1;
+	}
+	/*
+	 * No need to use ldarx/stdcx here
+	 */
+	*pmdp = __pmd(new_pmd & ~_PAGE_BUSY);
+	return 0;
+}
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH -V7 08/10] powerpc/THP: Enable THP on PPC64
  2013-04-28 19:51 [PATCH -V7 00/10] THP support for PPC64 (Patchset 2) Aneesh Kumar K.V
                   ` (6 preceding siblings ...)
  2013-04-28 19:51 ` [PATCH -V7 07/10] powerpc/THP: Add code to handle HPTE faults for large pages Aneesh Kumar K.V
@ 2013-04-28 19:51 ` Aneesh Kumar K.V
  2013-05-03  5:15   ` David Gibson
  2013-04-28 19:51 ` [PATCH -V7 09/10] powerpc: Optimize hugepage invalidate Aneesh Kumar K.V
  2013-04-28 19:51 ` [PATCH -V7 10/10] powerpc: disable assert_pte_locked Aneesh Kumar K.V
  9 siblings, 1 reply; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-04-28 19:51 UTC (permalink / raw)
  To: benh, paulus, dwg, linux-mm; +Cc: linuxppc-dev, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

We enable only if the we support 16MB page size.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |  3 +--
 arch/powerpc/mm/pgtable_64.c             | 28 ++++++++++++++++++++++++++++
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 97fc839..d65534b 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -426,8 +426,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
 	return pmd_val(pmd) >> PTE_RPN_SHIFT;
 }
 
-/* We will enable it in the last patch */
-#define has_transparent_hugepage() 0
+extern int has_transparent_hugepage(void);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline int pmd_young(pmd_t pmd)
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 54216c1..b742d6f 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -754,6 +754,34 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
 	return;
 }
 
+int has_transparent_hugepage(void)
+{
+	if (!mmu_has_feature(MMU_FTR_16M_PAGE))
+		return 0;
+	/*
+	 * We support THP only if HPAGE_SHIFT is 16MB.
+	 */
+	if (!HPAGE_SHIFT || (HPAGE_SHIFT != mmu_psize_defs[MMU_PAGE_16M].shift))
+		return 0;
+	/*
+	 * We need to make sure that we support 16MB hugepage in a segement
+	 * with base page size 64K or 4K. We only enable THP with a PAGE_SIZE
+	 * of 64K.
+	 */
+	/*
+	 * If we have 64K HPTE, we will be using that by default
+	 */
+	if (mmu_psize_defs[MMU_PAGE_64K].shift &&
+	    (mmu_psize_defs[MMU_PAGE_64K].penc[MMU_PAGE_16M] == -1))
+		return 0;
+	/*
+	 * Ok we only have 4K HPTE
+	 */
+	if (mmu_psize_defs[MMU_PAGE_4K].penc[MMU_PAGE_16M] == -1)
+		return 0;
+
+	return 1;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 pmd_t pmdp_get_and_clear(struct mm_struct *mm,
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH -V7 09/10] powerpc: Optimize hugepage invalidate
  2013-04-28 19:51 [PATCH -V7 00/10] THP support for PPC64 (Patchset 2) Aneesh Kumar K.V
                   ` (7 preceding siblings ...)
  2013-04-28 19:51 ` [PATCH -V7 08/10] powerpc/THP: Enable THP on PPC64 Aneesh Kumar K.V
@ 2013-04-28 19:51 ` Aneesh Kumar K.V
  2013-05-03  5:28   ` David Gibson
  2013-04-28 19:51 ` [PATCH -V7 10/10] powerpc: disable assert_pte_locked Aneesh Kumar K.V
  9 siblings, 1 reply; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-04-28 19:51 UTC (permalink / raw)
  To: benh, paulus, dwg, linux-mm; +Cc: linuxppc-dev, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Hugepage invalidate involves invalidating multiple hpte entries.
Optimize the operation using H_BULK_REMOVE on lpar platforms.
On native, reduce the number of tlb flush.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h    |   3 +
 arch/powerpc/mm/hash_native_64.c      |  78 +++++++++++++++++++++
 arch/powerpc/mm/pgtable_64.c          |  13 +++-
 arch/powerpc/platforms/pseries/lpar.c | 126 ++++++++++++++++++++++++++++++++--
 4 files changed, 210 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 3f3f691..5d1e7d2 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -56,6 +56,9 @@ struct machdep_calls {
 	void            (*hpte_removebolted)(unsigned long ea,
 					     int psize, int ssize);
 	void		(*flush_hash_range)(unsigned long number, int local);
+	void		(*hugepage_invalidate)(struct mm_struct *mm,
+					       unsigned char *hpte_slot_array,
+					       unsigned long addr, int psize);
 
 	/* special for kexec, to be called in real mode, linear mapping is
 	 * destroyed as well */
diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
index 6a2aead..8ca178d 100644
--- a/arch/powerpc/mm/hash_native_64.c
+++ b/arch/powerpc/mm/hash_native_64.c
@@ -455,6 +455,83 @@ static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
 	local_irq_restore(flags);
 }
 
+static void native_hugepage_invalidate(struct mm_struct *mm,
+				       unsigned char *hpte_slot_array,
+				       unsigned long addr, int psize)
+{
+	int ssize = 0, i;
+	int lock_tlbie;
+	struct hash_pte *hptep;
+	int actual_psize = MMU_PAGE_16M;
+	unsigned int max_hpte_count, valid;
+	unsigned long flags, s_addr = addr;
+	unsigned long hpte_v, want_v, shift;
+	unsigned long hidx, vpn = 0, vsid, hash, slot;
+
+	shift = mmu_psize_defs[psize].shift;
+	max_hpte_count = HUGE_PAGE_SIZE >> shift;
+
+	local_irq_save(flags);
+	for (i = 0; i < max_hpte_count; i++) {
+		/*
+		 * 8 bits per each hpte entries
+		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
+		 */
+		valid = hpte_slot_array[i] & 0x1;
+		if (!valid)
+			continue;
+		hidx =  hpte_slot_array[i]  >> 1;
+
+		/* get the vpn */
+		addr = s_addr + (i * (1ul << shift));
+		if (!is_kernel_addr(addr)) {
+			ssize = user_segment_size(addr);
+			vsid = get_vsid(mm->context.id, addr, ssize);
+			WARN_ON(vsid == 0);
+		} else {
+			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
+			ssize = mmu_kernel_ssize;
+		}
+
+		vpn = hpt_vpn(addr, vsid, ssize);
+		hash = hpt_hash(vpn, shift, ssize);
+		if (hidx & _PTEIDX_SECONDARY)
+			hash = ~hash;
+
+		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+		slot += hidx & _PTEIDX_GROUP_IX;
+
+		hptep = htab_address + slot;
+		want_v = hpte_encode_avpn(vpn, psize, ssize);
+		native_lock_hpte(hptep);
+		hpte_v = hptep->v;
+
+		/* Even if we miss, we need to invalidate the TLB */
+		if (!HPTE_V_COMPARE(hpte_v, want_v) || !(hpte_v & HPTE_V_VALID))
+			native_unlock_hpte(hptep);
+		else
+			/* Invalidate the hpte. NOTE: this also unlocks it */
+			hptep->v = 0;
+	}
+	/*
+	 * Since this is a hugepage, we just need a single tlbie.
+	 * use the last vpn.
+	 */
+	lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
+	if (lock_tlbie)
+		raw_spin_lock(&native_tlbie_lock);
+
+	asm volatile("ptesync":::"memory");
+	__tlbie(vpn, psize, actual_psize, ssize);
+	asm volatile("eieio; tlbsync; ptesync":::"memory");
+
+	if (lock_tlbie)
+		raw_spin_unlock(&native_tlbie_lock);
+
+	local_irq_restore(flags);
+}
+
+
 static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
 			int *psize, int *apsize, int *ssize, unsigned long *vpn)
 {
@@ -658,4 +735,5 @@ void __init hpte_init_native(void)
 	ppc_md.hpte_remove	= native_hpte_remove;
 	ppc_md.hpte_clear_all	= native_hpte_clear;
 	ppc_md.flush_hash_range = native_flush_hash_range;
+	ppc_md.hugepage_invalidate   = native_hugepage_invalidate;
 }
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index b742d6f..504952f 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -659,6 +659,7 @@ void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
 {
 	int ssize, i;
 	unsigned long s_addr;
+	int max_hpte_count;
 	unsigned int psize, valid;
 	unsigned char *hpte_slot_array;
 	unsigned long hidx, vpn, vsid, hash, shift, slot;
@@ -672,12 +673,18 @@ void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
 	 * second half of the PMD
 	 */
 	hpte_slot_array = *(char **)(pmdp + PTRS_PER_PMD);
-
 	/* get the base page size */
 	psize = get_slice_psize(mm, s_addr);
-	shift = mmu_psize_defs[psize].shift;
 
-	for (i = 0; i < (HUGE_PAGE_SIZE >> shift); i++) {
+	if (ppc_md.hugepage_invalidate)
+		return ppc_md.hugepage_invalidate(mm, hpte_slot_array,
+						  s_addr, psize);
+	/*
+	 * No bluk hpte removal support, invalidate each entry
+	 */
+	shift = mmu_psize_defs[psize].shift;
+	max_hpte_count = HUGE_PAGE_SIZE >> shift;
+	for (i = 0; i < max_hpte_count; i++) {
 		/*
 		 * 8 bits per each hpte entries
 		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 6d62072..58a31db 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -45,6 +45,13 @@
 #include "plpar_wrappers.h"
 #include "pseries.h"
 
+/* Flag bits for H_BULK_REMOVE */
+#define HBR_REQUEST	0x4000000000000000UL
+#define HBR_RESPONSE	0x8000000000000000UL
+#define HBR_END		0xc000000000000000UL
+#define HBR_AVPN	0x0200000000000000UL
+#define HBR_ANDCOND	0x0100000000000000UL
+
 
 /* in hvCall.S */
 EXPORT_SYMBOL(plpar_hcall);
@@ -345,6 +352,117 @@ static void pSeries_lpar_hpte_invalidate(unsigned long slot, unsigned long vpn,
 	BUG_ON(lpar_rc != H_SUCCESS);
 }
 
+/*
+ * Limit iterations holding pSeries_lpar_tlbie_lock to 3. We also need
+ * to make sure that we avoid bouncing the hypervisor tlbie lock.
+ */
+#define PPC64_HUGE_HPTE_BATCH 12
+
+static void __pSeries_lpar_hugepage_invalidate(unsigned long *slot,
+					     unsigned long *vpn, int count,
+					     int psize, int ssize)
+{
+	unsigned long param[9];
+	int i = 0, pix = 0, rc;
+	unsigned long flags = 0;
+	int lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
+
+	if (lock_tlbie)
+		spin_lock_irqsave(&pSeries_lpar_tlbie_lock, flags);
+
+	for (i = 0; i < count; i++) {
+
+		if (!firmware_has_feature(FW_FEATURE_BULK_REMOVE)) {
+			pSeries_lpar_hpte_invalidate(slot[i], vpn[i], psize,
+						     ssize, 0);
+		} else {
+			param[pix] = HBR_REQUEST | HBR_AVPN | slot[i];
+			param[pix+1] = hpte_encode_avpn(vpn[i], psize, ssize);
+			pix += 2;
+			if (pix == 8) {
+				rc = plpar_hcall9(H_BULK_REMOVE, param,
+						  param[0], param[1], param[2],
+						  param[3], param[4], param[5],
+						  param[6], param[7]);
+				BUG_ON(rc != H_SUCCESS);
+				pix = 0;
+			}
+		}
+	}
+	if (pix) {
+		param[pix] = HBR_END;
+		rc = plpar_hcall9(H_BULK_REMOVE, param, param[0], param[1],
+				  param[2], param[3], param[4], param[5],
+				  param[6], param[7]);
+		BUG_ON(rc != H_SUCCESS);
+	}
+
+	if (lock_tlbie)
+		spin_unlock_irqrestore(&pSeries_lpar_tlbie_lock, flags);
+}
+
+static void pSeries_lpar_hugepage_invalidate(struct mm_struct *mm,
+				       unsigned char *hpte_slot_array,
+				       unsigned long addr, int psize)
+{
+	int ssize = 0, i, index = 0;
+	unsigned long s_addr = addr;
+	unsigned int max_hpte_count, valid;
+	unsigned long vpn_array[PPC64_HUGE_HPTE_BATCH];
+	unsigned long slot_array[PPC64_HUGE_HPTE_BATCH];
+	unsigned long shift, hidx, vpn = 0, vsid, hash, slot;
+
+	shift = mmu_psize_defs[psize].shift;
+	max_hpte_count = HUGE_PAGE_SIZE >> shift;
+
+	for (i = 0; i < max_hpte_count; i++) {
+		/*
+		 * 8 bits per each hpte entries
+		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
+		 */
+		valid = hpte_slot_array[i] & 0x1;
+		if (!valid)
+			continue;
+		hidx =  hpte_slot_array[i]  >> 1;
+
+		/* get the vpn */
+		addr = s_addr + (i * (1ul << shift));
+		if (!is_kernel_addr(addr)) {
+			ssize = user_segment_size(addr);
+			vsid = get_vsid(mm->context.id, addr, ssize);
+			WARN_ON(vsid == 0);
+		} else {
+			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
+			ssize = mmu_kernel_ssize;
+		}
+
+		vpn = hpt_vpn(addr, vsid, ssize);
+		hash = hpt_hash(vpn, shift, ssize);
+		if (hidx & _PTEIDX_SECONDARY)
+			hash = ~hash;
+
+		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+		slot += hidx & _PTEIDX_GROUP_IX;
+
+		slot_array[index] = slot;
+		vpn_array[index] = vpn;
+		if (index == PPC64_HUGE_HPTE_BATCH - 1) {
+			/*
+			 * Now do a bluk invalidate
+			 */
+			__pSeries_lpar_hugepage_invalidate(slot_array,
+							   vpn_array,
+							   PPC64_HUGE_HPTE_BATCH,
+							   psize, ssize);
+			index = 0;
+		} else
+			index++;
+	}
+	if (index)
+		__pSeries_lpar_hugepage_invalidate(slot_array, vpn_array,
+						   index, psize, ssize);
+}
+
 static void pSeries_lpar_hpte_removebolted(unsigned long ea,
 					   int psize, int ssize)
 {
@@ -360,13 +478,6 @@ static void pSeries_lpar_hpte_removebolted(unsigned long ea,
 	pSeries_lpar_hpte_invalidate(slot, vpn, psize, ssize, 0);
 }
 
-/* Flag bits for H_BULK_REMOVE */
-#define HBR_REQUEST	0x4000000000000000UL
-#define HBR_RESPONSE	0x8000000000000000UL
-#define HBR_END		0xc000000000000000UL
-#define HBR_AVPN	0x0200000000000000UL
-#define HBR_ANDCOND	0x0100000000000000UL
-
 /*
  * Take a spinlock around flushes to avoid bouncing the hypervisor tlbie
  * lock.
@@ -452,6 +563,7 @@ void __init hpte_init_lpar(void)
 	ppc_md.hpte_removebolted = pSeries_lpar_hpte_removebolted;
 	ppc_md.flush_hash_range	= pSeries_lpar_flush_hash_range;
 	ppc_md.hpte_clear_all   = pSeries_lpar_hptab_clear;
+	ppc_md.hugepage_invalidate = pSeries_lpar_hugepage_invalidate;
 }
 
 #ifdef CONFIG_PPC_SMLPAR
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH -V7 10/10] powerpc: disable assert_pte_locked
  2013-04-28 19:51 [PATCH -V7 00/10] THP support for PPC64 (Patchset 2) Aneesh Kumar K.V
                   ` (8 preceding siblings ...)
  2013-04-28 19:51 ` [PATCH -V7 09/10] powerpc: Optimize hugepage invalidate Aneesh Kumar K.V
@ 2013-04-28 19:51 ` Aneesh Kumar K.V
  2013-05-03  5:30   ` David Gibson
  9 siblings, 1 reply; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-04-28 19:51 UTC (permalink / raw)
  To: benh, paulus, dwg, linux-mm; +Cc: linuxppc-dev, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

With THP we set pmd to none, before we do pte_clear. Hence we can't
walk page table to get the pte lock ptr and verify whether it is locked.
THP do take pte lock before calling pte_clear. So we don't change the locking
rules here. It is that we can't use page table walking to check whether
pte locks are help with THP.

NOTE: This needs to be re-written. Not to be merged upstream.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/pgtable.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 214130a..d77f94f 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -224,6 +224,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long address,
 #ifdef CONFIG_DEBUG_VM
 void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 {
+#if 0
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
@@ -237,6 +238,7 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	pmd = pmd_offset(pud, addr);
 	BUG_ON(!pmd_present(*pmd));
 	assert_spin_locked(pte_lockptr(mm, pmd));
+#endif
 }
 #endif /* CONFIG_DEBUG_VM */
 
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 01/10] powerpc/THP: Double the PMD table size for THP
  2013-04-28 19:51 ` [PATCH -V7 01/10] powerpc/THP: Double the PMD table size for THP Aneesh Kumar K.V
@ 2013-05-03  3:21   ` David Gibson
  0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2013-05-03  3:21 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm

[-- Attachment #1: Type: text/plain, Size: 881 bytes --]

On Mon, Apr 29, 2013 at 01:21:42AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> THP code does PTE page allocation along with large page request and deposit them
> for later use. This is to ensure that we won't have any failures when we split
> hugepages to regular pages.
> 
> On powerpc we want to use the deposited PTE page for storing hash pte slot and
> secondary bit information for the HPTEs. We use the second half
> of the pmd table to save the deposted PTE page.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

So far so good.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64
  2013-04-28 19:51 ` [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64 Aneesh Kumar K.V
@ 2013-05-03  4:52   ` David Gibson
  2013-05-03  8:19     ` Benjamin Herrenschmidt
  2013-05-04 19:14     ` Aneesh Kumar K.V
  0 siblings, 2 replies; 34+ messages in thread
From: David Gibson @ 2013-05-03  4:52 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm

[-- Attachment #1: Type: text/plain, Size: 25821 bytes --]

On Mon, Apr 29, 2013 at 01:21:43AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> We now have pmd entries covering 16MB range and the PMD table double its original size.
> We use the second half of the PMD table to deposit the pgtable (PTE page).
> The depoisted PTE page is further used to track the HPTE information. The information
> include [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
> With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
> 4096 entries. Both will fit in a 4K PTE page. On hugepage invalidate we need to walk
> the PTE page and invalidate all valid HPTEs.
> 
> This patch implements necessary arch specific functions for THP support and also
> hugepage invalidate logic. These PMD related functions are intentionally kept
> similar to their PTE counter-part.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/page.h              |  11 +-
>  arch/powerpc/include/asm/pgtable-ppc64-64k.h |   3 +-
>  arch/powerpc/include/asm/pgtable-ppc64.h     | 259 +++++++++++++++++++++-
>  arch/powerpc/include/asm/pgtable.h           |   5 +
>  arch/powerpc/include/asm/pte-hash64-64k.h    |  17 ++
>  arch/powerpc/mm/pgtable_64.c                 | 318 +++++++++++++++++++++++++++
>  arch/powerpc/platforms/Kconfig.cputype       |   1 +
>  7 files changed, 611 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
> index 988c812..cbf4be7 100644
> --- a/arch/powerpc/include/asm/page.h
> +++ b/arch/powerpc/include/asm/page.h
> @@ -37,8 +37,17 @@
>  #define PAGE_SIZE		(ASM_CONST(1) << PAGE_SHIFT)
>  
>  #ifndef __ASSEMBLY__
> -#ifdef CONFIG_HUGETLB_PAGE
> +/*
> + * With hugetlbfs enabled we allow the HPAGE_SHIFT to run time
> + * configurable. But we enable THP only with 16MB hugepage.
> + * With only THP configured, we force hugepage size to 16MB.
> + * This should ensure that all subarchs that doesn't support
> + * THP continue to work fine with HPAGE_SHIFT usage.
> + */
> +#if defined(CONFIG_HUGETLB_PAGE)
>  extern unsigned int HPAGE_SHIFT;
> +#elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
> +#define HPAGE_SHIFT PMD_SHIFT

As I said in comments on the first patch series, this messing around
with HPAGE_SHIFT for THP is missing the point.  On ppc HPAGE_SHIFT is
nothing more than the _default_ hugepage size for explicit hugepages.
THP should not be dependent on it in any way.

>  #else
>  #define HPAGE_SHIFT PAGE_SHIFT
>  #endif
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64-64k.h b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
> index 45142d6..a56b82f 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64-64k.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
> @@ -33,7 +33,8 @@
>  #define PGDIR_MASK	(~(PGDIR_SIZE-1))
>  
>  /* Bits to mask out from a PMD to get to the PTE page */
> -#define PMD_MASKED_BITS		0x1ff
> +/* PMDs point to PTE table fragments which are 4K aligned.  */
> +#define PMD_MASKED_BITS		0xfff

Hrm.  AFAICT this is related to the change in size of PTE tables, and
hence the page sharing stuff, so this belongs in the patch which
implements that, rather than the THP support itself.

>  /* Bits to mask out from a PGD/PUD to get to the PMD page */
>  #define PUD_MASKED_BITS		0x1ff
>  
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
> index ab84332..20133c1 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> @@ -154,7 +154,7 @@
>  #define	pmd_present(pmd)	(pmd_val(pmd) != 0)
>  #define	pmd_clear(pmdp)		(pmd_val(*(pmdp)) = 0)
>  #define pmd_page_vaddr(pmd)	(pmd_val(pmd) & ~PMD_MASKED_BITS)
> -#define pmd_page(pmd)		virt_to_page(pmd_page_vaddr(pmd))
> +extern struct page *pmd_page(pmd_t pmd);
>  
>  #define pud_set(pudp, pudval)	(pud_val(*(pudp)) = (pudval))
>  #define pud_none(pud)		(!pud_val(pud))
> @@ -382,4 +382,261 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
>  
>  #endif /* __ASSEMBLY__ */
>  
> +#ifndef _PAGE_SPLITTING
> +/*
> + * THP pages can't be special. So use the _PAGE_SPECIAL
> + */
> +#define _PAGE_SPLITTING _PAGE_SPECIAL
> +#endif
> +
> +#ifndef _PAGE_THP_HUGE
> +/*
> + * We need to differentiate between explicit huge page and THP huge
> + * page, since THP huge page also need to track real subpage details
> + * We use the _PAGE_COMBO bits here as dummy for platform that doesn't
> + * support THP.
> + */
> +#define _PAGE_THP_HUGE  0x10000000

So if it's _PAGE_COMBO, use _PAGE_COMBO, instead of the actual number.

> +#endif
> +
> +/*
> + * PTE flags to conserve for HPTE identification for THP page.
> + */
> +#ifndef _PAGE_THP_HPTEFLAGS
> +#define _PAGE_THP_HPTEFLAGS	(_PAGE_BUSY | _PAGE_HASHPTE)

You have this definition both here and in pte-hash64-64k.h.  More
importantly including _PAGE_BUSY seems like an extremely bad idea -
did you mean _PAGE_THP_HUGE == _PAGE_COMBO?

> +#endif
> +
> +#define HUGE_PAGE_SIZE		(ASM_CONST(1) << 24)
> +#define HUGE_PAGE_MASK		(~(HUGE_PAGE_SIZE - 1))

These constants should be named so its clear they're THP specific.
They should also be defined in terms of PMD_SHIFT, instead of
directly.

> +/*
> + * set of bits not changed in pmd_modify.
> + */
> +#define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_THP_HPTEFLAGS | \
> +			 _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_THP_HUGE)
> +
> +#ifndef __ASSEMBLY__
> +extern void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
> +				     pmd_t *pmdp);

This should maybe be called "hpge_do_hugepage_flush()".  The current
name suggests it returns a boolean, rather than performing the actual
flush.

> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
> +extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
> +extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
> +extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> +		       pmd_t *pmdp, pmd_t pmd);
> +extern void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
> +				 pmd_t *pmd);
> +
> +static inline int pmd_trans_huge(pmd_t pmd)
> +{
> +	/*
> +	 * leaf pte for huge page, bottom two bits != 00
> +	 */
> +	return (pmd_val(pmd) & 0x3) && (pmd_val(pmd) & _PAGE_THP_HUGE);
> +}
> +
> +static inline int pmd_large(pmd_t pmd)
> +{
> +	/*
> +	 * leaf pte for huge page, bottom two bits != 00
> +	 */
> +	if (pmd_trans_huge(pmd))
> +		return pmd_val(pmd) & _PAGE_PRESENT;
> +	return 0;
> +}
> +
> +static inline int pmd_trans_splitting(pmd_t pmd)
> +{
> +	if (pmd_trans_huge(pmd))
> +		return pmd_val(pmd) & _PAGE_SPLITTING;
> +	return 0;
> +}
> +
> +
> +static inline unsigned long pmd_pfn(pmd_t pmd)
> +{
> +	/*
> +	 * Only called for hugepage pmd
> +	 */
> +	return pmd_val(pmd) >> PTE_RPN_SHIFT;
> +}
> +
> +/* We will enable it in the last patch */
> +#define has_transparent_hugepage() 0
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +static inline int pmd_young(pmd_t pmd)
> +{
> +	return pmd_val(pmd) & _PAGE_ACCESSED;
> +}

It would be clearer to define this function as well as various others
that operate on PMDs as PTEs to just cast the parameter and call the
corresponding pte_XXX(),

> +
> +static inline pmd_t pmd_mkhuge(pmd_t pmd)
> +{
> +	/* Do nothing, mk_pmd() does this part.  */
> +	return pmd;
> +}
> +
> +#define __HAVE_ARCH_PMD_WRITE
> +static inline int pmd_write(pmd_t pmd)
> +{
> +	return pmd_val(pmd) & _PAGE_RW;
> +}
> +
> +static inline pmd_t pmd_mkold(pmd_t pmd)
> +{
> +	pmd_val(pmd) &= ~_PAGE_ACCESSED;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_wrprotect(pmd_t pmd)
> +{
> +	pmd_val(pmd) &= ~_PAGE_RW;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mkdirty(pmd_t pmd)
> +{
> +	pmd_val(pmd) |= _PAGE_DIRTY;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mkyoung(pmd_t pmd)
> +{
> +	pmd_val(pmd) |= _PAGE_ACCESSED;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mkwrite(pmd_t pmd)
> +{
> +	pmd_val(pmd) |= _PAGE_RW;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mknotpresent(pmd_t pmd)
> +{
> +	pmd_val(pmd) &= ~_PAGE_PRESENT;
> +	return pmd;
> +}
> +
> +static inline pmd_t pmd_mksplitting(pmd_t pmd)
> +{
> +	pmd_val(pmd) |= _PAGE_SPLITTING;
> +	return pmd;
> +}
> +
> +/*
> + * Set the dirty and/or accessed bits atomically in a linux hugepage PMD, this
> + * function doesn't need to flush the hash entry
> + */
> +static inline void __pmdp_set_access_flags(pmd_t *pmdp, pmd_t entry)
> +{
> +	unsigned long bits = pmd_val(entry) & (_PAGE_DIRTY |
> +					       _PAGE_ACCESSED |
> +					       _PAGE_RW | _PAGE_EXEC);
> +#ifdef PTE_ATOMIC_UPDATES
> +	unsigned long old, tmp;
> +
> +	__asm__ __volatile__(
> +	"1:	ldarx	%0,0,%4\n\
> +		andi.	%1,%0,%6\n\
> +		bne-	1b \n\
> +		or	%0,%3,%0\n\
> +		stdcx.	%0,0,%4\n\
> +		bne-	1b"
> +	:"=&r" (old), "=&r" (tmp), "=m" (*pmdp)
> +	:"r" (bits), "r" (pmdp), "m" (*pmdp), "i" (_PAGE_BUSY)
> +	:"cc");
> +#else
> +	unsigned long old = pmd_val(*pmdp);
> +	*pmdp = __pmd(old | bits);
> +#endif

Using parameter casts on the corresponding pte_update() function would
be even more valuable for these more complex functions with asm.

> +}
> +
> +#define __HAVE_ARCH_PMD_SAME
> +static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
> +{
> +	return (((pmd_val(pmd_a) ^ pmd_val(pmd_b)) & ~_PAGE_THP_HPTEFLAGS) == 0);

Here, specifically, the fact that PAGE_BUSY is in PAGE_THP_HPTEFLAGS
is likely to be bad.  If the page is busy, it's in the middle of
update so can't stably be considered the same as anything.

> +}
> +
> +#define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
> +extern int pmdp_set_access_flags(struct vm_area_struct *vma,
> +				 unsigned long address, pmd_t *pmdp,
> +				 pmd_t entry, int dirty);
> +
> +static inline unsigned long pmd_hugepage_update(struct mm_struct *mm,
> +						unsigned long addr,
> +						pmd_t *pmdp, unsigned long clr)
> +{
> +#ifdef PTE_ATOMIC_UPDATES
> +	unsigned long old, tmp;
> +
> +	__asm__ __volatile__(
> +	"1:	ldarx	%0,0,%3\n\
> +		andi.	%1,%0,%6\n\
> +		bne-	1b \n\
> +		andc	%1,%0,%4 \n\
> +		stdcx.	%1,0,%3 \n\
> +		bne-	1b"
> +	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
> +	: "r" (pmdp), "r" (clr), "m" (*pmdp), "i" (_PAGE_BUSY)
> +	: "cc" );
> +#else
> +	unsigned long old = pmd_val(*pmdp);
> +	*pmdp = __pmd(old & ~clr);
> +#endif
> +
> +#ifdef CONFIG_PPC_STD_MMU_64

THP only works with the standard hash MMU, so this #if seems a bit
pointless.

> +	if (old & _PAGE_HASHPTE)
> +		hpte_need_hugepage_flush(mm, addr, pmdp);
> +#endif
> +	return old;
> +}
> +
> +static inline int __pmdp_test_and_clear_young(struct mm_struct *mm,
> +					      unsigned long addr, pmd_t *pmdp)
> +{
> +	unsigned long old;
> +
> +	if ((pmd_val(*pmdp) & (_PAGE_ACCESSED | _PAGE_HASHPTE)) == 0)
> +		return 0;
> +	old = pmd_hugepage_update(mm, addr, pmdp, _PAGE_ACCESSED);
> +	return ((old & _PAGE_ACCESSED) != 0);
> +}
> +
> +#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
> +extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> +				     unsigned long address, pmd_t *pmdp);
> +#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
> +extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
> +				  unsigned long address, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
> +extern pmd_t pmdp_get_and_clear(struct mm_struct *mm,
> +				unsigned long addr, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PMDP_SET_WRPROTECT

Now that the PTE format is the same at bottom or PMD level, do you
still need this?

> +static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
> +				      pmd_t *pmdp)
> +{
> +
> +	if ((pmd_val(*pmdp) & _PAGE_RW) == 0)
> +		return;
> +
> +	pmd_hugepage_update(mm, addr, pmdp, _PAGE_RW);
> +}
> +
> +#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
> +extern void pmdp_splitting_flush(struct vm_area_struct *vma,
> +				 unsigned long address, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PGTABLE_DEPOSIT
> +extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
> +				       pgtable_t pgtable);
> +#define __HAVE_ARCH_PGTABLE_WITHDRAW
> +extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PMDP_INVALIDATE
> +extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +			    pmd_t *pmdp);
> +#endif /* __ASSEMBLY__ */
>  #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
> diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
> index 7aeb955..283198e 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -222,5 +222,10 @@ extern int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>  		       unsigned long end, int write, struct page **pages, int *nr);
>  #endif /* __ASSEMBLY__ */
>  
> +#ifndef CONFIG_TRANSPARENT_HUGEPAGE
> +#define pmd_large(pmd)		0
> +#define has_transparent_hugepage() 0
> +#endif
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_POWERPC_PGTABLE_H */
> diff --git a/arch/powerpc/include/asm/pte-hash64-64k.h b/arch/powerpc/include/asm/pte-hash64-64k.h
> index 3e13e23..6be70be 100644
> --- a/arch/powerpc/include/asm/pte-hash64-64k.h
> +++ b/arch/powerpc/include/asm/pte-hash64-64k.h
> @@ -38,6 +38,23 @@
>   */
>  #define PTE_RPN_SHIFT	(30)
>  
> +/*
> + * THP pages can't be special. So use the _PAGE_SPECIAL
> + */
> +#define _PAGE_SPLITTING _PAGE_SPECIAL
> +
> +/*
> + * PTE flags to conserve for HPTE identification for THP page.
> + * We drop _PAGE_COMBO here, because we overload that with _PAGE_TH_HUGE.
> + */
> +#define _PAGE_THP_HPTEFLAGS	(_PAGE_BUSY | _PAGE_HASHPTE)
> +
> +/*
> + * We need to differentiate between explicit huge page and THP huge
> + * page, since THP huge page also need to track real subpage details
> + */
> +#define _PAGE_THP_HUGE  _PAGE_COMBO

All 3 of these definitions also appeared elsewhere.

> +
>  #ifndef __ASSEMBLY__
>  
>  /*
> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
> index a854096..54216c1 100644
> --- a/arch/powerpc/mm/pgtable_64.c
> +++ b/arch/powerpc/mm/pgtable_64.c
> @@ -338,6 +338,19 @@ EXPORT_SYMBOL(iounmap);
>  EXPORT_SYMBOL(__iounmap);
>  EXPORT_SYMBOL(__iounmap_at);
>  
> +/*
> + * For hugepage we have pfn in the pmd, we use PTE_RPN_SHIFT bits for flags
> + * For PTE page, we have a PTE_FRAG_SIZE (4K) aligned virtual address.
> + */
> +struct page *pmd_page(pmd_t pmd)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	if (pmd_trans_huge(pmd))
> +		return pfn_to_page(pmd_pfn(pmd));

In this case you should be able to define this in terms of pte_pfn().

> +#endif
> +	return virt_to_page(pmd_page_vaddr(pmd));
> +}
> +
>  #ifdef CONFIG_PPC_64K_PAGES
>  static pte_t *get_from_cache(struct mm_struct *mm)
>  {
> @@ -455,3 +468,308 @@ void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
>  }
>  #endif
>  #endif /* CONFIG_PPC_64K_PAGES */
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static pmd_t set_hugepage_access_flags_filter(pmd_t pmd,
> +					      struct vm_area_struct *vma,
> +					      int dirty)
> +{
> +	return pmd;
> +}

This identity function is only used immediately before.  Why does it
exist?

> +/*
> + * This is called when relaxing access to a hugepage. It's also called in the page
> + * fault path when we don't hit any of the major fault cases, ie, a minor
> + * update of _PAGE_ACCESSED, _PAGE_DIRTY, etc... The generic code will have
> + * handled those two for us, we additionally deal with missing execute
> + * permission here on some processors
> + */
> +int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
> +			  pmd_t *pmdp, pmd_t entry, int dirty)
> +{
> +	int changed;
> +	entry = set_hugepage_access_flags_filter(entry, vma, dirty);
> +	changed = !pmd_same(*(pmdp), entry);
> +	if (changed) {
> +		__pmdp_set_access_flags(pmdp, entry);
> +		/*
> +		 * Since we are not supporting SW TLB systems, we don't
> +		 * have any thing similar to flush_tlb_page_nohash()
> +		 */
> +	}
> +	return changed;
> +}
> +
> +int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> +			      unsigned long address, pmd_t *pmdp)
> +{
> +	return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp);
> +}
> +
> +/*
> + * We currently remove entries from the hashtable regardless of whether
> + * the entry was young or dirty. The generic routines only flush if the
> + * entry was young or dirty which is not good enough.
> + *
> + * We should be more intelligent about this but for the moment we override
> + * these functions and force a tlb flush unconditionally
> + */
> +int pmdp_clear_flush_young(struct vm_area_struct *vma,
> +				  unsigned long address, pmd_t *pmdp)
> +{
> +	return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp);
> +}
> +
> +/*
> + * We mark the pmd splitting and invalidate all the hpte
> + * entries for this hugepage.
> + */
> +void pmdp_splitting_flush(struct vm_area_struct *vma,
> +			  unsigned long address, pmd_t *pmdp)
> +{
> +	unsigned long old, tmp;
> +
> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +#ifdef PTE_ATOMIC_UPDATES
> +
> +	__asm__ __volatile__(
> +	"1:	ldarx	%0,0,%3\n\
> +		andi.	%1,%0,%6\n\
> +		bne-	1b \n\
> +		ori	%1,%0,%4 \n\
> +		stdcx.	%1,0,%3 \n\
> +		bne-	1b"
> +	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
> +	: "r" (pmdp), "i" (_PAGE_SPLITTING), "m" (*pmdp), "i" (_PAGE_BUSY)
> +	: "cc" );
> +#else
> +	old = pmd_val(*pmdp);
> +	*pmdp = __pmd(old | _PAGE_SPLITTING);
> +#endif
> +	/*
> +	 * If we didn't had the splitting flag set, go and flush the
> +	 * HPTE entries and serialize against gup fast.
> +	 */
> +	if (!(old & _PAGE_SPLITTING)) {
> +#ifdef CONFIG_PPC_STD_MMU_64
> +		/* We need to flush the hpte */
> +		if (old & _PAGE_HASHPTE)
> +			hpte_need_hugepage_flush(vma->vm_mm, address, pmdp);
> +#endif
> +		/* need tlb flush only to serialize against gup-fast */
> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +	}
> +}
> +
> +/*
> + * We want to put the pgtable in pmd and use pgtable for tracking
> + * the base page size hptes
> + */
> +void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
> +				pgtable_t pgtable)
> +{
> +	unsigned long *pgtable_slot;
> +	assert_spin_locked(&mm->page_table_lock);
> +	/*
> +	 * we store the pgtable in the second half of PMD
> +	 */
> +	pgtable_slot = pmdp + PTRS_PER_PMD;
> +	*pgtable_slot = (unsigned long)pgtable;

Why not just make pgtable_slot have type (pgtable_t *) and avoid the
case.

> +}
> +
> +pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
> +{
> +	pgtable_t pgtable;
> +	unsigned long *pgtable_slot;
> +
> +	assert_spin_locked(&mm->page_table_lock);
> +	pgtable_slot = pmdp + PTRS_PER_PMD;
> +	pgtable = (pgtable_t) *pgtable_slot;
> +	/*
> +	 * We store HPTE information in the deposited PTE fragment.
> +	 * zero out the content on withdraw.
> +	 */
> +	memset(pgtable, 0, PTE_FRAG_SIZE);
> +	return pgtable;
> +}
> +
> +/*
> + * Since we are looking at latest ppc64, we don't need to worry about
> + * i/d cache coherency on exec fault
> + */
> +static pmd_t set_pmd_filter(pmd_t pmd, unsigned long addr)
> +{
> +	pmd = __pmd(pmd_val(pmd) & ~_PAGE_THP_HPTEFLAGS);
> +	return pmd;
> +}
> +
> +/*
> + * We can make it less convoluted than __set_pte_at, because
> + * we can ignore lot of hardware here, because this is only for
> + * MPSS
> + */
> +static inline void __set_pmd_at(struct mm_struct *mm, unsigned long addr,
> +				pmd_t *pmdp, pmd_t pmd, int percpu)
> +{
> +	/*
> +	 * There is nothing in hash page table now, so nothing to
> +	 * invalidate, set_pte_at is used for adding new entry.
> +	 * For updating we should use update_hugepage_pmd()
> +	 */
> +	*pmdp = pmd;
> +}

Again you should be able to define this in terms of the set_pte_at()
functions.

> +/*
> + * set a new huge pmd. We should not be called for updating
> + * an existing pmd entry. That should go via pmd_hugepage_update.
> + */
> +void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> +		pmd_t *pmdp, pmd_t pmd)
> +{
> +	/*
> +	 * Note: mm->context.id might not yet have been assigned as
> +	 * this context might not have been activated yet when this
> +	 * is called.

And the relevance of this comment here is...?

> +	 */
> +	pmd = set_pmd_filter(pmd, addr);
> +
> +	__set_pmd_at(mm, addr, pmdp, pmd, 0);
> +
> +}
> +
> +void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +		     pmd_t *pmdp)
> +{
> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT);
> +	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +}
> +
> +/*
> + * A linux hugepage PMD was changed and the corresponding hash table entries
> + * neesd to be flushed.
> + *
> + * The linux hugepage PMD now include the pmd entries followed by the address
> + * to the stashed pgtable_t. The stashed pgtable_t contains the hpte bits.
> + * [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
> + * With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
> + * 4096 entries. Both will fit in a 4K pgtable_t.
> + */
> +void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
> +			      pmd_t *pmdp)
> +{
> +	int ssize, i;
> +	unsigned long s_addr;
> +	unsigned int psize, valid;
> +	unsigned char *hpte_slot_array;
> +	unsigned long hidx, vpn, vsid, hash, shift, slot;
> +
> +	/*
> +	 * Flush all the hptes mapping this hugepage
> +	 */
> +	s_addr = addr & HUGE_PAGE_MASK;
> +	/*
> +	 * The hpte hindex are stored in the pgtable whose address is in the
> +	 * second half of the PMD
> +	 */
> +	hpte_slot_array = *(char **)(pmdp + PTRS_PER_PMD);
> +
> +	/* get the base page size */
> +	psize = get_slice_psize(mm, s_addr);
> +	shift = mmu_psize_defs[psize].shift;
> +
> +	for (i = 0; i < (HUGE_PAGE_SIZE >> shift); i++) {
> +		/*
> +		 * 8 bits per each hpte entries
> +		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
> +		 */
> +		valid = hpte_slot_array[i] & 0x1;
> +		if (!valid)
> +			continue;
> +		hidx =  hpte_slot_array[i]  >> 1;
> +
> +		/* get the vpn */
> +		addr = s_addr + (i * (1ul << shift));
> +		if (!is_kernel_addr(addr)) {
> +			ssize = user_segment_size(addr);
> +			vsid = get_vsid(mm->context.id, addr, ssize);
> +			WARN_ON(vsid == 0);
> +		} else {
> +			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
> +			ssize = mmu_kernel_ssize;
> +		}
> +
> +		vpn = hpt_vpn(addr, vsid, ssize);
> +		hash = hpt_hash(vpn, shift, ssize);
> +		if (hidx & _PTEIDX_SECONDARY)
> +			hash = ~hash;
> +
> +		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> +		slot += hidx & _PTEIDX_GROUP_IX;
> +		ppc_md.hpte_invalidate(slot, vpn, psize, ssize, 0);
> +	}
> +}
> +
> +static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot)
> +{
> +	pmd_val(pmd) |= pgprot_val(pgprot);
> +	return pmd;
> +}
> +
> +pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot)
> +{
> +	pmd_t pmd;
> +	/*
> +	 * For a valid pte, we would have _PAGE_PRESENT or _PAGE_FILE always
> +	 * set. We use this to check THP page at pmd level.
> +	 * leaf pte for huge page, bottom two bits != 00
> +	 */
> +	pmd_val(pmd) = pfn << PTE_RPN_SHIFT;
> +	pmd_val(pmd) |= _PAGE_THP_HUGE;
> +	pmd = pmd_set_protbits(pmd, pgprot);
> +	return pmd;
> +}
> +
> +pmd_t mk_pmd(struct page *page, pgprot_t pgprot)
> +{
> +	return pfn_pmd(page_to_pfn(page), pgprot);
> +}
> +
> +pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
> +{
> +
> +	pmd_val(pmd) &= _HPAGE_CHG_MASK;
> +	pmd = pmd_set_protbits(pmd, newprot);
> +	return pmd;
> +}
> +
> +/*
> + * This is called at the end of handling a user page fault, when the
> + * fault has been handled by updating a HUGE PMD entry in the linux page tables.
> + * We use it to preload an HPTE into the hash table corresponding to
> + * the updated linux HUGE PMD entry.
> + */
> +void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
> +			  pmd_t *pmd)
> +{
> +	return;
> +}
> +
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +pmd_t pmdp_get_and_clear(struct mm_struct *mm,
> +			 unsigned long addr, pmd_t *pmdp)
> +{
> +	pmd_t old_pmd;
> +	unsigned long old;
> +	/*
> +	 * khugepaged calls this for normal pmd also
> +	 */
> +	if (pmd_trans_huge(*pmdp)) {
> +		old = pmd_hugepage_update(mm, addr, pmdp, ~0UL);
> +		old_pmd = __pmd(old);
> +	} else {
> +		old_pmd = *pmdp;
> +		pmd_clear(pmdp);
> +	}
> +	return old_pmd;
> +}
> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
> index 18e3b76..a526144 100644
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -71,6 +71,7 @@ config PPC_BOOK3S_64
>  	select PPC_FPU
>  	select PPC_HAVE_PMU_SUPPORT
>  	select SYS_SUPPORTS_HUGETLBFS
> +	select HAVE_ARCH_TRANSPARENT_HUGEPAGE if PPC_64K_PAGES
>  
>  config PPC_BOOK3E_64
>  	bool "Embedded processors"

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 04/10] powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepages
  2013-04-28 19:51 ` [PATCH -V7 04/10] powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepages Aneesh Kumar K.V
@ 2013-05-03  4:53   ` David Gibson
  2013-05-03 18:58     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2013-05-03  4:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1091 bytes --]

On Mon, Apr 29, 2013 at 01:21:45AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

What's the difference in meaning between pmd_huge() and pmd_large()?


> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/hugetlbpage.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 8601f2d..081c001 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -954,7 +954,7 @@ pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift
>  			pdshift = PMD_SHIFT;
>  			pm = pmd_offset(pu, ea);
>  
> -			if (pmd_huge(*pm)) {
> +			if (pmd_huge(*pm) || pmd_large(*pm)) {
>  				ret_pte = (pte_t *) pm;
>  				goto out;
>  			} else if (is_hugepd(pm))

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 05/10] powerpc: Replace find_linux_pte with find_linux_pte_or_hugepte
  2013-04-28 19:51 ` [PATCH -V7 05/10] powerpc: Replace find_linux_pte with find_linux_pte_or_hugepte Aneesh Kumar K.V
@ 2013-05-03  4:56   ` David Gibson
  0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2013-05-03  4:56 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm

[-- Attachment #1: Type: text/plain, Size: 7612 bytes --]

On Mon, Apr 29, 2013 at 01:21:46AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly
> document why we don't need to handle transparent hugepages at callsites.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/pgtable-ppc64.h | 24 ------------------------
>  arch/powerpc/kernel/io-workarounds.c     | 10 ++++++++--
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c      |  2 +-
>  arch/powerpc/mm/hash_utils_64.c          |  8 +++++++-
>  arch/powerpc/mm/hugetlbpage.c            |  8 ++++++--
>  arch/powerpc/mm/tlb_hash64.c             |  7 ++++++-
>  arch/powerpc/platforms/pseries/eeh.c     |  7 ++++++-
>  7 files changed, 34 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
> index f0effab..97fc839 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> @@ -343,30 +343,6 @@ static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry)
>  
>  void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
>  void pgtable_cache_init(void);
> -
> -/*
> - * find_linux_pte returns the address of a linux pte for a given
> - * effective address and directory.  If not found, it returns zero.
> - */
> -static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea)
> -{
> -	pgd_t *pg;
> -	pud_t *pu;
> -	pmd_t *pm;
> -	pte_t *pt = NULL;
> -
> -	pg = pgdir + pgd_index(ea);
> -	if (!pgd_none(*pg)) {
> -		pu = pud_offset(pg, ea);
> -		if (!pud_none(*pu)) {
> -			pm = pmd_offset(pu, ea);
> -			if (pmd_present(*pm))
> -				pt = pte_offset_kernel(pm, ea);
> -		}
> -	}
> -	return pt;
> -}
> -
>  pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
>  				 unsigned *shift);
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/powerpc/kernel/io-workarounds.c b/arch/powerpc/kernel/io-workarounds.c
> index 50e90b7..e5263ab 100644
> --- a/arch/powerpc/kernel/io-workarounds.c
> +++ b/arch/powerpc/kernel/io-workarounds.c
> @@ -55,6 +55,7 @@ static struct iowa_bus *iowa_pci_find(unsigned long vaddr, unsigned long paddr)
>  
>  struct iowa_bus *iowa_mem_find_bus(const PCI_IO_ADDR addr)
>  {
> +	unsigned shift;
>  	struct iowa_bus *bus;
>  	int token;
>  
> @@ -70,11 +71,16 @@ struct iowa_bus *iowa_mem_find_bus(const PCI_IO_ADDR addr)
>  		if (vaddr < PHB_IO_BASE || vaddr >= PHB_IO_END)
>  			return NULL;
>  
> -		ptep = find_linux_pte(init_mm.pgd, vaddr);
> +		ptep = find_linux_pte_or_hugepte(init_mm.pgd, vaddr, &shift);
>  		if (ptep == NULL)
>  			paddr = 0;
> -		else
> +		else {
> +			/*
> +			 * we don't have hugepages backing iomem
> +			 */
> +			BUG_ON(shift);
>  			paddr = pte_pfn(*ptep) << PAGE_SHIFT;
> +		}
>  		bus = iowa_pci_find(vaddr, paddr);
>  
>  		if (bus == NULL)
> diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> index 19c93ba..8c345df 100644
> --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> @@ -27,7 +27,7 @@ static void *real_vmalloc_addr(void *x)
>  	unsigned long addr = (unsigned long) x;
>  	pte_t *p;
>  
> -	p = find_linux_pte(swapper_pg_dir, addr);
> +	p = find_linux_pte_or_hugepte(swapper_pg_dir, addr, NULL);
>  	if (!p || !pte_present(*p))
>  		return NULL;
>  	/* assume we don't have huge pages in vmalloc space... */
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index d0eb6d4..e942ae9 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -1131,6 +1131,7 @@ EXPORT_SYMBOL_GPL(hash_page);
>  void hash_preload(struct mm_struct *mm, unsigned long ea,
>  		  unsigned long access, unsigned long trap)
>  {
> +	int shift;
>  	unsigned long vsid;
>  	pgd_t *pgdir;
>  	pte_t *ptep;
> @@ -1152,10 +1153,15 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
>  	pgdir = mm->pgd;
>  	if (pgdir == NULL)
>  		return;
> -	ptep = find_linux_pte(pgdir, ea);
> +	/*
> +	 * THP pages use update_mmu_cache_pmd. We don't do
> +	 * hash preload there. Hence can ignore THP here
> +	 */
> +	ptep = find_linux_pte_or_hugepte(pgdir, ea, &shift);
>  	if (!ptep)
>  		return;
>  
> +	BUG_ON(shift);
>  #ifdef CONFIG_PPC_64K_PAGES
>  	/* If either _PAGE_4K_PFN or _PAGE_NO_CACHE is set (and we are on
>  	 * a 64K kernel), then we don't preload, hash_page() will take
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 081c001..1154714 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -105,6 +105,7 @@ int pgd_huge(pgd_t pgd)
>  
>  pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
>  {
> +	/* Only called for HugeTLB pages, hence can ignore THP */
>  	return find_linux_pte_or_hugepte(mm->pgd, addr, NULL);
>  }
>  
> @@ -673,11 +674,14 @@ follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
>  	struct page *page;
>  	unsigned shift;
>  	unsigned long mask;
> -
> +	/*
> +	 * Transparent hugepages are handled by generic code. We can skip them
> +	 * here.
> +	 */
>  	ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
>  
>  	/* Verify it is a huge page else bail. */
> -	if (!ptep || !shift)
> +	if (!ptep || !shift || pmd_trans_huge((pmd_t)*ptep))
>  		return ERR_PTR(-EINVAL);
>  
>  	mask = (1UL << shift) - 1;
> diff --git a/arch/powerpc/mm/tlb_hash64.c b/arch/powerpc/mm/tlb_hash64.c
> index 023ec8a..56d9b85 100644
> --- a/arch/powerpc/mm/tlb_hash64.c
> +++ b/arch/powerpc/mm/tlb_hash64.c
> @@ -189,6 +189,7 @@ void tlb_flush(struct mmu_gather *tlb)
>  void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
>  			      unsigned long end)
>  {
> +	int shift;
>  	unsigned long flags;
>  
>  	start = _ALIGN_DOWN(start, PAGE_SIZE);
> @@ -206,11 +207,15 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
>  	local_irq_save(flags);
>  	arch_enter_lazy_mmu_mode();
>  	for (; start < end; start += PAGE_SIZE) {
> -		pte_t *ptep = find_linux_pte(mm->pgd, start);
> +		pte_t *ptep = find_linux_pte_or_hugepte(mm->pgd, start, &shift);
>  		unsigned long pte;
>  
>  		if (ptep == NULL)
>  			continue;
> +		/*
> +		 * We won't find hugepages here, this is iomem.
> +		 */

Really?  Why?

> +		BUG_ON(shift);
>  		pte = pte_val(*ptep);
>  		if (!(pte & _PAGE_HASHPTE))
>  			continue;
> diff --git a/arch/powerpc/platforms/pseries/eeh.c b/arch/powerpc/platforms/pseries/eeh.c
> index 6b73d6c..d2e76d2 100644
> --- a/arch/powerpc/platforms/pseries/eeh.c
> +++ b/arch/powerpc/platforms/pseries/eeh.c
> @@ -258,12 +258,17 @@ void eeh_slot_error_detail(struct eeh_pe *pe, int severity)
>   */
>  static inline unsigned long eeh_token_to_phys(unsigned long token)
>  {
> +	int shift;
>  	pte_t *ptep;
>  	unsigned long pa;
>  
> -	ptep = find_linux_pte(init_mm.pgd, token);
> +	/*
> +	 * We won't find hugepages here, iomem
> +	 */
> +	ptep = find_linux_pte_or_hugepte(init_mm.pgd, token, &shift);
>  	if (!ptep)
>  		return token;
> +	BUG_ON(shift);
>  	pa = pte_pfn(*ptep) << PAGE_SHIFT;
>  
>  	return pa | (token & (PAGE_SIZE-1));

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 06/10] powerpc: Update gup_pmd_range to handle transparent hugepages
  2013-04-28 19:51 ` [PATCH -V7 06/10] powerpc: Update gup_pmd_range to handle transparent hugepages Aneesh Kumar K.V
@ 2013-05-03  4:57   ` David Gibson
  0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2013-05-03  4:57 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm

[-- Attachment #1: Type: text/plain, Size: 457 bytes --]

On Mon, Apr 29, 2013 at 01:21:47AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 07/10] powerpc/THP: Add code to handle HPTE faults for large pages
  2013-04-28 19:51 ` [PATCH -V7 07/10] powerpc/THP: Add code to handle HPTE faults for large pages Aneesh Kumar K.V
@ 2013-05-03  5:13   ` David Gibson
  0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2013-05-03  5:13 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm

[-- Attachment #1: Type: text/plain, Size: 656 bytes --]

On Mon, Apr 29, 2013 at 01:21:48AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> The deposted PTE page in the second half of the PMD table is used to
> track the state on hash PTEs. After updating the HPTE, we mark the
> coresponding slot in the deposted PTE page valid.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 08/10] powerpc/THP: Enable THP on PPC64
  2013-04-28 19:51 ` [PATCH -V7 08/10] powerpc/THP: Enable THP on PPC64 Aneesh Kumar K.V
@ 2013-05-03  5:15   ` David Gibson
  2013-05-03 18:49     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2013-05-03  5:15 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2720 bytes --]

On Mon, Apr 29, 2013 at 01:21:49AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> We enable only if the we support 16MB page size.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/pgtable-ppc64.h |  3 +--
>  arch/powerpc/mm/pgtable_64.c             | 28 ++++++++++++++++++++++++++++
>  2 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
> index 97fc839..d65534b 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> @@ -426,8 +426,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
>  	return pmd_val(pmd) >> PTE_RPN_SHIFT;
>  }
>  
> -/* We will enable it in the last patch */
> -#define has_transparent_hugepage() 0
> +extern int has_transparent_hugepage(void);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  static inline int pmd_young(pmd_t pmd)
> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
> index 54216c1..b742d6f 100644
> --- a/arch/powerpc/mm/pgtable_64.c
> +++ b/arch/powerpc/mm/pgtable_64.c
> @@ -754,6 +754,34 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	return;
>  }
>  
> +int has_transparent_hugepage(void)
> +{
> +	if (!mmu_has_feature(MMU_FTR_16M_PAGE))
> +		return 0;
> +	/*
> +	 * We support THP only if HPAGE_SHIFT is 16MB.
> +	 */
> +	if (!HPAGE_SHIFT || (HPAGE_SHIFT != mmu_psize_defs[MMU_PAGE_16M].shift))
> +		return 0;

Again, THP should not be dependent on the value of HPAGE_SHIFT.  Just
checking that mmu_psize_defsz[MMU_PAGE_16M].shift == 24 should be
sufficient (i.e. that 16M hugepages are supported).

> +	/*
> +	 * We need to make sure that we support 16MB hugepage in a segement
> +	 * with base page size 64K or 4K. We only enable THP with a PAGE_SIZE
> +	 * of 64K.
> +	 */
> +	/*
> +	 * If we have 64K HPTE, we will be using that by default
> +	 */
> +	if (mmu_psize_defs[MMU_PAGE_64K].shift &&
> +	    (mmu_psize_defs[MMU_PAGE_64K].penc[MMU_PAGE_16M] == -1))
> +		return 0;
> +	/*
> +	 * Ok we only have 4K HPTE
> +	 */
> +	if (mmu_psize_defs[MMU_PAGE_4K].penc[MMU_PAGE_16M] == -1)
> +		return 0;

Except you don't actually support THP on 4K base page size yet.

> +
> +	return 1;
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  pmd_t pmdp_get_and_clear(struct mm_struct *mm,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 09/10] powerpc: Optimize hugepage invalidate
  2013-04-28 19:51 ` [PATCH -V7 09/10] powerpc: Optimize hugepage invalidate Aneesh Kumar K.V
@ 2013-05-03  5:28   ` David Gibson
  2013-05-03 19:05     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2013-05-03  5:28 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm

[-- Attachment #1: Type: text/plain, Size: 11696 bytes --]

On Mon, Apr 29, 2013 at 01:21:50AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> Hugepage invalidate involves invalidating multiple hpte entries.
> Optimize the operation using H_BULK_REMOVE on lpar platforms.
> On native, reduce the number of tlb flush.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

Since this is purely an optimization, have you tried reproducing the
bugs you're chasing with this patch not included?

> ---
>  arch/powerpc/include/asm/machdep.h    |   3 +
>  arch/powerpc/mm/hash_native_64.c      |  78 +++++++++++++++++++++
>  arch/powerpc/mm/pgtable_64.c          |  13 +++-
>  arch/powerpc/platforms/pseries/lpar.c | 126 ++++++++++++++++++++++++++++++++--
>  4 files changed, 210 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
> index 3f3f691..5d1e7d2 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -56,6 +56,9 @@ struct machdep_calls {
>  	void            (*hpte_removebolted)(unsigned long ea,
>  					     int psize, int ssize);
>  	void		(*flush_hash_range)(unsigned long number, int local);
> +	void		(*hugepage_invalidate)(struct mm_struct *mm,
> +					       unsigned char *hpte_slot_array,
> +					       unsigned long addr, int psize);
>  
>  	/* special for kexec, to be called in real mode, linear mapping is
>  	 * destroyed as well */
> diff --git a/arch/powerpc/mm/hash_native_64.c b/arch/powerpc/mm/hash_native_64.c
> index 6a2aead..8ca178d 100644
> --- a/arch/powerpc/mm/hash_native_64.c
> +++ b/arch/powerpc/mm/hash_native_64.c
> @@ -455,6 +455,83 @@ static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
>  	local_irq_restore(flags);
>  }
>  
> +static void native_hugepage_invalidate(struct mm_struct *mm,
> +				       unsigned char *hpte_slot_array,
> +				       unsigned long addr, int psize)
> +{
> +	int ssize = 0, i;
> +	int lock_tlbie;
> +	struct hash_pte *hptep;
> +	int actual_psize = MMU_PAGE_16M;
> +	unsigned int max_hpte_count, valid;
> +	unsigned long flags, s_addr = addr;
> +	unsigned long hpte_v, want_v, shift;
> +	unsigned long hidx, vpn = 0, vsid, hash, slot;
> +
> +	shift = mmu_psize_defs[psize].shift;
> +	max_hpte_count = HUGE_PAGE_SIZE >> shift;
> +
> +	local_irq_save(flags);
> +	for (i = 0; i < max_hpte_count; i++) {
> +		/*
> +		 * 8 bits per each hpte entries
> +		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
> +		 */
> +		valid = hpte_slot_array[i] & 0x1;
> +		if (!valid)
> +			continue;
> +		hidx =  hpte_slot_array[i]  >> 1;
> +
> +		/* get the vpn */
> +		addr = s_addr + (i * (1ul << shift));
> +		if (!is_kernel_addr(addr)) {
> +			ssize = user_segment_size(addr);
> +			vsid = get_vsid(mm->context.id, addr, ssize);
> +			WARN_ON(vsid == 0);
> +		} else {
> +			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
> +			ssize = mmu_kernel_ssize;
> +		}
> +
> +		vpn = hpt_vpn(addr, vsid, ssize);
> +		hash = hpt_hash(vpn, shift, ssize);
> +		if (hidx & _PTEIDX_SECONDARY)
> +			hash = ~hash;
> +
> +		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> +		slot += hidx & _PTEIDX_GROUP_IX;
> +
> +		hptep = htab_address + slot;
> +		want_v = hpte_encode_avpn(vpn, psize, ssize);
> +		native_lock_hpte(hptep);
> +		hpte_v = hptep->v;
> +
> +		/* Even if we miss, we need to invalidate the TLB */
> +		if (!HPTE_V_COMPARE(hpte_v, want_v) || !(hpte_v & HPTE_V_VALID))
> +			native_unlock_hpte(hptep);
> +		else
> +			/* Invalidate the hpte. NOTE: this also unlocks it */
> +			hptep->v = 0;
> +	}
> +	/*
> +	 * Since this is a hugepage, we just need a single tlbie.
> +	 * use the last vpn.
> +	 */
> +	lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
> +	if (lock_tlbie)
> +		raw_spin_lock(&native_tlbie_lock);
> +
> +	asm volatile("ptesync":::"memory");
> +	__tlbie(vpn, psize, actual_psize, ssize);
> +	asm volatile("eieio; tlbsync; ptesync":::"memory");
> +
> +	if (lock_tlbie)
> +		raw_spin_unlock(&native_tlbie_lock);
> +
> +	local_irq_restore(flags);
> +}
> +
> +
>  static void hpte_decode(struct hash_pte *hpte, unsigned long slot,
>  			int *psize, int *apsize, int *ssize, unsigned long *vpn)
>  {
> @@ -658,4 +735,5 @@ void __init hpte_init_native(void)
>  	ppc_md.hpte_remove	= native_hpte_remove;
>  	ppc_md.hpte_clear_all	= native_hpte_clear;
>  	ppc_md.flush_hash_range = native_flush_hash_range;
> +	ppc_md.hugepage_invalidate   = native_hugepage_invalidate;
>  }
> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
> index b742d6f..504952f 100644
> --- a/arch/powerpc/mm/pgtable_64.c
> +++ b/arch/powerpc/mm/pgtable_64.c
> @@ -659,6 +659,7 @@ void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
>  {
>  	int ssize, i;
>  	unsigned long s_addr;
> +	int max_hpte_count;
>  	unsigned int psize, valid;
>  	unsigned char *hpte_slot_array;
>  	unsigned long hidx, vpn, vsid, hash, shift, slot;
> @@ -672,12 +673,18 @@ void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
>  	 * second half of the PMD
>  	 */
>  	hpte_slot_array = *(char **)(pmdp + PTRS_PER_PMD);
> -
>  	/* get the base page size */
>  	psize = get_slice_psize(mm, s_addr);
> -	shift = mmu_psize_defs[psize].shift;
>  
> -	for (i = 0; i < (HUGE_PAGE_SIZE >> shift); i++) {
> +	if (ppc_md.hugepage_invalidate)
> +		return ppc_md.hugepage_invalidate(mm, hpte_slot_array,
> +						  s_addr, psize);
> +	/*
> +	 * No bluk hpte removal support, invalidate each entry
> +	 */
> +	shift = mmu_psize_defs[psize].shift;
> +	max_hpte_count = HUGE_PAGE_SIZE >> shift;
> +	for (i = 0; i < max_hpte_count; i++) {
>  		/*
>  		 * 8 bits per each hpte entries
>  		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
> diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
> index 6d62072..58a31db 100644
> --- a/arch/powerpc/platforms/pseries/lpar.c
> +++ b/arch/powerpc/platforms/pseries/lpar.c
> @@ -45,6 +45,13 @@
>  #include "plpar_wrappers.h"
>  #include "pseries.h"
>  
> +/* Flag bits for H_BULK_REMOVE */
> +#define HBR_REQUEST	0x4000000000000000UL
> +#define HBR_RESPONSE	0x8000000000000000UL
> +#define HBR_END		0xc000000000000000UL
> +#define HBR_AVPN	0x0200000000000000UL
> +#define HBR_ANDCOND	0x0100000000000000UL
> +
>  
>  /* in hvCall.S */
>  EXPORT_SYMBOL(plpar_hcall);
> @@ -345,6 +352,117 @@ static void pSeries_lpar_hpte_invalidate(unsigned long slot, unsigned long vpn,
>  	BUG_ON(lpar_rc != H_SUCCESS);
>  }
>  
> +/*
> + * Limit iterations holding pSeries_lpar_tlbie_lock to 3. We also need
> + * to make sure that we avoid bouncing the hypervisor tlbie lock.
> + */
> +#define PPC64_HUGE_HPTE_BATCH 12
> +
> +static void __pSeries_lpar_hugepage_invalidate(unsigned long *slot,
> +					     unsigned long *vpn, int count,
> +					     int psize, int ssize)
> +{
> +	unsigned long param[9];

[9]?  I only see 8 elements being used.

> +	int i = 0, pix = 0, rc;
> +	unsigned long flags = 0;
> +	int lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
> +
> +	if (lock_tlbie)
> +		spin_lock_irqsave(&pSeries_lpar_tlbie_lock, flags);

Why are these hash operations being called with the tlbie lock held?

> +
> +	for (i = 0; i < count; i++) {
> +
> +		if (!firmware_has_feature(FW_FEATURE_BULK_REMOVE)) {
> +			pSeries_lpar_hpte_invalidate(slot[i], vpn[i], psize,
> +						     ssize, 0);

Couldn't you set the ppc_md hook based on the firmware request to
avoid this test in the inner loop?  I don't see any tlbie operations
at all.

> +		} else {
> +			param[pix] = HBR_REQUEST | HBR_AVPN | slot[i];
> +			param[pix+1] = hpte_encode_avpn(vpn[i], psize, ssize);
> +			pix += 2;
> +			if (pix == 8) {
> +				rc = plpar_hcall9(H_BULK_REMOVE, param,
> +						  param[0], param[1], param[2],
> +						  param[3], param[4], param[5],
> +						  param[6], param[7]);
> +				BUG_ON(rc != H_SUCCESS);
> +				pix = 0;
> +			}
> +		}
> +	}
> +	if (pix) {
> +		param[pix] = HBR_END;
> +		rc = plpar_hcall9(H_BULK_REMOVE, param, param[0], param[1],
> +				  param[2], param[3], param[4], param[5],
> +				  param[6], param[7]);
> +		BUG_ON(rc != H_SUCCESS);
> +	}
> +
> +	if (lock_tlbie)
> +		spin_unlock_irqrestore(&pSeries_lpar_tlbie_lock, flags);
> +}
> +
> +static void pSeries_lpar_hugepage_invalidate(struct mm_struct *mm,
> +				       unsigned char *hpte_slot_array,
> +				       unsigned long addr, int psize)
> +{
> +	int ssize = 0, i, index = 0;
> +	unsigned long s_addr = addr;
> +	unsigned int max_hpte_count, valid;
> +	unsigned long vpn_array[PPC64_HUGE_HPTE_BATCH];
> +	unsigned long slot_array[PPC64_HUGE_HPTE_BATCH];
> +	unsigned long shift, hidx, vpn = 0, vsid, hash, slot;
> +
> +	shift = mmu_psize_defs[psize].shift;
> +	max_hpte_count = HUGE_PAGE_SIZE >> shift;
> +
> +	for (i = 0; i < max_hpte_count; i++) {
> +		/*
> +		 * 8 bits per each hpte entries
> +		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
> +		 */
> +		valid = hpte_slot_array[i] & 0x1;
> +		if (!valid)
> +			continue;
> +		hidx =  hpte_slot_array[i]  >> 1;
> +
> +		/* get the vpn */
> +		addr = s_addr + (i * (1ul << shift));
> +		if (!is_kernel_addr(addr)) {
> +			ssize = user_segment_size(addr);
> +			vsid = get_vsid(mm->context.id, addr, ssize);
> +			WARN_ON(vsid == 0);
> +		} else {
> +			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
> +			ssize = mmu_kernel_ssize;
> +		}
> +
> +		vpn = hpt_vpn(addr, vsid, ssize);
> +		hash = hpt_hash(vpn, shift, ssize);
> +		if (hidx & _PTEIDX_SECONDARY)
> +			hash = ~hash;
> +
> +		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> +		slot += hidx & _PTEIDX_GROUP_IX;
> +
> +		slot_array[index] = slot;
> +		vpn_array[index] = vpn;
> +		if (index == PPC64_HUGE_HPTE_BATCH - 1) {
> +			/*
> +			 * Now do a bluk invalidate
> +			 */
> +			__pSeries_lpar_hugepage_invalidate(slot_array,
> +							   vpn_array,
> +							   PPC64_HUGE_HPTE_BATCH,
> +							   psize, ssize);

I don't really understand why you have one loop in this function, then
another in the __ function.

> +			index = 0;
> +		} else
> +			index++;
> +	}
> +	if (index)
> +		__pSeries_lpar_hugepage_invalidate(slot_array, vpn_array,
> +						   index, psize, ssize);
> +}
> +
>  static void pSeries_lpar_hpte_removebolted(unsigned long ea,
>  					   int psize, int ssize)
>  {
> @@ -360,13 +478,6 @@ static void pSeries_lpar_hpte_removebolted(unsigned long ea,
>  	pSeries_lpar_hpte_invalidate(slot, vpn, psize, ssize, 0);
>  }
>  
> -/* Flag bits for H_BULK_REMOVE */
> -#define HBR_REQUEST	0x4000000000000000UL
> -#define HBR_RESPONSE	0x8000000000000000UL
> -#define HBR_END		0xc000000000000000UL
> -#define HBR_AVPN	0x0200000000000000UL
> -#define HBR_ANDCOND	0x0100000000000000UL
> -
>  /*
>   * Take a spinlock around flushes to avoid bouncing the hypervisor tlbie
>   * lock.
> @@ -452,6 +563,7 @@ void __init hpte_init_lpar(void)
>  	ppc_md.hpte_removebolted = pSeries_lpar_hpte_removebolted;
>  	ppc_md.flush_hash_range	= pSeries_lpar_flush_hash_range;
>  	ppc_md.hpte_clear_all   = pSeries_lpar_hptab_clear;
> +	ppc_md.hugepage_invalidate = pSeries_lpar_hugepage_invalidate;
>  }
>  
>  #ifdef CONFIG_PPC_SMLPAR

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 10/10] powerpc: disable assert_pte_locked
  2013-04-28 19:51 ` [PATCH -V7 10/10] powerpc: disable assert_pte_locked Aneesh Kumar K.V
@ 2013-05-03  5:30   ` David Gibson
  2013-05-03 19:07     ` Aneesh Kumar K.V
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2013-05-03  5:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: paulus, linuxppc-dev, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1601 bytes --]

On Mon, Apr 29, 2013 at 01:21:51AM +0530, Aneesh Kumar K.V wrote:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> With THP we set pmd to none, before we do pte_clear. Hence we can't
> walk page table to get the pte lock ptr and verify whether it is locked.
> THP do take pte lock before calling pte_clear. So we don't change the locking
> rules here. It is that we can't use page table walking to check whether
> pte locks are help with THP.
> 
> NOTE: This needs to be re-written. Not to be merged upstream.

So, rewrite it..

> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/pgtable.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index 214130a..d77f94f 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -224,6 +224,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long address,
>  #ifdef CONFIG_DEBUG_VM
>  void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
>  {
> +#if 0
>  	pgd_t *pgd;
>  	pud_t *pud;
>  	pmd_t *pmd;
> @@ -237,6 +238,7 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
>  	pmd = pmd_offset(pud, addr);
>  	BUG_ON(!pmd_present(*pmd));
>  	assert_spin_locked(pte_lockptr(mm, pmd));
> +#endif
>  }
>  #endif /* CONFIG_DEBUG_VM */
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64
  2013-05-03  4:52   ` David Gibson
@ 2013-05-03  8:19     ` Benjamin Herrenschmidt
  2013-05-03 11:54       ` David Gibson
  2013-05-04 19:14     ` Aneesh Kumar K.V
  1 sibling, 1 reply; 34+ messages in thread
From: Benjamin Herrenschmidt @ 2013-05-03  8:19 UTC (permalink / raw)
  To: David Gibson; +Cc: linux-mm, linuxppc-dev, paulus, Aneesh Kumar K.V

On Fri, 2013-05-03 at 14:52 +1000, David Gibson wrote:
> Here, specifically, the fact that PAGE_BUSY is in PAGE_THP_HPTEFLAGS
> is likely to be bad.  If the page is busy, it's in the middle of
> update so can't stably be considered the same as anything.

_PAGE_BUSY is more like a read lock. It means it's being hashed, so what
is not stable is _PAGE_HASHPTE, slot index, _ACCESSED and _DIRTY. The
rest is stable and usually is what pmd_same looks at (though I have a
small doubt vs. _ACCESSED and _DIRTY but at least x86 doesn't care since
they are updated by HW).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64
  2013-05-03  8:19     ` Benjamin Herrenschmidt
@ 2013-05-03 11:54       ` David Gibson
  2013-05-03 13:00         ` Benjamin Herrenschmidt
  2013-05-03 18:54         ` Aneesh Kumar K.V
  0 siblings, 2 replies; 34+ messages in thread
From: David Gibson @ 2013-05-03 11:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linux-mm, linuxppc-dev, Aneesh Kumar K.V, paulus

[-- Attachment #1: Type: text/plain, Size: 990 bytes --]

On Fri, May 03, 2013 at 06:19:03PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2013-05-03 at 14:52 +1000, David Gibson wrote:
> > Here, specifically, the fact that PAGE_BUSY is in PAGE_THP_HPTEFLAGS
> > is likely to be bad.  If the page is busy, it's in the middle of
> > update so can't stably be considered the same as anything.
> 
> _PAGE_BUSY is more like a read lock. It means it's being hashed, so what
> is not stable is _PAGE_HASHPTE, slot index, _ACCESSED and _DIRTY. The
> rest is stable and usually is what pmd_same looks at (though I have a
> small doubt vs. _ACCESSED and _DIRTY but at least x86 doesn't care since
> they are updated by HW).

Ok.  It still seems very odd to me that _PAGE_BUSY would be in the THP
version of _PAGE_HASHPTE, but not the normal one.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64
  2013-05-03 11:54       ` David Gibson
@ 2013-05-03 13:00         ` Benjamin Herrenschmidt
  2013-05-03 18:54         ` Aneesh Kumar K.V
  1 sibling, 0 replies; 34+ messages in thread
From: Benjamin Herrenschmidt @ 2013-05-03 13:00 UTC (permalink / raw)
  To: David Gibson; +Cc: linux-mm, linuxppc-dev, Aneesh Kumar K.V, paulus

On Fri, 2013-05-03 at 21:54 +1000, David Gibson wrote:
> > _PAGE_BUSY is more like a read lock. It means it's being hashed, so what
> > is not stable is _PAGE_HASHPTE, slot index, _ACCESSED and _DIRTY. The
> > rest is stable and usually is what pmd_same looks at (though I have a
> > small doubt vs. _ACCESSED and _DIRTY but at least x86 doesn't care since
> > they are updated by HW).
> 
> Ok.  It still seems very odd to me that _PAGE_BUSY would be in the THP
> version of _PAGE_HASHPTE, but not the normal one.

Oh I agree, we should be consistent and it shouldn't be there, I was just
correcting some other aspect of your statement :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 08/10] powerpc/THP: Enable THP on PPC64
  2013-05-03  5:15   ` David Gibson
@ 2013-05-03 18:49     ` Aneesh Kumar K.V
  2013-05-05  8:59       ` David Gibson
  0 siblings, 1 reply; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-05-03 18:49 UTC (permalink / raw)
  To: David Gibson; +Cc: paulus, linuxppc-dev, linux-mm

David Gibson <dwg@au1.ibm.com> writes:

> On Mon, Apr 29, 2013 at 01:21:49AM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> We enable only if the we support 16MB page size.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/pgtable-ppc64.h |  3 +--
>>  arch/powerpc/mm/pgtable_64.c             | 28 ++++++++++++++++++++++++++++
>>  2 files changed, 29 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
>> index 97fc839..d65534b 100644
>> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
>> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
>> @@ -426,8 +426,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
>>  	return pmd_val(pmd) >> PTE_RPN_SHIFT;
>>  }
>>  
>> -/* We will enable it in the last patch */
>> -#define has_transparent_hugepage() 0
>> +extern int has_transparent_hugepage(void);
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  
>>  static inline int pmd_young(pmd_t pmd)
>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>> index 54216c1..b742d6f 100644
>> --- a/arch/powerpc/mm/pgtable_64.c
>> +++ b/arch/powerpc/mm/pgtable_64.c
>> @@ -754,6 +754,34 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
>>  	return;
>>  }
>>  
>> +int has_transparent_hugepage(void)
>> +{
>> +	if (!mmu_has_feature(MMU_FTR_16M_PAGE))
>> +		return 0;
>> +	/*
>> +	 * We support THP only if HPAGE_SHIFT is 16MB.
>> +	 */
>> +	if (!HPAGE_SHIFT || (HPAGE_SHIFT != mmu_psize_defs[MMU_PAGE_16M].shift))
>> +		return 0;
>
> Again, THP should not be dependent on the value of HPAGE_SHIFT.  Just
> checking that mmu_psize_defsz[MMU_PAGE_16M].shift == 24 should be
> sufficient (i.e. that 16M hugepages are supported).

done

+	/*
+	 * We support THP only if PMD_SIZE is 16MB.
+	 */
+	if (mmu_psize_defs[MMU_PAGE_16M].shift != PMD_SHIFT)
+		return 0;
+	/*


>
>> +	/*
>> +	 * We need to make sure that we support 16MB hugepage in a segement
>> +	 * with base page size 64K or 4K. We only enable THP with a PAGE_SIZE
>> +	 * of 64K.
>> +	 */
>> +	/*
>> +	 * If we have 64K HPTE, we will be using that by default
>> +	 */
>> +	if (mmu_psize_defs[MMU_PAGE_64K].shift &&
>> +	    (mmu_psize_defs[MMU_PAGE_64K].penc[MMU_PAGE_16M] == -1))
>> +		return 0;
>> +	/*
>> +	 * Ok we only have 4K HPTE
>> +	 */
>> +	if (mmu_psize_defs[MMU_PAGE_4K].penc[MMU_PAGE_16M] == -1)
>> +		return 0;
>
> Except you don't actually support THP on 4K base page size yet.


That is 64K linux page size and 4K HPTE . We do support that. The Linux
page size part is taken care by Kconfig. 

>
>> +
>> +	return 1;
>> +}
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  
>>  pmd_t pmdp_get_and_clear(struct mm_struct *mm,
>

-aneesh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64
  2013-05-03 11:54       ` David Gibson
  2013-05-03 13:00         ` Benjamin Herrenschmidt
@ 2013-05-03 18:54         ` Aneesh Kumar K.V
  1 sibling, 0 replies; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-05-03 18:54 UTC (permalink / raw)
  To: David Gibson, Benjamin Herrenschmidt; +Cc: linux-mm, linuxppc-dev, paulus

David Gibson <dwg@au1.ibm.com> writes:

> On Fri, May 03, 2013 at 06:19:03PM +1000, Benjamin Herrenschmidt wrote:
>> On Fri, 2013-05-03 at 14:52 +1000, David Gibson wrote:
>> > Here, specifically, the fact that PAGE_BUSY is in PAGE_THP_HPTEFLAGS
>> > is likely to be bad.  If the page is busy, it's in the middle of
>> > update so can't stably be considered the same as anything.
>> 
>> _PAGE_BUSY is more like a read lock. It means it's being hashed, so what
>> is not stable is _PAGE_HASHPTE, slot index, _ACCESSED and _DIRTY. The
>> rest is stable and usually is what pmd_same looks at (though I have a
>> small doubt vs. _ACCESSED and _DIRTY but at least x86 doesn't care since
>> they are updated by HW).
>
> Ok.  It still seems very odd to me that _PAGE_BUSY would be in the THP
> version of _PAGE_HASHPTE, but not the normal one.
>

64-4k definition:
/* PTE flags to conserve for HPTE identification */
#define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | \
			 _PAGE_SECONDARY | _PAGE_GROUP_IX)

64-64K definition:
/* PTE flags to conserve for HPTE identification */
#define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_COMBO)

BTW I have dropped that change in my current patch. I dropped the
usage of _PAGE_COMBO and instead started using _PAGE_4K_PFN for
identifying THP.That enabled me to use _PAGE_HPTEFLAGS as it is.

-aneesh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 04/10] powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepages
  2013-05-03  4:53   ` David Gibson
@ 2013-05-03 18:58     ` Aneesh Kumar K.V
  2013-05-04  6:28       ` David Gibson
  0 siblings, 1 reply; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-05-03 18:58 UTC (permalink / raw)
  To: David Gibson; +Cc: paulus, linuxppc-dev, linux-mm

David Gibson <dwg@au1.ibm.com> writes:

> On Mon, Apr 29, 2013 at 01:21:45AM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>
> What's the difference in meaning between pmd_huge() and pmd_large()?
>

#ifndef CONFIG_HUGETLB_PAGE
#define pmd_huge(x)	0
#endif

Also pmd_large do check for THP PTE flag, and _PAGE_PRESENT.

-aneesh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 09/10] powerpc: Optimize hugepage invalidate
  2013-05-03  5:28   ` David Gibson
@ 2013-05-03 19:05     ` Aneesh Kumar K.V
  2013-05-03 21:54       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-05-03 19:05 UTC (permalink / raw)
  To: David Gibson; +Cc: paulus, linuxppc-dev, linux-mm

David Gibson <dwg@au1.ibm.com> writes:

> On Mon, Apr 29, 2013 at 01:21:50AM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> Hugepage invalidate involves invalidating multiple hpte entries.
>> Optimize the operation using H_BULK_REMOVE on lpar platforms.
>> On native, reduce the number of tlb flush.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
> Since this is purely an optimization, have you tried reproducing the
> bugs you're chasing with this patch not included?

That was due to not handling thp split while walking page table. I have
that fixed. Will post the next version soon.

>
>> ---
>>  arch/powerpc/include/asm/machdep.h    |   3 +
>>  arch/powerpc/mm/hash_native_64.c      |  78 +++++++++++++++++++++
>>  arch/powerpc/mm/pgtable_64.c          |  13 +++-
>>  arch/powerpc/platforms/pseries/lpar.c | 126 ++++++++++++++++++++++++++++++++--
>>  4 files changed, 210 insertions(+), 10 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
>> index 3f3f691..5d1e7d2 100644
>> --- a/arch/powerpc/include/asm/machdep.h
>> +++ b/arch/powerpc/include/asm/machdep.h
>> @@ -56,6 +56,9 @@ struct machdep_calls {

.....

>>  
>> +/*
>> + * Limit iterations holding pSeries_lpar_tlbie_lock to 3. We also need
>> + * to make sure that we avoid bouncing the hypervisor tlbie lock.
>> + */
>> +#define PPC64_HUGE_HPTE_BATCH 12
>> +
>> +static void __pSeries_lpar_hugepage_invalidate(unsigned long *slot,
>> +					     unsigned long *vpn, int count,
>> +					     int psize, int ssize)
>> +{
>> +	unsigned long param[9];
>
> [9]?  I only see 8 elements being used.

cut paste error from pSeries_lpar_flush_hash_range

>
>> +	int i = 0, pix = 0, rc;
>> +	unsigned long flags = 0;
>> +	int lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
>> +
>> +	if (lock_tlbie)
>> +		spin_lock_irqsave(&pSeries_lpar_tlbie_lock, flags);
>
> Why are these hash operations being called with the tlbie lock held?

if the firmware doesn't support lockless TLBIE, we need to do locking
at the guest side. pSeries_lpar_flush_hash_range does that.

>
>> +
>> +	for (i = 0; i < count; i++) {
>> +
>> +		if (!firmware_has_feature(FW_FEATURE_BULK_REMOVE)) {
>> +			pSeries_lpar_hpte_invalidate(slot[i], vpn[i], psize,
>> +						     ssize, 0);
>
> Couldn't you set the ppc_md hook based on the firmware request to
> avoid this test in the inner loop?  I don't see any tlbie operations
> at all.

didn't get that.

>
>> +		} else {
>> +			param[pix] = HBR_REQUEST | HBR_AVPN | slot[i];
>> +			param[pix+1] = hpte_encode_avpn(vpn[i], psize, ssize);
>> +			pix += 2;
>> +			if (pix == 8) {
>> +				rc = plpar_hcall9(H_BULK_REMOVE, param,
>> +						  param[0], param[1], param[2],
>> +						  param[3], param[4], param[5],
>> +						  param[6], param[7]);
>> +				BUG_ON(rc != H_SUCCESS);
>> +				pix = 0;
>> +			}
>> +		}
>> +	}
>> +	if (pix) {
>> +		param[pix] = HBR_END;
>> +		rc = plpar_hcall9(H_BULK_REMOVE, param, param[0], param[1],
>> +				  param[2], param[3], param[4], param[5],
>> +				  param[6], param[7]);
>> +		BUG_ON(rc != H_SUCCESS);
>> +	}
>> +
>> +	if (lock_tlbie)
>> +		spin_unlock_irqrestore(&pSeries_lpar_tlbie_lock, flags);
>> +}
>> +
>> +static void pSeries_lpar_hugepage_invalidate(struct mm_struct *mm,
>> +				       unsigned char *hpte_slot_array,
>> +				       unsigned long addr, int psize)
>> +{
>> +	int ssize = 0, i, index = 0;
>> +	unsigned long s_addr = addr;
>> +	unsigned int max_hpte_count, valid;
>> +	unsigned long vpn_array[PPC64_HUGE_HPTE_BATCH];
>> +	unsigned long slot_array[PPC64_HUGE_HPTE_BATCH];
>> +	unsigned long shift, hidx, vpn = 0, vsid, hash, slot;
>> +
>> +	shift = mmu_psize_defs[psize].shift;
>> +	max_hpte_count = HUGE_PAGE_SIZE >> shift;
>> +
>> +	for (i = 0; i < max_hpte_count; i++) {
>> +		/*
>> +		 * 8 bits per each hpte entries
>> +		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
>> +		 */
>> +		valid = hpte_slot_array[i] & 0x1;
>> +		if (!valid)
>> +			continue;
>> +		hidx =  hpte_slot_array[i]  >> 1;
>> +
>> +		/* get the vpn */
>> +		addr = s_addr + (i * (1ul << shift));
>> +		if (!is_kernel_addr(addr)) {
>> +			ssize = user_segment_size(addr);
>> +			vsid = get_vsid(mm->context.id, addr, ssize);
>> +			WARN_ON(vsid == 0);
>> +		} else {
>> +			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
>> +			ssize = mmu_kernel_ssize;
>> +		}
>> +
>> +		vpn = hpt_vpn(addr, vsid, ssize);
>> +		hash = hpt_hash(vpn, shift, ssize);
>> +		if (hidx & _PTEIDX_SECONDARY)
>> +			hash = ~hash;
>> +
>> +		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
>> +		slot += hidx & _PTEIDX_GROUP_IX;
>> +
>> +		slot_array[index] = slot;
>> +		vpn_array[index] = vpn;
>> +		if (index == PPC64_HUGE_HPTE_BATCH - 1) {
>> +			/*
>> +			 * Now do a bluk invalidate
>> +			 */
>> +			__pSeries_lpar_hugepage_invalidate(slot_array,
>> +							   vpn_array,
>> +							   PPC64_HUGE_HPTE_BATCH,
>> +							   psize, ssize);
>
> I don't really understand why you have one loop in this function, then
> another in the __ function.

?? if we didn't accumulate batch size number of entries, we won't call
the above. Hence we will have to do the bulk remove outside the if
loop. 


>
>> +			index = 0;
>> +		} else
>> +			index++;
>> +	}
>> +	if (index)
>> +		__pSeries_lpar_hugepage_invalidate(slot_array, vpn_array,
>> +						   index, psize, ssize);
>> +}
>> +
>>  static void pSeries_lpar_hpte_removebolted(unsigned long ea,
>>  					   int psize, int ssize)
>>  {
>> @@ -360,13 +478,6 @@ static void pSeries_lpar_hpte_removebolted(unsigned long ea,
>>  	pSeries_lpar_hpte_invalidate(slot, vpn, psize, ssize, 0);
>>  }
>>  
>> -/* Flag bits for H_BULK_REMOVE */
>> -#define HBR_REQUEST	0x4000000000000000UL
>> -#define HBR_RESPONSE	0x8000000000000000UL
>> -#define HBR_END		0xc000000000000000UL
>> -#define HBR_AVPN	0x0200000000000000UL
>> -#define HBR_ANDCOND	0x0100000000000000UL
>> -
>>  /*
>>   * Take a spinlock around flushes to avoid bouncing the hypervisor tlbie
>>   * lock.
>> @@ -452,6 +563,7 @@ void __init hpte_init_lpar(void)
>>  	ppc_md.hpte_removebolted = pSeries_lpar_hpte_removebolted;
>>  	ppc_md.flush_hash_range	= pSeries_lpar_flush_hash_range;
>>  	ppc_md.hpte_clear_all   = pSeries_lpar_hptab_clear;
>> +	ppc_md.hugepage_invalidate = pSeries_lpar_hugepage_invalidate;
>>  }
>>  
>>  #ifdef CONFIG_PPC_SMLPAR
>
> -- 

-aneesh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 10/10] powerpc: disable assert_pte_locked
  2013-05-03  5:30   ` David Gibson
@ 2013-05-03 19:07     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-05-03 19:07 UTC (permalink / raw)
  To: David Gibson; +Cc: paulus, linuxppc-dev, linux-mm

David Gibson <dwg@au1.ibm.com> writes:

> On Mon, Apr 29, 2013 at 01:21:51AM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> With THP we set pmd to none, before we do pte_clear. Hence we can't
>> walk page table to get the pte lock ptr and verify whether it is locked.
>> THP do take pte lock before calling pte_clear. So we don't change the locking
>> rules here. It is that we can't use page table walking to check whether
>> pte locks are help with THP.
>> 
>> NOTE: This needs to be re-written. Not to be merged upstream.
>
> So, rewrite it..


That is something we need to discuss more. We can't do the pte_locked
assert the way we do now. Because as explained above, thp collapse
depend on setting pmd to none before doing pte_clear. So we clearly
cannot walk the page table and fine the ptl to check whether we are
holding that lock. But yes, these asserts are valid. Those function
should be called holding ptl locks. I still haven't found an alternative
way to do those asserts. Any suggestions ?


>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/mm/pgtable.c | 2 ++
>>  1 file changed, 2 insertions(+)
>> 
>> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
>> index 214130a..d77f94f 100644
>> --- a/arch/powerpc/mm/pgtable.c
>> +++ b/arch/powerpc/mm/pgtable.c
>> @@ -224,6 +224,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long address,
>>  #ifdef CONFIG_DEBUG_VM
>>  void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
>>  {
>> +#if 0
>>  	pgd_t *pgd;
>>  	pud_t *pud;
>>  	pmd_t *pmd;
>> @@ -237,6 +238,7 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
>>  	pmd = pmd_offset(pud, addr);
>>  	BUG_ON(!pmd_present(*pmd));
>>  	assert_spin_locked(pte_lockptr(mm, pmd));
>> +#endif
>>  }
>>  #endif /* CONFIG_DEBUG_VM */
>>  
>

-aneesh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 09/10] powerpc: Optimize hugepage invalidate
  2013-05-03 19:05     ` Aneesh Kumar K.V
@ 2013-05-03 21:54       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 34+ messages in thread
From: Benjamin Herrenschmidt @ 2013-05-03 21:54 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linux-mm, paulus, linuxppc-dev, David Gibson

On Sat, 2013-05-04 at 00:35 +0530, Aneesh Kumar K.V wrote:
> 
> if the firmware doesn't support lockless TLBIE, we need to do locking
> at the guest side. pSeries_lpar_flush_hash_range does that.

We don't "need" to ... it's an optimization because by experience the FW
locking was horrible (and the HW locking is too).

Beware however that the hash routines can take a lock too on
"native" (instead of pHyp)...

Ben.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 04/10] powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepages
  2013-05-03 18:58     ` Aneesh Kumar K.V
@ 2013-05-04  6:28       ` David Gibson
  0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2013-05-04  6:28 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, paulus, linux-mm

[-- Attachment #1: Type: text/plain, Size: 815 bytes --]

On Sat, May 04, 2013 at 12:28:20AM +0530, Aneesh Kumar K.V wrote:
> David Gibson <dwg@au1.ibm.com> writes:
> 
> > On Mon, Apr 29, 2013 at 01:21:45AM +0530, Aneesh Kumar K.V wrote:
> >> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> >
> > What's the difference in meaning between pmd_huge() and pmd_large()?
> >
> 
> #ifndef CONFIG_HUGETLB_PAGE
> #define pmd_huge(x)	0
> #endif
> 
> Also pmd_large do check for THP PTE flag, and _PAGE_PRESENT.

I don't mean what's the code difference.  I mean what is the semantic
difference between pmd_huge() and pmd_large() supposed to be - in
words.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64
  2013-05-03  4:52   ` David Gibson
  2013-05-03  8:19     ` Benjamin Herrenschmidt
@ 2013-05-04 19:14     ` Aneesh Kumar K.V
  2013-05-04 21:39       ` Benjamin Herrenschmidt
  2013-05-06  1:28       ` David Gibson
  1 sibling, 2 replies; 34+ messages in thread
From: Aneesh Kumar K.V @ 2013-05-04 19:14 UTC (permalink / raw)
  To: David Gibson; +Cc: paulus, linuxppc-dev, linux-mm

David Gibson <dwg@au1.ibm.com> writes:

> On Mon, Apr 29, 2013 at 01:21:43AM +0530, Aneesh Kumar K.V wrote:
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>> 
>> We now have pmd entries covering 16MB range and the PMD table double its original size.
>> We use the second half of the PMD table to deposit the pgtable (PTE page).
>> The depoisted PTE page is further used to track the HPTE information. The information
>> include [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
>> With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
>> 4096 entries. Both will fit in a 4K PTE page. On hugepage invalidate we need to walk
>> the PTE page and invalidate all valid HPTEs.
>> 
>> This patch implements necessary arch specific functions for THP support and also
>> hugepage invalidate logic. These PMD related functions are intentionally kept
>> similar to their PTE counter-part.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> ---
>>  arch/powerpc/include/asm/page.h              |  11 +-
>>  arch/powerpc/include/asm/pgtable-ppc64-64k.h |   3 +-
>>  arch/powerpc/include/asm/pgtable-ppc64.h     | 259 +++++++++++++++++++++-
>>  arch/powerpc/include/asm/pgtable.h           |   5 +
>>  arch/powerpc/include/asm/pte-hash64-64k.h    |  17 ++
>>  arch/powerpc/mm/pgtable_64.c                 | 318 +++++++++++++++++++++++++++
>>  arch/powerpc/platforms/Kconfig.cputype       |   1 +
>>  7 files changed, 611 insertions(+), 3 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
>> index 988c812..cbf4be7 100644
>> --- a/arch/powerpc/include/asm/page.h
>> +++ b/arch/powerpc/include/asm/page.h
>> @@ -37,8 +37,17 @@
>>  #define PAGE_SIZE		(ASM_CONST(1) << PAGE_SHIFT)
>>  
>>  #ifndef __ASSEMBLY__
>> -#ifdef CONFIG_HUGETLB_PAGE
>> +/*
>> + * With hugetlbfs enabled we allow the HPAGE_SHIFT to run time
>> + * configurable. But we enable THP only with 16MB hugepage.
>> + * With only THP configured, we force hugepage size to 16MB.
>> + * This should ensure that all subarchs that doesn't support
>> + * THP continue to work fine with HPAGE_SHIFT usage.
>> + */
>> +#if defined(CONFIG_HUGETLB_PAGE)
>>  extern unsigned int HPAGE_SHIFT;
>> +#elif defined(CONFIG_TRANSPARENT_HUGEPAGE)
>> +#define HPAGE_SHIFT PMD_SHIFT
>
> As I said in comments on the first patch series, this messing around
> with HPAGE_SHIFT for THP is missing the point.  On ppc HPAGE_SHIFT is
> nothing more than the _default_ hugepage size for explicit hugepages.
> THP should not be dependent on it in any way.

fixed. 

>
>>  #else
>>  #define HPAGE_SHIFT PAGE_SHIFT
>>  #endif
>> diff --git a/arch/powerpc/include/asm/pgtable-ppc64-64k.h b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
>> index 45142d6..a56b82f 100644
>> --- a/arch/powerpc/include/asm/pgtable-ppc64-64k.h
>> +++ b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
>> @@ -33,7 +33,8 @@
>>  #define PGDIR_MASK	(~(PGDIR_SIZE-1))
>>  
>>  /* Bits to mask out from a PMD to get to the PTE page */
>> -#define PMD_MASKED_BITS		0x1ff
>> +/* PMDs point to PTE table fragments which are 4K aligned.  */
>> +#define PMD_MASKED_BITS		0xfff
>
> Hrm.  AFAICT this is related to the change in size of PTE tables, and
> hence the page sharing stuff, so this belongs in the patch which
> implements that, rather than the THP support itself.
>

fixed

>>  /* Bits to mask out from a PGD/PUD to get to the PMD page */
>>  #define PUD_MASKED_BITS		0x1ff
>>  
>> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
>> index ab84332..20133c1 100644
>> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
>> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
>> @@ -154,7 +154,7 @@
>>  #define	pmd_present(pmd)	(pmd_val(pmd) != 0)
>>  #define	pmd_clear(pmdp)		(pmd_val(*(pmdp)) = 0)
>>  #define pmd_page_vaddr(pmd)	(pmd_val(pmd) & ~PMD_MASKED_BITS)
>> -#define pmd_page(pmd)		virt_to_page(pmd_page_vaddr(pmd))
>> +extern struct page *pmd_page(pmd_t pmd);
>>  
>>  #define pud_set(pudp, pudval)	(pud_val(*(pudp)) = (pudval))
>>  #define pud_none(pud)		(!pud_val(pud))
>> @@ -382,4 +382,261 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
>>  
>>  #endif /* __ASSEMBLY__ */
>>  
>> +#ifndef _PAGE_SPLITTING
>> +/*
>> + * THP pages can't be special. So use the _PAGE_SPECIAL
>> + */
>> +#define _PAGE_SPLITTING _PAGE_SPECIAL
>> +#endif
>> +
>> +#ifndef _PAGE_THP_HUGE
>> +/*
>> + * We need to differentiate between explicit huge page and THP huge
>> + * page, since THP huge page also need to track real subpage details
>> + * We use the _PAGE_COMBO bits here as dummy for platform that doesn't
>> + * support THP.
>> + */
>> +#define _PAGE_THP_HUGE  0x10000000
>
> So if it's _PAGE_COMBO, use _PAGE_COMBO, instead of the actual number.
>

We define _PAGE_THP_HUGE value in pte-hash64-64k.h. Now the functions
below which depends on _PAGE_THP_HUGE are in pgtable-ppc64.h. The above
#define takes care of compile errors on subarch that doesn't include
pte-hash64-64k.h We really won't be using these functions at run time,
because we will not find a transparent huge page on those subarchs.



>> +#endif
>> +
>> +/*
>> + * PTE flags to conserve for HPTE identification for THP page.
>> + */
>> +#ifndef _PAGE_THP_HPTEFLAGS
>> +#define _PAGE_THP_HPTEFLAGS	(_PAGE_BUSY | _PAGE_HASHPTE)
>
> You have this definition both here and in pte-hash64-64k.h.  More
> importantly including _PAGE_BUSY seems like an extremely bad idea -
> did you mean _PAGE_THP_HUGE == _PAGE_COMBO?
>

We have the same defition for _PAGE_HPTEFLAGS. But since i moved
_PAGE_THP_HUGE to _PAGE_4K_PFN in the new series, I will be dropping
this. 

>> +#endif
>> +
>> +#define HUGE_PAGE_SIZE		(ASM_CONST(1) << 24)
>> +#define HUGE_PAGE_MASK		(~(HUGE_PAGE_SIZE - 1))
>
> These constants should be named so its clear they're THP specific.
> They should also be defined in terms of PMD_SHIFT, instead of
> directly.
>

I was not able to use HPAGE_PMD_SIZE because we have that BUILD_BUG_ON
when THP is not enabled. I will switch them to PMD_SIZE and PMD_MASK ?


>> +/*
>> + * set of bits not changed in pmd_modify.
>> + */
>> +#define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_THP_HPTEFLAGS | \
>> +			 _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_THP_HUGE)
>> +
>> +#ifndef __ASSEMBLY__
>> +extern void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
>> +				     pmd_t *pmdp);
>
> This should maybe be called "hpge_do_hugepage_flush()".  The current
> name suggests it returns a boolean, rather than performing the actual
> flush.
>

done


>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
>> +extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
>> +extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
>> +extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>> +		       pmd_t *pmdp, pmd_t pmd);
>> +extern void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
>> +				 pmd_t *pmd);
>> +
>> +static inline int pmd_trans_huge(pmd_t pmd)
>> +{
>> +	/*
>> +	 * leaf pte for huge page, bottom two bits != 00
>> +	 */
>> +	return (pmd_val(pmd) & 0x3) && (pmd_val(pmd) & _PAGE_THP_HUGE);
>> +}
>> +
>> +static inline int pmd_large(pmd_t pmd)
>> +{
>> +	/*
>> +	 * leaf pte for huge page, bottom two bits != 00
>> +	 */
>> +	if (pmd_trans_huge(pmd))
>> +		return pmd_val(pmd) & _PAGE_PRESENT;
>> +	return 0;
>> +}
>> +
>> +static inline int pmd_trans_splitting(pmd_t pmd)
>> +{
>> +	if (pmd_trans_huge(pmd))
>> +		return pmd_val(pmd) & _PAGE_SPLITTING;
>> +	return 0;
>> +}
>> +
>> +
>> +static inline unsigned long pmd_pfn(pmd_t pmd)
>> +{
>> +	/*
>> +	 * Only called for hugepage pmd
>> +	 */
>> +	return pmd_val(pmd) >> PTE_RPN_SHIFT;
>> +}
>> +
>> +/* We will enable it in the last patch */
>> +#define has_transparent_hugepage() 0
>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> +
>> +static inline int pmd_young(pmd_t pmd)
>> +{
>> +	return pmd_val(pmd) & _PAGE_ACCESSED;
>> +}
>
> It would be clearer to define this function as well as various others
> that operate on PMDs as PTEs to just cast the parameter and call the
> corresponding pte_XXX(),

I did what tile arch is done. How about 

+#define pmd_pte(pmd)		(pmd)
+#define pte_pmd(pte)		(pte)
+#define pmd_pfn(pmd)		pte_pfn(pmd_pte(pmd))
+#define pmd_young(pmd)		pte_young(pmd_pte(pmd))
+#define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
+#define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
+#define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
+#define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 

>
>> +
>> +static inline pmd_t pmd_mkhuge(pmd_t pmd)
>> +{
>> +	/* Do nothing, mk_pmd() does this part.  */
>> +	return pmd;
>> +}
>> +
>> +#define __HAVE_ARCH_PMD_WRITE
>> +static inline int pmd_write(pmd_t pmd)
>> +{
>> +	return pmd_val(pmd) & _PAGE_RW;
>> +}
>> +
>> +static inline pmd_t pmd_mkold(pmd_t pmd)
>> +{
>> +	pmd_val(pmd) &= ~_PAGE_ACCESSED;
>> +	return pmd;
>> +}
>> +
>> +static inline pmd_t pmd_wrprotect(pmd_t pmd)
>> +{
>> +	pmd_val(pmd) &= ~_PAGE_RW;
>> +	return pmd;
>> +}
>> +
>> +static inline pmd_t pmd_mkdirty(pmd_t pmd)
>> +{
>> +	pmd_val(pmd) |= _PAGE_DIRTY;
>> +	return pmd;
>> +}
>> +
>> +static inline pmd_t pmd_mkyoung(pmd_t pmd)
>> +{
>> +	pmd_val(pmd) |= _PAGE_ACCESSED;
>> +	return pmd;
>> +}
>> +
>> +static inline pmd_t pmd_mkwrite(pmd_t pmd)
>> +{
>> +	pmd_val(pmd) |= _PAGE_RW;
>> +	return pmd;
>> +}
>> +
>> +static inline pmd_t pmd_mknotpresent(pmd_t pmd)
>> +{
>> +	pmd_val(pmd) &= ~_PAGE_PRESENT;
>> +	return pmd;
>> +}
>> +
>> +static inline pmd_t pmd_mksplitting(pmd_t pmd)
>> +{
>> +	pmd_val(pmd) |= _PAGE_SPLITTING;
>> +	return pmd;
>> +}
>> +
>> +/*
>> + * Set the dirty and/or accessed bits atomically in a linux hugepage PMD, this
>> + * function doesn't need to flush the hash entry
>> + */
>> +static inline void __pmdp_set_access_flags(pmd_t *pmdp, pmd_t entry)
>> +{
>> +	unsigned long bits = pmd_val(entry) & (_PAGE_DIRTY |
>> +					       _PAGE_ACCESSED |
>> +					       _PAGE_RW | _PAGE_EXEC);
>> +#ifdef PTE_ATOMIC_UPDATES
>> +	unsigned long old, tmp;
>> +
>> +	__asm__ __volatile__(
>> +	"1:	ldarx	%0,0,%4\n\
>> +		andi.	%1,%0,%6\n\
>> +		bne-	1b \n\
>> +		or	%0,%3,%0\n\
>> +		stdcx.	%0,0,%4\n\
>> +		bne-	1b"
>> +	:"=&r" (old), "=&r" (tmp), "=m" (*pmdp)
>> +	:"r" (bits), "r" (pmdp), "m" (*pmdp), "i" (_PAGE_BUSY)
>> +	:"cc");
>> +#else
>> +	unsigned long old = pmd_val(*pmdp);
>> +	*pmdp = __pmd(old | bits);
>> +#endif
>
> Using parameter casts on the corresponding pte_update() function would
> be even more valuable for these more complex functions with asm.


We may want to retain some of these because of the assert we want to add
for locking. PTE related functions expect ptl to be locked. PMD related
functions expect mm->page_table_lock to be locked.

>
>> +}
>> +
>> +#define __HAVE_ARCH_PMD_SAME
>> +static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
>> +{
>> +	return (((pmd_val(pmd_a) ^ pmd_val(pmd_b)) & ~_PAGE_THP_HPTEFLAGS) == 0);
>
> Here, specifically, the fact that PAGE_BUSY is in PAGE_THP_HPTEFLAGS
> is likely to be bad.  If the page is busy, it's in the middle of
> update so can't stably be considered the same as anything.
>


pte_same have the above definition. We use _PAGE_BUSY to indicate that
we are using the entry to satisfy a hpte hash insert. That is used to
prevent a parallel update. So why should pmd_same consider the
_PAGE_BUSY ? 


>> +}
>> +
>> +#define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
>> +extern int pmdp_set_access_flags(struct vm_area_struct *vma,
>> +				 unsigned long address, pmd_t *pmdp,
>> +				 pmd_t entry, int dirty);
>> +
>> +static inline unsigned long pmd_hugepage_update(struct mm_struct *mm,
>> +						unsigned long addr,
>> +						pmd_t *pmdp, unsigned long clr)
>> +{
>> +#ifdef PTE_ATOMIC_UPDATES
>> +	unsigned long old, tmp;
>> +
>> +	__asm__ __volatile__(
>> +	"1:	ldarx	%0,0,%3\n\
>> +		andi.	%1,%0,%6\n\
>> +		bne-	1b \n\
>> +		andc	%1,%0,%4 \n\
>> +		stdcx.	%1,0,%3 \n\
>> +		bne-	1b"
>> +	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
>> +	: "r" (pmdp), "r" (clr), "m" (*pmdp), "i" (_PAGE_BUSY)
>> +	: "cc" );
>> +#else
>> +	unsigned long old = pmd_val(*pmdp);
>> +	*pmdp = __pmd(old & ~clr);
>> +#endif
>> +
>> +#ifdef CONFIG_PPC_STD_MMU_64
>
> THP only works with the standard hash MMU, so this #if seems a bit
> pointless.

done


>
>> +	if (old & _PAGE_HASHPTE)
>> +		hpte_need_hugepage_flush(mm, addr, pmdp);
>> +#endif
>> +	return old;
>> +}
>> +
>> +static inline int __pmdp_test_and_clear_young(struct mm_struct *mm,
>> +					      unsigned long addr, pmd_t *pmdp)
>> +{
>> +	unsigned long old;
>> +
>> +	if ((pmd_val(*pmdp) & (_PAGE_ACCESSED | _PAGE_HASHPTE)) == 0)
>> +		return 0;
>> +	old = pmd_hugepage_update(mm, addr, pmdp, _PAGE_ACCESSED);
>> +	return ((old & _PAGE_ACCESSED) != 0);
>> +}
>> +
>> +#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
>> +extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>> +				     unsigned long address, pmd_t *pmdp);
>> +#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
>> +extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
>> +				  unsigned long address, pmd_t *pmdp);
>> +
>> +#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
>> +extern pmd_t pmdp_get_and_clear(struct mm_struct *mm,
>> +				unsigned long addr, pmd_t *pmdp);
>> +
>> +#define __HAVE_ARCH_PMDP_SET_WRPROTECT
>
> Now that the PTE format is the same at bottom or PMD level, do you
> still need this?

Some of them we can drop. Others we need to, because we want to have
different asserts as i explained above.  For example below wrprotect we
want to call pmd_hugepage_update. 

>
>> +static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
>> +				      pmd_t *pmdp)
>> +{
>> +
>> +	if ((pmd_val(*pmdp) & _PAGE_RW) == 0)
>> +		return;
>> +
>> +	pmd_hugepage_update(mm, addr, pmdp, _PAGE_RW);
>> +}
>> +
>> +#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
>> +extern void pmdp_splitting_flush(struct vm_area_struct *vma,
>> +				 unsigned long address, pmd_t *pmdp);
>> +
>> +#define __HAVE_ARCH_PGTABLE_DEPOSIT
>> +extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
>> +				       pgtable_t pgtable);
>> +#define __HAVE_ARCH_PGTABLE_WITHDRAW
>> +extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
>> +
>> +#define __HAVE_ARCH_PMDP_INVALIDATE
>> +extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>> +			    pmd_t *pmdp);
>> +#endif /* __ASSEMBLY__ */
>>  #endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
>> diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
>> index 7aeb955..283198e 100644
>> --- a/arch/powerpc/include/asm/pgtable.h
>> +++ b/arch/powerpc/include/asm/pgtable.h
>> @@ -222,5 +222,10 @@ extern int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>>  		       unsigned long end, int write, struct page **pages, int *nr);
>>  #endif /* __ASSEMBLY__ */
>>  
>> +#ifndef CONFIG_TRANSPARENT_HUGEPAGE
>> +#define pmd_large(pmd)		0
>> +#define has_transparent_hugepage() 0
>> +#endif
>> +
>>  #endif /* __KERNEL__ */
>>  #endif /* _ASM_POWERPC_PGTABLE_H */
>> diff --git a/arch/powerpc/include/asm/pte-hash64-64k.h b/arch/powerpc/include/asm/pte-hash64-64k.h
>> index 3e13e23..6be70be 100644
>> --- a/arch/powerpc/include/asm/pte-hash64-64k.h
>> +++ b/arch/powerpc/include/asm/pte-hash64-64k.h
>> @@ -38,6 +38,23 @@
>>   */
>>  #define PTE_RPN_SHIFT	(30)
>>  
>> +/*
>> + * THP pages can't be special. So use the _PAGE_SPECIAL
>> + */
>> +#define _PAGE_SPLITTING _PAGE_SPECIAL
>> +
>> +/*
>> + * PTE flags to conserve for HPTE identification for THP page.
>> + * We drop _PAGE_COMBO here, because we overload that with _PAGE_TH_HUGE.
>> + */
>> +#define _PAGE_THP_HPTEFLAGS	(_PAGE_BUSY | _PAGE_HASHPTE)
>> +
>> +/*
>> + * We need to differentiate between explicit huge page and THP huge
>> + * page, since THP huge page also need to track real subpage details
>> + */
>> +#define _PAGE_THP_HUGE  _PAGE_COMBO
>
> All 3 of these definitions also appeared elsewhere.

These are the actual values used. The pgtable-ppc64.h is to take care of
compliation issues on arch that doesn't support THP.

>
>> +
>>  #ifndef __ASSEMBLY__
>>  
>>  /*
>> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
>> index a854096..54216c1 100644
>> --- a/arch/powerpc/mm/pgtable_64.c
>> +++ b/arch/powerpc/mm/pgtable_64.c
>> @@ -338,6 +338,19 @@ EXPORT_SYMBOL(iounmap);
>>  EXPORT_SYMBOL(__iounmap);
>>  EXPORT_SYMBOL(__iounmap_at);
>>  
>> +/*
>> + * For hugepage we have pfn in the pmd, we use PTE_RPN_SHIFT bits for flags
>> + * For PTE page, we have a PTE_FRAG_SIZE (4K) aligned virtual address.
>> + */
>> +struct page *pmd_page(pmd_t pmd)
>> +{
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +	if (pmd_trans_huge(pmd))
>> +		return pfn_to_page(pmd_pfn(pmd));
>
> In this case you should be able to define this in terms of pte_pfn().

We now have pmd_pfn done in term of pte_pfn. So will retain pmd_pfn 

>
>> +#endif
>> +	return virt_to_page(pmd_page_vaddr(pmd));
>> +}
>> +
>>  #ifdef CONFIG_PPC_64K_PAGES
>>  static pte_t *get_from_cache(struct mm_struct *mm)
>>  {
>> @@ -455,3 +468,308 @@ void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
>>  }
>>  #endif
>>  #endif /* CONFIG_PPC_64K_PAGES */
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +static pmd_t set_hugepage_access_flags_filter(pmd_t pmd,
>> +					      struct vm_area_struct *vma,
>> +					      int dirty)
>> +{
>> +	return pmd;
>> +}
>
> This identity function is only used immediately before.  Why does it
> exist?
>

removed

>> +/*
>> + * This is called when relaxing access to a hugepage. It's also called in the page
>> + * fault path when we don't hit any of the major fault cases, ie, a minor
>> + * update of _PAGE_ACCESSED, _PAGE_DIRTY, etc... The generic code will have
>> + * handled those two for us, we additionally deal with missing execute
>> + * permission here on some processors
>> + */
>> +int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
>> +			  pmd_t *pmdp, pmd_t entry, int dirty)
>> +{
>> +	int changed;
>> +	entry = set_hugepage_access_flags_filter(entry, vma, dirty);
>> +	changed = !pmd_same(*(pmdp), entry);
>> +	if (changed) {
>> +		__pmdp_set_access_flags(pmdp, entry);
>> +		/*
>> +		 * Since we are not supporting SW TLB systems, we don't
>> +		 * have any thing similar to flush_tlb_page_nohash()
>> +		 */
>> +	}
>> +	return changed;
>> +}
>> +
>> +int pmdp_test_and_clear_young(struct vm_area_struct *vma,
>> +			      unsigned long address, pmd_t *pmdp)
>> +{
>> +	return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp);
>> +}
>> +
>> +/*
>> + * We currently remove entries from the hashtable regardless of whether
>> + * the entry was young or dirty. The generic routines only flush if the
>> + * entry was young or dirty which is not good enough.
>> + *
>> + * We should be more intelligent about this but for the moment we override
>> + * these functions and force a tlb flush unconditionally
>> + */
>> +int pmdp_clear_flush_young(struct vm_area_struct *vma,
>> +				  unsigned long address, pmd_t *pmdp)
>> +{
>> +	return __pmdp_test_and_clear_young(vma->vm_mm, address, pmdp);
>> +}
>> +
>> +/*
>> + * We mark the pmd splitting and invalidate all the hpte
>> + * entries for this hugepage.
>> + */
>> +void pmdp_splitting_flush(struct vm_area_struct *vma,
>> +			  unsigned long address, pmd_t *pmdp)
>> +{
>> +	unsigned long old, tmp;
>> +
>> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>> +#ifdef PTE_ATOMIC_UPDATES
>> +
>> +	__asm__ __volatile__(
>> +	"1:	ldarx	%0,0,%3\n\
>> +		andi.	%1,%0,%6\n\
>> +		bne-	1b \n\
>> +		ori	%1,%0,%4 \n\
>> +		stdcx.	%1,0,%3 \n\
>> +		bne-	1b"
>> +	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
>> +	: "r" (pmdp), "i" (_PAGE_SPLITTING), "m" (*pmdp), "i" (_PAGE_BUSY)
>> +	: "cc" );
>> +#else
>> +	old = pmd_val(*pmdp);
>> +	*pmdp = __pmd(old | _PAGE_SPLITTING);
>> +#endif
>> +	/*
>> +	 * If we didn't had the splitting flag set, go and flush the
>> +	 * HPTE entries and serialize against gup fast.
>> +	 */
>> +	if (!(old & _PAGE_SPLITTING)) {
>> +#ifdef CONFIG_PPC_STD_MMU_64
>> +		/* We need to flush the hpte */
>> +		if (old & _PAGE_HASHPTE)
>> +			hpte_need_hugepage_flush(vma->vm_mm, address, pmdp);
>> +#endif
>> +		/* need tlb flush only to serialize against gup-fast */
>> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>> +	}
>> +}
>> +
>> +/*
>> + * We want to put the pgtable in pmd and use pgtable for tracking
>> + * the base page size hptes
>> + */
>> +void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
>> +				pgtable_t pgtable)
>> +{
>> +	unsigned long *pgtable_slot;
>> +	assert_spin_locked(&mm->page_table_lock);
>> +	/*
>> +	 * we store the pgtable in the second half of PMD
>> +	 */
>> +	pgtable_slot = pmdp + PTRS_PER_PMD;
>> +	*pgtable_slot = (unsigned long)pgtable;
>
> Why not just make pgtable_slot have type (pgtable_t *) and avoid the
> case.
>

done. But we would have cast in the above line. 


>> +}
>> +
>> +pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>> +{
>> +	pgtable_t pgtable;
>> +	unsigned long *pgtable_slot;
>> +
>> +	assert_spin_locked(&mm->page_table_lock);
>> +	pgtable_slot = pmdp + PTRS_PER_PMD;
>> +	pgtable = (pgtable_t) *pgtable_slot;
>> +	/*
>> +	 * We store HPTE information in the deposited PTE fragment.
>> +	 * zero out the content on withdraw.
>> +	 */
>> +	memset(pgtable, 0, PTE_FRAG_SIZE);
>> +	return pgtable;
>> +}
>> +
>> +/*
>> + * Since we are looking at latest ppc64, we don't need to worry about
>> + * i/d cache coherency on exec fault
>> + */
>> +static pmd_t set_pmd_filter(pmd_t pmd, unsigned long addr)
>> +{
>> +	pmd = __pmd(pmd_val(pmd) & ~_PAGE_THP_HPTEFLAGS);
>> +	return pmd;
>> +}
>> +
>> +/*
>> + * We can make it less convoluted than __set_pte_at, because
>> + * we can ignore lot of hardware here, because this is only for
>> + * MPSS
>> + */
>> +static inline void __set_pmd_at(struct mm_struct *mm, unsigned long addr,
>> +				pmd_t *pmdp, pmd_t pmd, int percpu)
>> +{
>> +	/*
>> +	 * There is nothing in hash page table now, so nothing to
>> +	 * invalidate, set_pte_at is used for adding new entry.
>> +	 * For updating we should use update_hugepage_pmd()
>> +	 */
>> +	*pmdp = pmd;
>> +}
>
> Again you should be able to define this in terms of the set_pte_at()
> functions.
>

done 


>> +/*
>> + * set a new huge pmd. We should not be called for updating
>> + * an existing pmd entry. That should go via pmd_hugepage_update.
>> + */
>> +void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>> +		pmd_t *pmdp, pmd_t pmd)
>> +{
>> +	/*
>> +	 * Note: mm->context.id might not yet have been assigned as
>> +	 * this context might not have been activated yet when this
>> +	 * is called.
>
> And the relevance of this comment here is...?
>
>> +	 */
>> +	pmd = set_pmd_filter(pmd, addr);
>> +
>> +	__set_pmd_at(mm, addr, pmdp, pmd, 0);
>> +
>> +}
>> +
>> +void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>> +		     pmd_t *pmdp)
>> +{
>> +	pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT);
>> +	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>> +}
>> +
>> +/*
>> + * A linux hugepage PMD was changed and the corresponding hash table entries
>> + * neesd to be flushed.
>> + *
>> + * The linux hugepage PMD now include the pmd entries followed by the address
>> + * to the stashed pgtable_t. The stashed pgtable_t contains the hpte bits.
>> + * [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
>> + * With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
>> + * 4096 entries. Both will fit in a 4K pgtable_t.
>> + */
>> +void hpte_need_hugepage_flush(struct mm_struct *mm, unsigned long addr,
>> +			      pmd_t *pmdp)
>> +{
>> +	int ssize, i;
>> +	unsigned long s_addr;
>> +	unsigned int psize, valid;
>> +	unsigned char *hpte_slot_array;
>> +	unsigned long hidx, vpn, vsid, hash, shift, slot;
>> +
>> +	/*
>> +	 * Flush all the hptes mapping this hugepage
>> +	 */
>> +	s_addr = addr & HUGE_PAGE_MASK;
>> +	/*
>> +	 * The hpte hindex are stored in the pgtable whose address is in the
>> +	 * second half of the PMD
>> +	 */
>> +	hpte_slot_array = *(char **)(pmdp + PTRS_PER_PMD);
>> +
>> +	/* get the base page size */
>> +	psize = get_slice_psize(mm, s_addr);
>> +	shift = mmu_psize_defs[psize].shift;
>> +
>> +	for (i = 0; i < (HUGE_PAGE_SIZE >> shift); i++) {
>> +		/*
>> +		 * 8 bits per each hpte entries
>> +		 * 000| [ secondary group (one bit) | hidx (3 bits) | valid bit]
>> +		 */
>> +		valid = hpte_slot_array[i] & 0x1;
>> +		if (!valid)
>> +			continue;
>> +		hidx =  hpte_slot_array[i]  >> 1;
>> +
>> +		/* get the vpn */
>> +		addr = s_addr + (i * (1ul << shift));
>> +		if (!is_kernel_addr(addr)) {
>> +			ssize = user_segment_size(addr);
>> +			vsid = get_vsid(mm->context.id, addr, ssize);
>> +			WARN_ON(vsid == 0);
>> +		} else {
>> +			vsid = get_kernel_vsid(addr, mmu_kernel_ssize);
>> +			ssize = mmu_kernel_ssize;
>> +		}
>> +
>> +		vpn = hpt_vpn(addr, vsid, ssize);
>> +		hash = hpt_hash(vpn, shift, ssize);
>> +		if (hidx & _PTEIDX_SECONDARY)
>> +			hash = ~hash;
>> +
>> +		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
>> +		slot += hidx & _PTEIDX_GROUP_IX;
>> +		ppc_md.hpte_invalidate(slot, vpn, psize, ssize, 0);
>> +	}
>> +}
>> +
>> +static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot)
>> +{
>> +	pmd_val(pmd) |= pgprot_val(pgprot);
>> +	return pmd;
>> +}
>> +
>> +pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot)
>> +{
>> +	pmd_t pmd;
>> +	/*
>> +	 * For a valid pte, we would have _PAGE_PRESENT or _PAGE_FILE always
>> +	 * set. We use this to check THP page at pmd level.
>> +	 * leaf pte for huge page, bottom two bits != 00
>> +	 */
>> +	pmd_val(pmd) = pfn << PTE_RPN_SHIFT;
>> +	pmd_val(pmd) |= _PAGE_THP_HUGE;
>> +	pmd = pmd_set_protbits(pmd, pgprot);
>> +	return pmd;
>> +}
>> +
>> +pmd_t mk_pmd(struct page *page, pgprot_t pgprot)
>> +{
>> +	return pfn_pmd(page_to_pfn(page), pgprot);
>> +}
>> +
>> +pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
>> +{
>> +
>> +	pmd_val(pmd) &= _HPAGE_CHG_MASK;
>> +	pmd = pmd_set_protbits(pmd, newprot);
>> +	return pmd;
>> +}
>> +
>> +/*
>> + * This is called at the end of handling a user page fault, when the
>> + * fault has been handled by updating a HUGE PMD entry in the linux page tables.
>> + * We use it to preload an HPTE into the hash table corresponding to
>> + * the updated linux HUGE PMD entry.
>> + */
>> +void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
>> +			  pmd_t *pmd)
>> +{
>> +	return;
>> +}
>> +
>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> +
>> +pmd_t pmdp_get_and_clear(struct mm_struct *mm,
>> +			 unsigned long addr, pmd_t *pmdp)
>> +{
>> +	pmd_t old_pmd;
>> +	unsigned long old;
>> +	/*
>> +	 * khugepaged calls this for normal pmd also
>> +	 */
>> +	if (pmd_trans_huge(*pmdp)) {
>> +		old = pmd_hugepage_update(mm, addr, pmdp, ~0UL);
>> +		old_pmd = __pmd(old);
>> +	} else {
>> +		old_pmd = *pmdp;
>> +		pmd_clear(pmdp);
>> +	}
>> +	return old_pmd;
>> +}
>> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
>> index 18e3b76..a526144 100644
>> --- a/arch/powerpc/platforms/Kconfig.cputype
>> +++ b/arch/powerpc/platforms/Kconfig.cputype
>> @@ -71,6 +71,7 @@ config PPC_BOOK3S_64
>>  	select PPC_FPU
>>  	select PPC_HAVE_PMU_SUPPORT
>>  	select SYS_SUPPORTS_HUGETLBFS
>> +	select HAVE_ARCH_TRANSPARENT_HUGEPAGE if PPC_64K_PAGES
>>  
>>  config PPC_BOOK3E_64
>>  	bool "Embedded processors"
>

-aneesh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64
  2013-05-04 19:14     ` Aneesh Kumar K.V
@ 2013-05-04 21:39       ` Benjamin Herrenschmidt
  2013-05-06  1:28       ` David Gibson
  1 sibling, 0 replies; 34+ messages in thread
From: Benjamin Herrenschmidt @ 2013-05-04 21:39 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linux-mm, paulus, linuxppc-dev, David Gibson

On Sun, 2013-05-05 at 00:44 +0530, Aneesh Kumar K.V wrote:
> 
> We may want to retain some of these because of the assert we want to add
> for locking. PTE related functions expect ptl to be locked. PMD related
> functions expect mm->page_table_lock to be locked.

In this case have a single inline commmon function __something called
by two different wrappers.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 08/10] powerpc/THP: Enable THP on PPC64
  2013-05-03 18:49     ` Aneesh Kumar K.V
@ 2013-05-05  8:59       ` David Gibson
  0 siblings, 0 replies; 34+ messages in thread
From: David Gibson @ 2013-05-05  8:59 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, paulus, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3176 bytes --]

On Sat, May 04, 2013 at 12:19:03AM +0530, Aneesh Kumar K.V wrote:
> David Gibson <dwg@au1.ibm.com> writes:
> 
> > On Mon, Apr 29, 2013 at 01:21:49AM +0530, Aneesh Kumar K.V wrote:
> >> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> >> 
> >> We enable only if the we support 16MB page size.
> >> 
> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> >> ---
> >>  arch/powerpc/include/asm/pgtable-ppc64.h |  3 +--
> >>  arch/powerpc/mm/pgtable_64.c             | 28 ++++++++++++++++++++++++++++
> >>  2 files changed, 29 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
> >> index 97fc839..d65534b 100644
> >> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> >> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> >> @@ -426,8 +426,7 @@ static inline unsigned long pmd_pfn(pmd_t pmd)
> >>  	return pmd_val(pmd) >> PTE_RPN_SHIFT;
> >>  }
> >>  
> >> -/* We will enable it in the last patch */
> >> -#define has_transparent_hugepage() 0
> >> +extern int has_transparent_hugepage(void);
> >>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >>  
> >>  static inline int pmd_young(pmd_t pmd)
> >> diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
> >> index 54216c1..b742d6f 100644
> >> --- a/arch/powerpc/mm/pgtable_64.c
> >> +++ b/arch/powerpc/mm/pgtable_64.c
> >> @@ -754,6 +754,34 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
> >>  	return;
> >>  }
> >>  
> >> +int has_transparent_hugepage(void)
> >> +{
> >> +	if (!mmu_has_feature(MMU_FTR_16M_PAGE))
> >> +		return 0;
> >> +	/*
> >> +	 * We support THP only if HPAGE_SHIFT is 16MB.
> >> +	 */
> >> +	if (!HPAGE_SHIFT || (HPAGE_SHIFT != mmu_psize_defs[MMU_PAGE_16M].shift))
> >> +		return 0;
> >
> > Again, THP should not be dependent on the value of HPAGE_SHIFT.  Just
> > checking that mmu_psize_defsz[MMU_PAGE_16M].shift == 24 should be
> > sufficient (i.e. that 16M hugepages are supported).
> 
> done
> 
> +	/*
> +	 * We support THP only if PMD_SIZE is 16MB.
> +	 */
> +	if (mmu_psize_defs[MMU_PAGE_16M].shift != PMD_SHIFT)
> +		return 0;
> +	/*

Much better.

> >> +	/*
> >> +	 * We need to make sure that we support 16MB hugepage in a segement
> >> +	 * with base page size 64K or 4K. We only enable THP with a PAGE_SIZE
> >> +	 * of 64K.
> >> +	 */
> >> +	/*
> >> +	 * If we have 64K HPTE, we will be using that by default
> >> +	 */
> >> +	if (mmu_psize_defs[MMU_PAGE_64K].shift &&
> >> +	    (mmu_psize_defs[MMU_PAGE_64K].penc[MMU_PAGE_16M] == -1))
> >> +		return 0;
> >> +	/*
> >> +	 * Ok we only have 4K HPTE
> >> +	 */
> >> +	if (mmu_psize_defs[MMU_PAGE_4K].penc[MMU_PAGE_16M] == -1)
> >> +		return 0;
> >
> > Except you don't actually support THP on 4K base page size yet.
> 
> 
> That is 64K linux page size and 4K HPTE . We do support that.

Good point, sorry.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64
  2013-05-04 19:14     ` Aneesh Kumar K.V
  2013-05-04 21:39       ` Benjamin Herrenschmidt
@ 2013-05-06  1:28       ` David Gibson
  1 sibling, 0 replies; 34+ messages in thread
From: David Gibson @ 2013-05-06  1:28 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: linuxppc-dev, paulus, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4938 bytes --]

On Sun, May 05, 2013 at 12:44:35AM +0530, Aneesh Kumar K.V wrote:
> David Gibson <dwg@au1.ibm.com> writes:
> > On Mon, Apr 29, 2013 at 01:21:43AM +0530, Aneesh Kumar K.V wrote:
> >> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
[snip]
> >> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
> >> index ab84332..20133c1 100644
> >> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> >> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> >> @@ -154,7 +154,7 @@
> >>  #define	pmd_present(pmd)	(pmd_val(pmd) != 0)
> >>  #define	pmd_clear(pmdp)		(pmd_val(*(pmdp)) = 0)
> >>  #define pmd_page_vaddr(pmd)	(pmd_val(pmd) & ~PMD_MASKED_BITS)
> >> -#define pmd_page(pmd)		virt_to_page(pmd_page_vaddr(pmd))
> >> +extern struct page *pmd_page(pmd_t pmd);
> >>  
> >>  #define pud_set(pudp, pudval)	(pud_val(*(pudp)) = (pudval))
> >>  #define pud_none(pud)		(!pud_val(pud))
> >> @@ -382,4 +382,261 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
> >>  
> >>  #endif /* __ASSEMBLY__ */
> >>  
> >> +#ifndef _PAGE_SPLITTING
> >> +/*
> >> + * THP pages can't be special. So use the _PAGE_SPECIAL
> >> + */
> >> +#define _PAGE_SPLITTING _PAGE_SPECIAL
> >> +#endif
> >> +
> >> +#ifndef _PAGE_THP_HUGE
> >> +/*
> >> + * We need to differentiate between explicit huge page and THP huge
> >> + * page, since THP huge page also need to track real subpage details
> >> + * We use the _PAGE_COMBO bits here as dummy for platform that doesn't
> >> + * support THP.
> >> + */
> >> +#define _PAGE_THP_HUGE  0x10000000
> >
> > So if it's _PAGE_COMBO, use _PAGE_COMBO, instead of the actual number.
> 
> We define _PAGE_THP_HUGE value in pte-hash64-64k.h. Now the functions
> below which depends on _PAGE_THP_HUGE are in pgtable-ppc64.h. The above
> #define takes care of compile errors on subarch that doesn't include
> pte-hash64-64k.h We really won't be using these functions at run time,
> because we will not find a transparent huge page on those subarchs.

Nonetheless, duplicated definitions really won't do.

[snip]
> >> +#endif
> >> +
> >> +#define HUGE_PAGE_SIZE		(ASM_CONST(1) << 24)
> >> +#define HUGE_PAGE_MASK		(~(HUGE_PAGE_SIZE - 1))
> >
> > These constants should be named so its clear they're THP specific.
> > They should also be defined in terms of PMD_SHIFT, instead of
> > directly.
> 
> I was not able to use HPAGE_PMD_SIZE because we have that BUILD_BUG_ON
> when THP is not enabled. I will switch them to PMD_SIZE and PMD_MASK ?

That would be ok.  THP_PAGE_SIZE or something would also be fine.

[snip]
> >> +static inline unsigned long pmd_pfn(pmd_t pmd)
> >> +{
> >> +	/*
> >> +	 * Only called for hugepage pmd
> >> +	 */
> >> +	return pmd_val(pmd) >> PTE_RPN_SHIFT;
> >> +}
> >> +
> >> +/* We will enable it in the last patch */
> >> +#define has_transparent_hugepage() 0
> >> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >> +
> >> +static inline int pmd_young(pmd_t pmd)
> >> +{
> >> +	return pmd_val(pmd) & _PAGE_ACCESSED;
> >> +}
> >
> > It would be clearer to define this function as well as various others
> > that operate on PMDs as PTEs to just cast the parameter and call the
> > corresponding pte_XXX(),
> 
> I did what tile arch is done. How about 
> 
> +#define pmd_pte(pmd)		(pmd)
> +#define pte_pmd(pte)		(pte)
> +#define pmd_pfn(pmd)		pte_pfn(pmd_pte(pmd))
> +#define pmd_young(pmd)		pte_young(pmd_pte(pmd))
> +#define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
> +#define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
> +#define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
> +#define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
> +#define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))

Probably better for pmd_pte() and pte_pmd() to be inlines, so you
preserve type checking (at least with STRICT_MM_TYPECHECKS), but
otherwise looks ok.

[snip]
> >> +/*
> >> + * We want to put the pgtable in pmd and use pgtable for tracking
> >> + * the base page size hptes
> >> + */
> >> +void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
> >> +				pgtable_t pgtable)
> >> +{
> >> +	unsigned long *pgtable_slot;
> >> +	assert_spin_locked(&mm->page_table_lock);
> >> +	/*
> >> +	 * we store the pgtable in the second half of PMD
> >> +	 */
> >> +	pgtable_slot = pmdp + PTRS_PER_PMD;
> >> +	*pgtable_slot = (unsigned long)pgtable;
> >
> > Why not just make pgtable_slot have type (pgtable_t *) and avoid the
> > case.
> >
> 
> done. But we would have cast in the above line. 

Sure.  But in fact the above line would need a cast anyway, if you
turned on STRICT_MM_TYPECHECKS.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2013-05-06  1:46 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-28 19:51 [PATCH -V7 00/10] THP support for PPC64 (Patchset 2) Aneesh Kumar K.V
2013-04-28 19:51 ` [PATCH -V7 01/10] powerpc/THP: Double the PMD table size for THP Aneesh Kumar K.V
2013-05-03  3:21   ` David Gibson
2013-04-28 19:51 ` [PATCH -V7 02/10] powerpc/THP: Implement transparent hugepages for ppc64 Aneesh Kumar K.V
2013-05-03  4:52   ` David Gibson
2013-05-03  8:19     ` Benjamin Herrenschmidt
2013-05-03 11:54       ` David Gibson
2013-05-03 13:00         ` Benjamin Herrenschmidt
2013-05-03 18:54         ` Aneesh Kumar K.V
2013-05-04 19:14     ` Aneesh Kumar K.V
2013-05-04 21:39       ` Benjamin Herrenschmidt
2013-05-06  1:28       ` David Gibson
2013-04-28 19:51 ` [PATCH -V7 03/10] powerpc: move find_linux_pte_or_hugepte and gup_hugepte to common code Aneesh Kumar K.V
2013-04-28 19:51 ` [PATCH -V7 04/10] powerpc: Update find_linux_pte_or_hugepte to handle transparent hugepages Aneesh Kumar K.V
2013-05-03  4:53   ` David Gibson
2013-05-03 18:58     ` Aneesh Kumar K.V
2013-05-04  6:28       ` David Gibson
2013-04-28 19:51 ` [PATCH -V7 05/10] powerpc: Replace find_linux_pte with find_linux_pte_or_hugepte Aneesh Kumar K.V
2013-05-03  4:56   ` David Gibson
2013-04-28 19:51 ` [PATCH -V7 06/10] powerpc: Update gup_pmd_range to handle transparent hugepages Aneesh Kumar K.V
2013-05-03  4:57   ` David Gibson
2013-04-28 19:51 ` [PATCH -V7 07/10] powerpc/THP: Add code to handle HPTE faults for large pages Aneesh Kumar K.V
2013-05-03  5:13   ` David Gibson
2013-04-28 19:51 ` [PATCH -V7 08/10] powerpc/THP: Enable THP on PPC64 Aneesh Kumar K.V
2013-05-03  5:15   ` David Gibson
2013-05-03 18:49     ` Aneesh Kumar K.V
2013-05-05  8:59       ` David Gibson
2013-04-28 19:51 ` [PATCH -V7 09/10] powerpc: Optimize hugepage invalidate Aneesh Kumar K.V
2013-05-03  5:28   ` David Gibson
2013-05-03 19:05     ` Aneesh Kumar K.V
2013-05-03 21:54       ` Benjamin Herrenschmidt
2013-04-28 19:51 ` [PATCH -V7 10/10] powerpc: disable assert_pte_locked Aneesh Kumar K.V
2013-05-03  5:30   ` David Gibson
2013-05-03 19:07     ` Aneesh Kumar K.V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).