linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys
@ 2017-06-06  1:05 Ram Pai
  2017-06-06  1:05 ` [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys Ram Pai
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Ram Pai @ 2017-06-06  1:05 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel
  Cc: benh, paulus, mpe, khandual, aneesh.kumar, bsingharora,
	dave.hansen, hbabu, linuxram

Memory protection keys enable applications to protect its
address space from inadvertent access or corruption from
itself.

The overall idea:

 A process allocates a   key  and associates it with
 a  address  range  within    its   address   space.
 The process  than  can  dynamically  set read/write 
 permissions on  the   key   without  involving  the 
 kernel. Any  code that  violates   the  permissions
 off the address space; as defined by its associated
 key, will receive a segmentation fault.

This patch series enables the feature on PPC64.
It is enabled on HPTE 64K-page platform.

ISA3.0 section 5.7.13 describes the detailed specifications.

Testing:
	This patch series has passed all the protection key
	tests available in  the selftests directory. Though
	the test are written  for x86, I  have updated  the
	tests to  cater  to  powerpc. Will  send  the patch
	separately, along with documentation updates.

Thanks-to: Dave Hansen, Aneesh, Paul Mackerras,
	   Michael Ellermen  :)

Ram Pai (7):
  Free up four PTE bits to accommadate memory keys
  Implement sys_pkey_alloc and sys_pkey_free system call.
  store and restore the key state across context switches.
  Implementation for sys_mprotect_pkey() system call.
  Program HPTE key protection bits.
  Handle exceptions caused by violation of key protection.
  Deliver SEGV signal on protection key violation.

 arch/powerpc/Kconfig                          |  15 ++
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  12 ++
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  38 ++--
 arch/powerpc/include/asm/book3s/64/hash.h     |   8 +-
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  10 +
 arch/powerpc/include/asm/book3s/64/mmu.h      |  10 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  84 +++++++-
 arch/powerpc/include/asm/mman.h               |  29 +--
 arch/powerpc/include/asm/mmu_context.h        |  12 ++
 arch/powerpc/include/asm/pkeys.h              | 159 +++++++++++++++
 arch/powerpc/include/asm/processor.h          |   5 +
 arch/powerpc/include/asm/reg.h                |  10 +-
 arch/powerpc/include/asm/systbl.h             |   3 +
 arch/powerpc/include/asm/unistd.h             |   6 +-
 arch/powerpc/include/uapi/asm/ptrace.h        |   5 +-
 arch/powerpc/include/uapi/asm/unistd.h        |   3 +
 arch/powerpc/kernel/asm-offsets.c             |   1 +
 arch/powerpc/kernel/exceptions-64s.S          |  10 +-
 arch/powerpc/kernel/process.c                 |  18 ++
 arch/powerpc/kernel/signal_32.c               |  18 +-
 arch/powerpc/kernel/signal_64.c               |  11 ++
 arch/powerpc/kernel/traps.c                   |  49 +++++
 arch/powerpc/mm/Makefile                      |   1 +
 arch/powerpc/mm/dump_linuxpagetables.c        |   3 +-
 arch/powerpc/mm/fault.c                       |  21 +-
 arch/powerpc/mm/hash64_4k.c                   |  12 +-
 arch/powerpc/mm/hash64_64k.c                  |  73 +++----
 arch/powerpc/mm/hash_utils_64.c               |  43 ++++-
 arch/powerpc/mm/hugetlbpage-hash64.c          |  16 +-
 arch/powerpc/mm/mmu_context_book3s64.c        |   5 +
 arch/powerpc/mm/pkeys.c                       | 267 ++++++++++++++++++++++++++
 include/linux/mm.h                            |  32 +--
 include/uapi/asm-generic/mman-common.h        |   2 +-
 33 files changed, 856 insertions(+), 135 deletions(-)
 create mode 100644 arch/powerpc/include/asm/pkeys.h
 create mode 100644 arch/powerpc/mm/pkeys.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys
  2017-06-06  1:05 [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys Ram Pai
@ 2017-06-06  1:05 ` Ram Pai
  2017-06-12  6:57   ` Aneesh Kumar K.V
  2017-06-13  4:52   ` Aneesh Kumar K.V
  2017-06-06  1:05 ` [RFC PATCH 2/7 v1]powerpc: Implement sys_pkey_alloc and sys_pkey_free system call Ram Pai
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 23+ messages in thread
From: Ram Pai @ 2017-06-06  1:05 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel
  Cc: benh, paulus, mpe, khandual, aneesh.kumar, bsingharora,
	dave.hansen, hbabu, linuxram

Rearrange  PTE   bits to  free  up  bits 3, 4, 5  and  6  for
memory keys. Bit 3, 4, 5, 6 and 57  shall  be used for memory
keys.

The patch does the following change to the 64K PTE format

H_PAGE_BUSY moves from bit 3 to bit 7
H_PAGE_F_SECOND which occupied bit 4 moves to the second part
	of the pte.
H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
	second part of the pte.

The second part of the PTE will hold
   a (H_PAGE_F_SECOND|H_PAGE_F_GIX)  for  64K page backed pte,
   and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX)  for 4k  backed
	pte.

the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
is initialized to 0xF indicating a invalid slot. if a hashpage does
get  allocated  to  the  0xF  slot, it is released and not used. In
other words, even  though  0xF  is  a valid slot we discard it  and
consider it as invalid slot(HPTE_SOFT_INVALID). This  gives  us  an
opportunity to  not  depend on a bit in the primary PTE in order to
determine the validity of a slot.

When  we  release  a  0xF slot we also release a legitimate primary
slot  and  unmap  that  entry. This  is  to  ensure  that we do get
a legimate non-0xF slot the next time we retry for a slot.

Though treating 0xF slot as invalid reduces the number of available
slots and make have a effect on the performance, the probabilty
of hitting a 0xF is extermely low.

Compared  to the current scheme, the above described scheme reduces
the number of false hash table updates  significantly  and  has the
added  advantage  of  releasing  four  valuable  PTE bits for other
purpose.

This idea was jointly developed by Paul Mackerras, Aneesh, Michael
Ellermen and myself.

4K PTE format remain unchanged currently.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  | 12 +++++
 arch/powerpc/include/asm/book3s/64/hash-64k.h | 38 ++++++--------
 arch/powerpc/include/asm/book3s/64/hash.h     |  8 ++-
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  5 ++
 arch/powerpc/mm/dump_linuxpagetables.c        |  3 +-
 arch/powerpc/mm/hash64_4k.c                   | 12 ++---
 arch/powerpc/mm/hash64_64k.c                  | 73 +++++++++------------------
 arch/powerpc/mm/hash_utils_64.c               | 38 +++++++++++++-
 arch/powerpc/mm/hugetlbpage-hash64.c          | 16 +++---
 9 files changed, 107 insertions(+), 98 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index b4b5e6b..223d318 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -16,6 +16,18 @@
 #define H_PUD_TABLE_SIZE	(sizeof(pud_t) << H_PUD_INDEX_SIZE)
 #define H_PGD_TABLE_SIZE	(sizeof(pgd_t) << H_PGD_INDEX_SIZE)
 
+
+/*
+ * Only supported by 4k linux page size
+ */
+#define H_PAGE_F_SECOND        _RPAGE_RSV2     /* HPTE is in 2ndary HPTEG */
+#define H_PAGE_F_GIX           (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
+#define H_PAGE_F_GIX_SHIFT     56
+
+#define H_PAGE_BUSY	_RPAGE_RSV1     /* software: PTE & hash are busy */
+#define H_PAGE_HASHPTE	_RPAGE_RPN43    /* PTE has associated HPTE */
+
+
 /* PTE flags to conserve for HPTE identification */
 #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
 			 H_PAGE_F_SECOND | H_PAGE_F_GIX)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 9732837..87ee22d 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -10,23 +10,21 @@
  * 64k aligned address free up few of the lower bits of RPN for us
  * We steal that here. For more deatils look at pte_pfn/pfn_pte()
  */
-#define H_PAGE_COMBO	_RPAGE_RPN0 /* this is a combo 4k page */
-#define H_PAGE_4K_PFN	_RPAGE_RPN1 /* PFN is for a single 4k page */
+#define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
+#define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
+
+#define H_PAGE_BUSY	_RPAGE_RPN44     /* software: PTE & hash are busy */
+#define H_PAGE_HASHPTE	_RPAGE_RPN43    /* PTE has associated HPTE */
+
 /*
  * We need to differentiate between explicit huge page and THP huge
  * page, since THP huge page also need to track real subpage details
  */
 #define H_PAGE_THP_HUGE  H_PAGE_4K_PFN
 
-/*
- * Used to track subpage group valid if H_PAGE_COMBO is set
- * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
- */
-#define H_PAGE_COMBO_VALID	(H_PAGE_F_GIX | H_PAGE_F_SECOND)
-
 /* PTE flags to conserve for HPTE identification */
-#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
-			 H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
+#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE)
+
 /*
  * we support 16 fragments per PTE page of 64K size.
  */
@@ -54,24 +52,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
 	unsigned long *hidxp;
 
 	rpte.pte = pte;
-	rpte.hidx = 0;
-	if (pte_val(pte) & H_PAGE_COMBO) {
-		/*
-		 * Make sure we order the hidx load against the H_PAGE_COMBO
-		 * check. The store side ordering is done in __hash_page_4K
-		 */
-		smp_rmb();
-		hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
-		rpte.hidx = *hidxp;
-	}
+	/*
+	 * The store side ordering is done in __hash_page_4K
+	 */
+	smp_rmb();
+	hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
+	rpte.hidx = *hidxp;
 	return rpte;
 }
 
 static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
 {
-	if ((pte_val(rpte.pte) & H_PAGE_COMBO))
-		return (rpte.hidx >> (index<<2)) & 0xf;
-	return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
+	return ((rpte.hidx >> (index<<2)) & 0xfUL);
 }
 
 #define __rpte_to_pte(r)	((r).pte)
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
index 4e957b0..7742782 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -8,11 +8,9 @@
  *
  */
 #define H_PTE_NONE_MASK		_PAGE_HPTEFLAGS
-#define H_PAGE_F_GIX_SHIFT	56
-#define H_PAGE_BUSY		_RPAGE_RSV1 /* software: PTE & hash are busy */
-#define H_PAGE_F_SECOND		_RPAGE_RSV2	/* HPTE is in 2ndary HPTEG */
-#define H_PAGE_F_GIX		(_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
-#define H_PAGE_HASHPTE		_RPAGE_RPN43	/* PTE has associated HPTE */
+
+#define INIT_HIDX (~0x0UL)
+#define HPTE_SOFT_INVALID(slot) ((slot & 0xfUL) == 0xfUL)
 
 #ifdef CONFIG_PPC_64K_PAGES
 #include <asm/book3s/64/hash-64k.h>
diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 6981a52..cfb8169 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
 extern int __hash_page_64K(unsigned long ea, unsigned long access,
 			   unsigned long vsid, pte_t *ptep, unsigned long trap,
 			   unsigned long flags, int ssize);
+extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+			unsigned int subpg_index, unsigned long slot);
+extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
+			int ssize, real_pte_t rpte, unsigned int subpg_index);
+
 struct mm_struct;
 unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
 extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
index 44fe483..b832ed3 100644
--- a/arch/powerpc/mm/dump_linuxpagetables.c
+++ b/arch/powerpc/mm/dump_linuxpagetables.c
@@ -213,7 +213,7 @@ struct flag_info {
 		.val	= H_PAGE_4K_PFN,
 		.set	= "4K_pfn",
 	}, {
-#endif
+#else
 		.mask	= H_PAGE_F_GIX,
 		.val	= H_PAGE_F_GIX,
 		.set	= "f_gix",
@@ -224,6 +224,7 @@ struct flag_info {
 		.val	= H_PAGE_F_SECOND,
 		.set	= "f_second",
 	}, {
+#endif /* CONFIG_PPC_64K_PAGES */
 #endif
 		.mask	= _PAGE_SPECIAL,
 		.val	= _PAGE_SPECIAL,
diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
index 6fa450c..5b16416 100644
--- a/arch/powerpc/mm/hash64_4k.c
+++ b/arch/powerpc/mm/hash64_4k.c
@@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
 		   pte_t *ptep, unsigned long trap, unsigned long flags,
 		   int ssize, int subpg_prot)
 {
+	real_pte_t rpte;
 	unsigned long hpte_group;
 	unsigned long rflags, pa;
 	unsigned long old_pte, new_pte;
@@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
 	 * need to add in 0x1 if it's a read-only user page
 	 */
 	rflags = htab_convert_pte_flags(new_pte);
+	rpte = __real_pte(__pte(old_pte), ptep);
 
 	if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
 	    !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -64,12 +66,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
 		/*
 		 * There MIGHT be an HPTE for this pte
 		 */
-		hash = hpt_hash(vpn, shift, ssize);
-		if (old_pte & H_PAGE_F_SECOND)
-			hash = ~hash;
-		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-		slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
-
+		slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
 		if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K,
 					       MMU_PAGE_4K, ssize, flags) == -1)
 			old_pte &= ~_PAGE_HPTEFLAGS;
@@ -118,8 +115,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
 			return -1;
 		}
 		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
-		new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
-			(H_PAGE_F_SECOND | H_PAGE_F_GIX);
+		new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
 	}
 	*ptep = __pte(new_pte & ~H_PAGE_BUSY);
 	return 0;
diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
index 1a68cb1..feb90f1 100644
--- a/arch/powerpc/mm/hash64_64k.c
+++ b/arch/powerpc/mm/hash64_64k.c
@@ -15,34 +15,13 @@
 #include <linux/mm.h>
 #include <asm/machdep.h>
 #include <asm/mmu.h>
-/*
- * index from 0 - 15
- */
-bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
-{
-	unsigned long g_idx;
-	unsigned long ptev = pte_val(rpte.pte);
 
-	g_idx = (ptev & H_PAGE_COMBO_VALID) >> H_PAGE_F_GIX_SHIFT;
-	index = index >> 2;
-	if (g_idx & (0x1 << index))
-		return true;
-	else
-		return false;
-}
 /*
  * index from 0 - 15
  */
-static unsigned long mark_subptegroup_valid(unsigned long ptev, unsigned long index)
+bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
 {
-	unsigned long g_idx;
-
-	if (!(ptev & H_PAGE_COMBO))
-		return ptev;
-	index = index >> 2;
-	g_idx = 0x1 << index;
-
-	return ptev | (g_idx << H_PAGE_F_GIX_SHIFT);
+	return !(HPTE_SOFT_INVALID(rpte.hidx >> (index << 2)));
 }
 
 int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
@@ -50,10 +29,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
 		   int ssize, int subpg_prot)
 {
 	real_pte_t rpte;
-	unsigned long *hidxp;
 	unsigned long hpte_group;
 	unsigned int subpg_index;
-	unsigned long rflags, pa, hidx;
+	unsigned long rflags, pa;
 	unsigned long old_pte, new_pte, subpg_pte;
 	unsigned long vpn, hash, slot;
 	unsigned long shift = mmu_psize_defs[MMU_PAGE_4K].shift;
@@ -116,8 +94,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
 		 * On hash insert failure we use old pte value and we don't
 		 * want slot information there if we have a insert failure.
 		 */
-		old_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
-		new_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
+		old_pte &= ~(H_PAGE_HASHPTE);
+		new_pte &= ~(H_PAGE_HASHPTE);
+		rpte.hidx = INIT_HIDX;
 		goto htab_insert_hpte;
 	}
 	/*
@@ -126,18 +105,12 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
 	if (__rpte_sub_valid(rpte, subpg_index)) {
 		int ret;
 
-		hash = hpt_hash(vpn, shift, ssize);
-		hidx = __rpte_to_hidx(rpte, subpg_index);
-		if (hidx & _PTEIDX_SECONDARY)
-			hash = ~hash;
-		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-		slot += hidx & _PTEIDX_GROUP_IX;
-
+		slot = get_hidx_slot(vpn, shift, ssize, rpte, subpg_index);
 		ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
 						 MMU_PAGE_4K, MMU_PAGE_4K,
 						 ssize, flags);
 		/*
-		 *if we failed because typically the HPTE wasn't really here
+		 * if we failed because typically the HPTE wasn't really here
 		 * we try an insertion.
 		 */
 		if (ret == -1)
@@ -177,8 +150,14 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
 						rflags, HPTE_V_SECONDARY,
 						MMU_PAGE_4K, MMU_PAGE_4K,
 						ssize);
-		if (slot == -1) {
-			if (mftb() & 0x1)
+		if (unlikely(HPTE_SOFT_INVALID(slot)))
+			mmu_hash_ops.hpte_invalidate(slot, vpn,
+				MMU_PAGE_4K, MMU_PAGE_4K,
+				ssize, (flags & HPTE_LOCAL_UPDATE));
+
+		if (unlikely(slot == -1 || HPTE_SOFT_INVALID(slot))) {
+			if (HPTE_SOFT_INVALID(slot) || (mftb() & 0x1))
+
 				hpte_group = ((hash & htab_hash_mask) *
 					      HPTES_PER_GROUP) & ~0x7UL;
 			mmu_hash_ops.hpte_remove(hpte_group);
@@ -204,11 +183,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
 	 * Since we have H_PAGE_BUSY set on ptep, we can be sure
 	 * nobody is undating hidx.
 	 */
-	hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
-	rpte.hidx &= ~(0xfUL << (subpg_index << 2));
-	*hidxp = rpte.hidx  | (slot << (subpg_index << 2));
-	new_pte = mark_subptegroup_valid(new_pte, subpg_index);
-	new_pte |=  H_PAGE_HASHPTE;
+	new_pte |= set_hidx_slot(ptep, rpte, subpg_index, slot);
+	new_pte |= H_PAGE_HASHPTE;
+
 	/*
 	 * check __real_pte for details on matching smp_rmb()
 	 */
@@ -221,6 +198,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
 		    unsigned long vsid, pte_t *ptep, unsigned long trap,
 		    unsigned long flags, int ssize)
 {
+	real_pte_t rpte;
 	unsigned long hpte_group;
 	unsigned long rflags, pa;
 	unsigned long old_pte, new_pte;
@@ -257,6 +235,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
 	} while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
 
 	rflags = htab_convert_pte_flags(new_pte);
+	rpte = __real_pte(__pte(old_pte), ptep);
 
 	if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
 	    !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -267,12 +246,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
 		/*
 		 * There MIGHT be an HPTE for this pte
 		 */
-		hash = hpt_hash(vpn, shift, ssize);
-		if (old_pte & H_PAGE_F_SECOND)
-			hash = ~hash;
-		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-		slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
-
+		slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
 		if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
 					       MMU_PAGE_64K, ssize,
 					       flags) == -1)
@@ -322,9 +296,8 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
 					   MMU_PAGE_64K, MMU_PAGE_64K, old_pte);
 			return -1;
 		}
+		set_hidx_slot(ptep, rpte, 0, slot);
 		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
-		new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
-			(H_PAGE_F_SECOND | H_PAGE_F_GIX);
 	}
 	*ptep = __pte(new_pte & ~H_PAGE_BUSY);
 	return 0;
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index f2095ce..b405657 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -975,8 +975,9 @@ void __init hash__early_init_devtree(void)
 
 void __init hash__early_init_mmu(void)
 {
+#ifndef CONFIG_PPC_64K_PAGES
 	/*
-	 * We have code in __hash_page_64K() and elsewhere, which assumes it can
+	 * We have code in __hash_page_4K() and elsewhere, which assumes it can
 	 * do the following:
 	 *   new_pte |= (slot << H_PAGE_F_GIX_SHIFT) & (H_PAGE_F_SECOND | H_PAGE_F_GIX);
 	 *
@@ -987,6 +988,7 @@ void __init hash__early_init_mmu(void)
 	 * with a BUILD_BUG_ON().
 	 */
 	BUILD_BUG_ON(H_PAGE_F_SECOND != (1ul  << (H_PAGE_F_GIX_SHIFT + 3)));
+#endif /* CONFIG_PPC_64K_PAGES */
 
 	htab_init_page_sizes();
 
@@ -1589,6 +1591,40 @@ static inline void tm_flush_hash_page(int local)
 }
 #endif
 
+#ifdef CONFIG_PPC_64K_PAGES
+unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+		unsigned int subpg_index, unsigned long slot)
+{
+	unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
+
+	rpte.hidx &= ~(0xfUL << (subpg_index << 2));
+	*hidxp = rpte.hidx  | (slot << (subpg_index << 2));
+	return 0x0UL;
+}
+#else /* CONFIG_PPC_64K_PAGES */
+unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
+		unsigned int subpg_index, unsigned long slot)
+{
+	return (slot << H_PAGE_F_GIX_SHIFT) &
+			(H_PAGE_F_SECOND | H_PAGE_F_GIX);
+}
+#endif /* CONFIG_PPC_64K_PAGES */
+
+unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
+			int ssize, real_pte_t rpte, unsigned int subpg_index)
+{
+	unsigned long hash, slot, hidx;
+
+	hash = hpt_hash(vpn, shift, ssize);
+	hidx = __rpte_to_hidx(rpte, subpg_index);
+	if (hidx & _PTEIDX_SECONDARY)
+		hash = ~hash;
+	slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+	slot += hidx & _PTEIDX_GROUP_IX;
+	return slot;
+}
+
+
 /* WARNING: This is called from hash_low_64.S, if you change this prototype,
  *          do not forget to update the assembly call site !
  */
diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c
index a84bb44..25a50eb 100644
--- a/arch/powerpc/mm/hugetlbpage-hash64.c
+++ b/arch/powerpc/mm/hugetlbpage-hash64.c
@@ -14,7 +14,7 @@
 #include <asm/cacheflush.h>
 #include <asm/machdep.h>
 
-extern long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
+long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
 				  unsigned long pa, unsigned long rlags,
 				  unsigned long vflags, int psize, int ssize);
 
@@ -22,6 +22,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
 		     pte_t *ptep, unsigned long trap, unsigned long flags,
 		     int ssize, unsigned int shift, unsigned int mmu_psize)
 {
+	real_pte_t rpte;
 	unsigned long vpn;
 	unsigned long old_pte, new_pte;
 	unsigned long rflags, pa, sz;
@@ -61,6 +62,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
 	} while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
 
 	rflags = htab_convert_pte_flags(new_pte);
+	rpte = __real_pte(__pte(old_pte), ptep);
 
 	sz = ((1UL) << shift);
 	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
@@ -71,14 +73,9 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
 	/* Check if pte already has an hpte (case 2) */
 	if (unlikely(old_pte & H_PAGE_HASHPTE)) {
 		/* There MIGHT be an HPTE for this pte */
-		unsigned long hash, slot;
-
-		hash = hpt_hash(vpn, shift, ssize);
-		if (old_pte & H_PAGE_F_SECOND)
-			hash = ~hash;
-		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-		slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
+		unsigned long slot;
 
+		slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
 		if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, mmu_psize,
 					       mmu_psize, ssize, flags) == -1)
 			old_pte &= ~_PAGE_HPTEFLAGS;
@@ -106,8 +103,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
 			return -1;
 		}
 
-		new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
-			(H_PAGE_F_SECOND | H_PAGE_F_GIX);
+		new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
 	}
 
 	/*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 2/7 v1]powerpc: Implement sys_pkey_alloc and sys_pkey_free system call.
  2017-06-06  1:05 [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys Ram Pai
  2017-06-06  1:05 ` [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys Ram Pai
@ 2017-06-06  1:05 ` Ram Pai
  2017-06-06  1:05 ` [RFC PATCH 3/7 v1]powerpc: store and restore the key state across context switches Ram Pai
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Ram Pai @ 2017-06-06  1:05 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel
  Cc: benh, paulus, mpe, khandual, aneesh.kumar, bsingharora,
	dave.hansen, hbabu, linuxram

Sys_pkey_alloc() allocates and returns available pkey
Sys_pkey_free()  frees up the key.

Total 32 keys are supported on powerpc. However key 0,1 and 31
are reserved. So effectively we have 29 keys.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 arch/powerpc/Kconfig                         |  15 ++++
 arch/powerpc/include/asm/book3s/64/mmu.h     |  10 +++
 arch/powerpc/include/asm/book3s/64/pgtable.h |  62 ++++++++++++++
 arch/powerpc/include/asm/pkeys.h             | 124 +++++++++++++++++++++++++++
 arch/powerpc/include/asm/systbl.h            |   2 +
 arch/powerpc/include/asm/unistd.h            |   4 +-
 arch/powerpc/include/uapi/asm/unistd.h       |   2 +
 arch/powerpc/mm/Makefile                     |   1 +
 arch/powerpc/mm/mmu_context_book3s64.c       |   5 ++
 arch/powerpc/mm/pkeys.c                      |  88 +++++++++++++++++++
 include/linux/mm.h                           |  31 ++++---
 include/uapi/asm-generic/mman-common.h       |   2 +-
 12 files changed, 331 insertions(+), 15 deletions(-)
 create mode 100644 arch/powerpc/include/asm/pkeys.h
 create mode 100644 arch/powerpc/mm/pkeys.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f7c8f99..b6960617 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -871,6 +871,21 @@ config SECCOMP
 
 	  If unsure, say Y. Only embedded should say N here.
 
+config PPC64_MEMORY_PROTECTION_KEYS
+	prompt "PowerPC Memory Protection Keys"
+	def_bool y
+	# Note: only available in 64-bit mode
+	depends on PPC64 && PPC_64K_PAGES
+	select ARCH_USES_HIGH_VMA_FLAGS
+	select ARCH_HAS_PKEYS
+	---help---
+	  Memory Protection Keys provides a mechanism for enforcing
+	  page-based protections, but without requiring modification of the
+	  page tables when an application changes protection domains.
+
+	  For details, see Documentation/powerpc/protection-keys.txt
+
+	  If unsure, say y.
 endmenu
 
 config ISA_DMA_API
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
index 77529a3..0c0a2a8 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -108,6 +108,16 @@ struct patb_entry {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 	struct list_head iommu_group_mem_list;
 #endif
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	/*
+	 * Each bit represents one protection key.
+	 * bit set   -> key allocated
+	 * bit unset -> key available for allocation
+	 */
+	u32 pkey_allocation_map;
+	s16 execute_only_pkey; /* key holding execute-only protection */
+#endif
 } mm_context_t;
 
 /*
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 85bc987..87e9a89 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -428,6 +428,68 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
 		pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 1);
 }
 
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+
+#include <asm/reg.h>
+static inline u64 read_amr(void)
+{
+	return mfspr(SPRN_AMR);
+}
+static inline void write_amr(u64 value)
+{
+	mtspr(SPRN_AMR, value);
+}
+static inline u64 read_iamr(void)
+{
+	return mfspr(SPRN_IAMR);
+}
+static inline void write_iamr(u64 value)
+{
+	mtspr(SPRN_IAMR, value);
+}
+static inline u64 read_uamor(void)
+{
+	return mfspr(SPRN_UAMOR);
+}
+static inline void write_uamor(u64 value)
+{
+	mtspr(SPRN_UAMOR, value);
+}
+
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+static inline u64 read_amr(void)
+{
+	WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+	return -1;
+}
+static inline void write_amr(u64 value)
+{
+	WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+static inline u64 read_uamor(void)
+{
+	WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+	return -1;
+}
+static inline void write_uamor(u64 value)
+{
+	WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+static inline u64 read_iamr(void)
+{
+	WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+	return -1;
+}
+static inline void write_iamr(u64 value)
+{
+	WARN(1, "%s called with MEMORY PROTECTION KEYS disabled\n", __func__);
+}
+
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				       unsigned long addr, pte_t *ptep)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
new file mode 100644
index 0000000..7bc8746
--- /dev/null
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -0,0 +1,124 @@
+#ifndef _ASM_PPC64_PKEYS_H
+#define _ASM_PPC64_PKEYS_H
+
+
+#define arch_max_pkey()  32
+
+#define AMR_AD_BIT 0x1UL
+#define AMR_WD_BIT 0x2UL
+#define IAMR_EX_BIT 0x1UL
+#define AMR_BITS_PER_PKEY 2
+#define ARCH_VM_PKEY_FLAGS (VM_PKEY_BIT0 | \
+			VM_PKEY_BIT1 | \
+			VM_PKEY_BIT2 | \
+			VM_PKEY_BIT3 | \
+			VM_PKEY_BIT4)
+
+/*
+ * Bits are in BE format.
+ * NOTE: key 31, 1, 0 are not used.
+ * key 0 is used by default. It give read/write/execute permission.
+ * key 31 is reserved by the hypervisor.
+ * key 1 is recommended to be not used.
+ * PowerISA(3.0) page 1015, programming note.
+ */
+#define PKEY_INITIAL_ALLOCAION  0xc0000001
+
+#define pkeybit_mask(pkey) (0x1 << (arch_max_pkey() - pkey - 1))
+
+#define mm_pkey_allocation_map(mm)	(mm->context.pkey_allocation_map)
+
+#define mm_set_pkey_allocated(mm, pkey) {	\
+	mm_pkey_allocation_map(mm) |= pkeybit_mask(pkey); \
+}
+
+#define mm_set_pkey_free(mm, pkey) {	\
+	mm_pkey_allocation_map(mm) &= ~pkeybit_mask(pkey);	\
+}
+
+#define mm_set_pkey_is_allocated(mm, pkey)	\
+	(mm_pkey_allocation_map(mm) & pkeybit_mask(pkey))
+
+#define mm_set_pkey_is_reserved(mm, pkey) (PKEY_INITIAL_ALLOCAION & \
+					pkeybit_mask(pkey))
+
+static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
+{
+	/* a reserved key is never considered as 'explicitly allocated' */
+	return (!mm_set_pkey_is_reserved(mm, pkey) &&
+		mm_set_pkey_is_allocated(mm, pkey));
+}
+
+/*
+ * Returns a positive, 5-bit key on success, or -1 on failure.
+ */
+static inline int mm_pkey_alloc(struct mm_struct *mm)
+{
+	/*
+	 * Note: this is the one and only place we make sure
+	 * that the pkey is valid as far as the hardware is
+	 * concerned.  The rest of the kernel trusts that
+	 * only good, valid pkeys come out of here.
+	 */
+	u32 all_pkeys_mask = (u32)(~(0x0));
+	int ret;
+
+	/*
+	 * Are we out of pkeys?  We must handle this specially
+	 * because ffz() behavior is undefined if there are no
+	 * zeros.
+	 */
+	if (mm_pkey_allocation_map(mm) == all_pkeys_mask)
+		return -1;
+
+	ret = arch_max_pkey() -
+		ffz((u32)mm_pkey_allocation_map(mm))
+		- 1;
+	mm_set_pkey_allocated(mm, ret);
+	return ret;
+}
+
+static inline int mm_pkey_free(struct mm_struct *mm, int pkey)
+{
+	if (!mm_pkey_is_allocated(mm, pkey))
+		return -EINVAL;
+
+	mm_set_pkey_free(mm, pkey);
+
+	return 0;
+}
+
+/*
+ * Try to dedicate one of the protection keys to be used as an
+ * execute-only protection key.
+ */
+extern int __execute_only_pkey(struct mm_struct *mm);
+static inline int execute_only_pkey(struct mm_struct *mm)
+{
+	return __execute_only_pkey(mm);
+}
+
+extern int __arch_override_mprotect_pkey(struct vm_area_struct *vma,
+		int prot, int pkey);
+static inline int arch_override_mprotect_pkey(struct vm_area_struct *vma,
+		int prot, int pkey)
+{
+	return __arch_override_mprotect_pkey(vma, prot, pkey);
+}
+
+extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val);
+static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val)
+{
+	return __arch_set_user_pkey_access(tsk, pkey, init_val);
+}
+
+static inline pkey_mm_init(struct mm_struct *mm)
+{
+	mm_pkey_allocation_map(mm) = PKEY_INITIAL_ALLOCAION;
+	/* -1 means unallocated or invalid */
+	mm->context.execute_only_pkey = -1;
+}
+
+#endif /*_ASM_PPC64_PKEYS_H */
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 1c94708..22dd776 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -388,3 +388,5 @@
 COMPAT_SYS_SPU(pwritev2)
 SYSCALL(kexec_file_load)
 SYSCALL(statx)
+SYSCALL(pkey_alloc)
+SYSCALL(pkey_free)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index 9ba11db..e0273bc 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,13 +12,11 @@
 #include <uapi/asm/unistd.h>
 
 
-#define NR_syscalls		384
+#define NR_syscalls		386
 
 #define __NR__exit __NR_exit
 
 #define __IGNORE_pkey_mprotect
-#define __IGNORE_pkey_alloc
-#define __IGNORE_pkey_free
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index b85f142..7993a07 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -394,5 +394,7 @@
 #define __NR_pwritev2		381
 #define __NR_kexec_file_load	382
 #define __NR_statx		383
+#define __NR_pkey_alloc		384
+#define __NR_pkey_free		385
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 7414034..8cc2ff1 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -45,3 +45,4 @@ obj-$(CONFIG_PPC_COPRO_BASE)	+= copro_fault.o
 obj-$(CONFIG_SPAPR_TCE_IOMMU)	+= mmu_context_iommu.o
 obj-$(CONFIG_PPC_PTDUMP)	+= dump_linuxpagetables.o
 obj-$(CONFIG_PPC_HTDUMP)	+= dump_hashpagetable.o
+obj-$(CONFIG_PPC64_MEMORY_PROTECTION_KEYS)	+= pkeys.o
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index c6dca2a..2da9931 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -16,6 +16,7 @@
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/pkeys.h>
 #include <linux/spinlock.h>
 #include <linux/idr.h>
 #include <linux/export.h>
@@ -120,6 +121,10 @@ static int hash__init_new_context(struct mm_struct *mm)
 
 	subpage_prot_init_new_context(mm);
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	pkey_mm_init(mm);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 	return index;
 }
 
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
new file mode 100644
index 0000000..b97366e
--- /dev/null
+++ b/arch/powerpc/mm/pkeys.c
@@ -0,0 +1,88 @@
+/*
+ * PowerPC Memory Protection Keys management
+ * Copyright (c) 2015, Intel Corporation.
+ * Copyright (c) 2017, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/pkeys.h>                /* PKEY_*                       */
+#include <uapi/asm-generic/mman-common.h>
+
+
+/*
+ * set the access right in AMR IAMR and UAMOR register
+ * for @pkey to that specified in @init_val.
+ */
+int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
+		unsigned long init_val)
+{
+	u64 old_amr, old_uamor, old_iamr;
+	int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+	u64 new_amr_bits = 0x0ul;
+	u64 new_iamr_bits = 0x0ul;
+	u64 new_uamor_bits = 0x3ul;
+
+	/* Set the bits we need in AMR:  */
+	if (init_val & PKEY_DISABLE_ACCESS)
+		new_amr_bits |= AMR_AD_BIT;
+	if (init_val & PKEY_DISABLE_WRITE)
+		new_amr_bits |= AMR_WD_BIT;
+
+	/*
+	 * By default execute is disabled.
+	 * To enable execute, PKEY_ENABLE_EXECUTE
+	 * needs to be specified.
+	 */
+	if ((init_val & PKEY_DISABLE_EXECUTE))
+		new_iamr_bits |= IAMR_EX_BIT;
+
+	/* Shift the bits in to the correct place in AMR for pkey: */
+	new_amr_bits	<<= pkey_shift;
+	new_iamr_bits	<<= pkey_shift;
+	new_uamor_bits	<<= pkey_shift;
+
+	/* Get old AMR and mask off any old bits in place: */
+	old_amr	= read_amr();
+	old_amr	&= ~((u64)(AMR_AD_BIT|AMR_WD_BIT) << pkey_shift);
+
+	old_iamr = read_iamr();
+	old_iamr &= ~(0x3ul << pkey_shift);
+
+	old_uamor = read_uamor();
+	old_uamor &= ~(0x3ul << pkey_shift);
+
+	/* Write old part along with new part: */
+	write_amr(old_amr | new_amr_bits);
+	write_iamr(old_iamr | new_iamr_bits);
+	write_uamor(old_uamor | new_uamor_bits);
+
+	return 0;
+}
+
+int __execute_only_pkey(struct mm_struct *mm)
+{
+	return -1;
+}
+
+/*
+ * This should only be called for *plain* mprotect calls.
+ */
+int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot,
+		int pkey)
+{
+	/*
+	 * Is this an mprotect_pkey() call?  If so, never
+	 * override the value that came from the user.
+	 */
+	if (pkey != -1)
+		return pkey;
+
+	return 0;
+}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7cb17c6..34ddac7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -204,26 +204,35 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
 #ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
-#define VM_HIGH_ARCH_BIT_0	32	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_1	33	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
-#define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_0	32	/* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_1	33	/* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit arch */
+#define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit arch */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
+#define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
-#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
-# define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
-# define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
-# define VM_PKEY_BIT1	VM_HIGH_ARCH_1
-# define VM_PKEY_BIT2	VM_HIGH_ARCH_2
-# define VM_PKEY_BIT3	VM_HIGH_ARCH_3
-#endif
+#if defined(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) \
+	|| defined(CONFIG_PPC64_MEMORY_PROTECTION_KEYS)
+#define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
+#define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 5-bit value */
+#define VM_PKEY_BIT1	VM_HIGH_ARCH_1
+#define VM_PKEY_BIT2	VM_HIGH_ARCH_2
+#define VM_PKEY_BIT3	VM_HIGH_ARCH_3
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 #elif defined(CONFIG_PPC)
+#define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 5-bit value */
+#define VM_PKEY_BIT1	VM_HIGH_ARCH_1
+#define VM_PKEY_BIT2	VM_HIGH_ARCH_2
+#define VM_PKEY_BIT3	VM_HIGH_ARCH_3
+#define VM_PKEY_BIT4	VM_HIGH_ARCH_4  /* intel does not use this bit */
+					/* but reserved for future expansion */
 # define VM_SAO		VM_ARCH_1	/* Strong Access Ordering (powerpc) */
 #elif defined(CONFIG_PARISC)
 # define VM_GROWSUP	VM_ARCH_1
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 8c27db0..b13ecc6 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -76,5 +76,5 @@
 #define PKEY_DISABLE_WRITE	0x2
 #define PKEY_ACCESS_MASK	(PKEY_DISABLE_ACCESS |\
 				 PKEY_DISABLE_WRITE)
-
+#define PKEY_DISABLE_EXECUTE	0x4
 #endif /* __ASM_GENERIC_MMAN_COMMON_H */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 3/7 v1]powerpc: store and restore the key state across context switches.
  2017-06-06  1:05 [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys Ram Pai
  2017-06-06  1:05 ` [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys Ram Pai
  2017-06-06  1:05 ` [RFC PATCH 2/7 v1]powerpc: Implement sys_pkey_alloc and sys_pkey_free system call Ram Pai
@ 2017-06-06  1:05 ` Ram Pai
  2017-06-06  1:05 ` [RFC PATCH 4/7 v1]powerpc: Implementation for sys_mprotect_pkey() system call Ram Pai
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Ram Pai @ 2017-06-06  1:05 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel
  Cc: benh, paulus, mpe, khandual, aneesh.kumar, bsingharora,
	dave.hansen, hbabu, linuxram

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 arch/powerpc/include/asm/processor.h |  5 +++++
 arch/powerpc/kernel/process.c        | 18 ++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index a2123f2..1f714df 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -310,6 +310,11 @@ struct thread_struct {
 	struct thread_vr_state ckvr_state; /* Checkpointed VR state */
 	unsigned long	ckvrsave; /* Checkpointed VRSAVE */
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	unsigned long	amr;
+	unsigned long	iamr;
+	unsigned long	uamor;
+#endif
 #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
 	void*		kvm_shadow_vcpu; /* KVM internal data */
 #endif /* CONFIG_KVM_BOOK3S_32_HANDLER */
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index baae104..37d001a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1096,6 +1096,11 @@ static inline void save_sprs(struct thread_struct *t)
 		t->tar = mfspr(SPRN_TAR);
 	}
 #endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	t->amr = mfspr(SPRN_AMR);
+	t->iamr = mfspr(SPRN_IAMR);
+	t->uamor = mfspr(SPRN_UAMOR);
+#endif
 }
 
 static inline void restore_sprs(struct thread_struct *old_thread,
@@ -1131,6 +1136,14 @@ static inline void restore_sprs(struct thread_struct *old_thread,
 			mtspr(SPRN_TAR, new_thread->tar);
 	}
 #endif
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	if (old_thread->amr != new_thread->amr)
+		mtspr(SPRN_AMR, new_thread->amr);
+	if (old_thread->iamr != new_thread->iamr)
+		mtspr(SPRN_IAMR, new_thread->iamr);
+	if (old_thread->uamor != new_thread->uamor)
+		mtspr(SPRN_UAMOR, new_thread->uamor);
+#endif
 }
 
 struct task_struct *__switch_to(struct task_struct *prev,
@@ -1686,6 +1699,11 @@ void start_thread(struct pt_regs *regs, unsigned long start, unsigned long sp)
 	current->thread.tm_texasr = 0;
 	current->thread.tm_tfiar = 0;
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	current->thread.amr   = 0x0ul;
+	current->thread.iamr  = 0x0ul;
+	current->thread.uamor = 0x0ul;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 }
 EXPORT_SYMBOL(start_thread);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 4/7 v1]powerpc: Implementation for sys_mprotect_pkey() system call.
  2017-06-06  1:05 [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys Ram Pai
                   ` (2 preceding siblings ...)
  2017-06-06  1:05 ` [RFC PATCH 3/7 v1]powerpc: store and restore the key state across context switches Ram Pai
@ 2017-06-06  1:05 ` Ram Pai
  2017-06-06  1:05 ` [RFC PATCH 5/7 v1]powerpc: Program HPTE key protection bits Ram Pai
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Ram Pai @ 2017-06-06  1:05 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel
  Cc: benh, paulus, mpe, khandual, aneesh.kumar, bsingharora,
	dave.hansen, hbabu, linuxram

This system call, associates the pkey with the pte of all
pages corresponding to the given address range.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 22 ++++++-
 arch/powerpc/include/asm/mman.h              | 29 +++++----
 arch/powerpc/include/asm/pkeys.h             | 21 ++++++-
 arch/powerpc/include/asm/systbl.h            |  1 +
 arch/powerpc/include/asm/unistd.h            |  4 +-
 arch/powerpc/include/uapi/asm/unistd.h       |  1 +
 arch/powerpc/mm/pkeys.c                      | 93 +++++++++++++++++++++++++++-
 include/linux/mm.h                           |  1 +
 8 files changed, 154 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 87e9a89..bc845cd 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -37,6 +37,7 @@
 #define _RPAGE_RSV2		0x0800000000000000UL
 #define _RPAGE_RSV3		0x0400000000000000UL
 #define _RPAGE_RSV4		0x0200000000000000UL
+#define _RPAGE_RSV5		0x00040UL
 
 #define _PAGE_PTE		0x4000000000000000UL	/* distinguishes PTEs from pointers */
 #define _PAGE_PRESENT		0x8000000000000000UL	/* pte contains a translation */
@@ -56,6 +57,20 @@
 /* Max physical address bit as per radix table */
 #define _RPAGE_PA_MAX		57
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+#define H_PAGE_PKEY_BIT0	_RPAGE_RSV1
+#define H_PAGE_PKEY_BIT1	_RPAGE_RSV2
+#define H_PAGE_PKEY_BIT2	_RPAGE_RSV3
+#define H_PAGE_PKEY_BIT3	_RPAGE_RSV4
+#define H_PAGE_PKEY_BIT4	_RPAGE_RSV5
+#else /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+#define H_PAGE_PKEY_BIT0	0
+#define H_PAGE_PKEY_BIT1	0
+#define H_PAGE_PKEY_BIT2	0
+#define H_PAGE_PKEY_BIT3	0
+#define H_PAGE_PKEY_BIT4	0
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 /*
  * Max physical address bit we will use for now.
  *
@@ -122,7 +137,12 @@
 #define PAGE_PROT_BITS  (_PAGE_SAO | _PAGE_NON_IDEMPOTENT | _PAGE_TOLERANT | \
 			 H_PAGE_4K_PFN | _PAGE_PRIVILEGED | _PAGE_ACCESSED | \
 			 _PAGE_READ | _PAGE_WRITE |  _PAGE_DIRTY | _PAGE_EXEC | \
-			 _PAGE_SOFT_DIRTY)
+			 _PAGE_SOFT_DIRTY | \
+			 H_PAGE_PKEY_BIT0 | \
+			 H_PAGE_PKEY_BIT1 | \
+			 H_PAGE_PKEY_BIT2 | \
+			 H_PAGE_PKEY_BIT3 | \
+			 H_PAGE_PKEY_BIT4)
 /*
  * We define 2 sets of base prot bits, one for basic pages (ie,
  * cacheable kernel and user pages) and one for non cacheable
diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 30922f6..14cc1aa 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -13,24 +13,31 @@
 
 #include <asm/cputable.h>
 #include <linux/mm.h>
+#include <linux/pkeys.h>
 #include <asm/cpu_has_feature.h>
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+
 /*
  * This file is included by linux/mman.h, so we can't use cacl_vm_prot_bits()
  * here.  How important is the optimization?
  */
-static inline unsigned long arch_calc_vm_prot_bits(unsigned long prot,
-		unsigned long pkey)
-{
-	return (prot & PROT_SAO) ? VM_SAO : 0;
-}
-#define arch_calc_vm_prot_bits(prot, pkey) arch_calc_vm_prot_bits(prot, pkey)
+#define arch_calc_vm_prot_bits(prot, key) (             \
+		((prot) & PROT_SAO ? VM_SAO : 0) |	\
+			pkey_to_vmflag_bits(key))
+#define arch_vm_get_page_prot(vm_flags) __pgprot(       \
+		((vm_flags) & VM_SAO ? _PAGE_SAO : 0) |	\
+		vmflag_to_page_pkey_bits(vm_flags))
+
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
+#define arch_calc_vm_prot_bits(prot, key) (	\
+		((prot) & PROT_SAO ? VM_SAO : 0))
+#define arch_vm_get_page_prot(vm_flags) __pgprot(	\
+		((vm_flags) & VM_SAO ? _PAGE_SAO : 0))
+
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 
-static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
-{
-	return (vm_flags & VM_SAO) ? __pgprot(_PAGE_SAO) : __pgprot(0);
-}
-#define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)
 
 static inline bool arch_validate_prot(unsigned long prot)
 {
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 7bc8746..0f3dca8 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -14,6 +14,19 @@
 			VM_PKEY_BIT3 | \
 			VM_PKEY_BIT4)
 
+#define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
+			((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) |	\
+			((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) |	\
+			((key & 0x8UL) ? VM_PKEY_BIT3 : 0x0UL) |	\
+			((key & 0x10UL) ? VM_PKEY_BIT4 : 0x0UL))
+
+#define vmflag_to_page_pkey_bits(vm_flags)  \
+		(((vm_flags & VM_PKEY_BIT0) ? H_PAGE_PKEY_BIT4 : 0x0UL)|     \
+		((vm_flags & VM_PKEY_BIT1) ? H_PAGE_PKEY_BIT3 : 0x0UL) |     \
+		((vm_flags & VM_PKEY_BIT2) ? H_PAGE_PKEY_BIT2 : 0x0UL) |     \
+		((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) |     \
+		((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))
+
 /*
  * Bits are in BE format.
  * NOTE: key 31, 1, 0 are not used.
@@ -42,6 +55,12 @@
 #define mm_set_pkey_is_reserved(mm, pkey) (PKEY_INITIAL_ALLOCAION & \
 					pkeybit_mask(pkey))
 
+
+static inline int vma_pkey(struct vm_area_struct *vma)
+{
+	return (vma->vm_flags & ARCH_VM_PKEY_FLAGS) >> VM_PKEY_SHIFT;
+}
+
 static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 {
 	/* a reserved key is never considered as 'explicitly allocated' */
@@ -114,7 +133,7 @@ static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 	return __arch_set_user_pkey_access(tsk, pkey, init_val);
 }
 
-static inline pkey_mm_init(struct mm_struct *mm)
+static inline void pkey_mm_init(struct mm_struct *mm)
 {
 	mm_pkey_allocation_map(mm) = PKEY_INITIAL_ALLOCAION;
 	/* -1 means unallocated or invalid */
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 22dd776..b33b551 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -390,3 +390,4 @@
 SYSCALL(statx)
 SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
+SYSCALL(pkey_mprotect)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index e0273bc..daf1ba9 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,12 +12,10 @@
 #include <uapi/asm/unistd.h>
 
 
-#define NR_syscalls		386
+#define NR_syscalls		387
 
 #define __NR__exit __NR_exit
 
-#define __IGNORE_pkey_mprotect
-
 #ifndef __ASSEMBLY__
 
 #include <linux/types.h>
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index 7993a07..71ae45e 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -396,5 +396,6 @@
 #define __NR_statx		383
 #define __NR_pkey_alloc		384
 #define __NR_pkey_free		385
+#define __NR_pkey_mprotect	386
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index b97366e..11a32b3 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -15,6 +15,17 @@
 #include <linux/pkeys.h>                /* PKEY_*                       */
 #include <uapi/asm-generic/mman-common.h>
 
+#define pkeyshift(pkey) ((arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY)
+
+static inline bool pkey_allows_readwrite(int pkey)
+{
+	int pkey_shift = pkeyshift(pkey);
+
+	if (!(read_uamor() & (0x3UL << pkey_shift)))
+		return true;
+
+	return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
+}
 
 /*
  * set the access right in AMR IAMR and UAMOR register
@@ -68,7 +79,60 @@ int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 
 int __execute_only_pkey(struct mm_struct *mm)
 {
-	return -1;
+	bool need_to_set_mm_pkey = false;
+	int execute_only_pkey = mm->context.execute_only_pkey;
+	int ret;
+
+	/* Do we need to assign a pkey for mm's execute-only maps? */
+	if (execute_only_pkey == -1) {
+		/* Go allocate one to use, which might fail */
+		execute_only_pkey = mm_pkey_alloc(mm);
+		if (execute_only_pkey < 0)
+			return -1;
+		need_to_set_mm_pkey = true;
+	}
+
+	/*
+	 * We do not want to go through the relatively costly
+	 * dance to set AMR if we do not need to.  Check it
+	 * first and assume that if the execute-only pkey is
+	 * readwrite-disabled than we do not have to set it
+	 * ourselves.
+	 */
+	if (!need_to_set_mm_pkey &&
+	    !pkey_allows_readwrite(execute_only_pkey))
+		return execute_only_pkey;
+
+	/*
+	 * Set up AMR so that it denies access for everything
+	 * other than execution.
+	 */
+	ret = __arch_set_user_pkey_access(current, execute_only_pkey,
+			(PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
+	/*
+	 * If the AMR-set operation failed somehow, just return
+	 * 0 and effectively disable execute-only support.
+	 */
+	if (ret) {
+		mm_set_pkey_free(mm, execute_only_pkey);
+		return -1;
+	}
+
+	/* We got one, store it and use it from here on out */
+	if (need_to_set_mm_pkey)
+		mm->context.execute_only_pkey = execute_only_pkey;
+	return execute_only_pkey;
+}
+
+static inline bool vma_is_pkey_exec_only(struct vm_area_struct *vma)
+{
+	/* Do this check first since the vm_flags should be hot */
+	if ((vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) != VM_EXEC)
+		return false;
+	if (vma_pkey(vma) != vma->vm_mm->context.execute_only_pkey)
+		return false;
+
+	return true;
 }
 
 /*
@@ -84,5 +148,30 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot,
 	if (pkey != -1)
 		return pkey;
 
-	return 0;
+	/*
+	 * Look for a protection-key-drive execute-only mapping
+	 * which is now being given permissions that are not
+	 * execute-only.  Move it back to the default pkey.
+	 */
+	if (vma_is_pkey_exec_only(vma) &&
+	    (prot & (PROT_READ|PROT_WRITE))) {
+		return 0;
+	}
+	/*
+	 * The mapping is execute-only.  Go try to get the
+	 * execute-only protection key.  If we fail to do that,
+	 * fall through as if we do not have execute-only
+	 * support.
+	 */
+	if (prot == PROT_EXEC) {
+		pkey = execute_only_pkey(vma->vm_mm);
+		if (pkey > 0)
+			return pkey;
+	}
+	/*
+	 * This is a vanilla, non-pkey mprotect (or we failed to
+	 * setup execute-only), inherit the pkey from the VMA we
+	 * are working on.
+	 */
+	return vma_pkey(vma);
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 34ddac7..5399031 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -227,6 +227,7 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
 #define VM_PKEY_BIT3	VM_HIGH_ARCH_3
 #endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 #elif defined(CONFIG_PPC)
+#define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
 #define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 5-bit value */
 #define VM_PKEY_BIT1	VM_HIGH_ARCH_1
 #define VM_PKEY_BIT2	VM_HIGH_ARCH_2
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 5/7 v1]powerpc: Program HPTE key protection bits.
  2017-06-06  1:05 [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys Ram Pai
                   ` (3 preceding siblings ...)
  2017-06-06  1:05 ` [RFC PATCH 4/7 v1]powerpc: Implementation for sys_mprotect_pkey() system call Ram Pai
@ 2017-06-06  1:05 ` Ram Pai
  2017-06-06  1:05 ` [RFC PATCH 6/7 v1]powerpc: Handle exceptions caused by violation of key protection Ram Pai
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Ram Pai @ 2017-06-06  1:05 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel
  Cc: benh, paulus, mpe, khandual, aneesh.kumar, bsingharora,
	dave.hansen, hbabu, linuxram

Map the PTE protection key bits to the HPTE key protection bits,
while creatiing HPTE  entries.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 5 +++++
 arch/powerpc/include/asm/pkeys.h              | 7 +++++++
 arch/powerpc/mm/hash_utils_64.c               | 5 +++++
 3 files changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index cfb8169..3d7872c 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -90,6 +90,8 @@
 #define HPTE_R_PP0		ASM_CONST(0x8000000000000000)
 #define HPTE_R_TS		ASM_CONST(0x4000000000000000)
 #define HPTE_R_KEY_HI		ASM_CONST(0x3000000000000000)
+#define HPTE_R_KEY_BIT0		ASM_CONST(0x2000000000000000)
+#define HPTE_R_KEY_BIT1		ASM_CONST(0x1000000000000000)
 #define HPTE_R_RPN_SHIFT	12
 #define HPTE_R_RPN		ASM_CONST(0x0ffffffffffff000)
 #define HPTE_R_RPN_3_0		ASM_CONST(0x01fffffffffff000)
@@ -104,6 +106,9 @@
 #define HPTE_R_C		ASM_CONST(0x0000000000000080)
 #define HPTE_R_R		ASM_CONST(0x0000000000000100)
 #define HPTE_R_KEY_LO		ASM_CONST(0x0000000000000e00)
+#define HPTE_R_KEY_BIT2		ASM_CONST(0x0000000000000800)
+#define HPTE_R_KEY_BIT3		ASM_CONST(0x0000000000000400)
+#define HPTE_R_KEY_BIT4		ASM_CONST(0x0000000000000200)
 
 #define HPTE_V_1TB_SEG		ASM_CONST(0x4000000000000000)
 #define HPTE_V_VRMA_MASK	ASM_CONST(0x4001ffffff000000)
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 0f3dca8..9b6820d 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -27,6 +27,13 @@
 		((vm_flags & VM_PKEY_BIT3) ? H_PAGE_PKEY_BIT1 : 0x0UL) |     \
 		((vm_flags & VM_PKEY_BIT4) ? H_PAGE_PKEY_BIT0 : 0x0UL))
 
+#define calc_pte_to_hpte_pkey_bits(pteflags)	\
+	(((pteflags & H_PAGE_PKEY_BIT0) ? HPTE_R_KEY_BIT0 : 0x0UL) |	\
+	((pteflags & H_PAGE_PKEY_BIT1) ? HPTE_R_KEY_BIT1 : 0x0UL) |	\
+	((pteflags & H_PAGE_PKEY_BIT2) ? HPTE_R_KEY_BIT2 : 0x0UL) |	\
+	((pteflags & H_PAGE_PKEY_BIT3) ? HPTE_R_KEY_BIT3 : 0x0UL) |	\
+	((pteflags & H_PAGE_PKEY_BIT4) ? HPTE_R_KEY_BIT4 : 0x0UL))
+
 /*
  * Bits are in BE format.
  * NOTE: key 31, 1, 0 are not used.
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index b405657..2276392 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -35,6 +35,7 @@
 #include <linux/memblock.h>
 #include <linux/context_tracking.h>
 #include <linux/libfdt.h>
+#include <linux/pkeys.h>
 
 #include <asm/debugfs.h>
 #include <asm/processor.h>
@@ -230,6 +231,10 @@ unsigned long htab_convert_pte_flags(unsigned long pteflags)
 		 */
 		rflags |= HPTE_R_M;
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	rflags |= calc_pte_to_hpte_pkey_bits(pteflags);
+#endif
+
 	return rflags;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 6/7 v1]powerpc: Handle exceptions caused by violation of key protection.
  2017-06-06  1:05 [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys Ram Pai
                   ` (4 preceding siblings ...)
  2017-06-06  1:05 ` [RFC PATCH 5/7 v1]powerpc: Program HPTE key protection bits Ram Pai
@ 2017-06-06  1:05 ` Ram Pai
  2017-06-06  1:05 ` [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation Ram Pai
  2017-06-20  7:07 ` [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys Pavel Machek
  7 siblings, 0 replies; 23+ messages in thread
From: Ram Pai @ 2017-06-06  1:05 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel
  Cc: benh, paulus, mpe, khandual, aneesh.kumar, bsingharora,
	dave.hansen, hbabu, linuxram

Handle Data and Instruction exceptions caused by memory
protection-key.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 arch/powerpc/include/asm/mmu_context.h | 12 +++++
 arch/powerpc/include/asm/pkeys.h       |  9 ++++
 arch/powerpc/include/asm/reg.h         | 10 +++-
 arch/powerpc/kernel/exceptions-64s.S   |  2 +-
 arch/powerpc/mm/fault.c                | 21 +++++++-
 arch/powerpc/mm/pkeys.c                | 90 ++++++++++++++++++++++++++++++++++
 6 files changed, 140 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index da7e943..71fffe0 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -175,11 +175,23 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
 {
 }
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+bool arch_pte_access_permitted(pte_t pte, bool write);
+bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool execute, bool foreign);
+#else /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+static inline bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	/* by default, allow everything */
+	return true;
+}
 static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
 		bool write, bool execute, bool foreign)
 {
 	/* by default, allow everything */
 	return true;
 }
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_MMU_CONTEXT_H */
diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 9b6820d..405e7db 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -14,6 +14,15 @@
 			VM_PKEY_BIT3 | \
 			VM_PKEY_BIT4)
 
+static inline u16 pte_flags_to_pkey(unsigned long pte_flags)
+{
+	return ((pte_flags & H_PAGE_PKEY_BIT4) ? 0x1 : 0x0) |
+		((pte_flags & H_PAGE_PKEY_BIT3) ? 0x2 : 0x0) |
+		((pte_flags & H_PAGE_PKEY_BIT2) ? 0x4 : 0x0) |
+		((pte_flags & H_PAGE_PKEY_BIT1) ? 0x8 : 0x0) |
+		((pte_flags & H_PAGE_PKEY_BIT0) ? 0x10 : 0x0);
+}
+
 #define pkey_to_vmflag_bits(key) (((key & 0x1UL) ? VM_PKEY_BIT0 : 0x0UL) | \
 			((key & 0x2UL) ? VM_PKEY_BIT1 : 0x0UL) |	\
 			((key & 0x4UL) ? VM_PKEY_BIT2 : 0x0UL) |	\
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 7e50e47..c1ec880 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -272,16 +272,24 @@
 #define SPRN_DAR	0x013	/* Data Address Register */
 #define SPRN_DBCR	0x136	/* e300 Data Breakpoint Control Reg */
 #define SPRN_DSISR	0x012	/* Data Storage Interrupt Status Register */
+#define   DSISR_BIT32		0x80000000	/* not defined */
 #define   DSISR_NOHPTE		0x40000000	/* no translation found */
+#define   DSISR_PAGEATTR_CONFLT	0x20000000	/* page attribute conflict */
+#define   DSISR_BIT35		0x10000000	/* not defined */
 #define   DSISR_PROTFAULT	0x08000000	/* protection fault */
 #define   DSISR_BADACCESS	0x04000000	/* bad access to CI or G */
 #define   DSISR_ISSTORE		0x02000000	/* access was a store */
 #define   DSISR_DABRMATCH	0x00400000	/* hit data breakpoint */
-#define   DSISR_NOSEGMENT	0x00200000	/* SLB miss */
 #define   DSISR_KEYFAULT	0x00200000	/* Key fault */
+#define   DSISR_BIT43		0x00100000	/* Not defined */
 #define   DSISR_UNSUPP_MMU	0x00080000	/* Unsupported MMU config */
 #define   DSISR_SET_RC		0x00040000	/* Failed setting of R/C bits */
 #define   DSISR_PGDIRFAULT      0x00020000      /* Fault on page directory */
+#define   DSISR_PAGE_FAULT_MASK (DSISR_BIT32 |	\
+				DSISR_PAGEATTR_CONFLT |	\
+				DSISR_BADACCESS |	\
+				DSISR_KEYFAULT |	\
+				DSISR_BIT43)
 #define SPRN_TBRL	0x10C	/* Time Base Read Lower Register (user, R/O) */
 #define SPRN_TBRU	0x10D	/* Time Base Read Upper Register (user, R/O) */
 #define SPRN_CIR	0x11B	/* Chip Information Register (hyper, R/0) */
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index ae418b8..5226a9d 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1411,7 +1411,7 @@ USE_TEXT_SECTION()
 	.balign	IFETCH_ALIGN_BYTES
 do_hash_page:
 #ifdef CONFIG_PPC_STD_MMU_64
-	andis.	r0,r4,0xa410		/* weird error? */
+	andis.	r0,r4,DSISR_PAGE_FAULT_MASK@h	/* weird error? */
 	bne-	handle_page_fault	/* if not, try to insert a HPTE */
 	andis.  r0,r4,DSISR_DABRMATCH@h
 	bne-    handle_dabr_fault
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 3a7d580..c31624f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -216,9 +216,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	 * bits we are interested in.  But there are some bits which
 	 * indicate errors in DSISR but can validly be set in SRR1.
 	 */
-	if (trap == 0x400)
+	if (trap == 0x400) {
 		error_code &= 0x48200000;
-	else
+		flags |= FAULT_FLAG_INSTRUCTION;
+	} else
 		is_write = error_code & DSISR_ISSTORE;
 #else
 	is_write = error_code & ESR_DST;
@@ -261,6 +262,13 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	}
 #endif
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	if (error_code & DSISR_KEYFAULT) {
+		code = SEGV_PKUERR;
+		goto bad_area_nosemaphore;
+	}
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 	/* We restore the interrupt state now */
 	if (!arch_irq_disabled_regs(regs))
 		local_irq_enable();
@@ -441,6 +449,15 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 		WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
 #endif /* CONFIG_PPC_STD_MMU */
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
+					flags & FAULT_FLAG_INSTRUCTION,
+					0)) {
+		code = SEGV_PKUERR;
+		goto bad_area;
+	}
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 	/*
 	 * If for any reason at all we couldn't handle the fault,
 	 * make sure we exit gracefully rather than endlessly redo
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 11a32b3..a9545f1 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -27,6 +27,37 @@ static inline bool pkey_allows_readwrite(int pkey)
 	return !(read_amr() & ((AMR_AD_BIT|AMR_WD_BIT) << pkey_shift));
 }
 
+static inline bool pkey_allows_read(int pkey)
+{
+	int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+	if (!(read_uamor() & (0x3ul << pkey_shift)))
+		return true;
+
+	return !(read_amr() & (AMR_AD_BIT << pkey_shift));
+}
+
+static inline bool pkey_allows_write(int pkey)
+{
+	int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+	if (!(read_uamor() & (0x3ul << pkey_shift)))
+		return true;
+
+	return !(read_amr() & (AMR_WD_BIT << pkey_shift));
+}
+
+static inline bool pkey_allows_execute(int pkey)
+{
+	int pkey_shift = (arch_max_pkey()-pkey-1) * AMR_BITS_PER_PKEY;
+
+	if (!(read_uamor() & (0x3ul << pkey_shift)))
+		return true;
+
+	return !(read_iamr() & (IAMR_EX_BIT << pkey_shift));
+}
+
+
 /*
  * set the access right in AMR IAMR and UAMOR register
  * for @pkey to that specified in @init_val.
@@ -175,3 +206,62 @@ int __arch_override_mprotect_pkey(struct vm_area_struct *vma, int prot,
 	 */
 	return vma_pkey(vma);
 }
+
+bool arch_pte_access_permitted(pte_t pte, bool write)
+{
+	int pkey = pte_flags_to_pkey(pte_val(pte));
+
+	if (!pkey_allows_read(pkey))
+		return false;
+	if (write && !pkey_allows_write(pkey))
+		return false;
+	return true;
+}
+
+/*
+ * We only want to enforce protection keys on the current process
+ * because we effectively have no access to PKRU for other
+ * processes or any way to tell *which * PKRU in a threaded
+ * process we could use.
+ *
+ * So do not enforce things if the VMA is not from the current
+ * mm, or if we are in a kernel thread.
+ */
+static inline bool vma_is_foreign(struct vm_area_struct *vma)
+{
+	if (!current->mm)
+		return true;
+	/*
+	 * if the VMA is from another process, then PKRU has no
+	 * relevance and should not be enforced.
+	 */
+	if (current->mm != vma->vm_mm)
+		return true;
+
+	return false;
+}
+
+bool arch_vma_access_permitted(struct vm_area_struct *vma,
+		bool write, bool execute, bool foreign)
+{
+	int pkey;
+	/* allow access if the VMA is not one from this process */
+	if (foreign || vma_is_foreign(vma))
+		return true;
+
+	pkey = vma_pkey(vma);
+
+	if (!pkey)
+		return true;
+
+	if (execute)
+		return pkey_allows_execute(pkey);
+
+	if (!pkey_allows_read(pkey))
+		return false;
+
+	if (write)
+		return pkey_allows_write(pkey);
+
+	return true;
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.
  2017-06-06  1:05 [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys Ram Pai
                   ` (5 preceding siblings ...)
  2017-06-06  1:05 ` [RFC PATCH 6/7 v1]powerpc: Handle exceptions caused by violation of key protection Ram Pai
@ 2017-06-06  1:05 ` Ram Pai
  2017-06-16  9:20   ` Anshuman Khandual
  2017-06-16 11:18   ` Michael Ellerman
  2017-06-20  7:07 ` [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys Pavel Machek
  7 siblings, 2 replies; 23+ messages in thread
From: Ram Pai @ 2017-06-06  1:05 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel
  Cc: benh, paulus, mpe, khandual, aneesh.kumar, bsingharora,
	dave.hansen, hbabu, linuxram

The value of the AMR register at the time of the exception
is made available in gp_regs[PT_AMR] of the siginfo.

This field can be used to reprogram the permission bits of
any valid protection key.

Similarly the value of the key, whose protection got violated,
is made available at si_pkey field of the siginfo structure.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 arch/powerpc/include/uapi/asm/ptrace.h |  5 +++-
 arch/powerpc/kernel/asm-offsets.c      |  1 +
 arch/powerpc/kernel/exceptions-64s.S   |  8 ++++++
 arch/powerpc/kernel/signal_32.c        | 18 ++++++++++---
 arch/powerpc/kernel/signal_64.c        | 11 ++++++++
 arch/powerpc/kernel/traps.c            | 49 ++++++++++++++++++++++++++++++++++
 6 files changed, 88 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/uapi/asm/ptrace.h b/arch/powerpc/include/uapi/asm/ptrace.h
index 8036b38..109d0c2 100644
--- a/arch/powerpc/include/uapi/asm/ptrace.h
+++ b/arch/powerpc/include/uapi/asm/ptrace.h
@@ -49,6 +49,8 @@ struct pt_regs {
 	unsigned long dar;		/* Fault registers */
 	unsigned long dsisr;		/* on 4xx/Book-E used for ESR */
 	unsigned long result;		/* Result of a system call */
+	unsigned long dscr;		/* contents of the DSCR register */
+	unsigned long amr;		/* contents of AMR register */
 };
 
 #endif /* __ASSEMBLY__ */
@@ -109,7 +111,8 @@ struct pt_regs {
 #define PT_DSISR 42
 #define PT_RESULT 43
 #define PT_DSCR 44
-#define PT_REGS_COUNT 44
+#define PT_AMR 45
+#define PT_REGS_COUNT 45
 
 #define PT_FPR0	48	/* each FP reg occupies 2 slots in this space */
 
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 709e234..3818dc5 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -301,6 +301,7 @@ int main(void)
 	STACK_PT_REGS_OFFSET(ORIG_GPR3, orig_gpr3);
 	STACK_PT_REGS_OFFSET(RESULT, result);
 	STACK_PT_REGS_OFFSET(_TRAP, trap);
+	STACK_PT_REGS_OFFSET(_AMR, amr);
 #ifndef CONFIG_PPC64
 	/*
 	 * The PowerPC 400-class & Book-E processors have neither the DAR
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 5226a9d..9872dd3 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -493,6 +493,10 @@ EXC_COMMON_BEGIN(data_access_common)
 	ld	r12,_MSR(r1)
 	ld	r3,PACA_EXGEN+EX_DAR(r13)
 	lwz	r4,PACA_EXGEN+EX_DSISR(r13)
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	mfspr	r5,SPRN_AMR
+	std	r5,_AMR(r1)
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 	li	r5,0x300
 	std	r3,_DAR(r1)
 	std	r4,_DSISR(r1)
@@ -561,6 +565,10 @@ EXC_COMMON_BEGIN(instruction_access_common)
 	ld	r12,_MSR(r1)
 	ld	r3,_NIP(r1)
 	andis.	r4,r12,0x5820
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	mfspr	r5,SPRN_AMR
+	std	r5,_AMR(r1)
+#endif /*  CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 	li	r5,0x400
 	std	r3,_DAR(r1)
 	std	r4,_DSISR(r1)
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 97bb138..f317638 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -141,9 +141,11 @@ static inline int save_general_regs(struct pt_regs *regs,
 
 	WARN_ON(!FULL_REGS(regs));
 
-	for (i = 0; i <= PT_RESULT; i ++) {
+	for (i = 0; i <= PT_REGS_COUNT; i++) {
 		if (i == 14 && !FULL_REGS(regs))
 			i = 32;
+		if (i == PT_DSCR)
+			continue;
 		if (__put_user((unsigned int)gregs[i], &frame->mc_gregs[i]))
 			return -EFAULT;
 	}
@@ -156,8 +158,8 @@ static inline int restore_general_regs(struct pt_regs *regs,
 	elf_greg_t64 *gregs = (elf_greg_t64 *)regs;
 	int i;
 
-	for (i = 0; i <= PT_RESULT; i++) {
-		if ((i == PT_MSR) || (i == PT_SOFTE))
+	for (i = 0; i <= PT_REGS_COUNT; i++) {
+		if ((i == PT_MSR) || (i == PT_SOFTE) || (i == PT_DSCR))
 			continue;
 		if (__get_user(gregs[i], &sr->mc_gregs[i]))
 			return -EFAULT;
@@ -661,6 +663,9 @@ static long restore_user_regs(struct pt_regs *regs,
 	long err;
 	unsigned int save_r2 = 0;
 	unsigned long msr;
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	unsigned long amr;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 #ifdef CONFIG_VSX
 	int i;
 #endif
@@ -750,6 +755,13 @@ static long restore_user_regs(struct pt_regs *regs,
 		return 1;
 #endif /* CONFIG_SPE */
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	amr = regs->amr;
+	err |= __get_user(regs->amr, &sr->mc_gregs[PT_AMR]);
+	if (!err && amr != regs->amr)
+		mtspr(SPRN_AMR, regs->amr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 	return 0;
 }
 
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index c83c115..d86d232 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -327,6 +327,9 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
 	unsigned long save_r13 = 0;
 	unsigned long msr;
 	struct pt_regs *regs = tsk->thread.regs;
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	unsigned long amr;
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
 #ifdef CONFIG_VSX
 	int i;
 #endif
@@ -406,6 +409,14 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
 			tsk->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = 0;
 	}
 #endif
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	amr = regs->amr;
+	err |= __get_user(regs->amr, &sc->gp_regs[PT_AMR]);
+	if (!err && amr != regs->amr)
+		mtspr(SPRN_AMR, regs->amr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 	return err;
 }
 
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index d4e545d..cc4bde8b 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -20,6 +20,7 @@
 #include <linux/sched/debug.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/pkeys.h>
 #include <linux/stddef.h>
 #include <linux/unistd.h>
 #include <linux/ptrace.h>
@@ -247,6 +248,49 @@ void user_single_step_siginfo(struct task_struct *tsk,
 	info->si_addr = (void __user *)regs->nip;
 }
 
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+static void fill_sig_info_pkey(int si_code, siginfo_t *info, unsigned long addr)
+{
+	struct vm_area_struct *vma;
+
+	/* Fault not from Protection Keys: nothing to do */
+	if (si_code != SEGV_PKUERR)
+		return;
+
+	down_read(&current->mm->mmap_sem);
+	/*
+	 * we could be racing with pkey_mprotect().
+	 * If pkey_mprotect() wins the key value could
+	 * get modified...xxx
+	 */
+	vma = find_vma(current->mm, addr);
+	up_read(&current->mm->mmap_sem);
+
+	/*
+	 * force_sig_info_fault() is called from a number of
+	 * contexts, some of which have a VMA and some of which
+	 * do not.  The Pkey-fault handing happens after we have a
+	 * valid VMA, so we should never reach this without a
+	 * valid VMA.
+	 */
+	if (!vma) {
+		WARN_ONCE(1, "Pkey fault with no VMA passed in");
+		info->si_pkey = 0;
+		return;
+	}
+
+	/*
+	 * We could report the incorrect key because of the reason
+	 * explained above.
+	 *
+	 * si_pkey should be thought off as a strong hint, but not
+	 * an absolutely guarantee because of the race explained
+	 * above.
+	 */
+	info->si_pkey = vma_pkey(vma);
+}
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
 {
 	siginfo_t info;
@@ -274,6 +318,11 @@ void _exception(int signr, struct pt_regs *regs, int code, unsigned long addr)
 	info.si_signo = signr;
 	info.si_code = code;
 	info.si_addr = (void __user *) addr;
+
+#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
+	fill_sig_info_pkey(code, &info, addr);
+#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
+
 	force_sig_info(signr, &info, current);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys
  2017-06-06  1:05 ` [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys Ram Pai
@ 2017-06-12  6:57   ` Aneesh Kumar K.V
  2017-06-12 22:20     ` Ram Pai
  2017-06-13  4:52   ` Aneesh Kumar K.V
  1 sibling, 1 reply; 23+ messages in thread
From: Aneesh Kumar K.V @ 2017-06-12  6:57 UTC (permalink / raw)
  To: Ram Pai, linuxppc-dev, linux-kernel
  Cc: benh, paulus, mpe, khandual, bsingharora, dave.hansen, hbabu, linuxram

Ram Pai <linuxram@us.ibm.com> writes:

> Rearrange  PTE   bits to  free  up  bits 3, 4, 5  and  6  for
> memory keys. Bit 3, 4, 5, 6 and 57  shall  be used for memory
> keys.
>
> The patch does the following change to the 64K PTE format
>
> H_PAGE_BUSY moves from bit 3 to bit 7
> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> 	of the pte.
> H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
> 	second part of the pte.
>
> The second part of the PTE will hold
>    a (H_PAGE_F_SECOND|H_PAGE_F_GIX)  for  64K page backed pte,
>    and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX)  for 4k  backed
> 	pte.
>
> the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> is initialized to 0xF indicating a invalid slot. if a hashpage does
> get  allocated  to  the  0xF  slot, it is released and not used. In
> other words, even  though  0xF  is  a valid slot we discard it  and
> consider it as invalid slot(HPTE_SOFT_INVALID). This  gives  us  an
> opportunity to  not  depend on a bit in the primary PTE in order to
> determine the validity of a slot.

Do we need to do this for 64K hptes ? H_PAGE_HASHPTE indicates whether a
slot is valid or not. For 4K hptes, we do need this right ? ie, only
when H_PAGE_COMBO is set we need to consider 0xf as an invalid slot


>
> When  we  release  a  0xF slot we also release a legitimate primary
> slot  and  unmap  that  entry. This  is  to  ensure  that we do get
> a legimate non-0xF slot the next time we retry for a slot.

Can you explain this more ? what is a primary slot here ?

>
> Though treating 0xF slot as invalid reduces the number of available
> slots and make have a effect on the performance, the probabilty
> of hitting a 0xF is extermely low.
>
> Compared  to the current scheme, the above described scheme reduces
> the number of false hash table updates  significantly  and  has the
> added  advantage  of  releasing  four  valuable  PTE bits for other
> purpose.
>
> This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> Ellermen and myself.
>
> 4K PTE format remain unchanged currently.
>
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
>  arch/powerpc/include/asm/book3s/64/hash-4k.h  | 12 +++++
>  arch/powerpc/include/asm/book3s/64/hash-64k.h | 38 ++++++--------
>  arch/powerpc/include/asm/book3s/64/hash.h     |  8 ++-
>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |  5 ++
>  arch/powerpc/mm/dump_linuxpagetables.c        |  3 +-
>  arch/powerpc/mm/hash64_4k.c                   | 12 ++---
>  arch/powerpc/mm/hash64_64k.c                  | 73 +++++++++------------------
>  arch/powerpc/mm/hash_utils_64.c               | 38 +++++++++++++-
>  arch/powerpc/mm/hugetlbpage-hash64.c          | 16 +++---
>  9 files changed, 107 insertions(+), 98 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index b4b5e6b..223d318 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -16,6 +16,18 @@
>  #define H_PUD_TABLE_SIZE	(sizeof(pud_t) << H_PUD_INDEX_SIZE)
>  #define H_PGD_TABLE_SIZE	(sizeof(pgd_t) << H_PGD_INDEX_SIZE)
>
> +
> +/*
> + * Only supported by 4k linux page size

that comment is confusing, you can drop that, it is part of hash-4k.h
which means these defines are local to 4k linux page size config.

> + */
> +#define H_PAGE_F_SECOND        _RPAGE_RSV2     /* HPTE is in 2ndary HPTEG */
> +#define H_PAGE_F_GIX           (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> +#define H_PAGE_F_GIX_SHIFT     56
> +
> +#define H_PAGE_BUSY	_RPAGE_RSV1     /* software: PTE & hash are busy */
> +#define H_PAGE_HASHPTE	_RPAGE_RPN43    /* PTE has associated HPTE */
> +
> +
>  /* PTE flags to conserve for HPTE identification */
>  #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
>  			 H_PAGE_F_SECOND | H_PAGE_F_GIX)
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> index 9732837..87ee22d 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> @@ -10,23 +10,21 @@
>   * 64k aligned address free up few of the lower bits of RPN for us
>   * We steal that here. For more deatils look at pte_pfn/pfn_pte()
>   */
> -#define H_PAGE_COMBO	_RPAGE_RPN0 /* this is a combo 4k page */
> -#define H_PAGE_4K_PFN	_RPAGE_RPN1 /* PFN is for a single 4k page */
> +#define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
> +#define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
> +
> +#define H_PAGE_BUSY	_RPAGE_RPN44     /* software: PTE & hash are busy */
> +#define H_PAGE_HASHPTE	_RPAGE_RPN43    /* PTE has associated HPTE */
> +
>  /*
>   * We need to differentiate between explicit huge page and THP huge
>   * page, since THP huge page also need to track real subpage details
>   */
>  #define H_PAGE_THP_HUGE  H_PAGE_4K_PFN
>
> -/*
> - * Used to track subpage group valid if H_PAGE_COMBO is set
> - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
> - */
> -#define H_PAGE_COMBO_VALID	(H_PAGE_F_GIX | H_PAGE_F_SECOND)
> -
>  /* PTE flags to conserve for HPTE identification */
> -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
> -			 H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
> +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE)
> +

You drop H_PAGE_COMBO here. Is that intentional ?

We have code which does

		if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
					       MMU_PAGE_64K, ssize,
					       flags) == -1)
			old_pte &= ~_PAGE_HPTEFLAGS;
	

I guess they expect slot details to be cleared. With the above
_PAGE_HPTEFLAGS that is not true.


>  /*
>   * we support 16 fragments per PTE page of 64K size.
>   */
> @@ -54,24 +52,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
>  	unsigned long *hidxp;
>
>  	rpte.pte = pte;
> -	rpte.hidx = 0;
> -	if (pte_val(pte) & H_PAGE_COMBO) {
> -		/*
> -		 * Make sure we order the hidx load against the H_PAGE_COMBO
> -		 * check. The store side ordering is done in __hash_page_4K
> -		 */
> -		smp_rmb();
> -		hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> -		rpte.hidx = *hidxp;
> -	}
> +	/*
> +	 * The store side ordering is done in __hash_page_4K
> +	 */
> +	smp_rmb();
> +	hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> +	rpte.hidx = *hidxp;
>  	return rpte;
>  }
>
>  static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
>  {
> -	if ((pte_val(rpte.pte) & H_PAGE_COMBO))
> -		return (rpte.hidx >> (index<<2)) & 0xf;
> -	return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
> +	return ((rpte.hidx >> (index<<2)) & 0xfUL);
>  }
>
>  #define __rpte_to_pte(r)	((r).pte)
> diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> index 4e957b0..7742782 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> @@ -8,11 +8,9 @@
>   *
>   */
>  #define H_PTE_NONE_MASK		_PAGE_HPTEFLAGS
> -#define H_PAGE_F_GIX_SHIFT	56
> -#define H_PAGE_BUSY		_RPAGE_RSV1 /* software: PTE & hash are busy */
> -#define H_PAGE_F_SECOND		_RPAGE_RSV2	/* HPTE is in 2ndary HPTEG */
> -#define H_PAGE_F_GIX		(_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> -#define H_PAGE_HASHPTE		_RPAGE_RPN43	/* PTE has associated HPTE */
> +
> +#define INIT_HIDX (~0x0UL)

But then you use it like

	rpte.hidx = INIT_HIDX;

A better would be INIT_HIDX = 0xF ?. Also may be INIT is the wrong name
there ?

> +#define HPTE_SOFT_INVALID(slot) ((slot & 0xfUL) == 0xfUL)

Can you do that as a static inline ?

>
>  #ifdef CONFIG_PPC_64K_PAGES
>  #include <asm/book3s/64/hash-64k.h>
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> index 6981a52..cfb8169 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
>  extern int __hash_page_64K(unsigned long ea, unsigned long access,
>  			   unsigned long vsid, pte_t *ptep, unsigned long trap,
>  			   unsigned long flags, int ssize);
> +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> +			unsigned int subpg_index, unsigned long slot);
> +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> +			int ssize, real_pte_t rpte, unsigned int subpg_index);
> +
>  struct mm_struct;
>  unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
>  extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
> index 44fe483..b832ed3 100644
> --- a/arch/powerpc/mm/dump_linuxpagetables.c
> +++ b/arch/powerpc/mm/dump_linuxpagetables.c
> @@ -213,7 +213,7 @@ struct flag_info {
>  		.val	= H_PAGE_4K_PFN,
>  		.set	= "4K_pfn",
>  	}, {
> -#endif
> +#else
>  		.mask	= H_PAGE_F_GIX,
>  		.val	= H_PAGE_F_GIX,
>  		.set	= "f_gix",
> @@ -224,6 +224,7 @@ struct flag_info {
>  		.val	= H_PAGE_F_SECOND,
>  		.set	= "f_second",
>  	}, {
> +#endif /* CONFIG_PPC_64K_PAGES */
>  #endif
>  		.mask	= _PAGE_SPECIAL,
>  		.val	= _PAGE_SPECIAL,
> diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
> index 6fa450c..5b16416 100644
> --- a/arch/powerpc/mm/hash64_4k.c
> +++ b/arch/powerpc/mm/hash64_4k.c
> @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>  		   pte_t *ptep, unsigned long trap, unsigned long flags,
>  		   int ssize, int subpg_prot)
>  {
> +	real_pte_t rpte;
>  	unsigned long hpte_group;
>  	unsigned long rflags, pa;
>  	unsigned long old_pte, new_pte;
> @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>  	 * need to add in 0x1 if it's a read-only user page
>  	 */
>  	rflags = htab_convert_pte_flags(new_pte);
> +	rpte = __real_pte(__pte(old_pte), ptep);
>
>  	if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
>  	    !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> @@ -64,12 +66,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>  		/*
>  		 * There MIGHT be an HPTE for this pte
>  		 */
> -		hash = hpt_hash(vpn, shift, ssize);
> -		if (old_pte & H_PAGE_F_SECOND)
> -			hash = ~hash;
> -		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> -		slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> -
> +		slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);


This confused me, the rest of the code assume slot as the 4 bit value.
But get_hidx_slot is not exactly returning that.



>  		if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K,
>  					       MMU_PAGE_4K, ssize, flags) == -1)
>  			old_pte &= ~_PAGE_HPTEFLAGS;
> @@ -118,8 +115,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>  			return -1;
>  		}
>  		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
> -		new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> -			(H_PAGE_F_SECOND | H_PAGE_F_GIX);
> +		new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
>  	}
>  	*ptep = __pte(new_pte & ~H_PAGE_BUSY);
>  	return 0;
> diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
> index 1a68cb1..feb90f1 100644
> --- a/arch/powerpc/mm/hash64_64k.c
> +++ b/arch/powerpc/mm/hash64_64k.c
> @@ -15,34 +15,13 @@
>  #include <linux/mm.h>
>  #include <asm/machdep.h>
>  #include <asm/mmu.h>
> -/*
> - * index from 0 - 15
> - */
> -bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
> -{
> -	unsigned long g_idx;
> -	unsigned long ptev = pte_val(rpte.pte);
>
> -	g_idx = (ptev & H_PAGE_COMBO_VALID) >> H_PAGE_F_GIX_SHIFT;
> -	index = index >> 2;
> -	if (g_idx & (0x1 << index))
> -		return true;
> -	else
> -		return false;
> -}
>  /*
>   * index from 0 - 15
>   */
> -static unsigned long mark_subptegroup_valid(unsigned long ptev, unsigned long index)
> +bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
>  {
> -	unsigned long g_idx;
> -
> -	if (!(ptev & H_PAGE_COMBO))
> -		return ptev;
> -	index = index >> 2;
> -	g_idx = 0x1 << index;
> -
> -	return ptev | (g_idx << H_PAGE_F_GIX_SHIFT);
> +	return !(HPTE_SOFT_INVALID(rpte.hidx >> (index << 2)));
>  }
>
>  int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> @@ -50,10 +29,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>  		   int ssize, int subpg_prot)
>  {
>  	real_pte_t rpte;
> -	unsigned long *hidxp;
>  	unsigned long hpte_group;
>  	unsigned int subpg_index;
> -	unsigned long rflags, pa, hidx;
> +	unsigned long rflags, pa;
>  	unsigned long old_pte, new_pte, subpg_pte;
>  	unsigned long vpn, hash, slot;
>  	unsigned long shift = mmu_psize_defs[MMU_PAGE_4K].shift;
> @@ -116,8 +94,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>  		 * On hash insert failure we use old pte value and we don't
>  		 * want slot information there if we have a insert failure.
>  		 */
> -		old_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
> -		new_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
> +		old_pte &= ~(H_PAGE_HASHPTE);
> +		new_pte &= ~(H_PAGE_HASHPTE);
> +		rpte.hidx = INIT_HIDX;
>  		goto htab_insert_hpte;
>  	}
>  	/*
> @@ -126,18 +105,12 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>  	if (__rpte_sub_valid(rpte, subpg_index)) {
>  		int ret;
>
> -		hash = hpt_hash(vpn, shift, ssize);
> -		hidx = __rpte_to_hidx(rpte, subpg_index);
> -		if (hidx & _PTEIDX_SECONDARY)
> -			hash = ~hash;
> -		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> -		slot += hidx & _PTEIDX_GROUP_IX;
> -
> +		slot = get_hidx_slot(vpn, shift, ssize, rpte, subpg_index);
>  		ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
>  						 MMU_PAGE_4K, MMU_PAGE_4K,
>  						 ssize, flags);
>  		/*
> -		 *if we failed because typically the HPTE wasn't really here
> +		 * if we failed because typically the HPTE wasn't really here
>  		 * we try an insertion.
>  		 */
>  		if (ret == -1)
> @@ -177,8 +150,14 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>  						rflags, HPTE_V_SECONDARY,
>  						MMU_PAGE_4K, MMU_PAGE_4K,
>  						ssize);
> -		if (slot == -1) {
> -			if (mftb() & 0x1)
> +		if (unlikely(HPTE_SOFT_INVALID(slot)))
> +			mmu_hash_ops.hpte_invalidate(slot, vpn,
> +				MMU_PAGE_4K, MMU_PAGE_4K,
> +				ssize, (flags & HPTE_LOCAL_UPDATE));

Did this work ? invalidate first arg is the offset of hpte from the htab
base. where as insert return 4 bit slot number within group. I guess we
need to do a cleanup patch before to differentiate between slot number
within a group and slot number in hash page table. May be name slot
number within a group as gslot ?


> +
> +		if (unlikely(slot == -1 || HPTE_SOFT_INVALID(slot))) {
> +			if (HPTE_SOFT_INVALID(slot) || (mftb() & 0x1))
> +

So we always try to remove from primary if we got 0xf slot by insert ?


>  				hpte_group = ((hash & htab_hash_mask) *
>  					      HPTES_PER_GROUP) & ~0x7UL;
>  			mmu_hash_ops.hpte_remove(hpte_group);
> @@ -204,11 +183,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>  	 * Since we have H_PAGE_BUSY set on ptep, we can be sure
>  	 * nobody is undating hidx.
>  	 */
> -	hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> -	rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> -	*hidxp = rpte.hidx  | (slot << (subpg_index << 2));
> -	new_pte = mark_subptegroup_valid(new_pte, subpg_index);
> -	new_pte |=  H_PAGE_HASHPTE;
> +	new_pte |= set_hidx_slot(ptep, rpte, subpg_index, slot);
> +	new_pte |= H_PAGE_HASHPTE;
> +
>  	/*
>  	 * check __real_pte for details on matching smp_rmb()
>  	 */
> @@ -221,6 +198,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
>  		    unsigned long vsid, pte_t *ptep, unsigned long trap,
>  		    unsigned long flags, int ssize)
>  {
> +	real_pte_t rpte;
>  	unsigned long hpte_group;
>  	unsigned long rflags, pa;
>  	unsigned long old_pte, new_pte;
> @@ -257,6 +235,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
>  	} while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
>
>  	rflags = htab_convert_pte_flags(new_pte);
> +	rpte = __real_pte(__pte(old_pte), ptep);
>
>  	if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
>  	    !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> @@ -267,12 +246,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
>  		/*
>  		 * There MIGHT be an HPTE for this pte
>  		 */
> -		hash = hpt_hash(vpn, shift, ssize);
> -		if (old_pte & H_PAGE_F_SECOND)
> -			hash = ~hash;
> -		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> -		slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> -
> +		slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
>  		if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
>  					       MMU_PAGE_64K, ssize,
>  					       flags) == -1)
> @@ -322,9 +296,8 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
>  					   MMU_PAGE_64K, MMU_PAGE_64K, old_pte);
>  			return -1;
>  		}
> +		set_hidx_slot(ptep, rpte, 0, slot);
>  		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
> -		new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> -			(H_PAGE_F_SECOND | H_PAGE_F_GIX);
>  	}
>  	*ptep = __pte(new_pte & ~H_PAGE_BUSY);
>  	return 0;
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index f2095ce..b405657 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -975,8 +975,9 @@ void __init hash__early_init_devtree(void)
>
>  void __init hash__early_init_mmu(void)
>  {
> +#ifndef CONFIG_PPC_64K_PAGES
>  	/*
> -	 * We have code in __hash_page_64K() and elsewhere, which assumes it can
> +	 * We have code in __hash_page_4K() and elsewhere, which assumes it can
>  	 * do the following:
>  	 *   new_pte |= (slot << H_PAGE_F_GIX_SHIFT) & (H_PAGE_F_SECOND | H_PAGE_F_GIX);
>  	 *
> @@ -987,6 +988,7 @@ void __init hash__early_init_mmu(void)
>  	 * with a BUILD_BUG_ON().
>  	 */
>  	BUILD_BUG_ON(H_PAGE_F_SECOND != (1ul  << (H_PAGE_F_GIX_SHIFT + 3)));
> +#endif /* CONFIG_PPC_64K_PAGES */
>
>  	htab_init_page_sizes();
>
> @@ -1589,6 +1591,40 @@ static inline void tm_flush_hash_page(int local)
>  }
>  #endif
>
> +#ifdef CONFIG_PPC_64K_PAGES
> +unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> +		unsigned int subpg_index, unsigned long slot)
> +{
> +	unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> +
> +	rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> +	*hidxp = rpte.hidx  | (slot << (subpg_index << 2));
> +	return 0x0UL;
> +}
> +#else /* CONFIG_PPC_64K_PAGES */
> +unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> +		unsigned int subpg_index, unsigned long slot)
> +{
> +	return (slot << H_PAGE_F_GIX_SHIFT) &
> +			(H_PAGE_F_SECOND | H_PAGE_F_GIX);
> +}

Move them to header files so that we can avoid #ifdef here. Here slot is
4 bit value


> +#endif /* CONFIG_PPC_64K_PAGES */
> +
> +unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> +			int ssize, real_pte_t rpte, unsigned int subpg_index)
> +{
> +	unsigned long hash, slot, hidx;
> +
> +	hash = hpt_hash(vpn, shift, ssize);
> +	hidx = __rpte_to_hidx(rpte, subpg_index);
> +	if (hidx & _PTEIDX_SECONDARY)
> +		hash = ~hash;
> +	slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> +	slot += hidx & _PTEIDX_GROUP_IX;
> +	return slot;

Here slot is not. Can you rename this function ?

> +}
> +
> +
>  /* WARNING: This is called from hash_low_64.S, if you change this prototype,
>   *          do not forget to update the assembly call site !
>   */
> diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c
> index a84bb44..25a50eb 100644
> --- a/arch/powerpc/mm/hugetlbpage-hash64.c
> +++ b/arch/powerpc/mm/hugetlbpage-hash64.c
> @@ -14,7 +14,7 @@
>  #include <asm/cacheflush.h>
>  #include <asm/machdep.h>
>
> -extern long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
> +long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
>  				  unsigned long pa, unsigned long rlags,
>  				  unsigned long vflags, int psize, int ssize);
>
> @@ -22,6 +22,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
>  		     pte_t *ptep, unsigned long trap, unsigned long flags,
>  		     int ssize, unsigned int shift, unsigned int mmu_psize)
>  {
> +	real_pte_t rpte;
>  	unsigned long vpn;
>  	unsigned long old_pte, new_pte;
>  	unsigned long rflags, pa, sz;
> @@ -61,6 +62,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
>  	} while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
>
>  	rflags = htab_convert_pte_flags(new_pte);
> +	rpte = __real_pte(__pte(old_pte), ptep);
>
>  	sz = ((1UL) << shift);
>  	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> @@ -71,14 +73,9 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
>  	/* Check if pte already has an hpte (case 2) */
>  	if (unlikely(old_pte & H_PAGE_HASHPTE)) {
>  		/* There MIGHT be an HPTE for this pte */
> -		unsigned long hash, slot;
> -
> -		hash = hpt_hash(vpn, shift, ssize);
> -		if (old_pte & H_PAGE_F_SECOND)
> -			hash = ~hash;
> -		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> -		slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> +		unsigned long slot;
>
> +		slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
>  		if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, mmu_psize,
>  					       mmu_psize, ssize, flags) == -1)
>  			old_pte &= ~_PAGE_HPTEFLAGS;
> @@ -106,8 +103,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
>  			return -1;
>  		}
>
> -		new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> -			(H_PAGE_F_SECOND | H_PAGE_F_GIX);
> +		new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
>  	}
>
>  	/*
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys
  2017-06-12  6:57   ` Aneesh Kumar K.V
@ 2017-06-12 22:20     ` Ram Pai
  2017-06-13  2:02       ` Aneesh Kumar K.V
  0 siblings, 1 reply; 23+ messages in thread
From: Ram Pai @ 2017-06-12 22:20 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linuxppc-dev, linux-kernel, benh, paulus, mpe, khandual,
	bsingharora, dave.hansen, hbabu

On Mon, Jun 12, 2017 at 12:27:44PM +0530, Aneesh Kumar K.V wrote:
> Ram Pai <linuxram@us.ibm.com> writes:
> 
> > Rearrange  PTE   bits to  free  up  bits 3, 4, 5  and  6  for
> > memory keys. Bit 3, 4, 5, 6 and 57  shall  be used for memory
> > keys.
> >
> > The patch does the following change to the 64K PTE format
> >
> > H_PAGE_BUSY moves from bit 3 to bit 7
> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> > 	of the pte.
> > H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
> > 	second part of the pte.
> >
> > The second part of the PTE will hold
> >    a (H_PAGE_F_SECOND|H_PAGE_F_GIX)  for  64K page backed pte,
> >    and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX)  for 4k  backed
> > 	pte.
> >
> > the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> > is initialized to 0xF indicating a invalid slot. if a hashpage does
> > get  allocated  to  the  0xF  slot, it is released and not used. In
> > other words, even  though  0xF  is  a valid slot we discard it  and
> > consider it as invalid slot(HPTE_SOFT_INVALID). This  gives  us  an
> > opportunity to  not  depend on a bit in the primary PTE in order to
> > determine the validity of a slot.
> 
> Do we need to do this for 64K hptes ? H_PAGE_HASHPTE indicates whether a
> slot is valid or not. For 4K hptes, we do need this right ? ie, only
> when H_PAGE_COMBO is set we need to consider 0xf as an invalid slot

for 64k hptes; you are right, we do not use 0xF to
track the validity of a slot. We just depend on H_PAGE_HASHPTE flag.

for 4k hptes, we need to depend on both H_PAGE_HASHPTE as well as the
value of the slot.  H_PAGE_HASHPTE tells if there exists any valid
4k HPTEs, and the 4-bit values in the second-part-of-the-pte tells
us if they are valid values.

However in either case we do not need H_PAGE_COMBO. That flag is not
used for ptes.  But we continue to use that flag for pmd to track
hugepages, which is why I have not entirely divorced H_PAGE_COMBO from
the 64K pagesize case.

> 
> 
> >
> > When  we  release  a  0xF slot we also release a legitimate primary
> > slot  and  unmap  that  entry. This  is  to  ensure  that we do get
> > a legimate non-0xF slot the next time we retry for a slot.
> 
> Can you explain this more ? what is a primary slot here ?

I may not be using the right terminology here. But when I say slot, i
mean the four bits that tell the position of the hpte in the hash
buckets. Bit 0, indicates if it the primary or secondary hash bucket.
and bit 1,2,3 indicates the entry in the hash bucket.

So when i say primary slot, I mean a entry in the primary hash bucket.

The idea is, when hpte_insert returns a hpte which is cached in the 7slot
of the secondary bucket i.e 0xF, we discard it, and also release a
random entry from the primary bucket, so that on retry we can get that
entry.


> 
> >
> > Though treating 0xF slot as invalid reduces the number of available
> > slots and make have a effect on the performance, the probabilty
> > of hitting a 0xF is extermely low.
> >
> > Compared  to the current scheme, the above described scheme reduces
> > the number of false hash table updates  significantly  and  has the
> > added  advantage  of  releasing  four  valuable  PTE bits for other
> > purpose.
> >
> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> > Ellermen and myself.
> >
> > 4K PTE format remain unchanged currently.
> >
> > Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> > ---
> >  arch/powerpc/include/asm/book3s/64/hash-4k.h  | 12 +++++
> >  arch/powerpc/include/asm/book3s/64/hash-64k.h | 38 ++++++--------
> >  arch/powerpc/include/asm/book3s/64/hash.h     |  8 ++-
> >  arch/powerpc/include/asm/book3s/64/mmu-hash.h |  5 ++
> >  arch/powerpc/mm/dump_linuxpagetables.c        |  3 +-
> >  arch/powerpc/mm/hash64_4k.c                   | 12 ++---
> >  arch/powerpc/mm/hash64_64k.c                  | 73 +++++++++------------------
> >  arch/powerpc/mm/hash_utils_64.c               | 38 +++++++++++++-
> >  arch/powerpc/mm/hugetlbpage-hash64.c          | 16 +++---
> >  9 files changed, 107 insertions(+), 98 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > index b4b5e6b..223d318 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > @@ -16,6 +16,18 @@
> >  #define H_PUD_TABLE_SIZE	(sizeof(pud_t) << H_PUD_INDEX_SIZE)
> >  #define H_PGD_TABLE_SIZE	(sizeof(pgd_t) << H_PGD_INDEX_SIZE)
> >
> > +
> > +/*
> > + * Only supported by 4k linux page size
> 
> that comment is confusing, you can drop that, it is part of hash-4k.h
> which means these defines are local to 4k linux page size config.
> 
> > + */

right. ok.

> > +#define H_PAGE_F_SECOND        _RPAGE_RSV2     /* HPTE is in 2ndary HPTEG */
> > +#define H_PAGE_F_GIX           (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> > +#define H_PAGE_F_GIX_SHIFT     56
> > +
> > +#define H_PAGE_BUSY	_RPAGE_RSV1     /* software: PTE & hash are busy */
> > +#define H_PAGE_HASHPTE	_RPAGE_RPN43    /* PTE has associated HPTE */
> > +
> > +
> >  /* PTE flags to conserve for HPTE identification */
> >  #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
> >  			 H_PAGE_F_SECOND | H_PAGE_F_GIX)
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > index 9732837..87ee22d 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > @@ -10,23 +10,21 @@
> >   * 64k aligned address free up few of the lower bits of RPN for us
> >   * We steal that here. For more deatils look at pte_pfn/pfn_pte()
> >   */
> > -#define H_PAGE_COMBO	_RPAGE_RPN0 /* this is a combo 4k page */
> > -#define H_PAGE_4K_PFN	_RPAGE_RPN1 /* PFN is for a single 4k page */
> > +#define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
> > +#define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
> > +
> > +#define H_PAGE_BUSY	_RPAGE_RPN44     /* software: PTE & hash are busy */
> > +#define H_PAGE_HASHPTE	_RPAGE_RPN43    /* PTE has associated HPTE */
> > +
> >  /*
> >   * We need to differentiate between explicit huge page and THP huge
> >   * page, since THP huge page also need to track real subpage details
> >   */
> >  #define H_PAGE_THP_HUGE  H_PAGE_4K_PFN
> >
> > -/*
> > - * Used to track subpage group valid if H_PAGE_COMBO is set
> > - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
> > - */
> > -#define H_PAGE_COMBO_VALID	(H_PAGE_F_GIX | H_PAGE_F_SECOND)
> > -
> >  /* PTE flags to conserve for HPTE identification */
> > -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
> > -			 H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
> > +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE)
> > +
> 
> You drop H_PAGE_COMBO here. Is that intentional ?

Yes. that is intentional. See a downside?  We dont depend on
H_PAGE_COMBO for 64K pagesize PTE anymore.

> 
> We have code which does
> 
> 		if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
> 					       MMU_PAGE_64K, ssize,
> 					       flags) == -1)
> 			old_pte &= ~_PAGE_HPTEFLAGS;
> 	
> 
> I guess they expect slot details to be cleared. With the above
> _PAGE_HPTEFLAGS that is not true.

since we dont depend on COMBO flag, it should be fine. Let me know if
you see a downside.

> 
> 
> >  /*
> >   * we support 16 fragments per PTE page of 64K size.
> >   */
> > @@ -54,24 +52,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
> >  	unsigned long *hidxp;
> >
> >  	rpte.pte = pte;
> > -	rpte.hidx = 0;
> > -	if (pte_val(pte) & H_PAGE_COMBO) {
> > -		/*
> > -		 * Make sure we order the hidx load against the H_PAGE_COMBO
> > -		 * check. The store side ordering is done in __hash_page_4K
> > -		 */
> > -		smp_rmb();
> > -		hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> > -		rpte.hidx = *hidxp;
> > -	}
> > +	/*
> > +	 * The store side ordering is done in __hash_page_4K
> > +	 */
> > +	smp_rmb();
> > +	hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> > +	rpte.hidx = *hidxp;
> >  	return rpte;
> >  }
> >
> >  static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
> >  {
> > -	if ((pte_val(rpte.pte) & H_PAGE_COMBO))
> > -		return (rpte.hidx >> (index<<2)) & 0xf;
> > -	return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
> > +	return ((rpte.hidx >> (index<<2)) & 0xfUL);
> >  }
> >
> >  #define __rpte_to_pte(r)	((r).pte)
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> > index 4e957b0..7742782 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> > @@ -8,11 +8,9 @@
> >   *
> >   */
> >  #define H_PTE_NONE_MASK		_PAGE_HPTEFLAGS
> > -#define H_PAGE_F_GIX_SHIFT	56
> > -#define H_PAGE_BUSY		_RPAGE_RSV1 /* software: PTE & hash are busy */
> > -#define H_PAGE_F_SECOND		_RPAGE_RSV2	/* HPTE is in 2ndary HPTEG */
> > -#define H_PAGE_F_GIX		(_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> > -#define H_PAGE_HASHPTE		_RPAGE_RPN43	/* PTE has associated HPTE */
> > +
> > +#define INIT_HIDX (~0x0UL)
> 
> But then you use it like
> 
> 	rpte.hidx = INIT_HIDX;
> 
> A better would be INIT_HIDX = 0xF ?. Also may be INIT is the wrong name
> there ?

No. We are initializing all the entries in the second-part-of-the-pte to
0xF. which essentially boils down to ~0x0UL.

> 
> > +#define HPTE_SOFT_INVALID(slot) ((slot & 0xfUL) == 0xfUL)
> 
> Can you do that as a static inline ?

ok.

> 
> >
> >  #ifdef CONFIG_PPC_64K_PAGES
> >  #include <asm/book3s/64/hash-64k.h>
> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > index 6981a52..cfb8169 100644
> > --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
> >  extern int __hash_page_64K(unsigned long ea, unsigned long access,
> >  			   unsigned long vsid, pte_t *ptep, unsigned long trap,
> >  			   unsigned long flags, int ssize);
> > +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> > +			unsigned int subpg_index, unsigned long slot);
> > +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> > +			int ssize, real_pte_t rpte, unsigned int subpg_index);
> > +
> >  struct mm_struct;
> >  unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
> >  extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> > diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
> > index 44fe483..b832ed3 100644
> > --- a/arch/powerpc/mm/dump_linuxpagetables.c
> > +++ b/arch/powerpc/mm/dump_linuxpagetables.c
> > @@ -213,7 +213,7 @@ struct flag_info {
> >  		.val	= H_PAGE_4K_PFN,
> >  		.set	= "4K_pfn",
> >  	}, {
> > -#endif
> > +#else
> >  		.mask	= H_PAGE_F_GIX,
> >  		.val	= H_PAGE_F_GIX,
> >  		.set	= "f_gix",
> > @@ -224,6 +224,7 @@ struct flag_info {
> >  		.val	= H_PAGE_F_SECOND,
> >  		.set	= "f_second",
> >  	}, {
> > +#endif /* CONFIG_PPC_64K_PAGES */
> >  #endif
> >  		.mask	= _PAGE_SPECIAL,
> >  		.val	= _PAGE_SPECIAL,
> > diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
> > index 6fa450c..5b16416 100644
> > --- a/arch/powerpc/mm/hash64_4k.c
> > +++ b/arch/powerpc/mm/hash64_4k.c
> > @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >  		   pte_t *ptep, unsigned long trap, unsigned long flags,
> >  		   int ssize, int subpg_prot)
> >  {
> > +	real_pte_t rpte;
> >  	unsigned long hpte_group;
> >  	unsigned long rflags, pa;
> >  	unsigned long old_pte, new_pte;
> > @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >  	 * need to add in 0x1 if it's a read-only user page
> >  	 */
> >  	rflags = htab_convert_pte_flags(new_pte);
> > +	rpte = __real_pte(__pte(old_pte), ptep);
> >
> >  	if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
> >  	    !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> > @@ -64,12 +66,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >  		/*
> >  		 * There MIGHT be an HPTE for this pte
> >  		 */
> > -		hash = hpt_hash(vpn, shift, ssize);
> > -		if (old_pte & H_PAGE_F_SECOND)
> > -			hash = ~hash;
> > -		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > -		slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> > -
> > +		slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
> 
> 
> This confused me, the rest of the code assume slot as the 4 bit value.
> But get_hidx_slot is not exactly returning that.

get_hidx_slot() is returning a 4bit value. no?

> 
> 
> 
> >  		if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_4K,
> >  					       MMU_PAGE_4K, ssize, flags) == -1)
> >  			old_pte &= ~_PAGE_HPTEFLAGS;
> > @@ -118,8 +115,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >  			return -1;
> >  		}
> >  		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
> > -		new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> > -			(H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > +		new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
> >  	}
> >  	*ptep = __pte(new_pte & ~H_PAGE_BUSY);
> >  	return 0;
> > diff --git a/arch/powerpc/mm/hash64_64k.c b/arch/powerpc/mm/hash64_64k.c
> > index 1a68cb1..feb90f1 100644
> > --- a/arch/powerpc/mm/hash64_64k.c
> > +++ b/arch/powerpc/mm/hash64_64k.c
> > @@ -15,34 +15,13 @@
> >  #include <linux/mm.h>
> >  #include <asm/machdep.h>
> >  #include <asm/mmu.h>
> > -/*
> > - * index from 0 - 15
> > - */
> > -bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
> > -{
> > -	unsigned long g_idx;
> > -	unsigned long ptev = pte_val(rpte.pte);
> >
> > -	g_idx = (ptev & H_PAGE_COMBO_VALID) >> H_PAGE_F_GIX_SHIFT;
> > -	index = index >> 2;
> > -	if (g_idx & (0x1 << index))
> > -		return true;
> > -	else
> > -		return false;
> > -}
> >  /*
> >   * index from 0 - 15
> >   */
> > -static unsigned long mark_subptegroup_valid(unsigned long ptev, unsigned long index)
> > +bool __rpte_sub_valid(real_pte_t rpte, unsigned long index)
> >  {
> > -	unsigned long g_idx;
> > -
> > -	if (!(ptev & H_PAGE_COMBO))
> > -		return ptev;
> > -	index = index >> 2;
> > -	g_idx = 0x1 << index;
> > -
> > -	return ptev | (g_idx << H_PAGE_F_GIX_SHIFT);
> > +	return !(HPTE_SOFT_INVALID(rpte.hidx >> (index << 2)));
> >  }
> >
> >  int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> > @@ -50,10 +29,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >  		   int ssize, int subpg_prot)
> >  {
> >  	real_pte_t rpte;
> > -	unsigned long *hidxp;
> >  	unsigned long hpte_group;
> >  	unsigned int subpg_index;
> > -	unsigned long rflags, pa, hidx;
> > +	unsigned long rflags, pa;
> >  	unsigned long old_pte, new_pte, subpg_pte;
> >  	unsigned long vpn, hash, slot;
> >  	unsigned long shift = mmu_psize_defs[MMU_PAGE_4K].shift;
> > @@ -116,8 +94,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >  		 * On hash insert failure we use old pte value and we don't
> >  		 * want slot information there if we have a insert failure.
> >  		 */
> > -		old_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
> > -		new_pte &= ~(H_PAGE_HASHPTE | H_PAGE_F_GIX | H_PAGE_F_SECOND);
> > +		old_pte &= ~(H_PAGE_HASHPTE);
> > +		new_pte &= ~(H_PAGE_HASHPTE);
> > +		rpte.hidx = INIT_HIDX;
> >  		goto htab_insert_hpte;
> >  	}
> >  	/*
> > @@ -126,18 +105,12 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >  	if (__rpte_sub_valid(rpte, subpg_index)) {
> >  		int ret;
> >
> > -		hash = hpt_hash(vpn, shift, ssize);
> > -		hidx = __rpte_to_hidx(rpte, subpg_index);
> > -		if (hidx & _PTEIDX_SECONDARY)
> > -			hash = ~hash;
> > -		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > -		slot += hidx & _PTEIDX_GROUP_IX;
> > -
> > +		slot = get_hidx_slot(vpn, shift, ssize, rpte, subpg_index);
> >  		ret = mmu_hash_ops.hpte_updatepp(slot, rflags, vpn,
> >  						 MMU_PAGE_4K, MMU_PAGE_4K,
> >  						 ssize, flags);
> >  		/*
> > -		 *if we failed because typically the HPTE wasn't really here
> > +		 * if we failed because typically the HPTE wasn't really here
> >  		 * we try an insertion.
> >  		 */
> >  		if (ret == -1)
> > @@ -177,8 +150,14 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >  						rflags, HPTE_V_SECONDARY,
> >  						MMU_PAGE_4K, MMU_PAGE_4K,
> >  						ssize);
> > -		if (slot == -1) {
> > -			if (mftb() & 0x1)
> > +		if (unlikely(HPTE_SOFT_INVALID(slot)))
> > +			mmu_hash_ops.hpte_invalidate(slot, vpn,
> > +				MMU_PAGE_4K, MMU_PAGE_4K,
> > +				ssize, (flags & HPTE_LOCAL_UPDATE));
> 
> Did this work ?

Good catch. I have not been able to excercise this code. getting 0xf
value from mmu_hash_ops.hpte_insert() has been impossible. 

> invalidate first arg is the offset of hpte from the htab
> base. where as insert return 4 bit slot number within group. I guess we
> need to do a cleanup patch before to differentiate between slot number
> within a group and slot number in hash page table. May be name slot
> number within a group as gslot ?
> 
> 
> > +
> > +		if (unlikely(slot == -1 || HPTE_SOFT_INVALID(slot))) {
> > +			if (HPTE_SOFT_INVALID(slot) || (mftb() & 0x1))
> > +
> 
> So we always try to remove from primary if we got 0xf slot by insert ?


yes.

> 
> 
> >  				hpte_group = ((hash & htab_hash_mask) *
> >  					      HPTES_PER_GROUP) & ~0x7UL;
> >  			mmu_hash_ops.hpte_remove(hpte_group);
> > @@ -204,11 +183,9 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >  	 * Since we have H_PAGE_BUSY set on ptep, we can be sure
> >  	 * nobody is undating hidx.
> >  	 */
> > -	hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> > -	rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> > -	*hidxp = rpte.hidx  | (slot << (subpg_index << 2));
> > -	new_pte = mark_subptegroup_valid(new_pte, subpg_index);
> > -	new_pte |=  H_PAGE_HASHPTE;
> > +	new_pte |= set_hidx_slot(ptep, rpte, subpg_index, slot);
> > +	new_pte |= H_PAGE_HASHPTE;
> > +
> >  	/*
> >  	 * check __real_pte for details on matching smp_rmb()
> >  	 */
> > @@ -221,6 +198,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> >  		    unsigned long vsid, pte_t *ptep, unsigned long trap,
> >  		    unsigned long flags, int ssize)
> >  {
> > +	real_pte_t rpte;
> >  	unsigned long hpte_group;
> >  	unsigned long rflags, pa;
> >  	unsigned long old_pte, new_pte;
> > @@ -257,6 +235,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> >  	} while (!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
> >
> >  	rflags = htab_convert_pte_flags(new_pte);
> > +	rpte = __real_pte(__pte(old_pte), ptep);
> >
> >  	if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
> >  	    !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> > @@ -267,12 +246,7 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> >  		/*
> >  		 * There MIGHT be an HPTE for this pte
> >  		 */
> > -		hash = hpt_hash(vpn, shift, ssize);
> > -		if (old_pte & H_PAGE_F_SECOND)
> > -			hash = ~hash;
> > -		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > -		slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> > -
> > +		slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
> >  		if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
> >  					       MMU_PAGE_64K, ssize,
> >  					       flags) == -1)
> > @@ -322,9 +296,8 @@ int __hash_page_64K(unsigned long ea, unsigned long access,
> >  					   MMU_PAGE_64K, MMU_PAGE_64K, old_pte);
> >  			return -1;
> >  		}
> > +		set_hidx_slot(ptep, rpte, 0, slot);
> >  		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | H_PAGE_HASHPTE;
> > -		new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> > -			(H_PAGE_F_SECOND | H_PAGE_F_GIX);
> >  	}
> >  	*ptep = __pte(new_pte & ~H_PAGE_BUSY);
> >  	return 0;
> > diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> > index f2095ce..b405657 100644
> > --- a/arch/powerpc/mm/hash_utils_64.c
> > +++ b/arch/powerpc/mm/hash_utils_64.c
> > @@ -975,8 +975,9 @@ void __init hash__early_init_devtree(void)
> >
> >  void __init hash__early_init_mmu(void)
> >  {
> > +#ifndef CONFIG_PPC_64K_PAGES
> >  	/*
> > -	 * We have code in __hash_page_64K() and elsewhere, which assumes it can
> > +	 * We have code in __hash_page_4K() and elsewhere, which assumes it can
> >  	 * do the following:
> >  	 *   new_pte |= (slot << H_PAGE_F_GIX_SHIFT) & (H_PAGE_F_SECOND | H_PAGE_F_GIX);
> >  	 *
> > @@ -987,6 +988,7 @@ void __init hash__early_init_mmu(void)
> >  	 * with a BUILD_BUG_ON().
> >  	 */
> >  	BUILD_BUG_ON(H_PAGE_F_SECOND != (1ul  << (H_PAGE_F_GIX_SHIFT + 3)));
> > +#endif /* CONFIG_PPC_64K_PAGES */
> >
> >  	htab_init_page_sizes();
> >
> > @@ -1589,6 +1591,40 @@ static inline void tm_flush_hash_page(int local)
> >  }
> >  #endif
> >
> > +#ifdef CONFIG_PPC_64K_PAGES
> > +unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> > +		unsigned int subpg_index, unsigned long slot)
> > +{
> > +	unsigned long *hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> > +
> > +	rpte.hidx &= ~(0xfUL << (subpg_index << 2));
> > +	*hidxp = rpte.hidx  | (slot << (subpg_index << 2));
> > +	return 0x0UL;
> > +}
> > +#else /* CONFIG_PPC_64K_PAGES */
> > +unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> > +		unsigned int subpg_index, unsigned long slot)
> > +{
> > +	return (slot << H_PAGE_F_GIX_SHIFT) &
> > +			(H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > +}
> 
> Move them to header files so that we can avoid #ifdef here. Here slot is
> 4 bit value

ok.

> 
> 
> > +#endif /* CONFIG_PPC_64K_PAGES */
> > +
> > +unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> > +			int ssize, real_pte_t rpte, unsigned int subpg_index)
> > +{
> > +	unsigned long hash, slot, hidx;
> > +
> > +	hash = hpt_hash(vpn, shift, ssize);
> > +	hidx = __rpte_to_hidx(rpte, subpg_index);
> > +	if (hidx & _PTEIDX_SECONDARY)
> > +		hash = ~hash;
> > +	slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > +	slot += hidx & _PTEIDX_GROUP_IX;
> > +	return slot;
> 
> Here slot is not. Can you rename this function ?

rename to? I possibly am using the wrong terminology which is getting
reflected in the names of the function. Will have to get that right
to avoid confusion.


Thanks for your comments, 

RP

> 
> > +}
> > +
> > +
> >  /* WARNING: This is called from hash_low_64.S, if you change this prototype,
> >   *          do not forget to update the assembly call site !
> >   */
> > diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c b/arch/powerpc/mm/hugetlbpage-hash64.c
> > index a84bb44..25a50eb 100644
> > --- a/arch/powerpc/mm/hugetlbpage-hash64.c
> > +++ b/arch/powerpc/mm/hugetlbpage-hash64.c
> > @@ -14,7 +14,7 @@
> >  #include <asm/cacheflush.h>
> >  #include <asm/machdep.h>
> >
> > -extern long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
> > +long hpte_insert_repeating(unsigned long hash, unsigned long vpn,
> >  				  unsigned long pa, unsigned long rlags,
> >  				  unsigned long vflags, int psize, int ssize);
> >
> > @@ -22,6 +22,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> >  		     pte_t *ptep, unsigned long trap, unsigned long flags,
> >  		     int ssize, unsigned int shift, unsigned int mmu_psize)
> >  {
> > +	real_pte_t rpte;
> >  	unsigned long vpn;
> >  	unsigned long old_pte, new_pte;
> >  	unsigned long rflags, pa, sz;
> > @@ -61,6 +62,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> >  	} while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
> >
> >  	rflags = htab_convert_pte_flags(new_pte);
> > +	rpte = __real_pte(__pte(old_pte), ptep);
> >
> >  	sz = ((1UL) << shift);
> >  	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> > @@ -71,14 +73,9 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> >  	/* Check if pte already has an hpte (case 2) */
> >  	if (unlikely(old_pte & H_PAGE_HASHPTE)) {
> >  		/* There MIGHT be an HPTE for this pte */
> > -		unsigned long hash, slot;
> > -
> > -		hash = hpt_hash(vpn, shift, ssize);
> > -		if (old_pte & H_PAGE_F_SECOND)
> > -			hash = ~hash;
> > -		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> > -		slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> > +		unsigned long slot;
> >
> > +		slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
> >  		if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, mmu_psize,
> >  					       mmu_psize, ssize, flags) == -1)
> >  			old_pte &= ~_PAGE_HPTEFLAGS;
> > @@ -106,8 +103,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
> >  			return -1;
> >  		}
> >
> > -		new_pte |= (slot << H_PAGE_F_GIX_SHIFT) &
> > -			(H_PAGE_F_SECOND | H_PAGE_F_GIX);
> > +		new_pte |= set_hidx_slot(ptep, rpte, 0, slot);
> >  	}
> >
> >  	/*
> > -- 
> > 1.8.3.1

-- 
Ram Pai

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys
  2017-06-12 22:20     ` Ram Pai
@ 2017-06-13  2:02       ` Aneesh Kumar K.V
  2017-06-13 21:51         ` Ram Pai
  0 siblings, 1 reply; 23+ messages in thread
From: Aneesh Kumar K.V @ 2017-06-13  2:02 UTC (permalink / raw)
  To: Ram Pai
  Cc: linuxppc-dev, linux-kernel, benh, paulus, mpe, khandual,
	bsingharora, dave.hansen, hbabu

Ram Pai <linuxram@us.ibm.com> writes:

> On Mon, Jun 12, 2017 at 12:27:44PM +0530, Aneesh Kumar K.V wrote:
>> Ram Pai <linuxram@us.ibm.com> writes:
>> 
>> > Rearrange  PTE   bits to  free  up  bits 3, 4, 5  and  6  for
>> > memory keys. Bit 3, 4, 5, 6 and 57  shall  be used for memory
>> > keys.
>> >
>> > The patch does the following change to the 64K PTE format
>> >
>> > H_PAGE_BUSY moves from bit 3 to bit 7
>> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
>> > 	of the pte.
>> > H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
>> > 	second part of the pte.
>> >
>> > The second part of the PTE will hold
>> >    a (H_PAGE_F_SECOND|H_PAGE_F_GIX)  for  64K page backed pte,
>> >    and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX)  for 4k  backed
>> > 	pte.
>> >
>> > the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
>> > is initialized to 0xF indicating a invalid slot. if a hashpage does
>> > get  allocated  to  the  0xF  slot, it is released and not used. In
>> > other words, even  though  0xF  is  a valid slot we discard it  and
>> > consider it as invalid slot(HPTE_SOFT_INVALID). This  gives  us  an
>> > opportunity to  not  depend on a bit in the primary PTE in order to
>> > determine the validity of a slot.
>> 
>> Do we need to do this for 64K hptes ? H_PAGE_HASHPTE indicates whether a
>> slot is valid or not. For 4K hptes, we do need this right ? ie, only
>> when H_PAGE_COMBO is set we need to consider 0xf as an invalid slot
>
> for 64k hptes; you are right, we do not use 0xF to
> track the validity of a slot. We just depend on H_PAGE_HASHPTE flag.
>
> for 4k hptes, we need to depend on both H_PAGE_HASHPTE as well as the
> value of the slot.  H_PAGE_HASHPTE tells if there exists any valid
> 4k HPTEs, and the 4-bit values in the second-part-of-the-pte tells
> us if they are valid values.
>
> However in either case we do not need H_PAGE_COMBO. That flag is not
> used for ptes.  But we continue to use that flag for pmd to track
> hugepages, which is why I have not entirely divorced H_PAGE_COMBO from
> the 64K pagesize case.


Really ? May be i am missing that in the patch. H_PAGE_COMBO indicate
whether a 64K linux page is mapped via 4k hash page table enries. I
don't see you changing that in __hash_page_4k()

we still continue to do.

		new_pte = old_pte | H_PAGE_BUSY | _PAGE_ACCESSED | H_PAGE_COMBO;

	/*
	 * Check if the pte was already inserted into the hash table
	 * as a 64k HW page, and invalidate the 64k HPTE if so.
	 */
	if (!(old_pte & H_PAGE_COMBO)) {
		flush_hash_page(vpn, rpte, MMU_PAGE_64K, ssize, flags);



>
>> 
>> 
>> >
>> > When  we  release  a  0xF slot we also release a legitimate primary
>> > slot  and  unmap  that  entry. This  is  to  ensure  that we do get
>> > a legimate non-0xF slot the next time we retry for a slot.
>> 
>> Can you explain this more ? what is a primary slot here ?
>
> I may not be using the right terminology here. But when I say slot, i
> mean the four bits that tell the position of the hpte in the hash
> buckets. Bit 0, indicates if it the primary or secondary hash bucket.
> and bit 1,2,3 indicates the entry in the hash bucket.
>
> So when i say primary slot, I mean a entry in the primary hash bucket.
>
> The idea is, when hpte_insert returns a hpte which is cached in the 7slot
> of the secondary bucket i.e 0xF, we discard it, and also release a
> random entry from the primary bucket, so that on retry we can get that
> entry.
>
>
>> 
>> >
>> > Though treating 0xF slot as invalid reduces the number of available
>> > slots and make have a effect on the performance, the probabilty
>> > of hitting a 0xF is extermely low.
>> >
>> > Compared  to the current scheme, the above described scheme reduces
>> > the number of false hash table updates  significantly  and  has the
>> > added  advantage  of  releasing  four  valuable  PTE bits for other
>> > purpose.
>> >
>> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
>> > Ellermen and myself.
>> >
>> > 4K PTE format remain unchanged currently.
>> >
>> > Signed-off-by: Ram Pai <linuxram@us.ibm.com>
>> > ---
>> >  arch/powerpc/include/asm/book3s/64/hash-4k.h  | 12 +++++
>> >  arch/powerpc/include/asm/book3s/64/hash-64k.h | 38 ++++++--------
>> >  arch/powerpc/include/asm/book3s/64/hash.h     |  8 ++-
>> >  arch/powerpc/include/asm/book3s/64/mmu-hash.h |  5 ++
>> >  arch/powerpc/mm/dump_linuxpagetables.c        |  3 +-
>> >  arch/powerpc/mm/hash64_4k.c                   | 12 ++---
>> >  arch/powerpc/mm/hash64_64k.c                  | 73 +++++++++------------------
>> >  arch/powerpc/mm/hash_utils_64.c               | 38 +++++++++++++-
>> >  arch/powerpc/mm/hugetlbpage-hash64.c          | 16 +++---
>> >  9 files changed, 107 insertions(+), 98 deletions(-)
>> >
>> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> > index b4b5e6b..223d318 100644
>> > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
>> > @@ -16,6 +16,18 @@
>> >  #define H_PUD_TABLE_SIZE	(sizeof(pud_t) << H_PUD_INDEX_SIZE)
>> >  #define H_PGD_TABLE_SIZE	(sizeof(pgd_t) << H_PGD_INDEX_SIZE)
>> >
>> > +
>> > +/*
>> > + * Only supported by 4k linux page size
>> 
>> that comment is confusing, you can drop that, it is part of hash-4k.h
>> which means these defines are local to 4k linux page size config.
>> 
>> > + */
>
> right. ok.
>
>> > +#define H_PAGE_F_SECOND        _RPAGE_RSV2     /* HPTE is in 2ndary HPTEG */
>> > +#define H_PAGE_F_GIX           (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
>> > +#define H_PAGE_F_GIX_SHIFT     56
>> > +
>> > +#define H_PAGE_BUSY	_RPAGE_RSV1     /* software: PTE & hash are busy */
>> > +#define H_PAGE_HASHPTE	_RPAGE_RPN43    /* PTE has associated HPTE */
>> > +
>> > +
>> >  /* PTE flags to conserve for HPTE identification */
>> >  #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
>> >  			 H_PAGE_F_SECOND | H_PAGE_F_GIX)
>> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
>> > index 9732837..87ee22d 100644
>> > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
>> > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
>> > @@ -10,23 +10,21 @@
>> >   * 64k aligned address free up few of the lower bits of RPN for us
>> >   * We steal that here. For more deatils look at pte_pfn/pfn_pte()
>> >   */
>> > -#define H_PAGE_COMBO	_RPAGE_RPN0 /* this is a combo 4k page */
>> > -#define H_PAGE_4K_PFN	_RPAGE_RPN1 /* PFN is for a single 4k page */
>> > +#define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
>> > +#define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
>> > +
>> > +#define H_PAGE_BUSY	_RPAGE_RPN44     /* software: PTE & hash are busy */
>> > +#define H_PAGE_HASHPTE	_RPAGE_RPN43    /* PTE has associated HPTE */
>> > +
>> >  /*
>> >   * We need to differentiate between explicit huge page and THP huge
>> >   * page, since THP huge page also need to track real subpage details
>> >   */
>> >  #define H_PAGE_THP_HUGE  H_PAGE_4K_PFN
>> >
>> > -/*
>> > - * Used to track subpage group valid if H_PAGE_COMBO is set
>> > - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
>> > - */
>> > -#define H_PAGE_COMBO_VALID	(H_PAGE_F_GIX | H_PAGE_F_SECOND)
>> > -
>> >  /* PTE flags to conserve for HPTE identification */
>> > -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
>> > -			 H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
>> > +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE)
>> > +
>> 
>> You drop H_PAGE_COMBO here. Is that intentional ?
>
> Yes. that is intentional. See a downside?  We dont depend on
> H_PAGE_COMBO for 64K pagesize PTE anymore.
>
>> 
>> We have code which does
>> 
>> 		if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
>> 					       MMU_PAGE_64K, ssize,
>> 					       flags) == -1)
>> 			old_pte &= ~_PAGE_HPTEFLAGS;
>> 	
>> 
>> I guess they expect slot details to be cleared. With the above
>> _PAGE_HPTEFLAGS that is not true.
>
> since we dont depend on COMBO flag, it should be fine. Let me know if
> you see a downside.
>
>> 
>> 
>> >  /*
>> >   * we support 16 fragments per PTE page of 64K size.
>> >   */
>> > @@ -54,24 +52,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
>> >  	unsigned long *hidxp;
>> >
>> >  	rpte.pte = pte;
>> > -	rpte.hidx = 0;
>> > -	if (pte_val(pte) & H_PAGE_COMBO) {
>> > -		/*
>> > -		 * Make sure we order the hidx load against the H_PAGE_COMBO
>> > -		 * check. The store side ordering is done in __hash_page_4K
>> > -		 */
>> > -		smp_rmb();
>> > -		hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
>> > -		rpte.hidx = *hidxp;
>> > -	}
>> > +	/*
>> > +	 * The store side ordering is done in __hash_page_4K
>> > +	 */
>> > +	smp_rmb();
>> > +	hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
>> > +	rpte.hidx = *hidxp;
>> >  	return rpte;
>> >  }
>> >
>> >  static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
>> >  {
>> > -	if ((pte_val(rpte.pte) & H_PAGE_COMBO))
>> > -		return (rpte.hidx >> (index<<2)) & 0xf;
>> > -	return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
>> > +	return ((rpte.hidx >> (index<<2)) & 0xfUL);
>> >  }
>> >
>> >  #define __rpte_to_pte(r)	((r).pte)
>> > diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
>> > index 4e957b0..7742782 100644
>> > --- a/arch/powerpc/include/asm/book3s/64/hash.h
>> > +++ b/arch/powerpc/include/asm/book3s/64/hash.h
>> > @@ -8,11 +8,9 @@
>> >   *
>> >   */
>> >  #define H_PTE_NONE_MASK		_PAGE_HPTEFLAGS
>> > -#define H_PAGE_F_GIX_SHIFT	56
>> > -#define H_PAGE_BUSY		_RPAGE_RSV1 /* software: PTE & hash are busy */
>> > -#define H_PAGE_F_SECOND		_RPAGE_RSV2	/* HPTE is in 2ndary HPTEG */
>> > -#define H_PAGE_F_GIX		(_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
>> > -#define H_PAGE_HASHPTE		_RPAGE_RPN43	/* PTE has associated HPTE */
>> > +
>> > +#define INIT_HIDX (~0x0UL)
>> 
>> But then you use it like
>> 
>> 	rpte.hidx = INIT_HIDX;
>> 
>> A better would be INIT_HIDX = 0xF ?. Also may be INIT is the wrong name
>> there ?
>
> No. We are initializing all the entries in the second-part-of-the-pte to
> 0xF. which essentially boils down to ~0x0UL.

that you unsigned long casting is what i was suggesting to remove. We
may want to treat it as a 4 bits every where  ?


>
>> 
>> > +#define HPTE_SOFT_INVALID(slot) ((slot & 0xfUL) == 0xfUL)
>> 
>> Can you do that as a static inline ?
>
> ok.
>
>> 
>> >
>> >  #ifdef CONFIG_PPC_64K_PAGES
>> >  #include <asm/book3s/64/hash-64k.h>
>> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> > index 6981a52..cfb8169 100644
>> > --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> > +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
>> > @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
>> >  extern int __hash_page_64K(unsigned long ea, unsigned long access,
>> >  			   unsigned long vsid, pte_t *ptep, unsigned long trap,
>> >  			   unsigned long flags, int ssize);
>> > +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
>> > +			unsigned int subpg_index, unsigned long slot);
>> > +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
>> > +			int ssize, real_pte_t rpte, unsigned int subpg_index);
>> > +
>> >  struct mm_struct;
>> >  unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
>> >  extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
>> > diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
>> > index 44fe483..b832ed3 100644
>> > --- a/arch/powerpc/mm/dump_linuxpagetables.c
>> > +++ b/arch/powerpc/mm/dump_linuxpagetables.c
>> > @@ -213,7 +213,7 @@ struct flag_info {
>> >  		.val	= H_PAGE_4K_PFN,
>> >  		.set	= "4K_pfn",
>> >  	}, {
>> > -#endif
>> > +#else
>> >  		.mask	= H_PAGE_F_GIX,
>> >  		.val	= H_PAGE_F_GIX,
>> >  		.set	= "f_gix",
>> > @@ -224,6 +224,7 @@ struct flag_info {
>> >  		.val	= H_PAGE_F_SECOND,
>> >  		.set	= "f_second",
>> >  	}, {
>> > +#endif /* CONFIG_PPC_64K_PAGES */
>> >  #endif
>> >  		.mask	= _PAGE_SPECIAL,
>> >  		.val	= _PAGE_SPECIAL,
>> > diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
>> > index 6fa450c..5b16416 100644
>> > --- a/arch/powerpc/mm/hash64_4k.c
>> > +++ b/arch/powerpc/mm/hash64_4k.c
>> > @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>> >  		   pte_t *ptep, unsigned long trap, unsigned long flags,
>> >  		   int ssize, int subpg_prot)
>> >  {
>> > +	real_pte_t rpte;
>> >  	unsigned long hpte_group;
>> >  	unsigned long rflags, pa;
>> >  	unsigned long old_pte, new_pte;
>> > @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>> >  	 * need to add in 0x1 if it's a read-only user page
>> >  	 */
>> >  	rflags = htab_convert_pte_flags(new_pte);
>> > +	rpte = __real_pte(__pte(old_pte), ptep);
>> >
>> >  	if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
>> >  	    !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
>> > @@ -64,12 +66,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
>> >  		/*
>> >  		 * There MIGHT be an HPTE for this pte
>> >  		 */
>> > -		hash = hpt_hash(vpn, shift, ssize);
>> > -		if (old_pte & H_PAGE_F_SECOND)
>> > -			hash = ~hash;
>> > -		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
>> > -		slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
>> > -
>> > +		slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
>> 
>> 
>> This confused me, the rest of the code assume slot as the 4 bit value.
>> But get_hidx_slot is not exactly returning that.
>
> get_hidx_slot() is returning a 4bit value. no?

It is not
unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
			int ssize, real_pte_t rpte, unsigned int subpg_index)
{
	unsigned long hash, slot, hidx;

	hash = hpt_hash(vpn, shift, ssize);
	hidx = __rpte_to_hidx(rpte, subpg_index);
	if (hidx & _PTEIDX_SECONDARY)
		hash = ~hash;
	slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
	slot += hidx & _PTEIDX_GROUP_IX;
	return slot;
}

-aneesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys
  2017-06-06  1:05 ` [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys Ram Pai
  2017-06-12  6:57   ` Aneesh Kumar K.V
@ 2017-06-13  4:52   ` Aneesh Kumar K.V
  2017-06-13 21:52     ` Ram Pai
  1 sibling, 1 reply; 23+ messages in thread
From: Aneesh Kumar K.V @ 2017-06-13  4:52 UTC (permalink / raw)
  To: Ram Pai, linuxppc-dev, linux-kernel
  Cc: benh, paulus, mpe, khandual, bsingharora, dave.hansen, hbabu, linuxram

Ram Pai <linuxram@us.ibm.com> writes:

> Rearrange  PTE   bits to  free  up  bits 3, 4, 5  and  6  for
> memory keys. Bit 3, 4, 5, 6 and 57  shall  be used for memory
> keys.
>
> The patch does the following change to the 64K PTE format
>
> H_PAGE_BUSY moves from bit 3 to bit 7
> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> 	of the pte.
> H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
> 	second part of the pte.
>
> The second part of the PTE will hold
>    a (H_PAGE_F_SECOND|H_PAGE_F_GIX)  for  64K page backed pte,
>    and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX)  for 4k  backed
> 	pte.
>
> the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> is initialized to 0xF indicating a invalid slot. if a hashpage does
> get  allocated  to  the  0xF  slot, it is released and not used. In
> other words, even  though  0xF  is  a valid slot we discard it  and
> consider it as invalid slot(HPTE_SOFT_INVALID). This  gives  us  an
> opportunity to  not  depend on a bit in the primary PTE in order to
> determine the validity of a slot.
>
> When  we  release  a  0xF slot we also release a legitimate primary
> slot  and  unmap  that  entry. This  is  to  ensure  that we do get
> a legimate non-0xF slot the next time we retry for a slot.
>
> Though treating 0xF slot as invalid reduces the number of available
> slots and make have a effect on the performance, the probabilty
> of hitting a 0xF is extermely low.
>
> Compared  to the current scheme, the above described scheme reduces
> the number of false hash table updates  significantly  and  has the
> added  advantage  of  releasing  four  valuable  PTE bits for other
> purpose.
>
> This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> Ellermen and myself.
>
> 4K PTE format remain unchanged currently.
>

Can you also split this patch into two. One which changes
__hash_page_4k() ie, linux pte format w.r.t 4k hash pte. Second patch
with changes w.r.t __hash_page_64k() ie, pte format w.r.t 64k hash pte.

-aneesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys
  2017-06-13  2:02       ` Aneesh Kumar K.V
@ 2017-06-13 21:51         ` Ram Pai
  0 siblings, 0 replies; 23+ messages in thread
From: Ram Pai @ 2017-06-13 21:51 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linuxppc-dev, linux-kernel, benh, paulus, mpe, khandual,
	bsingharora, dave.hansen, hbabu

On Tue, Jun 13, 2017 at 07:32:24AM +0530, Aneesh Kumar K.V wrote:
> Ram Pai <linuxram@us.ibm.com> writes:
> 
> > On Mon, Jun 12, 2017 at 12:27:44PM +0530, Aneesh Kumar K.V wrote:
> >> Ram Pai <linuxram@us.ibm.com> writes:
> >> 
> >> > Rearrange  PTE   bits to  free  up  bits 3, 4, 5  and  6  for
> >> > memory keys. Bit 3, 4, 5, 6 and 57  shall  be used for memory
> >> > keys.
> >> >
> >> > The patch does the following change to the 64K PTE format
> >> >
> >> > H_PAGE_BUSY moves from bit 3 to bit 7
> >> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> >> > 	of the pte.
> >> > H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
> >> > 	second part of the pte.
> >> >
> >> > The second part of the PTE will hold
> >> >    a (H_PAGE_F_SECOND|H_PAGE_F_GIX)  for  64K page backed pte,
> >> >    and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX)  for 4k  backed
> >> > 	pte.
> >> >
> >> > the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> >> > is initialized to 0xF indicating a invalid slot. if a hashpage does
> >> > get  allocated  to  the  0xF  slot, it is released and not used. In
> >> > other words, even  though  0xF  is  a valid slot we discard it  and
> >> > consider it as invalid slot(HPTE_SOFT_INVALID). This  gives  us  an
> >> > opportunity to  not  depend on a bit in the primary PTE in order to
> >> > determine the validity of a slot.
> >> 
> >> Do we need to do this for 64K hptes ? H_PAGE_HASHPTE indicates whether a
> >> slot is valid or not. For 4K hptes, we do need this right ? ie, only
> >> when H_PAGE_COMBO is set we need to consider 0xf as an invalid slot
> >
> > for 64k hptes; you are right, we do not use 0xF to
> > track the validity of a slot. We just depend on H_PAGE_HASHPTE flag.
> >
> > for 4k hptes, we need to depend on both H_PAGE_HASHPTE as well as the
> > value of the slot.  H_PAGE_HASHPTE tells if there exists any valid
> > 4k HPTEs, and the 4-bit values in the second-part-of-the-pte tells
> > us if they are valid values.
> >
> > However in either case we do not need H_PAGE_COMBO. That flag is not
> > used for ptes.  But we continue to use that flag for pmd to track
> > hugepages, which is why I have not entirely divorced H_PAGE_COMBO from
> > the 64K pagesize case.
> 
> 
> Really ? May be i am missing that in the patch. H_PAGE_COMBO indicate
> whether a 64K linux page is mapped via 4k hash page table enries. I
> don't see you changing that in __hash_page_4k()
> 
> we still continue to do.
> 
> 		new_pte = old_pte | H_PAGE_BUSY | _PAGE_ACCESSED | H_PAGE_COMBO;
> 
> 	/*
> 	 * Check if the pte was already inserted into the hash table
> 	 * as a 64k HW page, and invalidate the 64k HPTE if so.
> 	 */
> 	if (!(old_pte & H_PAGE_COMBO)) {
> 		flush_hash_page(vpn, rpte, MMU_PAGE_64K, ssize, flags);
> 

ok. my memory blanked.  In this patch We continue to depend on COMBO
flag to distinguish between pte's backed by 4k hpte and 64k hpte. So
H_PAGE_COMBO flag cannot go for now in this patch.  Which means
_PAGE_HPTEFLAGS must continue to have the H_PAGE_COMBO flag too.

I had a patch to get rid of the COMBO bit too, which was dropped.  Will
regenerate a separate patch to rid the COMBO bit in __hash_page_4k().
That will divorse the COMBO bit entirely from 64K pte.

> 
> 
> >
> >> 
> >> 
> >> >
> >> > When  we  release  a  0xF slot we also release a legitimate primary
> >> > slot  and  unmap  that  entry. This  is  to  ensure  that we do get
> >> > a legimate non-0xF slot the next time we retry for a slot.
> >> 
> >> Can you explain this more ? what is a primary slot here ?
> >
> > I may not be using the right terminology here. But when I say slot, i
> > mean the four bits that tell the position of the hpte in the hash
> > buckets. Bit 0, indicates if it the primary or secondary hash bucket.
> > and bit 1,2,3 indicates the entry in the hash bucket.
> >
> > So when i say primary slot, I mean a entry in the primary hash bucket.
> >
> > The idea is, when hpte_insert returns a hpte which is cached in the 7slot
> > of the secondary bucket i.e 0xF, we discard it, and also release a
> > random entry from the primary bucket, so that on retry we can get that
> > entry.
> >
> >
> >> 
> >> >
> >> > Though treating 0xF slot as invalid reduces the number of available
> >> > slots and make have a effect on the performance, the probabilty
> >> > of hitting a 0xF is extermely low.
> >> >
> >> > Compared  to the current scheme, the above described scheme reduces
> >> > the number of false hash table updates  significantly  and  has the
> >> > added  advantage  of  releasing  four  valuable  PTE bits for other
> >> > purpose.
> >> >
> >> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> >> > Ellermen and myself.
> >> >
> >> > 4K PTE format remain unchanged currently.
> >> >
> >> > Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> >> > ---
> >> >  arch/powerpc/include/asm/book3s/64/hash-4k.h  | 12 +++++
> >> >  arch/powerpc/include/asm/book3s/64/hash-64k.h | 38 ++++++--------
> >> >  arch/powerpc/include/asm/book3s/64/hash.h     |  8 ++-
> >> >  arch/powerpc/include/asm/book3s/64/mmu-hash.h |  5 ++
> >> >  arch/powerpc/mm/dump_linuxpagetables.c        |  3 +-
> >> >  arch/powerpc/mm/hash64_4k.c                   | 12 ++---
> >> >  arch/powerpc/mm/hash64_64k.c                  | 73 +++++++++------------------
> >> >  arch/powerpc/mm/hash_utils_64.c               | 38 +++++++++++++-
> >> >  arch/powerpc/mm/hugetlbpage-hash64.c          | 16 +++---
> >> >  9 files changed, 107 insertions(+), 98 deletions(-)
> >> >
> >> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >> > index b4b5e6b..223d318 100644
> >> > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >> > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> >> > @@ -16,6 +16,18 @@
> >> >  #define H_PUD_TABLE_SIZE	(sizeof(pud_t) << H_PUD_INDEX_SIZE)
> >> >  #define H_PGD_TABLE_SIZE	(sizeof(pgd_t) << H_PGD_INDEX_SIZE)
> >> >
> >> > +
> >> > +/*
> >> > + * Only supported by 4k linux page size
> >> 
> >> that comment is confusing, you can drop that, it is part of hash-4k.h
> >> which means these defines are local to 4k linux page size config.
> >> 
> >> > + */
> >
> > right. ok.
> >
> >> > +#define H_PAGE_F_SECOND        _RPAGE_RSV2     /* HPTE is in 2ndary HPTEG */
> >> > +#define H_PAGE_F_GIX           (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> >> > +#define H_PAGE_F_GIX_SHIFT     56
> >> > +
> >> > +#define H_PAGE_BUSY	_RPAGE_RSV1     /* software: PTE & hash are busy */
> >> > +#define H_PAGE_HASHPTE	_RPAGE_RPN43    /* PTE has associated HPTE */
> >> > +
> >> > +
> >> >  /* PTE flags to conserve for HPTE identification */
> >> >  #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
> >> >  			 H_PAGE_F_SECOND | H_PAGE_F_GIX)
> >> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> >> > index 9732837..87ee22d 100644
> >> > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> >> > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> >> > @@ -10,23 +10,21 @@
> >> >   * 64k aligned address free up few of the lower bits of RPN for us
> >> >   * We steal that here. For more deatils look at pte_pfn/pfn_pte()
> >> >   */
> >> > -#define H_PAGE_COMBO	_RPAGE_RPN0 /* this is a combo 4k page */
> >> > -#define H_PAGE_4K_PFN	_RPAGE_RPN1 /* PFN is for a single 4k page */
> >> > +#define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
> >> > +#define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
> >> > +
> >> > +#define H_PAGE_BUSY	_RPAGE_RPN44     /* software: PTE & hash are busy */
> >> > +#define H_PAGE_HASHPTE	_RPAGE_RPN43    /* PTE has associated HPTE */
> >> > +
> >> >  /*
> >> >   * We need to differentiate between explicit huge page and THP huge
> >> >   * page, since THP huge page also need to track real subpage details
> >> >   */
> >> >  #define H_PAGE_THP_HUGE  H_PAGE_4K_PFN
> >> >
> >> > -/*
> >> > - * Used to track subpage group valid if H_PAGE_COMBO is set
> >> > - * This overloads H_PAGE_F_GIX and H_PAGE_F_SECOND
> >> > - */
> >> > -#define H_PAGE_COMBO_VALID	(H_PAGE_F_GIX | H_PAGE_F_SECOND)
> >> > -
> >> >  /* PTE flags to conserve for HPTE identification */
> >> > -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
> >> > -			 H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
> >> > +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE)
> >> > +
> >> 
> >> You drop H_PAGE_COMBO here. Is that intentional ?
> >
> > Yes. that is intentional. See a downside?  We dont depend on
> > H_PAGE_COMBO for 64K pagesize PTE anymore.
> >
> >> 
> >> We have code which does
> >> 
> >> 		if (mmu_hash_ops.hpte_updatepp(slot, rflags, vpn, MMU_PAGE_64K,
> >> 					       MMU_PAGE_64K, ssize,
> >> 					       flags) == -1)
> >> 			old_pte &= ~_PAGE_HPTEFLAGS;
> >> 	
> >> 
> >> I guess they expect slot details to be cleared. With the above
> >> _PAGE_HPTEFLAGS that is not true.
> >
> > since we dont depend on COMBO flag, it should be fine. Let me know if
> > you see a downside.
> >
> >> 
> >> 
> >> >  /*
> >> >   * we support 16 fragments per PTE page of 64K size.
> >> >   */
> >> > @@ -54,24 +52,18 @@ static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep)
> >> >  	unsigned long *hidxp;
> >> >
> >> >  	rpte.pte = pte;
> >> > -	rpte.hidx = 0;
> >> > -	if (pte_val(pte) & H_PAGE_COMBO) {
> >> > -		/*
> >> > -		 * Make sure we order the hidx load against the H_PAGE_COMBO
> >> > -		 * check. The store side ordering is done in __hash_page_4K
> >> > -		 */
> >> > -		smp_rmb();
> >> > -		hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> >> > -		rpte.hidx = *hidxp;
> >> > -	}
> >> > +	/*
> >> > +	 * The store side ordering is done in __hash_page_4K
> >> > +	 */
> >> > +	smp_rmb();
> >> > +	hidxp = (unsigned long *)(ptep + PTRS_PER_PTE);
> >> > +	rpte.hidx = *hidxp;
> >> >  	return rpte;
> >> >  }
> >> >
> >> >  static inline unsigned long __rpte_to_hidx(real_pte_t rpte, unsigned long index)
> >> >  {
> >> > -	if ((pte_val(rpte.pte) & H_PAGE_COMBO))
> >> > -		return (rpte.hidx >> (index<<2)) & 0xf;
> >> > -	return (pte_val(rpte.pte) >> H_PAGE_F_GIX_SHIFT) & 0xf;
> >> > +	return ((rpte.hidx >> (index<<2)) & 0xfUL);
> >> >  }
> >> >
> >> >  #define __rpte_to_pte(r)	((r).pte)
> >> > diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
> >> > index 4e957b0..7742782 100644
> >> > --- a/arch/powerpc/include/asm/book3s/64/hash.h
> >> > +++ b/arch/powerpc/include/asm/book3s/64/hash.h
> >> > @@ -8,11 +8,9 @@
> >> >   *
> >> >   */
> >> >  #define H_PTE_NONE_MASK		_PAGE_HPTEFLAGS
> >> > -#define H_PAGE_F_GIX_SHIFT	56
> >> > -#define H_PAGE_BUSY		_RPAGE_RSV1 /* software: PTE & hash are busy */
> >> > -#define H_PAGE_F_SECOND		_RPAGE_RSV2	/* HPTE is in 2ndary HPTEG */
> >> > -#define H_PAGE_F_GIX		(_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> >> > -#define H_PAGE_HASHPTE		_RPAGE_RPN43	/* PTE has associated HPTE */
> >> > +
> >> > +#define INIT_HIDX (~0x0UL)
> >> 
> >> But then you use it like
> >> 
> >> 	rpte.hidx = INIT_HIDX;
> >> 
> >> A better would be INIT_HIDX = 0xF ?. Also may be INIT is the wrong name
> >> there ?
> >
> > No. We are initializing all the entries in the second-part-of-the-pte to
> > 0xF. which essentially boils down to ~0x0UL.
> 
> that you unsigned long casting is what i was suggesting to remove. We
> may want to treat it as a 4 bits every where  ?
> 
> 
> >
> >> 
> >> > +#define HPTE_SOFT_INVALID(slot) ((slot & 0xfUL) == 0xfUL)
> >> 
> >> Can you do that as a static inline ?
> >
> > ok.
> >
> >> 
> >> >
> >> >  #ifdef CONFIG_PPC_64K_PAGES
> >> >  #include <asm/book3s/64/hash-64k.h>
> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> >> > index 6981a52..cfb8169 100644
> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> >> > @@ -435,6 +435,11 @@ extern int __hash_page_4K(unsigned long ea, unsigned long access,
> >> >  extern int __hash_page_64K(unsigned long ea, unsigned long access,
> >> >  			   unsigned long vsid, pte_t *ptep, unsigned long trap,
> >> >  			   unsigned long flags, int ssize);
> >> > +extern unsigned long set_hidx_slot(pte_t *ptep, real_pte_t rpte,
> >> > +			unsigned int subpg_index, unsigned long slot);
> >> > +extern unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> >> > +			int ssize, real_pte_t rpte, unsigned int subpg_index);
> >> > +
> >> >  struct mm_struct;
> >> >  unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap);
> >> >  extern int hash_page_mm(struct mm_struct *mm, unsigned long ea,
> >> > diff --git a/arch/powerpc/mm/dump_linuxpagetables.c b/arch/powerpc/mm/dump_linuxpagetables.c
> >> > index 44fe483..b832ed3 100644
> >> > --- a/arch/powerpc/mm/dump_linuxpagetables.c
> >> > +++ b/arch/powerpc/mm/dump_linuxpagetables.c
> >> > @@ -213,7 +213,7 @@ struct flag_info {
> >> >  		.val	= H_PAGE_4K_PFN,
> >> >  		.set	= "4K_pfn",
> >> >  	}, {
> >> > -#endif
> >> > +#else
> >> >  		.mask	= H_PAGE_F_GIX,
> >> >  		.val	= H_PAGE_F_GIX,
> >> >  		.set	= "f_gix",
> >> > @@ -224,6 +224,7 @@ struct flag_info {
> >> >  		.val	= H_PAGE_F_SECOND,
> >> >  		.set	= "f_second",
> >> >  	}, {
> >> > +#endif /* CONFIG_PPC_64K_PAGES */
> >> >  #endif
> >> >  		.mask	= _PAGE_SPECIAL,
> >> >  		.val	= _PAGE_SPECIAL,
> >> > diff --git a/arch/powerpc/mm/hash64_4k.c b/arch/powerpc/mm/hash64_4k.c
> >> > index 6fa450c..5b16416 100644
> >> > --- a/arch/powerpc/mm/hash64_4k.c
> >> > +++ b/arch/powerpc/mm/hash64_4k.c
> >> > @@ -20,6 +20,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >> >  		   pte_t *ptep, unsigned long trap, unsigned long flags,
> >> >  		   int ssize, int subpg_prot)
> >> >  {
> >> > +	real_pte_t rpte;
> >> >  	unsigned long hpte_group;
> >> >  	unsigned long rflags, pa;
> >> >  	unsigned long old_pte, new_pte;
> >> > @@ -54,6 +55,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >> >  	 * need to add in 0x1 if it's a read-only user page
> >> >  	 */
> >> >  	rflags = htab_convert_pte_flags(new_pte);
> >> > +	rpte = __real_pte(__pte(old_pte), ptep);
> >> >
> >> >  	if (cpu_has_feature(CPU_FTR_NOEXECUTE) &&
> >> >  	    !cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
> >> > @@ -64,12 +66,7 @@ int __hash_page_4K(unsigned long ea, unsigned long access, unsigned long vsid,
> >> >  		/*
> >> >  		 * There MIGHT be an HPTE for this pte
> >> >  		 */
> >> > -		hash = hpt_hash(vpn, shift, ssize);
> >> > -		if (old_pte & H_PAGE_F_SECOND)
> >> > -			hash = ~hash;
> >> > -		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> >> > -		slot += (old_pte & H_PAGE_F_GIX) >> H_PAGE_F_GIX_SHIFT;
> >> > -
> >> > +		slot = get_hidx_slot(vpn, shift, ssize, rpte, 0);
> >> 
> >> 
> >> This confused me, the rest of the code assume slot as the 4 bit value.
> >> But get_hidx_slot is not exactly returning that.
> >
> > get_hidx_slot() is returning a 4bit value. no?
> 
> It is not
> unsigned long get_hidx_slot(unsigned long vpn, unsigned long shift,
> 			int ssize, real_pte_t rpte, unsigned int subpg_index)
> {
> 	unsigned long hash, slot, hidx;
> 
> 	hash = hpt_hash(vpn, shift, ssize);
> 	hidx = __rpte_to_hidx(rpte, subpg_index);
> 	if (hidx & _PTEIDX_SECONDARY)
> 		hash = ~hash;
> 	slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
> 	slot += hidx & _PTEIDX_GROUP_IX;
> 	return slot;
> }

Actually the piece of redundant code has been captured into a function
and called from multiple places.  there is  no logic changes. It just
re-arrangement of code. I think your confusion is derived from the
naming of the function. Will rename it to  get_hidx_slot_offset(),
better?


Thanks for your comments, will be sending out a new v2 version soon.

RP
> 
> -aneesh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys
  2017-06-13  4:52   ` Aneesh Kumar K.V
@ 2017-06-13 21:52     ` Ram Pai
  0 siblings, 0 replies; 23+ messages in thread
From: Ram Pai @ 2017-06-13 21:52 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linuxppc-dev, linux-kernel, benh, paulus, mpe, khandual,
	bsingharora, dave.hansen, hbabu

On Tue, Jun 13, 2017 at 10:22:43AM +0530, Aneesh Kumar K.V wrote:
> Ram Pai <linuxram@us.ibm.com> writes:
> 
> > Rearrange  PTE   bits to  free  up  bits 3, 4, 5  and  6  for
> > memory keys. Bit 3, 4, 5, 6 and 57  shall  be used for memory
> > keys.
> >
> > The patch does the following change to the 64K PTE format
> >
> > H_PAGE_BUSY moves from bit 3 to bit 7
> > H_PAGE_F_SECOND which occupied bit 4 moves to the second part
> > 	of the pte.
> > H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
> > 	second part of the pte.
> >
> > The second part of the PTE will hold
> >    a (H_PAGE_F_SECOND|H_PAGE_F_GIX)  for  64K page backed pte,
> >    and sixteen (H_PAGE_F_SECOND|H_PAGE_F_GIX)  for 4k  backed
> > 	pte.
> >
> > the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
> > is initialized to 0xF indicating a invalid slot. if a hashpage does
> > get  allocated  to  the  0xF  slot, it is released and not used. In
> > other words, even  though  0xF  is  a valid slot we discard it  and
> > consider it as invalid slot(HPTE_SOFT_INVALID). This  gives  us  an
> > opportunity to  not  depend on a bit in the primary PTE in order to
> > determine the validity of a slot.
> >
> > When  we  release  a  0xF slot we also release a legitimate primary
> > slot  and  unmap  that  entry. This  is  to  ensure  that we do get
> > a legimate non-0xF slot the next time we retry for a slot.
> >
> > Though treating 0xF slot as invalid reduces the number of available
> > slots and make have a effect on the performance, the probabilty
> > of hitting a 0xF is extermely low.
> >
> > Compared  to the current scheme, the above described scheme reduces
> > the number of false hash table updates  significantly  and  has the
> > added  advantage  of  releasing  four  valuable  PTE bits for other
> > purpose.
> >
> > This idea was jointly developed by Paul Mackerras, Aneesh, Michael
> > Ellermen and myself.
> >
> > 4K PTE format remain unchanged currently.
> >
> 
> Can you also split this patch into two. One which changes
> __hash_page_4k() ie, linux pte format w.r.t 4k hash pte. Second patch
> with changes w.r.t __hash_page_64k() ie, pte format w.r.t 64k hash pte.

ok. A v2 version of the patch series will be out in a day or two.
RP

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.
  2017-06-06  1:05 ` [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation Ram Pai
@ 2017-06-16  9:20   ` Anshuman Khandual
  2017-06-16 10:33     ` Benjamin Herrenschmidt
  2017-06-16 19:10     ` Ram Pai
  2017-06-16 11:18   ` Michael Ellerman
  1 sibling, 2 replies; 23+ messages in thread
From: Anshuman Khandual @ 2017-06-16  9:20 UTC (permalink / raw)
  To: Ram Pai, linuxppc-dev, linux-kernel; +Cc: dave.hansen, paulus, aneesh.kumar

On 06/06/2017 06:35 AM, Ram Pai wrote:
> The value of the AMR register at the time of the exception
> is made available in gp_regs[PT_AMR] of the siginfo.

But its already available there in uctxt->uc_mcontext.regs->amr
while inside the signal delivery context in the user space. The
pt_regs already got updated with new AMR register. Then why we
need gp_regs to also contain AMR as well ?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.
  2017-06-16  9:20   ` Anshuman Khandual
@ 2017-06-16 10:33     ` Benjamin Herrenschmidt
  2017-06-16 19:15       ` Ram Pai
  2017-06-16 19:10     ` Ram Pai
  1 sibling, 1 reply; 23+ messages in thread
From: Benjamin Herrenschmidt @ 2017-06-16 10:33 UTC (permalink / raw)
  To: Anshuman Khandual, Ram Pai, linuxppc-dev, linux-kernel
  Cc: dave.hansen, paulus, aneesh.kumar

On Fri, 2017-06-16 at 14:50 +0530, Anshuman Khandual wrote:
> On 06/06/2017 06:35 AM, Ram Pai wrote:
> > The value of the AMR register at the time of the exception
> > is made available in gp_regs[PT_AMR] of the siginfo.
> 
> But its already available there in uctxt->uc_mcontext.regs->amr
> while inside the signal delivery context in the user space. The
> pt_regs already got updated with new AMR register. Then why we
> need gp_regs to also contain AMR as well ?

Also changing gp_regs layout/size is a major ABI issue...

Ben.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.
  2017-06-06  1:05 ` [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation Ram Pai
  2017-06-16  9:20   ` Anshuman Khandual
@ 2017-06-16 11:18   ` Michael Ellerman
  2017-06-16 19:35     ` Ram Pai
  1 sibling, 1 reply; 23+ messages in thread
From: Michael Ellerman @ 2017-06-16 11:18 UTC (permalink / raw)
  To: Ram Pai, linuxppc-dev, linux-kernel
  Cc: benh, paulus, khandual, aneesh.kumar, bsingharora, dave.hansen,
	hbabu, linuxram

Ram Pai <linuxram@us.ibm.com> writes:
> diff --git a/arch/powerpc/include/uapi/asm/ptrace.h b/arch/powerpc/include/uapi/asm/ptrace.h
> index 8036b38..109d0c2 100644
> --- a/arch/powerpc/include/uapi/asm/ptrace.h
> +++ b/arch/powerpc/include/uapi/asm/ptrace.h
> @@ -49,6 +49,8 @@ struct pt_regs {
>  	unsigned long dar;		/* Fault registers */
>  	unsigned long dsisr;		/* on 4xx/Book-E used for ESR */
>  	unsigned long result;		/* Result of a system call */
> +	unsigned long dscr;		/* contents of the DSCR register */
> +	unsigned long amr;		/* contents of AMR register */
>  };

You can't change pt_regs, it's ABI.

> @@ -109,7 +111,8 @@ struct pt_regs {
>  #define PT_DSISR 42
>  #define PT_RESULT 43
>  #define PT_DSCR 44
> -#define PT_REGS_COUNT 44
> +#define PT_AMR 45
> +#define PT_REGS_COUNT 45

You can add PT_AMR, but it has to be synthetic like DSCR, ie. not
actually in pt_regs but available via ptrace.

But do we want to do that? How does the x86 code export the key(s) of a
process? Or doesn't it?

cheers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.
  2017-06-16  9:20   ` Anshuman Khandual
  2017-06-16 10:33     ` Benjamin Herrenschmidt
@ 2017-06-16 19:10     ` Ram Pai
  1 sibling, 0 replies; 23+ messages in thread
From: Ram Pai @ 2017-06-16 19:10 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linuxppc-dev, linux-kernel, dave.hansen, paulus, aneesh.kumar

On Fri, Jun 16, 2017 at 02:50:13PM +0530, Anshuman Khandual wrote:
> On 06/06/2017 06:35 AM, Ram Pai wrote:
> > The value of the AMR register at the time of the exception
> > is made available in gp_regs[PT_AMR] of the siginfo.
> 
> But its already available there in uctxt->uc_mcontext.regs->amr
> while inside the signal delivery context in the user space. The
> pt_regs already got updated with new AMR register. Then why we
> need gp_regs to also contain AMR as well ?

It should not be available in uctxt->uc_mcontext.regs->amr.
In fact that field itself should not be there.

The ideas was to use one of the unused fields in gp_regs; without
extending gp_regs, to provide the contents of AMR. the 
PT_AMR offset in gp_regs is currently not used, which I am using
in this patch.

However this patch needs to be modified not to extend pt_regs,
or uctxt->uc_mcontext.regs

Thanks for initiating this concern.
RP

-- 
Ram Pai

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.
  2017-06-16 10:33     ` Benjamin Herrenschmidt
@ 2017-06-16 19:15       ` Ram Pai
  2017-06-16 22:54         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 23+ messages in thread
From: Ram Pai @ 2017-06-16 19:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Anshuman Khandual, linuxppc-dev, linux-kernel, dave.hansen,
	paulus, aneesh.kumar

On Fri, Jun 16, 2017 at 08:33:01PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2017-06-16 at 14:50 +0530, Anshuman Khandual wrote:
> > On 06/06/2017 06:35 AM, Ram Pai wrote:
> > > The value of the AMR register at the time of the exception
> > > is made available in gp_regs[PT_AMR] of the siginfo.
> > 
> > But its already available there in uctxt->uc_mcontext.regs->amr
> > while inside the signal delivery context in the user space. The
> > pt_regs already got updated with new AMR register. Then why we
> > need gp_regs to also contain AMR as well ?
> 
> Also changing gp_regs layout/size is a major ABI issue...

Ben,
	
gp_regs size is not changed, nor is the layout. A unused field in
the gp_regs is used to fill in the AMR contents. Old binaries will not
be knowing about this unused field, and hence should not break.

New binaries can leverage this already existing but newly defined
field; to read the contents of AMR.

Is it still a concern?
RP

> 
> Ben.

-- 
Ram Pai

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.
  2017-06-16 11:18   ` Michael Ellerman
@ 2017-06-16 19:35     ` Ram Pai
  0 siblings, 0 replies; 23+ messages in thread
From: Ram Pai @ 2017-06-16 19:35 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: linuxppc-dev, linux-kernel, benh, paulus, khandual, aneesh.kumar,
	bsingharora, dave.hansen, hbabu

On Fri, Jun 16, 2017 at 09:18:29PM +1000, Michael Ellerman wrote:
> Ram Pai <linuxram@us.ibm.com> writes:
> > diff --git a/arch/powerpc/include/uapi/asm/ptrace.h b/arch/powerpc/include/uapi/asm/ptrace.h
> > index 8036b38..109d0c2 100644
> > --- a/arch/powerpc/include/uapi/asm/ptrace.h
> > +++ b/arch/powerpc/include/uapi/asm/ptrace.h
> > @@ -49,6 +49,8 @@ struct pt_regs {
> >  	unsigned long dar;		/* Fault registers */
> >  	unsigned long dsisr;		/* on 4xx/Book-E used for ESR */
> >  	unsigned long result;		/* Result of a system call */
> > +	unsigned long dscr;		/* contents of the DSCR register */
> > +	unsigned long amr;		/* contents of AMR register */
> >  };
> 
> You can't change pt_regs, it's ABI.
> 
> > @@ -109,7 +111,8 @@ struct pt_regs {
> >  #define PT_DSISR 42
> >  #define PT_RESULT 43
> >  #define PT_DSCR 44
> > -#define PT_REGS_COUNT 44
> > +#define PT_AMR 45
> > +#define PT_REGS_COUNT 45
> 
> You can add PT_AMR, but it has to be synthetic like DSCR, ie. not
> actually in pt_regs but available via ptrace.

ok.

> 
> But do we want to do that? How does the x86 code export the key(s) of a
> process? Or doesn't it?

The semantics defined on x86 is, 

    signal handler has to have a way of knowing the contents of the
    PKRU; (the x86 equivalent of AMR).  Also the signal handler
    has to have the ability to modify the PKRU before it returns
    from the signal handler. This modified information will be
    used by the kernel to program the CPU's PKRU register.

    if the signal handler does not have the ability to do so, than
    when the signal handler returns and the user code restarts executing
    where it had left, it will continue to access the same protected 
    address and fault again, which will again invoke the signal handler
    and this will continue infinitely.

We have to provide the same semantics on powerpc. The way I intend to
do it is to use one of the unused field in the gp_regs and fill that 
with the contents of the AMR register. PT_AMR, at offset 45 in gp_regs
is not used currently. offset 45, 46, and 47 are available AFIACT.


Dave: Why is it not ok to reprogram the PKRU from the signal handler,
        instead of telling the kernel to do so on its behalf? Or
	have I got my understanding of the semantics wrong?


> 
> cheers

-- 
Ram Pai

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.
  2017-06-16 19:15       ` Ram Pai
@ 2017-06-16 22:54         ` Benjamin Herrenschmidt
  2017-06-22 21:41           ` Ram Pai
  0 siblings, 1 reply; 23+ messages in thread
From: Benjamin Herrenschmidt @ 2017-06-16 22:54 UTC (permalink / raw)
  To: Ram Pai
  Cc: Anshuman Khandual, linuxppc-dev, linux-kernel, dave.hansen,
	paulus, aneesh.kumar

On Fri, 2017-06-16 at 12:15 -0700, Ram Pai wrote:
> gp_regs size is not changed, nor is the layout. A unused field in
> the gp_regs is used to fill in the AMR contents. Old binaries will not
> be knowing about this unused field, and hence should not break.
> 
> New binaries can leverage this already existing but newly defined
> field; to read the contents of AMR.
> 
> Is it still a concern?

Calls to sys_swapcontext with a made-up context will end up with a crap
AMR if done by code who didn't know about that register.

Ben.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys
  2017-06-06  1:05 [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys Ram Pai
                   ` (6 preceding siblings ...)
  2017-06-06  1:05 ` [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation Ram Pai
@ 2017-06-20  7:07 ` Pavel Machek
  7 siblings, 0 replies; 23+ messages in thread
From: Pavel Machek @ 2017-06-20  7:07 UTC (permalink / raw)
  To: Ram Pai
  Cc: linuxppc-dev, linux-kernel, benh, paulus, mpe, khandual,
	aneesh.kumar, bsingharora, dave.hansen, hbabu

[-- Attachment #1: Type: text/plain, Size: 1093 bytes --]

Hi!

> Memory protection keys enable applications to protect its
> address space from inadvertent access or corruption from
> itself.
> 
> The overall idea:
> 
>  A process allocates a   key  and associates it with
>  a  address  range  within    its   address   space.
>  The process  than  can  dynamically  set read/write 
>  permissions on  the   key   without  involving  the 
>  kernel. Any  code that  violates   the  permissions
>  off the address space; as defined by its associated
>  key, will receive a segmentation fault.

Do you have some documentation how userspace should use this? Will it
be possible to hide details in libc so that it works across
architectures? Do you have some kind of library that hides them?

Where would you like it to be used? Web browsers?

How does it interact with ptrace()? With /dev/mem? With /proc/XXX/mem?
Will it enable malware to become very hard to understand?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation.
  2017-06-16 22:54         ` Benjamin Herrenschmidt
@ 2017-06-22 21:41           ` Ram Pai
  0 siblings, 0 replies; 23+ messages in thread
From: Ram Pai @ 2017-06-22 21:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Anshuman Khandual, linuxppc-dev, linux-kernel, dave.hansen,
	paulus, aneesh.kumar

On Sat, Jun 17, 2017 at 08:54:44AM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2017-06-16 at 12:15 -0700, Ram Pai wrote:
> > gp_regs size is not changed, nor is the layout. A unused field in
> > the gp_regs is used to fill in the AMR contents. Old binaries will not
> > be knowing about this unused field, and hence should not break.
> > 
> > New binaries can leverage this already existing but newly defined
> > field; to read the contents of AMR.
> > 
> > Is it still a concern?
> 
> Calls to sys_swapcontext with a made-up context will end up with a crap
> AMR if done by code who didn't know about that register.

Turns out x86 does not have this problem, because x86 does not implement
sys_swapcontext.  However; unlike x86, powerpc lets signal handler
program the AMR(x86 PKRU equivalent), which can persist even after the
signal handler returns to the kernel through sys_sigreturn.

So I am inclined to deviate from the x86 protection-key semantics.

On x86 the persistent way for the signal handler
to change the key register(PKRU) is through a field in the siginfo structure.

And on powerpc  the persistent way for the signal handler to change the
key register(AMR) will be to directly program the AMR register.

This should resolve your concern on powerpc, since there is no way
a crap AMR value will change the real AMR register, because the powerpc
kernel will not be letting it happen. Acceptable?

RP

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-06-22 21:41 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-06  1:05 [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys Ram Pai
2017-06-06  1:05 ` [RFC PATCH 1/7 v1]powerpc: Free up four PTE bits to accommodate memory keys Ram Pai
2017-06-12  6:57   ` Aneesh Kumar K.V
2017-06-12 22:20     ` Ram Pai
2017-06-13  2:02       ` Aneesh Kumar K.V
2017-06-13 21:51         ` Ram Pai
2017-06-13  4:52   ` Aneesh Kumar K.V
2017-06-13 21:52     ` Ram Pai
2017-06-06  1:05 ` [RFC PATCH 2/7 v1]powerpc: Implement sys_pkey_alloc and sys_pkey_free system call Ram Pai
2017-06-06  1:05 ` [RFC PATCH 3/7 v1]powerpc: store and restore the key state across context switches Ram Pai
2017-06-06  1:05 ` [RFC PATCH 4/7 v1]powerpc: Implementation for sys_mprotect_pkey() system call Ram Pai
2017-06-06  1:05 ` [RFC PATCH 5/7 v1]powerpc: Program HPTE key protection bits Ram Pai
2017-06-06  1:05 ` [RFC PATCH 6/7 v1]powerpc: Handle exceptions caused by violation of key protection Ram Pai
2017-06-06  1:05 ` [RFC PATCH 7/7 v1]powerpc: Deliver SEGV signal on protection key violation Ram Pai
2017-06-16  9:20   ` Anshuman Khandual
2017-06-16 10:33     ` Benjamin Herrenschmidt
2017-06-16 19:15       ` Ram Pai
2017-06-16 22:54         ` Benjamin Herrenschmidt
2017-06-22 21:41           ` Ram Pai
2017-06-16 19:10     ` Ram Pai
2017-06-16 11:18   ` Michael Ellerman
2017-06-16 19:35     ` Ram Pai
2017-06-20  7:07 ` [RFC PATCH 0/7 v1] powerpc: Memory Protection Keys Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).