All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement
@ 2017-11-06 22:29 Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 01/22] s390/mm: make gmap_protect_range more modular Janosch Frank
                   ` (21 more replies)
  0 siblings, 22 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

Since the z10 s390 does support 1M pages, but whereas hugetlbfs
support was added quite fast, KVM always used standard 4k pages for
guest backings.

This patchset adds full support for 1M huge page backings for s390
KVM guests. I.e. we also support VSIE (nested vms) for these guests
and are therefore able to run all combinations of backings for all
layers of guests.

When running a VSIE guest in a huge page backed guest, we need to
split some huge pages to be able to set granular protection. This way
we avoid a prot/unprot cycle if prefixes and VSIE pages containing
level 3 gmap DAT tables share the same segment, as the prefix has to
be accessible at all times and the VSIE page has to be write
protected.

TODO:
* Cleanups & Documentation
* Refactoring to get rid of a lot of indents
* Storage key support for split pages
* Regression testing
* Testing large setups
* Testing multi level VSIE

Accomplished testing:
l2: KVM guest
l3: nested KVM guest

* 1m l2 guests
* VSIE (l3) 4k and 1m guests on 1m l2
* 1m l2 -> l2 migration with 4k/1m l3 guests
* l3 -> l2 migration
* postcopy works every second try, seems to be QEMU or my setup


The initial prototype was started by Dominik Dingel. I had the
pleasure of adding the VSIE part, the protection transfers and the
optimizations. A huge thanks to Christian and Martin who review(ed)
and helped debugging/designing.


Dominik Dingel (2):
  s390/mm: hugetlb pages within a gmap can not be freed
  s390/mm: clear huge page storage keys on enable_skey

Janosch Frank (20):
  s390/mm: make gmap_protect_range more modular
  s390/mm: Abstract gmap notify bit setting
  s390/mm: add gmap PMD invalidation notification
  s390/mm: Add gmap pmd invalidation and clearing
  s390/mm: Introduce gmap_pmdp_xchg
  RFC: s390/mm: Transfer guest pmd protection to host
  s390/mm: Add huge page dirty sync support
  s390/mm: Add huge pmd storage key handling
  s390/mm: Remove superfluous parameter
  s390/mm: Add gmap_protect_large read protection support
  s390/mm: Make gmap_read_table EDAT1 compatible
  s390/mm: Make protect_rmap EDAT1 compatible
  s390/mm: GMAP read table extensions
  s390/mm: Add shadow segment code
  s390/mm: Add VSIE reverse fake case
  s390/mm: Remove gmap_pte_op_walk
  s390/mm: Split huge pages if granular protection is needed
  s390/mm: Enable gmap huge pmd support
  KVM: s390: Add KVM HPAGE capability
  RFC: s390/mm: Add gmap lock classes

 arch/s390/include/asm/gmap.h    |   39 +-
 arch/s390/include/asm/pgtable.h |   12 +-
 arch/s390/kvm/gaccess.c         |   64 +-
 arch/s390/kvm/kvm-s390.c        |   18 +-
 arch/s390/mm/fault.c            |   10 +-
 arch/s390/mm/gmap.c             | 1296 +++++++++++++++++++++++++++++++++++----
 arch/s390/mm/pageattr.c         |    6 +-
 arch/s390/mm/pgtable.c          |  143 ++++-
 include/uapi/linux/kvm.h        |    1 +
 9 files changed, 1421 insertions(+), 168 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC/PATCH 01/22] s390/mm: make gmap_protect_range more modular
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-08 10:40   ` David Hildenbrand
  2017-11-06 22:29 ` [RFC/PATCH 02/22] s390/mm: Abstract gmap notify bit setting Janosch Frank
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

This patch reworks the gmap_protect_range logic and extracts the pte
handling into an own function. Also we do now walk to the pmd and make
it accessible in the function for later use. This way we can add huge
page handling logic more easily.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/mm/gmap.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 95 insertions(+), 9 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 2f66290..9757242 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -876,6 +876,91 @@ static void gmap_pte_op_end(spinlock_t *ptl)
 	spin_unlock(ptl);
 }
 
+/**
+ * gmap_pmd_op_walk - walk the gmap tables, get the guest table lock
+ *		      and return the pmd pointer
+ * @gmap: pointer to guest mapping meta data structure
+ * @gaddr: virtual address in the guest address space
+ *
+ * Returns a pointer to the pmd for a guest address, or NULL
+ */
+static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr)
+{
+	pmd_t *pmdp;
+
+	spin_lock(&gmap->guest_table_lock);
+	pmdp = (pmd_t *) gmap_table_walk(gmap, gaddr, 1);
+
+	/*
+	 * Empty pmds can become large after we give up the
+	 * guest_table_lock, so we have to check for pmd_none
+	 * here.
+	 */
+	if (!pmdp || pmd_none(*pmdp)) {
+		spin_unlock(&gmap->guest_table_lock);
+		return NULL;
+	}
+	/*
+	 * For plain 4k guests that do not run under the vsie it
+	 * suffices to take the pte lock later on. Thus we can unlock
+	 * the guest_table_lock here.
+	 */
+	if (!pmd_large(*pmdp) && !gmap_is_shadow(gmap))
+		spin_unlock(&gmap->guest_table_lock);
+	return pmdp;
+}
+
+/**
+ * gmap_pmd_op_end - release the guest_table_lock if needed
+ * @gmap: pointer to the guest mapping meta data structure
+ * @pmdp: pointer to the pmd
+ */
+static inline void gmap_pmd_op_end(struct gmap *gmap, pmd_t *pmdp)
+{
+	if (pmd_large(*pmdp) || gmap_is_shadow(gmap))
+		spin_unlock(&gmap->guest_table_lock);
+}
+
+/*
+ * gmap_protect_pte - remove access rights to memory and set pgste bits
+ * @gmap: pointer to guest mapping meta data structure
+ * @gaddr: virtual address in the guest address space
+ * @pmdp: pointer to the pmd associated with the pte
+ * @prot: indicates access rights: PROT_NONE, PROT_READ or PROT_WRITE
+ * @bits: pgste notification bits to set
+ *
+ * Returns 0 if successfully protected, -ENOMEM if out of memory and
+ * -EAGAIN if a fixup is needed.
+ *
+ * Expected to be called with sg->mm->mmap_sem in read and
+ * guest_table_lock held for shadow gmaps.
+ */
+static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
+			    pmd_t *pmdp, int prot, unsigned long bits)
+{
+	int rc;
+	pte_t *ptep;
+	spinlock_t *ptl = NULL;
+
+	/* We have no upper segment, let's go back and fix this up. */
+	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
+		return -EAGAIN;
+
+	if (gmap_is_shadow(gmap)) {
+		ptep = pte_offset_map(pmdp, gaddr);
+	} else {
+		ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
+		if (!ptep)
+			return -ENOMEM;
+	}
+
+	/* Protect and unlock. */
+	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, bits);
+	if (ptl)
+		gmap_pte_op_end(ptl);
+	return rc;
+}
+
 /*
  * gmap_protect_range - remove access rights to memory and set pgste bits
  * @gmap: pointer to guest mapping meta data structure
@@ -895,16 +980,20 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 			      unsigned long len, int prot, unsigned long bits)
 {
 	unsigned long vmaddr;
-	spinlock_t *ptl;
-	pte_t *ptep;
+	pmd_t *pmdp;
 	int rc;
 
 	while (len) {
 		rc = -EAGAIN;
-		ptep = gmap_pte_op_walk(gmap, gaddr, &ptl);
-		if (ptep) {
-			rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, bits);
-			gmap_pte_op_end(ptl);
+		pmdp = gmap_pmd_op_walk(gmap, gaddr);
+		if (pmdp) {
+			rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
+					      bits);
+			if (!rc) {
+				len -= PAGE_SIZE;
+				gaddr += PAGE_SIZE;
+			}
+			gmap_pmd_op_end(gmap, pmdp);
 		}
 		if (rc) {
 			vmaddr = __gmap_translate(gmap, gaddr);
@@ -913,10 +1002,7 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 			rc = gmap_pte_op_fixup(gmap, gaddr, vmaddr, prot);
 			if (rc)
 				return rc;
-			continue;
 		}
-		gaddr += PAGE_SIZE;
-		len -= PAGE_SIZE;
 	}
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 02/22] s390/mm: Abstract gmap notify bit setting
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 01/22] s390/mm: make gmap_protect_range more modular Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-10 12:57   ` David Hildenbrand
  2017-11-06 22:29 ` [RFC/PATCH 03/22] s390/mm: add gmap PMD invalidation notification Janosch Frank
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

Currently we use the software PGSTE bits PGSTE_IN_BIT and PGSTE_VSIE_BIT
to notify before an invalidation occurs on a prefix page or a VSIE page
respectively. Both bits only work for a PGSTE, which only exists for
page tables.

For huge page support we also need such bits for segments (pmds) so
let's introduce abstract GMAP_ENTRY_* bits that will be realized into
the respective bits when gmap DAT table entries are protected.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
 arch/s390/include/asm/gmap.h |  4 ++++
 arch/s390/mm/gmap.c          | 13 ++++++++-----
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 741ddba..f3d84a8 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -8,6 +8,10 @@
 #ifndef _ASM_S390_GMAP_H
 #define _ASM_S390_GMAP_H
 
+/* Generic bits for GMAP notification on DAT table entry changes. */
+#define GMAP_ENTRY_VSIE	0x2
+#define GMAP_ENTRY_IN	0x1
+
 /**
  * struct gmap_struct - guest address space
  * @list: list head for the mm->context gmap list
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 9757242..74e2062 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -927,7 +927,7 @@ static inline void gmap_pmd_op_end(struct gmap *gmap, pmd_t *pmdp)
  * @gaddr: virtual address in the guest address space
  * @pmdp: pointer to the pmd associated with the pte
  * @prot: indicates access rights: PROT_NONE, PROT_READ or PROT_WRITE
- * @bits: pgste notification bits to set
+ * @bits: notification bits to set
  *
  * Returns 0 if successfully protected, -ENOMEM if out of memory and
  * -EAGAIN if a fixup is needed.
@@ -941,6 +941,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
 	int rc;
 	pte_t *ptep;
 	spinlock_t *ptl = NULL;
+	unsigned long pbits = 0;
 
 	/* We have no upper segment, let's go back and fix this up. */
 	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
@@ -954,8 +955,10 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
 			return -ENOMEM;
 	}
 
+	pbits |= (bits & GMAP_ENTRY_IN) ? PGSTE_IN_BIT : 0;
+	pbits |= (bits & GMAP_ENTRY_VSIE) ? PGSTE_VSIE_BIT : 0;
 	/* Protect and unlock. */
-	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, bits);
+	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, pbits);
 	if (ptl)
 		gmap_pte_op_end(ptl);
 	return rc;
@@ -1031,7 +1034,7 @@ int gmap_mprotect_notify(struct gmap *gmap, unsigned long gaddr,
 	if (!MACHINE_HAS_ESOP && prot == PROT_READ)
 		return -EINVAL;
 	down_read(&gmap->mm->mmap_sem);
-	rc = gmap_protect_range(gmap, gaddr, len, prot, PGSTE_IN_BIT);
+	rc = gmap_protect_range(gmap, gaddr, len, prot, GMAP_ENTRY_IN);
 	up_read(&gmap->mm->mmap_sem);
 	return rc;
 }
@@ -1153,7 +1156,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 		if (ptep) {
 			spin_lock(&sg->guest_table_lock);
 			rc = ptep_force_prot(parent->mm, paddr, ptep, prot,
-					     PGSTE_VSIE_BIT);
+					     GMAP_ENTRY_VSIE);
 			if (!rc)
 				gmap_insert_rmap(sg, vmaddr, rmap);
 			spin_unlock(&sg->guest_table_lock);
@@ -1622,7 +1625,7 @@ struct gmap *gmap_shadow(struct gmap *parent, unsigned long asce,
 	down_read(&parent->mm->mmap_sem);
 	rc = gmap_protect_range(parent, asce & _ASCE_ORIGIN,
 				((asce & _ASCE_TABLE_LENGTH) + 1) * PAGE_SIZE,
-				PROT_READ, PGSTE_VSIE_BIT);
+				PROT_READ, GMAP_ENTRY_VSIE);
 	up_read(&parent->mm->mmap_sem);
 	spin_lock(&parent->shadow_lock);
 	new->initialized = true;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 03/22] s390/mm: add gmap PMD invalidation notification
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 01/22] s390/mm: make gmap_protect_range more modular Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 02/22] s390/mm: Abstract gmap notify bit setting Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-15  9:55   ` David Hildenbrand
  2017-11-06 22:29 ` [RFC/PATCH 04/22] s390/mm: Add gmap pmd invalidation and clearing Janosch Frank
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

For later migration of huge pages we want to write-protect guest
PMDs. While doing this, we have to make absolutely sure, that the
guest's lowcore is always accessible when the VCPU is running. With
PTEs, this is solved by marking the PGSTEs of the lowcore pages with
the invalidation notification bit and kicking the guest out of the SIE
via a notifier function if we need to invalidate such a page.

With PMDs we do not have PGSTEs or some other bits we could use in the
host PMD. Instead we pick one of the free bits in the gmap PMD. Every
time a host pmd will be invalidated, we will check if the respective
gmap PMD has the bit set and in that case fire up the notifier.

In the first step we only support setting the invalidation bit, but we
do not support restricting access of guest pmds. It will follow
shortly.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/gmap.h    |  3 ++
 arch/s390/include/asm/pgtable.h |  1 +
 arch/s390/mm/gmap.c             | 95 ++++++++++++++++++++++++++++++++++++-----
 arch/s390/mm/pgtable.c          |  4 ++
 4 files changed, 93 insertions(+), 10 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index f3d84a8..99cf6d8 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -12,6 +12,9 @@
 #define GMAP_ENTRY_VSIE	0x2
 #define GMAP_ENTRY_IN	0x1
 
+/* Status bits in the gmap segment entry. */
+#define _SEGMENT_ENTRY_GMAP_IN		0x0001	/* invalidation notify bit */
+
 /**
  * struct gmap_struct - guest address space
  * @list: list head for the mm->context gmap list
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 20e75a2..4707647 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1092,6 +1092,7 @@ void ptep_set_pte_at(struct mm_struct *mm, unsigned long addr,
 void ptep_set_notify(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 void ptep_notify(struct mm_struct *mm, unsigned long addr,
 		 pte_t *ptep, unsigned long bits);
+void pmdp_notify(struct mm_struct *mm, unsigned long addr);
 int ptep_force_prot(struct mm_struct *mm, unsigned long gaddr,
 		    pte_t *ptep, int prot, unsigned long bit);
 void ptep_zap_unused(struct mm_struct *mm, unsigned long addr,
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 74e2062..3961589 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -595,10 +595,17 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 	if (*table == _SEGMENT_ENTRY_EMPTY) {
 		rc = radix_tree_insert(&gmap->host_to_guest,
 				       vmaddr >> PMD_SHIFT, table);
-		if (!rc)
-			*table = pmd_val(*pmd);
-	} else
-		rc = 0;
+		if (!rc) {
+			if (pmd_large(*pmd)) {
+				*table = pmd_val(*pmd) &
+					(_SEGMENT_ENTRY_ORIGIN_LARGE
+					 | _SEGMENT_ENTRY_INVALID
+					 | _SEGMENT_ENTRY_LARGE
+					 | _SEGMENT_ENTRY_PROTECT);
+			} else
+				*table = pmd_val(*pmd) & ~0x03UL;
+		}
+	}
 	spin_unlock(&gmap->guest_table_lock);
 	spin_unlock(ptl);
 	radix_tree_preload_end();
@@ -965,6 +972,35 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
 }
 
 /*
+ * gmap_protect_large - set pmd notification bits
+ * @pmdp: pointer to the pmd to be protected
+ * @prot: indicates access rights: PROT_NONE, PROT_READ or PROT_WRITE
+ * @bits: notification bits to set
+ *
+ * Returns 0 if successfully protected, -ENOMEM if out of memory and
+ * -EAGAIN if a fixup is needed.
+ *
+ * Expected to be called with sg->mm->mmap_sem in read and
+ * guest_table_lock held.
+ */
+static int gmap_protect_large(struct gmap *gmap, unsigned long gaddr,
+			      pmd_t *pmdp, int prot, unsigned long bits)
+{
+	int pmd_i, pmd_p;
+
+	pmd_i = pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID;
+	pmd_p = pmd_val(*pmdp) & _SEGMENT_ENTRY_PROTECT;
+
+	/* Fixup needed */
+	if ((pmd_i && (prot != PROT_NONE)) || (pmd_p && (prot & PROT_WRITE)))
+		return -EAGAIN;
+
+	if (bits & GMAP_ENTRY_IN)
+		pmd_val(*pmdp) |=  _SEGMENT_ENTRY_GMAP_IN;
+	return 0;
+}
+
+/*
  * gmap_protect_range - remove access rights to memory and set pgste bits
  * @gmap: pointer to guest mapping meta data structure
  * @gaddr: virtual address in the guest address space
@@ -977,7 +1013,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
  *
  * Called with sg->mm->mmap_sem in read.
  *
- * Note: Can also be called for shadow gmaps.
+ * Note: Can also be called for shadow gmaps, but only with 4k pages.
  */
 static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 			      unsigned long len, int prot, unsigned long bits)
@@ -990,11 +1026,20 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 		rc = -EAGAIN;
 		pmdp = gmap_pmd_op_walk(gmap, gaddr);
 		if (pmdp) {
-			rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
-					      bits);
-			if (!rc) {
-				len -= PAGE_SIZE;
-				gaddr += PAGE_SIZE;
+			if (!pmd_large(*pmdp)) {
+				rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
+						      bits);
+				if (!rc) {
+					len -= PAGE_SIZE;
+					gaddr += PAGE_SIZE;
+				}
+			} else {
+				rc =  gmap_protect_large(gmap, gaddr, pmdp,
+							 prot, bits);
+				if (!rc) {
+					len = len < HPAGE_SIZE ? 0 : len - HPAGE_SIZE;
+					gaddr = (gaddr & HPAGE_MASK) + HPAGE_SIZE;
+				}
 			}
 			gmap_pmd_op_end(gmap, pmdp);
 		}
@@ -2191,6 +2236,36 @@ void ptep_notify(struct mm_struct *mm, unsigned long vmaddr,
 }
 EXPORT_SYMBOL_GPL(ptep_notify);
 
+/**
+ * pmdp_notify - call all invalidation callbacks for a specific pmd
+ * @mm: pointer to the process mm_struct
+ * @vmaddr: virtual address in the process address space
+ *
+ * This function is expected to be called with mmap_sem held in read.
+ */
+void pmdp_notify(struct mm_struct *mm, unsigned long vmaddr)
+{
+	unsigned long *table, gaddr;
+	struct gmap *gmap;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
+		spin_lock(&gmap->guest_table_lock);
+		table = radix_tree_lookup(&gmap->host_to_guest,
+					  vmaddr >> PMD_SHIFT);
+		if (!table || !(*table & _SEGMENT_ENTRY_GMAP_IN)) {
+			spin_unlock(&gmap->guest_table_lock);
+			continue;
+		}
+		gaddr = __gmap_segment_gaddr(table);
+		*table &= ~_SEGMENT_ENTRY_GMAP_IN;
+		spin_unlock(&gmap->guest_table_lock);
+		gmap_call_notifier(gmap, gaddr, gaddr + HPAGE_SIZE - 1);
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(pmdp_notify);
+
 static inline void thp_split_mm(struct mm_struct *mm)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index ae677f8..79d35d0 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -404,6 +404,8 @@ pmd_t pmdp_xchg_direct(struct mm_struct *mm, unsigned long addr,
 	pmd_t old;
 
 	preempt_disable();
+	if (mm_has_pgste(mm))
+		pmdp_notify(mm, addr);
 	old = pmdp_flush_direct(mm, addr, pmdp);
 	*pmdp = new;
 	preempt_enable();
@@ -417,6 +419,8 @@ pmd_t pmdp_xchg_lazy(struct mm_struct *mm, unsigned long addr,
 	pmd_t old;
 
 	preempt_disable();
+	if (mm_has_pgste(mm))
+		pmdp_notify(mm, addr);
 	old = pmdp_flush_lazy(mm, addr, pmdp);
 	*pmdp = new;
 	preempt_enable();
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 04/22] s390/mm: Add gmap pmd invalidation and clearing
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (2 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 03/22] s390/mm: add gmap PMD invalidation notification Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 05/22] s390/mm: hugetlb pages within a gmap can not be freed Janosch Frank
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

If the host invalidates a pmd, we also have to invalidate the
corresponding gmap pmds, as well as flush them from the TLB. This is
necessary, as we don't share the pmd tables between host and guest as
we do with ptes.

The clearing part of these three new functions sets a guest pmd entry
to _SEGMENT_ENTRY_EMPTY, so the guest will fault on it and we will
re-link it.

Flushing the gmap is not necessary in the host's lazy local and csp
cases. Both purge the TLB completely.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/include/asm/pgtable.h |   4 ++
 arch/s390/mm/gmap.c             | 114 ++++++++++++++++++++++++++++++++++++++++
 arch/s390/mm/pgtable.c          |  17 ++++--
 3 files changed, 132 insertions(+), 3 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 4707647..43c7b01 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1117,6 +1117,10 @@ int set_pgste_bits(struct mm_struct *mm, unsigned long addr,
 int get_pgste(struct mm_struct *mm, unsigned long hva, unsigned long *pgstep);
 int pgste_perform_essa(struct mm_struct *mm, unsigned long hva, int orc,
 			unsigned long *oldpte, unsigned long *oldpgste);
+void gmap_pmdp_csp(struct mm_struct *mm, unsigned long vmaddr);
+void gmap_pmdp_invalidate(struct mm_struct *mm, unsigned long vmaddr);
+void gmap_pmdp_idte_local(struct mm_struct *mm, unsigned long vmaddr);
+void gmap_pmdp_idte_global(struct mm_struct *mm, unsigned long vmaddr);
 
 /*
  * Certain architectures need to do special things when PTEs
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 3961589..25297e1 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2266,6 +2266,120 @@ void pmdp_notify(struct mm_struct *mm, unsigned long vmaddr)
 }
 EXPORT_SYMBOL_GPL(pmdp_notify);
 
+static void gmap_pmdp_clear(struct mm_struct *mm, unsigned long vmaddr,
+			    int purge)
+{
+	pmd_t *pmdp;
+	struct gmap *gmap;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
+		spin_lock(&gmap->guest_table_lock);
+		pmdp = (pmd_t *)radix_tree_delete(&gmap->host_to_guest,
+						   vmaddr >> PMD_SHIFT);
+		if (pmdp) {
+			if (purge)
+				__pmdp_csp(pmdp);
+			pmd_val(*pmdp) = _SEGMENT_ENTRY_EMPTY;
+		}
+		spin_unlock(&gmap->guest_table_lock);
+	}
+	rcu_read_unlock();
+}
+
+/**
+ * gmap_pmdp_invalidate - invalidate all affected guest pmd entries without
+ *                        flushing
+ * @mm: pointer to the process mm_struct
+ * @vmaddr: virtual address in the process address space
+ */
+void gmap_pmdp_invalidate(struct mm_struct *mm, unsigned long vmaddr)
+{
+	gmap_pmdp_clear(mm, vmaddr, 0);
+}
+EXPORT_SYMBOL_GPL(gmap_pmdp_invalidate);
+
+/**
+ * gmap_pmdp_csp - csp all affected guest pmd entries
+ * @mm: pointer to the process mm_struct
+ * @vmaddr: virtual address in the process address space
+ */
+void gmap_pmdp_csp(struct mm_struct *mm, unsigned long vmaddr)
+{
+	gmap_pmdp_clear(mm, vmaddr, 1);
+}
+EXPORT_SYMBOL_GPL(gmap_pmdp_csp);
+
+/**
+ * gmap_pmdp_idte_local - invalidate and clear a guest pmd entry
+ * @mm: pointer to the process mm_struct
+ * @vmaddr: virtual address in the process address space
+ */
+void gmap_pmdp_idte_local(struct mm_struct *mm, unsigned long vmaddr)
+{
+	unsigned long *entry, gaddr;
+	struct gmap *gmap;
+	pmd_t *pmdp;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
+		spin_lock(&gmap->guest_table_lock);
+		entry = radix_tree_delete(&gmap->host_to_guest,
+					  vmaddr >> PMD_SHIFT);
+		if (entry) {
+			pmdp = (pmd_t *)entry;
+			gaddr = __gmap_segment_gaddr(entry);
+			if (MACHINE_HAS_TLB_GUEST)
+				__pmdp_idte(gaddr, pmdp,
+					    IDTE_GUEST_ASCE,
+					    gmap->asce, IDTE_LOCAL);
+			else if (MACHINE_HAS_IDTE)
+				__pmdp_idte(gaddr, pmdp, 0, 0,
+					    IDTE_LOCAL);
+			*entry = _SEGMENT_ENTRY_EMPTY;
+		}
+		spin_unlock(&gmap->guest_table_lock);
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(gmap_pmdp_idte_local);
+
+/**
+ * gmap_pmdp_idte_global - invalidate and clear a guest pmd entry
+ * @mm: pointer to the process mm_struct
+ * @vmaddr: virtual address in the process address space
+ */
+void gmap_pmdp_idte_global(struct mm_struct *mm, unsigned long vmaddr)
+{
+	unsigned long *entry, gaddr;
+	struct gmap *gmap;
+	pmd_t *pmdp;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
+		spin_lock(&gmap->guest_table_lock);
+		entry = radix_tree_delete(&gmap->host_to_guest,
+					  vmaddr >> PMD_SHIFT);
+		if (entry) {
+			pmdp = (pmd_t *)entry;
+			gaddr = __gmap_segment_gaddr(entry);
+			if (MACHINE_HAS_TLB_GUEST)
+				__pmdp_idte(gaddr, pmdp,
+					    IDTE_GUEST_ASCE,
+					    gmap->asce, IDTE_GLOBAL);
+			else if (MACHINE_HAS_IDTE)
+				__pmdp_idte(gaddr, pmdp, 0, 0,
+					    IDTE_GLOBAL);
+			else
+				__pmdp_csp(pmdp);
+			*entry = _SEGMENT_ENTRY_EMPTY;
+		}
+		spin_unlock(&gmap->guest_table_lock);
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(gmap_pmdp_idte_global);
+
 static inline void thp_split_mm(struct mm_struct *mm)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 79d35d0..c98fba9 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -346,18 +346,27 @@ static inline void pmdp_idte_local(struct mm_struct *mm,
 			    mm->context.asce, IDTE_LOCAL);
 	else
 		__pmdp_idte(addr, pmdp, 0, 0, IDTE_LOCAL);
+	if (mm_has_pgste(mm))
+		gmap_pmdp_idte_local(mm, addr);
 }
 
 static inline void pmdp_idte_global(struct mm_struct *mm,
 				    unsigned long addr, pmd_t *pmdp)
 {
-	if (MACHINE_HAS_TLB_GUEST)
+	if (MACHINE_HAS_TLB_GUEST) {
 		__pmdp_idte(addr, pmdp, IDTE_NODAT | IDTE_GUEST_ASCE,
 			    mm->context.asce, IDTE_GLOBAL);
-	else if (MACHINE_HAS_IDTE)
+		if (mm_has_pgste(mm))
+			gmap_pmdp_idte_global(mm, addr);
+	} else if (MACHINE_HAS_IDTE) {
 		__pmdp_idte(addr, pmdp, 0, 0, IDTE_GLOBAL);
-	else
+		if (mm_has_pgste(mm))
+			gmap_pmdp_idte_global(mm, addr);
+	} else {
 		__pmdp_csp(pmdp);
+		if (mm_has_pgste(mm))
+			gmap_pmdp_csp(mm, addr);
+	}
 }
 
 static inline pmd_t pmdp_flush_direct(struct mm_struct *mm,
@@ -391,6 +400,8 @@ static inline pmd_t pmdp_flush_lazy(struct mm_struct *mm,
 			  cpumask_of(smp_processor_id()))) {
 		pmd_val(*pmdp) |= _SEGMENT_ENTRY_INVALID;
 		mm->context.flush_mm = 1;
+		if (mm_has_pgste(mm))
+			gmap_pmdp_invalidate(mm, addr);
 	} else {
 		pmdp_idte_global(mm, addr, pmdp);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 05/22] s390/mm: hugetlb pages within a gmap can not be freed
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (3 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 04/22] s390/mm: Add gmap pmd invalidation and clearing Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 06/22] s390/mm: Introduce gmap_pmdp_xchg Janosch Frank
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

From: Dominik Dingel <dingel@linux.vnet.ibm.com>

Guests backed by huge pages could theoretically free unused pages via
the diagnose 10 instruction. We currently don't allow that, so we
don't have to refault it once it's needed again.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/mm/gmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 25297e1..2c0cc16 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -696,6 +696,9 @@ void gmap_discard(struct gmap *gmap, unsigned long from, unsigned long to)
 		vmaddr |= gaddr & ~PMD_MASK;
 		/* Find vma in the parent mm */
 		vma = find_vma(gmap->mm, vmaddr);
+		/* We do not discard pages that are backed by hugetlbfs */
+		if (vma && is_vm_hugetlb_page(vma))
+			continue;
 		size = min(to - gaddr, PMD_SIZE - (gaddr & ~PMD_MASK));
 		zap_page_range(vma, vmaddr, size);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 06/22] s390/mm: Introduce gmap_pmdp_xchg
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (4 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 05/22] s390/mm: hugetlb pages within a gmap can not be freed Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 07/22] RFC: s390/mm: Transfer guest pmd protection to host Janosch Frank
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

When changing guest pmds, we don't need to take care of the
corresponding host pmd. This means, we don't need to flush the host
TLB entries and we don't need to notify on all gmaps.

Let's introduce a function, that exchanges a pmd and takes care of the
necessary flushing and notification.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/mm/gmap.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 2c0cc16..d0d56f6 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -22,6 +22,9 @@
 
 #define GMAP_SHADOW_FAKE_TABLE 1ULL
 
+static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *old, pmd_t new,
+			   unsigned long gaddr);
+
 /**
  * gmap_alloc - allocate and initialize a guest address space
  * @mm: pointer to the parent mm_struct
@@ -2240,6 +2243,26 @@ void ptep_notify(struct mm_struct *mm, unsigned long vmaddr,
 EXPORT_SYMBOL_GPL(ptep_notify);
 
 /**
+ * pmdp_notify_gmap - call all invalidation callbacks for a specific pmd
+ * @gmap: pointer to the guest address space structure
+ * @gaddr: guest address which is affected
+ *
+ * This function is expected to be called with a locked
+ * guest_table_lock.
+ */
+static void pmdp_notify_gmap(struct gmap *gmap, unsigned long gaddr)
+{
+	unsigned long *table;
+
+	gaddr &= HPAGE_MASK;
+	table = gmap_table_walk(gmap, gaddr, 1);
+	if (!table || !(*table & _SEGMENT_ENTRY_GMAP_IN))
+		return;
+	*table &= ~_SEGMENT_ENTRY_GMAP_IN;
+	gmap_call_notifier(gmap, gaddr, gaddr + HPAGE_SIZE - 1);
+}
+
+/**
  * pmdp_notify - call all invalidation callbacks for a specific pmd
  * @mm: pointer to the process mm_struct
  * @vmaddr: virtual address in the process address space
@@ -2383,6 +2406,32 @@ void gmap_pmdp_idte_global(struct mm_struct *mm, unsigned long vmaddr)
 }
 EXPORT_SYMBOL_GPL(gmap_pmdp_idte_global);
 
+/**
+ * gmap_pmdp_xchg - exchange a gmap pmd with another and notify
+ * @gmap: pointer to the guest address space structure
+ * @pmdp: pointer to the pmd entry
+ * @new: replacement entry
+ * @gaddr: the affected guest address
+ *
+ * This function is assumed to be called with the guest_table_lock
+ * held.
+ */
+static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *pmdp, pmd_t new,
+			   unsigned long gaddr)
+{
+	pmdp_notify_gmap(gmap, gaddr);
+	if (MACHINE_HAS_TLB_GUEST)
+		__pmdp_idte(gaddr, (pmd_t *)pmdp,
+			    IDTE_GUEST_ASCE, gmap->asce,
+			    IDTE_GLOBAL);
+	if (MACHINE_HAS_IDTE)
+		__pmdp_idte(gaddr, (pmd_t *)pmdp,
+			    0, 0, IDTE_GLOBAL);
+	else
+		__pmdp_csp(pmdp);
+	*pmdp = new;
+}
+
 static inline void thp_split_mm(struct mm_struct *mm)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 07/22] RFC: s390/mm: Transfer guest pmd protection to host
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (5 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 06/22] s390/mm: Introduce gmap_pmdp_xchg Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 08/22] s390/mm: Add huge page dirty sync support Janosch Frank
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

If we protect the guest pmd, i.e. for dirty tracking, we need to
transfer the protection to the host pmd which we copied when linking
to the guest.

If we don't, we might loose changed that on migration, as changes on
host side don't get tracked.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/mm/gmap.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 106 insertions(+), 13 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index d0d56f6..3054b413 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -14,6 +14,7 @@
 #include <linux/swapops.h>
 #include <linux/ksm.h>
 #include <linux/mman.h>
+#include <linux/hugetlb.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -934,6 +935,88 @@ static inline void gmap_pmd_op_end(struct gmap *gmap, pmd_t *pmdp)
 		spin_unlock(&gmap->guest_table_lock);
 }
 
+/**
+ * gmap_pmdp_transfer_prot - transfer protection of guest pmd to host pmd
+ * @mm: the memory context
+ * @address: the affected host virtual address
+ * @gpmdp: guest pmd ptr
+ * @hpmdp: host pmd ptr
+ *
+ * Transfers the protection from a guest pmd to the associated guest
+ * pmd. This has to be done with a plain idte to circumvent the gmap
+ * invalidation hooks in the standard invalidation functions provided
+ * by pgtable.c.
+ */
+static void gmap_pmdp_transfer_prot(struct mm_struct *mm, unsigned long addr,
+				    pmd_t *gpmdp, pmd_t *hpmdp)
+{
+	int gpmd_i, gpmd_p, hpmd_i, hpmd_p;
+	pmd_t new = *hpmdp;
+
+	hpmd_i = pmd_val(*hpmdp) & _SEGMENT_ENTRY_INVALID;
+	hpmd_p = pmd_val(*hpmdp) & _SEGMENT_ENTRY_PROTECT;
+	gpmd_i = pmd_val(*gpmdp) & _SEGMENT_ENTRY_INVALID;
+	gpmd_p = pmd_val(*gpmdp) & _SEGMENT_ENTRY_PROTECT;
+
+	/* Fastpath, change not needed. */
+	if (hpmd_i || (hpmd_p && gpmd_p) || (!gpmd_i && !gpmd_p))
+		return;
+
+	if (gpmd_p && !hpmd_p)
+		pmd_val(new) |= _SEGMENT_ENTRY_PROTECT;
+	if (!gpmd_i && !hpmd_i)
+		pmd_val(new) &= ~_SEGMENT_ENTRY_INVALID;
+
+	if (MACHINE_HAS_TLB_GUEST)
+		__pmdp_idte(addr, hpmdp,
+			    IDTE_GUEST_ASCE,
+			    mm->context.asce, IDTE_GLOBAL);
+	else if (MACHINE_HAS_IDTE)
+		__pmdp_idte(addr, hpmdp, 0, 0,
+			    IDTE_GLOBAL);
+	else
+		__pmdp_csp(hpmdp);
+	*hpmdp = new;
+}
+
+/**
+ * gmap_pmdp_force_prot - change access rights of a locked pmd
+ * @mm: pointer to the process mm_struct
+ * @addr: virtual address in the guest address space
+ * @pmdp: pointer to the page table entry
+ * @prot: indicates guest access rights: PROT_NONE, PROT_READ or PROT_WRITE
+ * @bits: software bit to set (e.g. for notification)
+ *
+ * Returns 0 if the access rights were changed and -EAGAIN if the current
+ * and requested access rights are incompatible.
+ */
+static int gmap_pmdp_force_prot(struct gmap *gmap, unsigned long addr,
+				pmd_t *pmdp, int prot, unsigned long bits)
+{
+	int pmd_i, pmd_p;
+	pmd_t new = *pmdp;
+
+	pmd_i = pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID;
+	pmd_p = pmd_val(*pmdp) & _SEGMENT_ENTRY_PROTECT;
+
+	/* Fixup needed */
+	if ((pmd_i && (prot != PROT_NONE)) || (pmd_p && (prot == PROT_WRITE)))
+		return -EAGAIN;
+
+	if (prot == PROT_NONE && !pmd_i) {
+		pmd_val(new) |= _SEGMENT_ENTRY_INVALID;
+		gmap_pmdp_xchg(gmap, pmdp, new, addr);
+	}
+
+	if (prot == PROT_READ && !pmd_p) {
+		pmd_val(new) &= ~_SEGMENT_ENTRY_INVALID;
+		pmd_val(new) |= _SEGMENT_ENTRY_PROTECT;
+		gmap_pmdp_xchg(gmap, pmdp, new, addr);
+	}
+	pmd_val(*pmdp) |=  bits;
+	return 0;
+}
+
 /*
  * gmap_protect_pte - remove access rights to memory and set pgste bits
  * @gmap: pointer to guest mapping meta data structure
@@ -990,20 +1073,23 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
  * guest_table_lock held.
  */
 static int gmap_protect_large(struct gmap *gmap, unsigned long gaddr,
-			      pmd_t *pmdp, int prot, unsigned long bits)
+			      unsigned long vmaddr, pmd_t *pmdp, pmd_t *hpmdp,
+			      int prot, unsigned long bits)
 {
-	int pmd_i, pmd_p;
+	unsigned long sbits = 0;
+	int ret = 0;
 
-	pmd_i = pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID;
-	pmd_p = pmd_val(*pmdp) & _SEGMENT_ENTRY_PROTECT;
+	sbits |= (bits & GMAP_ENTRY_IN) ? _SEGMENT_ENTRY_GMAP_IN : 0;
+	/* Protect gmap pmd */
+	ret = gmap_pmdp_force_prot(gmap, gaddr, pmdp, prot, sbits);
+	/*
+	 * Transfer protection back to the host pmd, so userspace has
+	 * never more access rights than the VM.
+	 */
+	if (!ret)
+		gmap_pmdp_transfer_prot(gmap->mm, vmaddr, pmdp, hpmdp);
 
-	/* Fixup needed */
-	if ((pmd_i && (prot != PROT_NONE)) || (pmd_p && (prot & PROT_WRITE)))
-		return -EAGAIN;
-
-	if (bits & GMAP_ENTRY_IN)
-		pmd_val(*pmdp) |=  _SEGMENT_ENTRY_GMAP_IN;
-	return 0;
+	return ret;
 }
 
 /*
@@ -1024,12 +1110,18 @@ static int gmap_protect_large(struct gmap *gmap, unsigned long gaddr,
 static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 			      unsigned long len, int prot, unsigned long bits)
 {
+	spinlock_t *ptl;
 	unsigned long vmaddr;
-	pmd_t *pmdp;
+	pmd_t *pmdp, *hpmdp;
 	int rc;
 
 	while (len) {
 		rc = -EAGAIN;
+		vmaddr = __gmap_translate(gmap, gaddr);
+		hpmdp = (pmd_t *)huge_pte_offset(gmap->mm, vmaddr, HPAGE_SIZE);
+		/* Do we need tests here? */
+		ptl = pmd_lock(gmap->mm, hpmdp);
+
 		pmdp = gmap_pmd_op_walk(gmap, gaddr);
 		if (pmdp) {
 			if (!pmd_large(*pmdp)) {
@@ -1040,7 +1132,7 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 					gaddr += PAGE_SIZE;
 				}
 			} else {
-				rc =  gmap_protect_large(gmap, gaddr, pmdp,
+				rc =  gmap_protect_large(gmap, gaddr, vmaddr, pmdp, hpmdp,
 							 prot, bits);
 				if (!rc) {
 					len = len < HPAGE_SIZE ? 0 : len - HPAGE_SIZE;
@@ -1049,6 +1141,7 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 			}
 			gmap_pmd_op_end(gmap, pmdp);
 		}
+		spin_unlock(ptl);
 		if (rc) {
 			vmaddr = __gmap_translate(gmap, gaddr);
 			if (IS_ERR_VALUE(vmaddr))
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 08/22] s390/mm: Add huge page dirty sync support
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (6 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 07/22] RFC: s390/mm: Transfer guest pmd protection to host Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 09/22] s390/mm: clear huge page storage keys on enable_skey Janosch Frank
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

To do dirty loging with huge pages, we protect huge pmds in the
gmap. When they are written to, we unprotect them and mark them dirty.

We introduce the function gmap_test_and_clear_dirty_segment which
handles dirty sync for huge pages.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/include/asm/gmap.h |  6 +++-
 arch/s390/kvm/kvm-s390.c     | 18 ++++++----
 arch/s390/mm/gmap.c          | 85 +++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 96 insertions(+), 13 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 99cf6d8..16474c3 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -12,8 +12,10 @@
 #define GMAP_ENTRY_VSIE	0x2
 #define GMAP_ENTRY_IN	0x1
 
-/* Status bits in the gmap segment entry. */
+/* Status bits in huge and non-huge gmap segment entries. */
 #define _SEGMENT_ENTRY_GMAP_IN		0x0001	/* invalidation notify bit */
+/* Status bits only for huge segment entries */
+#define _SEGMENT_ENTRY_GMAP_UC		0x4000	/* user dirty (migration) */
 
 /**
  * struct gmap_struct - guest address space
@@ -138,4 +140,6 @@ void gmap_pte_notify(struct mm_struct *, unsigned long addr, pte_t *,
 int gmap_mprotect_notify(struct gmap *, unsigned long start,
 			 unsigned long len, int prot);
 
+void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
+			     unsigned long gaddr, unsigned long vmaddr);
 #endif /* _ASM_S390_GMAP_H */
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 40d0a1a..62d8bd1 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -430,19 +430,23 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 }
 
 static void kvm_s390_sync_dirty_log(struct kvm *kvm,
-					struct kvm_memory_slot *memslot)
+				    struct kvm_memory_slot *memslot)
 {
 	gfn_t cur_gfn, last_gfn;
-	unsigned long address;
+	unsigned long gaddr, vmaddr;
+	unsigned long *dirty = memslot->dirty_bitmap;
 	struct gmap *gmap = kvm->arch.gmap;
 
-	/* Loop over all guest pages */
+	/* Loop over all guest segments */
 	last_gfn = memslot->base_gfn + memslot->npages;
-	for (cur_gfn = memslot->base_gfn; cur_gfn <= last_gfn; cur_gfn++) {
-		address = gfn_to_hva_memslot(memslot, cur_gfn);
+	for (cur_gfn = memslot->base_gfn; cur_gfn <= last_gfn; cur_gfn += _PAGE_ENTRIES, dirty += 4) {
+		gaddr = gfn_to_gpa(cur_gfn);
+		vmaddr = gfn_to_hva_memslot(memslot, cur_gfn);
+		if (kvm_is_error_hva(vmaddr))
+			continue;
+
+		gmap_sync_dirty_log_pmd(gmap, dirty, gaddr, vmaddr);
 
-		if (test_and_clear_guest_dirty(gmap->mm, address))
-			mark_page_dirty(kvm, cur_gfn);
 		if (fatal_signal_pending(current))
 			return;
 		cond_resched();
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 3054b413..bebf83f 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -544,6 +544,7 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
+	pmd_t unprot;
 	int rc;
 
 	BUG_ON(gmap_is_shadow(gmap));
@@ -601,14 +602,20 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 				       vmaddr >> PMD_SHIFT, table);
 		if (!rc) {
 			if (pmd_large(*pmd)) {
-				*table = pmd_val(*pmd) &
-					(_SEGMENT_ENTRY_ORIGIN_LARGE
-					 | _SEGMENT_ENTRY_INVALID
-					 | _SEGMENT_ENTRY_LARGE
-					 | _SEGMENT_ENTRY_PROTECT);
+				*table = (pmd_val(*pmd) &
+					  (_SEGMENT_ENTRY_ORIGIN_LARGE
+					   | _SEGMENT_ENTRY_INVALID
+					   | _SEGMENT_ENTRY_LARGE
+					   | _SEGMENT_ENTRY_PROTECT))
+					| _SEGMENT_ENTRY_GMAP_UC;
 			} else
 				*table = pmd_val(*pmd) & ~0x03UL;
 		}
+	} else if (*table & _SEGMENT_ENTRY_PROTECT &&
+		   !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {
+		unprot = __pmd((*table & ~_SEGMENT_ENTRY_PROTECT)
+			       | _SEGMENT_ENTRY_GMAP_UC);
+		gmap_pmdp_xchg(gmap, (pmd_t *)table, unprot, gaddr);
 	}
 	spin_unlock(&gmap->guest_table_lock);
 	spin_unlock(ptl);
@@ -2525,6 +2532,74 @@ static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *pmdp, pmd_t new,
 	*pmdp = new;
 }
 
+/**
+ * gmap_test_and_clear_dirty_segment - test and reset segment dirty status
+ * @gmap: pointer to guest address space
+ * @pmdp: pointer to the pmd to be tested
+ * @gaddr: virtual address in the guest address space
+ *
+ * This function is assumed to be called with the guest_table_lock
+ * held.
+ */
+bool gmap_test_and_clear_dirty_segment(struct gmap *gmap, pmd_t *pmdp,
+				       pmd_t *hpmdp, unsigned long gaddr,
+				       unsigned long vmaddr)
+{
+	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
+		return false;
+
+	/* Already protected memory, which did not change is clean */
+	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_PROTECT &&
+	    !(pmd_val(*pmdp) & _SEGMENT_ENTRY_GMAP_UC))
+		return false;
+
+	/* Clear UC indication and reset protection */
+	pmd_val(*pmdp) &= ~_SEGMENT_ENTRY_GMAP_UC;
+	gmap_protect_large(gmap, gaddr, vmaddr, pmdp, hpmdp, PROT_READ, 0);
+	return true;
+}
+
+/**
+ * gmap_sync_dirty_log_pmd - set bitmap based on dirty status of segment
+ * @gmap: pointer to guest address space
+ * @bitmap: dirty bitmap for this pmd
+ * @gaddr: virtual address in the guest address space
+ * @vmaddr: virtual address in the host address space
+ *
+ * This function is assumed to be called with the guest_table_lock
+ * held.
+ */
+void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long bitmap[4],
+			     unsigned long gaddr, unsigned long vmaddr)
+{
+	int i = 0;
+	pmd_t *pmdp, *hpmdp;
+	spinlock_t *ptl;
+
+	hpmdp = (pmd_t *)huge_pte_offset(gmap->mm, vmaddr, HPAGE_SIZE);
+	if (!hpmdp)
+		return;
+	ptl = pmd_lock(gmap->mm, hpmdp);
+	pmdp = gmap_pmd_op_walk(gmap, gaddr);
+	if (!pmdp) {
+		spin_unlock(ptl);
+		return;
+	}
+
+	if (pmd_large(*pmdp)) {
+		if (gmap_test_and_clear_dirty_segment(gmap, pmdp, hpmdp,
+						      gaddr, vmaddr))
+			memset(bitmap, 0xFF, 32);
+	} else {
+		for (; i < _PAGE_ENTRIES; i++, vmaddr += PAGE_SIZE) {
+			if (test_and_clear_guest_dirty(gmap->mm, vmaddr))
+				set_bit_le(i, bitmap);
+		}
+	}
+	gmap_pmd_op_end(gmap, pmdp);
+	spin_unlock(ptl);
+}
+
 static inline void thp_split_mm(struct mm_struct *mm)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 09/22] s390/mm: clear huge page storage keys on enable_skey
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (7 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 08/22] s390/mm: Add huge page dirty sync support Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 10/22] s390/mm: Add huge pmd storage key handling Janosch Frank
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

From: Dominik Dingel <dingel@linux.vnet.ibm.com>

When a guest starts using storage keys, we trap and set a default one
for its whole valid address space. With this patch we are now able to
do that for large pages.

To speed up the storage key insertion, we use
__storage_key_init_range, which in-turn will use sske_frame to set
multiple storage keys with one instruction. As it has been previously
used for debuging we have to get rid of the default key check and make
it quiescing.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
[replaced page_set_storage_key loop with __storage_key_init_range]
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/mm/gmap.c     | 26 +++++++++++++++++++++++---
 arch/s390/mm/pageattr.c |  6 ++----
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index bebf83f..5f0e696 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2676,17 +2676,37 @@ EXPORT_SYMBOL_GPL(s390_enable_sie);
  * Enable storage key handling from now on and initialize the storage
  * keys with the default key.
  */
-static int __s390_enable_skey(pte_t *pte, unsigned long addr,
-			      unsigned long next, struct mm_walk *walk)
+static int __s390_enable_skey_pte(pte_t *pte, unsigned long addr,
+				  unsigned long next, struct mm_walk *walk)
 {
 	/* Clear storage key */
 	ptep_zap_key(walk->mm, addr, pte);
 	return 0;
 }
 
+static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
+				      unsigned long hmask, unsigned long next,
+				      struct mm_walk *walk)
+{
+	pmd_t *pmd = (pmd_t *)pte;
+	unsigned long start, end;
+
+	if (pmd_val(*pmd) & _SEGMENT_ENTRY_INVALID
+	    || !(pmd_val(*pmd) & _SEGMENT_ENTRY_WRITE))
+		return 0;
+
+	start = pmd_val(*pmd) & HPAGE_MASK;
+	end = start + HPAGE_SIZE - 1;
+	__storage_key_init_range(start, end);
+	return 0;
+}
+
 int s390_enable_skey(void)
 {
-	struct mm_walk walk = { .pte_entry = __s390_enable_skey };
+	struct mm_walk walk = {
+		.hugetlb_entry = __s390_enable_skey_hugetlb,
+		.pte_entry = __s390_enable_skey_pte,
+	};
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	int rc = 0;
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index 552f898..cf05681 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -13,7 +13,7 @@
 
 static inline unsigned long sske_frame(unsigned long addr, unsigned char skey)
 {
-	asm volatile(".insn rrf,0xb22b0000,%[skey],%[addr],9,0"
+	asm volatile(".insn rrf,0xb22b0000,%[skey],%[addr],1,0"
 		     : [addr] "+a" (addr) : [skey] "d" (skey));
 	return addr;
 }
@@ -22,8 +22,6 @@ void __storage_key_init_range(unsigned long start, unsigned long end)
 {
 	unsigned long boundary, size;
 
-	if (!PAGE_DEFAULT_KEY)
-		return;
 	while (start < end) {
 		if (MACHINE_HAS_EDAT1) {
 			/* set storage keys for a 1MB frame */
@@ -36,7 +34,7 @@ void __storage_key_init_range(unsigned long start, unsigned long end)
 				continue;
 			}
 		}
-		page_set_storage_key(start, PAGE_DEFAULT_KEY, 0);
+		page_set_storage_key(start, PAGE_DEFAULT_KEY, 1);
 		start += PAGE_SIZE;
 	}
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 10/22] s390/mm: Add huge pmd storage key handling
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (8 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 09/22] s390/mm: clear huge page storage keys on enable_skey Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 11/22] s390/mm: Remove superfluous parameter Janosch Frank
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

Storage keys for guests with huge page mappings have to directly set
the key in hardware. There are no PGSTEs for PMDs that we could use to
retain the guests's logical view of the key.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/mm/pgtable.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 67 insertions(+), 4 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index c98fba9..5ce1232 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -765,12 +765,45 @@ EXPORT_SYMBOL_GPL(test_and_clear_guest_dirty);
 int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
 			  unsigned char key, bool nq)
 {
-	unsigned long keyul;
+	unsigned long keyul, address;
 	spinlock_t *ptl;
 	pgste_t old, new;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
 	pte_t *ptep;
 
-	ptep = get_locked_pte(mm, addr, &ptl);
+	pgd = pgd_offset(mm, addr);
+	p4d = p4d_alloc(mm, pgd, addr);
+	if (!p4d)
+		return -EFAULT;
+	pud = pud_alloc(mm, p4d, addr);
+	if (!pud)
+		return -EFAULT;
+	pmd = pmd_alloc(mm, pud, addr);
+	if (!pmd)
+		return -EFAULT;
+
+	ptl = pmd_lock(mm, pmd);
+	if (!pmd_present(*pmd)) {
+		spin_unlock(ptl);
+		return -EFAULT;
+	}
+	if (pmd_large(*pmd)) {
+		address = pmd_val(*pmd) & HPAGE_MASK;
+		address |= addr & ~HPAGE_MASK;
+		/*
+		 * Huge pmds need quiescing operations, they are
+		 * always mapped.
+		 */
+		page_set_storage_key(address, key, 1);
+		spin_unlock(ptl);
+		return 0;
+	}
+	spin_unlock(ptl);
+
+	ptep = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (unlikely(!ptep))
 		return -EFAULT;
 
@@ -781,7 +814,7 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
 	pgste_val(new) |= (keyul & (_PAGE_CHANGED | _PAGE_REFERENCED)) << 48;
 	pgste_val(new) |= (keyul & (_PAGE_ACC_BITS | _PAGE_FP_BIT)) << 56;
 	if (!(pte_val(*ptep) & _PAGE_INVALID)) {
-		unsigned long address, bits, skey;
+		unsigned long bits, skey;
 
 		address = pte_val(*ptep) & PAGE_MASK;
 		skey = (unsigned long) page_get_storage_key(address);
@@ -876,11 +909,41 @@ EXPORT_SYMBOL(reset_guest_reference_bit);
 int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
 			  unsigned char *key)
 {
+	unsigned long address;
 	spinlock_t *ptl;
 	pgste_t pgste;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
 	pte_t *ptep;
 
-	ptep = get_locked_pte(mm, addr, &ptl);
+	pgd = pgd_offset(mm, addr);
+	p4d = p4d_alloc(mm, pgd, addr);
+	if (!p4d)
+		return -EFAULT;
+	pud = pud_alloc(mm, p4d, addr);
+	if (!pud)
+		return -EFAULT;
+	pmd = pmd_alloc(mm, pud, addr);
+	if (!pmd)
+		return -EFAULT;
+
+	ptl = pmd_lock(mm, pmd);
+	if (!pmd_present(*pmd)) {
+		spin_unlock(ptl);
+		return -EFAULT;
+	}
+	if (pmd_large(*pmd)) {
+		address = pmd_val(*pmd) & HPAGE_MASK;
+		address |= addr & ~HPAGE_MASK;
+		*key = page_get_storage_key(address);
+		spin_unlock(ptl);
+		return 0;
+	}
+	spin_unlock(ptl);
+
+	ptep = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (unlikely(!ptep))
 		return -EFAULT;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 11/22] s390/mm: Remove superfluous parameter
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (9 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 10/22] s390/mm: Add huge pmd storage key handling Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 12/22] s390/mm: Add gmap_protect_large read protection support Janosch Frank
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

It seems it hasn't even been used before the last cleanup and was
overlooked.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/mm/gmap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 5f0e696..d23b750 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2247,7 +2247,7 @@ EXPORT_SYMBOL_GPL(gmap_shadow_page);
  * Called with sg->parent->shadow_lock.
  */
 static void gmap_shadow_notify(struct gmap *sg, unsigned long vmaddr,
-			       unsigned long gaddr, pte_t *pte)
+			       unsigned long gaddr)
 {
 	struct gmap_rmap *rmap, *rnext, *head;
 	unsigned long start, end, bits, raddr;
@@ -2332,7 +2332,7 @@ void ptep_notify(struct mm_struct *mm, unsigned long vmaddr,
 			spin_lock(&gmap->shadow_lock);
 			list_for_each_entry_safe(sg, next,
 						 &gmap->children, list)
-				gmap_shadow_notify(sg, vmaddr, gaddr, pte);
+				gmap_shadow_notify(sg, vmaddr, gaddr);
 			spin_unlock(&gmap->shadow_lock);
 		}
 		if (bits & PGSTE_IN_BIT)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 12/22] s390/mm: Add gmap_protect_large read protection support
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (10 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 11/22] s390/mm: Remove superfluous parameter Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 13/22] s390/mm: Make gmap_read_table EDAT1 compatible Janosch Frank
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

We need to be able to write protect segments for when shadowing them
for VSIE.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/gmap.h | 1 +
 arch/s390/mm/gmap.c          | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 16474c3..dfd50d7 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -16,6 +16,7 @@
 #define _SEGMENT_ENTRY_GMAP_IN		0x0001	/* invalidation notify bit */
 /* Status bits only for huge segment entries */
 #define _SEGMENT_ENTRY_GMAP_UC		0x4000	/* user dirty (migration) */
+#define _SEGMENT_ENTRY_GMAP_VSIE	0x8000	/* vsie bit */
 
 /**
  * struct gmap_struct - guest address space
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index d23b750..69a575b 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1087,6 +1087,7 @@ static int gmap_protect_large(struct gmap *gmap, unsigned long gaddr,
 	int ret = 0;
 
 	sbits |= (bits & GMAP_ENTRY_IN) ? _SEGMENT_ENTRY_GMAP_IN : 0;
+	sbits |= (bits & GMAP_ENTRY_VSIE) ? _SEGMENT_ENTRY_GMAP_VSIE : 0;
 	/* Protect gmap pmd */
 	ret = gmap_pmdp_force_prot(gmap, gaddr, pmdp, prot, sbits);
 	/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 13/22] s390/mm: Make gmap_read_table EDAT1 compatible
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (11 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 12/22] s390/mm: Add gmap_protect_large read protection support Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 14/22] s390/mm: Make protect_rmap " Janosch Frank
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

For the VSIE we shadow the GMAP/DAT tables that a guest N builds for
its N + 1 guest. This means we read the GMAP from the memory of guest
N and use the retrieved information to build a shadow version from it
which is actually used to run the guest.

gmap_read_table is used to retrieve the data from the guest's address
space. Unfortunately it currently has no support for reading from huge
guests, so let's add that.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/mm/gmap.c | 44 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 33 insertions(+), 11 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 69a575b..c8296bc 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1208,23 +1208,45 @@ int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val)
 {
 	unsigned long address, vmaddr;
 	spinlock_t *ptl;
+	pmd_t *pmdp, pmd;
 	pte_t *ptep, pte;
 	int rc;
 
 	while (1) {
 		rc = -EAGAIN;
-		ptep = gmap_pte_op_walk(gmap, gaddr, &ptl);
-		if (ptep) {
-			pte = *ptep;
-			if (pte_present(pte) && (pte_val(pte) & _PAGE_READ)) {
-				address = pte_val(pte) & PAGE_MASK;
-				address += gaddr & ~PAGE_MASK;
-				*val = *(unsigned long *) address;
-				pte_val(*ptep) |= _PAGE_YOUNG;
-				/* Do *NOT* clear the _PAGE_INVALID bit! */
-				rc = 0;
+		pmdp = gmap_pmd_op_walk(gmap, gaddr);
+		if (pmdp) {
+			if (!pmd_large(*pmdp)) {
+				ptl = NULL;
+				ptep = NULL;
+				if (gmap_is_shadow(gmap))
+					ptep = pte_offset_map(pmdp, gaddr);
+				else
+					ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
+
+				if (ptep) {
+					pte = *ptep;
+					if (pte_present(pte) && (pte_val(pte) & _PAGE_READ)) {
+						address = pte_val(pte) & PAGE_MASK;
+						address += gaddr & ~PAGE_MASK;
+						*val = *(unsigned long *) address;
+						pte_val(*ptep) |= _PAGE_YOUNG;
+						/* Do *NOT* clear the _PAGE_INVALID bit! */
+						rc = 0;
+					}
+					if (ptl)
+						gmap_pte_op_end(ptl);
+				}
+			} else {
+				pmd = *pmdp;
+				if (!(pmd_val(pmd) & _SEGMENT_ENTRY_INVALID)) {
+					address = pmd_val(pmd) & HPAGE_MASK;
+					address += gaddr & ~HPAGE_MASK;
+					*val = *(unsigned long *) address;
+					rc = 0;
+				}
 			}
-			gmap_pte_op_end(ptl);
+			gmap_pmd_op_end(gmap, pmdp);
 		}
 		if (!rc)
 			break;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 14/22] s390/mm: Make protect_rmap EDAT1 compatible
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (12 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 13/22] s390/mm: Make gmap_read_table EDAT1 compatible Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 15/22] s390/mm: GMAP read table extensions Janosch Frank
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

When shadowing, we must make sure, that any changes to the GMAP inside
guest N will also be directly reflected in our shadow GMAP. This is
done by write-protecting guest N memory at the places where it stores
DAT tables for guest N + 1.

This still lacks EDAT1 support, so let's add it.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/mm/gmap.c | 82 ++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 69 insertions(+), 13 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index c8296bc..f6a28d9 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1289,6 +1289,56 @@ static inline void gmap_insert_rmap(struct gmap *sg, unsigned long vmaddr,
 	}
 }
 
+static int gmap_protect_rmap_large(struct gmap *sg, struct gmap_rmap *rmap,
+				   unsigned long paddr, unsigned long vmaddr,
+				   pmd_t *pmdp, int prot)
+{
+	int rc = 0;
+
+	/* We have no upper segment, let's go back and fix this up. */
+	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
+		return -EAGAIN;
+
+	spin_lock(&sg->guest_table_lock);
+	rc = gmap_protect_large(sg->parent, paddr, pmdp,
+				prot, GMAP_ENTRY_VSIE);
+	if (!rc)
+		gmap_insert_rmap(sg, vmaddr & HPAGE_MASK, rmap);
+
+	spin_unlock(&sg->guest_table_lock);
+	return rc;
+}
+
+static int gmap_protect_rmap_pte(struct gmap *sg, struct gmap_rmap *rmap,
+				 unsigned long paddr, unsigned long vmaddr,
+				 pmd_t *pmdp, int prot)
+{
+	int rc = 0;
+	pte_t *ptep = NULL;
+	spinlock_t *ptl = NULL;
+
+	/* We have no upper segment, let's go back and fix this up. */
+	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
+		return -EAGAIN;
+
+	if (gmap_is_shadow(sg->parent))
+		ptep = pte_offset_map(pmdp, paddr);
+	else
+		ptep = pte_alloc_map_lock(sg->parent->mm, pmdp, paddr, &ptl);
+
+	if (ptep) {
+		spin_lock(&sg->guest_table_lock);
+		rc = ptep_force_prot(sg->parent->mm, paddr, ptep, prot,
+				     PGSTE_VSIE_BIT);
+		if (!rc)
+			gmap_insert_rmap(sg, vmaddr, rmap);
+		spin_unlock(&sg->guest_table_lock);
+		if (ptl)
+			gmap_pte_op_end(ptl);
+	}
+	return rc;
+}
+
 /**
  * gmap_protect_rmap - modify access rights to memory and create an rmap
  * @sg: pointer to the shadow guest address space structure
@@ -1306,8 +1356,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 	struct gmap *parent;
 	struct gmap_rmap *rmap;
 	unsigned long vmaddr;
-	spinlock_t *ptl;
-	pte_t *ptep;
+	pmd_t *pmdp;
 	int rc;
 
 	BUG_ON(!gmap_is_shadow(sg));
@@ -1326,15 +1375,24 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 			return rc;
 		}
 		rc = -EAGAIN;
-		ptep = gmap_pte_op_walk(parent, paddr, &ptl);
-		if (ptep) {
-			spin_lock(&sg->guest_table_lock);
-			rc = ptep_force_prot(parent->mm, paddr, ptep, prot,
-					     GMAP_ENTRY_VSIE);
-			if (!rc)
-				gmap_insert_rmap(sg, vmaddr, rmap);
-			spin_unlock(&sg->guest_table_lock);
-			gmap_pte_op_end(ptl);
+		pmdp = gmap_pmd_op_walk(parent, paddr);
+		if (pmdp) {
+			if (!pmd_large(*pmdp)) {
+				rc = gmap_protect_rmap_pte(sg, rmap, paddr,
+							   vmaddr, pmdp, prot);
+				if (!rc) {
+					paddr += PAGE_SIZE;
+					len -= PAGE_SIZE;
+				}
+			} else {
+				rc = gmap_protect_rmap_large(sg, rmap, paddr,
+							     vmaddr, pmdp, prot);
+				if (!rc) {
+					len = len < HPAGE_SIZE ? 0 : len - HPAGE_SIZE;
+					paddr = (paddr & HPAGE_MASK) + HPAGE_SIZE;
+				}
+			}
+			gmap_pmd_op_end(parent, pmdp);
 		}
 		radix_tree_preload_end();
 		if (rc) {
@@ -1344,8 +1402,6 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 				return rc;
 			continue;
 		}
-		paddr += PAGE_SIZE;
-		len -= PAGE_SIZE;
 	}
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 15/22] s390/mm: GMAP read table extensions
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (13 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 14/22] s390/mm: Make protect_rmap " Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 16/22] s390/mm: Add shadow segment code Janosch Frank
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

gmap_read_table has to tell us to which guest DAT level it traveled to
get the information we requested. With this information we can start
shadowing without searching for the pmd or pte a second time.

* This commit will most likely merged into the read table edat
* extension or into the shadowing.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/gmap.h |  2 +-
 arch/s390/kvm/gaccess.c      | 15 +++++++++------
 arch/s390/mm/gmap.c          |  4 +++-
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index dfd50d7..ff6b2d0 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -116,7 +116,7 @@ void gmap_discard(struct gmap *, unsigned long from, unsigned long to);
 void __gmap_zap(struct gmap *, unsigned long gaddr);
 void gmap_unlink(struct mm_struct *, unsigned long *table, unsigned long vmaddr);
 
-int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val);
+int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val, int *fc);
 
 struct gmap *gmap_shadow(struct gmap *parent, unsigned long asce,
 			 int edat_level);
diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c
index 3cc7739..4bd15a9f 100644
--- a/arch/s390/kvm/gaccess.c
+++ b/arch/s390/kvm/gaccess.c
@@ -986,7 +986,7 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 	union asce asce;
 	union vaddress vaddr;
 	unsigned long ptr;
-	int rc;
+	int rc, fc = 0;
 
 	*fake = 0;
 	*dat_protection = 0;
@@ -1033,7 +1033,8 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 			rfte.val = ptr;
 			goto shadow_r2t;
 		}
-		rc = gmap_read_table(parent, ptr + vaddr.rfx * 8, &rfte.val);
+		rc = gmap_read_table(parent, ptr + vaddr.rfx * 8, &rfte.val,
+				     &fc);
 		if (rc)
 			return rc;
 		if (rfte.i)
@@ -1059,7 +1060,8 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 			rste.val = ptr;
 			goto shadow_r3t;
 		}
-		rc = gmap_read_table(parent, ptr + vaddr.rsx * 8, &rste.val);
+		rc = gmap_read_table(parent, ptr + vaddr.rsx * 8, &rste.val,
+				     &fc);
 		if (rc)
 			return rc;
 		if (rste.i)
@@ -1086,7 +1088,8 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 			rtte.val = ptr;
 			goto shadow_sgt;
 		}
-		rc = gmap_read_table(parent, ptr + vaddr.rtx * 8, &rtte.val);
+		rc = gmap_read_table(parent, ptr + vaddr.rtx * 8, &rtte.val,
+				     &fc);
 		if (rc)
 			return rc;
 		if (rtte.i)
@@ -1122,7 +1125,7 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 			ste.val = ptr;
 			goto shadow_pgt;
 		}
-		rc = gmap_read_table(parent, ptr + vaddr.sx * 8, &ste.val);
+		rc = gmap_read_table(parent, ptr + vaddr.sx * 8, &ste.val, &fc);
 		if (rc)
 			return rc;
 		if (ste.i)
@@ -1191,7 +1194,7 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 		goto shadow_page;
 	}
 	if (!rc)
-		rc = gmap_read_table(sg->parent, pgt + vaddr.px * 8, &pte.val);
+		rc = gmap_read_table(sg->parent, pgt + vaddr.px * 8, &pte.val, &fc);
 	if (!rc && pte.i)
 		rc = PGM_PAGE_TRANSLATION;
 	if (!rc && pte.z)
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index f6a28d9..bc77688 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1204,7 +1204,8 @@ EXPORT_SYMBOL_GPL(gmap_mprotect_notify);
  *
  * Called with gmap->mm->mmap_sem in read.
  */
-int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val)
+int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val,
+		    int *fc)
 {
 	unsigned long address, vmaddr;
 	spinlock_t *ptl;
@@ -1243,6 +1244,7 @@ int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val)
 					address = pmd_val(pmd) & HPAGE_MASK;
 					address += gaddr & ~HPAGE_MASK;
 					*val = *(unsigned long *) address;
+					*fc = 1;
 					rc = 0;
 				}
 			}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 16/22] s390/mm: Add shadow segment code
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (14 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 15/22] s390/mm: GMAP read table extensions Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 17/22] s390/mm: Add VSIE reverse fake case Janosch Frank
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

The VSIE code does not yet support shadowing large hosts. Let's add
large to large shadowing.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/gmap.h |   6 +-
 arch/s390/kvm/gaccess.c      |  35 +++++-
 arch/s390/mm/gmap.c          | 293 +++++++++++++++++++++++++++++++++++++------
 3 files changed, 293 insertions(+), 41 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index ff6b2d0..2ce861f 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -129,9 +129,11 @@ int gmap_shadow_sgt(struct gmap *sg, unsigned long saddr, unsigned long sgt,
 		    int fake);
 int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
 		    int fake);
-int gmap_shadow_pgt_lookup(struct gmap *sg, unsigned long saddr,
-			   unsigned long *pgt, int *dat_protection, int *fake);
+int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
+			   unsigned long *pgt, int *dat_protection,
+			   int *fake);
 int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte);
+int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd);
 
 void gmap_register_pte_notifier(struct gmap_notifier *);
 void gmap_unregister_pte_notifier(struct gmap_notifier *);
diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c
index 4bd15a9f..60a5dda 100644
--- a/arch/s390/kvm/gaccess.c
+++ b/arch/s390/kvm/gaccess.c
@@ -980,7 +980,7 @@ int kvm_s390_check_low_addr_prot_real(struct kvm_vcpu *vcpu, unsigned long gra)
  */
 static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 				  unsigned long *pgt, int *dat_protection,
-				  int *fake)
+				  int *fake, int *lvl)
 {
 	struct gmap *parent;
 	union asce asce;
@@ -1135,6 +1135,17 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 		if (ste.cs && asce.p)
 			return PGM_TRANSLATION_SPEC;
 		*dat_protection |= ste.fc0.p;
+
+		/* Parent is mapped by huge pages. */
+		if (fc) {
+			/* Guest is also huge, easy case. */
+			if (ste.fc && sg->edat_level >= 1) {
+				*lvl = 1;
+				*pgt = ptr;
+				return 0;
+			}
+		}
+		/* Small to small and small to huge case */
 		if (ste.fc && sg->edat_level >= 1) {
 			*fake = 1;
 			ptr = ste.fc1.sfaa * _SEGMENT_SIZE;
@@ -1171,8 +1182,9 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 {
 	union vaddress vaddr;
 	union page_table_entry pte;
+	union segment_table_entry ste;
 	unsigned long pgt;
-	int dat_protection, fake;
+	int dat_protection, fake, lvl, fc;
 	int rc;
 
 	down_read(&sg->mm->mmap_sem);
@@ -1183,12 +1195,26 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 	 */
 	ipte_lock(vcpu);
 
-	rc = gmap_shadow_pgt_lookup(sg, saddr, &pgt, &dat_protection, &fake);
+	rc = gmap_shadow_sgt_lookup(sg, saddr, &pgt, &dat_protection, &fake);
 	if (rc)
 		rc = kvm_s390_shadow_tables(sg, saddr, &pgt, &dat_protection,
-					    &fake);
+					    &fake, &lvl);
 
 	vaddr.addr = saddr;
+
+	/* Shadow stopped at segment level, we map pmd to pmd */
+	if (lvl) {
+		if (!rc)
+			rc = gmap_read_table(sg->parent, pgt + vaddr.sx * 8,
+					     &ste.val, &fc);
+		if (!rc && ste.i)
+			rc = PGM_PAGE_TRANSLATION;
+		ste.fc1.p |= dat_protection;
+		if (!rc)
+			rc = gmap_shadow_segment(sg, saddr, __pmd(ste.val));
+		goto out;
+	}
+
 	if (fake) {
 		pte.val = pgt + vaddr.px * PAGE_SIZE;
 		goto shadow_page;
@@ -1203,6 +1229,7 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 	pte.p |= dat_protection;
 	if (!rc)
 		rc = gmap_shadow_page(sg, saddr, __pte(pte.val));
+out:
 	ipte_unlock(vcpu);
 	up_read(&sg->mm->mmap_sem);
 	return rc;
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index bc77688..75c32de 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -613,7 +613,9 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 		}
 	} else if (*table & _SEGMENT_ENTRY_PROTECT &&
 		   !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {
-		unprot = __pmd((*table & ~_SEGMENT_ENTRY_PROTECT)
+		unprot = __pmd((*table & (_SEGMENT_ENTRY_ORIGIN_LARGE
+					  | _SEGMENT_ENTRY_INVALID
+					  | _SEGMENT_ENTRY_LARGE))
 			       | _SEGMENT_ENTRY_GMAP_UC);
 		gmap_pmdp_xchg(gmap, (pmd_t *)table, unprot, gaddr);
 	}
@@ -1293,7 +1295,7 @@ static inline void gmap_insert_rmap(struct gmap *sg, unsigned long vmaddr,
 
 static int gmap_protect_rmap_large(struct gmap *sg, struct gmap_rmap *rmap,
 				   unsigned long paddr, unsigned long vmaddr,
-				   pmd_t *pmdp, int prot)
+				   pmd_t *pmdp, pmd_t *hpmdp, int prot)
 {
 	int rc = 0;
 
@@ -1302,7 +1304,7 @@ static int gmap_protect_rmap_large(struct gmap *sg, struct gmap_rmap *rmap,
 		return -EAGAIN;
 
 	spin_lock(&sg->guest_table_lock);
-	rc = gmap_protect_large(sg->parent, paddr, pmdp,
+	rc = gmap_protect_large(sg->parent, paddr, vmaddr, pmdp, hpmdp,
 				prot, GMAP_ENTRY_VSIE);
 	if (!rc)
 		gmap_insert_rmap(sg, vmaddr & HPAGE_MASK, rmap);
@@ -1358,7 +1360,8 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 	struct gmap *parent;
 	struct gmap_rmap *rmap;
 	unsigned long vmaddr;
-	pmd_t *pmdp;
+	spinlock_t *ptl;
+	pmd_t *pmdp, *hpmdp;
 	int rc;
 
 	BUG_ON(!gmap_is_shadow(sg));
@@ -1367,12 +1370,18 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 		vmaddr = __gmap_translate(parent, paddr);
 		if (IS_ERR_VALUE(vmaddr))
 			return vmaddr;
+		hpmdp = (pmd_t *)huge_pte_offset(parent->mm, vmaddr, HPAGE_SIZE);
+		/* Do we need tests here? */
+		ptl = pmd_lock(parent->mm, hpmdp);
 		rmap = kzalloc(sizeof(*rmap), GFP_KERNEL);
-		if (!rmap)
+		if (!rmap) {
+			spin_unlock(ptl);
 			return -ENOMEM;
+		}
 		rmap->raddr = raddr;
 		rc = radix_tree_preload(GFP_KERNEL);
 		if (rc) {
+			spin_unlock(ptl);
 			kfree(rmap);
 			return rc;
 		}
@@ -1388,7 +1397,8 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 				}
 			} else {
 				rc = gmap_protect_rmap_large(sg, rmap, paddr,
-							     vmaddr, pmdp, prot);
+							     vmaddr, pmdp,
+							     hpmdp, prot);
 				if (!rc) {
 					len = len < HPAGE_SIZE ? 0 : len - HPAGE_SIZE;
 					paddr = (paddr & HPAGE_MASK) + HPAGE_SIZE;
@@ -1396,6 +1406,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 			}
 			gmap_pmd_op_end(parent, pmdp);
 		}
+		spin_unlock(ptl);
 		radix_tree_preload_end();
 		if (rc) {
 			kfree(rmap);
@@ -1409,6 +1420,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 }
 
 #define _SHADOW_RMAP_MASK	0x7
+#define _SHADOW_RMAP_SEGMENT_LP	0x6
 #define _SHADOW_RMAP_REGION1	0x5
 #define _SHADOW_RMAP_REGION2	0x4
 #define _SHADOW_RMAP_REGION3	0x3
@@ -1517,13 +1529,16 @@ static void __gmap_unshadow_sgt(struct gmap *sg, unsigned long raddr,
 	for (i = 0; i < _CRST_ENTRIES; i++, raddr += _SEGMENT_SIZE) {
 		if (!(sgt[i] & _SEGMENT_ENTRY_ORIGIN))
 			continue;
-		pgt = (unsigned long *)(sgt[i] & _REGION_ENTRY_ORIGIN);
+
+		if (!(sgt[i] & _SEGMENT_ENTRY_LARGE)) {
+			pgt = (unsigned long *)(sgt[i] & _REGION_ENTRY_ORIGIN);
+			__gmap_unshadow_pgt(sg, raddr, pgt);
+			/* Free page table */
+			page = pfn_to_page(__pa(pgt) >> PAGE_SHIFT);
+			list_del(&page->lru);
+			page_table_free_pgste(page);
+		}
 		sgt[i] = _SEGMENT_ENTRY_EMPTY;
-		__gmap_unshadow_pgt(sg, raddr, pgt);
-		/* Free page table */
-		page = pfn_to_page(__pa(pgt) >> PAGE_SHIFT);
-		list_del(&page->lru);
-		page_table_free_pgste(page);
 	}
 }
 
@@ -2140,32 +2155,62 @@ EXPORT_SYMBOL_GPL(gmap_shadow_sgt);
  *
  * Called with sg->mm->mmap_sem in read.
  */
-int gmap_shadow_pgt_lookup(struct gmap *sg, unsigned long saddr,
+void gmap_shadow_pgt_lookup(struct gmap *sg, unsigned long *sge,
+			    unsigned long saddr, unsigned long *pgt,
+			    int *dat_protection, int *fake)
+{
+	struct page *page;
+
+	/* Shadow page tables are full pages (pte+pgste) */
+	page = pfn_to_page(*sge >> PAGE_SHIFT);
+	*pgt = page->index & ~GMAP_SHADOW_FAKE_TABLE;
+	*dat_protection = !!(*sge & _SEGMENT_ENTRY_PROTECT);
+	*fake = !!(page->index & GMAP_SHADOW_FAKE_TABLE);
+}
+EXPORT_SYMBOL_GPL(gmap_shadow_pgt_lookup);
+
+int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
 			   unsigned long *pgt, int *dat_protection,
 			   int *fake)
 {
-	unsigned long *table;
+	unsigned long *sge, *r3e = NULL;
 	struct page *page;
-	int rc;
+	int rc = -EAGAIN;
 
 	BUG_ON(!gmap_is_shadow(sg));
 	spin_lock(&sg->guest_table_lock);
-	table = gmap_table_walk(sg, saddr, 1); /* get segment pointer */
-	if (table && !(*table & _SEGMENT_ENTRY_INVALID)) {
-		/* Shadow page tables are full pages (pte+pgste) */
-		page = pfn_to_page(*table >> PAGE_SHIFT);
-		*pgt = page->index & ~GMAP_SHADOW_FAKE_TABLE;
-		*dat_protection = !!(*table & _SEGMENT_ENTRY_PROTECT);
-		*fake = !!(page->index & GMAP_SHADOW_FAKE_TABLE);
-		rc = 0;
-	} else  {
-		rc = -EAGAIN;
+	if (sg->asce & _ASCE_TYPE_MASK) {
+		/* >2 GB guest */
+		r3e = (unsigned long *) gmap_table_walk(sg, saddr, 2);
+		if (!r3e || (*r3e & _REGION_ENTRY_INVALID))
+			goto out;
+		sge = (unsigned long *)(*r3e & _REGION_ENTRY_ORIGIN) + ((saddr & _SEGMENT_INDEX) >> _SEGMENT_SHIFT);
+	} else {
+		sge = (unsigned long *)(sg->asce & PAGE_MASK) + ((saddr & _SEGMENT_INDEX) >> _SEGMENT_SHIFT);
 	}
+	if (*sge & _SEGMENT_ENTRY_INVALID)
+		goto out;
+	rc = 0;
+	if (*sge & _SEGMENT_ENTRY_LARGE) {
+		if (r3e) {
+			page = pfn_to_page(*r3e >> PAGE_SHIFT);
+			*pgt = page->index & ~GMAP_SHADOW_FAKE_TABLE;
+			*dat_protection = !!(*r3e & _SEGMENT_ENTRY_PROTECT);
+			*fake = !!(page->index & GMAP_SHADOW_FAKE_TABLE);
+		} else {
+			*pgt = sg->orig_asce & PAGE_MASK;
+			*dat_protection = 0;
+			*fake = 0;
+		}
+	} else {
+		gmap_shadow_pgt_lookup(sg, sge, saddr, pgt,
+				       dat_protection, fake);
+	}
+out:
 	spin_unlock(&sg->guest_table_lock);
 	return rc;
-
 }
-EXPORT_SYMBOL_GPL(gmap_shadow_pgt_lookup);
+EXPORT_SYMBOL_GPL(gmap_shadow_sgt_lookup);
 
 /**
  * gmap_shadow_pgt - instantiate a shadow page table
@@ -2247,6 +2292,88 @@ int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
 }
 EXPORT_SYMBOL_GPL(gmap_shadow_pgt);
 
+int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd)
+{
+	struct gmap *parent;
+	struct gmap_rmap *rmap;
+	unsigned long vmaddr, paddr;
+	pmd_t spmd, tpmd, *spmdp = NULL, *tpmdp;
+	int prot;
+	int rc;
+
+	BUG_ON(!gmap_is_shadow(sg));
+	parent = sg->parent;
+
+	prot = (pmd_val(pmd) & _SEGMENT_ENTRY_PROTECT) ? PROT_READ : PROT_WRITE;
+	rmap = kzalloc(sizeof(*rmap), GFP_KERNEL);
+	if (!rmap)
+		return -ENOMEM;
+	rmap->raddr = (saddr & HPAGE_MASK) | _SHADOW_RMAP_SEGMENT_LP;
+
+	while (1) {
+		paddr = pmd_val(pmd) & HPAGE_MASK;
+		vmaddr = __gmap_translate(parent, paddr);
+		if (IS_ERR_VALUE(vmaddr)) {
+			rc = vmaddr;
+			break;
+		}
+		rc = radix_tree_preload(GFP_KERNEL);
+		if (rc)
+			break;
+		rc = -EAGAIN;
+
+		/* Let's look up the parent's mapping */
+		spmdp = gmap_pmd_op_walk(parent, paddr);
+		if (spmdp) {
+			spin_lock(&sg->guest_table_lock);
+			/* Get shadow segment table pointer */
+			tpmdp = (pmd_t *) gmap_table_walk(sg, saddr, 1);
+			if (!tpmdp) {
+				spin_unlock(&sg->guest_table_lock);
+				gmap_pmd_op_end(parent, spmdp);
+				radix_tree_preload_end();
+				break;
+			}
+			/* Shadowing magic happens here. */
+			if (!(pmd_val(*tpmdp) & _SEGMENT_ENTRY_INVALID)) {
+				rc = 0;	/* already shadowed */
+				spin_unlock(&sg->guest_table_lock);
+				gmap_pmd_op_end(parent, spmdp);
+				radix_tree_preload_end();
+				break;
+			}
+			spmd = *spmdp;
+			if (!(pmd_val(spmd) & _SEGMENT_ENTRY_INVALID) &&
+			    !((pmd_val(spmd) & _SEGMENT_ENTRY_PROTECT) &&
+			      !(pmd_val(pmd) & _SEGMENT_ENTRY_PROTECT))) {
+
+				*spmdp =  __pmd(pmd_val(spmd) | _SEGMENT_ENTRY_GMAP_VSIE);
+
+				/* Insert shadow ste */
+				pmd_val(tpmd) = ((pmd_val(spmd) & HPAGE_MASK) |
+						 _SEGMENT_ENTRY_LARGE |
+						 (pmd_val(pmd) & _SEGMENT_ENTRY_PROTECT));
+				*tpmdp = tpmd;
+
+				gmap_insert_rmap(sg, vmaddr, rmap);
+				rc = 0;
+			}
+			spin_unlock(&sg->guest_table_lock);
+			gmap_pmd_op_end(parent, spmdp);
+		}
+		radix_tree_preload_end();
+		if (!rc)
+			break;
+		rc = gmap_pte_op_fixup(parent, paddr, vmaddr, prot);
+		if (rc)
+			break;
+	}
+	if (rc)
+		kfree(rmap);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(gmap_shadow_segment);
+
 /**
  * gmap_shadow_page - create a shadow page mapping
  * @sg: pointer to the shadow guest address space structure
@@ -2323,6 +2450,78 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 EXPORT_SYMBOL_GPL(gmap_shadow_page);
 
 /**
+ * gmap_unshadow_segment - remove a huge segment from a shadow segment table
+ * @sg: pointer to the shadow guest address space structure
+ * @raddr: rmap address in the shadow guest address space
+ *
+ * Called with the sg->guest_table_lock
+ */
+static void gmap_unshadow_segment(struct gmap *sg, unsigned long raddr)
+{
+	unsigned long *table;
+
+	BUG_ON(!gmap_is_shadow(sg));
+	/* We already have the lock */
+	table = gmap_table_walk(sg, raddr, 1); /* get segment table pointer */
+	if (!table || *table & _SEGMENT_ENTRY_INVALID ||
+	    !(*table & _SEGMENT_ENTRY_LARGE))
+		return;
+	gmap_call_notifier(sg, raddr, raddr + (1UL << 20) - 1);
+	gmap_pmdp_xchg(sg, (pmd_t *)table, __pmd(_SEGMENT_ENTRY_EMPTY), raddr);
+}
+
+static void gmap_shadow_notify_pmd(struct gmap *sg, unsigned long vmaddr,
+				   unsigned long gaddr)
+{
+	struct gmap_rmap *rmap, *rnext, *head;
+	unsigned long start, end, bits, raddr;
+
+
+	BUG_ON(!gmap_is_shadow(sg));
+
+	spin_lock(&sg->guest_table_lock);
+	if (sg->removed) {
+		spin_unlock(&sg->guest_table_lock);
+		return;
+	}
+	/* Check for top level table */
+	start = sg->orig_asce & _ASCE_ORIGIN;
+	end = start + ((sg->orig_asce & _ASCE_TABLE_LENGTH) + 1) * 4096;
+	if (!(sg->orig_asce & _ASCE_REAL_SPACE) && gaddr >= start &&
+	    gaddr < ((end & HPAGE_MASK) + HPAGE_SIZE - 1)) {
+		/* The complete shadow table has to go */
+		gmap_unshadow(sg);
+		spin_unlock(&sg->guest_table_lock);
+		list_del(&sg->list);
+		gmap_put(sg);
+		return;
+	}
+	/* Remove the page table tree from on specific entry */
+	head = radix_tree_delete(&sg->host_to_rmap, (vmaddr & HPAGE_MASK) >> PAGE_SHIFT);
+	gmap_for_each_rmap_safe(rmap, rnext, head) {
+		bits = rmap->raddr & _SHADOW_RMAP_MASK;
+		raddr = rmap->raddr ^ bits;
+		switch (bits) {
+		case _SHADOW_RMAP_REGION1:
+			gmap_unshadow_r2t(sg, raddr);
+			break;
+		case _SHADOW_RMAP_REGION2:
+			gmap_unshadow_r3t(sg, raddr);
+			break;
+		case _SHADOW_RMAP_REGION3:
+			gmap_unshadow_sgt(sg, raddr);
+			break;
+		case _SHADOW_RMAP_SEGMENT_LP:
+			gmap_unshadow_segment(sg, raddr);
+			break;
+		}
+		kfree(rmap);
+	}
+	spin_unlock(&sg->guest_table_lock);
+}
+
+
+/**
  * gmap_shadow_notify - handle notifications for shadow gmap
  *
  * Called with sg->parent->shadow_lock.
@@ -2434,13 +2633,28 @@ EXPORT_SYMBOL_GPL(ptep_notify);
 static void pmdp_notify_gmap(struct gmap *gmap, unsigned long gaddr)
 {
 	unsigned long *table;
+	unsigned long vmaddr, bits;
+	struct gmap *sg, *next;
 
 	gaddr &= HPAGE_MASK;
 	table = gmap_table_walk(gmap, gaddr, 1);
-	if (!table || !(*table & _SEGMENT_ENTRY_GMAP_IN))
+	if (!table)
 		return;
-	*table &= ~_SEGMENT_ENTRY_GMAP_IN;
-	gmap_call_notifier(gmap, gaddr, gaddr + HPAGE_SIZE - 1);
+	bits = *table & (_SEGMENT_ENTRY_GMAP_IN | _SEGMENT_ENTRY_GMAP_VSIE);
+	if (!bits)
+		return;
+	*table ^= bits;
+	vmaddr = __gmap_translate(gmap, gaddr);
+	if (!list_empty(&gmap->children) && (bits & _SEGMENT_ENTRY_GMAP_VSIE)
+	    && (*table & _SEGMENT_ENTRY_PROTECT)) {
+		spin_lock(&gmap->shadow_lock);
+		list_for_each_entry_safe(sg, next,
+					 &gmap->children, list)
+			gmap_shadow_notify_pmd(sg, vmaddr, gaddr);
+		spin_unlock(&gmap->shadow_lock);
+	}
+	if (bits & _SEGMENT_ENTRY_GMAP_IN)
+		gmap_call_notifier(gmap, gaddr, gaddr + HPAGE_SIZE - 1);
 }
 
 /**
@@ -2452,22 +2666,31 @@ static void pmdp_notify_gmap(struct gmap *gmap, unsigned long gaddr)
  */
 void pmdp_notify(struct mm_struct *mm, unsigned long vmaddr)
 {
-	unsigned long *table, gaddr;
-	struct gmap *gmap;
+	unsigned long *table, gaddr, bits;
+	struct gmap *gmap, *sg, *next;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
 		spin_lock(&gmap->guest_table_lock);
 		table = radix_tree_lookup(&gmap->host_to_guest,
 					  vmaddr >> PMD_SHIFT);
-		if (!table || !(*table & _SEGMENT_ENTRY_GMAP_IN)) {
+		if (!table) {
 			spin_unlock(&gmap->guest_table_lock);
 			continue;
 		}
+		bits = *table & (_SEGMENT_ENTRY_GMAP_IN | _SEGMENT_ENTRY_GMAP_VSIE);
+		*table ^= bits;
 		gaddr = __gmap_segment_gaddr(table);
-		*table &= ~_SEGMENT_ENTRY_GMAP_IN;
 		spin_unlock(&gmap->guest_table_lock);
-		gmap_call_notifier(gmap, gaddr, gaddr + HPAGE_SIZE - 1);
+		if (!list_empty(&gmap->children) && (bits & _SEGMENT_ENTRY_GMAP_VSIE)) {
+			spin_lock(&gmap->shadow_lock);
+			list_for_each_entry_safe(sg, next,
+						 &gmap->children, list)
+				gmap_shadow_notify_pmd(sg, vmaddr, gaddr);
+			spin_unlock(&gmap->shadow_lock);
+		}
+		if (bits & _SEGMENT_ENTRY_GMAP_IN)
+			gmap_call_notifier(gmap, gaddr, gaddr + HPAGE_SIZE - 1);
 	}
 	rcu_read_unlock();
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 17/22] s390/mm: Add VSIE reverse fake case
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (15 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 16/22] s390/mm: Add shadow segment code Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 18/22] s390/mm: Remove gmap_pte_op_walk Janosch Frank
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

The fake VSIE case lets us run huge vsie guests on small hosts by
creating fake page tables. When running a small guest on a huge host,
we need to create fake tables once again.

The fake tables are needed to make sure, that the VSIE guest is only
able to access the memory that its host mapped for it.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/gmap.h |  2 +-
 arch/s390/kvm/gaccess.c      | 20 +++++++++++++----
 arch/s390/mm/gmap.c          | 52 +++++++++++++++++++++++++++++++++++---------
 3 files changed, 59 insertions(+), 15 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 2ce861f..a50dbc6 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -131,7 +131,7 @@ int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
 		    int fake);
 int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
 			   unsigned long *pgt, int *dat_protection,
-			   int *fake);
+			   int *fake, int *lvl);
 int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte);
 int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd);
 
diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c
index 60a5dda..5c6e8d8 100644
--- a/arch/s390/kvm/gaccess.c
+++ b/arch/s390/kvm/gaccess.c
@@ -1143,10 +1143,22 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 				*lvl = 1;
 				*pgt = ptr;
 				return 0;
+			} else {
+				/*
+				 * Reverse fake case.
+				 * We map a huge parent to a small guest, i.e.
+				 * we need fake shadow pagetables.
+				 *
+				 * We need pagetables here, because
+				 * guests not aligned on 1M could
+				 * read/write from/to the parent or
+				 * host.
+				 */
+				*lvl = 0;
 			}
 		}
 		/* Small to small and small to huge case */
-		if (ste.fc && sg->edat_level >= 1) {
+		if (!fc && ste.fc && sg->edat_level >= 1) {
 			*fake = 1;
 			ptr = ste.fc1.sfaa * _SEGMENT_SIZE;
 			ste.val = ptr;
@@ -1184,7 +1196,7 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 	union page_table_entry pte;
 	union segment_table_entry ste;
 	unsigned long pgt;
-	int dat_protection, fake, lvl, fc;
+	int dat_protection, fake, lvl = 0, fc;
 	int rc;
 
 	down_read(&sg->mm->mmap_sem);
@@ -1195,7 +1207,7 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 	 */
 	ipte_lock(vcpu);
 
-	rc = gmap_shadow_sgt_lookup(sg, saddr, &pgt, &dat_protection, &fake);
+	rc = gmap_shadow_sgt_lookup(sg, saddr, &pgt, &dat_protection, &fake, &lvl);
 	if (rc)
 		rc = kvm_s390_shadow_tables(sg, saddr, &pgt, &dat_protection,
 					    &fake, &lvl);
@@ -1203,7 +1215,7 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 	vaddr.addr = saddr;
 
 	/* Shadow stopped at segment level, we map pmd to pmd */
-	if (lvl) {
+	if (!rc && lvl) {
 		if (!rc)
 			rc = gmap_read_table(sg->parent, pgt + vaddr.sx * 8,
 					     &ste.val, &fc);
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 75c32de..91a0824 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1527,7 +1527,7 @@ static void __gmap_unshadow_sgt(struct gmap *sg, unsigned long raddr,
 	BUG_ON(!gmap_is_shadow(sg));
 	asce = (unsigned long) sgt | _ASCE_TYPE_SEGMENT;
 	for (i = 0; i < _CRST_ENTRIES; i++, raddr += _SEGMENT_SIZE) {
-		if (!(sgt[i] & _SEGMENT_ENTRY_ORIGIN))
+		if (sgt[i] ==  _SEGMENT_ENTRY_EMPTY)
 			continue;
 
 		if (!(sgt[i] & _SEGMENT_ENTRY_LARGE)) {
@@ -2171,7 +2171,7 @@ EXPORT_SYMBOL_GPL(gmap_shadow_pgt_lookup);
 
 int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
 			   unsigned long *pgt, int *dat_protection,
-			   int *fake)
+			   int *fake, int *lvl)
 {
 	unsigned long *sge, *r3e = NULL;
 	struct page *page;
@@ -2202,9 +2202,11 @@ int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
 			*dat_protection = 0;
 			*fake = 0;
 		}
+		*lvl = 1;
 	} else {
 		gmap_shadow_pgt_lookup(sg, sge, saddr, pgt,
 				       dat_protection, fake);
+		*lvl = 0;
 	}
 out:
 	spin_unlock(&sg->guest_table_lock);
@@ -2392,6 +2394,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 	struct gmap_rmap *rmap;
 	unsigned long vmaddr, paddr;
 	spinlock_t *ptl;
+	pmd_t *spmdp;
 	pte_t *sptep, *tptep;
 	int prot;
 	int rc;
@@ -2416,26 +2419,49 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 		if (rc)
 			break;
 		rc = -EAGAIN;
-		sptep = gmap_pte_op_walk(parent, paddr, &ptl);
-		if (sptep) {
+		spmdp = gmap_pmd_op_walk(parent, paddr);
+		if (spmdp && !(pmd_val(*spmdp) & _SEGMENT_ENTRY_INVALID)) {
 			spin_lock(&sg->guest_table_lock);
 			/* Get page table pointer */
 			tptep = (pte_t *) gmap_table_walk(sg, saddr, 0);
 			if (!tptep) {
 				spin_unlock(&sg->guest_table_lock);
-				gmap_pte_op_end(ptl);
 				radix_tree_preload_end();
+				gmap_pmd_op_end(parent, spmdp);
 				break;
 			}
-			rc = ptep_shadow_pte(sg->mm, saddr, sptep, tptep, pte);
-			if (rc > 0) {
-				/* Success and a new mapping */
-				gmap_insert_rmap(sg, vmaddr, rmap);
+
+			if (pmd_large(*spmdp)) {
+				/* TODO: Bits and pgstes */
+				*tptep = __pte(((pmd_val(*spmdp) &
+						_SEGMENT_ENTRY_ORIGIN_LARGE)
+					       + (pte_index(paddr) << 12))
+					       | (pte_val(pte) & _PAGE_PROTECT));
+				pmd_val(*spmdp) |= _SEGMENT_ENTRY_GMAP_VSIE;
+				gmap_insert_rmap(sg, vmaddr & HPAGE_MASK, rmap);
 				rmap = NULL;
 				rc = 0;
+			} else {
+				ptl = NULL;
+				if (gmap_is_shadow(parent))
+					sptep = pte_offset_map(spmdp, paddr);
+				else
+					sptep = pte_alloc_map_lock(parent->mm, spmdp, paddr, &ptl);
+
+				if (sptep) {
+					rc = ptep_shadow_pte(sg->mm, saddr, sptep, tptep, pte);
+					if (rc > 0) {
+						/* Success and a new mapping */
+						gmap_insert_rmap(sg, vmaddr, rmap);
+						rmap = NULL;
+						rc = 0;
+					}
+					if (ptl)
+						gmap_pte_op_end(ptl);
+				}
 			}
-			gmap_pte_op_end(ptl);
 			spin_unlock(&sg->guest_table_lock);
+			gmap_pmd_op_end(parent, spmdp);
 		}
 		radix_tree_preload_end();
 		if (!rc)
@@ -2514,6 +2540,12 @@ static void gmap_shadow_notify_pmd(struct gmap *sg, unsigned long vmaddr,
 		case _SHADOW_RMAP_SEGMENT_LP:
 			gmap_unshadow_segment(sg, raddr);
 			break;
+		case _SHADOW_RMAP_SEGMENT:
+			gmap_unshadow_pgt(sg, raddr);
+			break;
+		case _SHADOW_RMAP_PGTABLE:
+			gmap_unshadow_page(sg, raddr);
+			break;
 		}
 		kfree(rmap);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 18/22] s390/mm: Remove gmap_pte_op_walk
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (16 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 17/22] s390/mm: Add VSIE reverse fake case Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-06 22:29 ` [RFC/PATCH 19/22] s390/mm: Split huge pages if granular protection is needed Janosch Frank
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

After the large page support was added, there are no more users of
this function. Let's get rid of it.

Might be squashed later on into the previous commit.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/mm/gmap.c | 32 --------------------------------
 1 file changed, 32 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 91a0824..95d15e2 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -830,38 +830,6 @@ static inline unsigned long *gmap_table_walk(struct gmap *gmap,
 }
 
 /**
- * gmap_pte_op_walk - walk the gmap page table, get the page table lock
- *		      and return the pte pointer
- * @gmap: pointer to guest mapping meta data structure
- * @gaddr: virtual address in the guest address space
- * @ptl: pointer to the spinlock pointer
- *
- * Returns a pointer to the locked pte for a guest address, or NULL
- *
- * Note: Can also be called for shadow gmaps.
- */
-static pte_t *gmap_pte_op_walk(struct gmap *gmap, unsigned long gaddr,
-			       spinlock_t **ptl)
-{
-	unsigned long *table;
-
-	if (gmap_is_shadow(gmap))
-		spin_lock(&gmap->guest_table_lock);
-	/* Walk the gmap page table, lock and get pte pointer */
-	table = gmap_table_walk(gmap, gaddr, 1); /* get segment pointer */
-	if (!table || *table & _SEGMENT_ENTRY_INVALID) {
-		if (gmap_is_shadow(gmap))
-			spin_unlock(&gmap->guest_table_lock);
-		return NULL;
-	}
-	if (gmap_is_shadow(gmap)) {
-		*ptl = &gmap->guest_table_lock;
-		return pte_offset_map((pmd_t *) table, gaddr);
-	}
-	return pte_alloc_map_lock(gmap->mm, (pmd_t *) table, gaddr, ptl);
-}
-
-/**
  * gmap_pte_op_fixup - force a page in and connect the gmap page table
  * @gmap: pointer to guest mapping meta data structure
  * @gaddr: virtual address in the guest address space
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 19/22] s390/mm: Split huge pages if granular protection is needed
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (17 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 18/22] s390/mm: Remove gmap_pte_op_walk Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-12-07 16:32   ` David Hildenbrand
  2017-11-06 22:29 ` [RFC/PATCH 20/22] s390/mm: Enable gmap huge pmd support Janosch Frank
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

A guest can put DAT tables for a lower level guest in the same huge
segment as one of its prefixes. This would make it necessary for the
segment to be unprotected (because of the prefix) and protected
(because of the shadowing) at the same time. This is not possible in
this universe.

Hence we split the affected huge segment, so we can protect on a
per-page basis. Such gmap segments are special and get a new software
bit, that helps us handling this edge case.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/gmap.h    |  13 ++
 arch/s390/include/asm/pgtable.h |   7 +-
 arch/s390/mm/fault.c            |  10 +-
 arch/s390/mm/gmap.c             | 257 ++++++++++++++++++++++++++++++++++++----
 arch/s390/mm/pgtable.c          |  51 ++++++++
 5 files changed, 311 insertions(+), 27 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index a50dbc6..15a7834 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -14,6 +14,7 @@
 
 /* Status bits in huge and non-huge gmap segment entries. */
 #define _SEGMENT_ENTRY_GMAP_IN		0x0001	/* invalidation notify bit */
+#define _SEGMENT_ENTRY_GMAP_SPLIT	0x0002  /* split huge pmd */
 /* Status bits only for huge segment entries */
 #define _SEGMENT_ENTRY_GMAP_UC		0x4000	/* user dirty (migration) */
 #define _SEGMENT_ENTRY_GMAP_VSIE	0x8000	/* vsie bit */
@@ -57,6 +58,7 @@ struct gmap {
 	struct radix_tree_root host_to_rmap;
 	struct list_head children;
 	struct list_head pt_list;
+	struct list_head split_list;
 	spinlock_t shadow_lock;
 	struct gmap *parent;
 	unsigned long orig_asce;
@@ -97,6 +99,17 @@ static inline int gmap_is_shadow(struct gmap *gmap)
 	return !!gmap->parent;
 }
 
+/**
+ * gmap_pmd_is_split - Returns if a huge gmap pmd has been split.
+ * @pmdp: pointer to the pmd
+ *
+ * Returns true if the passed huge gmap pmd has been split.
+ */
+static inline bool gmap_pmd_is_split(pmd_t *pmdp)
+{
+	return !!(pmd_val(*pmdp) & _SEGMENT_ENTRY_GMAP_SPLIT);
+}
+
 struct gmap *gmap_create(struct mm_struct *mm, unsigned long limit);
 void gmap_remove(struct gmap *gmap);
 struct gmap *gmap_get(struct gmap *gmap);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 43c7b01..857087b 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1092,6 +1092,8 @@ void ptep_set_pte_at(struct mm_struct *mm, unsigned long addr,
 void ptep_set_notify(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 void ptep_notify(struct mm_struct *mm, unsigned long addr,
 		 pte_t *ptep, unsigned long bits);
+void ptep_notify_gmap(struct mm_struct *mm, unsigned long vmaddr,
+		      pte_t *pte, unsigned long bits);
 void pmdp_notify(struct mm_struct *mm, unsigned long addr);
 int ptep_force_prot(struct mm_struct *mm, unsigned long gaddr,
 		    pte_t *ptep, int prot, unsigned long bit);
@@ -1101,8 +1103,11 @@ void ptep_zap_key(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 int ptep_shadow_pte(struct mm_struct *mm, unsigned long saddr,
 		    pte_t *sptep, pte_t *tptep, pte_t pte);
 void ptep_unshadow_pte(struct mm_struct *mm, unsigned long saddr, pte_t *ptep);
-
+void ptep_remove_dirty_protection_split(struct mm_struct *mm, pte_t *ptep,
+					unsigned long vmaddr);
 bool test_and_clear_guest_dirty(struct mm_struct *mm, unsigned long address);
+bool test_and_clear_guest_dirty_split(struct mm_struct *mm, pmd_t *pmdp,
+				      unsigned long vmaddr);
 int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
 			  unsigned char key, bool nq);
 int cond_set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index bdabb01..052d924 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -401,7 +401,7 @@ static inline int do_exception(struct pt_regs *regs, int access)
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	unsigned long trans_exc_code;
-	unsigned long address;
+	unsigned long address, gaddress = 0;
 	unsigned int flags;
 	int fault;
 
@@ -448,6 +448,12 @@ static inline int do_exception(struct pt_regs *regs, int access)
 			fault = VM_FAULT_BADMAP;
 			goto out_up;
 		}
+		/*
+		 * The GMAP code needs the full fault address even
+		 * when using large pages. Hence we take an unmasked
+		 * copy to feed to __gmap_link.
+		 */
+		gaddress = address;
 		if (gmap->pfault_enabled)
 			flags |= FAULT_FLAG_RETRY_NOWAIT;
 	}
@@ -527,7 +533,7 @@ static inline int do_exception(struct pt_regs *regs, int access)
 #ifdef CONFIG_PGSTE
 	if (gmap) {
 		address =  __gmap_link(gmap, current->thread.gmap_addr,
-				       address);
+				       gaddress);
 		if (address == -EFAULT) {
 			fault = VM_FAULT_BADMAP;
 			goto out_up;
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 95d15e2..430dcd9 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -63,6 +63,7 @@ static struct gmap *gmap_alloc(unsigned long limit)
 	INIT_LIST_HEAD(&gmap->crst_list);
 	INIT_LIST_HEAD(&gmap->children);
 	INIT_LIST_HEAD(&gmap->pt_list);
+	INIT_LIST_HEAD(&gmap->split_list);
 	INIT_RADIX_TREE(&gmap->guest_to_host, GFP_KERNEL);
 	INIT_RADIX_TREE(&gmap->host_to_guest, GFP_ATOMIC);
 	INIT_RADIX_TREE(&gmap->host_to_rmap, GFP_ATOMIC);
@@ -194,6 +195,12 @@ static void gmap_free(struct gmap *gmap)
 	gmap_radix_tree_free(&gmap->guest_to_host);
 	gmap_radix_tree_free(&gmap->host_to_guest);
 
+	/* Free split pmd page tables */
+	spin_lock(&gmap->guest_table_lock);
+	list_for_each_entry_safe(page, next, &gmap->split_list, lru)
+		page_table_free_pgste(page);
+	spin_unlock(&gmap->guest_table_lock);
+
 	/* Free additional data for a shadow gmap */
 	if (gmap_is_shadow(gmap)) {
 		/* Free all page tables. */
@@ -545,6 +552,7 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 	pud_t *pud;
 	pmd_t *pmd;
 	pmd_t unprot;
+	pte_t *ptep;
 	int rc;
 
 	BUG_ON(gmap_is_shadow(gmap));
@@ -613,11 +621,26 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 		}
 	} else if (*table & _SEGMENT_ENTRY_PROTECT &&
 		   !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {
+		/*
+		 * We have to get rid of notifier bits in the new pmd
+		 * when unprotecting, as the exchange will drop them
+		 * in the old pmd.
+		 */
 		unprot = __pmd((*table & (_SEGMENT_ENTRY_ORIGIN_LARGE
 					  | _SEGMENT_ENTRY_INVALID
 					  | _SEGMENT_ENTRY_LARGE))
 			       | _SEGMENT_ENTRY_GMAP_UC);
 		gmap_pmdp_xchg(gmap, (pmd_t *)table, unprot, gaddr);
+	} else if (gmap_pmd_is_split((pmd_t *)table)) {
+		/*
+		 * Split pmds are somewhere in-between a normal and a
+		 * large pmd. As we don't share the page table, the
+		 * host does not remove protection on a fault and we
+		 * have to do it ourselves for the guest mapping.
+		 */
+		ptep = (pte_t *) pte_offset_map((pmd_t *)table, gaddr);
+		if (pte_val(*ptep) & _PAGE_PROTECT)
+			ptep_remove_dirty_protection_split(mm, ptep, vmaddr);
 	}
 	spin_unlock(&gmap->guest_table_lock);
 	spin_unlock(ptl);
@@ -896,7 +919,7 @@ static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr)
 	 * suffices to take the pte lock later on. Thus we can unlock
 	 * the guest_table_lock here.
 	 */
-	if (!pmd_large(*pmdp) && !gmap_is_shadow(gmap))
+	if (!gmap_pmd_is_split(pmdp) && !pmd_large(*pmdp) && !gmap_is_shadow(gmap))
 		spin_unlock(&gmap->guest_table_lock);
 	return pmdp;
 }
@@ -908,7 +931,7 @@ static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr)
  */
 static inline void gmap_pmd_op_end(struct gmap *gmap, pmd_t *pmdp)
 {
-	if (pmd_large(*pmdp) || gmap_is_shadow(gmap))
+	if (gmap_pmd_is_split(pmdp) || pmd_large(*pmdp) || gmap_is_shadow(gmap))
 		spin_unlock(&gmap->guest_table_lock);
 }
 
@@ -956,6 +979,18 @@ static void gmap_pmdp_transfer_prot(struct mm_struct *mm, unsigned long addr,
 	*hpmdp = new;
 }
 
+static void gmap_pte_transfer_prot(struct mm_struct *mm, unsigned long addr,
+				   pte_t *gptep, pmd_t *hpmdp)
+{
+	pmd_t mpmd = __pmd(0);
+
+	if (pte_val(*gptep) & _PAGE_PROTECT)
+		pmd_val(mpmd) |= _SEGMENT_ENTRY_PROTECT;
+	if (pte_val(*gptep) & _PAGE_INVALID)
+		pmd_val(mpmd) |= _SEGMENT_ENTRY_INVALID;
+	gmap_pmdp_transfer_prot(mm, addr, &mpmd, hpmdp);
+}
+
 /**
  * gmap_pmdp_force_prot - change access rights of a locked pmd
  * @mm: pointer to the process mm_struct
@@ -994,6 +1029,63 @@ static int gmap_pmdp_force_prot(struct gmap *gmap, unsigned long addr,
 	return 0;
 }
 
+/**
+ * gmap_pmd_split_free - Free a split pmd's page table
+ * @pmdp The split pmd that we free of its page table
+ *
+ * If the userspace pmds are exchanged, we'll remove the gmap pmds as
+ * well, so we fault on them and link them again. We would leak
+ * memory, if we didn't free split pmds here.
+ */
+static inline void gmap_pmd_split_free(pmd_t *pmdp)
+{
+	unsigned long pgt = pmd_val(*pmdp) & _SEGMENT_ENTRY_ORIGIN;
+	struct page *page;
+
+	if (gmap_pmd_is_split(pmdp)) {
+		page = pfn_to_page(pgt >> PAGE_SHIFT);
+		list_del(&page->lru);
+		page_table_free_pgste(page);
+	}
+}
+
+/**
+ * gmap_pmd_split - Split a huge gmap pmd and use a page table instead
+ * @gmap: pointer to guest mapping meta data structure
+ * @gaddr: virtual address in the guest address space
+ * @pmdp: pointer to the pmd that will be split
+ *
+ * When splitting gmap pmds, we have to make the resulting page table
+ * look like it's a normal one to be able to use the common pte
+ * handling functions. Also we need to track these new tables as they
+ * aren't tracked anywhere else.
+ */
+static int gmap_pmd_split(struct gmap *gmap, unsigned long gaddr, pmd_t *pmdp)
+{
+	unsigned long *table;
+	struct page *page;
+	pmd_t new;
+	int i;
+
+	page = page_table_alloc_pgste(gmap->mm);
+	if (!page)
+		return -ENOMEM;
+	table = (unsigned long *) page_to_phys(page);
+	for (i = 0; i < 256; i++) {
+		table[i] = (pmd_val(*pmdp) & HPAGE_MASK) + i * PAGE_SIZE;
+		/* pmd_large() implies pmd/pte_present() */
+		table[i] |=  _PAGE_PRESENT | _PAGE_READ | _PAGE_WRITE;
+		/* ptes are directly marked as dirty */
+		table[i + PTRS_PER_PTE] |= PGSTE_UC_BIT;
+	}
+
+	pmd_val(new) = ((unsigned long)table | _SEGMENT_ENTRY |
+			(_SEGMENT_ENTRY_GMAP_SPLIT));
+	list_add(&page->lru, &gmap->split_list);
+	gmap_pmdp_xchg(gmap, pmdp, new, gaddr);
+	return 0;
+}
+
 /*
  * gmap_protect_pte - remove access rights to memory and set pgste bits
  * @gmap: pointer to guest mapping meta data structure
@@ -1009,7 +1101,8 @@ static int gmap_pmdp_force_prot(struct gmap *gmap, unsigned long addr,
  * guest_table_lock held for shadow gmaps.
  */
 static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
-			    pmd_t *pmdp, int prot, unsigned long bits)
+			    unsigned long vmaddr, pmd_t *pmdp, pmd_t *hpmdp,
+			    int prot, unsigned long bits)
 {
 	int rc;
 	pte_t *ptep;
@@ -1020,7 +1113,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
 	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
 		return -EAGAIN;
 
-	if (gmap_is_shadow(gmap)) {
+	if (gmap_is_shadow(gmap) || gmap_pmd_is_split(pmdp)) {
 		ptep = pte_offset_map(pmdp, gaddr);
 	} else {
 		ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
@@ -1034,6 +1127,8 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
 	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, pbits);
 	if (ptl)
 		gmap_pte_op_end(ptl);
+	if (!rc && gmap_pmd_is_split(pmdp))
+		gmap_pte_transfer_prot(gmap->mm, vmaddr, ptep, hpmdp);
 	return rc;
 }
 
@@ -1058,6 +1153,14 @@ static int gmap_protect_large(struct gmap *gmap, unsigned long gaddr,
 
 	sbits |= (bits & GMAP_ENTRY_IN) ? _SEGMENT_ENTRY_GMAP_IN : 0;
 	sbits |= (bits & GMAP_ENTRY_VSIE) ? _SEGMENT_ENTRY_GMAP_VSIE : 0;
+
+	if (((prot != PROT_WRITE) && (bits & GMAP_ENTRY_VSIE))) {
+		ret = gmap_pmd_split(gmap, gaddr, pmdp);
+		if (ret)
+			return ret;
+		return -EFAULT;
+	}
+
 	/* Protect gmap pmd */
 	ret = gmap_pmdp_force_prot(gmap, gaddr, pmdp, prot, sbits);
 	/*
@@ -1091,27 +1194,35 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 	spinlock_t *ptl;
 	unsigned long vmaddr;
 	pmd_t *pmdp, *hpmdp;
-	int rc;
+	int rc = 0;
 
 	while (len) {
 		rc = -EAGAIN;
 		vmaddr = __gmap_translate(gmap, gaddr);
 		hpmdp = (pmd_t *)huge_pte_offset(gmap->mm, vmaddr, HPAGE_SIZE);
+		if (!hpmdp)
+			BUG();
 		/* Do we need tests here? */
 		ptl = pmd_lock(gmap->mm, hpmdp);
 
 		pmdp = gmap_pmd_op_walk(gmap, gaddr);
 		if (pmdp) {
 			if (!pmd_large(*pmdp)) {
-				rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
-						      bits);
+				if (gmap_pmd_is_split(pmdp) &&
+				    (bits & GMAP_ENTRY_IN)) {
+					pmd_val(*pmdp) |= _SEGMENT_ENTRY_GMAP_IN;
+				}
+
+				rc = gmap_protect_pte(gmap, gaddr, vmaddr,
+						      pmdp, hpmdp, prot, bits);
 				if (!rc) {
 					len -= PAGE_SIZE;
 					gaddr += PAGE_SIZE;
 				}
 			} else {
-				rc =  gmap_protect_large(gmap, gaddr, vmaddr, pmdp, hpmdp,
-							 prot, bits);
+				rc =  gmap_protect_large(gmap, gaddr, vmaddr,
+							 pmdp, hpmdp, prot,
+							 bits);
 				if (!rc) {
 					len = len < HPAGE_SIZE ? 0 : len - HPAGE_SIZE;
 					gaddr = (gaddr & HPAGE_MASK) + HPAGE_SIZE;
@@ -1120,7 +1231,9 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 			gmap_pmd_op_end(gmap, pmdp);
 		}
 		spin_unlock(ptl);
-		if (rc) {
+		if (rc == -EFAULT)
+			continue;
+		if (rc == -EAGAIN) {
 			vmaddr = __gmap_translate(gmap, gaddr);
 			if (IS_ERR_VALUE(vmaddr))
 				return vmaddr;
@@ -1190,7 +1303,7 @@ int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val,
 			if (!pmd_large(*pmdp)) {
 				ptl = NULL;
 				ptep = NULL;
-				if (gmap_is_shadow(gmap))
+				if (gmap_is_shadow(gmap) || gmap_pmd_is_split(pmdp))
 					ptep = pte_offset_map(pmdp, gaddr);
 				else
 					ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
@@ -1203,6 +1316,8 @@ int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val,
 						*val = *(unsigned long *) address;
 						pte_val(*ptep) |= _PAGE_YOUNG;
 						/* Do *NOT* clear the _PAGE_INVALID bit! */
+						if (gmap_pmd_is_split(pmdp))
+							*fc = 1;
 						rc = 0;
 					}
 					if (ptl)
@@ -1293,7 +1408,7 @@ static int gmap_protect_rmap_pte(struct gmap *sg, struct gmap_rmap *rmap,
 	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
 		return -EAGAIN;
 
-	if (gmap_is_shadow(sg->parent))
+	if (gmap_is_shadow(sg->parent) || gmap_pmd_is_split(pmdp))
 		ptep = pte_offset_map(pmdp, paddr);
 	else
 		ptep = pte_alloc_map_lock(sg->parent->mm, pmdp, paddr, &ptl);
@@ -1376,12 +1491,12 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 		}
 		spin_unlock(ptl);
 		radix_tree_preload_end();
-		if (rc) {
+		if (rc)
 			kfree(rmap);
+		if (rc == -EAGAIN) {
 			rc = gmap_pte_op_fixup(parent, paddr, vmaddr, prot);
 			if (rc)
 				return rc;
-			continue;
 		}
 	}
 	return 0;
@@ -2267,7 +2382,8 @@ int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd)
 	struct gmap *parent;
 	struct gmap_rmap *rmap;
 	unsigned long vmaddr, paddr;
-	pmd_t spmd, tpmd, *spmdp = NULL, *tpmdp;
+	pmd_t spmd, tpmd, *spmdp = NULL, *tpmdp, *hpmdp;
+	spinlock_t *ptl;
 	int prot;
 	int rc;
 
@@ -2287,6 +2403,11 @@ int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd)
 			rc = vmaddr;
 			break;
 		}
+		hpmdp = (pmd_t *)huge_pte_offset(sg->mm, vmaddr, HPAGE_SIZE);
+		if (!hpmdp)
+			BUG();
+		/* Do we need tests here? */
+		ptl = pmd_lock(sg->mm, hpmdp);
 		rc = radix_tree_preload(GFP_KERNEL);
 		if (rc)
 			break;
@@ -2295,12 +2416,15 @@ int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd)
 		/* Let's look up the parent's mapping */
 		spmdp = gmap_pmd_op_walk(parent, paddr);
 		if (spmdp) {
+			if (!pmd_large(*spmdp))
+				BUG();
 			spin_lock(&sg->guest_table_lock);
 			/* Get shadow segment table pointer */
 			tpmdp = (pmd_t *) gmap_table_walk(sg, saddr, 1);
 			if (!tpmdp) {
 				spin_unlock(&sg->guest_table_lock);
 				gmap_pmd_op_end(parent, spmdp);
+				spin_unlock(ptl);
 				radix_tree_preload_end();
 				break;
 			}
@@ -2309,6 +2433,7 @@ int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd)
 				rc = 0;	/* already shadowed */
 				spin_unlock(&sg->guest_table_lock);
 				gmap_pmd_op_end(parent, spmdp);
+				spin_unlock(ptl);
 				radix_tree_preload_end();
 				break;
 			}
@@ -2316,7 +2441,6 @@ int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd)
 			if (!(pmd_val(spmd) & _SEGMENT_ENTRY_INVALID) &&
 			    !((pmd_val(spmd) & _SEGMENT_ENTRY_PROTECT) &&
 			      !(pmd_val(pmd) & _SEGMENT_ENTRY_PROTECT))) {
-
 				*spmdp =  __pmd(pmd_val(spmd) | _SEGMENT_ENTRY_GMAP_VSIE);
 
 				/* Insert shadow ste */
@@ -2331,6 +2455,7 @@ int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd)
 			spin_unlock(&sg->guest_table_lock);
 			gmap_pmd_op_end(parent, spmdp);
 		}
+		spin_unlock(ptl);
 		radix_tree_preload_end();
 		if (!rc)
 			break;
@@ -2411,7 +2536,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 				rc = 0;
 			} else {
 				ptl = NULL;
-				if (gmap_is_shadow(parent))
+				if (gmap_is_shadow(parent) || gmap_pmd_is_split(spmdp))
 					sptep = pte_offset_map(spmdp, paddr);
 				else
 					sptep = pte_alloc_map_lock(parent->mm, spmdp, paddr, &ptl);
@@ -2566,6 +2691,9 @@ static void gmap_shadow_notify(struct gmap *sg, unsigned long vmaddr,
 		case _SHADOW_RMAP_REGION3:
 			gmap_unshadow_sgt(sg, raddr);
 			break;
+		case _SHADOW_RMAP_SEGMENT_LP:
+			gmap_unshadow_segment(sg, raddr);
+			break;
 		case _SHADOW_RMAP_SEGMENT:
 			gmap_unshadow_pgt(sg, raddr);
 			break;
@@ -2578,6 +2706,46 @@ static void gmap_shadow_notify(struct gmap *sg, unsigned long vmaddr,
 	spin_unlock(&sg->guest_table_lock);
 }
 
+/*
+ * ptep_notify_gmap - call all invalidation callbacks for a specific pte of a gmap
+ * @mm: pointer to the process mm_struct
+ * @addr: virtual address in the process address space
+ * @pte: pointer to the page table entry
+ * @bits: bits from the pgste that caused the notify call
+ *
+ * This function is assumed to be called with the guest_table_lock held.
+ */
+void ptep_notify_gmap(struct mm_struct *mm, unsigned long vmaddr,
+		      pte_t *pte, unsigned long bits)
+{
+	unsigned long offset, gaddr = 0;
+	unsigned long *table;
+	struct gmap *gmap, *sg, *next;
+
+	offset = ((unsigned long) pte) & (255 * sizeof(pte_t));
+	offset = offset * (4096 / sizeof(pte_t));
+	rcu_read_lock();
+	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
+		table = radix_tree_lookup(&gmap->host_to_guest,
+					  vmaddr >> PMD_SHIFT);
+		if (table)
+			gaddr = __gmap_segment_gaddr(table) + offset;
+		else
+			continue;
+
+		if (!list_empty(&gmap->children) && (bits & PGSTE_VSIE_BIT)) {
+			spin_lock(&gmap->shadow_lock);
+			list_for_each_entry_safe(sg, next,
+						 &gmap->children, list)
+				gmap_shadow_notify(sg, vmaddr, gaddr);
+			spin_unlock(&gmap->shadow_lock);
+		}
+		if (bits & PGSTE_IN_BIT)
+			gmap_call_notifier(gmap, gaddr, gaddr + PAGE_SIZE - 1);
+	}
+	rcu_read_unlock();
+}
+
 /**
  * ptep_notify - call all invalidation callbacks for a specific pte.
  * @mm: pointer to the process mm_struct
@@ -2640,10 +2808,12 @@ static void pmdp_notify_gmap(struct gmap *gmap, unsigned long gaddr)
 	table = gmap_table_walk(gmap, gaddr, 1);
 	if (!table)
 		return;
-	bits = *table & (_SEGMENT_ENTRY_GMAP_IN | _SEGMENT_ENTRY_GMAP_VSIE);
+	bits = *table & _SEGMENT_ENTRY_GMAP_IN;
+	if (pmd_large(__pmd(*table)) && (*table & _SEGMENT_ENTRY_GMAP_VSIE))
+		bits |= _SEGMENT_ENTRY_GMAP_VSIE;
 	if (!bits)
 		return;
-	*table ^= bits;
+	*table &= ~bits;
 	vmaddr = __gmap_translate(gmap, gaddr);
 	if (!list_empty(&gmap->children) && (bits & _SEGMENT_ENTRY_GMAP_VSIE)
 	    && (*table & _SEGMENT_ENTRY_PROTECT)) {
@@ -2657,6 +2827,23 @@ static void pmdp_notify_gmap(struct gmap *gmap, unsigned long gaddr)
 		gmap_call_notifier(gmap, gaddr, gaddr + HPAGE_SIZE - 1);
 }
 
+static void pmdp_notify_split(struct mm_struct *mm, unsigned long vmaddr,
+			      unsigned long *table)
+{
+	int i = 0;
+	unsigned long bits;
+	unsigned long *ptep = (unsigned long *)(*table & PAGE_MASK);
+	unsigned long *pgste = ptep + PTRS_PER_PTE;
+
+	for (; i < 256; i++, vmaddr += PAGE_SIZE, ptep++, pgste++) {
+		bits = *pgste & (PGSTE_IN_BIT | PGSTE_VSIE_BIT);
+		if (bits) {
+			*pgste ^= bits;
+			ptep_notify_gmap(mm, vmaddr, (pte_t *)ptep, bits);
+		}
+	}
+}
+
 /**
  * pmdp_notify - call all invalidation callbacks for a specific pmd
  * @mm: pointer to the process mm_struct
@@ -2678,8 +2865,17 @@ void pmdp_notify(struct mm_struct *mm, unsigned long vmaddr)
 			spin_unlock(&gmap->guest_table_lock);
 			continue;
 		}
-		bits = *table & (_SEGMENT_ENTRY_GMAP_IN | _SEGMENT_ENTRY_GMAP_VSIE);
-		*table ^= bits;
+
+		if (gmap_pmd_is_split((pmd_t *)table)) {
+			pmdp_notify_split(mm, vmaddr, table);
+			spin_unlock(&gmap->guest_table_lock);
+			continue;
+		}
+
+		bits = *table & (_SEGMENT_ENTRY_GMAP_IN);
+		if (pmd_large(__pmd(*table)) && (*table & _SEGMENT_ENTRY_GMAP_VSIE))
+			bits |= _SEGMENT_ENTRY_GMAP_VSIE;
+		*table &= ~bits;
 		gaddr = __gmap_segment_gaddr(table);
 		spin_unlock(&gmap->guest_table_lock);
 		if (!list_empty(&gmap->children) && (bits & _SEGMENT_ENTRY_GMAP_VSIE)) {
@@ -2710,6 +2906,7 @@ static void gmap_pmdp_clear(struct mm_struct *mm, unsigned long vmaddr,
 		if (pmdp) {
 			if (purge)
 				__pmdp_csp(pmdp);
+			gmap_pmd_split_free(pmdp);
 			pmd_val(*pmdp) = _SEGMENT_ENTRY_EMPTY;
 		}
 		spin_unlock(&gmap->guest_table_lock);
@@ -2766,6 +2963,7 @@ void gmap_pmdp_idte_local(struct mm_struct *mm, unsigned long vmaddr)
 			else if (MACHINE_HAS_IDTE)
 				__pmdp_idte(gaddr, pmdp, 0, 0,
 					    IDTE_LOCAL);
+			gmap_pmd_split_free(pmdp);
 			*entry = _SEGMENT_ENTRY_EMPTY;
 		}
 		spin_unlock(&gmap->guest_table_lock);
@@ -2802,6 +3000,8 @@ void gmap_pmdp_idte_global(struct mm_struct *mm, unsigned long vmaddr)
 					    IDTE_GLOBAL);
 			else
 				__pmdp_csp(pmdp);
+
+			gmap_pmd_split_free(pmdp);
 			*entry = _SEGMENT_ENTRY_EMPTY;
 		}
 		spin_unlock(&gmap->guest_table_lock);
@@ -2880,6 +3080,7 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long bitmap[4],
 	pmd_t *pmdp, *hpmdp;
 	spinlock_t *ptl;
 
+	/* Protection against gmap_link vsie unprotection. */
 	hpmdp = (pmd_t *)huge_pte_offset(gmap->mm, vmaddr, HPAGE_SIZE);
 	if (!hpmdp)
 		return;
@@ -2895,9 +3096,17 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long bitmap[4],
 						      gaddr, vmaddr))
 			memset(bitmap, 0xFF, 32);
 	} else {
-		for (; i < _PAGE_ENTRIES; i++, vmaddr += PAGE_SIZE) {
-			if (test_and_clear_guest_dirty(gmap->mm, vmaddr))
-				set_bit_le(i, bitmap);
+		/* We handle this here, as it's of the records from mm. */
+		if (unlikely(gmap_pmd_is_split(pmdp))) {
+			for (; i < _PAGE_ENTRIES; i++, vmaddr += PAGE_SIZE) {
+				if (test_and_clear_guest_dirty_split(gmap->mm, pmdp, vmaddr))
+					set_bit_le(i, bitmap);
+			}
+		} else {
+			for (; i < _PAGE_ENTRIES; i++, vmaddr += PAGE_SIZE) {
+				if (test_and_clear_guest_dirty(gmap->mm, vmaddr))
+					set_bit_le(i, bitmap);
+			}
 		}
 	}
 	gmap_pmd_op_end(gmap, pmdp);
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 5ce1232..0661986 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -762,6 +762,57 @@ bool test_and_clear_guest_dirty(struct mm_struct *mm, unsigned long addr)
 }
 EXPORT_SYMBOL_GPL(test_and_clear_guest_dirty);
 
+void ptep_remove_dirty_protection_split(struct mm_struct *mm,
+					pte_t *ptep, unsigned long vmaddr)
+{
+	pte_t unprot = __pte(pte_val(*ptep) & ~_PAGE_PROTECT);
+	pgste_t pgste;
+	unsigned long bits;
+
+	pgste = pgste_get_lock(ptep);
+	pgste_val(pgste) |= PGSTE_UC_BIT;
+
+	bits = pgste_val(pgste) & (PGSTE_IN_BIT | PGSTE_VSIE_BIT);
+	pgste_val(pgste) ^= bits;
+	ptep_notify_gmap(mm, vmaddr, ptep, bits);
+	ptep_ipte_global(mm, vmaddr, ptep, 0);
+
+	*ptep = unprot;
+	pgste_set_unlock(ptep, pgste);
+}
+EXPORT_SYMBOL_GPL(ptep_remove_dirty_protection_split);
+
+bool test_and_clear_guest_dirty_split(struct mm_struct *mm, pmd_t *pmdp,
+				      unsigned long vmaddr)
+{
+	bool dirty;
+	pte_t *ptep, pte;
+	pgste_t pgste;
+	unsigned long bits;
+
+	ptep = pte_offset_map(pmdp, vmaddr);
+	pgste = pgste_get_lock(ptep);
+	dirty = !!(pgste_val(pgste) & PGSTE_UC_BIT);
+	pgste_val(pgste) &= ~PGSTE_UC_BIT;
+	pte = *ptep;
+	if (dirty) {
+		bits = pgste_val(pgste) & (PGSTE_IN_BIT | PGSTE_VSIE_BIT);
+		if (bits) {
+			pgste_val(pgste) ^= bits;
+			ptep_notify_gmap(mm, vmaddr, ptep, bits);
+		}
+		ptep_ipte_global(mm, vmaddr, ptep, 0);
+		if (MACHINE_HAS_ESOP || !(pte_val(pte) & _PAGE_WRITE))
+			pte_val(pte) |= _PAGE_PROTECT;
+		else
+			pte_val(pte) |= _PAGE_INVALID;
+		*ptep = pte;
+	}
+	pgste_set_unlock(ptep, pgste);
+	return dirty;
+}
+EXPORT_SYMBOL_GPL(test_and_clear_guest_dirty_split);
+
 int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
 			  unsigned char key, bool nq)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 20/22] s390/mm: Enable gmap huge pmd support
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (18 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 19/22] s390/mm: Split huge pages if granular protection is needed Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-15 10:08   ` David Hildenbrand
  2017-11-06 22:29 ` [RFC/PATCH 21/22] KVM: s390: Add KVM HPAGE capability Janosch Frank
  2017-11-06 22:30 ` [RFC/PATCH 22/22] RFC: s390/mm: Add gmap lock classes Janosch Frank
  21 siblings, 1 reply; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

Now that we have everything in place, let's allow huge pmds for gmap
linking, effectively allowing huge pages guests.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/mm/gmap.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 430dcd9..3cc2765 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1,8 +1,10 @@
 /*
  *  KVM guest address space mapping code
  *
- *    Copyright IBM Corp. 2007, 2016
+ *    Copyright IBM Corp. 2007, 2016, 2017
  *    Author(s): Martin Schwidefsky <schwidefsky@de.ibm.com>
+ *		 David Hildenbrand <david@redhat.com>
+ *		 Janosch Frank <frankja@linux.vnet.ibm.com>
  */
 
 #include <linux/kernel.h>
@@ -596,9 +598,6 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 		return -EFAULT;
 	pmd = pmd_offset(pud, vmaddr);
 	VM_BUG_ON(pmd_none(*pmd));
-	/* large pmds cannot yet be handled */
-	if (pmd_large(*pmd))
-		return -EFAULT;
 	/* Link gmap segment table entry location to page table. */
 	rc = radix_tree_preload(GFP_KERNEL);
 	if (rc)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 21/22] KVM: s390: Add KVM HPAGE capability
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (19 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 20/22] s390/mm: Enable gmap huge pmd support Janosch Frank
@ 2017-11-06 22:29 ` Janosch Frank
  2017-11-07 10:07   ` Cornelia Huck
  2017-11-15 10:06   ` David Hildenbrand
  2017-11-06 22:30 ` [RFC/PATCH 22/22] RFC: s390/mm: Add gmap lock classes Janosch Frank
  21 siblings, 2 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:29 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

KVM huge page backing support can not be easily tested under
s390. Currently testing is only possible after most of the guest has
already been set up.

To indicate, that KVM has huge page backing support, we add the
KVM_CAP_S390_HPAGE capability. This does not mean, that transparent
huge pages are supported.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 include/uapi/linux/kvm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 8388875..2cd359a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -930,6 +930,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_SMT_POSSIBLE 147
 #define KVM_CAP_HYPERV_SYNIC2 148
 #define KVM_CAP_HYPERV_VP_INDEX 149
+#define KVM_CAP_S390_HPAGE 150
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [RFC/PATCH 22/22] RFC: s390/mm: Add gmap lock classes
  2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (20 preceding siblings ...)
  2017-11-06 22:29 ` [RFC/PATCH 21/22] KVM: s390: Add KVM HPAGE capability Janosch Frank
@ 2017-11-06 22:30 ` Janosch Frank
  2017-11-15 10:10   ` David Hildenbrand
  21 siblings, 1 reply; 42+ messages in thread
From: Janosch Frank @ 2017-11-06 22:30 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

A shadow gmap and its parent are locked right after each other when
doing VSIE management. Lockdep can't differentiate between the two
classes without some help.

TODO: Not sure yet if I have to annotate all and if gmap_pmd_walk will
be used by both shadow and parent

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/gmap.h |  6 ++++++
 arch/s390/mm/gmap.c          | 40 ++++++++++++++++++++--------------------
 2 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 15a7834..31fad98 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -19,6 +19,12 @@
 #define _SEGMENT_ENTRY_GMAP_UC		0x4000	/* user dirty (migration) */
 #define _SEGMENT_ENTRY_GMAP_VSIE	0x8000	/* vsie bit */
 
+
+enum gmap_lock_class {
+	GMAP_LOCK_PARENT,
+	GMAP_LOCK_SHADOW
+};
+
 /**
  * struct gmap_struct - guest address space
  * @list: list head for the mm->context gmap list
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 3cc2765..b17f424 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -198,7 +198,7 @@ static void gmap_free(struct gmap *gmap)
 	gmap_radix_tree_free(&gmap->host_to_guest);
 
 	/* Free split pmd page tables */
-	spin_lock(&gmap->guest_table_lock);
+	spin_lock_nested(&gmap->guest_table_lock, GMAP_LOCK_PARENT);
 	list_for_each_entry_safe(page, next, &gmap->split_list, lru)
 		page_table_free_pgste(page);
 	spin_unlock(&gmap->guest_table_lock);
@@ -1385,7 +1385,7 @@ static int gmap_protect_rmap_large(struct gmap *sg, struct gmap_rmap *rmap,
 	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
 		return -EAGAIN;
 
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	rc = gmap_protect_large(sg->parent, paddr, vmaddr, pmdp, hpmdp,
 				prot, GMAP_ENTRY_VSIE);
 	if (!rc)
@@ -1413,7 +1413,7 @@ static int gmap_protect_rmap_pte(struct gmap *sg, struct gmap_rmap *rmap,
 		ptep = pte_alloc_map_lock(sg->parent->mm, pmdp, paddr, &ptl);
 
 	if (ptep) {
-		spin_lock(&sg->guest_table_lock);
+		spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 		rc = ptep_force_prot(sg->parent->mm, paddr, ptep, prot,
 				     PGSTE_VSIE_BIT);
 		if (!rc)
@@ -1932,7 +1932,7 @@ struct gmap *gmap_shadow(struct gmap *parent, unsigned long asce,
 		/* only allow one real-space gmap shadow */
 		list_for_each_entry(sg, &parent->children, list) {
 			if (sg->orig_asce & _ASCE_REAL_SPACE) {
-				spin_lock(&sg->guest_table_lock);
+				spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 				gmap_unshadow(sg);
 				spin_unlock(&sg->guest_table_lock);
 				list_del(&sg->list);
@@ -2004,7 +2004,7 @@ int gmap_shadow_r2t(struct gmap *sg, unsigned long saddr, unsigned long r2t,
 		page->index |= GMAP_SHADOW_FAKE_TABLE;
 	s_r2t = (unsigned long *) page_to_phys(page);
 	/* Install shadow region second table */
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	table = gmap_table_walk(sg, saddr, 4); /* get region-1 pointer */
 	if (!table) {
 		rc = -EAGAIN;		/* Race with unshadow */
@@ -2037,7 +2037,7 @@ int gmap_shadow_r2t(struct gmap *sg, unsigned long saddr, unsigned long r2t,
 	offset = ((r2t & _REGION_ENTRY_OFFSET) >> 6) * PAGE_SIZE;
 	len = ((r2t & _REGION_ENTRY_LENGTH) + 1) * PAGE_SIZE - offset;
 	rc = gmap_protect_rmap(sg, raddr, origin + offset, len, PROT_READ);
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (!rc) {
 		table = gmap_table_walk(sg, saddr, 4);
 		if (!table || (*table & _REGION_ENTRY_ORIGIN) !=
@@ -2088,7 +2088,7 @@ int gmap_shadow_r3t(struct gmap *sg, unsigned long saddr, unsigned long r3t,
 		page->index |= GMAP_SHADOW_FAKE_TABLE;
 	s_r3t = (unsigned long *) page_to_phys(page);
 	/* Install shadow region second table */
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	table = gmap_table_walk(sg, saddr, 3); /* get region-2 pointer */
 	if (!table) {
 		rc = -EAGAIN;		/* Race with unshadow */
@@ -2120,7 +2120,7 @@ int gmap_shadow_r3t(struct gmap *sg, unsigned long saddr, unsigned long r3t,
 	offset = ((r3t & _REGION_ENTRY_OFFSET) >> 6) * PAGE_SIZE;
 	len = ((r3t & _REGION_ENTRY_LENGTH) + 1) * PAGE_SIZE - offset;
 	rc = gmap_protect_rmap(sg, raddr, origin + offset, len, PROT_READ);
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (!rc) {
 		table = gmap_table_walk(sg, saddr, 3);
 		if (!table || (*table & _REGION_ENTRY_ORIGIN) !=
@@ -2171,7 +2171,7 @@ int gmap_shadow_sgt(struct gmap *sg, unsigned long saddr, unsigned long sgt,
 		page->index |= GMAP_SHADOW_FAKE_TABLE;
 	s_sgt = (unsigned long *) page_to_phys(page);
 	/* Install shadow region second table */
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	table = gmap_table_walk(sg, saddr, 2); /* get region-3 pointer */
 	if (!table) {
 		rc = -EAGAIN;		/* Race with unshadow */
@@ -2204,7 +2204,7 @@ int gmap_shadow_sgt(struct gmap *sg, unsigned long saddr, unsigned long sgt,
 	offset = ((sgt & _REGION_ENTRY_OFFSET) >> 6) * PAGE_SIZE;
 	len = ((sgt & _REGION_ENTRY_LENGTH) + 1) * PAGE_SIZE - offset;
 	rc = gmap_protect_rmap(sg, raddr, origin + offset, len, PROT_READ);
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (!rc) {
 		table = gmap_table_walk(sg, saddr, 2);
 		if (!table || (*table & _REGION_ENTRY_ORIGIN) !=
@@ -2260,7 +2260,7 @@ int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
 	int rc = -EAGAIN;
 
 	BUG_ON(!gmap_is_shadow(sg));
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (sg->asce & _ASCE_TYPE_MASK) {
 		/* >2 GB guest */
 		r3e = (unsigned long *) gmap_table_walk(sg, saddr, 2);
@@ -2327,7 +2327,7 @@ int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
 		page->index |= GMAP_SHADOW_FAKE_TABLE;
 	s_pgt = (unsigned long *) page_to_phys(page);
 	/* Install shadow page table */
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	table = gmap_table_walk(sg, saddr, 1); /* get segment pointer */
 	if (!table) {
 		rc = -EAGAIN;		/* Race with unshadow */
@@ -2355,7 +2355,7 @@ int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
 	raddr = (saddr & _SEGMENT_MASK) | _SHADOW_RMAP_SEGMENT;
 	origin = pgt & _SEGMENT_ENTRY_ORIGIN & PAGE_MASK;
 	rc = gmap_protect_rmap(sg, raddr, origin, PAGE_SIZE, PROT_READ);
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (!rc) {
 		table = gmap_table_walk(sg, saddr, 1);
 		if (!table || (*table & _SEGMENT_ENTRY_ORIGIN) !=
@@ -2417,7 +2417,7 @@ int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd)
 		if (spmdp) {
 			if (!pmd_large(*spmdp))
 				BUG();
-			spin_lock(&sg->guest_table_lock);
+			spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 			/* Get shadow segment table pointer */
 			tpmdp = (pmd_t *) gmap_table_walk(sg, saddr, 1);
 			if (!tpmdp) {
@@ -2513,7 +2513,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 		rc = -EAGAIN;
 		spmdp = gmap_pmd_op_walk(parent, paddr);
 		if (spmdp && !(pmd_val(*spmdp) & _SEGMENT_ENTRY_INVALID)) {
-			spin_lock(&sg->guest_table_lock);
+			spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 			/* Get page table pointer */
 			tptep = (pte_t *) gmap_table_walk(sg, saddr, 0);
 			if (!tptep) {
@@ -2597,7 +2597,7 @@ static void gmap_shadow_notify_pmd(struct gmap *sg, unsigned long vmaddr,
 
 	BUG_ON(!gmap_is_shadow(sg));
 
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (sg->removed) {
 		spin_unlock(&sg->guest_table_lock);
 		return;
@@ -2658,7 +2658,7 @@ static void gmap_shadow_notify(struct gmap *sg, unsigned long vmaddr,
 
 	BUG_ON(!gmap_is_shadow(sg));
 
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (sg->removed) {
 		spin_unlock(&sg->guest_table_lock);
 		return;
@@ -2899,7 +2899,7 @@ static void gmap_pmdp_clear(struct mm_struct *mm, unsigned long vmaddr,
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
-		spin_lock(&gmap->guest_table_lock);
+		spin_lock_nested(&gmap->guest_table_lock, GMAP_LOCK_PARENT);
 		pmdp = (pmd_t *)radix_tree_delete(&gmap->host_to_guest,
 						   vmaddr >> PMD_SHIFT);
 		if (pmdp) {
@@ -2949,7 +2949,7 @@ void gmap_pmdp_idte_local(struct mm_struct *mm, unsigned long vmaddr)
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
-		spin_lock(&gmap->guest_table_lock);
+		spin_lock_nested(&gmap->guest_table_lock, GMAP_LOCK_PARENT);
 		entry = radix_tree_delete(&gmap->host_to_guest,
 					  vmaddr >> PMD_SHIFT);
 		if (entry) {
@@ -2984,7 +2984,7 @@ void gmap_pmdp_idte_global(struct mm_struct *mm, unsigned long vmaddr)
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
-		spin_lock(&gmap->guest_table_lock);
+		spin_lock_nested(&gmap->guest_table_lock, GMAP_LOCK_PARENT);
 		entry = radix_tree_delete(&gmap->host_to_guest,
 					  vmaddr >> PMD_SHIFT);
 		if (entry) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 21/22] KVM: s390: Add KVM HPAGE capability
  2017-11-06 22:29 ` [RFC/PATCH 21/22] KVM: s390: Add KVM HPAGE capability Janosch Frank
@ 2017-11-07 10:07   ` Cornelia Huck
  2017-11-07 10:53     ` Janosch Frank
  2017-11-15 10:06   ` David Hildenbrand
  1 sibling, 1 reply; 42+ messages in thread
From: Cornelia Huck @ 2017-11-07 10:07 UTC (permalink / raw)
  To: Janosch Frank
  Cc: kvm, schwidefsky, borntraeger, david, dominik.dingel, linux-s390

On Mon,  6 Nov 2017 23:29:59 +0100
Janosch Frank <frankja@linux.vnet.ibm.com> wrote:

> KVM huge page backing support can not be easily tested under
> s390. Currently testing is only possible after most of the guest has
> already been set up.
> 
> To indicate, that KVM has huge page backing support, we add the
> KVM_CAP_S390_HPAGE capability. This does not mean, that transparent
> huge pages are supported.
> 
> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> ---
>  include/uapi/linux/kvm.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 8388875..2cd359a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -930,6 +930,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_PPC_SMT_POSSIBLE 147
>  #define KVM_CAP_HYPERV_SYNIC2 148
>  #define KVM_CAP_HYPERV_VP_INDEX 149
> +#define KVM_CAP_S390_HPAGE 150
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  

Not looking into the rest of this patch set (for fear of my sanity :),
but I think this needs to be documented in
Documentation/virtual/kvm/api.txt.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 21/22] KVM: s390: Add KVM HPAGE capability
  2017-11-07 10:07   ` Cornelia Huck
@ 2017-11-07 10:53     ` Janosch Frank
  0 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-07 10:53 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: kvm, schwidefsky, borntraeger, david, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 1312 bytes --]

On 07.11.2017 11:07, Cornelia Huck wrote:
> On Mon,  6 Nov 2017 23:29:59 +0100
> Janosch Frank <frankja@linux.vnet.ibm.com> wrote:
> 
>> KVM huge page backing support can not be easily tested under
>> s390. Currently testing is only possible after most of the guest has
>> already been set up.
>>
>> To indicate, that KVM has huge page backing support, we add the
>> KVM_CAP_S390_HPAGE capability. This does not mean, that transparent
>> huge pages are supported.
>>
>> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
>> ---
>>  include/uapi/linux/kvm.h | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 8388875..2cd359a 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -930,6 +930,7 @@ struct kvm_ppc_resize_hpt {
>>  #define KVM_CAP_PPC_SMT_POSSIBLE 147
>>  #define KVM_CAP_HYPERV_SYNIC2 148
>>  #define KVM_CAP_HYPERV_VP_INDEX 149
>> +#define KVM_CAP_S390_HPAGE 150
>>  
>>  #ifdef KVM_CAP_IRQ_ROUTING
>>  
> 
> Not looking into the rest of this patch set (for fear of my sanity :),
> but I think this needs to be documented in
> Documentation/virtual/kvm/api.txt.

I'll add it right away, before I forget it :)
Thanks for looking and risking your sanity.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 01/22] s390/mm: make gmap_protect_range more modular
  2017-11-06 22:29 ` [RFC/PATCH 01/22] s390/mm: make gmap_protect_range more modular Janosch Frank
@ 2017-11-08 10:40   ` David Hildenbrand
  2017-11-08 12:21     ` Janosch Frank
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2017-11-08 10:40 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 06.11.2017 23:29, Janosch Frank wrote:
> This patch reworks the gmap_protect_range logic and extracts the pte
> handling into an own function. Also we do now walk to the pmd and make
> it accessible in the function for later use. This way we can add huge
> page handling logic more easily.

I just realized (and hope it is correct), that any gmap_shadow() checks
in e.g. gmap_pte_op_walk() are superfluous. This code is never reached.

This would imply that also in this patch, you can drop all
gmap_is_shadow(gmap) checks and instead add BUG_ON(gmap_is_shadow()) to
all functions.

Will double check and prepare a cleanup for existing code.

> 
> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
> ---
>  arch/s390/mm/gmap.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 95 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 2f66290..9757242 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -876,6 +876,91 @@ static void gmap_pte_op_end(spinlock_t *ptl)
>  	spin_unlock(ptl);
>  }
>  
> +/**
> + * gmap_pmd_op_walk - walk the gmap tables, get the guest table lock
> + *		      and return the pmd pointer
> + * @gmap: pointer to guest mapping meta data structure
> + * @gaddr: virtual address in the guest address space
> + *
> + * Returns a pointer to the pmd for a guest address, or NULL
> + */
> +static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr)
> +{
> +	pmd_t *pmdp;
> +
> +	spin_lock(&gmap->guest_table_lock);
> +	pmdp = (pmd_t *) gmap_table_walk(gmap, gaddr, 1);
> +
> +	/*
> +	 * Empty pmds can become large after we give up the
> +	 * guest_table_lock, so we have to check for pmd_none
> +	 * here.
> +	 */
> +	if (!pmdp || pmd_none(*pmdp)) {
> +		spin_unlock(&gmap->guest_table_lock);
> +		return NULL;
> +	}
> +	/*
> +	 * For plain 4k guests that do not run under the vsie it
> +	 * suffices to take the pte lock later on. Thus we can unlock
> +	 * the guest_table_lock here.
> +	 */
> +	if (!pmd_large(*pmdp) && !gmap_is_shadow(gmap))
> +		spin_unlock(&gmap->guest_table_lock);
> +	return pmdp;
> +}
> +
> +/**
> + * gmap_pmd_op_end - release the guest_table_lock if needed
> + * @gmap: pointer to the guest mapping meta data structure
> + * @pmdp: pointer to the pmd
> + */
> +static inline void gmap_pmd_op_end(struct gmap *gmap, pmd_t *pmdp)
> +{
> +	if (pmd_large(*pmdp) || gmap_is_shadow(gmap))
> +		spin_unlock(&gmap->guest_table_lock);
> +}
> +
> +/*
> + * gmap_protect_pte - remove access rights to memory and set pgste bits
> + * @gmap: pointer to guest mapping meta data structure
> + * @gaddr: virtual address in the guest address space
> + * @pmdp: pointer to the pmd associated with the pte
> + * @prot: indicates access rights: PROT_NONE, PROT_READ or PROT_WRITE
> + * @bits: pgste notification bits to set
> + *
> + * Returns 0 if successfully protected, -ENOMEM if out of memory and
> + * -EAGAIN if a fixup is needed.
> + *
> + * Expected to be called with sg->mm->mmap_sem in read and
> + * guest_table_lock held for shadow gmaps.
> + */
> +static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
> +			    pmd_t *pmdp, int prot, unsigned long bits)
> +{
> +	int rc;
> +	pte_t *ptep;
> +	spinlock_t *ptl = NULL;
> +
> +	/* We have no upper segment, let's go back and fix this up. */
> +	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
> +		return -EAGAIN;
> +
> +	if (gmap_is_shadow(gmap)) {
> +		ptep = pte_offset_map(pmdp, gaddr);
> +	} else {
> +		ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
> +		if (!ptep)
> +			return -ENOMEM;
> +	}

if (gmap_is_shadow(gmap))
	ptep = pte_offset_map(pmdp, gaddr);
else
	ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);

if (!ptep)
	return -ENOMEM;

Makes it a little bit nicer to read.

> +
> +	/* Protect and unlock. */
> +	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, bits);
> +	if (ptl)
> +		gmap_pte_op_end(ptl);

Would it make sense to move the ptl test into gmap_pte_op_end() ?

> +	return rc;
> +}
> +
>  /*
>   * gmap_protect_range - remove access rights to memory and set pgste bits
>   * @gmap: pointer to guest mapping meta data structure
> @@ -895,16 +980,20 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
>  			      unsigned long len, int prot, unsigned long bits)
>  {
>  	unsigned long vmaddr;
> -	spinlock_t *ptl;
> -	pte_t *ptep;
> +	pmd_t *pmdp;
>  	int rc;
>  
>  	while (len) {
>  		rc = -EAGAIN;
> -		ptep = gmap_pte_op_walk(gmap, gaddr, &ptl);
> -		if (ptep) {
> -			rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, bits);
> -			gmap_pte_op_end(ptl);
> +		pmdp = gmap_pmd_op_walk(gmap, gaddr);
> +		if (pmdp) {
> +			rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
> +					      bits);
> +			if (!rc) {
> +				len -= PAGE_SIZE;
> +				gaddr += PAGE_SIZE;
> +			}
> +			gmap_pmd_op_end(gmap, pmdp);
>  		}
>  		if (rc) {
>  			vmaddr = __gmap_translate(gmap, gaddr);
> @@ -913,10 +1002,7 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
>  			rc = gmap_pte_op_fixup(gmap, gaddr, vmaddr, prot);
>  			if (rc)
>  				return rc;
> -			continue;
>  		}
> -		gaddr += PAGE_SIZE;
> -		len -= PAGE_SIZE;

Not sure if I like this change. I think I prefer it the way it is now
(keeps the loop body shorter).

>  	}
>  	return 0;
>  }
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 01/22] s390/mm: make gmap_protect_range more modular
  2017-11-08 10:40   ` David Hildenbrand
@ 2017-11-08 12:21     ` Janosch Frank
  2017-11-08 12:26       ` David Hildenbrand
  0 siblings, 1 reply; 42+ messages in thread
From: Janosch Frank @ 2017-11-08 12:21 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 2766 bytes --]

On 08.11.2017 11:40, David Hildenbrand wrote:
> On 06.11.2017 23:29, Janosch Frank wrote:
>> This patch reworks the gmap_protect_range logic and extracts the pte
>> handling into an own function. Also we do now walk to the pmd and make
>> it accessible in the function for later use. This way we can add huge
>> page handling logic more easily.
> 
> I just realized (and hope it is correct), that any gmap_shadow() checks
> in e.g. gmap_pte_op_walk() are superfluous. This code is never reached.
> 
> This would imply that also in this patch, you can drop all
> gmap_is_shadow(gmap) checks and instead add BUG_ON(gmap_is_shadow()) to
> all functions.
> 
> Will double check and prepare a cleanup for existing code.

Hrm, looks like it.

Be aware, that I'm very protective about changes in the GMAP, especially
on the VSIE part. I've had minimal changes on these patches setting me
back for weeks, so if you want to change things, be absolutely sure,
that it will work (also with further changes) or add them after this set
is merged.

Thanks for having a look!

>> +static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
>> +			    pmd_t *pmdp, int prot, unsigned long bits)
>> +{
>> +	int rc;
>> +	pte_t *ptep;
>> +	spinlock_t *ptl = NULL;
>> +
>> +	/* We have no upper segment, let's go back and fix this up. */
>> +	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
>> +		return -EAGAIN;
>> +
>> +	if (gmap_is_shadow(gmap)) {
>> +		ptep = pte_offset_map(pmdp, gaddr);
>> +	} else {
>> +		ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
>> +		if (!ptep)
>> +			return -ENOMEM;
>> +	}
> 
> if (gmap_is_shadow(gmap))
> 	ptep = pte_offset_map(pmdp, gaddr);
> else
> 	ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
> 
> if (!ptep)
> 	return -ENOMEM;
> 
> Makes it a little bit nicer to read.

Sure

> 
>> +
>> +	/* Protect and unlock. */
>> +	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, bits);
>> +	if (ptl)
>> +		gmap_pte_op_end(ptl);
> 
> Would it make sense to move the ptl test into gmap_pte_op_end() ?

That's a great idea, I have a lot of these in the following patches.

> 
>> +	return rc;
>> +}
>> +
>> @@ -913,10 +1002,7 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
>>  			rc = gmap_pte_op_fixup(gmap, gaddr, vmaddr, prot);
>>  			if (rc)
>>  				return rc;
>> -			continue;
>>  		}
>> -		gaddr += PAGE_SIZE;
>> -		len -= PAGE_SIZE;
> 
> Not sure if I like this change. I think I prefer it the way it is now
> (keeps the loop body shorter).

Well, this loop and a lot of others will get longer through the other
patches in the set.

They need some refactoring, but just to be safe, start getting used to it :)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 01/22] s390/mm: make gmap_protect_range more modular
  2017-11-08 12:21     ` Janosch Frank
@ 2017-11-08 12:26       ` David Hildenbrand
  0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2017-11-08 12:26 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 08.11.2017 13:21, Janosch Frank wrote:
> On 08.11.2017 11:40, David Hildenbrand wrote:
>> On 06.11.2017 23:29, Janosch Frank wrote:
>>> This patch reworks the gmap_protect_range logic and extracts the pte
>>> handling into an own function. Also we do now walk to the pmd and make
>>> it accessible in the function for later use. This way we can add huge
>>> page handling logic more easily.
>>
>> I just realized (and hope it is correct), that any gmap_shadow() checks
>> in e.g. gmap_pte_op_walk() are superfluous. This code is never reached.
>>
>> This would imply that also in this patch, you can drop all
>> gmap_is_shadow(gmap) checks and instead add BUG_ON(gmap_is_shadow()) to
>> all functions.
>>
>> Will double check and prepare a cleanup for existing code.
> 
> Hrm, looks like it.
> 
> Be aware, that I'm very protective about changes in the GMAP, especially
> on the VSIE part. I've had minimal changes on these patches setting me
> back for weeks, so if you want to change things, be absolutely sure,
> that it will work (also with further changes) or add them after this set
> is merged.

I know that pain, but getting it in a clean shape is the way to go. This
is RFC after all. And understanding the code is better than rushing just
to get changes in.

Especially if it helps to clarify your changes (e.g. mprotect_notify
never applying to shadow gmaps).

> 
> Thanks for having a look!

Will try to have a look at the other patches but might take some time
(complicated stuff ...).

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 02/22] s390/mm: Abstract gmap notify bit setting
  2017-11-06 22:29 ` [RFC/PATCH 02/22] s390/mm: Abstract gmap notify bit setting Janosch Frank
@ 2017-11-10 12:57   ` David Hildenbrand
  2017-11-13 15:57     ` Janosch Frank
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2017-11-10 12:57 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 06.11.2017 23:29, Janosch Frank wrote:
> Currently we use the software PGSTE bits PGSTE_IN_BIT and PGSTE_VSIE_BIT
> to notify before an invalidation occurs on a prefix page or a VSIE page
> respectively. Both bits only work for a PGSTE, which only exists for
> page tables.
> 
> For huge page support we also need such bits for segments (pmds) so
> let's introduce abstract GMAP_ENTRY_* bits that will be realized into
> the respective bits when gmap DAT table entries are protected.
> 
> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
> ---
>  arch/s390/include/asm/gmap.h |  4 ++++
>  arch/s390/mm/gmap.c          | 13 ++++++++-----
>  2 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
> index 741ddba..f3d84a8 100644
> --- a/arch/s390/include/asm/gmap.h
> +++ b/arch/s390/include/asm/gmap.h
> @@ -8,6 +8,10 @@
>  #ifndef _ASM_S390_GMAP_H
>  #define _ASM_S390_GMAP_H
>  
> +/* Generic bits for GMAP notification on DAT table entry changes. */
> +#define GMAP_ENTRY_VSIE	0x2
> +#define GMAP_ENTRY_IN	0x1

Can we rename
GMAP_ENTRY_VSIE -> GMAP_ENTRY_SHADOW_IN

and also

PGSTE_VSIE_BIT -> PGSTE_SHADOW_IN_BIT

as they are used for managing shadow page table invalidation.

Also, a better fitting name for ENTRY would be nice :)
(GMAP_NOTIFY_...)

> +
>  /**
>   * struct gmap_struct - guest address space
>   * @list: list head for the mm->context gmap list
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 9757242..74e2062 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -927,7 +927,7 @@ static inline void gmap_pmd_op_end(struct gmap *gmap, pmd_t *pmdp)
>   * @gaddr: virtual address in the guest address space
>   * @pmdp: pointer to the pmd associated with the pte
>   * @prot: indicates access rights: PROT_NONE, PROT_READ or PROT_WRITE
> - * @bits: pgste notification bits to set
> + * @bits: notification bits to set
>   *
>   * Returns 0 if successfully protected, -ENOMEM if out of memory and
>   * -EAGAIN if a fixup is needed.
> @@ -941,6 +941,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
>  	int rc;
>  	pte_t *ptep;
>  	spinlock_t *ptl = NULL;
> +	unsigned long pbits = 0;
>  
>  	/* We have no upper segment, let's go back and fix this up. */
>  	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
> @@ -954,8 +955,10 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
>  			return -ENOMEM;
>  	}
>  
> +	pbits |= (bits & GMAP_ENTRY_IN) ? PGSTE_IN_BIT : 0;
> +	pbits |= (bits & GMAP_ENTRY_VSIE) ? PGSTE_VSIE_BIT : 0;
>  	/* Protect and unlock. */
> -	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, bits);
> +	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, pbits);
>  	if (ptl)
>  		gmap_pte_op_end(ptl);
>  	return rc;
> @@ -1031,7 +1034,7 @@ int gmap_mprotect_notify(struct gmap *gmap, unsigned long gaddr,
>  	if (!MACHINE_HAS_ESOP && prot == PROT_READ)
>  		return -EINVAL;
>  	down_read(&gmap->mm->mmap_sem);
> -	rc = gmap_protect_range(gmap, gaddr, len, prot, PGSTE_IN_BIT);
> +	rc = gmap_protect_range(gmap, gaddr, len, prot, GMAP_ENTRY_IN);
>  	up_read(&gmap->mm->mmap_sem);
>  	return rc;
>  }
> @@ -1153,7 +1156,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
>  		if (ptep) {
>  			spin_lock(&sg->guest_table_lock);
>  			rc = ptep_force_prot(parent->mm, paddr, ptep, prot,
> -					     PGSTE_VSIE_BIT);
> +					     GMAP_ENTRY_VSIE);
>  			if (!rc)
>  				gmap_insert_rmap(sg, vmaddr, rmap);
>  			spin_unlock(&sg->guest_table_lock);
> @@ -1622,7 +1625,7 @@ struct gmap *gmap_shadow(struct gmap *parent, unsigned long asce,
>  	down_read(&parent->mm->mmap_sem);
>  	rc = gmap_protect_range(parent, asce & _ASCE_ORIGIN,
>  				((asce & _ASCE_TABLE_LENGTH) + 1) * PAGE_SIZE,
> -				PROT_READ, PGSTE_VSIE_BIT);
> +				PROT_READ, GMAP_ENTRY_VSIE);
>  	up_read(&parent->mm->mmap_sem);
>  	spin_lock(&parent->shadow_lock);
>  	new->initialized = true;
> 

Makes sense.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 02/22] s390/mm: Abstract gmap notify bit setting
  2017-11-10 12:57   ` David Hildenbrand
@ 2017-11-13 15:57     ` Janosch Frank
  2017-11-15  9:30       ` David Hildenbrand
  0 siblings, 1 reply; 42+ messages in thread
From: Janosch Frank @ 2017-11-13 15:57 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 1799 bytes --]

On 10.11.2017 13:57, David Hildenbrand wrote:
> On 06.11.2017 23:29, Janosch Frank wrote:
>> Currently we use the software PGSTE bits PGSTE_IN_BIT and PGSTE_VSIE_BIT
>> to notify before an invalidation occurs on a prefix page or a VSIE page
>> respectively. Both bits only work for a PGSTE, which only exists for
>> page tables.
>>
>> For huge page support we also need such bits for segments (pmds) so
>> let's introduce abstract GMAP_ENTRY_* bits that will be realized into
>> the respective bits when gmap DAT table entries are protected.
>>
>> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
>> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
[...]
>> @@ -8,6 +8,10 @@
>>  #ifndef _ASM_S390_GMAP_H
>>  #define _ASM_S390_GMAP_H
>>  
>> +/* Generic bits for GMAP notification on DAT table entry changes. */
>> +#define GMAP_ENTRY_VSIE	0x2
>> +#define GMAP_ENTRY_IN	0x1
> 
> Can we rename
> GMAP_ENTRY_VSIE -> GMAP_ENTRY_SHADOW_IN
> 
> and also
> 
> PGSTE_VSIE_BIT -> PGSTE_SHADOW_IN_BIT

I'd like to keep the PGSTE bit names, the PGSTE bits are nicely
formatted and consistent in pgtable.h. The segment bit names are even
worse because of their length:

_SEGMENT_ENTRY_GMAP_VSIE

and if we'd use your naming it gets even longer:

_SEGMENT_ENTRY_GMAP_SHADOW_IN_BIT


I'll try to come up with something consistent and good sounding, but it
certainly isn't a #1 priority and I might fail to do so :)

> 
> as they are used for managing shadow page table invalidation.
> 
> Also, a better fitting name for ENTRY would be nice :)
> (GMAP_NOTIFY_...)

Yes, the ENTRY part came from the _SEGMENT_ENTRY_* constants, whose
names you might also dislike.

So I think I'll use:
GMAP_NOTIFY_PREFIX(_BIT)
GMAP_NOTIFY_SHADOW(_BIT)



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 02/22] s390/mm: Abstract gmap notify bit setting
  2017-11-13 15:57     ` Janosch Frank
@ 2017-11-15  9:30       ` David Hildenbrand
  0 siblings, 0 replies; 42+ messages in thread
From: David Hildenbrand @ 2017-11-15  9:30 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 13.11.2017 16:57, Janosch Frank wrote:
> On 10.11.2017 13:57, David Hildenbrand wrote:
>> On 06.11.2017 23:29, Janosch Frank wrote:
>>> Currently we use the software PGSTE bits PGSTE_IN_BIT and PGSTE_VSIE_BIT
>>> to notify before an invalidation occurs on a prefix page or a VSIE page
>>> respectively. Both bits only work for a PGSTE, which only exists for
>>> page tables.
>>>
>>> For huge page support we also need such bits for segments (pmds) so
>>> let's introduce abstract GMAP_ENTRY_* bits that will be realized into
>>> the respective bits when gmap DAT table entries are protected.
>>>
>>> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
>>> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
> [...]
>>> @@ -8,6 +8,10 @@
>>>  #ifndef _ASM_S390_GMAP_H
>>>  #define _ASM_S390_GMAP_H
>>>  
>>> +/* Generic bits for GMAP notification on DAT table entry changes. */
>>> +#define GMAP_ENTRY_VSIE	0x2
>>> +#define GMAP_ENTRY_IN	0x1
>>
>> Can we rename
>> GMAP_ENTRY_VSIE -> GMAP_ENTRY_SHADOW_IN
>>
>> and also
>>
>> PGSTE_VSIE_BIT -> PGSTE_SHADOW_IN_BIT
> 
> I'd like to keep the PGSTE bit names, the PGSTE bits are nicely
> formatted and consistent in pgtable.h. The segment bit names are even
> worse because of their length:
> 
> _SEGMENT_ENTRY_GMAP_VSIE
> 
> and if we'd use your naming it gets even longer:
> 
> _SEGMENT_ENTRY_GMAP_SHADOW_IN_BIT
> 
> 
> I'll try to come up with something consistent and good sounding, but it
> certainly isn't a #1 priority and I might fail to do so :)
> 
>>
>> as they are used for managing shadow page table invalidation.
>>
>> Also, a better fitting name for ENTRY would be nice :)
>> (GMAP_NOTIFY_...)
> 
> Yes, the ENTRY part came from the _SEGMENT_ENTRY_* constants, whose
> names you might also dislike.
> 
> So I think I'll use:
> GMAP_NOTIFY_PREFIX(_BIT)

s/PREFIX/MPROTECT/ ?

Because that's the GMAP interface being used.

> GMAP_NOTIFY_SHADOW(_BIT)
> 
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 03/22] s390/mm: add gmap PMD invalidation notification
  2017-11-06 22:29 ` [RFC/PATCH 03/22] s390/mm: add gmap PMD invalidation notification Janosch Frank
@ 2017-11-15  9:55   ` David Hildenbrand
  2017-11-17  9:02     ` Janosch Frank
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2017-11-15  9:55 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 06.11.2017 23:29, Janosch Frank wrote:
> For later migration of huge pages we want to write-protect guest
> PMDs. While doing this, we have to make absolutely sure, that the
> guest's lowcore is always accessible when the VCPU is running. With
> PTEs, this is solved by marking the PGSTEs of the lowcore pages with
> the invalidation notification bit and kicking the guest out of the SIE
> via a notifier function if we need to invalidate such a page.
> 
> With PMDs we do not have PGSTEs or some other bits we could use in the
> host PMD. Instead we pick one of the free bits in the gmap PMD. Every
> time a host pmd will be invalidated, we will check if the respective
> gmap PMD has the bit set and in that case fire up the notifier.
> 
> In the first step we only support setting the invalidation bit, but we
> do not support restricting access of guest pmds. It will follow
> shortly.
> 
> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> ---
>  arch/s390/include/asm/gmap.h    |  3 ++
>  arch/s390/include/asm/pgtable.h |  1 +
>  arch/s390/mm/gmap.c             | 95 ++++++++++++++++++++++++++++++++++++-----
>  arch/s390/mm/pgtable.c          |  4 ++
>  4 files changed, 93 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
> index f3d84a8..99cf6d8 100644
> --- a/arch/s390/include/asm/gmap.h
> +++ b/arch/s390/include/asm/gmap.h
> @@ -12,6 +12,9 @@
>  #define GMAP_ENTRY_VSIE	0x2
>  #define GMAP_ENTRY_IN	0x1
>  
> +/* Status bits in the gmap segment entry. */
> +#define _SEGMENT_ENTRY_GMAP_IN		0x0001	/* invalidation notify bit */
> +
>  /**
>   * struct gmap_struct - guest address space
>   * @list: list head for the mm->context gmap list
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 20e75a2..4707647 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1092,6 +1092,7 @@ void ptep_set_pte_at(struct mm_struct *mm, unsigned long addr,
>  void ptep_set_notify(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
>  void ptep_notify(struct mm_struct *mm, unsigned long addr,
>  		 pte_t *ptep, unsigned long bits);
> +void pmdp_notify(struct mm_struct *mm, unsigned long addr);
>  int ptep_force_prot(struct mm_struct *mm, unsigned long gaddr,
>  		    pte_t *ptep, int prot, unsigned long bit);
>  void ptep_zap_unused(struct mm_struct *mm, unsigned long addr,
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 74e2062..3961589 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -595,10 +595,17 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
>  	if (*table == _SEGMENT_ENTRY_EMPTY) {
>  		rc = radix_tree_insert(&gmap->host_to_guest,
>  				       vmaddr >> PMD_SHIFT, table);
> -		if (!rc)
> -			*table = pmd_val(*pmd);
> -	} else
> -		rc = 0;
> +		if (!rc) {
> +			if (pmd_large(*pmd)) {
> +				*table = pmd_val(*pmd) &
> +					(_SEGMENT_ENTRY_ORIGIN_LARGE
> +					 | _SEGMENT_ENTRY_INVALID
> +					 | _SEGMENT_ENTRY_LARGE
> +					 | _SEGMENT_ENTRY_PROTECT);

Can we reuse/define a mask for that?

Like _SEGMENT_ENTRY_BITS_LARGE

> +			} else
> +				*table = pmd_val(*pmd) & ~0x03UL;

Where exactly does the 0x03UL come from? Can we reuse e.g.
_SEGMENT_ENTRY_BITS here?

Or is there a way to unify both cases?

> +		}
> +	}
>  	spin_unlock(&gmap->guest_table_lock);
>  	spin_unlock(ptl);
>  	radix_tree_preload_end();
> @@ -965,6 +972,35 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
>  }
>  
>  /*
> + * gmap_protect_large - set pmd notification bits

Why not gmap_protect_pmd just like gmap_protect_pte?

> + * @pmdp: pointer to the pmd to be protected
> + * @prot: indicates access rights: PROT_NONE, PROT_READ or PROT_WRITE
> + * @bits: notification bits to set
> + *
> + * Returns 0 if successfully protected, -ENOMEM if out of memory and
> + * -EAGAIN if a fixup is needed.
> + *
> + * Expected to be called with sg->mm->mmap_sem in read and
> + * guest_table_lock held.

gmap_pmd_op_walk() guarantees the latter, correct?

> + */
> +static int gmap_protect_large(struct gmap *gmap, unsigned long gaddr,
> +			      pmd_t *pmdp, int prot, unsigned long bits)
> +{
> +	int pmd_i, pmd_p;
> +
> +	pmd_i = pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID;
> +	pmd_p = pmd_val(*pmdp) & _SEGMENT_ENTRY_PROTECT;

initialize both directly and make them const.

> +
> +	/* Fixup needed */
> +	if ((pmd_i && (prot != PROT_NONE)) || (pmd_p && (prot & PROT_WRITE)))
> +		return -EAGAIN;

Looks just like ptep_force_prot, so should be ok.

> +
> +	if (bits & GMAP_ENTRY_IN)
> +		pmd_val(*pmdp) |=  _SEGMENT_ENTRY_GMAP_IN;
> +	return 0;
> +}
> +
> +/*
>   * gmap_protect_range - remove access rights to memory and set pgste bits
>   * @gmap: pointer to guest mapping meta data structure
>   * @gaddr: virtual address in the guest address space
> @@ -977,7 +1013,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
>   *
>   * Called with sg->mm->mmap_sem in read.
>   *
> - * Note: Can also be called for shadow gmaps.
> + * Note: Can also be called for shadow gmaps, but only with 4k pages.
>   */
>  static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
>  			      unsigned long len, int prot, unsigned long bits)
> @@ -990,11 +1026,20 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
>  		rc = -EAGAIN;
>  		pmdp = gmap_pmd_op_walk(gmap, gaddr);
>  		if (pmdp) {
> -			rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
> -					      bits);
> -			if (!rc) {
> -				len -= PAGE_SIZE;
> -				gaddr += PAGE_SIZE;
> +			if (!pmd_large(*pmdp)) {
> +				rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
> +						      bits);
> +				if (!rc) {
> +					len -= PAGE_SIZE;
> +					gaddr += PAGE_SIZE;
> +				}
> +			} else {
> +				rc =  gmap_protect_large(gmap, gaddr, pmdp,
> +							 prot, bits);
> +				if (!rc) {
> +					len = len < HPAGE_SIZE ? 0 : len - HPAGE_SIZE;

Shouldn't you also have to take the difference to (gaddr & HPAGE_MASK)
into account, like you do in the next step? (so always subtracting
HPAGE_SIZE looks suspicious)

> +					gaddr = (gaddr & HPAGE_MASK) + HPAGE_SIZE;
> +				}
>  			}
>  			gmap_pmd_op_end(gmap, pmdp);
>  		}
> @@ -2191,6 +2236,36 @@ void ptep_notify(struct mm_struct *mm, unsigned long vmaddr,
>  }
>  EXPORT_SYMBOL_GPL(ptep_notify);
>  
> +/**
> + * pmdp_notify - call all invalidation callbacks for a specific pmd
> + * @mm: pointer to the process mm_struct
> + * @vmaddr: virtual address in the process address space
> + *
> + * This function is expected to be called with mmap_sem held in read.
> + */
> +void pmdp_notify(struct mm_struct *mm, unsigned long vmaddr)
> +{
> +	unsigned long *table, gaddr;
> +	struct gmap *gmap;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
> +		spin_lock(&gmap->guest_table_lock);
> +		table = radix_tree_lookup(&gmap->host_to_guest,
> +					  vmaddr >> PMD_SHIFT);
> +		if (!table || !(*table & _SEGMENT_ENTRY_GMAP_IN)) {
> +			spin_unlock(&gmap->guest_table_lock);
> +			continue;
> +		}
> +		gaddr = __gmap_segment_gaddr(table);
> +		*table &= ~_SEGMENT_ENTRY_GMAP_IN;

We can perform this page table change without any flushing/stuff like
that because we don't modify bits used by the HW. Is that guaranteed by
the architecture or are there any special conditions to take into
account? (applies also to were we set the _SEGMENT_ENTRY_GMAP_IN)

> +		spin_unlock(&gmap->guest_table_lock);
> +		gmap_call_notifier(gmap, gaddr, gaddr + HPAGE_SIZE - 1);
> +	}
> +	rcu_read_unlock();
> +}
> +EXPORT_SYMBOL_GPL(pmdp_notify);
> +
>  static inline void thp_split_mm(struct mm_struct *mm)
>  {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index ae677f8..79d35d0 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -404,6 +404,8 @@ pmd_t pmdp_xchg_direct(struct mm_struct *mm, unsigned long addr,
>  	pmd_t old;
>  
>  	preempt_disable();
> +	if (mm_has_pgste(mm))
> +		pmdp_notify(mm, addr);
>  	old = pmdp_flush_direct(mm, addr, pmdp);
>  	*pmdp = new;
>  	preempt_enable();
> @@ -417,6 +419,8 @@ pmd_t pmdp_xchg_lazy(struct mm_struct *mm, unsigned long addr,
>  	pmd_t old;
>  
>  	preempt_disable();
> +	if (mm_has_pgste(mm))
> +		pmdp_notify(mm, addr);
>  	old = pmdp_flush_lazy(mm, addr, pmdp);
>  	*pmdp = new;
>  	preempt_enable();
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 21/22] KVM: s390: Add KVM HPAGE capability
  2017-11-06 22:29 ` [RFC/PATCH 21/22] KVM: s390: Add KVM HPAGE capability Janosch Frank
  2017-11-07 10:07   ` Cornelia Huck
@ 2017-11-15 10:06   ` David Hildenbrand
  2017-11-15 12:02     ` Janosch Frank
  1 sibling, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2017-11-15 10:06 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 06.11.2017 23:29, Janosch Frank wrote:
> KVM huge page backing support can not be easily tested under
> s390. Currently testing is only possible after most of the guest has
> already been set up.
> 
> To indicate, that KVM has huge page backing support, we add the
> KVM_CAP_S390_HPAGE capability. This does not mean, that transparent
> huge pages are supported.

Are there any plans to support it? (my feeling is that it could be
somewhat tricky :) )

> 
> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> ---
>  include/uapi/linux/kvm.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 8388875..2cd359a 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -930,6 +930,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_PPC_SMT_POSSIBLE 147
>  #define KVM_CAP_HYPERV_SYNIC2 148
>  #define KVM_CAP_HYPERV_VP_INDEX 149
> +#define KVM_CAP_S390_HPAGE 150
>  
What about facilities/features that require PGSTE to be around. Have we
fixed up all handlers? (I remember some features/instructions triggering
special SIE exits in case they hit a large pmd in the gmap).

Which of these facilities will we have to disable in user space? (e.g.
we already handle CMM in QEMU if we detect explicit huge pages).

(lack of public documentation makes this hard to review/understand)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 20/22] s390/mm: Enable gmap huge pmd support
  2017-11-06 22:29 ` [RFC/PATCH 20/22] s390/mm: Enable gmap huge pmd support Janosch Frank
@ 2017-11-15 10:08   ` David Hildenbrand
  2017-11-15 12:24     ` Janosch Frank
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2017-11-15 10:08 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 06.11.2017 23:29, Janosch Frank wrote:
> Now that we have everything in place, let's allow huge pmds for gmap
> linking, effectively allowing huge pages guests.

Can you add a comment about transparent huge pages?

Reviewed-by: David Hildenbrand <david@redhat.com>

> 
> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
> ---
>  arch/s390/mm/gmap.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 430dcd9..3cc2765 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -1,8 +1,10 @@
>  /*
>   *  KVM guest address space mapping code
>   *
> - *    Copyright IBM Corp. 2007, 2016
> + *    Copyright IBM Corp. 2007, 2016, 2017
>   *    Author(s): Martin Schwidefsky <schwidefsky@de.ibm.com>
> + *		 David Hildenbrand <david@redhat.com>
> + *		 Janosch Frank <frankja@linux.vnet.ibm.com>
>   */
>  
>  #include <linux/kernel.h>
> @@ -596,9 +598,6 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
>  		return -EFAULT;
>  	pmd = pmd_offset(pud, vmaddr);
>  	VM_BUG_ON(pmd_none(*pmd));
> -	/* large pmds cannot yet be handled */
> -	if (pmd_large(*pmd))
> -		return -EFAULT;
>  	/* Link gmap segment table entry location to page table. */
>  	rc = radix_tree_preload(GFP_KERNEL);
>  	if (rc)
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 22/22] RFC: s390/mm: Add gmap lock classes
  2017-11-06 22:30 ` [RFC/PATCH 22/22] RFC: s390/mm: Add gmap lock classes Janosch Frank
@ 2017-11-15 10:10   ` David Hildenbrand
  2017-11-15 12:16     ` Janosch Frank
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2017-11-15 10:10 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 06.11.2017 23:30, Janosch Frank wrote:
> A shadow gmap and its parent are locked right after each other when
> doing VSIE management. Lockdep can't differentiate between the two
> classes without some help.
> 
> TODO: Not sure yet if I have to annotate all and if gmap_pmd_walk will
> be used by both shadow and parent

Why is that new, I thought we already had the same situation before and
lockdep didn't complain?

(worrying of we are now seeing a real problem and try to hide it)

> 
> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> ---
>  arch/s390/include/asm/gmap.h |  6 ++++++
>  arch/s390/mm/gmap.c          | 40 ++++++++++++++++++++--------------------
>  2 files changed, 26 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
> index 15a7834..31fad98 100644
> --- a/arch/s390/include/asm/gmap.h
> +++ b/arch/s390/include/asm/gmap.h
> @@ -19,6 +19,12 @@
>  #define _SEGMENT_ENTRY_GMAP_UC		0x4000	/* user dirty (migration) */
>  #define _SEGMENT_ENTRY_GMAP_VSIE	0x8000	/* vsie bit */
>  
> +
> +enum gmap_lock_class {
> +	GMAP_LOCK_PARENT,
> +	GMAP_LOCK_SHADOW
> +};
> +
>  /**
>   * struct gmap_struct - guest address space
>   * @list: list head for the mm->context gmap list
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 3cc2765..b17f424 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -198,7 +198,7 @@ static void gmap_free(struct gmap *gmap)
>  	gmap_radix_tree_free(&gmap->host_to_guest);
>  
>  	/* Free split pmd page tables */
> -	spin_lock(&gmap->guest_table_lock);
> +	spin_lock_nested(&gmap->guest_table_lock, GMAP_LOCK_PARENT);
>  	list_for_each_entry_safe(page, next, &gmap->split_list, lru)
>  		page_table_free_pgste(page);
>  	spin_unlock(&gmap->guest_table_lock);
> @@ -1385,7 +1385,7 @@ static int gmap_protect_rmap_large(struct gmap *sg, struct gmap_rmap *rmap,
>  	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
>  		return -EAGAIN;
>  
> -	spin_lock(&sg->guest_table_lock);
> +	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  	rc = gmap_protect_large(sg->parent, paddr, vmaddr, pmdp, hpmdp,
>  				prot, GMAP_ENTRY_VSIE);
>  	if (!rc)
> @@ -1413,7 +1413,7 @@ static int gmap_protect_rmap_pte(struct gmap *sg, struct gmap_rmap *rmap,
>  		ptep = pte_alloc_map_lock(sg->parent->mm, pmdp, paddr, &ptl);
>  
>  	if (ptep) {
> -		spin_lock(&sg->guest_table_lock);
> +		spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  		rc = ptep_force_prot(sg->parent->mm, paddr, ptep, prot,
>  				     PGSTE_VSIE_BIT);
>  		if (!rc)
> @@ -1932,7 +1932,7 @@ struct gmap *gmap_shadow(struct gmap *parent, unsigned long asce,
>  		/* only allow one real-space gmap shadow */
>  		list_for_each_entry(sg, &parent->children, list) {
>  			if (sg->orig_asce & _ASCE_REAL_SPACE) {
> -				spin_lock(&sg->guest_table_lock);
> +				spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  				gmap_unshadow(sg);
>  				spin_unlock(&sg->guest_table_lock);
>  				list_del(&sg->list);
> @@ -2004,7 +2004,7 @@ int gmap_shadow_r2t(struct gmap *sg, unsigned long saddr, unsigned long r2t,
>  		page->index |= GMAP_SHADOW_FAKE_TABLE;
>  	s_r2t = (unsigned long *) page_to_phys(page);
>  	/* Install shadow region second table */
> -	spin_lock(&sg->guest_table_lock);
> +	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  	table = gmap_table_walk(sg, saddr, 4); /* get region-1 pointer */
>  	if (!table) {
>  		rc = -EAGAIN;		/* Race with unshadow */
> @@ -2037,7 +2037,7 @@ int gmap_shadow_r2t(struct gmap *sg, unsigned long saddr, unsigned long r2t,
>  	offset = ((r2t & _REGION_ENTRY_OFFSET) >> 6) * PAGE_SIZE;
>  	len = ((r2t & _REGION_ENTRY_LENGTH) + 1) * PAGE_SIZE - offset;
>  	rc = gmap_protect_rmap(sg, raddr, origin + offset, len, PROT_READ);
> -	spin_lock(&sg->guest_table_lock);
> +	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  	if (!rc) {
>  		table = gmap_table_walk(sg, saddr, 4);
>  		if (!table || (*table & _REGION_ENTRY_ORIGIN) !=
> @@ -2088,7 +2088,7 @@ int gmap_shadow_r3t(struct gmap *sg, unsigned long saddr, unsigned long r3t,
>  		page->index |= GMAP_SHADOW_FAKE_TABLE;
>  	s_r3t = (unsigned long *) page_to_phys(page);
>  	/* Install shadow region second table */
> -	spin_lock(&sg->guest_table_lock);
> +	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  	table = gmap_table_walk(sg, saddr, 3); /* get region-2 pointer */
>  	if (!table) {
>  		rc = -EAGAIN;		/* Race with unshadow */
> @@ -2120,7 +2120,7 @@ int gmap_shadow_r3t(struct gmap *sg, unsigned long saddr, unsigned long r3t,
>  	offset = ((r3t & _REGION_ENTRY_OFFSET) >> 6) * PAGE_SIZE;
>  	len = ((r3t & _REGION_ENTRY_LENGTH) + 1) * PAGE_SIZE - offset;
>  	rc = gmap_protect_rmap(sg, raddr, origin + offset, len, PROT_READ);
> -	spin_lock(&sg->guest_table_lock);
> +	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  	if (!rc) {
>  		table = gmap_table_walk(sg, saddr, 3);
>  		if (!table || (*table & _REGION_ENTRY_ORIGIN) !=
> @@ -2171,7 +2171,7 @@ int gmap_shadow_sgt(struct gmap *sg, unsigned long saddr, unsigned long sgt,
>  		page->index |= GMAP_SHADOW_FAKE_TABLE;
>  	s_sgt = (unsigned long *) page_to_phys(page);
>  	/* Install shadow region second table */
> -	spin_lock(&sg->guest_table_lock);
> +	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  	table = gmap_table_walk(sg, saddr, 2); /* get region-3 pointer */
>  	if (!table) {
>  		rc = -EAGAIN;		/* Race with unshadow */
> @@ -2204,7 +2204,7 @@ int gmap_shadow_sgt(struct gmap *sg, unsigned long saddr, unsigned long sgt,
>  	offset = ((sgt & _REGION_ENTRY_OFFSET) >> 6) * PAGE_SIZE;
>  	len = ((sgt & _REGION_ENTRY_LENGTH) + 1) * PAGE_SIZE - offset;
>  	rc = gmap_protect_rmap(sg, raddr, origin + offset, len, PROT_READ);
> -	spin_lock(&sg->guest_table_lock);
> +	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  	if (!rc) {
>  		table = gmap_table_walk(sg, saddr, 2);
>  		if (!table || (*table & _REGION_ENTRY_ORIGIN) !=
> @@ -2260,7 +2260,7 @@ int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
>  	int rc = -EAGAIN;
>  
>  	BUG_ON(!gmap_is_shadow(sg));
> -	spin_lock(&sg->guest_table_lock);
> +	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  	if (sg->asce & _ASCE_TYPE_MASK) {
>  		/* >2 GB guest */
>  		r3e = (unsigned long *) gmap_table_walk(sg, saddr, 2);
> @@ -2327,7 +2327,7 @@ int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
>  		page->index |= GMAP_SHADOW_FAKE_TABLE;
>  	s_pgt = (unsigned long *) page_to_phys(page);
>  	/* Install shadow page table */
> -	spin_lock(&sg->guest_table_lock);
> +	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  	table = gmap_table_walk(sg, saddr, 1); /* get segment pointer */
>  	if (!table) {
>  		rc = -EAGAIN;		/* Race with unshadow */
> @@ -2355,7 +2355,7 @@ int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
>  	raddr = (saddr & _SEGMENT_MASK) | _SHADOW_RMAP_SEGMENT;
>  	origin = pgt & _SEGMENT_ENTRY_ORIGIN & PAGE_MASK;
>  	rc = gmap_protect_rmap(sg, raddr, origin, PAGE_SIZE, PROT_READ);
> -	spin_lock(&sg->guest_table_lock);
> +	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  	if (!rc) {
>  		table = gmap_table_walk(sg, saddr, 1);
>  		if (!table || (*table & _SEGMENT_ENTRY_ORIGIN) !=
> @@ -2417,7 +2417,7 @@ int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd)
>  		if (spmdp) {
>  			if (!pmd_large(*spmdp))
>  				BUG();
> -			spin_lock(&sg->guest_table_lock);
> +			spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  			/* Get shadow segment table pointer */
>  			tpmdp = (pmd_t *) gmap_table_walk(sg, saddr, 1);
>  			if (!tpmdp) {
> @@ -2513,7 +2513,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
>  		rc = -EAGAIN;
>  		spmdp = gmap_pmd_op_walk(parent, paddr);
>  		if (spmdp && !(pmd_val(*spmdp) & _SEGMENT_ENTRY_INVALID)) {
> -			spin_lock(&sg->guest_table_lock);
> +			spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  			/* Get page table pointer */
>  			tptep = (pte_t *) gmap_table_walk(sg, saddr, 0);
>  			if (!tptep) {
> @@ -2597,7 +2597,7 @@ static void gmap_shadow_notify_pmd(struct gmap *sg, unsigned long vmaddr,
>  
>  	BUG_ON(!gmap_is_shadow(sg));
>  
> -	spin_lock(&sg->guest_table_lock);
> +	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  	if (sg->removed) {
>  		spin_unlock(&sg->guest_table_lock);
>  		return;
> @@ -2658,7 +2658,7 @@ static void gmap_shadow_notify(struct gmap *sg, unsigned long vmaddr,
>  
>  	BUG_ON(!gmap_is_shadow(sg));
>  
> -	spin_lock(&sg->guest_table_lock);
> +	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
>  	if (sg->removed) {
>  		spin_unlock(&sg->guest_table_lock);
>  		return;
> @@ -2899,7 +2899,7 @@ static void gmap_pmdp_clear(struct mm_struct *mm, unsigned long vmaddr,
>  
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
> -		spin_lock(&gmap->guest_table_lock);
> +		spin_lock_nested(&gmap->guest_table_lock, GMAP_LOCK_PARENT);
>  		pmdp = (pmd_t *)radix_tree_delete(&gmap->host_to_guest,
>  						   vmaddr >> PMD_SHIFT);
>  		if (pmdp) {
> @@ -2949,7 +2949,7 @@ void gmap_pmdp_idte_local(struct mm_struct *mm, unsigned long vmaddr)
>  
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
> -		spin_lock(&gmap->guest_table_lock);
> +		spin_lock_nested(&gmap->guest_table_lock, GMAP_LOCK_PARENT);
>  		entry = radix_tree_delete(&gmap->host_to_guest,
>  					  vmaddr >> PMD_SHIFT);
>  		if (entry) {
> @@ -2984,7 +2984,7 @@ void gmap_pmdp_idte_global(struct mm_struct *mm, unsigned long vmaddr)
>  
>  	rcu_read_lock();
>  	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
> -		spin_lock(&gmap->guest_table_lock);
> +		spin_lock_nested(&gmap->guest_table_lock, GMAP_LOCK_PARENT);
>  		entry = radix_tree_delete(&gmap->host_to_guest,
>  					  vmaddr >> PMD_SHIFT);
>  		if (entry) {
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 21/22] KVM: s390: Add KVM HPAGE capability
  2017-11-15 10:06   ` David Hildenbrand
@ 2017-11-15 12:02     ` Janosch Frank
  0 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-15 12:02 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 2504 bytes --]

On 15.11.2017 11:06, David Hildenbrand wrote:
> On 06.11.2017 23:29, Janosch Frank wrote:
>> KVM huge page backing support can not be easily tested under
>> s390. Currently testing is only possible after most of the guest has
>> already been set up.
>>
>> To indicate, that KVM has huge page backing support, we add the
>> KVM_CAP_S390_HPAGE capability. This does not mean, that transparent
>> huge pages are supported.
> 
> Are there any plans to support it? (my feeling is that it could be
> somewhat tricky :) )

Better safe than sorry was the idea behind that.
By the way, this lacks the capability announcement in
kvm_vm_ioctl_check_extension() but I already fixed that for v2.

> 
>>
>> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
>> ---
>>  include/uapi/linux/kvm.h | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 8388875..2cd359a 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -930,6 +930,7 @@ struct kvm_ppc_resize_hpt {
>>  #define KVM_CAP_PPC_SMT_POSSIBLE 147
>>  #define KVM_CAP_HYPERV_SYNIC2 148
>>  #define KVM_CAP_HYPERV_VP_INDEX 149
>> +#define KVM_CAP_S390_HPAGE 150
>>  
> What about facilities/features that require PGSTE to be around. Have we
> fixed up all handlers? (I remember some features/instructions triggering
> special SIE exits in case they hit a large pmd in the gmap).
> 
> Which of these facilities will we have to disable in user space? (e.g.
> we already handle CMM in QEMU if we detect explicit huge pages).
> 
> (lack of public documentation makes this hard to review/understand)

Most facilities tolerate edat in a way, that changes are not necessary.
CMMA for example would still be interpreted but it would only accept the
stable/resident state. SKEYS have the RCP bypass. PFMF gets intercepted,
however, it doesn't use the GMAP, but the user mapping for clearing.
Also you already disabled a lot for VSIE, like pfmf interpretation.
Emulation in l2 for l3 and RRBE are still on the TODO list for complete
evaluation.

This is a valid concern for everyone involved and I scan as much
hardware documentation as possible. I have prepared unit tests for pfmf,
cmm, cmma and storage key operations in an effort to test as much as
possible in each level of SIE. Unfortunately I can't yet publish them,
but I will, as soon as I'm allowed.
I still wish for a way to disable skeys completely.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 22/22] RFC: s390/mm: Add gmap lock classes
  2017-11-15 10:10   ` David Hildenbrand
@ 2017-11-15 12:16     ` Janosch Frank
  0 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-15 12:16 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 750 bytes --]

On 15.11.2017 11:10, David Hildenbrand wrote:
> On 06.11.2017 23:30, Janosch Frank wrote:
>> A shadow gmap and its parent are locked right after each other when
>> doing VSIE management. Lockdep can't differentiate between the two
>> classes without some help.
>>
>> TODO: Not sure yet if I have to annotate all and if gmap_pmd_walk will
>> be used by both shadow and parent
> 
> Why is that new, I thought we already had the same situation before and
> lockdep didn't complain?
> 
> (worrying of we are now seeing a real problem and try to hide it)

In gmap_protect_rmap we now also need to take the parent's gmap page
table lock, which we didn't need before, because we only touched the
parent's ptes and had pte locks for that.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 20/22] s390/mm: Enable gmap huge pmd support
  2017-11-15 10:08   ` David Hildenbrand
@ 2017-11-15 12:24     ` Janosch Frank
  0 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-11-15 12:24 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 1762 bytes --]

On 15.11.2017 11:08, David Hildenbrand wrote:
> On 06.11.2017 23:29, Janosch Frank wrote:
>> Now that we have everything in place, let's allow huge pmds for gmap
>> linking, effectively allowing huge pages guests.
> 
> Can you add a comment about transparent huge pages?

Now that we have everything in place, let's allow huge (1m) pmds for
gmap linking, effectively allowing hugetlbfs backed guests. Transparent
huge pages and 2g huge pages are *not* supported through this change.

> 
> Reviewed-by: David Hildenbrand <david@redhat.com>

Thanks!

> 
>>
>> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
>> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
>> ---
>>  arch/s390/mm/gmap.c | 7 +++----
>>  1 file changed, 3 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
>> index 430dcd9..3cc2765 100644
>> --- a/arch/s390/mm/gmap.c
>> +++ b/arch/s390/mm/gmap.c
>> @@ -1,8 +1,10 @@
>>  /*
>>   *  KVM guest address space mapping code
>>   *
>> - *    Copyright IBM Corp. 2007, 2016
>> + *    Copyright IBM Corp. 2007, 2016, 2017
>>   *    Author(s): Martin Schwidefsky <schwidefsky@de.ibm.com>
>> + *		 David Hildenbrand <david@redhat.com>
>> + *		 Janosch Frank <frankja@linux.vnet.ibm.com>
>>   */
>>  
>>  #include <linux/kernel.h>
>> @@ -596,9 +598,6 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
>>  		return -EFAULT;
>>  	pmd = pmd_offset(pud, vmaddr);
>>  	VM_BUG_ON(pmd_none(*pmd));
>> -	/* large pmds cannot yet be handled */
>> -	if (pmd_large(*pmd))
>> -		return -EFAULT;
>>  	/* Link gmap segment table entry location to page table. */
>>  	rc = radix_tree_preload(GFP_KERNEL);
>>  	if (rc)
>>
> 
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 03/22] s390/mm: add gmap PMD invalidation notification
  2017-11-15  9:55   ` David Hildenbrand
@ 2017-11-17  9:02     ` Janosch Frank
  2017-11-17  9:19       ` Martin Schwidefsky
  0 siblings, 1 reply; 42+ messages in thread
From: Janosch Frank @ 2017-11-17  9:02 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 6121 bytes --]

On 15.11.2017 10:55, David Hildenbrand wrote:
> On 06.11.2017 23:29, Janosch Frank wrote:
>> For later migration of huge pages we want to write-protect guest
>> PMDs. While doing this, we have to make absolutely sure, that the
>> guest's lowcore is always accessible when the VCPU is running. With
>> PTEs, this is solved by marking the PGSTEs of the lowcore pages with
>> the invalidation notification bit and kicking the guest out of the SIE
>> via a notifier function if we need to invalidate such a page.
>>
>> With PMDs we do not have PGSTEs or some other bits we could use in the
>> host PMD. Instead we pick one of the free bits in the gmap PMD. Every
>> time a host pmd will be invalidated, we will check if the respective
>> gmap PMD has the bit set and in that case fire up the notifier.
>>
>> In the first step we only support setting the invalidation bit, but we
>> do not support restricting access of guest pmds. It will follow
>> shortly.
>>
>> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
>> ---

>> index 74e2062..3961589 100644
>> --- a/arch/s390/mm/gmap.c
>> +++ b/arch/s390/mm/gmap.c
>> @@ -595,10 +595,17 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
>>  	if (*table == _SEGMENT_ENTRY_EMPTY) {
>>  		rc = radix_tree_insert(&gmap->host_to_guest,
>>  				       vmaddr >> PMD_SHIFT, table);
>> -		if (!rc)
>> -			*table = pmd_val(*pmd);
>> -	} else
>> -		rc = 0;
>> +		if (!rc) {
>> +			if (pmd_large(*pmd)) {
>> +				*table = pmd_val(*pmd) &
>> +					(_SEGMENT_ENTRY_ORIGIN_LARGE
>> +					 | _SEGMENT_ENTRY_INVALID
>> +					 | _SEGMENT_ENTRY_LARGE
>> +					 | _SEGMENT_ENTRY_PROTECT);
> 
> Can we reuse/define a mask for that?

We could define GMAP masks, that only leave the bits needed for GMAP
usage. Integrating this in pgtable.h might be hard and we only have one
user of this.

> 
> Like _SEGMENT_ENTRY_BITS_LARGE
> 
>> +			} else
>> +				*table = pmd_val(*pmd) & ~0x03UL;
> 
> Where exactly does the 0x03UL come from? Can we reuse e.g.
> _SEGMENT_ENTRY_BITS here?

No, _SEGMENT_ENTRY_BITS contains these bits.

This is effectively ~(_SEGMENT_ENTRY_READ | _SEGMENT_ENTRY_WRITE) to get
rid of the software bits linux uses for tracking.

> 
> Or is there a way to unify both cases?

That will be difficult because of the different origin masks.

>>  
>>  /*
>> + * gmap_protect_large - set pmd notification bits
> 
> Why not gmap_protect_pmd just like gmap_protect_pte?

When I started I thought adding edat2 would not be harder than adding
edat1. The large suffix could have later been used to do 1m and 2g
handling because the bits are at the same places.

I'll change this one and all later occurrences.

> 
>> + * @pmdp: pointer to the pmd to be protected
>> + * @prot: indicates access rights: PROT_NONE, PROT_READ or PROT_WRITE
>> + * @bits: notification bits to set
>> + *
>> + * Returns 0 if successfully protected, -ENOMEM if out of memory and
>> + * -EAGAIN if a fixup is needed.
>> + *
>> + * Expected to be called with sg->mm->mmap_sem in read and
>> + * guest_table_lock held.
> 
> gmap_pmd_op_walk() guarantees the latter, correct?

Yes

> 
>> + */
>> +static int gmap_protect_large(struct gmap *gmap, unsigned long gaddr,
>> +			      pmd_t *pmdp, int prot, unsigned long bits)
>> +{
>> +	int pmd_i, pmd_p;
>> +
>> +	pmd_i = pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID;
>> +	pmd_p = pmd_val(*pmdp) & _SEGMENT_ENTRY_PROTECT;
> 
> initialize both directly and make them const.

Sure

>>  static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
>>  			      unsigned long len, int prot, unsigned long bits)
>> @@ -990,11 +1026,20 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
>>  		rc = -EAGAIN;
>>  		pmdp = gmap_pmd_op_walk(gmap, gaddr);
>>  		if (pmdp) {
>> -			rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
>> -					      bits);
>> -			if (!rc) {
>> -				len -= PAGE_SIZE;
>> -				gaddr += PAGE_SIZE;
>> +			if (!pmd_large(*pmdp)) {
>> +				rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
>> +						      bits);
>> +				if (!rc) {
>> +					len -= PAGE_SIZE;
>> +					gaddr += PAGE_SIZE;
>> +				}
>> +			} else {
>> +				rc =  gmap_protect_large(gmap, gaddr, pmdp,
>> +							 prot, bits);
>> +				if (!rc) {
>> +					len = len < HPAGE_SIZE ? 0 : len - HPAGE_SIZE;
> 
> Shouldn't you also have to take the difference to (gaddr & HPAGE_MASK)
> into account, like you do in the next step? (so always subtracting
> HPAGE_SIZE looks suspicious)

Yes it seems we have to do that.
We just never ran into troubles because len is generally < HPAGE_SIZE
and tables don't seem to cross segment boundaries.

>> +/**
>> + * pmdp_notify - call all invalidation callbacks for a specific pmd
>> + * @mm: pointer to the process mm_struct
>> + * @vmaddr: virtual address in the process address space
>> + *
>> + * This function is expected to be called with mmap_sem held in read.
>> + */
>> +void pmdp_notify(struct mm_struct *mm, unsigned long vmaddr)
>> +{
>> +	unsigned long *table, gaddr;
>> +	struct gmap *gmap;
>> +
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
>> +		spin_lock(&gmap->guest_table_lock);
>> +		table = radix_tree_lookup(&gmap->host_to_guest,
>> +					  vmaddr >> PMD_SHIFT);
>> +		if (!table || !(*table & _SEGMENT_ENTRY_GMAP_IN)) {
>> +			spin_unlock(&gmap->guest_table_lock);
>> +			continue;
>> +		}
>> +		gaddr = __gmap_segment_gaddr(table);
>> +		*table &= ~_SEGMENT_ENTRY_GMAP_IN;
> 
> We can perform this page table change without any flushing/stuff like
> that because we don't modify bits used by the HW. Is that guaranteed by
> the architecture or are there any special conditions to take into
> account? (applies also to were we set the _SEGMENT_ENTRY_GMAP_IN)

ACC is ignored when the validity is 0 and 62 and 63 (0x3) are "available
for programming".
It pretty much sounds like these are OS bits free for any usage, I'll
ask the hardware team when I find time.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 03/22] s390/mm: add gmap PMD invalidation notification
  2017-11-17  9:02     ` Janosch Frank
@ 2017-11-17  9:19       ` Martin Schwidefsky
  0 siblings, 0 replies; 42+ messages in thread
From: Martin Schwidefsky @ 2017-11-17  9:19 UTC (permalink / raw)
  To: Janosch Frank
  Cc: David Hildenbrand, kvm, borntraeger, dominik.dingel, linux-s390

On Fri, 17 Nov 2017 10:02:57 +0100
Janosch Frank <frankja@linux.vnet.ibm.com> wrote:

> On 15.11.2017 10:55, David Hildenbrand wrote:
> > On 06.11.2017 23:29, Janosch Frank wrote:  
> >> For later migration of huge pages we want to write-protect guest
> >> PMDs. While doing this, we have to make absolutely sure, that the
> >> guest's lowcore is always accessible when the VCPU is running. With
> >> PTEs, this is solved by marking the PGSTEs of the lowcore pages with
> >> the invalidation notification bit and kicking the guest out of the SIE
> >> via a notifier function if we need to invalidate such a page.
> >>
> >> With PMDs we do not have PGSTEs or some other bits we could use in the
> >> host PMD. Instead we pick one of the free bits in the gmap PMD. Every
> >> time a host pmd will be invalidated, we will check if the respective
> >> gmap PMD has the bit set and in that case fire up the notifier.
> >>
> >> In the first step we only support setting the invalidation bit, but we
> >> do not support restricting access of guest pmds. It will follow
> >> shortly.
> >>
> >> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> >> ---  
> 
> >> index 74e2062..3961589 100644
> >> --- a/arch/s390/mm/gmap.c
> >> +++ b/arch/s390/mm/gmap.c
> >> @@ -595,10 +595,17 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
> >>  	if (*table == _SEGMENT_ENTRY_EMPTY) {
> >>  		rc = radix_tree_insert(&gmap->host_to_guest,
> >>  				       vmaddr >> PMD_SHIFT, table);
> >> -		if (!rc)
> >> -			*table = pmd_val(*pmd);
> >> -	} else
> >> -		rc = 0;
> >> +		if (!rc) {
> >> +			if (pmd_large(*pmd)) {
> >> +				*table = pmd_val(*pmd) &
> >> +					(_SEGMENT_ENTRY_ORIGIN_LARGE
> >> +					 | _SEGMENT_ENTRY_INVALID
> >> +					 | _SEGMENT_ENTRY_LARGE
> >> +					 | _SEGMENT_ENTRY_PROTECT);  
> > 
> > Can we reuse/define a mask for that?  
> 
> We could define GMAP masks, that only leave the bits needed for GMAP
> usage. Integrating this in pgtable.h might be hard and we only have one
> user of this.
> 
> > 
> > Like _SEGMENT_ENTRY_BITS_LARGE
> >   
> >> +			} else
> >> +				*table = pmd_val(*pmd) & ~0x03UL;  
> > 
> > Where exactly does the 0x03UL come from? Can we reuse e.g.
> > _SEGMENT_ENTRY_BITS here?  
> 
> No, _SEGMENT_ENTRY_BITS contains these bits.
> 
> This is effectively ~(_SEGMENT_ENTRY_READ | _SEGMENT_ENTRY_WRITE) to get
> rid of the software bits linux uses for tracking.
> 
> > 
> > Or is there a way to unify both cases?  
> 
> That will be difficult because of the different origin masks.
> 
> >>  
> >>  /*
> >> + * gmap_protect_large - set pmd notification bits  
> > 
> > Why not gmap_protect_pmd just like gmap_protect_pte?  
> 
> When I started I thought adding edat2 would not be harder than adding
> edat1. The large suffix could have later been used to do 1m and 2g
> handling because the bits are at the same places.
> 
> I'll change this one and all later occurrences.
> 
> >   
> >> + * @pmdp: pointer to the pmd to be protected
> >> + * @prot: indicates access rights: PROT_NONE, PROT_READ or PROT_WRITE
> >> + * @bits: notification bits to set
> >> + *
> >> + * Returns 0 if successfully protected, -ENOMEM if out of memory and
> >> + * -EAGAIN if a fixup is needed.
> >> + *
> >> + * Expected to be called with sg->mm->mmap_sem in read and
> >> + * guest_table_lock held.  
> > 
> > gmap_pmd_op_walk() guarantees the latter, correct?  
> 
> Yes
> 
> >   
> >> + */
> >> +static int gmap_protect_large(struct gmap *gmap, unsigned long gaddr,
> >> +			      pmd_t *pmdp, int prot, unsigned long bits)
> >> +{
> >> +	int pmd_i, pmd_p;
> >> +
> >> +	pmd_i = pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID;
> >> +	pmd_p = pmd_val(*pmdp) & _SEGMENT_ENTRY_PROTECT;  
> > 
> > initialize both directly and make them const.  
> 
> Sure
> 
> >>  static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
> >>  			      unsigned long len, int prot, unsigned long bits)
> >> @@ -990,11 +1026,20 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
> >>  		rc = -EAGAIN;
> >>  		pmdp = gmap_pmd_op_walk(gmap, gaddr);
> >>  		if (pmdp) {
> >> -			rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
> >> -					      bits);
> >> -			if (!rc) {
> >> -				len -= PAGE_SIZE;
> >> -				gaddr += PAGE_SIZE;
> >> +			if (!pmd_large(*pmdp)) {
> >> +				rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
> >> +						      bits);
> >> +				if (!rc) {
> >> +					len -= PAGE_SIZE;
> >> +					gaddr += PAGE_SIZE;
> >> +				}
> >> +			} else {
> >> +				rc =  gmap_protect_large(gmap, gaddr, pmdp,
> >> +							 prot, bits);
> >> +				if (!rc) {
> >> +					len = len < HPAGE_SIZE ? 0 : len - HPAGE_SIZE;  
> > 
> > Shouldn't you also have to take the difference to (gaddr & HPAGE_MASK)
> > into account, like you do in the next step? (so always subtracting
> > HPAGE_SIZE looks suspicious)  
> 
> Yes it seems we have to do that.
> We just never ran into troubles because len is generally < HPAGE_SIZE
> and tables don't seem to cross segment boundaries.
> 
> >> +/**
> >> + * pmdp_notify - call all invalidation callbacks for a specific pmd
> >> + * @mm: pointer to the process mm_struct
> >> + * @vmaddr: virtual address in the process address space
> >> + *
> >> + * This function is expected to be called with mmap_sem held in read.
> >> + */
> >> +void pmdp_notify(struct mm_struct *mm, unsigned long vmaddr)
> >> +{
> >> +	unsigned long *table, gaddr;
> >> +	struct gmap *gmap;
> >> +
> >> +	rcu_read_lock();
> >> +	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
> >> +		spin_lock(&gmap->guest_table_lock);
> >> +		table = radix_tree_lookup(&gmap->host_to_guest,
> >> +					  vmaddr >> PMD_SHIFT);
> >> +		if (!table || !(*table & _SEGMENT_ENTRY_GMAP_IN)) {
> >> +			spin_unlock(&gmap->guest_table_lock);
> >> +			continue;
> >> +		}
> >> +		gaddr = __gmap_segment_gaddr(table);
> >> +		*table &= ~_SEGMENT_ENTRY_GMAP_IN;  
> > 
> > We can perform this page table change without any flushing/stuff like
> > that because we don't modify bits used by the HW. Is that guaranteed by
> > the architecture or are there any special conditions to take into
> > account? (applies also to were we set the _SEGMENT_ENTRY_GMAP_IN)  
> 
> ACC is ignored when the validity is 0 and 62 and 63 (0x3) are "available
> for programming".
> It pretty much sounds like these are OS bits free for any usage, I'll
> ask the hardware team when I find time.

Bits 62 and 63 used to be reserved bits but they have been defined for OS
usage with the z13 PoP.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 19/22] s390/mm: Split huge pages if granular protection is needed
  2017-11-06 22:29 ` [RFC/PATCH 19/22] s390/mm: Split huge pages if granular protection is needed Janosch Frank
@ 2017-12-07 16:32   ` David Hildenbrand
  2017-12-08  7:00     ` Janosch Frank
  0 siblings, 1 reply; 42+ messages in thread
From: David Hildenbrand @ 2017-12-07 16:32 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 06.11.2017 23:29, Janosch Frank wrote:
> A guest can put DAT tables for a lower level guest in the same huge
> segment as one of its prefixes. This would make it necessary for the
> segment to be unprotected (because of the prefix) and protected
> (because of the shadowing) at the same time. This is not possible in
> this universe.
> 
> Hence we split the affected huge segment, so we can protect on a
> per-page basis. Such gmap segments are special and get a new software
> bit, that helps us handling this edge case.

I am thinking about another condition and am not sure yet if it is
really a problem and already handled by this patch (if so, feel free to
add it to the description :) ): G2 -> G3 page table and a contained G2
-> G3 page lying on same G1 huge page

G1 runs G2 with huge pages.
G2 runs G3 without huge pages,
G1 creates shadow page tables for G3.

G2 has no idea of huge pages, so it could happen that a
page table from G2 -> G3 falls into the same G1 huge page as a G2->G3
backing page.

Now, if we're unlucky, it can happen that this page table references
that G3 page, lying on the same G1 huge page.

G1 will create a shadow page table, protecting access to this huge page
(do maintain the shadow properly).

What will happen when G3 tries to write to this page:

1. Shadow page table in G1 is built, huge page is protected in g2 gmap.
2. Part of that huge page is to be used in the shadow page table with
write access. This huge page is protected but we need write access, we
need to fixup.
3. Fixing up will invalidate the shadow page table.

IOW, G3 will never make progress.


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC/PATCH 19/22] s390/mm: Split huge pages if granular protection is needed
  2017-12-07 16:32   ` David Hildenbrand
@ 2017-12-08  7:00     ` Janosch Frank
  0 siblings, 0 replies; 42+ messages in thread
From: Janosch Frank @ 2017-12-08  7:00 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 2129 bytes --]

On 07.12.2017 17:32, David Hildenbrand wrote:
> On 06.11.2017 23:29, Janosch Frank wrote:
>> A guest can put DAT tables for a lower level guest in the same huge
>> segment as one of its prefixes. This would make it necessary for the
>> segment to be unprotected (because of the prefix) and protected
>> (because of the shadowing) at the same time. This is not possible in
>> this universe.
>>
>> Hence we split the affected huge segment, so we can protect on a
>> per-page basis. Such gmap segments are special and get a new software
>> bit, that helps us handling this edge case.
> 
> I am thinking about another condition and am not sure yet if it is
> really a problem and already handled by this patch (if so, feel free to
> add it to the description :) ): G2 -> G3 page table and a contained G2
> -> G3 page lying on same G1 huge page

Valid objection, but we (hopefully :) ) got you covered.

We directly split on a pmd protection with the VSIE bit and then
re-drive protection on the pte:

if (((prot != PROT_WRITE) && (bits & GMAP_ENTRY_VSIE))) {
		ret = gmap_pmd_split(gmap, gaddr, pmdp);

The red-rive is a bit ugly because of the EFAULT, but I'm open to
suggestions.



> 
> G1 runs G2 with huge pages.
> G2 runs G3 without huge pages,
> G1 creates shadow page tables for G3.
> 
> G2 has no idea of huge pages, so it could happen that a
> page table from G2 -> G3 falls into the same G1 huge page as a G2->G3
> backing page.
> 
> Now, if we're unlucky, it can happen that this page table references
> that G3 page, lying on the same G1 huge page.
> 
> G1 will create a shadow page table, protecting access to this huge page
> (do maintain the shadow properly).
> 
> What will happen when G3 tries to write to this page:
> 
> 1. Shadow page table in G1 is built, huge page is protected in g2 gmap.
> 2. Part of that huge page is to be used in the shadow page table with
> write access. This huge page is protected but we need write access, we
> need to fixup.
> 3. Fixing up will invalidate the shadow page table.
> 
> IOW, G3 will never make progress.
> 
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2017-12-08  7:00 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-06 22:29 [RFC/PATCH 00/22] KVM/s390: Hugetlbfs enablement Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 01/22] s390/mm: make gmap_protect_range more modular Janosch Frank
2017-11-08 10:40   ` David Hildenbrand
2017-11-08 12:21     ` Janosch Frank
2017-11-08 12:26       ` David Hildenbrand
2017-11-06 22:29 ` [RFC/PATCH 02/22] s390/mm: Abstract gmap notify bit setting Janosch Frank
2017-11-10 12:57   ` David Hildenbrand
2017-11-13 15:57     ` Janosch Frank
2017-11-15  9:30       ` David Hildenbrand
2017-11-06 22:29 ` [RFC/PATCH 03/22] s390/mm: add gmap PMD invalidation notification Janosch Frank
2017-11-15  9:55   ` David Hildenbrand
2017-11-17  9:02     ` Janosch Frank
2017-11-17  9:19       ` Martin Schwidefsky
2017-11-06 22:29 ` [RFC/PATCH 04/22] s390/mm: Add gmap pmd invalidation and clearing Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 05/22] s390/mm: hugetlb pages within a gmap can not be freed Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 06/22] s390/mm: Introduce gmap_pmdp_xchg Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 07/22] RFC: s390/mm: Transfer guest pmd protection to host Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 08/22] s390/mm: Add huge page dirty sync support Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 09/22] s390/mm: clear huge page storage keys on enable_skey Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 10/22] s390/mm: Add huge pmd storage key handling Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 11/22] s390/mm: Remove superfluous parameter Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 12/22] s390/mm: Add gmap_protect_large read protection support Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 13/22] s390/mm: Make gmap_read_table EDAT1 compatible Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 14/22] s390/mm: Make protect_rmap " Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 15/22] s390/mm: GMAP read table extensions Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 16/22] s390/mm: Add shadow segment code Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 17/22] s390/mm: Add VSIE reverse fake case Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 18/22] s390/mm: Remove gmap_pte_op_walk Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 19/22] s390/mm: Split huge pages if granular protection is needed Janosch Frank
2017-12-07 16:32   ` David Hildenbrand
2017-12-08  7:00     ` Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 20/22] s390/mm: Enable gmap huge pmd support Janosch Frank
2017-11-15 10:08   ` David Hildenbrand
2017-11-15 12:24     ` Janosch Frank
2017-11-06 22:29 ` [RFC/PATCH 21/22] KVM: s390: Add KVM HPAGE capability Janosch Frank
2017-11-07 10:07   ` Cornelia Huck
2017-11-07 10:53     ` Janosch Frank
2017-11-15 10:06   ` David Hildenbrand
2017-11-15 12:02     ` Janosch Frank
2017-11-06 22:30 ` [RFC/PATCH 22/22] RFC: s390/mm: Add gmap lock classes Janosch Frank
2017-11-15 10:10   ` David Hildenbrand
2017-11-15 12:16     ` Janosch Frank

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.