kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] KVM: s390: Add huge page VSIE support
@ 2021-01-13  9:40 Janosch Frank
  2021-01-13  9:41 ` [PATCH 01/14] s390/mm: Code cleanups Janosch Frank
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:40 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

As we finally want to get rid of the nested and hpage s390 KVM module
parameters let's try again to integrate huge page VSIE support.

The following patches have been rebased on 5.11-rc3 and enable us to
start huge page and normal page VSIE guest 3s in a huge page guest 2.

Branch:
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git/log/?h=hlp_vsie


Problems that need to be solved:
	* The VSIE guests crash on migration...
	* I have lost most of my knowledge about this topic and I'm
          currently paging back in
	* Lots of testing


Janosch Frank (14):
  s390/mm: Code cleanups
  s390/mm: Improve locking for huge page backings
  s390/mm: Take locking out of gmap_protect_pte
  s390/mm: split huge pages in GMAP when protecting
  s390/mm: Split huge pages when migrating
  s390/mm: Provide vmaddr to pmd notification
  s390/mm: factor out idte global flush into gmap_idte_global
  s390/mm: Make gmap_read_table EDAT1 compatible
  s390/mm: Make gmap_protect_rmap EDAT1 compatible
  s390/mm: Add simple ptep shadow function
  s390/mm: Add gmap shadowing for large pmds
  s390/mm: Add gmap lock classes
  s390/mm: Pull pmd invalid check in gmap_pmd_op_walk
  KVM: s390: Allow the VSIE to be used with huge pages

 Documentation/virt/kvm/api.rst  |   9 +-
 arch/s390/include/asm/gmap.h    |  31 +-
 arch/s390/include/asm/pgtable.h |   5 +
 arch/s390/kvm/gaccess.c         |  52 +-
 arch/s390/kvm/kvm-s390.c        |  14 +-
 arch/s390/mm/gmap.c             | 917 ++++++++++++++++++++++++--------
 arch/s390/mm/pgtable.c          |  61 ++-
 7 files changed, 819 insertions(+), 270 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 01/14] s390/mm: Code cleanups
  2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
@ 2021-01-13  9:41 ` Janosch Frank
  2021-01-13  9:41 ` [PATCH 02/14] s390/mm: Improve locking for huge page backings Janosch Frank
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:41 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

Let's clean up leftovers before introducing new code.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
 arch/s390/mm/gmap.c    | 8 ++++----
 arch/s390/mm/pgtable.c | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 9bb2c7512cd5..f857104ca6c1 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -5,7 +5,7 @@
  *    Copyright IBM Corp. 2007, 2020
  *    Author(s): Martin Schwidefsky <schwidefsky@de.ibm.com>
  *		 David Hildenbrand <david@redhat.com>
- *		 Janosch Frank <frankja@linux.vnet.ibm.com>
+ *		 Janosch Frank <frankja@linux.ibm.com>
  */
 
 #include <linux/kernel.h>
@@ -2290,10 +2290,10 @@ static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *pmdp, pmd_t new,
 	pmdp_notify_gmap(gmap, pmdp, gaddr);
 	pmd_val(new) &= ~_SEGMENT_ENTRY_GMAP_IN;
 	if (MACHINE_HAS_TLB_GUEST)
-		__pmdp_idte(gaddr, (pmd_t *)pmdp, IDTE_GUEST_ASCE, gmap->asce,
+		__pmdp_idte(gaddr, pmdp, IDTE_GUEST_ASCE, gmap->asce,
 			    IDTE_GLOBAL);
 	else if (MACHINE_HAS_IDTE)
-		__pmdp_idte(gaddr, (pmd_t *)pmdp, 0, 0, IDTE_GLOBAL);
+		__pmdp_idte(gaddr, pmdp, 0, 0, IDTE_GLOBAL);
 	else
 		__pmdp_csp(pmdp);
 	*pmdp = new;
@@ -2523,7 +2523,7 @@ static inline void thp_split_mm(struct mm_struct *mm)
  * - This must be called after THP was enabled
  */
 static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
-			   unsigned long end, struct mm_walk *walk)
+			    unsigned long end, struct mm_walk *walk)
 {
 	unsigned long addr;
 
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 18205f851c24..5915f3b725bc 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -743,7 +743,7 @@ void ptep_zap_key(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
  * Test and reset if a guest page is dirty
  */
 bool ptep_test_and_clear_uc(struct mm_struct *mm, unsigned long addr,
-		       pte_t *ptep)
+			    pte_t *ptep)
 {
 	pgste_t pgste;
 	pte_t pte;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 02/14] s390/mm: Improve locking for huge page backings
  2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
  2021-01-13  9:41 ` [PATCH 01/14] s390/mm: Code cleanups Janosch Frank
@ 2021-01-13  9:41 ` Janosch Frank
  2021-01-13  9:41 ` [PATCH 03/14] s390/mm: Take locking out of gmap_protect_pte Janosch Frank
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:41 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

The gmap guest_table_lock is used to protect changes to the guest's
DAT tables from region 1 to segments. Therefore it also protects the
host to guest radix tree where each new segment mapping by gmap_link()
is tracked. Changes to ptes are synchronized through the pte lock,
which is easily retrievable, because the gmap shares the page tables
with userspace.

With huge pages the story changes. PMD tables are not shared and we're
left with the pmd lock on userspace side and the guest_table_lock on
the gmap side. Having two locks for an object is a guarantee for
locking problems.

Therefore the guest_table_lock will only be used for population of the
gmap tables and hence protecting the host_to_guest tree. While the pmd
lock will be used for all changes to the pmd from both userspace and
the gmap.

This means we need to retrieve the vmaddr to retrieve a gmap pmd,
which takes a bit longer than before. But we can now operate on
multiple pmds which are in disjoint segment tables instead of having a
global lock.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h |  1 +
 arch/s390/mm/gmap.c             | 70 ++++++++++++++++++++-------------
 arch/s390/mm/pgtable.c          |  2 +-
 3 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 794746a32806..b1643afe1a00 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1519,6 +1519,7 @@ static __always_inline void __pudp_idte(unsigned long addr, pud_t *pudp,
 	}
 }
 
+pmd_t *pmd_alloc_map(struct mm_struct *mm, unsigned long addr);
 pmd_t pmdp_xchg_direct(struct mm_struct *, unsigned long, pmd_t *, pmd_t);
 pmd_t pmdp_xchg_lazy(struct mm_struct *, unsigned long, pmd_t *, pmd_t);
 pud_t pudp_xchg_direct(struct mm_struct *, unsigned long, pud_t *, pud_t);
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index f857104ca6c1..650c51749f4d 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -899,47 +899,62 @@ static void gmap_pte_op_end(spinlock_t *ptl)
 }
 
 /**
- * gmap_pmd_op_walk - walk the gmap tables, get the guest table lock
- *		      and return the pmd pointer
+ * gmap_pmd_op_walk - walk the gmap tables, get the pmd_lock if needed
+ *		      and return the pmd pointer or NULL
  * @gmap: pointer to guest mapping meta data structure
  * @gaddr: virtual address in the guest address space
  *
  * Returns a pointer to the pmd for a guest address, or NULL
  */
-static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr)
+static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr,
+				      spinlock_t **ptl)
 {
-	pmd_t *pmdp;
+	pmd_t *pmdp, *hpmdp;
+	unsigned long vmaddr;
+
 
 	BUG_ON(gmap_is_shadow(gmap));
-	pmdp = (pmd_t *) gmap_table_walk(gmap, gaddr, 1);
-	if (!pmdp)
-		return NULL;
 
-	/* without huge pages, there is no need to take the table lock */
-	if (!gmap->mm->context.allow_gmap_hpage_1m)
-		return pmd_none(*pmdp) ? NULL : pmdp;
-
-	spin_lock(&gmap->guest_table_lock);
-	if (pmd_none(*pmdp)) {
-		spin_unlock(&gmap->guest_table_lock);
-		return NULL;
+	*ptl = NULL;
+	if (gmap->mm->context.allow_gmap_hpage_1m) {
+		vmaddr = __gmap_translate(gmap, gaddr);
+		if (IS_ERR_VALUE(vmaddr))
+			return NULL;
+		hpmdp = pmd_alloc_map(gmap->mm, vmaddr);
+		if (!hpmdp)
+			return NULL;
+		*ptl = pmd_lock(gmap->mm, hpmdp);
+		if (pmd_none(*hpmdp)) {
+			spin_unlock(*ptl);
+			*ptl = NULL;
+			return NULL;
+		}
+		if (!pmd_large(*hpmdp)) {
+			spin_unlock(*ptl);
+			*ptl = NULL;
+		}
+	}
+
+	pmdp = (pmd_t *) gmap_table_walk(gmap, gaddr, 1);
+	if (!pmdp || pmd_none(*pmdp)) {
+		if (*ptl)
+			spin_unlock(*ptl);
+		pmdp = NULL;
+		*ptl = NULL;
 	}
 
-	/* 4k page table entries are locked via the pte (pte_alloc_map_lock). */
-	if (!pmd_large(*pmdp))
-		spin_unlock(&gmap->guest_table_lock);
 	return pmdp;
 }
 
 /**
- * gmap_pmd_op_end - release the guest_table_lock if needed
+ * gmap_pmd_op_end - release the pmd lock if needed
  * @gmap: pointer to the guest mapping meta data structure
  * @pmdp: pointer to the pmd
  */
-static inline void gmap_pmd_op_end(struct gmap *gmap, pmd_t *pmdp)
+static inline void gmap_pmd_op_end(spinlock_t *ptl)
 {
-	if (pmd_large(*pmdp))
-		spin_unlock(&gmap->guest_table_lock);
+	if (ptl)
+		spin_unlock(ptl);
 }
 
 /*
@@ -1041,13 +1056,14 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 			      unsigned long len, int prot, unsigned long bits)
 {
 	unsigned long vmaddr, dist;
+	spinlock_t *ptl = NULL;
 	pmd_t *pmdp;
 	int rc;
 
 	BUG_ON(gmap_is_shadow(gmap));
 	while (len) {
 		rc = -EAGAIN;
-		pmdp = gmap_pmd_op_walk(gmap, gaddr);
+		pmdp = gmap_pmd_op_walk(gmap, gaddr, &ptl);
 		if (pmdp) {
 			if (!pmd_large(*pmdp)) {
 				rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
@@ -1065,7 +1081,7 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 					gaddr = (gaddr & HPAGE_MASK) + HPAGE_SIZE;
 				}
 			}
-			gmap_pmd_op_end(gmap, pmdp);
+			gmap_pmd_op_end(ptl);
 		}
 		if (rc) {
 			if (rc == -EINVAL)
@@ -2462,9 +2478,9 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long bitmap[4],
 	int i;
 	pmd_t *pmdp;
 	pte_t *ptep;
-	spinlock_t *ptl;
+	spinlock_t *ptl = NULL;
 
-	pmdp = gmap_pmd_op_walk(gmap, gaddr);
+	pmdp = gmap_pmd_op_walk(gmap, gaddr, &ptl);
 	if (!pmdp)
 		return;
 
@@ -2481,7 +2497,7 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long bitmap[4],
 			spin_unlock(ptl);
 		}
 	}
-	gmap_pmd_op_end(gmap, pmdp);
+	gmap_pmd_op_end(ptl);
 }
 EXPORT_SYMBOL_GPL(gmap_sync_dirty_log_pmd);
 
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 5915f3b725bc..a0e674a9c70a 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -429,7 +429,7 @@ static inline pmd_t pmdp_flush_lazy(struct mm_struct *mm,
 }
 
 #ifdef CONFIG_PGSTE
-static pmd_t *pmd_alloc_map(struct mm_struct *mm, unsigned long addr)
+pmd_t *pmd_alloc_map(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 03/14] s390/mm: Take locking out of gmap_protect_pte
  2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
  2021-01-13  9:41 ` [PATCH 01/14] s390/mm: Code cleanups Janosch Frank
  2021-01-13  9:41 ` [PATCH 02/14] s390/mm: Improve locking for huge page backings Janosch Frank
@ 2021-01-13  9:41 ` Janosch Frank
  2021-01-13  9:41 ` [PATCH 04/14] s390/mm: split huge pages in GMAP when protecting Janosch Frank
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:41 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

Locking outside of the function gives us the freedom of ordering
locks, which will be important to not get locking issues for
gmap_protect_rmap.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 arch/s390/mm/gmap.c | 30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 650c51749f4d..c38f49dedf35 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1017,25 +1017,15 @@ static int gmap_protect_pmd(struct gmap *gmap, unsigned long gaddr,
  * Expected to be called with sg->mm->mmap_lock in read
  */
 static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
-			    pmd_t *pmdp, int prot, unsigned long bits)
+			    pte_t *ptep, int prot, unsigned long bits)
 {
 	int rc;
-	pte_t *ptep;
-	spinlock_t *ptl = NULL;
 	unsigned long pbits = 0;
 
-	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
-		return -EAGAIN;
-
-	ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
-	if (!ptep)
-		return -ENOMEM;
-
 	pbits |= (bits & GMAP_NOTIFY_MPROT) ? PGSTE_IN_BIT : 0;
 	pbits |= (bits & GMAP_NOTIFY_SHADOW) ? PGSTE_VSIE_BIT : 0;
 	/* Protect and unlock. */
 	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, pbits);
-	gmap_pte_op_end(ptl);
 	return rc;
 }
 
@@ -1056,18 +1046,26 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 			      unsigned long len, int prot, unsigned long bits)
 {
 	unsigned long vmaddr, dist;
-	spinlock_t *ptl = NULL;
+	spinlock_t *ptl_pmd = NULL, *ptl_pte = NULL;
 	pmd_t *pmdp;
+	pte_t *ptep;
 	int rc;
 
 	BUG_ON(gmap_is_shadow(gmap));
 	while (len) {
 		rc = -EAGAIN;
-		pmdp = gmap_pmd_op_walk(gmap, gaddr, &ptl);
+		pmdp = gmap_pmd_op_walk(gmap, gaddr, &ptl_pmd);
 		if (pmdp) {
 			if (!pmd_large(*pmdp)) {
-				rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
-						      bits);
+				ptl_pte = NULL;
+				ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr,
+							  &ptl_pte);
+				if (ptep)
+					rc = gmap_protect_pte(gmap, gaddr,
+							      ptep, prot, bits);
+				else
+					rc = -ENOMEM;
+				gmap_pte_op_end(ptl_pte);
 				if (!rc) {
 					len -= PAGE_SIZE;
 					gaddr += PAGE_SIZE;
@@ -1081,7 +1079,7 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 					gaddr = (gaddr & HPAGE_MASK) + HPAGE_SIZE;
 				}
 			}
-			gmap_pmd_op_end(ptl);
+			gmap_pmd_op_end(ptl_pmd);
 		}
 		if (rc) {
 			if (rc == -EINVAL)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 04/14] s390/mm: split huge pages in GMAP when protecting
  2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
                   ` (2 preceding siblings ...)
  2021-01-13  9:41 ` [PATCH 03/14] s390/mm: Take locking out of gmap_protect_pte Janosch Frank
@ 2021-01-13  9:41 ` Janosch Frank
  2021-01-13  9:41 ` [PATCH 05/14] s390/mm: Split huge pages when migrating Janosch Frank
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:41 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

Dirty tracking, vsie protection and lowcore invalidation notification
are best done on the smallest page size available to avoid unnecessary
flushing and table management operations.

Hence we now split huge pages and introduce a page table if a
notification bit is set or memory is protected via gmap_protect_range
or gmap_protect_rmap.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 arch/s390/include/asm/gmap.h    |  18 +++
 arch/s390/include/asm/pgtable.h |   3 +
 arch/s390/mm/gmap.c             | 247 +++++++++++++++++++++++++-------
 arch/s390/mm/pgtable.c          |  33 +++++
 4 files changed, 251 insertions(+), 50 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 40264f60b0da..a5711c189018 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -19,6 +19,11 @@
 /* Status bits only for huge segment entries */
 #define _SEGMENT_ENTRY_GMAP_IN		0x8000	/* invalidation notify bit */
 #define _SEGMENT_ENTRY_GMAP_UC		0x4000	/* dirty (migration) */
+/* Status bits in the gmap segment entry. */
+#define _SEGMENT_ENTRY_GMAP_SPLIT	0x0001  /* split huge pmd */
+
+#define GMAP_SEGMENT_STATUS_BITS (_SEGMENT_ENTRY_GMAP_UC | _SEGMENT_ENTRY_GMAP_SPLIT)
+#define GMAP_SEGMENT_NOTIFY_BITS _SEGMENT_ENTRY_GMAP_IN
 
 /**
  * struct gmap_struct - guest address space
@@ -62,6 +67,8 @@ struct gmap {
 	struct radix_tree_root host_to_rmap;
 	struct list_head children;
 	struct list_head pt_list;
+	struct list_head split_list;
+	spinlock_t split_list_lock;
 	spinlock_t shadow_lock;
 	struct gmap *parent;
 	unsigned long orig_asce;
@@ -102,6 +109,17 @@ static inline int gmap_is_shadow(struct gmap *gmap)
 	return !!gmap->parent;
 }
 
+/**
+ * gmap_pmd_is_split - Returns if a huge gmap pmd has been split.
+ * @pmdp: pointer to the pmd
+ *
+ * Returns true if the passed huge gmap pmd has been split.
+ */
+static inline bool gmap_pmd_is_split(pmd_t *pmdp)
+{
+	return !!(pmd_val(*pmdp) & _SEGMENT_ENTRY_GMAP_SPLIT);
+}
+
 struct gmap *gmap_create(struct mm_struct *mm, unsigned long limit);
 void gmap_remove(struct gmap *gmap);
 struct gmap *gmap_get(struct gmap *gmap);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index b1643afe1a00..6d6ad508f9c7 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1147,6 +1147,9 @@ int ptep_shadow_pte(struct mm_struct *mm, unsigned long saddr,
 		    pte_t *sptep, pte_t *tptep, pte_t pte);
 void ptep_unshadow_pte(struct mm_struct *mm, unsigned long saddr, pte_t *ptep);
 
+unsigned long ptep_get_and_clear_notification_bits(pte_t *ptep);
+void ptep_remove_protection_split(struct mm_struct *mm, pte_t *ptep,
+				  unsigned long gaddr);
 bool ptep_test_and_clear_uc(struct mm_struct *mm, unsigned long address,
 			    pte_t *ptep);
 int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index c38f49dedf35..41a5bbbc59e6 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -62,11 +62,13 @@ static struct gmap *gmap_alloc(unsigned long limit)
 	INIT_LIST_HEAD(&gmap->crst_list);
 	INIT_LIST_HEAD(&gmap->children);
 	INIT_LIST_HEAD(&gmap->pt_list);
+	INIT_LIST_HEAD(&gmap->split_list);
 	INIT_RADIX_TREE(&gmap->guest_to_host, GFP_KERNEL_ACCOUNT);
 	INIT_RADIX_TREE(&gmap->host_to_guest, GFP_ATOMIC | __GFP_ACCOUNT);
 	INIT_RADIX_TREE(&gmap->host_to_rmap, GFP_ATOMIC | __GFP_ACCOUNT);
 	spin_lock_init(&gmap->guest_table_lock);
 	spin_lock_init(&gmap->shadow_lock);
+	spin_lock_init(&gmap->split_list_lock);
 	refcount_set(&gmap->ref_count, 1);
 	page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
 	if (!page)
@@ -193,6 +195,10 @@ static void gmap_free(struct gmap *gmap)
 	gmap_radix_tree_free(&gmap->guest_to_host);
 	gmap_radix_tree_free(&gmap->host_to_guest);
 
+	/* Free split pmd page tables */
+	list_for_each_entry_safe(page, next, &gmap->split_list, lru)
+		page_table_free_pgste(page);
+
 	/* Free additional data for a shadow gmap */
 	if (gmap_is_shadow(gmap)) {
 		/* Free all page tables. */
@@ -547,6 +553,7 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 	pud_t *pud;
 	pmd_t *pmd;
 	u64 unprot;
+	pte_t *ptep;
 	int rc;
 
 	BUG_ON(gmap_is_shadow(gmap));
@@ -597,9 +604,15 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 	rc = radix_tree_preload(GFP_KERNEL_ACCOUNT);
 	if (rc)
 		return rc;
+	/*
+	 * do_exception() does remove the pte index for huge
+	 * pages, so we need to re-add it here to work on the
+	 * correct pte.
+	 */
+	vmaddr = vmaddr | (gaddr & ~PMD_MASK);
 	ptl = pmd_lock(mm, pmd);
-	spin_lock(&gmap->guest_table_lock);
 	if (*table == _SEGMENT_ENTRY_EMPTY) {
+		spin_lock(&gmap->guest_table_lock);
 		rc = radix_tree_insert(&gmap->host_to_guest,
 				       vmaddr >> PMD_SHIFT, table);
 		if (!rc) {
@@ -611,14 +624,24 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 				*table = pmd_val(*pmd) &
 					_SEGMENT_ENTRY_HARDWARE_BITS;
 		}
+		spin_unlock(&gmap->guest_table_lock);
 	} else if (*table & _SEGMENT_ENTRY_PROTECT &&
 		   !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {
 		unprot = (u64)*table;
 		unprot &= ~_SEGMENT_ENTRY_PROTECT;
 		unprot |= _SEGMENT_ENTRY_GMAP_UC;
 		gmap_pmdp_xchg(gmap, (pmd_t *)table, __pmd(unprot), gaddr);
+	} else if (gmap_pmd_is_split((pmd_t *)table)) {
+		/*
+		 * Split pmds are somewhere in-between a normal and a
+		 * large pmd. As we don't share the page table, the
+		 * host does not remove protection on a fault and we
+		 * have to do it ourselves for the guest mapping.
+		 */
+		ptep = pte_offset_map((pmd_t *)table, vmaddr);
+		if (pte_val(*ptep) & _PAGE_PROTECT)
+			ptep_remove_protection_split(mm, ptep, vmaddr);
 	}
-	spin_unlock(&gmap->guest_table_lock);
 	spin_unlock(ptl);
 	radix_tree_preload_end();
 	return rc;
@@ -860,7 +883,7 @@ static pte_t *gmap_pte_op_walk(struct gmap *gmap, unsigned long gaddr,
 }
 
 /**
- * gmap_pte_op_fixup - force a page in and connect the gmap page table
+ * gmap_fixup - force memory in and connect the gmap table entry
  * @gmap: pointer to guest mapping meta data structure
  * @gaddr: virtual address in the guest address space
  * @vmaddr: address in the host process address space
@@ -868,10 +891,10 @@ static pte_t *gmap_pte_op_walk(struct gmap *gmap, unsigned long gaddr,
  *
  * Returns 0 if the caller can retry __gmap_translate (might fail again),
  * -ENOMEM if out of memory and -EFAULT if anything goes wrong while fixing
- * up or connecting the gmap page table.
+ * up or connecting the gmap table entry.
  */
-static int gmap_pte_op_fixup(struct gmap *gmap, unsigned long gaddr,
-			     unsigned long vmaddr, int prot)
+static int gmap_fixup(struct gmap *gmap, unsigned long gaddr,
+		      unsigned long vmaddr, int prot)
 {
 	struct mm_struct *mm = gmap->mm;
 	unsigned int fault_flags;
@@ -957,6 +980,76 @@ static inline void gmap_pmd_op_end(spinlock_t *ptl)
 		spin_unlock(ptl);
 }
 
+static pte_t *gmap_pte_from_pmd(struct gmap *gmap, pmd_t *pmdp,
+				unsigned long addr, spinlock_t **ptl)
+{
+	*ptl = NULL;
+	if (likely(!gmap_pmd_is_split(pmdp)))
+		return pte_alloc_map_lock(gmap->mm, pmdp, addr, ptl);
+
+	return pte_offset_map(pmdp, addr);
+}
+
+/**
+ * gmap_pmd_split_free - Free a split pmd's page table
+ * @pmdp The split pmd that we free of its page table
+ *
+ * If the userspace pmds are exchanged, we'll remove the gmap pmds as
+ * well, so we fault on them and link them again. We would leak
+ * memory, if we didn't free split pmds here.
+ */
+static inline void gmap_pmd_split_free(struct gmap *gmap, pmd_t *pmdp)
+{
+	unsigned long pgt = pmd_val(*pmdp) & _SEGMENT_ENTRY_ORIGIN;
+	struct page *page;
+
+	if (gmap_pmd_is_split(pmdp)) {
+		page = pfn_to_page(pgt >> PAGE_SHIFT);
+		spin_lock(&gmap->split_list_lock);
+		list_del(&page->lru);
+		spin_unlock(&gmap->split_list_lock);
+		page_table_free_pgste(page);
+	}
+}
+
+/**
+ * gmap_pmd_split - Split a huge gmap pmd and use a page table instead
+ * @gmap: pointer to guest mapping meta data structure
+ * @gaddr: virtual address in the guest address space
+ * @pmdp: pointer to the pmd that will be split
+ * @pgtable: Pre-allocated page table
+ *
+ * When splitting gmap pmds, we have to make the resulting page table
+ * look like it's a normal one to be able to use the common pte
+ * handling functions. Also we need to track these new tables as they
+ * aren't tracked anywhere else.
+ */
+static void gmap_pmd_split(struct gmap *gmap, unsigned long gaddr,
+			   pmd_t *pmdp, struct page *page)
+{
+	unsigned long *ptable = (unsigned long *) page_to_phys(page);
+	pmd_t new;
+	int i;
+
+	for (i = 0; i < 256; i++) {
+		ptable[i] = (pmd_val(*pmdp) & HPAGE_MASK) + i * PAGE_SIZE;
+		/* Carry over hardware permission from the pmd */
+		if (pmd_val(*pmdp) & _SEGMENT_ENTRY_PROTECT)
+			ptable[i] |= _PAGE_PROTECT;
+		/* pmd_large() implies pmd/pte_present() */
+		ptable[i] |=  _PAGE_PRESENT | _PAGE_READ | _PAGE_WRITE;
+		/* ptes are directly marked as dirty */
+		ptable[i + PTRS_PER_PTE] |= PGSTE_UC_BIT;
+	}
+
+	pmd_val(new) = ((unsigned long)ptable | _SEGMENT_ENTRY |
+			(_SEGMENT_ENTRY_GMAP_SPLIT));
+	spin_lock(&gmap->split_list_lock);
+	list_add(&page->lru, &gmap->split_list);
+	spin_unlock(&gmap->split_list_lock);
+	gmap_pmdp_xchg(gmap, pmdp, new, gaddr);
+}
+
 /*
  * gmap_protect_pmd - remove access rights to memory and set pmd notification bits
  * @pmdp: pointer to the pmd to be protected
@@ -1045,7 +1138,8 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
 static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 			      unsigned long len, int prot, unsigned long bits)
 {
-	unsigned long vmaddr, dist;
+	struct page *page = NULL;
+	unsigned long vmaddr;
 	spinlock_t *ptl_pmd = NULL, *ptl_pte = NULL;
 	pmd_t *pmdp;
 	pte_t *ptep;
@@ -1054,12 +1148,12 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 	BUG_ON(gmap_is_shadow(gmap));
 	while (len) {
 		rc = -EAGAIN;
+
 		pmdp = gmap_pmd_op_walk(gmap, gaddr, &ptl_pmd);
-		if (pmdp) {
+		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
 			if (!pmd_large(*pmdp)) {
-				ptl_pte = NULL;
-				ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr,
-							  &ptl_pte);
+				ptep = gmap_pte_from_pmd(gmap, pmdp, gaddr,
+							 &ptl_pte);
 				if (ptep)
 					rc = gmap_protect_pte(gmap, gaddr,
 							      ptep, prot, bits);
@@ -1071,25 +1165,37 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 					gaddr += PAGE_SIZE;
 				}
 			} else {
-				rc = gmap_protect_pmd(gmap, gaddr, pmdp, prot,
-						      bits);
-				if (!rc) {
-					dist = HPAGE_SIZE - (gaddr & ~HPAGE_MASK);
-					len = len < dist ? 0 : len - dist;
-					gaddr = (gaddr & HPAGE_MASK) + HPAGE_SIZE;
+				if (!page) {
+					/* Drop locks for allocation. */
+					gmap_pmd_op_end(ptl_pmd);
+					ptl_pmd = NULL;
+					page = page_table_alloc_pgste(gmap->mm);
+					if (!page)
+						return -ENOMEM;
+					continue;
+				} else {
+					gmap_pmd_split(gmap, gaddr,
+						       pmdp, page);
+					page = NULL;
+					gmap_pmd_op_end(ptl_pmd);
+					continue;
 				}
 			}
 			gmap_pmd_op_end(ptl_pmd);
 		}
+		if (page) {
+			page_table_free_pgste(page);
+			page = NULL;
+		}
 		if (rc) {
-			if (rc == -EINVAL)
+			if (rc == -EINVAL || rc == -ENOMEM)
 				return rc;
 
 			/* -EAGAIN, fixup of userspace mm and gmap */
 			vmaddr = __gmap_translate(gmap, gaddr);
 			if (IS_ERR_VALUE(vmaddr))
 				return vmaddr;
-			rc = gmap_pte_op_fixup(gmap, gaddr, vmaddr, prot);
+			rc = gmap_fixup(gmap, gaddr, vmaddr, prot);
 			if (rc)
 				return rc;
 		}
@@ -1172,7 +1278,7 @@ int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val)
 			rc = vmaddr;
 			break;
 		}
-		rc = gmap_pte_op_fixup(gmap, gaddr, vmaddr, PROT_READ);
+		rc = gmap_fixup(gmap, gaddr, vmaddr, PROT_READ);
 		if (rc)
 			break;
 	}
@@ -1255,7 +1361,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 		radix_tree_preload_end();
 		if (rc) {
 			kfree(rmap);
-			rc = gmap_pte_op_fixup(parent, paddr, vmaddr, PROT_READ);
+			rc = gmap_fixup(parent, paddr, vmaddr, PROT_READ);
 			if (rc)
 				return rc;
 			continue;
@@ -2170,7 +2276,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 		radix_tree_preload_end();
 		if (!rc)
 			break;
-		rc = gmap_pte_op_fixup(parent, paddr, vmaddr, prot);
+		rc = gmap_fixup(parent, paddr, vmaddr, prot);
 		if (rc)
 			break;
 	}
@@ -2236,6 +2342,30 @@ static void gmap_shadow_notify(struct gmap *sg, unsigned long vmaddr,
 	spin_unlock(&sg->guest_table_lock);
 }
 
+/*
+ * ptep_notify_gmap - call all invalidation callbacks for a specific pte of a gmap
+ * @mm: pointer to the process mm_struct
+ * @addr: virtual address in the process address space
+ * @pte: pointer to the page table entry
+ * @bits: bits from the pgste that caused the notify call
+ *
+ * This function is assumed to be called with the guest_table_lock held.
+ */
+static void ptep_notify_gmap(struct gmap *gmap, unsigned long gaddr,
+			     unsigned long vmaddr, unsigned long bits)
+{
+	struct gmap *sg, *next;
+
+	if (!list_empty(&gmap->children) && (bits & PGSTE_VSIE_BIT)) {
+		spin_lock(&gmap->shadow_lock);
+		list_for_each_entry_safe(sg, next, &gmap->children, list)
+			gmap_shadow_notify(sg, vmaddr, gaddr);
+		spin_unlock(&gmap->shadow_lock);
+	}
+	if (bits & PGSTE_IN_BIT)
+		gmap_call_notifier(gmap, gaddr, gaddr + PAGE_SIZE - 1);
+}
+
 /**
  * ptep_notify - call all invalidation callbacks for a specific pte.
  * @mm: pointer to the process mm_struct
@@ -2251,7 +2381,7 @@ void ptep_notify(struct mm_struct *mm, unsigned long vmaddr,
 {
 	unsigned long offset, gaddr = 0;
 	unsigned long *table;
-	struct gmap *gmap, *sg, *next;
+	struct gmap *gmap;
 
 	offset = ((unsigned long) pte) & (255 * sizeof(pte_t));
 	offset = offset * (PAGE_SIZE / sizeof(pte_t));
@@ -2266,23 +2396,34 @@ void ptep_notify(struct mm_struct *mm, unsigned long vmaddr,
 		if (!table)
 			continue;
 
-		if (!list_empty(&gmap->children) && (bits & PGSTE_VSIE_BIT)) {
-			spin_lock(&gmap->shadow_lock);
-			list_for_each_entry_safe(sg, next,
-						 &gmap->children, list)
-				gmap_shadow_notify(sg, vmaddr, gaddr);
-			spin_unlock(&gmap->shadow_lock);
-		}
-		if (bits & PGSTE_IN_BIT)
-			gmap_call_notifier(gmap, gaddr, gaddr + PAGE_SIZE - 1);
+		ptep_notify_gmap(gmap, gaddr, vmaddr, bits);
 	}
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(ptep_notify);
 
-static void pmdp_notify_gmap(struct gmap *gmap, pmd_t *pmdp,
-			     unsigned long gaddr)
+static inline void pmdp_notify_split(struct gmap *gmap, pmd_t *pmdp,
+				     unsigned long gaddr, unsigned long vmaddr)
 {
+	int i = 0;
+	unsigned long bits;
+	pte_t *ptep = (pte_t *)(pmd_val(*pmdp) & PAGE_MASK);
+
+	for (; i < 256; i++, gaddr += PAGE_SIZE, vmaddr += PAGE_SIZE, ptep++) {
+		bits = ptep_get_and_clear_notification_bits(ptep);
+		if (bits)
+			ptep_notify_gmap(gmap, gaddr, vmaddr, bits);
+	}
+}
+
+static void pmdp_notify_gmap(struct gmap *gmap, pmd_t *pmdp,
+			     unsigned long gaddr, unsigned long vmaddr)
+{
+	if (gmap_pmd_is_split(pmdp))
+		return pmdp_notify_split(gmap, pmdp, gaddr, vmaddr);
+
+	if (!(pmd_val(*pmdp) & _SEGMENT_ENTRY_GMAP_IN))
+		return;
 	pmd_val(*pmdp) &= ~_SEGMENT_ENTRY_GMAP_IN;
 	gmap_call_notifier(gmap, gaddr, gaddr + HPAGE_SIZE - 1);
 }
@@ -2301,8 +2442,9 @@ static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *pmdp, pmd_t new,
 			   unsigned long gaddr)
 {
 	gaddr &= HPAGE_MASK;
-	pmdp_notify_gmap(gmap, pmdp, gaddr);
-	pmd_val(new) &= ~_SEGMENT_ENTRY_GMAP_IN;
+	pmdp_notify_gmap(gmap, pmdp, gaddr, 0);
+	if (pmd_large(new))
+		pmd_val(new) &= ~GMAP_SEGMENT_NOTIFY_BITS;
 	if (MACHINE_HAS_TLB_GUEST)
 		__pmdp_idte(gaddr, pmdp, IDTE_GUEST_ASCE, gmap->asce,
 			    IDTE_GLOBAL);
@@ -2327,11 +2469,13 @@ static void gmap_pmdp_clear(struct mm_struct *mm, unsigned long vmaddr,
 						  vmaddr >> PMD_SHIFT);
 		if (pmdp) {
 			gaddr = __gmap_segment_gaddr((unsigned long *)pmdp);
-			pmdp_notify_gmap(gmap, pmdp, gaddr);
-			WARN_ON(pmd_val(*pmdp) & ~(_SEGMENT_ENTRY_HARDWARE_BITS_LARGE |
-						   _SEGMENT_ENTRY_GMAP_UC));
+			pmdp_notify_gmap(gmap, pmdp, gaddr, vmaddr);
+			if (pmd_large(*pmdp))
+				WARN_ON(pmd_val(*pmdp) &
+					GMAP_SEGMENT_NOTIFY_BITS);
 			if (purge)
 				__pmdp_csp(pmdp);
+			gmap_pmd_split_free(gmap, pmdp);
 			pmd_val(*pmdp) = _SEGMENT_ENTRY_EMPTY;
 		}
 		spin_unlock(&gmap->guest_table_lock);
@@ -2381,14 +2525,15 @@ void gmap_pmdp_idte_local(struct mm_struct *mm, unsigned long vmaddr)
 		if (entry) {
 			pmdp = (pmd_t *)entry;
 			gaddr = __gmap_segment_gaddr(entry);
-			pmdp_notify_gmap(gmap, pmdp, gaddr);
-			WARN_ON(*entry & ~(_SEGMENT_ENTRY_HARDWARE_BITS_LARGE |
-					   _SEGMENT_ENTRY_GMAP_UC));
+			pmdp_notify_gmap(gmap, pmdp, gaddr, vmaddr);
+			if (pmd_large(*pmdp))
+				WARN_ON(*entry & GMAP_SEGMENT_NOTIFY_BITS);
 			if (MACHINE_HAS_TLB_GUEST)
 				__pmdp_idte(gaddr, pmdp, IDTE_GUEST_ASCE,
 					    gmap->asce, IDTE_LOCAL);
 			else if (MACHINE_HAS_IDTE)
 				__pmdp_idte(gaddr, pmdp, 0, 0, IDTE_LOCAL);
+			gmap_pmd_split_free(gmap, pmdp);
 			*entry = _SEGMENT_ENTRY_EMPTY;
 		}
 		spin_unlock(&gmap->guest_table_lock);
@@ -2416,9 +2561,9 @@ void gmap_pmdp_idte_global(struct mm_struct *mm, unsigned long vmaddr)
 		if (entry) {
 			pmdp = (pmd_t *)entry;
 			gaddr = __gmap_segment_gaddr(entry);
-			pmdp_notify_gmap(gmap, pmdp, gaddr);
-			WARN_ON(*entry & ~(_SEGMENT_ENTRY_HARDWARE_BITS_LARGE |
-					   _SEGMENT_ENTRY_GMAP_UC));
+			pmdp_notify_gmap(gmap, pmdp, gaddr, vmaddr);
+			if (pmd_large(*pmdp))
+				WARN_ON(*entry & GMAP_SEGMENT_NOTIFY_BITS);
 			if (MACHINE_HAS_TLB_GUEST)
 				__pmdp_idte(gaddr, pmdp, IDTE_GUEST_ASCE,
 					    gmap->asce, IDTE_GLOBAL);
@@ -2426,6 +2571,7 @@ void gmap_pmdp_idte_global(struct mm_struct *mm, unsigned long vmaddr)
 				__pmdp_idte(gaddr, pmdp, 0, 0, IDTE_GLOBAL);
 			else
 				__pmdp_csp(pmdp);
+			gmap_pmd_split_free(gmap, pmdp);
 			*entry = _SEGMENT_ENTRY_EMPTY;
 		}
 		spin_unlock(&gmap->guest_table_lock);
@@ -2476,9 +2622,10 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long bitmap[4],
 	int i;
 	pmd_t *pmdp;
 	pte_t *ptep;
-	spinlock_t *ptl = NULL;
+	spinlock_t *ptl_pmd = NULL;
+	spinlock_t *ptl_pte = NULL;
 
-	pmdp = gmap_pmd_op_walk(gmap, gaddr, &ptl);
+	pmdp = gmap_pmd_op_walk(gmap, gaddr, &ptl_pmd);
 	if (!pmdp)
 		return;
 
@@ -2487,15 +2634,15 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long bitmap[4],
 			bitmap_fill(bitmap, _PAGE_ENTRIES);
 	} else {
 		for (i = 0; i < _PAGE_ENTRIES; i++, vmaddr += PAGE_SIZE) {
-			ptep = pte_alloc_map_lock(gmap->mm, pmdp, vmaddr, &ptl);
+			ptep = gmap_pte_from_pmd(gmap, pmdp, vmaddr, &ptl_pte);
 			if (!ptep)
 				continue;
 			if (ptep_test_and_clear_uc(gmap->mm, vmaddr, ptep))
 				set_bit(i, bitmap);
-			spin_unlock(ptl);
+			gmap_pte_op_end(ptl_pte);
 		}
 	}
-	gmap_pmd_op_end(ptl);
+	gmap_pmd_op_end(ptl_pmd);
 }
 EXPORT_SYMBOL_GPL(gmap_sync_dirty_log_pmd);
 
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index a0e674a9c70a..16896f936d32 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -739,6 +739,39 @@ void ptep_zap_key(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 	preempt_enable();
 }
 
+unsigned long ptep_get_and_clear_notification_bits(pte_t *ptep)
+{
+	pgste_t pgste;
+	unsigned long bits;
+
+	pgste = pgste_get_lock(ptep);
+	bits = pgste_val(pgste) & (PGSTE_IN_BIT | PGSTE_VSIE_BIT);
+	pgste_val(pgste) ^= bits;
+	pgste_set_unlock(ptep, pgste);
+
+	return bits;
+}
+EXPORT_SYMBOL_GPL(ptep_get_and_clear_notification_bits);
+
+void ptep_remove_protection_split(struct mm_struct *mm, pte_t *ptep,
+				  unsigned long gaddr)
+{
+	pte_t pte;
+	pgste_t pgste;
+
+	pgste = pgste_get_lock(ptep);
+	pgste_val(pgste) |= PGSTE_UC_BIT;
+	pte = *ptep;
+	pte_val(pte) &= ~_PAGE_PROTECT;
+
+	pgste = pgste_pte_notify(mm, gaddr, ptep, pgste);
+	ptep_ipte_global(mm, gaddr, ptep, 0);
+
+	*ptep = pte;
+	pgste_set_unlock(ptep, pgste);
+}
+EXPORT_SYMBOL_GPL(ptep_remove_protection_split);
+
 /*
  * Test and reset if a guest page is dirty
  */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 05/14] s390/mm: Split huge pages when migrating
  2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
                   ` (3 preceding siblings ...)
  2021-01-13  9:41 ` [PATCH 04/14] s390/mm: split huge pages in GMAP when protecting Janosch Frank
@ 2021-01-13  9:41 ` Janosch Frank
  2021-01-13  9:41 ` [PATCH 06/14] s390/mm: Provide vmaddr to pmd notification Janosch Frank
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:41 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

Right now we mark the huge page that is being written to as dirty
although only a single byte may have changed. This means we have to
migrate 1MB although only a very limited amount of memory in that
range might be dirty.

To speed up migration this patch splits up write protected huge pages
into normal pages. The protection for the normal pages is only removed
for the page that caused the fault.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 arch/s390/mm/gmap.c | 34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 41a5bbbc59e6..d8c9b295294b 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -532,6 +532,9 @@ void gmap_unlink(struct mm_struct *mm, unsigned long *table,
 static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *old, pmd_t new,
 			   unsigned long gaddr);
 
+static void gmap_pmd_split(struct gmap *gmap, unsigned long gaddr,
+			   pmd_t *pmdp, struct page *page);
+
 /**
  * gmap_link - set up shadow page tables to connect a host to a guest address
  * @gmap: pointer to guest mapping meta data structure
@@ -547,12 +550,12 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 {
 	struct mm_struct *mm;
 	unsigned long *table;
+	struct page *page = NULL;
 	spinlock_t *ptl;
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
-	u64 unprot;
 	pte_t *ptep;
 	int rc;
 
@@ -600,6 +603,7 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 	/* Are we allowed to use huge pages? */
 	if (pmd_large(*pmd) && !gmap->mm->context.allow_gmap_hpage_1m)
 		return -EFAULT;
+retry_split:
 	/* Link gmap segment table entry location to page table. */
 	rc = radix_tree_preload(GFP_KERNEL_ACCOUNT);
 	if (rc)
@@ -627,10 +631,25 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 		spin_unlock(&gmap->guest_table_lock);
 	} else if (*table & _SEGMENT_ENTRY_PROTECT &&
 		   !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {
-		unprot = (u64)*table;
-		unprot &= ~_SEGMENT_ENTRY_PROTECT;
-		unprot |= _SEGMENT_ENTRY_GMAP_UC;
-		gmap_pmdp_xchg(gmap, (pmd_t *)table, __pmd(unprot), gaddr);
+		if (page) {
+			gmap_pmd_split(gmap, gaddr, (pmd_t *)table, page);
+			page = NULL;
+		} else {
+			spin_unlock(ptl);
+			ptl = NULL;
+			radix_tree_preload_end();
+			page = page_table_alloc_pgste(mm);
+			if (!page)
+				rc = -ENOMEM;
+			else
+				goto retry_split;
+		}
+		/*
+		 * The split moves over the protection, so we still
+		 * need to unprotect.
+		 */
+		ptep = pte_offset_map((pmd_t *)table, vmaddr);
+		ptep_remove_protection_split(mm, ptep, vmaddr);
 	} else if (gmap_pmd_is_split((pmd_t *)table)) {
 		/*
 		 * Split pmds are somewhere in-between a normal and a
@@ -642,7 +661,10 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 		if (pte_val(*ptep) & _PAGE_PROTECT)
 			ptep_remove_protection_split(mm, ptep, vmaddr);
 	}
-	spin_unlock(ptl);
+	if (page)
+		page_table_free_pgste(page);
+	if (ptl)
+		spin_unlock(ptl);
 	radix_tree_preload_end();
 	return rc;
 }
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 06/14] s390/mm: Provide vmaddr to pmd notification
  2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
                   ` (4 preceding siblings ...)
  2021-01-13  9:41 ` [PATCH 05/14] s390/mm: Split huge pages when migrating Janosch Frank
@ 2021-01-13  9:41 ` Janosch Frank
  2021-01-13  9:41 ` [PATCH 07/14] s390/mm: factor out idte global flush into gmap_idte_global Janosch Frank
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:41 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

It will be needed for shadow tables.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 arch/s390/mm/gmap.c | 52 +++++++++++++++++++++++----------------------
 1 file changed, 27 insertions(+), 25 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index d8c9b295294b..b7199c55f98a 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -530,10 +530,10 @@ void gmap_unlink(struct mm_struct *mm, unsigned long *table,
 }
 
 static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *old, pmd_t new,
-			   unsigned long gaddr);
+			   unsigned long gaddr, unsigned long vmaddr);
 
 static void gmap_pmd_split(struct gmap *gmap, unsigned long gaddr,
-			   pmd_t *pmdp, struct page *page);
+			   unsigned long vmaddr, pmd_t *pmdp, struct page *page);
 
 /**
  * gmap_link - set up shadow page tables to connect a host to a guest address
@@ -632,7 +632,8 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 	} else if (*table & _SEGMENT_ENTRY_PROTECT &&
 		   !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {
 		if (page) {
-			gmap_pmd_split(gmap, gaddr, (pmd_t *)table, page);
+			gmap_pmd_split(gmap, gaddr, vmaddr,
+				       (pmd_t *)table, page);
 			page = NULL;
 		} else {
 			spin_unlock(ptl);
@@ -952,19 +953,15 @@ static void gmap_pte_op_end(spinlock_t *ptl)
  * Returns a pointer to the pmd for a guest address, or NULL
  */
 static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr,
-				      spinlock_t **ptl)
+				      unsigned long vmaddr, spinlock_t **ptl)
 {
 	pmd_t *pmdp, *hpmdp;
-	unsigned long vmaddr;
 
 
 	BUG_ON(gmap_is_shadow(gmap));
 
 	*ptl = NULL;
 	if (gmap->mm->context.allow_gmap_hpage_1m) {
-		vmaddr = __gmap_translate(gmap, gaddr);
-		if (IS_ERR_VALUE(vmaddr))
-			return NULL;
 		hpmdp = pmd_alloc_map(gmap->mm, vmaddr);
 		if (!hpmdp)
 			return NULL;
@@ -1047,7 +1044,7 @@ static inline void gmap_pmd_split_free(struct gmap *gmap, pmd_t *pmdp)
  * aren't tracked anywhere else.
  */
 static void gmap_pmd_split(struct gmap *gmap, unsigned long gaddr,
-			   pmd_t *pmdp, struct page *page)
+			   unsigned long vmaddr, pmd_t *pmdp, struct page *page)
 {
 	unsigned long *ptable = (unsigned long *) page_to_phys(page);
 	pmd_t new;
@@ -1069,7 +1066,7 @@ static void gmap_pmd_split(struct gmap *gmap, unsigned long gaddr,
 	spin_lock(&gmap->split_list_lock);
 	list_add(&page->lru, &gmap->split_list);
 	spin_unlock(&gmap->split_list_lock);
-	gmap_pmdp_xchg(gmap, pmdp, new, gaddr);
+	gmap_pmdp_xchg(gmap, pmdp, new, gaddr, vmaddr);
 }
 
 /*
@@ -1087,7 +1084,8 @@ static void gmap_pmd_split(struct gmap *gmap, unsigned long gaddr,
  * guest_table_lock held.
  */
 static int gmap_protect_pmd(struct gmap *gmap, unsigned long gaddr,
-			    pmd_t *pmdp, int prot, unsigned long bits)
+			    unsigned long vmaddr, pmd_t *pmdp, int prot,
+			    unsigned long bits)
 {
 	int pmd_i = pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID;
 	int pmd_p = pmd_val(*pmdp) & _SEGMENT_ENTRY_PROTECT;
@@ -1099,13 +1097,13 @@ static int gmap_protect_pmd(struct gmap *gmap, unsigned long gaddr,
 
 	if (prot == PROT_NONE && !pmd_i) {
 		pmd_val(new) |= _SEGMENT_ENTRY_INVALID;
-		gmap_pmdp_xchg(gmap, pmdp, new, gaddr);
+		gmap_pmdp_xchg(gmap, pmdp, new, gaddr, vmaddr);
 	}
 
 	if (prot == PROT_READ && !pmd_p) {
 		pmd_val(new) &= ~_SEGMENT_ENTRY_INVALID;
 		pmd_val(new) |= _SEGMENT_ENTRY_PROTECT;
-		gmap_pmdp_xchg(gmap, pmdp, new, gaddr);
+		gmap_pmdp_xchg(gmap, pmdp, new, gaddr, vmaddr);
 	}
 
 	if (bits & GMAP_NOTIFY_MPROT)
@@ -1168,10 +1166,14 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 	int rc;
 
 	BUG_ON(gmap_is_shadow(gmap));
+
 	while (len) {
 		rc = -EAGAIN;
-
-		pmdp = gmap_pmd_op_walk(gmap, gaddr, &ptl_pmd);
+		vmaddr = __gmap_translate(gmap, gaddr);
+		if (IS_ERR_VALUE(vmaddr))
+			return vmaddr;
+		vmaddr |= gaddr & ~PMD_MASK;
+		pmdp = gmap_pmd_op_walk(gmap, gaddr, vmaddr, &ptl_pmd);
 		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
 			if (!pmd_large(*pmdp)) {
 				ptep = gmap_pte_from_pmd(gmap, pmdp, gaddr,
@@ -1196,7 +1198,7 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 						return -ENOMEM;
 					continue;
 				} else {
-					gmap_pmd_split(gmap, gaddr,
+					gmap_pmd_split(gmap, gaddr, vmaddr,
 						       pmdp, page);
 					page = NULL;
 					gmap_pmd_op_end(ptl_pmd);
@@ -1214,9 +1216,6 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 				return rc;
 
 			/* -EAGAIN, fixup of userspace mm and gmap */
-			vmaddr = __gmap_translate(gmap, gaddr);
-			if (IS_ERR_VALUE(vmaddr))
-				return vmaddr;
 			rc = gmap_fixup(gmap, gaddr, vmaddr, prot);
 			if (rc)
 				return rc;
@@ -2441,6 +2440,7 @@ static inline void pmdp_notify_split(struct gmap *gmap, pmd_t *pmdp,
 static void pmdp_notify_gmap(struct gmap *gmap, pmd_t *pmdp,
 			     unsigned long gaddr, unsigned long vmaddr)
 {
+	BUG_ON((gaddr & ~HPAGE_MASK) || (vmaddr & ~HPAGE_MASK));
 	if (gmap_pmd_is_split(pmdp))
 		return pmdp_notify_split(gmap, pmdp, gaddr, vmaddr);
 
@@ -2461,10 +2461,11 @@ static void pmdp_notify_gmap(struct gmap *gmap, pmd_t *pmdp,
  * held.
  */
 static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *pmdp, pmd_t new,
-			   unsigned long gaddr)
+			   unsigned long gaddr, unsigned long vmaddr)
 {
 	gaddr &= HPAGE_MASK;
-	pmdp_notify_gmap(gmap, pmdp, gaddr, 0);
+	vmaddr &= HPAGE_MASK;
+	pmdp_notify_gmap(gmap, pmdp, gaddr, vmaddr);
 	if (pmd_large(new))
 		pmd_val(new) &= ~GMAP_SEGMENT_NOTIFY_BITS;
 	if (MACHINE_HAS_TLB_GUEST)
@@ -2612,7 +2613,8 @@ EXPORT_SYMBOL_GPL(gmap_pmdp_idte_global);
  * held.
  */
 static bool gmap_test_and_clear_dirty_pmd(struct gmap *gmap, pmd_t *pmdp,
-					  unsigned long gaddr)
+					  unsigned long gaddr,
+					  unsigned long vmaddr)
 {
 	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
 		return false;
@@ -2624,7 +2626,7 @@ static bool gmap_test_and_clear_dirty_pmd(struct gmap *gmap, pmd_t *pmdp,
 
 	/* Clear UC indication and reset protection */
 	pmd_val(*pmdp) &= ~_SEGMENT_ENTRY_GMAP_UC;
-	gmap_protect_pmd(gmap, gaddr, pmdp, PROT_READ, 0);
+	gmap_protect_pmd(gmap, gaddr, vmaddr, pmdp, PROT_READ, 0);
 	return true;
 }
 
@@ -2647,12 +2649,12 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long bitmap[4],
 	spinlock_t *ptl_pmd = NULL;
 	spinlock_t *ptl_pte = NULL;
 
-	pmdp = gmap_pmd_op_walk(gmap, gaddr, &ptl_pmd);
+	pmdp = gmap_pmd_op_walk(gmap, gaddr, vmaddr, &ptl_pmd);
 	if (!pmdp)
 		return;
 
 	if (pmd_large(*pmdp)) {
-		if (gmap_test_and_clear_dirty_pmd(gmap, pmdp, gaddr))
+		if (gmap_test_and_clear_dirty_pmd(gmap, pmdp, gaddr, vmaddr))
 			bitmap_fill(bitmap, _PAGE_ENTRIES);
 	} else {
 		for (i = 0; i < _PAGE_ENTRIES; i++, vmaddr += PAGE_SIZE) {
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 07/14] s390/mm: factor out idte global flush into gmap_idte_global
  2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
                   ` (5 preceding siblings ...)
  2021-01-13  9:41 ` [PATCH 06/14] s390/mm: Provide vmaddr to pmd notification Janosch Frank
@ 2021-01-13  9:41 ` Janosch Frank
  2021-01-13  9:41 ` [PATCH 08/14] s390/mm: Make gmap_read_table EDAT1 compatible Janosch Frank
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:41 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

Introduce a function to do a idte global flush on a gmap pmd and
remove some code duplication.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/mm/gmap.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index b7199c55f98a..f89e710c31af 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1009,6 +1009,18 @@ static pte_t *gmap_pte_from_pmd(struct gmap *gmap, pmd_t *pmdp,
 	return pte_offset_map(pmdp, addr);
 }
 
+static inline void gmap_idte_global(unsigned long asce, pmd_t *pmdp,
+				    unsigned long gaddr)
+{
+	if (MACHINE_HAS_TLB_GUEST)
+		__pmdp_idte(gaddr, pmdp, IDTE_GUEST_ASCE, asce,
+			    IDTE_GLOBAL);
+	else if (MACHINE_HAS_IDTE)
+		__pmdp_idte(gaddr, pmdp, 0, 0, IDTE_GLOBAL);
+	else
+		__pmdp_csp(pmdp);
+}
+
 /**
  * gmap_pmd_split_free - Free a split pmd's page table
  * @pmdp The split pmd that we free of its page table
@@ -2468,13 +2480,7 @@ static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *pmdp, pmd_t new,
 	pmdp_notify_gmap(gmap, pmdp, gaddr, vmaddr);
 	if (pmd_large(new))
 		pmd_val(new) &= ~GMAP_SEGMENT_NOTIFY_BITS;
-	if (MACHINE_HAS_TLB_GUEST)
-		__pmdp_idte(gaddr, pmdp, IDTE_GUEST_ASCE, gmap->asce,
-			    IDTE_GLOBAL);
-	else if (MACHINE_HAS_IDTE)
-		__pmdp_idte(gaddr, pmdp, 0, 0, IDTE_GLOBAL);
-	else
-		__pmdp_csp(pmdp);
+	gmap_idte_global(gmap->asce, pmdp, gaddr);
 	*pmdp = new;
 }
 
@@ -2587,13 +2593,7 @@ void gmap_pmdp_idte_global(struct mm_struct *mm, unsigned long vmaddr)
 			pmdp_notify_gmap(gmap, pmdp, gaddr, vmaddr);
 			if (pmd_large(*pmdp))
 				WARN_ON(*entry & GMAP_SEGMENT_NOTIFY_BITS);
-			if (MACHINE_HAS_TLB_GUEST)
-				__pmdp_idte(gaddr, pmdp, IDTE_GUEST_ASCE,
-					    gmap->asce, IDTE_GLOBAL);
-			else if (MACHINE_HAS_IDTE)
-				__pmdp_idte(gaddr, pmdp, 0, 0, IDTE_GLOBAL);
-			else
-				__pmdp_csp(pmdp);
+			gmap_idte_global(gmap->asce, pmdp, gaddr);
 			gmap_pmd_split_free(gmap, pmdp);
 			*entry = _SEGMENT_ENTRY_EMPTY;
 		}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 08/14] s390/mm: Make gmap_read_table EDAT1 compatible
  2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
                   ` (6 preceding siblings ...)
  2021-01-13  9:41 ` [PATCH 07/14] s390/mm: factor out idte global flush into gmap_idte_global Janosch Frank
@ 2021-01-13  9:41 ` Janosch Frank
  2021-01-13  9:41 ` [PATCH 09/14] s390/mm: Make gmap_protect_rmap " Janosch Frank
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:41 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

For the upcoming support of VSIE guests on huge page backed hosts, we
need to be able to read from large segments.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 arch/s390/mm/gmap.c | 43 ++++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index f89e710c31af..910371dc511d 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1282,35 +1282,44 @@ EXPORT_SYMBOL_GPL(gmap_mprotect_notify);
 int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val)
 {
 	unsigned long address, vmaddr;
-	spinlock_t *ptl;
+	spinlock_t *ptl_pmd = NULL, *ptl_pte = NULL;
+	pmd_t *pmdp;
 	pte_t *ptep, pte;
 	int rc;
 
-	if (gmap_is_shadow(gmap))
-		return -EINVAL;
+	BUG_ON(gmap_is_shadow(gmap));
 
 	while (1) {
 		rc = -EAGAIN;
-		ptep = gmap_pte_op_walk(gmap, gaddr, &ptl);
-		if (ptep) {
-			pte = *ptep;
-			if (pte_present(pte) && (pte_val(pte) & _PAGE_READ)) {
-				address = pte_val(pte) & PAGE_MASK;
-				address += gaddr & ~PAGE_MASK;
+		vmaddr = __gmap_translate(gmap, gaddr);
+		if (IS_ERR_VALUE(vmaddr))
+			return vmaddr;
+		pmdp = gmap_pmd_op_walk(gmap, gaddr, vmaddr, &ptl_pmd);
+		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
+			if (!pmd_large(*pmdp)) {
+				ptep = gmap_pte_from_pmd(gmap, pmdp, vmaddr, &ptl_pte);
+				if (ptep) {
+					pte = *ptep;
+					if (pte_present(pte) && (pte_val(pte) & _PAGE_READ)) {
+						address = pte_val(pte) & PAGE_MASK;
+						address += gaddr & ~PAGE_MASK;
+						*val = *(unsigned long *) address;
+						pte_val(*ptep) |= _PAGE_YOUNG;
+						/* Do *NOT* clear the _PAGE_INVALID bit! */
+						rc = 0;
+					}
+				}
+				gmap_pte_op_end(ptl_pte);
+			} else {
+				address = pmd_val(*pmdp) & HPAGE_MASK;
+				address += gaddr & ~HPAGE_MASK;
 				*val = *(unsigned long *) address;
-				pte_val(*ptep) |= _PAGE_YOUNG;
-				/* Do *NOT* clear the _PAGE_INVALID bit! */
 				rc = 0;
 			}
-			gmap_pte_op_end(ptl);
+			gmap_pmd_op_end(ptl_pmd);
 		}
 		if (!rc)
 			break;
-		vmaddr = __gmap_translate(gmap, gaddr);
-		if (IS_ERR_VALUE(vmaddr)) {
-			rc = vmaddr;
-			break;
-		}
 		rc = gmap_fixup(gmap, gaddr, vmaddr, PROT_READ);
 		if (rc)
 			break;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 09/14] s390/mm: Make gmap_protect_rmap EDAT1 compatible
  2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
                   ` (7 preceding siblings ...)
  2021-01-13  9:41 ` [PATCH 08/14] s390/mm: Make gmap_read_table EDAT1 compatible Janosch Frank
@ 2021-01-13  9:41 ` Janosch Frank
  2021-01-13  9:41 ` [PATCH 10/14] s390/mm: Add simple ptep shadow function Janosch Frank
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:41 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

For the upcoming large page shadowing support, let's add the
possibility to split a huge page and protect it with
gmap_protect_rmap() for shadowing purposes.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 arch/s390/mm/gmap.c | 93 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 73 insertions(+), 20 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 910371dc511d..f20aa49c2791 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1142,7 +1142,8 @@ static int gmap_protect_pmd(struct gmap *gmap, unsigned long gaddr,
  * Expected to be called with sg->mm->mmap_lock in read
  */
 static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
-			    pte_t *ptep, int prot, unsigned long bits)
+			    unsigned long vmaddr, pte_t *ptep,
+			    int prot, unsigned long bits)
 {
 	int rc;
 	unsigned long pbits = 0;
@@ -1191,7 +1192,7 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 				ptep = gmap_pte_from_pmd(gmap, pmdp, gaddr,
 							 &ptl_pte);
 				if (ptep)
-					rc = gmap_protect_pte(gmap, gaddr,
+					rc = gmap_protect_pte(gmap, gaddr, vmaddr,
 							      ptep, prot, bits);
 				else
 					rc = -ENOMEM;
@@ -1354,6 +1355,21 @@ static inline void gmap_insert_rmap(struct gmap *sg, unsigned long vmaddr,
 	}
 }
 
+static int gmap_protect_rmap_pte(struct gmap *sg, struct gmap_rmap *rmap,
+				 unsigned long paddr, unsigned long vmaddr,
+				 pte_t *ptep, int prot)
+{
+	int rc = 0;
+
+	spin_lock(&sg->guest_table_lock);
+	rc = gmap_protect_pte(sg->parent, paddr, vmaddr, ptep,
+			      prot, GMAP_NOTIFY_SHADOW);
+	if (!rc)
+		gmap_insert_rmap(sg, vmaddr, rmap);
+	spin_unlock(&sg->guest_table_lock);
+	return rc;
+}
+
 /**
  * gmap_protect_rmap - restrict access rights to memory (RO) and create an rmap
  * @sg: pointer to the shadow guest address space structure
@@ -1370,16 +1386,15 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 	struct gmap *parent;
 	struct gmap_rmap *rmap;
 	unsigned long vmaddr;
-	spinlock_t *ptl;
+	pmd_t *pmdp;
 	pte_t *ptep;
+	spinlock_t *ptl_pmd = NULL, *ptl_pte = NULL;
+	struct page *page = NULL;
 	int rc;
 
 	BUG_ON(!gmap_is_shadow(sg));
 	parent = sg->parent;
 	while (len) {
-		vmaddr = __gmap_translate(parent, paddr);
-		if (IS_ERR_VALUE(vmaddr))
-			return vmaddr;
 		rmap = kzalloc(sizeof(*rmap), GFP_KERNEL_ACCOUNT);
 		if (!rmap)
 			return -ENOMEM;
@@ -1390,26 +1405,64 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 			return rc;
 		}
 		rc = -EAGAIN;
-		ptep = gmap_pte_op_walk(parent, paddr, &ptl);
-		if (ptep) {
-			spin_lock(&sg->guest_table_lock);
-			rc = ptep_force_prot(parent->mm, paddr, ptep, PROT_READ,
-					     PGSTE_VSIE_BIT);
-			if (!rc)
-				gmap_insert_rmap(sg, vmaddr, rmap);
-			spin_unlock(&sg->guest_table_lock);
-			gmap_pte_op_end(ptl);
+		vmaddr = __gmap_translate(parent, paddr);
+		if (IS_ERR_VALUE(vmaddr))
+			return vmaddr;
+		vmaddr |= paddr & ~PMD_MASK;
+		pmdp = gmap_pmd_op_walk(parent, paddr, vmaddr, &ptl_pmd);
+		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
+			if (!pmd_large(*pmdp)) {
+				ptl_pte = NULL;
+				ptep = gmap_pte_from_pmd(parent, pmdp, paddr,
+							 &ptl_pte);
+				if (ptep)
+					rc = gmap_protect_rmap_pte(sg, rmap, paddr,
+								   vmaddr, ptep,
+								   PROT_READ);
+				else
+					rc = -ENOMEM;
+				gmap_pte_op_end(ptl_pte);
+				gmap_pmd_op_end(ptl_pmd);
+				if (!rc) {
+					paddr += PAGE_SIZE;
+					len -= PAGE_SIZE;
+					radix_tree_preload_end();
+					continue;
+				}
+			} else {
+				if (!page) {
+					/* Drop locks for allocation. */
+					gmap_pmd_op_end(ptl_pmd);
+					ptl_pmd = NULL;
+					radix_tree_preload_end();
+					kfree(rmap);
+					page = page_table_alloc_pgste(parent->mm);
+					if (!page)
+						return -ENOMEM;
+					continue;
+				} else {
+					gmap_pmd_split(parent, paddr, vmaddr,
+						       pmdp, page);
+					gmap_pmd_op_end(ptl_pmd);
+					radix_tree_preload_end();
+					kfree(rmap);
+					page = NULL;
+					continue;
+				}
+
+			}
+		}
+		if (page) {
+			page_table_free_pgste(page);
+			page = NULL;
 		}
 		radix_tree_preload_end();
-		if (rc) {
-			kfree(rmap);
+		kfree(rmap);
+		if (rc == -EAGAIN) {
 			rc = gmap_fixup(parent, paddr, vmaddr, PROT_READ);
 			if (rc)
 				return rc;
-			continue;
 		}
-		paddr += PAGE_SIZE;
-		len -= PAGE_SIZE;
 	}
 	return 0;
 }
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 10/14] s390/mm: Add simple ptep shadow function
  2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
                   ` (8 preceding siblings ...)
  2021-01-13  9:41 ` [PATCH 09/14] s390/mm: Make gmap_protect_rmap " Janosch Frank
@ 2021-01-13  9:41 ` Janosch Frank
  2021-01-13  9:41 ` [PATCH 11/14] s390/mm: Add gmap shadowing for large pmds Janosch Frank
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:41 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

Let's factor out setting the shadow pte, so we can reuse that function
for later huge to 4k shadows where we don't have a spte or spgste.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 arch/s390/include/asm/pgtable.h |  1 +
 arch/s390/mm/pgtable.c          | 24 ++++++++++++++++--------
 2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 6d6ad508f9c7..d2c005f35c9c 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1145,6 +1145,7 @@ void ptep_zap_unused(struct mm_struct *mm, unsigned long addr,
 void ptep_zap_key(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 int ptep_shadow_pte(struct mm_struct *mm, unsigned long saddr,
 		    pte_t *sptep, pte_t *tptep, pte_t pte);
+void ptep_shadow_set(pte_t spte, pte_t *tptep, pte_t pte);
 void ptep_unshadow_pte(struct mm_struct *mm, unsigned long saddr, pte_t *ptep);
 
 unsigned long ptep_get_and_clear_notification_bits(pte_t *ptep);
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 16896f936d32..6066c7ef027a 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -645,11 +645,24 @@ int ptep_force_prot(struct mm_struct *mm, unsigned long addr,
 	return 0;
 }
 
+void ptep_shadow_set(pte_t spte, pte_t *tptep, pte_t pte)
+{
+	pte_t tpte;
+	pgste_t tpgste;
+
+	tpgste = pgste_get_lock(tptep);
+	pte_val(tpte) = (pte_val(spte) & PAGE_MASK) |
+		(pte_val(pte) & _PAGE_PROTECT);
+	/* don't touch the storage key - it belongs to parent pgste */
+	tpgste = pgste_set_pte(tptep, tpgste, tpte);
+	pgste_set_unlock(tptep, tpgste);
+}
+
 int ptep_shadow_pte(struct mm_struct *mm, unsigned long saddr,
 		    pte_t *sptep, pte_t *tptep, pte_t pte)
 {
-	pgste_t spgste, tpgste;
-	pte_t spte, tpte;
+	pgste_t spgste;
+	pte_t spte;
 	int rc = -EAGAIN;
 
 	if (!(pte_val(*tptep) & _PAGE_INVALID))
@@ -660,12 +673,7 @@ int ptep_shadow_pte(struct mm_struct *mm, unsigned long saddr,
 	    !((pte_val(spte) & _PAGE_PROTECT) &&
 	      !(pte_val(pte) & _PAGE_PROTECT))) {
 		pgste_val(spgste) |= PGSTE_VSIE_BIT;
-		tpgste = pgste_get_lock(tptep);
-		pte_val(tpte) = (pte_val(spte) & PAGE_MASK) |
-				(pte_val(pte) & _PAGE_PROTECT);
-		/* don't touch the storage key - it belongs to parent pgste */
-		tpgste = pgste_set_pte(tptep, tpgste, tpte);
-		pgste_set_unlock(tptep, tpgste);
+		ptep_shadow_set(spte, tptep, pte);
 		rc = 1;
 	}
 	pgste_set_unlock(sptep, spgste);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 11/14] s390/mm: Add gmap shadowing for large pmds
  2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
                   ` (9 preceding siblings ...)
  2021-01-13  9:41 ` [PATCH 10/14] s390/mm: Add simple ptep shadow function Janosch Frank
@ 2021-01-13  9:41 ` Janosch Frank
  2021-01-13  9:41 ` [PATCH 12/14] s390/mm: Add gmap lock classes Janosch Frank
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:41 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

Up to now we could only shadow large pmds when the parent's mapping
was done with normal sized pmds. This is done by introducing fake page
tables and effectively running the level 3 guest with a standard
memory backing instead of the large one.

With this patch we add shadowing when the host is large page
backed. This allows us to run normal and large backed VMs inside a
large backed host.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 arch/s390/include/asm/gmap.h |   9 +-
 arch/s390/kvm/gaccess.c      |  52 +++++-
 arch/s390/mm/gmap.c          | 328 ++++++++++++++++++++++++++++-------
 3 files changed, 317 insertions(+), 72 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index a5711c189018..4133d09597a5 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -19,11 +19,12 @@
 /* Status bits only for huge segment entries */
 #define _SEGMENT_ENTRY_GMAP_IN		0x8000	/* invalidation notify bit */
 #define _SEGMENT_ENTRY_GMAP_UC		0x4000	/* dirty (migration) */
+#define _SEGMENT_ENTRY_GMAP_VSIE	0x2000	/* vsie bit */
 /* Status bits in the gmap segment entry. */
 #define _SEGMENT_ENTRY_GMAP_SPLIT	0x0001  /* split huge pmd */
 
 #define GMAP_SEGMENT_STATUS_BITS (_SEGMENT_ENTRY_GMAP_UC | _SEGMENT_ENTRY_GMAP_SPLIT)
-#define GMAP_SEGMENT_NOTIFY_BITS _SEGMENT_ENTRY_GMAP_IN
+#define GMAP_SEGMENT_NOTIFY_BITS (_SEGMENT_ENTRY_GMAP_IN | _SEGMENT_ENTRY_GMAP_VSIE)
 
 /**
  * struct gmap_struct - guest address space
@@ -152,9 +153,11 @@ int gmap_shadow_sgt(struct gmap *sg, unsigned long saddr, unsigned long sgt,
 		    int fake);
 int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
 		    int fake);
-int gmap_shadow_pgt_lookup(struct gmap *sg, unsigned long saddr,
-			   unsigned long *pgt, int *dat_protection, int *fake);
+int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
+			   unsigned long *pgt, int *dat_protection,
+			   int *fake, int *lvl);
 int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte);
+int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd);
 
 void gmap_register_pte_notifier(struct gmap_notifier *);
 void gmap_unregister_pte_notifier(struct gmap_notifier *);
diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c
index 6d6b57059493..d5f6b5c2c8de 100644
--- a/arch/s390/kvm/gaccess.c
+++ b/arch/s390/kvm/gaccess.c
@@ -981,7 +981,7 @@ int kvm_s390_check_low_addr_prot_real(struct kvm_vcpu *vcpu, unsigned long gra)
  */
 static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 				  unsigned long *pgt, int *dat_protection,
-				  int *fake)
+				  int *fake, int *lvl)
 {
 	struct gmap *parent;
 	union asce asce;
@@ -1133,14 +1133,25 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 		if (ste.cs && asce.p)
 			return PGM_TRANSLATION_SPEC;
 		*dat_protection |= ste.fc0.p;
+
+		/* Guest is huge page mapped */
 		if (ste.fc && sg->edat_level >= 1) {
-			*fake = 1;
-			ptr = ste.fc1.sfaa * _SEGMENT_SIZE;
-			ste.val = ptr;
-			goto shadow_pgt;
+			/* 4k to 1m, we absolutely need fake shadow tables. */
+			if (!parent->mm->context.allow_gmap_hpage_1m) {
+				*fake = 1;
+				ptr = ste.fc1.sfaa * _SEGMENT_SIZE;
+				ste.val = ptr;
+				goto shadow_pgt;
+			} else {
+				*lvl = 1;
+				*pgt = ptr;
+				return 0;
+
+			}
 		}
 		ptr = ste.fc0.pto * (PAGE_SIZE / 2);
 shadow_pgt:
+		*lvl = 0;
 		ste.fc0.p |= *dat_protection;
 		rc = gmap_shadow_pgt(sg, saddr, ste.val, *fake);
 		if (rc)
@@ -1169,8 +1180,9 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 {
 	union vaddress vaddr;
 	union page_table_entry pte;
+	union segment_table_entry ste;
 	unsigned long pgt;
-	int dat_protection, fake;
+	int dat_protection, fake, lvl = 0;
 	int rc;
 
 	mmap_read_lock(sg->mm);
@@ -1181,12 +1193,35 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 	 */
 	ipte_lock(vcpu);
 
-	rc = gmap_shadow_pgt_lookup(sg, saddr, &pgt, &dat_protection, &fake);
+	rc = gmap_shadow_sgt_lookup(sg, saddr, &pgt, &dat_protection, &fake, &lvl);
 	if (rc)
 		rc = kvm_s390_shadow_tables(sg, saddr, &pgt, &dat_protection,
-					    &fake);
+					    &fake, &lvl);
 
 	vaddr.addr = saddr;
+
+	/* Shadow stopped at segment level, we map pmd to pmd */
+	if (!rc && lvl) {
+		rc = gmap_read_table(sg->parent, pgt + vaddr.sx * 8, &ste.val);
+		if (!rc && ste.i)
+			rc = PGM_PAGE_TRANSLATION;
+		ste.fc1.p |= dat_protection;
+		if (!rc)
+			rc = gmap_shadow_segment(sg, saddr, __pmd(ste.val));
+		if (rc == -EISDIR) {
+			/* Hit a split pmd, we need to setup a fake page table */
+			fake = 1;
+			pgt = ste.fc1.sfaa * _SEGMENT_SIZE;
+			ste.val = pgt;
+			rc = gmap_shadow_pgt(sg, saddr, ste.val, fake);
+			if (rc)
+				goto out;
+		} else {
+			/* We're done */
+			goto out;
+		}
+	}
+
 	if (fake) {
 		pte.val = pgt + vaddr.px * PAGE_SIZE;
 		goto shadow_page;
@@ -1201,6 +1236,7 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 	pte.p |= dat_protection;
 	if (!rc)
 		rc = gmap_shadow_page(sg, saddr, __pte(pte.val));
+out:
 	ipte_unlock(vcpu);
 	mmap_read_unlock(sg->mm);
 	return rc;
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index f20aa49c2791..50dd95946d32 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -883,28 +883,6 @@ static inline unsigned long *gmap_table_walk(struct gmap *gmap,
 	return table;
 }
 
-/**
- * gmap_pte_op_walk - walk the gmap page table, get the page table lock
- *		      and return the pte pointer
- * @gmap: pointer to guest mapping meta data structure
- * @gaddr: virtual address in the guest address space
- * @ptl: pointer to the spinlock pointer
- *
- * Returns a pointer to the locked pte for a guest address, or NULL
- */
-static pte_t *gmap_pte_op_walk(struct gmap *gmap, unsigned long gaddr,
-			       spinlock_t **ptl)
-{
-	unsigned long *table;
-
-	BUG_ON(gmap_is_shadow(gmap));
-	/* Walk the gmap page table, lock and get pte pointer */
-	table = gmap_table_walk(gmap, gaddr, 1); /* get segment pointer */
-	if (!table || *table & _SEGMENT_ENTRY_INVALID)
-		return NULL;
-	return pte_alloc_map_lock(gmap->mm, (pmd_t *) table, gaddr, ptl);
-}
-
 /**
  * gmap_fixup - force memory in and connect the gmap table entry
  * @gmap: pointer to guest mapping meta data structure
@@ -1468,6 +1446,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 }
 
 #define _SHADOW_RMAP_MASK	0x7
+#define _SHADOW_RMAP_SEGMENT_LP	0x6
 #define _SHADOW_RMAP_REGION1	0x5
 #define _SHADOW_RMAP_REGION2	0x4
 #define _SHADOW_RMAP_REGION3	0x3
@@ -1573,15 +1552,18 @@ static void __gmap_unshadow_sgt(struct gmap *sg, unsigned long raddr,
 
 	BUG_ON(!gmap_is_shadow(sg));
 	for (i = 0; i < _CRST_ENTRIES; i++, raddr += _SEGMENT_SIZE) {
-		if (!(sgt[i] & _SEGMENT_ENTRY_ORIGIN))
+		if (sgt[i] ==  _SEGMENT_ENTRY_EMPTY)
 			continue;
-		pgt = (unsigned long *)(sgt[i] & _REGION_ENTRY_ORIGIN);
+
+		if (!(sgt[i] & _SEGMENT_ENTRY_LARGE)) {
+			pgt = (unsigned long *)(sgt[i] & _SEGMENT_ENTRY_ORIGIN);
+			__gmap_unshadow_pgt(sg, raddr, pgt);
+			/* Free page table */
+			page = pfn_to_page(__pa(pgt) >> PAGE_SHIFT);
+			list_del(&page->lru);
+			page_table_free_pgste(page);
+		}
 		sgt[i] = _SEGMENT_ENTRY_EMPTY;
-		__gmap_unshadow_pgt(sg, raddr, pgt);
-		/* Free page table */
-		page = pfn_to_page(__pa(pgt) >> PAGE_SHIFT);
-		list_del(&page->lru);
-		page_table_free_pgste(page);
 	}
 }
 
@@ -2188,7 +2170,7 @@ EXPORT_SYMBOL_GPL(gmap_shadow_sgt);
 /**
  * gmap_shadow_lookup_pgtable - find a shadow page table
  * @sg: pointer to the shadow guest address space structure
- * @saddr: the address in the shadow aguest address space
+ * @saddr: the address in the shadow guest address space
  * @pgt: parent gmap address of the page table to get shadowed
  * @dat_protection: if the pgtable is marked as protected by dat
  * @fake: pgt references contiguous guest memory block, not a pgtable
@@ -2198,32 +2180,64 @@ EXPORT_SYMBOL_GPL(gmap_shadow_sgt);
  *
  * Called with sg->mm->mmap_lock in read.
  */
-int gmap_shadow_pgt_lookup(struct gmap *sg, unsigned long saddr,
-			   unsigned long *pgt, int *dat_protection,
-			   int *fake)
+void gmap_shadow_pgt_lookup(struct gmap *sg, unsigned long *sge,
+			    unsigned long saddr, unsigned long *pgt,
+			    int *dat_protection, int *fake)
 {
-	unsigned long *table;
 	struct page *page;
-	int rc;
+
+	/* Shadow page tables are full pages (pte+pgste) */
+	page = pfn_to_page(*sge >> PAGE_SHIFT);
+	*pgt = page->index & ~GMAP_SHADOW_FAKE_TABLE;
+	*dat_protection = !!(*sge & _SEGMENT_ENTRY_PROTECT);
+	*fake = !!(page->index & GMAP_SHADOW_FAKE_TABLE);
+}
+EXPORT_SYMBOL_GPL(gmap_shadow_pgt_lookup);
+
+int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
+			   unsigned long *pgt, int *dat_protection,
+			   int *fake, int *lvl)
+{
+	unsigned long *sge, *r3e = NULL;
+	struct page *page;
+	int rc = -EAGAIN;
 
 	BUG_ON(!gmap_is_shadow(sg));
 	spin_lock(&sg->guest_table_lock);
-	table = gmap_table_walk(sg, saddr, 1); /* get segment pointer */
-	if (table && !(*table & _SEGMENT_ENTRY_INVALID)) {
-		/* Shadow page tables are full pages (pte+pgste) */
-		page = pfn_to_page(*table >> PAGE_SHIFT);
-		*pgt = page->index & ~GMAP_SHADOW_FAKE_TABLE;
-		*dat_protection = !!(*table & _SEGMENT_ENTRY_PROTECT);
-		*fake = !!(page->index & GMAP_SHADOW_FAKE_TABLE);
-		rc = 0;
-	} else  {
-		rc = -EAGAIN;
+	if (sg->asce & _ASCE_TYPE_MASK) {
+		/* >2 GB guest */
+		r3e = (unsigned long *) gmap_table_walk(sg, saddr, 2);
+		if (!r3e || (*r3e & _REGION_ENTRY_INVALID))
+			goto out;
+		sge = (unsigned long *)(*r3e & _REGION_ENTRY_ORIGIN) + ((saddr & _SEGMENT_INDEX) >> _SEGMENT_SHIFT);
+	} else {
+		sge = (unsigned long *)(sg->asce & PAGE_MASK) + ((saddr & _SEGMENT_INDEX) >> _SEGMENT_SHIFT);
 	}
+	if (*sge & _SEGMENT_ENTRY_INVALID)
+		goto out;
+	rc = 0;
+	if (*sge & _SEGMENT_ENTRY_LARGE) {
+		if (r3e) {
+			page = pfn_to_page(*r3e >> PAGE_SHIFT);
+			*pgt = page->index & ~GMAP_SHADOW_FAKE_TABLE;
+			*dat_protection = !!(*r3e & _SEGMENT_ENTRY_PROTECT);
+			*fake = !!(page->index & GMAP_SHADOW_FAKE_TABLE);
+		} else {
+			*pgt = sg->orig_asce & PAGE_MASK;
+			*dat_protection = 0;
+			*fake = 0;
+		}
+		*lvl = 1;
+	} else {
+		gmap_shadow_pgt_lookup(sg, sge, saddr, pgt,
+				       dat_protection, fake);
+		*lvl = 0;
+	}
+out:
 	spin_unlock(&sg->guest_table_lock);
 	return rc;
-
 }
-EXPORT_SYMBOL_GPL(gmap_shadow_pgt_lookup);
+EXPORT_SYMBOL_GPL(gmap_shadow_sgt_lookup);
 
 /**
  * gmap_shadow_pgt - instantiate a shadow page table
@@ -2305,6 +2319,95 @@ int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
 }
 EXPORT_SYMBOL_GPL(gmap_shadow_pgt);
 
+int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd)
+{
+	struct gmap *parent;
+	struct gmap_rmap *rmap;
+	unsigned long vmaddr, paddr;
+	spinlock_t *ptl = NULL;
+	pmd_t spmd, tpmd, *spmdp = NULL, *tpmdp;
+	int prot;
+	int rc;
+
+	BUG_ON(!gmap_is_shadow(sg));
+	parent = sg->parent;
+
+	prot = (pmd_val(pmd) & _SEGMENT_ENTRY_PROTECT) ? PROT_READ : PROT_WRITE;
+	rmap = kzalloc(sizeof(*rmap), GFP_KERNEL);
+	if (!rmap)
+		return -ENOMEM;
+	rmap->raddr = (saddr & HPAGE_MASK) | _SHADOW_RMAP_SEGMENT_LP;
+
+	while (1) {
+		paddr = pmd_val(pmd) & HPAGE_MASK;
+		vmaddr = __gmap_translate(parent, paddr);
+		if (IS_ERR_VALUE(vmaddr)) {
+			rc = vmaddr;
+			break;
+		}
+		rc = radix_tree_preload(GFP_KERNEL);
+		if (rc)
+			break;
+		rc = -EAGAIN;
+
+		/* Let's look up the parent's mapping */
+		spmdp = gmap_pmd_op_walk(parent, paddr, vmaddr, &ptl);
+		if (spmdp) {
+			if (gmap_pmd_is_split(spmdp)) {
+				gmap_pmd_op_end(ptl);
+				radix_tree_preload_end();
+				rc = -EISDIR;
+				break;
+			}
+			spin_lock(&sg->guest_table_lock);
+			/* Get shadow segment table pointer */
+			tpmdp = (pmd_t *) gmap_table_walk(sg, saddr, 1);
+			if (!tpmdp) {
+				spin_unlock(&sg->guest_table_lock);
+				gmap_pmd_op_end(ptl);
+				radix_tree_preload_end();
+				break;
+			}
+			/* Shadowing magic happens here. */
+			if (!(pmd_val(*tpmdp) & _SEGMENT_ENTRY_INVALID)) {
+				rc = 0;	/* already shadowed */
+				spin_unlock(&sg->guest_table_lock);
+				gmap_pmd_op_end(ptl);
+				radix_tree_preload_end();
+				kfree(rmap);
+				break;
+			}
+			spmd = *spmdp;
+			if (!(pmd_val(spmd) & _SEGMENT_ENTRY_INVALID) &&
+			    !((pmd_val(spmd) & _SEGMENT_ENTRY_PROTECT) &&
+			      !(pmd_val(pmd) & _SEGMENT_ENTRY_PROTECT))) {
+
+				pmd_val(*spmdp) |= _SEGMENT_ENTRY_GMAP_VSIE;
+
+				/* Insert shadow ste */
+				pmd_val(tpmd) = ((pmd_val(spmd) &
+						  _SEGMENT_ENTRY_HARDWARE_BITS_LARGE) |
+						 (pmd_val(pmd) & _SEGMENT_ENTRY_PROTECT));
+				*tpmdp = tpmd;
+				gmap_insert_rmap(sg, vmaddr, rmap);
+				rc = 0;
+			}
+			spin_unlock(&sg->guest_table_lock);
+			gmap_pmd_op_end(ptl);
+		}
+		radix_tree_preload_end();
+		if (!rc)
+			break;
+		rc = gmap_fixup(parent, paddr, vmaddr, prot);
+		if (rc)
+			break;
+	}
+	if (rc)
+		kfree(rmap);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(gmap_shadow_segment);
+
 /**
  * gmap_shadow_page - create a shadow page mapping
  * @sg: pointer to the shadow guest address space structure
@@ -2322,7 +2425,8 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 	struct gmap *parent;
 	struct gmap_rmap *rmap;
 	unsigned long vmaddr, paddr;
-	spinlock_t *ptl;
+	spinlock_t *ptl_pmd = NULL, *ptl_pte = NULL;
+	pmd_t *spmdp;
 	pte_t *sptep, *tptep;
 	int prot;
 	int rc;
@@ -2347,26 +2451,46 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 		if (rc)
 			break;
 		rc = -EAGAIN;
-		sptep = gmap_pte_op_walk(parent, paddr, &ptl);
-		if (sptep) {
-			spin_lock(&sg->guest_table_lock);
+		spmdp = gmap_pmd_op_walk(parent, paddr, vmaddr, &ptl_pmd);
+		if (spmdp && !(pmd_val(*spmdp) & _SEGMENT_ENTRY_INVALID)) {
 			/* Get page table pointer */
 			tptep = (pte_t *) gmap_table_walk(sg, saddr, 0);
 			if (!tptep) {
-				spin_unlock(&sg->guest_table_lock);
-				gmap_pte_op_end(ptl);
 				radix_tree_preload_end();
+				gmap_pmd_op_end(ptl_pmd);
 				break;
 			}
-			rc = ptep_shadow_pte(sg->mm, saddr, sptep, tptep, pte);
-			if (rc > 0) {
-				/* Success and a new mapping */
-				gmap_insert_rmap(sg, vmaddr, rmap);
-				rmap = NULL;
-				rc = 0;
+
+			if (pmd_large(*spmdp)) {
+				pte_t spte;
+				if (!(pmd_val(*spmdp) & _SEGMENT_ENTRY_PROTECT)) {
+					spin_lock(&sg->guest_table_lock);
+					spte = __pte((pmd_val(*spmdp) &
+						      _SEGMENT_ENTRY_ORIGIN_LARGE)
+						     + (pte_index(paddr) << 12));
+					ptep_shadow_set(spte, tptep, pte);
+					pmd_val(*spmdp) |= _SEGMENT_ENTRY_GMAP_VSIE;
+					gmap_insert_rmap(sg, vmaddr, rmap);
+					rmap = NULL;
+					rc = 0;
+					spin_unlock(&sg->guest_table_lock);
+				}
+			} else {
+				sptep = gmap_pte_from_pmd(parent, spmdp, paddr, &ptl_pte);
+				spin_lock(&sg->guest_table_lock);
+				if (sptep) {
+					rc = ptep_shadow_pte(sg->mm, saddr, sptep, tptep, pte);
+					if (rc > 0) {
+						/* Success and a new mapping */
+						gmap_insert_rmap(sg, vmaddr, rmap);
+						rmap = NULL;
+						rc = 0;
+					}
+					spin_unlock(&sg->guest_table_lock);
+					gmap_pte_op_end(ptl_pte);
+				}
 			}
-			gmap_pte_op_end(ptl);
-			spin_unlock(&sg->guest_table_lock);
+			gmap_pmd_op_end(ptl_pmd);
 		}
 		radix_tree_preload_end();
 		if (!rc)
@@ -2380,6 +2504,75 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 }
 EXPORT_SYMBOL_GPL(gmap_shadow_page);
 
+/**
+ * gmap_unshadow_segment - remove a huge segment from a shadow segment table
+ * @sg: pointer to the shadow guest address space structure
+ * @raddr: rmap address in the shadow guest address space
+ *
+ * Called with the sg->guest_table_lock
+ */
+static void gmap_unshadow_segment(struct gmap *sg, unsigned long raddr)
+{
+	unsigned long *table;
+
+	BUG_ON(!gmap_is_shadow(sg));
+	/* We already have the lock */
+	table = gmap_table_walk(sg, raddr, 1); /* get segment table pointer */
+	if (!table || *table & _SEGMENT_ENTRY_INVALID ||
+	    !(*table & _SEGMENT_ENTRY_LARGE))
+		return;
+	gmap_call_notifier(sg, raddr, raddr + HPAGE_SIZE - 1);
+	gmap_idte_global(sg->asce, (pmd_t *)table, raddr);
+	*table = _SEGMENT_ENTRY_EMPTY;
+}
+
+static void gmap_shadow_notify_pmd(struct gmap *sg, unsigned long vmaddr,
+				   unsigned long gaddr)
+{
+	struct gmap_rmap *rmap, *rnext, *head;
+	unsigned long start, end, bits, raddr;
+
+
+	BUG_ON(!gmap_is_shadow(sg));
+
+	spin_lock(&sg->guest_table_lock);
+	if (sg->removed) {
+		spin_unlock(&sg->guest_table_lock);
+		return;
+	}
+	/* Check for top level table */
+	start = sg->orig_asce & _ASCE_ORIGIN;
+	end = start + ((sg->orig_asce & _ASCE_TABLE_LENGTH) + 1) * PAGE_SIZE;
+	if (!(sg->orig_asce & _ASCE_REAL_SPACE) && gaddr >= start &&
+	    gaddr < ((end & HPAGE_MASK) + HPAGE_SIZE - 1)) {
+		/* The complete shadow table has to go */
+		gmap_unshadow(sg);
+		spin_unlock(&sg->guest_table_lock);
+		list_del(&sg->list);
+		gmap_put(sg);
+		return;
+	}
+	/* Remove the page table tree from on specific entry */
+	head = radix_tree_delete(&sg->host_to_rmap, (vmaddr & HPAGE_MASK) >> PAGE_SHIFT);
+	gmap_for_each_rmap_safe(rmap, rnext, head) {
+		bits = rmap->raddr & _SHADOW_RMAP_MASK;
+		raddr = rmap->raddr ^ bits;
+		switch (bits) {
+		case _SHADOW_RMAP_SEGMENT_LP:
+			gmap_unshadow_segment(sg, raddr);
+			break;
+		case _SHADOW_RMAP_PGTABLE:
+			gmap_unshadow_page(sg, raddr);
+			break;
+		default:
+			BUG();
+		}
+		kfree(rmap);
+	}
+	spin_unlock(&sg->guest_table_lock);
+}
+
+
 /**
  * gmap_shadow_notify - handle notifications for shadow gmap
  *
@@ -2431,6 +2624,8 @@ static void gmap_shadow_notify(struct gmap *sg, unsigned long vmaddr,
 		case _SHADOW_RMAP_PGTABLE:
 			gmap_unshadow_page(sg, raddr);
 			break;
+		default:
+			BUG();
 		}
 		kfree(rmap);
 	}
@@ -2514,10 +2709,21 @@ static inline void pmdp_notify_split(struct gmap *gmap, pmd_t *pmdp,
 static void pmdp_notify_gmap(struct gmap *gmap, pmd_t *pmdp,
 			     unsigned long gaddr, unsigned long vmaddr)
 {
+	struct gmap *sg, *next;
+
 	BUG_ON((gaddr & ~HPAGE_MASK) || (vmaddr & ~HPAGE_MASK));
 	if (gmap_pmd_is_split(pmdp))
 		return pmdp_notify_split(gmap, pmdp, gaddr, vmaddr);
 
+	if (!list_empty(&gmap->children) &&
+	    (pmd_val(*pmdp) & _SEGMENT_ENTRY_GMAP_VSIE)) {
+		spin_lock(&gmap->shadow_lock);
+		list_for_each_entry_safe(sg, next, &gmap->children, list)
+			gmap_shadow_notify_pmd(sg, vmaddr, gaddr);
+		spin_unlock(&gmap->shadow_lock);
+	}
+	pmd_val(*pmdp) &= ~_SEGMENT_ENTRY_GMAP_VSIE;
+
 	if (!(pmd_val(*pmdp) & _SEGMENT_ENTRY_GMAP_IN))
 		return;
 	pmd_val(*pmdp) &= ~_SEGMENT_ENTRY_GMAP_IN;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 12/14] s390/mm: Add gmap lock classes
  2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
                   ` (10 preceding siblings ...)
  2021-01-13  9:41 ` [PATCH 11/14] s390/mm: Add gmap shadowing for large pmds Janosch Frank
@ 2021-01-13  9:41 ` Janosch Frank
  2021-01-13  9:41 ` [PATCH 13/14] s390/mm: Pull pmd invalid check in gmap_pmd_op_walk Janosch Frank
  2021-01-13  9:41 ` [PATCH 14/14] KVM: s390: Allow the VSIE to be used with huge pages Janosch Frank
  13 siblings, 0 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:41 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

A shadow gmap and its parent are locked right after each other when
doing VSIE management. Lockdep can't differentiate between the two
classes without some help.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 arch/s390/include/asm/gmap.h |  6 ++++++
 arch/s390/mm/gmap.c          | 40 +++++++++++++++++++-----------------
 2 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 4133d09597a5..4edbeb012e2a 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -26,6 +26,12 @@
 #define GMAP_SEGMENT_STATUS_BITS (_SEGMENT_ENTRY_GMAP_UC | _SEGMENT_ENTRY_GMAP_SPLIT)
 #define GMAP_SEGMENT_NOTIFY_BITS (_SEGMENT_ENTRY_GMAP_IN | _SEGMENT_ENTRY_GMAP_VSIE)
 
+
+enum gmap_lock_class {
+	GMAP_LOCK_PARENT,
+	GMAP_LOCK_SHADOW
+};
+
 /**
  * struct gmap_struct - guest address space
  * @list: list head for the mm->context gmap list
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 50dd95946d32..bc89fb974367 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1339,7 +1339,7 @@ static int gmap_protect_rmap_pte(struct gmap *sg, struct gmap_rmap *rmap,
 {
 	int rc = 0;
 
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	rc = gmap_protect_pte(sg->parent, paddr, vmaddr, ptep,
 			      prot, GMAP_NOTIFY_SHADOW);
 	if (!rc)
@@ -1874,7 +1874,7 @@ struct gmap *gmap_shadow(struct gmap *parent, unsigned long asce,
 		/* only allow one real-space gmap shadow */
 		list_for_each_entry(sg, &parent->children, list) {
 			if (sg->orig_asce & _ASCE_REAL_SPACE) {
-				spin_lock(&sg->guest_table_lock);
+				spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 				gmap_unshadow(sg);
 				spin_unlock(&sg->guest_table_lock);
 				list_del(&sg->list);
@@ -1946,7 +1946,7 @@ int gmap_shadow_r2t(struct gmap *sg, unsigned long saddr, unsigned long r2t,
 		page->index |= GMAP_SHADOW_FAKE_TABLE;
 	s_r2t = (unsigned long *) page_to_phys(page);
 	/* Install shadow region second table */
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	table = gmap_table_walk(sg, saddr, 4); /* get region-1 pointer */
 	if (!table) {
 		rc = -EAGAIN;		/* Race with unshadow */
@@ -1979,7 +1979,7 @@ int gmap_shadow_r2t(struct gmap *sg, unsigned long saddr, unsigned long r2t,
 	offset = ((r2t & _REGION_ENTRY_OFFSET) >> 6) * PAGE_SIZE;
 	len = ((r2t & _REGION_ENTRY_LENGTH) + 1) * PAGE_SIZE - offset;
 	rc = gmap_protect_rmap(sg, raddr, origin + offset, len);
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (!rc) {
 		table = gmap_table_walk(sg, saddr, 4);
 		if (!table || (*table & _REGION_ENTRY_ORIGIN) !=
@@ -2030,7 +2030,7 @@ int gmap_shadow_r3t(struct gmap *sg, unsigned long saddr, unsigned long r3t,
 		page->index |= GMAP_SHADOW_FAKE_TABLE;
 	s_r3t = (unsigned long *) page_to_phys(page);
 	/* Install shadow region second table */
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	table = gmap_table_walk(sg, saddr, 3); /* get region-2 pointer */
 	if (!table) {
 		rc = -EAGAIN;		/* Race with unshadow */
@@ -2063,7 +2063,7 @@ int gmap_shadow_r3t(struct gmap *sg, unsigned long saddr, unsigned long r3t,
 	offset = ((r3t & _REGION_ENTRY_OFFSET) >> 6) * PAGE_SIZE;
 	len = ((r3t & _REGION_ENTRY_LENGTH) + 1) * PAGE_SIZE - offset;
 	rc = gmap_protect_rmap(sg, raddr, origin + offset, len);
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (!rc) {
 		table = gmap_table_walk(sg, saddr, 3);
 		if (!table || (*table & _REGION_ENTRY_ORIGIN) !=
@@ -2114,7 +2114,7 @@ int gmap_shadow_sgt(struct gmap *sg, unsigned long saddr, unsigned long sgt,
 		page->index |= GMAP_SHADOW_FAKE_TABLE;
 	s_sgt = (unsigned long *) page_to_phys(page);
 	/* Install shadow region second table */
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	table = gmap_table_walk(sg, saddr, 2); /* get region-3 pointer */
 	if (!table) {
 		rc = -EAGAIN;		/* Race with unshadow */
@@ -2147,7 +2147,7 @@ int gmap_shadow_sgt(struct gmap *sg, unsigned long saddr, unsigned long sgt,
 	offset = ((sgt & _REGION_ENTRY_OFFSET) >> 6) * PAGE_SIZE;
 	len = ((sgt & _REGION_ENTRY_LENGTH) + 1) * PAGE_SIZE - offset;
 	rc = gmap_protect_rmap(sg, raddr, origin + offset, len);
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (!rc) {
 		table = gmap_table_walk(sg, saddr, 2);
 		if (!table || (*table & _REGION_ENTRY_ORIGIN) !=
@@ -2203,7 +2203,7 @@ int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
 	int rc = -EAGAIN;
 
 	BUG_ON(!gmap_is_shadow(sg));
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (sg->asce & _ASCE_TYPE_MASK) {
 		/* >2 GB guest */
 		r3e = (unsigned long *) gmap_table_walk(sg, saddr, 2);
@@ -2270,7 +2270,7 @@ int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
 		page->index |= GMAP_SHADOW_FAKE_TABLE;
 	s_pgt = (unsigned long *) page_to_phys(page);
 	/* Install shadow page table */
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	table = gmap_table_walk(sg, saddr, 1); /* get segment pointer */
 	if (!table) {
 		rc = -EAGAIN;		/* Race with unshadow */
@@ -2298,7 +2298,7 @@ int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
 	raddr = (saddr & _SEGMENT_MASK) | _SHADOW_RMAP_SEGMENT;
 	origin = pgt & _SEGMENT_ENTRY_ORIGIN & PAGE_MASK;
 	rc = gmap_protect_rmap(sg, raddr, origin, PAGE_SIZE);
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (!rc) {
 		table = gmap_table_walk(sg, saddr, 1);
 		if (!table || (*table & _SEGMENT_ENTRY_ORIGIN) !=
@@ -2359,7 +2359,7 @@ int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd)
 				rc = -EISDIR;
 				break;
 			}
-			spin_lock(&sg->guest_table_lock);
+			spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 			/* Get shadow segment table pointer */
 			tpmdp = (pmd_t *) gmap_table_walk(sg, saddr, 1);
 			if (!tpmdp) {
@@ -2464,7 +2464,8 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 			if (pmd_large(*spmdp)) {
 				pte_t spte;
 				if (!(pmd_val(*spmdp) & _SEGMENT_ENTRY_PROTECT)) {
-					spin_lock(&sg->guest_table_lock);
+					spin_lock_nested(&sg->guest_table_lock,
+							 GMAP_LOCK_SHADOW);
 					spte = __pte((pmd_val(*spmdp) &
 						      _SEGMENT_ENTRY_ORIGIN_LARGE)
 						     + (pte_index(paddr) << 12));
@@ -2477,7 +2478,8 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 				}
 			} else {
 				sptep = gmap_pte_from_pmd(parent, spmdp, paddr, &ptl_pte);
-				spin_lock(&sg->guest_table_lock);
+				spin_lock_nested(&sg->guest_table_lock,
+						 GMAP_LOCK_SHADOW);
 				if (sptep) {
 					rc = ptep_shadow_pte(sg->mm, saddr, sptep, tptep, pte);
 					if (rc > 0) {
@@ -2535,7 +2537,7 @@ static void gmap_shadow_notify_pmd(struct gmap *sg, unsigned long vmaddr,
 
 	BUG_ON(!gmap_is_shadow(sg));
 
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (sg->removed) {
 		spin_unlock(&sg->guest_table_lock);
 		return;
@@ -2586,7 +2588,7 @@ static void gmap_shadow_notify(struct gmap *sg, unsigned long vmaddr,
 
 	BUG_ON(!gmap_is_shadow(sg));
 
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (sg->removed) {
 		spin_unlock(&sg->guest_table_lock);
 		return;
@@ -2761,7 +2763,7 @@ static void gmap_pmdp_clear(struct mm_struct *mm, unsigned long vmaddr,
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
-		spin_lock(&gmap->guest_table_lock);
+		spin_lock_nested(&gmap->guest_table_lock, GMAP_LOCK_PARENT);
 		pmdp = (pmd_t *)radix_tree_delete(&gmap->host_to_guest,
 						  vmaddr >> PMD_SHIFT);
 		if (pmdp) {
@@ -2816,7 +2818,7 @@ void gmap_pmdp_idte_local(struct mm_struct *mm, unsigned long vmaddr)
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
-		spin_lock(&gmap->guest_table_lock);
+		spin_lock_nested(&gmap->guest_table_lock, GMAP_LOCK_PARENT);
 		entry = radix_tree_delete(&gmap->host_to_guest,
 					  vmaddr >> PMD_SHIFT);
 		if (entry) {
@@ -2852,7 +2854,7 @@ void gmap_pmdp_idte_global(struct mm_struct *mm, unsigned long vmaddr)
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
-		spin_lock(&gmap->guest_table_lock);
+		spin_lock_nested(&gmap->guest_table_lock, GMAP_LOCK_PARENT);
 		entry = radix_tree_delete(&gmap->host_to_guest,
 					  vmaddr >> PMD_SHIFT);
 		if (entry) {
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 13/14] s390/mm: Pull pmd invalid check in gmap_pmd_op_walk
  2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
                   ` (11 preceding siblings ...)
  2021-01-13  9:41 ` [PATCH 12/14] s390/mm: Add gmap lock classes Janosch Frank
@ 2021-01-13  9:41 ` Janosch Frank
  2021-01-13  9:41 ` [PATCH 14/14] KVM: s390: Allow the VSIE to be used with huge pages Janosch Frank
  13 siblings, 0 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:41 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

Not yet sure if I'll keep this.

The walk should only walk and not check I, but then it looks way
nicer.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 arch/s390/mm/gmap.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index bc89fb974367..c4778ded8450 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -956,7 +956,8 @@ static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr,
 	}
 
 	pmdp = (pmd_t *) gmap_table_walk(gmap, gaddr, 1);
-	if (!pmdp || pmd_none(*pmdp)) {
+	if (!pmdp || pmd_none(*pmdp) ||
+	    pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID) {
 		if (*ptl)
 			spin_unlock(*ptl);
 		pmdp = NULL;
@@ -1165,7 +1166,7 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 			return vmaddr;
 		vmaddr |= gaddr & ~PMD_MASK;
 		pmdp = gmap_pmd_op_walk(gmap, gaddr, vmaddr, &ptl_pmd);
-		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
+		if (pmdp) {
 			if (!pmd_large(*pmdp)) {
 				ptep = gmap_pte_from_pmd(gmap, pmdp, gaddr,
 							 &ptl_pte);
@@ -1274,7 +1275,7 @@ int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val)
 		if (IS_ERR_VALUE(vmaddr))
 			return vmaddr;
 		pmdp = gmap_pmd_op_walk(gmap, gaddr, vmaddr, &ptl_pmd);
-		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
+		if (pmdp) {
 			if (!pmd_large(*pmdp)) {
 				ptep = gmap_pte_from_pmd(gmap, pmdp, vmaddr, &ptl_pte);
 				if (ptep) {
@@ -1388,7 +1389,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 			return vmaddr;
 		vmaddr |= paddr & ~PMD_MASK;
 		pmdp = gmap_pmd_op_walk(parent, paddr, vmaddr, &ptl_pmd);
-		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
+		if (pmdp) {
 			if (!pmd_large(*pmdp)) {
 				ptl_pte = NULL;
 				ptep = gmap_pte_from_pmd(parent, pmdp, paddr,
@@ -2378,8 +2379,7 @@ int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd)
 				break;
 			}
 			spmd = *spmdp;
-			if (!(pmd_val(spmd) & _SEGMENT_ENTRY_INVALID) &&
-			    !((pmd_val(spmd) & _SEGMENT_ENTRY_PROTECT) &&
+			if (!((pmd_val(spmd) & _SEGMENT_ENTRY_PROTECT) &&
 			      !(pmd_val(pmd) & _SEGMENT_ENTRY_PROTECT))) {
 
 				pmd_val(*spmdp) |= _SEGMENT_ENTRY_GMAP_VSIE;
@@ -2452,7 +2452,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 			break;
 		rc = -EAGAIN;
 		spmdp = gmap_pmd_op_walk(parent, paddr, vmaddr, &ptl_pmd);
-		if (spmdp && !(pmd_val(*spmdp) & _SEGMENT_ENTRY_INVALID)) {
+		if (spmdp) {
 			/* Get page table pointer */
 			tptep = (pte_t *) gmap_table_walk(sg, saddr, 0);
 			if (!tptep) {
@@ -2886,9 +2886,6 @@ static bool gmap_test_and_clear_dirty_pmd(struct gmap *gmap, pmd_t *pmdp,
 					  unsigned long gaddr,
 					  unsigned long vmaddr)
 {
-	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
-		return false;
-
 	/* Already protected memory, which did not change is clean */
 	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_PROTECT &&
 	    !(pmd_val(*pmdp) & _SEGMENT_ENTRY_GMAP_UC))
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 14/14] KVM: s390: Allow the VSIE to be used with huge pages
  2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
                   ` (12 preceding siblings ...)
  2021-01-13  9:41 ` [PATCH 13/14] s390/mm: Pull pmd invalid check in gmap_pmd_op_walk Janosch Frank
@ 2021-01-13  9:41 ` Janosch Frank
  13 siblings, 0 replies; 15+ messages in thread
From: Janosch Frank @ 2021-01-13  9:41 UTC (permalink / raw)
  To: kvm; +Cc: borntraeger, david, linux-s390, imbrenda

Now that we have VSIE support for VMs with huge memory backing, let's
make both features usable at the same time.

Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
---
 Documentation/virt/kvm/api.rst |  9 ++++-----
 arch/s390/kvm/kvm-s390.c       | 14 ++------------
 arch/s390/mm/gmap.c            |  1 -
 3 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index c136e254b496..c7fa31bbaa78 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5876,15 +5876,14 @@ Do not enable KVM_FEATURE_PV_UNHALT if you disable HLT exits.
 
 :Architectures: s390
 :Parameters: none
-:Returns: 0 on success, -EINVAL if hpage module parameter was not set
-	  or cmma is enabled, or the VM has the KVM_VM_S390_UCONTROL
-	  flag set
+:Returns: 0 on success, -EINVAL if cmma is enabled, or the VM has the
+	  KVM_VM_S390_UCONTROL flag set
 
 With this capability the KVM support for memory backing with 1m pages
 through hugetlbfs can be enabled for a VM. After the capability is
 enabled, cmma can't be enabled anymore and pfmfi and the storage key
-interpretation are disabled. If cmma has already been enabled or the
-hpage module parameter is not set to 1, -EINVAL is returned.
+interpretation are disabled. If cmma has already been enabled, -EINVAL
+is returned.
 
 While it is generally possible to create a huge page backed VM without
 this capability, the VM will not be able to run.
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index dbafd057ca6a..ad4fc84bb090 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -175,11 +175,6 @@ static int nested;
 module_param(nested, int, S_IRUGO);
 MODULE_PARM_DESC(nested, "Nested virtualization support");
 
-/* allow 1m huge page guest backing, if !nested */
-static int hpage;
-module_param(hpage, int, 0444);
-MODULE_PARM_DESC(hpage, "1m huge page backing support");
-
 /* maximum percentage of steal time for polling.  >100 is treated like 100 */
 static u8 halt_poll_max_steal = 10;
 module_param(halt_poll_max_steal, byte, 0644);
@@ -551,7 +546,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		break;
 	case KVM_CAP_S390_HPAGE_1M:
 		r = 0;
-		if (hpage && !kvm_is_ucontrol(kvm))
+		if (!kvm_is_ucontrol(kvm))
 			r = 1;
 		break;
 	case KVM_CAP_S390_MEM_OP:
@@ -761,7 +756,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
 		mutex_lock(&kvm->lock);
 		if (kvm->created_vcpus)
 			r = -EBUSY;
-		else if (!hpage || kvm->arch.use_cmma || kvm_is_ucontrol(kvm))
+		else if (kvm->arch.use_cmma || kvm_is_ucontrol(kvm))
 			r = -EINVAL;
 		else {
 			r = 0;
@@ -5042,11 +5037,6 @@ static int __init kvm_s390_init(void)
 		return -ENODEV;
 	}
 
-	if (nested && hpage) {
-		pr_info("A KVM host that supports nesting cannot back its KVM guests with huge pages\n");
-		return -EINVAL;
-	}
-
 	for (i = 0; i < 16; i++)
 		kvm_s390_fac_base[i] |=
 			S390_lowcore.stfle_fac_list[i] & nonhyp_mask(i);
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index c4778ded8450..994688946553 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1844,7 +1844,6 @@ struct gmap *gmap_shadow(struct gmap *parent, unsigned long asce,
 	unsigned long limit;
 	int rc;
 
-	BUG_ON(parent->mm->context.allow_gmap_hpage_1m);
 	BUG_ON(gmap_is_shadow(parent));
 	spin_lock(&parent->shadow_lock);
 	sg = gmap_find_shadow(parent, asce, edat_level);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-01-13  9:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-13  9:40 [PATCH 00/14] KVM: s390: Add huge page VSIE support Janosch Frank
2021-01-13  9:41 ` [PATCH 01/14] s390/mm: Code cleanups Janosch Frank
2021-01-13  9:41 ` [PATCH 02/14] s390/mm: Improve locking for huge page backings Janosch Frank
2021-01-13  9:41 ` [PATCH 03/14] s390/mm: Take locking out of gmap_protect_pte Janosch Frank
2021-01-13  9:41 ` [PATCH 04/14] s390/mm: split huge pages in GMAP when protecting Janosch Frank
2021-01-13  9:41 ` [PATCH 05/14] s390/mm: Split huge pages when migrating Janosch Frank
2021-01-13  9:41 ` [PATCH 06/14] s390/mm: Provide vmaddr to pmd notification Janosch Frank
2021-01-13  9:41 ` [PATCH 07/14] s390/mm: factor out idte global flush into gmap_idte_global Janosch Frank
2021-01-13  9:41 ` [PATCH 08/14] s390/mm: Make gmap_read_table EDAT1 compatible Janosch Frank
2021-01-13  9:41 ` [PATCH 09/14] s390/mm: Make gmap_protect_rmap " Janosch Frank
2021-01-13  9:41 ` [PATCH 10/14] s390/mm: Add simple ptep shadow function Janosch Frank
2021-01-13  9:41 ` [PATCH 11/14] s390/mm: Add gmap shadowing for large pmds Janosch Frank
2021-01-13  9:41 ` [PATCH 12/14] s390/mm: Add gmap lock classes Janosch Frank
2021-01-13  9:41 ` [PATCH 13/14] s390/mm: Pull pmd invalid check in gmap_pmd_op_walk Janosch Frank
2021-01-13  9:41 ` [PATCH 14/14] KVM: s390: Allow the VSIE to be used with huge pages Janosch Frank

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).