All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement
@ 2018-02-09  9:34 Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 01/16] s390/mm: make gmap_protect_range more modular Janosch Frank
                   ` (16 more replies)
  0 siblings, 17 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

Since the z10 s390 does support 1M pages, but whereas hugetlbfs
support was added quite fast, KVM always used standard 4k pages for
guest backings.

This patchset adds full support for 1M huge page backings for s390
KVM guests. I.e. we also support VSIE (nested vms) for these guests
and are therefore able to run all combinations of backings for all
layers of guests.

When running a VSIE guest in a huge page backed guest, we need to
split some huge pages to be able to set granular protection. This way
we avoid a prot/unprot cycle if prefixes and VSIE pages containing
level 3 gmap DAT tables share the same segment, as the prefix has to
be accessible at all times and the VSIE page has to be write
protected.

Branch:
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git hlp_vsie
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git/log/?h=hlp_vsie


TODO:
* Cleanups & Documentation
* Regression testing
* Testing large setups
* Testing multi level VSIE
* Split notification does not need to go over all gmaps.
* Fixup commit messages and remove some more reviewed-bys

V3:
	* Moved splitting to the front.
	* Cleanups

V2:
	* Incorporated changes from David's cleanup
	* Now flushing with IDTE_NODAT for protection transfers.
	* Added RRBE huge page handling for g2 -> g3 skey emulation
	* Added documentation for capability
	* Renamed GMAP_ENTRY_* constants
	* Added SEGMENT hardware bits constants
	* Improved some patch descriptions
	* General small improvements
	* Introduced pte_from_pmd function

Accomplished testing:
l2: KVM guest
l3: nested KVM guest

* 1m l2 guests
* VSIE (l3) 4k and 1m guests on 1m l2
* 1m l2 -> l2 migration with 4k/1m l3 guests
* l3 -> l2 migration
* postcopy works every second try, seems to be QEMU or my setup


The initial prototype was started by Dominik Dingel. I had the
pleasure of adding the VSIE part, the protection transfers and the
optimizations. A huge thanks to Christian and Martin who review(ed)
and helped debugging/designing.


Dominik Dingel (2):
  s390/mm: clear huge page storage keys on enable_skey
  s390/mm: hugetlb pages within a gmap can not be freed

Janosch Frank (14):
  s390/mm: make gmap_protect_range more modular
  s390/mm: Abstract gmap notify bit setting
  s390/mm: Introduce gmap_pmdp_xchg
  s390/mm: add gmap PMD invalidation notification
  s390/mm: Add gmap pmd invalidation and clearing
  s390/mm: Add huge page dirty sync support
  s390/mm: Make gmap_read_table EDAT1 compatible
  s390/mm: Make protect_rmap EDAT1 compatible
  s390/mm: Add shadow segment code
  s390/mm: Add VSIE reverse fake case
  s390/mm: Enable gmap huge pmd support
  s390/mm: Add huge pmd storage key handling
  KVM: s390: Add KVM HPAGE capability
  s390/mm: Add gmap lock classes

 Documentation/virtual/kvm/api.txt |   10 +
 arch/s390/include/asm/gmap.h      |   38 +-
 arch/s390/include/asm/pgtable.h   |   19 +-
 arch/s390/kvm/gaccess.c           |   64 +-
 arch/s390/kvm/kvm-s390.c          |   19 +-
 arch/s390/mm/gmap.c               | 1156 +++++++++++++++++++++++++++++++++----
 arch/s390/mm/pageattr.c           |    6 +-
 arch/s390/mm/pgtable.c            |  182 +++++-
 include/uapi/linux/kvm.h          |    1 +
 9 files changed, 1324 insertions(+), 171 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 01/16] s390/mm: make gmap_protect_range more modular
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-13 14:07   ` David Hildenbrand
  2018-02-09  9:34 ` [RFC/PATCH v3 02/16] s390/mm: Abstract gmap notify bit setting Janosch Frank
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

This patch reworks the gmap_protect_range logic and extracts the pte
handling into an own function. Also we do now walk to the pmd and make
it accessible in the function for later use. This way we can add huge
page handling logic more easily.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/mm/gmap.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 81 insertions(+), 10 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 032d84e..549a55f 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -874,7 +874,77 @@ static int gmap_pte_op_fixup(struct gmap *gmap, unsigned long gaddr,
  */
 static void gmap_pte_op_end(spinlock_t *ptl)
 {
-	spin_unlock(ptl);
+	if (ptl)
+		spin_unlock(ptl);
+}
+
+/**
+ * gmap_pmd_op_walk - walk the gmap tables, get the guest table lock
+ *		      and return the pmd pointer
+ * @gmap: pointer to guest mapping meta data structure
+ * @gaddr: virtual address in the guest address space
+ *
+ * Returns a pointer to the pmd for a guest address, or NULL
+ */
+static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr)
+{
+	pmd_t *pmdp;
+
+	BUG_ON(gmap_is_shadow(gmap));
+	spin_lock(&gmap->guest_table_lock);
+	pmdp = (pmd_t *) gmap_table_walk(gmap, gaddr, 1);
+
+	if (!pmdp || pmd_none(*pmdp)) {
+		spin_unlock(&gmap->guest_table_lock);
+		return NULL;
+	}
+
+	/* 4k page table entries are locked via the pte (pte_alloc_map_lock). */
+	if (!pmd_large(*pmdp))
+		spin_unlock(&gmap->guest_table_lock);
+	return pmdp;
+}
+
+/**
+ * gmap_pmd_op_end - release the guest_table_lock if needed
+ * @gmap: pointer to the guest mapping meta data structure
+ * @pmdp: pointer to the pmd
+ */
+static inline void gmap_pmd_op_end(struct gmap *gmap, pmd_t *pmdp)
+{
+	if (pmd_large(*pmdp))
+		spin_unlock(&gmap->guest_table_lock);
+}
+
+/*
+ * gmap_protect_pte - remove access rights to memory and set pgste bits
+ * @gmap: pointer to guest mapping meta data structure
+ * @gaddr: virtual address in the guest address space
+ * @pmdp: pointer to the pmd associated with the pte
+ * @prot: indicates access rights: PROT_NONE, PROT_READ or PROT_WRITE
+ * @bits: pgste notification bits to set
+ *
+ * Returns 0 if successfully protected, -ENOMEM if out of memory and
+ * -EAGAIN if a fixup is needed.
+ *
+ * Expected to be called with sg->mm->mmap_sem in read and
+ * guest_table_lock held for shadow gmaps.
+ */
+static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
+			    pmd_t *pmdp, int prot, unsigned long bits)
+{
+	int rc;
+	pte_t *ptep;
+	spinlock_t *ptl = NULL;
+
+	ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
+	if (!ptep)
+		return -ENOMEM;
+
+	/* Protect and unlock. */
+	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, bits);
+	gmap_pte_op_end(ptl);
+	return rc;
 }
 
 /*
@@ -896,16 +966,20 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 			      unsigned long len, int prot, unsigned long bits)
 {
 	unsigned long vmaddr;
-	spinlock_t *ptl;
-	pte_t *ptep;
+	pmd_t *pmdp;
 	int rc;
 
 	while (len) {
 		rc = -EAGAIN;
-		ptep = gmap_pte_op_walk(gmap, gaddr, &ptl);
-		if (ptep) {
-			rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, bits);
-			gmap_pte_op_end(ptl);
+		pmdp = gmap_pmd_op_walk(gmap, gaddr);
+		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
+			rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
+					      bits);
+			if (!rc) {
+				len -= PAGE_SIZE;
+				gaddr += PAGE_SIZE;
+			}
+			gmap_pmd_op_end(gmap, pmdp);
 		}
 		if (rc) {
 			vmaddr = __gmap_translate(gmap, gaddr);
@@ -914,10 +988,7 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 			rc = gmap_pte_op_fixup(gmap, gaddr, vmaddr, prot);
 			if (rc)
 				return rc;
-			continue;
 		}
-		gaddr += PAGE_SIZE;
-		len -= PAGE_SIZE;
 	}
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 02/16] s390/mm: Abstract gmap notify bit setting
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 01/16] s390/mm: make gmap_protect_range more modular Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-13 14:10   ` David Hildenbrand
  2018-02-09  9:34 ` [RFC/PATCH v3 03/16] s390/mm: Introduce gmap_pmdp_xchg Janosch Frank
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

Currently we use the software PGSTE bits PGSTE_IN_BIT and PGSTE_VSIE_BIT
to notify before an invalidation occurs on a prefix page or a VSIE page
respectively. Both bits only work for a PGSTE, which only exists for
page tables.

For huge page support we also need such bits for segments (pmds) so
let's introduce abstract GMAP_NOTIFY_* bits that will be realized into
the respective bits when gmap DAT table entries are protected.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/include/asm/gmap.h |  4 ++++
 arch/s390/mm/gmap.c          | 13 ++++++++-----
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index e07cce8..c1bc563 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -9,6 +9,10 @@
 #ifndef _ASM_S390_GMAP_H
 #define _ASM_S390_GMAP_H
 
+/* Generic bits for GMAP notification on DAT table entry changes. */
+#define GMAP_NOTIFY_SHADOW	0x2
+#define GMAP_NOTIFY_MPROT	0x1
+
 /**
  * struct gmap_struct - guest address space
  * @list: list head for the mm->context gmap list
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 549a55f..4471cb9 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -922,7 +922,7 @@ static inline void gmap_pmd_op_end(struct gmap *gmap, pmd_t *pmdp)
  * @gaddr: virtual address in the guest address space
  * @pmdp: pointer to the pmd associated with the pte
  * @prot: indicates access rights: PROT_NONE, PROT_READ or PROT_WRITE
- * @bits: pgste notification bits to set
+ * @bits: notification bits to set
  *
  * Returns 0 if successfully protected, -ENOMEM if out of memory and
  * -EAGAIN if a fixup is needed.
@@ -936,13 +936,16 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
 	int rc;
 	pte_t *ptep;
 	spinlock_t *ptl = NULL;
+	unsigned long pbits = 0;
 
 	ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
 	if (!ptep)
 		return -ENOMEM;
 
+	pbits |= (bits & GMAP_NOTIFY_MPROT) ? PGSTE_IN_BIT : 0;
+	pbits |= (bits & GMAP_NOTIFY_SHADOW) ? PGSTE_VSIE_BIT : 0;
 	/* Protect and unlock. */
-	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, bits);
+	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, pbits);
 	gmap_pte_op_end(ptl);
 	return rc;
 }
@@ -1017,7 +1020,7 @@ int gmap_mprotect_notify(struct gmap *gmap, unsigned long gaddr,
 	if (!MACHINE_HAS_ESOP && prot == PROT_READ)
 		return -EINVAL;
 	down_read(&gmap->mm->mmap_sem);
-	rc = gmap_protect_range(gmap, gaddr, len, prot, PGSTE_IN_BIT);
+	rc = gmap_protect_range(gmap, gaddr, len, prot, GMAP_NOTIFY_MPROT);
 	up_read(&gmap->mm->mmap_sem);
 	return rc;
 }
@@ -1139,7 +1142,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 		if (ptep) {
 			spin_lock(&sg->guest_table_lock);
 			rc = ptep_force_prot(parent->mm, paddr, ptep, prot,
-					     PGSTE_VSIE_BIT);
+					     GMAP_NOTIFY_SHADOW);
 			if (!rc)
 				gmap_insert_rmap(sg, vmaddr, rmap);
 			spin_unlock(&sg->guest_table_lock);
@@ -1605,7 +1608,7 @@ struct gmap *gmap_shadow(struct gmap *parent, unsigned long asce,
 	down_read(&parent->mm->mmap_sem);
 	rc = gmap_protect_range(parent, asce & _ASCE_ORIGIN,
 				((asce & _ASCE_TABLE_LENGTH) + 1) * PAGE_SIZE,
-				PROT_READ, PGSTE_VSIE_BIT);
+				PROT_READ, GMAP_NOTIFY_SHADOW);
 	up_read(&parent->mm->mmap_sem);
 	spin_lock(&parent->shadow_lock);
 	new->initialized = true;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 03/16] s390/mm: Introduce gmap_pmdp_xchg
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 01/16] s390/mm: make gmap_protect_range more modular Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 02/16] s390/mm: Abstract gmap notify bit setting Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-13 14:16   ` David Hildenbrand
  2018-02-09  9:34 ` [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification Janosch Frank
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

When changing guest pmds, we don't need to take care of the
corresponding host pmd. This means, we don't need to flush the host
TLB entries and we don't need to notify on all gmaps.

Let's introduce a function, that exchanges a pmd and takes care of the
necessary flushing and notification.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/mm/gmap.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 4471cb9..16c877a9 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -23,6 +23,9 @@
 
 #define GMAP_SHADOW_FAKE_TABLE 1ULL
 
+static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *old, pmd_t new,
+			   unsigned long gaddr);
+
 /**
  * gmap_alloc - allocate and initialize a guest address space
  * @mm: pointer to the parent mm_struct
@@ -2174,6 +2177,31 @@ void ptep_notify(struct mm_struct *mm, unsigned long vmaddr,
 }
 EXPORT_SYMBOL_GPL(ptep_notify);
 
+/**
+ * gmap_pmdp_xchg - exchange a gmap pmd with another and notify
+ * @gmap: pointer to the guest address space structure
+ * @pmdp: pointer to the pmd entry
+ * @new: replacement entry
+ * @gaddr: the affected guest address
+ *
+ * This function is assumed to be called with the guest_table_lock
+ * held.
+ */
+static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *pmdp, pmd_t new,
+			   unsigned long gaddr)
+{
+	if (MACHINE_HAS_TLB_GUEST)
+		__pmdp_idte(gaddr, (pmd_t *)pmdp,
+			    IDTE_GUEST_ASCE, gmap->asce,
+			    IDTE_GLOBAL);
+	if (MACHINE_HAS_IDTE)
+		__pmdp_idte(gaddr, (pmd_t *)pmdp,
+			    0, 0, IDTE_GLOBAL);
+	else
+		__pmdp_csp(pmdp);
+	*pmdp = new;
+}
+
 static inline void thp_split_mm(struct mm_struct *mm)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (2 preceding siblings ...)
  2018-02-09  9:34 ` [RFC/PATCH v3 03/16] s390/mm: Introduce gmap_pmdp_xchg Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-13 14:36   ` David Hildenbrand
  2018-02-09  9:34 ` [RFC/PATCH v3 05/16] s390/mm: Add gmap pmd invalidation and clearing Janosch Frank
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

For later migration of huge pages we want to write-protect guest
PMDs. While doing this, we have to make absolutely sure, that the
guest's lowcore is always accessible when the VCPU is running. With
PTEs, this is solved by marking the PGSTEs of the lowcore pages with
the invalidation notification bit and kicking the guest out of the SIE
via a notifier function if we need to invalidate such a page.

With PMDs we do not have PGSTEs or some other bits we could use in the
host PMD. Instead we pick one of the free bits in the gmap PMD. Every
time a host pmd will be invalidated, we will check if the respective
gmap PMD has the bit set and in that case fire up the notifier.

In the first step we only support setting the invalidation bit, but we
do not support restricting access of guest pmds. It will follow
shortly.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/gmap.h    |  15 +++
 arch/s390/include/asm/pgtable.h |   9 +-
 arch/s390/mm/gmap.c             | 197 +++++++++++++++++++++++++++++++++++++---
 arch/s390/mm/pgtable.c          |   4 +
 4 files changed, 209 insertions(+), 16 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index c1bc563..4324b2a 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -13,6 +13,9 @@
 #define GMAP_NOTIFY_SHADOW	0x2
 #define GMAP_NOTIFY_MPROT	0x1
 
+/* Status bits in the gmap segment entry. */
+#define _SEGMENT_ENTRY_GMAP_SPLIT	0x0001  /* split huge pmd */
+
 /**
  * struct gmap_struct - guest address space
  * @list: list head for the mm->context gmap list
@@ -52,6 +55,7 @@ struct gmap {
 	struct radix_tree_root host_to_rmap;
 	struct list_head children;
 	struct list_head pt_list;
+	struct list_head split_list;
 	spinlock_t shadow_lock;
 	struct gmap *parent;
 	unsigned long orig_asce;
@@ -92,6 +96,17 @@ static inline int gmap_is_shadow(struct gmap *gmap)
 	return !!gmap->parent;
 }
 
+/**
+ * gmap_pmd_is_split - Returns if a huge gmap pmd has been split.
+ * @pmdp: pointer to the pmd
+ *
+ * Returns true if the passed huge gmap pmd has been split.
+ */
+static inline bool gmap_pmd_is_split(pmd_t *pmdp)
+{
+	return !!(pmd_val(*pmdp) & _SEGMENT_ENTRY_GMAP_SPLIT);
+}
+
 struct gmap *gmap_create(struct mm_struct *mm, unsigned long limit);
 void gmap_remove(struct gmap *gmap);
 struct gmap *gmap_get(struct gmap *gmap);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 2d24d33..e9b56b9 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -269,8 +269,10 @@ static inline int is_module_addr(void *addr)
 #define _REGION_ENTRY_BITS_LARGE 0xffffffff8000fe2fUL
 
 /* Bits in the segment table entry */
-#define _SEGMENT_ENTRY_BITS	0xfffffffffffffe33UL
-#define _SEGMENT_ENTRY_BITS_LARGE 0xfffffffffff0ff33UL
+#define _SEGMENT_ENTRY_BITS			0xfffffffffffffe33UL
+#define _SEGMENT_ENTRY_BITS_LARGE 		0xfffffffffff0ff33UL
+#define _SEGMENT_ENTRY_HARDWARE_BITS		0xfffffffffffffe30UL
+#define _SEGMENT_ENTRY_HARDWARE_BITS_LARGE	0xfffffffffff00730UL
 #define _SEGMENT_ENTRY_ORIGIN_LARGE ~0xfffffUL /* large page address	    */
 #define _SEGMENT_ENTRY_ORIGIN	~0x7ffUL/* page table origin		    */
 #define _SEGMENT_ENTRY_PROTECT	0x200	/* segment protection bit	    */
@@ -1093,6 +1095,9 @@ void ptep_set_pte_at(struct mm_struct *mm, unsigned long addr,
 void ptep_set_notify(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 void ptep_notify(struct mm_struct *mm, unsigned long addr,
 		 pte_t *ptep, unsigned long bits);
+void ptep_notify_gmap(struct mm_struct *mm, unsigned long vmaddr,
+		      pte_t *pte, unsigned long bits);
+void pmdp_notify(struct mm_struct *mm, unsigned long addr);
 int ptep_force_prot(struct mm_struct *mm, unsigned long gaddr,
 		    pte_t *ptep, int prot, unsigned long bit);
 void ptep_zap_unused(struct mm_struct *mm, unsigned long addr,
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 16c877a9..cb03e66 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -63,6 +63,7 @@ static struct gmap *gmap_alloc(unsigned long limit)
 	INIT_LIST_HEAD(&gmap->crst_list);
 	INIT_LIST_HEAD(&gmap->children);
 	INIT_LIST_HEAD(&gmap->pt_list);
+	INIT_LIST_HEAD(&gmap->split_list);
 	INIT_RADIX_TREE(&gmap->guest_to_host, GFP_KERNEL);
 	INIT_RADIX_TREE(&gmap->host_to_guest, GFP_ATOMIC);
 	INIT_RADIX_TREE(&gmap->host_to_rmap, GFP_ATOMIC);
@@ -194,6 +195,10 @@ static void gmap_free(struct gmap *gmap)
 	gmap_radix_tree_free(&gmap->guest_to_host);
 	gmap_radix_tree_free(&gmap->host_to_guest);
 
+	/* Free split pmd page tables */
+	list_for_each_entry_safe(page, next, &gmap->split_list, lru)
+		page_table_free_pgste(page);
+
 	/* Free additional data for a shadow gmap */
 	if (gmap_is_shadow(gmap)) {
 		/* Free all page tables. */
@@ -599,10 +604,15 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 	if (*table == _SEGMENT_ENTRY_EMPTY) {
 		rc = radix_tree_insert(&gmap->host_to_guest,
 				       vmaddr >> PMD_SHIFT, table);
-		if (!rc)
-			*table = pmd_val(*pmd);
-	} else
-		rc = 0;
+		if (!rc) {
+			if (pmd_large(*pmd)) {
+				*table = pmd_val(*pmd) &
+					_SEGMENT_ENTRY_HARDWARE_BITS_LARGE;
+			} else
+				*table = pmd_val(*pmd) &
+					_SEGMENT_ENTRY_HARDWARE_BITS;
+		}
+	}
 	spin_unlock(&gmap->guest_table_lock);
 	spin_unlock(ptl);
 	radix_tree_preload_end();
@@ -902,8 +912,11 @@ static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr)
 		return NULL;
 	}
 
-	/* 4k page table entries are locked via the pte (pte_alloc_map_lock). */
-	if (!pmd_large(*pmdp))
+	/*
+	 * Non-split 4k page table entries are locked via the pte
+	 * (pte_alloc_map_lock).
+	 */
+	if (!gmap_pmd_is_split(pmdp) && !pmd_large(*pmdp))
 		spin_unlock(&gmap->guest_table_lock);
 	return pmdp;
 }
@@ -915,10 +928,77 @@ static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr)
  */
 static inline void gmap_pmd_op_end(struct gmap *gmap, pmd_t *pmdp)
 {
-	if (pmd_large(*pmdp))
+	if (pmd_large(*pmdp) || gmap_pmd_is_split(pmdp))
 		spin_unlock(&gmap->guest_table_lock);
 }
 
+static pte_t *gmap_pte_from_pmd(struct gmap *gmap, pmd_t *pmdp,
+				unsigned long addr, spinlock_t **ptl)
+{
+	if (likely(!gmap_pmd_is_split(pmdp)))
+		return pte_alloc_map_lock(gmap->mm, pmdp, addr, ptl);
+
+	*ptl = NULL;
+	return pte_offset_map(pmdp, addr);
+}
+
+/**
+ * gmap_pmd_split_free - Free a split pmd's page table
+ * @pmdp The split pmd that we free of its page table
+ *
+ * If the userspace pmds are exchanged, we'll remove the gmap pmds as
+ * well, so we fault on them and link them again. We would leak
+ * memory, if we didn't free split pmds here.
+ */
+static inline void gmap_pmd_split_free(pmd_t *pmdp)
+{
+	unsigned long pgt = pmd_val(*pmdp) & _SEGMENT_ENTRY_ORIGIN;
+	struct page *page;
+
+	if (gmap_pmd_is_split(pmdp)) {
+		page = pfn_to_page(pgt >> PAGE_SHIFT);
+		list_del(&page->lru);
+		page_table_free_pgste(page);
+	}
+}
+
+/**
+ * gmap_pmd_split - Split a huge gmap pmd and use a page table instead
+ * @gmap: pointer to guest mapping meta data structure
+ * @gaddr: virtual address in the guest address space
+ * @pmdp: pointer to the pmd that will be split
+ *
+ * When splitting gmap pmds, we have to make the resulting page table
+ * look like it's a normal one to be able to use the common pte
+ * handling functions. Also we need to track these new tables as they
+ * aren't tracked anywhere else.
+ */
+static int gmap_pmd_split(struct gmap *gmap, unsigned long gaddr, pmd_t *pmdp)
+{
+	unsigned long *table;
+	struct page *page;
+	pmd_t new;
+	int i;
+
+	page = page_table_alloc_pgste(gmap->mm);
+	if (!page)
+		return -ENOMEM;
+	table = (unsigned long *) page_to_phys(page);
+	for (i = 0; i < 256; i++) {
+		table[i] = (pmd_val(*pmdp) & HPAGE_MASK) + i * PAGE_SIZE;
+		/* pmd_large() implies pmd/pte_present() */
+		table[i] |=  _PAGE_PRESENT | _PAGE_READ | _PAGE_WRITE;
+		/* ptes are directly marked as dirty */
+		table[i + PTRS_PER_PTE] |= PGSTE_UC_BIT;
+	}
+
+	pmd_val(new) = ((unsigned long)table | _SEGMENT_ENTRY |
+			(_SEGMENT_ENTRY_GMAP_SPLIT));
+	list_add(&page->lru, &gmap->split_list);
+	gmap_pmdp_xchg(gmap, pmdp, new, gaddr);
+	return 0;
+}
+
 /*
  * gmap_protect_pte - remove access rights to memory and set pgste bits
  * @gmap: pointer to guest mapping meta data structure
@@ -941,7 +1021,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
 	spinlock_t *ptl = NULL;
 	unsigned long pbits = 0;
 
-	ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
+	ptep = gmap_pte_from_pmd(gmap, pmdp, gaddr, &ptl);
 	if (!ptep)
 		return -ENOMEM;
 
@@ -979,15 +1059,21 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 		rc = -EAGAIN;
 		pmdp = gmap_pmd_op_walk(gmap, gaddr);
 		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
-			rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
-					      bits);
-			if (!rc) {
-				len -= PAGE_SIZE;
-				gaddr += PAGE_SIZE;
+			if (!pmd_large(*pmdp)) {
+				rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
+						      bits);
+				if (!rc) {
+					len -= PAGE_SIZE;
+					gaddr += PAGE_SIZE;
+				}
+			} else {
+				rc = gmap_pmd_split(gmap, gaddr, pmdp);
+				if (!rc)
+					rc = -EFAULT;
 			}
 			gmap_pmd_op_end(gmap, pmdp);
 		}
-		if (rc) {
+		if (rc && rc != -EFAULT) {
 			vmaddr = __gmap_translate(gmap, gaddr);
 			if (IS_ERR_VALUE(vmaddr))
 				return vmaddr;
@@ -2133,6 +2219,39 @@ static void gmap_shadow_notify(struct gmap *sg, unsigned long vmaddr,
 	spin_unlock(&sg->guest_table_lock);
 }
 
+/*
+ * ptep_notify_gmap - call all invalidation callbacks for a specific pte of a gmap
+ * @mm: pointer to the process mm_struct
+ * @addr: virtual address in the process address space
+ * @pte: pointer to the page table entry
+ * @bits: bits from the pgste that caused the notify call
+ *
+ * This function is assumed to be called with the guest_table_lock held.
+ */
+void ptep_notify_gmap(struct mm_struct *mm, unsigned long vmaddr,
+		      pte_t *pte, unsigned long bits)
+{
+	unsigned long offset, gaddr = 0;
+	unsigned long *table;
+	struct gmap *gmap;
+
+	offset = ((unsigned long) pte) & (255 * sizeof(pte_t));
+	offset = offset * (4096 / sizeof(pte_t));
+	rcu_read_lock();
+	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
+		table = radix_tree_lookup(&gmap->host_to_guest,
+					  vmaddr >> PMD_SHIFT);
+		if (table)
+			gaddr = __gmap_segment_gaddr(table) + offset;
+		else
+			continue;
+
+		if (bits & PGSTE_IN_BIT)
+			gmap_call_notifier(gmap, gaddr, gaddr + PAGE_SIZE - 1);
+	}
+	rcu_read_unlock();
+}
+
 /**
  * ptep_notify - call all invalidation callbacks for a specific pte.
  * @mm: pointer to the process mm_struct
@@ -2177,6 +2296,23 @@ void ptep_notify(struct mm_struct *mm, unsigned long vmaddr,
 }
 EXPORT_SYMBOL_GPL(ptep_notify);
 
+static void pmdp_notify_split(struct mm_struct *mm, unsigned long vmaddr,
+			      unsigned long *table)
+{
+	int i = 0;
+	unsigned long bits;
+	unsigned long *ptep = (unsigned long *)(*table & PAGE_MASK);
+	unsigned long *pgste = ptep + PTRS_PER_PTE;
+
+	for (; i < 256; i++, vmaddr += PAGE_SIZE, ptep++, pgste++) {
+		bits = *pgste & (PGSTE_IN_BIT | PGSTE_VSIE_BIT);
+		if (bits) {
+			*pgste ^= bits;
+			ptep_notify_gmap(mm, vmaddr, (pte_t *)ptep, bits);
+		}
+	}
+}
+
 /**
  * gmap_pmdp_xchg - exchange a gmap pmd with another and notify
  * @gmap: pointer to the guest address space structure
@@ -2202,6 +2338,39 @@ static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *pmdp, pmd_t new,
 	*pmdp = new;
 }
 
+/**
+ * pmdp_notify - call all invalidation callbacks for a specific pmd
+ * @mm: pointer to the process mm_struct
+ * @vmaddr: virtual address in the process address space
+ *
+ * This function is expected to be called with mmap_sem held in read.
+ */
+void pmdp_notify(struct mm_struct *mm, unsigned long vmaddr)
+{
+	unsigned long *table, gaddr;
+	struct gmap *gmap;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
+		spin_lock(&gmap->guest_table_lock);
+		table = radix_tree_lookup(&gmap->host_to_guest,
+					  vmaddr >> PMD_SHIFT);
+		if (!table) {
+			spin_unlock(&gmap->guest_table_lock);
+			continue;
+		}
+		gaddr = __gmap_segment_gaddr(table);
+		if (gmap_pmd_is_split((pmd_t *)table)) {
+			pmdp_notify_split(mm, vmaddr, table);
+			spin_unlock(&gmap->guest_table_lock);
+			continue;
+		}
+		spin_unlock(&gmap->guest_table_lock);
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(pmdp_notify);
+
 static inline void thp_split_mm(struct mm_struct *mm)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 4f2b65d..a6cc540 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -405,6 +405,8 @@ pmd_t pmdp_xchg_direct(struct mm_struct *mm, unsigned long addr,
 	pmd_t old;
 
 	preempt_disable();
+	if (mm_has_pgste(mm))
+		pmdp_notify(mm, addr);
 	old = pmdp_flush_direct(mm, addr, pmdp);
 	*pmdp = new;
 	preempt_enable();
@@ -418,6 +420,8 @@ pmd_t pmdp_xchg_lazy(struct mm_struct *mm, unsigned long addr,
 	pmd_t old;
 
 	preempt_disable();
+	if (mm_has_pgste(mm))
+		pmdp_notify(mm, addr);
 	old = pmdp_flush_lazy(mm, addr, pmdp);
 	*pmdp = new;
 	preempt_enable();
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 05/16] s390/mm: Add gmap pmd invalidation and clearing
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (3 preceding siblings ...)
  2018-02-09  9:34 ` [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 06/16] s390/mm: Add huge page dirty sync support Janosch Frank
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

If the host invalidates a pmd, we also have to invalidate the
corresponding gmap pmds, as well as flush them from the TLB. This is
necessary, as we don't share the pmd tables between host and guest as
we do with ptes.

The clearing part of these three new functions sets a guest pmd entry
to _SEGMENT_ENTRY_EMPTY, so the guest will fault on it and we will
re-link it.

Flushing the gmap is not necessary in the host's lazy local and csp
cases. Both purge the TLB completely.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/include/asm/pgtable.h |   4 ++
 arch/s390/mm/gmap.c             | 117 ++++++++++++++++++++++++++++++++++++++++
 arch/s390/mm/pgtable.c          |  17 ++++--
 3 files changed, 135 insertions(+), 3 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index e9b56b9..223791b 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1122,6 +1122,10 @@ int set_pgste_bits(struct mm_struct *mm, unsigned long addr,
 int get_pgste(struct mm_struct *mm, unsigned long hva, unsigned long *pgstep);
 int pgste_perform_essa(struct mm_struct *mm, unsigned long hva, int orc,
 			unsigned long *oldpte, unsigned long *oldpgste);
+void gmap_pmdp_csp(struct mm_struct *mm, unsigned long vmaddr);
+void gmap_pmdp_invalidate(struct mm_struct *mm, unsigned long vmaddr);
+void gmap_pmdp_idte_local(struct mm_struct *mm, unsigned long vmaddr);
+void gmap_pmdp_idte_global(struct mm_struct *mm, unsigned long vmaddr);
 
 /*
  * Certain architectures need to do special things when PTEs
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index cb03e66..9046c91 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2371,6 +2371,123 @@ void pmdp_notify(struct mm_struct *mm, unsigned long vmaddr)
 }
 EXPORT_SYMBOL_GPL(pmdp_notify);
 
+static void gmap_pmdp_clear(struct mm_struct *mm, unsigned long vmaddr,
+			    int purge)
+{
+	pmd_t *pmdp;
+	struct gmap *gmap;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
+		spin_lock(&gmap->guest_table_lock);
+		pmdp = (pmd_t *)radix_tree_delete(&gmap->host_to_guest,
+						   vmaddr >> PMD_SHIFT);
+		if (pmdp) {
+			if (purge)
+				__pmdp_csp(pmdp);
+			gmap_pmd_split_free(pmdp);
+			pmd_val(*pmdp) = _SEGMENT_ENTRY_EMPTY;
+		}
+		spin_unlock(&gmap->guest_table_lock);
+	}
+	rcu_read_unlock();
+}
+
+/**
+ * gmap_pmdp_invalidate - invalidate all affected guest pmd entries without
+ *                        flushing
+ * @mm: pointer to the process mm_struct
+ * @vmaddr: virtual address in the process address space
+ */
+void gmap_pmdp_invalidate(struct mm_struct *mm, unsigned long vmaddr)
+{
+	gmap_pmdp_clear(mm, vmaddr, 0);
+}
+EXPORT_SYMBOL_GPL(gmap_pmdp_invalidate);
+
+/**
+ * gmap_pmdp_csp - csp all affected guest pmd entries
+ * @mm: pointer to the process mm_struct
+ * @vmaddr: virtual address in the process address space
+ */
+void gmap_pmdp_csp(struct mm_struct *mm, unsigned long vmaddr)
+{
+	gmap_pmdp_clear(mm, vmaddr, 1);
+}
+EXPORT_SYMBOL_GPL(gmap_pmdp_csp);
+
+/**
+ * gmap_pmdp_idte_local - invalidate and clear a guest pmd entry
+ * @mm: pointer to the process mm_struct
+ * @vmaddr: virtual address in the process address space
+ */
+void gmap_pmdp_idte_local(struct mm_struct *mm, unsigned long vmaddr)
+{
+	unsigned long *entry, gaddr;
+	struct gmap *gmap;
+	pmd_t *pmdp;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
+		spin_lock(&gmap->guest_table_lock);
+		entry = radix_tree_delete(&gmap->host_to_guest,
+					  vmaddr >> PMD_SHIFT);
+		if (entry) {
+			pmdp = (pmd_t *)entry;
+			gaddr = __gmap_segment_gaddr(entry);
+			if (MACHINE_HAS_TLB_GUEST)
+				__pmdp_idte(gaddr, pmdp,
+					    IDTE_GUEST_ASCE,
+					    gmap->asce, IDTE_LOCAL);
+			else if (MACHINE_HAS_IDTE)
+				__pmdp_idte(gaddr, pmdp, 0, 0,
+					    IDTE_LOCAL);
+			gmap_pmd_split_free(pmdp);
+			*entry = _SEGMENT_ENTRY_EMPTY;
+		}
+		spin_unlock(&gmap->guest_table_lock);
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(gmap_pmdp_idte_local);
+
+/**
+ * gmap_pmdp_idte_global - invalidate and clear a guest pmd entry
+ * @mm: pointer to the process mm_struct
+ * @vmaddr: virtual address in the process address space
+ */
+void gmap_pmdp_idte_global(struct mm_struct *mm, unsigned long vmaddr)
+{
+	unsigned long *entry, gaddr;
+	struct gmap *gmap;
+	pmd_t *pmdp;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
+		spin_lock(&gmap->guest_table_lock);
+		entry = radix_tree_delete(&gmap->host_to_guest,
+					  vmaddr >> PMD_SHIFT);
+		if (entry) {
+			pmdp = (pmd_t *)entry;
+			gaddr = __gmap_segment_gaddr(entry);
+			if (MACHINE_HAS_TLB_GUEST)
+				__pmdp_idte(gaddr, pmdp,
+					    IDTE_GUEST_ASCE,
+					    gmap->asce, IDTE_GLOBAL);
+			else if (MACHINE_HAS_IDTE)
+				__pmdp_idte(gaddr, pmdp, 0, 0,
+					    IDTE_GLOBAL);
+			else
+				__pmdp_csp(pmdp);
+			gmap_pmd_split_free(pmdp);
+			*entry = _SEGMENT_ENTRY_EMPTY;
+		}
+		spin_unlock(&gmap->guest_table_lock);
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(gmap_pmdp_idte_global);
+
 static inline void thp_split_mm(struct mm_struct *mm)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index a6cc540..e690879 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -347,18 +347,27 @@ static inline void pmdp_idte_local(struct mm_struct *mm,
 			    mm->context.asce, IDTE_LOCAL);
 	else
 		__pmdp_idte(addr, pmdp, 0, 0, IDTE_LOCAL);
+	if (mm_has_pgste(mm))
+		gmap_pmdp_idte_local(mm, addr);
 }
 
 static inline void pmdp_idte_global(struct mm_struct *mm,
 				    unsigned long addr, pmd_t *pmdp)
 {
-	if (MACHINE_HAS_TLB_GUEST)
+	if (MACHINE_HAS_TLB_GUEST) {
 		__pmdp_idte(addr, pmdp, IDTE_NODAT | IDTE_GUEST_ASCE,
 			    mm->context.asce, IDTE_GLOBAL);
-	else if (MACHINE_HAS_IDTE)
+		if (mm_has_pgste(mm))
+			gmap_pmdp_idte_global(mm, addr);
+	} else if (MACHINE_HAS_IDTE) {
 		__pmdp_idte(addr, pmdp, 0, 0, IDTE_GLOBAL);
-	else
+		if (mm_has_pgste(mm))
+			gmap_pmdp_idte_global(mm, addr);
+	} else {
 		__pmdp_csp(pmdp);
+		if (mm_has_pgste(mm))
+			gmap_pmdp_csp(mm, addr);
+	}
 }
 
 static inline pmd_t pmdp_flush_direct(struct mm_struct *mm,
@@ -392,6 +401,8 @@ static inline pmd_t pmdp_flush_lazy(struct mm_struct *mm,
 			  cpumask_of(smp_processor_id()))) {
 		pmd_val(*pmdp) |= _SEGMENT_ENTRY_INVALID;
 		mm->context.flush_mm = 1;
+		if (mm_has_pgste(mm))
+			gmap_pmdp_invalidate(mm, addr);
 	} else {
 		pmdp_idte_global(mm, addr, pmdp);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 06/16] s390/mm: Add huge page dirty sync support
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (4 preceding siblings ...)
  2018-02-09  9:34 ` [RFC/PATCH v3 05/16] s390/mm: Add gmap pmd invalidation and clearing Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 07/16] s390/mm: Make gmap_read_table EDAT1 compatible Janosch Frank
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

To do dirty loging with huge pages, we protect huge pmds in the
gmap. When they are written to, we unprotect them and mark them dirty.

We introduce the function gmap_test_and_clear_dirty_segment which
handles dirty sync for huge pages.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/gmap.h    |   4 +
 arch/s390/include/asm/pgtable.h |   6 +-
 arch/s390/kvm/kvm-s390.c        |  18 ++--
 arch/s390/mm/gmap.c             | 209 +++++++++++++++++++++++++++++++++++++++-
 arch/s390/mm/pgtable.c          |  57 +++++++++--
 5 files changed, 278 insertions(+), 16 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 4324b2a..26a31c3 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -15,6 +15,8 @@
 
 /* Status bits in the gmap segment entry. */
 #define _SEGMENT_ENTRY_GMAP_SPLIT	0x0001  /* split huge pmd */
+/* Status bits only for huge segment entries */
+#define _SEGMENT_ENTRY_GMAP_UC		0x4000	/* user dirty (migration) */
 
 /**
  * struct gmap_struct - guest address space
@@ -151,4 +153,6 @@ void gmap_pte_notify(struct mm_struct *, unsigned long addr, pte_t *,
 int gmap_mprotect_notify(struct gmap *, unsigned long start,
 			 unsigned long len, int prot);
 
+void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
+			     unsigned long gaddr, unsigned long vmaddr);
 #endif /* _ASM_S390_GMAP_H */
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 223791b..e8837a2 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1105,8 +1105,12 @@ void ptep_zap_unused(struct mm_struct *mm, unsigned long addr,
 void ptep_zap_key(struct mm_struct *mm, unsigned long addr, pte_t *ptep);
 int ptep_shadow_pte(struct mm_struct *mm, unsigned long saddr,
 		    pte_t *sptep, pte_t *tptep, pte_t pte);
-void ptep_unshadow_pte(struct mm_struct *mm, unsigned long saddr, pte_t *ptep);
 
+void ptep_unshadow_pte(struct mm_struct *mm, unsigned long saddr, pte_t *ptep);
+void ptep_remove_dirty_protection_split(struct mm_struct *mm, pte_t *ptep,
+					unsigned long vmaddr);
+bool test_and_clear_guest_dirty_split(struct mm_struct *mm, pmd_t *pmdp,
+				      unsigned long vmaddr);
 bool test_and_clear_guest_dirty(struct mm_struct *mm, unsigned long address);
 int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
 			  unsigned char key, bool nq);
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 1371dff..0862bf0 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -431,19 +431,23 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 }
 
 static void kvm_s390_sync_dirty_log(struct kvm *kvm,
-					struct kvm_memory_slot *memslot)
+				    struct kvm_memory_slot *memslot)
 {
 	gfn_t cur_gfn, last_gfn;
-	unsigned long address;
+	unsigned long gaddr, vmaddr;
+	unsigned long *dirty = memslot->dirty_bitmap;
 	struct gmap *gmap = kvm->arch.gmap;
 
-	/* Loop over all guest pages */
+	/* Loop over all guest segments */
 	last_gfn = memslot->base_gfn + memslot->npages;
-	for (cur_gfn = memslot->base_gfn; cur_gfn <= last_gfn; cur_gfn++) {
-		address = gfn_to_hva_memslot(memslot, cur_gfn);
+	for (cur_gfn = memslot->base_gfn; cur_gfn <= last_gfn; cur_gfn += _PAGE_ENTRIES, dirty += 4) {
+		gaddr = gfn_to_gpa(cur_gfn);
+		vmaddr = gfn_to_hva_memslot(memslot, cur_gfn);
+		if (kvm_is_error_hva(vmaddr))
+			continue;
+
+		gmap_sync_dirty_log_pmd(gmap, dirty, gaddr, vmaddr);
 
-		if (test_and_clear_guest_dirty(gmap->mm, address))
-			mark_page_dirty(kvm, cur_gfn);
 		if (fatal_signal_pending(current))
 			return;
 		cond_resched();
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 9046c91..4dafa1e 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -15,6 +15,7 @@
 #include <linux/swapops.h>
 #include <linux/ksm.h>
 #include <linux/mman.h>
+#include <linux/hugetlb.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -549,6 +550,8 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
+	pmd_t unprot;
+	pte_t *ptep;
 	int rc;
 
 	BUG_ON(gmap_is_shadow(gmap));
@@ -606,12 +609,29 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 				       vmaddr >> PMD_SHIFT, table);
 		if (!rc) {
 			if (pmd_large(*pmd)) {
-				*table = pmd_val(*pmd) &
-					_SEGMENT_ENTRY_HARDWARE_BITS_LARGE;
+				*table = (pmd_val(*pmd) &
+					  _SEGMENT_ENTRY_HARDWARE_BITS_LARGE)
+					| _SEGMENT_ENTRY_GMAP_UC;
 			} else
 				*table = pmd_val(*pmd) &
 					_SEGMENT_ENTRY_HARDWARE_BITS;
 		}
+	} else if (*table & _SEGMENT_ENTRY_PROTECT &&
+		   !(pmd_val(*pmd) & _SEGMENT_ENTRY_PROTECT)) {
+		unprot = __pmd((*table & (_SEGMENT_ENTRY_HARDWARE_BITS_LARGE
+					  & ~_SEGMENT_ENTRY_PROTECT))
+			       | _SEGMENT_ENTRY_GMAP_UC);
+		gmap_pmdp_xchg(gmap, (pmd_t *)table, unprot, gaddr);
+	} else if (gmap_pmd_is_split((pmd_t *)table)) {
+		/*
+		 * Split pmds are somewhere in-between a normal and a
+		 * large pmd. As we don't share the page table, the
+		 * host does not remove protection on a fault and we
+		 * have to do it ourselves for the guest mapping.
+		 */
+		ptep = pte_offset_map((pmd_t *)table, gaddr);
+		if (pte_val(*ptep) & _PAGE_PROTECT)
+			ptep_remove_dirty_protection_split(mm, ptep, vmaddr);
 	}
 	spin_unlock(&gmap->guest_table_lock);
 	spin_unlock(ptl);
@@ -999,6 +1019,113 @@ static int gmap_pmd_split(struct gmap *gmap, unsigned long gaddr, pmd_t *pmdp)
 	return 0;
 }
 
+/**
+ * gmap_pmdp_force_prot - change access rights of a locked pmd
+ * @mm: pointer to the process mm_struct
+ * @addr: virtual address in the guest address space
+ * @pmdp: pointer to the page table entry
+ * @prot: indicates guest access rights: PROT_NONE, PROT_READ or PROT_WRITE
+ * @bits: software bit to set (e.g. for notification)
+ *
+ * Returns 0 if the access rights were changed and -EAGAIN if the current
+ * and requested access rights are incompatible.
+ */
+static int gmap_pmdp_force_prot(struct gmap *gmap, unsigned long addr,
+				pmd_t *pmdp, int prot)
+{
+	int pmd_i = pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID;
+	int pmd_p = pmd_val(*pmdp) & _SEGMENT_ENTRY_PROTECT;
+	pmd_t new = *pmdp;
+
+	/* Fixup needed */
+	if ((pmd_i && (prot != PROT_NONE)) || (pmd_p && (prot == PROT_WRITE)))
+		return -EAGAIN;
+
+	if (prot == PROT_NONE && !pmd_i) {
+		pmd_val(new) |= _SEGMENT_ENTRY_INVALID;
+		gmap_pmdp_xchg(gmap, pmdp, new, addr);
+	}
+
+	if (prot == PROT_READ && !pmd_p) {
+		pmd_val(new) &= ~_SEGMENT_ENTRY_INVALID;
+		pmd_val(new) |= _SEGMENT_ENTRY_PROTECT;
+		gmap_pmdp_xchg(gmap, pmdp, new, addr);
+	}
+	return 0;
+}
+
+/**
+ * gmap_pmdp_transfer_prot - transfer protection of guest pmd to host pmd
+ * @mm: the memory context
+ * @address: the affected host virtual address
+ * @gpmdp: guest pmd ptr
+ * @hpmdp: host pmd ptr
+ *
+ * Transfers the protection from a guest pmd to the associated guest
+ * pmd. This has to be done with a plain idte to circumvent the gmap
+ * invalidation hooks in the standard invalidation functions provided
+ * by pgtable.c.
+ */
+static void gmap_pmdp_transfer_prot(struct mm_struct *mm, unsigned long addr,
+				    pmd_t *gpmdp, pmd_t *hpmdp)
+{
+	const int gpmd_i = pmd_val(*gpmdp) & _SEGMENT_ENTRY_INVALID;
+	const int gpmd_p = pmd_val(*gpmdp) & _SEGMENT_ENTRY_PROTECT;
+	const int hpmd_i = pmd_val(*hpmdp) & _SEGMENT_ENTRY_INVALID;
+	const int hpmd_p = pmd_val(*hpmdp) & _SEGMENT_ENTRY_PROTECT;
+	pmd_t new = *hpmdp;
+
+	/* Fastpath, change not needed. */
+	if (hpmd_i || (hpmd_p && gpmd_p) || (!gpmd_i && !gpmd_p))
+		return;
+
+	if (gpmd_p && !hpmd_p)
+		pmd_val(new) |= _SEGMENT_ENTRY_PROTECT;
+	if (!gpmd_i && !hpmd_i)
+		pmd_val(new) &= ~_SEGMENT_ENTRY_INVALID;
+
+	if (MACHINE_HAS_TLB_GUEST)
+		__pmdp_idte(addr, hpmdp,
+			    IDTE_NODAT | IDTE_GUEST_ASCE,
+			    mm->context.asce, IDTE_GLOBAL);
+	else if (MACHINE_HAS_IDTE)
+		__pmdp_idte(addr, hpmdp, 0, 0,
+			    IDTE_GLOBAL);
+	else
+		__pmdp_csp(hpmdp);
+	*hpmdp = new;
+}
+
+/*
+ * gmap_protect_pmd - set pmd notification bits
+ * @pmdp: pointer to the pmd to be protected
+ * @prot: indicates access rights: PROT_NONE, PROT_READ or PROT_WRITE
+ * @bits: notification bits to set
+ *
+ * Returns 0 if successfully protected, -ENOMEM if out of memory and
+ * -EAGAIN if a fixup is needed.
+ *
+ * Expected to be called with sg->mm->mmap_sem in read and
+ * guest_table_lock held.
+ */
+static int gmap_protect_pmd(struct gmap *gmap, unsigned long gaddr,
+			    unsigned long vmaddr, pmd_t *pmdp, pmd_t *hpmdp,
+			    int prot)
+{
+	int ret = 0;
+
+	/* Protect gmap pmd for dirty tracking. */
+	ret = gmap_pmdp_force_prot(gmap, gaddr, pmdp, prot);
+	/*
+	 * Transfer protection back to the host pmd, so userspace has
+	 * never more access rights than the VM.
+	 */
+	if (!ret)
+		gmap_pmdp_transfer_prot(gmap->mm, vmaddr, pmdp, hpmdp);
+	return ret;
+}
+
+
 /*
  * gmap_protect_pte - remove access rights to memory and set pgste bits
  * @gmap: pointer to guest mapping meta data structure
@@ -2488,6 +2615,84 @@ void gmap_pmdp_idte_global(struct mm_struct *mm, unsigned long vmaddr)
 }
 EXPORT_SYMBOL_GPL(gmap_pmdp_idte_global);
 
+/**
+ * gmap_test_and_clear_dirty_segment - test and reset segment dirty status
+ * @gmap: pointer to guest address space
+ * @pmdp: pointer to the pmd to be tested
+ * @gaddr: virtual address in the guest address space
+ *
+ * This function is assumed to be called with the guest_table_lock
+ * held.
+ */
+bool gmap_test_and_clear_dirty_segment(struct gmap *gmap, pmd_t *pmdp,
+				       pmd_t *hpmdp, unsigned long gaddr,
+				       unsigned long vmaddr)
+{
+	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
+		return false;
+
+	/* Already protected memory, which did not change is clean */
+	if (pmd_val(*pmdp) & _SEGMENT_ENTRY_PROTECT &&
+	    !(pmd_val(*pmdp) & _SEGMENT_ENTRY_GMAP_UC))
+		return false;
+
+	/* Clear UC indication and reset protection */
+	pmd_val(*pmdp) &= ~_SEGMENT_ENTRY_GMAP_UC;
+	gmap_protect_pmd(gmap, gaddr, vmaddr, pmdp, hpmdp, PROT_READ);
+	return true;
+}
+
+/**
+ * gmap_sync_dirty_log_pmd - set bitmap based on dirty status of segment
+ * @gmap: pointer to guest address space
+ * @bitmap: dirty bitmap for this pmd
+ * @gaddr: virtual address in the guest address space
+ * @vmaddr: virtual address in the host address space
+ *
+ * This function is assumed to be called with the guest_table_lock
+ * held.
+ */
+void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long bitmap[4],
+			     unsigned long gaddr, unsigned long vmaddr)
+{
+	int i = 0;
+	pmd_t *pmdp, *hpmdp;
+	spinlock_t *ptl;
+
+	hpmdp = (pmd_t *)huge_pte_offset(gmap->mm, vmaddr, HPAGE_SIZE);
+	if (!hpmdp)
+		return;
+	ptl = pmd_lock(gmap->mm, hpmdp);
+	pmdp = gmap_pmd_op_walk(gmap, gaddr);
+	if (!pmdp) {
+		spin_unlock(ptl);
+		return;
+	}
+
+	if (pmd_large(*pmdp)) {
+		if (gmap_test_and_clear_dirty_segment(gmap, pmdp, hpmdp,
+						      gaddr, vmaddr))
+			memset(bitmap, 0xff, 32);
+	} else {
+		/* We handle this here, as it's of the records from mm. */
+		if (unlikely(gmap_pmd_is_split(pmdp))) {
+			for (; i < _PAGE_ENTRIES; i++, vmaddr += PAGE_SIZE) {
+				if (test_and_clear_guest_dirty_split(gmap->mm, pmdp, vmaddr))
+					set_bit_le(i, bitmap);
+				//TODO: protection transfer
+			}
+		} else {
+			for (; i < _PAGE_ENTRIES; i++, vmaddr += PAGE_SIZE) {
+				if (test_and_clear_guest_dirty(gmap->mm, vmaddr))
+					set_bit_le(i, bitmap);
+			}
+		}
+	}
+	gmap_pmd_op_end(gmap, pmdp);
+	spin_unlock(ptl);
+}
+EXPORT_SYMBOL_GPL(gmap_sync_dirty_log_pmd);
+
 static inline void thp_split_mm(struct mm_struct *mm)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index e690879..497fefe 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -705,6 +705,57 @@ void ptep_zap_key(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
 	preempt_enable();
 }
 
+void ptep_remove_dirty_protection_split(struct mm_struct *mm,
+					pte_t *ptep, unsigned long vmaddr)
+{
+	pte_t unprot = __pte(pte_val(*ptep) & ~_PAGE_PROTECT);
+	pgste_t pgste;
+	unsigned long bits;
+
+	pgste = pgste_get_lock(ptep);
+	pgste_val(pgste) |= PGSTE_UC_BIT;
+
+	bits = pgste_val(pgste) & (PGSTE_IN_BIT | PGSTE_VSIE_BIT);
+	pgste_val(pgste) ^= bits;
+	ptep_notify_gmap(mm, vmaddr, ptep, bits);
+	ptep_ipte_global(mm, vmaddr, ptep, 0);
+
+	*ptep = unprot;
+	pgste_set_unlock(ptep, pgste);
+}
+EXPORT_SYMBOL_GPL(ptep_remove_dirty_protection_split);
+
+bool test_and_clear_guest_dirty_split(struct mm_struct *mm, pmd_t *pmdp,
+				      unsigned long vmaddr)
+{
+	bool dirty;
+	pte_t *ptep, pte;
+	pgste_t pgste;
+	unsigned long bits;
+
+	ptep = pte_offset_map(pmdp, vmaddr);
+	pgste = pgste_get_lock(ptep);
+	dirty = !!(pgste_val(pgste) & PGSTE_UC_BIT);
+	pgste_val(pgste) &= ~PGSTE_UC_BIT;
+	pte = *ptep;
+	if (dirty) {
+		bits = pgste_val(pgste) & (PGSTE_IN_BIT | PGSTE_VSIE_BIT);
+		if (bits) {
+			pgste_val(pgste) ^= bits;
+			ptep_notify_gmap(mm, vmaddr, ptep, bits);
+		}
+		ptep_ipte_global(mm, vmaddr, ptep, 0);
+		if (MACHINE_HAS_ESOP || !(pte_val(pte) & _PAGE_WRITE))
+			pte_val(pte) |= _PAGE_PROTECT;
+		else
+			pte_val(pte) |= _PAGE_INVALID;
+		*ptep = pte;
+	}
+	pgste_set_unlock(ptep, pgste);
+	return dirty;
+}
+EXPORT_SYMBOL_GPL(test_and_clear_guest_dirty_split);
+
 /*
  * Test and reset if a guest page is dirty
  */
@@ -731,12 +782,6 @@ bool test_and_clear_guest_dirty(struct mm_struct *mm, unsigned long addr)
 	pmd = pmd_alloc(mm, pud, addr);
 	if (!pmd)
 		return false;
-	/* We can't run guests backed by huge pages, but userspace can
-	 * still set them up and then try to migrate them without any
-	 * migration support.
-	 */
-	if (pmd_large(*pmd))
-		return true;
 
 	ptep = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (unlikely(!ptep))
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 07/16] s390/mm: Make gmap_read_table EDAT1 compatible
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (5 preceding siblings ...)
  2018-02-09  9:34 ` [RFC/PATCH v3 06/16] s390/mm: Add huge page dirty sync support Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 08/16] s390/mm: Make protect_rmap " Janosch Frank
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

For the VSIE we shadow the GMAP/DAT tables that a guest N builds for
its N + 1 guest. This means we read the GMAP from the memory of guest
N and use the retrieved information to build a shadow version from it
which is actually used to run the guest.

gmap_read_table is used to retrieve the data from the guest's address
space. Unfortunately it currently has no support for reading from huge
guests, so let's add that.

gmap_read_table has to tell us to which guest DAT level it traveled to
get the information we requested. With this information we can start
shadowing without searching for the pmd or pte a second time.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/gmap.h |  2 +-
 arch/s390/kvm/gaccess.c      | 17 ++++++++++-------
 arch/s390/mm/gmap.c          | 34 ++++++++++++++++++++++++----------
 3 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 26a31c3..e7592e1 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -128,7 +128,7 @@ void gmap_discard(struct gmap *, unsigned long from, unsigned long to);
 void __gmap_zap(struct gmap *, unsigned long gaddr);
 void gmap_unlink(struct mm_struct *, unsigned long *table, unsigned long vmaddr);
 
-int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val);
+int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val, int *fc);
 
 struct gmap *gmap_shadow(struct gmap *parent, unsigned long asce,
 			 int edat_level);
diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c
index c24bfa7..f9c5cc0 100644
--- a/arch/s390/kvm/gaccess.c
+++ b/arch/s390/kvm/gaccess.c
@@ -987,7 +987,7 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 	union asce asce;
 	union vaddress vaddr;
 	unsigned long ptr;
-	int rc;
+	int rc, fc = 0;
 
 	*fake = 0;
 	*dat_protection = 0;
@@ -1034,7 +1034,8 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 			rfte.val = ptr;
 			goto shadow_r2t;
 		}
-		rc = gmap_read_table(parent, ptr + vaddr.rfx * 8, &rfte.val);
+		rc = gmap_read_table(parent, ptr + vaddr.rfx * 8, &rfte.val,
+				     &fc);
 		if (rc)
 			return rc;
 		if (rfte.i)
@@ -1060,7 +1061,8 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 			rste.val = ptr;
 			goto shadow_r3t;
 		}
-		rc = gmap_read_table(parent, ptr + vaddr.rsx * 8, &rste.val);
+		rc = gmap_read_table(parent, ptr + vaddr.rsx * 8, &rste.val,
+				     &fc);
 		if (rc)
 			return rc;
 		if (rste.i)
@@ -1087,7 +1089,8 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 			rtte.val = ptr;
 			goto shadow_sgt;
 		}
-		rc = gmap_read_table(parent, ptr + vaddr.rtx * 8, &rtte.val);
+		rc = gmap_read_table(parent, ptr + vaddr.rtx * 8, &rtte.val,
+				     &fc);
 		if (rc)
 			return rc;
 		if (rtte.i)
@@ -1123,7 +1126,7 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 			ste.val = ptr;
 			goto shadow_pgt;
 		}
-		rc = gmap_read_table(parent, ptr + vaddr.sx * 8, &ste.val);
+		rc = gmap_read_table(parent, ptr + vaddr.sx * 8, &ste.val, &fc);
 		if (rc)
 			return rc;
 		if (ste.i)
@@ -1170,7 +1173,7 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 	union vaddress vaddr;
 	union page_table_entry pte;
 	unsigned long pgt;
-	int dat_protection, fake;
+	int dat_protection, fake, fc;
 	int rc;
 
 	down_read(&sg->mm->mmap_sem);
@@ -1192,7 +1195,7 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 		goto shadow_page;
 	}
 	if (!rc)
-		rc = gmap_read_table(sg->parent, pgt + vaddr.px * 8, &pte.val);
+		rc = gmap_read_table(sg->parent, pgt + vaddr.px * 8, &pte.val, &fc);
 	if (!rc && pte.i)
 		rc = PGM_PAGE_TRANSLATION;
 	if (!rc && pte.z)
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 4dafa1e..5699770 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1254,27 +1254,41 @@ EXPORT_SYMBOL_GPL(gmap_mprotect_notify);
  *
  * Called with gmap->mm->mmap_sem in read.
  */
-int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val)
+int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val,
+		    int *fc)
 {
 	unsigned long address, vmaddr;
 	spinlock_t *ptl;
+	pmd_t *pmdp;
 	pte_t *ptep, pte;
 	int rc;
 
 	while (1) {
 		rc = -EAGAIN;
-		ptep = gmap_pte_op_walk(gmap, gaddr, &ptl);
-		if (ptep) {
-			pte = *ptep;
-			if (pte_present(pte) && (pte_val(pte) & _PAGE_READ)) {
-				address = pte_val(pte) & PAGE_MASK;
-				address += gaddr & ~PAGE_MASK;
+		pmdp = gmap_pmd_op_walk(gmap, gaddr);
+		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
+			if (!pmd_large(*pmdp)) {
+				ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
+				if (ptep) {
+					pte = *ptep;
+					if (pte_present(pte) && (pte_val(pte) & _PAGE_READ)) {
+						address = pte_val(pte) & PAGE_MASK;
+						address += gaddr & ~PAGE_MASK;
+						*val = *(unsigned long *) address;
+						pte_val(*ptep) |= _PAGE_YOUNG;
+						/* Do *NOT* clear the _PAGE_INVALID bit! */
+						rc = 0;
+					}
+					gmap_pte_op_end(ptl);
+				}
+			} else {
+				address = pmd_val(*pmdp) & HPAGE_MASK;
+				address += gaddr & ~HPAGE_MASK;
 				*val = *(unsigned long *) address;
-				pte_val(*ptep) |= _PAGE_YOUNG;
-				/* Do *NOT* clear the _PAGE_INVALID bit! */
+				*fc = 1;
 				rc = 0;
 			}
-			gmap_pte_op_end(ptl);
+			gmap_pmd_op_end(gmap, pmdp);
 		}
 		if (!rc)
 			break;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 08/16] s390/mm: Make protect_rmap EDAT1 compatible
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (6 preceding siblings ...)
  2018-02-09  9:34 ` [RFC/PATCH v3 07/16] s390/mm: Make gmap_read_table EDAT1 compatible Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 09/16] s390/mm: Add shadow segment code Janosch Frank
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

When shadowing, we must make sure, that any changes to the GMAP inside
guest N will also be directly reflected in our shadow GMAP. This is
done by write-protecting guest N memory at the places where it stores
DAT tables for guest N + 1.

This still lacks EDAT1 support, so let's add it.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/mm/gmap.c | 97 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 77 insertions(+), 20 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 5699770..66789e2 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1096,6 +1096,18 @@ static void gmap_pmdp_transfer_prot(struct mm_struct *mm, unsigned long addr,
 	*hpmdp = new;
 }
 
+static void gmap_pte_transfer_prot(struct mm_struct *mm, unsigned long addr,
+				   pte_t *gptep, pmd_t *hpmdp)
+{
+	pmd_t mpmd = __pmd(0);
+
+	if (pte_val(*gptep) & _PAGE_PROTECT)
+		pmd_val(mpmd) |= _SEGMENT_ENTRY_PROTECT;
+	if (pte_val(*gptep) & _PAGE_INVALID)
+		pmd_val(mpmd) |= _SEGMENT_ENTRY_INVALID;
+	gmap_pmdp_transfer_prot(mm, addr, &mpmd, hpmdp);
+}
+
 /*
  * gmap_protect_pmd - set pmd notification bits
  * @pmdp: pointer to the pmd to be protected
@@ -1141,7 +1153,8 @@ static int gmap_protect_pmd(struct gmap *gmap, unsigned long gaddr,
  * guest_table_lock held for shadow gmaps.
  */
 static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
-			    pmd_t *pmdp, int prot, unsigned long bits)
+			    unsigned long vmaddr, pmd_t *pmdp, pmd_t *hpmdp,
+			    int prot, unsigned long bits)
 {
 	int rc;
 	pte_t *ptep;
@@ -1157,6 +1170,8 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
 	/* Protect and unlock. */
 	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, pbits);
 	gmap_pte_op_end(ptl);
+	if (!rc && gmap_pmd_is_split(pmdp))
+		gmap_pte_transfer_prot(gmap->mm, vmaddr, ptep, hpmdp);
 	return rc;
 }
 
@@ -1178,17 +1193,25 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
 static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 			      unsigned long len, int prot, unsigned long bits)
 {
+	spinlock_t *ptl;
 	unsigned long vmaddr;
-	pmd_t *pmdp;
+	pmd_t *pmdp, *hpmdp;
 	int rc;
 
 	while (len) {
 		rc = -EAGAIN;
+		vmaddr = __gmap_translate(gmap, gaddr);
+		hpmdp = (pmd_t *)huge_pte_offset(gmap->mm, vmaddr, HPAGE_SIZE);
+		if (!hpmdp)
+			BUG();
+		/* Do we need tests here? */
+		ptl = pmd_lock(gmap->mm, hpmdp);
+
 		pmdp = gmap_pmd_op_walk(gmap, gaddr);
 		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
 			if (!pmd_large(*pmdp)) {
-				rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
-						      bits);
+				rc = gmap_protect_pte(gmap, gaddr, vmaddr,
+						      pmdp, hpmdp, prot, bits);
 				if (!rc) {
 					len -= PAGE_SIZE;
 					gaddr += PAGE_SIZE;
@@ -1200,6 +1223,7 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
 			}
 			gmap_pmd_op_end(gmap, pmdp);
 		}
+		spin_unlock(ptl);
 		if (rc && rc != -EFAULT) {
 			vmaddr = __gmap_translate(gmap, gaddr);
 			if (IS_ERR_VALUE(vmaddr))
@@ -1268,7 +1292,7 @@ int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val,
 		pmdp = gmap_pmd_op_walk(gmap, gaddr);
 		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
 			if (!pmd_large(*pmdp)) {
-				ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
+				ptep = gmap_pte_from_pmd(gmap, pmdp, gaddr, &ptl);
 				if (ptep) {
 					pte = *ptep;
 					if (pte_present(pte) && (pte_val(pte) & _PAGE_READ)) {
@@ -1331,6 +1355,28 @@ static inline void gmap_insert_rmap(struct gmap *sg, unsigned long vmaddr,
 	}
 }
 
+static int gmap_protect_rmap_pte(struct gmap *sg, struct gmap_rmap *rmap,
+				 unsigned long paddr, unsigned long vmaddr,
+				 pmd_t *pmdp, pmd_t *hpmdp, int prot)
+{
+	int rc = 0;
+	pte_t *ptep = NULL;
+	spinlock_t *ptl = NULL;
+
+	ptep = gmap_pte_from_pmd(sg->parent, pmdp, paddr, &ptl);
+	if (unlikely(!ptep))
+		return -ENOMEM;
+
+	spin_lock(&sg->guest_table_lock);
+	rc = gmap_protect_pte(sg->parent, paddr, vmaddr, pmdp, hpmdp,
+			      prot, GMAP_NOTIFY_SHADOW);
+	if (!rc)
+		gmap_insert_rmap(sg, vmaddr, rmap);
+	spin_unlock(&sg->guest_table_lock);
+	gmap_pte_op_end(ptl);
+	return rc;
+}
+
 /**
  * gmap_protect_rmap - modify access rights to memory and create an rmap
  * @sg: pointer to the shadow guest address space structure
@@ -1348,8 +1394,8 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 	struct gmap *parent;
 	struct gmap_rmap *rmap;
 	unsigned long vmaddr;
+	pmd_t *pmdp, *hpmdp;
 	spinlock_t *ptl;
-	pte_t *ptep;
 	int rc;
 
 	BUG_ON(!gmap_is_shadow(sg));
@@ -1358,36 +1404,47 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 		vmaddr = __gmap_translate(parent, paddr);
 		if (IS_ERR_VALUE(vmaddr))
 			return vmaddr;
+		hpmdp = (pmd_t *)huge_pte_offset(parent->mm, vmaddr, HPAGE_SIZE);
+		ptl = pmd_lock(parent->mm, hpmdp);
 		rmap = kzalloc(sizeof(*rmap), GFP_KERNEL);
-		if (!rmap)
+		if (!rmap) {
+			spin_unlock(ptl);
 			return -ENOMEM;
+		}
 		rmap->raddr = raddr;
 		rc = radix_tree_preload(GFP_KERNEL);
 		if (rc) {
+			spin_unlock(ptl);
 			kfree(rmap);
 			return rc;
 		}
 		rc = -EAGAIN;
-		ptep = gmap_pte_op_walk(parent, paddr, &ptl);
-		if (ptep) {
-			spin_lock(&sg->guest_table_lock);
-			rc = ptep_force_prot(parent->mm, paddr, ptep, prot,
-					     GMAP_NOTIFY_SHADOW);
-			if (!rc)
-				gmap_insert_rmap(sg, vmaddr, rmap);
-			spin_unlock(&sg->guest_table_lock);
-			gmap_pte_op_end(ptl);
+		pmdp = gmap_pmd_op_walk(parent, paddr);
+		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
+			if (!pmd_large(*pmdp)) {
+				rc = gmap_protect_rmap_pte(sg, rmap, paddr,
+							   vmaddr, pmdp, hpmdp,
+							   prot);
+				if (!rc) {
+					paddr += PAGE_SIZE;
+					len -= PAGE_SIZE;
+				}
+			} else {
+				rc = gmap_pmd_split(parent, paddr, pmdp);
+				if (!rc)
+					rc = -EFAULT;
+			}
+			gmap_pmd_op_end(parent, pmdp);
 		}
+		spin_unlock(ptl);
 		radix_tree_preload_end();
-		if (rc) {
+		if (rc)
 			kfree(rmap);
+		if (rc == -EAGAIN) {
 			rc = gmap_pte_op_fixup(parent, paddr, vmaddr, prot);
 			if (rc)
 				return rc;
-			continue;
 		}
-		paddr += PAGE_SIZE;
-		len -= PAGE_SIZE;
 	}
 	return 0;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 09/16] s390/mm: Add shadow segment code
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (7 preceding siblings ...)
  2018-02-09  9:34 ` [RFC/PATCH v3 08/16] s390/mm: Make protect_rmap " Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 10/16] s390/mm: Add VSIE reverse fake case Janosch Frank
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

The VSIE code does not yet support shadowing large hosts. Let's add
large to large shadowing.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/gmap.h |   7 +-
 arch/s390/kvm/gaccess.c      |  35 ++++-
 arch/s390/mm/gmap.c          | 332 ++++++++++++++++++++++++++++++++++++-------
 3 files changed, 314 insertions(+), 60 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index e7592e1..8387fdc 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -17,6 +17,7 @@
 #define _SEGMENT_ENTRY_GMAP_SPLIT	0x0001  /* split huge pmd */
 /* Status bits only for huge segment entries */
 #define _SEGMENT_ENTRY_GMAP_UC		0x4000	/* user dirty (migration) */
+#define _SEGMENT_ENTRY_GMAP_VSIE	0x8000	/* vsie bit */
 
 /**
  * struct gmap_struct - guest address space
@@ -141,9 +142,11 @@ int gmap_shadow_sgt(struct gmap *sg, unsigned long saddr, unsigned long sgt,
 		    int fake);
 int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
 		    int fake);
-int gmap_shadow_pgt_lookup(struct gmap *sg, unsigned long saddr,
-			   unsigned long *pgt, int *dat_protection, int *fake);
+int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
+			   unsigned long *pgt, int *dat_protection,
+			   int *fake);
 int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte);
+int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd);
 
 void gmap_register_pte_notifier(struct gmap_notifier *);
 void gmap_unregister_pte_notifier(struct gmap_notifier *);
diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c
index f9c5cc0..045d12e 100644
--- a/arch/s390/kvm/gaccess.c
+++ b/arch/s390/kvm/gaccess.c
@@ -981,7 +981,7 @@ int kvm_s390_check_low_addr_prot_real(struct kvm_vcpu *vcpu, unsigned long gra)
  */
 static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 				  unsigned long *pgt, int *dat_protection,
-				  int *fake)
+				  int *fake, int *lvl)
 {
 	struct gmap *parent;
 	union asce asce;
@@ -1136,6 +1136,17 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 		if (ste.cs && asce.p)
 			return PGM_TRANSLATION_SPEC;
 		*dat_protection |= ste.fc0.p;
+
+		/* Parent is mapped by huge pages. */
+		if (fc) {
+			/* Guest is also huge, easy case. */
+			if (ste.fc && sg->edat_level >= 1) {
+				*lvl = 1;
+				*pgt = ptr;
+				return 0;
+			}
+		}
+		/* Small to small and small to huge case */
 		if (ste.fc && sg->edat_level >= 1) {
 			*fake = 1;
 			ptr = ste.fc1.sfaa * _SEGMENT_SIZE;
@@ -1172,8 +1183,9 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 {
 	union vaddress vaddr;
 	union page_table_entry pte;
+	union segment_table_entry ste;
 	unsigned long pgt;
-	int dat_protection, fake, fc;
+	int dat_protection, fake, lvl, fc;
 	int rc;
 
 	down_read(&sg->mm->mmap_sem);
@@ -1184,12 +1196,26 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 	 */
 	ipte_lock(vcpu);
 
-	rc = gmap_shadow_pgt_lookup(sg, saddr, &pgt, &dat_protection, &fake);
+	rc = gmap_shadow_sgt_lookup(sg, saddr, &pgt, &dat_protection, &fake);
 	if (rc)
 		rc = kvm_s390_shadow_tables(sg, saddr, &pgt, &dat_protection,
-					    &fake);
+					    &fake, &lvl);
 
 	vaddr.addr = saddr;
+
+	/* Shadow stopped at segment level, we map pmd to pmd */
+	if (lvl) {
+		if (!rc)
+			rc = gmap_read_table(sg->parent, pgt + vaddr.sx * 8,
+					     &ste.val, &fc);
+		if (!rc && ste.i)
+			rc = PGM_PAGE_TRANSLATION;
+		ste.fc1.p |= dat_protection;
+		if (!rc)
+			rc = gmap_shadow_segment(sg, saddr, __pmd(ste.val));
+		goto out;
+	}
+
 	if (fake) {
 		pte.val = pgt + vaddr.px * PAGE_SIZE;
 		goto shadow_page;
@@ -1204,6 +1230,7 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 	pte.p |= dat_protection;
 	if (!rc)
 		rc = gmap_shadow_page(sg, saddr, __pte(pte.val));
+out:
 	ipte_unlock(vcpu);
 	up_read(&sg->mm->mmap_sem);
 	return rc;
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 66789e2..f805ec9 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1405,6 +1405,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 		if (IS_ERR_VALUE(vmaddr))
 			return vmaddr;
 		hpmdp = (pmd_t *)huge_pte_offset(parent->mm, vmaddr, HPAGE_SIZE);
+		/* Do we need tests here? */
 		ptl = pmd_lock(parent->mm, hpmdp);
 		rmap = kzalloc(sizeof(*rmap), GFP_KERNEL);
 		if (!rmap) {
@@ -1450,6 +1451,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
 }
 
 #define _SHADOW_RMAP_MASK	0x7
+#define _SHADOW_RMAP_SEGMENT_LP	0x6
 #define _SHADOW_RMAP_REGION1	0x5
 #define _SHADOW_RMAP_REGION2	0x4
 #define _SHADOW_RMAP_REGION3	0x3
@@ -1557,13 +1559,16 @@ static void __gmap_unshadow_sgt(struct gmap *sg, unsigned long raddr,
 	for (i = 0; i < _CRST_ENTRIES; i++, raddr += _SEGMENT_SIZE) {
 		if (!(sgt[i] & _SEGMENT_ENTRY_ORIGIN))
 			continue;
-		pgt = (unsigned long *)(sgt[i] & _REGION_ENTRY_ORIGIN);
+
+		if (!(sgt[i] & _SEGMENT_ENTRY_LARGE)) {
+			pgt = (unsigned long *)(sgt[i] & _REGION_ENTRY_ORIGIN);
+			__gmap_unshadow_pgt(sg, raddr, pgt);
+			/* Free page table */
+			page = pfn_to_page(__pa(pgt) >> PAGE_SHIFT);
+			list_del(&page->lru);
+			page_table_free_pgste(page);
+		}
 		sgt[i] = _SEGMENT_ENTRY_EMPTY;
-		__gmap_unshadow_pgt(sg, raddr, pgt);
-		/* Free page table */
-		page = pfn_to_page(__pa(pgt) >> PAGE_SHIFT);
-		list_del(&page->lru);
-		page_table_free_pgste(page);
 	}
 }
 
@@ -2178,32 +2183,62 @@ EXPORT_SYMBOL_GPL(gmap_shadow_sgt);
  *
  * Called with sg->mm->mmap_sem in read.
  */
-int gmap_shadow_pgt_lookup(struct gmap *sg, unsigned long saddr,
+void gmap_shadow_pgt_lookup(struct gmap *sg, unsigned long *sge,
+			    unsigned long saddr, unsigned long *pgt,
+			    int *dat_protection, int *fake)
+{
+	struct page *page;
+
+	/* Shadow page tables are full pages (pte+pgste) */
+	page = pfn_to_page(*sge >> PAGE_SHIFT);
+	*pgt = page->index & ~GMAP_SHADOW_FAKE_TABLE;
+	*dat_protection = !!(*sge & _SEGMENT_ENTRY_PROTECT);
+	*fake = !!(page->index & GMAP_SHADOW_FAKE_TABLE);
+}
+EXPORT_SYMBOL_GPL(gmap_shadow_pgt_lookup);
+
+int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
 			   unsigned long *pgt, int *dat_protection,
 			   int *fake)
 {
-	unsigned long *table;
+	unsigned long *sge, *r3e = NULL;
 	struct page *page;
-	int rc;
+	int rc = -EAGAIN;
 
 	BUG_ON(!gmap_is_shadow(sg));
 	spin_lock(&sg->guest_table_lock);
-	table = gmap_table_walk(sg, saddr, 1); /* get segment pointer */
-	if (table && !(*table & _SEGMENT_ENTRY_INVALID)) {
-		/* Shadow page tables are full pages (pte+pgste) */
-		page = pfn_to_page(*table >> PAGE_SHIFT);
-		*pgt = page->index & ~GMAP_SHADOW_FAKE_TABLE;
-		*dat_protection = !!(*table & _SEGMENT_ENTRY_PROTECT);
-		*fake = !!(page->index & GMAP_SHADOW_FAKE_TABLE);
-		rc = 0;
-	} else  {
-		rc = -EAGAIN;
+	if (sg->asce & _ASCE_TYPE_MASK) {
+		/* >2 GB guest */
+		r3e = (unsigned long *) gmap_table_walk(sg, saddr, 2);
+		if (!r3e || (*r3e & _REGION_ENTRY_INVALID))
+			goto out;
+		sge = (unsigned long *)(*r3e & _REGION_ENTRY_ORIGIN) + ((saddr & _SEGMENT_INDEX) >> _SEGMENT_SHIFT);
+	} else {
+		sge = (unsigned long *)(sg->asce & PAGE_MASK) + ((saddr & _SEGMENT_INDEX) >> _SEGMENT_SHIFT);
 	}
+	if (*sge & _SEGMENT_ENTRY_INVALID)
+		goto out;
+	rc = 0;
+	if (*sge & _SEGMENT_ENTRY_LARGE) {
+		if (r3e) {
+			page = pfn_to_page(*r3e >> PAGE_SHIFT);
+			*pgt = page->index & ~GMAP_SHADOW_FAKE_TABLE;
+			*dat_protection = !!(*r3e & _SEGMENT_ENTRY_PROTECT);
+			*fake = !!(page->index & GMAP_SHADOW_FAKE_TABLE);
+		} else {
+			*pgt = sg->orig_asce & PAGE_MASK;
+			*dat_protection = 0;
+			*fake = 0;
+		}
+	} else {
+		gmap_shadow_pgt_lookup(sg, sge, saddr, pgt,
+				       dat_protection, fake);
+	}
+out:
 	spin_unlock(&sg->guest_table_lock);
 	return rc;
-
 }
-EXPORT_SYMBOL_GPL(gmap_shadow_pgt_lookup);
+EXPORT_SYMBOL_GPL(gmap_shadow_sgt_lookup);
 
 /**
  * gmap_shadow_pgt - instantiate a shadow page table
@@ -2285,6 +2320,89 @@ int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
 }
 EXPORT_SYMBOL_GPL(gmap_shadow_pgt);
 
+int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd)
+{
+	struct gmap *parent;
+	struct gmap_rmap *rmap;
+	unsigned long vmaddr, paddr;
+	pmd_t spmd, tpmd, *spmdp = NULL, *tpmdp;
+	int prot;
+	int rc;
+
+	BUG_ON(!gmap_is_shadow(sg));
+	parent = sg->parent;
+
+	prot = (pmd_val(pmd) & _SEGMENT_ENTRY_PROTECT) ? PROT_READ : PROT_WRITE;
+	rmap = kzalloc(sizeof(*rmap), GFP_KERNEL);
+	if (!rmap)
+		return -ENOMEM;
+	rmap->raddr = (saddr & HPAGE_MASK) | _SHADOW_RMAP_SEGMENT_LP;
+
+	while (1) {
+		paddr = pmd_val(pmd) & HPAGE_MASK;
+		vmaddr = __gmap_translate(parent, paddr);
+		if (IS_ERR_VALUE(vmaddr)) {
+			rc = vmaddr;
+			break;
+		}
+		rc = radix_tree_preload(GFP_KERNEL);
+		if (rc)
+			break;
+		rc = -EAGAIN;
+
+		/* Let's look up the parent's mapping */
+		spmdp = gmap_pmd_op_walk(parent, paddr);
+		if (spmdp) {
+			spin_lock(&sg->guest_table_lock);
+			/* Get shadow segment table pointer */
+			tpmdp = (pmd_t *) gmap_table_walk(sg, saddr, 1);
+			if (!tpmdp) {
+				spin_unlock(&sg->guest_table_lock);
+				gmap_pmd_op_end(parent, spmdp);
+				radix_tree_preload_end();
+				break;
+			}
+			/* Shadowing magic happens here. */
+			if (!(pmd_val(*tpmdp) & _SEGMENT_ENTRY_INVALID)) {
+				rc = 0;	/* already shadowed */
+				spin_unlock(&sg->guest_table_lock);
+				gmap_pmd_op_end(parent, spmdp);
+				radix_tree_preload_end();
+				break;
+			}
+			spmd = *spmdp;
+			if (!(pmd_val(spmd) & _SEGMENT_ENTRY_INVALID) &&
+			    !((pmd_val(spmd) & _SEGMENT_ENTRY_PROTECT) &&
+			      !(pmd_val(pmd) & _SEGMENT_ENTRY_PROTECT))) {
+
+				*spmdp = __pmd(pmd_val(spmd)
+					       | _SEGMENT_ENTRY_GMAP_VSIE);
+
+				/* Insert shadow ste */
+				pmd_val(tpmd) = ((pmd_val(spmd) & HPAGE_MASK) |
+						 _SEGMENT_ENTRY_LARGE |
+						 (pmd_val(pmd) & _SEGMENT_ENTRY_PROTECT));
+				*tpmdp = tpmd;
+
+				gmap_insert_rmap(sg, vmaddr, rmap);
+				rc = 0;
+			}
+			spin_unlock(&sg->guest_table_lock);
+			gmap_pmd_op_end(parent, spmdp);
+		}
+		radix_tree_preload_end();
+		if (!rc)
+			break;
+		rc = gmap_pte_op_fixup(parent, paddr, vmaddr, prot);
+		if (rc)
+			break;
+	}
+	if (rc)
+		kfree(rmap);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(gmap_shadow_segment);
+
 /**
  * gmap_shadow_page - create a shadow page mapping
  * @sg: pointer to the shadow guest address space structure
@@ -2361,6 +2479,65 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 EXPORT_SYMBOL_GPL(gmap_shadow_page);
 
 /**
+ * gmap_unshadow_segment - remove a huge segment from a shadow segment table
+ * @sg: pointer to the shadow guest address space structure
+ * @raddr: rmap address in the shadow guest address space
+ *
+ * Called with the sg->guest_table_lock
+ */
+static void gmap_unshadow_segment(struct gmap *sg, unsigned long raddr)
+{
+	unsigned long *table;
+
+	BUG_ON(!gmap_is_shadow(sg));
+	/* We already have the lock */
+	table = gmap_table_walk(sg, raddr, 1); /* get segment table pointer */
+	if (!table || *table & _SEGMENT_ENTRY_INVALID ||
+	    !(*table & _SEGMENT_ENTRY_LARGE))
+		return;
+	gmap_call_notifier(sg, raddr, raddr + (1UL << 20) - 1);
+	gmap_pmdp_xchg(sg, (pmd_t *)table, __pmd(_SEGMENT_ENTRY_EMPTY), raddr);
+}
+
+static void gmap_shadow_notify_pmd(struct gmap *sg, unsigned long vmaddr,
+				   unsigned long gaddr)
+{
+	struct gmap_rmap *rmap, *rnext, *head;
+	unsigned long start, end, bits, raddr;
+
+
+	BUG_ON(!gmap_is_shadow(sg));
+
+	spin_lock(&sg->guest_table_lock);
+	if (sg->removed) {
+		spin_unlock(&sg->guest_table_lock);
+		return;
+	}
+	/* Check for top level table */
+	start = sg->orig_asce & _ASCE_ORIGIN;
+	end = start + ((sg->orig_asce & _ASCE_TABLE_LENGTH) + 1) * 4096;
+	if (!(sg->orig_asce & _ASCE_REAL_SPACE) && gaddr >= start &&
+	    gaddr < ((end & HPAGE_MASK) + HPAGE_SIZE - 1)) {
+		/* The complete shadow table has to go */
+		gmap_unshadow(sg);
+		spin_unlock(&sg->guest_table_lock);
+		list_del(&sg->list);
+		gmap_put(sg);
+		return;
+	}
+	/* Remove the page table tree from on specific entry */
+	head = radix_tree_delete(&sg->host_to_rmap, (vmaddr & HPAGE_MASK) >> PAGE_SHIFT);
+	gmap_for_each_rmap_safe(rmap, rnext, head) {
+		bits = rmap->raddr & _SHADOW_RMAP_MASK;
+		raddr = rmap->raddr ^ bits;
+		gmap_unshadow_segment(sg, raddr);
+		kfree(rmap);
+	}
+	spin_unlock(&sg->guest_table_lock);
+}
+
+
+/**
  * gmap_shadow_notify - handle notifications for shadow gmap
  *
  * Called with sg->parent->shadow_lock.
@@ -2512,6 +2689,85 @@ static void pmdp_notify_split(struct mm_struct *mm, unsigned long vmaddr,
 }
 
 /**
+ * pmdp_notify_gmap - call all invalidation callbacks for a specific pmd
+ * @gmap: pointer to the guest address space structure
+ * @gaddr: guest address which is affected
+ *
+ * This function is expected to be called with a locked
+ * guest_table_lock.
+ */
+static void pmdp_notify_gmap(struct gmap *gmap, unsigned long gaddr)
+{
+	unsigned long *table;
+	unsigned long vmaddr, bits;
+	struct gmap *sg, *next;
+
+	gaddr &= HPAGE_MASK;
+	table = gmap_table_walk(gmap, gaddr, 1);
+	if (!table)
+		return;
+	if (pmd_large(__pmd(*table)) && (*table & _SEGMENT_ENTRY_GMAP_VSIE))
+		bits = _SEGMENT_ENTRY_GMAP_VSIE;
+	if (!bits)
+		return;
+	*table &= ~bits;
+	vmaddr = __gmap_translate(gmap, gaddr);
+	if (!list_empty(&gmap->children) && (bits & _SEGMENT_ENTRY_GMAP_VSIE)
+	    && (*table & _SEGMENT_ENTRY_PROTECT)) {
+		spin_lock(&gmap->shadow_lock);
+		list_for_each_entry_safe(sg, next,
+					 &gmap->children, list)
+			gmap_shadow_notify_pmd(sg, vmaddr, gaddr);
+		spin_unlock(&gmap->shadow_lock);
+	}
+}
+
+/**
+ * pmdp_notify - call all invalidation callbacks for a specific pmd
+ * @mm: pointer to the process mm_struct
+ * @vmaddr: virtual address in the process address space
+ *
+ * This function is expected to be called with mmap_sem held in read.
+ */
+void pmdp_notify(struct mm_struct *mm, unsigned long vmaddr)
+{
+	unsigned long *table, gaddr, bits;
+	struct gmap *gmap, *sg, *next;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
+		spin_lock(&gmap->guest_table_lock);
+		table = radix_tree_lookup(&gmap->host_to_guest,
+					  vmaddr >> PMD_SHIFT);
+		if (!table) {
+			spin_unlock(&gmap->guest_table_lock);
+			continue;
+		}
+
+		if (gmap_pmd_is_split((pmd_t *)table)) {
+			pmdp_notify_split(mm, vmaddr, table);
+			spin_unlock(&gmap->guest_table_lock);
+			continue;
+		}
+
+		if (pmd_large(__pmd(*table)) && (*table & _SEGMENT_ENTRY_GMAP_VSIE))
+			bits = _SEGMENT_ENTRY_GMAP_VSIE;
+		*table &= ~bits;
+		gaddr = __gmap_segment_gaddr(table);
+		spin_unlock(&gmap->guest_table_lock);
+		if (!list_empty(&gmap->children) && (bits & _SEGMENT_ENTRY_GMAP_VSIE)) {
+			spin_lock(&gmap->shadow_lock);
+			list_for_each_entry_safe(sg, next,
+						 &gmap->children, list)
+				gmap_shadow_notify_pmd(sg, vmaddr, gaddr);
+			spin_unlock(&gmap->shadow_lock);
+		}
+	}
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(pmdp_notify);
+
+/**
  * gmap_pmdp_xchg - exchange a gmap pmd with another and notify
  * @gmap: pointer to the guest address space structure
  * @pmdp: pointer to the pmd entry
@@ -2524,6 +2780,7 @@ static void pmdp_notify_split(struct mm_struct *mm, unsigned long vmaddr,
 static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *pmdp, pmd_t new,
 			   unsigned long gaddr)
 {
+	pmdp_notify_gmap(gmap, gaddr);
 	if (MACHINE_HAS_TLB_GUEST)
 		__pmdp_idte(gaddr, (pmd_t *)pmdp,
 			    IDTE_GUEST_ASCE, gmap->asce,
@@ -2536,39 +2793,6 @@ static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *pmdp, pmd_t new,
 	*pmdp = new;
 }
 
-/**
- * pmdp_notify - call all invalidation callbacks for a specific pmd
- * @mm: pointer to the process mm_struct
- * @vmaddr: virtual address in the process address space
- *
- * This function is expected to be called with mmap_sem held in read.
- */
-void pmdp_notify(struct mm_struct *mm, unsigned long vmaddr)
-{
-	unsigned long *table, gaddr;
-	struct gmap *gmap;
-
-	rcu_read_lock();
-	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
-		spin_lock(&gmap->guest_table_lock);
-		table = radix_tree_lookup(&gmap->host_to_guest,
-					  vmaddr >> PMD_SHIFT);
-		if (!table) {
-			spin_unlock(&gmap->guest_table_lock);
-			continue;
-		}
-		gaddr = __gmap_segment_gaddr(table);
-		if (gmap_pmd_is_split((pmd_t *)table)) {
-			pmdp_notify_split(mm, vmaddr, table);
-			spin_unlock(&gmap->guest_table_lock);
-			continue;
-		}
-		spin_unlock(&gmap->guest_table_lock);
-	}
-	rcu_read_unlock();
-}
-EXPORT_SYMBOL_GPL(pmdp_notify);
-
 static void gmap_pmdp_clear(struct mm_struct *mm, unsigned long vmaddr,
 			    int purge)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 10/16] s390/mm: Add VSIE reverse fake case
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (8 preceding siblings ...)
  2018-02-09  9:34 ` [RFC/PATCH v3 09/16] s390/mm: Add shadow segment code Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 11/16] s390/mm: Enable gmap huge pmd support Janosch Frank
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

The fake VSIE case lets us run huge vsie guests on small hosts by
creating fake page tables. When running a small guest on a huge host,
we need to create fake tables once again.

The fake tables are needed to make sure, that the VSIE guest is only
able to access the memory that its host mapped for it.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/gmap.h |  2 +-
 arch/s390/kvm/gaccess.c      | 20 +++++++++---
 arch/s390/mm/gmap.c          | 76 +++++++++++++++++++-------------------------
 3 files changed, 50 insertions(+), 48 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 8387fdc..1347ff5 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -144,7 +144,7 @@ int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
 		    int fake);
 int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
 			   unsigned long *pgt, int *dat_protection,
-			   int *fake);
+			   int *fake, int *lvl);
 int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte);
 int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd);
 
diff --git a/arch/s390/kvm/gaccess.c b/arch/s390/kvm/gaccess.c
index 045d12e..de40d17 100644
--- a/arch/s390/kvm/gaccess.c
+++ b/arch/s390/kvm/gaccess.c
@@ -1144,10 +1144,22 @@ static int kvm_s390_shadow_tables(struct gmap *sg, unsigned long saddr,
 				*lvl = 1;
 				*pgt = ptr;
 				return 0;
+			} else {
+				/*
+				 * Reverse fake case.
+				 * We map a huge parent to a small guest, i.e.
+				 * we need fake shadow pagetables.
+				 *
+				 * We need pagetables here, because
+				 * guests not aligned on 1M could
+				 * read/write from/to the parent or
+				 * host.
+				 */
+				*lvl = 0;
 			}
 		}
 		/* Small to small and small to huge case */
-		if (ste.fc && sg->edat_level >= 1) {
+		if (!fc && ste.fc && sg->edat_level >= 1) {
 			*fake = 1;
 			ptr = ste.fc1.sfaa * _SEGMENT_SIZE;
 			ste.val = ptr;
@@ -1185,7 +1197,7 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 	union page_table_entry pte;
 	union segment_table_entry ste;
 	unsigned long pgt;
-	int dat_protection, fake, lvl, fc;
+	int dat_protection, fake, lvl = 0, fc;
 	int rc;
 
 	down_read(&sg->mm->mmap_sem);
@@ -1196,7 +1208,7 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 	 */
 	ipte_lock(vcpu);
 
-	rc = gmap_shadow_sgt_lookup(sg, saddr, &pgt, &dat_protection, &fake);
+	rc = gmap_shadow_sgt_lookup(sg, saddr, &pgt, &dat_protection, &fake, &lvl);
 	if (rc)
 		rc = kvm_s390_shadow_tables(sg, saddr, &pgt, &dat_protection,
 					    &fake, &lvl);
@@ -1204,7 +1216,7 @@ int kvm_s390_shadow_fault(struct kvm_vcpu *vcpu, struct gmap *sg,
 	vaddr.addr = saddr;
 
 	/* Shadow stopped at segment level, we map pmd to pmd */
-	if (lvl) {
+	if (!rc && lvl) {
 		if (!rc)
 			rc = gmap_read_table(sg->parent, pgt + vaddr.sx * 8,
 					     &ste.val, &fc);
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index f805ec9..ae25f76 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -841,38 +841,6 @@ static inline unsigned long *gmap_table_walk(struct gmap *gmap,
 }
 
 /**
- * gmap_pte_op_walk - walk the gmap page table, get the page table lock
- *		      and return the pte pointer
- * @gmap: pointer to guest mapping meta data structure
- * @gaddr: virtual address in the guest address space
- * @ptl: pointer to the spinlock pointer
- *
- * Returns a pointer to the locked pte for a guest address, or NULL
- *
- * Note: Can also be called for shadow gmaps.
- */
-static pte_t *gmap_pte_op_walk(struct gmap *gmap, unsigned long gaddr,
-			       spinlock_t **ptl)
-{
-	unsigned long *table;
-
-	if (gmap_is_shadow(gmap))
-		spin_lock(&gmap->guest_table_lock);
-	/* Walk the gmap page table, lock and get pte pointer */
-	table = gmap_table_walk(gmap, gaddr, 1); /* get segment pointer */
-	if (!table || *table & _SEGMENT_ENTRY_INVALID) {
-		if (gmap_is_shadow(gmap))
-			spin_unlock(&gmap->guest_table_lock);
-		return NULL;
-	}
-	if (gmap_is_shadow(gmap)) {
-		*ptl = &gmap->guest_table_lock;
-		return pte_offset_map((pmd_t *) table, gaddr);
-	}
-	return pte_alloc_map_lock(gmap->mm, (pmd_t *) table, gaddr, ptl);
-}
-
-/**
  * gmap_pte_op_fixup - force a page in and connect the gmap page table
  * @gmap: pointer to guest mapping meta data structure
  * @gaddr: virtual address in the guest address space
@@ -1557,7 +1525,7 @@ static void __gmap_unshadow_sgt(struct gmap *sg, unsigned long raddr,
 
 	BUG_ON(!gmap_is_shadow(sg));
 	for (i = 0; i < _CRST_ENTRIES; i++, raddr += _SEGMENT_SIZE) {
-		if (!(sgt[i] & _SEGMENT_ENTRY_ORIGIN))
+		if (sgt[i] ==  _SEGMENT_ENTRY_EMPTY)
 			continue;
 
 		if (!(sgt[i] & _SEGMENT_ENTRY_LARGE)) {
@@ -2199,7 +2167,7 @@ EXPORT_SYMBOL_GPL(gmap_shadow_pgt_lookup);
 
 int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
 			   unsigned long *pgt, int *dat_protection,
-			   int *fake)
+			   int *fake, int *lvl)
 {
 	unsigned long *sge, *r3e = NULL;
 	struct page *page;
@@ -2230,9 +2198,11 @@ int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
 			*dat_protection = 0;
 			*fake = 0;
 		}
+		*lvl = 1;
 	} else {
 		gmap_shadow_pgt_lookup(sg, sge, saddr, pgt,
 				       dat_protection, fake);
+		*lvl = 0;
 	}
 out:
 	spin_unlock(&sg->guest_table_lock);
@@ -2421,6 +2391,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 	struct gmap_rmap *rmap;
 	unsigned long vmaddr, paddr;
 	spinlock_t *ptl;
+	pmd_t *spmdp;
 	pte_t *sptep, *tptep;
 	int prot;
 	int rc;
@@ -2445,26 +2416,42 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 		if (rc)
 			break;
 		rc = -EAGAIN;
-		sptep = gmap_pte_op_walk(parent, paddr, &ptl);
-		if (sptep) {
+		spmdp = gmap_pmd_op_walk(parent, paddr);
+		if (spmdp && !(pmd_val(*spmdp) & _SEGMENT_ENTRY_INVALID)) {
 			spin_lock(&sg->guest_table_lock);
 			/* Get page table pointer */
 			tptep = (pte_t *) gmap_table_walk(sg, saddr, 0);
 			if (!tptep) {
 				spin_unlock(&sg->guest_table_lock);
-				gmap_pte_op_end(ptl);
 				radix_tree_preload_end();
+				gmap_pmd_op_end(parent, spmdp);
 				break;
 			}
-			rc = ptep_shadow_pte(sg->mm, saddr, sptep, tptep, pte);
-			if (rc > 0) {
-				/* Success and a new mapping */
-				gmap_insert_rmap(sg, vmaddr, rmap);
+
+			if (pmd_large(*spmdp)) {
+				*tptep = __pte(((pmd_val(*spmdp) &
+						_SEGMENT_ENTRY_ORIGIN_LARGE)
+					       + (pte_index(paddr) << 12))
+					       | (pte_val(pte) & _PAGE_PROTECT));
+				pmd_val(*spmdp) |= _SEGMENT_ENTRY_GMAP_VSIE;
+				gmap_insert_rmap(sg, vmaddr & HPAGE_MASK, rmap);
 				rmap = NULL;
 				rc = 0;
+			} else {
+				sptep = gmap_pte_from_pmd(parent, spmdp, paddr, &ptl);
+				if (sptep) {
+					rc = ptep_shadow_pte(sg->mm, saddr, sptep, tptep, pte);
+					if (rc > 0) {
+						/* Success and a new mapping */
+						gmap_insert_rmap(sg, vmaddr, rmap);
+						rmap = NULL;
+						rc = 0;
+					}
+					gmap_pte_op_end(ptl);
+				}
 			}
-			gmap_pte_op_end(ptl);
 			spin_unlock(&sg->guest_table_lock);
+			gmap_pmd_op_end(parent, spmdp);
 		}
 		radix_tree_preload_end();
 		if (!rc)
@@ -2530,7 +2517,10 @@ static void gmap_shadow_notify_pmd(struct gmap *sg, unsigned long vmaddr,
 	gmap_for_each_rmap_safe(rmap, rnext, head) {
 		bits = rmap->raddr & _SHADOW_RMAP_MASK;
 		raddr = rmap->raddr ^ bits;
-		gmap_unshadow_segment(sg, raddr);
+		if (bits == _SHADOW_RMAP_SEGMENT_LP)
+			gmap_unshadow_segment(sg, raddr);
+		else
+			gmap_unshadow_page(sg, raddr);
 		kfree(rmap);
 	}
 	spin_unlock(&sg->guest_table_lock);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 11/16] s390/mm: Enable gmap huge pmd support
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (9 preceding siblings ...)
  2018-02-09  9:34 ` [RFC/PATCH v3 10/16] s390/mm: Add VSIE reverse fake case Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 12/16] s390/mm: clear huge page storage keys on enable_skey Janosch Frank
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

Now that we have everything in place, let's allow huge (1m) pmds for
gmap linking, effectively allowing hugetlbfs backed guests. Transparent
huge pages and 2g huge pages are *not* supported through this change.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/mm/gmap.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index ae25f76..ff35643 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2,8 +2,10 @@
 /*
  *  KVM guest address space mapping code
  *
- *    Copyright IBM Corp. 2007, 2016
+ *    Copyright IBM Corp. 2007, 2016, 2017
  *    Author(s): Martin Schwidefsky <schwidefsky@de.ibm.com>
+ *		 David Hildenbrand <david@redhat.com>
+ *		 Janosch Frank <frankja@linux.vnet.ibm.com>
  */
 
 #include <linux/kernel.h>
@@ -595,9 +597,6 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 		return -EFAULT;
 	pmd = pmd_offset(pud, vmaddr);
 	VM_BUG_ON(pmd_none(*pmd));
-	/* large pmds cannot yet be handled */
-	if (pmd_large(*pmd))
-		return -EFAULT;
 	/* Link gmap segment table entry location to page table. */
 	rc = radix_tree_preload(GFP_KERNEL);
 	if (rc)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 12/16] s390/mm: clear huge page storage keys on enable_skey
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (10 preceding siblings ...)
  2018-02-09  9:34 ` [RFC/PATCH v3 11/16] s390/mm: Enable gmap huge pmd support Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 13/16] s390/mm: Add huge pmd storage key handling Janosch Frank
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

From: Dominik Dingel <dingel@linux.vnet.ibm.com>

When a guest starts using storage keys, we trap and set a default one
for its whole valid address space. With this patch we are now able to
do that for large pages.

To speed up the storage key insertion, we use
__storage_key_init_range, which in-turn will use sske_frame to set
multiple storage keys with one instruction. As it has been previously
used for debuging we have to get rid of the default key check and make
it quiescing.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
[replaced page_set_storage_key loop with __storage_key_init_range]
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/mm/gmap.c     | 26 +++++++++++++++++++++++---
 arch/s390/mm/pageattr.c |  6 ++----
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index ff35643..482bbf1 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -3053,17 +3053,37 @@ EXPORT_SYMBOL_GPL(s390_enable_sie);
  * Enable storage key handling from now on and initialize the storage
  * keys with the default key.
  */
-static int __s390_enable_skey(pte_t *pte, unsigned long addr,
-			      unsigned long next, struct mm_walk *walk)
+static int __s390_enable_skey_pte(pte_t *pte, unsigned long addr,
+				  unsigned long next, struct mm_walk *walk)
 {
 	/* Clear storage key */
 	ptep_zap_key(walk->mm, addr, pte);
 	return 0;
 }
 
+static int __s390_enable_skey_hugetlb(pte_t *pte, unsigned long addr,
+				      unsigned long hmask, unsigned long next,
+				      struct mm_walk *walk)
+{
+	pmd_t *pmd = (pmd_t *)pte;
+	unsigned long start, end;
+
+	if (pmd_val(*pmd) & _SEGMENT_ENTRY_INVALID
+	    || !(pmd_val(*pmd) & _SEGMENT_ENTRY_WRITE))
+		return 0;
+
+	start = pmd_val(*pmd) & HPAGE_MASK;
+	end = start + HPAGE_SIZE - 1;
+	__storage_key_init_range(start, end);
+	return 0;
+}
+
 int s390_enable_skey(void)
 {
-	struct mm_walk walk = { .pte_entry = __s390_enable_skey };
+	struct mm_walk walk = {
+		.hugetlb_entry = __s390_enable_skey_hugetlb,
+		.pte_entry = __s390_enable_skey_pte,
+	};
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	int rc = 0;
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index c441715..f8c6faa 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -14,7 +14,7 @@
 
 static inline unsigned long sske_frame(unsigned long addr, unsigned char skey)
 {
-	asm volatile(".insn rrf,0xb22b0000,%[skey],%[addr],9,0"
+	asm volatile(".insn rrf,0xb22b0000,%[skey],%[addr],1,0"
 		     : [addr] "+a" (addr) : [skey] "d" (skey));
 	return addr;
 }
@@ -23,8 +23,6 @@ void __storage_key_init_range(unsigned long start, unsigned long end)
 {
 	unsigned long boundary, size;
 
-	if (!PAGE_DEFAULT_KEY)
-		return;
 	while (start < end) {
 		if (MACHINE_HAS_EDAT1) {
 			/* set storage keys for a 1MB frame */
@@ -37,7 +35,7 @@ void __storage_key_init_range(unsigned long start, unsigned long end)
 				continue;
 			}
 		}
-		page_set_storage_key(start, PAGE_DEFAULT_KEY, 0);
+		page_set_storage_key(start, PAGE_DEFAULT_KEY, 1);
 		start += PAGE_SIZE;
 	}
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 13/16] s390/mm: Add huge pmd storage key handling
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (11 preceding siblings ...)
  2018-02-09  9:34 ` [RFC/PATCH v3 12/16] s390/mm: clear huge page storage keys on enable_skey Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 14/16] s390/mm: hugetlb pages within a gmap can not be freed Janosch Frank
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

Storage keys for guests with huge page mappings have to directly set
the key in hardware. There are no PGSTEs for PMDs that we could use to
retain the guests's logical view of the key.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/mm/pgtable.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 98 insertions(+), 6 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 497fefe..871fc65 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -811,12 +811,45 @@ EXPORT_SYMBOL_GPL(test_and_clear_guest_dirty);
 int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
 			  unsigned char key, bool nq)
 {
-	unsigned long keyul;
+	unsigned long keyul, address;
 	spinlock_t *ptl;
 	pgste_t old, new;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
 	pte_t *ptep;
 
-	ptep = get_locked_pte(mm, addr, &ptl);
+	pgd = pgd_offset(mm, addr);
+	p4d = p4d_alloc(mm, pgd, addr);
+	if (!p4d)
+		return -EFAULT;
+	pud = pud_alloc(mm, p4d, addr);
+	if (!pud)
+		return -EFAULT;
+	pmd = pmd_alloc(mm, pud, addr);
+	if (!pmd)
+		return -EFAULT;
+
+	ptl = pmd_lock(mm, pmd);
+	if (!pmd_present(*pmd)) {
+		spin_unlock(ptl);
+		return -EFAULT;
+	}
+	if (pmd_large(*pmd)) {
+		address = pmd_val(*pmd) & HPAGE_MASK;
+		address |= addr & ~HPAGE_MASK;
+		/*
+		 * Huge pmds need quiescing operations, they are
+		 * always mapped.
+		 */
+		page_set_storage_key(address, key, 1);
+		spin_unlock(ptl);
+		return 0;
+	}
+	spin_unlock(ptl);
+
+	ptep = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (unlikely(!ptep))
 		return -EFAULT;
 
@@ -827,7 +860,7 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
 	pgste_val(new) |= (keyul & (_PAGE_CHANGED | _PAGE_REFERENCED)) << 48;
 	pgste_val(new) |= (keyul & (_PAGE_ACC_BITS | _PAGE_FP_BIT)) << 56;
 	if (!(pte_val(*ptep) & _PAGE_INVALID)) {
-		unsigned long address, bits, skey;
+		unsigned long bits, skey;
 
 		address = pte_val(*ptep) & PAGE_MASK;
 		skey = (unsigned long) page_get_storage_key(address);
@@ -890,14 +923,43 @@ EXPORT_SYMBOL(cond_set_guest_storage_key);
 int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
 {
 	spinlock_t *ptl;
+	unsigned long address;
 	pgste_t old, new;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
 	pte_t *ptep;
 	int cc = 0;
 
-	ptep = get_locked_pte(mm, addr, &ptl);
-	if (unlikely(!ptep))
+	pgd = pgd_offset(mm, addr);
+	p4d = p4d_alloc(mm, pgd, addr);
+	if (!p4d)
+		return -EFAULT;
+	pud = pud_alloc(mm, p4d, addr);
+	if (!pud)
+		return -EFAULT;
+	pmd = pmd_alloc(mm, pud, addr);
+	if (!pmd)
 		return -EFAULT;
 
+	ptl = pmd_lock(mm, pmd);
+	if (!pmd_present(*pmd)) {
+		spin_unlock(ptl);
+		return -EFAULT;
+	}
+	if (pmd_large(*pmd)) {
+		address = pmd_val(*pmd) & HPAGE_MASK;
+		address |= addr & ~HPAGE_MASK;
+		cc = page_reset_referenced(addr);
+		spin_unlock(ptl);
+		return cc;
+	}
+	spin_unlock(ptl);
+
+	ptep = pte_alloc_map_lock(mm, pmd, addr, &ptl);
+	if (unlikely(!ptep))
+		return -EFAULT;
 	new = old = pgste_get_lock(ptep);
 	/* Reset guest reference bit only */
 	pgste_val(new) &= ~PGSTE_GR_BIT;
@@ -922,11 +984,41 @@ EXPORT_SYMBOL(reset_guest_reference_bit);
 int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
 			  unsigned char *key)
 {
+	unsigned long address;
 	spinlock_t *ptl;
 	pgste_t pgste;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
 	pte_t *ptep;
 
-	ptep = get_locked_pte(mm, addr, &ptl);
+	pgd = pgd_offset(mm, addr);
+	p4d = p4d_alloc(mm, pgd, addr);
+	if (!p4d)
+		return -EFAULT;
+	pud = pud_alloc(mm, p4d, addr);
+	if (!pud)
+		return -EFAULT;
+	pmd = pmd_alloc(mm, pud, addr);
+	if (!pmd)
+		return -EFAULT;
+
+	ptl = pmd_lock(mm, pmd);
+	if (!pmd_present(*pmd)) {
+		spin_unlock(ptl);
+		return -EFAULT;
+	}
+	if (pmd_large(*pmd)) {
+		address = pmd_val(*pmd) & HPAGE_MASK;
+		address |= addr & ~HPAGE_MASK;
+		*key = page_get_storage_key(address);
+		spin_unlock(ptl);
+		return 0;
+	}
+	spin_unlock(ptl);
+
+	ptep = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (unlikely(!ptep))
 		return -EFAULT;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 14/16] s390/mm: hugetlb pages within a gmap can not be freed
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (12 preceding siblings ...)
  2018-02-09  9:34 ` [RFC/PATCH v3 13/16] s390/mm: Add huge pmd storage key handling Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 15/16] KVM: s390: Add KVM HPAGE capability Janosch Frank
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

From: Dominik Dingel <dingel@linux.vnet.ibm.com>

Guests backed by huge pages could theoretically free unused pages via
the diagnose 10 instruction. We currently don't allow that, so we
don't have to refault it once it's needed again.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/mm/gmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 482bbf1..0b81863 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -722,6 +722,9 @@ void gmap_discard(struct gmap *gmap, unsigned long from, unsigned long to)
 		vmaddr |= gaddr & ~PMD_MASK;
 		/* Find vma in the parent mm */
 		vma = find_vma(gmap->mm, vmaddr);
+		/* We do not discard pages that are backed by hugetlbfs */
+		if (vma && is_vm_hugetlb_page(vma))
+			continue;
 		size = min(to - gaddr, PMD_SIZE - (gaddr & ~PMD_MASK));
 		zap_page_range(vma, vmaddr, size);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 15/16] KVM: s390: Add KVM HPAGE capability
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (13 preceding siblings ...)
  2018-02-09  9:34 ` [RFC/PATCH v3 14/16] s390/mm: hugetlb pages within a gmap can not be freed Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-09  9:34 ` [RFC/PATCH v3 16/16] s390/mm: Add gmap lock classes Janosch Frank
  2018-02-14 14:30 ` [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement David Hildenbrand
  16 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

KVM huge page backing support can not be easily tested under
s390. Currently testing is only possible after most of the guest has
already been set up.

To indicate, that KVM has huge page backing support with 1m pages, we
add the KVM_CAP_S390_HPAGE capability. This does not mean, that
transparent huge pages are supported.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 Documentation/virtual/kvm/api.txt | 10 ++++++++++
 arch/s390/kvm/kvm-s390.c          |  1 +
 include/uapi/linux/kvm.h          |  1 +
 3 files changed, 12 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index fc3ae95..2e0512e 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -4415,3 +4415,13 @@ Parameters: none
 This capability indicates if the flic device will be able to get/set the
 AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and allows
 to discover this without having to create a flic device.
+
+8.14 KVM_CAP_S390_HPAGE
+
+Architectures: s390
+This capability, if KVM_CHECK_EXTENSION indicates that it is
+available, means that KVM supports VMs that are memory backed through
+hugetlbfs with 1 megabyte pages.
+
+While it is generally possible to create and start such a VM without
+this support, the VM will not be functional.
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 0862bf0..171b017 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -393,6 +393,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_S390_CMMA_MIGRATION:
 	case KVM_CAP_S390_AIS:
 	case KVM_CAP_S390_AIS_MIGRATION:
+	case KVM_CAP_S390_HPAGE:
 		r = 1;
 		break;
 	case KVM_CAP_S390_MEM_OP:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 8fb90a0..99c2175 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -934,6 +934,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_AIS_MIGRATION 150
 #define KVM_CAP_PPC_GET_CPU_CHAR 151
 #define KVM_CAP_S390_BPB 152
+#define KVM_CAP_S390_HPAGE 153
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC/PATCH v3 16/16] s390/mm: Add gmap lock classes
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (14 preceding siblings ...)
  2018-02-09  9:34 ` [RFC/PATCH v3 15/16] KVM: s390: Add KVM HPAGE capability Janosch Frank
@ 2018-02-09  9:34 ` Janosch Frank
  2018-02-14 14:30 ` [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement David Hildenbrand
  16 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-09  9:34 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

A shadow gmap and its parent are locked right after each other when
doing VSIE management. Lockdep can't differentiate between the two
classes without some help.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/gmap.h |  6 ++++++
 arch/s390/mm/gmap.c          | 36 ++++++++++++++++++------------------
 2 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 1347ff5..f0c44fe 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -19,6 +19,12 @@
 #define _SEGMENT_ENTRY_GMAP_UC		0x4000	/* user dirty (migration) */
 #define _SEGMENT_ENTRY_GMAP_VSIE	0x8000	/* vsie bit */
 
+
+enum gmap_lock_class {
+	GMAP_LOCK_PARENT,
+	GMAP_LOCK_SHADOW
+};
+
 /**
  * struct gmap_struct - guest address space
  * @list: list head for the mm->context gmap list
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 0b81863..8edfef5 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1337,7 +1337,7 @@ static int gmap_protect_rmap_pte(struct gmap *sg, struct gmap_rmap *rmap,
 	if (unlikely(!ptep))
 		return -ENOMEM;
 
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	rc = gmap_protect_pte(sg->parent, paddr, vmaddr, pmdp, hpmdp,
 			      prot, GMAP_NOTIFY_SHADOW);
 	if (!rc)
@@ -1848,7 +1848,7 @@ struct gmap *gmap_shadow(struct gmap *parent, unsigned long asce,
 		/* only allow one real-space gmap shadow */
 		list_for_each_entry(sg, &parent->children, list) {
 			if (sg->orig_asce & _ASCE_REAL_SPACE) {
-				spin_lock(&sg->guest_table_lock);
+				spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 				gmap_unshadow(sg);
 				spin_unlock(&sg->guest_table_lock);
 				list_del(&sg->list);
@@ -1920,7 +1920,7 @@ int gmap_shadow_r2t(struct gmap *sg, unsigned long saddr, unsigned long r2t,
 		page->index |= GMAP_SHADOW_FAKE_TABLE;
 	s_r2t = (unsigned long *) page_to_phys(page);
 	/* Install shadow region second table */
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	table = gmap_table_walk(sg, saddr, 4); /* get region-1 pointer */
 	if (!table) {
 		rc = -EAGAIN;		/* Race with unshadow */
@@ -1953,7 +1953,7 @@ int gmap_shadow_r2t(struct gmap *sg, unsigned long saddr, unsigned long r2t,
 	offset = ((r2t & _REGION_ENTRY_OFFSET) >> 6) * PAGE_SIZE;
 	len = ((r2t & _REGION_ENTRY_LENGTH) + 1) * PAGE_SIZE - offset;
 	rc = gmap_protect_rmap(sg, raddr, origin + offset, len, PROT_READ);
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (!rc) {
 		table = gmap_table_walk(sg, saddr, 4);
 		if (!table || (*table & _REGION_ENTRY_ORIGIN) !=
@@ -2004,7 +2004,7 @@ int gmap_shadow_r3t(struct gmap *sg, unsigned long saddr, unsigned long r3t,
 		page->index |= GMAP_SHADOW_FAKE_TABLE;
 	s_r3t = (unsigned long *) page_to_phys(page);
 	/* Install shadow region second table */
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	table = gmap_table_walk(sg, saddr, 3); /* get region-2 pointer */
 	if (!table) {
 		rc = -EAGAIN;		/* Race with unshadow */
@@ -2036,7 +2036,7 @@ int gmap_shadow_r3t(struct gmap *sg, unsigned long saddr, unsigned long r3t,
 	offset = ((r3t & _REGION_ENTRY_OFFSET) >> 6) * PAGE_SIZE;
 	len = ((r3t & _REGION_ENTRY_LENGTH) + 1) * PAGE_SIZE - offset;
 	rc = gmap_protect_rmap(sg, raddr, origin + offset, len, PROT_READ);
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (!rc) {
 		table = gmap_table_walk(sg, saddr, 3);
 		if (!table || (*table & _REGION_ENTRY_ORIGIN) !=
@@ -2087,7 +2087,7 @@ int gmap_shadow_sgt(struct gmap *sg, unsigned long saddr, unsigned long sgt,
 		page->index |= GMAP_SHADOW_FAKE_TABLE;
 	s_sgt = (unsigned long *) page_to_phys(page);
 	/* Install shadow region second table */
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	table = gmap_table_walk(sg, saddr, 2); /* get region-3 pointer */
 	if (!table) {
 		rc = -EAGAIN;		/* Race with unshadow */
@@ -2120,7 +2120,7 @@ int gmap_shadow_sgt(struct gmap *sg, unsigned long saddr, unsigned long sgt,
 	offset = ((sgt & _REGION_ENTRY_OFFSET) >> 6) * PAGE_SIZE;
 	len = ((sgt & _REGION_ENTRY_LENGTH) + 1) * PAGE_SIZE - offset;
 	rc = gmap_protect_rmap(sg, raddr, origin + offset, len, PROT_READ);
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (!rc) {
 		table = gmap_table_walk(sg, saddr, 2);
 		if (!table || (*table & _REGION_ENTRY_ORIGIN) !=
@@ -2176,7 +2176,7 @@ int gmap_shadow_sgt_lookup(struct gmap *sg, unsigned long saddr,
 	int rc = -EAGAIN;
 
 	BUG_ON(!gmap_is_shadow(sg));
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (sg->asce & _ASCE_TYPE_MASK) {
 		/* >2 GB guest */
 		r3e = (unsigned long *) gmap_table_walk(sg, saddr, 2);
@@ -2243,7 +2243,7 @@ int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
 		page->index |= GMAP_SHADOW_FAKE_TABLE;
 	s_pgt = (unsigned long *) page_to_phys(page);
 	/* Install shadow page table */
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	table = gmap_table_walk(sg, saddr, 1); /* get segment pointer */
 	if (!table) {
 		rc = -EAGAIN;		/* Race with unshadow */
@@ -2271,7 +2271,7 @@ int gmap_shadow_pgt(struct gmap *sg, unsigned long saddr, unsigned long pgt,
 	raddr = (saddr & _SEGMENT_MASK) | _SHADOW_RMAP_SEGMENT;
 	origin = pgt & _SEGMENT_ENTRY_ORIGIN & PAGE_MASK;
 	rc = gmap_protect_rmap(sg, raddr, origin, PAGE_SIZE, PROT_READ);
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (!rc) {
 		table = gmap_table_walk(sg, saddr, 1);
 		if (!table || (*table & _SEGMENT_ENTRY_ORIGIN) !=
@@ -2325,7 +2325,7 @@ int gmap_shadow_segment(struct gmap *sg, unsigned long saddr, pmd_t pmd)
 		/* Let's look up the parent's mapping */
 		spmdp = gmap_pmd_op_walk(parent, paddr);
 		if (spmdp) {
-			spin_lock(&sg->guest_table_lock);
+			spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 			/* Get shadow segment table pointer */
 			tpmdp = (pmd_t *) gmap_table_walk(sg, saddr, 1);
 			if (!tpmdp) {
@@ -2420,7 +2420,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
 		rc = -EAGAIN;
 		spmdp = gmap_pmd_op_walk(parent, paddr);
 		if (spmdp && !(pmd_val(*spmdp) & _SEGMENT_ENTRY_INVALID)) {
-			spin_lock(&sg->guest_table_lock);
+			spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 			/* Get page table pointer */
 			tptep = (pte_t *) gmap_table_walk(sg, saddr, 0);
 			if (!tptep) {
@@ -2497,7 +2497,7 @@ static void gmap_shadow_notify_pmd(struct gmap *sg, unsigned long vmaddr,
 
 	BUG_ON(!gmap_is_shadow(sg));
 
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (sg->removed) {
 		spin_unlock(&sg->guest_table_lock);
 		return;
@@ -2542,7 +2542,7 @@ static void gmap_shadow_notify(struct gmap *sg, unsigned long vmaddr,
 
 	BUG_ON(!gmap_is_shadow(sg));
 
-	spin_lock(&sg->guest_table_lock);
+	spin_lock_nested(&sg->guest_table_lock, GMAP_LOCK_SHADOW);
 	if (sg->removed) {
 		spin_unlock(&sg->guest_table_lock);
 		return;
@@ -2793,7 +2793,7 @@ static void gmap_pmdp_clear(struct mm_struct *mm, unsigned long vmaddr,
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
-		spin_lock(&gmap->guest_table_lock);
+		spin_lock_nested(&gmap->guest_table_lock, GMAP_LOCK_PARENT);
 		pmdp = (pmd_t *)radix_tree_delete(&gmap->host_to_guest,
 						   vmaddr >> PMD_SHIFT);
 		if (pmdp) {
@@ -2843,7 +2843,7 @@ void gmap_pmdp_idte_local(struct mm_struct *mm, unsigned long vmaddr)
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
-		spin_lock(&gmap->guest_table_lock);
+		spin_lock_nested(&gmap->guest_table_lock, GMAP_LOCK_PARENT);
 		entry = radix_tree_delete(&gmap->host_to_guest,
 					  vmaddr >> PMD_SHIFT);
 		if (entry) {
@@ -2878,7 +2878,7 @@ void gmap_pmdp_idte_global(struct mm_struct *mm, unsigned long vmaddr)
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
-		spin_lock(&gmap->guest_table_lock);
+		spin_lock_nested(&gmap->guest_table_lock, GMAP_LOCK_PARENT);
 		entry = radix_tree_delete(&gmap->host_to_guest,
 					  vmaddr >> PMD_SHIFT);
 		if (entry) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 01/16] s390/mm: make gmap_protect_range more modular
  2018-02-09  9:34 ` [RFC/PATCH v3 01/16] s390/mm: make gmap_protect_range more modular Janosch Frank
@ 2018-02-13 14:07   ` David Hildenbrand
  0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-02-13 14:07 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

> +/**
> + * gmap_pmd_op_walk - walk the gmap tables, get the guest table lock
> + *		      and return the pmd pointer
> + * @gmap: pointer to guest mapping meta data structure
> + * @gaddr: virtual address in the guest address space
> + *
> + * Returns a pointer to the pmd for a guest address, or NULL
> + */
> +static inline pmd_t *gmap_pmd_op_walk(struct gmap *gmap, unsigned long gaddr)
> +{
> +	pmd_t *pmdp;
> +
> +	BUG_ON(gmap_is_shadow(gmap));
> +	spin_lock(&gmap->guest_table_lock);
> +	pmdp = (pmd_t *) gmap_table_walk(gmap, gaddr, 1);
> +

So gmap_table_walk() now basically always has to be called with the
guest_table_lock held. We could add a BUG/WARN for that in
gmap_table_walk().

> +	if (!pmdp || pmd_none(*pmdp)) {
> +		spin_unlock(&gmap->guest_table_lock);
> +		return NULL;
> +	}
> +
> +	/* 4k page table entries are locked via the pte (pte_alloc_map_lock). */
> +	if (!pmd_large(*pmdp))
> +		spin_unlock(&gmap->guest_table_lock);
> +	return pmdp;
> +}
> +
> +/**
> + * gmap_pmd_op_end - release the guest_table_lock if needed
> + * @gmap: pointer to the guest mapping meta data structure
> + * @pmdp: pointer to the pmd
> + */
> +static inline void gmap_pmd_op_end(struct gmap *gmap, pmd_t *pmdp)
> +{
> +	if (pmd_large(*pmdp))
> +		spin_unlock(&gmap->guest_table_lock);
> +}
> +
> +/*
> + * gmap_protect_pte - remove access rights to memory and set pgste bits
> + * @gmap: pointer to guest mapping meta data structure
> + * @gaddr: virtual address in the guest address space
> + * @pmdp: pointer to the pmd associated with the pte
> + * @prot: indicates access rights: PROT_NONE, PROT_READ or PROT_WRITE
> + * @bits: pgste notification bits to set
> + *
> + * Returns 0 if successfully protected, -ENOMEM if out of memory and
> + * -EAGAIN if a fixup is needed.
> + *
> + * Expected to be called with sg->mm->mmap_sem in read and
> + * guest_table_lock held for shadow gmaps.
> + */

stale comment about shadow gmaps.

> +static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
> +			    pmd_t *pmdp, int prot, unsigned long bits)
> +{
> +	int rc;
> +	pte_t *ptep;
> +	spinlock_t *ptl = NULL;
> +
> +	ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
> +	if (!ptep)
> +		return -ENOMEM;
> +
> +	/* Protect and unlock. */
> +	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, bits);
> +	gmap_pte_op_end(ptl);
> +	return rc;
>  }
>  
>  /*
> @@ -896,16 +966,20 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
>  			      unsigned long len, int prot, unsigned long bits)
>  {
>  	unsigned long vmaddr;
> -	spinlock_t *ptl;
> -	pte_t *ptep;
> +	pmd_t *pmdp;
>  	int rc;
>  
>  	while (len) {
>  		rc = -EAGAIN;
> -		ptep = gmap_pte_op_walk(gmap, gaddr, &ptl);
> -		if (ptep) {
> -			rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, bits);
> -			gmap_pte_op_end(ptl);
> +		pmdp = gmap_pmd_op_walk(gmap, gaddr);
> +		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {

You could move the !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID) check into
either gmap_pmd_op_walk() or gmap_protect_pte(). Makes this easier to
read (and you already have similar checks in gmap_pmd_op_walk() )

> +			rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
> +					      bits);
> +			if (!rc) {
> +				len -= PAGE_SIZE;
> +				gaddr += PAGE_SIZE;
> +			}
> +			gmap_pmd_op_end(gmap, pmdp);
>  		}
>  		if (rc) {
>  			vmaddr = __gmap_translate(gmap, gaddr);
> @@ -914,10 +988,7 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
>  			rc = gmap_pte_op_fixup(gmap, gaddr, vmaddr, prot);
>  			if (rc)
>  				return rc;
> -			continue;
>  		}
> -		gaddr += PAGE_SIZE;
> -		len -= PAGE_SIZE;
>  	}
>  	return 0;
>  }
> 

Apart from that, looks good to me.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 02/16] s390/mm: Abstract gmap notify bit setting
  2018-02-09  9:34 ` [RFC/PATCH v3 02/16] s390/mm: Abstract gmap notify bit setting Janosch Frank
@ 2018-02-13 14:10   ` David Hildenbrand
  2018-02-13 14:31     ` Janosch Frank
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-02-13 14:10 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 09.02.2018 10:34, Janosch Frank wrote:
> Currently we use the software PGSTE bits PGSTE_IN_BIT and PGSTE_VSIE_BIT
> to notify before an invalidation occurs on a prefix page or a VSIE page
> respectively. Both bits only work for a PGSTE, which only exists for
> page tables.
> 
> For huge page support we also need such bits for segments (pmds) so
> let's introduce abstract GMAP_NOTIFY_* bits that will be realized into
> the respective bits when gmap DAT table entries are protected.

This comment is stale.

This patch is no longer needed but also doesn't hurt.

> 
> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/s390/include/asm/gmap.h |  4 ++++
>  arch/s390/mm/gmap.c          | 13 ++++++++-----
>  2 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
> index e07cce8..c1bc563 100644
> --- a/arch/s390/include/asm/gmap.h
> +++ b/arch/s390/include/asm/gmap.h
> @@ -9,6 +9,10 @@
>  #ifndef _ASM_S390_GMAP_H
>  #define _ASM_S390_GMAP_H
>  
> +/* Generic bits for GMAP notification on DAT table entry changes. */
> +#define GMAP_NOTIFY_SHADOW	0x2
> +#define GMAP_NOTIFY_MPROT	0x1
> +
>  /**
>   * struct gmap_struct - guest address space
>   * @list: list head for the mm->context gmap list
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 549a55f..4471cb9 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -922,7 +922,7 @@ static inline void gmap_pmd_op_end(struct gmap *gmap, pmd_t *pmdp)
>   * @gaddr: virtual address in the guest address space
>   * @pmdp: pointer to the pmd associated with the pte
>   * @prot: indicates access rights: PROT_NONE, PROT_READ or PROT_WRITE
> - * @bits: pgste notification bits to set
> + * @bits: notification bits to set
>   *
>   * Returns 0 if successfully protected, -ENOMEM if out of memory and
>   * -EAGAIN if a fixup is needed.
> @@ -936,13 +936,16 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
>  	int rc;
>  	pte_t *ptep;
>  	spinlock_t *ptl = NULL;
> +	unsigned long pbits = 0;
>  
>  	ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
>  	if (!ptep)
>  		return -ENOMEM;
>  
> +	pbits |= (bits & GMAP_NOTIFY_MPROT) ? PGSTE_IN_BIT : 0;
> +	pbits |= (bits & GMAP_NOTIFY_SHADOW) ? PGSTE_VSIE_BIT : 0;
>  	/* Protect and unlock. */
> -	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, bits);
> +	rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, pbits);
>  	gmap_pte_op_end(ptl);
>  	return rc;
>  }
> @@ -1017,7 +1020,7 @@ int gmap_mprotect_notify(struct gmap *gmap, unsigned long gaddr,
>  	if (!MACHINE_HAS_ESOP && prot == PROT_READ)
>  		return -EINVAL;
>  	down_read(&gmap->mm->mmap_sem);
> -	rc = gmap_protect_range(gmap, gaddr, len, prot, PGSTE_IN_BIT);
> +	rc = gmap_protect_range(gmap, gaddr, len, prot, GMAP_NOTIFY_MPROT);
>  	up_read(&gmap->mm->mmap_sem);
>  	return rc;
>  }
> @@ -1139,7 +1142,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
>  		if (ptep) {
>  			spin_lock(&sg->guest_table_lock);
>  			rc = ptep_force_prot(parent->mm, paddr, ptep, prot,
> -					     PGSTE_VSIE_BIT);
> +					     GMAP_NOTIFY_SHADOW);
>  			if (!rc)
>  				gmap_insert_rmap(sg, vmaddr, rmap);
>  			spin_unlock(&sg->guest_table_lock);
> @@ -1605,7 +1608,7 @@ struct gmap *gmap_shadow(struct gmap *parent, unsigned long asce,
>  	down_read(&parent->mm->mmap_sem);
>  	rc = gmap_protect_range(parent, asce & _ASCE_ORIGIN,
>  				((asce & _ASCE_TABLE_LENGTH) + 1) * PAGE_SIZE,
> -				PROT_READ, PGSTE_VSIE_BIT);
> +				PROT_READ, GMAP_NOTIFY_SHADOW);
>  	up_read(&parent->mm->mmap_sem);
>  	spin_lock(&parent->shadow_lock);
>  	new->initialized = true;
>

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 03/16] s390/mm: Introduce gmap_pmdp_xchg
  2018-02-09  9:34 ` [RFC/PATCH v3 03/16] s390/mm: Introduce gmap_pmdp_xchg Janosch Frank
@ 2018-02-13 14:16   ` David Hildenbrand
  2018-02-13 14:39     ` Janosch Frank
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-02-13 14:16 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 09.02.2018 10:34, Janosch Frank wrote:
> When changing guest pmds, we don't need to take care of the
> corresponding host pmd. This means, we don't need to flush the host
> TLB entries and we don't need to notify on all gmaps.
> 
> Let's introduce a function, that exchanges a pmd and takes care of the
> necessary flushing and notification.

stale comment, notification is handled in a different patch.

> 
> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> ---
>  arch/s390/mm/gmap.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 4471cb9..16c877a9 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -23,6 +23,9 @@
>  
>  #define GMAP_SHADOW_FAKE_TABLE 1ULL
>  
> +static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *old, pmd_t new,
> +			   unsigned long gaddr);
> +
>  /**
>   * gmap_alloc - allocate and initialize a guest address space
>   * @mm: pointer to the parent mm_struct
> @@ -2174,6 +2177,31 @@ void ptep_notify(struct mm_struct *mm, unsigned long vmaddr,
>  }
>  EXPORT_SYMBOL_GPL(ptep_notify);
>  
> +/**
> + * gmap_pmdp_xchg - exchange a gmap pmd with another and notify

dito, no notification yet

> + * @gmap: pointer to the guest address space structure
> + * @pmdp: pointer to the pmd entry
> + * @new: replacement entry
> + * @gaddr: the affected guest address
> + *
> + * This function is assumed to be called with the guest_table_lock
> + * held.
> + */
> +static void gmap_pmdp_xchg(struct gmap *gmap, pmd_t *pmdp, pmd_t new,
> +			   unsigned long gaddr)
> +{
> +	if (MACHINE_HAS_TLB_GUEST)
> +		__pmdp_idte(gaddr, (pmd_t *)pmdp,
> +			    IDTE_GUEST_ASCE, gmap->asce,
> +			    IDTE_GLOBAL);

This fits into two lines

> +	if (MACHINE_HAS_IDTE)
> +		__pmdp_idte(gaddr, (pmd_t *)pmdp,
> +			    0, 0, IDTE_GLOBAL);

This fits into one line

> +	else
> +		__pmdp_csp(pmdp);
> +	*pmdp = new;
> +}
> +
>  static inline void thp_split_mm(struct mm_struct *mm)
>  {
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> 

Apart from that, looks good to me. (although very similar to
gmap_pmdp_idte_global -  can we reuse gmap_pmdp_xchg() in
gmap_pmdp_idte_global?)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 02/16] s390/mm: Abstract gmap notify bit setting
  2018-02-13 14:10   ` David Hildenbrand
@ 2018-02-13 14:31     ` Janosch Frank
  0 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-13 14:31 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 985 bytes --]

On 13.02.2018 15:10, David Hildenbrand wrote:
> On 09.02.2018 10:34, Janosch Frank wrote:
>> Currently we use the software PGSTE bits PGSTE_IN_BIT and PGSTE_VSIE_BIT
>> to notify before an invalidation occurs on a prefix page or a VSIE page
>> respectively. Both bits only work for a PGSTE, which only exists for
>> page tables.
>>
>> For huge page support we also need such bits for segments (pmds) so
>> let's introduce abstract GMAP_NOTIFY_* bits that will be realized into
>> the respective bits when gmap DAT table entries are protected.
> 
> This comment is stale.

Yes, I'm currently polishing. This version gives an impression how the
series looks like after shuffling it.
Correct comments have not been a priority yet, but I take every hint I get.

> 
> This patch is no longer needed but also doesn't hurt.

I kept this patch, as I disliked the PGSTE constants being used for
notification requests. I wouldn't have a lot of hard feelings dropping it.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification
  2018-02-09  9:34 ` [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification Janosch Frank
@ 2018-02-13 14:36   ` David Hildenbrand
  2018-02-13 14:54     ` Janosch Frank
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-02-13 14:36 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 09.02.2018 10:34, Janosch Frank wrote:
> For later migration of huge pages we want to write-protect guest
> PMDs. While doing this, we have to make absolutely sure, that the
> guest's lowcore is always accessible when the VCPU is running. With
> PTEs, this is solved by marking the PGSTEs of the lowcore pages with
> the invalidation notification bit and kicking the guest out of the SIE
> via a notifier function if we need to invalidate such a page.
> 
> With PMDs we do not have PGSTEs or some other bits we could use in the
> host PMD. Instead we pick one of the free bits in the gmap PMD. Every
> time a host pmd will be invalidated, we will check if the respective
> gmap PMD has the bit set and in that case fire up the notifier.
> 
> In the first step we only support setting the invalidation bit, but we
> do not support restricting access of guest pmds. It will follow
> shortly.
> 
> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>

[...]

> +
> +/**
> + * gmap_pmd_split - Split a huge gmap pmd and use a page table instead
> + * @gmap: pointer to guest mapping meta data structure
> + * @gaddr: virtual address in the guest address space
> + * @pmdp: pointer to the pmd that will be split
> + *
> + * When splitting gmap pmds, we have to make the resulting page table
> + * look like it's a normal one to be able to use the common pte
> + * handling functions. Also we need to track these new tables as they
> + * aren't tracked anywhere else.
> + */
> +static int gmap_pmd_split(struct gmap *gmap, unsigned long gaddr, pmd_t *pmdp)
> +{
> +	unsigned long *table;
> +	struct page *page;
> +	pmd_t new;
> +	int i;
> +

That's interesting, because the SIE can now suddenly work on these
PGSTEs, e.g. not leading to intercepts on certain events (like setting
storage keys).

How is that intended to be handled? I assume we would somehow have to
forbid the SIE from making use of the PGSTE. But that involves clearing
certain interception controls, which might be problematic.

> +	page = page_table_alloc_pgste(gmap->mm);
> +	if (!page)
> +		return -ENOMEM;
> +	table = (unsigned long *) page_to_phys(page);
> +	for (i = 0; i < 256; i++) {
> +		table[i] = (pmd_val(*pmdp) & HPAGE_MASK) + i * PAGE_SIZE;
> +		/* pmd_large() implies pmd/pte_present() */
> +		table[i] |=  _PAGE_PRESENT | _PAGE_READ | _PAGE_WRITE;
> +		/* ptes are directly marked as dirty */
> +		table[i + PTRS_PER_PTE] |= PGSTE_UC_BIT;
> +	}
> +
> +	pmd_val(new) = ((unsigned long)table | _SEGMENT_ENTRY |
> +			(_SEGMENT_ENTRY_GMAP_SPLIT));
> +	list_add(&page->lru, &gmap->split_list);
> +	gmap_pmdp_xchg(gmap, pmdp, new, gaddr);
> +	return 0;
> +}
> +
>  /*
>   * gmap_protect_pte - remove access rights to memory and set pgste bits
>   * @gmap: pointer to guest mapping meta data structure
> @@ -941,7 +1021,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
>  	spinlock_t *ptl = NULL;
>  	unsigned long pbits = 0;
>  
> -	ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
> +	ptep = gmap_pte_from_pmd(gmap, pmdp, gaddr, &ptl);
>  	if (!ptep)
>  		return -ENOMEM;
>  
> @@ -979,15 +1059,21 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
>  		rc = -EAGAIN;
>  		pmdp = gmap_pmd_op_walk(gmap, gaddr);
>  		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
> -			rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
> -					      bits);
> -			if (!rc) {
> -				len -= PAGE_SIZE;
> -				gaddr += PAGE_SIZE;
> +			if (!pmd_large(*pmdp)) {
> +				rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
> +						      bits);
> +				if (!rc) {
> +					len -= PAGE_SIZE;
> +					gaddr += PAGE_SIZE;
> +				}
> +			} else {
> +				rc = gmap_pmd_split(gmap, gaddr, pmdp);
> +				if (!rc)
> +					rc = -EFAULT;

Not sure if -EFAULT is the right thing to do here. Actually this would
be a much better use case for -EAGAIN.

(or simply avoid this by doing an gmap_pmd_op_end() + continue;)

>  			}
>  			gmap_pmd_op_end(gmap, pmdp);
>  		}
> -		if (rc) {
> +		if (rc && rc != -EFAULT) {
>  			vmaddr = __gmap_translate(gmap, gaddr);
>  			if (IS_ERR_VALUE(vmaddr))
>  				return vmaddr;
> @@ -2133,6 +2219,39 @@ static void gmap_shadow_notify(struct gmap *sg, unsigned long vmaddr,
>  	spin_unlock(&sg->guest_table_lock);
>  }
>  
> +/*
> + * ptep_notify_gmap - call all invalidation callbacks for a specific pte of a gmap
> + * @mm: pointer to the process mm_struct
> + * @addr: virtual address in the process address space
> + * @pte: pointer to the page table entry
> + * @bits: bits from the pgste that caused the notify call
> + *
> + * This function is assumed to be called with the guest_table_lock held.
> + */
> +void ptep_notify_gmap(struct mm_struct *mm, unsigned long vmaddr,
> +		      pte_t *pte, unsigned long bits)
> +{
> +	unsigned long offset, gaddr = 0;
> +	unsigned long *table;
> +	struct gmap *gmap;
> +
> +	offset = ((unsigned long) pte) & (255 * sizeof(pte_t));
> +	offset = offset * (4096 / sizeof(pte_t));
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(gmap, &mm->context.gmap_list, list) {
> +		table = radix_tree_lookup(&gmap->host_to_guest,
> +					  vmaddr >> PMD_SHIFT);
> +		if (table)
> +			gaddr = __gmap_segment_gaddr(table) + offset;
> +		else
> +			continue;
> +
> +		if (bits & PGSTE_IN_BIT)
> +			gmap_call_notifier(gmap, gaddr, gaddr + PAGE_SIZE - 1);
> +	}
> +	rcu_read_unlock();
> +}
> +
>  /**
>   * ptep_notify - call all invalidation callbacks for a specific pte.
>   * @mm: pointer to the process mm_struct
> @@ -2177,6 +2296,23 @@ void ptep_notify(struct mm_struct *mm, unsigned long vmaddr,
>  }
>  EXPORT_SYMBOL_GPL(ptep_notify);
>  
> +static void pmdp_notify_split(struct mm_struct *mm, unsigned long vmaddr,
> +			      unsigned long *table)
> +{
> +	int i = 0;
> +	unsigned long bits;
> +	unsigned long *ptep = (unsigned long *)(*table & PAGE_MASK);
> +	unsigned long *pgste = ptep + PTRS_PER_PTE;
> +
> +	for (; i < 256; i++, vmaddr += PAGE_SIZE, ptep++, pgste++) {
> +		bits = *pgste & (PGSTE_IN_BIT | PGSTE_VSIE_BIT);
> +		if (bits) {
> +			*pgste ^= bits;
> +			ptep_notify_gmap(mm, vmaddr, (pte_t *)ptep, bits);

All existing callers of ptep_notify_gmap() are called with the pgste
lock being held. I guess we don't need that here as we always take the
guest_table_lock when working with split PMDs.

Complicated stuff :)

> +		}
> +	}
> +}


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 03/16] s390/mm: Introduce gmap_pmdp_xchg
  2018-02-13 14:16   ` David Hildenbrand
@ 2018-02-13 14:39     ` Janosch Frank
  0 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-13 14:39 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 1105 bytes --]

On 13.02.2018 15:16, David Hildenbrand wrote:
> On 09.02.2018 10:34, Janosch Frank wrote:
>> When changing guest pmds, we don't need to take care of the
>> corresponding host pmd. This means, we don't need to flush the host
>> TLB entries and we don't need to notify on all gmaps.
>>
>> Let's introduce a function, that exchanges a pmd and takes care of the
>> necessary flushing and notification.
> 
> stale comment, notification is handled in a different patch.
> 
>>
>> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
>> ---
[...]
> This fits into one line

Thanks!
Fixed for all of them.

> 
>> +	else
>> +		__pmdp_csp(pmdp);
>> +	*pmdp = new;
>> +}
>> +
>>  static inline void thp_split_mm(struct mm_struct *mm)
>>  {
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>
> 
> Apart from that, looks good to me. (although very similar to
> gmap_pmdp_idte_global -  can we reuse gmap_pmdp_xchg() in
> gmap_pmdp_idte_global?)

Hrm, we'd need to disable notifications and do the split_free only for
the idte case. I'm not convinced that we gain a lot by doing that.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification
  2018-02-13 14:36   ` David Hildenbrand
@ 2018-02-13 14:54     ` Janosch Frank
  2018-02-13 14:59       ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Janosch Frank @ 2018-02-13 14:54 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 6364 bytes --]

On 13.02.2018 15:36, David Hildenbrand wrote:
> On 09.02.2018 10:34, Janosch Frank wrote:
>> For later migration of huge pages we want to write-protect guest
>> PMDs. While doing this, we have to make absolutely sure, that the
>> guest's lowcore is always accessible when the VCPU is running. With
>> PTEs, this is solved by marking the PGSTEs of the lowcore pages with
>> the invalidation notification bit and kicking the guest out of the SIE
>> via a notifier function if we need to invalidate such a page.
>>
>> With PMDs we do not have PGSTEs or some other bits we could use in the
>> host PMD. Instead we pick one of the free bits in the gmap PMD. Every
>> time a host pmd will be invalidated, we will check if the respective
>> gmap PMD has the bit set and in that case fire up the notifier.
>>
>> In the first step we only support setting the invalidation bit, but we
>> do not support restricting access of guest pmds. It will follow
>> shortly.
>>
>> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> 
> [...]
> 
>> +
>> +/**
>> + * gmap_pmd_split - Split a huge gmap pmd and use a page table instead
>> + * @gmap: pointer to guest mapping meta data structure
>> + * @gaddr: virtual address in the guest address space
>> + * @pmdp: pointer to the pmd that will be split
>> + *
>> + * When splitting gmap pmds, we have to make the resulting page table
>> + * look like it's a normal one to be able to use the common pte
>> + * handling functions. Also we need to track these new tables as they
>> + * aren't tracked anywhere else.
>> + */
>> +static int gmap_pmd_split(struct gmap *gmap, unsigned long gaddr, pmd_t *pmdp)
>> +{
>> +	unsigned long *table;
>> +	struct page *page;
>> +	pmd_t new;
>> +	int i;
>> +
> 
> That's interesting, because the SIE can now suddenly work on these
> PGSTEs, e.g. not leading to intercepts on certain events (like setting
> storage keys).
> 
> How is that intended to be handled? I assume we would somehow have to
> forbid the SIE from making use of the PGSTE. But that involves clearing
> certain interception controls, which might be problematic.

Well, cmma is disabled and storage keys should only be a problem, when
the pte is invalid without the pgste lock, which should never be the
case for split pmds.

> 
>> +	page = page_table_alloc_pgste(gmap->mm);
>> +	if (!page)
>> +		return -ENOMEM;
>> +	table = (unsigned long *) page_to_phys(page);
>> +	for (i = 0; i < 256; i++) {
>> +		table[i] = (pmd_val(*pmdp) & HPAGE_MASK) + i * PAGE_SIZE;
>> +		/* pmd_large() implies pmd/pte_present() */
>> +		table[i] |=  _PAGE_PRESENT | _PAGE_READ | _PAGE_WRITE;
>> +		/* ptes are directly marked as dirty */
>> +		table[i + PTRS_PER_PTE] |= PGSTE_UC_BIT;
>> +	}
>> +
>> +	pmd_val(new) = ((unsigned long)table | _SEGMENT_ENTRY |
>> +			(_SEGMENT_ENTRY_GMAP_SPLIT));
>> +	list_add(&page->lru, &gmap->split_list);
>> +	gmap_pmdp_xchg(gmap, pmdp, new, gaddr);
>> +	return 0;
>> +}
>> +
>>  /*
>>   * gmap_protect_pte - remove access rights to memory and set pgste bits
>>   * @gmap: pointer to guest mapping meta data structure
>> @@ -941,7 +1021,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
>>  	spinlock_t *ptl = NULL;
>>  	unsigned long pbits = 0;
>>  
>> -	ptep = pte_alloc_map_lock(gmap->mm, pmdp, gaddr, &ptl);
>> +	ptep = gmap_pte_from_pmd(gmap, pmdp, gaddr, &ptl);
>>  	if (!ptep)
>>  		return -ENOMEM;
>>  
>> @@ -979,15 +1059,21 @@ static int gmap_protect_range(struct gmap *gmap, unsigned long gaddr,
>>  		rc = -EAGAIN;
>>  		pmdp = gmap_pmd_op_walk(gmap, gaddr);
>>  		if (pmdp && !(pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)) {
>> -			rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
>> -					      bits);
>> -			if (!rc) {
>> -				len -= PAGE_SIZE;
>> -				gaddr += PAGE_SIZE;
>> +			if (!pmd_large(*pmdp)) {
>> +				rc = gmap_protect_pte(gmap, gaddr, pmdp, prot,
>> +						      bits);
>> +				if (!rc) {
>> +					len -= PAGE_SIZE;
>> +					gaddr += PAGE_SIZE;
>> +				}
>> +			} else {
>> +				rc = gmap_pmd_split(gmap, gaddr, pmdp);
>> +				if (!rc)
>> +					rc = -EFAULT;
> 
> Not sure if -EFAULT is the right thing to do here. Actually this would
> be a much better use case for -EAGAIN.
> 
> (or simply avoid this by doing an gmap_pmd_op_end() + continue;)

That is a leftover from the dark days of pmd protecting/notifying.
Will fix.

> 
>>  			}
>>  			gmap_pmd_op_end(gmap, pmdp);
>>  		}
>> -		if (rc) {
>> +		if (rc && rc != -EFAULT) {
>>  			vmaddr = __gmap_translate(gmap, gaddr);
>>  			if (IS_ERR_VALUE(vmaddr))
>>  				return vmaddr;
>> @@ -2133,6 +2219,39 @@ static void gmap_shadow_notify(struct gmap *sg, unsigned long vmaddr,
>>  	spin_unlock(&sg->guest_table_lock);
>>  }

>>  /**
>>   * ptep_notify - call all invalidation callbacks for a specific pte.
>>   * @mm: pointer to the process mm_struct
>> @@ -2177,6 +2296,23 @@ void ptep_notify(struct mm_struct *mm, unsigned long vmaddr,
>>  }
>>  EXPORT_SYMBOL_GPL(ptep_notify);
>>  
>> +static void pmdp_notify_split(struct mm_struct *mm, unsigned long vmaddr,
>> +			      unsigned long *table)
>> +{
>> +	int i = 0;
>> +	unsigned long bits;
>> +	unsigned long *ptep = (unsigned long *)(*table & PAGE_MASK);
>> +	unsigned long *pgste = ptep + PTRS_PER_PTE;
>> +
>> +	for (; i < 256; i++, vmaddr += PAGE_SIZE, ptep++, pgste++) {
>> +		bits = *pgste & (PGSTE_IN_BIT | PGSTE_VSIE_BIT);
>> +		if (bits) {
>> +			*pgste ^= bits;
>> +			ptep_notify_gmap(mm, vmaddr, (pte_t *)ptep, bits);
> 
> All existing callers of ptep_notify_gmap() are called with the pgste
> lock being held. I guess we don't need that here as we always take the
> guest_table_lock when working with split PMDs.

That's my understanding

> 
> Complicated stuff :)

Yes, also this is rather horrible performance wise, as we do another run
through the gmap list in ptep_notify_gmap, which we need to do, because
we don't get the gmap in ptep_remove_dirty_protection_split and
test_and_clear_guest_dirty_split.

The reason for that being, that we need the pgste lock for our dirty
changes which we only can do in pgtable.c for more or less proper
separation of mm and gmap functions. I've yet to find a nice and clean
solution to that.

> 
>> +		}
>> +	}
>> +}
> 
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification
  2018-02-13 14:54     ` Janosch Frank
@ 2018-02-13 14:59       ` David Hildenbrand
  2018-02-13 15:33         ` Janosch Frank
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-02-13 14:59 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 13.02.2018 15:54, Janosch Frank wrote:
> On 13.02.2018 15:36, David Hildenbrand wrote:
>> On 09.02.2018 10:34, Janosch Frank wrote:
>>> For later migration of huge pages we want to write-protect guest
>>> PMDs. While doing this, we have to make absolutely sure, that the
>>> guest's lowcore is always accessible when the VCPU is running. With
>>> PTEs, this is solved by marking the PGSTEs of the lowcore pages with
>>> the invalidation notification bit and kicking the guest out of the SIE
>>> via a notifier function if we need to invalidate such a page.
>>>
>>> With PMDs we do not have PGSTEs or some other bits we could use in the
>>> host PMD. Instead we pick one of the free bits in the gmap PMD. Every
>>> time a host pmd will be invalidated, we will check if the respective
>>> gmap PMD has the bit set and in that case fire up the notifier.
>>>
>>> In the first step we only support setting the invalidation bit, but we
>>> do not support restricting access of guest pmds. It will follow
>>> shortly.
>>>
>>> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
>>
>> [...]
>>
>>> +
>>> +/**
>>> + * gmap_pmd_split - Split a huge gmap pmd and use a page table instead
>>> + * @gmap: pointer to guest mapping meta data structure
>>> + * @gaddr: virtual address in the guest address space
>>> + * @pmdp: pointer to the pmd that will be split
>>> + *
>>> + * When splitting gmap pmds, we have to make the resulting page table
>>> + * look like it's a normal one to be able to use the common pte
>>> + * handling functions. Also we need to track these new tables as they
>>> + * aren't tracked anywhere else.
>>> + */
>>> +static int gmap_pmd_split(struct gmap *gmap, unsigned long gaddr, pmd_t *pmdp)
>>> +{
>>> +	unsigned long *table;
>>> +	struct page *page;
>>> +	pmd_t new;
>>> +	int i;
>>> +
>>
>> That's interesting, because the SIE can now suddenly work on these
>> PGSTEs, e.g. not leading to intercepts on certain events (like setting
>> storage keys).
>>
>> How is that intended to be handled? I assume we would somehow have to
>> forbid the SIE from making use of the PGSTE. But that involves clearing
>> certain interception controls, which might be problematic.
> 
> Well, cmma is disabled and storage keys should only be a problem, when
> the pte is invalid without the pgste lock, which should never be the
> case for split pmds.
> 

Are you sure? Because the SIE would suddenly stark working on guest
storage keys stored in the PGSTE if I am not mistaking? So I would
assume that there would have to be some kind of a sync.

But I don't have any documentation at hand, so i can't tell :)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification
  2018-02-13 14:59       ` David Hildenbrand
@ 2018-02-13 15:33         ` Janosch Frank
  2018-02-14 10:42           ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Janosch Frank @ 2018-02-13 15:33 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 3041 bytes --]

On 13.02.2018 15:59, David Hildenbrand wrote:
> On 13.02.2018 15:54, Janosch Frank wrote:
>> On 13.02.2018 15:36, David Hildenbrand wrote:
>>> On 09.02.2018 10:34, Janosch Frank wrote:
>>>> For later migration of huge pages we want to write-protect guest
>>>> PMDs. While doing this, we have to make absolutely sure, that the
>>>> guest's lowcore is always accessible when the VCPU is running. With
>>>> PTEs, this is solved by marking the PGSTEs of the lowcore pages with
>>>> the invalidation notification bit and kicking the guest out of the SIE
>>>> via a notifier function if we need to invalidate such a page.
>>>>
>>>> With PMDs we do not have PGSTEs or some other bits we could use in the
>>>> host PMD. Instead we pick one of the free bits in the gmap PMD. Every
>>>> time a host pmd will be invalidated, we will check if the respective
>>>> gmap PMD has the bit set and in that case fire up the notifier.
>>>>
>>>> In the first step we only support setting the invalidation bit, but we
>>>> do not support restricting access of guest pmds. It will follow
>>>> shortly.
>>>>
>>>> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
>>>
>>> [...]
>>>
>>>> +
>>>> +/**
>>>> + * gmap_pmd_split - Split a huge gmap pmd and use a page table instead
>>>> + * @gmap: pointer to guest mapping meta data structure
>>>> + * @gaddr: virtual address in the guest address space
>>>> + * @pmdp: pointer to the pmd that will be split
>>>> + *
>>>> + * When splitting gmap pmds, we have to make the resulting page table
>>>> + * look like it's a normal one to be able to use the common pte
>>>> + * handling functions. Also we need to track these new tables as they
>>>> + * aren't tracked anywhere else.
>>>> + */
>>>> +static int gmap_pmd_split(struct gmap *gmap, unsigned long gaddr, pmd_t *pmdp)
>>>> +{
>>>> +	unsigned long *table;
>>>> +	struct page *page;
>>>> +	pmd_t new;
>>>> +	int i;
>>>> +
>>>
>>> That's interesting, because the SIE can now suddenly work on these
>>> PGSTEs, e.g. not leading to intercepts on certain events (like setting
>>> storage keys).
>>>
>>> How is that intended to be handled? I assume we would somehow have to
>>> forbid the SIE from making use of the PGSTE. But that involves clearing
>>> certain interception controls, which might be problematic.
>>
>> Well, cmma is disabled and storage keys should only be a problem, when
>> the pte is invalid without the pgste lock, which should never be the
>> case for split pmds.
>>
> 
> Are you sure? Because the SIE would suddenly stark working on guest
> storage keys stored in the PGSTE if I am not mistaking? So I would
> assume that there would have to be some kind of a sync.
> 
> But I don't have any documentation at hand, so i can't tell :)
> 

The pgste lock is that sync and as the gmap is the only way to get to
the pte, we don't have any ptes invalid without the lock. Also
set_guest_storage_keys will find a (userspace) pmd and do a hardware
sske, like it is supposed to.




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification
  2018-02-13 15:33         ` Janosch Frank
@ 2018-02-14 10:42           ` David Hildenbrand
  2018-02-14 11:19             ` Janosch Frank
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-02-14 10:42 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


>>>> That's interesting, because the SIE can now suddenly work on these
>>>> PGSTEs, e.g. not leading to intercepts on certain events (like setting
>>>> storage keys).
>>>>
>>>> How is that intended to be handled? I assume we would somehow have to
>>>> forbid the SIE from making use of the PGSTE. But that involves clearing
>>>> certain interception controls, which might be problematic.
>>>
>>> Well, cmma is disabled and storage keys should only be a problem, when
>>> the pte is invalid without the pgste lock, which should never be the
>>> case for split pmds.
>>>
>>
>> Are you sure? Because the SIE would suddenly stark working on guest
>> storage keys stored in the PGSTE if I am not mistaking? So I would
>> assume that there would have to be some kind of a sync.
>>
>> But I don't have any documentation at hand, so i can't tell :)
>>
> 
> The pgste lock is that sync and as the gmap is the only way to get to
> the pte, we don't have any ptes invalid without the lock. Also
> set_guest_storage_keys will find a (userspace) pmd and do a hardware
> sske, like it is supposed to.

What happens according to the documentation in the following cases:

The HW has the storage-key facility enabled and a SKEY operation (ISKE,
RRBE, SSKE) hits a huge page:

a) Generates an intercept
b) Uses the real machine storage key (as there are no pgste)

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification
  2018-02-14 10:42           ` David Hildenbrand
@ 2018-02-14 11:19             ` Janosch Frank
  2018-02-14 14:18               ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Janosch Frank @ 2018-02-14 11:19 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 1610 bytes --]

On 14.02.2018 11:42, David Hildenbrand wrote:
> 
>>>>> That's interesting, because the SIE can now suddenly work on these
>>>>> PGSTEs, e.g. not leading to intercepts on certain events (like setting
>>>>> storage keys).
>>>>>
>>>>> How is that intended to be handled? I assume we would somehow have to
>>>>> forbid the SIE from making use of the PGSTE. But that involves clearing
>>>>> certain interception controls, which might be problematic.
>>>>
>>>> Well, cmma is disabled and storage keys should only be a problem, when
>>>> the pte is invalid without the pgste lock, which should never be the
>>>> case for split pmds.
>>>>
>>>
>>> Are you sure? Because the SIE would suddenly stark working on guest
>>> storage keys stored in the PGSTE if I am not mistaking? So I would
>>> assume that there would have to be some kind of a sync.
>>>
>>> But I don't have any documentation at hand, so i can't tell :)
>>>
>>
>> The pgste lock is that sync and as the gmap is the only way to get to
>> the pte, we don't have any ptes invalid without the lock. Also
>> set_guest_storage_keys will find a (userspace) pmd and do a hardware
>> sske, like it is supposed to.
> 
> What happens according to the documentation in the following cases:
> 
> The HW has the storage-key facility enabled and a SKEY operation (ISKE,
> RRBE, SSKE) hits a huge page:
> 
> a) Generates an intercept
> b) Uses the real machine storage key (as there are no pgste)
> 

b, this is an implicit RCP bypass (same goes for fc=1 r3 entries). This
might be even independent of the SKF if I'm not mistaken.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification
  2018-02-14 11:19             ` Janosch Frank
@ 2018-02-14 14:18               ` David Hildenbrand
  2018-02-14 14:55                 ` Janosch Frank
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-02-14 14:18 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 14.02.2018 12:19, Janosch Frank wrote:
> On 14.02.2018 11:42, David Hildenbrand wrote:
>>
>>>>>> That's interesting, because the SIE can now suddenly work on these
>>>>>> PGSTEs, e.g. not leading to intercepts on certain events (like setting
>>>>>> storage keys).
>>>>>>
>>>>>> How is that intended to be handled? I assume we would somehow have to
>>>>>> forbid the SIE from making use of the PGSTE. But that involves clearing
>>>>>> certain interception controls, which might be problematic.
>>>>>
>>>>> Well, cmma is disabled and storage keys should only be a problem, when
>>>>> the pte is invalid without the pgste lock, which should never be the
>>>>> case for split pmds.
>>>>>
>>>>
>>>> Are you sure? Because the SIE would suddenly stark working on guest
>>>> storage keys stored in the PGSTE if I am not mistaking? So I would
>>>> assume that there would have to be some kind of a sync.
>>>>
>>>> But I don't have any documentation at hand, so i can't tell :)
>>>>
>>>
>>> The pgste lock is that sync and as the gmap is the only way to get to
>>> the pte, we don't have any ptes invalid without the lock. Also
>>> set_guest_storage_keys will find a (userspace) pmd and do a hardware
>>> sske, like it is supposed to.
>>
>> What happens according to the documentation in the following cases:
>>
>> The HW has the storage-key facility enabled and a SKEY operation (ISKE,
>> RRBE, SSKE) hits a huge page:
>>
>> a) Generates an intercept
>> b) Uses the real machine storage key (as there are no pgste)
>>
> 
> b, this is an implicit RCP bypass (same goes for fc=1 r3 entries). This
> might be even independent of the SKF if I'm not mistaken.
> 

SKF implements interpretation of these 3 instructions, without it, they
lead to an intercept (e.g. what we force when inside vSIE).

Alright, so that means that we have two cases:

1. A SKEY operation hits a huge PMD. Everything is fine, the real
storage key is used. This is also what get_guest_storage_key() /
set_guest_storage_key() implement now.

2. A SKEY operation hits a split PMD. It will work on the PGSTE values.

As I don't have access to documentation, looking at set_guest_storage_key():

a) The real storage key is read.
b) The _PAGE_ACC_BITS and _PAGE_FP_BIT are written into the real storage key
c) The _PAGE_CHANGED and  _PAGE_REFERENCED are cleared from the real
storage key. They are managed in the PGSTE.

This means, that the requested _PAGE_CHANGED and _PAGE_REFERENCED bits
part of the storage key are not updated in the real storage key.


So what can happen is (please correct me if I'm wrong)

a) PMD is split. SSKE writes storage key with _PAGE_CHANGED, ends up in
PGSTE. The real storage key doesn't match the requested storage key.
b) Split PMD is replaced, triggers a removal of the split PMD ->
gmap_pmd_split_free(pmdp). The requested storage key is partially lost
(pgste removed).
c) PMD is mapped in again. If the guest reads the storage key now, the
value is wrong.


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement
  2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
                   ` (15 preceding siblings ...)
  2018-02-09  9:34 ` [RFC/PATCH v3 16/16] s390/mm: Add gmap lock classes Janosch Frank
@ 2018-02-14 14:30 ` David Hildenbrand
  2018-02-14 15:01   ` Janosch Frank
  16 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-02-14 14:30 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 09.02.2018 10:34, Janosch Frank wrote:
> Since the z10 s390 does support 1M pages, but whereas hugetlbfs
> support was added quite fast, KVM always used standard 4k pages for
> guest backings.
> 
> This patchset adds full support for 1M huge page backings for s390
> KVM guests. I.e. we also support VSIE (nested vms) for these guests
> and are therefore able to run all combinations of backings for all
> layers of guests.
> 
> When running a VSIE guest in a huge page backed guest, we need to
> split some huge pages to be able to set granular protection. This way
> we avoid a prot/unprot cycle if prefixes and VSIE pages containing
> level 3 gmap DAT tables share the same segment, as the prefix has to
> be accessible at all times and the VSIE page has to be write
> protected.
> 
> Branch:
> git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git hlp_vsie
> https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git/log/?h=hlp_vsie
> 

A general proposal: We will have split PMDs with fake PGSTE. This is
nasty but needed. I think we should hinder virtualization from making
use of these. Just like we already do for vSIE.

Should we make the KVM_CAP_S390_HPAGE a configuration option?

Without it being set, don't allow mapping huge pages into the GMAP.
Everything as usual.

With it being set (by user space when it thinks we need huge pages),
allow mapping huge pages into the GMAP AND
- Explicitly disable CMMA. Right now we trust on user space to do the
  right thing. ecb2 &= ~ECB2_CMMA
- Disable PFMFI -> ecb2 &= ~ECB2_PFMFI
- Disable SKF by setting scb->ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE

So user space has to explicitly indicate and allow huge pages. This will
result in all instructions that touch the PGSTE getting intercepted, so
we can properly work on the huge PMDs instead.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification
  2018-02-14 14:18               ` David Hildenbrand
@ 2018-02-14 14:55                 ` Janosch Frank
  2018-02-14 15:15                   ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Janosch Frank @ 2018-02-14 14:55 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 3402 bytes --]

On 14.02.2018 15:18, David Hildenbrand wrote:
> On 14.02.2018 12:19, Janosch Frank wrote:
>> On 14.02.2018 11:42, David Hildenbrand wrote:
>>>
>>>>>>> That's interesting, because the SIE can now suddenly work on these
>>>>>>> PGSTEs, e.g. not leading to intercepts on certain events (like setting
>>>>>>> storage keys).
>>>>>>>
>>>>>>> How is that intended to be handled? I assume we would somehow have to
>>>>>>> forbid the SIE from making use of the PGSTE. But that involves clearing
>>>>>>> certain interception controls, which might be problematic.
>>>>>>
>>>>>> Well, cmma is disabled and storage keys should only be a problem, when
>>>>>> the pte is invalid without the pgste lock, which should never be the
>>>>>> case for split pmds.
>>>>>>
>>>>>
>>>>> Are you sure? Because the SIE would suddenly stark working on guest
>>>>> storage keys stored in the PGSTE if I am not mistaking? So I would
>>>>> assume that there would have to be some kind of a sync.
>>>>>
>>>>> But I don't have any documentation at hand, so i can't tell :)
>>>>>
>>>>
>>>> The pgste lock is that sync and as the gmap is the only way to get to
>>>> the pte, we don't have any ptes invalid without the lock. Also
>>>> set_guest_storage_keys will find a (userspace) pmd and do a hardware
>>>> sske, like it is supposed to.
>>>
>>> What happens according to the documentation in the following cases:
>>>
>>> The HW has the storage-key facility enabled and a SKEY operation (ISKE,
>>> RRBE, SSKE) hits a huge page:
>>>
>>> a) Generates an intercept
>>> b) Uses the real machine storage key (as there are no pgste)
>>>
>>
>> b, this is an implicit RCP bypass (same goes for fc=1 r3 entries). This
>> might be even independent of the SKF if I'm not mistaken.
>>
> 
> SKF implements interpretation of these 3 instructions, without it, they
> lead to an intercept (e.g. what we force when inside vSIE).

I know, the parts read like it didn't matter if SKF are available or
not, but it seems like I misread a sentence.

> 
> Alright, so that means that we have two cases:
> 
> 1. A SKEY operation hits a huge PMD. Everything is fine, the real
> storage key is used. This is also what get_guest_storage_key() /
> set_guest_storage_key() implement now.

Yep

> 
> 2. A SKEY operation hits a split PMD. It will work on the PGSTE values.
> 
> As I don't have access to documentation, looking at set_guest_storage_key():
> 
> a) The real storage key is read.
> b) The _PAGE_ACC_BITS and _PAGE_FP_BIT are written into the real storage key
> c) The _PAGE_CHANGED and  _PAGE_REFERENCED are cleared from the real
> storage key. They are managed in the PGSTE.
> 
> This means, that the requested _PAGE_CHANGED and _PAGE_REFERENCED bits
> part of the storage key are not updated in the real storage key.
> 
> 
> So what can happen is (please correct me if I'm wrong)
> 
> a) PMD is split. SSKE writes storage key with _PAGE_CHANGED, ends up in
> PGSTE. The real storage key doesn't match the requested storage key.
> b) Split PMD is replaced, triggers a removal of the split PMD ->
> gmap_pmd_split_free(pmdp). The requested storage key is partially lost
> (pgste removed).
> c) PMD is mapped in again. If the guest reads the storage key now, the
> value is wrong.

Yes, we loose GR and GC.
Is there a case when the VM is running, where this would happen?


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement
  2018-02-14 14:30 ` [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement David Hildenbrand
@ 2018-02-14 15:01   ` Janosch Frank
  2018-02-14 15:07     ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Janosch Frank @ 2018-02-14 15:01 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 2058 bytes --]

On 14.02.2018 15:30, David Hildenbrand wrote:
> On 09.02.2018 10:34, Janosch Frank wrote:
>> Since the z10 s390 does support 1M pages, but whereas hugetlbfs
>> support was added quite fast, KVM always used standard 4k pages for
>> guest backings.
>>
>> This patchset adds full support for 1M huge page backings for s390
>> KVM guests. I.e. we also support VSIE (nested vms) for these guests
>> and are therefore able to run all combinations of backings for all
>> layers of guests.
>>
>> When running a VSIE guest in a huge page backed guest, we need to
>> split some huge pages to be able to set granular protection. This way
>> we avoid a prot/unprot cycle if prefixes and VSIE pages containing
>> level 3 gmap DAT tables share the same segment, as the prefix has to
>> be accessible at all times and the VSIE page has to be write
>> protected.
>>
>> Branch:
>> git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git hlp_vsie
>> https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git/log/?h=hlp_vsie
>>
> 
> A general proposal: We will have split PMDs with fake PGSTE. This is
> nasty but needed. I think we should hinder virtualization from making
> use of these. Just like we already do for vSIE.
> 
> Should we make the KVM_CAP_S390_HPAGE a configuration option?
> 
> Without it being set, don't allow mapping huge pages into the GMAP.
> Everything as usual.
> 
> With it being set (by user space when it thinks we need huge pages),
> allow mapping huge pages into the GMAP AND
> - Explicitly disable CMMA. Right now we trust on user space to do the
>   right thing. ecb2 &= ~ECB2_CMMA
> - Disable PFMFI -> ecb2 &= ~ECB2_PFMFI
> - Disable SKF by setting scb->ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE
> 
> So user space has to explicitly indicate and allow huge pages. This will
> result in all instructions that touch the PGSTE getting intercepted, so
> we can properly work on the huge PMDs instead.

My only concern here is:
Can this coexist with the cpumodels in a coordinated way?


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement
  2018-02-14 15:01   ` Janosch Frank
@ 2018-02-14 15:07     ` David Hildenbrand
  2018-02-14 15:33       ` Janosch Frank
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-02-14 15:07 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 14.02.2018 16:01, Janosch Frank wrote:
> On 14.02.2018 15:30, David Hildenbrand wrote:
>> On 09.02.2018 10:34, Janosch Frank wrote:
>>> Since the z10 s390 does support 1M pages, but whereas hugetlbfs
>>> support was added quite fast, KVM always used standard 4k pages for
>>> guest backings.
>>>
>>> This patchset adds full support for 1M huge page backings for s390
>>> KVM guests. I.e. we also support VSIE (nested vms) for these guests
>>> and are therefore able to run all combinations of backings for all
>>> layers of guests.
>>>
>>> When running a VSIE guest in a huge page backed guest, we need to
>>> split some huge pages to be able to set granular protection. This way
>>> we avoid a prot/unprot cycle if prefixes and VSIE pages containing
>>> level 3 gmap DAT tables share the same segment, as the prefix has to
>>> be accessible at all times and the VSIE page has to be write
>>> protected.
>>>
>>> Branch:
>>> git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git hlp_vsie
>>> https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git/log/?h=hlp_vsie
>>>
>>
>> A general proposal: We will have split PMDs with fake PGSTE. This is
>> nasty but needed. I think we should hinder virtualization from making
>> use of these. Just like we already do for vSIE.
>>
>> Should we make the KVM_CAP_S390_HPAGE a configuration option?
>>
>> Without it being set, don't allow mapping huge pages into the GMAP.
>> Everything as usual.
>>
>> With it being set (by user space when it thinks we need huge pages),
>> allow mapping huge pages into the GMAP AND
>> - Explicitly disable CMMA. Right now we trust on user space to do the
>>   right thing. ecb2 &= ~ECB2_CMMA
>> - Disable PFMFI -> ecb2 &= ~ECB2_PFMFI
>> - Disable SKF by setting scb->ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE
>>
>> So user space has to explicitly indicate and allow huge pages. This will
>> result in all instructions that touch the PGSTE getting intercepted, so
>> we can properly work on the huge PMDs instead.
> 
> My only concern here is:
> Can this coexist with the cpumodels in a coordinated way?
> 

We already have to fake away the CMMA facility in user space. So that
shouldn't be a problem. The other instructions
- PFMF
- ISKE, SSKE ...

Will simply always be interpreted. Should not affect the CPU model.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification
  2018-02-14 14:55                 ` Janosch Frank
@ 2018-02-14 15:15                   ` David Hildenbrand
  2018-02-14 15:24                     ` Janosch Frank
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-02-14 15:15 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

>> So what can happen is (please correct me if I'm wrong)
>>
>> a) PMD is split. SSKE writes storage key with _PAGE_CHANGED, ends up in
>> PGSTE. The real storage key doesn't match the requested storage key.
>> b) Split PMD is replaced, triggers a removal of the split PMD ->
>> gmap_pmd_split_free(pmdp). The requested storage key is partially lost
>> (pgste removed).
>> c) PMD is mapped in again. If the guest reads the storage key now, the
>> value is wrong.
> 
> Yes, we loose GR and GC.
> Is there a case when the VM is running, where this would happen?

It should already happen when migrating storage keys. The fake PGSTE are
not considered in get_guest_storage_key().

For the other parts, the original user space PMD would have to be
changed. A simply mprotect() should achieve that. Or dirty tracking. But
not sure how that applies to huge pages at all.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification
  2018-02-14 15:15                   ` David Hildenbrand
@ 2018-02-14 15:24                     ` Janosch Frank
  0 siblings, 0 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-14 15:24 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 1188 bytes --]

On 14.02.2018 16:15, David Hildenbrand wrote:
>>> So what can happen is (please correct me if I'm wrong)
>>>
>>> a) PMD is split. SSKE writes storage key with _PAGE_CHANGED, ends up in
>>> PGSTE. The real storage key doesn't match the requested storage key.
>>> b) Split PMD is replaced, triggers a removal of the split PMD ->
>>> gmap_pmd_split_free(pmdp). The requested storage key is partially lost
>>> (pgste removed).
>>> c) PMD is mapped in again. If the guest reads the storage key now, the
>>> value is wrong.
>>
>> Yes, we loose GR and GC.
>> Is there a case when the VM is running, where this would happen?
> 
> It should already happen when migrating storage keys. The fake PGSTE are
> not considered in get_guest_storage_key().

Yes, that should break it reliably :)

> 
> For the other parts, the original user space PMD would have to be
> changed. A simply mprotect() should achieve that. Or dirty tracking. But
> not sure how that applies to huge pages at all.

I've only seen that 'till now when the VM is already dead and the kernel
cleans up. Anyhow, you already proposed a solution, so there's no sense
in defending against corner cases.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement
  2018-02-14 15:07     ` David Hildenbrand
@ 2018-02-14 15:33       ` Janosch Frank
  2018-02-14 15:48         ` Christian Borntraeger
  2018-02-14 15:56         ` David Hildenbrand
  0 siblings, 2 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-14 15:33 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 2720 bytes --]

On 14.02.2018 16:07, David Hildenbrand wrote:
> On 14.02.2018 16:01, Janosch Frank wrote:
>> On 14.02.2018 15:30, David Hildenbrand wrote:
>>> On 09.02.2018 10:34, Janosch Frank wrote:
>>>> Since the z10 s390 does support 1M pages, but whereas hugetlbfs
>>>> support was added quite fast, KVM always used standard 4k pages for
>>>> guest backings.
>>>>
>>>> This patchset adds full support for 1M huge page backings for s390
>>>> KVM guests. I.e. we also support VSIE (nested vms) for these guests
>>>> and are therefore able to run all combinations of backings for all
>>>> layers of guests.
>>>>
>>>> When running a VSIE guest in a huge page backed guest, we need to
>>>> split some huge pages to be able to set granular protection. This way
>>>> we avoid a prot/unprot cycle if prefixes and VSIE pages containing
>>>> level 3 gmap DAT tables share the same segment, as the prefix has to
>>>> be accessible at all times and the VSIE page has to be write
>>>> protected.
>>>>
>>>> Branch:
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git hlp_vsie
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git/log/?h=hlp_vsie
>>>>
>>>
>>> A general proposal: We will have split PMDs with fake PGSTE. This is
>>> nasty but needed. I think we should hinder virtualization from making
>>> use of these. Just like we already do for vSIE.
>>>
>>> Should we make the KVM_CAP_S390_HPAGE a configuration option?
>>>
>>> Without it being set, don't allow mapping huge pages into the GMAP.
>>> Everything as usual.
>>>
>>> With it being set (by user space when it thinks we need huge pages),
>>> allow mapping huge pages into the GMAP AND
>>> - Explicitly disable CMMA. Right now we trust on user space to do the
>>>   right thing. ecb2 &= ~ECB2_CMMA
>>> - Disable PFMFI -> ecb2 &= ~ECB2_PFMFI
>>> - Disable SKF by setting scb->ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE
>>>
>>> So user space has to explicitly indicate and allow huge pages. This will
>>> result in all instructions that touch the PGSTE getting intercepted, so
>>> we can properly work on the huge PMDs instead.
>>
>> My only concern here is:
>> Can this coexist with the cpumodels in a coordinated way?
>>
> 
> We already have to fake away the CMMA facility in user space. So that
> shouldn't be a problem. The other instructions
> - PFMF
> - ISKE, SSKE ...
> 
> Will simply always be interpreted. Should not affect the CPU model.

Bear with me, it was a long day:

Would it make sense to force user space to configure HPAGE before asking
for model data, so that we can remove these model bits already from
kernel side and wouldn't need extensive handling on two points?


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement
  2018-02-14 15:33       ` Janosch Frank
@ 2018-02-14 15:48         ` Christian Borntraeger
  2018-02-14 15:57           ` David Hildenbrand
  2018-02-14 15:56         ` David Hildenbrand
  1 sibling, 1 reply; 49+ messages in thread
From: Christian Borntraeger @ 2018-02-14 15:48 UTC (permalink / raw)
  To: Janosch Frank, David Hildenbrand, kvm
  Cc: schwidefsky, dominik.dingel, linux-s390



On 02/14/2018 04:33 PM, Janosch Frank wrote:
> On 14.02.2018 16:07, David Hildenbrand wrote:
>> On 14.02.2018 16:01, Janosch Frank wrote:
>>> On 14.02.2018 15:30, David Hildenbrand wrote:
>>>> On 09.02.2018 10:34, Janosch Frank wrote:
>>>>> Since the z10 s390 does support 1M pages, but whereas hugetlbfs
>>>>> support was added quite fast, KVM always used standard 4k pages for
>>>>> guest backings.
>>>>>
>>>>> This patchset adds full support for 1M huge page backings for s390
>>>>> KVM guests. I.e. we also support VSIE (nested vms) for these guests
>>>>> and are therefore able to run all combinations of backings for all
>>>>> layers of guests.
>>>>>
>>>>> When running a VSIE guest in a huge page backed guest, we need to
>>>>> split some huge pages to be able to set granular protection. This way
>>>>> we avoid a prot/unprot cycle if prefixes and VSIE pages containing
>>>>> level 3 gmap DAT tables share the same segment, as the prefix has to
>>>>> be accessible at all times and the VSIE page has to be write
>>>>> protected.
>>>>>
>>>>> Branch:
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git hlp_vsie
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git/log/?h=hlp_vsie
>>>>>
>>>>
>>>> A general proposal: We will have split PMDs with fake PGSTE. This is
>>>> nasty but needed. I think we should hinder virtualization from making
>>>> use of these. Just like we already do for vSIE.
>>>>
>>>> Should we make the KVM_CAP_S390_HPAGE a configuration option?
>>>>
>>>> Without it being set, don't allow mapping huge pages into the GMAP.
>>>> Everything as usual.
>>>>
>>>> With it being set (by user space when it thinks we need huge pages),
>>>> allow mapping huge pages into the GMAP AND
>>>> - Explicitly disable CMMA. Right now we trust on user space to do the
>>>>   right thing. ecb2 &= ~ECB2_CMMA
>>>> - Disable PFMFI -> ecb2 &= ~ECB2_PFMFI
>>>> - Disable SKF by setting scb->ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE
>>>>
>>>> So user space has to explicitly indicate and allow huge pages. This will
>>>> result in all instructions that touch the PGSTE getting intercepted, so
>>>> we can properly work on the huge PMDs instead.

I think it would work out fine for the Linux case. We do not use storage keys.
And if a guest uses them they will be slower if the host uses large pages. Tough.

>>>
>>> My only concern here is:
>>> Can this coexist with the cpumodels in a coordinated way?
>>>
>>
>> We already have to fake away the CMMA facility in user space. So that
>> shouldn't be a problem. The other instructions
>> - PFMF
>> - ISKE, SSKE ...
>>
>> Will simply always be interpreted. Should not affect the CPU model.
> 
> Bear with me, it was a long day:
> 
> Would it make sense to force user space to configure HPAGE before asking
> for model data, so that we can remove these model bits already from
> kernel side and wouldn't need extensive handling on two points?
Looks like that we do not claim CMMA and storage key interpretion anyway
via the CPU model.

In the kernel we should then modify try_handle_skey to not believe in
sclp.has_skey (but a new flag instead). For VSIE we already disable
storage key interpretion (and do it manually). We could then also make
the HPAGE stuff XOR KVM_S390_VM_MEM_ENABLE_CMMA. Whatever comes first will
trigger an -EINVAL for the 2nd.

Something like this.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement
  2018-02-14 15:33       ` Janosch Frank
  2018-02-14 15:48         ` Christian Borntraeger
@ 2018-02-14 15:56         ` David Hildenbrand
  2018-02-15 15:43           ` [PATCH 0/3] Hpage capability rework Janosch Frank
  1 sibling, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-02-14 15:56 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


>> Will simply always be interpreted. Should not affect the CPU model.
> 
> Bear with me, it was a long day:
> 
> Would it make sense to force user space to configure HPAGE before asking
> for model data, so that we can remove these model bits already from
> kernel side and wouldn't need extensive handling on two points?

It doesn't make any difference though. The problem is that we will
always have to probe the CPU model without knowing which type of memory
we will have. This is than reported e.g. via "virsh domcapabilities".

So it is better to do like "you're supporting cmma" while you actually
silently disable it when huge pages are used. Otherwise, the expanded
host model won't be able to run. Please note that this is not a problem
for migration: We expect the same command line, so also the same "lying
about CMMA" on both sides.

Wonder if there would be any way to fake CMMA support when not having
PGSTE - at least in a sane way. (always intercept and ignore or sth like
that - that would make it even easier to handle)


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement
  2018-02-14 15:48         ` Christian Borntraeger
@ 2018-02-14 15:57           ` David Hildenbrand
  0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-02-14 15:57 UTC (permalink / raw)
  To: Christian Borntraeger, Janosch Frank, kvm
  Cc: schwidefsky, dominik.dingel, linux-s390


> I think it would work out fine for the Linux case. We do not use storage keys.
> And if a guest uses them they will be slower if the host uses large pages. Tough.

Yes, slower but handled correctly. (e.g. old SLES if I'm not mistaking)

> 
>>>>
>>>> My only concern here is:
>>>> Can this coexist with the cpumodels in a coordinated way?
>>>>
>>>
>>> We already have to fake away the CMMA facility in user space. So that
>>> shouldn't be a problem. The other instructions
>>> - PFMF
>>> - ISKE, SSKE ...
>>>
>>> Will simply always be interpreted. Should not affect the CPU model.
>>
>> Bear with me, it was a long day:
>>
>> Would it make sense to force user space to configure HPAGE before asking
>> for model data, so that we can remove these model bits already from
>> kernel side and wouldn't need extensive handling on two points?
> Looks like that we do not claim CMMA and storage key interpretion anyway
> via the CPU model.
> 
> In the kernel we should then modify try_handle_skey to not believe in
> sclp.has_skey (but a new flag instead). For VSIE we already disable
> storage key interpretion (and do it manually). We could then also make
> the HPAGE stuff XOR KVM_S390_VM_MEM_ENABLE_CMMA. Whatever comes first will
> trigger an -EINVAL for the 2nd.
> 
> Something like this.
> 
Exactly what I had in mind.

-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 0/3] Hpage capability rework
  2018-02-14 15:56         ` David Hildenbrand
@ 2018-02-15 15:43           ` Janosch Frank
  2018-02-15 15:43             ` [PATCH 1/3] KVM: s390: Refactor host cmma and pfmfi interpretation controls Janosch Frank
                               ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-15 15:43 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

I'm not yet done cleaning up, so I rather have an update here than
sending a new version. The first to patches also clean up cmma, pfmfi
and skf fixing the use_cmma naming annoyance.

Janosch Frank (3):
  KVM: s390: Refactor host cmma and pfmfi interpretation controls
  KVM: s390: Add storage key facility interpretation control
  s390/mm: Enable gmap huge pmd support

 Documentation/virtual/kvm/api.txt   | 12 +++++++++++
 arch/s390/include/asm/kvm_host.h    |  4 +++-
 arch/s390/include/asm/mmu.h         |  2 ++
 arch/s390/include/asm/mmu_context.h |  1 +
 arch/s390/kvm/kvm-s390.c            | 41 ++++++++++++++++++++++++++-----------
 arch/s390/kvm/priv.c                | 24 +++++++++++++---------
 arch/s390/mm/gmap.c                 |  8 +++++---
 include/uapi/linux/kvm.h            |  1 +
 8 files changed, 67 insertions(+), 26 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/3] KVM: s390: Refactor host cmma and pfmfi interpretation controls
  2018-02-15 15:43           ` [PATCH 0/3] Hpage capability rework Janosch Frank
@ 2018-02-15 15:43             ` Janosch Frank
  2018-02-15 16:08               ` David Hildenbrand
  2018-02-15 15:43             ` [PATCH 2/3] KVM: s390: Add storage key facility interpretation control Janosch Frank
  2018-02-15 15:43             ` [PATCH 3/3] s390/mm: Enable gmap huge pmd support Janosch Frank
  2 siblings, 1 reply; 49+ messages in thread
From: Janosch Frank @ 2018-02-15 15:43 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

use_cmma in kvm_arch means that the vm is allowed to use cmma, whereas
use_cmma in the mm context means it has been used before. Let's rename
the kvm_arch one to has_cmma

Also let's introduce has_pfmfi, so we can remove the pfmfi disablement
when we activate cmma and rather not activate it in the first place.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/kvm_host.h |  3 ++-
 arch/s390/kvm/kvm-s390.c         | 25 +++++++++++++------------
 arch/s390/kvm/priv.c             |  2 +-
 3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index afb0f08..b768754 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -791,7 +791,8 @@ struct kvm_arch{
 	unsigned long mem_limit;
 	int css_support;
 	int use_irqchip;
-	int use_cmma;
+	int has_cmma;
+	int has_pfmfi;
 	int user_cpu_state_ctrl;
 	int user_sigp;
 	int user_stsi;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 4a2d68c..781aaf0 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -655,7 +655,9 @@ static int kvm_s390_set_mem_control(struct kvm *kvm, struct kvm_device_attr *att
 		VM_EVENT(kvm, 3, "%s", "ENABLE: CMMA support");
 		mutex_lock(&kvm->lock);
 		if (!kvm->created_vcpus) {
-			kvm->arch.use_cmma = 1;
+			kvm->arch.has_cmma = 1;
+			/* Not compatible with cmma. */
+			kvm->arch.has_pfmfi = 0;
 			ret = 0;
 		}
 		mutex_unlock(&kvm->lock);
@@ -665,7 +667,7 @@ static int kvm_s390_set_mem_control(struct kvm *kvm, struct kvm_device_attr *att
 		if (!sclp.has_cmma)
 			break;
 		ret = -EINVAL;
-		if (!kvm->arch.use_cmma)
+		if (!kvm->arch.has_cmma)
 			break;
 
 		VM_EVENT(kvm, 3, "%s", "RESET: CMMA states");
@@ -810,7 +812,7 @@ static int kvm_s390_vm_start_migration(struct kvm *kvm)
 		return -ENOMEM;
 	kvm->arch.migration_state = mgs;
 
-	if (kvm->arch.use_cmma) {
+	if (kvm->arch.has_cmma) {
 		/*
 		 * Get the first slot. They are reverse sorted by base_gfn, so
 		 * the first slot is also the one at the end of the address
@@ -855,7 +857,7 @@ static int kvm_s390_vm_stop_migration(struct kvm *kvm)
 	mgs = kvm->arch.migration_state;
 	kvm->arch.migration_state = NULL;
 
-	if (kvm->arch.use_cmma) {
+	if (kvm->arch.has_cmma) {
 		kvm_s390_sync_request_broadcast(kvm, KVM_REQ_STOP_MIGRATION);
 		/* We have to wait for the essa emulation to finish */
 		synchronize_srcu(&kvm->srcu);
@@ -1551,7 +1553,7 @@ static int kvm_s390_get_cmma_bits(struct kvm *kvm,
 	cur = args->start_gfn;
 	i = next = pgstev = 0;
 
-	if (unlikely(!kvm->arch.use_cmma))
+	if (unlikely(!kvm->arch.has_cmma))
 		return -ENXIO;
 	/* Invalid/unsupported flags were specified */
 	if (args->flags & ~KVM_S390_CMMA_PEEK)
@@ -1650,7 +1652,7 @@ static int kvm_s390_set_cmma_bits(struct kvm *kvm,
 
 	mask = args->mask;
 
-	if (!kvm->arch.use_cmma)
+	if (!kvm->arch.has_cmma)
 		return -ENXIO;
 	/* invalid/unsupported flags */
 	if (args->flags != 0)
@@ -2007,6 +2009,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 	kvm->arch.css_support = 0;
 	kvm->arch.use_irqchip = 0;
+	kvm->arch.has_pfmfi = sclp.has_pfmfi;
 	kvm->arch.epoch = 0;
 
 	spin_lock_init(&kvm->arch.start_stop_lock);
@@ -2045,7 +2048,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 	if (kvm_is_ucontrol(vcpu->kvm))
 		gmap_remove(vcpu->arch.gmap);
 
-	if (vcpu->kvm->arch.use_cmma)
+	if (vcpu->kvm->arch.has_cmma)
 		kvm_s390_vcpu_unsetup_cmma(vcpu);
 	free_page((unsigned long)(vcpu->arch.sie_block));
 
@@ -2431,8 +2434,6 @@ int kvm_s390_vcpu_setup_cmma(struct kvm_vcpu *vcpu)
 	vcpu->arch.sie_block->cbrlo = get_zeroed_page(GFP_KERNEL);
 	if (!vcpu->arch.sie_block->cbrlo)
 		return -ENOMEM;
-
-	vcpu->arch.sie_block->ecb2 &= ~ECB2_PFMFI;
 	return 0;
 }
 
@@ -2468,7 +2469,7 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
 	if (test_kvm_facility(vcpu->kvm, 73))
 		vcpu->arch.sie_block->ecb |= ECB_TE;
 
-	if (test_kvm_facility(vcpu->kvm, 8) && sclp.has_pfmfi)
+	if (test_kvm_facility(vcpu->kvm, 8) && vcpu->kvm->arch.has_pfmfi)
 		vcpu->arch.sie_block->ecb2 |= ECB2_PFMFI;
 	if (test_kvm_facility(vcpu->kvm, 130))
 		vcpu->arch.sie_block->ecb2 |= ECB2_IEP;
@@ -2502,7 +2503,7 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
 	else
 		vcpu->arch.sie_block->ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE;
 
-	if (vcpu->kvm->arch.use_cmma) {
+	if (vcpu->kvm->arch.has_cmma) {
 		rc = kvm_s390_vcpu_setup_cmma(vcpu);
 		if (rc)
 			return rc;
@@ -3013,7 +3014,7 @@ static int kvm_s390_handle_requests(struct kvm_vcpu *vcpu)
 		 * Re-enable CMMA virtualization if CMMA is available and
 		 * was used.
 		 */
-		if ((vcpu->kvm->arch.use_cmma) &&
+		if ((vcpu->kvm->arch.has_cmma) &&
 		    (vcpu->kvm->mm->context.use_cmma))
 			vcpu->arch.sie_block->ecb2 |= ECB2_CMMA;
 		goto retry;
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index c4c4e15..908b579 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -1050,7 +1050,7 @@ static int handle_essa(struct kvm_vcpu *vcpu)
 	VCPU_EVENT(vcpu, 4, "ESSA: release %d pages", entries);
 	gmap = vcpu->arch.gmap;
 	vcpu->stat.instruction_essa++;
-	if (!vcpu->kvm->arch.use_cmma)
+	if (!vcpu->kvm->arch.has_cmma)
 		return kvm_s390_inject_program_int(vcpu, PGM_OPERATION);
 
 	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 2/3] KVM: s390: Add storage key facility interpretation control
  2018-02-15 15:43           ` [PATCH 0/3] Hpage capability rework Janosch Frank
  2018-02-15 15:43             ` [PATCH 1/3] KVM: s390: Refactor host cmma and pfmfi interpretation controls Janosch Frank
@ 2018-02-15 15:43             ` Janosch Frank
  2018-02-15 16:09               ` David Hildenbrand
  2018-02-15 20:27               ` Farhan Ali
  2018-02-15 15:43             ` [PATCH 3/3] s390/mm: Enable gmap huge pmd support Janosch Frank
  2 siblings, 2 replies; 49+ messages in thread
From: Janosch Frank @ 2018-02-15 15:43 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

Up to now we always expected to have the storage key facility
available for our (non-VSIE) KVM guests. For huge page support, we
need to be able to disable it, so let's introduce that now.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 arch/s390/include/asm/kvm_host.h |  1 +
 arch/s390/kvm/kvm-s390.c         |  1 +
 arch/s390/kvm/priv.c             | 22 +++++++++++++---------
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index b768754..ccbbd12 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -793,6 +793,7 @@ struct kvm_arch{
 	int use_irqchip;
 	int has_cmma;
 	int has_pfmfi;
+	int has_skf;
 	int user_cpu_state_ctrl;
 	int user_sigp;
 	int user_stsi;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 781aaf0..ddf3599 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2010,6 +2010,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm->arch.css_support = 0;
 	kvm->arch.use_irqchip = 0;
 	kvm->arch.has_pfmfi = sclp.has_pfmfi;
+	kvm->arch.has_skf = sclp.has_skey;
 	kvm->arch.epoch = 0;
 
 	spin_lock_init(&kvm->arch.start_stop_lock);
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index 908b579..c74ad22 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -208,19 +208,23 @@ int kvm_s390_skey_check_enable(struct kvm_vcpu *vcpu)
 	struct kvm_s390_sie_block *sie_block = vcpu->arch.sie_block;
 
 	trace_kvm_s390_skey_related_inst(vcpu);
-	if (!(sie_block->ictl & (ICTL_ISKE | ICTL_SSKE | ICTL_RRBE)) &&
+	/* Already enabled? */
+	if (vcpu->kvm->arch.has_skf &&
+	    !(sie_block->ictl & (ICTL_ISKE | ICTL_SSKE | ICTL_RRBE)) &&
 	    !kvm_s390_test_cpuflags(vcpu, CPUSTAT_KSS))
 		return rc;
 
 	rc = s390_enable_skey();
 	VCPU_EVENT(vcpu, 3, "enabling storage keys for guest: %d", rc);
-	if (!rc) {
-		if (kvm_s390_test_cpuflags(vcpu, CPUSTAT_KSS))
-			kvm_s390_clear_cpuflags(vcpu, CPUSTAT_KSS);
-		else
-			sie_block->ictl &= ~(ICTL_ISKE | ICTL_SSKE |
-					     ICTL_RRBE);
-	}
+	if (rc)
+		return rc;
+
+	if (kvm_s390_test_cpuflags(vcpu, CPUSTAT_KSS))
+		kvm_s390_clear_cpuflags(vcpu, CPUSTAT_KSS);
+	if (!vcpu->kvm->arch.has_skf)
+		sie_block->ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE;
+	else
+		sie_block->ictl &= ~(ICTL_ISKE | ICTL_SSKE | ICTL_RRBE);
 	return rc;
 }
 
@@ -231,7 +235,7 @@ static int try_handle_skey(struct kvm_vcpu *vcpu)
 	rc = kvm_s390_skey_check_enable(vcpu);
 	if (rc)
 		return rc;
-	if (sclp.has_skey) {
+	if (vcpu->kvm->arch.has_skf) {
 		/* with storage-key facility, SIE interprets it for us */
 		kvm_s390_retry_instr(vcpu);
 		VCPU_EVENT(vcpu, 4, "%s", "retrying storage key operation");
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 3/3] s390/mm: Enable gmap huge pmd support
  2018-02-15 15:43           ` [PATCH 0/3] Hpage capability rework Janosch Frank
  2018-02-15 15:43             ` [PATCH 1/3] KVM: s390: Refactor host cmma and pfmfi interpretation controls Janosch Frank
  2018-02-15 15:43             ` [PATCH 2/3] KVM: s390: Add storage key facility interpretation control Janosch Frank
@ 2018-02-15 15:43             ` Janosch Frank
  2018-02-15 16:10               ` David Hildenbrand
  2 siblings, 1 reply; 49+ messages in thread
From: Janosch Frank @ 2018-02-15 15:43 UTC (permalink / raw)
  To: kvm; +Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

Now that we have everything in place, let's allow huge (1m) pmds for
gmap linking, effectively allowing hugetlbfs backed guests. Transparent
huge pages and 2g huge pages are *not* supported through this change.

Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
---
 Documentation/virtual/kvm/api.txt   | 12 ++++++++++++
 arch/s390/include/asm/mmu.h         |  2 ++
 arch/s390/include/asm/mmu_context.h |  1 +
 arch/s390/kvm/kvm-s390.c            | 17 ++++++++++++++++-
 arch/s390/mm/gmap.c                 |  8 +++++---
 include/uapi/linux/kvm.h            |  1 +
 6 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 792fa87..edf248a 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -4270,6 +4270,18 @@ enables QEMU to build error log and branch to guest kernel registered
 machine check handling routine. Without this capability KVM will
 branch to guests' 0x200 interrupt vector.
 
+7.13 KVM_CAP_S390_HPAGE
+
+Architectures: s390
+Parameters: none
+
+With this capability the KVM support for memory backing with 1m pages
+through hugetlbfs can be enabled. This will disable cmm, cmma, pfmfi
+and the storage key interpretation.
+
+While it is generally possible to create and start a huge page backed
+VM without this capability, the VM will not be functional.
+
 8. Other capabilities.
 ----------------------
 
diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h
index db35c41a..5ab9452 100644
--- a/arch/s390/include/asm/mmu.h
+++ b/arch/s390/include/asm/mmu.h
@@ -24,6 +24,8 @@ typedef struct {
 	unsigned int use_skey:1;
 	/* The mmu context uses CMMA. */
 	unsigned int use_cmma:1;
+	/* The gmap associated with this context uses huge pages. */
+	unsigned int use_gmap_hpage:1;
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(name)						   \
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index 65154ea..79f2159 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -32,6 +32,7 @@ static inline int init_new_context(struct task_struct *tsk,
 	mm->context.has_pgste = 0;
 	mm->context.use_skey = 0;
 	mm->context.use_cmma = 0;
+	mm->context.use_gmap_hpage = 0;
 #endif
 	switch (mm->context.asce_limit) {
 	case _REGION2_SIZE:
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index ddf3599..8ea7025 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -406,6 +406,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_S390_CMMA_MIGRATION:
 	case KVM_CAP_S390_AIS:
 	case KVM_CAP_S390_AIS_MIGRATION:
+	case KVM_CAP_S390_HPAGE:
 		r = 1;
 		break;
 	case KVM_CAP_S390_MEM_OP:
@@ -604,6 +605,19 @@ static int kvm_vm_ioctl_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
 		VM_EVENT(kvm, 3, "ENABLE: CAP_S390_GS %s",
 			 r ? "(not available)" : "(success)");
 		break;
+	case KVM_CAP_S390_HPAGE:
+		mutex_lock(&kvm->lock);
+		if (kvm->created_vcpus) {
+			r = -EBUSY;
+		}
+		kvm->mm->context.use_gmap_hpage = 1;
+		/* They would complicate matters too much. */
+		kvm->arch.has_skf = 0;
+		kvm->arch.has_cmma = 0;
+		kvm->arch.has_pfmfi = 0;
+		mutex_unlock(&kvm->lock);
+		VM_EVENT(kvm, 3, "%s", "ENABLE: KVM_CAP_S390_HPAGE");
+		break;
 	case KVM_CAP_S390_USER_STSI:
 		VM_EVENT(kvm, 3, "%s", "ENABLE: CAP_S390_USER_STSI");
 		kvm->arch.user_stsi = 1;
@@ -655,7 +669,8 @@ static int kvm_s390_set_mem_control(struct kvm *kvm, struct kvm_device_attr *att
 		VM_EVENT(kvm, 3, "%s", "ENABLE: CMMA support");
 		mutex_lock(&kvm->lock);
 		if (!kvm->created_vcpus) {
-			kvm->arch.has_cmma = 1;
+			if (!kvm->mm->context.use_gmap_hpage)
+				kvm->arch.has_cmma = 1;
 			/* Not compatible with cmma. */
 			kvm->arch.has_pfmfi = 0;
 			ret = 0;
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 2cafcba..6937853 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2,8 +2,10 @@
 /*
  *  KVM guest address space mapping code
  *
- *    Copyright IBM Corp. 2007, 2016
+ *    Copyright IBM Corp. 2007, 2016, 2017
  *    Author(s): Martin Schwidefsky <schwidefsky@de.ibm.com>
+ *		 David Hildenbrand <david@redhat.com>
+ *		 Janosch Frank <frankja@linux.vnet.ibm.com>
  */
 
 #include <linux/kernel.h>
@@ -595,8 +597,8 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
 		return -EFAULT;
 	pmd = pmd_offset(pud, vmaddr);
 	VM_BUG_ON(pmd_none(*pmd));
-	/* large pmds cannot yet be handled */
-	if (pmd_large(*pmd))
+	/* Are we allowed to use huge pages? */
+	if (pmd_large(*pmd) && !gmap->mm->context.use_gmap_hpage)
 		return -EFAULT;
 	/* Link gmap segment table entry location to page table. */
 	rc = radix_tree_preload(GFP_KERNEL);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0fb5ef9..4e397d7 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -934,6 +934,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_AIS_MIGRATION 150
 #define KVM_CAP_PPC_GET_CPU_CHAR 151
 #define KVM_CAP_S390_BPB 152
+#define KVM_CAP_S390_HPAGE 153
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/3] KVM: s390: Refactor host cmma and pfmfi interpretation controls
  2018-02-15 15:43             ` [PATCH 1/3] KVM: s390: Refactor host cmma and pfmfi interpretation controls Janosch Frank
@ 2018-02-15 16:08               ` David Hildenbrand
  2018-02-15 16:42                 ` Janosch Frank
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2018-02-15 16:08 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 15.02.2018 16:43, Janosch Frank wrote:
> use_cmma in kvm_arch means that the vm is allowed to use cmma, whereas
> use_cmma in the mm context means it has been used before. Let's rename
> the kvm_arch one to has_cmma

Unfortunately all naming related to CMM(A) is already screwed up.

If the guest can issue the ESSA instruction, it has the CMM facility. We
used to provide it to him by enabling the CMMA control - the
"interpretation facility". We could right now fully emulate it (which is
what we do when we dirty track changes).

At least in the CPU model we did it right. (the feature is called "cmm")

But anyhow, we only provide the CMM facility to the guest right now if
we have CMMA, so keeping the name "cmma" here is not completely wrong.

> 
> Also let's introduce has_pfmfi, so we can remove the pfmfi disablement

Mixed feelings. has_pfmfi sounds more like "this guest has the PFMFI
feature", which is something related to vSIE. It is really more
"use_pfmfi". Because we provide the PFMF instruction either way (by
emulating it - it belongs to edat1).

Can we name this "use_cmma" and "use_pfmfi" and "use_skf", because that
is actually what we do?

The thing in the mm context should rather be renamed to "uses_cmm(a)" or
"used_cmm(a)".

> when we activate cmma and rather not activate it in the first place.
> 
> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> ---
>  arch/s390/include/asm/kvm_host.h |  3 ++-
>  arch/s390/kvm/kvm-s390.c         | 25 +++++++++++++------------
>  arch/s390/kvm/priv.c             |  2 +-
>  3 files changed, 16 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
> index afb0f08..b768754 100644
> --- a/arch/s390/include/asm/kvm_host.h
> +++ b/arch/s390/include/asm/kvm_host.h
> @@ -791,7 +791,8 @@ struct kvm_arch{
>  	unsigned long mem_limit;
>  	int css_support;
>  	int use_irqchip;
> -	int use_cmma;
> +	int has_cmma;
> +	int has_pfmfi;
>  	int user_cpu_state_ctrl;
>  	int user_sigp;
>  	int user_stsi;
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 4a2d68c..781aaf0 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -655,7 +655,9 @@ static int kvm_s390_set_mem_control(struct kvm *kvm, struct kvm_device_attr *att
>  		VM_EVENT(kvm, 3, "%s", "ENABLE: CMMA support");
>  		mutex_lock(&kvm->lock);
>  		if (!kvm->created_vcpus) {
> -			kvm->arch.use_cmma = 1;
> +			kvm->arch.has_cmma = 1;
> +			/* Not compatible with cmma. */
> +			kvm->arch.has_pfmfi = 0;
>  			ret = 0;
>  		}
>  		mutex_unlock(&kvm->lock);
> @@ -665,7 +667,7 @@ static int kvm_s390_set_mem_control(struct kvm *kvm, struct kvm_device_attr *att
>  		if (!sclp.has_cmma)
>  			break;
>  		ret = -EINVAL;
> -		if (!kvm->arch.use_cmma)
> +		if (!kvm->arch.has_cmma)
>  			break;
>  
>  		VM_EVENT(kvm, 3, "%s", "RESET: CMMA states");
> @@ -810,7 +812,7 @@ static int kvm_s390_vm_start_migration(struct kvm *kvm)
>  		return -ENOMEM;
>  	kvm->arch.migration_state = mgs;
>  
> -	if (kvm->arch.use_cmma) {
> +	if (kvm->arch.has_cmma) {
>  		/*
>  		 * Get the first slot. They are reverse sorted by base_gfn, so
>  		 * the first slot is also the one at the end of the address
> @@ -855,7 +857,7 @@ static int kvm_s390_vm_stop_migration(struct kvm *kvm)
>  	mgs = kvm->arch.migration_state;
>  	kvm->arch.migration_state = NULL;
>  
> -	if (kvm->arch.use_cmma) {
> +	if (kvm->arch.has_cmma) {
>  		kvm_s390_sync_request_broadcast(kvm, KVM_REQ_STOP_MIGRATION);
>  		/* We have to wait for the essa emulation to finish */
>  		synchronize_srcu(&kvm->srcu);
> @@ -1551,7 +1553,7 @@ static int kvm_s390_get_cmma_bits(struct kvm *kvm,
>  	cur = args->start_gfn;
>  	i = next = pgstev = 0;
>  
> -	if (unlikely(!kvm->arch.use_cmma))
> +	if (unlikely(!kvm->arch.has_cmma))
>  		return -ENXIO;
>  	/* Invalid/unsupported flags were specified */
>  	if (args->flags & ~KVM_S390_CMMA_PEEK)
> @@ -1650,7 +1652,7 @@ static int kvm_s390_set_cmma_bits(struct kvm *kvm,
>  
>  	mask = args->mask;
>  
> -	if (!kvm->arch.use_cmma)
> +	if (!kvm->arch.has_cmma)
>  		return -ENXIO;
>  	/* invalid/unsupported flags */
>  	if (args->flags != 0)
> @@ -2007,6 +2009,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  
>  	kvm->arch.css_support = 0;
>  	kvm->arch.use_irqchip = 0;
> +	kvm->arch.has_pfmfi = sclp.has_pfmfi;
>  	kvm->arch.epoch = 0;
>  
>  	spin_lock_init(&kvm->arch.start_stop_lock);
> @@ -2045,7 +2048,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
>  	if (kvm_is_ucontrol(vcpu->kvm))
>  		gmap_remove(vcpu->arch.gmap);
>  
> -	if (vcpu->kvm->arch.use_cmma)
> +	if (vcpu->kvm->arch.has_cmma)
>  		kvm_s390_vcpu_unsetup_cmma(vcpu);
>  	free_page((unsigned long)(vcpu->arch.sie_block));
>  
> @@ -2431,8 +2434,6 @@ int kvm_s390_vcpu_setup_cmma(struct kvm_vcpu *vcpu)
>  	vcpu->arch.sie_block->cbrlo = get_zeroed_page(GFP_KERNEL);
>  	if (!vcpu->arch.sie_block->cbrlo)
>  		return -ENOMEM;
> -
> -	vcpu->arch.sie_block->ecb2 &= ~ECB2_PFMFI;
>  	return 0;
>  }
>  
> @@ -2468,7 +2469,7 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
>  	if (test_kvm_facility(vcpu->kvm, 73))
>  		vcpu->arch.sie_block->ecb |= ECB_TE;
>  
> -	if (test_kvm_facility(vcpu->kvm, 8) && sclp.has_pfmfi)
> +	if (test_kvm_facility(vcpu->kvm, 8) && vcpu->kvm->arch.has_pfmfi)
>  		vcpu->arch.sie_block->ecb2 |= ECB2_PFMFI;
>  	if (test_kvm_facility(vcpu->kvm, 130))
>  		vcpu->arch.sie_block->ecb2 |= ECB2_IEP;
> @@ -2502,7 +2503,7 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
>  	else
>  		vcpu->arch.sie_block->ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE;
>  
> -	if (vcpu->kvm->arch.use_cmma) {
> +	if (vcpu->kvm->arch.has_cmma) {
>  		rc = kvm_s390_vcpu_setup_cmma(vcpu);
>  		if (rc)
>  			return rc;
> @@ -3013,7 +3014,7 @@ static int kvm_s390_handle_requests(struct kvm_vcpu *vcpu)
>  		 * Re-enable CMMA virtualization if CMMA is available and
>  		 * was used.
>  		 */
> -		if ((vcpu->kvm->arch.use_cmma) &&
> +		if ((vcpu->kvm->arch.has_cmma) &&
>  		    (vcpu->kvm->mm->context.use_cmma))
>  			vcpu->arch.sie_block->ecb2 |= ECB2_CMMA;
>  		goto retry;
> diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
> index c4c4e15..908b579 100644
> --- a/arch/s390/kvm/priv.c
> +++ b/arch/s390/kvm/priv.c
> @@ -1050,7 +1050,7 @@ static int handle_essa(struct kvm_vcpu *vcpu)
>  	VCPU_EVENT(vcpu, 4, "ESSA: release %d pages", entries);
>  	gmap = vcpu->arch.gmap;
>  	vcpu->stat.instruction_essa++;
> -	if (!vcpu->kvm->arch.use_cmma)
> +	if (!vcpu->kvm->arch.has_cmma)
>  		return kvm_s390_inject_program_int(vcpu, PGM_OPERATION);
>  
>  	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/3] KVM: s390: Add storage key facility interpretation control
  2018-02-15 15:43             ` [PATCH 2/3] KVM: s390: Add storage key facility interpretation control Janosch Frank
@ 2018-02-15 16:09               ` David Hildenbrand
  2018-02-15 20:27               ` Farhan Ali
  1 sibling, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-02-15 16:09 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 15.02.2018 16:43, Janosch Frank wrote:
> Up to now we always expected to have the storage key facility
> available for our (non-VSIE) KVM guests. For huge page support, we
> need to be able to disable it, so let's introduce that now.
> 
> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> ---
>  arch/s390/include/asm/kvm_host.h |  1 +
>  arch/s390/kvm/kvm-s390.c         |  1 +
>  arch/s390/kvm/priv.c             | 22 +++++++++++++---------
>  3 files changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
> index b768754..ccbbd12 100644
> --- a/arch/s390/include/asm/kvm_host.h
> +++ b/arch/s390/include/asm/kvm_host.h
> @@ -793,6 +793,7 @@ struct kvm_arch{
>  	int use_irqchip;
>  	int has_cmma;
>  	int has_pfmfi;
> +	int has_skf;
>  	int user_cpu_state_ctrl;
>  	int user_sigp;
>  	int user_stsi;
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 781aaf0..ddf3599 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -2010,6 +2010,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  	kvm->arch.css_support = 0;
>  	kvm->arch.use_irqchip = 0;
>  	kvm->arch.has_pfmfi = sclp.has_pfmfi;
> +	kvm->arch.has_skf = sclp.has_skey;

Dito has_skf rather relates to a vSIE feature for the guest ("this guest
has the SKF, therefore can use it for the (v)SIE").

-> use_skf IMHO.

>  	kvm->arch.epoch = 0;
>  
>  	spin_lock_init(&kvm->arch.start_stop_lock);
> diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
> index 908b579..c74ad22 100644
> --- a/arch/s390/kvm/priv.c
> +++ b/arch/s390/kvm/priv.c
> @@ -208,19 +208,23 @@ int kvm_s390_skey_check_enable(struct kvm_vcpu *vcpu)
>  	struct kvm_s390_sie_block *sie_block = vcpu->arch.sie_block;
>  
>  	trace_kvm_s390_skey_related_inst(vcpu);
> -	if (!(sie_block->ictl & (ICTL_ISKE | ICTL_SSKE | ICTL_RRBE)) &&
> +	/* Already enabled? */
> +	if (vcpu->kvm->arch.has_skf &&
> +	    !(sie_block->ictl & (ICTL_ISKE | ICTL_SSKE | ICTL_RRBE)) &&
>  	    !kvm_s390_test_cpuflags(vcpu, CPUSTAT_KSS))
>  		return rc;
>  
>  	rc = s390_enable_skey();
>  	VCPU_EVENT(vcpu, 3, "enabling storage keys for guest: %d", rc);
> -	if (!rc) {
> -		if (kvm_s390_test_cpuflags(vcpu, CPUSTAT_KSS))
> -			kvm_s390_clear_cpuflags(vcpu, CPUSTAT_KSS);
> -		else
> -			sie_block->ictl &= ~(ICTL_ISKE | ICTL_SSKE |
> -					     ICTL_RRBE);
> -	}
> +	if (rc)
> +		return rc;
> +
> +	if (kvm_s390_test_cpuflags(vcpu, CPUSTAT_KSS))
> +		kvm_s390_clear_cpuflags(vcpu, CPUSTAT_KSS);
> +	if (!vcpu->kvm->arch.has_skf)
> +		sie_block->ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE;
> +	else
> +		sie_block->ictl &= ~(ICTL_ISKE | ICTL_SSKE | ICTL_RRBE);
>  	return rc;
>  }
>  
> @@ -231,7 +235,7 @@ static int try_handle_skey(struct kvm_vcpu *vcpu)
>  	rc = kvm_s390_skey_check_enable(vcpu);
>  	if (rc)
>  		return rc;
> -	if (sclp.has_skey) {
> +	if (vcpu->kvm->arch.has_skf) {
>  		/* with storage-key facility, SIE interprets it for us */
>  		kvm_s390_retry_instr(vcpu);
>  		VCPU_EVENT(vcpu, 4, "%s", "retrying storage key operation");
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/3] s390/mm: Enable gmap huge pmd support
  2018-02-15 15:43             ` [PATCH 3/3] s390/mm: Enable gmap huge pmd support Janosch Frank
@ 2018-02-15 16:10               ` David Hildenbrand
  0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-02-15 16:10 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 15.02.2018 16:43, Janosch Frank wrote:
> Now that we have everything in place, let's allow huge (1m) pmds for
> gmap linking, effectively allowing hugetlbfs backed guests. Transparent
> huge pages and 2g huge pages are *not* supported through this change.
> 
> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> ---
>  Documentation/virtual/kvm/api.txt   | 12 ++++++++++++
>  arch/s390/include/asm/mmu.h         |  2 ++
>  arch/s390/include/asm/mmu_context.h |  1 +
>  arch/s390/kvm/kvm-s390.c            | 17 ++++++++++++++++-
>  arch/s390/mm/gmap.c                 |  8 +++++---
>  include/uapi/linux/kvm.h            |  1 +
>  6 files changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index 792fa87..edf248a 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4270,6 +4270,18 @@ enables QEMU to build error log and branch to guest kernel registered
>  machine check handling routine. Without this capability KVM will
>  branch to guests' 0x200 interrupt vector.
>  
> +7.13 KVM_CAP_S390_HPAGE
> +
> +Architectures: s390
> +Parameters: none
> +
> +With this capability the KVM support for memory backing with 1m pages
> +through hugetlbfs can be enabled. This will disable cmm, cmma, pfmfi
> +and the storage key interpretation.
> +
> +While it is generally possible to create and start a huge page backed
> +VM without this capability, the VM will not be functional.
> +
>  8. Other capabilities.
>  ----------------------
>  
> diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h
> index db35c41a..5ab9452 100644
> --- a/arch/s390/include/asm/mmu.h
> +++ b/arch/s390/include/asm/mmu.h
> @@ -24,6 +24,8 @@ typedef struct {
>  	unsigned int use_skey:1;
>  	/* The mmu context uses CMMA. */
>  	unsigned int use_cmma:1;
> +	/* The gmap associated with this context uses huge pages. */
> +	unsigned int use_gmap_hpage:1;
>  } mm_context_t;
>  
>  #define INIT_MM_CONTEXT(name)						   \
> diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
> index 65154ea..79f2159 100644
> --- a/arch/s390/include/asm/mmu_context.h
> +++ b/arch/s390/include/asm/mmu_context.h
> @@ -32,6 +32,7 @@ static inline int init_new_context(struct task_struct *tsk,
>  	mm->context.has_pgste = 0;
>  	mm->context.use_skey = 0;
>  	mm->context.use_cmma = 0;
> +	mm->context.use_gmap_hpage = 0;
>  #endif
>  	switch (mm->context.asce_limit) {
>  	case _REGION2_SIZE:
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index ddf3599..8ea7025 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -406,6 +406,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_S390_CMMA_MIGRATION:
>  	case KVM_CAP_S390_AIS:
>  	case KVM_CAP_S390_AIS_MIGRATION:
> +	case KVM_CAP_S390_HPAGE:
>  		r = 1;
>  		break;
>  	case KVM_CAP_S390_MEM_OP:
> @@ -604,6 +605,19 @@ static int kvm_vm_ioctl_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
>  		VM_EVENT(kvm, 3, "ENABLE: CAP_S390_GS %s",
>  			 r ? "(not available)" : "(success)");
>  		break;
> +	case KVM_CAP_S390_HPAGE:
> +		mutex_lock(&kvm->lock);
> +		if (kvm->created_vcpus) {
> +			r = -EBUSY;
> +		}
> +		kvm->mm->context.use_gmap_hpage = 1;
> +		/* They would complicate matters too much. */
> +		kvm->arch.has_skf = 0;
> +		kvm->arch.has_cmma = 0;
> +		kvm->arch.has_pfmfi = 0;
> +		mutex_unlock(&kvm->lock);
> +		VM_EVENT(kvm, 3, "%s", "ENABLE: KVM_CAP_S390_HPAGE");
> +		break;
>  	case KVM_CAP_S390_USER_STSI:
>  		VM_EVENT(kvm, 3, "%s", "ENABLE: CAP_S390_USER_STSI");
>  		kvm->arch.user_stsi = 1;
> @@ -655,7 +669,8 @@ static int kvm_s390_set_mem_control(struct kvm *kvm, struct kvm_device_attr *att
>  		VM_EVENT(kvm, 3, "%s", "ENABLE: CMMA support");
>  		mutex_lock(&kvm->lock);
>  		if (!kvm->created_vcpus) {
> -			kvm->arch.has_cmma = 1;
> +			if (!kvm->mm->context.use_gmap_hpage)
> +				kvm->arch.has_cmma = 1;
>  			/* Not compatible with cmma. */
>  			kvm->arch.has_pfmfi = 0;

As Christian said, this should be an XOR.

Either KVM_CAP_S390_HPAGE or CMMA can be enabled, not both (-EINVAL).

>  			ret = 0;
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 2cafcba..6937853 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2,8 +2,10 @@
>  /*
>   *  KVM guest address space mapping code
>   *
> - *    Copyright IBM Corp. 2007, 2016
> + *    Copyright IBM Corp. 2007, 2016, 2017
>   *    Author(s): Martin Schwidefsky <schwidefsky@de.ibm.com>
> + *		 David Hildenbrand <david@redhat.com>
> + *		 Janosch Frank <frankja@linux.vnet.ibm.com>
>   */
>  
>  #include <linux/kernel.h>
> @@ -595,8 +597,8 @@ int __gmap_link(struct gmap *gmap, unsigned long gaddr, unsigned long vmaddr)
>  		return -EFAULT;
>  	pmd = pmd_offset(pud, vmaddr);
>  	VM_BUG_ON(pmd_none(*pmd));
> -	/* large pmds cannot yet be handled */
> -	if (pmd_large(*pmd))
> +	/* Are we allowed to use huge pages? */
> +	if (pmd_large(*pmd) && !gmap->mm->context.use_gmap_hpage)
>  		return -EFAULT;
>  	/* Link gmap segment table entry location to page table. */
>  	rc = radix_tree_preload(GFP_KERNEL);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 0fb5ef9..4e397d7 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -934,6 +934,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_S390_AIS_MIGRATION 150
>  #define KVM_CAP_PPC_GET_CPU_CHAR 151
>  #define KVM_CAP_S390_BPB 152
> +#define KVM_CAP_S390_HPAGE 153
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> 


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/3] KVM: s390: Refactor host cmma and pfmfi interpretation controls
  2018-02-15 16:08               ` David Hildenbrand
@ 2018-02-15 16:42                 ` Janosch Frank
  2018-02-16  9:46                   ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Janosch Frank @ 2018-02-15 16:42 UTC (permalink / raw)
  To: David Hildenbrand, kvm
  Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390


[-- Attachment #1.1: Type: text/plain, Size: 7654 bytes --]

On 15.02.2018 17:08, David Hildenbrand wrote:
> On 15.02.2018 16:43, Janosch Frank wrote:
>> use_cmma in kvm_arch means that the vm is allowed to use cmma, whereas
>> use_cmma in the mm context means it has been used before. Let's rename
>> the kvm_arch one to has_cmma
> 
> Unfortunately all naming related to CMM(A) is already screwed up.
> 
> If the guest can issue the ESSA instruction, it has the CMM facility. We
> used to provide it to him by enabling the CMMA control - the
> "interpretation facility". We could right now fully emulate it (which is
> what we do when we dirty track changes).
> 
> At least in the CPU model we did it right. (the feature is called "cmm")
> 
> But anyhow, we only provide the CMM facility to the guest right now if
> we have CMMA, so keeping the name "cmma" here is not completely wrong.
> 
>>
>> Also let's introduce has_pfmfi, so we can remove the pfmfi disablement
> 
> Mixed feelings. has_pfmfi sounds more like "this guest has the PFMFI
> feature", which is something related to vSIE. It is really more
> "use_pfmfi". Because we provide the PFMF instruction either way (by
> emulating it - it belongs to edat1).
> 
> Can we name this "use_cmma" and "use_pfmfi" and "use_skf", because that
> is actually what we do?

As long as we have a difference between the arch and the context one and
the implementation of the patches are not a problem, sure.

We could also invert them and use emulate_pfmf or intercept_* which is
more specific about what we actually (don't) do.

> 
> The thing in the mm context should rather be renamed to "uses_cmm(a)" or
> "used_cmm(a)".

Yes, I like uses more, the rest of the gang likes used, now feel free to
propose something entirely different :)

If there's not much that speaks against the first two patches, I'd spin
them off to merge them earlier, the use_cmma cleanup has been on my list
for a long time. Thoughts?

> 
>> when we activate cmma and rather not activate it in the first place.
>>
>> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
>> ---
>>  arch/s390/include/asm/kvm_host.h |  3 ++-
>>  arch/s390/kvm/kvm-s390.c         | 25 +++++++++++++------------
>>  arch/s390/kvm/priv.c             |  2 +-
>>  3 files changed, 16 insertions(+), 14 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
>> index afb0f08..b768754 100644
>> --- a/arch/s390/include/asm/kvm_host.h
>> +++ b/arch/s390/include/asm/kvm_host.h
>> @@ -791,7 +791,8 @@ struct kvm_arch{
>>  	unsigned long mem_limit;
>>  	int css_support;
>>  	int use_irqchip;
>> -	int use_cmma;
>> +	int has_cmma;
>> +	int has_pfmfi;
>>  	int user_cpu_state_ctrl;
>>  	int user_sigp;
>>  	int user_stsi;
>> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
>> index 4a2d68c..781aaf0 100644
>> --- a/arch/s390/kvm/kvm-s390.c
>> +++ b/arch/s390/kvm/kvm-s390.c
>> @@ -655,7 +655,9 @@ static int kvm_s390_set_mem_control(struct kvm *kvm, struct kvm_device_attr *att
>>  		VM_EVENT(kvm, 3, "%s", "ENABLE: CMMA support");
>>  		mutex_lock(&kvm->lock);
>>  		if (!kvm->created_vcpus) {
>> -			kvm->arch.use_cmma = 1;
>> +			kvm->arch.has_cmma = 1;
>> +			/* Not compatible with cmma. */
>> +			kvm->arch.has_pfmfi = 0;
>>  			ret = 0;
>>  		}
>>  		mutex_unlock(&kvm->lock);
>> @@ -665,7 +667,7 @@ static int kvm_s390_set_mem_control(struct kvm *kvm, struct kvm_device_attr *att
>>  		if (!sclp.has_cmma)
>>  			break;
>>  		ret = -EINVAL;
>> -		if (!kvm->arch.use_cmma)
>> +		if (!kvm->arch.has_cmma)
>>  			break;
>>  
>>  		VM_EVENT(kvm, 3, "%s", "RESET: CMMA states");
>> @@ -810,7 +812,7 @@ static int kvm_s390_vm_start_migration(struct kvm *kvm)
>>  		return -ENOMEM;
>>  	kvm->arch.migration_state = mgs;
>>  
>> -	if (kvm->arch.use_cmma) {
>> +	if (kvm->arch.has_cmma) {
>>  		/*
>>  		 * Get the first slot. They are reverse sorted by base_gfn, so
>>  		 * the first slot is also the one at the end of the address
>> @@ -855,7 +857,7 @@ static int kvm_s390_vm_stop_migration(struct kvm *kvm)
>>  	mgs = kvm->arch.migration_state;
>>  	kvm->arch.migration_state = NULL;
>>  
>> -	if (kvm->arch.use_cmma) {
>> +	if (kvm->arch.has_cmma) {
>>  		kvm_s390_sync_request_broadcast(kvm, KVM_REQ_STOP_MIGRATION);
>>  		/* We have to wait for the essa emulation to finish */
>>  		synchronize_srcu(&kvm->srcu);
>> @@ -1551,7 +1553,7 @@ static int kvm_s390_get_cmma_bits(struct kvm *kvm,
>>  	cur = args->start_gfn;
>>  	i = next = pgstev = 0;
>>  
>> -	if (unlikely(!kvm->arch.use_cmma))
>> +	if (unlikely(!kvm->arch.has_cmma))
>>  		return -ENXIO;
>>  	/* Invalid/unsupported flags were specified */
>>  	if (args->flags & ~KVM_S390_CMMA_PEEK)
>> @@ -1650,7 +1652,7 @@ static int kvm_s390_set_cmma_bits(struct kvm *kvm,
>>  
>>  	mask = args->mask;
>>  
>> -	if (!kvm->arch.use_cmma)
>> +	if (!kvm->arch.has_cmma)
>>  		return -ENXIO;
>>  	/* invalid/unsupported flags */
>>  	if (args->flags != 0)
>> @@ -2007,6 +2009,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>>  
>>  	kvm->arch.css_support = 0;
>>  	kvm->arch.use_irqchip = 0;
>> +	kvm->arch.has_pfmfi = sclp.has_pfmfi;
>>  	kvm->arch.epoch = 0;
>>  
>>  	spin_lock_init(&kvm->arch.start_stop_lock);
>> @@ -2045,7 +2048,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
>>  	if (kvm_is_ucontrol(vcpu->kvm))
>>  		gmap_remove(vcpu->arch.gmap);
>>  
>> -	if (vcpu->kvm->arch.use_cmma)
>> +	if (vcpu->kvm->arch.has_cmma)
>>  		kvm_s390_vcpu_unsetup_cmma(vcpu);
>>  	free_page((unsigned long)(vcpu->arch.sie_block));
>>  
>> @@ -2431,8 +2434,6 @@ int kvm_s390_vcpu_setup_cmma(struct kvm_vcpu *vcpu)
>>  	vcpu->arch.sie_block->cbrlo = get_zeroed_page(GFP_KERNEL);
>>  	if (!vcpu->arch.sie_block->cbrlo)
>>  		return -ENOMEM;
>> -
>> -	vcpu->arch.sie_block->ecb2 &= ~ECB2_PFMFI;
>>  	return 0;
>>  }
>>  
>> @@ -2468,7 +2469,7 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
>>  	if (test_kvm_facility(vcpu->kvm, 73))
>>  		vcpu->arch.sie_block->ecb |= ECB_TE;
>>  
>> -	if (test_kvm_facility(vcpu->kvm, 8) && sclp.has_pfmfi)
>> +	if (test_kvm_facility(vcpu->kvm, 8) && vcpu->kvm->arch.has_pfmfi)
>>  		vcpu->arch.sie_block->ecb2 |= ECB2_PFMFI;
>>  	if (test_kvm_facility(vcpu->kvm, 130))
>>  		vcpu->arch.sie_block->ecb2 |= ECB2_IEP;
>> @@ -2502,7 +2503,7 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
>>  	else
>>  		vcpu->arch.sie_block->ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE;
>>  
>> -	if (vcpu->kvm->arch.use_cmma) {
>> +	if (vcpu->kvm->arch.has_cmma) {
>>  		rc = kvm_s390_vcpu_setup_cmma(vcpu);
>>  		if (rc)
>>  			return rc;
>> @@ -3013,7 +3014,7 @@ static int kvm_s390_handle_requests(struct kvm_vcpu *vcpu)
>>  		 * Re-enable CMMA virtualization if CMMA is available and
>>  		 * was used.
>>  		 */
>> -		if ((vcpu->kvm->arch.use_cmma) &&
>> +		if ((vcpu->kvm->arch.has_cmma) &&
>>  		    (vcpu->kvm->mm->context.use_cmma))
>>  			vcpu->arch.sie_block->ecb2 |= ECB2_CMMA;
>>  		goto retry;
>> diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
>> index c4c4e15..908b579 100644
>> --- a/arch/s390/kvm/priv.c
>> +++ b/arch/s390/kvm/priv.c
>> @@ -1050,7 +1050,7 @@ static int handle_essa(struct kvm_vcpu *vcpu)
>>  	VCPU_EVENT(vcpu, 4, "ESSA: release %d pages", entries);
>>  	gmap = vcpu->arch.gmap;
>>  	vcpu->stat.instruction_essa++;
>> -	if (!vcpu->kvm->arch.use_cmma)
>> +	if (!vcpu->kvm->arch.has_cmma)
>>  		return kvm_s390_inject_program_int(vcpu, PGM_OPERATION);
>>  
>>  	if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
>>
> 
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/3] KVM: s390: Add storage key facility interpretation control
  2018-02-15 15:43             ` [PATCH 2/3] KVM: s390: Add storage key facility interpretation control Janosch Frank
  2018-02-15 16:09               ` David Hildenbrand
@ 2018-02-15 20:27               ` Farhan Ali
  1 sibling, 0 replies; 49+ messages in thread
From: Farhan Ali @ 2018-02-15 20:27 UTC (permalink / raw)
  To: Janosch Frank, kvm
  Cc: schwidefsky, borntraeger, david, dominik.dingel, linux-s390

Reviewed-by: Farhan Ali <alifm@linux.vnet.ibm.com>

On 02/15/2018 10:43 AM, Janosch Frank wrote:
> Up to now we always expected to have the storage key facility
> available for our (non-VSIE) KVM guests. For huge page support, we
> need to be able to disable it, so let's introduce that now.
> 
> Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com>
> ---
>   arch/s390/include/asm/kvm_host.h |  1 +
>   arch/s390/kvm/kvm-s390.c         |  1 +
>   arch/s390/kvm/priv.c             | 22 +++++++++++++---------
>   3 files changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
> index b768754..ccbbd12 100644
> --- a/arch/s390/include/asm/kvm_host.h
> +++ b/arch/s390/include/asm/kvm_host.h
> @@ -793,6 +793,7 @@ struct kvm_arch{
>   	int use_irqchip;
>   	int has_cmma;
>   	int has_pfmfi;
> +	int has_skf;
>   	int user_cpu_state_ctrl;
>   	int user_sigp;
>   	int user_stsi;
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 781aaf0..ddf3599 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -2010,6 +2010,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>   	kvm->arch.css_support = 0;
>   	kvm->arch.use_irqchip = 0;
>   	kvm->arch.has_pfmfi = sclp.has_pfmfi;
> +	kvm->arch.has_skf = sclp.has_skey;
>   	kvm->arch.epoch = 0;
> 
>   	spin_lock_init(&kvm->arch.start_stop_lock);
> diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
> index 908b579..c74ad22 100644
> --- a/arch/s390/kvm/priv.c
> +++ b/arch/s390/kvm/priv.c
> @@ -208,19 +208,23 @@ int kvm_s390_skey_check_enable(struct kvm_vcpu *vcpu)
>   	struct kvm_s390_sie_block *sie_block = vcpu->arch.sie_block;
> 
>   	trace_kvm_s390_skey_related_inst(vcpu);
> -	if (!(sie_block->ictl & (ICTL_ISKE | ICTL_SSKE | ICTL_RRBE)) &&
> +	/* Already enabled? */
> +	if (vcpu->kvm->arch.has_skf &&
> +	    !(sie_block->ictl & (ICTL_ISKE | ICTL_SSKE | ICTL_RRBE)) &&
>   	    !kvm_s390_test_cpuflags(vcpu, CPUSTAT_KSS))
>   		return rc;
> 
>   	rc = s390_enable_skey();
>   	VCPU_EVENT(vcpu, 3, "enabling storage keys for guest: %d", rc);
> -	if (!rc) {
> -		if (kvm_s390_test_cpuflags(vcpu, CPUSTAT_KSS))
> -			kvm_s390_clear_cpuflags(vcpu, CPUSTAT_KSS);
> -		else
> -			sie_block->ictl &= ~(ICTL_ISKE | ICTL_SSKE |
> -					     ICTL_RRBE);
> -	}
> +	if (rc)
> +		return rc;
> +
> +	if (kvm_s390_test_cpuflags(vcpu, CPUSTAT_KSS))
> +		kvm_s390_clear_cpuflags(vcpu, CPUSTAT_KSS);
> +	if (!vcpu->kvm->arch.has_skf)
> +		sie_block->ictl |= ICTL_ISKE | ICTL_SSKE | ICTL_RRBE;
> +	else
> +		sie_block->ictl &= ~(ICTL_ISKE | ICTL_SSKE | ICTL_RRBE);
>   	return rc;
>   }
> 
> @@ -231,7 +235,7 @@ static int try_handle_skey(struct kvm_vcpu *vcpu)
>   	rc = kvm_s390_skey_check_enable(vcpu);
>   	if (rc)
>   		return rc;
> -	if (sclp.has_skey) {
> +	if (vcpu->kvm->arch.has_skf) {
>   		/* with storage-key facility, SIE interprets it for us */
>   		kvm_s390_retry_instr(vcpu);
>   		VCPU_EVENT(vcpu, 4, "%s", "retrying storage key operation");
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 1/3] KVM: s390: Refactor host cmma and pfmfi interpretation controls
  2018-02-15 16:42                 ` Janosch Frank
@ 2018-02-16  9:46                   ` David Hildenbrand
  0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2018-02-16  9:46 UTC (permalink / raw)
  To: Janosch Frank, kvm; +Cc: schwidefsky, borntraeger, dominik.dingel, linux-s390

On 15.02.2018 17:42, Janosch Frank wrote:
> On 15.02.2018 17:08, David Hildenbrand wrote:
>> On 15.02.2018 16:43, Janosch Frank wrote:
>>> use_cmma in kvm_arch means that the vm is allowed to use cmma, whereas
>>> use_cmma in the mm context means it has been used before. Let's rename
>>> the kvm_arch one to has_cmma
>>
>> Unfortunately all naming related to CMM(A) is already screwed up.
>>
>> If the guest can issue the ESSA instruction, it has the CMM facility. We
>> used to provide it to him by enabling the CMMA control - the
>> "interpretation facility". We could right now fully emulate it (which is
>> what we do when we dirty track changes).
>>
>> At least in the CPU model we did it right. (the feature is called "cmm")
>>
>> But anyhow, we only provide the CMM facility to the guest right now if
>> we have CMMA, so keeping the name "cmma" here is not completely wrong.
>>
>>>
>>> Also let's introduce has_pfmfi, so we can remove the pfmfi disablement
>>
>> Mixed feelings. has_pfmfi sounds more like "this guest has the PFMFI
>> feature", which is something related to vSIE. It is really more
>> "use_pfmfi". Because we provide the PFMF instruction either way (by
>> emulating it - it belongs to edat1).
>>
>> Can we name this "use_cmma" and "use_pfmfi" and "use_skf", because that
>> is actually what we do?
> 
> As long as we have a difference between the arch and the context one and
> the implementation of the patches are not a problem, sure.
> 
> We could also invert them and use emulate_pfmf or intercept_* which is
> more specific about what we actually (don't) do.

I think "using the interpretation facility" is good enough. Because that
is exactly what we do, independent of availability of the instruction
towards the guest.

emulate_pfmf might be misleading, e.g. if the guest doesn't have EDAT1
and we fake it away - emulate_pfmf is set but we don't emulate it. We
inject an exception right away.

intercept_* would work, however the would then have to be intercept_essa
instead of cmma. Not sure if that is easier to read.

> 
>>
>> The thing in the mm context should rather be renamed to "uses_cmm(a)" or
>> "used_cmm(a)".
> 
> Yes, I like uses more, the rest of the gang likes used, now feel free to
> propose something entirely different :)
> 
> If there's not much that speaks against the first two patches, I'd spin
> them off to merge them earlier, the use_cmma cleanup has been on my list
> for a long time. Thoughts?
> 

Jup, please resend, they make sense on their own.


-- 

Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2018-02-16  9:46 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-09  9:34 [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 01/16] s390/mm: make gmap_protect_range more modular Janosch Frank
2018-02-13 14:07   ` David Hildenbrand
2018-02-09  9:34 ` [RFC/PATCH v3 02/16] s390/mm: Abstract gmap notify bit setting Janosch Frank
2018-02-13 14:10   ` David Hildenbrand
2018-02-13 14:31     ` Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 03/16] s390/mm: Introduce gmap_pmdp_xchg Janosch Frank
2018-02-13 14:16   ` David Hildenbrand
2018-02-13 14:39     ` Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 04/16] s390/mm: add gmap PMD invalidation notification Janosch Frank
2018-02-13 14:36   ` David Hildenbrand
2018-02-13 14:54     ` Janosch Frank
2018-02-13 14:59       ` David Hildenbrand
2018-02-13 15:33         ` Janosch Frank
2018-02-14 10:42           ` David Hildenbrand
2018-02-14 11:19             ` Janosch Frank
2018-02-14 14:18               ` David Hildenbrand
2018-02-14 14:55                 ` Janosch Frank
2018-02-14 15:15                   ` David Hildenbrand
2018-02-14 15:24                     ` Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 05/16] s390/mm: Add gmap pmd invalidation and clearing Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 06/16] s390/mm: Add huge page dirty sync support Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 07/16] s390/mm: Make gmap_read_table EDAT1 compatible Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 08/16] s390/mm: Make protect_rmap " Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 09/16] s390/mm: Add shadow segment code Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 10/16] s390/mm: Add VSIE reverse fake case Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 11/16] s390/mm: Enable gmap huge pmd support Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 12/16] s390/mm: clear huge page storage keys on enable_skey Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 13/16] s390/mm: Add huge pmd storage key handling Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 14/16] s390/mm: hugetlb pages within a gmap can not be freed Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 15/16] KVM: s390: Add KVM HPAGE capability Janosch Frank
2018-02-09  9:34 ` [RFC/PATCH v3 16/16] s390/mm: Add gmap lock classes Janosch Frank
2018-02-14 14:30 ` [RFC/PATCH v3 00/16] KVM/s390: Hugetlbfs enablement David Hildenbrand
2018-02-14 15:01   ` Janosch Frank
2018-02-14 15:07     ` David Hildenbrand
2018-02-14 15:33       ` Janosch Frank
2018-02-14 15:48         ` Christian Borntraeger
2018-02-14 15:57           ` David Hildenbrand
2018-02-14 15:56         ` David Hildenbrand
2018-02-15 15:43           ` [PATCH 0/3] Hpage capability rework Janosch Frank
2018-02-15 15:43             ` [PATCH 1/3] KVM: s390: Refactor host cmma and pfmfi interpretation controls Janosch Frank
2018-02-15 16:08               ` David Hildenbrand
2018-02-15 16:42                 ` Janosch Frank
2018-02-16  9:46                   ` David Hildenbrand
2018-02-15 15:43             ` [PATCH 2/3] KVM: s390: Add storage key facility interpretation control Janosch Frank
2018-02-15 16:09               ` David Hildenbrand
2018-02-15 20:27               ` Farhan Ali
2018-02-15 15:43             ` [PATCH 3/3] s390/mm: Enable gmap huge pmd support Janosch Frank
2018-02-15 16:10               ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.