All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] software dirty bits for s390
@ 2013-02-05 18:12 Martin Schwidefsky
  2013-02-05 18:12 ` Martin Schwidefsky
  2013-02-05 18:12 ` [PATCH] s390/mm: implement software dirty bits Martin Schwidefsky
  0 siblings, 2 replies; 11+ messages in thread
From: Martin Schwidefsky @ 2013-02-05 18:12 UTC (permalink / raw)
  To: linux-mm, linux-s390
  Cc: Hugh Dickins, Mel Gorman, Jan Kara, Christian Ehrhardt,
	Martin Schwidefsky

Greetings,

good news, I got performance results for a selected set of workloads
with my software dirty bit patch (thanks Christian!). We found no
downsides to the software dirty bits, and a substantial improvement
in CPU utilization for the FIO test with mostly read mappings.

All good, the patch can now go upstream. The patch changes common
memory management code but the parts that are removed are purely
s390 specific. I can handle it via the linux-s390 tree but I would
not mind if it gets the sign-off by the mm folks.

Martin Schwidefsky (1):
  s390/mm: implement software dirty bits

 arch/s390/include/asm/page.h    |   22 -------
 arch/s390/include/asm/pgtable.h |  131 ++++++++++++++++++++++++++-------------
 arch/s390/include/asm/sclp.h    |    1 -
 arch/s390/include/asm/setup.h   |   16 ++---
 arch/s390/kvm/kvm-s390.c        |    2 +-
 arch/s390/lib/uaccess_pt.c      |    2 +-
 arch/s390/mm/pageattr.c         |    2 +-
 arch/s390/mm/vmem.c             |   24 +++----
 drivers/s390/char/sclp_cmd.c    |   10 +--
 include/asm-generic/pgtable.h   |   10 ---
 include/linux/page-flags.h      |    8 ---
 mm/rmap.c                       |   24 -------
 12 files changed, 112 insertions(+), 140 deletions(-)

-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] software dirty bits for s390
  2013-02-05 18:12 [PATCH] software dirty bits for s390 Martin Schwidefsky
@ 2013-02-05 18:12 ` Martin Schwidefsky
  2013-02-05 19:28   ` Martin Schwidefsky
  2013-02-05 18:12 ` [PATCH] s390/mm: implement software dirty bits Martin Schwidefsky
  1 sibling, 1 reply; 11+ messages in thread
From: Martin Schwidefsky @ 2013-02-05 18:12 UTC (permalink / raw)
  To: linux-mm, linux-s390
  Cc: Hugh Dickins, Mel Gorman, Jan Kara, Christian Ehrhardt,
	Martin Schwidefsky

Greetings,

good news, I got performance results for a selected set of workloads
with my software dirty bit patch (thanks Christian!). We found no
downsides to the software dirty bits, and a substantial improvement
in CPU utilization for the FIO test with mostly read mappings.

The patch can now go upstream.

Martin Schwidefsky (1):
  s390/mm: implement software dirty bits

 arch/s390/include/asm/page.h    |   22 -------
 arch/s390/include/asm/pgtable.h |  131 ++++++++++++++++++++++++++-------------
 arch/s390/include/asm/sclp.h    |    1 -
 arch/s390/include/asm/setup.h   |   16 ++---
 arch/s390/kvm/kvm-s390.c        |    2 +-
 arch/s390/lib/uaccess_pt.c      |    2 +-
 arch/s390/mm/pageattr.c         |    2 +-
 arch/s390/mm/vmem.c             |   24 +++----
 drivers/s390/char/sclp_cmd.c    |   10 +--
 include/asm-generic/pgtable.h   |   10 ---
 include/linux/page-flags.h      |    8 ---
 mm/rmap.c                       |   24 -------
 12 files changed, 112 insertions(+), 140 deletions(-)

-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] s390/mm: implement software dirty bits
  2013-02-05 18:12 [PATCH] software dirty bits for s390 Martin Schwidefsky
  2013-02-05 18:12 ` Martin Schwidefsky
@ 2013-02-05 18:12 ` Martin Schwidefsky
  2013-02-06 11:21   ` Mel Gorman
  2013-02-07  0:20   ` Hugh Dickins
  1 sibling, 2 replies; 11+ messages in thread
From: Martin Schwidefsky @ 2013-02-05 18:12 UTC (permalink / raw)
  To: linux-mm, linux-s390
  Cc: Hugh Dickins, Mel Gorman, Jan Kara, Christian Ehrhardt,
	Martin Schwidefsky

The s390 architecture is unique in respect to dirty page detection,
it uses the change bit in the per-page storage key to track page
modifications. All other architectures track dirty bits by means
of page table entries. This property of s390 has caused numerous
problems in the past, e.g. see git commit ef5d437f71afdf4a
"mm: fix XFS oops due to dirty pages without buffers on s390".

To avoid future issues in regard to per-page dirty bits convert
s390 to a fault based software dirty bit detection mechanism. All
user page table entries which are marked as clean will be hardware
read-only, even if the pte is supposed to be writable. A write by
the user process will trigger a protection fault which will cause
the user pte to be marked as dirty and the hardware read-only bit
is removed.

With this change the dirty bit in the storage key is irrelevant
for Linux as a host, but the storage key is still required for
KVM guests. The effect is that page_test_and_clear_dirty and the
related code can be removed. The referenced bit in the storage
key is still used by the page_test_and_clear_young primitive to
provide page age information.

For page cache pages of mappings with mapping_cap_account_dirty
there will not be any change in behavior as the dirty bit tracking
already uses read-only ptes to control the amount of dirty pages.
Only for swap cache pages and pages of mappings without
mapping_cap_account_dirty there can be additional protection faults.
To avoid an excessive number of additional faults the mk_pte
primitive checks for PageDirty if the pgprot value allows for writes
and pre-dirties the pte. That avoids all additional faults for
tmpfs and shmem pages until these pages are added to the swap cache.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/include/asm/page.h    |   22 -------
 arch/s390/include/asm/pgtable.h |  131 ++++++++++++++++++++++++++-------------
 arch/s390/include/asm/sclp.h    |    1 -
 arch/s390/include/asm/setup.h   |   16 ++---
 arch/s390/kvm/kvm-s390.c        |    2 +-
 arch/s390/lib/uaccess_pt.c      |    2 +-
 arch/s390/mm/pageattr.c         |    2 +-
 arch/s390/mm/vmem.c             |   24 +++----
 drivers/s390/char/sclp_cmd.c    |   10 +--
 include/asm-generic/pgtable.h   |   10 ---
 include/linux/page-flags.h      |    8 ---
 mm/rmap.c                       |   24 -------
 12 files changed, 112 insertions(+), 140 deletions(-)

diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index a86ad40840..75ce9b0 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -155,28 +155,6 @@ static inline int page_reset_referenced(unsigned long addr)
 #define _PAGE_ACC_BITS		0xf0	/* HW access control bits	*/
 
 /*
- * Test and clear dirty bit in storage key.
- * We can't clear the changed bit atomically. This is a potential
- * race against modification of the referenced bit. This function
- * should therefore only be called if it is not mapped in any
- * address space.
- *
- * Note that the bit gets set whenever page content is changed. That means
- * also when the page is modified by DMA or from inside the kernel.
- */
-#define __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
-static inline int page_test_and_clear_dirty(unsigned long pfn, int mapped)
-{
-	unsigned char skey;
-
-	skey = page_get_storage_key(pfn << PAGE_SHIFT);
-	if (!(skey & _PAGE_CHANGED))
-		return 0;
-	page_set_storage_key(pfn << PAGE_SHIFT, skey & ~_PAGE_CHANGED, mapped);
-	return 1;
-}
-
-/*
  * Test and clear referenced bit in storage key.
  */
 #define __HAVE_ARCH_PAGE_TEST_AND_CLEAR_YOUNG
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 098adbb..2b3d3b6 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -29,6 +29,7 @@
 #ifndef __ASSEMBLY__
 #include <linux/sched.h>
 #include <linux/mm_types.h>
+#include <linux/page-flags.h>
 #include <asm/bug.h>
 #include <asm/page.h>
 
@@ -221,13 +222,15 @@ extern unsigned long MODULES_END;
 /* Software bits in the page table entry */
 #define _PAGE_SWT	0x001		/* SW pte type bit t */
 #define _PAGE_SWX	0x002		/* SW pte type bit x */
-#define _PAGE_SWC	0x004		/* SW pte changed bit (for KVM) */
-#define _PAGE_SWR	0x008		/* SW pte referenced bit (for KVM) */
-#define _PAGE_SPECIAL	0x010		/* SW associated with special page */
+#define _PAGE_SWC	0x004		/* SW pte changed bit */
+#define _PAGE_SWR	0x008		/* SW pte referenced bit */
+#define _PAGE_SWW	0x010		/* SW pte write bit */
+#define _PAGE_SPECIAL	0x020		/* SW associated with special page */
 #define __HAVE_ARCH_PTE_SPECIAL
 
 /* Set of bits not changed in pte_modify */
-#define _PAGE_CHG_MASK	(PAGE_MASK | _PAGE_SPECIAL | _PAGE_SWC | _PAGE_SWR)
+#define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_CO | \
+				 _PAGE_SWC | _PAGE_SWR)
 
 /* Six different types of pages. */
 #define _PAGE_TYPE_EMPTY	0x400
@@ -321,6 +324,7 @@ extern unsigned long MODULES_END;
 
 /* Bits in the region table entry */
 #define _REGION_ENTRY_ORIGIN	~0xfffUL/* region/segment table origin	    */
+#define _REGION_ENTRY_RO	0x200	/* region protection bit	    */
 #define _REGION_ENTRY_INV	0x20	/* invalid region table entry	    */
 #define _REGION_ENTRY_TYPE_MASK	0x0c	/* region/segment table type mask   */
 #define _REGION_ENTRY_TYPE_R1	0x0c	/* region first table type	    */
@@ -382,9 +386,10 @@ extern unsigned long MODULES_END;
  */
 #define PAGE_NONE	__pgprot(_PAGE_TYPE_NONE)
 #define PAGE_RO		__pgprot(_PAGE_TYPE_RO)
-#define PAGE_RW		__pgprot(_PAGE_TYPE_RW)
+#define PAGE_RW		__pgprot(_PAGE_TYPE_RO | _PAGE_SWW)
+#define PAGE_RWC	__pgprot(_PAGE_TYPE_RW | _PAGE_SWW | _PAGE_SWC)
 
-#define PAGE_KERNEL	PAGE_RW
+#define PAGE_KERNEL	PAGE_RWC
 #define PAGE_COPY	PAGE_RO
 
 /*
@@ -631,23 +636,23 @@ static inline pgste_t pgste_update_all(pte_t *ptep, pgste_t pgste)
 	bits = skey & (_PAGE_CHANGED | _PAGE_REFERENCED);
 	/* Clear page changed & referenced bit in the storage key */
 	if (bits & _PAGE_CHANGED)
-		page_set_storage_key(address, skey ^ bits, 1);
+		page_set_storage_key(address, skey ^ bits, 0);
 	else if (bits)
 		page_reset_referenced(address);
 	/* Transfer page changed & referenced bit to guest bits in pgste */
 	pgste_val(pgste) |= bits << 48;		/* RCP_GR_BIT & RCP_GC_BIT */
 	/* Get host changed & referenced bits from pgste */
 	bits |= (pgste_val(pgste) & (RCP_HR_BIT | RCP_HC_BIT)) >> 52;
-	/* Clear host bits in pgste. */
+	/* Transfer page changed & referenced bit to kvm user bits */
+	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
+	/* Clear relevant host bits in pgste. */
 	pgste_val(pgste) &= ~(RCP_HR_BIT | RCP_HC_BIT);
 	pgste_val(pgste) &= ~(RCP_ACC_BITS | RCP_FP_BIT);
 	/* Copy page access key and fetch protection bit to pgste */
 	pgste_val(pgste) |=
 		(unsigned long) (skey & (_PAGE_ACC_BITS | _PAGE_FP_BIT)) << 56;
-	/* Transfer changed and referenced to kvm user bits */
-	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
-	/* Transfer changed & referenced to pte sofware bits */
-	pte_val(*ptep) |= bits << 1;		/* _PAGE_SWR & _PAGE_SWC */
+	/* Transfer referenced bit to pte */
+	pte_val(*ptep) |= (bits & _PAGE_REFERENCED) << 1;
 #endif
 	return pgste;
 
@@ -660,20 +665,25 @@ static inline pgste_t pgste_update_young(pte_t *ptep, pgste_t pgste)
 
 	if (!pte_present(*ptep))
 		return pgste;
+	/* Get referenced bit from storage key */
 	young = page_reset_referenced(pte_val(*ptep) & PAGE_MASK);
-	/* Transfer page referenced bit to pte software bit (host view) */
-	if (young || (pgste_val(pgste) & RCP_HR_BIT))
+	if (young)
+		pgste_val(pgste) |= RCP_GR_BIT;
+	/* Get host referenced bit from pgste */
+	if (pgste_val(pgste) & RCP_HR_BIT) {
+		pgste_val(pgste) &= ~RCP_HR_BIT;
+		young = 1;
+	}
+	/* Transfer referenced bit to kvm user bits and pte */
+	if (young) {
+		pgste_val(pgste) |= KVM_UR_BIT;
 		pte_val(*ptep) |= _PAGE_SWR;
-	/* Clear host referenced bit in pgste. */
-	pgste_val(pgste) &= ~RCP_HR_BIT;
-	/* Transfer page referenced bit to guest bit in pgste */
-	pgste_val(pgste) |= (unsigned long) young << 50; /* set RCP_GR_BIT */
+	}
 #endif
 	return pgste;
-
 }
 
-static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
+static inline void pgste_set_key(pte_t *ptep, pgste_t pgste, pte_t entry)
 {
 #ifdef CONFIG_PGSTE
 	unsigned long address;
@@ -687,10 +697,23 @@ static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
 	/* Set page access key and fetch protection bit from pgste */
 	nkey |= (pgste_val(pgste) & (RCP_ACC_BITS | RCP_FP_BIT)) >> 56;
 	if (okey != nkey)
-		page_set_storage_key(address, nkey, 1);
+		page_set_storage_key(address, nkey, 0);
 #endif
 }
 
+static inline void pgste_set_pte(pte_t *ptep, pte_t entry)
+{
+	if (!MACHINE_HAS_ESOP && (pte_val(entry) & _PAGE_SWW)) {
+		/*
+		 * Without enhanced suppression-on-protection force
+		 * the dirty bit on for all writable ptes.
+		 */
+		pte_val(entry) |= _PAGE_SWC;
+		pte_val(entry) &= ~_PAGE_RO;
+	}
+	*ptep = entry;
+}
+
 /**
  * struct gmap_struct - guest address space
  * @mm: pointer to the parent mm_struct
@@ -749,11 +772,14 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 
 	if (mm_has_pgste(mm)) {
 		pgste = pgste_get_lock(ptep);
-		pgste_set_pte(ptep, pgste, entry);
-		*ptep = entry;
+		pgste_set_key(ptep, pgste, entry);
+		pgste_set_pte(ptep, entry);
 		pgste_set_unlock(ptep, pgste);
-	} else
+	} else {
+		if (!(pte_val(entry) & _PAGE_INVALID) && MACHINE_HAS_EDAT1)
+			pte_val(entry) |= _PAGE_CO;
 		*ptep = entry;
+	}
 }
 
 /*
@@ -762,16 +788,12 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
  */
 static inline int pte_write(pte_t pte)
 {
-	return (pte_val(pte) & _PAGE_RO) == 0;
+	return (pte_val(pte) & _PAGE_SWW) != 0;
 }
 
 static inline int pte_dirty(pte_t pte)
 {
-#ifdef CONFIG_PGSTE
-	if (pte_val(pte) & _PAGE_SWC)
-		return 1;
-#endif
-	return 0;
+	return (pte_val(pte) & _PAGE_SWC) != 0;
 }
 
 static inline int pte_young(pte_t pte)
@@ -821,11 +843,14 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	pte_val(pte) &= _PAGE_CHG_MASK;
 	pte_val(pte) |= pgprot_val(newprot);
+	if ((pte_val(pte) & _PAGE_SWC) && (pte_val(pte) & _PAGE_SWW))
+		pte_val(pte) &= ~_PAGE_RO;
 	return pte;
 }
 
 static inline pte_t pte_wrprotect(pte_t pte)
 {
+	pte_val(pte) &= ~_PAGE_SWW;
 	/* Do not clobber _PAGE_TYPE_NONE pages!  */
 	if (!(pte_val(pte) & _PAGE_INVALID))
 		pte_val(pte) |= _PAGE_RO;
@@ -834,20 +859,26 @@ static inline pte_t pte_wrprotect(pte_t pte)
 
 static inline pte_t pte_mkwrite(pte_t pte)
 {
-	pte_val(pte) &= ~_PAGE_RO;
+	pte_val(pte) |= _PAGE_SWW;
+	if (pte_val(pte) & _PAGE_SWC)
+		pte_val(pte) &= ~_PAGE_RO;
 	return pte;
 }
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-#ifdef CONFIG_PGSTE
 	pte_val(pte) &= ~_PAGE_SWC;
-#endif
+	/* Do not clobber _PAGE_TYPE_NONE pages!  */
+	if (!(pte_val(pte) & _PAGE_INVALID))
+		pte_val(pte) |= _PAGE_RO;
 	return pte;
 }
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
+	pte_val(pte) |= _PAGE_SWC;
+	if (pte_val(pte) & _PAGE_SWW)
+		pte_val(pte) &= ~_PAGE_RO;
 	return pte;
 }
 
@@ -885,10 +916,10 @@ static inline pte_t pte_mkhuge(pte_t pte)
 		pte_val(pte) |= _SEGMENT_ENTRY_INV;
 	}
 	/*
-	 * Clear SW pte bits SWT and SWX, there are no SW bits in a segment
-	 * table entry.
+	 * Clear SW pte bits, there are no SW bits in a segment table entry.
 	 */
-	pte_val(pte) &= ~(_PAGE_SWT | _PAGE_SWX);
+	pte_val(pte) &= ~(_PAGE_SWT | _PAGE_SWX | _PAGE_SWC |
+			  _PAGE_SWR | _PAGE_SWW);
 	/*
 	 * Also set the change-override bit because we don't need dirty bit
 	 * tracking for hugetlbfs pages.
@@ -1040,9 +1071,11 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
 					   unsigned long address,
 					   pte_t *ptep, pte_t pte)
 {
-	*ptep = pte;
-	if (mm_has_pgste(mm))
+	if (mm_has_pgste(mm)) {
+		pgste_set_pte(ptep, pte);
 		pgste_set_unlock(ptep, *(pgste_t *)(ptep + PTRS_PER_PTE));
+	} else
+		*ptep = pte;
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_FLUSH
@@ -1110,10 +1143,13 @@ static inline pte_t ptep_set_wrprotect(struct mm_struct *mm,
 
 		if (!mm_exclusive(mm))
 			__ptep_ipte(address, ptep);
-		*ptep = pte_wrprotect(pte);
+		pte = pte_wrprotect(pte);
 
-		if (mm_has_pgste(mm))
+		if (mm_has_pgste(mm)) {
+			pgste_set_pte(ptep, pte);
 			pgste_set_unlock(ptep, pgste);
+		} else
+			*ptep = pte;
 	}
 	return pte;
 }
@@ -1131,10 +1167,12 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 		pgste = pgste_get_lock(ptep);
 
 	__ptep_ipte(address, ptep);
-	*ptep = entry;
 
-	if (mm_has_pgste(vma->vm_mm))
+	if (mm_has_pgste(vma->vm_mm)) {
+		pgste_set_pte(ptep, entry);
 		pgste_set_unlock(ptep, pgste);
+	} else
+		*ptep = entry;
 	return 1;
 }
 
@@ -1152,8 +1190,13 @@ static inline pte_t mk_pte_phys(unsigned long physpage, pgprot_t pgprot)
 static inline pte_t mk_pte(struct page *page, pgprot_t pgprot)
 {
 	unsigned long physpage = page_to_phys(page);
+	pte_t __pte = mk_pte_phys(physpage, pgprot);
 
-	return mk_pte_phys(physpage, pgprot);
+	if ((pte_val(__pte) & _PAGE_SWW) && PageDirty(page)) {
+		pte_val(__pte) |= _PAGE_SWC;
+		pte_val(__pte) &= ~_PAGE_RO;
+	}
+	return __pte;
 }
 
 #define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD-1))
@@ -1245,6 +1288,8 @@ static inline int pmd_trans_splitting(pmd_t pmd)
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t entry)
 {
+	if (!(pmd_val(entry) & _SEGMENT_ENTRY_INV) && MACHINE_HAS_EDAT1)
+		pmd_val(entry) |= _SEGMENT_ENTRY_CO;
 	*pmdp = entry;
 }
 
diff --git a/arch/s390/include/asm/sclp.h b/arch/s390/include/asm/sclp.h
index 8337886..06a1361 100644
--- a/arch/s390/include/asm/sclp.h
+++ b/arch/s390/include/asm/sclp.h
@@ -46,7 +46,6 @@ int sclp_cpu_deconfigure(u8 cpu);
 void sclp_facilities_detect(void);
 unsigned long long sclp_get_rnmax(void);
 unsigned long long sclp_get_rzm(void);
-u8 sclp_get_fac85(void);
 int sclp_sdias_blk_count(void);
 int sclp_sdias_copy(void *dest, int blk_num, int nr_blks);
 int sclp_chp_configure(struct chp_id chpid);
diff --git a/arch/s390/include/asm/setup.h b/arch/s390/include/asm/setup.h
index f69f76b..f685751 100644
--- a/arch/s390/include/asm/setup.h
+++ b/arch/s390/include/asm/setup.h
@@ -64,13 +64,14 @@ extern unsigned int s390_user_mode;
 
 #define MACHINE_FLAG_VM		(1UL << 0)
 #define MACHINE_FLAG_IEEE	(1UL << 1)
-#define MACHINE_FLAG_CSP	(1UL << 3)
-#define MACHINE_FLAG_MVPG	(1UL << 4)
-#define MACHINE_FLAG_DIAG44	(1UL << 5)
-#define MACHINE_FLAG_IDTE	(1UL << 6)
-#define MACHINE_FLAG_DIAG9C	(1UL << 7)
-#define MACHINE_FLAG_MVCOS	(1UL << 8)
-#define MACHINE_FLAG_KVM	(1UL << 9)
+#define MACHINE_FLAG_CSP	(1UL << 2)
+#define MACHINE_FLAG_MVPG	(1UL << 3)
+#define MACHINE_FLAG_DIAG44	(1UL << 4)
+#define MACHINE_FLAG_IDTE	(1UL << 5)
+#define MACHINE_FLAG_DIAG9C	(1UL << 6)
+#define MACHINE_FLAG_MVCOS	(1UL << 7)
+#define MACHINE_FLAG_KVM	(1UL << 8)
+#define MACHINE_FLAG_ESOP	(1UL << 9)
 #define MACHINE_FLAG_EDAT1	(1UL << 10)
 #define MACHINE_FLAG_EDAT2	(1UL << 11)
 #define MACHINE_FLAG_LPAR	(1UL << 12)
@@ -84,6 +85,7 @@ extern unsigned int s390_user_mode;
 #define MACHINE_IS_LPAR		(S390_lowcore.machine_flags & MACHINE_FLAG_LPAR)
 
 #define MACHINE_HAS_DIAG9C	(S390_lowcore.machine_flags & MACHINE_FLAG_DIAG9C)
+#define MACHINE_HAS_ESOP	(S390_lowcore.machine_flags & MACHINE_FLAG_ESOP)
 #define MACHINE_HAS_PFMF	MACHINE_HAS_EDAT1
 #define MACHINE_HAS_HPAGE	MACHINE_HAS_EDAT1
 
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index f090e81..2923781 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -147,7 +147,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = KVM_MAX_VCPUS;
 		break;
 	case KVM_CAP_S390_COW:
-		r = sclp_get_fac85() & 0x2;
+		r = MACHINE_HAS_ESOP;
 		break;
 	default:
 		r = 0;
diff --git a/arch/s390/lib/uaccess_pt.c b/arch/s390/lib/uaccess_pt.c
index 9017a63..a70ee84 100644
--- a/arch/s390/lib/uaccess_pt.c
+++ b/arch/s390/lib/uaccess_pt.c
@@ -50,7 +50,7 @@ static __always_inline unsigned long follow_table(struct mm_struct *mm,
 	ptep = pte_offset_map(pmd, addr);
 	if (!pte_present(*ptep))
 		return -0x11UL;
-	if (write && !pte_write(*ptep))
+	if (write && (!pte_write(*ptep) || !pte_dirty(*ptep)))
 		return -0x04UL;
 
 	return (pte_val(*ptep) & PAGE_MASK) + (addr & ~PAGE_MASK);
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index 29ccee3..d21040e 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -127,7 +127,7 @@ void kernel_map_pages(struct page *page, int numpages, int enable)
 			pte_val(*pte) = _PAGE_TYPE_EMPTY;
 			continue;
 		}
-		*pte = mk_pte_phys(address, __pgprot(_PAGE_TYPE_RW));
+		pte_val(*pte) = __pa(address);
 	}
 }
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 6ed1426..79699f46 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -85,11 +85,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 	pud_t *pu_dir;
 	pmd_t *pm_dir;
 	pte_t *pt_dir;
-	pte_t  pte;
 	int ret = -ENOMEM;
 
 	while (address < end) {
-		pte = mk_pte_phys(address, __pgprot(ro ? _PAGE_RO : 0));
 		pg_dir = pgd_offset_k(address);
 		if (pgd_none(*pg_dir)) {
 			pu_dir = vmem_pud_alloc();
@@ -101,9 +99,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 #if defined(CONFIG_64BIT) && !defined(CONFIG_DEBUG_PAGEALLOC)
 		if (MACHINE_HAS_EDAT2 && pud_none(*pu_dir) && address &&
 		    !(address & ~PUD_MASK) && (address + PUD_SIZE <= end)) {
-			pte_val(pte) |= _REGION3_ENTRY_LARGE;
-			pte_val(pte) |= _REGION_ENTRY_TYPE_R3;
-			pud_val(*pu_dir) = pte_val(pte);
+			pud_val(*pu_dir) = __pa(address) |
+				_REGION_ENTRY_TYPE_R3 | _REGION3_ENTRY_LARGE |
+				(ro ? _REGION_ENTRY_RO : 0);
 			address += PUD_SIZE;
 			continue;
 		}
@@ -118,8 +116,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 #if defined(CONFIG_64BIT) && !defined(CONFIG_DEBUG_PAGEALLOC)
 		if (MACHINE_HAS_EDAT1 && pmd_none(*pm_dir) && address &&
 		    !(address & ~PMD_MASK) && (address + PMD_SIZE <= end)) {
-			pte_val(pte) |= _SEGMENT_ENTRY_LARGE;
-			pmd_val(*pm_dir) = pte_val(pte);
+			pmd_val(*pm_dir) = __pa(address) |
+				_SEGMENT_ENTRY | _SEGMENT_ENTRY_LARGE |
+				(ro ? _SEGMENT_ENTRY_RO : 0);
 			address += PMD_SIZE;
 			continue;
 		}
@@ -132,7 +131,7 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 		}
 
 		pt_dir = pte_offset_kernel(pm_dir, address);
-		*pt_dir = pte;
+		pte_val(*pt_dir) = __pa(address) | (ro ? _PAGE_RO : 0);
 		address += PAGE_SIZE;
 	}
 	ret = 0;
@@ -199,7 +198,6 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
 	pud_t *pu_dir;
 	pmd_t *pm_dir;
 	pte_t *pt_dir;
-	pte_t  pte;
 	int ret = -ENOMEM;
 
 	start_addr = (unsigned long) start;
@@ -237,9 +235,8 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
 				new_page = vmemmap_alloc_block(PMD_SIZE, node);
 				if (!new_page)
 					goto out;
-				pte = mk_pte_phys(__pa(new_page), PAGE_RW);
-				pte_val(pte) |= _SEGMENT_ENTRY_LARGE;
-				pmd_val(*pm_dir) = pte_val(pte);
+				pmd_val(*pm_dir) = __pa(new_page) |
+					_SEGMENT_ENTRY | _SEGMENT_ENTRY_LARGE;
 				address = (address + PMD_SIZE) & PMD_MASK;
 				continue;
 			}
@@ -260,8 +257,7 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
 			new_page =__pa(vmem_alloc_pages(0));
 			if (!new_page)
 				goto out;
-			pte = pfn_pte(new_page >> PAGE_SHIFT, PAGE_KERNEL);
-			*pt_dir = pte;
+			pte_val(*pt_dir) = __pa(new_page);
 		}
 		address += PAGE_SIZE;
 	}
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index c44d13f..30a2255 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -56,7 +56,6 @@ static int __initdata early_read_info_sccb_valid;
 
 u64 sclp_facilities;
 static u8 sclp_fac84;
-static u8 sclp_fac85;
 static unsigned long long rzm;
 static unsigned long long rnmax;
 
@@ -131,7 +130,8 @@ void __init sclp_facilities_detect(void)
 	sccb = &early_read_info_sccb;
 	sclp_facilities = sccb->facilities;
 	sclp_fac84 = sccb->fac84;
-	sclp_fac85 = sccb->fac85;
+	if (sccb->fac85 & 0x02)
+		S390_lowcore.machine_flags |= MACHINE_FLAG_ESOP;
 	rnmax = sccb->rnmax ? sccb->rnmax : sccb->rnmax2;
 	rzm = sccb->rnsize ? sccb->rnsize : sccb->rnsize2;
 	rzm <<= 20;
@@ -171,12 +171,6 @@ unsigned long long sclp_get_rzm(void)
 	return rzm;
 }
 
-u8 sclp_get_fac85(void)
-{
-	return sclp_fac85;
-}
-EXPORT_SYMBOL_GPL(sclp_get_fac85);
-
 /*
  * This function will be called after sclp_facilities_detect(), which gets
  * called from early.c code. Therefore the sccb should have valid contents.
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 5cf680a..bfd8768 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -197,16 +197,6 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
-#define page_test_and_clear_dirty(pfn, mapped)	(0)
-#endif
-
-#ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
-#define pte_maybe_dirty(pte)		pte_dirty(pte)
-#else
-#define pte_maybe_dirty(pte)		(1)
-#endif
-
 #ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_YOUNG
 #define page_test_and_clear_young(pfn) (0)
 #endif
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 70473da..6d53675 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -303,21 +303,13 @@ static inline void __SetPageUptodate(struct page *page)
 
 static inline void SetPageUptodate(struct page *page)
 {
-#ifdef CONFIG_S390
-	if (!test_and_set_bit(PG_uptodate, &page->flags))
-		page_set_storage_key(page_to_phys(page), PAGE_DEFAULT_KEY, 0);
-#else
 	/*
 	 * Memory barrier must be issued before setting the PG_uptodate bit,
 	 * so that all previous stores issued in order to bring the page
 	 * uptodate are actually visible before PageUptodate becomes true.
-	 *
-	 * s390 doesn't need an explicit smp_wmb here because the test and
-	 * set bit already provides full barriers.
 	 */
 	smp_wmb();
 	set_bit(PG_uptodate, &(page)->flags);
-#endif
 }
 
 CLEARPAGEFLAG(Uptodate, uptodate)
diff --git a/mm/rmap.c b/mm/rmap.c
index 2c78f8c..3d38edf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1126,7 +1126,6 @@ void page_add_file_rmap(struct page *page)
  */
 void page_remove_rmap(struct page *page)
 {
-	struct address_space *mapping = page_mapping(page);
 	bool anon = PageAnon(page);
 	bool locked;
 	unsigned long flags;
@@ -1144,29 +1143,6 @@ void page_remove_rmap(struct page *page)
 		goto out;
 
 	/*
-	 * Now that the last pte has gone, s390 must transfer dirty
-	 * flag from storage key to struct page.  We can usually skip
-	 * this if the page is anon, so about to be freed; but perhaps
-	 * not if it's in swapcache - there might be another pte slot
-	 * containing the swap entry, but page not yet written to swap.
-	 *
-	 * And we can skip it on file pages, so long as the filesystem
-	 * participates in dirty tracking (note that this is not only an
-	 * optimization but also solves problems caused by dirty flag in
-	 * storage key getting set by a write from inside kernel); but need to
-	 * catch shm and tmpfs and ramfs pages which have been modified since
-	 * creation by read fault.
-	 *
-	 * Note that mapping must be decided above, before decrementing
-	 * mapcount (which luckily provides a barrier): once page is unmapped,
-	 * it could be truncated and page->mapping reset to NULL at any moment.
-	 * Note also that we are relying on page_mapping(page) to set mapping
-	 * to &swapper_space when PageSwapCache(page).
-	 */
-	if (mapping && !mapping_cap_account_dirty(mapping) &&
-	    page_test_and_clear_dirty(page_to_pfn(page), 1))
-		set_page_dirty(page);
-	/*
 	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
 	 * and not charged by memcg for now.
 	 */
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] software dirty bits for s390
  2013-02-05 18:12 ` Martin Schwidefsky
@ 2013-02-05 19:28   ` Martin Schwidefsky
  0 siblings, 0 replies; 11+ messages in thread
From: Martin Schwidefsky @ 2013-02-05 19:28 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: linux-mm, linux-s390, Hugh Dickins, Mel Gorman, Jan Kara,
	Christian Ehrhardt

On Tue,  5 Feb 2013 10:12:04 -0800
Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:

> Greetings,
> 
> good news, I got performance results for a selected set of workloads
> with my software dirty bit patch (thanks Christian!). We found no
> downsides to the software dirty bits, and a substantial improvement
> in CPU utilization for the FIO test with mostly read mappings.
> 
> The patch can now go upstream.

Grumpf, 0000-cover-letter.patch~ in the outgoing directory.
Please ignore.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] s390/mm: implement software dirty bits
  2013-02-05 18:12 ` [PATCH] s390/mm: implement software dirty bits Martin Schwidefsky
@ 2013-02-06 11:21   ` Mel Gorman
  2013-02-06 18:16     ` Martin Schwidefsky
  2013-02-07  0:20   ` Hugh Dickins
  1 sibling, 1 reply; 11+ messages in thread
From: Mel Gorman @ 2013-02-06 11:21 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: linux-mm, linux-s390, Hugh Dickins, Jan Kara, Christian Ehrhardt

On Tue, Feb 05, 2013 at 10:12:05AM -0800, Martin Schwidefsky wrote:
> The s390 architecture is unique in respect to dirty page detection,
> it uses the change bit in the per-page storage key to track page
> modifications. All other architectures track dirty bits by means
> of page table entries. This property of s390 has caused numerous
> problems in the past, e.g. see git commit ef5d437f71afdf4a
> "mm: fix XFS oops due to dirty pages without buffers on s390".
> 
> To avoid future issues in regard to per-page dirty bits convert
> s390 to a fault based software dirty bit detection mechanism. All
> user page table entries which are marked as clean will be hardware
> read-only, even if the pte is supposed to be writable. A write by
> the user process will trigger a protection fault which will cause
> the user pte to be marked as dirty and the hardware read-only bit
> is removed.
> 
> With this change the dirty bit in the storage key is irrelevant
> for Linux as a host, but the storage key is still required for
> KVM guests. The effect is that page_test_and_clear_dirty and the
> related code can be removed. The referenced bit in the storage
> key is still used by the page_test_and_clear_young primitive to
> provide page age information.
> 
> For page cache pages of mappings with mapping_cap_account_dirty
> there will not be any change in behavior as the dirty bit tracking
> already uses read-only ptes to control the amount of dirty pages.
> Only for swap cache pages and pages of mappings without
> mapping_cap_account_dirty there can be additional protection faults.
> To avoid an excessive number of additional faults the mk_pte
> primitive checks for PageDirty if the pgprot value allows for writes
> and pre-dirties the pte. That avoids all additional faults for
> tmpfs and shmem pages until these pages are added to the swap cache.
> 
> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>

I have a few clarifications below just to make sure I'm reading it right
but I think it looks fine and the change to mm/rmap.c is welcome.

> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 098adbb..2b3d3b6 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -29,6 +29,7 @@
>  #ifndef __ASSEMBLY__
>  #include <linux/sched.h>
>  #include <linux/mm_types.h>
> +#include <linux/page-flags.h>
>  #include <asm/bug.h>
>  #include <asm/page.h>
>  
> @@ -221,13 +222,15 @@ extern unsigned long MODULES_END;
>  /* Software bits in the page table entry */
>  #define _PAGE_SWT	0x001		/* SW pte type bit t */
>  #define _PAGE_SWX	0x002		/* SW pte type bit x */
> -#define _PAGE_SWC	0x004		/* SW pte changed bit (for KVM) */
> -#define _PAGE_SWR	0x008		/* SW pte referenced bit (for KVM) */
> -#define _PAGE_SPECIAL	0x010		/* SW associated with special page */
> +#define _PAGE_SWC	0x004		/* SW pte changed bit */
> +#define _PAGE_SWR	0x008		/* SW pte referenced bit */
> +#define _PAGE_SWW	0x010		/* SW pte write bit */
> +#define _PAGE_SPECIAL	0x020		/* SW associated with special page */
>  #define __HAVE_ARCH_PTE_SPECIAL
>  
>  /* Set of bits not changed in pte_modify */
> -#define _PAGE_CHG_MASK	(PAGE_MASK | _PAGE_SPECIAL | _PAGE_SWC | _PAGE_SWR)
> +#define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_CO | \
> +				 _PAGE_SWC | _PAGE_SWR)
>  

If I'm reading it right, the _PAGE_CO is bit is what allows you to force
the hardware to trap even if the PTE says the page is writable. This is
what necessitates the shuffling of _PAGE_SPECIAL so you have a software
write bit and a hardware write bit.

For existing distributions they might not be able to use this patch for
bugs like ""mm: fix XFS oops due to dirty pages without buffers on s390"
because this shuffling of bits will break KABIi but that's not your problem.

>  /* Six different types of pages. */
>  #define _PAGE_TYPE_EMPTY	0x400
> @@ -321,6 +324,7 @@ extern unsigned long MODULES_END;
>  
>  /* Bits in the region table entry */
>  #define _REGION_ENTRY_ORIGIN	~0xfffUL/* region/segment table origin	    */
> +#define _REGION_ENTRY_RO	0x200	/* region protection bit	    */
>  #define _REGION_ENTRY_INV	0x20	/* invalid region table entry	    */
>  #define _REGION_ENTRY_TYPE_MASK	0x0c	/* region/segment table type mask   */
>  #define _REGION_ENTRY_TYPE_R1	0x0c	/* region first table type	    */
> @@ -382,9 +386,10 @@ extern unsigned long MODULES_END;
>   */
>  #define PAGE_NONE	__pgprot(_PAGE_TYPE_NONE)
>  #define PAGE_RO		__pgprot(_PAGE_TYPE_RO)
> -#define PAGE_RW		__pgprot(_PAGE_TYPE_RW)
> +#define PAGE_RW		__pgprot(_PAGE_TYPE_RO | _PAGE_SWW)
> +#define PAGE_RWC	__pgprot(_PAGE_TYPE_RW | _PAGE_SWW | _PAGE_SWC)
>  
> -#define PAGE_KERNEL	PAGE_RW
> +#define PAGE_KERNEL	PAGE_RWC
>  #define PAGE_COPY	PAGE_RO
>  
>  /*

And this combination of page bits looks consistent. The details are
heavily buried in the arch code so I hope however deals with this area
in the future spots the changelog. Of course, that guy is likely to be
you anyway :)

> @@ -631,23 +636,23 @@ static inline pgste_t pgste_update_all(pte_t *ptep, pgste_t pgste)
>  	bits = skey & (_PAGE_CHANGED | _PAGE_REFERENCED);
>  	/* Clear page changed & referenced bit in the storage key */
>  	if (bits & _PAGE_CHANGED)
> -		page_set_storage_key(address, skey ^ bits, 1);
> +		page_set_storage_key(address, skey ^ bits, 0);
>  	else if (bits)
>  		page_reset_referenced(address);
>  	/* Transfer page changed & referenced bit to guest bits in pgste */
>  	pgste_val(pgste) |= bits << 48;		/* RCP_GR_BIT & RCP_GC_BIT */
>  	/* Get host changed & referenced bits from pgste */
>  	bits |= (pgste_val(pgste) & (RCP_HR_BIT | RCP_HC_BIT)) >> 52;
> -	/* Clear host bits in pgste. */
> +	/* Transfer page changed & referenced bit to kvm user bits */
> +	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
> +	/* Clear relevant host bits in pgste. */
>  	pgste_val(pgste) &= ~(RCP_HR_BIT | RCP_HC_BIT);
>  	pgste_val(pgste) &= ~(RCP_ACC_BITS | RCP_FP_BIT);
>  	/* Copy page access key and fetch protection bit to pgste */
>  	pgste_val(pgste) |=
>  		(unsigned long) (skey & (_PAGE_ACC_BITS | _PAGE_FP_BIT)) << 56;
> -	/* Transfer changed and referenced to kvm user bits */
> -	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
> -	/* Transfer changed & referenced to pte sofware bits */
> -	pte_val(*ptep) |= bits << 1;		/* _PAGE_SWR & _PAGE_SWC */
> +	/* Transfer referenced bit to pte */
> +	pte_val(*ptep) |= (bits & _PAGE_REFERENCED) << 1;
>  #endif
>  	return pgste;
>  
> @@ -660,20 +665,25 @@ static inline pgste_t pgste_update_young(pte_t *ptep, pgste_t pgste)
>  
>  	if (!pte_present(*ptep))
>  		return pgste;
> +	/* Get referenced bit from storage key */
>  	young = page_reset_referenced(pte_val(*ptep) & PAGE_MASK);
> -	/* Transfer page referenced bit to pte software bit (host view) */
> -	if (young || (pgste_val(pgste) & RCP_HR_BIT))
> +	if (young)
> +		pgste_val(pgste) |= RCP_GR_BIT;
> +	/* Get host referenced bit from pgste */
> +	if (pgste_val(pgste) & RCP_HR_BIT) {
> +		pgste_val(pgste) &= ~RCP_HR_BIT;
> +		young = 1;
> +	}
> +	/* Transfer referenced bit to kvm user bits and pte */
> +	if (young) {
> +		pgste_val(pgste) |= KVM_UR_BIT;
>  		pte_val(*ptep) |= _PAGE_SWR;
> -	/* Clear host referenced bit in pgste. */
> -	pgste_val(pgste) &= ~RCP_HR_BIT;
> -	/* Transfer page referenced bit to guest bit in pgste */
> -	pgste_val(pgste) |= (unsigned long) young << 50; /* set RCP_GR_BIT */
> +	}
>  #endif
>  	return pgste;
> -
>  }
>  
> -static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
> +static inline void pgste_set_key(pte_t *ptep, pgste_t pgste, pte_t entry)
>  {
>  #ifdef CONFIG_PGSTE
>  	unsigned long address;
> @@ -687,10 +697,23 @@ static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
>  	/* Set page access key and fetch protection bit from pgste */
>  	nkey |= (pgste_val(pgste) & (RCP_ACC_BITS | RCP_FP_BIT)) >> 56;
>  	if (okey != nkey)
> -		page_set_storage_key(address, nkey, 1);
> +		page_set_storage_key(address, nkey, 0);
>  #endif
>  }
>  
> +static inline void pgste_set_pte(pte_t *ptep, pte_t entry)
> +{
> +	if (!MACHINE_HAS_ESOP && (pte_val(entry) & _PAGE_SWW)) {
> +		/*
> +		 * Without enhanced suppression-on-protection force
> +		 * the dirty bit on for all writable ptes.
> +		 */
> +		pte_val(entry) |= _PAGE_SWC;
> +		pte_val(entry) &= ~_PAGE_RO;
> +	}
> +	*ptep = entry;
> +}
> +
>  /**
>   * struct gmap_struct - guest address space
>   * @mm: pointer to the parent mm_struct

So establishing a writable PTE clears the hardware RO override as
well and the changed bit is set so it'll be considered dirty.

Reading down through the other page tables updates, I couldn't spot a place
where you were inconsistent with the handling of the hardware _PAGE_CO
and _PAGE_RO. Writes were allowed when the change bit was already set.
I couldn't follow it all, particularly around the KVM bits so take it with
a grain of salt but for me anyway;

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] s390/mm: implement software dirty bits
  2013-02-06 11:21   ` Mel Gorman
@ 2013-02-06 18:16     ` Martin Schwidefsky
  0 siblings, 0 replies; 11+ messages in thread
From: Martin Schwidefsky @ 2013-02-06 18:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-s390, Hugh Dickins, Jan Kara, Christian Ehrhardt

Hi Mel,

On Wed, 6 Feb 2013 11:21:11 +0000
Mel Gorman <mgorman@suse.de> wrote:

> On Tue, Feb 05, 2013 at 10:12:05AM -0800, Martin Schwidefsky wrote:
> > The s390 architecture is unique in respect to dirty page detection,
> > it uses the change bit in the per-page storage key to track page
> > modifications. All other architectures track dirty bits by means
> > of page table entries. This property of s390 has caused numerous
> > problems in the past, e.g. see git commit ef5d437f71afdf4a
> > "mm: fix XFS oops due to dirty pages without buffers on s390".
> > 
> > To avoid future issues in regard to per-page dirty bits convert
> > s390 to a fault based software dirty bit detection mechanism. All
> > user page table entries which are marked as clean will be hardware
> > read-only, even if the pte is supposed to be writable. A write by
> > the user process will trigger a protection fault which will cause
> > the user pte to be marked as dirty and the hardware read-only bit
> > is removed.
> > 
> > With this change the dirty bit in the storage key is irrelevant
> > for Linux as a host, but the storage key is still required for
> > KVM guests. The effect is that page_test_and_clear_dirty and the
> > related code can be removed. The referenced bit in the storage
> > key is still used by the page_test_and_clear_young primitive to
> > provide page age information.
> > 
> > For page cache pages of mappings with mapping_cap_account_dirty
> > there will not be any change in behavior as the dirty bit tracking
> > already uses read-only ptes to control the amount of dirty pages.
> > Only for swap cache pages and pages of mappings without
> > mapping_cap_account_dirty there can be additional protection faults.
> > To avoid an excessive number of additional faults the mk_pte
> > primitive checks for PageDirty if the pgprot value allows for writes
> > and pre-dirties the pte. That avoids all additional faults for
> > tmpfs and shmem pages until these pages are added to the swap cache.
> > 
> > Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
> 
> I have a few clarifications below just to make sure I'm reading it right
> but I think it looks fine and the change to mm/rmap.c is welcome.

It is welcome to me as well. That XFS problem really convinced me that this
is the right way to go.

> > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> > index 098adbb..2b3d3b6 100644
> > --- a/arch/s390/include/asm/pgtable.h
> > +++ b/arch/s390/include/asm/pgtable.h
> > @@ -29,6 +29,7 @@
> >  #ifndef __ASSEMBLY__
> >  #include <linux/sched.h>
> >  #include <linux/mm_types.h>
> > +#include <linux/page-flags.h>
> >  #include <asm/bug.h>
> >  #include <asm/page.h>
> >  
> > @@ -221,13 +222,15 @@ extern unsigned long MODULES_END;
> >  /* Software bits in the page table entry */
> >  #define _PAGE_SWT	0x001		/* SW pte type bit t */
> >  #define _PAGE_SWX	0x002		/* SW pte type bit x */
> > -#define _PAGE_SWC	0x004		/* SW pte changed bit (for KVM) */
> > -#define _PAGE_SWR	0x008		/* SW pte referenced bit (for KVM) */
> > -#define _PAGE_SPECIAL	0x010		/* SW associated with special page */
> > +#define _PAGE_SWC	0x004		/* SW pte changed bit */
> > +#define _PAGE_SWR	0x008		/* SW pte referenced bit */
> > +#define _PAGE_SWW	0x010		/* SW pte write bit */
> > +#define _PAGE_SPECIAL	0x020		/* SW associated with special page */
> >  #define __HAVE_ARCH_PTE_SPECIAL
> >  
> >  /* Set of bits not changed in pte_modify */
> > -#define _PAGE_CHG_MASK	(PAGE_MASK | _PAGE_SPECIAL | _PAGE_SWC | _PAGE_SWR)
> > +#define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_CO | \
> > +				 _PAGE_SWC | _PAGE_SWR)
> >  
> 
> If I'm reading it right, the _PAGE_CO is bit is what allows you to force
> the hardware to trap even if the PTE says the page is writable. This is
> what necessitates the shuffling of _PAGE_SPECIAL so you have a software
> write bit and a hardware write bit.

No, the _PAGE_CO bit is the change-bit-override. This allows the hardware to
avoid to set the dirty bit in the storage key for a write access over a pte.
It is a optimization, the code would work without _PAGE_CO as well.
The basic idea is to introduce a software bit which indicates the software
view of writable vs. non-writable (the _PAGE_SWW bit). The hardware
_PAGE_RO is used to disallow writes while the pte is clean.

> For existing distributions they might not be able to use this patch for
> bugs like ""mm: fix XFS oops due to dirty pages without buffers on s390"
> because this shuffling of bits will break KABIi but that's not your problem.

I am not really sure if we should backport that change to the existing
distributions. It is a non-trivial change.

> >  /* Six different types of pages. */
> >  #define _PAGE_TYPE_EMPTY	0x400
> > @@ -321,6 +324,7 @@ extern unsigned long MODULES_END;
> >  
> >  /* Bits in the region table entry */
> >  #define _REGION_ENTRY_ORIGIN	~0xfffUL/* region/segment table origin	    */
> > +#define _REGION_ENTRY_RO	0x200	/* region protection bit	    */
> >  #define _REGION_ENTRY_INV	0x20	/* invalid region table entry	    */
> >  #define _REGION_ENTRY_TYPE_MASK	0x0c	/* region/segment table type mask   */
> >  #define _REGION_ENTRY_TYPE_R1	0x0c	/* region first table type	    */
> > @@ -382,9 +386,10 @@ extern unsigned long MODULES_END;
> >   */
> >  #define PAGE_NONE	__pgprot(_PAGE_TYPE_NONE)
> >  #define PAGE_RO		__pgprot(_PAGE_TYPE_RO)
> > -#define PAGE_RW		__pgprot(_PAGE_TYPE_RW)
> > +#define PAGE_RW		__pgprot(_PAGE_TYPE_RO | _PAGE_SWW)
> > +#define PAGE_RWC	__pgprot(_PAGE_TYPE_RW | _PAGE_SWW | _PAGE_SWC)
> >  
> > -#define PAGE_KERNEL	PAGE_RW
> > +#define PAGE_KERNEL	PAGE_RWC
> >  #define PAGE_COPY	PAGE_RO
> >  
> >  /*
> 
> And this combination of page bits looks consistent. The details are
> heavily buried in the arch code so I hope however deals with this area
> in the future spots the changelog. Of course, that guy is likely to be
> you anyway :)

Yep, I will be that guy. With the move of the complexity to handle s390
to the arch implementation even more than before.

> > @@ -631,23 +636,23 @@ static inline pgste_t pgste_update_all(pte_t *ptep, pgste_t pgste)
> >  	bits = skey & (_PAGE_CHANGED | _PAGE_REFERENCED);
> >  	/* Clear page changed & referenced bit in the storage key */
> >  	if (bits & _PAGE_CHANGED)
> > -		page_set_storage_key(address, skey ^ bits, 1);
> > +		page_set_storage_key(address, skey ^ bits, 0);
> >  	else if (bits)
> >  		page_reset_referenced(address);
> >  	/* Transfer page changed & referenced bit to guest bits in pgste */
> >  	pgste_val(pgste) |= bits << 48;		/* RCP_GR_BIT & RCP_GC_BIT */
> >  	/* Get host changed & referenced bits from pgste */
> >  	bits |= (pgste_val(pgste) & (RCP_HR_BIT | RCP_HC_BIT)) >> 52;
> > -	/* Clear host bits in pgste. */
> > +	/* Transfer page changed & referenced bit to kvm user bits */
> > +	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
> > +	/* Clear relevant host bits in pgste. */
> >  	pgste_val(pgste) &= ~(RCP_HR_BIT | RCP_HC_BIT);
> >  	pgste_val(pgste) &= ~(RCP_ACC_BITS | RCP_FP_BIT);
> >  	/* Copy page access key and fetch protection bit to pgste */
> >  	pgste_val(pgste) |=
> >  		(unsigned long) (skey & (_PAGE_ACC_BITS | _PAGE_FP_BIT)) << 56;
> > -	/* Transfer changed and referenced to kvm user bits */
> > -	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
> > -	/* Transfer changed & referenced to pte sofware bits */
> > -	pte_val(*ptep) |= bits << 1;		/* _PAGE_SWR & _PAGE_SWC */
> > +	/* Transfer referenced bit to pte */
> > +	pte_val(*ptep) |= (bits & _PAGE_REFERENCED) << 1;
> >  #endif
> >  	return pgste;
> >  
> > @@ -660,20 +665,25 @@ static inline pgste_t pgste_update_young(pte_t *ptep, pgste_t pgste)
> >  
> >  	if (!pte_present(*ptep))
> >  		return pgste;
> > +	/* Get referenced bit from storage key */
> >  	young = page_reset_referenced(pte_val(*ptep) & PAGE_MASK);
> > -	/* Transfer page referenced bit to pte software bit (host view) */
> > -	if (young || (pgste_val(pgste) & RCP_HR_BIT))
> > +	if (young)
> > +		pgste_val(pgste) |= RCP_GR_BIT;
> > +	/* Get host referenced bit from pgste */
> > +	if (pgste_val(pgste) & RCP_HR_BIT) {
> > +		pgste_val(pgste) &= ~RCP_HR_BIT;
> > +		young = 1;
> > +	}
> > +	/* Transfer referenced bit to kvm user bits and pte */
> > +	if (young) {
> > +		pgste_val(pgste) |= KVM_UR_BIT;
> >  		pte_val(*ptep) |= _PAGE_SWR;
> > -	/* Clear host referenced bit in pgste. */
> > -	pgste_val(pgste) &= ~RCP_HR_BIT;
> > -	/* Transfer page referenced bit to guest bit in pgste */
> > -	pgste_val(pgste) |= (unsigned long) young << 50; /* set RCP_GR_BIT */
> > +	}
> >  #endif
> >  	return pgste;
> > -
> >  }
> >  
> > -static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
> > +static inline void pgste_set_key(pte_t *ptep, pgste_t pgste, pte_t entry)
> >  {
> >  #ifdef CONFIG_PGSTE
> >  	unsigned long address;
> > @@ -687,10 +697,23 @@ static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
> >  	/* Set page access key and fetch protection bit from pgste */
> >  	nkey |= (pgste_val(pgste) & (RCP_ACC_BITS | RCP_FP_BIT)) >> 56;
> >  	if (okey != nkey)
> > -		page_set_storage_key(address, nkey, 1);
> > +		page_set_storage_key(address, nkey, 0);
> >  #endif
> >  }
> >  
> > +static inline void pgste_set_pte(pte_t *ptep, pte_t entry)
> > +{
> > +	if (!MACHINE_HAS_ESOP && (pte_val(entry) & _PAGE_SWW)) {
> > +		/*
> > +		 * Without enhanced suppression-on-protection force
> > +		 * the dirty bit on for all writable ptes.
> > +		 */
> > +		pte_val(entry) |= _PAGE_SWC;
> > +		pte_val(entry) &= ~_PAGE_RO;
> > +	}
> > +	*ptep = entry;
> > +}
> > +
> >  /**
> >   * struct gmap_struct - guest address space
> >   * @mm: pointer to the parent mm_struct
> 
> So establishing a writable PTE clears the hardware RO override as
> well and the changed bit is set so it'll be considered dirty.

Yes, that case for old machines running KVM is a bit unfortunate.
I need to force the changed bit so that the read-only bit can be cleared.
KVM will not work with read-only ptes if the enhanced suppression-on-
protection is not available.
 
> Reading down through the other page tables updates, I couldn't spot a place
> where you were inconsistent with the handling of the hardware _PAGE_CO
> and _PAGE_RO. Writes were allowed when the change bit was already set.
> I couldn't follow it all, particularly around the KVM bits so take it with
> a grain of salt but for me anyway;
> 
> Acked-by: Mel Gorman <mgorman@suse.de>
 
Much appreciated. Thanks Mel.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] s390/mm: implement software dirty bits
  2013-02-05 18:12 ` [PATCH] s390/mm: implement software dirty bits Martin Schwidefsky
  2013-02-06 11:21   ` Mel Gorman
@ 2013-02-07  0:20   ` Hugh Dickins
  2013-02-07 19:18     ` Martin Schwidefsky
  1 sibling, 1 reply; 11+ messages in thread
From: Hugh Dickins @ 2013-02-07  0:20 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: linux-mm, linux-s390, Mel Gorman, Jan Kara, Andrew Morton,
	Christian Ehrhardt

On Tue, 5 Feb 2013, Martin Schwidefsky wrote:

> The s390 architecture is unique in respect to dirty page detection,
> it uses the change bit in the per-page storage key to track page
> modifications. All other architectures track dirty bits by means
> of page table entries. This property of s390 has caused numerous
> problems in the past, e.g. see git commit ef5d437f71afdf4a
> "mm: fix XFS oops due to dirty pages without buffers on s390".
> 
> To avoid future issues in regard to per-page dirty bits convert
> s390 to a fault based software dirty bit detection mechanism. All
> user page table entries which are marked as clean will be hardware
> read-only, even if the pte is supposed to be writable. A write by
> the user process will trigger a protection fault which will cause
> the user pte to be marked as dirty and the hardware read-only bit
> is removed.
> 
> With this change the dirty bit in the storage key is irrelevant
> for Linux as a host, but the storage key is still required for
> KVM guests. The effect is that page_test_and_clear_dirty and the
> related code can be removed. The referenced bit in the storage
> key is still used by the page_test_and_clear_young primitive to
> provide page age information.
> 
> For page cache pages of mappings with mapping_cap_account_dirty
> there will not be any change in behavior as the dirty bit tracking
> already uses read-only ptes to control the amount of dirty pages.
> Only for swap cache pages and pages of mappings without
> mapping_cap_account_dirty there can be additional protection faults.
> To avoid an excessive number of additional faults the mk_pte
> primitive checks for PageDirty if the pgprot value allows for writes
> and pre-dirties the pte. That avoids all additional faults for
> tmpfs and shmem pages until these pages are added to the swap cache.
> 
> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
> ---
>  arch/s390/include/asm/page.h    |   22 -------
>  arch/s390/include/asm/pgtable.h |  131 ++++++++++++++++++++++++++-------------
>  arch/s390/include/asm/sclp.h    |    1 -
>  arch/s390/include/asm/setup.h   |   16 ++---
>  arch/s390/kvm/kvm-s390.c        |    2 +-
>  arch/s390/lib/uaccess_pt.c      |    2 +-
>  arch/s390/mm/pageattr.c         |    2 +-
>  arch/s390/mm/vmem.c             |   24 +++----
>  drivers/s390/char/sclp_cmd.c    |   10 +--
>  include/asm-generic/pgtable.h   |   10 ---
>  include/linux/page-flags.h      |    8 ---
>  mm/rmap.c                       |   24 -------
>  12 files changed, 112 insertions(+), 140 deletions(-)

Martin, I'd like to say Applauded-by: Hugh Dickins <hughd@google.com>
but I do have one reservation: the PageDirty business you helpfully
draw attention to in your description above.

That makes me nervous, having a PageDirty test buried down there in
one architecture's mk_pte().  Particularly since I know the PageDirty
handling on anon/swap pages is rather odd: it works, but it's hard to
justify some of the SetPageDirtys (when we add to swap, AND when we
remove from swap): partly a leftover from 2.4 days, when vmscan worked
differently, and we had to be more careful about freeing modified pages.

I did a patch a year or two ago, mainly for debugging some particular
issue by announcing "Bad page state" if ever a dirty page is freed, in
which I had to tidy that up.  Now, I don't have any immediate intention
to resurrect that patch, but I'm afraid that if I did, I might interfere
with your optimization in s390's mk_pte() without realizing it.

> --- a/arch/s390/include/asm/page.h
> +++ b/arch/s390/include/asm/page.h
> ...
> @@ -1152,8 +1190,13
>  static inline pte_t mk_pte(struct page *page, pgprot_t pgprot)
>  {
>  	unsigned long physpage = page_to_phys(page);
> +	pte_t __pte = mk_pte_phys(physpage, pgprot);
>  
> -	return mk_pte_phys(physpage, pgprot);
> +	if ((pte_val(__pte) & _PAGE_SWW) && PageDirty(page)) {
> +		pte_val(__pte) |= _PAGE_SWC;
> +		pte_val(__pte) &= ~_PAGE_RO;
> +	}
> +	return __pte;
>  }

Am I right to think that, once you examine the mk_pte() callsites,
this actually would not be affecting anon pages, nor accounted file
pages, just tmpfs/shmem or ramfs pages read-faulted into a read-write
shared vma?  (That fits with what you say above.)  That it amounts to
the patch below - which I think I would prefer, because it's explicit?
(There might be one or two other places it makes a difference e.g.
replacing a writable migration entry, but those too uncommon to matter.)

--- 3.8-rc6/mm/memory.c	2013-01-09 19:25:05.028321379 -0800
+++ linux/mm/memory.c	2013-02-06 15:01:17.904387877 -0800
@@ -3338,6 +3338,10 @@ static int __do_fault(struct mm_struct *
 				dirty_page = page;
 				get_page(dirty_page);
 			}
+#ifdef CONFIG_S390
+			else if (pte_write(entry) && PageDirty(page))
+				pte_mkdirty(entry);
+#endif
 		}
 		set_pte_at(mm, address, page_table, entry);
 
And then I wonder, is that something we should do on all architectures?
On the one hand, it would save a hardware fault when and if the pte is
dirtied later; on the other hand, it seems wrong to claim pte dirty when
not (though I didn't find anywhere that would care).

Thoughts?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] s390/mm: implement software dirty bits
  2013-02-07  0:20   ` Hugh Dickins
@ 2013-02-07 19:18     ` Martin Schwidefsky
  2013-02-11 14:27       ` Martin Schwidefsky
  0 siblings, 1 reply; 11+ messages in thread
From: Martin Schwidefsky @ 2013-02-07 19:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, linux-s390, Mel Gorman, Jan Kara, Andrew Morton,
	Christian Ehrhardt

On Wed, 6 Feb 2013 16:20:40 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> Martin, I'd like to say Applauded-by: Hugh Dickins <hughd@google.com>
> but I do have one reservation: the PageDirty business you helpfully
> draw attention to in your description above.
> 
> That makes me nervous, having a PageDirty test buried down there in
> one architecture's mk_pte().  Particularly since I know the PageDirty
> handling on anon/swap pages is rather odd: it works, but it's hard to
> justify some of the SetPageDirtys (when we add to swap, AND when we
> remove from swap): partly a leftover from 2.4 days, when vmscan worked
> differently, and we had to be more careful about freeing modified pages.

I tried to solved the whole thing with arch level code only. The PageDirty
check in mk_pte is essential to avoid additional protection faults for
tmpfs/shmem. 
 
> I did a patch a year or two ago, mainly for debugging some particular
> issue by announcing "Bad page state" if ever a dirty page is freed, in
> which I had to tidy that up.  Now, I don't have any immediate intention
> to resurrect that patch, but I'm afraid that if I did, I might interfere
> with your optimization in s390's mk_pte() without realizing it.
> 
> > --- a/arch/s390/include/asm/page.h
> > +++ b/arch/s390/include/asm/page.h
> > ...
> > @@ -1152,8 +1190,13
> >  static inline pte_t mk_pte(struct page *page, pgprot_t pgprot)
> >  {
> >  	unsigned long physpage = page_to_phys(page);
> > +	pte_t __pte = mk_pte_phys(physpage, pgprot);
> >  
> > -	return mk_pte_phys(physpage, pgprot);
> > +	if ((pte_val(__pte) & _PAGE_SWW) && PageDirty(page)) {
> > +		pte_val(__pte) |= _PAGE_SWC;
> > +		pte_val(__pte) &= ~_PAGE_RO;
> > +	}
> > +	return __pte;
> >  }
> 
> Am I right to think that, once you examine the mk_pte() callsites,
> this actually would not be affecting anon pages, nor accounted file
> pages, just tmpfs/shmem or ramfs pages read-faulted into a read-write
> shared vma?  (That fits with what you say above.)  That it amounts to
> the patch below - which I think I would prefer, because it's explicit?
> (There might be one or two other places it makes a difference e.g.
> replacing a writable migration entry, but those too uncommon to matter.)

Anon page and accounted file pages won't need the mk_pte optimization,
that is there for tmpfs/shmem. We could do that in common code as well,
to make the dependency on PageDirty more obvious.

> --- 3.8-rc6/mm/memory.c	2013-01-09 19:25:05.028321379 -0800
> +++ linux/mm/memory.c	2013-02-06 15:01:17.904387877 -0800
> @@ -3338,6 +3338,10 @@ static int __do_fault(struct mm_struct *
>  				dirty_page = page;
>  				get_page(dirty_page);
>  			}
> +#ifdef CONFIG_S390
> +			else if (pte_write(entry) && PageDirty(page))
> +				pte_mkdirty(entry);
> +#endif
>  		}
>  		set_pte_at(mm, address, page_table, entry);
> 
> And then I wonder, is that something we should do on all architectures?
> On the one hand, it would save a hardware fault when and if the pte is
> dirtied later; on the other hand, it seems wrong to claim pte dirty when
> not (though I didn't find anywhere that would care).

I don't like the fact that we are adding another CONFIG_S390, if we could
pre-dirty the pte for all architectures that would be nice. It has no
ill effects for s390 to make the pte dirty, I can think of no reason
why it should hurt for other architectures.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] s390/mm: implement software dirty bits
  2013-02-07 19:18     ` Martin Schwidefsky
@ 2013-02-11 14:27       ` Martin Schwidefsky
  2013-02-11 22:08         ` Hugh Dickins
  0 siblings, 1 reply; 11+ messages in thread
From: Martin Schwidefsky @ 2013-02-11 14:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, linux-s390, Mel Gorman, Jan Kara, Andrew Morton,
	Christian Ehrhardt, Martin Schwidefsky

Hi Hugh,

On Thu, 7 Feb 2013 11:18:38 -0800
Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:

> On Wed, 6 Feb 2013 16:20:40 -0800 (PST)
> Hugh Dickins <hughd@google.com> wrote:
> 
> Anon page and accounted file pages won't need the mk_pte optimization,
> that is there for tmpfs/shmem. We could do that in common code as well,
> to make the dependency on PageDirty more obvious.
> 
> > --- 3.8-rc6/mm/memory.c	2013-01-09 19:25:05.028321379 -0800
> > +++ linux/mm/memory.c	2013-02-06 15:01:17.904387877 -0800
> > @@ -3338,6 +3338,10 @@ static int __do_fault(struct mm_struct *
> >  				dirty_page = page;
> >  				get_page(dirty_page);
> >  			}
> > +#ifdef CONFIG_S390
> > +			else if (pte_write(entry) && PageDirty(page))
> > +				pte_mkdirty(entry);
> > +#endif
> >  		}
> >  		set_pte_at(mm, address, page_table, entry);
> > 
> > And then I wonder, is that something we should do on all architectures?
> > On the one hand, it would save a hardware fault when and if the pte is
> > dirtied later; on the other hand, it seems wrong to claim pte dirty when
> > not (though I didn't find anywhere that would care).
> 
> I don't like the fact that we are adding another CONFIG_S390, if we could
> pre-dirty the pte for all architectures that would be nice. It has no
> ill effects for s390 to make the pte dirty, I can think of no reason
> why it should hurt for other architectures.

Having though further on the issue, it does not make sense to force all
architectures to set the dirty bit in the pte as this would make
try_to_unmap_one to call set_page_dirty even for ptes which have not
been used for writing. set_page_dirty is a non-trivial function that
calls mapping->a_ops->set_page_dirty or __set_page_dirty_buffers. These
cycles should imho not be spent on architectures with h/w pte dirty
bits.

To avoid CONFIG_S390 in common code I'd like to introduce a new
__ARCH_WANT_PTE_WRITE_DIRTY define which then is used in __do_fault
like this:

				dirty_page = page;
				get_page(dirty_page);
			}
#ifdef __ARCH_WANT_PTE_WRITE_DIRTY
			/*
			 * Architectures that use software dirty bits may
			 * want to set the dirty bit in the pte if the pte
			 * is writable and the PageDirty bit is set for the
			 * page. This avoids unnecessary protection faults
			 * for writable mappings which do not use
			 * mapping_cap_account_dirty, e.g. tmpfs and shmem.
			 */
			else if (pte_write(entry) && PageDirty(page))
				entry = pte_mkdirty(entry);
#endif
		}
		set_pte_at(mm, address, page_table, entry);

Updated patch below. I guess this is ready to be added to the
features branch of linux-s390.

---
From 881ceba55bc58e69aa6157dd806a5297dae096d1 Mon Sep 17 00:00:00 2001
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
Date: Wed, 7 Nov 2012 13:17:37 +0100
Subject: [PATCH] s390/mm: implement software dirty bits

The s390 architecture is unique in respect to dirty page detection,
it uses the change bit in the per-page storage key to track page
modifications. All other architectures track dirty bits by means
of page table entries. This property of s390 has caused numerous
problems in the past, e.g. see git commit ef5d437f71afdf4a
"mm: fix XFS oops due to dirty pages without buffers on s390".

To avoid future issues in regard to per-page dirty bits convert
s390 to a fault based software dirty bit detection mechanism. All
user page table entries which are marked as clean will be hardware
read-only, even if the pte is supposed to be writable. A write by
the user process will trigger a protection fault which will cause
the user pte to be marked as dirty and the hardware read-only bit
is removed.

With this change the dirty bit in the storage key is irrelevant
for Linux as a host, but the storage key is still required for
KVM guests. The effect is that page_test_and_clear_dirty and the
related code can be removed. The referenced bit in the storage
key is still used by the page_test_and_clear_young primitive to
provide page age information.

For page cache pages of mappings with mapping_cap_account_dirty
there will not be any change in behavior as the dirty bit tracking
already uses read-only ptes to control the amount of dirty pages.
Only for swap cache pages and pages of mappings without
mapping_cap_account_dirty there can be additional protection faults.
To avoid an excessive number of additional faults the __do_fault
function will check for PageDirty if the pte is writable and
pre-dirties the pte. That avoids all additional faults for tmpfs
and shmem pages until these pages are added to the swap cache.
As this code is only required for architectures with software dirty
bits it is compiled only if __ARCH_WANT_PTE_WRITE_DIRTY is defined.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/include/asm/page.h    |  22 -------
 arch/s390/include/asm/pgtable.h | 126 ++++++++++++++++++++++++++--------------
 arch/s390/include/asm/sclp.h    |   1 -
 arch/s390/include/asm/setup.h   |  16 ++---
 arch/s390/kvm/kvm-s390.c        |   2 +-
 arch/s390/lib/uaccess_pt.c      |   2 +-
 arch/s390/mm/pageattr.c         |   2 +-
 arch/s390/mm/vmem.c             |  24 ++++----
 drivers/s390/char/sclp_cmd.c    |  10 +---
 include/asm-generic/pgtable.h   |  10 ----
 include/linux/page-flags.h      |   8 ---
 mm/memory.c                     |  12 ++++
 mm/rmap.c                       |  24 --------
 13 files changed, 120 insertions(+), 139 deletions(-)

diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index a86ad40840..75ce9b0 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -155,28 +155,6 @@ static inline int page_reset_referenced(unsigned long addr)
 #define _PAGE_ACC_BITS		0xf0	/* HW access control bits	*/
 
 /*
- * Test and clear dirty bit in storage key.
- * We can't clear the changed bit atomically. This is a potential
- * race against modification of the referenced bit. This function
- * should therefore only be called if it is not mapped in any
- * address space.
- *
- * Note that the bit gets set whenever page content is changed. That means
- * also when the page is modified by DMA or from inside the kernel.
- */
-#define __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
-static inline int page_test_and_clear_dirty(unsigned long pfn, int mapped)
-{
-	unsigned char skey;
-
-	skey = page_get_storage_key(pfn << PAGE_SHIFT);
-	if (!(skey & _PAGE_CHANGED))
-		return 0;
-	page_set_storage_key(pfn << PAGE_SHIFT, skey & ~_PAGE_CHANGED, mapped);
-	return 1;
-}
-
-/*
  * Test and clear referenced bit in storage key.
  */
 #define __HAVE_ARCH_PAGE_TEST_AND_CLEAR_YOUNG
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index a009d4d..cd90c1d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -29,6 +29,7 @@
 #ifndef __ASSEMBLY__
 #include <linux/sched.h>
 #include <linux/mm_types.h>
+#include <linux/page-flags.h>
 #include <asm/bug.h>
 #include <asm/page.h>
 
@@ -221,13 +222,15 @@ extern unsigned long MODULES_END;
 /* Software bits in the page table entry */
 #define _PAGE_SWT	0x001		/* SW pte type bit t */
 #define _PAGE_SWX	0x002		/* SW pte type bit x */
-#define _PAGE_SWC	0x004		/* SW pte changed bit (for KVM) */
-#define _PAGE_SWR	0x008		/* SW pte referenced bit (for KVM) */
-#define _PAGE_SPECIAL	0x010		/* SW associated with special page */
+#define _PAGE_SWC	0x004		/* SW pte changed bit */
+#define _PAGE_SWR	0x008		/* SW pte referenced bit */
+#define _PAGE_SWW	0x010		/* SW pte write bit */
+#define _PAGE_SPECIAL	0x020		/* SW associated with special page */
 #define __HAVE_ARCH_PTE_SPECIAL
 
 /* Set of bits not changed in pte_modify */
-#define _PAGE_CHG_MASK	(PAGE_MASK | _PAGE_SPECIAL | _PAGE_SWC | _PAGE_SWR)
+#define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_CO | \
+				 _PAGE_SWC | _PAGE_SWR)
 
 /* Six different types of pages. */
 #define _PAGE_TYPE_EMPTY	0x400
@@ -321,6 +324,7 @@ extern unsigned long MODULES_END;
 
 /* Bits in the region table entry */
 #define _REGION_ENTRY_ORIGIN	~0xfffUL/* region/segment table origin	    */
+#define _REGION_ENTRY_RO	0x200	/* region protection bit	    */
 #define _REGION_ENTRY_INV	0x20	/* invalid region table entry	    */
 #define _REGION_ENTRY_TYPE_MASK	0x0c	/* region/segment table type mask   */
 #define _REGION_ENTRY_TYPE_R1	0x0c	/* region first table type	    */
@@ -382,9 +386,10 @@ extern unsigned long MODULES_END;
  */
 #define PAGE_NONE	__pgprot(_PAGE_TYPE_NONE)
 #define PAGE_RO		__pgprot(_PAGE_TYPE_RO)
-#define PAGE_RW		__pgprot(_PAGE_TYPE_RW)
+#define PAGE_RW		__pgprot(_PAGE_TYPE_RO | _PAGE_SWW)
+#define PAGE_RWC	__pgprot(_PAGE_TYPE_RW | _PAGE_SWW | _PAGE_SWC)
 
-#define PAGE_KERNEL	PAGE_RW
+#define PAGE_KERNEL	PAGE_RWC
 #define PAGE_SHARED	PAGE_KERNEL
 #define PAGE_COPY	PAGE_RO
 
@@ -632,23 +637,23 @@ static inline pgste_t pgste_update_all(pte_t *ptep, pgste_t pgste)
 	bits = skey & (_PAGE_CHANGED | _PAGE_REFERENCED);
 	/* Clear page changed & referenced bit in the storage key */
 	if (bits & _PAGE_CHANGED)
-		page_set_storage_key(address, skey ^ bits, 1);
+		page_set_storage_key(address, skey ^ bits, 0);
 	else if (bits)
 		page_reset_referenced(address);
 	/* Transfer page changed & referenced bit to guest bits in pgste */
 	pgste_val(pgste) |= bits << 48;		/* RCP_GR_BIT & RCP_GC_BIT */
 	/* Get host changed & referenced bits from pgste */
 	bits |= (pgste_val(pgste) & (RCP_HR_BIT | RCP_HC_BIT)) >> 52;
-	/* Clear host bits in pgste. */
+	/* Transfer page changed & referenced bit to kvm user bits */
+	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
+	/* Clear relevant host bits in pgste. */
 	pgste_val(pgste) &= ~(RCP_HR_BIT | RCP_HC_BIT);
 	pgste_val(pgste) &= ~(RCP_ACC_BITS | RCP_FP_BIT);
 	/* Copy page access key and fetch protection bit to pgste */
 	pgste_val(pgste) |=
 		(unsigned long) (skey & (_PAGE_ACC_BITS | _PAGE_FP_BIT)) << 56;
-	/* Transfer changed and referenced to kvm user bits */
-	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
-	/* Transfer changed & referenced to pte sofware bits */
-	pte_val(*ptep) |= bits << 1;		/* _PAGE_SWR & _PAGE_SWC */
+	/* Transfer referenced bit to pte */
+	pte_val(*ptep) |= (bits & _PAGE_REFERENCED) << 1;
 #endif
 	return pgste;
 
@@ -661,20 +666,25 @@ static inline pgste_t pgste_update_young(pte_t *ptep, pgste_t pgste)
 
 	if (!pte_present(*ptep))
 		return pgste;
+	/* Get referenced bit from storage key */
 	young = page_reset_referenced(pte_val(*ptep) & PAGE_MASK);
-	/* Transfer page referenced bit to pte software bit (host view) */
-	if (young || (pgste_val(pgste) & RCP_HR_BIT))
+	if (young)
+		pgste_val(pgste) |= RCP_GR_BIT;
+	/* Get host referenced bit from pgste */
+	if (pgste_val(pgste) & RCP_HR_BIT) {
+		pgste_val(pgste) &= ~RCP_HR_BIT;
+		young = 1;
+	}
+	/* Transfer referenced bit to kvm user bits and pte */
+	if (young) {
+		pgste_val(pgste) |= KVM_UR_BIT;
 		pte_val(*ptep) |= _PAGE_SWR;
-	/* Clear host referenced bit in pgste. */
-	pgste_val(pgste) &= ~RCP_HR_BIT;
-	/* Transfer page referenced bit to guest bit in pgste */
-	pgste_val(pgste) |= (unsigned long) young << 50; /* set RCP_GR_BIT */
+	}
 #endif
 	return pgste;
-
 }
 
-static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
+static inline void pgste_set_key(pte_t *ptep, pgste_t pgste, pte_t entry)
 {
 #ifdef CONFIG_PGSTE
 	unsigned long address;
@@ -688,10 +698,23 @@ static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
 	/* Set page access key and fetch protection bit from pgste */
 	nkey |= (pgste_val(pgste) & (RCP_ACC_BITS | RCP_FP_BIT)) >> 56;
 	if (okey != nkey)
-		page_set_storage_key(address, nkey, 1);
+		page_set_storage_key(address, nkey, 0);
 #endif
 }
 
+static inline void pgste_set_pte(pte_t *ptep, pte_t entry)
+{
+	if (!MACHINE_HAS_ESOP && (pte_val(entry) & _PAGE_SWW)) {
+		/*
+		 * Without enhanced suppression-on-protection force
+		 * the dirty bit on for all writable ptes.
+		 */
+		pte_val(entry) |= _PAGE_SWC;
+		pte_val(entry) &= ~_PAGE_RO;
+	}
+	*ptep = entry;
+}
+
 /**
  * struct gmap_struct - guest address space
  * @mm: pointer to the parent mm_struct
@@ -750,11 +773,14 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 
 	if (mm_has_pgste(mm)) {
 		pgste = pgste_get_lock(ptep);
-		pgste_set_pte(ptep, pgste, entry);
-		*ptep = entry;
+		pgste_set_key(ptep, pgste, entry);
+		pgste_set_pte(ptep, entry);
 		pgste_set_unlock(ptep, pgste);
-	} else
+	} else {
+		if (!(pte_val(entry) & _PAGE_INVALID) && MACHINE_HAS_EDAT1)
+			pte_val(entry) |= _PAGE_CO;
 		*ptep = entry;
+	}
 }
 
 /*
@@ -763,16 +789,12 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
  */
 static inline int pte_write(pte_t pte)
 {
-	return (pte_val(pte) & _PAGE_RO) == 0;
+	return (pte_val(pte) & _PAGE_SWW) != 0;
 }
 
 static inline int pte_dirty(pte_t pte)
 {
-#ifdef CONFIG_PGSTE
-	if (pte_val(pte) & _PAGE_SWC)
-		return 1;
-#endif
-	return 0;
+	return (pte_val(pte) & _PAGE_SWC) != 0;
 }
 
 static inline int pte_young(pte_t pte)
@@ -822,11 +844,14 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
 	pte_val(pte) &= _PAGE_CHG_MASK;
 	pte_val(pte) |= pgprot_val(newprot);
+	if ((pte_val(pte) & _PAGE_SWC) && (pte_val(pte) & _PAGE_SWW))
+		pte_val(pte) &= ~_PAGE_RO;
 	return pte;
 }
 
 static inline pte_t pte_wrprotect(pte_t pte)
 {
+	pte_val(pte) &= ~_PAGE_SWW;
 	/* Do not clobber _PAGE_TYPE_NONE pages!  */
 	if (!(pte_val(pte) & _PAGE_INVALID))
 		pte_val(pte) |= _PAGE_RO;
@@ -835,20 +860,26 @@ static inline pte_t pte_wrprotect(pte_t pte)
 
 static inline pte_t pte_mkwrite(pte_t pte)
 {
-	pte_val(pte) &= ~_PAGE_RO;
+	pte_val(pte) |= _PAGE_SWW;
+	if (pte_val(pte) & _PAGE_SWC)
+		pte_val(pte) &= ~_PAGE_RO;
 	return pte;
 }
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-#ifdef CONFIG_PGSTE
 	pte_val(pte) &= ~_PAGE_SWC;
-#endif
+	/* Do not clobber _PAGE_TYPE_NONE pages!  */
+	if (!(pte_val(pte) & _PAGE_INVALID))
+		pte_val(pte) |= _PAGE_RO;
 	return pte;
 }
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
+	pte_val(pte) |= _PAGE_SWC;
+	if (pte_val(pte) & _PAGE_SWW)
+		pte_val(pte) &= ~_PAGE_RO;
 	return pte;
 }
 
@@ -886,10 +917,10 @@ static inline pte_t pte_mkhuge(pte_t pte)
 		pte_val(pte) |= _SEGMENT_ENTRY_INV;
 	}
 	/*
-	 * Clear SW pte bits SWT and SWX, there are no SW bits in a segment
-	 * table entry.
+	 * Clear SW pte bits, there are no SW bits in a segment table entry.
 	 */
-	pte_val(pte) &= ~(_PAGE_SWT | _PAGE_SWX);
+	pte_val(pte) &= ~(_PAGE_SWT | _PAGE_SWX | _PAGE_SWC |
+			  _PAGE_SWR | _PAGE_SWW);
 	/*
 	 * Also set the change-override bit because we don't need dirty bit
 	 * tracking for hugetlbfs pages.
@@ -1041,9 +1072,11 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
 					   unsigned long address,
 					   pte_t *ptep, pte_t pte)
 {
-	*ptep = pte;
-	if (mm_has_pgste(mm))
+	if (mm_has_pgste(mm)) {
+		pgste_set_pte(ptep, pte);
 		pgste_set_unlock(ptep, *(pgste_t *)(ptep + PTRS_PER_PTE));
+	} else
+		*ptep = pte;
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_FLUSH
@@ -1111,10 +1144,13 @@ static inline pte_t ptep_set_wrprotect(struct mm_struct *mm,
 
 		if (!mm_exclusive(mm))
 			__ptep_ipte(address, ptep);
-		*ptep = pte_wrprotect(pte);
+		pte = pte_wrprotect(pte);
 
-		if (mm_has_pgste(mm))
+		if (mm_has_pgste(mm)) {
+			pgste_set_pte(ptep, pte);
 			pgste_set_unlock(ptep, pgste);
+		} else
+			*ptep = pte;
 	}
 	return pte;
 }
@@ -1132,10 +1168,12 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 		pgste = pgste_get_lock(ptep);
 
 	__ptep_ipte(address, ptep);
-	*ptep = entry;
 
-	if (mm_has_pgste(vma->vm_mm))
+	if (mm_has_pgste(vma->vm_mm)) {
+		pgste_set_pte(ptep, entry);
 		pgste_set_unlock(ptep, pgste);
+	} else
+		*ptep = entry;
 	return 1;
 }
 
@@ -1246,6 +1284,8 @@ static inline int pmd_trans_splitting(pmd_t pmd)
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t entry)
 {
+	if (!(pmd_val(entry) & _SEGMENT_ENTRY_INV) && MACHINE_HAS_EDAT1)
+		pmd_val(entry) |= _SEGMENT_ENTRY_CO;
 	*pmdp = entry;
 }
 
@@ -1486,6 +1526,8 @@ extern int s390_enable_sie(void);
  */
 #define pgtable_cache_init()	do { } while (0)
 
+#define __ARCH_WANT_PTE_WRITE_DIRTY
+
 #include <asm-generic/pgtable.h>
 
 #endif /* _S390_PAGE_H */
diff --git a/arch/s390/include/asm/sclp.h b/arch/s390/include/asm/sclp.h
index 8337886..06a1361 100644
--- a/arch/s390/include/asm/sclp.h
+++ b/arch/s390/include/asm/sclp.h
@@ -46,7 +46,6 @@ int sclp_cpu_deconfigure(u8 cpu);
 void sclp_facilities_detect(void);
 unsigned long long sclp_get_rnmax(void);
 unsigned long long sclp_get_rzm(void);
-u8 sclp_get_fac85(void);
 int sclp_sdias_blk_count(void);
 int sclp_sdias_copy(void *dest, int blk_num, int nr_blks);
 int sclp_chp_configure(struct chp_id chpid);
diff --git a/arch/s390/include/asm/setup.h b/arch/s390/include/asm/setup.h
index f69f76b..f685751 100644
--- a/arch/s390/include/asm/setup.h
+++ b/arch/s390/include/asm/setup.h
@@ -64,13 +64,14 @@ extern unsigned int s390_user_mode;
 
 #define MACHINE_FLAG_VM		(1UL << 0)
 #define MACHINE_FLAG_IEEE	(1UL << 1)
-#define MACHINE_FLAG_CSP	(1UL << 3)
-#define MACHINE_FLAG_MVPG	(1UL << 4)
-#define MACHINE_FLAG_DIAG44	(1UL << 5)
-#define MACHINE_FLAG_IDTE	(1UL << 6)
-#define MACHINE_FLAG_DIAG9C	(1UL << 7)
-#define MACHINE_FLAG_MVCOS	(1UL << 8)
-#define MACHINE_FLAG_KVM	(1UL << 9)
+#define MACHINE_FLAG_CSP	(1UL << 2)
+#define MACHINE_FLAG_MVPG	(1UL << 3)
+#define MACHINE_FLAG_DIAG44	(1UL << 4)
+#define MACHINE_FLAG_IDTE	(1UL << 5)
+#define MACHINE_FLAG_DIAG9C	(1UL << 6)
+#define MACHINE_FLAG_MVCOS	(1UL << 7)
+#define MACHINE_FLAG_KVM	(1UL << 8)
+#define MACHINE_FLAG_ESOP	(1UL << 9)
 #define MACHINE_FLAG_EDAT1	(1UL << 10)
 #define MACHINE_FLAG_EDAT2	(1UL << 11)
 #define MACHINE_FLAG_LPAR	(1UL << 12)
@@ -84,6 +85,7 @@ extern unsigned int s390_user_mode;
 #define MACHINE_IS_LPAR		(S390_lowcore.machine_flags & MACHINE_FLAG_LPAR)
 
 #define MACHINE_HAS_DIAG9C	(S390_lowcore.machine_flags & MACHINE_FLAG_DIAG9C)
+#define MACHINE_HAS_ESOP	(S390_lowcore.machine_flags & MACHINE_FLAG_ESOP)
 #define MACHINE_HAS_PFMF	MACHINE_HAS_EDAT1
 #define MACHINE_HAS_HPAGE	MACHINE_HAS_EDAT1
 
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index f090e81..2923781 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -147,7 +147,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 		r = KVM_MAX_VCPUS;
 		break;
 	case KVM_CAP_S390_COW:
-		r = sclp_get_fac85() & 0x2;
+		r = MACHINE_HAS_ESOP;
 		break;
 	default:
 		r = 0;
diff --git a/arch/s390/lib/uaccess_pt.c b/arch/s390/lib/uaccess_pt.c
index 9017a63..a70ee84 100644
--- a/arch/s390/lib/uaccess_pt.c
+++ b/arch/s390/lib/uaccess_pt.c
@@ -50,7 +50,7 @@ static __always_inline unsigned long follow_table(struct mm_struct *mm,
 	ptep = pte_offset_map(pmd, addr);
 	if (!pte_present(*ptep))
 		return -0x11UL;
-	if (write && !pte_write(*ptep))
+	if (write && (!pte_write(*ptep) || !pte_dirty(*ptep)))
 		return -0x04UL;
 
 	return (pte_val(*ptep) & PAGE_MASK) + (addr & ~PAGE_MASK);
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index 29ccee3..d21040e 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -127,7 +127,7 @@ void kernel_map_pages(struct page *page, int numpages, int enable)
 			pte_val(*pte) = _PAGE_TYPE_EMPTY;
 			continue;
 		}
-		*pte = mk_pte_phys(address, __pgprot(_PAGE_TYPE_RW));
+		pte_val(*pte) = __pa(address);
 	}
 }
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 6ed1426..79699f46 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -85,11 +85,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 	pud_t *pu_dir;
 	pmd_t *pm_dir;
 	pte_t *pt_dir;
-	pte_t  pte;
 	int ret = -ENOMEM;
 
 	while (address < end) {
-		pte = mk_pte_phys(address, __pgprot(ro ? _PAGE_RO : 0));
 		pg_dir = pgd_offset_k(address);
 		if (pgd_none(*pg_dir)) {
 			pu_dir = vmem_pud_alloc();
@@ -101,9 +99,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 #if defined(CONFIG_64BIT) && !defined(CONFIG_DEBUG_PAGEALLOC)
 		if (MACHINE_HAS_EDAT2 && pud_none(*pu_dir) && address &&
 		    !(address & ~PUD_MASK) && (address + PUD_SIZE <= end)) {
-			pte_val(pte) |= _REGION3_ENTRY_LARGE;
-			pte_val(pte) |= _REGION_ENTRY_TYPE_R3;
-			pud_val(*pu_dir) = pte_val(pte);
+			pud_val(*pu_dir) = __pa(address) |
+				_REGION_ENTRY_TYPE_R3 | _REGION3_ENTRY_LARGE |
+				(ro ? _REGION_ENTRY_RO : 0);
 			address += PUD_SIZE;
 			continue;
 		}
@@ -118,8 +116,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 #if defined(CONFIG_64BIT) && !defined(CONFIG_DEBUG_PAGEALLOC)
 		if (MACHINE_HAS_EDAT1 && pmd_none(*pm_dir) && address &&
 		    !(address & ~PMD_MASK) && (address + PMD_SIZE <= end)) {
-			pte_val(pte) |= _SEGMENT_ENTRY_LARGE;
-			pmd_val(*pm_dir) = pte_val(pte);
+			pmd_val(*pm_dir) = __pa(address) |
+				_SEGMENT_ENTRY | _SEGMENT_ENTRY_LARGE |
+				(ro ? _SEGMENT_ENTRY_RO : 0);
 			address += PMD_SIZE;
 			continue;
 		}
@@ -132,7 +131,7 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
 		}
 
 		pt_dir = pte_offset_kernel(pm_dir, address);
-		*pt_dir = pte;
+		pte_val(*pt_dir) = __pa(address) | (ro ? _PAGE_RO : 0);
 		address += PAGE_SIZE;
 	}
 	ret = 0;
@@ -199,7 +198,6 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
 	pud_t *pu_dir;
 	pmd_t *pm_dir;
 	pte_t *pt_dir;
-	pte_t  pte;
 	int ret = -ENOMEM;
 
 	start_addr = (unsigned long) start;
@@ -237,9 +235,8 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
 				new_page = vmemmap_alloc_block(PMD_SIZE, node);
 				if (!new_page)
 					goto out;
-				pte = mk_pte_phys(__pa(new_page), PAGE_RW);
-				pte_val(pte) |= _SEGMENT_ENTRY_LARGE;
-				pmd_val(*pm_dir) = pte_val(pte);
+				pmd_val(*pm_dir) = __pa(new_page) |
+					_SEGMENT_ENTRY | _SEGMENT_ENTRY_LARGE;
 				address = (address + PMD_SIZE) & PMD_MASK;
 				continue;
 			}
@@ -260,8 +257,7 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
 			new_page =__pa(vmem_alloc_pages(0));
 			if (!new_page)
 				goto out;
-			pte = pfn_pte(new_page >> PAGE_SHIFT, PAGE_KERNEL);
-			*pt_dir = pte;
+			pte_val(*pt_dir) = __pa(new_page);
 		}
 		address += PAGE_SIZE;
 	}
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index c44d13f..30a2255 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -56,7 +56,6 @@ static int __initdata early_read_info_sccb_valid;
 
 u64 sclp_facilities;
 static u8 sclp_fac84;
-static u8 sclp_fac85;
 static unsigned long long rzm;
 static unsigned long long rnmax;
 
@@ -131,7 +130,8 @@ void __init sclp_facilities_detect(void)
 	sccb = &early_read_info_sccb;
 	sclp_facilities = sccb->facilities;
 	sclp_fac84 = sccb->fac84;
-	sclp_fac85 = sccb->fac85;
+	if (sccb->fac85 & 0x02)
+		S390_lowcore.machine_flags |= MACHINE_FLAG_ESOP;
 	rnmax = sccb->rnmax ? sccb->rnmax : sccb->rnmax2;
 	rzm = sccb->rnsize ? sccb->rnsize : sccb->rnsize2;
 	rzm <<= 20;
@@ -171,12 +171,6 @@ unsigned long long sclp_get_rzm(void)
 	return rzm;
 }
 
-u8 sclp_get_fac85(void)
-{
-	return sclp_fac85;
-}
-EXPORT_SYMBOL_GPL(sclp_get_fac85);
-
 /*
  * This function will be called after sclp_facilities_detect(), which gets
  * called from early.c code. Therefore the sccb should have valid contents.
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 5cf680a..bfd8768 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -197,16 +197,6 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
-#define page_test_and_clear_dirty(pfn, mapped)	(0)
-#endif
-
-#ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
-#define pte_maybe_dirty(pte)		pte_dirty(pte)
-#else
-#define pte_maybe_dirty(pte)		(1)
-#endif
-
 #ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_YOUNG
 #define page_test_and_clear_young(pfn) (0)
 #endif
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 70473da..6d53675 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -303,21 +303,13 @@ static inline void __SetPageUptodate(struct page *page)
 
 static inline void SetPageUptodate(struct page *page)
 {
-#ifdef CONFIG_S390
-	if (!test_and_set_bit(PG_uptodate, &page->flags))
-		page_set_storage_key(page_to_phys(page), PAGE_DEFAULT_KEY, 0);
-#else
 	/*
 	 * Memory barrier must be issued before setting the PG_uptodate bit,
 	 * so that all previous stores issued in order to bring the page
 	 * uptodate are actually visible before PageUptodate becomes true.
-	 *
-	 * s390 doesn't need an explicit smp_wmb here because the test and
-	 * set bit already provides full barriers.
 	 */
 	smp_wmb();
 	set_bit(PG_uptodate, &(page)->flags);
-#endif
 }
 
 CLEARPAGEFLAG(Uptodate, uptodate)
diff --git a/mm/memory.c b/mm/memory.c
index bb1369f..b3da90d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3338,6 +3338,18 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 				dirty_page = page;
 				get_page(dirty_page);
 			}
+#ifdef __ARCH_WANT_PTE_WRITE_DIRTY
+			/*
+			 * Architectures that use software dirty bits may
+			 * want to set the dirty bit in the pte if the pte
+			 * is writable and the PageDirty bit is set for the
+			 * page. This avoids unnecessary protection faults
+			 * for writable mappings which do not use
+			 * mapping_cap_account_dirty, e.g. tmpfs and shmem.
+			 */
+			else if (pte_write(entry) && PageDirty(page))
+				entry = pte_mkdirty(entry);
+#endif
 		}
 		set_pte_at(mm, address, page_table, entry);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 2c78f8c..3d38edf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1126,7 +1126,6 @@ void page_add_file_rmap(struct page *page)
  */
 void page_remove_rmap(struct page *page)
 {
-	struct address_space *mapping = page_mapping(page);
 	bool anon = PageAnon(page);
 	bool locked;
 	unsigned long flags;
@@ -1144,29 +1143,6 @@ void page_remove_rmap(struct page *page)
 		goto out;
 
 	/*
-	 * Now that the last pte has gone, s390 must transfer dirty
-	 * flag from storage key to struct page.  We can usually skip
-	 * this if the page is anon, so about to be freed; but perhaps
-	 * not if it's in swapcache - there might be another pte slot
-	 * containing the swap entry, but page not yet written to swap.
-	 *
-	 * And we can skip it on file pages, so long as the filesystem
-	 * participates in dirty tracking (note that this is not only an
-	 * optimization but also solves problems caused by dirty flag in
-	 * storage key getting set by a write from inside kernel); but need to
-	 * catch shm and tmpfs and ramfs pages which have been modified since
-	 * creation by read fault.
-	 *
-	 * Note that mapping must be decided above, before decrementing
-	 * mapcount (which luckily provides a barrier): once page is unmapped,
-	 * it could be truncated and page->mapping reset to NULL at any moment.
-	 * Note also that we are relying on page_mapping(page) to set mapping
-	 * to &swapper_space when PageSwapCache(page).
-	 */
-	if (mapping && !mapping_cap_account_dirty(mapping) &&
-	    page_test_and_clear_dirty(page_to_pfn(page), 1))
-		set_page_dirty(page);
-	/*
 	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
 	 * and not charged by memcg for now.
 	 */
-- 
1.7.12.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] s390/mm: implement software dirty bits
  2013-02-11 14:27       ` Martin Schwidefsky
@ 2013-02-11 22:08         ` Hugh Dickins
  2013-02-12  8:46           ` Martin Schwidefsky
  0 siblings, 1 reply; 11+ messages in thread
From: Hugh Dickins @ 2013-02-11 22:08 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: linux-mm, linux-s390, Mel Gorman, Jan Kara, Andrew Morton,
	Christian Ehrhardt

On Mon, 11 Feb 2013, Martin Schwidefsky wrote:
> On Thu, 7 Feb 2013 11:18:38 -0800
> Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > On Wed, 6 Feb 2013 16:20:40 -0800 (PST)
> > Hugh Dickins <hughd@google.com> wrote:
> > 
> > Anon page and accounted file pages won't need the mk_pte optimization,
> > that is there for tmpfs/shmem. We could do that in common code as well,
> > to make the dependency on PageDirty more obvious.
> > 
> > > --- 3.8-rc6/mm/memory.c	2013-01-09 19:25:05.028321379 -0800
> > > +++ linux/mm/memory.c	2013-02-06 15:01:17.904387877 -0800
> > > @@ -3338,6 +3338,10 @@ static int __do_fault(struct mm_struct *
> > >  				dirty_page = page;
> > >  				get_page(dirty_page);
> > >  			}
> > > +#ifdef CONFIG_S390
> > > +			else if (pte_write(entry) && PageDirty(page))
> > > +				pte_mkdirty(entry);
> > > +#endif
> > >  		}
> > >  		set_pte_at(mm, address, page_table, entry);
> > > 
> > > And then I wonder, is that something we should do on all architectures?
> > > On the one hand, it would save a hardware fault when and if the pte is
> > > dirtied later; on the other hand, it seems wrong to claim pte dirty when
> > > not (though I didn't find anywhere that would care).
> > 
> > I don't like the fact that we are adding another CONFIG_S390, if we could
> > pre-dirty the pte for all architectures that would be nice. It has no
> > ill effects for s390 to make the pte dirty, I can think of no reason
> > why it should hurt for other architectures.
> 
> Having though further on the issue, it does not make sense to force all
> architectures to set the dirty bit in the pte as this would make
> try_to_unmap_one to call set_page_dirty even for ptes which have not
> been used for writing.

In this particular case of shmem/tmpfs/ramfs (perhaps a few unaccounted
others too, I doubt many are mmap'able), on pages that were already
PageDirty when mapped.  And ramfs doesn't get as far as try_to_unmap_one,
because it has already failed the page_evictable test.

> set_page_dirty is a non-trivial function that
> calls mapping->a_ops->set_page_dirty or __set_page_dirty_buffers. These
> cycles should imho not be spent on architectures with h/w pte dirty
> bits.

The almost no-op __set_page_dirty_no_writeback is actually the one
that gets called.  Now, I don't disagree with you that I'd prefer not
to have to call it; but I'd also prefer to do the same thing on s390
as other architectures.

I'm undecided which I prefer.  Before you wrote, I was going to suggest
that you put your original patch into your tree for linux-next, then I
propose an mm patch on top, restoring the s390 mk_pte() to normalcy, and
adding the pte_mkdirty() to __do_fault() as above; but with a comment
(you have), taking out the #ifdef, doing it on all architectures - so
that if we see a problem on one (because some code elsewhere is deducing
something from pte_dirty), it's advance warning of a problem on s390.
But if anyone objected to my patch, it would cast doubt upon yours.

> 
> To avoid CONFIG_S390 in common code I'd like to introduce a new
> __ARCH_WANT_PTE_WRITE_DIRTY define which then is used in __do_fault
> like this:

My personal opinion is that an __ARCH_WANT_PTE_WRITE_DIRTY that is set
by only a single architecture just obfuscates the issue, that CONFIG_S390
is clearer for everyone.  Much of my dislike of page_test_and_clear_dirty
was that it looks so brilliantly generic, and yet is so peculiar to s390.

But it's quite likely that I'm in a minority of one on that:
#ifdef CONFIG_HUGHD
#define __DEVELOPER_PREFERS_MORE_EXPLICIT_SINGLE_ARCH_DEPENDENCE 1
#endif

And at least #ifdef CONFIG_S390_OR_WHATEVER flags it as exceptional:
this might be a case where I'd say the ugliness of an #ifdef is good.
I am glad that you've come around to doing it this way, rather than
hiding the PageDirty peculiarity down in arch/s390's mk_pte().

> 
> 				dirty_page = page;
> 				get_page(dirty_page);
> 			}
> #ifdef __ARCH_WANT_PTE_WRITE_DIRTY
> 			/*
> 			 * Architectures that use software dirty bits may
> 			 * want to set the dirty bit in the pte if the pte
> 			 * is writable and the PageDirty bit is set for the
> 			 * page. This avoids unnecessary protection faults
> 			 * for writable mappings which do not use
> 			 * mapping_cap_account_dirty, e.g. tmpfs and shmem.

Good comment, but is s390 really an unmentionable?
Stand proud, and help us all by mentioning s390 in the comment!

> 			 */
> 			else if (pte_write(entry) && PageDirty(page))
> 				entry = pte_mkdirty(entry);
> #endif
> 		}
> 		set_pte_at(mm, address, page_table, entry);
> 
> Updated patch below. I guess this is ready to be added to the
> features branch of linux-s390.

Okay.  Please consider for a moment whether my point on the almost no-op
__set_page_dirty_no_writeback changes your mind; but if you decide you'd
still prefer to go with this no-change-to-other-architectures, I'm happy
with that.  Just, please mention s390 in the comment, to win my Acked-by
(and pray you don't lose Mel's).

Thanks,
Hugh

> 
> ---
> From 881ceba55bc58e69aa6157dd806a5297dae096d1 Mon Sep 17 00:00:00 2001
> From: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Date: Wed, 7 Nov 2012 13:17:37 +0100
> Subject: [PATCH] s390/mm: implement software dirty bits
> 
> The s390 architecture is unique in respect to dirty page detection,
> it uses the change bit in the per-page storage key to track page
> modifications. All other architectures track dirty bits by means
> of page table entries. This property of s390 has caused numerous
> problems in the past, e.g. see git commit ef5d437f71afdf4a
> "mm: fix XFS oops due to dirty pages without buffers on s390".
> 
> To avoid future issues in regard to per-page dirty bits convert
> s390 to a fault based software dirty bit detection mechanism. All
> user page table entries which are marked as clean will be hardware
> read-only, even if the pte is supposed to be writable. A write by
> the user process will trigger a protection fault which will cause
> the user pte to be marked as dirty and the hardware read-only bit
> is removed.
> 
> With this change the dirty bit in the storage key is irrelevant
> for Linux as a host, but the storage key is still required for
> KVM guests. The effect is that page_test_and_clear_dirty and the
> related code can be removed. The referenced bit in the storage
> key is still used by the page_test_and_clear_young primitive to
> provide page age information.
> 
> For page cache pages of mappings with mapping_cap_account_dirty
> there will not be any change in behavior as the dirty bit tracking
> already uses read-only ptes to control the amount of dirty pages.
> Only for swap cache pages and pages of mappings without
> mapping_cap_account_dirty there can be additional protection faults.
> To avoid an excessive number of additional faults the __do_fault
> function will check for PageDirty if the pte is writable and
> pre-dirties the pte. That avoids all additional faults for tmpfs
> and shmem pages until these pages are added to the swap cache.
> As this code is only required for architectures with software dirty
> bits it is compiled only if __ARCH_WANT_PTE_WRITE_DIRTY is defined.
> 
> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
> ---
>  arch/s390/include/asm/page.h    |  22 -------
>  arch/s390/include/asm/pgtable.h | 126 ++++++++++++++++++++++++++--------------
>  arch/s390/include/asm/sclp.h    |   1 -
>  arch/s390/include/asm/setup.h   |  16 ++---
>  arch/s390/kvm/kvm-s390.c        |   2 +-
>  arch/s390/lib/uaccess_pt.c      |   2 +-
>  arch/s390/mm/pageattr.c         |   2 +-
>  arch/s390/mm/vmem.c             |  24 ++++----
>  drivers/s390/char/sclp_cmd.c    |  10 +---
>  include/asm-generic/pgtable.h   |  10 ----
>  include/linux/page-flags.h      |   8 ---
>  mm/memory.c                     |  12 ++++
>  mm/rmap.c                       |  24 --------
>  13 files changed, 120 insertions(+), 139 deletions(-)
> 
> diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
> index a86ad40840..75ce9b0 100644
> --- a/arch/s390/include/asm/page.h
> +++ b/arch/s390/include/asm/page.h
> @@ -155,28 +155,6 @@ static inline int page_reset_referenced(unsigned long addr)
>  #define _PAGE_ACC_BITS		0xf0	/* HW access control bits	*/
>  
>  /*
> - * Test and clear dirty bit in storage key.
> - * We can't clear the changed bit atomically. This is a potential
> - * race against modification of the referenced bit. This function
> - * should therefore only be called if it is not mapped in any
> - * address space.
> - *
> - * Note that the bit gets set whenever page content is changed. That means
> - * also when the page is modified by DMA or from inside the kernel.
> - */
> -#define __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
> -static inline int page_test_and_clear_dirty(unsigned long pfn, int mapped)
> -{
> -	unsigned char skey;
> -
> -	skey = page_get_storage_key(pfn << PAGE_SHIFT);
> -	if (!(skey & _PAGE_CHANGED))
> -		return 0;
> -	page_set_storage_key(pfn << PAGE_SHIFT, skey & ~_PAGE_CHANGED, mapped);
> -	return 1;
> -}
> -
> -/*
>   * Test and clear referenced bit in storage key.
>   */
>  #define __HAVE_ARCH_PAGE_TEST_AND_CLEAR_YOUNG
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index a009d4d..cd90c1d 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -29,6 +29,7 @@
>  #ifndef __ASSEMBLY__
>  #include <linux/sched.h>
>  #include <linux/mm_types.h>
> +#include <linux/page-flags.h>
>  #include <asm/bug.h>
>  #include <asm/page.h>
>  
> @@ -221,13 +222,15 @@ extern unsigned long MODULES_END;
>  /* Software bits in the page table entry */
>  #define _PAGE_SWT	0x001		/* SW pte type bit t */
>  #define _PAGE_SWX	0x002		/* SW pte type bit x */
> -#define _PAGE_SWC	0x004		/* SW pte changed bit (for KVM) */
> -#define _PAGE_SWR	0x008		/* SW pte referenced bit (for KVM) */
> -#define _PAGE_SPECIAL	0x010		/* SW associated with special page */
> +#define _PAGE_SWC	0x004		/* SW pte changed bit */
> +#define _PAGE_SWR	0x008		/* SW pte referenced bit */
> +#define _PAGE_SWW	0x010		/* SW pte write bit */
> +#define _PAGE_SPECIAL	0x020		/* SW associated with special page */
>  #define __HAVE_ARCH_PTE_SPECIAL
>  
>  /* Set of bits not changed in pte_modify */
> -#define _PAGE_CHG_MASK	(PAGE_MASK | _PAGE_SPECIAL | _PAGE_SWC | _PAGE_SWR)
> +#define _PAGE_CHG_MASK		(PAGE_MASK | _PAGE_SPECIAL | _PAGE_CO | \
> +				 _PAGE_SWC | _PAGE_SWR)
>  
>  /* Six different types of pages. */
>  #define _PAGE_TYPE_EMPTY	0x400
> @@ -321,6 +324,7 @@ extern unsigned long MODULES_END;
>  
>  /* Bits in the region table entry */
>  #define _REGION_ENTRY_ORIGIN	~0xfffUL/* region/segment table origin	    */
> +#define _REGION_ENTRY_RO	0x200	/* region protection bit	    */
>  #define _REGION_ENTRY_INV	0x20	/* invalid region table entry	    */
>  #define _REGION_ENTRY_TYPE_MASK	0x0c	/* region/segment table type mask   */
>  #define _REGION_ENTRY_TYPE_R1	0x0c	/* region first table type	    */
> @@ -382,9 +386,10 @@ extern unsigned long MODULES_END;
>   */
>  #define PAGE_NONE	__pgprot(_PAGE_TYPE_NONE)
>  #define PAGE_RO		__pgprot(_PAGE_TYPE_RO)
> -#define PAGE_RW		__pgprot(_PAGE_TYPE_RW)
> +#define PAGE_RW		__pgprot(_PAGE_TYPE_RO | _PAGE_SWW)
> +#define PAGE_RWC	__pgprot(_PAGE_TYPE_RW | _PAGE_SWW | _PAGE_SWC)
>  
> -#define PAGE_KERNEL	PAGE_RW
> +#define PAGE_KERNEL	PAGE_RWC
>  #define PAGE_SHARED	PAGE_KERNEL
>  #define PAGE_COPY	PAGE_RO
>  
> @@ -632,23 +637,23 @@ static inline pgste_t pgste_update_all(pte_t *ptep, pgste_t pgste)
>  	bits = skey & (_PAGE_CHANGED | _PAGE_REFERENCED);
>  	/* Clear page changed & referenced bit in the storage key */
>  	if (bits & _PAGE_CHANGED)
> -		page_set_storage_key(address, skey ^ bits, 1);
> +		page_set_storage_key(address, skey ^ bits, 0);
>  	else if (bits)
>  		page_reset_referenced(address);
>  	/* Transfer page changed & referenced bit to guest bits in pgste */
>  	pgste_val(pgste) |= bits << 48;		/* RCP_GR_BIT & RCP_GC_BIT */
>  	/* Get host changed & referenced bits from pgste */
>  	bits |= (pgste_val(pgste) & (RCP_HR_BIT | RCP_HC_BIT)) >> 52;
> -	/* Clear host bits in pgste. */
> +	/* Transfer page changed & referenced bit to kvm user bits */
> +	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
> +	/* Clear relevant host bits in pgste. */
>  	pgste_val(pgste) &= ~(RCP_HR_BIT | RCP_HC_BIT);
>  	pgste_val(pgste) &= ~(RCP_ACC_BITS | RCP_FP_BIT);
>  	/* Copy page access key and fetch protection bit to pgste */
>  	pgste_val(pgste) |=
>  		(unsigned long) (skey & (_PAGE_ACC_BITS | _PAGE_FP_BIT)) << 56;
> -	/* Transfer changed and referenced to kvm user bits */
> -	pgste_val(pgste) |= bits << 45;		/* KVM_UR_BIT & KVM_UC_BIT */
> -	/* Transfer changed & referenced to pte sofware bits */
> -	pte_val(*ptep) |= bits << 1;		/* _PAGE_SWR & _PAGE_SWC */
> +	/* Transfer referenced bit to pte */
> +	pte_val(*ptep) |= (bits & _PAGE_REFERENCED) << 1;
>  #endif
>  	return pgste;
>  
> @@ -661,20 +666,25 @@ static inline pgste_t pgste_update_young(pte_t *ptep, pgste_t pgste)
>  
>  	if (!pte_present(*ptep))
>  		return pgste;
> +	/* Get referenced bit from storage key */
>  	young = page_reset_referenced(pte_val(*ptep) & PAGE_MASK);
> -	/* Transfer page referenced bit to pte software bit (host view) */
> -	if (young || (pgste_val(pgste) & RCP_HR_BIT))
> +	if (young)
> +		pgste_val(pgste) |= RCP_GR_BIT;
> +	/* Get host referenced bit from pgste */
> +	if (pgste_val(pgste) & RCP_HR_BIT) {
> +		pgste_val(pgste) &= ~RCP_HR_BIT;
> +		young = 1;
> +	}
> +	/* Transfer referenced bit to kvm user bits and pte */
> +	if (young) {
> +		pgste_val(pgste) |= KVM_UR_BIT;
>  		pte_val(*ptep) |= _PAGE_SWR;
> -	/* Clear host referenced bit in pgste. */
> -	pgste_val(pgste) &= ~RCP_HR_BIT;
> -	/* Transfer page referenced bit to guest bit in pgste */
> -	pgste_val(pgste) |= (unsigned long) young << 50; /* set RCP_GR_BIT */
> +	}
>  #endif
>  	return pgste;
> -
>  }
>  
> -static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
> +static inline void pgste_set_key(pte_t *ptep, pgste_t pgste, pte_t entry)
>  {
>  #ifdef CONFIG_PGSTE
>  	unsigned long address;
> @@ -688,10 +698,23 @@ static inline void pgste_set_pte(pte_t *ptep, pgste_t pgste, pte_t entry)
>  	/* Set page access key and fetch protection bit from pgste */
>  	nkey |= (pgste_val(pgste) & (RCP_ACC_BITS | RCP_FP_BIT)) >> 56;
>  	if (okey != nkey)
> -		page_set_storage_key(address, nkey, 1);
> +		page_set_storage_key(address, nkey, 0);
>  #endif
>  }
>  
> +static inline void pgste_set_pte(pte_t *ptep, pte_t entry)
> +{
> +	if (!MACHINE_HAS_ESOP && (pte_val(entry) & _PAGE_SWW)) {
> +		/*
> +		 * Without enhanced suppression-on-protection force
> +		 * the dirty bit on for all writable ptes.
> +		 */
> +		pte_val(entry) |= _PAGE_SWC;
> +		pte_val(entry) &= ~_PAGE_RO;
> +	}
> +	*ptep = entry;
> +}
> +
>  /**
>   * struct gmap_struct - guest address space
>   * @mm: pointer to the parent mm_struct
> @@ -750,11 +773,14 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>  
>  	if (mm_has_pgste(mm)) {
>  		pgste = pgste_get_lock(ptep);
> -		pgste_set_pte(ptep, pgste, entry);
> -		*ptep = entry;
> +		pgste_set_key(ptep, pgste, entry);
> +		pgste_set_pte(ptep, entry);
>  		pgste_set_unlock(ptep, pgste);
> -	} else
> +	} else {
> +		if (!(pte_val(entry) & _PAGE_INVALID) && MACHINE_HAS_EDAT1)
> +			pte_val(entry) |= _PAGE_CO;
>  		*ptep = entry;
> +	}
>  }
>  
>  /*
> @@ -763,16 +789,12 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
>   */
>  static inline int pte_write(pte_t pte)
>  {
> -	return (pte_val(pte) & _PAGE_RO) == 0;
> +	return (pte_val(pte) & _PAGE_SWW) != 0;
>  }
>  
>  static inline int pte_dirty(pte_t pte)
>  {
> -#ifdef CONFIG_PGSTE
> -	if (pte_val(pte) & _PAGE_SWC)
> -		return 1;
> -#endif
> -	return 0;
> +	return (pte_val(pte) & _PAGE_SWC) != 0;
>  }
>  
>  static inline int pte_young(pte_t pte)
> @@ -822,11 +844,14 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>  {
>  	pte_val(pte) &= _PAGE_CHG_MASK;
>  	pte_val(pte) |= pgprot_val(newprot);
> +	if ((pte_val(pte) & _PAGE_SWC) && (pte_val(pte) & _PAGE_SWW))
> +		pte_val(pte) &= ~_PAGE_RO;
>  	return pte;
>  }
>  
>  static inline pte_t pte_wrprotect(pte_t pte)
>  {
> +	pte_val(pte) &= ~_PAGE_SWW;
>  	/* Do not clobber _PAGE_TYPE_NONE pages!  */
>  	if (!(pte_val(pte) & _PAGE_INVALID))
>  		pte_val(pte) |= _PAGE_RO;
> @@ -835,20 +860,26 @@ static inline pte_t pte_wrprotect(pte_t pte)
>  
>  static inline pte_t pte_mkwrite(pte_t pte)
>  {
> -	pte_val(pte) &= ~_PAGE_RO;
> +	pte_val(pte) |= _PAGE_SWW;
> +	if (pte_val(pte) & _PAGE_SWC)
> +		pte_val(pte) &= ~_PAGE_RO;
>  	return pte;
>  }
>  
>  static inline pte_t pte_mkclean(pte_t pte)
>  {
> -#ifdef CONFIG_PGSTE
>  	pte_val(pte) &= ~_PAGE_SWC;
> -#endif
> +	/* Do not clobber _PAGE_TYPE_NONE pages!  */
> +	if (!(pte_val(pte) & _PAGE_INVALID))
> +		pte_val(pte) |= _PAGE_RO;
>  	return pte;
>  }
>  
>  static inline pte_t pte_mkdirty(pte_t pte)
>  {
> +	pte_val(pte) |= _PAGE_SWC;
> +	if (pte_val(pte) & _PAGE_SWW)
> +		pte_val(pte) &= ~_PAGE_RO;
>  	return pte;
>  }
>  
> @@ -886,10 +917,10 @@ static inline pte_t pte_mkhuge(pte_t pte)
>  		pte_val(pte) |= _SEGMENT_ENTRY_INV;
>  	}
>  	/*
> -	 * Clear SW pte bits SWT and SWX, there are no SW bits in a segment
> -	 * table entry.
> +	 * Clear SW pte bits, there are no SW bits in a segment table entry.
>  	 */
> -	pte_val(pte) &= ~(_PAGE_SWT | _PAGE_SWX);
> +	pte_val(pte) &= ~(_PAGE_SWT | _PAGE_SWX | _PAGE_SWC |
> +			  _PAGE_SWR | _PAGE_SWW);
>  	/*
>  	 * Also set the change-override bit because we don't need dirty bit
>  	 * tracking for hugetlbfs pages.
> @@ -1041,9 +1072,11 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
>  					   unsigned long address,
>  					   pte_t *ptep, pte_t pte)
>  {
> -	*ptep = pte;
> -	if (mm_has_pgste(mm))
> +	if (mm_has_pgste(mm)) {
> +		pgste_set_pte(ptep, pte);
>  		pgste_set_unlock(ptep, *(pgste_t *)(ptep + PTRS_PER_PTE));
> +	} else
> +		*ptep = pte;
>  }
>  
>  #define __HAVE_ARCH_PTEP_CLEAR_FLUSH
> @@ -1111,10 +1144,13 @@ static inline pte_t ptep_set_wrprotect(struct mm_struct *mm,
>  
>  		if (!mm_exclusive(mm))
>  			__ptep_ipte(address, ptep);
> -		*ptep = pte_wrprotect(pte);
> +		pte = pte_wrprotect(pte);
>  
> -		if (mm_has_pgste(mm))
> +		if (mm_has_pgste(mm)) {
> +			pgste_set_pte(ptep, pte);
>  			pgste_set_unlock(ptep, pgste);
> +		} else
> +			*ptep = pte;
>  	}
>  	return pte;
>  }
> @@ -1132,10 +1168,12 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>  		pgste = pgste_get_lock(ptep);
>  
>  	__ptep_ipte(address, ptep);
> -	*ptep = entry;
>  
> -	if (mm_has_pgste(vma->vm_mm))
> +	if (mm_has_pgste(vma->vm_mm)) {
> +		pgste_set_pte(ptep, entry);
>  		pgste_set_unlock(ptep, pgste);
> +	} else
> +		*ptep = entry;
>  	return 1;
>  }
>  
> @@ -1246,6 +1284,8 @@ static inline int pmd_trans_splitting(pmd_t pmd)
>  static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>  			      pmd_t *pmdp, pmd_t entry)
>  {
> +	if (!(pmd_val(entry) & _SEGMENT_ENTRY_INV) && MACHINE_HAS_EDAT1)
> +		pmd_val(entry) |= _SEGMENT_ENTRY_CO;
>  	*pmdp = entry;
>  }
>  
> @@ -1486,6 +1526,8 @@ extern int s390_enable_sie(void);
>   */
>  #define pgtable_cache_init()	do { } while (0)
>  
> +#define __ARCH_WANT_PTE_WRITE_DIRTY
> +
>  #include <asm-generic/pgtable.h>
>  
>  #endif /* _S390_PAGE_H */
> diff --git a/arch/s390/include/asm/sclp.h b/arch/s390/include/asm/sclp.h
> index 8337886..06a1361 100644
> --- a/arch/s390/include/asm/sclp.h
> +++ b/arch/s390/include/asm/sclp.h
> @@ -46,7 +46,6 @@ int sclp_cpu_deconfigure(u8 cpu);
>  void sclp_facilities_detect(void);
>  unsigned long long sclp_get_rnmax(void);
>  unsigned long long sclp_get_rzm(void);
> -u8 sclp_get_fac85(void);
>  int sclp_sdias_blk_count(void);
>  int sclp_sdias_copy(void *dest, int blk_num, int nr_blks);
>  int sclp_chp_configure(struct chp_id chpid);
> diff --git a/arch/s390/include/asm/setup.h b/arch/s390/include/asm/setup.h
> index f69f76b..f685751 100644
> --- a/arch/s390/include/asm/setup.h
> +++ b/arch/s390/include/asm/setup.h
> @@ -64,13 +64,14 @@ extern unsigned int s390_user_mode;
>  
>  #define MACHINE_FLAG_VM		(1UL << 0)
>  #define MACHINE_FLAG_IEEE	(1UL << 1)
> -#define MACHINE_FLAG_CSP	(1UL << 3)
> -#define MACHINE_FLAG_MVPG	(1UL << 4)
> -#define MACHINE_FLAG_DIAG44	(1UL << 5)
> -#define MACHINE_FLAG_IDTE	(1UL << 6)
> -#define MACHINE_FLAG_DIAG9C	(1UL << 7)
> -#define MACHINE_FLAG_MVCOS	(1UL << 8)
> -#define MACHINE_FLAG_KVM	(1UL << 9)
> +#define MACHINE_FLAG_CSP	(1UL << 2)
> +#define MACHINE_FLAG_MVPG	(1UL << 3)
> +#define MACHINE_FLAG_DIAG44	(1UL << 4)
> +#define MACHINE_FLAG_IDTE	(1UL << 5)
> +#define MACHINE_FLAG_DIAG9C	(1UL << 6)
> +#define MACHINE_FLAG_MVCOS	(1UL << 7)
> +#define MACHINE_FLAG_KVM	(1UL << 8)
> +#define MACHINE_FLAG_ESOP	(1UL << 9)
>  #define MACHINE_FLAG_EDAT1	(1UL << 10)
>  #define MACHINE_FLAG_EDAT2	(1UL << 11)
>  #define MACHINE_FLAG_LPAR	(1UL << 12)
> @@ -84,6 +85,7 @@ extern unsigned int s390_user_mode;
>  #define MACHINE_IS_LPAR		(S390_lowcore.machine_flags & MACHINE_FLAG_LPAR)
>  
>  #define MACHINE_HAS_DIAG9C	(S390_lowcore.machine_flags & MACHINE_FLAG_DIAG9C)
> +#define MACHINE_HAS_ESOP	(S390_lowcore.machine_flags & MACHINE_FLAG_ESOP)
>  #define MACHINE_HAS_PFMF	MACHINE_HAS_EDAT1
>  #define MACHINE_HAS_HPAGE	MACHINE_HAS_EDAT1
>  
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index f090e81..2923781 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -147,7 +147,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>  		r = KVM_MAX_VCPUS;
>  		break;
>  	case KVM_CAP_S390_COW:
> -		r = sclp_get_fac85() & 0x2;
> +		r = MACHINE_HAS_ESOP;
>  		break;
>  	default:
>  		r = 0;
> diff --git a/arch/s390/lib/uaccess_pt.c b/arch/s390/lib/uaccess_pt.c
> index 9017a63..a70ee84 100644
> --- a/arch/s390/lib/uaccess_pt.c
> +++ b/arch/s390/lib/uaccess_pt.c
> @@ -50,7 +50,7 @@ static __always_inline unsigned long follow_table(struct mm_struct *mm,
>  	ptep = pte_offset_map(pmd, addr);
>  	if (!pte_present(*ptep))
>  		return -0x11UL;
> -	if (write && !pte_write(*ptep))
> +	if (write && (!pte_write(*ptep) || !pte_dirty(*ptep)))
>  		return -0x04UL;
>  
>  	return (pte_val(*ptep) & PAGE_MASK) + (addr & ~PAGE_MASK);
> diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
> index 29ccee3..d21040e 100644
> --- a/arch/s390/mm/pageattr.c
> +++ b/arch/s390/mm/pageattr.c
> @@ -127,7 +127,7 @@ void kernel_map_pages(struct page *page, int numpages, int enable)
>  			pte_val(*pte) = _PAGE_TYPE_EMPTY;
>  			continue;
>  		}
> -		*pte = mk_pte_phys(address, __pgprot(_PAGE_TYPE_RW));
> +		pte_val(*pte) = __pa(address);
>  	}
>  }
>  
> diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
> index 6ed1426..79699f46 100644
> --- a/arch/s390/mm/vmem.c
> +++ b/arch/s390/mm/vmem.c
> @@ -85,11 +85,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
>  	pud_t *pu_dir;
>  	pmd_t *pm_dir;
>  	pte_t *pt_dir;
> -	pte_t  pte;
>  	int ret = -ENOMEM;
>  
>  	while (address < end) {
> -		pte = mk_pte_phys(address, __pgprot(ro ? _PAGE_RO : 0));
>  		pg_dir = pgd_offset_k(address);
>  		if (pgd_none(*pg_dir)) {
>  			pu_dir = vmem_pud_alloc();
> @@ -101,9 +99,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
>  #if defined(CONFIG_64BIT) && !defined(CONFIG_DEBUG_PAGEALLOC)
>  		if (MACHINE_HAS_EDAT2 && pud_none(*pu_dir) && address &&
>  		    !(address & ~PUD_MASK) && (address + PUD_SIZE <= end)) {
> -			pte_val(pte) |= _REGION3_ENTRY_LARGE;
> -			pte_val(pte) |= _REGION_ENTRY_TYPE_R3;
> -			pud_val(*pu_dir) = pte_val(pte);
> +			pud_val(*pu_dir) = __pa(address) |
> +				_REGION_ENTRY_TYPE_R3 | _REGION3_ENTRY_LARGE |
> +				(ro ? _REGION_ENTRY_RO : 0);
>  			address += PUD_SIZE;
>  			continue;
>  		}
> @@ -118,8 +116,9 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
>  #if defined(CONFIG_64BIT) && !defined(CONFIG_DEBUG_PAGEALLOC)
>  		if (MACHINE_HAS_EDAT1 && pmd_none(*pm_dir) && address &&
>  		    !(address & ~PMD_MASK) && (address + PMD_SIZE <= end)) {
> -			pte_val(pte) |= _SEGMENT_ENTRY_LARGE;
> -			pmd_val(*pm_dir) = pte_val(pte);
> +			pmd_val(*pm_dir) = __pa(address) |
> +				_SEGMENT_ENTRY | _SEGMENT_ENTRY_LARGE |
> +				(ro ? _SEGMENT_ENTRY_RO : 0);
>  			address += PMD_SIZE;
>  			continue;
>  		}
> @@ -132,7 +131,7 @@ static int vmem_add_mem(unsigned long start, unsigned long size, int ro)
>  		}
>  
>  		pt_dir = pte_offset_kernel(pm_dir, address);
> -		*pt_dir = pte;
> +		pte_val(*pt_dir) = __pa(address) | (ro ? _PAGE_RO : 0);
>  		address += PAGE_SIZE;
>  	}
>  	ret = 0;
> @@ -199,7 +198,6 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
>  	pud_t *pu_dir;
>  	pmd_t *pm_dir;
>  	pte_t *pt_dir;
> -	pte_t  pte;
>  	int ret = -ENOMEM;
>  
>  	start_addr = (unsigned long) start;
> @@ -237,9 +235,8 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
>  				new_page = vmemmap_alloc_block(PMD_SIZE, node);
>  				if (!new_page)
>  					goto out;
> -				pte = mk_pte_phys(__pa(new_page), PAGE_RW);
> -				pte_val(pte) |= _SEGMENT_ENTRY_LARGE;
> -				pmd_val(*pm_dir) = pte_val(pte);
> +				pmd_val(*pm_dir) = __pa(new_page) |
> +					_SEGMENT_ENTRY | _SEGMENT_ENTRY_LARGE;
>  				address = (address + PMD_SIZE) & PMD_MASK;
>  				continue;
>  			}
> @@ -260,8 +257,7 @@ int __meminit vmemmap_populate(struct page *start, unsigned long nr, int node)
>  			new_page =__pa(vmem_alloc_pages(0));
>  			if (!new_page)
>  				goto out;
> -			pte = pfn_pte(new_page >> PAGE_SHIFT, PAGE_KERNEL);
> -			*pt_dir = pte;
> +			pte_val(*pt_dir) = __pa(new_page);
>  		}
>  		address += PAGE_SIZE;
>  	}
> diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
> index c44d13f..30a2255 100644
> --- a/drivers/s390/char/sclp_cmd.c
> +++ b/drivers/s390/char/sclp_cmd.c
> @@ -56,7 +56,6 @@ static int __initdata early_read_info_sccb_valid;
>  
>  u64 sclp_facilities;
>  static u8 sclp_fac84;
> -static u8 sclp_fac85;
>  static unsigned long long rzm;
>  static unsigned long long rnmax;
>  
> @@ -131,7 +130,8 @@ void __init sclp_facilities_detect(void)
>  	sccb = &early_read_info_sccb;
>  	sclp_facilities = sccb->facilities;
>  	sclp_fac84 = sccb->fac84;
> -	sclp_fac85 = sccb->fac85;
> +	if (sccb->fac85 & 0x02)
> +		S390_lowcore.machine_flags |= MACHINE_FLAG_ESOP;
>  	rnmax = sccb->rnmax ? sccb->rnmax : sccb->rnmax2;
>  	rzm = sccb->rnsize ? sccb->rnsize : sccb->rnsize2;
>  	rzm <<= 20;
> @@ -171,12 +171,6 @@ unsigned long long sclp_get_rzm(void)
>  	return rzm;
>  }
>  
> -u8 sclp_get_fac85(void)
> -{
> -	return sclp_fac85;
> -}
> -EXPORT_SYMBOL_GPL(sclp_get_fac85);
> -
>  /*
>   * This function will be called after sclp_facilities_detect(), which gets
>   * called from early.c code. Therefore the sccb should have valid contents.
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 5cf680a..bfd8768 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -197,16 +197,6 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif
>  
> -#ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
> -#define page_test_and_clear_dirty(pfn, mapped)	(0)
> -#endif
> -
> -#ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_DIRTY
> -#define pte_maybe_dirty(pte)		pte_dirty(pte)
> -#else
> -#define pte_maybe_dirty(pte)		(1)
> -#endif
> -
>  #ifndef __HAVE_ARCH_PAGE_TEST_AND_CLEAR_YOUNG
>  #define page_test_and_clear_young(pfn) (0)
>  #endif
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 70473da..6d53675 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -303,21 +303,13 @@ static inline void __SetPageUptodate(struct page *page)
>  
>  static inline void SetPageUptodate(struct page *page)
>  {
> -#ifdef CONFIG_S390
> -	if (!test_and_set_bit(PG_uptodate, &page->flags))
> -		page_set_storage_key(page_to_phys(page), PAGE_DEFAULT_KEY, 0);
> -#else
>  	/*
>  	 * Memory barrier must be issued before setting the PG_uptodate bit,
>  	 * so that all previous stores issued in order to bring the page
>  	 * uptodate are actually visible before PageUptodate becomes true.
> -	 *
> -	 * s390 doesn't need an explicit smp_wmb here because the test and
> -	 * set bit already provides full barriers.
>  	 */
>  	smp_wmb();
>  	set_bit(PG_uptodate, &(page)->flags);
> -#endif
>  }
>  
>  CLEARPAGEFLAG(Uptodate, uptodate)
> diff --git a/mm/memory.c b/mm/memory.c
> index bb1369f..b3da90d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3338,6 +3338,18 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  				dirty_page = page;
>  				get_page(dirty_page);
>  			}
> +#ifdef __ARCH_WANT_PTE_WRITE_DIRTY
> +			/*
> +			 * Architectures that use software dirty bits may
> +			 * want to set the dirty bit in the pte if the pte
> +			 * is writable and the PageDirty bit is set for the
> +			 * page. This avoids unnecessary protection faults
> +			 * for writable mappings which do not use
> +			 * mapping_cap_account_dirty, e.g. tmpfs and shmem.
> +			 */
> +			else if (pte_write(entry) && PageDirty(page))
> +				entry = pte_mkdirty(entry);
> +#endif
>  		}
>  		set_pte_at(mm, address, page_table, entry);
>  
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 2c78f8c..3d38edf 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1126,7 +1126,6 @@ void page_add_file_rmap(struct page *page)
>   */
>  void page_remove_rmap(struct page *page)
>  {
> -	struct address_space *mapping = page_mapping(page);
>  	bool anon = PageAnon(page);
>  	bool locked;
>  	unsigned long flags;
> @@ -1144,29 +1143,6 @@ void page_remove_rmap(struct page *page)
>  		goto out;
>  
>  	/*
> -	 * Now that the last pte has gone, s390 must transfer dirty
> -	 * flag from storage key to struct page.  We can usually skip
> -	 * this if the page is anon, so about to be freed; but perhaps
> -	 * not if it's in swapcache - there might be another pte slot
> -	 * containing the swap entry, but page not yet written to swap.
> -	 *
> -	 * And we can skip it on file pages, so long as the filesystem
> -	 * participates in dirty tracking (note that this is not only an
> -	 * optimization but also solves problems caused by dirty flag in
> -	 * storage key getting set by a write from inside kernel); but need to
> -	 * catch shm and tmpfs and ramfs pages which have been modified since
> -	 * creation by read fault.
> -	 *
> -	 * Note that mapping must be decided above, before decrementing
> -	 * mapcount (which luckily provides a barrier): once page is unmapped,
> -	 * it could be truncated and page->mapping reset to NULL at any moment.
> -	 * Note also that we are relying on page_mapping(page) to set mapping
> -	 * to &swapper_space when PageSwapCache(page).
> -	 */
> -	if (mapping && !mapping_cap_account_dirty(mapping) &&
> -	    page_test_and_clear_dirty(page_to_pfn(page), 1))
> -		set_page_dirty(page);
> -	/*
>  	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
>  	 * and not charged by memcg for now.
>  	 */
> -- 
> 1.7.12.4
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] s390/mm: implement software dirty bits
  2013-02-11 22:08         ` Hugh Dickins
@ 2013-02-12  8:46           ` Martin Schwidefsky
  0 siblings, 0 replies; 11+ messages in thread
From: Martin Schwidefsky @ 2013-02-12  8:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, linux-s390, Mel Gorman, Jan Kara, Andrew Morton,
	Christian Ehrhardt, Russell King

On Mon, 11 Feb 2013 14:08:23 -0800 (PST)
Hugh Dickins <hughd@google.com> wrote:

> On Mon, 11 Feb 2013, Martin Schwidefsky wrote:
> > On Thu, 7 Feb 2013 11:18:38 -0800
> > Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > > On Wed, 6 Feb 2013 16:20:40 -0800 (PST)
> > > Hugh Dickins <hughd@google.com> wrote:
> > > 
> > > Anon page and accounted file pages won't need the mk_pte optimization,
> > > that is there for tmpfs/shmem. We could do that in common code as well,
> > > to make the dependency on PageDirty more obvious.
> > > 
> > > > --- 3.8-rc6/mm/memory.c	2013-01-09 19:25:05.028321379 -0800
> > > > +++ linux/mm/memory.c	2013-02-06 15:01:17.904387877 -0800
> > > > @@ -3338,6 +3338,10 @@ static int __do_fault(struct mm_struct *
> > > >  				dirty_page = page;
> > > >  				get_page(dirty_page);
> > > >  			}
> > > > +#ifdef CONFIG_S390
> > > > +			else if (pte_write(entry) && PageDirty(page))
> > > > +				pte_mkdirty(entry);
> > > > +#endif
> > > >  		}
> > > >  		set_pte_at(mm, address, page_table, entry);
> > > > 
> > > > And then I wonder, is that something we should do on all architectures?
> > > > On the one hand, it would save a hardware fault when and if the pte is
> > > > dirtied later; on the other hand, it seems wrong to claim pte dirty when
> > > > not (though I didn't find anywhere that would care).
> > > 
> > > I don't like the fact that we are adding another CONFIG_S390, if we could
> > > pre-dirty the pte for all architectures that would be nice. It has no
> > > ill effects for s390 to make the pte dirty, I can think of no reason
> > > why it should hurt for other architectures.
> > 
> > Having though further on the issue, it does not make sense to force all
> > architectures to set the dirty bit in the pte as this would make
> > try_to_unmap_one to call set_page_dirty even for ptes which have not
> > been used for writing.
> 
> In this particular case of shmem/tmpfs/ramfs (perhaps a few unaccounted
> others too, I doubt many are mmap'able), on pages that were already
> PageDirty when mapped.  And ramfs doesn't get as far as try_to_unmap_one,
> because it has already failed the page_evictable test.

The important case is shmem for databases, no? 

> > set_page_dirty is a non-trivial function that
> > calls mapping->a_ops->set_page_dirty or __set_page_dirty_buffers. These
> > cycles should imho not be spent on architectures with h/w pte dirty
> > bits.
> 
> The almost no-op __set_page_dirty_no_writeback is actually the one
> that gets called.  Now, I don't disagree with you that I'd prefer not
> to have to call it; but I'd also prefer to do the same thing on s390
> as other architectures.

Even if that would mean that unnecessary cycles are spent on the other
architectures? My feeling is that we should try to avoid that.
 
> I'm undecided which I prefer.  Before you wrote, I was going to suggest
> that you put your original patch into your tree for linux-next, then I
> propose an mm patch on top, restoring the s390 mk_pte() to normalcy, and
> adding the pte_mkdirty() to __do_fault() as above; but with a comment
> (you have), taking out the #ifdef, doing it on all architectures - so
> that if we see a problem on one (because some code elsewhere is deducing
> something from pte_dirty), it's advance warning of a problem on s390.
> But if anyone objected to my patch, it would cast doubt upon yours.

That is certainly a workable approach.

> > 
> > To avoid CONFIG_S390 in common code I'd like to introduce a new
> > __ARCH_WANT_PTE_WRITE_DIRTY define which then is used in __do_fault
> > like this:
> 
> My personal opinion is that an __ARCH_WANT_PTE_WRITE_DIRTY that is set
> by only a single architecture just obfuscates the issue, that CONFIG_S390
> is clearer for everyone.  Much of my dislike of page_test_and_clear_dirty
> was that it looks so brilliantly generic, and yet is so peculiar to s390.
> 
> But it's quite likely that I'm in a minority of one on that:
> #ifdef CONFIG_HUGHD
> #define __DEVELOPER_PREFERS_MORE_EXPLICIT_SINGLE_ARCH_DEPENDENCE 1
> #endif
> 
> And at least #ifdef CONFIG_S390_OR_WHATEVER flags it as exceptional:
> this might be a case where I'd say the ugliness of an #ifdef is good.
> I am glad that you've come around to doing it this way, rather than
> hiding the PageDirty peculiarity down in arch/s390's mk_pte().

I am not so sure about that. Arm seems to have exactly the same problem,
they do not set the h/w write bit as long as the user ptes are not dirty.
My guess is that arm is the second architectures that could use the define.

Putting the arm maintainer on CC. Russell, the question is if a pre-dirty
of writable user PTEs if the PageDirty bit is set would help arm to avoid
protection faults for tmpfs/shmem. The relevant hunk from the patch:

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3338,6 +3338,18 @@ static int __do_fault(struct mm_struct *mm, struct vm_are
a_struct *vma,
                                dirty_page = page;
                                get_page(dirty_page);
                        }
+#ifdef __ARCH_WANT_PTE_WRITE_DIRTY
+                       /*
+                        * Architectures that use software dirty bits may
+                        * want to set the dirty bit in the pte if the pte
+                        * is writable and the PageDirty bit is set for the
+                        * page. This avoids unnecessary protection faults
+                        * for writable mappings which do not use
+                        * mapping_cap_account_dirty, e.g. tmpfs and shmem.
+                        */
+                       else if (pte_write(entry) && PageDirty(page))
+                               entry = pte_mkdirty(entry);
+#endif
                }
                set_pte_at(mm, address, page_table, entry);
 
-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-02-12  8:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-05 18:12 [PATCH] software dirty bits for s390 Martin Schwidefsky
2013-02-05 18:12 ` Martin Schwidefsky
2013-02-05 19:28   ` Martin Schwidefsky
2013-02-05 18:12 ` [PATCH] s390/mm: implement software dirty bits Martin Schwidefsky
2013-02-06 11:21   ` Mel Gorman
2013-02-06 18:16     ` Martin Schwidefsky
2013-02-07  0:20   ` Hugh Dickins
2013-02-07 19:18     ` Martin Schwidefsky
2013-02-11 14:27       ` Martin Schwidefsky
2013-02-11 22:08         ` Hugh Dickins
2013-02-12  8:46           ` Martin Schwidefsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.