linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv3 00/12] Fix few rmap-related THP bugs
@ 2017-01-29 17:38 Kirill A. Shutemov
  2017-01-29 17:38 ` [PATCHv3 01/12] uprobes: split THPs before trying replace them Kirill A. Shutemov
                   ` (11 more replies)
  0 siblings, 12 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-01-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

The patch fixes handing PTE-mapped THPs in page_referenced() and
page_idle_clear_pte_refs().

To achieve that I've intrdocued new helper -- page_vma_mapped_walk() -- which
replaces all page_check_address{,_transhuge}() and covers all THP cases.

Patchset overview:
  - First patch fixes one uprobe bug (unrelated to the rest of the
    patchset, just spotted it at the same time);

  - Patches 2-5 fix handling PTE-mapped THPs in page_referenced(),
    page_idle_clear_pte_refs() and rmap core;

  - Patches 6-12 convert all page_check_address{,_transhuge}() users (plus
    remove_migration_pte()) to page_vma_mapped_walk() and drop unused helpers.

I think the fixes are not critical enough for stable@ as they don't lead
to crashes or hangs, only suboptimal behaviour.

Please review and consider applying.

v3:
  - fix page_vma_mapped_walk() breakage;
  - fix one more build error reported by 0-day testing;
  - Add few Acked-by/Reviewed-by;
v2:
  - address feedback from Andrew;
  - fix build errors noticed by 0-day testing.

Kirill A. Shutemov (12):
  uprobes: split THPs before trying replace them
  mm: introduce page_vma_mapped_walk()
  mm: fix handling PTE-mapped THPs in page_referenced()
  mm: fix handling PTE-mapped THPs in page_idle_clear_pte_refs()
  mm, rmap: check all VMAs that PTE-mapped THP can be part of
  mm: convert page_mkclean_one() to use page_vma_mapped_walk()
  mm: convert try_to_unmap_one() to use page_vma_mapped_walk()
  mm, ksm: convert write_protect_page() to use page_vma_mapped_walk()
  mm, uprobes: convert __replace_page() to use page_vma_mapped_walk()
  mm: convert page_mapped_in_vma() to use page_vma_mapped_walk()
  mm: drop page_check_address{,_transhuge}
  mm: convert remove_migration_pte() to use page_vma_mapped_walk()

 include/linux/rmap.h    |  52 ++---
 kernel/events/uprobes.c |  26 ++-
 mm/Makefile             |   6 +-
 mm/huge_memory.c        |  25 +--
 mm/internal.h           |   9 +-
 mm/ksm.c                |  34 +--
 mm/migrate.c            | 104 ++++-----
 mm/page_idle.c          |  34 +--
 mm/page_vma_mapped.c    | 218 ++++++++++++++++++
 mm/rmap.c               | 574 +++++++++++++++++++-----------------------------
 10 files changed, 573 insertions(+), 509 deletions(-)
 create mode 100644 mm/page_vma_mapped.c

-- 
2.11.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCHv3 01/12] uprobes: split THPs before trying replace them
  2017-01-29 17:38 [PATCHv3 00/12] Fix few rmap-related THP bugs Kirill A. Shutemov
@ 2017-01-29 17:38 ` Kirill A. Shutemov
  2017-01-31 15:44   ` Oleg Nesterov
  2017-01-29 17:38 ` [PATCHv3 02/12] mm: introduce page_vma_mapped_walk() Kirill A. Shutemov
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-01-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov, Oleg Nesterov,
	Peter Zijlstra

For THPs page_check_address() always fails. It leads to endless loop in
uprobe_write_opcode().

Testcase with huge-tmpfs (not sure if it's possible to trigger this
uprobe codepath for anon memory):

	mount -t debugfs none /sys/kernel/debug
	mount -t tmpfs -o huge=always none /mnt
	gcc -Wall -O2 -o /mnt/test -x c - <<EOF
	int main(void)
	{
		return 0;
	}
	/* Padding to map the code segment with huge pmd */
	asm (".zero 2097152");
	EOF
	echo 'p /mnt/test:0' > /sys/kernel/debug/tracing/uprobe_events
	echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
	/mnt/test

Let's split THPs before trying to replace.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 kernel/events/uprobes.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index d416f3baf392..1e65c79e52a6 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -300,8 +300,8 @@ int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr,
 
 retry:
 	/* Read the page with vaddr into memory */
-	ret = get_user_pages_remote(NULL, mm, vaddr, 1, FOLL_FORCE, &old_page,
-			&vma, NULL);
+	ret = get_user_pages_remote(NULL, mm, vaddr, 1,
+			FOLL_FORCE | FOLL_SPLIT, &old_page, &vma, NULL);
 	if (ret <= 0)
 		return ret;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv3 02/12] mm: introduce page_vma_mapped_walk()
  2017-01-29 17:38 [PATCHv3 00/12] Fix few rmap-related THP bugs Kirill A. Shutemov
  2017-01-29 17:38 ` [PATCHv3 01/12] uprobes: split THPs before trying replace them Kirill A. Shutemov
@ 2017-01-29 17:38 ` Kirill A. Shutemov
  2017-01-29 17:38 ` [PATCHv3 03/12] mm: fix handling PTE-mapped THPs in page_referenced() Kirill A. Shutemov
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-01-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

The patch introduces new interface to check if a page is mapped into a vma.
It aims to address shortcomings of page_check_address{,_transhuge}.

Existing interface is not able to handle PTE-mapped THPs: it only finds
the first PTE. The rest lefted unnoticed.

page_vma_mapped_walk() iterates over all possible mapping of the page in the
vma.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/rmap.h |  26 +++++++
 mm/Makefile          |   6 +-
 mm/huge_memory.c     |   9 ++-
 mm/page_vma_mapped.c | 188 +++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 224 insertions(+), 5 deletions(-)
 create mode 100644 mm/page_vma_mapped.c

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 15321fb1df6b..b76343610653 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -9,6 +9,7 @@
 #include <linux/mm.h>
 #include <linux/rwsem.h>
 #include <linux/memcontrol.h>
+#include <linux/highmem.h>
 
 /*
  * The anon_vma heads a list of private "related" vmas, to scan if
@@ -232,6 +233,31 @@ static inline bool page_check_address_transhuge(struct page *page,
 }
 #endif
 
+/* Avoid racy checks */
+#define PVMW_SYNC		(1 << 0)
+/* Look for migarion entries rather than present PTEs */
+#define PVMW_MIGRATION		(1 << 1)
+
+struct page_vma_mapped_walk {
+	struct page *page;
+	struct vm_area_struct *vma;
+	unsigned long address;
+	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
+	unsigned int flags;
+};
+
+static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
+{
+	if (pvmw->pte)
+		pte_unmap(pvmw->pte);
+	if (pvmw->ptl)
+		spin_unlock(pvmw->ptl);
+}
+
+bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
+
 /*
  * Used by swapoff to help locate where page is expected in vma.
  */
diff --git a/mm/Makefile b/mm/Makefile
index 295bd7a9f76b..e375745a88a5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -23,8 +23,10 @@ KCOV_INSTRUMENT_vmstat.o := n
 
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= gup.o highmem.o memory.o mincore.o \
-			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o pagewalk.o pgtable-generic.o
+			   mlock.o mmap.o mprotect.o mremap.o msync.o \
+			   page_vma_mapped.o pagewalk.o pgtable-generic.o \
+			   rmap.o vmalloc.o
+
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
 mmu-$(CONFIG_MMU)	+= process_vm_access.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9a6bd6c8d55a..16820e001d79 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1862,9 +1862,12 @@ static void freeze_page(struct page *page)
 static void unfreeze_page(struct page *page)
 {
 	int i;
-
-	for (i = 0; i < HPAGE_PMD_NR; i++)
-		remove_migration_ptes(page + i, page + i, true);
+	if (PageTransHuge(page)) {
+		remove_migration_ptes(page, page, true);
+	} else {
+		for (i = 0; i < HPAGE_PMD_NR; i++)
+			remove_migration_ptes(page + i, page + i, true);
+	}
 }
 
 static void __split_huge_page_tail(struct page *head, int tail,
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
new file mode 100644
index 000000000000..bbd2a39e985d
--- /dev/null
+++ b/mm/page_vma_mapped.c
@@ -0,0 +1,188 @@
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/hugetlb.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+
+#include "internal.h"
+
+static inline bool check_pmd(struct page_vma_mapped_walk *pvmw)
+{
+	pmd_t pmde;
+	/*
+	 * Make sure we don't re-load pmd between present and !trans_huge check.
+	 * We need a consistent view.
+	 */
+	pmde = READ_ONCE(*pvmw->pmd);
+	return pmd_present(pmde) && !pmd_trans_huge(pmde);
+}
+
+static inline bool not_found(struct page_vma_mapped_walk *pvmw)
+{
+	page_vma_mapped_walk_done(pvmw);
+	return false;
+}
+
+static bool map_pte(struct page_vma_mapped_walk *pvmw)
+{
+	pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
+	if (!(pvmw->flags & PVMW_SYNC)) {
+		if (pvmw->flags & PVMW_MIGRATION) {
+			if (!is_swap_pte(*pvmw->pte))
+				return false;
+		} else {
+			if (!pte_present(*pvmw->pte))
+				return false;
+		}
+	}
+	pvmw->ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd);
+	spin_lock(pvmw->ptl);
+	return true;
+}
+
+static bool check_pte(struct page_vma_mapped_walk *pvmw)
+{
+	if (pvmw->flags & PVMW_MIGRATION) {
+#ifdef CONFIG_MIGRATION
+		swp_entry_t entry;
+		if (!is_swap_pte(*pvmw->pte))
+			return false;
+		entry = pte_to_swp_entry(*pvmw->pte);
+		if (!is_migration_entry(entry))
+			return false;
+		if (migration_entry_to_page(entry) - pvmw->page >=
+				hpage_nr_pages(pvmw->page)) {
+			return false;
+		}
+		if (migration_entry_to_page(entry) < pvmw->page)
+			return false;
+#else
+		WARN_ON_ONCE(1);
+#endif
+	} else {
+		if (!pte_present(*pvmw->pte))
+			return false;
+
+		/* THP can be referenced by any subpage */
+		if (pte_page(*pvmw->pte) - pvmw->page >=
+				hpage_nr_pages(pvmw->page)) {
+			return false;
+		}
+		if (pte_page(*pvmw->pte) < pvmw->page)
+			return false;
+	}
+
+	return true;
+}
+
+/**
+ * page_vma_mapped_walk - check if @pvmw->page is mapped in @pvmw->vma at
+ * @pvmw->address
+ * @pvmw: pointer to struct page_vma_mapped_walk. page, vma, address and flags
+ * must be set. pmd, pte and ptl must be NULL.
+ *
+ * Returns true if the page is mapped in the vma. @pvmw->pmd and @pvmw->pte point
+ * to relevant page table entries. @pvmw->ptl is locked. @pvmw->address is
+ * adjusted if needed (for PTE-mapped THPs).
+ *
+ * If @pvmw->pmd is set but @pvmw->pte is not, you have found PMD-mapped page
+ * (usually THP). For PTE-mapped THP, you should run page_vma_mapped_walk() in 
+ * a loop to find all PTEs that map the THP.
+ *
+ * For HugeTLB pages, @pvmw->pte is set to the relevant page table entry
+ * regardless of which page table level the page is mapped at. @pvmw->pmd is
+ * NULL.
+ *
+ * Retruns false if there are no more page table entries for the page in
+ * the vma. @pvmw->ptl is unlocked and @pvmw->pte is unmapped.
+ *
+ * If you need to stop the walk before page_vma_mapped_walk() returned false,
+ * use page_vma_mapped_walk_done(). It will do the housekeeping.
+ */
+bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
+{
+	struct mm_struct *mm = pvmw->vma->vm_mm;
+	struct page *page = pvmw->page;
+	pgd_t *pgd;
+	pud_t *pud;
+
+	/* The only possible pmd mapping has been handled on last iteration */
+	if (pvmw->pmd && !pvmw->pte)
+		return not_found(pvmw);
+
+	/* Only for THP, seek to next pte entry makes sense */
+	if (pvmw->pte) {
+		if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
+			return not_found(pvmw);
+		goto next_pte;
+	}
+
+	if (unlikely(PageHuge(pvmw->page))) {
+		/* when pud is not present, pte will be NULL */
+		pvmw->pte = huge_pte_offset(mm, pvmw->address);
+		if (!pvmw->pte)
+			return false;
+
+		pvmw->ptl = huge_pte_lockptr(page_hstate(page), mm, pvmw->pte);
+		spin_lock(pvmw->ptl);
+		if (!check_pte(pvmw))
+			return not_found(pvmw);
+		return true;
+	}
+restart:
+	pgd = pgd_offset(mm, pvmw->address);
+	if (!pgd_present(*pgd))
+		return false;
+	pud = pud_offset(pgd, pvmw->address);
+	if (!pud_present(*pud))
+		return false;
+	pvmw->pmd = pmd_offset(pud, pvmw->address);
+	if (pmd_trans_huge(*pvmw->pmd)) {
+		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
+		if (!pmd_present(*pvmw->pmd))
+			return not_found(pvmw);
+		if (likely(pmd_trans_huge(*pvmw->pmd))) {
+			if (pvmw->flags & PVMW_MIGRATION)
+				return not_found(pvmw);
+			if (pmd_page(*pvmw->pmd) != page)
+				return not_found(pvmw);
+			return true;
+		} else {
+			/* THP pmd was split under us: handle on pte level */
+			spin_unlock(pvmw->ptl);
+			pvmw->ptl = NULL;
+		}
+	} else {
+		if (!check_pmd(pvmw))
+			return false;
+	}
+	if (!map_pte(pvmw))
+		goto next_pte;
+	while (1) {
+		if (check_pte(pvmw))
+			return true;
+next_pte:	do {
+			pvmw->address += PAGE_SIZE;
+			if (pvmw->address >=
+					__vma_address(pvmw->page, pvmw->vma) +
+					hpage_nr_pages(pvmw->page) * PAGE_SIZE)
+				return not_found(pvmw);
+			/* Did we cross page table boundary? */
+			if (pvmw->address % PMD_SIZE == 0) {
+				pte_unmap(pvmw->pte);
+				if (pvmw->ptl) {
+					spin_unlock(pvmw->ptl);
+					pvmw->ptl = NULL;
+				}
+				goto restart;
+			} else {
+				pvmw->pte++;
+			}
+		} while (pte_none(*pvmw->pte));
+
+		if (!pvmw->ptl) {
+			pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
+			spin_lock(pvmw->ptl);
+		}
+	}
+}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv3 03/12] mm: fix handling PTE-mapped THPs in page_referenced()
  2017-01-29 17:38 [PATCHv3 00/12] Fix few rmap-related THP bugs Kirill A. Shutemov
  2017-01-29 17:38 ` [PATCHv3 01/12] uprobes: split THPs before trying replace them Kirill A. Shutemov
  2017-01-29 17:38 ` [PATCHv3 02/12] mm: introduce page_vma_mapped_walk() Kirill A. Shutemov
@ 2017-01-29 17:38 ` Kirill A. Shutemov
  2017-02-02 15:26   ` Michal Hocko
  2017-01-29 17:38 ` [PATCHv3 04/12] mm: fix handling PTE-mapped THPs in page_idle_clear_pte_refs() Kirill A. Shutemov
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-01-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

For PTE-mapped THP page_check_address_transhuge() is not adequate: it
cannot find all relevant PTEs, only the first one. It means we can miss
some references of the page and it can result in suboptimal decisions by
vmscan.

Let's switch it to page_vma_mapped_walk().

I don't think it's subject for stable@: it's not fatal. The only side
effect is that THP can be swapped out when it shouldn't.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/rmap.c | 66 ++++++++++++++++++++++++++++++++-------------------------------
 1 file changed, 34 insertions(+), 32 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 91619fd70939..0dff8accd629 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -886,45 +886,48 @@ struct page_referenced_arg {
 static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			unsigned long address, void *arg)
 {
-	struct mm_struct *mm = vma->vm_mm;
 	struct page_referenced_arg *pra = arg;
-	pmd_t *pmd;
-	pte_t *pte;
-	spinlock_t *ptl;
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = address,
+	};
 	int referenced = 0;
 
-	if (!page_check_address_transhuge(page, mm, address, &pmd, &pte, &ptl))
-		return SWAP_AGAIN;
+	while (page_vma_mapped_walk(&pvmw)) {
+		address = pvmw.address;
 
-	if (vma->vm_flags & VM_LOCKED) {
-		if (pte)
-			pte_unmap(pte);
-		spin_unlock(ptl);
-		pra->vm_flags |= VM_LOCKED;
-		return SWAP_FAIL; /* To break the loop */
-	}
+		if (vma->vm_flags & VM_LOCKED) {
+			page_vma_mapped_walk_done(&pvmw);
+			pra->vm_flags |= VM_LOCKED;
+			return SWAP_FAIL; /* To break the loop */
+		}
 
-	if (pte) {
-		if (ptep_clear_flush_young_notify(vma, address, pte)) {
-			/*
-			 * Don't treat a reference through a sequentially read
-			 * mapping as such.  If the page has been used in
-			 * another mapping, we will catch it; if this other
-			 * mapping is already gone, the unmap path will have
-			 * set PG_referenced or activated the page.
-			 */
-			if (likely(!(vma->vm_flags & VM_SEQ_READ)))
+		if (pvmw.pte) {
+			if (ptep_clear_flush_young_notify(vma, address,
+						pvmw.pte)) {
+				/*
+				 * Don't treat a reference through
+				 * a sequentially read mapping as such.
+				 * If the page has been used in another mapping,
+				 * we will catch it; if this other mapping is
+				 * already gone, the unmap path will have set
+				 * PG_referenced or activated the page.
+				 */
+				if (likely(!(vma->vm_flags & VM_SEQ_READ)))
+					referenced++;
+			}
+		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+			if (pmdp_clear_flush_young_notify(vma, address,
+						pvmw.pmd))
 				referenced++;
+		} else {
+			/* unexpected pmd-mapped page? */
+			WARN_ON_ONCE(1);
 		}
-		pte_unmap(pte);
-	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-		if (pmdp_clear_flush_young_notify(vma, address, pmd))
-			referenced++;
-	} else {
-		/* unexpected pmd-mapped page? */
-		WARN_ON_ONCE(1);
+
+		pra->mapcount--;
 	}
-	spin_unlock(ptl);
 
 	if (referenced)
 		clear_page_idle(page);
@@ -936,7 +939,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		pra->vm_flags |= vma->vm_flags;
 	}
 
-	pra->mapcount--;
 	if (!pra->mapcount)
 		return SWAP_SUCCESS; /* To break the loop */
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv3 04/12] mm: fix handling PTE-mapped THPs in page_idle_clear_pte_refs()
  2017-01-29 17:38 [PATCHv3 00/12] Fix few rmap-related THP bugs Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2017-01-29 17:38 ` [PATCHv3 03/12] mm: fix handling PTE-mapped THPs in page_referenced() Kirill A. Shutemov
@ 2017-01-29 17:38 ` Kirill A. Shutemov
  2017-01-29 17:38 ` [PATCHv3 05/12] mm, rmap: check all VMAs that PTE-mapped THP can be part of Kirill A. Shutemov
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-01-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov, Vladimir Davydov

For PTE-mapped THP page_check_address_transhuge() is not adequate: it
cannot find all relevant PTEs, only the first one.i

Let's switch it to page_vma_mapped_walk().

I don't think it's subject for stable@: it's not fatal.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
---
 mm/page_idle.c | 34 +++++++++++++++++-----------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/mm/page_idle.c b/mm/page_idle.c
index ae11aa914e55..b0ee56c56b58 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -54,27 +54,27 @@ static int page_idle_clear_pte_refs_one(struct page *page,
 					struct vm_area_struct *vma,
 					unsigned long addr, void *arg)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	pmd_t *pmd;
-	pte_t *pte;
-	spinlock_t *ptl;
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = addr,
+	};
 	bool referenced = false;
 
-	if (!page_check_address_transhuge(page, mm, addr, &pmd, &pte, &ptl))
-		return SWAP_AGAIN;
-
-	if (pte) {
-		referenced = ptep_clear_young_notify(vma, addr, pte);
-		pte_unmap(pte);
-	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
-		referenced = pmdp_clear_young_notify(vma, addr, pmd);
-	} else {
-		/* unexpected pmd-mapped page? */
-		WARN_ON_ONCE(1);
+	while (page_vma_mapped_walk(&pvmw)) {
+		addr = pvmw.address;
+		if (pvmw.pte) {
+			referenced = ptep_clear_young_notify(vma, addr,
+					pvmw.pte);
+		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+			referenced = pmdp_clear_young_notify(vma, addr,
+					pvmw.pmd);
+		} else {
+			/* unexpected pmd-mapped page? */
+			WARN_ON_ONCE(1);
+		}
 	}
 
-	spin_unlock(ptl);
-
 	if (referenced) {
 		clear_page_idle(page);
 		/*
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv3 05/12] mm, rmap: check all VMAs that PTE-mapped THP can be part of
  2017-01-29 17:38 [PATCHv3 00/12] Fix few rmap-related THP bugs Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2017-01-29 17:38 ` [PATCHv3 04/12] mm: fix handling PTE-mapped THPs in page_idle_clear_pte_refs() Kirill A. Shutemov
@ 2017-01-29 17:38 ` Kirill A. Shutemov
  2017-01-29 17:38 ` [PATCHv3 06/12] mm: convert page_mkclean_one() to use page_vma_mapped_walk() Kirill A. Shutemov
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-01-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

Current rmap code can miss a VMA that maps PTE-mapped THP if the first
suppage of the THP was unmapped from the VMA.

We need to walk rmap for the whole range of offsets that THP covers, not
only the first one.

vma_address() also need to be corrected to check the range instead of
the first subpage.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/internal.h |  9 ++++++---
 mm/rmap.c     | 16 ++++++++++------
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 03763f5c42c5..1f90c65df7fb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -333,12 +333,15 @@ __vma_address(struct page *page, struct vm_area_struct *vma)
 static inline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
-	unsigned long address = __vma_address(page, vma);
+	unsigned long start, end;
+
+	start = __vma_address(page, vma);
+	end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
 
 	/* page should be within @vma mapping range */
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma);
 
-	return address;
+	return max(start, vma->vm_start);
 }
 
 #else /* !CONFIG_MMU */
diff --git a/mm/rmap.c b/mm/rmap.c
index 0dff8accd629..c4bad599cc7b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1757,7 +1757,7 @@ static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
 		bool locked)
 {
 	struct anon_vma *anon_vma;
-	pgoff_t pgoff;
+	pgoff_t pgoff_start, pgoff_end;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
@@ -1771,8 +1771,10 @@ static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
 	if (!anon_vma)
 		return ret;
 
-	pgoff = page_to_pgoff(page);
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+	pgoff_start = page_to_pgoff(page);
+	pgoff_end = pgoff_start + hpage_nr_pages(page) - 1;
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
+			pgoff_start, pgoff_end) {
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
 
@@ -1810,7 +1812,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
 		bool locked)
 {
 	struct address_space *mapping = page_mapping(page);
-	pgoff_t pgoff;
+	pgoff_t pgoff_start, pgoff_end;
 	struct vm_area_struct *vma;
 	int ret = SWAP_AGAIN;
 
@@ -1825,10 +1827,12 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
 	if (!mapping)
 		return ret;
 
-	pgoff = page_to_pgoff(page);
+	pgoff_start = page_to_pgoff(page);
+	pgoff_end = pgoff_start + hpage_nr_pages(page) - 1;
 	if (!locked)
 		i_mmap_lock_read(mapping);
-	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+	vma_interval_tree_foreach(vma, &mapping->i_mmap,
+			pgoff_start, pgoff_end) {
 		unsigned long address = vma_address(page, vma);
 
 		cond_resched();
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv3 06/12] mm: convert page_mkclean_one() to use page_vma_mapped_walk()
  2017-01-29 17:38 [PATCHv3 00/12] Fix few rmap-related THP bugs Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2017-01-29 17:38 ` [PATCHv3 05/12] mm, rmap: check all VMAs that PTE-mapped THP can be part of Kirill A. Shutemov
@ 2017-01-29 17:38 ` Kirill A. Shutemov
  2017-01-29 17:38 ` [PATCHv3 07/12] mm: convert try_to_unmap_one() " Kirill A. Shutemov
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-01-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

For consistency, it worth converting all page_check_address() to
page_vma_mapped_walk(), so we could drop the former.

PMD handling here is future-proofing, we don't have users yet. ext4 with
huge pages will be the first.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/rmap.c | 68 ++++++++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 45 insertions(+), 23 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c4bad599cc7b..58597de049fd 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1017,34 +1017,56 @@ int page_referenced(struct page *page,
 static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 			    unsigned long address, void *arg)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	pte_t *pte;
-	spinlock_t *ptl;
-	int ret = 0;
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = address,
+		.flags = PVMW_SYNC,
+	};
 	int *cleaned = arg;
 
-	pte = page_check_address(page, mm, address, &ptl, 1);
-	if (!pte)
-		goto out;
-
-	if (pte_dirty(*pte) || pte_write(*pte)) {
-		pte_t entry;
+	while (page_vma_mapped_walk(&pvmw)) {
+		int ret = 0;
+		address = pvmw.address;
+		if (pvmw.pte) {
+			pte_t entry;
+			pte_t *pte = pvmw.pte;
+
+			if (!pte_dirty(*pte) && !pte_write(*pte))
+				continue;
+
+			flush_cache_page(vma, address, pte_pfn(*pte));
+			entry = ptep_clear_flush(vma, address, pte);
+			entry = pte_wrprotect(entry);
+			entry = pte_mkclean(entry);
+			set_pte_at(vma->vm_mm, address, pte, entry);
+			ret = 1;
+		} else {
+#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
+			pmd_t *pmd = pvmw.pmd;
+			pmd_t entry;
+
+			if (!pmd_dirty(*pmd) && !pmd_write(*pmd))
+				continue;
+
+			flush_cache_page(vma, address, page_to_pfn(page));
+			entry = pmdp_huge_clear_flush(vma, address, pmd);
+			entry = pmd_wrprotect(entry);
+			entry = pmd_mkclean(entry);
+			set_pmd_at(vma->vm_mm, address, pmd, entry);
+			ret = 1;
+#else
+			/* unexpected pmd-mapped page? */
+			WARN_ON_ONCE(1);
+#endif
+		}
 
-		flush_cache_page(vma, address, pte_pfn(*pte));
-		entry = ptep_clear_flush(vma, address, pte);
-		entry = pte_wrprotect(entry);
-		entry = pte_mkclean(entry);
-		set_pte_at(mm, address, pte, entry);
-		ret = 1;
+		if (ret) {
+			mmu_notifier_invalidate_page(vma->vm_mm, address);
+			(*cleaned)++;
+		}
 	}
 
-	pte_unmap_unlock(pte, ptl);
-
-	if (ret) {
-		mmu_notifier_invalidate_page(mm, address);
-		(*cleaned)++;
-	}
-out:
 	return SWAP_AGAIN;
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv3 07/12] mm: convert try_to_unmap_one() to use page_vma_mapped_walk()
  2017-01-29 17:38 [PATCHv3 00/12] Fix few rmap-related THP bugs Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2017-01-29 17:38 ` [PATCHv3 06/12] mm: convert page_mkclean_one() to use page_vma_mapped_walk() Kirill A. Shutemov
@ 2017-01-29 17:38 ` Kirill A. Shutemov
  2017-01-29 17:38 ` [PATCHv3 08/12] mm, ksm: convert write_protect_page() " Kirill A. Shutemov
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-01-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

For consistency, it worth converting all page_check_address() to
page_vma_mapped_walk(), so we could drop the former.

It also makes freeze_page() as we walk though rmap only once.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |  16 +---
 mm/rmap.c        | 260 ++++++++++++++++++++++++++++---------------------------
 2 files changed, 137 insertions(+), 139 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 16820e001d79..ca7855f857fa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1839,24 +1839,16 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 static void freeze_page(struct page *page)
 {
 	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
-		TTU_RMAP_LOCKED;
-	int i, ret;
+		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
+	int ret;
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
 	if (PageAnon(page))
 		ttu_flags |= TTU_MIGRATION;
 
-	/* We only need TTU_SPLIT_HUGE_PMD once */
-	ret = try_to_unmap(page, ttu_flags | TTU_SPLIT_HUGE_PMD);
-	for (i = 1; !ret && i < HPAGE_PMD_NR; i++) {
-		/* Cut short if the page is unmapped */
-		if (page_count(page) == 1)
-			return;
-
-		ret = try_to_unmap(page + i, ttu_flags);
-	}
-	VM_BUG_ON_PAGE(ret, page + i - 1);
+	ret = try_to_unmap(page, ttu_flags);
+	VM_BUG_ON_PAGE(ret, page);
 }
 
 static void unfreeze_page(struct page *page)
diff --git a/mm/rmap.c b/mm/rmap.c
index 58597de049fd..11668fb881d8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -607,8 +607,7 @@ void try_to_unmap_flush_dirty(void)
 		try_to_unmap_flush();
 }
 
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
-		struct page *page, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = &current->tlb_ubc;
 
@@ -643,8 +642,7 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 	return should_defer;
 }
 #else
-static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
-		struct page *page, bool writable)
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
 {
 }
 
@@ -1459,155 +1457,163 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		     unsigned long address, void *arg)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *pte;
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = address,
+	};
 	pte_t pteval;
-	spinlock_t *ptl;
+	struct page *subpage;
 	int ret = SWAP_AGAIN;
 	struct rmap_private *rp = arg;
 	enum ttu_flags flags = rp->flags;
 
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
-		goto out;
+		return SWAP_AGAIN;
 
 	if (flags & TTU_SPLIT_HUGE_PMD) {
 		split_huge_pmd_address(vma, address,
 				flags & TTU_MIGRATION, page);
-		/* check if we have anything to do after split */
-		if (page_mapcount(page) == 0)
-			goto out;
 	}
 
-	pte = page_check_address(page, mm, address, &ptl,
-				 PageTransCompound(page));
-	if (!pte)
-		goto out;
+	while (page_vma_mapped_walk(&pvmw)) {
+		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+		address = pvmw.address;
 
-	/*
-	 * If the page is mlock()d, we cannot swap it out.
-	 * If it's recently referenced (perhaps page_referenced
-	 * skipped over this mm) then we should reactivate it.
-	 */
-	if (!(flags & TTU_IGNORE_MLOCK)) {
-		if (vma->vm_flags & VM_LOCKED) {
-			/* PTE-mapped THP are never mlocked */
-			if (!PageTransCompound(page)) {
-				/*
-				 * Holding pte lock, we do *not* need
-				 * mmap_sem here
-				 */
-				mlock_vma_page(page);
+		/* Unexpected PMD-mapped THP? */
+		VM_BUG_ON_PAGE(!pvmw.pte, page);
+
+		/*
+		 * If the page is mlock()d, we cannot swap it out.
+		 * If it's recently referenced (perhaps page_referenced
+		 * skipped over this mm) then we should reactivate it.
+		 */
+		if (!(flags & TTU_IGNORE_MLOCK)) {
+			if (vma->vm_flags & VM_LOCKED) {
+				/* PTE-mapped THP are never mlocked */
+				if (!PageTransCompound(page)) {
+					/*
+					 * Holding pte lock, we do *not* need
+					 * mmap_sem here
+					 */
+					mlock_vma_page(page);
+				}
+				ret = SWAP_MLOCK;
+				page_vma_mapped_walk_done(&pvmw);
+				break;
 			}
-			ret = SWAP_MLOCK;
-			goto out_unmap;
+			if (flags & TTU_MUNLOCK)
+				continue;
 		}
-		if (flags & TTU_MUNLOCK)
-			goto out_unmap;
-	}
-	if (!(flags & TTU_IGNORE_ACCESS)) {
-		if (ptep_clear_flush_young_notify(vma, address, pte)) {
-			ret = SWAP_FAIL;
-			goto out_unmap;
+
+		if (!(flags & TTU_IGNORE_ACCESS)) {
+			if (ptep_clear_flush_young_notify(vma, address,
+						pvmw.pte)) {
+				ret = SWAP_FAIL;
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
 		}
-  	}
 
-	/* Nuke the page table entry. */
-	flush_cache_page(vma, address, page_to_pfn(page));
-	if (should_defer_flush(mm, flags)) {
-		/*
-		 * We clear the PTE but do not flush so potentially a remote
-		 * CPU could still be writing to the page. If the entry was
-		 * previously clean then the architecture must guarantee that
-		 * a clear->dirty transition on a cached TLB entry is written
-		 * through and traps if the PTE is unmapped.
-		 */
-		pteval = ptep_get_and_clear(mm, address, pte);
+		/* Nuke the page table entry. */
+		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		if (should_defer_flush(mm, flags)) {
+			/*
+			 * We clear the PTE but do not flush so potentially
+			 * a remote CPU could still be writing to the page.
+			 * If the entry was previously clean then the
+			 * architecture must guarantee that a clear->dirty
+			 * transition on a cached TLB entry is written through
+			 * and traps if the PTE is unmapped.
+			 */
+			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+
+			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+		} else {
+			pteval = ptep_clear_flush(vma, address, pvmw.pte);
+		}
 
-		set_tlb_ubc_flush_pending(mm, page, pte_dirty(pteval));
-	} else {
-		pteval = ptep_clear_flush(vma, address, pte);
-	}
+		/* Move the dirty bit to the page. Now the pte is gone. */
+		if (pte_dirty(pteval))
+			set_page_dirty(page);
 
-	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
-		set_page_dirty(page);
+		/* Update high watermark before we lower rss */
+		update_hiwater_rss(mm);
 
-	/* Update high watermark before we lower rss */
-	update_hiwater_rss(mm);
+		if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+			if (PageHuge(page)) {
+				int nr = 1 << compound_order(page);
+				hugetlb_count_sub(nr, mm);
+			} else {
+				dec_mm_counter(mm, mm_counter(page));
+			}
 
-	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
-		if (PageHuge(page)) {
-			hugetlb_count_sub(1 << compound_order(page), mm);
-		} else {
+			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
+			set_pte_at(mm, address, pvmw.pte, pteval);
+		} else if (pte_unused(pteval)) {
+			/*
+			 * The guest indicated that the page content is of no
+			 * interest anymore. Simply discard the pte, vmscan
+			 * will take care of the rest.
+			 */
 			dec_mm_counter(mm, mm_counter(page));
-		}
-		set_pte_at(mm, address, pte,
-			   swp_entry_to_pte(make_hwpoison_entry(page)));
-	} else if (pte_unused(pteval)) {
-		/*
-		 * The guest indicated that the page content is of no
-		 * interest anymore. Simply discard the pte, vmscan
-		 * will take care of the rest.
-		 */
-		dec_mm_counter(mm, mm_counter(page));
-	} else if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION)) {
-		swp_entry_t entry;
-		pte_t swp_pte;
-		/*
-		 * Store the pfn of the page in a special migration
-		 * pte. do_swap_page() will wait until the migration
-		 * pte is removed and then restart fault handling.
-		 */
-		entry = make_migration_entry(page, pte_write(pteval));
-		swp_pte = swp_entry_to_pte(entry);
-		if (pte_soft_dirty(pteval))
-			swp_pte = pte_swp_mksoft_dirty(swp_pte);
-		set_pte_at(mm, address, pte, swp_pte);
-	} else if (PageAnon(page)) {
-		swp_entry_t entry = { .val = page_private(page) };
-		pte_t swp_pte;
-		/*
-		 * Store the swap location in the pte.
-		 * See handle_pte_fault() ...
-		 */
-		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
+		} else if (IS_ENABLED(CONFIG_MIGRATION) &&
+				(flags & TTU_MIGRATION)) {
+			swp_entry_t entry;
+			pte_t swp_pte;
+			/*
+			 * Store the pfn of the page in a special migration
+			 * pte. do_swap_page() will wait until the migration
+			 * pte is removed and then restart fault handling.
+			 */
+			entry = make_migration_entry(subpage,
+					pte_write(pteval));
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pteval))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(mm, address, pvmw.pte, swp_pte);
+		} else if (PageAnon(page)) {
+			swp_entry_t entry = { .val = page_private(subpage) };
+			pte_t swp_pte;
+			/*
+			 * Store the swap location in the pte.
+			 * See handle_pte_fault() ...
+			 */
+			VM_BUG_ON_PAGE(!PageSwapCache(page), page);
+
+			if (!PageDirty(page) && (flags & TTU_LZFREE)) {
+				/* It's a freeable page by MADV_FREE */
+				dec_mm_counter(mm, MM_ANONPAGES);
+				rp->lazyfreed++;
+				goto discard;
+			}
 
-		if (!PageDirty(page) && (flags & TTU_LZFREE)) {
-			/* It's a freeable page by MADV_FREE */
+			if (swap_duplicate(entry) < 0) {
+				set_pte_at(mm, address, pvmw.pte, pteval);
+				ret = SWAP_FAIL;
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
+			if (list_empty(&mm->mmlist)) {
+				spin_lock(&mmlist_lock);
+				if (list_empty(&mm->mmlist))
+					list_add(&mm->mmlist, &init_mm.mmlist);
+				spin_unlock(&mmlist_lock);
+			}
 			dec_mm_counter(mm, MM_ANONPAGES);
-			rp->lazyfreed++;
-			goto discard;
-		}
-
-		if (swap_duplicate(entry) < 0) {
-			set_pte_at(mm, address, pte, pteval);
-			ret = SWAP_FAIL;
-			goto out_unmap;
-		}
-		if (list_empty(&mm->mmlist)) {
-			spin_lock(&mmlist_lock);
-			if (list_empty(&mm->mmlist))
-				list_add(&mm->mmlist, &init_mm.mmlist);
-			spin_unlock(&mmlist_lock);
-		}
-		dec_mm_counter(mm, MM_ANONPAGES);
-		inc_mm_counter(mm, MM_SWAPENTS);
-		swp_pte = swp_entry_to_pte(entry);
-		if (pte_soft_dirty(pteval))
-			swp_pte = pte_swp_mksoft_dirty(swp_pte);
-		set_pte_at(mm, address, pte, swp_pte);
-	} else
-		dec_mm_counter(mm, mm_counter_file(page));
-
+			inc_mm_counter(mm, MM_SWAPENTS);
+			swp_pte = swp_entry_to_pte(entry);
+			if (pte_soft_dirty(pteval))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(mm, address, pvmw.pte, swp_pte);
+		} else
+			dec_mm_counter(mm, mm_counter_file(page));
 discard:
-	page_remove_rmap(page, PageHuge(page));
-	put_page(page);
-
-out_unmap:
-	pte_unmap_unlock(pte, ptl);
-	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
+		page_remove_rmap(subpage, PageHuge(page));
+		put_page(page);
 		mmu_notifier_invalidate_page(mm, address);
-out:
+	}
 	return ret;
 }
 
@@ -1632,7 +1638,7 @@ static bool invalid_migration_vma(struct vm_area_struct *vma, void *arg)
 
 static int page_mapcount_is_zero(struct page *page)
 {
-	return !page_mapcount(page);
+	return !total_mapcount(page);
 }
 
 /**
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv3 08/12] mm, ksm: convert write_protect_page() to use page_vma_mapped_walk()
  2017-01-29 17:38 [PATCHv3 00/12] Fix few rmap-related THP bugs Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2017-01-29 17:38 ` [PATCHv3 07/12] mm: convert try_to_unmap_one() " Kirill A. Shutemov
@ 2017-01-29 17:38 ` Kirill A. Shutemov
  2017-01-29 17:38 ` [PATCHv3 09/12] mm, uprobes: convert __replace_page() " Kirill A. Shutemov
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-01-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

For consistency, it worth converting all page_check_address() to
page_vma_mapped_walk(), so we could drop the former.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/ksm.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 9ae6011a41f8..91a2eb048516 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -850,33 +850,35 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 			      pte_t *orig_pte)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	unsigned long addr;
-	pte_t *ptep;
-	spinlock_t *ptl;
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+	};
 	int swapped;
 	int err = -EFAULT;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
-	addr = page_address_in_vma(page, vma);
-	if (addr == -EFAULT)
+	pvmw.address = page_address_in_vma(page, vma);
+	if (pvmw.address == -EFAULT)
 		goto out;
 
 	BUG_ON(PageTransCompound(page));
 
-	mmun_start = addr;
-	mmun_end   = addr + PAGE_SIZE;
+	mmun_start = pvmw.address;
+	mmun_end   = pvmw.address + PAGE_SIZE;
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 
-	ptep = page_check_address(page, mm, addr, &ptl, 0);
-	if (!ptep)
+	if (!page_vma_mapped_walk(&pvmw))
 		goto out_mn;
+	if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?"))
+		goto out_unlock;
 
-	if (pte_write(*ptep) || pte_dirty(*ptep)) {
+	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte)) {
 		pte_t entry;
 
 		swapped = PageSwapCache(page);
-		flush_cache_page(vma, addr, page_to_pfn(page));
+		flush_cache_page(vma, pvmw.address, page_to_pfn(page));
 		/*
 		 * Ok this is tricky, when get_user_pages_fast() run it doesn't
 		 * take any lock, therefore the check that we are going to make
@@ -886,25 +888,25 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		 * this assure us that no O_DIRECT can happen after the check
 		 * or in the middle of the check.
 		 */
-		entry = ptep_clear_flush_notify(vma, addr, ptep);
+		entry = ptep_clear_flush_notify(vma, pvmw.address, pvmw.pte);
 		/*
 		 * Check that no O_DIRECT or similar I/O is in progress on the
 		 * page
 		 */
 		if (page_mapcount(page) + 1 + swapped != page_count(page)) {
-			set_pte_at(mm, addr, ptep, entry);
+			set_pte_at(mm, pvmw.address, pvmw.pte, entry);
 			goto out_unlock;
 		}
 		if (pte_dirty(entry))
 			set_page_dirty(page);
 		entry = pte_mkclean(pte_wrprotect(entry));
-		set_pte_at_notify(mm, addr, ptep, entry);
+		set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry);
 	}
-	*orig_pte = *ptep;
+	*orig_pte = *pvmw.pte;
 	err = 0;
 
 out_unlock:
-	pte_unmap_unlock(ptep, ptl);
+	page_vma_mapped_walk_done(&pvmw);
 out_mn:
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 out:
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv3 09/12] mm, uprobes: convert __replace_page() to use page_vma_mapped_walk()
  2017-01-29 17:38 [PATCHv3 00/12] Fix few rmap-related THP bugs Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2017-01-29 17:38 ` [PATCHv3 08/12] mm, ksm: convert write_protect_page() " Kirill A. Shutemov
@ 2017-01-29 17:38 ` Kirill A. Shutemov
  2017-01-29 17:38 ` [PATCHv3 10/12] mm: convert page_mapped_in_vma() " Kirill A. Shutemov
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-01-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

For consistency, it worth converting all page_check_address() to
page_vma_mapped_walk(), so we could drop the former.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/events/uprobes.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 1e65c79e52a6..18c6b23edd3c 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -153,14 +153,19 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 				struct page *old_page, struct page *new_page)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pte_t *ptep;
+	struct page_vma_mapped_walk pvmw = {
+		.page = old_page,
+		.vma = vma,
+		.address = addr,
+	};
 	int err;
 	/* For mmu_notifiers */
 	const unsigned long mmun_start = addr;
 	const unsigned long mmun_end   = addr + PAGE_SIZE;
 	struct mem_cgroup *memcg;
 
+	VM_BUG_ON_PAGE(PageTransHuge(old_page), old_page);
+
 	err = mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL, &memcg,
 			false);
 	if (err)
@@ -171,11 +176,11 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	err = -EAGAIN;
-	ptep = page_check_address(old_page, mm, addr, &ptl, 0);
-	if (!ptep) {
+	if (!page_vma_mapped_walk(&pvmw)) {
 		mem_cgroup_cancel_charge(new_page, memcg, false);
 		goto unlock;
 	}
+	VM_BUG_ON_PAGE(addr != pvmw.address, old_page);
 
 	get_page(new_page);
 	page_add_new_anon_rmap(new_page, vma, addr, false);
@@ -187,14 +192,15 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		inc_mm_counter(mm, MM_ANONPAGES);
 	}
 
-	flush_cache_page(vma, addr, pte_pfn(*ptep));
-	ptep_clear_flush_notify(vma, addr, ptep);
-	set_pte_at_notify(mm, addr, ptep, mk_pte(new_page, vma->vm_page_prot));
+	flush_cache_page(vma, addr, pte_pfn(*pvmw.pte));
+	ptep_clear_flush_notify(vma, addr, pvmw.pte);
+	set_pte_at_notify(mm, addr, pvmw.pte,
+			mk_pte(new_page, vma->vm_page_prot));
 
 	page_remove_rmap(old_page, false);
 	if (!page_mapped(old_page))
 		try_to_free_swap(old_page);
-	pte_unmap_unlock(ptep, ptl);
+	page_vma_mapped_walk_done(&pvmw);
 
 	if (vma->vm_flags & VM_LOCKED)
 		munlock_vma_page(old_page);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv3 10/12] mm: convert page_mapped_in_vma() to use page_vma_mapped_walk()
  2017-01-29 17:38 [PATCHv3 00/12] Fix few rmap-related THP bugs Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2017-01-29 17:38 ` [PATCHv3 09/12] mm, uprobes: convert __replace_page() " Kirill A. Shutemov
@ 2017-01-29 17:38 ` Kirill A. Shutemov
  2017-01-29 17:38 ` [PATCHv3 11/12] mm: drop page_check_address{,_transhuge} Kirill A. Shutemov
  2017-01-29 17:38 ` [PATCHv3 12/12] mm: convert remove_migration_pte() to use page_vma_mapped_walk() Kirill A. Shutemov
  11 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-01-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

For consistency, it worth converting all page_check_address() to
page_vma_mapped_walk(), so we could drop the former.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/page_vma_mapped.c | 30 ++++++++++++++++++++++++++++++
 mm/rmap.c            | 26 --------------------------
 2 files changed, 30 insertions(+), 26 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index bbd2a39e985d..dc4756f878be 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -186,3 +186,33 @@ next_pte:	do {
 		}
 	}
 }
+
+/**
+ * page_mapped_in_vma - check whether a page is really mapped in a VMA
+ * @page: the page to test
+ * @vma: the VMA to test
+ *
+ * Returns 1 if the page is mapped into the page tables of the VMA, 0
+ * if the page is not mapped into the page tables of this VMA.  Only
+ * valid for normal file or anonymous VMAs.
+ */
+int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
+{
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.flags = PVMW_SYNC,
+	};
+	unsigned long start, end;
+
+	start = __vma_address(page, vma);
+	end = start + PAGE_SIZE * (hpage_nr_pages(page) - 1);
+
+	if (unlikely(end < vma->vm_start || start >= vma->vm_end))
+		return 0;
+	pvmw.address = max(start, vma->vm_start);
+	if (!page_vma_mapped_walk(&pvmw))
+		return 0;
+	page_vma_mapped_walk_done(&pvmw);
+	return 1;
+}
diff --git a/mm/rmap.c b/mm/rmap.c
index 11668fb881d8..80525820aada 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -756,32 +756,6 @@ pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
 	return NULL;
 }
 
-/**
- * page_mapped_in_vma - check whether a page is really mapped in a VMA
- * @page: the page to test
- * @vma: the VMA to test
- *
- * Returns 1 if the page is mapped into the page tables of the VMA, 0
- * if the page is not mapped into the page tables of this VMA.  Only
- * valid for normal file or anonymous VMAs.
- */
-int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
-{
-	unsigned long address;
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	address = __vma_address(page, vma);
-	if (unlikely(address < vma->vm_start || address >= vma->vm_end))
-		return 0;
-	pte = page_check_address(page, vma->vm_mm, address, &ptl, 1);
-	if (!pte)			/* the page is not in this mm */
-		return 0;
-	pte_unmap_unlock(pte, ptl);
-
-	return 1;
-}
-
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
  * Check that @page is mapped at @address into @mm. In contrast to
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv3 11/12] mm: drop page_check_address{,_transhuge}
  2017-01-29 17:38 [PATCHv3 00/12] Fix few rmap-related THP bugs Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2017-01-29 17:38 ` [PATCHv3 10/12] mm: convert page_mapped_in_vma() " Kirill A. Shutemov
@ 2017-01-29 17:38 ` Kirill A. Shutemov
  2017-01-29 17:38 ` [PATCHv3 12/12] mm: convert remove_migration_pte() to use page_vma_mapped_walk() Kirill A. Shutemov
  11 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-01-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

All users are gone. Let's drop them.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/rmap.h |  36 --------------
 mm/rmap.c            | 138 ---------------------------------------------------
 2 files changed, 174 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b76343610653..8c89e902df3e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -197,42 +197,6 @@ int page_referenced(struct page *, int is_locked,
 
 int try_to_unmap(struct page *, enum ttu_flags flags);
 
-/*
- * Used by uprobes to replace a userspace page safely
- */
-pte_t *__page_check_address(struct page *, struct mm_struct *,
-				unsigned long, spinlock_t **, int);
-
-static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
-					unsigned long address,
-					spinlock_t **ptlp, int sync)
-{
-	pte_t *ptep;
-
-	__cond_lock(*ptlp, ptep = __page_check_address(page, mm, address,
-						       ptlp, sync));
-	return ptep;
-}
-
-/*
- * Used by idle page tracking to check if a page was referenced via page
- * tables.
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-bool page_check_address_transhuge(struct page *page, struct mm_struct *mm,
-				  unsigned long address, pmd_t **pmdp,
-				  pte_t **ptep, spinlock_t **ptlp);
-#else
-static inline bool page_check_address_transhuge(struct page *page,
-				struct mm_struct *mm, unsigned long address,
-				pmd_t **pmdp, pte_t **ptep, spinlock_t **ptlp)
-{
-	*ptep = page_check_address(page, mm, address, ptlp, 0);
-	*pmdp = NULL;
-	return !!*ptep;
-}
-#endif
-
 /* Avoid racy checks */
 #define PVMW_SYNC		(1 << 0)
 /* Look for migarion entries rather than present PTEs */
diff --git a/mm/rmap.c b/mm/rmap.c
index 80525820aada..8774791e2809 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -708,144 +708,6 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 	return pmd;
 }
 
-/*
- * Check that @page is mapped at @address into @mm.
- *
- * If @sync is false, page_check_address may perform a racy check to avoid
- * the page table lock when the pte is not present (helpful when reclaiming
- * highly shared pages).
- *
- * On success returns with pte mapped and locked.
- */
-pte_t *__page_check_address(struct page *page, struct mm_struct *mm,
-			  unsigned long address, spinlock_t **ptlp, int sync)
-{
-	pmd_t *pmd;
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	if (unlikely(PageHuge(page))) {
-		/* when pud is not present, pte will be NULL */
-		pte = huge_pte_offset(mm, address);
-		if (!pte)
-			return NULL;
-
-		ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
-		goto check;
-	}
-
-	pmd = mm_find_pmd(mm, address);
-	if (!pmd)
-		return NULL;
-
-	pte = pte_offset_map(pmd, address);
-	/* Make a quick check before getting the lock */
-	if (!sync && !pte_present(*pte)) {
-		pte_unmap(pte);
-		return NULL;
-	}
-
-	ptl = pte_lockptr(mm, pmd);
-check:
-	spin_lock(ptl);
-	if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
-		*ptlp = ptl;
-		return pte;
-	}
-	pte_unmap_unlock(pte, ptl);
-	return NULL;
-}
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/*
- * Check that @page is mapped at @address into @mm. In contrast to
- * page_check_address(), this function can handle transparent huge pages.
- *
- * On success returns true with pte mapped and locked. For PMD-mapped
- * transparent huge pages *@ptep is set to NULL.
- */
-bool page_check_address_transhuge(struct page *page, struct mm_struct *mm,
-				  unsigned long address, pmd_t **pmdp,
-				  pte_t **ptep, spinlock_t **ptlp)
-{
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	if (unlikely(PageHuge(page))) {
-		/* when pud is not present, pte will be NULL */
-		pte = huge_pte_offset(mm, address);
-		if (!pte)
-			return false;
-
-		ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
-		pmd = NULL;
-		goto check_pte;
-	}
-
-	pgd = pgd_offset(mm, address);
-	if (!pgd_present(*pgd))
-		return false;
-	pud = pud_offset(pgd, address);
-	if (!pud_present(*pud))
-		return false;
-	pmd = pmd_offset(pud, address);
-
-	if (pmd_trans_huge(*pmd)) {
-		ptl = pmd_lock(mm, pmd);
-		if (!pmd_present(*pmd))
-			goto unlock_pmd;
-		if (unlikely(!pmd_trans_huge(*pmd))) {
-			spin_unlock(ptl);
-			goto map_pte;
-		}
-
-		if (pmd_page(*pmd) != page)
-			goto unlock_pmd;
-
-		pte = NULL;
-		goto found;
-unlock_pmd:
-		spin_unlock(ptl);
-		return false;
-	} else {
-		pmd_t pmde = *pmd;
-
-		barrier();
-		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
-			return false;
-	}
-map_pte:
-	pte = pte_offset_map(pmd, address);
-	if (!pte_present(*pte)) {
-		pte_unmap(pte);
-		return false;
-	}
-
-	ptl = pte_lockptr(mm, pmd);
-check_pte:
-	spin_lock(ptl);
-
-	if (!pte_present(*pte)) {
-		pte_unmap_unlock(pte, ptl);
-		return false;
-	}
-
-	/* THP can be referenced by any subpage */
-	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
-		pte_unmap_unlock(pte, ptl);
-		return false;
-	}
-found:
-	*ptep = pte;
-	*pmdp = pmd;
-	*ptlp = ptl;
-	return true;
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
 struct page_referenced_arg {
 	int mapcount;
 	int referenced;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCHv3 12/12] mm: convert remove_migration_pte() to use page_vma_mapped_walk()
  2017-01-29 17:38 [PATCHv3 00/12] Fix few rmap-related THP bugs Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2017-01-29 17:38 ` [PATCHv3 11/12] mm: drop page_check_address{,_transhuge} Kirill A. Shutemov
@ 2017-01-29 17:38 ` Kirill A. Shutemov
  11 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-01-29 17:38 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton
  Cc: linux-mm, linux-kernel, Kirill A. Shutemov

remove_migration_pte() also can easily be converted to page_vma_mapped_walk().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/migrate.c | 104 +++++++++++++++++++++++------------------------------------
 1 file changed, 41 insertions(+), 63 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 87f4d0f81819..366466ed7fdc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -193,82 +193,60 @@ void putback_movable_pages(struct list_head *l)
 /*
  * Restore a potential migration pte to a working pte entry
  */
-static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+static int remove_migration_pte(struct page *page, struct vm_area_struct *vma,
 				 unsigned long addr, void *old)
 {
-	struct mm_struct *mm = vma->vm_mm;
+	struct page_vma_mapped_walk pvmw = {
+		.page = old,
+		.vma = vma,
+		.address = addr,
+		.flags = PVMW_SYNC | PVMW_MIGRATION,
+	};
+	struct page *new;
+	pte_t pte;
 	swp_entry_t entry;
- 	pmd_t *pmd;
-	pte_t *ptep, pte;
- 	spinlock_t *ptl;
 
-	if (unlikely(PageHuge(new))) {
-		ptep = huge_pte_offset(mm, addr);
-		if (!ptep)
-			goto out;
-		ptl = huge_pte_lockptr(hstate_vma(vma), mm, ptep);
-	} else {
-		pmd = mm_find_pmd(mm, addr);
-		if (!pmd)
-			goto out;
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	while (page_vma_mapped_walk(&pvmw)) {
+		new = page - pvmw.page->index +
+			linear_page_index(vma, pvmw.address);
 
-		ptep = pte_offset_map(pmd, addr);
+		get_page(new);
+		pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
+		if (pte_swp_soft_dirty(*pvmw.pte))
+			pte = pte_mksoft_dirty(pte);
 
-		/*
-		 * Peek to check is_swap_pte() before taking ptlock?  No, we
-		 * can race mremap's move_ptes(), which skips anon_vma lock.
-		 */
-
-		ptl = pte_lockptr(mm, pmd);
-	}
-
- 	spin_lock(ptl);
-	pte = *ptep;
-	if (!is_swap_pte(pte))
-		goto unlock;
-
-	entry = pte_to_swp_entry(pte);
-
-	if (!is_migration_entry(entry) ||
-	    migration_entry_to_page(entry) != old)
-		goto unlock;
-
-	get_page(new);
-	pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
-	if (pte_swp_soft_dirty(*ptep))
-		pte = pte_mksoft_dirty(pte);
-
-	/* Recheck VMA as permissions can change since migration started  */
-	if (is_write_migration_entry(entry))
-		pte = maybe_mkwrite(pte, vma);
+		/* Recheck VMA as permissions can change since migration started  */
+		entry = pte_to_swp_entry(*pvmw.pte);
+		if (is_write_migration_entry(entry))
+			pte = maybe_mkwrite(pte, vma);
 
 #ifdef CONFIG_HUGETLB_PAGE
-	if (PageHuge(new)) {
-		pte = pte_mkhuge(pte);
-		pte = arch_make_huge_pte(pte, vma, new, 0);
-	}
+		if (PageHuge(new)) {
+			pte = pte_mkhuge(pte);
+			pte = arch_make_huge_pte(pte, vma, new, 0);
+		}
 #endif
-	flush_dcache_page(new);
-	set_pte_at(mm, addr, ptep, pte);
+		flush_dcache_page(new);
+		set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
 
-	if (PageHuge(new)) {
-		if (PageAnon(new))
-			hugepage_add_anon_rmap(new, vma, addr);
+		if (PageHuge(new)) {
+			if (PageAnon(new))
+				hugepage_add_anon_rmap(new, vma, pvmw.address);
+			else
+				page_dup_rmap(new, true);
+		} else if (PageAnon(new))
+			page_add_anon_rmap(new, vma, pvmw.address, false);
 		else
-			page_dup_rmap(new, true);
-	} else if (PageAnon(new))
-		page_add_anon_rmap(new, vma, addr, false);
-	else
-		page_add_file_rmap(new, false);
+			page_add_file_rmap(new, false);
 
-	if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
-		mlock_vma_page(new);
+		if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
+			mlock_vma_page(new);
+
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, pvmw.address, pvmw.pte);
+	}
 
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, addr, ptep);
-unlock:
-	pte_unmap_unlock(ptep, ptl);
-out:
 	return SWAP_AGAIN;
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 01/12] uprobes: split THPs before trying replace them
  2017-01-29 17:38 ` [PATCHv3 01/12] uprobes: split THPs before trying replace them Kirill A. Shutemov
@ 2017-01-31 15:44   ` Oleg Nesterov
  0 siblings, 0 replies; 16+ messages in thread
From: Oleg Nesterov @ 2017-01-31 15:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton,
	linux-mm, linux-kernel, Peter Zijlstra

On 01/29, Kirill A. Shutemov wrote:
>
> For THPs page_check_address() always fails. It leads to endless loop in
> uprobe_write_opcode().
>
> Testcase with huge-tmpfs (not sure if it's possible to trigger this
> uprobe codepath for anon memory):

No, you can't probe the anonymous memory,

> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -300,8 +300,8 @@ int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr,
>  
>  retry:
>  	/* Read the page with vaddr into memory */
> -	ret = get_user_pages_remote(NULL, mm, vaddr, 1, FOLL_FORCE, &old_page,
> -			&vma, NULL);
> +	ret = get_user_pages_remote(NULL, mm, vaddr, 1,
> +			FOLL_FORCE | FOLL_SPLIT, &old_page, &vma, NULL);
>  	if (ret <= 0)
>  		return ret;

Thanks,

Acked-by: Oleg Nesterov <oleg@redhat.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 03/12] mm: fix handling PTE-mapped THPs in page_referenced()
  2017-01-29 17:38 ` [PATCHv3 03/12] mm: fix handling PTE-mapped THPs in page_referenced() Kirill A. Shutemov
@ 2017-02-02 15:26   ` Michal Hocko
  2017-02-04 10:33     ` Kirill A. Shutemov
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2017-02-02 15:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Hugh Dickins, Rik van Riel, Andrew Morton,
	linux-mm, linux-kernel

On Sun 29-01-17 20:38:49, Kirill A. Shutemov wrote:
> For PTE-mapped THP page_check_address_transhuge() is not adequate: it
> cannot find all relevant PTEs, only the first one. It means we can miss
> some references of the page and it can result in suboptimal decisions by
> vmscan.
> 
> Let's switch it to page_vma_mapped_walk().
> 
> I don't think it's subject for stable@: it's not fatal. The only side
> effect is that THP can be swapped out when it shouldn't.

Please be more specific about the situation when this happens and how a
user can recognize this is going on. In other words when should I
consider backporting this series.

Also the interface is quite awkward imho. Why cannot we provide a
callback into page_vma_mapped_walk and call it for each pte/pmd that
matters to the given page? Wouldn't that be much easier than the loop
around page_vma_mapped_walk iterator?

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/rmap.c | 66 ++++++++++++++++++++++++++++++++-------------------------------
>  1 file changed, 34 insertions(+), 32 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 91619fd70939..0dff8accd629 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -886,45 +886,48 @@ struct page_referenced_arg {
>  static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  			unsigned long address, void *arg)
>  {
> -	struct mm_struct *mm = vma->vm_mm;
>  	struct page_referenced_arg *pra = arg;
> -	pmd_t *pmd;
> -	pte_t *pte;
> -	spinlock_t *ptl;
> +	struct page_vma_mapped_walk pvmw = {
> +		.page = page,
> +		.vma = vma,
> +		.address = address,
> +	};
>  	int referenced = 0;
>  
> -	if (!page_check_address_transhuge(page, mm, address, &pmd, &pte, &ptl))
> -		return SWAP_AGAIN;
> +	while (page_vma_mapped_walk(&pvmw)) {
> +		address = pvmw.address;
>  
> -	if (vma->vm_flags & VM_LOCKED) {
> -		if (pte)
> -			pte_unmap(pte);
> -		spin_unlock(ptl);
> -		pra->vm_flags |= VM_LOCKED;
> -		return SWAP_FAIL; /* To break the loop */
> -	}
> +		if (vma->vm_flags & VM_LOCKED) {
> +			page_vma_mapped_walk_done(&pvmw);
> +			pra->vm_flags |= VM_LOCKED;
> +			return SWAP_FAIL; /* To break the loop */
> +		}
>  
> -	if (pte) {
> -		if (ptep_clear_flush_young_notify(vma, address, pte)) {
> -			/*
> -			 * Don't treat a reference through a sequentially read
> -			 * mapping as such.  If the page has been used in
> -			 * another mapping, we will catch it; if this other
> -			 * mapping is already gone, the unmap path will have
> -			 * set PG_referenced or activated the page.
> -			 */
> -			if (likely(!(vma->vm_flags & VM_SEQ_READ)))
> +		if (pvmw.pte) {
> +			if (ptep_clear_flush_young_notify(vma, address,
> +						pvmw.pte)) {
> +				/*
> +				 * Don't treat a reference through
> +				 * a sequentially read mapping as such.
> +				 * If the page has been used in another mapping,
> +				 * we will catch it; if this other mapping is
> +				 * already gone, the unmap path will have set
> +				 * PG_referenced or activated the page.
> +				 */
> +				if (likely(!(vma->vm_flags & VM_SEQ_READ)))
> +					referenced++;
> +			}
> +		} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> +			if (pmdp_clear_flush_young_notify(vma, address,
> +						pvmw.pmd))
>  				referenced++;
> +		} else {
> +			/* unexpected pmd-mapped page? */
> +			WARN_ON_ONCE(1);
>  		}
> -		pte_unmap(pte);
> -	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> -		if (pmdp_clear_flush_young_notify(vma, address, pmd))
> -			referenced++;
> -	} else {
> -		/* unexpected pmd-mapped page? */
> -		WARN_ON_ONCE(1);
> +
> +		pra->mapcount--;
>  	}
> -	spin_unlock(ptl);
>  
>  	if (referenced)
>  		clear_page_idle(page);
> @@ -936,7 +939,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  		pra->vm_flags |= vma->vm_flags;
>  	}
>  
> -	pra->mapcount--;
>  	if (!pra->mapcount)
>  		return SWAP_SUCCESS; /* To break the loop */
>  
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHv3 03/12] mm: fix handling PTE-mapped THPs in page_referenced()
  2017-02-02 15:26   ` Michal Hocko
@ 2017-02-04 10:33     ` Kirill A. Shutemov
  0 siblings, 0 replies; 16+ messages in thread
From: Kirill A. Shutemov @ 2017-02-04 10:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Hugh Dickins, Rik van Riel,
	Andrew Morton, linux-mm, linux-kernel

On Thu, Feb 02, 2017 at 04:26:56PM +0100, Michal Hocko wrote:
> On Sun 29-01-17 20:38:49, Kirill A. Shutemov wrote:
> > For PTE-mapped THP page_check_address_transhuge() is not adequate: it
> > cannot find all relevant PTEs, only the first one. It means we can miss
> > some references of the page and it can result in suboptimal decisions by
> > vmscan.
> > 
> > Let's switch it to page_vma_mapped_walk().
> > 
> > I don't think it's subject for stable@: it's not fatal. The only side
> > effect is that THP can be swapped out when it shouldn't.
> 
> Please be more specific about the situation when this happens and how a
> user can recognize this is going on. In other words when should I
> consider backporting this series.

The first you need huge PMD to get split with split_huge_pmd(). It can
happen due to munmap(), mprotect(), mremap(), etc. After split_huge_pmd()
we have THP mapped with bunch of PTEs instead of single PMD.

The bug is that the kernel only sees pte_young() on the PTEs that maps the
first 4k, but not the rest. So if your access pattern touches the THP, but
not the first 4k, the page can be reclaimed unfairly and possibly
re-faulted from swap soon after.

I don't think it's visible to user, except as unneeded swap-out/swap-in in
on rare occasion.

> Also the interface is quite awkward imho. Why cannot we provide a
> callback into page_vma_mapped_walk and call it for each pte/pmd that
> matters to the given page? Wouldn't that be much easier than the loop
> around page_vma_mapped_walk iterator?

I don't agree that interface with call back would be easier. You would
also need to pass down additional context with packing/unpacking it on
both ends. I don't think it makes interface less awkward.

But it's matter of taste.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-02-04 10:34 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-29 17:38 [PATCHv3 00/12] Fix few rmap-related THP bugs Kirill A. Shutemov
2017-01-29 17:38 ` [PATCHv3 01/12] uprobes: split THPs before trying replace them Kirill A. Shutemov
2017-01-31 15:44   ` Oleg Nesterov
2017-01-29 17:38 ` [PATCHv3 02/12] mm: introduce page_vma_mapped_walk() Kirill A. Shutemov
2017-01-29 17:38 ` [PATCHv3 03/12] mm: fix handling PTE-mapped THPs in page_referenced() Kirill A. Shutemov
2017-02-02 15:26   ` Michal Hocko
2017-02-04 10:33     ` Kirill A. Shutemov
2017-01-29 17:38 ` [PATCHv3 04/12] mm: fix handling PTE-mapped THPs in page_idle_clear_pte_refs() Kirill A. Shutemov
2017-01-29 17:38 ` [PATCHv3 05/12] mm, rmap: check all VMAs that PTE-mapped THP can be part of Kirill A. Shutemov
2017-01-29 17:38 ` [PATCHv3 06/12] mm: convert page_mkclean_one() to use page_vma_mapped_walk() Kirill A. Shutemov
2017-01-29 17:38 ` [PATCHv3 07/12] mm: convert try_to_unmap_one() " Kirill A. Shutemov
2017-01-29 17:38 ` [PATCHv3 08/12] mm, ksm: convert write_protect_page() " Kirill A. Shutemov
2017-01-29 17:38 ` [PATCHv3 09/12] mm, uprobes: convert __replace_page() " Kirill A. Shutemov
2017-01-29 17:38 ` [PATCHv3 10/12] mm: convert page_mapped_in_vma() " Kirill A. Shutemov
2017-01-29 17:38 ` [PATCHv3 11/12] mm: drop page_check_address{,_transhuge} Kirill A. Shutemov
2017-01-29 17:38 ` [PATCHv3 12/12] mm: convert remove_migration_pte() to use page_vma_mapped_walk() Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).