linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv2 00/28] huge tmpfs implementation using compound pages
@ 2016-02-11 14:21 Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 01/28] thp, dax: do not try to withdraw pgtable from non-anon VMA Kirill A. Shutemov
                   ` (27 more replies)
  0 siblings, 28 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

Here is my implementation of huge pages support in tmpfs/shmem. It's more
or less complete. I'm comfortable enough with this to run my workstation.

And it hasn't crashed so far. :)

The main difference with Hugh's approach[1] is that I continue with
compound pages, where Hugh invents new way couple pages: team pages.
I believe THP refcounting rework made team pages unnecessary: compound
page are flexible enough to serve needs of page cache.

Many ideas and some patches were stolen from Hugh's patchset. Having this
patchset around was very helpful.

I will continue with code validation. I would expect mlock require some
more attention.

Please, review and test the code.

Git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git hugetmpfs/v2

== Patchset overview ==

[01/28]
	I've posted the patch last night. I stepped on the bug during my
	testing of huge tmpfs, but I think DAX has the same problem, so it
	should be applied now.

[02-05/28]
	These patches also where posted separately. They simplify
	split_huge_page() code with speed trade off. I'm not sure if they
	should go upstream, but they make my life easier for now.
	Patches were slightly adjust to handle file pages too.

[06-11/28]
	Rework fault path and rmap to handle file pmd. Unlike DAX with
	vm_ops->pmd_fault, we don't need to ask filesystem twice -- first
	for huge page and then for small. If ->fault happend to return
	huge page and VMA is suitable for mapping it as huge, we would do
	so.

[12-20/28]
	Various preparation of THP core for file pages.

[21-25/28]
	Various preparation of MM core for file pages.

[26-28/28]
	And finally, bring huge pages into tmpfs/shmem.
	Two of three patches came from Hugh's patchset. :)

[1] http://lkml.kernel.org/g/alpine.LSU.2.11.1502201941340.14414@eggly.anvils

Hugh Dickins (2):
  shmem: prepare huge=N mount option and /proc/sys/vm/shmem_huge
  shmem: get_unmapped_area align huge page

Kirill A. Shutemov (26):
  thp, dax: do not try to withdraw pgtable from non-anon VMA
  rmap: introduce rmap_walk_locked()
  rmap: extend try_to_unmap() to be usable by split_huge_page()
  mm: make remove_migration_ptes() beyond mm/migration.c
  thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers
  mm: do not pass mm_struct into handle_mm_fault
  mm: introduce fault_env
  mm: postpone page table allocation until do_set_pte()
  rmap: support file thp
  mm: introduce do_set_pmd()
  mm, rmap: account file thp pages
  thp, vmstats: add counters for huge file pages
  thp: support file pages in zap_huge_pmd()
  thp: handle file pages in split_huge_pmd()
  thp: handle file COW faults
  thp: handle file pages in mremap()
  thp: skip file huge pmd on copy_huge_pmd()
  thp: prepare change_huge_pmd() for file thp
  thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
  thp: file pages support for split_huge_page()
  vmscan: split file huge pages before paging them out
  page-flags: relax policy for PG_mappedtodisk and PG_reclaim
  radix-tree: implement radix_tree_maybe_preload_order()
  filemap: prepare find and delete operations for huge pages
  truncate: handle file thp
  shmem: add huge pages support

 Documentation/filesystems/Locking |  10 +-
 arch/alpha/mm/fault.c             |   2 +-
 arch/arc/mm/fault.c               |   2 +-
 arch/arm/mm/fault.c               |   2 +-
 arch/arm64/mm/fault.c             |   2 +-
 arch/avr32/mm/fault.c             |   2 +-
 arch/cris/mm/fault.c              |   2 +-
 arch/frv/mm/fault.c               |   2 +-
 arch/hexagon/mm/vm_fault.c        |   2 +-
 arch/ia64/mm/fault.c              |   2 +-
 arch/m32r/mm/fault.c              |   2 +-
 arch/m68k/mm/fault.c              |   2 +-
 arch/metag/mm/fault.c             |   2 +-
 arch/microblaze/mm/fault.c        |   2 +-
 arch/mips/mm/fault.c              |   2 +-
 arch/mn10300/mm/fault.c           |   2 +-
 arch/nios2/mm/fault.c             |   2 +-
 arch/openrisc/mm/fault.c          |   2 +-
 arch/parisc/mm/fault.c            |   2 +-
 arch/powerpc/mm/copro_fault.c     |   2 +-
 arch/powerpc/mm/fault.c           |   2 +-
 arch/s390/mm/fault.c              |   2 +-
 arch/score/mm/fault.c             |   2 +-
 arch/sh/mm/fault.c                |   2 +-
 arch/sparc/mm/fault_32.c          |   4 +-
 arch/sparc/mm/fault_64.c          |   2 +-
 arch/tile/mm/fault.c              |   2 +-
 arch/um/kernel/trap.c             |   2 +-
 arch/unicore32/mm/fault.c         |   2 +-
 arch/x86/mm/fault.c               |   2 +-
 arch/xtensa/mm/fault.c            |   2 +-
 drivers/base/node.c               |  10 +-
 drivers/char/mem.c                |  24 ++
 drivers/iommu/amd_iommu_v2.c      |   2 +-
 drivers/iommu/intel-svm.c         |   2 +-
 fs/proc/meminfo.c                 |   5 +-
 fs/userfaultfd.c                  |  22 +-
 include/linux/huge_mm.h           |  29 +-
 include/linux/mm.h                |  33 +-
 include/linux/mmzone.h            |   3 +-
 include/linux/page-flags.h        |   6 +-
 include/linux/radix-tree.h        |   1 +
 include/linux/rmap.h              |   8 +-
 include/linux/shmem_fs.h          |  18 +-
 include/linux/userfaultfd_k.h     |   8 +-
 include/linux/vm_event_item.h     |   7 +
 ipc/shm.c                         |   6 +-
 kernel/sysctl.c                   |  12 +
 lib/radix-tree.c                  |  70 +++-
 mm/filemap.c                      | 220 +++++++----
 mm/gup.c                          |   7 +-
 mm/huge_memory.c                  | 714 ++++++++++++++--------------------
 mm/internal.h                     |  20 +-
 mm/ksm.c                          |   3 +-
 mm/memory.c                       | 796 +++++++++++++++++++++-----------------
 mm/mempolicy.c                    |   4 +-
 mm/migrate.c                      |  17 +-
 mm/mmap.c                         |  20 +-
 mm/mremap.c                       |  22 +-
 mm/nommu.c                        |   3 +-
 mm/page-writeback.c               |   1 +
 mm/rmap.c                         | 125 ++++--
 mm/shmem.c                        | 493 +++++++++++++++++++----
 mm/swap.c                         |   2 +
 mm/truncate.c                     |  22 +-
 mm/util.c                         |   6 +
 mm/vmscan.c                       |  15 +-
 mm/vmstat.c                       |   3 +
 68 files changed, 1727 insertions(+), 1104 deletions(-)

-- 
2.7.0

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCHv2 01/28] thp, dax: do not try to withdraw pgtable from non-anon VMA
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 02/28] rmap: introduce rmap_walk_locked() Kirill A. Shutemov
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

DAX doesn't deposit pgtables when it maps huge pages: nothing to
withdraw. It can lead to crash.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8ca1718f7df3..2057a3b7cc24 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1730,7 +1730,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
 		VM_BUG_ON(!pmd_none(*new_pmd));
 
-		if (pmd_move_must_withdraw(new_ptl, old_ptl)) {
+		if (pmd_move_must_withdraw(new_ptl, old_ptl) &&
+				vma_is_anonymous(vma)) {
 			pgtable_t pgtable;
 			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
 			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 02/28] rmap: introduce rmap_walk_locked()
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 01/28] thp, dax: do not try to withdraw pgtable from non-anon VMA Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 18:52   ` Andi Kleen
  2016-02-11 14:21 ` [PATCHv2 03/28] rmap: extend try_to_unmap() to be usable by split_huge_page() Kirill A. Shutemov
                   ` (25 subsequent siblings)
  27 siblings, 1 reply; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

rmap_walk_locked() is the same as rmap_walk(), but caller takes care
about relevant rmap lock.

It's preparation to switch THP splitting from custom rmap walk in
freeze_page()/unfreeze_page() to generic one.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/rmap.h |  1 +
 mm/rmap.c            | 41 ++++++++++++++++++++++++++++++++---------
 2 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a07f42bedda3..a5875e9b4a27 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -266,6 +266,7 @@ struct rmap_walk_control {
 };
 
 int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
+int rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc);
 
 #else	/* !CONFIG_MMU */
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 02f0bfc3c80a..30b739ce0ffa 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1715,14 +1715,21 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * LOCKED.
  */
-static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
+static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
+		bool locked)
 {
 	struct anon_vma *anon_vma;
 	pgoff_t pgoff;
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
-	anon_vma = rmap_walk_anon_lock(page, rwc);
+	if (locked) {
+		anon_vma = page_anon_vma(page);
+		/* anon_vma disappear under us? */
+		VM_BUG_ON_PAGE(!anon_vma, page);
+	} else {
+		anon_vma = rmap_walk_anon_lock(page, rwc);
+	}
 	if (!anon_vma)
 		return ret;
 
@@ -1742,7 +1749,9 @@ static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
 		if (rwc->done && rwc->done(page))
 			break;
 	}
-	anon_vma_unlock_read(anon_vma);
+
+	if (!locked)
+		anon_vma_unlock_read(anon_vma);
 	return ret;
 }
 
@@ -1759,9 +1768,10 @@ static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
  * vm_flags for that VMA.  That should be OK, because that vma shouldn't be
  * LOCKED.
  */
-static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
+static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
+		bool locked)
 {
-	struct address_space *mapping = page->mapping;
+	struct address_space *mapping = page_mapping(page);
 	pgoff_t pgoff;
 	struct vm_area_struct *vma;
 	int ret = SWAP_AGAIN;
@@ -1778,7 +1788,8 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
 		return ret;
 
 	pgoff = page_to_pgoff(page);
-	i_mmap_lock_read(mapping);
+	if (!locked)
+		i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
 		unsigned long address = vma_address(page, vma);
 
@@ -1795,7 +1806,8 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
 	}
 
 done:
-	i_mmap_unlock_read(mapping);
+	if (!locked)
+		i_mmap_unlock_read(mapping);
 	return ret;
 }
 
@@ -1804,9 +1816,20 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc)
 	if (unlikely(PageKsm(page)))
 		return rmap_walk_ksm(page, rwc);
 	else if (PageAnon(page))
-		return rmap_walk_anon(page, rwc);
+		return rmap_walk_anon(page, rwc, false);
+	else
+		return rmap_walk_file(page, rwc, false);
+}
+
+/* Like rmap_walk, but caller holds relevant rmap lock */
+int rmap_walk_locked(struct page *page, struct rmap_walk_control *rwc)
+{
+	/* no ksm support for now */
+	VM_BUG_ON_PAGE(PageKsm(page), page);
+	if (PageAnon(page))
+		return rmap_walk_anon(page, rwc, true);
 	else
-		return rmap_walk_file(page, rwc);
+		return rmap_walk_file(page, rwc, true);
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 03/28] rmap: extend try_to_unmap() to be usable by split_huge_page()
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 01/28] thp, dax: do not try to withdraw pgtable from non-anon VMA Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 02/28] rmap: introduce rmap_walk_locked() Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 04/28] mm: make remove_migration_ptes() beyond mm/migration.c Kirill A. Shutemov
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

The patch add support for two ttu_flags:

  - TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to
    unmap page;

  - TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock;

Apart these flags, patch changes rwc->done to !page_mapcount()
instead of !page_mapped(). try_to_unmap() works on pte level, so we
really interested if this small pages is mapped, not compound page
it's part of.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |  7 +++++++
 include/linux/rmap.h    |  3 +++
 mm/huge_memory.c        |  5 +----
 mm/rmap.c               | 24 ++++++++++++++++--------
 4 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 459fd25b378e..c47067151ffd 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -111,6 +111,9 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			__split_huge_pmd(__vma, __pmd, __address);	\
 	}  while (0)
 
+
+void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address);
+
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
@@ -178,6 +181,10 @@ static inline int split_huge_page(struct page *page)
 static inline void deferred_split_huge_page(struct page *page) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
+
+static inline void split_huge_pmd_address(struct vm_area_struct *vma,
+		unsigned long address) {}
+
 static inline int hugepage_madvise(struct vm_area_struct *vma,
 				   unsigned long *vm_flags, int advice)
 {
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a5875e9b4a27..3d975e2252d4 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -86,6 +86,7 @@ enum ttu_flags {
 	TTU_MIGRATION = 2,		/* migration mode */
 	TTU_MUNLOCK = 4,		/* munlock mode */
 	TTU_LZFREE = 8,			/* lazy free mode */
+	TTU_SPLIT_HUGE_PMD = 16,	/* split huge PMD if any */
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
@@ -93,6 +94,8 @@ enum ttu_flags {
 	TTU_BATCH_FLUSH = (1 << 11),	/* Batch TLB flushes where possible
 					 * and caller guarantees they will
 					 * do a final flush if necessary */
+	TTU_RMAP_LOCKED = (1 << 12)	/* do not grab rmap lock:
+					 * caller holds it */
 };
 
 #ifdef CONFIG_MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2057a3b7cc24..801d4f9aac80 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3049,15 +3049,12 @@ out:
 	}
 }
 
-static void split_huge_pmd_address(struct vm_area_struct *vma,
-				    unsigned long address)
+void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address)
 {
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
 
-	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
-
 	pgd = pgd_offset(vma->vm_mm, address);
 	if (!pgd_present(*pgd))
 		return;
diff --git a/mm/rmap.c b/mm/rmap.c
index 30b739ce0ffa..945933a01010 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1431,6 +1431,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
 		goto out;
 
+	if (flags & TTU_SPLIT_HUGE_PMD)
+		split_huge_pmd_address(vma, address);
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
 		goto out;
@@ -1576,10 +1578,10 @@ static bool invalid_migration_vma(struct vm_area_struct *vma, void *arg)
 	return is_vma_temporary_stack(vma);
 }
 
-static int page_not_mapped(struct page *page)
+static int page_mapcount_is_zero(struct page *page)
 {
-	return !page_mapped(page);
-};
+	return !page_mapcount(page);
+}
 
 /**
  * try_to_unmap - try to remove all page table mappings to a page
@@ -1606,12 +1608,10 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_unmap_one,
 		.arg = &rp,
-		.done = page_not_mapped,
+		.done = page_mapcount_is_zero,
 		.anon_lock = page_lock_anon_vma_read,
 	};
 
-	VM_BUG_ON_PAGE(!PageHuge(page) && PageTransHuge(page), page);
-
 	/*
 	 * During exec, a temporary VMA is setup and later moved.
 	 * The VMA is moved under the anon_vma lock but not the
@@ -1623,9 +1623,12 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
 	if ((flags & TTU_MIGRATION) && !PageKsm(page) && PageAnon(page))
 		rwc.invalid_vma = invalid_migration_vma;
 
-	ret = rmap_walk(page, &rwc);
+	if (flags & TTU_RMAP_LOCKED)
+		ret = rmap_walk_locked(page, &rwc);
+	else
+		ret = rmap_walk(page, &rwc);
 
-	if (ret != SWAP_MLOCK && !page_mapped(page)) {
+	if (ret != SWAP_MLOCK && !page_mapcount(page)) {
 		ret = SWAP_SUCCESS;
 		if (rp.lazyfreed && !PageDirty(page))
 			ret = SWAP_LZFREE;
@@ -1633,6 +1636,11 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
 	return ret;
 }
 
+static int page_not_mapped(struct page *page)
+{
+	return !page_mapped(page);
+};
+
 /**
  * try_to_munlock - try to munlock a page
  * @page: the page to be munlocked
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 04/28] mm: make remove_migration_ptes() beyond mm/migration.c
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 03/28] rmap: extend try_to_unmap() to be usable by split_huge_page() Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-12 16:54   ` Dave Hansen
  2016-02-11 14:21 ` [PATCHv2 05/28] thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers Kirill A. Shutemov
                   ` (23 subsequent siblings)
  27 siblings, 1 reply; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

The patch makes remove_migration_ptes() available to be used in
split_huge_page().

New parameter 'locked' added: as with try_to_umap() we need a way to
indicate that caller holds rmap lock.

We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP
pages are never mlocked.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/rmap.h |  2 ++
 mm/migrate.c         | 15 +++++++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 3d975e2252d4..49eb4f8ebac9 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -243,6 +243,8 @@ int page_mkclean(struct page *);
  */
 int try_to_munlock(struct page *);
 
+void remove_migration_ptes(struct page *old, struct page *new, bool locked);
+
 /*
  * Called by memory-failure.c to kill processes.
  */
diff --git a/mm/migrate.c b/mm/migrate.c
index 17db63b2dd36..993390dcf68d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -172,7 +172,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	else
 		page_add_file_rmap(new);
 
-	if (vma->vm_flags & VM_LOCKED)
+	if (vma->vm_flags & VM_LOCKED && !PageCompound(new))
 		mlock_vma_page(new);
 
 	/* No need to invalidate - it was non-present before */
@@ -187,14 +187,17 @@ out:
  * Get rid of all migration entries and replace them by
  * references to the indicated page.
  */
-static void remove_migration_ptes(struct page *old, struct page *new)
+void remove_migration_ptes(struct page *old, struct page *new, bool locked)
 {
 	struct rmap_walk_control rwc = {
 		.rmap_one = remove_migration_pte,
 		.arg = old,
 	};
 
-	rmap_walk(new, &rwc);
+	if (locked)
+		rmap_walk_locked(new, &rwc);
+	else
+		rmap_walk(new, &rwc);
 }
 
 /*
@@ -706,7 +709,7 @@ static int writeout(struct address_space *mapping, struct page *page)
 	 * At this point we know that the migration attempt cannot
 	 * be successful.
 	 */
-	remove_migration_ptes(page, page);
+	remove_migration_ptes(page, page, false);
 
 	rc = mapping->a_ops->writepage(page, &wbc);
 
@@ -904,7 +907,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 
 	if (page_was_mapped)
 		remove_migration_ptes(page,
-			rc == MIGRATEPAGE_SUCCESS ? newpage : page);
+			rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
 
 out_unlock_both:
 	unlock_page(newpage);
@@ -1074,7 +1077,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 
 	if (page_was_mapped)
 		remove_migration_ptes(hpage,
-			rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage);
+			rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage, false);
 
 	unlock_page(new_hpage);
 
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 05/28] thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 04/28] mm: make remove_migration_ptes() beyond mm/migration.c Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 06/28] mm: do not pass mm_struct into handle_mm_fault Kirill A. Shutemov
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

freeze_page() and unfreeze_page() helpers evolved in rather complex
beasts. It would be nice to cut complexity of this code.

This patch rewrites freeze_page() using standard try_to_unmap().
unfreeze_page() is rewritten with remove_migration_ptes().

The result is much simpler.

But the new variant is somewhat slower. Current helpers iterates over
VMAs the compound page is mapped to, and then over ptes within this VMA.
New helpers iterates over small page, then over VMA the small page
mapped to, and only then find relevant pte.

Also we've lost optimization which allows to split PMD directly into
migration entries.

I don't think the slowdown is critical, considering how much simpler
result is and that split_huge_page() is quite rare nowadays. It only
happens due memory pressure or migration.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 214 +++++++------------------------------------------------
 1 file changed, 24 insertions(+), 190 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 801d4f9aac80..388164c3cacd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2904,7 +2904,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 }
 
 static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long haddr, bool freeze)
+		unsigned long haddr)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
@@ -2946,18 +2946,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * transferred to avoid any possibility of altering
 		 * permissions across VMAs.
 		 */
-		if (freeze) {
-			swp_entry_t swp_entry;
-			swp_entry = make_migration_entry(page + i, write);
-			entry = swp_entry_to_pte(swp_entry);
-		} else {
-			entry = mk_pte(page + i, vma->vm_page_prot);
-			entry = maybe_mkwrite(entry, vma);
-			if (!write)
-				entry = pte_wrprotect(entry);
-			if (!young)
-				entry = pte_mkold(entry);
-		}
+		entry = mk_pte(page + i, vma->vm_page_prot);
+		entry = maybe_mkwrite(entry, vma);
+		if (!write)
+			entry = pte_wrprotect(entry);
+		if (!young)
+			entry = pte_mkold(entry);
 		if (dirty)
 			SetPageDirty(page + i);
 		pte = pte_offset_map(&_pmd, haddr);
@@ -3010,13 +3004,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 */
 	pmdp_invalidate(vma, haddr, pmd);
 	pmd_populate(mm, pmd, pgtable);
-
-	if (freeze) {
-		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-			page_remove_rmap(page + i, false);
-			put_page(page + i);
-		}
-	}
 }
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -3037,7 +3024,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			page = NULL;
 	} else if (!pmd_devmap(*pmd))
 		goto out;
-	__split_huge_pmd_locked(vma, pmd, haddr, false);
+	__split_huge_pmd_locked(vma, pmd, haddr);
 out:
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
@@ -3114,180 +3101,27 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 	}
 }
 
-static void freeze_page_vma(struct vm_area_struct *vma, struct page *page,
-		unsigned long address)
+static void freeze_page(struct page *page)
 {
-	unsigned long haddr = address & HPAGE_PMD_MASK;
-	spinlock_t *ptl;
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
-	int i, nr = HPAGE_PMD_NR;
-
-	/* Skip pages which doesn't belong to the VMA */
-	if (address < vma->vm_start) {
-		int off = (vma->vm_start - address) >> PAGE_SHIFT;
-		page += off;
-		nr -= off;
-		address = vma->vm_start;
-	}
-
-	pgd = pgd_offset(vma->vm_mm, address);
-	if (!pgd_present(*pgd))
-		return;
-	pud = pud_offset(pgd, address);
-	if (!pud_present(*pud))
-		return;
-	pmd = pmd_offset(pud, address);
-	ptl = pmd_lock(vma->vm_mm, pmd);
-	if (!pmd_present(*pmd)) {
-		spin_unlock(ptl);
-		return;
-	}
-	if (pmd_trans_huge(*pmd)) {
-		if (page == pmd_page(*pmd))
-			__split_huge_pmd_locked(vma, pmd, haddr, true);
-		spin_unlock(ptl);
-		return;
-	}
-	spin_unlock(ptl);
-
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
-	for (i = 0; i < nr; i++, address += PAGE_SIZE, page++, pte++) {
-		pte_t entry, swp_pte;
-		swp_entry_t swp_entry;
-
-		/*
-		 * We've just crossed page table boundary: need to map next one.
-		 * It can happen if THP was mremaped to non PMD-aligned address.
-		 */
-		if (unlikely(address == haddr + HPAGE_PMD_SIZE)) {
-			pte_unmap_unlock(pte - 1, ptl);
-			pmd = mm_find_pmd(vma->vm_mm, address);
-			if (!pmd)
-				return;
-			pte = pte_offset_map_lock(vma->vm_mm, pmd,
-					address, &ptl);
-		}
-
-		if (!pte_present(*pte))
-			continue;
-		if (page_to_pfn(page) != pte_pfn(*pte))
-			continue;
-		flush_cache_page(vma, address, page_to_pfn(page));
-		entry = ptep_clear_flush(vma, address, pte);
-		if (pte_dirty(entry))
-			SetPageDirty(page);
-		swp_entry = make_migration_entry(page, pte_write(entry));
-		swp_pte = swp_entry_to_pte(swp_entry);
-		if (pte_soft_dirty(entry))
-			swp_pte = pte_swp_mksoft_dirty(swp_pte);
-		set_pte_at(vma->vm_mm, address, pte, swp_pte);
-		page_remove_rmap(page, false);
-		put_page(page);
-	}
-	pte_unmap_unlock(pte - 1, ptl);
-}
-
-static void freeze_page(struct anon_vma *anon_vma, struct page *page)
-{
-	struct anon_vma_chain *avc;
-	pgoff_t pgoff = page_to_pgoff(page);
+	enum ttu_flags ttu_flags = TTU_MIGRATION | TTU_IGNORE_MLOCK |
+		TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED;
+	int i, ret;
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff,
-			pgoff + HPAGE_PMD_NR - 1) {
-		unsigned long address = __vma_address(page, avc->vma);
-
-		mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
-				address, address + HPAGE_PMD_SIZE);
-		freeze_page_vma(avc->vma, page, address);
-		mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
-				address, address + HPAGE_PMD_SIZE);
-	}
-}
-
-static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
-		unsigned long address)
-{
-	spinlock_t *ptl;
-	pmd_t *pmd;
-	pte_t *pte, entry;
-	swp_entry_t swp_entry;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
-	int i, nr = HPAGE_PMD_NR;
-
-	/* Skip pages which doesn't belong to the VMA */
-	if (address < vma->vm_start) {
-		int off = (vma->vm_start - address) >> PAGE_SHIFT;
-		page += off;
-		nr -= off;
-		address = vma->vm_start;
-	}
-
-	pmd = mm_find_pmd(vma->vm_mm, address);
-	if (!pmd)
-		return;
-
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
-	for (i = 0; i < nr; i++, address += PAGE_SIZE, page++, pte++) {
-		/*
-		 * We've just crossed page table boundary: need to map next one.
-		 * It can happen if THP was mremaped to non-PMD aligned address.
-		 */
-		if (unlikely(address == haddr + HPAGE_PMD_SIZE)) {
-			pte_unmap_unlock(pte - 1, ptl);
-			pmd = mm_find_pmd(vma->vm_mm, address);
-			if (!pmd)
-				return;
-			pte = pte_offset_map_lock(vma->vm_mm, pmd,
-					address, &ptl);
-		}
-
-		if (!is_swap_pte(*pte))
-			continue;
-
-		swp_entry = pte_to_swp_entry(*pte);
-		if (!is_migration_entry(swp_entry))
-			continue;
-		if (migration_entry_to_page(swp_entry) != page)
-			continue;
-
-		get_page(page);
-		page_add_anon_rmap(page, vma, address, false);
-
-		entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
-		if (PageDirty(page))
-			entry = pte_mkdirty(entry);
-		if (is_write_migration_entry(swp_entry))
-			entry = maybe_mkwrite(entry, vma);
-
-		flush_dcache_page(page);
-		set_pte_at(vma->vm_mm, address, pte, entry);
-
-		/* No need to invalidate - it was non-present before */
-		update_mmu_cache(vma, address, pte);
-	}
-	pte_unmap_unlock(pte - 1, ptl);
+	/* We only need TTU_SPLIT_HUGE_PMD once */
+	ret = try_to_unmap(page, ttu_flags | TTU_SPLIT_HUGE_PMD);
+	for (i = 1; !ret && i < HPAGE_PMD_NR; i++)
+		ret = try_to_unmap(page + i, ttu_flags);
+	VM_BUG_ON(ret);
 }
 
-static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
+static void unfreeze_page(struct page *page)
 {
-	struct anon_vma_chain *avc;
-	pgoff_t pgoff = page_to_pgoff(page);
-
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
-			pgoff, pgoff + HPAGE_PMD_NR - 1) {
-		unsigned long address = __vma_address(page, avc->vma);
+	int i;
 
-		mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
-				address, address + HPAGE_PMD_SIZE);
-		unfreeze_page_vma(avc->vma, page, address);
-		mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
-				address, address + HPAGE_PMD_SIZE);
-	}
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		remove_migration_ptes(page + i, page + i, true);
 }
 
 static void __split_huge_page_tail(struct page *head, int tail,
@@ -3365,7 +3199,7 @@ static void __split_huge_page(struct page *page, struct list_head *list)
 	ClearPageCompound(head);
 	spin_unlock_irq(&zone->lru_lock);
 
-	unfreeze_page(page_anon_vma(head), head);
+	unfreeze_page(head);
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		struct page *subpage = head + i;
@@ -3461,7 +3295,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	}
 
 	mlocked = PageMlocked(page);
-	freeze_page(anon_vma, head);
+	freeze_page(head);
 	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
 	/* Make sure the page is not on per-CPU pagevec as it takes pin */
@@ -3490,7 +3324,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		BUG();
 	} else {
 		spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
-		unfreeze_page(anon_vma, head);
+		unfreeze_page(head);
 		ret = -EBUSY;
 	}
 
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 06/28] mm: do not pass mm_struct into handle_mm_fault
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 05/28] thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 07/28] mm: introduce fault_env Kirill A. Shutemov
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

We always have vma->vm_mm around.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/alpha/mm/fault.c         |  2 +-
 arch/arc/mm/fault.c           |  2 +-
 arch/arm/mm/fault.c           |  2 +-
 arch/arm64/mm/fault.c         |  2 +-
 arch/avr32/mm/fault.c         |  2 +-
 arch/cris/mm/fault.c          |  2 +-
 arch/frv/mm/fault.c           |  2 +-
 arch/hexagon/mm/vm_fault.c    |  2 +-
 arch/ia64/mm/fault.c          |  2 +-
 arch/m32r/mm/fault.c          |  2 +-
 arch/m68k/mm/fault.c          |  2 +-
 arch/metag/mm/fault.c         |  2 +-
 arch/microblaze/mm/fault.c    |  2 +-
 arch/mips/mm/fault.c          |  2 +-
 arch/mn10300/mm/fault.c       |  2 +-
 arch/nios2/mm/fault.c         |  2 +-
 arch/openrisc/mm/fault.c      |  2 +-
 arch/parisc/mm/fault.c        |  2 +-
 arch/powerpc/mm/copro_fault.c |  2 +-
 arch/powerpc/mm/fault.c       |  2 +-
 arch/s390/mm/fault.c          |  2 +-
 arch/score/mm/fault.c         |  2 +-
 arch/sh/mm/fault.c            |  2 +-
 arch/sparc/mm/fault_32.c      |  4 ++--
 arch/sparc/mm/fault_64.c      |  2 +-
 arch/tile/mm/fault.c          |  2 +-
 arch/um/kernel/trap.c         |  2 +-
 arch/unicore32/mm/fault.c     |  2 +-
 arch/x86/mm/fault.c           |  2 +-
 arch/xtensa/mm/fault.c        |  2 +-
 drivers/iommu/amd_iommu_v2.c  |  2 +-
 drivers/iommu/intel-svm.c     |  2 +-
 include/linux/mm.h            |  9 ++++-----
 mm/gup.c                      |  5 ++---
 mm/ksm.c                      |  3 +--
 mm/memory.c                   | 13 +++++++------
 36 files changed, 47 insertions(+), 49 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 4a905bd667e2..83e9eee57a55 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -147,7 +147,7 @@ retry:
 	/* If for any reason at all we couldn't handle the fault,
 	   make sure we exit gracefully rather than endlessly redo
 	   the fault.  */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return;
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index af63f4a13e60..e94e5aa33985 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -137,7 +137,7 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	/* If Pagefault was interrupted by SIGKILL, exit page fault "early" */
 	if (unlikely(fatal_signal_pending(current))) {
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index daafcf121ce0..7cd0d5b2ef50 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -243,7 +243,7 @@ good_area:
 		goto out;
 	}
 
-	return handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
+	return handle_mm_fault(vma, addr & PAGE_MASK, flags);
 
 check_stack:
 	/* Don't allow expansion below FIRST_USER_ADDRESS */
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 92ddac1e8ca2..a36c31adb087 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -183,7 +183,7 @@ good_area:
 		goto out;
 	}
 
-	return handle_mm_fault(mm, vma, addr & PAGE_MASK, mm_flags);
+	return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags);
 
 check_stack:
 	if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr))
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index c03533937a9f..a4b7edac8f10 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -134,7 +134,7 @@ good_area:
 	 * sure we exit gracefully rather than endlessly redo the
 	 * fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return;
diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
index 3066d40a6db1..112ef26c7f2e 100644
--- a/arch/cris/mm/fault.c
+++ b/arch/cris/mm/fault.c
@@ -168,7 +168,7 @@ retry:
 	 * the fault.
 	 */
 
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return;
diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c
index 61d99767fe16..614a46c413d2 100644
--- a/arch/frv/mm/fault.c
+++ b/arch/frv/mm/fault.c
@@ -164,7 +164,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, ear0, flags);
+	fault = handle_mm_fault(vma, ear0, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index 8704c9320032..bd7c251e2bce 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -101,7 +101,7 @@ good_area:
 		break;
 	}
 
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return;
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 70b40d1205a6..fa6ad95e992e 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -159,7 +159,7 @@ retry:
 	 * sure we exit gracefully rather than endlessly redo the
 	 * fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return;
diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
index 8f9875b7933d..a3785d3644c2 100644
--- a/arch/m32r/mm/fault.c
+++ b/arch/m32r/mm/fault.c
@@ -196,7 +196,7 @@ good_area:
 	 */
 	addr = (address & PAGE_MASK);
 	set_thread_fault_code(error_code);
-	fault = handle_mm_fault(mm, vma, addr, flags);
+	fault = handle_mm_fault(vma, addr, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 6a94cdd0c830..bd66a0b20c6b 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -136,7 +136,7 @@ good_area:
 	 * the fault.
 	 */
 
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 	pr_debug("handle_mm_fault returns %d\n", fault);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c
index f57edca63609..372783a67dda 100644
--- a/arch/metag/mm/fault.c
+++ b/arch/metag/mm/fault.c
@@ -133,7 +133,7 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return 0;
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 177dfc003643..abb678ccde6f 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -216,7 +216,7 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return;
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 4b88fa031891..9560ad731120 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -153,7 +153,7 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return;
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 4a1d181ed32f..f23781d6bbb3 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -254,7 +254,7 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return;
diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index b51878b0c6b8..affc4eb3f89e 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -131,7 +131,7 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return;
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 230ac20ae794..e94cd225e816 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -163,7 +163,7 @@ good_area:
 	 * the fault.
 	 */
 
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return;
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index a762864ec92e..87436cd7ea1a 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -243,7 +243,7 @@ good_area:
 	 * fault.
 	 */
 
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return;
diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c
index 6527882ce05e..bb0354222b11 100644
--- a/arch/powerpc/mm/copro_fault.c
+++ b/arch/powerpc/mm/copro_fault.c
@@ -75,7 +75,7 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea,
 	}
 
 	ret = 0;
-	*flt = handle_mm_fault(mm, vma, ea, is_write ? FAULT_FLAG_WRITE : 0);
+	*flt = handle_mm_fault(vma, ea, is_write ? FAULT_FLAG_WRITE : 0);
 	if (unlikely(*flt & VM_FAULT_ERROR)) {
 		if (*flt & VM_FAULT_OOM) {
 			ret = -ENOMEM;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index a67c6d781c52..a4db22f65021 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -429,7 +429,7 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 	if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
 		if (fault & VM_FAULT_SIGSEGV)
 			goto bad_area;
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index ec1a30d0d11a..9579c92e35b9 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -455,7 +455,7 @@ retry:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 	/* No reason to continue if interrupted by SIGKILL. */
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
 		fault = VM_FAULT_SIGNAL;
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 37a6c2e0e969..995b71e4db4b 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -111,7 +111,7 @@ good_area:
 	* make sure we exit gracefully rather than endlessly redo
 	* the fault.
 	*/
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 79d8276377d1..9bf876780cef 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -487,7 +487,7 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if (unlikely(fault & (VM_FAULT_RETRY | VM_FAULT_ERROR)))
 		if (mm_fault_error(regs, error_code, address, fault))
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index c399e7b3b035..3d2d6686d8e9 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -241,7 +241,7 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return;
@@ -411,7 +411,7 @@ good_area:
 		if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
 			goto bad_area;
 	}
-	switch (handle_mm_fault(mm, vma, address, flags)) {
+	switch (handle_mm_fault(vma, address, flags)) {
 	case VM_FAULT_SIGBUS:
 	case VM_FAULT_OOM:
 		goto do_sigbus;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index cb841a33da59..6c43b924a7a2 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -436,7 +436,7 @@ good_area:
 			goto bad_area;
 	}
 
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		goto exit_exception;
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 13eac59bf16a..4be04f649396 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -435,7 +435,7 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return 0;
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 98783dd0fa2e..ad8f206ab5e8 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -73,7 +73,7 @@ good_area:
 	do {
 		int fault;
 
-		fault = handle_mm_fault(mm, vma, address, flags);
+		fault = handle_mm_fault(vma, address, flags);
 
 		if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 			goto out_nosemaphore;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index afccef5529cc..43c84cb70fc7 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -194,7 +194,7 @@ good_area:
 	 * If for any reason at all we couldn't handle the fault, make
 	 * sure we exit gracefully rather than endlessly redo the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
+	fault = handle_mm_fault(vma, addr & PAGE_MASK, flags);
 	return fault;
 
 check_stack:
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9a3f77..d01fd75adfcb 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1235,7 +1235,7 @@ good_area:
 	 * the fault.  Since we never set FAULT_FLAG_RETRY_NOWAIT, if
 	 * we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 	major |= fault & VM_FAULT_MAJOR;
 
 	/*
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index c9784c1b18d8..551bba2c9ed8 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -110,7 +110,7 @@ good_area:
 	 * make sure we exit gracefully rather than endlessly redo
 	 * the fault.
 	 */
-	fault = handle_mm_fault(mm, vma, address, flags);
+	fault = handle_mm_fault(vma, address, flags);
 
 	if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 		return;
diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 7caf2fa237f2..d2f978245ca6 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -539,7 +539,7 @@ static void do_fault(struct work_struct *work)
 		goto out;
 	}
 
-	ret = handle_mm_fault(mm, vma, address, write);
+	ret = handle_mm_fault(vma, address, write);
 	if (ret & VM_FAULT_ERROR) {
 		/* failed to service fault */
 		up_read(&mm->mmap_sem);
diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index 50464833d0b8..48c0f18e7a84 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -559,7 +559,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 		if (access_error(vma, req))
 			goto invalid;
 
-		ret = handle_mm_fault(svm->mm, vma, address,
+		ret = handle_mm_fault(vma, address,
 				      req->wr_req ? FAULT_FLAG_WRITE : 0);
 		if (ret & VM_FAULT_ERROR)
 			goto invalid;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9eac99d4902d..5a68e2eded1b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1186,15 +1186,14 @@ int generic_error_remove_page(struct address_space *mapping, struct page *page);
 int invalidate_inode_page(struct page *page);
 
 #ifdef CONFIG_MMU
-extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, unsigned int flags);
+extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+		unsigned int flags);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
 #else
-static inline int handle_mm_fault(struct mm_struct *mm,
-			struct vm_area_struct *vma, unsigned long address,
-			unsigned int flags)
+static inline int handle_mm_fault(struct vm_area_struct *vma,
+		unsigned long address, unsigned int flags)
 {
 	/* should never happen if there's no MMU */
 	BUG();
diff --git a/mm/gup.c b/mm/gup.c
index 7bf19ffa2199..60f422a0af8b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -349,7 +349,6 @@ unmap:
 static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 		unsigned long address, unsigned int *flags, int *nonblocking)
 {
-	struct mm_struct *mm = vma->vm_mm;
 	unsigned int fault_flags = 0;
 	int ret;
 
@@ -372,7 +371,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 		fault_flags |= FAULT_FLAG_TRIED;
 	}
 
-	ret = handle_mm_fault(mm, vma, address, fault_flags);
+	ret = handle_mm_fault(vma, address, fault_flags);
 	if (ret & VM_FAULT_ERROR) {
 		if (ret & VM_FAULT_OOM)
 			return -ENOMEM;
@@ -659,7 +658,7 @@ retry:
 	if (!(vm_flags & vma->vm_flags))
 		return -EFAULT;
 
-	ret = handle_mm_fault(mm, vma, address, fault_flags);
+	ret = handle_mm_fault(vma, address, fault_flags);
 	major |= ret & VM_FAULT_MAJOR;
 	if (ret & VM_FAULT_ERROR) {
 		if (ret & VM_FAULT_OOM)
diff --git a/mm/ksm.c b/mm/ksm.c
index 823d78b2a055..1bd92327bf01 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -440,8 +440,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 		if (IS_ERR_OR_NULL(page))
 			break;
 		if (PageKsm(page))
-			ret = handle_mm_fault(vma->vm_mm, vma, addr,
-							FAULT_FLAG_WRITE);
+			ret = handle_mm_fault(vma, addr, FAULT_FLAG_WRITE);
 		else
 			ret = VM_FAULT_WRITE;
 		put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 5f2c8f0c4998..2b0c10ec3064 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3346,9 +3346,10 @@ unlock:
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
-static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-			     unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+		unsigned int flags)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
@@ -3421,15 +3422,15 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		    unsigned long address, unsigned int flags)
+int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+		unsigned int flags)
 {
 	int ret;
 
 	__set_current_state(TASK_RUNNING);
 
 	count_vm_event(PGFAULT);
-	mem_cgroup_count_vm_event(mm, PGFAULT);
+	mem_cgroup_count_vm_event(vma->vm_mm, PGFAULT);
 
 	/* do counter updates before entering really critical section. */
 	check_sync_rss_stat(current);
@@ -3441,7 +3442,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (flags & FAULT_FLAG_USER)
 		mem_cgroup_oom_enable();
 
-	ret = __handle_mm_fault(mm, vma, address, flags);
+	ret = __handle_mm_fault(vma, address, flags);
 
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_oom_disable();
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 07/28] mm: introduce fault_env
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 06/28] mm: do not pass mm_struct into handle_mm_fault Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte() Kirill A. Shutemov
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

The idea borrowed from Peter's patch from patchset on speculative page
faults[1]:

Instead of passing around the endless list of function arguments,
replace the lot with a single structure so we can change context
without endless function signature changes.

The changes are mostly mechanical with exception of faultaround code:
filemap_map_pages() got reworked a bit.

This patch is preparation for the next one.

[1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 Documentation/filesystems/Locking |  10 +-
 fs/userfaultfd.c                  |  22 +-
 include/linux/huge_mm.h           |  20 +-
 include/linux/mm.h                |  22 +-
 include/linux/userfaultfd_k.h     |   8 +-
 mm/filemap.c                      |  28 +-
 mm/huge_memory.c                  | 280 +++++++++----------
 mm/internal.h                     |   4 +-
 mm/memory.c                       | 571 ++++++++++++++++++--------------------
 9 files changed, 455 insertions(+), 510 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 06d443450f21..0e499a7944a5 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -546,13 +546,13 @@ subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
 locked. The VM will unlock the page.
 
 	->map_pages() is called when VM asks to map easy accessible pages.
-Filesystem should find and map pages associated with offsets from "pgoff"
-till "max_pgoff". ->map_pages() is called with page table locked and must
+Filesystem should find and map pages associated with offsets from "start_pgoff"
+till "end_pgoff". ->map_pages() is called with page table locked and must
 not block.  If it's not possible to reach a page without blocking,
 filesystem should skip it. Filesystem should use do_set_pte() to setup
-page table entry. Pointer to entry associated with offset "pgoff" is
-passed in "pte" field in vm_fault structure. Pointers to entries for other
-offsets should be calculated relative to "pte".
+page table entry. Pointer to entry associated with the page is passed in
+"pte" field in fault_env structure. Pointers to entries for other offsets
+should be calculated relative to "pte".
 
 	->page_mkwrite() is called when a previously read-only pte is
 about to become writeable. The filesystem again must ensure that there are
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 50311703135b..0a08143dbc87 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -257,10 +257,9 @@ out:
  * fatal_signal_pending()s, and the mmap_sem must be released before
  * returning it.
  */
-int handle_userfault(struct vm_area_struct *vma, unsigned long address,
-		     unsigned int flags, unsigned long reason)
+int handle_userfault(struct fault_env *fe, unsigned long reason)
 {
-	struct mm_struct *mm = vma->vm_mm;
+	struct mm_struct *mm = fe->vma->vm_mm;
 	struct userfaultfd_ctx *ctx;
 	struct userfaultfd_wait_queue uwq;
 	int ret;
@@ -269,7 +268,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
 	ret = VM_FAULT_SIGBUS;
-	ctx = vma->vm_userfaultfd_ctx.ctx;
+	ctx = fe->vma->vm_userfaultfd_ctx.ctx;
 	if (!ctx)
 		goto out;
 
@@ -296,17 +295,17 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	 * without first stopping userland access to the memory. For
 	 * VM_UFFD_MISSING userfaults this is enough for now.
 	 */
-	if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
+	if (unlikely(!(fe->flags & FAULT_FLAG_ALLOW_RETRY))) {
 		/*
 		 * Validate the invariant that nowait must allow retry
 		 * to be sure not to return SIGBUS erroneously on
 		 * nowait invocations.
 		 */
-		BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
+		BUG_ON(fe->flags & FAULT_FLAG_RETRY_NOWAIT);
 #ifdef CONFIG_DEBUG_VM
 		if (printk_ratelimit()) {
 			printk(KERN_WARNING
-			       "FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
+			       "FAULT_FLAG_ALLOW_RETRY missing %x\n", fe->flags);
 			dump_stack();
 		}
 #endif
@@ -318,7 +317,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 	 * and wait.
 	 */
 	ret = VM_FAULT_RETRY;
-	if (flags & FAULT_FLAG_RETRY_NOWAIT)
+	if (fe->flags & FAULT_FLAG_RETRY_NOWAIT)
 		goto out;
 
 	/* take the reference before dropping the mmap_sem */
@@ -326,10 +325,11 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 
 	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
 	uwq.wq.private = current;
-	uwq.msg = userfault_msg(address, flags, reason);
+	uwq.msg = userfault_msg(fe->address, fe->flags, reason);
 	uwq.ctx = ctx;
 
-	return_to_userland = (flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
+	return_to_userland =
+		(fe->flags & (FAULT_FLAG_USER|FAULT_FLAG_KILLABLE)) ==
 		(FAULT_FLAG_USER|FAULT_FLAG_KILLABLE);
 
 	spin_lock(&ctx->fault_pending_wqh.lock);
@@ -347,7 +347,7 @@ int handle_userfault(struct vm_area_struct *vma, unsigned long address,
 			  TASK_KILLABLE);
 	spin_unlock(&ctx->fault_pending_wqh.lock);
 
-	must_wait = userfaultfd_must_wait(ctx, address, flags, reason);
+	must_wait = userfaultfd_must_wait(ctx, fe->address, fe->flags, reason);
 	up_read(&mm->mmap_sem);
 
 	if (likely(must_wait && !ACCESS_ONCE(ctx->released) &&
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c47067151ffd..a9ec30594a81 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -1,20 +1,12 @@
 #ifndef _LINUX_HUGE_MM_H
 #define _LINUX_HUGE_MM_H
 
-extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
-				      struct vm_area_struct *vma,
-				      unsigned long address, pmd_t *pmd,
-				      unsigned int flags);
+extern int do_huge_pmd_anonymous_page(struct fault_env *fe);
 extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 			 struct vm_area_struct *vma);
-extern void huge_pmd_set_accessed(struct mm_struct *mm,
-				  struct vm_area_struct *vma,
-				  unsigned long address, pmd_t *pmd,
-				  pmd_t orig_pmd, int dirty);
-extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			       unsigned long address, pmd_t *pmd,
-			       pmd_t orig_pmd);
+extern void huge_pmd_set_accessed(struct fault_env *fe, pmd_t orig_pmd);
+extern int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd);
 extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 					  unsigned long addr,
 					  pmd_t *pmd,
@@ -142,8 +134,7 @@ static inline int hpage_nr_pages(struct page *page)
 	return 1;
 }
 
-extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-				unsigned long addr, pmd_t pmd, pmd_t *pmdp);
+extern int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t orig_pmd);
 
 extern struct page *huge_zero_page;
 
@@ -203,8 +194,7 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 	return NULL;
 }
 
-static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-					unsigned long addr, pmd_t pmd, pmd_t *pmdp)
+static inline int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t orig_pmd)
 {
 	return 0;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a68e2eded1b..ca99c0ecf52e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -256,10 +256,15 @@ struct vm_fault {
 					 * is set (which is also implied by
 					 * VM_FAULT_ERROR).
 					 */
-	/* for ->map_pages() only */
-	pgoff_t max_pgoff;		/* map pages for offset from pgoff till
-					 * max_pgoff inclusive */
-	pte_t *pte;			/* pte entry associated with ->pgoff */
+};
+
+struct fault_env {
+	struct vm_area_struct *vma;
+	unsigned long address;
+	unsigned int flags;
+	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
 };
 
 /*
@@ -274,7 +279,8 @@ struct vm_operations_struct {
 	int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
 	int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
 						pmd_t *, unsigned int flags);
-	void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);
+	void (*map_pages)(struct fault_env *fe,
+			pgoff_t start_pgoff, pgoff_t end_pgoff);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
@@ -553,8 +559,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 	return pte;
 }
 
-void do_set_pte(struct vm_area_struct *vma, unsigned long address,
-		struct page *page, pte_t *pte, bool write, bool anon);
+void do_set_pte(struct fault_env *fe, struct page *page);
 #endif
 
 /*
@@ -2032,7 +2037,8 @@ extern void truncate_inode_pages_final(struct address_space *);
 
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
-extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
+extern void filemap_map_pages(struct fault_env *fe,
+		pgoff_t start_pgoff, pgoff_t end_pgoff);
 extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 587480ad41b7..dd66a952e8cd 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -27,8 +27,7 @@
 #define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
 
-extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
-			    unsigned int flags, unsigned long reason);
+extern int handle_userfault(struct fault_env *fe, unsigned long reason);
 
 extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
 			    unsigned long src_start, unsigned long len);
@@ -56,10 +55,7 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 #else /* CONFIG_USERFAULTFD */
 
 /* mm helpers */
-static inline int handle_userfault(struct vm_area_struct *vma,
-				   unsigned long address,
-				   unsigned int flags,
-				   unsigned long reason)
+static inline int handle_userfault(struct fault_env *fe, unsigned long reason)
 {
 	return VM_FAULT_SIGBUS;
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index bc3ae0e5c925..28b3875969a8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2130,22 +2130,27 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
-void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+void filemap_map_pages(struct fault_env *fe,
+		pgoff_t start_pgoff, pgoff_t end_pgoff)
 {
 	struct radix_tree_iter iter;
 	void **slot;
-	struct file *file = vma->vm_file;
+	struct file *file = fe->vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
+	pgoff_t last_pgoff = start_pgoff;
 	loff_t size;
 	struct page *page;
-	unsigned long address = (unsigned long) vmf->virtual_address;
-	unsigned long addr;
-	pte_t *pte;
 
 	rcu_read_lock();
-	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
-		if (iter.index > vmf->max_pgoff)
+	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
+			start_pgoff) {
+		if (iter.index > end_pgoff)
 			break;
+		fe->pte += iter.index - last_pgoff;
+		fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
+		last_pgoff = iter.index;
+		if (!pte_none(*fe->pte))
+			goto next;
 repeat:
 		page = radix_tree_deref_slot(slot);
 		if (unlikely(!page))
@@ -2180,14 +2185,9 @@ repeat:
 		if (page->index >= size >> PAGE_CACHE_SHIFT)
 			goto unlock;
 
-		pte = vmf->pte + page->index - vmf->pgoff;
-		if (!pte_none(*pte))
-			goto unlock;
-
 		if (file->f_ra.mmap_miss > 0)
 			file->f_ra.mmap_miss--;
-		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
-		do_set_pte(vma, addr, page, pte, false, false);
+		do_set_pte(fe, page);
 		unlock_page(page);
 		goto next;
 unlock:
@@ -2195,7 +2195,7 @@ unlock:
 skip:
 		page_cache_release(page);
 next:
-		if (iter.index == vmf->max_pgoff)
+		if (iter.index == end_pgoff)
 			break;
 	}
 	rcu_read_unlock();
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 388164c3cacd..7ea43b9fbec4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -800,26 +800,23 @@ void prep_transhuge_page(struct page *page)
 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
 }
 
-static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
-					struct vm_area_struct *vma,
-					unsigned long address, pmd_t *pmd,
-					struct page *page, gfp_t gfp,
-					unsigned int flags)
+static int __do_huge_pmd_anonymous_page(struct fault_env *fe, struct page *page,
+		gfp_t gfp)
 {
+	struct vm_area_struct *vma = fe->vma;
 	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
-	spinlock_t *ptl;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
+	unsigned long haddr = fe->address & HPAGE_PMD_MASK;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
-	if (mem_cgroup_try_charge(page, mm, gfp, &memcg, true)) {
+	if (mem_cgroup_try_charge(page, vma->vm_mm, gfp, &memcg, true)) {
 		put_page(page);
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
 
-	pgtable = pte_alloc_one(mm, haddr);
+	pgtable = pte_alloc_one(vma->vm_mm, haddr);
 	if (unlikely(!pgtable)) {
 		mem_cgroup_cancel_charge(page, memcg, true);
 		put_page(page);
@@ -834,12 +831,12 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	 */
 	__SetPageUptodate(page);
 
-	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_none(*pmd))) {
-		spin_unlock(ptl);
+	fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+	if (unlikely(!pmd_none(*fe->pmd))) {
+		spin_unlock(fe->ptl);
 		mem_cgroup_cancel_charge(page, memcg, true);
 		put_page(page);
-		pte_free(mm, pgtable);
+		pte_free(vma->vm_mm, pgtable);
 	} else {
 		pmd_t entry;
 
@@ -847,12 +844,11 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		if (userfaultfd_missing(vma)) {
 			int ret;
 
-			spin_unlock(ptl);
+			spin_unlock(fe->ptl);
 			mem_cgroup_cancel_charge(page, memcg, true);
 			put_page(page);
-			pte_free(mm, pgtable);
-			ret = handle_userfault(vma, address, flags,
-					       VM_UFFD_MISSING);
+			pte_free(vma->vm_mm, pgtable);
+			ret = handle_userfault(fe, VM_UFFD_MISSING);
 			VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			return ret;
 		}
@@ -862,11 +858,11 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		page_add_new_anon_rmap(page, vma, haddr, true);
 		mem_cgroup_commit_charge(page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(page, vma);
-		pgtable_trans_huge_deposit(mm, pmd, pgtable);
-		set_pmd_at(mm, haddr, pmd, entry);
-		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
-		atomic_long_inc(&mm->nr_ptes);
-		spin_unlock(ptl);
+		pgtable_trans_huge_deposit(vma->vm_mm, fe->pmd, pgtable);
+		set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+		atomic_long_inc(&vma->vm_mm->nr_ptes);
+		spin_unlock(fe->ptl);
 		count_vm_event(THP_FAULT_ALLOC);
 	}
 
@@ -895,13 +891,12 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 	return true;
 }
 
-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			       unsigned long address, pmd_t *pmd,
-			       unsigned int flags)
+int do_huge_pmd_anonymous_page(struct fault_env *fe)
 {
+	struct vm_area_struct *vma = fe->vma;
 	gfp_t gfp;
 	struct page *page;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
+	unsigned long haddr = fe->address & HPAGE_PMD_MASK;
 
 	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
 		return VM_FAULT_FALLBACK;
@@ -909,42 +904,40 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return VM_FAULT_OOM;
 	if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
 		return VM_FAULT_OOM;
-	if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm) &&
+	if (!(fe->flags & FAULT_FLAG_WRITE) &&
+			!mm_forbids_zeropage(vma->vm_mm) &&
 			transparent_hugepage_use_zero_page()) {
-		spinlock_t *ptl;
 		pgtable_t pgtable;
 		struct page *zero_page;
 		bool set;
 		int ret;
-		pgtable = pte_alloc_one(mm, haddr);
+		pgtable = pte_alloc_one(vma->vm_mm, haddr);
 		if (unlikely(!pgtable))
 			return VM_FAULT_OOM;
 		zero_page = get_huge_zero_page();
 		if (unlikely(!zero_page)) {
-			pte_free(mm, pgtable);
+			pte_free(vma->vm_mm, pgtable);
 			count_vm_event(THP_FAULT_FALLBACK);
 			return VM_FAULT_FALLBACK;
 		}
-		ptl = pmd_lock(mm, pmd);
+		fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
 		ret = 0;
 		set = false;
-		if (pmd_none(*pmd)) {
+		if (pmd_none(*fe->pmd)) {
 			if (userfaultfd_missing(vma)) {
-				spin_unlock(ptl);
-				ret = handle_userfault(vma, address, flags,
-						       VM_UFFD_MISSING);
+				spin_unlock(fe->ptl);
+				ret = handle_userfault(fe, VM_UFFD_MISSING);
 				VM_BUG_ON(ret & VM_FAULT_FALLBACK);
 			} else {
-				set_huge_zero_page(pgtable, mm, vma,
-						   haddr, pmd,
-						   zero_page);
-				spin_unlock(ptl);
+				set_huge_zero_page(pgtable, vma->vm_mm, vma,
+						   haddr, fe->pmd, zero_page);
+				spin_unlock(fe->ptl);
 				set = true;
 			}
 		} else
-			spin_unlock(ptl);
+			spin_unlock(fe->ptl);
 		if (!set) {
-			pte_free(mm, pgtable);
+			pte_free(vma->vm_mm, pgtable);
 			put_huge_zero_page();
 		}
 		return ret;
@@ -956,8 +949,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return VM_FAULT_FALLBACK;
 	}
 	prep_transhuge_page(page);
-	return __do_huge_pmd_anonymous_page(mm, vma, address, pmd, page, gfp,
-					    flags);
+	return __do_huge_pmd_anonymous_page(fe, page, gfp);
 }
 
 static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -1129,38 +1121,31 @@ out:
 	return ret;
 }
 
-void huge_pmd_set_accessed(struct mm_struct *mm,
-			   struct vm_area_struct *vma,
-			   unsigned long address,
-			   pmd_t *pmd, pmd_t orig_pmd,
-			   int dirty)
+void huge_pmd_set_accessed(struct fault_env *fe, pmd_t orig_pmd)
 {
-	spinlock_t *ptl;
 	pmd_t entry;
 	unsigned long haddr;
 
-	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+	fe->ptl = pmd_lock(fe->vma->vm_mm, fe->pmd);
+	if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
 		goto unlock;
 
 	entry = pmd_mkyoung(orig_pmd);
-	haddr = address & HPAGE_PMD_MASK;
-	if (pmdp_set_access_flags(vma, haddr, pmd, entry, dirty))
-		update_mmu_cache_pmd(vma, address, pmd);
+	haddr = fe->address & HPAGE_PMD_MASK;
+	if (pmdp_set_access_flags(fe->vma, haddr, fe->pmd, entry,
+				fe->flags & FAULT_FLAG_WRITE))
+		update_mmu_cache_pmd(fe->vma, fe->address, fe->pmd);
 
 unlock:
-	spin_unlock(ptl);
+	spin_unlock(fe->ptl);
 }
 
-static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
-					struct vm_area_struct *vma,
-					unsigned long address,
-					pmd_t *pmd, pmd_t orig_pmd,
-					struct page *page,
-					unsigned long haddr)
+static int do_huge_pmd_wp_page_fallback(struct fault_env *fe, pmd_t orig_pmd,
+		struct page *page)
 {
+	struct vm_area_struct *vma = fe->vma;
+	unsigned long haddr = fe->address & HPAGE_PMD_MASK;
 	struct mem_cgroup *memcg;
-	spinlock_t *ptl;
 	pgtable_t pgtable;
 	pmd_t _pmd;
 	int ret = 0, i;
@@ -1177,11 +1162,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE |
-					       __GFP_OTHER_NODE,
-					       vma, address, page_to_nid(page));
+					       __GFP_OTHER_NODE, vma,
+					       fe->address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
-			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
-						   &memcg, false))) {
+			     mem_cgroup_try_charge(pages[i], vma->vm_mm,
+				     GFP_KERNEL, &memcg, false))) {
 			if (pages[i])
 				put_page(pages[i]);
 			while (--i >= 0) {
@@ -1207,41 +1192,41 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
 
-	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+	fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+	if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
 		goto out_free_pages;
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
-	pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+	pmdp_huge_clear_flush_notify(vma, haddr, fe->pmd);
 	/* leave pmd empty until pte is filled */
 
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-	pmd_populate(mm, &_pmd, pgtable);
+	pgtable = pgtable_trans_huge_withdraw(vma->vm_mm, fe->pmd);
+	pmd_populate(vma->vm_mm, &_pmd, pgtable);
 
 	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-		pte_t *pte, entry;
+		pte_t entry;
 		entry = mk_pte(pages[i], vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
-		page_add_new_anon_rmap(pages[i], vma, haddr, false);
+		page_add_new_anon_rmap(pages[i], fe->vma, haddr, false);
 		mem_cgroup_commit_charge(pages[i], memcg, false, false);
 		lru_cache_add_active_or_unevictable(pages[i], vma);
-		pte = pte_offset_map(&_pmd, haddr);
-		VM_BUG_ON(!pte_none(*pte));
-		set_pte_at(mm, haddr, pte, entry);
-		pte_unmap(pte);
+		fe->pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*fe->pte));
+		set_pte_at(vma->vm_mm, haddr, fe->pte, entry);
+		pte_unmap(fe->pte);
 	}
 	kfree(pages);
 
 	smp_wmb(); /* make pte visible before pmd */
-	pmd_populate(mm, pmd, pgtable);
+	pmd_populate(vma->vm_mm, fe->pmd, pgtable);
 	page_remove_rmap(page, true);
-	spin_unlock(ptl);
+	spin_unlock(fe->ptl);
 
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
 
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
@@ -1250,8 +1235,8 @@ out:
 	return ret;
 
 out_free_pages:
-	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	spin_unlock(fe->ptl);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -1262,25 +1247,23 @@ out_free_pages:
 	goto out;
 }
 
-int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
 {
-	spinlock_t *ptl;
-	int ret = 0;
+	struct vm_area_struct *vma = fe->vma;
 	struct page *page = NULL, *new_page;
 	struct mem_cgroup *memcg;
-	unsigned long haddr;
+	unsigned long haddr = fe->address & HPAGE_PMD_MASK;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 	gfp_t huge_gfp;			/* for allocation and charge */
+	int ret = 0;
 
-	ptl = pmd_lockptr(mm, pmd);
+	fe->ptl = pmd_lockptr(vma->vm_mm, fe->pmd);
 	VM_BUG_ON_VMA(!vma->anon_vma, vma);
-	haddr = address & HPAGE_PMD_MASK;
 	if (is_huge_zero_pmd(orig_pmd))
 		goto alloc;
-	spin_lock(ptl);
-	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+	spin_lock(fe->ptl);
+	if (unlikely(!pmd_same(*fe->pmd, orig_pmd)))
 		goto out_unlock;
 
 	page = pmd_page(orig_pmd);
@@ -1299,13 +1282,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
-			update_mmu_cache_pmd(vma, address, pmd);
+		if (pmdp_set_access_flags(vma, haddr, fe->pmd, entry,  1))
+			update_mmu_cache_pmd(vma, fe->address, fe->pmd);
 		ret |= VM_FAULT_WRITE;
 		goto out_unlock;
 	}
 	get_page(page);
-	spin_unlock(ptl);
+	spin_unlock(fe->ptl);
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow()) {
@@ -1318,13 +1301,12 @@ alloc:
 		prep_transhuge_page(new_page);
 	} else {
 		if (!page) {
-			split_huge_pmd(vma, pmd, address);
+			split_huge_pmd(vma, fe->pmd, fe->address);
 			ret |= VM_FAULT_FALLBACK;
 		} else {
-			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
-					pmd, orig_pmd, page, haddr);
+			ret = do_huge_pmd_wp_page_fallback(fe, orig_pmd, page);
 			if (ret & VM_FAULT_OOM) {
-				split_huge_pmd(vma, pmd, address);
+				split_huge_pmd(vma, fe->pmd, fe->address);
 				ret |= VM_FAULT_FALLBACK;
 			}
 			put_page(page);
@@ -1333,14 +1315,12 @@ alloc:
 		goto out;
 	}
 
-	if (unlikely(mem_cgroup_try_charge(new_page, mm, huge_gfp, &memcg,
-					   true))) {
+	if (unlikely(mem_cgroup_try_charge(new_page, vma->vm_mm,
+					huge_gfp, &memcg, true))) {
 		put_page(new_page);
-		if (page) {
-			split_huge_pmd(vma, pmd, address);
+		split_huge_pmd(vma, fe->pmd, fe->address);
+		if (page)
 			put_page(page);
-		} else
-			split_huge_pmd(vma, pmd, address);
 		ret |= VM_FAULT_FALLBACK;
 		count_vm_event(THP_FAULT_FALLBACK);
 		goto out;
@@ -1356,13 +1336,13 @@ alloc:
 
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
 
-	spin_lock(ptl);
+	spin_lock(fe->ptl);
 	if (page)
 		put_page(page);
-	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
-		spin_unlock(ptl);
+	if (unlikely(!pmd_same(*fe->pmd, orig_pmd))) {
+		spin_unlock(fe->ptl);
 		mem_cgroup_cancel_charge(new_page, memcg, true);
 		put_page(new_page);
 		goto out_mn;
@@ -1370,14 +1350,14 @@ alloc:
 		pmd_t entry;
 		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+		pmdp_huge_clear_flush_notify(vma, haddr, fe->pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(new_page, vma);
-		set_pmd_at(mm, haddr, pmd, entry);
-		update_mmu_cache_pmd(vma, address, pmd);
+		set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+		update_mmu_cache_pmd(vma, fe->address, fe->pmd);
 		if (!page) {
-			add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+			add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 			put_huge_zero_page();
 		} else {
 			VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -1386,13 +1366,13 @@ alloc:
 		}
 		ret |= VM_FAULT_WRITE;
 	}
-	spin_unlock(ptl);
+	spin_unlock(fe->ptl);
 out_mn:
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
 out:
 	return ret;
 out_unlock:
-	spin_unlock(ptl);
+	spin_unlock(fe->ptl);
 	return ret;
 }
 
@@ -1452,13 +1432,12 @@ out:
 }
 
 /* NUMA hinting page fault entry point for trans huge pmds */
-int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-				unsigned long addr, pmd_t pmd, pmd_t *pmdp)
+int do_huge_pmd_numa_page(struct fault_env *fe, pmd_t pmd)
 {
-	spinlock_t *ptl;
+	struct vm_area_struct *vma = fe->vma;
 	struct anon_vma *anon_vma = NULL;
 	struct page *page;
-	unsigned long haddr = addr & HPAGE_PMD_MASK;
+	unsigned long haddr = fe->address & HPAGE_PMD_MASK;
 	int page_nid = -1, this_nid = numa_node_id();
 	int target_nid, last_cpupid = -1;
 	bool page_locked;
@@ -1469,8 +1448,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* A PROT_NONE fault should not end up here */
 	BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
 
-	ptl = pmd_lock(mm, pmdp);
-	if (unlikely(!pmd_same(pmd, *pmdp)))
+	fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+	if (unlikely(!pmd_same(pmd, *fe->pmd)))
 		goto out_unlock;
 
 	/*
@@ -1478,9 +1457,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * without disrupting NUMA hinting information. Do not relock and
 	 * check_same as the page may no longer be mapped.
 	 */
-	if (unlikely(pmd_trans_migrating(*pmdp))) {
-		page = pmd_page(*pmdp);
-		spin_unlock(ptl);
+	if (unlikely(pmd_trans_migrating(*fe->pmd))) {
+		page = pmd_page(*fe->pmd);
+		spin_unlock(fe->ptl);
 		wait_on_page_locked(page);
 		goto out;
 	}
@@ -1513,7 +1492,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/* Migration could have started since the pmd_trans_migrating check */
 	if (!page_locked) {
-		spin_unlock(ptl);
+		spin_unlock(fe->ptl);
 		wait_on_page_locked(page);
 		page_nid = -1;
 		goto out;
@@ -1524,12 +1503,12 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * to serialises splits
 	 */
 	get_page(page);
-	spin_unlock(ptl);
+	spin_unlock(fe->ptl);
 	anon_vma = page_lock_anon_vma_read(page);
 
 	/* Confirm the PMD did not change while page_table_lock was released */
-	spin_lock(ptl);
-	if (unlikely(!pmd_same(pmd, *pmdp))) {
+	spin_lock(fe->ptl);
+	if (unlikely(!pmd_same(pmd, *fe->pmd))) {
 		unlock_page(page);
 		put_page(page);
 		page_nid = -1;
@@ -1547,9 +1526,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * Migrate the THP to the requested node, returns with page unlocked
 	 * and access rights restored.
 	 */
-	spin_unlock(ptl);
-	migrated = migrate_misplaced_transhuge_page(mm, vma,
-				pmdp, pmd, addr, page, target_nid);
+	spin_unlock(fe->ptl);
+	migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma,
+				fe->pmd, pmd, fe->address, page, target_nid);
 	if (migrated) {
 		flags |= TNF_MIGRATED;
 		page_nid = target_nid;
@@ -1564,18 +1543,18 @@ clear_pmdnuma:
 	pmd = pmd_mkyoung(pmd);
 	if (was_writable)
 		pmd = pmd_mkwrite(pmd);
-	set_pmd_at(mm, haddr, pmdp, pmd);
-	update_mmu_cache_pmd(vma, addr, pmdp);
+	set_pmd_at(vma->vm_mm, haddr, fe->pmd, pmd);
+	update_mmu_cache_pmd(vma, fe->address, fe->pmd);
 	unlock_page(page);
 out_unlock:
-	spin_unlock(ptl);
+	spin_unlock(fe->ptl);
 
 out:
 	if (anon_vma)
 		page_unlock_anon_vma_read(anon_vma);
 
 	if (page_nid != -1)
-		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);
+		task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, fe->flags);
 
 	return 0;
 }
@@ -2356,29 +2335,32 @@ static void __collapse_huge_page_swapin(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long address, pmd_t *pmd)
 {
-	unsigned long _address;
-	pte_t *pte, pteval;
+	pte_t pteval;
 	int swapped_in = 0, ret = 0;
-
-	pte = pte_offset_map(pmd, address);
-	for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
-	     pte++, _address += PAGE_SIZE) {
-		pteval = *pte;
+	struct fault_env fe = {
+		.vma = vma,
+		.address = address,
+		.flags = FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
+		.pmd = pmd,
+	};
+
+	fe.pte = pte_offset_map(pmd, address);
+	for (; fe.address < address + HPAGE_PMD_NR*PAGE_SIZE;
+			fe.pte++, fe.address += PAGE_SIZE) {
+		pteval = *fe.pte;
 		if (!is_swap_pte(pteval))
 			continue;
 		swapped_in++;
-		ret = do_swap_page(mm, vma, _address, pte, pmd,
-				   FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT,
-				   pteval);
+		ret = do_swap_page(&fe, pteval);
 		if (ret & VM_FAULT_ERROR) {
 			trace_mm_collapse_huge_page_swapin(mm, swapped_in, 0);
 			return;
 		}
 		/* pte is unmapped now, we need to map it */
-		pte = pte_offset_map(pmd, _address);
+		fe.pte = pte_offset_map(pmd, fe.address);
 	}
-	pte--;
-	pte_unmap(pte);
+	fe.pte--;
+	pte_unmap(fe.pte);
 	trace_mm_collapse_huge_page_swapin(mm, swapped_in, 1);
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 271ad9567e32..4ff5f2588430 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -35,9 +35,7 @@
 /* Do not use these with a slab allocator */
 #define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
 
-extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *page_table, pmd_t *pmd,
-			unsigned int flags, pte_t orig_pte);
+int do_swap_page(struct fault_env *fe, pte_t orig_pte);
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
diff --git a/mm/memory.c b/mm/memory.c
index 2b0c10ec3064..f8f9549fac86 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1994,13 +1994,11 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
  * case, all we need to do here is to mark the page as writable and update
  * any related book-keeping.
  */
-static inline int wp_page_reuse(struct mm_struct *mm,
-			struct vm_area_struct *vma, unsigned long address,
-			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
-			struct page *page, int page_mkwrite,
-			int dirty_shared)
-	__releases(ptl)
+static inline int wp_page_reuse(struct fault_env *fe, pte_t orig_pte,
+			struct page *page, int page_mkwrite, int dirty_shared)
+	__releases(fe->ptl)
 {
+	struct vm_area_struct *vma = fe->vma;
 	pte_t entry;
 	/*
 	 * Clear the pages cpupid information as the existing
@@ -2010,12 +2008,12 @@ static inline int wp_page_reuse(struct mm_struct *mm,
 	if (page)
 		page_cpupid_xchg_last(page, (1 << LAST_CPUPID_SHIFT) - 1);
 
-	flush_cache_page(vma, address, pte_pfn(orig_pte));
+	flush_cache_page(vma, fe->address, pte_pfn(orig_pte));
 	entry = pte_mkyoung(orig_pte);
 	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-	if (ptep_set_access_flags(vma, address, page_table, entry, 1))
-		update_mmu_cache(vma, address, page_table);
-	pte_unmap_unlock(page_table, ptl);
+	if (ptep_set_access_flags(vma, fe->address, fe->pte, entry, 1))
+		update_mmu_cache(vma, fe->address, fe->pte);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 
 	if (dirty_shared) {
 		struct address_space *mapping;
@@ -2061,30 +2059,31 @@ static inline int wp_page_reuse(struct mm_struct *mm,
  *   held to the old page, as well as updating the rmap.
  * - In any case, unlock the PTL and drop the reference we took to the old page.
  */
-static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *page_table, pmd_t *pmd,
-			pte_t orig_pte, struct page *old_page)
+static int wp_page_copy(struct fault_env *fe, pte_t orig_pte,
+		struct page *old_page)
 {
+	struct vm_area_struct *vma = fe->vma;
+	struct mm_struct *mm = vma->vm_mm;
 	struct page *new_page = NULL;
-	spinlock_t *ptl = NULL;
 	pte_t entry;
 	int page_copied = 0;
-	const unsigned long mmun_start = address & PAGE_MASK;	/* For mmu_notifiers */
-	const unsigned long mmun_end = mmun_start + PAGE_SIZE;	/* For mmu_notifiers */
+	const unsigned long mmun_start = fe->address & PAGE_MASK;
+	const unsigned long mmun_end = mmun_start + PAGE_SIZE;
 	struct mem_cgroup *memcg;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
 
 	if (is_zero_pfn(pte_pfn(orig_pte))) {
-		new_page = alloc_zeroed_user_highpage_movable(vma, address);
+		new_page = alloc_zeroed_user_highpage_movable(vma, fe->address);
 		if (!new_page)
 			goto oom;
 	} else {
-		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
+				fe->address);
 		if (!new_page)
 			goto oom;
-		cow_user_page(new_page, old_page, address, vma);
+		cow_user_page(new_page, old_page, fe->address, vma);
 	}
 
 	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false))
@@ -2097,8 +2096,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (likely(pte_same(*page_table, orig_pte))) {
+	fe->pte = pte_offset_map_lock(mm, fe->pmd, fe->address, &fe->ptl);
+	if (likely(pte_same(*fe->pte, orig_pte))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
 				dec_mm_counter_fast(mm,
@@ -2108,7 +2107,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		} else {
 			inc_mm_counter_fast(mm, MM_ANONPAGES);
 		}
-		flush_cache_page(vma, address, pte_pfn(orig_pte));
+		flush_cache_page(vma, fe->address, pte_pfn(orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		/*
@@ -2117,8 +2116,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * seen in the presence of one thread doing SMC and another
 		 * thread doing COW.
 		 */
-		ptep_clear_flush_notify(vma, address, page_table);
-		page_add_new_anon_rmap(new_page, vma, address, false);
+		ptep_clear_flush_notify(vma, fe->address, fe->pte);
+		page_add_new_anon_rmap(new_page, vma, fe->address, false);
 		mem_cgroup_commit_charge(new_page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
@@ -2126,8 +2125,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
-		set_pte_at_notify(mm, address, page_table, entry);
-		update_mmu_cache(vma, address, page_table);
+		set_pte_at_notify(mm, fe->address, fe->pte, entry);
+		update_mmu_cache(vma, fe->address, fe->pte);
 		if (old_page) {
 			/*
 			 * Only after switching the pte to the new page may
@@ -2164,7 +2163,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (new_page)
 		page_cache_release(new_page);
 
-	pte_unmap_unlock(page_table, ptl);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	if (old_page) {
 		/*
@@ -2192,44 +2191,43 @@ oom:
  * Handle write page faults for VM_MIXEDMAP or VM_PFNMAP for a VM_SHARED
  * mapping
  */
-static int wp_pfn_shared(struct mm_struct *mm,
-			struct vm_area_struct *vma, unsigned long address,
-			pte_t *page_table, spinlock_t *ptl, pte_t orig_pte,
-			pmd_t *pmd)
+static int wp_pfn_shared(struct fault_env *fe,  pte_t orig_pte)
 {
+	struct vm_area_struct *vma = fe->vma;
+
 	if (vma->vm_ops && vma->vm_ops->pfn_mkwrite) {
 		struct vm_fault vmf = {
 			.page = NULL,
-			.pgoff = linear_page_index(vma, address),
-			.virtual_address = (void __user *)(address & PAGE_MASK),
+			.pgoff = linear_page_index(vma, fe->address),
+			.virtual_address =
+				(void __user *)(fe->address & PAGE_MASK),
 			.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE,
 		};
 		int ret;
 
-		pte_unmap_unlock(page_table, ptl);
+		pte_unmap_unlock(fe->pte, fe->ptl);
 		ret = vma->vm_ops->pfn_mkwrite(vma, &vmf);
 		if (ret & VM_FAULT_ERROR)
 			return ret;
-		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+		fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+				&fe->ptl);
 		/*
 		 * We might have raced with another page fault while we
 		 * released the pte_offset_map_lock.
 		 */
-		if (!pte_same(*page_table, orig_pte)) {
-			pte_unmap_unlock(page_table, ptl);
+		if (!pte_same(*fe->pte, orig_pte)) {
+			pte_unmap_unlock(fe->pte, fe->ptl);
 			return 0;
 		}
 	}
-	return wp_page_reuse(mm, vma, address, page_table, ptl, orig_pte,
-			     NULL, 0, 0);
+	return wp_page_reuse(fe, orig_pte, NULL, 0, 0);
 }
 
-static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
-			  unsigned long address, pte_t *page_table,
-			  pmd_t *pmd, spinlock_t *ptl, pte_t orig_pte,
-			  struct page *old_page)
-	__releases(ptl)
+static int wp_page_shared(struct fault_env *fe, pte_t orig_pte,
+		struct page *old_page)
+	__releases(fe->ptl)
 {
+	struct vm_area_struct *vma = fe->vma;
 	int page_mkwrite = 0;
 
 	page_cache_get(old_page);
@@ -2237,8 +2235,8 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
 		int tmp;
 
-		pte_unmap_unlock(page_table, ptl);
-		tmp = do_page_mkwrite(vma, old_page, address);
+		pte_unmap_unlock(fe->pte, fe->ptl);
+		tmp = do_page_mkwrite(vma, old_page, fe->address);
 		if (unlikely(!tmp || (tmp &
 				      (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
 			page_cache_release(old_page);
@@ -2250,19 +2248,18 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * they did, we just return, as we can count on the
 		 * MMU to tell us if they didn't also make it writable.
 		 */
-		page_table = pte_offset_map_lock(mm, pmd, address,
-						 &ptl);
-		if (!pte_same(*page_table, orig_pte)) {
+		fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+						 &fe->ptl);
+		if (!pte_same(*fe->pte, orig_pte)) {
 			unlock_page(old_page);
-			pte_unmap_unlock(page_table, ptl);
+			pte_unmap_unlock(fe->pte, fe->ptl);
 			page_cache_release(old_page);
 			return 0;
 		}
 		page_mkwrite = 1;
 	}
 
-	return wp_page_reuse(mm, vma, address, page_table, ptl,
-			     orig_pte, old_page, page_mkwrite, 1);
+	return wp_page_reuse(fe, orig_pte, old_page, page_mkwrite, 1);
 }
 
 /*
@@ -2283,14 +2280,13 @@ static int wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
  * but allow concurrent faults), with pte both mapped and locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		spinlock_t *ptl, pte_t orig_pte)
-	__releases(ptl)
+static int do_wp_page(struct fault_env *fe, pte_t orig_pte)
+	__releases(fe->ptl)
 {
+	struct vm_area_struct *vma = fe->vma;
 	struct page *old_page;
 
-	old_page = vm_normal_page(vma, address, orig_pte);
+	old_page = vm_normal_page(vma, fe->address, orig_pte);
 	if (!old_page) {
 		/*
 		 * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
@@ -2301,12 +2297,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 				     (VM_WRITE|VM_SHARED))
-			return wp_pfn_shared(mm, vma, address, page_table, ptl,
-					     orig_pte, pmd);
+			return wp_pfn_shared(fe, orig_pte);
 
-		pte_unmap_unlock(page_table, ptl);
-		return wp_page_copy(mm, vma, address, page_table, pmd,
-				    orig_pte, old_page);
+		pte_unmap_unlock(fe->pte, fe->ptl);
+		return wp_page_copy(fe, orig_pte, old_page);
 	}
 
 	/*
@@ -2316,13 +2310,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (PageAnon(old_page) && !PageKsm(old_page)) {
 		if (!trylock_page(old_page)) {
 			page_cache_get(old_page);
-			pte_unmap_unlock(page_table, ptl);
+			pte_unmap_unlock(fe->pte, fe->ptl);
 			lock_page(old_page);
-			page_table = pte_offset_map_lock(mm, pmd, address,
-							 &ptl);
-			if (!pte_same(*page_table, orig_pte)) {
+			fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd,
+					fe->address, &fe->ptl);
+			if (!pte_same(*fe->pte, orig_pte)) {
 				unlock_page(old_page);
-				pte_unmap_unlock(page_table, ptl);
+				pte_unmap_unlock(fe->pte, fe->ptl);
 				page_cache_release(old_page);
 				return 0;
 			}
@@ -2334,16 +2328,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * the rmap code will not search our parent or siblings.
 			 * Protected against the rmap code by the page lock.
 			 */
-			page_move_anon_rmap(old_page, vma, address);
+			page_move_anon_rmap(old_page, vma, fe->address);
 			unlock_page(old_page);
-			return wp_page_reuse(mm, vma, address, page_table, ptl,
-					     orig_pte, old_page, 0, 0);
+			return wp_page_reuse(fe, orig_pte, old_page, 0, 0);
 		}
 		unlock_page(old_page);
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {
-		return wp_page_shared(mm, vma, address, page_table, pmd,
-				      ptl, orig_pte, old_page);
+		return wp_page_shared(fe, orig_pte, old_page);
 	}
 
 	/*
@@ -2351,9 +2343,8 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	page_cache_get(old_page);
 
-	pte_unmap_unlock(page_table, ptl);
-	return wp_page_copy(mm, vma, address, page_table, pmd,
-			    orig_pte, old_page);
+	pte_unmap_unlock(fe->pte, fe->ptl);
+	return wp_page_copy(fe, orig_pte, old_page);
 }
 
 static void unmap_mapping_range_vma(struct vm_area_struct *vma,
@@ -2444,11 +2435,9 @@ EXPORT_SYMBOL(unmap_mapping_range);
  * We return with the mmap_sem locked or unlocked in the same cases
  * as does filemap_fault().
  */
-int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
+int do_swap_page(struct fault_env *fe, pte_t orig_pte)
 {
-	spinlock_t *ptl;
+	struct vm_area_struct *vma = fe->vma;
 	struct page *page, *swapcache;
 	struct mem_cgroup *memcg;
 	swp_entry_t entry;
@@ -2457,17 +2446,17 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int exclusive = 0;
 	int ret = 0;
 
-	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
+	if (!pte_unmap_same(vma->vm_mm, fe->pmd, fe->pte, orig_pte))
 		goto out;
 
 	entry = pte_to_swp_entry(orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
-			migration_entry_wait(mm, pmd, address);
+			migration_entry_wait(vma->vm_mm, fe->pmd, fe->address);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
 		} else {
-			print_bad_pte(vma, address, orig_pte, NULL);
+			print_bad_pte(vma, fe->address, orig_pte, NULL);
 			ret = VM_FAULT_SIGBUS;
 		}
 		goto out;
@@ -2476,14 +2465,15 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	page = lookup_swap_cache(entry);
 	if (!page) {
 		page = swapin_readahead(entry,
-					GFP_HIGHUSER_MOVABLE, vma, address);
+					GFP_HIGHUSER_MOVABLE, vma, fe->address);
 		if (!page) {
 			/*
 			 * Back out if somebody else faulted in this pte
 			 * while we released the pte lock.
 			 */
-			page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-			if (likely(pte_same(*page_table, orig_pte)))
+			fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd,
+					fe->address, &fe->ptl);
+			if (likely(pte_same(*fe->pte, orig_pte)))
 				ret = VM_FAULT_OOM;
 			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 			goto unlock;
@@ -2492,7 +2482,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
-		mem_cgroup_count_vm_event(mm, PGMAJFAULT);
+		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 	} else if (PageHWPoison(page)) {
 		/*
 		 * hwpoisoned dirty swapcache pages are kept for killing
@@ -2505,7 +2495,7 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	swapcache = page;
-	locked = lock_page_or_retry(page, mm, flags);
+	locked = lock_page_or_retry(page, vma->vm_mm, fe->flags);
 
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 	if (!locked) {
@@ -2522,14 +2512,15 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
 		goto out_page;
 
-	page = ksm_might_need_to_copy(page, vma, address);
+	page = ksm_might_need_to_copy(page, vma, fe->address);
 	if (unlikely(!page)) {
 		ret = VM_FAULT_OOM;
 		page = swapcache;
 		goto out_page;
 	}
 
-	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
+				&memcg, false)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
@@ -2537,8 +2528,9 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*page_table, orig_pte)))
+	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+			&fe->ptl);
+	if (unlikely(!pte_same(*fe->pte, orig_pte)))
 		goto out_nomap;
 
 	if (unlikely(!PageUptodate(page))) {
@@ -2556,24 +2548,24 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * must be called after the swap_free(), or it will never succeed.
 	 */
 
-	inc_mm_counter_fast(mm, MM_ANONPAGES);
-	dec_mm_counter_fast(mm, MM_SWAPENTS);
+	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
-	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
+	if ((fe->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
-		flags &= ~FAULT_FLAG_WRITE;
+		fe->flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
 		exclusive = RMAP_EXCLUSIVE;
 	}
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(orig_pte))
 		pte = pte_mksoft_dirty(pte);
-	set_pte_at(mm, address, page_table, pte);
+	set_pte_at(vma->vm_mm, fe->address, fe->pte, pte);
 	if (page == swapcache) {
-		do_page_add_anon_rmap(page, vma, address, exclusive);
+		do_page_add_anon_rmap(page, vma, fe->address, exclusive);
 		mem_cgroup_commit_charge(page, memcg, true, false);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, address, false);
+		page_add_new_anon_rmap(page, vma, fe->address, false);
 		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
@@ -2596,22 +2588,22 @@ int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_cache_release(swapcache);
 	}
 
-	if (flags & FAULT_FLAG_WRITE) {
-		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
+	if (fe->flags & FAULT_FLAG_WRITE) {
+		ret |= do_wp_page(fe, pte);
 		if (ret & VM_FAULT_ERROR)
 			ret &= VM_FAULT_ERROR;
 		goto out;
 	}
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, address, page_table);
+	update_mmu_cache(vma, fe->address, fe->pte);
 unlock:
-	pte_unmap_unlock(page_table, ptl);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 out:
 	return ret;
 out_nomap:
 	mem_cgroup_cancel_charge(page, memcg, false);
-	pte_unmap_unlock(page_table, ptl);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 out_page:
 	unlock_page(page);
 out_release:
@@ -2662,37 +2654,36 @@ static inline int check_stack_guard_page(struct vm_area_struct *vma, unsigned lo
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags)
+static int do_anonymous_page(struct fault_env *fe)
 {
+	struct vm_area_struct *vma = fe->vma;
 	struct mem_cgroup *memcg;
 	struct page *page;
-	spinlock_t *ptl;
 	pte_t entry;
 
-	pte_unmap(page_table);
+	pte_unmap(fe->pte);
 
 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
 		return VM_FAULT_SIGBUS;
 
 	/* Check if we need to add a guard page to the stack */
-	if (check_stack_guard_page(vma, address) < 0)
+	if (check_stack_guard_page(vma, fe->address) < 0)
 		return VM_FAULT_SIGSEGV;
 
 	/* Use the zero-page for reads */
-	if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm)) {
-		entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
+	if (!(fe->flags & FAULT_FLAG_WRITE) &&
+			!mm_forbids_zeropage(vma->vm_mm)) {
+		entry = pte_mkspecial(pfn_pte(my_zero_pfn(fe->address),
 						vma->vm_page_prot));
-		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-		if (!pte_none(*page_table))
+		fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+				&fe->ptl);
+		if (!pte_none(*fe->pte))
 			goto unlock;
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (userfaultfd_missing(vma)) {
-			pte_unmap_unlock(page_table, ptl);
-			return handle_userfault(vma, address, flags,
-						VM_UFFD_MISSING);
+			pte_unmap_unlock(fe->pte, fe->ptl);
+			return handle_userfault(fe, VM_UFFD_MISSING);
 		}
 		goto setpte;
 	}
@@ -2700,11 +2691,11 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Allocate our own private page. */
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
-	page = alloc_zeroed_user_highpage_movable(vma, address);
+	page = alloc_zeroed_user_highpage_movable(vma, fe->address);
 	if (!page)
 		goto oom;
 
-	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false))
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
 		goto oom_free_page;
 
 	/*
@@ -2718,30 +2709,30 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (!pte_none(*page_table))
+	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+			&fe->ptl);
+	if (!pte_none(*fe->pte))
 		goto release;
 
 	/* Deliver the page fault to userland, check inside PT lock */
 	if (userfaultfd_missing(vma)) {
-		pte_unmap_unlock(page_table, ptl);
+		pte_unmap_unlock(fe->pte, fe->ptl);
 		mem_cgroup_cancel_charge(page, memcg, false);
 		page_cache_release(page);
-		return handle_userfault(vma, address, flags,
-					VM_UFFD_MISSING);
+		return handle_userfault(fe, VM_UFFD_MISSING);
 	}
 
-	inc_mm_counter_fast(mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, address, false);
+	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
+	page_add_new_anon_rmap(page, vma, fe->address, false);
 	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
-	set_pte_at(mm, address, page_table, entry);
+	set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);
 
 	/* No need to invalidate - it was non-present before */
-	update_mmu_cache(vma, address, page_table);
+	update_mmu_cache(vma, fe->address, fe->pte);
 unlock:
-	pte_unmap_unlock(page_table, ptl);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 	return 0;
 release:
 	mem_cgroup_cancel_charge(page, memcg, false);
@@ -2758,16 +2749,16 @@ oom:
  * released depending on flags and vma->vm_ops->fault() return value.
  * See filemap_fault() and __lock_page_retry().
  */
-static int __do_fault(struct vm_area_struct *vma, unsigned long address,
-			pgoff_t pgoff, unsigned int flags,
-			struct page *cow_page, struct page **page)
+static int __do_fault(struct fault_env *fe, pgoff_t pgoff,
+		struct page *cow_page, struct page **page)
 {
+	struct vm_area_struct *vma = fe->vma;
 	struct vm_fault vmf;
 	int ret;
 
-	vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+	vmf.virtual_address = (void __user *)(fe->address & PAGE_MASK);
 	vmf.pgoff = pgoff;
-	vmf.flags = flags;
+	vmf.flags = fe->flags;
 	vmf.page = NULL;
 	vmf.gfp_mask = __get_fault_gfp_mask(vma);
 	vmf.cow_page = cow_page;
@@ -2798,38 +2789,36 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
 /**
  * do_set_pte - setup new PTE entry for given page and add reverse page mapping.
  *
- * @vma: virtual memory area
- * @address: user virtual address
+ * @fe: fault environment
  * @page: page to map
- * @pte: pointer to target page table entry
- * @write: true, if new entry is writable
- * @anon: true, if it's anonymous page
  *
- * Caller must hold page table lock relevant for @pte.
+ * Caller must hold page table lock relevant for @fe->pte.
  *
  * Target users are page handler itself and implementations of
  * vm_ops->map_pages.
  */
-void do_set_pte(struct vm_area_struct *vma, unsigned long address,
-		struct page *page, pte_t *pte, bool write, bool anon)
+void do_set_pte(struct fault_env *fe, struct page *page)
 {
+	struct vm_area_struct *vma = fe->vma;
+	bool write = fe->flags & FAULT_FLAG_WRITE;
 	pte_t entry;
 
 	flush_icache_page(vma, page);
 	entry = mk_pte(page, vma->vm_page_prot);
 	if (write)
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-	if (anon) {
+	/* copy-on-write page */
+	if (write && !(vma->vm_flags & VM_SHARED)) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, address, false);
+		page_add_new_anon_rmap(page, vma, fe->address, false);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
 		page_add_file_rmap(page);
 	}
-	set_pte_at(vma->vm_mm, address, pte, entry);
+	set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);
 
 	/* no need to invalidate: a not-present page won't be cached */
-	update_mmu_cache(vma, address, pte);
+	update_mmu_cache(vma, fe->address, fe->pte);
 }
 
 static unsigned long fault_around_bytes __read_mostly =
@@ -2896,57 +2885,53 @@ late_initcall(fault_around_debugfs);
  * fault_around_pages() value (and therefore to page order).  This way it's
  * easier to guarantee that we don't cross page table boundaries.
  */
-static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
-		pte_t *pte, pgoff_t pgoff, unsigned int flags)
+static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
 {
-	unsigned long start_addr, nr_pages, mask;
-	pgoff_t max_pgoff;
-	struct vm_fault vmf;
+	unsigned long address = fe->address, start_addr, nr_pages, mask;
+	pte_t *pte = fe->pte;
+	pgoff_t end_pgoff;
 	int off;
 
 	nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
 	mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
 
-	start_addr = max(address & mask, vma->vm_start);
-	off = ((address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
-	pte -= off;
-	pgoff -= off;
+	start_addr = max(fe->address & mask, fe->vma->vm_start);
+	off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+	fe->pte -= off;
+	start_pgoff -= off;
 
 	/*
-	 *  max_pgoff is either end of page table or end of vma
-	 *  or fault_around_pages() from pgoff, depending what is nearest.
+	 *  end_pgoff is either end of page table or end of vma
+	 *  or fault_around_pages() from start_pgoff, depending what is nearest.
 	 */
-	max_pgoff = pgoff - ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+	end_pgoff = start_pgoff -
+		((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
 		PTRS_PER_PTE - 1;
-	max_pgoff = min3(max_pgoff, vma_pages(vma) + vma->vm_pgoff - 1,
-			pgoff + nr_pages - 1);
+	end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
+			start_pgoff + nr_pages - 1);
 
 	/* Check if it makes any sense to call ->map_pages */
-	while (!pte_none(*pte)) {
-		if (++pgoff > max_pgoff)
-			return;
-		start_addr += PAGE_SIZE;
-		if (start_addr >= vma->vm_end)
-			return;
-		pte++;
+	fe->address = start_addr;
+	while (!pte_none(*fe->pte)) {
+		if (++start_pgoff > end_pgoff)
+			goto out;
+		fe->address += PAGE_SIZE;
+		if (fe->address >= fe->vma->vm_end)
+			goto out;
+		fe->pte++;
 	}
 
-	vmf.virtual_address = (void __user *) start_addr;
-	vmf.pte = pte;
-	vmf.pgoff = pgoff;
-	vmf.max_pgoff = max_pgoff;
-	vmf.flags = flags;
-	vmf.gfp_mask = __get_fault_gfp_mask(vma);
-	vma->vm_ops->map_pages(vma, &vmf);
+	fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
+out:
+	/* restore fault_env */
+	fe->pte = pte;
+	fe->address = address;
 }
 
-static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd,
-		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
 {
+	struct vm_area_struct *vma = fe->vma;
 	struct page *fault_page;
-	spinlock_t *ptl;
-	pte_t *pte;
 	int ret = 0;
 
 	/*
@@ -2955,64 +2940,64 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * something).
 	 */
 	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
-		pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-		do_fault_around(vma, address, pte, pgoff, flags);
-		if (!pte_same(*pte, orig_pte))
+		fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+				&fe->ptl);
+		do_fault_around(fe, pgoff);
+		if (!pte_same(*fe->pte, orig_pte))
 			goto unlock_out;
-		pte_unmap_unlock(pte, ptl);
+		pte_unmap_unlock(fe->pte, fe->ptl);
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+	ret = __do_fault(fe, pgoff, NULL, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, orig_pte))) {
-		pte_unmap_unlock(pte, ptl);
+	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl);
+	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+		pte_unmap_unlock(fe->pte, fe->ptl);
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
 		return ret;
 	}
-	do_set_pte(vma, address, fault_page, pte, false, false);
+	do_set_pte(fe, fault_page);
 	unlock_page(fault_page);
 unlock_out:
-	pte_unmap_unlock(pte, ptl);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 	return ret;
 }
 
-static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd,
-		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
 {
+	struct vm_area_struct *vma = fe->vma;
 	struct page *fault_page, *new_page;
 	struct mem_cgroup *memcg;
-	spinlock_t *ptl;
-	pte_t *pte;
 	int ret;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
 
-	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, fe->address);
 	if (!new_page)
 		return VM_FAULT_OOM;
 
-	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false)) {
+	if (mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL,
+				&memcg, false)) {
 		page_cache_release(new_page);
 		return VM_FAULT_OOM;
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
+	ret = __do_fault(fe, pgoff, new_page, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
 
 	if (fault_page)
-		copy_user_highpage(new_page, fault_page, address, vma);
+		copy_user_highpage(new_page, fault_page, fe->address, vma);
 	__SetPageUptodate(new_page);
 
-	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, orig_pte))) {
-		pte_unmap_unlock(pte, ptl);
+	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+			&fe->ptl);
+	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+		pte_unmap_unlock(fe->pte, fe->ptl);
 		if (fault_page) {
 			unlock_page(fault_page);
 			page_cache_release(fault_page);
@@ -3025,10 +3010,10 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		goto uncharge_out;
 	}
-	do_set_pte(vma, address, new_page, pte, true, true);
+	do_set_pte(fe, new_page);
 	mem_cgroup_commit_charge(new_page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
-	pte_unmap_unlock(pte, ptl);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 	if (fault_page) {
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
@@ -3046,18 +3031,15 @@ uncharge_out:
 	return ret;
 }
 
-static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd,
-		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
 {
+	struct vm_area_struct *vma = fe->vma;
 	struct page *fault_page;
 	struct address_space *mapping;
-	spinlock_t *ptl;
-	pte_t *pte;
 	int dirtied = 0;
 	int ret, tmp;
 
-	ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+	ret = __do_fault(fe, pgoff, NULL, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
@@ -3067,7 +3049,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	if (vma->vm_ops->page_mkwrite) {
 		unlock_page(fault_page);
-		tmp = do_page_mkwrite(vma, fault_page, address);
+		tmp = do_page_mkwrite(vma, fault_page, fe->address);
 		if (unlikely(!tmp ||
 				(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
 			page_cache_release(fault_page);
@@ -3075,15 +3057,16 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 	}
 
-	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, orig_pte))) {
-		pte_unmap_unlock(pte, ptl);
+	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+			&fe->ptl);
+	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+		pte_unmap_unlock(fe->pte, fe->ptl);
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
 		return ret;
 	}
-	do_set_pte(vma, address, fault_page, pte, true, false);
-	pte_unmap_unlock(pte, ptl);
+	do_set_pte(fe, fault_page);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 
 	if (set_page_dirty(fault_page))
 		dirtied = 1;
@@ -3115,24 +3098,21 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
-static int do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
+static int do_fault(struct fault_env *fe, pte_t orig_pte)
 {
-	pgoff_t pgoff = (((address & PAGE_MASK)
+	struct vm_area_struct *vma = fe->vma;
+	pgoff_t pgoff = (((fe->address & PAGE_MASK)
 			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
-	pte_unmap(page_table);
+	pte_unmap(fe->pte);
 	/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
 	if (!vma->vm_ops->fault)
 		return VM_FAULT_SIGBUS;
-	if (!(flags & FAULT_FLAG_WRITE))
-		return do_read_fault(mm, vma, address, pmd, pgoff, flags,
-				orig_pte);
+	if (!(fe->flags & FAULT_FLAG_WRITE))
+		return do_read_fault(fe, pgoff,	orig_pte);
 	if (!(vma->vm_flags & VM_SHARED))
-		return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
-				orig_pte);
-	return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+		return do_cow_fault(fe, pgoff, orig_pte);
+	return do_shared_fault(fe, pgoff, orig_pte);
 }
 
 static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3150,11 +3130,10 @@ static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 	return mpol_misplaced(page, vma, addr);
 }
 
-static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+static int do_numa_page(struct fault_env *fe, pte_t pte)
 {
+	struct vm_area_struct *vma = fe->vma;
 	struct page *page = NULL;
-	spinlock_t *ptl;
 	int page_nid = -1;
 	int last_cpupid;
 	int target_nid;
@@ -3174,10 +3153,10 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	* page table entry is not accessible, so there would be no
 	* concurrent hardware modifications to the PTE.
 	*/
-	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
-	if (unlikely(!pte_same(*ptep, pte))) {
-		pte_unmap_unlock(ptep, ptl);
+	fe->ptl = pte_lockptr(vma->vm_mm, fe->pmd);
+	spin_lock(fe->ptl);
+	if (unlikely(!pte_same(*fe->pte, pte))) {
+		pte_unmap_unlock(fe->pte, fe->ptl);
 		goto out;
 	}
 
@@ -3186,18 +3165,18 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte = pte_mkyoung(pte);
 	if (was_writable)
 		pte = pte_mkwrite(pte);
-	set_pte_at(mm, addr, ptep, pte);
-	update_mmu_cache(vma, addr, ptep);
+	set_pte_at(vma->vm_mm, fe->address, fe->pte, pte);
+	update_mmu_cache(vma, fe->address, fe->pte);
 
-	page = vm_normal_page(vma, addr, pte);
+	page = vm_normal_page(vma, fe->address, pte);
 	if (!page) {
-		pte_unmap_unlock(ptep, ptl);
+		pte_unmap_unlock(fe->pte, fe->ptl);
 		return 0;
 	}
 
 	/* TODO: handle PTE-mapped THP */
 	if (PageCompound(page)) {
-		pte_unmap_unlock(ptep, ptl);
+		pte_unmap_unlock(fe->pte, fe->ptl);
 		return 0;
 	}
 
@@ -3221,8 +3200,9 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	last_cpupid = page_cpupid_last(page);
 	page_nid = page_to_nid(page);
-	target_nid = numa_migrate_prep(page, vma, addr, page_nid, &flags);
-	pte_unmap_unlock(ptep, ptl);
+	target_nid = numa_migrate_prep(page, vma, fe->address, page_nid,
+			&flags);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 	if (target_nid == -1) {
 		put_page(page);
 		goto out;
@@ -3242,24 +3222,24 @@ out:
 	return 0;
 }
 
-static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pmd_t *pmd, unsigned int flags)
+static int create_huge_pmd(struct fault_env *fe)
 {
+	struct vm_area_struct *vma = fe->vma;
 	if (vma_is_anonymous(vma))
-		return do_huge_pmd_anonymous_page(mm, vma, address, pmd, flags);
+		return do_huge_pmd_anonymous_page(fe);
 	if (vma->vm_ops->pmd_fault)
-		return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+		return vma->vm_ops->pmd_fault(vma, fe->address, fe->pmd,
+				fe->flags);
 	return VM_FAULT_FALLBACK;
 }
 
-static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pmd_t *pmd, pmd_t orig_pmd,
-			unsigned int flags)
+static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
 {
-	if (vma_is_anonymous(vma))
-		return do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd);
-	if (vma->vm_ops->pmd_fault)
-		return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+	if (vma_is_anonymous(fe->vma))
+		return do_huge_pmd_wp_page(fe, orig_pmd);
+	if (fe->vma->vm_ops->pmd_fault)
+		return fe->vma->vm_ops->pmd_fault(fe->vma, fe->address, fe->pmd,
+				fe->flags);
 	return VM_FAULT_FALLBACK;
 }
 
@@ -3279,12 +3259,9 @@ static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
-static int handle_pte_fault(struct mm_struct *mm,
-		     struct vm_area_struct *vma, unsigned long address,
-		     pte_t *pte, pmd_t *pmd, unsigned int flags)
+static int handle_pte_fault(struct fault_env *fe)
 {
 	pte_t entry;
-	spinlock_t *ptl;
 
 	/*
 	 * some architectures can have larger ptes than wordsize,
@@ -3294,37 +3271,34 @@ static int handle_pte_fault(struct mm_struct *mm,
 	 * we later double check anyway with the ptl lock held. So here
 	 * a barrier will do.
 	 */
-	entry = *pte;
+	entry = *fe->pte;
 	barrier();
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
-			if (vma_is_anonymous(vma))
-				return do_anonymous_page(mm, vma, address,
-							 pte, pmd, flags);
+			if (vma_is_anonymous(fe->vma))
+				return do_anonymous_page(fe);
 			else
-				return do_fault(mm, vma, address, pte, pmd,
-						flags, entry);
+				return do_fault(fe, entry);
 		}
-		return do_swap_page(mm, vma, address,
-					pte, pmd, flags, entry);
+		return do_swap_page(fe, entry);
 	}
 
 	if (pte_protnone(entry))
-		return do_numa_page(mm, vma, address, entry, pte, pmd);
+		return do_numa_page(fe, entry);
 
-	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
-	if (unlikely(!pte_same(*pte, entry)))
+	fe->ptl = pte_lockptr(fe->vma->vm_mm, fe->pmd);
+	spin_lock(fe->ptl);
+	if (unlikely(!pte_same(*fe->pte, entry)))
 		goto unlock;
-	if (flags & FAULT_FLAG_WRITE) {
+	if (fe->flags & FAULT_FLAG_WRITE) {
 		if (!pte_write(entry))
-			return do_wp_page(mm, vma, address,
-					pte, pmd, ptl, entry);
+			return do_wp_page(fe, entry);
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
-	if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
-		update_mmu_cache(vma, address, pte);
+	if (ptep_set_access_flags(fe->vma, fe->address, fe->pte, entry,
+				fe->flags & FAULT_FLAG_WRITE)) {
+		update_mmu_cache(fe->vma, fe->address, fe->pte);
 	} else {
 		/*
 		 * This is needed only for protection faults but the arch code
@@ -3332,11 +3306,11 @@ static int handle_pte_fault(struct mm_struct *mm,
 		 * This still avoids useless tlb flushes for .text page faults
 		 * with threads.
 		 */
-		if (flags & FAULT_FLAG_WRITE)
-			flush_tlb_fix_spurious_fault(vma, address);
+		if (fe->flags & FAULT_FLAG_WRITE)
+			flush_tlb_fix_spurious_fault(fe->vma, fe->address);
 	}
 unlock:
-	pte_unmap_unlock(pte, ptl);
+	pte_unmap_unlock(fe->pte, fe->ptl);
 	return 0;
 }
 
@@ -3349,46 +3323,42 @@ unlock:
 static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		unsigned int flags)
 {
+	struct fault_env fe = {
+		.vma = vma,
+		.address = address,
+		.flags = flags,
+	};
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
-
-	if (unlikely(is_vm_hugetlb_page(vma)))
-		return hugetlb_fault(mm, vma, address, flags);
 
 	pgd = pgd_offset(mm, address);
 	pud = pud_alloc(mm, pgd, address);
 	if (!pud)
 		return VM_FAULT_OOM;
-	pmd = pmd_alloc(mm, pud, address);
-	if (!pmd)
+	fe.pmd = pmd_alloc(mm, pud, address);
+	if (!fe.pmd)
 		return VM_FAULT_OOM;
-	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
-		int ret = create_huge_pmd(mm, vma, address, pmd, flags);
+	if (pmd_none(*fe.pmd) && transparent_hugepage_enabled(vma)) {
+		int ret = create_huge_pmd(&fe);
 		if (!(ret & VM_FAULT_FALLBACK))
 			return ret;
 	} else {
-		pmd_t orig_pmd = *pmd;
+		pmd_t orig_pmd = *fe.pmd;
 		int ret;
 
 		barrier();
 		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
-			unsigned int dirty = flags & FAULT_FLAG_WRITE;
-
 			if (pmd_protnone(orig_pmd))
-				return do_huge_pmd_numa_page(mm, vma, address,
-							     orig_pmd, pmd);
+				return do_huge_pmd_numa_page(&fe, orig_pmd);
 
-			if (dirty && !pmd_write(orig_pmd)) {
-				ret = wp_huge_pmd(mm, vma, address, pmd,
-							orig_pmd, flags);
+			if ((fe.flags & FAULT_FLAG_WRITE) &&
+					!pmd_write(orig_pmd)) {
+				ret = wp_huge_pmd(&fe, orig_pmd);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
 			} else {
-				huge_pmd_set_accessed(mm, vma, address, pmd,
-						      orig_pmd, dirty);
+				huge_pmd_set_accessed(&fe, orig_pmd);
 				return 0;
 			}
 		}
@@ -3399,11 +3369,11 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	 * run pte_offset_map on the pmd, if an huge pmd could
 	 * materialize from under us from a different thread.
 	 */
-	if (unlikely(pmd_none(*pmd)) &&
-	    unlikely(__pte_alloc(mm, vma, pmd, address)))
+	if (unlikely(pmd_none(*fe.pmd)) &&
+	    unlikely(__pte_alloc(fe.vma->vm_mm, fe.vma, fe.pmd, fe.address)))
 		return VM_FAULT_OOM;
 	/* if an huge pmd materialized from under us just retry later */
-	if (unlikely(pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
+	if (unlikely(pmd_trans_huge(*fe.pmd) || pmd_devmap(*fe.pmd)))
 		return 0;
 	/*
 	 * A regular pmd is established and it can't morph into a huge pmd
@@ -3411,9 +3381,9 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	 * read mode and khugepaged takes it in write mode. So now it's
 	 * safe to run pte_offset_map().
 	 */
-	pte = pte_offset_map(pmd, address);
+	fe.pte = pte_offset_map(fe.pmd, fe.address);
 
-	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+	return handle_pte_fault(&fe);
 }
 
 /*
@@ -3442,7 +3412,10 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	if (flags & FAULT_FLAG_USER)
 		mem_cgroup_oom_enable();
 
-	ret = __handle_mm_fault(vma, address, flags);
+	if (unlikely(is_vm_hugetlb_page(vma)))
+		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
+	else
+		ret = __handle_mm_fault(vma, address, flags);
 
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_oom_disable();
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte()
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 07/28] mm: introduce fault_env Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-12 17:44   ` Dave Hansen
  2016-02-11 14:21 ` [PATCHv2 09/28] rmap: support file thp Kirill A. Shutemov
                   ` (19 subsequent siblings)
  27 siblings, 1 reply; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

The idea (and most of code) is borrowed again: from Hugh's patchset on
huge tmpfs[1].

Instead of allocation pte page table upfront, we postpone this until we
have page to map in hands. This approach opens possibility to map the
page as huge if filesystem supports this.

Comparing to Hugh's patch I've pushed page table allocation a bit
further: into do_set_pte(). This way we can postpone allocation even in
faultaround case without moving do_fault_around() after __do_fault().

[1] http://lkml.kernel.org/r/alpine.LSU.2.11.1502202015090.14414@eggly.anvils

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h |   4 +-
 mm/filemap.c       |  17 ++--
 mm/memory.c        | 254 ++++++++++++++++++++++++++++++-----------------------
 mm/nommu.c         |   3 +-
 4 files changed, 162 insertions(+), 116 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ca99c0ecf52e..172f4d8e798d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -265,6 +265,7 @@ struct fault_env {
 	pmd_t *pmd;
 	pte_t *pte;
 	spinlock_t *ptl;
+	pgtable_t prealloc_pte;
 };
 
 /*
@@ -559,7 +560,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 	return pte;
 }
 
-void do_set_pte(struct fault_env *fe, struct page *page);
+int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
+		struct page *page);
 #endif
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 28b3875969a8..ba8150d6dc33 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2146,11 +2146,6 @@ void filemap_map_pages(struct fault_env *fe,
 			start_pgoff) {
 		if (iter.index > end_pgoff)
 			break;
-		fe->pte += iter.index - last_pgoff;
-		fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
-		last_pgoff = iter.index;
-		if (!pte_none(*fe->pte))
-			goto next;
 repeat:
 		page = radix_tree_deref_slot(slot);
 		if (unlikely(!page))
@@ -2187,7 +2182,17 @@ repeat:
 
 		if (file->f_ra.mmap_miss > 0)
 			file->f_ra.mmap_miss--;
-		do_set_pte(fe, page);
+
+		fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
+		if (fe->pte)
+			fe->pte += iter.index - last_pgoff;
+		last_pgoff = iter.index;
+		if (do_set_pte(fe, NULL, page)) {
+			/* failed to setup page table: giving up */
+			if (!fe->pte)
+				break;
+			goto unlock;
+		}
 		unlock_page(page);
 		goto next;
 unlock:
diff --git a/mm/memory.c b/mm/memory.c
index f8f9549fac86..0de6f176674d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2661,8 +2661,6 @@ static int do_anonymous_page(struct fault_env *fe)
 	struct page *page;
 	pte_t entry;
 
-	pte_unmap(fe->pte);
-
 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
 		return VM_FAULT_SIGBUS;
@@ -2671,6 +2669,18 @@ static int do_anonymous_page(struct fault_env *fe)
 	if (check_stack_guard_page(vma, fe->address) < 0)
 		return VM_FAULT_SIGSEGV;
 
+	/*
+	 * Use __pte_alloc instead of pte_alloc_map, because we can't
+	 * run pte_offset_map on the pmd, if an huge pmd could
+	 * materialize from under us from a different thread.
+	 */
+	if (unlikely(pmd_none(*fe->pmd) &&
+			__pte_alloc(vma->vm_mm, vma, fe->pmd, fe->address)))
+		return VM_FAULT_OOM;
+	/* If an huge pmd materialized from under us just retry later */
+	if (unlikely(pmd_trans_huge(*fe->pmd)))
+		return 0;
+
 	/* Use the zero-page for reads */
 	if (!(fe->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm)) {
@@ -2786,23 +2796,66 @@ static int __do_fault(struct fault_env *fe, pgoff_t pgoff,
 	return ret;
 }
 
+static int pte_alloc_one_map(struct fault_env *fe)
+{
+	struct vm_area_struct *vma = fe->vma;
+
+	if (!pmd_none(*fe->pmd))
+		goto map_pte;
+	if (fe->prealloc_pte) {
+		smp_wmb(); /* See comment in __pte_alloc() */
+
+		fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+		if (unlikely(!pmd_none(*fe->pmd))) {
+			spin_unlock(fe->ptl);
+			goto map_pte;
+		}
+
+		atomic_long_inc(&vma->vm_mm->nr_ptes);
+		pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte);
+		spin_unlock(fe->ptl);
+		fe->prealloc_pte = 0;
+	} else if (unlikely(__pte_alloc(vma->vm_mm, vma, fe->pmd,
+					fe->address))) {
+		return VM_FAULT_OOM;
+	}
+map_pte:
+	if (unlikely(pmd_trans_huge(*fe->pmd)))
+		return VM_FAULT_NOPAGE;
+
+	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
+			&fe->ptl);
+	return 0;
+}
+
 /**
  * do_set_pte - setup new PTE entry for given page and add reverse page mapping.
  *
  * @fe: fault environment
+ * @memcg: memcg to charge page (only for private mappings)
  * @page: page to map
  *
- * Caller must hold page table lock relevant for @fe->pte.
+ * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on return.
  *
  * Target users are page handler itself and implementations of
  * vm_ops->map_pages.
  */
-void do_set_pte(struct fault_env *fe, struct page *page)
+int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
+		struct page *page)
 {
 	struct vm_area_struct *vma = fe->vma;
 	bool write = fe->flags & FAULT_FLAG_WRITE;
 	pte_t entry;
 
+	if (!fe->pte) {
+		int ret = pte_alloc_one_map(fe);
+		if (ret)
+			return ret;
+	}
+
+	if (unlikely(!pte_none(*fe->pte)))
+		return VM_FAULT_NOPAGE;
+
 	flush_icache_page(vma, page);
 	entry = mk_pte(page, vma->vm_page_prot);
 	if (write)
@@ -2811,6 +2864,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)
 	if (write && !(vma->vm_flags & VM_SHARED)) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 		page_add_new_anon_rmap(page, vma, fe->address, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
+		lru_cache_add_active_or_unevictable(page, vma);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
 		page_add_file_rmap(page);
@@ -2819,6 +2874,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)
 
 	/* no need to invalidate: a not-present page won't be cached */
 	update_mmu_cache(vma, fe->address, fe->pte);
+
+	return 0;
 }
 
 static unsigned long fault_around_bytes __read_mostly =
@@ -2885,19 +2942,17 @@ late_initcall(fault_around_debugfs);
  * fault_around_pages() value (and therefore to page order).  This way it's
  * easier to guarantee that we don't cross page table boundaries.
  */
-static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
+static int do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
 {
-	unsigned long address = fe->address, start_addr, nr_pages, mask;
-	pte_t *pte = fe->pte;
+	unsigned long address = fe->address, nr_pages, mask;
 	pgoff_t end_pgoff;
-	int off;
+	int off, ret = 0;
 
 	nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
 	mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
 
-	start_addr = max(fe->address & mask, fe->vma->vm_start);
-	off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
-	fe->pte -= off;
+	fe->address = max(address & mask, fe->vma->vm_start);
+	off = ((address - fe->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
 	start_pgoff -= off;
 
 	/*
@@ -2905,30 +2960,33 @@ static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
 	 *  or fault_around_pages() from start_pgoff, depending what is nearest.
 	 */
 	end_pgoff = start_pgoff -
-		((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+		((fe->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
 		PTRS_PER_PTE - 1;
 	end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
 			start_pgoff + nr_pages - 1);
 
-	/* Check if it makes any sense to call ->map_pages */
-	fe->address = start_addr;
-	while (!pte_none(*fe->pte)) {
-		if (++start_pgoff > end_pgoff)
-			goto out;
-		fe->address += PAGE_SIZE;
-		if (fe->address >= fe->vma->vm_end)
-			goto out;
-		fe->pte++;
+	if (pmd_none(*fe->pmd))
+		fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address);
+	fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
+	if (fe->prealloc_pte) {
+		pte_free(fe->vma->vm_mm, fe->prealloc_pte);
+		fe->prealloc_pte = 0;
 	}
+	if (!fe->pte)
+		goto out;
 
-	fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
+	/* check if the page fault is solved */
+	fe->pte -= (fe->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
+	if (!pte_none(*fe->pte))
+		ret = VM_FAULT_NOPAGE;
+	pte_unmap_unlock(fe->pte, fe->ptl);
 out:
-	/* restore fault_env */
-	fe->pte = pte;
 	fe->address = address;
+	fe->pte = NULL;
+	return ret;
 }
 
-static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_read_fault(struct fault_env *fe, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma = fe->vma;
 	struct page *fault_page;
@@ -2940,33 +2998,25 @@ static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
 	 * something).
 	 */
 	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
-		fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
-				&fe->ptl);
-		do_fault_around(fe, pgoff);
-		if (!pte_same(*fe->pte, orig_pte))
-			goto unlock_out;
-		pte_unmap_unlock(fe->pte, fe->ptl);
+		ret = do_fault_around(fe, pgoff);
+		if (ret)
+			return ret;
 	}
 
 	ret = __do_fault(fe, pgoff, NULL, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl);
-	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+	ret |= do_set_pte(fe, NULL, fault_page);
+	if (fe->pte)
 		pte_unmap_unlock(fe->pte, fe->ptl);
-		unlock_page(fault_page);
-		page_cache_release(fault_page);
-		return ret;
-	}
-	do_set_pte(fe, fault_page);
 	unlock_page(fault_page);
-unlock_out:
-	pte_unmap_unlock(fe->pte, fe->ptl);
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+		page_cache_release(fault_page);
 	return ret;
 }
 
-static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma = fe->vma;
 	struct page *fault_page, *new_page;
@@ -2994,26 +3044,9 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
 		copy_user_highpage(new_page, fault_page, fe->address, vma);
 	__SetPageUptodate(new_page);
 
-	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
-			&fe->ptl);
-	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+	ret |= do_set_pte(fe, memcg, new_page);
+	if (fe->pte)
 		pte_unmap_unlock(fe->pte, fe->ptl);
-		if (fault_page) {
-			unlock_page(fault_page);
-			page_cache_release(fault_page);
-		} else {
-			/*
-			 * The fault handler has no page to lock, so it holds
-			 * i_mmap_lock for read to protect against truncate.
-			 */
-			i_mmap_unlock_read(vma->vm_file->f_mapping);
-		}
-		goto uncharge_out;
-	}
-	do_set_pte(fe, new_page);
-	mem_cgroup_commit_charge(new_page, memcg, false, false);
-	lru_cache_add_active_or_unevictable(new_page, vma);
-	pte_unmap_unlock(fe->pte, fe->ptl);
 	if (fault_page) {
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
@@ -3024,6 +3057,8 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
 		 */
 		i_mmap_unlock_read(vma->vm_file->f_mapping);
 	}
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+		goto uncharge_out;
 	return ret;
 uncharge_out:
 	mem_cgroup_cancel_charge(new_page, memcg, false);
@@ -3031,7 +3066,7 @@ uncharge_out:
 	return ret;
 }
 
-static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
+static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma = fe->vma;
 	struct page *fault_page;
@@ -3057,16 +3092,15 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
 		}
 	}
 
-	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
-			&fe->ptl);
-	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
+	ret |= do_set_pte(fe, NULL, fault_page);
+	if (fe->pte)
 		pte_unmap_unlock(fe->pte, fe->ptl);
+	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+					VM_FAULT_RETRY))) {
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
 		return ret;
 	}
-	do_set_pte(fe, fault_page);
-	pte_unmap_unlock(fe->pte, fe->ptl);
 
 	if (set_page_dirty(fault_page))
 		dirtied = 1;
@@ -3098,21 +3132,19 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
-static int do_fault(struct fault_env *fe, pte_t orig_pte)
+static int do_fault(struct fault_env *fe)
 {
 	struct vm_area_struct *vma = fe->vma;
-	pgoff_t pgoff = (((fe->address & PAGE_MASK)
-			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	pgoff_t pgoff = linear_page_index(vma, fe->address);
 
-	pte_unmap(fe->pte);
 	/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
 	if (!vma->vm_ops->fault)
 		return VM_FAULT_SIGBUS;
 	if (!(fe->flags & FAULT_FLAG_WRITE))
-		return do_read_fault(fe, pgoff,	orig_pte);
+		return do_read_fault(fe, pgoff);
 	if (!(vma->vm_flags & VM_SHARED))
-		return do_cow_fault(fe, pgoff, orig_pte);
-	return do_shared_fault(fe, pgoff, orig_pte);
+		return do_cow_fault(fe, pgoff);
+	return do_shared_fault(fe, pgoff);
 }
 
 static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3252,37 +3284,62 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with pte unmapped and unlocked.
+ * We enter with non-exclusive mmap_sem (to exclude vma changes, but allow
+ * concurrent faults).
  *
- * The mmap_sem may have been released depending on flags and our
- * return value.  See filemap_fault() and __lock_page_or_retry().
+ * The mmap_sem may have been released depending on flags and our return value.
+ * See filemap_fault() and __lock_page_or_retry().
  */
 static int handle_pte_fault(struct fault_env *fe)
 {
 	pte_t entry;
 
+	/* If an huge pmd materialized from under us just retry later */
+	if (unlikely(pmd_trans_huge(*fe->pmd)))
+		return 0;
+
+	if (unlikely(pmd_none(*fe->pmd))) {
+		/*
+		 * Leave __pte_alloc() until later: because vm_ops->fault may
+		 * want to allocate huge page, and if we expose page table
+		 * for an instant, it will be difficult to retract from
+		 * concurrent faults and from rmap lookups.
+		 */
+	} else {
+		/*
+		 * A regular pmd is established and it can't morph into a huge
+		 * pmd from under us anymore at this point because we hold the
+		 * mmap_sem read mode and khugepaged takes it in write mode.
+		 * So now it's safe to run pte_offset_map().
+		 */
+		fe->pte = pte_offset_map(fe->pmd, fe->address);
+
+		entry = *fe->pte;
+		barrier();
+		if (pte_none(entry)) {
+			pte_unmap(fe->pte);
+			fe->pte = NULL;
+		}
+	}
+
 	/*
 	 * some architectures can have larger ptes than wordsize,
 	 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
 	 * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
-	 * The code below just needs a consistent view for the ifs and
+	 * The code above just needs a consistent view for the ifs and
 	 * we later double check anyway with the ptl lock held. So here
 	 * a barrier will do.
 	 */
-	entry = *fe->pte;
-	barrier();
-	if (!pte_present(entry)) {
-		if (pte_none(entry)) {
-			if (vma_is_anonymous(fe->vma))
-				return do_anonymous_page(fe);
-			else
-				return do_fault(fe, entry);
-		}
-		return do_swap_page(fe, entry);
+	if (!fe->pte) {
+		if (vma_is_anonymous(fe->vma))
+			return do_anonymous_page(fe);
+		else
+			return do_fault(fe);
 	}
 
+	if (!pte_present(entry))
+		return do_swap_page(fe, entry);
+
 	if (pte_protnone(entry))
 		return do_numa_page(fe, entry);
 
@@ -3364,25 +3421,6 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 	}
 
-	/*
-	 * Use __pte_alloc instead of pte_alloc_map, because we can't
-	 * run pte_offset_map on the pmd, if an huge pmd could
-	 * materialize from under us from a different thread.
-	 */
-	if (unlikely(pmd_none(*fe.pmd)) &&
-	    unlikely(__pte_alloc(fe.vma->vm_mm, fe.vma, fe.pmd, fe.address)))
-		return VM_FAULT_OOM;
-	/* if an huge pmd materialized from under us just retry later */
-	if (unlikely(pmd_trans_huge(*fe.pmd) || pmd_devmap(*fe.pmd)))
-		return 0;
-	/*
-	 * A regular pmd is established and it can't morph into a huge pmd
-	 * from under us anymore at this point because we hold the mmap_sem
-	 * read mode and khugepaged takes it in write mode. So now it's
-	 * safe to run pte_offset_map().
-	 */
-	fe.pte = pte_offset_map(fe.pmd, fe.address);
-
 	return handle_pte_fault(&fe);
 }
 
diff --git a/mm/nommu.c b/mm/nommu.c
index fbf6f0f1d6c9..f392488123b5 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1930,7 +1930,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 }
 EXPORT_SYMBOL(filemap_fault);
 
-void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+void filemap_map_pages(struct fault_env *fe, pgoff_t start_pgoff,
+		pgoff_t end_pgoff)
 {
 	BUG();
 }
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 09/28] rmap: support file thp
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte() Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 10/28] mm: introduce do_set_pmd() Kirill A. Shutemov
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

Naive approach: on mapping/unmapping the page as compound we update
->_mapcount on each 4k page. That's not efficient, but it's not obvious
how we can optimize this. We can look into optimization later.

PG_double_map optimization doesn't work for file pages since lifecycle
of file pages is different comparing to anon pages: file page can be
mapped again at any time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/rmap.h |  2 +-
 mm/huge_memory.c     | 10 +++++++---
 mm/memory.c          |  4 ++--
 mm/migrate.c         |  2 +-
 mm/rmap.c            | 48 +++++++++++++++++++++++++++++++++++-------------
 mm/util.c            |  6 ++++++
 6 files changed, 52 insertions(+), 20 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 49eb4f8ebac9..5704f101b52e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -165,7 +165,7 @@ void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
 			   unsigned long, int);
 void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long, bool);
-void page_add_file_rmap(struct page *);
+void page_add_file_rmap(struct page *, bool);
 void page_remove_rmap(struct page *, bool);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7ea43b9fbec4..0dc081fea9f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3202,18 +3202,22 @@ static void __split_huge_page(struct page *page, struct list_head *list)
 
 int total_mapcount(struct page *page)
 {
-	int i, ret;
+	int i, compound, ret;
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
 	if (likely(!PageCompound(page)))
 		return atomic_read(&page->_mapcount) + 1;
 
-	ret = compound_mapcount(page);
+	compound = compound_mapcount(page);
 	if (PageHuge(page))
-		return ret;
+		return compound;
+	ret = compound;
 	for (i = 0; i < HPAGE_PMD_NR; i++)
 		ret += atomic_read(&page[i]._mapcount) + 1;
+	/* File pages has compound_mapcount included in _mapcount*/
+	if (!PageAnon(page))
+		ret -= compound * HPAGE_PMD_NR;
 	if (PageDoubleMap(page))
 		ret -= HPAGE_PMD_NR;
 	return ret;
diff --git a/mm/memory.c b/mm/memory.c
index 0de6f176674d..0d204ef02855 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1440,7 +1440,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 	/* Ok, finally just insert the thing.. */
 	get_page(page);
 	inc_mm_counter_fast(mm, mm_counter_file(page));
-	page_add_file_rmap(page);
+	page_add_file_rmap(page, false);
 	set_pte_at(mm, addr, pte, mk_pte(page, prot));
 
 	retval = 0;
@@ -2868,7 +2868,7 @@ int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
 		lru_cache_add_active_or_unevictable(page, vma);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
-		page_add_file_rmap(page);
+		page_add_file_rmap(page, false);
 	}
 	set_pte_at(vma->vm_mm, fe->address, fe->pte, entry);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 993390dcf68d..20f3ef726bc3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -170,7 +170,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	} else if (PageAnon(new))
 		page_add_anon_rmap(new, vma, addr, false);
 	else
-		page_add_file_rmap(new);
+		page_add_file_rmap(new, false);
 
 	if (vma->vm_flags & VM_LOCKED && !PageCompound(new))
 		mlock_vma_page(new);
diff --git a/mm/rmap.c b/mm/rmap.c
index 945933a01010..b550bf637ce3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1285,18 +1285,34 @@ void page_add_new_anon_rmap(struct page *page,
  *
  * The caller needs to hold the pte lock.
  */
-void page_add_file_rmap(struct page *page)
+void page_add_file_rmap(struct page *page, bool compound)
 {
+	int i, nr = 1;
+
+	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
 	lock_page_memcg(page);
-	if (atomic_inc_and_test(&page->_mapcount)) {
-		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
+	if (compound && PageTransHuge(page)) {
+		for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+			if (atomic_inc_and_test(&page[i]._mapcount))
+				nr++;
+		}
+		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
+			goto out;
+	} else {
+		if (!atomic_inc_and_test(&page->_mapcount))
+			goto out;
 	}
+	__mod_zone_page_state(page_zone(page), NR_FILE_MAPPED, nr);
+	mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
+out:
 	unlock_page_memcg(page);
 }
 
-static void page_remove_file_rmap(struct page *page)
+static void page_remove_file_rmap(struct page *page, bool compound)
 {
+	int i, nr = 1;
+
+	VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
 	lock_page_memcg(page);
 
 	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
@@ -1307,15 +1323,24 @@ static void page_remove_file_rmap(struct page *page)
 	}
 
 	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, &page->_mapcount))
-		goto out;
+	if (compound && PageTransHuge(page)) {
+		for (i = 0, nr = 0; i < HPAGE_PMD_NR; i++) {
+			if (atomic_add_negative(-1, &page[i]._mapcount))
+				nr++;
+		}
+		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+			goto out;
+	} else {
+		if (!atomic_add_negative(-1, &page->_mapcount))
+			goto out;
+	}
 
 	/*
 	 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	__dec_zone_page_state(page, NR_FILE_MAPPED);
+	__mod_zone_page_state(page_zone(page), NR_FILE_MAPPED, -nr);
 	mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 
 	if (unlikely(PageMlocked(page)))
@@ -1371,11 +1396,8 @@ static void page_remove_anon_compound_rmap(struct page *page)
  */
 void page_remove_rmap(struct page *page, bool compound)
 {
-	if (!PageAnon(page)) {
-		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
-		page_remove_file_rmap(page);
-		return;
-	}
+	if (!PageAnon(page))
+		return page_remove_file_rmap(page, compound);
 
 	if (compound)
 		return page_remove_anon_compound_rmap(page);
diff --git a/mm/util.c b/mm/util.c
index a36fd2813adf..757bb18b061f 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -357,6 +357,12 @@ int __page_mapcount(struct page *page)
 	int ret;
 
 	ret = atomic_read(&page->_mapcount) + 1;
+	/*
+	 * For file THP page->_mapcount contains total number of mapping
+	 * of the page: no need to look into compound_mapcount.
+	 */
+	if (!PageAnon(page) && !PageHuge(page))
+		return ret;
 	page = compound_head(page);
 	ret += atomic_read(compound_mapcount_ptr(page)) + 1;
 	if (PageDoubleMap(page))
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 10/28] mm: introduce do_set_pmd()
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 09/28] rmap: support file thp Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 11/28] mm, rmap: account file thp pages Kirill A. Shutemov
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

With postponed page table allocation we have chance to setup huge pages.
do_set_pte() calls do_set_pmd() if following criteria met:

 - page is compound;
 - pmd entry in pmd_none();
 - vma has suitable size and alignment;

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c |  8 -------
 mm/internal.h    | 16 ++++++++++++++
 mm/memory.c      | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 78 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0dc081fea9f1..9d614cee994f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -771,14 +771,6 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 	return pmd;
 }
 
-static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
-{
-	pmd_t entry;
-	entry = mk_pmd(page, prot);
-	entry = pmd_mkhuge(entry);
-	return entry;
-}
-
 static inline struct list_head *page_deferred_list(struct page *page)
 {
 	/*
diff --git a/mm/internal.h b/mm/internal.h
index 4ff5f2588430..4c5e13138c46 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -37,6 +37,22 @@
 
 int do_swap_page(struct fault_env *fe, pte_t orig_pte);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
+{
+	pmd_t entry;
+	entry = mk_pmd(page, prot);
+	entry = pmd_mkhuge(entry);
+	return entry;
+}
+#else
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
+{
+	BUILD_BUG();
+	return __pmd(0);
+}
+#endif
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/memory.c b/mm/memory.c
index 0d204ef02855..fb61e82bbb9a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2828,6 +2828,57 @@ map_pte:
 	return 0;
 }
 
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_PMD_NR - 1)
+static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
+		unsigned long haddr)
+{
+	if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+			(vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+		return false;
+	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+		return false;
+	return true;
+}
+
+static int do_set_pmd(struct fault_env *fe, struct page *page)
+{
+	struct vm_area_struct *vma = fe->vma;
+	bool write = fe->flags & FAULT_FLAG_WRITE;
+	unsigned long haddr = fe->address & HPAGE_PMD_MASK;
+	pmd_t entry;
+	int ret;
+
+	if (!transhuge_vma_suitable(vma, haddr))
+		return VM_FAULT_FALLBACK;
+
+	ret = VM_FAULT_FALLBACK;
+
+	fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
+	if (unlikely(!pmd_none(*fe->pmd)))
+		goto out;
+
+	// XXX: make flush_icache_page() aware about compound pages?
+	flush_icache_page(vma, page);
+
+	page = compound_head(page);
+	entry = mk_huge_pmd(page, vma->vm_page_prot);
+	if (write)
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+
+	add_mm_counter(vma->vm_mm, MM_FILEPAGES, HPAGE_PMD_NR);
+	page_add_file_rmap(page, true);
+
+	set_pmd_at(vma->vm_mm, haddr, fe->pmd, entry);
+
+	update_mmu_cache_pmd(vma, haddr, fe->pmd);
+
+	/* fault is handled */
+	ret = 0;
+out:
+	spin_unlock(fe->ptl);
+	return ret;
+}
+
 /**
  * do_set_pte - setup new PTE entry for given page and add reverse page mapping.
  *
@@ -2846,9 +2897,19 @@ int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
 	struct vm_area_struct *vma = fe->vma;
 	bool write = fe->flags & FAULT_FLAG_WRITE;
 	pte_t entry;
+	int ret;
+
+	if (pmd_none(*fe->pmd) && PageTransCompound(page)) {
+		/* THP on COW? */
+		VM_BUG_ON_PAGE(memcg, page);
+
+		ret = do_set_pmd(fe, page);
+		if (ret != VM_FAULT_FALLBACK)
+			return ret;
+	}
 
 	if (!fe->pte) {
-		int ret = pte_alloc_one_map(fe);
+		ret = pte_alloc_one_map(fe);
 		if (ret)
 			return ret;
 	}
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 11/28] mm, rmap: account file thp pages
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 10/28] mm: introduce do_set_pmd() Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 12/28] thp, vmstats: add counters for huge file pages Kirill A. Shutemov
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

Let's add FileHugePages field into meminfo.

NR_ANON_TRANSPARENT_HUGEPAGES is renames to NR_ANON_THPS.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/base/node.c    | 10 ++++++----
 fs/proc/meminfo.c      |  5 +++--
 include/linux/mmzone.h |  3 ++-
 mm/huge_memory.c       |  2 +-
 mm/rmap.c              | 12 ++++++------
 mm/vmstat.c            |  1 +
 6 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..57b6c1ed0330 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -113,6 +113,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d SUnreclaim:     %8lu kB\n"
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		       "Node %d AnonHugePages:  %8lu kB\n"
+		       "Node %d FileHugePages:  %8lu kB\n"
 #endif
 			,
 		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -131,10 +132,11 @@ static ssize_t node_read_meminfo(struct device *dev,
 				node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
 		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
-			, nid,
-			K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
-			HPAGE_PMD_NR));
+		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
+		       nid, K(node_page_state(nid, NR_ANON_THPS) *
+				       HPAGE_PMD_NR),
+		       nid, K(node_page_state(nid, NR_FILE_THP_MAPPED) *
+				       HPAGE_PMD_NR));
 #else
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
 #endif
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index df4661abadc4..df07256c71a7 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -134,6 +134,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		"AnonHugePages:  %8lu kB\n"
+		"FileHugePages:  %8lu kB\n"
 #endif
 #ifdef CONFIG_CMA
 		"CmaTotal:       %8lu kB\n"
@@ -191,8 +192,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 		, atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		, K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
-		   HPAGE_PMD_NR)
+		, K(global_page_state(NR_ANON_THPS) * HPAGE_PMD_NR)
+		, K(global_page_state(NR_FILE_THP_MAPPED) * HPAGE_PMD_NR)
 #endif
 #ifdef CONFIG_CMA
 		, K(totalcma_pages)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 03cbdd906f55..a29a87ab05a4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -158,7 +158,8 @@ enum zone_stat_item {
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
 	WORKINGSET_NODERECLAIM,
-	NR_ANON_TRANSPARENT_HUGEPAGES,
+	NR_ANON_THPS,
+	NR_FILE_THP_MAPPED,
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9d614cee994f..732bda42ca80 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2946,7 +2946,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
 		/* Last compound_mapcount is gone. */
-		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		__dec_zone_page_state(page, NR_ANON_THPS);
 		if (TestClearPageDoubleMap(page)) {
 			/* No need in mapcount reference anymore */
 			for (i = 0; i < HPAGE_PMD_NR; i++)
diff --git a/mm/rmap.c b/mm/rmap.c
index b550bf637ce3..765e001836dc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1227,10 +1227,8 @@ void do_page_add_anon_rmap(struct page *page,
 		 * pte lock(a spinlock) is held, which implies preemption
 		 * disabled.
 		 */
-		if (compound) {
-			__inc_zone_page_state(page,
-					      NR_ANON_TRANSPARENT_HUGEPAGES);
-		}
+		if (compound)
+			__inc_zone_page_state(page, NR_ANON_THPS);
 		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	}
 	if (unlikely(PageKsm(page)))
@@ -1268,7 +1266,7 @@ void page_add_new_anon_rmap(struct page *page,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		/* increment count (starts at -1) */
 		atomic_set(compound_mapcount_ptr(page), 0);
-		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		__inc_zone_page_state(page, NR_ANON_THPS);
 	} else {
 		/* Anon THP always mapped first with PMD */
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1298,6 +1296,7 @@ void page_add_file_rmap(struct page *page, bool compound)
 		}
 		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
 			goto out;
+		__inc_zone_page_state(page, NR_FILE_THP_MAPPED);
 	} else {
 		if (!atomic_inc_and_test(&page->_mapcount))
 			goto out;
@@ -1330,6 +1329,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 		}
 		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
 			goto out;
+		__dec_zone_page_state(page, NR_FILE_THP_MAPPED);
 	} else {
 		if (!atomic_add_negative(-1, &page->_mapcount))
 			goto out;
@@ -1363,7 +1363,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		return;
 
-	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	__dec_zone_page_state(page, NR_ANON_THPS);
 
 	if (TestClearPageDoubleMap(page)) {
 		/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 05c6ba2534fe..801e6b18fb94 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -762,6 +762,7 @@ const char * const vmstat_text[] = {
 	"workingset_activate",
 	"workingset_nodereclaim",
 	"nr_anon_transparent_hugepages",
+	"nr_file_transparent_hugepages",
 	"nr_free_cma",
 
 	/* enum writeback_stat_item counters */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 12/28] thp, vmstats: add counters for huge file pages
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 11/28] mm, rmap: account file thp pages Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 13/28] thp: support file pages in zap_huge_pmd() Kirill A. Shutemov
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

THP_FILE_ALLOC: how many times huge page was allocated and put page
cache.

THP_FILE_MAPPED: how many times file huge page was mapped.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/vm_event_item.h | 7 +++++++
 mm/memory.c                   | 1 +
 mm/vmstat.c                   | 2 ++
 3 files changed, 10 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index b79e831006b0..8359022f6ea1 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -69,6 +69,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
+		THP_FILE_ALLOC,
+		THP_FILE_MAPPED,
 		THP_SPLIT_PAGE,
 		THP_SPLIT_PAGE_FAILED,
 		THP_DEFERRED_SPLIT_PAGE,
@@ -99,4 +101,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NR_VM_EVENT_ITEMS
 };
 
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+#define THP_FILE_ALLOC ({ BUILD_BUG(); 0; })
+#define THP_FILE_MAPPED ({ BUILD_BUG(); 0; })
+#endif
+
 #endif		/* VM_EVENT_ITEM_H_INCLUDED */
diff --git a/mm/memory.c b/mm/memory.c
index fb61e82bbb9a..6c98ed8e3c4a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2874,6 +2874,7 @@ static int do_set_pmd(struct fault_env *fe, struct page *page)
 
 	/* fault is handled */
 	ret = 0;
+	count_vm_event(THP_FILE_MAPPED);
 out:
 	spin_unlock(fe->ptl);
 	return ret;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 801e6b18fb94..e69031a3b306 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -846,6 +846,8 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
+	"thp_file_alloc",
+	"thp_file_mapped",
 	"thp_split_page",
 	"thp_split_page_failed",
 	"thp_deferred_split_page",
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 13/28] thp: support file pages in zap_huge_pmd()
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (11 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 12/28] thp, vmstats: add counters for huge file pages Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-12 18:33   ` Dave Hansen
  2016-02-11 14:21 ` [PATCHv2 14/28] thp: handle file pages in split_huge_pmd() Kirill A. Shutemov
                   ` (14 subsequent siblings)
  27 siblings, 1 reply; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

For file pages we don't deposit page table on mapping: no need to
withdraw it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 732bda42ca80..8fd5a3c58353 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1654,10 +1654,16 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		struct page *page = pmd_page(orig_pmd);
 		page_remove_rmap(page, true);
 		VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-		add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 		VM_BUG_ON_PAGE(!PageHead(page), page);
-		pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
-		atomic_long_dec(&tlb->mm->nr_ptes);
+		if (PageAnon(page)) {
+			pgtable_t pgtable;
+			pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
+			pte_free(tlb->mm, pgtable);
+			atomic_long_dec(&tlb->mm->nr_ptes);
+			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+		} else {
+			add_mm_counter(tlb->mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+		}
 		spin_unlock(ptl);
 		tlb_remove_page(tlb, page);
 	}
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 14/28] thp: handle file pages in split_huge_pmd()
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 13/28] thp: support file pages in zap_huge_pmd() Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 15/28] thp: handle file COW faults Kirill A. Shutemov
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

Splitting THP PMD is simple: just unmap it as in DAX case.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8fd5a3c58353..4da4e915af61 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2900,10 +2900,16 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	count_vm_event(THP_SPLIT_PMD);
 
-	if (vma_is_dax(vma)) {
-		pmd_t _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+	if (!vma_is_anonymous(vma)) {
+		_pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
 		if (is_huge_zero_pmd(_pmd))
 			put_huge_zero_page();
+		if (vma_is_dax(vma))
+			return;
+		page = pmd_page(_pmd);
+		page_remove_rmap(page, true);
+		put_page(page);
+		add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
 		return;
 	} else if (is_huge_zero_pmd(*pmd)) {
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 15/28] thp: handle file COW faults
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (13 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 14/28] thp: handle file pages in split_huge_pmd() Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-12 18:36   ` Dave Hansen
  2016-02-11 14:21 ` [PATCHv2 16/28] thp: handle file pages in mremap() Kirill A. Shutemov
                   ` (12 subsequent siblings)
  27 siblings, 1 reply; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

File COW for THP is handled on pte level: just split the pmd.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/memory.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 6c98ed8e3c4a..19eff2164e5b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3334,6 +3334,11 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
 	if (fe->vma->vm_ops->pmd_fault)
 		return fe->vma->vm_ops->pmd_fault(fe->vma, fe->address, fe->pmd,
 				fe->flags);
+
+	/* COW handled on pte level: split pmd */
+	VM_BUG_ON_VMA(fe->vma->vm_flags & VM_SHARED, fe->vma);
+	split_huge_pmd(fe->vma, fe->pmd, fe->address);
+
 	return VM_FAULT_FALLBACK;
 }
 
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 16/28] thp: handle file pages in mremap()
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (14 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 15/28] thp: handle file COW faults Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 17/28] thp: skip file huge pmd on copy_huge_pmd() Kirill A. Shutemov
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

We need to mirror logic in move_ptes() wrt need_rmap_locks to get proper
serialization file THP.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/mremap.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 8eeba02fc991..b43027a25982 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -193,17 +193,27 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			break;
 		if (pmd_trans_huge(*old_pmd)) {
 			if (extent == HPAGE_PMD_SIZE) {
+				struct address_space *mapping = NULL;
+				struct anon_vma *anon_vma = NULL;
 				bool moved;
-				VM_BUG_ON_VMA(vma->vm_file || !vma->anon_vma,
-					      vma);
 				/* See comment in move_ptes() */
-				if (need_rmap_locks)
-					anon_vma_lock_write(vma->anon_vma);
+				if (need_rmap_locks) {
+					if (vma->vm_file) {
+						mapping = vma->vm_file->f_mapping;
+						i_mmap_lock_write(mapping);
+					}
+					if (vma->anon_vma) {
+						anon_vma = vma->anon_vma;
+						anon_vma_lock_write(anon_vma);
+					}
+				}
 				moved = move_huge_pmd(vma, new_vma, old_addr,
 						    new_addr, old_end,
 						    old_pmd, new_pmd);
-				if (need_rmap_locks)
-					anon_vma_unlock_write(vma->anon_vma);
+				if (anon_vma)
+					anon_vma_unlock_write(anon_vma);
+				if (mapping)
+					i_mmap_unlock_write(mapping);
 				if (moved) {
 					need_flush = true;
 					continue;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 17/28] thp: skip file huge pmd on copy_huge_pmd()
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (15 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 16/28] thp: handle file pages in mremap() Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-12 18:42   ` Dave Hansen
  2016-02-11 14:21 ` [PATCHv2 18/28] thp: prepare change_huge_pmd() for file thp Kirill A. Shutemov
                   ` (10 subsequent siblings)
  27 siblings, 1 reply; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

File pmds can be safely skip on copy_huge_pmd(), we can re-fault them
later. COW for file mappings handled on pte level.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4da4e915af61..00f10d323039 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1052,14 +1052,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	struct page *src_page;
 	pmd_t pmd;
 	pgtable_t pgtable = NULL;
-	int ret;
+	int ret = -ENOMEM;
 
-	if (!vma_is_dax(vma)) {
-		ret = -ENOMEM;
-		pgtable = pte_alloc_one(dst_mm, addr);
-		if (unlikely(!pgtable))
-			goto out;
-	}
+	/* Skip if can be re-fill on fault */
+	if (!vma_is_anonymous(vma))
+		return 0;
+
+	pgtable = pte_alloc_one(dst_mm, addr);
+	if (unlikely(!pgtable))
+		goto out;
 
 	dst_ptl = pmd_lock(dst_mm, dst_pmd);
 	src_ptl = pmd_lockptr(src_mm, src_pmd);
@@ -1067,7 +1068,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	ret = -EAGAIN;
 	pmd = *src_pmd;
-	if (unlikely(!pmd_trans_huge(pmd) && !pmd_devmap(pmd))) {
+	if (unlikely(!pmd_trans_huge(pmd))) {
 		pte_free(dst_mm, pgtable);
 		goto out_unlock;
 	}
@@ -1090,16 +1091,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		goto out_unlock;
 	}
 
-	if (!vma_is_dax(vma)) {
-		/* thp accounting separate from pmd_devmap accounting */
-		src_page = pmd_page(pmd);
-		VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
-		get_page(src_page);
-		page_dup_rmap(src_page, true);
-		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-		atomic_long_inc(&dst_mm->nr_ptes);
-		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
-	}
+	src_page = pmd_page(pmd);
+	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+	get_page(src_page);
+	page_dup_rmap(src_page, true);
+	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	atomic_long_inc(&dst_mm->nr_ptes);
+	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
 	pmd = pmd_mkold(pmd_wrprotect(pmd));
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 18/28] thp: prepare change_huge_pmd() for file thp
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (16 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 17/28] thp: skip file huge pmd on copy_huge_pmd() Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-12 18:48   ` Dave Hansen
  2016-02-11 14:21 ` [PATCHv2 19/28] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem Kirill A. Shutemov
                   ` (9 subsequent siblings)
  27 siblings, 1 reply; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

change_huge_pmd() has assert which is not relvant for file page.
For shared mapping it's perfectly fine to have page table entry
writable, without explicit mkwrite.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 00f10d323039..8e2d84698c15 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1756,7 +1756,6 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 				entry = pmd_mkwrite(entry);
 			ret = HPAGE_PMD_NR;
 			set_pmd_at(mm, addr, pmd, entry);
-			BUG_ON(!preserve_write && pmd_write(entry));
 		}
 		spin_unlock(ptl);
 	}
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 19/28] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (17 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 18/28] thp: prepare change_huge_pmd() for file thp Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-12 18:50   ` Dave Hansen
  2016-02-16 15:49   ` Dave Hansen
  2016-02-11 14:21 ` [PATCHv2 20/28] thp: file pages support for split_huge_page() Kirill A. Shutemov
                   ` (8 subsequent siblings)
  27 siblings, 2 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
During split we munlock the huge page which requires rmap walk.
rmap wants to take the lock on its own.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/mmap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 2f2415a7a688..c9d0c412b6dd 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -802,6 +802,8 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
+	vma_adjust_trans_huge(vma, start, end, adjust_next);
+
 	if (file) {
 		mapping = file->f_mapping;
 		root = &mapping->i_mmap;
@@ -822,8 +824,6 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 	}
 
-	vma_adjust_trans_huge(vma, start, end, adjust_next);
-
 	anon_vma = vma->anon_vma;
 	if (!anon_vma && adjust_next)
 		anon_vma = next->anon_vma;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 20/28] thp: file pages support for split_huge_page()
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (18 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 19/28] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 21/28] vmscan: split file huge pages before paging them out Kirill A. Shutemov
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

Basic scheme is the same as for anon THP.

Main differences:

  - File pages are on radix-tree, so we have head->_count offset by
    HPAGE_PMD_NR. The count got distributed to small pages during split.

  - mapping->tree_lock prevents non-lockless access to pages under split
    over radix-tree;

  - lockless access is prevented by setting the head->_count to 0 during
    split;

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/gup.c         |   2 +
 mm/huge_memory.c | 137 ++++++++++++++++++++++++++++++++++++++-----------------
 mm/mempolicy.c   |   2 +
 3 files changed, 100 insertions(+), 41 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 60f422a0af8b..76148816c0cd 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -285,6 +285,8 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 			ret = split_huge_page(page);
 			unlock_page(page);
 			put_page(page);
+			if (pmd_none(*pmd))
+				return no_page_table(vma, flags);
 		}
 
 		return ret ? ERR_PTR(ret) :
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e2d84698c15..ca7f21516c3a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -29,6 +29,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/page_idle.h>
 #include <linux/swapops.h>
+#include <linux/shmem_fs.h>
 #include <linux/debugfs.h>
 
 #include <asm/tlb.h>
@@ -3096,7 +3097,7 @@ static void freeze_page(struct page *page)
 	ret = try_to_unmap(page, ttu_flags | TTU_SPLIT_HUGE_PMD);
 	for (i = 1; !ret && i < HPAGE_PMD_NR; i++)
 		ret = try_to_unmap(page + i, ttu_flags);
-	VM_BUG_ON(ret);
+	VM_BUG_ON_PAGE(ret, page + i - 1);
 }
 
 static void unfreeze_page(struct page *page)
@@ -3118,15 +3119,20 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	/*
 	 * tail_page->_count is zero and not changing from under us. But
 	 * get_page_unless_zero() may be running from under us on the
-	 * tail_page. If we used atomic_set() below instead of atomic_inc(), we
-	 * would then run atomic_set() concurrently with
+	 * tail_page. If we used atomic_set() below instead of atomic_inc() or
+	 * atomic_add(), we would then run atomic_set() concurrently with
 	 * get_page_unless_zero(), and atomic_set() is implemented in C not
 	 * using locked ops. spin_unlock on x86 sometime uses locked ops
 	 * because of PPro errata 66, 92, so unless somebody can guarantee
 	 * atomic_set() here would be safe on all archs (and not only on x86),
-	 * it's safer to use atomic_inc().
+	 * it's safer to use atomic_inc()/atomic_add().
 	 */
-	atomic_inc(&page_tail->_count);
+	if (PageAnon(head)) {
+		atomic_inc(&page_tail->_count);
+	} else {
+		/* Additional pin to radix tree */
+		atomic_add(2, &page_tail->_count);
+	}
 
 	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	page_tail->flags |= (head->flags &
@@ -3162,15 +3168,14 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	lru_add_page_tail(head, page_tail, lruvec, list);
 }
 
-static void __split_huge_page(struct page *page, struct list_head *list)
+static void __split_huge_page(struct page *page, struct list_head *list,
+		unsigned long flags)
 {
 	struct page *head = compound_head(page);
 	struct zone *zone = page_zone(head);
 	struct lruvec *lruvec;
 	int i;
 
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irq(&zone->lru_lock);
 	lruvec = mem_cgroup_page_lruvec(head, zone);
 
 	/* complete memcg works before add pages to LRU */
@@ -3180,7 +3185,16 @@ static void __split_huge_page(struct page *page, struct list_head *list)
 		__split_huge_page_tail(head, i, lruvec, list);
 
 	ClearPageCompound(head);
-	spin_unlock_irq(&zone->lru_lock);
+	/* See comment in __split_huge_page_tail() */
+	if (PageAnon(head)) {
+		atomic_inc(&head->_count);
+	} else {
+		/* Additional pin to radix tree */
+		atomic_add(2, &head->_count);
+		spin_unlock(&head->mapping->tree_lock);
+	}
+
+	spin_unlock_irqrestore(&page_zone(head)->lru_lock, flags);
 
 	unfreeze_page(head);
 
@@ -3248,35 +3262,43 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	struct page *head = compound_head(page);
 	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
 	struct anon_vma *anon_vma;
-	int count, mapcount, ret;
+	int count, mapcount, extra_pins, ret;
 	bool mlocked;
 	unsigned long flags;
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
-	VM_BUG_ON_PAGE(!PageAnon(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
-	/*
-	 * The caller does not necessarily hold an mmap_sem that would prevent
-	 * the anon_vma disappearing so we first we take a reference to it
-	 * and then lock the anon_vma for write. This is similar to
-	 * page_lock_anon_vma_read except the write lock is taken to serialise
-	 * against parallel split or collapse operations.
-	 */
-	anon_vma = page_get_anon_vma(head);
-	if (!anon_vma) {
-		ret = -EBUSY;
-		goto out;
+	if (PageAnon(head)) {
+		extra_pins = 0;
+		/*
+		 * The caller does not necessarily hold an mmap_sem that would
+		 * prevent the anon_vma disappearing so we first we take a
+		 * reference to it and then lock the anon_vma for write. This
+		 * is similar to page_lock_anon_vma_read except the write lock
+		 * is taken to serialise against parallel split or collapse
+		 * operations.
+		 */
+		anon_vma = page_get_anon_vma(head);
+		if (!anon_vma) {
+			ret = -EBUSY;
+			goto out;
+		}
+		anon_vma_lock_write(anon_vma);
+	} else {
+		/* Addidional pins from radix tree */
+		extra_pins = HPAGE_PMD_NR;
+		i_mmap_lock_read(head->mapping);
+		anon_vma = NULL;
 	}
-	anon_vma_lock_write(anon_vma);
 
 	/*
 	 * Racy check if we can split the page, before freeze_page() will
 	 * split PMDs
 	 */
-	if (total_mapcount(head) != page_count(head) - 1) {
+	if (total_mapcount(head) != page_count(head) - extra_pins - 1) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
@@ -3289,35 +3311,69 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	if (mlocked)
 		lru_add_drain();
 
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irqsave(&page_zone(head)->lru_lock, flags);
+
+	if (!anon_vma) {
+		void **pslot;
+
+		spin_lock(&head->mapping->tree_lock);
+		pslot = radix_tree_lookup_slot(&head->mapping->page_tree,
+				page_index(head));
+		/*
+		 * Check if the head page is present in radix tree.
+		 * We assume all tail are present too, if head is there.
+		 */
+		if (radix_tree_deref_slot_protected(pslot,
+					&head->mapping->tree_lock) != head)
+			goto fail;
+	}
+
 	/* Prevent deferred_split_scan() touching ->_count */
-	spin_lock_irqsave(&pgdata->split_queue_lock, flags);
+	spin_lock(&pgdata->split_queue_lock);
 	count = page_count(head);
 	mapcount = total_mapcount(head);
-	if (!mapcount && count == 1) {
+	if (!mapcount && page_freeze_refs(head, 1 + extra_pins)) {
 		if (!list_empty(page_deferred_list(head))) {
 			pgdata->split_queue_len--;
 			list_del(page_deferred_list(head));
 		}
-		spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
-		__split_huge_page(page, list);
+		spin_unlock(&pgdata->split_queue_lock);
+		__split_huge_page(page, list, flags);
 		ret = 0;
-	} else if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
-		spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
-		pr_alert("total_mapcount: %u, page_count(): %u\n",
-				mapcount, count);
-		if (PageTail(page))
-			dump_page(head, NULL);
-		dump_page(page, "total_mapcount(head) > 0");
-		BUG();
 	} else {
-		spin_unlock_irqrestore(&pgdata->split_queue_lock, flags);
+		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
+			pr_alert("total_mapcount: %u, page_count(): %u\n",
+					mapcount, count);
+			if (PageTail(page))
+				dump_page(head, NULL);
+			dump_page(page, "total_mapcount(head) > 0");
+			BUG();
+		}
+		spin_unlock(&pgdata->split_queue_lock);
+fail:		if (!anon_vma)
+			spin_unlock(&head->mapping->tree_lock);
+		spin_unlock_irqrestore(&page_zone(head)->lru_lock, flags);
 		unfreeze_page(head);
 		ret = -EBUSY;
 	}
 
 out_unlock:
-	anon_vma_unlock_write(anon_vma);
-	put_anon_vma(anon_vma);
+	if (anon_vma) {
+		anon_vma_unlock_write(anon_vma);
+		put_anon_vma(anon_vma);
+	} else {
+		struct inode *inode = head->mapping->host;
+		i_mmap_unlock_read(head->mapping);
+
+		/* After split, some pages can be beyond i_size.
+		 * We need to drop them.
+		 *
+		 * TODO: Find generic solution.
+		 */
+		unmap_mapping_range(inode->i_mapping, inode->i_size, 0, 1);
+		shmem_truncate_range(inode, inode->i_size, (loff_t)-1);
+	}
 out:
 	count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
 	return ret;
@@ -3440,8 +3496,7 @@ static int split_huge_pages_set(void *data, u64 val)
 			if (zone != page_zone(page))
 				goto next;
 
-			if (!PageHead(page) || !PageAnon(page) ||
-					PageHuge(page))
+			if (!PageHead(page) || PageHuge(page) || !PageLRU(page))
 				goto next;
 
 			total++;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8c5fd08c253c..5742271a026d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -515,6 +515,8 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		}
 	}
 
+	if (pmd_none(*pmd))
+		return 0;
 retry:
 	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 21/28] vmscan: split file huge pages before paging them out
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (19 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 20/28] thp: file pages support for split_huge_page() Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 22/28] page-flags: relax policy for PG_mappedtodisk and PG_reclaim Kirill A. Shutemov
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

This is preparation of vmscan for file huge pages. We cannot write out
huge pages, so we need to split them on the way out.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/vmscan.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 18b3767136f4..ffd8df7275aa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -468,12 +468,14 @@ void drop_slab(void)
 
 static inline int is_page_cache_freeable(struct page *page)
 {
+	int radix_tree_pins = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
+
 	/*
 	 * A freeable page cache page is referenced only by the caller
 	 * that isolated the page, the page cache radix tree and
 	 * optional buffer heads at page->private.
 	 */
-	return page_count(page) - page_has_private(page) == 2;
+	return page_count(page) - page_has_private(page) == 1 + radix_tree_pins;
 }
 
 static int may_write_to_inode(struct inode *inode, struct scan_control *sc)
@@ -543,8 +545,6 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	 * swap_backing_dev_info is bust: it doesn't reflect the
 	 * congestion state of the swapdevs.  Easy to fix, if needed.
 	 */
-	if (!is_page_cache_freeable(page))
-		return PAGE_KEEP;
 	if (!mapping) {
 		/*
 		 * Some data journaling orphaned pages can have
@@ -1107,6 +1107,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			 * starts and then write it out here.
 			 */
 			try_to_unmap_flush_dirty();
+
+			if (!is_page_cache_freeable(page))
+				goto keep_locked;
+
+			if (unlikely(PageTransHuge(page))) {
+				if (split_huge_page_to_list(page, page_list))
+					goto keep_locked;
+			}
+
 			switch (pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
 				goto keep_locked;
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 22/28] page-flags: relax policy for PG_mappedtodisk and PG_reclaim
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (20 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 21/28] vmscan: split file huge pages before paging them out Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 23/28] radix-tree: implement radix_tree_maybe_preload_order() Kirill A. Shutemov
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

Therse flags are in use for file THP.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/page-flags.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 19724e6ebd26..d0dfe4a5be33 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -292,11 +292,11 @@ PAGEFLAG(OwnerPriv1, owner_priv_1, PF_ANY)
  */
 TESTPAGEFLAG(Writeback, writeback, PF_NO_COMPOUND)
 	TESTSCFLAG(Writeback, writeback, PF_NO_COMPOUND)
-PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_COMPOUND)
+PAGEFLAG(MappedToDisk, mappedtodisk, PF_NO_TAIL)
 
 /* PG_readahead is only used for reads; PG_reclaim is only for writes */
-PAGEFLAG(Reclaim, reclaim, PF_NO_COMPOUND)
-	TESTCLEARFLAG(Reclaim, reclaim, PF_NO_COMPOUND)
+PAGEFLAG(Reclaim, reclaim, PF_NO_TAIL)
+	TESTCLEARFLAG(Reclaim, reclaim, PF_NO_TAIL)
 PAGEFLAG(Readahead, reclaim, PF_NO_COMPOUND)
 	TESTCLEARFLAG(Readahead, reclaim, PF_NO_COMPOUND)
 
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 23/28] radix-tree: implement radix_tree_maybe_preload_order()
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (21 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 22/28] page-flags: relax policy for PG_mappedtodisk and PG_reclaim Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 24/28] filemap: prepare find and delete operations for huge pages Kirill A. Shutemov
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

The new helper is similar to radix_tree_maybe_preload(), but tries to
preload number of nodes required to insert (1 << order) continuous
naturally-aligned elements.

This is required to push huge pages into pagecache.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/radix-tree.h |  1 +
 lib/radix-tree.c           | 70 ++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 63 insertions(+), 8 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 32623d26b62a..20b626160430 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -288,6 +288,7 @@ unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
 			unsigned long first_index, unsigned int max_items);
 int radix_tree_preload(gfp_t gfp_mask);
 int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
 			unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 6b79e9026e24..197afed56e5d 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -42,6 +42,9 @@
  */
 static unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH + 1] __read_mostly;
 
+/* Number of nodes in fully populated tree of given height */
+static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] __read_mostly;
+
 /*
  * Radix tree node cache.
  */
@@ -251,7 +254,7 @@ radix_tree_node_free(struct radix_tree_node *node)
  * To make use of this facility, the radix tree must be initialised without
  * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
  */
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload(gfp_t gfp_mask, int nr)
 {
 	struct radix_tree_preload *rtp;
 	struct radix_tree_node *node;
@@ -259,14 +262,14 @@ static int __radix_tree_preload(gfp_t gfp_mask)
 
 	preempt_disable();
 	rtp = this_cpu_ptr(&radix_tree_preloads);
-	while (rtp->nr < RADIX_TREE_PRELOAD_SIZE) {
+	while (rtp->nr < nr) {
 		preempt_enable();
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
 		rtp = this_cpu_ptr(&radix_tree_preloads);
-		if (rtp->nr < RADIX_TREE_PRELOAD_SIZE) {
+		if (rtp->nr < nr) {
 			node->private_data = rtp->nodes;
 			rtp->nodes = node;
 			rtp->nr++;
@@ -292,7 +295,7 @@ int radix_tree_preload(gfp_t gfp_mask)
 {
 	/* Warn on non-sensical use... */
 	WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask));
-	return __radix_tree_preload(gfp_mask);
+	return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
 }
 EXPORT_SYMBOL(radix_tree_preload);
 
@@ -304,7 +307,7 @@ EXPORT_SYMBOL(radix_tree_preload);
 int radix_tree_maybe_preload(gfp_t gfp_mask)
 {
 	if (gfpflags_allow_blocking(gfp_mask))
-		return __radix_tree_preload(gfp_mask);
+		return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
 	/* Preloading doesn't help anything with this gfp mask, skip it */
 	preempt_disable();
 	return 0;
@@ -312,6 +315,52 @@ int radix_tree_maybe_preload(gfp_t gfp_mask)
 EXPORT_SYMBOL(radix_tree_maybe_preload);
 
 /*
+ * The same as function above, but preload number of nodes required to insert
+ * (1 << order) continuous naturally-aligned elements.
+ */
+int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order)
+{
+	unsigned long nr_subtrees;
+	int nr_nodes, subtree_height;
+
+	/* Preloading doesn't help anything with this gfp mask, skip it */
+	if (!gfpflags_allow_blocking(gfp_mask)) {
+		preempt_disable();
+		return 0;
+	}
+
+
+	/*
+	 * Calculate number and height of fully populated subtrees it takes to
+	 * store (1 << order) elements.
+	 */
+	nr_subtrees = 1 << order;
+	for (subtree_height = 0; nr_subtrees > RADIX_TREE_MAP_SIZE;
+			subtree_height++)
+		nr_subtrees >>= RADIX_TREE_MAP_SHIFT;
+
+	/*
+	 * The worst case is zero height tree with a single item at index 0 and
+	 * then inserting items starting at ULONG_MAX - (1 << order).
+	 *
+	 * This requires RADIX_TREE_MAX_PATH nodes to build branch from root to
+	 * 0-index item.
+	 */
+	nr_nodes = RADIX_TREE_MAX_PATH;
+
+	/* Plus branch to fully populated subtrees. */
+	nr_nodes += RADIX_TREE_MAX_PATH - subtree_height;
+
+	/* Root node is shared. */
+	nr_nodes--;
+
+	/* Plus nodes required to build subtrees. */
+	nr_nodes += nr_subtrees * height_to_maxnodes[subtree_height];
+
+	return __radix_tree_preload(gfp_mask, nr_nodes);
+}
+
+/*
  *	Return the maximum key which can be store into a
  *	radix tree with height HEIGHT.
  */
@@ -1462,12 +1511,17 @@ static __init unsigned long __maxindex(unsigned int height)
 	return ~0UL >> shift;
 }
 
-static __init void radix_tree_init_maxindex(void)
+static __init void radix_tree_init_arrays(void)
 {
-	unsigned int i;
+	unsigned int i, j;
 
 	for (i = 0; i < ARRAY_SIZE(height_to_maxindex); i++)
 		height_to_maxindex[i] = __maxindex(i);
+	for (i = 0; i < ARRAY_SIZE(height_to_maxnodes); i++) {
+		for (j = i; j > 0; j--)
+			height_to_maxnodes[i] += height_to_maxindex[j - 1] + 1;
+	}
+
 }
 
 static int radix_tree_callback(struct notifier_block *nfb,
@@ -1497,6 +1551,6 @@ void __init radix_tree_init(void)
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
 			radix_tree_node_ctor);
-	radix_tree_init_maxindex();
+	radix_tree_init_arrays();
 	hotcpu_notifier(radix_tree_callback, 0);
 }
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 24/28] filemap: prepare find and delete operations for huge pages
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (22 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 23/28] radix-tree: implement radix_tree_maybe_preload_order() Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 25/28] truncate: handle file thp Kirill A. Shutemov
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

For now, we would have HPAGE_PMD_NR entries in radix tree for every huge
page. That's suboptimal and it will be changed to use Matthew's
multi-order entries later.

'add' operation is not changed, because we don't need it to implement
hugetmpfs: shmem uses its own implementation.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 187 ++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 134 insertions(+), 53 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index ba8150d6dc33..d082ca524ba3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -110,43 +110,18 @@
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
-static void page_cache_tree_delete(struct address_space *mapping,
-				   struct page *page, void *shadow)
+static void __page_cache_tree_delete(struct address_space *mapping,
+		struct radix_tree_node *node, void **slot, unsigned long index,
+		void *shadow)
 {
-	struct radix_tree_node *node;
-	unsigned long index;
-	unsigned int offset;
 	unsigned int tag;
-	void **slot;
-
-	VM_BUG_ON(!PageLocked(page));
 
-	__radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
-
-	if (shadow) {
-		mapping->nrexceptional++;
-		/*
-		 * Make sure the nrexceptional update is committed before
-		 * the nrpages update so that final truncate racing
-		 * with reclaim does not see both counters 0 at the
-		 * same time and miss a shadow entry.
-		 */
-		smp_wmb();
-	}
-	mapping->nrpages--;
-
-	if (!node) {
-		/* Clear direct pointer tags in root node */
-		mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
-		radix_tree_replace_slot(slot, shadow);
-		return;
-	}
+	VM_BUG_ON(node == NULL);
+	VM_BUG_ON(*slot == NULL);
 
 	/* Clear tree tags for the removed page */
-	index = page->index;
-	offset = index & RADIX_TREE_MAP_MASK;
 	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
-		if (test_bit(offset, node->tags[tag]))
+		if (test_bit(index & RADIX_TREE_MAP_MASK, node->tags[tag]))
 			radix_tree_tag_clear(&mapping->page_tree, index, tag);
 	}
 
@@ -173,6 +148,54 @@ static void page_cache_tree_delete(struct address_space *mapping,
 	}
 }
 
+static void page_cache_tree_delete(struct address_space *mapping,
+				   struct page *page, void *shadow)
+{
+	struct radix_tree_node *node;
+	unsigned long index;
+	void **slot;
+	int i, nr = PageHuge(page) ? 1 : hpage_nr_pages(page);
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(PageTail(page), page);
+
+	__radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
+
+	if (shadow) {
+		mapping->nrexceptional += nr;
+		/*
+		 * Make sure the nrexceptional update is committed before
+		 * the nrpages update so that final truncate racing
+		 * with reclaim does not see both counters 0 at the
+		 * same time and miss a shadow entry.
+		 */
+		smp_wmb();
+	}
+	mapping->nrpages -= nr;
+
+	if (!node) {
+		/* Clear direct pointer tags in root node */
+		mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
+		VM_BUG_ON(nr != 1);
+		radix_tree_replace_slot(slot, shadow);
+		return;
+	}
+
+	index = page->index;
+	VM_BUG_ON_PAGE(index & (nr - 1), page);
+	for (i = 0; i < nr; i++) {
+		/* Cross node border */
+		if (i && ((index + i) & RADIX_TREE_MAP_MASK) == 0) {
+			__radix_tree_lookup(&mapping->page_tree,
+					page->index + i, &node, &slot);
+		}
+
+		__page_cache_tree_delete(mapping, node,
+				slot + (i & RADIX_TREE_MAP_MASK), index + i,
+				shadow);
+	}
+}
+
 /*
  * Delete a page from the page cache and free it. Caller has to make
  * sure the page is locked and that nobody else uses it - or that usage
@@ -181,6 +204,7 @@ static void page_cache_tree_delete(struct address_space *mapping,
 void __delete_from_page_cache(struct page *page, void *shadow)
 {
 	struct address_space *mapping = page->mapping;
+	int nr = hpage_nr_pages(page);
 
 	trace_mm_filemap_delete_from_page_cache(page);
 	/*
@@ -200,9 +224,10 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 
 	/* hugetlb pages do not participate in page cache accounting. */
 	if (!PageHuge(page))
-		__dec_zone_page_state(page, NR_FILE_PAGES);
+		__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
 	if (PageSwapBacked(page))
-		__dec_zone_page_state(page, NR_SHMEM);
+		__mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
+	VM_BUG_ON_PAGE(PageTail(page), page);
 	VM_BUG_ON_PAGE(page_mapped(page), page);
 
 	/*
@@ -227,9 +252,8 @@ void __delete_from_page_cache(struct page *page, void *shadow)
  */
 void delete_from_page_cache(struct page *page)
 {
-	struct address_space *mapping = page->mapping;
+	struct address_space *mapping = page_mapping(page);
 	unsigned long flags;
-
 	void (*freepage)(struct page *);
 
 	BUG_ON(!PageLocked(page));
@@ -242,7 +266,13 @@ void delete_from_page_cache(struct page *page)
 
 	if (freepage)
 		freepage(page);
-	page_cache_release(page);
+
+	if (PageTransHuge(page) && !PageHuge(page)) {
+		atomic_sub(HPAGE_PMD_NR, &page->_count);
+		VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+	} else {
+		page_cache_release(page);
+	}
 }
 EXPORT_SYMBOL(delete_from_page_cache);
 
@@ -1035,7 +1065,7 @@ EXPORT_SYMBOL(page_cache_prev_hole);
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
 {
 	void **pagep;
-	struct page *page;
+	struct page *head, *page;
 
 	rcu_read_lock();
 repeat:
@@ -1055,9 +1085,17 @@ repeat:
 			 */
 			goto out;
 		}
-		if (!page_cache_get_speculative(page))
+
+		head = compound_head(page);
+		if (!page_cache_get_speculative(head))
 			goto repeat;
 
+		/* The page was split under us? */
+		if (compound_head(page) != head) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
 		/*
 		 * Has the page moved?
 		 * This is part of the lockless pagecache protocol. See
@@ -1100,12 +1138,12 @@ repeat:
 	if (page && !radix_tree_exception(page)) {
 		lock_page(page);
 		/* Has the page been truncated? */
-		if (unlikely(page->mapping != mapping)) {
+		if (unlikely(page_mapping(page) != mapping)) {
 			unlock_page(page);
 			page_cache_release(page);
 			goto repeat;
 		}
-		VM_BUG_ON_PAGE(page->index != offset, page);
+		VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
 	}
 	return page;
 }
@@ -1238,7 +1276,7 @@ unsigned find_get_entries(struct address_space *mapping,
 	rcu_read_lock();
 restart:
 	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
-		struct page *page;
+		struct page *head, *page;
 repeat:
 		page = radix_tree_deref_slot(slot);
 		if (unlikely(!page))
@@ -1253,8 +1291,16 @@ repeat:
 			 */
 			goto export;
 		}
-		if (!page_cache_get_speculative(page))
+
+		head = compound_head(page);
+		if (!page_cache_get_speculative(head))
+			goto repeat;
+
+		/* The page was split under us? */
+		if (compound_head(page) != head) {
+			page_cache_release(page);
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *slot)) {
@@ -1300,7 +1346,7 @@ unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
 	rcu_read_lock();
 restart:
 	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
-		struct page *page;
+		struct page *head, *page;
 repeat:
 		page = radix_tree_deref_slot(slot);
 		if (unlikely(!page))
@@ -1324,9 +1370,16 @@ repeat:
 			continue;
 		}
 
-		if (!page_cache_get_speculative(page))
+		head = compound_head(page);
+		if (!page_cache_get_speculative(head))
 			goto repeat;
 
+		/* The page was split under us? */
+		if (compound_head(page) != head) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
 		/* Has the page moved? */
 		if (unlikely(page != *slot)) {
 			page_cache_release(page);
@@ -1367,7 +1420,7 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index,
 	rcu_read_lock();
 restart:
 	radix_tree_for_each_contig(slot, &mapping->page_tree, &iter, index) {
-		struct page *page;
+		struct page *head, *page;
 repeat:
 		page = radix_tree_deref_slot(slot);
 		/* The hole, there no reason to continue */
@@ -1391,8 +1444,14 @@ repeat:
 			break;
 		}
 
-		if (!page_cache_get_speculative(page))
+		head = compound_head(page);
+		if (!page_cache_get_speculative(head))
 			goto repeat;
+		/* The page was split under us? */
+		if (compound_head(page) != head) {
+			page_cache_release(page);
+			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *slot)) {
@@ -1405,7 +1464,7 @@ repeat:
 		 * otherwise we can get both false positives and false
 		 * negatives, which is just confusing to the caller.
 		 */
-		if (page->mapping == NULL || page->index != iter.index) {
+		if (page->mapping == NULL || page_to_pgoff(page) != iter.index) {
 			page_cache_release(page);
 			break;
 		}
@@ -1444,7 +1503,7 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 restart:
 	radix_tree_for_each_tagged(slot, &mapping->page_tree,
 				   &iter, *index, tag) {
-		struct page *page;
+		struct page *head, *page;
 repeat:
 		page = radix_tree_deref_slot(slot);
 		if (unlikely(!page))
@@ -1473,8 +1532,15 @@ repeat:
 			continue;
 		}
 
-		if (!page_cache_get_speculative(page))
+		head = compound_head(page);
+		if (!page_cache_get_speculative(head))
+			goto repeat;
+
+		/* The page was split under us? */
+		if (compound_head(page) != head) {
+			page_cache_release(page);
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *slot)) {
@@ -1523,7 +1589,7 @@ unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
 restart:
 	radix_tree_for_each_tagged(slot, &mapping->page_tree,
 				   &iter, start, tag) {
-		struct page *page;
+		struct page *head, *page;
 repeat:
 		page = radix_tree_deref_slot(slot);
 		if (unlikely(!page))
@@ -1545,9 +1611,17 @@ repeat:
 			 */
 			goto export;
 		}
-		if (!page_cache_get_speculative(page))
+
+		head = compound_head(page);
+		if (!page_cache_get_speculative(head))
 			goto repeat;
 
+		/* The page was split under us? */
+		if (compound_head(page) != head) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
 		/* Has the page moved? */
 		if (unlikely(page != *slot)) {
 			page_cache_release(page);
@@ -2139,7 +2213,7 @@ void filemap_map_pages(struct fault_env *fe,
 	struct address_space *mapping = file->f_mapping;
 	pgoff_t last_pgoff = start_pgoff;
 	loff_t size;
-	struct page *page;
+	struct page *head, *page;
 
 	rcu_read_lock();
 	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
@@ -2157,8 +2231,15 @@ repeat:
 				goto next;
 		}
 
-		if (!page_cache_get_speculative(page))
+		head = compound_head(page);
+		if (!page_cache_get_speculative(head))
+			goto repeat;
+
+		/* The page was split under us? */
+		if (compound_head(page) != head) {
+			page_cache_release(page);
 			goto repeat;
+		}
 
 		/* Has the page moved? */
 		if (unlikely(page != *slot)) {
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 25/28] truncate: handle file thp
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (23 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 24/28] filemap: prepare find and delete operations for huge pages Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 26/28] shmem: prepare huge=N mount option and /proc/sys/vm/shmem_huge Kirill A. Shutemov
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

For shmem/tmpfs we only need to tweak truncate_inode_page() and
invalidate_mapping_pages().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/truncate.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 7598b552ae03..40d3730a8e62 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -157,10 +157,14 @@ invalidate_complete_page(struct address_space *mapping, struct page *page)
 
 int truncate_inode_page(struct address_space *mapping, struct page *page)
 {
+	loff_t holelen;
+	VM_BUG_ON_PAGE(PageTail(page), page);
+
+	holelen = PageTransHuge(page) ? HPAGE_PMD_SIZE : PAGE_CACHE_SIZE;
 	if (page_mapped(page)) {
 		unmap_mapping_range(mapping,
 				   (loff_t)page->index << PAGE_CACHE_SHIFT,
-				   PAGE_CACHE_SIZE, 0);
+				   holelen, 0);
 	}
 	return truncate_complete_page(mapping, page);
 }
@@ -489,7 +493,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 
 			if (!trylock_page(page))
 				continue;
-			WARN_ON(page->index != index);
+
+			WARN_ON(page_to_pgoff(page) != index);
+
+			/* Middle of THP: skip */
+			if (PageTransTail(page)) {
+				unlock_page(page);
+				continue;
+			} else if (PageTransHuge(page)) {
+				index += HPAGE_PMD_NR - 1;
+				i += HPAGE_PMD_NR - 1;
+				/* 'end' is in the middle of THP */
+				if (index ==  round_down(end, HPAGE_PMD_NR))
+					continue;
+			}
+
 			ret = invalidate_inode_page(page);
 			unlock_page(page);
 			/*
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 26/28] shmem: prepare huge=N mount option and /proc/sys/vm/shmem_huge
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (24 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 25/28] truncate: handle file thp Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 27/28] shmem: get_unmapped_area align huge page Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 28/28] shmem: add huge pages support Kirill A. Shutemov
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A . Shutemov

From: Hugh Dickins <hughd@google.com>

Plumb in a new "huge=1" or "huge=0" mount option to tmpfs: I don't
want to get into a maze of boot options, madvises and fadvises at
this stage, nor extend the use of the existing THP tuning to tmpfs;
though either might be pursued later on.  We just want a way to ask
a tmpfs filesystem to favor huge pages, and a way to turn that off
again when it doesn't work out so well.  Default of course is off.

"mount -o remount,huge=N /mountpoint" works fine after mount:
remounting from huge=1 (on) to huge=0 (off) will not attempt to
break up huge pages at all, just stop more from being allocated.

It's possible that we shall allow more values for the option later,
to select different strategies (e.g. how hard to try when allocating
huge pages, or when to map hugely and when not, or how sparse a huge
page should be before it is split up), either for experiments, or well
baked in: so use an unsigned char in the superblock rather than a bool.

No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE,
which is the appropriate option to protect those who don't want
the new bloat, and with which we shall share some pmd code.  Use a
"name=numeric_value" format like most other tmpfs options.  Prohibit
the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is invalid
without CONFIG_NUMA (was hidden in mpol_parse_str(): make it explicit).
Allow setting >0 only if the machine has_transparent_hugepage().

But what about Shmem with no user-visible mount?  SysV SHM, memfds,
shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers'
DRM objects, Ashmem.  Though unlikely to suit all usages, provide
sysctl /proc/sys/vm/shmem_huge to experiment with huge on those.

And allow shmem_huge two further values: -1 for use in emergencies,
to force the huge option off from all mounts; and (currently) 2,
to force the huge option on for all - very useful for testing.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/shmem_fs.h | 16 +++++++++----
 kernel/sysctl.c          | 12 ++++++++++
 mm/shmem.c               | 59 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+), 5 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index a43f41cb3c43..c35482b1dd24 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -31,9 +31,10 @@ struct shmem_sb_info {
 	unsigned long max_inodes;   /* How many inodes are allowed */
 	unsigned long free_inodes;  /* How many are left for allocation */
 	spinlock_t stat_lock;	    /* Serialize shmem_sb_info changes */
+	umode_t mode;		    /* Mount mode for root directory */
+	unsigned char huge;	    /* Whether to try for hugepages */
 	kuid_t uid;		    /* Mount uid for root directory */
 	kgid_t gid;		    /* Mount gid for root directory */
-	umode_t mode;		    /* Mount mode for root directory */
 	struct mempolicy *mpol;     /* default memory policy for mappings */
 };
 
@@ -72,18 +73,23 @@ static inline struct page *shmem_read_mapping_page(
 }
 
 #ifdef CONFIG_TMPFS
-
 extern int shmem_add_seals(struct file *file, unsigned int seals);
 extern int shmem_get_seals(struct file *file);
 extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
-
 #else
-
 static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
 {
 	return -EINVAL;
 }
+#endif /* CONFIG_TMPFS */
 
-#endif
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SHMEM)
+# ifdef CONFIG_SYSCTL
+struct ctl_table;
+extern int shmem_huge, shmem_huge_min, shmem_huge_max;
+extern int shmem_huge_sysctl(struct ctl_table *table, int write,
+			     void __user *buffer, size_t *lenp, loff_t *ppos);
+# endif /* CONFIG_SYSCTL */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SHMEM */
 
 #endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d5e7e24b85a9..be75efd865cd 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -43,6 +43,7 @@
 #include <linux/ratelimit.h>
 #include <linux/compaction.h>
 #include <linux/hugetlb.h>
+#include <linux/shmem_fs.h>
 #include <linux/initrd.h>
 #include <linux/key.h>
 #include <linux/times.h>
@@ -1301,6 +1302,17 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one_hundred,
 	},
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SHMEM)
+	{
+		.procname	= "shmem_huge",
+		.data		= &shmem_huge,
+		.maxlen		= sizeof(shmem_huge),
+		.mode		= 0644,
+		.proc_handler	= shmem_huge_sysctl,
+		.extra1		= &shmem_huge_min,
+		.extra2		= &shmem_huge_max,
+	},
+#endif
 #ifdef CONFIG_HUGETLB_PAGE
 	{
 		.procname	= "nr_hugepages",
diff --git a/mm/shmem.c b/mm/shmem.c
index d60d6335a253..0ba46c92ccc8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -58,6 +58,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/falloc.h>
 #include <linux/splice.h>
 #include <linux/security.h>
+#include <linux/sysctl.h>
 #include <linux/swapops.h>
 #include <linux/mempolicy.h>
 #include <linux/namei.h>
@@ -289,6 +290,25 @@ static bool shmem_confirm_swap(struct address_space *mapping,
 }
 
 /*
+ * Definitions for "huge tmpfs": tmpfs mounted with the huge=1 option
+ */
+
+/* Special values for /proc/sys/vm/shmem_huge */
+#define SHMEM_HUGE_DENY		(-1)
+#define SHMEM_HUGE_FORCE	(2)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/* ifdef here to avoid bloating shmem.o when not necessary */
+
+int shmem_huge __read_mostly;
+
+#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+
+#define shmem_huge SHMEM_HUGE_DENY
+
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+/*
  * Like add_to_page_cache_locked, but error if expected item has gone.
  */
 static int shmem_add_to_page_cache(struct page *page,
@@ -2915,11 +2935,21 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
 			sbinfo->gid = make_kgid(current_user_ns(), gid);
 			if (!gid_valid(sbinfo->gid))
 				goto bad_val;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		} else if (!strcmp(this_char, "huge")) {
+			if (kstrtou8(value, 10, &sbinfo->huge) < 0 ||
+			    sbinfo->huge >= SHMEM_HUGE_FORCE)
+				goto bad_val;
+			if (sbinfo->huge && !has_transparent_hugepage())
+				goto bad_val;
+#endif
+#ifdef CONFIG_NUMA
 		} else if (!strcmp(this_char,"mpol")) {
 			mpol_put(mpol);
 			mpol = NULL;
 			if (mpol_parse_str(value, &mpol))
 				goto bad_val;
+#endif
 		} else {
 			printk(KERN_ERR "tmpfs: Bad mount option %s\n",
 			       this_char);
@@ -2966,6 +2996,7 @@ static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
 		goto out;
 
 	error = 0;
+	sbinfo->huge = config.huge;
 	sbinfo->max_blocks  = config.max_blocks;
 	sbinfo->max_inodes  = config.max_inodes;
 	sbinfo->free_inodes = config.max_inodes - inodes;
@@ -2999,6 +3030,9 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 	if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
 		seq_printf(seq, ",gid=%u",
 				from_kgid_munged(&init_user_ns, sbinfo->gid));
+	/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
+	if (sbinfo->huge)
+		seq_printf(seq, ",huge=%u", sbinfo->huge);
 	shmem_show_mpol(seq, sbinfo->mpol);
 	return 0;
 }
@@ -3347,6 +3381,31 @@ out3:
 	return error;
 }
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSCTL)
+int shmem_huge_min = SHMEM_HUGE_DENY;
+int shmem_huge_max = SHMEM_HUGE_FORCE;
+/*
+ * /proc/sys/vm/shmem_huge sysctl for internal shm_mnt, and mount override:
+ * -1 disables huge on shm_mnt and all mounts, for emergency use
+ *  0 disables huge on internal shm_mnt (which has no way to be remounted)
+ *  1  enables huge on internal shm_mnt (which has no way to be remounted)
+ *  2  enables huge on shm_mnt and all mounts, w/o needing option, for testing
+ *     (but we may add more huge options, and push that 2 for testing upwards)
+ */
+int shmem_huge_sysctl(struct ctl_table *table, int write,
+		      void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int err;
+
+	if (!has_transparent_hugepage())
+		shmem_huge_max = 0;
+	err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (write && !err && !IS_ERR(shm_mnt))
+		SHMEM_SB(shm_mnt->mnt_sb)->huge = (shmem_huge > 0);
+	return err;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSCTL */
+
 #else /* !CONFIG_SHMEM */
 
 /*
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 27/28] shmem: get_unmapped_area align huge page
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (25 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 26/28] shmem: prepare huge=N mount option and /proc/sys/vm/shmem_huge Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  2016-02-11 14:21 ` [PATCHv2 28/28] shmem: add huge pages support Kirill A. Shutemov
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A . Shutemov

From: Hugh Dickins <hughd@google.com>

Provide a shmem_get_unmapped_area method in file_operations, called
at mmap time to decide the mapping address.  It could be conditional
on CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by
making it unconditional.

shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
(which we treat as a black box, highly dependent on architecture and
config and executable layout).  Lots of conditions, and in most cases
it just goes with the address that chose; but when our huge stars are
rightly aligned, yet that did not provide a suitable address, go back
to ask for a larger arena, within which to align the mapping suitably.

There have to be some direct calls to shmem_get_unmapped_area(),
not via the file_operations: because of the way shmem_zero_setup()
is called to create a shmem object late in the mmap sequence, when
MAP_SHARED is requested with MAP_ANONYMOUS or /dev/zero.  Though
this only matters when /proc/sys/vm/shmem_huge has been set.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 drivers/char/mem.c       | 24 ++++++++++++
 include/linux/shmem_fs.h |  2 +
 ipc/shm.c                |  6 ++-
 mm/mmap.c                | 16 +++++++-
 mm/shmem.c               | 96 ++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 140 insertions(+), 4 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 6b1721f978c2..a4c3ce0c9ece 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -22,6 +22,7 @@
 #include <linux/device.h>
 #include <linux/highmem.h>
 #include <linux/backing-dev.h>
+#include <linux/shmem_fs.h>
 #include <linux/splice.h>
 #include <linux/pfn.h>
 #include <linux/export.h>
@@ -661,6 +662,28 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+static unsigned long get_unmapped_area_zero(struct file *file,
+				unsigned long addr, unsigned long len,
+				unsigned long pgoff, unsigned long flags)
+{
+#ifdef CONFIG_MMU
+	if (flags & MAP_SHARED) {
+		/*
+		 * mmap_zero() will call shmem_zero_setup() to create a file,
+		 * so use shmem's get_unmapped_area in case it can be huge;
+		 * and pass NULL for file as in mmap.c's get_unmapped_area(),
+		 * so as not to confuse shmem with our handle on "/dev/zero".
+		 */
+		return shmem_get_unmapped_area(NULL, addr, len, pgoff, flags);
+	}
+
+	/* Otherwise flags & MAP_PRIVATE: with no shmem object beneath it */
+	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+#else
+	return -ENOSYS;
+#endif
+}
+
 static ssize_t write_full(struct file *file, const char __user *buf,
 			  size_t count, loff_t *ppos)
 {
@@ -768,6 +791,7 @@ static const struct file_operations zero_fops = {
 	.read_iter	= read_iter_zero,
 	.write_iter	= write_iter_zero,
 	.mmap		= mmap_zero,
+	.get_unmapped_area = get_unmapped_area_zero,
 #ifndef CONFIG_MMU
 	.mmap_capabilities = zero_mmap_capabilities,
 #endif
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index c35482b1dd24..85f11d75bfaf 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -53,6 +53,8 @@ extern struct file *shmem_file_setup(const char *name,
 extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
 					    unsigned long flags);
 extern int shmem_zero_setup(struct vm_area_struct *);
+extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags);
 extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
 extern bool shmem_mapping(struct address_space *mapping);
 extern void shmem_unlock_mapping(struct address_space *mapping);
diff --git a/ipc/shm.c b/ipc/shm.c
index 3174634ca4e5..b797a6e49d78 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -476,13 +476,15 @@ static const struct file_operations shm_file_operations = {
 	.mmap		= shm_mmap,
 	.fsync		= shm_fsync,
 	.release	= shm_release,
-#ifndef CONFIG_MMU
 	.get_unmapped_area	= shm_get_unmapped_area,
-#endif
 	.llseek		= noop_llseek,
 	.fallocate	= shm_fallocate,
 };
 
+/*
+ * shm_file_operations_huge is now identical to shm_file_operations,
+ * but we keep it distinct for the sake of is_file_shm_hugepages().
+ */
 static const struct file_operations shm_file_operations_huge = {
 	.mmap		= shm_mmap,
 	.fsync		= shm_fsync,
diff --git a/mm/mmap.c b/mm/mmap.c
index c9d0c412b6dd..6c2044ce9af0 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -25,6 +25,7 @@
 #include <linux/personality.h>
 #include <linux/security.h>
 #include <linux/hugetlb.h>
+#include <linux/shmem_fs.h>
 #include <linux/profile.h>
 #include <linux/export.h>
 #include <linux/mount.h>
@@ -2017,8 +2018,19 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 		return -ENOMEM;
 
 	get_area = current->mm->get_unmapped_area;
-	if (file && file->f_op->get_unmapped_area)
-		get_area = file->f_op->get_unmapped_area;
+	if (file) {
+		if (file->f_op->get_unmapped_area)
+			get_area = file->f_op->get_unmapped_area;
+	} else if (flags & MAP_SHARED) {
+		/*
+		 * mmap_region() will call shmem_zero_setup() to create a file,
+		 * so use shmem's get_unmapped_area in case it can be huge.
+		 * do_mmap_pgoff() will clear pgoff, so match alignment.
+		 */
+		pgoff = 0;
+		get_area = shmem_get_unmapped_area;
+	}
+
 	addr = get_area(file, addr, len, pgoff, flags);
 	if (IS_ERR_VALUE(addr))
 		return addr;
diff --git a/mm/shmem.c b/mm/shmem.c
index 0ba46c92ccc8..6069062d93b0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1470,6 +1470,94 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	return ret;
 }
 
+unsigned long shmem_get_unmapped_area(struct file *file,
+				      unsigned long uaddr, unsigned long len,
+				      unsigned long pgoff, unsigned long flags)
+{
+	unsigned long (*get_area)(struct file *,
+		unsigned long, unsigned long, unsigned long, unsigned long);
+	unsigned long addr;
+	unsigned long offset;
+	unsigned long inflated_len;
+	unsigned long inflated_addr;
+	unsigned long inflated_offset;
+
+	if (len > TASK_SIZE)
+		return -ENOMEM;
+
+	get_area = current->mm->get_unmapped_area;
+	addr = get_area(file, uaddr, len, pgoff, flags);
+
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return addr;
+	if (IS_ERR_VALUE(addr))
+		return addr;
+	if (addr & ~PAGE_MASK)
+		return addr;
+	if (addr > TASK_SIZE - len)
+		return addr;
+
+	if (shmem_huge == SHMEM_HUGE_DENY)
+		return addr;
+	if (len < HPAGE_PMD_SIZE)
+		return addr;
+	if (flags & MAP_FIXED)
+		return addr;
+	/*
+	 * Our priority is to support MAP_SHARED mapped hugely;
+	 * and support MAP_PRIVATE mapped hugely too, until it is COWed.
+	 * But if caller specified an address hint, respect that as before.
+	 */
+	if (uaddr)
+		return addr;
+
+	if (shmem_huge != SHMEM_HUGE_FORCE) {
+		struct super_block *sb;
+
+		if (file) {
+			VM_BUG_ON(file->f_op != &shmem_file_operations);
+			sb = file_inode(file)->i_sb;
+		} else {
+			/*
+			 * Called directly from mm/mmap.c, or drivers/char/mem.c
+			 * for "/dev/zero", to create a shared anonymous object.
+			 */
+			if (IS_ERR(shm_mnt))
+				return addr;
+			sb = shm_mnt->mnt_sb;
+		}
+		if (!SHMEM_SB(sb)->huge)
+			return addr;
+	}
+
+	offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
+	if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
+		return addr;
+	if ((addr & (HPAGE_PMD_SIZE-1)) == offset)
+		return addr;
+
+	inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
+	if (inflated_len > TASK_SIZE)
+		return addr;
+	if (inflated_len < len)
+		return addr;
+
+	inflated_addr = get_area(NULL, 0, inflated_len, 0, flags);
+	if (IS_ERR_VALUE(inflated_addr))
+		return addr;
+	if (inflated_addr & ~PAGE_MASK)
+		return addr;
+
+	inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1);
+	inflated_addr += offset - inflated_offset;
+	if (inflated_offset > offset)
+		inflated_addr += HPAGE_PMD_SIZE;
+
+	if (inflated_addr > TASK_SIZE - len)
+		return addr;
+	return inflated_addr;
+}
+
 #ifdef CONFIG_NUMA
 static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *mpol)
 {
@@ -3249,6 +3337,7 @@ static const struct address_space_operations shmem_aops = {
 
 static const struct file_operations shmem_file_operations = {
 	.mmap		= shmem_mmap,
+	.get_unmapped_area = shmem_get_unmapped_area,
 #ifdef CONFIG_TMPFS
 	.llseek		= shmem_file_llseek,
 	.read_iter	= shmem_file_read_iter,
@@ -3448,6 +3537,13 @@ void shmem_unlock_mapping(struct address_space *mapping)
 {
 }
 
+unsigned long shmem_get_unmapped_area(struct file *file,
+				      unsigned long addr, unsigned long len,
+				      unsigned long pgoff, unsigned long flags)
+{
+	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+
 void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 {
 	truncate_inode_pages_range(inode->i_mapping, lstart, lend);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCHv2 28/28] shmem: add huge pages support
  2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
                   ` (26 preceding siblings ...)
  2016-02-11 14:21 ` [PATCHv2 27/28] shmem: get_unmapped_area align huge page Kirill A. Shutemov
@ 2016-02-11 14:21 ` Kirill A. Shutemov
  27 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-11 14:21 UTC (permalink / raw)
  To: Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Dave Hansen, Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm,
	Kirill A. Shutemov

Here's basic implementation of huge pages support for shmem/tmpfs.

It's all pretty streight-forward:

  - shmem_getpage() allcoates huge page if it can and try to inserd into
    radix tree with shmem_add_to_page_cache();

  - shmem_add_to_page_cache() puts the page onto radix-tree if there's
    space for it;

  - shmem_undo_range() removes huge pages, if it fully within range.
    Partial truncate of huge pages zero out this part of THP.

  - no need to change shmem_fault: core-mm will map an compound page as
    huge if VMA is suitable;

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/huge_mm.h |   2 +
 mm/memory.c             |   5 +-
 mm/mempolicy.c          |   2 +-
 mm/page-writeback.c     |   1 +
 mm/shmem.c              | 338 ++++++++++++++++++++++++++++++++++++------------
 mm/swap.c               |   2 +
 6 files changed, 265 insertions(+), 85 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a9ec30594a81..1e74ac5c9f67 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -159,6 +159,8 @@ struct page *get_huge_zero_page(void);
 
 #define transparent_hugepage_enabled(__vma) 0
 
+static inline void prep_transhuge_page(struct page *page) {}
+
 #define transparent_hugepage_flags 0UL
 static inline int
 split_huge_page_to_list(struct page *page, struct list_head *list)
diff --git a/mm/memory.c b/mm/memory.c
index 19eff2164e5b..2a35fdde7796 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1094,7 +1094,7 @@ again:
 				 * unmap shared but keep private pages.
 				 */
 				if (details->check_mapping &&
-				    details->check_mapping != page->mapping)
+				    details->check_mapping != page_rmapping(page))
 					continue;
 			}
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
@@ -1185,7 +1185,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
-				VM_BUG_ON_VMA(!rwsem_is_locked(&tlb->mm->mmap_sem), vma);
+				VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
+						!rwsem_is_locked(&tlb->mm->mmap_sem), vma);
 				split_huge_pmd(vma, pmd, addr);
 			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5742271a026d..30befece3782 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -534,7 +534,7 @@ retry:
 		nid = page_to_nid(page);
 		if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
 			continue;
-		if (PageTail(page) && PageAnon(page)) {
+		if (PageTransCompound(page)) {
 			get_page(page);
 			pte_unmap_unlock(pte, ptl);
 			lock_page(page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 11ff8f758631..2c8d5386665d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2554,6 +2554,7 @@ int set_page_dirty(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 
+	page = compound_head(page);
 	if (likely(mapping)) {
 		int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
 		/*
diff --git a/mm/shmem.c b/mm/shmem.c
index 6069062d93b0..27d21a8a671a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -174,10 +174,13 @@ static inline int shmem_reacct_size(unsigned long flags,
  * shmem_getpage reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
  * so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
  */
-static inline int shmem_acct_block(unsigned long flags)
+static inline int shmem_acct_block(unsigned long flags, long pages)
 {
-	return (flags & VM_NORESERVE) ?
-		security_vm_enough_memory_mm(current->mm, VM_ACCT(PAGE_CACHE_SIZE)) : 0;
+	if (!(flags & VM_NORESERVE))
+		return 0;
+
+	return security_vm_enough_memory_mm(current->mm,
+			pages * VM_ACCT(PAGE_CACHE_SIZE));
 }
 
 static inline void shmem_unacct_blocks(unsigned long flags, long pages)
@@ -315,30 +318,55 @@ static int shmem_add_to_page_cache(struct page *page,
 				   struct address_space *mapping,
 				   pgoff_t index, void *expected)
 {
-	int error;
+	int error, nr = hpage_nr_pages(page);
 
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	VM_BUG_ON_PAGE(index != round_down(index, nr), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+	VM_BUG_ON(expected && PageTransHuge(page));
 
-	page_cache_get(page);
+	atomic_add(nr, &page->_count);
 	page->mapping = mapping;
 	page->index = index;
 
 	spin_lock_irq(&mapping->tree_lock);
-	if (!expected)
+	if (PageTransHuge(page)) {
+		void __rcu **results;
+		pgoff_t idx;
+		int i;
+
+		error = 0;
+		if (radix_tree_gang_lookup_slot(&mapping->page_tree,
+					&results, &idx, index, 1) &&
+				idx < index + HPAGE_PMD_NR) {
+			error = -EEXIST;
+		}
+
+		if (!error) {
+			for (i = 0; i < HPAGE_PMD_NR; i++) {
+				error = radix_tree_insert(&mapping->page_tree,
+						index + i, page + i);
+				VM_BUG_ON(error);
+			}
+			count_vm_event(THP_FILE_ALLOC);
+		}
+	} else if (!expected) {
 		error = radix_tree_insert(&mapping->page_tree, index, page);
-	else
+	} else {
 		error = shmem_radix_tree_replace(mapping, index, expected,
 								 page);
+	}
+
 	if (!error) {
-		mapping->nrpages++;
-		__inc_zone_page_state(page, NR_FILE_PAGES);
-		__inc_zone_page_state(page, NR_SHMEM);
+		mapping->nrpages += nr;
+		__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+		__mod_zone_page_state(page_zone(page), NR_SHMEM, nr);
 		spin_unlock_irq(&mapping->tree_lock);
 	} else {
 		page->mapping = NULL;
 		spin_unlock_irq(&mapping->tree_lock);
-		page_cache_release(page);
+		atomic_sub(nr, &page->_count);
 	}
 	return error;
 }
@@ -351,6 +379,8 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
 	struct address_space *mapping = page->mapping;
 	int error;
 
+	VM_BUG_ON_PAGE(PageCompound(page), page);
+
 	spin_lock_irq(&mapping->tree_lock);
 	error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
 	page->mapping = NULL;
@@ -526,6 +556,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			index = indices[i];
 			if (index >= end)
 				break;
+			VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page);
 
 			if (radix_tree_exceptional_entry(page)) {
 				if (unfalloc)
@@ -537,8 +568,29 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 
 			if (!trylock_page(page))
 				continue;
+
+			if (PageTransTail(page)) {
+				/* Middle of THP: zero out the page */
+				clear_highpage(page);
+				unlock_page(page);
+				continue;
+			} else if (PageTransHuge(page)) {
+				if (index == round_down(end, HPAGE_PMD_NR)) {
+					/*
+					 * Range ends in the middle of THP:
+					 * zero out the page
+					 */
+					clear_highpage(page);
+					unlock_page(page);
+					continue;
+				}
+				index += HPAGE_PMD_NR - 1;
+				i += HPAGE_PMD_NR - 1;
+			}
+
 			if (!unfalloc || !PageUptodate(page)) {
-				if (page->mapping == mapping) {
+				VM_BUG_ON_PAGE(PageTail(page), page);
+				if (page_mapping(page) == mapping) {
 					VM_BUG_ON_PAGE(PageWriteback(page), page);
 					truncate_inode_page(mapping, page);
 				}
@@ -614,8 +666,36 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			}
 
 			lock_page(page);
+
+			if (PageTransTail(page)) {
+				/* Middle of THP: zero out the page */
+				clear_highpage(page);
+				unlock_page(page);
+				/*
+				 * Partial thp truncate due 'start' in middle
+				 * of THP: don't need to look on these pages
+				 * again on !pvec.nr restart.
+				 */
+				if (index != round_down(end, HPAGE_PMD_NR))
+					start++;
+				continue;
+			} else if (PageTransHuge(page)) {
+				if (index == round_down(end, HPAGE_PMD_NR)) {
+					/*
+					 * Range ends in the middle of THP:
+					 * zero out the page
+					 */
+					clear_highpage(page);
+					unlock_page(page);
+					continue;
+				}
+				index += HPAGE_PMD_NR - 1;
+				i += HPAGE_PMD_NR - 1;
+			}
+
 			if (!unfalloc || !PageUptodate(page)) {
-				if (page->mapping == mapping) {
+				VM_BUG_ON_PAGE(PageTail(page), page);
+				if (page_mapping(page) == mapping) {
 					VM_BUG_ON_PAGE(PageWriteback(page), page);
 					truncate_inode_page(mapping, page);
 				} else {
@@ -874,6 +954,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	swp_entry_t swap;
 	pgoff_t index;
 
+	VM_BUG_ON_PAGE(PageCompound(page), page);
 	BUG_ON(!PageLocked(page));
 	mapping = page->mapping;
 	index = page->index;
@@ -973,8 +1054,8 @@ redirty:
 	return 0;
 }
 
-#ifdef CONFIG_NUMA
 #ifdef CONFIG_TMPFS
+#ifdef CONFIG_NUMA
 static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
 {
 	char buffer[64];
@@ -998,68 +1079,129 @@ static struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
 	}
 	return mpol;
 }
+
+#else
+
+static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
+{
+}
+#endif /* CONFIG_NUMA */
 #endif /* CONFIG_TMPFS */
 
+static void shmem_pseudo_vma_init(struct vm_area_struct *vma,
+		struct shmem_inode_info *info, pgoff_t index)
+{
+	/* Create a pseudo vma that just contains the policy */
+	vma->vm_start = 0;
+	/* Bias interleave by inode number to distribute better across nodes */
+	vma->vm_pgoff = index + info->vfs_inode.i_ino;
+	vma->vm_ops = NULL;
+
+#ifdef CONFIG_NUMA
+	vma->vm_policy = mpol_shared_policy_lookup(&info->policy, index);
+#endif /* CONFIG_NUMA */
+}
+
+static void shmem_pseudo_vma_destroy(struct vm_area_struct *vma)
+{
+#ifdef CONFIG_NUMA
+	/* Drop reference taken by mpol_shared_policy_lookup() */
+	mpol_cond_put(vma->vm_policy);
+#endif
+}
+
 static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
 			struct shmem_inode_info *info, pgoff_t index)
 {
 	struct vm_area_struct pvma;
 	struct page *page;
 
-	/* Create a pseudo vma that just contains the policy */
-	pvma.vm_start = 0;
-	/* Bias interleave by inode number to distribute better across nodes */
-	pvma.vm_pgoff = index + info->vfs_inode.i_ino;
-	pvma.vm_ops = NULL;
-	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
-
+	shmem_pseudo_vma_init(&pvma, info, index);
 	page = swapin_readahead(swap, gfp, &pvma, 0);
-
-	/* Drop reference taken by mpol_shared_policy_lookup() */
-	mpol_cond_put(pvma.vm_policy);
+	shmem_pseudo_vma_destroy(&pvma);
 
 	return page;
 }
 
-static struct page *shmem_alloc_page(gfp_t gfp,
-			struct shmem_inode_info *info, pgoff_t index)
+static struct page *shmem_alloc_hugepage(gfp_t gfp,
+		struct shmem_inode_info *info, pgoff_t index)
 {
 	struct vm_area_struct pvma;
+	struct inode *inode = &info->vfs_inode;
+	struct address_space *mapping = inode->i_mapping;
+	pgoff_t idx, hindex = round_down(index, HPAGE_PMD_NR);
+	void __rcu **results;
 	struct page *page;
 
-	/* Create a pseudo vma that just contains the policy */
-	pvma.vm_start = 0;
-	/* Bias interleave by inode number to distribute better across nodes */
-	pvma.vm_pgoff = index + info->vfs_inode.i_ino;
-	pvma.vm_ops = NULL;
-	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
-
-	page = alloc_page_vma(gfp, &pvma, 0);
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return NULL;
 
-	/* Drop reference taken by mpol_shared_policy_lookup() */
-	mpol_cond_put(pvma.vm_policy);
+	rcu_read_lock();
+	if (radix_tree_gang_lookup_slot(&mapping->page_tree, &results, &idx,
+				hindex, 1) && idx < hindex + HPAGE_PMD_NR) {
+		rcu_read_unlock();
+		return NULL;
+	}
+	rcu_read_unlock();
 
+	shmem_pseudo_vma_init(&pvma, info, hindex);
+	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
+			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
+	shmem_pseudo_vma_destroy(&pvma);
+	if (page)
+		prep_transhuge_page(page);
 	return page;
 }
-#else /* !CONFIG_NUMA */
-#ifdef CONFIG_TMPFS
-static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
-{
-}
-#endif /* CONFIG_TMPFS */
 
-static inline struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
+static struct page *shmem_alloc_page(gfp_t gfp,
 			struct shmem_inode_info *info, pgoff_t index)
 {
-	return swapin_readahead(swap, gfp, NULL, 0);
+	struct vm_area_struct pvma;
+	struct page *page;
+
+	shmem_pseudo_vma_init(&pvma, info, index);
+	page = alloc_page_vma(gfp, &pvma, 0);
+	shmem_pseudo_vma_destroy(&pvma);
+
+	return page;
 }
 
-static inline struct page *shmem_alloc_page(gfp_t gfp,
-			struct shmem_inode_info *info, pgoff_t index)
+static struct page *shmem_alloc_and_acct_page(gfp_t gfp,
+		struct shmem_inode_info *info, struct shmem_sb_info *sbinfo,
+		pgoff_t index, bool huge)
 {
-	return alloc_page(gfp);
+	struct page *page;
+	int nr;
+	int err = -ENOSPC;
+
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		huge = false;
+	nr = huge ? HPAGE_PMD_NR : 1;
+
+	if (shmem_acct_block(info->flags, nr))
+		goto failed;
+	if (sbinfo->max_blocks) {
+		if (percpu_counter_compare(&sbinfo->used_blocks,
+					sbinfo->max_blocks + nr) > 0)
+			goto unacct;
+		percpu_counter_add(&sbinfo->used_blocks, nr);
+	}
+
+	if (huge)
+		page = shmem_alloc_hugepage(gfp, info, index);
+	else
+		page = shmem_alloc_page(gfp, info, index);
+	if (page)
+		return page;
+
+	err = -ENOMEM;
+	if (sbinfo->max_blocks)
+		percpu_counter_add(&sbinfo->used_blocks, -nr);
+unacct:
+	shmem_unacct_blocks(info->flags, nr);
+failed:
+	return ERR_PTR(err);
 }
-#endif /* CONFIG_NUMA */
 
 #if !defined(CONFIG_NUMA) || !defined(CONFIG_TMPFS)
 static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
@@ -1167,6 +1309,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct mem_cgroup *memcg;
 	struct page *page;
 	swp_entry_t swap;
+	pgoff_t hindex = index;
 	int error;
 	int once = 0;
 	int alloced = 0;
@@ -1283,24 +1426,30 @@ repeat:
 		swap_free(swap);
 
 	} else {
-		if (shmem_acct_block(info->flags)) {
-			error = -ENOSPC;
-			goto failed;
+		/* shmem_symlink() */
+		if (mapping->a_ops != &shmem_aops)
+			goto alloc_nohuge;
+		if (shmem_huge == SHMEM_HUGE_DENY)
+			goto alloc_nohuge;
+		if (shmem_huge != SHMEM_HUGE_FORCE && !sbinfo->huge)
+			goto alloc_nohuge;
+
+		page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
+				index, true);
+		if (IS_ERR(page)) {
+alloc_nohuge:		page = shmem_alloc_and_acct_page(gfp, info, sbinfo,
+					index, false);
 		}
-		if (sbinfo->max_blocks) {
-			if (percpu_counter_compare(&sbinfo->used_blocks,
-						sbinfo->max_blocks) >= 0) {
-				error = -ENOSPC;
-				goto unacct;
-			}
-			percpu_counter_inc(&sbinfo->used_blocks);
+		if (IS_ERR(page)) {
+			error = PTR_ERR(page);
+			page = NULL;
+			goto failed;
 		}
 
-		page = shmem_alloc_page(gfp, info, index);
-		if (!page) {
-			error = -ENOMEM;
-			goto decused;
-		}
+		if (PageTransHuge(page))
+			hindex = round_down(index, HPAGE_PMD_NR);
+		else
+			hindex = index;
 
 		__SetPageSwapBacked(page);
 		__SetPageLocked(page);
@@ -1308,25 +1457,28 @@ repeat:
 			__SetPageReferenced(page);
 
 		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
-				false);
+				PageTransHuge(page));
 		if (error)
-			goto decused;
-		error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
+			goto unacct;
+		error = radix_tree_maybe_preload_order(gfp & GFP_RECLAIM_MASK,
+				compound_order(page));
 		if (!error) {
-			error = shmem_add_to_page_cache(page, mapping, index,
+			error = shmem_add_to_page_cache(page, mapping, hindex,
 							NULL);
 			radix_tree_preload_end();
 		}
 		if (error) {
-			mem_cgroup_cancel_charge(page, memcg, false);
-			goto decused;
+			mem_cgroup_cancel_charge(page, memcg,
+					PageTransHuge(page));
+			goto unacct;
 		}
-		mem_cgroup_commit_charge(page, memcg, false, false);
+		mem_cgroup_commit_charge(page, memcg, false,
+				PageTransHuge(page));
 		lru_cache_add_anon(page);
 
 		spin_lock(&info->lock);
-		info->alloced++;
-		inode->i_blocks += BLOCKS_PER_PAGE;
+		info->alloced += 1 << compound_order(page);
+		inode->i_blocks += BLOCKS_PER_PAGE << compound_order(page);
 		shmem_recalc_inode(inode);
 		spin_unlock(&info->lock);
 		alloced = true;
@@ -1342,10 +1494,15 @@ clear:
 		 * but SGP_FALLOC on a page fallocated earlier must initialize
 		 * it now, lest undo on failure cancel our earlier guarantee.
 		 */
-		if (sgp != SGP_WRITE) {
-			clear_highpage(page);
-			flush_dcache_page(page);
-			SetPageUptodate(page);
+		if (sgp != SGP_WRITE && !PageUptodate(page)) {
+			struct page *head = compound_head(page);
+			int i;
+
+			for (i = 0; i < (1 << compound_order(head)); i++) {
+				clear_highpage(head + i);
+				flush_dcache_page(head + i);
+			}
+			SetPageUptodate(head);
 		}
 		if (sgp == SGP_DIRTY)
 			set_page_dirty(page);
@@ -1364,17 +1521,23 @@ clear:
 		error = -EINVAL;
 		goto unlock;
 	}
-	*pagep = page;
+	*pagep = page + index - hindex;
 	return 0;
 
 	/*
 	 * Error recovery.
 	 */
-decused:
-	if (sbinfo->max_blocks)
-		percpu_counter_add(&sbinfo->used_blocks, -1);
 unacct:
-	shmem_unacct_blocks(info->flags, 1);
+	if (sbinfo->max_blocks)
+		percpu_counter_add(&sbinfo->used_blocks,
+				1 << compound_order(page));
+	shmem_unacct_blocks(info->flags, 1 << compound_order(page));
+
+	if (PageTransHuge(page)) {
+		unlock_page(page);
+		page_cache_release(page);
+		goto alloc_nohuge;
+	}
 failed:
 	if (swap.val && !shmem_confirm_swap(mapping, index, swap))
 		error = -EEXIST;
@@ -1715,12 +1878,23 @@ shmem_write_end(struct file *file, struct address_space *mapping,
 		i_size_write(inode, pos + copied);
 
 	if (!PageUptodate(page)) {
+		struct page *head = compound_head(page);
+		if (PageTransCompound(page)) {
+			int i;
+
+			for (i = 0; i < HPAGE_PMD_NR; i++) {
+				if (head + i == page)
+					continue;
+				clear_highpage(head + i);
+				flush_dcache_page(head + i);
+			}
+		}
 		if (copied < PAGE_CACHE_SIZE) {
 			unsigned from = pos & (PAGE_CACHE_SIZE - 1);
 			zero_user_segments(page, 0, from,
 					from + copied, PAGE_CACHE_SIZE);
 		}
-		SetPageUptodate(page);
+		SetPageUptodate(head);
 	}
 	set_page_dirty(page);
 	unlock_page(page);
diff --git a/mm/swap.c b/mm/swap.c
index 09fe5e97714a..5ee5118f45d4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -291,6 +291,7 @@ static bool need_activate_page_drain(int cpu)
 
 void activate_page(struct page *page)
 {
+	page = compound_head(page);
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
 		struct pagevec *pvec = &get_cpu_var(activate_page_pvecs);
 
@@ -315,6 +316,7 @@ void activate_page(struct page *page)
 {
 	struct zone *zone = page_zone(page);
 
+	page = compound_head(page);
 	spin_lock_irq(&zone->lru_lock);
 	__activate_page(page, mem_cgroup_page_lruvec(page, zone), NULL);
 	spin_unlock_irq(&zone->lru_lock);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 02/28] rmap: introduce rmap_walk_locked()
  2016-02-11 14:21 ` [PATCHv2 02/28] rmap: introduce rmap_walk_locked() Kirill A. Shutemov
@ 2016-02-11 18:52   ` Andi Kleen
  2016-02-16  9:36     ` Kirill A. Shutemov
  0 siblings, 1 reply; 55+ messages in thread
From: Andi Kleen @ 2016-02-11 18:52 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> rmap_walk_locked() is the same as rmap_walk(), but caller takes care
> about relevant rmap lock.
>
> It's preparation to switch THP splitting from custom rmap walk in
> freeze_page()/unfreeze_page() to generic one.

Would be better to move all locking into the callers, with an
appropiate helper for users who don't want to deal with it.
Conditional locking based on flags is always tricky.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 04/28] mm: make remove_migration_ptes() beyond mm/migration.c
  2016-02-11 14:21 ` [PATCHv2 04/28] mm: make remove_migration_ptes() beyond mm/migration.c Kirill A. Shutemov
@ 2016-02-12 16:54   ` Dave Hansen
  2016-02-16  9:54     ` Kirill A. Shutemov
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2016-02-12 16:54 UTC (permalink / raw)
  To: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm

On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote
> We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP
> pages are never mlocked.

That's kinda subtle.  Can you explain more?

If we did the following:

	ptr = mmap(NULL, 512*PAGE_SIZE, ...);
	mlock(ptr, 512*PAGE_SIZE);
	fork();
	munmap(ptr + 100 * PAGE_SIZE, PAGE_SIZE);

I'd expect to get two processes, each mapping the same compound THP, one
with a PMD and the other with 511 ptes and one hole.  Is there something
different that goes on?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte()
  2016-02-11 14:21 ` [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte() Kirill A. Shutemov
@ 2016-02-12 17:44   ` Dave Hansen
  2016-02-16 14:26     ` Kirill A. Shutemov
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2016-02-12 17:44 UTC (permalink / raw)
  To: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm

On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ca99c0ecf52e..172f4d8e798d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -265,6 +265,7 @@ struct fault_env {
>  	pmd_t *pmd;
>  	pte_t *pte;
>  	spinlock_t *ptl;
> +	pgtable_t prealloc_pte;
>  };

If we're going to do this fault_env thing, we need some heavy-duty
comments on what the different fields do and what they mean.  We don't
want to get in to a situation where we're doing

	void fault_foo(struct fault_env *fe);..

and then we change the internals of fault_foo() to manipulate a
different set of fe->* variables, or change assumptions, then have
callers randomly break.

One _nice_ part of passing all the arguments explicitly is that it makes
you go visit all the call sites and think about how the conventions change.

It just makes me nervous.

The semantics of having both a ->pte and ->pmd need to be very clearly
spelled out too, please.

>  /*
> @@ -559,7 +560,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>  	return pte;
>  }
>  
> -void do_set_pte(struct fault_env *fe, struct page *page);
> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
> +		struct page *page);
>  #endif

I think do_set_pte() might be due for a new name if it's going to be
doing allocations internally.

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 28b3875969a8..ba8150d6dc33 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2146,11 +2146,6 @@ void filemap_map_pages(struct fault_env *fe,
>  			start_pgoff) {
>  		if (iter.index > end_pgoff)
>  			break;
> -		fe->pte += iter.index - last_pgoff;
> -		fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
> -		last_pgoff = iter.index;
> -		if (!pte_none(*fe->pte))
> -			goto next;
>  repeat:
>  		page = radix_tree_deref_slot(slot);
>  		if (unlikely(!page))
> @@ -2187,7 +2182,17 @@ repeat:
>  
>  		if (file->f_ra.mmap_miss > 0)
>  			file->f_ra.mmap_miss--;
> -		do_set_pte(fe, page);
> +
> +		fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
> +		if (fe->pte)
> +			fe->pte += iter.index - last_pgoff;
> +		last_pgoff = iter.index;
> +		if (do_set_pte(fe, NULL, page)) {
> +			/* failed to setup page table: giving up */
> +			if (!fe->pte)
> +				break;
> +			goto unlock;
> +		}

What's the failure here, though?  Failed to set PTE or failed to
_allocate_ pte page?  One of them is a harmless race setting the pte and
the other is a pretty crummy allocation failure.  Do we really not want
to differentiate these?

This also throws away the spiffy new error code that comes baqck from
do_set_pte().  Is that OK?

>  		unlock_page(page);
>  		goto next;
>  unlock:
> diff --git a/mm/memory.c b/mm/memory.c
> index f8f9549fac86..0de6f176674d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2661,8 +2661,6 @@ static int do_anonymous_page(struct fault_env *fe)
>  	struct page *page;
>  	pte_t entry;
>  
> -	pte_unmap(fe->pte);
> -
>  	/* File mapping without ->vm_ops ? */
>  	if (vma->vm_flags & VM_SHARED)
>  		return VM_FAULT_SIGBUS;
> @@ -2671,6 +2669,18 @@ static int do_anonymous_page(struct fault_env *fe)
>  	if (check_stack_guard_page(vma, fe->address) < 0)
>  		return VM_FAULT_SIGSEGV;
>  
> +	/*
> +	 * Use __pte_alloc instead of pte_alloc_map, because we can't
> +	 * run pte_offset_map on the pmd, if an huge pmd could
> +	 * materialize from under us from a different thread.
> +	 */

This comment is a little bit funky.  Maybe:

"Use __pte_alloc() instead of pte_alloc_map().  We can't run
pte_offset_map() on pmds where a huge pmd might be created (from a
different thread)."

Could you also talk a bit about where it _is_ safe to call pte_alloc_map()?

> +	if (unlikely(pmd_none(*fe->pmd) &&
> +			__pte_alloc(vma->vm_mm, vma, fe->pmd, fe->address)))
> +		return VM_FAULT_OOM;

Should we just move this pmd_none() check in to __pte_alloc()?  You do
this same-style check at least twice.

> +	/* If an huge pmd materialized from under us just retry later */
> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
> +		return 0;

Nit: please stop sprinkling unlikely() everywhere.  Is there some
concrete benefit to doing it here?  I really doubt the compiler needs
help putting the code for "return 0" out-of-line.

Why is it important to abort here?  Is this a small-page-only path?

> +static int pte_alloc_one_map(struct fault_env *fe)
> +{
> +	struct vm_area_struct *vma = fe->vma;
> +
> +	if (!pmd_none(*fe->pmd))
> +		goto map_pte;

So the calling convention here is...?  It looks like this has to be
called with fe->pmd == pmd_none().  If not, we assume it's pointing to a
pte page.  This can never be called on a huge pmd.  Right?

> +	if (fe->prealloc_pte) {
> +		smp_wmb(); /* See comment in __pte_alloc() */

Are we trying to make *this* cpu's write visible, or to see the write
from __pte_alloc()?  It seems like we're trying to see the write.  Isn't
smp_rmb() what we want for that?

> +		fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
> +		if (unlikely(!pmd_none(*fe->pmd))) {
> +			spin_unlock(fe->ptl);
> +			goto map_pte;
> +		}

Should we just make pmd_none() likely()?  That seems like it would save
about 20MB of unlikely()'s in the source.

> +		atomic_long_inc(&vma->vm_mm->nr_ptes);
> +		pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte);
> +		spin_unlock(fe->ptl);
> +		fe->prealloc_pte = 0;
> +	} else if (unlikely(__pte_alloc(vma->vm_mm, vma, fe->pmd,
> +					fe->address))) {
> +		return VM_FAULT_OOM;
> +	}
> +map_pte:
> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
> +		return VM_FAULT_NOPAGE;

I think I need a refresher on the locking rules.  pte_offset_map*() is
unsafe to call on a huge pmd.  What in this context makes it impossible
for the pmd to get promoted after the check?

> +	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
> +			&fe->ptl);
> +	return 0;
> +}
> +
>  /**
>   * do_set_pte - setup new PTE entry for given page and add reverse page mapping.
>   *
>   * @fe: fault environment
> + * @memcg: memcg to charge page (only for private mappings)
>   * @page: page to map
>   *
> - * Caller must hold page table lock relevant for @fe->pte.

That's a bit screwy now because fe->pte might not exist.  Right?  I
thought the ptl was derived from the physical address of the pte page.
How can we have a lock for a physical address that doesn't exist yet?

> + * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on return.
>   *
>   * Target users are page handler itself and implementations of
>   * vm_ops->map_pages.
>   */
> -void do_set_pte(struct fault_env *fe, struct page *page)
> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
> +		struct page *page)
>  {
>  	struct vm_area_struct *vma = fe->vma;
>  	bool write = fe->flags & FAULT_FLAG_WRITE;
>  	pte_t entry;
>  
> +	if (!fe->pte) {
> +		int ret = pte_alloc_one_map(fe);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (!pte_none(*fe->pte))
> +		return VM_FAULT_NOPAGE;

Oh, you've got to add another pte_none() check because you're deferring
the acquisition of the ptl lock?

>  	flush_icache_page(vma, page);
>  	entry = mk_pte(page, vma->vm_page_prot);
>  	if (write)
> @@ -2811,6 +2864,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)
>  	if (write && !(vma->vm_flags & VM_SHARED)) {
>  		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>  		page_add_new_anon_rmap(page, vma, fe->address, false);
> +		mem_cgroup_commit_charge(page, memcg, false, false);
> +		lru_cache_add_active_or_unevictable(page, vma);
>  	} else {
>  		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
>  		page_add_file_rmap(page);
> @@ -2819,6 +2874,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)
>  
>  	/* no need to invalidate: a not-present page won't be cached */
>  	update_mmu_cache(vma, fe->address, fe->pte);
> +
> +	return 0;
>  }
>  
>  static unsigned long fault_around_bytes __read_mostly =
> @@ -2885,19 +2942,17 @@ late_initcall(fault_around_debugfs);
>   * fault_around_pages() value (and therefore to page order).  This way it's
>   * easier to guarantee that we don't cross page table boundaries.
>   */
> -static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
> +static int do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
>  {
> -	unsigned long address = fe->address, start_addr, nr_pages, mask;
> -	pte_t *pte = fe->pte;
> +	unsigned long address = fe->address, nr_pages, mask;
>  	pgoff_t end_pgoff;
> -	int off;
> +	int off, ret = 0;
>  
>  	nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
>  	mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
>  
> -	start_addr = max(fe->address & mask, fe->vma->vm_start);
> -	off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
> -	fe->pte -= off;
> +	fe->address = max(address & mask, fe->vma->vm_start);
> +	off = ((address - fe->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
>  	start_pgoff -= off;

Considering what's in this patch already, I think I'd leave the trivial
local variable replacement for another patch.

>  	/*
> @@ -2905,30 +2960,33 @@ static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
>  	 *  or fault_around_pages() from start_pgoff, depending what is nearest.
>  	 */
>  	end_pgoff = start_pgoff -
> -		((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
> +		((fe->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
>  		PTRS_PER_PTE - 1;
>  	end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
>  			start_pgoff + nr_pages - 1);
>  
> -	/* Check if it makes any sense to call ->map_pages */
> -	fe->address = start_addr;
> -	while (!pte_none(*fe->pte)) {
> -		if (++start_pgoff > end_pgoff)
> -			goto out;
> -		fe->address += PAGE_SIZE;
> -		if (fe->address >= fe->vma->vm_end)
> -			goto out;
> -		fe->pte++;
> +	if (pmd_none(*fe->pmd))
> +		fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address);
> +	fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
> +	if (fe->prealloc_pte) {
> +		pte_free(fe->vma->vm_mm, fe->prealloc_pte);
> +		fe->prealloc_pte = 0;
>  	}
> +	if (!fe->pte)
> +		goto out;

What does !fe->pte *mean* here?  No pte page?  No pte present?  Huge pte
present?

> -	fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
> +	/* check if the page fault is solved */
> +	fe->pte -= (fe->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
> +	if (!pte_none(*fe->pte))
> +		ret = VM_FAULT_NOPAGE;
> +	pte_unmap_unlock(fe->pte, fe->ptl);
>  out:
> -	/* restore fault_env */
> -	fe->pte = pte;
>  	fe->address = address;
> +	fe->pte = NULL;
> +	return ret;
>  }
>  
> -static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
> +static int do_read_fault(struct fault_env *fe, pgoff_t pgoff)
>  {
>  	struct vm_area_struct *vma = fe->vma;
>  	struct page *fault_page;
> @@ -2940,33 +2998,25 @@ static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
>  	 * something).
>  	 */
>  	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
> -		fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
> -				&fe->ptl);
> -		do_fault_around(fe, pgoff);
> -		if (!pte_same(*fe->pte, orig_pte))
> -			goto unlock_out;
> -		pte_unmap_unlock(fe->pte, fe->ptl);
> +		ret = do_fault_around(fe, pgoff);
> +		if (ret)
> +			return ret;
>  	}
>  
>  	ret = __do_fault(fe, pgoff, NULL, &fault_page);
>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>  		return ret;
>  
> -	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl);
> -	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
> +	ret |= do_set_pte(fe, NULL, fault_page);
> +	if (fe->pte)
>  		pte_unmap_unlock(fe->pte, fe->ptl);
> -		unlock_page(fault_page);
> -		page_cache_release(fault_page);
> -		return ret;
> -	}
> -	do_set_pte(fe, fault_page);
>  	unlock_page(fault_page);
> -unlock_out:
> -	pte_unmap_unlock(fe->pte, fe->ptl);
> +	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> +		page_cache_release(fault_page);
>  	return ret;
>  }
>  
> -static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
> +static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff)
>  {
>  	struct vm_area_struct *vma = fe->vma;
>  	struct page *fault_page, *new_page;
> @@ -2994,26 +3044,9 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
>  		copy_user_highpage(new_page, fault_page, fe->address, vma);
>  	__SetPageUptodate(new_page);
>  
> -	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
> -			&fe->ptl);
> -	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
> +	ret |= do_set_pte(fe, memcg, new_page);
> +	if (fe->pte)
>  		pte_unmap_unlock(fe->pte, fe->ptl);
> -		if (fault_page) {
> -			unlock_page(fault_page);
> -			page_cache_release(fault_page);
> -		} else {
> -			/*
> -			 * The fault handler has no page to lock, so it holds
> -			 * i_mmap_lock for read to protect against truncate.
> -			 */
> -			i_mmap_unlock_read(vma->vm_file->f_mapping);
> -		}
> -		goto uncharge_out;
> -	}
> -	do_set_pte(fe, new_page);
> -	mem_cgroup_commit_charge(new_page, memcg, false, false);
> -	lru_cache_add_active_or_unevictable(new_page, vma);
> -	pte_unmap_unlock(fe->pte, fe->ptl);
>  	if (fault_page) {
>  		unlock_page(fault_page);
>  		page_cache_release(fault_page);
> @@ -3024,6 +3057,8 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
>  		 */
>  		i_mmap_unlock_read(vma->vm_file->f_mapping);
>  	}
> +	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> +		goto uncharge_out;
>  	return ret;
>  uncharge_out:
>  	mem_cgroup_cancel_charge(new_page, memcg, false);
> @@ -3031,7 +3066,7 @@ uncharge_out:
>  	return ret;
>  }
>  
> -static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
> +static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff)
>  {
>  	struct vm_area_struct *vma = fe->vma;
>  	struct page *fault_page;
> @@ -3057,16 +3092,15 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
>  		}
>  	}
>  
> -	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
> -			&fe->ptl);
> -	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
> +	ret |= do_set_pte(fe, NULL, fault_page);
> +	if (fe->pte)
>  		pte_unmap_unlock(fe->pte, fe->ptl);
> +	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
> +					VM_FAULT_RETRY))) {
>  		unlock_page(fault_page);
>  		page_cache_release(fault_page);
>  		return ret;
>  	}
> -	do_set_pte(fe, fault_page);
> -	pte_unmap_unlock(fe->pte, fe->ptl);
>  
>  	if (set_page_dirty(fault_page))
>  		dirtied = 1;
> @@ -3098,21 +3132,19 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
>   * The mmap_sem may have been released depending on flags and our
>   * return value.  See filemap_fault() and __lock_page_or_retry().
>   */
> -static int do_fault(struct fault_env *fe, pte_t orig_pte)
> +static int do_fault(struct fault_env *fe)
>  {
>  	struct vm_area_struct *vma = fe->vma;
> -	pgoff_t pgoff = (((fe->address & PAGE_MASK)
> -			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> +	pgoff_t pgoff = linear_page_index(vma, fe->address);

Looks like another trivial cleanup.

> -	pte_unmap(fe->pte);
>  	/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
>  	if (!vma->vm_ops->fault)
>  		return VM_FAULT_SIGBUS;
>  	if (!(fe->flags & FAULT_FLAG_WRITE))
> -		return do_read_fault(fe, pgoff,	orig_pte);
> +		return do_read_fault(fe, pgoff);
>  	if (!(vma->vm_flags & VM_SHARED))
> -		return do_cow_fault(fe, pgoff, orig_pte);
> -	return do_shared_fault(fe, pgoff, orig_pte);
> +		return do_cow_fault(fe, pgoff);
> +	return do_shared_fault(fe, pgoff);
>  }
>  
>  static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
> @@ -3252,37 +3284,62 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
>   * with external mmu caches can use to update those (ie the Sparc or
>   * PowerPC hashed page tables that act as extended TLBs).
>   *
> - * We enter with non-exclusive mmap_sem (to exclude vma changes,
> - * but allow concurrent faults), and pte mapped but not yet locked.
> - * We return with pte unmapped and unlocked.
> + * We enter with non-exclusive mmap_sem (to exclude vma changes, but allow
> + * concurrent faults).
>   *
> - * The mmap_sem may have been released depending on flags and our
> - * return value.  See filemap_fault() and __lock_page_or_retry().
> + * The mmap_sem may have been released depending on flags and our return value.
> + * See filemap_fault() and __lock_page_or_retry().
>   */
>  static int handle_pte_fault(struct fault_env *fe)
>  {
>  	pte_t entry;
>  
> +	/* If an huge pmd materialized from under us just retry later */
> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
> +		return 0;
> +
> +	if (unlikely(pmd_none(*fe->pmd))) {
> +		/*
> +		 * Leave __pte_alloc() until later: because vm_ops->fault may
> +		 * want to allocate huge page, and if we expose page table
> +		 * for an instant, it will be difficult to retract from
> +		 * concurrent faults and from rmap lookups.
> +		 */
> +	} else {
> +		/*
> +		 * A regular pmd is established and it can't morph into a huge
> +		 * pmd from under us anymore at this point because we hold the
> +		 * mmap_sem read mode and khugepaged takes it in write mode.
> +		 * So now it's safe to run pte_offset_map().
> +		 */
> +		fe->pte = pte_offset_map(fe->pmd, fe->address);
> +
> +		entry = *fe->pte;
> +		barrier();

Barrier because....?

> +		if (pte_none(entry)) {
> +			pte_unmap(fe->pte);
> +			fe->pte = NULL;
> +		}
> +	}
> +
>  	/*
>  	 * some architectures can have larger ptes than wordsize,
>  	 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
>  	 * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
> -	 * The code below just needs a consistent view for the ifs and
> +	 * The code above just needs a consistent view for the ifs and
>  	 * we later double check anyway with the ptl lock held. So here
>  	 * a barrier will do.
>  	 */

Looks like the barrier got moved, but not the comment.

Man, that's a lot of code.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 13/28] thp: support file pages in zap_huge_pmd()
  2016-02-11 14:21 ` [PATCHv2 13/28] thp: support file pages in zap_huge_pmd() Kirill A. Shutemov
@ 2016-02-12 18:33   ` Dave Hansen
  2016-02-16 10:00     ` Kirill A. Shutemov
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2016-02-12 18:33 UTC (permalink / raw)
  To: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm

On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> For file pages we don't deposit page table on mapping: no need to
> withdraw it.

I thought the deposit thing was to guarantee we could always do a PMD
split.  It still seems like if you wanted to split a huge-tmpfs page,
you'd need to first split the PMD which might need the deposited one.

Why not?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 15/28] thp: handle file COW faults
  2016-02-11 14:21 ` [PATCHv2 15/28] thp: handle file COW faults Kirill A. Shutemov
@ 2016-02-12 18:36   ` Dave Hansen
  2016-02-16 10:08     ` Kirill A. Shutemov
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2016-02-12 18:36 UTC (permalink / raw)
  To: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm

On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> File COW for THP is handled on pte level: just split the pmd.

More changelog.  More comments, please.

We don't want to COW THP's because we'll waste memory?  A COW that we
could handle with 4k, we would have to handle with 2M, and that's
inefficient and high-latency?

Seems like a good idea to me.  It would just be nice to ensure every
reviewer doesn't have to think their way through it.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 17/28] thp: skip file huge pmd on copy_huge_pmd()
  2016-02-11 14:21 ` [PATCHv2 17/28] thp: skip file huge pmd on copy_huge_pmd() Kirill A. Shutemov
@ 2016-02-12 18:42   ` Dave Hansen
  2016-02-16 10:14     ` Kirill A. Shutemov
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2016-02-12 18:42 UTC (permalink / raw)
  To: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm

On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> File pmds can be safely skip on copy_huge_pmd(), we can re-fault them
> later. COW for file mappings handled on pte level.

Is this different from 4k pages?  I figured we might skip copying
file-backed ptes on fork, but I couldn't find the code.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 18/28] thp: prepare change_huge_pmd() for file thp
  2016-02-11 14:21 ` [PATCHv2 18/28] thp: prepare change_huge_pmd() for file thp Kirill A. Shutemov
@ 2016-02-12 18:48   ` Dave Hansen
  2016-02-16 10:15     ` Kirill A. Shutemov
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2016-02-12 18:48 UTC (permalink / raw)
  To: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm

On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> change_huge_pmd() has assert which is not relvant for file page.
> For shared mapping it's perfectly fine to have page table entry
> writable, without explicit mkwrite.

Should we have the bug only trigger on anonymous VMAs instead of
removing it?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 19/28] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
  2016-02-11 14:21 ` [PATCHv2 19/28] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem Kirill A. Shutemov
@ 2016-02-12 18:50   ` Dave Hansen
  2016-02-16 10:16     ` Kirill A. Shutemov
  2016-02-16 15:49   ` Dave Hansen
  1 sibling, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2016-02-12 18:50 UTC (permalink / raw)
  To: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm

On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
> During split we munlock the huge page which requires rmap walk.
> rmap wants to take the lock on its own.

Which lock are you talking about here?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 02/28] rmap: introduce rmap_walk_locked()
  2016-02-11 18:52   ` Andi Kleen
@ 2016-02-16  9:36     ` Kirill A. Shutemov
  0 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-16  9:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Dave Hansen,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm

On Thu, Feb 11, 2016 at 10:52:08AM -0800, Andi Kleen wrote:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> 
> > rmap_walk_locked() is the same as rmap_walk(), but caller takes care
> > about relevant rmap lock.
> >
> > It's preparation to switch THP splitting from custom rmap walk in
> > freeze_page()/unfreeze_page() to generic one.
> 
> Would be better to move all locking into the callers, with an
> appropiate helper for users who don't want to deal with it.
> Conditional locking based on flags is always tricky.

Hm. That's kinda tricky for rmap_walk_ksm()..

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 04/28] mm: make remove_migration_ptes() beyond mm/migration.c
  2016-02-12 16:54   ` Dave Hansen
@ 2016-02-16  9:54     ` Kirill A. Shutemov
  2016-02-16 15:29       ` Dave Hansen
  0 siblings, 1 reply; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-16  9:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Jerome Marchand, Yang Shi,
	Sasha Levin, linux-kernel, linux-mm

On Fri, Feb 12, 2016 at 08:54:58AM -0800, Dave Hansen wrote:
> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote
> > We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP
> > pages are never mlocked.
> 
> That's kinda subtle.  Can you explain more?
> 
> If we did the following:
> 
> 	ptr = mmap(NULL, 512*PAGE_SIZE, ...);
> 	mlock(ptr, 512*PAGE_SIZE);
> 	fork();
> 	munmap(ptr + 100 * PAGE_SIZE, PAGE_SIZE);
> 
> I'd expect to get two processes, each mapping the same compound THP, one
> with a PMD and the other with 511 ptes and one hole.  Is there something
> different that goes on?

I'm not sure what exactly you want to ask with this code, but it will have
the following result:

 - After fork() process will split the pmd in munlock(). For file thp
   split pmd, means clear it out. Mapping split_huge_pmd() would munlock
   the page as we do for anon thp;

 - In child process the page is never mapped as mlock() is not inherited
   and we don't copy page tables for shared VMA as they can re-faulted
   later;

The basic semantic for mlock()ed file THP would be the same as for anon
THP: we only keep the page mlocked as long as it's mapped only with PMDs.
This way it's relatively simple to make sure that we don't leak mlocked
pages.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 13/28] thp: support file pages in zap_huge_pmd()
  2016-02-12 18:33   ` Dave Hansen
@ 2016-02-16 10:00     ` Kirill A. Shutemov
  2016-02-16 15:31       ` Dave Hansen
  0 siblings, 1 reply; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-16 10:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Jerome Marchand, Yang Shi,
	Sasha Levin, linux-kernel, linux-mm

On Fri, Feb 12, 2016 at 10:33:37AM -0800, Dave Hansen wrote:
> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> > For file pages we don't deposit page table on mapping: no need to
> > withdraw it.
> 
> I thought the deposit thing was to guarantee we could always do a PMD
> split.  It still seems like if you wanted to split a huge-tmpfs page,
> you'd need to first split the PMD which might need the deposited one.
> 
> Why not?

For file thp, split_huge_pmd() is implemented by clearing out the pmd: we
can setup and fill pte table later. Therefore no need to deposit page
table -- we would not use it. DAX does the same.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 15/28] thp: handle file COW faults
  2016-02-12 18:36   ` Dave Hansen
@ 2016-02-16 10:08     ` Kirill A. Shutemov
  0 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-16 10:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Jerome Marchand, Yang Shi,
	Sasha Levin, linux-kernel, linux-mm

On Fri, Feb 12, 2016 at 10:36:25AM -0800, Dave Hansen wrote:
> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> > File COW for THP is handled on pte level: just split the pmd.
> 
> More changelog.  More comments, please.

Okay, I'll add more.

> We don't want to COW THP's because we'll waste memory?  A COW that we
> could handle with 4k, we would have to handle with 2M, and that's
> inefficient and high-latency?

All of above.i

It's not clear how benefitial THP file COW mappings. And it would require
some code to make them work.

I think at some point we can consider teaching khugepaged to collapse such
pages, but allocating huge on fault is probably overkill.

> Seems like a good idea to me.  It would just be nice to ensure every
> reviewer doesn't have to think their way through it.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 17/28] thp: skip file huge pmd on copy_huge_pmd()
  2016-02-12 18:42   ` Dave Hansen
@ 2016-02-16 10:14     ` Kirill A. Shutemov
  2016-02-16 15:46       ` Dave Hansen
  0 siblings, 1 reply; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-16 10:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Jerome Marchand, Yang Shi,
	Sasha Levin, linux-kernel, linux-mm

On Fri, Feb 12, 2016 at 10:42:09AM -0800, Dave Hansen wrote:
> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> > File pmds can be safely skip on copy_huge_pmd(), we can re-fault them
> > later. COW for file mappings handled on pte level.
> 
> Is this different from 4k pages?  I figured we might skip copying
> file-backed ptes on fork, but I couldn't find the code.

Currently, we only filter out on per-VMA basis. See first comment in
copy_page_range().

Here we handle PMD mapped file pages in COW mapping. File THP can be
mapped into COW mapping as result of read page fault.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 18/28] thp: prepare change_huge_pmd() for file thp
  2016-02-12 18:48   ` Dave Hansen
@ 2016-02-16 10:15     ` Kirill A. Shutemov
  0 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-16 10:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Jerome Marchand, Yang Shi,
	Sasha Levin, linux-kernel, linux-mm

On Fri, Feb 12, 2016 at 10:48:59AM -0800, Dave Hansen wrote:
> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> > change_huge_pmd() has assert which is not relvant for file page.
> > For shared mapping it's perfectly fine to have page table entry
> > writable, without explicit mkwrite.
> 
> Should we have the bug only trigger on anonymous VMAs instead of
> removing it?

Makes sense.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 19/28] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
  2016-02-12 18:50   ` Dave Hansen
@ 2016-02-16 10:16     ` Kirill A. Shutemov
  0 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-16 10:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Jerome Marchand, Yang Shi,
	Sasha Levin, linux-kernel, linux-mm

On Fri, Feb 12, 2016 at 10:50:02AM -0800, Dave Hansen wrote:
> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> > vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
> > During split we munlock the huge page which requires rmap walk.
> > rmap wants to take the lock on its own.
> 
> Which lock are you talking about here?

i_mmap_rwsem. It's in patch subject. I'll update body.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte()
  2016-02-12 17:44   ` Dave Hansen
@ 2016-02-16 14:26     ` Kirill A. Shutemov
  2016-02-16 17:17       ` Dave Hansen
  2016-02-16 17:38       ` Dave Hansen
  0 siblings, 2 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-16 14:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli,
	Andrew Morton, Vlastimil Babka, Christoph Lameter,
	Naoya Horiguchi, Jerome Marchand, Yang Shi, Sasha Levin,
	linux-kernel, linux-mm

On Fri, Feb 12, 2016 at 09:44:41AM -0800, Dave Hansen wrote:
> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index ca99c0ecf52e..172f4d8e798d 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -265,6 +265,7 @@ struct fault_env {
> >  	pmd_t *pmd;
> >  	pte_t *pte;
> >  	spinlock_t *ptl;
> > +	pgtable_t prealloc_pte;
> >  };
> 
> If we're going to do this fault_env thing, we need some heavy-duty
> comments on what the different fields do and what they mean.  We don't
> want to get in to a situation where we're doing
> 
> 	void fault_foo(struct fault_env *fe);..
> 
> and then we change the internals of fault_foo() to manipulate a
> different set of fe->* variables, or change assumptions, then have
> callers randomly break.
> 
> One _nice_ part of passing all the arguments explicitly is that it makes
> you go visit all the call sites and think about how the conventions change.
> 
> It just makes me nervous.
> 
> The semantics of having both a ->pte and ->pmd need to be very clearly
> spelled out too, please.

I've updated this to:

/*
 * Page fault context: passes though page fault handler instead of endless list
 * of function arguments.
 */
struct fault_env {
	struct vm_area_struct *vma;	/* Target VMA */
	unsigned long address;		/* Faulting virtual address */
	unsigned int flags;		/* FAULT_FLAG_xxx flags */
	pmd_t *pmd;			/* Pointer to pmd entry matching
					 * the 'address'
					 */
	pte_t *pte;			/* Pointer to pte entry matching
					 * the 'address'. NULL if the page
					 * table hasn't been allocated.
					 */
	spinlock_t *ptl;		/* Page table lock.
					 * Protects pte page table if 'pte'
					 * is not NULL, otherwise pmd.
					 */
	pgtable_t prealloc_pte;		/* Pre-allocated pte page table.
					 * vm_ops->map_pages() calls
					 * do_set_pte() from atomic context.
					 * do_fault_around() pre-allocates
					 * page table to avoid allocation from
					 * atomic context.
					 */
};

> 
> >  /*
> > @@ -559,7 +560,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> >  	return pte;
> >  }
> >  
> > -void do_set_pte(struct fault_env *fe, struct page *page);
> > +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
> > +		struct page *page);
> >  #endif
> 
> I think do_set_pte() might be due for a new name if it's going to be
> doing allocations internally.

Any suggestions?

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 28b3875969a8..ba8150d6dc33 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2146,11 +2146,6 @@ void filemap_map_pages(struct fault_env *fe,
> >  			start_pgoff) {
> >  		if (iter.index > end_pgoff)
> >  			break;
> > -		fe->pte += iter.index - last_pgoff;
> > -		fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
> > -		last_pgoff = iter.index;
> > -		if (!pte_none(*fe->pte))
> > -			goto next;
> >  repeat:
> >  		page = radix_tree_deref_slot(slot);
> >  		if (unlikely(!page))
> > @@ -2187,7 +2182,17 @@ repeat:
> >  
> >  		if (file->f_ra.mmap_miss > 0)
> >  			file->f_ra.mmap_miss--;
> > -		do_set_pte(fe, page);
> > +
> > +		fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
> > +		if (fe->pte)
> > +			fe->pte += iter.index - last_pgoff;
> > +		last_pgoff = iter.index;
> > +		if (do_set_pte(fe, NULL, page)) {
> > +			/* failed to setup page table: giving up */
> > +			if (!fe->pte)
> > +				break;
> > +			goto unlock;
> > +		}
> 
> What's the failure here, though?

At this point in patchset it never fails: allocation failure is not
possible as we pre-allocate page table for faularound.

Later after do_set_pmd() is introduced, huge page can be mapped here. By
us or under us.

I'll update comment.

> Failed to set PTE or failed to _allocate_ pte page?  One of them is a
> harmless race setting the pte and the other is a pretty crummy
> allocation failure.  Do we really not want to differentiate these?

Not really. That's speculative codepath: do_read_fault() will check if
faultaround solved the fault or not.

> This also throws away the spiffy new error code that comes baqck from
> do_set_pte().  Is that OK?

Yes. We will try harder in do_read_fault() once faultaround code failed to
solve the page fault with all proper locks and error handling.

> >  		unlock_page(page);
> >  		goto next;
> >  unlock:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f8f9549fac86..0de6f176674d 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2661,8 +2661,6 @@ static int do_anonymous_page(struct fault_env *fe)
> >  	struct page *page;
> >  	pte_t entry;
> >  
> > -	pte_unmap(fe->pte);
> > -
> >  	/* File mapping without ->vm_ops ? */
> >  	if (vma->vm_flags & VM_SHARED)
> >  		return VM_FAULT_SIGBUS;
> > @@ -2671,6 +2669,18 @@ static int do_anonymous_page(struct fault_env *fe)
> >  	if (check_stack_guard_page(vma, fe->address) < 0)
> >  		return VM_FAULT_SIGSEGV;
> >  
> > +	/*
> > +	 * Use __pte_alloc instead of pte_alloc_map, because we can't
> > +	 * run pte_offset_map on the pmd, if an huge pmd could
> > +	 * materialize from under us from a different thread.
> > +	 */
> 
> This comment is a little bit funky.  Maybe:
> 
> "Use __pte_alloc() instead of pte_alloc_map().  We can't run
> pte_offset_map() on pmds where a huge pmd might be created (from a
> different thread)."
> 
> Could you also talk a bit about where it _is_ safe to call pte_alloc_map()?

That comment was just moved from __handle_mm_fault().

Would this be okay:

        /*
         * Use __pte_alloc() instead of pte_alloc_map().  We can't run
         * pte_offset_map() on pmds where a huge pmd might be created (from
         * a different thread).
         *
         * pte_alloc_map() is safe to use under down_write(mmap_sem) or when
         * parallel threads are excluded by other means.
         */

> > +	if (unlikely(pmd_none(*fe->pmd) &&
> > +			__pte_alloc(vma->vm_mm, vma, fe->pmd, fe->address)))
> > +		return VM_FAULT_OOM;
> 
> Should we just move this pmd_none() check in to __pte_alloc()?  You do
> this same-style check at least twice.

We have it there. The check here is speculative to avoid taking ptl.

> > +	/* If an huge pmd materialized from under us just retry later */
> > +	if (unlikely(pmd_trans_huge(*fe->pmd)))
> > +		return 0;
> 
> Nit: please stop sprinkling unlikely() everywhere.  Is there some
> concrete benefit to doing it here?  I really doubt the compiler needs
> help putting the code for "return 0" out-of-line.
> 
> Why is it important to abort here?  Is this a small-page-only path?

This unlikely() was moved from __handle_mm_fault(). I didn't put much
consideration in it.
 
> > +static int pte_alloc_one_map(struct fault_env *fe)
> > +{
> > +	struct vm_area_struct *vma = fe->vma;
> > +
> > +	if (!pmd_none(*fe->pmd))
> > +		goto map_pte;
> 
> So the calling convention here is...?  It looks like this has to be
> called with fe->pmd == pmd_none().  If not, we assume it's pointing to a
> pte page.  This can never be called on a huge pmd.  Right?

It's not under ptl, so pmd can be filled under us. There's huge pmd check in
'map_pte' goto path.
 
> > +	if (fe->prealloc_pte) {
> > +		smp_wmb(); /* See comment in __pte_alloc() */
> 
> Are we trying to make *this* cpu's write visible, or to see the write
> from __pte_alloc()?  It seems like we're trying to see the write.  Isn't
> smp_rmb() what we want for that?

See 362a61ad6119.

I think more logical way would be to put it into do_fault_around(), just after
pte_alloc_one().
 
> > +		fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
> > +		if (unlikely(!pmd_none(*fe->pmd))) {
> > +			spin_unlock(fe->ptl);
> > +			goto map_pte;
> > +		}
> 
> Should we just make pmd_none() likely()?  That seems like it would save
> about 20MB of unlikely()'s in the source.

Heh.

> > +		atomic_long_inc(&vma->vm_mm->nr_ptes);
> > +		pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte);
> > +		spin_unlock(fe->ptl);
> > +		fe->prealloc_pte = 0;
> > +	} else if (unlikely(__pte_alloc(vma->vm_mm, vma, fe->pmd,
> > +					fe->address))) {
> > +		return VM_FAULT_OOM;
> > +	}
> > +map_pte:
> > +	if (unlikely(pmd_trans_huge(*fe->pmd)))
> > +		return VM_FAULT_NOPAGE;
> 
> I think I need a refresher on the locking rules.  pte_offset_map*() is
> unsafe to call on a huge pmd.  What in this context makes it impossible
> for the pmd to get promoted after the check?

Do you mean what stops pte page table to collapsed into huge pmd?
The answer is mmap_sem. Collapse code takes the lock on write to be able to
retract page table.
 
> > +	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
> > +			&fe->ptl);
> > +	return 0;
> > +}
> > +
> >  /**
> >   * do_set_pte - setup new PTE entry for given page and add reverse page mapping.
> >   *
> >   * @fe: fault environment
> > + * @memcg: memcg to charge page (only for private mappings)
> >   * @page: page to map
> >   *
> > - * Caller must hold page table lock relevant for @fe->pte.
> 
> That's a bit screwy now because fe->pte might not exist.  Right?  I

[ you're commenting deleted line ]

Right.

> thought the ptl was derived from the physical address of the pte page.
> How can we have a lock for a physical address that doesn't exist yet?

If fe->pte is NULL, pte_alloc_one_map() would take care about allocation, map
and lock the page table.
 
> > + * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on return.
> >   *
> >   * Target users are page handler itself and implementations of
> >   * vm_ops->map_pages.
> >   */
> > -void do_set_pte(struct fault_env *fe, struct page *page)
> > +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
> > +		struct page *page)
> >  {
> >  	struct vm_area_struct *vma = fe->vma;
> >  	bool write = fe->flags & FAULT_FLAG_WRITE;
> >  	pte_t entry;
> >  
> > +	if (!fe->pte) {
> > +		int ret = pte_alloc_one_map(fe);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	if (!pte_none(*fe->pte))
> > +		return VM_FAULT_NOPAGE;
> 
> Oh, you've got to add another pte_none() check because you're deferring
> the acquisition of the ptl lock?

Yes, we need to re-check once ptl is taken.

> >  	flush_icache_page(vma, page);
> >  	entry = mk_pte(page, vma->vm_page_prot);
> >  	if (write)
> > @@ -2811,6 +2864,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)
> >  	if (write && !(vma->vm_flags & VM_SHARED)) {
> >  		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> >  		page_add_new_anon_rmap(page, vma, fe->address, false);
> > +		mem_cgroup_commit_charge(page, memcg, false, false);
> > +		lru_cache_add_active_or_unevictable(page, vma);
> >  	} else {
> >  		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
> >  		page_add_file_rmap(page);
> > @@ -2819,6 +2874,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)
> >  
> >  	/* no need to invalidate: a not-present page won't be cached */
> >  	update_mmu_cache(vma, fe->address, fe->pte);
> > +
> > +	return 0;
> >  }
> >  
> >  static unsigned long fault_around_bytes __read_mostly =
> > @@ -2885,19 +2942,17 @@ late_initcall(fault_around_debugfs);
> >   * fault_around_pages() value (and therefore to page order).  This way it's
> >   * easier to guarantee that we don't cross page table boundaries.
> >   */
> > -static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
> > +static int do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
> >  {
> > -	unsigned long address = fe->address, start_addr, nr_pages, mask;
> > -	pte_t *pte = fe->pte;
> > +	unsigned long address = fe->address, nr_pages, mask;
> >  	pgoff_t end_pgoff;
> > -	int off;
> > +	int off, ret = 0;
> >  
> >  	nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
> >  	mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
> >  
> > -	start_addr = max(fe->address & mask, fe->vma->vm_start);
> > -	off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
> > -	fe->pte -= off;
> > +	fe->address = max(address & mask, fe->vma->vm_start);
> > +	off = ((address - fe->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
> >  	start_pgoff -= off;
> 
> Considering what's in this patch already, I think I'd leave the trivial
> local variable replacement for another patch.

fe->address is not a local variable: it get passed into map_pages.

> >  	/*
> > @@ -2905,30 +2960,33 @@ static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
> >  	 *  or fault_around_pages() from start_pgoff, depending what is nearest.
> >  	 */
> >  	end_pgoff = start_pgoff -
> > -		((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
> > +		((fe->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
> >  		PTRS_PER_PTE - 1;
> >  	end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
> >  			start_pgoff + nr_pages - 1);
> >  
> > -	/* Check if it makes any sense to call ->map_pages */
> > -	fe->address = start_addr;
> > -	while (!pte_none(*fe->pte)) {
> > -		if (++start_pgoff > end_pgoff)
> > -			goto out;
> > -		fe->address += PAGE_SIZE;
> > -		if (fe->address >= fe->vma->vm_end)
> > -			goto out;
> > -		fe->pte++;
> > +	if (pmd_none(*fe->pmd))
> > +		fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address);
> > +	fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
> > +	if (fe->prealloc_pte) {
> > +		pte_free(fe->vma->vm_mm, fe->prealloc_pte);
> > +		fe->prealloc_pte = 0;
> >  	}
> > +	if (!fe->pte)
> > +		goto out;
> 
> What does !fe->pte *mean* here?  No pte page?  No pte present?  Huge pte
> present?

Huge pmd is mapped.

Comment added.

> > -	fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
> > +	/* check if the page fault is solved */
> > +	fe->pte -= (fe->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
> > +	if (!pte_none(*fe->pte))
> > +		ret = VM_FAULT_NOPAGE;
> > +	pte_unmap_unlock(fe->pte, fe->ptl);
> >  out:
> > -	/* restore fault_env */
> > -	fe->pte = pte;
> >  	fe->address = address;
> > +	fe->pte = NULL;
> > +	return ret;
> >  }
> >  
> > -static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
> > +static int do_read_fault(struct fault_env *fe, pgoff_t pgoff)
> >  {
> >  	struct vm_area_struct *vma = fe->vma;
> >  	struct page *fault_page;
> > @@ -2940,33 +2998,25 @@ static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
> >  	 * something).
> >  	 */
> >  	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
> > -		fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
> > -				&fe->ptl);
> > -		do_fault_around(fe, pgoff);
> > -		if (!pte_same(*fe->pte, orig_pte))
> > -			goto unlock_out;
> > -		pte_unmap_unlock(fe->pte, fe->ptl);
> > +		ret = do_fault_around(fe, pgoff);
> > +		if (ret)
> > +			return ret;
> >  	}
> >  
> >  	ret = __do_fault(fe, pgoff, NULL, &fault_page);
> >  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> >  		return ret;
> >  
> > -	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl);
> > -	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
> > +	ret |= do_set_pte(fe, NULL, fault_page);
> > +	if (fe->pte)
> >  		pte_unmap_unlock(fe->pte, fe->ptl);
> > -		unlock_page(fault_page);
> > -		page_cache_release(fault_page);
> > -		return ret;
> > -	}
> > -	do_set_pte(fe, fault_page);
> >  	unlock_page(fault_page);
> > -unlock_out:
> > -	pte_unmap_unlock(fe->pte, fe->ptl);
> > +	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> > +		page_cache_release(fault_page);
> >  	return ret;
> >  }
> >  
> > -static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
> > +static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff)
> >  {
> >  	struct vm_area_struct *vma = fe->vma;
> >  	struct page *fault_page, *new_page;
> > @@ -2994,26 +3044,9 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
> >  		copy_user_highpage(new_page, fault_page, fe->address, vma);
> >  	__SetPageUptodate(new_page);
> >  
> > -	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
> > -			&fe->ptl);
> > -	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
> > +	ret |= do_set_pte(fe, memcg, new_page);
> > +	if (fe->pte)
> >  		pte_unmap_unlock(fe->pte, fe->ptl);
> > -		if (fault_page) {
> > -			unlock_page(fault_page);
> > -			page_cache_release(fault_page);
> > -		} else {
> > -			/*
> > -			 * The fault handler has no page to lock, so it holds
> > -			 * i_mmap_lock for read to protect against truncate.
> > -			 */
> > -			i_mmap_unlock_read(vma->vm_file->f_mapping);
> > -		}
> > -		goto uncharge_out;
> > -	}
> > -	do_set_pte(fe, new_page);
> > -	mem_cgroup_commit_charge(new_page, memcg, false, false);
> > -	lru_cache_add_active_or_unevictable(new_page, vma);
> > -	pte_unmap_unlock(fe->pte, fe->ptl);
> >  	if (fault_page) {
> >  		unlock_page(fault_page);
> >  		page_cache_release(fault_page);
> > @@ -3024,6 +3057,8 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
> >  		 */
> >  		i_mmap_unlock_read(vma->vm_file->f_mapping);
> >  	}
> > +	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> > +		goto uncharge_out;
> >  	return ret;
> >  uncharge_out:
> >  	mem_cgroup_cancel_charge(new_page, memcg, false);
> > @@ -3031,7 +3066,7 @@ uncharge_out:
> >  	return ret;
> >  }
> >  
> > -static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
> > +static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff)
> >  {
> >  	struct vm_area_struct *vma = fe->vma;
> >  	struct page *fault_page;
> > @@ -3057,16 +3092,15 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
> >  		}
> >  	}
> >  
> > -	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
> > -			&fe->ptl);
> > -	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
> > +	ret |= do_set_pte(fe, NULL, fault_page);
> > +	if (fe->pte)
> >  		pte_unmap_unlock(fe->pte, fe->ptl);
> > +	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
> > +					VM_FAULT_RETRY))) {
> >  		unlock_page(fault_page);
> >  		page_cache_release(fault_page);
> >  		return ret;
> >  	}
> > -	do_set_pte(fe, fault_page);
> > -	pte_unmap_unlock(fe->pte, fe->ptl);
> >  
> >  	if (set_page_dirty(fault_page))
> >  		dirtied = 1;
> > @@ -3098,21 +3132,19 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
> >   * The mmap_sem may have been released depending on flags and our
> >   * return value.  See filemap_fault() and __lock_page_or_retry().
> >   */
> > -static int do_fault(struct fault_env *fe, pte_t orig_pte)
> > +static int do_fault(struct fault_env *fe)
> >  {
> >  	struct vm_area_struct *vma = fe->vma;
> > -	pgoff_t pgoff = (((fe->address & PAGE_MASK)
> > -			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > +	pgoff_t pgoff = linear_page_index(vma, fe->address);
> 
> Looks like another trivial cleanup.

Okay, I'll move it into separate patch.

> > -	pte_unmap(fe->pte);
> >  	/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
> >  	if (!vma->vm_ops->fault)
> >  		return VM_FAULT_SIGBUS;
> >  	if (!(fe->flags & FAULT_FLAG_WRITE))
> > -		return do_read_fault(fe, pgoff,	orig_pte);
> > +		return do_read_fault(fe, pgoff);
> >  	if (!(vma->vm_flags & VM_SHARED))
> > -		return do_cow_fault(fe, pgoff, orig_pte);
> > -	return do_shared_fault(fe, pgoff, orig_pte);
> > +		return do_cow_fault(fe, pgoff);
> > +	return do_shared_fault(fe, pgoff);
> >  }
> >  
> >  static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
> > @@ -3252,37 +3284,62 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
> >   * with external mmu caches can use to update those (ie the Sparc or
> >   * PowerPC hashed page tables that act as extended TLBs).
> >   *
> > - * We enter with non-exclusive mmap_sem (to exclude vma changes,
> > - * but allow concurrent faults), and pte mapped but not yet locked.
> > - * We return with pte unmapped and unlocked.
> > + * We enter with non-exclusive mmap_sem (to exclude vma changes, but allow
> > + * concurrent faults).
> >   *
> > - * The mmap_sem may have been released depending on flags and our
> > - * return value.  See filemap_fault() and __lock_page_or_retry().
> > + * The mmap_sem may have been released depending on flags and our return value.
> > + * See filemap_fault() and __lock_page_or_retry().
> >   */
> >  static int handle_pte_fault(struct fault_env *fe)
> >  {
> >  	pte_t entry;
> >  
> > +	/* If an huge pmd materialized from under us just retry later */
> > +	if (unlikely(pmd_trans_huge(*fe->pmd)))
> > +		return 0;
> > +
> > +	if (unlikely(pmd_none(*fe->pmd))) {
> > +		/*
> > +		 * Leave __pte_alloc() until later: because vm_ops->fault may
> > +		 * want to allocate huge page, and if we expose page table
> > +		 * for an instant, it will be difficult to retract from
> > +		 * concurrent faults and from rmap lookups.
> > +		 */
> > +	} else {
> > +		/*
> > +		 * A regular pmd is established and it can't morph into a huge
> > +		 * pmd from under us anymore at this point because we hold the
> > +		 * mmap_sem read mode and khugepaged takes it in write mode.
> > +		 * So now it's safe to run pte_offset_map().
> > +		 */
> > +		fe->pte = pte_offset_map(fe->pmd, fe->address);
> > +
> > +		entry = *fe->pte;
> > +		barrier();
> 
> Barrier because....?
> 
> > +		if (pte_none(entry)) {
> > +			pte_unmap(fe->pte);
> > +			fe->pte = NULL;
> > +		}
> > +	}
> > +
> >  	/*
> >  	 * some architectures can have larger ptes than wordsize,
> >  	 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
> >  	 * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
> > -	 * The code below just needs a consistent view for the ifs and
> > +	 * The code above just needs a consistent view for the ifs and
> >  	 * we later double check anyway with the ptl lock held. So here
> >  	 * a barrier will do.
> >  	 */
> 
> Looks like the barrier got moved, but not the comment.

Moved.

> Man, that's a lot of code.

Yeah. I don't see a sensible way to split it. :-/

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 04/28] mm: make remove_migration_ptes() beyond mm/migration.c
  2016-02-16  9:54     ` Kirill A. Shutemov
@ 2016-02-16 15:29       ` Dave Hansen
  0 siblings, 0 replies; 55+ messages in thread
From: Dave Hansen @ 2016-02-16 15:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Jerome Marchand, Yang Shi,
	Sasha Levin, linux-kernel, linux-mm

On 02/16/2016 01:54 AM, Kirill A. Shutemov wrote:
> On Fri, Feb 12, 2016 at 08:54:58AM -0800, Dave Hansen wrote:
>> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote
>>> We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP
>>> pages are never mlocked.
>>
>> That's kinda subtle.  Can you explain more?
>>
>> If we did the following:
>>
>> 	ptr = mmap(NULL, 512*PAGE_SIZE, ...);
>> 	mlock(ptr, 512*PAGE_SIZE);
>> 	fork();
>> 	munmap(ptr + 100 * PAGE_SIZE, PAGE_SIZE);
>>
>> I'd expect to get two processes, each mapping the same compound THP, one
>> with a PMD and the other with 511 ptes and one hole.  Is there something
>> different that goes on?
> 
> I'm not sure what exactly you want to ask with this code, but it will have
> the following result:
> 
>  - After fork() process will split the pmd in munlock(). For file thp
>    split pmd, means clear it out. Mapping split_huge_pmd() would munlock
>    the page as we do for anon thp;
> 
>  - In child process the page is never mapped as mlock() is not inherited
>    and we don't copy page tables for shared VMA as they can re-faulted
>    later;

Huh, I didn't realize we don't inherit mlock() across fork(). Learn
something every day!

> The basic semantic for mlock()ed file THP would be the same as for anon
> THP: we only keep the page mlocked as long as it's mapped only with PMDs.
> This way it's relatively simple to make sure that we don't leak mlocked
> pages.

Ahh, I forgot about that bit.  Could you add some of that description to
the changelog so I don't forget again?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 13/28] thp: support file pages in zap_huge_pmd()
  2016-02-16 10:00     ` Kirill A. Shutemov
@ 2016-02-16 15:31       ` Dave Hansen
  2016-02-18 12:19         ` Kirill A. Shutemov
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2016-02-16 15:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Jerome Marchand, Yang Shi,
	Sasha Levin, linux-kernel, linux-mm

On 02/16/2016 02:00 AM, Kirill A. Shutemov wrote:
> On Fri, Feb 12, 2016 at 10:33:37AM -0800, Dave Hansen wrote:
>> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
>>> For file pages we don't deposit page table on mapping: no need to
>>> withdraw it.
>>
>> I thought the deposit thing was to guarantee we could always do a PMD
>> split.  It still seems like if you wanted to split a huge-tmpfs page,
>> you'd need to first split the PMD which might need the deposited one.
>>
>> Why not?
> 
> For file thp, split_huge_pmd() is implemented by clearing out the pmd: we
> can setup and fill pte table later. Therefore no need to deposit page
> table -- we would not use it. DAX does the same.

Ahh...  Do we just never split in any fault contexts, or do we just
retry the fault?

In any case, that seems like fine enough (although subtle) behavior.
Can you call it out a bit more explicitly in the patch text?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 17/28] thp: skip file huge pmd on copy_huge_pmd()
  2016-02-16 10:14     ` Kirill A. Shutemov
@ 2016-02-16 15:46       ` Dave Hansen
  2016-02-18 12:41         ` Kirill A. Shutemov
  0 siblings, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2016-02-16 15:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Andrea Arcangeli, Andrew Morton, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Jerome Marchand, Yang Shi,
	Sasha Levin, linux-kernel, linux-mm

On 02/16/2016 02:14 AM, Kirill A. Shutemov wrote:
> On Fri, Feb 12, 2016 at 10:42:09AM -0800, Dave Hansen wrote:
>> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
>>> File pmds can be safely skip on copy_huge_pmd(), we can re-fault them
>>> later. COW for file mappings handled on pte level.
>>
>> Is this different from 4k pages?  I figured we might skip copying
>> file-backed ptes on fork, but I couldn't find the code.
> 
> Currently, we only filter out on per-VMA basis. See first comment in
> copy_page_range().
> 
> Here we handle PMD mapped file pages in COW mapping. File THP can be
> mapped into COW mapping as result of read page fault.

OK...  So, copy_page_range() has a check for "Don't copy ptes where a
page fault will fill them correctly."  Seems sane enough, but the check
is implemented using a check for the VMA having !vma->anon_vma, which is
a head-scratcher for a moment.  Why does that apply to huge tmpfs?

Ahh, MAP_PRIVATE.  MAP_PRIVATE vmas have ->anon_vma because they have
essentially-anonymous pages for when they do a COW, so they don't hit
that check and they go through the copy_*() functions, including
copy_huge_pmd().

We don't handle 2M COW operations yet so we simply decline to copy these
pages.  Might cost us page faults down the road, but it makes things
easier to implement for now.

Did I get that right?

Any chance we could get a bit of that into the patch descriptions so
that the next hapless reviewer can spend their time looking at your code
instead of relearning the fork() handling for MAP_PRIVATE?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 19/28] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
  2016-02-11 14:21 ` [PATCHv2 19/28] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem Kirill A. Shutemov
  2016-02-12 18:50   ` Dave Hansen
@ 2016-02-16 15:49   ` Dave Hansen
  1 sibling, 0 replies; 55+ messages in thread
From: Dave Hansen @ 2016-02-16 15:49 UTC (permalink / raw)
  To: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli, Andrew Morton
  Cc: Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Jerome Marchand, Yang Shi, Sasha Levin, linux-kernel, linux-mm

On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
> During split we munlock the huge page which requires rmap walk.
> rmap wants to take the lock on its own.

Ahhh, ... so we $SUBJECT in order to fix it.

Now it all makes sense.  Maybe I'm old fashioned, but I tend to have
forgotten $SUBJECT by the time I start to read the patch body text.
It's really handy for me when the body text stands on its own.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte()
  2016-02-16 14:26     ` Kirill A. Shutemov
@ 2016-02-16 17:17       ` Dave Hansen
  2016-02-23 13:05         ` Kirill A. Shutemov
  2016-02-16 17:38       ` Dave Hansen
  1 sibling, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2016-02-16 17:17 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli,
	Andrew Morton, Vlastimil Babka, Christoph Lameter,
	Naoya Horiguchi, Jerome Marchand, Yang Shi, Sasha Levin,
	linux-kernel, linux-mm

On 02/16/2016 06:26 AM, Kirill A. Shutemov wrote:
> On Fri, Feb 12, 2016 at 09:44:41AM -0800, Dave Hansen wrote:
>> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index ca99c0ecf52e..172f4d8e798d 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -265,6 +265,7 @@ struct fault_env {
>>>  	pmd_t *pmd;
>>>  	pte_t *pte;
>>>  	spinlock_t *ptl;
>>> +	pgtable_t prealloc_pte;
>>>  };
>>
>> If we're going to do this fault_env thing, we need some heavy-duty
>> comments on what the different fields do and what they mean.  We don't
>> want to get in to a situation where we're doing
>>
>> 	void fault_foo(struct fault_env *fe);..
>>
>> and then we change the internals of fault_foo() to manipulate a
>> different set of fe->* variables, or change assumptions, then have
>> callers randomly break.
>>
>> One _nice_ part of passing all the arguments explicitly is that it makes
>> you go visit all the call sites and think about how the conventions change.
>>
>> It just makes me nervous.
>>
>> The semantics of having both a ->pte and ->pmd need to be very clearly
>> spelled out too, please.
> 
> I've updated this to:
> 
> /*
>  * Page fault context: passes though page fault handler instead of endless list
>  * of function arguments.
>  */
> struct fault_env {
> 	struct vm_area_struct *vma;	/* Target VMA */
> 	unsigned long address;		/* Faulting virtual address */
> 	unsigned int flags;		/* FAULT_FLAG_xxx flags */
> 	pmd_t *pmd;			/* Pointer to pmd entry matching
> 					 * the 'address'
> 					 */

Is this just for huge PMDs, or does it also cover normal PMDs pointing
to PTE pages?  Is it populated every time we're at or below the PMD
during a fault?  Is it always valid?

> 	pte_t *pte;			/* Pointer to pte entry matching
> 					 * the 'address'. NULL if the page
> 					 * table hasn't been allocated.
> 					 */

What's the relationship between pmd and pte?  Can both be set at the
same time, etc...?

> 	spinlock_t *ptl;		/* Page table lock.
> 					 * Protects pte page table if 'pte'
> 					 * is not NULL, otherwise pmd.
> 					 */

Are there any rules for callers when a callee puts a value in here?

> 	pgtable_t prealloc_pte;		/* Pre-allocated pte page table.
> 					 * vm_ops->map_pages() calls
> 					 * do_set_pte() from atomic context.
> 					 * do_fault_around() pre-allocates
> 					 * page table to avoid allocation from
> 					 * atomic context.
> 					 */
> };

Who's responsible for freeing this and when?

>>>  /*
>>> @@ -559,7 +560,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>>>  	return pte;
>>>  }
>>>  
>>> -void do_set_pte(struct fault_env *fe, struct page *page);
>>> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
>>> +		struct page *page);
>>>  #endif
>>
>> I think do_set_pte() might be due for a new name if it's going to be
>> doing allocations internally.
> 
> Any suggestions?

alloc_set_pte() is probably fine.  Just make it clear early in some
comments that the allocation is conditional.

>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index 28b3875969a8..ba8150d6dc33 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -2146,11 +2146,6 @@ void filemap_map_pages(struct fault_env *fe,
>>>  			start_pgoff) {
>>>  		if (iter.index > end_pgoff)
>>>  			break;
>>> -		fe->pte += iter.index - last_pgoff;
>>> -		fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
>>> -		last_pgoff = iter.index;
>>> -		if (!pte_none(*fe->pte))
>>> -			goto next;
>>>  repeat:
>>>  		page = radix_tree_deref_slot(slot);
>>>  		if (unlikely(!page))
>>> @@ -2187,7 +2182,17 @@ repeat:
>>>  
>>>  		if (file->f_ra.mmap_miss > 0)
>>>  			file->f_ra.mmap_miss--;
>>> -		do_set_pte(fe, page);
>>> +
>>> +		fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
>>> +		if (fe->pte)
>>> +			fe->pte += iter.index - last_pgoff;
>>> +		last_pgoff = iter.index;
>>> +		if (do_set_pte(fe, NULL, page)) {
>>> +			/* failed to setup page table: giving up */
>>> +			if (!fe->pte)
>>> +				break;
>>> +			goto unlock;
>>> +		}
>>
>> What's the failure here, though?
> 
> At this point in patchset it never fails: allocation failure is not
> possible as we pre-allocate page table for faularound.
> 
> Later after do_set_pmd() is introduced, huge page can be mapped here. By
> us or under us.
> 
> I'll update comment.

So why check the return value of do_set_pte()?  Why can it return nonzero?

>> This also throws away the spiffy new error code that comes baqck from
>> do_set_pte().  Is that OK?
> 
> Yes. We will try harder in do_read_fault() once faultaround code failed to
> solve the page fault with all proper locks and error handling.

OK, I hope the new comment addresses this.

>>> +	/*
>>> +	 * Use __pte_alloc instead of pte_alloc_map, because we can't
>>> +	 * run pte_offset_map on the pmd, if an huge pmd could
>>> +	 * materialize from under us from a different thread.
>>> +	 */
>>
>> This comment is a little bit funky.  Maybe:
>>
>> "Use __pte_alloc() instead of pte_alloc_map().  We can't run
>> pte_offset_map() on pmds where a huge pmd might be created (from a
>> different thread)."
>>
>> Could you also talk a bit about where it _is_ safe to call pte_alloc_map()?
> 
> That comment was just moved from __handle_mm_fault().
> 
> Would this be okay:
> 
>         /*
>          * Use __pte_alloc() instead of pte_alloc_map().  We can't run
>          * pte_offset_map() on pmds where a huge pmd might be created (from
>          * a different thread).
>          *
>          * pte_alloc_map() is safe to use under down_write(mmap_sem) or when
>          * parallel threads are excluded by other means.
>          */

Yep, that looks good.  Just make sure to make it clear that mmap_sem
isn't held in *this* context.

>>> +	if (unlikely(pmd_none(*fe->pmd) &&
>>> +			__pte_alloc(vma->vm_mm, vma, fe->pmd, fe->address)))
>>> +		return VM_FAULT_OOM;
>>
>> Should we just move this pmd_none() check in to __pte_alloc()?  You do
>> this same-style check at least twice.
> 
> We have it there. The check here is speculative to avoid taking ptl.
> 
>>> +	/* If an huge pmd materialized from under us just retry later */
>>> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
>>> +		return 0;
>>
>> Nit: please stop sprinkling unlikely() everywhere.  Is there some
>> concrete benefit to doing it here?  I really doubt the compiler needs
>> help putting the code for "return 0" out-of-line.
>>
>> Why is it important to abort here?  Is this a small-page-only path?
> 
> This unlikely() was moved from __handle_mm_fault(). I didn't put much
> consideration in it.
>  
>>> +static int pte_alloc_one_map(struct fault_env *fe)
>>> +{
>>> +	struct vm_area_struct *vma = fe->vma;
>>> +
>>> +	if (!pmd_none(*fe->pmd))
>>> +		goto map_pte;
>>
>> So the calling convention here is...?  It looks like this has to be
>> called with fe->pmd == pmd_none().  If not, we assume it's pointing to a
>> pte page.  This can never be called on a huge pmd.  Right?
> 
> It's not under ptl, so pmd can be filled under us. There's huge pmd check in
> 'map_pte' goto path.
>  
>>> +	if (fe->prealloc_pte) {
>>> +		smp_wmb(); /* See comment in __pte_alloc() */
>>
>> Are we trying to make *this* cpu's write visible, or to see the write
>> from __pte_alloc()?  It seems like we're trying to see the write.  Isn't
>> smp_rmb() what we want for that?
> 
> See 362a61ad6119.
> 
> I think more logical way would be to put it into do_fault_around(), just after
> pte_alloc_one().
>  
>>> +		fe->ptl = pmd_lock(vma->vm_mm, fe->pmd);
>>> +		if (unlikely(!pmd_none(*fe->pmd))) {
>>> +			spin_unlock(fe->ptl);
>>> +			goto map_pte;
>>> +		}
>>
>> Should we just make pmd_none() likely()?  That seems like it would save
>> about 20MB of unlikely()'s in the source.
> 
> Heh.
> 
>>> +		atomic_long_inc(&vma->vm_mm->nr_ptes);
>>> +		pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte);
>>> +		spin_unlock(fe->ptl);
>>> +		fe->prealloc_pte = 0;
>>> +	} else if (unlikely(__pte_alloc(vma->vm_mm, vma, fe->pmd,
>>> +					fe->address))) {
>>> +		return VM_FAULT_OOM;
>>> +	}
>>> +map_pte:
>>> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
>>> +		return VM_FAULT_NOPAGE;
>>
>> I think I need a refresher on the locking rules.  pte_offset_map*() is
>> unsafe to call on a huge pmd.  What in this context makes it impossible
>> for the pmd to get promoted after the check?
> 
> Do you mean what stops pte page table to collapsed into huge pmd?
> The answer is mmap_sem. Collapse code takes the lock on write to be able to
> retract page table.
>  
>>> +	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
>>> +			&fe->ptl);
>>> +	return 0;
>>> +}
>>> +
>>>  /**
>>>   * do_set_pte - setup new PTE entry for given page and add reverse page mapping.
>>>   *
>>>   * @fe: fault environment
>>> + * @memcg: memcg to charge page (only for private mappings)
>>>   * @page: page to map
>>>   *
>>> - * Caller must hold page table lock relevant for @fe->pte.
>>
>> That's a bit screwy now because fe->pte might not exist.  Right?  I
> 
> [ you're commenting deleted line ]
> 
> Right.
> 
>> thought the ptl was derived from the physical address of the pte page.
>> How can we have a lock for a physical address that doesn't exist yet?
> 
> If fe->pte is NULL, pte_alloc_one_map() would take care about allocation, map
> and lock the page table.
>  
>>> + * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on return.
>>>   *
>>>   * Target users are page handler itself and implementations of
>>>   * vm_ops->map_pages.
>>>   */
>>> -void do_set_pte(struct fault_env *fe, struct page *page)
>>> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
>>> +		struct page *page)
>>>  {
>>>  	struct vm_area_struct *vma = fe->vma;
>>>  	bool write = fe->flags & FAULT_FLAG_WRITE;
>>>  	pte_t entry;
>>>  
>>> +	if (!fe->pte) {
>>> +		int ret = pte_alloc_one_map(fe);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +
>>> +	if (!pte_none(*fe->pte))
>>> +		return VM_FAULT_NOPAGE;
>>
>> Oh, you've got to add another pte_none() check because you're deferring
>> the acquisition of the ptl lock?
> 
> Yes, we need to re-check once ptl is taken.
> 
>>>  	flush_icache_page(vma, page);
>>>  	entry = mk_pte(page, vma->vm_page_prot);
>>>  	if (write)
>>> @@ -2811,6 +2864,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)
>>>  	if (write && !(vma->vm_flags & VM_SHARED)) {
>>>  		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>>>  		page_add_new_anon_rmap(page, vma, fe->address, false);
>>> +		mem_cgroup_commit_charge(page, memcg, false, false);
>>> +		lru_cache_add_active_or_unevictable(page, vma);
>>>  	} else {
>>>  		inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
>>>  		page_add_file_rmap(page);
>>> @@ -2819,6 +2874,8 @@ void do_set_pte(struct fault_env *fe, struct page *page)
>>>  
>>>  	/* no need to invalidate: a not-present page won't be cached */
>>>  	update_mmu_cache(vma, fe->address, fe->pte);
>>> +
>>> +	return 0;
>>>  }
>>>  
>>>  static unsigned long fault_around_bytes __read_mostly =
>>> @@ -2885,19 +2942,17 @@ late_initcall(fault_around_debugfs);
>>>   * fault_around_pages() value (and therefore to page order).  This way it's
>>>   * easier to guarantee that we don't cross page table boundaries.
>>>   */
>>> -static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
>>> +static int do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
>>>  {
>>> -	unsigned long address = fe->address, start_addr, nr_pages, mask;
>>> -	pte_t *pte = fe->pte;
>>> +	unsigned long address = fe->address, nr_pages, mask;
>>>  	pgoff_t end_pgoff;
>>> -	int off;
>>> +	int off, ret = 0;
>>>  
>>>  	nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT;
>>>  	mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK;
>>>  
>>> -	start_addr = max(fe->address & mask, fe->vma->vm_start);
>>> -	off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
>>> -	fe->pte -= off;
>>> +	fe->address = max(address & mask, fe->vma->vm_start);
>>> +	off = ((address - fe->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
>>>  	start_pgoff -= off;
>>
>> Considering what's in this patch already, I think I'd leave the trivial
>> local variable replacement for another patch.
> 
> fe->address is not a local variable: it get passed into map_pages.
> 
>>>  	/*
>>> @@ -2905,30 +2960,33 @@ static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff)
>>>  	 *  or fault_around_pages() from start_pgoff, depending what is nearest.
>>>  	 */
>>>  	end_pgoff = start_pgoff -
>>> -		((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
>>> +		((fe->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
>>>  		PTRS_PER_PTE - 1;
>>>  	end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1,
>>>  			start_pgoff + nr_pages - 1);
>>>  
>>> -	/* Check if it makes any sense to call ->map_pages */
>>> -	fe->address = start_addr;
>>> -	while (!pte_none(*fe->pte)) {
>>> -		if (++start_pgoff > end_pgoff)
>>> -			goto out;
>>> -		fe->address += PAGE_SIZE;
>>> -		if (fe->address >= fe->vma->vm_end)
>>> -			goto out;
>>> -		fe->pte++;
>>> +	if (pmd_none(*fe->pmd))
>>> +		fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address);
>>> +	fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
>>> +	if (fe->prealloc_pte) {
>>> +		pte_free(fe->vma->vm_mm, fe->prealloc_pte);
>>> +		fe->prealloc_pte = 0;
>>>  	}
>>> +	if (!fe->pte)
>>> +		goto out;
>>
>> What does !fe->pte *mean* here?  No pte page?  No pte present?  Huge pte
>> present?
> 
> Huge pmd is mapped.
> 
> Comment added.
> 
>>> -	fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
>>> +	/* check if the page fault is solved */
>>> +	fe->pte -= (fe->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT);
>>> +	if (!pte_none(*fe->pte))
>>> +		ret = VM_FAULT_NOPAGE;
>>> +	pte_unmap_unlock(fe->pte, fe->ptl);
>>>  out:
>>> -	/* restore fault_env */
>>> -	fe->pte = pte;
>>>  	fe->address = address;
>>> +	fe->pte = NULL;
>>> +	return ret;
>>>  }
>>>  
>>> -static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
>>> +static int do_read_fault(struct fault_env *fe, pgoff_t pgoff)
>>>  {
>>>  	struct vm_area_struct *vma = fe->vma;
>>>  	struct page *fault_page;
>>> @@ -2940,33 +2998,25 @@ static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
>>>  	 * something).
>>>  	 */
>>>  	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
>>> -		fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
>>> -				&fe->ptl);
>>> -		do_fault_around(fe, pgoff);
>>> -		if (!pte_same(*fe->pte, orig_pte))
>>> -			goto unlock_out;
>>> -		pte_unmap_unlock(fe->pte, fe->ptl);
>>> +		ret = do_fault_around(fe, pgoff);
>>> +		if (ret)
>>> +			return ret;
>>>  	}
>>>  
>>>  	ret = __do_fault(fe, pgoff, NULL, &fault_page);
>>>  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>>>  		return ret;
>>>  
>>> -	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl);
>>> -	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
>>> +	ret |= do_set_pte(fe, NULL, fault_page);
>>> +	if (fe->pte)
>>>  		pte_unmap_unlock(fe->pte, fe->ptl);
>>> -		unlock_page(fault_page);
>>> -		page_cache_release(fault_page);
>>> -		return ret;
>>> -	}
>>> -	do_set_pte(fe, fault_page);
>>>  	unlock_page(fault_page);
>>> -unlock_out:
>>> -	pte_unmap_unlock(fe->pte, fe->ptl);
>>> +	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>>> +		page_cache_release(fault_page);
>>>  	return ret;
>>>  }
>>>  
>>> -static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
>>> +static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff)
>>>  {
>>>  	struct vm_area_struct *vma = fe->vma;
>>>  	struct page *fault_page, *new_page;
>>> @@ -2994,26 +3044,9 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
>>>  		copy_user_highpage(new_page, fault_page, fe->address, vma);
>>>  	__SetPageUptodate(new_page);
>>>  
>>> -	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
>>> -			&fe->ptl);
>>> -	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
>>> +	ret |= do_set_pte(fe, memcg, new_page);
>>> +	if (fe->pte)
>>>  		pte_unmap_unlock(fe->pte, fe->ptl);
>>> -		if (fault_page) {
>>> -			unlock_page(fault_page);
>>> -			page_cache_release(fault_page);
>>> -		} else {
>>> -			/*
>>> -			 * The fault handler has no page to lock, so it holds
>>> -			 * i_mmap_lock for read to protect against truncate.
>>> -			 */
>>> -			i_mmap_unlock_read(vma->vm_file->f_mapping);
>>> -		}
>>> -		goto uncharge_out;
>>> -	}
>>> -	do_set_pte(fe, new_page);
>>> -	mem_cgroup_commit_charge(new_page, memcg, false, false);
>>> -	lru_cache_add_active_or_unevictable(new_page, vma);
>>> -	pte_unmap_unlock(fe->pte, fe->ptl);
>>>  	if (fault_page) {
>>>  		unlock_page(fault_page);
>>>  		page_cache_release(fault_page);
>>> @@ -3024,6 +3057,8 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
>>>  		 */
>>>  		i_mmap_unlock_read(vma->vm_file->f_mapping);
>>>  	}
>>> +	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
>>> +		goto uncharge_out;
>>>  	return ret;
>>>  uncharge_out:
>>>  	mem_cgroup_cancel_charge(new_page, memcg, false);
>>> @@ -3031,7 +3066,7 @@ uncharge_out:
>>>  	return ret;
>>>  }
>>>  
>>> -static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
>>> +static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff)
>>>  {
>>>  	struct vm_area_struct *vma = fe->vma;
>>>  	struct page *fault_page;
>>> @@ -3057,16 +3092,15 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
>>>  		}
>>>  	}
>>>  
>>> -	fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address,
>>> -			&fe->ptl);
>>> -	if (unlikely(!pte_same(*fe->pte, orig_pte))) {
>>> +	ret |= do_set_pte(fe, NULL, fault_page);
>>> +	if (fe->pte)
>>>  		pte_unmap_unlock(fe->pte, fe->ptl);
>>> +	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
>>> +					VM_FAULT_RETRY))) {
>>>  		unlock_page(fault_page);
>>>  		page_cache_release(fault_page);
>>>  		return ret;
>>>  	}
>>> -	do_set_pte(fe, fault_page);
>>> -	pte_unmap_unlock(fe->pte, fe->ptl);
>>>  
>>>  	if (set_page_dirty(fault_page))
>>>  		dirtied = 1;
>>> @@ -3098,21 +3132,19 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte)
>>>   * The mmap_sem may have been released depending on flags and our
>>>   * return value.  See filemap_fault() and __lock_page_or_retry().
>>>   */
>>> -static int do_fault(struct fault_env *fe, pte_t orig_pte)
>>> +static int do_fault(struct fault_env *fe)
>>>  {
>>>  	struct vm_area_struct *vma = fe->vma;
>>> -	pgoff_t pgoff = (((fe->address & PAGE_MASK)
>>> -			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
>>> +	pgoff_t pgoff = linear_page_index(vma, fe->address);
>>
>> Looks like another trivial cleanup.
> 
> Okay, I'll move it into separate patch.
> 
>>> -	pte_unmap(fe->pte);
>>>  	/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
>>>  	if (!vma->vm_ops->fault)
>>>  		return VM_FAULT_SIGBUS;
>>>  	if (!(fe->flags & FAULT_FLAG_WRITE))
>>> -		return do_read_fault(fe, pgoff,	orig_pte);
>>> +		return do_read_fault(fe, pgoff);
>>>  	if (!(vma->vm_flags & VM_SHARED))
>>> -		return do_cow_fault(fe, pgoff, orig_pte);
>>> -	return do_shared_fault(fe, pgoff, orig_pte);
>>> +		return do_cow_fault(fe, pgoff);
>>> +	return do_shared_fault(fe, pgoff);
>>>  }
>>>  
>>>  static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>>> @@ -3252,37 +3284,62 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd)
>>>   * with external mmu caches can use to update those (ie the Sparc or
>>>   * PowerPC hashed page tables that act as extended TLBs).
>>>   *
>>> - * We enter with non-exclusive mmap_sem (to exclude vma changes,
>>> - * but allow concurrent faults), and pte mapped but not yet locked.
>>> - * We return with pte unmapped and unlocked.
>>> + * We enter with non-exclusive mmap_sem (to exclude vma changes, but allow
>>> + * concurrent faults).
>>>   *
>>> - * The mmap_sem may have been released depending on flags and our
>>> - * return value.  See filemap_fault() and __lock_page_or_retry().
>>> + * The mmap_sem may have been released depending on flags and our return value.
>>> + * See filemap_fault() and __lock_page_or_retry().
>>>   */
>>>  static int handle_pte_fault(struct fault_env *fe)
>>>  {
>>>  	pte_t entry;
>>>  
>>> +	/* If an huge pmd materialized from under us just retry later */
>>> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
>>> +		return 0;
>>> +
>>> +	if (unlikely(pmd_none(*fe->pmd))) {
>>> +		/*
>>> +		 * Leave __pte_alloc() until later: because vm_ops->fault may
>>> +		 * want to allocate huge page, and if we expose page table
>>> +		 * for an instant, it will be difficult to retract from
>>> +		 * concurrent faults and from rmap lookups.
>>> +		 */
>>> +	} else {
>>> +		/*
>>> +		 * A regular pmd is established and it can't morph into a huge
>>> +		 * pmd from under us anymore at this point because we hold the
>>> +		 * mmap_sem read mode and khugepaged takes it in write mode.
>>> +		 * So now it's safe to run pte_offset_map().
>>> +		 */
>>> +		fe->pte = pte_offset_map(fe->pmd, fe->address);
>>> +
>>> +		entry = *fe->pte;
>>> +		barrier();
>>
>> Barrier because....?
>>
>>> +		if (pte_none(entry)) {
>>> +			pte_unmap(fe->pte);
>>> +			fe->pte = NULL;
>>> +		}
>>> +	}
>>> +
>>>  	/*
>>>  	 * some architectures can have larger ptes than wordsize,
>>>  	 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
>>>  	 * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
>>> -	 * The code below just needs a consistent view for the ifs and
>>> +	 * The code above just needs a consistent view for the ifs and
>>>  	 * we later double check anyway with the ptl lock held. So here
>>>  	 * a barrier will do.
>>>  	 */
>>
>> Looks like the barrier got moved, but not the comment.
> 
> Moved.
> 
>> Man, that's a lot of code.
> 
> Yeah. I don't see a sensible way to split it. :-/
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte()
  2016-02-16 14:26     ` Kirill A. Shutemov
  2016-02-16 17:17       ` Dave Hansen
@ 2016-02-16 17:38       ` Dave Hansen
  2016-02-23 22:58         ` Kirill A. Shutemov
  1 sibling, 1 reply; 55+ messages in thread
From: Dave Hansen @ 2016-02-16 17:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli,
	Andrew Morton, Vlastimil Babka, Christoph Lameter,
	Naoya Horiguchi, Jerome Marchand, Yang Shi, Sasha Levin,
	linux-kernel, linux-mm

Sorry, fat-fingered the send on the last one.

On 02/16/2016 06:26 AM, Kirill A. Shutemov wrote:
> On Fri, Feb 12, 2016 at 09:44:41AM -0800, Dave Hansen wrote:
>>> +	if (unlikely(pmd_none(*fe->pmd) &&
>>> +			__pte_alloc(vma->vm_mm, vma, fe->pmd, fe->address)))
>>> +		return VM_FAULT_OOM;
>>
>> Should we just move this pmd_none() check in to __pte_alloc()?  You do
>> this same-style check at least twice.
> 
> We have it there. The check here is speculative to avoid taking ptl.

OK, that's a performance optimization.  Why shouldn't all callers of
__pte_alloc() get the same optimization?

>>> +	/* If an huge pmd materialized from under us just retry later */
>>> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
>>> +		return 0;
>>
>> Nit: please stop sprinkling unlikely() everywhere.  Is there some
>> concrete benefit to doing it here?  I really doubt the compiler needs
>> help putting the code for "return 0" out-of-line.
>>
>> Why is it important to abort here?  Is this a small-page-only path?
> 
> This unlikely() was moved from __handle_mm_fault(). I didn't put much
> consideration in it.

OK, but separately from the unlikely()...  Why is it important to jump
out of this code when we see a pmd_trans_huge() pmd?

>>> +static int pte_alloc_one_map(struct fault_env *fe)
>>> +{
>>> +	struct vm_area_struct *vma = fe->vma;
>>> +
>>> +	if (!pmd_none(*fe->pmd))
>>> +		goto map_pte;
>>
>> So the calling convention here is...?  It looks like this has to be
>> called with fe->pmd == pmd_none().  If not, we assume it's pointing to a
>> pte page.  This can never be called on a huge pmd.  Right?
> 
> It's not under ptl, so pmd can be filled under us. There's huge pmd check in
> 'map_pte' goto path.

OK, could we add some comments on that?  We expect to be called to
______, but if there is a race, we might also have to handle ______, etc...?

>>> +	if (fe->prealloc_pte) {
>>> +		smp_wmb(); /* See comment in __pte_alloc() */
>>
>> Are we trying to make *this* cpu's write visible, or to see the write
>> from __pte_alloc()?  It seems like we're trying to see the write.  Isn't
>> smp_rmb() what we want for that?
> 
> See 362a61ad6119.

That patch explains that anyone allocating and initializing a page table
page must ensure that all CPUs can see the initialization writes
*before* the page can be linked into the page tables.  __pte_alloc()
performs a smp_wmb() to ensure that other processors can see its writes.

That still doesn't answer my question though.  What does this barrier
do?  What does it make visible to this processor?  __pte_alloc() already
made its initialization visible, so what's the purpose *here*?

>>> +		atomic_long_inc(&vma->vm_mm->nr_ptes);
>>> +		pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte);
>>> +		spin_unlock(fe->ptl);
>>> +		fe->prealloc_pte = 0;
>>> +	} else if (unlikely(__pte_alloc(vma->vm_mm, vma, fe->pmd,
>>> +					fe->address))) {
>>> +		return VM_FAULT_OOM;
>>> +	}
>>> +map_pte:
>>> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
>>> +		return VM_FAULT_NOPAGE;
>>
>> I think I need a refresher on the locking rules.  pte_offset_map*() is
>> unsafe to call on a huge pmd.  What in this context makes it impossible
>> for the pmd to get promoted after the check?
> 
> Do you mean what stops pte page table to collapsed into huge pmd?
> The answer is mmap_sem. Collapse code takes the lock on write to be able to
> retract page table.

What I learned in this set is that pte_offset_map_lock() is dangerous to
call unless THPs have been excluded somehow from the PMD it's being
called on.

What I'm looking for is something to make sure that the context has been
thought through and is thoroughly THP-free.

It sounds like you've thought through all the cases, but your thoughts
aren't clear from the way the code is laid out currently.

>>> + * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on return.
>>>   *
>>>   * Target users are page handler itself and implementations of
>>>   * vm_ops->map_pages.
>>>   */
>>> -void do_set_pte(struct fault_env *fe, struct page *page)
>>> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
>>> +		struct page *page)
>>>  {
>>>  	struct vm_area_struct *vma = fe->vma;
>>>  	bool write = fe->flags & FAULT_FLAG_WRITE;
>>>  	pte_t entry;
>>>  
>>> +	if (!fe->pte) {
>>> +		int ret = pte_alloc_one_map(fe);
>>> +		if (ret)
>>> +			return ret;
>>> +	}
>>> +
>>> +	if (!pte_none(*fe->pte))
>>> +		return VM_FAULT_NOPAGE;
>>
>> Oh, you've got to add another pte_none() check because you're deferring
>> the acquisition of the ptl lock?
> 
> Yes, we need to re-check once ptl is taken.

Another good comment to add, I think. :)


>>> -	/* Check if it makes any sense to call ->map_pages */
>>> -	fe->address = start_addr;
>>> -	while (!pte_none(*fe->pte)) {
>>> -		if (++start_pgoff > end_pgoff)
>>> -			goto out;
>>> -		fe->address += PAGE_SIZE;
>>> -		if (fe->address >= fe->vma->vm_end)
>>> -			goto out;
>>> -		fe->pte++;
>>> +	if (pmd_none(*fe->pmd))
>>> +		fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address);
>>> +	fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
>>> +	if (fe->prealloc_pte) {
>>> +		pte_free(fe->vma->vm_mm, fe->prealloc_pte);
>>> +		fe->prealloc_pte = 0;
>>>  	}
>>> +	if (!fe->pte)
>>> +		goto out;
>>
>> What does !fe->pte *mean* here?  No pte page?  No pte present?  Huge pte
>> present?
> 
> Huge pmd is mapped.
> 
> Comment added.

Huh, so in _some_ contexts, !fe->pte means that we've got a huge pmd.  I
don't remember seeing that spelled out in the structure comments.



>>> +	if (unlikely(pmd_none(*fe->pmd))) {
>>> +		/*
>>> +		 * Leave __pte_alloc() until later: because vm_ops->fault may
>>> +		 * want to allocate huge page, and if we expose page table
>>> +		 * for an instant, it will be difficult to retract from
>>> +		 * concurrent faults and from rmap lookups.
>>> +		 */
>>> +	} else {
>>> +		/*
>>> +		 * A regular pmd is established and it can't morph into a huge
>>> +		 * pmd from under us anymore at this point because we hold the
>>> +		 * mmap_sem read mode and khugepaged takes it in write mode.
>>> +		 * So now it's safe to run pte_offset_map().
>>> +		 */
>>> +		fe->pte = pte_offset_map(fe->pmd, fe->address);
>>> +
>>> +		entry = *fe->pte;
>>> +		barrier();
>>
>> Barrier because....?

Did you miss a response here, Kirill?

>>> +		if (pte_none(entry)) {
>>> +			pte_unmap(fe->pte);
>>> +			fe->pte = NULL;
>>> +		}
>>> +	}
>>> +
>>>  	/*
>>>  	 * some architectures can have larger ptes than wordsize,
>>>  	 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
>>>  	 * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
>>> -	 * The code below just needs a consistent view for the ifs and
>>> +	 * The code above just needs a consistent view for the ifs and
>>>  	 * we later double check anyway with the ptl lock held. So here
>>>  	 * a barrier will do.
>>>  	 */
>>
>> Looks like the barrier got moved, but not the comment.
> 
> Moved.
> 
>> Man, that's a lot of code.
> 
> Yeah. I don't see a sensible way to split it. :-/

Can you do the "postpone allocation" parts without adding additional THP
code?  Or does the postponement just add all of the extra THP-handling
spots?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 13/28] thp: support file pages in zap_huge_pmd()
  2016-02-16 15:31       ` Dave Hansen
@ 2016-02-18 12:19         ` Kirill A. Shutemov
  0 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-18 12:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli,
	Andrew Morton, Vlastimil Babka, Christoph Lameter,
	Naoya Horiguchi, Jerome Marchand, Yang Shi, Sasha Levin,
	linux-kernel, linux-mm

On Tue, Feb 16, 2016 at 07:31:58AM -0800, Dave Hansen wrote:
> On 02/16/2016 02:00 AM, Kirill A. Shutemov wrote:
> > On Fri, Feb 12, 2016 at 10:33:37AM -0800, Dave Hansen wrote:
> >> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> >>> For file pages we don't deposit page table on mapping: no need to
> >>> withdraw it.
> >>
> >> I thought the deposit thing was to guarantee we could always do a PMD
> >> split.  It still seems like if you wanted to split a huge-tmpfs page,
> >> you'd need to first split the PMD which might need the deposited one.
> >>
> >> Why not?
> > 
> > For file thp, split_huge_pmd() is implemented by clearing out the pmd: we
> > can setup and fill pte table later. Therefore no need to deposit page
> > table -- we would not use it. DAX does the same.
> 
> Ahh...  Do we just never split in any fault contexts, or do we just
> retry the fault?

In fault contexts we would just continue fault handling as if we had
pmd_none().

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 17/28] thp: skip file huge pmd on copy_huge_pmd()
  2016-02-16 15:46       ` Dave Hansen
@ 2016-02-18 12:41         ` Kirill A. Shutemov
  0 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-18 12:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli,
	Andrew Morton, Vlastimil Babka, Christoph Lameter,
	Naoya Horiguchi, Jerome Marchand, Yang Shi, Sasha Levin,
	linux-kernel, linux-mm

On Tue, Feb 16, 2016 at 07:46:37AM -0800, Dave Hansen wrote:
> On 02/16/2016 02:14 AM, Kirill A. Shutemov wrote:
> > On Fri, Feb 12, 2016 at 10:42:09AM -0800, Dave Hansen wrote:
> >> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> >>> File pmds can be safely skip on copy_huge_pmd(), we can re-fault them
> >>> later. COW for file mappings handled on pte level.
> >>
> >> Is this different from 4k pages?  I figured we might skip copying
> >> file-backed ptes on fork, but I couldn't find the code.
> > 
> > Currently, we only filter out on per-VMA basis. See first comment in
> > copy_page_range().
> > 
> > Here we handle PMD mapped file pages in COW mapping. File THP can be
> > mapped into COW mapping as result of read page fault.
> 
> OK...  So, copy_page_range() has a check for "Don't copy ptes where a
> page fault will fill them correctly."  Seems sane enough, but the check
> is implemented using a check for the VMA having !vma->anon_vma, which is
> a head-scratcher for a moment.  Why does that apply to huge tmpfs?
> 
> Ahh, MAP_PRIVATE.  MAP_PRIVATE vmas have ->anon_vma because they have
> essentially-anonymous pages for when they do a COW, so they don't hit
> that check and they go through the copy_*() functions, including
> copy_huge_pmd().
> 
> We don't handle 2M COW operations yet so we simply decline to copy these
> pages.  Might cost us page faults down the road, but it makes things
> easier to implement for now.
> 
> Did I get that right?

Yep.

> Any chance we could get a bit of that into the patch descriptions so
> that the next hapless reviewer can spend their time looking at your code
> instead of relearning the fork() handling for MAP_PRIVATE?

Sure.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte()
  2016-02-16 17:17       ` Dave Hansen
@ 2016-02-23 13:05         ` Kirill A. Shutemov
  0 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-23 13:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli,
	Andrew Morton, Vlastimil Babka, Christoph Lameter,
	Naoya Horiguchi, Jerome Marchand, Yang Shi, Sasha Levin,
	linux-kernel, linux-mm

On Tue, Feb 16, 2016 at 09:17:17AM -0800, Dave Hansen wrote:
> On 02/16/2016 06:26 AM, Kirill A. Shutemov wrote:
> > On Fri, Feb 12, 2016 at 09:44:41AM -0800, Dave Hansen wrote:
> >> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>> index ca99c0ecf52e..172f4d8e798d 100644
> >>> --- a/include/linux/mm.h
> >>> +++ b/include/linux/mm.h
> >>> @@ -265,6 +265,7 @@ struct fault_env {
> >>>  	pmd_t *pmd;
> >>>  	pte_t *pte;
> >>>  	spinlock_t *ptl;
> >>> +	pgtable_t prealloc_pte;
> >>>  };
> >>
> >> If we're going to do this fault_env thing, we need some heavy-duty
> >> comments on what the different fields do and what they mean.  We don't
> >> want to get in to a situation where we're doing
> >>
> >> 	void fault_foo(struct fault_env *fe);..
> >>
> >> and then we change the internals of fault_foo() to manipulate a
> >> different set of fe->* variables, or change assumptions, then have
> >> callers randomly break.
> >>
> >> One _nice_ part of passing all the arguments explicitly is that it makes
> >> you go visit all the call sites and think about how the conventions change.
> >>
> >> It just makes me nervous.
> >>
> >> The semantics of having both a ->pte and ->pmd need to be very clearly
> >> spelled out too, please.
> > 
> > I've updated this to:
> > 
> > /*
> >  * Page fault context: passes though page fault handler instead of endless list
> >  * of function arguments.
> >  */
> > struct fault_env {
> > 	struct vm_area_struct *vma;	/* Target VMA */
> > 	unsigned long address;		/* Faulting virtual address */
> > 	unsigned int flags;		/* FAULT_FLAG_xxx flags */
> > 	pmd_t *pmd;			/* Pointer to pmd entry matching
> > 					 * the 'address'
> > 					 */
> 
> Is this just for huge PMDs, or does it also cover normal PMDs pointing
> to PTE pages?

Any.

> Is it populated every time we're at or below the PMD during a fault?

Yes.

> Is it always valid?

It points to relevant entry. Nothing to say about content of the entry in
general.

> > 	pte_t *pte;			/* Pointer to pte entry matching
> > 					 * the 'address'. NULL if the page
> > 					 * table hasn't been allocated.
> > 					 */
> 
> What's the relationship between pmd and pte?  Can both be set at the
> same time, etc...?

If pte set, pmd is set too. pmd in this case would point to page table pte
is part of.

It's pretty straight-forward.

> 
> > 	spinlock_t *ptl;		/* Page table lock.
> > 					 * Protects pte page table if 'pte'
> > 					 * is not NULL, otherwise pmd.
> > 					 */
> 
> Are there any rules for callers when a callee puts a value in here?

Nothing in particular. In most cases we acquire and release ptl in the
same function, with few exceptions: write-protect fault path and
do_set_pte(). That's documented around these functions.

> > 	pgtable_t prealloc_pte;		/* Pre-allocated pte page table.
> > 					 * vm_ops->map_pages() calls
> > 					 * do_set_pte() from atomic context.
> > 					 * do_fault_around() pre-allocates
> > 					 * page table to avoid allocation from
> > 					 * atomic context.
> > 					 */
> > };
> 
> Who's responsible for freeing this and when?

do_fault_around() frees the page table if it wasn't used.

> >>>  /*
> >>> @@ -559,7 +560,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> >>>  	return pte;
> >>>  }
> >>>  
> >>> -void do_set_pte(struct fault_env *fe, struct page *page);
> >>> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
> >>> +		struct page *page);
> >>>  #endif
> >>
> >> I think do_set_pte() might be due for a new name if it's going to be
> >> doing allocations internally.
> > 
> > Any suggestions?
> 
> alloc_set_pte() is probably fine.  Just make it clear early in some
> comments that the allocation is conditional.

Ok.

> >>> diff --git a/mm/filemap.c b/mm/filemap.c
> >>> index 28b3875969a8..ba8150d6dc33 100644
> >>> --- a/mm/filemap.c
> >>> +++ b/mm/filemap.c
> >>> @@ -2146,11 +2146,6 @@ void filemap_map_pages(struct fault_env *fe,
> >>>  			start_pgoff) {
> >>>  		if (iter.index > end_pgoff)
> >>>  			break;
> >>> -		fe->pte += iter.index - last_pgoff;
> >>> -		fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
> >>> -		last_pgoff = iter.index;
> >>> -		if (!pte_none(*fe->pte))
> >>> -			goto next;
> >>>  repeat:
> >>>  		page = radix_tree_deref_slot(slot);
> >>>  		if (unlikely(!page))
> >>> @@ -2187,7 +2182,17 @@ repeat:
> >>>  
> >>>  		if (file->f_ra.mmap_miss > 0)
> >>>  			file->f_ra.mmap_miss--;
> >>> -		do_set_pte(fe, page);
> >>> +
> >>> +		fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
> >>> +		if (fe->pte)
> >>> +			fe->pte += iter.index - last_pgoff;
> >>> +		last_pgoff = iter.index;
> >>> +		if (do_set_pte(fe, NULL, page)) {
> >>> +			/* failed to setup page table: giving up */
> >>> +			if (!fe->pte)
> >>> +				break;
> >>> +			goto unlock;
> >>> +		}
> >>
> >> What's the failure here, though?
> > 
> > At this point in patchset it never fails: allocation failure is not
> > possible as we pre-allocate page table for faularound.
> > 
> > Later after do_set_pmd() is introduced, huge page can be mapped here. By
> > us or under us.
> > 
> > I'll update comment.
> 
> So why check the return value of do_set_pte()?  Why can it return nonzero?

Actually, this part is buggy (loops without result). I used to return
VM_FAULT_NOPAGE when huge page is setup, but not anymore.

I'll replace it with this:

diff --git a/mm/filemap.c b/mm/filemap.c
index de3bb308f5a9..5f655220df69 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2269,12 +2269,12 @@ repeat:
 		if (fe->pte)
 			fe->pte += iter.index - last_pgoff;
 		last_pgoff = iter.index;
-		if (alloc_set_pte(fe, NULL, page)) {
-			/* Huge page is mapped? */
-			if (!fe->pte)
-				break;
-			goto unlock;
-		}
+		alloc_set_pte(fe, NULL, page);
+		/* Huge page is mapped? No need to proceed. */
+		if (pmd_trans_huge(*fe->pmd))
+			break;
+		/* Failed to setup page table? */
+		VM_BUG_ON(!fe->pte);
 		unlock_page(page);
 		goto next;
 unlock:

-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte()
  2016-02-16 17:38       ` Dave Hansen
@ 2016-02-23 22:58         ` Kirill A. Shutemov
  0 siblings, 0 replies; 55+ messages in thread
From: Kirill A. Shutemov @ 2016-02-23 22:58 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Hugh Dickins, Andrea Arcangeli,
	Andrew Morton, Vlastimil Babka, Christoph Lameter,
	Naoya Horiguchi, Jerome Marchand, Yang Shi, Sasha Levin,
	linux-kernel, linux-mm

On Tue, Feb 16, 2016 at 09:38:48AM -0800, Dave Hansen wrote:
> Sorry, fat-fingered the send on the last one.
> 
> On 02/16/2016 06:26 AM, Kirill A. Shutemov wrote:
> > On Fri, Feb 12, 2016 at 09:44:41AM -0800, Dave Hansen wrote:
> >>> +	if (unlikely(pmd_none(*fe->pmd) &&
> >>> +			__pte_alloc(vma->vm_mm, vma, fe->pmd, fe->address)))
> >>> +		return VM_FAULT_OOM;
> >>
> >> Should we just move this pmd_none() check in to __pte_alloc()?  You do
> >> this same-style check at least twice.
> > 
> > We have it there. The check here is speculative to avoid taking ptl.
> 
> OK, that's a performance optimization.  Why shouldn't all callers of
> __pte_alloc() get the same optimization?

I've sent patch for this.

> >>> +	/* If an huge pmd materialized from under us just retry later */
> >>> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
> >>> +		return 0;
> >>
> >> Nit: please stop sprinkling unlikely() everywhere.  Is there some
> >> concrete benefit to doing it here?  I really doubt the compiler needs
> >> help putting the code for "return 0" out-of-line.
> >>
> >> Why is it important to abort here?  Is this a small-page-only path?
> > 
> > This unlikely() was moved from __handle_mm_fault(). I didn't put much
> > consideration in it.
> 
> OK, but separately from the unlikely()...  Why is it important to jump
> out of this code when we see a pmd_trans_huge() pmd?

The code below work on pte level, so it expect the pmd to point to page
table.

And the page fault most likely was solved anyway.

> >>> +static int pte_alloc_one_map(struct fault_env *fe)
> >>> +{
> >>> +	struct vm_area_struct *vma = fe->vma;
> >>> +
> >>> +	if (!pmd_none(*fe->pmd))
> >>> +		goto map_pte;
> >>
> >> So the calling convention here is...?  It looks like this has to be
> >> called with fe->pmd == pmd_none().  If not, we assume it's pointing to a
> >> pte page.  This can never be called on a huge pmd.  Right?
> > 
> > It's not under ptl, so pmd can be filled under us. There's huge pmd check in
> > 'map_pte' goto path.
> 
> OK, could we add some comments on that?  We expect to be called to
> ______, but if there is a race, we might also have to handle ______, etc...?

Ok.

> >>> +	if (fe->prealloc_pte) {
> >>> +		smp_wmb(); /* See comment in __pte_alloc() */
> >>
> >> Are we trying to make *this* cpu's write visible, or to see the write
> >> from __pte_alloc()?  It seems like we're trying to see the write.  Isn't
> >> smp_rmb() what we want for that?
> > 
> > See 362a61ad6119.
> 
> That patch explains that anyone allocating and initializing a page table
> page must ensure that all CPUs can see the initialization writes
> *before* the page can be linked into the page tables.  __pte_alloc()
> performs a smp_wmb() to ensure that other processors can see its writes.
> 
> That still doesn't answer my question though.  What does this barrier
> do?  What does it make visible to this processor?  __pte_alloc() already
> made its initialization visible, so what's the purpose *here*?

We don't call __pte_alloc() to allocate the page table for ->prealloc_pte,
we call pte_alloc_one(), which doesn't have the barrier.

> >>> +		atomic_long_inc(&vma->vm_mm->nr_ptes);
> >>> +		pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte);
> >>> +		spin_unlock(fe->ptl);
> >>> +		fe->prealloc_pte = 0;
> >>> +	} else if (unlikely(__pte_alloc(vma->vm_mm, vma, fe->pmd,
> >>> +					fe->address))) {
> >>> +		return VM_FAULT_OOM;
> >>> +	}
> >>> +map_pte:
> >>> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
> >>> +		return VM_FAULT_NOPAGE;
> >>
> >> I think I need a refresher on the locking rules.  pte_offset_map*() is
> >> unsafe to call on a huge pmd.  What in this context makes it impossible
> >> for the pmd to get promoted after the check?
> > 
> > Do you mean what stops pte page table to collapsed into huge pmd?
> > The answer is mmap_sem. Collapse code takes the lock on write to be able to
> > retract page table.
> 
> What I learned in this set is that pte_offset_map_lock() is dangerous to
> call unless THPs have been excluded somehow from the PMD it's being
> called on.
> 
> What I'm looking for is something to make sure that the context has been
> thought through and is thoroughly THP-free.
> 
> It sounds like you've thought through all the cases, but your thoughts
> aren't clear from the way the code is laid out currently.

Actually, I've discovered race looking into this code. Andrea has fixed it
in __handle_mm_fault() and I will move the comment here.

> >>> + * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on return.
> >>>   *
> >>>   * Target users are page handler itself and implementations of
> >>>   * vm_ops->map_pages.
> >>>   */
> >>> -void do_set_pte(struct fault_env *fe, struct page *page)
> >>> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
> >>> +		struct page *page)
> >>>  {
> >>>  	struct vm_area_struct *vma = fe->vma;
> >>>  	bool write = fe->flags & FAULT_FLAG_WRITE;
> >>>  	pte_t entry;
> >>>  
> >>> +	if (!fe->pte) {
> >>> +		int ret = pte_alloc_one_map(fe);
> >>> +		if (ret)
> >>> +			return ret;
> >>> +	}
> >>> +
> >>> +	if (!pte_none(*fe->pte))
> >>> +		return VM_FAULT_NOPAGE;
> >>
> >> Oh, you've got to add another pte_none() check because you're deferring
> >> the acquisition of the ptl lock?
> > 
> > Yes, we need to re-check once ptl is taken.
> 
> Another good comment to add, I think. :)

Ok.

> >>> -	/* Check if it makes any sense to call ->map_pages */
> >>> -	fe->address = start_addr;
> >>> -	while (!pte_none(*fe->pte)) {
> >>> -		if (++start_pgoff > end_pgoff)
> >>> -			goto out;
> >>> -		fe->address += PAGE_SIZE;
> >>> -		if (fe->address >= fe->vma->vm_end)
> >>> -			goto out;
> >>> -		fe->pte++;
> >>> +	if (pmd_none(*fe->pmd))
> >>> +		fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address);
> >>> +	fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
> >>> +	if (fe->prealloc_pte) {
> >>> +		pte_free(fe->vma->vm_mm, fe->prealloc_pte);
> >>> +		fe->prealloc_pte = 0;
> >>>  	}
> >>> +	if (!fe->pte)
> >>> +		goto out;
> >>
> >> What does !fe->pte *mean* here?  No pte page?  No pte present?  Huge pte
> >> present?
> > 
> > Huge pmd is mapped.
> > 
> > Comment added.
> 
> Huh, so in _some_ contexts, !fe->pte means that we've got a huge pmd.  I
> don't remember seeing that spelled out in the structure comments.

I'll change it to "if (pmd_trans_huge(*fe->pmd))".

> >>> +	if (unlikely(pmd_none(*fe->pmd))) {
> >>> +		/*
> >>> +		 * Leave __pte_alloc() until later: because vm_ops->fault may
> >>> +		 * want to allocate huge page, and if we expose page table
> >>> +		 * for an instant, it will be difficult to retract from
> >>> +		 * concurrent faults and from rmap lookups.
> >>> +		 */
> >>> +	} else {
> >>> +		/*
> >>> +		 * A regular pmd is established and it can't morph into a huge
> >>> +		 * pmd from under us anymore at this point because we hold the
> >>> +		 * mmap_sem read mode and khugepaged takes it in write mode.
> >>> +		 * So now it's safe to run pte_offset_map().
> >>> +		 */
> >>> +		fe->pte = pte_offset_map(fe->pmd, fe->address);
> >>> +
> >>> +		entry = *fe->pte;
> >>> +		barrier();
> >>
> >> Barrier because....?
> 
> Did you miss a response here, Kirill?

The comment below is about this barrier.
Isn't it sufficient?

> >>> +		if (pte_none(entry)) {
> >>> +			pte_unmap(fe->pte);
> >>> +			fe->pte = NULL;
> >>> +		}
> >>> +	}
> >>> +
> >>>  	/*
> >>>  	 * some architectures can have larger ptes than wordsize,
> >>>  	 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
> >>>  	 * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
> >>> -	 * The code below just needs a consistent view for the ifs and
> >>> +	 * The code above just needs a consistent view for the ifs and
> >>>  	 * we later double check anyway with the ptl lock held. So here
> >>>  	 * a barrier will do.
> >>>  	 */
> >>
> >> Looks like the barrier got moved, but not the comment.
> > 
> > Moved.
> > 
> >> Man, that's a lot of code.
> > 
> > Yeah. I don't see a sensible way to split it. :-/
> 
> Can you do the "postpone allocation" parts without adding additional THP
> code?  Or does the postponement just add all of the extra THP-handling
> spots?

I'll check. But I wouldn't expect moving THP-handling out of the commit
will make it much smaller.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2016-02-24  9:51 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 01/28] thp, dax: do not try to withdraw pgtable from non-anon VMA Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 02/28] rmap: introduce rmap_walk_locked() Kirill A. Shutemov
2016-02-11 18:52   ` Andi Kleen
2016-02-16  9:36     ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 03/28] rmap: extend try_to_unmap() to be usable by split_huge_page() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 04/28] mm: make remove_migration_ptes() beyond mm/migration.c Kirill A. Shutemov
2016-02-12 16:54   ` Dave Hansen
2016-02-16  9:54     ` Kirill A. Shutemov
2016-02-16 15:29       ` Dave Hansen
2016-02-11 14:21 ` [PATCHv2 05/28] thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 06/28] mm: do not pass mm_struct into handle_mm_fault Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 07/28] mm: introduce fault_env Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte() Kirill A. Shutemov
2016-02-12 17:44   ` Dave Hansen
2016-02-16 14:26     ` Kirill A. Shutemov
2016-02-16 17:17       ` Dave Hansen
2016-02-23 13:05         ` Kirill A. Shutemov
2016-02-16 17:38       ` Dave Hansen
2016-02-23 22:58         ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 09/28] rmap: support file thp Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 10/28] mm: introduce do_set_pmd() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 11/28] mm, rmap: account file thp pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 12/28] thp, vmstats: add counters for huge file pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 13/28] thp: support file pages in zap_huge_pmd() Kirill A. Shutemov
2016-02-12 18:33   ` Dave Hansen
2016-02-16 10:00     ` Kirill A. Shutemov
2016-02-16 15:31       ` Dave Hansen
2016-02-18 12:19         ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 14/28] thp: handle file pages in split_huge_pmd() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 15/28] thp: handle file COW faults Kirill A. Shutemov
2016-02-12 18:36   ` Dave Hansen
2016-02-16 10:08     ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 16/28] thp: handle file pages in mremap() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 17/28] thp: skip file huge pmd on copy_huge_pmd() Kirill A. Shutemov
2016-02-12 18:42   ` Dave Hansen
2016-02-16 10:14     ` Kirill A. Shutemov
2016-02-16 15:46       ` Dave Hansen
2016-02-18 12:41         ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 18/28] thp: prepare change_huge_pmd() for file thp Kirill A. Shutemov
2016-02-12 18:48   ` Dave Hansen
2016-02-16 10:15     ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 19/28] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem Kirill A. Shutemov
2016-02-12 18:50   ` Dave Hansen
2016-02-16 10:16     ` Kirill A. Shutemov
2016-02-16 15:49   ` Dave Hansen
2016-02-11 14:21 ` [PATCHv2 20/28] thp: file pages support for split_huge_page() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 21/28] vmscan: split file huge pages before paging them out Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 22/28] page-flags: relax policy for PG_mappedtodisk and PG_reclaim Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 23/28] radix-tree: implement radix_tree_maybe_preload_order() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 24/28] filemap: prepare find and delete operations for huge pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 25/28] truncate: handle file thp Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 26/28] shmem: prepare huge=N mount option and /proc/sys/vm/shmem_huge Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 27/28] shmem: get_unmapped_area align huge page Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 28/28] shmem: add huge pages support Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).