All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/24] huge tmpfs: an alternative approach to THPageCache
@ 2015-02-21  3:49 ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  3:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, Hugh Dickins,
	linux-kernel, linux-mm

I warned last month that I have been working on "huge tmpfs":
an implementation of Transparent Huge Page Cache in tmpfs,
for those who are tired of the limitations of hugetlbfs.

Here's a fully working patchset, against v3.19 so that you can give it
a try against a stable base.  I've not yet studied how well it applies
to current git: probably lots of easily resolved clashes with nonlinear
removal.  Against mmotm, the rmap.c differences looked nontrivial.

Fully working?  Well, at present page migration just keeps away from
these teams of pages.  And once memory pressure has disbanded a team
to swap it out, there is nothing to put it together again later on,
to restore the original hugepage performance.  Those must follow,
but no thought yet (khugepaged? maybe).

Yes, I realize there's nothing yet under Documentation, nor fs/proc
beyond meminfo, nor other debug/visibility files: must follow, but
I've cared more to provide the basic functionality.

I don't expect to update this patchset in the next few weeks: now that
it's posted, my priority is look at other people's work before LSF/MM;
and in particular, of course, your (Kirill's) THP refcounting redesign.

01 mm: update_lru_size warn and reset bad lru_size
02 mm: update_lru_size do the __mod_zone_page_state
03 mm: use __SetPageSwapBacked and don't ClearPageSwapBacked
04 mm: make page migration's newpage handling more robust
05 tmpfs: preliminary minor tidyups
06 huge tmpfs: prepare counts in meminfo, vmstat and SysRq-m
07 huge tmpfs: include shmem freeholes in available memory counts
08 huge tmpfs: prepare huge=N mount option and /proc/sys/vm/shmem_huge
09 huge tmpfs: try to allocate huge pages, split into a team
10 huge tmpfs: avoid team pages in a few places
11 huge tmpfs: shrinker to migrate and free underused holes
12 huge tmpfs: get_unmapped_area align and fault supply huge page
13 huge tmpfs: extend get_user_pages_fast to shmem pmd
14 huge tmpfs: extend vma_adjust_trans_huge to shmem pmd
15 huge tmpfs: rework page_referenced_one and try_to_unmap_one
16 huge tmpfs: fix problems from premature exposure of pagetable
17 huge tmpfs: map shmem by huge page pmd or by page team ptes
18 huge tmpfs: mmap_sem is unlocked when truncation splits huge pmd
19 huge tmpfs: disband split huge pmds on race or memory failure
20 huge tmpfs: use Unevictable lru with variable hpage_nr_pages()
21 huge tmpfs: fix Mlocked meminfo, tracking huge and unhuge mlocks
22 huge tmpfs: fix Mapped meminfo, tracking huge and unhuge mappings
23 kvm: plumb return of hva when resolving page fault.
24 kvm: teach kvm to map page teams as huge pages.

 arch/mips/mm/gup.c             |   17 
 arch/powerpc/mm/pgtable_64.c   |    7 
 arch/s390/mm/gup.c             |   22 
 arch/sparc/mm/gup.c            |   22 
 arch/x86/kvm/mmu.c             |  171 +++-
 arch/x86/kvm/paging_tmpl.h     |    6 
 arch/x86/mm/gup.c              |   17 
 drivers/base/node.c            |   20 
 drivers/char/mem.c             |   23 
 fs/proc/meminfo.c              |   17 
 include/linux/huge_mm.h        |   18 
 include/linux/kvm_host.h       |    2 
 include/linux/memcontrol.h     |   11 
 include/linux/mempolicy.h      |    6 
 include/linux/migrate.h        |    3 
 include/linux/mm.h             |   95 +-
 include/linux/mm_inline.h      |   24 
 include/linux/mm_types.h       |    1 
 include/linux/mmzone.h         |    5 
 include/linux/page-flags.h     |    6 
 include/linux/pageteam.h       |  289 +++++++
 include/linux/shmem_fs.h       |   21 
 include/trace/events/migrate.h |    3 
 ipc/shm.c                      |    6 
 kernel/sysctl.c                |   12 
 mm/balloon_compaction.c        |   10 
 mm/compaction.c                |    6 
 mm/filemap.c                   |   10 
 mm/gup.c                       |   22 
 mm/huge_memory.c               |  281 ++++++-
 mm/internal.h                  |   25 
 mm/memcontrol.c                |   42 -
 mm/memory-failure.c            |    8 
 mm/memory.c                    |  227 +++--
 mm/migrate.c                   |  139 +--
 mm/mlock.c                     |  181 ++--
 mm/mmap.c                      |   17 
 mm/page-writeback.c            |    2 
 mm/page_alloc.c                |   13 
 mm/rmap.c                      |  207 +++--
 mm/shmem.c                     | 1235 +++++++++++++++++++++++++++++--
 mm/swap.c                      |    5 
 mm/swap_state.c                |    3 
 mm/truncate.c                  |    2 
 mm/vmscan.c                    |   76 +
 mm/vmstat.c                    |    3 
 mm/zswap.c                     |    3 
 virt/kvm/kvm_main.c            |   24 
 48 files changed, 2790 insertions(+), 575 deletions(-)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 00/24] huge tmpfs: an alternative approach to THPageCache
@ 2015-02-21  3:49 ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  3:49 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, Hugh Dickins,
	linux-kernel, linux-mm

I warned last month that I have been working on "huge tmpfs":
an implementation of Transparent Huge Page Cache in tmpfs,
for those who are tired of the limitations of hugetlbfs.

Here's a fully working patchset, against v3.19 so that you can give it
a try against a stable base.  I've not yet studied how well it applies
to current git: probably lots of easily resolved clashes with nonlinear
removal.  Against mmotm, the rmap.c differences looked nontrivial.

Fully working?  Well, at present page migration just keeps away from
these teams of pages.  And once memory pressure has disbanded a team
to swap it out, there is nothing to put it together again later on,
to restore the original hugepage performance.  Those must follow,
but no thought yet (khugepaged? maybe).

Yes, I realize there's nothing yet under Documentation, nor fs/proc
beyond meminfo, nor other debug/visibility files: must follow, but
I've cared more to provide the basic functionality.

I don't expect to update this patchset in the next few weeks: now that
it's posted, my priority is look at other people's work before LSF/MM;
and in particular, of course, your (Kirill's) THP refcounting redesign.

01 mm: update_lru_size warn and reset bad lru_size
02 mm: update_lru_size do the __mod_zone_page_state
03 mm: use __SetPageSwapBacked and don't ClearPageSwapBacked
04 mm: make page migration's newpage handling more robust
05 tmpfs: preliminary minor tidyups
06 huge tmpfs: prepare counts in meminfo, vmstat and SysRq-m
07 huge tmpfs: include shmem freeholes in available memory counts
08 huge tmpfs: prepare huge=N mount option and /proc/sys/vm/shmem_huge
09 huge tmpfs: try to allocate huge pages, split into a team
10 huge tmpfs: avoid team pages in a few places
11 huge tmpfs: shrinker to migrate and free underused holes
12 huge tmpfs: get_unmapped_area align and fault supply huge page
13 huge tmpfs: extend get_user_pages_fast to shmem pmd
14 huge tmpfs: extend vma_adjust_trans_huge to shmem pmd
15 huge tmpfs: rework page_referenced_one and try_to_unmap_one
16 huge tmpfs: fix problems from premature exposure of pagetable
17 huge tmpfs: map shmem by huge page pmd or by page team ptes
18 huge tmpfs: mmap_sem is unlocked when truncation splits huge pmd
19 huge tmpfs: disband split huge pmds on race or memory failure
20 huge tmpfs: use Unevictable lru with variable hpage_nr_pages()
21 huge tmpfs: fix Mlocked meminfo, tracking huge and unhuge mlocks
22 huge tmpfs: fix Mapped meminfo, tracking huge and unhuge mappings
23 kvm: plumb return of hva when resolving page fault.
24 kvm: teach kvm to map page teams as huge pages.

 arch/mips/mm/gup.c             |   17 
 arch/powerpc/mm/pgtable_64.c   |    7 
 arch/s390/mm/gup.c             |   22 
 arch/sparc/mm/gup.c            |   22 
 arch/x86/kvm/mmu.c             |  171 +++-
 arch/x86/kvm/paging_tmpl.h     |    6 
 arch/x86/mm/gup.c              |   17 
 drivers/base/node.c            |   20 
 drivers/char/mem.c             |   23 
 fs/proc/meminfo.c              |   17 
 include/linux/huge_mm.h        |   18 
 include/linux/kvm_host.h       |    2 
 include/linux/memcontrol.h     |   11 
 include/linux/mempolicy.h      |    6 
 include/linux/migrate.h        |    3 
 include/linux/mm.h             |   95 +-
 include/linux/mm_inline.h      |   24 
 include/linux/mm_types.h       |    1 
 include/linux/mmzone.h         |    5 
 include/linux/page-flags.h     |    6 
 include/linux/pageteam.h       |  289 +++++++
 include/linux/shmem_fs.h       |   21 
 include/trace/events/migrate.h |    3 
 ipc/shm.c                      |    6 
 kernel/sysctl.c                |   12 
 mm/balloon_compaction.c        |   10 
 mm/compaction.c                |    6 
 mm/filemap.c                   |   10 
 mm/gup.c                       |   22 
 mm/huge_memory.c               |  281 ++++++-
 mm/internal.h                  |   25 
 mm/memcontrol.c                |   42 -
 mm/memory-failure.c            |    8 
 mm/memory.c                    |  227 +++--
 mm/migrate.c                   |  139 +--
 mm/mlock.c                     |  181 ++--
 mm/mmap.c                      |   17 
 mm/page-writeback.c            |    2 
 mm/page_alloc.c                |   13 
 mm/rmap.c                      |  207 +++--
 mm/shmem.c                     | 1235 +++++++++++++++++++++++++++++--
 mm/swap.c                      |    5 
 mm/swap_state.c                |    3 
 mm/truncate.c                  |    2 
 mm/vmscan.c                    |   76 +
 mm/vmstat.c                    |    3 
 mm/zswap.c                     |    3 
 virt/kvm/kvm_main.c            |   24 
 48 files changed, 2790 insertions(+), 575 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 01/24] mm: update_lru_size warn and reset bad lru_size
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  3:51   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  3:51 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Though debug kernels have a VM_BUG_ON to help protect from misaccounting
lru_size, non-debug kernels are liable to wrap it around: and then the
vast unsigned long size draws page reclaim into a loop of repeatedly
doing nothing on an empty list, without even a cond_resched().

That soft lockup looks confusingly like an over-busy reclaim scenario,
with lots of contention on the lruvec lock in shrink_inactive_list():
yet has a totally different origin.

Help differentiate with a custom warning in mem_cgroup_update_lru_size(),
even in non-debug kernels; and reset the size to avoid the lockup.  But
the particular bug which suggested this change was mine alone, and since
fixed.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm_inline.h |    2 +-
 mm/memcontrol.c           |   24 ++++++++++++++++++++----
 2 files changed, 21 insertions(+), 5 deletions(-)

--- thpfs.orig/include/linux/mm_inline.h	2013-11-03 15:41:51.000000000 -0800
+++ thpfs/include/linux/mm_inline.h	2015-02-20 19:33:25.928096883 -0800
@@ -35,8 +35,8 @@ static __always_inline void del_page_fro
 				struct lruvec *lruvec, enum lru_list lru)
 {
 	int nr_pages = hpage_nr_pages(page);
-	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
 	list_del(&page->lru);
+	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
 	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
 }
 
--- thpfs.orig/mm/memcontrol.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/memcontrol.c	2015-02-20 19:33:25.928096883 -0800
@@ -1296,22 +1296,38 @@ out:
  * @lru: index of lru list the page is sitting on
  * @nr_pages: positive when adding or negative when removing
  *
- * This function must be called when a page is added to or removed from an
- * lru list.
+ * This function must be called under lruvec lock, just before a page is added
+ * to or just after a page is removed from an lru list (that ordering being so
+ * as to allow it to check that lru_size 0 is consistent with list_empty).
  */
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 				int nr_pages)
 {
 	struct mem_cgroup_per_zone *mz;
 	unsigned long *lru_size;
+	long size;
+	bool empty;
 
 	if (mem_cgroup_disabled())
 		return;
 
 	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
 	lru_size = mz->lru_size + lru;
-	*lru_size += nr_pages;
-	VM_BUG_ON((long)(*lru_size) < 0);
+	empty = list_empty(lruvec->lists + lru);
+
+	if (nr_pages < 0)
+		*lru_size += nr_pages;
+
+	size = *lru_size;
+	if (WARN(size < 0 || empty != !size,
+	"mem_cgroup_update_lru_size(%p, %d, %d): lru_size %ld but %sempty\n",
+			lruvec, lru, nr_pages, size, empty ? "" : "not ")) {
+		VM_BUG_ON(1);
+		*lru_size = 0;
+	}
+
+	if (nr_pages > 0)
+		*lru_size += nr_pages;
 }
 
 bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, struct mem_cgroup *root)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 01/24] mm: update_lru_size warn and reset bad lru_size
@ 2015-02-21  3:51   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  3:51 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Though debug kernels have a VM_BUG_ON to help protect from misaccounting
lru_size, non-debug kernels are liable to wrap it around: and then the
vast unsigned long size draws page reclaim into a loop of repeatedly
doing nothing on an empty list, without even a cond_resched().

That soft lockup looks confusingly like an over-busy reclaim scenario,
with lots of contention on the lruvec lock in shrink_inactive_list():
yet has a totally different origin.

Help differentiate with a custom warning in mem_cgroup_update_lru_size(),
even in non-debug kernels; and reset the size to avoid the lockup.  But
the particular bug which suggested this change was mine alone, and since
fixed.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm_inline.h |    2 +-
 mm/memcontrol.c           |   24 ++++++++++++++++++++----
 2 files changed, 21 insertions(+), 5 deletions(-)

--- thpfs.orig/include/linux/mm_inline.h	2013-11-03 15:41:51.000000000 -0800
+++ thpfs/include/linux/mm_inline.h	2015-02-20 19:33:25.928096883 -0800
@@ -35,8 +35,8 @@ static __always_inline void del_page_fro
 				struct lruvec *lruvec, enum lru_list lru)
 {
 	int nr_pages = hpage_nr_pages(page);
-	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
 	list_del(&page->lru);
+	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
 	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
 }
 
--- thpfs.orig/mm/memcontrol.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/memcontrol.c	2015-02-20 19:33:25.928096883 -0800
@@ -1296,22 +1296,38 @@ out:
  * @lru: index of lru list the page is sitting on
  * @nr_pages: positive when adding or negative when removing
  *
- * This function must be called when a page is added to or removed from an
- * lru list.
+ * This function must be called under lruvec lock, just before a page is added
+ * to or just after a page is removed from an lru list (that ordering being so
+ * as to allow it to check that lru_size 0 is consistent with list_empty).
  */
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 				int nr_pages)
 {
 	struct mem_cgroup_per_zone *mz;
 	unsigned long *lru_size;
+	long size;
+	bool empty;
 
 	if (mem_cgroup_disabled())
 		return;
 
 	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
 	lru_size = mz->lru_size + lru;
-	*lru_size += nr_pages;
-	VM_BUG_ON((long)(*lru_size) < 0);
+	empty = list_empty(lruvec->lists + lru);
+
+	if (nr_pages < 0)
+		*lru_size += nr_pages;
+
+	size = *lru_size;
+	if (WARN(size < 0 || empty != !size,
+	"mem_cgroup_update_lru_size(%p, %d, %d): lru_size %ld but %sempty\n",
+			lruvec, lru, nr_pages, size, empty ? "" : "not ")) {
+		VM_BUG_ON(1);
+		*lru_size = 0;
+	}
+
+	if (nr_pages > 0)
+		*lru_size += nr_pages;
 }
 
 bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, struct mem_cgroup *root)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 02/24] mm: update_lru_size do the __mod_zone_page_state
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  3:54   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  3:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, Konstantin Khlebnikov,
	linux-kernel, linux-mm

Konstantin Khlebnikov pointed out (nearly three years ago, when lumpy
reclaim was removed) that lru_size can be updated by -nr_taken once
per call to isolate_lru_pages(), instead of page by page.

Update it inside isolate_lru_pages(), or at its two callsites?  I
chose to update it at the callsites, rearranging and grouping the
updates by nr_taken and nr_scanned together in both.

With one exception, mem_cgroup_update_lru_size(,lru,) is then used
where __mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall
be adding some more calls in a later commit.  Make the code a little
smaller and simpler by incorporating stat update in lru_size update.

The exception was move_active_pages_to_lru(), which aggregated the
pgmoved stat update separately from the individual lru_size updates;
but I still think this a simplification worth making.

However, the __mod_zone_page_state is not peculiar to mem_cgroups: so
better use the name update_lru_size, calls mem_cgroup_update_lru_size
when CONFIG_MEMCG.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |    6 ------
 include/linux/mm_inline.h  |   24 ++++++++++++++++++------
 mm/memcontrol.c            |    4 +++-
 mm/vmscan.c                |   23 ++++++++++-------------
 4 files changed, 31 insertions(+), 26 deletions(-)

--- thpfs.orig/include/linux/memcontrol.h	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/include/linux/memcontrol.h	2015-02-20 19:33:31.052085168 -0800
@@ -275,12 +275,6 @@ mem_cgroup_get_lru_size(struct lruvec *l
 }
 
 static inline void
-mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
-			      int increment)
-{
-}
-
-static inline void
 mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
--- thpfs.orig/include/linux/mm_inline.h	2015-02-20 19:33:25.928096883 -0800
+++ thpfs/include/linux/mm_inline.h	2015-02-20 19:33:31.052085168 -0800
@@ -22,22 +22,34 @@ static inline int page_is_file_cache(str
 	return !PageSwapBacked(page);
 }
 
+static __always_inline void __update_lru_size(struct lruvec *lruvec,
+				enum lru_list lru, int nr_pages)
+{
+	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
+}
+
+static __always_inline void update_lru_size(struct lruvec *lruvec,
+				enum lru_list lru, int nr_pages)
+{
+#ifdef CONFIG_MEMCG
+	mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+#else
+	__update_lru_size(lruvec, lru, nr_pages);
+#endif
+}
+
 static __always_inline void add_page_to_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
-	int nr_pages = hpage_nr_pages(page);
-	mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+	update_lru_size(lruvec, lru, hpage_nr_pages(page));
 	list_add(&page->lru, &lruvec->lists[lru]);
-	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
 }
 
 static __always_inline void del_page_from_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
-	int nr_pages = hpage_nr_pages(page);
 	list_del(&page->lru);
-	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
-	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
+	update_lru_size(lruvec, lru, -hpage_nr_pages(page));
 }
 
 /**
--- thpfs.orig/mm/memcontrol.c	2015-02-20 19:33:25.928096883 -0800
+++ thpfs/mm/memcontrol.c	2015-02-20 19:33:31.052085168 -0800
@@ -1309,7 +1309,7 @@ void mem_cgroup_update_lru_size(struct l
 	bool empty;
 
 	if (mem_cgroup_disabled())
-		return;
+		goto out;
 
 	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
 	lru_size = mz->lru_size + lru;
@@ -1328,6 +1328,8 @@ void mem_cgroup_update_lru_size(struct l
 
 	if (nr_pages > 0)
 		*lru_size += nr_pages;
+out:
+	__update_lru_size(lruvec, lru, nr_pages);
 }
 
 bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, struct mem_cgroup *root)
--- thpfs.orig/mm/vmscan.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/vmscan.c	2015-02-20 19:33:31.056085158 -0800
@@ -1280,7 +1280,6 @@ static unsigned long isolate_lru_pages(u
 
 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
 		struct page *page;
-		int nr_pages;
 
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
@@ -1289,10 +1288,8 @@ static unsigned long isolate_lru_pages(u
 
 		switch (__isolate_lru_page(page, mode)) {
 		case 0:
-			nr_pages = hpage_nr_pages(page);
-			mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
+			nr_taken += hpage_nr_pages(page);
 			list_move(&page->lru, dst);
-			nr_taken += nr_pages;
 			break;
 
 		case -EBUSY:
@@ -1507,8 +1504,9 @@ shrink_inactive_list(unsigned long nr_to
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, isolate_mode, lru);
 
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
+	update_lru_size(lruvec, lru, -nr_taken);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc)) {
 		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
@@ -1529,8 +1527,6 @@ shrink_inactive_list(unsigned long nr_to
 
 	spin_lock_irq(&zone->lru_lock);
 
-	reclaim_stat->recent_scanned[file] += nr_taken;
-
 	if (global_reclaim(sc)) {
 		if (current_is_kswapd())
 			__count_zone_vm_events(PGSTEAL_KSWAPD, zone,
@@ -1650,7 +1646,7 @@ static void move_active_pages_to_lru(str
 		SetPageLRU(page);
 
 		nr_pages = hpage_nr_pages(page);
-		mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+		update_lru_size(lruvec, lru, nr_pages);
 		list_move(&page->lru, &lruvec->lists[lru]);
 		pgmoved += nr_pages;
 
@@ -1668,7 +1664,7 @@ static void move_active_pages_to_lru(str
 				list_add(&page->lru, pages_to_free);
 		}
 	}
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
+
 	if (!is_active_lru(lru))
 		__count_vm_events(PGDEACTIVATE, pgmoved);
 }
@@ -1702,14 +1698,15 @@ static void shrink_active_list(unsigned
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, isolate_mode, lru);
-	if (global_reclaim(sc))
-		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
 
+	update_lru_size(lruvec, lru, -nr_taken);
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
+	if (global_reclaim(sc))
+		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
 	__count_zone_vm_events(PGREFILL, zone, nr_scanned);
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+
 	spin_unlock_irq(&zone->lru_lock);
 
 	while (!list_empty(&l_hold)) {

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 02/24] mm: update_lru_size do the __mod_zone_page_state
@ 2015-02-21  3:54   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  3:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, Konstantin Khlebnikov,
	linux-kernel, linux-mm

Konstantin Khlebnikov pointed out (nearly three years ago, when lumpy
reclaim was removed) that lru_size can be updated by -nr_taken once
per call to isolate_lru_pages(), instead of page by page.

Update it inside isolate_lru_pages(), or at its two callsites?  I
chose to update it at the callsites, rearranging and grouping the
updates by nr_taken and nr_scanned together in both.

With one exception, mem_cgroup_update_lru_size(,lru,) is then used
where __mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall
be adding some more calls in a later commit.  Make the code a little
smaller and simpler by incorporating stat update in lru_size update.

The exception was move_active_pages_to_lru(), which aggregated the
pgmoved stat update separately from the individual lru_size updates;
but I still think this a simplification worth making.

However, the __mod_zone_page_state is not peculiar to mem_cgroups: so
better use the name update_lru_size, calls mem_cgroup_update_lru_size
when CONFIG_MEMCG.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |    6 ------
 include/linux/mm_inline.h  |   24 ++++++++++++++++++------
 mm/memcontrol.c            |    4 +++-
 mm/vmscan.c                |   23 ++++++++++-------------
 4 files changed, 31 insertions(+), 26 deletions(-)

--- thpfs.orig/include/linux/memcontrol.h	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/include/linux/memcontrol.h	2015-02-20 19:33:31.052085168 -0800
@@ -275,12 +275,6 @@ mem_cgroup_get_lru_size(struct lruvec *l
 }
 
 static inline void
-mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
-			      int increment)
-{
-}
-
-static inline void
 mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
--- thpfs.orig/include/linux/mm_inline.h	2015-02-20 19:33:25.928096883 -0800
+++ thpfs/include/linux/mm_inline.h	2015-02-20 19:33:31.052085168 -0800
@@ -22,22 +22,34 @@ static inline int page_is_file_cache(str
 	return !PageSwapBacked(page);
 }
 
+static __always_inline void __update_lru_size(struct lruvec *lruvec,
+				enum lru_list lru, int nr_pages)
+{
+	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
+}
+
+static __always_inline void update_lru_size(struct lruvec *lruvec,
+				enum lru_list lru, int nr_pages)
+{
+#ifdef CONFIG_MEMCG
+	mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+#else
+	__update_lru_size(lruvec, lru, nr_pages);
+#endif
+}
+
 static __always_inline void add_page_to_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
-	int nr_pages = hpage_nr_pages(page);
-	mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+	update_lru_size(lruvec, lru, hpage_nr_pages(page));
 	list_add(&page->lru, &lruvec->lists[lru]);
-	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
 }
 
 static __always_inline void del_page_from_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
-	int nr_pages = hpage_nr_pages(page);
 	list_del(&page->lru);
-	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
-	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
+	update_lru_size(lruvec, lru, -hpage_nr_pages(page));
 }
 
 /**
--- thpfs.orig/mm/memcontrol.c	2015-02-20 19:33:25.928096883 -0800
+++ thpfs/mm/memcontrol.c	2015-02-20 19:33:31.052085168 -0800
@@ -1309,7 +1309,7 @@ void mem_cgroup_update_lru_size(struct l
 	bool empty;
 
 	if (mem_cgroup_disabled())
-		return;
+		goto out;
 
 	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
 	lru_size = mz->lru_size + lru;
@@ -1328,6 +1328,8 @@ void mem_cgroup_update_lru_size(struct l
 
 	if (nr_pages > 0)
 		*lru_size += nr_pages;
+out:
+	__update_lru_size(lruvec, lru, nr_pages);
 }
 
 bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, struct mem_cgroup *root)
--- thpfs.orig/mm/vmscan.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/vmscan.c	2015-02-20 19:33:31.056085158 -0800
@@ -1280,7 +1280,6 @@ static unsigned long isolate_lru_pages(u
 
 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
 		struct page *page;
-		int nr_pages;
 
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
@@ -1289,10 +1288,8 @@ static unsigned long isolate_lru_pages(u
 
 		switch (__isolate_lru_page(page, mode)) {
 		case 0:
-			nr_pages = hpage_nr_pages(page);
-			mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
+			nr_taken += hpage_nr_pages(page);
 			list_move(&page->lru, dst);
-			nr_taken += nr_pages;
 			break;
 
 		case -EBUSY:
@@ -1507,8 +1504,9 @@ shrink_inactive_list(unsigned long nr_to
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, isolate_mode, lru);
 
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
+	update_lru_size(lruvec, lru, -nr_taken);
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc)) {
 		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
@@ -1529,8 +1527,6 @@ shrink_inactive_list(unsigned long nr_to
 
 	spin_lock_irq(&zone->lru_lock);
 
-	reclaim_stat->recent_scanned[file] += nr_taken;
-
 	if (global_reclaim(sc)) {
 		if (current_is_kswapd())
 			__count_zone_vm_events(PGSTEAL_KSWAPD, zone,
@@ -1650,7 +1646,7 @@ static void move_active_pages_to_lru(str
 		SetPageLRU(page);
 
 		nr_pages = hpage_nr_pages(page);
-		mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+		update_lru_size(lruvec, lru, nr_pages);
 		list_move(&page->lru, &lruvec->lists[lru]);
 		pgmoved += nr_pages;
 
@@ -1668,7 +1664,7 @@ static void move_active_pages_to_lru(str
 				list_add(&page->lru, pages_to_free);
 		}
 	}
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
+
 	if (!is_active_lru(lru))
 		__count_vm_events(PGDEACTIVATE, pgmoved);
 }
@@ -1702,14 +1698,15 @@ static void shrink_active_list(unsigned
 
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, isolate_mode, lru);
-	if (global_reclaim(sc))
-		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
 
+	update_lru_size(lruvec, lru, -nr_taken);
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
+	if (global_reclaim(sc))
+		__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
 	__count_zone_vm_events(PGREFILL, zone, nr_scanned);
-	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+
 	spin_unlock_irq(&zone->lru_lock);
 
 	while (!list_empty(&l_hold)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 03/24] mm: use __SetPageSwapBacked and don't ClearPageSwapBacked
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  3:56   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  3:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, Mel Gorman,
	linux-kernel, linux-mm

Commit 07a427884348 ("mm: shmem: avoid atomic operation during
shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
by __SetPageSwapBacked, pointing out that the newly allocated page is
not yet visible to other users (except speculative get_page_unless_zero-
ers, who may not update page flags before their further checks).

That was part of a series in which Mel was focused on tmpfs profiles:
but almost all SetPageSwapBacked uses can be so optimized, with the
same justification.  And remove the ClearPageSwapBacked from
read_swap_cache_async()'s and zswap_get_swap_cache_page()'s error
paths: it's not an error to free a page with PG_swapbacked set.

(There's probably scope for further __SetPageFlags in other places,
but SwapBacked is the one I'm interested in at the moment.)

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/migrate.c    |    6 +++---
 mm/rmap.c       |    2 +-
 mm/shmem.c      |    4 ++--
 mm/swap_state.c |    3 +--
 mm/zswap.c      |    3 +--
 5 files changed, 8 insertions(+), 10 deletions(-)

--- thpfs.orig/mm/migrate.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/migrate.c	2015-02-20 19:33:35.676074594 -0800
@@ -763,7 +763,7 @@ static int move_to_new_page(struct page
 	newpage->index = page->index;
 	newpage->mapping = page->mapping;
 	if (PageSwapBacked(page))
-		SetPageSwapBacked(newpage);
+		__SetPageSwapBacked(newpage);
 
 	mapping = page_mapping(page);
 	if (!mapping)
@@ -978,7 +978,7 @@ out:
 	 * during isolation.
 	 */
 	if (rc != MIGRATEPAGE_SUCCESS && put_new_page) {
-		ClearPageSwapBacked(newpage);
+		__ClearPageSwapBacked(newpage);
 		put_new_page(newpage, private);
 	} else if (unlikely(__is_movable_balloon_page(newpage))) {
 		/* drop our reference, page already in the balloon */
@@ -1792,7 +1792,7 @@ int migrate_misplaced_transhuge_page(str
 
 	/* Prepare a page as a migration target */
 	__set_page_locked(new_page);
-	SetPageSwapBacked(new_page);
+	__SetPageSwapBacked(new_page);
 
 	/* anon mapping, we can simply copy page->mapping to the new page: */
 	new_page->mapping = page->mapping;
--- thpfs.orig/mm/rmap.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/rmap.c	2015-02-20 19:33:35.676074594 -0800
@@ -1068,7 +1068,7 @@ void page_add_new_anon_rmap(struct page
 	struct vm_area_struct *vma, unsigned long address)
 {
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
-	SetPageSwapBacked(page);
+	__SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
 	if (PageTransHuge(page))
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
--- thpfs.orig/mm/shmem.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:33:35.676074594 -0800
@@ -987,8 +987,8 @@ static int shmem_replace_page(struct pag
 	flush_dcache_page(newpage);
 
 	__set_page_locked(newpage);
+	__SetPageSwapBacked(newpage);
 	SetPageUptodate(newpage);
-	SetPageSwapBacked(newpage);
 	set_page_private(newpage, swap_index);
 	SetPageSwapCache(newpage);
 
@@ -1177,8 +1177,8 @@ repeat:
 			goto decused;
 		}
 
-		__SetPageSwapBacked(page);
 		__set_page_locked(page);
+		__SetPageSwapBacked(page);
 		if (sgp == SGP_WRITE)
 			__SetPageReferenced(page);
 
--- thpfs.orig/mm/swap_state.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/swap_state.c	2015-02-20 19:33:35.676074594 -0800
@@ -364,7 +364,7 @@ struct page *read_swap_cache_async(swp_e
 
 		/* May fail (-ENOMEM) if radix-tree node allocation failed. */
 		__set_page_locked(new_page);
-		SetPageSwapBacked(new_page);
+		__SetPageSwapBacked(new_page);
 		err = __add_to_swap_cache(new_page, entry);
 		if (likely(!err)) {
 			radix_tree_preload_end();
@@ -376,7 +376,6 @@ struct page *read_swap_cache_async(swp_e
 			return new_page;
 		}
 		radix_tree_preload_end();
-		ClearPageSwapBacked(new_page);
 		__clear_page_locked(new_page);
 		/*
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
--- thpfs.orig/mm/zswap.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/zswap.c	2015-02-20 19:33:35.676074594 -0800
@@ -491,7 +491,7 @@ static int zswap_get_swap_cache_page(swp
 
 		/* May fail (-ENOMEM) if radix-tree node allocation failed. */
 		__set_page_locked(new_page);
-		SetPageSwapBacked(new_page);
+		__SetPageSwapBacked(new_page);
 		err = __add_to_swap_cache(new_page, entry);
 		if (likely(!err)) {
 			radix_tree_preload_end();
@@ -500,7 +500,6 @@ static int zswap_get_swap_cache_page(swp
 			return ZSWAP_SWAPCACHE_NEW;
 		}
 		radix_tree_preload_end();
-		ClearPageSwapBacked(new_page);
 		__clear_page_locked(new_page);
 		/*
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 03/24] mm: use __SetPageSwapBacked and don't ClearPageSwapBacked
@ 2015-02-21  3:56   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  3:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, Mel Gorman,
	linux-kernel, linux-mm

Commit 07a427884348 ("mm: shmem: avoid atomic operation during
shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
by __SetPageSwapBacked, pointing out that the newly allocated page is
not yet visible to other users (except speculative get_page_unless_zero-
ers, who may not update page flags before their further checks).

That was part of a series in which Mel was focused on tmpfs profiles:
but almost all SetPageSwapBacked uses can be so optimized, with the
same justification.  And remove the ClearPageSwapBacked from
read_swap_cache_async()'s and zswap_get_swap_cache_page()'s error
paths: it's not an error to free a page with PG_swapbacked set.

(There's probably scope for further __SetPageFlags in other places,
but SwapBacked is the one I'm interested in at the moment.)

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/migrate.c    |    6 +++---
 mm/rmap.c       |    2 +-
 mm/shmem.c      |    4 ++--
 mm/swap_state.c |    3 +--
 mm/zswap.c      |    3 +--
 5 files changed, 8 insertions(+), 10 deletions(-)

--- thpfs.orig/mm/migrate.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/migrate.c	2015-02-20 19:33:35.676074594 -0800
@@ -763,7 +763,7 @@ static int move_to_new_page(struct page
 	newpage->index = page->index;
 	newpage->mapping = page->mapping;
 	if (PageSwapBacked(page))
-		SetPageSwapBacked(newpage);
+		__SetPageSwapBacked(newpage);
 
 	mapping = page_mapping(page);
 	if (!mapping)
@@ -978,7 +978,7 @@ out:
 	 * during isolation.
 	 */
 	if (rc != MIGRATEPAGE_SUCCESS && put_new_page) {
-		ClearPageSwapBacked(newpage);
+		__ClearPageSwapBacked(newpage);
 		put_new_page(newpage, private);
 	} else if (unlikely(__is_movable_balloon_page(newpage))) {
 		/* drop our reference, page already in the balloon */
@@ -1792,7 +1792,7 @@ int migrate_misplaced_transhuge_page(str
 
 	/* Prepare a page as a migration target */
 	__set_page_locked(new_page);
-	SetPageSwapBacked(new_page);
+	__SetPageSwapBacked(new_page);
 
 	/* anon mapping, we can simply copy page->mapping to the new page: */
 	new_page->mapping = page->mapping;
--- thpfs.orig/mm/rmap.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/rmap.c	2015-02-20 19:33:35.676074594 -0800
@@ -1068,7 +1068,7 @@ void page_add_new_anon_rmap(struct page
 	struct vm_area_struct *vma, unsigned long address)
 {
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
-	SetPageSwapBacked(page);
+	__SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
 	if (PageTransHuge(page))
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
--- thpfs.orig/mm/shmem.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:33:35.676074594 -0800
@@ -987,8 +987,8 @@ static int shmem_replace_page(struct pag
 	flush_dcache_page(newpage);
 
 	__set_page_locked(newpage);
+	__SetPageSwapBacked(newpage);
 	SetPageUptodate(newpage);
-	SetPageSwapBacked(newpage);
 	set_page_private(newpage, swap_index);
 	SetPageSwapCache(newpage);
 
@@ -1177,8 +1177,8 @@ repeat:
 			goto decused;
 		}
 
-		__SetPageSwapBacked(page);
 		__set_page_locked(page);
+		__SetPageSwapBacked(page);
 		if (sgp == SGP_WRITE)
 			__SetPageReferenced(page);
 
--- thpfs.orig/mm/swap_state.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/swap_state.c	2015-02-20 19:33:35.676074594 -0800
@@ -364,7 +364,7 @@ struct page *read_swap_cache_async(swp_e
 
 		/* May fail (-ENOMEM) if radix-tree node allocation failed. */
 		__set_page_locked(new_page);
-		SetPageSwapBacked(new_page);
+		__SetPageSwapBacked(new_page);
 		err = __add_to_swap_cache(new_page, entry);
 		if (likely(!err)) {
 			radix_tree_preload_end();
@@ -376,7 +376,6 @@ struct page *read_swap_cache_async(swp_e
 			return new_page;
 		}
 		radix_tree_preload_end();
-		ClearPageSwapBacked(new_page);
 		__clear_page_locked(new_page);
 		/*
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
--- thpfs.orig/mm/zswap.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/zswap.c	2015-02-20 19:33:35.676074594 -0800
@@ -491,7 +491,7 @@ static int zswap_get_swap_cache_page(swp
 
 		/* May fail (-ENOMEM) if radix-tree node allocation failed. */
 		__set_page_locked(new_page);
-		SetPageSwapBacked(new_page);
+		__SetPageSwapBacked(new_page);
 		err = __add_to_swap_cache(new_page, entry);
 		if (likely(!err)) {
 			radix_tree_preload_end();
@@ -500,7 +500,6 @@ static int zswap_get_swap_cache_page(swp
 			return ZSWAP_SWAPCACHE_NEW;
 		}
 		radix_tree_preload_end();
-		ClearPageSwapBacked(new_page);
 		__clear_page_locked(new_page);
 		/*
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 04/24] mm: make page migration's newpage handling more robust
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  3:58   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  3:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, David Rientjes,
	linux-kernel, linux-mm

I don't know of any problem in the current tree, but huge tmpfs wants
to use the custom put_new_page feature of page migration, with a pool
of its own pages: and met some surprises and difficulties in doing so.

An unused newpage is expected to be released with the put_new_page(),
but there was one MIGRATEPAGE_SUCCESS (0) path which released it with
putback_lru_page(), which was wrong for this custom pool.  Fixed more
easily by resetting put_new_page once it won't be needed, than by
adding a further flag to modify the rc test.

Definitely an extension rather than a bugfix: pages of that newpage
pool might in rare cases still be speculatively accessed and locked
by another task, so relax move_to_new_page()'s !trylock_page() BUG
to an -EAGAIN, just like when old page is found locked.  Do its
__ClearPageSwapBacked on failure while still holding page lock;
and don't reset old page->mapping if PageAnon, we often assume
that PageAnon is persistent once set.

Actually, move the trylock_page(newpage) and unlock_page(newpage)
up a level into __unmap_and_move(), to the same level as the trylock
and unlock of old page: though I moved them originally to suit an old
tree (one using mem_cgroup_prepare_migration()), it still seems a
better placement.

Then the remove_migration_ptes() can be done in one place,
instead of at one level on success but another level on failure.

Add some VM_BUG_ONs to enforce the new convention, and adjust
unmap_and_move_huge_page() and balloon_page_migrate() to fit too.

Finally, clean up __unmap_and_move()'s increasingly weird block
"if (anon_vma) nothing; else if (PageSwapCache) nothing; else out;"
while keeping its useful comment on unmapped swapcache.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/balloon_compaction.c |   10 ---
 mm/migrate.c            |  120 +++++++++++++++++++-------------------
 2 files changed, 63 insertions(+), 67 deletions(-)

--- thpfs.orig/mm/balloon_compaction.c	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/mm/balloon_compaction.c	2015-02-20 19:33:40.872062714 -0800
@@ -199,23 +199,17 @@ int balloon_page_migrate(struct page *ne
 	struct balloon_dev_info *balloon = balloon_page_device(page);
 	int rc = -EAGAIN;
 
-	/*
-	 * Block others from accessing the 'newpage' when we get around to
-	 * establishing additional references. We should be the only one
-	 * holding a reference to the 'newpage' at this point.
-	 */
-	BUG_ON(!trylock_page(newpage));
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
 	if (WARN_ON(!__is_movable_balloon_page(page))) {
 		dump_page(page, "not movable balloon page");
-		unlock_page(newpage);
 		return rc;
 	}
 
 	if (balloon && balloon->migratepage)
 		rc = balloon->migratepage(balloon, newpage, page, mode);
 
-	unlock_page(newpage);
 	return rc;
 }
 #endif /* CONFIG_BALLOON_COMPACTION */
--- thpfs.orig/mm/migrate.c	2015-02-20 19:33:35.676074594 -0800
+++ thpfs/mm/migrate.c	2015-02-20 19:33:40.876062705 -0800
@@ -746,18 +746,13 @@ static int fallback_migrate_page(struct
  *  MIGRATEPAGE_SUCCESS - success
  */
 static int move_to_new_page(struct page *newpage, struct page *page,
-				int page_was_mapped, enum migrate_mode mode)
+				enum migrate_mode mode)
 {
 	struct address_space *mapping;
 	int rc;
 
-	/*
-	 * Block others from accessing the page when we get around to
-	 * establishing additional references. We are the only one
-	 * holding a reference to the new page at this point.
-	 */
-	if (!trylock_page(newpage))
-		BUG();
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
 	/* Prepare mapping for the new page.*/
 	newpage->index = page->index;
@@ -781,16 +776,14 @@ static int move_to_new_page(struct page
 		rc = fallback_migrate_page(mapping, newpage, page, mode);
 
 	if (rc != MIGRATEPAGE_SUCCESS) {
+		__ClearPageSwapBacked(newpage);
 		newpage->mapping = NULL;
 	} else {
 		mem_cgroup_migrate(page, newpage, false);
-		if (page_was_mapped)
-			remove_migration_ptes(page, newpage);
-		page->mapping = NULL;
+		if (!PageAnon(page))
+			page->mapping = NULL;
 	}
 
-	unlock_page(newpage);
-
 	return rc;
 }
 
@@ -839,6 +832,7 @@ static int __unmap_and_move(struct page
 			goto out_unlock;
 		wait_on_page_writeback(page);
 	}
+
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
 	 * we cannot notice that anon_vma is freed while we migrates a page.
@@ -853,28 +847,29 @@ static int __unmap_and_move(struct page
 		 * getting a hold on an anon_vma from outside one of its mms.
 		 */
 		anon_vma = page_get_anon_vma(page);
-		if (anon_vma) {
-			/*
-			 * Anon page
-			 */
-		} else if (PageSwapCache(page)) {
-			/*
-			 * We cannot be sure that the anon_vma of an unmapped
-			 * swapcache page is safe to use because we don't
-			 * know in advance if the VMA that this page belonged
-			 * to still exists. If the VMA and others sharing the
-			 * data have been freed, then the anon_vma could
-			 * already be invalid.
-			 *
-			 * To avoid this possibility, swapcache pages get
-			 * migrated but are not remapped when migration
-			 * completes
-			 */
-		} else {
+		if (!anon_vma && !PageSwapCache(page))
 			goto out_unlock;
-		}
+		/*
+		 * We cannot be sure that the anon_vma of an unmapped swapcache
+		 * page is safe to use because we don't know in advance if the
+		 * VMA that this page belonged to still exists. If the VMA and
+		 * others sharing it have been freed, then the anon_vma could
+		 * be invalid.  To avoid this possibility, swapcache pages
+		 * are migrated, but not remapped when migration completes.
+		 */
 	}
 
+	/*
+	 * Block others from accessing the new page when we get around to
+	 * establishing additional references. We are usually the only one
+	 * holding a reference to newpage at this point. We used to have a BUG
+	 * here if trylock_page(newpage) fails, but intend to introduce a case
+	 * where there might be a race with the previous use of newpage.  This
+	 * is much like races on the refcount of oldpage: just don't BUG().
+	 */
+	if (unlikely(!trylock_page(newpage)))
+		goto out_unlock;
+
 	if (unlikely(isolated_balloon_page(page))) {
 		/*
 		 * A ballooned page does not need any special attention from
@@ -884,7 +879,7 @@ static int __unmap_and_move(struct page
 		 * the page migration right away (proteced by page lock).
 		 */
 		rc = balloon_page_migrate(newpage, page, mode);
-		goto out_unlock;
+		goto out_unlock_both;
 	}
 
 	/*
@@ -903,30 +898,28 @@ static int __unmap_and_move(struct page
 		VM_BUG_ON_PAGE(PageAnon(page), page);
 		if (page_has_private(page)) {
 			try_to_free_buffers(page);
-			goto out_unlock;
+			goto out_unlock_both;
 		}
-		goto skip_unmap;
-	}
-
-	/* Establish migration ptes or remove ptes */
-	if (page_mapped(page)) {
+	} else if (page_mapped(page)) {
+		/* Establish migration ptes or remove ptes */
 		try_to_unmap(page,
 			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 		page_was_mapped = 1;
 	}
 
-skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page, page_was_mapped, mode);
+		rc = move_to_new_page(newpage, page, mode);
 
-	if (rc && page_was_mapped)
-		remove_migration_ptes(page, page);
+	if (page_was_mapped)
+		remove_migration_ptes(page,
+			rc == MIGRATEPAGE_SUCCESS ? newpage : page);
 
+out_unlock_both:
+	unlock_page(newpage);
+out_unlock:
 	/* Drop an anon_vma reference if we took one */
 	if (anon_vma)
 		put_anon_vma(anon_vma);
-
-out_unlock:
 	unlock_page(page);
 out:
 	return rc;
@@ -940,10 +933,11 @@ static int unmap_and_move(new_page_t get
 			unsigned long private, struct page *page, int force,
 			enum migrate_mode mode)
 {
-	int rc = 0;
+	int rc = MIGRATEPAGE_SUCCESS;
 	int *result = NULL;
-	struct page *newpage = get_new_page(page, private, &result);
+	struct page *newpage;
 
+	newpage = get_new_page(page, private, &result);
 	if (!newpage)
 		return -ENOMEM;
 
@@ -957,6 +951,8 @@ static int unmap_and_move(new_page_t get
 			goto out;
 
 	rc = __unmap_and_move(page, newpage, force, mode);
+	if (rc == MIGRATEPAGE_SUCCESS)
+		put_new_page = NULL;
 
 out:
 	if (rc != -EAGAIN) {
@@ -977,10 +973,9 @@ out:
 	 * it.  Otherwise, putback_lru_page() will drop the reference grabbed
 	 * during isolation.
 	 */
-	if (rc != MIGRATEPAGE_SUCCESS && put_new_page) {
-		__ClearPageSwapBacked(newpage);
+	if (put_new_page)
 		put_new_page(newpage, private);
-	} else if (unlikely(__is_movable_balloon_page(newpage))) {
+	else if (unlikely(__is_movable_balloon_page(newpage))) {
 		/* drop our reference, page already in the balloon */
 		put_page(newpage);
 	} else
@@ -1018,7 +1013,7 @@ static int unmap_and_move_huge_page(new_
 				struct page *hpage, int force,
 				enum migrate_mode mode)
 {
-	int rc = 0;
+	int rc = -EAGAIN;
 	int *result = NULL;
 	int page_was_mapped = 0;
 	struct page *new_hpage;
@@ -1040,8 +1035,6 @@ static int unmap_and_move_huge_page(new_
 	if (!new_hpage)
 		return -ENOMEM;
 
-	rc = -EAGAIN;
-
 	if (!trylock_page(hpage)) {
 		if (!force || mode != MIGRATE_SYNC)
 			goto out;
@@ -1051,6 +1044,9 @@ static int unmap_and_move_huge_page(new_
 	if (PageAnon(hpage))
 		anon_vma = page_get_anon_vma(hpage);
 
+	if (unlikely(!trylock_page(new_hpage)))
+		goto put_anon;
+
 	if (page_mapped(hpage)) {
 		try_to_unmap(hpage,
 			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
@@ -1058,16 +1054,22 @@ static int unmap_and_move_huge_page(new_
 	}
 
 	if (!page_mapped(hpage))
-		rc = move_to_new_page(new_hpage, hpage, page_was_mapped, mode);
+		rc = move_to_new_page(new_hpage, hpage, mode);
 
-	if (rc != MIGRATEPAGE_SUCCESS && page_was_mapped)
-		remove_migration_ptes(hpage, hpage);
+	if (page_was_mapped)
+		remove_migration_ptes(hpage,
+			rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage);
 
+	unlock_page(new_hpage);
+
+put_anon:
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 
-	if (rc == MIGRATEPAGE_SUCCESS)
+	if (rc == MIGRATEPAGE_SUCCESS) {
 		hugetlb_cgroup_migrate(hpage, new_hpage);
+		put_new_page = NULL;
+	}
 
 	unlock_page(hpage);
 out:
@@ -1079,7 +1081,7 @@ out:
 	 * it.  Otherwise, put_page() will drop the reference grabbed during
 	 * isolation.
 	 */
-	if (rc != MIGRATEPAGE_SUCCESS && put_new_page)
+	if (put_new_page)
 		put_new_page(new_hpage, private);
 	else
 		put_page(new_hpage);
@@ -1109,7 +1111,7 @@ out:
  *
  * The function returns after 10 attempts or if no pages are movable any more
  * because the list has become empty or no retryable pages exist any more.
- * The caller should call putback_lru_pages() to return pages to the LRU
+ * The caller should call putback_movable_pages() to return pages to the LRU
  * or free list only if ret != 0.
  *
  * Returns the number of pages that were not migrated, or an error code.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 04/24] mm: make page migration's newpage handling more robust
@ 2015-02-21  3:58   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  3:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, David Rientjes,
	linux-kernel, linux-mm

I don't know of any problem in the current tree, but huge tmpfs wants
to use the custom put_new_page feature of page migration, with a pool
of its own pages: and met some surprises and difficulties in doing so.

An unused newpage is expected to be released with the put_new_page(),
but there was one MIGRATEPAGE_SUCCESS (0) path which released it with
putback_lru_page(), which was wrong for this custom pool.  Fixed more
easily by resetting put_new_page once it won't be needed, than by
adding a further flag to modify the rc test.

Definitely an extension rather than a bugfix: pages of that newpage
pool might in rare cases still be speculatively accessed and locked
by another task, so relax move_to_new_page()'s !trylock_page() BUG
to an -EAGAIN, just like when old page is found locked.  Do its
__ClearPageSwapBacked on failure while still holding page lock;
and don't reset old page->mapping if PageAnon, we often assume
that PageAnon is persistent once set.

Actually, move the trylock_page(newpage) and unlock_page(newpage)
up a level into __unmap_and_move(), to the same level as the trylock
and unlock of old page: though I moved them originally to suit an old
tree (one using mem_cgroup_prepare_migration()), it still seems a
better placement.

Then the remove_migration_ptes() can be done in one place,
instead of at one level on success but another level on failure.

Add some VM_BUG_ONs to enforce the new convention, and adjust
unmap_and_move_huge_page() and balloon_page_migrate() to fit too.

Finally, clean up __unmap_and_move()'s increasingly weird block
"if (anon_vma) nothing; else if (PageSwapCache) nothing; else out;"
while keeping its useful comment on unmapped swapcache.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/balloon_compaction.c |   10 ---
 mm/migrate.c            |  120 +++++++++++++++++++-------------------
 2 files changed, 63 insertions(+), 67 deletions(-)

--- thpfs.orig/mm/balloon_compaction.c	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/mm/balloon_compaction.c	2015-02-20 19:33:40.872062714 -0800
@@ -199,23 +199,17 @@ int balloon_page_migrate(struct page *ne
 	struct balloon_dev_info *balloon = balloon_page_device(page);
 	int rc = -EAGAIN;
 
-	/*
-	 * Block others from accessing the 'newpage' when we get around to
-	 * establishing additional references. We should be the only one
-	 * holding a reference to the 'newpage' at this point.
-	 */
-	BUG_ON(!trylock_page(newpage));
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
 	if (WARN_ON(!__is_movable_balloon_page(page))) {
 		dump_page(page, "not movable balloon page");
-		unlock_page(newpage);
 		return rc;
 	}
 
 	if (balloon && balloon->migratepage)
 		rc = balloon->migratepage(balloon, newpage, page, mode);
 
-	unlock_page(newpage);
 	return rc;
 }
 #endif /* CONFIG_BALLOON_COMPACTION */
--- thpfs.orig/mm/migrate.c	2015-02-20 19:33:35.676074594 -0800
+++ thpfs/mm/migrate.c	2015-02-20 19:33:40.876062705 -0800
@@ -746,18 +746,13 @@ static int fallback_migrate_page(struct
  *  MIGRATEPAGE_SUCCESS - success
  */
 static int move_to_new_page(struct page *newpage, struct page *page,
-				int page_was_mapped, enum migrate_mode mode)
+				enum migrate_mode mode)
 {
 	struct address_space *mapping;
 	int rc;
 
-	/*
-	 * Block others from accessing the page when we get around to
-	 * establishing additional references. We are the only one
-	 * holding a reference to the new page at this point.
-	 */
-	if (!trylock_page(newpage))
-		BUG();
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
 	/* Prepare mapping for the new page.*/
 	newpage->index = page->index;
@@ -781,16 +776,14 @@ static int move_to_new_page(struct page
 		rc = fallback_migrate_page(mapping, newpage, page, mode);
 
 	if (rc != MIGRATEPAGE_SUCCESS) {
+		__ClearPageSwapBacked(newpage);
 		newpage->mapping = NULL;
 	} else {
 		mem_cgroup_migrate(page, newpage, false);
-		if (page_was_mapped)
-			remove_migration_ptes(page, newpage);
-		page->mapping = NULL;
+		if (!PageAnon(page))
+			page->mapping = NULL;
 	}
 
-	unlock_page(newpage);
-
 	return rc;
 }
 
@@ -839,6 +832,7 @@ static int __unmap_and_move(struct page
 			goto out_unlock;
 		wait_on_page_writeback(page);
 	}
+
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
 	 * we cannot notice that anon_vma is freed while we migrates a page.
@@ -853,28 +847,29 @@ static int __unmap_and_move(struct page
 		 * getting a hold on an anon_vma from outside one of its mms.
 		 */
 		anon_vma = page_get_anon_vma(page);
-		if (anon_vma) {
-			/*
-			 * Anon page
-			 */
-		} else if (PageSwapCache(page)) {
-			/*
-			 * We cannot be sure that the anon_vma of an unmapped
-			 * swapcache page is safe to use because we don't
-			 * know in advance if the VMA that this page belonged
-			 * to still exists. If the VMA and others sharing the
-			 * data have been freed, then the anon_vma could
-			 * already be invalid.
-			 *
-			 * To avoid this possibility, swapcache pages get
-			 * migrated but are not remapped when migration
-			 * completes
-			 */
-		} else {
+		if (!anon_vma && !PageSwapCache(page))
 			goto out_unlock;
-		}
+		/*
+		 * We cannot be sure that the anon_vma of an unmapped swapcache
+		 * page is safe to use because we don't know in advance if the
+		 * VMA that this page belonged to still exists. If the VMA and
+		 * others sharing it have been freed, then the anon_vma could
+		 * be invalid.  To avoid this possibility, swapcache pages
+		 * are migrated, but not remapped when migration completes.
+		 */
 	}
 
+	/*
+	 * Block others from accessing the new page when we get around to
+	 * establishing additional references. We are usually the only one
+	 * holding a reference to newpage at this point. We used to have a BUG
+	 * here if trylock_page(newpage) fails, but intend to introduce a case
+	 * where there might be a race with the previous use of newpage.  This
+	 * is much like races on the refcount of oldpage: just don't BUG().
+	 */
+	if (unlikely(!trylock_page(newpage)))
+		goto out_unlock;
+
 	if (unlikely(isolated_balloon_page(page))) {
 		/*
 		 * A ballooned page does not need any special attention from
@@ -884,7 +879,7 @@ static int __unmap_and_move(struct page
 		 * the page migration right away (proteced by page lock).
 		 */
 		rc = balloon_page_migrate(newpage, page, mode);
-		goto out_unlock;
+		goto out_unlock_both;
 	}
 
 	/*
@@ -903,30 +898,28 @@ static int __unmap_and_move(struct page
 		VM_BUG_ON_PAGE(PageAnon(page), page);
 		if (page_has_private(page)) {
 			try_to_free_buffers(page);
-			goto out_unlock;
+			goto out_unlock_both;
 		}
-		goto skip_unmap;
-	}
-
-	/* Establish migration ptes or remove ptes */
-	if (page_mapped(page)) {
+	} else if (page_mapped(page)) {
+		/* Establish migration ptes or remove ptes */
 		try_to_unmap(page,
 			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 		page_was_mapped = 1;
 	}
 
-skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page, page_was_mapped, mode);
+		rc = move_to_new_page(newpage, page, mode);
 
-	if (rc && page_was_mapped)
-		remove_migration_ptes(page, page);
+	if (page_was_mapped)
+		remove_migration_ptes(page,
+			rc == MIGRATEPAGE_SUCCESS ? newpage : page);
 
+out_unlock_both:
+	unlock_page(newpage);
+out_unlock:
 	/* Drop an anon_vma reference if we took one */
 	if (anon_vma)
 		put_anon_vma(anon_vma);
-
-out_unlock:
 	unlock_page(page);
 out:
 	return rc;
@@ -940,10 +933,11 @@ static int unmap_and_move(new_page_t get
 			unsigned long private, struct page *page, int force,
 			enum migrate_mode mode)
 {
-	int rc = 0;
+	int rc = MIGRATEPAGE_SUCCESS;
 	int *result = NULL;
-	struct page *newpage = get_new_page(page, private, &result);
+	struct page *newpage;
 
+	newpage = get_new_page(page, private, &result);
 	if (!newpage)
 		return -ENOMEM;
 
@@ -957,6 +951,8 @@ static int unmap_and_move(new_page_t get
 			goto out;
 
 	rc = __unmap_and_move(page, newpage, force, mode);
+	if (rc == MIGRATEPAGE_SUCCESS)
+		put_new_page = NULL;
 
 out:
 	if (rc != -EAGAIN) {
@@ -977,10 +973,9 @@ out:
 	 * it.  Otherwise, putback_lru_page() will drop the reference grabbed
 	 * during isolation.
 	 */
-	if (rc != MIGRATEPAGE_SUCCESS && put_new_page) {
-		__ClearPageSwapBacked(newpage);
+	if (put_new_page)
 		put_new_page(newpage, private);
-	} else if (unlikely(__is_movable_balloon_page(newpage))) {
+	else if (unlikely(__is_movable_balloon_page(newpage))) {
 		/* drop our reference, page already in the balloon */
 		put_page(newpage);
 	} else
@@ -1018,7 +1013,7 @@ static int unmap_and_move_huge_page(new_
 				struct page *hpage, int force,
 				enum migrate_mode mode)
 {
-	int rc = 0;
+	int rc = -EAGAIN;
 	int *result = NULL;
 	int page_was_mapped = 0;
 	struct page *new_hpage;
@@ -1040,8 +1035,6 @@ static int unmap_and_move_huge_page(new_
 	if (!new_hpage)
 		return -ENOMEM;
 
-	rc = -EAGAIN;
-
 	if (!trylock_page(hpage)) {
 		if (!force || mode != MIGRATE_SYNC)
 			goto out;
@@ -1051,6 +1044,9 @@ static int unmap_and_move_huge_page(new_
 	if (PageAnon(hpage))
 		anon_vma = page_get_anon_vma(hpage);
 
+	if (unlikely(!trylock_page(new_hpage)))
+		goto put_anon;
+
 	if (page_mapped(hpage)) {
 		try_to_unmap(hpage,
 			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
@@ -1058,16 +1054,22 @@ static int unmap_and_move_huge_page(new_
 	}
 
 	if (!page_mapped(hpage))
-		rc = move_to_new_page(new_hpage, hpage, page_was_mapped, mode);
+		rc = move_to_new_page(new_hpage, hpage, mode);
 
-	if (rc != MIGRATEPAGE_SUCCESS && page_was_mapped)
-		remove_migration_ptes(hpage, hpage);
+	if (page_was_mapped)
+		remove_migration_ptes(hpage,
+			rc == MIGRATEPAGE_SUCCESS ? new_hpage : hpage);
 
+	unlock_page(new_hpage);
+
+put_anon:
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 
-	if (rc == MIGRATEPAGE_SUCCESS)
+	if (rc == MIGRATEPAGE_SUCCESS) {
 		hugetlb_cgroup_migrate(hpage, new_hpage);
+		put_new_page = NULL;
+	}
 
 	unlock_page(hpage);
 out:
@@ -1079,7 +1081,7 @@ out:
 	 * it.  Otherwise, put_page() will drop the reference grabbed during
 	 * isolation.
 	 */
-	if (rc != MIGRATEPAGE_SUCCESS && put_new_page)
+	if (put_new_page)
 		put_new_page(new_hpage, private);
 	else
 		put_page(new_hpage);
@@ -1109,7 +1111,7 @@ out:
  *
  * The function returns after 10 attempts or if no pages are movable any more
  * because the list has become empty or no retryable pages exist any more.
- * The caller should call putback_lru_pages() to return pages to the LRU
+ * The caller should call putback_movable_pages() to return pages to the LRU
  * or free list only if ret != 0.
  *
  * Returns the number of pages that were not migrated, or an error code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 05/24] tmpfs: preliminary minor tidyups
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:00   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Make a few cleanups in mm/shmem.c, before going on to complicate it.

shmem_alloc_page() will become more complicated: we can't afford to
to have that complication duplicated between a CONFIG_NUMA version
and a !CONFIG_NUMA version, so rearrange the #ifdef'ery there to
yield a single shmem_swapin() and a single shmem_alloc_page().

Yes, it's a shame to inflict the horrid pseudo-vma on non-NUMA
configurations, but one day we'll get around to eliminating it
(elsewhere I have an alloc_pages_mpol() patch, but mpol handling is
subtle and bug-prone, and changed yet again since my last version).

Move __set_page_locked __SetPageSwapBacked from shmem_getpage_gfp()
to shmem_alloc_page(): that SwapBacked flag will be useful in future,
to help it to distinguish different cases appropriately.

And the SGP_DIRTY variant of SGP_CACHE is hard to understand and of
little use (IIRC it dates back to when shmem_getpage() returned the
page unlocked): let's kill it and just do the necessary in
do_shmem_file_read().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mempolicy.h |    6 ++
 mm/shmem.c                |   73 +++++++++++++-----------------------
 2 files changed, 34 insertions(+), 45 deletions(-)

--- thpfs.orig/include/linux/mempolicy.h	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/include/linux/mempolicy.h	2015-02-20 19:33:46.112050733 -0800
@@ -228,6 +228,12 @@ static inline void mpol_free_shared_poli
 {
 }
 
+static inline struct mempolicy *
+mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+{
+	return NULL;
+}
+
 #define vma_policy(vma) NULL
 
 static inline int
--- thpfs.orig/mm/shmem.c	2015-02-20 19:33:35.676074594 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:33:46.116050724 -0800
@@ -6,8 +6,8 @@
  *		 2000-2001 Christoph Rohland
  *		 2000-2001 SAP AG
  *		 2002 Red Hat Inc.
- * Copyright (C) 2002-2011 Hugh Dickins.
- * Copyright (C) 2011 Google Inc.
+ * Copyright (C) 2002-2015 Hugh Dickins.
+ * Copyright (C) 2011-2015 Google Inc.
  * Copyright (C) 2002-2005 VERITAS Software Corporation.
  * Copyright (C) 2004 Andi Kleen, SuSE Labs
  *
@@ -99,7 +99,6 @@ struct shmem_falloc {
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
-	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
 	SGP_WRITE,	/* may exceed i_size, may allocate !Uptodate page */
 	SGP_FALLOC,	/* like SGP_WRITE, but make existing page Uptodate */
 };
@@ -167,7 +166,7 @@ static inline int shmem_reacct_size(unsi
 
 /*
  * ... whereas tmpfs objects are accounted incrementally as
- * pages are allocated, in order to allow huge sparse files.
+ * pages are allocated, in order to allow large sparse files.
  * shmem_getpage reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
  * so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
  */
@@ -849,8 +848,7 @@ redirty:
 	return 0;
 }
 
-#ifdef CONFIG_NUMA
-#ifdef CONFIG_TMPFS
+#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
 static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
 {
 	char buffer[64];
@@ -874,7 +872,18 @@ static struct mempolicy *shmem_get_sbmpo
 	}
 	return mpol;
 }
-#endif /* CONFIG_TMPFS */
+#else /* !CONFIG_NUMA || !CONFIG_TMPFS */
+static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
+{
+}
+static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
+{
+	return NULL;
+}
+#endif /* CONFIG_NUMA && CONFIG_TMPFS */
+#ifndef CONFIG_NUMA
+#define vm_policy vm_private_data
+#endif
 
 static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
 			struct shmem_inode_info *info, pgoff_t index)
@@ -910,39 +919,17 @@ static struct page *shmem_alloc_page(gfp
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
 
-	page = alloc_page_vma(gfp, &pvma, 0);
+	page = alloc_pages_vma(gfp, 0, &pvma, 0, numa_node_id());
+	if (page) {
+		__set_page_locked(page);
+		__SetPageSwapBacked(page);
+	}
 
 	/* Drop reference taken by mpol_shared_policy_lookup() */
 	mpol_cond_put(pvma.vm_policy);
 
 	return page;
 }
-#else /* !CONFIG_NUMA */
-#ifdef CONFIG_TMPFS
-static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
-{
-}
-#endif /* CONFIG_TMPFS */
-
-static inline struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
-			struct shmem_inode_info *info, pgoff_t index)
-{
-	return swapin_readahead(swap, gfp, NULL, 0);
-}
-
-static inline struct page *shmem_alloc_page(gfp_t gfp,
-			struct shmem_inode_info *info, pgoff_t index)
-{
-	return alloc_page(gfp);
-}
-#endif /* CONFIG_NUMA */
-
-#if !defined(CONFIG_NUMA) || !defined(CONFIG_TMPFS)
-static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
-{
-	return NULL;
-}
-#endif
 
 /*
  * When a page is moved from swapcache to shmem filecache (either by the
@@ -986,8 +973,6 @@ static int shmem_replace_page(struct pag
 	copy_highpage(newpage, oldpage);
 	flush_dcache_page(newpage);
 
-	__set_page_locked(newpage);
-	__SetPageSwapBacked(newpage);
 	SetPageUptodate(newpage);
 	set_page_private(newpage, swap_index);
 	SetPageSwapCache(newpage);
@@ -1177,11 +1162,6 @@ repeat:
 			goto decused;
 		}
 
-		__set_page_locked(page);
-		__SetPageSwapBacked(page);
-		if (sgp == SGP_WRITE)
-			__SetPageReferenced(page);
-
 		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
 		if (error)
 			goto decused;
@@ -1205,6 +1185,8 @@ repeat:
 		spin_unlock(&info->lock);
 		alloced = true;
 
+		if (sgp == SGP_WRITE)
+			__SetPageReferenced(page);
 		/*
 		 * Let SGP_FALLOC use the SGP_WRITE optimization on a new page.
 		 */
@@ -1221,8 +1203,6 @@ clear:
 			flush_dcache_page(page);
 			SetPageUptodate(page);
 		}
-		if (sgp == SGP_DIRTY)
-			set_page_dirty(page);
 	}
 
 	/* Perhaps the file has been truncated since we checked */
@@ -1537,7 +1517,7 @@ static ssize_t shmem_file_read_iter(stru
 	 * and even mark them dirty, so it cannot exceed the max_blocks limit.
 	 */
 	if (!iter_is_iovec(to))
-		sgp = SGP_DIRTY;
+		sgp = SGP_CACHE;
 
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	offset = *ppos & ~PAGE_CACHE_MASK;
@@ -1563,8 +1543,11 @@ static ssize_t shmem_file_read_iter(stru
 				error = 0;
 			break;
 		}
-		if (page)
+		if (page) {
+			if (sgp == SGP_CACHE)
+				set_page_dirty(page);
 			unlock_page(page);
+		}
 
 		/*
 		 * We must evaluate after, since reads (unlike writes)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 05/24] tmpfs: preliminary minor tidyups
@ 2015-02-21  4:00   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Make a few cleanups in mm/shmem.c, before going on to complicate it.

shmem_alloc_page() will become more complicated: we can't afford to
to have that complication duplicated between a CONFIG_NUMA version
and a !CONFIG_NUMA version, so rearrange the #ifdef'ery there to
yield a single shmem_swapin() and a single shmem_alloc_page().

Yes, it's a shame to inflict the horrid pseudo-vma on non-NUMA
configurations, but one day we'll get around to eliminating it
(elsewhere I have an alloc_pages_mpol() patch, but mpol handling is
subtle and bug-prone, and changed yet again since my last version).

Move __set_page_locked __SetPageSwapBacked from shmem_getpage_gfp()
to shmem_alloc_page(): that SwapBacked flag will be useful in future,
to help it to distinguish different cases appropriately.

And the SGP_DIRTY variant of SGP_CACHE is hard to understand and of
little use (IIRC it dates back to when shmem_getpage() returned the
page unlocked): let's kill it and just do the necessary in
do_shmem_file_read().

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mempolicy.h |    6 ++
 mm/shmem.c                |   73 +++++++++++++-----------------------
 2 files changed, 34 insertions(+), 45 deletions(-)

--- thpfs.orig/include/linux/mempolicy.h	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/include/linux/mempolicy.h	2015-02-20 19:33:46.112050733 -0800
@@ -228,6 +228,12 @@ static inline void mpol_free_shared_poli
 {
 }
 
+static inline struct mempolicy *
+mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+{
+	return NULL;
+}
+
 #define vma_policy(vma) NULL
 
 static inline int
--- thpfs.orig/mm/shmem.c	2015-02-20 19:33:35.676074594 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:33:46.116050724 -0800
@@ -6,8 +6,8 @@
  *		 2000-2001 Christoph Rohland
  *		 2000-2001 SAP AG
  *		 2002 Red Hat Inc.
- * Copyright (C) 2002-2011 Hugh Dickins.
- * Copyright (C) 2011 Google Inc.
+ * Copyright (C) 2002-2015 Hugh Dickins.
+ * Copyright (C) 2011-2015 Google Inc.
  * Copyright (C) 2002-2005 VERITAS Software Corporation.
  * Copyright (C) 2004 Andi Kleen, SuSE Labs
  *
@@ -99,7 +99,6 @@ struct shmem_falloc {
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
-	SGP_DIRTY,	/* like SGP_CACHE, but set new page dirty */
 	SGP_WRITE,	/* may exceed i_size, may allocate !Uptodate page */
 	SGP_FALLOC,	/* like SGP_WRITE, but make existing page Uptodate */
 };
@@ -167,7 +166,7 @@ static inline int shmem_reacct_size(unsi
 
 /*
  * ... whereas tmpfs objects are accounted incrementally as
- * pages are allocated, in order to allow huge sparse files.
+ * pages are allocated, in order to allow large sparse files.
  * shmem_getpage reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
  * so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
  */
@@ -849,8 +848,7 @@ redirty:
 	return 0;
 }
 
-#ifdef CONFIG_NUMA
-#ifdef CONFIG_TMPFS
+#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
 static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
 {
 	char buffer[64];
@@ -874,7 +872,18 @@ static struct mempolicy *shmem_get_sbmpo
 	}
 	return mpol;
 }
-#endif /* CONFIG_TMPFS */
+#else /* !CONFIG_NUMA || !CONFIG_TMPFS */
+static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
+{
+}
+static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
+{
+	return NULL;
+}
+#endif /* CONFIG_NUMA && CONFIG_TMPFS */
+#ifndef CONFIG_NUMA
+#define vm_policy vm_private_data
+#endif
 
 static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
 			struct shmem_inode_info *info, pgoff_t index)
@@ -910,39 +919,17 @@ static struct page *shmem_alloc_page(gfp
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
 
-	page = alloc_page_vma(gfp, &pvma, 0);
+	page = alloc_pages_vma(gfp, 0, &pvma, 0, numa_node_id());
+	if (page) {
+		__set_page_locked(page);
+		__SetPageSwapBacked(page);
+	}
 
 	/* Drop reference taken by mpol_shared_policy_lookup() */
 	mpol_cond_put(pvma.vm_policy);
 
 	return page;
 }
-#else /* !CONFIG_NUMA */
-#ifdef CONFIG_TMPFS
-static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
-{
-}
-#endif /* CONFIG_TMPFS */
-
-static inline struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
-			struct shmem_inode_info *info, pgoff_t index)
-{
-	return swapin_readahead(swap, gfp, NULL, 0);
-}
-
-static inline struct page *shmem_alloc_page(gfp_t gfp,
-			struct shmem_inode_info *info, pgoff_t index)
-{
-	return alloc_page(gfp);
-}
-#endif /* CONFIG_NUMA */
-
-#if !defined(CONFIG_NUMA) || !defined(CONFIG_TMPFS)
-static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
-{
-	return NULL;
-}
-#endif
 
 /*
  * When a page is moved from swapcache to shmem filecache (either by the
@@ -986,8 +973,6 @@ static int shmem_replace_page(struct pag
 	copy_highpage(newpage, oldpage);
 	flush_dcache_page(newpage);
 
-	__set_page_locked(newpage);
-	__SetPageSwapBacked(newpage);
 	SetPageUptodate(newpage);
 	set_page_private(newpage, swap_index);
 	SetPageSwapCache(newpage);
@@ -1177,11 +1162,6 @@ repeat:
 			goto decused;
 		}
 
-		__set_page_locked(page);
-		__SetPageSwapBacked(page);
-		if (sgp == SGP_WRITE)
-			__SetPageReferenced(page);
-
 		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
 		if (error)
 			goto decused;
@@ -1205,6 +1185,8 @@ repeat:
 		spin_unlock(&info->lock);
 		alloced = true;
 
+		if (sgp == SGP_WRITE)
+			__SetPageReferenced(page);
 		/*
 		 * Let SGP_FALLOC use the SGP_WRITE optimization on a new page.
 		 */
@@ -1221,8 +1203,6 @@ clear:
 			flush_dcache_page(page);
 			SetPageUptodate(page);
 		}
-		if (sgp == SGP_DIRTY)
-			set_page_dirty(page);
 	}
 
 	/* Perhaps the file has been truncated since we checked */
@@ -1537,7 +1517,7 @@ static ssize_t shmem_file_read_iter(stru
 	 * and even mark them dirty, so it cannot exceed the max_blocks limit.
 	 */
 	if (!iter_is_iovec(to))
-		sgp = SGP_DIRTY;
+		sgp = SGP_CACHE;
 
 	index = *ppos >> PAGE_CACHE_SHIFT;
 	offset = *ppos & ~PAGE_CACHE_MASK;
@@ -1563,8 +1543,11 @@ static ssize_t shmem_file_read_iter(stru
 				error = 0;
 			break;
 		}
-		if (page)
+		if (page) {
+			if (sgp == SGP_CACHE)
+				set_page_dirty(page);
 			unlock_page(page);
+		}
 
 		/*
 		 * We must evaluate after, since reads (unlike writes)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 06/24] huge tmpfs: prepare counts in meminfo, vmstat and SysRq-m
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:01   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Abbreviate NR_ANON_TRANSPARENT_HUGEPAGES to NR_ANON_HUGEPAGES,
add NR_SHMEM_HUGEPAGES, NR_SHMEM_PMDMAPPED, NR_SHMEM_FREEHOLES:
to be accounted in later commits, when we shall need some visibility.

Shown in /proc/meminfo and /sys/devices/system/node/nodeN/meminfo
as AnonHugePages (as before), ShmemHugePages, ShmemPmdMapped,
ShmemFreeHoles; /proc/vmstat and /sys/devices/system/node/nodeN/vmstat
as nr_anon_transparent_hugepages (as before), nr_shmem_hugepages,
nr_shmem_pmdmapped, nr_shmem_freeholes.

Be upfront about this being Shmem, neither file nor anon: Shmem
is sometimes counted as file (as in Cached) and sometimes as anon
(as in Active(anon)); which is too confusing.  Shmem is already
shown in meminfo, so use that term, rather than tmpfs or shm.

ShmemHugePages will show that portion of Shmem which is allocated
on complete huge pages.  ShmemPmdMapped (named not to misalign the
%8lu) will show that portion of ShmemHugePages which is mapped into
userspace with huge pmds.  ShmemFreeHoles will show the wastage
from using huge pages for small, or sparsely occupied, or unrounded
files: wastage not included in Shmem or MemFree, but will be freed
under memory pressure.  (But no count for the partially occupied
portions of huge pages: seems less important, but could be added.)

Since shmem_freeholes are otherwise hidden, they ought to be shown by
show_free_areas(), in OOM-kill or ALT-SysRq-m or /proc/sysrq-trigger m.
shmem_hugepages is a subset of shmem, and shmem_pmdmapped a subset of
shmem_hugepages: there is not a strong argument for adding them here
(anon_hugepages is not shown), but include them anyway for reassurance.

The lines get rather long: abbreviate thus
  mapped:19778 shmem:38 pagetables:1153 bounce:0
  shmem_hugepages:0 _pmdmapped:0 _freeholes:2044
  free_cma:0
and
... shmem:92kB _hugepages:0kB _pmdmapped:0kB _freeholes:0kB ...

Tidy up the CONFIG_TRANSPARENT_HUGEPAGE printf blocks in
fs/proc/meminfo.c and drivers/base/node.c: the shorter names help.
Clarify a comment in page_remove_rmap() to refer to "hugetlbfs pages"
rather than hugepages generally.  I left arch/tile/mm/pgtable.c's
show_mem() unchanged: tile does not HAVE_ARCH_TRANSPARENT_HUGEPAGE.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 drivers/base/node.c    |   20 +++++++++++---------
 fs/proc/meminfo.c      |   11 ++++++++---
 include/linux/mmzone.h |    5 ++++-
 mm/huge_memory.c       |    2 +-
 mm/page_alloc.c        |   13 +++++++++++--
 mm/rmap.c              |    9 ++++-----
 mm/vmstat.c            |    3 +++
 7 files changed, 42 insertions(+), 21 deletions(-)

--- thpfs.orig/drivers/base/node.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/drivers/base/node.c	2015-02-20 19:33:51.488038441 -0800
@@ -111,9 +111,6 @@ static ssize_t node_read_meminfo(struct
 		       "Node %d Slab:           %8lu kB\n"
 		       "Node %d SReclaimable:   %8lu kB\n"
 		       "Node %d SUnreclaim:     %8lu kB\n"
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		       "Node %d AnonHugePages:  %8lu kB\n"
-#endif
 			,
 		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
 		       nid, K(node_page_state(nid, NR_WRITEBACK)),
@@ -130,13 +127,18 @@ static ssize_t node_read_meminfo(struct
 		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE) +
 				node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
 		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
-			, nid,
-			K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
-			HPAGE_PMD_NR));
-#else
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	n += sprintf(buf + n,
+		"Node %d AnonHugePages:  %8lu kB\n"
+		"Node %d ShmemHugePages: %8lu kB\n"
+		"Node %d ShmemPmdMapped: %8lu kB\n"
+		"Node %d ShmemFreeHoles: %8lu kB\n",
+		nid, K(node_page_state(nid, NR_ANON_HUGEPAGES)*HPAGE_PMD_NR),
+		nid, K(node_page_state(nid, NR_SHMEM_HUGEPAGES)*HPAGE_PMD_NR),
+		nid, K(node_page_state(nid, NR_SHMEM_PMDMAPPED)*HPAGE_PMD_NR),
+		nid, K(node_page_state(nid, NR_SHMEM_FREEHOLES)));
 #endif
 	n += hugetlb_report_node_meminfo(nid, buf + n);
 	return n;
--- thpfs.orig/fs/proc/meminfo.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/fs/proc/meminfo.c	2015-02-20 19:33:51.488038441 -0800
@@ -140,6 +140,9 @@ static int meminfo_proc_show(struct seq_
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		"AnonHugePages:  %8lu kB\n"
+		"ShmemHugePages: %8lu kB\n"
+		"ShmemPmdMapped: %8lu kB\n"
+		"ShmemFreeHoles: %8lu kB\n"
 #endif
 #ifdef CONFIG_CMA
 		"CmaTotal:       %8lu kB\n"
@@ -194,11 +197,13 @@ static int meminfo_proc_show(struct seq_
 		vmi.used >> 10,
 		vmi.largest_chunk >> 10
 #ifdef CONFIG_MEMORY_FAILURE
-		, atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
+		, K(atomic_long_read(&num_poisoned_pages))
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		, K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
-		   HPAGE_PMD_NR)
+		, K(global_page_state(NR_ANON_HUGEPAGES) * HPAGE_PMD_NR)
+		, K(global_page_state(NR_SHMEM_HUGEPAGES) * HPAGE_PMD_NR)
+		, K(global_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR)
+		, K(global_page_state(NR_SHMEM_FREEHOLES))
 #endif
 #ifdef CONFIG_CMA
 		, K(totalcma_pages)
--- thpfs.orig/include/linux/mmzone.h	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/include/linux/mmzone.h	2015-02-20 19:33:51.492038431 -0800
@@ -155,7 +155,10 @@ enum zone_stat_item {
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
 	WORKINGSET_NODERECLAIM,
-	NR_ANON_TRANSPARENT_HUGEPAGES,
+	NR_ANON_HUGEPAGES,	/* transparent anon huge pages */
+	NR_SHMEM_HUGEPAGES,	/* transparent shmem huge pages */
+	NR_SHMEM_PMDMAPPED,	/* shmem huge pages currently mapped hugely */
+	NR_SHMEM_FREEHOLES,	/* unused memory of high-order allocations */
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
--- thpfs.orig/mm/huge_memory.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/huge_memory.c	2015-02-20 19:33:51.492038431 -0800
@@ -1747,7 +1747,7 @@ static void __split_huge_page_refcount(s
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
-	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+	__mod_zone_page_state(zone, NR_ANON_HUGEPAGES, -1);
 
 	ClearPageCompound(page);
 	compound_unlock(page);
--- thpfs.orig/mm/page_alloc.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/page_alloc.c	2015-02-20 19:33:51.492038431 -0800
@@ -3279,10 +3279,10 @@ void show_free_areas(unsigned int filter
 
 	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
 		" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
-		" unevictable:%lu"
-		" dirty:%lu writeback:%lu unstable:%lu\n"
+		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
 		" free:%lu slab_reclaimable:%lu slab_unreclaimable:%lu\n"
 		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
+		" shmem_hugepages:%lu _pmdmapped:%lu _freeholes:%lu\n"
 		" free_cma:%lu\n",
 		global_page_state(NR_ACTIVE_ANON),
 		global_page_state(NR_INACTIVE_ANON),
@@ -3301,6 +3301,9 @@ void show_free_areas(unsigned int filter
 		global_page_state(NR_SHMEM),
 		global_page_state(NR_PAGETABLE),
 		global_page_state(NR_BOUNCE),
+		global_page_state(NR_SHMEM_HUGEPAGES),
+		global_page_state(NR_SHMEM_PMDMAPPED),
+		global_page_state(NR_SHMEM_FREEHOLES),
 		global_page_state(NR_FREE_CMA_PAGES));
 
 	for_each_populated_zone(zone) {
@@ -3328,6 +3331,9 @@ void show_free_areas(unsigned int filter
 			" writeback:%lukB"
 			" mapped:%lukB"
 			" shmem:%lukB"
+			" _hugepages:%lukB"
+			" _pmdmapped:%lukB"
+			" _freeholes:%lukB"
 			" slab_reclaimable:%lukB"
 			" slab_unreclaimable:%lukB"
 			" kernel_stack:%lukB"
@@ -3358,6 +3364,9 @@ void show_free_areas(unsigned int filter
 			K(zone_page_state(zone, NR_WRITEBACK)),
 			K(zone_page_state(zone, NR_FILE_MAPPED)),
 			K(zone_page_state(zone, NR_SHMEM)),
+			K(zone_page_state(zone, NR_SHMEM_HUGEPAGES)),
+			K(zone_page_state(zone, NR_SHMEM_PMDMAPPED)),
+			K(zone_page_state(zone, NR_SHMEM_FREEHOLES)),
 			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
 			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
 			zone_page_state(zone, NR_KERNEL_STACK) *
--- thpfs.orig/mm/rmap.c	2015-02-20 19:33:35.676074594 -0800
+++ thpfs/mm/rmap.c	2015-02-20 19:33:51.496038422 -0800
@@ -1038,8 +1038,7 @@ void do_page_add_anon_rmap(struct page *
 		 * disabled.
 		 */
 		if (PageTransHuge(page))
-			__inc_zone_page_state(page,
-					      NR_ANON_TRANSPARENT_HUGEPAGES);
+			__inc_zone_page_state(page, NR_ANON_HUGEPAGES);
 		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
 				hpage_nr_pages(page));
 	}
@@ -1071,7 +1070,7 @@ void page_add_new_anon_rmap(struct page
 	__SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
 	if (PageTransHuge(page))
-		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		__inc_zone_page_state(page, NR_ANON_HUGEPAGES);
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
 			hpage_nr_pages(page));
 	__page_set_anon_rmap(page, vma, address, 1);
@@ -1109,7 +1108,7 @@ static void page_remove_file_rmap(struct
 	if (!atomic_add_negative(-1, &page->_mapcount))
 		goto out;
 
-	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
+	/* hugetlbfs pages are not counted in NR_FILE_MAPPED for now. */
 	if (unlikely(PageHuge(page)))
 		goto out;
 
@@ -1154,7 +1153,7 @@ void page_remove_rmap(struct page *page)
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
 	if (PageTransHuge(page))
-		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		__dec_zone_page_state(page, NR_ANON_HUGEPAGES);
 
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
 			      -hpage_nr_pages(page));
--- thpfs.orig/mm/vmstat.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/vmstat.c	2015-02-20 19:33:51.496038422 -0800
@@ -795,6 +795,9 @@ const char * const vmstat_text[] = {
 	"workingset_activate",
 	"workingset_nodereclaim",
 	"nr_anon_transparent_hugepages",
+	"nr_shmem_hugepages",
+	"nr_shmem_pmdmapped",
+	"nr_shmem_freeholes",
 	"nr_free_cma",
 
 	/* enum writeback_stat_item counters */

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 06/24] huge tmpfs: prepare counts in meminfo, vmstat and SysRq-m
@ 2015-02-21  4:01   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Abbreviate NR_ANON_TRANSPARENT_HUGEPAGES to NR_ANON_HUGEPAGES,
add NR_SHMEM_HUGEPAGES, NR_SHMEM_PMDMAPPED, NR_SHMEM_FREEHOLES:
to be accounted in later commits, when we shall need some visibility.

Shown in /proc/meminfo and /sys/devices/system/node/nodeN/meminfo
as AnonHugePages (as before), ShmemHugePages, ShmemPmdMapped,
ShmemFreeHoles; /proc/vmstat and /sys/devices/system/node/nodeN/vmstat
as nr_anon_transparent_hugepages (as before), nr_shmem_hugepages,
nr_shmem_pmdmapped, nr_shmem_freeholes.

Be upfront about this being Shmem, neither file nor anon: Shmem
is sometimes counted as file (as in Cached) and sometimes as anon
(as in Active(anon)); which is too confusing.  Shmem is already
shown in meminfo, so use that term, rather than tmpfs or shm.

ShmemHugePages will show that portion of Shmem which is allocated
on complete huge pages.  ShmemPmdMapped (named not to misalign the
%8lu) will show that portion of ShmemHugePages which is mapped into
userspace with huge pmds.  ShmemFreeHoles will show the wastage
from using huge pages for small, or sparsely occupied, or unrounded
files: wastage not included in Shmem or MemFree, but will be freed
under memory pressure.  (But no count for the partially occupied
portions of huge pages: seems less important, but could be added.)

Since shmem_freeholes are otherwise hidden, they ought to be shown by
show_free_areas(), in OOM-kill or ALT-SysRq-m or /proc/sysrq-trigger m.
shmem_hugepages is a subset of shmem, and shmem_pmdmapped a subset of
shmem_hugepages: there is not a strong argument for adding them here
(anon_hugepages is not shown), but include them anyway for reassurance.

The lines get rather long: abbreviate thus
  mapped:19778 shmem:38 pagetables:1153 bounce:0
  shmem_hugepages:0 _pmdmapped:0 _freeholes:2044
  free_cma:0
and
... shmem:92kB _hugepages:0kB _pmdmapped:0kB _freeholes:0kB ...

Tidy up the CONFIG_TRANSPARENT_HUGEPAGE printf blocks in
fs/proc/meminfo.c and drivers/base/node.c: the shorter names help.
Clarify a comment in page_remove_rmap() to refer to "hugetlbfs pages"
rather than hugepages generally.  I left arch/tile/mm/pgtable.c's
show_mem() unchanged: tile does not HAVE_ARCH_TRANSPARENT_HUGEPAGE.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 drivers/base/node.c    |   20 +++++++++++---------
 fs/proc/meminfo.c      |   11 ++++++++---
 include/linux/mmzone.h |    5 ++++-
 mm/huge_memory.c       |    2 +-
 mm/page_alloc.c        |   13 +++++++++++--
 mm/rmap.c              |    9 ++++-----
 mm/vmstat.c            |    3 +++
 7 files changed, 42 insertions(+), 21 deletions(-)

--- thpfs.orig/drivers/base/node.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/drivers/base/node.c	2015-02-20 19:33:51.488038441 -0800
@@ -111,9 +111,6 @@ static ssize_t node_read_meminfo(struct
 		       "Node %d Slab:           %8lu kB\n"
 		       "Node %d SReclaimable:   %8lu kB\n"
 		       "Node %d SUnreclaim:     %8lu kB\n"
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		       "Node %d AnonHugePages:  %8lu kB\n"
-#endif
 			,
 		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
 		       nid, K(node_page_state(nid, NR_WRITEBACK)),
@@ -130,13 +127,18 @@ static ssize_t node_read_meminfo(struct
 		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE) +
 				node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
 		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
-			, nid,
-			K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
-			HPAGE_PMD_NR));
-#else
 		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	n += sprintf(buf + n,
+		"Node %d AnonHugePages:  %8lu kB\n"
+		"Node %d ShmemHugePages: %8lu kB\n"
+		"Node %d ShmemPmdMapped: %8lu kB\n"
+		"Node %d ShmemFreeHoles: %8lu kB\n",
+		nid, K(node_page_state(nid, NR_ANON_HUGEPAGES)*HPAGE_PMD_NR),
+		nid, K(node_page_state(nid, NR_SHMEM_HUGEPAGES)*HPAGE_PMD_NR),
+		nid, K(node_page_state(nid, NR_SHMEM_PMDMAPPED)*HPAGE_PMD_NR),
+		nid, K(node_page_state(nid, NR_SHMEM_FREEHOLES)));
 #endif
 	n += hugetlb_report_node_meminfo(nid, buf + n);
 	return n;
--- thpfs.orig/fs/proc/meminfo.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/fs/proc/meminfo.c	2015-02-20 19:33:51.488038441 -0800
@@ -140,6 +140,9 @@ static int meminfo_proc_show(struct seq_
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 		"AnonHugePages:  %8lu kB\n"
+		"ShmemHugePages: %8lu kB\n"
+		"ShmemPmdMapped: %8lu kB\n"
+		"ShmemFreeHoles: %8lu kB\n"
 #endif
 #ifdef CONFIG_CMA
 		"CmaTotal:       %8lu kB\n"
@@ -194,11 +197,13 @@ static int meminfo_proc_show(struct seq_
 		vmi.used >> 10,
 		vmi.largest_chunk >> 10
 #ifdef CONFIG_MEMORY_FAILURE
-		, atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
+		, K(atomic_long_read(&num_poisoned_pages))
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		, K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
-		   HPAGE_PMD_NR)
+		, K(global_page_state(NR_ANON_HUGEPAGES) * HPAGE_PMD_NR)
+		, K(global_page_state(NR_SHMEM_HUGEPAGES) * HPAGE_PMD_NR)
+		, K(global_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR)
+		, K(global_page_state(NR_SHMEM_FREEHOLES))
 #endif
 #ifdef CONFIG_CMA
 		, K(totalcma_pages)
--- thpfs.orig/include/linux/mmzone.h	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/include/linux/mmzone.h	2015-02-20 19:33:51.492038431 -0800
@@ -155,7 +155,10 @@ enum zone_stat_item {
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
 	WORKINGSET_NODERECLAIM,
-	NR_ANON_TRANSPARENT_HUGEPAGES,
+	NR_ANON_HUGEPAGES,	/* transparent anon huge pages */
+	NR_SHMEM_HUGEPAGES,	/* transparent shmem huge pages */
+	NR_SHMEM_PMDMAPPED,	/* shmem huge pages currently mapped hugely */
+	NR_SHMEM_FREEHOLES,	/* unused memory of high-order allocations */
 	NR_FREE_CMA_PAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
--- thpfs.orig/mm/huge_memory.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/huge_memory.c	2015-02-20 19:33:51.492038431 -0800
@@ -1747,7 +1747,7 @@ static void __split_huge_page_refcount(s
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
-	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+	__mod_zone_page_state(zone, NR_ANON_HUGEPAGES, -1);
 
 	ClearPageCompound(page);
 	compound_unlock(page);
--- thpfs.orig/mm/page_alloc.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/page_alloc.c	2015-02-20 19:33:51.492038431 -0800
@@ -3279,10 +3279,10 @@ void show_free_areas(unsigned int filter
 
 	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
 		" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
-		" unevictable:%lu"
-		" dirty:%lu writeback:%lu unstable:%lu\n"
+		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
 		" free:%lu slab_reclaimable:%lu slab_unreclaimable:%lu\n"
 		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
+		" shmem_hugepages:%lu _pmdmapped:%lu _freeholes:%lu\n"
 		" free_cma:%lu\n",
 		global_page_state(NR_ACTIVE_ANON),
 		global_page_state(NR_INACTIVE_ANON),
@@ -3301,6 +3301,9 @@ void show_free_areas(unsigned int filter
 		global_page_state(NR_SHMEM),
 		global_page_state(NR_PAGETABLE),
 		global_page_state(NR_BOUNCE),
+		global_page_state(NR_SHMEM_HUGEPAGES),
+		global_page_state(NR_SHMEM_PMDMAPPED),
+		global_page_state(NR_SHMEM_FREEHOLES),
 		global_page_state(NR_FREE_CMA_PAGES));
 
 	for_each_populated_zone(zone) {
@@ -3328,6 +3331,9 @@ void show_free_areas(unsigned int filter
 			" writeback:%lukB"
 			" mapped:%lukB"
 			" shmem:%lukB"
+			" _hugepages:%lukB"
+			" _pmdmapped:%lukB"
+			" _freeholes:%lukB"
 			" slab_reclaimable:%lukB"
 			" slab_unreclaimable:%lukB"
 			" kernel_stack:%lukB"
@@ -3358,6 +3364,9 @@ void show_free_areas(unsigned int filter
 			K(zone_page_state(zone, NR_WRITEBACK)),
 			K(zone_page_state(zone, NR_FILE_MAPPED)),
 			K(zone_page_state(zone, NR_SHMEM)),
+			K(zone_page_state(zone, NR_SHMEM_HUGEPAGES)),
+			K(zone_page_state(zone, NR_SHMEM_PMDMAPPED)),
+			K(zone_page_state(zone, NR_SHMEM_FREEHOLES)),
 			K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
 			K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
 			zone_page_state(zone, NR_KERNEL_STACK) *
--- thpfs.orig/mm/rmap.c	2015-02-20 19:33:35.676074594 -0800
+++ thpfs/mm/rmap.c	2015-02-20 19:33:51.496038422 -0800
@@ -1038,8 +1038,7 @@ void do_page_add_anon_rmap(struct page *
 		 * disabled.
 		 */
 		if (PageTransHuge(page))
-			__inc_zone_page_state(page,
-					      NR_ANON_TRANSPARENT_HUGEPAGES);
+			__inc_zone_page_state(page, NR_ANON_HUGEPAGES);
 		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
 				hpage_nr_pages(page));
 	}
@@ -1071,7 +1070,7 @@ void page_add_new_anon_rmap(struct page
 	__SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
 	if (PageTransHuge(page))
-		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		__inc_zone_page_state(page, NR_ANON_HUGEPAGES);
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
 			hpage_nr_pages(page));
 	__page_set_anon_rmap(page, vma, address, 1);
@@ -1109,7 +1108,7 @@ static void page_remove_file_rmap(struct
 	if (!atomic_add_negative(-1, &page->_mapcount))
 		goto out;
 
-	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
+	/* hugetlbfs pages are not counted in NR_FILE_MAPPED for now. */
 	if (unlikely(PageHuge(page)))
 		goto out;
 
@@ -1154,7 +1153,7 @@ void page_remove_rmap(struct page *page)
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
 	if (PageTransHuge(page))
-		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		__dec_zone_page_state(page, NR_ANON_HUGEPAGES);
 
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
 			      -hpage_nr_pages(page));
--- thpfs.orig/mm/vmstat.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/vmstat.c	2015-02-20 19:33:51.496038422 -0800
@@ -795,6 +795,9 @@ const char * const vmstat_text[] = {
 	"workingset_activate",
 	"workingset_nodereclaim",
 	"nr_anon_transparent_hugepages",
+	"nr_shmem_hugepages",
+	"nr_shmem_pmdmapped",
+	"nr_shmem_freeholes",
 	"nr_free_cma",
 
 	/* enum writeback_stat_item counters */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 07/24] huge tmpfs: include shmem freeholes in available memory counts
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:03   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

ShmemFreeHoles will be freed under memory pressure, but are not included
in MemFree: they need to be added into MemAvailable, and wherever the
kernel calculates freeable pages; but in a few other places also.

There is certainly room for debate about those places: I've made my
selection (and kept some notes), you may come up with a different list.
I decided against max_sane_readahead(), because I suspect it's already
too much; and left drivers/staging/android/lowmemorykiller.c out for now.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 fs/proc/meminfo.c   |    6 ++++++
 mm/mmap.c           |    1 +
 mm/page-writeback.c |    2 ++
 mm/vmscan.c         |    3 ++-
 4 files changed, 11 insertions(+), 1 deletion(-)

--- thpfs.orig/fs/proc/meminfo.c	2015-02-20 19:33:51.488038441 -0800
+++ thpfs/fs/proc/meminfo.c	2015-02-20 19:33:56.528026917 -0800
@@ -76,6 +76,12 @@ static int meminfo_proc_show(struct seq_
 	available += pagecache;
 
 	/*
+	 * Shmem freeholes help to keep huge pages intact, but contain
+	 * no data, and can be shrunk whenever small pages are needed.
+	 */
+	available += global_page_state(NR_SHMEM_FREEHOLES);
+
+	/*
 	 * Part of the reclaimable slab consists of items that are in use,
 	 * and cannot be freed. Cap this estimate at the low watermark.
 	 */
--- thpfs.orig/mm/mmap.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/mmap.c	2015-02-20 19:33:56.528026917 -0800
@@ -168,6 +168,7 @@ int __vm_enough_memory(struct mm_struct
 
 	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
 		free = global_page_state(NR_FREE_PAGES);
+		free += global_page_state(NR_SHMEM_FREEHOLES);
 		free += global_page_state(NR_FILE_PAGES);
 
 		/*
--- thpfs.orig/mm/page-writeback.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/page-writeback.c	2015-02-20 19:33:56.532026908 -0800
@@ -187,6 +187,7 @@ static unsigned long zone_dirtyable_memo
 	nr_pages = zone_page_state(zone, NR_FREE_PAGES);
 	nr_pages -= min(nr_pages, zone->dirty_balance_reserve);
 
+	nr_pages += zone_page_state(zone, NR_SHMEM_FREEHOLES);
 	nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
 	nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);
 
@@ -241,6 +242,7 @@ static unsigned long global_dirtyable_me
 	x = global_page_state(NR_FREE_PAGES);
 	x -= min(x, dirty_balance_reserve);
 
+	x += global_page_state(NR_SHMEM_FREEHOLES);
 	x += global_page_state(NR_INACTIVE_FILE);
 	x += global_page_state(NR_ACTIVE_FILE);
 
--- thpfs.orig/mm/vmscan.c	2015-02-20 19:33:31.056085158 -0800
+++ thpfs/mm/vmscan.c	2015-02-20 19:33:56.532026908 -0800
@@ -1946,7 +1946,8 @@ static void get_scan_count(struct lruvec
 		unsigned long zonefile;
 		unsigned long zonefree;
 
-		zonefree = zone_page_state(zone, NR_FREE_PAGES);
+		zonefree = zone_page_state(zone, NR_FREE_PAGES) +
+			   zone_page_state(zone, NR_SHMEM_FREEHOLES);
 		zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
 			   zone_page_state(zone, NR_INACTIVE_FILE);
 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 07/24] huge tmpfs: include shmem freeholes in available memory counts
@ 2015-02-21  4:03   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

ShmemFreeHoles will be freed under memory pressure, but are not included
in MemFree: they need to be added into MemAvailable, and wherever the
kernel calculates freeable pages; but in a few other places also.

There is certainly room for debate about those places: I've made my
selection (and kept some notes), you may come up with a different list.
I decided against max_sane_readahead(), because I suspect it's already
too much; and left drivers/staging/android/lowmemorykiller.c out for now.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 fs/proc/meminfo.c   |    6 ++++++
 mm/mmap.c           |    1 +
 mm/page-writeback.c |    2 ++
 mm/vmscan.c         |    3 ++-
 4 files changed, 11 insertions(+), 1 deletion(-)

--- thpfs.orig/fs/proc/meminfo.c	2015-02-20 19:33:51.488038441 -0800
+++ thpfs/fs/proc/meminfo.c	2015-02-20 19:33:56.528026917 -0800
@@ -76,6 +76,12 @@ static int meminfo_proc_show(struct seq_
 	available += pagecache;
 
 	/*
+	 * Shmem freeholes help to keep huge pages intact, but contain
+	 * no data, and can be shrunk whenever small pages are needed.
+	 */
+	available += global_page_state(NR_SHMEM_FREEHOLES);
+
+	/*
 	 * Part of the reclaimable slab consists of items that are in use,
 	 * and cannot be freed. Cap this estimate at the low watermark.
 	 */
--- thpfs.orig/mm/mmap.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/mmap.c	2015-02-20 19:33:56.528026917 -0800
@@ -168,6 +168,7 @@ int __vm_enough_memory(struct mm_struct
 
 	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
 		free = global_page_state(NR_FREE_PAGES);
+		free += global_page_state(NR_SHMEM_FREEHOLES);
 		free += global_page_state(NR_FILE_PAGES);
 
 		/*
--- thpfs.orig/mm/page-writeback.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/page-writeback.c	2015-02-20 19:33:56.532026908 -0800
@@ -187,6 +187,7 @@ static unsigned long zone_dirtyable_memo
 	nr_pages = zone_page_state(zone, NR_FREE_PAGES);
 	nr_pages -= min(nr_pages, zone->dirty_balance_reserve);
 
+	nr_pages += zone_page_state(zone, NR_SHMEM_FREEHOLES);
 	nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
 	nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);
 
@@ -241,6 +242,7 @@ static unsigned long global_dirtyable_me
 	x = global_page_state(NR_FREE_PAGES);
 	x -= min(x, dirty_balance_reserve);
 
+	x += global_page_state(NR_SHMEM_FREEHOLES);
 	x += global_page_state(NR_INACTIVE_FILE);
 	x += global_page_state(NR_ACTIVE_FILE);
 
--- thpfs.orig/mm/vmscan.c	2015-02-20 19:33:31.056085158 -0800
+++ thpfs/mm/vmscan.c	2015-02-20 19:33:56.532026908 -0800
@@ -1946,7 +1946,8 @@ static void get_scan_count(struct lruvec
 		unsigned long zonefile;
 		unsigned long zonefree;
 
-		zonefree = zone_page_state(zone, NR_FREE_PAGES);
+		zonefree = zone_page_state(zone, NR_FREE_PAGES) +
+			   zone_page_state(zone, NR_SHMEM_FREEHOLES);
 		zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
 			   zone_page_state(zone, NR_INACTIVE_FILE);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 08/24] huge tmpfs: prepare huge=N mount option and /proc/sys/vm/shmem_huge
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:05   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Plumb in a new "huge=1" or "huge=0" mount option to tmpfs: I don't
want to get into a maze of boot options, madvises and fadvises at
this stage, nor extend the use of the existing THP tuning to tmpfs;
though either might be pursued later on.  We just want a way to ask
a tmpfs filesystem to favor huge pages, and a way to turn that off
again when it doesn't work out so well.  Default of course is off.

"mount -o remount,huge=N /mountpoint" works fine after mount:
remounting from huge=1 (on) to huge=0 (off) will not attempt to
break up huge pages at all, just stop more from being allocated.

It's possible that we shall allow more values for the option later,
to select different strategies (e.g. how hard to try when allocating
huge pages, or when to map hugely and when not, or how sparse a huge
page should be before it is split up), either for experiments, or well
baked in: so use an unsigned char in the superblock rather than a bool.

No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE,
which is the appropriate option to protect those who don't want
the new bloat, and with which we shall share some pmd code.  Use a
"name=numeric_value" format like most other tmpfs options.  Prohibit
the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is invalid
without CONFIG_NUMA (was hidden in mpol_parse_str(): make it explicit).
Allow setting >0 only if the machine has_transparent_hugepage().

But what about Shmem with no user-visible mount?  SysV SHM, memfds,
shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers'
DRM objects, Ashmem.  Though unlikely to suit all usages, provide
sysctl /proc/sys/vm/shmem_huge to experiment with huge on those.

And allow shmem_huge two further values: -1 for use in emergencies,
to force the huge option off from all mounts; and (currently) 2,
to force the huge option on for all - very useful for testing.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h |   16 ++++++----
 kernel/sysctl.c          |   12 +++++++
 mm/shmem.c               |   59 +++++++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+), 5 deletions(-)

--- thpfs.orig/include/linux/shmem_fs.h	2014-10-05 12:23:04.000000000 -0700
+++ thpfs/include/linux/shmem_fs.h	2015-02-20 19:34:01.464015631 -0800
@@ -31,9 +31,10 @@ struct shmem_sb_info {
 	unsigned long max_inodes;   /* How many inodes are allowed */
 	unsigned long free_inodes;  /* How many are left for allocation */
 	spinlock_t stat_lock;	    /* Serialize shmem_sb_info changes */
+	umode_t mode;		    /* Mount mode for root directory */
+	unsigned char huge;	    /* Whether to try for hugepages */
 	kuid_t uid;		    /* Mount uid for root directory */
 	kgid_t gid;		    /* Mount gid for root directory */
-	umode_t mode;		    /* Mount mode for root directory */
 	struct mempolicy *mpol;     /* default memory policy for mappings */
 };
 
@@ -68,18 +69,23 @@ static inline struct page *shmem_read_ma
 }
 
 #ifdef CONFIG_TMPFS
-
 extern int shmem_add_seals(struct file *file, unsigned int seals);
 extern int shmem_get_seals(struct file *file);
 extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
-
 #else
-
 static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
 {
 	return -EINVAL;
 }
+#endif /* CONFIG_TMPFS */
 
-#endif
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SHMEM)
+# ifdef CONFIG_SYSCTL
+struct ctl_table;
+extern int shmem_huge, shmem_huge_min, shmem_huge_max;
+extern int shmem_huge_sysctl(struct ctl_table *table, int write,
+			     void __user *buffer, size_t *lenp, loff_t *ppos);
+# endif /* CONFIG_SYSCTL */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SHMEM */
 
 #endif
--- thpfs.orig/kernel/sysctl.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/kernel/sysctl.c	2015-02-20 19:34:01.464015631 -0800
@@ -42,6 +42,7 @@
 #include <linux/ratelimit.h>
 #include <linux/compaction.h>
 #include <linux/hugetlb.h>
+#include <linux/shmem_fs.h>
 #include <linux/initrd.h>
 #include <linux/key.h>
 #include <linux/times.h>
@@ -1241,6 +1242,17 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one_hundred,
 	},
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SHMEM)
+	{
+		.procname	= "shmem_huge",
+		.data		= &shmem_huge,
+		.maxlen		= sizeof(shmem_huge),
+		.mode		= 0644,
+		.proc_handler	= shmem_huge_sysctl,
+		.extra1		= &shmem_huge_min,
+		.extra2		= &shmem_huge_max,
+	},
+#endif
 #ifdef CONFIG_HUGETLB_PAGE
 	{
 		.procname	= "nr_hugepages",
--- thpfs.orig/mm/shmem.c	2015-02-20 19:33:46.116050724 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:34:01.464015631 -0800
@@ -58,6 +58,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/falloc.h>
 #include <linux/splice.h>
 #include <linux/security.h>
+#include <linux/sysctl.h>
 #include <linux/swapops.h>
 #include <linux/mempolicy.h>
 #include <linux/namei.h>
@@ -291,6 +292,25 @@ static bool shmem_confirm_swap(struct ad
 }
 
 /*
+ * Definitions for "huge tmpfs": tmpfs mounted with the huge=1 option
+ */
+
+/* Special values for /proc/sys/vm/shmem_huge */
+#define SHMEM_HUGE_DENY		(-1)
+#define SHMEM_HUGE_FORCE	(2)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/* ifdef here to avoid bloating shmem.o when not necessary */
+
+int shmem_huge __read_mostly;
+
+#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+
+#define shmem_huge SHMEM_HUGE_DENY
+
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+/*
  * Like add_to_page_cache_locked, but error if expected item has gone.
  */
 static int shmem_add_to_page_cache(struct page *page,
@@ -2802,11 +2822,21 @@ static int shmem_parse_options(char *opt
 			sbinfo->gid = make_kgid(current_user_ns(), gid);
 			if (!gid_valid(sbinfo->gid))
 				goto bad_val;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		} else if (!strcmp(this_char, "huge")) {
+			if (kstrtou8(value, 10, &sbinfo->huge) < 0 ||
+			    sbinfo->huge >= SHMEM_HUGE_FORCE)
+				goto bad_val;
+			if (sbinfo->huge && !has_transparent_hugepage())
+				goto bad_val;
+#endif
+#ifdef CONFIG_NUMA
 		} else if (!strcmp(this_char,"mpol")) {
 			mpol_put(mpol);
 			mpol = NULL;
 			if (mpol_parse_str(value, &mpol))
 				goto bad_val;
+#endif
 		} else {
 			printk(KERN_ERR "tmpfs: Bad mount option %s\n",
 			       this_char);
@@ -2853,6 +2883,7 @@ static int shmem_remount_fs(struct super
 		goto out;
 
 	error = 0;
+	sbinfo->huge = config.huge;
 	sbinfo->max_blocks  = config.max_blocks;
 	sbinfo->max_inodes  = config.max_inodes;
 	sbinfo->free_inodes = config.max_inodes - inodes;
@@ -2886,6 +2917,9 @@ static int shmem_show_options(struct seq
 	if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
 		seq_printf(seq, ",gid=%u",
 				from_kgid_munged(&init_user_ns, sbinfo->gid));
+	/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
+	if (sbinfo->huge)
+		seq_printf(seq, ",huge=%u", sbinfo->huge);
 	shmem_show_mpol(seq, sbinfo->mpol);
 	return 0;
 }
@@ -3242,6 +3276,31 @@ out4:
 	return error;
 }
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSCTL)
+int shmem_huge_min = SHMEM_HUGE_DENY;
+int shmem_huge_max = SHMEM_HUGE_FORCE;
+/*
+ * /proc/sys/vm/shmem_huge sysctl for internal shm_mnt, and mount override:
+ * -1 disables huge on shm_mnt and all mounts, for emergency use
+ *  0 disables huge on internal shm_mnt (which has no way to be remounted)
+ *  1  enables huge on internal shm_mnt (which has no way to be remounted)
+ *  2  enables huge on shm_mnt and all mounts, w/o needing option, for testing
+ *     (but we may add more huge options, and push that 2 for testing upwards)
+ */
+int shmem_huge_sysctl(struct ctl_table *table, int write,
+		      void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int err;
+
+	if (!has_transparent_hugepage())
+		shmem_huge_max = 0;
+	err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (write && !err && !IS_ERR(shm_mnt))
+		SHMEM_SB(shm_mnt->mnt_sb)->huge = (shmem_huge > 0);
+	return err;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSCTL */
+
 #else /* !CONFIG_SHMEM */
 
 /*

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 08/24] huge tmpfs: prepare huge=N mount option and /proc/sys/vm/shmem_huge
@ 2015-02-21  4:05   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:05 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Plumb in a new "huge=1" or "huge=0" mount option to tmpfs: I don't
want to get into a maze of boot options, madvises and fadvises at
this stage, nor extend the use of the existing THP tuning to tmpfs;
though either might be pursued later on.  We just want a way to ask
a tmpfs filesystem to favor huge pages, and a way to turn that off
again when it doesn't work out so well.  Default of course is off.

"mount -o remount,huge=N /mountpoint" works fine after mount:
remounting from huge=1 (on) to huge=0 (off) will not attempt to
break up huge pages at all, just stop more from being allocated.

It's possible that we shall allow more values for the option later,
to select different strategies (e.g. how hard to try when allocating
huge pages, or when to map hugely and when not, or how sparse a huge
page should be before it is split up), either for experiments, or well
baked in: so use an unsigned char in the superblock rather than a bool.

No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE,
which is the appropriate option to protect those who don't want
the new bloat, and with which we shall share some pmd code.  Use a
"name=numeric_value" format like most other tmpfs options.  Prohibit
the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is invalid
without CONFIG_NUMA (was hidden in mpol_parse_str(): make it explicit).
Allow setting >0 only if the machine has_transparent_hugepage().

But what about Shmem with no user-visible mount?  SysV SHM, memfds,
shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers'
DRM objects, Ashmem.  Though unlikely to suit all usages, provide
sysctl /proc/sys/vm/shmem_huge to experiment with huge on those.

And allow shmem_huge two further values: -1 for use in emergencies,
to force the huge option off from all mounts; and (currently) 2,
to force the huge option on for all - very useful for testing.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/shmem_fs.h |   16 ++++++----
 kernel/sysctl.c          |   12 +++++++
 mm/shmem.c               |   59 +++++++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+), 5 deletions(-)

--- thpfs.orig/include/linux/shmem_fs.h	2014-10-05 12:23:04.000000000 -0700
+++ thpfs/include/linux/shmem_fs.h	2015-02-20 19:34:01.464015631 -0800
@@ -31,9 +31,10 @@ struct shmem_sb_info {
 	unsigned long max_inodes;   /* How many inodes are allowed */
 	unsigned long free_inodes;  /* How many are left for allocation */
 	spinlock_t stat_lock;	    /* Serialize shmem_sb_info changes */
+	umode_t mode;		    /* Mount mode for root directory */
+	unsigned char huge;	    /* Whether to try for hugepages */
 	kuid_t uid;		    /* Mount uid for root directory */
 	kgid_t gid;		    /* Mount gid for root directory */
-	umode_t mode;		    /* Mount mode for root directory */
 	struct mempolicy *mpol;     /* default memory policy for mappings */
 };
 
@@ -68,18 +69,23 @@ static inline struct page *shmem_read_ma
 }
 
 #ifdef CONFIG_TMPFS
-
 extern int shmem_add_seals(struct file *file, unsigned int seals);
 extern int shmem_get_seals(struct file *file);
 extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
-
 #else
-
 static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
 {
 	return -EINVAL;
 }
+#endif /* CONFIG_TMPFS */
 
-#endif
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SHMEM)
+# ifdef CONFIG_SYSCTL
+struct ctl_table;
+extern int shmem_huge, shmem_huge_min, shmem_huge_max;
+extern int shmem_huge_sysctl(struct ctl_table *table, int write,
+			     void __user *buffer, size_t *lenp, loff_t *ppos);
+# endif /* CONFIG_SYSCTL */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SHMEM */
 
 #endif
--- thpfs.orig/kernel/sysctl.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/kernel/sysctl.c	2015-02-20 19:34:01.464015631 -0800
@@ -42,6 +42,7 @@
 #include <linux/ratelimit.h>
 #include <linux/compaction.h>
 #include <linux/hugetlb.h>
+#include <linux/shmem_fs.h>
 #include <linux/initrd.h>
 #include <linux/key.h>
 #include <linux/times.h>
@@ -1241,6 +1242,17 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one_hundred,
 	},
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SHMEM)
+	{
+		.procname	= "shmem_huge",
+		.data		= &shmem_huge,
+		.maxlen		= sizeof(shmem_huge),
+		.mode		= 0644,
+		.proc_handler	= shmem_huge_sysctl,
+		.extra1		= &shmem_huge_min,
+		.extra2		= &shmem_huge_max,
+	},
+#endif
 #ifdef CONFIG_HUGETLB_PAGE
 	{
 		.procname	= "nr_hugepages",
--- thpfs.orig/mm/shmem.c	2015-02-20 19:33:46.116050724 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:34:01.464015631 -0800
@@ -58,6 +58,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/falloc.h>
 #include <linux/splice.h>
 #include <linux/security.h>
+#include <linux/sysctl.h>
 #include <linux/swapops.h>
 #include <linux/mempolicy.h>
 #include <linux/namei.h>
@@ -291,6 +292,25 @@ static bool shmem_confirm_swap(struct ad
 }
 
 /*
+ * Definitions for "huge tmpfs": tmpfs mounted with the huge=1 option
+ */
+
+/* Special values for /proc/sys/vm/shmem_huge */
+#define SHMEM_HUGE_DENY		(-1)
+#define SHMEM_HUGE_FORCE	(2)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/* ifdef here to avoid bloating shmem.o when not necessary */
+
+int shmem_huge __read_mostly;
+
+#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+
+#define shmem_huge SHMEM_HUGE_DENY
+
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+/*
  * Like add_to_page_cache_locked, but error if expected item has gone.
  */
 static int shmem_add_to_page_cache(struct page *page,
@@ -2802,11 +2822,21 @@ static int shmem_parse_options(char *opt
 			sbinfo->gid = make_kgid(current_user_ns(), gid);
 			if (!gid_valid(sbinfo->gid))
 				goto bad_val;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		} else if (!strcmp(this_char, "huge")) {
+			if (kstrtou8(value, 10, &sbinfo->huge) < 0 ||
+			    sbinfo->huge >= SHMEM_HUGE_FORCE)
+				goto bad_val;
+			if (sbinfo->huge && !has_transparent_hugepage())
+				goto bad_val;
+#endif
+#ifdef CONFIG_NUMA
 		} else if (!strcmp(this_char,"mpol")) {
 			mpol_put(mpol);
 			mpol = NULL;
 			if (mpol_parse_str(value, &mpol))
 				goto bad_val;
+#endif
 		} else {
 			printk(KERN_ERR "tmpfs: Bad mount option %s\n",
 			       this_char);
@@ -2853,6 +2883,7 @@ static int shmem_remount_fs(struct super
 		goto out;
 
 	error = 0;
+	sbinfo->huge = config.huge;
 	sbinfo->max_blocks  = config.max_blocks;
 	sbinfo->max_inodes  = config.max_inodes;
 	sbinfo->free_inodes = config.max_inodes - inodes;
@@ -2886,6 +2917,9 @@ static int shmem_show_options(struct seq
 	if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
 		seq_printf(seq, ",gid=%u",
 				from_kgid_munged(&init_user_ns, sbinfo->gid));
+	/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
+	if (sbinfo->huge)
+		seq_printf(seq, ",huge=%u", sbinfo->huge);
 	shmem_show_mpol(seq, sbinfo->mpol);
 	return 0;
 }
@@ -3242,6 +3276,31 @@ out4:
 	return error;
 }
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSCTL)
+int shmem_huge_min = SHMEM_HUGE_DENY;
+int shmem_huge_max = SHMEM_HUGE_FORCE;
+/*
+ * /proc/sys/vm/shmem_huge sysctl for internal shm_mnt, and mount override:
+ * -1 disables huge on shm_mnt and all mounts, for emergency use
+ *  0 disables huge on internal shm_mnt (which has no way to be remounted)
+ *  1  enables huge on internal shm_mnt (which has no way to be remounted)
+ *  2  enables huge on shm_mnt and all mounts, w/o needing option, for testing
+ *     (but we may add more huge options, and push that 2 for testing upwards)
+ */
+int shmem_huge_sysctl(struct ctl_table *table, int write,
+		      void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int err;
+
+	if (!has_transparent_hugepage())
+		shmem_huge_max = 0;
+	err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (write && !err && !IS_ERR(shm_mnt))
+		SHMEM_SB(shm_mnt->mnt_sb)->huge = (shmem_huge > 0);
+	return err;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSCTL */
+
 #else /* !CONFIG_SHMEM */
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 09/24] huge tmpfs: try to allocate huge pages, split into a team
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:06   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Now we get down to work.  The idea here is that compound pages were
ideal for hugetlbfs, with its own separate pool to which huge pages
must be freed.  Not so suitable for anonymous THP, which was forced
to adopt a strange refcount-in-mapcount technique to track its tails -
compare v2.6.37's mm/swap.c:put_compound_page() with a current version!
And not at all suitable for pagecache THP, where one process may want
to map 4kB of a file while another maps 2MB spanning the same offset.

And since anonymous THP was confined to private mappings, that blurred
the distinction between the mapping and the object mapped: so splitting
the mapping entailed splitting the object (the compound page).  For a
long time, pagecache THP appeared to be an even greater challenge: but
that's when you try to follow the anonymous lead too closely.  Actually
pagecache THP is easier, once you abandon compound pages, and consider
the object and its mapping separately.

(I think it was probably a mistake to use compound pages for anonymous
THP, and that it can be simplified by conversion to the same as we do
here: but I have spent no time thinking that through, I may be quite
wrong, and it's certainly no part of this patch series to change how
anonymous THP works today.)

This and the next patches are entirely concerned with the object and
not its mapping: but there will be no chance of mapping the object
with huge pmds, unless it is allocated in huge extents.  Mounting
a tmpfs with the huge=1 option requests that objects be allocated
in huge extents, when memory fragmentation and pressure permit.

The main change here is, of course, to shmem_alloc_page(), and to
shmem_add_to_page_cache(): with attention to the races which may
well occur in between the two calls - which involves a rather ugly
"hugehint" interface between them, and the shmem_hugeteam_lookup()
helper which checks the surrounding area for a previously allocated
huge page, or a small page implying earlier huge allocation failure.

shmem_getpage_gfp() works in terms of small (meaning typically 4kB)
pages just as before; the radix_tree holds a slot for each small
page just as before; memcg is charged for small pages just as before;
the LRUs hold small pages just as before; get_user_pages() will work
on ordinarily-refcounted small pages just as before.  Which keeps it
all reassuringly simple, but is sure to show up in greater overhead
than hugetlbfs, when first establishing an object; and reclaim from
LRU (with 512 items to go through when only 1 will free them) is sure
to demand cleverer handling in later patches.

The huge page itself is allocated (currently with __GFP_NORETRY)
as a high-order page, but not as a compound page; and that high-order
page is immediately split into its separately refcounted subpages (no
overhead to that: establishing a compound page itself has to set up
each tail page).  Only the small page that was asked for is put into
the radix_tree ("page cache") at that time, the remainder left unused
(but with page count 1).  The whole is loosely "held together" with a
new PageTeam flag on the head page (whether or not it was put in the
cache), and then one by one, on each tail page as it is instantiated.
There is no requirement that the file be written sequentially.

PageSwapBacked proves useful to distinguish a page which has been
instantiated from one which has not: particularly in the case of that
head page marked PageTeam even when not yet instantiated.  Although
conceptually very different, PageTeam is successfully reusing the
CONFIG_TRANSPARENT_HUGEPAGE PG_compound_lock, so no need to beg for a
new flag bit: just a few places may also need to check for !PageHead
or !PageAnon to distinguish - see next patch.

Truncation (and hole-punch and eviction) needs to disband the
team before any page is freed from it; and although it will only be
important once we get to mapping the page, even now take the lock on
the head page when truncating any team member (though commonly the head
page will be the first truncated anyway).  That does need a trylock,
and sometimes even a busyloop waiting for PageTeam to be cleared, but
I don't see an actual problem with it (no worse than waiting to take
a bitspinlock).  When disbanding a team, ask free_hot_cold_page() to
free to the cold end of the pcp list, so the subpages are more likely
to be buddied back together.

In reclaim (shmem_writepage), simply redirty any tail page of the team,
and only when the head is to be reclaimed, proceed to disband and swap.
(Unless head remains uninstantiated: then tail may disband and swap.)
This strategy will still be safe once we get to mapping the huge page:
the head (and hence the huge) can never be mapped at this point.

With this patch, the ShmemHugePages line of /proc/meminfo is shown,
but it totals the amount of huge page memory allocated, not the
amount fully used: so it may show ShmemHugePages exceeding Shmem.

Disclaimer: I have used PAGE_SIZE, PAGE_SHIFT throughout this series,
paying no attention to when it should actually say PAGE_CACHE_SIZE,
PAGE_CACHE_SHIFT: enforcing that hypothetical distinction requires
a different mindset, better left to a later exercise.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/page-flags.h |    6 
 include/linux/pageteam.h   |   32 +++
 mm/shmem.c                 |  357 ++++++++++++++++++++++++++++++++---
 3 files changed, 367 insertions(+), 28 deletions(-)

--- thpfs.orig/include/linux/page-flags.h	2014-10-05 12:23:04.000000000 -0700
+++ thpfs/include/linux/page-flags.h	2015-02-20 19:34:06.224004747 -0800
@@ -108,6 +108,7 @@ enum pageflags {
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
+	PG_team = PG_compound_lock,	/* used for huge shmem (thpfs) */
 #endif
 	__NR_PAGEFLAGS,
 
@@ -180,6 +181,9 @@ static inline int Page##uname(const stru
 #define SETPAGEFLAG_NOOP(uname)						\
 static inline void SetPage##uname(struct page *page) {  }
 
+#define __SETPAGEFLAG_NOOP(uname)					\
+static inline void __SetPage##uname(struct page *page) {  }
+
 #define CLEARPAGEFLAG_NOOP(uname)					\
 static inline void ClearPage##uname(struct page *page) {  }
 
@@ -458,7 +462,9 @@ static inline int PageTransTail(struct p
 	return PageTail(page);
 }
 
+PAGEFLAG(Team, team) __SETPAGEFLAG(Team, team)
 #else
+PAGEFLAG_FALSE(Team) __SETPAGEFLAG_NOOP(Team)
 
 static inline int PageTransHuge(struct page *page)
 {
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ thpfs/include/linux/pageteam.h	2015-02-20 19:34:06.224004747 -0800
@@ -0,0 +1,32 @@
+#ifndef _LINUX_PAGETEAM_H
+#define _LINUX_PAGETEAM_H
+
+/*
+ * Declarations and definitions for PageTeam pages and page->team_usage:
+ * as implemented for "huge tmpfs" in mm/shmem.c and mm/huge_memory.c, when
+ * CONFIG_TRANSPARENT_HUGEPAGE=y, and tmpfs is mounted with the huge=1 option.
+ */
+
+#include <linux/huge_mm.h>
+#include <linux/mm_types.h>
+#include <linux/mmdebug.h>
+#include <asm/page.h>
+
+static inline struct page *team_head(struct page *page)
+{
+	struct page *head = page - (page->index & (HPAGE_PMD_NR-1));
+	/*
+	 * Locating head by page->index is a faster calculation than by
+	 * pfn_to_page(page_to_pfn), and we only use this function after
+	 * page->index has been set (never on tail holes): but check that.
+	 *
+	 * Although this is only used on a PageTeam(page), the team might be
+	 * disbanded racily, so it's not safe to VM_BUG_ON(!PageTeam(page));
+	 * but page->index remains stable across disband and truncation.
+	 */
+	VM_BUG_ON_PAGE(head != pfn_to_page(page_to_pfn(page) &
+			~((unsigned long)HPAGE_PMD_NR-1)), page);
+	return head;
+}
+
+#endif /* _LINUX_PAGETEAM_H */
--- thpfs.orig/mm/shmem.c	2015-02-20 19:34:01.464015631 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:34:06.224004747 -0800
@@ -60,6 +60,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/security.h>
 #include <linux/sysctl.h>
 #include <linux/swapops.h>
+#include <linux/pageteam.h>
 #include <linux/mempolicy.h>
 #include <linux/namei.h>
 #include <linux/ctype.h>
@@ -299,49 +300,236 @@ static bool shmem_confirm_swap(struct ad
 #define SHMEM_HUGE_DENY		(-1)
 #define SHMEM_HUGE_FORCE	(2)
 
+/* hugehint values: NULL to choose a small page always */
+#define SHMEM_ALLOC_SMALL_PAGE	((struct page *)1)
+#define SHMEM_ALLOC_HUGE_PAGE	((struct page *)2)
+#define SHMEM_RETRY_HUGE_PAGE	((struct page *)3)
+/* otherwise hugehint is the hugeteam page to be used */
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /* ifdef here to avoid bloating shmem.o when not necessary */
 
 int shmem_huge __read_mostly;
 
+static struct page *shmem_hugeteam_lookup(struct address_space *mapping,
+					  pgoff_t index, bool speculative)
+{
+	pgoff_t start;
+	pgoff_t indice;
+	void __rcu **pagep;
+	struct page *cachepage;
+	struct page *headpage;
+	struct page *page;
+
+	/*
+	 * First called speculatively, under rcu_read_lock(), by the huge
+	 * shmem_alloc_page(): to decide whether to allocate a new huge page,
+	 * or a new small page, or use a previously allocated huge team page.
+	 *
+	 * Later called under mapping->tree_lock, by shmem_add_to_page_cache(),
+	 * to confirm the decision just before inserting into the radix_tree.
+	 */
+
+	start = round_down(index, HPAGE_PMD_NR);
+restart:
+	if (!radix_tree_gang_lookup_slot(&mapping->page_tree,
+					 &pagep, &indice, start, 1))
+		return SHMEM_ALLOC_HUGE_PAGE;
+	cachepage = rcu_dereference_check(*pagep,
+		lockdep_is_held(&mapping->tree_lock));
+	if (!cachepage || indice >= start + HPAGE_PMD_NR)
+		return SHMEM_ALLOC_HUGE_PAGE;
+	if (radix_tree_exception(cachepage)) {
+		if (radix_tree_deref_retry(cachepage))
+			goto restart;
+		return SHMEM_ALLOC_SMALL_PAGE;
+	}
+	if (!PageTeam(cachepage))
+		return SHMEM_ALLOC_SMALL_PAGE;
+	/* headpage is very often its first cachepage, but not necessarily */
+	headpage = cachepage - (indice - start);
+	page = headpage + (index - start);
+	if (speculative && !page_cache_get_speculative(page))
+		goto restart;
+	if (!PageTeam(headpage) ||
+	    headpage->mapping != mapping || headpage->index != start) {
+		if (speculative)
+			page_cache_release(page);
+		goto restart;
+	}
+	return page;
+}
+
+static int shmem_disband_hugehead(struct page *head)
+{
+	struct address_space *mapping;
+	struct zone *zone;
+	int nr = -1;
+
+	mapping = head->mapping;
+	zone = page_zone(head);
+
+	spin_lock_irq(&mapping->tree_lock);
+	if (PageTeam(head)) {
+		ClearPageTeam(head);
+		__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
+		nr = 1;
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+	return nr;
+}
+
+static void shmem_disband_hugetails(struct page *head)
+{
+	struct page *page;
+	struct page *endpage;
+
+	page = head;
+	endpage = head + HPAGE_PMD_NR;
+
+	/* Condition follows in next commit */ {
+		/*
+		 * The usual case: disbanding team and freeing holes as cold
+		 * (cold being more likely to preserve high-order extents).
+		 */
+		if (!PageSwapBacked(page)) {	/* head was not in cache */
+			page->mapping = NULL;
+			if (put_page_testzero(page))
+				free_hot_cold_page(page, 1);
+		}
+		while (++page < endpage) {
+			if (PageTeam(page))
+				ClearPageTeam(page);
+			else if (put_page_testzero(page))
+				free_hot_cold_page(page, 1);
+		}
+	}
+}
+
+static void shmem_disband_hugeteam(struct page *page)
+{
+	struct page *head = team_head(page);
+	int nr_used;
+
+	/*
+	 * In most cases, shmem_disband_hugeteam() is called with this page
+	 * locked.  But shmem_getpage_gfp()'s alloced_huge failure case calls
+	 * it after unlocking and releasing: because it has not exposed the
+	 * page, and prefers free_hot_cold_page to free it all cold together.
+	 *
+	 * The truncation case may need a second lock, on the head page,
+	 * to guard against races while shmem fault prepares a huge pmd.
+	 * Little point in returning error, it has to check PageTeam anyway.
+	 */
+	if (head != page) {
+		if (!get_page_unless_zero(head))
+			return;
+		if (!trylock_page(head)) {
+			page_cache_release(head);
+			return;
+		}
+		if (!PageTeam(head)) {
+			unlock_page(head);
+			page_cache_release(head);
+			return;
+		}
+	}
+
+	/*
+	 * Disable preemption because truncation may end up spinning until a
+	 * tail PageTeam has been cleared: we hold the lock as briefly as we
+	 * can (splitting disband in two stages), but better not be preempted.
+	 */
+	preempt_disable();
+	nr_used = shmem_disband_hugehead(head);
+	if (head != page)
+		unlock_page(head);
+	if (nr_used >= 0)
+		shmem_disband_hugetails(head);
+	if (head != page)
+		page_cache_release(head);
+	preempt_enable();
+}
+
 #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
 
 #define shmem_huge SHMEM_HUGE_DENY
 
+static inline struct page *shmem_hugeteam_lookup(struct address_space *mapping,
+					pgoff_t index, bool speculative)
+{
+	BUILD_BUG();
+	return SHMEM_ALLOC_SMALL_PAGE;
+}
+
+static inline void shmem_disband_hugeteam(struct page *page)
+{
+	BUILD_BUG();
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
  * Like add_to_page_cache_locked, but error if expected item has gone.
  */
-static int shmem_add_to_page_cache(struct page *page,
-				   struct address_space *mapping,
-				   pgoff_t index, void *expected)
+static int
+shmem_add_to_page_cache(struct page *page, struct address_space *mapping,
+			pgoff_t index, void *expected, struct page *hugehint)
 {
+	struct zone *zone = page_zone(page);
 	int error;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+	VM_BUG_ON(expected && hugehint);
+
+	spin_lock_irq(&mapping->tree_lock);
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugehint) {
+		if (shmem_hugeteam_lookup(mapping, index, false) != hugehint) {
+			error = -EEXIST;	/* will retry */
+			goto errout;
+		}
+		if (!PageSwapBacked(page)) {	/* huge needs special care */
+			SetPageSwapBacked(page);
+			SetPageTeam(page);
+		}
+	}
 
-	page_cache_get(page);
 	page->mapping = mapping;
 	page->index = index;
+	/* smp_wmb()?  That's in radix_tree_insert()'s rcu_assign_pointer() */
 
-	spin_lock_irq(&mapping->tree_lock);
 	if (!expected)
 		error = radix_tree_insert(&mapping->page_tree, index, page);
 	else
 		error = shmem_radix_tree_replace(mapping, index, expected,
 								 page);
-	if (!error) {
-		mapping->nrpages++;
-		__inc_zone_page_state(page, NR_FILE_PAGES);
-		__inc_zone_page_state(page, NR_SHMEM);
-		spin_unlock_irq(&mapping->tree_lock);
-	} else {
-		page->mapping = NULL;
-		spin_unlock_irq(&mapping->tree_lock);
-		page_cache_release(page);
+	if (unlikely(error)) {
+		/* Beware: did above make some flags fleetingly visible? */
+		VM_BUG_ON(page == hugehint);
+		goto errout;
 	}
+
+	if (!PageTeam(page))
+		page_cache_get(page);
+	else if (hugehint == SHMEM_ALLOC_HUGE_PAGE)
+		__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
+
+	mapping->nrpages++;
+	__inc_zone_state(zone, NR_FILE_PAGES);
+	__inc_zone_state(zone, NR_SHMEM);
+	spin_unlock_irq(&mapping->tree_lock);
+	return 0;
+
+errout:
+	if (PageTeam(page)) {
+		/* We use SwapBacked to indicate if already in cache */
+		ClearPageSwapBacked(page);
+		if (index & (HPAGE_PMD_NR-1)) {
+			ClearPageTeam(page);
+			page->mapping = NULL;
+		}
+	} else
+		page->mapping = NULL;
+	spin_unlock_irq(&mapping->tree_lock);
 	return error;
 }
 
@@ -427,15 +615,16 @@ static void shmem_undo_range(struct inod
 	struct pagevec pvec;
 	pgoff_t indices[PAGEVEC_SIZE];
 	long nr_swaps_freed = 0;
+	pgoff_t warm_index = 0;
 	pgoff_t index;
 	int i;
 
 	if (lend == -1)
 		end = -1;	/* unsigned, so actually very big */
 
-	pagevec_init(&pvec, 0);
 	index = start;
 	while (index < end) {
+		pagevec_init(&pvec, index < warm_index);
 		pvec.nr = find_get_entries(mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			pvec.pages, indices);
@@ -461,7 +650,21 @@ static void shmem_undo_range(struct inod
 			if (!unfalloc || !PageUptodate(page)) {
 				if (page->mapping == mapping) {
 					VM_BUG_ON_PAGE(PageWriteback(page), page);
-					truncate_inode_page(mapping, page);
+					if (PageTeam(page)) {
+						/*
+						 * Try preserve huge pages by
+						 * freeing to tail of pcp list.
+						 */
+						pvec.cold = 1;
+						warm_index = round_up(
+						    index + 1, HPAGE_PMD_NR);
+						shmem_disband_hugeteam(page);
+						/* but that may not succeed */
+					}
+					if (!PageTeam(page)) {
+						truncate_inode_page(mapping,
+								    page);
+					}
 				}
 			}
 			unlock_page(page);
@@ -503,7 +706,8 @@ static void shmem_undo_range(struct inod
 	index = start;
 	while (index < end) {
 		cond_resched();
-
+		/* Carrying warm_index from first pass is the best we can do */
+		pagevec_init(&pvec, index < warm_index);
 		pvec.nr = find_get_entries(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
 				pvec.pages, indices);
@@ -538,7 +742,26 @@ static void shmem_undo_range(struct inod
 			if (!unfalloc || !PageUptodate(page)) {
 				if (page->mapping == mapping) {
 					VM_BUG_ON_PAGE(PageWriteback(page), page);
-					truncate_inode_page(mapping, page);
+					if (PageTeam(page)) {
+						/*
+						 * Try preserve huge pages by
+						 * freeing to tail of pcp list.
+						 */
+						pvec.cold = 1;
+						warm_index = round_up(
+						    index + 1, HPAGE_PMD_NR);
+						shmem_disband_hugeteam(page);
+						/* but that may not succeed */
+					}
+					if (!PageTeam(page)) {
+						truncate_inode_page(mapping,
+								    page);
+					} else if (end != -1) {
+						/* Punch retry disband now */
+						unlock_page(page);
+						index--;
+						break;
+					}
 				} else {
 					/* Page was replaced by swap: retry */
 					unlock_page(page);
@@ -690,7 +913,7 @@ static int shmem_unuse_inode(struct shme
 	 */
 	if (!error)
 		error = shmem_add_to_page_cache(*pagep, mapping, index,
-						radswap);
+						radswap, NULL);
 	if (error != -ENOMEM) {
 		/*
 		 * Truncation and eviction use free_swap_and_cache(), which
@@ -827,10 +1050,25 @@ static int shmem_writepage(struct page *
 		SetPageUptodate(page);
 	}
 
+	if (PageTeam(page)) {
+		struct page *head = team_head(page);
+		/*
+		 * Only proceed if this is head, or if head is unpopulated.
+		 */
+		if (page != head && PageSwapBacked(head))
+			goto redirty;
+	}
+
 	swap = get_swap_page();
 	if (!swap.val)
 		goto redirty;
 
+	if (PageTeam(page)) {
+		shmem_disband_hugeteam(page);
+		if (PageTeam(page))
+			goto putswap;
+	}
+
 	/*
 	 * Add inode to shmem_unuse()'s list of swapped-out inodes,
 	 * if it's not already there.  Do it now before the page is
@@ -859,6 +1097,7 @@ static int shmem_writepage(struct page *
 	}
 
 	mutex_unlock(&shmem_swaplist_mutex);
+putswap:
 	swapcache_free(swap);
 redirty:
 	set_page_dirty(page);
@@ -926,8 +1165,8 @@ static struct page *shmem_swapin(swp_ent
 	return page;
 }
 
-static struct page *shmem_alloc_page(gfp_t gfp,
-			struct shmem_inode_info *info, pgoff_t index)
+static struct page *shmem_alloc_page(gfp_t gfp, struct shmem_inode_info *info,
+	pgoff_t index, struct page **hugehint, struct page **alloced_huge)
 {
 	struct vm_area_struct pvma;
 	struct page *page;
@@ -939,12 +1178,54 @@ static struct page *shmem_alloc_page(gfp
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
 
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && *hugehint) {
+		struct address_space *mapping = info->vfs_inode.i_mapping;
+		struct page *head;
+
+		rcu_read_lock();
+		*hugehint = shmem_hugeteam_lookup(mapping, index, true);
+		rcu_read_unlock();
+
+		if (*hugehint == SHMEM_ALLOC_HUGE_PAGE) {
+			head = alloc_pages_vma(gfp|__GFP_NORETRY|__GFP_NOWARN,
+				HPAGE_PMD_ORDER, &pvma, 0, numa_node_id());
+			if (head) {
+				split_page(head, HPAGE_PMD_ORDER);
+
+				/* Prepare head page for add_to_page_cache */
+				__SetPageTeam(head);
+				head->mapping = mapping;
+				head->index = round_down(index, HPAGE_PMD_NR);
+				*alloced_huge = head;
+
+				/* Prepare wanted page for add_to_page_cache */
+				page = head + (index & (HPAGE_PMD_NR-1));
+				page_cache_get(page);
+				__set_page_locked(page);
+				goto out;
+			}
+		} else if (*hugehint != SHMEM_ALLOC_SMALL_PAGE) {
+			page = *hugehint;
+			head = page - (index & (HPAGE_PMD_NR-1));
+			/*
+			 * This page is already visible: so we cannot use the
+			 * __nonatomic ops, must check that it has not already
+			 * been added, and cannot set the flags it needs until
+			 * add_to_page_cache has the tree_lock.
+			 */
+			lock_page(page);
+			if (PageSwapBacked(page) || !PageTeam(head))
+				*hugehint = SHMEM_RETRY_HUGE_PAGE;
+			goto out;
+		}
+	}
+
 	page = alloc_pages_vma(gfp, 0, &pvma, 0, numa_node_id());
 	if (page) {
 		__set_page_locked(page);
 		__SetPageSwapBacked(page);
 	}
-
+out:
 	/* Drop reference taken by mpol_shared_policy_lookup() */
 	mpol_cond_put(pvma.vm_policy);
 
@@ -975,6 +1256,7 @@ static int shmem_replace_page(struct pag
 	struct address_space *swap_mapping;
 	pgoff_t swap_index;
 	int error;
+	struct page *hugehint = NULL;
 
 	oldpage = *pagep;
 	swap_index = page_private(oldpage);
@@ -985,7 +1267,7 @@ static int shmem_replace_page(struct pag
 	 * limit chance of success by further cpuset and node constraints.
 	 */
 	gfp &= ~GFP_CONSTRAINT_MASK;
-	newpage = shmem_alloc_page(gfp, info, index);
+	newpage = shmem_alloc_page(gfp, info, index, &hugehint, &hugehint);
 	if (!newpage)
 		return -ENOMEM;
 
@@ -1051,6 +1333,8 @@ static int shmem_getpage_gfp(struct inod
 	int error;
 	int once = 0;
 	int alloced = 0;
+	struct page *hugehint;
+	struct page *alloced_huge = NULL;
 
 	if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
 		return -EFBIG;
@@ -1127,7 +1411,7 @@ repeat:
 		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
-						swp_to_radix_entry(swap));
+						swp_to_radix_entry(swap), NULL);
 			/*
 			 * We already confirmed swap under page lock, and make
 			 * no memory allocation here, so usually no possibility
@@ -1176,11 +1460,23 @@ repeat:
 			percpu_counter_inc(&sbinfo->used_blocks);
 		}
 
-		page = shmem_alloc_page(gfp, info, index);
+		/* Take huge hint from super, except for shmem_symlink() */
+		hugehint = NULL;
+		if (mapping->a_ops == &shmem_aops &&
+		    (shmem_huge == SHMEM_HUGE_FORCE ||
+		     (sbinfo->huge && shmem_huge != SHMEM_HUGE_DENY)))
+			hugehint = SHMEM_ALLOC_HUGE_PAGE;
+
+		page = shmem_alloc_page(gfp, info, index,
+					&hugehint, &alloced_huge);
 		if (!page) {
 			error = -ENOMEM;
 			goto decused;
 		}
+		if (hugehint == SHMEM_RETRY_HUGE_PAGE) {
+			error = -EEXIST;
+			goto decused;
+		}
 
 		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
 		if (error)
@@ -1188,7 +1484,7 @@ repeat:
 		error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
-							NULL);
+							NULL, hugehint);
 			radix_tree_preload_end();
 		}
 		if (error) {
@@ -1229,7 +1525,8 @@ clear:
 	if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
 	    ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
 		error = -EINVAL;
-		if (alloced)
+		alloced_huge = NULL;	/* already exposed: maybe now in use */
+		if (alloced && !PageTeam(page))
 			goto trunc;
 		else
 			goto failed;
@@ -1263,6 +1560,10 @@ unlock:
 		unlock_page(page);
 		page_cache_release(page);
 	}
+	if (alloced_huge) {
+		shmem_disband_hugeteam(alloced_huge);
+		alloced_huge = NULL;
+	}
 	if (error == -ENOSPC && !once++) {
 		info = SHMEM_I(inode);
 		spin_lock(&info->lock);

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 09/24] huge tmpfs: try to allocate huge pages, split into a team
@ 2015-02-21  4:06   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Now we get down to work.  The idea here is that compound pages were
ideal for hugetlbfs, with its own separate pool to which huge pages
must be freed.  Not so suitable for anonymous THP, which was forced
to adopt a strange refcount-in-mapcount technique to track its tails -
compare v2.6.37's mm/swap.c:put_compound_page() with a current version!
And not at all suitable for pagecache THP, where one process may want
to map 4kB of a file while another maps 2MB spanning the same offset.

And since anonymous THP was confined to private mappings, that blurred
the distinction between the mapping and the object mapped: so splitting
the mapping entailed splitting the object (the compound page).  For a
long time, pagecache THP appeared to be an even greater challenge: but
that's when you try to follow the anonymous lead too closely.  Actually
pagecache THP is easier, once you abandon compound pages, and consider
the object and its mapping separately.

(I think it was probably a mistake to use compound pages for anonymous
THP, and that it can be simplified by conversion to the same as we do
here: but I have spent no time thinking that through, I may be quite
wrong, and it's certainly no part of this patch series to change how
anonymous THP works today.)

This and the next patches are entirely concerned with the object and
not its mapping: but there will be no chance of mapping the object
with huge pmds, unless it is allocated in huge extents.  Mounting
a tmpfs with the huge=1 option requests that objects be allocated
in huge extents, when memory fragmentation and pressure permit.

The main change here is, of course, to shmem_alloc_page(), and to
shmem_add_to_page_cache(): with attention to the races which may
well occur in between the two calls - which involves a rather ugly
"hugehint" interface between them, and the shmem_hugeteam_lookup()
helper which checks the surrounding area for a previously allocated
huge page, or a small page implying earlier huge allocation failure.

shmem_getpage_gfp() works in terms of small (meaning typically 4kB)
pages just as before; the radix_tree holds a slot for each small
page just as before; memcg is charged for small pages just as before;
the LRUs hold small pages just as before; get_user_pages() will work
on ordinarily-refcounted small pages just as before.  Which keeps it
all reassuringly simple, but is sure to show up in greater overhead
than hugetlbfs, when first establishing an object; and reclaim from
LRU (with 512 items to go through when only 1 will free them) is sure
to demand cleverer handling in later patches.

The huge page itself is allocated (currently with __GFP_NORETRY)
as a high-order page, but not as a compound page; and that high-order
page is immediately split into its separately refcounted subpages (no
overhead to that: establishing a compound page itself has to set up
each tail page).  Only the small page that was asked for is put into
the radix_tree ("page cache") at that time, the remainder left unused
(but with page count 1).  The whole is loosely "held together" with a
new PageTeam flag on the head page (whether or not it was put in the
cache), and then one by one, on each tail page as it is instantiated.
There is no requirement that the file be written sequentially.

PageSwapBacked proves useful to distinguish a page which has been
instantiated from one which has not: particularly in the case of that
head page marked PageTeam even when not yet instantiated.  Although
conceptually very different, PageTeam is successfully reusing the
CONFIG_TRANSPARENT_HUGEPAGE PG_compound_lock, so no need to beg for a
new flag bit: just a few places may also need to check for !PageHead
or !PageAnon to distinguish - see next patch.

Truncation (and hole-punch and eviction) needs to disband the
team before any page is freed from it; and although it will only be
important once we get to mapping the page, even now take the lock on
the head page when truncating any team member (though commonly the head
page will be the first truncated anyway).  That does need a trylock,
and sometimes even a busyloop waiting for PageTeam to be cleared, but
I don't see an actual problem with it (no worse than waiting to take
a bitspinlock).  When disbanding a team, ask free_hot_cold_page() to
free to the cold end of the pcp list, so the subpages are more likely
to be buddied back together.

In reclaim (shmem_writepage), simply redirty any tail page of the team,
and only when the head is to be reclaimed, proceed to disband and swap.
(Unless head remains uninstantiated: then tail may disband and swap.)
This strategy will still be safe once we get to mapping the huge page:
the head (and hence the huge) can never be mapped at this point.

With this patch, the ShmemHugePages line of /proc/meminfo is shown,
but it totals the amount of huge page memory allocated, not the
amount fully used: so it may show ShmemHugePages exceeding Shmem.

Disclaimer: I have used PAGE_SIZE, PAGE_SHIFT throughout this series,
paying no attention to when it should actually say PAGE_CACHE_SIZE,
PAGE_CACHE_SHIFT: enforcing that hypothetical distinction requires
a different mindset, better left to a later exercise.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/page-flags.h |    6 
 include/linux/pageteam.h   |   32 +++
 mm/shmem.c                 |  357 ++++++++++++++++++++++++++++++++---
 3 files changed, 367 insertions(+), 28 deletions(-)

--- thpfs.orig/include/linux/page-flags.h	2014-10-05 12:23:04.000000000 -0700
+++ thpfs/include/linux/page-flags.h	2015-02-20 19:34:06.224004747 -0800
@@ -108,6 +108,7 @@ enum pageflags {
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
+	PG_team = PG_compound_lock,	/* used for huge shmem (thpfs) */
 #endif
 	__NR_PAGEFLAGS,
 
@@ -180,6 +181,9 @@ static inline int Page##uname(const stru
 #define SETPAGEFLAG_NOOP(uname)						\
 static inline void SetPage##uname(struct page *page) {  }
 
+#define __SETPAGEFLAG_NOOP(uname)					\
+static inline void __SetPage##uname(struct page *page) {  }
+
 #define CLEARPAGEFLAG_NOOP(uname)					\
 static inline void ClearPage##uname(struct page *page) {  }
 
@@ -458,7 +462,9 @@ static inline int PageTransTail(struct p
 	return PageTail(page);
 }
 
+PAGEFLAG(Team, team) __SETPAGEFLAG(Team, team)
 #else
+PAGEFLAG_FALSE(Team) __SETPAGEFLAG_NOOP(Team)
 
 static inline int PageTransHuge(struct page *page)
 {
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ thpfs/include/linux/pageteam.h	2015-02-20 19:34:06.224004747 -0800
@@ -0,0 +1,32 @@
+#ifndef _LINUX_PAGETEAM_H
+#define _LINUX_PAGETEAM_H
+
+/*
+ * Declarations and definitions for PageTeam pages and page->team_usage:
+ * as implemented for "huge tmpfs" in mm/shmem.c and mm/huge_memory.c, when
+ * CONFIG_TRANSPARENT_HUGEPAGE=y, and tmpfs is mounted with the huge=1 option.
+ */
+
+#include <linux/huge_mm.h>
+#include <linux/mm_types.h>
+#include <linux/mmdebug.h>
+#include <asm/page.h>
+
+static inline struct page *team_head(struct page *page)
+{
+	struct page *head = page - (page->index & (HPAGE_PMD_NR-1));
+	/*
+	 * Locating head by page->index is a faster calculation than by
+	 * pfn_to_page(page_to_pfn), and we only use this function after
+	 * page->index has been set (never on tail holes): but check that.
+	 *
+	 * Although this is only used on a PageTeam(page), the team might be
+	 * disbanded racily, so it's not safe to VM_BUG_ON(!PageTeam(page));
+	 * but page->index remains stable across disband and truncation.
+	 */
+	VM_BUG_ON_PAGE(head != pfn_to_page(page_to_pfn(page) &
+			~((unsigned long)HPAGE_PMD_NR-1)), page);
+	return head;
+}
+
+#endif /* _LINUX_PAGETEAM_H */
--- thpfs.orig/mm/shmem.c	2015-02-20 19:34:01.464015631 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:34:06.224004747 -0800
@@ -60,6 +60,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/security.h>
 #include <linux/sysctl.h>
 #include <linux/swapops.h>
+#include <linux/pageteam.h>
 #include <linux/mempolicy.h>
 #include <linux/namei.h>
 #include <linux/ctype.h>
@@ -299,49 +300,236 @@ static bool shmem_confirm_swap(struct ad
 #define SHMEM_HUGE_DENY		(-1)
 #define SHMEM_HUGE_FORCE	(2)
 
+/* hugehint values: NULL to choose a small page always */
+#define SHMEM_ALLOC_SMALL_PAGE	((struct page *)1)
+#define SHMEM_ALLOC_HUGE_PAGE	((struct page *)2)
+#define SHMEM_RETRY_HUGE_PAGE	((struct page *)3)
+/* otherwise hugehint is the hugeteam page to be used */
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /* ifdef here to avoid bloating shmem.o when not necessary */
 
 int shmem_huge __read_mostly;
 
+static struct page *shmem_hugeteam_lookup(struct address_space *mapping,
+					  pgoff_t index, bool speculative)
+{
+	pgoff_t start;
+	pgoff_t indice;
+	void __rcu **pagep;
+	struct page *cachepage;
+	struct page *headpage;
+	struct page *page;
+
+	/*
+	 * First called speculatively, under rcu_read_lock(), by the huge
+	 * shmem_alloc_page(): to decide whether to allocate a new huge page,
+	 * or a new small page, or use a previously allocated huge team page.
+	 *
+	 * Later called under mapping->tree_lock, by shmem_add_to_page_cache(),
+	 * to confirm the decision just before inserting into the radix_tree.
+	 */
+
+	start = round_down(index, HPAGE_PMD_NR);
+restart:
+	if (!radix_tree_gang_lookup_slot(&mapping->page_tree,
+					 &pagep, &indice, start, 1))
+		return SHMEM_ALLOC_HUGE_PAGE;
+	cachepage = rcu_dereference_check(*pagep,
+		lockdep_is_held(&mapping->tree_lock));
+	if (!cachepage || indice >= start + HPAGE_PMD_NR)
+		return SHMEM_ALLOC_HUGE_PAGE;
+	if (radix_tree_exception(cachepage)) {
+		if (radix_tree_deref_retry(cachepage))
+			goto restart;
+		return SHMEM_ALLOC_SMALL_PAGE;
+	}
+	if (!PageTeam(cachepage))
+		return SHMEM_ALLOC_SMALL_PAGE;
+	/* headpage is very often its first cachepage, but not necessarily */
+	headpage = cachepage - (indice - start);
+	page = headpage + (index - start);
+	if (speculative && !page_cache_get_speculative(page))
+		goto restart;
+	if (!PageTeam(headpage) ||
+	    headpage->mapping != mapping || headpage->index != start) {
+		if (speculative)
+			page_cache_release(page);
+		goto restart;
+	}
+	return page;
+}
+
+static int shmem_disband_hugehead(struct page *head)
+{
+	struct address_space *mapping;
+	struct zone *zone;
+	int nr = -1;
+
+	mapping = head->mapping;
+	zone = page_zone(head);
+
+	spin_lock_irq(&mapping->tree_lock);
+	if (PageTeam(head)) {
+		ClearPageTeam(head);
+		__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
+		nr = 1;
+	}
+	spin_unlock_irq(&mapping->tree_lock);
+	return nr;
+}
+
+static void shmem_disband_hugetails(struct page *head)
+{
+	struct page *page;
+	struct page *endpage;
+
+	page = head;
+	endpage = head + HPAGE_PMD_NR;
+
+	/* Condition follows in next commit */ {
+		/*
+		 * The usual case: disbanding team and freeing holes as cold
+		 * (cold being more likely to preserve high-order extents).
+		 */
+		if (!PageSwapBacked(page)) {	/* head was not in cache */
+			page->mapping = NULL;
+			if (put_page_testzero(page))
+				free_hot_cold_page(page, 1);
+		}
+		while (++page < endpage) {
+			if (PageTeam(page))
+				ClearPageTeam(page);
+			else if (put_page_testzero(page))
+				free_hot_cold_page(page, 1);
+		}
+	}
+}
+
+static void shmem_disband_hugeteam(struct page *page)
+{
+	struct page *head = team_head(page);
+	int nr_used;
+
+	/*
+	 * In most cases, shmem_disband_hugeteam() is called with this page
+	 * locked.  But shmem_getpage_gfp()'s alloced_huge failure case calls
+	 * it after unlocking and releasing: because it has not exposed the
+	 * page, and prefers free_hot_cold_page to free it all cold together.
+	 *
+	 * The truncation case may need a second lock, on the head page,
+	 * to guard against races while shmem fault prepares a huge pmd.
+	 * Little point in returning error, it has to check PageTeam anyway.
+	 */
+	if (head != page) {
+		if (!get_page_unless_zero(head))
+			return;
+		if (!trylock_page(head)) {
+			page_cache_release(head);
+			return;
+		}
+		if (!PageTeam(head)) {
+			unlock_page(head);
+			page_cache_release(head);
+			return;
+		}
+	}
+
+	/*
+	 * Disable preemption because truncation may end up spinning until a
+	 * tail PageTeam has been cleared: we hold the lock as briefly as we
+	 * can (splitting disband in two stages), but better not be preempted.
+	 */
+	preempt_disable();
+	nr_used = shmem_disband_hugehead(head);
+	if (head != page)
+		unlock_page(head);
+	if (nr_used >= 0)
+		shmem_disband_hugetails(head);
+	if (head != page)
+		page_cache_release(head);
+	preempt_enable();
+}
+
 #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
 
 #define shmem_huge SHMEM_HUGE_DENY
 
+static inline struct page *shmem_hugeteam_lookup(struct address_space *mapping,
+					pgoff_t index, bool speculative)
+{
+	BUILD_BUG();
+	return SHMEM_ALLOC_SMALL_PAGE;
+}
+
+static inline void shmem_disband_hugeteam(struct page *page)
+{
+	BUILD_BUG();
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
  * Like add_to_page_cache_locked, but error if expected item has gone.
  */
-static int shmem_add_to_page_cache(struct page *page,
-				   struct address_space *mapping,
-				   pgoff_t index, void *expected)
+static int
+shmem_add_to_page_cache(struct page *page, struct address_space *mapping,
+			pgoff_t index, void *expected, struct page *hugehint)
 {
+	struct zone *zone = page_zone(page);
 	int error;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+	VM_BUG_ON(expected && hugehint);
+
+	spin_lock_irq(&mapping->tree_lock);
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugehint) {
+		if (shmem_hugeteam_lookup(mapping, index, false) != hugehint) {
+			error = -EEXIST;	/* will retry */
+			goto errout;
+		}
+		if (!PageSwapBacked(page)) {	/* huge needs special care */
+			SetPageSwapBacked(page);
+			SetPageTeam(page);
+		}
+	}
 
-	page_cache_get(page);
 	page->mapping = mapping;
 	page->index = index;
+	/* smp_wmb()?  That's in radix_tree_insert()'s rcu_assign_pointer() */
 
-	spin_lock_irq(&mapping->tree_lock);
 	if (!expected)
 		error = radix_tree_insert(&mapping->page_tree, index, page);
 	else
 		error = shmem_radix_tree_replace(mapping, index, expected,
 								 page);
-	if (!error) {
-		mapping->nrpages++;
-		__inc_zone_page_state(page, NR_FILE_PAGES);
-		__inc_zone_page_state(page, NR_SHMEM);
-		spin_unlock_irq(&mapping->tree_lock);
-	} else {
-		page->mapping = NULL;
-		spin_unlock_irq(&mapping->tree_lock);
-		page_cache_release(page);
+	if (unlikely(error)) {
+		/* Beware: did above make some flags fleetingly visible? */
+		VM_BUG_ON(page == hugehint);
+		goto errout;
 	}
+
+	if (!PageTeam(page))
+		page_cache_get(page);
+	else if (hugehint == SHMEM_ALLOC_HUGE_PAGE)
+		__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
+
+	mapping->nrpages++;
+	__inc_zone_state(zone, NR_FILE_PAGES);
+	__inc_zone_state(zone, NR_SHMEM);
+	spin_unlock_irq(&mapping->tree_lock);
+	return 0;
+
+errout:
+	if (PageTeam(page)) {
+		/* We use SwapBacked to indicate if already in cache */
+		ClearPageSwapBacked(page);
+		if (index & (HPAGE_PMD_NR-1)) {
+			ClearPageTeam(page);
+			page->mapping = NULL;
+		}
+	} else
+		page->mapping = NULL;
+	spin_unlock_irq(&mapping->tree_lock);
 	return error;
 }
 
@@ -427,15 +615,16 @@ static void shmem_undo_range(struct inod
 	struct pagevec pvec;
 	pgoff_t indices[PAGEVEC_SIZE];
 	long nr_swaps_freed = 0;
+	pgoff_t warm_index = 0;
 	pgoff_t index;
 	int i;
 
 	if (lend == -1)
 		end = -1;	/* unsigned, so actually very big */
 
-	pagevec_init(&pvec, 0);
 	index = start;
 	while (index < end) {
+		pagevec_init(&pvec, index < warm_index);
 		pvec.nr = find_get_entries(mapping, index,
 			min(end - index, (pgoff_t)PAGEVEC_SIZE),
 			pvec.pages, indices);
@@ -461,7 +650,21 @@ static void shmem_undo_range(struct inod
 			if (!unfalloc || !PageUptodate(page)) {
 				if (page->mapping == mapping) {
 					VM_BUG_ON_PAGE(PageWriteback(page), page);
-					truncate_inode_page(mapping, page);
+					if (PageTeam(page)) {
+						/*
+						 * Try preserve huge pages by
+						 * freeing to tail of pcp list.
+						 */
+						pvec.cold = 1;
+						warm_index = round_up(
+						    index + 1, HPAGE_PMD_NR);
+						shmem_disband_hugeteam(page);
+						/* but that may not succeed */
+					}
+					if (!PageTeam(page)) {
+						truncate_inode_page(mapping,
+								    page);
+					}
 				}
 			}
 			unlock_page(page);
@@ -503,7 +706,8 @@ static void shmem_undo_range(struct inod
 	index = start;
 	while (index < end) {
 		cond_resched();
-
+		/* Carrying warm_index from first pass is the best we can do */
+		pagevec_init(&pvec, index < warm_index);
 		pvec.nr = find_get_entries(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
 				pvec.pages, indices);
@@ -538,7 +742,26 @@ static void shmem_undo_range(struct inod
 			if (!unfalloc || !PageUptodate(page)) {
 				if (page->mapping == mapping) {
 					VM_BUG_ON_PAGE(PageWriteback(page), page);
-					truncate_inode_page(mapping, page);
+					if (PageTeam(page)) {
+						/*
+						 * Try preserve huge pages by
+						 * freeing to tail of pcp list.
+						 */
+						pvec.cold = 1;
+						warm_index = round_up(
+						    index + 1, HPAGE_PMD_NR);
+						shmem_disband_hugeteam(page);
+						/* but that may not succeed */
+					}
+					if (!PageTeam(page)) {
+						truncate_inode_page(mapping,
+								    page);
+					} else if (end != -1) {
+						/* Punch retry disband now */
+						unlock_page(page);
+						index--;
+						break;
+					}
 				} else {
 					/* Page was replaced by swap: retry */
 					unlock_page(page);
@@ -690,7 +913,7 @@ static int shmem_unuse_inode(struct shme
 	 */
 	if (!error)
 		error = shmem_add_to_page_cache(*pagep, mapping, index,
-						radswap);
+						radswap, NULL);
 	if (error != -ENOMEM) {
 		/*
 		 * Truncation and eviction use free_swap_and_cache(), which
@@ -827,10 +1050,25 @@ static int shmem_writepage(struct page *
 		SetPageUptodate(page);
 	}
 
+	if (PageTeam(page)) {
+		struct page *head = team_head(page);
+		/*
+		 * Only proceed if this is head, or if head is unpopulated.
+		 */
+		if (page != head && PageSwapBacked(head))
+			goto redirty;
+	}
+
 	swap = get_swap_page();
 	if (!swap.val)
 		goto redirty;
 
+	if (PageTeam(page)) {
+		shmem_disband_hugeteam(page);
+		if (PageTeam(page))
+			goto putswap;
+	}
+
 	/*
 	 * Add inode to shmem_unuse()'s list of swapped-out inodes,
 	 * if it's not already there.  Do it now before the page is
@@ -859,6 +1097,7 @@ static int shmem_writepage(struct page *
 	}
 
 	mutex_unlock(&shmem_swaplist_mutex);
+putswap:
 	swapcache_free(swap);
 redirty:
 	set_page_dirty(page);
@@ -926,8 +1165,8 @@ static struct page *shmem_swapin(swp_ent
 	return page;
 }
 
-static struct page *shmem_alloc_page(gfp_t gfp,
-			struct shmem_inode_info *info, pgoff_t index)
+static struct page *shmem_alloc_page(gfp_t gfp, struct shmem_inode_info *info,
+	pgoff_t index, struct page **hugehint, struct page **alloced_huge)
 {
 	struct vm_area_struct pvma;
 	struct page *page;
@@ -939,12 +1178,54 @@ static struct page *shmem_alloc_page(gfp
 	pvma.vm_ops = NULL;
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);
 
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && *hugehint) {
+		struct address_space *mapping = info->vfs_inode.i_mapping;
+		struct page *head;
+
+		rcu_read_lock();
+		*hugehint = shmem_hugeteam_lookup(mapping, index, true);
+		rcu_read_unlock();
+
+		if (*hugehint == SHMEM_ALLOC_HUGE_PAGE) {
+			head = alloc_pages_vma(gfp|__GFP_NORETRY|__GFP_NOWARN,
+				HPAGE_PMD_ORDER, &pvma, 0, numa_node_id());
+			if (head) {
+				split_page(head, HPAGE_PMD_ORDER);
+
+				/* Prepare head page for add_to_page_cache */
+				__SetPageTeam(head);
+				head->mapping = mapping;
+				head->index = round_down(index, HPAGE_PMD_NR);
+				*alloced_huge = head;
+
+				/* Prepare wanted page for add_to_page_cache */
+				page = head + (index & (HPAGE_PMD_NR-1));
+				page_cache_get(page);
+				__set_page_locked(page);
+				goto out;
+			}
+		} else if (*hugehint != SHMEM_ALLOC_SMALL_PAGE) {
+			page = *hugehint;
+			head = page - (index & (HPAGE_PMD_NR-1));
+			/*
+			 * This page is already visible: so we cannot use the
+			 * __nonatomic ops, must check that it has not already
+			 * been added, and cannot set the flags it needs until
+			 * add_to_page_cache has the tree_lock.
+			 */
+			lock_page(page);
+			if (PageSwapBacked(page) || !PageTeam(head))
+				*hugehint = SHMEM_RETRY_HUGE_PAGE;
+			goto out;
+		}
+	}
+
 	page = alloc_pages_vma(gfp, 0, &pvma, 0, numa_node_id());
 	if (page) {
 		__set_page_locked(page);
 		__SetPageSwapBacked(page);
 	}
-
+out:
 	/* Drop reference taken by mpol_shared_policy_lookup() */
 	mpol_cond_put(pvma.vm_policy);
 
@@ -975,6 +1256,7 @@ static int shmem_replace_page(struct pag
 	struct address_space *swap_mapping;
 	pgoff_t swap_index;
 	int error;
+	struct page *hugehint = NULL;
 
 	oldpage = *pagep;
 	swap_index = page_private(oldpage);
@@ -985,7 +1267,7 @@ static int shmem_replace_page(struct pag
 	 * limit chance of success by further cpuset and node constraints.
 	 */
 	gfp &= ~GFP_CONSTRAINT_MASK;
-	newpage = shmem_alloc_page(gfp, info, index);
+	newpage = shmem_alloc_page(gfp, info, index, &hugehint, &hugehint);
 	if (!newpage)
 		return -ENOMEM;
 
@@ -1051,6 +1333,8 @@ static int shmem_getpage_gfp(struct inod
 	int error;
 	int once = 0;
 	int alloced = 0;
+	struct page *hugehint;
+	struct page *alloced_huge = NULL;
 
 	if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
 		return -EFBIG;
@@ -1127,7 +1411,7 @@ repeat:
 		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
-						swp_to_radix_entry(swap));
+						swp_to_radix_entry(swap), NULL);
 			/*
 			 * We already confirmed swap under page lock, and make
 			 * no memory allocation here, so usually no possibility
@@ -1176,11 +1460,23 @@ repeat:
 			percpu_counter_inc(&sbinfo->used_blocks);
 		}
 
-		page = shmem_alloc_page(gfp, info, index);
+		/* Take huge hint from super, except for shmem_symlink() */
+		hugehint = NULL;
+		if (mapping->a_ops == &shmem_aops &&
+		    (shmem_huge == SHMEM_HUGE_FORCE ||
+		     (sbinfo->huge && shmem_huge != SHMEM_HUGE_DENY)))
+			hugehint = SHMEM_ALLOC_HUGE_PAGE;
+
+		page = shmem_alloc_page(gfp, info, index,
+					&hugehint, &alloced_huge);
 		if (!page) {
 			error = -ENOMEM;
 			goto decused;
 		}
+		if (hugehint == SHMEM_RETRY_HUGE_PAGE) {
+			error = -EEXIST;
+			goto decused;
+		}
 
 		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
 		if (error)
@@ -1188,7 +1484,7 @@ repeat:
 		error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
-							NULL);
+							NULL, hugehint);
 			radix_tree_preload_end();
 		}
 		if (error) {
@@ -1229,7 +1525,8 @@ clear:
 	if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
 	    ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
 		error = -EINVAL;
-		if (alloced)
+		alloced_huge = NULL;	/* already exposed: maybe now in use */
+		if (alloced && !PageTeam(page))
 			goto trunc;
 		else
 			goto failed;
@@ -1263,6 +1560,10 @@ unlock:
 		unlock_page(page);
 		page_cache_release(page);
 	}
+	if (alloced_huge) {
+		shmem_disband_hugeteam(alloced_huge);
+		alloced_huge = NULL;
+	}
 	if (error == -ENOSPC && !once++) {
 		info = SHMEM_I(inode);
 		spin_lock(&info->lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 10/24] huge tmpfs: avoid team pages in a few places
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:07   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

A few functions outside of mm/shmem.c must take care not to damage a
team accidentally.  In particular, although huge tmpfs will make its
own use of page migration, we don't want compaction or other users
of page migration to stomp on teams by mistake: a backstop check
in unmap_and_move() secures most cases, and an earlier check in
isolate_migratepages_block() saves compaction from wasting time.

These checks are certainly too strong: we shall want NUMA mempolicy
and balancing, and memory hot-remove, and soft-offline of failing
memory, to work with team pages; but defer those to a later series,
probably to be implemented along with rebanding disbanded teams (to
recover their original performance once memory pressure is relieved).

However, a PageTeam test is often not sufficient: because PG_team
is shared with PG_compound_lock, there's a danger that a momentarily
compound-locked page will look as if PageTeam.  (And places in shmem.c
where we check PageTeam(head) when that head might already be freed
and reused for a smaller compound page.)

Mostly use !PageAnon to check for this: !PageHead can also work, but
there's an instant in __split_huge_page_refcount() when PageHead is
cleared before the compound_unlock() - the durability of PageAnon is
easier to think about.

Hoist the declaration of PageAnon (and its associated definitions) in
linux/mm.h up before the declaration of __compound_tail_refcounted()
to facilitate this: compound tail refcounting (and compound locking)
is only necessary if the head is perhaps anonymous THP, so PageAnon.

Of course, the danger of confusion between PG_compound_lock and
PG_team could more easily be addressed by assigning a separate page
flag bit for PageTeam; but I'm reluctant to ask for that, and in the
longer term hopeful that PG_compound_lock can be removed altogether.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm.h |   92 +++++++++++++++++++++----------------------
 mm/compaction.c    |    6 ++
 mm/memcontrol.c    |    4 -
 mm/migrate.c       |   12 +++++
 mm/truncate.c      |    2 
 mm/vmscan.c        |    2 
 6 files changed, 68 insertions(+), 50 deletions(-)

--- thpfs.orig/include/linux/mm.h	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/include/linux/mm.h	2015-02-20 19:34:11.231993296 -0800
@@ -473,6 +473,48 @@ static inline int page_count(struct page
 	return atomic_read(&compound_head(page)->_count);
 }
 
+/*
+ * On an anonymous page mapped into a user virtual memory area,
+ * page->mapping points to its anon_vma, not to a struct address_space;
+ * with the PAGE_MAPPING_ANON bit set to distinguish it.  See rmap.h.
+ *
+ * On an anonymous page in a VM_MERGEABLE area, if CONFIG_KSM is enabled,
+ * the PAGE_MAPPING_KSM bit may be set along with the PAGE_MAPPING_ANON bit;
+ * and then page->mapping points, not to an anon_vma, but to a private
+ * structure which KSM associates with that merged page.  See ksm.h.
+ *
+ * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is currently never used.
+ *
+ * Please note that, confusingly, "page_mapping" refers to the inode
+ * address_space which maps the page from disk; whereas "page_mapped"
+ * refers to user virtual address space into which the page is mapped.
+ */
+#define PAGE_MAPPING_ANON	1
+#define PAGE_MAPPING_KSM	2
+#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)
+
+extern struct address_space *page_mapping(struct page *page);
+
+/* Neutral page->mapping pointer to address_space or anon_vma or other */
+static inline void *page_rmapping(struct page *page)
+{
+	return (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
+}
+
+extern struct address_space *__page_file_mapping(struct page *);
+
+static inline struct address_space *page_file_mapping(struct page *page)
+{
+	if (unlikely(PageSwapCache(page)))
+		return __page_file_mapping(page);
+	return page->mapping;
+}
+
+static inline int PageAnon(struct page *page)
+{
+	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
+}
+
 #ifdef CONFIG_HUGETLB_PAGE
 extern int PageHeadHuge(struct page *page_head);
 #else /* CONFIG_HUGETLB_PAGE */
@@ -484,15 +526,15 @@ static inline int PageHeadHuge(struct pa
 
 static inline bool __compound_tail_refcounted(struct page *page)
 {
-	return !PageSlab(page) && !PageHeadHuge(page);
+	return PageAnon(page) && !PageSlab(page) && !PageHeadHuge(page);
 }
 
 /*
  * This takes a head page as parameter and tells if the
  * tail page reference counting can be skipped.
  *
- * For this to be safe, PageSlab and PageHeadHuge must remain true on
- * any given page where they return true here, until all tail pins
+ * For this to be safe, PageAnon and PageSlab and PageHeadHuge must remain
+ * true on any given page where they return true here, until all tail pins
  * have been released.
  */
 static inline bool compound_tail_refcounted(struct page *page)
@@ -980,50 +1022,6 @@ void page_address_init(void);
 #endif
 
 /*
- * On an anonymous page mapped into a user virtual memory area,
- * page->mapping points to its anon_vma, not to a struct address_space;
- * with the PAGE_MAPPING_ANON bit set to distinguish it.  See rmap.h.
- *
- * On an anonymous page in a VM_MERGEABLE area, if CONFIG_KSM is enabled,
- * the PAGE_MAPPING_KSM bit may be set along with the PAGE_MAPPING_ANON bit;
- * and then page->mapping points, not to an anon_vma, but to a private
- * structure which KSM associates with that merged page.  See ksm.h.
- *
- * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is currently never used.
- *
- * Please note that, confusingly, "page_mapping" refers to the inode
- * address_space which maps the page from disk; whereas "page_mapped"
- * refers to user virtual address space into which the page is mapped.
- */
-#define PAGE_MAPPING_ANON	1
-#define PAGE_MAPPING_KSM	2
-#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)
-
-extern struct address_space *page_mapping(struct page *page);
-
-/* Neutral page->mapping pointer to address_space or anon_vma or other */
-static inline void *page_rmapping(struct page *page)
-{
-	return (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
-}
-
-extern struct address_space *__page_file_mapping(struct page *);
-
-static inline
-struct address_space *page_file_mapping(struct page *page)
-{
-	if (unlikely(PageSwapCache(page)))
-		return __page_file_mapping(page);
-
-	return page->mapping;
-}
-
-static inline int PageAnon(struct page *page)
-{
-	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
-}
-
-/*
  * Return the pagecache index of the passed page.  Regular pagecache pages
  * use ->index whereas swapcache pages use ->private
  */
--- thpfs.orig/mm/compaction.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/compaction.c	2015-02-20 19:34:11.231993296 -0800
@@ -676,6 +676,12 @@ isolate_migratepages_block(struct compac
 			continue;
 		}
 
+		/* Team bit == compound_lock bit: racy check before skipping */
+		if (PageTeam(page) && !PageAnon(page)) {
+			low_pfn = round_up(low_pfn + 1, HPAGE_PMD_NR) - 1;
+			continue;
+		}
+
 		/*
 		 * Migration will fail if an anonymous page is pinned in memory,
 		 * so avoid taking lru_lock and isolating it unnecessarily in an
--- thpfs.orig/mm/memcontrol.c	2015-02-20 19:33:31.052085168 -0800
+++ thpfs/mm/memcontrol.c	2015-02-20 19:34:11.231993296 -0800
@@ -5021,8 +5021,8 @@ static enum mc_target_type get_mctgt_typ
 	enum mc_target_type ret = MC_TARGET_NONE;
 
 	page = pmd_page(pmd);
-	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
-	if (!move_anon())
+	/* Don't attempt to move huge tmpfs pages: could be enabled later */
+	if (!move_anon() || !PageAnon(page))
 		return ret;
 	if (page->mem_cgroup == mc.from) {
 		ret = MC_TARGET_PAGE;
--- thpfs.orig/mm/migrate.c	2015-02-20 19:33:40.876062705 -0800
+++ thpfs/mm/migrate.c	2015-02-20 19:34:11.235993287 -0800
@@ -937,6 +937,10 @@ static int unmap_and_move(new_page_t get
 	int *result = NULL;
 	struct page *newpage;
 
+	/* Team bit == compound_lock bit: racy check before refusing */
+	if (PageTeam(page) && !PageAnon(page))
+		return -EBUSY;
+
 	newpage = get_new_page(page, private, &result);
 	if (!newpage)
 		return -ENOMEM;
@@ -1770,6 +1774,14 @@ int migrate_misplaced_transhuge_page(str
 	pmd_t orig_entry;
 
 	/*
+	 * Leave support for NUMA balancing on huge tmpfs pages to the future.
+	 * The pmd marking up to this point should work okay, but from here on
+	 * there is work to be done: e.g. anon page->mapping assumption below.
+	 */
+	if (!PageAnon(page))
+		goto out_dropref;
+
+	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
--- thpfs.orig/mm/truncate.c	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/mm/truncate.c	2015-02-20 19:34:11.235993287 -0800
@@ -542,7 +542,7 @@ invalidate_complete_page2(struct address
 		return 0;
 
 	spin_lock_irq(&mapping->tree_lock);
-	if (PageDirty(page))
+	if (PageDirty(page) || PageTeam(page))
 		goto failed;
 
 	BUG_ON(page_has_private(page));
--- thpfs.orig/mm/vmscan.c	2015-02-20 19:33:56.532026908 -0800
+++ thpfs/mm/vmscan.c	2015-02-20 19:34:11.235993287 -0800
@@ -567,6 +567,8 @@ static int __remove_mapping(struct addre
 	 * Note that if SetPageDirty is always performed via set_page_dirty,
 	 * and thus under tree_lock, then this ordering is not required.
 	 */
+	if (unlikely(PageTeam(page)))
+		goto cannot_free;
 	if (!page_freeze_refs(page, 2))
 		goto cannot_free;
 	/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 10/24] huge tmpfs: avoid team pages in a few places
@ 2015-02-21  4:07   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

A few functions outside of mm/shmem.c must take care not to damage a
team accidentally.  In particular, although huge tmpfs will make its
own use of page migration, we don't want compaction or other users
of page migration to stomp on teams by mistake: a backstop check
in unmap_and_move() secures most cases, and an earlier check in
isolate_migratepages_block() saves compaction from wasting time.

These checks are certainly too strong: we shall want NUMA mempolicy
and balancing, and memory hot-remove, and soft-offline of failing
memory, to work with team pages; but defer those to a later series,
probably to be implemented along with rebanding disbanded teams (to
recover their original performance once memory pressure is relieved).

However, a PageTeam test is often not sufficient: because PG_team
is shared with PG_compound_lock, there's a danger that a momentarily
compound-locked page will look as if PageTeam.  (And places in shmem.c
where we check PageTeam(head) when that head might already be freed
and reused for a smaller compound page.)

Mostly use !PageAnon to check for this: !PageHead can also work, but
there's an instant in __split_huge_page_refcount() when PageHead is
cleared before the compound_unlock() - the durability of PageAnon is
easier to think about.

Hoist the declaration of PageAnon (and its associated definitions) in
linux/mm.h up before the declaration of __compound_tail_refcounted()
to facilitate this: compound tail refcounting (and compound locking)
is only necessary if the head is perhaps anonymous THP, so PageAnon.

Of course, the danger of confusion between PG_compound_lock and
PG_team could more easily be addressed by assigning a separate page
flag bit for PageTeam; but I'm reluctant to ask for that, and in the
longer term hopeful that PG_compound_lock can be removed altogether.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/mm.h |   92 +++++++++++++++++++++----------------------
 mm/compaction.c    |    6 ++
 mm/memcontrol.c    |    4 -
 mm/migrate.c       |   12 +++++
 mm/truncate.c      |    2 
 mm/vmscan.c        |    2 
 6 files changed, 68 insertions(+), 50 deletions(-)

--- thpfs.orig/include/linux/mm.h	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/include/linux/mm.h	2015-02-20 19:34:11.231993296 -0800
@@ -473,6 +473,48 @@ static inline int page_count(struct page
 	return atomic_read(&compound_head(page)->_count);
 }
 
+/*
+ * On an anonymous page mapped into a user virtual memory area,
+ * page->mapping points to its anon_vma, not to a struct address_space;
+ * with the PAGE_MAPPING_ANON bit set to distinguish it.  See rmap.h.
+ *
+ * On an anonymous page in a VM_MERGEABLE area, if CONFIG_KSM is enabled,
+ * the PAGE_MAPPING_KSM bit may be set along with the PAGE_MAPPING_ANON bit;
+ * and then page->mapping points, not to an anon_vma, but to a private
+ * structure which KSM associates with that merged page.  See ksm.h.
+ *
+ * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is currently never used.
+ *
+ * Please note that, confusingly, "page_mapping" refers to the inode
+ * address_space which maps the page from disk; whereas "page_mapped"
+ * refers to user virtual address space into which the page is mapped.
+ */
+#define PAGE_MAPPING_ANON	1
+#define PAGE_MAPPING_KSM	2
+#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)
+
+extern struct address_space *page_mapping(struct page *page);
+
+/* Neutral page->mapping pointer to address_space or anon_vma or other */
+static inline void *page_rmapping(struct page *page)
+{
+	return (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
+}
+
+extern struct address_space *__page_file_mapping(struct page *);
+
+static inline struct address_space *page_file_mapping(struct page *page)
+{
+	if (unlikely(PageSwapCache(page)))
+		return __page_file_mapping(page);
+	return page->mapping;
+}
+
+static inline int PageAnon(struct page *page)
+{
+	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
+}
+
 #ifdef CONFIG_HUGETLB_PAGE
 extern int PageHeadHuge(struct page *page_head);
 #else /* CONFIG_HUGETLB_PAGE */
@@ -484,15 +526,15 @@ static inline int PageHeadHuge(struct pa
 
 static inline bool __compound_tail_refcounted(struct page *page)
 {
-	return !PageSlab(page) && !PageHeadHuge(page);
+	return PageAnon(page) && !PageSlab(page) && !PageHeadHuge(page);
 }
 
 /*
  * This takes a head page as parameter and tells if the
  * tail page reference counting can be skipped.
  *
- * For this to be safe, PageSlab and PageHeadHuge must remain true on
- * any given page where they return true here, until all tail pins
+ * For this to be safe, PageAnon and PageSlab and PageHeadHuge must remain
+ * true on any given page where they return true here, until all tail pins
  * have been released.
  */
 static inline bool compound_tail_refcounted(struct page *page)
@@ -980,50 +1022,6 @@ void page_address_init(void);
 #endif
 
 /*
- * On an anonymous page mapped into a user virtual memory area,
- * page->mapping points to its anon_vma, not to a struct address_space;
- * with the PAGE_MAPPING_ANON bit set to distinguish it.  See rmap.h.
- *
- * On an anonymous page in a VM_MERGEABLE area, if CONFIG_KSM is enabled,
- * the PAGE_MAPPING_KSM bit may be set along with the PAGE_MAPPING_ANON bit;
- * and then page->mapping points, not to an anon_vma, but to a private
- * structure which KSM associates with that merged page.  See ksm.h.
- *
- * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is currently never used.
- *
- * Please note that, confusingly, "page_mapping" refers to the inode
- * address_space which maps the page from disk; whereas "page_mapped"
- * refers to user virtual address space into which the page is mapped.
- */
-#define PAGE_MAPPING_ANON	1
-#define PAGE_MAPPING_KSM	2
-#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)
-
-extern struct address_space *page_mapping(struct page *page);
-
-/* Neutral page->mapping pointer to address_space or anon_vma or other */
-static inline void *page_rmapping(struct page *page)
-{
-	return (void *)((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
-}
-
-extern struct address_space *__page_file_mapping(struct page *);
-
-static inline
-struct address_space *page_file_mapping(struct page *page)
-{
-	if (unlikely(PageSwapCache(page)))
-		return __page_file_mapping(page);
-
-	return page->mapping;
-}
-
-static inline int PageAnon(struct page *page)
-{
-	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
-}
-
-/*
  * Return the pagecache index of the passed page.  Regular pagecache pages
  * use ->index whereas swapcache pages use ->private
  */
--- thpfs.orig/mm/compaction.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/compaction.c	2015-02-20 19:34:11.231993296 -0800
@@ -676,6 +676,12 @@ isolate_migratepages_block(struct compac
 			continue;
 		}
 
+		/* Team bit == compound_lock bit: racy check before skipping */
+		if (PageTeam(page) && !PageAnon(page)) {
+			low_pfn = round_up(low_pfn + 1, HPAGE_PMD_NR) - 1;
+			continue;
+		}
+
 		/*
 		 * Migration will fail if an anonymous page is pinned in memory,
 		 * so avoid taking lru_lock and isolating it unnecessarily in an
--- thpfs.orig/mm/memcontrol.c	2015-02-20 19:33:31.052085168 -0800
+++ thpfs/mm/memcontrol.c	2015-02-20 19:34:11.231993296 -0800
@@ -5021,8 +5021,8 @@ static enum mc_target_type get_mctgt_typ
 	enum mc_target_type ret = MC_TARGET_NONE;
 
 	page = pmd_page(pmd);
-	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
-	if (!move_anon())
+	/* Don't attempt to move huge tmpfs pages: could be enabled later */
+	if (!move_anon() || !PageAnon(page))
 		return ret;
 	if (page->mem_cgroup == mc.from) {
 		ret = MC_TARGET_PAGE;
--- thpfs.orig/mm/migrate.c	2015-02-20 19:33:40.876062705 -0800
+++ thpfs/mm/migrate.c	2015-02-20 19:34:11.235993287 -0800
@@ -937,6 +937,10 @@ static int unmap_and_move(new_page_t get
 	int *result = NULL;
 	struct page *newpage;
 
+	/* Team bit == compound_lock bit: racy check before refusing */
+	if (PageTeam(page) && !PageAnon(page))
+		return -EBUSY;
+
 	newpage = get_new_page(page, private, &result);
 	if (!newpage)
 		return -ENOMEM;
@@ -1770,6 +1774,14 @@ int migrate_misplaced_transhuge_page(str
 	pmd_t orig_entry;
 
 	/*
+	 * Leave support for NUMA balancing on huge tmpfs pages to the future.
+	 * The pmd marking up to this point should work okay, but from here on
+	 * there is work to be done: e.g. anon page->mapping assumption below.
+	 */
+	if (!PageAnon(page))
+		goto out_dropref;
+
+	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
--- thpfs.orig/mm/truncate.c	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/mm/truncate.c	2015-02-20 19:34:11.235993287 -0800
@@ -542,7 +542,7 @@ invalidate_complete_page2(struct address
 		return 0;
 
 	spin_lock_irq(&mapping->tree_lock);
-	if (PageDirty(page))
+	if (PageDirty(page) || PageTeam(page))
 		goto failed;
 
 	BUG_ON(page_has_private(page));
--- thpfs.orig/mm/vmscan.c	2015-02-20 19:33:56.532026908 -0800
+++ thpfs/mm/vmscan.c	2015-02-20 19:34:11.235993287 -0800
@@ -567,6 +567,8 @@ static int __remove_mapping(struct addre
 	 * Note that if SetPageDirty is always performed via set_page_dirty,
 	 * and thus under tree_lock, then this ordering is not required.
 	 */
+	if (unlikely(PageTeam(page)))
+		goto cannot_free;
 	if (!page_freeze_refs(page, 2))
 		goto cannot_free;
 	/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:09   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Using 2MB for each small file is wasteful, and on average even a large
file is likely to waste 1MB at the end.  We could say that a huge tmpfs
is only suitable for huge files, but I would much prefer not to limit
it in that way, and would not be very able to test such a filesystem.

In our model, the unused space in the team is not put on any LRU (nor
charged to any memcg), so not yet accessible to page reclaim: we need
a shrinker to disband the team, and free up the unused space, under
memory pressure.  (Typically the freeable space is at the end, but
there's no assumption that it's at end of huge page or end of file.)

shmem_shrink_hugehole() is usually called from vmscan's shrink_slabs();
but I've found a direct call from shmem_alloc_page(), when it fails
to allocate a huge page (perhaps because too much memory is occupied
by shmem huge holes), is also helpful before a retry.

But each team holds a valuable resource: an extent of contiguous
memory that could be used for another team (or for an anonymous THP).
So try to proceed in such a way as to conserve that resource: rather
than just freeing the unused space and leaving yet another huge page
fragmented, also try to migrate the used space to another partially
occupied huge page.

The algorithm in shmem_choose_hugehole() (find least occupied huge page
in older half of shrinklist, and migrate its cachepages into the most
occupied huge page with enough space to fit, again chosen from older
half of shrinklist) is unlikely to be ideal; but easy to implement as
a demonstration of the pieces which can be used by any algorithm,
and good enough for now.  A radix_tree tag helps to locate the
partially occupied huge pages more quickly: the tag available
since shmem does not participate in dirty/writeback accounting.

The "team_usage" field added to struct page (in union with "private")
is somewhat vaguely named: because while the huge page is sparsely
occupied, it counts the occupancy; but once the huge page is fully
occupied, it will come to be used differently in a later patch, as
the huge mapcount (offset by the HPAGE_PMD_NR occupancy) - it is
never possible to map a sparsely occupied huge page, because that
would expose stale data to the user.

With this patch, the ShmemHugePages and ShmemFreeHoles lines of
/proc/meminfo are shown correctly; but ShmemPmdMapped remains 0.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/migrate.h        |    3 
 include/linux/mm_types.h       |    1 
 include/linux/shmem_fs.h       |    3 
 include/trace/events/migrate.h |    3 
 mm/shmem.c                     |  439 ++++++++++++++++++++++++++++++-
 5 files changed, 436 insertions(+), 13 deletions(-)

--- thpfs.orig/include/linux/migrate.h	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/include/linux/migrate.h	2015-02-20 19:34:16.135982083 -0800
@@ -23,7 +23,8 @@ enum migrate_reason {
 	MR_SYSCALL,		/* also applies to cpusets */
 	MR_MEMPOLICY_MBIND,
 	MR_NUMA_MISPLACED,
-	MR_CMA
+	MR_CMA,
+	MR_SHMEM_HUGEHOLE,
 };
 
 #ifdef CONFIG_MIGRATION
--- thpfs.orig/include/linux/mm_types.h	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/include/linux/mm_types.h	2015-02-20 19:34:16.135982083 -0800
@@ -165,6 +165,7 @@ struct page {
 #endif
 		struct kmem_cache *slab_cache;	/* SL[AU]B: Pointer to slab */
 		struct page *first_page;	/* Compound tail pages */
+		atomic_long_t team_usage;	/* In shmem's PageTeam page */
 	};
 
 #ifdef CONFIG_MEMCG
--- thpfs.orig/include/linux/shmem_fs.h	2015-02-20 19:34:01.464015631 -0800
+++ thpfs/include/linux/shmem_fs.h	2015-02-20 19:34:16.135982083 -0800
@@ -19,8 +19,9 @@ struct shmem_inode_info {
 		unsigned long	swapped;	/* subtotal assigned to swap */
 		char		*symlink;	/* unswappable short symlink */
 	};
-	struct shared_policy	policy;		/* NUMA memory alloc policy */
+	struct list_head	shrinklist;	/* shrinkable hpage inodes */
 	struct list_head	swaplist;	/* chain of maybes on swap */
+	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
 	struct inode		vfs_inode;
 };
--- thpfs.orig/include/trace/events/migrate.h	2014-10-05 12:23:04.000000000 -0700
+++ thpfs/include/trace/events/migrate.h	2015-02-20 19:34:16.135982083 -0800
@@ -18,7 +18,8 @@
 	{MR_SYSCALL,		"syscall_or_cpuset"},		\
 	{MR_MEMPOLICY_MBIND,	"mempolicy_mbind"},		\
 	{MR_NUMA_MISPLACED,	"numa_misplaced"},		\
-	{MR_CMA,		"cma"}
+	{MR_CMA,		"cma"},				\
+	{MR_SHMEM_HUGEHOLE,	"shmem_hugehole"}
 
 TRACE_EVENT(mm_migrate_pages,
 
--- thpfs.orig/mm/shmem.c	2015-02-20 19:34:06.224004747 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:34:16.139982074 -0800
@@ -58,6 +58,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/falloc.h>
 #include <linux/splice.h>
 #include <linux/security.h>
+#include <linux/shrinker.h>
 #include <linux/sysctl.h>
 #include <linux/swapops.h>
 #include <linux/pageteam.h>
@@ -74,6 +75,7 @@ static struct vfsmount *shm_mnt;
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
+#include "internal.h"
 
 #define BLOCKS_PER_PAGE  (PAGE_CACHE_SIZE/512)
 #define VM_ACCT(size)    (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)
@@ -306,6 +308,13 @@ static bool shmem_confirm_swap(struct ad
 #define SHMEM_RETRY_HUGE_PAGE	((struct page *)3)
 /* otherwise hugehint is the hugeteam page to be used */
 
+/* tag for shrinker to locate unfilled hugepages */
+#define SHMEM_TAG_HUGEHOLE	PAGECACHE_TAG_DIRTY
+
+static LIST_HEAD(shmem_shrinklist);
+static unsigned long shmem_shrinklist_depth;
+static DEFINE_SPINLOCK(shmem_shrinklist_lock);
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /* ifdef here to avoid bloating shmem.o when not necessary */
 
@@ -360,26 +369,104 @@ restart:
 	return page;
 }
 
+static int shmem_freeholes(struct page *head)
+{
+	/*
+	 * Note: team_usage will also be used to count huge mappings,
+	 * so treat a negative value from shmem_freeholes() as none.
+	 */
+	return HPAGE_PMD_NR - atomic_long_read(&head->team_usage);
+}
+
+static void shmem_clear_tag_hugehole(struct address_space *mapping,
+				     pgoff_t index)
+{
+	struct page *page = NULL;
+
+	/*
+	 * The tag was set on the first subpage to be inserted in cache.
+	 * When written sequentially, or instantiated by a huge fault,
+	 * it will be on the head page, but that's not always so.  And
+	 * radix_tree_tag_clear() succeeds when it finds a slot, whether
+	 * tag was set on it or not.  So first lookup and then clear.
+	 */
+	radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)&page,
+					index, 1, SHMEM_TAG_HUGEHOLE);
+	VM_BUG_ON(!page || page->index >= index + HPAGE_PMD_NR);
+	radix_tree_tag_clear(&mapping->page_tree, page->index,
+					SHMEM_TAG_HUGEHOLE);
+}
+
+static void shmem_added_to_hugeteam(struct page *page, struct zone *zone,
+				    struct page *hugehint)
+{
+	struct address_space *mapping = page->mapping;
+	struct page *head = team_head(page);
+	int nr;
+
+	if (hugehint == SHMEM_ALLOC_HUGE_PAGE) {
+		atomic_long_set(&head->team_usage, 1);
+		radix_tree_tag_set(&mapping->page_tree, page->index,
+					SHMEM_TAG_HUGEHOLE);
+		__mod_zone_page_state(zone, NR_SHMEM_FREEHOLES, HPAGE_PMD_NR-1);
+	} else {
+		/* We do not need atomic ops until huge page gets mapped */
+		nr = atomic_long_read(&head->team_usage) + 1;
+		atomic_long_set(&head->team_usage, nr);
+		if (nr == HPAGE_PMD_NR) {
+			shmem_clear_tag_hugehole(mapping, head->index);
+			__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
+		}
+		__dec_zone_state(zone, NR_SHMEM_FREEHOLES);
+	}
+}
+
 static int shmem_disband_hugehead(struct page *head)
 {
 	struct address_space *mapping;
 	struct zone *zone;
 	int nr = -1;
 
-	mapping = head->mapping;
-	zone = page_zone(head);
+	/*
+	 * Only in the shrinker migration case might head have been truncated.
+	 * But although head->mapping may then be zeroed at any moment, mapping
+	 * stays safe because shmem_evict_inode must take our shrinklist lock.
+	 */
+	mapping = ACCESS_ONCE(head->mapping);
+	if (!mapping)
+		return nr;
 
+	zone = page_zone(head);
 	spin_lock_irq(&mapping->tree_lock);
+
 	if (PageTeam(head)) {
+		nr = atomic_long_read(&head->team_usage);
+		atomic_long_set(&head->team_usage, 0);
+		/*
+		 * Disable additions to the team.
+		 * Ensure head->private is written before PageTeam is
+		 * cleared, so shmem_writepage() cannot write swap into
+		 * head->private, then have it overwritten by that 0!
+		 */
+		smp_mb__before_atomic();
 		ClearPageTeam(head);
-		__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
-		nr = 1;
+
+		if (nr >= HPAGE_PMD_NR) {
+			__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
+			VM_BUG_ON(nr != HPAGE_PMD_NR);
+		} else if (nr) {
+			shmem_clear_tag_hugehole(mapping, head->index);
+			__mod_zone_page_state(zone, NR_SHMEM_FREEHOLES,
+						nr - HPAGE_PMD_NR);
+		} /* else shmem_getpage_gfp disbanding a failed alloced_huge */
 	}
+
 	spin_unlock_irq(&mapping->tree_lock);
 	return nr;
 }
 
-static void shmem_disband_hugetails(struct page *head)
+static void shmem_disband_hugetails(struct page *head,
+				    struct list_head *list, int nr)
 {
 	struct page *page;
 	struct page *endpage;
@@ -387,7 +474,7 @@ static void shmem_disband_hugetails(stru
 	page = head;
 	endpage = head + HPAGE_PMD_NR;
 
-	/* Condition follows in next commit */ {
+	if (!nr) {
 		/*
 		 * The usual case: disbanding team and freeing holes as cold
 		 * (cold being more likely to preserve high-order extents).
@@ -403,7 +490,52 @@ static void shmem_disband_hugetails(stru
 			else if (put_page_testzero(page))
 				free_hot_cold_page(page, 1);
 		}
+	} else if (nr < 0) {
+		struct zone *zone = page_zone(page);
+		int orig_nr = nr;
+		/*
+		 * Shrinker wants to migrate cache pages from this team.
+		 */
+		if (!PageSwapBacked(page)) {	/* head was not in cache */
+			page->mapping = NULL;
+			if (put_page_testzero(page))
+				free_hot_cold_page(page, 1);
+		} else if (isolate_lru_page(page) == 0) {
+			list_add_tail(&page->lru, list);
+			nr++;
+		}
+		while (++page < endpage) {
+			if (PageTeam(page)) {
+				if (isolate_lru_page(page) == 0) {
+					list_add_tail(&page->lru, list);
+					nr++;
+				}
+				ClearPageTeam(page);
+			} else if (put_page_testzero(page))
+				free_hot_cold_page(page, 1);
+		}
+		/* Yes, shmem counts in NR_ISOLATED_ANON but NR_FILE_PAGES */
+		mod_zone_page_state(zone, NR_ISOLATED_ANON, nr - orig_nr);
+	} else {
+		/*
+		 * Shrinker wants free pages from this team to migrate into.
+		 */
+		if (!PageSwapBacked(page)) {	/* head was not in cache */
+			page->mapping = NULL;
+			list_add_tail(&page->lru, list);
+			nr--;
+		}
+		while (++page < endpage) {
+			if (PageTeam(page))
+				ClearPageTeam(page);
+			else if (nr) {
+				list_add_tail(&page->lru, list);
+				nr--;
+			} else if (put_page_testzero(page))
+				free_hot_cold_page(page, 1);
+		}
 	}
+	VM_BUG_ON(nr > 0);	/* maybe a few were not isolated */
 }
 
 static void shmem_disband_hugeteam(struct page *page)
@@ -445,12 +577,252 @@ static void shmem_disband_hugeteam(struc
 	if (head != page)
 		unlock_page(head);
 	if (nr_used >= 0)
-		shmem_disband_hugetails(head);
+		shmem_disband_hugetails(head, NULL, 0);
 	if (head != page)
 		page_cache_release(head);
 	preempt_enable();
 }
 
+static struct page *shmem_get_hugehole(struct address_space *mapping,
+				       unsigned long *index)
+{
+	struct page *page;
+	struct page *head;
+
+	rcu_read_lock();
+	while (radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)&page,
+					  *index, 1, SHMEM_TAG_HUGEHOLE)) {
+		if (radix_tree_exception(page))
+			continue;
+		if (!page_cache_get_speculative(page))
+			continue;
+		if (!PageTeam(page) || page->mapping != mapping)
+			goto release;
+		head = team_head(page);
+		if (head != page) {
+			if (!page_cache_get_speculative(head))
+				goto release;
+			page_cache_release(page);
+			page = head;
+			if (!PageTeam(page) || page->mapping != mapping)
+				goto release;
+		}
+		if (shmem_freeholes(head) > 0) {
+			rcu_read_unlock();
+			*index = head->index + HPAGE_PMD_NR;
+			return head;
+		}
+release:
+		page_cache_release(page);
+	}
+	rcu_read_unlock();
+	return NULL;
+}
+
+static unsigned long shmem_choose_hugehole(struct list_head *fromlist,
+					   struct list_head *tolist)
+{
+	unsigned long freed = 0;
+	unsigned long double_depth;
+	struct list_head *this, *next;
+	struct shmem_inode_info *info;
+	struct address_space *mapping;
+	struct page *frompage = NULL;
+	struct page *topage = NULL;
+	struct page *page;
+	pgoff_t index;
+	int fromused;
+	int toused;
+	int nid;
+
+	double_depth = 0;
+	spin_lock(&shmem_shrinklist_lock);
+	list_for_each_safe(this, next, &shmem_shrinklist) {
+		info = list_entry(this, struct shmem_inode_info, shrinklist);
+		mapping = info->vfs_inode.i_mapping;
+		if (!radix_tree_tagged(&mapping->page_tree,
+					SHMEM_TAG_HUGEHOLE)) {
+			list_del_init(&info->shrinklist);
+			shmem_shrinklist_depth--;
+			continue;
+		}
+		index = 0;
+		while ((page = shmem_get_hugehole(mapping, &index))) {
+			/* Choose to migrate from page with least in use */
+			if (!frompage ||
+			    shmem_freeholes(page) > shmem_freeholes(frompage)) {
+				if (frompage)
+					page_cache_release(frompage);
+				frompage = page;
+				if (shmem_freeholes(page) == HPAGE_PMD_NR-1) {
+					/* No point searching further */
+					double_depth = -3;
+					break;
+				}
+			} else
+				page_cache_release(page);
+		}
+
+		/* Only reclaim from the older half of the shrinklist */
+		double_depth += 2;
+		if (double_depth >= min(shmem_shrinklist_depth, 2000UL))
+			break;
+	}
+
+	if (!frompage)
+		goto unlock;
+	preempt_disable();
+	fromused = shmem_disband_hugehead(frompage);
+	spin_unlock(&shmem_shrinklist_lock);
+	if (fromused > 0)
+		shmem_disband_hugetails(frompage, fromlist, -fromused);
+	preempt_enable();
+	nid = page_to_nid(frompage);
+	page_cache_release(frompage);
+
+	if (fromused <= 0)
+		return 0;
+	freed = HPAGE_PMD_NR - fromused;
+	if (fromused > HPAGE_PMD_NR/2)
+		return freed;
+
+	double_depth = 0;
+	spin_lock(&shmem_shrinklist_lock);
+	list_for_each_safe(this, next, &shmem_shrinklist) {
+		info = list_entry(this, struct shmem_inode_info, shrinklist);
+		mapping = info->vfs_inode.i_mapping;
+		if (!radix_tree_tagged(&mapping->page_tree,
+					SHMEM_TAG_HUGEHOLE)) {
+			list_del_init(&info->shrinklist);
+			shmem_shrinklist_depth--;
+			continue;
+		}
+		index = 0;
+		while ((page = shmem_get_hugehole(mapping, &index))) {
+			/* Choose to migrate to page with just enough free */
+			if (shmem_freeholes(page) >= fromused &&
+			    page_to_nid(page) == nid) {
+				if (!topage || shmem_freeholes(page) <
+					      shmem_freeholes(topage)) {
+					if (topage)
+						page_cache_release(topage);
+					topage = page;
+					if (shmem_freeholes(page) == fromused) {
+						/* No point searching further */
+						double_depth = -3;
+						break;
+					}
+				} else
+					page_cache_release(page);
+			} else
+				page_cache_release(page);
+		}
+
+		/* Only reclaim from the older half of the shrinklist */
+		double_depth += 2;
+		if (double_depth >= min(shmem_shrinklist_depth, 2000UL))
+			break;
+	}
+
+	if (!topage)
+		goto unlock;
+	preempt_disable();
+	toused = shmem_disband_hugehead(topage);
+	spin_unlock(&shmem_shrinklist_lock);
+	if (toused > 0) {
+		if (HPAGE_PMD_NR - toused >= fromused)
+			shmem_disband_hugetails(topage, tolist, fromused);
+		else
+			shmem_disband_hugetails(topage, NULL, 0);
+		freed += HPAGE_PMD_NR - toused;
+	}
+	preempt_enable();
+	page_cache_release(topage);
+	return freed;
+unlock:
+	spin_unlock(&shmem_shrinklist_lock);
+	return freed;
+}
+
+static struct page *shmem_get_migrate_page(struct page *frompage,
+					   unsigned long private, int **result)
+{
+	struct list_head *tolist = (struct list_head *)private;
+	struct page *topage;
+
+	VM_BUG_ON(list_empty(tolist));
+	topage = list_first_entry(tolist, struct page, lru);
+	list_del(&topage->lru);
+	return topage;
+}
+
+static void shmem_put_migrate_page(struct page *topage, unsigned long private)
+{
+	struct list_head *tolist = (struct list_head *)private;
+
+	list_add(&topage->lru, tolist);
+}
+
+static void shmem_putback_migrate_pages(struct list_head *tolist)
+{
+	struct page *topage;
+	struct page *next;
+
+	/*
+	 * The tolist pages were not counted in NR_ISOLATED, so stats
+	 * would go wrong if putback_movable_pages() were used on them.
+	 * Indeed, even putback_lru_page() is wrong for these pages.
+	 */
+	list_for_each_entry_safe(topage, next, tolist, lru) {
+		list_del(&topage->lru);
+		if (put_page_testzero(topage))
+			free_hot_cold_page(topage, 1);
+	}
+}
+
+static unsigned long shmem_shrink_hugehole(struct shrinker *shrink,
+					   struct shrink_control *sc)
+{
+	unsigned long freed;
+	LIST_HEAD(fromlist);
+	LIST_HEAD(tolist);
+
+	freed = shmem_choose_hugehole(&fromlist, &tolist);
+	if (list_empty(&fromlist))
+		return SHRINK_STOP;
+	if (!list_empty(&tolist)) {
+		migrate_pages(&fromlist, shmem_get_migrate_page,
+			      shmem_put_migrate_page, (unsigned long)&tolist,
+			      MIGRATE_SYNC, MR_SHMEM_HUGEHOLE);
+		preempt_disable();
+		drain_local_pages(NULL);  /* try to preserve huge freed page */
+		preempt_enable();
+		shmem_putback_migrate_pages(&tolist);
+	}
+	putback_movable_pages(&fromlist); /* if any were left behind */
+	return freed;
+}
+
+static unsigned long shmem_count_hugehole(struct shrinker *shrink,
+					  struct shrink_control *sc)
+{
+	/*
+	 * Huge hole space is not charged to any memcg:
+	 * only shrink it for global reclaim.
+	 * But at present we're only called for global reclaim anyway.
+	 */
+	if (list_empty(&shmem_shrinklist))
+		return 0;
+	return global_page_state(NR_SHMEM_FREEHOLES);
+}
+
+static struct shrinker shmem_hugehole_shrinker = {
+	.count_objects = shmem_count_hugehole,
+	.scan_objects = shmem_shrink_hugehole,
+	.seeks = DEFAULT_SEEKS,		/* would another value work better? */
+	.batch = HPAGE_PMD_NR,		/* would another value work better? */
+};
+
 #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
 
 #define shmem_huge SHMEM_HUGE_DENY
@@ -466,6 +838,17 @@ static inline void shmem_disband_hugetea
 {
 	BUILD_BUG();
 }
+
+static inline void shmem_added_to_hugeteam(struct page *page,
+				struct zone *zone, struct page *hugehint)
+{
+}
+
+static inline unsigned long shmem_shrink_hugehole(struct shrinker *shrink,
+						  struct shrink_control *sc)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
@@ -508,10 +891,10 @@ shmem_add_to_page_cache(struct page *pag
 		goto errout;
 	}
 
-	if (!PageTeam(page))
+	if (PageTeam(page))
+		shmem_added_to_hugeteam(page, zone, hugehint);
+	else
 		page_cache_get(page);
-	else if (hugehint == SHMEM_ALLOC_HUGE_PAGE)
-		__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
 
 	mapping->nrpages++;
 	__inc_zone_state(zone, NR_FILE_PAGES);
@@ -839,6 +1222,14 @@ static void shmem_evict_inode(struct ino
 		shmem_unacct_size(info->flags, inode->i_size);
 		inode->i_size = 0;
 		shmem_truncate_range(inode, 0, (loff_t)-1);
+		if (!list_empty(&info->shrinklist)) {
+			spin_lock(&shmem_shrinklist_lock);
+			if (!list_empty(&info->shrinklist)) {
+				list_del_init(&info->shrinklist);
+				shmem_shrinklist_depth--;
+			}
+			spin_unlock(&shmem_shrinklist_lock);
+		}
 		if (!list_empty(&info->swaplist)) {
 			mutex_lock(&shmem_swaplist_mutex);
 			list_del_init(&info->swaplist);
@@ -1189,10 +1580,18 @@ static struct page *shmem_alloc_page(gfp
 		if (*hugehint == SHMEM_ALLOC_HUGE_PAGE) {
 			head = alloc_pages_vma(gfp|__GFP_NORETRY|__GFP_NOWARN,
 				HPAGE_PMD_ORDER, &pvma, 0, numa_node_id());
+			if (!head) {
+				shmem_shrink_hugehole(NULL, NULL);
+				head = alloc_pages_vma(
+					gfp|__GFP_NORETRY|__GFP_NOWARN,
+					HPAGE_PMD_ORDER, &pvma, 0,
+					numa_node_id());
+			}
 			if (head) {
 				split_page(head, HPAGE_PMD_ORDER);
 
 				/* Prepare head page for add_to_page_cache */
+				atomic_long_set(&head->team_usage, 0);
 				__SetPageTeam(head);
 				head->mapping = mapping;
 				head->index = round_down(index, HPAGE_PMD_NR);
@@ -1504,6 +1903,21 @@ repeat:
 		if (sgp == SGP_WRITE)
 			__SetPageReferenced(page);
 		/*
+		 * Might we see !list_empty a moment before the shrinker
+		 * removes this inode from its list?  Unlikely, since we
+		 * already set a tag in the tree.  Some barrier required?
+		 */
+		if (alloced_huge && list_empty(&info->shrinklist)) {
+			spin_lock(&shmem_shrinklist_lock);
+			if (list_empty(&info->shrinklist)) {
+				list_add_tail(&info->shrinklist,
+					      &shmem_shrinklist);
+				shmem_shrinklist_depth++;
+			}
+			spin_unlock(&shmem_shrinklist_lock);
+		}
+
+		/*
 		 * Let SGP_FALLOC use the SGP_WRITE optimization on a new page.
 		 */
 		if (sgp == SGP_FALLOC)
@@ -1724,6 +2138,7 @@ static struct inode *shmem_get_inode(str
 		spin_lock_init(&info->lock);
 		info->seals = F_SEAL_SEAL;
 		info->flags = flags & VM_NORESERVE;
+		INIT_LIST_HEAD(&info->shrinklist);
 		INIT_LIST_HEAD(&info->swaplist);
 		simple_xattrs_init(&info->xattrs);
 		cache_no_acl(inode);
@@ -3564,6 +3979,10 @@ int __init shmem_init(void)
 		printk(KERN_ERR "Could not kern_mount tmpfs\n");
 		goto out1;
 	}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	register_shrinker(&shmem_hugehole_shrinker);
+#endif
 	return 0;
 
 out1:

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
@ 2015-02-21  4:09   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:09 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Using 2MB for each small file is wasteful, and on average even a large
file is likely to waste 1MB at the end.  We could say that a huge tmpfs
is only suitable for huge files, but I would much prefer not to limit
it in that way, and would not be very able to test such a filesystem.

In our model, the unused space in the team is not put on any LRU (nor
charged to any memcg), so not yet accessible to page reclaim: we need
a shrinker to disband the team, and free up the unused space, under
memory pressure.  (Typically the freeable space is at the end, but
there's no assumption that it's at end of huge page or end of file.)

shmem_shrink_hugehole() is usually called from vmscan's shrink_slabs();
but I've found a direct call from shmem_alloc_page(), when it fails
to allocate a huge page (perhaps because too much memory is occupied
by shmem huge holes), is also helpful before a retry.

But each team holds a valuable resource: an extent of contiguous
memory that could be used for another team (or for an anonymous THP).
So try to proceed in such a way as to conserve that resource: rather
than just freeing the unused space and leaving yet another huge page
fragmented, also try to migrate the used space to another partially
occupied huge page.

The algorithm in shmem_choose_hugehole() (find least occupied huge page
in older half of shrinklist, and migrate its cachepages into the most
occupied huge page with enough space to fit, again chosen from older
half of shrinklist) is unlikely to be ideal; but easy to implement as
a demonstration of the pieces which can be used by any algorithm,
and good enough for now.  A radix_tree tag helps to locate the
partially occupied huge pages more quickly: the tag available
since shmem does not participate in dirty/writeback accounting.

The "team_usage" field added to struct page (in union with "private")
is somewhat vaguely named: because while the huge page is sparsely
occupied, it counts the occupancy; but once the huge page is fully
occupied, it will come to be used differently in a later patch, as
the huge mapcount (offset by the HPAGE_PMD_NR occupancy) - it is
never possible to map a sparsely occupied huge page, because that
would expose stale data to the user.

With this patch, the ShmemHugePages and ShmemFreeHoles lines of
/proc/meminfo are shown correctly; but ShmemPmdMapped remains 0.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/migrate.h        |    3 
 include/linux/mm_types.h       |    1 
 include/linux/shmem_fs.h       |    3 
 include/trace/events/migrate.h |    3 
 mm/shmem.c                     |  439 ++++++++++++++++++++++++++++++-
 5 files changed, 436 insertions(+), 13 deletions(-)

--- thpfs.orig/include/linux/migrate.h	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/include/linux/migrate.h	2015-02-20 19:34:16.135982083 -0800
@@ -23,7 +23,8 @@ enum migrate_reason {
 	MR_SYSCALL,		/* also applies to cpusets */
 	MR_MEMPOLICY_MBIND,
 	MR_NUMA_MISPLACED,
-	MR_CMA
+	MR_CMA,
+	MR_SHMEM_HUGEHOLE,
 };
 
 #ifdef CONFIG_MIGRATION
--- thpfs.orig/include/linux/mm_types.h	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/include/linux/mm_types.h	2015-02-20 19:34:16.135982083 -0800
@@ -165,6 +165,7 @@ struct page {
 #endif
 		struct kmem_cache *slab_cache;	/* SL[AU]B: Pointer to slab */
 		struct page *first_page;	/* Compound tail pages */
+		atomic_long_t team_usage;	/* In shmem's PageTeam page */
 	};
 
 #ifdef CONFIG_MEMCG
--- thpfs.orig/include/linux/shmem_fs.h	2015-02-20 19:34:01.464015631 -0800
+++ thpfs/include/linux/shmem_fs.h	2015-02-20 19:34:16.135982083 -0800
@@ -19,8 +19,9 @@ struct shmem_inode_info {
 		unsigned long	swapped;	/* subtotal assigned to swap */
 		char		*symlink;	/* unswappable short symlink */
 	};
-	struct shared_policy	policy;		/* NUMA memory alloc policy */
+	struct list_head	shrinklist;	/* shrinkable hpage inodes */
 	struct list_head	swaplist;	/* chain of maybes on swap */
+	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
 	struct inode		vfs_inode;
 };
--- thpfs.orig/include/trace/events/migrate.h	2014-10-05 12:23:04.000000000 -0700
+++ thpfs/include/trace/events/migrate.h	2015-02-20 19:34:16.135982083 -0800
@@ -18,7 +18,8 @@
 	{MR_SYSCALL,		"syscall_or_cpuset"},		\
 	{MR_MEMPOLICY_MBIND,	"mempolicy_mbind"},		\
 	{MR_NUMA_MISPLACED,	"numa_misplaced"},		\
-	{MR_CMA,		"cma"}
+	{MR_CMA,		"cma"},				\
+	{MR_SHMEM_HUGEHOLE,	"shmem_hugehole"}
 
 TRACE_EVENT(mm_migrate_pages,
 
--- thpfs.orig/mm/shmem.c	2015-02-20 19:34:06.224004747 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:34:16.139982074 -0800
@@ -58,6 +58,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/falloc.h>
 #include <linux/splice.h>
 #include <linux/security.h>
+#include <linux/shrinker.h>
 #include <linux/sysctl.h>
 #include <linux/swapops.h>
 #include <linux/pageteam.h>
@@ -74,6 +75,7 @@ static struct vfsmount *shm_mnt;
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
+#include "internal.h"
 
 #define BLOCKS_PER_PAGE  (PAGE_CACHE_SIZE/512)
 #define VM_ACCT(size)    (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)
@@ -306,6 +308,13 @@ static bool shmem_confirm_swap(struct ad
 #define SHMEM_RETRY_HUGE_PAGE	((struct page *)3)
 /* otherwise hugehint is the hugeteam page to be used */
 
+/* tag for shrinker to locate unfilled hugepages */
+#define SHMEM_TAG_HUGEHOLE	PAGECACHE_TAG_DIRTY
+
+static LIST_HEAD(shmem_shrinklist);
+static unsigned long shmem_shrinklist_depth;
+static DEFINE_SPINLOCK(shmem_shrinklist_lock);
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /* ifdef here to avoid bloating shmem.o when not necessary */
 
@@ -360,26 +369,104 @@ restart:
 	return page;
 }
 
+static int shmem_freeholes(struct page *head)
+{
+	/*
+	 * Note: team_usage will also be used to count huge mappings,
+	 * so treat a negative value from shmem_freeholes() as none.
+	 */
+	return HPAGE_PMD_NR - atomic_long_read(&head->team_usage);
+}
+
+static void shmem_clear_tag_hugehole(struct address_space *mapping,
+				     pgoff_t index)
+{
+	struct page *page = NULL;
+
+	/*
+	 * The tag was set on the first subpage to be inserted in cache.
+	 * When written sequentially, or instantiated by a huge fault,
+	 * it will be on the head page, but that's not always so.  And
+	 * radix_tree_tag_clear() succeeds when it finds a slot, whether
+	 * tag was set on it or not.  So first lookup and then clear.
+	 */
+	radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)&page,
+					index, 1, SHMEM_TAG_HUGEHOLE);
+	VM_BUG_ON(!page || page->index >= index + HPAGE_PMD_NR);
+	radix_tree_tag_clear(&mapping->page_tree, page->index,
+					SHMEM_TAG_HUGEHOLE);
+}
+
+static void shmem_added_to_hugeteam(struct page *page, struct zone *zone,
+				    struct page *hugehint)
+{
+	struct address_space *mapping = page->mapping;
+	struct page *head = team_head(page);
+	int nr;
+
+	if (hugehint == SHMEM_ALLOC_HUGE_PAGE) {
+		atomic_long_set(&head->team_usage, 1);
+		radix_tree_tag_set(&mapping->page_tree, page->index,
+					SHMEM_TAG_HUGEHOLE);
+		__mod_zone_page_state(zone, NR_SHMEM_FREEHOLES, HPAGE_PMD_NR-1);
+	} else {
+		/* We do not need atomic ops until huge page gets mapped */
+		nr = atomic_long_read(&head->team_usage) + 1;
+		atomic_long_set(&head->team_usage, nr);
+		if (nr == HPAGE_PMD_NR) {
+			shmem_clear_tag_hugehole(mapping, head->index);
+			__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
+		}
+		__dec_zone_state(zone, NR_SHMEM_FREEHOLES);
+	}
+}
+
 static int shmem_disband_hugehead(struct page *head)
 {
 	struct address_space *mapping;
 	struct zone *zone;
 	int nr = -1;
 
-	mapping = head->mapping;
-	zone = page_zone(head);
+	/*
+	 * Only in the shrinker migration case might head have been truncated.
+	 * But although head->mapping may then be zeroed at any moment, mapping
+	 * stays safe because shmem_evict_inode must take our shrinklist lock.
+	 */
+	mapping = ACCESS_ONCE(head->mapping);
+	if (!mapping)
+		return nr;
 
+	zone = page_zone(head);
 	spin_lock_irq(&mapping->tree_lock);
+
 	if (PageTeam(head)) {
+		nr = atomic_long_read(&head->team_usage);
+		atomic_long_set(&head->team_usage, 0);
+		/*
+		 * Disable additions to the team.
+		 * Ensure head->private is written before PageTeam is
+		 * cleared, so shmem_writepage() cannot write swap into
+		 * head->private, then have it overwritten by that 0!
+		 */
+		smp_mb__before_atomic();
 		ClearPageTeam(head);
-		__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
-		nr = 1;
+
+		if (nr >= HPAGE_PMD_NR) {
+			__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
+			VM_BUG_ON(nr != HPAGE_PMD_NR);
+		} else if (nr) {
+			shmem_clear_tag_hugehole(mapping, head->index);
+			__mod_zone_page_state(zone, NR_SHMEM_FREEHOLES,
+						nr - HPAGE_PMD_NR);
+		} /* else shmem_getpage_gfp disbanding a failed alloced_huge */
 	}
+
 	spin_unlock_irq(&mapping->tree_lock);
 	return nr;
 }
 
-static void shmem_disband_hugetails(struct page *head)
+static void shmem_disband_hugetails(struct page *head,
+				    struct list_head *list, int nr)
 {
 	struct page *page;
 	struct page *endpage;
@@ -387,7 +474,7 @@ static void shmem_disband_hugetails(stru
 	page = head;
 	endpage = head + HPAGE_PMD_NR;
 
-	/* Condition follows in next commit */ {
+	if (!nr) {
 		/*
 		 * The usual case: disbanding team and freeing holes as cold
 		 * (cold being more likely to preserve high-order extents).
@@ -403,7 +490,52 @@ static void shmem_disband_hugetails(stru
 			else if (put_page_testzero(page))
 				free_hot_cold_page(page, 1);
 		}
+	} else if (nr < 0) {
+		struct zone *zone = page_zone(page);
+		int orig_nr = nr;
+		/*
+		 * Shrinker wants to migrate cache pages from this team.
+		 */
+		if (!PageSwapBacked(page)) {	/* head was not in cache */
+			page->mapping = NULL;
+			if (put_page_testzero(page))
+				free_hot_cold_page(page, 1);
+		} else if (isolate_lru_page(page) == 0) {
+			list_add_tail(&page->lru, list);
+			nr++;
+		}
+		while (++page < endpage) {
+			if (PageTeam(page)) {
+				if (isolate_lru_page(page) == 0) {
+					list_add_tail(&page->lru, list);
+					nr++;
+				}
+				ClearPageTeam(page);
+			} else if (put_page_testzero(page))
+				free_hot_cold_page(page, 1);
+		}
+		/* Yes, shmem counts in NR_ISOLATED_ANON but NR_FILE_PAGES */
+		mod_zone_page_state(zone, NR_ISOLATED_ANON, nr - orig_nr);
+	} else {
+		/*
+		 * Shrinker wants free pages from this team to migrate into.
+		 */
+		if (!PageSwapBacked(page)) {	/* head was not in cache */
+			page->mapping = NULL;
+			list_add_tail(&page->lru, list);
+			nr--;
+		}
+		while (++page < endpage) {
+			if (PageTeam(page))
+				ClearPageTeam(page);
+			else if (nr) {
+				list_add_tail(&page->lru, list);
+				nr--;
+			} else if (put_page_testzero(page))
+				free_hot_cold_page(page, 1);
+		}
 	}
+	VM_BUG_ON(nr > 0);	/* maybe a few were not isolated */
 }
 
 static void shmem_disband_hugeteam(struct page *page)
@@ -445,12 +577,252 @@ static void shmem_disband_hugeteam(struc
 	if (head != page)
 		unlock_page(head);
 	if (nr_used >= 0)
-		shmem_disband_hugetails(head);
+		shmem_disband_hugetails(head, NULL, 0);
 	if (head != page)
 		page_cache_release(head);
 	preempt_enable();
 }
 
+static struct page *shmem_get_hugehole(struct address_space *mapping,
+				       unsigned long *index)
+{
+	struct page *page;
+	struct page *head;
+
+	rcu_read_lock();
+	while (radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)&page,
+					  *index, 1, SHMEM_TAG_HUGEHOLE)) {
+		if (radix_tree_exception(page))
+			continue;
+		if (!page_cache_get_speculative(page))
+			continue;
+		if (!PageTeam(page) || page->mapping != mapping)
+			goto release;
+		head = team_head(page);
+		if (head != page) {
+			if (!page_cache_get_speculative(head))
+				goto release;
+			page_cache_release(page);
+			page = head;
+			if (!PageTeam(page) || page->mapping != mapping)
+				goto release;
+		}
+		if (shmem_freeholes(head) > 0) {
+			rcu_read_unlock();
+			*index = head->index + HPAGE_PMD_NR;
+			return head;
+		}
+release:
+		page_cache_release(page);
+	}
+	rcu_read_unlock();
+	return NULL;
+}
+
+static unsigned long shmem_choose_hugehole(struct list_head *fromlist,
+					   struct list_head *tolist)
+{
+	unsigned long freed = 0;
+	unsigned long double_depth;
+	struct list_head *this, *next;
+	struct shmem_inode_info *info;
+	struct address_space *mapping;
+	struct page *frompage = NULL;
+	struct page *topage = NULL;
+	struct page *page;
+	pgoff_t index;
+	int fromused;
+	int toused;
+	int nid;
+
+	double_depth = 0;
+	spin_lock(&shmem_shrinklist_lock);
+	list_for_each_safe(this, next, &shmem_shrinklist) {
+		info = list_entry(this, struct shmem_inode_info, shrinklist);
+		mapping = info->vfs_inode.i_mapping;
+		if (!radix_tree_tagged(&mapping->page_tree,
+					SHMEM_TAG_HUGEHOLE)) {
+			list_del_init(&info->shrinklist);
+			shmem_shrinklist_depth--;
+			continue;
+		}
+		index = 0;
+		while ((page = shmem_get_hugehole(mapping, &index))) {
+			/* Choose to migrate from page with least in use */
+			if (!frompage ||
+			    shmem_freeholes(page) > shmem_freeholes(frompage)) {
+				if (frompage)
+					page_cache_release(frompage);
+				frompage = page;
+				if (shmem_freeholes(page) == HPAGE_PMD_NR-1) {
+					/* No point searching further */
+					double_depth = -3;
+					break;
+				}
+			} else
+				page_cache_release(page);
+		}
+
+		/* Only reclaim from the older half of the shrinklist */
+		double_depth += 2;
+		if (double_depth >= min(shmem_shrinklist_depth, 2000UL))
+			break;
+	}
+
+	if (!frompage)
+		goto unlock;
+	preempt_disable();
+	fromused = shmem_disband_hugehead(frompage);
+	spin_unlock(&shmem_shrinklist_lock);
+	if (fromused > 0)
+		shmem_disband_hugetails(frompage, fromlist, -fromused);
+	preempt_enable();
+	nid = page_to_nid(frompage);
+	page_cache_release(frompage);
+
+	if (fromused <= 0)
+		return 0;
+	freed = HPAGE_PMD_NR - fromused;
+	if (fromused > HPAGE_PMD_NR/2)
+		return freed;
+
+	double_depth = 0;
+	spin_lock(&shmem_shrinklist_lock);
+	list_for_each_safe(this, next, &shmem_shrinklist) {
+		info = list_entry(this, struct shmem_inode_info, shrinklist);
+		mapping = info->vfs_inode.i_mapping;
+		if (!radix_tree_tagged(&mapping->page_tree,
+					SHMEM_TAG_HUGEHOLE)) {
+			list_del_init(&info->shrinklist);
+			shmem_shrinklist_depth--;
+			continue;
+		}
+		index = 0;
+		while ((page = shmem_get_hugehole(mapping, &index))) {
+			/* Choose to migrate to page with just enough free */
+			if (shmem_freeholes(page) >= fromused &&
+			    page_to_nid(page) == nid) {
+				if (!topage || shmem_freeholes(page) <
+					      shmem_freeholes(topage)) {
+					if (topage)
+						page_cache_release(topage);
+					topage = page;
+					if (shmem_freeholes(page) == fromused) {
+						/* No point searching further */
+						double_depth = -3;
+						break;
+					}
+				} else
+					page_cache_release(page);
+			} else
+				page_cache_release(page);
+		}
+
+		/* Only reclaim from the older half of the shrinklist */
+		double_depth += 2;
+		if (double_depth >= min(shmem_shrinklist_depth, 2000UL))
+			break;
+	}
+
+	if (!topage)
+		goto unlock;
+	preempt_disable();
+	toused = shmem_disband_hugehead(topage);
+	spin_unlock(&shmem_shrinklist_lock);
+	if (toused > 0) {
+		if (HPAGE_PMD_NR - toused >= fromused)
+			shmem_disband_hugetails(topage, tolist, fromused);
+		else
+			shmem_disband_hugetails(topage, NULL, 0);
+		freed += HPAGE_PMD_NR - toused;
+	}
+	preempt_enable();
+	page_cache_release(topage);
+	return freed;
+unlock:
+	spin_unlock(&shmem_shrinklist_lock);
+	return freed;
+}
+
+static struct page *shmem_get_migrate_page(struct page *frompage,
+					   unsigned long private, int **result)
+{
+	struct list_head *tolist = (struct list_head *)private;
+	struct page *topage;
+
+	VM_BUG_ON(list_empty(tolist));
+	topage = list_first_entry(tolist, struct page, lru);
+	list_del(&topage->lru);
+	return topage;
+}
+
+static void shmem_put_migrate_page(struct page *topage, unsigned long private)
+{
+	struct list_head *tolist = (struct list_head *)private;
+
+	list_add(&topage->lru, tolist);
+}
+
+static void shmem_putback_migrate_pages(struct list_head *tolist)
+{
+	struct page *topage;
+	struct page *next;
+
+	/*
+	 * The tolist pages were not counted in NR_ISOLATED, so stats
+	 * would go wrong if putback_movable_pages() were used on them.
+	 * Indeed, even putback_lru_page() is wrong for these pages.
+	 */
+	list_for_each_entry_safe(topage, next, tolist, lru) {
+		list_del(&topage->lru);
+		if (put_page_testzero(topage))
+			free_hot_cold_page(topage, 1);
+	}
+}
+
+static unsigned long shmem_shrink_hugehole(struct shrinker *shrink,
+					   struct shrink_control *sc)
+{
+	unsigned long freed;
+	LIST_HEAD(fromlist);
+	LIST_HEAD(tolist);
+
+	freed = shmem_choose_hugehole(&fromlist, &tolist);
+	if (list_empty(&fromlist))
+		return SHRINK_STOP;
+	if (!list_empty(&tolist)) {
+		migrate_pages(&fromlist, shmem_get_migrate_page,
+			      shmem_put_migrate_page, (unsigned long)&tolist,
+			      MIGRATE_SYNC, MR_SHMEM_HUGEHOLE);
+		preempt_disable();
+		drain_local_pages(NULL);  /* try to preserve huge freed page */
+		preempt_enable();
+		shmem_putback_migrate_pages(&tolist);
+	}
+	putback_movable_pages(&fromlist); /* if any were left behind */
+	return freed;
+}
+
+static unsigned long shmem_count_hugehole(struct shrinker *shrink,
+					  struct shrink_control *sc)
+{
+	/*
+	 * Huge hole space is not charged to any memcg:
+	 * only shrink it for global reclaim.
+	 * But at present we're only called for global reclaim anyway.
+	 */
+	if (list_empty(&shmem_shrinklist))
+		return 0;
+	return global_page_state(NR_SHMEM_FREEHOLES);
+}
+
+static struct shrinker shmem_hugehole_shrinker = {
+	.count_objects = shmem_count_hugehole,
+	.scan_objects = shmem_shrink_hugehole,
+	.seeks = DEFAULT_SEEKS,		/* would another value work better? */
+	.batch = HPAGE_PMD_NR,		/* would another value work better? */
+};
+
 #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
 
 #define shmem_huge SHMEM_HUGE_DENY
@@ -466,6 +838,17 @@ static inline void shmem_disband_hugetea
 {
 	BUILD_BUG();
 }
+
+static inline void shmem_added_to_hugeteam(struct page *page,
+				struct zone *zone, struct page *hugehint)
+{
+}
+
+static inline unsigned long shmem_shrink_hugehole(struct shrinker *shrink,
+						  struct shrink_control *sc)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
@@ -508,10 +891,10 @@ shmem_add_to_page_cache(struct page *pag
 		goto errout;
 	}
 
-	if (!PageTeam(page))
+	if (PageTeam(page))
+		shmem_added_to_hugeteam(page, zone, hugehint);
+	else
 		page_cache_get(page);
-	else if (hugehint == SHMEM_ALLOC_HUGE_PAGE)
-		__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
 
 	mapping->nrpages++;
 	__inc_zone_state(zone, NR_FILE_PAGES);
@@ -839,6 +1222,14 @@ static void shmem_evict_inode(struct ino
 		shmem_unacct_size(info->flags, inode->i_size);
 		inode->i_size = 0;
 		shmem_truncate_range(inode, 0, (loff_t)-1);
+		if (!list_empty(&info->shrinklist)) {
+			spin_lock(&shmem_shrinklist_lock);
+			if (!list_empty(&info->shrinklist)) {
+				list_del_init(&info->shrinklist);
+				shmem_shrinklist_depth--;
+			}
+			spin_unlock(&shmem_shrinklist_lock);
+		}
 		if (!list_empty(&info->swaplist)) {
 			mutex_lock(&shmem_swaplist_mutex);
 			list_del_init(&info->swaplist);
@@ -1189,10 +1580,18 @@ static struct page *shmem_alloc_page(gfp
 		if (*hugehint == SHMEM_ALLOC_HUGE_PAGE) {
 			head = alloc_pages_vma(gfp|__GFP_NORETRY|__GFP_NOWARN,
 				HPAGE_PMD_ORDER, &pvma, 0, numa_node_id());
+			if (!head) {
+				shmem_shrink_hugehole(NULL, NULL);
+				head = alloc_pages_vma(
+					gfp|__GFP_NORETRY|__GFP_NOWARN,
+					HPAGE_PMD_ORDER, &pvma, 0,
+					numa_node_id());
+			}
 			if (head) {
 				split_page(head, HPAGE_PMD_ORDER);
 
 				/* Prepare head page for add_to_page_cache */
+				atomic_long_set(&head->team_usage, 0);
 				__SetPageTeam(head);
 				head->mapping = mapping;
 				head->index = round_down(index, HPAGE_PMD_NR);
@@ -1504,6 +1903,21 @@ repeat:
 		if (sgp == SGP_WRITE)
 			__SetPageReferenced(page);
 		/*
+		 * Might we see !list_empty a moment before the shrinker
+		 * removes this inode from its list?  Unlikely, since we
+		 * already set a tag in the tree.  Some barrier required?
+		 */
+		if (alloced_huge && list_empty(&info->shrinklist)) {
+			spin_lock(&shmem_shrinklist_lock);
+			if (list_empty(&info->shrinklist)) {
+				list_add_tail(&info->shrinklist,
+					      &shmem_shrinklist);
+				shmem_shrinklist_depth++;
+			}
+			spin_unlock(&shmem_shrinklist_lock);
+		}
+
+		/*
 		 * Let SGP_FALLOC use the SGP_WRITE optimization on a new page.
 		 */
 		if (sgp == SGP_FALLOC)
@@ -1724,6 +2138,7 @@ static struct inode *shmem_get_inode(str
 		spin_lock_init(&info->lock);
 		info->seals = F_SEAL_SEAL;
 		info->flags = flags & VM_NORESERVE;
+		INIT_LIST_HEAD(&info->shrinklist);
 		INIT_LIST_HEAD(&info->swaplist);
 		simple_xattrs_init(&info->xattrs);
 		cache_no_acl(inode);
@@ -3564,6 +3979,10 @@ int __init shmem_init(void)
 		printk(KERN_ERR "Could not kern_mount tmpfs\n");
 		goto out1;
 	}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	register_shrinker(&shmem_hugehole_shrinker);
+#endif
 	return 0;
 
 out1:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 12/24] huge tmpfs: get_unmapped_area align and fault supply huge page
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:11   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Now make the shmem.c changes necessary for mapping its huge pages into
userspace with huge pmds: without actually doing so, since that needs
changes in huge_memory.c and across mm, better left to another patch.

Provide a shmem_get_unmapped_area method in file_operations, called
at mmap time to decide the mapping address.  It could be conditional
on CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by
making it unconditional.

shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
(which we treat as a black box, highly dependent on architecture and
config and executable layout).  Lots of conditions, and in most cases
it just goes with the address that chose; but when our huge stars are
rightly aligned, yet that did not provide a suitable address, go back
to ask for a larger arena, within which to align the mapping suitably.

There have to be some direct calls to shmem_get_unmapped_area(),
not via the file_operations: because of the way shmem_zero_setup()
is called to create a shmem object late in the mmap sequence, when
MAP_SHARED is requested with MAP_ANONYMOUS or /dev/zero.  Though
this only matters when /proc/sys/vm/shmem_huge has been set.

Then at fault time, shmem_fault() does its usual shmem_getpage_gfp(),
and if caller __do_fault() passed FAULT_FLAG_MAY_HUGE (in later patch),
checks if the 4kB page returned is PageTeam, and, subject to further
conditions, proceeds to populate the whole of the huge page (if it
was not already fully populated and uptodate: use PG_owner_priv_1
PageChecked to save repeating all this each time the object is mapped);
then returns it to __do_fault() with a VM_FAULT_HUGE flag to request
a huge pmd.

Among shmem_fault()'s conditions: don't attempt huge if VM_NONLINEAR.
But that raises the question, what if the remap_file_pages(2) system
call were used on an area with huge pmds?  Turns out that it populates
the area using __get_locked_pte(): VM_BUG_ON(pmd_trans_huge(*pmd))
replaced by split_huge_page_pmd_mm() and we should be okay.

Two conditions you might expect, which are not enforced.  Originally
I intended to support just MAP_SHARED at this stage, which should be
good enough for a first implementation; but support for MAP_PRIVATE
(on read fault) needs so little further change, that it was well worth
supporting too - it opens up the opportunity to copy your x86_64 ELF
executables to huge tmpfs, their text then automatically mapped huge.

The other missing condition: shmem_getpage_gfp() is checking that
the fault falls within (4kB-rounded-up) i_size, but shmem_fault() maps
hugely even when the tail of the 2MB falls outside the (4kB-rounded-up)
i_size.  This is intentional, but may need reconsideration - especially
in the MAP_PRIVATE case (is it right for a private mapping to allocate
"hidden" pages to the object beyond its EOF?).  The intent is that an
application can indicate its desire for huge pmds throughout, even of
the tail, by using a hugely-rounded-up mmap size; but we might end up
retracting this, asking for fallocate to be used explicitly for that.
(hugetlbfs behaves even less standardly: its mmap extends the i_size
of the object.)

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 drivers/char/mem.c       |   23 ++++
 include/linux/mm.h       |    3 
 include/linux/shmem_fs.h |    2 
 ipc/shm.c                |    6 -
 mm/memory.c              |    3 
 mm/mmap.c                |   16 ++
 mm/shmem.c               |  200 ++++++++++++++++++++++++++++++++++++-
 7 files changed, 243 insertions(+), 10 deletions(-)

--- thpfs.orig/drivers/char/mem.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/drivers/char/mem.c	2015-02-20 19:34:21.595969599 -0800
@@ -22,6 +22,7 @@
 #include <linux/device.h>
 #include <linux/highmem.h>
 #include <linux/backing-dev.h>
+#include <linux/shmem_fs.h>
 #include <linux/splice.h>
 #include <linux/pfn.h>
 #include <linux/export.h>
@@ -654,6 +655,27 @@ static int mmap_zero(struct file *file,
 	return 0;
 }
 
+static unsigned long get_unmapped_area_zero(struct file *file,
+				unsigned long addr, unsigned long len,
+				unsigned long pgoff, unsigned long flags)
+{
+#ifndef CONFIG_MMU
+	return -ENOSYS;
+#endif
+	if (flags & MAP_SHARED) {
+		/*
+		 * mmap_zero() will call shmem_zero_setup() to create a file,
+		 * so use shmem's get_unmapped_area in case it can be huge;
+		 * and pass NULL for file as in mmap.c's get_unmapped_area(),
+		 * so as not to confuse shmem with our handle on "/dev/zero".
+		 */
+		return shmem_get_unmapped_area(NULL, addr, len, pgoff, flags);
+	}
+
+	/* Otherwise flags & MAP_PRIVATE: with no shmem object beneath it */
+	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+
 static ssize_t write_full(struct file *file, const char __user *buf,
 			  size_t count, loff_t *ppos)
 {
@@ -760,6 +782,7 @@ static const struct file_operations zero
 	.read_iter	= read_iter_zero,
 	.aio_write	= aio_write_zero,
 	.mmap		= mmap_zero,
+	.get_unmapped_area = get_unmapped_area_zero,
 };
 
 /*
--- thpfs.orig/include/linux/mm.h	2015-02-20 19:34:11.231993296 -0800
+++ thpfs/include/linux/mm.h	2015-02-20 19:34:21.599969589 -0800
@@ -213,6 +213,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
 #define FAULT_FLAG_TRIED	0x40	/* second try */
 #define FAULT_FLAG_USER		0x80	/* The fault originated in userspace */
+#define FAULT_FLAG_MAY_HUGE	0x100	/* PT not alloced: could use huge pmd */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -1069,7 +1070,7 @@ static inline int page_mapped(struct pag
 #define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned small page */
 #define VM_FAULT_HWPOISON_LARGE 0x0020  /* Hit poisoned large page. Index encoded in upper bits */
 #define VM_FAULT_SIGSEGV 0x0040
-
+#define VM_FAULT_HUGE	0x0080	/* ->fault needs page installed as huge pmd */
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
--- thpfs.orig/include/linux/shmem_fs.h	2015-02-20 19:34:16.135982083 -0800
+++ thpfs/include/linux/shmem_fs.h	2015-02-20 19:34:21.599969589 -0800
@@ -54,6 +54,8 @@ extern struct file *shmem_file_setup(con
 extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
 					    unsigned long flags);
 extern int shmem_zero_setup(struct vm_area_struct *);
+extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags);
 extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
 extern bool shmem_mapping(struct address_space *mapping);
 extern void shmem_unlock_mapping(struct address_space *mapping);
--- thpfs.orig/ipc/shm.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/ipc/shm.c	2015-02-20 19:34:21.599969589 -0800
@@ -442,13 +442,15 @@ static const struct file_operations shm_
 	.mmap		= shm_mmap,
 	.fsync		= shm_fsync,
 	.release	= shm_release,
-#ifndef CONFIG_MMU
 	.get_unmapped_area	= shm_get_unmapped_area,
-#endif
 	.llseek		= noop_llseek,
 	.fallocate	= shm_fallocate,
 };
 
+/*
+ * shm_file_operations_huge is now identical to shm_file_operations,
+ * but we keep it distinct for the sake of is_file_shm_hugepages().
+ */
 static const struct file_operations shm_file_operations_huge = {
 	.mmap		= shm_mmap,
 	.fsync		= shm_fsync,
--- thpfs.orig/mm/memory.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/memory.c	2015-02-20 19:34:21.599969589 -0800
@@ -1448,7 +1448,8 @@ pte_t *__get_locked_pte(struct mm_struct
 	if (pud) {
 		pmd_t * pmd = pmd_alloc(mm, pud, addr);
 		if (pmd) {
-			VM_BUG_ON(pmd_trans_huge(*pmd));
+			/* VM_NONLINEAR install_file_pte() must split hugepmd */
+			split_huge_page_pmd_mm(mm, addr, pmd);
 			return pte_alloc_map_lock(mm, pmd, addr, ptl);
 		}
 	}
--- thpfs.orig/mm/mmap.c	2015-02-20 19:33:56.528026917 -0800
+++ thpfs/mm/mmap.c	2015-02-20 19:34:21.603969581 -0800
@@ -25,6 +25,7 @@
 #include <linux/personality.h>
 #include <linux/security.h>
 #include <linux/hugetlb.h>
+#include <linux/shmem_fs.h>
 #include <linux/profile.h>
 #include <linux/export.h>
 #include <linux/mount.h>
@@ -2017,8 +2018,19 @@ get_unmapped_area(struct file *file, uns
 		return -ENOMEM;
 
 	get_area = current->mm->get_unmapped_area;
-	if (file && file->f_op->get_unmapped_area)
-		get_area = file->f_op->get_unmapped_area;
+	if (file) {
+		if (file->f_op->get_unmapped_area)
+			get_area = file->f_op->get_unmapped_area;
+	} else if (flags & MAP_SHARED) {
+		/*
+		 * mmap_region() will call shmem_zero_setup() to create a file,
+		 * so use shmem's get_unmapped_area in case it can be huge.
+		 * do_mmap_pgoff() will clear pgoff, so match alignment.
+		 */
+		pgoff = 0;
+		get_area = shmem_get_unmapped_area;
+	}
+
 	addr = get_area(file, addr, len, pgoff, flags);
 	if (IS_ERR_VALUE(addr))
 		return addr;
--- thpfs.orig/mm/shmem.c	2015-02-20 19:34:16.139982074 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:34:21.603969581 -0800
@@ -103,6 +103,8 @@ struct shmem_falloc {
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
+			/* ordering assumed: those above don't check i_size */
+	SGP_TEAM,	/* may exceed i_size, may make team page Uptodate */
 	SGP_WRITE,	/* may exceed i_size, may allocate !Uptodate page */
 	SGP_FALLOC,	/* like SGP_WRITE, but make existing page Uptodate */
 };
@@ -421,6 +423,42 @@ static void shmem_added_to_hugeteam(stru
 	}
 }
 
+static int shmem_populate_hugeteam(struct inode *inode, struct page *head)
+{
+	struct page *page;
+	pgoff_t index;
+	int error;
+	int i;
+
+	/* We only have to do this once */
+	if (PageChecked(head))
+		return 0;
+
+	index = head->index;
+	for (i = 0; i < HPAGE_PMD_NR; i++, index++) {
+		if (!PageTeam(head))
+			return -EAGAIN;
+		if (PageChecked(head))
+			return 0;
+		/* Mark all pages dirty even when map is readonly, for now */
+		if (PageUptodate(head + i) && PageDirty(head + i))
+			continue;
+		error = shmem_getpage(inode, index, &page, SGP_TEAM, NULL);
+		if (error)
+			return error;
+		SetPageDirty(page);
+		unlock_page(page);
+		page_cache_release(page);
+		if (page != head + i)
+			return -EAGAIN;
+		cond_resched();
+	}
+
+	/* Now safe from the shrinker, but not yet from truncate */
+	SetPageChecked(head);
+	return 0;
+}
+
 static int shmem_disband_hugehead(struct page *head)
 {
 	struct address_space *mapping;
@@ -844,6 +882,12 @@ static inline void shmem_added_to_hugete
 {
 }
 
+static inline int shmem_populate_hugeteam(struct inode *inode,
+					  struct page *head)
+{
+	return -EAGAIN;
+}
+
 static inline unsigned long shmem_shrink_hugehole(struct shrinker *shrink,
 						  struct shrink_control *sc)
 {
@@ -1745,7 +1789,7 @@ repeat:
 		page = NULL;
 	}
 
-	if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
+	if (sgp <= SGP_CACHE &&
 	    ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
 		error = -EINVAL;
 		goto failed;
@@ -1936,7 +1980,7 @@ clear:
 	}
 
 	/* Perhaps the file has been truncated since we checked */
-	if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
+	if (sgp <= SGP_CACHE &&
 	    ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
 		error = -EINVAL;
 		alloced_huge = NULL;	/* already exposed: maybe now in use */
@@ -1992,9 +2036,12 @@ unlock:
 
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
+	unsigned long addr = (unsigned long)vmf->virtual_address;
 	struct inode *inode = file_inode(vma->vm_file);
-	int error;
+	struct page *head;
 	int ret = VM_FAULT_LOCKED;
+	int once = 0;
+	int error;
 
 	/*
 	 * Trinity finds that probing a hole which tmpfs is punching can
@@ -2054,6 +2101,8 @@ static int shmem_fault(struct vm_area_st
 		spin_unlock(&inode->i_lock);
 	}
 
+single:
+	vmf->page = NULL;
 	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
 	if (error)
 		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
@@ -2062,7 +2111,142 @@ static int shmem_fault(struct vm_area_st
 		count_vm_event(PGMAJFAULT);
 		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 	}
-	return ret;
+
+	/*
+	 * Shall we map a huge page hugely?
+	 */
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return ret;
+	if (!(vmf->flags & FAULT_FLAG_MAY_HUGE))
+		return ret;
+	if (!PageTeam(vmf->page))
+		return ret;
+	if (once++)
+		return ret;
+	if (vma->vm_flags & VM_NONLINEAR)
+		return ret;
+	if (!(vma->vm_flags & VM_SHARED) && (vmf->flags & FAULT_FLAG_WRITE))
+		return ret;
+	if ((vma->vm_start-(vma->vm_pgoff<<PAGE_SHIFT)) & (HPAGE_PMD_SIZE-1))
+		return ret;
+	if (round_down(addr, HPAGE_PMD_SIZE) < vma->vm_start)
+		return ret;
+	if (round_up(addr + 1, HPAGE_PMD_SIZE) > vma->vm_end)
+		return ret;
+	/* But omit i_size check: allow up to huge page boundary */
+
+	head = team_head(vmf->page);
+	if (!get_page_unless_zero(head))
+		return ret;
+	if (!PageTeam(head)) {
+		page_cache_release(head);
+		return ret;
+	}
+
+	unlock_page(vmf->page);
+	page_cache_release(vmf->page);
+	if (shmem_populate_hugeteam(inode, head) < 0) {
+		page_cache_release(head);
+		goto single;
+	}
+	lock_page(head);
+	if (!PageTeam(head)) {
+		unlock_page(head);
+		page_cache_release(head);
+		goto single;
+	}
+
+	/* Now safe from truncation */
+	vmf->page = head;
+	return ret | VM_FAULT_HUGE;
+}
+
+unsigned long shmem_get_unmapped_area(struct file *file,
+				      unsigned long uaddr, unsigned long len,
+				      unsigned long pgoff, unsigned long flags)
+{
+	unsigned long (*get_area)(struct file *,
+		unsigned long, unsigned long, unsigned long, unsigned long);
+	unsigned long addr;
+	unsigned long offset;
+	unsigned long inflated_len;
+	unsigned long inflated_addr;
+	unsigned long inflated_offset;
+
+	if (len > TASK_SIZE)
+		return -ENOMEM;
+
+	get_area = current->mm->get_unmapped_area;
+	addr = get_area(file, uaddr, len, pgoff, flags);
+
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return addr;
+	if (IS_ERR_VALUE(addr))
+		return addr;
+	if (addr & ~PAGE_MASK)
+		return addr;
+	if (addr > TASK_SIZE - len)
+		return addr;
+
+	if (shmem_huge == SHMEM_HUGE_DENY)
+		return addr;
+	if (len < HPAGE_PMD_SIZE)
+		return addr;
+	if (flags & MAP_FIXED)
+		return addr;
+	/*
+	 * Our priority is to support MAP_SHARED mapped hugely;
+	 * and support MAP_PRIVATE mapped hugely too, until it is COWed.
+	 * But if caller specified an address hint, respect that as before.
+	 */
+	if (uaddr)
+		return addr;
+
+	if (shmem_huge != SHMEM_HUGE_FORCE) {
+		struct super_block *sb;
+
+		if (file) {
+			VM_BUG_ON(file->f_op != &shmem_file_operations);
+			sb = file_inode(file)->i_sb;
+		} else {
+			/*
+			 * Called directly from mm/mmap.c, or drivers/char/mem.c
+			 * for "/dev/zero", to create a shared anonymous object.
+			 */
+			if (IS_ERR(shm_mnt))
+				return addr;
+			sb = shm_mnt->mnt_sb;
+		}
+		if (!SHMEM_SB(sb)->huge)
+			return addr;
+	}
+
+	offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
+	if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
+		return addr;
+	if ((addr & (HPAGE_PMD_SIZE-1)) == offset)
+		return addr;
+
+	inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
+	if (inflated_len > TASK_SIZE)
+		return addr;
+	if (inflated_len < len)
+		return addr;
+
+	inflated_addr = get_area(NULL, 0, inflated_len, 0, flags);
+	if (IS_ERR_VALUE(inflated_addr))
+		return addr;
+	if (inflated_addr & ~PAGE_MASK)
+		return addr;
+
+	inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1);
+	inflated_addr += offset - inflated_offset;
+	if (inflated_offset > offset)
+		inflated_addr += HPAGE_PMD_SIZE;
+
+	if (inflated_addr > TASK_SIZE - len)
+		return addr;
+	return inflated_addr;
 }
 
 #ifdef CONFIG_NUMA
@@ -3852,6 +4036,7 @@ static const struct address_space_operat
 
 static const struct file_operations shmem_file_operations = {
 	.mmap		= shmem_mmap,
+	.get_unmapped_area = shmem_get_unmapped_area,
 #ifdef CONFIG_TMPFS
 	.llseek		= shmem_file_llseek,
 	.read		= new_sync_read,
@@ -4063,6 +4248,13 @@ void shmem_unlock_mapping(struct address
 {
 }
 
+unsigned long shmem_get_unmapped_area(struct file *file,
+				      unsigned long addr, unsigned long len,
+				      unsigned long pgoff, unsigned long flags)
+{
+	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+
 void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 {
 	truncate_inode_pages_range(inode->i_mapping, lstart, lend);

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 12/24] huge tmpfs: get_unmapped_area align and fault supply huge page
@ 2015-02-21  4:11   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Now make the shmem.c changes necessary for mapping its huge pages into
userspace with huge pmds: without actually doing so, since that needs
changes in huge_memory.c and across mm, better left to another patch.

Provide a shmem_get_unmapped_area method in file_operations, called
at mmap time to decide the mapping address.  It could be conditional
on CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by
making it unconditional.

shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
(which we treat as a black box, highly dependent on architecture and
config and executable layout).  Lots of conditions, and in most cases
it just goes with the address that chose; but when our huge stars are
rightly aligned, yet that did not provide a suitable address, go back
to ask for a larger arena, within which to align the mapping suitably.

There have to be some direct calls to shmem_get_unmapped_area(),
not via the file_operations: because of the way shmem_zero_setup()
is called to create a shmem object late in the mmap sequence, when
MAP_SHARED is requested with MAP_ANONYMOUS or /dev/zero.  Though
this only matters when /proc/sys/vm/shmem_huge has been set.

Then at fault time, shmem_fault() does its usual shmem_getpage_gfp(),
and if caller __do_fault() passed FAULT_FLAG_MAY_HUGE (in later patch),
checks if the 4kB page returned is PageTeam, and, subject to further
conditions, proceeds to populate the whole of the huge page (if it
was not already fully populated and uptodate: use PG_owner_priv_1
PageChecked to save repeating all this each time the object is mapped);
then returns it to __do_fault() with a VM_FAULT_HUGE flag to request
a huge pmd.

Among shmem_fault()'s conditions: don't attempt huge if VM_NONLINEAR.
But that raises the question, what if the remap_file_pages(2) system
call were used on an area with huge pmds?  Turns out that it populates
the area using __get_locked_pte(): VM_BUG_ON(pmd_trans_huge(*pmd))
replaced by split_huge_page_pmd_mm() and we should be okay.

Two conditions you might expect, which are not enforced.  Originally
I intended to support just MAP_SHARED at this stage, which should be
good enough for a first implementation; but support for MAP_PRIVATE
(on read fault) needs so little further change, that it was well worth
supporting too - it opens up the opportunity to copy your x86_64 ELF
executables to huge tmpfs, their text then automatically mapped huge.

The other missing condition: shmem_getpage_gfp() is checking that
the fault falls within (4kB-rounded-up) i_size, but shmem_fault() maps
hugely even when the tail of the 2MB falls outside the (4kB-rounded-up)
i_size.  This is intentional, but may need reconsideration - especially
in the MAP_PRIVATE case (is it right for a private mapping to allocate
"hidden" pages to the object beyond its EOF?).  The intent is that an
application can indicate its desire for huge pmds throughout, even of
the tail, by using a hugely-rounded-up mmap size; but we might end up
retracting this, asking for fallocate to be used explicitly for that.
(hugetlbfs behaves even less standardly: its mmap extends the i_size
of the object.)

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 drivers/char/mem.c       |   23 ++++
 include/linux/mm.h       |    3 
 include/linux/shmem_fs.h |    2 
 ipc/shm.c                |    6 -
 mm/memory.c              |    3 
 mm/mmap.c                |   16 ++
 mm/shmem.c               |  200 ++++++++++++++++++++++++++++++++++++-
 7 files changed, 243 insertions(+), 10 deletions(-)

--- thpfs.orig/drivers/char/mem.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/drivers/char/mem.c	2015-02-20 19:34:21.595969599 -0800
@@ -22,6 +22,7 @@
 #include <linux/device.h>
 #include <linux/highmem.h>
 #include <linux/backing-dev.h>
+#include <linux/shmem_fs.h>
 #include <linux/splice.h>
 #include <linux/pfn.h>
 #include <linux/export.h>
@@ -654,6 +655,27 @@ static int mmap_zero(struct file *file,
 	return 0;
 }
 
+static unsigned long get_unmapped_area_zero(struct file *file,
+				unsigned long addr, unsigned long len,
+				unsigned long pgoff, unsigned long flags)
+{
+#ifndef CONFIG_MMU
+	return -ENOSYS;
+#endif
+	if (flags & MAP_SHARED) {
+		/*
+		 * mmap_zero() will call shmem_zero_setup() to create a file,
+		 * so use shmem's get_unmapped_area in case it can be huge;
+		 * and pass NULL for file as in mmap.c's get_unmapped_area(),
+		 * so as not to confuse shmem with our handle on "/dev/zero".
+		 */
+		return shmem_get_unmapped_area(NULL, addr, len, pgoff, flags);
+	}
+
+	/* Otherwise flags & MAP_PRIVATE: with no shmem object beneath it */
+	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+
 static ssize_t write_full(struct file *file, const char __user *buf,
 			  size_t count, loff_t *ppos)
 {
@@ -760,6 +782,7 @@ static const struct file_operations zero
 	.read_iter	= read_iter_zero,
 	.aio_write	= aio_write_zero,
 	.mmap		= mmap_zero,
+	.get_unmapped_area = get_unmapped_area_zero,
 };
 
 /*
--- thpfs.orig/include/linux/mm.h	2015-02-20 19:34:11.231993296 -0800
+++ thpfs/include/linux/mm.h	2015-02-20 19:34:21.599969589 -0800
@@ -213,6 +213,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_KILLABLE	0x20	/* The fault task is in SIGKILL killable region */
 #define FAULT_FLAG_TRIED	0x40	/* second try */
 #define FAULT_FLAG_USER		0x80	/* The fault originated in userspace */
+#define FAULT_FLAG_MAY_HUGE	0x100	/* PT not alloced: could use huge pmd */
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -1069,7 +1070,7 @@ static inline int page_mapped(struct pag
 #define VM_FAULT_HWPOISON 0x0010	/* Hit poisoned small page */
 #define VM_FAULT_HWPOISON_LARGE 0x0020  /* Hit poisoned large page. Index encoded in upper bits */
 #define VM_FAULT_SIGSEGV 0x0040
-
+#define VM_FAULT_HUGE	0x0080	/* ->fault needs page installed as huge pmd */
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
--- thpfs.orig/include/linux/shmem_fs.h	2015-02-20 19:34:16.135982083 -0800
+++ thpfs/include/linux/shmem_fs.h	2015-02-20 19:34:21.599969589 -0800
@@ -54,6 +54,8 @@ extern struct file *shmem_file_setup(con
 extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
 					    unsigned long flags);
 extern int shmem_zero_setup(struct vm_area_struct *);
+extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags);
 extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
 extern bool shmem_mapping(struct address_space *mapping);
 extern void shmem_unlock_mapping(struct address_space *mapping);
--- thpfs.orig/ipc/shm.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/ipc/shm.c	2015-02-20 19:34:21.599969589 -0800
@@ -442,13 +442,15 @@ static const struct file_operations shm_
 	.mmap		= shm_mmap,
 	.fsync		= shm_fsync,
 	.release	= shm_release,
-#ifndef CONFIG_MMU
 	.get_unmapped_area	= shm_get_unmapped_area,
-#endif
 	.llseek		= noop_llseek,
 	.fallocate	= shm_fallocate,
 };
 
+/*
+ * shm_file_operations_huge is now identical to shm_file_operations,
+ * but we keep it distinct for the sake of is_file_shm_hugepages().
+ */
 static const struct file_operations shm_file_operations_huge = {
 	.mmap		= shm_mmap,
 	.fsync		= shm_fsync,
--- thpfs.orig/mm/memory.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/memory.c	2015-02-20 19:34:21.599969589 -0800
@@ -1448,7 +1448,8 @@ pte_t *__get_locked_pte(struct mm_struct
 	if (pud) {
 		pmd_t * pmd = pmd_alloc(mm, pud, addr);
 		if (pmd) {
-			VM_BUG_ON(pmd_trans_huge(*pmd));
+			/* VM_NONLINEAR install_file_pte() must split hugepmd */
+			split_huge_page_pmd_mm(mm, addr, pmd);
 			return pte_alloc_map_lock(mm, pmd, addr, ptl);
 		}
 	}
--- thpfs.orig/mm/mmap.c	2015-02-20 19:33:56.528026917 -0800
+++ thpfs/mm/mmap.c	2015-02-20 19:34:21.603969581 -0800
@@ -25,6 +25,7 @@
 #include <linux/personality.h>
 #include <linux/security.h>
 #include <linux/hugetlb.h>
+#include <linux/shmem_fs.h>
 #include <linux/profile.h>
 #include <linux/export.h>
 #include <linux/mount.h>
@@ -2017,8 +2018,19 @@ get_unmapped_area(struct file *file, uns
 		return -ENOMEM;
 
 	get_area = current->mm->get_unmapped_area;
-	if (file && file->f_op->get_unmapped_area)
-		get_area = file->f_op->get_unmapped_area;
+	if (file) {
+		if (file->f_op->get_unmapped_area)
+			get_area = file->f_op->get_unmapped_area;
+	} else if (flags & MAP_SHARED) {
+		/*
+		 * mmap_region() will call shmem_zero_setup() to create a file,
+		 * so use shmem's get_unmapped_area in case it can be huge.
+		 * do_mmap_pgoff() will clear pgoff, so match alignment.
+		 */
+		pgoff = 0;
+		get_area = shmem_get_unmapped_area;
+	}
+
 	addr = get_area(file, addr, len, pgoff, flags);
 	if (IS_ERR_VALUE(addr))
 		return addr;
--- thpfs.orig/mm/shmem.c	2015-02-20 19:34:16.139982074 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:34:21.603969581 -0800
@@ -103,6 +103,8 @@ struct shmem_falloc {
 enum sgp_type {
 	SGP_READ,	/* don't exceed i_size, don't allocate page */
 	SGP_CACHE,	/* don't exceed i_size, may allocate page */
+			/* ordering assumed: those above don't check i_size */
+	SGP_TEAM,	/* may exceed i_size, may make team page Uptodate */
 	SGP_WRITE,	/* may exceed i_size, may allocate !Uptodate page */
 	SGP_FALLOC,	/* like SGP_WRITE, but make existing page Uptodate */
 };
@@ -421,6 +423,42 @@ static void shmem_added_to_hugeteam(stru
 	}
 }
 
+static int shmem_populate_hugeteam(struct inode *inode, struct page *head)
+{
+	struct page *page;
+	pgoff_t index;
+	int error;
+	int i;
+
+	/* We only have to do this once */
+	if (PageChecked(head))
+		return 0;
+
+	index = head->index;
+	for (i = 0; i < HPAGE_PMD_NR; i++, index++) {
+		if (!PageTeam(head))
+			return -EAGAIN;
+		if (PageChecked(head))
+			return 0;
+		/* Mark all pages dirty even when map is readonly, for now */
+		if (PageUptodate(head + i) && PageDirty(head + i))
+			continue;
+		error = shmem_getpage(inode, index, &page, SGP_TEAM, NULL);
+		if (error)
+			return error;
+		SetPageDirty(page);
+		unlock_page(page);
+		page_cache_release(page);
+		if (page != head + i)
+			return -EAGAIN;
+		cond_resched();
+	}
+
+	/* Now safe from the shrinker, but not yet from truncate */
+	SetPageChecked(head);
+	return 0;
+}
+
 static int shmem_disband_hugehead(struct page *head)
 {
 	struct address_space *mapping;
@@ -844,6 +882,12 @@ static inline void shmem_added_to_hugete
 {
 }
 
+static inline int shmem_populate_hugeteam(struct inode *inode,
+					  struct page *head)
+{
+	return -EAGAIN;
+}
+
 static inline unsigned long shmem_shrink_hugehole(struct shrinker *shrink,
 						  struct shrink_control *sc)
 {
@@ -1745,7 +1789,7 @@ repeat:
 		page = NULL;
 	}
 
-	if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
+	if (sgp <= SGP_CACHE &&
 	    ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
 		error = -EINVAL;
 		goto failed;
@@ -1936,7 +1980,7 @@ clear:
 	}
 
 	/* Perhaps the file has been truncated since we checked */
-	if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
+	if (sgp <= SGP_CACHE &&
 	    ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
 		error = -EINVAL;
 		alloced_huge = NULL;	/* already exposed: maybe now in use */
@@ -1992,9 +2036,12 @@ unlock:
 
 static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
+	unsigned long addr = (unsigned long)vmf->virtual_address;
 	struct inode *inode = file_inode(vma->vm_file);
-	int error;
+	struct page *head;
 	int ret = VM_FAULT_LOCKED;
+	int once = 0;
+	int error;
 
 	/*
 	 * Trinity finds that probing a hole which tmpfs is punching can
@@ -2054,6 +2101,8 @@ static int shmem_fault(struct vm_area_st
 		spin_unlock(&inode->i_lock);
 	}
 
+single:
+	vmf->page = NULL;
 	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
 	if (error)
 		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
@@ -2062,7 +2111,142 @@ static int shmem_fault(struct vm_area_st
 		count_vm_event(PGMAJFAULT);
 		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 	}
-	return ret;
+
+	/*
+	 * Shall we map a huge page hugely?
+	 */
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return ret;
+	if (!(vmf->flags & FAULT_FLAG_MAY_HUGE))
+		return ret;
+	if (!PageTeam(vmf->page))
+		return ret;
+	if (once++)
+		return ret;
+	if (vma->vm_flags & VM_NONLINEAR)
+		return ret;
+	if (!(vma->vm_flags & VM_SHARED) && (vmf->flags & FAULT_FLAG_WRITE))
+		return ret;
+	if ((vma->vm_start-(vma->vm_pgoff<<PAGE_SHIFT)) & (HPAGE_PMD_SIZE-1))
+		return ret;
+	if (round_down(addr, HPAGE_PMD_SIZE) < vma->vm_start)
+		return ret;
+	if (round_up(addr + 1, HPAGE_PMD_SIZE) > vma->vm_end)
+		return ret;
+	/* But omit i_size check: allow up to huge page boundary */
+
+	head = team_head(vmf->page);
+	if (!get_page_unless_zero(head))
+		return ret;
+	if (!PageTeam(head)) {
+		page_cache_release(head);
+		return ret;
+	}
+
+	unlock_page(vmf->page);
+	page_cache_release(vmf->page);
+	if (shmem_populate_hugeteam(inode, head) < 0) {
+		page_cache_release(head);
+		goto single;
+	}
+	lock_page(head);
+	if (!PageTeam(head)) {
+		unlock_page(head);
+		page_cache_release(head);
+		goto single;
+	}
+
+	/* Now safe from truncation */
+	vmf->page = head;
+	return ret | VM_FAULT_HUGE;
+}
+
+unsigned long shmem_get_unmapped_area(struct file *file,
+				      unsigned long uaddr, unsigned long len,
+				      unsigned long pgoff, unsigned long flags)
+{
+	unsigned long (*get_area)(struct file *,
+		unsigned long, unsigned long, unsigned long, unsigned long);
+	unsigned long addr;
+	unsigned long offset;
+	unsigned long inflated_len;
+	unsigned long inflated_addr;
+	unsigned long inflated_offset;
+
+	if (len > TASK_SIZE)
+		return -ENOMEM;
+
+	get_area = current->mm->get_unmapped_area;
+	addr = get_area(file, uaddr, len, pgoff, flags);
+
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return addr;
+	if (IS_ERR_VALUE(addr))
+		return addr;
+	if (addr & ~PAGE_MASK)
+		return addr;
+	if (addr > TASK_SIZE - len)
+		return addr;
+
+	if (shmem_huge == SHMEM_HUGE_DENY)
+		return addr;
+	if (len < HPAGE_PMD_SIZE)
+		return addr;
+	if (flags & MAP_FIXED)
+		return addr;
+	/*
+	 * Our priority is to support MAP_SHARED mapped hugely;
+	 * and support MAP_PRIVATE mapped hugely too, until it is COWed.
+	 * But if caller specified an address hint, respect that as before.
+	 */
+	if (uaddr)
+		return addr;
+
+	if (shmem_huge != SHMEM_HUGE_FORCE) {
+		struct super_block *sb;
+
+		if (file) {
+			VM_BUG_ON(file->f_op != &shmem_file_operations);
+			sb = file_inode(file)->i_sb;
+		} else {
+			/*
+			 * Called directly from mm/mmap.c, or drivers/char/mem.c
+			 * for "/dev/zero", to create a shared anonymous object.
+			 */
+			if (IS_ERR(shm_mnt))
+				return addr;
+			sb = shm_mnt->mnt_sb;
+		}
+		if (!SHMEM_SB(sb)->huge)
+			return addr;
+	}
+
+	offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
+	if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
+		return addr;
+	if ((addr & (HPAGE_PMD_SIZE-1)) == offset)
+		return addr;
+
+	inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
+	if (inflated_len > TASK_SIZE)
+		return addr;
+	if (inflated_len < len)
+		return addr;
+
+	inflated_addr = get_area(NULL, 0, inflated_len, 0, flags);
+	if (IS_ERR_VALUE(inflated_addr))
+		return addr;
+	if (inflated_addr & ~PAGE_MASK)
+		return addr;
+
+	inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1);
+	inflated_addr += offset - inflated_offset;
+	if (inflated_offset > offset)
+		inflated_addr += HPAGE_PMD_SIZE;
+
+	if (inflated_addr > TASK_SIZE - len)
+		return addr;
+	return inflated_addr;
 }
 
 #ifdef CONFIG_NUMA
@@ -3852,6 +4036,7 @@ static const struct address_space_operat
 
 static const struct file_operations shmem_file_operations = {
 	.mmap		= shmem_mmap,
+	.get_unmapped_area = shmem_get_unmapped_area,
 #ifdef CONFIG_TMPFS
 	.llseek		= shmem_file_llseek,
 	.read		= new_sync_read,
@@ -4063,6 +4248,13 @@ void shmem_unlock_mapping(struct address
 {
 }
 
+unsigned long shmem_get_unmapped_area(struct file *file,
+				      unsigned long addr, unsigned long len,
+				      unsigned long pgoff, unsigned long flags)
+{
+	return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+
 void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 {
 	truncate_inode_pages_range(inode->i_mapping, lstart, lend);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 13/24] huge tmpfs: extend get_user_pages_fast to shmem pmd
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:12   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Factor out one small part of the shmem pmd handling: the arch-specific
get_user_pages_fast() has special code to cope with the peculiar
refcounting on anonymous THP tail pages (and on hugetlbfs tail pages):
which must be avoided in the straightforward shmem pmd case.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/mips/mm/gup.c  |   17 ++++++++++++-----
 arch/s390/mm/gup.c  |   22 +++++++++++++++++++++-
 arch/sparc/mm/gup.c |   22 +++++++++++++++++++++-
 arch/x86/mm/gup.c   |   17 ++++++++++++-----
 mm/gup.c            |   22 +++++++++++++++++++++-
 5 files changed, 87 insertions(+), 13 deletions(-)

--- thpfs.orig/arch/mips/mm/gup.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/arch/mips/mm/gup.c	2015-02-20 19:34:26.971957306 -0800
@@ -64,7 +64,8 @@ static inline void get_head_page_multipl
 {
 	VM_BUG_ON(page != compound_head(page));
 	VM_BUG_ON(page_count(page) == 0);
-	atomic_add(nr, &page->_count);
+	if (nr)
+		atomic_add(nr, &page->_count);
 	SetPageReferenced(page);
 }
 
@@ -85,13 +86,19 @@ static int gup_huge_pmd(pmd_t pmd, unsig
 	head = pte_page(pte);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	do {
-		VM_BUG_ON(compound_head(page) != head);
-		pages[*nr] = page;
-		if (PageTail(page))
+		if (PageTail(page)) {
+			VM_BUG_ON(compound_head(page) != head);
 			get_huge_page_tail(page);
+			refs++;
+		} else {
+			/*
+			 * Handle head or huge tmpfs with normal refcounting.
+			 */
+			get_page(page);
+		}
+		pages[*nr] = page;
 		(*nr)++;
 		page++;
-		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
 	get_head_page_multiple(head, refs);
--- thpfs.orig/arch/s390/mm/gup.c	2014-01-19 18:40:07.000000000 -0800
+++ thpfs/arch/s390/mm/gup.c	2015-02-20 19:34:26.971957306 -0800
@@ -61,10 +61,30 @@ static inline int gup_huge_pmd(pmd_t *pm
 		return 0;
 	VM_BUG_ON(!pfn_valid(pmd_val(pmd) >> PAGE_SHIFT));
 
-	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+
+	if (!PageHead(head)) {
+		/*
+		 * Handle a huge tmpfs team with normal refcounting.
+		 */
+		do {
+			if (!page_cache_get_speculative(page))
+				return 0;
+			if (unlikely(pmd_val(pmd) != pmd_val(*pmdp))) {
+				put_page(page);
+				return 0;
+			}
+			pages[*nr] = page;
+			(*nr)++;
+			page++;
+		} while (addr += PAGE_SIZE, addr != end);
+		return 1;
+	}
+
 	tail = page;
+	refs = 0;
+
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
--- thpfs.orig/arch/sparc/mm/gup.c	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/arch/sparc/mm/gup.c	2015-02-20 19:34:26.975957297 -0800
@@ -79,10 +79,30 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd
 	if (write && !pmd_write(pmd))
 		return 0;
 
-	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+
+	if (!PageHead(head)) {
+		/*
+		 * Handle a huge tmpfs team with normal refcounting.
+		 */
+		do {
+			if (!page_cache_get_speculative(page))
+				return 0;
+			if (unlikely(pmd_val(pmd) != pmd_val(*pmdp))) {
+				put_page(page);
+				return 0;
+			}
+			pages[*nr] = page;
+			(*nr)++;
+			page++;
+		} while (addr += PAGE_SIZE, addr != end);
+		return 1;
+	}
+
 	tail = page;
+	refs = 0;
+
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
--- thpfs.orig/arch/x86/mm/gup.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/arch/x86/mm/gup.c	2015-02-20 19:34:26.975957297 -0800
@@ -110,7 +110,8 @@ static inline void get_head_page_multipl
 {
 	VM_BUG_ON_PAGE(page != compound_head(page), page);
 	VM_BUG_ON_PAGE(page_count(page) == 0, page);
-	atomic_add(nr, &page->_count);
+	if (nr)
+		atomic_add(nr, &page->_count);
 	SetPageReferenced(page);
 }
 
@@ -135,13 +136,19 @@ static noinline int gup_huge_pmd(pmd_t p
 	head = pte_page(pte);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	do {
-		VM_BUG_ON_PAGE(compound_head(page) != head, page);
-		pages[*nr] = page;
-		if (PageTail(page))
+		if (PageTail(page)) {
+			VM_BUG_ON_PAGE(compound_head(page) != head, page);
 			get_huge_page_tail(page);
+			refs++;
+		} else {
+			/*
+			 * Handle head or huge tmpfs with normal refcounting.
+			 */
+			get_page(page);
+		}
+		pages[*nr] = page;
 		(*nr)++;
 		page++;
-		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 	get_head_page_multiple(head, refs);
 
--- thpfs.orig/mm/gup.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/gup.c	2015-02-20 19:34:26.975957297 -0800
@@ -795,10 +795,30 @@ static int gup_huge_pmd(pmd_t orig, pmd_
 	if (write && !pmd_write(orig))
 		return 0;
 
-	refs = 0;
 	head = pmd_page(orig);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+
+	if (!PageHead(head)) {
+		/*
+		 * Handle a huge tmpfs team with normal refcounting.
+		 */
+		do {
+			if (!page_cache_get_speculative(page))
+				return 0;
+			if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+				put_page(page);
+				return 0;
+			}
+			pages[*nr] = page;
+			(*nr)++;
+			page++;
+		} while (addr += PAGE_SIZE, addr != end);
+		return 1;
+	}
+
 	tail = page;
+	refs = 0;
+
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 13/24] huge tmpfs: extend get_user_pages_fast to shmem pmd
@ 2015-02-21  4:12   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:12 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Factor out one small part of the shmem pmd handling: the arch-specific
get_user_pages_fast() has special code to cope with the peculiar
refcounting on anonymous THP tail pages (and on hugetlbfs tail pages):
which must be avoided in the straightforward shmem pmd case.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/mips/mm/gup.c  |   17 ++++++++++++-----
 arch/s390/mm/gup.c  |   22 +++++++++++++++++++++-
 arch/sparc/mm/gup.c |   22 +++++++++++++++++++++-
 arch/x86/mm/gup.c   |   17 ++++++++++++-----
 mm/gup.c            |   22 +++++++++++++++++++++-
 5 files changed, 87 insertions(+), 13 deletions(-)

--- thpfs.orig/arch/mips/mm/gup.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/arch/mips/mm/gup.c	2015-02-20 19:34:26.971957306 -0800
@@ -64,7 +64,8 @@ static inline void get_head_page_multipl
 {
 	VM_BUG_ON(page != compound_head(page));
 	VM_BUG_ON(page_count(page) == 0);
-	atomic_add(nr, &page->_count);
+	if (nr)
+		atomic_add(nr, &page->_count);
 	SetPageReferenced(page);
 }
 
@@ -85,13 +86,19 @@ static int gup_huge_pmd(pmd_t pmd, unsig
 	head = pte_page(pte);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	do {
-		VM_BUG_ON(compound_head(page) != head);
-		pages[*nr] = page;
-		if (PageTail(page))
+		if (PageTail(page)) {
+			VM_BUG_ON(compound_head(page) != head);
 			get_huge_page_tail(page);
+			refs++;
+		} else {
+			/*
+			 * Handle head or huge tmpfs with normal refcounting.
+			 */
+			get_page(page);
+		}
+		pages[*nr] = page;
 		(*nr)++;
 		page++;
-		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 
 	get_head_page_multiple(head, refs);
--- thpfs.orig/arch/s390/mm/gup.c	2014-01-19 18:40:07.000000000 -0800
+++ thpfs/arch/s390/mm/gup.c	2015-02-20 19:34:26.971957306 -0800
@@ -61,10 +61,30 @@ static inline int gup_huge_pmd(pmd_t *pm
 		return 0;
 	VM_BUG_ON(!pfn_valid(pmd_val(pmd) >> PAGE_SHIFT));
 
-	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+
+	if (!PageHead(head)) {
+		/*
+		 * Handle a huge tmpfs team with normal refcounting.
+		 */
+		do {
+			if (!page_cache_get_speculative(page))
+				return 0;
+			if (unlikely(pmd_val(pmd) != pmd_val(*pmdp))) {
+				put_page(page);
+				return 0;
+			}
+			pages[*nr] = page;
+			(*nr)++;
+			page++;
+		} while (addr += PAGE_SIZE, addr != end);
+		return 1;
+	}
+
 	tail = page;
+	refs = 0;
+
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
--- thpfs.orig/arch/sparc/mm/gup.c	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/arch/sparc/mm/gup.c	2015-02-20 19:34:26.975957297 -0800
@@ -79,10 +79,30 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd
 	if (write && !pmd_write(pmd))
 		return 0;
 
-	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+
+	if (!PageHead(head)) {
+		/*
+		 * Handle a huge tmpfs team with normal refcounting.
+		 */
+		do {
+			if (!page_cache_get_speculative(page))
+				return 0;
+			if (unlikely(pmd_val(pmd) != pmd_val(*pmdp))) {
+				put_page(page);
+				return 0;
+			}
+			pages[*nr] = page;
+			(*nr)++;
+			page++;
+		} while (addr += PAGE_SIZE, addr != end);
+		return 1;
+	}
+
 	tail = page;
+	refs = 0;
+
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
--- thpfs.orig/arch/x86/mm/gup.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/arch/x86/mm/gup.c	2015-02-20 19:34:26.975957297 -0800
@@ -110,7 +110,8 @@ static inline void get_head_page_multipl
 {
 	VM_BUG_ON_PAGE(page != compound_head(page), page);
 	VM_BUG_ON_PAGE(page_count(page) == 0, page);
-	atomic_add(nr, &page->_count);
+	if (nr)
+		atomic_add(nr, &page->_count);
 	SetPageReferenced(page);
 }
 
@@ -135,13 +136,19 @@ static noinline int gup_huge_pmd(pmd_t p
 	head = pte_page(pte);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	do {
-		VM_BUG_ON_PAGE(compound_head(page) != head, page);
-		pages[*nr] = page;
-		if (PageTail(page))
+		if (PageTail(page)) {
+			VM_BUG_ON_PAGE(compound_head(page) != head, page);
 			get_huge_page_tail(page);
+			refs++;
+		} else {
+			/*
+			 * Handle head or huge tmpfs with normal refcounting.
+			 */
+			get_page(page);
+		}
+		pages[*nr] = page;
 		(*nr)++;
 		page++;
-		refs++;
 	} while (addr += PAGE_SIZE, addr != end);
 	get_head_page_multiple(head, refs);
 
--- thpfs.orig/mm/gup.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/gup.c	2015-02-20 19:34:26.975957297 -0800
@@ -795,10 +795,30 @@ static int gup_huge_pmd(pmd_t orig, pmd_
 	if (write && !pmd_write(orig))
 		return 0;
 
-	refs = 0;
 	head = pmd_page(orig);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+
+	if (!PageHead(head)) {
+		/*
+		 * Handle a huge tmpfs team with normal refcounting.
+		 */
+		do {
+			if (!page_cache_get_speculative(page))
+				return 0;
+			if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+				put_page(page);
+				return 0;
+			}
+			pages[*nr] = page;
+			(*nr)++;
+			page++;
+		} while (addr += PAGE_SIZE, addr != end);
+		return 1;
+	}
+
 	tail = page;
+	refs = 0;
+
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 14/24] huge tmpfs: extend vma_adjust_trans_huge to shmem pmd
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:13   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Factor out one small part of the shmem pmd handling: the inline function
vma_adjust_trans_huge() (called when vmas are split or merged) contains
a preliminary !anon_vma || vm_ops check to avoid the overhead of
__vma_adjust_trans_huge() on areas which could not possibly contain an
anonymous THP pmd.  But with huge tmpfs, we shall need it to be called
even in those excluded cases.

Before the split pmd ptlocks, there was a nice alternative optimization
to make: avoid the overhead of __vma_adjust_trans_huge() on mms which
could not possibly contain a huge pmd - those with NULL pmd_huge_pte
(using a huge pmd demands the deposit of a spare page table, typically
stored in a list at pmd_huge_pte, withdrawn for use when splitting the
pmd; and huge tmpfs will follow that protocol too).

Still use that optimization when !USE_SPLIT_PMD_PTLOCKS, when
mm->pmd_huge_pte is updated under mm->page_table_lock (but beware:
unlike other arches, powerpc made no use of pmd_huge_pte before, so
this patch hacks it to update pmd_huge_pte as a count).  In common
configs, no equivalent optimization on x86 now: if that's a visible
problem, we can add an atomic count or flag to mm for the purpose.

And looking into the overhead of __vma_adjust_trans_huge(): it is
silly for split_huge_page_pmd_mm() to be calling find_vma() followed
by split_huge_page_pmd(), when it can check the pmd directly first,
and usually avoid the find_vma() call.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/mm/pgtable_64.c |    7 ++++++-
 include/linux/huge_mm.h      |    5 ++++-
 mm/huge_memory.c             |    7 ++-----
 3 files changed, 12 insertions(+), 7 deletions(-)

--- thpfs.orig/arch/powerpc/mm/pgtable_64.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/arch/powerpc/mm/pgtable_64.c	2015-02-20 19:34:32.363944978 -0800
@@ -675,9 +675,12 @@ void pgtable_trans_huge_deposit(struct m
 				pgtable_t pgtable)
 {
 	pgtable_t *pgtable_slot;
+
 	assert_spin_locked(&mm->page_table_lock);
+	mm->pmd_huge_pte++;
 	/*
-	 * we store the pgtable in the second half of PMD
+	 * we store the pgtable in the second half of PMD; but must also
+	 * set pmd_huge_pte for the optimization in vma_adjust_trans_huge().
 	 */
 	pgtable_slot = (pgtable_t *)pmdp + PTRS_PER_PMD;
 	*pgtable_slot = pgtable;
@@ -696,6 +699,8 @@ pgtable_t pgtable_trans_huge_withdraw(st
 	pgtable_t *pgtable_slot;
 
 	assert_spin_locked(&mm->page_table_lock);
+	mm->pmd_huge_pte--;
+
 	pgtable_slot = (pgtable_t *)pmdp + PTRS_PER_PMD;
 	pgtable = *pgtable_slot;
 	/*
--- thpfs.orig/include/linux/huge_mm.h	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/include/linux/huge_mm.h	2015-02-20 19:34:32.363944978 -0800
@@ -143,8 +143,11 @@ static inline void vma_adjust_trans_huge
 					 unsigned long end,
 					 long adjust_next)
 {
-	if (!vma->anon_vma || vma->vm_ops)
+#if !USE_SPLIT_PMD_PTLOCKS
+	/* If no pgtable is deposited, there is no huge pmd to worry about */
+	if (!vma->vm_mm->pmd_huge_pte)
 		return;
+#endif
 	__vma_adjust_trans_huge(vma, start, end, adjust_next);
 }
 static inline int hpage_nr_pages(struct page *page)
--- thpfs.orig/mm/huge_memory.c	2015-02-20 19:33:51.492038431 -0800
+++ thpfs/mm/huge_memory.c	2015-02-20 19:34:32.367944969 -0800
@@ -2905,11 +2905,8 @@ again:
 void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd)
 {
-	struct vm_area_struct *vma;
-
-	vma = find_vma(mm, address);
-	BUG_ON(vma == NULL);
-	split_huge_page_pmd(vma, address, pmd);
+	if (unlikely(pmd_trans_huge(*pmd)))
+		__split_huge_page_pmd(find_vma(mm, address), address, pmd);
 }
 
 static void split_huge_page_address(struct mm_struct *mm,

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 14/24] huge tmpfs: extend vma_adjust_trans_huge to shmem pmd
@ 2015-02-21  4:13   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Factor out one small part of the shmem pmd handling: the inline function
vma_adjust_trans_huge() (called when vmas are split or merged) contains
a preliminary !anon_vma || vm_ops check to avoid the overhead of
__vma_adjust_trans_huge() on areas which could not possibly contain an
anonymous THP pmd.  But with huge tmpfs, we shall need it to be called
even in those excluded cases.

Before the split pmd ptlocks, there was a nice alternative optimization
to make: avoid the overhead of __vma_adjust_trans_huge() on mms which
could not possibly contain a huge pmd - those with NULL pmd_huge_pte
(using a huge pmd demands the deposit of a spare page table, typically
stored in a list at pmd_huge_pte, withdrawn for use when splitting the
pmd; and huge tmpfs will follow that protocol too).

Still use that optimization when !USE_SPLIT_PMD_PTLOCKS, when
mm->pmd_huge_pte is updated under mm->page_table_lock (but beware:
unlike other arches, powerpc made no use of pmd_huge_pte before, so
this patch hacks it to update pmd_huge_pte as a count).  In common
configs, no equivalent optimization on x86 now: if that's a visible
problem, we can add an atomic count or flag to mm for the purpose.

And looking into the overhead of __vma_adjust_trans_huge(): it is
silly for split_huge_page_pmd_mm() to be calling find_vma() followed
by split_huge_page_pmd(), when it can check the pmd directly first,
and usually avoid the find_vma() call.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/powerpc/mm/pgtable_64.c |    7 ++++++-
 include/linux/huge_mm.h      |    5 ++++-
 mm/huge_memory.c             |    7 ++-----
 3 files changed, 12 insertions(+), 7 deletions(-)

--- thpfs.orig/arch/powerpc/mm/pgtable_64.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/arch/powerpc/mm/pgtable_64.c	2015-02-20 19:34:32.363944978 -0800
@@ -675,9 +675,12 @@ void pgtable_trans_huge_deposit(struct m
 				pgtable_t pgtable)
 {
 	pgtable_t *pgtable_slot;
+
 	assert_spin_locked(&mm->page_table_lock);
+	mm->pmd_huge_pte++;
 	/*
-	 * we store the pgtable in the second half of PMD
+	 * we store the pgtable in the second half of PMD; but must also
+	 * set pmd_huge_pte for the optimization in vma_adjust_trans_huge().
 	 */
 	pgtable_slot = (pgtable_t *)pmdp + PTRS_PER_PMD;
 	*pgtable_slot = pgtable;
@@ -696,6 +699,8 @@ pgtable_t pgtable_trans_huge_withdraw(st
 	pgtable_t *pgtable_slot;
 
 	assert_spin_locked(&mm->page_table_lock);
+	mm->pmd_huge_pte--;
+
 	pgtable_slot = (pgtable_t *)pmdp + PTRS_PER_PMD;
 	pgtable = *pgtable_slot;
 	/*
--- thpfs.orig/include/linux/huge_mm.h	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/include/linux/huge_mm.h	2015-02-20 19:34:32.363944978 -0800
@@ -143,8 +143,11 @@ static inline void vma_adjust_trans_huge
 					 unsigned long end,
 					 long adjust_next)
 {
-	if (!vma->anon_vma || vma->vm_ops)
+#if !USE_SPLIT_PMD_PTLOCKS
+	/* If no pgtable is deposited, there is no huge pmd to worry about */
+	if (!vma->vm_mm->pmd_huge_pte)
 		return;
+#endif
 	__vma_adjust_trans_huge(vma, start, end, adjust_next);
 }
 static inline int hpage_nr_pages(struct page *page)
--- thpfs.orig/mm/huge_memory.c	2015-02-20 19:33:51.492038431 -0800
+++ thpfs/mm/huge_memory.c	2015-02-20 19:34:32.367944969 -0800
@@ -2905,11 +2905,8 @@ again:
 void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd)
 {
-	struct vm_area_struct *vma;
-
-	vma = find_vma(mm, address);
-	BUG_ON(vma == NULL);
-	split_huge_page_pmd(vma, address, pmd);
+	if (unlikely(pmd_trans_huge(*pmd)))
+		__split_huge_page_pmd(find_vma(mm, address), address, pmd);
 }
 
 static void split_huge_page_address(struct mm_struct *mm,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 15/24] huge tmpfs: rework page_referenced_one and try_to_unmap_one
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:15   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

page_referenced_one() currently decides whether to go the huge
pmd route or the small pte route by looking at PageTransHuge(page).
But with huge tmpfs pages simultaneously mappable as small and as huge,
it's not deducible from page flags which is the case.  And the "helpers"
page_check_address, page_check_address_pmd, mm_find_pmd are designed to
hide the information we need now, instead of helping.

Open code (as it once was) with pgd,pud,pmd,pte: get *pmd speculatively,
and if it appears pmd_trans_huge, then acquire pmd_lock and recheck.
The same code is then valid for anon THP and for huge tmpfs, without
any page flag test.

Copy from this template in try_to_unmap_one(), to prepare for its
use on huge tmpfs pages (whereas anon THPs have already been split in
add_to_swap() before getting here); with a stub for unmap_team_by_pmd()
until a later patch implements it.  But unlike page_referenced_one(),
here we must allow for hugetlbfs pages (including non-pmd-based ones),
so must still use huge_pte_offset instead of pmd_trans_huge for those.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pageteam.h |    6 +
 mm/rmap.c                |  158 +++++++++++++++++++++++++++++--------
 2 files changed, 133 insertions(+), 31 deletions(-)

--- thpfs.orig/include/linux/pageteam.h	2015-02-20 19:34:06.224004747 -0800
+++ thpfs/include/linux/pageteam.h	2015-02-20 19:34:37.851932430 -0800
@@ -29,4 +29,10 @@ static inline struct page *team_head(str
 	return head;
 }
 
+/* Temporary stub for mm/rmap.c until implemented in mm/huge_memory.c */
+static inline void unmap_team_by_pmd(struct vm_area_struct *vma,
+			unsigned long addr, pmd_t *pmd, struct page *page)
+{
+}
+
 #endif /* _LINUX_PAGETEAM_H */
--- thpfs.orig/mm/rmap.c	2015-02-20 19:33:51.496038422 -0800
+++ thpfs/mm/rmap.c	2015-02-20 19:34:37.851932430 -0800
@@ -44,6 +44,7 @@
 
 #include <linux/mm.h>
 #include <linux/pagemap.h>
+#include <linux/pageteam.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/slab.h>
@@ -607,7 +608,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm,
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd = NULL;
-	pmd_t pmde;
+	pmd_t pmdval;
 
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
@@ -620,12 +621,12 @@ pmd_t *mm_find_pmd(struct mm_struct *mm,
 	pmd = pmd_offset(pud, address);
 	/*
 	 * Some THP functions use the sequence pmdp_clear_flush(), set_pmd_at()
-	 * without holding anon_vma lock for write.  So when looking for a
-	 * genuine pmde (in which to find pte), test present and !THP together.
+	 * without locking out concurrent rmap lookups.  So when looking for a
+	 * pmd entry, in which to find a pte, test present and !THP together.
 	 */
-	pmde = *pmd;
+	pmdval = *pmd;
 	barrier();
-	if (!pmd_present(pmde) || pmd_trans_huge(pmde))
+	if (!pmd_present(pmdval) || pmd_trans_huge(pmdval))
 		pmd = NULL;
 out:
 	return pmd;
@@ -718,22 +719,41 @@ static int page_referenced_one(struct pa
 			unsigned long address, void *arg)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pmd_t pmdval;
+	pte_t *pte;
 	spinlock_t *ptl;
 	int referenced = 0;
 	struct page_referenced_arg *pra = arg;
 
-	if (unlikely(PageTransHuge(page))) {
-		pmd_t *pmd;
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return SWAP_AGAIN;
 
-		/*
-		 * rmap might return false positives; we must filter
-		 * these out using page_check_address_pmd().
-		 */
-		pmd = page_check_address_pmd(page, mm, address,
-					     PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
-		if (!pmd)
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return SWAP_AGAIN;
+
+	pmd = pmd_offset(pud, address);
+again:
+	/* See comment in mm_find_pmd() for why we use pmdval+barrier here */
+	pmdval = *pmd;
+	barrier();
+	if (!pmd_present(pmdval))
+		return SWAP_AGAIN;
+
+	if (pmd_trans_huge(pmdval)) {
+		if (pmd_page(pmdval) != page)
 			return SWAP_AGAIN;
 
+		ptl = pmd_lock(mm, pmd);
+		if (!pmd_same(*pmd, pmdval)) {
+			spin_unlock(ptl);
+			goto again;
+		}
+
 		if (vma->vm_flags & VM_LOCKED) {
 			spin_unlock(ptl);
 			pra->vm_flags |= VM_LOCKED;
@@ -745,15 +765,22 @@ static int page_referenced_one(struct pa
 			referenced++;
 		spin_unlock(ptl);
 	} else {
-		pte_t *pte;
+		pte = pte_offset_map(pmd, address);
 
-		/*
-		 * rmap might return false positives; we must filter
-		 * these out using page_check_address().
-		 */
-		pte = page_check_address(page, mm, address, &ptl, 0);
-		if (!pte)
+		/* Make a quick check before getting the lock */
+		if (!pte_present(*pte)) {
+			pte_unmap(pte);
 			return SWAP_AGAIN;
+		}
+
+		ptl = pte_lockptr(mm, pmd);
+		spin_lock(ptl);
+
+		if (!pte_present(*pte) ||
+		    page_to_pfn(page) != pte_pfn(*pte)) {
+			pte_unmap_unlock(pte, ptl);
+			return SWAP_AGAIN;
+		}
 
 		if (vma->vm_flags & VM_LOCKED) {
 			pte_unmap_unlock(pte, ptl);
@@ -1179,15 +1206,84 @@ static int try_to_unmap_one(struct page
 		     unsigned long address, void *arg)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *pte;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pmd_t pmdval;
+	pte_t *pte = NULL;
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
-	if (!pte)
-		goto out;
+	if (unlikely(PageHuge(page))) {
+		pte = huge_pte_offset(mm, address);
+		if (!pte)
+			return ret;
+		ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
+		goto check;
+	}
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return ret;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return ret;
+
+	pmd = pmd_offset(pud, address);
+again:
+	/* See comment in mm_find_pmd() for why we use pmdval+barrier here */
+	pmdval = *pmd;
+	barrier();
+	if (!pmd_present(pmdval))
+		return ret;
+
+	if (pmd_trans_huge(pmdval)) {
+		if (pmd_page(pmdval) != page)
+			return ret;
+
+		ptl = pmd_lock(mm, pmd);
+		if (!pmd_same(*pmd, pmdval)) {
+			spin_unlock(ptl);
+			goto again;
+		}
+
+		if (!(flags & TTU_IGNORE_MLOCK)) {
+			if (vma->vm_flags & VM_LOCKED)
+				goto out_mlock;
+			if (flags & TTU_MUNLOCK)
+				goto out_unmap;
+		}
+		if (!(flags & TTU_IGNORE_ACCESS) &&
+		    pmdp_clear_flush_young_notify(vma, address, pmd)) {
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
+
+		spin_unlock(ptl);
+		unmap_team_by_pmd(vma, address, pmd, page);
+		return ret;
+	}
+
+	pte = pte_offset_map(pmd, address);
+
+	/* Make a quick check before getting the lock */
+	if (!pte_present(*pte)) {
+		pte_unmap(pte);
+		return ret;
+	}
+
+	ptl = pte_lockptr(mm, pmd);
+check:
+	spin_lock(ptl);
+
+	if (!pte_present(*pte) ||
+	    page_to_pfn(page) != pte_pfn(*pte)) {
+		pte_unmap_unlock(pte, ptl);
+		return ret;
+	}
 
 	/*
 	 * If the page is mlock()d, we cannot swap it out.
@@ -1197,7 +1293,6 @@ static int try_to_unmap_one(struct page
 	if (!(flags & TTU_IGNORE_MLOCK)) {
 		if (vma->vm_flags & VM_LOCKED)
 			goto out_mlock;
-
 		if (flags & TTU_MUNLOCK)
 			goto out_unmap;
 	}
@@ -1287,16 +1382,17 @@ static int try_to_unmap_one(struct page
 	page_cache_release(page);
 
 out_unmap:
-	pte_unmap_unlock(pte, ptl);
+	spin_unlock(ptl);
+	if (pte)
+		pte_unmap(pte);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
 		mmu_notifier_invalidate_page(mm, address);
-out:
 	return ret;
 
 out_mlock:
-	pte_unmap_unlock(pte, ptl);
-
-
+	spin_unlock(ptl);
+	if (pte)
+		pte_unmap(pte);
 	/*
 	 * We need mmap_sem locking, Otherwise VM_LOCKED check makes
 	 * unstable result and race. Plus, We can't wait here because

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 15/24] huge tmpfs: rework page_referenced_one and try_to_unmap_one
@ 2015-02-21  4:15   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

page_referenced_one() currently decides whether to go the huge
pmd route or the small pte route by looking at PageTransHuge(page).
But with huge tmpfs pages simultaneously mappable as small and as huge,
it's not deducible from page flags which is the case.  And the "helpers"
page_check_address, page_check_address_pmd, mm_find_pmd are designed to
hide the information we need now, instead of helping.

Open code (as it once was) with pgd,pud,pmd,pte: get *pmd speculatively,
and if it appears pmd_trans_huge, then acquire pmd_lock and recheck.
The same code is then valid for anon THP and for huge tmpfs, without
any page flag test.

Copy from this template in try_to_unmap_one(), to prepare for its
use on huge tmpfs pages (whereas anon THPs have already been split in
add_to_swap() before getting here); with a stub for unmap_team_by_pmd()
until a later patch implements it.  But unlike page_referenced_one(),
here we must allow for hugetlbfs pages (including non-pmd-based ones),
so must still use huge_pte_offset instead of pmd_trans_huge for those.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pageteam.h |    6 +
 mm/rmap.c                |  158 +++++++++++++++++++++++++++++--------
 2 files changed, 133 insertions(+), 31 deletions(-)

--- thpfs.orig/include/linux/pageteam.h	2015-02-20 19:34:06.224004747 -0800
+++ thpfs/include/linux/pageteam.h	2015-02-20 19:34:37.851932430 -0800
@@ -29,4 +29,10 @@ static inline struct page *team_head(str
 	return head;
 }
 
+/* Temporary stub for mm/rmap.c until implemented in mm/huge_memory.c */
+static inline void unmap_team_by_pmd(struct vm_area_struct *vma,
+			unsigned long addr, pmd_t *pmd, struct page *page)
+{
+}
+
 #endif /* _LINUX_PAGETEAM_H */
--- thpfs.orig/mm/rmap.c	2015-02-20 19:33:51.496038422 -0800
+++ thpfs/mm/rmap.c	2015-02-20 19:34:37.851932430 -0800
@@ -44,6 +44,7 @@
 
 #include <linux/mm.h>
 #include <linux/pagemap.h>
+#include <linux/pageteam.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/slab.h>
@@ -607,7 +608,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm,
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd = NULL;
-	pmd_t pmde;
+	pmd_t pmdval;
 
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
@@ -620,12 +621,12 @@ pmd_t *mm_find_pmd(struct mm_struct *mm,
 	pmd = pmd_offset(pud, address);
 	/*
 	 * Some THP functions use the sequence pmdp_clear_flush(), set_pmd_at()
-	 * without holding anon_vma lock for write.  So when looking for a
-	 * genuine pmde (in which to find pte), test present and !THP together.
+	 * without locking out concurrent rmap lookups.  So when looking for a
+	 * pmd entry, in which to find a pte, test present and !THP together.
 	 */
-	pmde = *pmd;
+	pmdval = *pmd;
 	barrier();
-	if (!pmd_present(pmde) || pmd_trans_huge(pmde))
+	if (!pmd_present(pmdval) || pmd_trans_huge(pmdval))
 		pmd = NULL;
 out:
 	return pmd;
@@ -718,22 +719,41 @@ static int page_referenced_one(struct pa
 			unsigned long address, void *arg)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pmd_t pmdval;
+	pte_t *pte;
 	spinlock_t *ptl;
 	int referenced = 0;
 	struct page_referenced_arg *pra = arg;
 
-	if (unlikely(PageTransHuge(page))) {
-		pmd_t *pmd;
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return SWAP_AGAIN;
 
-		/*
-		 * rmap might return false positives; we must filter
-		 * these out using page_check_address_pmd().
-		 */
-		pmd = page_check_address_pmd(page, mm, address,
-					     PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
-		if (!pmd)
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return SWAP_AGAIN;
+
+	pmd = pmd_offset(pud, address);
+again:
+	/* See comment in mm_find_pmd() for why we use pmdval+barrier here */
+	pmdval = *pmd;
+	barrier();
+	if (!pmd_present(pmdval))
+		return SWAP_AGAIN;
+
+	if (pmd_trans_huge(pmdval)) {
+		if (pmd_page(pmdval) != page)
 			return SWAP_AGAIN;
 
+		ptl = pmd_lock(mm, pmd);
+		if (!pmd_same(*pmd, pmdval)) {
+			spin_unlock(ptl);
+			goto again;
+		}
+
 		if (vma->vm_flags & VM_LOCKED) {
 			spin_unlock(ptl);
 			pra->vm_flags |= VM_LOCKED;
@@ -745,15 +765,22 @@ static int page_referenced_one(struct pa
 			referenced++;
 		spin_unlock(ptl);
 	} else {
-		pte_t *pte;
+		pte = pte_offset_map(pmd, address);
 
-		/*
-		 * rmap might return false positives; we must filter
-		 * these out using page_check_address().
-		 */
-		pte = page_check_address(page, mm, address, &ptl, 0);
-		if (!pte)
+		/* Make a quick check before getting the lock */
+		if (!pte_present(*pte)) {
+			pte_unmap(pte);
 			return SWAP_AGAIN;
+		}
+
+		ptl = pte_lockptr(mm, pmd);
+		spin_lock(ptl);
+
+		if (!pte_present(*pte) ||
+		    page_to_pfn(page) != pte_pfn(*pte)) {
+			pte_unmap_unlock(pte, ptl);
+			return SWAP_AGAIN;
+		}
 
 		if (vma->vm_flags & VM_LOCKED) {
 			pte_unmap_unlock(pte, ptl);
@@ -1179,15 +1206,84 @@ static int try_to_unmap_one(struct page
 		     unsigned long address, void *arg)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *pte;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pmd_t pmdval;
+	pte_t *pte = NULL;
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
-	if (!pte)
-		goto out;
+	if (unlikely(PageHuge(page))) {
+		pte = huge_pte_offset(mm, address);
+		if (!pte)
+			return ret;
+		ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
+		goto check;
+	}
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return ret;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return ret;
+
+	pmd = pmd_offset(pud, address);
+again:
+	/* See comment in mm_find_pmd() for why we use pmdval+barrier here */
+	pmdval = *pmd;
+	barrier();
+	if (!pmd_present(pmdval))
+		return ret;
+
+	if (pmd_trans_huge(pmdval)) {
+		if (pmd_page(pmdval) != page)
+			return ret;
+
+		ptl = pmd_lock(mm, pmd);
+		if (!pmd_same(*pmd, pmdval)) {
+			spin_unlock(ptl);
+			goto again;
+		}
+
+		if (!(flags & TTU_IGNORE_MLOCK)) {
+			if (vma->vm_flags & VM_LOCKED)
+				goto out_mlock;
+			if (flags & TTU_MUNLOCK)
+				goto out_unmap;
+		}
+		if (!(flags & TTU_IGNORE_ACCESS) &&
+		    pmdp_clear_flush_young_notify(vma, address, pmd)) {
+			ret = SWAP_FAIL;
+			goto out_unmap;
+		}
+
+		spin_unlock(ptl);
+		unmap_team_by_pmd(vma, address, pmd, page);
+		return ret;
+	}
+
+	pte = pte_offset_map(pmd, address);
+
+	/* Make a quick check before getting the lock */
+	if (!pte_present(*pte)) {
+		pte_unmap(pte);
+		return ret;
+	}
+
+	ptl = pte_lockptr(mm, pmd);
+check:
+	spin_lock(ptl);
+
+	if (!pte_present(*pte) ||
+	    page_to_pfn(page) != pte_pfn(*pte)) {
+		pte_unmap_unlock(pte, ptl);
+		return ret;
+	}
 
 	/*
 	 * If the page is mlock()d, we cannot swap it out.
@@ -1197,7 +1293,6 @@ static int try_to_unmap_one(struct page
 	if (!(flags & TTU_IGNORE_MLOCK)) {
 		if (vma->vm_flags & VM_LOCKED)
 			goto out_mlock;
-
 		if (flags & TTU_MUNLOCK)
 			goto out_unmap;
 	}
@@ -1287,16 +1382,17 @@ static int try_to_unmap_one(struct page
 	page_cache_release(page);
 
 out_unmap:
-	pte_unmap_unlock(pte, ptl);
+	spin_unlock(ptl);
+	if (pte)
+		pte_unmap(pte);
 	if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
 		mmu_notifier_invalidate_page(mm, address);
-out:
 	return ret;
 
 out_mlock:
-	pte_unmap_unlock(pte, ptl);
-
-
+	spin_unlock(ptl);
+	if (pte)
+		pte_unmap(pte);
 	/*
 	 * We need mmap_sem locking, Otherwise VM_LOCKED check makes
 	 * unstable result and race. Plus, We can't wait here because

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 16/24] huge tmpfs: fix problems from premature exposure of pagetable
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:16   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Andrea wrote a very interesting comment on THP in mm/memory.c,
just before the end of __handle_mm_fault():

 * A regular pmd is established and it can't morph into a huge pmd
 * from under us anymore at this point because we hold the mmap_sem
 * read mode and khugepaged takes it in write mode. So now it's
 * safe to run pte_offset_map().

This comment hints at several difficulties, which anon THP solved
for itself with mmap_sem and anon_vma lock, but which huge tmpfs
may need to solve differently.

The reference to pte_offset_map() above: I believe that's a hint
that on a 32-bit machine, the pagetables might need to come from
kernel-mapped memory, but a huge pmd pointing to user memory beyond
that limit could be racily substituted, causing undefined behavior
in the architecture-dependent pte_offset_map().

That itself is not a problem on x86_64, but there's plenty more:
how about those places which use pte_offset_map_lock() - if that
spinlock is in the struct page of a pagetable, which has been
deposited and might be withdrawn and freed at any moment (being
on a list unattached to the allocating pmd in the case of x86),
taking the spinlock might corrupt someone else's struct page.

Because THP has departed from the earlier rules (when pagetable
was only freed under exclusive mmap_sem, or at exit_mmap, after
removing all affected vmas from the rmap list): zap_huge_pmd()
does pte_free() even when serving MADV_DONTNEED under down_read
of mmap_sem.

And what of the "entry = *pte" at the start of handle_pte_fault(),
getting the entry used in pte_same(,orig_pte) tests to validate all
fault handling?  If that entry can itself be junk picked out of some
freed and reused pagetable, it's hard to estimate the consequences.

We need to consider the safety of concurrent faults, and the
safety of rmap lookups, and the safety of miscellaneous operations
such as smaps_pte_range() for reading /proc/<pid>/smaps.

I set out to make safe the places which descend pgd,pud,pmd,pte,
using more careful access techniques like mm_find_pmd(); but with
pte_offset_map() being architecture-defined, it's too big a job to
tighten it up all over.

Instead, approach from the opposite direction: just do not expose
a pagetable in an empty *pmd, until vm_ops->fault has had a chance
to ask for a huge pmd there.  This is a much easier change to make,
and we are lucky that all the driver faults appear to be using
interfaces (like vm_insert_page() and remap_pfn_range()) which
automatically do the pte_alloc() if it was not already done.

But we must not get stuck refaulting: need FAULT_FLAG_MAY_HUGE for
__do_fault() to tell shmem_fault() to try for huge only when *pmd is
empty (could instead add pmd to vmf and let shmem work that out for
itself, but probably better to hide pmd from vm_ops->faults).

Without a pagetable to hold the pte_none() entry found in a newly
allocated pagetable, handle_pte_fault() would like to provide a static
none entry for later orig_pte checks.  But architectures have never had
to provide that definition before; and although almost all use zeroes
for an empty pagetable, a few do not - nios2, s390, um, xtensa.

Never mind, forget about pte_same(,orig_pte), the three __do_fault()
callers can follow do_anonymous_page()'s example, and just use a
pte_none() check instead - supplemented by a pte_file pte_to_pgoff
check until the day VM_NONLINEAR is removed.

do_fault_around() presents one last problem: it wants pagetable to
have been allocated, but was being called by do_read_fault() before
__do_fault().  But I see no disadvantage to moving it after,
allowing huge pmd to be chosent first.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/filemap.c |   10 +-
 mm/memory.c  |  202 +++++++++++++++++++++++++++----------------------
 2 files changed, 118 insertions(+), 94 deletions(-)

--- thpfs.orig/mm/filemap.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/filemap.c	2015-02-20 19:34:42.875920943 -0800
@@ -2000,6 +2000,10 @@ void filemap_map_pages(struct vm_area_st
 	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
 		if (iter.index > vmf->max_pgoff)
 			break;
+
+		pte = vmf->pte + iter.index - vmf->pgoff;
+		if (!pte_none(*pte))
+			goto next;
 repeat:
 		page = radix_tree_deref_slot(slot);
 		if (unlikely(!page))
@@ -2020,6 +2024,8 @@ repeat:
 			goto repeat;
 		}
 
+		VM_BUG_ON_PAGE(page->index != iter.index, page);
+
 		if (!PageUptodate(page) ||
 				PageReadahead(page) ||
 				PageHWPoison(page))
@@ -2034,10 +2040,6 @@ repeat:
 		if (page->index >= size >> PAGE_CACHE_SHIFT)
 			goto unlock;
 
-		pte = vmf->pte + page->index - vmf->pgoff;
-		if (!pte_none(*pte))
-			goto unlock;
-
 		if (file->f_ra.mmap_miss > 0)
 			file->f_ra.mmap_miss--;
 		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
--- thpfs.orig/mm/memory.c	2015-02-20 19:34:21.599969589 -0800
+++ thpfs/mm/memory.c	2015-02-20 19:34:42.875920943 -0800
@@ -2617,24 +2617,33 @@ static inline int check_stack_guard_page
 
 /*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with mmap_sem still held, but pte unmapped and unlocked.
+ * but allow concurrent faults).  We return with mmap_sem still held.
  */
 static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags)
+		unsigned long address, pmd_t *pmd, unsigned int flags)
 {
 	struct mem_cgroup *memcg;
+	pte_t *page_table;
 	struct page *page;
 	spinlock_t *ptl;
 	pte_t entry;
 
-	pte_unmap(page_table);
-
 	/* Check if we need to add a guard page to the stack */
 	if (check_stack_guard_page(vma, address) < 0)
 		return VM_FAULT_SIGSEGV;
 
+	/*
+	 * Use __pte_alloc instead of pte_alloc_map, because we can't
+	 * run pte_offset_map on the pmd, if an huge pmd could
+	 * materialize from under us from a different thread.
+	 */
+	if (unlikely(pmd_none(*pmd)) &&
+	    unlikely(__pte_alloc(mm, vma, pmd, address)))
+		return VM_FAULT_OOM;
+	/* If an huge pmd materialized from under us just retry later */
+	if (unlikely(pmd_trans_huge(*pmd)))
+		return 0;
+
 	/* Use the zero-page for reads */
 	if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm)) {
 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
@@ -2697,7 +2706,7 @@ oom:
  * See filemap_fault() and __lock_page_retry().
  */
 static int __do_fault(struct vm_area_struct *vma, unsigned long address,
-		pgoff_t pgoff, unsigned int flags, struct page **page)
+	pmd_t *pmd, pgoff_t pgoff, unsigned int flags, struct page **page)
 {
 	struct vm_fault vmf;
 	int ret;
@@ -2711,20 +2720,41 @@ static int __do_fault(struct vm_area_str
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-	if (unlikely(PageHWPoison(vmf.page))) {
-		if (ret & VM_FAULT_LOCKED)
-			unlock_page(vmf.page);
-		page_cache_release(vmf.page);
-		return VM_FAULT_HWPOISON;
-	}
-
 	if (unlikely(!(ret & VM_FAULT_LOCKED)))
 		lock_page(vmf.page);
 	else
 		VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
 
+	if (unlikely(PageHWPoison(vmf.page))) {
+		ret = VM_FAULT_HWPOISON;
+		goto err;
+	}
+
+	/*
+	 * Use __pte_alloc instead of pte_alloc_map, because we can't
+	 * run pte_offset_map on the pmd, if an huge pmd could
+	 * materialize from under us from a different thread.
+	 */
+	if (unlikely(pmd_none(*pmd)) &&
+	    unlikely(__pte_alloc(vma->vm_mm, vma, pmd, address))) {
+		ret = VM_FAULT_OOM;
+		goto err;
+	}
+	/*
+	 * If an huge pmd materialized from under us just retry later.
+	 * Allow for racing transition of huge pmd to none to pagetable.
+	 */
+	if (unlikely(pmd_trans_huge(*pmd) || pmd_none(*pmd))) {
+		ret = VM_FAULT_NOPAGE;
+		goto err;
+	}
+
 	*page = vmf.page;
 	return ret;
+err:
+	unlock_page(vmf.page);
+	page_cache_release(vmf.page);
+	return ret;
 }
 
 /**
@@ -2875,33 +2905,20 @@ static void do_fault_around(struct vm_ar
 
 static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
-		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+		pgoff_t pgoff, unsigned int flags)
 {
 	struct page *fault_page;
 	spinlock_t *ptl;
 	pte_t *pte;
-	int ret = 0;
-
-	/*
-	 * Let's call ->map_pages() first and use ->fault() as fallback
-	 * if page by the offset is not ready to be mapped (cold cache or
-	 * something).
-	 */
-	if (vma->vm_ops->map_pages && !(flags & FAULT_FLAG_NONLINEAR) &&
-	    fault_around_bytes >> PAGE_SHIFT > 1) {
-		pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-		do_fault_around(vma, address, pte, pgoff, flags);
-		if (!pte_same(*pte, orig_pte))
-			goto unlock_out;
-		pte_unmap_unlock(pte, ptl);
-	}
+	int ret;
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pmd, pgoff, flags, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, orig_pte))) {
+	if (unlikely(!pte_none(*pte) &&
+	    !(pte_file(*pte) && pte_to_pgoff(*pte) == pgoff))) {
 		pte_unmap_unlock(pte, ptl);
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
@@ -2909,14 +2926,21 @@ static int do_read_fault(struct mm_struc
 	}
 	do_set_pte(vma, address, fault_page, pte, false, false);
 	unlock_page(fault_page);
-unlock_out:
+
+	/*
+	 * Finally call ->map_pages() to fault around the pte we just set.
+	 */
+	if (vma->vm_ops->map_pages && !(flags & FAULT_FLAG_NONLINEAR) &&
+	    fault_around_bytes >> PAGE_SHIFT > 1)
+		do_fault_around(vma, address, pte, pgoff, flags);
+
 	pte_unmap_unlock(pte, ptl);
 	return ret;
 }
 
 static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
-		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+		pgoff_t pgoff, unsigned int flags)
 {
 	struct page *fault_page, *new_page;
 	struct mem_cgroup *memcg;
@@ -2936,7 +2960,7 @@ static int do_cow_fault(struct mm_struct
 		return VM_FAULT_OOM;
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pmd, pgoff, flags, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
 
@@ -2944,7 +2968,8 @@ static int do_cow_fault(struct mm_struct
 	__SetPageUptodate(new_page);
 
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, orig_pte))) {
+	if (unlikely(!pte_none(*pte) &&
+	    !(pte_file(*pte) && pte_to_pgoff(*pte) == pgoff))) {
 		pte_unmap_unlock(pte, ptl);
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
@@ -2965,7 +2990,7 @@ uncharge_out:
 
 static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
-		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+		pgoff_t pgoff, unsigned int flags)
 {
 	struct page *fault_page;
 	struct address_space *mapping;
@@ -2974,7 +2999,7 @@ static int do_shared_fault(struct mm_str
 	int dirtied = 0;
 	int ret, tmp;
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pmd, pgoff, flags, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
@@ -2993,7 +3018,8 @@ static int do_shared_fault(struct mm_str
 	}
 
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, orig_pte))) {
+	if (unlikely(!pte_none(*pte) &&
+	    !(pte_file(*pte) && pte_to_pgoff(*pte) == pgoff))) {
 		pte_unmap_unlock(pte, ptl);
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
@@ -3034,20 +3060,16 @@ static int do_shared_fault(struct mm_str
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
 static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
+		unsigned long address, pmd_t *pmd, unsigned int flags)
 {
 	pgoff_t pgoff = (((address & PAGE_MASK)
 			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
-	pte_unmap(page_table);
 	if (!(flags & FAULT_FLAG_WRITE))
-		return do_read_fault(mm, vma, address, pmd, pgoff, flags,
-				orig_pte);
+		return do_read_fault(mm, vma, address, pmd, pgoff, flags);
 	if (!(vma->vm_flags & VM_SHARED))
-		return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
-				orig_pte);
-	return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+		return do_cow_fault(mm, vma, address, pmd, pgoff, flags);
+	return do_shared_fault(mm, vma, address, pmd, pgoff, flags);
 }
 
 /*
@@ -3082,12 +3104,10 @@ static int do_nonlinear_fault(struct mm_
 
 	pgoff = pte_to_pgoff(orig_pte);
 	if (!(flags & FAULT_FLAG_WRITE))
-		return do_read_fault(mm, vma, address, pmd, pgoff, flags,
-				orig_pte);
+		return do_read_fault(mm, vma, address, pmd, pgoff, flags);
 	if (!(vma->vm_flags & VM_SHARED))
-		return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
-				orig_pte);
-	return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+		return do_cow_fault(mm, vma, address, pmd, pgoff, flags);
+	return do_shared_fault(mm, vma, address, pmd, pgoff, flags);
 }
 
 static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3189,40 +3209,62 @@ out:
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with pte unmapped and unlocked.
- *
+ * We enter with non-exclusive mmap_sem
+ * (to exclude vma changes, but allow concurrent faults).
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
 static int handle_pte_fault(struct mm_struct *mm,
 		     struct vm_area_struct *vma, unsigned long address,
-		     pte_t *pte, pmd_t *pmd, unsigned int flags)
+		     pmd_t *pmd, unsigned int flags)
 {
+	pte_t *pte;
 	pte_t entry;
 	spinlock_t *ptl;
 
+	/* If an huge pmd materialized from under us just retry later */
+	if (unlikely(pmd_trans_huge(*pmd)))
+		return 0;
+
+	if (unlikely(pmd_none(*pmd))) {
+		/*
+		 * Leave __pte_alloc() until later: because huge tmpfs may
+		 * want to map_team_by_pmd(), and if we expose page table
+		 * for an instant, it will be difficult to retract from
+		 * concurrent faults and from rmap lookups.
+		 */
+		pte = NULL;
+	} else {
+		/*
+		 * A regular pmd is established and it can't morph into a huge
+		 * pmd from under us anymore at this point because we hold the
+		 * mmap_sem read mode and khugepaged takes it in write mode.
+		 * So now it's safe to run pte_offset_map().
+		 */
+		pte = pte_offset_map(pmd, address);
+		entry = *pte;
+		barrier();
+		if (pte_none(entry)) {
+			pte_unmap(pte);
+			pte = NULL;
+		}
+	}
+
 	/*
 	 * some architectures can have larger ptes than wordsize,
 	 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
 	 * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
-	 * The code below just needs a consistent view for the ifs and
+	 * The code above just needs a consistent view for the ifs and
 	 * we later double check anyway with the ptl lock held. So here
 	 * a barrier will do.
 	 */
-	entry = *pte;
-	barrier();
+
+	if (!pte) {
+		if (vma->vm_ops && vma->vm_ops->fault)
+			return do_linear_fault(mm, vma, address, pmd, flags);
+		return do_anonymous_page(mm, vma, address, pmd, flags);
+	}
 	if (!pte_present(entry)) {
-		if (pte_none(entry)) {
-			if (vma->vm_ops) {
-				if (likely(vma->vm_ops->fault))
-					return do_linear_fault(mm, vma, address,
-						pte, pmd, flags, entry);
-			}
-			return do_anonymous_page(mm, vma, address,
-						 pte, pmd, flags);
-		}
 		if (pte_file(entry))
 			return do_nonlinear_fault(mm, vma, address,
 					pte, pmd, flags, entry);
@@ -3273,7 +3315,6 @@ static int __handle_mm_fault(struct mm_s
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
 
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
@@ -3325,26 +3366,7 @@ static int __handle_mm_fault(struct mm_s
 		}
 	}
 
-	/*
-	 * Use __pte_alloc instead of pte_alloc_map, because we can't
-	 * run pte_offset_map on the pmd, if an huge pmd could
-	 * materialize from under us from a different thread.
-	 */
-	if (unlikely(pmd_none(*pmd)) &&
-	    unlikely(__pte_alloc(mm, vma, pmd, address)))
-		return VM_FAULT_OOM;
-	/* if an huge pmd materialized from under us just retry later */
-	if (unlikely(pmd_trans_huge(*pmd)))
-		return 0;
-	/*
-	 * A regular pmd is established and it can't morph into a huge pmd
-	 * from under us anymore at this point because we hold the mmap_sem
-	 * read mode and khugepaged takes it in write mode. So now it's
-	 * safe to run pte_offset_map().
-	 */
-	pte = pte_offset_map(pmd, address);
-
-	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+	return handle_pte_fault(mm, vma, address, pmd, flags);
 }
 
 /*

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 16/24] huge tmpfs: fix problems from premature exposure of pagetable
@ 2015-02-21  4:16   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Andrea wrote a very interesting comment on THP in mm/memory.c,
just before the end of __handle_mm_fault():

 * A regular pmd is established and it can't morph into a huge pmd
 * from under us anymore at this point because we hold the mmap_sem
 * read mode and khugepaged takes it in write mode. So now it's
 * safe to run pte_offset_map().

This comment hints at several difficulties, which anon THP solved
for itself with mmap_sem and anon_vma lock, but which huge tmpfs
may need to solve differently.

The reference to pte_offset_map() above: I believe that's a hint
that on a 32-bit machine, the pagetables might need to come from
kernel-mapped memory, but a huge pmd pointing to user memory beyond
that limit could be racily substituted, causing undefined behavior
in the architecture-dependent pte_offset_map().

That itself is not a problem on x86_64, but there's plenty more:
how about those places which use pte_offset_map_lock() - if that
spinlock is in the struct page of a pagetable, which has been
deposited and might be withdrawn and freed at any moment (being
on a list unattached to the allocating pmd in the case of x86),
taking the spinlock might corrupt someone else's struct page.

Because THP has departed from the earlier rules (when pagetable
was only freed under exclusive mmap_sem, or at exit_mmap, after
removing all affected vmas from the rmap list): zap_huge_pmd()
does pte_free() even when serving MADV_DONTNEED under down_read
of mmap_sem.

And what of the "entry = *pte" at the start of handle_pte_fault(),
getting the entry used in pte_same(,orig_pte) tests to validate all
fault handling?  If that entry can itself be junk picked out of some
freed and reused pagetable, it's hard to estimate the consequences.

We need to consider the safety of concurrent faults, and the
safety of rmap lookups, and the safety of miscellaneous operations
such as smaps_pte_range() for reading /proc/<pid>/smaps.

I set out to make safe the places which descend pgd,pud,pmd,pte,
using more careful access techniques like mm_find_pmd(); but with
pte_offset_map() being architecture-defined, it's too big a job to
tighten it up all over.

Instead, approach from the opposite direction: just do not expose
a pagetable in an empty *pmd, until vm_ops->fault has had a chance
to ask for a huge pmd there.  This is a much easier change to make,
and we are lucky that all the driver faults appear to be using
interfaces (like vm_insert_page() and remap_pfn_range()) which
automatically do the pte_alloc() if it was not already done.

But we must not get stuck refaulting: need FAULT_FLAG_MAY_HUGE for
__do_fault() to tell shmem_fault() to try for huge only when *pmd is
empty (could instead add pmd to vmf and let shmem work that out for
itself, but probably better to hide pmd from vm_ops->faults).

Without a pagetable to hold the pte_none() entry found in a newly
allocated pagetable, handle_pte_fault() would like to provide a static
none entry for later orig_pte checks.  But architectures have never had
to provide that definition before; and although almost all use zeroes
for an empty pagetable, a few do not - nios2, s390, um, xtensa.

Never mind, forget about pte_same(,orig_pte), the three __do_fault()
callers can follow do_anonymous_page()'s example, and just use a
pte_none() check instead - supplemented by a pte_file pte_to_pgoff
check until the day VM_NONLINEAR is removed.

do_fault_around() presents one last problem: it wants pagetable to
have been allocated, but was being called by do_read_fault() before
__do_fault().  But I see no disadvantage to moving it after,
allowing huge pmd to be chosent first.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/filemap.c |   10 +-
 mm/memory.c  |  202 +++++++++++++++++++++++++++----------------------
 2 files changed, 118 insertions(+), 94 deletions(-)

--- thpfs.orig/mm/filemap.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/filemap.c	2015-02-20 19:34:42.875920943 -0800
@@ -2000,6 +2000,10 @@ void filemap_map_pages(struct vm_area_st
 	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
 		if (iter.index > vmf->max_pgoff)
 			break;
+
+		pte = vmf->pte + iter.index - vmf->pgoff;
+		if (!pte_none(*pte))
+			goto next;
 repeat:
 		page = radix_tree_deref_slot(slot);
 		if (unlikely(!page))
@@ -2020,6 +2024,8 @@ repeat:
 			goto repeat;
 		}
 
+		VM_BUG_ON_PAGE(page->index != iter.index, page);
+
 		if (!PageUptodate(page) ||
 				PageReadahead(page) ||
 				PageHWPoison(page))
@@ -2034,10 +2040,6 @@ repeat:
 		if (page->index >= size >> PAGE_CACHE_SHIFT)
 			goto unlock;
 
-		pte = vmf->pte + page->index - vmf->pgoff;
-		if (!pte_none(*pte))
-			goto unlock;
-
 		if (file->f_ra.mmap_miss > 0)
 			file->f_ra.mmap_miss--;
 		addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
--- thpfs.orig/mm/memory.c	2015-02-20 19:34:21.599969589 -0800
+++ thpfs/mm/memory.c	2015-02-20 19:34:42.875920943 -0800
@@ -2617,24 +2617,33 @@ static inline int check_stack_guard_page
 
 /*
  * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with mmap_sem still held, but pte unmapped and unlocked.
+ * but allow concurrent faults).  We return with mmap_sem still held.
  */
 static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags)
+		unsigned long address, pmd_t *pmd, unsigned int flags)
 {
 	struct mem_cgroup *memcg;
+	pte_t *page_table;
 	struct page *page;
 	spinlock_t *ptl;
 	pte_t entry;
 
-	pte_unmap(page_table);
-
 	/* Check if we need to add a guard page to the stack */
 	if (check_stack_guard_page(vma, address) < 0)
 		return VM_FAULT_SIGSEGV;
 
+	/*
+	 * Use __pte_alloc instead of pte_alloc_map, because we can't
+	 * run pte_offset_map on the pmd, if an huge pmd could
+	 * materialize from under us from a different thread.
+	 */
+	if (unlikely(pmd_none(*pmd)) &&
+	    unlikely(__pte_alloc(mm, vma, pmd, address)))
+		return VM_FAULT_OOM;
+	/* If an huge pmd materialized from under us just retry later */
+	if (unlikely(pmd_trans_huge(*pmd)))
+		return 0;
+
 	/* Use the zero-page for reads */
 	if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm)) {
 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
@@ -2697,7 +2706,7 @@ oom:
  * See filemap_fault() and __lock_page_retry().
  */
 static int __do_fault(struct vm_area_struct *vma, unsigned long address,
-		pgoff_t pgoff, unsigned int flags, struct page **page)
+	pmd_t *pmd, pgoff_t pgoff, unsigned int flags, struct page **page)
 {
 	struct vm_fault vmf;
 	int ret;
@@ -2711,20 +2720,41 @@ static int __do_fault(struct vm_area_str
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
-	if (unlikely(PageHWPoison(vmf.page))) {
-		if (ret & VM_FAULT_LOCKED)
-			unlock_page(vmf.page);
-		page_cache_release(vmf.page);
-		return VM_FAULT_HWPOISON;
-	}
-
 	if (unlikely(!(ret & VM_FAULT_LOCKED)))
 		lock_page(vmf.page);
 	else
 		VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
 
+	if (unlikely(PageHWPoison(vmf.page))) {
+		ret = VM_FAULT_HWPOISON;
+		goto err;
+	}
+
+	/*
+	 * Use __pte_alloc instead of pte_alloc_map, because we can't
+	 * run pte_offset_map on the pmd, if an huge pmd could
+	 * materialize from under us from a different thread.
+	 */
+	if (unlikely(pmd_none(*pmd)) &&
+	    unlikely(__pte_alloc(vma->vm_mm, vma, pmd, address))) {
+		ret = VM_FAULT_OOM;
+		goto err;
+	}
+	/*
+	 * If an huge pmd materialized from under us just retry later.
+	 * Allow for racing transition of huge pmd to none to pagetable.
+	 */
+	if (unlikely(pmd_trans_huge(*pmd) || pmd_none(*pmd))) {
+		ret = VM_FAULT_NOPAGE;
+		goto err;
+	}
+
 	*page = vmf.page;
 	return ret;
+err:
+	unlock_page(vmf.page);
+	page_cache_release(vmf.page);
+	return ret;
 }
 
 /**
@@ -2875,33 +2905,20 @@ static void do_fault_around(struct vm_ar
 
 static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
-		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+		pgoff_t pgoff, unsigned int flags)
 {
 	struct page *fault_page;
 	spinlock_t *ptl;
 	pte_t *pte;
-	int ret = 0;
-
-	/*
-	 * Let's call ->map_pages() first and use ->fault() as fallback
-	 * if page by the offset is not ready to be mapped (cold cache or
-	 * something).
-	 */
-	if (vma->vm_ops->map_pages && !(flags & FAULT_FLAG_NONLINEAR) &&
-	    fault_around_bytes >> PAGE_SHIFT > 1) {
-		pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-		do_fault_around(vma, address, pte, pgoff, flags);
-		if (!pte_same(*pte, orig_pte))
-			goto unlock_out;
-		pte_unmap_unlock(pte, ptl);
-	}
+	int ret;
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pmd, pgoff, flags, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, orig_pte))) {
+	if (unlikely(!pte_none(*pte) &&
+	    !(pte_file(*pte) && pte_to_pgoff(*pte) == pgoff))) {
 		pte_unmap_unlock(pte, ptl);
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
@@ -2909,14 +2926,21 @@ static int do_read_fault(struct mm_struc
 	}
 	do_set_pte(vma, address, fault_page, pte, false, false);
 	unlock_page(fault_page);
-unlock_out:
+
+	/*
+	 * Finally call ->map_pages() to fault around the pte we just set.
+	 */
+	if (vma->vm_ops->map_pages && !(flags & FAULT_FLAG_NONLINEAR) &&
+	    fault_around_bytes >> PAGE_SHIFT > 1)
+		do_fault_around(vma, address, pte, pgoff, flags);
+
 	pte_unmap_unlock(pte, ptl);
 	return ret;
 }
 
 static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
-		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+		pgoff_t pgoff, unsigned int flags)
 {
 	struct page *fault_page, *new_page;
 	struct mem_cgroup *memcg;
@@ -2936,7 +2960,7 @@ static int do_cow_fault(struct mm_struct
 		return VM_FAULT_OOM;
 	}
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pmd, pgoff, flags, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		goto uncharge_out;
 
@@ -2944,7 +2968,8 @@ static int do_cow_fault(struct mm_struct
 	__SetPageUptodate(new_page);
 
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, orig_pte))) {
+	if (unlikely(!pte_none(*pte) &&
+	    !(pte_file(*pte) && pte_to_pgoff(*pte) == pgoff))) {
 		pte_unmap_unlock(pte, ptl);
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
@@ -2965,7 +2990,7 @@ uncharge_out:
 
 static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd,
-		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+		pgoff_t pgoff, unsigned int flags)
 {
 	struct page *fault_page;
 	struct address_space *mapping;
@@ -2974,7 +2999,7 @@ static int do_shared_fault(struct mm_str
 	int dirtied = 0;
 	int ret, tmp;
 
-	ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+	ret = __do_fault(vma, address, pmd, pgoff, flags, &fault_page);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
 
@@ -2993,7 +3018,8 @@ static int do_shared_fault(struct mm_str
 	}
 
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (unlikely(!pte_same(*pte, orig_pte))) {
+	if (unlikely(!pte_none(*pte) &&
+	    !(pte_file(*pte) && pte_to_pgoff(*pte) == pgoff))) {
 		pte_unmap_unlock(pte, ptl);
 		unlock_page(fault_page);
 		page_cache_release(fault_page);
@@ -3034,20 +3060,16 @@ static int do_shared_fault(struct mm_str
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
 static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
+		unsigned long address, pmd_t *pmd, unsigned int flags)
 {
 	pgoff_t pgoff = (((address & PAGE_MASK)
 			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
-	pte_unmap(page_table);
 	if (!(flags & FAULT_FLAG_WRITE))
-		return do_read_fault(mm, vma, address, pmd, pgoff, flags,
-				orig_pte);
+		return do_read_fault(mm, vma, address, pmd, pgoff, flags);
 	if (!(vma->vm_flags & VM_SHARED))
-		return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
-				orig_pte);
-	return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+		return do_cow_fault(mm, vma, address, pmd, pgoff, flags);
+	return do_shared_fault(mm, vma, address, pmd, pgoff, flags);
 }
 
 /*
@@ -3082,12 +3104,10 @@ static int do_nonlinear_fault(struct mm_
 
 	pgoff = pte_to_pgoff(orig_pte);
 	if (!(flags & FAULT_FLAG_WRITE))
-		return do_read_fault(mm, vma, address, pmd, pgoff, flags,
-				orig_pte);
+		return do_read_fault(mm, vma, address, pmd, pgoff, flags);
 	if (!(vma->vm_flags & VM_SHARED))
-		return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
-				orig_pte);
-	return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+		return do_cow_fault(mm, vma, address, pmd, pgoff, flags);
+	return do_shared_fault(mm, vma, address, pmd, pgoff, flags);
 }
 
 static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3189,40 +3209,62 @@ out:
  * with external mmu caches can use to update those (ie the Sparc or
  * PowerPC hashed page tables that act as extended TLBs).
  *
- * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with pte unmapped and unlocked.
- *
+ * We enter with non-exclusive mmap_sem
+ * (to exclude vma changes, but allow concurrent faults).
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
 static int handle_pte_fault(struct mm_struct *mm,
 		     struct vm_area_struct *vma, unsigned long address,
-		     pte_t *pte, pmd_t *pmd, unsigned int flags)
+		     pmd_t *pmd, unsigned int flags)
 {
+	pte_t *pte;
 	pte_t entry;
 	spinlock_t *ptl;
 
+	/* If an huge pmd materialized from under us just retry later */
+	if (unlikely(pmd_trans_huge(*pmd)))
+		return 0;
+
+	if (unlikely(pmd_none(*pmd))) {
+		/*
+		 * Leave __pte_alloc() until later: because huge tmpfs may
+		 * want to map_team_by_pmd(), and if we expose page table
+		 * for an instant, it will be difficult to retract from
+		 * concurrent faults and from rmap lookups.
+		 */
+		pte = NULL;
+	} else {
+		/*
+		 * A regular pmd is established and it can't morph into a huge
+		 * pmd from under us anymore at this point because we hold the
+		 * mmap_sem read mode and khugepaged takes it in write mode.
+		 * So now it's safe to run pte_offset_map().
+		 */
+		pte = pte_offset_map(pmd, address);
+		entry = *pte;
+		barrier();
+		if (pte_none(entry)) {
+			pte_unmap(pte);
+			pte = NULL;
+		}
+	}
+
 	/*
 	 * some architectures can have larger ptes than wordsize,
 	 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
 	 * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
-	 * The code below just needs a consistent view for the ifs and
+	 * The code above just needs a consistent view for the ifs and
 	 * we later double check anyway with the ptl lock held. So here
 	 * a barrier will do.
 	 */
-	entry = *pte;
-	barrier();
+
+	if (!pte) {
+		if (vma->vm_ops && vma->vm_ops->fault)
+			return do_linear_fault(mm, vma, address, pmd, flags);
+		return do_anonymous_page(mm, vma, address, pmd, flags);
+	}
 	if (!pte_present(entry)) {
-		if (pte_none(entry)) {
-			if (vma->vm_ops) {
-				if (likely(vma->vm_ops->fault))
-					return do_linear_fault(mm, vma, address,
-						pte, pmd, flags, entry);
-			}
-			return do_anonymous_page(mm, vma, address,
-						 pte, pmd, flags);
-		}
 		if (pte_file(entry))
 			return do_nonlinear_fault(mm, vma, address,
 					pte, pmd, flags, entry);
@@ -3273,7 +3315,6 @@ static int __handle_mm_fault(struct mm_s
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
 
 	if (unlikely(is_vm_hugetlb_page(vma)))
 		return hugetlb_fault(mm, vma, address, flags);
@@ -3325,26 +3366,7 @@ static int __handle_mm_fault(struct mm_s
 		}
 	}
 
-	/*
-	 * Use __pte_alloc instead of pte_alloc_map, because we can't
-	 * run pte_offset_map on the pmd, if an huge pmd could
-	 * materialize from under us from a different thread.
-	 */
-	if (unlikely(pmd_none(*pmd)) &&
-	    unlikely(__pte_alloc(mm, vma, pmd, address)))
-		return VM_FAULT_OOM;
-	/* if an huge pmd materialized from under us just retry later */
-	if (unlikely(pmd_trans_huge(*pmd)))
-		return 0;
-	/*
-	 * A regular pmd is established and it can't morph into a huge pmd
-	 * from under us anymore at this point because we hold the mmap_sem
-	 * read mode and khugepaged takes it in write mode. So now it's
-	 * safe to run pte_offset_map().
-	 */
-	pte = pte_offset_map(pmd, address);
-
-	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+	return handle_pte_fault(mm, vma, address, pmd, flags);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 17/24] huge tmpfs: map shmem by huge page pmd or by page team ptes
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:18   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

This is the commit which at last gets huge mappings of tmpfs working,
as can be seen from the ShmemPmdMapped line of /proc/meminfo.

The main thing here is the trio of functions map_team_by_pmd(),
unmap_team_by_pmd() and remap_team_by_ptes() added to huge_memory.c;
and of course the enablement of FAULT_FLAG_MAY_HUGE from memory.c
to shmem.c, with VM_FAULT_HUGE back from shmem.c to memory.c.  But
one-line and few-line changes scattered throughout huge_memory.c.

Huge tmpfs is relying on the pmd_trans_huge() page table hooks which
the original Anonymous THP project placed throughout mm; but skips
almost all of its complications, going to its own simpler handling.

One odd little change: removal of the VM_NOHUGEPAGE check from
move_huge_pmd().  That's a helper for mremap() move: the new_vma
should be following the same rules as the old vma, so if there's a
trans_huge pmd in the old vma, then it can go in the new, alignment
permitting.  It was a very minor optimization for Anonymous THP; but
now we can reach the same code for huge tmpfs, which is nowhere else
respecting VM_NOHUGEPAGE (whether it should is a different question;
but for now it's simplest to ignore all the various THP switches).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pageteam.h |   41 ++++++
 mm/huge_memory.c         |  238 ++++++++++++++++++++++++++++++++++---
 mm/memory.c              |   11 +
 3 files changed, 273 insertions(+), 17 deletions(-)

--- thpfs.orig/include/linux/pageteam.h	2015-02-20 19:34:37.851932430 -0800
+++ thpfs/include/linux/pageteam.h	2015-02-20 19:34:48.083909034 -0800
@@ -29,10 +29,49 @@ static inline struct page *team_head(str
 	return head;
 }
 
-/* Temporary stub for mm/rmap.c until implemented in mm/huge_memory.c */
+/*
+ * Returns true if this team is mapped by pmd somewhere.
+ */
+static inline bool team_hugely_mapped(struct page *head)
+{
+	return atomic_long_read(&head->team_usage) > HPAGE_PMD_NR;
+}
+
+/*
+ * Returns true if this was the first mapping by pmd, whereupon mapped stats
+ * need to be updated.
+ */
+static inline bool inc_hugely_mapped(struct page *head)
+{
+	return atomic_long_inc_return(&head->team_usage) == HPAGE_PMD_NR+1;
+}
+
+/*
+ * Returns true if this was the last mapping by pmd, whereupon mapped stats
+ * need to be updated.
+ */
+static inline bool dec_hugely_mapped(struct page *head)
+{
+	return atomic_long_dec_return(&head->team_usage) == HPAGE_PMD_NR;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int map_team_by_pmd(struct vm_area_struct *vma,
+			unsigned long addr, pmd_t *pmd, struct page *page);
+void unmap_team_by_pmd(struct vm_area_struct *vma,
+			unsigned long addr, pmd_t *pmd, struct page *page);
+#else
+static inline int map_team_by_pmd(struct vm_area_struct *vma,
+			unsigned long addr, pmd_t *pmd, struct page *page)
+{
+	VM_BUG_ON_PAGE(1, page);
+	return 0;
+}
 static inline void unmap_team_by_pmd(struct vm_area_struct *vma,
 			unsigned long addr, pmd_t *pmd, struct page *page)
 {
+	VM_BUG_ON_PAGE(1, page);
 }
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_PAGETEAM_H */
--- thpfs.orig/mm/huge_memory.c	2015-02-20 19:34:32.367944969 -0800
+++ thpfs/mm/huge_memory.c	2015-02-20 19:34:48.083909034 -0800
@@ -21,6 +21,7 @@
 #include <linux/freezer.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
+#include <linux/pageteam.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
 
@@ -28,6 +29,10 @@
 #include <asm/pgalloc.h>
 #include "internal.h"
 
+static void page_remove_team_rmap(struct page *);
+static void remap_team_by_ptes(struct vm_area_struct *vma, unsigned long addr,
+			       pmd_t *pmd, struct page *page);
+
 /*
  * By default transparent hugepage support is disabled in order that avoid
  * to risk increase the memory footprint of applications without a guaranteed
@@ -901,13 +906,19 @@ int copy_huge_pmd(struct mm_struct *dst_
 		goto out;
 	}
 	src_page = pmd_page(pmd);
-	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
 	page_dup_rmap(src_page);
-	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-
-	pmdp_set_wrprotect(src_mm, addr, src_pmd);
-	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	if (PageAnon(src_page)) {
+		VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+		pmdp_set_wrprotect(src_mm, addr, src_pmd);
+		pmd = pmd_wrprotect(pmd);
+	} else {
+		VM_BUG_ON_PAGE(!PageTeam(src_page), src_page);
+		inc_hugely_mapped(src_page);
+	}
+	add_mm_counter(dst_mm, PageAnon(src_page) ?
+		MM_ANONPAGES : MM_FILEPAGES, HPAGE_PMD_NR);
+	pmd = pmd_mkold(pmd);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 	atomic_long_inc(&dst_mm->nr_ptes);
@@ -1088,22 +1099,28 @@ int do_huge_pmd_wp_page(struct mm_struct
 {
 	spinlock_t *ptl;
 	int ret = 0;
-	struct page *page = NULL, *new_page;
+	struct page *page, *new_page;
 	struct mem_cgroup *memcg;
 	unsigned long haddr;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	ptl = pmd_lockptr(mm, pmd);
-	VM_BUG_ON_VMA(!vma->anon_vma, vma);
 	haddr = address & HPAGE_PMD_MASK;
-	if (is_huge_zero_pmd(orig_pmd))
+	page = pmd_page(orig_pmd);
+	if (is_huge_zero_page(page)) {
+		page = NULL;
 		goto alloc;
+	}
+	if (!PageAnon(page)) {
+		remap_team_by_ptes(vma, address, pmd, page);
+		/* Let's just take another fault to do the COW */
+		return 0;
+	}
 	spin_lock(ptl);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
 		goto out_unlock;
 
-	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
 	if (page_mapcount(page) == 1) {
 		pmd_t entry;
@@ -1117,6 +1134,7 @@ int do_huge_pmd_wp_page(struct mm_struct
 	get_user_huge_page(page);
 	spin_unlock(ptl);
 alloc:
+	VM_BUG_ON(!vma->anon_vma);
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -1226,7 +1244,7 @@ struct page *follow_trans_huge_pmd(struc
 		goto out;
 
 	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!PageHead(page), page);
+	VM_BUG_ON_PAGE(!PageHead(page) && !PageTeam(page), page);
 	if (flags & FOLL_TOUCH) {
 		pmd_t _pmd;
 		/*
@@ -1251,7 +1269,7 @@ struct page *follow_trans_huge_pmd(struc
 		}
 	}
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
-	VM_BUG_ON_PAGE(!PageCompound(page), page);
+	VM_BUG_ON_PAGE(!PageCompound(page) && !PageTeam(page), page);
 	if (flags & FOLL_GET)
 		get_page_foll(page);
 
@@ -1409,10 +1427,12 @@ int zap_huge_pmd(struct mmu_gather *tlb,
 			put_huge_zero_page();
 		} else {
 			page = pmd_page(orig_pmd);
+			if (!PageAnon(page))
+				page_remove_team_rmap(page);
 			page_remove_rmap(page);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
-			VM_BUG_ON_PAGE(!PageHead(page), page);
+			add_mm_counter(tlb->mm, PageAnon(page) ?
+				MM_ANONPAGES : MM_FILEPAGES, -HPAGE_PMD_NR);
 			atomic_long_dec(&tlb->mm->nr_ptes);
 			spin_unlock(ptl);
 			tlb_remove_page(tlb, page);
@@ -1456,8 +1476,7 @@ int move_huge_pmd(struct vm_area_struct
 
 	if ((old_addr & ~HPAGE_PMD_MASK) ||
 	    (new_addr & ~HPAGE_PMD_MASK) ||
-	    old_end - old_addr < HPAGE_PMD_SIZE ||
-	    (new_vma->vm_flags & VM_NOHUGEPAGE))
+	    old_end - old_addr < HPAGE_PMD_SIZE)
 		goto out;
 
 	/*
@@ -1518,7 +1537,6 @@ int change_huge_pmd(struct vm_area_struc
 			entry = pmd_modify(entry, newprot);
 			ret = HPAGE_PMD_NR;
 			set_pmd_at(mm, addr, pmd, entry);
-			BUG_ON(pmd_write(entry));
 		} else {
 			struct page *page = pmd_page(*pmd);
 
@@ -2864,6 +2882,17 @@ void __split_huge_page_pmd(struct vm_are
 	unsigned long haddr = address & HPAGE_PMD_MASK;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	pmd_t pmdval;
+
+	pmdval = *pmd;
+	barrier();
+	if (!pmd_present(pmdval) || !pmd_trans_huge(pmdval))
+		return;
+	page = pmd_page(pmdval);
+	if (!PageAnon(page) && !is_huge_zero_page(page)) {
+		remap_team_by_ptes(vma, address, pmd, page);
+		return;
+	}
 
 	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
 
@@ -2976,3 +3005,180 @@ void __vma_adjust_trans_huge(struct vm_a
 			split_huge_page_address(next->vm_mm, nstart);
 	}
 }
+
+/*
+ * huge pmd support for huge tmpfs
+ */
+
+static void page_add_team_rmap(struct page *page)
+{
+	VM_BUG_ON_PAGE(PageAnon(page), page);
+	VM_BUG_ON_PAGE(!PageTeam(page), page);
+	if (inc_hugely_mapped(page))
+		__inc_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+}
+
+static void page_remove_team_rmap(struct page *page)
+{
+	VM_BUG_ON_PAGE(PageAnon(page), page);
+	VM_BUG_ON_PAGE(!PageTeam(page), page);
+	if (dec_hugely_mapped(page))
+		__dec_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+}
+
+int map_team_by_pmd(struct vm_area_struct *vma, unsigned long addr,
+		    pmd_t *pmd, struct page *page)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgtable_t pgtable;
+	spinlock_t *pml;
+	pmd_t pmdval;
+	int ret = VM_FAULT_NOPAGE;
+
+	/*
+	 * Another task may have mapped it in just ahead of us; but we
+	 * have the huge page locked, so others will wait on us now... or,
+	 * is there perhaps some way another might still map in a single pte?
+	 */
+	VM_BUG_ON_PAGE(!PageTeam(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	if (!pmd_none(*pmd))
+		goto raced2;
+
+	addr &= HPAGE_PMD_MASK;
+	pgtable = pte_alloc_one(mm, addr);
+	if (!pgtable) {
+		ret = VM_FAULT_OOM;
+		goto raced2;
+	}
+
+	pml = pmd_lock(mm, pmd);
+	if (!pmd_none(*pmd))
+		goto raced1;
+	pmdval = mk_pmd(page, vma->vm_page_prot);
+	pmdval = pmd_mkhuge(pmd_mkdirty(pmdval));
+	set_pmd_at(mm, addr, pmd, pmdval);
+	page_add_file_rmap(page);
+	page_add_team_rmap(page);
+	update_mmu_cache_pmd(vma, addr, pmd);
+	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	atomic_long_inc(&mm->nr_ptes);
+	spin_unlock(pml);
+
+	unlock_page(page);
+	add_mm_counter(mm, MM_FILEPAGES, HPAGE_PMD_NR);
+	return ret;
+raced1:
+	spin_unlock(pml);
+	pte_free(mm, pgtable);
+raced2:
+	unlock_page(page);
+	page_cache_release(page);
+	return ret;
+}
+
+void unmap_team_by_pmd(struct vm_area_struct *vma, unsigned long addr,
+		       pmd_t *pmd, struct page *page)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgtable_t pgtable = NULL;
+	unsigned long end;
+	spinlock_t *pml;
+
+	VM_BUG_ON_PAGE(!PageTeam(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	/*
+	 * But even so there might be a racing zap_huge_pmd() or
+	 * remap_team_by_ptes() while the page_table_lock is dropped.
+	 */
+
+	addr &= HPAGE_PMD_MASK;
+	end = addr + HPAGE_PMD_SIZE;
+
+	mmu_notifier_invalidate_range_start(mm, addr, end);
+	pml = pmd_lock(mm, pmd);
+	if (pmd_trans_huge(*pmd) && pmd_page(*pmd) == page) {
+		pmdp_clear_flush(vma, addr, pmd);
+		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+		page_remove_team_rmap(page);
+		page_remove_rmap(page);
+		atomic_long_dec(&mm->nr_ptes);
+	}
+	spin_unlock(pml);
+	mmu_notifier_invalidate_range_end(mm, addr, end);
+
+	if (!pgtable)
+		return;
+
+	pte_free(mm, pgtable);
+	update_hiwater_rss(mm);
+	add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+	page_cache_release(page);
+}
+
+static void remap_team_by_ptes(struct vm_area_struct *vma, unsigned long addr,
+			       pmd_t *pmd, struct page *page)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *head = page;
+	pgtable_t pgtable;
+	unsigned long end;
+	spinlock_t *pml;
+	spinlock_t *ptl;
+	pte_t *pte;
+	pmd_t pmdval;
+	pte_t pteval;
+
+	addr &= HPAGE_PMD_MASK;
+	end = addr + HPAGE_PMD_SIZE;
+
+	mmu_notifier_invalidate_range_start(mm, addr, end);
+	pml = pmd_lock(mm, pmd);
+	if (!pmd_trans_huge(*pmd) || pmd_page(*pmd) != page)
+		goto raced;
+
+	pmdval = pmdp_clear_flush(vma, addr, pmd);
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pmd_populate(mm, pmd, pgtable);
+	ptl = pte_lockptr(mm, pmd);
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+	page_remove_team_rmap(page);
+	update_mmu_cache_pmd(vma, addr, pmd);
+
+	/*
+	 * It would be nice to have prepared this page table in advance,
+	 * so we could just switch from pmd to ptes under one lock.
+	 * But a comment in zap_huge_pmd() warns that ppc64 needs
+	 * to look at the deposited page table when clearing the pmd.
+	 */
+	pte = pte_offset_map(pmd, addr);
+	do {
+		pteval = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
+		if (!pmd_young(pmdval))
+			pteval = pte_mkold(pteval);
+		set_pte_at(mm, addr, pte, pteval);
+		if (page != head) {
+			/*
+			 * We did not remove the head's rmap count above: that
+			 * seems better than letting it slip to 0 for a moment.
+			 */
+			page_add_file_rmap(page);
+			page_cache_get(page);
+		}
+		/*
+		 * Move page flags from head to page,
+		 * as __split_huge_page_refcount() does for anon?
+		 * Start off by assuming not, but reconsider later.
+		 */
+	} while (pte++, page++, addr += PAGE_SIZE, addr != end);
+
+	pte -= HPAGE_PMD_NR;
+	addr -= HPAGE_PMD_NR;
+	if (ptl != pml)
+		spin_unlock(ptl);
+	pte_unmap(pte);
+raced:
+	spin_unlock(pml);
+	mmu_notifier_invalidate_range_end(mm, addr, end);
+}
--- thpfs.orig/mm/memory.c	2015-02-20 19:34:42.875920943 -0800
+++ thpfs/mm/memory.c	2015-02-20 19:34:48.083909034 -0800
@@ -45,6 +45,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/pageteam.h>
 #include <linux/ksm.h>
 #include <linux/rmap.h>
 #include <linux/export.h>
@@ -2716,9 +2717,19 @@ static int __do_fault(struct vm_area_str
 	vmf.flags = flags;
 	vmf.page = NULL;
 
+	/*
+	 * Give huge pmd a chance before allocating pte or trying fault around.
+	 */
+	if (unlikely(pmd_none(*pmd)))
+		vmf.flags |= FAULT_FLAG_MAY_HUGE;
+
 	ret = vma->vm_ops->fault(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
+	if (unlikely(ret & VM_FAULT_HUGE)) {
+		ret |= map_team_by_pmd(vma, address, pmd, vmf.page);
+		return ret;
+	}
 
 	if (unlikely(!(ret & VM_FAULT_LOCKED)))
 		lock_page(vmf.page);

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 17/24] huge tmpfs: map shmem by huge page pmd or by page team ptes
@ 2015-02-21  4:18   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

This is the commit which at last gets huge mappings of tmpfs working,
as can be seen from the ShmemPmdMapped line of /proc/meminfo.

The main thing here is the trio of functions map_team_by_pmd(),
unmap_team_by_pmd() and remap_team_by_ptes() added to huge_memory.c;
and of course the enablement of FAULT_FLAG_MAY_HUGE from memory.c
to shmem.c, with VM_FAULT_HUGE back from shmem.c to memory.c.  But
one-line and few-line changes scattered throughout huge_memory.c.

Huge tmpfs is relying on the pmd_trans_huge() page table hooks which
the original Anonymous THP project placed throughout mm; but skips
almost all of its complications, going to its own simpler handling.

One odd little change: removal of the VM_NOHUGEPAGE check from
move_huge_pmd().  That's a helper for mremap() move: the new_vma
should be following the same rules as the old vma, so if there's a
trans_huge pmd in the old vma, then it can go in the new, alignment
permitting.  It was a very minor optimization for Anonymous THP; but
now we can reach the same code for huge tmpfs, which is nowhere else
respecting VM_NOHUGEPAGE (whether it should is a different question;
but for now it's simplest to ignore all the various THP switches).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pageteam.h |   41 ++++++
 mm/huge_memory.c         |  238 ++++++++++++++++++++++++++++++++++---
 mm/memory.c              |   11 +
 3 files changed, 273 insertions(+), 17 deletions(-)

--- thpfs.orig/include/linux/pageteam.h	2015-02-20 19:34:37.851932430 -0800
+++ thpfs/include/linux/pageteam.h	2015-02-20 19:34:48.083909034 -0800
@@ -29,10 +29,49 @@ static inline struct page *team_head(str
 	return head;
 }
 
-/* Temporary stub for mm/rmap.c until implemented in mm/huge_memory.c */
+/*
+ * Returns true if this team is mapped by pmd somewhere.
+ */
+static inline bool team_hugely_mapped(struct page *head)
+{
+	return atomic_long_read(&head->team_usage) > HPAGE_PMD_NR;
+}
+
+/*
+ * Returns true if this was the first mapping by pmd, whereupon mapped stats
+ * need to be updated.
+ */
+static inline bool inc_hugely_mapped(struct page *head)
+{
+	return atomic_long_inc_return(&head->team_usage) == HPAGE_PMD_NR+1;
+}
+
+/*
+ * Returns true if this was the last mapping by pmd, whereupon mapped stats
+ * need to be updated.
+ */
+static inline bool dec_hugely_mapped(struct page *head)
+{
+	return atomic_long_dec_return(&head->team_usage) == HPAGE_PMD_NR;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int map_team_by_pmd(struct vm_area_struct *vma,
+			unsigned long addr, pmd_t *pmd, struct page *page);
+void unmap_team_by_pmd(struct vm_area_struct *vma,
+			unsigned long addr, pmd_t *pmd, struct page *page);
+#else
+static inline int map_team_by_pmd(struct vm_area_struct *vma,
+			unsigned long addr, pmd_t *pmd, struct page *page)
+{
+	VM_BUG_ON_PAGE(1, page);
+	return 0;
+}
 static inline void unmap_team_by_pmd(struct vm_area_struct *vma,
 			unsigned long addr, pmd_t *pmd, struct page *page)
 {
+	VM_BUG_ON_PAGE(1, page);
 }
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_PAGETEAM_H */
--- thpfs.orig/mm/huge_memory.c	2015-02-20 19:34:32.367944969 -0800
+++ thpfs/mm/huge_memory.c	2015-02-20 19:34:48.083909034 -0800
@@ -21,6 +21,7 @@
 #include <linux/freezer.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
+#include <linux/pageteam.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
 
@@ -28,6 +29,10 @@
 #include <asm/pgalloc.h>
 #include "internal.h"
 
+static void page_remove_team_rmap(struct page *);
+static void remap_team_by_ptes(struct vm_area_struct *vma, unsigned long addr,
+			       pmd_t *pmd, struct page *page);
+
 /*
  * By default transparent hugepage support is disabled in order that avoid
  * to risk increase the memory footprint of applications without a guaranteed
@@ -901,13 +906,19 @@ int copy_huge_pmd(struct mm_struct *dst_
 		goto out;
 	}
 	src_page = pmd_page(pmd);
-	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
 	page_dup_rmap(src_page);
-	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-
-	pmdp_set_wrprotect(src_mm, addr, src_pmd);
-	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	if (PageAnon(src_page)) {
+		VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+		pmdp_set_wrprotect(src_mm, addr, src_pmd);
+		pmd = pmd_wrprotect(pmd);
+	} else {
+		VM_BUG_ON_PAGE(!PageTeam(src_page), src_page);
+		inc_hugely_mapped(src_page);
+	}
+	add_mm_counter(dst_mm, PageAnon(src_page) ?
+		MM_ANONPAGES : MM_FILEPAGES, HPAGE_PMD_NR);
+	pmd = pmd_mkold(pmd);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 	atomic_long_inc(&dst_mm->nr_ptes);
@@ -1088,22 +1099,28 @@ int do_huge_pmd_wp_page(struct mm_struct
 {
 	spinlock_t *ptl;
 	int ret = 0;
-	struct page *page = NULL, *new_page;
+	struct page *page, *new_page;
 	struct mem_cgroup *memcg;
 	unsigned long haddr;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	ptl = pmd_lockptr(mm, pmd);
-	VM_BUG_ON_VMA(!vma->anon_vma, vma);
 	haddr = address & HPAGE_PMD_MASK;
-	if (is_huge_zero_pmd(orig_pmd))
+	page = pmd_page(orig_pmd);
+	if (is_huge_zero_page(page)) {
+		page = NULL;
 		goto alloc;
+	}
+	if (!PageAnon(page)) {
+		remap_team_by_ptes(vma, address, pmd, page);
+		/* Let's just take another fault to do the COW */
+		return 0;
+	}
 	spin_lock(ptl);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
 		goto out_unlock;
 
-	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
 	if (page_mapcount(page) == 1) {
 		pmd_t entry;
@@ -1117,6 +1134,7 @@ int do_huge_pmd_wp_page(struct mm_struct
 	get_user_huge_page(page);
 	spin_unlock(ptl);
 alloc:
+	VM_BUG_ON(!vma->anon_vma);
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
 		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -1226,7 +1244,7 @@ struct page *follow_trans_huge_pmd(struc
 		goto out;
 
 	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!PageHead(page), page);
+	VM_BUG_ON_PAGE(!PageHead(page) && !PageTeam(page), page);
 	if (flags & FOLL_TOUCH) {
 		pmd_t _pmd;
 		/*
@@ -1251,7 +1269,7 @@ struct page *follow_trans_huge_pmd(struc
 		}
 	}
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
-	VM_BUG_ON_PAGE(!PageCompound(page), page);
+	VM_BUG_ON_PAGE(!PageCompound(page) && !PageTeam(page), page);
 	if (flags & FOLL_GET)
 		get_page_foll(page);
 
@@ -1409,10 +1427,12 @@ int zap_huge_pmd(struct mmu_gather *tlb,
 			put_huge_zero_page();
 		} else {
 			page = pmd_page(orig_pmd);
+			if (!PageAnon(page))
+				page_remove_team_rmap(page);
 			page_remove_rmap(page);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
-			VM_BUG_ON_PAGE(!PageHead(page), page);
+			add_mm_counter(tlb->mm, PageAnon(page) ?
+				MM_ANONPAGES : MM_FILEPAGES, -HPAGE_PMD_NR);
 			atomic_long_dec(&tlb->mm->nr_ptes);
 			spin_unlock(ptl);
 			tlb_remove_page(tlb, page);
@@ -1456,8 +1476,7 @@ int move_huge_pmd(struct vm_area_struct
 
 	if ((old_addr & ~HPAGE_PMD_MASK) ||
 	    (new_addr & ~HPAGE_PMD_MASK) ||
-	    old_end - old_addr < HPAGE_PMD_SIZE ||
-	    (new_vma->vm_flags & VM_NOHUGEPAGE))
+	    old_end - old_addr < HPAGE_PMD_SIZE)
 		goto out;
 
 	/*
@@ -1518,7 +1537,6 @@ int change_huge_pmd(struct vm_area_struc
 			entry = pmd_modify(entry, newprot);
 			ret = HPAGE_PMD_NR;
 			set_pmd_at(mm, addr, pmd, entry);
-			BUG_ON(pmd_write(entry));
 		} else {
 			struct page *page = pmd_page(*pmd);
 
@@ -2864,6 +2882,17 @@ void __split_huge_page_pmd(struct vm_are
 	unsigned long haddr = address & HPAGE_PMD_MASK;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
+	pmd_t pmdval;
+
+	pmdval = *pmd;
+	barrier();
+	if (!pmd_present(pmdval) || !pmd_trans_huge(pmdval))
+		return;
+	page = pmd_page(pmdval);
+	if (!PageAnon(page) && !is_huge_zero_page(page)) {
+		remap_team_by_ptes(vma, address, pmd, page);
+		return;
+	}
 
 	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
 
@@ -2976,3 +3005,180 @@ void __vma_adjust_trans_huge(struct vm_a
 			split_huge_page_address(next->vm_mm, nstart);
 	}
 }
+
+/*
+ * huge pmd support for huge tmpfs
+ */
+
+static void page_add_team_rmap(struct page *page)
+{
+	VM_BUG_ON_PAGE(PageAnon(page), page);
+	VM_BUG_ON_PAGE(!PageTeam(page), page);
+	if (inc_hugely_mapped(page))
+		__inc_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+}
+
+static void page_remove_team_rmap(struct page *page)
+{
+	VM_BUG_ON_PAGE(PageAnon(page), page);
+	VM_BUG_ON_PAGE(!PageTeam(page), page);
+	if (dec_hugely_mapped(page))
+		__dec_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+}
+
+int map_team_by_pmd(struct vm_area_struct *vma, unsigned long addr,
+		    pmd_t *pmd, struct page *page)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgtable_t pgtable;
+	spinlock_t *pml;
+	pmd_t pmdval;
+	int ret = VM_FAULT_NOPAGE;
+
+	/*
+	 * Another task may have mapped it in just ahead of us; but we
+	 * have the huge page locked, so others will wait on us now... or,
+	 * is there perhaps some way another might still map in a single pte?
+	 */
+	VM_BUG_ON_PAGE(!PageTeam(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	if (!pmd_none(*pmd))
+		goto raced2;
+
+	addr &= HPAGE_PMD_MASK;
+	pgtable = pte_alloc_one(mm, addr);
+	if (!pgtable) {
+		ret = VM_FAULT_OOM;
+		goto raced2;
+	}
+
+	pml = pmd_lock(mm, pmd);
+	if (!pmd_none(*pmd))
+		goto raced1;
+	pmdval = mk_pmd(page, vma->vm_page_prot);
+	pmdval = pmd_mkhuge(pmd_mkdirty(pmdval));
+	set_pmd_at(mm, addr, pmd, pmdval);
+	page_add_file_rmap(page);
+	page_add_team_rmap(page);
+	update_mmu_cache_pmd(vma, addr, pmd);
+	pgtable_trans_huge_deposit(mm, pmd, pgtable);
+	atomic_long_inc(&mm->nr_ptes);
+	spin_unlock(pml);
+
+	unlock_page(page);
+	add_mm_counter(mm, MM_FILEPAGES, HPAGE_PMD_NR);
+	return ret;
+raced1:
+	spin_unlock(pml);
+	pte_free(mm, pgtable);
+raced2:
+	unlock_page(page);
+	page_cache_release(page);
+	return ret;
+}
+
+void unmap_team_by_pmd(struct vm_area_struct *vma, unsigned long addr,
+		       pmd_t *pmd, struct page *page)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgtable_t pgtable = NULL;
+	unsigned long end;
+	spinlock_t *pml;
+
+	VM_BUG_ON_PAGE(!PageTeam(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	/*
+	 * But even so there might be a racing zap_huge_pmd() or
+	 * remap_team_by_ptes() while the page_table_lock is dropped.
+	 */
+
+	addr &= HPAGE_PMD_MASK;
+	end = addr + HPAGE_PMD_SIZE;
+
+	mmu_notifier_invalidate_range_start(mm, addr, end);
+	pml = pmd_lock(mm, pmd);
+	if (pmd_trans_huge(*pmd) && pmd_page(*pmd) == page) {
+		pmdp_clear_flush(vma, addr, pmd);
+		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+		page_remove_team_rmap(page);
+		page_remove_rmap(page);
+		atomic_long_dec(&mm->nr_ptes);
+	}
+	spin_unlock(pml);
+	mmu_notifier_invalidate_range_end(mm, addr, end);
+
+	if (!pgtable)
+		return;
+
+	pte_free(mm, pgtable);
+	update_hiwater_rss(mm);
+	add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+	page_cache_release(page);
+}
+
+static void remap_team_by_ptes(struct vm_area_struct *vma, unsigned long addr,
+			       pmd_t *pmd, struct page *page)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *head = page;
+	pgtable_t pgtable;
+	unsigned long end;
+	spinlock_t *pml;
+	spinlock_t *ptl;
+	pte_t *pte;
+	pmd_t pmdval;
+	pte_t pteval;
+
+	addr &= HPAGE_PMD_MASK;
+	end = addr + HPAGE_PMD_SIZE;
+
+	mmu_notifier_invalidate_range_start(mm, addr, end);
+	pml = pmd_lock(mm, pmd);
+	if (!pmd_trans_huge(*pmd) || pmd_page(*pmd) != page)
+		goto raced;
+
+	pmdval = pmdp_clear_flush(vma, addr, pmd);
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pmd_populate(mm, pmd, pgtable);
+	ptl = pte_lockptr(mm, pmd);
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+	page_remove_team_rmap(page);
+	update_mmu_cache_pmd(vma, addr, pmd);
+
+	/*
+	 * It would be nice to have prepared this page table in advance,
+	 * so we could just switch from pmd to ptes under one lock.
+	 * But a comment in zap_huge_pmd() warns that ppc64 needs
+	 * to look at the deposited page table when clearing the pmd.
+	 */
+	pte = pte_offset_map(pmd, addr);
+	do {
+		pteval = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
+		if (!pmd_young(pmdval))
+			pteval = pte_mkold(pteval);
+		set_pte_at(mm, addr, pte, pteval);
+		if (page != head) {
+			/*
+			 * We did not remove the head's rmap count above: that
+			 * seems better than letting it slip to 0 for a moment.
+			 */
+			page_add_file_rmap(page);
+			page_cache_get(page);
+		}
+		/*
+		 * Move page flags from head to page,
+		 * as __split_huge_page_refcount() does for anon?
+		 * Start off by assuming not, but reconsider later.
+		 */
+	} while (pte++, page++, addr += PAGE_SIZE, addr != end);
+
+	pte -= HPAGE_PMD_NR;
+	addr -= HPAGE_PMD_NR;
+	if (ptl != pml)
+		spin_unlock(ptl);
+	pte_unmap(pte);
+raced:
+	spin_unlock(pml);
+	mmu_notifier_invalidate_range_end(mm, addr, end);
+}
--- thpfs.orig/mm/memory.c	2015-02-20 19:34:42.875920943 -0800
+++ thpfs/mm/memory.c	2015-02-20 19:34:48.083909034 -0800
@@ -45,6 +45,7 @@
 #include <linux/swap.h>
 #include <linux/highmem.h>
 #include <linux/pagemap.h>
+#include <linux/pageteam.h>
 #include <linux/ksm.h>
 #include <linux/rmap.h>
 #include <linux/export.h>
@@ -2716,9 +2717,19 @@ static int __do_fault(struct vm_area_str
 	vmf.flags = flags;
 	vmf.page = NULL;
 
+	/*
+	 * Give huge pmd a chance before allocating pte or trying fault around.
+	 */
+	if (unlikely(pmd_none(*pmd)))
+		vmf.flags |= FAULT_FLAG_MAY_HUGE;
+
 	ret = vma->vm_ops->fault(vma, &vmf);
 	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
 		return ret;
+	if (unlikely(ret & VM_FAULT_HUGE)) {
+		ret |= map_team_by_pmd(vma, address, pmd, vmf.page);
+		return ret;
+	}
 
 	if (unlikely(!(ret & VM_FAULT_LOCKED)))
 		lock_page(vmf.page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 18/24] huge tmpfs: mmap_sem is unlocked when truncation splits huge pmd
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:20   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, David Rientjes,
	linux-kernel, linux-mm

zap_pmd_range()'s CONFIG_DEBUG_VM !rwsem_is_locked(&mmap_sem) BUG()
is invalid with huge tmpfs, where truncation of a hugely-mapped file
to an unhugely-aligned size easily hits it.

(Although anon THP could in principle apply khugepaged to private file
mappings, which are not excluded by the MADV_HUGEPAGE restrictions, in
practice there's a vm_ops check which excludes them, so it never hits
this BUG() - there's no interface to "truncate" an anonymous mapping.)

We could complicate the test, to check i_mmap_rwsem also when there's
a vm_file; but I'm inclined to make zap_pmd_range() more readable by
simply deleting this check.  A search has shown no report of the issue
in the 2.5 years since e0897d75f0b2 ("mm, thp: print useful information
when mmap_sem is unlocked in zap_pmd_range") expanded it from VM_BUG_ON()
- though I cannot point to what commit I would say then fixed the issue.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/memory.c |   13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

--- thpfs.orig/mm/memory.c	2015-02-20 19:34:48.083909034 -0800
+++ thpfs/mm/memory.c	2015-02-20 19:34:53.467896724 -0800
@@ -1219,18 +1219,9 @@ static inline unsigned long zap_pmd_rang
 	do {
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*pmd)) {
-			if (next - addr != HPAGE_PMD_SIZE) {
-#ifdef CONFIG_DEBUG_VM
-				if (!rwsem_is_locked(&tlb->mm->mmap_sem)) {
-					pr_err("%s: mmap_sem is unlocked! addr=0x%lx end=0x%lx vma->vm_start=0x%lx vma->vm_end=0x%lx\n",
-						__func__, addr, end,
-						vma->vm_start,
-						vma->vm_end);
-					BUG();
-				}
-#endif
+			if (next - addr != HPAGE_PMD_SIZE)
 				split_huge_page_pmd(vma, addr, pmd);
-			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
+			else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
 		}

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 18/24] huge tmpfs: mmap_sem is unlocked when truncation splits huge pmd
@ 2015-02-21  4:20   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, David Rientjes,
	linux-kernel, linux-mm

zap_pmd_range()'s CONFIG_DEBUG_VM !rwsem_is_locked(&mmap_sem) BUG()
is invalid with huge tmpfs, where truncation of a hugely-mapped file
to an unhugely-aligned size easily hits it.

(Although anon THP could in principle apply khugepaged to private file
mappings, which are not excluded by the MADV_HUGEPAGE restrictions, in
practice there's a vm_ops check which excludes them, so it never hits
this BUG() - there's no interface to "truncate" an anonymous mapping.)

We could complicate the test, to check i_mmap_rwsem also when there's
a vm_file; but I'm inclined to make zap_pmd_range() more readable by
simply deleting this check.  A search has shown no report of the issue
in the 2.5 years since e0897d75f0b2 ("mm, thp: print useful information
when mmap_sem is unlocked in zap_pmd_range") expanded it from VM_BUG_ON()
- though I cannot point to what commit I would say then fixed the issue.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/memory.c |   13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

--- thpfs.orig/mm/memory.c	2015-02-20 19:34:48.083909034 -0800
+++ thpfs/mm/memory.c	2015-02-20 19:34:53.467896724 -0800
@@ -1219,18 +1219,9 @@ static inline unsigned long zap_pmd_rang
 	do {
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*pmd)) {
-			if (next - addr != HPAGE_PMD_SIZE) {
-#ifdef CONFIG_DEBUG_VM
-				if (!rwsem_is_locked(&tlb->mm->mmap_sem)) {
-					pr_err("%s: mmap_sem is unlocked! addr=0x%lx end=0x%lx vma->vm_start=0x%lx vma->vm_end=0x%lx\n",
-						__func__, addr, end,
-						vma->vm_start,
-						vma->vm_end);
-					BUG();
-				}
-#endif
+			if (next - addr != HPAGE_PMD_SIZE)
 				split_huge_page_pmd(vma, addr, pmd);
-			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
+			else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 19/24] huge tmpfs: disband split huge pmds on race or memory failure
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:22   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Andres L-C has pointed out that the single-page unmap_mapping_range()
fallback in truncate_inode_page() cannot protect against the case when
a huge page was faulted in after the full-range unmap_mapping_range():
because page_mapped(page) checks tail page's mapcount, not the head's.

So, there's a danger that hole-punching (and maybe even truncation)
can free pages while they are mapped into userspace with a huge pmd.
And I don't believe that the CVE-2014-4171 protection in shmem_fault()
can fully protect from this, although it does make it much harder.

Fix that by adding a duplicate single-page unmap_mapping_range()
into shmem_disband_hugeteam() (called when punching or truncating
a PageTeam), at the point when we also hold the head's page lock
(without which there would still be races): which will then split
all huge pmd mappings covering the page into team pte mappings.

This is also just what's needed to handle memory_failure() correctly:
provide custom shmem_error_remove_page(), call shmem_disband_hugeteam()
from that before proceeding to generic_error_remove_page(), then this
additional unmap_mapping_range() will remap team by ptes as needed.

(There is an unlikely case that we're racing with another disbander,
or disband didn't get trylock on head page at first: memory_failure()
has almost finished with the page, so it's safe to unlock and relock
before retrying.)

But there is one further change needed in hwpoison_user_mappings():
it must recognize a hugely mapped team before concluding that the
page is not mapped.  (And still no support for soft_offline(),
which will have to wait for page migration of teams.)

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/memory-failure.c |    8 +++++++-
 mm/shmem.c          |   27 ++++++++++++++++++++++++++-
 2 files changed, 33 insertions(+), 2 deletions(-)

--- thpfs.orig/mm/memory-failure.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/memory-failure.c	2015-02-20 19:34:59.047883965 -0800
@@ -44,6 +44,7 @@
 #include <linux/rmap.h>
 #include <linux/export.h>
 #include <linux/pagemap.h>
+#include <linux/pageteam.h>
 #include <linux/swap.h>
 #include <linux/backing-dev.h>
 #include <linux/migrate.h>
@@ -889,6 +890,7 @@ static int hwpoison_user_mappings(struct
 	int kill = 1, forcekill;
 	struct page *hpage = *hpagep;
 	struct page *ppage;
+	bool mapped;
 
 	/*
 	 * Here we are interested only in user-mapped pages, so skip any
@@ -903,7 +905,11 @@ static int hwpoison_user_mappings(struct
 	 * This check implies we don't kill processes if their pages
 	 * are in the swap cache early. Those are always late kills.
 	 */
-	if (!page_mapped(hpage))
+	mapped = page_mapped(hpage);
+	if (PageTeam(p) && !PageAnon(p) &&
+	    team_hugely_mapped(team_head(p)))
+		mapped = true;
+	if (!mapped)
 		return SWAP_SUCCESS;
 
 	if (PageKsm(p)) {
--- thpfs.orig/mm/shmem.c	2015-02-20 19:34:21.603969581 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:34:59.051883956 -0800
@@ -603,6 +603,17 @@ static void shmem_disband_hugeteam(struc
 			page_cache_release(head);
 			return;
 		}
+		/*
+		 * truncate_inode_page() will unmap page if page_mapped(page),
+		 * but there's a race by which the team could be hugely mapped,
+		 * with page_mapped(page) saying false.  So check here if the
+		 * head is hugely mapped, and if so unmap page to remap team.
+		 */
+		if (team_hugely_mapped(head)) {
+			unmap_mapping_range(page->mapping,
+				(loff_t)page->index << PAGE_CACHE_SHIFT,
+				PAGE_CACHE_SIZE, 0);
+		}
 	}
 
 	/*
@@ -1216,6 +1227,20 @@ void shmem_truncate_range(struct inode *
 }
 EXPORT_SYMBOL_GPL(shmem_truncate_range);
 
+int shmem_error_remove_page(struct address_space *mapping, struct page *page)
+{
+	if (PageTeam(page)) {
+		shmem_disband_hugeteam(page);
+		while (unlikely(PageTeam(page))) {
+			unlock_page(page);
+			cond_resched();
+			lock_page(page);
+			shmem_disband_hugeteam(page);
+		}
+	}
+	return generic_error_remove_page(mapping, page);
+}
+
 static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
@@ -4031,7 +4056,7 @@ static const struct address_space_operat
 #ifdef CONFIG_MIGRATION
 	.migratepage	= migrate_page,
 #endif
-	.error_remove_page = generic_error_remove_page,
+	.error_remove_page = shmem_error_remove_page,
 };
 
 static const struct file_operations shmem_file_operations = {

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 19/24] huge tmpfs: disband split huge pmds on race or memory failure
@ 2015-02-21  4:22   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Andres L-C has pointed out that the single-page unmap_mapping_range()
fallback in truncate_inode_page() cannot protect against the case when
a huge page was faulted in after the full-range unmap_mapping_range():
because page_mapped(page) checks tail page's mapcount, not the head's.

So, there's a danger that hole-punching (and maybe even truncation)
can free pages while they are mapped into userspace with a huge pmd.
And I don't believe that the CVE-2014-4171 protection in shmem_fault()
can fully protect from this, although it does make it much harder.

Fix that by adding a duplicate single-page unmap_mapping_range()
into shmem_disband_hugeteam() (called when punching or truncating
a PageTeam), at the point when we also hold the head's page lock
(without which there would still be races): which will then split
all huge pmd mappings covering the page into team pte mappings.

This is also just what's needed to handle memory_failure() correctly:
provide custom shmem_error_remove_page(), call shmem_disband_hugeteam()
from that before proceeding to generic_error_remove_page(), then this
additional unmap_mapping_range() will remap team by ptes as needed.

(There is an unlikely case that we're racing with another disbander,
or disband didn't get trylock on head page at first: memory_failure()
has almost finished with the page, so it's safe to unlock and relock
before retrying.)

But there is one further change needed in hwpoison_user_mappings():
it must recognize a hugely mapped team before concluding that the
page is not mapped.  (And still no support for soft_offline(),
which will have to wait for page migration of teams.)

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/memory-failure.c |    8 +++++++-
 mm/shmem.c          |   27 ++++++++++++++++++++++++++-
 2 files changed, 33 insertions(+), 2 deletions(-)

--- thpfs.orig/mm/memory-failure.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/memory-failure.c	2015-02-20 19:34:59.047883965 -0800
@@ -44,6 +44,7 @@
 #include <linux/rmap.h>
 #include <linux/export.h>
 #include <linux/pagemap.h>
+#include <linux/pageteam.h>
 #include <linux/swap.h>
 #include <linux/backing-dev.h>
 #include <linux/migrate.h>
@@ -889,6 +890,7 @@ static int hwpoison_user_mappings(struct
 	int kill = 1, forcekill;
 	struct page *hpage = *hpagep;
 	struct page *ppage;
+	bool mapped;
 
 	/*
 	 * Here we are interested only in user-mapped pages, so skip any
@@ -903,7 +905,11 @@ static int hwpoison_user_mappings(struct
 	 * This check implies we don't kill processes if their pages
 	 * are in the swap cache early. Those are always late kills.
 	 */
-	if (!page_mapped(hpage))
+	mapped = page_mapped(hpage);
+	if (PageTeam(p) && !PageAnon(p) &&
+	    team_hugely_mapped(team_head(p)))
+		mapped = true;
+	if (!mapped)
 		return SWAP_SUCCESS;
 
 	if (PageKsm(p)) {
--- thpfs.orig/mm/shmem.c	2015-02-20 19:34:21.603969581 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:34:59.051883956 -0800
@@ -603,6 +603,17 @@ static void shmem_disband_hugeteam(struc
 			page_cache_release(head);
 			return;
 		}
+		/*
+		 * truncate_inode_page() will unmap page if page_mapped(page),
+		 * but there's a race by which the team could be hugely mapped,
+		 * with page_mapped(page) saying false.  So check here if the
+		 * head is hugely mapped, and if so unmap page to remap team.
+		 */
+		if (team_hugely_mapped(head)) {
+			unmap_mapping_range(page->mapping,
+				(loff_t)page->index << PAGE_CACHE_SHIFT,
+				PAGE_CACHE_SIZE, 0);
+		}
 	}
 
 	/*
@@ -1216,6 +1227,20 @@ void shmem_truncate_range(struct inode *
 }
 EXPORT_SYMBOL_GPL(shmem_truncate_range);
 
+int shmem_error_remove_page(struct address_space *mapping, struct page *page)
+{
+	if (PageTeam(page)) {
+		shmem_disband_hugeteam(page);
+		while (unlikely(PageTeam(page))) {
+			unlock_page(page);
+			cond_resched();
+			lock_page(page);
+			shmem_disband_hugeteam(page);
+		}
+	}
+	return generic_error_remove_page(mapping, page);
+}
+
 static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
@@ -4031,7 +4056,7 @@ static const struct address_space_operat
 #ifdef CONFIG_MIGRATION
 	.migratepage	= migrate_page,
 #endif
-	.error_remove_page = generic_error_remove_page,
+	.error_remove_page = shmem_error_remove_page,
 };
 
 static const struct file_operations shmem_file_operations = {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 20/24] huge tmpfs: use Unevictable lru with variable hpage_nr_pages()
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:23   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

A big advantage of huge tmpfs over hugetlbfs is that its pages can
be swapped out; but too often it OOMs before swapping them out.

At first I tried changing page_evictable(), to treat all tail pages
of a hugely mapped team as unevictable: the anon LRUs were otherwise
swamped by pages that could not be freed before the head.

That worked quite well, some of the time, but has some drawbacks.

Most obviously, /proc/meminfo is liable to show 511/512ths of all
the ShmemPmdMapped as Unevictable; which is rather sad for a feature
intended to improve on hugetlbfs by letting the pages be swappable.

But more seriously, although it is helpful to have those tails out
of the way on the Unevictable list, page reclaim can very easily come
to a point where all the team heads to be freed are on the Active list,
but the Inactive is large enough that !inactive_anon_is_low(), so the
Active is never scanned to unmap those heads to release all the tails.
Eventually we OOM.

Perhaps that could be dealt with by hacking inactive_anon_is_low():
but it wouldn't help the Unevictable numbers, and has never been
necessary for anon THP.  How does anon THP avoid this?  It doesn't
put tails on the LRU at all, so doesn't then need to shift them to
Unevictable; but there would still be the danger of an Active list
full of heads, holding the unseen tails, but the ratio too high for
for Active scanning - except that hpage_nr_pages() weights each THP
head by the number of small pages the huge page holds, instead of the
usual 1, and that is what keeps the Active/Inactive balance working.

So in this patch we try to do the same for huge tmpfs pages.  However,
a team is not one huge compound page, but a collection of independent
pages, and the fair and lazy way to accomplish this seems to be to
transfer each tail's weight to head at the time when shmem_writepage()
has been asked to evict the page, but refuses because the head has not
yet been evicted.  So although the failed-to-be-evicted tails are moved
to the Unevictable LRU, each counts for 0kB in the Unevictable amount,
its 4kB going to the head in the Active(anon) or Inactive(anon) amount.

Apart from mlock.c (next patch), hpage_nr_pages() is now only called
on a maybe-PageTeam page while under lruvec lock, and we do need to
hold lruvec lock when transferring weight from one page to another.
That is a new overhead, which shmem_disband_hugehead() prefers to
avoid, if the head's weight is just the default 1.  And it's not
clear how well this will all play out if different pages of a team
are charged to different memcgs: but the code allows for that, and
it should be fine while that's just an exceptional minority case.

A change I like in principle, but have not made, and do not intend
to make unless we see a workload that demands it: it would be natural
for mark_page_accessed() to retrieve such a 0-weight page from the
Unevictable LRU, assigning it weight again and giving it a new life
on the Active and Inactive LRUs.  As it is, I'm hoping PageReferenced
gives a good enough hint as to whether a page should be retained, when
shmem_evictify_hugetails() brings it back from Unevictable to Inactive.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/huge_mm.h  |   13 +++
 include/linux/pageteam.h |   48 ++++++++++-
 mm/memcontrol.c          |   10 ++
 mm/shmem.c               |  158 ++++++++++++++++++++++++++++++-------
 mm/swap.c                |    5 +
 mm/vmscan.c              |   42 +++++++++
 6 files changed, 243 insertions(+), 33 deletions(-)

--- thpfs.orig/include/linux/huge_mm.h	2015-02-20 19:34:32.363944978 -0800
+++ thpfs/include/linux/huge_mm.h	2015-02-20 19:35:04.303871947 -0800
@@ -150,10 +150,23 @@ static inline void vma_adjust_trans_huge
 #endif
 	__vma_adjust_trans_huge(vma, start, end, adjust_next);
 }
+
+/* Repeat definition from linux/pageteam.h to force error if different */
+#define TEAM_LRU_WEIGHT_MASK	((1L << (HPAGE_PMD_ORDER + 1)) - 1)
+
 static inline int hpage_nr_pages(struct page *page)
 {
 	if (unlikely(PageTransHuge(page)))
 		return HPAGE_PMD_NR;
+	/*
+	 * PG_team == PG_compound_lock, but PageTransHuge == PageHead.
+	 * The question of races here is interesting, but not for now:
+	 * this can only be relied on while holding the lruvec lock,
+	 * or knowing that the page is anonymous, not from huge tmpfs.
+	 */
+	if (PageTeam(page))
+		return atomic_long_read(&page->team_usage) &
+					TEAM_LRU_WEIGHT_MASK;
 	return 1;
 }
 
--- thpfs.orig/include/linux/pageteam.h	2015-02-20 19:34:48.083909034 -0800
+++ thpfs/include/linux/pageteam.h	2015-02-20 19:35:04.303871947 -0800
@@ -30,11 +30,32 @@ static inline struct page *team_head(str
 }
 
 /*
+ * Mask for lower bits of team_usage, giving the weight 0..HPAGE_PMD_NR of the
+ * page on its LRU: normal pages have weight 1, tails held unevictable until
+ * head is evicted have weight 0, and the head gathers weight 1..HPAGE_PMD_NR.
+ */
+#define TEAM_LRU_WEIGHT_ONE	1L
+#define TEAM_LRU_WEIGHT_MASK	((1L << (HPAGE_PMD_ORDER + 1)) - 1)
+
+#define TEAM_HIGH_COUNTER	(1L << (HPAGE_PMD_ORDER + 1))
+/*
+ * Count how many pages of team are instantiated, as it is built up.
+ */
+#define TEAM_PAGE_COUNTER	TEAM_HIGH_COUNTER
+#define TEAM_COMPLETE		(TEAM_PAGE_COUNTER << HPAGE_PMD_ORDER)
+/*
+ * And when complete, count how many huge mappings (like page_mapcount): an
+ * incomplete team cannot be hugely mapped (would expose uninitialized holes).
+ */
+#define TEAM_MAPPING_COUNTER	TEAM_HIGH_COUNTER
+#define TEAM_HUGELY_MAPPED	(TEAM_COMPLETE + TEAM_MAPPING_COUNTER)
+
+/*
  * Returns true if this team is mapped by pmd somewhere.
  */
 static inline bool team_hugely_mapped(struct page *head)
 {
-	return atomic_long_read(&head->team_usage) > HPAGE_PMD_NR;
+	return atomic_long_read(&head->team_usage) >= TEAM_HUGELY_MAPPED;
 }
 
 /*
@@ -43,7 +64,8 @@ static inline bool team_hugely_mapped(st
  */
 static inline bool inc_hugely_mapped(struct page *head)
 {
-	return atomic_long_inc_return(&head->team_usage) == HPAGE_PMD_NR+1;
+	return atomic_long_add_return(TEAM_MAPPING_COUNTER, &head->team_usage)
+		< TEAM_HUGELY_MAPPED + TEAM_MAPPING_COUNTER;
 }
 
 /*
@@ -52,7 +74,27 @@ static inline bool inc_hugely_mapped(str
  */
 static inline bool dec_hugely_mapped(struct page *head)
 {
-	return atomic_long_dec_return(&head->team_usage) == HPAGE_PMD_NR;
+	return atomic_long_sub_return(TEAM_MAPPING_COUNTER, &head->team_usage)
+		< TEAM_HUGELY_MAPPED;
+}
+
+static inline void inc_lru_weight(struct page *head)
+{
+	atomic_long_inc(&head->team_usage);
+	VM_BUG_ON_PAGE((atomic_long_read(&head->team_usage) &
+			TEAM_LRU_WEIGHT_MASK) > HPAGE_PMD_NR, head);
+}
+
+static inline void set_lru_weight(struct page *page)
+{
+	VM_BUG_ON_PAGE(atomic_long_read(&page->team_usage) != 0, page);
+	atomic_long_set(&page->team_usage, 1);
+}
+
+static inline void clear_lru_weight(struct page *page)
+{
+	VM_BUG_ON_PAGE(atomic_long_read(&page->team_usage) != 1, page);
+	atomic_long_set(&page->team_usage, 0);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
--- thpfs.orig/mm/memcontrol.c	2015-02-20 19:34:11.231993296 -0800
+++ thpfs/mm/memcontrol.c	2015-02-20 19:35:04.303871947 -0800
@@ -1319,6 +1319,16 @@ void mem_cgroup_update_lru_size(struct l
 		*lru_size += nr_pages;
 
 	size = *lru_size;
+	if (!size && !empty && lru == LRU_UNEVICTABLE) {
+		struct page *page;
+		/*
+		 * The unevictable list might be full of team tail pages of 0
+		 * weight: check the first, and skip the warning if that fits.
+		 */
+		page = list_first_entry(lruvec->lists + lru, struct page, lru);
+		if (hpage_nr_pages(page) == 0)
+			empty = true;
+	}
 	if (WARN(size < 0 || empty != !size,
 	"mem_cgroup_update_lru_size(%p, %d, %d): lru_size %ld but %sempty\n",
 			lruvec, lru, nr_pages, size, empty ? "" : "not ")) {
--- thpfs.orig/mm/shmem.c	2015-02-20 19:34:59.051883956 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:35:04.307871938 -0800
@@ -63,6 +63,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/swapops.h>
 #include <linux/pageteam.h>
 #include <linux/mempolicy.h>
+#include <linux/mm_inline.h>
 #include <linux/namei.h>
 #include <linux/ctype.h>
 #include <linux/migrate.h>
@@ -373,11 +374,10 @@ restart:
 
 static int shmem_freeholes(struct page *head)
 {
-	/*
-	 * Note: team_usage will also be used to count huge mappings,
-	 * so treat a negative value from shmem_freeholes() as none.
-	 */
-	return HPAGE_PMD_NR - atomic_long_read(&head->team_usage);
+	long nr = atomic_long_read(&head->team_usage);
+
+	return (nr >= TEAM_COMPLETE) ? 0 :
+		HPAGE_PMD_NR - (nr / TEAM_PAGE_COUNTER);
 }
 
 static void shmem_clear_tag_hugehole(struct address_space *mapping,
@@ -404,18 +404,16 @@ static void shmem_added_to_hugeteam(stru
 {
 	struct address_space *mapping = page->mapping;
 	struct page *head = team_head(page);
-	int nr;
 
 	if (hugehint == SHMEM_ALLOC_HUGE_PAGE) {
-		atomic_long_set(&head->team_usage, 1);
+		atomic_long_set(&head->team_usage,
+				TEAM_PAGE_COUNTER + TEAM_LRU_WEIGHT_ONE);
 		radix_tree_tag_set(&mapping->page_tree, page->index,
 					SHMEM_TAG_HUGEHOLE);
 		__mod_zone_page_state(zone, NR_SHMEM_FREEHOLES, HPAGE_PMD_NR-1);
 	} else {
-		/* We do not need atomic ops until huge page gets mapped */
-		nr = atomic_long_read(&head->team_usage) + 1;
-		atomic_long_set(&head->team_usage, nr);
-		if (nr == HPAGE_PMD_NR) {
+		if (atomic_long_add_return(TEAM_PAGE_COUNTER,
+				&head->team_usage) >= TEAM_COMPLETE) {
 			shmem_clear_tag_hugehole(mapping, head->index);
 			__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
 		}
@@ -459,36 +457,61 @@ static int shmem_populate_hugeteam(struc
 	return 0;
 }
 
-static int shmem_disband_hugehead(struct page *head)
+static int shmem_disband_hugehead(struct page *head, int *head_lru_weight)
 {
 	struct address_space *mapping;
+	bool lru_locked = false;
+	unsigned long flags;
 	struct zone *zone;
-	int nr = -1;
+	long team_usage;
+	long nr = -1;
 
 	/*
 	 * Only in the shrinker migration case might head have been truncated.
 	 * But although head->mapping may then be zeroed at any moment, mapping
 	 * stays safe because shmem_evict_inode must take our shrinklist lock.
 	 */
+	*head_lru_weight = 0;
 	mapping = ACCESS_ONCE(head->mapping);
 	if (!mapping)
 		return nr;
 
 	zone = page_zone(head);
-	spin_lock_irq(&mapping->tree_lock);
+	team_usage = atomic_long_read(&head->team_usage);
+again1:
+	if ((team_usage & TEAM_LRU_WEIGHT_MASK) != TEAM_LRU_WEIGHT_ONE) {
+		spin_lock_irq(&zone->lru_lock);
+		lru_locked = true;
+	}
+	spin_lock_irqsave(&mapping->tree_lock, flags);
 
 	if (PageTeam(head)) {
-		nr = atomic_long_read(&head->team_usage);
-		atomic_long_set(&head->team_usage, 0);
+again2:
+		nr = atomic_long_cmpxchg(&head->team_usage, team_usage,
+					 TEAM_LRU_WEIGHT_ONE);
+		if (unlikely(nr != team_usage)) {
+			team_usage = nr;
+			if (lru_locked ||
+			    (team_usage & TEAM_LRU_WEIGHT_MASK) ==
+						    TEAM_LRU_WEIGHT_ONE)
+				goto again2;
+			spin_unlock_irqrestore(&mapping->tree_lock, flags);
+			goto again1;
+		}
+		*head_lru_weight = nr & TEAM_LRU_WEIGHT_MASK;
+		nr /= TEAM_PAGE_COUNTER;
+
 		/*
-		 * Disable additions to the team.
-		 * Ensure head->private is written before PageTeam is
-		 * cleared, so shmem_writepage() cannot write swap into
-		 * head->private, then have it overwritten by that 0!
+		 * Disable additions to the team.  The cmpxchg above
+		 * ensures head->team_usage is read before PageTeam is cleared,
+		 * when shmem_writepage() might write swap into head->private.
 		 */
-		smp_mb__before_atomic();
 		ClearPageTeam(head);
 
+		if (PageLRU(head) && *head_lru_weight > 1)
+			update_lru_size(mem_cgroup_page_lruvec(head, zone),
+					page_lru(head), 1 - *head_lru_weight);
+
 		if (nr >= HPAGE_PMD_NR) {
 			__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
 			VM_BUG_ON(nr != HPAGE_PMD_NR);
@@ -499,10 +522,72 @@ static int shmem_disband_hugehead(struct
 		} /* else shmem_getpage_gfp disbanding a failed alloced_huge */
 	}
 
-	spin_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irqrestore(&mapping->tree_lock, flags);
+	if (lru_locked)
+		spin_unlock_irq(&zone->lru_lock);
 	return nr;
 }
 
+static void shmem_evictify_hugetails(struct page *head, int head_lru_weight)
+{
+	struct page *page;
+	struct lruvec *lruvec = NULL;
+	struct zone *zone = page_zone(head);
+	bool lru_locked = false;
+
+	/*
+	 * The head has been sheltering the rest of its team from reclaim:
+	 * if any were moved to the unevictable list, now make them evictable.
+	 */
+again:
+	for (page = head + HPAGE_PMD_NR - 1; page > head; page--) {
+		if (!PageTeam(page))
+			continue;
+		if (atomic_long_read(&page->team_usage) == TEAM_LRU_WEIGHT_ONE)
+			continue;
+
+		/*
+		 * Delay getting lru lock until we reach a page that needs it.
+		 */
+		if (!lru_locked) {
+			spin_lock_irq(&zone->lru_lock);
+			lru_locked = true;
+		}
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+
+		VM_BUG_ON_PAGE(atomic_long_read(&page->team_usage), page);
+		VM_BUG_ON_PAGE(!PageLRU(page), page);
+		VM_BUG_ON_PAGE(!PageUnevictable(page), page);
+		VM_BUG_ON_PAGE(PageActive(page), page);
+
+		set_lru_weight(page);
+		head_lru_weight--;
+
+		if (!page_evictable(page)) {
+			update_lru_size(lruvec, LRU_UNEVICTABLE, 1);
+			continue;
+		}
+
+		ClearPageUnevictable(page);
+		update_lru_size(lruvec, LRU_INACTIVE_ANON, 1);
+
+		list_del(&page->lru);
+		list_add_tail(&page->lru, lruvec->lists + LRU_INACTIVE_ANON);
+	}
+
+	if (lru_locked) {
+		spin_unlock_irq(&zone->lru_lock);
+		lru_locked = false;
+	}
+
+	/*
+	 * But how can we be sure that a racing putback_inactive_pages()
+	 * did its clear_lru_weight() before we checked team_usage above?
+	 */
+	if (unlikely(head_lru_weight != TEAM_LRU_WEIGHT_ONE))
+		goto again;
+}
+
 static void shmem_disband_hugetails(struct page *head,
 				    struct list_head *list, int nr)
 {
@@ -579,6 +664,7 @@ static void shmem_disband_hugetails(stru
 static void shmem_disband_hugeteam(struct page *page)
 {
 	struct page *head = team_head(page);
+	int head_lru_weight;
 	int nr_used;
 
 	/*
@@ -622,9 +708,11 @@ static void shmem_disband_hugeteam(struc
 	 * can (splitting disband in two stages), but better not be preempted.
 	 */
 	preempt_disable();
-	nr_used = shmem_disband_hugehead(head);
+	nr_used = shmem_disband_hugehead(head, &head_lru_weight);
 	if (head != page)
 		unlock_page(head);
+	if (head_lru_weight > TEAM_LRU_WEIGHT_ONE)
+		shmem_evictify_hugetails(head, head_lru_weight);
 	if (nr_used >= 0)
 		shmem_disband_hugetails(head, NULL, 0);
 	if (head != page)
@@ -680,6 +768,7 @@ static unsigned long shmem_choose_hugeho
 	struct page *topage = NULL;
 	struct page *page;
 	pgoff_t index;
+	int head_lru_weight;
 	int fromused;
 	int toused;
 	int nid;
@@ -721,8 +810,10 @@ static unsigned long shmem_choose_hugeho
 	if (!frompage)
 		goto unlock;
 	preempt_disable();
-	fromused = shmem_disband_hugehead(frompage);
+	fromused = shmem_disband_hugehead(frompage, &head_lru_weight);
 	spin_unlock(&shmem_shrinklist_lock);
+	if (head_lru_weight > TEAM_LRU_WEIGHT_ONE)
+		shmem_evictify_hugetails(frompage, head_lru_weight);
 	if (fromused > 0)
 		shmem_disband_hugetails(frompage, fromlist, -fromused);
 	preempt_enable();
@@ -776,8 +867,10 @@ static unsigned long shmem_choose_hugeho
 	if (!topage)
 		goto unlock;
 	preempt_disable();
-	toused = shmem_disband_hugehead(topage);
+	toused = shmem_disband_hugehead(topage, &head_lru_weight);
 	spin_unlock(&shmem_shrinklist_lock);
+	if (head_lru_weight > TEAM_LRU_WEIGHT_ONE)
+		shmem_evictify_hugetails(topage, head_lru_weight);
 	if (toused > 0) {
 		if (HPAGE_PMD_NR - toused >= fromused)
 			shmem_disband_hugetails(topage, tolist, fromused);
@@ -927,7 +1020,11 @@ shmem_add_to_page_cache(struct page *pag
 		}
 		if (!PageSwapBacked(page)) {	/* huge needs special care */
 			SetPageSwapBacked(page);
-			SetPageTeam(page);
+			if (!PageTeam(page)) {
+				atomic_long_set(&page->team_usage,
+						TEAM_LRU_WEIGHT_ONE);
+				SetPageTeam(page);
+			}
 		}
 	}
 
@@ -1514,9 +1611,13 @@ static int shmem_writepage(struct page *
 		struct page *head = team_head(page);
 		/*
 		 * Only proceed if this is head, or if head is unpopulated.
+		 * Redirty any others, without setting PageActive, and then
+		 * putback_inactive_pages() will shift them to unevictable.
 		 */
-		if (page != head && PageSwapBacked(head))
+		if (page != head && PageSwapBacked(head)) {
+			wbc->for_reclaim = 0;
 			goto redirty;
+		}
 	}
 
 	swap = get_swap_page();
@@ -1660,7 +1761,8 @@ static struct page *shmem_alloc_page(gfp
 				split_page(head, HPAGE_PMD_ORDER);
 
 				/* Prepare head page for add_to_page_cache */
-				atomic_long_set(&head->team_usage, 0);
+				atomic_long_set(&head->team_usage,
+						TEAM_LRU_WEIGHT_ONE);
 				__SetPageTeam(head);
 				head->mapping = mapping;
 				head->index = round_down(index, HPAGE_PMD_NR);
--- thpfs.orig/mm/swap.c	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/mm/swap.c	2015-02-20 19:35:04.307871938 -0800
@@ -702,6 +702,11 @@ void lru_cache_add_active_or_unevictable
 					 struct vm_area_struct *vma)
 {
 	VM_BUG_ON_PAGE(PageLRU(page), page);
+	/*
+	 * Using hpage_nr_pages() on a huge tmpfs team page might not give the
+	 * 1 NR_MLOCK needs below; but this seems to be for anon pages only.
+	 */
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
 
 	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
 		SetPageActive(page);
--- thpfs.orig/mm/vmscan.c	2015-02-20 19:34:11.235993287 -0800
+++ thpfs/mm/vmscan.c	2015-02-20 19:35:04.307871938 -0800
@@ -19,6 +19,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/swap.h>
 #include <linux/pagemap.h>
+#include <linux/pageteam.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
 #include <linux/vmpressure.h>
@@ -1419,6 +1420,42 @@ putback_inactive_pages(struct lruvec *lr
 			continue;
 		}
 
+		if (PageTeam(page) && !PageAnon(page) && !PageActive(page)) {
+			struct page *head = team_head(page);
+			struct address_space *mapping = head->mapping;
+			bool transferring_weight = false;
+			unsigned long flags;
+			/*
+			 * Team tail page was ready for eviction, but has
+			 * been sent back from shmem_writepage(): transfer
+			 * its weight to head, and move tail to unevictable.
+			 *
+			 * Barrier below so PageTeam guarantees good "mapping".
+			 */
+			smp_rmb();
+			if (page != head && PageTeam(head)) {
+				lruvec = mem_cgroup_page_lruvec(head, zone);
+				spin_lock_irqsave(&mapping->tree_lock, flags);
+				if (PageTeam(head)) {
+					inc_lru_weight(head);
+					transferring_weight = true;
+				}
+				spin_unlock_irqrestore(
+						&mapping->tree_lock, flags);
+			}
+			if (transferring_weight) {
+				if (PageLRU(head))
+					update_lru_size(lruvec,
+							page_lru(head), 1);
+				/* Get this tail page out of the way for now */
+				SetPageUnevictable(page);
+				clear_lru_weight(page);
+			} else {
+				/* Traditional case of unswapped & redirtied */
+				SetPageActive(page);
+			}
+		}
+
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 
 		SetPageLRU(page);
@@ -3705,11 +3742,12 @@ int zone_reclaim(struct zone *zone, gfp_
  * Reasons page might not be evictable:
  * (1) page's mapping marked unevictable
  * (2) page is part of an mlocked VMA
- *
+ * (3) page is held in memory as part of a team
  */
 int page_evictable(struct page *page)
 {
-	return !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
+	return !mapping_unevictable(page_mapping(page)) &&
+		!PageMlocked(page) && hpage_nr_pages(page);
 }
 
 #ifdef CONFIG_SHMEM

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 20/24] huge tmpfs: use Unevictable lru with variable hpage_nr_pages()
@ 2015-02-21  4:23   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:23 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

A big advantage of huge tmpfs over hugetlbfs is that its pages can
be swapped out; but too often it OOMs before swapping them out.

At first I tried changing page_evictable(), to treat all tail pages
of a hugely mapped team as unevictable: the anon LRUs were otherwise
swamped by pages that could not be freed before the head.

That worked quite well, some of the time, but has some drawbacks.

Most obviously, /proc/meminfo is liable to show 511/512ths of all
the ShmemPmdMapped as Unevictable; which is rather sad for a feature
intended to improve on hugetlbfs by letting the pages be swappable.

But more seriously, although it is helpful to have those tails out
of the way on the Unevictable list, page reclaim can very easily come
to a point where all the team heads to be freed are on the Active list,
but the Inactive is large enough that !inactive_anon_is_low(), so the
Active is never scanned to unmap those heads to release all the tails.
Eventually we OOM.

Perhaps that could be dealt with by hacking inactive_anon_is_low():
but it wouldn't help the Unevictable numbers, and has never been
necessary for anon THP.  How does anon THP avoid this?  It doesn't
put tails on the LRU at all, so doesn't then need to shift them to
Unevictable; but there would still be the danger of an Active list
full of heads, holding the unseen tails, but the ratio too high for
for Active scanning - except that hpage_nr_pages() weights each THP
head by the number of small pages the huge page holds, instead of the
usual 1, and that is what keeps the Active/Inactive balance working.

So in this patch we try to do the same for huge tmpfs pages.  However,
a team is not one huge compound page, but a collection of independent
pages, and the fair and lazy way to accomplish this seems to be to
transfer each tail's weight to head at the time when shmem_writepage()
has been asked to evict the page, but refuses because the head has not
yet been evicted.  So although the failed-to-be-evicted tails are moved
to the Unevictable LRU, each counts for 0kB in the Unevictable amount,
its 4kB going to the head in the Active(anon) or Inactive(anon) amount.

Apart from mlock.c (next patch), hpage_nr_pages() is now only called
on a maybe-PageTeam page while under lruvec lock, and we do need to
hold lruvec lock when transferring weight from one page to another.
That is a new overhead, which shmem_disband_hugehead() prefers to
avoid, if the head's weight is just the default 1.  And it's not
clear how well this will all play out if different pages of a team
are charged to different memcgs: but the code allows for that, and
it should be fine while that's just an exceptional minority case.

A change I like in principle, but have not made, and do not intend
to make unless we see a workload that demands it: it would be natural
for mark_page_accessed() to retrieve such a 0-weight page from the
Unevictable LRU, assigning it weight again and giving it a new life
on the Active and Inactive LRUs.  As it is, I'm hoping PageReferenced
gives a good enough hint as to whether a page should be retained, when
shmem_evictify_hugetails() brings it back from Unevictable to Inactive.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/huge_mm.h  |   13 +++
 include/linux/pageteam.h |   48 ++++++++++-
 mm/memcontrol.c          |   10 ++
 mm/shmem.c               |  158 ++++++++++++++++++++++++++++++-------
 mm/swap.c                |    5 +
 mm/vmscan.c              |   42 +++++++++
 6 files changed, 243 insertions(+), 33 deletions(-)

--- thpfs.orig/include/linux/huge_mm.h	2015-02-20 19:34:32.363944978 -0800
+++ thpfs/include/linux/huge_mm.h	2015-02-20 19:35:04.303871947 -0800
@@ -150,10 +150,23 @@ static inline void vma_adjust_trans_huge
 #endif
 	__vma_adjust_trans_huge(vma, start, end, adjust_next);
 }
+
+/* Repeat definition from linux/pageteam.h to force error if different */
+#define TEAM_LRU_WEIGHT_MASK	((1L << (HPAGE_PMD_ORDER + 1)) - 1)
+
 static inline int hpage_nr_pages(struct page *page)
 {
 	if (unlikely(PageTransHuge(page)))
 		return HPAGE_PMD_NR;
+	/*
+	 * PG_team == PG_compound_lock, but PageTransHuge == PageHead.
+	 * The question of races here is interesting, but not for now:
+	 * this can only be relied on while holding the lruvec lock,
+	 * or knowing that the page is anonymous, not from huge tmpfs.
+	 */
+	if (PageTeam(page))
+		return atomic_long_read(&page->team_usage) &
+					TEAM_LRU_WEIGHT_MASK;
 	return 1;
 }
 
--- thpfs.orig/include/linux/pageteam.h	2015-02-20 19:34:48.083909034 -0800
+++ thpfs/include/linux/pageteam.h	2015-02-20 19:35:04.303871947 -0800
@@ -30,11 +30,32 @@ static inline struct page *team_head(str
 }
 
 /*
+ * Mask for lower bits of team_usage, giving the weight 0..HPAGE_PMD_NR of the
+ * page on its LRU: normal pages have weight 1, tails held unevictable until
+ * head is evicted have weight 0, and the head gathers weight 1..HPAGE_PMD_NR.
+ */
+#define TEAM_LRU_WEIGHT_ONE	1L
+#define TEAM_LRU_WEIGHT_MASK	((1L << (HPAGE_PMD_ORDER + 1)) - 1)
+
+#define TEAM_HIGH_COUNTER	(1L << (HPAGE_PMD_ORDER + 1))
+/*
+ * Count how many pages of team are instantiated, as it is built up.
+ */
+#define TEAM_PAGE_COUNTER	TEAM_HIGH_COUNTER
+#define TEAM_COMPLETE		(TEAM_PAGE_COUNTER << HPAGE_PMD_ORDER)
+/*
+ * And when complete, count how many huge mappings (like page_mapcount): an
+ * incomplete team cannot be hugely mapped (would expose uninitialized holes).
+ */
+#define TEAM_MAPPING_COUNTER	TEAM_HIGH_COUNTER
+#define TEAM_HUGELY_MAPPED	(TEAM_COMPLETE + TEAM_MAPPING_COUNTER)
+
+/*
  * Returns true if this team is mapped by pmd somewhere.
  */
 static inline bool team_hugely_mapped(struct page *head)
 {
-	return atomic_long_read(&head->team_usage) > HPAGE_PMD_NR;
+	return atomic_long_read(&head->team_usage) >= TEAM_HUGELY_MAPPED;
 }
 
 /*
@@ -43,7 +64,8 @@ static inline bool team_hugely_mapped(st
  */
 static inline bool inc_hugely_mapped(struct page *head)
 {
-	return atomic_long_inc_return(&head->team_usage) == HPAGE_PMD_NR+1;
+	return atomic_long_add_return(TEAM_MAPPING_COUNTER, &head->team_usage)
+		< TEAM_HUGELY_MAPPED + TEAM_MAPPING_COUNTER;
 }
 
 /*
@@ -52,7 +74,27 @@ static inline bool inc_hugely_mapped(str
  */
 static inline bool dec_hugely_mapped(struct page *head)
 {
-	return atomic_long_dec_return(&head->team_usage) == HPAGE_PMD_NR;
+	return atomic_long_sub_return(TEAM_MAPPING_COUNTER, &head->team_usage)
+		< TEAM_HUGELY_MAPPED;
+}
+
+static inline void inc_lru_weight(struct page *head)
+{
+	atomic_long_inc(&head->team_usage);
+	VM_BUG_ON_PAGE((atomic_long_read(&head->team_usage) &
+			TEAM_LRU_WEIGHT_MASK) > HPAGE_PMD_NR, head);
+}
+
+static inline void set_lru_weight(struct page *page)
+{
+	VM_BUG_ON_PAGE(atomic_long_read(&page->team_usage) != 0, page);
+	atomic_long_set(&page->team_usage, 1);
+}
+
+static inline void clear_lru_weight(struct page *page)
+{
+	VM_BUG_ON_PAGE(atomic_long_read(&page->team_usage) != 1, page);
+	atomic_long_set(&page->team_usage, 0);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
--- thpfs.orig/mm/memcontrol.c	2015-02-20 19:34:11.231993296 -0800
+++ thpfs/mm/memcontrol.c	2015-02-20 19:35:04.303871947 -0800
@@ -1319,6 +1319,16 @@ void mem_cgroup_update_lru_size(struct l
 		*lru_size += nr_pages;
 
 	size = *lru_size;
+	if (!size && !empty && lru == LRU_UNEVICTABLE) {
+		struct page *page;
+		/*
+		 * The unevictable list might be full of team tail pages of 0
+		 * weight: check the first, and skip the warning if that fits.
+		 */
+		page = list_first_entry(lruvec->lists + lru, struct page, lru);
+		if (hpage_nr_pages(page) == 0)
+			empty = true;
+	}
 	if (WARN(size < 0 || empty != !size,
 	"mem_cgroup_update_lru_size(%p, %d, %d): lru_size %ld but %sempty\n",
 			lruvec, lru, nr_pages, size, empty ? "" : "not ")) {
--- thpfs.orig/mm/shmem.c	2015-02-20 19:34:59.051883956 -0800
+++ thpfs/mm/shmem.c	2015-02-20 19:35:04.307871938 -0800
@@ -63,6 +63,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/swapops.h>
 #include <linux/pageteam.h>
 #include <linux/mempolicy.h>
+#include <linux/mm_inline.h>
 #include <linux/namei.h>
 #include <linux/ctype.h>
 #include <linux/migrate.h>
@@ -373,11 +374,10 @@ restart:
 
 static int shmem_freeholes(struct page *head)
 {
-	/*
-	 * Note: team_usage will also be used to count huge mappings,
-	 * so treat a negative value from shmem_freeholes() as none.
-	 */
-	return HPAGE_PMD_NR - atomic_long_read(&head->team_usage);
+	long nr = atomic_long_read(&head->team_usage);
+
+	return (nr >= TEAM_COMPLETE) ? 0 :
+		HPAGE_PMD_NR - (nr / TEAM_PAGE_COUNTER);
 }
 
 static void shmem_clear_tag_hugehole(struct address_space *mapping,
@@ -404,18 +404,16 @@ static void shmem_added_to_hugeteam(stru
 {
 	struct address_space *mapping = page->mapping;
 	struct page *head = team_head(page);
-	int nr;
 
 	if (hugehint == SHMEM_ALLOC_HUGE_PAGE) {
-		atomic_long_set(&head->team_usage, 1);
+		atomic_long_set(&head->team_usage,
+				TEAM_PAGE_COUNTER + TEAM_LRU_WEIGHT_ONE);
 		radix_tree_tag_set(&mapping->page_tree, page->index,
 					SHMEM_TAG_HUGEHOLE);
 		__mod_zone_page_state(zone, NR_SHMEM_FREEHOLES, HPAGE_PMD_NR-1);
 	} else {
-		/* We do not need atomic ops until huge page gets mapped */
-		nr = atomic_long_read(&head->team_usage) + 1;
-		atomic_long_set(&head->team_usage, nr);
-		if (nr == HPAGE_PMD_NR) {
+		if (atomic_long_add_return(TEAM_PAGE_COUNTER,
+				&head->team_usage) >= TEAM_COMPLETE) {
 			shmem_clear_tag_hugehole(mapping, head->index);
 			__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
 		}
@@ -459,36 +457,61 @@ static int shmem_populate_hugeteam(struc
 	return 0;
 }
 
-static int shmem_disband_hugehead(struct page *head)
+static int shmem_disband_hugehead(struct page *head, int *head_lru_weight)
 {
 	struct address_space *mapping;
+	bool lru_locked = false;
+	unsigned long flags;
 	struct zone *zone;
-	int nr = -1;
+	long team_usage;
+	long nr = -1;
 
 	/*
 	 * Only in the shrinker migration case might head have been truncated.
 	 * But although head->mapping may then be zeroed at any moment, mapping
 	 * stays safe because shmem_evict_inode must take our shrinklist lock.
 	 */
+	*head_lru_weight = 0;
 	mapping = ACCESS_ONCE(head->mapping);
 	if (!mapping)
 		return nr;
 
 	zone = page_zone(head);
-	spin_lock_irq(&mapping->tree_lock);
+	team_usage = atomic_long_read(&head->team_usage);
+again1:
+	if ((team_usage & TEAM_LRU_WEIGHT_MASK) != TEAM_LRU_WEIGHT_ONE) {
+		spin_lock_irq(&zone->lru_lock);
+		lru_locked = true;
+	}
+	spin_lock_irqsave(&mapping->tree_lock, flags);
 
 	if (PageTeam(head)) {
-		nr = atomic_long_read(&head->team_usage);
-		atomic_long_set(&head->team_usage, 0);
+again2:
+		nr = atomic_long_cmpxchg(&head->team_usage, team_usage,
+					 TEAM_LRU_WEIGHT_ONE);
+		if (unlikely(nr != team_usage)) {
+			team_usage = nr;
+			if (lru_locked ||
+			    (team_usage & TEAM_LRU_WEIGHT_MASK) ==
+						    TEAM_LRU_WEIGHT_ONE)
+				goto again2;
+			spin_unlock_irqrestore(&mapping->tree_lock, flags);
+			goto again1;
+		}
+		*head_lru_weight = nr & TEAM_LRU_WEIGHT_MASK;
+		nr /= TEAM_PAGE_COUNTER;
+
 		/*
-		 * Disable additions to the team.
-		 * Ensure head->private is written before PageTeam is
-		 * cleared, so shmem_writepage() cannot write swap into
-		 * head->private, then have it overwritten by that 0!
+		 * Disable additions to the team.  The cmpxchg above
+		 * ensures head->team_usage is read before PageTeam is cleared,
+		 * when shmem_writepage() might write swap into head->private.
 		 */
-		smp_mb__before_atomic();
 		ClearPageTeam(head);
 
+		if (PageLRU(head) && *head_lru_weight > 1)
+			update_lru_size(mem_cgroup_page_lruvec(head, zone),
+					page_lru(head), 1 - *head_lru_weight);
+
 		if (nr >= HPAGE_PMD_NR) {
 			__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
 			VM_BUG_ON(nr != HPAGE_PMD_NR);
@@ -499,10 +522,72 @@ static int shmem_disband_hugehead(struct
 		} /* else shmem_getpage_gfp disbanding a failed alloced_huge */
 	}
 
-	spin_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irqrestore(&mapping->tree_lock, flags);
+	if (lru_locked)
+		spin_unlock_irq(&zone->lru_lock);
 	return nr;
 }
 
+static void shmem_evictify_hugetails(struct page *head, int head_lru_weight)
+{
+	struct page *page;
+	struct lruvec *lruvec = NULL;
+	struct zone *zone = page_zone(head);
+	bool lru_locked = false;
+
+	/*
+	 * The head has been sheltering the rest of its team from reclaim:
+	 * if any were moved to the unevictable list, now make them evictable.
+	 */
+again:
+	for (page = head + HPAGE_PMD_NR - 1; page > head; page--) {
+		if (!PageTeam(page))
+			continue;
+		if (atomic_long_read(&page->team_usage) == TEAM_LRU_WEIGHT_ONE)
+			continue;
+
+		/*
+		 * Delay getting lru lock until we reach a page that needs it.
+		 */
+		if (!lru_locked) {
+			spin_lock_irq(&zone->lru_lock);
+			lru_locked = true;
+		}
+		lruvec = mem_cgroup_page_lruvec(page, zone);
+
+		VM_BUG_ON_PAGE(atomic_long_read(&page->team_usage), page);
+		VM_BUG_ON_PAGE(!PageLRU(page), page);
+		VM_BUG_ON_PAGE(!PageUnevictable(page), page);
+		VM_BUG_ON_PAGE(PageActive(page), page);
+
+		set_lru_weight(page);
+		head_lru_weight--;
+
+		if (!page_evictable(page)) {
+			update_lru_size(lruvec, LRU_UNEVICTABLE, 1);
+			continue;
+		}
+
+		ClearPageUnevictable(page);
+		update_lru_size(lruvec, LRU_INACTIVE_ANON, 1);
+
+		list_del(&page->lru);
+		list_add_tail(&page->lru, lruvec->lists + LRU_INACTIVE_ANON);
+	}
+
+	if (lru_locked) {
+		spin_unlock_irq(&zone->lru_lock);
+		lru_locked = false;
+	}
+
+	/*
+	 * But how can we be sure that a racing putback_inactive_pages()
+	 * did its clear_lru_weight() before we checked team_usage above?
+	 */
+	if (unlikely(head_lru_weight != TEAM_LRU_WEIGHT_ONE))
+		goto again;
+}
+
 static void shmem_disband_hugetails(struct page *head,
 				    struct list_head *list, int nr)
 {
@@ -579,6 +664,7 @@ static void shmem_disband_hugetails(stru
 static void shmem_disband_hugeteam(struct page *page)
 {
 	struct page *head = team_head(page);
+	int head_lru_weight;
 	int nr_used;
 
 	/*
@@ -622,9 +708,11 @@ static void shmem_disband_hugeteam(struc
 	 * can (splitting disband in two stages), but better not be preempted.
 	 */
 	preempt_disable();
-	nr_used = shmem_disband_hugehead(head);
+	nr_used = shmem_disband_hugehead(head, &head_lru_weight);
 	if (head != page)
 		unlock_page(head);
+	if (head_lru_weight > TEAM_LRU_WEIGHT_ONE)
+		shmem_evictify_hugetails(head, head_lru_weight);
 	if (nr_used >= 0)
 		shmem_disband_hugetails(head, NULL, 0);
 	if (head != page)
@@ -680,6 +768,7 @@ static unsigned long shmem_choose_hugeho
 	struct page *topage = NULL;
 	struct page *page;
 	pgoff_t index;
+	int head_lru_weight;
 	int fromused;
 	int toused;
 	int nid;
@@ -721,8 +810,10 @@ static unsigned long shmem_choose_hugeho
 	if (!frompage)
 		goto unlock;
 	preempt_disable();
-	fromused = shmem_disband_hugehead(frompage);
+	fromused = shmem_disband_hugehead(frompage, &head_lru_weight);
 	spin_unlock(&shmem_shrinklist_lock);
+	if (head_lru_weight > TEAM_LRU_WEIGHT_ONE)
+		shmem_evictify_hugetails(frompage, head_lru_weight);
 	if (fromused > 0)
 		shmem_disband_hugetails(frompage, fromlist, -fromused);
 	preempt_enable();
@@ -776,8 +867,10 @@ static unsigned long shmem_choose_hugeho
 	if (!topage)
 		goto unlock;
 	preempt_disable();
-	toused = shmem_disband_hugehead(topage);
+	toused = shmem_disband_hugehead(topage, &head_lru_weight);
 	spin_unlock(&shmem_shrinklist_lock);
+	if (head_lru_weight > TEAM_LRU_WEIGHT_ONE)
+		shmem_evictify_hugetails(topage, head_lru_weight);
 	if (toused > 0) {
 		if (HPAGE_PMD_NR - toused >= fromused)
 			shmem_disband_hugetails(topage, tolist, fromused);
@@ -927,7 +1020,11 @@ shmem_add_to_page_cache(struct page *pag
 		}
 		if (!PageSwapBacked(page)) {	/* huge needs special care */
 			SetPageSwapBacked(page);
-			SetPageTeam(page);
+			if (!PageTeam(page)) {
+				atomic_long_set(&page->team_usage,
+						TEAM_LRU_WEIGHT_ONE);
+				SetPageTeam(page);
+			}
 		}
 	}
 
@@ -1514,9 +1611,13 @@ static int shmem_writepage(struct page *
 		struct page *head = team_head(page);
 		/*
 		 * Only proceed if this is head, or if head is unpopulated.
+		 * Redirty any others, without setting PageActive, and then
+		 * putback_inactive_pages() will shift them to unevictable.
 		 */
-		if (page != head && PageSwapBacked(head))
+		if (page != head && PageSwapBacked(head)) {
+			wbc->for_reclaim = 0;
 			goto redirty;
+		}
 	}
 
 	swap = get_swap_page();
@@ -1660,7 +1761,8 @@ static struct page *shmem_alloc_page(gfp
 				split_page(head, HPAGE_PMD_ORDER);
 
 				/* Prepare head page for add_to_page_cache */
-				atomic_long_set(&head->team_usage, 0);
+				atomic_long_set(&head->team_usage,
+						TEAM_LRU_WEIGHT_ONE);
 				__SetPageTeam(head);
 				head->mapping = mapping;
 				head->index = round_down(index, HPAGE_PMD_NR);
--- thpfs.orig/mm/swap.c	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/mm/swap.c	2015-02-20 19:35:04.307871938 -0800
@@ -702,6 +702,11 @@ void lru_cache_add_active_or_unevictable
 					 struct vm_area_struct *vma)
 {
 	VM_BUG_ON_PAGE(PageLRU(page), page);
+	/*
+	 * Using hpage_nr_pages() on a huge tmpfs team page might not give the
+	 * 1 NR_MLOCK needs below; but this seems to be for anon pages only.
+	 */
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
 
 	if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
 		SetPageActive(page);
--- thpfs.orig/mm/vmscan.c	2015-02-20 19:34:11.235993287 -0800
+++ thpfs/mm/vmscan.c	2015-02-20 19:35:04.307871938 -0800
@@ -19,6 +19,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/swap.h>
 #include <linux/pagemap.h>
+#include <linux/pageteam.h>
 #include <linux/init.h>
 #include <linux/highmem.h>
 #include <linux/vmpressure.h>
@@ -1419,6 +1420,42 @@ putback_inactive_pages(struct lruvec *lr
 			continue;
 		}
 
+		if (PageTeam(page) && !PageAnon(page) && !PageActive(page)) {
+			struct page *head = team_head(page);
+			struct address_space *mapping = head->mapping;
+			bool transferring_weight = false;
+			unsigned long flags;
+			/*
+			 * Team tail page was ready for eviction, but has
+			 * been sent back from shmem_writepage(): transfer
+			 * its weight to head, and move tail to unevictable.
+			 *
+			 * Barrier below so PageTeam guarantees good "mapping".
+			 */
+			smp_rmb();
+			if (page != head && PageTeam(head)) {
+				lruvec = mem_cgroup_page_lruvec(head, zone);
+				spin_lock_irqsave(&mapping->tree_lock, flags);
+				if (PageTeam(head)) {
+					inc_lru_weight(head);
+					transferring_weight = true;
+				}
+				spin_unlock_irqrestore(
+						&mapping->tree_lock, flags);
+			}
+			if (transferring_weight) {
+				if (PageLRU(head))
+					update_lru_size(lruvec,
+							page_lru(head), 1);
+				/* Get this tail page out of the way for now */
+				SetPageUnevictable(page);
+				clear_lru_weight(page);
+			} else {
+				/* Traditional case of unswapped & redirtied */
+				SetPageActive(page);
+			}
+		}
+
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 
 		SetPageLRU(page);
@@ -3705,11 +3742,12 @@ int zone_reclaim(struct zone *zone, gfp_
  * Reasons page might not be evictable:
  * (1) page's mapping marked unevictable
  * (2) page is part of an mlocked VMA
- *
+ * (3) page is held in memory as part of a team
  */
 int page_evictable(struct page *page)
 {
-	return !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
+	return !mapping_unevictable(page_mapping(page)) &&
+		!PageMlocked(page) && hpage_nr_pages(page);
 }
 
 #ifdef CONFIG_SHMEM

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 21/24] huge tmpfs: fix Mlocked meminfo, tracking huge and unhuge mlocks
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:25   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Up to this point, the huge tmpfs effort hasn't looked at or touched
mm/mlock.c at all, and it was surprising that regular tests did not
therefore crash machines.

/proc/meminfo's Mlocked count has been whatever happens to be shown
if we do nothing extra: a hugely mapped and mlocked team page would
count as 4kB instead of the 2MB you'd expect; or at least until the
previous (Unevictable) patch, which now requires lruvec locking for
hpage_nr_pages() on a team page (locking not given it in mlock.c),
and varies the amount returned by hpage_nr_pages().

It would be easy to correct the 4kB or variable amount to 2MB
by using an alternative to hpage_nr_pages() here.  And it would be
fairly easy to maintain an entirely independent HugelyMlocked count,
such that Mlocked+HugelyMlocked might amount to (almost) twice RAM
size.  But is that what observers of Mlocked want?  Probably not.

So we need a huge pmd mlock to count as 2MB, but discount 4kB for
each page within it that is already mlocked by pte somewhere, in
this or another process; and a small pte mlock to count usually as
4kB, but 0 if the team head is already mlocked by pmd somewhere.

Can this be done by maintaining extra counts per team?  I did
intend so, but (a) space in team_usage is limited, and (b) mlock
and munlock already involve slow LRU switching, so might as well
keep 4kB and 2MB in synch manually; but most significantly (c) the
trylocking around which mlock (and restoration of mlock in munlock)
is currently designed, makes it hard to work out just when a count
does need to be incremented.

The hard-won solution looks much simpler than I thought possible,
but an odd interface in its current implementation.  Not so much
needed changing, mainly just clear_page_mlock(), mlock_vma_page()
munlock_vma_page() and try_to_"unmap"_one().  The big difference
from before, is that a team head page might be being mlocked as a
4kB page or as a 2MB page, and the called functions cannot tell:
so now need an nr_pages argument.  But odd because the PageTeam
case immediately converts that to an iteration count, whereas
the anon THP case keeps it as the weight for a single iteration.
Not very nice, but will do for now: it was so hard to get here,
I'm very reluctant to pull it apart in a hurry.

The TEAM_HUGELY_MLOCKED flag in team_usage does not play a large part,
just optimizes out the overhead in a couple of cases: we don't want to
make yet another pass down the team, whenever a team is last unmapped,
just to handle the unlikely mlocked-then-truncated case; and we don't
want munlocking one of many parallel huge mlocks to check every page.

Notes in passing:  Wouldn't mlock and munlock be better off using
proper anon_vma and i_mmap_rwsem locking, instead of the current page
and mmap_sem trylocking?  And if try_to_munlock() was crying out for
its own rmap walk before, instead of abusing try_to_unuse(), now it
is screaming for it.  But I haven't the time for such cleanups now,
and may be mistaken.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pageteam.h |   38 +++++++
 mm/huge_memory.c         |    6 +
 mm/internal.h            |   25 +++--
 mm/mlock.c               |  181 ++++++++++++++++++++++---------------
 mm/rmap.c                |   34 ++++--
 5 files changed, 193 insertions(+), 91 deletions(-)

--- thpfs.orig/include/linux/pageteam.h	2015-02-20 19:35:04.303871947 -0800
+++ thpfs/include/linux/pageteam.h	2015-02-20 19:35:09.991858941 -0800
@@ -36,8 +36,14 @@ static inline struct page *team_head(str
  */
 #define TEAM_LRU_WEIGHT_ONE	1L
 #define TEAM_LRU_WEIGHT_MASK	((1L << (HPAGE_PMD_ORDER + 1)) - 1)
+/*
+ * Single bit to indicate whether team is hugely mlocked (like PageMlocked).
+ * Then another bit reserved for experiments with other team flags.
+ */
+#define TEAM_HUGELY_MLOCKED	(1L << (HPAGE_PMD_ORDER + 1))
+#define TEAM_RESERVED_FLAG	(1L << (HPAGE_PMD_ORDER + 2))
 
-#define TEAM_HIGH_COUNTER	(1L << (HPAGE_PMD_ORDER + 1))
+#define TEAM_HIGH_COUNTER	(1L << (HPAGE_PMD_ORDER + 3))
 /*
  * Count how many pages of team are instantiated, as it is built up.
  */
@@ -97,6 +103,36 @@ static inline void clear_lru_weight(stru
 	atomic_long_set(&page->team_usage, 0);
 }
 
+static inline bool team_hugely_mlocked(struct page *head)
+{
+	VM_BUG_ON_PAGE(head != team_head(head), head);
+	return atomic_long_read(&head->team_usage) & TEAM_HUGELY_MLOCKED;
+}
+
+static inline void set_hugely_mlocked(struct page *head)
+{
+	long team_usage;
+
+	VM_BUG_ON_PAGE(head != team_head(head), head);
+	team_usage = atomic_long_read(&head->team_usage);
+	while (!(team_usage & TEAM_HUGELY_MLOCKED)) {
+		team_usage = atomic_long_cmpxchg(&head->team_usage,
+				team_usage, team_usage | TEAM_HUGELY_MLOCKED);
+	}
+}
+
+static inline void clear_hugely_mlocked(struct page *head)
+{
+	long team_usage;
+
+	VM_BUG_ON_PAGE(head != team_head(head), head);
+	team_usage = atomic_long_read(&head->team_usage);
+	while (team_usage & TEAM_HUGELY_MLOCKED) {
+		team_usage = atomic_long_cmpxchg(&head->team_usage,
+				team_usage, team_usage & ~TEAM_HUGELY_MLOCKED);
+	}
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int map_team_by_pmd(struct vm_area_struct *vma,
 			unsigned long addr, pmd_t *pmd, struct page *page);
--- thpfs.orig/mm/huge_memory.c	2015-02-20 19:34:48.083909034 -0800
+++ thpfs/mm/huge_memory.c	2015-02-20 19:35:09.991858941 -0800
@@ -1264,7 +1264,7 @@ struct page *follow_trans_huge_pmd(struc
 		if (page->mapping && trylock_page(page)) {
 			lru_add_drain();
 			if (page->mapping)
-				mlock_vma_page(page);
+				mlock_vma_pages(page, HPAGE_PMD_NR);
 			unlock_page(page);
 		}
 	}
@@ -1435,6 +1435,10 @@ int zap_huge_pmd(struct mmu_gather *tlb,
 				MM_ANONPAGES : MM_FILEPAGES, -HPAGE_PMD_NR);
 			atomic_long_dec(&tlb->mm->nr_ptes);
 			spin_unlock(ptl);
+			if (!PageAnon(page) &&
+			    !team_hugely_mapped(page) &&
+			    team_hugely_mlocked(page))
+				clear_pages_mlock(page, HPAGE_PMD_NR);
 			tlb_remove_page(tlb, page);
 		}
 		pte_free(tlb->mm, pgtable);
--- thpfs.orig/mm/internal.h	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/internal.h	2015-02-20 19:35:09.991858941 -0800
@@ -230,8 +230,16 @@ static inline void munlock_vma_pages_all
 /*
  * must be called with vma's mmap_sem held for read or write, and page locked.
  */
-extern void mlock_vma_page(struct page *page);
-extern unsigned int munlock_vma_page(struct page *page);
+extern void mlock_vma_pages(struct page *page, int nr_pages);
+static inline void mlock_vma_page(struct page *page)
+{
+	mlock_vma_pages(page, 1);
+}
+extern int munlock_vma_pages(struct page *page, int nr_pages);
+static inline void munlock_vma_page(struct page *page)
+{
+	munlock_vma_pages(page, 1);
+}
 
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
@@ -242,7 +250,11 @@ extern unsigned int munlock_vma_page(str
  * If called for a page that is still mapped by mlocked vmas, all we do
  * is revert to lazy LRU behaviour -- semantics are not broken.
  */
-extern void clear_page_mlock(struct page *page);
+extern void clear_pages_mlock(struct page *page, int nr_pages);
+static inline void clear_page_mlock(struct page *page)
+{
+	clear_pages_mlock(page, 1);
+}
 
 /*
  * mlock_migrate_page - called only from migrate_page_copy() to
@@ -268,12 +280,7 @@ extern pmd_t maybe_pmd_mkwrite(pmd_t pmd
 extern unsigned long vma_address(struct page *page,
 				 struct vm_area_struct *vma);
 #endif
-#else /* !CONFIG_MMU */
-static inline void clear_page_mlock(struct page *page) { }
-static inline void mlock_vma_page(struct page *page) { }
-static inline void mlock_migrate_page(struct page *new, struct page *old) { }
-
-#endif /* !CONFIG_MMU */
+#endif /* CONFIG_MMU */
 
 /*
  * Return the mem_map entry representing the 'offset' subpage within
--- thpfs.orig/mm/mlock.c	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/mm/mlock.c	2015-02-20 19:35:09.991858941 -0800
@@ -11,6 +11,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/pagemap.h>
+#include <linux/pageteam.h>
 #include <linux/pagevec.h>
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
@@ -51,40 +52,70 @@ EXPORT_SYMBOL(can_do_mlock);
  * (see mm/rmap.c).
  */
 
-/*
- *  LRU accounting for clear_page_mlock()
+/**
+ * clear_pages_mlock - clear mlock from a page or pages
+ * @page - page to be unlocked
+ * @nr_pages - usually 1, but HPAGE_PMD_NR if pmd mapping is zapped.
+ *
+ * Clear the page's PageMlocked().  This can be useful in a situation where
+ * we want to unconditionally remove a page from the pagecache -- e.g.,
+ * on truncation or freeing.
+ *
+ * It is legal to call this function for any page, mlocked or not.
+ * If called for a page that is still mapped by mlocked vmas, all we do
+ * is revert to lazy LRU behaviour -- semantics are not broken.
  */
-void clear_page_mlock(struct page *page)
+void clear_pages_mlock(struct page *page, int nr_pages)
 {
-	if (!TestClearPageMlocked(page))
-		return;
+	struct zone *zone = page_zone(page);
+	struct page *endpage = page + 1;
 
-	mod_zone_page_state(page_zone(page), NR_MLOCK,
-			    -hpage_nr_pages(page));
-	count_vm_event(UNEVICTABLE_PGCLEARED);
-	if (!isolate_lru_page(page)) {
-		putback_lru_page(page);
-	} else {
-		/*
-		 * We lost the race. the page already moved to evictable list.
-		 */
-		if (PageUnevictable(page))
+	if (nr_pages > 1 && PageTeam(page) && !PageAnon(page)) {
+		clear_hugely_mlocked(page);	/* page is team head */
+		endpage = page + nr_pages;
+		nr_pages = 1;
+	}
+
+	for (; page < endpage; page++) {
+		if (page_mapped(page))
+			continue;
+		if (!TestClearPageMlocked(page))
+			continue;
+		mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
+		count_vm_event(UNEVICTABLE_PGCLEARED);
+		if (!isolate_lru_page(page))
+			putback_lru_page(page);
+		else if (PageUnevictable(page))
 			count_vm_event(UNEVICTABLE_PGSTRANDED);
 	}
 }
 
-/*
- * Mark page as mlocked if not already.
+/**
+ * mlock_vma_pages - mlock a vma page or pages
+ * @page - page to be unlocked
+ * @nr_pages - usually 1, but HPAGE_PMD_NR if pmd mapping is mlocked.
+ *
+ * Mark pages as mlocked if not already.
  * If page on LRU, isolate and putback to move to unevictable list.
  */
-void mlock_vma_page(struct page *page)
+void mlock_vma_pages(struct page *page, int nr_pages)
 {
+	struct zone *zone = page_zone(page);
+	struct page *endpage = page + 1;
+
 	/* Serialize with page migration */
-	BUG_ON(!PageLocked(page));
+	VM_BUG_ON_PAGE(!PageLocked(page) && !PageTeam(page), page);
+
+	if (nr_pages > 1 && PageTeam(page) && !PageAnon(page)) {
+		set_hugely_mlocked(page);	/* page is team head */
+		endpage = page + nr_pages;
+		nr_pages = 1;
+	}
 
-	if (!TestSetPageMlocked(page)) {
-		mod_zone_page_state(page_zone(page), NR_MLOCK,
-				    hpage_nr_pages(page));
+	for (; page < endpage; page++) {
+		if (TestSetPageMlocked(page))
+			continue;
+		mod_zone_page_state(zone, NR_MLOCK, nr_pages);
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 		if (!isolate_lru_page(page))
 			putback_lru_page(page);
@@ -108,6 +139,18 @@ static bool __munlock_isolate_lru_page(s
 		return true;
 	}
 
+	/*
+	 * Perform accounting when page isolation fails in munlock.
+	 * There is nothing else to do because it means some other task has
+	 * already removed the page from the LRU. putback_lru_page() will take
+	 * care of removing the page from the unevictable list, if necessary.
+	 * vmscan [page_referenced()] will move the page back to the
+	 * unevictable list if some other vma has it mlocked.
+	 */
+	if (PageUnevictable(page))
+		__count_vm_event(UNEVICTABLE_PGSTRANDED);
+	else
+		__count_vm_event(UNEVICTABLE_PGMUNLOCKED);
 	return false;
 }
 
@@ -125,7 +168,7 @@ static void __munlock_isolated_page(stru
 	 * Optimization: if the page was mapped just once, that's our mapping
 	 * and we don't need to check all the other vmas.
 	 */
-	if (page_mapcount(page) > 1)
+	if (page_mapcount(page) > 1 || PageTeam(page))
 		ret = try_to_munlock(page);
 
 	/* Did try_to_unlock() succeed or punt? */
@@ -135,29 +178,12 @@ static void __munlock_isolated_page(stru
 	putback_lru_page(page);
 }
 
-/*
- * Accounting for page isolation fail during munlock
- *
- * Performs accounting when page isolation fails in munlock. There is nothing
- * else to do because it means some other task has already removed the page
- * from the LRU. putback_lru_page() will take care of removing the page from
- * the unevictable list, if necessary. vmscan [page_referenced()] will move
- * the page back to the unevictable list if some other vma has it mlocked.
- */
-static void __munlock_isolation_failed(struct page *page)
-{
-	if (PageUnevictable(page))
-		__count_vm_event(UNEVICTABLE_PGSTRANDED);
-	else
-		__count_vm_event(UNEVICTABLE_PGMUNLOCKED);
-}
-
 /**
- * munlock_vma_page - munlock a vma page
- * @page - page to be unlocked, either a normal page or THP page head
+ * munlock_vma_pages - munlock a vma page or pages
+ * @page - page to be unlocked
+ * @nr_pages - usually 1, but HPAGE_PMD_NR if pmd mapping is munlocked
  *
- * returns the size of the page as a page mask (0 for normal page,
- *         HPAGE_PMD_NR - 1 for THP head page)
+ * returns the size of the page (usually 1, but HPAGE_PMD_NR for huge page)
  *
  * called from munlock()/munmap() path with page supposedly on the LRU.
  * When we munlock a page, because the vma where we found the page is being
@@ -170,39 +196,55 @@ static void __munlock_isolation_failed(s
  * can't isolate the page, we leave it for putback_lru_page() and vmscan
  * [page_referenced()/try_to_unmap()] to deal with.
  */
-unsigned int munlock_vma_page(struct page *page)
+int munlock_vma_pages(struct page *page, int nr_pages)
 {
-	unsigned int nr_pages;
 	struct zone *zone = page_zone(page);
+	struct page *endpage = page + 1;
+	struct page *head = NULL;
+	int ret = nr_pages;
+	bool isolated;
 
 	/* For try_to_munlock() and to serialize with page migration */
-	BUG_ON(!PageLocked(page));
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	if (nr_pages > 1 && PageTeam(page) && !PageAnon(page)) {
+		head = page;
+		clear_hugely_mlocked(page);	/* page is team head */
+		endpage = page + nr_pages;
+		nr_pages = 1;
+	}
 
 	/*
-	 * Serialize with any parallel __split_huge_page_refcount() which
-	 * might otherwise copy PageMlocked to part of the tail pages before
+	 * Serialize THP with any parallel __split_huge_page_refcount() which
+	 * might otherwise copy PageMlocked to some of the tail pages before
 	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
 	 */
 	spin_lock_irq(&zone->lru_lock);
+	if (PageAnon(page))
+		ret = nr_pages = hpage_nr_pages(page);
 
-	nr_pages = hpage_nr_pages(page);
-	if (!TestClearPageMlocked(page))
-		goto unlock_out;
-
-	__mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
+	for (; page < endpage; page++) {
+		if (!TestClearPageMlocked(page))
+			continue;
 
-	if (__munlock_isolate_lru_page(page, true)) {
+		__mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
+		isolated = __munlock_isolate_lru_page(page, true);
 		spin_unlock_irq(&zone->lru_lock);
-		__munlock_isolated_page(page);
-		goto out;
-	}
-	__munlock_isolation_failed(page);
+		if (isolated)
+			__munlock_isolated_page(page);
 
-unlock_out:
+		/*
+		 * If try_to_munlock() found the huge page to be still
+		 * mlocked, don't waste more time munlocking and rmap
+		 * walking and re-mlocking each of the team's pages.
+		 */
+		if (!head || team_hugely_mlocked(head))
+			goto out;
+		spin_lock_irq(&zone->lru_lock);
+	}
 	spin_unlock_irq(&zone->lru_lock);
-
 out:
-	return nr_pages - 1;
+	return ret;
 }
 
 /**
@@ -351,8 +393,6 @@ static void __munlock_pagevec(struct pag
 			 */
 			if (__munlock_isolate_lru_page(page, false))
 				continue;
-			else
-				__munlock_isolation_failed(page);
 		}
 
 		/*
@@ -500,15 +540,18 @@ void munlock_vma_pages_range(struct vm_a
 				&page_mask);
 
 		if (page && !IS_ERR(page)) {
-			if (PageTransHuge(page)) {
+			if (PageTransHuge(page) || PageTeam(page)) {
 				lock_page(page);
 				/*
 				 * Any THP page found by follow_page_mask() may
-				 * have gotten split before reaching
-				 * munlock_vma_page(), so we need to recompute
-				 * the page_mask here.
+				 * be split before reaching munlock_vma_pages()
+				 * so we need to recompute the page_mask here.
 				 */
-				page_mask = munlock_vma_page(page);
+				if (page_mask &&
+				    !PageTeam(page) && !PageHead(page))
+					page_mask = 0;
+				page_mask = munlock_vma_pages(page,
+							page_mask + 1) - 1;
 				unlock_page(page);
 				put_page(page); /* follow_page_mask() */
 			} else {
--- thpfs.orig/mm/rmap.c	2015-02-20 19:34:37.851932430 -0800
+++ thpfs/mm/rmap.c	2015-02-20 19:35:09.995858933 -0800
@@ -1161,6 +1161,8 @@ out:
  */
 void page_remove_rmap(struct page *page)
 {
+	int nr_pages;
+
 	if (!PageAnon(page)) {
 		page_remove_file_rmap(page);
 		return;
@@ -1179,14 +1181,16 @@ void page_remove_rmap(struct page *page)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	if (PageTransHuge(page))
+	nr_pages = 1;
+	if (PageTransHuge(page)) {
 		__dec_zone_page_state(page, NR_ANON_HUGEPAGES);
+		nr_pages = hpage_nr_pages(page);
+	}
 
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-			      -hpage_nr_pages(page));
+	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr_pages);
 
 	if (unlikely(PageMlocked(page)))
-		clear_page_mlock(page);
+		clear_pages_mlock(page, nr_pages);
 
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
@@ -1214,6 +1218,7 @@ static int try_to_unmap_one(struct page
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
+	int mlock_pages = 1;
 	enum ttu_flags flags = (enum ttu_flags)arg;
 
 	if (unlikely(PageHuge(page))) {
@@ -1241,8 +1246,13 @@ again:
 		return ret;
 
 	if (pmd_trans_huge(pmdval)) {
-		if (pmd_page(pmdval) != page)
-			return ret;
+		if (pmd_page(pmdval) != page) {
+			if (!PageTeam(page) || !(flags & TTU_MUNLOCK))
+				return ret;
+			page = team_head(page);
+			if (pmd_page(pmdval) != page)
+				return ret;
+		}
 
 		ptl = pmd_lock(mm, pmd);
 		if (!pmd_same(*pmd, pmdval)) {
@@ -1251,8 +1261,10 @@ again:
 		}
 
 		if (!(flags & TTU_IGNORE_MLOCK)) {
-			if (vma->vm_flags & VM_LOCKED)
+			if (vma->vm_flags & VM_LOCKED) {
+				mlock_pages = HPAGE_PMD_NR;
 				goto out_mlock;
+			}
 			if (flags & TTU_MUNLOCK)
 				goto out_unmap;
 		}
@@ -1403,7 +1415,7 @@ out_mlock:
 	 */
 	if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
 		if (vma->vm_flags & VM_LOCKED) {
-			mlock_vma_page(page);
+			mlock_vma_pages(page, mlock_pages);
 			ret = SWAP_MLOCK;
 		}
 		up_read(&vma->vm_mm->mmap_sem);
@@ -1706,7 +1718,6 @@ int try_to_munlock(struct page *page)
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_unmap_one,
 		.arg = (void *)TTU_MUNLOCK,
-		.done = page_not_mapped,
 		/*
 		 * We don't bother to try to find the munlocked page in
 		 * nonlinears. It's costly. Instead, later, page reclaim logic
@@ -1717,7 +1728,8 @@ int try_to_munlock(struct page *page)
 
 	};
 
-	VM_BUG_ON_PAGE(!PageLocked(page) || PageLRU(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page) && !PageTeam(page), page);
+	VM_BUG_ON_PAGE(PageLRU(page), page);
 
 	ret = rmap_walk(page, &rwc);
 	return ret;
@@ -1823,7 +1835,7 @@ static int rmap_walk_file(struct page *p
 	 * structure at mapping cannot be freed and reused yet,
 	 * so we can safely take mapping->i_mmap_rwsem.
 	 */
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page) && !PageTeam(page), page);
 
 	if (!mapping)
 		return ret;

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 21/24] huge tmpfs: fix Mlocked meminfo, tracking huge and unhuge mlocks
@ 2015-02-21  4:25   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Up to this point, the huge tmpfs effort hasn't looked at or touched
mm/mlock.c at all, and it was surprising that regular tests did not
therefore crash machines.

/proc/meminfo's Mlocked count has been whatever happens to be shown
if we do nothing extra: a hugely mapped and mlocked team page would
count as 4kB instead of the 2MB you'd expect; or at least until the
previous (Unevictable) patch, which now requires lruvec locking for
hpage_nr_pages() on a team page (locking not given it in mlock.c),
and varies the amount returned by hpage_nr_pages().

It would be easy to correct the 4kB or variable amount to 2MB
by using an alternative to hpage_nr_pages() here.  And it would be
fairly easy to maintain an entirely independent HugelyMlocked count,
such that Mlocked+HugelyMlocked might amount to (almost) twice RAM
size.  But is that what observers of Mlocked want?  Probably not.

So we need a huge pmd mlock to count as 2MB, but discount 4kB for
each page within it that is already mlocked by pte somewhere, in
this or another process; and a small pte mlock to count usually as
4kB, but 0 if the team head is already mlocked by pmd somewhere.

Can this be done by maintaining extra counts per team?  I did
intend so, but (a) space in team_usage is limited, and (b) mlock
and munlock already involve slow LRU switching, so might as well
keep 4kB and 2MB in synch manually; but most significantly (c) the
trylocking around which mlock (and restoration of mlock in munlock)
is currently designed, makes it hard to work out just when a count
does need to be incremented.

The hard-won solution looks much simpler than I thought possible,
but an odd interface in its current implementation.  Not so much
needed changing, mainly just clear_page_mlock(), mlock_vma_page()
munlock_vma_page() and try_to_"unmap"_one().  The big difference
from before, is that a team head page might be being mlocked as a
4kB page or as a 2MB page, and the called functions cannot tell:
so now need an nr_pages argument.  But odd because the PageTeam
case immediately converts that to an iteration count, whereas
the anon THP case keeps it as the weight for a single iteration.
Not very nice, but will do for now: it was so hard to get here,
I'm very reluctant to pull it apart in a hurry.

The TEAM_HUGELY_MLOCKED flag in team_usage does not play a large part,
just optimizes out the overhead in a couple of cases: we don't want to
make yet another pass down the team, whenever a team is last unmapped,
just to handle the unlikely mlocked-then-truncated case; and we don't
want munlocking one of many parallel huge mlocks to check every page.

Notes in passing:  Wouldn't mlock and munlock be better off using
proper anon_vma and i_mmap_rwsem locking, instead of the current page
and mmap_sem trylocking?  And if try_to_munlock() was crying out for
its own rmap walk before, instead of abusing try_to_unuse(), now it
is screaming for it.  But I haven't the time for such cleanups now,
and may be mistaken.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/pageteam.h |   38 +++++++
 mm/huge_memory.c         |    6 +
 mm/internal.h            |   25 +++--
 mm/mlock.c               |  181 ++++++++++++++++++++++---------------
 mm/rmap.c                |   34 ++++--
 5 files changed, 193 insertions(+), 91 deletions(-)

--- thpfs.orig/include/linux/pageteam.h	2015-02-20 19:35:04.303871947 -0800
+++ thpfs/include/linux/pageteam.h	2015-02-20 19:35:09.991858941 -0800
@@ -36,8 +36,14 @@ static inline struct page *team_head(str
  */
 #define TEAM_LRU_WEIGHT_ONE	1L
 #define TEAM_LRU_WEIGHT_MASK	((1L << (HPAGE_PMD_ORDER + 1)) - 1)
+/*
+ * Single bit to indicate whether team is hugely mlocked (like PageMlocked).
+ * Then another bit reserved for experiments with other team flags.
+ */
+#define TEAM_HUGELY_MLOCKED	(1L << (HPAGE_PMD_ORDER + 1))
+#define TEAM_RESERVED_FLAG	(1L << (HPAGE_PMD_ORDER + 2))
 
-#define TEAM_HIGH_COUNTER	(1L << (HPAGE_PMD_ORDER + 1))
+#define TEAM_HIGH_COUNTER	(1L << (HPAGE_PMD_ORDER + 3))
 /*
  * Count how many pages of team are instantiated, as it is built up.
  */
@@ -97,6 +103,36 @@ static inline void clear_lru_weight(stru
 	atomic_long_set(&page->team_usage, 0);
 }
 
+static inline bool team_hugely_mlocked(struct page *head)
+{
+	VM_BUG_ON_PAGE(head != team_head(head), head);
+	return atomic_long_read(&head->team_usage) & TEAM_HUGELY_MLOCKED;
+}
+
+static inline void set_hugely_mlocked(struct page *head)
+{
+	long team_usage;
+
+	VM_BUG_ON_PAGE(head != team_head(head), head);
+	team_usage = atomic_long_read(&head->team_usage);
+	while (!(team_usage & TEAM_HUGELY_MLOCKED)) {
+		team_usage = atomic_long_cmpxchg(&head->team_usage,
+				team_usage, team_usage | TEAM_HUGELY_MLOCKED);
+	}
+}
+
+static inline void clear_hugely_mlocked(struct page *head)
+{
+	long team_usage;
+
+	VM_BUG_ON_PAGE(head != team_head(head), head);
+	team_usage = atomic_long_read(&head->team_usage);
+	while (team_usage & TEAM_HUGELY_MLOCKED) {
+		team_usage = atomic_long_cmpxchg(&head->team_usage,
+				team_usage, team_usage & ~TEAM_HUGELY_MLOCKED);
+	}
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int map_team_by_pmd(struct vm_area_struct *vma,
 			unsigned long addr, pmd_t *pmd, struct page *page);
--- thpfs.orig/mm/huge_memory.c	2015-02-20 19:34:48.083909034 -0800
+++ thpfs/mm/huge_memory.c	2015-02-20 19:35:09.991858941 -0800
@@ -1264,7 +1264,7 @@ struct page *follow_trans_huge_pmd(struc
 		if (page->mapping && trylock_page(page)) {
 			lru_add_drain();
 			if (page->mapping)
-				mlock_vma_page(page);
+				mlock_vma_pages(page, HPAGE_PMD_NR);
 			unlock_page(page);
 		}
 	}
@@ -1435,6 +1435,10 @@ int zap_huge_pmd(struct mmu_gather *tlb,
 				MM_ANONPAGES : MM_FILEPAGES, -HPAGE_PMD_NR);
 			atomic_long_dec(&tlb->mm->nr_ptes);
 			spin_unlock(ptl);
+			if (!PageAnon(page) &&
+			    !team_hugely_mapped(page) &&
+			    team_hugely_mlocked(page))
+				clear_pages_mlock(page, HPAGE_PMD_NR);
 			tlb_remove_page(tlb, page);
 		}
 		pte_free(tlb->mm, pgtable);
--- thpfs.orig/mm/internal.h	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/mm/internal.h	2015-02-20 19:35:09.991858941 -0800
@@ -230,8 +230,16 @@ static inline void munlock_vma_pages_all
 /*
  * must be called with vma's mmap_sem held for read or write, and page locked.
  */
-extern void mlock_vma_page(struct page *page);
-extern unsigned int munlock_vma_page(struct page *page);
+extern void mlock_vma_pages(struct page *page, int nr_pages);
+static inline void mlock_vma_page(struct page *page)
+{
+	mlock_vma_pages(page, 1);
+}
+extern int munlock_vma_pages(struct page *page, int nr_pages);
+static inline void munlock_vma_page(struct page *page)
+{
+	munlock_vma_pages(page, 1);
+}
 
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
@@ -242,7 +250,11 @@ extern unsigned int munlock_vma_page(str
  * If called for a page that is still mapped by mlocked vmas, all we do
  * is revert to lazy LRU behaviour -- semantics are not broken.
  */
-extern void clear_page_mlock(struct page *page);
+extern void clear_pages_mlock(struct page *page, int nr_pages);
+static inline void clear_page_mlock(struct page *page)
+{
+	clear_pages_mlock(page, 1);
+}
 
 /*
  * mlock_migrate_page - called only from migrate_page_copy() to
@@ -268,12 +280,7 @@ extern pmd_t maybe_pmd_mkwrite(pmd_t pmd
 extern unsigned long vma_address(struct page *page,
 				 struct vm_area_struct *vma);
 #endif
-#else /* !CONFIG_MMU */
-static inline void clear_page_mlock(struct page *page) { }
-static inline void mlock_vma_page(struct page *page) { }
-static inline void mlock_migrate_page(struct page *new, struct page *old) { }
-
-#endif /* !CONFIG_MMU */
+#endif /* CONFIG_MMU */
 
 /*
  * Return the mem_map entry representing the 'offset' subpage within
--- thpfs.orig/mm/mlock.c	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/mm/mlock.c	2015-02-20 19:35:09.991858941 -0800
@@ -11,6 +11,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/pagemap.h>
+#include <linux/pageteam.h>
 #include <linux/pagevec.h>
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
@@ -51,40 +52,70 @@ EXPORT_SYMBOL(can_do_mlock);
  * (see mm/rmap.c).
  */
 
-/*
- *  LRU accounting for clear_page_mlock()
+/**
+ * clear_pages_mlock - clear mlock from a page or pages
+ * @page - page to be unlocked
+ * @nr_pages - usually 1, but HPAGE_PMD_NR if pmd mapping is zapped.
+ *
+ * Clear the page's PageMlocked().  This can be useful in a situation where
+ * we want to unconditionally remove a page from the pagecache -- e.g.,
+ * on truncation or freeing.
+ *
+ * It is legal to call this function for any page, mlocked or not.
+ * If called for a page that is still mapped by mlocked vmas, all we do
+ * is revert to lazy LRU behaviour -- semantics are not broken.
  */
-void clear_page_mlock(struct page *page)
+void clear_pages_mlock(struct page *page, int nr_pages)
 {
-	if (!TestClearPageMlocked(page))
-		return;
+	struct zone *zone = page_zone(page);
+	struct page *endpage = page + 1;
 
-	mod_zone_page_state(page_zone(page), NR_MLOCK,
-			    -hpage_nr_pages(page));
-	count_vm_event(UNEVICTABLE_PGCLEARED);
-	if (!isolate_lru_page(page)) {
-		putback_lru_page(page);
-	} else {
-		/*
-		 * We lost the race. the page already moved to evictable list.
-		 */
-		if (PageUnevictable(page))
+	if (nr_pages > 1 && PageTeam(page) && !PageAnon(page)) {
+		clear_hugely_mlocked(page);	/* page is team head */
+		endpage = page + nr_pages;
+		nr_pages = 1;
+	}
+
+	for (; page < endpage; page++) {
+		if (page_mapped(page))
+			continue;
+		if (!TestClearPageMlocked(page))
+			continue;
+		mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
+		count_vm_event(UNEVICTABLE_PGCLEARED);
+		if (!isolate_lru_page(page))
+			putback_lru_page(page);
+		else if (PageUnevictable(page))
 			count_vm_event(UNEVICTABLE_PGSTRANDED);
 	}
 }
 
-/*
- * Mark page as mlocked if not already.
+/**
+ * mlock_vma_pages - mlock a vma page or pages
+ * @page - page to be unlocked
+ * @nr_pages - usually 1, but HPAGE_PMD_NR if pmd mapping is mlocked.
+ *
+ * Mark pages as mlocked if not already.
  * If page on LRU, isolate and putback to move to unevictable list.
  */
-void mlock_vma_page(struct page *page)
+void mlock_vma_pages(struct page *page, int nr_pages)
 {
+	struct zone *zone = page_zone(page);
+	struct page *endpage = page + 1;
+
 	/* Serialize with page migration */
-	BUG_ON(!PageLocked(page));
+	VM_BUG_ON_PAGE(!PageLocked(page) && !PageTeam(page), page);
+
+	if (nr_pages > 1 && PageTeam(page) && !PageAnon(page)) {
+		set_hugely_mlocked(page);	/* page is team head */
+		endpage = page + nr_pages;
+		nr_pages = 1;
+	}
 
-	if (!TestSetPageMlocked(page)) {
-		mod_zone_page_state(page_zone(page), NR_MLOCK,
-				    hpage_nr_pages(page));
+	for (; page < endpage; page++) {
+		if (TestSetPageMlocked(page))
+			continue;
+		mod_zone_page_state(zone, NR_MLOCK, nr_pages);
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
 		if (!isolate_lru_page(page))
 			putback_lru_page(page);
@@ -108,6 +139,18 @@ static bool __munlock_isolate_lru_page(s
 		return true;
 	}
 
+	/*
+	 * Perform accounting when page isolation fails in munlock.
+	 * There is nothing else to do because it means some other task has
+	 * already removed the page from the LRU. putback_lru_page() will take
+	 * care of removing the page from the unevictable list, if necessary.
+	 * vmscan [page_referenced()] will move the page back to the
+	 * unevictable list if some other vma has it mlocked.
+	 */
+	if (PageUnevictable(page))
+		__count_vm_event(UNEVICTABLE_PGSTRANDED);
+	else
+		__count_vm_event(UNEVICTABLE_PGMUNLOCKED);
 	return false;
 }
 
@@ -125,7 +168,7 @@ static void __munlock_isolated_page(stru
 	 * Optimization: if the page was mapped just once, that's our mapping
 	 * and we don't need to check all the other vmas.
 	 */
-	if (page_mapcount(page) > 1)
+	if (page_mapcount(page) > 1 || PageTeam(page))
 		ret = try_to_munlock(page);
 
 	/* Did try_to_unlock() succeed or punt? */
@@ -135,29 +178,12 @@ static void __munlock_isolated_page(stru
 	putback_lru_page(page);
 }
 
-/*
- * Accounting for page isolation fail during munlock
- *
- * Performs accounting when page isolation fails in munlock. There is nothing
- * else to do because it means some other task has already removed the page
- * from the LRU. putback_lru_page() will take care of removing the page from
- * the unevictable list, if necessary. vmscan [page_referenced()] will move
- * the page back to the unevictable list if some other vma has it mlocked.
- */
-static void __munlock_isolation_failed(struct page *page)
-{
-	if (PageUnevictable(page))
-		__count_vm_event(UNEVICTABLE_PGSTRANDED);
-	else
-		__count_vm_event(UNEVICTABLE_PGMUNLOCKED);
-}
-
 /**
- * munlock_vma_page - munlock a vma page
- * @page - page to be unlocked, either a normal page or THP page head
+ * munlock_vma_pages - munlock a vma page or pages
+ * @page - page to be unlocked
+ * @nr_pages - usually 1, but HPAGE_PMD_NR if pmd mapping is munlocked
  *
- * returns the size of the page as a page mask (0 for normal page,
- *         HPAGE_PMD_NR - 1 for THP head page)
+ * returns the size of the page (usually 1, but HPAGE_PMD_NR for huge page)
  *
  * called from munlock()/munmap() path with page supposedly on the LRU.
  * When we munlock a page, because the vma where we found the page is being
@@ -170,39 +196,55 @@ static void __munlock_isolation_failed(s
  * can't isolate the page, we leave it for putback_lru_page() and vmscan
  * [page_referenced()/try_to_unmap()] to deal with.
  */
-unsigned int munlock_vma_page(struct page *page)
+int munlock_vma_pages(struct page *page, int nr_pages)
 {
-	unsigned int nr_pages;
 	struct zone *zone = page_zone(page);
+	struct page *endpage = page + 1;
+	struct page *head = NULL;
+	int ret = nr_pages;
+	bool isolated;
 
 	/* For try_to_munlock() and to serialize with page migration */
-	BUG_ON(!PageLocked(page));
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	if (nr_pages > 1 && PageTeam(page) && !PageAnon(page)) {
+		head = page;
+		clear_hugely_mlocked(page);	/* page is team head */
+		endpage = page + nr_pages;
+		nr_pages = 1;
+	}
 
 	/*
-	 * Serialize with any parallel __split_huge_page_refcount() which
-	 * might otherwise copy PageMlocked to part of the tail pages before
+	 * Serialize THP with any parallel __split_huge_page_refcount() which
+	 * might otherwise copy PageMlocked to some of the tail pages before
 	 * we clear it in the head page. It also stabilizes hpage_nr_pages().
 	 */
 	spin_lock_irq(&zone->lru_lock);
+	if (PageAnon(page))
+		ret = nr_pages = hpage_nr_pages(page);
 
-	nr_pages = hpage_nr_pages(page);
-	if (!TestClearPageMlocked(page))
-		goto unlock_out;
-
-	__mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
+	for (; page < endpage; page++) {
+		if (!TestClearPageMlocked(page))
+			continue;
 
-	if (__munlock_isolate_lru_page(page, true)) {
+		__mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
+		isolated = __munlock_isolate_lru_page(page, true);
 		spin_unlock_irq(&zone->lru_lock);
-		__munlock_isolated_page(page);
-		goto out;
-	}
-	__munlock_isolation_failed(page);
+		if (isolated)
+			__munlock_isolated_page(page);
 
-unlock_out:
+		/*
+		 * If try_to_munlock() found the huge page to be still
+		 * mlocked, don't waste more time munlocking and rmap
+		 * walking and re-mlocking each of the team's pages.
+		 */
+		if (!head || team_hugely_mlocked(head))
+			goto out;
+		spin_lock_irq(&zone->lru_lock);
+	}
 	spin_unlock_irq(&zone->lru_lock);
-
 out:
-	return nr_pages - 1;
+	return ret;
 }
 
 /**
@@ -351,8 +393,6 @@ static void __munlock_pagevec(struct pag
 			 */
 			if (__munlock_isolate_lru_page(page, false))
 				continue;
-			else
-				__munlock_isolation_failed(page);
 		}
 
 		/*
@@ -500,15 +540,18 @@ void munlock_vma_pages_range(struct vm_a
 				&page_mask);
 
 		if (page && !IS_ERR(page)) {
-			if (PageTransHuge(page)) {
+			if (PageTransHuge(page) || PageTeam(page)) {
 				lock_page(page);
 				/*
 				 * Any THP page found by follow_page_mask() may
-				 * have gotten split before reaching
-				 * munlock_vma_page(), so we need to recompute
-				 * the page_mask here.
+				 * be split before reaching munlock_vma_pages()
+				 * so we need to recompute the page_mask here.
 				 */
-				page_mask = munlock_vma_page(page);
+				if (page_mask &&
+				    !PageTeam(page) && !PageHead(page))
+					page_mask = 0;
+				page_mask = munlock_vma_pages(page,
+							page_mask + 1) - 1;
 				unlock_page(page);
 				put_page(page); /* follow_page_mask() */
 			} else {
--- thpfs.orig/mm/rmap.c	2015-02-20 19:34:37.851932430 -0800
+++ thpfs/mm/rmap.c	2015-02-20 19:35:09.995858933 -0800
@@ -1161,6 +1161,8 @@ out:
  */
 void page_remove_rmap(struct page *page)
 {
+	int nr_pages;
+
 	if (!PageAnon(page)) {
 		page_remove_file_rmap(page);
 		return;
@@ -1179,14 +1181,16 @@ void page_remove_rmap(struct page *page)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	if (PageTransHuge(page))
+	nr_pages = 1;
+	if (PageTransHuge(page)) {
 		__dec_zone_page_state(page, NR_ANON_HUGEPAGES);
+		nr_pages = hpage_nr_pages(page);
+	}
 
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-			      -hpage_nr_pages(page));
+	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr_pages);
 
 	if (unlikely(PageMlocked(page)))
-		clear_page_mlock(page);
+		clear_pages_mlock(page, nr_pages);
 
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
@@ -1214,6 +1218,7 @@ static int try_to_unmap_one(struct page
 	pte_t pteval;
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
+	int mlock_pages = 1;
 	enum ttu_flags flags = (enum ttu_flags)arg;
 
 	if (unlikely(PageHuge(page))) {
@@ -1241,8 +1246,13 @@ again:
 		return ret;
 
 	if (pmd_trans_huge(pmdval)) {
-		if (pmd_page(pmdval) != page)
-			return ret;
+		if (pmd_page(pmdval) != page) {
+			if (!PageTeam(page) || !(flags & TTU_MUNLOCK))
+				return ret;
+			page = team_head(page);
+			if (pmd_page(pmdval) != page)
+				return ret;
+		}
 
 		ptl = pmd_lock(mm, pmd);
 		if (!pmd_same(*pmd, pmdval)) {
@@ -1251,8 +1261,10 @@ again:
 		}
 
 		if (!(flags & TTU_IGNORE_MLOCK)) {
-			if (vma->vm_flags & VM_LOCKED)
+			if (vma->vm_flags & VM_LOCKED) {
+				mlock_pages = HPAGE_PMD_NR;
 				goto out_mlock;
+			}
 			if (flags & TTU_MUNLOCK)
 				goto out_unmap;
 		}
@@ -1403,7 +1415,7 @@ out_mlock:
 	 */
 	if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
 		if (vma->vm_flags & VM_LOCKED) {
-			mlock_vma_page(page);
+			mlock_vma_pages(page, mlock_pages);
 			ret = SWAP_MLOCK;
 		}
 		up_read(&vma->vm_mm->mmap_sem);
@@ -1706,7 +1718,6 @@ int try_to_munlock(struct page *page)
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_unmap_one,
 		.arg = (void *)TTU_MUNLOCK,
-		.done = page_not_mapped,
 		/*
 		 * We don't bother to try to find the munlocked page in
 		 * nonlinears. It's costly. Instead, later, page reclaim logic
@@ -1717,7 +1728,8 @@ int try_to_munlock(struct page *page)
 
 	};
 
-	VM_BUG_ON_PAGE(!PageLocked(page) || PageLRU(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page) && !PageTeam(page), page);
+	VM_BUG_ON_PAGE(PageLRU(page), page);
 
 	ret = rmap_walk(page, &rwc);
 	return ret;
@@ -1823,7 +1835,7 @@ static int rmap_walk_file(struct page *p
 	 * structure at mapping cannot be freed and reused yet,
 	 * so we can safely take mapping->i_mmap_rwsem.
 	 */
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page) && !PageTeam(page), page);
 
 	if (!mapping)
 		return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 22/24] huge tmpfs: fix Mapped meminfo, tracking huge and unhuge mappings
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:27   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Maintaining Mlocked was the difficult one, but now that it is correctly
tracked, without duplication between the 4kB and 2MB amounts, I think
we have to make a similar effort with Mapped.

But whereas mlock and munlock were already rare and slow operations,
to which we could fairly add a little more overhead in the huge tmpfs
case, ordinary mmap is not something we want to slow down further,
relative to hugetlbfs.

In the Mapped case, I think we can take small or misaligned mmaps of
huge tmpfs files as the exceptional operation, and add a little more
overhead to those, by maintaining another count for them in the head;
and by keeping both hugely and unhugely mapped counts in the one long,
can rely on cmpxchg to manage their racing transitions atomically.

That's good on 64-bit, but there are not enough free bits in a 32-bit
atomic_long_t team_usage to support this: I think we should continue
to permit huge tmpfs on 32-bit, but accept that Mapped may be doubly
counted there.  (A more serious problem on 32-bit is that it would,
I think, be possible to overflow the huge mapping counter: protection
against that will need to be added.)

Now that we are maintaining NR_FILE_MAPPED correctly for huge
tmpfs, adjust vmscan's zone_unmapped_file_pages() to exclude
NR_SHMEM_PMDMAPPED, which it clearly would not want included.
Whereas minimum_image_size() in kernel/power/snapshot.c?  I have
not grasped the basis for that calculation, so leaving untouched.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |    5 +
 include/linux/pageteam.h   |  152 ++++++++++++++++++++++++++++++++---
 mm/huge_memory.c           |   40 ++++++++-
 mm/rmap.c                  |   10 +-
 mm/vmscan.c                |    6 +
 5 files changed, 194 insertions(+), 19 deletions(-)

--- thpfs.orig/include/linux/memcontrol.h	2015-02-20 19:33:31.052085168 -0800
+++ thpfs/include/linux/memcontrol.h	2015-02-20 19:35:15.207847015 -0800
@@ -308,6 +308,11 @@ static inline bool mem_cgroup_oom_synchr
 	return false;
 }
 
+static inline void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
+				 enum mem_cgroup_stat_index idx, int val)
+{
+}
+
 static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
 					    enum mem_cgroup_stat_index idx)
 {
--- thpfs.orig/include/linux/pageteam.h	2015-02-20 19:35:09.991858941 -0800
+++ thpfs/include/linux/pageteam.h	2015-02-20 19:35:15.207847015 -0800
@@ -30,6 +30,30 @@ static inline struct page *team_head(str
 }
 
 /*
+ * Layout of team head's page->team_usage field, as on x86_64 and arm64_4K:
+ *
+ *  63        32 31          22 21      12     11         10    9          0
+ * +------------+--------------+----------+----------+---------+------------+
+ * | pmd_mapped & instantiated | unhugely | reserved | mlocked | lru_weight |
+ * |   42 bits       10 bits   |  10 bits |  1 bit   |  1 bit  |   10 bits  |
+ * +------------+--------------+----------+----------+---------+------------+
+ *
+ * TEAM_LRU_WEIGHT_ONE               1  (1<<0)
+ * TEAM_LRU_WEIGHT_MASK            3ff  (1<<10)-1
+ * TEAM_HUGELY_MLOCKED             400  (1<<10)
+ * TEAM_RESERVED_FLAG              800  (1<<11)
+ * TEAM_UNHUGELY_COUNTER          1000  (1<<12)
+ * TEAM_UNHUGELY_MASK           3ff000  (1<<22)-(1<<12)
+ * TEAM_PAGE_COUNTER            400000  (1<<22)
+ * TEAM_COMPLETE              80000000  (1<<31)
+ * TEAM_MAPPING_COUNTER         400000  (1<<22)
+ * TEAM_HUGELY_MAPPED         80400000  (1<<31)
+ *
+ * The upper bits count up to TEAM_COMPLETE as pages are instantiated,
+ * and then, above TEAM_COMPLETE, they count huge mappings of the team.
+ * Team tails have team_usage either 1 (lru_weight 1) or 0 (lru_weight 0).
+ */
+/*
  * Mask for lower bits of team_usage, giving the weight 0..HPAGE_PMD_NR of the
  * page on its LRU: normal pages have weight 1, tails held unevictable until
  * head is evicted have weight 0, and the head gathers weight 1..HPAGE_PMD_NR.
@@ -42,8 +66,20 @@ static inline struct page *team_head(str
  */
 #define TEAM_HUGELY_MLOCKED	(1L << (HPAGE_PMD_ORDER + 1))
 #define TEAM_RESERVED_FLAG	(1L << (HPAGE_PMD_ORDER + 2))
-
+#ifdef CONFIG_64BIT
+/*
+ * Count how many pages of team are individually mapped into userspace.
+ */
+#define TEAM_UNHUGELY_COUNTER	(1L << (HPAGE_PMD_ORDER + 3))
+#define TEAM_HIGH_COUNTER	(1L << (2*HPAGE_PMD_ORDER + 4))
+#define TEAM_UNHUGELY_MASK	(TEAM_HIGH_COUNTER - TEAM_UNHUGELY_COUNTER)
+#else /* 32-bit */
+/*
+ * Not enough bits in atomic_long_t: we prefer not to bloat struct page just to
+ * avoid duplication in Mapped, when a page is mapped both hugely and unhugely.
+ */
 #define TEAM_HIGH_COUNTER	(1L << (HPAGE_PMD_ORDER + 3))
+#endif /* CONFIG_64BIT */
 /*
  * Count how many pages of team are instantiated, as it is built up.
  */
@@ -66,22 +102,120 @@ static inline bool team_hugely_mapped(st
 
 /*
  * Returns true if this was the first mapping by pmd, whereupon mapped stats
- * need to be updated.
+ * need to be updated.  Together with the number of pages which then need
+ * to be accounted (can be ignored when false returned): because some team
+ * members may have been mapped unhugely by pte, so already counted as Mapped.
  */
-static inline bool inc_hugely_mapped(struct page *head)
+static inline bool inc_hugely_mapped(struct page *head, int *nr_pages)
 {
-	return atomic_long_add_return(TEAM_MAPPING_COUNTER, &head->team_usage)
-		< TEAM_HUGELY_MAPPED + TEAM_MAPPING_COUNTER;
+	long team_usage;
+
+	team_usage = atomic_long_add_return(TEAM_MAPPING_COUNTER,
+					    &head->team_usage);
+	*nr_pages = HPAGE_PMD_NR -
+#ifdef CONFIG_64BIT
+		(team_usage & TEAM_UNHUGELY_MASK) / TEAM_UNHUGELY_COUNTER;
+#else
+		1;	/* 1 allows for the additional page_add_file_rmap() */
+#endif
+	return team_usage < TEAM_HUGELY_MAPPED + TEAM_MAPPING_COUNTER;
 }
 
 /*
  * Returns true if this was the last mapping by pmd, whereupon mapped stats
- * need to be updated.
+ * need to be updated.  Together with the number of pages which then need
+ * to be accounted (can be ignored when false returned): because some team
+ * members may still be mapped unhugely by pte, so remain counted as Mapped.
+ */
+static inline bool dec_hugely_mapped(struct page *head, int *nr_pages)
+{
+	long team_usage;
+
+	team_usage = atomic_long_sub_return(TEAM_MAPPING_COUNTER,
+					    &head->team_usage);
+	*nr_pages = HPAGE_PMD_NR -
+#ifdef CONFIG_64BIT
+		(team_usage & TEAM_UNHUGELY_MASK) / TEAM_UNHUGELY_COUNTER;
+#else
+		1;	/* 1 allows for the additional page_remove_rmap() */
+#endif
+	return team_usage < TEAM_HUGELY_MAPPED;
+}
+
+/*
+ * Returns true if this pte mapping is of a non-team page, or of a team page not
+ * covered by an existing huge pmd mapping: whereupon stats need to be updated.
+ * Only called when mapcount goes up from 0 to 1 i.e. _mapcount from -1 to 0.
+ */
+static inline bool inc_unhugely_mapped(struct page *page)
+{
+#ifdef CONFIG_64BIT
+	struct page *head;
+	long team_usage;
+	long old;
+
+	if (likely(!PageTeam(page)))
+		return true;
+	head = team_head(page);
+	team_usage = atomic_long_read(&head->team_usage);
+	for (;;) {
+		/* Is team now being disbanded? Stop once team_usage is reset */
+		if (unlikely(!PageTeam(head) ||
+			     team_usage / TEAM_PAGE_COUNTER == 0))
+			return true;
+		/*
+		 * XXX: but despite the impressive-looking cmpxchg, gthelen
+		 * points out that head might be freed and reused and assigned
+		 * a matching value in ->private now: tiny chance, must revisit.
+		 */
+		old = atomic_long_cmpxchg(&head->team_usage,
+			team_usage, team_usage + TEAM_UNHUGELY_COUNTER);
+		if (likely(old == team_usage))
+			break;
+		team_usage = old;
+	}
+	return team_usage < TEAM_HUGELY_MAPPED;
+#else /* 32-bit */
+	return true;
+#endif
+}
+
+/*
+ * Returns true if this pte mapping is of a non-team page, or of a team page not
+ * covered by a remaining huge pmd mapping: whereupon stats need to be updated.
+ * Only called when mapcount goes down from 1 to 0 i.e. _mapcount from 0 to -1.
  */
-static inline bool dec_hugely_mapped(struct page *head)
+static inline bool dec_unhugely_mapped(struct page *page)
 {
-	return atomic_long_sub_return(TEAM_MAPPING_COUNTER, &head->team_usage)
-		< TEAM_HUGELY_MAPPED;
+#ifdef CONFIG_64BIT
+	struct page *head;
+	long team_usage;
+	long old;
+
+	if (likely(!PageTeam(page)))
+		return true;
+	head = team_head(page);
+	team_usage = atomic_long_read(&head->team_usage);
+	for (;;) {
+		/* Is team now being disbanded? Stop once team_usage is reset */
+		if (unlikely(!PageTeam(head) ||
+			     team_usage / TEAM_PAGE_COUNTER == 0))
+			return true;
+		/*
+		 * XXX: but despite the impressive-looking cmpxchg, gthelen
+		 * points out that head might be freed and reused and assigned
+		 * a matching value in ->private now: tiny chance, must revisit.
+		 */
+		old = atomic_long_cmpxchg(&head->team_usage,
+			team_usage, team_usage - TEAM_UNHUGELY_COUNTER);
+		if (likely(old == team_usage))
+			break;
+		team_usage = old;
+	}
+	return team_usage < TEAM_HUGELY_MAPPED + TEAM_MAPPING_COUNTER;
+#else /* 32-bit */
+	return true;
+#endif
 }
 
 static inline void inc_lru_weight(struct page *head)
--- thpfs.orig/mm/huge_memory.c	2015-02-20 19:35:09.991858941 -0800
+++ thpfs/mm/huge_memory.c	2015-02-20 19:35:15.207847015 -0800
@@ -913,8 +913,10 @@ int copy_huge_pmd(struct mm_struct *dst_
 		pmdp_set_wrprotect(src_mm, addr, src_pmd);
 		pmd = pmd_wrprotect(pmd);
 	} else {
+		int nr_pages;	/* not interesting here */
+
 		VM_BUG_ON_PAGE(!PageTeam(src_page), src_page);
-		inc_hugely_mapped(src_page);
+		inc_hugely_mapped(src_page, &nr_pages);
 	}
 	add_mm_counter(dst_mm, PageAnon(src_page) ?
 		MM_ANONPAGES : MM_FILEPAGES, HPAGE_PMD_NR);
@@ -3016,18 +3018,46 @@ void __vma_adjust_trans_huge(struct vm_a
 
 static void page_add_team_rmap(struct page *page)
 {
+	struct mem_cgroup *memcg;
+	unsigned long flags;
+	bool locked;
+	int nr_pages;
+
 	VM_BUG_ON_PAGE(PageAnon(page), page);
 	VM_BUG_ON_PAGE(!PageTeam(page), page);
-	if (inc_hugely_mapped(page))
-		__inc_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+
+	memcg = mem_cgroup_begin_page_stat(page, &locked, &flags);
+	if (inc_hugely_mapped(page, &nr_pages)) {
+		struct zone *zone = page_zone(page);
+
+		__inc_zone_state(zone, NR_SHMEM_PMDMAPPED);
+		__mod_zone_page_state(zone, NR_FILE_MAPPED, nr_pages);
+		mem_cgroup_update_page_stat(memcg,
+				MEM_CGROUP_STAT_FILE_MAPPED, nr_pages);
+	}
+	mem_cgroup_end_page_stat(memcg, &locked, &flags);
 }
 
 static void page_remove_team_rmap(struct page *page)
 {
+	struct mem_cgroup *memcg;
+	unsigned long flags;
+	bool locked;
+	int nr_pages;
+
 	VM_BUG_ON_PAGE(PageAnon(page), page);
 	VM_BUG_ON_PAGE(!PageTeam(page), page);
-	if (dec_hugely_mapped(page))
-		__dec_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+
+	memcg = mem_cgroup_begin_page_stat(page, &locked, &flags);
+	if (dec_hugely_mapped(page, &nr_pages)) {
+		struct zone *zone = page_zone(page);
+
+		__dec_zone_state(zone, NR_SHMEM_PMDMAPPED);
+		__mod_zone_page_state(zone, NR_FILE_MAPPED, -nr_pages);
+		mem_cgroup_update_page_stat(memcg,
+				MEM_CGROUP_STAT_FILE_MAPPED, -nr_pages);
+	}
+	mem_cgroup_end_page_stat(memcg, &locked, &flags);
 }
 
 int map_team_by_pmd(struct vm_area_struct *vma, unsigned long addr,
--- thpfs.orig/mm/rmap.c	2015-02-20 19:35:09.995858933 -0800
+++ thpfs/mm/rmap.c	2015-02-20 19:35:15.207847015 -0800
@@ -1116,7 +1116,8 @@ void page_add_file_rmap(struct page *pag
 	bool locked;
 
 	memcg = mem_cgroup_begin_page_stat(page, &locked, &flags);
-	if (atomic_inc_and_test(&page->_mapcount)) {
+	if (atomic_inc_and_test(&page->_mapcount) &&
+	    inc_unhugely_mapped(page)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
@@ -1144,9 +1145,10 @@ static void page_remove_file_rmap(struct
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	__dec_zone_page_state(page, NR_FILE_MAPPED);
-	mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED);
-
+	if (dec_unhugely_mapped(page)) {
+		__dec_zone_page_state(page, NR_FILE_MAPPED);
+		mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED);
+	}
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
 out:
--- thpfs.orig/mm/vmscan.c	2015-02-20 19:35:04.307871938 -0800
+++ thpfs/mm/vmscan.c	2015-02-20 19:35:15.211847007 -0800
@@ -3602,8 +3602,12 @@ static inline unsigned long zone_unmappe
 	/*
 	 * It's possible for there to be more file mapped pages than
 	 * accounted for by the pages on the file LRU lists because
-	 * tmpfs pages accounted for as ANON can also be FILE_MAPPED
+	 * tmpfs pages accounted for as ANON can also be FILE_MAPPED.
+	 * We don't know how many, beyond the PMDMAPPED excluded below.
 	 */
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		file_mapped -= zone_page_state(zone, NR_SHMEM_PMDMAPPED) <<
+							HPAGE_PMD_ORDER;
 	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
 }
 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 22/24] huge tmpfs: fix Mapped meminfo, tracking huge and unhuge mappings
@ 2015-02-21  4:27   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

Maintaining Mlocked was the difficult one, but now that it is correctly
tracked, without duplication between the 4kB and 2MB amounts, I think
we have to make a similar effort with Mapped.

But whereas mlock and munlock were already rare and slow operations,
to which we could fairly add a little more overhead in the huge tmpfs
case, ordinary mmap is not something we want to slow down further,
relative to hugetlbfs.

In the Mapped case, I think we can take small or misaligned mmaps of
huge tmpfs files as the exceptional operation, and add a little more
overhead to those, by maintaining another count for them in the head;
and by keeping both hugely and unhugely mapped counts in the one long,
can rely on cmpxchg to manage their racing transitions atomically.

That's good on 64-bit, but there are not enough free bits in a 32-bit
atomic_long_t team_usage to support this: I think we should continue
to permit huge tmpfs on 32-bit, but accept that Mapped may be doubly
counted there.  (A more serious problem on 32-bit is that it would,
I think, be possible to overflow the huge mapping counter: protection
against that will need to be added.)

Now that we are maintaining NR_FILE_MAPPED correctly for huge
tmpfs, adjust vmscan's zone_unmapped_file_pages() to exclude
NR_SHMEM_PMDMAPPED, which it clearly would not want included.
Whereas minimum_image_size() in kernel/power/snapshot.c?  I have
not grasped the basis for that calculation, so leaving untouched.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 include/linux/memcontrol.h |    5 +
 include/linux/pageteam.h   |  152 ++++++++++++++++++++++++++++++++---
 mm/huge_memory.c           |   40 ++++++++-
 mm/rmap.c                  |   10 +-
 mm/vmscan.c                |    6 +
 5 files changed, 194 insertions(+), 19 deletions(-)

--- thpfs.orig/include/linux/memcontrol.h	2015-02-20 19:33:31.052085168 -0800
+++ thpfs/include/linux/memcontrol.h	2015-02-20 19:35:15.207847015 -0800
@@ -308,6 +308,11 @@ static inline bool mem_cgroup_oom_synchr
 	return false;
 }
 
+static inline void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
+				 enum mem_cgroup_stat_index idx, int val)
+{
+}
+
 static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
 					    enum mem_cgroup_stat_index idx)
 {
--- thpfs.orig/include/linux/pageteam.h	2015-02-20 19:35:09.991858941 -0800
+++ thpfs/include/linux/pageteam.h	2015-02-20 19:35:15.207847015 -0800
@@ -30,6 +30,30 @@ static inline struct page *team_head(str
 }
 
 /*
+ * Layout of team head's page->team_usage field, as on x86_64 and arm64_4K:
+ *
+ *  63        32 31          22 21      12     11         10    9          0
+ * +------------+--------------+----------+----------+---------+------------+
+ * | pmd_mapped & instantiated | unhugely | reserved | mlocked | lru_weight |
+ * |   42 bits       10 bits   |  10 bits |  1 bit   |  1 bit  |   10 bits  |
+ * +------------+--------------+----------+----------+---------+------------+
+ *
+ * TEAM_LRU_WEIGHT_ONE               1  (1<<0)
+ * TEAM_LRU_WEIGHT_MASK            3ff  (1<<10)-1
+ * TEAM_HUGELY_MLOCKED             400  (1<<10)
+ * TEAM_RESERVED_FLAG              800  (1<<11)
+ * TEAM_UNHUGELY_COUNTER          1000  (1<<12)
+ * TEAM_UNHUGELY_MASK           3ff000  (1<<22)-(1<<12)
+ * TEAM_PAGE_COUNTER            400000  (1<<22)
+ * TEAM_COMPLETE              80000000  (1<<31)
+ * TEAM_MAPPING_COUNTER         400000  (1<<22)
+ * TEAM_HUGELY_MAPPED         80400000  (1<<31)
+ *
+ * The upper bits count up to TEAM_COMPLETE as pages are instantiated,
+ * and then, above TEAM_COMPLETE, they count huge mappings of the team.
+ * Team tails have team_usage either 1 (lru_weight 1) or 0 (lru_weight 0).
+ */
+/*
  * Mask for lower bits of team_usage, giving the weight 0..HPAGE_PMD_NR of the
  * page on its LRU: normal pages have weight 1, tails held unevictable until
  * head is evicted have weight 0, and the head gathers weight 1..HPAGE_PMD_NR.
@@ -42,8 +66,20 @@ static inline struct page *team_head(str
  */
 #define TEAM_HUGELY_MLOCKED	(1L << (HPAGE_PMD_ORDER + 1))
 #define TEAM_RESERVED_FLAG	(1L << (HPAGE_PMD_ORDER + 2))
-
+#ifdef CONFIG_64BIT
+/*
+ * Count how many pages of team are individually mapped into userspace.
+ */
+#define TEAM_UNHUGELY_COUNTER	(1L << (HPAGE_PMD_ORDER + 3))
+#define TEAM_HIGH_COUNTER	(1L << (2*HPAGE_PMD_ORDER + 4))
+#define TEAM_UNHUGELY_MASK	(TEAM_HIGH_COUNTER - TEAM_UNHUGELY_COUNTER)
+#else /* 32-bit */
+/*
+ * Not enough bits in atomic_long_t: we prefer not to bloat struct page just to
+ * avoid duplication in Mapped, when a page is mapped both hugely and unhugely.
+ */
 #define TEAM_HIGH_COUNTER	(1L << (HPAGE_PMD_ORDER + 3))
+#endif /* CONFIG_64BIT */
 /*
  * Count how many pages of team are instantiated, as it is built up.
  */
@@ -66,22 +102,120 @@ static inline bool team_hugely_mapped(st
 
 /*
  * Returns true if this was the first mapping by pmd, whereupon mapped stats
- * need to be updated.
+ * need to be updated.  Together with the number of pages which then need
+ * to be accounted (can be ignored when false returned): because some team
+ * members may have been mapped unhugely by pte, so already counted as Mapped.
  */
-static inline bool inc_hugely_mapped(struct page *head)
+static inline bool inc_hugely_mapped(struct page *head, int *nr_pages)
 {
-	return atomic_long_add_return(TEAM_MAPPING_COUNTER, &head->team_usage)
-		< TEAM_HUGELY_MAPPED + TEAM_MAPPING_COUNTER;
+	long team_usage;
+
+	team_usage = atomic_long_add_return(TEAM_MAPPING_COUNTER,
+					    &head->team_usage);
+	*nr_pages = HPAGE_PMD_NR -
+#ifdef CONFIG_64BIT
+		(team_usage & TEAM_UNHUGELY_MASK) / TEAM_UNHUGELY_COUNTER;
+#else
+		1;	/* 1 allows for the additional page_add_file_rmap() */
+#endif
+	return team_usage < TEAM_HUGELY_MAPPED + TEAM_MAPPING_COUNTER;
 }
 
 /*
  * Returns true if this was the last mapping by pmd, whereupon mapped stats
- * need to be updated.
+ * need to be updated.  Together with the number of pages which then need
+ * to be accounted (can be ignored when false returned): because some team
+ * members may still be mapped unhugely by pte, so remain counted as Mapped.
+ */
+static inline bool dec_hugely_mapped(struct page *head, int *nr_pages)
+{
+	long team_usage;
+
+	team_usage = atomic_long_sub_return(TEAM_MAPPING_COUNTER,
+					    &head->team_usage);
+	*nr_pages = HPAGE_PMD_NR -
+#ifdef CONFIG_64BIT
+		(team_usage & TEAM_UNHUGELY_MASK) / TEAM_UNHUGELY_COUNTER;
+#else
+		1;	/* 1 allows for the additional page_remove_rmap() */
+#endif
+	return team_usage < TEAM_HUGELY_MAPPED;
+}
+
+/*
+ * Returns true if this pte mapping is of a non-team page, or of a team page not
+ * covered by an existing huge pmd mapping: whereupon stats need to be updated.
+ * Only called when mapcount goes up from 0 to 1 i.e. _mapcount from -1 to 0.
+ */
+static inline bool inc_unhugely_mapped(struct page *page)
+{
+#ifdef CONFIG_64BIT
+	struct page *head;
+	long team_usage;
+	long old;
+
+	if (likely(!PageTeam(page)))
+		return true;
+	head = team_head(page);
+	team_usage = atomic_long_read(&head->team_usage);
+	for (;;) {
+		/* Is team now being disbanded? Stop once team_usage is reset */
+		if (unlikely(!PageTeam(head) ||
+			     team_usage / TEAM_PAGE_COUNTER == 0))
+			return true;
+		/*
+		 * XXX: but despite the impressive-looking cmpxchg, gthelen
+		 * points out that head might be freed and reused and assigned
+		 * a matching value in ->private now: tiny chance, must revisit.
+		 */
+		old = atomic_long_cmpxchg(&head->team_usage,
+			team_usage, team_usage + TEAM_UNHUGELY_COUNTER);
+		if (likely(old == team_usage))
+			break;
+		team_usage = old;
+	}
+	return team_usage < TEAM_HUGELY_MAPPED;
+#else /* 32-bit */
+	return true;
+#endif
+}
+
+/*
+ * Returns true if this pte mapping is of a non-team page, or of a team page not
+ * covered by a remaining huge pmd mapping: whereupon stats need to be updated.
+ * Only called when mapcount goes down from 1 to 0 i.e. _mapcount from 0 to -1.
  */
-static inline bool dec_hugely_mapped(struct page *head)
+static inline bool dec_unhugely_mapped(struct page *page)
 {
-	return atomic_long_sub_return(TEAM_MAPPING_COUNTER, &head->team_usage)
-		< TEAM_HUGELY_MAPPED;
+#ifdef CONFIG_64BIT
+	struct page *head;
+	long team_usage;
+	long old;
+
+	if (likely(!PageTeam(page)))
+		return true;
+	head = team_head(page);
+	team_usage = atomic_long_read(&head->team_usage);
+	for (;;) {
+		/* Is team now being disbanded? Stop once team_usage is reset */
+		if (unlikely(!PageTeam(head) ||
+			     team_usage / TEAM_PAGE_COUNTER == 0))
+			return true;
+		/*
+		 * XXX: but despite the impressive-looking cmpxchg, gthelen
+		 * points out that head might be freed and reused and assigned
+		 * a matching value in ->private now: tiny chance, must revisit.
+		 */
+		old = atomic_long_cmpxchg(&head->team_usage,
+			team_usage, team_usage - TEAM_UNHUGELY_COUNTER);
+		if (likely(old == team_usage))
+			break;
+		team_usage = old;
+	}
+	return team_usage < TEAM_HUGELY_MAPPED + TEAM_MAPPING_COUNTER;
+#else /* 32-bit */
+	return true;
+#endif
 }
 
 static inline void inc_lru_weight(struct page *head)
--- thpfs.orig/mm/huge_memory.c	2015-02-20 19:35:09.991858941 -0800
+++ thpfs/mm/huge_memory.c	2015-02-20 19:35:15.207847015 -0800
@@ -913,8 +913,10 @@ int copy_huge_pmd(struct mm_struct *dst_
 		pmdp_set_wrprotect(src_mm, addr, src_pmd);
 		pmd = pmd_wrprotect(pmd);
 	} else {
+		int nr_pages;	/* not interesting here */
+
 		VM_BUG_ON_PAGE(!PageTeam(src_page), src_page);
-		inc_hugely_mapped(src_page);
+		inc_hugely_mapped(src_page, &nr_pages);
 	}
 	add_mm_counter(dst_mm, PageAnon(src_page) ?
 		MM_ANONPAGES : MM_FILEPAGES, HPAGE_PMD_NR);
@@ -3016,18 +3018,46 @@ void __vma_adjust_trans_huge(struct vm_a
 
 static void page_add_team_rmap(struct page *page)
 {
+	struct mem_cgroup *memcg;
+	unsigned long flags;
+	bool locked;
+	int nr_pages;
+
 	VM_BUG_ON_PAGE(PageAnon(page), page);
 	VM_BUG_ON_PAGE(!PageTeam(page), page);
-	if (inc_hugely_mapped(page))
-		__inc_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+
+	memcg = mem_cgroup_begin_page_stat(page, &locked, &flags);
+	if (inc_hugely_mapped(page, &nr_pages)) {
+		struct zone *zone = page_zone(page);
+
+		__inc_zone_state(zone, NR_SHMEM_PMDMAPPED);
+		__mod_zone_page_state(zone, NR_FILE_MAPPED, nr_pages);
+		mem_cgroup_update_page_stat(memcg,
+				MEM_CGROUP_STAT_FILE_MAPPED, nr_pages);
+	}
+	mem_cgroup_end_page_stat(memcg, &locked, &flags);
 }
 
 static void page_remove_team_rmap(struct page *page)
 {
+	struct mem_cgroup *memcg;
+	unsigned long flags;
+	bool locked;
+	int nr_pages;
+
 	VM_BUG_ON_PAGE(PageAnon(page), page);
 	VM_BUG_ON_PAGE(!PageTeam(page), page);
-	if (dec_hugely_mapped(page))
-		__dec_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+
+	memcg = mem_cgroup_begin_page_stat(page, &locked, &flags);
+	if (dec_hugely_mapped(page, &nr_pages)) {
+		struct zone *zone = page_zone(page);
+
+		__dec_zone_state(zone, NR_SHMEM_PMDMAPPED);
+		__mod_zone_page_state(zone, NR_FILE_MAPPED, -nr_pages);
+		mem_cgroup_update_page_stat(memcg,
+				MEM_CGROUP_STAT_FILE_MAPPED, -nr_pages);
+	}
+	mem_cgroup_end_page_stat(memcg, &locked, &flags);
 }
 
 int map_team_by_pmd(struct vm_area_struct *vma, unsigned long addr,
--- thpfs.orig/mm/rmap.c	2015-02-20 19:35:09.995858933 -0800
+++ thpfs/mm/rmap.c	2015-02-20 19:35:15.207847015 -0800
@@ -1116,7 +1116,8 @@ void page_add_file_rmap(struct page *pag
 	bool locked;
 
 	memcg = mem_cgroup_begin_page_stat(page, &locked, &flags);
-	if (atomic_inc_and_test(&page->_mapcount)) {
+	if (atomic_inc_and_test(&page->_mapcount) &&
+	    inc_unhugely_mapped(page)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED);
 	}
@@ -1144,9 +1145,10 @@ static void page_remove_file_rmap(struct
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	__dec_zone_page_state(page, NR_FILE_MAPPED);
-	mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED);
-
+	if (dec_unhugely_mapped(page)) {
+		__dec_zone_page_state(page, NR_FILE_MAPPED);
+		mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED);
+	}
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
 out:
--- thpfs.orig/mm/vmscan.c	2015-02-20 19:35:04.307871938 -0800
+++ thpfs/mm/vmscan.c	2015-02-20 19:35:15.211847007 -0800
@@ -3602,8 +3602,12 @@ static inline unsigned long zone_unmappe
 	/*
 	 * It's possible for there to be more file mapped pages than
 	 * accounted for by the pages on the file LRU lists because
-	 * tmpfs pages accounted for as ANON can also be FILE_MAPPED
+	 * tmpfs pages accounted for as ANON can also be FILE_MAPPED.
+	 * We don't know how many, beyond the PMDMAPPED excluded below.
 	 */
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		file_mapped -= zone_page_state(zone, NR_SHMEM_PMDMAPPED) <<
+							HPAGE_PMD_ORDER;
 	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 23/24] kvm: plumb return of hva when resolving page fault.
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:29   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, Andres Lagar-Cavilla,
	linux-kernel, linux-mm

From: Andres Lagar-Cavilla <andreslc@google.com>

So we don't have to redo this work later. Note the hva is not racy, it
is simple arithmetic based on the memslot.

This will be used in the huge tmpfs commits.

Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/x86/kvm/mmu.c         |   16 +++++++++++-----
 arch/x86/kvm/paging_tmpl.h |    3 ++-
 include/linux/kvm_host.h   |    2 +-
 virt/kvm/kvm_main.c        |   24 ++++++++++++++----------
 4 files changed, 28 insertions(+), 17 deletions(-)

--- thpfs.orig/arch/x86/kvm/mmu.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/arch/x86/kvm/mmu.c	2015-02-20 19:35:20.095835839 -0800
@@ -2907,7 +2907,8 @@ exit:
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gva_t gva, pfn_t *pfn, bool write, bool *writable);
+			 gva_t gva, pfn_t *pfn, bool write, bool *writable,
+			 unsigned long *hva);
 static void make_mmu_pages_available(struct kvm_vcpu *vcpu);
 
 static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
@@ -2918,6 +2919,7 @@ static int nonpaging_map(struct kvm_vcpu
 	int force_pt_level;
 	pfn_t pfn;
 	unsigned long mmu_seq;
+	unsigned long hva;
 	bool map_writable, write = error_code & PFERR_WRITE_MASK;
 
 	force_pt_level = mapping_level_dirty_bitmap(vcpu, gfn);
@@ -2941,7 +2943,8 @@ static int nonpaging_map(struct kvm_vcpu
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable))
+	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write,
+			 &map_writable, &hva))
 		return 0;
 
 	if (handle_abnormal_pfn(vcpu, v, gfn, pfn, ACC_ALL, &r))
@@ -3360,11 +3363,12 @@ static bool can_do_async_pf(struct kvm_v
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gva_t gva, pfn_t *pfn, bool write, bool *writable)
+			 gva_t gva, pfn_t *pfn, bool write, bool *writable,
+			 unsigned long *hva)
 {
 	bool async;
 
-	*pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async, write, writable);
+	*pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async, write, writable, hva);
 
 	if (!async)
 		return false; /* *pfn has correct page already */
@@ -3393,6 +3397,7 @@ static int tdp_page_fault(struct kvm_vcp
 	int force_pt_level;
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	unsigned long mmu_seq;
+	unsigned long hva;
 	int write = error_code & PFERR_WRITE_MASK;
 	bool map_writable;
 
@@ -3423,7 +3428,8 @@ static int tdp_page_fault(struct kvm_vcp
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
+	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write,
+			 &map_writable, &hva))
 		return 0;
 
 	if (handle_abnormal_pfn(vcpu, 0, gfn, pfn, ACC_ALL, &r))
--- thpfs.orig/arch/x86/kvm/paging_tmpl.h	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/arch/x86/kvm/paging_tmpl.h	2015-02-20 19:35:20.095835839 -0800
@@ -709,6 +709,7 @@ static int FNAME(page_fault)(struct kvm_
 	int level = PT_PAGE_TABLE_LEVEL;
 	int force_pt_level;
 	unsigned long mmu_seq;
+	unsigned long hva;
 	bool map_writable, is_self_change_mapping;
 
 	pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
@@ -759,7 +760,7 @@ static int FNAME(page_fault)(struct kvm_
 	smp_rmb();
 
 	if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, write_fault,
-			 &map_writable))
+			 &map_writable, &hva))
 		return 0;
 
 	if (handle_abnormal_pfn(vcpu, mmu_is_nested(vcpu) ? 0 : addr,
--- thpfs.orig/include/linux/kvm_host.h	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/include/linux/kvm_host.h	2015-02-20 19:35:20.095835839 -0800
@@ -554,7 +554,7 @@ void kvm_set_page_accessed(struct page *
 
 pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, bool *async,
-		       bool write_fault, bool *writable);
+		       bool write_fault, bool *writable, unsigned long *hva);
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable);
--- thpfs.orig/virt/kvm/kvm_main.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/virt/kvm/kvm_main.c	2015-02-20 19:35:20.095835839 -0800
@@ -1328,7 +1328,8 @@ exit:
 
 static pfn_t
 __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic,
-		     bool *async, bool write_fault, bool *writable)
+		     bool *async, bool write_fault, bool *writable,
+		     unsigned long *hva)
 {
 	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
@@ -1344,12 +1345,15 @@ __gfn_to_pfn_memslot(struct kvm_memory_s
 		writable = NULL;
 	}
 
+	if (hva)
+		*hva = addr;
+
 	return hva_to_pfn(addr, atomic, async, write_fault,
 			  writable);
 }
 
 static pfn_t __gfn_to_pfn(struct kvm *kvm, gfn_t gfn, bool atomic, bool *async,
-			  bool write_fault, bool *writable)
+			  bool write_fault, bool *writable, unsigned long *hva)
 {
 	struct kvm_memory_slot *slot;
 
@@ -1359,43 +1363,43 @@ static pfn_t __gfn_to_pfn(struct kvm *kv
 	slot = gfn_to_memslot(kvm, gfn);
 
 	return __gfn_to_pfn_memslot(slot, gfn, atomic, async, write_fault,
-				    writable);
+				    writable, hva);
 }
 
 pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
 {
-	return __gfn_to_pfn(kvm, gfn, true, NULL, true, NULL);
+	return __gfn_to_pfn(kvm, gfn, true, NULL, true, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_atomic);
 
 pfn_t gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, bool *async,
-		       bool write_fault, bool *writable)
+		       bool write_fault, bool *writable, unsigned long *hva)
 {
-	return __gfn_to_pfn(kvm, gfn, false, async, write_fault, writable);
+	return __gfn_to_pfn(kvm, gfn, false, async, write_fault, writable, hva);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_async);
 
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 {
-	return __gfn_to_pfn(kvm, gfn, false, NULL, true, NULL);
+	return __gfn_to_pfn(kvm, gfn, false, NULL, true, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
 pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
-	return __gfn_to_pfn(kvm, gfn, false, NULL, write_fault, writable);
+	return __gfn_to_pfn(kvm, gfn, false, NULL, write_fault, writable, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
 pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL);
+	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL, NULL);
 }
 
 pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL);
+	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 23/24] kvm: plumb return of hva when resolving page fault.
@ 2015-02-21  4:29   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, Andres Lagar-Cavilla,
	linux-kernel, linux-mm

From: Andres Lagar-Cavilla <andreslc@google.com>

So we don't have to redo this work later. Note the hva is not racy, it
is simple arithmetic based on the memslot.

This will be used in the huge tmpfs commits.

Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/x86/kvm/mmu.c         |   16 +++++++++++-----
 arch/x86/kvm/paging_tmpl.h |    3 ++-
 include/linux/kvm_host.h   |    2 +-
 virt/kvm/kvm_main.c        |   24 ++++++++++++++----------
 4 files changed, 28 insertions(+), 17 deletions(-)

--- thpfs.orig/arch/x86/kvm/mmu.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/arch/x86/kvm/mmu.c	2015-02-20 19:35:20.095835839 -0800
@@ -2907,7 +2907,8 @@ exit:
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gva_t gva, pfn_t *pfn, bool write, bool *writable);
+			 gva_t gva, pfn_t *pfn, bool write, bool *writable,
+			 unsigned long *hva);
 static void make_mmu_pages_available(struct kvm_vcpu *vcpu);
 
 static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
@@ -2918,6 +2919,7 @@ static int nonpaging_map(struct kvm_vcpu
 	int force_pt_level;
 	pfn_t pfn;
 	unsigned long mmu_seq;
+	unsigned long hva;
 	bool map_writable, write = error_code & PFERR_WRITE_MASK;
 
 	force_pt_level = mapping_level_dirty_bitmap(vcpu, gfn);
@@ -2941,7 +2943,8 @@ static int nonpaging_map(struct kvm_vcpu
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable))
+	if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write,
+			 &map_writable, &hva))
 		return 0;
 
 	if (handle_abnormal_pfn(vcpu, v, gfn, pfn, ACC_ALL, &r))
@@ -3360,11 +3363,12 @@ static bool can_do_async_pf(struct kvm_v
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gva_t gva, pfn_t *pfn, bool write, bool *writable)
+			 gva_t gva, pfn_t *pfn, bool write, bool *writable,
+			 unsigned long *hva)
 {
 	bool async;
 
-	*pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async, write, writable);
+	*pfn = gfn_to_pfn_async(vcpu->kvm, gfn, &async, write, writable, hva);
 
 	if (!async)
 		return false; /* *pfn has correct page already */
@@ -3393,6 +3397,7 @@ static int tdp_page_fault(struct kvm_vcp
 	int force_pt_level;
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	unsigned long mmu_seq;
+	unsigned long hva;
 	int write = error_code & PFERR_WRITE_MASK;
 	bool map_writable;
 
@@ -3423,7 +3428,8 @@ static int tdp_page_fault(struct kvm_vcp
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
 
-	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
+	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write,
+			 &map_writable, &hva))
 		return 0;
 
 	if (handle_abnormal_pfn(vcpu, 0, gfn, pfn, ACC_ALL, &r))
--- thpfs.orig/arch/x86/kvm/paging_tmpl.h	2014-12-07 14:21:05.000000000 -0800
+++ thpfs/arch/x86/kvm/paging_tmpl.h	2015-02-20 19:35:20.095835839 -0800
@@ -709,6 +709,7 @@ static int FNAME(page_fault)(struct kvm_
 	int level = PT_PAGE_TABLE_LEVEL;
 	int force_pt_level;
 	unsigned long mmu_seq;
+	unsigned long hva;
 	bool map_writable, is_self_change_mapping;
 
 	pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
@@ -759,7 +760,7 @@ static int FNAME(page_fault)(struct kvm_
 	smp_rmb();
 
 	if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, write_fault,
-			 &map_writable))
+			 &map_writable, &hva))
 		return 0;
 
 	if (handle_abnormal_pfn(vcpu, mmu_is_nested(vcpu) ? 0 : addr,
--- thpfs.orig/include/linux/kvm_host.h	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/include/linux/kvm_host.h	2015-02-20 19:35:20.095835839 -0800
@@ -554,7 +554,7 @@ void kvm_set_page_accessed(struct page *
 
 pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, bool *async,
-		       bool write_fault, bool *writable);
+		       bool write_fault, bool *writable, unsigned long *hva);
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable);
--- thpfs.orig/virt/kvm/kvm_main.c	2015-02-08 18:54:22.000000000 -0800
+++ thpfs/virt/kvm/kvm_main.c	2015-02-20 19:35:20.095835839 -0800
@@ -1328,7 +1328,8 @@ exit:
 
 static pfn_t
 __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn, bool atomic,
-		     bool *async, bool write_fault, bool *writable)
+		     bool *async, bool write_fault, bool *writable,
+		     unsigned long *hva)
 {
 	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
@@ -1344,12 +1345,15 @@ __gfn_to_pfn_memslot(struct kvm_memory_s
 		writable = NULL;
 	}
 
+	if (hva)
+		*hva = addr;
+
 	return hva_to_pfn(addr, atomic, async, write_fault,
 			  writable);
 }
 
 static pfn_t __gfn_to_pfn(struct kvm *kvm, gfn_t gfn, bool atomic, bool *async,
-			  bool write_fault, bool *writable)
+			  bool write_fault, bool *writable, unsigned long *hva)
 {
 	struct kvm_memory_slot *slot;
 
@@ -1359,43 +1363,43 @@ static pfn_t __gfn_to_pfn(struct kvm *kv
 	slot = gfn_to_memslot(kvm, gfn);
 
 	return __gfn_to_pfn_memslot(slot, gfn, atomic, async, write_fault,
-				    writable);
+				    writable, hva);
 }
 
 pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn)
 {
-	return __gfn_to_pfn(kvm, gfn, true, NULL, true, NULL);
+	return __gfn_to_pfn(kvm, gfn, true, NULL, true, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_atomic);
 
 pfn_t gfn_to_pfn_async(struct kvm *kvm, gfn_t gfn, bool *async,
-		       bool write_fault, bool *writable)
+		       bool write_fault, bool *writable, unsigned long *hva)
 {
-	return __gfn_to_pfn(kvm, gfn, false, async, write_fault, writable);
+	return __gfn_to_pfn(kvm, gfn, false, async, write_fault, writable, hva);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_async);
 
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 {
-	return __gfn_to_pfn(kvm, gfn, false, NULL, true, NULL);
+	return __gfn_to_pfn(kvm, gfn, false, NULL, true, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
 pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
-	return __gfn_to_pfn(kvm, gfn, false, NULL, write_fault, writable);
+	return __gfn_to_pfn(kvm, gfn, false, NULL, write_fault, writable, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
 pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL);
+	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL, NULL);
 }
 
 pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL);
+	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 24/24] kvm: teach kvm to map page teams as huge pages.
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-21  4:31   ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, Andres Lagar-Cavilla,
	linux-kernel, linux-mm

From: Andres Lagar-Cavilla <andreslc@google.com>

Include a small treatise on the locking rules around page teams.

Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/x86/kvm/mmu.c         |  155 +++++++++++++++++++++++++++++------
 arch/x86/kvm/paging_tmpl.h |    3 
 2 files changed, 132 insertions(+), 26 deletions(-)

--- thpfs.orig/arch/x86/kvm/mmu.c	2015-02-20 19:35:20.095835839 -0800
+++ thpfs/arch/x86/kvm/mmu.c	2015-02-20 19:35:24.775825138 -0800
@@ -32,6 +32,7 @@
 #include <linux/module.h>
 #include <linux/swap.h>
 #include <linux/hugetlb.h>
+#include <linux/pageteam.h>
 #include <linux/compiler.h>
 #include <linux/srcu.h>
 #include <linux/slab.h>
@@ -2723,7 +2724,106 @@ static int kvm_handle_bad_page(struct kv
 	return -EFAULT;
 }
 
+/*
+ * We are holding kvm->mmu_lock, serializing against mmu notifiers.
+ * We have a ref on page.
+ *
+ * A team of tmpfs 512 pages can be mapped as an integral hugepage as long as
+ * the team is not disbanded. The head page is !PageTeam if disbanded.
+ *
+ * Huge tmpfs pages are disbanded for page freeing, shrinking, or swap out.
+ *
+ * Freeing (punch hole, truncation):
+ *  shmem_undo_range
+ *     disband
+ *       lock head page
+ *       unmap_mapping_range
+ *         zap_page_range_single
+ *           mmu_notifier_invalidate_range_start
+ *           split_huge_page_pmd or zap_huge_pmd
+ *             remap_team_by_ptes
+ *           mmu_notifier_invalidate_range_end
+ *       unlock head page
+ *     pagevec_release
+ *        pages are freed
+ * If we race with disband MMUN will fix us up. The head page lock also
+ * serializes any gup() against resolving the page team.
+ *
+ * Shrinker, disbands, but once a page team is fully banded up it no longer is
+ * tagged as shrinkable in the radix tree and hence can't be shrunk.
+ *  shmem_shrink_hugehole
+ *     shmem_choose_hugehole
+ *        disband
+ *     migrate_pages
+ *        try_to_unmap
+ *           mmu_notifier_invalidate_page
+ * Double-indemnity: if we race with disband, MMUN will fix us up.
+ *
+ * Swap out:
+ *  shrink_page_list
+ *    try_to_unmap
+ *      unmap_team_by_pmd
+ *         mmu_notifier_invalidate_range
+ *    pageout
+ *      shmem_writepage
+ *         disband
+ *    free_hot_cold_page_list
+ *       pages are freed
+ * If we race with disband, no one will come to fix us up. So, we check for a
+ * pmd mapping, serializing against the MMUN in unmap_team_by_pmd, which will
+ * break the pmd mapping if it runs before us (or invalidate our mapping if ran
+ * after).
+ *
+ * N.B. migration requires further thought all around.
+ */
+static bool is_huge_tmpfs(struct mm_struct *mm, struct page *page,
+			  unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	struct page *head;
+
+	if (PageAnon(page) || !PageTeam(page))
+		return false;
+	/*
+	 * This strictly assumes PMD-level huge-ing.
+	 * Which is the only thing KVM can handle here.
+	 * N.B. Assume (like everywhere else) PAGE_SIZE == PAGE_CACHE_SIZE.
+	 */
+	if (((address & (HPAGE_PMD_SIZE - 1)) >> PAGE_SHIFT) !=
+	    (page->index & (HPAGE_PMD_NR-1)))
+		return false;
+	head = team_head(page);
+	if (!PageTeam(head))
+		return false;
+	/*
+	 * Attempt at early discard. If the head races into becoming SwapCache,
+	 * and thus having a bogus team_usage, we'll know for sure next.
+	 */
+	if (!team_hugely_mapped(head))
+		return false;
+	/*
+	 * Open code page_check_address_pmd, otherwise we'd have to make it
+	 * a module-visible symbol. Simplify it. No need for page table lock,
+	 * as mmu notifier serialization ensures we are on either side of
+	 * unmap_team_by_pmd or remap_team_by_ptes.
+	 */
+	address &= HPAGE_PMD_MASK;
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return false;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return false;
+	pmd = pmd_offset(pud, address);
+	if (!pmd_trans_huge(*pmd))
+		return false;
+	return pmd_page(*pmd) == head;
+}
+
 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
+					unsigned long address,
 					gfn_t *gfnp, pfn_t *pfnp, int *levelp)
 {
 	pfn_t pfn = *pfnp;
@@ -2737,29 +2837,34 @@ static void transparent_hugepage_adjust(
 	 * here.
 	 */
 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
-	    level == PT_PAGE_TABLE_LEVEL &&
-	    PageTransCompound(pfn_to_page(pfn)) &&
-	    !has_wrprotected_page(vcpu->kvm, gfn, PT_DIRECTORY_LEVEL)) {
-		unsigned long mask;
-		/*
-		 * mmu_notifier_retry was successful and we hold the
-		 * mmu_lock here, so the pmd can't become splitting
-		 * from under us, and in turn
-		 * __split_huge_page_refcount() can't run from under
-		 * us and we can safely transfer the refcount from
-		 * PG_tail to PG_head as we switch the pfn to tail to
-		 * head.
-		 */
-		*levelp = level = PT_DIRECTORY_LEVEL;
-		mask = KVM_PAGES_PER_HPAGE(level) - 1;
-		VM_BUG_ON((gfn & mask) != (pfn & mask));
-		if (pfn & mask) {
-			gfn &= ~mask;
-			*gfnp = gfn;
-			kvm_release_pfn_clean(pfn);
-			pfn &= ~mask;
-			kvm_get_pfn(pfn);
-			*pfnp = pfn;
+	    level == PT_PAGE_TABLE_LEVEL) {
+		struct page *page = pfn_to_page(pfn);
+
+		if ((PageTransCompound(page) ||
+		     is_huge_tmpfs(vcpu->kvm->mm, page, address)) &&
+		    !has_wrprotected_page(vcpu->kvm, gfn,
+					  PT_DIRECTORY_LEVEL)) {
+			unsigned long mask;
+			/*
+			 * mmu_notifier_retry was successful and we hold the
+			 * mmu_lock here, so the pmd can't become splitting
+			 * from under us, and in turn
+			 * __split_huge_page_refcount() can't run from under
+			 * us and we can safely transfer the refcount from
+			 * PG_tail to PG_head as we switch the pfn to tail to
+			 * head.
+			 */
+			*levelp = level = PT_DIRECTORY_LEVEL;
+			mask = KVM_PAGES_PER_HPAGE(level) - 1;
+			VM_BUG_ON((gfn & mask) != (pfn & mask));
+			if (pfn & mask) {
+				gfn &= ~mask;
+				*gfnp = gfn;
+				kvm_release_pfn_clean(pfn);
+				pfn &= ~mask;
+				kvm_get_pfn(pfn);
+				*pfnp = pfn;
+			}
 		}
 	}
 }
@@ -2955,7 +3060,7 @@ static int nonpaging_map(struct kvm_vcpu
 		goto out_unlock;
 	make_mmu_pages_available(vcpu);
 	if (likely(!force_pt_level))
-		transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
+		transparent_hugepage_adjust(vcpu, hva, &gfn, &pfn, &level);
 	r = __direct_map(vcpu, v, write, map_writable, level, gfn, pfn,
 			 prefault);
 	spin_unlock(&vcpu->kvm->mmu_lock);
@@ -3440,7 +3545,7 @@ static int tdp_page_fault(struct kvm_vcp
 		goto out_unlock;
 	make_mmu_pages_available(vcpu);
 	if (likely(!force_pt_level))
-		transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
+		transparent_hugepage_adjust(vcpu, hva, &gfn, &pfn, &level);
 	r = __direct_map(vcpu, gpa, write, map_writable,
 			 level, gfn, pfn, prefault);
 	spin_unlock(&vcpu->kvm->mmu_lock);
--- thpfs.orig/arch/x86/kvm/paging_tmpl.h	2015-02-20 19:35:20.095835839 -0800
+++ thpfs/arch/x86/kvm/paging_tmpl.h	2015-02-20 19:35:24.775825138 -0800
@@ -794,7 +794,8 @@ static int FNAME(page_fault)(struct kvm_
 	kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
 	make_mmu_pages_available(vcpu);
 	if (!force_pt_level)
-		transparent_hugepage_adjust(vcpu, &walker.gfn, &pfn, &level);
+		transparent_hugepage_adjust(vcpu, hva, &walker.gfn, &pfn,
+					    &level);
 	r = FNAME(fetch)(vcpu, addr, &walker, write_fault,
 			 level, pfn, map_writable, prefault);
 	++vcpu->stat.pf_fixed;

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 24/24] kvm: teach kvm to map page teams as huge pages.
@ 2015-02-21  4:31   ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-02-21  4:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, Andres Lagar-Cavilla,
	linux-kernel, linux-mm

From: Andres Lagar-Cavilla <andreslc@google.com>

Include a small treatise on the locking rules around page teams.

Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 arch/x86/kvm/mmu.c         |  155 +++++++++++++++++++++++++++++------
 arch/x86/kvm/paging_tmpl.h |    3 
 2 files changed, 132 insertions(+), 26 deletions(-)

--- thpfs.orig/arch/x86/kvm/mmu.c	2015-02-20 19:35:20.095835839 -0800
+++ thpfs/arch/x86/kvm/mmu.c	2015-02-20 19:35:24.775825138 -0800
@@ -32,6 +32,7 @@
 #include <linux/module.h>
 #include <linux/swap.h>
 #include <linux/hugetlb.h>
+#include <linux/pageteam.h>
 #include <linux/compiler.h>
 #include <linux/srcu.h>
 #include <linux/slab.h>
@@ -2723,7 +2724,106 @@ static int kvm_handle_bad_page(struct kv
 	return -EFAULT;
 }
 
+/*
+ * We are holding kvm->mmu_lock, serializing against mmu notifiers.
+ * We have a ref on page.
+ *
+ * A team of tmpfs 512 pages can be mapped as an integral hugepage as long as
+ * the team is not disbanded. The head page is !PageTeam if disbanded.
+ *
+ * Huge tmpfs pages are disbanded for page freeing, shrinking, or swap out.
+ *
+ * Freeing (punch hole, truncation):
+ *  shmem_undo_range
+ *     disband
+ *       lock head page
+ *       unmap_mapping_range
+ *         zap_page_range_single
+ *           mmu_notifier_invalidate_range_start
+ *           split_huge_page_pmd or zap_huge_pmd
+ *             remap_team_by_ptes
+ *           mmu_notifier_invalidate_range_end
+ *       unlock head page
+ *     pagevec_release
+ *        pages are freed
+ * If we race with disband MMUN will fix us up. The head page lock also
+ * serializes any gup() against resolving the page team.
+ *
+ * Shrinker, disbands, but once a page team is fully banded up it no longer is
+ * tagged as shrinkable in the radix tree and hence can't be shrunk.
+ *  shmem_shrink_hugehole
+ *     shmem_choose_hugehole
+ *        disband
+ *     migrate_pages
+ *        try_to_unmap
+ *           mmu_notifier_invalidate_page
+ * Double-indemnity: if we race with disband, MMUN will fix us up.
+ *
+ * Swap out:
+ *  shrink_page_list
+ *    try_to_unmap
+ *      unmap_team_by_pmd
+ *         mmu_notifier_invalidate_range
+ *    pageout
+ *      shmem_writepage
+ *         disband
+ *    free_hot_cold_page_list
+ *       pages are freed
+ * If we race with disband, no one will come to fix us up. So, we check for a
+ * pmd mapping, serializing against the MMUN in unmap_team_by_pmd, which will
+ * break the pmd mapping if it runs before us (or invalidate our mapping if ran
+ * after).
+ *
+ * N.B. migration requires further thought all around.
+ */
+static bool is_huge_tmpfs(struct mm_struct *mm, struct page *page,
+			  unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	struct page *head;
+
+	if (PageAnon(page) || !PageTeam(page))
+		return false;
+	/*
+	 * This strictly assumes PMD-level huge-ing.
+	 * Which is the only thing KVM can handle here.
+	 * N.B. Assume (like everywhere else) PAGE_SIZE == PAGE_CACHE_SIZE.
+	 */
+	if (((address & (HPAGE_PMD_SIZE - 1)) >> PAGE_SHIFT) !=
+	    (page->index & (HPAGE_PMD_NR-1)))
+		return false;
+	head = team_head(page);
+	if (!PageTeam(head))
+		return false;
+	/*
+	 * Attempt at early discard. If the head races into becoming SwapCache,
+	 * and thus having a bogus team_usage, we'll know for sure next.
+	 */
+	if (!team_hugely_mapped(head))
+		return false;
+	/*
+	 * Open code page_check_address_pmd, otherwise we'd have to make it
+	 * a module-visible symbol. Simplify it. No need for page table lock,
+	 * as mmu notifier serialization ensures we are on either side of
+	 * unmap_team_by_pmd or remap_team_by_ptes.
+	 */
+	address &= HPAGE_PMD_MASK;
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return false;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return false;
+	pmd = pmd_offset(pud, address);
+	if (!pmd_trans_huge(*pmd))
+		return false;
+	return pmd_page(*pmd) == head;
+}
+
 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
+					unsigned long address,
 					gfn_t *gfnp, pfn_t *pfnp, int *levelp)
 {
 	pfn_t pfn = *pfnp;
@@ -2737,29 +2837,34 @@ static void transparent_hugepage_adjust(
 	 * here.
 	 */
 	if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
-	    level == PT_PAGE_TABLE_LEVEL &&
-	    PageTransCompound(pfn_to_page(pfn)) &&
-	    !has_wrprotected_page(vcpu->kvm, gfn, PT_DIRECTORY_LEVEL)) {
-		unsigned long mask;
-		/*
-		 * mmu_notifier_retry was successful and we hold the
-		 * mmu_lock here, so the pmd can't become splitting
-		 * from under us, and in turn
-		 * __split_huge_page_refcount() can't run from under
-		 * us and we can safely transfer the refcount from
-		 * PG_tail to PG_head as we switch the pfn to tail to
-		 * head.
-		 */
-		*levelp = level = PT_DIRECTORY_LEVEL;
-		mask = KVM_PAGES_PER_HPAGE(level) - 1;
-		VM_BUG_ON((gfn & mask) != (pfn & mask));
-		if (pfn & mask) {
-			gfn &= ~mask;
-			*gfnp = gfn;
-			kvm_release_pfn_clean(pfn);
-			pfn &= ~mask;
-			kvm_get_pfn(pfn);
-			*pfnp = pfn;
+	    level == PT_PAGE_TABLE_LEVEL) {
+		struct page *page = pfn_to_page(pfn);
+
+		if ((PageTransCompound(page) ||
+		     is_huge_tmpfs(vcpu->kvm->mm, page, address)) &&
+		    !has_wrprotected_page(vcpu->kvm, gfn,
+					  PT_DIRECTORY_LEVEL)) {
+			unsigned long mask;
+			/*
+			 * mmu_notifier_retry was successful and we hold the
+			 * mmu_lock here, so the pmd can't become splitting
+			 * from under us, and in turn
+			 * __split_huge_page_refcount() can't run from under
+			 * us and we can safely transfer the refcount from
+			 * PG_tail to PG_head as we switch the pfn to tail to
+			 * head.
+			 */
+			*levelp = level = PT_DIRECTORY_LEVEL;
+			mask = KVM_PAGES_PER_HPAGE(level) - 1;
+			VM_BUG_ON((gfn & mask) != (pfn & mask));
+			if (pfn & mask) {
+				gfn &= ~mask;
+				*gfnp = gfn;
+				kvm_release_pfn_clean(pfn);
+				pfn &= ~mask;
+				kvm_get_pfn(pfn);
+				*pfnp = pfn;
+			}
 		}
 	}
 }
@@ -2955,7 +3060,7 @@ static int nonpaging_map(struct kvm_vcpu
 		goto out_unlock;
 	make_mmu_pages_available(vcpu);
 	if (likely(!force_pt_level))
-		transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
+		transparent_hugepage_adjust(vcpu, hva, &gfn, &pfn, &level);
 	r = __direct_map(vcpu, v, write, map_writable, level, gfn, pfn,
 			 prefault);
 	spin_unlock(&vcpu->kvm->mmu_lock);
@@ -3440,7 +3545,7 @@ static int tdp_page_fault(struct kvm_vcp
 		goto out_unlock;
 	make_mmu_pages_available(vcpu);
 	if (likely(!force_pt_level))
-		transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
+		transparent_hugepage_adjust(vcpu, hva, &gfn, &pfn, &level);
 	r = __direct_map(vcpu, gpa, write, map_writable,
 			 level, gfn, pfn, prefault);
 	spin_unlock(&vcpu->kvm->mmu_lock);
--- thpfs.orig/arch/x86/kvm/paging_tmpl.h	2015-02-20 19:35:20.095835839 -0800
+++ thpfs/arch/x86/kvm/paging_tmpl.h	2015-02-20 19:35:24.775825138 -0800
@@ -794,7 +794,8 @@ static int FNAME(page_fault)(struct kvm_
 	kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
 	make_mmu_pages_available(vcpu);
 	if (!force_pt_level)
-		transparent_hugepage_adjust(vcpu, &walker.gfn, &pfn, &level);
+		transparent_hugepage_adjust(vcpu, hva, &walker.gfn, &pfn,
+					    &level);
 	r = FNAME(fetch)(vcpu, addr, &walker, write_fault,
 			 level, pfn, map_writable, prefault);
 	++vcpu->stat.pf_fixed;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 01/24] mm: update_lru_size warn and reset bad lru_size
  2015-02-21  3:51   ` Hugh Dickins
@ 2015-02-23  9:30     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2015-02-23  9:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Ning Qu, Andrew Morton,
	linux-kernel, linux-mm

On Fri, Feb 20, 2015 at 07:51:16PM -0800, Hugh Dickins wrote:
> Though debug kernels have a VM_BUG_ON to help protect from misaccounting
> lru_size, non-debug kernels are liable to wrap it around: and then the
> vast unsigned long size draws page reclaim into a loop of repeatedly
> doing nothing on an empty list, without even a cond_resched().
> 
> That soft lockup looks confusingly like an over-busy reclaim scenario,
> with lots of contention on the lruvec lock in shrink_inactive_list():
> yet has a totally different origin.
> 
> Help differentiate with a custom warning in mem_cgroup_update_lru_size(),
> even in non-debug kernels; and reset the size to avoid the lockup.  But
> the particular bug which suggested this change was mine alone, and since
> fixed.

Do we need this kind of check for !MEMCG kernels?

> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  include/linux/mm_inline.h |    2 +-
>  mm/memcontrol.c           |   24 ++++++++++++++++++++----
>  2 files changed, 21 insertions(+), 5 deletions(-)
> 
> --- thpfs.orig/include/linux/mm_inline.h	2013-11-03 15:41:51.000000000 -0800
> +++ thpfs/include/linux/mm_inline.h	2015-02-20 19:33:25.928096883 -0800
> @@ -35,8 +35,8 @@ static __always_inline void del_page_fro
>  				struct lruvec *lruvec, enum lru_list lru)
>  {
>  	int nr_pages = hpage_nr_pages(page);
> -	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
>  	list_del(&page->lru);
> +	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
>  	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
>  }
>  
> --- thpfs.orig/mm/memcontrol.c	2015-02-08 18:54:22.000000000 -0800
> +++ thpfs/mm/memcontrol.c	2015-02-20 19:33:25.928096883 -0800
> @@ -1296,22 +1296,38 @@ out:
>   * @lru: index of lru list the page is sitting on
>   * @nr_pages: positive when adding or negative when removing
>   *
> - * This function must be called when a page is added to or removed from an
> - * lru list.
> + * This function must be called under lruvec lock, just before a page is added
> + * to or just after a page is removed from an lru list (that ordering being so
> + * as to allow it to check that lru_size 0 is consistent with list_empty).
>   */
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
>  				int nr_pages)
>  {
>  	struct mem_cgroup_per_zone *mz;
>  	unsigned long *lru_size;
> +	long size;
> +	bool empty;
>  
>  	if (mem_cgroup_disabled())
>  		return;
>  
>  	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
>  	lru_size = mz->lru_size + lru;
> -	*lru_size += nr_pages;
> -	VM_BUG_ON((long)(*lru_size) < 0);
> +	empty = list_empty(lruvec->lists + lru);
> +
> +	if (nr_pages < 0)
> +		*lru_size += nr_pages;
> +
> +	size = *lru_size;
> +	if (WARN(size < 0 || empty != !size,
> +	"mem_cgroup_update_lru_size(%p, %d, %d): lru_size %ld but %sempty\n",
> +			lruvec, lru, nr_pages, size, empty ? "" : "not ")) {

Formatting can be unscrewed this way:

	if (WARN(size < 0 || empty != !size,
		"%s(%p, %d, %d): lru_size %ld but %sempty\n",
		__func__, lruvec, lru, nr_pages, size, empty ? "" : "not ")) {

> +		VM_BUG_ON(1);
> +		*lru_size = 0;
> +	}
> +
> +	if (nr_pages > 0)
> +		*lru_size += nr_pages;
>  }
>  
>  bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, struct mem_cgroup *root)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 01/24] mm: update_lru_size warn and reset bad lru_size
@ 2015-02-23  9:30     ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2015-02-23  9:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Ning Qu, Andrew Morton,
	linux-kernel, linux-mm

On Fri, Feb 20, 2015 at 07:51:16PM -0800, Hugh Dickins wrote:
> Though debug kernels have a VM_BUG_ON to help protect from misaccounting
> lru_size, non-debug kernels are liable to wrap it around: and then the
> vast unsigned long size draws page reclaim into a loop of repeatedly
> doing nothing on an empty list, without even a cond_resched().
> 
> That soft lockup looks confusingly like an over-busy reclaim scenario,
> with lots of contention on the lruvec lock in shrink_inactive_list():
> yet has a totally different origin.
> 
> Help differentiate with a custom warning in mem_cgroup_update_lru_size(),
> even in non-debug kernels; and reset the size to avoid the lockup.  But
> the particular bug which suggested this change was mine alone, and since
> fixed.

Do we need this kind of check for !MEMCG kernels?

> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  include/linux/mm_inline.h |    2 +-
>  mm/memcontrol.c           |   24 ++++++++++++++++++++----
>  2 files changed, 21 insertions(+), 5 deletions(-)
> 
> --- thpfs.orig/include/linux/mm_inline.h	2013-11-03 15:41:51.000000000 -0800
> +++ thpfs/include/linux/mm_inline.h	2015-02-20 19:33:25.928096883 -0800
> @@ -35,8 +35,8 @@ static __always_inline void del_page_fro
>  				struct lruvec *lruvec, enum lru_list lru)
>  {
>  	int nr_pages = hpage_nr_pages(page);
> -	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
>  	list_del(&page->lru);
> +	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
>  	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
>  }
>  
> --- thpfs.orig/mm/memcontrol.c	2015-02-08 18:54:22.000000000 -0800
> +++ thpfs/mm/memcontrol.c	2015-02-20 19:33:25.928096883 -0800
> @@ -1296,22 +1296,38 @@ out:
>   * @lru: index of lru list the page is sitting on
>   * @nr_pages: positive when adding or negative when removing
>   *
> - * This function must be called when a page is added to or removed from an
> - * lru list.
> + * This function must be called under lruvec lock, just before a page is added
> + * to or just after a page is removed from an lru list (that ordering being so
> + * as to allow it to check that lru_size 0 is consistent with list_empty).
>   */
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
>  				int nr_pages)
>  {
>  	struct mem_cgroup_per_zone *mz;
>  	unsigned long *lru_size;
> +	long size;
> +	bool empty;
>  
>  	if (mem_cgroup_disabled())
>  		return;
>  
>  	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
>  	lru_size = mz->lru_size + lru;
> -	*lru_size += nr_pages;
> -	VM_BUG_ON((long)(*lru_size) < 0);
> +	empty = list_empty(lruvec->lists + lru);
> +
> +	if (nr_pages < 0)
> +		*lru_size += nr_pages;
> +
> +	size = *lru_size;
> +	if (WARN(size < 0 || empty != !size,
> +	"mem_cgroup_update_lru_size(%p, %d, %d): lru_size %ld but %sempty\n",
> +			lruvec, lru, nr_pages, size, empty ? "" : "not ")) {

Formatting can be unscrewed this way:

	if (WARN(size < 0 || empty != !size,
		"%s(%p, %d, %d): lru_size %ld but %sempty\n",
		__func__, lruvec, lru, nr_pages, size, empty ? "" : "not ")) {

> +		VM_BUG_ON(1);
> +		*lru_size = 0;
> +	}
> +
> +	if (nr_pages > 0)
> +		*lru_size += nr_pages;
>  }
>  
>  bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, struct mem_cgroup *root)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 00/24] huge tmpfs: an alternative approach to THPageCache
  2015-02-21  3:49 ` Hugh Dickins
@ 2015-02-23 13:48   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2015-02-23 13:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Ning Qu, Andrew Morton,
	linux-kernel, linux-mm

On Fri, Feb 20, 2015 at 07:49:16PM -0800, Hugh Dickins wrote:
> I warned last month that I have been working on "huge tmpfs":
> an implementation of Transparent Huge Page Cache in tmpfs,
> for those who are tired of the limitations of hugetlbfs.
> 
> Here's a fully working patchset, against v3.19 so that you can give it
> a try against a stable base.  I've not yet studied how well it applies
> to current git: probably lots of easily resolved clashes with nonlinear
> removal.  Against mmotm, the rmap.c differences looked nontrivial.
> 
> Fully working?  Well, at present page migration just keeps away from
> these teams of pages.  And once memory pressure has disbanded a team
> to swap it out, there is nothing to put it together again later on,
> to restore the original hugepage performance.  Those must follow,
> but no thought yet (khugepaged? maybe).
> 
> Yes, I realize there's nothing yet under Documentation, nor fs/proc
> beyond meminfo, nor other debug/visibility files: must follow, but
> I've cared more to provide the basic functionality.
> 
> I don't expect to update this patchset in the next few weeks: now that
> it's posted, my priority is look at other people's work before LSF/MM;
> and in particular, of course, your (Kirill's) THP refcounting redesign.

I scanned through the patches to get general idea on how it works. I'm not
sure that I will have time and will power to do proper code-digging before
the summit. I found few bugs in my patchset which I want to troubleshoot
first.

One thing I'm not really comfortable with is introducing yet another way
to couple pages together. It's less risky in short term than my approach
-- fewer existing codepaths affected, but it rises maintaining cost later.
Not sure it's what we want.

After Johannes' work which added exceptional entries to normal page cache
I hoped to see shmem/tmpfs implementation moving toward generic page
cache. But this patchset is step in other direction -- it makes
shmem/tmpfs even more special-cased. :(

Do you have any insights on how this approach applies to real filesystems?
I don't think there's any show stopper, but better to ask early ;)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 00/24] huge tmpfs: an alternative approach to THPageCache
@ 2015-02-23 13:48   ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2015-02-23 13:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Ning Qu, Andrew Morton,
	linux-kernel, linux-mm

On Fri, Feb 20, 2015 at 07:49:16PM -0800, Hugh Dickins wrote:
> I warned last month that I have been working on "huge tmpfs":
> an implementation of Transparent Huge Page Cache in tmpfs,
> for those who are tired of the limitations of hugetlbfs.
> 
> Here's a fully working patchset, against v3.19 so that you can give it
> a try against a stable base.  I've not yet studied how well it applies
> to current git: probably lots of easily resolved clashes with nonlinear
> removal.  Against mmotm, the rmap.c differences looked nontrivial.
> 
> Fully working?  Well, at present page migration just keeps away from
> these teams of pages.  And once memory pressure has disbanded a team
> to swap it out, there is nothing to put it together again later on,
> to restore the original hugepage performance.  Those must follow,
> but no thought yet (khugepaged? maybe).
> 
> Yes, I realize there's nothing yet under Documentation, nor fs/proc
> beyond meminfo, nor other debug/visibility files: must follow, but
> I've cared more to provide the basic functionality.
> 
> I don't expect to update this patchset in the next few weeks: now that
> it's posted, my priority is look at other people's work before LSF/MM;
> and in particular, of course, your (Kirill's) THP refcounting redesign.

I scanned through the patches to get general idea on how it works. I'm not
sure that I will have time and will power to do proper code-digging before
the summit. I found few bugs in my patchset which I want to troubleshoot
first.

One thing I'm not really comfortable with is introducing yet another way
to couple pages together. It's less risky in short term than my approach
-- fewer existing codepaths affected, but it rises maintaining cost later.
Not sure it's what we want.

After Johannes' work which added exceptional entries to normal page cache
I hoped to see shmem/tmpfs implementation moving toward generic page
cache. But this patchset is step in other direction -- it makes
shmem/tmpfs even more special-cased. :(

Do you have any insights on how this approach applies to real filesystems?
I don't think there's any show stopper, but better to ask early ;)

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 03/24] mm: use __SetPageSwapBacked and don't ClearPageSwapBacked
  2015-02-21  3:56   ` Hugh Dickins
@ 2015-02-25 10:53     ` Mel Gorman
  -1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2015-02-25 10:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Ning Qu, Andrew Morton,
	linux-kernel, linux-mm

On Fri, Feb 20, 2015 at 07:56:15PM -0800, Hugh Dickins wrote:
> Commit 07a427884348 ("mm: shmem: avoid atomic operation during
> shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
> by __SetPageSwapBacked, pointing out that the newly allocated page is
> not yet visible to other users (except speculative get_page_unless_zero-
> ers, who may not update page flags before their further checks).
> 
> That was part of a series in which Mel was focused on tmpfs profiles:
> but almost all SetPageSwapBacked uses can be so optimized, with the
> same justification.  And remove the ClearPageSwapBacked from
> read_swap_cache_async()'s and zswap_get_swap_cache_page()'s error
> paths: it's not an error to free a page with PG_swapbacked set.
> 
> (There's probably scope for further __SetPageFlags in other places,
> but SwapBacked is the one I'm interested in at the moment.)
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/migrate.c    |    6 +++---
>  mm/rmap.c       |    2 +-
>  mm/shmem.c      |    4 ++--
>  mm/swap_state.c |    3 +--
>  mm/zswap.c      |    3 +--
>  5 files changed, 8 insertions(+), 10 deletions(-)
> 
> <SNIP>
> --- thpfs.orig/mm/shmem.c	2015-02-08 18:54:22.000000000 -0800
> +++ thpfs/mm/shmem.c	2015-02-20 19:33:35.676074594 -0800
> @@ -987,8 +987,8 @@ static int shmem_replace_page(struct pag
>  	flush_dcache_page(newpage);
>  
>  	__set_page_locked(newpage);
> +	__SetPageSwapBacked(newpage);
>  	SetPageUptodate(newpage);
> -	SetPageSwapBacked(newpage);
>  	set_page_private(newpage, swap_index);
>  	SetPageSwapCache(newpage);
>  

It's clear why you did this but ...

> @@ -1177,8 +1177,8 @@ repeat:
>  			goto decused;
>  		}
>  
> -		__SetPageSwapBacked(page);
>  		__set_page_locked(page);
> +		__SetPageSwapBacked(page);
>  		if (sgp == SGP_WRITE)
>  			__SetPageReferenced(page);
>  

It's less clear why this was necessary. I don't think it causes any
problems though so

Reviewed-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 03/24] mm: use __SetPageSwapBacked and don't ClearPageSwapBacked
@ 2015-02-25 10:53     ` Mel Gorman
  0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2015-02-25 10:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Ning Qu, Andrew Morton,
	linux-kernel, linux-mm

On Fri, Feb 20, 2015 at 07:56:15PM -0800, Hugh Dickins wrote:
> Commit 07a427884348 ("mm: shmem: avoid atomic operation during
> shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
> by __SetPageSwapBacked, pointing out that the newly allocated page is
> not yet visible to other users (except speculative get_page_unless_zero-
> ers, who may not update page flags before their further checks).
> 
> That was part of a series in which Mel was focused on tmpfs profiles:
> but almost all SetPageSwapBacked uses can be so optimized, with the
> same justification.  And remove the ClearPageSwapBacked from
> read_swap_cache_async()'s and zswap_get_swap_cache_page()'s error
> paths: it's not an error to free a page with PG_swapbacked set.
> 
> (There's probably scope for further __SetPageFlags in other places,
> but SwapBacked is the one I'm interested in at the moment.)
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/migrate.c    |    6 +++---
>  mm/rmap.c       |    2 +-
>  mm/shmem.c      |    4 ++--
>  mm/swap_state.c |    3 +--
>  mm/zswap.c      |    3 +--
>  5 files changed, 8 insertions(+), 10 deletions(-)
> 
> <SNIP>
> --- thpfs.orig/mm/shmem.c	2015-02-08 18:54:22.000000000 -0800
> +++ thpfs/mm/shmem.c	2015-02-20 19:33:35.676074594 -0800
> @@ -987,8 +987,8 @@ static int shmem_replace_page(struct pag
>  	flush_dcache_page(newpage);
>  
>  	__set_page_locked(newpage);
> +	__SetPageSwapBacked(newpage);
>  	SetPageUptodate(newpage);
> -	SetPageSwapBacked(newpage);
>  	set_page_private(newpage, swap_index);
>  	SetPageSwapCache(newpage);
>  

It's clear why you did this but ...

> @@ -1177,8 +1177,8 @@ repeat:
>  			goto decused;
>  		}
>  
> -		__SetPageSwapBacked(page);
>  		__set_page_locked(page);
> +		__SetPageSwapBacked(page);
>  		if (sgp == SGP_WRITE)
>  			__SetPageReferenced(page);
>  

It's less clear why this was necessary. I don't think it causes any
problems though so

Reviewed-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
  2015-02-21  4:09   ` Hugh Dickins
@ 2015-03-19 16:56     ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 76+ messages in thread
From: Konstantin Khlebnikov @ 2015-03-19 16:56 UTC (permalink / raw)
  To: Hugh Dickins, Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

On 21.02.2015 07:09, Hugh Dickins wrote:
> Using 2MB for each small file is wasteful, and on average even a large
> file is likely to waste 1MB at the end.  We could say that a huge tmpfs
> is only suitable for huge files, but I would much prefer not to limit
> it in that way, and would not be very able to test such a filesystem.
>
> In our model, the unused space in the team is not put on any LRU (nor
> charged to any memcg), so not yet accessible to page reclaim: we need
> a shrinker to disband the team, and free up the unused space, under
> memory pressure.  (Typically the freeable space is at the end, but
> there's no assumption that it's at end of huge page or end of file.)
>
> shmem_shrink_hugehole() is usually called from vmscan's shrink_slabs();
> but I've found a direct call from shmem_alloc_page(), when it fails
> to allocate a huge page (perhaps because too much memory is occupied
> by shmem huge holes), is also helpful before a retry.
>
> But each team holds a valuable resource: an extent of contiguous
> memory that could be used for another team (or for an anonymous THP).
> So try to proceed in such a way as to conserve that resource: rather
> than just freeing the unused space and leaving yet another huge page
> fragmented, also try to migrate the used space to another partially
> occupied huge page.
>
> The algorithm in shmem_choose_hugehole() (find least occupied huge page
> in older half of shrinklist, and migrate its cachepages into the most
> occupied huge page with enough space to fit, again chosen from older
> half of shrinklist) is unlikely to be ideal; but easy to implement as
> a demonstration of the pieces which can be used by any algorithm,
> and good enough for now.  A radix_tree tag helps to locate the
> partially occupied huge pages more quickly: the tag available
> since shmem does not participate in dirty/writeback accounting.
>
> The "team_usage" field added to struct page (in union with "private")
> is somewhat vaguely named: because while the huge page is sparsely
> occupied, it counts the occupancy; but once the huge page is fully
> occupied, it will come to be used differently in a later patch, as
> the huge mapcount (offset by the HPAGE_PMD_NR occupancy) - it is
> never possible to map a sparsely occupied huge page, because that
> would expose stale data to the user.

That might be a problem if this approach is supposed to be used for
normal filesystems. Instead of adding dedicated counter shmem could
detect partially occupied page by scanning though all tail pages and
checking PageUptodate() and bump mapcount for all tail pages prevent
races between mmap and truncate. Overhead shouldn't be that big, also
we can add fastpath - mark completely uptodate page with one of unused
page flag (PG_private or something).

Another (strange) idea is adding separate array of struct huge_page
into each zone. They will work as headers for huge pages and hold
that kind of fields. Pageblock flags also could be stored here.

>
> With this patch, the ShmemHugePages and ShmemFreeHoles lines of
> /proc/meminfo are shown correctly; but ShmemPmdMapped remains 0.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>   include/linux/migrate.h        |    3
>   include/linux/mm_types.h       |    1
>   include/linux/shmem_fs.h       |    3
>   include/trace/events/migrate.h |    3
>   mm/shmem.c                     |  439 ++++++++++++++++++++++++++++++-
>   5 files changed, 436 insertions(+), 13 deletions(-)
>
> --- thpfs.orig/include/linux/migrate.h	2015-02-08 18:54:22.000000000 -0800
> +++ thpfs/include/linux/migrate.h	2015-02-20 19:34:16.135982083 -0800
> @@ -23,7 +23,8 @@ enum migrate_reason {
>   	MR_SYSCALL,		/* also applies to cpusets */
>   	MR_MEMPOLICY_MBIND,
>   	MR_NUMA_MISPLACED,
> -	MR_CMA
> +	MR_CMA,
> +	MR_SHMEM_HUGEHOLE,
>   };
>
>   #ifdef CONFIG_MIGRATION
> --- thpfs.orig/include/linux/mm_types.h	2015-02-08 18:54:22.000000000 -0800
> +++ thpfs/include/linux/mm_types.h	2015-02-20 19:34:16.135982083 -0800
> @@ -165,6 +165,7 @@ struct page {
>   #endif
>   		struct kmem_cache *slab_cache;	/* SL[AU]B: Pointer to slab */
>   		struct page *first_page;	/* Compound tail pages */
> +		atomic_long_t team_usage;	/* In shmem's PageTeam page */
>   	};
>
>   #ifdef CONFIG_MEMCG
> --- thpfs.orig/include/linux/shmem_fs.h	2015-02-20 19:34:01.464015631 -0800
> +++ thpfs/include/linux/shmem_fs.h	2015-02-20 19:34:16.135982083 -0800
> @@ -19,8 +19,9 @@ struct shmem_inode_info {
>   		unsigned long	swapped;	/* subtotal assigned to swap */
>   		char		*symlink;	/* unswappable short symlink */
>   	};
> -	struct shared_policy	policy;		/* NUMA memory alloc policy */
> +	struct list_head	shrinklist;	/* shrinkable hpage inodes */
>   	struct list_head	swaplist;	/* chain of maybes on swap */
> +	struct shared_policy	policy;		/* NUMA memory alloc policy */
>   	struct simple_xattrs	xattrs;		/* list of xattrs */
>   	struct inode		vfs_inode;
>   };
> --- thpfs.orig/include/trace/events/migrate.h	2014-10-05 12:23:04.000000000 -0700
> +++ thpfs/include/trace/events/migrate.h	2015-02-20 19:34:16.135982083 -0800
> @@ -18,7 +18,8 @@
>   	{MR_SYSCALL,		"syscall_or_cpuset"},		\
>   	{MR_MEMPOLICY_MBIND,	"mempolicy_mbind"},		\
>   	{MR_NUMA_MISPLACED,	"numa_misplaced"},		\
> -	{MR_CMA,		"cma"}
> +	{MR_CMA,		"cma"},				\
> +	{MR_SHMEM_HUGEHOLE,	"shmem_hugehole"}
>
>   TRACE_EVENT(mm_migrate_pages,
>
> --- thpfs.orig/mm/shmem.c	2015-02-20 19:34:06.224004747 -0800
> +++ thpfs/mm/shmem.c	2015-02-20 19:34:16.139982074 -0800
> @@ -58,6 +58,7 @@ static struct vfsmount *shm_mnt;
>   #include <linux/falloc.h>
>   #include <linux/splice.h>
>   #include <linux/security.h>
> +#include <linux/shrinker.h>
>   #include <linux/sysctl.h>
>   #include <linux/swapops.h>
>   #include <linux/pageteam.h>
> @@ -74,6 +75,7 @@ static struct vfsmount *shm_mnt;
>
>   #include <asm/uaccess.h>
>   #include <asm/pgtable.h>
> +#include "internal.h"
>
>   #define BLOCKS_PER_PAGE  (PAGE_CACHE_SIZE/512)
>   #define VM_ACCT(size)    (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)
> @@ -306,6 +308,13 @@ static bool shmem_confirm_swap(struct ad
>   #define SHMEM_RETRY_HUGE_PAGE	((struct page *)3)
>   /* otherwise hugehint is the hugeteam page to be used */
>
> +/* tag for shrinker to locate unfilled hugepages */
> +#define SHMEM_TAG_HUGEHOLE	PAGECACHE_TAG_DIRTY
> +
> +static LIST_HEAD(shmem_shrinklist);
> +static unsigned long shmem_shrinklist_depth;
> +static DEFINE_SPINLOCK(shmem_shrinklist_lock);
> +
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   /* ifdef here to avoid bloating shmem.o when not necessary */
>
> @@ -360,26 +369,104 @@ restart:
>   	return page;
>   }
>
> +static int shmem_freeholes(struct page *head)
> +{
> +	/*
> +	 * Note: team_usage will also be used to count huge mappings,
> +	 * so treat a negative value from shmem_freeholes() as none.
> +	 */
> +	return HPAGE_PMD_NR - atomic_long_read(&head->team_usage);
> +}
> +
> +static void shmem_clear_tag_hugehole(struct address_space *mapping,
> +				     pgoff_t index)
> +{
> +	struct page *page = NULL;
> +
> +	/*
> +	 * The tag was set on the first subpage to be inserted in cache.
> +	 * When written sequentially, or instantiated by a huge fault,
> +	 * it will be on the head page, but that's not always so.  And
> +	 * radix_tree_tag_clear() succeeds when it finds a slot, whether
> +	 * tag was set on it or not.  So first lookup and then clear.
> +	 */
> +	radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)&page,
> +					index, 1, SHMEM_TAG_HUGEHOLE);
> +	VM_BUG_ON(!page || page->index >= index + HPAGE_PMD_NR);
> +	radix_tree_tag_clear(&mapping->page_tree, page->index,
> +					SHMEM_TAG_HUGEHOLE);
> +}
> +
> +static void shmem_added_to_hugeteam(struct page *page, struct zone *zone,
> +				    struct page *hugehint)
> +{
> +	struct address_space *mapping = page->mapping;
> +	struct page *head = team_head(page);
> +	int nr;
> +
> +	if (hugehint == SHMEM_ALLOC_HUGE_PAGE) {
> +		atomic_long_set(&head->team_usage, 1);
> +		radix_tree_tag_set(&mapping->page_tree, page->index,
> +					SHMEM_TAG_HUGEHOLE);
> +		__mod_zone_page_state(zone, NR_SHMEM_FREEHOLES, HPAGE_PMD_NR-1);
> +	} else {
> +		/* We do not need atomic ops until huge page gets mapped */
> +		nr = atomic_long_read(&head->team_usage) + 1;
> +		atomic_long_set(&head->team_usage, nr);
> +		if (nr == HPAGE_PMD_NR) {
> +			shmem_clear_tag_hugehole(mapping, head->index);
> +			__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
> +		}
> +		__dec_zone_state(zone, NR_SHMEM_FREEHOLES);
> +	}
> +}
> +
>   static int shmem_disband_hugehead(struct page *head)
>   {
>   	struct address_space *mapping;
>   	struct zone *zone;
>   	int nr = -1;
>
> -	mapping = head->mapping;
> -	zone = page_zone(head);
> +	/*
> +	 * Only in the shrinker migration case might head have been truncated.
> +	 * But although head->mapping may then be zeroed at any moment, mapping
> +	 * stays safe because shmem_evict_inode must take our shrinklist lock.
> +	 */
> +	mapping = ACCESS_ONCE(head->mapping);
> +	if (!mapping)
> +		return nr;
>
> +	zone = page_zone(head);
>   	spin_lock_irq(&mapping->tree_lock);
> +
>   	if (PageTeam(head)) {
> +		nr = atomic_long_read(&head->team_usage);
> +		atomic_long_set(&head->team_usage, 0);
> +		/*
> +		 * Disable additions to the team.
> +		 * Ensure head->private is written before PageTeam is
> +		 * cleared, so shmem_writepage() cannot write swap into
> +		 * head->private, then have it overwritten by that 0!
> +		 */
> +		smp_mb__before_atomic();
>   		ClearPageTeam(head);
> -		__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
> -		nr = 1;
> +
> +		if (nr >= HPAGE_PMD_NR) {
> +			__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
> +			VM_BUG_ON(nr != HPAGE_PMD_NR);
> +		} else if (nr) {
> +			shmem_clear_tag_hugehole(mapping, head->index);
> +			__mod_zone_page_state(zone, NR_SHMEM_FREEHOLES,
> +						nr - HPAGE_PMD_NR);
> +		} /* else shmem_getpage_gfp disbanding a failed alloced_huge */
>   	}
> +
>   	spin_unlock_irq(&mapping->tree_lock);
>   	return nr;
>   }
>
> -static void shmem_disband_hugetails(struct page *head)
> +static void shmem_disband_hugetails(struct page *head,
> +				    struct list_head *list, int nr)
>   {
>   	struct page *page;
>   	struct page *endpage;
> @@ -387,7 +474,7 @@ static void shmem_disband_hugetails(stru
>   	page = head;
>   	endpage = head + HPAGE_PMD_NR;
>
> -	/* Condition follows in next commit */ {
> +	if (!nr) {
>   		/*
>   		 * The usual case: disbanding team and freeing holes as cold
>   		 * (cold being more likely to preserve high-order extents).
> @@ -403,7 +490,52 @@ static void shmem_disband_hugetails(stru
>   			else if (put_page_testzero(page))
>   				free_hot_cold_page(page, 1);
>   		}
> +	} else if (nr < 0) {
> +		struct zone *zone = page_zone(page);
> +		int orig_nr = nr;
> +		/*
> +		 * Shrinker wants to migrate cache pages from this team.
> +		 */
> +		if (!PageSwapBacked(page)) {	/* head was not in cache */
> +			page->mapping = NULL;
> +			if (put_page_testzero(page))
> +				free_hot_cold_page(page, 1);
> +		} else if (isolate_lru_page(page) == 0) {
> +			list_add_tail(&page->lru, list);
> +			nr++;
> +		}
> +		while (++page < endpage) {
> +			if (PageTeam(page)) {
> +				if (isolate_lru_page(page) == 0) {
> +					list_add_tail(&page->lru, list);
> +					nr++;
> +				}
> +				ClearPageTeam(page);
> +			} else if (put_page_testzero(page))
> +				free_hot_cold_page(page, 1);
> +		}
> +		/* Yes, shmem counts in NR_ISOLATED_ANON but NR_FILE_PAGES */
> +		mod_zone_page_state(zone, NR_ISOLATED_ANON, nr - orig_nr);
> +	} else {
> +		/*
> +		 * Shrinker wants free pages from this team to migrate into.
> +		 */
> +		if (!PageSwapBacked(page)) {	/* head was not in cache */
> +			page->mapping = NULL;
> +			list_add_tail(&page->lru, list);
> +			nr--;
> +		}
> +		while (++page < endpage) {
> +			if (PageTeam(page))
> +				ClearPageTeam(page);
> +			else if (nr) {
> +				list_add_tail(&page->lru, list);
> +				nr--;
> +			} else if (put_page_testzero(page))
> +				free_hot_cold_page(page, 1);
> +		}
>   	}
> +	VM_BUG_ON(nr > 0);	/* maybe a few were not isolated */
>   }
>
>   static void shmem_disband_hugeteam(struct page *page)
> @@ -445,12 +577,252 @@ static void shmem_disband_hugeteam(struc
>   	if (head != page)
>   		unlock_page(head);
>   	if (nr_used >= 0)
> -		shmem_disband_hugetails(head);
> +		shmem_disband_hugetails(head, NULL, 0);
>   	if (head != page)
>   		page_cache_release(head);
>   	preempt_enable();
>   }
>
> +static struct page *shmem_get_hugehole(struct address_space *mapping,
> +				       unsigned long *index)
> +{
> +	struct page *page;
> +	struct page *head;
> +
> +	rcu_read_lock();
> +	while (radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)&page,
> +					  *index, 1, SHMEM_TAG_HUGEHOLE)) {
> +		if (radix_tree_exception(page))
> +			continue;
> +		if (!page_cache_get_speculative(page))
> +			continue;
> +		if (!PageTeam(page) || page->mapping != mapping)
> +			goto release;
> +		head = team_head(page);
> +		if (head != page) {
> +			if (!page_cache_get_speculative(head))
> +				goto release;
> +			page_cache_release(page);
> +			page = head;
> +			if (!PageTeam(page) || page->mapping != mapping)
> +				goto release;
> +		}
> +		if (shmem_freeholes(head) > 0) {
> +			rcu_read_unlock();
> +			*index = head->index + HPAGE_PMD_NR;
> +			return head;
> +		}
> +release:
> +		page_cache_release(page);
> +	}
> +	rcu_read_unlock();
> +	return NULL;
> +}
> +
> +static unsigned long shmem_choose_hugehole(struct list_head *fromlist,
> +					   struct list_head *tolist)
> +{
> +	unsigned long freed = 0;
> +	unsigned long double_depth;
> +	struct list_head *this, *next;
> +	struct shmem_inode_info *info;
> +	struct address_space *mapping;
> +	struct page *frompage = NULL;
> +	struct page *topage = NULL;
> +	struct page *page;
> +	pgoff_t index;
> +	int fromused;
> +	int toused;
> +	int nid;
> +
> +	double_depth = 0;
> +	spin_lock(&shmem_shrinklist_lock);
> +	list_for_each_safe(this, next, &shmem_shrinklist) {
> +		info = list_entry(this, struct shmem_inode_info, shrinklist);
> +		mapping = info->vfs_inode.i_mapping;
> +		if (!radix_tree_tagged(&mapping->page_tree,
> +					SHMEM_TAG_HUGEHOLE)) {
> +			list_del_init(&info->shrinklist);
> +			shmem_shrinklist_depth--;
> +			continue;
> +		}
> +		index = 0;
> +		while ((page = shmem_get_hugehole(mapping, &index))) {
> +			/* Choose to migrate from page with least in use */
> +			if (!frompage ||
> +			    shmem_freeholes(page) > shmem_freeholes(frompage)) {
> +				if (frompage)
> +					page_cache_release(frompage);
> +				frompage = page;
> +				if (shmem_freeholes(page) == HPAGE_PMD_NR-1) {
> +					/* No point searching further */
> +					double_depth = -3;
> +					break;
> +				}
> +			} else
> +				page_cache_release(page);
> +		}
> +
> +		/* Only reclaim from the older half of the shrinklist */
> +		double_depth += 2;
> +		if (double_depth >= min(shmem_shrinklist_depth, 2000UL))
> +			break;
> +	}
> +
> +	if (!frompage)
> +		goto unlock;
> +	preempt_disable();
> +	fromused = shmem_disband_hugehead(frompage);
> +	spin_unlock(&shmem_shrinklist_lock);
> +	if (fromused > 0)
> +		shmem_disband_hugetails(frompage, fromlist, -fromused);
> +	preempt_enable();
> +	nid = page_to_nid(frompage);
> +	page_cache_release(frompage);
> +
> +	if (fromused <= 0)
> +		return 0;
> +	freed = HPAGE_PMD_NR - fromused;
> +	if (fromused > HPAGE_PMD_NR/2)
> +		return freed;
> +
> +	double_depth = 0;
> +	spin_lock(&shmem_shrinklist_lock);
> +	list_for_each_safe(this, next, &shmem_shrinklist) {
> +		info = list_entry(this, struct shmem_inode_info, shrinklist);
> +		mapping = info->vfs_inode.i_mapping;
> +		if (!radix_tree_tagged(&mapping->page_tree,
> +					SHMEM_TAG_HUGEHOLE)) {
> +			list_del_init(&info->shrinklist);
> +			shmem_shrinklist_depth--;
> +			continue;
> +		}
> +		index = 0;
> +		while ((page = shmem_get_hugehole(mapping, &index))) {
> +			/* Choose to migrate to page with just enough free */
> +			if (shmem_freeholes(page) >= fromused &&
> +			    page_to_nid(page) == nid) {
> +				if (!topage || shmem_freeholes(page) <
> +					      shmem_freeholes(topage)) {
> +					if (topage)
> +						page_cache_release(topage);
> +					topage = page;
> +					if (shmem_freeholes(page) == fromused) {
> +						/* No point searching further */
> +						double_depth = -3;
> +						break;
> +					}
> +				} else
> +					page_cache_release(page);
> +			} else
> +				page_cache_release(page);
> +		}
> +
> +		/* Only reclaim from the older half of the shrinklist */
> +		double_depth += 2;
> +		if (double_depth >= min(shmem_shrinklist_depth, 2000UL))
> +			break;
> +	}
> +
> +	if (!topage)
> +		goto unlock;
> +	preempt_disable();
> +	toused = shmem_disband_hugehead(topage);
> +	spin_unlock(&shmem_shrinklist_lock);
> +	if (toused > 0) {
> +		if (HPAGE_PMD_NR - toused >= fromused)
> +			shmem_disband_hugetails(topage, tolist, fromused);
> +		else
> +			shmem_disband_hugetails(topage, NULL, 0);
> +		freed += HPAGE_PMD_NR - toused;
> +	}
> +	preempt_enable();
> +	page_cache_release(topage);
> +	return freed;
> +unlock:
> +	spin_unlock(&shmem_shrinklist_lock);
> +	return freed;
> +}
> +
> +static struct page *shmem_get_migrate_page(struct page *frompage,
> +					   unsigned long private, int **result)
> +{
> +	struct list_head *tolist = (struct list_head *)private;
> +	struct page *topage;
> +
> +	VM_BUG_ON(list_empty(tolist));
> +	topage = list_first_entry(tolist, struct page, lru);
> +	list_del(&topage->lru);
> +	return topage;
> +}
> +
> +static void shmem_put_migrate_page(struct page *topage, unsigned long private)
> +{
> +	struct list_head *tolist = (struct list_head *)private;
> +
> +	list_add(&topage->lru, tolist);
> +}
> +
> +static void shmem_putback_migrate_pages(struct list_head *tolist)
> +{
> +	struct page *topage;
> +	struct page *next;
> +
> +	/*
> +	 * The tolist pages were not counted in NR_ISOLATED, so stats
> +	 * would go wrong if putback_movable_pages() were used on them.
> +	 * Indeed, even putback_lru_page() is wrong for these pages.
> +	 */
> +	list_for_each_entry_safe(topage, next, tolist, lru) {
> +		list_del(&topage->lru);
> +		if (put_page_testzero(topage))
> +			free_hot_cold_page(topage, 1);
> +	}
> +}
> +
> +static unsigned long shmem_shrink_hugehole(struct shrinker *shrink,
> +					   struct shrink_control *sc)
> +{
> +	unsigned long freed;
> +	LIST_HEAD(fromlist);
> +	LIST_HEAD(tolist);
> +
> +	freed = shmem_choose_hugehole(&fromlist, &tolist);
> +	if (list_empty(&fromlist))
> +		return SHRINK_STOP;
> +	if (!list_empty(&tolist)) {
> +		migrate_pages(&fromlist, shmem_get_migrate_page,
> +			      shmem_put_migrate_page, (unsigned long)&tolist,
> +			      MIGRATE_SYNC, MR_SHMEM_HUGEHOLE);
> +		preempt_disable();
> +		drain_local_pages(NULL);  /* try to preserve huge freed page */
> +		preempt_enable();
> +		shmem_putback_migrate_pages(&tolist);
> +	}
> +	putback_movable_pages(&fromlist); /* if any were left behind */
> +	return freed;
> +}
> +
> +static unsigned long shmem_count_hugehole(struct shrinker *shrink,
> +					  struct shrink_control *sc)
> +{
> +	/*
> +	 * Huge hole space is not charged to any memcg:
> +	 * only shrink it for global reclaim.
> +	 * But at present we're only called for global reclaim anyway.
> +	 */
> +	if (list_empty(&shmem_shrinklist))
> +		return 0;
> +	return global_page_state(NR_SHMEM_FREEHOLES);
> +}
> +
> +static struct shrinker shmem_hugehole_shrinker = {
> +	.count_objects = shmem_count_hugehole,
> +	.scan_objects = shmem_shrink_hugehole,
> +	.seeks = DEFAULT_SEEKS,		/* would another value work better? */
> +	.batch = HPAGE_PMD_NR,		/* would another value work better? */
> +};
> +
>   #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
>
>   #define shmem_huge SHMEM_HUGE_DENY
> @@ -466,6 +838,17 @@ static inline void shmem_disband_hugetea
>   {
>   	BUILD_BUG();
>   }
> +
> +static inline void shmem_added_to_hugeteam(struct page *page,
> +				struct zone *zone, struct page *hugehint)
> +{
> +}
> +
> +static inline unsigned long shmem_shrink_hugehole(struct shrinker *shrink,
> +						  struct shrink_control *sc)
> +{
> +	return 0;
> +}
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
>   /*
> @@ -508,10 +891,10 @@ shmem_add_to_page_cache(struct page *pag
>   		goto errout;
>   	}
>
> -	if (!PageTeam(page))
> +	if (PageTeam(page))
> +		shmem_added_to_hugeteam(page, zone, hugehint);
> +	else
>   		page_cache_get(page);
> -	else if (hugehint == SHMEM_ALLOC_HUGE_PAGE)
> -		__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
>
>   	mapping->nrpages++;
>   	__inc_zone_state(zone, NR_FILE_PAGES);
> @@ -839,6 +1222,14 @@ static void shmem_evict_inode(struct ino
>   		shmem_unacct_size(info->flags, inode->i_size);
>   		inode->i_size = 0;
>   		shmem_truncate_range(inode, 0, (loff_t)-1);
> +		if (!list_empty(&info->shrinklist)) {
> +			spin_lock(&shmem_shrinklist_lock);
> +			if (!list_empty(&info->shrinklist)) {
> +				list_del_init(&info->shrinklist);
> +				shmem_shrinklist_depth--;
> +			}
> +			spin_unlock(&shmem_shrinklist_lock);
> +		}
>   		if (!list_empty(&info->swaplist)) {
>   			mutex_lock(&shmem_swaplist_mutex);
>   			list_del_init(&info->swaplist);
> @@ -1189,10 +1580,18 @@ static struct page *shmem_alloc_page(gfp
>   		if (*hugehint == SHMEM_ALLOC_HUGE_PAGE) {
>   			head = alloc_pages_vma(gfp|__GFP_NORETRY|__GFP_NOWARN,
>   				HPAGE_PMD_ORDER, &pvma, 0, numa_node_id());
> +			if (!head) {
> +				shmem_shrink_hugehole(NULL, NULL);
> +				head = alloc_pages_vma(
> +					gfp|__GFP_NORETRY|__GFP_NOWARN,
> +					HPAGE_PMD_ORDER, &pvma, 0,
> +					numa_node_id());
> +			}
>   			if (head) {
>   				split_page(head, HPAGE_PMD_ORDER);
>
>   				/* Prepare head page for add_to_page_cache */
> +				atomic_long_set(&head->team_usage, 0);
>   				__SetPageTeam(head);
>   				head->mapping = mapping;
>   				head->index = round_down(index, HPAGE_PMD_NR);
> @@ -1504,6 +1903,21 @@ repeat:
>   		if (sgp == SGP_WRITE)
>   			__SetPageReferenced(page);
>   		/*
> +		 * Might we see !list_empty a moment before the shrinker
> +		 * removes this inode from its list?  Unlikely, since we
> +		 * already set a tag in the tree.  Some barrier required?
> +		 */
> +		if (alloced_huge && list_empty(&info->shrinklist)) {
> +			spin_lock(&shmem_shrinklist_lock);
> +			if (list_empty(&info->shrinklist)) {
> +				list_add_tail(&info->shrinklist,
> +					      &shmem_shrinklist);
> +				shmem_shrinklist_depth++;
> +			}
> +			spin_unlock(&shmem_shrinklist_lock);
> +		}
> +
> +		/*
>   		 * Let SGP_FALLOC use the SGP_WRITE optimization on a new page.
>   		 */
>   		if (sgp == SGP_FALLOC)
> @@ -1724,6 +2138,7 @@ static struct inode *shmem_get_inode(str
>   		spin_lock_init(&info->lock);
>   		info->seals = F_SEAL_SEAL;
>   		info->flags = flags & VM_NORESERVE;
> +		INIT_LIST_HEAD(&info->shrinklist);
>   		INIT_LIST_HEAD(&info->swaplist);
>   		simple_xattrs_init(&info->xattrs);
>   		cache_no_acl(inode);
> @@ -3564,6 +3979,10 @@ int __init shmem_init(void)
>   		printk(KERN_ERR "Could not kern_mount tmpfs\n");
>   		goto out1;
>   	}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	register_shrinker(&shmem_hugehole_shrinker);
> +#endif
>   	return 0;
>
>   out1:
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
@ 2015-03-19 16:56     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 76+ messages in thread
From: Konstantin Khlebnikov @ 2015-03-19 16:56 UTC (permalink / raw)
  To: Hugh Dickins, Kirill A. Shutemov
  Cc: Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

On 21.02.2015 07:09, Hugh Dickins wrote:
> Using 2MB for each small file is wasteful, and on average even a large
> file is likely to waste 1MB at the end.  We could say that a huge tmpfs
> is only suitable for huge files, but I would much prefer not to limit
> it in that way, and would not be very able to test such a filesystem.
>
> In our model, the unused space in the team is not put on any LRU (nor
> charged to any memcg), so not yet accessible to page reclaim: we need
> a shrinker to disband the team, and free up the unused space, under
> memory pressure.  (Typically the freeable space is at the end, but
> there's no assumption that it's at end of huge page or end of file.)
>
> shmem_shrink_hugehole() is usually called from vmscan's shrink_slabs();
> but I've found a direct call from shmem_alloc_page(), when it fails
> to allocate a huge page (perhaps because too much memory is occupied
> by shmem huge holes), is also helpful before a retry.
>
> But each team holds a valuable resource: an extent of contiguous
> memory that could be used for another team (or for an anonymous THP).
> So try to proceed in such a way as to conserve that resource: rather
> than just freeing the unused space and leaving yet another huge page
> fragmented, also try to migrate the used space to another partially
> occupied huge page.
>
> The algorithm in shmem_choose_hugehole() (find least occupied huge page
> in older half of shrinklist, and migrate its cachepages into the most
> occupied huge page with enough space to fit, again chosen from older
> half of shrinklist) is unlikely to be ideal; but easy to implement as
> a demonstration of the pieces which can be used by any algorithm,
> and good enough for now.  A radix_tree tag helps to locate the
> partially occupied huge pages more quickly: the tag available
> since shmem does not participate in dirty/writeback accounting.
>
> The "team_usage" field added to struct page (in union with "private")
> is somewhat vaguely named: because while the huge page is sparsely
> occupied, it counts the occupancy; but once the huge page is fully
> occupied, it will come to be used differently in a later patch, as
> the huge mapcount (offset by the HPAGE_PMD_NR occupancy) - it is
> never possible to map a sparsely occupied huge page, because that
> would expose stale data to the user.

That might be a problem if this approach is supposed to be used for
normal filesystems. Instead of adding dedicated counter shmem could
detect partially occupied page by scanning though all tail pages and
checking PageUptodate() and bump mapcount for all tail pages prevent
races between mmap and truncate. Overhead shouldn't be that big, also
we can add fastpath - mark completely uptodate page with one of unused
page flag (PG_private or something).

Another (strange) idea is adding separate array of struct huge_page
into each zone. They will work as headers for huge pages and hold
that kind of fields. Pageblock flags also could be stored here.

>
> With this patch, the ShmemHugePages and ShmemFreeHoles lines of
> /proc/meminfo are shown correctly; but ShmemPmdMapped remains 0.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>   include/linux/migrate.h        |    3
>   include/linux/mm_types.h       |    1
>   include/linux/shmem_fs.h       |    3
>   include/trace/events/migrate.h |    3
>   mm/shmem.c                     |  439 ++++++++++++++++++++++++++++++-
>   5 files changed, 436 insertions(+), 13 deletions(-)
>
> --- thpfs.orig/include/linux/migrate.h	2015-02-08 18:54:22.000000000 -0800
> +++ thpfs/include/linux/migrate.h	2015-02-20 19:34:16.135982083 -0800
> @@ -23,7 +23,8 @@ enum migrate_reason {
>   	MR_SYSCALL,		/* also applies to cpusets */
>   	MR_MEMPOLICY_MBIND,
>   	MR_NUMA_MISPLACED,
> -	MR_CMA
> +	MR_CMA,
> +	MR_SHMEM_HUGEHOLE,
>   };
>
>   #ifdef CONFIG_MIGRATION
> --- thpfs.orig/include/linux/mm_types.h	2015-02-08 18:54:22.000000000 -0800
> +++ thpfs/include/linux/mm_types.h	2015-02-20 19:34:16.135982083 -0800
> @@ -165,6 +165,7 @@ struct page {
>   #endif
>   		struct kmem_cache *slab_cache;	/* SL[AU]B: Pointer to slab */
>   		struct page *first_page;	/* Compound tail pages */
> +		atomic_long_t team_usage;	/* In shmem's PageTeam page */
>   	};
>
>   #ifdef CONFIG_MEMCG
> --- thpfs.orig/include/linux/shmem_fs.h	2015-02-20 19:34:01.464015631 -0800
> +++ thpfs/include/linux/shmem_fs.h	2015-02-20 19:34:16.135982083 -0800
> @@ -19,8 +19,9 @@ struct shmem_inode_info {
>   		unsigned long	swapped;	/* subtotal assigned to swap */
>   		char		*symlink;	/* unswappable short symlink */
>   	};
> -	struct shared_policy	policy;		/* NUMA memory alloc policy */
> +	struct list_head	shrinklist;	/* shrinkable hpage inodes */
>   	struct list_head	swaplist;	/* chain of maybes on swap */
> +	struct shared_policy	policy;		/* NUMA memory alloc policy */
>   	struct simple_xattrs	xattrs;		/* list of xattrs */
>   	struct inode		vfs_inode;
>   };
> --- thpfs.orig/include/trace/events/migrate.h	2014-10-05 12:23:04.000000000 -0700
> +++ thpfs/include/trace/events/migrate.h	2015-02-20 19:34:16.135982083 -0800
> @@ -18,7 +18,8 @@
>   	{MR_SYSCALL,		"syscall_or_cpuset"},		\
>   	{MR_MEMPOLICY_MBIND,	"mempolicy_mbind"},		\
>   	{MR_NUMA_MISPLACED,	"numa_misplaced"},		\
> -	{MR_CMA,		"cma"}
> +	{MR_CMA,		"cma"},				\
> +	{MR_SHMEM_HUGEHOLE,	"shmem_hugehole"}
>
>   TRACE_EVENT(mm_migrate_pages,
>
> --- thpfs.orig/mm/shmem.c	2015-02-20 19:34:06.224004747 -0800
> +++ thpfs/mm/shmem.c	2015-02-20 19:34:16.139982074 -0800
> @@ -58,6 +58,7 @@ static struct vfsmount *shm_mnt;
>   #include <linux/falloc.h>
>   #include <linux/splice.h>
>   #include <linux/security.h>
> +#include <linux/shrinker.h>
>   #include <linux/sysctl.h>
>   #include <linux/swapops.h>
>   #include <linux/pageteam.h>
> @@ -74,6 +75,7 @@ static struct vfsmount *shm_mnt;
>
>   #include <asm/uaccess.h>
>   #include <asm/pgtable.h>
> +#include "internal.h"
>
>   #define BLOCKS_PER_PAGE  (PAGE_CACHE_SIZE/512)
>   #define VM_ACCT(size)    (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)
> @@ -306,6 +308,13 @@ static bool shmem_confirm_swap(struct ad
>   #define SHMEM_RETRY_HUGE_PAGE	((struct page *)3)
>   /* otherwise hugehint is the hugeteam page to be used */
>
> +/* tag for shrinker to locate unfilled hugepages */
> +#define SHMEM_TAG_HUGEHOLE	PAGECACHE_TAG_DIRTY
> +
> +static LIST_HEAD(shmem_shrinklist);
> +static unsigned long shmem_shrinklist_depth;
> +static DEFINE_SPINLOCK(shmem_shrinklist_lock);
> +
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   /* ifdef here to avoid bloating shmem.o when not necessary */
>
> @@ -360,26 +369,104 @@ restart:
>   	return page;
>   }
>
> +static int shmem_freeholes(struct page *head)
> +{
> +	/*
> +	 * Note: team_usage will also be used to count huge mappings,
> +	 * so treat a negative value from shmem_freeholes() as none.
> +	 */
> +	return HPAGE_PMD_NR - atomic_long_read(&head->team_usage);
> +}
> +
> +static void shmem_clear_tag_hugehole(struct address_space *mapping,
> +				     pgoff_t index)
> +{
> +	struct page *page = NULL;
> +
> +	/*
> +	 * The tag was set on the first subpage to be inserted in cache.
> +	 * When written sequentially, or instantiated by a huge fault,
> +	 * it will be on the head page, but that's not always so.  And
> +	 * radix_tree_tag_clear() succeeds when it finds a slot, whether
> +	 * tag was set on it or not.  So first lookup and then clear.
> +	 */
> +	radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)&page,
> +					index, 1, SHMEM_TAG_HUGEHOLE);
> +	VM_BUG_ON(!page || page->index >= index + HPAGE_PMD_NR);
> +	radix_tree_tag_clear(&mapping->page_tree, page->index,
> +					SHMEM_TAG_HUGEHOLE);
> +}
> +
> +static void shmem_added_to_hugeteam(struct page *page, struct zone *zone,
> +				    struct page *hugehint)
> +{
> +	struct address_space *mapping = page->mapping;
> +	struct page *head = team_head(page);
> +	int nr;
> +
> +	if (hugehint == SHMEM_ALLOC_HUGE_PAGE) {
> +		atomic_long_set(&head->team_usage, 1);
> +		radix_tree_tag_set(&mapping->page_tree, page->index,
> +					SHMEM_TAG_HUGEHOLE);
> +		__mod_zone_page_state(zone, NR_SHMEM_FREEHOLES, HPAGE_PMD_NR-1);
> +	} else {
> +		/* We do not need atomic ops until huge page gets mapped */
> +		nr = atomic_long_read(&head->team_usage) + 1;
> +		atomic_long_set(&head->team_usage, nr);
> +		if (nr == HPAGE_PMD_NR) {
> +			shmem_clear_tag_hugehole(mapping, head->index);
> +			__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
> +		}
> +		__dec_zone_state(zone, NR_SHMEM_FREEHOLES);
> +	}
> +}
> +
>   static int shmem_disband_hugehead(struct page *head)
>   {
>   	struct address_space *mapping;
>   	struct zone *zone;
>   	int nr = -1;
>
> -	mapping = head->mapping;
> -	zone = page_zone(head);
> +	/*
> +	 * Only in the shrinker migration case might head have been truncated.
> +	 * But although head->mapping may then be zeroed at any moment, mapping
> +	 * stays safe because shmem_evict_inode must take our shrinklist lock.
> +	 */
> +	mapping = ACCESS_ONCE(head->mapping);
> +	if (!mapping)
> +		return nr;
>
> +	zone = page_zone(head);
>   	spin_lock_irq(&mapping->tree_lock);
> +
>   	if (PageTeam(head)) {
> +		nr = atomic_long_read(&head->team_usage);
> +		atomic_long_set(&head->team_usage, 0);
> +		/*
> +		 * Disable additions to the team.
> +		 * Ensure head->private is written before PageTeam is
> +		 * cleared, so shmem_writepage() cannot write swap into
> +		 * head->private, then have it overwritten by that 0!
> +		 */
> +		smp_mb__before_atomic();
>   		ClearPageTeam(head);
> -		__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
> -		nr = 1;
> +
> +		if (nr >= HPAGE_PMD_NR) {
> +			__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
> +			VM_BUG_ON(nr != HPAGE_PMD_NR);
> +		} else if (nr) {
> +			shmem_clear_tag_hugehole(mapping, head->index);
> +			__mod_zone_page_state(zone, NR_SHMEM_FREEHOLES,
> +						nr - HPAGE_PMD_NR);
> +		} /* else shmem_getpage_gfp disbanding a failed alloced_huge */
>   	}
> +
>   	spin_unlock_irq(&mapping->tree_lock);
>   	return nr;
>   }
>
> -static void shmem_disband_hugetails(struct page *head)
> +static void shmem_disband_hugetails(struct page *head,
> +				    struct list_head *list, int nr)
>   {
>   	struct page *page;
>   	struct page *endpage;
> @@ -387,7 +474,7 @@ static void shmem_disband_hugetails(stru
>   	page = head;
>   	endpage = head + HPAGE_PMD_NR;
>
> -	/* Condition follows in next commit */ {
> +	if (!nr) {
>   		/*
>   		 * The usual case: disbanding team and freeing holes as cold
>   		 * (cold being more likely to preserve high-order extents).
> @@ -403,7 +490,52 @@ static void shmem_disband_hugetails(stru
>   			else if (put_page_testzero(page))
>   				free_hot_cold_page(page, 1);
>   		}
> +	} else if (nr < 0) {
> +		struct zone *zone = page_zone(page);
> +		int orig_nr = nr;
> +		/*
> +		 * Shrinker wants to migrate cache pages from this team.
> +		 */
> +		if (!PageSwapBacked(page)) {	/* head was not in cache */
> +			page->mapping = NULL;
> +			if (put_page_testzero(page))
> +				free_hot_cold_page(page, 1);
> +		} else if (isolate_lru_page(page) == 0) {
> +			list_add_tail(&page->lru, list);
> +			nr++;
> +		}
> +		while (++page < endpage) {
> +			if (PageTeam(page)) {
> +				if (isolate_lru_page(page) == 0) {
> +					list_add_tail(&page->lru, list);
> +					nr++;
> +				}
> +				ClearPageTeam(page);
> +			} else if (put_page_testzero(page))
> +				free_hot_cold_page(page, 1);
> +		}
> +		/* Yes, shmem counts in NR_ISOLATED_ANON but NR_FILE_PAGES */
> +		mod_zone_page_state(zone, NR_ISOLATED_ANON, nr - orig_nr);
> +	} else {
> +		/*
> +		 * Shrinker wants free pages from this team to migrate into.
> +		 */
> +		if (!PageSwapBacked(page)) {	/* head was not in cache */
> +			page->mapping = NULL;
> +			list_add_tail(&page->lru, list);
> +			nr--;
> +		}
> +		while (++page < endpage) {
> +			if (PageTeam(page))
> +				ClearPageTeam(page);
> +			else if (nr) {
> +				list_add_tail(&page->lru, list);
> +				nr--;
> +			} else if (put_page_testzero(page))
> +				free_hot_cold_page(page, 1);
> +		}
>   	}
> +	VM_BUG_ON(nr > 0);	/* maybe a few were not isolated */
>   }
>
>   static void shmem_disband_hugeteam(struct page *page)
> @@ -445,12 +577,252 @@ static void shmem_disband_hugeteam(struc
>   	if (head != page)
>   		unlock_page(head);
>   	if (nr_used >= 0)
> -		shmem_disband_hugetails(head);
> +		shmem_disband_hugetails(head, NULL, 0);
>   	if (head != page)
>   		page_cache_release(head);
>   	preempt_enable();
>   }
>
> +static struct page *shmem_get_hugehole(struct address_space *mapping,
> +				       unsigned long *index)
> +{
> +	struct page *page;
> +	struct page *head;
> +
> +	rcu_read_lock();
> +	while (radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)&page,
> +					  *index, 1, SHMEM_TAG_HUGEHOLE)) {
> +		if (radix_tree_exception(page))
> +			continue;
> +		if (!page_cache_get_speculative(page))
> +			continue;
> +		if (!PageTeam(page) || page->mapping != mapping)
> +			goto release;
> +		head = team_head(page);
> +		if (head != page) {
> +			if (!page_cache_get_speculative(head))
> +				goto release;
> +			page_cache_release(page);
> +			page = head;
> +			if (!PageTeam(page) || page->mapping != mapping)
> +				goto release;
> +		}
> +		if (shmem_freeholes(head) > 0) {
> +			rcu_read_unlock();
> +			*index = head->index + HPAGE_PMD_NR;
> +			return head;
> +		}
> +release:
> +		page_cache_release(page);
> +	}
> +	rcu_read_unlock();
> +	return NULL;
> +}
> +
> +static unsigned long shmem_choose_hugehole(struct list_head *fromlist,
> +					   struct list_head *tolist)
> +{
> +	unsigned long freed = 0;
> +	unsigned long double_depth;
> +	struct list_head *this, *next;
> +	struct shmem_inode_info *info;
> +	struct address_space *mapping;
> +	struct page *frompage = NULL;
> +	struct page *topage = NULL;
> +	struct page *page;
> +	pgoff_t index;
> +	int fromused;
> +	int toused;
> +	int nid;
> +
> +	double_depth = 0;
> +	spin_lock(&shmem_shrinklist_lock);
> +	list_for_each_safe(this, next, &shmem_shrinklist) {
> +		info = list_entry(this, struct shmem_inode_info, shrinklist);
> +		mapping = info->vfs_inode.i_mapping;
> +		if (!radix_tree_tagged(&mapping->page_tree,
> +					SHMEM_TAG_HUGEHOLE)) {
> +			list_del_init(&info->shrinklist);
> +			shmem_shrinklist_depth--;
> +			continue;
> +		}
> +		index = 0;
> +		while ((page = shmem_get_hugehole(mapping, &index))) {
> +			/* Choose to migrate from page with least in use */
> +			if (!frompage ||
> +			    shmem_freeholes(page) > shmem_freeholes(frompage)) {
> +				if (frompage)
> +					page_cache_release(frompage);
> +				frompage = page;
> +				if (shmem_freeholes(page) == HPAGE_PMD_NR-1) {
> +					/* No point searching further */
> +					double_depth = -3;
> +					break;
> +				}
> +			} else
> +				page_cache_release(page);
> +		}
> +
> +		/* Only reclaim from the older half of the shrinklist */
> +		double_depth += 2;
> +		if (double_depth >= min(shmem_shrinklist_depth, 2000UL))
> +			break;
> +	}
> +
> +	if (!frompage)
> +		goto unlock;
> +	preempt_disable();
> +	fromused = shmem_disband_hugehead(frompage);
> +	spin_unlock(&shmem_shrinklist_lock);
> +	if (fromused > 0)
> +		shmem_disband_hugetails(frompage, fromlist, -fromused);
> +	preempt_enable();
> +	nid = page_to_nid(frompage);
> +	page_cache_release(frompage);
> +
> +	if (fromused <= 0)
> +		return 0;
> +	freed = HPAGE_PMD_NR - fromused;
> +	if (fromused > HPAGE_PMD_NR/2)
> +		return freed;
> +
> +	double_depth = 0;
> +	spin_lock(&shmem_shrinklist_lock);
> +	list_for_each_safe(this, next, &shmem_shrinklist) {
> +		info = list_entry(this, struct shmem_inode_info, shrinklist);
> +		mapping = info->vfs_inode.i_mapping;
> +		if (!radix_tree_tagged(&mapping->page_tree,
> +					SHMEM_TAG_HUGEHOLE)) {
> +			list_del_init(&info->shrinklist);
> +			shmem_shrinklist_depth--;
> +			continue;
> +		}
> +		index = 0;
> +		while ((page = shmem_get_hugehole(mapping, &index))) {
> +			/* Choose to migrate to page with just enough free */
> +			if (shmem_freeholes(page) >= fromused &&
> +			    page_to_nid(page) == nid) {
> +				if (!topage || shmem_freeholes(page) <
> +					      shmem_freeholes(topage)) {
> +					if (topage)
> +						page_cache_release(topage);
> +					topage = page;
> +					if (shmem_freeholes(page) == fromused) {
> +						/* No point searching further */
> +						double_depth = -3;
> +						break;
> +					}
> +				} else
> +					page_cache_release(page);
> +			} else
> +				page_cache_release(page);
> +		}
> +
> +		/* Only reclaim from the older half of the shrinklist */
> +		double_depth += 2;
> +		if (double_depth >= min(shmem_shrinklist_depth, 2000UL))
> +			break;
> +	}
> +
> +	if (!topage)
> +		goto unlock;
> +	preempt_disable();
> +	toused = shmem_disband_hugehead(topage);
> +	spin_unlock(&shmem_shrinklist_lock);
> +	if (toused > 0) {
> +		if (HPAGE_PMD_NR - toused >= fromused)
> +			shmem_disband_hugetails(topage, tolist, fromused);
> +		else
> +			shmem_disband_hugetails(topage, NULL, 0);
> +		freed += HPAGE_PMD_NR - toused;
> +	}
> +	preempt_enable();
> +	page_cache_release(topage);
> +	return freed;
> +unlock:
> +	spin_unlock(&shmem_shrinklist_lock);
> +	return freed;
> +}
> +
> +static struct page *shmem_get_migrate_page(struct page *frompage,
> +					   unsigned long private, int **result)
> +{
> +	struct list_head *tolist = (struct list_head *)private;
> +	struct page *topage;
> +
> +	VM_BUG_ON(list_empty(tolist));
> +	topage = list_first_entry(tolist, struct page, lru);
> +	list_del(&topage->lru);
> +	return topage;
> +}
> +
> +static void shmem_put_migrate_page(struct page *topage, unsigned long private)
> +{
> +	struct list_head *tolist = (struct list_head *)private;
> +
> +	list_add(&topage->lru, tolist);
> +}
> +
> +static void shmem_putback_migrate_pages(struct list_head *tolist)
> +{
> +	struct page *topage;
> +	struct page *next;
> +
> +	/*
> +	 * The tolist pages were not counted in NR_ISOLATED, so stats
> +	 * would go wrong if putback_movable_pages() were used on them.
> +	 * Indeed, even putback_lru_page() is wrong for these pages.
> +	 */
> +	list_for_each_entry_safe(topage, next, tolist, lru) {
> +		list_del(&topage->lru);
> +		if (put_page_testzero(topage))
> +			free_hot_cold_page(topage, 1);
> +	}
> +}
> +
> +static unsigned long shmem_shrink_hugehole(struct shrinker *shrink,
> +					   struct shrink_control *sc)
> +{
> +	unsigned long freed;
> +	LIST_HEAD(fromlist);
> +	LIST_HEAD(tolist);
> +
> +	freed = shmem_choose_hugehole(&fromlist, &tolist);
> +	if (list_empty(&fromlist))
> +		return SHRINK_STOP;
> +	if (!list_empty(&tolist)) {
> +		migrate_pages(&fromlist, shmem_get_migrate_page,
> +			      shmem_put_migrate_page, (unsigned long)&tolist,
> +			      MIGRATE_SYNC, MR_SHMEM_HUGEHOLE);
> +		preempt_disable();
> +		drain_local_pages(NULL);  /* try to preserve huge freed page */
> +		preempt_enable();
> +		shmem_putback_migrate_pages(&tolist);
> +	}
> +	putback_movable_pages(&fromlist); /* if any were left behind */
> +	return freed;
> +}
> +
> +static unsigned long shmem_count_hugehole(struct shrinker *shrink,
> +					  struct shrink_control *sc)
> +{
> +	/*
> +	 * Huge hole space is not charged to any memcg:
> +	 * only shrink it for global reclaim.
> +	 * But at present we're only called for global reclaim anyway.
> +	 */
> +	if (list_empty(&shmem_shrinklist))
> +		return 0;
> +	return global_page_state(NR_SHMEM_FREEHOLES);
> +}
> +
> +static struct shrinker shmem_hugehole_shrinker = {
> +	.count_objects = shmem_count_hugehole,
> +	.scan_objects = shmem_shrink_hugehole,
> +	.seeks = DEFAULT_SEEKS,		/* would another value work better? */
> +	.batch = HPAGE_PMD_NR,		/* would another value work better? */
> +};
> +
>   #else /* !CONFIG_TRANSPARENT_HUGEPAGE */
>
>   #define shmem_huge SHMEM_HUGE_DENY
> @@ -466,6 +838,17 @@ static inline void shmem_disband_hugetea
>   {
>   	BUILD_BUG();
>   }
> +
> +static inline void shmem_added_to_hugeteam(struct page *page,
> +				struct zone *zone, struct page *hugehint)
> +{
> +}
> +
> +static inline unsigned long shmem_shrink_hugehole(struct shrinker *shrink,
> +						  struct shrink_control *sc)
> +{
> +	return 0;
> +}
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
>   /*
> @@ -508,10 +891,10 @@ shmem_add_to_page_cache(struct page *pag
>   		goto errout;
>   	}
>
> -	if (!PageTeam(page))
> +	if (PageTeam(page))
> +		shmem_added_to_hugeteam(page, zone, hugehint);
> +	else
>   		page_cache_get(page);
> -	else if (hugehint == SHMEM_ALLOC_HUGE_PAGE)
> -		__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
>
>   	mapping->nrpages++;
>   	__inc_zone_state(zone, NR_FILE_PAGES);
> @@ -839,6 +1222,14 @@ static void shmem_evict_inode(struct ino
>   		shmem_unacct_size(info->flags, inode->i_size);
>   		inode->i_size = 0;
>   		shmem_truncate_range(inode, 0, (loff_t)-1);
> +		if (!list_empty(&info->shrinklist)) {
> +			spin_lock(&shmem_shrinklist_lock);
> +			if (!list_empty(&info->shrinklist)) {
> +				list_del_init(&info->shrinklist);
> +				shmem_shrinklist_depth--;
> +			}
> +			spin_unlock(&shmem_shrinklist_lock);
> +		}
>   		if (!list_empty(&info->swaplist)) {
>   			mutex_lock(&shmem_swaplist_mutex);
>   			list_del_init(&info->swaplist);
> @@ -1189,10 +1580,18 @@ static struct page *shmem_alloc_page(gfp
>   		if (*hugehint == SHMEM_ALLOC_HUGE_PAGE) {
>   			head = alloc_pages_vma(gfp|__GFP_NORETRY|__GFP_NOWARN,
>   				HPAGE_PMD_ORDER, &pvma, 0, numa_node_id());
> +			if (!head) {
> +				shmem_shrink_hugehole(NULL, NULL);
> +				head = alloc_pages_vma(
> +					gfp|__GFP_NORETRY|__GFP_NOWARN,
> +					HPAGE_PMD_ORDER, &pvma, 0,
> +					numa_node_id());
> +			}
>   			if (head) {
>   				split_page(head, HPAGE_PMD_ORDER);
>
>   				/* Prepare head page for add_to_page_cache */
> +				atomic_long_set(&head->team_usage, 0);
>   				__SetPageTeam(head);
>   				head->mapping = mapping;
>   				head->index = round_down(index, HPAGE_PMD_NR);
> @@ -1504,6 +1903,21 @@ repeat:
>   		if (sgp == SGP_WRITE)
>   			__SetPageReferenced(page);
>   		/*
> +		 * Might we see !list_empty a moment before the shrinker
> +		 * removes this inode from its list?  Unlikely, since we
> +		 * already set a tag in the tree.  Some barrier required?
> +		 */
> +		if (alloced_huge && list_empty(&info->shrinklist)) {
> +			spin_lock(&shmem_shrinklist_lock);
> +			if (list_empty(&info->shrinklist)) {
> +				list_add_tail(&info->shrinklist,
> +					      &shmem_shrinklist);
> +				shmem_shrinklist_depth++;
> +			}
> +			spin_unlock(&shmem_shrinklist_lock);
> +		}
> +
> +		/*
>   		 * Let SGP_FALLOC use the SGP_WRITE optimization on a new page.
>   		 */
>   		if (sgp == SGP_FALLOC)
> @@ -1724,6 +2138,7 @@ static struct inode *shmem_get_inode(str
>   		spin_lock_init(&info->lock);
>   		info->seals = F_SEAL_SEAL;
>   		info->flags = flags & VM_NORESERVE;
> +		INIT_LIST_HEAD(&info->shrinklist);
>   		INIT_LIST_HEAD(&info->swaplist);
>   		simple_xattrs_init(&info->xattrs);
>   		cache_no_acl(inode);
> @@ -3564,6 +3979,10 @@ int __init shmem_init(void)
>   		printk(KERN_ERR "Could not kern_mount tmpfs\n");
>   		goto out1;
>   	}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	register_shrinker(&shmem_hugehole_shrinker);
> +#endif
>   	return 0;
>
>   out1:
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 00/24] huge tmpfs: an alternative approach to THPageCache
  2015-02-23 13:48   ` Kirill A. Shutemov
@ 2015-03-23  2:25     ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-03-23  2:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli, Ning Qu,
	Andrew Morton, linux-kernel, linux-mm

On Mon, 23 Feb 2015, Kirill A. Shutemov wrote:
> 
> I scanned through the patches to get general idea on how it works.

Thanks!

> I'm not
> sure that I will have time and will power to do proper code-digging before
> the summit. I found few bugs in my patchset which I want to troubleshoot
> first.

Yes, I agree that should take priority.

> 
> One thing I'm not really comfortable with is introducing yet another way
> to couple pages together. It's less risky in short term than my approach
> -- fewer existing codepaths affected, but it rises maintaining cost later.
> Not sure it's what we want.

Yes, I appreciate your reluctance to add another way of achieving the
same thing.  I still believe that compound pages were a wrong direction
for THP; but until I've posted an implementation of anon THP my way,
and you've posted an implementation of huge tmpfs your way, it's going
to be hard to compare the advantages and disadvantages of each, to
decide between them.

And (as we said at LSF/MM) we each have a priority to attend to before
that: I need to support page migration, and recovery of hugeness after
swap; and you your bugfixes.  (The only bug I've noticed in mine since
posting, a consequence of developing on an earlier release then not
reauditing pmd_trans, is that I need to relax your VM_BUG_ON_VMA in
mm/mremap.c move_page_tables().)

For now, huge tmpfs is giving us useful "transparent hugetlbfs"
functionality, and we're happy to continue developing it that way;
but can switch it over to compound pages, if they win the argument
without sacrificing too much.

> 
> After Johannes' work which added exceptional entries to normal page cache
> I hoped to see shmem/tmpfs implementation moving toward generic page
> cache. But this patchset is step in other direction -- it makes
> shmem/tmpfs even more special-cased. :(

Well, Johannes's use for the exceptional entries was rather different
from tmpfs's.  I think tmpfs will always be a special case, and one
especially entitled to huge pages, and that does not distress me at
all - though I wasn't deaf to Chris Mason asking for huge pages too.

(I do wonder if Boaz and persistent memory and the dynamic 4k struct
pages discussion will overtake and re-inform both of our designs.)

> 
> Do you have any insights on how this approach applies to real filesystems?
> I don't think there's any show stopper, but better to ask early ;)

The not-quite-a-show-stopper is my use of page->private, as Konstantin
observes in other mail: I'll muse on that a little in replying to him.

Aside from the page->private issue, the changes outside of shmem.c
should be easily applicable to other filesystems, and some of them
perhaps already useful to you.

But frankly I've given next to no thought as to how easily the code
added in shmem.c could be moved out and used for others: tmpfs was
where we wanted it.

Hugh

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 00/24] huge tmpfs: an alternative approach to THPageCache
@ 2015-03-23  2:25     ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-03-23  2:25 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli, Ning Qu,
	Andrew Morton, linux-kernel, linux-mm

On Mon, 23 Feb 2015, Kirill A. Shutemov wrote:
> 
> I scanned through the patches to get general idea on how it works.

Thanks!

> I'm not
> sure that I will have time and will power to do proper code-digging before
> the summit. I found few bugs in my patchset which I want to troubleshoot
> first.

Yes, I agree that should take priority.

> 
> One thing I'm not really comfortable with is introducing yet another way
> to couple pages together. It's less risky in short term than my approach
> -- fewer existing codepaths affected, but it rises maintaining cost later.
> Not sure it's what we want.

Yes, I appreciate your reluctance to add another way of achieving the
same thing.  I still believe that compound pages were a wrong direction
for THP; but until I've posted an implementation of anon THP my way,
and you've posted an implementation of huge tmpfs your way, it's going
to be hard to compare the advantages and disadvantages of each, to
decide between them.

And (as we said at LSF/MM) we each have a priority to attend to before
that: I need to support page migration, and recovery of hugeness after
swap; and you your bugfixes.  (The only bug I've noticed in mine since
posting, a consequence of developing on an earlier release then not
reauditing pmd_trans, is that I need to relax your VM_BUG_ON_VMA in
mm/mremap.c move_page_tables().)

For now, huge tmpfs is giving us useful "transparent hugetlbfs"
functionality, and we're happy to continue developing it that way;
but can switch it over to compound pages, if they win the argument
without sacrificing too much.

> 
> After Johannes' work which added exceptional entries to normal page cache
> I hoped to see shmem/tmpfs implementation moving toward generic page
> cache. But this patchset is step in other direction -- it makes
> shmem/tmpfs even more special-cased. :(

Well, Johannes's use for the exceptional entries was rather different
from tmpfs's.  I think tmpfs will always be a special case, and one
especially entitled to huge pages, and that does not distress me at
all - though I wasn't deaf to Chris Mason asking for huge pages too.

(I do wonder if Boaz and persistent memory and the dynamic 4k struct
pages discussion will overtake and re-inform both of our designs.)

> 
> Do you have any insights on how this approach applies to real filesystems?
> I don't think there's any show stopper, but better to ask early ;)

The not-quite-a-show-stopper is my use of page->private, as Konstantin
observes in other mail: I'll muse on that a little in replying to him.

Aside from the page->private issue, the changes outside of shmem.c
should be easily applicable to other filesystems, and some of them
perhaps already useful to you.

But frankly I've given next to no thought as to how easily the code
added in shmem.c could be moved out and used for others: tmpfs was
where we wanted it.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 01/24] mm: update_lru_size warn and reset bad lru_size
  2015-02-23  9:30     ` Kirill A. Shutemov
@ 2015-03-23  2:44       ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-03-23  2:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli, Ning Qu,
	Andrew Morton, linux-kernel, linux-mm

On Mon, 23 Feb 2015, Kirill A. Shutemov wrote:
> On Fri, Feb 20, 2015 at 07:51:16PM -0800, Hugh Dickins wrote:
> > Though debug kernels have a VM_BUG_ON to help protect from misaccounting
> > lru_size, non-debug kernels are liable to wrap it around: and then the
> > vast unsigned long size draws page reclaim into a loop of repeatedly
> > doing nothing on an empty list, without even a cond_resched().
> > 
> > That soft lockup looks confusingly like an over-busy reclaim scenario,
> > with lots of contention on the lruvec lock in shrink_inactive_list():
> > yet has a totally different origin.
> > 
> > Help differentiate with a custom warning in mem_cgroup_update_lru_size(),
> > even in non-debug kernels; and reset the size to avoid the lockup.  But
> > the particular bug which suggested this change was mine alone, and since
> > fixed.
> 
> Do we need this kind of check for !MEMCG kernels?

I hope we don't: I hope that the MEMCG case can be a good enough canary
to catch the issues for !MEMCG too.  I thought that the !MEMCG stats were
maintained in such a different (per-cpu) way, whose batching would defeat
such checks without imposing unwelcome overhead - or am I wrong on that?

> 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >  include/linux/mm_inline.h |    2 +-
> >  mm/memcontrol.c           |   24 ++++++++++++++++++++----
> >  2 files changed, 21 insertions(+), 5 deletions(-)
> > 
> > --- thpfs.orig/include/linux/mm_inline.h	2013-11-03 15:41:51.000000000 -0800
> > +++ thpfs/include/linux/mm_inline.h	2015-02-20 19:33:25.928096883 -0800
> > @@ -35,8 +35,8 @@ static __always_inline void del_page_fro
> >  				struct lruvec *lruvec, enum lru_list lru)
> >  {
> >  	int nr_pages = hpage_nr_pages(page);
> > -	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
> >  	list_del(&page->lru);
> > +	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
> >  	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
> >  }
> >  
> > --- thpfs.orig/mm/memcontrol.c	2015-02-08 18:54:22.000000000 -0800
> > +++ thpfs/mm/memcontrol.c	2015-02-20 19:33:25.928096883 -0800
> > @@ -1296,22 +1296,38 @@ out:
> >   * @lru: index of lru list the page is sitting on
> >   * @nr_pages: positive when adding or negative when removing
> >   *
> > - * This function must be called when a page is added to or removed from an
> > - * lru list.
> > + * This function must be called under lruvec lock, just before a page is added
> > + * to or just after a page is removed from an lru list (that ordering being so
> > + * as to allow it to check that lru_size 0 is consistent with list_empty).
> >   */
> >  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> >  				int nr_pages)
> >  {
> >  	struct mem_cgroup_per_zone *mz;
> >  	unsigned long *lru_size;
> > +	long size;
> > +	bool empty;
> >  
> >  	if (mem_cgroup_disabled())
> >  		return;
> >  
> >  	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
> >  	lru_size = mz->lru_size + lru;
> > -	*lru_size += nr_pages;
> > -	VM_BUG_ON((long)(*lru_size) < 0);
> > +	empty = list_empty(lruvec->lists + lru);
> > +
> > +	if (nr_pages < 0)
> > +		*lru_size += nr_pages;
> > +
> > +	size = *lru_size;
> > +	if (WARN(size < 0 || empty != !size,
> > +	"mem_cgroup_update_lru_size(%p, %d, %d): lru_size %ld but %sempty\n",
> > +			lruvec, lru, nr_pages, size, empty ? "" : "not ")) {
> 
> Formatting can be unscrewed this way:
> 
> 	if (WARN(size < 0 || empty != !size,
> 		"%s(%p, %d, %d): lru_size %ld but %sempty\n",
> 		__func__, lruvec, lru, nr_pages, size, empty ? "" : "not ")) {

Indeed, thanks.  Greg Thelen had made the same suggestion for a different
reason, I just didn't get to incorporate it this time around, but will do
better next time.

I don't expect to be reposting the whole series very soon, unless
someone asks for it: I think migration and recovery ought to be
supported before reposting.  But I might send a mini-series of
this and the other preparatory patches, maybe.

> 
> > +		VM_BUG_ON(1);
> > +		*lru_size = 0;
> > +	}
> > +
> > +	if (nr_pages > 0)
> > +		*lru_size += nr_pages;
> >  }
> >  
> >  bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, struct mem_cgroup *root)
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 01/24] mm: update_lru_size warn and reset bad lru_size
@ 2015-03-23  2:44       ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-03-23  2:44 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli, Ning Qu,
	Andrew Morton, linux-kernel, linux-mm

On Mon, 23 Feb 2015, Kirill A. Shutemov wrote:
> On Fri, Feb 20, 2015 at 07:51:16PM -0800, Hugh Dickins wrote:
> > Though debug kernels have a VM_BUG_ON to help protect from misaccounting
> > lru_size, non-debug kernels are liable to wrap it around: and then the
> > vast unsigned long size draws page reclaim into a loop of repeatedly
> > doing nothing on an empty list, without even a cond_resched().
> > 
> > That soft lockup looks confusingly like an over-busy reclaim scenario,
> > with lots of contention on the lruvec lock in shrink_inactive_list():
> > yet has a totally different origin.
> > 
> > Help differentiate with a custom warning in mem_cgroup_update_lru_size(),
> > even in non-debug kernels; and reset the size to avoid the lockup.  But
> > the particular bug which suggested this change was mine alone, and since
> > fixed.
> 
> Do we need this kind of check for !MEMCG kernels?

I hope we don't: I hope that the MEMCG case can be a good enough canary
to catch the issues for !MEMCG too.  I thought that the !MEMCG stats were
maintained in such a different (per-cpu) way, whose batching would defeat
such checks without imposing unwelcome overhead - or am I wrong on that?

> 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >  include/linux/mm_inline.h |    2 +-
> >  mm/memcontrol.c           |   24 ++++++++++++++++++++----
> >  2 files changed, 21 insertions(+), 5 deletions(-)
> > 
> > --- thpfs.orig/include/linux/mm_inline.h	2013-11-03 15:41:51.000000000 -0800
> > +++ thpfs/include/linux/mm_inline.h	2015-02-20 19:33:25.928096883 -0800
> > @@ -35,8 +35,8 @@ static __always_inline void del_page_fro
> >  				struct lruvec *lruvec, enum lru_list lru)
> >  {
> >  	int nr_pages = hpage_nr_pages(page);
> > -	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
> >  	list_del(&page->lru);
> > +	mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
> >  	__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
> >  }
> >  
> > --- thpfs.orig/mm/memcontrol.c	2015-02-08 18:54:22.000000000 -0800
> > +++ thpfs/mm/memcontrol.c	2015-02-20 19:33:25.928096883 -0800
> > @@ -1296,22 +1296,38 @@ out:
> >   * @lru: index of lru list the page is sitting on
> >   * @nr_pages: positive when adding or negative when removing
> >   *
> > - * This function must be called when a page is added to or removed from an
> > - * lru list.
> > + * This function must be called under lruvec lock, just before a page is added
> > + * to or just after a page is removed from an lru list (that ordering being so
> > + * as to allow it to check that lru_size 0 is consistent with list_empty).
> >   */
> >  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> >  				int nr_pages)
> >  {
> >  	struct mem_cgroup_per_zone *mz;
> >  	unsigned long *lru_size;
> > +	long size;
> > +	bool empty;
> >  
> >  	if (mem_cgroup_disabled())
> >  		return;
> >  
> >  	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
> >  	lru_size = mz->lru_size + lru;
> > -	*lru_size += nr_pages;
> > -	VM_BUG_ON((long)(*lru_size) < 0);
> > +	empty = list_empty(lruvec->lists + lru);
> > +
> > +	if (nr_pages < 0)
> > +		*lru_size += nr_pages;
> > +
> > +	size = *lru_size;
> > +	if (WARN(size < 0 || empty != !size,
> > +	"mem_cgroup_update_lru_size(%p, %d, %d): lru_size %ld but %sempty\n",
> > +			lruvec, lru, nr_pages, size, empty ? "" : "not ")) {
> 
> Formatting can be unscrewed this way:
> 
> 	if (WARN(size < 0 || empty != !size,
> 		"%s(%p, %d, %d): lru_size %ld but %sempty\n",
> 		__func__, lruvec, lru, nr_pages, size, empty ? "" : "not ")) {

Indeed, thanks.  Greg Thelen had made the same suggestion for a different
reason, I just didn't get to incorporate it this time around, but will do
better next time.

I don't expect to be reposting the whole series very soon, unless
someone asks for it: I think migration and recovery ought to be
supported before reposting.  But I might send a mini-series of
this and the other preparatory patches, maybe.

> 
> > +		VM_BUG_ON(1);
> > +		*lru_size = 0;
> > +	}
> > +
> > +	if (nr_pages > 0)
> > +		*lru_size += nr_pages;
> >  }
> >  
> >  bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, struct mem_cgroup *root)
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> -- 
>  Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 03/24] mm: use __SetPageSwapBacked and don't ClearPageSwapBacked
  2015-02-25 10:53     ` Mel Gorman
@ 2015-03-23  3:01       ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-03-23  3:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli, Ning Qu,
	Andrew Morton, linux-kernel, linux-mm

On Wed, 25 Feb 2015, Mel Gorman wrote:
> On Fri, Feb 20, 2015 at 07:56:15PM -0800, Hugh Dickins wrote:
> > Commit 07a427884348 ("mm: shmem: avoid atomic operation during
> > shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
> > by __SetPageSwapBacked, pointing out that the newly allocated page is
> > not yet visible to other users (except speculative get_page_unless_zero-
> > ers, who may not update page flags before their further checks).
> > 
> > That was part of a series in which Mel was focused on tmpfs profiles:
> > but almost all SetPageSwapBacked uses can be so optimized, with the
> > same justification.  And remove the ClearPageSwapBacked from
> > read_swap_cache_async()'s and zswap_get_swap_cache_page()'s error
> > paths: it's not an error to free a page with PG_swapbacked set.
> > 
> > (There's probably scope for further __SetPageFlags in other places,
> > but SwapBacked is the one I'm interested in at the moment.)
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >  mm/migrate.c    |    6 +++---
> >  mm/rmap.c       |    2 +-
> >  mm/shmem.c      |    4 ++--
> >  mm/swap_state.c |    3 +--
> >  mm/zswap.c      |    3 +--
> >  5 files changed, 8 insertions(+), 10 deletions(-)
> > 
> > <SNIP>
> > --- thpfs.orig/mm/shmem.c	2015-02-08 18:54:22.000000000 -0800
> > +++ thpfs/mm/shmem.c	2015-02-20 19:33:35.676074594 -0800
> > @@ -987,8 +987,8 @@ static int shmem_replace_page(struct pag
> >  	flush_dcache_page(newpage);
> >  
> >  	__set_page_locked(newpage);
> > +	__SetPageSwapBacked(newpage);
> >  	SetPageUptodate(newpage);
> > -	SetPageSwapBacked(newpage);
> >  	set_page_private(newpage, swap_index);
> >  	SetPageSwapCache(newpage);
> >  
> 
> It's clear why you did this but ...
> 
> > @@ -1177,8 +1177,8 @@ repeat:
> >  			goto decused;
> >  		}
> >  
> > -		__SetPageSwapBacked(page);
> >  		__set_page_locked(page);
> > +		__SetPageSwapBacked(page);
> >  		if (sgp == SGP_WRITE)
> >  			__SetPageReferenced(page);
> >  
> 
> It's less clear why this was necessary.

I don't think the reordering was necessary in either case
(though perhaps the first hunk makes a subsequent patch smaller).
I just get irritated by seeing the same lines of code permuted
in different ways for no reason, and thought I'd tidy them up
to establish one familiar sequence, that's all.

> I don't think it causes any problems though so
> 
> Reviewed-by: Mel Gorman <mgorman@suse.de>

Thanks!

Hugh

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 03/24] mm: use __SetPageSwapBacked and don't ClearPageSwapBacked
@ 2015-03-23  3:01       ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-03-23  3:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli, Ning Qu,
	Andrew Morton, linux-kernel, linux-mm

On Wed, 25 Feb 2015, Mel Gorman wrote:
> On Fri, Feb 20, 2015 at 07:56:15PM -0800, Hugh Dickins wrote:
> > Commit 07a427884348 ("mm: shmem: avoid atomic operation during
> > shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
> > by __SetPageSwapBacked, pointing out that the newly allocated page is
> > not yet visible to other users (except speculative get_page_unless_zero-
> > ers, who may not update page flags before their further checks).
> > 
> > That was part of a series in which Mel was focused on tmpfs profiles:
> > but almost all SetPageSwapBacked uses can be so optimized, with the
> > same justification.  And remove the ClearPageSwapBacked from
> > read_swap_cache_async()'s and zswap_get_swap_cache_page()'s error
> > paths: it's not an error to free a page with PG_swapbacked set.
> > 
> > (There's probably scope for further __SetPageFlags in other places,
> > but SwapBacked is the one I'm interested in at the moment.)
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >  mm/migrate.c    |    6 +++---
> >  mm/rmap.c       |    2 +-
> >  mm/shmem.c      |    4 ++--
> >  mm/swap_state.c |    3 +--
> >  mm/zswap.c      |    3 +--
> >  5 files changed, 8 insertions(+), 10 deletions(-)
> > 
> > <SNIP>
> > --- thpfs.orig/mm/shmem.c	2015-02-08 18:54:22.000000000 -0800
> > +++ thpfs/mm/shmem.c	2015-02-20 19:33:35.676074594 -0800
> > @@ -987,8 +987,8 @@ static int shmem_replace_page(struct pag
> >  	flush_dcache_page(newpage);
> >  
> >  	__set_page_locked(newpage);
> > +	__SetPageSwapBacked(newpage);
> >  	SetPageUptodate(newpage);
> > -	SetPageSwapBacked(newpage);
> >  	set_page_private(newpage, swap_index);
> >  	SetPageSwapCache(newpage);
> >  
> 
> It's clear why you did this but ...
> 
> > @@ -1177,8 +1177,8 @@ repeat:
> >  			goto decused;
> >  		}
> >  
> > -		__SetPageSwapBacked(page);
> >  		__set_page_locked(page);
> > +		__SetPageSwapBacked(page);
> >  		if (sgp == SGP_WRITE)
> >  			__SetPageReferenced(page);
> >  
> 
> It's less clear why this was necessary.

I don't think the reordering was necessary in either case
(though perhaps the first hunk makes a subsequent patch smaller).
I just get irritated by seeing the same lines of code permuted
in different ways for no reason, and thought I'd tidy them up
to establish one familiar sequence, that's all.

> I don't think it causes any problems though so
> 
> Reviewed-by: Mel Gorman <mgorman@suse.de>

Thanks!

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
  2015-03-19 16:56     ` Konstantin Khlebnikov
@ 2015-03-23  4:40       ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-03-23  4:40 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli, Ning Qu,
	Andrew Morton, linux-kernel, linux-mm

On Thu, 19 Mar 2015, Konstantin Khlebnikov wrote:
> On 21.02.2015 07:09, Hugh Dickins wrote:
> > 
> > The "team_usage" field added to struct page (in union with "private")
> > is somewhat vaguely named: because while the huge page is sparsely
> > occupied, it counts the occupancy; but once the huge page is fully
> > occupied, it will come to be used differently in a later patch, as
> > the huge mapcount (offset by the HPAGE_PMD_NR occupancy) - it is
> > never possible to map a sparsely occupied huge page, because that
> > would expose stale data to the user.
> 
> That might be a problem if this approach is supposed to be used for
> normal filesystems.

Yes, most filesystems have their own use for page->private.
My concern at this stage has just been to have a good implementation
for tmpfs, but Kirill and others are certainly interested in looking
beyond that.

> Instead of adding dedicated counter shmem could
> detect partially occupied page by scanning though all tail pages and
> checking PageUptodate() and bump mapcount for all tail pages prevent
> races between mmap and truncate. Overhead shouldn't be that big, also
> we can add fastpath - mark completely uptodate page with one of unused
> page flag (PG_private or something).

I do already use PageChecked (PG_owner_priv_1) for just that purpose:
noting all subpages Uptodate (and marked Dirty) when first mapping by
pmd (in 12/24).

But don't bump mapcount on the subpages, just the head: I don't mind
doing a pass down the subpages when it's first hugely mapped, but prefer
to avoid such a pass on every huge map and unmap - seems unnecessary.

The team_usage (== private) field ends up with three or four separate
counts (and an mlocked flag) packed into it: I expect we could trade
some of those counts for scans down the 512 subpages when necessary,
but I doubt it's a good tradeoff; and keeping atomicity would be
difficult (I've never wanted to have to take page_lock or somesuch
on every page in zap_pte_range).  Without atomicity the stats go wrong
(I think Kirill has a problem of that kind in his page_remove_rmap scan).

It will be interesting to see what Kirill does to maintain the stats
for huge pagecache: but he will have no difficulty in finding fields
to store counts, because he's got lots of spare fields in those 511
tail pages - that's a useful benefit of the compound page, but does
prevent the tails from being used in ordinary ways.  (I did try using
team_head[1].team_usage for more, but atomicity needs prevented it.)

> 
> Another (strange) idea is adding separate array of struct huge_page
> into each zone. They will work as headers for huge pages and hold
> that kind of fields. Pageblock flags also could be stored here.

It's not such a strange idea, it is a definite possibility.  Though
I've tended to think of them more as a separate array of struct pages,
one for each of the hugepages.

It's a complication I'll keep away from as long as I can, but something
like that will probably have to come.  Consider the ambiguity of the
head page, whose flags and counts may represent the 4k page mapped
by pte and the 2M page mapped by pmd: there's an absurdity to that,
one that I can live with for now, but expect some nasty case to demand
a change (the way I have it at present, just mlocking the 4k head is
enough to hold the 2M hugepage in memory: that's not good, but should
be quite easily fixed without needing the hugepage array itself).

And I think ideas may emerge from the persistent memory struct page
discussions, which feed in here.  One reason to hold back for now.

Hugh

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
@ 2015-03-23  4:40       ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-03-23  4:40 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Hugh Dickins, Kirill A. Shutemov, Andrea Arcangeli, Ning Qu,
	Andrew Morton, linux-kernel, linux-mm

On Thu, 19 Mar 2015, Konstantin Khlebnikov wrote:
> On 21.02.2015 07:09, Hugh Dickins wrote:
> > 
> > The "team_usage" field added to struct page (in union with "private")
> > is somewhat vaguely named: because while the huge page is sparsely
> > occupied, it counts the occupancy; but once the huge page is fully
> > occupied, it will come to be used differently in a later patch, as
> > the huge mapcount (offset by the HPAGE_PMD_NR occupancy) - it is
> > never possible to map a sparsely occupied huge page, because that
> > would expose stale data to the user.
> 
> That might be a problem if this approach is supposed to be used for
> normal filesystems.

Yes, most filesystems have their own use for page->private.
My concern at this stage has just been to have a good implementation
for tmpfs, but Kirill and others are certainly interested in looking
beyond that.

> Instead of adding dedicated counter shmem could
> detect partially occupied page by scanning though all tail pages and
> checking PageUptodate() and bump mapcount for all tail pages prevent
> races between mmap and truncate. Overhead shouldn't be that big, also
> we can add fastpath - mark completely uptodate page with one of unused
> page flag (PG_private or something).

I do already use PageChecked (PG_owner_priv_1) for just that purpose:
noting all subpages Uptodate (and marked Dirty) when first mapping by
pmd (in 12/24).

But don't bump mapcount on the subpages, just the head: I don't mind
doing a pass down the subpages when it's first hugely mapped, but prefer
to avoid such a pass on every huge map and unmap - seems unnecessary.

The team_usage (== private) field ends up with three or four separate
counts (and an mlocked flag) packed into it: I expect we could trade
some of those counts for scans down the 512 subpages when necessary,
but I doubt it's a good tradeoff; and keeping atomicity would be
difficult (I've never wanted to have to take page_lock or somesuch
on every page in zap_pte_range).  Without atomicity the stats go wrong
(I think Kirill has a problem of that kind in his page_remove_rmap scan).

It will be interesting to see what Kirill does to maintain the stats
for huge pagecache: but he will have no difficulty in finding fields
to store counts, because he's got lots of spare fields in those 511
tail pages - that's a useful benefit of the compound page, but does
prevent the tails from being used in ordinary ways.  (I did try using
team_head[1].team_usage for more, but atomicity needs prevented it.)

> 
> Another (strange) idea is adding separate array of struct huge_page
> into each zone. They will work as headers for huge pages and hold
> that kind of fields. Pageblock flags also could be stored here.

It's not such a strange idea, it is a definite possibility.  Though
I've tended to think of them more as a separate array of struct pages,
one for each of the hugepages.

It's a complication I'll keep away from as long as I can, but something
like that will probably have to come.  Consider the ambiguity of the
head page, whose flags and counts may represent the 4k page mapped
by pte and the 2M page mapped by pmd: there's an absurdity to that,
one that I can live with for now, but expect some nasty case to demand
a change (the way I have it at present, just mlocking the 4k head is
enough to hold the 2M hugepage in memory: that's not good, but should
be quite easily fixed without needing the hugepage array itself).

And I think ideas may emerge from the persistent memory struct page
discussions, which feed in here.  One reason to hold back for now.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
  2015-03-23  4:40       ` Hugh Dickins
@ 2015-03-23 12:50         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2015-03-23 12:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, Kirill A. Shutemov, Andrea Arcangeli,
	Ning Qu, Andrew Morton, linux-kernel, linux-mm

On Sun, Mar 22, 2015 at 09:40:02PM -0700, Hugh Dickins wrote:
> (I think Kirill has a problem of that kind in his page_remove_rmap scan).

Ouch! Thanks for noticing this. 

It should work fine while we are anon-THP only, but it need to be fixed to
work with files.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
@ 2015-03-23 12:50         ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2015-03-23 12:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, Kirill A. Shutemov, Andrea Arcangeli,
	Ning Qu, Andrew Morton, linux-kernel, linux-mm

On Sun, Mar 22, 2015 at 09:40:02PM -0700, Hugh Dickins wrote:
> (I think Kirill has a problem of that kind in his page_remove_rmap scan).

Ouch! Thanks for noticing this. 

It should work fine while we are anon-THP only, but it need to be fixed to
work with files.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
  2015-03-23 12:50         ` Kirill A. Shutemov
@ 2015-03-23 13:50           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2015-03-23 13:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, Kirill A. Shutemov, Andrea Arcangeli,
	Ning Qu, Andrew Morton, linux-kernel, linux-mm

On Mon, Mar 23, 2015 at 02:50:09PM +0200, Kirill A. Shutemov wrote:
> On Sun, Mar 22, 2015 at 09:40:02PM -0700, Hugh Dickins wrote:
> > (I think Kirill has a problem of that kind in his page_remove_rmap scan).
> 
> Ouch! Thanks for noticing this. 
> 
> It should work fine while we are anon-THP only, but it need to be fixed to
> work with files.

Err. No, it must be fixed for anon-THP too.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
@ 2015-03-23 13:50           ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2015-03-23 13:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, Kirill A. Shutemov, Andrea Arcangeli,
	Ning Qu, Andrew Morton, linux-kernel, linux-mm

On Mon, Mar 23, 2015 at 02:50:09PM +0200, Kirill A. Shutemov wrote:
> On Sun, Mar 22, 2015 at 09:40:02PM -0700, Hugh Dickins wrote:
> > (I think Kirill has a problem of that kind in his page_remove_rmap scan).
> 
> Ouch! Thanks for noticing this. 
> 
> It should work fine while we are anon-THP only, but it need to be fixed to
> work with files.

Err. No, it must be fixed for anon-THP too.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
  2015-03-23  4:40       ` Hugh Dickins
@ 2015-03-24 12:57         ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2015-03-24 12:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, Kirill A. Shutemov, Andrea Arcangeli,
	Ning Qu, Andrew Morton, linux-kernel, linux-mm

On Sun, Mar 22, 2015 at 09:40:02PM -0700, Hugh Dickins wrote:
> (I think Kirill has a problem of that kind in his page_remove_rmap scan).
> 
> It will be interesting to see what Kirill does to maintain the stats
> for huge pagecache: but he will have no difficulty in finding fields
> to store counts, because he's got lots of spare fields in those 511
> tail pages - that's a useful benefit of the compound page, but does
> prevent the tails from being used in ordinary ways.  (I did try using
> team_head[1].team_usage for more, but atomicity needs prevented it.)

The patch below should address the race you pointed, if I've got all
right. Not hugely happy with the change though.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 435c90f59227..a3e6b35520f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -423,8 +423,17 @@ static inline void page_mapcount_reset(struct page *page)
 
 static inline int page_mapcount(struct page *page)
 {
+	int ret;
 	VM_BUG_ON_PAGE(PageSlab(page), page);
-	return atomic_read(&page->_mapcount) + compound_mapcount(page) + 1;
+	ret = atomic_read(&page->_mapcount) + 1;
+	if (compound_mapcount(page)) {
+		/*
+		 * positive compound_mapcount() offsets ->_mapcount by one --
+		 * substract here.
+		*/
+	       ret += compound_mapcount(page) - 1;
+	}
+	return ret;
 }
 
 static inline int page_count(struct page *page)
diff --git a/mm/rmap.c b/mm/rmap.c
index fc6eee4ed476..f4ab976276e7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1066,9 +1066,17 @@ void do_page_add_anon_rmap(struct page *page,
 		 * disabled.
 		 */
 		if (compound) {
+			int i;
 			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 			__inc_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
+			/*
+			 * While compound_mapcount() is positive we keep *one*
+			 * mapcount reference in all subpages. It's required
+			 * for atomic removal from rmap.
+			 */
+			for (i = 0; i < nr; i++)
+				atomic_set(&page[i]._mapcount, 0);
 		}
 		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	}
@@ -1103,10 +1111,19 @@ void page_add_new_anon_rmap(struct page *page,
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
 	SetPageSwapBacked(page);
 	if (compound) {
+		int i;
+
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		/* increment count (starts at -1) */
 		atomic_set(compound_mapcount_ptr(page), 0);
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		/*
+		 * While compound_mapcount() is positive we keep *one* mapcount
+		 * reference in all subpages. It's required for atomic removal
+		 * from rmap.
+		 */
+		for (i = 0; i < nr; i++)
+			atomic_set(&page[i]._mapcount, 0);
 	} else {
 		/* Anon THP always mapped first with PMD */
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1174,9 +1191,6 @@ out:
  */
 void page_remove_rmap(struct page *page, bool compound)
 {
-	int nr = compound ? hpage_nr_pages(page) : 1;
-	bool partial_thp_unmap;
-
 	if (!PageAnon(page)) {
 		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
 		page_remove_file_rmap(page);
@@ -1184,10 +1198,20 @@ void page_remove_rmap(struct page *page, bool compound)
 	}
 
 	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, compound ?
-			       compound_mapcount_ptr(page) :
-			       &page->_mapcount))
+	if (compound) {
+		int i;
+
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+			return;
+		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		for (i = 0; i < hpage_nr_pages(page); i++)
+			page_remove_rmap(page + i, false);
 		return;
+	} else {
+		if (!atomic_add_negative(-1, &page->_mapcount))
+			return;
+	}
 
 	/* Hugepages are not counted in NR_ANON_PAGES for now. */
 	if (unlikely(PageHuge(page)))
@@ -1198,26 +1222,12 @@ void page_remove_rmap(struct page *page, bool compound)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	if (compound) {
-		int i;
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
-		/* The page can be mapped with ptes */
-		for (i = 0; i < hpage_nr_pages(page); i++)
-			if (page_mapcount(page + i))
-				nr--;
-		partial_thp_unmap = nr != hpage_nr_pages(page);
-	} else if (PageTransCompound(page)) {
-		partial_thp_unmap = !compound_mapcount(page);
-	} else
-		partial_thp_unmap = false;
-
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
+	__dec_zone_page_state(page, NR_ANON_PAGES);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
 
-	if (partial_thp_unmap)
+	if (PageTransCompound(page))
 		deferred_split_huge_page(compound_head(page));
 
 	/*
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
@ 2015-03-24 12:57         ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2015-03-24 12:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Konstantin Khlebnikov, Kirill A. Shutemov, Andrea Arcangeli,
	Ning Qu, Andrew Morton, linux-kernel, linux-mm

On Sun, Mar 22, 2015 at 09:40:02PM -0700, Hugh Dickins wrote:
> (I think Kirill has a problem of that kind in his page_remove_rmap scan).
> 
> It will be interesting to see what Kirill does to maintain the stats
> for huge pagecache: but he will have no difficulty in finding fields
> to store counts, because he's got lots of spare fields in those 511
> tail pages - that's a useful benefit of the compound page, but does
> prevent the tails from being used in ordinary ways.  (I did try using
> team_head[1].team_usage for more, but atomicity needs prevented it.)

The patch below should address the race you pointed, if I've got all
right. Not hugely happy with the change though.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 435c90f59227..a3e6b35520f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -423,8 +423,17 @@ static inline void page_mapcount_reset(struct page *page)
 
 static inline int page_mapcount(struct page *page)
 {
+	int ret;
 	VM_BUG_ON_PAGE(PageSlab(page), page);
-	return atomic_read(&page->_mapcount) + compound_mapcount(page) + 1;
+	ret = atomic_read(&page->_mapcount) + 1;
+	if (compound_mapcount(page)) {
+		/*
+		 * positive compound_mapcount() offsets ->_mapcount by one --
+		 * substract here.
+		*/
+	       ret += compound_mapcount(page) - 1;
+	}
+	return ret;
 }
 
 static inline int page_count(struct page *page)
diff --git a/mm/rmap.c b/mm/rmap.c
index fc6eee4ed476..f4ab976276e7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1066,9 +1066,17 @@ void do_page_add_anon_rmap(struct page *page,
 		 * disabled.
 		 */
 		if (compound) {
+			int i;
 			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 			__inc_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
+			/*
+			 * While compound_mapcount() is positive we keep *one*
+			 * mapcount reference in all subpages. It's required
+			 * for atomic removal from rmap.
+			 */
+			for (i = 0; i < nr; i++)
+				atomic_set(&page[i]._mapcount, 0);
 		}
 		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	}
@@ -1103,10 +1111,19 @@ void page_add_new_anon_rmap(struct page *page,
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
 	SetPageSwapBacked(page);
 	if (compound) {
+		int i;
+
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		/* increment count (starts at -1) */
 		atomic_set(compound_mapcount_ptr(page), 0);
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		/*
+		 * While compound_mapcount() is positive we keep *one* mapcount
+		 * reference in all subpages. It's required for atomic removal
+		 * from rmap.
+		 */
+		for (i = 0; i < nr; i++)
+			atomic_set(&page[i]._mapcount, 0);
 	} else {
 		/* Anon THP always mapped first with PMD */
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1174,9 +1191,6 @@ out:
  */
 void page_remove_rmap(struct page *page, bool compound)
 {
-	int nr = compound ? hpage_nr_pages(page) : 1;
-	bool partial_thp_unmap;
-
 	if (!PageAnon(page)) {
 		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
 		page_remove_file_rmap(page);
@@ -1184,10 +1198,20 @@ void page_remove_rmap(struct page *page, bool compound)
 	}
 
 	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, compound ?
-			       compound_mapcount_ptr(page) :
-			       &page->_mapcount))
+	if (compound) {
+		int i;
+
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+			return;
+		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		for (i = 0; i < hpage_nr_pages(page); i++)
+			page_remove_rmap(page + i, false);
 		return;
+	} else {
+		if (!atomic_add_negative(-1, &page->_mapcount))
+			return;
+	}
 
 	/* Hugepages are not counted in NR_ANON_PAGES for now. */
 	if (unlikely(PageHuge(page)))
@@ -1198,26 +1222,12 @@ void page_remove_rmap(struct page *page, bool compound)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	if (compound) {
-		int i;
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
-		/* The page can be mapped with ptes */
-		for (i = 0; i < hpage_nr_pages(page); i++)
-			if (page_mapcount(page + i))
-				nr--;
-		partial_thp_unmap = nr != hpage_nr_pages(page);
-	} else if (PageTransCompound(page)) {
-		partial_thp_unmap = !compound_mapcount(page);
-	} else
-		partial_thp_unmap = false;
-
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
+	__dec_zone_page_state(page, NR_ANON_PAGES);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
 
-	if (partial_thp_unmap)
+	if (PageTransCompound(page))
 		deferred_split_huge_page(compound_head(page));
 
 	/*
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
  2015-03-24 12:57         ` Kirill A. Shutemov
@ 2015-03-25  0:41           ` Hugh Dickins
  -1 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-03-25  0:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Konstantin Khlebnikov, Kirill A. Shutemov,
	Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

On Tue, 24 Mar 2015, Kirill A. Shutemov wrote:
> On Sun, Mar 22, 2015 at 09:40:02PM -0700, Hugh Dickins wrote:
> > (I think Kirill has a problem of that kind in his page_remove_rmap scan).

(And this one I mentioned to you at the conference :)

> > 
> > It will be interesting to see what Kirill does to maintain the stats
> > for huge pagecache: but he will have no difficulty in finding fields
> > to store counts, because he's got lots of spare fields in those 511
> > tail pages - that's a useful benefit of the compound page, but does
> > prevent the tails from being used in ordinary ways.  (I did try using
> > team_head[1].team_usage for more, but atomicity needs prevented it.)
> 
> The patch below should address the race you pointed, if I've got all
> right. Not hugely happy with the change though.

Yes, without thinking too hard about it, something like what you have
below should do it.  Not very pretty; maybe a neater idea will come up.

Hugh

> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 435c90f59227..a3e6b35520f8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -423,8 +423,17 @@ static inline void page_mapcount_reset(struct page *page)
>  
>  static inline int page_mapcount(struct page *page)
>  {
> +	int ret;
>  	VM_BUG_ON_PAGE(PageSlab(page), page);
> -	return atomic_read(&page->_mapcount) + compound_mapcount(page) + 1;
> +	ret = atomic_read(&page->_mapcount) + 1;
> +	if (compound_mapcount(page)) {
> +		/*
> +		 * positive compound_mapcount() offsets ->_mapcount by one --
> +		 * substract here.
> +		*/
> +	       ret += compound_mapcount(page) - 1;
> +	}
> +	return ret;
>  }
>  
>  static inline int page_count(struct page *page)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fc6eee4ed476..f4ab976276e7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1066,9 +1066,17 @@ void do_page_add_anon_rmap(struct page *page,
>  		 * disabled.
>  		 */
>  		if (compound) {
> +			int i;
>  			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>  			__inc_zone_page_state(page,
>  					      NR_ANON_TRANSPARENT_HUGEPAGES);
> +			/*
> +			 * While compound_mapcount() is positive we keep *one*
> +			 * mapcount reference in all subpages. It's required
> +			 * for atomic removal from rmap.
> +			 */
> +			for (i = 0; i < nr; i++)
> +				atomic_set(&page[i]._mapcount, 0);
>  		}
>  		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
>  	}
> @@ -1103,10 +1111,19 @@ void page_add_new_anon_rmap(struct page *page,
>  	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>  	SetPageSwapBacked(page);
>  	if (compound) {
> +		int i;
> +
>  		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>  		/* increment count (starts at -1) */
>  		atomic_set(compound_mapcount_ptr(page), 0);
>  		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> +		/*
> +		 * While compound_mapcount() is positive we keep *one* mapcount
> +		 * reference in all subpages. It's required for atomic removal
> +		 * from rmap.
> +		 */
> +		for (i = 0; i < nr; i++)
> +			atomic_set(&page[i]._mapcount, 0);
>  	} else {
>  		/* Anon THP always mapped first with PMD */
>  		VM_BUG_ON_PAGE(PageTransCompound(page), page);
> @@ -1174,9 +1191,6 @@ out:
>   */
>  void page_remove_rmap(struct page *page, bool compound)
>  {
> -	int nr = compound ? hpage_nr_pages(page) : 1;
> -	bool partial_thp_unmap;
> -
>  	if (!PageAnon(page)) {
>  		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
>  		page_remove_file_rmap(page);
> @@ -1184,10 +1198,20 @@ void page_remove_rmap(struct page *page, bool compound)
>  	}
>  
>  	/* page still mapped by someone else? */
> -	if (!atomic_add_negative(-1, compound ?
> -			       compound_mapcount_ptr(page) :
> -			       &page->_mapcount))
> +	if (compound) {
> +		int i;
> +
> +		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> +		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
> +			return;
> +		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> +		for (i = 0; i < hpage_nr_pages(page); i++)
> +			page_remove_rmap(page + i, false);
>  		return;
> +	} else {
> +		if (!atomic_add_negative(-1, &page->_mapcount))
> +			return;
> +	}
>  
>  	/* Hugepages are not counted in NR_ANON_PAGES for now. */
>  	if (unlikely(PageHuge(page)))
> @@ -1198,26 +1222,12 @@ void page_remove_rmap(struct page *page, bool compound)
>  	 * these counters are not modified in interrupt context, and
>  	 * pte lock(a spinlock) is held, which implies preemption disabled.
>  	 */
> -	if (compound) {
> -		int i;
> -		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> -		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> -		/* The page can be mapped with ptes */
> -		for (i = 0; i < hpage_nr_pages(page); i++)
> -			if (page_mapcount(page + i))
> -				nr--;
> -		partial_thp_unmap = nr != hpage_nr_pages(page);
> -	} else if (PageTransCompound(page)) {
> -		partial_thp_unmap = !compound_mapcount(page);
> -	} else
> -		partial_thp_unmap = false;
> -
> -	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
> +	__dec_zone_page_state(page, NR_ANON_PAGES);
>  
>  	if (unlikely(PageMlocked(page)))
>  		clear_page_mlock(page);
>  
> -	if (partial_thp_unmap)
> +	if (PageTransCompound(page))
>  		deferred_split_huge_page(compound_head(page));
>  
>  	/*
> -- 
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes
@ 2015-03-25  0:41           ` Hugh Dickins
  0 siblings, 0 replies; 76+ messages in thread
From: Hugh Dickins @ 2015-03-25  0:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Konstantin Khlebnikov, Kirill A. Shutemov,
	Andrea Arcangeli, Ning Qu, Andrew Morton, linux-kernel, linux-mm

On Tue, 24 Mar 2015, Kirill A. Shutemov wrote:
> On Sun, Mar 22, 2015 at 09:40:02PM -0700, Hugh Dickins wrote:
> > (I think Kirill has a problem of that kind in his page_remove_rmap scan).

(And this one I mentioned to you at the conference :)

> > 
> > It will be interesting to see what Kirill does to maintain the stats
> > for huge pagecache: but he will have no difficulty in finding fields
> > to store counts, because he's got lots of spare fields in those 511
> > tail pages - that's a useful benefit of the compound page, but does
> > prevent the tails from being used in ordinary ways.  (I did try using
> > team_head[1].team_usage for more, but atomicity needs prevented it.)
> 
> The patch below should address the race you pointed, if I've got all
> right. Not hugely happy with the change though.

Yes, without thinking too hard about it, something like what you have
below should do it.  Not very pretty; maybe a neater idea will come up.

Hugh

> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 435c90f59227..a3e6b35520f8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -423,8 +423,17 @@ static inline void page_mapcount_reset(struct page *page)
>  
>  static inline int page_mapcount(struct page *page)
>  {
> +	int ret;
>  	VM_BUG_ON_PAGE(PageSlab(page), page);
> -	return atomic_read(&page->_mapcount) + compound_mapcount(page) + 1;
> +	ret = atomic_read(&page->_mapcount) + 1;
> +	if (compound_mapcount(page)) {
> +		/*
> +		 * positive compound_mapcount() offsets ->_mapcount by one --
> +		 * substract here.
> +		*/
> +	       ret += compound_mapcount(page) - 1;
> +	}
> +	return ret;
>  }
>  
>  static inline int page_count(struct page *page)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fc6eee4ed476..f4ab976276e7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1066,9 +1066,17 @@ void do_page_add_anon_rmap(struct page *page,
>  		 * disabled.
>  		 */
>  		if (compound) {
> +			int i;
>  			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>  			__inc_zone_page_state(page,
>  					      NR_ANON_TRANSPARENT_HUGEPAGES);
> +			/*
> +			 * While compound_mapcount() is positive we keep *one*
> +			 * mapcount reference in all subpages. It's required
> +			 * for atomic removal from rmap.
> +			 */
> +			for (i = 0; i < nr; i++)
> +				atomic_set(&page[i]._mapcount, 0);
>  		}
>  		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
>  	}
> @@ -1103,10 +1111,19 @@ void page_add_new_anon_rmap(struct page *page,
>  	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>  	SetPageSwapBacked(page);
>  	if (compound) {
> +		int i;
> +
>  		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>  		/* increment count (starts at -1) */
>  		atomic_set(compound_mapcount_ptr(page), 0);
>  		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> +		/*
> +		 * While compound_mapcount() is positive we keep *one* mapcount
> +		 * reference in all subpages. It's required for atomic removal
> +		 * from rmap.
> +		 */
> +		for (i = 0; i < nr; i++)
> +			atomic_set(&page[i]._mapcount, 0);
>  	} else {
>  		/* Anon THP always mapped first with PMD */
>  		VM_BUG_ON_PAGE(PageTransCompound(page), page);
> @@ -1174,9 +1191,6 @@ out:
>   */
>  void page_remove_rmap(struct page *page, bool compound)
>  {
> -	int nr = compound ? hpage_nr_pages(page) : 1;
> -	bool partial_thp_unmap;
> -
>  	if (!PageAnon(page)) {
>  		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
>  		page_remove_file_rmap(page);
> @@ -1184,10 +1198,20 @@ void page_remove_rmap(struct page *page, bool compound)
>  	}
>  
>  	/* page still mapped by someone else? */
> -	if (!atomic_add_negative(-1, compound ?
> -			       compound_mapcount_ptr(page) :
> -			       &page->_mapcount))
> +	if (compound) {
> +		int i;
> +
> +		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> +		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
> +			return;
> +		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> +		for (i = 0; i < hpage_nr_pages(page); i++)
> +			page_remove_rmap(page + i, false);
>  		return;
> +	} else {
> +		if (!atomic_add_negative(-1, &page->_mapcount))
> +			return;
> +	}
>  
>  	/* Hugepages are not counted in NR_ANON_PAGES for now. */
>  	if (unlikely(PageHuge(page)))
> @@ -1198,26 +1222,12 @@ void page_remove_rmap(struct page *page, bool compound)
>  	 * these counters are not modified in interrupt context, and
>  	 * pte lock(a spinlock) is held, which implies preemption disabled.
>  	 */
> -	if (compound) {
> -		int i;
> -		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> -		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> -		/* The page can be mapped with ptes */
> -		for (i = 0; i < hpage_nr_pages(page); i++)
> -			if (page_mapcount(page + i))
> -				nr--;
> -		partial_thp_unmap = nr != hpage_nr_pages(page);
> -	} else if (PageTransCompound(page)) {
> -		partial_thp_unmap = !compound_mapcount(page);
> -	} else
> -		partial_thp_unmap = false;
> -
> -	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
> +	__dec_zone_page_state(page, NR_ANON_PAGES);
>  
>  	if (unlikely(PageMlocked(page)))
>  		clear_page_mlock(page);
>  
> -	if (partial_thp_unmap)
> +	if (PageTransCompound(page))
>  		deferred_split_huge_page(compound_head(page));
>  
>  	/*
> -- 
>  Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 16/24] huge tmpfs: fix problems from premature exposure of pagetable
  2015-02-21  4:16   ` Hugh Dickins
@ 2015-07-01 10:53     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2015-07-01 10:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Ning Qu, Andrew Morton,
	linux-kernel, linux-mm

On Fri, Feb 20, 2015 at 08:16:32PM -0800, Hugh Dickins wrote:
> Andrea wrote a very interesting comment on THP in mm/memory.c,
> just before the end of __handle_mm_fault():
> 
>  * A regular pmd is established and it can't morph into a huge pmd
>  * from under us anymore at this point because we hold the mmap_sem
>  * read mode and khugepaged takes it in write mode. So now it's
>  * safe to run pte_offset_map().
> 
> This comment hints at several difficulties, which anon THP solved
> for itself with mmap_sem and anon_vma lock, but which huge tmpfs
> may need to solve differently.
> 
> The reference to pte_offset_map() above: I believe that's a hint
> that on a 32-bit machine, the pagetables might need to come from
> kernel-mapped memory, but a huge pmd pointing to user memory beyond
> that limit could be racily substituted, causing undefined behavior
> in the architecture-dependent pte_offset_map().
> 
> That itself is not a problem on x86_64, but there's plenty more:
> how about those places which use pte_offset_map_lock() - if that
> spinlock is in the struct page of a pagetable, which has been
> deposited and might be withdrawn and freed at any moment (being
> on a list unattached to the allocating pmd in the case of x86),
> taking the spinlock might corrupt someone else's struct page.
> 
> Because THP has departed from the earlier rules (when pagetable
> was only freed under exclusive mmap_sem, or at exit_mmap, after
> removing all affected vmas from the rmap list): zap_huge_pmd()
> does pte_free() even when serving MADV_DONTNEED under down_read
> of mmap_sem.
> 
> And what of the "entry = *pte" at the start of handle_pte_fault(),
> getting the entry used in pte_same(,orig_pte) tests to validate all
> fault handling?  If that entry can itself be junk picked out of some
> freed and reused pagetable, it's hard to estimate the consequences.
> 
> We need to consider the safety of concurrent faults, and the
> safety of rmap lookups, and the safety of miscellaneous operations
> such as smaps_pte_range() for reading /proc/<pid>/smaps.
> 
> I set out to make safe the places which descend pgd,pud,pmd,pte,
> using more careful access techniques like mm_find_pmd(); but with
> pte_offset_map() being architecture-defined, it's too big a job to
> tighten it up all over.
> 
> Instead, approach from the opposite direction: just do not expose
> a pagetable in an empty *pmd, until vm_ops->fault has had a chance
> to ask for a huge pmd there.  This is a much easier change to make,
> and we are lucky that all the driver faults appear to be using
> interfaces (like vm_insert_page() and remap_pfn_range()) which
> automatically do the pte_alloc() if it was not already done.
> 
> But we must not get stuck refaulting: need FAULT_FLAG_MAY_HUGE for
> __do_fault() to tell shmem_fault() to try for huge only when *pmd is
> empty (could instead add pmd to vmf and let shmem work that out for
> itself, but probably better to hide pmd from vm_ops->faults).
> 
> Without a pagetable to hold the pte_none() entry found in a newly
> allocated pagetable, handle_pte_fault() would like to provide a static
> none entry for later orig_pte checks.  But architectures have never had
> to provide that definition before; and although almost all use zeroes
> for an empty pagetable, a few do not - nios2, s390, um, xtensa.
> 
> Never mind, forget about pte_same(,orig_pte), the three __do_fault()
> callers can follow do_anonymous_page()'s example, and just use a
> pte_none() check instead - supplemented by a pte_file pte_to_pgoff
> check until the day VM_NONLINEAR is removed.
> 
> do_fault_around() presents one last problem: it wants pagetable to
> have been allocated, but was being called by do_read_fault() before
> __do_fault().  But I see no disadvantage to moving it after,
> allowing huge pmd to be chosent first.

One disadvantage is addtional radix-tree lookup for page cache hot case.
IIRC, the difference was small, but measurable back when I implemented
faultaround.

Have you considered pushing page table allocation even futher -- into
do_set_pte()?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 16/24] huge tmpfs: fix problems from premature exposure of pagetable
@ 2015-07-01 10:53     ` Kirill A. Shutemov
  0 siblings, 0 replies; 76+ messages in thread
From: Kirill A. Shutemov @ 2015-07-01 10:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Ning Qu, Andrew Morton,
	linux-kernel, linux-mm

On Fri, Feb 20, 2015 at 08:16:32PM -0800, Hugh Dickins wrote:
> Andrea wrote a very interesting comment on THP in mm/memory.c,
> just before the end of __handle_mm_fault():
> 
>  * A regular pmd is established and it can't morph into a huge pmd
>  * from under us anymore at this point because we hold the mmap_sem
>  * read mode and khugepaged takes it in write mode. So now it's
>  * safe to run pte_offset_map().
> 
> This comment hints at several difficulties, which anon THP solved
> for itself with mmap_sem and anon_vma lock, but which huge tmpfs
> may need to solve differently.
> 
> The reference to pte_offset_map() above: I believe that's a hint
> that on a 32-bit machine, the pagetables might need to come from
> kernel-mapped memory, but a huge pmd pointing to user memory beyond
> that limit could be racily substituted, causing undefined behavior
> in the architecture-dependent pte_offset_map().
> 
> That itself is not a problem on x86_64, but there's plenty more:
> how about those places which use pte_offset_map_lock() - if that
> spinlock is in the struct page of a pagetable, which has been
> deposited and might be withdrawn and freed at any moment (being
> on a list unattached to the allocating pmd in the case of x86),
> taking the spinlock might corrupt someone else's struct page.
> 
> Because THP has departed from the earlier rules (when pagetable
> was only freed under exclusive mmap_sem, or at exit_mmap, after
> removing all affected vmas from the rmap list): zap_huge_pmd()
> does pte_free() even when serving MADV_DONTNEED under down_read
> of mmap_sem.
> 
> And what of the "entry = *pte" at the start of handle_pte_fault(),
> getting the entry used in pte_same(,orig_pte) tests to validate all
> fault handling?  If that entry can itself be junk picked out of some
> freed and reused pagetable, it's hard to estimate the consequences.
> 
> We need to consider the safety of concurrent faults, and the
> safety of rmap lookups, and the safety of miscellaneous operations
> such as smaps_pte_range() for reading /proc/<pid>/smaps.
> 
> I set out to make safe the places which descend pgd,pud,pmd,pte,
> using more careful access techniques like mm_find_pmd(); but with
> pte_offset_map() being architecture-defined, it's too big a job to
> tighten it up all over.
> 
> Instead, approach from the opposite direction: just do not expose
> a pagetable in an empty *pmd, until vm_ops->fault has had a chance
> to ask for a huge pmd there.  This is a much easier change to make,
> and we are lucky that all the driver faults appear to be using
> interfaces (like vm_insert_page() and remap_pfn_range()) which
> automatically do the pte_alloc() if it was not already done.
> 
> But we must not get stuck refaulting: need FAULT_FLAG_MAY_HUGE for
> __do_fault() to tell shmem_fault() to try for huge only when *pmd is
> empty (could instead add pmd to vmf and let shmem work that out for
> itself, but probably better to hide pmd from vm_ops->faults).
> 
> Without a pagetable to hold the pte_none() entry found in a newly
> allocated pagetable, handle_pte_fault() would like to provide a static
> none entry for later orig_pte checks.  But architectures have never had
> to provide that definition before; and although almost all use zeroes
> for an empty pagetable, a few do not - nios2, s390, um, xtensa.
> 
> Never mind, forget about pte_same(,orig_pte), the three __do_fault()
> callers can follow do_anonymous_page()'s example, and just use a
> pte_none() check instead - supplemented by a pte_file pte_to_pgoff
> check until the day VM_NONLINEAR is removed.
> 
> do_fault_around() presents one last problem: it wants pagetable to
> have been allocated, but was being called by do_read_fault() before
> __do_fault().  But I see no disadvantage to moving it after,
> allowing huge pmd to be chosent first.

One disadvantage is addtional radix-tree lookup for page cache hot case.
IIRC, the difference was small, but measurable back when I implemented
faultaround.

Have you considered pushing page table allocation even futher -- into
do_set_pte()?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2015-07-01 10:53 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-21  3:49 [PATCH 00/24] huge tmpfs: an alternative approach to THPageCache Hugh Dickins
2015-02-21  3:49 ` Hugh Dickins
2015-02-21  3:51 ` [PATCH 01/24] mm: update_lru_size warn and reset bad lru_size Hugh Dickins
2015-02-21  3:51   ` Hugh Dickins
2015-02-23  9:30   ` Kirill A. Shutemov
2015-02-23  9:30     ` Kirill A. Shutemov
2015-03-23  2:44     ` Hugh Dickins
2015-03-23  2:44       ` Hugh Dickins
2015-02-21  3:54 ` [PATCH 02/24] mm: update_lru_size do the __mod_zone_page_state Hugh Dickins
2015-02-21  3:54   ` Hugh Dickins
2015-02-21  3:56 ` [PATCH 03/24] mm: use __SetPageSwapBacked and don't ClearPageSwapBacked Hugh Dickins
2015-02-21  3:56   ` Hugh Dickins
2015-02-25 10:53   ` Mel Gorman
2015-02-25 10:53     ` Mel Gorman
2015-03-23  3:01     ` Hugh Dickins
2015-03-23  3:01       ` Hugh Dickins
2015-02-21  3:58 ` [PATCH 04/24] mm: make page migration's newpage handling more robust Hugh Dickins
2015-02-21  3:58   ` Hugh Dickins
2015-02-21  4:00 ` [PATCH 05/24] tmpfs: preliminary minor tidyups Hugh Dickins
2015-02-21  4:00   ` Hugh Dickins
2015-02-21  4:01 ` [PATCH 06/24] huge tmpfs: prepare counts in meminfo, vmstat and SysRq-m Hugh Dickins
2015-02-21  4:01   ` Hugh Dickins
2015-02-21  4:03 ` [PATCH 07/24] huge tmpfs: include shmem freeholes in available memory counts Hugh Dickins
2015-02-21  4:03   ` Hugh Dickins
2015-02-21  4:05 ` [PATCH 08/24] huge tmpfs: prepare huge=N mount option and /proc/sys/vm/shmem_huge Hugh Dickins
2015-02-21  4:05   ` Hugh Dickins
2015-02-21  4:06 ` [PATCH 09/24] huge tmpfs: try to allocate huge pages, split into a team Hugh Dickins
2015-02-21  4:06   ` Hugh Dickins
2015-02-21  4:07 ` [PATCH 10/24] huge tmpfs: avoid team pages in a few places Hugh Dickins
2015-02-21  4:07   ` Hugh Dickins
2015-02-21  4:09 ` [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes Hugh Dickins
2015-02-21  4:09   ` Hugh Dickins
2015-03-19 16:56   ` Konstantin Khlebnikov
2015-03-19 16:56     ` Konstantin Khlebnikov
2015-03-23  4:40     ` Hugh Dickins
2015-03-23  4:40       ` Hugh Dickins
2015-03-23 12:50       ` Kirill A. Shutemov
2015-03-23 12:50         ` Kirill A. Shutemov
2015-03-23 13:50         ` Kirill A. Shutemov
2015-03-23 13:50           ` Kirill A. Shutemov
2015-03-24 12:57       ` Kirill A. Shutemov
2015-03-24 12:57         ` Kirill A. Shutemov
2015-03-25  0:41         ` Hugh Dickins
2015-03-25  0:41           ` Hugh Dickins
2015-02-21  4:11 ` [PATCH 12/24] huge tmpfs: get_unmapped_area align and fault supply huge page Hugh Dickins
2015-02-21  4:11   ` Hugh Dickins
2015-02-21  4:12 ` [PATCH 13/24] huge tmpfs: extend get_user_pages_fast to shmem pmd Hugh Dickins
2015-02-21  4:12   ` Hugh Dickins
2015-02-21  4:13 ` [PATCH 14/24] huge tmpfs: extend vma_adjust_trans_huge " Hugh Dickins
2015-02-21  4:13   ` Hugh Dickins
2015-02-21  4:15 ` [PATCH 15/24] huge tmpfs: rework page_referenced_one and try_to_unmap_one Hugh Dickins
2015-02-21  4:15   ` Hugh Dickins
2015-02-21  4:16 ` [PATCH 16/24] huge tmpfs: fix problems from premature exposure of pagetable Hugh Dickins
2015-02-21  4:16   ` Hugh Dickins
2015-07-01 10:53   ` Kirill A. Shutemov
2015-07-01 10:53     ` Kirill A. Shutemov
2015-02-21  4:18 ` [PATCH 17/24] huge tmpfs: map shmem by huge page pmd or by page team ptes Hugh Dickins
2015-02-21  4:18   ` Hugh Dickins
2015-02-21  4:20 ` [PATCH 18/24] huge tmpfs: mmap_sem is unlocked when truncation splits huge pmd Hugh Dickins
2015-02-21  4:20   ` Hugh Dickins
2015-02-21  4:22 ` [PATCH 19/24] huge tmpfs: disband split huge pmds on race or memory failure Hugh Dickins
2015-02-21  4:22   ` Hugh Dickins
2015-02-21  4:23 ` [PATCH 20/24] huge tmpfs: use Unevictable lru with variable hpage_nr_pages() Hugh Dickins
2015-02-21  4:23   ` Hugh Dickins
2015-02-21  4:25 ` [PATCH 21/24] huge tmpfs: fix Mlocked meminfo, tracking huge and unhuge mlocks Hugh Dickins
2015-02-21  4:25   ` Hugh Dickins
2015-02-21  4:27 ` [PATCH 22/24] huge tmpfs: fix Mapped meminfo, tracking huge and unhuge mappings Hugh Dickins
2015-02-21  4:27   ` Hugh Dickins
2015-02-21  4:29 ` [PATCH 23/24] kvm: plumb return of hva when resolving page fault Hugh Dickins
2015-02-21  4:29   ` Hugh Dickins
2015-02-21  4:31 ` [PATCH 24/24] kvm: teach kvm to map page teams as huge pages Hugh Dickins
2015-02-21  4:31   ` Hugh Dickins
2015-02-23 13:48 ` [PATCH 00/24] huge tmpfs: an alternative approach to THPageCache Kirill A. Shutemov
2015-02-23 13:48   ` Kirill A. Shutemov
2015-03-23  2:25   ` Hugh Dickins
2015-03-23  2:25     ` Hugh Dickins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.