linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv2 RFC 00/19] THP refcounting redesign
@ 2014-11-05 14:49 Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 01/19] mm, thp: drop FOLL_SPLIT Kirill A. Shutemov
                   ` (18 more replies)
  0 siblings, 19 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

Hello everybody,

Here's my second version of the patchset on THP refcounting redesign. It's
still RFC and I have quite a few items on todo list. The patches are on
top of next-20140811 + Naoya's patchset on pagewalker. I'll rebase it once
updated pagewalker hits -mm. It's relatively stable: trinity is not able
to crash it in my setup.

The goal of patchset is to make refcounting on THP pages cheaper with
simpler semantics and allow the same THP compound page to be mapped with
PMD and PTEs. This is required to get reasonable THP-pagecache
implementation.

With the new refcounting design it's much easier to protect against
split_huge_page(): simple reference on a page will make you the deal.
It makes gup_fast() implementation simpler and doesn't require
special-case in futex code to handle tail THP pages.

It should improve THP utilization over the system since splitting THP in
one process doesn't necessary lead to splitting the page in all other
processes have the page mapped.

= Design overview =

The main reason why we can't map THP with 4k is how refcounting on THP
designed. It built around two requirements:

  - split of huge page should never fail;
  - we can't change interface of get_user_page();

To be able to split huge page at any point we have to track which tail
page was pinned. It leads to tricky and expensive get_page() on tail pages
and also occupy tail_page->_mapcount.

Most split_huge_page*() users want PMD to be split into table of PTEs and
don't care whether compound page is going to be split or not.

The plan is:

 - allow split_huge_page() to fail if the page is pinned. It's trivial to
   split non-pinned page and it doesn't require tail page refcounting, so
   tail_page->_mapcount is free to be reused.

 - introduce new routine -- split_huge_pmd() -- to split PMD into table of
   PTEs. It splits only one PMD, not touching other PMDs the page is
   mapped with or underlying compound page. Unlike new split_huge_page(),
   split_huge_pmd() never fails.

Fortunately, we have only few places where split_huge_page() is needed:
swap out, memory failure, migration, KSM. And all of them can handle
split_huge_page() fail.

In new scheme we use page->_mapcount is used to account how many time
the page is mapped with PTEs. We have separate compound_mapcount() to
count mappings with PMD. page_mapcount() returns sum of PTE and PMD
mappings of the page.

Introducing split_huge_pmd() effectively allows THP to be mapped with 4k.
It may be a surprise to some code to see a PTE which points to tail page
or VMA start/end in the middle of compound page.

= Patches overview =

Patches 1-2:
	Remove barely-used FOLL_SPLIT and rearrange split-related code to
	make future changes simpler.

Patches 3-4:
	Make PageAnon() and PG_locked related helpers to loon on head
	page if tail page is passed. It's required sinse pte_page() can
	now point to tail page. It's likely that we need to change other
	pageflags-related helpers too.

Patch 5:
	With PTE-mapeed THP, rmap cannot rely on PageTransHuge() check to
	decide if map small page or THP. We need to get the info from
	caller.

Patch 6:
	Store mapcount for compound pages separately: in the first tail
	page ->mapping.

Patch 7:
	Adjust conditions when we can re-use the page on write-protection
	fault.

Patch 8:
	Prepare migration code to deal with tail pages.

Patch 9:
	split_huge_page_pmd() to split_huge_pmd() to reflect that page is
	not going to be split, only PMD.

Patch 10:
	Temporary make split_huge_page() to return -EBUSY on all split
	requests. This allows to drop tail-page refcounting and change
	implementation of split_huge_pmd() to split PMD to table of PTEs
	without splitting compound page.

Patch 11:
	New THP_SPLIT_* vmstats.

Patch 12:
	Implement new split_huge_page() which fails if the page is pinned.
	For now, we rely on compound_lock() to make page counts stable.

Patches 13-14:
	Drop infrastructure for handling PMD splitting. We don't use it
	anymore in split_huge_page(). For now we only remove it from
	generic code and x86.

Patch 15:
	Remove ugly special case if futex happened to be in tail THP page.
	With new refcounting it much easier to protect against split.

Patch 16:
	Documentation update.

Patch 17:
	Hack to split all THP on mlock(). To be replaced with real
	solution for new refcounting.

Patches 18-19:
	Replaces compound_lock with migration entries as mechanism to
	freeze page counts on split_huge_page(). We don't need
	compound_lock anymore. It makes get_page()/put_page() on tail
	pages faster.

= TODO =

 - As we discussed on the first RFC we need to split THP on munmap() if
   the page is not mapped with PMD anymore. If split is failed (page is
   pinned) we need to queue it for splitting (to vmscan ?).

 - Proper mlock implementation is required.

 - Review all PageTransHuge() users -- consider getting rid of the helper.

 - Memory cgroup adaptation to new refcount -- I haven't checked yet what
   need to be done there, but I would expect some breakage.

 - Check page-flags: whether they should be on the compound page or
   per-4k.

 - Drop pmd splitting infrastructure from rest archs. Should be easy.

 - Check if khugepaged need to be adjusted.

Also munmap() on part of huge page will not split and free unmapped part
immediately. We need to be careful here to keep memory footprint under
control.

As side effect we don't need to mark PMD splitting since we have
split_huge_pmd(). get_page()/put_page() on tail of THP is cheaper (and
cleaner) now.

I will continue with stabilizing this. The patchset also available on
git[1].

Any comments?

Kirill A. Shutemov (19):
  mm, thp: drop FOLL_SPLIT
  thp: cluster split_huge_page* code together
  mm: change PageAnon() to work on tail pages
  mm: avoid PG_locked on tail pages
  rmap: add argument to charge compound page
  mm: store mapcount for compound page separate
  mm, thp: adjust conditions when we can reuse the page on WP fault
  mm: prepare migration code for new THP refcounting
  thp: rename split_huge_page_pmd() to split_huge_pmd()
  thp: PMD splitting without splitting compound page
  mm, vmstats: new THP splitting event
  thp: implement new split_huge_page()
  mm, thp: remove infrastructure for handling splitting PMDs
  x86, thp: remove remove infrastructure for handling splitting PMDs
  futex, thp: remove special case for THP in get_futex_key
  thp: update documentation
  mlock, thp: HACK: split all pages in VM_LOCKED vma
  tho, mm: use migration entries to freeze page counts on split
  mm, thp: remove compound_lock

 Documentation/vm/transhuge.txt       |  95 ++---
 arch/mips/mm/gup.c                   |   4 -
 arch/powerpc/mm/hugetlbpage.c        |  12 -
 arch/powerpc/mm/subpage-prot.c       |   2 +-
 arch/s390/mm/gup.c                   |  13 +-
 arch/s390/mm/pgtable.c               |  17 +-
 arch/sparc/mm/gup.c                  |  14 +-
 arch/x86/include/asm/pgtable.h       |   9 -
 arch/x86/include/asm/pgtable_types.h |   2 -
 arch/x86/kernel/vm86_32.c            |   6 +-
 arch/x86/mm/gup.c                    |  17 +-
 arch/x86/mm/pgtable.c                |  14 -
 fs/proc/task_mmu.c                   |   8 +-
 include/asm-generic/pgtable.h        |   5 -
 include/linux/huge_mm.h              |  55 +--
 include/linux/hugetlb_inline.h       |   9 +-
 include/linux/migrate.h              |   3 +
 include/linux/mm.h                   | 112 +-----
 include/linux/page-flags.h           |  15 +-
 include/linux/pagemap.h              |   5 +
 include/linux/rmap.h                 |  18 +-
 include/linux/swap.h                 |   8 +-
 include/linux/vm_event_item.h        |   4 +-
 kernel/events/uprobes.c              |   4 +-
 kernel/futex.c                       |  61 +--
 mm/filemap.c                         |   1 +
 mm/filemap_xip.c                     |   2 +-
 mm/gup.c                             |  18 +-
 mm/huge_memory.c                     | 730 ++++++++++++++++-------------------
 mm/hugetlb.c                         |   8 +-
 mm/internal.h                        |  31 +-
 mm/ksm.c                             |   4 +-
 mm/memcontrol.c                      |  14 +-
 mm/memory.c                          |  36 +-
 mm/mempolicy.c                       |   2 +-
 mm/migrate.c                         |  50 ++-
 mm/mincore.c                         |   2 +-
 mm/mlock.c                           | 130 +++++--
 mm/mprotect.c                        |   2 +-
 mm/mremap.c                          |   2 +-
 mm/page_alloc.c                      |  16 +-
 mm/pagewalk.c                        |   2 +-
 mm/pgtable-generic.c                 |  14 -
 mm/rmap.c                            |  98 +++--
 mm/slub.c                            |   2 +
 mm/swap.c                            | 260 +++----------
 mm/swapfile.c                        |   7 +-
 mm/vmstat.c                          |   4 +-
 48 files changed, 789 insertions(+), 1158 deletions(-)

-- 
2.1.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 01/19] mm, thp: drop FOLL_SPLIT
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-25  3:01   ` Naoya Horiguchi
  2014-11-05 14:49 ` [PATCH 02/19] thp: cluster split_huge_page* code together Kirill A. Shutemov
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

FOLL_SPLIT is used only in two places: migration and s390.

Let's replace it with explicit split and remove FOLL_SPLIT.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt | 11 -----------
 arch/s390/mm/pgtable.c         | 17 +++++++++++------
 include/linux/mm.h             |  1 -
 mm/gup.c                       |  4 ----
 mm/migrate.c                   |  7 ++++++-
 5 files changed, 17 insertions(+), 23 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 6b31cfbe2a9a..df1794a9071f 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -263,17 +263,6 @@ same constrains that applies to hugetlbfs too, so any driver capable
 of handling GUP on hugetlbfs will also work fine on transparent
 hugepage backed mappings.
 
-In case you can't handle compound pages if they're returned by
-follow_page, the FOLL_SPLIT bit can be specified as parameter to
-follow_page, so that it will split the hugepages before returning
-them. Migration for example passes FOLL_SPLIT as parameter to
-follow_page because it's not hugepage aware and in fact it can't work
-at all on hugetlbfs (but it instead works fine on transparent
-hugepages thanks to FOLL_SPLIT). migration simply can't deal with
-hugepages being returned (as it's not only checking the pfn of the
-page and pinning it during the copy but it pretends to migrate the
-memory in regular page sizes and with regular pte/pmd mappings).
-
 == Optimizing the applications ==
 
 To be guaranteed that the kernel will map a 2M page immediately in any
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 19daa53a3da4..a43f4d33f376 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -1248,20 +1248,25 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline void thp_split_vma(struct vm_area_struct *vma)
+static int thp_split_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+		struct mm_walk *walk)
 {
-	unsigned long addr;
-
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE)
-		follow_page(vma, addr, FOLL_SPLIT);
+	struct vm_area_struct *vma = walk->vma;
+	split_huge_page_pmd(vma, addr, pmd);
+	return 0;
 }
 
 static inline void thp_split_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 
+	struct mm_walk thp_split_walk = {
+		.mm = mm,
+		.pmd_entry = thp_split_pmd,
+
+	};
 	for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
-		thp_split_vma(vma);
+		walk_page_vma(vma, &thp_split_walk);
 		vma->vm_flags &= ~VM_HUGEPAGE;
 		vma->vm_flags |= VM_NOHUGEPAGE;
 	}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c9f866760df8..98c11c5be0ad 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1985,7 +1985,6 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
 #define FOLL_NOWAIT	0x20	/* if a disk transfer is needed, start the IO
 				 * and return without waiting upon it */
 #define FOLL_MLOCK	0x40	/* mark page as mlocked */
-#define FOLL_SPLIT	0x80	/* don't return transhuge pages, split them */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
 #define FOLL_NUMA	0x200	/* force NUMA hinting page fault */
 #define FOLL_MIGRATION	0x400	/* wait for page to replace migration entry */
diff --git a/mm/gup.c b/mm/gup.c
index 91d044b1600d..03f34c417591 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -192,10 +192,6 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	if ((flags & FOLL_NUMA) && pmd_numa(*pmd))
 		return no_page_table(vma, flags);
 	if (pmd_trans_huge(*pmd)) {
-		if (flags & FOLL_SPLIT) {
-			split_huge_page_pmd(vma, address, pmd);
-			return follow_page_pte(vma, address, pmd, flags);
-		}
 		ptl = pmd_lock(mm, pmd);
 		if (likely(pmd_trans_huge(*pmd))) {
 			if (unlikely(pmd_trans_splitting(*pmd))) {
diff --git a/mm/migrate.c b/mm/migrate.c
index f78ec9bd454d..ad4694515f31 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1236,7 +1236,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
 			goto set_status;
 
-		page = follow_page(vma, pp->addr, FOLL_GET|FOLL_SPLIT);
+		page = follow_page(vma, pp->addr, FOLL_GET);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))
@@ -1246,6 +1246,11 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		if (!page)
 			goto set_status;
 
+		if (PageTransHuge(page) && split_huge_page(page)) {
+			err = -EBUSY;
+			goto set_status;
+		}
+
 		/* Use PageReserved to check for zero page */
 		if (PageReserved(page))
 			goto put_and_set;
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 02/19] thp: cluster split_huge_page* code together
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 01/19] mm, thp: drop FOLL_SPLIT Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 03/19] mm: change PageAnon() to work on tail pages Kirill A. Shutemov
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

Rearrange code in mm/huge_memory.c to make future changes somewhat
easier.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 223 +++++++++++++++++++++++++++----------------------------
 1 file changed, 111 insertions(+), 112 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8efe27b86370..52973809777f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1600,6 +1600,117 @@ unlock:
 	return NULL;
 }
 
+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+		unsigned long haddr, pmd_t *pmd)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	int i;
+
+	pmdp_clear_flush(vma, haddr, pmd);
+	/* leave pmd empty until pte is filled */
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+		entry = pte_mkspecial(entry);
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	put_huge_zero_page();
+}
+
+void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd)
+{
+	spinlock_t *ptl;
+	struct page *page;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	unsigned long mmun_start;	/* For mmu_notifiers */
+	unsigned long mmun_end;		/* For mmu_notifiers */
+
+	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
+
+	mmun_start = haddr;
+	mmun_end   = haddr + HPAGE_PMD_SIZE;
+again:
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+	ptl = pmd_lock(mm, pmd);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(ptl);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		return;
+	}
+	if (is_huge_zero_pmd(*pmd)) {
+		__split_huge_zero_page_pmd(vma, haddr, pmd);
+		spin_unlock(ptl);
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+		return;
+	}
+	page = pmd_page(*pmd);
+	VM_BUG_ON_PAGE(!page_count(page), page);
+	get_page(page);
+	spin_unlock(ptl);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+
+	split_huge_page(page);
+
+	put_page(page);
+
+	/*
+	 * We don't always have down_write of mmap_sem here: a racing
+	 * do_huge_pmd_wp_page() might have copied-on-write to another
+	 * huge page before our split_huge_page() got the anon_vma lock.
+	 */
+	if (unlikely(pmd_trans_huge(*pmd)))
+		goto again;
+}
+
+void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+		pmd_t *pmd)
+{
+	struct vm_area_struct *vma;
+
+	vma = find_vma(mm, address);
+	BUG_ON(vma == NULL);
+	split_huge_page_pmd(vma, address, pmd);
+}
+
+static void split_huge_page_address(struct mm_struct *mm,
+				    unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return;
+	/*
+	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
+	 * materialize from under us.
+	 */
+	split_huge_page_pmd_mm(mm, address, pmd);
+}
 static int __split_huge_page_splitting(struct page *page,
 				       struct vm_area_struct *vma,
 				       unsigned long address)
@@ -2808,118 +2919,6 @@ static int khugepaged(void *none)
 	return 0;
 }
 
-static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
-		unsigned long haddr, pmd_t *pmd)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	pgtable_t pgtable;
-	pmd_t _pmd;
-	int i;
-
-	pmdp_clear_flush(vma, haddr, pmd);
-	/* leave pmd empty until pte is filled */
-
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-	pmd_populate(mm, &_pmd, pgtable);
-
-	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-		pte_t *pte, entry;
-		entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
-		entry = pte_mkspecial(entry);
-		pte = pte_offset_map(&_pmd, haddr);
-		VM_BUG_ON(!pte_none(*pte));
-		set_pte_at(mm, haddr, pte, entry);
-		pte_unmap(pte);
-	}
-	smp_wmb(); /* make pte visible before pmd */
-	pmd_populate(mm, pmd, pgtable);
-	put_huge_zero_page();
-}
-
-void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd)
-{
-	spinlock_t *ptl;
-	struct page *page;
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
-
-	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
-
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_trans_huge(*pmd))) {
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
-	if (is_huge_zero_pmd(*pmd)) {
-		__split_huge_zero_page_pmd(vma, haddr, pmd);
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
-	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!page_count(page), page);
-	get_page(page);
-	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
-	split_huge_page(page);
-
-	put_page(page);
-
-	/*
-	 * We don't always have down_write of mmap_sem here: a racing
-	 * do_huge_pmd_wp_page() might have copied-on-write to another
-	 * huge page before our split_huge_page() got the anon_vma lock.
-	 */
-	if (unlikely(pmd_trans_huge(*pmd)))
-		goto again;
-}
-
-void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd)
-{
-	struct vm_area_struct *vma;
-
-	vma = find_vma(mm, address);
-	BUG_ON(vma == NULL);
-	split_huge_page_pmd(vma, address, pmd);
-}
-
-static void split_huge_page_address(struct mm_struct *mm,
-				    unsigned long address)
-{
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-
-	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
-
-	pgd = pgd_offset(mm, address);
-	if (!pgd_present(*pgd))
-		return;
-
-	pud = pud_offset(pgd, address);
-	if (!pud_present(*pud))
-		return;
-
-	pmd = pmd_offset(pud, address);
-	if (!pmd_present(*pmd))
-		return;
-	/*
-	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
-	 * materialize from under us.
-	 */
-	split_huge_page_pmd_mm(mm, address, pmd);
-}
-
 void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 			     unsigned long start,
 			     unsigned long end,
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 03/19] mm: change PageAnon() to work on tail pages
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 01/19] mm, thp: drop FOLL_SPLIT Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 02/19] thp: cluster split_huge_page* code together Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 04/19] mm: avoid PG_locked " Kirill A. Shutemov
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

Current PageAnon() is always return false for tail. We need to look on
head page for correct answer. Let's change the function to give the
right result.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 98c11c5be0ad..1825c468f158 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -983,6 +983,7 @@ struct address_space *page_file_mapping(struct page *page)
 
 static inline int PageAnon(struct page *page)
 {
+	page = compound_head(page);
 	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
 }
 
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 04/19] mm: avoid PG_locked on tail pages
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 03/19] mm: change PageAnon() to work on tail pages Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 05/19] rmap: add argument to charge compound page Kirill A. Shutemov
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting pte entries can point to tail pages. It's doesn't
make much sense to mark tail page locked -- we need to protect whole
compound page.

This patch adjust helpers related to PG_locked to operate on head page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/page-flags.h | 3 ++-
 include/linux/pagemap.h    | 5 +++++
 mm/filemap.c               | 1 +
 mm/slub.c                  | 2 ++
 4 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e1f5fcd79792..676f72d29ac2 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -203,7 +203,8 @@ static inline int __TestClearPage##uname(struct page *page) { return 0; }
 
 struct page;	/* forward declaration */
 
-TESTPAGEFLAG(Locked, locked)
+#define PageLocked(page) test_bit(PG_locked, &compound_head(page)->flags)
+
 PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
 PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
 	__SETPAGEFLAG(Referenced, referenced)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3df8c7db7a4e..110e86e480bb 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -445,16 +445,19 @@ extern void unlock_page(struct page *page);
 
 static inline void __set_page_locked(struct page *page)
 {
+	VM_BUG_ON_PAGE(PageTail(page), page);
 	__set_bit(PG_locked, &page->flags);
 }
 
 static inline void __clear_page_locked(struct page *page)
 {
+	VM_BUG_ON_PAGE(PageTail(page), page);
 	__clear_bit(PG_locked, &page->flags);
 }
 
 static inline int trylock_page(struct page *page)
 {
+	page = compound_head(page);
 	return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
 }
 
@@ -505,6 +508,7 @@ extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
 
 static inline int wait_on_page_locked_killable(struct page *page)
 {
+	page = compound_head(page);
 	if (PageLocked(page))
 		return wait_on_page_bit_killable(page, PG_locked);
 	return 0;
@@ -519,6 +523,7 @@ static inline int wait_on_page_locked_killable(struct page *page)
  */
 static inline void wait_on_page_locked(struct page *page)
 {
+	page = compound_head(page);
 	if (PageLocked(page))
 		wait_on_page_bit(page, PG_locked);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index f501b56ec2c6..020d4afd45df 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -735,6 +735,7 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue);
  */
 void unlock_page(struct page *page)
 {
+	page = compound_head(page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	clear_bit_unlock(PG_locked, &page->flags);
 	smp_mb__after_atomic();
diff --git a/mm/slub.c b/mm/slub.c
index 3e8afcc07a76..de37b20abaa9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -347,11 +347,13 @@ static inline int oo_objects(struct kmem_cache_order_objects x)
  */
 static __always_inline void slab_lock(struct page *page)
 {
+	VM_BUG_ON_PAGE(PageTail(page), page);
 	bit_spin_lock(PG_locked, &page->flags);
 }
 
 static __always_inline void slab_unlock(struct page *page)
 {
+	VM_BUG_ON_PAGE(PageTail(page), page);
 	__bit_spin_unlock(PG_locked, &page->flags);
 }
 
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 05/19] rmap: add argument to charge compound page
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 04/19] mm: avoid PG_locked " Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 06/19] mm: store mapcount for compound page separate Kirill A. Shutemov
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

We're going to allow mapping of individual 4k pages of THP compound
page. It means we cannot rely on PageTransHuge() check to decide if map
small page or THP.

The patch adds new argument to rmap function to indicate whethe we want
to map whole compound page or only the small page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/rmap.h    | 14 +++++++++++---
 kernel/events/uprobes.c |  4 ++--
 mm/filemap_xip.c        |  2 +-
 mm/huge_memory.c        | 16 ++++++++--------
 mm/hugetlb.c            |  4 ++--
 mm/ksm.c                |  4 ++--
 mm/memory.c             | 14 +++++++-------
 mm/migrate.c            |  8 ++++----
 mm/rmap.c               | 46 ++++++++++++++++++++++++++++------------------
 mm/swapfile.c           |  4 ++--
 10 files changed, 67 insertions(+), 49 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index be574506e6a9..ef09ca48c789 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -156,16 +156,24 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
 
 struct anon_vma *page_get_anon_vma(struct page *page);
 
+/* flags for do_page_add_anon_rmap() */
+enum {
+	RMAP_EXCLUSIVE = 1,
+	RMAP_COMPOUND = 2,
+};
+
 /*
  * rmap interfaces called when adding or removing pte of page
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
-void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
+void page_add_anon_rmap(struct page *, struct vm_area_struct *,
+		unsigned long, bool);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
 			   unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
+void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+		unsigned long, bool);
 void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *);
+void page_remove_rmap(struct page *, bool);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 			    unsigned long);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 1d0af8a2c646..de133050e948 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -183,7 +183,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		goto unlock;
 
 	get_page(kpage);
-	page_add_new_anon_rmap(kpage, vma, addr);
+	page_add_new_anon_rmap(kpage, vma, addr, false);
 	mem_cgroup_commit_charge(kpage, memcg, false);
 	lru_cache_add_active_or_unevictable(kpage, vma);
 
@@ -196,7 +196,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	ptep_clear_flush(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	pte_unmap_unlock(ptep, ptl);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3f685c..8f7587e44004 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -193,7 +193,7 @@ retry:
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
 			pteval = ptep_clear_flush(vma, address, pte);
-			page_remove_rmap(page);
+			page_remove_rmap(page, false);
 			dec_mm_counter(mm, MM_FILEPAGES);
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 52973809777f..9c53800c4eea 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -748,7 +748,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pmd_t entry;
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		page_add_new_anon_rmap(page, vma, haddr);
+		page_add_new_anon_rmap(page, vma, haddr, true);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
@@ -1048,7 +1048,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
-		page_add_new_anon_rmap(pages[i], vma, haddr);
+		page_add_new_anon_rmap(pages[i], vma, haddr, false);
 		mem_cgroup_commit_charge(pages[i], memcg, false);
 		lru_cache_add_active_or_unevictable(pages[i], vma);
 		pte = pte_offset_map(&_pmd, haddr);
@@ -1060,7 +1060,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
-	page_remove_rmap(page);
+	page_remove_rmap(page, true);
 	spin_unlock(ptl);
 
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
@@ -1180,7 +1180,7 @@ alloc:
 		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush(vma, haddr, pmd);
-		page_add_new_anon_rmap(new_page, vma, haddr);
+		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -1190,7 +1190,7 @@ alloc:
 			put_huge_zero_page();
 		} else {
 			VM_BUG_ON_PAGE(!PageHead(page), page);
-			page_remove_rmap(page);
+			page_remove_rmap(page, true);
 			put_page(page);
 		}
 		ret |= VM_FAULT_WRITE;
@@ -1409,7 +1409,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			put_huge_zero_page();
 		} else {
 			page = pmd_page(orig_pmd);
-			page_remove_rmap(page);
+			page_remove_rmap(page, true);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -2319,7 +2319,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			 * superfluous.
 			 */
 			pte_clear(vma->vm_mm, address, _pte);
-			page_remove_rmap(src_page);
+			page_remove_rmap(src_page, false);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
 		}
@@ -2615,7 +2615,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	page_add_new_anon_rmap(new_page, vma, address);
+	page_add_new_anon_rmap(new_page, vma, address, true);
 	mem_cgroup_commit_charge(new_page, memcg, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eeceeeb09019..dad8e0732922 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2683,7 +2683,7 @@ again:
 		if (huge_pte_dirty(pte))
 			set_page_dirty(page);
 
-		page_remove_rmap(page);
+		page_remove_rmap(page, true);
 		force_flush = !__tlb_remove_page(tlb, page);
 		if (force_flush) {
 			spin_unlock(ptl);
@@ -2901,7 +2901,7 @@ retry_avoidcopy:
 		huge_ptep_clear_flush(vma, address, ptep);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
-		page_remove_rmap(old_page);
+		page_remove_rmap(old_page, true);
 		hugepage_add_new_anon_rmap(new_page, vma, address);
 		/* Make the old page be freed below */
 		new_page = old_page;
diff --git a/mm/ksm.c b/mm/ksm.c
index f7de4c07c693..00da250cc560 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -960,13 +960,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	}
 
 	get_page(kpage);
-	page_add_anon_rmap(kpage, vma, addr);
+	page_add_anon_rmap(kpage, vma, addr, false);
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 042f8b3cabc1..6f84c8a51cc0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1141,7 +1141,7 @@ again:
 					mark_page_accessed(page);
 				rss[MM_FILEPAGES]--;
 			}
-			page_remove_rmap(page);
+			page_remove_rmap(page, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(!__tlb_remove_page(tlb, page))) {
@@ -2232,7 +2232,7 @@ gotten:
 		 * thread doing COW.
 		 */
 		ptep_clear_flush(vma, address, page_table);
-		page_add_new_anon_rmap(new_page, vma, address);
+		page_add_new_anon_rmap(new_page, vma, address, false);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
@@ -2265,7 +2265,7 @@ gotten:
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			page_remove_rmap(old_page);
+			page_remove_rmap(old_page, false);
 		}
 
 		/* Free the old page.. */
@@ -2524,7 +2524,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
-		exclusive = 1;
+		exclusive = RMAP_EXCLUSIVE;
 	}
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(orig_pte))
@@ -2534,7 +2534,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		do_page_add_anon_rmap(page, vma, address, exclusive);
 		mem_cgroup_commit_charge(page, memcg, true);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, address);
+		page_add_new_anon_rmap(page, vma, address, false);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
@@ -2672,7 +2672,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, address);
+	page_add_new_anon_rmap(page, vma, address, false);
 	mem_cgroup_commit_charge(page, memcg, false);
 	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
@@ -2757,7 +2757,7 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		entry = pte_mksoft_dirty(entry);
 	if (anon) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, address);
+		page_add_new_anon_rmap(page, vma, address, false);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, MM_FILEPAGES);
 		page_add_file_rmap(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index ad4694515f31..6b9413df1661 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -163,7 +163,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		else
 			page_dup_rmap(new);
 	} else if (PageAnon(new))
-		page_add_anon_rmap(new, vma, addr);
+		page_add_anon_rmap(new, vma, addr, false);
 	else
 		page_add_file_rmap(new);
 
@@ -1863,7 +1863,7 @@ fail_putback:
 	 * guarantee the copy is visible before the pagetable update.
 	 */
 	flush_cache_range(vma, mmun_start, mmun_end);
-	page_add_anon_rmap(new_page, vma, mmun_start);
+	page_add_anon_rmap(new_page, vma, mmun_start, true);
 	pmdp_clear_flush(vma, mmun_start, pmd);
 	set_pmd_at(mm, mmun_start, pmd, entry);
 	flush_tlb_range(vma, mmun_start, mmun_end);
@@ -1873,13 +1873,13 @@ fail_putback:
 		set_pmd_at(mm, mmun_start, pmd, orig_entry);
 		flush_tlb_range(vma, mmun_start, mmun_end);
 		update_mmu_cache_pmd(vma, address, &entry);
-		page_remove_rmap(new_page);
+		page_remove_rmap(new_page, true);
 		goto fail_putback;
 	}
 
 	mem_cgroup_migrate(page, new_page, false);
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, true);
 
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
diff --git a/mm/rmap.c b/mm/rmap.c
index 3e8491c504f8..f706a6af1801 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -973,9 +973,9 @@ static void __page_check_anon_rmap(struct page *page,
  * (but PageKsm is never downgraded to PageAnon).
  */
 void page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address, bool compound)
 {
-	do_page_add_anon_rmap(page, vma, address, 0);
+	do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
 }
 
 /*
@@ -984,21 +984,24 @@ void page_add_anon_rmap(struct page *page,
  * Everybody else should continue to use page_add_anon_rmap above.
  */
 void do_page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, int exclusive)
+	struct vm_area_struct *vma, unsigned long address, int flags)
 {
 	int first = atomic_inc_and_test(&page->_mapcount);
 	if (first) {
+		bool compound = flags & RMAP_COMPOUND;
+		int nr = compound ? hpage_nr_pages(page) : 1;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 		 * these counters are not modified in interrupt context, and
 		 * pte lock(a spinlock) is held, which implies preemption
 		 * disabled.
 		 */
-		if (PageTransHuge(page))
+		if (compound) {
+			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 			__inc_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
-		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-				hpage_nr_pages(page));
+		}
+		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	}
 	if (unlikely(PageKsm(page)))
 		return;
@@ -1006,7 +1009,8 @@ void do_page_add_anon_rmap(struct page *page,
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	/* address might be in next vma when migration races vma_adjust */
 	if (first)
-		__page_set_anon_rmap(page, vma, address, exclusive);
+		__page_set_anon_rmap(page, vma, address,
+				flags & RMAP_EXCLUSIVE);
 	else
 		__page_check_anon_rmap(page, vma, address);
 }
@@ -1022,15 +1026,18 @@ void do_page_add_anon_rmap(struct page *page,
  * Page does not have to be locked.
  */
 void page_add_new_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address, bool compound)
 {
+	int nr = compound ? hpage_nr_pages(page) : 1;
+
 	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
-	if (PageTransHuge(page))
+	if (compound) {
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-			hpage_nr_pages(page));
+	}
+	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
 
@@ -1059,7 +1066,7 @@ void page_add_file_rmap(struct page *page)
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page)
+void page_remove_rmap(struct page *page, bool compound)
 {
 	bool anon = PageAnon(page);
 	bool locked;
@@ -1089,12 +1096,15 @@ void page_remove_rmap(struct page *page)
 	if (unlikely(PageHuge(page)))
 		goto out;
 	if (anon) {
-		if (PageTransHuge(page))
+		int nr = compound ? hpage_nr_pages(page) : 1;
+		if (compound) {
+			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 			__dec_zone_page_state(page,
-					      NR_ANON_TRANSPARENT_HUGEPAGES);
-		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-				-hpage_nr_pages(page));
+					NR_ANON_TRANSPARENT_HUGEPAGES);
+		}
+		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
 	} else {
+		VM_BUG_ON_PAGE(compound, page);
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
 		mem_cgroup_end_update_page_stat(page, &locked, &flags);
@@ -1227,7 +1237,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	} else
 		dec_mm_counter(mm, MM_FILEPAGES);
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	page_cache_release(page);
 
 out_unmap:
@@ -1374,7 +1384,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		if (pte_dirty(pteval))
 			set_page_dirty(page);
 
-		page_remove_rmap(page);
+		page_remove_rmap(page, false);
 		page_cache_release(page);
 		dec_mm_counter(mm, MM_FILEPAGES);
 		(*mapcount)--;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8798b2e0ac59..57252bb35041 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1121,10 +1121,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
-		page_add_anon_rmap(page, vma, addr);
+		page_add_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, true);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, addr);
+		page_add_new_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 06/19] mm: store mapcount for compound page separate
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 05/19] rmap: add argument to charge compound page Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-18  8:43   ` Naoya Horiguchi
                     ` (2 more replies)
  2014-11-05 14:49 ` [PATCH 07/19] mm, thp: adjust conditions when we can reuse the page on WP fault Kirill A. Shutemov
                   ` (12 subsequent siblings)
  18 siblings, 3 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

We're going to allow mapping of individual 4k pages of THP compound and
we need a cheap way to find out how many time the compound page is
mapped with PMD -- compound_mapcount() does this.

page_mapcount() counts both: PTE and PMD mappings of the page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h   | 17 +++++++++++++++--
 include/linux/rmap.h |  4 ++--
 mm/huge_memory.c     | 23 ++++++++++++++---------
 mm/hugetlb.c         |  4 ++--
 mm/memory.c          |  2 +-
 mm/migrate.c         |  2 +-
 mm/page_alloc.c      | 13 ++++++++++---
 mm/rmap.c            | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 8 files changed, 88 insertions(+), 27 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1825c468f158..aef03acff228 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -435,6 +435,19 @@ static inline struct page *compound_head(struct page *page)
 	return page;
 }
 
+static inline atomic_t *compound_mapcount_ptr(struct page *page)
+{
+	return (atomic_t *)&page[1].mapping;
+}
+
+static inline int compound_mapcount(struct page *page)
+{
+	if (!PageCompound(page))
+		return 0;
+	page = compound_head(page);
+	return atomic_read(compound_mapcount_ptr(page)) + 1;
+}
+
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
@@ -447,7 +460,7 @@ static inline void page_mapcount_reset(struct page *page)
 
 static inline int page_mapcount(struct page *page)
 {
-	return atomic_read(&(page)->_mapcount) + 1;
+	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) + 1;
 }
 
 static inline int page_count(struct page *page)
@@ -1017,7 +1030,7 @@ static inline pgoff_t page_file_index(struct page *page)
  */
 static inline int page_mapped(struct page *page)
 {
-	return atomic_read(&(page)->_mapcount) >= 0;
+	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
 }
 
 /*
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index ef09ca48c789..a9499ad8c037 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -180,9 +180,9 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 				unsigned long);
 
-static inline void page_dup_rmap(struct page *page)
+static inline void page_dup_rmap(struct page *page, bool compound)
 {
-	atomic_inc(&page->_mapcount);
+	atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount);
 }
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c53800c4eea..869f9bcf481e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -904,7 +904,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	src_page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
-	page_dup_rmap(src_page);
+	page_dup_rmap(src_page, true);
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
@@ -1763,8 +1763,8 @@ static void __split_huge_page_refcount(struct page *page,
 		struct page *page_tail = page + i;
 
 		/* tail_page->_mapcount cannot change */
-		BUG_ON(page_mapcount(page_tail) < 0);
-		tail_count += page_mapcount(page_tail);
+		BUG_ON(atomic_read(&page_tail->_mapcount) + 1 < 0);
+		tail_count += atomic_read(&page_tail->_mapcount) + 1;
 		/* check for overflow */
 		BUG_ON(tail_count < 0);
 		BUG_ON(atomic_read(&page_tail->_count) != 0);
@@ -1781,8 +1781,7 @@ static void __split_huge_page_refcount(struct page *page,
 		 * atomic_set() here would be safe on all archs (and
 		 * not only on x86), it's safer to use atomic_add().
 		 */
-		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
-			   &page_tail->_count);
+		atomic_add(page_mapcount(page_tail) + 1, &page_tail->_count);
 
 		/* after clearing PageTail the gup refcount can be released */
 		smp_mb__after_atomic();
@@ -1819,15 +1818,18 @@ static void __split_huge_page_refcount(struct page *page,
 		 * status is achieved setting a reserved bit in the
 		 * pmd, not by clearing the present bit.
 		*/
-		page_tail->_mapcount = page->_mapcount;
+		atomic_set(&page_tail->_mapcount, compound_mapcount(page) - 1);
 
-		BUG_ON(page_tail->mapping);
-		page_tail->mapping = page->mapping;
+		/* ->mapping in first tail page is compound_mapcount */
+		if (i != 1) {
+			BUG_ON(page_tail->mapping);
+			page_tail->mapping = page->mapping;
+			BUG_ON(!PageAnon(page_tail));
+		}
 
 		page_tail->index = page->index + i;
 		page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
 
-		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
 		BUG_ON(!PageDirty(page_tail));
 		BUG_ON(!PageSwapBacked(page_tail));
@@ -1837,6 +1839,9 @@ static void __split_huge_page_refcount(struct page *page,
 	atomic_sub(tail_count, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
+	page->_mapcount = *compound_mapcount_ptr(page);
+	page[1].mapping = page->mapping;
+
 	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
 
 	ClearPageCompound(page);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dad8e0732922..445db64a8b08 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2603,7 +2603,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
 			get_page(ptepage);
-			page_dup_rmap(ptepage);
+			page_dup_rmap(ptepage, true);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
 		spin_unlock(src_ptl);
@@ -3058,7 +3058,7 @@ retry:
 		ClearPagePrivate(page);
 		hugepage_add_new_anon_rmap(page, vma, address);
 	} else
-		page_dup_rmap(page);
+		page_dup_rmap(page, true);
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, address, ptep, new_pte);
diff --git a/mm/memory.c b/mm/memory.c
index 6f84c8a51cc0..1b17a72dc93f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -872,7 +872,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
 		get_page(page);
-		page_dup_rmap(page);
+		page_dup_rmap(page, false);
 		if (PageAnon(page))
 			rss[MM_ANONPAGES]++;
 		else
diff --git a/mm/migrate.c b/mm/migrate.c
index 6b9413df1661..f1a12ced2531 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -161,7 +161,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		if (PageAnon(new))
 			hugepage_add_anon_rmap(new, vma, addr);
 		else
-			page_dup_rmap(new);
+			page_dup_rmap(new, false);
 	} else if (PageAnon(new))
 		page_add_anon_rmap(new, vma, addr, false);
 	else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0e3d2fee585..b19d1e69ca12 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -369,6 +369,7 @@ void prep_compound_page(struct page *page, unsigned long order)
 
 	set_compound_page_dtor(page, free_compound_page);
 	set_compound_order(page, order);
+	atomic_set(compound_mapcount_ptr(page), -1);
 	__SetPageHead(page);
 	for (i = 1; i < nr_pages; i++) {
 		struct page *p = page + i;
@@ -643,7 +644,9 @@ static inline int free_pages_check(struct page *page)
 
 	if (unlikely(page_mapcount(page)))
 		bad_reason = "nonzero mapcount";
-	if (unlikely(page->mapping != NULL))
+	if (unlikely(compound_mapcount(page)))
+		bad_reason = "nonzero compound_mapcount";
+	if (unlikely(page->mapping != NULL) && !PageTail(page))
 		bad_reason = "non-NULL mapping";
 	if (unlikely(atomic_read(&page->_count) != 0))
 		bad_reason = "nonzero _count";
@@ -760,6 +763,8 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
 		bad += free_pages_check(page + i);
 	if (bad)
 		return false;
+	if (order)
+		page[1].mapping = NULL;
 
 	if (!PageHighMem(page)) {
 		debug_check_no_locks_freed(page_address(page),
@@ -6632,10 +6637,12 @@ static void dump_page_flags(unsigned long flags)
 void dump_page_badflags(struct page *page, const char *reason,
 		unsigned long badflags)
 {
-	printk(KERN_ALERT
-	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
+	pr_alert("page:%p count:%d mapcount:%d mapping:%p index:%#lx",
 		page, atomic_read(&page->_count), page_mapcount(page),
 		page->mapping, page->index);
+	if (PageCompound(page))
+		printk(" compound_mapcount: %d", compound_mapcount(page));
+	printk("\n");
 	dump_page_flags(page->flags);
 	if (reason)
 		pr_alert("page dumped because: %s\n", reason);
diff --git a/mm/rmap.c b/mm/rmap.c
index f706a6af1801..eecc9301847d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -986,9 +986,30 @@ void page_add_anon_rmap(struct page *page,
 void do_page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address, int flags)
 {
-	int first = atomic_inc_and_test(&page->_mapcount);
+	bool compound = flags & RMAP_COMPOUND;
+	bool first;
+
+	VM_BUG_ON_PAGE(!PageLocked(compound_head(page)), page);
+
+	if (PageTransCompound(page)) {
+		struct page *head_page = compound_head(page);
+
+		if (compound) {
+			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+			first = atomic_inc_and_test(compound_mapcount_ptr(page));
+		} else {
+			/* Anon THP always mapped first with PMD */
+			first = 0;
+			VM_BUG_ON_PAGE(!compound_mapcount(head_page),
+					head_page);
+			atomic_inc(&page->_mapcount);
+		}
+	} else {
+		VM_BUG_ON_PAGE(compound, page);
+		first = atomic_inc_and_test(&page->_mapcount);
+	}
+
 	if (first) {
-		bool compound = flags & RMAP_COMPOUND;
 		int nr = compound ? hpage_nr_pages(page) : 1;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
@@ -1006,7 +1027,6 @@ void do_page_add_anon_rmap(struct page *page,
 	if (unlikely(PageKsm(page)))
 		return;
 
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	/* address might be in next vma when migration races vma_adjust */
 	if (first)
 		__page_set_anon_rmap(page, vma, address,
@@ -1032,10 +1052,19 @@ void page_add_new_anon_rmap(struct page *page,
 
 	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
 	SetPageSwapBacked(page);
-	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
 	if (compound) {
+		atomic_t *compound_mapcount;
+
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+		compound_mapcount = (atomic_t *)&page[1].mapping;
+		/* increment count (starts at -1) */
+		atomic_set(compound_mapcount, 0);
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	} else {
+		/* Anon THP always mapped first with PMD */
+		VM_BUG_ON_PAGE(PageTransCompound(page), page);
+		/* increment count (starts at -1) */
+		atomic_set(&page->_mapcount, 0);
 	}
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
@@ -1081,7 +1110,9 @@ void page_remove_rmap(struct page *page, bool compound)
 		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 
 	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, &page->_mapcount))
+	if (!atomic_add_negative(-1, compound ?
+				compound_mapcount_ptr(page) :
+				&page->_mapcount))
 		goto out;
 
 	/*
@@ -1098,9 +1129,14 @@ void page_remove_rmap(struct page *page, bool compound)
 	if (anon) {
 		int nr = compound ? hpage_nr_pages(page) : 1;
 		if (compound) {
+			int i;
 			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 			__dec_zone_page_state(page,
 					NR_ANON_TRANSPARENT_HUGEPAGES);
+			/* The page can be mapped with ptes */
+			for (i = 0; i < HPAGE_PMD_NR; i++)
+				if (page_mapcount(page + i))
+					nr--;
 		}
 		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
 	} else {
@@ -1749,7 +1785,7 @@ void hugepage_add_anon_rmap(struct page *page,
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!anon_vma);
 	/* address might be in next vma when migration races vma_adjust */
-	first = atomic_inc_and_test(&page->_mapcount);
+	first = atomic_inc_and_test(compound_mapcount_ptr(page));
 	if (first)
 		__hugepage_set_anon_rmap(page, vma, address, 0);
 }
@@ -1758,7 +1794,7 @@ void hugepage_add_new_anon_rmap(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {
 	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-	atomic_set(&page->_mapcount, 0);
+	atomic_set(compound_mapcount_ptr(page), 0);
 	__hugepage_set_anon_rmap(page, vma, address, 1);
 }
 #endif /* CONFIG_HUGETLB_PAGE */
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 07/19] mm, thp: adjust conditions when we can reuse the page on WP fault
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 06/19] mm: store mapcount for compound page separate Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 08/19] mm: prepare migration code for new THP refcounting Kirill A. Shutemov
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we will be able map the same compound page with
PTEs and PMDs. It requires adjustment to conditions when we can reuse
the page on write-protection fault.

For PTE fault we can't reuse the page if it's part of huge page.

For PMD we can only reuse the page if nobody else maps the huge page or
it's part. We can do it by checking page_mapcount() on each sub-page,
but it's expensive.
The cheaper way is to check page_count() to be equal 1: every mapcount
takes page reference reference, so this way we can guarantee, that the
PMD is the only mapping.
This can give false negative if somebody pinned the page, but that's
fine.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/swap.h |  8 +++++++-
 mm/huge_memory.c     | 12 +++++++++++-
 mm/swapfile.c        |  3 +++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1b72060f093a..79333ea921c8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -555,7 +555,13 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
-#define reuse_swap_page(page)	(page_mapcount(page) == 1)
+static inline int reuse_swap_page(struct page *page)
+{
+	/* The page is part of THP and cannot be reused */
+	if (PageTransCompound(page))
+		return 0;
+	return page_mapcount(page) == 1;
+}
 
 static inline int try_to_free_swap(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 869f9bcf481e..aa22350673a7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1106,7 +1106,17 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
-	if (page_mapcount(page) == 1) {
+	/*
+	 * We can only reuse the page if nobody else maps the huge page or it's
+	 * part. We can do it by checking page_mapcount() on each sub-page, but
+	 * it's expensive.
+	 * The cheaper way is to check page_count() to be equal 1: every
+	 * mapcount takes page reference reference, so this way we can
+	 * guarantee, that the PMD is the only mapping.
+	 * This can give false negative if somebody pinned the page, but that's
+	 * fine.
+	 */
+	if (page_mapcount(page) == 1 && page_count(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 57252bb35041..dfc81ba15697 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -887,6 +887,9 @@ int reuse_swap_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	if (unlikely(PageKsm(page)))
 		return 0;
+	/* The page is part of THP and cannot be reused */
+	if (PageTransCompound(page))
+		return 0;
 	count = page_mapcount(page);
 	if (count <= 1 && PageSwapCache(page)) {
 		count += page_swapcount(page);
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 08/19] mm: prepare migration code for new THP refcounting
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 07/19] mm, thp: adjust conditions when we can reuse the page on WP fault Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 09/19] thp: rename split_huge_page_pmd() to split_huge_pmd() Kirill A. Shutemov
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting VMAs can start or end in the middle of huge page.
We need to modify code to call split_huge_page() properly.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/migrate.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index f1a12ced2531..4dc941100388 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1235,7 +1235,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		vma = find_vma(mm, pp->addr);
 		if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
 			goto set_status;
-
+retry:
 		page = follow_page(vma, pp->addr, FOLL_GET);
 
 		err = PTR_ERR(page);
@@ -1246,9 +1246,27 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
 		if (!page)
 			goto set_status;
 
-		if (PageTransHuge(page) && split_huge_page(page)) {
-			err = -EBUSY;
-			goto set_status;
+		if (PageTransCompound(page)) {
+			struct page *head_page = compound_head(page);
+
+			/*
+			 * split_huge_page() wants pin to be only on head page
+			 */
+			if (page != head_page) {
+				get_page(head_page);
+				put_page(page);
+			}
+
+			err = split_huge_page(head_page);
+			if (err) {
+				put_page(head_page);
+				goto set_status;
+			}
+
+			if (page != head_page) {
+				put_page(head_page);
+				goto retry;
+			}
 		}
 
 		/* Use PageReserved to check for zero page */
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 09/19] thp: rename split_huge_page_pmd() to split_huge_pmd()
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 08/19] mm: prepare migration code for new THP refcounting Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 10/19] thp: PMD splitting without splitting compound page Kirill A. Shutemov
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

We are going to decouple splitting THP PMD from splitting underlying
compound page.

This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
to reflect the fact that it doesn't imply page splitting, only PMD.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/powerpc/mm/subpage-prot.c |  2 +-
 arch/s390/mm/pgtable.c         |  2 +-
 arch/x86/kernel/vm86_32.c      |  6 +++++-
 include/linux/huge_mm.h        |  8 ++------
 mm/huge_memory.c               | 33 +++++++++++++--------------------
 mm/memory.c                    |  2 +-
 mm/mempolicy.c                 |  2 +-
 mm/mprotect.c                  |  2 +-
 mm/mremap.c                    |  2 +-
 mm/pagewalk.c                  |  2 +-
 10 files changed, 27 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
index fa9fb5b4c66c..d5543514c1df 100644
--- a/arch/powerpc/mm/subpage-prot.c
+++ b/arch/powerpc/mm/subpage-prot.c
@@ -135,7 +135,7 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
 				  unsigned long end, struct mm_walk *walk)
 {
 	struct vm_area_struct *vma = walk->vma;
-	split_huge_page_pmd(vma, addr, pmd);
+	split_huge_pmd(vma, pmd, addr);
 	return 0;
 }
 
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index a43f4d33f376..a14585c0b4e3 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -1252,7 +1252,7 @@ static int thp_split_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
 		struct mm_walk *walk)
 {
 	struct vm_area_struct *vma = walk->vma;
-	split_huge_page_pmd(vma, addr, pmd);
+	split_huge_pmd(vma, pmd, addr);
 	return 0;
 }
 
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index e8edcf52e069..883160599965 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -182,7 +182,11 @@ static void mark_screen_rdonly(struct mm_struct *mm)
 	if (pud_none_or_clear_bad(pud))
 		goto out;
 	pmd = pmd_offset(pud, 0xA0000);
-	split_huge_page_pmd_mm(mm, 0xA0000, pmd);
+
+	if (pmd_trans_huge(*pmd)) {
+		struct vm_area_struct *vma = find_vma(mm, 0xA0000);
+		split_huge_pmd(vma, pmd, 0xA0000);
+	}
 	if (pmd_none_or_clear_bad(pmd))
 		goto out;
 	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 63579cb8d3dc..e7fc53ddfe43 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -100,7 +100,7 @@ static inline int split_huge_page(struct page *page)
 }
 extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd);
-#define split_huge_page_pmd(__vma, __address, __pmd)			\
+#define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
 		if (unlikely(pmd_trans_huge(*____pmd)))			\
@@ -115,8 +115,6 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
-extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd);
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
@@ -176,11 +174,9 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-#define split_huge_page_pmd(__vma, __address, __pmd)	\
-	do { } while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
-#define split_huge_page_pmd_mm(__mm, __address, __pmd)	\
+#define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
 				   unsigned long *vm_flags, int advice)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index aa22350673a7..9411f2c93dd6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1137,13 +1137,13 @@ alloc:
 
 	if (unlikely(!new_page)) {
 		if (!page) {
-			split_huge_page_pmd(vma, address, pmd);
+			split_huge_pmd(vma, pmd, address);
 			ret |= VM_FAULT_FALLBACK;
 		} else {
 			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
 					pmd, orig_pmd, page, haddr);
 			if (ret & VM_FAULT_OOM) {
-				split_huge_page(page);
+				split_huge_pmd(vma, pmd, address);
 				ret |= VM_FAULT_FALLBACK;
 			}
 			put_user_huge_page(page);
@@ -1156,10 +1156,10 @@ alloc:
 					   GFP_TRANSHUGE, &memcg))) {
 		put_page(new_page);
 		if (page) {
-			split_huge_page(page);
+			split_huge_pmd(vma, pmd, address);
 			put_user_huge_page(page);
 		} else
-			split_huge_page_pmd(vma, address, pmd);
+			split_huge_pmd(vma, pmd, address);
 		ret |= VM_FAULT_FALLBACK;
 		count_vm_event(THP_FAULT_FALLBACK);
 		goto out;
@@ -1685,17 +1685,7 @@ again:
 		goto again;
 }
 
-void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd)
-{
-	struct vm_area_struct *vma;
-
-	vma = find_vma(mm, address);
-	BUG_ON(vma == NULL);
-	split_huge_page_pmd(vma, address, pmd);
-}
-
-static void split_huge_page_address(struct mm_struct *mm,
+static void split_huge_page_address(struct vm_area_struct *vma,
 				    unsigned long address)
 {
 	pgd_t *pgd;
@@ -1704,7 +1694,7 @@ static void split_huge_page_address(struct mm_struct *mm,
 
 	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
 
-	pgd = pgd_offset(mm, address);
+	pgd = pgd_offset(vma->vm_mm, address);
 	if (!pgd_present(*pgd))
 		return;
 
@@ -1715,11 +1705,14 @@ static void split_huge_page_address(struct mm_struct *mm,
 	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
 		return;
+
+	if (!pmd_trans_huge(*pmd))
+		return;
 	/*
 	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
 	 * materialize from under us.
 	 */
-	split_huge_page_pmd_mm(mm, address, pmd);
+	__split_huge_page_pmd(vma, address, pmd);
 }
 static int __split_huge_page_splitting(struct page *page,
 				       struct vm_area_struct *vma,
@@ -2947,7 +2940,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 	if (start & ~HPAGE_PMD_MASK &&
 	    (start & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
-		split_huge_page_address(vma->vm_mm, start);
+		split_huge_page_address(vma, start);
 
 	/*
 	 * If the new end address isn't hpage aligned and it could
@@ -2957,7 +2950,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 	if (end & ~HPAGE_PMD_MASK &&
 	    (end & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
-		split_huge_page_address(vma->vm_mm, end);
+		split_huge_page_address(vma, end);
 
 	/*
 	 * If we're also updating the vma->vm_next->vm_start, if the new
@@ -2971,6 +2964,6 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 		if (nstart & ~HPAGE_PMD_MASK &&
 		    (nstart & HPAGE_PMD_MASK) >= next->vm_start &&
 		    (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
-			split_huge_page_address(next->vm_mm, nstart);
+			split_huge_page_address(next, nstart);
 	}
 }
diff --git a/mm/memory.c b/mm/memory.c
index 1b17a72dc93f..3f7a8bd768de 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1239,7 +1239,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 					BUG();
 				}
 #endif
-				split_huge_page_pmd(vma, addr, pmd);
+				split_huge_pmd(vma, pmd, addr);
 			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d1ac73927003..a6396a0df018 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -501,7 +501,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	split_huge_page_pmd(vma, addr, pmd);
+	split_huge_pmd(vma, pmd, addr);
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c43d557941f8..775a66a598dc 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -162,7 +162,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
-				split_huge_page_pmd(vma, addr, pmd);
+				split_huge_pmd(vma, pmd, addr);
 			else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
 						newprot, prot_numa);
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180e9f21..d2d1047e5dc3 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -209,7 +209,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 				need_flush = true;
 				continue;
 			} else if (!err) {
-				split_huge_page_pmd(vma, old_addr, old_pmd);
+				split_huge_pmd(vma, old_pmd, old_addr);
 			}
 			VM_BUG_ON(pmd_trans_huge(*old_pmd));
 		}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 65fb68df3aa2..4d2386364561 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ again:
 		if (!walk->pte_entry)
 			continue;
 
-		split_huge_page_pmd_mm(walk->mm, addr, pmd);
+		split_huge_pmd(walk->vma, pmd, addr);
 		if (pmd_trans_unstable(pmd))
 			goto again;
 		err = walk_pte_range(pmd, addr, next, walk);
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 10/19] thp: PMD splitting without splitting compound page
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 09/19] thp: rename split_huge_page_pmd() to split_huge_pmd() Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-19  6:57   ` Naoya Horiguchi
  2014-11-05 14:49 ` [PATCH 11/19] mm, vmstats: new THP splitting event Kirill A. Shutemov
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

Current split_huge_page() combines two operations: splitting PMDs into
tables of PTEs and splitting underlying compound page. This patch
changes split_huge_pmd() implementation to split the given PMD without
splitting other PMDs this page mapped with or underlying compound page.

In order to do this we have to get rid of tail page refcounting, which
uses _mapcount of tail pages. Tail page refcounting is needed to be able
to split THP page at any point: we always know which of tail pages is
pinned (i.e. by get_user_pages()) and can distribute page count
correctly.

We can avoid this by allowing split_huge_page() to fail if the compound
page is pinned. This patch removes all infrastructure for tail page
refcounting and make split_huge_page() to always return -EBUSY. All
split_huge_page() users already know how to handle its fail. Proper
implementation will be added later.

Without tail page refcounting, implementation of split_huge_pmd() is
pretty straight-forward.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/mips/mm/gup.c            |   4 -
 arch/powerpc/mm/hugetlbpage.c |  12 --
 arch/s390/mm/gup.c            |  13 +-
 arch/sparc/mm/gup.c           |  14 +--
 arch/x86/mm/gup.c             |   4 -
 include/linux/huge_mm.h       |   7 +-
 include/linux/mm.h            |  62 +---------
 mm/huge_memory.c              | 270 ++++++++++--------------------------------
 mm/internal.h                 |  31 +----
 mm/swap.c                     | 245 +-------------------------------------
 10 files changed, 79 insertions(+), 583 deletions(-)

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 06ce17c2a905..8e56e7a2558b 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -87,8 +87,6 @@ static int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end,
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
@@ -153,8 +151,6 @@ static int gup_huge_pud(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 7e70ae968e5f..e4ba17694b6b 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1022,7 +1022,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 {
 	unsigned long mask;
 	unsigned long pte_end;
-	struct page *head, *page, *tail;
 	pte_t pte;
 	int refs;
 
@@ -1053,7 +1052,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	head = pte_page(pte);
 
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -1075,15 +1073,5 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
-	/*
-	 * Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 639fce464008..e4c5ca753abe 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -52,7 +52,7 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
 	unsigned long mask, result;
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	result = write ? 0 : _SEGMENT_ENTRY_PROTECT;
@@ -64,7 +64,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -85,16 +84,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		return 0;
 	}
 
-	/*
-	 * Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 1aed0432c64b..04bc1aa350fa 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -56,8 +56,6 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 			put_page(head);
 			return 0;
 		}
-		if (head != page)
-			get_huge_page_tail(page);
 
 		pages[*nr] = page;
 		(*nr)++;
@@ -70,7 +68,7 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 			unsigned long end, int write, struct page **pages,
 			int *nr)
 {
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	if (!(pmd_val(pmd) & _PAGE_VALID))
@@ -82,7 +80,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -103,15 +100,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		return 0;
 	}
 
-	/* Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 207d9aef662d..754bca23ec1b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -137,8 +137,6 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
@@ -214,8 +212,6 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e7fc53ddfe43..bd6506a724f0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -98,14 +98,13 @@ static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list(page, NULL);
 }
-extern void __split_huge_page_pmd(struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd);
+extern void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address);
 #define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
 		if (unlikely(pmd_trans_huge(*____pmd)))			\
-			__split_huge_page_pmd(__vma, __address,		\
-					____pmd);			\
+			__split_huge_pmd(__vma, __pmd, __address);	\
 	}  while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)				\
 	do {								\
diff --git a/include/linux/mm.h b/include/linux/mm.h
index aef03acff228..d36ad0575e58 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -413,25 +413,10 @@ static inline void compound_unlock_irqrestore(struct page *page,
 #endif
 }
 
-static inline struct page *compound_head_by_tail(struct page *tail)
-{
-	struct page *head = tail->first_page;
-
-	/*
-	 * page->first_page may be a dangling pointer to an old
-	 * compound page, so recheck that it is still a tail
-	 * page before returning.
-	 */
-	smp_rmb();
-	if (likely(PageTail(tail)))
-		return head;
-	return tail;
-}
-
 static inline struct page *compound_head(struct page *page)
 {
 	if (unlikely(PageTail(page)))
-		return compound_head_by_tail(page);
+		return page->first_page;
 	return page;
 }
 
@@ -477,50 +462,11 @@ static inline int PageHeadHuge(struct page *page_head)
 }
 #endif /* CONFIG_HUGETLB_PAGE */
 
-static inline bool __compound_tail_refcounted(struct page *page)
-{
-	return !PageSlab(page) && !PageHeadHuge(page);
-}
-
-/*
- * This takes a head page as parameter and tells if the
- * tail page reference counting can be skipped.
- *
- * For this to be safe, PageSlab and PageHeadHuge must remain true on
- * any given page where they return true here, until all tail pins
- * have been released.
- */
-static inline bool compound_tail_refcounted(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return __compound_tail_refcounted(page);
-}
-
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run from under us.
-	 */
-	VM_BUG_ON_PAGE(!PageTail(page), page);
-	VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-	VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-	if (compound_tail_refcounted(page->first_page))
-		atomic_inc(&page->_mapcount);
-}
-
-extern bool __get_page_tail(struct page *page);
-
 static inline void get_page(struct page *page)
 {
-	if (unlikely(PageTail(page)))
-		if (likely(__get_page_tail(page)))
-			return;
-	/*
-	 * Getting a normal page or the head of a compound page
-	 * requires to already have an elevated page->_count.
-	 */
-	VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
-	atomic_inc(&page->_count);
+	struct page *page_head = compound_head(page);
+	VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page);
+	atomic_inc(&page_head->_count);
 }
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9411f2c93dd6..e51059eea5bc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -944,37 +944,6 @@ unlock:
 	spin_unlock(ptl);
 }
 
-/*
- * Save CONFIG_DEBUG_PAGEALLOC from faulting falsely on tail pages
- * during copy_user_huge_page()'s copy_page_rep(): in the case when
- * the source page gets split and a tail freed before copy completes.
- * Called under pmd_lock of checked pmd, so safe from splitting itself.
- */
-static void get_user_huge_page(struct page *page)
-{
-	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
-		struct page *endpage = page + HPAGE_PMD_NR;
-
-		atomic_add(HPAGE_PMD_NR, &page->_count);
-		while (++page < endpage)
-			get_huge_page_tail(page);
-	} else {
-		get_page(page);
-	}
-}
-
-static void put_user_huge_page(struct page *page)
-{
-	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
-		struct page *endpage = page + HPAGE_PMD_NR;
-
-		while (page < endpage)
-			put_page(page++);
-	} else {
-		put_page(page);
-	}
-}
-
 static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long address,
@@ -1125,7 +1094,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret |= VM_FAULT_WRITE;
 		goto out_unlock;
 	}
-	get_user_huge_page(page);
+	get_page(page);
 	spin_unlock(ptl);
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
@@ -1146,7 +1115,7 @@ alloc:
 				split_huge_pmd(vma, pmd, address);
 				ret |= VM_FAULT_FALLBACK;
 			}
-			put_user_huge_page(page);
+			put_page(page);
 		}
 		count_vm_event(THP_FAULT_FALLBACK);
 		goto out;
@@ -1157,7 +1126,7 @@ alloc:
 		put_page(new_page);
 		if (page) {
 			split_huge_pmd(vma, pmd, address);
-			put_user_huge_page(page);
+			put_page(page);
 		} else
 			split_huge_pmd(vma, pmd, address);
 		ret |= VM_FAULT_FALLBACK;
@@ -1179,7 +1148,7 @@ alloc:
 
 	spin_lock(ptl);
 	if (page)
-		put_user_huge_page(page);
+		put_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(ptl);
 		mem_cgroup_cancel_charge(new_page, memcg);
@@ -1638,51 +1607,73 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	put_huge_zero_page();
 }
 
-void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd)
+
+static void __split_huge_pmd_locked(struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address)
 {
-	spinlock_t *ptl;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	int i;
 
 	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
 
+	if (is_huge_zero_pmd(*pmd))
+		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+
+	page = pmd_page(*pmd);
+	VM_BUG_ON_PAGE(!page_count(page), page);
+	atomic_add(HPAGE_PMD_NR - 1, &page->_count);
+
+	/* leave pmd empty until pte is filled */
+	pmdp_clear_flush(vma, haddr, pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t entry, *pte;
+		/*
+		 * Note that pmd_numa is not transferred deliberately to avoid
+		 * any possibility that pte_numa leaks to a PROT_NONE VMA by
+		 * accident.
+		 */
+		entry = mk_pte(page + i, vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		if (!pmd_write(*pmd))
+			entry = pte_wrprotect(entry);
+		if (!pmd_young(*pmd))
+			entry = pte_mkold(entry);
+		pte = pte_offset_map(&_pmd, haddr);
+		BUG_ON(!pte_none(*pte));
+		atomic_inc(&page[i]._mapcount);
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	atomic_dec(compound_mapcount_ptr(page));
+}
+
+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address)
+{
+	spinlock_t *ptl;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	unsigned long mmun_start;       /* For mmu_notifiers */
+	unsigned long mmun_end;         /* For mmu_notifiers */
+
 	mmun_start = haddr;
 	mmun_end   = haddr + HPAGE_PMD_SIZE;
-again:
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_trans_huge(*pmd))) {
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
-	if (is_huge_zero_pmd(*pmd)) {
-		__split_huge_zero_page_pmd(vma, haddr, pmd);
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
-	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!page_count(page), page);
-	get_page(page);
+	if (likely(pmd_trans_huge(*pmd)))
+		__split_huge_pmd_locked(vma, pmd, address);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
-	split_huge_page(page);
-
-	put_page(page);
-
-	/*
-	 * We don't always have down_write of mmap_sem here: a racing
-	 * do_huge_pmd_wp_page() might have copied-on-write to another
-	 * huge page before our split_huge_page() got the anon_vma lock.
-	 */
-	if (unlikely(pmd_trans_huge(*pmd)))
-		goto again;
 }
 
 static void split_huge_page_address(struct vm_area_struct *vma,
@@ -1712,40 +1703,10 @@ static void split_huge_page_address(struct vm_area_struct *vma,
 	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
 	 * materialize from under us.
 	 */
-	__split_huge_page_pmd(vma, address, pmd);
-}
-static int __split_huge_page_splitting(struct page *page,
-				       struct vm_area_struct *vma,
-				       unsigned long address)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pmd_t *pmd;
-	int ret = 0;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = address;
-	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
-
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	pmd = page_check_address_pmd(page, mm, address,
-			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
-	if (pmd) {
-		/*
-		 * We can't temporarily set the pmd to null in order
-		 * to split it, the pmd must remain marked huge at all
-		 * times or the VM won't take the pmd_trans_huge paths
-		 * and it won't wait on the anon_vma->root->rwsem to
-		 * serialize against split_huge_page*.
-		 */
-		pmdp_splitting_flush(vma, address, pmd);
-		ret = 1;
-		spin_unlock(ptl);
-	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
-	return ret;
+	__split_huge_pmd(vma, pmd, address);
 }
 
+#if 0
 static void __split_huge_page_refcount(struct page *page,
 				       struct list_head *list)
 {
@@ -1871,79 +1832,6 @@ static void __split_huge_page_refcount(struct page *page,
 	BUG_ON(page_count(page) <= 0);
 }
 
-static int __split_huge_page_map(struct page *page,
-				 struct vm_area_struct *vma,
-				 unsigned long address)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pmd_t *pmd, _pmd;
-	int ret = 0, i;
-	pgtable_t pgtable;
-	unsigned long haddr;
-
-	pmd = page_check_address_pmd(page, mm, address,
-			PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
-	if (pmd) {
-		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-		pmd_populate(mm, &_pmd, pgtable);
-		if (pmd_write(*pmd))
-			BUG_ON(page_mapcount(page) != 1);
-
-		haddr = address;
-		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-			pte_t *pte, entry;
-			BUG_ON(PageCompound(page+i));
-			entry = mk_pte(page + i, vma->vm_page_prot);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			if (!pmd_write(*pmd))
-				entry = pte_wrprotect(entry);
-			if (!pmd_young(*pmd))
-				entry = pte_mkold(entry);
-			if (pmd_numa(*pmd))
-				entry = pte_mknuma(entry);
-			pte = pte_offset_map(&_pmd, haddr);
-			BUG_ON(!pte_none(*pte));
-			set_pte_at(mm, haddr, pte, entry);
-			pte_unmap(pte);
-		}
-
-		smp_wmb(); /* make pte visible before pmd */
-		/*
-		 * Up to this point the pmd is present and huge and
-		 * userland has the whole access to the hugepage
-		 * during the split (which happens in place). If we
-		 * overwrite the pmd with the not-huge version
-		 * pointing to the pte here (which of course we could
-		 * if all CPUs were bug free), userland could trigger
-		 * a small page size TLB miss on the small sized TLB
-		 * while the hugepage TLB entry is still established
-		 * in the huge TLB. Some CPU doesn't like that. See
-		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
-		 * Erratum 383 on page 93. Intel should be safe but is
-		 * also warns that it's only safe if the permission
-		 * and cache attributes of the two entries loaded in
-		 * the two TLB is identical (which should be the case
-		 * here). But it is generally safer to never allow
-		 * small and huge TLB entries for the same virtual
-		 * address to be loaded simultaneously. So instead of
-		 * doing "pmd_populate(); flush_tlb_range();" we first
-		 * mark the current pmd notpresent (atomically because
-		 * here the pmd_trans_huge and pmd_trans_splitting
-		 * must remain set at all times on the pmd until the
-		 * split is complete for this pmd), then we flush the
-		 * SMP TLB and finally we write the non-huge version
-		 * of the pmd entry with pmd_populate.
-		 */
-		pmdp_invalidate(vma, address, pmd);
-		pmd_populate(mm, pmd, pgtable);
-		ret = 1;
-		spin_unlock(ptl);
-	}
-
-	return ret;
-}
-
 /* must be called with anon_vma->root->rwsem held */
 static void __split_huge_page(struct page *page,
 			      struct anon_vma *anon_vma,
@@ -1994,48 +1882,18 @@ static void __split_huge_page(struct page *page,
 		BUG();
 	}
 }
+#endif
 
 /*
  * Split a hugepage into normal pages. This doesn't change the position of head
  * page. If @list is null, tail pages will be added to LRU list, otherwise, to
  * @list. Both head page and tail pages will inherit mapping, flags, and so on
  * from the hugepage.
- * Return 0 if the hugepage is split successfully otherwise return 1.
+ * Return 0 if the hugepage is split successfully otherwise return -errno.
  */
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
-	struct anon_vma *anon_vma;
-	int ret = 1;
-
-	BUG_ON(is_huge_zero_page(page));
-	BUG_ON(!PageAnon(page));
-
-	/*
-	 * The caller does not necessarily hold an mmap_sem that would prevent
-	 * the anon_vma disappearing so we first we take a reference to it
-	 * and then lock the anon_vma for write. This is similar to
-	 * page_lock_anon_vma_read except the write lock is taken to serialise
-	 * against parallel split or collapse operations.
-	 */
-	anon_vma = page_get_anon_vma(page);
-	if (!anon_vma)
-		goto out;
-	anon_vma_lock_write(anon_vma);
-
-	ret = 0;
-	if (!PageCompound(page))
-		goto out_unlock;
-
-	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma, list);
-	count_vm_event(THP_SPLIT);
-
-	BUG_ON(PageCompound(page));
-out_unlock:
-	anon_vma_unlock_write(anon_vma);
-	put_anon_vma(anon_vma);
-out:
-	return ret;
+	return -EBUSY;
 }
 
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
diff --git a/mm/internal.h b/mm/internal.h
index a1b651b11c5f..1e4c6f5dd38a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -47,26 +47,6 @@ static inline void set_page_refcounted(struct page *page)
 	set_page_count(page, 1);
 }
 
-static inline void __get_page_tail_foll(struct page *page,
-					bool get_page_head)
-{
-	/*
-	 * If we're getting a tail page, the elevated page->_count is
-	 * required only in the head page and we will elevate the head
-	 * page->_count and tail page->_mapcount.
-	 *
-	 * We elevate page_tail->_mapcount for tail pages to force
-	 * page_tail->_count to be zero at all times to avoid getting
-	 * false positives from get_page_unless_zero() with
-	 * speculative page access (like in
-	 * page_cache_get_speculative()) on tail pages.
-	 */
-	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
-	if (get_page_head)
-		atomic_inc(&page->first_page->_count);
-	get_huge_page_tail(page);
-}
-
 /*
  * This is meant to be called as the FOLL_GET operation of
  * follow_page() and it must be called while holding the proper PT
@@ -74,14 +54,9 @@ static inline void __get_page_tail_foll(struct page *page,
  */
 static inline void get_page_foll(struct page *page)
 {
-	if (unlikely(PageTail(page)))
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount() can't run under
-		 * get_page_foll() because we hold the proper PT lock.
-		 */
-		__get_page_tail_foll(page, true);
-	else {
+	if (unlikely(PageTail(page))) {
+		atomic_inc(&page->first_page->_count);
+	} else {
 		/*
 		 * Getting a normal page or the head of a compound page
 		 * requires to already have an elevated page->_count.
diff --git a/mm/swap.c b/mm/swap.c
index 6b2dc3897cd5..826cab5f725a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -80,185 +80,12 @@ static void __put_compound_page(struct page *page)
 	(*dtor)(page);
 }
 
-/**
- * Two special cases here: we could avoid taking compound_lock_irqsave
- * and could skip the tail refcounting(in _mapcount).
- *
- * 1. Hugetlbfs page:
- *
- *    PageHeadHuge will remain true until the compound page
- *    is released and enters the buddy allocator, and it could
- *    not be split by __split_huge_page_refcount().
- *
- *    So if we see PageHeadHuge set, and we have the tail page pin,
- *    then we could safely put head page.
- *
- * 2. Slab THP page:
- *
- *    PG_slab is cleared before the slab frees the head page, and
- *    tail pin cannot be the last reference left on the head page,
- *    because the slab code is free to reuse the compound page
- *    after a kfree/kmem_cache_free without having to check if
- *    there's any tail pin left.  In turn all tail pinsmust be always
- *    released while the head is still pinned by the slab code
- *    and so we know PG_slab will be still set too.
- *
- *    So if we see PageSlab set, and we have the tail page pin,
- *    then we could safely put head page.
- */
-static __always_inline
-void put_unrefcounted_compound_page(struct page *page_head, struct page *page)
-{
-	/*
-	 * If @page is a THP tail, we must read the tail page
-	 * flags after the head page flags. The
-	 * __split_huge_page_refcount side enforces write memory barriers
-	 * between clearing PageTail and before the head page
-	 * can be freed and reallocated.
-	 */
-	smp_rmb();
-	if (likely(PageTail(page))) {
-		/*
-		 * __split_huge_page_refcount cannot race
-		 * here, see the comment above this function.
-		 */
-		VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
-		VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
-		if (put_page_testzero(page_head)) {
-			/*
-			 * If this is the tail of a slab THP page,
-			 * the tail pin must not be the last reference
-			 * held on the page, because the PG_slab cannot
-			 * be cleared before all tail pins (which skips
-			 * the _mapcount tail refcounting) have been
-			 * released.
-			 *
-			 * If this is the tail of a hugetlbfs page,
-			 * the tail pin may be the last reference on
-			 * the page instead, because PageHeadHuge will
-			 * not go away until the compound page enters
-			 * the buddy allocator.
-			 */
-			VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
-			__put_compound_page(page_head);
-		}
-	} else
-		/*
-		 * __split_huge_page_refcount run before us,
-		 * @page was a THP tail. The split @page_head
-		 * has been freed and reallocated as slab or
-		 * hugetlbfs page of smaller order (only
-		 * possible if reallocated as slab on x86).
-		 */
-		if (put_page_testzero(page))
-			__put_single_page(page);
-}
-
-static __always_inline
-void put_refcounted_compound_page(struct page *page_head, struct page *page)
-{
-	if (likely(page != page_head && get_page_unless_zero(page_head))) {
-		unsigned long flags;
-
-		/*
-		 * @page_head wasn't a dangling pointer but it may not
-		 * be a head page anymore by the time we obtain the
-		 * lock. That is ok as long as it can't be freed from
-		 * under us.
-		 */
-		flags = compound_lock_irqsave(page_head);
-		if (unlikely(!PageTail(page))) {
-			/* __split_huge_page_refcount run before us */
-			compound_unlock_irqrestore(page_head, flags);
-			if (put_page_testzero(page_head)) {
-				/*
-				 * The @page_head may have been freed
-				 * and reallocated as a compound page
-				 * of smaller order and then freed
-				 * again.  All we know is that it
-				 * cannot have become: a THP page, a
-				 * compound page of higher order, a
-				 * tail page.  That is because we
-				 * still hold the refcount of the
-				 * split THP tail and page_head was
-				 * the THP head before the split.
-				 */
-				if (PageHead(page_head))
-					__put_compound_page(page_head);
-				else
-					__put_single_page(page_head);
-			}
-out_put_single:
-			if (put_page_testzero(page))
-				__put_single_page(page);
-			return;
-		}
-		VM_BUG_ON_PAGE(page_head != page->first_page, page);
-		/*
-		 * We can release the refcount taken by
-		 * get_page_unless_zero() now that
-		 * __split_huge_page_refcount() is blocked on the
-		 * compound_lock.
-		 */
-		if (put_page_testzero(page_head))
-			VM_BUG_ON_PAGE(1, page_head);
-		/* __split_huge_page_refcount will wait now */
-		VM_BUG_ON_PAGE(page_mapcount(page) <= 0, page);
-		atomic_dec(&page->_mapcount);
-		VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page_head);
-		VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-		compound_unlock_irqrestore(page_head, flags);
-
-		if (put_page_testzero(page_head)) {
-			if (PageHead(page_head))
-				__put_compound_page(page_head);
-			else
-				__put_single_page(page_head);
-		}
-	} else {
-		/* @page_head is a dangling pointer */
-		VM_BUG_ON_PAGE(PageTail(page), page);
-		goto out_put_single;
-	}
-}
-
 static void put_compound_page(struct page *page)
 {
-	struct page *page_head;
-
-	/*
-	 * We see the PageCompound set and PageTail not set, so @page maybe:
-	 *  1. hugetlbfs head page, or
-	 *  2. THP head page.
-	 */
-	if (likely(!PageTail(page))) {
-		if (put_page_testzero(page)) {
-			/*
-			 * By the time all refcounts have been released
-			 * split_huge_page cannot run anymore from under us.
-			 */
-			if (PageHead(page))
-				__put_compound_page(page);
-			else
-				__put_single_page(page);
-		}
-		return;
-	}
+	struct page *page_head = compound_head(page);
 
-	/*
-	 * We see the PageCompound set and PageTail set, so @page maybe:
-	 *  1. a tail hugetlbfs page, or
-	 *  2. a tail THP page, or
-	 *  3. a split THP page.
-	 *
-	 *  Case 3 is possible, as we may race with
-	 *  __split_huge_page_refcount tearing down a THP page.
-	 */
-	page_head = compound_head_by_tail(page);
-	if (!__compound_tail_refcounted(page_head))
-		put_unrefcounted_compound_page(page_head, page);
-	else
-		put_refcounted_compound_page(page_head, page);
+	if (put_page_testzero(page_head))
+			__put_compound_page(page_head);
 }
 
 void put_page(struct page *page)
@@ -270,72 +97,6 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
-/*
- * This function is exported but must not be called by anything other
- * than get_page(). It implements the slow path of get_page().
- */
-bool __get_page_tail(struct page *page)
-{
-	/*
-	 * This takes care of get_page() if run on a tail page
-	 * returned by one of the get_user_pages/follow_page variants.
-	 * get_user_pages/follow_page itself doesn't need the compound
-	 * lock because it runs __get_page_tail_foll() under the
-	 * proper PT lock that already serializes against
-	 * split_huge_page().
-	 */
-	unsigned long flags;
-	bool got;
-	struct page *page_head = compound_head(page);
-
-	/* Ref to put_compound_page() comment. */
-	if (!__compound_tail_refcounted(page_head)) {
-		smp_rmb();
-		if (likely(PageTail(page))) {
-			/*
-			 * This is a hugetlbfs page or a slab
-			 * page. __split_huge_page_refcount
-			 * cannot race here.
-			 */
-			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
-			__get_page_tail_foll(page, true);
-			return true;
-		} else {
-			/*
-			 * __split_huge_page_refcount run
-			 * before us, "page" was a THP
-			 * tail. The split page_head has been
-			 * freed and reallocated as slab or
-			 * hugetlbfs page of smaller order
-			 * (only possible if reallocated as
-			 * slab on x86).
-			 */
-			return false;
-		}
-	}
-
-	got = false;
-	if (likely(page != page_head && get_page_unless_zero(page_head))) {
-		/*
-		 * page_head wasn't a dangling pointer but it
-		 * may not be a head page anymore by the time
-		 * we obtain the lock. That is ok as long as it
-		 * can't be freed from under us.
-		 */
-		flags = compound_lock_irqsave(page_head);
-		/* here __split_huge_page_refcount won't run anymore */
-		if (likely(PageTail(page))) {
-			__get_page_tail_foll(page, false);
-			got = true;
-		}
-		compound_unlock_irqrestore(page_head, flags);
-		if (unlikely(!got))
-			put_page(page_head);
-	}
-	return got;
-}
-EXPORT_SYMBOL(__get_page_tail);
-
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 11/19] mm, vmstats: new THP splitting event
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 10/19] thp: PMD splitting without splitting compound page Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 12/19] thp: implement new split_huge_page() Kirill A. Shutemov
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILT and THP_SPLIT_PMD. It reflects the fact that we
now can split PMD without the compound page and that split_huge_page()
can fail.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/vm_event_item.h | 4 +++-
 mm/huge_memory.c              | 3 +++
 mm/vmstat.c                   | 4 +++-
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index ced92345c963..b44dffa769b9 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -68,7 +68,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
-		THP_SPLIT,
+		THP_SPLIT_PAGE,
+		THP_SPLIT_PAGE_FAILED,
+		THP_SPLIT_PMD,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e51059eea5bc..c63911b10143 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1620,6 +1620,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma,
 
 	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
 
+	count_vm_event(THP_SPLIT_PMD);
+
 	if (is_huge_zero_pmd(*pmd))
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 
@@ -1893,6 +1895,7 @@ static void __split_huge_page(struct page *page,
  */
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
+	count_vm_event(THP_SPLIT_PAGE_FAILED);
 	return -EBUSY;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c4d42e27ca9e..7cedd10c636d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -872,7 +872,9 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
-	"thp_split",
+	"thp_split_page",
+	"thp_split_page_failed",
+	"thp_split_pmd",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
 #endif
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 12/19] thp: implement new split_huge_page()
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 11/19] mm, vmstats: new THP splitting event Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 13/19] mm, thp: remove infrastructure for handling splitting PMDs Kirill A. Shutemov
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

The new split_huge_page() can fail if the compound is pinned: we expect
only caller to have one reference to head page. If the page is pinned
split_huge_page() returns -EBUSY and caller must handle this correctly.

We don't need mark PMDs splitting since now we can split one PMD a time
with split_huge_pmd().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/hugetlb_inline.h |   9 +-
 include/linux/mm.h             |  22 +++--
 mm/huge_memory.c               | 183 +++++++++++++++++++++++------------------
 mm/swap.c                      | 126 +++++++++++++++++++++++++++-
 4 files changed, 244 insertions(+), 96 deletions(-)

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 2bb681fbeb35..c5cd37479731 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -10,6 +10,8 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 	return !!(vma->vm_flags & VM_HUGETLB);
 }
 
+int PageHeadHuge(struct page *page_head);
+
 #else
 
 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
@@ -17,6 +19,11 @@ static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
 	return 0;
 }
 
-#endif
+static inline int PageHeadHuge(struct page *page_head)
+{
+       return 0;
+}
+
+#endif /* CONFIG_HUGETLB_PAGE */
 
 #endif
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d36ad0575e58..f2f95469f1c3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -453,20 +453,18 @@ static inline int page_count(struct page *page)
 	return atomic_read(&compound_head(page)->_count);
 }
 
-#ifdef CONFIG_HUGETLB_PAGE
-extern int PageHeadHuge(struct page *page_head);
-#else /* CONFIG_HUGETLB_PAGE */
-static inline int PageHeadHuge(struct page *page_head)
-{
-	return 0;
-}
-#endif /* CONFIG_HUGETLB_PAGE */
-
+void __get_page_tail(struct page *page);
 static inline void get_page(struct page *page)
 {
-	struct page *page_head = compound_head(page);
-	VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page);
-	atomic_inc(&page_head->_count);
+	if (unlikely(PageTail(page)))
+		return __get_page_tail(page);
+
+	/*
+	 * Getting a normal page or the head of a compound page
+	 * requires to already have an elevated page->_count.
+	 */
+	VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+	atomic_inc(&page->_count);
 }
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c63911b10143..36fa0d505956 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1708,31 +1708,52 @@ static void split_huge_page_address(struct vm_area_struct *vma,
 	__split_huge_pmd(vma, pmd, address);
 }
 
-#if 0
-static void __split_huge_page_refcount(struct page *page,
+static int __split_huge_page_refcount(struct page *page,
 				       struct list_head *list)
 {
 	int i;
 	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
-	int tail_count = 0;
+	int tail_mapcount = 0;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
 	lruvec = mem_cgroup_page_lruvec(page, zone);
 
 	compound_lock(page);
+
+	/*
+	 * We cannot split pinned THP page: we expect page count to be equal
+	 * to sum of mapcount of all sub-pages plus one (split_huge_page()
+	 * caller must take reference for head page).
+	 *
+	 * Compound lock only prevents page->_count to be updated from
+	 * get_page() or put_page() on tail page. It means means page_count()
+	 * can change under us from head page after the check, but it's okay:
+	 * all new refernces will stay on head page after split.
+	 */
+	tail_mapcount = 0;
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		tail_mapcount += page_mapcount(page + i);
+	if (tail_mapcount != page_count(page) - 1) {
+		BUG_ON(tail_mapcount > page_count(page) - 1);
+		compound_unlock(page);
+		spin_unlock_irq(&zone->lru_lock);
+		return -EBUSY;
+	}
+
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(page);
 
+	tail_mapcount = 0;
 	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
 		struct page *page_tail = page + i;
 
 		/* tail_page->_mapcount cannot change */
-		BUG_ON(atomic_read(&page_tail->_mapcount) + 1 < 0);
-		tail_count += atomic_read(&page_tail->_mapcount) + 1;
+		BUG_ON(page_mapcount(page_tail) < 0);
+		tail_mapcount += page_mapcount(page_tail);
 		/* check for overflow */
-		BUG_ON(tail_count < 0);
+		BUG_ON(tail_mapcount < 0);
 		BUG_ON(atomic_read(&page_tail->_count) != 0);
 		/*
 		 * tail_page->_count is zero and not changing from
@@ -1770,28 +1791,9 @@ static void __split_huge_page_refcount(struct page *page,
 		/* clear PageTail before overwriting first_page */
 		smp_wmb();
 
-		/*
-		 * __split_huge_page_splitting() already set the
-		 * splitting bit in all pmd that could map this
-		 * hugepage, that will ensure no CPU can alter the
-		 * mapcount on the head page. The mapcount is only
-		 * accounted in the head page and it has to be
-		 * transferred to all tail pages in the below code. So
-		 * for this code to be safe, the split the mapcount
-		 * can't change. But that doesn't mean userland can't
-		 * keep changing and reading the page contents while
-		 * we transfer the mapcount, so the pmd splitting
-		 * status is achieved setting a reserved bit in the
-		 * pmd, not by clearing the present bit.
-		*/
-		atomic_set(&page_tail->_mapcount, compound_mapcount(page) - 1);
-
 		/* ->mapping in first tail page is compound_mapcount */
-		if (i != 1) {
-			BUG_ON(page_tail->mapping);
-			page_tail->mapping = page->mapping;
-			BUG_ON(!PageAnon(page_tail));
-		}
+		BUG_ON(i != 1 && page_tail->mapping);
+		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
 		page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
@@ -1802,12 +1804,9 @@ static void __split_huge_page_refcount(struct page *page,
 
 		lru_add_page_tail(page, page_tail, lruvec, list);
 	}
-	atomic_sub(tail_count, &page->_count);
+	atomic_sub(tail_mapcount, &page->_count);
 	BUG_ON(atomic_read(&page->_count) <= 0);
 
-	page->_mapcount = *compound_mapcount_ptr(page);
-	page[1].mapping = page->mapping;
-
 	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
 
 	ClearPageCompound(page);
@@ -1832,71 +1831,95 @@ static void __split_huge_page_refcount(struct page *page,
 	 * to be pinned by the caller.
 	 */
 	BUG_ON(page_count(page) <= 0);
+	return 0;
 }
 
-/* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
-			      struct anon_vma *anon_vma,
-			      struct list_head *list)
+/*
+ * Split a hugepage into normal pages. This doesn't change the position of head
+ * page. If @list is null, tail pages will be added to LRU list, otherwise, to
+ * @list. Both head page and tail pages will inherit mapping, flags, and so on
+ * from the hugepage.
+ * Return 0 if the hugepage is split successfully otherwise return -errno.
+ */
+int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
-	int mapcount, mapcount2;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	struct anon_vma *anon_vma;
 	struct anon_vma_chain *avc;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	int i, tail_mapcount;
+	int ret = -EBUSY;
 
-	BUG_ON(!PageHead(page));
-	BUG_ON(PageTail(page));
+	BUG_ON(is_huge_zero_page(page));
+	BUG_ON(!PageAnon(page));
 
-	mapcount = 0;
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
-		struct vm_area_struct *vma = avc->vma;
-		unsigned long addr = vma_address(page, vma);
-		BUG_ON(is_vma_temporary_stack(vma));
-		mapcount += __split_huge_page_splitting(page, vma, addr);
-	}
 	/*
-	 * It is critical that new vmas are added to the tail of the
-	 * anon_vma list. This guarantes that if copy_huge_pmd() runs
-	 * and establishes a child pmd before
-	 * __split_huge_page_splitting() freezes the parent pmd (so if
-	 * we fail to prevent copy_huge_pmd() from running until the
-	 * whole __split_huge_page() is complete), we will still see
-	 * the newly established pmd of the child later during the
-	 * walk, to be able to set it as pmd_trans_splitting too.
+	 * The caller does not necessarily hold an mmap_sem that would prevent
+	 * the anon_vma disappearing so we first we take a reference to it
+	 * and then lock the anon_vma for write. This is similar to
+	 * page_lock_anon_vma_read except the write lock is taken to serialise
+	 * against parallel split or collapse operations.
 	 */
-	if (mapcount != page_mapcount(page)) {
-		pr_err("mapcount %d page_mapcount %d\n",
-			mapcount, page_mapcount(page));
-		BUG();
+	anon_vma = page_get_anon_vma(page);
+	if (!anon_vma)
+		goto out;
+	anon_vma_lock_write(anon_vma);
+
+	if (!PageCompound(page)) {
+		ret = 0;
+		goto out_unlock;
 	}
 
-	__split_huge_page_refcount(page, list);
+	BUG_ON(!PageSwapBacked(page));
+
+	/*
+	 * Racy check if __split_huge_page_refcount() can be successful, before
+	 * splitting PMDs.
+	 */
+	tail_mapcount = compound_mapcount(page);
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		tail_mapcount += atomic_read(&page[i]._mapcount) + 1;
+	if (tail_mapcount != page_count(page) - 1) {
+		VM_BUG_ON_PAGE(tail_mapcount > page_count(page) - 1, page);
+		ret = -EBUSY;
+		goto out_unlock;
+	}
 
-	mapcount2 = 0;
 	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long addr = vma_address(page, vma);
-		BUG_ON(is_vma_temporary_stack(vma));
-		mapcount2 += __split_huge_page_map(page, vma, addr);
-	}
-	if (mapcount != mapcount2) {
-		pr_err("mapcount %d mapcount2 %d page_mapcount %d\n",
-			mapcount, mapcount2, page_mapcount(page));
-		BUG();
+		spinlock_t *ptl;
+		pmd_t *pmd;
+		unsigned long haddr = addr & HPAGE_PMD_MASK;
+		unsigned long mmun_start;	/* For mmu_notifiers */
+		unsigned long mmun_end;		/* For mmu_notifiers */
+
+		mmun_start = haddr;
+		mmun_end   = haddr + HPAGE_PMD_SIZE;
+		mmu_notifier_invalidate_range_start(vma->vm_mm,
+				mmun_start, mmun_end);
+		pmd = page_check_address_pmd(page, vma->vm_mm, addr,
+				PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+		if (pmd) {
+			__split_huge_pmd_locked(vma, pmd, addr);
+			spin_unlock(ptl);
+		}
+		mmu_notifier_invalidate_range_end(vma->vm_mm,
+				mmun_start, mmun_end);
 	}
-}
-#endif
 
-/*
- * Split a hugepage into normal pages. This doesn't change the position of head
- * page. If @list is null, tail pages will be added to LRU list, otherwise, to
- * @list. Both head page and tail pages will inherit mapping, flags, and so on
- * from the hugepage.
- * Return 0 if the hugepage is split successfully otherwise return -errno.
- */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
-{
-	count_vm_event(THP_SPLIT_PAGE_FAILED);
-	return -EBUSY;
+	BUG_ON(compound_mapcount(page));
+	ret = __split_huge_page_refcount(page, list);
+	BUG_ON(!ret && PageCompound(page));
+
+out_unlock:
+	anon_vma_unlock_write(anon_vma);
+	put_anon_vma(anon_vma);
+out:
+	if (ret)
+		count_vm_event(THP_SPLIT_PAGE_FAILED);
+	else
+		count_vm_event(THP_SPLIT_PAGE);
+	return ret;
 }
 
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
diff --git a/mm/swap.c b/mm/swap.c
index 826cab5f725a..da28e0767088 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -80,12 +80,86 @@ static void __put_compound_page(struct page *page)
 	(*dtor)(page);
 }
 
+static inline bool compound_lock_needed(struct page *page)
+{
+	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+		!PageSlab(page) && !PageHeadHuge(page);
+}
+
 static void put_compound_page(struct page *page)
 {
-	struct page *page_head = compound_head(page);
+	struct page *page_head;
+	unsigned long flags;
+
+	if (likely(!PageTail(page))) {
+		if (put_page_testzero(page)) {
+			/*
+			 * By the time all refcounts have been released
+			 * split_huge_page cannot run anymore from under us.
+			 */
+			if (PageHead(page))
+				__put_compound_page(page);
+			else
+				__put_single_page(page);
+		}
+		return;
+	}
+
+	/* __split_huge_page_refcount can run under us */
+	page_head = compound_head(page);
+
+	if (!compound_lock_needed(page_head)) {
+		/*
+		 * If "page" is a THP tail, we must read the tail page flags
+		 * after the head page flags. The split_huge_page side enforces
+		 * write memory barriers between clearing PageTail and before
+		 * the head page can be freed and reallocated.
+		 */
+		smp_rmb();
+		if (likely(PageTail(page))) {
+			/* __split_huge_page_refcount cannot race here. */
+			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
+			VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
+			if (put_page_testzero(page_head)) {
+				/*
+				 * If this is the tail of a slab compound page,
+				 * the tail pin must not be the last reference
+				 * held on the page, because the PG_slab cannot
+				 * be cleared before all tail pins (which skips
+				 * the _mapcount tail refcounting) have been
+				 * released. For hugetlbfs the tail pin may be
+				 * the last reference on the page instead,
+				 * because PageHeadHuge will not go away until
+				 * the compound page enters the buddy
+				 * allocator.
+				 */
+				VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
+				__put_compound_page(page_head);
+			}
+		} else if (put_page_testzero(page))
+			__put_single_page(page);
+		return;
+	}
 
-	if (put_page_testzero(page_head))
-			__put_compound_page(page_head);
+	flags = compound_lock_irqsave(page_head);
+	/* here __split_huge_page_refcount won't run anymore */
+	if (likely(page != page_head && PageTail(page))) {
+		bool free;
+
+		free = put_page_testzero(page_head);
+		compound_unlock_irqrestore(page_head, flags);
+		if (free) {
+			if (PageHead(page_head))
+				__put_compound_page(page_head);
+			else
+				__put_single_page(page_head);
+		}
+	} else {
+		compound_unlock_irqrestore(page_head, flags);
+		VM_BUG_ON_PAGE(PageTail(page), page);
+		if (put_page_testzero(page))
+			__put_single_page(page);
+	}
 }
 
 void put_page(struct page *page)
@@ -97,6 +171,52 @@ void put_page(struct page *page)
 }
 EXPORT_SYMBOL(put_page);
 
+/*
+ * This function is exported but must not be called by anything other
+ * than get_page(). It implements the slow path of get_page().
+ */
+void __get_page_tail(struct page *page)
+{
+	struct page *page_head = compound_head(page);
+	unsigned long flags;
+
+	if (!compound_lock_needed(page_head)) {
+		smp_rmb();
+		if (likely(PageTail(page))) {
+			/*
+			 * This is a hugetlbfs page or a slab page.
+			 * __split_huge_page_refcount cannot race here.
+			 */
+			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
+			VM_BUG_ON(page_head != page->first_page);
+			VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0,
+					page);
+			atomic_inc(&page_head->_count);
+		} else {
+			/*
+			 * __split_huge_page_refcount run before us, "page" was
+			 * a thp tail. the split page_head has been freed and
+			 * reallocated as slab or hugetlbfs page of smaller
+			 * order (only possible if reallocated as slab on x86).
+			 */
+			VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+			atomic_inc(&page->_count);
+		}
+		return;
+	}
+
+	flags = compound_lock_irqsave(page_head);
+	/* here __split_huge_page_refcount won't run anymore */
+	if (unlikely(page == page_head || !PageTail(page) ||
+				!get_page_unless_zero(page_head))) {
+		/* page is not part of THP page anymore */
+		VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
+		atomic_inc(&page->_count);
+	}
+	compound_unlock_irqrestore(page_head, flags);
+}
+EXPORT_SYMBOL(__get_page_tail);
+
 /**
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 13/19] mm, thp: remove infrastructure for handling splitting PMDs
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (11 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 12/19] thp: implement new split_huge_page() Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 14/19] x86, thp: remove " Kirill A. Shutemov
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop code
to handle this.

Arch-specific code will removed separately.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 fs/proc/task_mmu.c            |  8 +++---
 include/asm-generic/pgtable.h |  5 ----
 include/linux/huge_mm.h       | 16 ------------
 mm/gup.c                      | 14 +++--------
 mm/huge_memory.c              | 57 +++++++++----------------------------------
 mm/memcontrol.c               | 14 ++---------
 mm/memory.c                   | 18 ++------------
 mm/mincore.c                  |  2 +-
 mm/pgtable-generic.c          | 14 -----------
 mm/rmap.c                     |  4 +--
 10 files changed, 25 insertions(+), 127 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index d61fd9251197..887156e33474 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -517,7 +517,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
 		spin_unlock(ptl);
 		mss->anonymous_thp += HPAGE_PMD_SIZE;
@@ -791,7 +791,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	struct page *page;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
 			clear_soft_dirty_pmd(vma, addr, pmd);
 			goto out;
@@ -1072,7 +1072,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 	/* find the first VMA at or above 'addr' */
 	vma = find_vma(walk->mm, addr);
-	if (vma && pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (vma && pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		int pmd_flags2;
 
 		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
@@ -1387,7 +1387,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	pte_t *orig_pte;
 	pte_t *pte;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		pte_t huge_pte = *(pte_t *)pmd;
 		struct page *page;
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 53b2acc38213..204fa5db3068 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -167,11 +167,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long address, pmd_t *pmdp);
-#endif
-
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index bd6506a724f0..94f331166974 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -46,15 +46,9 @@ enum transparent_hugepage_flag {
 #endif
 };
 
-enum page_check_address_pmd_flag {
-	PAGE_CHECK_ADDRESS_PMD_FLAG,
-	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
-	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
-};
 extern pmd_t *page_check_address_pmd(struct page *page,
 				     struct mm_struct *mm,
 				     unsigned long address,
-				     enum page_check_address_pmd_flag flag,
 				     spinlock_t **ptl);
 
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
@@ -106,14 +100,6 @@ extern void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		if (unlikely(pmd_trans_huge(*____pmd)))			\
 			__split_huge_pmd(__vma, __pmd, __address);	\
 	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock_write(__anon_vma);			\
-		anon_vma_unlock_write(__anon_vma);			\
-		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
-		       pmd_trans_huge(*____pmd));			\
-	} while (0)
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
@@ -173,8 +159,6 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-#define wait_split_huge_page(__anon_vma, __pmd)	\
-	do { } while (0)
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
diff --git a/mm/gup.c b/mm/gup.c
index 03f34c417591..9c8cd3f10422 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -194,16 +194,10 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	if (pmd_trans_huge(*pmd)) {
 		ptl = pmd_lock(mm, pmd);
 		if (likely(pmd_trans_huge(*pmd))) {
-			if (unlikely(pmd_trans_splitting(*pmd))) {
-				spin_unlock(ptl);
-				wait_split_huge_page(vma->anon_vma, pmd);
-			} else {
-				page = follow_trans_huge_pmd(vma, address,
-							     pmd, flags);
-				spin_unlock(ptl);
-				*page_mask = HPAGE_PMD_NR - 1;
-				return page;
-			}
+			page = follow_trans_huge_pmd(vma, address, pmd, flags);
+			spin_unlock(ptl);
+			*page_mask = HPAGE_PMD_NR - 1;
+			return page;
 		} else
 			spin_unlock(ptl);
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 36fa0d505956..95f2a83ad9d8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -892,15 +892,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		goto out_unlock;
 	}
 
-	if (unlikely(pmd_trans_splitting(pmd))) {
-		/* split huge page running from under us */
-		spin_unlock(src_ptl);
-		spin_unlock(dst_ptl);
-		pte_free(dst_mm, pgtable);
-
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
-		goto out;
-	}
 	src_page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
@@ -1369,7 +1360,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		struct page *page;
 		pgtable_t pgtable;
 		pmd_t orig_pmd;
@@ -1408,7 +1399,6 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		  pmd_t *old_pmd, pmd_t *new_pmd)
 {
 	spinlock_t *old_ptl, *new_ptl;
-	int ret = 0;
 	pmd_t pmd;
 
 	struct mm_struct *mm = vma->vm_mm;
@@ -1417,7 +1407,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 	    (new_addr & ~HPAGE_PMD_MASK) ||
 	    old_end - old_addr < HPAGE_PMD_SIZE ||
 	    (new_vma->vm_flags & VM_NOHUGEPAGE))
-		goto out;
+		return 0;
 
 	/*
 	 * The destination pmd shouldn't be established, free_pgtables()
@@ -1425,15 +1415,14 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 	 */
 	if (WARN_ON(!pmd_none(*new_pmd))) {
 		VM_BUG_ON(pmd_trans_huge(*new_pmd));
-		goto out;
+		return 0;
 	}
 
 	/*
 	 * We don't have to worry about the ordering of src and dst
 	 * ptlocks because exclusive mmap_sem prevents deadlock.
 	 */
-	ret = __pmd_trans_huge_lock(old_pmd, vma, &old_ptl);
-	if (ret == 1) {
+	if (__pmd_trans_huge_lock(old_pmd, vma, &old_ptl)) {
 		new_ptl = pmd_lockptr(mm, new_pmd);
 		if (new_ptl != old_ptl)
 			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
@@ -1449,9 +1438,9 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		if (new_ptl != old_ptl)
 			spin_unlock(new_ptl);
 		spin_unlock(old_ptl);
+		return 1;
 	}
-out:
-	return ret;
+	return 0;
 }
 
 /*
@@ -1467,7 +1456,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	spinlock_t *ptl;
 	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		pmd_t entry;
 		ret = 1;
 		if (!prot_numa) {
@@ -1510,17 +1499,8 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 		spinlock_t **ptl)
 {
 	*ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(pmd_trans_huge(*pmd))) {
-		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(*ptl);
-			wait_split_huge_page(vma->anon_vma, pmd);
-			return -1;
-		} else {
-			/* Thp mapped by 'pmd' is stable, so we can
-			 * handle it as it is. */
-			return 1;
-		}
-	}
+	if (likely(pmd_trans_huge(*pmd)))
+		return 1;
 	spin_unlock(*ptl);
 	return 0;
 }
@@ -1536,7 +1516,6 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 pmd_t *page_check_address_pmd(struct page *page,
 			      struct mm_struct *mm,
 			      unsigned long address,
-			      enum page_check_address_pmd_flag flag,
 			      spinlock_t **ptl)
 {
 	pgd_t *pgd;
@@ -1559,21 +1538,8 @@ pmd_t *page_check_address_pmd(struct page *page,
 		goto unlock;
 	if (pmd_page(*pmd) != page)
 		goto unlock;
-	/*
-	 * split_vma() may create temporary aliased mappings. There is
-	 * no risk as long as all huge pmd are found and have their
-	 * splitting bit set before __split_huge_page_refcount
-	 * runs. Finding the same huge pmd more than once during the
-	 * same rmap walk is not a problem.
-	 */
-	if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
-	    pmd_trans_splitting(*pmd))
-		goto unlock;
-	if (pmd_trans_huge(*pmd)) {
-		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
-			  !pmd_trans_splitting(*pmd));
+	if (pmd_trans_huge(*pmd))
 		return pmd;
-	}
 unlock:
 	spin_unlock(*ptl);
 	return NULL;
@@ -1897,8 +1863,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		mmun_end   = haddr + HPAGE_PMD_SIZE;
 		mmu_notifier_invalidate_range_start(vma->vm_mm,
 				mmun_start, mmun_end);
-		pmd = page_check_address_pmd(page, vma->vm_mm, addr,
-				PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+		pmd = page_check_address_pmd(page, vma->vm_mm, addr, &ptl);
 		if (pmd) {
 			__split_huge_pmd_locked(vma, pmd, addr);
 			spin_unlock(ptl);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1d6e560c8e9..46d2f03659d3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5893,7 +5893,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
 			mc.precharge += HPAGE_PMD_NR;
 		spin_unlock(ptl);
@@ -6065,17 +6065,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	struct page *page;
 	struct page_cgroup *pc;
 
-	/*
-	 * We don't take compound_lock() here but no race with splitting thp
-	 * happens because:
-	 *  - if pmd_trans_huge_lock() returns 1, the relevant thp is not
-	 *    under splitting, which means there's no concurrent thp split,
-	 *  - if another thread runs into split_huge_page() just after we
-	 *    entered this if-block, the thread must wait for page table lock
-	 *    to be unlocked in __split_huge_page_splitting(), where the main
-	 *    part of thp split is not executed yet.
-	 */
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (mc.precharge < HPAGE_PMD_NR) {
 			spin_unlock(ptl);
 			return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 3f7a8bd768de..812205d0ee5f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -563,7 +563,6 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	spinlock_t *ptl;
 	pgtable_t new = pte_alloc_one(mm, address);
-	int wait_split_huge_page;
 	if (!new)
 		return -ENOMEM;
 
@@ -583,18 +582,14 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	ptl = pmd_lock(mm, pmd);
-	wait_split_huge_page = 0;
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		atomic_long_inc(&mm->nr_ptes);
 		pmd_populate(mm, pmd, new);
 		new = NULL;
-	} else if (unlikely(pmd_trans_splitting(*pmd)))
-		wait_split_huge_page = 1;
+	}
 	spin_unlock(ptl);
 	if (new)
 		pte_free(mm, new);
-	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
 	return 0;
 }
 
@@ -610,8 +605,7 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
-	} else
-		VM_BUG_ON(pmd_trans_splitting(*pmd));
+	}
 	spin_unlock(&init_mm.page_table_lock);
 	if (new)
 		pte_free_kernel(&init_mm, new);
@@ -3295,14 +3289,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (pmd_trans_huge(orig_pmd)) {
 			unsigned int dirty = flags & FAULT_FLAG_WRITE;
 
-			/*
-			 * If the pmd is splitting, return and retry the
-			 * the fault.  Alternative: wait until the split
-			 * is done, and goto retry.
-			 */
-			if (pmd_trans_splitting(orig_pmd))
-				return 0;
-
 			if (pmd_numa(orig_pmd))
 				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);
diff --git a/mm/mincore.c b/mm/mincore.c
index 0e548fbce19e..819b0f3adee0 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -111,7 +111,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct vm_area_struct *vma = walk->vma;
 	pte_t *ptep;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		memset(walk->private, 1, (end - addr) >> PAGE_SHIFT);
 		walk->private += (end - addr) >> PAGE_SHIFT;
 		spin_unlock(ptl);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a8b919925934..414f36c6e8f9 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -133,20 +133,6 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
-			  pmd_t *pmdp)
-{
-	pmd_t pmd = pmd_mksplitting(*pmdp);
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-	/* tlb flush only to serialize against gup-fast */
-	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#endif
-
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
diff --git a/mm/rmap.c b/mm/rmap.c
index eecc9301847d..c5d8fa899093 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -686,8 +686,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		 * rmap might return false positives; we must filter
 		 * these out using page_check_address_pmd().
 		 */
-		pmd = page_check_address_pmd(page, mm, address,
-					     PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+		pmd = page_check_address_pmd(page, mm, address, &ptl);
 		if (!pmd)
 			return SWAP_AGAIN;
 
@@ -697,7 +696,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			return SWAP_FAIL; /* To break the loop */
 		}
 
-		/* go ahead even if the pmd is pmd_trans_splitting() */
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
 		spin_unlock(ptl);
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 14/19] x86, thp: remove remove infrastructure for handling splitting PMDs
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 13/19] mm, thp: remove infrastructure for handling splitting PMDs Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 15/19] futex, thp: remove special case for THP in get_futex_key Kirill A. Shutemov
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h       |  9 ---------
 arch/x86/include/asm/pgtable_types.h |  2 --
 arch/x86/mm/gup.c                    | 13 +------------
 arch/x86/mm/pgtable.c                | 14 --------------
 4 files changed, 1 insertion(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0ec056012618..1c60bfca6b65 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -158,11 +158,6 @@ static inline int pmd_large(pmd_t pte)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	return pmd_val(pmd) & _PAGE_SPLITTING;
-}
-
 static inline int pmd_trans_huge(pmd_t pmd)
 {
 	return pmd_val(pmd) & _PAGE_PSE;
@@ -799,10 +794,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pmd_t *pmdp);
 
 
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long addr, pmd_t *pmdp);
-
 #define __HAVE_ARCH_PMD_WRITE
 static inline int pmd_write(pmd_t pmd)
 {
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index f216963760e5..7d8066d1d9c0 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,7 +22,6 @@
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
-#define _PAGE_BIT_SPLITTING	_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_IOMAP		_PAGE_BIT_SOFTW2 /* flag used to indicate IO mapping */
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
@@ -57,7 +56,6 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
-#define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 754bca23ec1b..b65b3fc4494a 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -157,18 +157,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		/*
-		 * The pmd_trans_splitting() check below explains why
-		 * pmdp_splitting_flush has to flush the tlb, to stop
-		 * this gup-fast code from running while we set the
-		 * splitting bit in the pmd. Returning zero will take
-		 * the slow path that will call wait_split_huge_page()
-		 * if the pmd is still in splitting state. gup-fast
-		 * can't because it has irq disabled and
-		 * wait_split_huge_page() would never return as the
-		 * tlb flush IPI wouldn't run.
-		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			/*
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 6fb6927f9e76..336847f5719e 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -429,20 +429,6 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 
 	return young;
 }
-
-void pmdp_splitting_flush(struct vm_area_struct *vma,
-			  unsigned long address, pmd_t *pmdp)
-{
-	int set;
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
-				(unsigned long *)pmdp);
-	if (set) {
-		pmd_update(vma->vm_mm, address, pmdp);
-		/* need tlb flush only to serialize against gup-fast */
-		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-	}
-}
 #endif
 
 /**
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 15/19] futex, thp: remove special case for THP in get_futex_key
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (13 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 14/19] x86, thp: remove " Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 16/19] thp: update documentation Kirill A. Shutemov
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

With new THP refcounting, we don't need tricks to stabilize huge page.
If we've got reference to tail page, it can't split under us.

This patch effectively reverts a5b338f2b0b1.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 kernel/futex.c | 61 ++++++++++++----------------------------------------------
 1 file changed, 12 insertions(+), 49 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index d3a9d946d0b7..fb71ccba683b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -391,7 +391,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
-	struct page *page, *page_head;
+	struct page *page;
 	int err, ro = 0;
 
 	/*
@@ -434,46 +434,9 @@ again:
 	else
 		err = 0;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	page_head = page;
-	if (unlikely(PageTail(page))) {
-		put_page(page);
-		/* serialize against __split_huge_page_splitting() */
-		local_irq_disable();
-		if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
-			page_head = compound_head(page);
-			/*
-			 * page_head is valid pointer but we must pin
-			 * it before taking the PG_lock and/or
-			 * PG_compound_lock. The moment we re-enable
-			 * irqs __split_huge_page_splitting() can
-			 * return and the head page can be freed from
-			 * under us. We can't take the PG_lock and/or
-			 * PG_compound_lock on a page that could be
-			 * freed from under us.
-			 */
-			if (page != page_head) {
-				get_page(page_head);
-				put_page(page);
-			}
-			local_irq_enable();
-		} else {
-			local_irq_enable();
-			goto again;
-		}
-	}
-#else
-	page_head = compound_head(page);
-	if (page != page_head) {
-		get_page(page_head);
-		put_page(page);
-	}
-#endif
-
-	lock_page(page_head);
-
+	lock_page(page);
 	/*
-	 * If page_head->mapping is NULL, then it cannot be a PageAnon
+	 * If page->mapping is NULL, then it cannot be a PageAnon
 	 * page; but it might be the ZERO_PAGE or in the gate area or
 	 * in a special mapping (all cases which we are happy to fail);
 	 * or it may have been a good file page when get_user_pages_fast
@@ -485,12 +448,12 @@ again:
 	 *
 	 * The case we do have to guard against is when memory pressure made
 	 * shmem_writepage move it from filecache to swapcache beneath us:
-	 * an unlikely race, but we do need to retry for page_head->mapping.
+	 * an unlikely race, but we do need to retry for page->mapping.
 	 */
-	if (!page_head->mapping) {
-		int shmem_swizzled = PageSwapCache(page_head);
-		unlock_page(page_head);
-		put_page(page_head);
+	if (!page->mapping) {
+		int shmem_swizzled = PageSwapCache(page);
+		unlock_page(page);
+		put_page(page);
 		if (shmem_swizzled)
 			goto again;
 		return -EFAULT;
@@ -503,7 +466,7 @@ again:
 	 * it's a read-only handle, it's expected that futexes attach to
 	 * the object not the particular process.
 	 */
-	if (PageAnon(page_head)) {
+	if (PageAnon(page)) {
 		/*
 		 * A RO anonymous page will never change and thus doesn't make
 		 * sense for futex operations.
@@ -518,15 +481,15 @@ again:
 		key->private.address = address;
 	} else {
 		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
-		key->shared.inode = page_head->mapping->host;
+		key->shared.inode = page->mapping->host;
 		key->shared.pgoff = basepage_index(page);
 	}
 
 	get_futex_key_refs(key); /* implies MB (B) */
 
 out:
-	unlock_page(page_head);
-	put_page(page_head);
+	unlock_page(page);
+	put_page(page);
 	return err;
 }
 
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 16/19] thp: update documentation
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (14 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 15/19] futex, thp: remove special case for THP in get_futex_key Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-19  8:07   ` Naoya Horiguchi
  2014-11-05 14:49 ` [PATCH 17/19] mlock, thp: HACK: split all pages in VM_LOCKED vma Kirill A. Shutemov
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

The patch updates Documentation/vm/transhuge.txt to reflect changes in
THP design.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt | 84 +++++++++++++++++++-----------------------
 1 file changed, 38 insertions(+), 46 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index df1794a9071f..33465e7b0d9b 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -200,9 +200,18 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
 	of pages that should be collapsed into one huge page but failed
 	the allocation.
 
-thp_split is incremented every time a huge page is split into base
+thp_split_page is incremented every time a huge page is split into base
 	pages. This can happen for a variety of reasons but a common
 	reason is that a huge page is old and is being reclaimed.
+	This action implies splitting all PMD the page mapped with.
+
+thp_split_page_failed is is incremented if kernel fails to split huge
+	page. This can happen if the page was pinned by somebody.
+
+thp_split_pmd is incremented every time a PMD split into table of PTEs.
+	This can happen, for instance, when application calls mprotect() or
+	munmap() on part of huge page. It doesn't split huge page, only
+	page table entry.
 
 thp_zero_page_alloc is incremented every time a huge zero page is
 	successfully allocated. It includes allocations which where
@@ -280,9 +289,9 @@ unaffected. libhugetlbfs will also work fine as usual.
 == Graceful fallback ==
 
 Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(vma, addr, pmd) where the pmd is the one returned by
+split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
 pmd_offset. It's trivial to make the code transparent hugepage aware
-by just grepping for "pmd_offset" and adding split_huge_page_pmd where
+by just grepping for "pmd_offset" and adding split_huge_pmd where
 missing after pmd_offset returns the pmd. Thanks to the graceful
 fallback design, with a one liner change, you can avoid to write
 hundred if not thousand of lines of complex code to make your code
@@ -291,7 +300,8 @@ hugepage aware.
 If you're not walking pagetables but you run into a physical hugepage
 but you can't handle it natively in your code, you can split it by
 calling split_huge_page(page). This is what the Linux VM does before
-it tries to swapout the hugepage for example.
+it tries to swapout the hugepage for example. split_huge_page can fail
+if the page is pinned and you must handle this correctly.
 
 Example to make mremap.c transparent hugepage aware with a one liner
 change:
@@ -303,14 +313,14 @@ diff --git a/mm/mremap.c b/mm/mremap.c
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
-+	split_huge_page_pmd(vma, addr, pmd);
++	split_huge_pmd(vma, pmd, addr);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;
 
 == Locking in hugepage aware code ==
 
 We want as much code as possible hugepage aware, as calling
-split_huge_page() or split_huge_page_pmd() has a cost.
+split_huge_page() or split_huge_pmd() has a cost.
 
 To make pagetable walks huge pmd aware, all you need to do is to call
 pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
@@ -319,47 +329,29 @@ created from under you by khugepaged (khugepaged collapse_huge_page
 takes the mmap_sem in write mode in addition to the anon_vma lock). If
 pmd_trans_huge returns false, you just fallback in the old code
 paths. If instead pmd_trans_huge returns true, you have to take the
-mm->page_table_lock and re-run pmd_trans_huge. Taking the
-page_table_lock will prevent the huge pmd to be converted into a
-regular pmd from under you (split_huge_page can run in parallel to the
+page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
+page table lock will prevent the huge pmd to be converted into a
+regular pmd from under you (split_huge_pmd can run in parallel to the
 pagetable walk). If the second pmd_trans_huge returns false, you
 should just drop the page_table_lock and fallback to the old code as
-before. Otherwise you should run pmd_trans_splitting on the pmd. In
-case pmd_trans_splitting returns true, it means split_huge_page is
-already in the middle of splitting the page. So if pmd_trans_splitting
-returns true it's enough to drop the page_table_lock and call
-wait_split_huge_page and then fallback the old code paths. You are
-guaranteed by the time wait_split_huge_page returns, the pmd isn't
-huge anymore. If pmd_trans_splitting returns false, you can proceed to
-process the huge pmd and the hugepage natively. Once finished you can
-drop the page_table_lock.
-
-== compound_lock, get_user_pages and put_page ==
+before. Otherwise you can proceed to process the huge pmd and the
+hugepage natively. Once finished you can drop the page_table_lock.
+
+== Refcounts and transparent huge pages ==
 
+As with other compound page types we do all refcounting for THP on head
+page, but unlike other compound pages THP support splitting.
 split_huge_page internally has to distribute the refcounts in the head
-page to the tail pages before clearing all PG_head/tail bits from the
-page structures. It can do that easily for refcounts taken by huge pmd
-mappings. But the GUI API as created by hugetlbfs (that returns head
-and tail pages if running get_user_pages on an address backed by any
-hugepage), requires the refcount to be accounted on the tail pages and
-not only in the head pages, if we want to be able to run
-split_huge_page while there are gup pins established on any tail
-page. Failure to be able to run split_huge_page if there's any gup pin
-on any tail page, would mean having to split all hugepages upfront in
-get_user_pages which is unacceptable as too many gup users are
-performance critical and they must work natively on hugepages like
-they work natively on hugetlbfs already (hugetlbfs is simpler because
-hugetlbfs pages cannot be split so there wouldn't be requirement of
-accounting the pins on the tail pages for hugetlbfs). If we wouldn't
-account the gup refcounts on the tail pages during gup, we won't know
-anymore which tail page is pinned by gup and which is not while we run
-split_huge_page. But we still have to add the gup pin to the head page
-too, to know when we can free the compound page in case it's never
-split during its lifetime. That requires changing not just
-get_page, but put_page as well so that when put_page runs on a tail
-page (and only on a tail page) it will find its respective head page,
-and then it will decrease the head page refcount in addition to the
-tail page refcount. To obtain a head page reliably and to decrease its
-refcount without race conditions, put_page has to serialize against
-__split_huge_page_refcount using a special per-page lock called
-compound_lock.
+page to the tail pages before clearing all PG_head/tail bits from the page
+structures. It can be done easily for refcounts taken by page table
+entries. But we don't have enough information on how to distribute any
+additional pins (i.e. from get_user_pages). split_huge_page fails any
+requests to split pinned huge page: it expects page count to be equal to
+sum of mapcount of all sub-pages plus one (split_huge_page caller must
+have reference for head page).
+
+split_huge_page uses per-page compound_lock to protect page->_count to be
+updated by get_page()/put_page() on tail pages.
+
+Note that split_huge_pmd doesn't have any limitation on refcounting: PMD
+can be split at any point and never fails.
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 17/19] mlock, thp: HACK: split all pages in VM_LOCKED vma
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (15 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 16/19] thp: update documentation Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-19  9:02   ` Naoya Horiguchi
  2014-11-05 14:49 ` [PATCH 18/19] tho, mm: use migration entries to freeze page counts on split Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 19/19] mm, thp: remove compound_lock Kirill A. Shutemov
  18 siblings, 1 reply; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

We don't yet handle mlocked pages properly with new THP refcounting.
For now we split all pages in VMA on mlock and disallow khugepaged
collapse pages in the VMA. If split failed on mlock() we fail the
syscall with -EBUSY.
---
 include/linux/huge_mm.h |  24 +++++++++
 mm/huge_memory.c        |  17 ++-----
 mm/mlock.c              | 130 +++++++++++++++++++++++++++++++++---------------
 3 files changed, 118 insertions(+), 53 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 94f331166974..abe146bd8ed7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -140,6 +140,18 @@ static inline int hpage_nr_pages(struct page *page)
 extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				unsigned long addr, pmd_t pmd, pmd_t *pmdp);
 
+extern struct page *huge_zero_page __read_mostly;
+
+static inline bool is_huge_zero_page(struct page *page)
+{
+	return ACCESS_ONCE(huge_zero_page) == page;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+	return is_huge_zero_page(pmd_page(pmd));
+}
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -185,6 +197,18 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
 	return 0;
 }
 
+static inline bool is_huge_zero_page(struct page *page)
+{
+	BUG();
+	return 0;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+	BUG();
+	return 0;
+}
+
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 95f2a83ad9d8..555a9134dfa0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -171,17 +171,7 @@ static int start_khugepaged(void)
 }
 
 static atomic_t huge_zero_refcount;
-static struct page *huge_zero_page __read_mostly;
-
-static inline bool is_huge_zero_page(struct page *page)
-{
-	return ACCESS_ONCE(huge_zero_page) == page;
-}
-
-static inline bool is_huge_zero_pmd(pmd_t pmd)
-{
-	return is_huge_zero_page(pmd_page(pmd));
-}
+struct page *huge_zero_page __read_mostly;
 
 static struct page *get_huge_zero_page(void)
 {
@@ -801,6 +791,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
 		return VM_FAULT_FALLBACK;
+	if (vma->vm_flags & VM_LOCKED)
+		return VM_FAULT_FALLBACK;
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
 	if (unlikely(khugepaged_enter(vma)))
@@ -2352,7 +2344,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
 	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
 	    (vma->vm_flags & VM_NOHUGEPAGE))
 		return false;
-
+	if (vma->vm_flags & VM_LOCKED)
+		return false;
 	if (!vma->anon_vma || vma->vm_ops)
 		return false;
 	if (is_vma_temporary_stack(vma))
diff --git a/mm/mlock.c b/mm/mlock.c
index ce84cb0b83ef..e3a367685503 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -500,38 +500,26 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 				&page_mask);
 
 		if (page && !IS_ERR(page)) {
-			if (PageTransHuge(page)) {
-				lock_page(page);
-				/*
-				 * Any THP page found by follow_page_mask() may
-				 * have gotten split before reaching
-				 * munlock_vma_page(), so we need to recompute
-				 * the page_mask here.
-				 */
-				page_mask = munlock_vma_page(page);
-				unlock_page(page);
-				put_page(page); /* follow_page_mask() */
-			} else {
-				/*
-				 * Non-huge pages are handled in batches via
-				 * pagevec. The pin from follow_page_mask()
-				 * prevents them from collapsing by THP.
-				 */
-				pagevec_add(&pvec, page);
-				zone = page_zone(page);
-				zoneid = page_zone_id(page);
+			VM_BUG_ON_PAGE(PageTransCompound(page), page);
+			/*
+			 * Non-huge pages are handled in batches via
+			 * pagevec. The pin from follow_page_mask()
+			 * prevents them from collapsing by THP.
+			 */
+			pagevec_add(&pvec, page);
+			zone = page_zone(page);
+			zoneid = page_zone_id(page);
 
-				/*
-				 * Try to fill the rest of pagevec using fast
-				 * pte walk. This will also update start to
-				 * the next page to process. Then munlock the
-				 * pagevec.
-				 */
-				start = __munlock_pagevec_fill(&pvec, vma,
-						zoneid, start, end);
-				__munlock_pagevec(&pvec, zone);
-				goto next;
-			}
+			/*
+			 * Try to fill the rest of pagevec using fast
+			 * pte walk. This will also update start to
+			 * the next page to process. Then munlock the
+			 * pagevec.
+			 */
+			start = __munlock_pagevec_fill(&pvec, vma,
+					zoneid, start, end);
+			__munlock_pagevec(&pvec, zone);
+			goto next;
 		}
 		/* It's a bug to munlock in the middle of a THP page */
 		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
@@ -542,6 +530,60 @@ next:
 	}
 }
 
+static int thp_split(pmd_t *pmd, unsigned long addr, unsigned long end,
+		struct mm_walk *walk)
+{
+	spinlock_t *ptl;
+	struct page *page = NULL;
+	pte_t *pte;
+	int err = 0;
+
+retry:
+	if (pmd_none(*pmd))
+		return 0;
+	if (pmd_trans_huge(*pmd)) {
+		if (is_huge_zero_pmd(*pmd)) {
+			split_huge_pmd(walk->vma, pmd, addr);
+			return 0;
+		}
+		ptl = pmd_lock(walk->mm, pmd);
+		if (!pmd_trans_huge(*pmd)) {
+			spin_unlock(ptl);
+			goto retry;
+		}
+		page = pmd_page(*pmd);
+		VM_BUG_ON_PAGE(!PageHead(page), page);
+		get_page(page);
+		spin_unlock(ptl);
+		err = split_huge_page(page);
+		put_page(page);
+		return err;
+	}
+	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	do {
+		if (!pte_present(*pte))
+			continue;
+		page = vm_normal_page(walk->vma, addr, *pte);
+		if (!page)
+			continue;
+		if (PageTransCompound(page)) {
+			page = compound_head(page);
+			get_page(page);
+			spin_unlock(ptl);
+			err = split_huge_page(page);
+			spin_lock(ptl);
+			put_page(page);
+			if (!err) {
+				VM_BUG_ON_PAGE(compound_mapcount(page), page);
+				VM_BUG_ON_PAGE(PageTransCompound(page), page);
+			} else
+				break;
+		}
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+	pte_unmap_unlock(pte - 1, ptl);
+	return err;
+}
+
 /*
  * mlock_fixup  - handle mlock[all]/munlock[all] requests.
  *
@@ -586,24 +628,30 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 success:
 	/*
-	 * Keep track of amount of locked VM.
-	 */
-	nr_pages = (end - start) >> PAGE_SHIFT;
-	if (!lock)
-		nr_pages = -nr_pages;
-	mm->locked_vm += nr_pages;
-
-	/*
 	 * vm_flags is protected by the mmap_sem held in write mode.
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, __mlock_vma_pages_range will bring it back.
 	 */
 
-	if (lock)
+	if (lock) {
+		struct mm_walk thp_split_walk = {
+			.mm = mm,
+			.pmd_entry = thp_split,
+		};
+		ret = walk_page_vma(vma, &thp_split_walk);
+		if (ret)
+			goto out;
 		vma->vm_flags = newflags;
-	else
+	} else
 		munlock_vma_pages_range(vma, start, end);
 
+	/*
+	 * Keep track of amount of locked VM.
+	 */
+	nr_pages = (end - start) >> PAGE_SHIFT;
+	if (!lock)
+		nr_pages = -nr_pages;
+	mm->locked_vm += nr_pages;
 out:
 	*prev = vma;
 	return ret;
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 18/19] tho, mm: use migration entries to freeze page counts on split
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (16 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 17/19] mlock, thp: HACK: split all pages in VM_LOCKED vma Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  2014-11-05 14:49 ` [PATCH 19/19] mm, thp: remove compound_lock Kirill A. Shutemov
  18 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

Currently, we rely on compound_lock() to get page counts stable on
splitting page refcounting. To get it work we also take the lock on
get_page() and put_page() which is hot path.

This patch rework splitting code to setup migration entries to stabilaze
page count/mapcount before distribute refcounts. It means we don't need
to compound lock in get_page()/put_page().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/migrate.h |   3 +
 mm/huge_memory.c        | 173 ++++++++++++++++++++++++++++++++++--------------
 mm/migrate.c            |  15 +++--
 3 files changed, 135 insertions(+), 56 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a2901c414664..edbbed27fb7c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -55,6 +55,9 @@ extern int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page,
 		struct buffer_head *head, enum migrate_mode mode,
 		int extra_count);
+extern int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+		unsigned long addr, void *old);
+
 #else
 
 static inline void putback_movable_pages(struct list_head *l) {}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 555a9134dfa0..4e087091a809 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/swapops.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -1567,7 +1568,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 
 
 static void __split_huge_pmd_locked(struct vm_area_struct *vma,
-		pmd_t *pmd, unsigned long address)
+		pmd_t *pmd, unsigned long address, int freeze)
 {
 	unsigned long haddr = address & HPAGE_PMD_MASK;
 	struct page *page;
@@ -1600,12 +1601,19 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma,
 		 * any possibility that pte_numa leaks to a PROT_NONE VMA by
 		 * accident.
 		 */
-		entry = mk_pte(page + i, vma->vm_page_prot);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		if (!pmd_write(*pmd))
-			entry = pte_wrprotect(entry);
-		if (!pmd_young(*pmd))
-			entry = pte_mkold(entry);
+		if (freeze) {
+			swp_entry_t swp_entry;
+			swp_entry = make_migration_entry(page + i,
+					pmd_write(*pmd));
+			entry = swp_entry_to_pte(swp_entry);
+		} else {
+			entry = mk_pte(page + i, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			if (!pmd_write(*pmd))
+				entry = pte_wrprotect(entry);
+			if (!pmd_young(*pmd))
+				entry = pte_mkold(entry);
+		}
 		pte = pte_offset_map(&_pmd, haddr);
 		BUG_ON(!pte_none(*pte));
 		atomic_inc(&page[i]._mapcount);
@@ -1631,7 +1639,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	ptl = pmd_lock(mm, pmd);
 	if (likely(pmd_trans_huge(*pmd)))
-		__split_huge_pmd_locked(vma, pmd, address);
+		__split_huge_pmd_locked(vma, pmd, address, 0);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 }
@@ -1666,20 +1674,106 @@ static void split_huge_page_address(struct vm_area_struct *vma,
 	__split_huge_pmd(vma, pmd, address);
 }
 
-static int __split_huge_page_refcount(struct page *page,
-				       struct list_head *list)
+static void freeze_page(struct anon_vma *anon_vma, struct page *page)
+{
+	struct anon_vma_chain *avc;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	unsigned long addr, haddr;
+	unsigned long mmun_start, mmun_end;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *start_pte, *pte;
+	spinlock_t *ptl;
+
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+		vma = avc->vma;
+		mm = vma->vm_mm;
+		haddr = addr = vma_address(page, vma) & HPAGE_PMD_MASK;
+		mmun_start = haddr;
+		mmun_end   = haddr + HPAGE_PMD_SIZE;
+		mmu_notifier_invalidate_range_start(vma->vm_mm,
+				mmun_start, mmun_end);
+
+		pgd = pgd_offset(vma->vm_mm, addr);
+		if (!pgd_present(*pgd))
+			goto next;
+		pud = pud_offset(pgd, addr);
+		if (!pud_present(*pud))
+			goto next;
+		pmd = pmd_offset(pud, addr);
+
+		ptl = pmd_lock(vma->vm_mm, pmd);
+		if (!pmd_present(*pmd)) {
+			spin_unlock(ptl);
+			goto next;
+		}
+		if (pmd_trans_huge(*pmd)) {
+			if (page == pmd_page(*pmd))
+				__split_huge_pmd_locked(vma, pmd, addr, 1);
+			spin_unlock(ptl);
+			goto next;
+		}
+		spin_unlock(ptl);
+
+		start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+		pte = start_pte;
+		do {
+			pte_t entry, swp_pte;
+			swp_entry_t swp_entry;
+
+			if (!pte_present(*pte))
+				continue;
+			if (page_to_pfn(page) != pte_pfn(*pte))
+				continue;
+			flush_cache_page(vma, addr, page_to_pfn(page));
+			entry = ptep_clear_flush(vma, addr, pte);
+			swp_entry = make_migration_entry(page,
+					pte_write(entry));
+			swp_pte = swp_entry_to_pte(swp_entry);
+			if (pte_soft_dirty(entry))
+				swp_pte = pte_swp_mksoft_dirty(swp_pte);
+			set_pte_at(vma->vm_mm, addr, pte, swp_pte);
+		} while (pte++, addr += PAGE_SIZE, page++, addr != mmun_end);
+		pte_unmap_unlock(start_pte, ptl);
+next:
+		mmu_notifier_invalidate_range_end(vma->vm_mm,
+				mmun_start, mmun_end);
+	}
+}
+
+static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
+{
+	struct anon_vma_chain *avc;
+	struct vm_area_struct *vma;
+	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	unsigned long addr;
+
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
+		vma = avc->vma;
+		addr = vma_address(page, vma);
+		remove_migration_pte(page, vma, addr, page);
+	}
+}
+
+static int __split_huge_page_refcount(struct anon_vma *anon_vma,
+		struct page *page, struct list_head *list)
 {
 	int i;
 	struct zone *zone = page_zone(page);
 	struct lruvec *lruvec;
 	int tail_mapcount = 0;
 
+	lock_page(page);
+	freeze_page(anon_vma, page);
+	BUG_ON(compound_mapcount(page));
+
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
 	lruvec = mem_cgroup_page_lruvec(page, zone);
 
-	compound_lock(page);
-
 	/*
 	 * We cannot split pinned THP page: we expect page count to be equal
 	 * to sum of mapcount of all sub-pages plus one (split_huge_page()
@@ -1695,8 +1789,9 @@ static int __split_huge_page_refcount(struct page *page,
 		tail_mapcount += page_mapcount(page + i);
 	if (tail_mapcount != page_count(page) - 1) {
 		BUG_ON(tail_mapcount > page_count(page) - 1);
-		compound_unlock(page);
 		spin_unlock_irq(&zone->lru_lock);
+		unfreeze_page(anon_vma, page);
+		unlock_page(page);
 		return -EBUSY;
 	}
 
@@ -1743,6 +1838,7 @@ static int __split_huge_page_refcount(struct page *page,
 				      (1L << PG_mlocked) |
 				      (1L << PG_uptodate) |
 				      (1L << PG_active) |
+				      (1L << PG_locked) |
 				      (1L << PG_unevictable)));
 		page_tail->flags |= (1L << PG_dirty);
 
@@ -1768,12 +1864,16 @@ static int __split_huge_page_refcount(struct page *page,
 	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
 
 	ClearPageCompound(page);
-	compound_unlock(page);
 	spin_unlock_irq(&zone->lru_lock);
 
+	unfreeze_page(anon_vma, page);
+	unlock_page(page);
+
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		struct page *page_tail = page + i;
 		BUG_ON(page_count(page_tail) <= 0);
+		unfreeze_page(anon_vma, page_tail);
+		unlock_page(page_tail);
 		/*
 		 * Tail pages may be freed if there wasn't any mapping
 		 * like if add_to_swap() is running on a lru page that
@@ -1802,10 +1902,8 @@ static int __split_huge_page_refcount(struct page *page,
 int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct anon_vma *anon_vma;
-	struct anon_vma_chain *avc;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
 	int i, tail_mapcount;
-	int ret = -EBUSY;
+	int ret = 0;
 
 	BUG_ON(is_huge_zero_page(page));
 	BUG_ON(!PageAnon(page));
@@ -1819,15 +1917,12 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	 */
 	anon_vma = page_get_anon_vma(page);
 	if (!anon_vma)
-		goto out;
+		return -EBUSY;
 	anon_vma_lock_write(anon_vma);
 
-	if (!PageCompound(page)) {
-		ret = 0;
-		goto out_unlock;
-	}
-
 	BUG_ON(!PageSwapBacked(page));
+	if (!PageCompound(page))
+		goto out;
 
 	/*
 	 * Racy check if __split_huge_page_refcount() can be successful, before
@@ -1839,39 +1934,15 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	if (tail_mapcount != page_count(page) - 1) {
 		VM_BUG_ON_PAGE(tail_mapcount > page_count(page) - 1, page);
 		ret = -EBUSY;
-		goto out_unlock;
-	}
-
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
-		struct vm_area_struct *vma = avc->vma;
-		unsigned long addr = vma_address(page, vma);
-		spinlock_t *ptl;
-		pmd_t *pmd;
-		unsigned long haddr = addr & HPAGE_PMD_MASK;
-		unsigned long mmun_start;	/* For mmu_notifiers */
-		unsigned long mmun_end;		/* For mmu_notifiers */
-
-		mmun_start = haddr;
-		mmun_end   = haddr + HPAGE_PMD_SIZE;
-		mmu_notifier_invalidate_range_start(vma->vm_mm,
-				mmun_start, mmun_end);
-		pmd = page_check_address_pmd(page, vma->vm_mm, addr, &ptl);
-		if (pmd) {
-			__split_huge_pmd_locked(vma, pmd, addr);
-			spin_unlock(ptl);
-		}
-		mmu_notifier_invalidate_range_end(vma->vm_mm,
-				mmun_start, mmun_end);
+		goto out;
 	}
 
-	BUG_ON(compound_mapcount(page));
-	ret = __split_huge_page_refcount(page, list);
+	ret = __split_huge_page_refcount(anon_vma, page, list);
 	BUG_ON(!ret && PageCompound(page));
-
-out_unlock:
+out:
 	anon_vma_unlock_write(anon_vma);
 	put_anon_vma(anon_vma);
-out:
+
 	if (ret)
 		count_vm_event(THP_SPLIT_PAGE_FAILED);
 	else
diff --git a/mm/migrate.c b/mm/migrate.c
index 4dc941100388..326064547b51 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -102,7 +102,7 @@ void putback_movable_pages(struct list_head *l)
 /*
  * Restore a potential migration pte to a working pte entry
  */
-static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
+int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 				 unsigned long addr, void *old)
 {
 	struct mm_struct *mm = vma->vm_mm;
@@ -139,7 +139,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 	entry = pte_to_swp_entry(pte);
 
 	if (!is_migration_entry(entry) ||
-	    migration_entry_to_page(entry) != old)
+	    compound_head(migration_entry_to_page(entry)) != old)
 		goto unlock;
 
 	get_page(new);
@@ -162,9 +162,14 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 			hugepage_add_anon_rmap(new, vma, addr);
 		else
 			page_dup_rmap(new, false);
-	} else if (PageAnon(new))
-		page_add_anon_rmap(new, vma, addr, false);
-	else
+	} else if (PageAnon(new)) {
+		/* unfreeze_page() case: the page wasn't removed from rmap */
+		if (PageCompound(new)) {
+			VM_BUG_ON(compound_head(new) != old);
+			put_page(new);
+		} else
+			page_add_anon_rmap(new, vma, addr, false);
+	} else
 		page_add_file_rmap(new);
 
 	/* No need to invalidate - it was non-present before */
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 19/19] mm, thp: remove compound_lock
  2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
                   ` (17 preceding siblings ...)
  2014-11-05 14:49 ` [PATCH 18/19] tho, mm: use migration entries to freeze page counts on split Kirill A. Shutemov
@ 2014-11-05 14:49 ` Kirill A. Shutemov
  18 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-05 14:49 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm, Kirill A. Shutemov

We don't need a compound lock anymore: split_huge_page() doesn't need it
anymore.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h         |  35 ------------
 include/linux/page-flags.h |  12 +---
 mm/page_alloc.c            |   3 -
 mm/swap.c                  | 135 +++++++++++++++------------------------------
 4 files changed, 46 insertions(+), 139 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f2f95469f1c3..61f745f1fb2e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -378,41 +378,6 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 
 extern void kvfree(const void *addr);
 
-static inline void compound_lock(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	VM_BUG_ON_PAGE(PageSlab(page), page);
-	bit_spin_lock(PG_compound_lock, &page->flags);
-#endif
-}
-
-static inline void compound_unlock(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	VM_BUG_ON_PAGE(PageSlab(page), page);
-	bit_spin_unlock(PG_compound_lock, &page->flags);
-#endif
-}
-
-static inline unsigned long compound_lock_irqsave(struct page *page)
-{
-	unsigned long uninitialized_var(flags);
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	local_irq_save(flags);
-	compound_lock(page);
-#endif
-	return flags;
-}
-
-static inline void compound_unlock_irqrestore(struct page *page,
-					      unsigned long flags)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	compound_unlock(page);
-	local_irq_restore(flags);
-#endif
-}
-
 static inline struct page *compound_head(struct page *page)
 {
 	if (unlikely(PageTail(page)))
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 676f72d29ac2..46ebd9c05a59 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -106,9 +106,6 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	PG_compound_lock,
-#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -511,12 +508,6 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 #define __PG_MLOCKED		0
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
-#else
-#define __PG_COMPOUND_LOCK		0
-#endif
-
 /*
  * Flags checked when a page is freed.  Pages being freed should not have
  * these flags set.  It they are, there is a problem.
@@ -526,8 +517,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
-	 __PG_COMPOUND_LOCK)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b19d1e69ca12..cf3096f97c6d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6598,9 +6598,6 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_MEMORY_FAILURE
 	{1UL << PG_hwpoison,		"hwpoison"	},
 #endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	{1UL << PG_compound_lock,	"compound_lock"	},
-#endif
 };
 
 static void dump_page_flags(unsigned long flags)
diff --git a/mm/swap.c b/mm/swap.c
index da28e0767088..537592dfc6c4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -80,16 +80,9 @@ static void __put_compound_page(struct page *page)
 	(*dtor)(page);
 }
 
-static inline bool compound_lock_needed(struct page *page)
-{
-	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		!PageSlab(page) && !PageHeadHuge(page);
-}
-
 static void put_compound_page(struct page *page)
 {
 	struct page *page_head;
-	unsigned long flags;
 
 	if (likely(!PageTail(page))) {
 		if (put_page_testzero(page)) {
@@ -108,58 +101,33 @@ static void put_compound_page(struct page *page)
 	/* __split_huge_page_refcount can run under us */
 	page_head = compound_head(page);
 
-	if (!compound_lock_needed(page_head)) {
-		/*
-		 * If "page" is a THP tail, we must read the tail page flags
-		 * after the head page flags. The split_huge_page side enforces
-		 * write memory barriers between clearing PageTail and before
-		 * the head page can be freed and reallocated.
-		 */
-		smp_rmb();
-		if (likely(PageTail(page))) {
-			/* __split_huge_page_refcount cannot race here. */
-			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
-			VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
-			if (put_page_testzero(page_head)) {
-				/*
-				 * If this is the tail of a slab compound page,
-				 * the tail pin must not be the last reference
-				 * held on the page, because the PG_slab cannot
-				 * be cleared before all tail pins (which skips
-				 * the _mapcount tail refcounting) have been
-				 * released. For hugetlbfs the tail pin may be
-				 * the last reference on the page instead,
-				 * because PageHeadHuge will not go away until
-				 * the compound page enters the buddy
-				 * allocator.
-				 */
-				VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
-				__put_compound_page(page_head);
-			}
-		} else if (put_page_testzero(page))
-			__put_single_page(page);
-		return;
-	}
-
-	flags = compound_lock_irqsave(page_head);
-	/* here __split_huge_page_refcount won't run anymore */
-	if (likely(page != page_head && PageTail(page))) {
-		bool free;
-
-		free = put_page_testzero(page_head);
-		compound_unlock_irqrestore(page_head, flags);
-		if (free) {
-			if (PageHead(page_head))
-				__put_compound_page(page_head);
-			else
-				__put_single_page(page_head);
+	/*
+	 * If "page" is a THP tail, we must read the tail page flags after the
+	 * head page flags. The split_huge_page side enforces write memory
+	 * barriers between clearing PageTail and before the head page can be
+	 * freed and reallocated.
+	 */
+	smp_rmb();
+	if (likely(PageTail(page))) {
+		/* __split_huge_page_refcount cannot race here. */
+		VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
+		if (put_page_testzero(page_head)) {
+			/*
+			 * If this is the tail of a slab compound page, the
+			 * tail pin must not be the last reference held on the
+			 * page, because the PG_slab cannot be cleared before
+			 * all tail pins (which skips the _mapcount tail
+			 * refcounting) have been released. For hugetlbfs the
+			 * tail pin may be the last reference on the page
+			 * instead, because PageHeadHuge will not go away until
+			 * the compound page enters the buddy allocator.
+			 */
+			VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
+			__put_compound_page(page_head);
 		}
-	} else {
-		compound_unlock_irqrestore(page_head, flags);
-		VM_BUG_ON_PAGE(PageTail(page), page);
-		if (put_page_testzero(page))
-			__put_single_page(page);
-	}
+	} else if (put_page_testzero(page))
+		__put_single_page(page);
+	return;
 }
 
 void put_page(struct page *page)
@@ -178,42 +146,29 @@ EXPORT_SYMBOL(put_page);
 void __get_page_tail(struct page *page)
 {
 	struct page *page_head = compound_head(page);
-	unsigned long flags;
 
-	if (!compound_lock_needed(page_head)) {
-		smp_rmb();
-		if (likely(PageTail(page))) {
-			/*
-			 * This is a hugetlbfs page or a slab page.
-			 * __split_huge_page_refcount cannot race here.
-			 */
-			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
-			VM_BUG_ON(page_head != page->first_page);
-			VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0,
-					page);
-			atomic_inc(&page_head->_count);
-		} else {
-			/*
-			 * __split_huge_page_refcount run before us, "page" was
-			 * a thp tail. the split page_head has been freed and
-			 * reallocated as slab or hugetlbfs page of smaller
-			 * order (only possible if reallocated as slab on x86).
-			 */
-			VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
-			atomic_inc(&page->_count);
-		}
-		return;
-	}
-
-	flags = compound_lock_irqsave(page_head);
-	/* here __split_huge_page_refcount won't run anymore */
-	if (unlikely(page == page_head || !PageTail(page) ||
-				!get_page_unless_zero(page_head))) {
-		/* page is not part of THP page anymore */
+	smp_rmb();
+	if (likely(PageTail(page))) {
+		/*
+		 * This is a hugetlbfs page or a slab page.
+		 * __split_huge_page_refcount cannot race here.
+		 */
+		VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
+		VM_BUG_ON(page_head != page->first_page);
+		VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0,
+				page);
+		atomic_inc(&page_head->_count);
+	} else {
+		/*
+		 * __split_huge_page_refcount run before us, "page" was
+		 * a thp tail. the split page_head has been freed and
+		 * reallocated as slab or hugetlbfs page of smaller
+		 * order (only possible if reallocated as slab on x86).
+		 */
 		VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
 		atomic_inc(&page->_count);
 	}
-	compound_unlock_irqrestore(page_head, flags);
+	return;
 }
 EXPORT_SYMBOL(__get_page_tail);
 
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/19] mm: store mapcount for compound page separate
  2014-11-05 14:49 ` [PATCH 06/19] mm: store mapcount for compound page separate Kirill A. Shutemov
@ 2014-11-18  8:43   ` Naoya Horiguchi
  2014-11-18  9:58     ` Kirill A. Shutemov
  2014-11-19 10:51   ` Jerome Marchand
  2014-11-21  6:12   ` Aneesh Kumar K.V
  2 siblings, 1 reply; 41+ messages in thread
From: Naoya Horiguchi @ 2014-11-18  8:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Dave Hansen, Hugh Dickins,
	Mel Gorman, Rik van Riel, Vlastimil Babka, Christoph Lameter,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm

On Wed, Nov 05, 2014 at 04:49:41PM +0200, Kirill A. Shutemov wrote:
> We're going to allow mapping of individual 4k pages of THP compound and
> we need a cheap way to find out how many time the compound page is
> mapped with PMD -- compound_mapcount() does this.
> 
> page_mapcount() counts both: PTE and PMD mappings of the page.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---

...

> @@ -1837,6 +1839,9 @@ static void __split_huge_page_refcount(struct page *page,
>  	atomic_sub(tail_count, &page->_count);
>  	BUG_ON(atomic_read(&page->_count) <= 0);
>  
> +	page->_mapcount = *compound_mapcount_ptr(page);

Is atomic_set() necessary?

> +	page[1].mapping = page->mapping;
> +
>  	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
>  
>  	ClearPageCompound(page);

...

> @@ -760,6 +763,8 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
>  		bad += free_pages_check(page + i);
>  	if (bad)
>  		return false;
> +	if (order)
> +		page[1].mapping = NULL;
>  
>  	if (!PageHighMem(page)) {
>  		debug_check_no_locks_freed(page_address(page),
> @@ -6632,10 +6637,12 @@ static void dump_page_flags(unsigned long flags)
>  void dump_page_badflags(struct page *page, const char *reason,
>  		unsigned long badflags)
>  {
> -	printk(KERN_ALERT
> -	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
> +	pr_alert("page:%p count:%d mapcount:%d mapping:%p index:%#lx",
>  		page, atomic_read(&page->_count), page_mapcount(page),
>  		page->mapping, page->index);
> +	if (PageCompound(page))

> +		printk(" compound_mapcount: %d", compound_mapcount(page));
> +	printk("\n");

These two printk() should be pr_alert(), too?

>  	dump_page_flags(page->flags);
>  	if (reason)
>  		pr_alert("page dumped because: %s\n", reason);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f706a6af1801..eecc9301847d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -986,9 +986,30 @@ void page_add_anon_rmap(struct page *page,
>  void do_page_add_anon_rmap(struct page *page,
>  	struct vm_area_struct *vma, unsigned long address, int flags)
>  {
> -	int first = atomic_inc_and_test(&page->_mapcount);
> +	bool compound = flags & RMAP_COMPOUND;
> +	bool first;
> +
> +	VM_BUG_ON_PAGE(!PageLocked(compound_head(page)), page);
> +
> +	if (PageTransCompound(page)) {
> +		struct page *head_page = compound_head(page);
> +
> +		if (compound) {
> +			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> +			first = atomic_inc_and_test(compound_mapcount_ptr(page));

Is compound_mapcount_ptr() well-defined for tail pages?
This function seems to access struct page of the page next to a given page,
so if the given page is the last tail page of a thp, page outside the thp
will be accessed. Do you have a prevention from this?
atomic_inc_and_test(compound_mapcount_ptr(head_page)) is what you intended?

> +		} else {
> +			/* Anon THP always mapped first with PMD */
> +			first = 0;
> +			VM_BUG_ON_PAGE(!compound_mapcount(head_page),
> +					head_page);
> +			atomic_inc(&page->_mapcount);
> +		}
> +	} else {
> +		VM_BUG_ON_PAGE(compound, page);
> +		first = atomic_inc_and_test(&page->_mapcount);
> +	}
> +
>  	if (first) {
> -		bool compound = flags & RMAP_COMPOUND;
>  		int nr = compound ? hpage_nr_pages(page) : 1;
>  		/*
>  		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because

...

> @@ -1032,10 +1052,19 @@ void page_add_new_anon_rmap(struct page *page,
>  
>  	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
>  	SetPageSwapBacked(page);
> -	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
>  	if (compound) {
> +		atomic_t *compound_mapcount;
> +
>  		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> +		compound_mapcount = (atomic_t *)&page[1].mapping;

You can use compound_mapcount_ptr() here.

Thanks,
Naoya Horiguchi

> +		/* increment count (starts at -1) */
> +		atomic_set(compound_mapcount, 0);
>  		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> +	} else {
> +		/* Anon THP always mapped first with PMD */
> +		VM_BUG_ON_PAGE(PageTransCompound(page), page);
> +		/* increment count (starts at -1) */
> +		atomic_set(&page->_mapcount, 0);
>  	}
>  	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
>  	__page_set_anon_rmap(page, vma, address, 1);

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/19] mm: store mapcount for compound page separate
  2014-11-18  8:43   ` Naoya Horiguchi
@ 2014-11-18  9:58     ` Kirill A. Shutemov
  2014-11-18 23:41       ` Naoya Horiguchi
  2014-11-21  6:41       ` Aneesh Kumar K.V
  0 siblings, 2 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-18  9:58 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Steve Capper, Aneesh Kumar K.V,
	Johannes Weiner, Michal Hocko, linux-kernel, linux-mm

On Tue, Nov 18, 2014 at 08:43:00AM +0000, Naoya Horiguchi wrote:
> > @@ -1837,6 +1839,9 @@ static void __split_huge_page_refcount(struct page *page,
> >  	atomic_sub(tail_count, &page->_count);
> >  	BUG_ON(atomic_read(&page->_count) <= 0);
> >  
> > +	page->_mapcount = *compound_mapcount_ptr(page);
> 
> Is atomic_set() necessary?

Do you mean
	atomic_set(&page->_mapcount, atomic_read(compound_mapcount_ptr(page)));
?

I don't see why we would need this. Simple assignment should work just
fine. Or we have archs which will break?

> > @@ -6632,10 +6637,12 @@ static void dump_page_flags(unsigned long flags)
> >  void dump_page_badflags(struct page *page, const char *reason,
> >  		unsigned long badflags)
> >  {
> > -	printk(KERN_ALERT
> > -	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
> > +	pr_alert("page:%p count:%d mapcount:%d mapping:%p index:%#lx",
> >  		page, atomic_read(&page->_count), page_mapcount(page),
> >  		page->mapping, page->index);
> > +	if (PageCompound(page))
> 
> > +		printk(" compound_mapcount: %d", compound_mapcount(page));
> > +	printk("\n");
> 
> These two printk() should be pr_alert(), too?

No. It will split the line into several messages in dmesg.

> > @@ -986,9 +986,30 @@ void page_add_anon_rmap(struct page *page,
> >  void do_page_add_anon_rmap(struct page *page,
> >  	struct vm_area_struct *vma, unsigned long address, int flags)
> >  {
> > -	int first = atomic_inc_and_test(&page->_mapcount);
> > +	bool compound = flags & RMAP_COMPOUND;
> > +	bool first;
> > +
> > +	VM_BUG_ON_PAGE(!PageLocked(compound_head(page)), page);
> > +
> > +	if (PageTransCompound(page)) {
> > +		struct page *head_page = compound_head(page);
> > +
> > +		if (compound) {
> > +			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > +			first = atomic_inc_and_test(compound_mapcount_ptr(page));
> 
> Is compound_mapcount_ptr() well-defined for tail pages?

The page is head page, otherwise VM_BUG_ON on the line above would trigger.

> > @@ -1032,10 +1052,19 @@ void page_add_new_anon_rmap(struct page *page,
> >  
> >  	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> >  	SetPageSwapBacked(page);
> > -	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
> >  	if (compound) {
> > +		atomic_t *compound_mapcount;
> > +
> >  		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > +		compound_mapcount = (atomic_t *)&page[1].mapping;
> 
> You can use compound_mapcount_ptr() here.

Right, thanks.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/19] mm: store mapcount for compound page separate
  2014-11-18  9:58     ` Kirill A. Shutemov
@ 2014-11-18 23:41       ` Naoya Horiguchi
  2014-11-19  0:54         ` Kirill A. Shutemov
  2014-11-21  6:41       ` Aneesh Kumar K.V
  1 sibling, 1 reply; 41+ messages in thread
From: Naoya Horiguchi @ 2014-11-18 23:41 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Steve Capper, Aneesh Kumar K.V,
	Johannes Weiner, Michal Hocko, linux-kernel, linux-mm

On Tue, Nov 18, 2014 at 11:58:11AM +0200, Kirill A. Shutemov wrote:
> On Tue, Nov 18, 2014 at 08:43:00AM +0000, Naoya Horiguchi wrote:
> > > @@ -1837,6 +1839,9 @@ static void __split_huge_page_refcount(struct page *page,
> > >  	atomic_sub(tail_count, &page->_count);
> > >  	BUG_ON(atomic_read(&page->_count) <= 0);
> > >  
> > > +	page->_mapcount = *compound_mapcount_ptr(page);
> > 
> > Is atomic_set() necessary?
> 
> Do you mean
> 	atomic_set(&page->_mapcount, atomic_read(compound_mapcount_ptr(page)));
> ?
> 
> I don't see why we would need this. Simple assignment should work just
> fine. Or we have archs which will break?

Sorry, I was wrong, please ignore this comment.

> > > @@ -6632,10 +6637,12 @@ static void dump_page_flags(unsigned long flags)
> > >  void dump_page_badflags(struct page *page, const char *reason,
> > >  		unsigned long badflags)
> > >  {
> > > -	printk(KERN_ALERT
> > > -	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
> > > +	pr_alert("page:%p count:%d mapcount:%d mapping:%p index:%#lx",
> > >  		page, atomic_read(&page->_count), page_mapcount(page),
> > >  		page->mapping, page->index);
> > > +	if (PageCompound(page))
> > 
> > > +		printk(" compound_mapcount: %d", compound_mapcount(page));
> > > +	printk("\n");
> > 
> > These two printk() should be pr_alert(), too?
> 
> No. It will split the line into several messages in dmesg.

This splitting is fine. I meant that these printk()s are for one series
of message, so setting the same log level looks reasonable to me.

> > > @@ -986,9 +986,30 @@ void page_add_anon_rmap(struct page *page,
> > >  void do_page_add_anon_rmap(struct page *page,
> > >  	struct vm_area_struct *vma, unsigned long address, int flags)
> > >  {
> > > -	int first = atomic_inc_and_test(&page->_mapcount);
> > > +	bool compound = flags & RMAP_COMPOUND;
> > > +	bool first;
> > > +
> > > +	VM_BUG_ON_PAGE(!PageLocked(compound_head(page)), page);
> > > +
> > > +	if (PageTransCompound(page)) {
> > > +		struct page *head_page = compound_head(page);
> > > +
> > > +		if (compound) {
> > > +			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > +			first = atomic_inc_and_test(compound_mapcount_ptr(page));
> > 
> > Is compound_mapcount_ptr() well-defined for tail pages?
> 
> The page is head page, otherwise VM_BUG_ON on the line above would trigger.

Ah, OK.

Thanks,
Naoya Horiguchi

> > > @@ -1032,10 +1052,19 @@ void page_add_new_anon_rmap(struct page *page,
> > >  
> > >  	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> > >  	SetPageSwapBacked(page);
> > > -	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
> > >  	if (compound) {
> > > +		atomic_t *compound_mapcount;
> > > +
> > >  		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > > +		compound_mapcount = (atomic_t *)&page[1].mapping;
> > 
> > You can use compound_mapcount_ptr() here.
> 
> Right, thanks.
> 
> -- 
>  Kirill A. Shutemov
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/19] mm: store mapcount for compound page separate
  2014-11-18 23:41       ` Naoya Horiguchi
@ 2014-11-19  0:54         ` Kirill A. Shutemov
  0 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-19  0:54 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Steve Capper, Aneesh Kumar K.V,
	Johannes Weiner, Michal Hocko, linux-kernel, linux-mm

On Tue, Nov 18, 2014 at 11:41:08PM +0000, Naoya Horiguchi wrote:
> > > > @@ -6632,10 +6637,12 @@ static void dump_page_flags(unsigned long flags)
> > > >  void dump_page_badflags(struct page *page, const char *reason,
> > > >  		unsigned long badflags)
> > > >  {
> > > > -	printk(KERN_ALERT
> > > > -	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
> > > > +	pr_alert("page:%p count:%d mapcount:%d mapping:%p index:%#lx",
> > > >  		page, atomic_read(&page->_count), page_mapcount(page),
> > > >  		page->mapping, page->index);
> > > > +	if (PageCompound(page))
> > > 
> > > > +		printk(" compound_mapcount: %d", compound_mapcount(page));
> > > > +	printk("\n");
> > > 
> > > These two printk() should be pr_alert(), too?
> > 
> > No. It will split the line into several messages in dmesg.
> 
> This splitting is fine. I meant that these printk()s are for one series
> of message, so setting the same log level looks reasonable to me.

Hm. It seems what I really need to use there is pr_cont(). I didn't know
it exists. Thanks for hint ;)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 10/19] thp: PMD splitting without splitting compound page
  2014-11-05 14:49 ` [PATCH 10/19] thp: PMD splitting without splitting compound page Kirill A. Shutemov
@ 2014-11-19  6:57   ` Naoya Horiguchi
  2014-11-19 13:02     ` Kirill A. Shutemov
  0 siblings, 1 reply; 41+ messages in thread
From: Naoya Horiguchi @ 2014-11-19  6:57 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Dave Hansen, Hugh Dickins,
	Mel Gorman, Rik van Riel, Vlastimil Babka, Christoph Lameter,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm

On Wed, Nov 05, 2014 at 04:49:45PM +0200, Kirill A. Shutemov wrote:
> Current split_huge_page() combines two operations: splitting PMDs into
> tables of PTEs and splitting underlying compound page. This patch
> changes split_huge_pmd() implementation to split the given PMD without
> splitting other PMDs this page mapped with or underlying compound page.
> 
> In order to do this we have to get rid of tail page refcounting, which
> uses _mapcount of tail pages. Tail page refcounting is needed to be able
> to split THP page at any point: we always know which of tail pages is
> pinned (i.e. by get_user_pages()) and can distribute page count
> correctly.
> 
> We can avoid this by allowing split_huge_page() to fail if the compound
> page is pinned. This patch removes all infrastructure for tail page
> refcounting and make split_huge_page() to always return -EBUSY. All
> split_huge_page() users already know how to handle its fail. Proper
> implementation will be added later.
> 
> Without tail page refcounting, implementation of split_huge_pmd() is
> pretty straight-forward.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
...

> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 7e70ae968e5f..e4ba17694b6b 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -1022,7 +1022,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>  {
>  	unsigned long mask;
>  	unsigned long pte_end;
> -	struct page *head, *page, *tail;
>  	pte_t pte;
>  	int refs;
>  

This breaks build of powerpc, so you need keep *head and *page as
you do for other architectures.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/19] thp: update documentation
  2014-11-05 14:49 ` [PATCH 16/19] thp: update documentation Kirill A. Shutemov
@ 2014-11-19  8:07   ` Naoya Horiguchi
  2014-11-19 13:11     ` Kirill A. Shutemov
  0 siblings, 1 reply; 41+ messages in thread
From: Naoya Horiguchi @ 2014-11-19  8:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Dave Hansen, Hugh Dickins,
	Mel Gorman, Rik van Riel, Vlastimil Babka, Christoph Lameter,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm

On Wed, Nov 05, 2014 at 04:49:51PM +0200, Kirill A. Shutemov wrote:
> The patch updates Documentation/vm/transhuge.txt to reflect changes in
> THP design.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  Documentation/vm/transhuge.txt | 84 +++++++++++++++++++-----------------------
>  1 file changed, 38 insertions(+), 46 deletions(-)
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> index df1794a9071f..33465e7b0d9b 100644
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -200,9 +200,18 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
>  	of pages that should be collapsed into one huge page but failed
>  	the allocation.
>  
> -thp_split is incremented every time a huge page is split into base
> +thp_split_page is incremented every time a huge page is split into base
>  	pages. This can happen for a variety of reasons but a common
>  	reason is that a huge page is old and is being reclaimed.
> +	This action implies splitting all PMD the page mapped with.
> +
> +thp_split_page_failed is is incremented if kernel fails to split huge

'is' appears twice.

> +	page. This can happen if the page was pinned by somebody.
> +
> +thp_split_pmd is incremented every time a PMD split into table of PTEs.
> +	This can happen, for instance, when application calls mprotect() or
> +	munmap() on part of huge page. It doesn't split huge page, only
> +	page table entry.
>  
>  thp_zero_page_alloc is incremented every time a huge zero page is
>  	successfully allocated. It includes allocations which where

There is a sentense related to the adjustment on futex code you just
removed in patch 15/19 in "get_user_pages and follow_page" section.

  ...
  split_huge_page() to avoid the head and tail pages to disappear from
  under it, see the futex code to see an example of that, hugetlbfs also
  needed special handling in futex code for similar reasons).

this seems obsolete, so we need some change on this?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 17/19] mlock, thp: HACK: split all pages in VM_LOCKED vma
  2014-11-05 14:49 ` [PATCH 17/19] mlock, thp: HACK: split all pages in VM_LOCKED vma Kirill A. Shutemov
@ 2014-11-19  9:02   ` Naoya Horiguchi
  2014-11-19 13:08     ` Kirill A. Shutemov
  0 siblings, 1 reply; 41+ messages in thread
From: Naoya Horiguchi @ 2014-11-19  9:02 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Dave Hansen, Hugh Dickins,
	Mel Gorman, Rik van Riel, Vlastimil Babka, Christoph Lameter,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm

On Wed, Nov 05, 2014 at 04:49:52PM +0200, Kirill A. Shutemov wrote:
> We don't yet handle mlocked pages properly with new THP refcounting.
> For now we split all pages in VMA on mlock and disallow khugepaged
> collapse pages in the VMA. If split failed on mlock() we fail the
> syscall with -EBUSY.
> ---
...

> @@ -542,6 +530,60 @@ next:
>  	}
>  }
>  
> +static int thp_split(pmd_t *pmd, unsigned long addr, unsigned long end,
> +		struct mm_walk *walk)
> +{
> +	spinlock_t *ptl;
> +	struct page *page = NULL;
> +	pte_t *pte;
> +	int err = 0;
> +
> +retry:
> +	if (pmd_none(*pmd))
> +		return 0;
> +	if (pmd_trans_huge(*pmd)) {
> +		if (is_huge_zero_pmd(*pmd)) {
> +			split_huge_pmd(walk->vma, pmd, addr);
> +			return 0;
> +		}
> +		ptl = pmd_lock(walk->mm, pmd);
> +		if (!pmd_trans_huge(*pmd)) {
> +			spin_unlock(ptl);
> +			goto retry;
> +		}
> +		page = pmd_page(*pmd);
> +		VM_BUG_ON_PAGE(!PageHead(page), page);
> +		get_page(page);
> +		spin_unlock(ptl);
> +		err = split_huge_page(page);
> +		put_page(page);
> +		return err;
> +	}
> +	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> +	do {
> +		if (!pte_present(*pte))
> +			continue;
> +		page = vm_normal_page(walk->vma, addr, *pte);
> +		if (!page)
> +			continue;
> +		if (PageTransCompound(page)) {
> +			page = compound_head(page);
> +			get_page(page);
> +			spin_unlock(ptl);
> +			err = split_huge_page(page);
> +			spin_lock(ptl);
> +			put_page(page);
> +			if (!err) {
> +				VM_BUG_ON_PAGE(compound_mapcount(page), page);
> +				VM_BUG_ON_PAGE(PageTransCompound(page), page);

If split_huge_page() succeeded, we don't have to continue the iteration,
so break this loop here?

Thanks,
Naoya Horiguchi

> +			} else
> +				break;
> +		}
> +	} while (pte++, addr += PAGE_SIZE, addr != end);
> +	pte_unmap_unlock(pte - 1, ptl);
> +	return err;
> +}
> +
>  /*
>   * mlock_fixup  - handle mlock[all]/munlock[all] requests.
>   *

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/19] mm: store mapcount for compound page separate
  2014-11-05 14:49 ` [PATCH 06/19] mm: store mapcount for compound page separate Kirill A. Shutemov
  2014-11-18  8:43   ` Naoya Horiguchi
@ 2014-11-19 10:51   ` Jerome Marchand
  2014-11-19 13:00     ` Kirill A. Shutemov
  2014-11-21  6:12   ` Aneesh Kumar K.V
  2 siblings, 1 reply; 41+ messages in thread
From: Jerome Marchand @ 2014-11-19 10:51 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 13741 bytes --]

On 11/05/2014 03:49 PM, Kirill A. Shutemov wrote:
> We're going to allow mapping of individual 4k pages of THP compound and
> we need a cheap way to find out how many time the compound page is
> mapped with PMD -- compound_mapcount() does this.
> 
> page_mapcount() counts both: PTE and PMD mappings of the page.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/mm.h   | 17 +++++++++++++++--
>  include/linux/rmap.h |  4 ++--
>  mm/huge_memory.c     | 23 ++++++++++++++---------
>  mm/hugetlb.c         |  4 ++--
>  mm/memory.c          |  2 +-
>  mm/migrate.c         |  2 +-
>  mm/page_alloc.c      | 13 ++++++++++---
>  mm/rmap.c            | 50 +++++++++++++++++++++++++++++++++++++++++++-------
>  8 files changed, 88 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1825c468f158..aef03acff228 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -435,6 +435,19 @@ static inline struct page *compound_head(struct page *page)
>  	return page;
>  }
>  
> +static inline atomic_t *compound_mapcount_ptr(struct page *page)
> +{
> +	return (atomic_t *)&page[1].mapping;
> +}

IIUC your patch overloads the unused mapping field of the first tail
page to store the PMD mapcount. That's a non obvious trick. Why not make
it more explicit by adding a new field (say compound_mapcount - and the
appropriate comment of course) to the union to which mapping already belong?
The patch description would benefit from more explanation too.

Jerome

> +
> +static inline int compound_mapcount(struct page *page)
> +{
> +	if (!PageCompound(page))
> +		return 0;
> +	page = compound_head(page);
> +	return atomic_read(compound_mapcount_ptr(page)) + 1;
> +}
> +
>  /*
>   * The atomic page->_mapcount, starts from -1: so that transitions
>   * both from it and to it can be tracked, using atomic_inc_and_test
> @@ -447,7 +460,7 @@ static inline void page_mapcount_reset(struct page *page)
>  
>  static inline int page_mapcount(struct page *page)
>  {
> -	return atomic_read(&(page)->_mapcount) + 1;
> +	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) + 1;
>  }
>  
>  static inline int page_count(struct page *page)
> @@ -1017,7 +1030,7 @@ static inline pgoff_t page_file_index(struct page *page)
>   */
>  static inline int page_mapped(struct page *page)
>  {
> -	return atomic_read(&(page)->_mapcount) >= 0;
> +	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
>  }
>  
>  /*
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index ef09ca48c789..a9499ad8c037 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -180,9 +180,9 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
>  void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
>  				unsigned long);
>  
> -static inline void page_dup_rmap(struct page *page)
> +static inline void page_dup_rmap(struct page *page, bool compound)
>  {
> -	atomic_inc(&page->_mapcount);
> +	atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount);
>  }
>  
>  /*
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9c53800c4eea..869f9bcf481e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -904,7 +904,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	src_page = pmd_page(pmd);
>  	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
>  	get_page(src_page);
> -	page_dup_rmap(src_page);
> +	page_dup_rmap(src_page, true);
>  	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>  
>  	pmdp_set_wrprotect(src_mm, addr, src_pmd);
> @@ -1763,8 +1763,8 @@ static void __split_huge_page_refcount(struct page *page,
>  		struct page *page_tail = page + i;
>  
>  		/* tail_page->_mapcount cannot change */
> -		BUG_ON(page_mapcount(page_tail) < 0);
> -		tail_count += page_mapcount(page_tail);
> +		BUG_ON(atomic_read(&page_tail->_mapcount) + 1 < 0);
> +		tail_count += atomic_read(&page_tail->_mapcount) + 1;
>  		/* check for overflow */
>  		BUG_ON(tail_count < 0);
>  		BUG_ON(atomic_read(&page_tail->_count) != 0);
> @@ -1781,8 +1781,7 @@ static void __split_huge_page_refcount(struct page *page,
>  		 * atomic_set() here would be safe on all archs (and
>  		 * not only on x86), it's safer to use atomic_add().
>  		 */
> -		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
> -			   &page_tail->_count);
> +		atomic_add(page_mapcount(page_tail) + 1, &page_tail->_count);
>  
>  		/* after clearing PageTail the gup refcount can be released */
>  		smp_mb__after_atomic();
> @@ -1819,15 +1818,18 @@ static void __split_huge_page_refcount(struct page *page,
>  		 * status is achieved setting a reserved bit in the
>  		 * pmd, not by clearing the present bit.
>  		*/
> -		page_tail->_mapcount = page->_mapcount;
> +		atomic_set(&page_tail->_mapcount, compound_mapcount(page) - 1);
>  
> -		BUG_ON(page_tail->mapping);
> -		page_tail->mapping = page->mapping;
> +		/* ->mapping in first tail page is compound_mapcount */
> +		if (i != 1) {
> +			BUG_ON(page_tail->mapping);
> +			page_tail->mapping = page->mapping;
> +			BUG_ON(!PageAnon(page_tail));
> +		}
>  
>  		page_tail->index = page->index + i;
>  		page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
>  
> -		BUG_ON(!PageAnon(page_tail));
>  		BUG_ON(!PageUptodate(page_tail));
>  		BUG_ON(!PageDirty(page_tail));
>  		BUG_ON(!PageSwapBacked(page_tail));
> @@ -1837,6 +1839,9 @@ static void __split_huge_page_refcount(struct page *page,
>  	atomic_sub(tail_count, &page->_count);
>  	BUG_ON(atomic_read(&page->_count) <= 0);
>  
> +	page->_mapcount = *compound_mapcount_ptr(page);
> +	page[1].mapping = page->mapping;
> +
>  	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
>  
>  	ClearPageCompound(page);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dad8e0732922..445db64a8b08 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2603,7 +2603,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>  			entry = huge_ptep_get(src_pte);
>  			ptepage = pte_page(entry);
>  			get_page(ptepage);
> -			page_dup_rmap(ptepage);
> +			page_dup_rmap(ptepage, true);
>  			set_huge_pte_at(dst, addr, dst_pte, entry);
>  		}
>  		spin_unlock(src_ptl);
> @@ -3058,7 +3058,7 @@ retry:
>  		ClearPagePrivate(page);
>  		hugepage_add_new_anon_rmap(page, vma, address);
>  	} else
> -		page_dup_rmap(page);
> +		page_dup_rmap(page, true);
>  	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
>  				&& (vma->vm_flags & VM_SHARED)));
>  	set_huge_pte_at(mm, address, ptep, new_pte);
> diff --git a/mm/memory.c b/mm/memory.c
> index 6f84c8a51cc0..1b17a72dc93f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -872,7 +872,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	page = vm_normal_page(vma, addr, pte);
>  	if (page) {
>  		get_page(page);
> -		page_dup_rmap(page);
> +		page_dup_rmap(page, false);
>  		if (PageAnon(page))
>  			rss[MM_ANONPAGES]++;
>  		else
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6b9413df1661..f1a12ced2531 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -161,7 +161,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
>  		if (PageAnon(new))
>  			hugepage_add_anon_rmap(new, vma, addr);
>  		else
> -			page_dup_rmap(new);
> +			page_dup_rmap(new, false);
>  	} else if (PageAnon(new))
>  		page_add_anon_rmap(new, vma, addr, false);
>  	else
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d0e3d2fee585..b19d1e69ca12 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -369,6 +369,7 @@ void prep_compound_page(struct page *page, unsigned long order)
>  
>  	set_compound_page_dtor(page, free_compound_page);
>  	set_compound_order(page, order);
> +	atomic_set(compound_mapcount_ptr(page), -1);
>  	__SetPageHead(page);
>  	for (i = 1; i < nr_pages; i++) {
>  		struct page *p = page + i;
> @@ -643,7 +644,9 @@ static inline int free_pages_check(struct page *page)
>  
>  	if (unlikely(page_mapcount(page)))
>  		bad_reason = "nonzero mapcount";
> -	if (unlikely(page->mapping != NULL))
> +	if (unlikely(compound_mapcount(page)))
> +		bad_reason = "nonzero compound_mapcount";
> +	if (unlikely(page->mapping != NULL) && !PageTail(page))
>  		bad_reason = "non-NULL mapping";
>  	if (unlikely(atomic_read(&page->_count) != 0))
>  		bad_reason = "nonzero _count";
> @@ -760,6 +763,8 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
>  		bad += free_pages_check(page + i);
>  	if (bad)
>  		return false;
> +	if (order)
> +		page[1].mapping = NULL;
>  
>  	if (!PageHighMem(page)) {
>  		debug_check_no_locks_freed(page_address(page),
> @@ -6632,10 +6637,12 @@ static void dump_page_flags(unsigned long flags)
>  void dump_page_badflags(struct page *page, const char *reason,
>  		unsigned long badflags)
>  {
> -	printk(KERN_ALERT
> -	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
> +	pr_alert("page:%p count:%d mapcount:%d mapping:%p index:%#lx",
>  		page, atomic_read(&page->_count), page_mapcount(page),
>  		page->mapping, page->index);
> +	if (PageCompound(page))
> +		printk(" compound_mapcount: %d", compound_mapcount(page));
> +	printk("\n");
>  	dump_page_flags(page->flags);
>  	if (reason)
>  		pr_alert("page dumped because: %s\n", reason);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f706a6af1801..eecc9301847d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -986,9 +986,30 @@ void page_add_anon_rmap(struct page *page,
>  void do_page_add_anon_rmap(struct page *page,
>  	struct vm_area_struct *vma, unsigned long address, int flags)
>  {
> -	int first = atomic_inc_and_test(&page->_mapcount);
> +	bool compound = flags & RMAP_COMPOUND;
> +	bool first;
> +
> +	VM_BUG_ON_PAGE(!PageLocked(compound_head(page)), page);
> +
> +	if (PageTransCompound(page)) {
> +		struct page *head_page = compound_head(page);
> +
> +		if (compound) {
> +			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> +			first = atomic_inc_and_test(compound_mapcount_ptr(page));
> +		} else {
> +			/* Anon THP always mapped first with PMD */
> +			first = 0;
> +			VM_BUG_ON_PAGE(!compound_mapcount(head_page),
> +					head_page);
> +			atomic_inc(&page->_mapcount);
> +		}
> +	} else {
> +		VM_BUG_ON_PAGE(compound, page);
> +		first = atomic_inc_and_test(&page->_mapcount);
> +	}
> +
>  	if (first) {
> -		bool compound = flags & RMAP_COMPOUND;
>  		int nr = compound ? hpage_nr_pages(page) : 1;
>  		/*
>  		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
> @@ -1006,7 +1027,6 @@ void do_page_add_anon_rmap(struct page *page,
>  	if (unlikely(PageKsm(page)))
>  		return;
>  
> -	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	/* address might be in next vma when migration races vma_adjust */
>  	if (first)
>  		__page_set_anon_rmap(page, vma, address,
> @@ -1032,10 +1052,19 @@ void page_add_new_anon_rmap(struct page *page,
>  
>  	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
>  	SetPageSwapBacked(page);
> -	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
>  	if (compound) {
> +		atomic_t *compound_mapcount;
> +
>  		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> +		compound_mapcount = (atomic_t *)&page[1].mapping;
> +		/* increment count (starts at -1) */
> +		atomic_set(compound_mapcount, 0);
>  		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> +	} else {
> +		/* Anon THP always mapped first with PMD */
> +		VM_BUG_ON_PAGE(PageTransCompound(page), page);
> +		/* increment count (starts at -1) */
> +		atomic_set(&page->_mapcount, 0);
>  	}
>  	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
>  	__page_set_anon_rmap(page, vma, address, 1);
> @@ -1081,7 +1110,9 @@ void page_remove_rmap(struct page *page, bool compound)
>  		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  
>  	/* page still mapped by someone else? */
> -	if (!atomic_add_negative(-1, &page->_mapcount))
> +	if (!atomic_add_negative(-1, compound ?
> +				compound_mapcount_ptr(page) :
> +				&page->_mapcount))
>  		goto out;
>  
>  	/*
> @@ -1098,9 +1129,14 @@ void page_remove_rmap(struct page *page, bool compound)
>  	if (anon) {
>  		int nr = compound ? hpage_nr_pages(page) : 1;
>  		if (compound) {
> +			int i;
>  			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>  			__dec_zone_page_state(page,
>  					NR_ANON_TRANSPARENT_HUGEPAGES);
> +			/* The page can be mapped with ptes */
> +			for (i = 0; i < HPAGE_PMD_NR; i++)
> +				if (page_mapcount(page + i))
> +					nr--;
>  		}
>  		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
>  	} else {
> @@ -1749,7 +1785,7 @@ void hugepage_add_anon_rmap(struct page *page,
>  	BUG_ON(!PageLocked(page));
>  	BUG_ON(!anon_vma);
>  	/* address might be in next vma when migration races vma_adjust */
> -	first = atomic_inc_and_test(&page->_mapcount);
> +	first = atomic_inc_and_test(compound_mapcount_ptr(page));
>  	if (first)
>  		__hugepage_set_anon_rmap(page, vma, address, 0);
>  }
> @@ -1758,7 +1794,7 @@ void hugepage_add_new_anon_rmap(struct page *page,
>  			struct vm_area_struct *vma, unsigned long address)
>  {
>  	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> -	atomic_set(&page->_mapcount, 0);
> +	atomic_set(compound_mapcount_ptr(page), 0);
>  	__hugepage_set_anon_rmap(page, vma, address, 1);
>  }
>  #endif /* CONFIG_HUGETLB_PAGE */
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/19] mm: store mapcount for compound page separate
  2014-11-19 10:51   ` Jerome Marchand
@ 2014-11-19 13:00     ` Kirill A. Shutemov
  2014-11-19 13:15       ` Jerome Marchand
  2014-11-20 20:06       ` Christoph Lameter
  0 siblings, 2 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-19 13:00 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, linux-kernel,
	linux-mm

On Wed, Nov 19, 2014 at 11:51:09AM +0100, Jerome Marchand wrote:
> On 11/05/2014 03:49 PM, Kirill A. Shutemov wrote:
> > We're going to allow mapping of individual 4k pages of THP compound and
> > we need a cheap way to find out how many time the compound page is
> > mapped with PMD -- compound_mapcount() does this.
> > 
> > page_mapcount() counts both: PTE and PMD mappings of the page.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/mm.h   | 17 +++++++++++++++--
> >  include/linux/rmap.h |  4 ++--
> >  mm/huge_memory.c     | 23 ++++++++++++++---------
> >  mm/hugetlb.c         |  4 ++--
> >  mm/memory.c          |  2 +-
> >  mm/migrate.c         |  2 +-
> >  mm/page_alloc.c      | 13 ++++++++++---
> >  mm/rmap.c            | 50 +++++++++++++++++++++++++++++++++++++++++++-------
> >  8 files changed, 88 insertions(+), 27 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 1825c468f158..aef03acff228 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -435,6 +435,19 @@ static inline struct page *compound_head(struct page *page)
> >  	return page;
> >  }
> >  
> > +static inline atomic_t *compound_mapcount_ptr(struct page *page)
> > +{
> > +	return (atomic_t *)&page[1].mapping;
> > +}
> 
> IIUC your patch overloads the unused mapping field of the first tail
> page to store the PMD mapcount. That's a non obvious trick. Why not make
> it more explicit by adding a new field (say compound_mapcount - and the
> appropriate comment of course) to the union to which mapping already belong?

I don't think we want to bloat struct page description: nobody outside of
helpers should use it direcly. And it's exactly what we did to store
compound page destructor and compound page order.

> The patch description would benefit from more explanation too.

Agreed.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 10/19] thp: PMD splitting without splitting compound page
  2014-11-19  6:57   ` Naoya Horiguchi
@ 2014-11-19 13:02     ` Kirill A. Shutemov
  0 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-19 13:02 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Steve Capper, Aneesh Kumar K.V,
	Johannes Weiner, Michal Hocko, linux-kernel, linux-mm

On Wed, Nov 19, 2014 at 06:57:47AM +0000, Naoya Horiguchi wrote:
> On Wed, Nov 05, 2014 at 04:49:45PM +0200, Kirill A. Shutemov wrote:
> > Current split_huge_page() combines two operations: splitting PMDs into
> > tables of PTEs and splitting underlying compound page. This patch
> > changes split_huge_pmd() implementation to split the given PMD without
> > splitting other PMDs this page mapped with or underlying compound page.
> > 
> > In order to do this we have to get rid of tail page refcounting, which
> > uses _mapcount of tail pages. Tail page refcounting is needed to be able
> > to split THP page at any point: we always know which of tail pages is
> > pinned (i.e. by get_user_pages()) and can distribute page count
> > correctly.
> > 
> > We can avoid this by allowing split_huge_page() to fail if the compound
> > page is pinned. This patch removes all infrastructure for tail page
> > refcounting and make split_huge_page() to always return -EBUSY. All
> > split_huge_page() users already know how to handle its fail. Proper
> > implementation will be added later.
> > 
> > Without tail page refcounting, implementation of split_huge_pmd() is
> > pretty straight-forward.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> ...
> 
> > diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> > index 7e70ae968e5f..e4ba17694b6b 100644
> > --- a/arch/powerpc/mm/hugetlbpage.c
> > +++ b/arch/powerpc/mm/hugetlbpage.c
> > @@ -1022,7 +1022,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
> >  {
> >  	unsigned long mask;
> >  	unsigned long pte_end;
> > -	struct page *head, *page, *tail;
> >  	pte_t pte;
> >  	int refs;
> >  
> 
> This breaks build of powerpc, so you need keep *head and *page as
> you do for other architectures.

Yeah. I've already fixed this localy after 0-DAY kernel testing
complained.

Thanks.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 17/19] mlock, thp: HACK: split all pages in VM_LOCKED vma
  2014-11-19  9:02   ` Naoya Horiguchi
@ 2014-11-19 13:08     ` Kirill A. Shutemov
  0 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-19 13:08 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Steve Capper, Aneesh Kumar K.V,
	Johannes Weiner, Michal Hocko, linux-kernel, linux-mm

On Wed, Nov 19, 2014 at 09:02:42AM +0000, Naoya Horiguchi wrote:
> On Wed, Nov 05, 2014 at 04:49:52PM +0200, Kirill A. Shutemov wrote:
> > We don't yet handle mlocked pages properly with new THP refcounting.
> > For now we split all pages in VMA on mlock and disallow khugepaged
> > collapse pages in the VMA. If split failed on mlock() we fail the
> > syscall with -EBUSY.
> > ---
> ...
> 
> > @@ -542,6 +530,60 @@ next:
> >  	}
> >  }
> >  
> > +static int thp_split(pmd_t *pmd, unsigned long addr, unsigned long end,
> > +		struct mm_walk *walk)
> > +{
> > +	spinlock_t *ptl;
> > +	struct page *page = NULL;
> > +	pte_t *pte;
> > +	int err = 0;
> > +
> > +retry:
> > +	if (pmd_none(*pmd))
> > +		return 0;
> > +	if (pmd_trans_huge(*pmd)) {
> > +		if (is_huge_zero_pmd(*pmd)) {
> > +			split_huge_pmd(walk->vma, pmd, addr);
> > +			return 0;
> > +		}
> > +		ptl = pmd_lock(walk->mm, pmd);
> > +		if (!pmd_trans_huge(*pmd)) {
> > +			spin_unlock(ptl);
> > +			goto retry;
> > +		}
> > +		page = pmd_page(*pmd);
> > +		VM_BUG_ON_PAGE(!PageHead(page), page);
> > +		get_page(page);
> > +		spin_unlock(ptl);
> > +		err = split_huge_page(page);
> > +		put_page(page);
> > +		return err;
> > +	}
> > +	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> > +	do {
> > +		if (!pte_present(*pte))
> > +			continue;
> > +		page = vm_normal_page(walk->vma, addr, *pte);
> > +		if (!page)
> > +			continue;
> > +		if (PageTransCompound(page)) {
> > +			page = compound_head(page);
> > +			get_page(page);
> > +			spin_unlock(ptl);
> > +			err = split_huge_page(page);
> > +			spin_lock(ptl);
> > +			put_page(page);
> > +			if (!err) {
> > +				VM_BUG_ON_PAGE(compound_mapcount(page), page);
> > +				VM_BUG_ON_PAGE(PageTransCompound(page), page);
> 
> If split_huge_page() succeeded, we don't have to continue the iteration,
> so break this loop here?

We may want to skip to the next huge page region, but the patch is crap
anyway -- don't bother.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/19] thp: update documentation
  2014-11-19  8:07   ` Naoya Horiguchi
@ 2014-11-19 13:11     ` Kirill A. Shutemov
  0 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-19 13:11 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Steve Capper, Aneesh Kumar K.V,
	Johannes Weiner, Michal Hocko, linux-kernel, linux-mm

On Wed, Nov 19, 2014 at 08:07:59AM +0000, Naoya Horiguchi wrote:
> On Wed, Nov 05, 2014 at 04:49:51PM +0200, Kirill A. Shutemov wrote:
> > The patch updates Documentation/vm/transhuge.txt to reflect changes in
> > THP design.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  Documentation/vm/transhuge.txt | 84 +++++++++++++++++++-----------------------
> >  1 file changed, 38 insertions(+), 46 deletions(-)
> > 
> > diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> > index df1794a9071f..33465e7b0d9b 100644
> > --- a/Documentation/vm/transhuge.txt
> > +++ b/Documentation/vm/transhuge.txt
> > @@ -200,9 +200,18 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
> >  	of pages that should be collapsed into one huge page but failed
> >  	the allocation.
> >  
> > -thp_split is incremented every time a huge page is split into base
> > +thp_split_page is incremented every time a huge page is split into base
> >  	pages. This can happen for a variety of reasons but a common
> >  	reason is that a huge page is old and is being reclaimed.
> > +	This action implies splitting all PMD the page mapped with.
> > +
> > +thp_split_page_failed is is incremented if kernel fails to split huge
> 
> 'is' appears twice.
> 
> > +	page. This can happen if the page was pinned by somebody.
> > +
> > +thp_split_pmd is incremented every time a PMD split into table of PTEs.
> > +	This can happen, for instance, when application calls mprotect() or
> > +	munmap() on part of huge page. It doesn't split huge page, only
> > +	page table entry.
> >  
> >  thp_zero_page_alloc is incremented every time a huge zero page is
> >  	successfully allocated. It includes allocations which where
> 
> There is a sentense related to the adjustment on futex code you just
> removed in patch 15/19 in "get_user_pages and follow_page" section.
> 
>   ...
>   split_huge_page() to avoid the head and tail pages to disappear from
>   under it, see the futex code to see an example of that, hugetlbfs also
>   needed special handling in futex code for similar reasons).
> 
> this seems obsolete, so we need some change on this?

I'll update documentation futher once patchset will be closer to ready
state.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/19] mm: store mapcount for compound page separate
  2014-11-19 13:00     ` Kirill A. Shutemov
@ 2014-11-19 13:15       ` Jerome Marchand
  2014-11-20 20:06       ` Christoph Lameter
  1 sibling, 0 replies; 41+ messages in thread
From: Jerome Marchand @ 2014-11-19 13:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, linux-kernel,
	linux-mm

[-- Attachment #1: Type: text/plain, Size: 2234 bytes --]

On 11/19/2014 02:00 PM, Kirill A. Shutemov wrote:
> On Wed, Nov 19, 2014 at 11:51:09AM +0100, Jerome Marchand wrote:
>> On 11/05/2014 03:49 PM, Kirill A. Shutemov wrote:
>>> We're going to allow mapping of individual 4k pages of THP compound and
>>> we need a cheap way to find out how many time the compound page is
>>> mapped with PMD -- compound_mapcount() does this.
>>>
>>> page_mapcount() counts both: PTE and PMD mappings of the page.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> ---
>>>  include/linux/mm.h   | 17 +++++++++++++++--
>>>  include/linux/rmap.h |  4 ++--
>>>  mm/huge_memory.c     | 23 ++++++++++++++---------
>>>  mm/hugetlb.c         |  4 ++--
>>>  mm/memory.c          |  2 +-
>>>  mm/migrate.c         |  2 +-
>>>  mm/page_alloc.c      | 13 ++++++++++---
>>>  mm/rmap.c            | 50 +++++++++++++++++++++++++++++++++++++++++++-------
>>>  8 files changed, 88 insertions(+), 27 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 1825c468f158..aef03acff228 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -435,6 +435,19 @@ static inline struct page *compound_head(struct page *page)
>>>  	return page;
>>>  }
>>>  
>>> +static inline atomic_t *compound_mapcount_ptr(struct page *page)
>>> +{
>>> +	return (atomic_t *)&page[1].mapping;
>>> +}
>>
>> IIUC your patch overloads the unused mapping field of the first tail
>> page to store the PMD mapcount. That's a non obvious trick. Why not make
>> it more explicit by adding a new field (say compound_mapcount - and the
>> appropriate comment of course) to the union to which mapping already belong?
> 
> I don't think we want to bloat struct page description: nobody outside of
> helpers should use it direcly. And it's exactly what we did to store
> compound page destructor and compound page order.

Yes, but hiding it might make people think this field is unused when
it's not. If it has been done that way for a while, maybe it's not as
much trouble as I think it is, but could you at least add a comment in
the helper.

> 
>> The patch description would benefit from more explanation too.
> 
> Agreed.
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/19] mm: store mapcount for compound page separate
  2014-11-19 13:00     ` Kirill A. Shutemov
  2014-11-19 13:15       ` Jerome Marchand
@ 2014-11-20 20:06       ` Christoph Lameter
  2014-11-21 12:01         ` Kirill A. Shutemov
  1 sibling, 1 reply; 41+ messages in thread
From: Christoph Lameter @ 2014-11-20 20:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Jerome Marchand, Kirill A. Shutemov, Andrew Morton,
	Andrea Arcangeli, Dave Hansen, Hugh Dickins, Mel Gorman,
	Rik van Riel, Vlastimil Babka, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, linux-kernel,
	linux-mm

On Wed, 19 Nov 2014, Kirill A. Shutemov wrote:

> I don't think we want to bloat struct page description: nobody outside of
> helpers should use it direcly. And it's exactly what we did to store
> compound page destructor and compound page order.

This is more like a description what overloading is occurring. Either
add the new way of using it there including a comment explainng things or
please do not overload the field.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/19] mm: store mapcount for compound page separate
  2014-11-05 14:49 ` [PATCH 06/19] mm: store mapcount for compound page separate Kirill A. Shutemov
  2014-11-18  8:43   ` Naoya Horiguchi
  2014-11-19 10:51   ` Jerome Marchand
@ 2014-11-21  6:12   ` Aneesh Kumar K.V
  2014-11-21 12:02     ` Kirill A. Shutemov
  2 siblings, 1 reply; 41+ messages in thread
From: Aneesh Kumar K.V @ 2014-11-21  6:12 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli
  Cc: Dave Hansen, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Johannes Weiner, Michal Hocko, linux-kernel,
	linux-mm, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> We're going to allow mapping of individual 4k pages of THP compound and
> we need a cheap way to find out how many time the compound page is
> mapped with PMD -- compound_mapcount() does this.
>
> page_mapcount() counts both: PTE and PMD mappings of the page.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  include/linux/mm.h   | 17 +++++++++++++++--
>  include/linux/rmap.h |  4 ++--
>  mm/huge_memory.c     | 23 ++++++++++++++---------
>  mm/hugetlb.c         |  4 ++--
>  mm/memory.c          |  2 +-
>  mm/migrate.c         |  2 +-
>  mm/page_alloc.c      | 13 ++++++++++---
>  mm/rmap.c            | 50 +++++++++++++++++++++++++++++++++++++++++++-------
>  8 files changed, 88 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1825c468f158..aef03acff228 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -435,6 +435,19 @@ static inline struct page *compound_head(struct page *page)
>  	return page;
>  }
>  
> +static inline atomic_t *compound_mapcount_ptr(struct page *page)
> +{
> +	return (atomic_t *)&page[1].mapping;
> +}
> +
> +static inline int compound_mapcount(struct page *page)
> +{
> +	if (!PageCompound(page))
> +		return 0;
> +	page = compound_head(page);
> +	return atomic_read(compound_mapcount_ptr(page)) + 1;
> +}


How about 

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e0b286649f1..59c9cf3d8510 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -46,6 +46,11 @@ struct page {
 	unsigned long flags;		/* Atomic flags, some possibly
 					 * updated asynchronously */
 	union {
+		/*
+		  * For THP we use this to track the compound
+		  * page mapcount.
+		  */
+		atomic_t _compound_mapcount;
 		struct address_space *mapping;	/* If low bit clear, points to
 						 * inode address_space, or NULL.
 						 * If page mapped as anonymous

and 

static inline atomic_t *compound_mapcount_ptr(struct page *page)
{
        return (atomic_t *)&page[1]._compound_mapcount;
}



-aneesh


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/19] mm: store mapcount for compound page separate
  2014-11-18  9:58     ` Kirill A. Shutemov
  2014-11-18 23:41       ` Naoya Horiguchi
@ 2014-11-21  6:41       ` Aneesh Kumar K.V
  2014-11-21 11:47         ` Kirill A. Shutemov
  1 sibling, 1 reply; 41+ messages in thread
From: Aneesh Kumar K.V @ 2014-11-21  6:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Naoya Horiguchi
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Steve Capper, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Tue, Nov 18, 2014 at 08:43:00AM +0000, Naoya Horiguchi wrote:
>> > @@ -1837,6 +1839,9 @@ static void __split_huge_page_refcount(struct page *page,
>> >  	atomic_sub(tail_count, &page->_count);
>> >  	BUG_ON(atomic_read(&page->_count) <= 0);
>> >  
>> > +	page->_mapcount = *compound_mapcount_ptr(page);
>> 
>> Is atomic_set() necessary?
>
> Do you mean
> 	atomic_set(&page->_mapcount, atomic_read(compound_mapcount_ptr(page)));
> ?
>
> I don't see why we would need this. Simple assignment should work just
> fine. Or we have archs which will break?

Are you looking at architecture related atomic_set issues, or the fact
that we cannot have parallel _mapcount update and hence the above
assignment should be ok ? If the former, current thp code
use atomic_add instead of even using atomic_set when
updatinge page_tail->_count.  

		 * from under us on the tail_page. If we used
		 * atomic_set() below instead of atomic_add(), we
		 * would then run atomic_set() concurrently with
		 * get_page_unless_zero(), and atomic_set() is
		 * implemented in C not using locked ops. spin_unlock
		 * on x86 sometime uses locked ops because of PPro
		 * errata 66, 92, so unless somebody can guarantee
		 * atomic_set() here would be safe on all archs (and
		 * not only on x86), it's safer to use atomic_add().
		 */
		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
			   &page_tail->_count);



-aneesh


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/19] mm: store mapcount for compound page separate
  2014-11-21  6:41       ` Aneesh Kumar K.V
@ 2014-11-21 11:47         ` Kirill A. Shutemov
  0 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-21 11:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Naoya Horiguchi, Kirill A. Shutemov, Andrew Morton,
	Andrea Arcangeli, Dave Hansen, Hugh Dickins, Mel Gorman,
	Rik van Riel, Vlastimil Babka, Christoph Lameter, Steve Capper,
	Johannes Weiner, Michal Hocko, linux-kernel, linux-mm

On Fri, Nov 21, 2014 at 12:11:34PM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
> > On Tue, Nov 18, 2014 at 08:43:00AM +0000, Naoya Horiguchi wrote:
> >> > @@ -1837,6 +1839,9 @@ static void __split_huge_page_refcount(struct page *page,
> >> >  	atomic_sub(tail_count, &page->_count);
> >> >  	BUG_ON(atomic_read(&page->_count) <= 0);
> >> >  
> >> > +	page->_mapcount = *compound_mapcount_ptr(page);
> >> 
> >> Is atomic_set() necessary?
> >
> > Do you mean
> > 	atomic_set(&page->_mapcount, atomic_read(compound_mapcount_ptr(page)));
> > ?
> >
> > I don't see why we would need this. Simple assignment should work just
> > fine. Or we have archs which will break?
> 
> Are you looking at architecture related atomic_set issues, or the fact
> that we cannot have parallel _mapcount update and hence the above
> assignment should be ok ? If the former, current thp code
> use atomic_add instead of even using atomic_set when
> updatinge page_tail->_count.  
> 
> 		 * from under us on the tail_page. If we used
> 		 * atomic_set() below instead of atomic_add(), we
> 		 * would then run atomic_set() concurrently with
> 		 * get_page_unless_zero(), and atomic_set() is
> 		 * implemented in C not using locked ops. spin_unlock
> 		 * on x86 sometime uses locked ops because of PPro
> 		 * errata 66, 92, so unless somebody can guarantee
> 		 * atomic_set() here would be safe on all archs (and
> 		 * not only on x86), it's safer to use atomic_add().
> 		 */
> 		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
> 			   &page_tail->_count);

We don't have anything like get_page_unless_zero() for _mapcount as far as
I can see. And we have similar assignment there now:

	page_tail->_mapcount = page->_mapcount;

Anyway the assignment goes away by the end of patchset.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/19] mm: store mapcount for compound page separate
  2014-11-20 20:06       ` Christoph Lameter
@ 2014-11-21 12:01         ` Kirill A. Shutemov
  0 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-21 12:01 UTC (permalink / raw)
  To: Christoph Lameter, Dave Hansen
  Cc: Jerome Marchand, Kirill A. Shutemov, Andrew Morton,
	Andrea Arcangeli, Hugh Dickins, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V,
	Johannes Weiner, Michal Hocko, linux-kernel, linux-mm

On Thu, Nov 20, 2014 at 02:06:53PM -0600, Christoph Lameter wrote:
> On Wed, 19 Nov 2014, Kirill A. Shutemov wrote:
> 
> > I don't think we want to bloat struct page description: nobody outside of
> > helpers should use it direcly. And it's exactly what we did to store
> > compound page destructor and compound page order.
> 
> This is more like a description what overloading is occurring. Either
> add the new way of using it there including a comment explainng things or
> please do not overload the field.

I can do this although I don't see much value. At least we need to be
consistent and do the same for compound destructor and compound order.

Dave, you tried to sort mess around struct page recently. Any opinion?

BTW, how far we should go there? Should things like
PAGE_BUDDY_MAPCOUNT_VALUE and PAGE_BALLOON_MAPCOUNT_VALUE be described in
struct page definition too?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/19] mm: store mapcount for compound page separate
  2014-11-21  6:12   ` Aneesh Kumar K.V
@ 2014-11-21 12:02     ` Kirill A. Shutemov
  0 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-21 12:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, linux-kernel, linux-mm

On Fri, Nov 21, 2014 at 11:42:51AM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:
> 
> > We're going to allow mapping of individual 4k pages of THP compound and
> > we need a cheap way to find out how many time the compound page is
> > mapped with PMD -- compound_mapcount() does this.
> >
> > page_mapcount() counts both: PTE and PMD mappings of the page.
> >
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  include/linux/mm.h   | 17 +++++++++++++++--
> >  include/linux/rmap.h |  4 ++--
> >  mm/huge_memory.c     | 23 ++++++++++++++---------
> >  mm/hugetlb.c         |  4 ++--
> >  mm/memory.c          |  2 +-
> >  mm/migrate.c         |  2 +-
> >  mm/page_alloc.c      | 13 ++++++++++---
> >  mm/rmap.c            | 50 +++++++++++++++++++++++++++++++++++++++++++-------
> >  8 files changed, 88 insertions(+), 27 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 1825c468f158..aef03acff228 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -435,6 +435,19 @@ static inline struct page *compound_head(struct page *page)
> >  	return page;
> >  }
> >  
> > +static inline atomic_t *compound_mapcount_ptr(struct page *page)
> > +{
> > +	return (atomic_t *)&page[1].mapping;
> > +}
> > +
> > +static inline int compound_mapcount(struct page *page)
> > +{
> > +	if (!PageCompound(page))
> > +		return 0;
> > +	page = compound_head(page);
> > +	return atomic_read(compound_mapcount_ptr(page)) + 1;
> > +}
> 
> 
> How about 
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6e0b286649f1..59c9cf3d8510 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -46,6 +46,11 @@ struct page {
>  	unsigned long flags;		/* Atomic flags, some possibly
>  					 * updated asynchronously */
>  	union {
> +		/*
> +		  * For THP we use this to track the compound
> +		  * page mapcount.
> +		  */
> +		atomic_t _compound_mapcount;
>  		struct address_space *mapping;	/* If low bit clear, points to
>  						 * inode address_space, or NULL.
>  						 * If page mapped as anonymous
> 
> and 
> 
> static inline atomic_t *compound_mapcount_ptr(struct page *page)
> {
>         return (atomic_t *)&page[1]._compound_mapcount;
> }

Cast is redundant ;)

See answer to Christoph.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/19] mm, thp: drop FOLL_SPLIT
  2014-11-05 14:49 ` [PATCH 01/19] mm, thp: drop FOLL_SPLIT Kirill A. Shutemov
@ 2014-11-25  3:01   ` Naoya Horiguchi
  2014-11-25 14:04     ` Kirill A. Shutemov
  0 siblings, 1 reply; 41+ messages in thread
From: Naoya Horiguchi @ 2014-11-25  3:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Dave Hansen, Hugh Dickins,
	Mel Gorman, Rik van Riel, Vlastimil Babka, Christoph Lameter,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	linux-kernel, linux-mm

On Wed, Nov 05, 2014 at 04:49:36PM +0200, Kirill A. Shutemov wrote:
> FOLL_SPLIT is used only in two places: migration and s390.
> 
> Let's replace it with explicit split and remove FOLL_SPLIT.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
...
> @@ -1246,6 +1246,11 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
>  		if (!page)
>  			goto set_status;
>  
> +		if (PageTransHuge(page) && split_huge_page(page)) {
> +			err = -EBUSY;
> +			goto set_status;
> +		}
> +

This check makes split_huge_page() be called for hugetlb pages, which
triggers BUG_ON. So could you do this after if (PageHuge) block below?
And I think that we have "Node already in the right place" check afterward,
so I hope that moving down this check also helps us reduce thp splitting.

Thanks,
Naoya Horiguchi

>  		/* Use PageReserved to check for zero page */
>  		if (PageReserved(page))
>  			goto put_and_set;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/19] mm, thp: drop FOLL_SPLIT
  2014-11-25  3:01   ` Naoya Horiguchi
@ 2014-11-25 14:04     ` Kirill A. Shutemov
  0 siblings, 0 replies; 41+ messages in thread
From: Kirill A. Shutemov @ 2014-11-25 14:04 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Dave Hansen,
	Hugh Dickins, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Steve Capper, Aneesh Kumar K.V,
	Johannes Weiner, Michal Hocko, linux-kernel, linux-mm

On Tue, Nov 25, 2014 at 03:01:16AM +0000, Naoya Horiguchi wrote:
> On Wed, Nov 05, 2014 at 04:49:36PM +0200, Kirill A. Shutemov wrote:
> > FOLL_SPLIT is used only in two places: migration and s390.
> > 
> > Let's replace it with explicit split and remove FOLL_SPLIT.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> ...
> > @@ -1246,6 +1246,11 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
> >  		if (!page)
> >  			goto set_status;
> >  
> > +		if (PageTransHuge(page) && split_huge_page(page)) {
> > +			err = -EBUSY;
> > +			goto set_status;
> > +		}
> > +
> 
> This check makes split_huge_page() be called for hugetlb pages, which
> triggers BUG_ON. So could you do this after if (PageHuge) block below?
> And I think that we have "Node already in the right place" check afterward,
> so I hope that moving down this check also helps us reduce thp splitting.

It makes sense. Thanks for report.

Other problem here is that we need to goto put_and_set, not set_status :-/

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2014-11-25 14:04 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 01/19] mm, thp: drop FOLL_SPLIT Kirill A. Shutemov
2014-11-25  3:01   ` Naoya Horiguchi
2014-11-25 14:04     ` Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 02/19] thp: cluster split_huge_page* code together Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 03/19] mm: change PageAnon() to work on tail pages Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 04/19] mm: avoid PG_locked " Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 05/19] rmap: add argument to charge compound page Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 06/19] mm: store mapcount for compound page separate Kirill A. Shutemov
2014-11-18  8:43   ` Naoya Horiguchi
2014-11-18  9:58     ` Kirill A. Shutemov
2014-11-18 23:41       ` Naoya Horiguchi
2014-11-19  0:54         ` Kirill A. Shutemov
2014-11-21  6:41       ` Aneesh Kumar K.V
2014-11-21 11:47         ` Kirill A. Shutemov
2014-11-19 10:51   ` Jerome Marchand
2014-11-19 13:00     ` Kirill A. Shutemov
2014-11-19 13:15       ` Jerome Marchand
2014-11-20 20:06       ` Christoph Lameter
2014-11-21 12:01         ` Kirill A. Shutemov
2014-11-21  6:12   ` Aneesh Kumar K.V
2014-11-21 12:02     ` Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 07/19] mm, thp: adjust conditions when we can reuse the page on WP fault Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 08/19] mm: prepare migration code for new THP refcounting Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 09/19] thp: rename split_huge_page_pmd() to split_huge_pmd() Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 10/19] thp: PMD splitting without splitting compound page Kirill A. Shutemov
2014-11-19  6:57   ` Naoya Horiguchi
2014-11-19 13:02     ` Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 11/19] mm, vmstats: new THP splitting event Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 12/19] thp: implement new split_huge_page() Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 13/19] mm, thp: remove infrastructure for handling splitting PMDs Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 14/19] x86, thp: remove " Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 15/19] futex, thp: remove special case for THP in get_futex_key Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 16/19] thp: update documentation Kirill A. Shutemov
2014-11-19  8:07   ` Naoya Horiguchi
2014-11-19 13:11     ` Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 17/19] mlock, thp: HACK: split all pages in VM_LOCKED vma Kirill A. Shutemov
2014-11-19  9:02   ` Naoya Horiguchi
2014-11-19 13:08     ` Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 18/19] tho, mm: use migration entries to freeze page counts on split Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 19/19] mm, thp: remove compound_lock Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).