linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHv6 00/36] THP refcounting redesign
@ 2015-06-03 17:05 Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 01/36] mm, proc: adjust PSS calculation Kirill A. Shutemov
                   ` (36 more replies)
  0 siblings, 37 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Hello everybody,

Here's new revision of refcounting patchset. Please review and consider
applying.

The goal of patchset is to make refcounting on THP pages cheaper with
simpler semantics and allow the same THP compound page to be mapped with
PMD and PTEs. This is required to get reasonable THP-pagecache
implementation.

With the new refcounting design it's much easier to protect against
split_huge_page(): simple reference on a page will make you the deal.
It makes gup_fast() implementation simpler and doesn't require
special-case in futex code to handle tail THP pages.

It should improve THP utilization over the system since splitting THP in
one process doesn't necessary lead to splitting the page in all other
processes have the page mapped.

The patchset drastically lower complexity of get_page()/put_page()
codepaths. I encourage people look on this code before-and-after to
justify time budget on reviewing this patchset.

= Changelog =

v6:
  - rebase to since-4.0;
  - optimize mapcount handling: significantely reduce overhead for most
    common cases.
  - split pages on migrate_pages();
  - remove infrastructure for handling splitting PMDs on all architectures;
  - fix page_mapcount() for hugetlb pages;

v5:
  - Tested-by: Sasha Levin!™
  - re-split patchset in hope to improve readability;
  - rebased on top of page flags and ->mapping sanitizing patchset;
  - uncharge compound_mapcount rather than mapcount for hugetlb pages
    during removing from rmap;
  - differentiate page_mapped() from page_mapcount() for compound pages;
  - rework deferred_split_huge_page() to use shrinker interface;
  - fix race in page_remove_rmap();
  - get rid of __get_page_tail();
  - few random bug fixes;
v4:
  - fix sizes reported in smaps;
  - defines instead of enum for RMAP_{EXCLUSIVE,COMPOUND};
  - skip THP pages on munlock_vma_pages_range(): they are never mlocked;
  - properly handle huge zero page on FOLL_SPLIT;
  - fix lock_page() slow path on tail pages;
  - account page_get_anon_vma() fail to THP_SPLIT_PAGE_FAILED;
  - fix split_huge_page() on huge page with unmapped head page;
  - fix transfering 'write' and 'young' from pmd to ptes on split_huge_pmd;
  - call page_remove_rmap() in unfreeze_page under ptl.

= Design overview =

The main reason why we can't map THP with 4k is how refcounting on THP
designed. It built around two requirements:

  - split of huge page should never fail;
  - we can't change interface of get_user_page();

To be able to split huge page at any point we have to track which tail
page was pinned. It leads to tricky and expensive get_page() on tail pages
and also occupy tail_page->_mapcount.

Most split_huge_page*() users want PMD to be split into table of PTEs and
don't care whether compound page is going to be split or not.

The plan is:

 - allow split_huge_page() to fail if the page is pinned. It's trivial to
   split non-pinned page and it doesn't require tail page refcounting, so
   tail_page->_mapcount is free to be reused.

 - introduce new routine -- split_huge_pmd() -- to split PMD into table of
   PTEs. It splits only one PMD, not touching other PMDs the page is
   mapped with or underlying compound page. Unlike new split_huge_page(),
   split_huge_pmd() never fails.

Fortunately, we have only few places where split_huge_page() is needed:
swap out, memory failure, migration, KSM. And all of them can handle
split_huge_page() fail.

In new scheme we use page->_mapcount is used to account how many time
the page is mapped with PTEs. We have separate compound_mapcount() to
count mappings with PMD. page_mapcount() returns sum of PTE and PMD
mappings of the page.

Introducing split_huge_pmd() effectively allows THP to be mapped with 4k.
It may be a surprise to some code to see a PTE which points to tail page
or VMA start/end in the middle of compound page.

munmap() part of THP will split PMD, but doesn't split the huge page. In
order to take memory consumption under control we put partially unmapped
huge page on list. The pages will be split by shrinker if memory pressure
comes. This way we also avoid unnecessary split_huge_page() on exit(2) if
a THP belong to more than one VMA.

= Refcounts and transparent huge pages =

  - get_page() and put_page() work *only* on head page's ->_count.
    We don't touch tail pages at all for these oprations. We stopped
    touching ->_mapcount in tail pages to store it's pins.

  - ->_count in tail pages is always zero: get_page_unless_zero() never
    succeed on tail pages. Nothing changed in this respect.

  - map/unmap of the pages with PTE entry increment/decrement ->_mapcount
    on relevent sub-page of the compound page.

  - map/unmap of the whole compound page accounted in compound_mapcount
    (stored in first tail page).

  - PageDoubleMap() indicates that ->_mapcount in all subpages is offset
    up by one. This additional reference is required to get race-free
    detection of unmap of subpages when we have them mapped with both PMDs
    and PTEs.

    This is optimization required to lower overhead of per-subpage
    mapcount tracking. The alternative is alter ->_mapcount in all
    subpages on each map/unmap of the whole compound page.

    We set PG_double_map when a PMD of the page got split for the first
    time, but still have PMD mapping. The addtional references go away
    with last compound_mapcount.

= Benchmarks =

Kernel build benchmark:

			baseline		v6

Amean    user-2       447.76 (  0.00%)      451.24 ( -0.78%)
Amean    user-4       314.94 (  0.00%)      310.10 (  1.54%)
Amean    user-8       388.91 (  0.00%)      388.95 ( -0.01%)
Amean    user-16      518.68 (  0.00%)      518.60 (  0.02%)
Amean    user-24      533.58 (  0.00%)      535.35 ( -0.33%)
Amean    syst-2        77.52 (  0.00%)       70.16 (  9.49%)
Amean    syst-4        51.21 (  0.00%)       44.55 ( 13.00%)
Amean    syst-8        42.12 (  0.00%)       42.59 ( -1.12%)
Amean    syst-16       50.29 (  0.00%)       50.14 (  0.30%)
Amean    syst-24       49.36 (  0.00%)       48.54 (  1.65%)
Amean    elsp-2       242.64 (  0.00%)      244.46 ( -0.75%)
Amean    elsp-4        93.78 (  0.00%)       92.89 (  0.95%)
Amean    elsp-8        61.51 (  0.00%)       61.92 ( -0.66%)
Amean    elsp-16       53.95 (  0.00%)       53.80 (  0.29%)
Amean    elsp-24       52.75 (  0.00%)       53.14 ( -0.74%)
Stddev   user-2        15.49 (  0.00%)       13.75 ( 11.24%)
Stddev   user-4         7.85 (  0.00%)        4.42 ( 43.68%)
Stddev   user-8         1.29 (  0.00%)        2.77 (-114.28%)
Stddev   user-16        2.56 (  0.00%)        1.54 ( 39.89%)
Stddev   user-24        1.75 (  0.00%)        1.06 ( 39.75%)
Stddev   syst-2         3.02 (  0.00%)        2.00 ( 33.86%)
Stddev   syst-4         1.23 (  0.00%)        0.91 ( 26.65%)
Stddev   syst-8         0.41 (  0.00%)        0.30 ( 28.32%)
Stddev   syst-16        0.51 (  0.00%)        0.71 (-38.07%)
Stddev   syst-24        0.92 (  0.00%)        0.70 ( 23.86%)
Stddev   elsp-2         8.70 (  0.00%)        7.99 (  8.13%)
Stddev   elsp-4         1.74 (  0.00%)        0.59 ( 66.08%)
Stddev   elsp-8         0.40 (  0.00%)        0.30 ( 25.11%)
Stddev   elsp-16        0.45 (  0.00%)        0.37 ( 17.73%)
Stddev   elsp-24        0.57 (  0.00%)        0.38 ( 33.64%)

Changes are mostly non-significant. The only noticble part is reduction of
system time for -j2 and -j4: 9.49% and 13.00%.

specjvm
                                    base                    v6
Ops compiler                   569.52 (  0.00%)       618.74 (  8.64%)
Ops compress                   456.32 (  0.00%)       469.20 (  2.82%)
Ops crypto                     424.34 (  0.00%)       413.64 ( -2.52%)
Ops derby                      535.15 (  0.00%)       536.96 (  0.34%)
Ops mpegaudio                  291.03 (  0.00%)       286.35 ( -1.61%)
Ops scimark.large               75.91 (  0.00%)        77.21 (  1.71%)
Ops scimark.small              529.19 (  0.00%)       527.07 ( -0.40%)
Ops serial                     316.13 (  0.00%)       316.40 (  0.09%)
Ops sunflow                    154.90 (  0.00%)       154.85 ( -0.03%)
Ops xml                        612.94 (  0.00%)       575.20 ( -6.16%)
Ops compiler.compiler          770.11 (  0.00%)       878.22 ( 14.04%)
Ops compiler.sunflow           421.17 (  0.00%)       435.92 (  3.50%)
Ops compress                   456.32 (  0.00%)       469.20 (  2.82%)
Ops crypto.aes                 153.17 (  0.00%)       151.30 ( -1.22%)
Ops crypto.rsa                 607.43 (  0.00%)       564.65 ( -7.04%)
Ops crypto.signverify          821.23 (  0.00%)       828.40 (  0.87%)
Ops derby                      535.15 (  0.00%)       536.96 (  0.34%)
Ops mpegaudio                  291.03 (  0.00%)       286.35 ( -1.61%)
Ops scimark.fft.large           69.59 (  0.00%)        69.81 (  0.32%)
Ops scimark.lu.large            20.31 (  0.00%)        20.32 (  0.05%)
Ops scimark.sor.large          114.57 (  0.00%)       113.99 ( -0.51%)
Ops scimark.sparse.large        55.56 (  0.00%)        61.71 ( 11.07%)
Ops scimark.monte_carlo        280.13 (  0.00%)       275.09 ( -1.80%)
Ops scimark.fft.small          815.19 (  0.00%)       819.55 (  0.53%)
Ops scimark.lu.small          1072.62 (  0.00%)      1081.47 (  0.83%)
Ops scimark.sor.small          674.47 (  0.00%)       674.24 ( -0.03%)
Ops scimark.sparse.small       251.20 (  0.00%)       247.43 ( -1.50%)
Ops serial                     316.13 (  0.00%)       316.40 (  0.09%)
Ops sunflow                    154.90 (  0.00%)       154.85 ( -0.03%)
Ops xml.transform              538.16 (  0.00%)       519.64 ( -3.44%)
Ops xml.validation             698.10 (  0.00%)       636.70 ( -8.80%)

Results are mixed.

= Patches overview =

Patch 1:
        We need to look on all subpages of compound page to calculate
        correct PSS, because they can have different mapcount.

Patch 2:
        With PTE-mapeed THP, rmap cannot rely on PageTransHuge() check to
        decide if map small page or THP. We need to get the info from
        caller.

Patch 3:
        Make memcg aware about new refcounting. Validation needed.

Patch 4:
        Adjust conditions when we can re-use the page on write-protection
        fault.

Patch 5:
        FOLL_SPLIT should be handled on PTE level too.

Patch 6:
	Make generic fast GUP implementation aware about PTE-mapped huge
	pages.

Patch 7:
        Split all pages in mlocked VMA. That should be good enough for
	now.

Patch 8:
        Make khugepaged aware about PTE-mapped huge pages.

Patch 9:
	Rename split_huge_page_pmd() to split_huge_pmd() to reflect that
	page is not going to be split, only PMD.

Patch 10:
        New THP_SPLIT_* vmstats.

Patch 11:
	Up to this point we tried to keep patchset bisectable, but next
	patches are going to change how core of THP refcounting work.
	That's easier to review change if we would disable THP temporally
	and bring it back once everything is ready.

Patch 12:
	Remove all split_huge_page()-related code. It also remove need in
	tail page refcounting.

Patch 13:
	Drop tail page refcounting. Diffstat is nice! :)

Patch 14:
        Remove ugly special case if futex happened to be in tail THP page.
        With new refcounting it much easier to protect against split.

Patch 15:
        Simplify KSM code which handle THP.

Patch 16:
	No need in compound_lock anymore.

Patches 17-25:
        Drop infrastructure for handling PMD splitting. We don't use it
        anymore in split_huge_page().

Patch 26:
        Store mapcount for compound pages separately: in the first tail
        page ->mapping.

Patch 27:
	Let's define page_mapped() to be true for compound pages if any
	sub-pages of the compound page is mapped (with PMD or PTE).

Patch 28:
	Make numabalancing aware about PTE-mapped THP.

Patch 29:
	Implement new split_huge_pmd().

Patch 30-32:
	Implement new split_huge_page().

Patch 33:
	Split pages instaed of PMDs on migrate_pages.

Patch 34:
        Handle partial unmap of THP. We put partially unmapped huge page
        list. Pages from list will split via shrinker if memory pressure
	comes. This way we also avoid unnecessary split_huge_page() on
        exit(2) if a THP belong to more than one VMA.

Patch 35:
	Everything is in place. Re-enable THP.

Patch 36:
        Documentation update.

The patchset also available on git:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v5

Please review.

Kirill A. Shutemov (36):
  mm, proc: adjust PSS calculation
  rmap: add argument to charge compound page
  memcg: adjust to support new THP refcounting
  mm, thp: adjust conditions when we can reuse the page on WP fault
  mm: adjust FOLL_SPLIT for new refcounting
  mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton
  thp, mlock: do not allow huge pages in mlocked area
  khugepaged: ignore pmd tables with THP mapped with ptes
  thp: rename split_huge_page_pmd() to split_huge_pmd()
  mm, vmstats: new THP splitting event
  mm: temporally mark THP broken
  thp: drop all split_huge_page()-related code
  mm: drop tail page refcounting
  futex, thp: remove special case for THP in get_futex_key
  ksm: prepare to new THP semantics
  mm, thp: remove compound_lock
  arm64, thp: remove infrastructure for handling splitting PMDs
  arm, thp: remove infrastructure for handling splitting PMDs
  mips, thp: remove infrastructure for handling splitting PMDs
  powerpc, thp: remove infrastructure for handling splitting PMDs
  s390, thp: remove infrastructure for handling splitting PMDs
  sparc, thp: remove infrastructure for handling splitting PMDs
  tile, thp: remove infrastructure for handling splitting PMDs
  x86, thp: remove infrastructure for handling splitting PMDs
  mm, thp: remove infrastructure for handling splitting PMDs
  mm: rework mapcount accounting to enable 4k mapping of THPs
  mm: differentiate page_mapped() from page_mapcount() for compound
    pages
  mm, numa: skip PTE-mapped THP on numa fault
  thp: implement split_huge_pmd()
  thp: add option to setup migration entiries during PMD split
  thp, mm: split_huge_page(): caller need to lock page
  thp: reintroduce split_huge_page()
  migrate_pages: try to split pages on qeueuing
  thp: introduce deferred_split_huge_page()
  mm: re-enable THP
  thp: update documentation

 Documentation/vm/transhuge.txt           |  124 ++--
 arch/arc/mm/cache_arc700.c               |    4 +-
 arch/arm/include/asm/pgtable-3level.h    |   10 -
 arch/arm/lib/uaccess_with_memcpy.c       |    5 +-
 arch/arm/mm/flush.c                      |   17 +-
 arch/arm64/include/asm/pgtable.h         |    9 -
 arch/arm64/mm/flush.c                    |   16 -
 arch/mips/include/asm/pgtable-bits.h     |    6 +-
 arch/mips/include/asm/pgtable.h          |   18 -
 arch/mips/mm/c-r4k.c                     |    3 +-
 arch/mips/mm/cache.c                     |    2 +-
 arch/mips/mm/gup.c                       |   17 +-
 arch/mips/mm/init.c                      |    6 +-
 arch/mips/mm/pgtable-64.c                |   14 -
 arch/mips/mm/tlbex.c                     |    1 -
 arch/powerpc/include/asm/kvm_book3s_64.h |    6 -
 arch/powerpc/include/asm/pgtable-ppc64.h |   25 +-
 arch/powerpc/mm/hugepage-hash64.c        |    3 -
 arch/powerpc/mm/hugetlbpage.c            |   20 +-
 arch/powerpc/mm/pgtable_64.c             |   49 --
 arch/powerpc/mm/subpage-prot.c           |    2 +-
 arch/s390/include/asm/pgtable.h          |   15 +-
 arch/s390/mm/gup.c                       |   24 +-
 arch/s390/mm/pgtable.c                   |   16 -
 arch/sh/mm/cache-sh4.c                   |    2 +-
 arch/sh/mm/cache.c                       |    8 +-
 arch/sparc/include/asm/pgtable_64.h      |   16 -
 arch/sparc/mm/fault_64.c                 |    3 -
 arch/sparc/mm/gup.c                      |   16 +-
 arch/tile/include/asm/pgtable.h          |   10 -
 arch/x86/include/asm/pgtable.h           |    9 -
 arch/x86/include/asm/pgtable_types.h     |    2 -
 arch/x86/kernel/vm86_32.c                |    6 +-
 arch/x86/mm/gup.c                        |   17 +-
 arch/x86/mm/pgtable.c                    |   14 -
 arch/xtensa/mm/tlb.c                     |    2 +-
 fs/proc/page.c                           |    4 +-
 fs/proc/task_mmu.c                       |   56 +-
 include/asm-generic/pgtable.h            |    9 -
 include/linux/huge_mm.h                  |   55 +-
 include/linux/memcontrol.h               |   16 +-
 include/linux/mm.h                       |  113 ++--
 include/linux/mm_types.h                 |   18 +-
 include/linux/page-flags.h               |   49 +-
 include/linux/pagemap.h                  |   13 +-
 include/linux/rmap.h                     |   16 +-
 include/linux/swap.h                     |    3 +-
 include/linux/vm_event_item.h            |    4 +-
 kernel/events/uprobes.c                  |   11 +-
 kernel/futex.c                           |   61 +-
 mm/debug.c                               |    8 +-
 mm/filemap.c                             |   10 +-
 mm/gup.c                                 |  111 ++-
 mm/huge_memory.c                         | 1092 ++++++++++++++++--------------
 mm/hugetlb.c                             |   10 +-
 mm/internal.h                            |   70 +-
 mm/ksm.c                                 |   61 +-
 mm/madvise.c                             |    2 +-
 mm/memcontrol.c                          |   84 +--
 mm/memory-failure.c                      |   10 +-
 mm/memory.c                              |   73 +-
 mm/mempolicy.c                           |   37 +-
 mm/migrate.c                             |   19 +-
 mm/mincore.c                             |    2 +-
 mm/mlock.c                               |   51 +-
 mm/mprotect.c                            |    2 +-
 mm/mremap.c                              |   15 +-
 mm/page_alloc.c                          |   16 +-
 mm/pagewalk.c                            |    2 +-
 mm/pgtable-generic.c                     |   14 -
 mm/rmap.c                                |  162 +++--
 mm/shmem.c                               |   21 +-
 mm/swap.c                                |  273 +-------
 mm/swapfile.c                            |   16 +-
 mm/userfaultfd.c                         |    8 +-
 mm/vmstat.c                              |    4 +-
 76 files changed, 1310 insertions(+), 1808 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCHv6 01/36] mm, proc: adjust PSS calculation
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-09 12:29   ` Vlastimil Babka
  2015-06-03 17:05 ` [PATCHv6 02/36] rmap: add argument to charge compound page Kirill A. Shutemov
                   ` (35 subsequent siblings)
  36 siblings, 1 reply; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting all subpages of the compound page are not nessessary
have the same mapcount. We need to take into account mapcount of every
sub-page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/proc/task_mmu.c | 48 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 31 insertions(+), 17 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 58be92e11939..f9b285761bc0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -449,9 +449,10 @@ struct mem_size_stats {
 };
 
 static void smaps_account(struct mem_size_stats *mss, struct page *page,
-		unsigned long size, bool young, bool dirty)
+		bool compound, bool young, bool dirty)
 {
-	int mapcount;
+	int i, nr = compound ? HPAGE_PMD_NR : 1;
+	unsigned long size = nr * PAGE_SIZE;
 
 	if (PageAnon(page))
 		mss->anonymous += size;
@@ -460,23 +461,36 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 	/* Accumulate the size in pages that have been accessed. */
 	if (young || PageReferenced(page))
 		mss->referenced += size;
-	mapcount = page_mapcount(page);
-	if (mapcount >= 2) {
-		u64 pss_delta;
 
-		if (dirty || PageDirty(page))
-			mss->shared_dirty += size;
-		else
-			mss->shared_clean += size;
-		pss_delta = (u64)size << PSS_SHIFT;
-		do_div(pss_delta, mapcount);
-		mss->pss += pss_delta;
-	} else {
+	/*
+	 * page_count(page) == 1 guarantees the page is mapped exactly once.
+	 * If any subpage of the compound page mapped with PTE it would elevate
+	 * page_count().
+	 */
+	if (page_count(page) == 1) {
 		if (dirty || PageDirty(page))
 			mss->private_dirty += size;
 		else
 			mss->private_clean += size;
-		mss->pss += (u64)size << PSS_SHIFT;
+		return;
+	}
+
+	for (i = 0; i < nr; i++, page++) {
+		int mapcount = page_mapcount(page);
+
+		if (mapcount >= 2) {
+			if (dirty || PageDirty(page))
+				mss->shared_dirty += PAGE_SIZE;
+			else
+				mss->shared_clean += PAGE_SIZE;
+			mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
+		} else {
+			if (dirty || PageDirty(page))
+				mss->private_dirty += PAGE_SIZE;
+			else
+				mss->private_clean += PAGE_SIZE;
+			mss->pss += PAGE_SIZE << PSS_SHIFT;
+		}
 	}
 }
 
@@ -500,7 +514,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
 	if (!page)
 		return;
-	smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte));
+
+	smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte));
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -516,8 +531,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 	if (IS_ERR_OR_NULL(page))
 		return;
 	mss->anonymous_thp += HPAGE_PMD_SIZE;
-	smaps_account(mss, page, HPAGE_PMD_SIZE,
-			pmd_young(*pmd), pmd_dirty(*pmd));
+	smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
 }
 #else
 static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 02/36] rmap: add argument to charge compound page
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 01/36] mm, proc: adjust PSS calculation Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 03/36] memcg: adjust to support new THP refcounting Kirill A. Shutemov
                   ` (34 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to allow mapping of individual 4k pages of THP compound
page. It means we cannot rely on PageTransHuge() check to decide if
map/unmap small page or THP.

The patch adds new argument to rmap functions to indicate whether we want
to operate on whole compound page or only the small page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/rmap.h    | 12 +++++++++---
 kernel/events/uprobes.c |  4 ++--
 mm/huge_memory.c        | 16 ++++++++--------
 mm/hugetlb.c            |  4 ++--
 mm/ksm.c                |  4 ++--
 mm/memory.c             | 14 +++++++-------
 mm/migrate.c            |  8 ++++----
 mm/rmap.c               | 48 +++++++++++++++++++++++++++++++-----------------
 mm/swapfile.c           |  4 ++--
 mm/userfaultfd.c        |  2 +-
 10 files changed, 68 insertions(+), 48 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bf36b6e644c4..2bc86dc14305 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -159,16 +159,22 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
 
 struct anon_vma *page_get_anon_vma(struct page *page);
 
+/* bitflags for do_page_add_anon_rmap() */
+#define RMAP_EXCLUSIVE 0x01
+#define RMAP_COMPOUND 0x02
+
 /*
  * rmap interfaces called when adding or removing pte of page
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
-void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
+void page_add_anon_rmap(struct page *, struct vm_area_struct *,
+		unsigned long, bool);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
 			   unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
+void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+		unsigned long, bool);
 void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *);
+void page_remove_rmap(struct page *, bool);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 			    unsigned long);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cb346f26a22d..5523daf59953 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -183,7 +183,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		goto unlock;
 
 	get_page(kpage);
-	page_add_new_anon_rmap(kpage, vma, addr);
+	page_add_new_anon_rmap(kpage, vma, addr, false);
 	mem_cgroup_commit_charge(kpage, memcg, false);
 	lru_cache_add_active_or_unevictable(kpage, vma);
 
@@ -196,7 +196,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	ptep_clear_flush_notify(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	pte_unmap_unlock(ptep, ptl);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9671f51e954d..310b0650abe0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -773,7 +773,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		page_add_new_anon_rmap(page, vma, haddr);
+		page_add_new_anon_rmap(page, vma, haddr, true);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
@@ -1068,7 +1068,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
-		page_add_new_anon_rmap(pages[i], vma, haddr);
+		page_add_new_anon_rmap(pages[i], vma, haddr, false);
 		mem_cgroup_commit_charge(pages[i], memcg, false);
 		lru_cache_add_active_or_unevictable(pages[i], vma);
 		pte = pte_offset_map(&_pmd, haddr);
@@ -1080,7 +1080,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
-	page_remove_rmap(page);
+	page_remove_rmap(page, true);
 	spin_unlock(ptl);
 
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
@@ -1200,7 +1200,7 @@ alloc:
 		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_huge_clear_flush_notify(vma, haddr, pmd);
-		page_add_new_anon_rmap(new_page, vma, haddr);
+		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -1210,7 +1210,7 @@ alloc:
 			put_huge_zero_page();
 		} else {
 			VM_BUG_ON_PAGE(!PageHead(page), page);
-			page_remove_rmap(page);
+			page_remove_rmap(page, true);
 			put_page(page);
 		}
 		ret |= VM_FAULT_WRITE;
@@ -1465,7 +1465,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			put_huge_zero_page();
 		} else {
 			page = pmd_page(orig_pmd);
-			page_remove_rmap(page);
+			page_remove_rmap(page, true);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -2311,7 +2311,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			 * superfluous.
 			 */
 			pte_clear(vma->vm_mm, address, _pte);
-			page_remove_rmap(src_page);
+			page_remove_rmap(src_page, false);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
 		}
@@ -2606,7 +2606,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	page_add_new_anon_rmap(new_page, vma, address);
+	page_add_new_anon_rmap(new_page, vma, address, true);
 	mem_cgroup_commit_charge(new_page, memcg, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 290984b6a702..ddbcecab781f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2824,7 +2824,7 @@ again:
 		if (huge_pte_dirty(pte))
 			set_page_dirty(page);
 
-		page_remove_rmap(page);
+		page_remove_rmap(page, true);
 		force_flush = !__tlb_remove_page(tlb, page);
 		if (force_flush) {
 			address += sz;
@@ -3045,7 +3045,7 @@ retry_avoidcopy:
 		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
-		page_remove_rmap(old_page);
+		page_remove_rmap(old_page, true);
 		hugepage_add_new_anon_rmap(new_page, vma, address);
 		/* Make the old page be freed below */
 		new_page = old_page;
diff --git a/mm/ksm.c b/mm/ksm.c
index bc7be0ee2080..fe09f3ddc912 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -957,13 +957,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	}
 
 	get_page(kpage);
-	page_add_anon_rmap(kpage, vma, addr);
+	page_add_anon_rmap(kpage, vma, addr, false);
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 8a2fc9945b46..4cd6d9392004 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1125,7 +1125,7 @@ again:
 					mark_page_accessed(page);
 				rss[MM_FILEPAGES]--;
 			}
-			page_remove_rmap(page);
+			page_remove_rmap(page, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(!__tlb_remove_page(tlb, page))) {
@@ -2113,7 +2113,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * thread doing COW.
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
-		page_add_new_anon_rmap(new_page, vma, address);
+		page_add_new_anon_rmap(new_page, vma, address, false);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
@@ -2146,7 +2146,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			page_remove_rmap(old_page);
+			page_remove_rmap(old_page, false);
 		}
 
 		/* Free the old page.. */
@@ -2562,7 +2562,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
-		exclusive = 1;
+		exclusive = RMAP_EXCLUSIVE;
 	}
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(orig_pte))
@@ -2572,7 +2572,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		do_page_add_anon_rmap(page, vma, address, exclusive);
 		mem_cgroup_commit_charge(page, memcg, true);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, address);
+		page_add_new_anon_rmap(page, vma, address, false);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
@@ -2726,7 +2726,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, address);
+	page_add_new_anon_rmap(page, vma, address, false);
 	mem_cgroup_commit_charge(page, memcg, false);
 	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
@@ -2814,7 +2814,7 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 	if (anon) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, address);
+		page_add_new_anon_rmap(page, vma, address, false);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, MM_FILEPAGES);
 		page_add_file_rmap(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 236ee25e79d9..9cabb25fb16e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -166,7 +166,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		else
 			page_dup_rmap(new);
 	} else if (PageAnon(new))
-		page_add_anon_rmap(new, vma, addr);
+		page_add_anon_rmap(new, vma, addr, false);
 	else
 		page_add_file_rmap(new);
 
@@ -1798,7 +1798,7 @@ fail_putback:
 	 * guarantee the copy is visible before the pagetable update.
 	 */
 	flush_cache_range(vma, mmun_start, mmun_end);
-	page_add_anon_rmap(new_page, vma, mmun_start);
+	page_add_anon_rmap(new_page, vma, mmun_start, true);
 	pmdp_huge_clear_flush_notify(vma, mmun_start, pmd);
 	set_pmd_at(mm, mmun_start, pmd, entry);
 	flush_tlb_range(vma, mmun_start, mmun_end);
@@ -1809,13 +1809,13 @@ fail_putback:
 		flush_tlb_range(vma, mmun_start, mmun_end);
 		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
 		update_mmu_cache_pmd(vma, address, &entry);
-		page_remove_rmap(new_page);
+		page_remove_rmap(new_page, true);
 		goto fail_putback;
 	}
 
 	mem_cgroup_migrate(page, new_page, false);
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, true);
 
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
diff --git a/mm/rmap.c b/mm/rmap.c
index 9c045940ed10..51807d605ebf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1046,6 +1046,7 @@ static void __page_check_anon_rmap(struct page *page,
  * @page:	the page to add the mapping to
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
+ * @compound:	charge the page as compound or small page
  *
  * The caller needs to hold the pte lock, and the page must be locked in
  * the anon_vma case: to serialize mapping,index checking after setting,
@@ -1053,9 +1054,9 @@ static void __page_check_anon_rmap(struct page *page,
  * (but PageKsm is never downgraded to PageAnon).
  */
 void page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address, bool compound)
 {
-	do_page_add_anon_rmap(page, vma, address, 0);
+	do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
 }
 
 /*
@@ -1064,21 +1065,24 @@ void page_add_anon_rmap(struct page *page,
  * Everybody else should continue to use page_add_anon_rmap above.
  */
 void do_page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, int exclusive)
+	struct vm_area_struct *vma, unsigned long address, int flags)
 {
 	int first = atomic_inc_and_test(&page->_mapcount);
 	if (first) {
+		bool compound = flags & RMAP_COMPOUND;
+		int nr = compound ? hpage_nr_pages(page) : 1;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 		 * these counters are not modified in interrupt context, and
 		 * pte lock(a spinlock) is held, which implies preemption
 		 * disabled.
 		 */
-		if (PageTransHuge(page))
+		if (compound) {
+			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 			__inc_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
-		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-				hpage_nr_pages(page));
+		}
+		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	}
 	if (unlikely(PageKsm(page)))
 		return;
@@ -1086,7 +1090,8 @@ void do_page_add_anon_rmap(struct page *page,
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	/* address might be in next vma when migration races vma_adjust */
 	if (first)
-		__page_set_anon_rmap(page, vma, address, exclusive);
+		__page_set_anon_rmap(page, vma, address,
+				flags & RMAP_EXCLUSIVE);
 	else
 		__page_check_anon_rmap(page, vma, address);
 }
@@ -1096,21 +1101,25 @@ void do_page_add_anon_rmap(struct page *page,
  * @page:	the page to add the mapping to
  * @vma:	the vm area in which the mapping is added
  * @address:	the user virtual address mapped
+ * @compound:	charge the page as compound or small page
  *
  * Same as page_add_anon_rmap but must only be called on *new* pages.
  * This means the inc-and-test can be bypassed.
  * Page does not have to be locked.
  */
 void page_add_new_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address, bool compound)
 {
+	int nr = compound ? hpage_nr_pages(page) : 1;
+
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
-	if (PageTransHuge(page))
+	if (compound) {
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-			hpage_nr_pages(page));
+	}
+	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
 
@@ -1162,13 +1171,17 @@ out:
 
 /**
  * page_remove_rmap - take down pte mapping from a page
- * @page: page to remove mapping from
+ * @page:	page to remove mapping from
+ * @compound:	uncharge the page as compound or small page
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page)
+void page_remove_rmap(struct page *page, bool compound)
 {
+	int nr = compound ? hpage_nr_pages(page) : 1;
+
 	if (!PageAnon(page)) {
+		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
 		page_remove_file_rmap(page);
 		return;
 	}
@@ -1186,11 +1199,12 @@ void page_remove_rmap(struct page *page)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	if (PageTransHuge(page))
+	if (compound) {
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	}
 
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-			      -hpage_nr_pages(page));
+	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
@@ -1332,7 +1346,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		dec_mm_counter(mm, MM_FILEPAGES);
 
 discard:
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	page_cache_release(page);
 
 out_unmap:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a7e72103f23b..65825c2687f5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1121,10 +1121,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
-		page_add_anon_rmap(page, vma, addr);
+		page_add_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, true);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, addr);
+		page_add_new_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 77fee9325a57..ae21a1f309c2 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -76,7 +76,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 		goto out_release_uncharge_unlock;
 
 	inc_mm_counter(dst_mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, dst_vma, dst_addr);
+	page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
 	mem_cgroup_commit_charge(page, memcg, false);
 	lru_cache_add_active_or_unevictable(page, dst_vma);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 03/36] memcg: adjust to support new THP refcounting
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 01/36] mm, proc: adjust PSS calculation Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 02/36] rmap: add argument to charge compound page Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 04/36] mm, thp: adjust conditions when we can reuse the page on WP fault Kirill A. Shutemov
                   ` (33 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

As with rmap, with new refcounting we cannot rely on PageTransHuge() to
check if we need to charge size of huge page form the cgroup. We need to
get information from caller to know whether it was mapped with PMD or
PTE.

We do uncharge when last reference on the page gone. At that point if we
see PageTransHuge() it means we need to unchange whole huge page.

The tricky part is partial unmap -- when we try to unmap part of huge
page. We don't do a special handing of this situation, meaning we don't
uncharge the part of huge page unless last user is gone or
split_huge_page() is triggered. In case of cgroup memory pressure
happens the partial unmapped page will be split through shrinker. This
should be good enough.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/memcontrol.h | 16 +++++++-----
 kernel/events/uprobes.c    |  7 +++---
 mm/filemap.c               |  8 +++---
 mm/huge_memory.c           | 33 ++++++++++++------------
 mm/memcontrol.c            | 62 +++++++++++++++++-----------------------------
 mm/memory.c                | 28 ++++++++++-----------
 mm/shmem.c                 | 21 +++++++++-------
 mm/swapfile.c              |  9 ++++---
 mm/userfaultfd.c           |  6 ++---
 9 files changed, 92 insertions(+), 98 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6c8918114804..44e62357d876 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -74,10 +74,12 @@ void mem_cgroup_events(struct mem_cgroup *memcg,
 bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
-			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
+			  bool compound);
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
-			      bool lrucare);
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
+			      bool lrucare, bool compound);
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
+		bool compound);
 void mem_cgroup_uncharge(struct page *page);
 void mem_cgroup_uncharge_list(struct list_head *page_list);
 
@@ -209,7 +211,8 @@ static inline bool mem_cgroup_low(struct mem_cgroup *root,
 
 static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask,
-					struct mem_cgroup **memcgp)
+					struct mem_cgroup **memcgp,
+					bool compound)
 {
 	*memcgp = NULL;
 	return 0;
@@ -217,12 +220,13 @@ static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 
 static inline void mem_cgroup_commit_charge(struct page *page,
 					    struct mem_cgroup *memcg,
-					    bool lrucare)
+					    bool lrucare, bool compound)
 {
 }
 
 static inline void mem_cgroup_cancel_charge(struct page *page,
-					    struct mem_cgroup *memcg)
+					    struct mem_cgroup *memcg,
+					    bool compound)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 5523daf59953..04e26bdf0717 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -169,7 +169,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	const unsigned long mmun_end   = addr + PAGE_SIZE;
 	struct mem_cgroup *memcg;
 
-	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
+	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg,
+			false);
 	if (err)
 		return err;
 
@@ -184,7 +185,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	get_page(kpage);
 	page_add_new_anon_rmap(kpage, vma, addr, false);
-	mem_cgroup_commit_charge(kpage, memcg, false);
+	mem_cgroup_commit_charge(kpage, memcg, false, false);
 	lru_cache_add_active_or_unevictable(kpage, vma);
 
 	if (!PageAnon(page)) {
@@ -207,7 +208,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	err = 0;
  unlock:
-	mem_cgroup_cancel_charge(kpage, memcg);
+	mem_cgroup_cancel_charge(kpage, memcg, false);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	unlock_page(page);
 	return err;
diff --git a/mm/filemap.c b/mm/filemap.c
index df533a10e8c3..cb41cf3069d2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -562,7 +562,7 @@ static int __add_to_page_cache_locked(struct page *page,
 
 	if (!huge) {
 		error = mem_cgroup_try_charge(page, current->mm,
-					      gfp_mask, &memcg);
+					      gfp_mask, &memcg, false);
 		if (error)
 			return error;
 	}
@@ -570,7 +570,7 @@ static int __add_to_page_cache_locked(struct page *page,
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
 		if (!huge)
-			mem_cgroup_cancel_charge(page, memcg);
+			mem_cgroup_cancel_charge(page, memcg, false);
 		return error;
 	}
 
@@ -589,7 +589,7 @@ static int __add_to_page_cache_locked(struct page *page,
 		__inc_zone_page_state(page, NR_FILE_PAGES);
 	spin_unlock_irq(&mapping->tree_lock);
 	if (!huge)
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
 err_insert:
@@ -597,7 +597,7 @@ err_insert:
 	/* Leave page->index set: truncation relies upon it */
 	spin_unlock_irq(&mapping->tree_lock);
 	if (!huge)
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, false);
 	page_cache_release(page);
 	return error;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 310b0650abe0..1043b9b0659c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -727,7 +727,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
-	if (mem_cgroup_try_charge(page, mm, gfp, &memcg)) {
+	if (mem_cgroup_try_charge(page, mm, gfp, &memcg, true)) {
 		put_page(page);
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -735,7 +735,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 	pgtable = pte_alloc_one(mm, haddr);
 	if (unlikely(!pgtable)) {
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, true);
 		put_page(page);
 		return VM_FAULT_OOM;
 	}
@@ -751,7 +751,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_none(*pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, true);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -762,7 +762,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 			int ret;
 
 			spin_unlock(ptl);
-			mem_cgroup_cancel_charge(page, memcg);
+			mem_cgroup_cancel_charge(page, memcg, true);
 			put_page(page);
 			pte_free(mm, pgtable);
 			ret = handle_userfault(vma, haddr, flags,
@@ -774,7 +774,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr, true);
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -1024,13 +1024,14 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					       vma, address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
 			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
-						   &memcg))) {
+						   &memcg, false))) {
 			if (pages[i])
 				put_page(pages[i]);
 			while (--i >= 0) {
 				memcg = (void *)page_private(pages[i]);
 				set_page_private(pages[i], 0);
-				mem_cgroup_cancel_charge(pages[i], memcg);
+				mem_cgroup_cancel_charge(pages[i], memcg,
+						false);
 				put_page(pages[i]);
 			}
 			kfree(pages);
@@ -1069,7 +1070,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
 		page_add_new_anon_rmap(pages[i], vma, haddr, false);
-		mem_cgroup_commit_charge(pages[i], memcg, false);
+		mem_cgroup_commit_charge(pages[i], memcg, false, false);
 		lru_cache_add_active_or_unevictable(pages[i], vma);
 		pte = pte_offset_map(&_pmd, haddr);
 		VM_BUG_ON(!pte_none(*pte));
@@ -1097,7 +1098,7 @@ out_free_pages:
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
-		mem_cgroup_cancel_charge(pages[i], memcg);
+		mem_cgroup_cancel_charge(pages[i], memcg, false);
 		put_page(pages[i]);
 	}
 	kfree(pages);
@@ -1163,7 +1164,8 @@ alloc:
 		goto out;
 	}
 
-	if (unlikely(mem_cgroup_try_charge(new_page, mm, huge_gfp, &memcg))) {
+	if (unlikely(mem_cgroup_try_charge(new_page, mm, huge_gfp,
+					&memcg, true))) {
 		put_page(new_page);
 		if (page) {
 			split_huge_page(page);
@@ -1192,7 +1194,7 @@ alloc:
 		put_user_huge_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_cancel_charge(new_page, memcg);
+		mem_cgroup_cancel_charge(new_page, memcg, true);
 		put_page(new_page);
 		goto out_mn;
 	} else {
@@ -1201,7 +1203,7 @@ alloc:
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_huge_clear_flush_notify(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr, true);
-		mem_cgroup_commit_charge(new_page, memcg, false);
+		mem_cgroup_commit_charge(new_page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pmd_at(mm, haddr, pmd, entry);
 		update_mmu_cache_pmd(vma, address, pmd);
@@ -2519,8 +2521,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (!new_page)
 		return;
 
-	if (unlikely(mem_cgroup_try_charge(new_page, mm,
-					   gfp, &memcg)))
+	if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true)))
 		return;
 
 	/*
@@ -2607,7 +2608,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
 	page_add_new_anon_rmap(new_page, vma, address, true);
-	mem_cgroup_commit_charge(new_page, memcg, false);
+	mem_cgroup_commit_charge(new_page, memcg, false, true);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
@@ -2622,7 +2623,7 @@ out_up_write:
 	return;
 
 out:
-	mem_cgroup_cancel_charge(new_page, memcg);
+	mem_cgroup_cancel_charge(new_page, memcg, true);
 	goto out_up_write;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5fd273d22714..1c7c6bb69d3c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -827,7 +827,7 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 struct page *page,
-					 int nr_pages)
+					 bool compound, int nr_pages)
 {
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
@@ -840,9 +840,11 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE],
 				nr_pages);
 
-	if (PageTransHuge(page))
+	if (compound) {
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 				nr_pages);
+	}
 
 	/* pagein of a big page is an event. So, ignore page size */
 	if (nr_pages > 0)
@@ -4744,30 +4746,24 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
  * from old cgroup.
  */
 static int mem_cgroup_move_account(struct page *page,
-				   unsigned int nr_pages,
+				   bool compound,
 				   struct mem_cgroup *from,
 				   struct mem_cgroup *to)
 {
 	unsigned long flags;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 	int ret;
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
-	/*
-	 * The page is isolated from LRU. So, collapse function
-	 * will not handle this page. But page splitting can happen.
-	 * Do this check under compound_page_lock(). The caller should
-	 * hold it.
-	 */
-	ret = -EBUSY;
-	if (nr_pages > 1 && !PageTransHuge(page))
-		goto out;
+	VM_BUG_ON(compound && !PageTransHuge(page));
 
 	/*
 	 * Prevent mem_cgroup_migrate() from looking at page->mem_cgroup
 	 * of its source page while we change it: page migration takes
 	 * both pages off the LRU, but page cache replacement doesn't.
 	 */
+	ret = -EBUSY;
 	if (!trylock_page(page))
 		goto out;
 
@@ -4804,9 +4800,9 @@ static int mem_cgroup_move_account(struct page *page,
 	ret = 0;
 
 	local_irq_disable();
-	mem_cgroup_charge_statistics(to, page, nr_pages);
+	mem_cgroup_charge_statistics(to, page, compound, nr_pages);
 	memcg_check_events(to, page);
-	mem_cgroup_charge_statistics(from, page, -nr_pages);
+	mem_cgroup_charge_statistics(from, page, compound, -nr_pages);
 	memcg_check_events(from, page);
 	local_irq_enable();
 out_unlock:
@@ -5083,7 +5079,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		if (target_type == MC_TARGET_PAGE) {
 			page = target.page;
 			if (!isolate_lru_page(page)) {
-				if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
+				if (!mem_cgroup_move_account(page, true,
 							     mc.from, mc.to)) {
 					mc.precharge -= HPAGE_PMD_NR;
 					mc.moved_charge += HPAGE_PMD_NR;
@@ -5112,7 +5108,8 @@ retry:
 			page = target.page;
 			if (isolate_lru_page(page))
 				goto put;
-			if (!mem_cgroup_move_account(page, 1, mc.from, mc.to)) {
+			if (!mem_cgroup_move_account(page, false,
+						mc.from, mc.to)) {
 				mc.precharge--;
 				/* we uncharge from mc.from later. */
 				mc.moved_charge++;
@@ -5460,10 +5457,11 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
  * with mem_cgroup_cancel_charge() in case page instantiation fails.
  */
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
-			  gfp_t gfp_mask, struct mem_cgroup **memcgp)
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
+			  bool compound)
 {
 	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 	int ret = 0;
 
 	if (mem_cgroup_disabled())
@@ -5481,11 +5479,6 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			goto out;
 	}
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
 	if (do_swap_account && PageSwapCache(page))
 		memcg = try_get_mem_cgroup_from_page(page);
 	if (!memcg)
@@ -5521,9 +5514,9 @@ out:
  * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
  */
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
-			      bool lrucare)
+			      bool lrucare, bool compound)
 {
-	unsigned int nr_pages = 1;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 
 	VM_BUG_ON_PAGE(!page->mapping, page);
 	VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
@@ -5540,13 +5533,8 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 
 	commit_charge(page, memcg, lrucare);
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
 	local_irq_disable();
-	mem_cgroup_charge_statistics(memcg, page, nr_pages);
+	mem_cgroup_charge_statistics(memcg, page, compound, nr_pages);
 	memcg_check_events(memcg, page);
 	local_irq_enable();
 
@@ -5568,9 +5556,10 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
  *
  * Cancel a charge transaction started by mem_cgroup_try_charge().
  */
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
+		bool compound)
 {
-	unsigned int nr_pages = 1;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -5582,11 +5571,6 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
 	if (!memcg)
 		return;
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
 	cancel_charge(memcg, nr_pages);
 }
 
@@ -5840,7 +5824,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	/* XXX: caller holds IRQ-safe mapping->tree_lock */
 	VM_BUG_ON(!irqs_disabled());
 
-	mem_cgroup_charge_statistics(memcg, page, -1);
+	mem_cgroup_charge_statistics(memcg, page, false, -1);
 	memcg_check_events(memcg, page);
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 4cd6d9392004..f48dfa6d8859 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2083,7 +2083,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		cow_user_page(new_page, old_page, address, vma);
 	}
 
-	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false))
 		goto oom_free_new;
 
 	__SetPageUptodate(new_page);
@@ -2114,7 +2114,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address, false);
-		mem_cgroup_commit_charge(new_page, memcg, false);
+		mem_cgroup_commit_charge(new_page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
@@ -2153,7 +2153,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		new_page = old_page;
 		page_copied = 1;
 	} else {
-		mem_cgroup_cancel_charge(new_page, memcg);
+		mem_cgroup_cancel_charge(new_page, memcg, false);
 	}
 
 	if (new_page)
@@ -2528,7 +2528,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_page;
 	}
 
-	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
@@ -2570,10 +2570,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	set_pte_at(mm, address, page_table, pte);
 	if (page == swapcache) {
 		do_page_add_anon_rmap(page, vma, address, exclusive);
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, address, false);
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
 
@@ -2608,7 +2608,7 @@ unlock:
 out:
 	return ret;
 out_nomap:
-	mem_cgroup_cancel_charge(page, memcg);
+	mem_cgroup_cancel_charge(page, memcg, false);
 	pte_unmap_unlock(page_table, ptl);
 out_page:
 	unlock_page(page);
@@ -2698,7 +2698,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!page)
 		goto oom;
 
-	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false))
 		goto oom_free_page;
 
 	/*
@@ -2719,7 +2719,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/* Deliver the page fault to userland, check inside PT lock */
 	if (userfaultfd_missing(vma)) {
 		pte_unmap_unlock(page_table, ptl);
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, false);
 		page_cache_release(page);
 		return handle_userfault(vma, address, flags,
 					VM_UFFD_MISSING);
@@ -2727,7 +2727,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address, false);
-	mem_cgroup_commit_charge(page, memcg, false);
+	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
 	set_pte_at(mm, address, page_table, entry);
@@ -2738,7 +2738,7 @@ unlock:
 	pte_unmap_unlock(page_table, ptl);
 	return 0;
 release:
-	mem_cgroup_cancel_charge(page, memcg);
+	mem_cgroup_cancel_charge(page, memcg, false);
 	page_cache_release(page);
 	goto unlock;
 oom_free_page:
@@ -2989,7 +2989,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!new_page)
 		return VM_FAULT_OOM;
 
-	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false)) {
 		page_cache_release(new_page);
 		return VM_FAULT_OOM;
 	}
@@ -3018,7 +3018,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto uncharge_out;
 	}
 	do_set_pte(vma, address, new_page, pte, true, true);
-	mem_cgroup_commit_charge(new_page, memcg, false);
+	mem_cgroup_commit_charge(new_page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pte_unmap_unlock(pte, ptl);
 	if (fault_page) {
@@ -3033,7 +3033,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	return ret;
 uncharge_out:
-	mem_cgroup_cancel_charge(new_page, memcg);
+	mem_cgroup_cancel_charge(new_page, memcg, false);
 	page_cache_release(new_page);
 	return ret;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index e001168d340a..41c648485161 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -706,7 +706,8 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
 	 */
-	error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg);
+	error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg,
+			false);
 	if (error)
 		goto out;
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -729,9 +730,9 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	if (error) {
 		if (error != -ENOMEM)
 			error = 0;
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, false);
 	} else
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 out:
 	unlock_page(page);
 	page_cache_release(page);
@@ -1114,7 +1115,8 @@ repeat:
 				goto failed;
 		}
 
-		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
+				false);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						swp_to_radix_entry(swap));
@@ -1131,14 +1133,14 @@ repeat:
 			 * "repeat": reading a hole and writing should succeed.
 			 */
 			if (error) {
-				mem_cgroup_cancel_charge(page, memcg);
+				mem_cgroup_cancel_charge(page, memcg, false);
 				delete_from_swap_cache(page);
 			}
 		}
 		if (error)
 			goto failed;
 
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 
 		spin_lock(&info->lock);
 		info->swapped--;
@@ -1177,7 +1179,8 @@ repeat:
 		if (sgp == SGP_WRITE)
 			__SetPageReferenced(page);
 
-		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
+				false);
 		if (error)
 			goto decused;
 		error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
@@ -1187,10 +1190,10 @@ repeat:
 			radix_tree_preload_end();
 		}
 		if (error) {
-			mem_cgroup_cancel_charge(page, memcg);
+			mem_cgroup_cancel_charge(page, memcg, false);
 			goto decused;
 		}
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_anon(page);
 
 		spin_lock(&info->lock);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 65825c2687f5..6dd365d1c488 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1103,14 +1103,15 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	if (unlikely(!page))
 		return -ENOMEM;
 
-	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
+	{
 		ret = -ENOMEM;
 		goto out_nolock;
 	}
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	if (unlikely(!maybe_same_pte(*pte, swp_entry_to_pte(entry)))) {
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, false);
 		ret = 0;
 		goto out;
 	}
@@ -1122,10 +1123,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
 		page_add_anon_rmap(page, vma, addr, false);
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr, false);
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
 	swap_free(entry);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index ae21a1f309c2..806b0c758c5b 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -63,7 +63,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 	__SetPageUptodate(page);
 
 	ret = -ENOMEM;
-	if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg))
+	if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false))
 		goto out_release;
 
 	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
@@ -77,7 +77,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 
 	inc_mm_counter(dst_mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
-	mem_cgroup_commit_charge(page, memcg, false);
+	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, dst_vma);
 
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
@@ -91,7 +91,7 @@ out:
 	return ret;
 out_release_uncharge_unlock:
 	pte_unmap_unlock(dst_pte, ptl);
-	mem_cgroup_cancel_charge(page, memcg);
+	mem_cgroup_cancel_charge(page, memcg, false);
 out_release:
 	page_cache_release(page);
 	goto out;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 04/36] mm, thp: adjust conditions when we can reuse the page on WP fault
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (2 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 03/36] memcg: adjust to support new THP refcounting Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 05/36] mm: adjust FOLL_SPLIT for new refcounting Kirill A. Shutemov
                   ` (32 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we will be able map the same compound page with
PTEs and PMDs. It requires adjustment to conditions when we can reuse
the page on write-protection fault.

For PTE fault we can't reuse the page if it's part of huge page.

For PMD we can only reuse the page if nobody else maps the huge page or
it's part. We can do it by checking page_mapcount() on each sub-page,
but it's expensive.

The cheaper way is to check page_count() to be equal 1: every mapcount
takes page reference, so this way we can guarantee, that the PMD is the
only mapping.

This approach can give false negative if somebody pinned the page, but
that doesn't affect correctness.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/swap.h |  3 ++-
 mm/huge_memory.c     | 12 +++++++++++-
 mm/swapfile.c        |  3 +++
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0428e4c84e1d..17cdd6b9456b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -524,7 +524,8 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
-#define reuse_swap_page(page)	(page_mapcount(page) == 1)
+#define reuse_swap_page(page) \
+	(!PageTransCompound(page) && page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1043b9b0659c..6035cc92be6b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1128,7 +1128,17 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
-	if (page_mapcount(page) == 1) {
+	/*
+	 * We can only reuse the page if nobody else maps the huge page or it's
+	 * part. We can do it by checking page_mapcount() on each sub-page, but
+	 * it's expensive.
+	 * The cheaper way is to check page_count() to be equal 1: every
+	 * mapcount takes page reference reference, so this way we can
+	 * guarantee, that the PMD is the only mapping.
+	 * This can give false negative if somebody pinned the page, but that's
+	 * fine.
+	 */
+	if (page_mapcount(page) == 1 && page_count(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6dd365d1c488..3cd5f188b996 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -887,6 +887,9 @@ int reuse_swap_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	if (unlikely(PageKsm(page)))
 		return 0;
+	/* The page is part of THP and cannot be reused */
+	if (PageTransCompound(page))
+		return 0;
 	count = page_mapcount(page);
 	if (count <= 1 && PageSwapCache(page)) {
 		count += page_swapcount(page);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 05/36] mm: adjust FOLL_SPLIT for new refcounting
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (3 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 04/36] mm, thp: adjust conditions when we can reuse the page on WP fault Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 06/36] mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton Kirill A. Shutemov
                   ` (31 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We need to prepare kernel to allow transhuge pages to be mapped with
ptes too. We need to handle FOLL_SPLIT in follow_page_pte().

Also we use split_huge_page() directly instead of split_huge_page_pmd().
split_huge_page_pmd() will gone.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/gup.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 49 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 6297f6bccfb1..2dd6706ea29f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -79,6 +79,19 @@ retry:
 		page = pte_page(pte);
 	}
 
+	if (flags & FOLL_SPLIT && PageTransCompound(page)) {
+		int ret;
+		get_page(page);
+		pte_unmap_unlock(ptep, ptl);
+		lock_page(page);
+		ret = split_huge_page(page);
+		unlock_page(page);
+		put_page(page);
+		if (ret)
+			return ERR_PTR(ret);
+		goto retry;
+	}
+
 	if (flags & FOLL_GET)
 		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
@@ -186,27 +199,45 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		return no_page_table(vma, flags);
-	if (pmd_trans_huge(*pmd)) {
-		if (flags & FOLL_SPLIT) {
+	if (likely(!pmd_trans_huge(*pmd)))
+		return follow_page_pte(vma, address, pmd, flags);
+
+	ptl = pmd_lock(mm, pmd);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(ptl);
+		return follow_page_pte(vma, address, pmd, flags);
+	}
+
+	if (unlikely(pmd_trans_splitting(*pmd))) {
+		spin_unlock(ptl);
+		wait_split_huge_page(vma->anon_vma, pmd);
+		return follow_page_pte(vma, address, pmd, flags);
+	}
+
+	if (flags & FOLL_SPLIT) {
+		int ret;
+		page = pmd_page(*pmd);
+		if (is_huge_zero_page(page)) {
+			spin_unlock(ptl);
+			ret = 0;
 			split_huge_page_pmd(vma, address, pmd);
-			return follow_page_pte(vma, address, pmd, flags);
-		}
-		ptl = pmd_lock(mm, pmd);
-		if (likely(pmd_trans_huge(*pmd))) {
-			if (unlikely(pmd_trans_splitting(*pmd))) {
-				spin_unlock(ptl);
-				wait_split_huge_page(vma->anon_vma, pmd);
-			} else {
-				page = follow_trans_huge_pmd(vma, address,
-							     pmd, flags);
-				spin_unlock(ptl);
-				*page_mask = HPAGE_PMD_NR - 1;
-				return page;
-			}
-		} else
+		} else {
+			get_page(page);
 			spin_unlock(ptl);
+			lock_page(page);
+			ret = split_huge_page(page);
+			unlock_page(page);
+			put_page(page);
+		}
+
+		return ret ? ERR_PTR(ret) :
+			follow_page_pte(vma, address, pmd, flags);
 	}
-	return follow_page_pte(vma, address, pmd, flags);
+
+	page = follow_trans_huge_pmd(vma, address, pmd, flags);
+	spin_unlock(ptl);
+	*page_mask = HPAGE_PMD_NR - 1;
+	return page;
 }
 
 static int get_gate_page(struct mm_struct *mm, unsigned long address,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 06/36] mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (4 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 05/36] mm: adjust FOLL_SPLIT for new refcounting Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 07/36] thp, mlock: do not allow huge pages in mlocked area Kirill A. Shutemov
                   ` (30 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we are going to see THP tail pages mapped with PTE.
Generic fast GUP rely on page_cache_get_speculative() to obtain
reference on page. page_cache_get_speculative() always fails on tail
pages, because ->_count on tail pages is always zero.

Let's handle tail pages in gup_pte_range().

New split_huge_page() will rely on migration entries to freeze page's
counts. Recheck PTE value after page_cache_get_speculative() on head
page should be enough to serialize against split.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/gup.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 2dd6706ea29f..d975bdc2465f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1051,7 +1051,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		 * for an example see gup_get_pte in arch/x86/mm/gup.c
 		 */
 		pte_t pte = READ_ONCE(*ptep);
-		struct page *page;
+		struct page *head, *page;
 
 		/*
 		 * Similar to the PMD case below, NUMA hinting must take slow
@@ -1063,15 +1063,17 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
+		head = compound_head(page);
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(head))
 			goto pte_unmap;
 
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-			put_page(page);
+			put_page(head);
 			goto pte_unmap;
 		}
 
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
 		(*nr)++;
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 07/36] thp, mlock: do not allow huge pages in mlocked area
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (5 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 06/36] mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 08/36] khugepaged: ignore pmd tables with THP mapped with ptes Kirill A. Shutemov
                   ` (29 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting THP can belong to several VMAs. This makes tricky
to track THP pages, when they partially mlocked. It can lead to leaking
mlocked pages to non-VM_LOCKED vmas and other problems.

With this patch we will split all pages on mlock and avoid
fault-in/collapse new THP in VM_LOCKED vmas.

I've tried alternative approach: do not mark THP pages mlocked and keep
them on normal LRUs. This way vmscan could try to split huge pages on
memory pressure and free up subpages which doesn't belong to VM_LOCKED
vmas.  But this is user-visible change: we screw up Mlocked accouting
reported in meminfo, so I had to leave this approach aside.

We can bring something better later, but this should be good enough for
now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
---
 mm/gup.c         |  2 ++
 mm/huge_memory.c |  5 ++++-
 mm/memory.c      |  3 ++-
 mm/mlock.c       | 51 +++++++++++++++++++--------------------------------
 4 files changed, 27 insertions(+), 34 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index d975bdc2465f..2c2fa62f54f0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -882,6 +882,8 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
 
 	gup_flags = FOLL_TOUCH | FOLL_POPULATE;
+	if (vma->vm_flags & VM_LOCKED)
+		gup_flags |= FOLL_SPLIT;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6035cc92be6b..4ad975506c1b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -815,6 +815,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
 		return VM_FAULT_FALLBACK;
+	if (vma->vm_flags & VM_LOCKED)
+		return VM_FAULT_FALLBACK;
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
 	if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
@@ -2493,7 +2495,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
 	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
 	    (vma->vm_flags & VM_NOHUGEPAGE))
 		return false;
-
+	if (vma->vm_flags & VM_LOCKED)
+		return false;
 	if (!vma->anon_vma || vma->vm_ops)
 		return false;
 	if (is_vma_temporary_stack(vma))
diff --git a/mm/memory.c b/mm/memory.c
index f48dfa6d8859..8067c47ecf14 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2161,7 +2161,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	pte_unmap_unlock(page_table, ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-	if (old_page) {
+	/* THP pages are never mlocked */
+	if (old_page && !PageTransCompound(old_page)) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
 		 * keep the mlocked page.
diff --git a/mm/mlock.c b/mm/mlock.c
index 25936680064f..1d47b98407de 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -443,39 +443,26 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 		page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
 				&page_mask);
 
-		if (page && !IS_ERR(page)) {
-			if (PageTransHuge(page)) {
-				lock_page(page);
-				/*
-				 * Any THP page found by follow_page_mask() may
-				 * have gotten split before reaching
-				 * munlock_vma_page(), so we need to recompute
-				 * the page_mask here.
-				 */
-				page_mask = munlock_vma_page(page);
-				unlock_page(page);
-				put_page(page); /* follow_page_mask() */
-			} else {
-				/*
-				 * Non-huge pages are handled in batches via
-				 * pagevec. The pin from follow_page_mask()
-				 * prevents them from collapsing by THP.
-				 */
-				pagevec_add(&pvec, page);
-				zone = page_zone(page);
-				zoneid = page_zone_id(page);
+		if (page && !IS_ERR(page) && !PageTransCompound(page)) {
+			/*
+			 * Non-huge pages are handled in batches via
+			 * pagevec. The pin from follow_page_mask()
+			 * prevents them from collapsing by THP.
+			 */
+			pagevec_add(&pvec, page);
+			zone = page_zone(page);
+			zoneid = page_zone_id(page);
 
-				/*
-				 * Try to fill the rest of pagevec using fast
-				 * pte walk. This will also update start to
-				 * the next page to process. Then munlock the
-				 * pagevec.
-				 */
-				start = __munlock_pagevec_fill(&pvec, vma,
-						zoneid, start, end);
-				__munlock_pagevec(&pvec, zone);
-				goto next;
-			}
+			/*
+			 * Try to fill the rest of pagevec using fast
+			 * pte walk. This will also update start to
+			 * the next page to process. Then munlock the
+			 * pagevec.
+			 */
+			start = __munlock_pagevec_fill(&pvec, vma,
+					zoneid, start, end);
+			__munlock_pagevec(&pvec, zone);
+			goto next;
 		}
 		/* It's a bug to munlock in the middle of a THP page */
 		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 08/36] khugepaged: ignore pmd tables with THP mapped with ptes
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (6 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 07/36] thp, mlock: do not allow huge pages in mlocked area Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 09/36] thp: rename split_huge_page_pmd() to split_huge_pmd() Kirill A. Shutemov
                   ` (28 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Prepare khugepaged to see compound pages mapped with pte. For now we
won't collapse the pmd table with such pte.

khugepaged is subject for future rework wrt new refcounting.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/huge_memory.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4ad975506c1b..a5423bee0109 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2680,6 +2680,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
 			goto out_unmap;
+
+		/* TODO: teach khugepaged to collapse THP mapped with pte */
+		if (PageCompound(page))
+			goto out_unmap;
+
 		/*
 		 * Record which node the original page is from and save this
 		 * information to khugepaged_node_load[].
@@ -2690,7 +2695,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		if (khugepaged_scan_abort(node))
 			goto out_unmap;
 		khugepaged_node_load[node]++;
-		VM_BUG_ON_PAGE(PageCompound(page), page);
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
 			goto out_unmap;
 		/*
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 09/36] thp: rename split_huge_page_pmd() to split_huge_pmd()
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (7 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 08/36] khugepaged: ignore pmd tables with THP mapped with ptes Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 10/36] mm, vmstats: new THP splitting event Kirill A. Shutemov
                   ` (27 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We are going to decouple splitting THP PMD from splitting underlying
compound page.

This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
to reflect the fact that it doesn't imply page splitting, only PMD.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 arch/powerpc/mm/subpage-prot.c |  2 +-
 arch/x86/kernel/vm86_32.c      |  6 +++++-
 include/linux/huge_mm.h        |  8 ++------
 mm/gup.c                       |  2 +-
 mm/huge_memory.c               | 32 +++++++++++---------------------
 mm/madvise.c                   |  2 +-
 mm/memory.c                    |  2 +-
 mm/mempolicy.c                 |  2 +-
 mm/mprotect.c                  |  2 +-
 mm/mremap.c                    |  2 +-
 mm/pagewalk.c                  |  2 +-
 11 files changed, 26 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
index fa9fb5b4c66c..d5543514c1df 100644
--- a/arch/powerpc/mm/subpage-prot.c
+++ b/arch/powerpc/mm/subpage-prot.c
@@ -135,7 +135,7 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
 				  unsigned long end, struct mm_walk *walk)
 {
 	struct vm_area_struct *vma = walk->vma;
-	split_huge_page_pmd(vma, addr, pmd);
+	split_huge_pmd(vma, pmd, addr);
 	return 0;
 }
 
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index e8edcf52e069..883160599965 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -182,7 +182,11 @@ static void mark_screen_rdonly(struct mm_struct *mm)
 	if (pud_none_or_clear_bad(pud))
 		goto out;
 	pmd = pmd_offset(pud, 0xA0000);
-	split_huge_page_pmd_mm(mm, 0xA0000, pmd);
+
+	if (pmd_trans_huge(*pmd)) {
+		struct vm_area_struct *vma = find_vma(mm, 0xA0000);
+		split_huge_pmd(vma, pmd, 0xA0000);
+	}
 	if (pmd_none_or_clear_bad(pmd))
 		goto out;
 	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 44a840a53974..34bbf769d52e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -104,7 +104,7 @@ static inline int split_huge_page(struct page *page)
 }
 extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd);
-#define split_huge_page_pmd(__vma, __address, __pmd)			\
+#define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
 		if (unlikely(pmd_trans_huge(*____pmd)))			\
@@ -119,8 +119,6 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
-extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd);
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
@@ -187,11 +185,9 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-#define split_huge_page_pmd(__vma, __address, __pmd)	\
-	do { } while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
-#define split_huge_page_pmd_mm(__mm, __address, __pmd)	\
+#define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
 				   unsigned long *vm_flags, int advice)
diff --git a/mm/gup.c b/mm/gup.c
index 2c2fa62f54f0..5c44a3d87856 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -220,7 +220,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		if (is_huge_zero_page(page)) {
 			spin_unlock(ptl);
 			ret = 0;
-			split_huge_page_pmd(vma, address, pmd);
+			split_huge_pmd(vma, pmd, address);
 		} else {
 			get_page(page);
 			spin_unlock(ptl);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a5423bee0109..2374dbb57c44 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1161,13 +1161,13 @@ alloc:
 
 	if (unlikely(!new_page)) {
 		if (!page) {
-			split_huge_page_pmd(vma, address, pmd);
+			split_huge_pmd(vma, pmd, address);
 			ret |= VM_FAULT_FALLBACK;
 		} else {
 			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
 					pmd, orig_pmd, page, haddr);
 			if (ret & VM_FAULT_OOM) {
-				split_huge_page(page);
+				split_huge_pmd(vma, pmd, address);
 				ret |= VM_FAULT_FALLBACK;
 			}
 			put_user_huge_page(page);
@@ -1180,10 +1180,10 @@ alloc:
 					&memcg, true))) {
 		put_page(new_page);
 		if (page) {
-			split_huge_page(page);
+			split_huge_pmd(vma, pmd, address);
 			put_user_huge_page(page);
 		} else
-			split_huge_page_pmd(vma, address, pmd);
+			split_huge_pmd(vma, pmd, address);
 		ret |= VM_FAULT_FALLBACK;
 		count_vm_event(THP_FAULT_FALLBACK);
 		goto out;
@@ -3010,17 +3010,7 @@ again:
 		goto again;
 }
 
-void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd)
-{
-	struct vm_area_struct *vma;
-
-	vma = find_vma(mm, address);
-	BUG_ON(vma == NULL);
-	split_huge_page_pmd(vma, address, pmd);
-}
-
-static void split_huge_page_address(struct mm_struct *mm,
+static void split_huge_pmd_address(struct vm_area_struct *vma,
 				    unsigned long address)
 {
 	pgd_t *pgd;
@@ -3029,7 +3019,7 @@ static void split_huge_page_address(struct mm_struct *mm,
 
 	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
 
-	pgd = pgd_offset(mm, address);
+	pgd = pgd_offset(vma->vm_mm, address);
 	if (!pgd_present(*pgd))
 		return;
 
@@ -3038,13 +3028,13 @@ static void split_huge_page_address(struct mm_struct *mm,
 		return;
 
 	pmd = pmd_offset(pud, address);
-	if (!pmd_present(*pmd))
+	if (!pmd_present(*pmd) || !pmd_trans_huge(*pmd))
 		return;
 	/*
 	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
 	 * materialize from under us.
 	 */
-	split_huge_page_pmd_mm(mm, address, pmd);
+	__split_huge_page_pmd(vma, address, pmd);
 }
 
 void __vma_adjust_trans_huge(struct vm_area_struct *vma,
@@ -3060,7 +3050,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 	if (start & ~HPAGE_PMD_MASK &&
 	    (start & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
-		split_huge_page_address(vma->vm_mm, start);
+		split_huge_pmd_address(vma, start);
 
 	/*
 	 * If the new end address isn't hpage aligned and it could
@@ -3070,7 +3060,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 	if (end & ~HPAGE_PMD_MASK &&
 	    (end & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
-		split_huge_page_address(vma->vm_mm, end);
+		split_huge_pmd_address(vma, end);
 
 	/*
 	 * If we're also updating the vma->vm_next->vm_start, if the new
@@ -3084,6 +3074,6 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 		if (nstart & ~HPAGE_PMD_MASK &&
 		    (nstart & HPAGE_PMD_MASK) >= next->vm_start &&
 		    (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
-			split_huge_page_address(next->vm_mm, nstart);
+			split_huge_pmd_address(next, nstart);
 	}
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index d215ea949630..fa86472ea5d5 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -282,7 +282,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd)) {
 		if (next - addr != HPAGE_PMD_SIZE)
-			split_huge_page_pmd(vma, addr, pmd);
+			split_huge_pmd(vma, pmd, addr);
 		else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
 			goto next;
 		/* fall through */
diff --git a/mm/memory.c b/mm/memory.c
index 8067c47ecf14..d37aeda3a8fe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1204,7 +1204,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 					BUG();
 				}
 #endif
-				split_huge_page_pmd(vma, addr, pmd);
+				split_huge_pmd(vma, pmd, addr);
 			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eeec77abd168..528f6c467cf1 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -493,7 +493,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	split_huge_page_pmd(vma, addr, pmd);
+	split_huge_pmd(vma, pmd, addr);
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ef5be8eaab00..9c1445dc8a4c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -160,7 +160,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
-				split_huge_page_pmd(vma, addr, pmd);
+				split_huge_pmd(vma, pmd, addr);
 			else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
 						newprot, prot_numa);
diff --git a/mm/mremap.c b/mm/mremap.c
index a7c93eceb1c8..d9cc4debd3aa 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -209,7 +209,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 				need_flush = true;
 				continue;
 			} else if (!err) {
-				split_huge_page_pmd(vma, old_addr, old_pmd);
+				split_huge_pmd(vma, old_pmd, old_addr);
 			}
 			VM_BUG_ON(pmd_trans_huge(*old_pmd));
 		}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 29f2f8b853ae..207244489a68 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ again:
 		if (!walk->pte_entry)
 			continue;
 
-		split_huge_page_pmd_mm(walk->mm, addr, pmd);
+		split_huge_pmd(walk->vma, pmd, addr);
 		if (pmd_trans_unstable(pmd))
 			goto again;
 		err = walk_pte_range(pmd, addr, next, walk);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 10/36] mm, vmstats: new THP splitting event
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (8 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 09/36] thp: rename split_huge_page_pmd() to split_huge_pmd() Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 11/36] mm: temporally mark THP broken Kirill A. Shutemov
                   ` (26 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILED and THP_SPLIT_PMD. It reflects the fact that we
are going to be able split PMD without the compound page and that
split_huge_page() can fail.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/vm_event_item.h | 4 +++-
 mm/huge_memory.c              | 2 +-
 mm/vmstat.c                   | 4 +++-
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 2b1cef88b827..3261bfe2156a 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -69,7 +69,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
-		THP_SPLIT,
+		THP_SPLIT_PAGE,
+		THP_SPLIT_PAGE_FAILED,
+		THP_SPLIT_PMD,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2374dbb57c44..cdc3a358eb17 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1986,7 +1986,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 
 	BUG_ON(!PageSwapBacked(page));
 	__split_huge_page(page, anon_vma, list);
-	count_vm_event(THP_SPLIT);
+	count_vm_event(THP_SPLIT_PAGE);
 
 	BUG_ON(PageCompound(page));
 out_unlock:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1fd0886a389f..e1c87425fe11 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -821,7 +821,9 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
-	"thp_split",
+	"thp_split_page",
+	"thp_split_page_failed",
+	"thp_split_pmd",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
 #endif
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 11/36] mm: temporally mark THP broken
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (9 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 10/36] mm, vmstats: new THP splitting event Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 12/36] thp: drop all split_huge_page()-related code Kirill A. Shutemov
                   ` (25 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Up to this point we tried to keep patchset bisectable, but next patches
are going to change how core of THP refcounting work.

It would be beneficial to split the change into several patches and make
it more reviewable. Unfortunately, I don't see how we can achieve that
while keeping THP working.

Let's hide THP under CONFIG_BROKEN for now and bring it back when new
refcounting get established.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index e79de2bd12cd..c973f416cbe5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -410,7 +410,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
-	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
+	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && BROKEN
 	select COMPACTION
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 12/36] thp: drop all split_huge_page()-related code
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (10 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 11/36] mm: temporally mark THP broken Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 13/36] mm: drop tail page refcounting Kirill A. Shutemov
                   ` (24 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We will re-introduce new version with new refcounting later in patchset.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/huge_mm.h |  28 +---
 mm/huge_memory.c        | 400 +-----------------------------------------------
 2 files changed, 7 insertions(+), 421 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 34bbf769d52e..47f80207782f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -97,28 +97,12 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 #endif /* CONFIG_DEBUG_VM */
 
 extern unsigned long transparent_hugepage_flags;
-extern int split_huge_page_to_list(struct page *page, struct list_head *list);
-static inline int split_huge_page(struct page *page)
-{
-	return split_huge_page_to_list(page, NULL);
-}
-extern void __split_huge_page_pmd(struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd);
-#define split_huge_pmd(__vma, __pmd, __address)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		if (unlikely(pmd_trans_huge(*____pmd)))			\
-			__split_huge_page_pmd(__vma, __address,		\
-					____pmd);			\
-	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock_write(__anon_vma);			\
-		anon_vma_unlock_write(__anon_vma);			\
-		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
-		       pmd_trans_huge(*____pmd));			\
-	} while (0)
+
+#define split_huge_page_to_list(page, list) BUILD_BUG()
+#define split_huge_page(page) BUILD_BUG()
+#define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
+
+#define wait_split_huge_page(__anon_vma, __pmd) BUILD_BUG();
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cdc3a358eb17..daa12e51fe70 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1673,329 +1673,6 @@ int pmd_freeable(pmd_t pmd)
 	return !pmd_dirty(pmd);
 }
 
-static int __split_huge_page_splitting(struct page *page,
-				       struct vm_area_struct *vma,
-				       unsigned long address)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pmd_t *pmd;
-	int ret = 0;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = address;
-	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
-
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	pmd = page_check_address_pmd(page, mm, address,
-			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
-	if (pmd) {
-		/*
-		 * We can't temporarily set the pmd to null in order
-		 * to split it, the pmd must remain marked huge at all
-		 * times or the VM won't take the pmd_trans_huge paths
-		 * and it won't wait on the anon_vma->root->rwsem to
-		 * serialize against split_huge_page*.
-		 */
-		pmdp_splitting_flush(vma, address, pmd);
-
-		ret = 1;
-		spin_unlock(ptl);
-	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
-	return ret;
-}
-
-static void __split_huge_page_refcount(struct page *page,
-				       struct list_head *list)
-{
-	int i;
-	struct zone *zone = page_zone(page);
-	struct lruvec *lruvec;
-	int tail_count = 0;
-
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irq(&zone->lru_lock);
-	lruvec = mem_cgroup_page_lruvec(page, zone);
-
-	compound_lock(page);
-	/* complete memcg works before add pages to LRU */
-	mem_cgroup_split_huge_fixup(page);
-
-	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
-		struct page *page_tail = page + i;
-
-		/* tail_page->_mapcount cannot change */
-		BUG_ON(page_mapcount(page_tail) < 0);
-		tail_count += page_mapcount(page_tail);
-		/* check for overflow */
-		BUG_ON(tail_count < 0);
-		BUG_ON(atomic_read(&page_tail->_count) != 0);
-		/*
-		 * tail_page->_count is zero and not changing from
-		 * under us. But get_page_unless_zero() may be running
-		 * from under us on the tail_page. If we used
-		 * atomic_set() below instead of atomic_add(), we
-		 * would then run atomic_set() concurrently with
-		 * get_page_unless_zero(), and atomic_set() is
-		 * implemented in C not using locked ops. spin_unlock
-		 * on x86 sometime uses locked ops because of PPro
-		 * errata 66, 92, so unless somebody can guarantee
-		 * atomic_set() here would be safe on all archs (and
-		 * not only on x86), it's safer to use atomic_add().
-		 */
-		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
-			   &page_tail->_count);
-
-		/* after clearing PageTail the gup refcount can be released */
-		smp_mb__after_atomic();
-
-		/*
-		 * retain hwpoison flag of the poisoned tail page:
-		 *   fix for the unsuitable process killed on Guest Machine(KVM)
-		 *   by the memory-failure.
-		 */
-		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
-		page_tail->flags |= (page->flags &
-				     ((1L << PG_referenced) |
-				      (1L << PG_swapbacked) |
-				      (1L << PG_mlocked) |
-				      (1L << PG_uptodate) |
-				      (1L << PG_active) |
-				      (1L << PG_unevictable)));
-		page_tail->flags |= (1L << PG_dirty);
-
-		/* clear PageTail before overwriting first_page */
-		smp_wmb();
-
-		/*
-		 * __split_huge_page_splitting() already set the
-		 * splitting bit in all pmd that could map this
-		 * hugepage, that will ensure no CPU can alter the
-		 * mapcount on the head page. The mapcount is only
-		 * accounted in the head page and it has to be
-		 * transferred to all tail pages in the below code. So
-		 * for this code to be safe, the split the mapcount
-		 * can't change. But that doesn't mean userland can't
-		 * keep changing and reading the page contents while
-		 * we transfer the mapcount, so the pmd splitting
-		 * status is achieved setting a reserved bit in the
-		 * pmd, not by clearing the present bit.
-		*/
-		page_tail->_mapcount = page->_mapcount;
-
-		BUG_ON(page_tail->mapping != TAIL_MAPPING);
-		page_tail->mapping = page->mapping;
-
-		page_tail->index = page->index + i;
-		page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
-
-		BUG_ON(!PageAnon(page_tail));
-		BUG_ON(!PageUptodate(page_tail));
-		BUG_ON(!PageDirty(page_tail));
-		BUG_ON(!PageSwapBacked(page_tail));
-
-		lru_add_page_tail(page, page_tail, lruvec, list);
-	}
-	atomic_sub(tail_count, &page->_count);
-	BUG_ON(atomic_read(&page->_count) <= 0);
-
-	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
-
-	ClearPageCompound(page);
-	compound_unlock(page);
-	spin_unlock_irq(&zone->lru_lock);
-
-	for (i = 1; i < HPAGE_PMD_NR; i++) {
-		struct page *page_tail = page + i;
-		BUG_ON(page_count(page_tail) <= 0);
-		/*
-		 * Tail pages may be freed if there wasn't any mapping
-		 * like if add_to_swap() is running on a lru page that
-		 * had its mapping zapped. And freeing these pages
-		 * requires taking the lru_lock so we do the put_page
-		 * of the tail pages after the split is complete.
-		 */
-		put_page(page_tail);
-	}
-
-	/*
-	 * Only the head page (now become a regular page) is required
-	 * to be pinned by the caller.
-	 */
-	BUG_ON(page_count(page) <= 0);
-}
-
-static int __split_huge_page_map(struct page *page,
-				 struct vm_area_struct *vma,
-				 unsigned long address)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pmd_t *pmd, _pmd;
-	int ret = 0, i;
-	pgtable_t pgtable;
-	unsigned long haddr;
-
-	pmd = page_check_address_pmd(page, mm, address,
-			PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
-	if (pmd) {
-		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-		pmd_populate(mm, &_pmd, pgtable);
-		if (pmd_write(*pmd))
-			BUG_ON(page_mapcount(page) != 1);
-
-		haddr = address;
-		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-			pte_t *pte, entry;
-			BUG_ON(PageCompound(page+i));
-			/*
-			 * Note that NUMA hinting access restrictions are not
-			 * transferred to avoid any possibility of altering
-			 * permissions across VMAs.
-			 */
-			entry = mk_pte(page + i, vma->vm_page_prot);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			if (!pmd_write(*pmd))
-				entry = pte_wrprotect(entry);
-			if (!pmd_young(*pmd))
-				entry = pte_mkold(entry);
-			pte = pte_offset_map(&_pmd, haddr);
-			BUG_ON(!pte_none(*pte));
-			set_pte_at(mm, haddr, pte, entry);
-			pte_unmap(pte);
-		}
-
-		smp_wmb(); /* make pte visible before pmd */
-		/*
-		 * Up to this point the pmd is present and huge and
-		 * userland has the whole access to the hugepage
-		 * during the split (which happens in place). If we
-		 * overwrite the pmd with the not-huge version
-		 * pointing to the pte here (which of course we could
-		 * if all CPUs were bug free), userland could trigger
-		 * a small page size TLB miss on the small sized TLB
-		 * while the hugepage TLB entry is still established
-		 * in the huge TLB. Some CPU doesn't like that. See
-		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
-		 * Erratum 383 on page 93. Intel should be safe but is
-		 * also warns that it's only safe if the permission
-		 * and cache attributes of the two entries loaded in
-		 * the two TLB is identical (which should be the case
-		 * here). But it is generally safer to never allow
-		 * small and huge TLB entries for the same virtual
-		 * address to be loaded simultaneously. So instead of
-		 * doing "pmd_populate(); flush_tlb_range();" we first
-		 * mark the current pmd notpresent (atomically because
-		 * here the pmd_trans_huge and pmd_trans_splitting
-		 * must remain set at all times on the pmd until the
-		 * split is complete for this pmd), then we flush the
-		 * SMP TLB and finally we write the non-huge version
-		 * of the pmd entry with pmd_populate.
-		 */
-		pmdp_invalidate(vma, address, pmd);
-		pmd_populate(mm, pmd, pgtable);
-		ret = 1;
-		spin_unlock(ptl);
-	}
-
-	return ret;
-}
-
-/* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
-			      struct anon_vma *anon_vma,
-			      struct list_head *list)
-{
-	int mapcount, mapcount2;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
-	struct anon_vma_chain *avc;
-
-	BUG_ON(!PageHead(page));
-	BUG_ON(PageTail(page));
-
-	mapcount = 0;
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
-		struct vm_area_struct *vma = avc->vma;
-		unsigned long addr = vma_address(page, vma);
-		BUG_ON(is_vma_temporary_stack(vma));
-		mapcount += __split_huge_page_splitting(page, vma, addr);
-	}
-	/*
-	 * It is critical that new vmas are added to the tail of the
-	 * anon_vma list. This guarantes that if copy_huge_pmd() runs
-	 * and establishes a child pmd before
-	 * __split_huge_page_splitting() freezes the parent pmd (so if
-	 * we fail to prevent copy_huge_pmd() from running until the
-	 * whole __split_huge_page() is complete), we will still see
-	 * the newly established pmd of the child later during the
-	 * walk, to be able to set it as pmd_trans_splitting too.
-	 */
-	if (mapcount != page_mapcount(page)) {
-		pr_err("mapcount %d page_mapcount %d\n",
-			mapcount, page_mapcount(page));
-		BUG();
-	}
-
-	__split_huge_page_refcount(page, list);
-
-	mapcount2 = 0;
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
-		struct vm_area_struct *vma = avc->vma;
-		unsigned long addr = vma_address(page, vma);
-		BUG_ON(is_vma_temporary_stack(vma));
-		mapcount2 += __split_huge_page_map(page, vma, addr);
-	}
-	if (mapcount != mapcount2) {
-		pr_err("mapcount %d mapcount2 %d page_mapcount %d\n",
-			mapcount, mapcount2, page_mapcount(page));
-		BUG();
-	}
-}
-
-/*
- * Split a hugepage into normal pages. This doesn't change the position of head
- * page. If @list is null, tail pages will be added to LRU list, otherwise, to
- * @list. Both head page and tail pages will inherit mapping, flags, and so on
- * from the hugepage.
- * Return 0 if the hugepage is split successfully otherwise return 1.
- */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
-{
-	struct anon_vma *anon_vma;
-	int ret = 1;
-
-	BUG_ON(is_huge_zero_page(page));
-	BUG_ON(!PageAnon(page));
-
-	/*
-	 * The caller does not necessarily hold an mmap_sem that would prevent
-	 * the anon_vma disappearing so we first we take a reference to it
-	 * and then lock the anon_vma for write. This is similar to
-	 * page_lock_anon_vma_read except the write lock is taken to serialise
-	 * against parallel split or collapse operations.
-	 */
-	anon_vma = page_get_anon_vma(page);
-	if (!anon_vma)
-		goto out;
-	anon_vma_lock_write(anon_vma);
-
-	ret = 0;
-	if (!PageCompound(page))
-		goto out_unlock;
-
-	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma, list);
-	count_vm_event(THP_SPLIT_PAGE);
-
-	BUG_ON(PageCompound(page));
-out_unlock:
-	anon_vma_unlock_write(anon_vma);
-	put_anon_vma(anon_vma);
-out:
-	return ret;
-}
-
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
@@ -2935,81 +2612,6 @@ static int khugepaged(void *none)
 	return 0;
 }
 
-static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
-		unsigned long haddr, pmd_t *pmd)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	pgtable_t pgtable;
-	pmd_t _pmd;
-	int i;
-
-	pmdp_huge_clear_flush_notify(vma, haddr, pmd);
-	/* leave pmd empty until pte is filled */
-
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-	pmd_populate(mm, &_pmd, pgtable);
-
-	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-		pte_t *pte, entry;
-		entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
-		entry = pte_mkspecial(entry);
-		pte = pte_offset_map(&_pmd, haddr);
-		VM_BUG_ON(!pte_none(*pte));
-		set_pte_at(mm, haddr, pte, entry);
-		pte_unmap(pte);
-	}
-	smp_wmb(); /* make pte visible before pmd */
-	pmd_populate(mm, pmd, pgtable);
-	put_huge_zero_page();
-}
-
-void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd)
-{
-	spinlock_t *ptl;
-	struct page *page;
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
-
-	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
-
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_trans_huge(*pmd))) {
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
-	if (is_huge_zero_pmd(*pmd)) {
-		__split_huge_zero_page_pmd(vma, haddr, pmd);
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
-	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!page_count(page), page);
-	get_page(page);
-	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
-	split_huge_page(page);
-
-	put_page(page);
-
-	/*
-	 * We don't always have down_write of mmap_sem here: a racing
-	 * do_huge_pmd_wp_page() might have copied-on-write to another
-	 * huge page before our split_huge_page() got the anon_vma lock.
-	 */
-	if (unlikely(pmd_trans_huge(*pmd)))
-		goto again;
-}
-
 static void split_huge_pmd_address(struct vm_area_struct *vma,
 				    unsigned long address)
 {
@@ -3034,7 +2636,7 @@ static void split_huge_pmd_address(struct vm_area_struct *vma,
 	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
 	 * materialize from under us.
 	 */
-	__split_huge_page_pmd(vma, address, pmd);
+	split_huge_pmd(vma, pmd, address);
 }
 
 void __vma_adjust_trans_huge(struct vm_area_struct *vma,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 13/36] mm: drop tail page refcounting
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (11 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 12/36] thp: drop all split_huge_page()-related code Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-09 13:59   ` Vlastimil Babka
  2015-06-03 17:05 ` [PATCHv6 14/36] futex, thp: remove special case for THP in get_futex_key Kirill A. Shutemov
                   ` (23 subsequent siblings)
  36 siblings, 1 reply; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Tail page refcounting is utterly complicated and painful to support.

It uses ->_mapcount on tail pages to store how many times this page is
pinned. get_page() bumps ->_mapcount on tail page in addition to
->_count on head. This information is required by split_huge_page() to
be able to distribute pins from head of compound page to tails during
the split.

We will need ->_mapcount acoount PTE mappings of subpages of the
compound page. We eliminate need in current meaning of ->_mapcount in
tail pages by forbidding split entirely if the page is pinned.

The only user of tail page refcounting is THP which is marked BROKEN for
now.

Let's drop all this mess. It makes get_page() and put_page() much
simpler.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 arch/mips/mm/gup.c            |   4 -
 arch/powerpc/mm/hugetlbpage.c |  13 +-
 arch/s390/mm/gup.c            |  13 +-
 arch/sparc/mm/gup.c           |  14 +--
 arch/x86/mm/gup.c             |   4 -
 include/linux/mm.h            |  47 ++------
 include/linux/mm_types.h      |  17 +--
 mm/gup.c                      |  34 +-----
 mm/huge_memory.c              |  41 +------
 mm/hugetlb.c                  |   2 +-
 mm/internal.h                 |  44 -------
 mm/swap.c                     | 273 +++---------------------------------------
 12 files changed, 40 insertions(+), 466 deletions(-)

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 349995d19c7f..36a35115dc2e 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -87,8 +87,6 @@ static int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end,
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
@@ -153,8 +151,6 @@ static int gup_huge_pud(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index dde6ff52a3c6..5f979e51ef7b 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1032,7 +1032,7 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 {
 	unsigned long mask;
 	unsigned long pte_end;
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	pte_t pte;
 	int refs;
 
@@ -1055,7 +1055,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	head = pte_page(pte);
 
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -1077,15 +1076,5 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
-	/*
-	 * Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 5c586c78ca8d..dab30527ad41 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -52,7 +52,7 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
 	unsigned long mask, result;
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	result = write ? 0 : _SEGMENT_ENTRY_PROTECT;
@@ -64,7 +64,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -85,16 +84,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		return 0;
 	}
 
-	/*
-	 * Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 2e5c4fc2daa9..9091c5daa2e1 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -56,8 +56,6 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 			put_page(head);
 			return 0;
 		}
-		if (head != page)
-			get_huge_page_tail(page);
 
 		pages[*nr] = page;
 		(*nr)++;
@@ -70,7 +68,7 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 			unsigned long end, int write, struct page **pages,
 			int *nr)
 {
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	if (!(pmd_val(pmd) & _PAGE_VALID))
@@ -82,7 +80,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -103,15 +100,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		return 0;
 	}
 
-	/* Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 81bf3d2af3eb..62a887a3cf50 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -137,8 +137,6 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
@@ -214,8 +212,6 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 76376e04988a..f079d9ffc797 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -459,44 +459,9 @@ static inline int page_count(struct page *page)
 	return atomic_read(&compound_head(page)->_count);
 }
 
-static inline bool __compound_tail_refcounted(struct page *page)
-{
-	return PageAnon(page) && !PageSlab(page) && !PageHeadHuge(page);
-}
-
-/*
- * This takes a head page as parameter and tells if the
- * tail page reference counting can be skipped.
- *
- * For this to be safe, PageSlab and PageHeadHuge must remain true on
- * any given page where they return true here, until all tail pins
- * have been released.
- */
-static inline bool compound_tail_refcounted(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return __compound_tail_refcounted(page);
-}
-
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run from under us.
-	 */
-	VM_BUG_ON_PAGE(!PageTail(page), page);
-	VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-	VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-	if (compound_tail_refcounted(page->first_page))
-		atomic_inc(&page->_mapcount);
-}
-
-extern bool __get_page_tail(struct page *page);
-
 static inline void get_page(struct page *page)
 {
-	if (unlikely(PageTail(page)))
-		if (likely(__get_page_tail(page)))
-			return;
+	page = compound_head(page);
 	/*
 	 * Getting a normal page or the head of a compound page
 	 * requires to already have an elevated page->_count.
@@ -527,7 +492,15 @@ static inline void init_page_count(struct page *page)
 	atomic_set(&page->_count, 1);
 }
 
-void put_page(struct page *page);
+void __put_page(struct page* page);
+
+static inline void put_page(struct page *page)
+{
+	page = compound_head(page);
+	if (put_page_testzero(page))
+		__put_page(page);
+}
+
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f6266742ce1f..4b51a59160ab 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -92,20 +92,9 @@ struct page {
 
 				union {
 					/*
-					 * Count of ptes mapped in
-					 * mms, to show when page is
-					 * mapped & limit reverse map
-					 * searches.
-					 *
-					 * Used also for tail pages
-					 * refcounting instead of
-					 * _count. Tail pages cannot
-					 * be mapped and keeping the
-					 * tail page _count zero at
-					 * all times guarantees
-					 * get_page_unless_zero() will
-					 * never succeed on tail
-					 * pages.
+					 * Count of ptes mapped in mms, to show
+					 * when page is mapped & limit reverse
+					 * map searches.
 					 */
 					atomic_t _mapcount;
 
diff --git a/mm/gup.c b/mm/gup.c
index 5c44a3d87856..f0f390fb9213 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -93,7 +93,7 @@ retry:
 	}
 
 	if (flags & FOLL_GET)
-		get_page_foll(page);
+		get_page(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
@@ -1108,7 +1108,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	if (write && !pmd_write(orig))
@@ -1117,7 +1117,6 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	refs = 0;
 	head = pmd_page(orig);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
@@ -1138,24 +1137,13 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 	}
 
-	/*
-	 * Any tail pages need their mapcount reference taken before we
-	 * return. (This allows the THP code to bump their ref count when
-	 * they are split into base pages).
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	if (write && !pud_write(orig))
@@ -1164,7 +1152,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	refs = 0;
 	head = pud_page(orig);
 	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
@@ -1185,12 +1172,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 	}
 
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
@@ -1199,7 +1180,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 			struct page **pages, int *nr)
 {
 	int refs;
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 
 	if (write && !pgd_write(orig))
 		return 0;
@@ -1207,7 +1188,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	refs = 0;
 	head = pgd_page(orig);
 	page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
@@ -1228,12 +1208,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		return 0;
 	}
 
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index daa12e51fe70..70fff4c9e067 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -966,37 +966,6 @@ unlock:
 	spin_unlock(ptl);
 }
 
-/*
- * Save CONFIG_DEBUG_PAGEALLOC from faulting falsely on tail pages
- * during copy_user_huge_page()'s copy_page_rep(): in the case when
- * the source page gets split and a tail freed before copy completes.
- * Called under pmd_lock of checked pmd, so safe from splitting itself.
- */
-static void get_user_huge_page(struct page *page)
-{
-	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
-		struct page *endpage = page + HPAGE_PMD_NR;
-
-		atomic_add(HPAGE_PMD_NR, &page->_count);
-		while (++page < endpage)
-			get_huge_page_tail(page);
-	} else {
-		get_page(page);
-	}
-}
-
-static void put_user_huge_page(struct page *page)
-{
-	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
-		struct page *endpage = page + HPAGE_PMD_NR;
-
-		while (page < endpage)
-			put_page(page++);
-	} else {
-		put_page(page);
-	}
-}
-
 static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long address,
@@ -1149,7 +1118,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret |= VM_FAULT_WRITE;
 		goto out_unlock;
 	}
-	get_user_huge_page(page);
+	get_page(page);
 	spin_unlock(ptl);
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
@@ -1170,7 +1139,7 @@ alloc:
 				split_huge_pmd(vma, pmd, address);
 				ret |= VM_FAULT_FALLBACK;
 			}
-			put_user_huge_page(page);
+			put_page(page);
 		}
 		count_vm_event(THP_FAULT_FALLBACK);
 		goto out;
@@ -1181,7 +1150,7 @@ alloc:
 		put_page(new_page);
 		if (page) {
 			split_huge_pmd(vma, pmd, address);
-			put_user_huge_page(page);
+			put_page(page);
 		} else
 			split_huge_pmd(vma, pmd, address);
 		ret |= VM_FAULT_FALLBACK;
@@ -1203,7 +1172,7 @@ alloc:
 
 	spin_lock(ptl);
 	if (page)
-		put_user_huge_page(page);
+		put_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(ptl);
 		mem_cgroup_cancel_charge(new_page, memcg, true);
@@ -1288,7 +1257,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 	if (flags & FOLL_GET)
-		get_page_foll(page);
+		get_page(page);
 
 out:
 	return page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ddbcecab781f..9d44d8f17760 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3480,7 +3480,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 same_page:
 		if (pages) {
 			pages[i] = mem_map_offset(page, pfn_offset);
-			get_page_foll(pages[i]);
+			get_page(pages[i]);
 		}
 
 		if (vmas)
diff --git a/mm/internal.h b/mm/internal.h
index 36b23f1e2ca6..6aa08db10589 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -47,50 +47,6 @@ static inline void set_page_refcounted(struct page *page)
 	set_page_count(page, 1);
 }
 
-static inline void __get_page_tail_foll(struct page *page,
-					bool get_page_head)
-{
-	/*
-	 * If we're getting a tail page, the elevated page->_count is
-	 * required only in the head page and we will elevate the head
-	 * page->_count and tail page->_mapcount.
-	 *
-	 * We elevate page_tail->_mapcount for tail pages to force
-	 * page_tail->_count to be zero at all times to avoid getting
-	 * false positives from get_page_unless_zero() with
-	 * speculative page access (like in
-	 * page_cache_get_speculative()) on tail pages.
-	 */
-	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
-	if (get_page_head)
-		atomic_inc(&page->first_page->_count);
-	get_huge_page_tail(page);
-}
-
-/*
- * This is meant to be called as the FOLL_GET operation of
- * follow_page() and it must be called while holding the proper PT
- * lock while the pte (or pmd_trans_huge) is still mapping the page.
- */
-static inline void get_page_foll(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount() can't run under
-		 * get_page_foll() because we hold the proper PT lock.
-		 */
-		__get_page_tail_foll(page, true);
-	else {
-		/*
-		 * Getting a normal page or the head of a compound page
-		 * requires to already have an elevated page->_count.
-		 */
-		VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
-		atomic_inc(&page->_count);
-	}
-}
-
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/swap.c b/mm/swap.c
index ab7c338eda87..39166c05e5f3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -89,260 +89,14 @@ static void __put_compound_page(struct page *page)
 	(*dtor)(page);
 }
 
-/**
- * Two special cases here: we could avoid taking compound_lock_irqsave
- * and could skip the tail refcounting(in _mapcount).
- *
- * 1. Hugetlbfs page:
- *
- *    PageHeadHuge will remain true until the compound page
- *    is released and enters the buddy allocator, and it could
- *    not be split by __split_huge_page_refcount().
- *
- *    So if we see PageHeadHuge set, and we have the tail page pin,
- *    then we could safely put head page.
- *
- * 2. Slab THP page:
- *
- *    PG_slab is cleared before the slab frees the head page, and
- *    tail pin cannot be the last reference left on the head page,
- *    because the slab code is free to reuse the compound page
- *    after a kfree/kmem_cache_free without having to check if
- *    there's any tail pin left.  In turn all tail pinsmust be always
- *    released while the head is still pinned by the slab code
- *    and so we know PG_slab will be still set too.
- *
- *    So if we see PageSlab set, and we have the tail page pin,
- *    then we could safely put head page.
- */
-static __always_inline
-void put_unrefcounted_compound_page(struct page *page_head, struct page *page)
-{
-	/*
-	 * If @page is a THP tail, we must read the tail page
-	 * flags after the head page flags. The
-	 * __split_huge_page_refcount side enforces write memory barriers
-	 * between clearing PageTail and before the head page
-	 * can be freed and reallocated.
-	 */
-	smp_rmb();
-	if (likely(PageTail(page))) {
-		/*
-		 * __split_huge_page_refcount cannot race
-		 * here, see the comment above this function.
-		 */
-		VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
-		if (put_page_testzero(page_head)) {
-			/*
-			 * If this is the tail of a slab THP page,
-			 * the tail pin must not be the last reference
-			 * held on the page, because the PG_slab cannot
-			 * be cleared before all tail pins (which skips
-			 * the _mapcount tail refcounting) have been
-			 * released.
-			 *
-			 * If this is the tail of a hugetlbfs page,
-			 * the tail pin may be the last reference on
-			 * the page instead, because PageHeadHuge will
-			 * not go away until the compound page enters
-			 * the buddy allocator.
-			 */
-			VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
-			__put_compound_page(page_head);
-		}
-	} else
-		/*
-		 * __split_huge_page_refcount run before us,
-		 * @page was a THP tail. The split @page_head
-		 * has been freed and reallocated as slab or
-		 * hugetlbfs page of smaller order (only
-		 * possible if reallocated as slab on x86).
-		 */
-		if (put_page_testzero(page))
-			__put_single_page(page);
-}
-
-static __always_inline
-void put_refcounted_compound_page(struct page *page_head, struct page *page)
-{
-	if (likely(page != page_head && get_page_unless_zero(page_head))) {
-		unsigned long flags;
-
-		/*
-		 * @page_head wasn't a dangling pointer but it may not
-		 * be a head page anymore by the time we obtain the
-		 * lock. That is ok as long as it can't be freed from
-		 * under us.
-		 */
-		flags = compound_lock_irqsave(page_head);
-		if (unlikely(!PageTail(page))) {
-			/* __split_huge_page_refcount run before us */
-			compound_unlock_irqrestore(page_head, flags);
-			if (put_page_testzero(page_head)) {
-				/*
-				 * The @page_head may have been freed
-				 * and reallocated as a compound page
-				 * of smaller order and then freed
-				 * again.  All we know is that it
-				 * cannot have become: a THP page, a
-				 * compound page of higher order, a
-				 * tail page.  That is because we
-				 * still hold the refcount of the
-				 * split THP tail and page_head was
-				 * the THP head before the split.
-				 */
-				if (PageHead(page_head))
-					__put_compound_page(page_head);
-				else
-					__put_single_page(page_head);
-			}
-out_put_single:
-			if (put_page_testzero(page))
-				__put_single_page(page);
-			return;
-		}
-		VM_BUG_ON_PAGE(page_head != page->first_page, page);
-		/*
-		 * We can release the refcount taken by
-		 * get_page_unless_zero() now that
-		 * __split_huge_page_refcount() is blocked on the
-		 * compound_lock.
-		 */
-		if (put_page_testzero(page_head))
-			VM_BUG_ON_PAGE(1, page_head);
-		/* __split_huge_page_refcount will wait now */
-		VM_BUG_ON_PAGE(page_mapcount(page) <= 0, page);
-		atomic_dec(&page->_mapcount);
-		VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page_head);
-		VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-		compound_unlock_irqrestore(page_head, flags);
-
-		if (put_page_testzero(page_head)) {
-			if (PageHead(page_head))
-				__put_compound_page(page_head);
-			else
-				__put_single_page(page_head);
-		}
-	} else {
-		/* @page_head is a dangling pointer */
-		VM_BUG_ON_PAGE(PageTail(page), page);
-		goto out_put_single;
-	}
-}
-
-static void put_compound_page(struct page *page)
-{
-	struct page *page_head;
-
-	/*
-	 * We see the PageCompound set and PageTail not set, so @page maybe:
-	 *  1. hugetlbfs head page, or
-	 *  2. THP head page.
-	 */
-	if (likely(!PageTail(page))) {
-		if (put_page_testzero(page)) {
-			/*
-			 * By the time all refcounts have been released
-			 * split_huge_page cannot run anymore from under us.
-			 */
-			if (PageHead(page))
-				__put_compound_page(page);
-			else
-				__put_single_page(page);
-		}
-		return;
-	}
-
-	/*
-	 * We see the PageCompound set and PageTail set, so @page maybe:
-	 *  1. a tail hugetlbfs page, or
-	 *  2. a tail THP page, or
-	 *  3. a split THP page.
-	 *
-	 *  Case 3 is possible, as we may race with
-	 *  __split_huge_page_refcount tearing down a THP page.
-	 */
-	page_head = compound_head_by_tail(page);
-	if (!__compound_tail_refcounted(page_head))
-		put_unrefcounted_compound_page(page_head, page);
-	else
-		put_refcounted_compound_page(page_head, page);
-}
-
-void put_page(struct page *page)
+void __put_page(struct page *page)
 {
 	if (unlikely(PageCompound(page)))
-		put_compound_page(page);
-	else if (put_page_testzero(page))
+		__put_compound_page(page);
+	else
 		__put_single_page(page);
 }
-EXPORT_SYMBOL(put_page);
-
-/*
- * This function is exported but must not be called by anything other
- * than get_page(). It implements the slow path of get_page().
- */
-bool __get_page_tail(struct page *page)
-{
-	/*
-	 * This takes care of get_page() if run on a tail page
-	 * returned by one of the get_user_pages/follow_page variants.
-	 * get_user_pages/follow_page itself doesn't need the compound
-	 * lock because it runs __get_page_tail_foll() under the
-	 * proper PT lock that already serializes against
-	 * split_huge_page().
-	 */
-	unsigned long flags;
-	bool got;
-	struct page *page_head = compound_head(page);
-
-	/* Ref to put_compound_page() comment. */
-	if (!__compound_tail_refcounted(page_head)) {
-		smp_rmb();
-		if (likely(PageTail(page))) {
-			/*
-			 * This is a hugetlbfs page or a slab
-			 * page. __split_huge_page_refcount
-			 * cannot race here.
-			 */
-			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
-			__get_page_tail_foll(page, true);
-			return true;
-		} else {
-			/*
-			 * __split_huge_page_refcount run
-			 * before us, "page" was a THP
-			 * tail. The split page_head has been
-			 * freed and reallocated as slab or
-			 * hugetlbfs page of smaller order
-			 * (only possible if reallocated as
-			 * slab on x86).
-			 */
-			return false;
-		}
-	}
-
-	got = false;
-	if (likely(page != page_head && get_page_unless_zero(page_head))) {
-		/*
-		 * page_head wasn't a dangling pointer but it
-		 * may not be a head page anymore by the time
-		 * we obtain the lock. That is ok as long as it
-		 * can't be freed from under us.
-		 */
-		flags = compound_lock_irqsave(page_head);
-		/* here __split_huge_page_refcount won't run anymore */
-		if (likely(PageTail(page))) {
-			__get_page_tail_foll(page, false);
-			got = true;
-		}
-		compound_unlock_irqrestore(page_head, flags);
-		if (unlikely(!got))
-			put_page(page_head);
-	}
-	return got;
-}
-EXPORT_SYMBOL(__get_page_tail);
+EXPORT_SYMBOL(__put_page);
 
 /**
  * put_pages_list() - release a list of pages
@@ -959,15 +713,6 @@ void release_pages(struct page **pages, int nr, bool cold)
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
-		if (unlikely(PageCompound(page))) {
-			if (zone) {
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
-				zone = NULL;
-			}
-			put_compound_page(page);
-			continue;
-		}
-
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
@@ -978,9 +723,19 @@ void release_pages(struct page **pages, int nr, bool cold)
 			zone = NULL;
 		}
 
+		page = compound_head(page);
 		if (!put_page_testzero(page))
 			continue;
 
+		if (PageCompound(page)) {
+			if (zone) {
+				spin_unlock_irqrestore(&zone->lru_lock, flags);
+				zone = NULL;
+			}
+			__put_compound_page(page);
+			continue;
+		}
+
 		if (PageLRU(page)) {
 			struct zone *pagezone = page_zone(page);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 14/36] futex, thp: remove special case for THP in get_futex_key
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (12 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 13/36] mm: drop tail page refcounting Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 15/36] ksm: prepare to new THP semantics Kirill A. Shutemov
                   ` (22 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new THP refcounting, we don't need tricks to stabilize huge page.
If we've got reference to tail page, it can't split under us.

This patch effectively reverts a5b338f2b0b1.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 kernel/futex.c | 61 ++++++++++++----------------------------------------------
 1 file changed, 12 insertions(+), 49 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index 2a5e3830e953..1809371ebef8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -399,7 +399,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
-	struct page *page, *page_head;
+	struct page *page;
 	int err, ro = 0;
 
 	/*
@@ -442,46 +442,9 @@ again:
 	else
 		err = 0;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	page_head = page;
-	if (unlikely(PageTail(page))) {
-		put_page(page);
-		/* serialize against __split_huge_page_splitting() */
-		local_irq_disable();
-		if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
-			page_head = compound_head(page);
-			/*
-			 * page_head is valid pointer but we must pin
-			 * it before taking the PG_lock and/or
-			 * PG_compound_lock. The moment we re-enable
-			 * irqs __split_huge_page_splitting() can
-			 * return and the head page can be freed from
-			 * under us. We can't take the PG_lock and/or
-			 * PG_compound_lock on a page that could be
-			 * freed from under us.
-			 */
-			if (page != page_head) {
-				get_page(page_head);
-				put_page(page);
-			}
-			local_irq_enable();
-		} else {
-			local_irq_enable();
-			goto again;
-		}
-	}
-#else
-	page_head = compound_head(page);
-	if (page != page_head) {
-		get_page(page_head);
-		put_page(page);
-	}
-#endif
-
-	lock_page(page_head);
-
+	lock_page(page);
 	/*
-	 * If page_head->mapping is NULL, then it cannot be a PageAnon
+	 * If page->mapping is NULL, then it cannot be a PageAnon
 	 * page; but it might be the ZERO_PAGE or in the gate area or
 	 * in a special mapping (all cases which we are happy to fail);
 	 * or it may have been a good file page when get_user_pages_fast
@@ -493,12 +456,12 @@ again:
 	 *
 	 * The case we do have to guard against is when memory pressure made
 	 * shmem_writepage move it from filecache to swapcache beneath us:
-	 * an unlikely race, but we do need to retry for page_head->mapping.
+	 * an unlikely race, but we do need to retry for page->mapping.
 	 */
-	if (!page_head->mapping) {
-		int shmem_swizzled = PageSwapCache(page_head);
-		unlock_page(page_head);
-		put_page(page_head);
+	if (!page->mapping) {
+		int shmem_swizzled = PageSwapCache(page);
+		unlock_page(page);
+		put_page(page);
 		if (shmem_swizzled)
 			goto again;
 		return -EFAULT;
@@ -511,7 +474,7 @@ again:
 	 * it's a read-only handle, it's expected that futexes attach to
 	 * the object not the particular process.
 	 */
-	if (PageAnon(page_head)) {
+	if (PageAnon(page)) {
 		/*
 		 * A RO anonymous page will never change and thus doesn't make
 		 * sense for futex operations.
@@ -526,15 +489,15 @@ again:
 		key->private.address = address;
 	} else {
 		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
-		key->shared.inode = page_head->mapping->host;
+		key->shared.inode = page->mapping->host;
 		key->shared.pgoff = basepage_index(page);
 	}
 
 	get_futex_key_refs(key); /* implies MB (B) */
 
 out:
-	unlock_page(page_head);
-	put_page(page_head);
+	unlock_page(page);
+	put_page(page);
 	return err;
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 15/36] ksm: prepare to new THP semantics
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (13 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 14/36] futex, thp: remove special case for THP in get_futex_key Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 16/36] mm, thp: remove compound_lock Kirill A. Shutemov
                   ` (21 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We don't need special code to stabilize THP. If you've got reference to
any subpage of THP it will not be split under you.

New split_huge_page() also accepts tail pages: no need in special code
to get reference to head page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/ksm.c | 57 ++++++++++-----------------------------------------------
 1 file changed, 10 insertions(+), 47 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index fe09f3ddc912..fb333d8188fc 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -441,20 +441,6 @@ static void break_cow(struct rmap_item *rmap_item)
 	up_read(&mm->mmap_sem);
 }
 
-static struct page *page_trans_compound_anon(struct page *page)
-{
-	if (PageTransCompound(page)) {
-		struct page *head = compound_head(page);
-		/*
-		 * head may actually be splitted and freed from under
-		 * us but it's ok here.
-		 */
-		if (PageAnon(head))
-			return head;
-	}
-	return NULL;
-}
-
 static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 {
 	struct mm_struct *mm = rmap_item->mm;
@@ -470,7 +456,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 	page = follow_page(vma, addr, FOLL_GET);
 	if (IS_ERR_OR_NULL(page))
 		goto out;
-	if (PageAnon(page) || page_trans_compound_anon(page)) {
+	if (PageAnon(page)) {
 		flush_anon_page(vma, page, addr);
 		flush_dcache_page(page);
 	} else {
@@ -976,33 +962,6 @@ out:
 	return err;
 }
 
-static int page_trans_compound_anon_split(struct page *page)
-{
-	int ret = 0;
-	struct page *transhuge_head = page_trans_compound_anon(page);
-	if (transhuge_head) {
-		/* Get the reference on the head to split it. */
-		if (get_page_unless_zero(transhuge_head)) {
-			/*
-			 * Recheck we got the reference while the head
-			 * was still anonymous.
-			 */
-			if (PageAnon(transhuge_head))
-				ret = split_huge_page(transhuge_head);
-			else
-				/*
-				 * Retry later if split_huge_page run
-				 * from under us.
-				 */
-				ret = 1;
-			put_page(transhuge_head);
-		} else
-			/* Retry later if split_huge_page run from under us. */
-			ret = 1;
-	}
-	return ret;
-}
-
 /*
  * try_to_merge_one_page - take two pages and merge them into one
  * @vma: the vma that holds the pte pointing to page
@@ -1023,9 +982,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 
 	if (!(vma->vm_flags & VM_MERGEABLE))
 		goto out;
-	if (PageTransCompound(page) && page_trans_compound_anon_split(page))
-		goto out;
-	BUG_ON(PageTransCompound(page));
 	if (!PageAnon(page))
 		goto out;
 
@@ -1038,6 +994,13 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 	 */
 	if (!trylock_page(page))
 		goto out;
+
+	if (PageTransCompound(page)) {
+		err = split_huge_page(page);
+		if (err)
+			goto out_unlock;
+	}
+
 	/*
 	 * If this anonymous page is mapped only here, its pte may need
 	 * to be write-protected.  If it's mapped elsewhere, all of its
@@ -1068,6 +1031,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 		}
 	}
 
+out_unlock:
 	unlock_page(page);
 out:
 	return err;
@@ -1620,8 +1584,7 @@ next_mm:
 				cond_resched();
 				continue;
 			}
-			if (PageAnon(*page) ||
-			    page_trans_compound_anon(*page)) {
+			if (PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
 				flush_dcache_page(*page);
 				rmap_item = get_next_rmap_item(slot,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 16/36] mm, thp: remove compound_lock
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (14 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 15/36] ksm: prepare to new THP semantics Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 17/36] arm64, thp: remove infrastructure for handling splitting PMDs Kirill A. Shutemov
                   ` (20 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We are going to use migration entries to stabilize page counts. It means
we don't need compound_lock() for that.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h         | 35 -----------------------------------
 include/linux/page-flags.h | 12 +-----------
 mm/debug.c                 |  3 ---
 mm/memcontrol.c            | 11 +++--------
 4 files changed, 4 insertions(+), 57 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f079d9ffc797..31cd5be081cf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -403,41 +403,6 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 
 extern void kvfree(const void *addr);
 
-static inline void compound_lock(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	VM_BUG_ON_PAGE(PageSlab(page), page);
-	bit_spin_lock(PG_compound_lock, &page->flags);
-#endif
-}
-
-static inline void compound_unlock(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	VM_BUG_ON_PAGE(PageSlab(page), page);
-	bit_spin_unlock(PG_compound_lock, &page->flags);
-#endif
-}
-
-static inline unsigned long compound_lock_irqsave(struct page *page)
-{
-	unsigned long uninitialized_var(flags);
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	local_irq_save(flags);
-	compound_lock(page);
-#endif
-	return flags;
-}
-
-static inline void compound_unlock_irqrestore(struct page *page,
-					      unsigned long flags)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	compound_unlock(page);
-	local_irq_restore(flags);
-#endif
-}
-
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 91b7f9b2b774..74b7cece1dfa 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -106,9 +106,6 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	PG_compound_lock,
-#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -683,12 +680,6 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 #define __PG_MLOCKED		0
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
-#else
-#define __PG_COMPOUND_LOCK		0
-#endif
-
 /*
  * Flags checked when a page is freed.  Pages being freed should not have
  * these flags set.  It they are, there is a problem.
@@ -698,8 +689,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
-	 __PG_COMPOUND_LOCK)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON )
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.
diff --git a/mm/debug.c b/mm/debug.c
index 3eb3ac2fcee7..9dfcd77e7354 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -45,9 +45,6 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_MEMORY_FAILURE
 	{1UL << PG_hwpoison,		"hwpoison"	},
 #endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	{1UL << PG_compound_lock,	"compound_lock"	},
-#endif
 };
 
 static void dump_flags(unsigned long flags,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1c7c6bb69d3c..d0d862e4ca74 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2761,9 +2761,7 @@ struct mem_cgroup *__mem_cgroup_from_kmem(void *ptr)
 
 /*
  * Because tail pages are not marked as "used", set it. We're under
- * zone->lru_lock, 'splitting on pmd' and compound_lock.
- * charge/uncharge will be never happen and move_account() is done under
- * compound_lock(), so we don't have to take care of races.
+ * zone->lru_lock and migration entries setup in all page mappings.
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
@@ -4738,9 +4736,7 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
  * @from: mem_cgroup which the page is moved from.
  * @to:	mem_cgroup which the page is moved to. @from != @to.
  *
- * The caller must confirm following.
- * - page is not on LRU (isolate_page() is useful.)
- * - compound_lock is held when nr_pages > 1
+ * The caller must make sure the page is not on LRU (isolate_page() is useful.)
  *
  * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
  * from old cgroup.
@@ -5061,8 +5057,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	struct page *page;
 
 	/*
-	 * We don't take compound_lock() here but no race with splitting thp
-	 * happens because:
+	 * No race with splitting thp happens because:
 	 *  - if pmd_trans_huge_lock() returns 1, the relevant thp is not
 	 *    under splitting, which means there's no concurrent thp split,
 	 *  - if another thread runs into split_huge_page() just after we
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 17/36] arm64, thp: remove infrastructure for handling splitting PMDs
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (15 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 16/36] mm, thp: remove compound_lock Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 18/36] arm, " Kirill A. Shutemov
                   ` (19 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
needed for fast_gup.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/arm64/include/asm/pgtable.h |  9 ---------
 arch/arm64/mm/flush.c            | 16 ----------------
 2 files changed, 25 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index bd5db28324ba..37cdbf37934c 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -274,20 +274,11 @@ static inline pgprot_t mk_sect_prot(pgprot_t prot)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT))
-#define pmd_trans_splitting(pmd)	pte_special(pmd_pte(pmd))
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-struct vm_area_struct;
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
-			  pmd_t *pmdp);
-#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
 #define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
-#define pmd_mksplitting(pmd)	pte_pmd(pte_mkspecial(pmd_pte(pmd)))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 #define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index b6f14e8d2121..0d64089d28b5 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -104,19 +104,3 @@ EXPORT_SYMBOL(flush_dcache_page);
  */
 EXPORT_SYMBOL(flush_cache_all);
 EXPORT_SYMBOL(flush_icache_range);
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
-			  pmd_t *pmdp)
-{
-	pmd_t pmd = pmd_mksplitting(*pmdp);
-
-	VM_BUG_ON(address & ~PMD_MASK);
-	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-
-	/* dummy IPI to serialise against fast_gup */
-	kick_all_cpus_sync();
-}
-#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 18/36] arm, thp: remove infrastructure for handling splitting PMDs
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (16 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 17/36] arm64, thp: remove infrastructure for handling splitting PMDs Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 19/36] mips, " Kirill A. Shutemov
                   ` (18 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
needed for fast_gup.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/arm/include/asm/pgtable-3level.h | 10 ----------
 arch/arm/lib/uaccess_with_memcpy.c    |  5 ++---
 arch/arm/mm/flush.c                   | 15 ---------------
 3 files changed, 2 insertions(+), 28 deletions(-)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 6d6012a320b2..d42f81f13618 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -88,7 +88,6 @@
 
 #define L_PMD_SECT_VALID	(_AT(pmdval_t, 1) << 0)
 #define L_PMD_SECT_DIRTY	(_AT(pmdval_t, 1) << 55)
-#define L_PMD_SECT_SPLITTING	(_AT(pmdval_t, 1) << 56)
 #define L_PMD_SECT_NONE		(_AT(pmdval_t, 1) << 57)
 #define L_PMD_SECT_RDONLY	(_AT(pteval_t, 1) << 58)
 
@@ -232,21 +231,12 @@ static inline pte_t pte_mkspecial(pte_t pte)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd)	(pmd_val(pmd) && !pmd_table(pmd))
-#define pmd_trans_splitting(pmd) (pmd_isset((pmd), L_PMD_SECT_SPLITTING))
-
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
-			  pmd_t *pmdp);
-#endif
-#endif
 
 #define PMD_BIT_FUNC(fn,op) \
 static inline pmd_t pmd_##fn(pmd_t pmd) { pmd_val(pmd) op; return pmd; }
 
 PMD_BIT_FUNC(wrprotect,	|= L_PMD_SECT_RDONLY);
 PMD_BIT_FUNC(mkold,	&= ~PMD_SECT_AF);
-PMD_BIT_FUNC(mksplitting, |= L_PMD_SECT_SPLITTING);
 PMD_BIT_FUNC(mkwrite,   &= ~L_PMD_SECT_RDONLY);
 PMD_BIT_FUNC(mkdirty,   |= L_PMD_SECT_DIRTY);
 PMD_BIT_FUNC(mkclean,   &= ~L_PMD_SECT_DIRTY);
diff --git a/arch/arm/lib/uaccess_with_memcpy.c b/arch/arm/lib/uaccess_with_memcpy.c
index 3e58d710013c..11af98f46f05 100644
--- a/arch/arm/lib/uaccess_with_memcpy.c
+++ b/arch/arm/lib/uaccess_with_memcpy.c
@@ -52,14 +52,13 @@ pin_page_for_write(const void __user *_addr, pte_t **ptep, spinlock_t **ptlp)
 	 *
 	 * Lock the page table for the destination and check
 	 * to see that it's still huge and whether or not we will
-	 * need to fault on write, or if we have a splitting THP.
+	 * need to fault on write.
 	 */
 	if (unlikely(pmd_thp_or_huge(*pmd))) {
 		ptl = &current->mm->page_table_lock;
 		spin_lock(ptl);
 		if (unlikely(!pmd_thp_or_huge(*pmd)
-			|| pmd_hugewillfault(*pmd)
-			|| pmd_trans_splitting(*pmd))) {
+			|| pmd_hugewillfault(*pmd))) {
 			spin_unlock(ptl);
 			return 0;
 		}
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 34b66af516ea..77f229302032 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -400,18 +400,3 @@ void __flush_anon_page(struct vm_area_struct *vma, struct page *page, unsigned l
 	 */
 	__cpuc_flush_dcache_area(page_address(page), PAGE_SIZE);
 }
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#ifdef CONFIG_HAVE_RCU_TABLE_FREE
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
-			  pmd_t *pmdp)
-{
-	pmd_t pmd = pmd_mksplitting(*pmdp);
-	VM_BUG_ON(address & ~PMD_MASK);
-	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-
-	/* dummy IPI to serialise against fast_gup */
-	kick_all_cpus_sync();
-}
-#endif /* CONFIG_HAVE_RCU_TABLE_FREE */
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 19/36] mips, thp: remove infrastructure for handling splitting PMDs
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (17 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 18/36] arm, " Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 20/36] powerpc, " Kirill A. Shutemov
                   ` (17 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
needed for fast_gup.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/mips/include/asm/pgtable-bits.h |  6 +-----
 arch/mips/include/asm/pgtable.h      | 18 ------------------
 arch/mips/mm/gup.c                   | 13 +------------
 arch/mips/mm/pgtable-64.c            | 14 --------------
 arch/mips/mm/tlbex.c                 |  1 -
 5 files changed, 2 insertions(+), 50 deletions(-)

diff --git a/arch/mips/include/asm/pgtable-bits.h b/arch/mips/include/asm/pgtable-bits.h
index 91747c282bb3..403e4b226c36 100644
--- a/arch/mips/include/asm/pgtable-bits.h
+++ b/arch/mips/include/asm/pgtable-bits.h
@@ -120,17 +120,13 @@
 /* huge tlb page */
 #define _PAGE_HUGE_SHIFT	(_PAGE_MODIFIED_SHIFT + 1)
 #define _PAGE_HUGE		(1 << _PAGE_HUGE_SHIFT)
-#define _PAGE_SPLITTING_SHIFT	(_PAGE_HUGE_SHIFT + 1)
-#define _PAGE_SPLITTING		(1 << _PAGE_SPLITTING_SHIFT)
 #else
 #define _PAGE_HUGE_SHIFT	(_PAGE_MODIFIED_SHIFT)
 #define _PAGE_HUGE		({BUG(); 1; })	/* Dummy value */
-#define _PAGE_SPLITTING_SHIFT	(_PAGE_HUGE_SHIFT)
-#define _PAGE_SPLITTING		({BUG(); 1; })	/* Dummy value */
 #endif
 
 /* Page cannot be executed */
-#define _PAGE_NO_EXEC_SHIFT	(cpu_has_rixi ? _PAGE_SPLITTING_SHIFT + 1 : _PAGE_SPLITTING_SHIFT)
+#define _PAGE_NO_EXEC_SHIFT	(cpu_has_rixi ? _PAGE_HUGE_SHIFT + 1 : _PAGE_HUGE_SHIFT)
 #define _PAGE_NO_EXEC		({BUG_ON(!cpu_has_rixi); 1 << _PAGE_NO_EXEC_SHIFT; })
 
 /* Page cannot be read */
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index 5dc360bfc1a7..a8ed1627bfc0 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -456,27 +456,9 @@ static inline pmd_t pmd_mkhuge(pmd_t pmd)
 	return pmd;
 }
 
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	return !!(pmd_val(pmd) & _PAGE_SPLITTING);
-}
-
-static inline pmd_t pmd_mksplitting(pmd_t pmd)
-{
-	pmd_val(pmd) |= _PAGE_SPLITTING;
-
-	return pmd;
-}
-
 extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 		       pmd_t *pmdp, pmd_t pmd);
 
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-/* Extern to avoid header file madness */
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-					unsigned long address,
-					pmd_t *pmdp);
-
 #define __HAVE_ARCH_PMD_WRITE
 static inline int pmd_write(pmd_t pmd)
 {
diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 36a35115dc2e..1afd87c999b0 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -107,18 +107,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		/*
-		 * The pmd_trans_splitting() check below explains why
-		 * pmdp_splitting_flush has to flush the tlb, to stop
-		 * this gup-fast code from running while we set the
-		 * splitting bit in the pmd. Returning zero will take
-		 * the slow path that will call wait_split_huge_page()
-		 * if the pmd is still in splitting state. gup-fast
-		 * can't because it has irq disabled and
-		 * wait_split_huge_page() would never return as the
-		 * tlb flush IPI wouldn't run.
-		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd))
 			return 0;
 		if (unlikely(pmd_huge(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages,nr))
diff --git a/arch/mips/mm/pgtable-64.c b/arch/mips/mm/pgtable-64.c
index e8adc0069d66..ce4473e7c0d2 100644
--- a/arch/mips/mm/pgtable-64.c
+++ b/arch/mips/mm/pgtable-64.c
@@ -62,20 +62,6 @@ void pmd_init(unsigned long addr, unsigned long pagetable)
 }
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-
-void pmdp_splitting_flush(struct vm_area_struct *vma,
-			 unsigned long address,
-			 pmd_t *pmdp)
-{
-	if (!pmd_trans_splitting(*pmdp)) {
-		pmd_t pmd = pmd_mksplitting(*pmdp);
-		set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-	}
-}
-
-#endif
-
 pmd_t mk_pmd(struct page *page, pgprot_t prot)
 {
 	pmd_t pmd;
diff --git a/arch/mips/mm/tlbex.c b/arch/mips/mm/tlbex.c
index d75ff73a2012..d12412bdd506 100644
--- a/arch/mips/mm/tlbex.c
+++ b/arch/mips/mm/tlbex.c
@@ -229,7 +229,6 @@ static void output_pgtable_bits_defines(void)
 	pr_define("_PAGE_MODIFIED_SHIFT %d\n", _PAGE_MODIFIED_SHIFT);
 #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
 	pr_define("_PAGE_HUGE_SHIFT %d\n", _PAGE_HUGE_SHIFT);
-	pr_define("_PAGE_SPLITTING_SHIFT %d\n", _PAGE_SPLITTING_SHIFT);
 #endif
 	if (cpu_has_rixi) {
 #ifdef _PAGE_NO_EXEC_SHIFT
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 20/36] powerpc, thp: remove infrastructure for handling splitting PMDs
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (18 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 19/36] mips, " Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 21/36] s390, " Kirill A. Shutemov
                   ` (16 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
needed for fast_gup.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  6 ----
 arch/powerpc/include/asm/pgtable-ppc64.h | 25 +---------------
 arch/powerpc/mm/hugepage-hash64.c        |  3 --
 arch/powerpc/mm/hugetlbpage.c            |  7 +----
 arch/powerpc/mm/pgtable_64.c             | 49 --------------------------------
 5 files changed, 2 insertions(+), 88 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index 2d81e202bdcc..9a96fe3caa48 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -298,12 +298,6 @@ static inline pte_t kvmppc_read_update_linux_pte(pte_t *ptep, int writing,
 			cpu_relax();
 			continue;
 		}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		/* If hugepage and is trans splitting return None */
-		if (unlikely(hugepage &&
-			     pmd_trans_splitting(pte_pmd(old_pte))))
-			return __pte(0);
-#endif
 		/* If pte is not present return None */
 		if (unlikely(!(old_pte & _PAGE_PRESENT)))
 			return __pte(0);
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 012e3adab7f8..5729b303d206 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -358,11 +358,6 @@ void pgtable_cache_init(void);
 #endif /* __ASSEMBLY__ */
 
 /*
- * THP pages can't be special. So use the _PAGE_SPECIAL
- */
-#define _PAGE_SPLITTING _PAGE_SPECIAL
-
-/*
  * We need to differentiate between explicit huge page and THP huge
  * page, since THP huge page also need to track real subpage details
  */
@@ -372,8 +367,7 @@ void pgtable_cache_init(void);
  * set of bits not changed in pmd_modify.
  */
 #define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS |		\
-			 _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_SPLITTING | \
-			 _PAGE_THP_HUGE)
+			 _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_THP_HUGE)
 
 #ifndef __ASSEMBLY__
 /*
@@ -455,13 +449,6 @@ static inline int pmd_trans_huge(pmd_t pmd)
 	return (pmd_val(pmd) & 0x3) && (pmd_val(pmd) & _PAGE_THP_HUGE);
 }
 
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	if (pmd_trans_huge(pmd))
-		return pmd_val(pmd) & _PAGE_SPLITTING;
-	return 0;
-}
-
 extern int has_transparent_hugepage(void);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -514,12 +501,6 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd)
 	return pmd;
 }
 
-static inline pmd_t pmd_mksplitting(pmd_t pmd)
-{
-	pmd_val(pmd) |= _PAGE_SPLITTING;
-	return pmd;
-}
-
 #define __HAVE_ARCH_PMD_SAME
 static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
@@ -570,10 +551,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 	pmd_hugepage_update(mm, addr, pmdp, _PAGE_RW, 0);
 }
 
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long address, pmd_t *pmdp);
-
 #define pmdp_collapse_flush pmdp_collapse_flush
 extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 				 unsigned long address, pmd_t *pmdp);
diff --git a/arch/powerpc/mm/hugepage-hash64.c b/arch/powerpc/mm/hugepage-hash64.c
index 86686514ae13..078f7207afd2 100644
--- a/arch/powerpc/mm/hugepage-hash64.c
+++ b/arch/powerpc/mm/hugepage-hash64.c
@@ -39,9 +39,6 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
 		/* If PMD busy, retry the access */
 		if (unlikely(old_pmd & _PAGE_BUSY))
 			return 0;
-		/* If PMD is trans splitting retry the access */
-		if (unlikely(old_pmd & _PAGE_SPLITTING))
-			return 0;
 		/* If PMD permissions don't match, take page fault */
 		if (unlikely(access & ~old_pmd))
 			return 1;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 5f979e51ef7b..a09ff52b906e 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -997,13 +997,8 @@ pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift
 			/*
 			 * A hugepage collapse is captured by pmd_none, because
 			 * it mark the pmd none and do a hpte invalidate.
-			 *
-			 * A hugepage split is captured by pmd_trans_splitting
-			 * because we mark the pmd trans splitting and do a
-			 * hpte invalidate
-			 *
 			 */
-			if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+			if (pmd_none(pmd))
 				return NULL;
 
 			if (pmd_huge(pmd) || pmd_large(pmd)) {
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 52c4827bceb0..7bad475bc235 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -614,55 +614,6 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 }
 
 /*
- * We mark the pmd splitting and invalidate all the hpte
- * entries for this hugepage.
- */
-void pmdp_splitting_flush(struct vm_area_struct *vma,
-			  unsigned long address, pmd_t *pmdp)
-{
-	unsigned long old, tmp;
-
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-
-#ifdef CONFIG_DEBUG_VM
-	WARN_ON(!pmd_trans_huge(*pmdp));
-	assert_spin_locked(&vma->vm_mm->page_table_lock);
-#endif
-
-#ifdef PTE_ATOMIC_UPDATES
-
-	__asm__ __volatile__(
-	"1:	ldarx	%0,0,%3\n\
-		andi.	%1,%0,%6\n\
-		bne-	1b \n\
-		ori	%1,%0,%4 \n\
-		stdcx.	%1,0,%3 \n\
-		bne-	1b"
-	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
-	: "r" (pmdp), "i" (_PAGE_SPLITTING), "m" (*pmdp), "i" (_PAGE_BUSY)
-	: "cc" );
-#else
-	old = pmd_val(*pmdp);
-	*pmdp = __pmd(old | _PAGE_SPLITTING);
-#endif
-	/*
-	 * If we didn't had the splitting flag set, go and flush the
-	 * HPTE entries.
-	 */
-	trace_hugepage_splitting(address, old);
-	if (!(old & _PAGE_SPLITTING)) {
-		/* We need to flush the hpte */
-		if (old & _PAGE_HASHPTE)
-			hpte_do_hugepage_flush(vma->vm_mm, address, pmdp, old);
-	}
-	/*
-	 * This ensures that generic code that rely on IRQ disabling
-	 * to prevent a parallel THP split work as expected.
-	 */
-	kick_all_cpus_sync();
-}
-
-/*
  * We want to put the pgtable in pmd and use pgtable for tracking
  * the base page size hptes
  */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 21/36] s390, thp: remove infrastructure for handling splitting PMDs
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (19 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 20/36] powerpc, " Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 22/36] sparc, " Kirill A. Shutemov
                   ` (15 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
needed for fast_gup.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/s390/include/asm/pgtable.h | 15 +--------------
 arch/s390/mm/gup.c              | 11 +----------
 arch/s390/mm/pgtable.c          | 16 ----------------
 3 files changed, 2 insertions(+), 40 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index cf7e7c6bab9d..06bf2f1367f9 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -380,7 +380,6 @@ static inline int is_module_addr(void *addr)
 
 #define _SEGMENT_ENTRY_DIRTY	0x2000	/* SW segment dirty bit */
 #define _SEGMENT_ENTRY_YOUNG	0x1000	/* SW segment young bit */
-#define _SEGMENT_ENTRY_SPLIT	0x0800	/* THP splitting bit */
 #define _SEGMENT_ENTRY_LARGE	0x0400	/* STE-format control, large page */
 #define _SEGMENT_ENTRY_READ	0x0002	/* SW segment read bit */
 #define _SEGMENT_ENTRY_WRITE	0x0001	/* SW segment write bit */
@@ -404,8 +403,6 @@ static inline int is_module_addr(void *addr)
  * read-write, old segment table entries (origin!=0)
  */
 
-#define _SEGMENT_ENTRY_SPLIT_BIT 11	/* THP splitting bit number */
-
 /* Page status table bits for virtualization */
 #define PGSTE_ACC_BITS	0xf000000000000000UL
 #define PGSTE_FP_BIT	0x0800000000000000UL
@@ -617,10 +614,6 @@ static inline int pmd_bad(pmd_t pmd)
 	return (pmd_val(pmd) & ~_SEGMENT_ENTRY_BITS) != 0;
 }
 
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long addr, pmd_t *pmdp);
-
 #define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
 extern int pmdp_set_access_flags(struct vm_area_struct *vma,
 				 unsigned long address, pmd_t *pmdp,
@@ -1494,7 +1487,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	if (pmd_large(pmd)) {
 		pmd_val(pmd) &= _SEGMENT_ENTRY_ORIGIN_LARGE |
 			_SEGMENT_ENTRY_DIRTY | _SEGMENT_ENTRY_YOUNG |
-			_SEGMENT_ENTRY_LARGE | _SEGMENT_ENTRY_SPLIT;
+			_SEGMENT_ENTRY_LARGE;
 		pmd_val(pmd) |= massage_pgprot_pmd(newprot);
 		if (!(pmd_val(pmd) & _SEGMENT_ENTRY_DIRTY))
 			pmd_val(pmd) |= _SEGMENT_ENTRY_PROTECT;
@@ -1602,12 +1595,6 @@ extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 #define __HAVE_ARCH_PGTABLE_WITHDRAW
 extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	return (pmd_val(pmd) & _SEGMENT_ENTRY_LARGE) &&
-		(pmd_val(pmd) & _SEGMENT_ENTRY_SPLIT);
-}
-
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t entry)
 {
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index dab30527ad41..83eb4629bdb2 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -104,16 +104,7 @@ static inline int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr,
 		pmd = *pmdp;
 		barrier();
 		next = pmd_addr_end(addr, end);
-		/*
-		 * The pmd_trans_splitting() check below explains why
-		 * pmdp_splitting_flush() has to serialize with
-		 * smp_call_function() against our disabled IRQs, to stop
-		 * this gup-fast code from running while we set the
-		 * splitting bit in the pmd. Returning zero will take
-		 * the slow path that will call wait_split_huge_page()
-		 * if the pmd is still in splitting state.
-		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmdp, pmd, addr, next,
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index b2c1542f2ba2..5736232f7936 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -1419,22 +1419,6 @@ int pmdp_set_access_flags(struct vm_area_struct *vma,
 	return 1;
 }
 
-static void pmdp_splitting_flush_sync(void *arg)
-{
-	/* Simply deliver the interrupt */
-}
-
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
-			  pmd_t *pmdp)
-{
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	if (!test_and_set_bit(_SEGMENT_ENTRY_SPLIT_BIT,
-			      (unsigned long *) pmdp)) {
-		/* need to serialize against gup-fast (IRQ disabled) */
-		smp_call_function(pmdp_splitting_flush_sync, NULL, 1);
-	}
-}
-
 void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				pgtable_t pgtable)
 {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 22/36] sparc, thp: remove infrastructure for handling splitting PMDs
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (20 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 21/36] s390, " Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 23/36] tile, " Kirill A. Shutemov
                   ` (14 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/sparc/include/asm/pgtable_64.h | 16 ----------------
 arch/sparc/mm/fault_64.c            |  3 ---
 arch/sparc/mm/gup.c                 |  2 +-
 3 files changed, 1 insertion(+), 20 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 9953d3c86e8b..46fcb17d226b 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -661,13 +661,6 @@ static inline unsigned long pmd_trans_huge(pmd_t pmd)
 	return pte_val(pte) & _PAGE_PMD_HUGE;
 }
 
-static inline unsigned long pmd_trans_splitting(pmd_t pmd)
-{
-	pte_t pte = __pte(pmd_val(pmd));
-
-	return pmd_trans_huge(pmd) && pte_special(pte);
-}
-
 #define has_transparent_hugepage() 1
 
 static inline pmd_t pmd_mkold(pmd_t pmd)
@@ -724,15 +717,6 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)
 	return __pmd(pte_val(pte));
 }
 
-static inline pmd_t pmd_mksplitting(pmd_t pmd)
-{
-	pte_t pte = __pte(pmd_val(pmd));
-
-	pte = pte_mkspecial(pte);
-
-	return __pmd(pte_val(pte));
-}
-
 static inline pgprot_t pmd_pgprot(pmd_t entry)
 {
 	unsigned long val = pmd_val(entry);
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 479823249429..a504a4d85ebd 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -113,9 +113,6 @@ static unsigned int get_user_insn(unsigned long tpc)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (pmd_trans_huge(*pmdp)) {
-		if (pmd_trans_splitting(*pmdp))
-			goto out_irq_enable;
-
 		pa  = pmd_pfn(*pmdp) << PAGE_SHIFT;
 		pa += tpc & ~HPAGE_MASK;
 
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 9091c5daa2e1..eb3d8e8ebc6b 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -114,7 +114,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmdp, pmd, addr, next,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 23/36] tile, thp: remove infrastructure for handling splitting PMDs
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (21 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 22/36] sparc, " Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 24/36] x86, " Kirill A. Shutemov
                   ` (13 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/tile/include/asm/pgtable.h | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/arch/tile/include/asm/pgtable.h b/arch/tile/include/asm/pgtable.h
index 2b05ccbebed9..96cecf55522e 100644
--- a/arch/tile/include/asm/pgtable.h
+++ b/arch/tile/include/asm/pgtable.h
@@ -489,16 +489,6 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define has_transparent_hugepage() 1
 #define pmd_trans_huge pmd_huge_page
-
-static inline pmd_t pmd_mksplitting(pmd_t pmd)
-{
-	return pte_pmd(hv_pte_set_client2(pmd_pte(pmd)));
-}
-
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	return hv_pte_get_client2(pmd_pte(pmd));
-}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /*
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 24/36] x86, thp: remove infrastructure for handling splitting PMDs
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (22 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 23/36] tile, " Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 25/36] mm, " Kirill A. Shutemov
                   ` (12 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 arch/x86/include/asm/pgtable.h       |  9 ---------
 arch/x86/include/asm/pgtable_types.h |  2 --
 arch/x86/mm/gup.c                    | 13 +------------
 arch/x86/mm/pgtable.c                | 14 --------------
 4 files changed, 1 insertion(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 4383012950b0..37c280e0827a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -158,11 +158,6 @@ static inline int pmd_large(pmd_t pte)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	return pmd_val(pmd) & _PAGE_SPLITTING;
-}
-
 static inline int pmd_trans_huge(pmd_t pmd)
 {
 	return pmd_val(pmd) & _PAGE_PSE;
@@ -794,10 +789,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pmd_t *pmdp);
 
 
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long addr, pmd_t *pmdp);
-
 #define __HAVE_ARCH_PMD_WRITE
 static inline int pmd_write(pmd_t pmd)
 {
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 78f0c8cbe316..45f7cff1baac 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,7 +22,6 @@
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
-#define _PAGE_BIT_SPLITTING	_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
@@ -46,7 +45,6 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
-#define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 62a887a3cf50..49bbbc57603b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -157,18 +157,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		/*
-		 * The pmd_trans_splitting() check below explains why
-		 * pmdp_splitting_flush has to flush the tlb, to stop
-		 * this gup-fast code from running while we set the
-		 * splitting bit in the pmd. Returning zero will take
-		 * the slow path that will call wait_split_huge_page()
-		 * if the pmd is still in splitting state. gup-fast
-		 * can't because it has irq disabled and
-		 * wait_split_huge_page() would never return as the
-		 * tlb flush IPI wouldn't run.
-		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
 			/*
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 66d5aa27a7a5..23006b1797a0 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -434,20 +434,6 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 
 	return young;
 }
-
-void pmdp_splitting_flush(struct vm_area_struct *vma,
-			  unsigned long address, pmd_t *pmdp)
-{
-	int set;
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
-				(unsigned long *)pmdp);
-	if (set) {
-		pmd_update(vma->vm_mm, address, pmdp);
-		/* need tlb flush only to serialize against gup-fast */
-		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-	}
-}
 #endif
 
 /**
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 25/36] mm, thp: remove infrastructure for handling splitting PMDs
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (23 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 24/36] x86, " Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:05 ` [PATCHv6 26/36] mm: rework mapcount accounting to enable 4k mapping of THPs Kirill A. Shutemov
                   ` (11 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop code
to handle this.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/proc/task_mmu.c            |  8 +++---
 include/asm-generic/pgtable.h |  9 -------
 include/linux/huge_mm.h       | 21 +++++----------
 mm/gup.c                      | 12 +--------
 mm/huge_memory.c              | 60 ++++++++++---------------------------------
 mm/memcontrol.c               | 13 ++--------
 mm/memory.c                   | 18 ++-----------
 mm/mincore.c                  |  2 +-
 mm/mremap.c                   | 15 +++++------
 mm/pgtable-generic.c          | 14 ----------
 mm/rmap.c                     |  4 +--
 11 files changed, 37 insertions(+), 139 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f9b285761bc0..b9e94cb7671b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -547,7 +547,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		smaps_pmd_entry(pmd, addr, walk);
 		spin_unlock(ptl);
 		return 0;
@@ -814,7 +814,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	struct page *page;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
 			clear_soft_dirty_pmd(vma, addr, pmd);
 			goto out;
@@ -1127,7 +1127,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte, *orig_pte;
 	int err = 0;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		int pmd_flags2;
 
 		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
@@ -1434,7 +1434,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	pte_t *orig_pte;
 	pte_t *pte;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		pte_t huge_pte = *(pte_t *)pmd;
 		struct page *page;
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 7a70de9ee890..2a952995d8ed 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -184,11 +184,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long address, pmd_t *pmdp);
-#endif
-
 #ifndef pmdp_collapse_flush
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
@@ -581,10 +576,6 @@ static inline int pmd_trans_huge(pmd_t pmd)
 {
 	return 0;
 }
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	return 0;
-}
 #ifndef __HAVE_ARCH_PMD_WRITE
 static inline int pmd_write(pmd_t pmd)
 {
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 47f80207782f..67b3f3760f4a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -28,7 +28,7 @@ extern int zap_huge_pmd(struct mmu_gather *tlb,
 extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, unsigned long end,
 			unsigned char *vec);
-extern int move_huge_pmd(struct vm_area_struct *vma,
+extern bool move_huge_pmd(struct vm_area_struct *vma,
 			 struct vm_area_struct *new_vma,
 			 unsigned long old_addr,
 			 unsigned long new_addr, unsigned long old_end,
@@ -49,15 +49,9 @@ enum transparent_hugepage_flag {
 #endif
 };
 
-enum page_check_address_pmd_flag {
-	PAGE_CHECK_ADDRESS_PMD_FLAG,
-	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
-	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
-};
 extern pmd_t *page_check_address_pmd(struct page *page,
 				     struct mm_struct *mm,
 				     unsigned long address,
-				     enum page_check_address_pmd_flag flag,
 				     spinlock_t **ptl);
 extern int pmd_freeable(pmd_t pmd);
 
@@ -102,7 +96,6 @@ extern unsigned long transparent_hugepage_flags;
 #define split_huge_page(page) BUILD_BUG()
 #define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
 
-#define wait_split_huge_page(__anon_vma, __pmd) BUILD_BUG();
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
@@ -112,17 +105,17 @@ extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 				    unsigned long start,
 				    unsigned long end,
 				    long adjust_next);
-extern int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+extern bool __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 		spinlock_t **ptl);
 /* mmap_sem must be held on entry */
-static inline int pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+static inline bool pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 		spinlock_t **ptl)
 {
 	VM_BUG_ON_VMA(!rwsem_is_locked(&vma->vm_mm->mmap_sem), vma);
 	if (pmd_trans_huge(*pmd))
 		return __pmd_trans_huge_lock(pmd, vma, ptl);
 	else
-		return 0;
+		return false;
 }
 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long start,
@@ -169,8 +162,6 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-#define wait_split_huge_page(__anon_vma, __pmd)	\
-	do { } while (0)
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
@@ -185,10 +176,10 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 long adjust_next)
 {
 }
-static inline int pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+static inline bool pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 		spinlock_t **ptl)
 {
-	return 0;
+	return false;
 }
 
 static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
diff --git a/mm/gup.c b/mm/gup.c
index f0f390fb9213..3ab0a8a7b255 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -207,13 +207,6 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags);
 	}
-
-	if (unlikely(pmd_trans_splitting(*pmd))) {
-		spin_unlock(ptl);
-		wait_split_huge_page(vma->anon_vma, pmd);
-		return follow_page_pte(vma, address, pmd, flags);
-	}
-
 	if (flags & FOLL_SPLIT) {
 		int ret;
 		page = pmd_page(*pmd);
@@ -1023,9 +1016,6 @@ struct page *get_dump_page(unsigned long addr)
  *  *) HAVE_RCU_TABLE_FREE is enabled, and tlb_remove_table is used to free
  *      pages containing page tables.
  *
- *  *) THP splits will broadcast an IPI, this can be achieved by overriding
- *      pmdp_splitting_flush.
- *
  *  *) ptes can be read atomically by the architecture.
  *
  *  *) access_ok is sufficient to validate userspace address ranges.
@@ -1222,7 +1212,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = READ_ONCE(*pmdp);
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd))
 			return 0;
 
 		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 70fff4c9e067..efe52b5bd979 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -914,15 +914,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		goto out_unlock;
 	}
 
-	if (unlikely(pmd_trans_splitting(pmd))) {
-		/* split huge page running from under us */
-		spin_unlock(src_ptl);
-		spin_unlock(dst_ptl);
-		pte_free(dst_mm, pgtable);
-
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
-		goto out;
-	}
 	src_page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
@@ -1428,7 +1419,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		struct page *page;
 		pgtable_t pgtable;
 		pmd_t orig_pmd;
@@ -1462,13 +1453,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	return ret;
 }
 
-int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
+bool move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		  unsigned long old_addr,
 		  unsigned long new_addr, unsigned long old_end,
 		  pmd_t *old_pmd, pmd_t *new_pmd)
 {
 	spinlock_t *old_ptl, *new_ptl;
-	int ret = 0;
 	pmd_t pmd;
 
 	struct mm_struct *mm = vma->vm_mm;
@@ -1477,7 +1467,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 	    (new_addr & ~HPAGE_PMD_MASK) ||
 	    old_end - old_addr < HPAGE_PMD_SIZE ||
 	    (new_vma->vm_flags & VM_NOHUGEPAGE))
-		goto out;
+		return false;
 
 	/*
 	 * The destination pmd shouldn't be established, free_pgtables()
@@ -1485,15 +1475,14 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 	 */
 	if (WARN_ON(!pmd_none(*new_pmd))) {
 		VM_BUG_ON(pmd_trans_huge(*new_pmd));
-		goto out;
+		return false;
 	}
 
 	/*
 	 * We don't have to worry about the ordering of src and dst
 	 * ptlocks because exclusive mmap_sem prevents deadlock.
 	 */
-	ret = __pmd_trans_huge_lock(old_pmd, vma, &old_ptl);
-	if (ret == 1) {
+	if (__pmd_trans_huge_lock(old_pmd, vma, &old_ptl)) {
 		new_ptl = pmd_lockptr(mm, new_pmd);
 		if (new_ptl != old_ptl)
 			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
@@ -1509,9 +1498,9 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		if (new_ptl != old_ptl)
 			spin_unlock(new_ptl);
 		spin_unlock(old_ptl);
+		return true;
 	}
-out:
-	return ret;
+	return false;
 }
 
 /*
@@ -1527,7 +1516,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	spinlock_t *ptl;
 	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		pmd_t entry;
 		bool preserve_write = prot_numa && pmd_write(*pmd);
 		ret = 1;
@@ -1564,23 +1553,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
  * Note that if it returns 1, this routine returns without unlocking page
  * table locks. So callers must unlock them.
  */
-int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
+bool __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 		spinlock_t **ptl)
 {
 	*ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(pmd_trans_huge(*pmd))) {
-		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(*ptl);
-			wait_split_huge_page(vma->anon_vma, pmd);
-			return -1;
-		} else {
-			/* Thp mapped by 'pmd' is stable, so we can
-			 * handle it as it is. */
-			return 1;
-		}
-	}
+	if (likely(pmd_trans_huge(*pmd)))
+		return true;
 	spin_unlock(*ptl);
-	return 0;
+	return false;
 }
 
 /*
@@ -1594,7 +1574,6 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 pmd_t *page_check_address_pmd(struct page *page,
 			      struct mm_struct *mm,
 			      unsigned long address,
-			      enum page_check_address_pmd_flag flag,
 			      spinlock_t **ptl)
 {
 	pgd_t *pgd;
@@ -1617,21 +1596,8 @@ pmd_t *page_check_address_pmd(struct page *page,
 		goto unlock;
 	if (pmd_page(*pmd) != page)
 		goto unlock;
-	/*
-	 * split_vma() may create temporary aliased mappings. There is
-	 * no risk as long as all huge pmd are found and have their
-	 * splitting bit set before __split_huge_page_refcount
-	 * runs. Finding the same huge pmd more than once during the
-	 * same rmap walk is not a problem.
-	 */
-	if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
-	    pmd_trans_splitting(*pmd))
-		goto unlock;
-	if (pmd_trans_huge(*pmd)) {
-		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
-			  !pmd_trans_splitting(*pmd));
+	if (pmd_trans_huge(*pmd))
 		return pmd;
-	}
 unlock:
 	spin_unlock(*ptl);
 	return NULL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d0d862e4ca74..bb57ecc62031 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4888,7 +4888,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
 			mc.precharge += HPAGE_PMD_NR;
 		spin_unlock(ptl);
@@ -5056,16 +5056,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	union mc_target target;
 	struct page *page;
 
-	/*
-	 * No race with splitting thp happens because:
-	 *  - if pmd_trans_huge_lock() returns 1, the relevant thp is not
-	 *    under splitting, which means there's no concurrent thp split,
-	 *  - if another thread runs into split_huge_page() just after we
-	 *    entered this if-block, the thread must wait for page table lock
-	 *    to be unlocked in __split_huge_page_splitting(), where the main
-	 *    part of thp split is not executed yet.
-	 */
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (mc.precharge < HPAGE_PMD_NR) {
 			spin_unlock(ptl);
 			return 0;
diff --git a/mm/memory.c b/mm/memory.c
index d37aeda3a8fe..ff2e8ed5b82a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -566,7 +566,6 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	spinlock_t *ptl;
 	pgtable_t new = pte_alloc_one(mm, address);
-	int wait_split_huge_page;
 	if (!new)
 		return -ENOMEM;
 
@@ -586,18 +585,14 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	ptl = pmd_lock(mm, pmd);
-	wait_split_huge_page = 0;
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		atomic_long_inc(&mm->nr_ptes);
 		pmd_populate(mm, pmd, new);
 		new = NULL;
-	} else if (unlikely(pmd_trans_splitting(*pmd)))
-		wait_split_huge_page = 1;
+	}
 	spin_unlock(ptl);
 	if (new)
 		pte_free(mm, new);
-	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
 	return 0;
 }
 
@@ -613,8 +608,7 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
-	} else
-		VM_BUG_ON(pmd_trans_splitting(*pmd));
+	}
 	spin_unlock(&init_mm.page_table_lock);
 	if (new)
 		pte_free_kernel(&init_mm, new);
@@ -3343,14 +3337,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (pmd_trans_huge(orig_pmd)) {
 			unsigned int dirty = flags & FAULT_FLAG_WRITE;
 
-			/*
-			 * If the pmd is splitting, return and retry the
-			 * the fault.  Alternative: wait until the split
-			 * is done, and goto retry.
-			 */
-			if (pmd_trans_splitting(orig_pmd))
-				return 0;
-
 			if (pmd_protnone(orig_pmd))
 				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);
diff --git a/mm/mincore.c b/mm/mincore.c
index be25efde64a4..feb867f5fdf4 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -117,7 +117,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	unsigned char *vec = walk->private;
 	int nr = (end - addr) >> PAGE_SHIFT;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		memset(vec, 1, nr);
 		spin_unlock(ptl);
 		goto out;
diff --git a/mm/mremap.c b/mm/mremap.c
index d9cc4debd3aa..09b8b9b24c05 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -192,25 +192,24 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		if (!new_pmd)
 			break;
 		if (pmd_trans_huge(*old_pmd)) {
-			int err = 0;
 			if (extent == HPAGE_PMD_SIZE) {
+				bool moved;
 				VM_BUG_ON_VMA(vma->vm_file || !vma->anon_vma,
 					      vma);
 				/* See comment in move_ptes() */
 				if (need_rmap_locks)
 					anon_vma_lock_write(vma->anon_vma);
-				err = move_huge_pmd(vma, new_vma, old_addr,
+				moved = move_huge_pmd(vma, new_vma, old_addr,
 						    new_addr, old_end,
 						    old_pmd, new_pmd);
 				if (need_rmap_locks)
 					anon_vma_unlock_write(vma->anon_vma);
+				if (moved) {
+					need_flush = true;
+					continue;
+				}
 			}
-			if (err > 0) {
-				need_flush = true;
-				continue;
-			} else if (!err) {
-				split_huge_pmd(vma, old_pmd, old_addr);
-			}
+			split_huge_pmd(vma, old_pmd, old_addr);
 			VM_BUG_ON(pmd_trans_huge(*old_pmd));
 		}
 		if (pmd_none(*new_pmd) && __pte_alloc(new_vma->vm_mm, new_vma,
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 6b674e00153c..89b150f8c920 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -134,20 +134,6 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
-			  pmd_t *pmdp)
-{
-	pmd_t pmd = pmd_mksplitting(*pmdp);
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-	/* tlb flush only to serialize against gup-fast */
-	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#endif
-
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
diff --git a/mm/rmap.c b/mm/rmap.c
index 51807d605ebf..0e3851661487 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -737,8 +737,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		 * rmap might return false positives; we must filter
 		 * these out using page_check_address_pmd().
 		 */
-		pmd = page_check_address_pmd(page, mm, address,
-					     PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+		pmd = page_check_address_pmd(page, mm, address, &ptl);
 		if (!pmd)
 			return SWAP_AGAIN;
 
@@ -748,7 +747,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			return SWAP_FAIL; /* To break the loop */
 		}
 
-		/* go ahead even if the pmd is pmd_trans_splitting() */
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 26/36] mm: rework mapcount accounting to enable 4k mapping of THPs
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (24 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 25/36] mm, " Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-10 13:47   ` Vlastimil Babka
  2015-06-03 17:05 ` [PATCHv6 27/36] mm: differentiate page_mapped() from page_mapcount() for compound pages Kirill A. Shutemov
                   ` (10 subsequent siblings)
  36 siblings, 1 reply; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to allow mapping of individual 4k pages of THP compound.
It means we need to track mapcount on per small page basis.

Straight-forward approach is to use ->_mapcount in all subpages to track
how many time this subpage is mapped with PMDs or PTEs combined. But
this is rather expensive: mapping or unmapping of a THP page with PMD
would require HPAGE_PMD_NR atomic operations instead of single we have
now.

The idea is to store separately how many times the page was mapped as
whole -- compound_mapcount. This frees up ->_mapcount in subpages to
track PTE mapcount.

We use the same approach as with compound page destructor and compound
order to store compound_mapcount: use space in first tail page,
->mapping this time.

Any time we map/unmap whole compound page (THP or hugetlb) -- we
increment/decrement compound_mapcount. When we map part of compound page
with PTE we operate on ->_mapcount of the subpage.

page_mapcount() counts both: PTE and PMD mappings of the page.

Basically, we have mapcount for a subpage spread over two counters.
It makes tricky to detect when last mapcount for a page goes away.

We introduced PageDoubleMap() for this. When we split THP PMD for the
first time and there's other PMD mapping left we offset up ->_mapcount
in all subpages by one and set PG_double_map on the compound page.
These additional references go away with last compound_mapcount.

This approach provides a way to detect when last mapcount goes away on
per small page basis without introducing new overhead for most common
cases.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h         | 26 +++++++++++-
 include/linux/mm_types.h   |  1 +
 include/linux/page-flags.h | 37 +++++++++++++++++
 include/linux/rmap.h       |  4 +-
 mm/debug.c                 |  5 ++-
 mm/huge_memory.c           |  2 +-
 mm/hugetlb.c               |  4 +-
 mm/memory.c                |  2 +-
 mm/migrate.c               |  2 +-
 mm/page_alloc.c            | 14 +++++--
 mm/rmap.c                  | 98 +++++++++++++++++++++++++++++++++++-----------
 11 files changed, 160 insertions(+), 35 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 31cd5be081cf..22cd540104ec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -403,6 +403,19 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 
 extern void kvfree(const void *addr);
 
+static inline atomic_t *compound_mapcount_ptr(struct page *page)
+{
+	return &page[1].compound_mapcount;
+}
+
+static inline int compound_mapcount(struct page *page)
+{
+	if (!PageCompound(page))
+		return 0;
+	page = compound_head(page);
+	return atomic_read(compound_mapcount_ptr(page)) + 1;
+}
+
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
@@ -415,8 +428,17 @@ static inline void page_mapcount_reset(struct page *page)
 
 static inline int page_mapcount(struct page *page)
 {
+	int ret;
 	VM_BUG_ON_PAGE(PageSlab(page), page);
-	return atomic_read(&page->_mapcount) + 1;
+
+	ret = atomic_read(&page->_mapcount) + 1;
+	if (PageCompound(page)) {
+		page = compound_head(page);
+		ret += compound_mapcount(page);
+		if (PageDoubleMap(page))
+			ret--;
+	}
+	return ret;
 }
 
 static inline int page_count(struct page *page)
@@ -898,7 +920,7 @@ static inline pgoff_t page_file_index(struct page *page)
  */
 static inline int page_mapped(struct page *page)
 {
-	return atomic_read(&(page)->_mapcount) >= 0;
+	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
 }
 
 /*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4b51a59160ab..4d182cd14c1f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -56,6 +56,7 @@ struct page {
 						 * see PAGE_MAPPING_ANON below.
 						 */
 		void *s_mem;			/* slab first object */
+		atomic_t compound_mapcount;	/* first tail page */
 	};
 
 	/* Second double word */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74b7cece1dfa..a8d47c1edf6a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -127,6 +127,9 @@ enum pageflags {
 
 	/* SLOB */
 	PG_slob_free = PG_private,
+
+	/* THP. Stored in first tail page's flags */
+	PG_double_map = PG_private_2,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -593,10 +596,44 @@ static inline int PageTransTail(struct page *page)
 	return PageTail(page);
 }
 
+/*
+ * PageDoubleMap indicates that the compound page is mapped with PTEs as well
+ * as PMDs.
+ *
+ * This is required for optimization of rmap oprations for THP: we can postpone
+ * per small page mapcount accounting (and its overhead from atomic operations)
+ * until the first PMD split.
+ *
+ * For the page PageDoubleMap means ->_mapcount in all sub-pages is offset up
+ * by one. This reference will go away with last compound_mapcount.
+ *
+ * See also __split_huge_pmd_locked() and page_remove_anon_compound_rmap().
+ */
+static inline int PageDoubleMap(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	return page[1].flags & PG_double_map;
+}
+
+static inline int TestSetPageDoubleMap(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	return test_and_set_bit(PG_double_map, &page[1].flags);
+}
+
+static inline void ClearPageDoubleMap(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+	clear_bit(PG_double_map, &page[1].flags);
+}
+
 #else
 TESTPAGEFLAG_FALSE(TransHuge)
 TESTPAGEFLAG_FALSE(TransCompound)
 TESTPAGEFLAG_FALSE(TransTail)
+TESTPAGEFLAG_FALSE(DoubleMap)
+	TESTSETFLAG_FALSE(DoubleMap)
+	CLEARPAGEFLAG_NOOP(DoubleMap)
 #endif
 
 /*
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 2bc86dc14305..1757854cd35c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -181,9 +181,9 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 				unsigned long);
 
-static inline void page_dup_rmap(struct page *page)
+static inline void page_dup_rmap(struct page *page, bool compound)
 {
-	atomic_inc(&page->_mapcount);
+	atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount);
 }
 
 /*
diff --git a/mm/debug.c b/mm/debug.c
index 9dfcd77e7354..4a82f639b964 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -80,9 +80,12 @@ static void dump_flags(unsigned long flags,
 void dump_page_badflags(struct page *page, const char *reason,
 		unsigned long badflags)
 {
-	pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
+	pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx",
 		  page, atomic_read(&page->_count), page_mapcount(page),
 		  page->mapping, page->index);
+	if (PageCompound(page))
+		pr_cont(" compound_mapcount: %d", compound_mapcount(page));
+	pr_cont("\n");
 	BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS);
 	dump_flags(page->flags, pageflag_names, ARRAY_SIZE(pageflag_names));
 	if (reason)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index efe52b5bd979..5b0a13d2f28c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -917,7 +917,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	src_page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
-	page_dup_rmap(src_page);
+	page_dup_rmap(src_page, true);
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9d44d8f17760..f1e6ff471729 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2742,7 +2742,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
 			get_page(ptepage);
-			page_dup_rmap(ptepage);
+			page_dup_rmap(ptepage, true);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
 		spin_unlock(src_ptl);
@@ -3203,7 +3203,7 @@ retry:
 		ClearPagePrivate(page);
 		hugepage_add_new_anon_rmap(page, vma, address);
 	} else
-		page_dup_rmap(page);
+		page_dup_rmap(page, true);
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, address, ptep, new_pte);
diff --git a/mm/memory.c b/mm/memory.c
index ff2e8ed5b82a..cdc20d674675 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -867,7 +867,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
 		get_page(page);
-		page_dup_rmap(page);
+		page_dup_rmap(page, false);
 		if (PageAnon(page))
 			rss[MM_ANONPAGES]++;
 		else
diff --git a/mm/migrate.c b/mm/migrate.c
index 9cabb25fb16e..dfd24cb7afc6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -164,7 +164,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		if (PageAnon(new))
 			hugepage_add_anon_rmap(new, vma, addr);
 		else
-			page_dup_rmap(new);
+			page_dup_rmap(new, false);
 	} else if (PageAnon(new))
 		page_add_anon_rmap(new, vma, addr, false);
 	else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1f9ffbb087cb..6e3b42763894 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -452,6 +452,7 @@ void prep_compound_page(struct page *page, unsigned long order)
 		smp_wmb();
 		__SetPageTail(p);
 	}
+	atomic_set(compound_mapcount_ptr(page), -1);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
@@ -716,7 +717,7 @@ static inline int free_pages_check(struct page *page)
 	const char *bad_reason = NULL;
 	unsigned long bad_flags = 0;
 
-	if (unlikely(page_mapcount(page)))
+	if (unlikely(atomic_read(&page->_mapcount) != -1))
 		bad_reason = "nonzero mapcount";
 	if (unlikely(page->mapping != NULL))
 		bad_reason = "non-NULL mapping";
@@ -825,7 +826,14 @@ static void free_one_page(struct zone *zone,
 
 static int free_tail_pages_check(struct page *head_page, struct page *page)
 {
-	if (page->mapping != TAIL_MAPPING) {
+	/* mapping in first tail page is used for compound_mapcount() */
+	if (page - head_page == 1) {
+		if (unlikely(compound_mapcount(page))) {
+			bad_page(page, "nonzero compound_mapcount", 0);
+			page->mapping = NULL;
+			return 1;
+		}
+	} else if (page->mapping != TAIL_MAPPING) {
 		bad_page(page, "corrupted mapping in tail page", 0);
 		page->mapping = NULL;
 		return 1;
@@ -1288,7 +1296,7 @@ static inline int check_new_page(struct page *page)
 	const char *bad_reason = NULL;
 	unsigned long bad_flags = 0;
 
-	if (unlikely(page_mapcount(page)))
+	if (unlikely(atomic_read(&page->_mapcount) != -1))
 		bad_reason = "nonzero mapcount";
 	if (unlikely(page->mapping != NULL))
 		bad_reason = "non-NULL mapping";
diff --git a/mm/rmap.c b/mm/rmap.c
index 0e3851661487..2bfa15faff1e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1035,7 +1035,7 @@ static void __page_check_anon_rmap(struct page *page,
 	 * over the call to page_add_new_anon_rmap.
 	 */
 	BUG_ON(page_anon_vma(page)->root != vma->anon_vma->root);
-	BUG_ON(page->index != linear_page_index(vma, address));
+	BUG_ON(page_to_pgoff(page) != linear_page_index(vma, address));
 #endif
 }
 
@@ -1065,9 +1065,26 @@ void page_add_anon_rmap(struct page *page,
 void do_page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address, int flags)
 {
-	int first = atomic_inc_and_test(&page->_mapcount);
+	bool compound = flags & RMAP_COMPOUND;
+	bool first;
+
+	if (PageTransCompound(page)) {
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
+		if (compound) {
+			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+			first = atomic_inc_and_test(compound_mapcount_ptr(page));
+		} else {
+			/* Anon THP always mapped first with PMD */
+			first = 0;
+			VM_BUG_ON_PAGE(!page_mapcount(page), page);
+			atomic_inc(&page->_mapcount);
+		}
+	} else {
+		VM_BUG_ON_PAGE(compound, page);
+		first = atomic_inc_and_test(&page->_mapcount);
+	}
+
 	if (first) {
-		bool compound = flags & RMAP_COMPOUND;
 		int nr = compound ? hpage_nr_pages(page) : 1;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
@@ -1086,6 +1103,7 @@ void do_page_add_anon_rmap(struct page *page,
 		return;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
 	/* address might be in next vma when migration races vma_adjust */
 	if (first)
 		__page_set_anon_rmap(page, vma, address,
@@ -1112,10 +1130,16 @@ void page_add_new_anon_rmap(struct page *page,
 
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
 	SetPageSwapBacked(page);
-	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
 	if (compound) {
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+		/* increment count (starts at -1) */
+		atomic_set(compound_mapcount_ptr(page), 0);
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	} else {
+		/* Anon THP always mapped first with PMD */
+		VM_BUG_ON_PAGE(PageTransCompound(page), page);
+		/* increment count (starts at -1) */
+		atomic_set(&page->_mapcount, 0);
 	}
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
@@ -1145,12 +1169,15 @@ static void page_remove_file_rmap(struct page *page)
 
 	memcg = mem_cgroup_begin_page_stat(page);
 
-	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, &page->_mapcount))
+	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
+	if (unlikely(PageHuge(page))) {
+		/* hugetlb pages are always mapped with pmds */
+		atomic_dec(compound_mapcount_ptr(page));
 		goto out;
+	}
 
-	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
-	if (unlikely(PageHuge(page)))
+	/* page still mapped by someone else? */
+	if (!atomic_add_negative(-1, &page->_mapcount))
 		goto out;
 
 	/*
@@ -1167,6 +1194,41 @@ out:
 	mem_cgroup_end_page_stat(memcg);
 }
 
+static void page_remove_anon_compound_rmap(struct page *page)
+{
+	int i, nr;
+
+	if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+		return;
+
+	/* Hugepages are not counted in NR_ANON_PAGES for now. */
+	if (unlikely(PageHuge(page)))
+		return;
+
+	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+		return;
+
+	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+
+	if (PageDoubleMap(page)) {
+		nr = 0;
+		ClearPageDoubleMap(page);
+		/*
+		 * Subpages can be mapped with PTEs too. Check how many of
+		 * themi are still mapped.
+		 */
+		for (i = 0; i < HPAGE_PMD_NR; i++) {
+			if (atomic_add_negative(-1, &page[i]._mapcount))
+				nr++;
+		}
+	} else {
+		nr = HPAGE_PMD_NR;
+	}
+
+	if (nr)
+		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
+}
+
 /**
  * page_remove_rmap - take down pte mapping from a page
  * @page:	page to remove mapping from
@@ -1176,33 +1238,25 @@ out:
  */
 void page_remove_rmap(struct page *page, bool compound)
 {
-	int nr = compound ? hpage_nr_pages(page) : 1;
-
 	if (!PageAnon(page)) {
 		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
 		page_remove_file_rmap(page);
 		return;
 	}
 
+	if (compound)
+		return page_remove_anon_compound_rmap(page);
+
 	/* page still mapped by someone else? */
 	if (!atomic_add_negative(-1, &page->_mapcount))
 		return;
 
-	/* Hugepages are not counted in NR_ANON_PAGES for now. */
-	if (unlikely(PageHuge(page)))
-		return;
-
 	/*
 	 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	if (compound) {
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
-	}
-
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
+	__dec_zone_page_state(page, NR_ANON_PAGES);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
@@ -1643,7 +1697,7 @@ void hugepage_add_anon_rmap(struct page *page,
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!anon_vma);
 	/* address might be in next vma when migration races vma_adjust */
-	first = atomic_inc_and_test(&page->_mapcount);
+	first = atomic_inc_and_test(compound_mapcount_ptr(page));
 	if (first)
 		__hugepage_set_anon_rmap(page, vma, address, 0);
 }
@@ -1652,7 +1706,7 @@ void hugepage_add_new_anon_rmap(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {
 	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-	atomic_set(&page->_mapcount, 0);
+	atomic_set(compound_mapcount_ptr(page), 0);
 	__hugepage_set_anon_rmap(page, vma, address, 1);
 }
 #endif /* CONFIG_HUGETLB_PAGE */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 27/36] mm: differentiate page_mapped() from page_mapcount() for compound pages
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (25 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 26/36] mm: rework mapcount accounting to enable 4k mapping of THPs Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-09 10:58   ` Kirill A. Shutemov
  2015-06-10 14:34   ` Vlastimil Babka
  2015-06-03 17:05 ` [PATCHv6 28/36] mm, numa: skip PTE-mapped THP on numa fault Kirill A. Shutemov
                   ` (9 subsequent siblings)
  36 siblings, 2 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Let's define page_mapped() to be true for compound pages if any
sub-pages of the compound page is mapped (with PMD or PTE).

On other hand page_mapcount() return mapcount for this particular small
page.

This will make cases like page_get_anon_vma() behave correctly once we
allow huge pages to be mapped with PTE.

Most users outside core-mm should use page_mapcount() instead of
page_mapped().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 arch/arc/mm/cache_arc700.c |  4 ++--
 arch/arm/mm/flush.c        |  2 +-
 arch/mips/mm/c-r4k.c       |  3 ++-
 arch/mips/mm/cache.c       |  2 +-
 arch/mips/mm/init.c        |  6 +++---
 arch/sh/mm/cache-sh4.c     |  2 +-
 arch/sh/mm/cache.c         |  8 ++++----
 arch/xtensa/mm/tlb.c       |  2 +-
 fs/proc/page.c             |  4 ++--
 include/linux/mm.h         | 15 +++++++++++++--
 mm/filemap.c               |  2 +-
 11 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/arch/arc/mm/cache_arc700.c b/arch/arc/mm/cache_arc700.c
index 8c3a3e02ba92..1baa4d23314b 100644
--- a/arch/arc/mm/cache_arc700.c
+++ b/arch/arc/mm/cache_arc700.c
@@ -490,7 +490,7 @@ void flush_dcache_page(struct page *page)
 	 */
 	if (!mapping_mapped(mapping)) {
 		clear_bit(PG_dc_clean, &page->flags);
-	} else if (page_mapped(page)) {
+	} else if (page_mapcount(page)) {
 
 		/* kernel reading from page with U-mapping */
 		void *paddr = page_address(page);
@@ -675,7 +675,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 	 * Note that while @u_vaddr refers to DST page's userspace vaddr, it is
 	 * equally valid for SRC page as well
 	 */
-	if (page_mapped(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
+	if (page_mapcount(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
 		__flush_dcache_page(kfrom, u_vaddr);
 		clean_src_k_mappings = 1;
 	}
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 77f229302032..4da544aa25ef 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -315,7 +315,7 @@ void flush_dcache_page(struct page *page)
 	mapping = page_mapping(page);
 
 	if (!cache_ops_need_broadcast() &&
-	    mapping && !page_mapped(page))
+	    mapping && !page_mapcount(page))
 		clear_bit(PG_dcache_clean, &page->flags);
 	else {
 		__flush_dcache_page(mapping, page);
diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c
index 3f8059602765..b28bf185ef77 100644
--- a/arch/mips/mm/c-r4k.c
+++ b/arch/mips/mm/c-r4k.c
@@ -578,7 +578,8 @@ static inline void local_r4k_flush_cache_page(void *args)
 		 * another ASID than the current one.
 		 */
 		map_coherent = (cpu_has_dc_aliases &&
-				page_mapped(page) && !Page_dcache_dirty(page));
+				page_mapcount(page) &&
+				!Page_dcache_dirty(page));
 		if (map_coherent)
 			vaddr = kmap_coherent(page, addr);
 		else
diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c
index 7e3ea7766822..e695b28dc32c 100644
--- a/arch/mips/mm/cache.c
+++ b/arch/mips/mm/cache.c
@@ -106,7 +106,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
 	unsigned long addr = (unsigned long) page_address(page);
 
 	if (pages_do_alias(addr, vmaddr)) {
-		if (page_mapped(page) && !Page_dcache_dirty(page)) {
+		if (page_mapcount(page) && !Page_dcache_dirty(page)) {
 			void *kaddr;
 
 			kaddr = kmap_coherent(page, vmaddr);
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index 448cde372af0..2c8e44aa536e 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -156,7 +156,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 
 	vto = kmap_atomic(to);
 	if (cpu_has_dc_aliases &&
-	    page_mapped(from) && !Page_dcache_dirty(from)) {
+	    page_mapcount(from) && !Page_dcache_dirty(from)) {
 		vfrom = kmap_coherent(from, vaddr);
 		copy_page(vto, vfrom);
 		kunmap_coherent();
@@ -178,7 +178,7 @@ void copy_to_user_page(struct vm_area_struct *vma,
 	unsigned long len)
 {
 	if (cpu_has_dc_aliases &&
-	    page_mapped(page) && !Page_dcache_dirty(page)) {
+	    page_mapcount(page) && !Page_dcache_dirty(page)) {
 		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(vto, src, len);
 		kunmap_coherent();
@@ -196,7 +196,7 @@ void copy_from_user_page(struct vm_area_struct *vma,
 	unsigned long len)
 {
 	if (cpu_has_dc_aliases &&
-	    page_mapped(page) && !Page_dcache_dirty(page)) {
+	    page_mapcount(page) && !Page_dcache_dirty(page)) {
 		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(dst, vfrom, len);
 		kunmap_coherent();
diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c
index 51d8f7f31d1d..58aaa4f33b81 100644
--- a/arch/sh/mm/cache-sh4.c
+++ b/arch/sh/mm/cache-sh4.c
@@ -241,7 +241,7 @@ static void sh4_flush_cache_page(void *args)
 		 */
 		map_coherent = (current_cpu_data.dcache.n_aliases &&
 			test_bit(PG_dcache_clean, &page->flags) &&
-			page_mapped(page));
+			page_mapcount(page));
 		if (map_coherent)
 			vaddr = kmap_coherent(page, address);
 		else
diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c
index f770e3992620..e58cfbf45150 100644
--- a/arch/sh/mm/cache.c
+++ b/arch/sh/mm/cache.c
@@ -59,7 +59,7 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
 		       unsigned long vaddr, void *dst, const void *src,
 		       unsigned long len)
 {
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 	    test_bit(PG_dcache_clean, &page->flags)) {
 		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(vto, src, len);
@@ -78,7 +78,7 @@ void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
 			 unsigned long vaddr, void *dst, const void *src,
 			 unsigned long len)
 {
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 	    test_bit(PG_dcache_clean, &page->flags)) {
 		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(dst, vfrom, len);
@@ -97,7 +97,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 
 	vto = kmap_atomic(to);
 
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(from) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(from) &&
 	    test_bit(PG_dcache_clean, &from->flags)) {
 		vfrom = kmap_coherent(from, vaddr);
 		copy_page(vto, vfrom);
@@ -153,7 +153,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
 	unsigned long addr = (unsigned long) page_address(page);
 
 	if (pages_do_alias(addr, vmaddr)) {
-		if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+		if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 		    test_bit(PG_dcache_clean, &page->flags)) {
 			void *kaddr;
 
diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
index 5ece856c5725..35c822286bbe 100644
--- a/arch/xtensa/mm/tlb.c
+++ b/arch/xtensa/mm/tlb.c
@@ -245,7 +245,7 @@ static int check_tlb_entry(unsigned w, unsigned e, bool dtlb)
 						page_mapcount(p));
 				if (!page_count(p))
 					rc |= TLB_INSANE;
-				else if (page_mapped(p))
+				else if (page_mapcount(p))
 					rc |= TLB_SUSPICIOUS;
 			} else {
 				rc |= TLB_INSANE;
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..e99c059339f6 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -97,9 +97,9 @@ u64 stable_page_flags(struct page *page)
 	 * pseudo flags for the well known (anonymous) memory mapped pages
 	 *
 	 * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
-	 * simple test in page_mapped() is not enough.
+	 * simple test in page_mapcount() is not enough.
 	 */
-	if (!PageSlab(page) && page_mapped(page))
+	if (!PageSlab(page) && page_mapcount(page))
 		u |= 1 << KPF_MMAP;
 	if (PageAnon(page))
 		u |= 1 << KPF_ANON;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 22cd540104ec..16add6692f49 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -917,10 +917,21 @@ static inline pgoff_t page_file_index(struct page *page)
 
 /*
  * Return true if this page is mapped into pagetables.
+ * For compound page it returns true if any subpage of compound page is mapped.
  */
-static inline int page_mapped(struct page *page)
+static inline bool page_mapped(struct page *page)
 {
-	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
+	int i;
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) >= 0;
+	page = compound_head(page);
+	if (compound_mapcount(page))
+		return true;
+	for (i = 0; i < hpage_nr_pages(page); i++) {
+		if (atomic_read(&page[i]._mapcount) >= 0)
+			return true;
+	}
+	return true;
 }
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index cb41cf3069d2..c6cf03303ded 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -202,7 +202,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 		__dec_zone_page_state(page, NR_FILE_PAGES);
 	if (PageSwapBacked(page))
 		__dec_zone_page_state(page, NR_SHMEM);
-	BUG_ON(page_mapped(page));
+	VM_BUG_ON_PAGE(page_mapped(page), page);
 
 	/*
 	 * At this point page must be either written or cleaned by truncate.
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 28/36] mm, numa: skip PTE-mapped THP on numa fault
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (26 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 27/36] mm: differentiate page_mapped() from page_mapcount() for compound pages Kirill A. Shutemov
@ 2015-06-03 17:05 ` Kirill A. Shutemov
  2015-06-03 17:06 ` [PATCHv6 29/36] thp: implement split_huge_pmd() Kirill A. Shutemov
                   ` (8 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:05 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to have THP mapped with PTEs. It will confuse numabalancing.
Let's skip them for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/memory.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index cdc20d674675..89dae13db3d3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3179,6 +3179,12 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
+	/* TODO: handle PTE-mapped THP */
+	if (PageCompound(page)) {
+		pte_unmap_unlock(ptep, ptl);
+		return 0;
+	}
+
 	/*
 	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
 	 * much anyway since they can be in shared cache state. This misses
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 29/36] thp: implement split_huge_pmd()
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (27 preceding siblings ...)
  2015-06-03 17:05 ` [PATCHv6 28/36] mm, numa: skip PTE-mapped THP on numa fault Kirill A. Shutemov
@ 2015-06-03 17:06 ` Kirill A. Shutemov
  2015-06-11  9:49   ` Vlastimil Babka
  2015-06-03 17:06 ` [PATCHv6 30/36] thp: add option to setup migration entiries during PMD split Kirill A. Shutemov
                   ` (7 subsequent siblings)
  36 siblings, 1 reply; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:06 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Original split_huge_page() combined two operations: splitting PMDs into
tables of PTEs and splitting underlying compound page. This patch
implements split_huge_pmd() which split given PMD without splitting
other PMDs this page mapped with or underlying compound page.

Without tail page refcounting, implementation of split_huge_pmd() is
pretty straight-forward.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/huge_mm.h |  11 ++++-
 mm/huge_memory.c        | 119 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 129 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 67b3f3760f4a..9cfce23ab9d0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -94,7 +94,16 @@ extern unsigned long transparent_hugepage_flags;
 
 #define split_huge_page_to_list(page, list) BUILD_BUG()
 #define split_huge_page(page) BUILD_BUG()
-#define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
+
+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address);
+
+#define split_huge_pmd(__vma, __pmd, __address)				\
+	do {								\
+		pmd_t *____pmd = (__pmd);				\
+		if (pmd_trans_huge(*____pmd))				\
+			__split_huge_pmd(__vma, __pmd, __address);	\
+	}  while (0)
 
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5b0a13d2f28c..0f1f5731a893 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2547,6 +2547,125 @@ static int khugepaged(void *none)
 	return 0;
 }
 
+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+		unsigned long haddr, pmd_t *pmd)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	int i;
+
+	/* leave pmd empty until pte is filled */
+	pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+		entry = pte_mkspecial(entry);
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	put_huge_zero_page();
+}
+
+static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long haddr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	bool young, write;
+	int i;
+
+	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
+	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
+	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
+	VM_BUG_ON(!pmd_trans_huge(*pmd));
+
+	count_vm_event(THP_SPLIT_PMD);
+
+	if (is_huge_zero_pmd(*pmd))
+		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+
+	page = pmd_page(*pmd);
+	VM_BUG_ON_PAGE(!page_count(page), page);
+	atomic_add(HPAGE_PMD_NR - 1, &page->_count);
+	write = pmd_write(*pmd);
+	young = pmd_young(*pmd);
+
+	/* leave pmd empty until pte is filled */
+	pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t entry, *pte;
+		/*
+		 * Note that NUMA hinting access restrictions are not
+		 * transferred to avoid any possibility of altering
+		 * permissions across VMAs.
+		 */
+		entry = mk_pte(page + i, vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		if (!write)
+			entry = pte_wrprotect(entry);
+		if (!young)
+			entry = pte_mkold(entry);
+		pte = pte_offset_map(&_pmd, haddr);
+		BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		atomic_inc(&page[i]._mapcount);
+		pte_unmap(pte);
+	}
+
+	if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
+		/* Last compound_mapcount is gone. */
+		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		if (PageDoubleMap(page)) {
+			/* No need in mapcount reference anymore */
+			ClearPageDoubleMap(page);
+			for (i = 0; i < HPAGE_PMD_NR; i++)
+				atomic_dec(&page[i]._mapcount);
+		}
+	} else if (!TestSetPageDoubleMap(page)) {
+		/*
+		 * The first PMD split for the compound page and we still
+		 * have other PMD mapping of the page: bump _mapcount in
+		 * every small page.
+		 * This reference will go away with last compound_mapcount.
+		 */
+		for (i = 0; i < HPAGE_PMD_NR; i++)
+			atomic_inc(&page[i]._mapcount);
+	}
+
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+}
+
+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address)
+{
+	spinlock_t *ptl;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
+	ptl = pmd_lock(mm, pmd);
+	if (likely(pmd_trans_huge(*pmd)))
+		__split_huge_pmd_locked(vma, pmd, haddr);
+	spin_unlock(ptl);
+	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
+}
+
 static void split_huge_pmd_address(struct vm_area_struct *vma,
 				    unsigned long address)
 {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 30/36] thp: add option to setup migration entiries during PMD split
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (28 preceding siblings ...)
  2015-06-03 17:06 ` [PATCHv6 29/36] thp: implement split_huge_pmd() Kirill A. Shutemov
@ 2015-06-03 17:06 ` Kirill A. Shutemov
  2015-06-03 17:06 ` [PATCHv6 31/36] thp, mm: split_huge_page(): caller need to lock page Kirill A. Shutemov
                   ` (6 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:06 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We are going to use migration PTE entires to stabilize page counts.
If the page is mapped with PMDs we need to split the PMD and setup
migration enties. It's reasonable to combine these operations to avoid
double-scanning over the page table.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/huge_memory.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0f1f5731a893..aa852094c7b9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -24,6 +24,7 @@
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/swapops.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -2576,7 +2577,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 }
 
 static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long haddr)
+		unsigned long haddr, bool freeze)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
@@ -2614,12 +2615,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * transferred to avoid any possibility of altering
 		 * permissions across VMAs.
 		 */
-		entry = mk_pte(page + i, vma->vm_page_prot);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		if (!write)
-			entry = pte_wrprotect(entry);
-		if (!young)
-			entry = pte_mkold(entry);
+		if (freeze) {
+			swp_entry_t swp_entry;
+			swp_entry = make_migration_entry(page + i, write);
+			entry = swp_entry_to_pte(swp_entry);
+		} else {
+			entry = mk_pte(page + i, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			if (!write)
+				entry = pte_wrprotect(entry);
+			if (!young)
+				entry = pte_mkold(entry);
+		}
 		pte = pte_offset_map(&_pmd, haddr);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, haddr, pte, entry);
@@ -2661,7 +2668,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
 	ptl = pmd_lock(mm, pmd);
 	if (likely(pmd_trans_huge(*pmd)))
-		__split_huge_pmd_locked(vma, pmd, haddr);
+		__split_huge_pmd_locked(vma, pmd, haddr, false);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 31/36] thp, mm: split_huge_page(): caller need to lock page
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (29 preceding siblings ...)
  2015-06-03 17:06 ` [PATCHv6 30/36] thp: add option to setup migration entiries during PMD split Kirill A. Shutemov
@ 2015-06-03 17:06 ` Kirill A. Shutemov
  2015-06-03 17:06 ` [PATCHv6 32/36] thp: reintroduce split_huge_page() Kirill A. Shutemov
                   ` (5 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:06 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to use migration entries instead of compound_lock() to
stabilize page refcounts. Setup and remove migration entries require
page to be locked.

Some of split_huge_page() callers already have the page locked. Let's
require everybody to lock the page before calling split_huge_page().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/memory-failure.c | 10 ++++++++--
 mm/migrate.c        |  8 ++++++--
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1cf7f2988422..0d9989a36d32 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1143,15 +1143,18 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 				put_page(hpage);
 			return -EBUSY;
 		}
+		lock_page(hpage);
 		if (unlikely(split_huge_page(hpage))) {
 			pr_err("MCE: %#lx: thp split failed\n", pfn);
 			if (TestClearPageHWPoison(p))
 				atomic_long_sub(nr_pages, &num_poisoned_pages);
+			unlock_page(hpage);
 			put_page(p);
 			if (p != hpage)
 				put_page(hpage);
 			return -EBUSY;
 		}
+		unlock_page(hpage);
 		VM_BUG_ON_PAGE(!page_count(p), p);
 		hpage = compound_head(p);
 	}
@@ -1714,10 +1717,13 @@ int soft_offline_page(struct page *page, int flags)
 		return -EBUSY;
 	}
 	if (!PageHuge(page) && PageTransHuge(hpage)) {
-		if (PageAnon(hpage) && unlikely(split_huge_page(hpage))) {
+		lock_page(page);
+		ret = split_huge_page(hpage);
+		unlock_page(page);
+		if (unlikely(ret)) {
 			pr_info("soft offline: %#lx: failed to split THP\n",
 				pfn);
-			return -EBUSY;
+			return ret;
 		}
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index dfd24cb7afc6..8bb2107b8751 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -933,9 +933,13 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		goto out;
 	}
 
-	if (unlikely(PageTransHuge(page)))
-		if (unlikely(split_huge_page(page)))
+	if (unlikely(PageTransHuge(page))) {
+		lock_page(page);
+		rc = split_huge_page(page);
+		unlock_page(page);
+		if (rc)
 			goto out;
+	}
 
 	rc = __unmap_and_move(page, newpage, force, mode);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 32/36] thp: reintroduce split_huge_page()
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (30 preceding siblings ...)
  2015-06-03 17:06 ` [PATCHv6 31/36] thp, mm: split_huge_page(): caller need to lock page Kirill A. Shutemov
@ 2015-06-03 17:06 ` Kirill A. Shutemov
  2015-06-10 15:44   ` Vlastimil Babka
  2015-06-03 17:06 ` [PATCHv6 33/36] migrate_pages: try to split pages on qeueuing Kirill A. Shutemov
                   ` (4 subsequent siblings)
  36 siblings, 1 reply; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:06 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

This patch adds implementation of split_huge_page() for new
refcountings.

Unlike previous implementation, new split_huge_page() can fail if
somebody holds GUP pin on the page. It also means that pin on page
would prevent it from bening split under you. It makes situation in
many places much cleaner.

The basic scheme of split_huge_page():

  - Check that sum of mapcounts of all subpage is equal to page_count()
    plus one (caller pin). Foll off with -EBUSY. This way we can avoid
    useless PMD-splits.

  - Freeze the page counters by splitting all PMD and setup migration
    PTEs.

  - Re-check sum of mapcounts against page_count(). Page's counts are
    stable now. -EBUSY if page is pinned.

  - Split compound page.

  - Unfreeze the page by removing migration entries.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/huge_mm.h |   7 +-
 include/linux/pagemap.h |  13 +-
 mm/huge_memory.c        | 318 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h           |  26 +++-
 mm/rmap.c               |  21 ----
 5 files changed, 357 insertions(+), 28 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9cfce23ab9d0..89981a042d85 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,8 +92,11 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 extern unsigned long transparent_hugepage_flags;
 
-#define split_huge_page_to_list(page, list) BUILD_BUG()
-#define split_huge_page(page) BUILD_BUG()
+int split_huge_page_to_list(struct page *page, struct list_head *list);
+static inline int split_huge_page(struct page *page)
+{
+	return split_huge_page_to_list(page, NULL);
+}
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7c3790764795..3cca325c1560 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -387,10 +387,21 @@ static inline struct page *read_mapping_page(struct address_space *mapping,
  */
 static inline pgoff_t page_to_pgoff(struct page *page)
 {
+	pgoff_t pgoff;
+
 	if (unlikely(PageHeadHuge(page)))
 		return page->index << compound_order(page);
-	else
+
+	if (likely(!PageTransTail(page)))
 		return page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+	/*
+	 *  We don't initialize ->index for tail pages: calculate based on
+	 *  head page
+	 */
+	pgoff = page->first_page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff += page - page->first_page;
+	return pgoff;
 }
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index aa852094c7b9..27045ea7f37e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2740,3 +2740,321 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 			split_huge_pmd_address(next, nstart);
 	}
 }
+
+static void freeze_page_vma(struct vm_area_struct *vma, struct page *page,
+		unsigned long address)
+{
+	spinlock_t *ptl;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	int i;
+
+	pgd = pgd_offset(vma->vm_mm, address);
+	if (!pgd_present(*pgd))
+		return;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return;
+	pmd = pmd_offset(pud, address);
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	if (!pmd_present(*pmd)) {
+		spin_unlock(ptl);
+		return;
+	}
+	if (pmd_trans_huge(*pmd)) {
+		if (page == pmd_page(*pmd))
+			__split_huge_pmd_locked(vma, pmd, address, true);
+		spin_unlock(ptl);
+		return;
+	}
+	spin_unlock(ptl);
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
+	for (i = 0; i < HPAGE_PMD_NR; i++, address += PAGE_SIZE, page++) {
+		pte_t entry, swp_pte;
+		swp_entry_t swp_entry;
+
+		if (!pte_present(pte[i]))
+			continue;
+		if (page_to_pfn(page) != pte_pfn(pte[i]))
+			continue;
+		flush_cache_page(vma, address, page_to_pfn(page));
+		entry = ptep_clear_flush(vma, address, pte + i);
+		swp_entry = make_migration_entry(page, pte_write(entry));
+		swp_pte = swp_entry_to_pte(swp_entry);
+		if (pte_soft_dirty(entry))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		set_pte_at(vma->vm_mm, address, pte + i, swp_pte);
+	}
+	pte_unmap_unlock(pte, ptl);
+}
+
+static void freeze_page(struct anon_vma *anon_vma, struct page *page)
+{
+	struct anon_vma_chain *avc;
+	pgoff_t pgoff = page_to_pgoff(page);
+
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff,
+			pgoff + HPAGE_PMD_NR - 1) {
+		unsigned long haddr;
+
+		haddr = __vma_address(page, avc->vma) & HPAGE_PMD_MASK;
+		mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
+				haddr, haddr + HPAGE_PMD_SIZE);
+		freeze_page_vma(avc->vma, page, haddr);
+		mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
+				haddr, haddr + HPAGE_PMD_SIZE);
+	}
+}
+
+static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
+		unsigned long address)
+{
+	spinlock_t *ptl;
+	pmd_t *pmd;
+	pte_t *pte, entry;
+	swp_entry_t swp_entry;
+	int i;
+
+	pmd = mm_find_pmd(vma->vm_mm, address);
+	if (!pmd)
+		return;
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
+	for (i = 0; i < HPAGE_PMD_NR; i++, address += PAGE_SIZE, page++) {
+		if (!page_mapped(page))
+			continue;
+		if (!is_swap_pte(pte[i]))
+			continue;
+
+		swp_entry = pte_to_swp_entry(pte[i]);
+		if (!is_migration_entry(swp_entry))
+			continue;
+		if (migration_entry_to_page(swp_entry) != page)
+			continue;
+
+		entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
+		if (is_write_migration_entry(swp_entry))
+			entry = maybe_mkwrite(entry, vma);
+
+		flush_dcache_page(page);
+		set_pte_at(vma->vm_mm, address, pte + i, entry);
+
+		/* No need to invalidate - it was non-present before */
+		update_mmu_cache(vma, address, pte + i);
+	}
+	pte_unmap_unlock(pte, ptl);
+}
+
+static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
+{
+	struct anon_vma_chain *avc;
+	pgoff_t pgoff = page_to_pgoff(page);
+
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
+			pgoff, pgoff + HPAGE_PMD_NR - 1) {
+		unsigned long address = __vma_address(page, avc->vma);
+
+		mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
+				address, address + HPAGE_PMD_SIZE);
+		unfreeze_page_vma(avc->vma, page, address);
+		mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
+				address, address + HPAGE_PMD_SIZE);
+	}
+}
+
+static int total_mapcount(struct page *page)
+{
+	int i, ret;
+
+	ret = compound_mapcount(page);
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		ret += atomic_read(&page[i]._mapcount) + 1;
+
+	if (PageDoubleMap(page))
+		ret -= HPAGE_PMD_NR;
+
+	return ret;
+}
+
+static int __split_huge_page_tail(struct page *head, int tail,
+		struct lruvec *lruvec, struct list_head *list)
+{
+	int mapcount;
+	struct page *page_tail = head + tail;
+
+	mapcount = page_mapcount(page_tail);
+	VM_BUG_ON_PAGE(atomic_read(&page_tail->_count) != 0, page_tail);
+
+	/*
+	 * tail_page->_count is zero and not changing from under us. But
+	 * get_page_unless_zero() may be running from under us on the
+	 * tail_page. If we used atomic_set() below instead of atomic_add(), we
+	 * would then run atomic_set() concurrently with
+	 * get_page_unless_zero(), and atomic_set() is implemented in C not
+	 * using locked ops. spin_unlock on x86 sometime uses locked ops
+	 * because of PPro errata 66, 92, so unless somebody can guarantee
+	 * atomic_set() here would be safe on all archs (and not only on x86),
+	 * it's safer to use atomic_add().
+	 */
+	atomic_add(page_mapcount(page_tail) + 1, &page_tail->_count);
+
+	/* after clearing PageTail the gup refcount can be released */
+	smp_mb__after_atomic();
+
+	/*
+	 * retain hwpoison flag of the poisoned tail page:
+	 *   fix for the unsuitable process killed on Guest Machine(KVM)
+	 *   by the memory-failure.
+	 */
+	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
+	page_tail->flags |= (head->flags &
+			((1L << PG_referenced) |
+			 (1L << PG_swapbacked) |
+			 (1L << PG_mlocked) |
+			 (1L << PG_uptodate) |
+			 (1L << PG_active) |
+			 (1L << PG_locked) |
+			 (1L << PG_unevictable)));
+	page_tail->flags |= (1L << PG_dirty);
+
+	/* clear PageTail before overwriting first_page */
+	smp_wmb();
+
+	/* ->mapping in first tail page is compound_mapcount */
+	VM_BUG_ON_PAGE(tail != 1 && page_tail->mapping != TAIL_MAPPING,
+			page_tail);
+	page_tail->mapping = head->mapping;
+
+	page_tail->index = head->index + tail;
+	page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
+	lru_add_page_tail(head, page_tail, lruvec, list);
+
+	return mapcount;
+}
+
+static void __split_huge_page(struct page *page, struct list_head *list)
+{
+	struct page *head = compound_head(page);
+	struct zone *zone = page_zone(head);
+	struct lruvec *lruvec;
+	int i, tail_mapcount;
+
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irq(&zone->lru_lock);
+	lruvec = mem_cgroup_page_lruvec(head, zone);
+
+	/* complete memcg works before add pages to LRU */
+	mem_cgroup_split_huge_fixup(head);
+
+	tail_mapcount = 0;
+	for (i = HPAGE_PMD_NR - 1; i >= 1; i--)
+		tail_mapcount += __split_huge_page_tail(head, i, lruvec, list);
+	atomic_sub(tail_mapcount, &head->_count);
+
+	ClearPageCompound(head);
+	spin_unlock_irq(&zone->lru_lock);
+
+	unfreeze_page(page_anon_vma(head), head);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		struct page *subpage = head + i;
+		if (subpage == page)
+			continue;
+		unlock_page(subpage);
+
+		/*
+		 * Subpages may be freed if there wasn't any mapping
+		 * like if add_to_swap() is running on a lru page that
+		 * had its mapping zapped. And freeing these pages
+		 * requires taking the lru_lock so we do the put_page
+		 * of the tail pages after the split is complete.
+		 */
+		put_page(subpage);
+	}
+}
+
+/*
+ * This function splits huge page into normal pages. @page can point to any
+ * subpage of huge page to split. Split doesn't change the position of @page.
+ *
+ * Only caller must hold pin on the @page, otherwise split fails with -EBUSY.
+ * The huge page must be locked.
+ *
+ * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
+ *
+ * Both head page and tail pages will inherit mapping, flags, and so on from
+ * the hugepage.
+ *
+ * GUP pin and PG_locked transfered to @page. Rest subpages can be freed if
+ * they are not mapped.
+ *
+ * Returns 0 if the hugepage is split successfully.
+ * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under
+ * us.
+ */
+int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+	struct page *head = compound_head(page);
+	struct anon_vma *anon_vma;
+	int count, mapcount, ret;
+
+	VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+	VM_BUG_ON_PAGE(!PageCompound(page), page);
+
+	/*
+	 * The caller does not necessarily hold an mmap_sem that would prevent
+	 * the anon_vma disappearing so we first we take a reference to it
+	 * and then lock the anon_vma for write. This is similar to
+	 * page_lock_anon_vma_read except the write lock is taken to serialise
+	 * against parallel split or collapse operations.
+	 */
+	anon_vma = page_get_anon_vma(head);
+	if (!anon_vma) {
+		ret = -EBUSY;
+		goto out;
+	}
+	anon_vma_lock_write(anon_vma);
+
+	/*
+	 * Racy check if we can split the page, before freeze_page() will
+	 * split PMDs
+	 */
+	if (total_mapcount(head) != page_count(head) - 1) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	freeze_page(anon_vma, head);
+	VM_BUG_ON_PAGE(compound_mapcount(head), head);
+
+	count = page_count(head);
+	mapcount = total_mapcount(head);
+	if (mapcount == count - 1) {
+		__split_huge_page(page, list);
+		ret = 0;
+	} else if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount > count - 1) {
+		pr_alert("total_mapcount: %u, page_count(): %u\n",
+				mapcount, count);
+		if (PageTail(page))
+			dump_page(head, NULL);
+		dump_page(page, "total_mapcount(head) > page_count(head) - 1");
+		BUG();
+	} else {
+		unfreeze_page(anon_vma, head);
+		ret = -EBUSY;
+	}
+
+out_unlock:
+	anon_vma_unlock_write(anon_vma);
+	put_anon_vma(anon_vma);
+out:
+	count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
+	return ret;
+}
diff --git a/mm/internal.h b/mm/internal.h
index 6aa08db10589..a2e466bc60b2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -13,6 +13,7 @@
 
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/pagemap.h>
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
@@ -245,10 +246,27 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-extern unsigned long vma_address(struct page *page,
-				 struct vm_area_struct *vma);
-#endif
+/*
+ * At what user virtual address is page expected in @vma?
+ */
+static inline unsigned long
+__vma_address(struct page *page, struct vm_area_struct *vma)
+{
+	pgoff_t pgoff = page_to_pgoff(page);
+	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+}
+
+static inline unsigned long
+vma_address(struct page *page, struct vm_area_struct *vma)
+{
+	unsigned long address = __vma_address(page, vma);
+
+	/* page should be within @vma mapping range */
+	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+
+	return address;
+}
+
 #else /* !CONFIG_MMU */
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
diff --git a/mm/rmap.c b/mm/rmap.c
index 2bfa15faff1e..7a374d1ddcd3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -561,27 +561,6 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
 }
 
 /*
- * At what user virtual address is page expected in @vma?
- */
-static inline unsigned long
-__vma_address(struct page *page, struct vm_area_struct *vma)
-{
-	pgoff_t pgoff = page_to_pgoff(page);
-	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-}
-
-inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
-{
-	unsigned long address = __vma_address(page, vma);
-
-	/* page should be within @vma mapping range */
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
-
-	return address;
-}
-
-/*
  * At what user virtual address is page expected in vma?
  * Caller should check the page is actually part of the vma.
  */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 33/36] migrate_pages: try to split pages on qeueuing
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (31 preceding siblings ...)
  2015-06-03 17:06 ` [PATCHv6 32/36] thp: reintroduce split_huge_page() Kirill A. Shutemov
@ 2015-06-03 17:06 ` Kirill A. Shutemov
  2015-06-11  9:27   ` Vlastimil Babka
  2015-06-03 17:06 ` [PATCHv6 34/36] thp: introduce deferred_split_huge_page() Kirill A. Shutemov
                   ` (3 subsequent siblings)
  36 siblings, 1 reply; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:06 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We are not able to migrate THPs. It means it's not enough to split only
PMD on migration -- we need to split compound page under it too.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/mempolicy.c | 37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 528f6c467cf1..0b1499c2f890 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -489,14 +489,31 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	struct page *page;
 	struct queue_pages *qp = walk->private;
 	unsigned long flags = qp->flags;
-	int nid;
+	int nid, ret;
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	split_huge_pmd(vma, pmd, addr);
-	if (pmd_trans_unstable(pmd))
-		return 0;
+	if (pmd_trans_huge(*pmd)) {
+		ptl = pmd_lock(walk->mm, pmd);
+		if (pmd_trans_huge(*pmd)) {
+			page = pmd_page(*pmd);
+			if (is_huge_zero_page(page)) {
+				spin_unlock(ptl);
+				split_huge_pmd(vma, pmd, addr);
+			} else {
+				get_page(page);
+				spin_unlock(ptl);
+				lock_page(page);
+				ret = split_huge_page(page);
+				unlock_page(page);
+				put_page(page);
+				if (ret)
+					return 0;
+			}
+		}
+	}
 
+retry:
 	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		if (!pte_present(*pte))
@@ -513,6 +530,18 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 		nid = page_to_nid(page);
 		if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
 			continue;
+		if (PageTail(page) && PageAnon(page)) {
+			get_page(page);
+			pte_unmap_unlock(pte - 1, ptl);
+			lock_page(page);
+			ret = split_huge_page(page);
+			unlock_page(page);
+			put_page(page);
+			/* Failed to split -- skip. */
+			if (ret)
+				continue;
+			goto retry;
+		}
 
 		if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
 			migrate_page_add(page, qp->pagelist, flags);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 34/36] thp: introduce deferred_split_huge_page()
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (32 preceding siblings ...)
  2015-06-03 17:06 ` [PATCHv6 33/36] migrate_pages: try to split pages on qeueuing Kirill A. Shutemov
@ 2015-06-03 17:06 ` Kirill A. Shutemov
  2015-06-03 17:06 ` [PATCHv6 35/36] mm: re-enable THP Kirill A. Shutemov
                   ` (2 subsequent siblings)
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:06 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Currently we don't split huge page on partial unmap. It's not an ideal
situation. It can lead to memory overhead.

Furtunately, we can detect partial unmap on page_remove_rmap(). But we
cannot call split_huge_page() from there due to locking context.

It's also counterproductive to do directly from munmap() codepath: in
many cases we will hit this from exit(2) and splitting the huge page
just to free it up in small pages is not what we really want.

The patch introduce deferred_split_huge_page() which put the huge page
into queue for splitting. The splitting itself will happen when we get
memory pressure via shrinker interface. The page will be dropped from
list on freeing through compound page destructor.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/huge_mm.h |   4 ++
 include/linux/mm.h      |   2 +
 mm/huge_memory.c        | 127 ++++++++++++++++++++++++++++++++++++++++++++++--
 mm/migrate.c            |   1 +
 mm/page_alloc.c         |   2 +-
 mm/rmap.c               |   7 ++-
 6 files changed, 138 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 89981a042d85..c1cca36c73db 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,11 +92,14 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 extern unsigned long transparent_hugepage_flags;
 
+extern void prep_transhuge_page(struct page *page);
+
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list(page, NULL);
 }
+void deferred_split_huge_page(struct page *page);
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address);
@@ -174,6 +177,7 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
+static inline void deferred_split_huge_page(struct page *page) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 16add6692f49..9c594fbb3f6d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -522,6 +522,8 @@ static inline void set_compound_order(struct page *page, unsigned long order)
 	page[1].compound_order = order;
 }
 
+void free_compound_page(struct page *page);
+
 #ifdef CONFIG_MMU
 /*
  * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 27045ea7f37e..bee9bf073543 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -71,6 +71,8 @@ static int khugepaged(void *none);
 static int khugepaged_slab_init(void);
 static void khugepaged_slab_exit(void);
 
+static void free_transhuge_page(struct page *page);
+
 #define MM_SLOTS_HASH_BITS 10
 static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
@@ -105,6 +107,10 @@ static struct khugepaged_scan khugepaged_scan = {
 	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
 };
 
+static DEFINE_SPINLOCK(split_queue_lock);
+static LIST_HEAD(split_queue);
+static unsigned long split_queue_len;
+static struct shrinker deferred_split_shrinker;
 
 static int set_recommended_min_free_kbytes(void)
 {
@@ -643,6 +649,9 @@ static int __init hugepage_init(void)
 	err = register_shrinker(&huge_zero_page_shrinker);
 	if (err)
 		goto err_hzp_shrinker;
+	err = register_shrinker(&deferred_split_shrinker);
+	if (err)
+		goto err_split_shrinker;
 
 	/*
 	 * By default disable transparent hugepages on smaller systems,
@@ -660,6 +669,8 @@ static int __init hugepage_init(void)
 
 	return 0;
 err_khugepaged:
+	unregister_shrinker(&deferred_split_shrinker);
+err_split_shrinker:
 	unregister_shrinker(&huge_zero_page_shrinker);
 err_hzp_shrinker:
 	khugepaged_slab_exit();
@@ -716,6 +727,19 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 	return entry;
 }
 
+void prep_transhuge_page(struct page *page)
+{
+	/* we use page->lru in second tail page: assuming THP order >= 2 */
+	BUILD_BUG_ON(HPAGE_PMD_ORDER < 2);
+
+	/*
+	 *  ->lru in the first tail page is occupied by destructor
+	 *  and order of the compound page
+	 */
+	INIT_LIST_HEAD(&page[2].lru);
+	set_compound_page_dtor(page, free_transhuge_page);
+}
+
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
@@ -868,6 +892,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
+	prep_transhuge_page(page);
 	return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, gfp, flags);
 }
 
@@ -1120,7 +1145,9 @@ alloc:
 	} else
 		new_page = NULL;
 
-	if (unlikely(!new_page)) {
+	if (likely(new_page)) {
+		prep_transhuge_page(new_page);
+	} else {
 		if (!page) {
 			split_huge_pmd(vma, pmd, address);
 			ret |= VM_FAULT_FALLBACK;
@@ -2045,6 +2072,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
 		return NULL;
 	}
 
+	prep_transhuge_page(*hpage);
 	count_vm_event(THP_COLLAPSE_ALLOC);
 	return *hpage;
 }
@@ -2056,8 +2084,12 @@ static int khugepaged_find_target_node(void)
 
 static inline struct page *alloc_hugepage(int defrag)
 {
-	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
-			   HPAGE_PMD_ORDER);
+	struct page *page;
+
+	page = alloc_pages(alloc_hugepage_gfpmask(defrag, 0), HPAGE_PMD_ORDER);
+	if (page)
+		prep_transhuge_page(page);
+	return page;
 }
 
 static struct page *khugepaged_alloc_hugepage(bool *wait)
@@ -2947,6 +2979,13 @@ static void __split_huge_page(struct page *page, struct list_head *list)
 	spin_lock_irq(&zone->lru_lock);
 	lruvec = mem_cgroup_page_lruvec(head, zone);
 
+	spin_lock(&split_queue_lock);
+	if (!list_empty(&head[2].lru)) {
+		split_queue_len--;
+		list_del(&head[2].lru);
+	}
+	spin_unlock(&split_queue_lock);
+
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
 
@@ -3058,3 +3097,85 @@ out:
 	count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
 	return ret;
 }
+
+static void free_transhuge_page(struct page *page)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&split_queue_lock, flags);
+	if (!list_empty(&page[2].lru)) {
+		split_queue_len--;
+		list_del(&page[2].lru);
+	}
+	spin_unlock_irqrestore(&split_queue_lock, flags);
+	free_compound_page(page);
+}
+
+void deferred_split_huge_page(struct page *page)
+{
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+
+	spin_lock_irqsave(&split_queue_lock, flags);
+	if (list_empty(&page[2].lru)) {
+		list_add_tail(&page[2].lru, &split_queue);
+		split_queue_len++;
+	}
+	spin_unlock_irqrestore(&split_queue_lock, flags);
+}
+
+static unsigned long deferred_split_count(struct shrinker *shrink,
+		struct shrink_control *sc)
+{
+	/*
+	 * Split a page from split_queue will free up at least one page,
+	 * at most HPAGE_PMD_NR - 1. We don't track exact number.
+	 * Let's use HPAGE_PMD_NR / 2 as ballpark.
+	 */
+	return ACCESS_ONCE(split_queue_len) * HPAGE_PMD_NR / 2;
+}
+
+static unsigned long deferred_split_scan(struct shrinker *shrink,
+		struct shrink_control *sc)
+{
+	unsigned long flags;
+	LIST_HEAD(list);
+	struct page *page, *next;
+	int split = 0;
+
+	spin_lock_irqsave(&split_queue_lock, flags);
+	list_splice_init(&split_queue, &list);
+
+	/* Take pin on all head pages to avoid freeing them under us */
+	list_for_each_entry_safe(page, next, &list, lru) {
+		page = compound_head(page);
+		/* race with put_compound_page() */
+		if (!get_page_unless_zero(page)) {
+			list_del_init(&page[2].lru);
+			split_queue_len--;
+		}
+	}
+	spin_unlock_irqrestore(&split_queue_lock, flags);
+
+	list_for_each_entry_safe(page, next, &list, lru) {
+		lock_page(page);
+		/* split_huge_page() removes page from list on success */
+		if (!split_huge_page(page))
+			split++;
+		unlock_page(page);
+		put_page(page);
+	}
+
+	spin_lock_irqsave(&split_queue_lock, flags);
+	list_splice_tail(&list, &split_queue);
+	spin_unlock_irqrestore(&split_queue_lock, flags);
+
+	return split * HPAGE_PMD_NR / 2;
+}
+
+static struct shrinker deferred_split_shrinker = {
+	.count_objects = deferred_split_count,
+	.scan_objects = deferred_split_scan,
+	.seeks = DEFAULT_SEEKS,
+};
diff --git a/mm/migrate.c b/mm/migrate.c
index 8bb2107b8751..4c79c5447623 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1742,6 +1742,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		HPAGE_PMD_ORDER);
 	if (!new_page)
 		goto out_fail;
+	prep_transhuge_page(new_page);
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e3b42763894..f46b6ef06748 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -430,7 +430,7 @@ out:
  * This usage means that zero-order pages may not be compound.
  */
 
-static void free_compound_page(struct page *page)
+void free_compound_page(struct page *page)
 {
 	__free_pages_ok(page, compound_order(page));
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 7a374d1ddcd3..c9939ac3d5a1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1204,8 +1204,10 @@ static void page_remove_anon_compound_rmap(struct page *page)
 		nr = HPAGE_PMD_NR;
 	}
 
-	if (nr)
+	if (nr) {
 		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
+		deferred_split_huge_page(page);
+	}
 }
 
 /**
@@ -1240,6 +1242,9 @@ void page_remove_rmap(struct page *page, bool compound)
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
 
+	if (PageTransCompound(page))
+		deferred_split_huge_page(compound_head(page));
+
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
 	 * but that might overwrite a racing page_add_anon_rmap
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 35/36] mm: re-enable THP
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (33 preceding siblings ...)
  2015-06-03 17:06 ` [PATCHv6 34/36] thp: introduce deferred_split_huge_page() Kirill A. Shutemov
@ 2015-06-03 17:06 ` Kirill A. Shutemov
  2015-06-03 17:06 ` [PATCHv6 36/36] thp: update documentation Kirill A. Shutemov
  2015-06-16 13:17 ` [PATCHv6 00/36] THP refcounting redesign Jerome Marchand
  36 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:06 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

All parts of THP with new refcounting are now in place. We can now allow
to enable THP.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index c973f416cbe5..e79de2bd12cd 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -410,7 +410,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
-	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && BROKEN
+	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select COMPACTION
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCHv6 36/36] thp: update documentation
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (34 preceding siblings ...)
  2015-06-03 17:06 ` [PATCHv6 35/36] mm: re-enable THP Kirill A. Shutemov
@ 2015-06-03 17:06 ` Kirill A. Shutemov
  2015-06-11 12:30   ` Vlastimil Babka
  2015-06-16 13:17 ` [PATCHv6 00/36] THP refcounting redesign Jerome Marchand
  36 siblings, 1 reply; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-03 17:06 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

The patch updates Documentation/vm/transhuge.txt to reflect changes in
THP design.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt | 124 +++++++++++++++++++++++------------------
 1 file changed, 69 insertions(+), 55 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 6b31cfbe2a9a..2352b12cae93 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -35,10 +35,10 @@ miss is going to run faster.
 
 == Design ==
 
-- "graceful fallback": mm components which don't have transparent
-  hugepage knowledge fall back to breaking a transparent hugepage and
-  working on the regular pages and their respective regular pmd/pte
-  mappings
+- "graceful fallback": mm components which don't have transparent hugepage
+  knowledge fall back to breaking huge pmd mapping into table of ptes and,
+  if nesessary, split a transparent hugepage. Therefore these components
+  can continue working on the regular pages or regular pte mappings.
 
 - if a hugepage allocation fails because of memory fragmentation,
   regular pages should be gracefully allocated instead and mixed in
@@ -200,9 +200,18 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
 	of pages that should be collapsed into one huge page but failed
 	the allocation.
 
-thp_split is incremented every time a huge page is split into base
+thp_split_page is incremented every time a huge page is split into base
 	pages. This can happen for a variety of reasons but a common
 	reason is that a huge page is old and is being reclaimed.
+	This action implies splitting all PMD the page mapped with.
+
+thp_split_page_failed is is incremented if kernel fails to split huge
+	page. This can happen if the page was pinned by somebody.
+
+thp_split_pmd is incremented every time a PMD split into table of PTEs.
+	This can happen, for instance, when application calls mprotect() or
+	munmap() on part of huge page. It doesn't split huge page, only
+	page table entry.
 
 thp_zero_page_alloc is incremented every time a huge zero page is
 	successfully allocated. It includes allocations which where
@@ -253,10 +262,8 @@ is complete, so they won't ever notice the fact the page is huge. But
 if any driver is going to mangle over the page structure of the tail
 page (like for checking page->mapping or other bits that are relevant
 for the head page and not the tail page), it should be updated to jump
-to check head page instead (while serializing properly against
-split_huge_page() to avoid the head and tail pages to disappear from
-under it, see the futex code to see an example of that, hugetlbfs also
-needed special handling in futex code for similar reasons).
+to check head page instead. Taking reference on any head/tail page would
+prevent page from being split by anyone.
 
 NOTE: these aren't new constraints to the GUP API, and they match the
 same constrains that applies to hugetlbfs too, so any driver capable
@@ -291,9 +298,9 @@ unaffected. libhugetlbfs will also work fine as usual.
 == Graceful fallback ==
 
 Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(vma, addr, pmd) where the pmd is the one returned by
+split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
 pmd_offset. It's trivial to make the code transparent hugepage aware
-by just grepping for "pmd_offset" and adding split_huge_page_pmd where
+by just grepping for "pmd_offset" and adding split_huge_pmd where
 missing after pmd_offset returns the pmd. Thanks to the graceful
 fallback design, with a one liner change, you can avoid to write
 hundred if not thousand of lines of complex code to make your code
@@ -302,7 +309,8 @@ hugepage aware.
 If you're not walking pagetables but you run into a physical hugepage
 but you can't handle it natively in your code, you can split it by
 calling split_huge_page(page). This is what the Linux VM does before
-it tries to swapout the hugepage for example.
+it tries to swapout the hugepage for example. split_huge_page() can fail
+if the page is pinned and you must handle this correctly.
 
 Example to make mremap.c transparent hugepage aware with a one liner
 change:
@@ -314,14 +322,14 @@ diff --git a/mm/mremap.c b/mm/mremap.c
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
-+	split_huge_page_pmd(vma, addr, pmd);
++	split_huge_pmd(vma, pmd, addr);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;
 
 == Locking in hugepage aware code ==
 
 We want as much code as possible hugepage aware, as calling
-split_huge_page() or split_huge_page_pmd() has a cost.
+split_huge_page() or split_huge_pmd() has a cost.
 
 To make pagetable walks huge pmd aware, all you need to do is to call
 pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
@@ -330,47 +338,53 @@ created from under you by khugepaged (khugepaged collapse_huge_page
 takes the mmap_sem in write mode in addition to the anon_vma lock). If
 pmd_trans_huge returns false, you just fallback in the old code
 paths. If instead pmd_trans_huge returns true, you have to take the
-mm->page_table_lock and re-run pmd_trans_huge. Taking the
-page_table_lock will prevent the huge pmd to be converted into a
-regular pmd from under you (split_huge_page can run in parallel to the
+page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
+page table lock will prevent the huge pmd to be converted into a
+regular pmd from under you (split_huge_pmd can run in parallel to the
 pagetable walk). If the second pmd_trans_huge returns false, you
-should just drop the page_table_lock and fallback to the old code as
-before. Otherwise you should run pmd_trans_splitting on the pmd. In
-case pmd_trans_splitting returns true, it means split_huge_page is
-already in the middle of splitting the page. So if pmd_trans_splitting
-returns true it's enough to drop the page_table_lock and call
-wait_split_huge_page and then fallback the old code paths. You are
-guaranteed by the time wait_split_huge_page returns, the pmd isn't
-huge anymore. If pmd_trans_splitting returns false, you can proceed to
-process the huge pmd and the hugepage natively. Once finished you can
-drop the page_table_lock.
-
-== compound_lock, get_user_pages and put_page ==
+should just drop the page table lock and fallback to the old code as
+before. Otherwise you can proceed to process the huge pmd and the
+hugepage natively. Once finished you can drop the page table lock.
+
+== Refcounts and transparent huge pages ==
+
+Refcounting on THP is mostly consistent with refcounting on other compound
+pages:
+
+  - get_page()/put_page() and GUP operate in head page's ->_count.
+
+  - ->_count in tail pages is always zero: get_page_unless_zero() never
+    succeed on tail pages.
+
+  - map/unmap of the pages with PTE entry increment/decrement ->_mapcount
+    on relevent sub-page of the compound page.
+
+  - map/unmap of the whole compound page accounted in compound_mapcount
+    (stored in first tail page).
+
+PageDoubleMap() indicates that ->_mapcount in all subpages is offset up by one.
+This additional reference is required to get race-free detection of unmap of
+subpages when we have them mapped with both PMDs and PTEs.
+
+This is optimization required to lower overhead of per-subpage mapcount
+tracking. The alternative is alter ->_mapcount in all subpages on each
+map/unmap of the whole compound page.
+
+We set PG_double_map when a PMD of the page got split for the first time,
+but still have PMD mapping. The addtional references go away with last
+compound_mapcount.
 
 split_huge_page internally has to distribute the refcounts in the head
-page to the tail pages before clearing all PG_head/tail bits from the
-page structures. It can do that easily for refcounts taken by huge pmd
-mappings. But the GUI API as created by hugetlbfs (that returns head
-and tail pages if running get_user_pages on an address backed by any
-hugepage), requires the refcount to be accounted on the tail pages and
-not only in the head pages, if we want to be able to run
-split_huge_page while there are gup pins established on any tail
-page. Failure to be able to run split_huge_page if there's any gup pin
-on any tail page, would mean having to split all hugepages upfront in
-get_user_pages which is unacceptable as too many gup users are
-performance critical and they must work natively on hugepages like
-they work natively on hugetlbfs already (hugetlbfs is simpler because
-hugetlbfs pages cannot be split so there wouldn't be requirement of
-accounting the pins on the tail pages for hugetlbfs). If we wouldn't
-account the gup refcounts on the tail pages during gup, we won't know
-anymore which tail page is pinned by gup and which is not while we run
-split_huge_page. But we still have to add the gup pin to the head page
-too, to know when we can free the compound page in case it's never
-split during its lifetime. That requires changing not just
-get_page, but put_page as well so that when put_page runs on a tail
-page (and only on a tail page) it will find its respective head page,
-and then it will decrease the head page refcount in addition to the
-tail page refcount. To obtain a head page reliably and to decrease its
-refcount without race conditions, put_page has to serialize against
-__split_huge_page_refcount using a special per-page lock called
-compound_lock.
+page to the tail pages before clearing all PG_head/tail bits from the page
+structures. It can be done easily for refcounts taken by page table
+entries. But we don't have enough information on how to distribute any
+additional pins (i.e. from get_user_pages). split_huge_page() fails any
+requests to split pinned huge page: it expects page count to be equal to
+sum of mapcount of all sub-pages plus one (split_huge_page caller must
+have reference for head page).
+
+split_huge_page uses migration entries to stabilize page->_count and
+page->_mapcount.
+
+Note that split_huge_pmd() doesn't have any limitation on refcounting:
+pmd can be split at any point and never fails.
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 27/36] mm: differentiate page_mapped() from page_mapcount() for compound pages
  2015-06-03 17:05 ` [PATCHv6 27/36] mm: differentiate page_mapped() from page_mapcount() for compound pages Kirill A. Shutemov
@ 2015-06-09 10:58   ` Kirill A. Shutemov
  2015-06-10 14:34   ` Vlastimil Babka
  1 sibling, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-09 10:58 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Hugh Dickins, Dave Hansen,
	Mel Gorman, Rik van Riel, Vlastimil Babka, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On Wed, Jun 03, 2015 at 08:05:58PM +0300, Kirill A. Shutemov wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 22cd540104ec..16add6692f49 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -917,10 +917,21 @@ static inline pgoff_t page_file_index(struct page *page)
>  
>  /*
>   * Return true if this page is mapped into pagetables.
> + * For compound page it returns true if any subpage of compound page is mapped.
>   */
> -static inline int page_mapped(struct page *page)
> +static inline bool page_mapped(struct page *page)
>  {
> -	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
> +	int i;
> +	if (likely(!PageCompound(page)))
> +		return atomic_read(&page->_mapcount) >= 0;
> +	page = compound_head(page);
> +	if (compound_mapcount(page))
> +		return true;
> +	for (i = 0; i < hpage_nr_pages(page); i++) {
> +		if (atomic_read(&page[i]._mapcount) >= 0)
> +			return true;
> +	}
> +	return true;

Oops. 'false' should be here. Updated patch is below.

>From f0a84566f1a3d2545a3e557f93a668ecba0c842d Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Fri, 20 Mar 2015 12:24:55 +0200
Subject: [PATCH] mm: differentiate page_mapped() from page_mapcount() for
 compound pages

Let's define page_mapped() to be true for compound pages if any
sub-pages of the compound page is mapped (with PMD or PTE).

On other hand page_mapcount() return mapcount for this particular small
page.

This will make cases like page_get_anon_vma() behave correctly once we
allow huge pages to be mapped with PTE.

Most users outside core-mm should use page_mapcount() instead of
page_mapped().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 arch/arc/mm/cache_arc700.c |  4 ++--
 arch/arm/mm/flush.c        |  2 +-
 arch/mips/mm/c-r4k.c       |  3 ++-
 arch/mips/mm/cache.c       |  2 +-
 arch/mips/mm/init.c        |  6 +++---
 arch/sh/mm/cache-sh4.c     |  2 +-
 arch/sh/mm/cache.c         |  8 ++++----
 arch/xtensa/mm/tlb.c       |  2 +-
 fs/proc/page.c             |  4 ++--
 include/linux/mm.h         | 15 +++++++++++++--
 mm/filemap.c               |  2 +-
 11 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/arch/arc/mm/cache_arc700.c b/arch/arc/mm/cache_arc700.c
index 8c3a3e02ba92..1baa4d23314b 100644
--- a/arch/arc/mm/cache_arc700.c
+++ b/arch/arc/mm/cache_arc700.c
@@ -490,7 +490,7 @@ void flush_dcache_page(struct page *page)
 	 */
 	if (!mapping_mapped(mapping)) {
 		clear_bit(PG_dc_clean, &page->flags);
-	} else if (page_mapped(page)) {
+	} else if (page_mapcount(page)) {
 
 		/* kernel reading from page with U-mapping */
 		void *paddr = page_address(page);
@@ -675,7 +675,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 	 * Note that while @u_vaddr refers to DST page's userspace vaddr, it is
 	 * equally valid for SRC page as well
 	 */
-	if (page_mapped(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
+	if (page_mapcount(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
 		__flush_dcache_page(kfrom, u_vaddr);
 		clean_src_k_mappings = 1;
 	}
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 77f229302032..4da544aa25ef 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -315,7 +315,7 @@ void flush_dcache_page(struct page *page)
 	mapping = page_mapping(page);
 
 	if (!cache_ops_need_broadcast() &&
-	    mapping && !page_mapped(page))
+	    mapping && !page_mapcount(page))
 		clear_bit(PG_dcache_clean, &page->flags);
 	else {
 		__flush_dcache_page(mapping, page);
diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c
index 3f8059602765..b28bf185ef77 100644
--- a/arch/mips/mm/c-r4k.c
+++ b/arch/mips/mm/c-r4k.c
@@ -578,7 +578,8 @@ static inline void local_r4k_flush_cache_page(void *args)
 		 * another ASID than the current one.
 		 */
 		map_coherent = (cpu_has_dc_aliases &&
-				page_mapped(page) && !Page_dcache_dirty(page));
+				page_mapcount(page) &&
+				!Page_dcache_dirty(page));
 		if (map_coherent)
 			vaddr = kmap_coherent(page, addr);
 		else
diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c
index 7e3ea7766822..e695b28dc32c 100644
--- a/arch/mips/mm/cache.c
+++ b/arch/mips/mm/cache.c
@@ -106,7 +106,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
 	unsigned long addr = (unsigned long) page_address(page);
 
 	if (pages_do_alias(addr, vmaddr)) {
-		if (page_mapped(page) && !Page_dcache_dirty(page)) {
+		if (page_mapcount(page) && !Page_dcache_dirty(page)) {
 			void *kaddr;
 
 			kaddr = kmap_coherent(page, vmaddr);
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index 448cde372af0..2c8e44aa536e 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -156,7 +156,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 
 	vto = kmap_atomic(to);
 	if (cpu_has_dc_aliases &&
-	    page_mapped(from) && !Page_dcache_dirty(from)) {
+	    page_mapcount(from) && !Page_dcache_dirty(from)) {
 		vfrom = kmap_coherent(from, vaddr);
 		copy_page(vto, vfrom);
 		kunmap_coherent();
@@ -178,7 +178,7 @@ void copy_to_user_page(struct vm_area_struct *vma,
 	unsigned long len)
 {
 	if (cpu_has_dc_aliases &&
-	    page_mapped(page) && !Page_dcache_dirty(page)) {
+	    page_mapcount(page) && !Page_dcache_dirty(page)) {
 		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(vto, src, len);
 		kunmap_coherent();
@@ -196,7 +196,7 @@ void copy_from_user_page(struct vm_area_struct *vma,
 	unsigned long len)
 {
 	if (cpu_has_dc_aliases &&
-	    page_mapped(page) && !Page_dcache_dirty(page)) {
+	    page_mapcount(page) && !Page_dcache_dirty(page)) {
 		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(dst, vfrom, len);
 		kunmap_coherent();
diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c
index 51d8f7f31d1d..58aaa4f33b81 100644
--- a/arch/sh/mm/cache-sh4.c
+++ b/arch/sh/mm/cache-sh4.c
@@ -241,7 +241,7 @@ static void sh4_flush_cache_page(void *args)
 		 */
 		map_coherent = (current_cpu_data.dcache.n_aliases &&
 			test_bit(PG_dcache_clean, &page->flags) &&
-			page_mapped(page));
+			page_mapcount(page));
 		if (map_coherent)
 			vaddr = kmap_coherent(page, address);
 		else
diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c
index f770e3992620..e58cfbf45150 100644
--- a/arch/sh/mm/cache.c
+++ b/arch/sh/mm/cache.c
@@ -59,7 +59,7 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
 		       unsigned long vaddr, void *dst, const void *src,
 		       unsigned long len)
 {
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 	    test_bit(PG_dcache_clean, &page->flags)) {
 		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(vto, src, len);
@@ -78,7 +78,7 @@ void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
 			 unsigned long vaddr, void *dst, const void *src,
 			 unsigned long len)
 {
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 	    test_bit(PG_dcache_clean, &page->flags)) {
 		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(dst, vfrom, len);
@@ -97,7 +97,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 
 	vto = kmap_atomic(to);
 
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(from) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(from) &&
 	    test_bit(PG_dcache_clean, &from->flags)) {
 		vfrom = kmap_coherent(from, vaddr);
 		copy_page(vto, vfrom);
@@ -153,7 +153,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
 	unsigned long addr = (unsigned long) page_address(page);
 
 	if (pages_do_alias(addr, vmaddr)) {
-		if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+		if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 		    test_bit(PG_dcache_clean, &page->flags)) {
 			void *kaddr;
 
diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
index 5ece856c5725..35c822286bbe 100644
--- a/arch/xtensa/mm/tlb.c
+++ b/arch/xtensa/mm/tlb.c
@@ -245,7 +245,7 @@ static int check_tlb_entry(unsigned w, unsigned e, bool dtlb)
 						page_mapcount(p));
 				if (!page_count(p))
 					rc |= TLB_INSANE;
-				else if (page_mapped(p))
+				else if (page_mapcount(p))
 					rc |= TLB_SUSPICIOUS;
 			} else {
 				rc |= TLB_INSANE;
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..e99c059339f6 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -97,9 +97,9 @@ u64 stable_page_flags(struct page *page)
 	 * pseudo flags for the well known (anonymous) memory mapped pages
 	 *
 	 * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
-	 * simple test in page_mapped() is not enough.
+	 * simple test in page_mapcount() is not enough.
 	 */
-	if (!PageSlab(page) && page_mapped(page))
+	if (!PageSlab(page) && page_mapcount(page))
 		u |= 1 << KPF_MMAP;
 	if (PageAnon(page))
 		u |= 1 << KPF_ANON;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 22cd540104ec..8cd6ee853eee 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -917,10 +917,21 @@ static inline pgoff_t page_file_index(struct page *page)
 
 /*
  * Return true if this page is mapped into pagetables.
+ * For compound page it returns true if any subpage of compound page is mapped.
  */
-static inline int page_mapped(struct page *page)
+static inline bool page_mapped(struct page *page)
 {
-	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
+	int i;
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) >= 0;
+	page = compound_head(page);
+	if (compound_mapcount(page))
+		return true;
+	for (i = 0; i < hpage_nr_pages(page); i++) {
+		if (atomic_read(&page[i]._mapcount) >= 0)
+			return true;
+	}
+	return false;
 }
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index fb0ba54c3ac5..c5a40dd4f846 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -202,7 +202,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 		__dec_zone_page_state(page, NR_FILE_PAGES);
 	if (PageSwapBacked(page))
 		__dec_zone_page_state(page, NR_SHMEM);
-	BUG_ON(page_mapped(page));
+	VM_BUG_ON_PAGE(page_mapped(page), page);
 
 	/*
 	 * At this point page must be either written or cleaned by truncate.
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 01/36] mm, proc: adjust PSS calculation
  2015-06-03 17:05 ` [PATCHv6 01/36] mm, proc: adjust PSS calculation Kirill A. Shutemov
@ 2015-06-09 12:29   ` Vlastimil Babka
  2015-06-22 10:02     ` Kirill A. Shutemov
  0 siblings, 1 reply; 58+ messages in thread
From: Vlastimil Babka @ 2015-06-09 12:29 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 06/03/2015 07:05 PM, Kirill A. Shutemov wrote:
> With new refcounting all subpages of the compound page are not nessessary
> have the same mapcount. We need to take into account mapcount of every
> sub-page.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> Acked-by: Jerome Marchand <jmarchan@redhat.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>   fs/proc/task_mmu.c | 48 +++++++++++++++++++++++++++++++-----------------
>   1 file changed, 31 insertions(+), 17 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 58be92e11939..f9b285761bc0 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -449,9 +449,10 @@ struct mem_size_stats {
>   };
>
>   static void smaps_account(struct mem_size_stats *mss, struct page *page,
> -		unsigned long size, bool young, bool dirty)
> +		bool compound, bool young, bool dirty)
>   {
> -	int mapcount;
> +	int i, nr = compound ? HPAGE_PMD_NR : 1;
> +	unsigned long size = nr * PAGE_SIZE;
>
>   	if (PageAnon(page))
>   		mss->anonymous += size;
> @@ -460,23 +461,36 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>   	/* Accumulate the size in pages that have been accessed. */
>   	if (young || PageReferenced(page))
>   		mss->referenced += size;
> -	mapcount = page_mapcount(page);
> -	if (mapcount >= 2) {
> -		u64 pss_delta;
>
> -		if (dirty || PageDirty(page))
> -			mss->shared_dirty += size;
> -		else
> -			mss->shared_clean += size;
> -		pss_delta = (u64)size << PSS_SHIFT;
> -		do_div(pss_delta, mapcount);
> -		mss->pss += pss_delta;
> -	} else {
> +	/*
> +	 * page_count(page) == 1 guarantees the page is mapped exactly once.
> +	 * If any subpage of the compound page mapped with PTE it would elevate
> +	 * page_count().
> +	 */
> +	if (page_count(page) == 1) {
>   		if (dirty || PageDirty(page))
>   			mss->private_dirty += size;
>   		else
>   			mss->private_clean += size;
> -		mss->pss += (u64)size << PSS_SHIFT;

Deleting the line above was a mistake, right?

> +		return;
> +	}
> +
> +	for (i = 0; i < nr; i++, page++) {
> +		int mapcount = page_mapcount(page);
> +
> +		if (mapcount >= 2) {
> +			if (dirty || PageDirty(page))
> +				mss->shared_dirty += PAGE_SIZE;
> +			else
> +				mss->shared_clean += PAGE_SIZE;
> +			mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
> +		} else {
> +			if (dirty || PageDirty(page))
> +				mss->private_dirty += PAGE_SIZE;
> +			else
> +				mss->private_clean += PAGE_SIZE;
> +			mss->pss += PAGE_SIZE << PSS_SHIFT;
> +		}
>   	}
>   }
>
> @@ -500,7 +514,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>
>   	if (!page)
>   		return;
> -	smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte));
> +
> +	smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte));
>   }
>
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -516,8 +531,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
>   	if (IS_ERR_OR_NULL(page))
>   		return;
>   	mss->anonymous_thp += HPAGE_PMD_SIZE;
> -	smaps_account(mss, page, HPAGE_PMD_SIZE,
> -			pmd_young(*pmd), pmd_dirty(*pmd));
> +	smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
>   }
>   #else
>   static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 13/36] mm: drop tail page refcounting
  2015-06-03 17:05 ` [PATCHv6 13/36] mm: drop tail page refcounting Kirill A. Shutemov
@ 2015-06-09 13:59   ` Vlastimil Babka
  0 siblings, 0 replies; 58+ messages in thread
From: Vlastimil Babka @ 2015-06-09 13:59 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 06/03/2015 07:05 PM, Kirill A. Shutemov wrote:
> Tail page refcounting is utterly complicated and painful to support.
>
> It uses ->_mapcount on tail pages to store how many times this page is
> pinned. get_page() bumps ->_mapcount on tail page in addition to
> ->_count on head. This information is required by split_huge_page() to
> be able to distribute pins from head of compound page to tails during
> the split.
>
> We will need ->_mapcount acoount PTE mappings of subpages of the

                            to account

> compound page. We eliminate need in current meaning of ->_mapcount in
> tail pages by forbidding split entirely if the page is pinned.
>
> The only user of tail page refcounting is THP which is marked BROKEN for
> now.
>
> Let's drop all this mess. It makes get_page() and put_page() much
> simpler.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 26/36] mm: rework mapcount accounting to enable 4k mapping of THPs
  2015-06-03 17:05 ` [PATCHv6 26/36] mm: rework mapcount accounting to enable 4k mapping of THPs Kirill A. Shutemov
@ 2015-06-10 13:47   ` Vlastimil Babka
  2015-06-22 10:22     ` Kirill A. Shutemov
  0 siblings, 1 reply; 58+ messages in thread
From: Vlastimil Babka @ 2015-06-10 13:47 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 06/03/2015 07:05 PM, Kirill A. Shutemov wrote:
> We're going to allow mapping of individual 4k pages of THP compound.
> It means we need to track mapcount on per small page basis.
>
> Straight-forward approach is to use ->_mapcount in all subpages to track
> how many time this subpage is mapped with PMDs or PTEs combined. But
> this is rather expensive: mapping or unmapping of a THP page with PMD
> would require HPAGE_PMD_NR atomic operations instead of single we have
> now.
>
> The idea is to store separately how many times the page was mapped as
> whole -- compound_mapcount. This frees up ->_mapcount in subpages to
> track PTE mapcount.
>
> We use the same approach as with compound page destructor and compound
> order to store compound_mapcount: use space in first tail page,
> ->mapping this time.
>
> Any time we map/unmap whole compound page (THP or hugetlb) -- we
> increment/decrement compound_mapcount. When we map part of compound page
> with PTE we operate on ->_mapcount of the subpage.
>
> page_mapcount() counts both: PTE and PMD mappings of the page.
>
> Basically, we have mapcount for a subpage spread over two counters.
> It makes tricky to detect when last mapcount for a page goes away.
>
> We introduced PageDoubleMap() for this. When we split THP PMD for the
> first time and there's other PMD mapping left we offset up ->_mapcount
> in all subpages by one and set PG_double_map on the compound page.
> These additional references go away with last compound_mapcount.
>
> This approach provides a way to detect when last mapcount goes away on
> per small page basis without introducing new overhead for most common
> cases.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   include/linux/mm.h         | 26 +++++++++++-
>   include/linux/mm_types.h   |  1 +
>   include/linux/page-flags.h | 37 +++++++++++++++++
>   include/linux/rmap.h       |  4 +-
>   mm/debug.c                 |  5 ++-
>   mm/huge_memory.c           |  2 +-
>   mm/hugetlb.c               |  4 +-
>   mm/memory.c                |  2 +-
>   mm/migrate.c               |  2 +-
>   mm/page_alloc.c            | 14 +++++--
>   mm/rmap.c                  | 98 +++++++++++++++++++++++++++++++++++-----------
>   11 files changed, 160 insertions(+), 35 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 31cd5be081cf..22cd540104ec 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -403,6 +403,19 @@ static inline int is_vmalloc_or_module_addr(const void *x)
>
>   extern void kvfree(const void *addr);
>
> +static inline atomic_t *compound_mapcount_ptr(struct page *page)
> +{
> +	return &page[1].compound_mapcount;
> +}
> +
> +static inline int compound_mapcount(struct page *page)
> +{
> +	if (!PageCompound(page))
> +		return 0;
> +	page = compound_head(page);
> +	return atomic_read(compound_mapcount_ptr(page)) + 1;
> +}
> +
>   /*
>    * The atomic page->_mapcount, starts from -1: so that transitions
>    * both from it and to it can be tracked, using atomic_inc_and_test
> @@ -415,8 +428,17 @@ static inline void page_mapcount_reset(struct page *page)
>
>   static inline int page_mapcount(struct page *page)
>   {
> +	int ret;
>   	VM_BUG_ON_PAGE(PageSlab(page), page);
> -	return atomic_read(&page->_mapcount) + 1;
> +
> +	ret = atomic_read(&page->_mapcount) + 1;
> +	if (PageCompound(page)) {
> +		page = compound_head(page);
> +		ret += compound_mapcount(page);

compound_mapcount() means another PageCompound() and compound_head(), 
which you just did. I've tried this to see the effect on a function that 
"calls" (inlines) page_mapcount() once:

-               ret += compound_mapcount(page);
+               ret += atomic_read(compound_mapcount_ptr(page)) + 1;

bloat-o-meter on compaction.o:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-59 (-59)
function                                     old     new   delta
isolate_migratepages_block                  1769    1710     -59

> +		if (PageDoubleMap(page))
> +			ret--;
> +	}
> +	return ret;
>   }
>
>   static inline int page_count(struct page *page)
> @@ -898,7 +920,7 @@ static inline pgoff_t page_file_index(struct page *page)
>    */
>   static inline int page_mapped(struct page *page)
>   {
> -	return atomic_read(&(page)->_mapcount) >= 0;
> +	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
>   }
>
>   /*
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4b51a59160ab..4d182cd14c1f 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -56,6 +56,7 @@ struct page {
>   						 * see PAGE_MAPPING_ANON below.
>   						 */
>   		void *s_mem;			/* slab first object */
> +		atomic_t compound_mapcount;	/* first tail page */
>   	};
>
>   	/* Second double word */
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 74b7cece1dfa..a8d47c1edf6a 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -127,6 +127,9 @@ enum pageflags {
>
>   	/* SLOB */
>   	PG_slob_free = PG_private,
> +
> +	/* THP. Stored in first tail page's flags */
> +	PG_double_map = PG_private_2,

Well, not just THP. Any user of compound pages must make sure not to use 
PG_private_2 on the first tail page. At least where the page is going to 
be user-mapped. And same thing about fields that are in union with 
compound_mapcount. Should that be documented more prominently somewhere? 
I guess there's no such user so far, right?

>   };
>
>   #ifndef __GENERATING_BOUNDS_H

[...]

> @@ -1167,6 +1194,41 @@ out:
>   	mem_cgroup_end_page_stat(memcg);
>   }
>
> +static void page_remove_anon_compound_rmap(struct page *page)
> +{
> +	int i, nr;
> +
> +	if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
> +		return;
> +
> +	/* Hugepages are not counted in NR_ANON_PAGES for now. */
> +	if (unlikely(PageHuge(page)))
> +		return;
> +
> +	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> +		return;
> +
> +	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> +
> +	if (PageDoubleMap(page)) {
> +		nr = 0;
> +		ClearPageDoubleMap(page);
> +		/*
> +		 * Subpages can be mapped with PTEs too. Check how many of
> +		 * themi are still mapped.
> +		 */
> +		for (i = 0; i < HPAGE_PMD_NR; i++) {
> +			if (atomic_add_negative(-1, &page[i]._mapcount))
> +				nr++;
> +		}
> +	} else {
> +		nr = HPAGE_PMD_NR;
> +	}
> +
> +	if (nr)
> +		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);

-nr as we discussed on IRC.

> +}
> +
>   /**
>    * page_remove_rmap - take down pte mapping from a page
>    * @page:	page to remove mapping from
> @@ -1176,33 +1238,25 @@ out:
>    */
>   void page_remove_rmap(struct page *page, bool compound)
>   {
> -	int nr = compound ? hpage_nr_pages(page) : 1;
> -
>   	if (!PageAnon(page)) {
>   		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
>   		page_remove_file_rmap(page);
>   		return;
>   	}
>
> +	if (compound)
> +		return page_remove_anon_compound_rmap(page);
> +
>   	/* page still mapped by someone else? */
>   	if (!atomic_add_negative(-1, &page->_mapcount))
>   		return;
>
> -	/* Hugepages are not counted in NR_ANON_PAGES for now. */
> -	if (unlikely(PageHuge(page)))
> -		return;
> -
>   	/*
>   	 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
>   	 * these counters are not modified in interrupt context, and
>   	 * pte lock(a spinlock) is held, which implies preemption disabled.
>   	 */
> -	if (compound) {
> -		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> -		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> -	}
> -
> -	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
> +	__dec_zone_page_state(page, NR_ANON_PAGES);
>
>   	if (unlikely(PageMlocked(page)))
>   		clear_page_mlock(page);
> @@ -1643,7 +1697,7 @@ void hugepage_add_anon_rmap(struct page *page,
>   	BUG_ON(!PageLocked(page));
>   	BUG_ON(!anon_vma);
>   	/* address might be in next vma when migration races vma_adjust */
> -	first = atomic_inc_and_test(&page->_mapcount);
> +	first = atomic_inc_and_test(compound_mapcount_ptr(page));
>   	if (first)
>   		__hugepage_set_anon_rmap(page, vma, address, 0);
>   }
> @@ -1652,7 +1706,7 @@ void hugepage_add_new_anon_rmap(struct page *page,
>   			struct vm_area_struct *vma, unsigned long address)
>   {
>   	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> -	atomic_set(&page->_mapcount, 0);
> +	atomic_set(compound_mapcount_ptr(page), 0);
>   	__hugepage_set_anon_rmap(page, vma, address, 1);
>   }
>   #endif /* CONFIG_HUGETLB_PAGE */
>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 27/36] mm: differentiate page_mapped() from page_mapcount() for compound pages
  2015-06-03 17:05 ` [PATCHv6 27/36] mm: differentiate page_mapped() from page_mapcount() for compound pages Kirill A. Shutemov
  2015-06-09 10:58   ` Kirill A. Shutemov
@ 2015-06-10 14:34   ` Vlastimil Babka
  1 sibling, 0 replies; 58+ messages in thread
From: Vlastimil Babka @ 2015-06-10 14:34 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 06/03/2015 07:05 PM, Kirill A. Shutemov wrote:
> Let's define page_mapped() to be true for compound pages if any
> sub-pages of the compound page is mapped (with PMD or PTE).
>
> On other hand page_mapcount() return mapcount for this particular small
> page.
>
> This will make cases like page_get_anon_vma() behave correctly once we
> allow huge pages to be mapped with PTE.
>
> Most users outside core-mm should use page_mapcount() instead of
> page_mapped().
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

[...]

> @@ -917,10 +917,21 @@ static inline pgoff_t page_file_index(struct page *page)
>
>   /*
>    * Return true if this page is mapped into pagetables.
> + * For compound page it returns true if any subpage of compound page is mapped.
>    */
> -static inline int page_mapped(struct page *page)
> +static inline bool page_mapped(struct page *page)
>   {
> -	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
> +	int i;
> +	if (likely(!PageCompound(page)))
> +		return atomic_read(&page->_mapcount) >= 0;
> +	page = compound_head(page);
> +	if (compound_mapcount(page))

Same optimization opportunity as I pointed out in previous patch for 
page_mapcount().

> +		return true;
> +	for (i = 0; i < hpage_nr_pages(page); i++) {
> +		if (atomic_read(&page[i]._mapcount) >= 0)
> +			return true;
> +	}
> +	return true;
>   }
>
>   /*
> diff --git a/mm/filemap.c b/mm/filemap.c
> index cb41cf3069d2..c6cf03303ded 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -202,7 +202,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
>   		__dec_zone_page_state(page, NR_FILE_PAGES);
>   	if (PageSwapBacked(page))
>   		__dec_zone_page_state(page, NR_SHMEM);
> -	BUG_ON(page_mapped(page));
> +	VM_BUG_ON_PAGE(page_mapped(page), page);
>
>   	/*
>   	 * At this point page must be either written or cleaned by truncate.
>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 32/36] thp: reintroduce split_huge_page()
  2015-06-03 17:06 ` [PATCHv6 32/36] thp: reintroduce split_huge_page() Kirill A. Shutemov
@ 2015-06-10 15:44   ` Vlastimil Babka
  2015-06-22 11:28     ` Kirill A. Shutemov
  0 siblings, 1 reply; 58+ messages in thread
From: Vlastimil Babka @ 2015-06-10 15:44 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 06/03/2015 07:06 PM, Kirill A. Shutemov wrote:
> This patch adds implementation of split_huge_page() for new
> refcountings.
>
> Unlike previous implementation, new split_huge_page() can fail if
> somebody holds GUP pin on the page. It also means that pin on page
> would prevent it from bening split under you. It makes situation in
> many places much cleaner.
>
> The basic scheme of split_huge_page():
>
>    - Check that sum of mapcounts of all subpage is equal to page_count()
>      plus one (caller pin). Foll off with -EBUSY. This way we can avoid
>      useless PMD-splits.
>
>    - Freeze the page counters by splitting all PMD and setup migration
>      PTEs.
>
>    - Re-check sum of mapcounts against page_count(). Page's counts are
>      stable now. -EBUSY if page is pinned.
>
>    - Split compound page.
>
>    - Unfreeze the page by removing migration entries.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

[...]

> +
> +static int __split_huge_page_tail(struct page *head, int tail,
> +		struct lruvec *lruvec, struct list_head *list)
> +{
> +	int mapcount;
> +	struct page *page_tail = head + tail;
> +
> +	mapcount = page_mapcount(page_tail);

Isn't page_mapcount() unnecessarily heavyweight here? When you are 
splitting a page, it already should have zero compound_mapcount() and 
shouldn't be PageDoubleMap(), no? So you should care about 
page->_mapcount only? Sure, splitting THP is not a hotpath, but when 
done 512 times per split, it could make some difference in the split's 
latency.

> +	VM_BUG_ON_PAGE(atomic_read(&page_tail->_count) != 0, page_tail);
> +
> +	/*
> +	 * tail_page->_count is zero and not changing from under us. But
> +	 * get_page_unless_zero() may be running from under us on the
> +	 * tail_page. If we used atomic_set() below instead of atomic_add(), we
> +	 * would then run atomic_set() concurrently with
> +	 * get_page_unless_zero(), and atomic_set() is implemented in C not
> +	 * using locked ops. spin_unlock on x86 sometime uses locked ops
> +	 * because of PPro errata 66, 92, so unless somebody can guarantee
> +	 * atomic_set() here would be safe on all archs (and not only on x86),
> +	 * it's safer to use atomic_add().

I would be surprised if this was the first place to use atomic_set() 
with potential concurrent atomic_add(). Shouldn't atomic_*() API 
guarantee that this works?

> +	 */
> +	atomic_add(page_mapcount(page_tail) + 1, &page_tail->_count);

You already have the value in mapcount variable, so why read it again.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 33/36] migrate_pages: try to split pages on qeueuing
  2015-06-03 17:06 ` [PATCHv6 33/36] migrate_pages: try to split pages on qeueuing Kirill A. Shutemov
@ 2015-06-11  9:27   ` Vlastimil Babka
  2015-06-22 11:35     ` Kirill A. Shutemov
  0 siblings, 1 reply; 58+ messages in thread
From: Vlastimil Babka @ 2015-06-11  9:27 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 06/03/2015 07:06 PM, Kirill A. Shutemov wrote:
> We are not able to migrate THPs. It means it's not enough to split only
> PMD on migration -- we need to split compound page under it too.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   mm/mempolicy.c | 37 +++++++++++++++++++++++++++++++++----
>   1 file changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 528f6c467cf1..0b1499c2f890 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -489,14 +489,31 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
>   	struct page *page;
>   	struct queue_pages *qp = walk->private;
>   	unsigned long flags = qp->flags;
> -	int nid;
> +	int nid, ret;
>   	pte_t *pte;
>   	spinlock_t *ptl;
>
> -	split_huge_pmd(vma, pmd, addr);
> -	if (pmd_trans_unstable(pmd))
> -		return 0;
> +	if (pmd_trans_huge(*pmd)) {
> +		ptl = pmd_lock(walk->mm, pmd);
> +		if (pmd_trans_huge(*pmd)) {
> +			page = pmd_page(*pmd);
> +			if (is_huge_zero_page(page)) {
> +				spin_unlock(ptl);
> +				split_huge_pmd(vma, pmd, addr);
> +			} else {
> +				get_page(page);
> +				spin_unlock(ptl);
> +				lock_page(page);
> +				ret = split_huge_page(page);
> +				unlock_page(page);
> +				put_page(page);
> +				if (ret)
> +					return 0;
> +			}
> +		}
> +	}
>
> +retry:
>   	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
>   	for (; addr != end; pte++, addr += PAGE_SIZE) {
>   		if (!pte_present(*pte))
> @@ -513,6 +530,18 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
>   		nid = page_to_nid(page);
>   		if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
>   			continue;
> +		if (PageTail(page) && PageAnon(page)) {

Hm, can it really happen that we stumble upon THP tail page here, 
without first stumbling upon it in the previous hunk above? If so, when?

> +			get_page(page);
> +			pte_unmap_unlock(pte - 1, ptl);
> +			lock_page(page);
> +			ret = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +			/* Failed to split -- skip. */
> +			if (ret)
> +				continue;
> +			goto retry;
> +		}
>
>   		if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
>   			migrate_page_add(page, qp->pagelist, flags);
>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 29/36] thp: implement split_huge_pmd()
  2015-06-03 17:06 ` [PATCHv6 29/36] thp: implement split_huge_pmd() Kirill A. Shutemov
@ 2015-06-11  9:49   ` Vlastimil Babka
  2015-06-22 11:14     ` Kirill A. Shutemov
  0 siblings, 1 reply; 58+ messages in thread
From: Vlastimil Babka @ 2015-06-11  9:49 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 06/03/2015 07:06 PM, Kirill A. Shutemov wrote:
> Original split_huge_page() combined two operations: splitting PMDs into
> tables of PTEs and splitting underlying compound page. This patch
> implements split_huge_pmd() which split given PMD without splitting
> other PMDs this page mapped with or underlying compound page.
>
> Without tail page refcounting, implementation of split_huge_pmd() is
> pretty straight-forward.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

[...]

> +
> +	if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
> +		/* Last compound_mapcount is gone. */
> +		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> +		if (PageDoubleMap(page)) {
> +			/* No need in mapcount reference anymore */
> +			ClearPageDoubleMap(page);
> +			for (i = 0; i < HPAGE_PMD_NR; i++)
> +				atomic_dec(&page[i]._mapcount);
> +		}
> +	} else if (!TestSetPageDoubleMap(page)) {
> +		/*
> +		 * The first PMD split for the compound page and we still
> +		 * have other PMD mapping of the page: bump _mapcount in
> +		 * every small page.
> +		 * This reference will go away with last compound_mapcount.
> +		 */
> +		for (i = 0; i < HPAGE_PMD_NR; i++)
> +			atomic_inc(&page[i]._mapcount);

The order of actions here means that between TestSetPageDoubleMap() and 
the atomic incs, anyone calling page_mapcount() on one of the pages not 
processed by the for loop yet, will see a value lower by 1 from what he 
should see. I wonder if that can cause any trouble somewhere, especially 
if there's only one other compound mapping and page_mapcount() will 
return 0 instead of 1?

Conversely, when clearing PageDoubleMap() above (or in one of those rmap 
functions IIRC), one could see mapcount inflated by one. But I guess 
that's less dangerous.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 36/36] thp: update documentation
  2015-06-03 17:06 ` [PATCHv6 36/36] thp: update documentation Kirill A. Shutemov
@ 2015-06-11 12:30   ` Vlastimil Babka
  2015-06-22 13:18     ` Kirill A. Shutemov
  0 siblings, 1 reply; 58+ messages in thread
From: Vlastimil Babka @ 2015-06-11 12:30 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 06/03/2015 07:06 PM, Kirill A. Shutemov wrote:
> The patch updates Documentation/vm/transhuge.txt to reflect changes in
> THP design.

One thing I'm missing is info about the deferred splitting.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   Documentation/vm/transhuge.txt | 124 +++++++++++++++++++++++------------------
>   1 file changed, 69 insertions(+), 55 deletions(-)
>
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> index 6b31cfbe2a9a..2352b12cae93 100644
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -35,10 +35,10 @@ miss is going to run faster.
>
>   == Design ==
>
> -- "graceful fallback": mm components which don't have transparent
> -  hugepage knowledge fall back to breaking a transparent hugepage and
> -  working on the regular pages and their respective regular pmd/pte
> -  mappings
> +- "graceful fallback": mm components which don't have transparent hugepage
> +  knowledge fall back to breaking huge pmd mapping into table of ptes and,
> +  if nesessary, split a transparent hugepage. Therefore these components

         necessary
> +
> +split_huge_page uses migration entries to stabilize page->_count and
> +page->_mapcount.

Hm, what if there's some physical memory scanner taking page->_count 
pins? I think compaction shouldn't be an issue, but maybe some others?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 00/36] THP refcounting redesign
  2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
                   ` (35 preceding siblings ...)
  2015-06-03 17:06 ` [PATCHv6 36/36] thp: update documentation Kirill A. Shutemov
@ 2015-06-16 13:17 ` Jerome Marchand
  2015-06-22 13:21   ` Kirill A. Shutemov
  36 siblings, 1 reply; 58+ messages in thread
From: Jerome Marchand @ 2015-06-16 13:17 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Sasha Levin,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 8495 bytes --]

On 06/03/2015 07:05 PM, Kirill A. Shutemov wrote:
> Hello everybody,
> 
> Here's new revision of refcounting patchset. Please review and consider
> applying.
> 
> The goal of patchset is to make refcounting on THP pages cheaper with
> simpler semantics and allow the same THP compound page to be mapped with
> PMD and PTEs. This is required to get reasonable THP-pagecache
> implementation.
> 
> With the new refcounting design it's much easier to protect against
> split_huge_page(): simple reference on a page will make you the deal.
> It makes gup_fast() implementation simpler and doesn't require
> special-case in futex code to handle tail THP pages.
> 
> It should improve THP utilization over the system since splitting THP in
> one process doesn't necessary lead to splitting the page in all other
> processes have the page mapped.
> 
> The patchset drastically lower complexity of get_page()/put_page()
> codepaths. I encourage people look on this code before-and-after to
> justify time budget on reviewing this patchset.
> 
> = Changelog =
> 
> v6:
>   - rebase to since-4.0;
>   - optimize mapcount handling: significantely reduce overhead for most
>     common cases.
>   - split pages on migrate_pages();
>   - remove infrastructure for handling splitting PMDs on all architectures;
>   - fix page_mapcount() for hugetlb pages;
> 

Hi Kirill,

I ran some LTP mm tests and hugemmap tests trigger the following:

[  438.749457] page:ffffea0000df8000 count:2 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
[  438.750089] flags: 0x3ffc0000004001(locked|head)
[  438.750089] page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
[  438.750089] ------------[ cut here ]------------
[  438.768046] kernel BUG at mm/filemap.c:205!
[  438.768046] invalid opcode: 0000 [#1] SMP 
[  438.768046] Modules linked in: loop ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ppdev iosf_mbi crct10dif_pclmul crc32_pclmul crc32c_intel joydev ghash_clmulni_intel virtio_balloon pcspkr virtio_console nfsd parport_pc parport floppy pvpanic i2c_piix4 acpi_cpufreq auth_rpcgss nfs_acl lockd grace sunrpc virtio_net qxl virtio_blk drm_kms_helper ttm drm serio_raw ata_generic virtio_pci virtio_ring virtio pata_acpi
[  438.768046] CPU: 1 PID: 12918 Comm: hugemmap01 Not tainted 4.0.0thprfc-kasv6+ #247
[  438.768046] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  438.768046] task: ffff88007b09cc40 ti: ffff880077b88000 task.ti: ffff880077b88000
[  438.768046] RIP: 0010:[<ffffffff811e2aac>]  [<ffffffff811e2aac>] __delete_from_page_cache+0x4bc/0x5a0
[  438.768046] RSP: 0018:ffff880077b8bc58  EFLAGS: 00010086
[  438.768046] RAX: 0000000000000036 RBX: ffffea0000df8000 RCX: 0000000000000006
[  438.768046] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88007d5ce9c0
[  438.768046] RBP: ffff880077b8bcb8 R08: 0000000000000001 R09: 0000000000000001
[  438.768046] R10: 0000000000000001 R11: ffff880034e44210 R12: ffffea0000df8000
[  438.768046] R13: ffff88003562cac0 R14: 0000000000000000 R15: ffff88003562cac8
[  438.768046] FS:  00007fda9ccbb700(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000
[  438.768046] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  438.768046] CR2: 00007fda9ccc7000 CR3: 00000000785e6000 CR4: 00000000001407e0
[  438.768046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  438.768046] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  438.768046] Stack:
[  438.768046]  0000000000000246 ffff88003562cad8 ffff88003562caf0 0000000000000000
[  438.768046]  ffff88003562cad0 000000009bfc6d69 ffff880077b8bcb8 ffffea0000df8000
[  438.768046]  ffff88003562cad8 0000000000000000 ffffea0000df8000 0000000000000000
[  438.768046] Call Trace:
[  438.768046]  [<ffffffff811e2be5>] delete_from_page_cache+0x55/0xd0
[  438.768046]  [<ffffffff81380be5>] truncate_hugepages+0x135/0x290
[  438.768046]  [<ffffffff810e7df5>] ? local_clock+0x15/0x30
[  438.768046]  [<ffffffff8110647f>] ? lock_release_holdtime.part.31+0xf/0x190
[  438.768046]  [<ffffffff81380eb8>] hugetlbfs_evict_inode+0x18/0x40
[  438.768046]  [<ffffffff812982bb>] evict+0xab/0x180
[  438.768046]  [<ffffffff81298cee>] iput+0x1ce/0x390
[  438.768046]  [<ffffffff8128aba9>] do_unlinkat+0x209/0x330
[  438.768046]  [<ffffffff81884632>] ? ret_from_sys_call+0x24/0x5f
[  438.768046]  [<ffffffff811095ed>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[  438.768046]  [<ffffffff8128bf66>] SyS_unlink+0x16/0x20
[  438.768046]  [<ffffffff81884609>] system_call_fastpath+0x12/0x17
[  438.768046] Code: 49 8b 14 24 4c 89 e0 80 e6 80 74 08 4c 89 e7 e8 15 2e 69 00 8b 40 48 83 c0 01 74 25 48 c7 c6 28 fb c6 81 48 89 df e8 d4 43 03 00 <0f> 0b 48 89 df e8 f4 2d 69 00 48 f7 00 00 c0 00 00 49 89 c4 75 
[  438.768046] RIP  [<ffffffff811e2aac>] __delete_from_page_cache+0x4bc/0x5a0
[  438.768046]  RSP <ffff880077b8bc58>
[  438.768046] ---[ end trace 3903188dcb3f3d48 ]---
[  438.768046] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:41
[  438.768046] in_atomic(): 1, irqs_disabled(): 1, pid: 12918, name: hugemmap01
[  438.768046] INFO: lockdep is turned off.
[  438.768046] irq event stamp: 6218
[  438.768046] hardirqs last  enabled at (6217): [<ffffffff818812df>] __mutex_unlock_slowpath+0xbf/0x190
[  438.768046] hardirqs last disabled at (6218): [<ffffffff8188387f>] _raw_spin_lock_irq+0x1f/0x80
[  438.768046] softirqs last  enabled at (6042): [<ffffffff810b0df7>] __do_softirq+0x377/0x670
[  438.768046] softirqs last disabled at (6027): [<ffffffff810b14ad>] irq_exit+0x11d/0x130
[  438.768046] CPU: 1 PID: 12918 Comm: hugemmap01 Tainted: G      D         4.0.0thprfc-kasv6+ #247
[  438.768046] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  438.768046]  0000000000000000 000000009bfc6d69 ffff880077b8b8a8 ffffffff81879afa
[  438.768046]  0000000000000000 ffff88007b09cc40 ffff880077b8b8d8 ffffffff810da0cc
[  438.768046]  0000000000000000 ffffffff81c68746 0000000000000029 0000000000000000
[  438.768046] Call Trace:
[  438.768046]  [<ffffffff81879afa>] dump_stack+0x4c/0x65
[  438.768046]  [<ffffffff810da0cc>] ___might_sleep+0x18c/0x250
[  438.768046]  [<ffffffff810da1dd>] __might_sleep+0x4d/0x90
[  438.768046]  [<ffffffff8188163a>] down_read+0x2a/0xa0
[  438.768046]  [<ffffffff810be6c3>] exit_signals+0x33/0x150
[  438.768046]  [<ffffffff810adc2f>] do_exit+0xcf/0xd20
[  438.768046]  [<ffffffff81121006>] ? kmsg_dump+0x166/0x220
[  438.768046]  [<ffffffff81120ed4>] ? kmsg_dump+0x34/0x220
[  438.768046]  [<ffffffff81021cce>] oops_end+0x9e/0xe0
[  438.768046]  [<ffffffff8102224b>] die+0x4b/0x70
[  438.768046]  [<ffffffff8101df80>] do_trap+0xb0/0x150
[  438.768046]  [<ffffffff8101e2f4>] do_error_trap+0xa4/0x180
[  438.768046]  [<ffffffff811e2aac>] ? __delete_from_page_cache+0x4bc/0x5a0
[  438.768046]  [<ffffffff81120255>] ? vprintk_emit+0x285/0x620
[  438.768046]  [<ffffffff81435b9d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  438.768046]  [<ffffffff8101ee90>] do_invalid_op+0x20/0x30
[  438.768046]  [<ffffffff818860de>] invalid_op+0x1e/0x30
[  438.768046]  [<ffffffff811e2aac>] ? __delete_from_page_cache+0x4bc/0x5a0
[  438.768046]  [<ffffffff811e2aac>] ? __delete_from_page_cache+0x4bc/0x5a0
[  438.768046]  [<ffffffff811e2be5>] delete_from_page_cache+0x55/0xd0
[  438.768046]  [<ffffffff81380be5>] truncate_hugepages+0x135/0x290
[  438.768046]  [<ffffffff810e7df5>] ? local_clock+0x15/0x30
[  438.768046]  [<ffffffff8110647f>] ? lock_release_holdtime.part.31+0xf/0x190
[  438.768046]  [<ffffffff81380eb8>] hugetlbfs_evict_inode+0x18/0x40
[  438.768046]  [<ffffffff812982bb>] evict+0xab/0x180
[  438.768046]  [<ffffffff81298cee>] iput+0x1ce/0x390
[  438.768046]  [<ffffffff8128aba9>] do_unlinkat+0x209/0x330
[  438.768046]  [<ffffffff81884632>] ? ret_from_sys_call+0x24/0x5f
[  438.768046]  [<ffffffff811095ed>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[  438.768046]  [<ffffffff8128bf66>] SyS_unlink+0x16/0x20
[  438.768046]  [<ffffffff81884609>] system_call_fastpath+0x12/0x17
[  438.768046] note: hugemmap01[12918] exited with preempt_count 1

Jerome


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 01/36] mm, proc: adjust PSS calculation
  2015-06-09 12:29   ` Vlastimil Babka
@ 2015-06-22 10:02     ` Kirill A. Shutemov
  0 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-22 10:02 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Tue, Jun 09, 2015 at 02:29:17PM +0200, Vlastimil Babka wrote:
> On 06/03/2015 07:05 PM, Kirill A. Shutemov wrote:
> >With new refcounting all subpages of the compound page are not nessessary
> >have the same mapcount. We need to take into account mapcount of every
> >sub-page.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> >Acked-by: Jerome Marchand <jmarchan@redhat.com>
> >Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >---
> >  fs/proc/task_mmu.c | 48 +++++++++++++++++++++++++++++++-----------------
> >  1 file changed, 31 insertions(+), 17 deletions(-)
> >
> >diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >index 58be92e11939..f9b285761bc0 100644
> >--- a/fs/proc/task_mmu.c
> >+++ b/fs/proc/task_mmu.c
> >@@ -449,9 +449,10 @@ struct mem_size_stats {
> >  };
> >
> >  static void smaps_account(struct mem_size_stats *mss, struct page *page,
> >-		unsigned long size, bool young, bool dirty)
> >+		bool compound, bool young, bool dirty)
> >  {
> >-	int mapcount;
> >+	int i, nr = compound ? HPAGE_PMD_NR : 1;
> >+	unsigned long size = nr * PAGE_SIZE;
> >
> >  	if (PageAnon(page))
> >  		mss->anonymous += size;
> >@@ -460,23 +461,36 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
> >  	/* Accumulate the size in pages that have been accessed. */
> >  	if (young || PageReferenced(page))
> >  		mss->referenced += size;
> >-	mapcount = page_mapcount(page);
> >-	if (mapcount >= 2) {
> >-		u64 pss_delta;
> >
> >-		if (dirty || PageDirty(page))
> >-			mss->shared_dirty += size;
> >-		else
> >-			mss->shared_clean += size;
> >-		pss_delta = (u64)size << PSS_SHIFT;
> >-		do_div(pss_delta, mapcount);
> >-		mss->pss += pss_delta;
> >-	} else {
> >+	/*
> >+	 * page_count(page) == 1 guarantees the page is mapped exactly once.
> >+	 * If any subpage of the compound page mapped with PTE it would elevate
> >+	 * page_count().
> >+	 */
> >+	if (page_count(page) == 1) {
> >  		if (dirty || PageDirty(page))
> >  			mss->private_dirty += size;
> >  		else
> >  			mss->private_clean += size;
> >-		mss->pss += (u64)size << PSS_SHIFT;
> 
> Deleting the line above was a mistake, right?

Yep :-/

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 26/36] mm: rework mapcount accounting to enable 4k mapping of THPs
  2015-06-10 13:47   ` Vlastimil Babka
@ 2015-06-22 10:22     ` Kirill A. Shutemov
  0 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-22 10:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Wed, Jun 10, 2015 at 03:47:06PM +0200, Vlastimil Babka wrote:
> On 06/03/2015 07:05 PM, Kirill A. Shutemov wrote:
> >@@ -415,8 +428,17 @@ static inline void page_mapcount_reset(struct page *page)
> >
> >  static inline int page_mapcount(struct page *page)
> >  {
> >+	int ret;
> >  	VM_BUG_ON_PAGE(PageSlab(page), page);
> >-	return atomic_read(&page->_mapcount) + 1;
> >+
> >+	ret = atomic_read(&page->_mapcount) + 1;
> >+	if (PageCompound(page)) {
> >+		page = compound_head(page);
> >+		ret += compound_mapcount(page);
> 
> compound_mapcount() means another PageCompound() and compound_head(), which
> you just did. I've tried this to see the effect on a function that "calls"
> (inlines) page_mapcount() once:
> 
> -               ret += compound_mapcount(page);
> +               ret += atomic_read(compound_mapcount_ptr(page)) + 1;
> 
> bloat-o-meter on compaction.o:
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-59 (-59)
> function                                     old     new   delta
> isolate_migratepages_block                  1769    1710     -59

Okay, fair enough.

> >diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >index 74b7cece1dfa..a8d47c1edf6a 100644
> >--- a/include/linux/page-flags.h
> >+++ b/include/linux/page-flags.h
> >@@ -127,6 +127,9 @@ enum pageflags {
> >
> >  	/* SLOB */
> >  	PG_slob_free = PG_private,
> >+
> >+	/* THP. Stored in first tail page's flags */
> >+	PG_double_map = PG_private_2,
> 
> Well, not just THP. Any user of compound pages must make sure not to use
> PG_private_2 on the first tail page. At least where the page is going to be
> user-mapped. And same thing about fields that are in union with
> compound_mapcount. Should that be documented more prominently somewhere?

I would substitute "THP" for "Compound pages".

> I guess there's no such user so far, right?

I believe so.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 29/36] thp: implement split_huge_pmd()
  2015-06-11  9:49   ` Vlastimil Babka
@ 2015-06-22 11:14     ` Kirill A. Shutemov
  2015-06-22 16:01       ` Vlastimil Babka
  0 siblings, 1 reply; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-22 11:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Thu, Jun 11, 2015 at 11:49:48AM +0200, Vlastimil Babka wrote:
> On 06/03/2015 07:06 PM, Kirill A. Shutemov wrote:
> >Original split_huge_page() combined two operations: splitting PMDs into
> >tables of PTEs and splitting underlying compound page. This patch
> >implements split_huge_pmd() which split given PMD without splitting
> >other PMDs this page mapped with or underlying compound page.
> >
> >Without tail page refcounting, implementation of split_huge_pmd() is
> >pretty straight-forward.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> 
> [...]
> 
> >+
> >+	if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
> >+		/* Last compound_mapcount is gone. */
> >+		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> >+		if (PageDoubleMap(page)) {
> >+			/* No need in mapcount reference anymore */
> >+			ClearPageDoubleMap(page);
> >+			for (i = 0; i < HPAGE_PMD_NR; i++)
> >+				atomic_dec(&page[i]._mapcount);
> >+		}
> >+	} else if (!TestSetPageDoubleMap(page)) {
> >+		/*
> >+		 * The first PMD split for the compound page and we still
> >+		 * have other PMD mapping of the page: bump _mapcount in
> >+		 * every small page.
> >+		 * This reference will go away with last compound_mapcount.
> >+		 */
> >+		for (i = 0; i < HPAGE_PMD_NR; i++)
> >+			atomic_inc(&page[i]._mapcount);
> 
> The order of actions here means that between TestSetPageDoubleMap() and the
> atomic incs, anyone calling page_mapcount() on one of the pages not
> processed by the for loop yet, will see a value lower by 1 from what he
> should see. I wonder if that can cause any trouble somewhere, especially if
> there's only one other compound mapping and page_mapcount() will return 0
> instead of 1?

Good catch. Thanks.

What about this?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0f1f5731a893..cd0e6addb662 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2636,15 +2636,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
                        for (i = 0; i < HPAGE_PMD_NR; i++)
                                atomic_dec(&page[i]._mapcount);
                }
-       } else if (!TestSetPageDoubleMap(page)) {
+       } else if (!PageDoubleMap(page)) {
                /*
                 * The first PMD split for the compound page and we still
                 * have other PMD mapping of the page: bump _mapcount in
                 * every small page.
+                *
                 * This reference will go away with last compound_mapcount.
+                *
+                * Note, we need to increment mapcounts before setting
+                * PG_double_map to avoid false-negative page_mapped().
                 */
                for (i = 0; i < HPAGE_PMD_NR; i++)
                        atomic_inc(&page[i]._mapcount);
+
+               if (TestSetPageDoubleMap(page)) {
+                       /* Race with another  __split_huge_pmd() for the page */
+                       for (i = 0; i < HPAGE_PMD_NR; i++)
+                               atomic_dec(&page[i]._mapcount);
+               }
        }
 
        smp_wmb(); /* make pte visible before pmd */

> Conversely, when clearing PageDoubleMap() above (or in one of those rmap
> functions IIRC), one could see mapcount inflated by one. But I guess that's
> less dangerous.

I think it's safe.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 32/36] thp: reintroduce split_huge_page()
  2015-06-10 15:44   ` Vlastimil Babka
@ 2015-06-22 11:28     ` Kirill A. Shutemov
  0 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-22 11:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Wed, Jun 10, 2015 at 05:44:30PM +0200, Vlastimil Babka wrote:
> On 06/03/2015 07:06 PM, Kirill A. Shutemov wrote:
> >+static int __split_huge_page_tail(struct page *head, int tail,
> >+		struct lruvec *lruvec, struct list_head *list)
> >+{
> >+	int mapcount;
> >+	struct page *page_tail = head + tail;
> >+
> >+	mapcount = page_mapcount(page_tail);
> 
> Isn't page_mapcount() unnecessarily heavyweight here? When you are splitting
> a page, it already should have zero compound_mapcount() and shouldn't be
> PageDoubleMap(), no? So you should care about page->_mapcount only? Sure,
> splitting THP is not a hotpath, but when done 512 times per split, it could
> make some difference in the split's latency.

Okay, replaced with direct atomic_read().

> >+	VM_BUG_ON_PAGE(atomic_read(&page_tail->_count) != 0, page_tail);
> >+
> >+	/*
> >+	 * tail_page->_count is zero and not changing from under us. But
> >+	 * get_page_unless_zero() may be running from under us on the
> >+	 * tail_page. If we used atomic_set() below instead of atomic_add(), we
> >+	 * would then run atomic_set() concurrently with
> >+	 * get_page_unless_zero(), and atomic_set() is implemented in C not
> >+	 * using locked ops. spin_unlock on x86 sometime uses locked ops
> >+	 * because of PPro errata 66, 92, so unless somebody can guarantee
> >+	 * atomic_set() here would be safe on all archs (and not only on x86),
> >+	 * it's safer to use atomic_add().
> 
> I would be surprised if this was the first place to use atomic_set() with
> potential concurrent atomic_add(). Shouldn't atomic_*() API guarantee that
> this works?

I don't have much insight on the issue. This part is carried over from
pre-rework split_huge_page().

> 
> >+	 */
> >+	atomic_add(page_mapcount(page_tail) + 1, &page_tail->_count);
> 
> You already have the value in mapcount variable, so why read it again.

Fixed.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 33/36] migrate_pages: try to split pages on qeueuing
  2015-06-11  9:27   ` Vlastimil Babka
@ 2015-06-22 11:35     ` Kirill A. Shutemov
  0 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-22 11:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Thu, Jun 11, 2015 at 11:27:19AM +0200, Vlastimil Babka wrote:
> On 06/03/2015 07:06 PM, Kirill A. Shutemov wrote:
> >We are not able to migrate THPs. It means it's not enough to split only
> >PMD on migration -- we need to split compound page under it too.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >---
> >  mm/mempolicy.c | 37 +++++++++++++++++++++++++++++++++----
> >  1 file changed, 33 insertions(+), 4 deletions(-)
> >
> >diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> >index 528f6c467cf1..0b1499c2f890 100644
> >--- a/mm/mempolicy.c
> >+++ b/mm/mempolicy.c
> >@@ -489,14 +489,31 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
> >  	struct page *page;
> >  	struct queue_pages *qp = walk->private;
> >  	unsigned long flags = qp->flags;
> >-	int nid;
> >+	int nid, ret;
> >  	pte_t *pte;
> >  	spinlock_t *ptl;
> >
> >-	split_huge_pmd(vma, pmd, addr);
> >-	if (pmd_trans_unstable(pmd))
> >-		return 0;
> >+	if (pmd_trans_huge(*pmd)) {
> >+		ptl = pmd_lock(walk->mm, pmd);
> >+		if (pmd_trans_huge(*pmd)) {
> >+			page = pmd_page(*pmd);
> >+			if (is_huge_zero_page(page)) {
> >+				spin_unlock(ptl);
> >+				split_huge_pmd(vma, pmd, addr);
> >+			} else {
> >+				get_page(page);
> >+				spin_unlock(ptl);
> >+				lock_page(page);
> >+				ret = split_huge_page(page);
> >+				unlock_page(page);
> >+				put_page(page);
> >+				if (ret)
> >+					return 0;
> >+			}
> >+		}
> >+	}
> >
> >+retry:
> >  	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> >  	for (; addr != end; pte++, addr += PAGE_SIZE) {
> >  		if (!pte_present(*pte))
> >@@ -513,6 +530,18 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
> >  		nid = page_to_nid(page);
> >  		if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
> >  			continue;
> >+		if (PageTail(page) && PageAnon(page)) {
> 
> Hm, can it really happen that we stumble upon THP tail page here, without
> first stumbling upon it in the previous hunk above? If so, when?

The first hunk catch PMD-mapped THP and here we deal with PTE-mapped.
The scenario: fault in a THP, split PMD (not page) e.g. with mprotect()
and then try to migrate.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 36/36] thp: update documentation
  2015-06-11 12:30   ` Vlastimil Babka
@ 2015-06-22 13:18     ` Kirill A. Shutemov
  2015-06-22 16:07       ` Vlastimil Babka
  0 siblings, 1 reply; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-22 13:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Thu, Jun 11, 2015 at 02:30:15PM +0200, Vlastimil Babka wrote:
> On 06/03/2015 07:06 PM, Kirill A. Shutemov wrote:
> >The patch updates Documentation/vm/transhuge.txt to reflect changes in
> >THP design.
> 
> One thing I'm missing is info about the deferred splitting.

Okay, I'll add this.

> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >---
> >  Documentation/vm/transhuge.txt | 124 +++++++++++++++++++++++------------------
> >  1 file changed, 69 insertions(+), 55 deletions(-)
> >
> >diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> >index 6b31cfbe2a9a..2352b12cae93 100644
> >--- a/Documentation/vm/transhuge.txt
> >+++ b/Documentation/vm/transhuge.txt
> >@@ -35,10 +35,10 @@ miss is going to run faster.
> >
> >  == Design ==
> >
> >-- "graceful fallback": mm components which don't have transparent
> >-  hugepage knowledge fall back to breaking a transparent hugepage and
> >-  working on the regular pages and their respective regular pmd/pte
> >-  mappings
> >+- "graceful fallback": mm components which don't have transparent hugepage
> >+  knowledge fall back to breaking huge pmd mapping into table of ptes and,
> >+  if nesessary, split a transparent hugepage. Therefore these components
> 
>         necessary
> >+
> >+split_huge_page uses migration entries to stabilize page->_count and
> >+page->_mapcount.
> 
> Hm, what if there's some physical memory scanner taking page->_count pins? I
> think compaction shouldn't be an issue, but maybe some others?

The only legitimate way scanner can get reference to a page is
get_page_unless_zero(), right?

All tail pages has zero ->_count until atomic_add() in
__split_huge_page_tail() -- get_page_unless_zero() will fail.
After the atomic_add() we don't care about ->_count value.
We already known how many references with should uncharge from 
head page.

For head page get_page_unless_zero() will succeed and we don't
mind. It's clear where reference should go after split: it will
stay on head page.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 00/36] THP refcounting redesign
  2015-06-16 13:17 ` [PATCHv6 00/36] THP refcounting redesign Jerome Marchand
@ 2015-06-22 13:21   ` Kirill A. Shutemov
  2015-06-22 13:32     ` Jerome Marchand
  0 siblings, 1 reply; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-22 13:21 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Sasha Levin, linux-kernel, linux-mm

On Tue, Jun 16, 2015 at 03:17:13PM +0200, Jerome Marchand wrote:
> On 06/03/2015 07:05 PM, Kirill A. Shutemov wrote:
> > Hello everybody,
> > 
> > Here's new revision of refcounting patchset. Please review and consider
> > applying.
> > 
> > The goal of patchset is to make refcounting on THP pages cheaper with
> > simpler semantics and allow the same THP compound page to be mapped with
> > PMD and PTEs. This is required to get reasonable THP-pagecache
> > implementation.
> > 
> > With the new refcounting design it's much easier to protect against
> > split_huge_page(): simple reference on a page will make you the deal.
> > It makes gup_fast() implementation simpler and doesn't require
> > special-case in futex code to handle tail THP pages.
> > 
> > It should improve THP utilization over the system since splitting THP in
> > one process doesn't necessary lead to splitting the page in all other
> > processes have the page mapped.
> > 
> > The patchset drastically lower complexity of get_page()/put_page()
> > codepaths. I encourage people look on this code before-and-after to
> > justify time budget on reviewing this patchset.
> > 
> > = Changelog =
> > 
> > v6:
> >   - rebase to since-4.0;
> >   - optimize mapcount handling: significantely reduce overhead for most
> >     common cases.
> >   - split pages on migrate_pages();
> >   - remove infrastructure for handling splitting PMDs on all architectures;
> >   - fix page_mapcount() for hugetlb pages;
> > 
> 
> Hi Kirill,
> 
> I ran some LTP mm tests and hugemmap tests trigger the following:
> 
> [  438.749457] page:ffffea0000df8000 count:2 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
> [  438.750089] flags: 0x3ffc0000004001(locked|head)
> [  438.750089] page dumped because: VM_BUG_ON_PAGE(page_mapped(page))

Did you run with original or updated version of patch 27/36?
In original post of v6 there was bug: page_mapped() always returned true.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 00/36] THP refcounting redesign
  2015-06-22 13:21   ` Kirill A. Shutemov
@ 2015-06-22 13:32     ` Jerome Marchand
  2015-06-22 13:39       ` Kirill A. Shutemov
  0 siblings, 1 reply; 58+ messages in thread
From: Jerome Marchand @ 2015-06-22 13:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Sasha Levin, linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2128 bytes --]

On 06/22/2015 03:21 PM, Kirill A. Shutemov wrote:
> On Tue, Jun 16, 2015 at 03:17:13PM +0200, Jerome Marchand wrote:
>> On 06/03/2015 07:05 PM, Kirill A. Shutemov wrote:
>>> Hello everybody,
>>>
>>> Here's new revision of refcounting patchset. Please review and consider
>>> applying.
>>>
>>> The goal of patchset is to make refcounting on THP pages cheaper with
>>> simpler semantics and allow the same THP compound page to be mapped with
>>> PMD and PTEs. This is required to get reasonable THP-pagecache
>>> implementation.
>>>
>>> With the new refcounting design it's much easier to protect against
>>> split_huge_page(): simple reference on a page will make you the deal.
>>> It makes gup_fast() implementation simpler and doesn't require
>>> special-case in futex code to handle tail THP pages.
>>>
>>> It should improve THP utilization over the system since splitting THP in
>>> one process doesn't necessary lead to splitting the page in all other
>>> processes have the page mapped.
>>>
>>> The patchset drastically lower complexity of get_page()/put_page()
>>> codepaths. I encourage people look on this code before-and-after to
>>> justify time budget on reviewing this patchset.
>>>
>>> = Changelog =
>>>
>>> v6:
>>>   - rebase to since-4.0;
>>>   - optimize mapcount handling: significantely reduce overhead for most
>>>     common cases.
>>>   - split pages on migrate_pages();
>>>   - remove infrastructure for handling splitting PMDs on all architectures;
>>>   - fix page_mapcount() for hugetlb pages;
>>>
>>
>> Hi Kirill,
>>
>> I ran some LTP mm tests and hugemmap tests trigger the following:
>>
>> [  438.749457] page:ffffea0000df8000 count:2 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
>> [  438.750089] flags: 0x3ffc0000004001(locked|head)
>> [  438.750089] page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
> 
> Did you run with original or updated version of patch 27/36?
> In original post of v6 there was bug: page_mapped() always returned true.
> 

Indeed! I'll try again with the corrected patch.

Thanks,
Jerome




[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 00/36] THP refcounting redesign
  2015-06-22 13:32     ` Jerome Marchand
@ 2015-06-22 13:39       ` Kirill A. Shutemov
  0 siblings, 0 replies; 58+ messages in thread
From: Kirill A. Shutemov @ 2015-06-22 13:39 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Sasha Levin, linux-kernel, linux-mm

On Mon, Jun 22, 2015 at 03:32:54PM +0200, Jerome Marchand wrote:
> On 06/22/2015 03:21 PM, Kirill A. Shutemov wrote:
> > On Tue, Jun 16, 2015 at 03:17:13PM +0200, Jerome Marchand wrote:
> >> On 06/03/2015 07:05 PM, Kirill A. Shutemov wrote:
> >>> Hello everybody,
> >>>
> >>> Here's new revision of refcounting patchset. Please review and consider
> >>> applying.
> >>>
> >>> The goal of patchset is to make refcounting on THP pages cheaper with
> >>> simpler semantics and allow the same THP compound page to be mapped with
> >>> PMD and PTEs. This is required to get reasonable THP-pagecache
> >>> implementation.
> >>>
> >>> With the new refcounting design it's much easier to protect against
> >>> split_huge_page(): simple reference on a page will make you the deal.
> >>> It makes gup_fast() implementation simpler and doesn't require
> >>> special-case in futex code to handle tail THP pages.
> >>>
> >>> It should improve THP utilization over the system since splitting THP in
> >>> one process doesn't necessary lead to splitting the page in all other
> >>> processes have the page mapped.
> >>>
> >>> The patchset drastically lower complexity of get_page()/put_page()
> >>> codepaths. I encourage people look on this code before-and-after to
> >>> justify time budget on reviewing this patchset.
> >>>
> >>> = Changelog =
> >>>
> >>> v6:
> >>>   - rebase to since-4.0;
> >>>   - optimize mapcount handling: significantely reduce overhead for most
> >>>     common cases.
> >>>   - split pages on migrate_pages();
> >>>   - remove infrastructure for handling splitting PMDs on all architectures;
> >>>   - fix page_mapcount() for hugetlb pages;
> >>>
> >>
> >> Hi Kirill,
> >>
> >> I ran some LTP mm tests and hugemmap tests trigger the following:
> >>
> >> [  438.749457] page:ffffea0000df8000 count:2 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
> >> [  438.750089] flags: 0x3ffc0000004001(locked|head)
> >> [  438.750089] page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
> > 
> > Did you run with original or updated version of patch 27/36?
> > In original post of v6 there was bug: page_mapped() always returned true.
> > 
> 
> Indeed! I'll try again with the corrected patch.

I'm going to post updated patchset soon. Probably tomorrow.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 29/36] thp: implement split_huge_pmd()
  2015-06-22 11:14     ` Kirill A. Shutemov
@ 2015-06-22 16:01       ` Vlastimil Babka
  0 siblings, 0 replies; 58+ messages in thread
From: Vlastimil Babka @ 2015-06-22 16:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 06/22/2015 01:14 PM, Kirill A. Shutemov wrote:
> On Thu, Jun 11, 2015 at 11:49:48AM +0200, Vlastimil Babka wrote:
>> On 06/03/2015 07:06 PM, Kirill A. Shutemov wrote:
>>
>> The order of actions here means that between TestSetPageDoubleMap() and the
>> atomic incs, anyone calling page_mapcount() on one of the pages not
>> processed by the for loop yet, will see a value lower by 1 from what he
>> should see. I wonder if that can cause any trouble somewhere, especially if
>> there's only one other compound mapping and page_mapcount() will return 0
>> instead of 1?
>
> Good catch. Thanks.
>
> What about this?
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 0f1f5731a893..cd0e6addb662 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2636,15 +2636,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>                          for (i = 0; i < HPAGE_PMD_NR; i++)
>                                  atomic_dec(&page[i]._mapcount);
>                  }
> -       } else if (!TestSetPageDoubleMap(page)) {
> +       } else if (!PageDoubleMap(page)) {
>                  /*
>                   * The first PMD split for the compound page and we still
>                   * have other PMD mapping of the page: bump _mapcount in
>                   * every small page.
> +                *
>                   * This reference will go away with last compound_mapcount.
> +                *
> +                * Note, we need to increment mapcounts before setting
> +                * PG_double_map to avoid false-negative page_mapped().
>                   */
>                  for (i = 0; i < HPAGE_PMD_NR; i++)
>                          atomic_inc(&page[i]._mapcount);
> +
> +               if (TestSetPageDoubleMap(page)) {
> +                       /* Race with another  __split_huge_pmd() for the page */
> +                       for (i = 0; i < HPAGE_PMD_NR; i++)
> +                               atomic_dec(&page[i]._mapcount);
> +               }
>          }

Yeah that should work.

>          smp_wmb(); /* make pte visible before pmd */



>
>> Conversely, when clearing PageDoubleMap() above (or in one of those rmap
>> functions IIRC), one could see mapcount inflated by one. But I guess that's
>> less dangerous.
>
> I think it's safe.

OK.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCHv6 36/36] thp: update documentation
  2015-06-22 13:18     ` Kirill A. Shutemov
@ 2015-06-22 16:07       ` Vlastimil Babka
  0 siblings, 0 replies; 58+ messages in thread
From: Vlastimil Babka @ 2015-06-22 16:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 06/22/2015 03:18 PM, Kirill A. Shutemov wrote:
> On Thu, Jun 11, 2015 at 02:30:15PM +0200, Vlastimil Babka wrote:
>> On 06/03/2015 07:06 PM, Kirill A. Shutemov wrote:
>>> The patch updates Documentation/vm/transhuge.txt to reflect changes in
>>> THP design.
>>
>> One thing I'm missing is info about the deferred splitting.
>
> Okay, I'll add this.

Thanks.

>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> ---
>>>   Documentation/vm/transhuge.txt | 124 +++++++++++++++++++++++------------------
>>>   1 file changed, 69 insertions(+), 55 deletions(-)
>>>
>>> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
>>> index 6b31cfbe2a9a..2352b12cae93 100644
>>> --- a/Documentation/vm/transhuge.txt
>>> +++ b/Documentation/vm/transhuge.txt
>>> @@ -35,10 +35,10 @@ miss is going to run faster.
>>>
>>>   == Design ==
>>>
>>> -- "graceful fallback": mm components which don't have transparent
>>> -  hugepage knowledge fall back to breaking a transparent hugepage and
>>> -  working on the regular pages and their respective regular pmd/pte
>>> -  mappings
>>> +- "graceful fallback": mm components which don't have transparent hugepage
>>> +  knowledge fall back to breaking huge pmd mapping into table of ptes and,
>>> +  if nesessary, split a transparent hugepage. Therefore these components
>>
>>          necessary
>>> +
>>> +split_huge_page uses migration entries to stabilize page->_count and
>>> +page->_mapcount.
>>
>> Hm, what if there's some physical memory scanner taking page->_count pins? I
>> think compaction shouldn't be an issue, but maybe some others?
>
> The only legitimate way scanner can get reference to a page is
> get_page_unless_zero(), right?

I think so.

> All tail pages has zero ->_count until atomic_add() in
> __split_huge_page_tail() -- get_page_unless_zero() will fail.
> After the atomic_add() we don't care about ->_count value.
> We already known how many references with should uncharge from
> head page.
>
> For head page get_page_unless_zero() will succeed and we don't
> mind. It's clear where reference should go after split: it will
> stay on head page.

I guess that works, but it's rather non-obvious thus worth documenting 
somewhere.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2015-06-22 16:07 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-03 17:05 [PATCHv6 00/36] THP refcounting redesign Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 01/36] mm, proc: adjust PSS calculation Kirill A. Shutemov
2015-06-09 12:29   ` Vlastimil Babka
2015-06-22 10:02     ` Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 02/36] rmap: add argument to charge compound page Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 03/36] memcg: adjust to support new THP refcounting Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 04/36] mm, thp: adjust conditions when we can reuse the page on WP fault Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 05/36] mm: adjust FOLL_SPLIT for new refcounting Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 06/36] mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 07/36] thp, mlock: do not allow huge pages in mlocked area Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 08/36] khugepaged: ignore pmd tables with THP mapped with ptes Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 09/36] thp: rename split_huge_page_pmd() to split_huge_pmd() Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 10/36] mm, vmstats: new THP splitting event Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 11/36] mm: temporally mark THP broken Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 12/36] thp: drop all split_huge_page()-related code Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 13/36] mm: drop tail page refcounting Kirill A. Shutemov
2015-06-09 13:59   ` Vlastimil Babka
2015-06-03 17:05 ` [PATCHv6 14/36] futex, thp: remove special case for THP in get_futex_key Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 15/36] ksm: prepare to new THP semantics Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 16/36] mm, thp: remove compound_lock Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 17/36] arm64, thp: remove infrastructure for handling splitting PMDs Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 18/36] arm, " Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 19/36] mips, " Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 20/36] powerpc, " Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 21/36] s390, " Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 22/36] sparc, " Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 23/36] tile, " Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 24/36] x86, " Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 25/36] mm, " Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 26/36] mm: rework mapcount accounting to enable 4k mapping of THPs Kirill A. Shutemov
2015-06-10 13:47   ` Vlastimil Babka
2015-06-22 10:22     ` Kirill A. Shutemov
2015-06-03 17:05 ` [PATCHv6 27/36] mm: differentiate page_mapped() from page_mapcount() for compound pages Kirill A. Shutemov
2015-06-09 10:58   ` Kirill A. Shutemov
2015-06-10 14:34   ` Vlastimil Babka
2015-06-03 17:05 ` [PATCHv6 28/36] mm, numa: skip PTE-mapped THP on numa fault Kirill A. Shutemov
2015-06-03 17:06 ` [PATCHv6 29/36] thp: implement split_huge_pmd() Kirill A. Shutemov
2015-06-11  9:49   ` Vlastimil Babka
2015-06-22 11:14     ` Kirill A. Shutemov
2015-06-22 16:01       ` Vlastimil Babka
2015-06-03 17:06 ` [PATCHv6 30/36] thp: add option to setup migration entiries during PMD split Kirill A. Shutemov
2015-06-03 17:06 ` [PATCHv6 31/36] thp, mm: split_huge_page(): caller need to lock page Kirill A. Shutemov
2015-06-03 17:06 ` [PATCHv6 32/36] thp: reintroduce split_huge_page() Kirill A. Shutemov
2015-06-10 15:44   ` Vlastimil Babka
2015-06-22 11:28     ` Kirill A. Shutemov
2015-06-03 17:06 ` [PATCHv6 33/36] migrate_pages: try to split pages on qeueuing Kirill A. Shutemov
2015-06-11  9:27   ` Vlastimil Babka
2015-06-22 11:35     ` Kirill A. Shutemov
2015-06-03 17:06 ` [PATCHv6 34/36] thp: introduce deferred_split_huge_page() Kirill A. Shutemov
2015-06-03 17:06 ` [PATCHv6 35/36] mm: re-enable THP Kirill A. Shutemov
2015-06-03 17:06 ` [PATCHv6 36/36] thp: update documentation Kirill A. Shutemov
2015-06-11 12:30   ` Vlastimil Babka
2015-06-22 13:18     ` Kirill A. Shutemov
2015-06-22 16:07       ` Vlastimil Babka
2015-06-16 13:17 ` [PATCHv6 00/36] THP refcounting redesign Jerome Marchand
2015-06-22 13:21   ` Kirill A. Shutemov
2015-06-22 13:32     ` Jerome Marchand
2015-06-22 13:39       ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).