All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv5 00/28] THP refcounting redesign
@ 2015-04-23 21:03 ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Hello everybody,

Here's reworked version of my patchset. All known issues were addressed.

The goal of patchset is to make refcounting on THP pages cheaper with
simpler semantics and allow the same THP compound page to be mapped with
PMD and PTEs. This is required to get reasonable THP-pagecache
implementation.

With the new refcounting design it's much easier to protect against
split_huge_page(): simple reference on a page will make you the deal.
It makes gup_fast() implementation simpler and doesn't require
special-case in futex code to handle tail THP pages.

It should improve THP utilization over the system since splitting THP in
one process doesn't necessary lead to splitting the page in all other
processes have the page mapped.

The patchset drastically lower complexity of get_page()/put_page()
codepaths. I encourage reviewers look on this code before-and-after to
justify time budget on reviewing this patchset.

= Changelog =

v5:
  - Tested-by: Sasha Levin!™
  - re-split patchset in hope to improve readability;
  - rebased on top of page flags and ->mapping sanitizing patchset;
  - uncharge compound_mapcount rather than mapcount for hugetlb pages
    during removing from rmap;
  - differentiate page_mapped() from page_mapcount() for compound pages;
  - rework deferred_split_huge_page() to use shrinker interface;
  - fix race in page_remove_rmap();
  - get rid of __get_page_tail();
  - few random bug fixes;
v4:
  - fix sizes reported in smaps;
  - defines instead of enum for RMAP_{EXCLUSIVE,COMPOUND};
  - skip THP pages on munlock_vma_pages_range(): they are never mlocked;
  - properly handle huge zero page on FOLL_SPLIT;
  - fix lock_page() slow path on tail pages;
  - account page_get_anon_vma() fail to THP_SPLIT_PAGE_FAILED;
  - fix split_huge_page() on huge page with unmapped head page;
  - fix transfering 'write' and 'young' from pmd to ptes on split_huge_pmd;
  - call page_remove_rmap() in unfreeze_page under ptl.

= Design overview =

The main reason why we can't map THP with 4k is how refcounting on THP
designed. It built around two requirements:

  - split of huge page should never fail;
  - we can't change interface of get_user_page();

To be able to split huge page at any point we have to track which tail
page was pinned. It leads to tricky and expensive get_page() on tail pages
and also occupy tail_page->_mapcount.

Most split_huge_page*() users want PMD to be split into table of PTEs and
don't care whether compound page is going to be split or not.

The plan is:

 - allow split_huge_page() to fail if the page is pinned. It's trivial to
   split non-pinned page and it doesn't require tail page refcounting, so
   tail_page->_mapcount is free to be reused.

 - introduce new routine -- split_huge_pmd() -- to split PMD into table of
   PTEs. It splits only one PMD, not touching other PMDs the page is
   mapped with or underlying compound page. Unlike new split_huge_page(),
   split_huge_pmd() never fails.

Fortunately, we have only few places where split_huge_page() is needed:
swap out, memory failure, migration, KSM. And all of them can handle
split_huge_page() fail.

In new scheme we use page->_mapcount is used to account how many time
the page is mapped with PTEs. We have separate compound_mapcount() to
count mappings with PMD. page_mapcount() returns sum of PTE and PMD
mappings of the page.

Introducing split_huge_pmd() effectively allows THP to be mapped with 4k.
It may be a surprise to some code to see a PTE which points to tail page
or VMA start/end in the middle of compound page.

munmap() part of THP will split PMD, but doesn't split the huge page. In
order to take memory consumption under control we put partially unmapped
huge page on list. The pages will be split by shrinker if memory pressure
comes. This way we also avoid unnecessary split_huge_page() on exit(2) if
a THP belong to more than one VMA.

= Patches overview =

Patch 1:
        We need to look on all subpages of compound page to calculate
        correct PSS, because they can have different mapcount.

Patch 2:
        With PTE-mapeed THP, rmap cannot rely on PageTransHuge() check to
        decide if map small page or THP. We need to get the info from
        caller.

Patch 3:
        Make memcg aware about new refcounting. Validation needed.

Patch 4:
        Adjust conditions when we can re-use the page on write-protection
        fault.

Patch 5:
        FOLL_SPLIT should be handled on PTE level too.

Patch 6:
	Make generic fast GUP implementation aware about PTE-mapped huge
	pages.

Patch 7:
        Split all pages in mlocked VMA. That should be good enough for
	now.

Patch 8:
        Make khugepaged aware about PTE-mapped huge pages.

Patch 9:
	Rename split_huge_page_pmd() to split_huge_pmd() to reflect that
	page is not going to be split, only PMD.

Patch 10:
        New THP_SPLIT_* vmstats.

Patch 11:
	Up to this point we tried to keep patchset bisectable, but next
	patches are going to change how core of THP refcounting work.
	That's easier to review change if we would disable THP temporally
	and bring it back once everything is ready.

Patch 12:
	Remove all split_huge_page()-related code. It also remove need in
	tail page refcounting.

Patch 13:
	Drop tail page refcounting. Diffstat is nice! :)

Patch 14:
        Remove ugly special case if futex happened to be in tail THP page.
        With new refcounting it much easier to protect against split.

Patch 15:
        Simplify KSM code which handle THP.

Patch 16:
	No need in compound_lock anymore.

Patches 17-18:
        Drop infrastructure for handling PMD splitting. We don't use it
        anymore in split_huge_page(). For now we only remove it from
        generic code and x86. I'll cleanup other architectures later.

Patch 19:
        Store mapcount for compound pages separately: in the first tail
        page ->mapping.

Patch 20:
	Let's define page_mapped() to be true for compound pages if any
	sub-pages of the compound page is mapped (with PMD or PTE).

Patch 21:
	Make numabalancing aware about PTE-mapped THP.

Patch 22:
	Implement new split_huge_pmd().

Patch 23-25:
	Implement new split_huge_page().

Patch 26:
        Handle partial unmap of THP. We put partially unmapped huge page
        list. Pages from list will split via shrinker if memory pressure
	comes. This way we also avoid unnecessary split_huge_page() on
        exit(2) if a THP belong to more than one VMA.

Patch 27:
	Everything is in place. Re-enable THP.

Patch 28:
        Documentation update.

The patchset also available on git:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v5

Please review.

Kirill A. Shutemov (28):
  mm, proc: adjust PSS calculation
  rmap: add argument to charge compound page
  memcg: adjust to support new THP refcounting
  mm, thp: adjust conditions when we can reuse the page on WP fault
  mm: adjust FOLL_SPLIT for new refcounting
  mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton
  thp, mlock: do not allow huge pages in mlocked area
  khugepaged: ignore pmd tables with THP mapped with ptes
  thp: rename split_huge_page_pmd() to split_huge_pmd()
  mm, vmstats: new THP splitting event
  mm: temporally mark THP broken
  thp: drop all split_huge_page()-related code
  mm: drop tail page refcounting
  futex, thp: remove special case for THP in get_futex_key
  ksm: prepare to new THP semantics
  mm, thp: remove compound_lock
  mm, thp: remove infrastructure for handling splitting PMDs
  x86, thp: remove infrastructure for handling splitting PMDs
  mm: store mapcount for compound page separately
  mm: differentiate page_mapped() from page_mapcount() for compound
    pages
  mm, numa: skip PTE-mapped THP on numa fault
  thp: implement split_huge_pmd()
  thp: add option to setup migration entiries during PMD split
  thp, mm: split_huge_page(): caller need to lock page
  thp: reintroduce split_huge_page()
  thp: introduce deferred_split_huge_page()
  mm: re-enable THP
  thp: update documentation

 Documentation/vm/transhuge.txt       |  100 ++--
 arch/arc/mm/cache_arc700.c           |    4 +-
 arch/arm/mm/flush.c                  |    2 +-
 arch/mips/mm/c-r4k.c                 |    3 +-
 arch/mips/mm/cache.c                 |    2 +-
 arch/mips/mm/gup.c                   |    4 -
 arch/mips/mm/init.c                  |    6 +-
 arch/powerpc/mm/hugetlbpage.c        |   13 +-
 arch/powerpc/mm/subpage-prot.c       |    2 +-
 arch/s390/mm/gup.c                   |   13 +-
 arch/sh/mm/cache-sh4.c               |    2 +-
 arch/sh/mm/cache.c                   |    8 +-
 arch/sparc/mm/gup.c                  |   14 +-
 arch/x86/include/asm/pgtable.h       |    9 -
 arch/x86/include/asm/pgtable_types.h |    2 -
 arch/x86/kernel/vm86_32.c            |    6 +-
 arch/x86/mm/gup.c                    |   17 +-
 arch/x86/mm/pgtable.c                |   14 -
 arch/xtensa/mm/tlb.c                 |    2 +-
 fs/proc/page.c                       |    4 +-
 fs/proc/task_mmu.c                   |   51 +-
 include/asm-generic/pgtable.h        |    5 -
 include/linux/huge_mm.h              |   41 +-
 include/linux/memcontrol.h           |   16 +-
 include/linux/mm.h                   |  106 ++--
 include/linux/mm_types.h             |   18 +-
 include/linux/page-flags.h           |   12 +-
 include/linux/pagemap.h              |    9 +-
 include/linux/rmap.h                 |   16 +-
 include/linux/swap.h                 |    3 +-
 include/linux/vm_event_item.h        |    4 +-
 kernel/events/uprobes.c              |   11 +-
 kernel/futex.c                       |   61 +-
 mm/debug.c                           |    8 +-
 mm/filemap.c                         |   10 +-
 mm/filemap_xip.c                     |    2 +-
 mm/gup.c                             |  106 ++--
 mm/huge_memory.c                     | 1076 +++++++++++++++++++---------------
 mm/hugetlb.c                         |   10 +-
 mm/internal.h                        |   70 +--
 mm/ksm.c                             |   61 +-
 mm/madvise.c                         |    2 +-
 mm/memcontrol.c                      |   76 +--
 mm/memory-failure.c                  |   12 +-
 mm/memory.c                          |   71 +--
 mm/mempolicy.c                       |    2 +-
 mm/migrate.c                         |   19 +-
 mm/mincore.c                         |    2 +-
 mm/mlock.c                           |   51 +-
 mm/mprotect.c                        |    2 +-
 mm/mremap.c                          |    2 +-
 mm/page_alloc.c                      |   16 +-
 mm/pagewalk.c                        |    2 +-
 mm/pgtable-generic.c                 |   14 -
 mm/rmap.c                            |  144 +++--
 mm/shmem.c                           |   21 +-
 mm/swap.c                            |  274 +--------
 mm/swapfile.c                        |   16 +-
 mm/vmstat.c                          |    4 +-
 59 files changed, 1144 insertions(+), 1509 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 189+ messages in thread

* [PATCHv5 00/28] THP refcounting redesign
@ 2015-04-23 21:03 ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Hello everybody,

Here's reworked version of my patchset. All known issues were addressed.

The goal of patchset is to make refcounting on THP pages cheaper with
simpler semantics and allow the same THP compound page to be mapped with
PMD and PTEs. This is required to get reasonable THP-pagecache
implementation.

With the new refcounting design it's much easier to protect against
split_huge_page(): simple reference on a page will make you the deal.
It makes gup_fast() implementation simpler and doesn't require
special-case in futex code to handle tail THP pages.

It should improve THP utilization over the system since splitting THP in
one process doesn't necessary lead to splitting the page in all other
processes have the page mapped.

The patchset drastically lower complexity of get_page()/put_page()
codepaths. I encourage reviewers look on this code before-and-after to
justify time budget on reviewing this patchset.

= Changelog =

v5:
  - Tested-by: Sasha Levin!a?c
  - re-split patchset in hope to improve readability;
  - rebased on top of page flags and ->mapping sanitizing patchset;
  - uncharge compound_mapcount rather than mapcount for hugetlb pages
    during removing from rmap;
  - differentiate page_mapped() from page_mapcount() for compound pages;
  - rework deferred_split_huge_page() to use shrinker interface;
  - fix race in page_remove_rmap();
  - get rid of __get_page_tail();
  - few random bug fixes;
v4:
  - fix sizes reported in smaps;
  - defines instead of enum for RMAP_{EXCLUSIVE,COMPOUND};
  - skip THP pages on munlock_vma_pages_range(): they are never mlocked;
  - properly handle huge zero page on FOLL_SPLIT;
  - fix lock_page() slow path on tail pages;
  - account page_get_anon_vma() fail to THP_SPLIT_PAGE_FAILED;
  - fix split_huge_page() on huge page with unmapped head page;
  - fix transfering 'write' and 'young' from pmd to ptes on split_huge_pmd;
  - call page_remove_rmap() in unfreeze_page under ptl.

= Design overview =

The main reason why we can't map THP with 4k is how refcounting on THP
designed. It built around two requirements:

  - split of huge page should never fail;
  - we can't change interface of get_user_page();

To be able to split huge page at any point we have to track which tail
page was pinned. It leads to tricky and expensive get_page() on tail pages
and also occupy tail_page->_mapcount.

Most split_huge_page*() users want PMD to be split into table of PTEs and
don't care whether compound page is going to be split or not.

The plan is:

 - allow split_huge_page() to fail if the page is pinned. It's trivial to
   split non-pinned page and it doesn't require tail page refcounting, so
   tail_page->_mapcount is free to be reused.

 - introduce new routine -- split_huge_pmd() -- to split PMD into table of
   PTEs. It splits only one PMD, not touching other PMDs the page is
   mapped with or underlying compound page. Unlike new split_huge_page(),
   split_huge_pmd() never fails.

Fortunately, we have only few places where split_huge_page() is needed:
swap out, memory failure, migration, KSM. And all of them can handle
split_huge_page() fail.

In new scheme we use page->_mapcount is used to account how many time
the page is mapped with PTEs. We have separate compound_mapcount() to
count mappings with PMD. page_mapcount() returns sum of PTE and PMD
mappings of the page.

Introducing split_huge_pmd() effectively allows THP to be mapped with 4k.
It may be a surprise to some code to see a PTE which points to tail page
or VMA start/end in the middle of compound page.

munmap() part of THP will split PMD, but doesn't split the huge page. In
order to take memory consumption under control we put partially unmapped
huge page on list. The pages will be split by shrinker if memory pressure
comes. This way we also avoid unnecessary split_huge_page() on exit(2) if
a THP belong to more than one VMA.

= Patches overview =

Patch 1:
        We need to look on all subpages of compound page to calculate
        correct PSS, because they can have different mapcount.

Patch 2:
        With PTE-mapeed THP, rmap cannot rely on PageTransHuge() check to
        decide if map small page or THP. We need to get the info from
        caller.

Patch 3:
        Make memcg aware about new refcounting. Validation needed.

Patch 4:
        Adjust conditions when we can re-use the page on write-protection
        fault.

Patch 5:
        FOLL_SPLIT should be handled on PTE level too.

Patch 6:
	Make generic fast GUP implementation aware about PTE-mapped huge
	pages.

Patch 7:
        Split all pages in mlocked VMA. That should be good enough for
	now.

Patch 8:
        Make khugepaged aware about PTE-mapped huge pages.

Patch 9:
	Rename split_huge_page_pmd() to split_huge_pmd() to reflect that
	page is not going to be split, only PMD.

Patch 10:
        New THP_SPLIT_* vmstats.

Patch 11:
	Up to this point we tried to keep patchset bisectable, but next
	patches are going to change how core of THP refcounting work.
	That's easier to review change if we would disable THP temporally
	and bring it back once everything is ready.

Patch 12:
	Remove all split_huge_page()-related code. It also remove need in
	tail page refcounting.

Patch 13:
	Drop tail page refcounting. Diffstat is nice! :)

Patch 14:
        Remove ugly special case if futex happened to be in tail THP page.
        With new refcounting it much easier to protect against split.

Patch 15:
        Simplify KSM code which handle THP.

Patch 16:
	No need in compound_lock anymore.

Patches 17-18:
        Drop infrastructure for handling PMD splitting. We don't use it
        anymore in split_huge_page(). For now we only remove it from
        generic code and x86. I'll cleanup other architectures later.

Patch 19:
        Store mapcount for compound pages separately: in the first tail
        page ->mapping.

Patch 20:
	Let's define page_mapped() to be true for compound pages if any
	sub-pages of the compound page is mapped (with PMD or PTE).

Patch 21:
	Make numabalancing aware about PTE-mapped THP.

Patch 22:
	Implement new split_huge_pmd().

Patch 23-25:
	Implement new split_huge_page().

Patch 26:
        Handle partial unmap of THP. We put partially unmapped huge page
        list. Pages from list will split via shrinker if memory pressure
	comes. This way we also avoid unnecessary split_huge_page() on
        exit(2) if a THP belong to more than one VMA.

Patch 27:
	Everything is in place. Re-enable THP.

Patch 28:
        Documentation update.

The patchset also available on git:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v5

Please review.

Kirill A. Shutemov (28):
  mm, proc: adjust PSS calculation
  rmap: add argument to charge compound page
  memcg: adjust to support new THP refcounting
  mm, thp: adjust conditions when we can reuse the page on WP fault
  mm: adjust FOLL_SPLIT for new refcounting
  mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton
  thp, mlock: do not allow huge pages in mlocked area
  khugepaged: ignore pmd tables with THP mapped with ptes
  thp: rename split_huge_page_pmd() to split_huge_pmd()
  mm, vmstats: new THP splitting event
  mm: temporally mark THP broken
  thp: drop all split_huge_page()-related code
  mm: drop tail page refcounting
  futex, thp: remove special case for THP in get_futex_key
  ksm: prepare to new THP semantics
  mm, thp: remove compound_lock
  mm, thp: remove infrastructure for handling splitting PMDs
  x86, thp: remove infrastructure for handling splitting PMDs
  mm: store mapcount for compound page separately
  mm: differentiate page_mapped() from page_mapcount() for compound
    pages
  mm, numa: skip PTE-mapped THP on numa fault
  thp: implement split_huge_pmd()
  thp: add option to setup migration entiries during PMD split
  thp, mm: split_huge_page(): caller need to lock page
  thp: reintroduce split_huge_page()
  thp: introduce deferred_split_huge_page()
  mm: re-enable THP
  thp: update documentation

 Documentation/vm/transhuge.txt       |  100 ++--
 arch/arc/mm/cache_arc700.c           |    4 +-
 arch/arm/mm/flush.c                  |    2 +-
 arch/mips/mm/c-r4k.c                 |    3 +-
 arch/mips/mm/cache.c                 |    2 +-
 arch/mips/mm/gup.c                   |    4 -
 arch/mips/mm/init.c                  |    6 +-
 arch/powerpc/mm/hugetlbpage.c        |   13 +-
 arch/powerpc/mm/subpage-prot.c       |    2 +-
 arch/s390/mm/gup.c                   |   13 +-
 arch/sh/mm/cache-sh4.c               |    2 +-
 arch/sh/mm/cache.c                   |    8 +-
 arch/sparc/mm/gup.c                  |   14 +-
 arch/x86/include/asm/pgtable.h       |    9 -
 arch/x86/include/asm/pgtable_types.h |    2 -
 arch/x86/kernel/vm86_32.c            |    6 +-
 arch/x86/mm/gup.c                    |   17 +-
 arch/x86/mm/pgtable.c                |   14 -
 arch/xtensa/mm/tlb.c                 |    2 +-
 fs/proc/page.c                       |    4 +-
 fs/proc/task_mmu.c                   |   51 +-
 include/asm-generic/pgtable.h        |    5 -
 include/linux/huge_mm.h              |   41 +-
 include/linux/memcontrol.h           |   16 +-
 include/linux/mm.h                   |  106 ++--
 include/linux/mm_types.h             |   18 +-
 include/linux/page-flags.h           |   12 +-
 include/linux/pagemap.h              |    9 +-
 include/linux/rmap.h                 |   16 +-
 include/linux/swap.h                 |    3 +-
 include/linux/vm_event_item.h        |    4 +-
 kernel/events/uprobes.c              |   11 +-
 kernel/futex.c                       |   61 +-
 mm/debug.c                           |    8 +-
 mm/filemap.c                         |   10 +-
 mm/filemap_xip.c                     |    2 +-
 mm/gup.c                             |  106 ++--
 mm/huge_memory.c                     | 1076 +++++++++++++++++++---------------
 mm/hugetlb.c                         |   10 +-
 mm/internal.h                        |   70 +--
 mm/ksm.c                             |   61 +-
 mm/madvise.c                         |    2 +-
 mm/memcontrol.c                      |   76 +--
 mm/memory-failure.c                  |   12 +-
 mm/memory.c                          |   71 +--
 mm/mempolicy.c                       |    2 +-
 mm/migrate.c                         |   19 +-
 mm/mincore.c                         |    2 +-
 mm/mlock.c                           |   51 +-
 mm/mprotect.c                        |    2 +-
 mm/mremap.c                          |    2 +-
 mm/page_alloc.c                      |   16 +-
 mm/pagewalk.c                        |    2 +-
 mm/pgtable-generic.c                 |   14 -
 mm/rmap.c                            |  144 +++--
 mm/shmem.c                           |   21 +-
 mm/swap.c                            |  274 +--------
 mm/swapfile.c                        |   16 +-
 mm/vmstat.c                          |    4 +-
 59 files changed, 1144 insertions(+), 1509 deletions(-)

-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* [PATCHv5 01/28] mm, proc: adjust PSS calculation
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting all subpages of the compound page are not nessessary
have the same mapcount. We need to take into account mapcount of every
sub-page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 956b75d61809..95bc384ee3f7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -449,9 +449,10 @@ struct mem_size_stats {
 };
 
 static void smaps_account(struct mem_size_stats *mss, struct page *page,
-		unsigned long size, bool young, bool dirty)
+		bool compound, bool young, bool dirty)
 {
-	int mapcount;
+	int i, nr = compound ? hpage_nr_pages(page) : 1;
+	unsigned long size = nr * PAGE_SIZE;
 
 	if (PageAnon(page))
 		mss->anonymous += size;
@@ -460,23 +461,23 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 	/* Accumulate the size in pages that have been accessed. */
 	if (young || PageReferenced(page))
 		mss->referenced += size;
-	mapcount = page_mapcount(page);
-	if (mapcount >= 2) {
-		u64 pss_delta;
 
-		if (dirty || PageDirty(page))
-			mss->shared_dirty += size;
-		else
-			mss->shared_clean += size;
-		pss_delta = (u64)size << PSS_SHIFT;
-		do_div(pss_delta, mapcount);
-		mss->pss += pss_delta;
-	} else {
-		if (dirty || PageDirty(page))
-			mss->private_dirty += size;
-		else
-			mss->private_clean += size;
-		mss->pss += (u64)size << PSS_SHIFT;
+	for (i = 0; i < nr; i++) {
+		int mapcount = page_mapcount(page + i);
+
+		if (mapcount >= 2) {
+			if (dirty || PageDirty(page + i))
+				mss->shared_dirty += PAGE_SIZE;
+			else
+				mss->shared_clean += PAGE_SIZE;
+			mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
+		} else {
+			if (dirty || PageDirty(page + i))
+				mss->private_dirty += PAGE_SIZE;
+			else
+				mss->private_clean += PAGE_SIZE;
+			mss->pss += PAGE_SIZE << PSS_SHIFT;
+		}
 	}
 }
 
@@ -500,7 +501,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
 	if (!page)
 		return;
-	smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte));
+
+	smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte));
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -516,8 +518,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 	if (IS_ERR_OR_NULL(page))
 		return;
 	mss->anonymous_thp += HPAGE_PMD_SIZE;
-	smaps_account(mss, page, HPAGE_PMD_SIZE,
-			pmd_young(*pmd), pmd_dirty(*pmd));
+	smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
 }
 #else
 static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 01/28] mm, proc: adjust PSS calculation
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting all subpages of the compound page are not nessessary
have the same mapcount. We need to take into account mapcount of every
sub-page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 956b75d61809..95bc384ee3f7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -449,9 +449,10 @@ struct mem_size_stats {
 };
 
 static void smaps_account(struct mem_size_stats *mss, struct page *page,
-		unsigned long size, bool young, bool dirty)
+		bool compound, bool young, bool dirty)
 {
-	int mapcount;
+	int i, nr = compound ? hpage_nr_pages(page) : 1;
+	unsigned long size = nr * PAGE_SIZE;
 
 	if (PageAnon(page))
 		mss->anonymous += size;
@@ -460,23 +461,23 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 	/* Accumulate the size in pages that have been accessed. */
 	if (young || PageReferenced(page))
 		mss->referenced += size;
-	mapcount = page_mapcount(page);
-	if (mapcount >= 2) {
-		u64 pss_delta;
 
-		if (dirty || PageDirty(page))
-			mss->shared_dirty += size;
-		else
-			mss->shared_clean += size;
-		pss_delta = (u64)size << PSS_SHIFT;
-		do_div(pss_delta, mapcount);
-		mss->pss += pss_delta;
-	} else {
-		if (dirty || PageDirty(page))
-			mss->private_dirty += size;
-		else
-			mss->private_clean += size;
-		mss->pss += (u64)size << PSS_SHIFT;
+	for (i = 0; i < nr; i++) {
+		int mapcount = page_mapcount(page + i);
+
+		if (mapcount >= 2) {
+			if (dirty || PageDirty(page + i))
+				mss->shared_dirty += PAGE_SIZE;
+			else
+				mss->shared_clean += PAGE_SIZE;
+			mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
+		} else {
+			if (dirty || PageDirty(page + i))
+				mss->private_dirty += PAGE_SIZE;
+			else
+				mss->private_clean += PAGE_SIZE;
+			mss->pss += PAGE_SIZE << PSS_SHIFT;
+		}
 	}
 }
 
@@ -500,7 +501,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
 	if (!page)
 		return;
-	smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte));
+
+	smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte));
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -516,8 +518,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 	if (IS_ERR_OR_NULL(page))
 		return;
 	mss->anonymous_thp += HPAGE_PMD_SIZE;
-	smaps_account(mss, page, HPAGE_PMD_SIZE,
-			pmd_young(*pmd), pmd_dirty(*pmd));
+	smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
 }
 #else
 static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 02/28] rmap: add argument to charge compound page
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to allow mapping of individual 4k pages of THP compound
page. It means we cannot rely on PageTransHuge() check to decide if
map/unmap small page or THP.

The patch adds new argument to rmap functions to indicate whether we want
to operate on whole compound page or only the small page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/rmap.h    | 12 +++++++++---
 kernel/events/uprobes.c |  4 ++--
 mm/filemap_xip.c        |  2 +-
 mm/huge_memory.c        | 16 ++++++++--------
 mm/hugetlb.c            |  4 ++--
 mm/ksm.c                |  4 ++--
 mm/memory.c             | 14 +++++++-------
 mm/migrate.c            |  8 ++++----
 mm/rmap.c               | 43 +++++++++++++++++++++++++++----------------
 mm/swapfile.c           |  4 ++--
 10 files changed, 64 insertions(+), 47 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d3630fa3a17b..e7ecba43ae71 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -159,16 +159,22 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
 
 struct anon_vma *page_get_anon_vma(struct page *page);
 
+/* bitflags for do_page_add_anon_rmap() */
+#define RMAP_EXCLUSIVE 0x01
+#define RMAP_COMPOUND 0x02
+
 /*
  * rmap interfaces called when adding or removing pte of page
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
-void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
+void page_add_anon_rmap(struct page *, struct vm_area_struct *,
+		unsigned long, bool);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
 			   unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
+void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+		unsigned long, bool);
 void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *);
+void page_remove_rmap(struct page *, bool);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 			    unsigned long);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cb346f26a22d..5523daf59953 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -183,7 +183,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		goto unlock;
 
 	get_page(kpage);
-	page_add_new_anon_rmap(kpage, vma, addr);
+	page_add_new_anon_rmap(kpage, vma, addr, false);
 	mem_cgroup_commit_charge(kpage, memcg, false);
 	lru_cache_add_active_or_unevictable(kpage, vma);
 
@@ -196,7 +196,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	ptep_clear_flush_notify(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	pte_unmap_unlock(ptep, ptl);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index c175f9f25210..791d9043a983 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -189,7 +189,7 @@ retry:
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
 			pteval = ptep_clear_flush(vma, address, pte);
-			page_remove_rmap(page);
+			page_remove_rmap(page, false);
 			dec_mm_counter(mm, MM_FILEPAGES);
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5a137c3a7f2f..b40fc0ff9315 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -752,7 +752,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pmd_t entry;
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		page_add_new_anon_rmap(page, vma, haddr);
+		page_add_new_anon_rmap(page, vma, haddr, true);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
@@ -1043,7 +1043,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
-		page_add_new_anon_rmap(pages[i], vma, haddr);
+		page_add_new_anon_rmap(pages[i], vma, haddr, false);
 		mem_cgroup_commit_charge(pages[i], memcg, false);
 		lru_cache_add_active_or_unevictable(pages[i], vma);
 		pte = pte_offset_map(&_pmd, haddr);
@@ -1055,7 +1055,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
-	page_remove_rmap(page);
+	page_remove_rmap(page, true);
 	spin_unlock(ptl);
 
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
@@ -1175,7 +1175,7 @@ alloc:
 		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush_notify(vma, haddr, pmd);
-		page_add_new_anon_rmap(new_page, vma, haddr);
+		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -1185,7 +1185,7 @@ alloc:
 			put_huge_zero_page();
 		} else {
 			VM_BUG_ON_PAGE(!PageHead(page), page);
-			page_remove_rmap(page);
+			page_remove_rmap(page, true);
 			put_page(page);
 		}
 		ret |= VM_FAULT_WRITE;
@@ -1440,7 +1440,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			put_huge_zero_page();
 		} else {
 			page = pmd_page(orig_pmd);
-			page_remove_rmap(page);
+			page_remove_rmap(page, true);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -2285,7 +2285,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			 * superfluous.
 			 */
 			pte_clear(vma->vm_mm, address, _pte);
-			page_remove_rmap(src_page);
+			page_remove_rmap(src_page, false);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
 		}
@@ -2580,7 +2580,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	page_add_new_anon_rmap(new_page, vma, address);
+	page_add_new_anon_rmap(new_page, vma, address, true);
 	mem_cgroup_commit_charge(new_page, memcg, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e8c92ae35b4b..eb2a0430535e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2797,7 +2797,7 @@ again:
 		if (huge_pte_dirty(pte))
 			set_page_dirty(page);
 
-		page_remove_rmap(page);
+		page_remove_rmap(page, true);
 		force_flush = !__tlb_remove_page(tlb, page);
 		if (force_flush) {
 			address += sz;
@@ -3018,7 +3018,7 @@ retry_avoidcopy:
 		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
-		page_remove_rmap(old_page);
+		page_remove_rmap(old_page, true);
 		hugepage_add_new_anon_rmap(new_page, vma, address);
 		/* Make the old page be freed below */
 		new_page = old_page;
diff --git a/mm/ksm.c b/mm/ksm.c
index bc7be0ee2080..fe09f3ddc912 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -957,13 +957,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	}
 
 	get_page(kpage);
-	page_add_anon_rmap(kpage, vma, addr);
+	page_add_anon_rmap(kpage, vma, addr, false);
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index f150f7ed4e84..d6171752ea59 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1122,7 +1122,7 @@ again:
 					mark_page_accessed(page);
 				rss[MM_FILEPAGES]--;
 			}
-			page_remove_rmap(page);
+			page_remove_rmap(page, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(!__tlb_remove_page(tlb, page))) {
@@ -2108,7 +2108,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * thread doing COW.
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
-		page_add_new_anon_rmap(new_page, vma, address);
+		page_add_new_anon_rmap(new_page, vma, address, false);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
@@ -2141,7 +2141,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			page_remove_rmap(old_page);
+			page_remove_rmap(old_page, false);
 		}
 
 		/* Free the old page.. */
@@ -2556,7 +2556,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
-		exclusive = 1;
+		exclusive = RMAP_EXCLUSIVE;
 	}
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(orig_pte))
@@ -2566,7 +2566,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		do_page_add_anon_rmap(page, vma, address, exclusive);
 		mem_cgroup_commit_charge(page, memcg, true);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, address);
+		page_add_new_anon_rmap(page, vma, address, false);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
@@ -2704,7 +2704,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, address);
+	page_add_new_anon_rmap(page, vma, address, false);
 	mem_cgroup_commit_charge(page, memcg, false);
 	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
@@ -2787,7 +2787,7 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 	if (anon) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, address);
+		page_add_new_anon_rmap(page, vma, address, false);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, MM_FILEPAGES);
 		page_add_file_rmap(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 022adc253cd4..9a380238a4d0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -166,7 +166,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		else
 			page_dup_rmap(new);
 	} else if (PageAnon(new))
-		page_add_anon_rmap(new, vma, addr);
+		page_add_anon_rmap(new, vma, addr, false);
 	else
 		page_add_file_rmap(new);
 
@@ -1795,7 +1795,7 @@ fail_putback:
 	 * guarantee the copy is visible before the pagetable update.
 	 */
 	flush_cache_range(vma, mmun_start, mmun_end);
-	page_add_anon_rmap(new_page, vma, mmun_start);
+	page_add_anon_rmap(new_page, vma, mmun_start, true);
 	pmdp_clear_flush_notify(vma, mmun_start, pmd);
 	set_pmd_at(mm, mmun_start, pmd, entry);
 	flush_tlb_range(vma, mmun_start, mmun_end);
@@ -1806,13 +1806,13 @@ fail_putback:
 		flush_tlb_range(vma, mmun_start, mmun_end);
 		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
 		update_mmu_cache_pmd(vma, address, &entry);
-		page_remove_rmap(new_page);
+		page_remove_rmap(new_page, true);
 		goto fail_putback;
 	}
 
 	mem_cgroup_migrate(page, new_page, false);
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, true);
 
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
diff --git a/mm/rmap.c b/mm/rmap.c
index dad23a43e42c..4ca4b5cffd95 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1048,9 +1048,9 @@ static void __page_check_anon_rmap(struct page *page,
  * (but PageKsm is never downgraded to PageAnon).
  */
 void page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address, bool compound)
 {
-	do_page_add_anon_rmap(page, vma, address, 0);
+	do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
 }
 
 /*
@@ -1059,21 +1059,24 @@ void page_add_anon_rmap(struct page *page,
  * Everybody else should continue to use page_add_anon_rmap above.
  */
 void do_page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, int exclusive)
+	struct vm_area_struct *vma, unsigned long address, int flags)
 {
 	int first = atomic_inc_and_test(&page->_mapcount);
 	if (first) {
+		bool compound = flags & RMAP_COMPOUND;
+		int nr = compound ? hpage_nr_pages(page) : 1;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 		 * these counters are not modified in interrupt context, and
 		 * pte lock(a spinlock) is held, which implies preemption
 		 * disabled.
 		 */
-		if (PageTransHuge(page))
+		if (compound) {
+			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 			__inc_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
-		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-				hpage_nr_pages(page));
+		}
+		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	}
 	if (unlikely(PageKsm(page)))
 		return;
@@ -1081,7 +1084,8 @@ void do_page_add_anon_rmap(struct page *page,
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	/* address might be in next vma when migration races vma_adjust */
 	if (first)
-		__page_set_anon_rmap(page, vma, address, exclusive);
+		__page_set_anon_rmap(page, vma, address,
+				flags & RMAP_EXCLUSIVE);
 	else
 		__page_check_anon_rmap(page, vma, address);
 }
@@ -1097,15 +1101,18 @@ void do_page_add_anon_rmap(struct page *page,
  * Page does not have to be locked.
  */
 void page_add_new_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address, bool compound)
 {
+	int nr = compound ? hpage_nr_pages(page) : 1;
+
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
-	if (PageTransHuge(page))
+	if (compound) {
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-			hpage_nr_pages(page));
+	}
+	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
 
@@ -1161,9 +1168,12 @@ out:
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page)
+void page_remove_rmap(struct page *page, bool compound)
 {
+	int nr = compound ? hpage_nr_pages(page) : 1;
+
 	if (!PageAnon(page)) {
+		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
 		page_remove_file_rmap(page);
 		return;
 	}
@@ -1181,11 +1191,12 @@ void page_remove_rmap(struct page *page)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	if (PageTransHuge(page))
+	if (compound) {
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	}
 
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-			      -hpage_nr_pages(page));
+	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
@@ -1327,7 +1338,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		dec_mm_counter(mm, MM_FILEPAGES);
 
 discard:
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	page_cache_release(page);
 
 out_unmap:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a7e72103f23b..65825c2687f5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1121,10 +1121,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
-		page_add_anon_rmap(page, vma, addr);
+		page_add_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, true);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, addr);
+		page_add_new_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 02/28] rmap: add argument to charge compound page
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to allow mapping of individual 4k pages of THP compound
page. It means we cannot rely on PageTransHuge() check to decide if
map/unmap small page or THP.

The patch adds new argument to rmap functions to indicate whether we want
to operate on whole compound page or only the small page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/rmap.h    | 12 +++++++++---
 kernel/events/uprobes.c |  4 ++--
 mm/filemap_xip.c        |  2 +-
 mm/huge_memory.c        | 16 ++++++++--------
 mm/hugetlb.c            |  4 ++--
 mm/ksm.c                |  4 ++--
 mm/memory.c             | 14 +++++++-------
 mm/migrate.c            |  8 ++++----
 mm/rmap.c               | 43 +++++++++++++++++++++++++++----------------
 mm/swapfile.c           |  4 ++--
 10 files changed, 64 insertions(+), 47 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d3630fa3a17b..e7ecba43ae71 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -159,16 +159,22 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
 
 struct anon_vma *page_get_anon_vma(struct page *page);
 
+/* bitflags for do_page_add_anon_rmap() */
+#define RMAP_EXCLUSIVE 0x01
+#define RMAP_COMPOUND 0x02
+
 /*
  * rmap interfaces called when adding or removing pte of page
  */
 void page_move_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
-void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
+void page_add_anon_rmap(struct page *, struct vm_area_struct *,
+		unsigned long, bool);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
 			   unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
+void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+		unsigned long, bool);
 void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *);
+void page_remove_rmap(struct page *, bool);
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 			    unsigned long);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index cb346f26a22d..5523daf59953 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -183,7 +183,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 		goto unlock;
 
 	get_page(kpage);
-	page_add_new_anon_rmap(kpage, vma, addr);
+	page_add_new_anon_rmap(kpage, vma, addr, false);
 	mem_cgroup_commit_charge(kpage, memcg, false);
 	lru_cache_add_active_or_unevictable(kpage, vma);
 
@@ -196,7 +196,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	ptep_clear_flush_notify(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	pte_unmap_unlock(ptep, ptl);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index c175f9f25210..791d9043a983 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -189,7 +189,7 @@ retry:
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
 			pteval = ptep_clear_flush(vma, address, pte);
-			page_remove_rmap(page);
+			page_remove_rmap(page, false);
 			dec_mm_counter(mm, MM_FILEPAGES);
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5a137c3a7f2f..b40fc0ff9315 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -752,7 +752,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		pmd_t entry;
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		page_add_new_anon_rmap(page, vma, haddr);
+		page_add_new_anon_rmap(page, vma, haddr, true);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
@@ -1043,7 +1043,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
-		page_add_new_anon_rmap(pages[i], vma, haddr);
+		page_add_new_anon_rmap(pages[i], vma, haddr, false);
 		mem_cgroup_commit_charge(pages[i], memcg, false);
 		lru_cache_add_active_or_unevictable(pages[i], vma);
 		pte = pte_offset_map(&_pmd, haddr);
@@ -1055,7 +1055,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
-	page_remove_rmap(page);
+	page_remove_rmap(page, true);
 	spin_unlock(ptl);
 
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
@@ -1175,7 +1175,7 @@ alloc:
 		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush_notify(vma, haddr, pmd);
-		page_add_new_anon_rmap(new_page, vma, haddr);
+		page_add_new_anon_rmap(new_page, vma, haddr, true);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -1185,7 +1185,7 @@ alloc:
 			put_huge_zero_page();
 		} else {
 			VM_BUG_ON_PAGE(!PageHead(page), page);
-			page_remove_rmap(page);
+			page_remove_rmap(page, true);
 			put_page(page);
 		}
 		ret |= VM_FAULT_WRITE;
@@ -1440,7 +1440,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			put_huge_zero_page();
 		} else {
 			page = pmd_page(orig_pmd);
-			page_remove_rmap(page);
+			page_remove_rmap(page, true);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
@@ -2285,7 +2285,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 			 * superfluous.
 			 */
 			pte_clear(vma->vm_mm, address, _pte);
-			page_remove_rmap(src_page);
+			page_remove_rmap(src_page, false);
 			spin_unlock(ptl);
 			free_page_and_swap_cache(src_page);
 		}
@@ -2580,7 +2580,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
-	page_add_new_anon_rmap(new_page, vma, address);
+	page_add_new_anon_rmap(new_page, vma, address, true);
 	mem_cgroup_commit_charge(new_page, memcg, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e8c92ae35b4b..eb2a0430535e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2797,7 +2797,7 @@ again:
 		if (huge_pte_dirty(pte))
 			set_page_dirty(page);
 
-		page_remove_rmap(page);
+		page_remove_rmap(page, true);
 		force_flush = !__tlb_remove_page(tlb, page);
 		if (force_flush) {
 			address += sz;
@@ -3018,7 +3018,7 @@ retry_avoidcopy:
 		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
-		page_remove_rmap(old_page);
+		page_remove_rmap(old_page, true);
 		hugepage_add_new_anon_rmap(new_page, vma, address);
 		/* Make the old page be freed below */
 		new_page = old_page;
diff --git a/mm/ksm.c b/mm/ksm.c
index bc7be0ee2080..fe09f3ddc912 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -957,13 +957,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	}
 
 	get_page(kpage);
-	page_add_anon_rmap(kpage, vma, addr);
+	page_add_anon_rmap(kpage, vma, addr, false);
 
 	flush_cache_page(vma, addr, pte_pfn(*ptep));
 	ptep_clear_flush_notify(vma, addr, ptep);
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	if (!page_mapped(page))
 		try_to_free_swap(page);
 	put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index f150f7ed4e84..d6171752ea59 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1122,7 +1122,7 @@ again:
 					mark_page_accessed(page);
 				rss[MM_FILEPAGES]--;
 			}
-			page_remove_rmap(page);
+			page_remove_rmap(page, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(!__tlb_remove_page(tlb, page))) {
@@ -2108,7 +2108,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * thread doing COW.
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
-		page_add_new_anon_rmap(new_page, vma, address);
+		page_add_new_anon_rmap(new_page, vma, address, false);
 		mem_cgroup_commit_charge(new_page, memcg, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
@@ -2141,7 +2141,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * mapcount is visible. So transitively, TLBs to
 			 * old page will be flushed before it can be reused.
 			 */
-			page_remove_rmap(old_page);
+			page_remove_rmap(old_page, false);
 		}
 
 		/* Free the old page.. */
@@ -2556,7 +2556,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
-		exclusive = 1;
+		exclusive = RMAP_EXCLUSIVE;
 	}
 	flush_icache_page(vma, page);
 	if (pte_swp_soft_dirty(orig_pte))
@@ -2566,7 +2566,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		do_page_add_anon_rmap(page, vma, address, exclusive);
 		mem_cgroup_commit_charge(page, memcg, true);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, address);
+		page_add_new_anon_rmap(page, vma, address, false);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
@@ -2704,7 +2704,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto release;
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
-	page_add_new_anon_rmap(page, vma, address);
+	page_add_new_anon_rmap(page, vma, address, false);
 	mem_cgroup_commit_charge(page, memcg, false);
 	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
@@ -2787,7 +2787,7 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 	if (anon) {
 		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-		page_add_new_anon_rmap(page, vma, address);
+		page_add_new_anon_rmap(page, vma, address, false);
 	} else {
 		inc_mm_counter_fast(vma->vm_mm, MM_FILEPAGES);
 		page_add_file_rmap(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 022adc253cd4..9a380238a4d0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -166,7 +166,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		else
 			page_dup_rmap(new);
 	} else if (PageAnon(new))
-		page_add_anon_rmap(new, vma, addr);
+		page_add_anon_rmap(new, vma, addr, false);
 	else
 		page_add_file_rmap(new);
 
@@ -1795,7 +1795,7 @@ fail_putback:
 	 * guarantee the copy is visible before the pagetable update.
 	 */
 	flush_cache_range(vma, mmun_start, mmun_end);
-	page_add_anon_rmap(new_page, vma, mmun_start);
+	page_add_anon_rmap(new_page, vma, mmun_start, true);
 	pmdp_clear_flush_notify(vma, mmun_start, pmd);
 	set_pmd_at(mm, mmun_start, pmd, entry);
 	flush_tlb_range(vma, mmun_start, mmun_end);
@@ -1806,13 +1806,13 @@ fail_putback:
 		flush_tlb_range(vma, mmun_start, mmun_end);
 		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
 		update_mmu_cache_pmd(vma, address, &entry);
-		page_remove_rmap(new_page);
+		page_remove_rmap(new_page, true);
 		goto fail_putback;
 	}
 
 	mem_cgroup_migrate(page, new_page, false);
 
-	page_remove_rmap(page);
+	page_remove_rmap(page, true);
 
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
diff --git a/mm/rmap.c b/mm/rmap.c
index dad23a43e42c..4ca4b5cffd95 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1048,9 +1048,9 @@ static void __page_check_anon_rmap(struct page *page,
  * (but PageKsm is never downgraded to PageAnon).
  */
 void page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address, bool compound)
 {
-	do_page_add_anon_rmap(page, vma, address, 0);
+	do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
 }
 
 /*
@@ -1059,21 +1059,24 @@ void page_add_anon_rmap(struct page *page,
  * Everybody else should continue to use page_add_anon_rmap above.
  */
 void do_page_add_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address, int exclusive)
+	struct vm_area_struct *vma, unsigned long address, int flags)
 {
 	int first = atomic_inc_and_test(&page->_mapcount);
 	if (first) {
+		bool compound = flags & RMAP_COMPOUND;
+		int nr = compound ? hpage_nr_pages(page) : 1;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
 		 * these counters are not modified in interrupt context, and
 		 * pte lock(a spinlock) is held, which implies preemption
 		 * disabled.
 		 */
-		if (PageTransHuge(page))
+		if (compound) {
+			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 			__inc_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
-		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-				hpage_nr_pages(page));
+		}
+		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	}
 	if (unlikely(PageKsm(page)))
 		return;
@@ -1081,7 +1084,8 @@ void do_page_add_anon_rmap(struct page *page,
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	/* address might be in next vma when migration races vma_adjust */
 	if (first)
-		__page_set_anon_rmap(page, vma, address, exclusive);
+		__page_set_anon_rmap(page, vma, address,
+				flags & RMAP_EXCLUSIVE);
 	else
 		__page_check_anon_rmap(page, vma, address);
 }
@@ -1097,15 +1101,18 @@ void do_page_add_anon_rmap(struct page *page,
  * Page does not have to be locked.
  */
 void page_add_new_anon_rmap(struct page *page,
-	struct vm_area_struct *vma, unsigned long address)
+	struct vm_area_struct *vma, unsigned long address, bool compound)
 {
+	int nr = compound ? hpage_nr_pages(page) : 1;
+
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
-	if (PageTransHuge(page))
+	if (compound) {
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-			hpage_nr_pages(page));
+	}
+	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
 }
 
@@ -1161,9 +1168,12 @@ out:
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page)
+void page_remove_rmap(struct page *page, bool compound)
 {
+	int nr = compound ? hpage_nr_pages(page) : 1;
+
 	if (!PageAnon(page)) {
+		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
 		page_remove_file_rmap(page);
 		return;
 	}
@@ -1181,11 +1191,12 @@ void page_remove_rmap(struct page *page)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	if (PageTransHuge(page))
+	if (compound) {
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	}
 
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
-			      -hpage_nr_pages(page));
+	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
@@ -1327,7 +1338,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		dec_mm_counter(mm, MM_FILEPAGES);
 
 discard:
-	page_remove_rmap(page);
+	page_remove_rmap(page, false);
 	page_cache_release(page);
 
 out_unmap:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a7e72103f23b..65825c2687f5 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1121,10 +1121,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
-		page_add_anon_rmap(page, vma, addr);
+		page_add_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, true);
 	} else { /* ksm created a completely new copy */
-		page_add_new_anon_rmap(page, vma, addr);
+		page_add_new_anon_rmap(page, vma, addr, false);
 		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 03/28] memcg: adjust to support new THP refcounting
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

As with rmap, with new refcounting we cannot rely on PageTransHuge() to
check if we need to charge size of huge page form the cgroup. We need to
get information from caller to know whether it was mapped with PMD or
PTE.

We do uncharge when last reference on the page gone. At that point if we
see PageTransHuge() it means we need to unchange whole huge page.

The tricky part is partial unmap -- when we try to unmap part of huge
page. We don't do a special handing of this situation, meaning we don't
uncharge the part of huge page unless last user is gone or
split_huge_page() is triggered. In case of cgroup memory pressure
happens the partial unmapped page will be split through shrinker. This
should be good enough.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/memcontrol.h | 16 +++++++-----
 kernel/events/uprobes.c    |  7 +++---
 mm/filemap.c               |  8 +++---
 mm/huge_memory.c           | 31 ++++++++++++-----------
 mm/memcontrol.c            | 62 +++++++++++++++++-----------------------------
 mm/memory.c                | 26 +++++++++----------
 mm/shmem.c                 | 21 +++++++++-------
 mm/swapfile.c              |  9 ++++---
 8 files changed, 87 insertions(+), 93 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5fb0d0c..6a70e6c4bece 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -74,10 +74,12 @@ void mem_cgroup_events(struct mem_cgroup *memcg,
 bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
-			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
+			  bool compound);
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
-			      bool lrucare);
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
+			      bool lrucare, bool compound);
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
+		bool compound);
 void mem_cgroup_uncharge(struct page *page);
 void mem_cgroup_uncharge_list(struct list_head *page_list);
 
@@ -209,7 +211,8 @@ static inline bool mem_cgroup_low(struct mem_cgroup *root,
 
 static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask,
-					struct mem_cgroup **memcgp)
+					struct mem_cgroup **memcgp,
+					bool compound)
 {
 	*memcgp = NULL;
 	return 0;
@@ -217,12 +220,13 @@ static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 
 static inline void mem_cgroup_commit_charge(struct page *page,
 					    struct mem_cgroup *memcg,
-					    bool lrucare)
+					    bool lrucare, bool compound)
 {
 }
 
 static inline void mem_cgroup_cancel_charge(struct page *page,
-					    struct mem_cgroup *memcg)
+					    struct mem_cgroup *memcg,
+					    bool compound)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 5523daf59953..04e26bdf0717 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -169,7 +169,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	const unsigned long mmun_end   = addr + PAGE_SIZE;
 	struct mem_cgroup *memcg;
 
-	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
+	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg,
+			false);
 	if (err)
 		return err;
 
@@ -184,7 +185,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	get_page(kpage);
 	page_add_new_anon_rmap(kpage, vma, addr, false);
-	mem_cgroup_commit_charge(kpage, memcg, false);
+	mem_cgroup_commit_charge(kpage, memcg, false, false);
 	lru_cache_add_active_or_unevictable(kpage, vma);
 
 	if (!PageAnon(page)) {
@@ -207,7 +208,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	err = 0;
  unlock:
-	mem_cgroup_cancel_charge(kpage, memcg);
+	mem_cgroup_cancel_charge(kpage, memcg, false);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	unlock_page(page);
 	return err;
diff --git a/mm/filemap.c b/mm/filemap.c
index 7aeeb33618bf..ce4d6e3d740f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -555,7 +555,7 @@ static int __add_to_page_cache_locked(struct page *page,
 
 	if (!huge) {
 		error = mem_cgroup_try_charge(page, current->mm,
-					      gfp_mask, &memcg);
+					      gfp_mask, &memcg, false);
 		if (error)
 			return error;
 	}
@@ -563,7 +563,7 @@ static int __add_to_page_cache_locked(struct page *page,
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
 		if (!huge)
-			mem_cgroup_cancel_charge(page, memcg);
+			mem_cgroup_cancel_charge(page, memcg, false);
 		return error;
 	}
 
@@ -579,7 +579,7 @@ static int __add_to_page_cache_locked(struct page *page,
 	__inc_zone_page_state(page, NR_FILE_PAGES);
 	spin_unlock_irq(&mapping->tree_lock);
 	if (!huge)
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
 err_insert:
@@ -587,7 +587,7 @@ err_insert:
 	/* Leave page->index set: truncation relies upon it */
 	spin_unlock_irq(&mapping->tree_lock);
 	if (!huge)
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, false);
 	page_cache_release(page);
 	return error;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b40fc0ff9315..534f353e12bf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -725,12 +725,12 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
-	if (mem_cgroup_try_charge(page, mm, gfp, &memcg))
+	if (mem_cgroup_try_charge(page, mm, gfp, &memcg, true))
 		return VM_FAULT_OOM;
 
 	pgtable = pte_alloc_one(mm, haddr);
 	if (unlikely(!pgtable)) {
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, true);
 		return VM_FAULT_OOM;
 	}
 
@@ -745,7 +745,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_none(*pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, true);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -753,7 +753,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr, true);
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -999,13 +999,14 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					       vma, address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
 			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
-						   &memcg))) {
+						   &memcg, false))) {
 			if (pages[i])
 				put_page(pages[i]);
 			while (--i >= 0) {
 				memcg = (void *)page_private(pages[i]);
 				set_page_private(pages[i], 0);
-				mem_cgroup_cancel_charge(pages[i], memcg);
+				mem_cgroup_cancel_charge(pages[i], memcg,
+						false);
 				put_page(pages[i]);
 			}
 			kfree(pages);
@@ -1044,7 +1045,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
 		page_add_new_anon_rmap(pages[i], vma, haddr, false);
-		mem_cgroup_commit_charge(pages[i], memcg, false);
+		mem_cgroup_commit_charge(pages[i], memcg, false, false);
 		lru_cache_add_active_or_unevictable(pages[i], vma);
 		pte = pte_offset_map(&_pmd, haddr);
 		VM_BUG_ON(!pte_none(*pte));
@@ -1072,7 +1073,7 @@ out_free_pages:
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
-		mem_cgroup_cancel_charge(pages[i], memcg);
+		mem_cgroup_cancel_charge(pages[i], memcg, false);
 		put_page(pages[i]);
 	}
 	kfree(pages);
@@ -1138,7 +1139,8 @@ alloc:
 		goto out;
 	}
 
-	if (unlikely(mem_cgroup_try_charge(new_page, mm, huge_gfp, &memcg))) {
+	if (unlikely(mem_cgroup_try_charge(new_page, mm, huge_gfp,
+					&memcg, true))) {
 		put_page(new_page);
 		if (page) {
 			split_huge_page(page);
@@ -1167,7 +1169,7 @@ alloc:
 		put_user_huge_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_cancel_charge(new_page, memcg);
+		mem_cgroup_cancel_charge(new_page, memcg, true);
 		put_page(new_page);
 		goto out_mn;
 	} else {
@@ -1176,7 +1178,7 @@ alloc:
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush_notify(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr, true);
-		mem_cgroup_commit_charge(new_page, memcg, false);
+		mem_cgroup_commit_charge(new_page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pmd_at(mm, haddr, pmd, entry);
 		update_mmu_cache_pmd(vma, address, pmd);
@@ -2493,8 +2495,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (!new_page)
 		return;
 
-	if (unlikely(mem_cgroup_try_charge(new_page, mm,
-					   gfp, &memcg)))
+	if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true)))
 		return;
 
 	/*
@@ -2581,7 +2582,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
 	page_add_new_anon_rmap(new_page, vma, address, true);
-	mem_cgroup_commit_charge(new_page, memcg, false);
+	mem_cgroup_commit_charge(new_page, memcg, false, true);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
@@ -2596,7 +2597,7 @@ out_up_write:
 	return;
 
 out:
-	mem_cgroup_cancel_charge(new_page, memcg);
+	mem_cgroup_cancel_charge(new_page, memcg, true);
 	goto out_up_write;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14c2f2017e37..f659d4f77138 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -827,7 +827,7 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 struct page *page,
-					 int nr_pages)
+					 bool compound, int nr_pages)
 {
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
@@ -840,9 +840,11 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE],
 				nr_pages);
 
-	if (PageTransHuge(page))
+	if (compound) {
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 				nr_pages);
+	}
 
 	/* pagein of a big page is an event. So, ignore page size */
 	if (nr_pages > 0)
@@ -4740,30 +4742,24 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
  * from old cgroup.
  */
 static int mem_cgroup_move_account(struct page *page,
-				   unsigned int nr_pages,
+				   bool compound,
 				   struct mem_cgroup *from,
 				   struct mem_cgroup *to)
 {
 	unsigned long flags;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 	int ret;
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
-	/*
-	 * The page is isolated from LRU. So, collapse function
-	 * will not handle this page. But page splitting can happen.
-	 * Do this check under compound_page_lock(). The caller should
-	 * hold it.
-	 */
-	ret = -EBUSY;
-	if (nr_pages > 1 && !PageTransHuge(page))
-		goto out;
+	VM_BUG_ON(compound && !PageTransHuge(page));
 
 	/*
 	 * Prevent mem_cgroup_migrate() from looking at page->mem_cgroup
 	 * of its source page while we change it: page migration takes
 	 * both pages off the LRU, but page cache replacement doesn't.
 	 */
+	ret = -EBUSY;
 	if (!trylock_page(page))
 		goto out;
 
@@ -4800,9 +4796,9 @@ static int mem_cgroup_move_account(struct page *page,
 	ret = 0;
 
 	local_irq_disable();
-	mem_cgroup_charge_statistics(to, page, nr_pages);
+	mem_cgroup_charge_statistics(to, page, compound, nr_pages);
 	memcg_check_events(to, page);
-	mem_cgroup_charge_statistics(from, page, -nr_pages);
+	mem_cgroup_charge_statistics(from, page, compound, -nr_pages);
 	memcg_check_events(from, page);
 	local_irq_enable();
 out_unlock:
@@ -5079,7 +5075,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		if (target_type == MC_TARGET_PAGE) {
 			page = target.page;
 			if (!isolate_lru_page(page)) {
-				if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
+				if (!mem_cgroup_move_account(page, true,
 							     mc.from, mc.to)) {
 					mc.precharge -= HPAGE_PMD_NR;
 					mc.moved_charge += HPAGE_PMD_NR;
@@ -5108,7 +5104,8 @@ retry:
 			page = target.page;
 			if (isolate_lru_page(page))
 				goto put;
-			if (!mem_cgroup_move_account(page, 1, mc.from, mc.to)) {
+			if (!mem_cgroup_move_account(page, false,
+						mc.from, mc.to)) {
 				mc.precharge--;
 				/* we uncharge from mc.from later. */
 				mc.moved_charge++;
@@ -5456,10 +5453,11 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
  * with mem_cgroup_cancel_charge() in case page instantiation fails.
  */
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
-			  gfp_t gfp_mask, struct mem_cgroup **memcgp)
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
+			  bool compound)
 {
 	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 	int ret = 0;
 
 	if (mem_cgroup_disabled())
@@ -5477,11 +5475,6 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			goto out;
 	}
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
 	if (do_swap_account && PageSwapCache(page))
 		memcg = try_get_mem_cgroup_from_page(page);
 	if (!memcg)
@@ -5517,9 +5510,9 @@ out:
  * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
  */
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
-			      bool lrucare)
+			      bool lrucare, bool compound)
 {
-	unsigned int nr_pages = 1;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 
 	VM_BUG_ON_PAGE(!page->mapping, page);
 	VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
@@ -5536,13 +5529,8 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 
 	commit_charge(page, memcg, lrucare);
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
 	local_irq_disable();
-	mem_cgroup_charge_statistics(memcg, page, nr_pages);
+	mem_cgroup_charge_statistics(memcg, page, compound, nr_pages);
 	memcg_check_events(memcg, page);
 	local_irq_enable();
 
@@ -5564,9 +5552,10 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
  *
  * Cancel a charge transaction started by mem_cgroup_try_charge().
  */
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
+		bool compound)
 {
-	unsigned int nr_pages = 1;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -5578,11 +5567,6 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
 	if (!memcg)
 		return;
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
 	cancel_charge(memcg, nr_pages);
 }
 
@@ -5836,7 +5820,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	/* XXX: caller holds IRQ-safe mapping->tree_lock */
 	VM_BUG_ON(!irqs_disabled());
 
-	mem_cgroup_charge_statistics(memcg, page, -1);
+	mem_cgroup_charge_statistics(memcg, page, false, -1);
 	memcg_check_events(memcg, page);
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index d6171752ea59..559c6651d6b6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2080,7 +2080,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	__SetPageUptodate(new_page);
 
-	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false))
 		goto oom_free_new;
 
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
@@ -2109,7 +2109,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address, false);
-		mem_cgroup_commit_charge(new_page, memcg, false);
+		mem_cgroup_commit_charge(new_page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
@@ -2148,7 +2148,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		new_page = old_page;
 		page_copied = 1;
 	} else {
-		mem_cgroup_cancel_charge(new_page, memcg);
+		mem_cgroup_cancel_charge(new_page, memcg, false);
 	}
 
 	if (new_page)
@@ -2522,7 +2522,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_page;
 	}
 
-	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
@@ -2564,10 +2564,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	set_pte_at(mm, address, page_table, pte);
 	if (page == swapcache) {
 		do_page_add_anon_rmap(page, vma, address, exclusive);
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, address, false);
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
 
@@ -2602,7 +2602,7 @@ unlock:
 out:
 	return ret;
 out_nomap:
-	mem_cgroup_cancel_charge(page, memcg);
+	mem_cgroup_cancel_charge(page, memcg, false);
 	pte_unmap_unlock(page_table, ptl);
 out_page:
 	unlock_page(page);
@@ -2692,7 +2692,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	__SetPageUptodate(page);
 
-	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false))
 		goto oom_free_page;
 
 	entry = mk_pte(page, vma->vm_page_prot);
@@ -2705,7 +2705,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address, false);
-	mem_cgroup_commit_charge(page, memcg, false);
+	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
 	set_pte_at(mm, address, page_table, entry);
@@ -2716,7 +2716,7 @@ unlock:
 	pte_unmap_unlock(page_table, ptl);
 	return 0;
 release:
-	mem_cgroup_cancel_charge(page, memcg);
+	mem_cgroup_cancel_charge(page, memcg, false);
 	page_cache_release(page);
 	goto unlock;
 oom_free_page:
@@ -2962,7 +2962,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!new_page)
 		return VM_FAULT_OOM;
 
-	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false)) {
 		page_cache_release(new_page);
 		return VM_FAULT_OOM;
 	}
@@ -2982,14 +2982,14 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto uncharge_out;
 	}
 	do_set_pte(vma, address, new_page, pte, true, true);
-	mem_cgroup_commit_charge(new_page, memcg, false);
+	mem_cgroup_commit_charge(new_page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pte_unmap_unlock(pte, ptl);
 	unlock_page(fault_page);
 	page_cache_release(fault_page);
 	return ret;
 uncharge_out:
-	mem_cgroup_cancel_charge(new_page, memcg);
+	mem_cgroup_cancel_charge(new_page, memcg, false);
 	page_cache_release(new_page);
 	return ret;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 2026b3b8dd28..d0ab2f9e83f6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -706,7 +706,8 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
 	 */
-	error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg);
+	error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg,
+			false);
 	if (error)
 		goto out;
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -729,9 +730,9 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	if (error) {
 		if (error != -ENOMEM)
 			error = 0;
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, false);
 	} else
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 out:
 	unlock_page(page);
 	page_cache_release(page);
@@ -1114,7 +1115,8 @@ repeat:
 				goto failed;
 		}
 
-		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
+				false);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						swp_to_radix_entry(swap));
@@ -1131,14 +1133,14 @@ repeat:
 			 * "repeat": reading a hole and writing should succeed.
 			 */
 			if (error) {
-				mem_cgroup_cancel_charge(page, memcg);
+				mem_cgroup_cancel_charge(page, memcg, false);
 				delete_from_swap_cache(page);
 			}
 		}
 		if (error)
 			goto failed;
 
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 
 		spin_lock(&info->lock);
 		info->swapped--;
@@ -1177,7 +1179,8 @@ repeat:
 		if (sgp == SGP_WRITE)
 			__SetPageReferenced(page);
 
-		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
+				false);
 		if (error)
 			goto decused;
 		error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
@@ -1187,10 +1190,10 @@ repeat:
 			radix_tree_preload_end();
 		}
 		if (error) {
-			mem_cgroup_cancel_charge(page, memcg);
+			mem_cgroup_cancel_charge(page, memcg, false);
 			goto decused;
 		}
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_anon(page);
 
 		spin_lock(&info->lock);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 65825c2687f5..6dd365d1c488 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1103,14 +1103,15 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	if (unlikely(!page))
 		return -ENOMEM;
 
-	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
+	{
 		ret = -ENOMEM;
 		goto out_nolock;
 	}
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	if (unlikely(!maybe_same_pte(*pte, swp_entry_to_pte(entry)))) {
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, false);
 		ret = 0;
 		goto out;
 	}
@@ -1122,10 +1123,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
 		page_add_anon_rmap(page, vma, addr, false);
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr, false);
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
 	swap_free(entry);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 03/28] memcg: adjust to support new THP refcounting
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

As with rmap, with new refcounting we cannot rely on PageTransHuge() to
check if we need to charge size of huge page form the cgroup. We need to
get information from caller to know whether it was mapped with PMD or
PTE.

We do uncharge when last reference on the page gone. At that point if we
see PageTransHuge() it means we need to unchange whole huge page.

The tricky part is partial unmap -- when we try to unmap part of huge
page. We don't do a special handing of this situation, meaning we don't
uncharge the part of huge page unless last user is gone or
split_huge_page() is triggered. In case of cgroup memory pressure
happens the partial unmapped page will be split through shrinker. This
should be good enough.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/memcontrol.h | 16 +++++++-----
 kernel/events/uprobes.c    |  7 +++---
 mm/filemap.c               |  8 +++---
 mm/huge_memory.c           | 31 ++++++++++++-----------
 mm/memcontrol.c            | 62 +++++++++++++++++-----------------------------
 mm/memory.c                | 26 +++++++++----------
 mm/shmem.c                 | 21 +++++++++-------
 mm/swapfile.c              |  9 ++++---
 8 files changed, 87 insertions(+), 93 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5fb0d0c..6a70e6c4bece 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -74,10 +74,12 @@ void mem_cgroup_events(struct mem_cgroup *memcg,
 bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
-			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
+			  bool compound);
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
-			      bool lrucare);
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
+			      bool lrucare, bool compound);
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
+		bool compound);
 void mem_cgroup_uncharge(struct page *page);
 void mem_cgroup_uncharge_list(struct list_head *page_list);
 
@@ -209,7 +211,8 @@ static inline bool mem_cgroup_low(struct mem_cgroup *root,
 
 static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask,
-					struct mem_cgroup **memcgp)
+					struct mem_cgroup **memcgp,
+					bool compound)
 {
 	*memcgp = NULL;
 	return 0;
@@ -217,12 +220,13 @@ static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 
 static inline void mem_cgroup_commit_charge(struct page *page,
 					    struct mem_cgroup *memcg,
-					    bool lrucare)
+					    bool lrucare, bool compound)
 {
 }
 
 static inline void mem_cgroup_cancel_charge(struct page *page,
-					    struct mem_cgroup *memcg)
+					    struct mem_cgroup *memcg,
+					    bool compound)
 {
 }
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 5523daf59953..04e26bdf0717 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -169,7 +169,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	const unsigned long mmun_end   = addr + PAGE_SIZE;
 	struct mem_cgroup *memcg;
 
-	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg);
+	err = mem_cgroup_try_charge(kpage, vma->vm_mm, GFP_KERNEL, &memcg,
+			false);
 	if (err)
 		return err;
 
@@ -184,7 +185,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	get_page(kpage);
 	page_add_new_anon_rmap(kpage, vma, addr, false);
-	mem_cgroup_commit_charge(kpage, memcg, false);
+	mem_cgroup_commit_charge(kpage, memcg, false, false);
 	lru_cache_add_active_or_unevictable(kpage, vma);
 
 	if (!PageAnon(page)) {
@@ -207,7 +208,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	err = 0;
  unlock:
-	mem_cgroup_cancel_charge(kpage, memcg);
+	mem_cgroup_cancel_charge(kpage, memcg, false);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	unlock_page(page);
 	return err;
diff --git a/mm/filemap.c b/mm/filemap.c
index 7aeeb33618bf..ce4d6e3d740f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -555,7 +555,7 @@ static int __add_to_page_cache_locked(struct page *page,
 
 	if (!huge) {
 		error = mem_cgroup_try_charge(page, current->mm,
-					      gfp_mask, &memcg);
+					      gfp_mask, &memcg, false);
 		if (error)
 			return error;
 	}
@@ -563,7 +563,7 @@ static int __add_to_page_cache_locked(struct page *page,
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
 		if (!huge)
-			mem_cgroup_cancel_charge(page, memcg);
+			mem_cgroup_cancel_charge(page, memcg, false);
 		return error;
 	}
 
@@ -579,7 +579,7 @@ static int __add_to_page_cache_locked(struct page *page,
 	__inc_zone_page_state(page, NR_FILE_PAGES);
 	spin_unlock_irq(&mapping->tree_lock);
 	if (!huge)
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
 err_insert:
@@ -587,7 +587,7 @@ err_insert:
 	/* Leave page->index set: truncation relies upon it */
 	spin_unlock_irq(&mapping->tree_lock);
 	if (!huge)
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, false);
 	page_cache_release(page);
 	return error;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b40fc0ff9315..534f353e12bf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -725,12 +725,12 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
-	if (mem_cgroup_try_charge(page, mm, gfp, &memcg))
+	if (mem_cgroup_try_charge(page, mm, gfp, &memcg, true))
 		return VM_FAULT_OOM;
 
 	pgtable = pte_alloc_one(mm, haddr);
 	if (unlikely(!pgtable)) {
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, true);
 		return VM_FAULT_OOM;
 	}
 
@@ -745,7 +745,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_none(*pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, true);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -753,7 +753,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr, true);
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		set_pmd_at(mm, haddr, pmd, entry);
@@ -999,13 +999,14 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					       vma, address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
 			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
-						   &memcg))) {
+						   &memcg, false))) {
 			if (pages[i])
 				put_page(pages[i]);
 			while (--i >= 0) {
 				memcg = (void *)page_private(pages[i]);
 				set_page_private(pages[i], 0);
-				mem_cgroup_cancel_charge(pages[i], memcg);
+				mem_cgroup_cancel_charge(pages[i], memcg,
+						false);
 				put_page(pages[i]);
 			}
 			kfree(pages);
@@ -1044,7 +1045,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
 		page_add_new_anon_rmap(pages[i], vma, haddr, false);
-		mem_cgroup_commit_charge(pages[i], memcg, false);
+		mem_cgroup_commit_charge(pages[i], memcg, false, false);
 		lru_cache_add_active_or_unevictable(pages[i], vma);
 		pte = pte_offset_map(&_pmd, haddr);
 		VM_BUG_ON(!pte_none(*pte));
@@ -1072,7 +1073,7 @@ out_free_pages:
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
-		mem_cgroup_cancel_charge(pages[i], memcg);
+		mem_cgroup_cancel_charge(pages[i], memcg, false);
 		put_page(pages[i]);
 	}
 	kfree(pages);
@@ -1138,7 +1139,8 @@ alloc:
 		goto out;
 	}
 
-	if (unlikely(mem_cgroup_try_charge(new_page, mm, huge_gfp, &memcg))) {
+	if (unlikely(mem_cgroup_try_charge(new_page, mm, huge_gfp,
+					&memcg, true))) {
 		put_page(new_page);
 		if (page) {
 			split_huge_page(page);
@@ -1167,7 +1169,7 @@ alloc:
 		put_user_huge_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_cancel_charge(new_page, memcg);
+		mem_cgroup_cancel_charge(new_page, memcg, true);
 		put_page(new_page);
 		goto out_mn;
 	} else {
@@ -1176,7 +1178,7 @@ alloc:
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush_notify(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr, true);
-		mem_cgroup_commit_charge(new_page, memcg, false);
+		mem_cgroup_commit_charge(new_page, memcg, false, true);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pmd_at(mm, haddr, pmd, entry);
 		update_mmu_cache_pmd(vma, address, pmd);
@@ -2493,8 +2495,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (!new_page)
 		return;
 
-	if (unlikely(mem_cgroup_try_charge(new_page, mm,
-					   gfp, &memcg)))
+	if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true)))
 		return;
 
 	/*
@@ -2581,7 +2582,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
 	page_add_new_anon_rmap(new_page, vma, address, true);
-	mem_cgroup_commit_charge(new_page, memcg, false);
+	mem_cgroup_commit_charge(new_page, memcg, false, true);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
@@ -2596,7 +2597,7 @@ out_up_write:
 	return;
 
 out:
-	mem_cgroup_cancel_charge(new_page, memcg);
+	mem_cgroup_cancel_charge(new_page, memcg, true);
 	goto out_up_write;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14c2f2017e37..f659d4f77138 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -827,7 +827,7 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 struct page *page,
-					 int nr_pages)
+					 bool compound, int nr_pages)
 {
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
@@ -840,9 +840,11 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_CACHE],
 				nr_pages);
 
-	if (PageTransHuge(page))
+	if (compound) {
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 				nr_pages);
+	}
 
 	/* pagein of a big page is an event. So, ignore page size */
 	if (nr_pages > 0)
@@ -4740,30 +4742,24 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
  * from old cgroup.
  */
 static int mem_cgroup_move_account(struct page *page,
-				   unsigned int nr_pages,
+				   bool compound,
 				   struct mem_cgroup *from,
 				   struct mem_cgroup *to)
 {
 	unsigned long flags;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 	int ret;
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
-	/*
-	 * The page is isolated from LRU. So, collapse function
-	 * will not handle this page. But page splitting can happen.
-	 * Do this check under compound_page_lock(). The caller should
-	 * hold it.
-	 */
-	ret = -EBUSY;
-	if (nr_pages > 1 && !PageTransHuge(page))
-		goto out;
+	VM_BUG_ON(compound && !PageTransHuge(page));
 
 	/*
 	 * Prevent mem_cgroup_migrate() from looking at page->mem_cgroup
 	 * of its source page while we change it: page migration takes
 	 * both pages off the LRU, but page cache replacement doesn't.
 	 */
+	ret = -EBUSY;
 	if (!trylock_page(page))
 		goto out;
 
@@ -4800,9 +4796,9 @@ static int mem_cgroup_move_account(struct page *page,
 	ret = 0;
 
 	local_irq_disable();
-	mem_cgroup_charge_statistics(to, page, nr_pages);
+	mem_cgroup_charge_statistics(to, page, compound, nr_pages);
 	memcg_check_events(to, page);
-	mem_cgroup_charge_statistics(from, page, -nr_pages);
+	mem_cgroup_charge_statistics(from, page, compound, -nr_pages);
 	memcg_check_events(from, page);
 	local_irq_enable();
 out_unlock:
@@ -5079,7 +5075,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		if (target_type == MC_TARGET_PAGE) {
 			page = target.page;
 			if (!isolate_lru_page(page)) {
-				if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
+				if (!mem_cgroup_move_account(page, true,
 							     mc.from, mc.to)) {
 					mc.precharge -= HPAGE_PMD_NR;
 					mc.moved_charge += HPAGE_PMD_NR;
@@ -5108,7 +5104,8 @@ retry:
 			page = target.page;
 			if (isolate_lru_page(page))
 				goto put;
-			if (!mem_cgroup_move_account(page, 1, mc.from, mc.to)) {
+			if (!mem_cgroup_move_account(page, false,
+						mc.from, mc.to)) {
 				mc.precharge--;
 				/* we uncharge from mc.from later. */
 				mc.moved_charge++;
@@ -5456,10 +5453,11 @@ bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
  * with mem_cgroup_cancel_charge() in case page instantiation fails.
  */
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
-			  gfp_t gfp_mask, struct mem_cgroup **memcgp)
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
+			  bool compound)
 {
 	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 	int ret = 0;
 
 	if (mem_cgroup_disabled())
@@ -5477,11 +5475,6 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			goto out;
 	}
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
 	if (do_swap_account && PageSwapCache(page))
 		memcg = try_get_mem_cgroup_from_page(page);
 	if (!memcg)
@@ -5517,9 +5510,9 @@ out:
  * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
  */
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
-			      bool lrucare)
+			      bool lrucare, bool compound)
 {
-	unsigned int nr_pages = 1;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 
 	VM_BUG_ON_PAGE(!page->mapping, page);
 	VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
@@ -5536,13 +5529,8 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 
 	commit_charge(page, memcg, lrucare);
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
 	local_irq_disable();
-	mem_cgroup_charge_statistics(memcg, page, nr_pages);
+	mem_cgroup_charge_statistics(memcg, page, compound, nr_pages);
 	memcg_check_events(memcg, page);
 	local_irq_enable();
 
@@ -5564,9 +5552,10 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
  *
  * Cancel a charge transaction started by mem_cgroup_try_charge().
  */
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
+		bool compound)
 {
-	unsigned int nr_pages = 1;
+	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -5578,11 +5567,6 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
 	if (!memcg)
 		return;
 
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-
 	cancel_charge(memcg, nr_pages);
 }
 
@@ -5836,7 +5820,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	/* XXX: caller holds IRQ-safe mapping->tree_lock */
 	VM_BUG_ON(!irqs_disabled());
 
-	mem_cgroup_charge_statistics(memcg, page, -1);
+	mem_cgroup_charge_statistics(memcg, page, false, -1);
 	memcg_check_events(memcg, page);
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index d6171752ea59..559c6651d6b6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2080,7 +2080,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 	__SetPageUptodate(new_page);
 
-	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false))
 		goto oom_free_new;
 
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
@@ -2109,7 +2109,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		ptep_clear_flush_notify(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address, false);
-		mem_cgroup_commit_charge(new_page, memcg, false);
+		mem_cgroup_commit_charge(new_page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
@@ -2148,7 +2148,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 		new_page = old_page;
 		page_copied = 1;
 	} else {
-		mem_cgroup_cancel_charge(new_page, memcg);
+		mem_cgroup_cancel_charge(new_page, memcg, false);
 	}
 
 	if (new_page)
@@ -2522,7 +2522,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_page;
 	}
 
-	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
@@ -2564,10 +2564,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	set_pte_at(mm, address, page_table, pte);
 	if (page == swapcache) {
 		do_page_add_anon_rmap(page, vma, address, exclusive);
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, address, false);
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
 
@@ -2602,7 +2602,7 @@ unlock:
 out:
 	return ret;
 out_nomap:
-	mem_cgroup_cancel_charge(page, memcg);
+	mem_cgroup_cancel_charge(page, memcg, false);
 	pte_unmap_unlock(page_table, ptl);
 out_page:
 	unlock_page(page);
@@ -2692,7 +2692,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	__SetPageUptodate(page);
 
-	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg, false))
 		goto oom_free_page;
 
 	entry = mk_pte(page, vma->vm_page_prot);
@@ -2705,7 +2705,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address, false);
-	mem_cgroup_commit_charge(page, memcg, false);
+	mem_cgroup_commit_charge(page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
 	set_pte_at(mm, address, page_table, entry);
@@ -2716,7 +2716,7 @@ unlock:
 	pte_unmap_unlock(page_table, ptl);
 	return 0;
 release:
-	mem_cgroup_cancel_charge(page, memcg);
+	mem_cgroup_cancel_charge(page, memcg, false);
 	page_cache_release(page);
 	goto unlock;
 oom_free_page:
@@ -2962,7 +2962,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!new_page)
 		return VM_FAULT_OOM;
 
-	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg, false)) {
 		page_cache_release(new_page);
 		return VM_FAULT_OOM;
 	}
@@ -2982,14 +2982,14 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto uncharge_out;
 	}
 	do_set_pte(vma, address, new_page, pte, true, true);
-	mem_cgroup_commit_charge(new_page, memcg, false);
+	mem_cgroup_commit_charge(new_page, memcg, false, false);
 	lru_cache_add_active_or_unevictable(new_page, vma);
 	pte_unmap_unlock(pte, ptl);
 	unlock_page(fault_page);
 	page_cache_release(fault_page);
 	return ret;
 uncharge_out:
-	mem_cgroup_cancel_charge(new_page, memcg);
+	mem_cgroup_cancel_charge(new_page, memcg, false);
 	page_cache_release(new_page);
 	return ret;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 2026b3b8dd28..d0ab2f9e83f6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -706,7 +706,8 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
 	 */
-	error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg);
+	error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg,
+			false);
 	if (error)
 		goto out;
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -729,9 +730,9 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	if (error) {
 		if (error != -ENOMEM)
 			error = 0;
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, false);
 	} else
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 out:
 	unlock_page(page);
 	page_cache_release(page);
@@ -1114,7 +1115,8 @@ repeat:
 				goto failed;
 		}
 
-		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
+				false);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						swp_to_radix_entry(swap));
@@ -1131,14 +1133,14 @@ repeat:
 			 * "repeat": reading a hole and writing should succeed.
 			 */
 			if (error) {
-				mem_cgroup_cancel_charge(page, memcg);
+				mem_cgroup_cancel_charge(page, memcg, false);
 				delete_from_swap_cache(page);
 			}
 		}
 		if (error)
 			goto failed;
 
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 
 		spin_lock(&info->lock);
 		info->swapped--;
@@ -1177,7 +1179,8 @@ repeat:
 		if (sgp == SGP_WRITE)
 			__SetPageReferenced(page);
 
-		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
+				false);
 		if (error)
 			goto decused;
 		error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
@@ -1187,10 +1190,10 @@ repeat:
 			radix_tree_preload_end();
 		}
 		if (error) {
-			mem_cgroup_cancel_charge(page, memcg);
+			mem_cgroup_cancel_charge(page, memcg, false);
 			goto decused;
 		}
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_anon(page);
 
 		spin_lock(&info->lock);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 65825c2687f5..6dd365d1c488 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1103,14 +1103,15 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	if (unlikely(!page))
 		return -ENOMEM;
 
-	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
+	{
 		ret = -ENOMEM;
 		goto out_nolock;
 	}
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	if (unlikely(!maybe_same_pte(*pte, swp_entry_to_pte(entry)))) {
-		mem_cgroup_cancel_charge(page, memcg);
+		mem_cgroup_cancel_charge(page, memcg, false);
 		ret = 0;
 		goto out;
 	}
@@ -1122,10 +1123,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	if (page == swapcache) {
 		page_add_anon_rmap(page, vma, addr, false);
-		mem_cgroup_commit_charge(page, memcg, true);
+		mem_cgroup_commit_charge(page, memcg, true, false);
 	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr, false);
-		mem_cgroup_commit_charge(page, memcg, false);
+		mem_cgroup_commit_charge(page, memcg, false, false);
 		lru_cache_add_active_or_unevictable(page, vma);
 	}
 	swap_free(entry);
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 04/28] mm, thp: adjust conditions when we can reuse the page on WP fault
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we will be able map the same compound page with
PTEs and PMDs. It requires adjustment to conditions when we can reuse
the page on write-protection fault.

For PTE fault we can't reuse the page if it's part of huge page.

For PMD we can only reuse the page if nobody else maps the huge page or
it's part. We can do it by checking page_mapcount() on each sub-page,
but it's expensive.

The cheaper way is to check page_count() to be equal 1: every mapcount
takes page reference, so this way we can guarantee, that the PMD is the
only mapping.

This approach can give false negative if somebody pinned the page, but
that doesn't affect correctness.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/swap.h |  3 ++-
 mm/huge_memory.c     | 12 +++++++++++-
 mm/swapfile.c        |  3 +++
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0428e4c84e1d..17cdd6b9456b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -524,7 +524,8 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
-#define reuse_swap_page(page)	(page_mapcount(page) == 1)
+#define reuse_swap_page(page) \
+	(!PageTransCompound(page) && page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 534f353e12bf..fd8af5b9917f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1103,7 +1103,17 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
-	if (page_mapcount(page) == 1) {
+	/*
+	 * We can only reuse the page if nobody else maps the huge page or it's
+	 * part. We can do it by checking page_mapcount() on each sub-page, but
+	 * it's expensive.
+	 * The cheaper way is to check page_count() to be equal 1: every
+	 * mapcount takes page reference reference, so this way we can
+	 * guarantee, that the PMD is the only mapping.
+	 * This can give false negative if somebody pinned the page, but that's
+	 * fine.
+	 */
+	if (page_mapcount(page) == 1 && page_count(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6dd365d1c488..3cd5f188b996 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -887,6 +887,9 @@ int reuse_swap_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	if (unlikely(PageKsm(page)))
 		return 0;
+	/* The page is part of THP and cannot be reused */
+	if (PageTransCompound(page))
+		return 0;
 	count = page_mapcount(page);
 	if (count <= 1 && PageSwapCache(page)) {
 		count += page_swapcount(page);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 04/28] mm, thp: adjust conditions when we can reuse the page on WP fault
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we will be able map the same compound page with
PTEs and PMDs. It requires adjustment to conditions when we can reuse
the page on write-protection fault.

For PTE fault we can't reuse the page if it's part of huge page.

For PMD we can only reuse the page if nobody else maps the huge page or
it's part. We can do it by checking page_mapcount() on each sub-page,
but it's expensive.

The cheaper way is to check page_count() to be equal 1: every mapcount
takes page reference, so this way we can guarantee, that the PMD is the
only mapping.

This approach can give false negative if somebody pinned the page, but
that doesn't affect correctness.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/swap.h |  3 ++-
 mm/huge_memory.c     | 12 +++++++++++-
 mm/swapfile.c        |  3 +++
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0428e4c84e1d..17cdd6b9456b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -524,7 +524,8 @@ static inline int page_swapcount(struct page *page)
 	return 0;
 }
 
-#define reuse_swap_page(page)	(page_mapcount(page) == 1)
+#define reuse_swap_page(page) \
+	(!PageTransCompound(page) && page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 534f353e12bf..fd8af5b9917f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1103,7 +1103,17 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(orig_pmd);
 	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
-	if (page_mapcount(page) == 1) {
+	/*
+	 * We can only reuse the page if nobody else maps the huge page or it's
+	 * part. We can do it by checking page_mapcount() on each sub-page, but
+	 * it's expensive.
+	 * The cheaper way is to check page_count() to be equal 1: every
+	 * mapcount takes page reference reference, so this way we can
+	 * guarantee, that the PMD is the only mapping.
+	 * This can give false negative if somebody pinned the page, but that's
+	 * fine.
+	 */
+	if (page_mapcount(page) == 1 && page_count(page) == 1) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6dd365d1c488..3cd5f188b996 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -887,6 +887,9 @@ int reuse_swap_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	if (unlikely(PageKsm(page)))
 		return 0;
+	/* The page is part of THP and cannot be reused */
+	if (PageTransCompound(page))
+		return 0;
 	count = page_mapcount(page);
 	if (count <= 1 && PageSwapCache(page)) {
 		count += page_swapcount(page);
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 05/28] mm: adjust FOLL_SPLIT for new refcounting
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We need to prepare kernel to allow transhuge pages to be mapped with
ptes too. We need to handle FOLL_SPLIT in follow_page_pte().

Also we use split_huge_page() directly instead of split_huge_page_pmd().
split_huge_page_pmd() will gone.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/gup.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 49 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 203781fa96a5..ebdb39b3e820 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -79,6 +79,19 @@ retry:
 		page = pte_page(pte);
 	}
 
+	if (flags & FOLL_SPLIT && PageTransCompound(page)) {
+		int ret;
+		get_page(page);
+		pte_unmap_unlock(ptep, ptl);
+		lock_page(page);
+		ret = split_huge_page(page);
+		unlock_page(page);
+		put_page(page);
+		if (ret)
+			return ERR_PTR(ret);
+		goto retry;
+	}
+
 	if (flags & FOLL_GET)
 		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
@@ -186,27 +199,45 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		return no_page_table(vma, flags);
-	if (pmd_trans_huge(*pmd)) {
-		if (flags & FOLL_SPLIT) {
+	if (likely(!pmd_trans_huge(*pmd)))
+		return follow_page_pte(vma, address, pmd, flags);
+
+	ptl = pmd_lock(mm, pmd);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(ptl);
+		return follow_page_pte(vma, address, pmd, flags);
+	}
+
+	if (unlikely(pmd_trans_splitting(*pmd))) {
+		spin_unlock(ptl);
+		wait_split_huge_page(vma->anon_vma, pmd);
+		return follow_page_pte(vma, address, pmd, flags);
+	}
+
+	if (flags & FOLL_SPLIT) {
+		int ret;
+		page = pmd_page(*pmd);
+		if (is_huge_zero_page(page)) {
+			spin_unlock(ptl);
+			ret = 0;
 			split_huge_page_pmd(vma, address, pmd);
-			return follow_page_pte(vma, address, pmd, flags);
-		}
-		ptl = pmd_lock(mm, pmd);
-		if (likely(pmd_trans_huge(*pmd))) {
-			if (unlikely(pmd_trans_splitting(*pmd))) {
-				spin_unlock(ptl);
-				wait_split_huge_page(vma->anon_vma, pmd);
-			} else {
-				page = follow_trans_huge_pmd(vma, address,
-							     pmd, flags);
-				spin_unlock(ptl);
-				*page_mask = HPAGE_PMD_NR - 1;
-				return page;
-			}
-		} else
+		} else {
+			get_page(page);
 			spin_unlock(ptl);
+			lock_page(page);
+			ret = split_huge_page(page);
+			unlock_page(page);
+			put_page(page);
+		}
+
+		return ret ? ERR_PTR(ret) :
+			follow_page_pte(vma, address, pmd, flags);
 	}
-	return follow_page_pte(vma, address, pmd, flags);
+
+	page = follow_trans_huge_pmd(vma, address, pmd, flags);
+	spin_unlock(ptl);
+	*page_mask = HPAGE_PMD_NR - 1;
+	return page;
 }
 
 static int get_gate_page(struct mm_struct *mm, unsigned long address,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 05/28] mm: adjust FOLL_SPLIT for new refcounting
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We need to prepare kernel to allow transhuge pages to be mapped with
ptes too. We need to handle FOLL_SPLIT in follow_page_pte().

Also we use split_huge_page() directly instead of split_huge_page_pmd().
split_huge_page_pmd() will gone.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/gup.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 49 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 203781fa96a5..ebdb39b3e820 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -79,6 +79,19 @@ retry:
 		page = pte_page(pte);
 	}
 
+	if (flags & FOLL_SPLIT && PageTransCompound(page)) {
+		int ret;
+		get_page(page);
+		pte_unmap_unlock(ptep, ptl);
+		lock_page(page);
+		ret = split_huge_page(page);
+		unlock_page(page);
+		put_page(page);
+		if (ret)
+			return ERR_PTR(ret);
+		goto retry;
+	}
+
 	if (flags & FOLL_GET)
 		get_page_foll(page);
 	if (flags & FOLL_TOUCH) {
@@ -186,27 +199,45 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 	}
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		return no_page_table(vma, flags);
-	if (pmd_trans_huge(*pmd)) {
-		if (flags & FOLL_SPLIT) {
+	if (likely(!pmd_trans_huge(*pmd)))
+		return follow_page_pte(vma, address, pmd, flags);
+
+	ptl = pmd_lock(mm, pmd);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(ptl);
+		return follow_page_pte(vma, address, pmd, flags);
+	}
+
+	if (unlikely(pmd_trans_splitting(*pmd))) {
+		spin_unlock(ptl);
+		wait_split_huge_page(vma->anon_vma, pmd);
+		return follow_page_pte(vma, address, pmd, flags);
+	}
+
+	if (flags & FOLL_SPLIT) {
+		int ret;
+		page = pmd_page(*pmd);
+		if (is_huge_zero_page(page)) {
+			spin_unlock(ptl);
+			ret = 0;
 			split_huge_page_pmd(vma, address, pmd);
-			return follow_page_pte(vma, address, pmd, flags);
-		}
-		ptl = pmd_lock(mm, pmd);
-		if (likely(pmd_trans_huge(*pmd))) {
-			if (unlikely(pmd_trans_splitting(*pmd))) {
-				spin_unlock(ptl);
-				wait_split_huge_page(vma->anon_vma, pmd);
-			} else {
-				page = follow_trans_huge_pmd(vma, address,
-							     pmd, flags);
-				spin_unlock(ptl);
-				*page_mask = HPAGE_PMD_NR - 1;
-				return page;
-			}
-		} else
+		} else {
+			get_page(page);
 			spin_unlock(ptl);
+			lock_page(page);
+			ret = split_huge_page(page);
+			unlock_page(page);
+			put_page(page);
+		}
+
+		return ret ? ERR_PTR(ret) :
+			follow_page_pte(vma, address, pmd, flags);
 	}
-	return follow_page_pte(vma, address, pmd, flags);
+
+	page = follow_trans_huge_pmd(vma, address, pmd, flags);
+	spin_unlock(ptl);
+	*page_mask = HPAGE_PMD_NR - 1;
+	return page;
 }
 
 static int get_gate_page(struct mm_struct *mm, unsigned long address,
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 06/28] mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we are going to see THP tail pages mapped with PTE.
Generic fast GUP rely on page_cache_get_speculative() to obtain
reference on page. page_cache_get_speculative() always fails on tail
pages, because ->_count on tail pages is always zero.

Let's handle tail pages in gup_pte_range().

New split_huge_page() will rely on migration entries to freeze page's
counts. Recheck PTE value after page_cache_get_speculative() on head
page should be enough to serialize against split.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/gup.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ebdb39b3e820..eaeeae15006b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1051,7 +1051,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		 * for an example see gup_get_pte in arch/x86/mm/gup.c
 		 */
 		pte_t pte = READ_ONCE(*ptep);
-		struct page *page;
+		struct page *head, *page;
 
 		/*
 		 * Similar to the PMD case below, NUMA hinting must take slow
@@ -1063,15 +1063,17 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
+		head = compound_head(page);
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(head))
 			goto pte_unmap;
 
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-			put_page(page);
+			put_page(head);
 			goto pte_unmap;
 		}
 
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
 		(*nr)++;
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 06/28] mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we are going to see THP tail pages mapped with PTE.
Generic fast GUP rely on page_cache_get_speculative() to obtain
reference on page. page_cache_get_speculative() always fails on tail
pages, because ->_count on tail pages is always zero.

Let's handle tail pages in gup_pte_range().

New split_huge_page() will rely on migration entries to freeze page's
counts. Recheck PTE value after page_cache_get_speculative() on head
page should be enough to serialize against split.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/gup.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ebdb39b3e820..eaeeae15006b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1051,7 +1051,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		 * for an example see gup_get_pte in arch/x86/mm/gup.c
 		 */
 		pte_t pte = READ_ONCE(*ptep);
-		struct page *page;
+		struct page *head, *page;
 
 		/*
 		 * Similar to the PMD case below, NUMA hinting must take slow
@@ -1063,15 +1063,17 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
+		head = compound_head(page);
 
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(head))
 			goto pte_unmap;
 
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-			put_page(page);
+			put_page(head);
 			goto pte_unmap;
 		}
 
+		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
 		(*nr)++;
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 07/28] thp, mlock: do not allow huge pages in mlocked area
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting THP can belong to several VMAs. This makes tricky
to track THP pages, when they partially mlocked. It can lead to leaking
mlocked pages to non-VM_LOCKED vmas and other problems.

With this patch we will split all pages on mlock and avoid
fault-in/collapse new THP in VM_LOCKED vmas.

I've tried alternative approach: do not mark THP pages mlocked and keep
them on normal LRUs. This way vmscan could try to split huge pages on
memory pressure and free up subpages which doesn't belong to VM_LOCKED
vmas.  But this is user-visible change: we screw up Mlocked accouting
reported in meminfo, so I had to leave this approach aside.

We can bring something better later, but this should be good enough for
now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/gup.c         |  2 ++
 mm/huge_memory.c |  5 ++++-
 mm/memory.c      |  3 ++-
 mm/mlock.c       | 51 +++++++++++++++++++--------------------------------
 4 files changed, 27 insertions(+), 34 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index eaeeae15006b..7334eb24f414 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -882,6 +882,8 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
 
 	gup_flags = FOLL_TOUCH | FOLL_POPULATE;
+	if (vma->vm_flags & VM_LOCKED)
+		gup_flags |= FOLL_SPLIT;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd8af5b9917f..fa3d4f78b716 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -796,6 +796,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
 		return VM_FAULT_FALLBACK;
+	if (vma->vm_flags & VM_LOCKED)
+		return VM_FAULT_FALLBACK;
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
 	if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
@@ -2467,7 +2469,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
 	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
 	    (vma->vm_flags & VM_NOHUGEPAGE))
 		return false;
-
+	if (vma->vm_flags & VM_LOCKED)
+		return false;
 	if (!vma->anon_vma || vma->vm_ops)
 		return false;
 	if (is_vma_temporary_stack(vma))
diff --git a/mm/memory.c b/mm/memory.c
index 559c6651d6b6..8bbd3f88544b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2156,7 +2156,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	pte_unmap_unlock(page_table, ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-	if (old_page) {
+	/* THP pages are never mlocked */
+	if (old_page && !PageTransCompound(old_page)) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
 		 * keep the mlocked page.
diff --git a/mm/mlock.c b/mm/mlock.c
index 6fd2cf15e868..76cde3967483 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -443,39 +443,26 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 		page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
 				&page_mask);
 
-		if (page && !IS_ERR(page)) {
-			if (PageTransHuge(page)) {
-				lock_page(page);
-				/*
-				 * Any THP page found by follow_page_mask() may
-				 * have gotten split before reaching
-				 * munlock_vma_page(), so we need to recompute
-				 * the page_mask here.
-				 */
-				page_mask = munlock_vma_page(page);
-				unlock_page(page);
-				put_page(page); /* follow_page_mask() */
-			} else {
-				/*
-				 * Non-huge pages are handled in batches via
-				 * pagevec. The pin from follow_page_mask()
-				 * prevents them from collapsing by THP.
-				 */
-				pagevec_add(&pvec, page);
-				zone = page_zone(page);
-				zoneid = page_zone_id(page);
+		if (page && !IS_ERR(page) && !PageTransCompound(page)) {
+			/*
+			 * Non-huge pages are handled in batches via
+			 * pagevec. The pin from follow_page_mask()
+			 * prevents them from collapsing by THP.
+			 */
+			pagevec_add(&pvec, page);
+			zone = page_zone(page);
+			zoneid = page_zone_id(page);
 
-				/*
-				 * Try to fill the rest of pagevec using fast
-				 * pte walk. This will also update start to
-				 * the next page to process. Then munlock the
-				 * pagevec.
-				 */
-				start = __munlock_pagevec_fill(&pvec, vma,
-						zoneid, start, end);
-				__munlock_pagevec(&pvec, zone);
-				goto next;
-			}
+			/*
+			 * Try to fill the rest of pagevec using fast
+			 * pte walk. This will also update start to
+			 * the next page to process. Then munlock the
+			 * pagevec.
+			 */
+			start = __munlock_pagevec_fill(&pvec, vma,
+					zoneid, start, end);
+			__munlock_pagevec(&pvec, zone);
+			goto next;
 		}
 		/* It's a bug to munlock in the middle of a THP page */
 		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 07/28] thp, mlock: do not allow huge pages in mlocked area
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting THP can belong to several VMAs. This makes tricky
to track THP pages, when they partially mlocked. It can lead to leaking
mlocked pages to non-VM_LOCKED vmas and other problems.

With this patch we will split all pages on mlock and avoid
fault-in/collapse new THP in VM_LOCKED vmas.

I've tried alternative approach: do not mark THP pages mlocked and keep
them on normal LRUs. This way vmscan could try to split huge pages on
memory pressure and free up subpages which doesn't belong to VM_LOCKED
vmas.  But this is user-visible change: we screw up Mlocked accouting
reported in meminfo, so I had to leave this approach aside.

We can bring something better later, but this should be good enough for
now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/gup.c         |  2 ++
 mm/huge_memory.c |  5 ++++-
 mm/memory.c      |  3 ++-
 mm/mlock.c       | 51 +++++++++++++++++++--------------------------------
 4 files changed, 27 insertions(+), 34 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index eaeeae15006b..7334eb24f414 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -882,6 +882,8 @@ long populate_vma_page_range(struct vm_area_struct *vma,
 	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
 
 	gup_flags = FOLL_TOUCH | FOLL_POPULATE;
+	if (vma->vm_flags & VM_LOCKED)
+		gup_flags |= FOLL_SPLIT;
 	/*
 	 * We want to touch writable mappings with a write fault in order
 	 * to break COW, except for shared mappings because these don't COW
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fd8af5b9917f..fa3d4f78b716 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -796,6 +796,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
 		return VM_FAULT_FALLBACK;
+	if (vma->vm_flags & VM_LOCKED)
+		return VM_FAULT_FALLBACK;
 	if (unlikely(anon_vma_prepare(vma)))
 		return VM_FAULT_OOM;
 	if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
@@ -2467,7 +2469,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
 	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
 	    (vma->vm_flags & VM_NOHUGEPAGE))
 		return false;
-
+	if (vma->vm_flags & VM_LOCKED)
+		return false;
 	if (!vma->anon_vma || vma->vm_ops)
 		return false;
 	if (is_vma_temporary_stack(vma))
diff --git a/mm/memory.c b/mm/memory.c
index 559c6651d6b6..8bbd3f88544b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2156,7 +2156,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	pte_unmap_unlock(page_table, ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-	if (old_page) {
+	/* THP pages are never mlocked */
+	if (old_page && !PageTransCompound(old_page)) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
 		 * keep the mlocked page.
diff --git a/mm/mlock.c b/mm/mlock.c
index 6fd2cf15e868..76cde3967483 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -443,39 +443,26 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 		page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
 				&page_mask);
 
-		if (page && !IS_ERR(page)) {
-			if (PageTransHuge(page)) {
-				lock_page(page);
-				/*
-				 * Any THP page found by follow_page_mask() may
-				 * have gotten split before reaching
-				 * munlock_vma_page(), so we need to recompute
-				 * the page_mask here.
-				 */
-				page_mask = munlock_vma_page(page);
-				unlock_page(page);
-				put_page(page); /* follow_page_mask() */
-			} else {
-				/*
-				 * Non-huge pages are handled in batches via
-				 * pagevec. The pin from follow_page_mask()
-				 * prevents them from collapsing by THP.
-				 */
-				pagevec_add(&pvec, page);
-				zone = page_zone(page);
-				zoneid = page_zone_id(page);
+		if (page && !IS_ERR(page) && !PageTransCompound(page)) {
+			/*
+			 * Non-huge pages are handled in batches via
+			 * pagevec. The pin from follow_page_mask()
+			 * prevents them from collapsing by THP.
+			 */
+			pagevec_add(&pvec, page);
+			zone = page_zone(page);
+			zoneid = page_zone_id(page);
 
-				/*
-				 * Try to fill the rest of pagevec using fast
-				 * pte walk. This will also update start to
-				 * the next page to process. Then munlock the
-				 * pagevec.
-				 */
-				start = __munlock_pagevec_fill(&pvec, vma,
-						zoneid, start, end);
-				__munlock_pagevec(&pvec, zone);
-				goto next;
-			}
+			/*
+			 * Try to fill the rest of pagevec using fast
+			 * pte walk. This will also update start to
+			 * the next page to process. Then munlock the
+			 * pagevec.
+			 */
+			start = __munlock_pagevec_fill(&pvec, vma,
+					zoneid, start, end);
+			__munlock_pagevec(&pvec, zone);
+			goto next;
 		}
 		/* It's a bug to munlock in the middle of a THP page */
 		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 08/28] khugepaged: ignore pmd tables with THP mapped with ptes
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Prepare khugepaged to see compound pages mapped with pte. For now we
won't collapse the pmd table with such pte.

khugepaged is subject for future rework wrt new refcounting.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/huge_memory.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fa3d4f78b716..ffc30e4462c1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2653,6 +2653,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
 			goto out_unmap;
+
+		/* TODO: teach khugepaged to collapse THP mapped with pte */
+		if (PageCompound(page))
+			goto out_unmap;
+
 		/*
 		 * Record which node the original page is from and save this
 		 * information to khugepaged_node_load[].
@@ -2663,7 +2668,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		if (khugepaged_scan_abort(node))
 			goto out_unmap;
 		khugepaged_node_load[node]++;
-		VM_BUG_ON_PAGE(PageCompound(page), page);
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
 			goto out_unmap;
 		/*
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 08/28] khugepaged: ignore pmd tables with THP mapped with ptes
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Prepare khugepaged to see compound pages mapped with pte. For now we
won't collapse the pmd table with such pte.

khugepaged is subject for future rework wrt new refcounting.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/huge_memory.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fa3d4f78b716..ffc30e4462c1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2653,6 +2653,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		page = vm_normal_page(vma, _address, pteval);
 		if (unlikely(!page))
 			goto out_unmap;
+
+		/* TODO: teach khugepaged to collapse THP mapped with pte */
+		if (PageCompound(page))
+			goto out_unmap;
+
 		/*
 		 * Record which node the original page is from and save this
 		 * information to khugepaged_node_load[].
@@ -2663,7 +2668,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		if (khugepaged_scan_abort(node))
 			goto out_unmap;
 		khugepaged_node_load[node]++;
-		VM_BUG_ON_PAGE(PageCompound(page), page);
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
 			goto out_unmap;
 		/*
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 09/28] thp: rename split_huge_page_pmd() to split_huge_pmd()
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We are going to decouple splitting THP PMD from splitting underlying
compound page.

This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
to reflect the fact that it doesn't imply page splitting, only PMD.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 arch/powerpc/mm/subpage-prot.c |  2 +-
 arch/x86/kernel/vm86_32.c      |  6 +++++-
 include/linux/huge_mm.h        |  8 ++------
 mm/gup.c                       |  2 +-
 mm/huge_memory.c               | 32 +++++++++++---------------------
 mm/madvise.c                   |  2 +-
 mm/memory.c                    |  2 +-
 mm/mempolicy.c                 |  2 +-
 mm/mprotect.c                  |  2 +-
 mm/mremap.c                    |  2 +-
 mm/pagewalk.c                  |  2 +-
 11 files changed, 26 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
index fa9fb5b4c66c..d5543514c1df 100644
--- a/arch/powerpc/mm/subpage-prot.c
+++ b/arch/powerpc/mm/subpage-prot.c
@@ -135,7 +135,7 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
 				  unsigned long end, struct mm_walk *walk)
 {
 	struct vm_area_struct *vma = walk->vma;
-	split_huge_page_pmd(vma, addr, pmd);
+	split_huge_pmd(vma, pmd, addr);
 	return 0;
 }
 
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index e8edcf52e069..883160599965 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -182,7 +182,11 @@ static void mark_screen_rdonly(struct mm_struct *mm)
 	if (pud_none_or_clear_bad(pud))
 		goto out;
 	pmd = pmd_offset(pud, 0xA0000);
-	split_huge_page_pmd_mm(mm, 0xA0000, pmd);
+
+	if (pmd_trans_huge(*pmd)) {
+		struct vm_area_struct *vma = find_vma(mm, 0xA0000);
+		split_huge_pmd(vma, pmd, 0xA0000);
+	}
 	if (pmd_none_or_clear_bad(pmd))
 		goto out;
 	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 44a840a53974..34bbf769d52e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -104,7 +104,7 @@ static inline int split_huge_page(struct page *page)
 }
 extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd);
-#define split_huge_page_pmd(__vma, __address, __pmd)			\
+#define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
 		if (unlikely(pmd_trans_huge(*____pmd)))			\
@@ -119,8 +119,6 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
-extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd);
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
@@ -187,11 +185,9 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-#define split_huge_page_pmd(__vma, __address, __pmd)	\
-	do { } while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
-#define split_huge_page_pmd_mm(__mm, __address, __pmd)	\
+#define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
 				   unsigned long *vm_flags, int advice)
diff --git a/mm/gup.c b/mm/gup.c
index 7334eb24f414..19e01f156abb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -220,7 +220,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		if (is_huge_zero_page(page)) {
 			spin_unlock(ptl);
 			ret = 0;
-			split_huge_page_pmd(vma, address, pmd);
+			split_huge_pmd(vma, pmd, address);
 		} else {
 			get_page(page);
 			spin_unlock(ptl);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ffc30e4462c1..ccbfacf07160 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1136,13 +1136,13 @@ alloc:
 
 	if (unlikely(!new_page)) {
 		if (!page) {
-			split_huge_page_pmd(vma, address, pmd);
+			split_huge_pmd(vma, pmd, address);
 			ret |= VM_FAULT_FALLBACK;
 		} else {
 			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
 					pmd, orig_pmd, page, haddr);
 			if (ret & VM_FAULT_OOM) {
-				split_huge_page(page);
+				split_huge_pmd(vma, pmd, address);
 				ret |= VM_FAULT_FALLBACK;
 			}
 			put_user_huge_page(page);
@@ -1155,10 +1155,10 @@ alloc:
 					&memcg, true))) {
 		put_page(new_page);
 		if (page) {
-			split_huge_page(page);
+			split_huge_pmd(vma, pmd, address);
 			put_user_huge_page(page);
 		} else
-			split_huge_page_pmd(vma, address, pmd);
+			split_huge_pmd(vma, pmd, address);
 		ret |= VM_FAULT_FALLBACK;
 		count_vm_event(THP_FAULT_FALLBACK);
 		goto out;
@@ -2985,17 +2985,7 @@ again:
 		goto again;
 }
 
-void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd)
-{
-	struct vm_area_struct *vma;
-
-	vma = find_vma(mm, address);
-	BUG_ON(vma == NULL);
-	split_huge_page_pmd(vma, address, pmd);
-}
-
-static void split_huge_page_address(struct mm_struct *mm,
+static void split_huge_pmd_address(struct vm_area_struct *vma,
 				    unsigned long address)
 {
 	pgd_t *pgd;
@@ -3004,7 +2994,7 @@ static void split_huge_page_address(struct mm_struct *mm,
 
 	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
 
-	pgd = pgd_offset(mm, address);
+	pgd = pgd_offset(vma->vm_mm, address);
 	if (!pgd_present(*pgd))
 		return;
 
@@ -3013,13 +3003,13 @@ static void split_huge_page_address(struct mm_struct *mm,
 		return;
 
 	pmd = pmd_offset(pud, address);
-	if (!pmd_present(*pmd))
+	if (!pmd_present(*pmd) || !pmd_trans_huge(*pmd))
 		return;
 	/*
 	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
 	 * materialize from under us.
 	 */
-	split_huge_page_pmd_mm(mm, address, pmd);
+	__split_huge_page_pmd(vma, address, pmd);
 }
 
 void __vma_adjust_trans_huge(struct vm_area_struct *vma,
@@ -3035,7 +3025,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 	if (start & ~HPAGE_PMD_MASK &&
 	    (start & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
-		split_huge_page_address(vma->vm_mm, start);
+		split_huge_pmd_address(vma, start);
 
 	/*
 	 * If the new end address isn't hpage aligned and it could
@@ -3045,7 +3035,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 	if (end & ~HPAGE_PMD_MASK &&
 	    (end & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
-		split_huge_page_address(vma->vm_mm, end);
+		split_huge_pmd_address(vma, end);
 
 	/*
 	 * If we're also updating the vma->vm_next->vm_start, if the new
@@ -3059,6 +3049,6 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 		if (nstart & ~HPAGE_PMD_MASK &&
 		    (nstart & HPAGE_PMD_MASK) >= next->vm_start &&
 		    (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
-			split_huge_page_address(next->vm_mm, nstart);
+			split_huge_pmd_address(next, nstart);
 	}
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index 22b86daf6b94..f5a81ca0dca7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -281,7 +281,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd)) {
 		if (next - addr != HPAGE_PMD_SIZE)
-			split_huge_page_pmd(vma, addr, pmd);
+			split_huge_pmd(vma, pmd, addr);
 		else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
 			goto next;
 		/* fall through */
diff --git a/mm/memory.c b/mm/memory.c
index 8bbd3f88544b..61e7ed722760 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1201,7 +1201,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 					BUG();
 				}
 #endif
-				split_huge_page_pmd(vma, addr, pmd);
+				split_huge_pmd(vma, pmd, addr);
 			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8badb84c013e..aac490fdc91f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -493,7 +493,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	split_huge_page_pmd(vma, addr, pmd);
+	split_huge_pmd(vma, pmd, addr);
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 88584838e704..714d2fbbaafd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
-				split_huge_page_pmd(vma, addr, pmd);
+				split_huge_pmd(vma, pmd, addr);
 			else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
 						newprot, prot_numa);
diff --git a/mm/mremap.c b/mm/mremap.c
index afa3ab740d8c..3e40ea27edc4 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -208,7 +208,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 				need_flush = true;
 				continue;
 			} else if (!err) {
-				split_huge_page_pmd(vma, old_addr, old_pmd);
+				split_huge_pmd(vma, old_pmd, old_addr);
 			}
 			VM_BUG_ON(pmd_trans_huge(*old_pmd));
 		}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 29f2f8b853ae..207244489a68 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ again:
 		if (!walk->pte_entry)
 			continue;
 
-		split_huge_page_pmd_mm(walk->mm, addr, pmd);
+		split_huge_pmd(walk->vma, pmd, addr);
 		if (pmd_trans_unstable(pmd))
 			goto again;
 		err = walk_pte_range(pmd, addr, next, walk);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 09/28] thp: rename split_huge_page_pmd() to split_huge_pmd()
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We are going to decouple splitting THP PMD from splitting underlying
compound page.

This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
to reflect the fact that it doesn't imply page splitting, only PMD.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 arch/powerpc/mm/subpage-prot.c |  2 +-
 arch/x86/kernel/vm86_32.c      |  6 +++++-
 include/linux/huge_mm.h        |  8 ++------
 mm/gup.c                       |  2 +-
 mm/huge_memory.c               | 32 +++++++++++---------------------
 mm/madvise.c                   |  2 +-
 mm/memory.c                    |  2 +-
 mm/mempolicy.c                 |  2 +-
 mm/mprotect.c                  |  2 +-
 mm/mremap.c                    |  2 +-
 mm/pagewalk.c                  |  2 +-
 11 files changed, 26 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
index fa9fb5b4c66c..d5543514c1df 100644
--- a/arch/powerpc/mm/subpage-prot.c
+++ b/arch/powerpc/mm/subpage-prot.c
@@ -135,7 +135,7 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
 				  unsigned long end, struct mm_walk *walk)
 {
 	struct vm_area_struct *vma = walk->vma;
-	split_huge_page_pmd(vma, addr, pmd);
+	split_huge_pmd(vma, pmd, addr);
 	return 0;
 }
 
diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index e8edcf52e069..883160599965 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -182,7 +182,11 @@ static void mark_screen_rdonly(struct mm_struct *mm)
 	if (pud_none_or_clear_bad(pud))
 		goto out;
 	pmd = pmd_offset(pud, 0xA0000);
-	split_huge_page_pmd_mm(mm, 0xA0000, pmd);
+
+	if (pmd_trans_huge(*pmd)) {
+		struct vm_area_struct *vma = find_vma(mm, 0xA0000);
+		split_huge_pmd(vma, pmd, 0xA0000);
+	}
 	if (pmd_none_or_clear_bad(pmd))
 		goto out;
 	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 44a840a53974..34bbf769d52e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -104,7 +104,7 @@ static inline int split_huge_page(struct page *page)
 }
 extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 		unsigned long address, pmd_t *pmd);
-#define split_huge_page_pmd(__vma, __address, __pmd)			\
+#define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
 		if (unlikely(pmd_trans_huge(*____pmd)))			\
@@ -119,8 +119,6 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
 	} while (0)
-extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd);
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
@@ -187,11 +185,9 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-#define split_huge_page_pmd(__vma, __address, __pmd)	\
-	do { } while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
-#define split_huge_page_pmd_mm(__mm, __address, __pmd)	\
+#define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
 				   unsigned long *vm_flags, int advice)
diff --git a/mm/gup.c b/mm/gup.c
index 7334eb24f414..19e01f156abb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -220,7 +220,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		if (is_huge_zero_page(page)) {
 			spin_unlock(ptl);
 			ret = 0;
-			split_huge_page_pmd(vma, address, pmd);
+			split_huge_pmd(vma, pmd, address);
 		} else {
 			get_page(page);
 			spin_unlock(ptl);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ffc30e4462c1..ccbfacf07160 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1136,13 +1136,13 @@ alloc:
 
 	if (unlikely(!new_page)) {
 		if (!page) {
-			split_huge_page_pmd(vma, address, pmd);
+			split_huge_pmd(vma, pmd, address);
 			ret |= VM_FAULT_FALLBACK;
 		} else {
 			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
 					pmd, orig_pmd, page, haddr);
 			if (ret & VM_FAULT_OOM) {
-				split_huge_page(page);
+				split_huge_pmd(vma, pmd, address);
 				ret |= VM_FAULT_FALLBACK;
 			}
 			put_user_huge_page(page);
@@ -1155,10 +1155,10 @@ alloc:
 					&memcg, true))) {
 		put_page(new_page);
 		if (page) {
-			split_huge_page(page);
+			split_huge_pmd(vma, pmd, address);
 			put_user_huge_page(page);
 		} else
-			split_huge_page_pmd(vma, address, pmd);
+			split_huge_pmd(vma, pmd, address);
 		ret |= VM_FAULT_FALLBACK;
 		count_vm_event(THP_FAULT_FALLBACK);
 		goto out;
@@ -2985,17 +2985,7 @@ again:
 		goto again;
 }
 
-void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
-		pmd_t *pmd)
-{
-	struct vm_area_struct *vma;
-
-	vma = find_vma(mm, address);
-	BUG_ON(vma == NULL);
-	split_huge_page_pmd(vma, address, pmd);
-}
-
-static void split_huge_page_address(struct mm_struct *mm,
+static void split_huge_pmd_address(struct vm_area_struct *vma,
 				    unsigned long address)
 {
 	pgd_t *pgd;
@@ -3004,7 +2994,7 @@ static void split_huge_page_address(struct mm_struct *mm,
 
 	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
 
-	pgd = pgd_offset(mm, address);
+	pgd = pgd_offset(vma->vm_mm, address);
 	if (!pgd_present(*pgd))
 		return;
 
@@ -3013,13 +3003,13 @@ static void split_huge_page_address(struct mm_struct *mm,
 		return;
 
 	pmd = pmd_offset(pud, address);
-	if (!pmd_present(*pmd))
+	if (!pmd_present(*pmd) || !pmd_trans_huge(*pmd))
 		return;
 	/*
 	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
 	 * materialize from under us.
 	 */
-	split_huge_page_pmd_mm(mm, address, pmd);
+	__split_huge_page_pmd(vma, address, pmd);
 }
 
 void __vma_adjust_trans_huge(struct vm_area_struct *vma,
@@ -3035,7 +3025,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 	if (start & ~HPAGE_PMD_MASK &&
 	    (start & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
-		split_huge_page_address(vma->vm_mm, start);
+		split_huge_pmd_address(vma, start);
 
 	/*
 	 * If the new end address isn't hpage aligned and it could
@@ -3045,7 +3035,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 	if (end & ~HPAGE_PMD_MASK &&
 	    (end & HPAGE_PMD_MASK) >= vma->vm_start &&
 	    (end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
-		split_huge_page_address(vma->vm_mm, end);
+		split_huge_pmd_address(vma, end);
 
 	/*
 	 * If we're also updating the vma->vm_next->vm_start, if the new
@@ -3059,6 +3049,6 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 		if (nstart & ~HPAGE_PMD_MASK &&
 		    (nstart & HPAGE_PMD_MASK) >= next->vm_start &&
 		    (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
-			split_huge_page_address(next->vm_mm, nstart);
+			split_huge_pmd_address(next, nstart);
 	}
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index 22b86daf6b94..f5a81ca0dca7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -281,7 +281,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd)) {
 		if (next - addr != HPAGE_PMD_SIZE)
-			split_huge_page_pmd(vma, addr, pmd);
+			split_huge_pmd(vma, pmd, addr);
 		else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
 			goto next;
 		/* fall through */
diff --git a/mm/memory.c b/mm/memory.c
index 8bbd3f88544b..61e7ed722760 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1201,7 +1201,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 					BUG();
 				}
 #endif
-				split_huge_page_pmd(vma, addr, pmd);
+				split_huge_pmd(vma, pmd, addr);
 			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8badb84c013e..aac490fdc91f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -493,7 +493,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	split_huge_page_pmd(vma, addr, pmd);
+	split_huge_pmd(vma, pmd, addr);
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 88584838e704..714d2fbbaafd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 
 		if (pmd_trans_huge(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
-				split_huge_page_pmd(vma, addr, pmd);
+				split_huge_pmd(vma, pmd, addr);
 			else {
 				int nr_ptes = change_huge_pmd(vma, pmd, addr,
 						newprot, prot_numa);
diff --git a/mm/mremap.c b/mm/mremap.c
index afa3ab740d8c..3e40ea27edc4 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -208,7 +208,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 				need_flush = true;
 				continue;
 			} else if (!err) {
-				split_huge_page_pmd(vma, old_addr, old_pmd);
+				split_huge_pmd(vma, old_pmd, old_addr);
 			}
 			VM_BUG_ON(pmd_trans_huge(*old_pmd));
 		}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 29f2f8b853ae..207244489a68 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ again:
 		if (!walk->pte_entry)
 			continue;
 
-		split_huge_page_pmd_mm(walk->mm, addr, pmd);
+		split_huge_pmd(walk->vma, pmd, addr);
 		if (pmd_trans_unstable(pmd))
 			goto again;
 		err = walk_pte_range(pmd, addr, next, walk);
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 10/28] mm, vmstats: new THP splitting event
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILT and THP_SPLIT_PMD. It reflects the fact that we
are going to be able split PMD without the compound page and that
split_huge_page() can fail.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/vm_event_item.h | 4 +++-
 mm/huge_memory.c              | 2 +-
 mm/vmstat.c                   | 4 +++-
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 2b1cef88b827..3261bfe2156a 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -69,7 +69,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
-		THP_SPLIT,
+		THP_SPLIT_PAGE,
+		THP_SPLIT_PAGE_FAILED,
+		THP_SPLIT_PMD,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ccbfacf07160..be6d0e0f5050 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1961,7 +1961,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 
 	BUG_ON(!PageSwapBacked(page));
 	__split_huge_page(page, anon_vma, list);
-	count_vm_event(THP_SPLIT);
+	count_vm_event(THP_SPLIT_PAGE);
 
 	BUG_ON(PageCompound(page));
 out_unlock:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1fd0886a389f..e1c87425fe11 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -821,7 +821,9 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
-	"thp_split",
+	"thp_split_page",
+	"thp_split_page_failed",
+	"thp_split_pmd",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
 #endif
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 10/28] mm, vmstats: new THP splitting event
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILT and THP_SPLIT_PMD. It reflects the fact that we
are going to be able split PMD without the compound page and that
split_huge_page() can fail.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Christoph Lameter <cl@linux.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/vm_event_item.h | 4 +++-
 mm/huge_memory.c              | 2 +-
 mm/vmstat.c                   | 4 +++-
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 2b1cef88b827..3261bfe2156a 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -69,7 +69,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_FAULT_FALLBACK,
 		THP_COLLAPSE_ALLOC,
 		THP_COLLAPSE_ALLOC_FAILED,
-		THP_SPLIT,
+		THP_SPLIT_PAGE,
+		THP_SPLIT_PAGE_FAILED,
+		THP_SPLIT_PMD,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ccbfacf07160..be6d0e0f5050 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1961,7 +1961,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 
 	BUG_ON(!PageSwapBacked(page));
 	__split_huge_page(page, anon_vma, list);
-	count_vm_event(THP_SPLIT);
+	count_vm_event(THP_SPLIT_PAGE);
 
 	BUG_ON(PageCompound(page));
 out_unlock:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1fd0886a389f..e1c87425fe11 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -821,7 +821,9 @@ const char * const vmstat_text[] = {
 	"thp_fault_fallback",
 	"thp_collapse_alloc",
 	"thp_collapse_alloc_failed",
-	"thp_split",
+	"thp_split_page",
+	"thp_split_page_failed",
+	"thp_split_pmd",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
 #endif
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 11/28] mm: temporally mark THP broken
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Up to this point we tried to keep patchset bisectable, but next patches
are going to change how core of THP refcounting work.

It would be beneficial to split the change into several patches and make
it more reviewable. Unfortunately, I don't see how we can achieve that
while keeping THP working.

Let's hide THP under CONFIG_BROKEN for now and bring it back when new
refcounting get established.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index baeb0c4a686a..2c96d2484527 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -408,7 +408,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
-	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
+	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && BROKEN
 	select COMPACTION
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 11/28] mm: temporally mark THP broken
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Up to this point we tried to keep patchset bisectable, but next patches
are going to change how core of THP refcounting work.

It would be beneficial to split the change into several patches and make
it more reviewable. Unfortunately, I don't see how we can achieve that
while keeping THP working.

Let's hide THP under CONFIG_BROKEN for now and bring it back when new
refcounting get established.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index baeb0c4a686a..2c96d2484527 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -408,7 +408,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
-	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
+	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && BROKEN
 	select COMPACTION
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 12/28] thp: drop all split_huge_page()-related code
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We will re-introduce new version with new refcounting later in patchset.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/huge_mm.h |  28 +---
 mm/huge_memory.c        | 400 +-----------------------------------------------
 2 files changed, 7 insertions(+), 421 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 34bbf769d52e..47f80207782f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -97,28 +97,12 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 #endif /* CONFIG_DEBUG_VM */
 
 extern unsigned long transparent_hugepage_flags;
-extern int split_huge_page_to_list(struct page *page, struct list_head *list);
-static inline int split_huge_page(struct page *page)
-{
-	return split_huge_page_to_list(page, NULL);
-}
-extern void __split_huge_page_pmd(struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd);
-#define split_huge_pmd(__vma, __pmd, __address)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		if (unlikely(pmd_trans_huge(*____pmd)))			\
-			__split_huge_page_pmd(__vma, __address,		\
-					____pmd);			\
-	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock_write(__anon_vma);			\
-		anon_vma_unlock_write(__anon_vma);			\
-		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
-		       pmd_trans_huge(*____pmd));			\
-	} while (0)
+
+#define split_huge_page_to_list(page, list) BUILD_BUG()
+#define split_huge_page(page) BUILD_BUG()
+#define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
+
+#define wait_split_huge_page(__anon_vma, __pmd) BUILD_BUG();
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index be6d0e0f5050..f3cc576dad73 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1648,329 +1648,6 @@ int pmd_freeable(pmd_t pmd)
 	return !pmd_dirty(pmd);
 }
 
-static int __split_huge_page_splitting(struct page *page,
-				       struct vm_area_struct *vma,
-				       unsigned long address)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pmd_t *pmd;
-	int ret = 0;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = address;
-	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
-
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	pmd = page_check_address_pmd(page, mm, address,
-			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
-	if (pmd) {
-		/*
-		 * We can't temporarily set the pmd to null in order
-		 * to split it, the pmd must remain marked huge at all
-		 * times or the VM won't take the pmd_trans_huge paths
-		 * and it won't wait on the anon_vma->root->rwsem to
-		 * serialize against split_huge_page*.
-		 */
-		pmdp_splitting_flush(vma, address, pmd);
-
-		ret = 1;
-		spin_unlock(ptl);
-	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
-	return ret;
-}
-
-static void __split_huge_page_refcount(struct page *page,
-				       struct list_head *list)
-{
-	int i;
-	struct zone *zone = page_zone(page);
-	struct lruvec *lruvec;
-	int tail_count = 0;
-
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irq(&zone->lru_lock);
-	lruvec = mem_cgroup_page_lruvec(page, zone);
-
-	compound_lock(page);
-	/* complete memcg works before add pages to LRU */
-	mem_cgroup_split_huge_fixup(page);
-
-	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
-		struct page *page_tail = page + i;
-
-		/* tail_page->_mapcount cannot change */
-		BUG_ON(page_mapcount(page_tail) < 0);
-		tail_count += page_mapcount(page_tail);
-		/* check for overflow */
-		BUG_ON(tail_count < 0);
-		BUG_ON(atomic_read(&page_tail->_count) != 0);
-		/*
-		 * tail_page->_count is zero and not changing from
-		 * under us. But get_page_unless_zero() may be running
-		 * from under us on the tail_page. If we used
-		 * atomic_set() below instead of atomic_add(), we
-		 * would then run atomic_set() concurrently with
-		 * get_page_unless_zero(), and atomic_set() is
-		 * implemented in C not using locked ops. spin_unlock
-		 * on x86 sometime uses locked ops because of PPro
-		 * errata 66, 92, so unless somebody can guarantee
-		 * atomic_set() here would be safe on all archs (and
-		 * not only on x86), it's safer to use atomic_add().
-		 */
-		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
-			   &page_tail->_count);
-
-		/* after clearing PageTail the gup refcount can be released */
-		smp_mb__after_atomic();
-
-		/*
-		 * retain hwpoison flag of the poisoned tail page:
-		 *   fix for the unsuitable process killed on Guest Machine(KVM)
-		 *   by the memory-failure.
-		 */
-		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
-		page_tail->flags |= (page->flags &
-				     ((1L << PG_referenced) |
-				      (1L << PG_swapbacked) |
-				      (1L << PG_mlocked) |
-				      (1L << PG_uptodate) |
-				      (1L << PG_active) |
-				      (1L << PG_unevictable)));
-		page_tail->flags |= (1L << PG_dirty);
-
-		/* clear PageTail before overwriting first_page */
-		smp_wmb();
-
-		/*
-		 * __split_huge_page_splitting() already set the
-		 * splitting bit in all pmd that could map this
-		 * hugepage, that will ensure no CPU can alter the
-		 * mapcount on the head page. The mapcount is only
-		 * accounted in the head page and it has to be
-		 * transferred to all tail pages in the below code. So
-		 * for this code to be safe, the split the mapcount
-		 * can't change. But that doesn't mean userland can't
-		 * keep changing and reading the page contents while
-		 * we transfer the mapcount, so the pmd splitting
-		 * status is achieved setting a reserved bit in the
-		 * pmd, not by clearing the present bit.
-		*/
-		page_tail->_mapcount = page->_mapcount;
-
-		BUG_ON(page_tail->mapping != TAIL_MAPPING);
-		page_tail->mapping = page->mapping;
-
-		page_tail->index = page->index + i;
-		page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
-
-		BUG_ON(!PageAnon(page_tail));
-		BUG_ON(!PageUptodate(page_tail));
-		BUG_ON(!PageDirty(page_tail));
-		BUG_ON(!PageSwapBacked(page_tail));
-
-		lru_add_page_tail(page, page_tail, lruvec, list);
-	}
-	atomic_sub(tail_count, &page->_count);
-	BUG_ON(atomic_read(&page->_count) <= 0);
-
-	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
-
-	ClearPageCompound(page);
-	compound_unlock(page);
-	spin_unlock_irq(&zone->lru_lock);
-
-	for (i = 1; i < HPAGE_PMD_NR; i++) {
-		struct page *page_tail = page + i;
-		BUG_ON(page_count(page_tail) <= 0);
-		/*
-		 * Tail pages may be freed if there wasn't any mapping
-		 * like if add_to_swap() is running on a lru page that
-		 * had its mapping zapped. And freeing these pages
-		 * requires taking the lru_lock so we do the put_page
-		 * of the tail pages after the split is complete.
-		 */
-		put_page(page_tail);
-	}
-
-	/*
-	 * Only the head page (now become a regular page) is required
-	 * to be pinned by the caller.
-	 */
-	BUG_ON(page_count(page) <= 0);
-}
-
-static int __split_huge_page_map(struct page *page,
-				 struct vm_area_struct *vma,
-				 unsigned long address)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pmd_t *pmd, _pmd;
-	int ret = 0, i;
-	pgtable_t pgtable;
-	unsigned long haddr;
-
-	pmd = page_check_address_pmd(page, mm, address,
-			PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
-	if (pmd) {
-		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-		pmd_populate(mm, &_pmd, pgtable);
-		if (pmd_write(*pmd))
-			BUG_ON(page_mapcount(page) != 1);
-
-		haddr = address;
-		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-			pte_t *pte, entry;
-			BUG_ON(PageCompound(page+i));
-			/*
-			 * Note that NUMA hinting access restrictions are not
-			 * transferred to avoid any possibility of altering
-			 * permissions across VMAs.
-			 */
-			entry = mk_pte(page + i, vma->vm_page_prot);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			if (!pmd_write(*pmd))
-				entry = pte_wrprotect(entry);
-			if (!pmd_young(*pmd))
-				entry = pte_mkold(entry);
-			pte = pte_offset_map(&_pmd, haddr);
-			BUG_ON(!pte_none(*pte));
-			set_pte_at(mm, haddr, pte, entry);
-			pte_unmap(pte);
-		}
-
-		smp_wmb(); /* make pte visible before pmd */
-		/*
-		 * Up to this point the pmd is present and huge and
-		 * userland has the whole access to the hugepage
-		 * during the split (which happens in place). If we
-		 * overwrite the pmd with the not-huge version
-		 * pointing to the pte here (which of course we could
-		 * if all CPUs were bug free), userland could trigger
-		 * a small page size TLB miss on the small sized TLB
-		 * while the hugepage TLB entry is still established
-		 * in the huge TLB. Some CPU doesn't like that. See
-		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
-		 * Erratum 383 on page 93. Intel should be safe but is
-		 * also warns that it's only safe if the permission
-		 * and cache attributes of the two entries loaded in
-		 * the two TLB is identical (which should be the case
-		 * here). But it is generally safer to never allow
-		 * small and huge TLB entries for the same virtual
-		 * address to be loaded simultaneously. So instead of
-		 * doing "pmd_populate(); flush_tlb_range();" we first
-		 * mark the current pmd notpresent (atomically because
-		 * here the pmd_trans_huge and pmd_trans_splitting
-		 * must remain set at all times on the pmd until the
-		 * split is complete for this pmd), then we flush the
-		 * SMP TLB and finally we write the non-huge version
-		 * of the pmd entry with pmd_populate.
-		 */
-		pmdp_invalidate(vma, address, pmd);
-		pmd_populate(mm, pmd, pgtable);
-		ret = 1;
-		spin_unlock(ptl);
-	}
-
-	return ret;
-}
-
-/* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
-			      struct anon_vma *anon_vma,
-			      struct list_head *list)
-{
-	int mapcount, mapcount2;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
-	struct anon_vma_chain *avc;
-
-	BUG_ON(!PageHead(page));
-	BUG_ON(PageTail(page));
-
-	mapcount = 0;
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
-		struct vm_area_struct *vma = avc->vma;
-		unsigned long addr = vma_address(page, vma);
-		BUG_ON(is_vma_temporary_stack(vma));
-		mapcount += __split_huge_page_splitting(page, vma, addr);
-	}
-	/*
-	 * It is critical that new vmas are added to the tail of the
-	 * anon_vma list. This guarantes that if copy_huge_pmd() runs
-	 * and establishes a child pmd before
-	 * __split_huge_page_splitting() freezes the parent pmd (so if
-	 * we fail to prevent copy_huge_pmd() from running until the
-	 * whole __split_huge_page() is complete), we will still see
-	 * the newly established pmd of the child later during the
-	 * walk, to be able to set it as pmd_trans_splitting too.
-	 */
-	if (mapcount != page_mapcount(page)) {
-		pr_err("mapcount %d page_mapcount %d\n",
-			mapcount, page_mapcount(page));
-		BUG();
-	}
-
-	__split_huge_page_refcount(page, list);
-
-	mapcount2 = 0;
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
-		struct vm_area_struct *vma = avc->vma;
-		unsigned long addr = vma_address(page, vma);
-		BUG_ON(is_vma_temporary_stack(vma));
-		mapcount2 += __split_huge_page_map(page, vma, addr);
-	}
-	if (mapcount != mapcount2) {
-		pr_err("mapcount %d mapcount2 %d page_mapcount %d\n",
-			mapcount, mapcount2, page_mapcount(page));
-		BUG();
-	}
-}
-
-/*
- * Split a hugepage into normal pages. This doesn't change the position of head
- * page. If @list is null, tail pages will be added to LRU list, otherwise, to
- * @list. Both head page and tail pages will inherit mapping, flags, and so on
- * from the hugepage.
- * Return 0 if the hugepage is split successfully otherwise return 1.
- */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
-{
-	struct anon_vma *anon_vma;
-	int ret = 1;
-
-	BUG_ON(is_huge_zero_page(page));
-	BUG_ON(!PageAnon(page));
-
-	/*
-	 * The caller does not necessarily hold an mmap_sem that would prevent
-	 * the anon_vma disappearing so we first we take a reference to it
-	 * and then lock the anon_vma for write. This is similar to
-	 * page_lock_anon_vma_read except the write lock is taken to serialise
-	 * against parallel split or collapse operations.
-	 */
-	anon_vma = page_get_anon_vma(page);
-	if (!anon_vma)
-		goto out;
-	anon_vma_lock_write(anon_vma);
-
-	ret = 0;
-	if (!PageCompound(page))
-		goto out_unlock;
-
-	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma, list);
-	count_vm_event(THP_SPLIT_PAGE);
-
-	BUG_ON(PageCompound(page));
-out_unlock:
-	anon_vma_unlock_write(anon_vma);
-	put_anon_vma(anon_vma);
-out:
-	return ret;
-}
-
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
@@ -2910,81 +2587,6 @@ static int khugepaged(void *none)
 	return 0;
 }
 
-static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
-		unsigned long haddr, pmd_t *pmd)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	pgtable_t pgtable;
-	pmd_t _pmd;
-	int i;
-
-	pmdp_clear_flush_notify(vma, haddr, pmd);
-	/* leave pmd empty until pte is filled */
-
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-	pmd_populate(mm, &_pmd, pgtable);
-
-	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-		pte_t *pte, entry;
-		entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
-		entry = pte_mkspecial(entry);
-		pte = pte_offset_map(&_pmd, haddr);
-		VM_BUG_ON(!pte_none(*pte));
-		set_pte_at(mm, haddr, pte, entry);
-		pte_unmap(pte);
-	}
-	smp_wmb(); /* make pte visible before pmd */
-	pmd_populate(mm, pmd, pgtable);
-	put_huge_zero_page();
-}
-
-void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd)
-{
-	spinlock_t *ptl;
-	struct page *page;
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
-
-	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
-
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_trans_huge(*pmd))) {
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
-	if (is_huge_zero_pmd(*pmd)) {
-		__split_huge_zero_page_pmd(vma, haddr, pmd);
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
-	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!page_count(page), page);
-	get_page(page);
-	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
-	split_huge_page(page);
-
-	put_page(page);
-
-	/*
-	 * We don't always have down_write of mmap_sem here: a racing
-	 * do_huge_pmd_wp_page() might have copied-on-write to another
-	 * huge page before our split_huge_page() got the anon_vma lock.
-	 */
-	if (unlikely(pmd_trans_huge(*pmd)))
-		goto again;
-}
-
 static void split_huge_pmd_address(struct vm_area_struct *vma,
 				    unsigned long address)
 {
@@ -3009,7 +2611,7 @@ static void split_huge_pmd_address(struct vm_area_struct *vma,
 	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
 	 * materialize from under us.
 	 */
-	__split_huge_page_pmd(vma, address, pmd);
+	split_huge_pmd(vma, pmd, address);
 }
 
 void __vma_adjust_trans_huge(struct vm_area_struct *vma,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 12/28] thp: drop all split_huge_page()-related code
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We will re-introduce new version with new refcounting later in patchset.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/huge_mm.h |  28 +---
 mm/huge_memory.c        | 400 +-----------------------------------------------
 2 files changed, 7 insertions(+), 421 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 34bbf769d52e..47f80207782f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -97,28 +97,12 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 #endif /* CONFIG_DEBUG_VM */
 
 extern unsigned long transparent_hugepage_flags;
-extern int split_huge_page_to_list(struct page *page, struct list_head *list);
-static inline int split_huge_page(struct page *page)
-{
-	return split_huge_page_to_list(page, NULL);
-}
-extern void __split_huge_page_pmd(struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd);
-#define split_huge_pmd(__vma, __pmd, __address)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		if (unlikely(pmd_trans_huge(*____pmd)))			\
-			__split_huge_page_pmd(__vma, __address,		\
-					____pmd);			\
-	}  while (0)
-#define wait_split_huge_page(__anon_vma, __pmd)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock_write(__anon_vma);			\
-		anon_vma_unlock_write(__anon_vma);			\
-		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
-		       pmd_trans_huge(*____pmd));			\
-	} while (0)
+
+#define split_huge_page_to_list(page, list) BUILD_BUG()
+#define split_huge_page(page) BUILD_BUG()
+#define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
+
+#define wait_split_huge_page(__anon_vma, __pmd) BUILD_BUG();
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index be6d0e0f5050..f3cc576dad73 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1648,329 +1648,6 @@ int pmd_freeable(pmd_t pmd)
 	return !pmd_dirty(pmd);
 }
 
-static int __split_huge_page_splitting(struct page *page,
-				       struct vm_area_struct *vma,
-				       unsigned long address)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pmd_t *pmd;
-	int ret = 0;
-	/* For mmu_notifiers */
-	const unsigned long mmun_start = address;
-	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
-
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	pmd = page_check_address_pmd(page, mm, address,
-			PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG, &ptl);
-	if (pmd) {
-		/*
-		 * We can't temporarily set the pmd to null in order
-		 * to split it, the pmd must remain marked huge at all
-		 * times or the VM won't take the pmd_trans_huge paths
-		 * and it won't wait on the anon_vma->root->rwsem to
-		 * serialize against split_huge_page*.
-		 */
-		pmdp_splitting_flush(vma, address, pmd);
-
-		ret = 1;
-		spin_unlock(ptl);
-	}
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
-	return ret;
-}
-
-static void __split_huge_page_refcount(struct page *page,
-				       struct list_head *list)
-{
-	int i;
-	struct zone *zone = page_zone(page);
-	struct lruvec *lruvec;
-	int tail_count = 0;
-
-	/* prevent PageLRU to go away from under us, and freeze lru stats */
-	spin_lock_irq(&zone->lru_lock);
-	lruvec = mem_cgroup_page_lruvec(page, zone);
-
-	compound_lock(page);
-	/* complete memcg works before add pages to LRU */
-	mem_cgroup_split_huge_fixup(page);
-
-	for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
-		struct page *page_tail = page + i;
-
-		/* tail_page->_mapcount cannot change */
-		BUG_ON(page_mapcount(page_tail) < 0);
-		tail_count += page_mapcount(page_tail);
-		/* check for overflow */
-		BUG_ON(tail_count < 0);
-		BUG_ON(atomic_read(&page_tail->_count) != 0);
-		/*
-		 * tail_page->_count is zero and not changing from
-		 * under us. But get_page_unless_zero() may be running
-		 * from under us on the tail_page. If we used
-		 * atomic_set() below instead of atomic_add(), we
-		 * would then run atomic_set() concurrently with
-		 * get_page_unless_zero(), and atomic_set() is
-		 * implemented in C not using locked ops. spin_unlock
-		 * on x86 sometime uses locked ops because of PPro
-		 * errata 66, 92, so unless somebody can guarantee
-		 * atomic_set() here would be safe on all archs (and
-		 * not only on x86), it's safer to use atomic_add().
-		 */
-		atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
-			   &page_tail->_count);
-
-		/* after clearing PageTail the gup refcount can be released */
-		smp_mb__after_atomic();
-
-		/*
-		 * retain hwpoison flag of the poisoned tail page:
-		 *   fix for the unsuitable process killed on Guest Machine(KVM)
-		 *   by the memory-failure.
-		 */
-		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
-		page_tail->flags |= (page->flags &
-				     ((1L << PG_referenced) |
-				      (1L << PG_swapbacked) |
-				      (1L << PG_mlocked) |
-				      (1L << PG_uptodate) |
-				      (1L << PG_active) |
-				      (1L << PG_unevictable)));
-		page_tail->flags |= (1L << PG_dirty);
-
-		/* clear PageTail before overwriting first_page */
-		smp_wmb();
-
-		/*
-		 * __split_huge_page_splitting() already set the
-		 * splitting bit in all pmd that could map this
-		 * hugepage, that will ensure no CPU can alter the
-		 * mapcount on the head page. The mapcount is only
-		 * accounted in the head page and it has to be
-		 * transferred to all tail pages in the below code. So
-		 * for this code to be safe, the split the mapcount
-		 * can't change. But that doesn't mean userland can't
-		 * keep changing and reading the page contents while
-		 * we transfer the mapcount, so the pmd splitting
-		 * status is achieved setting a reserved bit in the
-		 * pmd, not by clearing the present bit.
-		*/
-		page_tail->_mapcount = page->_mapcount;
-
-		BUG_ON(page_tail->mapping != TAIL_MAPPING);
-		page_tail->mapping = page->mapping;
-
-		page_tail->index = page->index + i;
-		page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
-
-		BUG_ON(!PageAnon(page_tail));
-		BUG_ON(!PageUptodate(page_tail));
-		BUG_ON(!PageDirty(page_tail));
-		BUG_ON(!PageSwapBacked(page_tail));
-
-		lru_add_page_tail(page, page_tail, lruvec, list);
-	}
-	atomic_sub(tail_count, &page->_count);
-	BUG_ON(atomic_read(&page->_count) <= 0);
-
-	__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
-
-	ClearPageCompound(page);
-	compound_unlock(page);
-	spin_unlock_irq(&zone->lru_lock);
-
-	for (i = 1; i < HPAGE_PMD_NR; i++) {
-		struct page *page_tail = page + i;
-		BUG_ON(page_count(page_tail) <= 0);
-		/*
-		 * Tail pages may be freed if there wasn't any mapping
-		 * like if add_to_swap() is running on a lru page that
-		 * had its mapping zapped. And freeing these pages
-		 * requires taking the lru_lock so we do the put_page
-		 * of the tail pages after the split is complete.
-		 */
-		put_page(page_tail);
-	}
-
-	/*
-	 * Only the head page (now become a regular page) is required
-	 * to be pinned by the caller.
-	 */
-	BUG_ON(page_count(page) <= 0);
-}
-
-static int __split_huge_page_map(struct page *page,
-				 struct vm_area_struct *vma,
-				 unsigned long address)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pmd_t *pmd, _pmd;
-	int ret = 0, i;
-	pgtable_t pgtable;
-	unsigned long haddr;
-
-	pmd = page_check_address_pmd(page, mm, address,
-			PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
-	if (pmd) {
-		pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-		pmd_populate(mm, &_pmd, pgtable);
-		if (pmd_write(*pmd))
-			BUG_ON(page_mapcount(page) != 1);
-
-		haddr = address;
-		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-			pte_t *pte, entry;
-			BUG_ON(PageCompound(page+i));
-			/*
-			 * Note that NUMA hinting access restrictions are not
-			 * transferred to avoid any possibility of altering
-			 * permissions across VMAs.
-			 */
-			entry = mk_pte(page + i, vma->vm_page_prot);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-			if (!pmd_write(*pmd))
-				entry = pte_wrprotect(entry);
-			if (!pmd_young(*pmd))
-				entry = pte_mkold(entry);
-			pte = pte_offset_map(&_pmd, haddr);
-			BUG_ON(!pte_none(*pte));
-			set_pte_at(mm, haddr, pte, entry);
-			pte_unmap(pte);
-		}
-
-		smp_wmb(); /* make pte visible before pmd */
-		/*
-		 * Up to this point the pmd is present and huge and
-		 * userland has the whole access to the hugepage
-		 * during the split (which happens in place). If we
-		 * overwrite the pmd with the not-huge version
-		 * pointing to the pte here (which of course we could
-		 * if all CPUs were bug free), userland could trigger
-		 * a small page size TLB miss on the small sized TLB
-		 * while the hugepage TLB entry is still established
-		 * in the huge TLB. Some CPU doesn't like that. See
-		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
-		 * Erratum 383 on page 93. Intel should be safe but is
-		 * also warns that it's only safe if the permission
-		 * and cache attributes of the two entries loaded in
-		 * the two TLB is identical (which should be the case
-		 * here). But it is generally safer to never allow
-		 * small and huge TLB entries for the same virtual
-		 * address to be loaded simultaneously. So instead of
-		 * doing "pmd_populate(); flush_tlb_range();" we first
-		 * mark the current pmd notpresent (atomically because
-		 * here the pmd_trans_huge and pmd_trans_splitting
-		 * must remain set at all times on the pmd until the
-		 * split is complete for this pmd), then we flush the
-		 * SMP TLB and finally we write the non-huge version
-		 * of the pmd entry with pmd_populate.
-		 */
-		pmdp_invalidate(vma, address, pmd);
-		pmd_populate(mm, pmd, pgtable);
-		ret = 1;
-		spin_unlock(ptl);
-	}
-
-	return ret;
-}
-
-/* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
-			      struct anon_vma *anon_vma,
-			      struct list_head *list)
-{
-	int mapcount, mapcount2;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
-	struct anon_vma_chain *avc;
-
-	BUG_ON(!PageHead(page));
-	BUG_ON(PageTail(page));
-
-	mapcount = 0;
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
-		struct vm_area_struct *vma = avc->vma;
-		unsigned long addr = vma_address(page, vma);
-		BUG_ON(is_vma_temporary_stack(vma));
-		mapcount += __split_huge_page_splitting(page, vma, addr);
-	}
-	/*
-	 * It is critical that new vmas are added to the tail of the
-	 * anon_vma list. This guarantes that if copy_huge_pmd() runs
-	 * and establishes a child pmd before
-	 * __split_huge_page_splitting() freezes the parent pmd (so if
-	 * we fail to prevent copy_huge_pmd() from running until the
-	 * whole __split_huge_page() is complete), we will still see
-	 * the newly established pmd of the child later during the
-	 * walk, to be able to set it as pmd_trans_splitting too.
-	 */
-	if (mapcount != page_mapcount(page)) {
-		pr_err("mapcount %d page_mapcount %d\n",
-			mapcount, page_mapcount(page));
-		BUG();
-	}
-
-	__split_huge_page_refcount(page, list);
-
-	mapcount2 = 0;
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
-		struct vm_area_struct *vma = avc->vma;
-		unsigned long addr = vma_address(page, vma);
-		BUG_ON(is_vma_temporary_stack(vma));
-		mapcount2 += __split_huge_page_map(page, vma, addr);
-	}
-	if (mapcount != mapcount2) {
-		pr_err("mapcount %d mapcount2 %d page_mapcount %d\n",
-			mapcount, mapcount2, page_mapcount(page));
-		BUG();
-	}
-}
-
-/*
- * Split a hugepage into normal pages. This doesn't change the position of head
- * page. If @list is null, tail pages will be added to LRU list, otherwise, to
- * @list. Both head page and tail pages will inherit mapping, flags, and so on
- * from the hugepage.
- * Return 0 if the hugepage is split successfully otherwise return 1.
- */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
-{
-	struct anon_vma *anon_vma;
-	int ret = 1;
-
-	BUG_ON(is_huge_zero_page(page));
-	BUG_ON(!PageAnon(page));
-
-	/*
-	 * The caller does not necessarily hold an mmap_sem that would prevent
-	 * the anon_vma disappearing so we first we take a reference to it
-	 * and then lock the anon_vma for write. This is similar to
-	 * page_lock_anon_vma_read except the write lock is taken to serialise
-	 * against parallel split or collapse operations.
-	 */
-	anon_vma = page_get_anon_vma(page);
-	if (!anon_vma)
-		goto out;
-	anon_vma_lock_write(anon_vma);
-
-	ret = 0;
-	if (!PageCompound(page))
-		goto out_unlock;
-
-	BUG_ON(!PageSwapBacked(page));
-	__split_huge_page(page, anon_vma, list);
-	count_vm_event(THP_SPLIT_PAGE);
-
-	BUG_ON(PageCompound(page));
-out_unlock:
-	anon_vma_unlock_write(anon_vma);
-	put_anon_vma(anon_vma);
-out:
-	return ret;
-}
-
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
@@ -2910,81 +2587,6 @@ static int khugepaged(void *none)
 	return 0;
 }
 
-static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
-		unsigned long haddr, pmd_t *pmd)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	pgtable_t pgtable;
-	pmd_t _pmd;
-	int i;
-
-	pmdp_clear_flush_notify(vma, haddr, pmd);
-	/* leave pmd empty until pte is filled */
-
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-	pmd_populate(mm, &_pmd, pgtable);
-
-	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
-		pte_t *pte, entry;
-		entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
-		entry = pte_mkspecial(entry);
-		pte = pte_offset_map(&_pmd, haddr);
-		VM_BUG_ON(!pte_none(*pte));
-		set_pte_at(mm, haddr, pte, entry);
-		pte_unmap(pte);
-	}
-	smp_wmb(); /* make pte visible before pmd */
-	pmd_populate(mm, pmd, pgtable);
-	put_huge_zero_page();
-}
-
-void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
-		pmd_t *pmd)
-{
-	spinlock_t *ptl;
-	struct page *page;
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
-	unsigned long mmun_start;	/* For mmu_notifiers */
-	unsigned long mmun_end;		/* For mmu_notifiers */
-
-	BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
-
-	mmun_start = haddr;
-	mmun_end   = haddr + HPAGE_PMD_SIZE;
-again:
-	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
-	ptl = pmd_lock(mm, pmd);
-	if (unlikely(!pmd_trans_huge(*pmd))) {
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
-	if (is_huge_zero_pmd(*pmd)) {
-		__split_huge_zero_page_pmd(vma, haddr, pmd);
-		spin_unlock(ptl);
-		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-		return;
-	}
-	page = pmd_page(*pmd);
-	VM_BUG_ON_PAGE(!page_count(page), page);
-	get_page(page);
-	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
-
-	split_huge_page(page);
-
-	put_page(page);
-
-	/*
-	 * We don't always have down_write of mmap_sem here: a racing
-	 * do_huge_pmd_wp_page() might have copied-on-write to another
-	 * huge page before our split_huge_page() got the anon_vma lock.
-	 */
-	if (unlikely(pmd_trans_huge(*pmd)))
-		goto again;
-}
-
 static void split_huge_pmd_address(struct vm_area_struct *vma,
 				    unsigned long address)
 {
@@ -3009,7 +2611,7 @@ static void split_huge_pmd_address(struct vm_area_struct *vma,
 	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
 	 * materialize from under us.
 	 */
-	__split_huge_page_pmd(vma, address, pmd);
+	split_huge_pmd(vma, pmd, address);
 }
 
 void __vma_adjust_trans_huge(struct vm_area_struct *vma,
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 13/28] mm: drop tail page refcounting
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Tail page refcounting is utterly complicated and painful to support.
It also make use of ->_mapcount to account pins on tail pages. We will
need ->_mapcount acoount PTE mappings of subpages of the compound page.

The only user of tail page refcounting is THP which is marked BROKEN for
now.

Let's drop all this mess. It makes get_page() and put_pag() much simplier.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 arch/mips/mm/gup.c            |   4 -
 arch/powerpc/mm/hugetlbpage.c |  13 +-
 arch/s390/mm/gup.c            |  13 +-
 arch/sparc/mm/gup.c           |  14 +--
 arch/x86/mm/gup.c             |   4 -
 include/linux/mm.h            |  47 ++------
 include/linux/mm_types.h      |  17 +--
 mm/gup.c                      |  34 +-----
 mm/huge_memory.c              |  41 +------
 mm/hugetlb.c                  |   2 +-
 mm/internal.h                 |  44 -------
 mm/swap.c                     | 274 +++---------------------------------------
 12 files changed, 40 insertions(+), 467 deletions(-)

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 349995d19c7f..36a35115dc2e 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -87,8 +87,6 @@ static int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end,
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
@@ -153,8 +151,6 @@ static int gup_huge_pud(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index cf0464f4284f..f30ae0f7f570 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1037,7 +1037,7 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 {
 	unsigned long mask;
 	unsigned long pte_end;
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	pte_t pte;
 	int refs;
 
@@ -1060,7 +1060,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	head = pte_page(pte);
 
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -1082,15 +1081,5 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
-	/*
-	 * Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 5c586c78ca8d..dab30527ad41 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -52,7 +52,7 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
 	unsigned long mask, result;
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	result = write ? 0 : _SEGMENT_ENTRY_PROTECT;
@@ -64,7 +64,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -85,16 +84,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		return 0;
 	}
 
-	/*
-	 * Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 2e5c4fc2daa9..9091c5daa2e1 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -56,8 +56,6 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 			put_page(head);
 			return 0;
 		}
-		if (head != page)
-			get_huge_page_tail(page);
 
 		pages[*nr] = page;
 		(*nr)++;
@@ -70,7 +68,7 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 			unsigned long end, int write, struct page **pages,
 			int *nr)
 {
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	if (!(pmd_val(pmd) & _PAGE_VALID))
@@ -82,7 +80,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -103,15 +100,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		return 0;
 	}
 
-	/* Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 81bf3d2af3eb..62a887a3cf50 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -137,8 +137,6 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
@@ -214,8 +212,6 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index efe8417360a2..dd1b5f2b1966 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -449,44 +449,9 @@ static inline int page_count(struct page *page)
 	return atomic_read(&compound_head(page)->_count);
 }
 
-static inline bool __compound_tail_refcounted(struct page *page)
-{
-	return PageAnon(page) && !PageSlab(page) && !PageHeadHuge(page);
-}
-
-/*
- * This takes a head page as parameter and tells if the
- * tail page reference counting can be skipped.
- *
- * For this to be safe, PageSlab and PageHeadHuge must remain true on
- * any given page where they return true here, until all tail pins
- * have been released.
- */
-static inline bool compound_tail_refcounted(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return __compound_tail_refcounted(page);
-}
-
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run from under us.
-	 */
-	VM_BUG_ON_PAGE(!PageTail(page), page);
-	VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-	VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-	if (compound_tail_refcounted(page->first_page))
-		atomic_inc(&page->_mapcount);
-}
-
-extern bool __get_page_tail(struct page *page);
-
 static inline void get_page(struct page *page)
 {
-	if (unlikely(PageTail(page)))
-		if (likely(__get_page_tail(page)))
-			return;
+	page = compound_head(page);
 	/*
 	 * Getting a normal page or the head of a compound page
 	 * requires to already have an elevated page->_count.
@@ -517,7 +482,15 @@ static inline void init_page_count(struct page *page)
 	atomic_set(&page->_count, 1);
 }
 
-void put_page(struct page *page);
+void __put_page(struct page* page);
+
+static inline void put_page(struct page *page)
+{
+	page = compound_head(page);
+	if (put_page_testzero(page))
+		__put_page(page);
+}
+
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 590630eb59ba..126f481bb95a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -92,20 +92,9 @@ struct page {
 
 				union {
 					/*
-					 * Count of ptes mapped in
-					 * mms, to show when page is
-					 * mapped & limit reverse map
-					 * searches.
-					 *
-					 * Used also for tail pages
-					 * refcounting instead of
-					 * _count. Tail pages cannot
-					 * be mapped and keeping the
-					 * tail page _count zero at
-					 * all times guarantees
-					 * get_page_unless_zero() will
-					 * never succeed on tail
-					 * pages.
+					 * Count of ptes mapped in mms, to show
+					 * when page is mapped & limit reverse
+					 * map searches.
 					 */
 					atomic_t _mapcount;
 
diff --git a/mm/gup.c b/mm/gup.c
index 19e01f156abb..53f9681b7b30 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -93,7 +93,7 @@ retry:
 	}
 
 	if (flags & FOLL_GET)
-		get_page_foll(page);
+		get_page(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
@@ -1108,7 +1108,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	if (write && !pmd_write(orig))
@@ -1117,7 +1117,6 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	refs = 0;
 	head = pmd_page(orig);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
@@ -1138,24 +1137,13 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 	}
 
-	/*
-	 * Any tail pages need their mapcount reference taken before we
-	 * return. (This allows the THP code to bump their ref count when
-	 * they are split into base pages).
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	if (write && !pud_write(orig))
@@ -1164,7 +1152,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	refs = 0;
 	head = pud_page(orig);
 	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
@@ -1185,12 +1172,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 	}
 
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
@@ -1199,7 +1180,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 			struct page **pages, int *nr)
 {
 	int refs;
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 
 	if (write && !pgd_write(orig))
 		return 0;
@@ -1207,7 +1188,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	refs = 0;
 	head = pgd_page(orig);
 	page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
@@ -1228,12 +1208,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		return 0;
 	}
 
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f3cc576dad73..16c6c262385c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -941,37 +941,6 @@ unlock:
 	spin_unlock(ptl);
 }
 
-/*
- * Save CONFIG_DEBUG_PAGEALLOC from faulting falsely on tail pages
- * during copy_user_huge_page()'s copy_page_rep(): in the case when
- * the source page gets split and a tail freed before copy completes.
- * Called under pmd_lock of checked pmd, so safe from splitting itself.
- */
-static void get_user_huge_page(struct page *page)
-{
-	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
-		struct page *endpage = page + HPAGE_PMD_NR;
-
-		atomic_add(HPAGE_PMD_NR, &page->_count);
-		while (++page < endpage)
-			get_huge_page_tail(page);
-	} else {
-		get_page(page);
-	}
-}
-
-static void put_user_huge_page(struct page *page)
-{
-	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
-		struct page *endpage = page + HPAGE_PMD_NR;
-
-		while (page < endpage)
-			put_page(page++);
-	} else {
-		put_page(page);
-	}
-}
-
 static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long address,
@@ -1124,7 +1093,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret |= VM_FAULT_WRITE;
 		goto out_unlock;
 	}
-	get_user_huge_page(page);
+	get_page(page);
 	spin_unlock(ptl);
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
@@ -1145,7 +1114,7 @@ alloc:
 				split_huge_pmd(vma, pmd, address);
 				ret |= VM_FAULT_FALLBACK;
 			}
-			put_user_huge_page(page);
+			put_page(page);
 		}
 		count_vm_event(THP_FAULT_FALLBACK);
 		goto out;
@@ -1156,7 +1125,7 @@ alloc:
 		put_page(new_page);
 		if (page) {
 			split_huge_pmd(vma, pmd, address);
-			put_user_huge_page(page);
+			put_page(page);
 		} else
 			split_huge_pmd(vma, pmd, address);
 		ret |= VM_FAULT_FALLBACK;
@@ -1178,7 +1147,7 @@ alloc:
 
 	spin_lock(ptl);
 	if (page)
-		put_user_huge_page(page);
+		put_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(ptl);
 		mem_cgroup_cancel_charge(new_page, memcg, true);
@@ -1263,7 +1232,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 	if (flags & FOLL_GET)
-		get_page_foll(page);
+		get_page(page);
 
 out:
 	return page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eb2a0430535e..f27d4edada3a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3453,7 +3453,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 same_page:
 		if (pages) {
 			pages[i] = mem_map_offset(page, pfn_offset);
-			get_page_foll(pages[i]);
+			get_page(pages[i]);
 		}
 
 		if (vmas)
diff --git a/mm/internal.h b/mm/internal.h
index a25e359a4039..98bce4d12a16 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -47,50 +47,6 @@ static inline void set_page_refcounted(struct page *page)
 	set_page_count(page, 1);
 }
 
-static inline void __get_page_tail_foll(struct page *page,
-					bool get_page_head)
-{
-	/*
-	 * If we're getting a tail page, the elevated page->_count is
-	 * required only in the head page and we will elevate the head
-	 * page->_count and tail page->_mapcount.
-	 *
-	 * We elevate page_tail->_mapcount for tail pages to force
-	 * page_tail->_count to be zero at all times to avoid getting
-	 * false positives from get_page_unless_zero() with
-	 * speculative page access (like in
-	 * page_cache_get_speculative()) on tail pages.
-	 */
-	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
-	if (get_page_head)
-		atomic_inc(&page->first_page->_count);
-	get_huge_page_tail(page);
-}
-
-/*
- * This is meant to be called as the FOLL_GET operation of
- * follow_page() and it must be called while holding the proper PT
- * lock while the pte (or pmd_trans_huge) is still mapping the page.
- */
-static inline void get_page_foll(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount() can't run under
-		 * get_page_foll() because we hold the proper PT lock.
-		 */
-		__get_page_tail_foll(page, true);
-	else {
-		/*
-		 * Getting a normal page or the head of a compound page
-		 * requires to already have an elevated page->_count.
-		 */
-		VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
-		atomic_inc(&page->_count);
-	}
-}
-
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/swap.c b/mm/swap.c
index 8773de093171..39166c05e5f3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -89,261 +89,14 @@ static void __put_compound_page(struct page *page)
 	(*dtor)(page);
 }
 
-/**
- * Two special cases here: we could avoid taking compound_lock_irqsave
- * and could skip the tail refcounting(in _mapcount).
- *
- * 1. Hugetlbfs page:
- *
- *    PageHeadHuge will remain true until the compound page
- *    is released and enters the buddy allocator, and it could
- *    not be split by __split_huge_page_refcount().
- *
- *    So if we see PageHeadHuge set, and we have the tail page pin,
- *    then we could safely put head page.
- *
- * 2. Slab THP page:
- *
- *    PG_slab is cleared before the slab frees the head page, and
- *    tail pin cannot be the last reference left on the head page,
- *    because the slab code is free to reuse the compound page
- *    after a kfree/kmem_cache_free without having to check if
- *    there's any tail pin left.  In turn all tail pinsmust be always
- *    released while the head is still pinned by the slab code
- *    and so we know PG_slab will be still set too.
- *
- *    So if we see PageSlab set, and we have the tail page pin,
- *    then we could safely put head page.
- */
-static __always_inline
-void put_unrefcounted_compound_page(struct page *page_head, struct page *page)
-{
-	/*
-	 * If @page is a THP tail, we must read the tail page
-	 * flags after the head page flags. The
-	 * __split_huge_page_refcount side enforces write memory barriers
-	 * between clearing PageTail and before the head page
-	 * can be freed and reallocated.
-	 */
-	smp_rmb();
-	if (likely(PageTail(page))) {
-		/*
-		 * __split_huge_page_refcount cannot race
-		 * here, see the comment above this function.
-		 */
-		VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
-		VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
-		if (put_page_testzero(page_head)) {
-			/*
-			 * If this is the tail of a slab THP page,
-			 * the tail pin must not be the last reference
-			 * held on the page, because the PG_slab cannot
-			 * be cleared before all tail pins (which skips
-			 * the _mapcount tail refcounting) have been
-			 * released.
-			 *
-			 * If this is the tail of a hugetlbfs page,
-			 * the tail pin may be the last reference on
-			 * the page instead, because PageHeadHuge will
-			 * not go away until the compound page enters
-			 * the buddy allocator.
-			 */
-			VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
-			__put_compound_page(page_head);
-		}
-	} else
-		/*
-		 * __split_huge_page_refcount run before us,
-		 * @page was a THP tail. The split @page_head
-		 * has been freed and reallocated as slab or
-		 * hugetlbfs page of smaller order (only
-		 * possible if reallocated as slab on x86).
-		 */
-		if (put_page_testzero(page))
-			__put_single_page(page);
-}
-
-static __always_inline
-void put_refcounted_compound_page(struct page *page_head, struct page *page)
-{
-	if (likely(page != page_head && get_page_unless_zero(page_head))) {
-		unsigned long flags;
-
-		/*
-		 * @page_head wasn't a dangling pointer but it may not
-		 * be a head page anymore by the time we obtain the
-		 * lock. That is ok as long as it can't be freed from
-		 * under us.
-		 */
-		flags = compound_lock_irqsave(page_head);
-		if (unlikely(!PageTail(page))) {
-			/* __split_huge_page_refcount run before us */
-			compound_unlock_irqrestore(page_head, flags);
-			if (put_page_testzero(page_head)) {
-				/*
-				 * The @page_head may have been freed
-				 * and reallocated as a compound page
-				 * of smaller order and then freed
-				 * again.  All we know is that it
-				 * cannot have become: a THP page, a
-				 * compound page of higher order, a
-				 * tail page.  That is because we
-				 * still hold the refcount of the
-				 * split THP tail and page_head was
-				 * the THP head before the split.
-				 */
-				if (PageHead(page_head))
-					__put_compound_page(page_head);
-				else
-					__put_single_page(page_head);
-			}
-out_put_single:
-			if (put_page_testzero(page))
-				__put_single_page(page);
-			return;
-		}
-		VM_BUG_ON_PAGE(page_head != page->first_page, page);
-		/*
-		 * We can release the refcount taken by
-		 * get_page_unless_zero() now that
-		 * __split_huge_page_refcount() is blocked on the
-		 * compound_lock.
-		 */
-		if (put_page_testzero(page_head))
-			VM_BUG_ON_PAGE(1, page_head);
-		/* __split_huge_page_refcount will wait now */
-		VM_BUG_ON_PAGE(page_mapcount(page) <= 0, page);
-		atomic_dec(&page->_mapcount);
-		VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page_head);
-		VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-		compound_unlock_irqrestore(page_head, flags);
-
-		if (put_page_testzero(page_head)) {
-			if (PageHead(page_head))
-				__put_compound_page(page_head);
-			else
-				__put_single_page(page_head);
-		}
-	} else {
-		/* @page_head is a dangling pointer */
-		VM_BUG_ON_PAGE(PageTail(page), page);
-		goto out_put_single;
-	}
-}
-
-static void put_compound_page(struct page *page)
-{
-	struct page *page_head;
-
-	/*
-	 * We see the PageCompound set and PageTail not set, so @page maybe:
-	 *  1. hugetlbfs head page, or
-	 *  2. THP head page.
-	 */
-	if (likely(!PageTail(page))) {
-		if (put_page_testzero(page)) {
-			/*
-			 * By the time all refcounts have been released
-			 * split_huge_page cannot run anymore from under us.
-			 */
-			if (PageHead(page))
-				__put_compound_page(page);
-			else
-				__put_single_page(page);
-		}
-		return;
-	}
-
-	/*
-	 * We see the PageCompound set and PageTail set, so @page maybe:
-	 *  1. a tail hugetlbfs page, or
-	 *  2. a tail THP page, or
-	 *  3. a split THP page.
-	 *
-	 *  Case 3 is possible, as we may race with
-	 *  __split_huge_page_refcount tearing down a THP page.
-	 */
-	page_head = compound_head_by_tail(page);
-	if (!__compound_tail_refcounted(page_head))
-		put_unrefcounted_compound_page(page_head, page);
-	else
-		put_refcounted_compound_page(page_head, page);
-}
-
-void put_page(struct page *page)
+void __put_page(struct page *page)
 {
 	if (unlikely(PageCompound(page)))
-		put_compound_page(page);
-	else if (put_page_testzero(page))
+		__put_compound_page(page);
+	else
 		__put_single_page(page);
 }
-EXPORT_SYMBOL(put_page);
-
-/*
- * This function is exported but must not be called by anything other
- * than get_page(). It implements the slow path of get_page().
- */
-bool __get_page_tail(struct page *page)
-{
-	/*
-	 * This takes care of get_page() if run on a tail page
-	 * returned by one of the get_user_pages/follow_page variants.
-	 * get_user_pages/follow_page itself doesn't need the compound
-	 * lock because it runs __get_page_tail_foll() under the
-	 * proper PT lock that already serializes against
-	 * split_huge_page().
-	 */
-	unsigned long flags;
-	bool got;
-	struct page *page_head = compound_head(page);
-
-	/* Ref to put_compound_page() comment. */
-	if (!__compound_tail_refcounted(page_head)) {
-		smp_rmb();
-		if (likely(PageTail(page))) {
-			/*
-			 * This is a hugetlbfs page or a slab
-			 * page. __split_huge_page_refcount
-			 * cannot race here.
-			 */
-			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
-			__get_page_tail_foll(page, true);
-			return true;
-		} else {
-			/*
-			 * __split_huge_page_refcount run
-			 * before us, "page" was a THP
-			 * tail. The split page_head has been
-			 * freed and reallocated as slab or
-			 * hugetlbfs page of smaller order
-			 * (only possible if reallocated as
-			 * slab on x86).
-			 */
-			return false;
-		}
-	}
-
-	got = false;
-	if (likely(page != page_head && get_page_unless_zero(page_head))) {
-		/*
-		 * page_head wasn't a dangling pointer but it
-		 * may not be a head page anymore by the time
-		 * we obtain the lock. That is ok as long as it
-		 * can't be freed from under us.
-		 */
-		flags = compound_lock_irqsave(page_head);
-		/* here __split_huge_page_refcount won't run anymore */
-		if (likely(PageTail(page))) {
-			__get_page_tail_foll(page, false);
-			got = true;
-		}
-		compound_unlock_irqrestore(page_head, flags);
-		if (unlikely(!got))
-			put_page(page_head);
-	}
-	return got;
-}
-EXPORT_SYMBOL(__get_page_tail);
+EXPORT_SYMBOL(__put_page);
 
 /**
  * put_pages_list() - release a list of pages
@@ -960,15 +713,6 @@ void release_pages(struct page **pages, int nr, bool cold)
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
-		if (unlikely(PageCompound(page))) {
-			if (zone) {
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
-				zone = NULL;
-			}
-			put_compound_page(page);
-			continue;
-		}
-
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
@@ -979,9 +723,19 @@ void release_pages(struct page **pages, int nr, bool cold)
 			zone = NULL;
 		}
 
+		page = compound_head(page);
 		if (!put_page_testzero(page))
 			continue;
 
+		if (PageCompound(page)) {
+			if (zone) {
+				spin_unlock_irqrestore(&zone->lru_lock, flags);
+				zone = NULL;
+			}
+			__put_compound_page(page);
+			continue;
+		}
+
 		if (PageLRU(page)) {
 			struct zone *pagezone = page_zone(page);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 13/28] mm: drop tail page refcounting
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Tail page refcounting is utterly complicated and painful to support.
It also make use of ->_mapcount to account pins on tail pages. We will
need ->_mapcount acoount PTE mappings of subpages of the compound page.

The only user of tail page refcounting is THP which is marked BROKEN for
now.

Let's drop all this mess. It makes get_page() and put_pag() much simplier.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 arch/mips/mm/gup.c            |   4 -
 arch/powerpc/mm/hugetlbpage.c |  13 +-
 arch/s390/mm/gup.c            |  13 +-
 arch/sparc/mm/gup.c           |  14 +--
 arch/x86/mm/gup.c             |   4 -
 include/linux/mm.h            |  47 ++------
 include/linux/mm_types.h      |  17 +--
 mm/gup.c                      |  34 +-----
 mm/huge_memory.c              |  41 +------
 mm/hugetlb.c                  |   2 +-
 mm/internal.h                 |  44 -------
 mm/swap.c                     | 274 +++---------------------------------------
 12 files changed, 40 insertions(+), 467 deletions(-)

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 349995d19c7f..36a35115dc2e 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -87,8 +87,6 @@ static int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end,
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
@@ -153,8 +151,6 @@ static int gup_huge_pud(pud_t pud, unsigned long addr, unsigned long end,
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index cf0464f4284f..f30ae0f7f570 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1037,7 +1037,7 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 {
 	unsigned long mask;
 	unsigned long pte_end;
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	pte_t pte;
 	int refs;
 
@@ -1060,7 +1060,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	head = pte_page(pte);
 
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -1082,15 +1081,5 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 	}
 
-	/*
-	 * Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
index 5c586c78ca8d..dab30527ad41 100644
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -52,7 +52,7 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
 	unsigned long mask, result;
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	result = write ? 0 : _SEGMENT_ENTRY_PROTECT;
@@ -64,7 +64,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -85,16 +84,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		return 0;
 	}
 
-	/*
-	 * Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
index 2e5c4fc2daa9..9091c5daa2e1 100644
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -56,8 +56,6 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
 			put_page(head);
 			return 0;
 		}
-		if (head != page)
-			get_huge_page_tail(page);
 
 		pages[*nr] = page;
 		(*nr)++;
@@ -70,7 +68,7 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 			unsigned long end, int write, struct page **pages,
 			int *nr)
 {
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	if (!(pmd_val(pmd) & _PAGE_VALID))
@@ -82,7 +80,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 	refs = 0;
 	head = pmd_page(pmd);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
@@ -103,15 +100,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
 		return 0;
 	}
 
-	/* Any tail page need their mapcount reference taken before we
-	 * return.
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 81bf3d2af3eb..62a887a3cf50 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -137,8 +137,6 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
@@ -214,8 +212,6 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
-		if (PageTail(page))
-			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index efe8417360a2..dd1b5f2b1966 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -449,44 +449,9 @@ static inline int page_count(struct page *page)
 	return atomic_read(&compound_head(page)->_count);
 }
 
-static inline bool __compound_tail_refcounted(struct page *page)
-{
-	return PageAnon(page) && !PageSlab(page) && !PageHeadHuge(page);
-}
-
-/*
- * This takes a head page as parameter and tells if the
- * tail page reference counting can be skipped.
- *
- * For this to be safe, PageSlab and PageHeadHuge must remain true on
- * any given page where they return true here, until all tail pins
- * have been released.
- */
-static inline bool compound_tail_refcounted(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return __compound_tail_refcounted(page);
-}
-
-static inline void get_huge_page_tail(struct page *page)
-{
-	/*
-	 * __split_huge_page_refcount() cannot run from under us.
-	 */
-	VM_BUG_ON_PAGE(!PageTail(page), page);
-	VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
-	VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-	if (compound_tail_refcounted(page->first_page))
-		atomic_inc(&page->_mapcount);
-}
-
-extern bool __get_page_tail(struct page *page);
-
 static inline void get_page(struct page *page)
 {
-	if (unlikely(PageTail(page)))
-		if (likely(__get_page_tail(page)))
-			return;
+	page = compound_head(page);
 	/*
 	 * Getting a normal page or the head of a compound page
 	 * requires to already have an elevated page->_count.
@@ -517,7 +482,15 @@ static inline void init_page_count(struct page *page)
 	atomic_set(&page->_count, 1);
 }
 
-void put_page(struct page *page);
+void __put_page(struct page* page);
+
+static inline void put_page(struct page *page)
+{
+	page = compound_head(page);
+	if (put_page_testzero(page))
+		__put_page(page);
+}
+
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 590630eb59ba..126f481bb95a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -92,20 +92,9 @@ struct page {
 
 				union {
 					/*
-					 * Count of ptes mapped in
-					 * mms, to show when page is
-					 * mapped & limit reverse map
-					 * searches.
-					 *
-					 * Used also for tail pages
-					 * refcounting instead of
-					 * _count. Tail pages cannot
-					 * be mapped and keeping the
-					 * tail page _count zero at
-					 * all times guarantees
-					 * get_page_unless_zero() will
-					 * never succeed on tail
-					 * pages.
+					 * Count of ptes mapped in mms, to show
+					 * when page is mapped & limit reverse
+					 * map searches.
 					 */
 					atomic_t _mapcount;
 
diff --git a/mm/gup.c b/mm/gup.c
index 19e01f156abb..53f9681b7b30 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -93,7 +93,7 @@ retry:
 	}
 
 	if (flags & FOLL_GET)
-		get_page_foll(page);
+		get_page(page);
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
 		    !pte_dirty(pte) && !PageDirty(page))
@@ -1108,7 +1108,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	if (write && !pmd_write(orig))
@@ -1117,7 +1117,6 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	refs = 0;
 	head = pmd_page(orig);
 	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
@@ -1138,24 +1137,13 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 	}
 
-	/*
-	 * Any tail pages need their mapcount reference taken before we
-	 * return. (This allows the THP code to bump their ref count when
-	 * they are split into base pages).
-	 */
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 	int refs;
 
 	if (write && !pud_write(orig))
@@ -1164,7 +1152,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	refs = 0;
 	head = pud_page(orig);
 	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
@@ -1185,12 +1172,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 	}
 
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
@@ -1199,7 +1180,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 			struct page **pages, int *nr)
 {
 	int refs;
-	struct page *head, *page, *tail;
+	struct page *head, *page;
 
 	if (write && !pgd_write(orig))
 		return 0;
@@ -1207,7 +1188,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	refs = 0;
 	head = pgd_page(orig);
 	page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
-	tail = page;
 	do {
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 		pages[*nr] = page;
@@ -1228,12 +1208,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		return 0;
 	}
 
-	while (refs--) {
-		if (PageTail(tail))
-			get_huge_page_tail(tail);
-		tail++;
-	}
-
 	return 1;
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f3cc576dad73..16c6c262385c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -941,37 +941,6 @@ unlock:
 	spin_unlock(ptl);
 }
 
-/*
- * Save CONFIG_DEBUG_PAGEALLOC from faulting falsely on tail pages
- * during copy_user_huge_page()'s copy_page_rep(): in the case when
- * the source page gets split and a tail freed before copy completes.
- * Called under pmd_lock of checked pmd, so safe from splitting itself.
- */
-static void get_user_huge_page(struct page *page)
-{
-	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
-		struct page *endpage = page + HPAGE_PMD_NR;
-
-		atomic_add(HPAGE_PMD_NR, &page->_count);
-		while (++page < endpage)
-			get_huge_page_tail(page);
-	} else {
-		get_page(page);
-	}
-}
-
-static void put_user_huge_page(struct page *page)
-{
-	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
-		struct page *endpage = page + HPAGE_PMD_NR;
-
-		while (page < endpage)
-			put_page(page++);
-	} else {
-		put_page(page);
-	}
-}
-
 static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long address,
@@ -1124,7 +1093,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		ret |= VM_FAULT_WRITE;
 		goto out_unlock;
 	}
-	get_user_huge_page(page);
+	get_page(page);
 	spin_unlock(ptl);
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
@@ -1145,7 +1114,7 @@ alloc:
 				split_huge_pmd(vma, pmd, address);
 				ret |= VM_FAULT_FALLBACK;
 			}
-			put_user_huge_page(page);
+			put_page(page);
 		}
 		count_vm_event(THP_FAULT_FALLBACK);
 		goto out;
@@ -1156,7 +1125,7 @@ alloc:
 		put_page(new_page);
 		if (page) {
 			split_huge_pmd(vma, pmd, address);
-			put_user_huge_page(page);
+			put_page(page);
 		} else
 			split_huge_pmd(vma, pmd, address);
 		ret |= VM_FAULT_FALLBACK;
@@ -1178,7 +1147,7 @@ alloc:
 
 	spin_lock(ptl);
 	if (page)
-		put_user_huge_page(page);
+		put_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(ptl);
 		mem_cgroup_cancel_charge(new_page, memcg, true);
@@ -1263,7 +1232,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 	if (flags & FOLL_GET)
-		get_page_foll(page);
+		get_page(page);
 
 out:
 	return page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index eb2a0430535e..f27d4edada3a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3453,7 +3453,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 same_page:
 		if (pages) {
 			pages[i] = mem_map_offset(page, pfn_offset);
-			get_page_foll(pages[i]);
+			get_page(pages[i]);
 		}
 
 		if (vmas)
diff --git a/mm/internal.h b/mm/internal.h
index a25e359a4039..98bce4d12a16 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -47,50 +47,6 @@ static inline void set_page_refcounted(struct page *page)
 	set_page_count(page, 1);
 }
 
-static inline void __get_page_tail_foll(struct page *page,
-					bool get_page_head)
-{
-	/*
-	 * If we're getting a tail page, the elevated page->_count is
-	 * required only in the head page and we will elevate the head
-	 * page->_count and tail page->_mapcount.
-	 *
-	 * We elevate page_tail->_mapcount for tail pages to force
-	 * page_tail->_count to be zero at all times to avoid getting
-	 * false positives from get_page_unless_zero() with
-	 * speculative page access (like in
-	 * page_cache_get_speculative()) on tail pages.
-	 */
-	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
-	if (get_page_head)
-		atomic_inc(&page->first_page->_count);
-	get_huge_page_tail(page);
-}
-
-/*
- * This is meant to be called as the FOLL_GET operation of
- * follow_page() and it must be called while holding the proper PT
- * lock while the pte (or pmd_trans_huge) is still mapping the page.
- */
-static inline void get_page_foll(struct page *page)
-{
-	if (unlikely(PageTail(page)))
-		/*
-		 * This is safe only because
-		 * __split_huge_page_refcount() can't run under
-		 * get_page_foll() because we hold the proper PT lock.
-		 */
-		__get_page_tail_foll(page, true);
-	else {
-		/*
-		 * Getting a normal page or the head of a compound page
-		 * requires to already have an elevated page->_count.
-		 */
-		VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
-		atomic_inc(&page->_count);
-	}
-}
-
 extern unsigned long highest_memmap_pfn;
 
 /*
diff --git a/mm/swap.c b/mm/swap.c
index 8773de093171..39166c05e5f3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -89,261 +89,14 @@ static void __put_compound_page(struct page *page)
 	(*dtor)(page);
 }
 
-/**
- * Two special cases here: we could avoid taking compound_lock_irqsave
- * and could skip the tail refcounting(in _mapcount).
- *
- * 1. Hugetlbfs page:
- *
- *    PageHeadHuge will remain true until the compound page
- *    is released and enters the buddy allocator, and it could
- *    not be split by __split_huge_page_refcount().
- *
- *    So if we see PageHeadHuge set, and we have the tail page pin,
- *    then we could safely put head page.
- *
- * 2. Slab THP page:
- *
- *    PG_slab is cleared before the slab frees the head page, and
- *    tail pin cannot be the last reference left on the head page,
- *    because the slab code is free to reuse the compound page
- *    after a kfree/kmem_cache_free without having to check if
- *    there's any tail pin left.  In turn all tail pinsmust be always
- *    released while the head is still pinned by the slab code
- *    and so we know PG_slab will be still set too.
- *
- *    So if we see PageSlab set, and we have the tail page pin,
- *    then we could safely put head page.
- */
-static __always_inline
-void put_unrefcounted_compound_page(struct page *page_head, struct page *page)
-{
-	/*
-	 * If @page is a THP tail, we must read the tail page
-	 * flags after the head page flags. The
-	 * __split_huge_page_refcount side enforces write memory barriers
-	 * between clearing PageTail and before the head page
-	 * can be freed and reallocated.
-	 */
-	smp_rmb();
-	if (likely(PageTail(page))) {
-		/*
-		 * __split_huge_page_refcount cannot race
-		 * here, see the comment above this function.
-		 */
-		VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
-		VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
-		if (put_page_testzero(page_head)) {
-			/*
-			 * If this is the tail of a slab THP page,
-			 * the tail pin must not be the last reference
-			 * held on the page, because the PG_slab cannot
-			 * be cleared before all tail pins (which skips
-			 * the _mapcount tail refcounting) have been
-			 * released.
-			 *
-			 * If this is the tail of a hugetlbfs page,
-			 * the tail pin may be the last reference on
-			 * the page instead, because PageHeadHuge will
-			 * not go away until the compound page enters
-			 * the buddy allocator.
-			 */
-			VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
-			__put_compound_page(page_head);
-		}
-	} else
-		/*
-		 * __split_huge_page_refcount run before us,
-		 * @page was a THP tail. The split @page_head
-		 * has been freed and reallocated as slab or
-		 * hugetlbfs page of smaller order (only
-		 * possible if reallocated as slab on x86).
-		 */
-		if (put_page_testzero(page))
-			__put_single_page(page);
-}
-
-static __always_inline
-void put_refcounted_compound_page(struct page *page_head, struct page *page)
-{
-	if (likely(page != page_head && get_page_unless_zero(page_head))) {
-		unsigned long flags;
-
-		/*
-		 * @page_head wasn't a dangling pointer but it may not
-		 * be a head page anymore by the time we obtain the
-		 * lock. That is ok as long as it can't be freed from
-		 * under us.
-		 */
-		flags = compound_lock_irqsave(page_head);
-		if (unlikely(!PageTail(page))) {
-			/* __split_huge_page_refcount run before us */
-			compound_unlock_irqrestore(page_head, flags);
-			if (put_page_testzero(page_head)) {
-				/*
-				 * The @page_head may have been freed
-				 * and reallocated as a compound page
-				 * of smaller order and then freed
-				 * again.  All we know is that it
-				 * cannot have become: a THP page, a
-				 * compound page of higher order, a
-				 * tail page.  That is because we
-				 * still hold the refcount of the
-				 * split THP tail and page_head was
-				 * the THP head before the split.
-				 */
-				if (PageHead(page_head))
-					__put_compound_page(page_head);
-				else
-					__put_single_page(page_head);
-			}
-out_put_single:
-			if (put_page_testzero(page))
-				__put_single_page(page);
-			return;
-		}
-		VM_BUG_ON_PAGE(page_head != page->first_page, page);
-		/*
-		 * We can release the refcount taken by
-		 * get_page_unless_zero() now that
-		 * __split_huge_page_refcount() is blocked on the
-		 * compound_lock.
-		 */
-		if (put_page_testzero(page_head))
-			VM_BUG_ON_PAGE(1, page_head);
-		/* __split_huge_page_refcount will wait now */
-		VM_BUG_ON_PAGE(page_mapcount(page) <= 0, page);
-		atomic_dec(&page->_mapcount);
-		VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page_head);
-		VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-		compound_unlock_irqrestore(page_head, flags);
-
-		if (put_page_testzero(page_head)) {
-			if (PageHead(page_head))
-				__put_compound_page(page_head);
-			else
-				__put_single_page(page_head);
-		}
-	} else {
-		/* @page_head is a dangling pointer */
-		VM_BUG_ON_PAGE(PageTail(page), page);
-		goto out_put_single;
-	}
-}
-
-static void put_compound_page(struct page *page)
-{
-	struct page *page_head;
-
-	/*
-	 * We see the PageCompound set and PageTail not set, so @page maybe:
-	 *  1. hugetlbfs head page, or
-	 *  2. THP head page.
-	 */
-	if (likely(!PageTail(page))) {
-		if (put_page_testzero(page)) {
-			/*
-			 * By the time all refcounts have been released
-			 * split_huge_page cannot run anymore from under us.
-			 */
-			if (PageHead(page))
-				__put_compound_page(page);
-			else
-				__put_single_page(page);
-		}
-		return;
-	}
-
-	/*
-	 * We see the PageCompound set and PageTail set, so @page maybe:
-	 *  1. a tail hugetlbfs page, or
-	 *  2. a tail THP page, or
-	 *  3. a split THP page.
-	 *
-	 *  Case 3 is possible, as we may race with
-	 *  __split_huge_page_refcount tearing down a THP page.
-	 */
-	page_head = compound_head_by_tail(page);
-	if (!__compound_tail_refcounted(page_head))
-		put_unrefcounted_compound_page(page_head, page);
-	else
-		put_refcounted_compound_page(page_head, page);
-}
-
-void put_page(struct page *page)
+void __put_page(struct page *page)
 {
 	if (unlikely(PageCompound(page)))
-		put_compound_page(page);
-	else if (put_page_testzero(page))
+		__put_compound_page(page);
+	else
 		__put_single_page(page);
 }
-EXPORT_SYMBOL(put_page);
-
-/*
- * This function is exported but must not be called by anything other
- * than get_page(). It implements the slow path of get_page().
- */
-bool __get_page_tail(struct page *page)
-{
-	/*
-	 * This takes care of get_page() if run on a tail page
-	 * returned by one of the get_user_pages/follow_page variants.
-	 * get_user_pages/follow_page itself doesn't need the compound
-	 * lock because it runs __get_page_tail_foll() under the
-	 * proper PT lock that already serializes against
-	 * split_huge_page().
-	 */
-	unsigned long flags;
-	bool got;
-	struct page *page_head = compound_head(page);
-
-	/* Ref to put_compound_page() comment. */
-	if (!__compound_tail_refcounted(page_head)) {
-		smp_rmb();
-		if (likely(PageTail(page))) {
-			/*
-			 * This is a hugetlbfs page or a slab
-			 * page. __split_huge_page_refcount
-			 * cannot race here.
-			 */
-			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
-			__get_page_tail_foll(page, true);
-			return true;
-		} else {
-			/*
-			 * __split_huge_page_refcount run
-			 * before us, "page" was a THP
-			 * tail. The split page_head has been
-			 * freed and reallocated as slab or
-			 * hugetlbfs page of smaller order
-			 * (only possible if reallocated as
-			 * slab on x86).
-			 */
-			return false;
-		}
-	}
-
-	got = false;
-	if (likely(page != page_head && get_page_unless_zero(page_head))) {
-		/*
-		 * page_head wasn't a dangling pointer but it
-		 * may not be a head page anymore by the time
-		 * we obtain the lock. That is ok as long as it
-		 * can't be freed from under us.
-		 */
-		flags = compound_lock_irqsave(page_head);
-		/* here __split_huge_page_refcount won't run anymore */
-		if (likely(PageTail(page))) {
-			__get_page_tail_foll(page, false);
-			got = true;
-		}
-		compound_unlock_irqrestore(page_head, flags);
-		if (unlikely(!got))
-			put_page(page_head);
-	}
-	return got;
-}
-EXPORT_SYMBOL(__get_page_tail);
+EXPORT_SYMBOL(__put_page);
 
 /**
  * put_pages_list() - release a list of pages
@@ -960,15 +713,6 @@ void release_pages(struct page **pages, int nr, bool cold)
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
-		if (unlikely(PageCompound(page))) {
-			if (zone) {
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
-				zone = NULL;
-			}
-			put_compound_page(page);
-			continue;
-		}
-
 		/*
 		 * Make sure the IRQ-safe lock-holding time does not get
 		 * excessive with a continuous string of pages from the
@@ -979,9 +723,19 @@ void release_pages(struct page **pages, int nr, bool cold)
 			zone = NULL;
 		}
 
+		page = compound_head(page);
 		if (!put_page_testzero(page))
 			continue;
 
+		if (PageCompound(page)) {
+			if (zone) {
+				spin_unlock_irqrestore(&zone->lru_lock, flags);
+				zone = NULL;
+			}
+			__put_compound_page(page);
+			continue;
+		}
+
 		if (PageLRU(page)) {
 			struct zone *pagezone = page_zone(page);
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 14/28] futex, thp: remove special case for THP in get_futex_key
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new THP refcounting, we don't need tricks to stabilize huge page.
If we've got reference to tail page, it can't split under us.

This patch effectively reverts a5b338f2b0b1.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 kernel/futex.c | 61 ++++++++++++----------------------------------------------
 1 file changed, 12 insertions(+), 49 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index f4d8a85641ed..cf0192e60ef9 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -399,7 +399,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
-	struct page *page, *page_head;
+	struct page *page;
 	int err, ro = 0;
 
 	/*
@@ -442,46 +442,9 @@ again:
 	else
 		err = 0;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	page_head = page;
-	if (unlikely(PageTail(page))) {
-		put_page(page);
-		/* serialize against __split_huge_page_splitting() */
-		local_irq_disable();
-		if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
-			page_head = compound_head(page);
-			/*
-			 * page_head is valid pointer but we must pin
-			 * it before taking the PG_lock and/or
-			 * PG_compound_lock. The moment we re-enable
-			 * irqs __split_huge_page_splitting() can
-			 * return and the head page can be freed from
-			 * under us. We can't take the PG_lock and/or
-			 * PG_compound_lock on a page that could be
-			 * freed from under us.
-			 */
-			if (page != page_head) {
-				get_page(page_head);
-				put_page(page);
-			}
-			local_irq_enable();
-		} else {
-			local_irq_enable();
-			goto again;
-		}
-	}
-#else
-	page_head = compound_head(page);
-	if (page != page_head) {
-		get_page(page_head);
-		put_page(page);
-	}
-#endif
-
-	lock_page(page_head);
-
+	lock_page(page);
 	/*
-	 * If page_head->mapping is NULL, then it cannot be a PageAnon
+	 * If page->mapping is NULL, then it cannot be a PageAnon
 	 * page; but it might be the ZERO_PAGE or in the gate area or
 	 * in a special mapping (all cases which we are happy to fail);
 	 * or it may have been a good file page when get_user_pages_fast
@@ -493,12 +456,12 @@ again:
 	 *
 	 * The case we do have to guard against is when memory pressure made
 	 * shmem_writepage move it from filecache to swapcache beneath us:
-	 * an unlikely race, but we do need to retry for page_head->mapping.
+	 * an unlikely race, but we do need to retry for page->mapping.
 	 */
-	if (!page_head->mapping) {
-		int shmem_swizzled = PageSwapCache(page_head);
-		unlock_page(page_head);
-		put_page(page_head);
+	if (!page->mapping) {
+		int shmem_swizzled = PageSwapCache(page);
+		unlock_page(page);
+		put_page(page);
 		if (shmem_swizzled)
 			goto again;
 		return -EFAULT;
@@ -511,7 +474,7 @@ again:
 	 * it's a read-only handle, it's expected that futexes attach to
 	 * the object not the particular process.
 	 */
-	if (PageAnon(page_head)) {
+	if (PageAnon(page)) {
 		/*
 		 * A RO anonymous page will never change and thus doesn't make
 		 * sense for futex operations.
@@ -526,15 +489,15 @@ again:
 		key->private.address = address;
 	} else {
 		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
-		key->shared.inode = page_head->mapping->host;
+		key->shared.inode = page->mapping->host;
 		key->shared.pgoff = basepage_index(page);
 	}
 
 	get_futex_key_refs(key); /* implies MB (B) */
 
 out:
-	unlock_page(page_head);
-	put_page(page_head);
+	unlock_page(page);
+	put_page(page);
 	return err;
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 14/28] futex, thp: remove special case for THP in get_futex_key
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new THP refcounting, we don't need tricks to stabilize huge page.
If we've got reference to tail page, it can't split under us.

This patch effectively reverts a5b338f2b0b1.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 kernel/futex.c | 61 ++++++++++++----------------------------------------------
 1 file changed, 12 insertions(+), 49 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index f4d8a85641ed..cf0192e60ef9 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -399,7 +399,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
-	struct page *page, *page_head;
+	struct page *page;
 	int err, ro = 0;
 
 	/*
@@ -442,46 +442,9 @@ again:
 	else
 		err = 0;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	page_head = page;
-	if (unlikely(PageTail(page))) {
-		put_page(page);
-		/* serialize against __split_huge_page_splitting() */
-		local_irq_disable();
-		if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
-			page_head = compound_head(page);
-			/*
-			 * page_head is valid pointer but we must pin
-			 * it before taking the PG_lock and/or
-			 * PG_compound_lock. The moment we re-enable
-			 * irqs __split_huge_page_splitting() can
-			 * return and the head page can be freed from
-			 * under us. We can't take the PG_lock and/or
-			 * PG_compound_lock on a page that could be
-			 * freed from under us.
-			 */
-			if (page != page_head) {
-				get_page(page_head);
-				put_page(page);
-			}
-			local_irq_enable();
-		} else {
-			local_irq_enable();
-			goto again;
-		}
-	}
-#else
-	page_head = compound_head(page);
-	if (page != page_head) {
-		get_page(page_head);
-		put_page(page);
-	}
-#endif
-
-	lock_page(page_head);
-
+	lock_page(page);
 	/*
-	 * If page_head->mapping is NULL, then it cannot be a PageAnon
+	 * If page->mapping is NULL, then it cannot be a PageAnon
 	 * page; but it might be the ZERO_PAGE or in the gate area or
 	 * in a special mapping (all cases which we are happy to fail);
 	 * or it may have been a good file page when get_user_pages_fast
@@ -493,12 +456,12 @@ again:
 	 *
 	 * The case we do have to guard against is when memory pressure made
 	 * shmem_writepage move it from filecache to swapcache beneath us:
-	 * an unlikely race, but we do need to retry for page_head->mapping.
+	 * an unlikely race, but we do need to retry for page->mapping.
 	 */
-	if (!page_head->mapping) {
-		int shmem_swizzled = PageSwapCache(page_head);
-		unlock_page(page_head);
-		put_page(page_head);
+	if (!page->mapping) {
+		int shmem_swizzled = PageSwapCache(page);
+		unlock_page(page);
+		put_page(page);
 		if (shmem_swizzled)
 			goto again;
 		return -EFAULT;
@@ -511,7 +474,7 @@ again:
 	 * it's a read-only handle, it's expected that futexes attach to
 	 * the object not the particular process.
 	 */
-	if (PageAnon(page_head)) {
+	if (PageAnon(page)) {
 		/*
 		 * A RO anonymous page will never change and thus doesn't make
 		 * sense for futex operations.
@@ -526,15 +489,15 @@ again:
 		key->private.address = address;
 	} else {
 		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
-		key->shared.inode = page_head->mapping->host;
+		key->shared.inode = page->mapping->host;
 		key->shared.pgoff = basepage_index(page);
 	}
 
 	get_futex_key_refs(key); /* implies MB (B) */
 
 out:
-	unlock_page(page_head);
-	put_page(page_head);
+	unlock_page(page);
+	put_page(page);
 	return err;
 }
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 15/28] ksm: prepare to new THP semantics
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We don't need special code to stabilize THP. If you've got reference to
any subpage of THP it will not be split under you.

New split_huge_page() also accepts tail pages: no need in special code
to get reference to head page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/ksm.c | 57 ++++++++++-----------------------------------------------
 1 file changed, 10 insertions(+), 47 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index fe09f3ddc912..fb333d8188fc 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -441,20 +441,6 @@ static void break_cow(struct rmap_item *rmap_item)
 	up_read(&mm->mmap_sem);
 }
 
-static struct page *page_trans_compound_anon(struct page *page)
-{
-	if (PageTransCompound(page)) {
-		struct page *head = compound_head(page);
-		/*
-		 * head may actually be splitted and freed from under
-		 * us but it's ok here.
-		 */
-		if (PageAnon(head))
-			return head;
-	}
-	return NULL;
-}
-
 static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 {
 	struct mm_struct *mm = rmap_item->mm;
@@ -470,7 +456,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 	page = follow_page(vma, addr, FOLL_GET);
 	if (IS_ERR_OR_NULL(page))
 		goto out;
-	if (PageAnon(page) || page_trans_compound_anon(page)) {
+	if (PageAnon(page)) {
 		flush_anon_page(vma, page, addr);
 		flush_dcache_page(page);
 	} else {
@@ -976,33 +962,6 @@ out:
 	return err;
 }
 
-static int page_trans_compound_anon_split(struct page *page)
-{
-	int ret = 0;
-	struct page *transhuge_head = page_trans_compound_anon(page);
-	if (transhuge_head) {
-		/* Get the reference on the head to split it. */
-		if (get_page_unless_zero(transhuge_head)) {
-			/*
-			 * Recheck we got the reference while the head
-			 * was still anonymous.
-			 */
-			if (PageAnon(transhuge_head))
-				ret = split_huge_page(transhuge_head);
-			else
-				/*
-				 * Retry later if split_huge_page run
-				 * from under us.
-				 */
-				ret = 1;
-			put_page(transhuge_head);
-		} else
-			/* Retry later if split_huge_page run from under us. */
-			ret = 1;
-	}
-	return ret;
-}
-
 /*
  * try_to_merge_one_page - take two pages and merge them into one
  * @vma: the vma that holds the pte pointing to page
@@ -1023,9 +982,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 
 	if (!(vma->vm_flags & VM_MERGEABLE))
 		goto out;
-	if (PageTransCompound(page) && page_trans_compound_anon_split(page))
-		goto out;
-	BUG_ON(PageTransCompound(page));
 	if (!PageAnon(page))
 		goto out;
 
@@ -1038,6 +994,13 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 	 */
 	if (!trylock_page(page))
 		goto out;
+
+	if (PageTransCompound(page)) {
+		err = split_huge_page(page);
+		if (err)
+			goto out_unlock;
+	}
+
 	/*
 	 * If this anonymous page is mapped only here, its pte may need
 	 * to be write-protected.  If it's mapped elsewhere, all of its
@@ -1068,6 +1031,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 		}
 	}
 
+out_unlock:
 	unlock_page(page);
 out:
 	return err;
@@ -1620,8 +1584,7 @@ next_mm:
 				cond_resched();
 				continue;
 			}
-			if (PageAnon(*page) ||
-			    page_trans_compound_anon(*page)) {
+			if (PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
 				flush_dcache_page(*page);
 				rmap_item = get_next_rmap_item(slot,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 15/28] ksm: prepare to new THP semantics
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We don't need special code to stabilize THP. If you've got reference to
any subpage of THP it will not be split under you.

New split_huge_page() also accepts tail pages: no need in special code
to get reference to head page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/ksm.c | 57 ++++++++++-----------------------------------------------
 1 file changed, 10 insertions(+), 47 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index fe09f3ddc912..fb333d8188fc 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -441,20 +441,6 @@ static void break_cow(struct rmap_item *rmap_item)
 	up_read(&mm->mmap_sem);
 }
 
-static struct page *page_trans_compound_anon(struct page *page)
-{
-	if (PageTransCompound(page)) {
-		struct page *head = compound_head(page);
-		/*
-		 * head may actually be splitted and freed from under
-		 * us but it's ok here.
-		 */
-		if (PageAnon(head))
-			return head;
-	}
-	return NULL;
-}
-
 static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 {
 	struct mm_struct *mm = rmap_item->mm;
@@ -470,7 +456,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 	page = follow_page(vma, addr, FOLL_GET);
 	if (IS_ERR_OR_NULL(page))
 		goto out;
-	if (PageAnon(page) || page_trans_compound_anon(page)) {
+	if (PageAnon(page)) {
 		flush_anon_page(vma, page, addr);
 		flush_dcache_page(page);
 	} else {
@@ -976,33 +962,6 @@ out:
 	return err;
 }
 
-static int page_trans_compound_anon_split(struct page *page)
-{
-	int ret = 0;
-	struct page *transhuge_head = page_trans_compound_anon(page);
-	if (transhuge_head) {
-		/* Get the reference on the head to split it. */
-		if (get_page_unless_zero(transhuge_head)) {
-			/*
-			 * Recheck we got the reference while the head
-			 * was still anonymous.
-			 */
-			if (PageAnon(transhuge_head))
-				ret = split_huge_page(transhuge_head);
-			else
-				/*
-				 * Retry later if split_huge_page run
-				 * from under us.
-				 */
-				ret = 1;
-			put_page(transhuge_head);
-		} else
-			/* Retry later if split_huge_page run from under us. */
-			ret = 1;
-	}
-	return ret;
-}
-
 /*
  * try_to_merge_one_page - take two pages and merge them into one
  * @vma: the vma that holds the pte pointing to page
@@ -1023,9 +982,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 
 	if (!(vma->vm_flags & VM_MERGEABLE))
 		goto out;
-	if (PageTransCompound(page) && page_trans_compound_anon_split(page))
-		goto out;
-	BUG_ON(PageTransCompound(page));
 	if (!PageAnon(page))
 		goto out;
 
@@ -1038,6 +994,13 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 	 */
 	if (!trylock_page(page))
 		goto out;
+
+	if (PageTransCompound(page)) {
+		err = split_huge_page(page);
+		if (err)
+			goto out_unlock;
+	}
+
 	/*
 	 * If this anonymous page is mapped only here, its pte may need
 	 * to be write-protected.  If it's mapped elsewhere, all of its
@@ -1068,6 +1031,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 		}
 	}
 
+out_unlock:
 	unlock_page(page);
 out:
 	return err;
@@ -1620,8 +1584,7 @@ next_mm:
 				cond_resched();
 				continue;
 			}
-			if (PageAnon(*page) ||
-			    page_trans_compound_anon(*page)) {
+			if (PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
 				flush_dcache_page(*page);
 				rmap_item = get_next_rmap_item(slot,
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 16/28] mm, thp: remove compound_lock
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We are going to use migration entries to stabilize page counts. It means
we don't need compound_lock() for that.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/mm.h         | 35 -----------------------------------
 include/linux/page-flags.h | 12 +-----------
 mm/debug.c                 |  3 ---
 3 files changed, 1 insertion(+), 49 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dd1b5f2b1966..dad667d99304 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -393,41 +393,6 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 
 extern void kvfree(const void *addr);
 
-static inline void compound_lock(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	VM_BUG_ON_PAGE(PageSlab(page), page);
-	bit_spin_lock(PG_compound_lock, &page->flags);
-#endif
-}
-
-static inline void compound_unlock(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	VM_BUG_ON_PAGE(PageSlab(page), page);
-	bit_spin_unlock(PG_compound_lock, &page->flags);
-#endif
-}
-
-static inline unsigned long compound_lock_irqsave(struct page *page)
-{
-	unsigned long uninitialized_var(flags);
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	local_irq_save(flags);
-	compound_lock(page);
-#endif
-	return flags;
-}
-
-static inline void compound_unlock_irqrestore(struct page *page,
-					      unsigned long flags)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	compound_unlock(page);
-	local_irq_restore(flags);
-#endif
-}
-
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 91b7f9b2b774..74b7cece1dfa 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -106,9 +106,6 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	PG_compound_lock,
-#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -683,12 +680,6 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 #define __PG_MLOCKED		0
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
-#else
-#define __PG_COMPOUND_LOCK		0
-#endif
-
 /*
  * Flags checked when a page is freed.  Pages being freed should not have
  * these flags set.  It they are, there is a problem.
@@ -698,8 +689,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
-	 __PG_COMPOUND_LOCK)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON )
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.
diff --git a/mm/debug.c b/mm/debug.c
index 3eb3ac2fcee7..9dfcd77e7354 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -45,9 +45,6 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_MEMORY_FAILURE
 	{1UL << PG_hwpoison,		"hwpoison"	},
 #endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	{1UL << PG_compound_lock,	"compound_lock"	},
-#endif
 };
 
 static void dump_flags(unsigned long flags,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 16/28] mm, thp: remove compound_lock
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We are going to use migration entries to stabilize page counts. It means
we don't need compound_lock() for that.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/mm.h         | 35 -----------------------------------
 include/linux/page-flags.h | 12 +-----------
 mm/debug.c                 |  3 ---
 3 files changed, 1 insertion(+), 49 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dd1b5f2b1966..dad667d99304 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -393,41 +393,6 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 
 extern void kvfree(const void *addr);
 
-static inline void compound_lock(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	VM_BUG_ON_PAGE(PageSlab(page), page);
-	bit_spin_lock(PG_compound_lock, &page->flags);
-#endif
-}
-
-static inline void compound_unlock(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	VM_BUG_ON_PAGE(PageSlab(page), page);
-	bit_spin_unlock(PG_compound_lock, &page->flags);
-#endif
-}
-
-static inline unsigned long compound_lock_irqsave(struct page *page)
-{
-	unsigned long uninitialized_var(flags);
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	local_irq_save(flags);
-	compound_lock(page);
-#endif
-	return flags;
-}
-
-static inline void compound_unlock_irqrestore(struct page *page,
-					      unsigned long flags)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	compound_unlock(page);
-	local_irq_restore(flags);
-#endif
-}
-
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 91b7f9b2b774..74b7cece1dfa 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -106,9 +106,6 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	PG_compound_lock,
-#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -683,12 +680,6 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 #define __PG_MLOCKED		0
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
-#else
-#define __PG_COMPOUND_LOCK		0
-#endif
-
 /*
  * Flags checked when a page is freed.  Pages being freed should not have
  * these flags set.  It they are, there is a problem.
@@ -698,8 +689,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
-	 __PG_COMPOUND_LOCK)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON )
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.
diff --git a/mm/debug.c b/mm/debug.c
index 3eb3ac2fcee7..9dfcd77e7354 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -45,9 +45,6 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_MEMORY_FAILURE
 	{1UL << PG_hwpoison,		"hwpoison"	},
 #endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	{1UL << PG_compound_lock,	"compound_lock"	},
-#endif
 };
 
 static void dump_flags(unsigned long flags,
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 17/28] mm, thp: remove infrastructure for handling splitting PMDs
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop code
to handle this.

Arch-specific code will removed separately.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 fs/proc/task_mmu.c            |  8 +++----
 include/asm-generic/pgtable.h |  5 ----
 include/linux/huge_mm.h       |  9 --------
 mm/gup.c                      |  7 ------
 mm/huge_memory.c              | 54 ++++++++-----------------------------------
 mm/memcontrol.c               | 14 ++---------
 mm/memory.c                   | 18 ++-------------
 mm/mincore.c                  |  2 +-
 mm/pgtable-generic.c          | 14 -----------
 mm/rmap.c                     |  4 +---
 10 files changed, 20 insertions(+), 115 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 95bc384ee3f7..edd63c40ed71 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -534,7 +534,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		smaps_pmd_entry(pmd, addr, walk);
 		spin_unlock(ptl);
 		return 0;
@@ -799,7 +799,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	struct page *page;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
 			clear_soft_dirty_pmd(vma, addr, pmd);
 			goto out;
@@ -1112,7 +1112,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte, *orig_pte;
 	int err = 0;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		int pmd_flags2;
 
 		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
@@ -1416,7 +1416,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	pte_t *orig_pte;
 	pte_t *pte;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		pte_t huge_pte = *(pte_t *)pmd;
 		struct page *page;
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 39f1d6a2b04d..fe617b7e4be6 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -184,11 +184,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long address, pmd_t *pmdp);
-#endif
-
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 47f80207782f..0382230b490f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -49,15 +49,9 @@ enum transparent_hugepage_flag {
 #endif
 };
 
-enum page_check_address_pmd_flag {
-	PAGE_CHECK_ADDRESS_PMD_FLAG,
-	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
-	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
-};
 extern pmd_t *page_check_address_pmd(struct page *page,
 				     struct mm_struct *mm,
 				     unsigned long address,
-				     enum page_check_address_pmd_flag flag,
 				     spinlock_t **ptl);
 extern int pmd_freeable(pmd_t pmd);
 
@@ -102,7 +96,6 @@ extern unsigned long transparent_hugepage_flags;
 #define split_huge_page(page) BUILD_BUG()
 #define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
 
-#define wait_split_huge_page(__anon_vma, __pmd) BUILD_BUG();
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
@@ -169,8 +162,6 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-#define wait_split_huge_page(__anon_vma, __pmd)	\
-	do { } while (0)
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
diff --git a/mm/gup.c b/mm/gup.c
index 53f9681b7b30..0cebfa76fd0c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -207,13 +207,6 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags);
 	}
-
-	if (unlikely(pmd_trans_splitting(*pmd))) {
-		spin_unlock(ptl);
-		wait_split_huge_page(vma->anon_vma, pmd);
-		return follow_page_pte(vma, address, pmd, flags);
-	}
-
 	if (flags & FOLL_SPLIT) {
 		int ret;
 		page = pmd_page(*pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 16c6c262385c..23181f836b62 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -889,15 +889,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		goto out_unlock;
 	}
 
-	if (unlikely(pmd_trans_splitting(pmd))) {
-		/* split huge page running from under us */
-		spin_unlock(src_ptl);
-		spin_unlock(dst_ptl);
-		pte_free(dst_mm, pgtable);
-
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
-		goto out;
-	}
 	src_page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
@@ -1403,7 +1394,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		struct page *page;
 		pgtable_t pgtable;
 		pmd_t orig_pmd;
@@ -1443,7 +1434,6 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		  pmd_t *old_pmd, pmd_t *new_pmd)
 {
 	spinlock_t *old_ptl, *new_ptl;
-	int ret = 0;
 	pmd_t pmd;
 
 	struct mm_struct *mm = vma->vm_mm;
@@ -1452,7 +1442,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 	    (new_addr & ~HPAGE_PMD_MASK) ||
 	    old_end - old_addr < HPAGE_PMD_SIZE ||
 	    (new_vma->vm_flags & VM_NOHUGEPAGE))
-		goto out;
+		return 0;
 
 	/*
 	 * The destination pmd shouldn't be established, free_pgtables()
@@ -1460,15 +1450,14 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 	 */
 	if (WARN_ON(!pmd_none(*new_pmd))) {
 		VM_BUG_ON(pmd_trans_huge(*new_pmd));
-		goto out;
+		return 0;
 	}
 
 	/*
 	 * We don't have to worry about the ordering of src and dst
 	 * ptlocks because exclusive mmap_sem prevents deadlock.
 	 */
-	ret = __pmd_trans_huge_lock(old_pmd, vma, &old_ptl);
-	if (ret == 1) {
+	if (__pmd_trans_huge_lock(old_pmd, vma, &old_ptl)) {
 		new_ptl = pmd_lockptr(mm, new_pmd);
 		if (new_ptl != old_ptl)
 			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
@@ -1484,9 +1473,9 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		if (new_ptl != old_ptl)
 			spin_unlock(new_ptl);
 		spin_unlock(old_ptl);
+		return 1;
 	}
-out:
-	return ret;
+	return 0;
 }
 
 /*
@@ -1502,7 +1491,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	spinlock_t *ptl;
 	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		pmd_t entry;
 		bool preserve_write = prot_numa && pmd_write(*pmd);
 		ret = 1;
@@ -1543,17 +1532,8 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 		spinlock_t **ptl)
 {
 	*ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(pmd_trans_huge(*pmd))) {
-		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(*ptl);
-			wait_split_huge_page(vma->anon_vma, pmd);
-			return -1;
-		} else {
-			/* Thp mapped by 'pmd' is stable, so we can
-			 * handle it as it is. */
-			return 1;
-		}
-	}
+	if (likely(pmd_trans_huge(*pmd)))
+		return 1;
 	spin_unlock(*ptl);
 	return 0;
 }
@@ -1569,7 +1549,6 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 pmd_t *page_check_address_pmd(struct page *page,
 			      struct mm_struct *mm,
 			      unsigned long address,
-			      enum page_check_address_pmd_flag flag,
 			      spinlock_t **ptl)
 {
 	pgd_t *pgd;
@@ -1592,21 +1571,8 @@ pmd_t *page_check_address_pmd(struct page *page,
 		goto unlock;
 	if (pmd_page(*pmd) != page)
 		goto unlock;
-	/*
-	 * split_vma() may create temporary aliased mappings. There is
-	 * no risk as long as all huge pmd are found and have their
-	 * splitting bit set before __split_huge_page_refcount
-	 * runs. Finding the same huge pmd more than once during the
-	 * same rmap walk is not a problem.
-	 */
-	if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
-	    pmd_trans_splitting(*pmd))
-		goto unlock;
-	if (pmd_trans_huge(*pmd)) {
-		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
-			  !pmd_trans_splitting(*pmd));
+	if (pmd_trans_huge(*pmd))
 		return pmd;
-	}
 unlock:
 	spin_unlock(*ptl);
 	return NULL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f659d4f77138..1bc6a77067ad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4888,7 +4888,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
 			mc.precharge += HPAGE_PMD_NR;
 		spin_unlock(ptl);
@@ -5056,17 +5056,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	union mc_target target;
 	struct page *page;
 
-	/*
-	 * We don't take compound_lock() here but no race with splitting thp
-	 * happens because:
-	 *  - if pmd_trans_huge_lock() returns 1, the relevant thp is not
-	 *    under splitting, which means there's no concurrent thp split,
-	 *  - if another thread runs into split_huge_page() just after we
-	 *    entered this if-block, the thread must wait for page table lock
-	 *    to be unlocked in __split_huge_page_splitting(), where the main
-	 *    part of thp split is not executed yet.
-	 */
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (mc.precharge < HPAGE_PMD_NR) {
 			spin_unlock(ptl);
 			return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 61e7ed722760..1bad3766b00c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -565,7 +565,6 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	spinlock_t *ptl;
 	pgtable_t new = pte_alloc_one(mm, address);
-	int wait_split_huge_page;
 	if (!new)
 		return -ENOMEM;
 
@@ -585,18 +584,14 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	ptl = pmd_lock(mm, pmd);
-	wait_split_huge_page = 0;
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		atomic_long_inc(&mm->nr_ptes);
 		pmd_populate(mm, pmd, new);
 		new = NULL;
-	} else if (unlikely(pmd_trans_splitting(*pmd)))
-		wait_split_huge_page = 1;
+	}
 	spin_unlock(ptl);
 	if (new)
 		pte_free(mm, new);
-	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
 	return 0;
 }
 
@@ -612,8 +607,7 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
-	} else
-		VM_BUG_ON(pmd_trans_splitting(*pmd));
+	}
 	spin_unlock(&init_mm.page_table_lock);
 	if (new)
 		pte_free_kernel(&init_mm, new);
@@ -3299,14 +3293,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (pmd_trans_huge(orig_pmd)) {
 			unsigned int dirty = flags & FAULT_FLAG_WRITE;
 
-			/*
-			 * If the pmd is splitting, return and retry the
-			 * the fault.  Alternative: wait until the split
-			 * is done, and goto retry.
-			 */
-			if (pmd_trans_splitting(orig_pmd))
-				return 0;
-
 			if (pmd_protnone(orig_pmd))
 				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);
diff --git a/mm/mincore.c b/mm/mincore.c
index be25efde64a4..feb867f5fdf4 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -117,7 +117,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	unsigned char *vec = walk->private;
 	int nr = (end - addr) >> PAGE_SHIFT;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		memset(vec, 1, nr);
 		spin_unlock(ptl);
 		goto out;
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c25f94b33811..2fe699cedd4d 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -133,20 +133,6 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
-			  pmd_t *pmdp)
-{
-	pmd_t pmd = pmd_mksplitting(*pmdp);
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-	/* tlb flush only to serialize against gup-fast */
-	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#endif
-
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
diff --git a/mm/rmap.c b/mm/rmap.c
index 4ca4b5cffd95..1636a96e5f71 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -737,8 +737,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		 * rmap might return false positives; we must filter
 		 * these out using page_check_address_pmd().
 		 */
-		pmd = page_check_address_pmd(page, mm, address,
-					     PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+		pmd = page_check_address_pmd(page, mm, address, &ptl);
 		if (!pmd)
 			return SWAP_AGAIN;
 
@@ -748,7 +747,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			return SWAP_FAIL; /* To break the loop */
 		}
 
-		/* go ahead even if the pmd is pmd_trans_splitting() */
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 17/28] mm, thp: remove infrastructure for handling splitting PMDs
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop code
to handle this.

Arch-specific code will removed separately.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 fs/proc/task_mmu.c            |  8 +++----
 include/asm-generic/pgtable.h |  5 ----
 include/linux/huge_mm.h       |  9 --------
 mm/gup.c                      |  7 ------
 mm/huge_memory.c              | 54 ++++++++-----------------------------------
 mm/memcontrol.c               | 14 ++---------
 mm/memory.c                   | 18 ++-------------
 mm/mincore.c                  |  2 +-
 mm/pgtable-generic.c          | 14 -----------
 mm/rmap.c                     |  4 +---
 10 files changed, 20 insertions(+), 115 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 95bc384ee3f7..edd63c40ed71 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -534,7 +534,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		smaps_pmd_entry(pmd, addr, walk);
 		spin_unlock(ptl);
 		return 0;
@@ -799,7 +799,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	struct page *page;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
 			clear_soft_dirty_pmd(vma, addr, pmd);
 			goto out;
@@ -1112,7 +1112,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *pte, *orig_pte;
 	int err = 0;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		int pmd_flags2;
 
 		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
@@ -1416,7 +1416,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	pte_t *orig_pte;
 	pte_t *pte;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		pte_t huge_pte = *(pte_t *)pmd;
 		struct page *page;
 
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 39f1d6a2b04d..fe617b7e4be6 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -184,11 +184,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long address, pmd_t *pmdp);
-#endif
-
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 47f80207782f..0382230b490f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -49,15 +49,9 @@ enum transparent_hugepage_flag {
 #endif
 };
 
-enum page_check_address_pmd_flag {
-	PAGE_CHECK_ADDRESS_PMD_FLAG,
-	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
-	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
-};
 extern pmd_t *page_check_address_pmd(struct page *page,
 				     struct mm_struct *mm,
 				     unsigned long address,
-				     enum page_check_address_pmd_flag flag,
 				     spinlock_t **ptl);
 extern int pmd_freeable(pmd_t pmd);
 
@@ -102,7 +96,6 @@ extern unsigned long transparent_hugepage_flags;
 #define split_huge_page(page) BUILD_BUG()
 #define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
 
-#define wait_split_huge_page(__anon_vma, __pmd) BUILD_BUG();
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
@@ -169,8 +162,6 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-#define wait_split_huge_page(__anon_vma, __pmd)	\
-	do { } while (0)
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
diff --git a/mm/gup.c b/mm/gup.c
index 53f9681b7b30..0cebfa76fd0c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -207,13 +207,6 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		return follow_page_pte(vma, address, pmd, flags);
 	}
-
-	if (unlikely(pmd_trans_splitting(*pmd))) {
-		spin_unlock(ptl);
-		wait_split_huge_page(vma->anon_vma, pmd);
-		return follow_page_pte(vma, address, pmd, flags);
-	}
-
 	if (flags & FOLL_SPLIT) {
 		int ret;
 		page = pmd_page(*pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 16c6c262385c..23181f836b62 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -889,15 +889,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		goto out_unlock;
 	}
 
-	if (unlikely(pmd_trans_splitting(pmd))) {
-		/* split huge page running from under us */
-		spin_unlock(src_ptl);
-		spin_unlock(dst_ptl);
-		pte_free(dst_mm, pgtable);
-
-		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
-		goto out;
-	}
 	src_page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
@@ -1403,7 +1394,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		struct page *page;
 		pgtable_t pgtable;
 		pmd_t orig_pmd;
@@ -1443,7 +1434,6 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		  pmd_t *old_pmd, pmd_t *new_pmd)
 {
 	spinlock_t *old_ptl, *new_ptl;
-	int ret = 0;
 	pmd_t pmd;
 
 	struct mm_struct *mm = vma->vm_mm;
@@ -1452,7 +1442,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 	    (new_addr & ~HPAGE_PMD_MASK) ||
 	    old_end - old_addr < HPAGE_PMD_SIZE ||
 	    (new_vma->vm_flags & VM_NOHUGEPAGE))
-		goto out;
+		return 0;
 
 	/*
 	 * The destination pmd shouldn't be established, free_pgtables()
@@ -1460,15 +1450,14 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 	 */
 	if (WARN_ON(!pmd_none(*new_pmd))) {
 		VM_BUG_ON(pmd_trans_huge(*new_pmd));
-		goto out;
+		return 0;
 	}
 
 	/*
 	 * We don't have to worry about the ordering of src and dst
 	 * ptlocks because exclusive mmap_sem prevents deadlock.
 	 */
-	ret = __pmd_trans_huge_lock(old_pmd, vma, &old_ptl);
-	if (ret == 1) {
+	if (__pmd_trans_huge_lock(old_pmd, vma, &old_ptl)) {
 		new_ptl = pmd_lockptr(mm, new_pmd);
 		if (new_ptl != old_ptl)
 			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
@@ -1484,9 +1473,9 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
 		if (new_ptl != old_ptl)
 			spin_unlock(new_ptl);
 		spin_unlock(old_ptl);
+		return 1;
 	}
-out:
-	return ret;
+	return 0;
 }
 
 /*
@@ -1502,7 +1491,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	spinlock_t *ptl;
 	int ret = 0;
 
-	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		pmd_t entry;
 		bool preserve_write = prot_numa && pmd_write(*pmd);
 		ret = 1;
@@ -1543,17 +1532,8 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 		spinlock_t **ptl)
 {
 	*ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(pmd_trans_huge(*pmd))) {
-		if (unlikely(pmd_trans_splitting(*pmd))) {
-			spin_unlock(*ptl);
-			wait_split_huge_page(vma->anon_vma, pmd);
-			return -1;
-		} else {
-			/* Thp mapped by 'pmd' is stable, so we can
-			 * handle it as it is. */
-			return 1;
-		}
-	}
+	if (likely(pmd_trans_huge(*pmd)))
+		return 1;
 	spin_unlock(*ptl);
 	return 0;
 }
@@ -1569,7 +1549,6 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 pmd_t *page_check_address_pmd(struct page *page,
 			      struct mm_struct *mm,
 			      unsigned long address,
-			      enum page_check_address_pmd_flag flag,
 			      spinlock_t **ptl)
 {
 	pgd_t *pgd;
@@ -1592,21 +1571,8 @@ pmd_t *page_check_address_pmd(struct page *page,
 		goto unlock;
 	if (pmd_page(*pmd) != page)
 		goto unlock;
-	/*
-	 * split_vma() may create temporary aliased mappings. There is
-	 * no risk as long as all huge pmd are found and have their
-	 * splitting bit set before __split_huge_page_refcount
-	 * runs. Finding the same huge pmd more than once during the
-	 * same rmap walk is not a problem.
-	 */
-	if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
-	    pmd_trans_splitting(*pmd))
-		goto unlock;
-	if (pmd_trans_huge(*pmd)) {
-		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
-			  !pmd_trans_splitting(*pmd));
+	if (pmd_trans_huge(*pmd))
 		return pmd;
-	}
 unlock:
 	spin_unlock(*ptl);
 	return NULL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f659d4f77138..1bc6a77067ad 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4888,7 +4888,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
 			mc.precharge += HPAGE_PMD_NR;
 		spin_unlock(ptl);
@@ -5056,17 +5056,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	union mc_target target;
 	struct page *page;
 
-	/*
-	 * We don't take compound_lock() here but no race with splitting thp
-	 * happens because:
-	 *  - if pmd_trans_huge_lock() returns 1, the relevant thp is not
-	 *    under splitting, which means there's no concurrent thp split,
-	 *  - if another thread runs into split_huge_page() just after we
-	 *    entered this if-block, the thread must wait for page table lock
-	 *    to be unlocked in __split_huge_page_splitting(), where the main
-	 *    part of thp split is not executed yet.
-	 */
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		if (mc.precharge < HPAGE_PMD_NR) {
 			spin_unlock(ptl);
 			return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 61e7ed722760..1bad3766b00c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -565,7 +565,6 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	spinlock_t *ptl;
 	pgtable_t new = pte_alloc_one(mm, address);
-	int wait_split_huge_page;
 	if (!new)
 		return -ENOMEM;
 
@@ -585,18 +584,14 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	ptl = pmd_lock(mm, pmd);
-	wait_split_huge_page = 0;
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		atomic_long_inc(&mm->nr_ptes);
 		pmd_populate(mm, pmd, new);
 		new = NULL;
-	} else if (unlikely(pmd_trans_splitting(*pmd)))
-		wait_split_huge_page = 1;
+	}
 	spin_unlock(ptl);
 	if (new)
 		pte_free(mm, new);
-	if (wait_split_huge_page)
-		wait_split_huge_page(vma->anon_vma, pmd);
 	return 0;
 }
 
@@ -612,8 +607,7 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
-	} else
-		VM_BUG_ON(pmd_trans_splitting(*pmd));
+	}
 	spin_unlock(&init_mm.page_table_lock);
 	if (new)
 		pte_free_kernel(&init_mm, new);
@@ -3299,14 +3293,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (pmd_trans_huge(orig_pmd)) {
 			unsigned int dirty = flags & FAULT_FLAG_WRITE;
 
-			/*
-			 * If the pmd is splitting, return and retry the
-			 * the fault.  Alternative: wait until the split
-			 * is done, and goto retry.
-			 */
-			if (pmd_trans_splitting(orig_pmd))
-				return 0;
-
 			if (pmd_protnone(orig_pmd))
 				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);
diff --git a/mm/mincore.c b/mm/mincore.c
index be25efde64a4..feb867f5fdf4 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -117,7 +117,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	unsigned char *vec = walk->private;
 	int nr = (end - addr) >> PAGE_SHIFT;
 
-	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
 		memset(vec, 1, nr);
 		spin_unlock(ptl);
 		goto out;
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index c25f94b33811..2fe699cedd4d 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -133,20 +133,6 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
-#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
-			  pmd_t *pmdp)
-{
-	pmd_t pmd = pmd_mksplitting(*pmdp);
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-	/* tlb flush only to serialize against gup-fast */
-	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#endif
-
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
diff --git a/mm/rmap.c b/mm/rmap.c
index 4ca4b5cffd95..1636a96e5f71 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -737,8 +737,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		 * rmap might return false positives; we must filter
 		 * these out using page_check_address_pmd().
 		 */
-		pmd = page_check_address_pmd(page, mm, address,
-					     PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
+		pmd = page_check_address_pmd(page, mm, address, &ptl);
 		if (!pmd)
 			return SWAP_AGAIN;
 
@@ -748,7 +747,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			return SWAP_FAIL; /* To break the loop */
 		}
 
-		/* go ahead even if the pmd is pmd_trans_splitting() */
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 18/28] x86, thp: remove infrastructure for handling splitting PMDs
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 arch/x86/include/asm/pgtable.h       |  9 ---------
 arch/x86/include/asm/pgtable_types.h |  2 --
 arch/x86/mm/gup.c                    | 13 +------------
 arch/x86/mm/pgtable.c                | 14 --------------
 4 files changed, 1 insertion(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f89d6c9943ea..21a2e25a5393 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -158,11 +158,6 @@ static inline int pmd_large(pmd_t pte)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	return pmd_val(pmd) & _PAGE_SPLITTING;
-}
-
 static inline int pmd_trans_huge(pmd_t pmd)
 {
 	return pmd_val(pmd) & _PAGE_PSE;
@@ -792,10 +787,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pmd_t *pmdp);
 
 
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long addr, pmd_t *pmdp);
-
 #define __HAVE_ARCH_PMD_WRITE
 static inline int pmd_write(pmd_t pmd)
 {
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 78f0c8cbe316..45f7cff1baac 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,7 +22,6 @@
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
-#define _PAGE_BIT_SPLITTING	_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
@@ -46,7 +45,6 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
-#define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 62a887a3cf50..49bbbc57603b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -157,18 +157,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		/*
-		 * The pmd_trans_splitting() check below explains why
-		 * pmdp_splitting_flush has to flush the tlb, to stop
-		 * this gup-fast code from running while we set the
-		 * splitting bit in the pmd. Returning zero will take
-		 * the slow path that will call wait_split_huge_page()
-		 * if the pmd is still in splitting state. gup-fast
-		 * can't because it has irq disabled and
-		 * wait_split_huge_page() would never return as the
-		 * tlb flush IPI wouldn't run.
-		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
 			/*
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 66d5aa27a7a5..23006b1797a0 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -434,20 +434,6 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 
 	return young;
 }
-
-void pmdp_splitting_flush(struct vm_area_struct *vma,
-			  unsigned long address, pmd_t *pmdp)
-{
-	int set;
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
-				(unsigned long *)pmdp);
-	if (set) {
-		pmd_update(vma->vm_mm, address, pmdp);
-		/* need tlb flush only to serialize against gup-fast */
-		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-	}
-}
 #endif
 
 /**
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 18/28] x86, thp: remove infrastructure for handling splitting PMDs
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 arch/x86/include/asm/pgtable.h       |  9 ---------
 arch/x86/include/asm/pgtable_types.h |  2 --
 arch/x86/mm/gup.c                    | 13 +------------
 arch/x86/mm/pgtable.c                | 14 --------------
 4 files changed, 1 insertion(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f89d6c9943ea..21a2e25a5393 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -158,11 +158,6 @@ static inline int pmd_large(pmd_t pte)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	return pmd_val(pmd) & _PAGE_SPLITTING;
-}
-
 static inline int pmd_trans_huge(pmd_t pmd)
 {
 	return pmd_val(pmd) & _PAGE_PSE;
@@ -792,10 +787,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
 				  unsigned long address, pmd_t *pmdp);
 
 
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long addr, pmd_t *pmdp);
-
 #define __HAVE_ARCH_PMD_WRITE
 static inline int pmd_write(pmd_t pmd)
 {
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 78f0c8cbe316..45f7cff1baac 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,7 +22,6 @@
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_SOFTW1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_SOFTW1
-#define _PAGE_BIT_SPLITTING	_PAGE_BIT_SOFTW2 /* only valid on a PSE pmd */
 #define _PAGE_BIT_HIDDEN	_PAGE_BIT_SOFTW3 /* hidden by kmemcheck */
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
@@ -46,7 +45,6 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
-#define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 62a887a3cf50..49bbbc57603b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -157,18 +157,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		/*
-		 * The pmd_trans_splitting() check below explains why
-		 * pmdp_splitting_flush has to flush the tlb, to stop
-		 * this gup-fast code from running while we set the
-		 * splitting bit in the pmd. Returning zero will take
-		 * the slow path that will call wait_split_huge_page()
-		 * if the pmd is still in splitting state. gup-fast
-		 * can't because it has irq disabled and
-		 * wait_split_huge_page() would never return as the
-		 * tlb flush IPI wouldn't run.
-		 */
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd) || !pmd_present(pmd))) {
 			/*
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 66d5aa27a7a5..23006b1797a0 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -434,20 +434,6 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
 
 	return young;
 }
-
-void pmdp_splitting_flush(struct vm_area_struct *vma,
-			  unsigned long address, pmd_t *pmdp)
-{
-	int set;
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
-				(unsigned long *)pmdp);
-	if (set) {
-		pmd_update(vma->vm_mm, address, pmdp);
-		/* need tlb flush only to serialize against gup-fast */
-		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-	}
-}
 #endif
 
 /**
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 19/28] mm: store mapcount for compound page separately
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to allow mapping of individual 4k pages of THP compound and
we need a cheap way to find out how many time the compound page is
mapped with PMD -- compound_mapcount() does this.

We use the same approach as with compound page destructor and compound
order: use space in first tail page, ->mapping this time.

page_mapcount() counts both: PTE and PMD mappings of the page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/mm.h       | 25 ++++++++++++--
 include/linux/mm_types.h |  1 +
 include/linux/rmap.h     |  4 +--
 mm/debug.c               |  5 ++-
 mm/huge_memory.c         |  2 +-
 mm/hugetlb.c             |  4 +--
 mm/memory.c              |  2 +-
 mm/migrate.c             |  2 +-
 mm/page_alloc.c          | 14 ++++++--
 mm/rmap.c                | 87 +++++++++++++++++++++++++++++++++++++-----------
 10 files changed, 114 insertions(+), 32 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dad667d99304..33cb3aa647a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -393,6 +393,19 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 
 extern void kvfree(const void *addr);
 
+static inline atomic_t *compound_mapcount_ptr(struct page *page)
+{
+	return &page[1].compound_mapcount;
+}
+
+static inline int compound_mapcount(struct page *page)
+{
+	if (!PageCompound(page))
+		return 0;
+	page = compound_head(page);
+	return atomic_read(compound_mapcount_ptr(page)) + 1;
+}
+
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
@@ -405,8 +418,16 @@ static inline void page_mapcount_reset(struct page *page)
 
 static inline int page_mapcount(struct page *page)
 {
+	int ret;
 	VM_BUG_ON_PAGE(PageSlab(page), page);
-	return atomic_read(&page->_mapcount) + 1;
+	ret = atomic_read(&page->_mapcount) + 1;
+	/*
+	 * Positive compound_mapcount() offsets ->_mapcount in every page by
+	 * one. Let's substract it here.
+	 */
+	if (compound_mapcount(page))
+	       ret += compound_mapcount(page) - 1;
+	return ret;
 }
 
 static inline int page_count(struct page *page)
@@ -888,7 +909,7 @@ static inline pgoff_t page_file_index(struct page *page)
  */
 static inline int page_mapped(struct page *page)
 {
-	return atomic_read(&(page)->_mapcount) >= 0;
+	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
 }
 
 /*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 126f481bb95a..c8485fe2381c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -56,6 +56,7 @@ struct page {
 						 * see PAGE_MAPPING_ANON below.
 						 */
 		void *s_mem;			/* slab first object */
+		atomic_t compound_mapcount;	/* first tail page */
 	};
 
 	/* Second double word */
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index e7ecba43ae71..bb16ec73eeb7 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -181,9 +181,9 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 				unsigned long);
 
-static inline void page_dup_rmap(struct page *page)
+static inline void page_dup_rmap(struct page *page, bool compound)
 {
-	atomic_inc(&page->_mapcount);
+	atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount);
 }
 
 /*
diff --git a/mm/debug.c b/mm/debug.c
index 9dfcd77e7354..4a82f639b964 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -80,9 +80,12 @@ static void dump_flags(unsigned long flags,
 void dump_page_badflags(struct page *page, const char *reason,
 		unsigned long badflags)
 {
-	pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
+	pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx",
 		  page, atomic_read(&page->_count), page_mapcount(page),
 		  page->mapping, page->index);
+	if (PageCompound(page))
+		pr_cont(" compound_mapcount: %d", compound_mapcount(page));
+	pr_cont("\n");
 	BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS);
 	dump_flags(page->flags, pageflag_names, ARRAY_SIZE(pageflag_names));
 	if (reason)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 23181f836b62..06adbe3f2100 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -892,7 +892,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	src_page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
-	page_dup_rmap(src_page);
+	page_dup_rmap(src_page, true);
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f27d4edada3a..94d70a16395e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2715,7 +2715,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
 			get_page(ptepage);
-			page_dup_rmap(ptepage);
+			page_dup_rmap(ptepage, true);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
 		spin_unlock(src_ptl);
@@ -3176,7 +3176,7 @@ retry:
 		ClearPagePrivate(page);
 		hugepage_add_new_anon_rmap(page, vma, address);
 	} else
-		page_dup_rmap(page);
+		page_dup_rmap(page, true);
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, address, ptep, new_pte);
diff --git a/mm/memory.c b/mm/memory.c
index 1bad3766b00c..0b295f7094b1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -864,7 +864,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
 		get_page(page);
-		page_dup_rmap(page);
+		page_dup_rmap(page, false);
 		if (PageAnon(page))
 			rss[MM_ANONPAGES]++;
 		else
diff --git a/mm/migrate.c b/mm/migrate.c
index 9a380238a4d0..b51e88c9dba2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -164,7 +164,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		if (PageAnon(new))
 			hugepage_add_anon_rmap(new, vma, addr);
 		else
-			page_dup_rmap(new);
+			page_dup_rmap(new, false);
 	} else if (PageAnon(new))
 		page_add_anon_rmap(new, vma, addr, false);
 	else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df2e25424b71..ac331be78308 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -378,6 +378,7 @@ void prep_compound_page(struct page *page, unsigned long order)
 		smp_wmb();
 		__SetPageTail(p);
 	}
+	atomic_set(compound_mapcount_ptr(page), -1);
 }
 
 static inline void prep_zero_page(struct page *page, unsigned int order,
@@ -656,7 +657,7 @@ static inline int free_pages_check(struct page *page)
 	const char *bad_reason = NULL;
 	unsigned long bad_flags = 0;
 
-	if (unlikely(page_mapcount(page)))
+	if (unlikely(atomic_read(&page->_mapcount) != -1))
 		bad_reason = "nonzero mapcount";
 	if (unlikely(page->mapping != NULL))
 		bad_reason = "non-NULL mapping";
@@ -765,7 +766,14 @@ static void free_one_page(struct zone *zone,
 
 static int free_tail_pages_check(struct page *head_page, struct page *page)
 {
-	if (page->mapping != TAIL_MAPPING) {
+	/* mapping in first tail page is used for compound_mapcount() */
+	if (page - head_page == 1) {
+		if (unlikely(compound_mapcount(page))) {
+			bad_page(page, "nonzero compound_mapcount", 0);
+			page->mapping = NULL;
+			return 1;
+		}
+	} else if (page->mapping != TAIL_MAPPING) {
 		bad_page(page, "corrupted mapping in tail page", 0);
 		page->mapping = NULL;
 		return 1;
@@ -940,7 +948,7 @@ static inline int check_new_page(struct page *page)
 	const char *bad_reason = NULL;
 	unsigned long bad_flags = 0;
 
-	if (unlikely(page_mapcount(page)))
+	if (unlikely(atomic_read(&page->_mapcount) != -1))
 		bad_reason = "nonzero mapcount";
 	if (unlikely(page->mapping != NULL))
 		bad_reason = "non-NULL mapping";
diff --git a/mm/rmap.c b/mm/rmap.c
index 1636a96e5f71..047953145710 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1030,7 +1030,7 @@ static void __page_check_anon_rmap(struct page *page,
 	 * over the call to page_add_new_anon_rmap.
 	 */
 	BUG_ON(page_anon_vma(page)->root != vma->anon_vma->root);
-	BUG_ON(page->index != linear_page_index(vma, address));
+	BUG_ON(page_to_pgoff(page) != linear_page_index(vma, address));
 #endif
 }
 
@@ -1059,9 +1059,26 @@ void page_add_anon_rmap(struct page *page,
 void do_page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address, int flags)
 {
-	int first = atomic_inc_and_test(&page->_mapcount);
+	bool compound = flags & RMAP_COMPOUND;
+	bool first;
+
+	if (PageTransCompound(page)) {
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
+		if (compound) {
+			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+			first = atomic_inc_and_test(compound_mapcount_ptr(page));
+		} else {
+			/* Anon THP always mapped first with PMD */
+			first = 0;
+			VM_BUG_ON_PAGE(!page_mapcount(page), page);
+			atomic_inc(&page->_mapcount);
+		}
+	} else {
+		VM_BUG_ON_PAGE(compound, page);
+		first = atomic_inc_and_test(&page->_mapcount);
+	}
+
 	if (first) {
-		bool compound = flags & RMAP_COMPOUND;
 		int nr = compound ? hpage_nr_pages(page) : 1;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
@@ -1070,9 +1087,17 @@ void do_page_add_anon_rmap(struct page *page,
 		 * disabled.
 		 */
 		if (compound) {
+			int i;
 			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 			__inc_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
+			/*
+			 * While compound_mapcount() is positive we keep *one*
+			 * mapcount reference in all subpages. It's required
+			 * for atomic removal from rmap.
+			 */
+			for (i = 0; i < nr; i++)
+				atomic_set(&page[i]._mapcount, 0);
 		}
 		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	}
@@ -1080,6 +1105,7 @@ void do_page_add_anon_rmap(struct page *page,
 		return;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
 	/* address might be in next vma when migration races vma_adjust */
 	if (first)
 		__page_set_anon_rmap(page, vma, address,
@@ -1105,10 +1131,25 @@ void page_add_new_anon_rmap(struct page *page,
 
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
 	SetPageSwapBacked(page);
-	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
 	if (compound) {
+		int i;
+
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+		/* increment count (starts at -1) */
+		atomic_set(compound_mapcount_ptr(page), 0);
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		/*
+		 * While compound_mapcount() is positive we keep *one* mapcount
+		 * reference in all subpages. It's required for atomic removal
+		 * from rmap.
+		 */
+		for (i = 0; i < nr; i++)
+			atomic_set(&page[i]._mapcount, 0);
+	} else {
+		/* Anon THP always mapped first with PMD */
+		VM_BUG_ON_PAGE(PageTransCompound(page), page);
+		/* increment count (starts at -1) */
+		atomic_set(&page->_mapcount, 0);
 	}
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
@@ -1138,12 +1179,15 @@ static void page_remove_file_rmap(struct page *page)
 
 	memcg = mem_cgroup_begin_page_stat(page);
 
-	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, &page->_mapcount))
+	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
+	if (unlikely(PageHuge(page))) {
+		/* hugetlb pages are always mapped with pmds */
+		atomic_dec(compound_mapcount_ptr(page));
 		goto out;
+	}
 
-	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
-	if (unlikely(PageHuge(page)))
+	/* page still mapped by someone else? */
+	if (!atomic_add_negative(-1, &page->_mapcount))
 		goto out;
 
 	/*
@@ -1168,8 +1212,6 @@ out:
  */
 void page_remove_rmap(struct page *page, bool compound)
 {
-	int nr = compound ? hpage_nr_pages(page) : 1;
-
 	if (!PageAnon(page)) {
 		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
 		page_remove_file_rmap(page);
@@ -1177,8 +1219,20 @@ void page_remove_rmap(struct page *page, bool compound)
 	}
 
 	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, &page->_mapcount))
+	if (compound) {
+		int i;
+
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+			return;
+		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		for (i = 0; i < hpage_nr_pages(page); i++)
+			page_remove_rmap(page + i, false);
 		return;
+	} else {
+		if (!atomic_add_negative(-1, &page->_mapcount))
+			return;
+	}
 
 	/* Hugepages are not counted in NR_ANON_PAGES for now. */
 	if (unlikely(PageHuge(page)))
@@ -1189,12 +1243,7 @@ void page_remove_rmap(struct page *page, bool compound)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	if (compound) {
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
-	}
-
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
+	__dec_zone_page_state(page, NR_ANON_PAGES);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
@@ -1635,7 +1684,7 @@ void hugepage_add_anon_rmap(struct page *page,
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!anon_vma);
 	/* address might be in next vma when migration races vma_adjust */
-	first = atomic_inc_and_test(&page->_mapcount);
+	first = atomic_inc_and_test(compound_mapcount_ptr(page));
 	if (first)
 		__hugepage_set_anon_rmap(page, vma, address, 0);
 }
@@ -1644,7 +1693,7 @@ void hugepage_add_new_anon_rmap(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {
 	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-	atomic_set(&page->_mapcount, 0);
+	atomic_set(compound_mapcount_ptr(page), 0);
 	__hugepage_set_anon_rmap(page, vma, address, 1);
 }
 #endif /* CONFIG_HUGETLB_PAGE */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 19/28] mm: store mapcount for compound page separately
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to allow mapping of individual 4k pages of THP compound and
we need a cheap way to find out how many time the compound page is
mapped with PMD -- compound_mapcount() does this.

We use the same approach as with compound page destructor and compound
order: use space in first tail page, ->mapping this time.

page_mapcount() counts both: PTE and PMD mappings of the page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/mm.h       | 25 ++++++++++++--
 include/linux/mm_types.h |  1 +
 include/linux/rmap.h     |  4 +--
 mm/debug.c               |  5 ++-
 mm/huge_memory.c         |  2 +-
 mm/hugetlb.c             |  4 +--
 mm/memory.c              |  2 +-
 mm/migrate.c             |  2 +-
 mm/page_alloc.c          | 14 ++++++--
 mm/rmap.c                | 87 +++++++++++++++++++++++++++++++++++++-----------
 10 files changed, 114 insertions(+), 32 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dad667d99304..33cb3aa647a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -393,6 +393,19 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 
 extern void kvfree(const void *addr);
 
+static inline atomic_t *compound_mapcount_ptr(struct page *page)
+{
+	return &page[1].compound_mapcount;
+}
+
+static inline int compound_mapcount(struct page *page)
+{
+	if (!PageCompound(page))
+		return 0;
+	page = compound_head(page);
+	return atomic_read(compound_mapcount_ptr(page)) + 1;
+}
+
 /*
  * The atomic page->_mapcount, starts from -1: so that transitions
  * both from it and to it can be tracked, using atomic_inc_and_test
@@ -405,8 +418,16 @@ static inline void page_mapcount_reset(struct page *page)
 
 static inline int page_mapcount(struct page *page)
 {
+	int ret;
 	VM_BUG_ON_PAGE(PageSlab(page), page);
-	return atomic_read(&page->_mapcount) + 1;
+	ret = atomic_read(&page->_mapcount) + 1;
+	/*
+	 * Positive compound_mapcount() offsets ->_mapcount in every page by
+	 * one. Let's substract it here.
+	 */
+	if (compound_mapcount(page))
+	       ret += compound_mapcount(page) - 1;
+	return ret;
 }
 
 static inline int page_count(struct page *page)
@@ -888,7 +909,7 @@ static inline pgoff_t page_file_index(struct page *page)
  */
 static inline int page_mapped(struct page *page)
 {
-	return atomic_read(&(page)->_mapcount) >= 0;
+	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
 }
 
 /*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 126f481bb95a..c8485fe2381c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -56,6 +56,7 @@ struct page {
 						 * see PAGE_MAPPING_ANON below.
 						 */
 		void *s_mem;			/* slab first object */
+		atomic_t compound_mapcount;	/* first tail page */
 	};
 
 	/* Second double word */
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index e7ecba43ae71..bb16ec73eeb7 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -181,9 +181,9 @@ void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
 				unsigned long);
 
-static inline void page_dup_rmap(struct page *page)
+static inline void page_dup_rmap(struct page *page, bool compound)
 {
-	atomic_inc(&page->_mapcount);
+	atomic_inc(compound ? compound_mapcount_ptr(page) : &page->_mapcount);
 }
 
 /*
diff --git a/mm/debug.c b/mm/debug.c
index 9dfcd77e7354..4a82f639b964 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -80,9 +80,12 @@ static void dump_flags(unsigned long flags,
 void dump_page_badflags(struct page *page, const char *reason,
 		unsigned long badflags)
 {
-	pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
+	pr_emerg("page:%p count:%d mapcount:%d mapping:%p index:%#lx",
 		  page, atomic_read(&page->_count), page_mapcount(page),
 		  page->mapping, page->index);
+	if (PageCompound(page))
+		pr_cont(" compound_mapcount: %d", compound_mapcount(page));
+	pr_cont("\n");
 	BUILD_BUG_ON(ARRAY_SIZE(pageflag_names) != __NR_PAGEFLAGS);
 	dump_flags(page->flags, pageflag_names, ARRAY_SIZE(pageflag_names));
 	if (reason)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 23181f836b62..06adbe3f2100 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -892,7 +892,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	src_page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
-	page_dup_rmap(src_page);
+	page_dup_rmap(src_page, true);
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 
 	pmdp_set_wrprotect(src_mm, addr, src_pmd);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f27d4edada3a..94d70a16395e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2715,7 +2715,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			entry = huge_ptep_get(src_pte);
 			ptepage = pte_page(entry);
 			get_page(ptepage);
-			page_dup_rmap(ptepage);
+			page_dup_rmap(ptepage, true);
 			set_huge_pte_at(dst, addr, dst_pte, entry);
 		}
 		spin_unlock(src_ptl);
@@ -3176,7 +3176,7 @@ retry:
 		ClearPagePrivate(page);
 		hugepage_add_new_anon_rmap(page, vma, address);
 	} else
-		page_dup_rmap(page);
+		page_dup_rmap(page, true);
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, address, ptep, new_pte);
diff --git a/mm/memory.c b/mm/memory.c
index 1bad3766b00c..0b295f7094b1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -864,7 +864,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
 		get_page(page);
-		page_dup_rmap(page);
+		page_dup_rmap(page, false);
 		if (PageAnon(page))
 			rss[MM_ANONPAGES]++;
 		else
diff --git a/mm/migrate.c b/mm/migrate.c
index 9a380238a4d0..b51e88c9dba2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -164,7 +164,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		if (PageAnon(new))
 			hugepage_add_anon_rmap(new, vma, addr);
 		else
-			page_dup_rmap(new);
+			page_dup_rmap(new, false);
 	} else if (PageAnon(new))
 		page_add_anon_rmap(new, vma, addr, false);
 	else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df2e25424b71..ac331be78308 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -378,6 +378,7 @@ void prep_compound_page(struct page *page, unsigned long order)
 		smp_wmb();
 		__SetPageTail(p);
 	}
+	atomic_set(compound_mapcount_ptr(page), -1);
 }
 
 static inline void prep_zero_page(struct page *page, unsigned int order,
@@ -656,7 +657,7 @@ static inline int free_pages_check(struct page *page)
 	const char *bad_reason = NULL;
 	unsigned long bad_flags = 0;
 
-	if (unlikely(page_mapcount(page)))
+	if (unlikely(atomic_read(&page->_mapcount) != -1))
 		bad_reason = "nonzero mapcount";
 	if (unlikely(page->mapping != NULL))
 		bad_reason = "non-NULL mapping";
@@ -765,7 +766,14 @@ static void free_one_page(struct zone *zone,
 
 static int free_tail_pages_check(struct page *head_page, struct page *page)
 {
-	if (page->mapping != TAIL_MAPPING) {
+	/* mapping in first tail page is used for compound_mapcount() */
+	if (page - head_page == 1) {
+		if (unlikely(compound_mapcount(page))) {
+			bad_page(page, "nonzero compound_mapcount", 0);
+			page->mapping = NULL;
+			return 1;
+		}
+	} else if (page->mapping != TAIL_MAPPING) {
 		bad_page(page, "corrupted mapping in tail page", 0);
 		page->mapping = NULL;
 		return 1;
@@ -940,7 +948,7 @@ static inline int check_new_page(struct page *page)
 	const char *bad_reason = NULL;
 	unsigned long bad_flags = 0;
 
-	if (unlikely(page_mapcount(page)))
+	if (unlikely(atomic_read(&page->_mapcount) != -1))
 		bad_reason = "nonzero mapcount";
 	if (unlikely(page->mapping != NULL))
 		bad_reason = "non-NULL mapping";
diff --git a/mm/rmap.c b/mm/rmap.c
index 1636a96e5f71..047953145710 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1030,7 +1030,7 @@ static void __page_check_anon_rmap(struct page *page,
 	 * over the call to page_add_new_anon_rmap.
 	 */
 	BUG_ON(page_anon_vma(page)->root != vma->anon_vma->root);
-	BUG_ON(page->index != linear_page_index(vma, address));
+	BUG_ON(page_to_pgoff(page) != linear_page_index(vma, address));
 #endif
 }
 
@@ -1059,9 +1059,26 @@ void page_add_anon_rmap(struct page *page,
 void do_page_add_anon_rmap(struct page *page,
 	struct vm_area_struct *vma, unsigned long address, int flags)
 {
-	int first = atomic_inc_and_test(&page->_mapcount);
+	bool compound = flags & RMAP_COMPOUND;
+	bool first;
+
+	if (PageTransCompound(page)) {
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
+		if (compound) {
+			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+			first = atomic_inc_and_test(compound_mapcount_ptr(page));
+		} else {
+			/* Anon THP always mapped first with PMD */
+			first = 0;
+			VM_BUG_ON_PAGE(!page_mapcount(page), page);
+			atomic_inc(&page->_mapcount);
+		}
+	} else {
+		VM_BUG_ON_PAGE(compound, page);
+		first = atomic_inc_and_test(&page->_mapcount);
+	}
+
 	if (first) {
-		bool compound = flags & RMAP_COMPOUND;
 		int nr = compound ? hpage_nr_pages(page) : 1;
 		/*
 		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
@@ -1070,9 +1087,17 @@ void do_page_add_anon_rmap(struct page *page,
 		 * disabled.
 		 */
 		if (compound) {
+			int i;
 			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 			__inc_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
+			/*
+			 * While compound_mapcount() is positive we keep *one*
+			 * mapcount reference in all subpages. It's required
+			 * for atomic removal from rmap.
+			 */
+			for (i = 0; i < nr; i++)
+				atomic_set(&page[i]._mapcount, 0);
 		}
 		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	}
@@ -1080,6 +1105,7 @@ void do_page_add_anon_rmap(struct page *page,
 		return;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
 	/* address might be in next vma when migration races vma_adjust */
 	if (first)
 		__page_set_anon_rmap(page, vma, address,
@@ -1105,10 +1131,25 @@ void page_add_new_anon_rmap(struct page *page,
 
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
 	SetPageSwapBacked(page);
-	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
 	if (compound) {
+		int i;
+
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+		/* increment count (starts at -1) */
+		atomic_set(compound_mapcount_ptr(page), 0);
 		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		/*
+		 * While compound_mapcount() is positive we keep *one* mapcount
+		 * reference in all subpages. It's required for atomic removal
+		 * from rmap.
+		 */
+		for (i = 0; i < nr; i++)
+			atomic_set(&page[i]._mapcount, 0);
+	} else {
+		/* Anon THP always mapped first with PMD */
+		VM_BUG_ON_PAGE(PageTransCompound(page), page);
+		/* increment count (starts at -1) */
+		atomic_set(&page->_mapcount, 0);
 	}
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
 	__page_set_anon_rmap(page, vma, address, 1);
@@ -1138,12 +1179,15 @@ static void page_remove_file_rmap(struct page *page)
 
 	memcg = mem_cgroup_begin_page_stat(page);
 
-	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, &page->_mapcount))
+	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
+	if (unlikely(PageHuge(page))) {
+		/* hugetlb pages are always mapped with pmds */
+		atomic_dec(compound_mapcount_ptr(page));
 		goto out;
+	}
 
-	/* Hugepages are not counted in NR_FILE_MAPPED for now. */
-	if (unlikely(PageHuge(page)))
+	/* page still mapped by someone else? */
+	if (!atomic_add_negative(-1, &page->_mapcount))
 		goto out;
 
 	/*
@@ -1168,8 +1212,6 @@ out:
  */
 void page_remove_rmap(struct page *page, bool compound)
 {
-	int nr = compound ? hpage_nr_pages(page) : 1;
-
 	if (!PageAnon(page)) {
 		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
 		page_remove_file_rmap(page);
@@ -1177,8 +1219,20 @@ void page_remove_rmap(struct page *page, bool compound)
 	}
 
 	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, &page->_mapcount))
+	if (compound) {
+		int i;
+
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
+			return;
+		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+		for (i = 0; i < hpage_nr_pages(page); i++)
+			page_remove_rmap(page + i, false);
 		return;
+	} else {
+		if (!atomic_add_negative(-1, &page->_mapcount))
+			return;
+	}
 
 	/* Hugepages are not counted in NR_ANON_PAGES for now. */
 	if (unlikely(PageHuge(page)))
@@ -1189,12 +1243,7 @@ void page_remove_rmap(struct page *page, bool compound)
 	 * these counters are not modified in interrupt context, and
 	 * pte lock(a spinlock) is held, which implies preemption disabled.
 	 */
-	if (compound) {
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
-	}
-
-	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
+	__dec_zone_page_state(page, NR_ANON_PAGES);
 
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
@@ -1635,7 +1684,7 @@ void hugepage_add_anon_rmap(struct page *page,
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!anon_vma);
 	/* address might be in next vma when migration races vma_adjust */
-	first = atomic_inc_and_test(&page->_mapcount);
+	first = atomic_inc_and_test(compound_mapcount_ptr(page));
 	if (first)
 		__hugepage_set_anon_rmap(page, vma, address, 0);
 }
@@ -1644,7 +1693,7 @@ void hugepage_add_new_anon_rmap(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {
 	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
-	atomic_set(&page->_mapcount, 0);
+	atomic_set(compound_mapcount_ptr(page), 0);
 	__hugepage_set_anon_rmap(page, vma, address, 1);
 }
 #endif /* CONFIG_HUGETLB_PAGE */
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 20/28] mm: differentiate page_mapped() from page_mapcount() for compound pages
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Let's define page_mapped() to be true for compound pages if any
sub-pages of the compound page is mapped (with PMD or PTE).

On other hand page_mapcount() return mapcount for this particular small
page.

This will make cases like page_get_anon_vma() behave correctly once we
allow huge pages to be mapped with PTE.

Most users outside core-mm should use page_mapcount() instead of
page_mapped().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 arch/arc/mm/cache_arc700.c |  4 ++--
 arch/arm/mm/flush.c        |  2 +-
 arch/mips/mm/c-r4k.c       |  3 ++-
 arch/mips/mm/cache.c       |  2 +-
 arch/mips/mm/init.c        |  6 +++---
 arch/sh/mm/cache-sh4.c     |  2 +-
 arch/sh/mm/cache.c         |  8 ++++----
 arch/xtensa/mm/tlb.c       |  2 +-
 fs/proc/page.c             |  4 ++--
 include/linux/mm.h         | 11 ++++++++++-
 mm/filemap.c               |  2 +-
 11 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/arc/mm/cache_arc700.c b/arch/arc/mm/cache_arc700.c
index 8c3a3e02ba92..1baa4d23314b 100644
--- a/arch/arc/mm/cache_arc700.c
+++ b/arch/arc/mm/cache_arc700.c
@@ -490,7 +490,7 @@ void flush_dcache_page(struct page *page)
 	 */
 	if (!mapping_mapped(mapping)) {
 		clear_bit(PG_dc_clean, &page->flags);
-	} else if (page_mapped(page)) {
+	} else if (page_mapcount(page)) {
 
 		/* kernel reading from page with U-mapping */
 		void *paddr = page_address(page);
@@ -675,7 +675,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 	 * Note that while @u_vaddr refers to DST page's userspace vaddr, it is
 	 * equally valid for SRC page as well
 	 */
-	if (page_mapped(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
+	if (page_mapcount(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
 		__flush_dcache_page(kfrom, u_vaddr);
 		clean_src_k_mappings = 1;
 	}
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 34b66af516ea..8f972fc8933d 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -315,7 +315,7 @@ void flush_dcache_page(struct page *page)
 	mapping = page_mapping(page);
 
 	if (!cache_ops_need_broadcast() &&
-	    mapping && !page_mapped(page))
+	    mapping && !page_mapcount(page))
 		clear_bit(PG_dcache_clean, &page->flags);
 	else {
 		__flush_dcache_page(mapping, page);
diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c
index dd261df005c2..c4960b2d6682 100644
--- a/arch/mips/mm/c-r4k.c
+++ b/arch/mips/mm/c-r4k.c
@@ -578,7 +578,8 @@ static inline void local_r4k_flush_cache_page(void *args)
 		 * another ASID than the current one.
 		 */
 		map_coherent = (cpu_has_dc_aliases &&
-				page_mapped(page) && !Page_dcache_dirty(page));
+				page_mapcount(page) &&
+				!Page_dcache_dirty(page));
 		if (map_coherent)
 			vaddr = kmap_coherent(page, addr);
 		else
diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c
index 7e3ea7766822..e695b28dc32c 100644
--- a/arch/mips/mm/cache.c
+++ b/arch/mips/mm/cache.c
@@ -106,7 +106,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
 	unsigned long addr = (unsigned long) page_address(page);
 
 	if (pages_do_alias(addr, vmaddr)) {
-		if (page_mapped(page) && !Page_dcache_dirty(page)) {
+		if (page_mapcount(page) && !Page_dcache_dirty(page)) {
 			void *kaddr;
 
 			kaddr = kmap_coherent(page, vmaddr);
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index 448cde372af0..2c8e44aa536e 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -156,7 +156,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 
 	vto = kmap_atomic(to);
 	if (cpu_has_dc_aliases &&
-	    page_mapped(from) && !Page_dcache_dirty(from)) {
+	    page_mapcount(from) && !Page_dcache_dirty(from)) {
 		vfrom = kmap_coherent(from, vaddr);
 		copy_page(vto, vfrom);
 		kunmap_coherent();
@@ -178,7 +178,7 @@ void copy_to_user_page(struct vm_area_struct *vma,
 	unsigned long len)
 {
 	if (cpu_has_dc_aliases &&
-	    page_mapped(page) && !Page_dcache_dirty(page)) {
+	    page_mapcount(page) && !Page_dcache_dirty(page)) {
 		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(vto, src, len);
 		kunmap_coherent();
@@ -196,7 +196,7 @@ void copy_from_user_page(struct vm_area_struct *vma,
 	unsigned long len)
 {
 	if (cpu_has_dc_aliases &&
-	    page_mapped(page) && !Page_dcache_dirty(page)) {
+	    page_mapcount(page) && !Page_dcache_dirty(page)) {
 		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(dst, vfrom, len);
 		kunmap_coherent();
diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c
index 51d8f7f31d1d..58aaa4f33b81 100644
--- a/arch/sh/mm/cache-sh4.c
+++ b/arch/sh/mm/cache-sh4.c
@@ -241,7 +241,7 @@ static void sh4_flush_cache_page(void *args)
 		 */
 		map_coherent = (current_cpu_data.dcache.n_aliases &&
 			test_bit(PG_dcache_clean, &page->flags) &&
-			page_mapped(page));
+			page_mapcount(page));
 		if (map_coherent)
 			vaddr = kmap_coherent(page, address);
 		else
diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c
index f770e3992620..e58cfbf45150 100644
--- a/arch/sh/mm/cache.c
+++ b/arch/sh/mm/cache.c
@@ -59,7 +59,7 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
 		       unsigned long vaddr, void *dst, const void *src,
 		       unsigned long len)
 {
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 	    test_bit(PG_dcache_clean, &page->flags)) {
 		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(vto, src, len);
@@ -78,7 +78,7 @@ void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
 			 unsigned long vaddr, void *dst, const void *src,
 			 unsigned long len)
 {
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 	    test_bit(PG_dcache_clean, &page->flags)) {
 		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(dst, vfrom, len);
@@ -97,7 +97,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 
 	vto = kmap_atomic(to);
 
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(from) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(from) &&
 	    test_bit(PG_dcache_clean, &from->flags)) {
 		vfrom = kmap_coherent(from, vaddr);
 		copy_page(vto, vfrom);
@@ -153,7 +153,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
 	unsigned long addr = (unsigned long) page_address(page);
 
 	if (pages_do_alias(addr, vmaddr)) {
-		if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+		if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 		    test_bit(PG_dcache_clean, &page->flags)) {
 			void *kaddr;
 
diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
index 5ece856c5725..35c822286bbe 100644
--- a/arch/xtensa/mm/tlb.c
+++ b/arch/xtensa/mm/tlb.c
@@ -245,7 +245,7 @@ static int check_tlb_entry(unsigned w, unsigned e, bool dtlb)
 						page_mapcount(p));
 				if (!page_count(p))
 					rc |= TLB_INSANE;
-				else if (page_mapped(p))
+				else if (page_mapcount(p))
 					rc |= TLB_SUSPICIOUS;
 			} else {
 				rc |= TLB_INSANE;
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..e99c059339f6 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -97,9 +97,9 @@ u64 stable_page_flags(struct page *page)
 	 * pseudo flags for the well known (anonymous) memory mapped pages
 	 *
 	 * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
-	 * simple test in page_mapped() is not enough.
+	 * simple test in page_mapcount() is not enough.
 	 */
-	if (!PageSlab(page) && page_mapped(page))
+	if (!PageSlab(page) && page_mapcount(page))
 		u |= 1 << KPF_MMAP;
 	if (PageAnon(page))
 		u |= 1 << KPF_ANON;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 33cb3aa647a6..8ddc184c55d6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -909,7 +909,16 @@ static inline pgoff_t page_file_index(struct page *page)
  */
 static inline int page_mapped(struct page *page)
 {
-	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
+	int i;
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) >= 0;
+	if (compound_mapcount(page))
+		return 1;
+	for (i = 0; i < hpage_nr_pages(page); i++) {
+		if (atomic_read(&page[i]._mapcount) >= 0)
+			return 1;
+	}
+	return 0;
 }
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index ce4d6e3d740f..c25ba3b4e7a2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -200,7 +200,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	if (PageSwapBacked(page))
 		__dec_zone_page_state(page, NR_SHMEM);
-	BUG_ON(page_mapped(page));
+	VM_BUG_ON_PAGE(page_mapped(page), page);
 
 	/*
 	 * At this point page must be either written or cleaned by truncate.
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 20/28] mm: differentiate page_mapped() from page_mapcount() for compound pages
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Let's define page_mapped() to be true for compound pages if any
sub-pages of the compound page is mapped (with PMD or PTE).

On other hand page_mapcount() return mapcount for this particular small
page.

This will make cases like page_get_anon_vma() behave correctly once we
allow huge pages to be mapped with PTE.

Most users outside core-mm should use page_mapcount() instead of
page_mapped().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 arch/arc/mm/cache_arc700.c |  4 ++--
 arch/arm/mm/flush.c        |  2 +-
 arch/mips/mm/c-r4k.c       |  3 ++-
 arch/mips/mm/cache.c       |  2 +-
 arch/mips/mm/init.c        |  6 +++---
 arch/sh/mm/cache-sh4.c     |  2 +-
 arch/sh/mm/cache.c         |  8 ++++----
 arch/xtensa/mm/tlb.c       |  2 +-
 fs/proc/page.c             |  4 ++--
 include/linux/mm.h         | 11 ++++++++++-
 mm/filemap.c               |  2 +-
 11 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/arc/mm/cache_arc700.c b/arch/arc/mm/cache_arc700.c
index 8c3a3e02ba92..1baa4d23314b 100644
--- a/arch/arc/mm/cache_arc700.c
+++ b/arch/arc/mm/cache_arc700.c
@@ -490,7 +490,7 @@ void flush_dcache_page(struct page *page)
 	 */
 	if (!mapping_mapped(mapping)) {
 		clear_bit(PG_dc_clean, &page->flags);
-	} else if (page_mapped(page)) {
+	} else if (page_mapcount(page)) {
 
 		/* kernel reading from page with U-mapping */
 		void *paddr = page_address(page);
@@ -675,7 +675,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 	 * Note that while @u_vaddr refers to DST page's userspace vaddr, it is
 	 * equally valid for SRC page as well
 	 */
-	if (page_mapped(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
+	if (page_mapcount(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
 		__flush_dcache_page(kfrom, u_vaddr);
 		clean_src_k_mappings = 1;
 	}
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 34b66af516ea..8f972fc8933d 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -315,7 +315,7 @@ void flush_dcache_page(struct page *page)
 	mapping = page_mapping(page);
 
 	if (!cache_ops_need_broadcast() &&
-	    mapping && !page_mapped(page))
+	    mapping && !page_mapcount(page))
 		clear_bit(PG_dcache_clean, &page->flags);
 	else {
 		__flush_dcache_page(mapping, page);
diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c
index dd261df005c2..c4960b2d6682 100644
--- a/arch/mips/mm/c-r4k.c
+++ b/arch/mips/mm/c-r4k.c
@@ -578,7 +578,8 @@ static inline void local_r4k_flush_cache_page(void *args)
 		 * another ASID than the current one.
 		 */
 		map_coherent = (cpu_has_dc_aliases &&
-				page_mapped(page) && !Page_dcache_dirty(page));
+				page_mapcount(page) &&
+				!Page_dcache_dirty(page));
 		if (map_coherent)
 			vaddr = kmap_coherent(page, addr);
 		else
diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c
index 7e3ea7766822..e695b28dc32c 100644
--- a/arch/mips/mm/cache.c
+++ b/arch/mips/mm/cache.c
@@ -106,7 +106,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
 	unsigned long addr = (unsigned long) page_address(page);
 
 	if (pages_do_alias(addr, vmaddr)) {
-		if (page_mapped(page) && !Page_dcache_dirty(page)) {
+		if (page_mapcount(page) && !Page_dcache_dirty(page)) {
 			void *kaddr;
 
 			kaddr = kmap_coherent(page, vmaddr);
diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
index 448cde372af0..2c8e44aa536e 100644
--- a/arch/mips/mm/init.c
+++ b/arch/mips/mm/init.c
@@ -156,7 +156,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 
 	vto = kmap_atomic(to);
 	if (cpu_has_dc_aliases &&
-	    page_mapped(from) && !Page_dcache_dirty(from)) {
+	    page_mapcount(from) && !Page_dcache_dirty(from)) {
 		vfrom = kmap_coherent(from, vaddr);
 		copy_page(vto, vfrom);
 		kunmap_coherent();
@@ -178,7 +178,7 @@ void copy_to_user_page(struct vm_area_struct *vma,
 	unsigned long len)
 {
 	if (cpu_has_dc_aliases &&
-	    page_mapped(page) && !Page_dcache_dirty(page)) {
+	    page_mapcount(page) && !Page_dcache_dirty(page)) {
 		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(vto, src, len);
 		kunmap_coherent();
@@ -196,7 +196,7 @@ void copy_from_user_page(struct vm_area_struct *vma,
 	unsigned long len)
 {
 	if (cpu_has_dc_aliases &&
-	    page_mapped(page) && !Page_dcache_dirty(page)) {
+	    page_mapcount(page) && !Page_dcache_dirty(page)) {
 		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(dst, vfrom, len);
 		kunmap_coherent();
diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c
index 51d8f7f31d1d..58aaa4f33b81 100644
--- a/arch/sh/mm/cache-sh4.c
+++ b/arch/sh/mm/cache-sh4.c
@@ -241,7 +241,7 @@ static void sh4_flush_cache_page(void *args)
 		 */
 		map_coherent = (current_cpu_data.dcache.n_aliases &&
 			test_bit(PG_dcache_clean, &page->flags) &&
-			page_mapped(page));
+			page_mapcount(page));
 		if (map_coherent)
 			vaddr = kmap_coherent(page, address);
 		else
diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c
index f770e3992620..e58cfbf45150 100644
--- a/arch/sh/mm/cache.c
+++ b/arch/sh/mm/cache.c
@@ -59,7 +59,7 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
 		       unsigned long vaddr, void *dst, const void *src,
 		       unsigned long len)
 {
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 	    test_bit(PG_dcache_clean, &page->flags)) {
 		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(vto, src, len);
@@ -78,7 +78,7 @@ void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
 			 unsigned long vaddr, void *dst, const void *src,
 			 unsigned long len)
 {
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 	    test_bit(PG_dcache_clean, &page->flags)) {
 		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
 		memcpy(dst, vfrom, len);
@@ -97,7 +97,7 @@ void copy_user_highpage(struct page *to, struct page *from,
 
 	vto = kmap_atomic(to);
 
-	if (boot_cpu_data.dcache.n_aliases && page_mapped(from) &&
+	if (boot_cpu_data.dcache.n_aliases && page_mapcount(from) &&
 	    test_bit(PG_dcache_clean, &from->flags)) {
 		vfrom = kmap_coherent(from, vaddr);
 		copy_page(vto, vfrom);
@@ -153,7 +153,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
 	unsigned long addr = (unsigned long) page_address(page);
 
 	if (pages_do_alias(addr, vmaddr)) {
-		if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
+		if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
 		    test_bit(PG_dcache_clean, &page->flags)) {
 			void *kaddr;
 
diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
index 5ece856c5725..35c822286bbe 100644
--- a/arch/xtensa/mm/tlb.c
+++ b/arch/xtensa/mm/tlb.c
@@ -245,7 +245,7 @@ static int check_tlb_entry(unsigned w, unsigned e, bool dtlb)
 						page_mapcount(p));
 				if (!page_count(p))
 					rc |= TLB_INSANE;
-				else if (page_mapped(p))
+				else if (page_mapcount(p))
 					rc |= TLB_SUSPICIOUS;
 			} else {
 				rc |= TLB_INSANE;
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 7eee2d8b97d9..e99c059339f6 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -97,9 +97,9 @@ u64 stable_page_flags(struct page *page)
 	 * pseudo flags for the well known (anonymous) memory mapped pages
 	 *
 	 * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
-	 * simple test in page_mapped() is not enough.
+	 * simple test in page_mapcount() is not enough.
 	 */
-	if (!PageSlab(page) && page_mapped(page))
+	if (!PageSlab(page) && page_mapcount(page))
 		u |= 1 << KPF_MMAP;
 	if (PageAnon(page))
 		u |= 1 << KPF_ANON;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 33cb3aa647a6..8ddc184c55d6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -909,7 +909,16 @@ static inline pgoff_t page_file_index(struct page *page)
  */
 static inline int page_mapped(struct page *page)
 {
-	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
+	int i;
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) >= 0;
+	if (compound_mapcount(page))
+		return 1;
+	for (i = 0; i < hpage_nr_pages(page); i++) {
+		if (atomic_read(&page[i]._mapcount) >= 0)
+			return 1;
+	}
+	return 0;
 }
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index ce4d6e3d740f..c25ba3b4e7a2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -200,7 +200,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	if (PageSwapBacked(page))
 		__dec_zone_page_state(page, NR_SHMEM);
-	BUG_ON(page_mapped(page));
+	VM_BUG_ON_PAGE(page_mapped(page), page);
 
 	/*
 	 * At this point page must be either written or cleaned by truncate.
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 21/28] mm, numa: skip PTE-mapped THP on numa fault
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to have THP mapped with PTEs. It will confuse numabalancing.
Let's skip them for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/memory.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 0b295f7094b1..68c6002ee8ba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3135,6 +3135,12 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
+	/* TODO: handle PTE-mapped THP */
+	if (PageCompound(page)) {
+		pte_unmap_unlock(ptep, ptl);
+		return 0;
+	}
+
 	/*
 	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
 	 * much anyway since they can be in shared cache state. This misses
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 21/28] mm, numa: skip PTE-mapped THP on numa fault
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to have THP mapped with PTEs. It will confuse numabalancing.
Let's skip them for now.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/memory.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 0b295f7094b1..68c6002ee8ba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3135,6 +3135,12 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		return 0;
 	}
 
+	/* TODO: handle PTE-mapped THP */
+	if (PageCompound(page)) {
+		pte_unmap_unlock(ptep, ptl);
+		return 0;
+	}
+
 	/*
 	 * Avoid grouping on RO pages in general. RO pages shouldn't hurt as
 	 * much anyway since they can be in shared cache state. This misses
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 22/28] thp: implement split_huge_pmd()
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Original split_huge_page() combined two operations: splitting PMDs into
tables of PTEs and splitting underlying compound page. This patch
implements split_huge_pmd() which split given PMD without splitting
other PMDs this page mapped with or underlying compound page.

Without tail page refcounting, implementation of split_huge_pmd() is
pretty straight-forward.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/huge_mm.h |  11 ++++-
 mm/huge_memory.c        | 108 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0382230b490f..b7844c73b7db 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -94,7 +94,16 @@ extern unsigned long transparent_hugepage_flags;
 
 #define split_huge_page_to_list(page, list) BUILD_BUG()
 #define split_huge_page(page) BUILD_BUG()
-#define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
+
+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address);
+
+#define split_huge_pmd(__vma, __pmd, __address)				\
+	do {								\
+		pmd_t *____pmd = (__pmd);				\
+		if (unlikely(pmd_trans_huge(*____pmd)))			\
+			__split_huge_pmd(__vma, __pmd, __address);	\
+	}  while (0)
 
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 06adbe3f2100..5885ef8f0fad 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2522,6 +2522,114 @@ static int khugepaged(void *none)
 	return 0;
 }
 
+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+		unsigned long haddr, pmd_t *pmd)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	int i;
+
+	/* leave pmd empty until pte is filled */
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+		entry = pte_mkspecial(entry);
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	put_huge_zero_page();
+}
+
+static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long haddr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	bool young, write, last;
+	int i;
+
+	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
+	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
+	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
+	VM_BUG_ON(!pmd_trans_huge(*pmd));
+
+	count_vm_event(THP_SPLIT_PMD);
+
+	if (is_huge_zero_pmd(*pmd))
+		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+
+	page = pmd_page(*pmd);
+	VM_BUG_ON_PAGE(!page_count(page), page);
+	atomic_add(HPAGE_PMD_NR - 1, &page->_count);
+	last = atomic_add_negative(-1, compound_mapcount_ptr(page));
+	if (last)
+		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+
+	write = pmd_write(*pmd);
+	young = pmd_young(*pmd);
+
+	/* leave pmd empty until pte is filled */
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t entry, *pte;
+		/*
+		 * Note that NUMA hinting access restrictions are not
+		 * transferred to avoid any possibility of altering
+		 * permissions across VMAs.
+		 */
+		entry = mk_pte(page + i, vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		if (!write)
+			entry = pte_wrprotect(entry);
+		if (!young)
+			entry = pte_mkold(entry);
+		pte = pte_offset_map(&_pmd, haddr);
+		BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		/*
+		 * Positive compound_mapcount also offsets ->_mapcount of
+		 * every subpage by one -- no need to increase mapcount when
+		 * splitting last PMD.
+		 */
+		if (!last)
+			atomic_inc(&page[i]._mapcount);
+		pte_unmap(pte);
+	}
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+}
+
+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address)
+{
+	spinlock_t *ptl;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
+	ptl = pmd_lock(mm, pmd);
+	if (likely(pmd_trans_huge(*pmd)))
+		__split_huge_pmd_locked(vma, pmd, haddr);
+	spin_unlock(ptl);
+	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
+}
+
 static void split_huge_pmd_address(struct vm_area_struct *vma,
 				    unsigned long address)
 {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 22/28] thp: implement split_huge_pmd()
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Original split_huge_page() combined two operations: splitting PMDs into
tables of PTEs and splitting underlying compound page. This patch
implements split_huge_pmd() which split given PMD without splitting
other PMDs this page mapped with or underlying compound page.

Without tail page refcounting, implementation of split_huge_pmd() is
pretty straight-forward.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/huge_mm.h |  11 ++++-
 mm/huge_memory.c        | 108 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0382230b490f..b7844c73b7db 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -94,7 +94,16 @@ extern unsigned long transparent_hugepage_flags;
 
 #define split_huge_page_to_list(page, list) BUILD_BUG()
 #define split_huge_page(page) BUILD_BUG()
-#define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
+
+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address);
+
+#define split_huge_pmd(__vma, __pmd, __address)				\
+	do {								\
+		pmd_t *____pmd = (__pmd);				\
+		if (unlikely(pmd_trans_huge(*____pmd)))			\
+			__split_huge_pmd(__vma, __pmd, __address);	\
+	}  while (0)
 
 #if HPAGE_PMD_ORDER >= MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 06adbe3f2100..5885ef8f0fad 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2522,6 +2522,114 @@ static int khugepaged(void *none)
 	return 0;
 }
 
+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+		unsigned long haddr, pmd_t *pmd)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	int i;
+
+	/* leave pmd empty until pte is filled */
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+		entry = pte_mkspecial(entry);
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	put_huge_zero_page();
+}
+
+static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long haddr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *page;
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	bool young, write, last;
+	int i;
+
+	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
+	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
+	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
+	VM_BUG_ON(!pmd_trans_huge(*pmd));
+
+	count_vm_event(THP_SPLIT_PMD);
+
+	if (is_huge_zero_pmd(*pmd))
+		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+
+	page = pmd_page(*pmd);
+	VM_BUG_ON_PAGE(!page_count(page), page);
+	atomic_add(HPAGE_PMD_NR - 1, &page->_count);
+	last = atomic_add_negative(-1, compound_mapcount_ptr(page));
+	if (last)
+		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+
+	write = pmd_write(*pmd);
+	young = pmd_young(*pmd);
+
+	/* leave pmd empty until pte is filled */
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t entry, *pte;
+		/*
+		 * Note that NUMA hinting access restrictions are not
+		 * transferred to avoid any possibility of altering
+		 * permissions across VMAs.
+		 */
+		entry = mk_pte(page + i, vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		if (!write)
+			entry = pte_wrprotect(entry);
+		if (!young)
+			entry = pte_mkold(entry);
+		pte = pte_offset_map(&_pmd, haddr);
+		BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		/*
+		 * Positive compound_mapcount also offsets ->_mapcount of
+		 * every subpage by one -- no need to increase mapcount when
+		 * splitting last PMD.
+		 */
+		if (!last)
+			atomic_inc(&page[i]._mapcount);
+		pte_unmap(pte);
+	}
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+}
+
+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address)
+{
+	spinlock_t *ptl;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+
+	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
+	ptl = pmd_lock(mm, pmd);
+	if (likely(pmd_trans_huge(*pmd)))
+		__split_huge_pmd_locked(vma, pmd, haddr);
+	spin_unlock(ptl);
+	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
+}
+
 static void split_huge_pmd_address(struct vm_area_struct *vma,
 				    unsigned long address)
 {
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 23/28] thp: add option to setup migration entiries during PMD split
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We are going to use migration PTE entires to stabilize page counts.
If the page is mapped with PMDs we need to split the PMD and setup
migration enties. It's reasonable to combine these operations to avoid
double-scanning over the page table.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/huge_memory.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5885ef8f0fad..2f9e2e882bab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/swapops.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -2551,7 +2552,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 }
 
 static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long haddr)
+		unsigned long haddr, bool freeze)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
@@ -2593,12 +2594,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * transferred to avoid any possibility of altering
 		 * permissions across VMAs.
 		 */
-		entry = mk_pte(page + i, vma->vm_page_prot);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		if (!write)
-			entry = pte_wrprotect(entry);
-		if (!young)
-			entry = pte_mkold(entry);
+		if (freeze) {
+			swp_entry_t swp_entry;
+			swp_entry = make_migration_entry(page + i, write);
+			entry = swp_entry_to_pte(swp_entry);
+		} else {
+			entry = mk_pte(page + i, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			if (!write)
+				entry = pte_wrprotect(entry);
+			if (!young)
+				entry = pte_mkold(entry);
+		}
 		pte = pte_offset_map(&_pmd, haddr);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, haddr, pte, entry);
@@ -2625,7 +2632,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
 	ptl = pmd_lock(mm, pmd);
 	if (likely(pmd_trans_huge(*pmd)))
-		__split_huge_pmd_locked(vma, pmd, haddr);
+		__split_huge_pmd_locked(vma, pmd, haddr, false);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 23/28] thp: add option to setup migration entiries during PMD split
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We are going to use migration PTE entires to stabilize page counts.
If the page is mapped with PMDs we need to split the PMD and setup
migration enties. It's reasonable to combine these operations to avoid
double-scanning over the page table.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/huge_memory.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5885ef8f0fad..2f9e2e882bab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
 #include <linux/pagemap.h>
 #include <linux/migrate.h>
 #include <linux/hashtable.h>
+#include <linux/swapops.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -2551,7 +2552,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 }
 
 static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long haddr)
+		unsigned long haddr, bool freeze)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page;
@@ -2593,12 +2594,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		 * transferred to avoid any possibility of altering
 		 * permissions across VMAs.
 		 */
-		entry = mk_pte(page + i, vma->vm_page_prot);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-		if (!write)
-			entry = pte_wrprotect(entry);
-		if (!young)
-			entry = pte_mkold(entry);
+		if (freeze) {
+			swp_entry_t swp_entry;
+			swp_entry = make_migration_entry(page + i, write);
+			entry = swp_entry_to_pte(swp_entry);
+		} else {
+			entry = mk_pte(page + i, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			if (!write)
+				entry = pte_wrprotect(entry);
+			if (!young)
+				entry = pte_mkold(entry);
+		}
 		pte = pte_offset_map(&_pmd, haddr);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, haddr, pte, entry);
@@ -2625,7 +2632,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
 	ptl = pmd_lock(mm, pmd);
 	if (likely(pmd_trans_huge(*pmd)))
-		__split_huge_pmd_locked(vma, pmd, haddr);
+		__split_huge_pmd_locked(vma, pmd, haddr, false);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
 }
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 24/28] thp, mm: split_huge_page(): caller need to lock page
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to use migration entries instead of compound_lock() to
stabilize page refcounts. Setup and remove migration entries require
page to be locked.

Some of split_huge_page() callers already have the page locked. Let's
require everybody to lock the page before calling split_huge_page().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/memory-failure.c | 12 +++++++++---
 mm/migrate.c        |  8 ++++++--
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 441eff52d099..d9b06727e480 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -996,7 +996,10 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 		 * enough * to be safe.
 		 */
 		if (!PageHuge(hpage) && PageAnon(hpage)) {
-			if (unlikely(split_huge_page(hpage))) {
+			lock_page(hpage);
+			ret = split_huge_page(hpage);
+			unlock_page(hpage);
+			if (unlikely(ret)) {
 				/*
 				 * FIXME: if splitting THP is failed, it is
 				 * better to stop the following operation rather
@@ -1750,10 +1753,13 @@ int soft_offline_page(struct page *page, int flags)
 		return -EBUSY;
 	}
 	if (!PageHuge(page) && PageTransHuge(hpage)) {
-		if (PageAnon(hpage) && unlikely(split_huge_page(hpage))) {
+		lock_page(page);
+		ret = split_huge_page(hpage);
+		unlock_page(page);
+		if (unlikely(ret)) {
 			pr_info("soft offline: %#lx: failed to split THP\n",
 				pfn);
-			return -EBUSY;
+			return ret;
 		}
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index b51e88c9dba2..03b9c4ba56dc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -932,9 +932,13 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		goto out;
 	}
 
-	if (unlikely(PageTransHuge(page)))
-		if (unlikely(split_huge_page(page)))
+	if (unlikely(PageTransHuge(page))) {
+		lock_page(page);
+		rc = split_huge_page(page);
+		unlock_page(page);
+		if (rc)
 			goto out;
+	}
 
 	rc = __unmap_and_move(page, newpage, force, mode);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 24/28] thp, mm: split_huge_page(): caller need to lock page
@ 2015-04-23 21:03   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:03 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

We're going to use migration entries instead of compound_lock() to
stabilize page refcounts. Setup and remove migration entries require
page to be locked.

Some of split_huge_page() callers already have the page locked. Let's
require everybody to lock the page before calling split_huge_page().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/memory-failure.c | 12 +++++++++---
 mm/migrate.c        |  8 ++++++--
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 441eff52d099..d9b06727e480 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -996,7 +996,10 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 		 * enough * to be safe.
 		 */
 		if (!PageHuge(hpage) && PageAnon(hpage)) {
-			if (unlikely(split_huge_page(hpage))) {
+			lock_page(hpage);
+			ret = split_huge_page(hpage);
+			unlock_page(hpage);
+			if (unlikely(ret)) {
 				/*
 				 * FIXME: if splitting THP is failed, it is
 				 * better to stop the following operation rather
@@ -1750,10 +1753,13 @@ int soft_offline_page(struct page *page, int flags)
 		return -EBUSY;
 	}
 	if (!PageHuge(page) && PageTransHuge(hpage)) {
-		if (PageAnon(hpage) && unlikely(split_huge_page(hpage))) {
+		lock_page(page);
+		ret = split_huge_page(hpage);
+		unlock_page(page);
+		if (unlikely(ret)) {
 			pr_info("soft offline: %#lx: failed to split THP\n",
 				pfn);
-			return -EBUSY;
+			return ret;
 		}
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index b51e88c9dba2..03b9c4ba56dc 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -932,9 +932,13 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		goto out;
 	}
 
-	if (unlikely(PageTransHuge(page)))
-		if (unlikely(split_huge_page(page)))
+	if (unlikely(PageTransHuge(page))) {
+		lock_page(page);
+		rc = split_huge_page(page);
+		unlock_page(page);
+		if (rc)
 			goto out;
+	}
 
 	rc = __unmap_and_move(page, newpage, force, mode);
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 25/28] thp: reintroduce split_huge_page()
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:04   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:04 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

This patch adds implementation of split_huge_page() for new
refcountings.

Unlike previous implementation, new split_huge_page() can fail if
somebody holds GUP pin on the page. It also means that pin on page
would prevent it from bening split under you. It makes situation in
many places much cleaner.

The basic scheme of split_huge_page():

  - Check that sum of mapcounts of all subpage is equal to page_count()
    plus one (caller pin). Foll off with -EBUSY. This way we can avoid
    useless PMD-splits.

  - Freeze the page counters by splitting all PMD and setup migration
    PTEs.

  - Re-check sum of mapcounts against page_count(). Page's counts are
    stable now. -EBUSY if page is pinned.

  - Split compound page.

  - Unfreeze the page by removing migration entries.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/huge_mm.h |   7 +-
 include/linux/pagemap.h |   9 +-
 mm/huge_memory.c        | 322 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h           |  26 +++-
 mm/rmap.c               |  21 ----
 5 files changed, 357 insertions(+), 28 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b7844c73b7db..3c0a50ed3eb8 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,8 +92,11 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 extern unsigned long transparent_hugepage_flags;
 
-#define split_huge_page_to_list(page, list) BUILD_BUG()
-#define split_huge_page(page) BUILD_BUG()
+int split_huge_page_to_list(struct page *page, struct list_head *list);
+static inline int split_huge_page(struct page *page)
+{
+	return split_huge_page_to_list(page, NULL);
+}
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7c3790764795..ffbb23dbebba 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -387,10 +387,17 @@ static inline struct page *read_mapping_page(struct address_space *mapping,
  */
 static inline pgoff_t page_to_pgoff(struct page *page)
 {
+	pgoff_t pgoff;
+
 	if (unlikely(PageHeadHuge(page)))
 		return page->index << compound_order(page);
-	else
+
+	if (likely(!PageTransTail(page)))
 		return page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+	pgoff = page->first_page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff += page - page->first_page;
+	return pgoff;
 }
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f9e2e882bab..7ad338ab2ac8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2704,3 +2704,325 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 			split_huge_pmd_address(next, nstart);
 	}
 }
+
+static void freeze_page_vma(struct vm_area_struct *vma, struct page *page,
+		unsigned long address)
+{
+	spinlock_t *ptl;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	int i;
+
+	pgd = pgd_offset(vma->vm_mm, address);
+	if (!pgd_present(*pgd))
+		return;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return;
+	pmd = pmd_offset(pud, address);
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	if (!pmd_present(*pmd)) {
+		spin_unlock(ptl);
+		return;
+	}
+	if (pmd_trans_huge(*pmd)) {
+		if (page == pmd_page(*pmd))
+			__split_huge_pmd_locked(vma, pmd, address, true);
+		spin_unlock(ptl);
+		return;
+	}
+	spin_unlock(ptl);
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
+	for (i = 0; i < HPAGE_PMD_NR; i++, address += PAGE_SIZE, page++) {
+		pte_t entry, swp_pte;
+		swp_entry_t swp_entry;
+
+		if (!pte_present(pte[i]))
+			continue;
+		if (page_to_pfn(page) != pte_pfn(pte[i]))
+			continue;
+		flush_cache_page(vma, address, page_to_pfn(page));
+		entry = ptep_clear_flush(vma, address, pte + i);
+		swp_entry = make_migration_entry(page, pte_write(entry));
+		swp_pte = swp_entry_to_pte(swp_entry);
+		if (pte_soft_dirty(entry))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		set_pte_at(vma->vm_mm, address, pte + i, swp_pte);
+	}
+	pte_unmap_unlock(pte, ptl);
+}
+
+static void freeze_page(struct anon_vma *anon_vma, struct page *page)
+{
+	struct anon_vma_chain *avc;
+	pgoff_t pgoff = page_to_pgoff(page);
+
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff,
+			pgoff + HPAGE_PMD_NR - 1) {
+		unsigned long haddr;
+
+		haddr = __vma_address(page, avc->vma) & HPAGE_PMD_MASK;
+		mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
+				haddr, haddr + HPAGE_PMD_SIZE);
+		freeze_page_vma(avc->vma, page, haddr);
+		mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
+				haddr, haddr + HPAGE_PMD_SIZE);
+	}
+}
+
+static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
+		unsigned long address)
+{
+	spinlock_t *ptl;
+	pmd_t *pmd;
+	pte_t *pte, entry;
+	swp_entry_t swp_entry;
+
+	pmd = mm_find_pmd(vma->vm_mm, address);
+	if (!pmd)
+		return;
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
+
+	if (!is_swap_pte(*pte))
+		goto unlock;
+
+	swp_entry = pte_to_swp_entry(*pte);
+	if (!is_migration_entry(swp_entry) ||
+			migration_entry_to_page(swp_entry) != page)
+		goto unlock;
+
+	entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
+	if (is_write_migration_entry(swp_entry))
+		entry = maybe_mkwrite(entry, vma);
+
+	flush_dcache_page(page);
+	set_pte_at(vma->vm_mm, address, pte, entry);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(vma, address, pte);
+unlock:
+	pte_unmap_unlock(pte, ptl);
+}
+
+static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
+{
+	struct anon_vma_chain *avc;
+	pgoff_t pgoff = page_to_pgoff(page);
+	int i;
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, pgoff++, page++) {
+		if (!page_mapcount(page))
+			continue;
+
+		anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
+				pgoff, pgoff) {
+			unsigned long address = vma_address(page, avc->vma);
+
+			mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
+					address, address + PAGE_SIZE);
+			unfreeze_page_vma(avc->vma, page, address);
+			mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
+					address, address + PAGE_SIZE);
+		}
+	}
+}
+
+static int total_mapcount(struct page *page)
+{
+	int i, ret;
+
+	ret = compound_mapcount(page);
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		ret += atomic_read(&page[i]._mapcount) + 1;
+
+	/*
+	 * Positive compound_mapcount() offsets ->_mapcount in every subpage by
+	 * one. Let's substract it here.
+	 */
+	if (compound_mapcount(page))
+		ret -= HPAGE_PMD_NR;
+
+	return ret;
+}
+
+static int __split_huge_page_tail(struct page *head, int tail,
+		struct lruvec *lruvec, struct list_head *list)
+{
+	int mapcount;
+	struct page *page_tail = head + tail;
+
+	mapcount = page_mapcount(page_tail);
+	BUG_ON(atomic_read(&page_tail->_count) != 0);
+
+	/*
+	 * tail_page->_count is zero and not changing from under us. But
+	 * get_page_unless_zero() may be running from under us on the
+	 * tail_page. If we used atomic_set() below instead of atomic_add(), we
+	 * would then run atomic_set() concurrently with
+	 * get_page_unless_zero(), and atomic_set() is implemented in C not
+	 * using locked ops. spin_unlock on x86 sometime uses locked ops
+	 * because of PPro errata 66, 92, so unless somebody can guarantee
+	 * atomic_set() here would be safe on all archs (and not only on x86),
+	 * it's safer to use atomic_add().
+	 */
+	atomic_add(page_mapcount(page_tail) + 1, &page_tail->_count);
+
+	/* after clearing PageTail the gup refcount can be released */
+	smp_mb__after_atomic();
+
+	/*
+	 * retain hwpoison flag of the poisoned tail page:
+	 *   fix for the unsuitable process killed on Guest Machine(KVM)
+	 *   by the memory-failure.
+	 */
+	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
+	page_tail->flags |= (head->flags &
+			((1L << PG_referenced) |
+			 (1L << PG_swapbacked) |
+			 (1L << PG_mlocked) |
+			 (1L << PG_uptodate) |
+			 (1L << PG_active) |
+			 (1L << PG_locked) |
+			 (1L << PG_unevictable)));
+	page_tail->flags |= (1L << PG_dirty);
+
+	/* clear PageTail before overwriting first_page */
+	smp_wmb();
+
+	/* ->mapping in first tail page is compound_mapcount */
+	BUG_ON(tail != 1 && page_tail->mapping != TAIL_MAPPING);
+	page_tail->mapping = head->mapping;
+
+	page_tail->index = head->index + tail;
+	page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
+	lru_add_page_tail(head, page_tail, lruvec, list);
+
+	return mapcount;
+}
+
+static void __split_huge_page(struct page *page, struct list_head *list)
+{
+	struct page *head = compound_head(page);
+	struct zone *zone = page_zone(head);
+	struct lruvec *lruvec;
+	int i, tail_mapcount;
+
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irq(&zone->lru_lock);
+	lruvec = mem_cgroup_page_lruvec(head, zone);
+
+	/* complete memcg works before add pages to LRU */
+	mem_cgroup_split_huge_fixup(head);
+
+	tail_mapcount = 0;
+	for (i = HPAGE_PMD_NR - 1; i >= 1; i--)
+		tail_mapcount += __split_huge_page_tail(head, i, lruvec, list);
+	atomic_sub(tail_mapcount, &head->_count);
+
+	ClearPageCompound(head);
+	spin_unlock_irq(&zone->lru_lock);
+
+	unfreeze_page(page_anon_vma(head), head);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		struct page *subpage = head + i;
+		if (subpage == page)
+			continue;
+		unlock_page(subpage);
+
+		/*
+		 * Subpages may be freed if there wasn't any mapping
+		 * like if add_to_swap() is running on a lru page that
+		 * had its mapping zapped. And freeing these pages
+		 * requires taking the lru_lock so we do the put_page
+		 * of the tail pages after the split is complete.
+		 */
+		put_page(subpage);
+	}
+}
+
+/*
+ * This function splits huge page into normal pages. @page can point to any
+ * subpage of huge page to split. Split doesn't change the position of @page.
+ *
+ * Only caller must hold pin on the @page, otherwise split fails with -EBUSY.
+ * The huge page must be locked.
+ *
+ * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
+ *
+ * Both head page and tail pages will inherit mapping, flags, and so on from
+ * the hugepage.
+ *
+ * GUP pin and PG_locked transfered to @page. Rest subpages can be freed if
+ * they are not mapped.
+ *
+ * Returns 0 if the hugepage is split successfully.
+ * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under
+ * us.
+ */
+int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+	struct page *head = compound_head(page);
+	struct anon_vma *anon_vma;
+	int mapcount, ret;
+
+	BUG_ON(is_huge_zero_page(page));
+	BUG_ON(!PageAnon(page));
+	BUG_ON(!PageLocked(page));
+	BUG_ON(!PageSwapBacked(page));
+	BUG_ON(!PageCompound(page));
+
+	/*
+	 * The caller does not necessarily hold an mmap_sem that would prevent
+	 * the anon_vma disappearing so we first we take a reference to it
+	 * and then lock the anon_vma for write. This is similar to
+	 * page_lock_anon_vma_read except the write lock is taken to serialise
+	 * against parallel split or collapse operations.
+	 */
+	anon_vma = page_get_anon_vma(head);
+	if (!anon_vma) {
+		ret = -EBUSY;
+		goto out;
+	}
+	anon_vma_lock_write(anon_vma);
+
+	/*
+	 * Racy check if we can split the page, before freeze_page() will
+	 * split PMDs
+	 */
+	if (total_mapcount(head) != page_count(head) - 1) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	freeze_page(anon_vma, head);
+	VM_BUG_ON_PAGE(compound_mapcount(head), head);
+
+	mapcount = total_mapcount(head);
+	if (mapcount == page_count(head) - 1) {
+		__split_huge_page(page, list);
+		ret = 0;
+	} else if (mapcount > page_count(page) - 1) {
+		pr_alert("total_mapcount: %u, page_count(): %u\n",
+				mapcount, page_count(page));
+		if (PageTail(page))
+			dump_page(head, NULL);
+		dump_page(page, "tail_mapcount > page_count(page) - 1");
+		BUG();
+	} else {
+		unfreeze_page(anon_vma, head);
+		ret = -EBUSY;
+	}
+
+out_unlock:
+	anon_vma_unlock_write(anon_vma);
+	put_anon_vma(anon_vma);
+out:
+	count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
+	return ret;
+}
diff --git a/mm/internal.h b/mm/internal.h
index 98bce4d12a16..aee0f2566fdd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -13,6 +13,7 @@
 
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/pagemap.h>
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
@@ -244,10 +245,27 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-extern unsigned long vma_address(struct page *page,
-				 struct vm_area_struct *vma);
-#endif
+/*
+ * At what user virtual address is page expected in @vma?
+ */
+static inline unsigned long
+__vma_address(struct page *page, struct vm_area_struct *vma)
+{
+	pgoff_t pgoff = page_to_pgoff(page);
+	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+}
+
+static inline unsigned long
+vma_address(struct page *page, struct vm_area_struct *vma)
+{
+	unsigned long address = __vma_address(page, vma);
+
+	/* page should be within @vma mapping range */
+	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+
+	return address;
+}
+
 #else /* !CONFIG_MMU */
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
diff --git a/mm/rmap.c b/mm/rmap.c
index 047953145710..723af5bbeb02 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -561,27 +561,6 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
 }
 
 /*
- * At what user virtual address is page expected in @vma?
- */
-static inline unsigned long
-__vma_address(struct page *page, struct vm_area_struct *vma)
-{
-	pgoff_t pgoff = page_to_pgoff(page);
-	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-}
-
-inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
-{
-	unsigned long address = __vma_address(page, vma);
-
-	/* page should be within @vma mapping range */
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
-
-	return address;
-}
-
-/*
  * At what user virtual address is page expected in vma?
  * Caller should check the page is actually part of the vma.
  */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 25/28] thp: reintroduce split_huge_page()
@ 2015-04-23 21:04   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:04 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

This patch adds implementation of split_huge_page() for new
refcountings.

Unlike previous implementation, new split_huge_page() can fail if
somebody holds GUP pin on the page. It also means that pin on page
would prevent it from bening split under you. It makes situation in
many places much cleaner.

The basic scheme of split_huge_page():

  - Check that sum of mapcounts of all subpage is equal to page_count()
    plus one (caller pin). Foll off with -EBUSY. This way we can avoid
    useless PMD-splits.

  - Freeze the page counters by splitting all PMD and setup migration
    PTEs.

  - Re-check sum of mapcounts against page_count(). Page's counts are
    stable now. -EBUSY if page is pinned.

  - Split compound page.

  - Unfreeze the page by removing migration entries.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/huge_mm.h |   7 +-
 include/linux/pagemap.h |   9 +-
 mm/huge_memory.c        | 322 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/internal.h           |  26 +++-
 mm/rmap.c               |  21 ----
 5 files changed, 357 insertions(+), 28 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b7844c73b7db..3c0a50ed3eb8 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,8 +92,11 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 extern unsigned long transparent_hugepage_flags;
 
-#define split_huge_page_to_list(page, list) BUILD_BUG()
-#define split_huge_page(page) BUILD_BUG()
+int split_huge_page_to_list(struct page *page, struct list_head *list);
+static inline int split_huge_page(struct page *page)
+{
+	return split_huge_page_to_list(page, NULL);
+}
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7c3790764795..ffbb23dbebba 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -387,10 +387,17 @@ static inline struct page *read_mapping_page(struct address_space *mapping,
  */
 static inline pgoff_t page_to_pgoff(struct page *page)
 {
+	pgoff_t pgoff;
+
 	if (unlikely(PageHeadHuge(page)))
 		return page->index << compound_order(page);
-	else
+
+	if (likely(!PageTransTail(page)))
 		return page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+
+	pgoff = page->first_page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff += page - page->first_page;
+	return pgoff;
 }
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f9e2e882bab..7ad338ab2ac8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2704,3 +2704,325 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
 			split_huge_pmd_address(next, nstart);
 	}
 }
+
+static void freeze_page_vma(struct vm_area_struct *vma, struct page *page,
+		unsigned long address)
+{
+	spinlock_t *ptl;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	int i;
+
+	pgd = pgd_offset(vma->vm_mm, address);
+	if (!pgd_present(*pgd))
+		return;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return;
+	pmd = pmd_offset(pud, address);
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	if (!pmd_present(*pmd)) {
+		spin_unlock(ptl);
+		return;
+	}
+	if (pmd_trans_huge(*pmd)) {
+		if (page == pmd_page(*pmd))
+			__split_huge_pmd_locked(vma, pmd, address, true);
+		spin_unlock(ptl);
+		return;
+	}
+	spin_unlock(ptl);
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
+	for (i = 0; i < HPAGE_PMD_NR; i++, address += PAGE_SIZE, page++) {
+		pte_t entry, swp_pte;
+		swp_entry_t swp_entry;
+
+		if (!pte_present(pte[i]))
+			continue;
+		if (page_to_pfn(page) != pte_pfn(pte[i]))
+			continue;
+		flush_cache_page(vma, address, page_to_pfn(page));
+		entry = ptep_clear_flush(vma, address, pte + i);
+		swp_entry = make_migration_entry(page, pte_write(entry));
+		swp_pte = swp_entry_to_pte(swp_entry);
+		if (pte_soft_dirty(entry))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		set_pte_at(vma->vm_mm, address, pte + i, swp_pte);
+	}
+	pte_unmap_unlock(pte, ptl);
+}
+
+static void freeze_page(struct anon_vma *anon_vma, struct page *page)
+{
+	struct anon_vma_chain *avc;
+	pgoff_t pgoff = page_to_pgoff(page);
+
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff,
+			pgoff + HPAGE_PMD_NR - 1) {
+		unsigned long haddr;
+
+		haddr = __vma_address(page, avc->vma) & HPAGE_PMD_MASK;
+		mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
+				haddr, haddr + HPAGE_PMD_SIZE);
+		freeze_page_vma(avc->vma, page, haddr);
+		mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
+				haddr, haddr + HPAGE_PMD_SIZE);
+	}
+}
+
+static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
+		unsigned long address)
+{
+	spinlock_t *ptl;
+	pmd_t *pmd;
+	pte_t *pte, entry;
+	swp_entry_t swp_entry;
+
+	pmd = mm_find_pmd(vma->vm_mm, address);
+	if (!pmd)
+		return;
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
+
+	if (!is_swap_pte(*pte))
+		goto unlock;
+
+	swp_entry = pte_to_swp_entry(*pte);
+	if (!is_migration_entry(swp_entry) ||
+			migration_entry_to_page(swp_entry) != page)
+		goto unlock;
+
+	entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
+	if (is_write_migration_entry(swp_entry))
+		entry = maybe_mkwrite(entry, vma);
+
+	flush_dcache_page(page);
+	set_pte_at(vma->vm_mm, address, pte, entry);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache(vma, address, pte);
+unlock:
+	pte_unmap_unlock(pte, ptl);
+}
+
+static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
+{
+	struct anon_vma_chain *avc;
+	pgoff_t pgoff = page_to_pgoff(page);
+	int i;
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, pgoff++, page++) {
+		if (!page_mapcount(page))
+			continue;
+
+		anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
+				pgoff, pgoff) {
+			unsigned long address = vma_address(page, avc->vma);
+
+			mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
+					address, address + PAGE_SIZE);
+			unfreeze_page_vma(avc->vma, page, address);
+			mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
+					address, address + PAGE_SIZE);
+		}
+	}
+}
+
+static int total_mapcount(struct page *page)
+{
+	int i, ret;
+
+	ret = compound_mapcount(page);
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		ret += atomic_read(&page[i]._mapcount) + 1;
+
+	/*
+	 * Positive compound_mapcount() offsets ->_mapcount in every subpage by
+	 * one. Let's substract it here.
+	 */
+	if (compound_mapcount(page))
+		ret -= HPAGE_PMD_NR;
+
+	return ret;
+}
+
+static int __split_huge_page_tail(struct page *head, int tail,
+		struct lruvec *lruvec, struct list_head *list)
+{
+	int mapcount;
+	struct page *page_tail = head + tail;
+
+	mapcount = page_mapcount(page_tail);
+	BUG_ON(atomic_read(&page_tail->_count) != 0);
+
+	/*
+	 * tail_page->_count is zero and not changing from under us. But
+	 * get_page_unless_zero() may be running from under us on the
+	 * tail_page. If we used atomic_set() below instead of atomic_add(), we
+	 * would then run atomic_set() concurrently with
+	 * get_page_unless_zero(), and atomic_set() is implemented in C not
+	 * using locked ops. spin_unlock on x86 sometime uses locked ops
+	 * because of PPro errata 66, 92, so unless somebody can guarantee
+	 * atomic_set() here would be safe on all archs (and not only on x86),
+	 * it's safer to use atomic_add().
+	 */
+	atomic_add(page_mapcount(page_tail) + 1, &page_tail->_count);
+
+	/* after clearing PageTail the gup refcount can be released */
+	smp_mb__after_atomic();
+
+	/*
+	 * retain hwpoison flag of the poisoned tail page:
+	 *   fix for the unsuitable process killed on Guest Machine(KVM)
+	 *   by the memory-failure.
+	 */
+	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
+	page_tail->flags |= (head->flags &
+			((1L << PG_referenced) |
+			 (1L << PG_swapbacked) |
+			 (1L << PG_mlocked) |
+			 (1L << PG_uptodate) |
+			 (1L << PG_active) |
+			 (1L << PG_locked) |
+			 (1L << PG_unevictable)));
+	page_tail->flags |= (1L << PG_dirty);
+
+	/* clear PageTail before overwriting first_page */
+	smp_wmb();
+
+	/* ->mapping in first tail page is compound_mapcount */
+	BUG_ON(tail != 1 && page_tail->mapping != TAIL_MAPPING);
+	page_tail->mapping = head->mapping;
+
+	page_tail->index = head->index + tail;
+	page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
+	lru_add_page_tail(head, page_tail, lruvec, list);
+
+	return mapcount;
+}
+
+static void __split_huge_page(struct page *page, struct list_head *list)
+{
+	struct page *head = compound_head(page);
+	struct zone *zone = page_zone(head);
+	struct lruvec *lruvec;
+	int i, tail_mapcount;
+
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irq(&zone->lru_lock);
+	lruvec = mem_cgroup_page_lruvec(head, zone);
+
+	/* complete memcg works before add pages to LRU */
+	mem_cgroup_split_huge_fixup(head);
+
+	tail_mapcount = 0;
+	for (i = HPAGE_PMD_NR - 1; i >= 1; i--)
+		tail_mapcount += __split_huge_page_tail(head, i, lruvec, list);
+	atomic_sub(tail_mapcount, &head->_count);
+
+	ClearPageCompound(head);
+	spin_unlock_irq(&zone->lru_lock);
+
+	unfreeze_page(page_anon_vma(head), head);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		struct page *subpage = head + i;
+		if (subpage == page)
+			continue;
+		unlock_page(subpage);
+
+		/*
+		 * Subpages may be freed if there wasn't any mapping
+		 * like if add_to_swap() is running on a lru page that
+		 * had its mapping zapped. And freeing these pages
+		 * requires taking the lru_lock so we do the put_page
+		 * of the tail pages after the split is complete.
+		 */
+		put_page(subpage);
+	}
+}
+
+/*
+ * This function splits huge page into normal pages. @page can point to any
+ * subpage of huge page to split. Split doesn't change the position of @page.
+ *
+ * Only caller must hold pin on the @page, otherwise split fails with -EBUSY.
+ * The huge page must be locked.
+ *
+ * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
+ *
+ * Both head page and tail pages will inherit mapping, flags, and so on from
+ * the hugepage.
+ *
+ * GUP pin and PG_locked transfered to @page. Rest subpages can be freed if
+ * they are not mapped.
+ *
+ * Returns 0 if the hugepage is split successfully.
+ * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under
+ * us.
+ */
+int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+	struct page *head = compound_head(page);
+	struct anon_vma *anon_vma;
+	int mapcount, ret;
+
+	BUG_ON(is_huge_zero_page(page));
+	BUG_ON(!PageAnon(page));
+	BUG_ON(!PageLocked(page));
+	BUG_ON(!PageSwapBacked(page));
+	BUG_ON(!PageCompound(page));
+
+	/*
+	 * The caller does not necessarily hold an mmap_sem that would prevent
+	 * the anon_vma disappearing so we first we take a reference to it
+	 * and then lock the anon_vma for write. This is similar to
+	 * page_lock_anon_vma_read except the write lock is taken to serialise
+	 * against parallel split or collapse operations.
+	 */
+	anon_vma = page_get_anon_vma(head);
+	if (!anon_vma) {
+		ret = -EBUSY;
+		goto out;
+	}
+	anon_vma_lock_write(anon_vma);
+
+	/*
+	 * Racy check if we can split the page, before freeze_page() will
+	 * split PMDs
+	 */
+	if (total_mapcount(head) != page_count(head) - 1) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	freeze_page(anon_vma, head);
+	VM_BUG_ON_PAGE(compound_mapcount(head), head);
+
+	mapcount = total_mapcount(head);
+	if (mapcount == page_count(head) - 1) {
+		__split_huge_page(page, list);
+		ret = 0;
+	} else if (mapcount > page_count(page) - 1) {
+		pr_alert("total_mapcount: %u, page_count(): %u\n",
+				mapcount, page_count(page));
+		if (PageTail(page))
+			dump_page(head, NULL);
+		dump_page(page, "tail_mapcount > page_count(page) - 1");
+		BUG();
+	} else {
+		unfreeze_page(anon_vma, head);
+		ret = -EBUSY;
+	}
+
+out_unlock:
+	anon_vma_unlock_write(anon_vma);
+	put_anon_vma(anon_vma);
+out:
+	count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
+	return ret;
+}
diff --git a/mm/internal.h b/mm/internal.h
index 98bce4d12a16..aee0f2566fdd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -13,6 +13,7 @@
 
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/pagemap.h>
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
@@ -244,10 +245,27 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-extern unsigned long vma_address(struct page *page,
-				 struct vm_area_struct *vma);
-#endif
+/*
+ * At what user virtual address is page expected in @vma?
+ */
+static inline unsigned long
+__vma_address(struct page *page, struct vm_area_struct *vma)
+{
+	pgoff_t pgoff = page_to_pgoff(page);
+	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+}
+
+static inline unsigned long
+vma_address(struct page *page, struct vm_area_struct *vma)
+{
+	unsigned long address = __vma_address(page, vma);
+
+	/* page should be within @vma mapping range */
+	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+
+	return address;
+}
+
 #else /* !CONFIG_MMU */
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
diff --git a/mm/rmap.c b/mm/rmap.c
index 047953145710..723af5bbeb02 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -561,27 +561,6 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
 }
 
 /*
- * At what user virtual address is page expected in @vma?
- */
-static inline unsigned long
-__vma_address(struct page *page, struct vm_area_struct *vma)
-{
-	pgoff_t pgoff = page_to_pgoff(page);
-	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-}
-
-inline unsigned long
-vma_address(struct page *page, struct vm_area_struct *vma)
-{
-	unsigned long address = __vma_address(page, vma);
-
-	/* page should be within @vma mapping range */
-	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
-
-	return address;
-}
-
-/*
  * At what user virtual address is page expected in vma?
  * Caller should check the page is actually part of the vma.
  */
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 26/28] thp: introduce deferred_split_huge_page()
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:04   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:04 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Currently we don't split huge page on partial unmap. It's not an ideal
situation. It can lead to memory overhead.

Furtunately, we can detect partial unmap on page_remove_rmap(). But we
cannot call split_huge_page() from there due to locking context.

It's also counterproductive to do directly from munmap() codepath: in
many cases we will hit this from exit(2) and splitting the huge page
just to free it up in small pages is not what we really want.

The patch introduce deferred_split_huge_page() which put the huge page
into queue for splitting. The splitting itself will happen when we get
memory pressure via shrinker interface. The page will be dropped from
list on freeing through compound page destructor.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/huge_mm.h |   4 ++
 include/linux/mm.h      |   2 +
 mm/huge_memory.c        | 126 ++++++++++++++++++++++++++++++++++++++++++++++--
 mm/migrate.c            |   1 +
 mm/page_alloc.c         |   2 +-
 mm/rmap.c               |   3 ++
 6 files changed, 133 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3c0a50ed3eb8..8bf0f8d1c796 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,11 +92,14 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 extern unsigned long transparent_hugepage_flags;
 
+extern void prep_transhuge_page(struct page *page);
+
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list(page, NULL);
 }
+void deferred_split_huge_page(struct page *page);
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address);
@@ -174,6 +177,7 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
+static inline void deferred_split_huge_page(struct page *page) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ddc184c55d6..331b15b02514 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -511,6 +511,8 @@ static inline void set_compound_order(struct page *page, unsigned long order)
 	page[1].compound_order = order;
 }
 
+void free_compound_page(struct page *page);
+
 #ifdef CONFIG_MMU
 /*
  * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7ad338ab2ac8..cce4604c192f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -70,6 +70,8 @@ static int khugepaged(void *none);
 static int khugepaged_slab_init(void);
 static void khugepaged_slab_exit(void);
 
+static void free_transhuge_page(struct page *page);
+
 #define MM_SLOTS_HASH_BITS 10
 static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
@@ -104,6 +106,10 @@ static struct khugepaged_scan khugepaged_scan = {
 	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
 };
 
+static DEFINE_SPINLOCK(split_queue_lock);
+static LIST_HEAD(split_queue);
+static unsigned long split_queue_len;
+static struct shrinker deferred_split_shrinker;
 
 static int set_recommended_min_free_kbytes(void)
 {
@@ -642,6 +648,9 @@ static int __init hugepage_init(void)
 	err = register_shrinker(&huge_zero_page_shrinker);
 	if (err)
 		goto err_hzp_shrinker;
+	err = register_shrinker(&deferred_split_shrinker);
+	if (err)
+		goto err_split_shrinker;
 
 	/*
 	 * By default disable transparent hugepages on smaller systems,
@@ -659,6 +668,8 @@ static int __init hugepage_init(void)
 
 	return 0;
 err_khugepaged:
+	unregister_shrinker(&deferred_split_shrinker);
+err_split_shrinker:
 	unregister_shrinker(&huge_zero_page_shrinker);
 err_hzp_shrinker:
 	khugepaged_slab_exit();
@@ -715,6 +726,12 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 	return entry;
 }
 
+void prep_transhuge_page(struct page *page)
+{
+	INIT_LIST_HEAD(&page[2].lru);
+	set_compound_page_dtor(page, free_transhuge_page);
+}
+
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
@@ -834,7 +851,9 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, gfp))) {
+	prep_transhuge_page(page);
+	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr,
+					pmd, page, gfp))) {
 		put_page(page);
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -1095,7 +1114,9 @@ alloc:
 	} else
 		new_page = NULL;
 
-	if (unlikely(!new_page)) {
+	if (likely(new_page)) {
+		prep_transhuge_page(new_page);
+	} else {
 		if (!page) {
 			split_huge_pmd(vma, pmd, address);
 			ret |= VM_FAULT_FALLBACK;
@@ -2019,6 +2040,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
 		return NULL;
 	}
 
+	prep_transhuge_page(*hpage);
 	count_vm_event(THP_COLLAPSE_ALLOC);
 	return *hpage;
 }
@@ -2030,8 +2052,12 @@ static int khugepaged_find_target_node(void)
 
 static inline struct page *alloc_hugepage(int defrag)
 {
-	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
-			   HPAGE_PMD_ORDER);
+	struct page *page;
+
+	page = alloc_pages(alloc_hugepage_gfpmask(defrag, 0), HPAGE_PMD_ORDER);
+	if (page)
+		prep_transhuge_page(page);
+	return page;
 }
 
 static struct page *khugepaged_alloc_hugepage(bool *wait)
@@ -2916,6 +2942,13 @@ static void __split_huge_page(struct page *page, struct list_head *list)
 	spin_lock_irq(&zone->lru_lock);
 	lruvec = mem_cgroup_page_lruvec(head, zone);
 
+	spin_lock(&split_queue_lock);
+	if (!list_empty(&head[2].lru)) {
+		split_queue_len--;
+		list_del(&head[2].lru);
+	}
+	spin_unlock(&split_queue_lock);
+
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
 
@@ -3026,3 +3059,88 @@ out:
 	count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
 	return ret;
 }
+
+static void free_transhuge_page(struct page *page)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&split_queue_lock, flags);
+	if (!list_empty(&page[2].lru)) {
+		split_queue_len--;
+		list_del(&page[2].lru);
+	}
+	spin_unlock_irqrestore(&split_queue_lock, flags);
+	free_compound_page(page);
+}
+
+void deferred_split_huge_page(struct page *page)
+{
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+
+	/* we use page->lru in second tail page: assuming THP order >= 2 */
+	BUILD_BUG_ON(HPAGE_PMD_ORDER < 2);
+
+	spin_lock_irqsave(&split_queue_lock, flags);
+	if (list_empty(&page[2].lru)) {
+		list_add_tail(&page[2].lru, &split_queue);
+		split_queue_len++;
+	}
+	spin_unlock_irqrestore(&split_queue_lock, flags);
+}
+
+static unsigned long deferred_split_count(struct shrinker *shrink,
+		struct shrink_control *sc)
+{
+	/*
+	 * Split a page from split_queue will free up at least one page,
+	 * at most HPAGE_PMD_NR - 1. We don't track exact number.
+	 * Let's use HPAGE_PMD_NR / 2 as ballpark.
+	 */
+	return ACCESS_ONCE(split_queue_len) * HPAGE_PMD_NR / 2;
+}
+
+static unsigned long deferred_split_scan(struct shrinker *shrink,
+		struct shrink_control *sc)
+{
+	unsigned long flags;
+	LIST_HEAD(list);
+	struct page *page, *next;
+	int split = 0;
+
+	spin_lock_irqsave(&split_queue_lock, flags);
+	list_splice_init(&split_queue, &list);
+
+	/* Take pin on all head pages to avoid freeing them under us */
+	list_for_each_entry_safe(page, next, &list, lru) {
+		page = compound_head(page);
+		/* race with put_compound_page() */
+		if (!get_page_unless_zero(page)) {
+			list_del_init(&page[2].lru);
+			split_queue_len--;
+		}
+	}
+	spin_unlock_irqrestore(&split_queue_lock, flags);
+
+	list_for_each_entry_safe(page, next, &list, lru) {
+		lock_page(page);
+		/* split_huge_page() removes page from list on success */
+		if (!split_huge_page(page))
+			split++;
+		unlock_page(page);
+		put_page(page);
+	}
+
+	spin_lock_irqsave(&split_queue_lock, flags);
+	list_splice_tail(&list, &split_queue);
+	spin_unlock_irqrestore(&split_queue_lock, flags);
+
+	return split * HPAGE_PMD_NR / 2;
+}
+
+static struct shrinker deferred_split_shrinker = {
+	.count_objects = deferred_split_count,
+	.scan_objects = deferred_split_scan,
+	.seeks = DEFAULT_SEEKS,
+};
diff --git a/mm/migrate.c b/mm/migrate.c
index 03b9c4ba56dc..ef3472397ced 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1739,6 +1739,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		HPAGE_PMD_ORDER);
 	if (!new_page)
 		goto out_fail;
+	prep_transhuge_page(new_page);
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ac331be78308..f3ffce74d9dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -356,7 +356,7 @@ out:
  * This usage means that zero-order pages may not be compound.
  */
 
-static void free_compound_page(struct page *page)
+void free_compound_page(struct page *page)
 {
 	__free_pages_ok(page, compound_order(page));
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 723af5bbeb02..55a0108bec99 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1227,6 +1227,9 @@ void page_remove_rmap(struct page *page, bool compound)
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
 
+	if (PageTransCompound(page))
+		deferred_split_huge_page(compound_head(page));
+
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
 	 * but that might overwrite a racing page_add_anon_rmap
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 26/28] thp: introduce deferred_split_huge_page()
@ 2015-04-23 21:04   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:04 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

Currently we don't split huge page on partial unmap. It's not an ideal
situation. It can lead to memory overhead.

Furtunately, we can detect partial unmap on page_remove_rmap(). But we
cannot call split_huge_page() from there due to locking context.

It's also counterproductive to do directly from munmap() codepath: in
many cases we will hit this from exit(2) and splitting the huge page
just to free it up in small pages is not what we really want.

The patch introduce deferred_split_huge_page() which put the huge page
into queue for splitting. The splitting itself will happen when we get
memory pressure via shrinker interface. The page will be dropped from
list on freeing through compound page destructor.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 include/linux/huge_mm.h |   4 ++
 include/linux/mm.h      |   2 +
 mm/huge_memory.c        | 126 ++++++++++++++++++++++++++++++++++++++++++++++--
 mm/migrate.c            |   1 +
 mm/page_alloc.c         |   2 +-
 mm/rmap.c               |   3 ++
 6 files changed, 133 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3c0a50ed3eb8..8bf0f8d1c796 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,11 +92,14 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
 
 extern unsigned long transparent_hugepage_flags;
 
+extern void prep_transhuge_page(struct page *page);
+
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list(page, NULL);
 }
+void deferred_split_huge_page(struct page *page);
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address);
@@ -174,6 +177,7 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
+static inline void deferred_split_huge_page(struct page *page) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ddc184c55d6..331b15b02514 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -511,6 +511,8 @@ static inline void set_compound_order(struct page *page, unsigned long order)
 	page[1].compound_order = order;
 }
 
+void free_compound_page(struct page *page);
+
 #ifdef CONFIG_MMU
 /*
  * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7ad338ab2ac8..cce4604c192f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -70,6 +70,8 @@ static int khugepaged(void *none);
 static int khugepaged_slab_init(void);
 static void khugepaged_slab_exit(void);
 
+static void free_transhuge_page(struct page *page);
+
 #define MM_SLOTS_HASH_BITS 10
 static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
 
@@ -104,6 +106,10 @@ static struct khugepaged_scan khugepaged_scan = {
 	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
 };
 
+static DEFINE_SPINLOCK(split_queue_lock);
+static LIST_HEAD(split_queue);
+static unsigned long split_queue_len;
+static struct shrinker deferred_split_shrinker;
 
 static int set_recommended_min_free_kbytes(void)
 {
@@ -642,6 +648,9 @@ static int __init hugepage_init(void)
 	err = register_shrinker(&huge_zero_page_shrinker);
 	if (err)
 		goto err_hzp_shrinker;
+	err = register_shrinker(&deferred_split_shrinker);
+	if (err)
+		goto err_split_shrinker;
 
 	/*
 	 * By default disable transparent hugepages on smaller systems,
@@ -659,6 +668,8 @@ static int __init hugepage_init(void)
 
 	return 0;
 err_khugepaged:
+	unregister_shrinker(&deferred_split_shrinker);
+err_split_shrinker:
 	unregister_shrinker(&huge_zero_page_shrinker);
 err_hzp_shrinker:
 	khugepaged_slab_exit();
@@ -715,6 +726,12 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
 	return entry;
 }
 
+void prep_transhuge_page(struct page *page)
+{
+	INIT_LIST_HEAD(&page[2].lru);
+	set_compound_page_dtor(page, free_transhuge_page);
+}
+
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					struct vm_area_struct *vma,
 					unsigned long haddr, pmd_t *pmd,
@@ -834,7 +851,9 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page, gfp))) {
+	prep_transhuge_page(page);
+	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr,
+					pmd, page, gfp))) {
 		put_page(page);
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -1095,7 +1114,9 @@ alloc:
 	} else
 		new_page = NULL;
 
-	if (unlikely(!new_page)) {
+	if (likely(new_page)) {
+		prep_transhuge_page(new_page);
+	} else {
 		if (!page) {
 			split_huge_pmd(vma, pmd, address);
 			ret |= VM_FAULT_FALLBACK;
@@ -2019,6 +2040,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm,
 		return NULL;
 	}
 
+	prep_transhuge_page(*hpage);
 	count_vm_event(THP_COLLAPSE_ALLOC);
 	return *hpage;
 }
@@ -2030,8 +2052,12 @@ static int khugepaged_find_target_node(void)
 
 static inline struct page *alloc_hugepage(int defrag)
 {
-	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
-			   HPAGE_PMD_ORDER);
+	struct page *page;
+
+	page = alloc_pages(alloc_hugepage_gfpmask(defrag, 0), HPAGE_PMD_ORDER);
+	if (page)
+		prep_transhuge_page(page);
+	return page;
 }
 
 static struct page *khugepaged_alloc_hugepage(bool *wait)
@@ -2916,6 +2942,13 @@ static void __split_huge_page(struct page *page, struct list_head *list)
 	spin_lock_irq(&zone->lru_lock);
 	lruvec = mem_cgroup_page_lruvec(head, zone);
 
+	spin_lock(&split_queue_lock);
+	if (!list_empty(&head[2].lru)) {
+		split_queue_len--;
+		list_del(&head[2].lru);
+	}
+	spin_unlock(&split_queue_lock);
+
 	/* complete memcg works before add pages to LRU */
 	mem_cgroup_split_huge_fixup(head);
 
@@ -3026,3 +3059,88 @@ out:
 	count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
 	return ret;
 }
+
+static void free_transhuge_page(struct page *page)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&split_queue_lock, flags);
+	if (!list_empty(&page[2].lru)) {
+		split_queue_len--;
+		list_del(&page[2].lru);
+	}
+	spin_unlock_irqrestore(&split_queue_lock, flags);
+	free_compound_page(page);
+}
+
+void deferred_split_huge_page(struct page *page)
+{
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+
+	/* we use page->lru in second tail page: assuming THP order >= 2 */
+	BUILD_BUG_ON(HPAGE_PMD_ORDER < 2);
+
+	spin_lock_irqsave(&split_queue_lock, flags);
+	if (list_empty(&page[2].lru)) {
+		list_add_tail(&page[2].lru, &split_queue);
+		split_queue_len++;
+	}
+	spin_unlock_irqrestore(&split_queue_lock, flags);
+}
+
+static unsigned long deferred_split_count(struct shrinker *shrink,
+		struct shrink_control *sc)
+{
+	/*
+	 * Split a page from split_queue will free up at least one page,
+	 * at most HPAGE_PMD_NR - 1. We don't track exact number.
+	 * Let's use HPAGE_PMD_NR / 2 as ballpark.
+	 */
+	return ACCESS_ONCE(split_queue_len) * HPAGE_PMD_NR / 2;
+}
+
+static unsigned long deferred_split_scan(struct shrinker *shrink,
+		struct shrink_control *sc)
+{
+	unsigned long flags;
+	LIST_HEAD(list);
+	struct page *page, *next;
+	int split = 0;
+
+	spin_lock_irqsave(&split_queue_lock, flags);
+	list_splice_init(&split_queue, &list);
+
+	/* Take pin on all head pages to avoid freeing them under us */
+	list_for_each_entry_safe(page, next, &list, lru) {
+		page = compound_head(page);
+		/* race with put_compound_page() */
+		if (!get_page_unless_zero(page)) {
+			list_del_init(&page[2].lru);
+			split_queue_len--;
+		}
+	}
+	spin_unlock_irqrestore(&split_queue_lock, flags);
+
+	list_for_each_entry_safe(page, next, &list, lru) {
+		lock_page(page);
+		/* split_huge_page() removes page from list on success */
+		if (!split_huge_page(page))
+			split++;
+		unlock_page(page);
+		put_page(page);
+	}
+
+	spin_lock_irqsave(&split_queue_lock, flags);
+	list_splice_tail(&list, &split_queue);
+	spin_unlock_irqrestore(&split_queue_lock, flags);
+
+	return split * HPAGE_PMD_NR / 2;
+}
+
+static struct shrinker deferred_split_shrinker = {
+	.count_objects = deferred_split_count,
+	.scan_objects = deferred_split_scan,
+	.seeks = DEFAULT_SEEKS,
+};
diff --git a/mm/migrate.c b/mm/migrate.c
index 03b9c4ba56dc..ef3472397ced 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1739,6 +1739,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		HPAGE_PMD_ORDER);
 	if (!new_page)
 		goto out_fail;
+	prep_transhuge_page(new_page);
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ac331be78308..f3ffce74d9dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -356,7 +356,7 @@ out:
  * This usage means that zero-order pages may not be compound.
  */
 
-static void free_compound_page(struct page *page)
+void free_compound_page(struct page *page)
 {
 	__free_pages_ok(page, compound_order(page));
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 723af5bbeb02..55a0108bec99 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1227,6 +1227,9 @@ void page_remove_rmap(struct page *page, bool compound)
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
 
+	if (PageTransCompound(page))
+		deferred_split_huge_page(compound_head(page));
+
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
 	 * but that might overwrite a racing page_add_anon_rmap
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 27/28] mm: re-enable THP
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:04   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:04 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

All parts of THP with new refcounting are now in place. We can now allow
to enable THP.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 2c96d2484527..baeb0c4a686a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -408,7 +408,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
-	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && BROKEN
+	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select COMPACTION
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 27/28] mm: re-enable THP
@ 2015-04-23 21:04   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:04 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

All parts of THP with new refcounting are now in place. We can now allow
to enable THP.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 2c96d2484527..baeb0c4a686a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -408,7 +408,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
-	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && BROKEN
+	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select COMPACTION
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 28/28] thp: update documentation
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-23 21:04   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:04 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

The patch updates Documentation/vm/transhuge.txt to reflect changes in
THP design.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt | 100 +++++++++++++++++++----------------------
 1 file changed, 45 insertions(+), 55 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 6b31cfbe2a9a..a12171e850d4 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -35,10 +35,10 @@ miss is going to run faster.
 
 == Design ==
 
-- "graceful fallback": mm components which don't have transparent
-  hugepage knowledge fall back to breaking a transparent hugepage and
-  working on the regular pages and their respective regular pmd/pte
-  mappings
+- "graceful fallback": mm components which don't have transparent hugepage
+  knowledge fall back to breaking huge pmd mapping into table of ptes and,
+  if nesessary, split a transparent hugepage. Therefore these components
+  can continue working on the regular pages or regular pte mappings.
 
 - if a hugepage allocation fails because of memory fragmentation,
   regular pages should be gracefully allocated instead and mixed in
@@ -200,9 +200,18 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
 	of pages that should be collapsed into one huge page but failed
 	the allocation.
 
-thp_split is incremented every time a huge page is split into base
+thp_split_page is incremented every time a huge page is split into base
 	pages. This can happen for a variety of reasons but a common
 	reason is that a huge page is old and is being reclaimed.
+	This action implies splitting all PMD the page mapped with.
+
+thp_split_page_failed is is incremented if kernel fails to split huge
+	page. This can happen if the page was pinned by somebody.
+
+thp_split_pmd is incremented every time a PMD split into table of PTEs.
+	This can happen, for instance, when application calls mprotect() or
+	munmap() on part of huge page. It doesn't split huge page, only
+	page table entry.
 
 thp_zero_page_alloc is incremented every time a huge zero page is
 	successfully allocated. It includes allocations which where
@@ -253,10 +262,8 @@ is complete, so they won't ever notice the fact the page is huge. But
 if any driver is going to mangle over the page structure of the tail
 page (like for checking page->mapping or other bits that are relevant
 for the head page and not the tail page), it should be updated to jump
-to check head page instead (while serializing properly against
-split_huge_page() to avoid the head and tail pages to disappear from
-under it, see the futex code to see an example of that, hugetlbfs also
-needed special handling in futex code for similar reasons).
+to check head page instead. Taking reference on any head/tail page would
+prevent page from being split by anyone.
 
 NOTE: these aren't new constraints to the GUP API, and they match the
 same constrains that applies to hugetlbfs too, so any driver capable
@@ -291,9 +298,9 @@ unaffected. libhugetlbfs will also work fine as usual.
 == Graceful fallback ==
 
 Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(vma, addr, pmd) where the pmd is the one returned by
+split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
 pmd_offset. It's trivial to make the code transparent hugepage aware
-by just grepping for "pmd_offset" and adding split_huge_page_pmd where
+by just grepping for "pmd_offset" and adding split_huge_pmd where
 missing after pmd_offset returns the pmd. Thanks to the graceful
 fallback design, with a one liner change, you can avoid to write
 hundred if not thousand of lines of complex code to make your code
@@ -302,7 +309,8 @@ hugepage aware.
 If you're not walking pagetables but you run into a physical hugepage
 but you can't handle it natively in your code, you can split it by
 calling split_huge_page(page). This is what the Linux VM does before
-it tries to swapout the hugepage for example.
+it tries to swapout the hugepage for example. split_huge_page() can fail
+if the page is pinned and you must handle this correctly.
 
 Example to make mremap.c transparent hugepage aware with a one liner
 change:
@@ -314,14 +322,14 @@ diff --git a/mm/mremap.c b/mm/mremap.c
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
-+	split_huge_page_pmd(vma, addr, pmd);
++	split_huge_pmd(vma, pmd, addr);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;
 
 == Locking in hugepage aware code ==
 
 We want as much code as possible hugepage aware, as calling
-split_huge_page() or split_huge_page_pmd() has a cost.
+split_huge_page() or split_huge_pmd() has a cost.
 
 To make pagetable walks huge pmd aware, all you need to do is to call
 pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
@@ -330,47 +338,29 @@ created from under you by khugepaged (khugepaged collapse_huge_page
 takes the mmap_sem in write mode in addition to the anon_vma lock). If
 pmd_trans_huge returns false, you just fallback in the old code
 paths. If instead pmd_trans_huge returns true, you have to take the
-mm->page_table_lock and re-run pmd_trans_huge. Taking the
-page_table_lock will prevent the huge pmd to be converted into a
-regular pmd from under you (split_huge_page can run in parallel to the
+page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
+page table lock will prevent the huge pmd to be converted into a
+regular pmd from under you (split_huge_pmd can run in parallel to the
 pagetable walk). If the second pmd_trans_huge returns false, you
-should just drop the page_table_lock and fallback to the old code as
-before. Otherwise you should run pmd_trans_splitting on the pmd. In
-case pmd_trans_splitting returns true, it means split_huge_page is
-already in the middle of splitting the page. So if pmd_trans_splitting
-returns true it's enough to drop the page_table_lock and call
-wait_split_huge_page and then fallback the old code paths. You are
-guaranteed by the time wait_split_huge_page returns, the pmd isn't
-huge anymore. If pmd_trans_splitting returns false, you can proceed to
-process the huge pmd and the hugepage natively. Once finished you can
-drop the page_table_lock.
-
-== compound_lock, get_user_pages and put_page ==
+should just drop the page table lock and fallback to the old code as
+before. Otherwise you can proceed to process the huge pmd and the
+hugepage natively. Once finished you can drop the page table lock.
+
+== Refcounts and transparent huge pages ==
 
+As with other compound page types we do all refcounting for THP on head
+page, but unlike other compound pages THP support splitting.
 split_huge_page internally has to distribute the refcounts in the head
-page to the tail pages before clearing all PG_head/tail bits from the
-page structures. It can do that easily for refcounts taken by huge pmd
-mappings. But the GUI API as created by hugetlbfs (that returns head
-and tail pages if running get_user_pages on an address backed by any
-hugepage), requires the refcount to be accounted on the tail pages and
-not only in the head pages, if we want to be able to run
-split_huge_page while there are gup pins established on any tail
-page. Failure to be able to run split_huge_page if there's any gup pin
-on any tail page, would mean having to split all hugepages upfront in
-get_user_pages which is unacceptable as too many gup users are
-performance critical and they must work natively on hugepages like
-they work natively on hugetlbfs already (hugetlbfs is simpler because
-hugetlbfs pages cannot be split so there wouldn't be requirement of
-accounting the pins on the tail pages for hugetlbfs). If we wouldn't
-account the gup refcounts on the tail pages during gup, we won't know
-anymore which tail page is pinned by gup and which is not while we run
-split_huge_page. But we still have to add the gup pin to the head page
-too, to know when we can free the compound page in case it's never
-split during its lifetime. That requires changing not just
-get_page, but put_page as well so that when put_page runs on a tail
-page (and only on a tail page) it will find its respective head page,
-and then it will decrease the head page refcount in addition to the
-tail page refcount. To obtain a head page reliably and to decrease its
-refcount without race conditions, put_page has to serialize against
-__split_huge_page_refcount using a special per-page lock called
-compound_lock.
+page to the tail pages before clearing all PG_head/tail bits from the page
+structures. It can be done easily for refcounts taken by page table
+entries. But we don't have enough information on how to distribute any
+additional pins (i.e. from get_user_pages). split_huge_page fails any
+requests to split pinned huge page: it expects page count to be equal to
+sum of mapcount of all sub-pages plus one (split_huge_page caller must
+have reference for head page).
+
+split_huge_page uses migration entries to stabilize page->_count and
+page->_mapcount.
+
+Note that split_huge_pmd() doesn't have any limitation on refcounting:
+pmd can be split at any point and never fails.
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [PATCHv5 28/28] thp: update documentation
@ 2015-04-23 21:04   ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-23 21:04 UTC (permalink / raw)
  To: Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm, Kirill A. Shutemov

The patch updates Documentation/vm/transhuge.txt to reflect changes in
THP design.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt | 100 +++++++++++++++++++----------------------
 1 file changed, 45 insertions(+), 55 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 6b31cfbe2a9a..a12171e850d4 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -35,10 +35,10 @@ miss is going to run faster.
 
 == Design ==
 
-- "graceful fallback": mm components which don't have transparent
-  hugepage knowledge fall back to breaking a transparent hugepage and
-  working on the regular pages and their respective regular pmd/pte
-  mappings
+- "graceful fallback": mm components which don't have transparent hugepage
+  knowledge fall back to breaking huge pmd mapping into table of ptes and,
+  if nesessary, split a transparent hugepage. Therefore these components
+  can continue working on the regular pages or regular pte mappings.
 
 - if a hugepage allocation fails because of memory fragmentation,
   regular pages should be gracefully allocated instead and mixed in
@@ -200,9 +200,18 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
 	of pages that should be collapsed into one huge page but failed
 	the allocation.
 
-thp_split is incremented every time a huge page is split into base
+thp_split_page is incremented every time a huge page is split into base
 	pages. This can happen for a variety of reasons but a common
 	reason is that a huge page is old and is being reclaimed.
+	This action implies splitting all PMD the page mapped with.
+
+thp_split_page_failed is is incremented if kernel fails to split huge
+	page. This can happen if the page was pinned by somebody.
+
+thp_split_pmd is incremented every time a PMD split into table of PTEs.
+	This can happen, for instance, when application calls mprotect() or
+	munmap() on part of huge page. It doesn't split huge page, only
+	page table entry.
 
 thp_zero_page_alloc is incremented every time a huge zero page is
 	successfully allocated. It includes allocations which where
@@ -253,10 +262,8 @@ is complete, so they won't ever notice the fact the page is huge. But
 if any driver is going to mangle over the page structure of the tail
 page (like for checking page->mapping or other bits that are relevant
 for the head page and not the tail page), it should be updated to jump
-to check head page instead (while serializing properly against
-split_huge_page() to avoid the head and tail pages to disappear from
-under it, see the futex code to see an example of that, hugetlbfs also
-needed special handling in futex code for similar reasons).
+to check head page instead. Taking reference on any head/tail page would
+prevent page from being split by anyone.
 
 NOTE: these aren't new constraints to the GUP API, and they match the
 same constrains that applies to hugetlbfs too, so any driver capable
@@ -291,9 +298,9 @@ unaffected. libhugetlbfs will also work fine as usual.
 == Graceful fallback ==
 
 Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(vma, addr, pmd) where the pmd is the one returned by
+split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
 pmd_offset. It's trivial to make the code transparent hugepage aware
-by just grepping for "pmd_offset" and adding split_huge_page_pmd where
+by just grepping for "pmd_offset" and adding split_huge_pmd where
 missing after pmd_offset returns the pmd. Thanks to the graceful
 fallback design, with a one liner change, you can avoid to write
 hundred if not thousand of lines of complex code to make your code
@@ -302,7 +309,8 @@ hugepage aware.
 If you're not walking pagetables but you run into a physical hugepage
 but you can't handle it natively in your code, you can split it by
 calling split_huge_page(page). This is what the Linux VM does before
-it tries to swapout the hugepage for example.
+it tries to swapout the hugepage for example. split_huge_page() can fail
+if the page is pinned and you must handle this correctly.
 
 Example to make mremap.c transparent hugepage aware with a one liner
 change:
@@ -314,14 +322,14 @@ diff --git a/mm/mremap.c b/mm/mremap.c
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
-+	split_huge_page_pmd(vma, addr, pmd);
++	split_huge_pmd(vma, pmd, addr);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;
 
 == Locking in hugepage aware code ==
 
 We want as much code as possible hugepage aware, as calling
-split_huge_page() or split_huge_page_pmd() has a cost.
+split_huge_page() or split_huge_pmd() has a cost.
 
 To make pagetable walks huge pmd aware, all you need to do is to call
 pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
@@ -330,47 +338,29 @@ created from under you by khugepaged (khugepaged collapse_huge_page
 takes the mmap_sem in write mode in addition to the anon_vma lock). If
 pmd_trans_huge returns false, you just fallback in the old code
 paths. If instead pmd_trans_huge returns true, you have to take the
-mm->page_table_lock and re-run pmd_trans_huge. Taking the
-page_table_lock will prevent the huge pmd to be converted into a
-regular pmd from under you (split_huge_page can run in parallel to the
+page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
+page table lock will prevent the huge pmd to be converted into a
+regular pmd from under you (split_huge_pmd can run in parallel to the
 pagetable walk). If the second pmd_trans_huge returns false, you
-should just drop the page_table_lock and fallback to the old code as
-before. Otherwise you should run pmd_trans_splitting on the pmd. In
-case pmd_trans_splitting returns true, it means split_huge_page is
-already in the middle of splitting the page. So if pmd_trans_splitting
-returns true it's enough to drop the page_table_lock and call
-wait_split_huge_page and then fallback the old code paths. You are
-guaranteed by the time wait_split_huge_page returns, the pmd isn't
-huge anymore. If pmd_trans_splitting returns false, you can proceed to
-process the huge pmd and the hugepage natively. Once finished you can
-drop the page_table_lock.
-
-== compound_lock, get_user_pages and put_page ==
+should just drop the page table lock and fallback to the old code as
+before. Otherwise you can proceed to process the huge pmd and the
+hugepage natively. Once finished you can drop the page table lock.
+
+== Refcounts and transparent huge pages ==
 
+As with other compound page types we do all refcounting for THP on head
+page, but unlike other compound pages THP support splitting.
 split_huge_page internally has to distribute the refcounts in the head
-page to the tail pages before clearing all PG_head/tail bits from the
-page structures. It can do that easily for refcounts taken by huge pmd
-mappings. But the GUI API as created by hugetlbfs (that returns head
-and tail pages if running get_user_pages on an address backed by any
-hugepage), requires the refcount to be accounted on the tail pages and
-not only in the head pages, if we want to be able to run
-split_huge_page while there are gup pins established on any tail
-page. Failure to be able to run split_huge_page if there's any gup pin
-on any tail page, would mean having to split all hugepages upfront in
-get_user_pages which is unacceptable as too many gup users are
-performance critical and they must work natively on hugepages like
-they work natively on hugetlbfs already (hugetlbfs is simpler because
-hugetlbfs pages cannot be split so there wouldn't be requirement of
-accounting the pins on the tail pages for hugetlbfs). If we wouldn't
-account the gup refcounts on the tail pages during gup, we won't know
-anymore which tail page is pinned by gup and which is not while we run
-split_huge_page. But we still have to add the gup pin to the head page
-too, to know when we can free the compound page in case it's never
-split during its lifetime. That requires changing not just
-get_page, but put_page as well so that when put_page runs on a tail
-page (and only on a tail page) it will find its respective head page,
-and then it will decrease the head page refcount in addition to the
-tail page refcount. To obtain a head page reliably and to decrease its
-refcount without race conditions, put_page has to serialize against
-__split_huge_page_refcount using a special per-page lock called
-compound_lock.
+page to the tail pages before clearing all PG_head/tail bits from the page
+structures. It can be done easily for refcounts taken by page table
+entries. But we don't have enough information on how to distribute any
+additional pins (i.e. from get_user_pages). split_huge_page fails any
+requests to split pinned huge page: it expects page count to be equal to
+sum of mapcount of all sub-pages plus one (split_huge_page caller must
+have reference for head page).
+
+split_huge_page uses migration entries to stabilize page->_count and
+page->_mapcount.
+
+Note that split_huge_pmd() doesn't have any limitation on refcounting:
+pmd can be split at any point and never fails.
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 00/28] THP refcounting redesign
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-27 23:03   ` Andrew Morton
  -1 siblings, 0 replies; 189+ messages in thread
From: Andrew Morton @ 2015-04-27 23:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Hugh Dickins, Dave Hansen, Mel Gorman,
	Rik van Riel, Vlastimil Babka, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On Fri, 24 Apr 2015 00:03:35 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Hello everybody,
> 
> Here's reworked version of my patchset. All known issues were addressed.
> 
> The goal of patchset is to make refcounting on THP pages cheaper with
> simpler semantics and allow the same THP compound page to be mapped with
> PMD and PTEs. This is required to get reasonable THP-pagecache
> implementation.

Are there any measurable performance improvements?

> With the new refcounting design it's much easier to protect against
> split_huge_page(): simple reference on a page will make you the deal.
> It makes gup_fast() implementation simpler and doesn't require
> special-case in futex code to handle tail THP pages.
> 
> It should improve THP utilization over the system since splitting THP in
> one process doesn't necessary lead to splitting the page in all other
> processes have the page mapped.
> 
> The patchset drastically lower complexity of get_page()/put_page()
> codepaths. I encourage reviewers look on this code before-and-after to
> justify time budget on reviewing this patchset.
>
> ...
>
>  59 files changed, 1144 insertions(+), 1509 deletions(-)

It's huge.  I'm going to need help reviewing this.  Have earlier
versions been reviewed much?  Who do you believe are suitable
reviewers?

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 00/28] THP refcounting redesign
@ 2015-04-27 23:03   ` Andrew Morton
  0 siblings, 0 replies; 189+ messages in thread
From: Andrew Morton @ 2015-04-27 23:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Hugh Dickins, Dave Hansen, Mel Gorman,
	Rik van Riel, Vlastimil Babka, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On Fri, 24 Apr 2015 00:03:35 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> Hello everybody,
> 
> Here's reworked version of my patchset. All known issues were addressed.
> 
> The goal of patchset is to make refcounting on THP pages cheaper with
> simpler semantics and allow the same THP compound page to be mapped with
> PMD and PTEs. This is required to get reasonable THP-pagecache
> implementation.

Are there any measurable performance improvements?

> With the new refcounting design it's much easier to protect against
> split_huge_page(): simple reference on a page will make you the deal.
> It makes gup_fast() implementation simpler and doesn't require
> special-case in futex code to handle tail THP pages.
> 
> It should improve THP utilization over the system since splitting THP in
> one process doesn't necessary lead to splitting the page in all other
> processes have the page mapped.
> 
> The patchset drastically lower complexity of get_page()/put_page()
> codepaths. I encourage reviewers look on this code before-and-after to
> justify time budget on reviewing this patchset.
>
> ...
>
>  59 files changed, 1144 insertions(+), 1509 deletions(-)

It's huge.  I'm going to need help reviewing this.  Have earlier
versions been reviewed much?  Who do you believe are suitable
reviewers?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 00/28] THP refcounting redesign
  2015-04-27 23:03   ` Andrew Morton
@ 2015-04-27 23:33     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-27 23:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Hugh Dickins, Dave Hansen,
	Mel Gorman, Rik van Riel, Vlastimil Babka, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On Mon, Apr 27, 2015 at 04:03:48PM -0700, Andrew Morton wrote:
> On Fri, 24 Apr 2015 00:03:35 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > Hello everybody,
> > 
> > Here's reworked version of my patchset. All known issues were addressed.
> > 
> > The goal of patchset is to make refcounting on THP pages cheaper with
> > simpler semantics and allow the same THP compound page to be mapped with
> > PMD and PTEs. This is required to get reasonable THP-pagecache
> > implementation.
> 
> Are there any measurable performance improvements?

I was focused on stability up to this point. I'll bring some numbers.

> > With the new refcounting design it's much easier to protect against
> > split_huge_page(): simple reference on a page will make you the deal.
> > It makes gup_fast() implementation simpler and doesn't require
> > special-case in futex code to handle tail THP pages.
> > 
> > It should improve THP utilization over the system since splitting THP in
> > one process doesn't necessary lead to splitting the page in all other
> > processes have the page mapped.
> > 
> > The patchset drastically lower complexity of get_page()/put_page()
> > codepaths. I encourage reviewers look on this code before-and-after to
> > justify time budget on reviewing this patchset.
> >
> > ...
> >
> >  59 files changed, 1144 insertions(+), 1509 deletions(-)
> 
> It's huge.  I'm going to need help reviewing this.  Have earlier
> versions been reviewed much?

The most helpful was feedback from Aneesh for v4. Hugh pointed to few weak
parts. But I can't say that the patchset was reviewed much.

Sasha helped with testing. Few bugs he found was fixed during preparing v5
for posting. One more issue was pointed after posting the patchset. I work
on it now.

> Who do you believe are suitable reviewers?

Andrea is obvious candidate. Hugh looked recently into the same area with
his team pages idea.

In general, I tried to keep people who can be helpful with review or
testing on CC list.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 00/28] THP refcounting redesign
@ 2015-04-27 23:33     ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-27 23:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Hugh Dickins, Dave Hansen,
	Mel Gorman, Rik van Riel, Vlastimil Babka, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On Mon, Apr 27, 2015 at 04:03:48PM -0700, Andrew Morton wrote:
> On Fri, 24 Apr 2015 00:03:35 +0300 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > Hello everybody,
> > 
> > Here's reworked version of my patchset. All known issues were addressed.
> > 
> > The goal of patchset is to make refcounting on THP pages cheaper with
> > simpler semantics and allow the same THP compound page to be mapped with
> > PMD and PTEs. This is required to get reasonable THP-pagecache
> > implementation.
> 
> Are there any measurable performance improvements?

I was focused on stability up to this point. I'll bring some numbers.

> > With the new refcounting design it's much easier to protect against
> > split_huge_page(): simple reference on a page will make you the deal.
> > It makes gup_fast() implementation simpler and doesn't require
> > special-case in futex code to handle tail THP pages.
> > 
> > It should improve THP utilization over the system since splitting THP in
> > one process doesn't necessary lead to splitting the page in all other
> > processes have the page mapped.
> > 
> > The patchset drastically lower complexity of get_page()/put_page()
> > codepaths. I encourage reviewers look on this code before-and-after to
> > justify time budget on reviewing this patchset.
> >
> > ...
> >
> >  59 files changed, 1144 insertions(+), 1509 deletions(-)
> 
> It's huge.  I'm going to need help reviewing this.  Have earlier
> versions been reviewed much?

The most helpful was feedback from Aneesh for v4. Hugh pointed to few weak
parts. But I can't say that the patchset was reviewed much.

Sasha helped with testing. Few bugs he found was fixed during preparing v5
for posting. One more issue was pointed after posting the patchset. I work
on it now.

> Who do you believe are suitable reviewers?

Andrea is obvious candidate. Hugh looked recently into the same area with
his team pages idea.

In general, I tried to keep people who can be helpful with review or
testing on CC list.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 18/28] x86, thp: remove infrastructure for handling splitting PMDs
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-04-29  9:13     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 189+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-29  9:13 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, Jerome Marchand, Sasha Levin,
	linux-kernel, linux-mm, Kirill A. Shutemov

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> With new refcounting we don't need to mark PMDs splitting. Let's drop
> code to handle this.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>  arch/x86/include/asm/pgtable.h       |  9 ---------
>  arch/x86/include/asm/pgtable_types.h |  2 --
>  arch/x86/mm/gup.c                    | 13 +------------
>  arch/x86/mm/pgtable.c                | 14 --------------
>  4 files changed, 1 insertion(+), 37 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index f89d6c9943ea..21a2e25a5393 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -158,11 +158,6 @@ static inline int pmd_large(pmd_t pte)
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -static inline int pmd_trans_splitting(pmd_t pmd)
> -{
> -	return pmd_val(pmd) & _PAGE_SPLITTING;
> -}
> -
>  static inline int pmd_trans_huge(pmd_t pmd)
>  {
>  	return pmd_val(pmd) & _PAGE_PSE;
> @@ -792,10 +787,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
>  				  unsigned long address, pmd_t *pmdp);
>  
>  
> -#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
> -extern void pmdp_splitting_flush(struct vm_area_struct *vma,
> -				 unsigned long addr, pmd_t *pmdp);
> -

Can we keep pmdp_splitting_flush or a variant and use that before a
hugepage split ? That is to have a special function to do
pmd_clear before splitting ?. We still depend on a IPI to be send to
other cpus on split and not all archs will do that in
pmdp_clear_flush_notify. I guess we need the ipi to make sure a
local_irq_disable can prevent a parallel split ?

Something like ?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cce4604c192f..0a0d00b21f76 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2608,7 +2608,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	young = pmd_young(*pmd);
 
 	/* leave pmd empty until pte is filled */
-	pmdp_clear_flush_notify(vma, haddr, pmd);
+	pmdp_splitting_clear_flush_notify(vma, haddr, pmd);
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);


-aneesh


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 18/28] x86, thp: remove infrastructure for handling splitting PMDs
@ 2015-04-29  9:13     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 189+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-29  9:13 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Johannes Weiner, Michal Hocko, Jerome Marchand, Sasha Levin,
	linux-kernel, linux-mm

"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> writes:

> With new refcounting we don't need to mark PMDs splitting. Let's drop
> code to handle this.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>  arch/x86/include/asm/pgtable.h       |  9 ---------
>  arch/x86/include/asm/pgtable_types.h |  2 --
>  arch/x86/mm/gup.c                    | 13 +------------
>  arch/x86/mm/pgtable.c                | 14 --------------
>  4 files changed, 1 insertion(+), 37 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index f89d6c9943ea..21a2e25a5393 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -158,11 +158,6 @@ static inline int pmd_large(pmd_t pte)
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -static inline int pmd_trans_splitting(pmd_t pmd)
> -{
> -	return pmd_val(pmd) & _PAGE_SPLITTING;
> -}
> -
>  static inline int pmd_trans_huge(pmd_t pmd)
>  {
>  	return pmd_val(pmd) & _PAGE_PSE;
> @@ -792,10 +787,6 @@ extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
>  				  unsigned long address, pmd_t *pmdp);
>  
>  
> -#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
> -extern void pmdp_splitting_flush(struct vm_area_struct *vma,
> -				 unsigned long addr, pmd_t *pmdp);
> -

Can we keep pmdp_splitting_flush or a variant and use that before a
hugepage split ? That is to have a special function to do
pmd_clear before splitting ?. We still depend on a IPI to be send to
other cpus on split and not all archs will do that in
pmdp_clear_flush_notify. I guess we need the ipi to make sure a
local_irq_disable can prevent a parallel split ?

Something like ?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cce4604c192f..0a0d00b21f76 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2608,7 +2608,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	young = pmd_young(*pmd);
 
 	/* leave pmd empty until pte is filled */
-	pmdp_clear_flush_notify(vma, haddr, pmd);
+	pmdp_splitting_clear_flush_notify(vma, haddr, pmd);
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);


-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 01/28] mm, proc: adjust PSS calculation
  2015-04-23 21:03   ` Kirill A. Shutemov
  (?)
@ 2015-04-29 15:49   ` Jerome Marchand
  -1 siblings, 0 replies; 189+ messages in thread
From: Jerome Marchand @ 2015-04-29 15:49 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Sasha Levin,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3073 bytes --]

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting all subpages of the compound page are not nessessary
> have the same mapcount. We need to take into account mapcount of every
> sub-page.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Jerome Marchand <jmarchan@redhat.com>

> ---
>  fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
>  1 file changed, 22 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 956b75d61809..95bc384ee3f7 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -449,9 +449,10 @@ struct mem_size_stats {
>  };
>  
>  static void smaps_account(struct mem_size_stats *mss, struct page *page,
> -		unsigned long size, bool young, bool dirty)
> +		bool compound, bool young, bool dirty)
>  {
> -	int mapcount;
> +	int i, nr = compound ? hpage_nr_pages(page) : 1;
> +	unsigned long size = nr * PAGE_SIZE;
>  
>  	if (PageAnon(page))
>  		mss->anonymous += size;
> @@ -460,23 +461,23 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>  	/* Accumulate the size in pages that have been accessed. */
>  	if (young || PageReferenced(page))
>  		mss->referenced += size;
> -	mapcount = page_mapcount(page);
> -	if (mapcount >= 2) {
> -		u64 pss_delta;
>  
> -		if (dirty || PageDirty(page))
> -			mss->shared_dirty += size;
> -		else
> -			mss->shared_clean += size;
> -		pss_delta = (u64)size << PSS_SHIFT;
> -		do_div(pss_delta, mapcount);
> -		mss->pss += pss_delta;
> -	} else {
> -		if (dirty || PageDirty(page))
> -			mss->private_dirty += size;
> -		else
> -			mss->private_clean += size;
> -		mss->pss += (u64)size << PSS_SHIFT;
> +	for (i = 0; i < nr; i++) {
> +		int mapcount = page_mapcount(page + i);
> +
> +		if (mapcount >= 2) {
> +			if (dirty || PageDirty(page + i))
> +				mss->shared_dirty += PAGE_SIZE;
> +			else
> +				mss->shared_clean += PAGE_SIZE;
> +			mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
> +		} else {
> +			if (dirty || PageDirty(page + i))
> +				mss->private_dirty += PAGE_SIZE;
> +			else
> +				mss->private_clean += PAGE_SIZE;
> +			mss->pss += PAGE_SIZE << PSS_SHIFT;
> +		}
>  	}
>  }
>  
> @@ -500,7 +501,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>  
>  	if (!page)
>  		return;
> -	smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte));
> +
> +	smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte));
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -516,8 +518,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
>  	if (IS_ERR_OR_NULL(page))
>  		return;
>  	mss->anonymous_thp += HPAGE_PMD_SIZE;
> -	smaps_account(mss, page, HPAGE_PMD_SIZE,
> -			pmd_young(*pmd), pmd_dirty(*pmd));
> +	smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
>  }
>  #else
>  static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 02/28] rmap: add argument to charge compound page
  2015-04-23 21:03   ` Kirill A. Shutemov
  (?)
@ 2015-04-29 15:53   ` Jerome Marchand
  2015-04-30 11:52       ` Kirill A. Shutemov
  -1 siblings, 1 reply; 189+ messages in thread
From: Jerome Marchand @ 2015-04-29 15:53 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Sasha Levin,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 18049 bytes --]

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We're going to allow mapping of individual 4k pages of THP compound
> page. It means we cannot rely on PageTransHuge() check to decide if
> map/unmap small page or THP.
> 
> The patch adds new argument to rmap functions to indicate whether we want
> to operate on whole compound page or only the small page.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>  include/linux/rmap.h    | 12 +++++++++---
>  kernel/events/uprobes.c |  4 ++--
>  mm/filemap_xip.c        |  2 +-
>  mm/huge_memory.c        | 16 ++++++++--------
>  mm/hugetlb.c            |  4 ++--
>  mm/ksm.c                |  4 ++--
>  mm/memory.c             | 14 +++++++-------
>  mm/migrate.c            |  8 ++++----
>  mm/rmap.c               | 43 +++++++++++++++++++++++++++----------------
>  mm/swapfile.c           |  4 ++--
>  10 files changed, 64 insertions(+), 47 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index d3630fa3a17b..e7ecba43ae71 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -159,16 +159,22 @@ static inline void anon_vma_merge(struct vm_area_struct *vma,
>  
>  struct anon_vma *page_get_anon_vma(struct page *page);
>  
> +/* bitflags for do_page_add_anon_rmap() */
> +#define RMAP_EXCLUSIVE 0x01
> +#define RMAP_COMPOUND 0x02
> +
>  /*
>   * rmap interfaces called when adding or removing pte of page
>   */
>  void page_move_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
> -void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
> +void page_add_anon_rmap(struct page *, struct vm_area_struct *,
> +		unsigned long, bool);
>  void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
>  			   unsigned long, int);
> -void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
> +void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
> +		unsigned long, bool);
>  void page_add_file_rmap(struct page *);
> -void page_remove_rmap(struct page *);
> +void page_remove_rmap(struct page *, bool);
>  
>  void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
>  			    unsigned long);
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index cb346f26a22d..5523daf59953 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -183,7 +183,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  		goto unlock;
>  
>  	get_page(kpage);
> -	page_add_new_anon_rmap(kpage, vma, addr);
> +	page_add_new_anon_rmap(kpage, vma, addr, false);
>  	mem_cgroup_commit_charge(kpage, memcg, false);
>  	lru_cache_add_active_or_unevictable(kpage, vma);
>  
> @@ -196,7 +196,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  	ptep_clear_flush_notify(vma, addr, ptep);
>  	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
>  
> -	page_remove_rmap(page);
> +	page_remove_rmap(page, false);
>  	if (!page_mapped(page))
>  		try_to_free_swap(page);
>  	pte_unmap_unlock(ptep, ptl);
> diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> index c175f9f25210..791d9043a983 100644
> --- a/mm/filemap_xip.c
> +++ b/mm/filemap_xip.c
> @@ -189,7 +189,7 @@ retry:
>  			/* Nuke the page table entry. */
>  			flush_cache_page(vma, address, pte_pfn(*pte));
>  			pteval = ptep_clear_flush(vma, address, pte);
> -			page_remove_rmap(page);
> +			page_remove_rmap(page, false);
>  			dec_mm_counter(mm, MM_FILEPAGES);
>  			BUG_ON(pte_dirty(pteval));
>  			pte_unmap_unlock(pte, ptl);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5a137c3a7f2f..b40fc0ff9315 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -752,7 +752,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
>  		pmd_t entry;
>  		entry = mk_huge_pmd(page, vma->vm_page_prot);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> -		page_add_new_anon_rmap(page, vma, haddr);
> +		page_add_new_anon_rmap(page, vma, haddr, true);
>  		mem_cgroup_commit_charge(page, memcg, false);
>  		lru_cache_add_active_or_unevictable(page, vma);
>  		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> @@ -1043,7 +1043,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  		memcg = (void *)page_private(pages[i]);
>  		set_page_private(pages[i], 0);
> -		page_add_new_anon_rmap(pages[i], vma, haddr);
> +		page_add_new_anon_rmap(pages[i], vma, haddr, false);
>  		mem_cgroup_commit_charge(pages[i], memcg, false);
>  		lru_cache_add_active_or_unevictable(pages[i], vma);
>  		pte = pte_offset_map(&_pmd, haddr);
> @@ -1055,7 +1055,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  
>  	smp_wmb(); /* make pte visible before pmd */
>  	pmd_populate(mm, pmd, pgtable);
> -	page_remove_rmap(page);
> +	page_remove_rmap(page, true);
>  	spin_unlock(ptl);
>  
>  	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> @@ -1175,7 +1175,7 @@ alloc:
>  		entry = mk_huge_pmd(new_page, vma->vm_page_prot);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>  		pmdp_clear_flush_notify(vma, haddr, pmd);
> -		page_add_new_anon_rmap(new_page, vma, haddr);
> +		page_add_new_anon_rmap(new_page, vma, haddr, true);
>  		mem_cgroup_commit_charge(new_page, memcg, false);
>  		lru_cache_add_active_or_unevictable(new_page, vma);
>  		set_pmd_at(mm, haddr, pmd, entry);
> @@ -1185,7 +1185,7 @@ alloc:
>  			put_huge_zero_page();
>  		} else {
>  			VM_BUG_ON_PAGE(!PageHead(page), page);
> -			page_remove_rmap(page);
> +			page_remove_rmap(page, true);
>  			put_page(page);
>  		}
>  		ret |= VM_FAULT_WRITE;
> @@ -1440,7 +1440,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			put_huge_zero_page();
>  		} else {
>  			page = pmd_page(orig_pmd);
> -			page_remove_rmap(page);
> +			page_remove_rmap(page, true);
>  			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
>  			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
>  			VM_BUG_ON_PAGE(!PageHead(page), page);
> @@ -2285,7 +2285,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
>  			 * superfluous.
>  			 */
>  			pte_clear(vma->vm_mm, address, _pte);
> -			page_remove_rmap(src_page);
> +			page_remove_rmap(src_page, false);
>  			spin_unlock(ptl);
>  			free_page_and_swap_cache(src_page);
>  		}
> @@ -2580,7 +2580,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>  
>  	spin_lock(pmd_ptl);
>  	BUG_ON(!pmd_none(*pmd));
> -	page_add_new_anon_rmap(new_page, vma, address);
> +	page_add_new_anon_rmap(new_page, vma, address, true);
>  	mem_cgroup_commit_charge(new_page, memcg, false);
>  	lru_cache_add_active_or_unevictable(new_page, vma);
>  	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e8c92ae35b4b..eb2a0430535e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2797,7 +2797,7 @@ again:
>  		if (huge_pte_dirty(pte))
>  			set_page_dirty(page);
>  
> -		page_remove_rmap(page);
> +		page_remove_rmap(page, true);
>  		force_flush = !__tlb_remove_page(tlb, page);
>  		if (force_flush) {
>  			address += sz;
> @@ -3018,7 +3018,7 @@ retry_avoidcopy:
>  		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
>  		set_huge_pte_at(mm, address, ptep,
>  				make_huge_pte(vma, new_page, 1));
> -		page_remove_rmap(old_page);
> +		page_remove_rmap(old_page, true);
>  		hugepage_add_new_anon_rmap(new_page, vma, address);
>  		/* Make the old page be freed below */
>  		new_page = old_page;
> diff --git a/mm/ksm.c b/mm/ksm.c
> index bc7be0ee2080..fe09f3ddc912 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -957,13 +957,13 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
>  	}
>  
>  	get_page(kpage);
> -	page_add_anon_rmap(kpage, vma, addr);
> +	page_add_anon_rmap(kpage, vma, addr, false);
>  
>  	flush_cache_page(vma, addr, pte_pfn(*ptep));
>  	ptep_clear_flush_notify(vma, addr, ptep);
>  	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
>  
> -	page_remove_rmap(page);
> +	page_remove_rmap(page, false);
>  	if (!page_mapped(page))
>  		try_to_free_swap(page);
>  	put_page(page);
> diff --git a/mm/memory.c b/mm/memory.c
> index f150f7ed4e84..d6171752ea59 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1122,7 +1122,7 @@ again:
>  					mark_page_accessed(page);
>  				rss[MM_FILEPAGES]--;
>  			}
> -			page_remove_rmap(page);
> +			page_remove_rmap(page, false);
>  			if (unlikely(page_mapcount(page) < 0))
>  				print_bad_pte(vma, addr, ptent, page);
>  			if (unlikely(!__tlb_remove_page(tlb, page))) {
> @@ -2108,7 +2108,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * thread doing COW.
>  		 */
>  		ptep_clear_flush_notify(vma, address, page_table);
> -		page_add_new_anon_rmap(new_page, vma, address);
> +		page_add_new_anon_rmap(new_page, vma, address, false);
>  		mem_cgroup_commit_charge(new_page, memcg, false);
>  		lru_cache_add_active_or_unevictable(new_page, vma);
>  		/*
> @@ -2141,7 +2141,7 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
>  			 * mapcount is visible. So transitively, TLBs to
>  			 * old page will be flushed before it can be reused.
>  			 */
> -			page_remove_rmap(old_page);
> +			page_remove_rmap(old_page, false);
>  		}
>  
>  		/* Free the old page.. */
> @@ -2556,7 +2556,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  		flags &= ~FAULT_FLAG_WRITE;
>  		ret |= VM_FAULT_WRITE;
> -		exclusive = 1;
> +		exclusive = RMAP_EXCLUSIVE;
>  	}
>  	flush_icache_page(vma, page);
>  	if (pte_swp_soft_dirty(orig_pte))
> @@ -2566,7 +2566,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		do_page_add_anon_rmap(page, vma, address, exclusive);
>  		mem_cgroup_commit_charge(page, memcg, true);
>  	} else { /* ksm created a completely new copy */
> -		page_add_new_anon_rmap(page, vma, address);
> +		page_add_new_anon_rmap(page, vma, address, false);
>  		mem_cgroup_commit_charge(page, memcg, false);
>  		lru_cache_add_active_or_unevictable(page, vma);
>  	}
> @@ -2704,7 +2704,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		goto release;
>  
>  	inc_mm_counter_fast(mm, MM_ANONPAGES);
> -	page_add_new_anon_rmap(page, vma, address);
> +	page_add_new_anon_rmap(page, vma, address, false);
>  	mem_cgroup_commit_charge(page, memcg, false);
>  	lru_cache_add_active_or_unevictable(page, vma);
>  setpte:
> @@ -2787,7 +2787,7 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  	if (anon) {
>  		inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> -		page_add_new_anon_rmap(page, vma, address);
> +		page_add_new_anon_rmap(page, vma, address, false);
>  	} else {
>  		inc_mm_counter_fast(vma->vm_mm, MM_FILEPAGES);
>  		page_add_file_rmap(page);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 022adc253cd4..9a380238a4d0 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -166,7 +166,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
>  		else
>  			page_dup_rmap(new);
>  	} else if (PageAnon(new))
> -		page_add_anon_rmap(new, vma, addr);
> +		page_add_anon_rmap(new, vma, addr, false);
>  	else
>  		page_add_file_rmap(new);
>  
> @@ -1795,7 +1795,7 @@ fail_putback:
>  	 * guarantee the copy is visible before the pagetable update.
>  	 */
>  	flush_cache_range(vma, mmun_start, mmun_end);
> -	page_add_anon_rmap(new_page, vma, mmun_start);
> +	page_add_anon_rmap(new_page, vma, mmun_start, true);
>  	pmdp_clear_flush_notify(vma, mmun_start, pmd);
>  	set_pmd_at(mm, mmun_start, pmd, entry);
>  	flush_tlb_range(vma, mmun_start, mmun_end);
> @@ -1806,13 +1806,13 @@ fail_putback:
>  		flush_tlb_range(vma, mmun_start, mmun_end);
>  		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
>  		update_mmu_cache_pmd(vma, address, &entry);
> -		page_remove_rmap(new_page);
> +		page_remove_rmap(new_page, true);
>  		goto fail_putback;
>  	}
>  
>  	mem_cgroup_migrate(page, new_page, false);
>  
> -	page_remove_rmap(page);
> +	page_remove_rmap(page, true);
>  
>  	spin_unlock(ptl);
>  	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index dad23a43e42c..4ca4b5cffd95 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1048,9 +1048,9 @@ static void __page_check_anon_rmap(struct page *page,
>   * (but PageKsm is never downgraded to PageAnon).
>   */

The comment above should be updated to include the new argument.

>  void page_add_anon_rmap(struct page *page,
> -	struct vm_area_struct *vma, unsigned long address)
> +	struct vm_area_struct *vma, unsigned long address, bool compound)
>  {
> -	do_page_add_anon_rmap(page, vma, address, 0);
> +	do_page_add_anon_rmap(page, vma, address, compound ? RMAP_COMPOUND : 0);
>  }
>  
>  /*
> @@ -1059,21 +1059,24 @@ void page_add_anon_rmap(struct page *page,
>   * Everybody else should continue to use page_add_anon_rmap above.
>   */
>  void do_page_add_anon_rmap(struct page *page,
> -	struct vm_area_struct *vma, unsigned long address, int exclusive)
> +	struct vm_area_struct *vma, unsigned long address, int flags)
>  {
>  	int first = atomic_inc_and_test(&page->_mapcount);
>  	if (first) {
> +		bool compound = flags & RMAP_COMPOUND;
> +		int nr = compound ? hpage_nr_pages(page) : 1;
>  		/*
>  		 * We use the irq-unsafe __{inc|mod}_zone_page_stat because
>  		 * these counters are not modified in interrupt context, and
>  		 * pte lock(a spinlock) is held, which implies preemption
>  		 * disabled.
>  		 */
> -		if (PageTransHuge(page))
> +		if (compound) {
> +			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>  			__inc_zone_page_state(page,
>  					      NR_ANON_TRANSPARENT_HUGEPAGES);
> -		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
> -				hpage_nr_pages(page));
> +		}
> +		__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
>  	}
>  	if (unlikely(PageKsm(page)))
>  		return;
> @@ -1081,7 +1084,8 @@ void do_page_add_anon_rmap(struct page *page,
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	/* address might be in next vma when migration races vma_adjust */
>  	if (first)
> -		__page_set_anon_rmap(page, vma, address, exclusive);
> +		__page_set_anon_rmap(page, vma, address,
> +				flags & RMAP_EXCLUSIVE);
>  	else
>  		__page_check_anon_rmap(page, vma, address);
>  }
> @@ -1097,15 +1101,18 @@ void do_page_add_anon_rmap(struct page *page,
>   * Page does not have to be locked.
>   */

Again, the description of the function should be updated.

>  void page_add_new_anon_rmap(struct page *page,
> -	struct vm_area_struct *vma, unsigned long address)
> +	struct vm_area_struct *vma, unsigned long address, bool compound)
>  {
> +	int nr = compound ? hpage_nr_pages(page) : 1;
> +
>  	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
>  	SetPageSwapBacked(page);
>  	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
> -	if (PageTransHuge(page))
> +	if (compound) {
> +		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>  		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> -	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
> -			hpage_nr_pages(page));
> +	}
> +	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
>  	__page_set_anon_rmap(page, vma, address, 1);
>  }
>  
> @@ -1161,9 +1168,12 @@ out:
>   *
>   * The caller needs to hold the pte lock.
>   */

Same here.

Jerome

> -void page_remove_rmap(struct page *page)
> +void page_remove_rmap(struct page *page, bool compound)
>  {
> +	int nr = compound ? hpage_nr_pages(page) : 1;
> +
>  	if (!PageAnon(page)) {
> +		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
>  		page_remove_file_rmap(page);
>  		return;
>  	}
> @@ -1181,11 +1191,12 @@ void page_remove_rmap(struct page *page)
>  	 * these counters are not modified in interrupt context, and
>  	 * pte lock(a spinlock) is held, which implies preemption disabled.
>  	 */
> -	if (PageTransHuge(page))
> +	if (compound) {
> +		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>  		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> +	}
>  
> -	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
> -			      -hpage_nr_pages(page));
> +	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
>  
>  	if (unlikely(PageMlocked(page)))
>  		clear_page_mlock(page);
> @@ -1327,7 +1338,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  		dec_mm_counter(mm, MM_FILEPAGES);
>  
>  discard:
> -	page_remove_rmap(page);
> +	page_remove_rmap(page, false);
>  	page_cache_release(page);
>  
>  out_unmap:
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index a7e72103f23b..65825c2687f5 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1121,10 +1121,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  	set_pte_at(vma->vm_mm, addr, pte,
>  		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
>  	if (page == swapcache) {
> -		page_add_anon_rmap(page, vma, addr);
> +		page_add_anon_rmap(page, vma, addr, false);
>  		mem_cgroup_commit_charge(page, memcg, true);
>  	} else { /* ksm created a completely new copy */
> -		page_add_new_anon_rmap(page, vma, addr);
> +		page_add_new_anon_rmap(page, vma, addr, false);
>  		mem_cgroup_commit_charge(page, memcg, false);
>  		lru_cache_add_active_or_unevictable(page, vma);
>  	}
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 04/28] mm, thp: adjust conditions when we can reuse the page on WP fault
  2015-04-23 21:03   ` Kirill A. Shutemov
  (?)
@ 2015-04-29 15:54   ` Jerome Marchand
  -1 siblings, 0 replies; 189+ messages in thread
From: Jerome Marchand @ 2015-04-29 15:54 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Sasha Levin,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3176 bytes --]

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting we will be able map the same compound page with
> PTEs and PMDs. It requires adjustment to conditions when we can reuse
> the page on write-protection fault.
> 
> For PTE fault we can't reuse the page if it's part of huge page.
> 
> For PMD we can only reuse the page if nobody else maps the huge page or
> it's part. We can do it by checking page_mapcount() on each sub-page,
> but it's expensive.
> 
> The cheaper way is to check page_count() to be equal 1: every mapcount
> takes page reference, so this way we can guarantee, that the PMD is the
> only mapping.
> 
> This approach can give false negative if somebody pinned the page, but
> that doesn't affect correctness.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Jerome Marchand <jmarchan@redhat.com>

> ---
>  include/linux/swap.h |  3 ++-
>  mm/huge_memory.c     | 12 +++++++++++-
>  mm/swapfile.c        |  3 +++
>  3 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0428e4c84e1d..17cdd6b9456b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -524,7 +524,8 @@ static inline int page_swapcount(struct page *page)
>  	return 0;
>  }
>  
> -#define reuse_swap_page(page)	(page_mapcount(page) == 1)
> +#define reuse_swap_page(page) \
> +	(!PageTransCompound(page) && page_mapcount(page) == 1)
>  
>  static inline int try_to_free_swap(struct page *page)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 534f353e12bf..fd8af5b9917f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1103,7 +1103,17 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  	page = pmd_page(orig_pmd);
>  	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> -	if (page_mapcount(page) == 1) {
> +	/*
> +	 * We can only reuse the page if nobody else maps the huge page or it's
> +	 * part. We can do it by checking page_mapcount() on each sub-page, but
> +	 * it's expensive.
> +	 * The cheaper way is to check page_count() to be equal 1: every
> +	 * mapcount takes page reference reference, so this way we can
> +	 * guarantee, that the PMD is the only mapping.
> +	 * This can give false negative if somebody pinned the page, but that's
> +	 * fine.
> +	 */
> +	if (page_mapcount(page) == 1 && page_count(page) == 1) {
>  		pmd_t entry;
>  		entry = pmd_mkyoung(orig_pmd);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6dd365d1c488..3cd5f188b996 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -887,6 +887,9 @@ int reuse_swap_page(struct page *page)
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	if (unlikely(PageKsm(page)))
>  		return 0;
> +	/* The page is part of THP and cannot be reused */
> +	if (PageTransCompound(page))
> +		return 0;
>  	count = page_mapcount(page);
>  	if (count <= 1 && PageSwapCache(page)) {
>  		count += page_swapcount(page);
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 06/28] mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton
  2015-04-23 21:03   ` Kirill A. Shutemov
  (?)
@ 2015-04-29 15:56   ` Jerome Marchand
  -1 siblings, 0 replies; 189+ messages in thread
From: Jerome Marchand @ 2015-04-29 15:56 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Sasha Levin,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1886 bytes --]

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting we are going to see THP tail pages mapped with PTE.
> Generic fast GUP rely on page_cache_get_speculative() to obtain
> reference on page. page_cache_get_speculative() always fails on tail
> pages, because ->_count on tail pages is always zero.
> 
> Let's handle tail pages in gup_pte_range().
> 
> New split_huge_page() will rely on migration entries to freeze page's
> counts. Recheck PTE value after page_cache_get_speculative() on head
> page should be enough to serialize against split.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Jerome Marchand <jmarchan@redhat.com>

> ---
>  mm/gup.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index ebdb39b3e820..eaeeae15006b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1051,7 +1051,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  		 * for an example see gup_get_pte in arch/x86/mm/gup.c
>  		 */
>  		pte_t pte = READ_ONCE(*ptep);
> -		struct page *page;
> +		struct page *head, *page;
>  
>  		/*
>  		 * Similar to the PMD case below, NUMA hinting must take slow
> @@ -1063,15 +1063,17 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  
>  		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  		page = pte_page(pte);
> +		head = compound_head(page);
>  
> -		if (!page_cache_get_speculative(page))
> +		if (!page_cache_get_speculative(head))
>  			goto pte_unmap;
>  
>  		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> -			put_page(page);
> +			put_page(head);
>  			goto pte_unmap;
>  		}
>  
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>  		pages[*nr] = page;
>  		(*nr)++;
>  
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 07/28] thp, mlock: do not allow huge pages in mlocked area
  2015-04-23 21:03   ` Kirill A. Shutemov
  (?)
@ 2015-04-29 15:58   ` Jerome Marchand
  -1 siblings, 0 replies; 189+ messages in thread
From: Jerome Marchand @ 2015-04-29 15:58 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Sasha Levin,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 5408 bytes --]

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting THP can belong to several VMAs. This makes tricky
> to track THP pages, when they partially mlocked. It can lead to leaking
> mlocked pages to non-VM_LOCKED vmas and other problems.
> 
> With this patch we will split all pages on mlock and avoid
> fault-in/collapse new THP in VM_LOCKED vmas.
> 
> I've tried alternative approach: do not mark THP pages mlocked and keep
> them on normal LRUs. This way vmscan could try to split huge pages on
> memory pressure and free up subpages which doesn't belong to VM_LOCKED
> vmas.  But this is user-visible change: we screw up Mlocked accouting
> reported in meminfo, so I had to leave this approach aside.
> 
> We can bring something better later, but this should be good enough for
> now.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Jerome Marchand <jmarchan@redhat.com>

> ---
>  mm/gup.c         |  2 ++
>  mm/huge_memory.c |  5 ++++-
>  mm/memory.c      |  3 ++-
>  mm/mlock.c       | 51 +++++++++++++++++++--------------------------------
>  4 files changed, 27 insertions(+), 34 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index eaeeae15006b..7334eb24f414 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -882,6 +882,8 @@ long populate_vma_page_range(struct vm_area_struct *vma,
>  	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
>  
>  	gup_flags = FOLL_TOUCH | FOLL_POPULATE;
> +	if (vma->vm_flags & VM_LOCKED)
> +		gup_flags |= FOLL_SPLIT;
>  	/*
>  	 * We want to touch writable mappings with a write fault in order
>  	 * to break COW, except for shared mappings because these don't COW
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fd8af5b9917f..fa3d4f78b716 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -796,6 +796,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
>  		return VM_FAULT_FALLBACK;
> +	if (vma->vm_flags & VM_LOCKED)
> +		return VM_FAULT_FALLBACK;
>  	if (unlikely(anon_vma_prepare(vma)))
>  		return VM_FAULT_OOM;
>  	if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
> @@ -2467,7 +2469,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
>  	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
>  	    (vma->vm_flags & VM_NOHUGEPAGE))
>  		return false;
> -
> +	if (vma->vm_flags & VM_LOCKED)
> +		return false;
>  	if (!vma->anon_vma || vma->vm_ops)
>  		return false;
>  	if (is_vma_temporary_stack(vma))
> diff --git a/mm/memory.c b/mm/memory.c
> index 559c6651d6b6..8bbd3f88544b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2156,7 +2156,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  	pte_unmap_unlock(page_table, ptl);
>  	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> -	if (old_page) {
> +	/* THP pages are never mlocked */
> +	if (old_page && !PageTransCompound(old_page)) {
>  		/*
>  		 * Don't let another task, with possibly unlocked vma,
>  		 * keep the mlocked page.
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 6fd2cf15e868..76cde3967483 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -443,39 +443,26 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
>  		page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
>  				&page_mask);
>  
> -		if (page && !IS_ERR(page)) {
> -			if (PageTransHuge(page)) {
> -				lock_page(page);
> -				/*
> -				 * Any THP page found by follow_page_mask() may
> -				 * have gotten split before reaching
> -				 * munlock_vma_page(), so we need to recompute
> -				 * the page_mask here.
> -				 */
> -				page_mask = munlock_vma_page(page);
> -				unlock_page(page);
> -				put_page(page); /* follow_page_mask() */
> -			} else {
> -				/*
> -				 * Non-huge pages are handled in batches via
> -				 * pagevec. The pin from follow_page_mask()
> -				 * prevents them from collapsing by THP.
> -				 */
> -				pagevec_add(&pvec, page);
> -				zone = page_zone(page);
> -				zoneid = page_zone_id(page);
> +		if (page && !IS_ERR(page) && !PageTransCompound(page)) {
> +			/*
> +			 * Non-huge pages are handled in batches via
> +			 * pagevec. The pin from follow_page_mask()
> +			 * prevents them from collapsing by THP.
> +			 */
> +			pagevec_add(&pvec, page);
> +			zone = page_zone(page);
> +			zoneid = page_zone_id(page);
>  
> -				/*
> -				 * Try to fill the rest of pagevec using fast
> -				 * pte walk. This will also update start to
> -				 * the next page to process. Then munlock the
> -				 * pagevec.
> -				 */
> -				start = __munlock_pagevec_fill(&pvec, vma,
> -						zoneid, start, end);
> -				__munlock_pagevec(&pvec, zone);
> -				goto next;
> -			}
> +			/*
> +			 * Try to fill the rest of pagevec using fast
> +			 * pte walk. This will also update start to
> +			 * the next page to process. Then munlock the
> +			 * pagevec.
> +			 */
> +			start = __munlock_pagevec_fill(&pvec, vma,
> +					zoneid, start, end);
> +			__munlock_pagevec(&pvec, zone);
> +			goto next;
>  		}
>  		/* It's a bug to munlock in the middle of a THP page */
>  		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 08/28] khugepaged: ignore pmd tables with THP mapped with ptes
  2015-04-23 21:03   ` Kirill A. Shutemov
  (?)
@ 2015-04-29 15:59   ` Jerome Marchand
  -1 siblings, 0 replies; 189+ messages in thread
From: Jerome Marchand @ 2015-04-29 15:59 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Sasha Levin,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1422 bytes --]

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> Prepare khugepaged to see compound pages mapped with pte. For now we
> won't collapse the pmd table with such pte.
> 
> khugepaged is subject for future rework wrt new refcounting.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Jerome Marchand <jmarchan@redhat.com>

> ---
>  mm/huge_memory.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fa3d4f78b716..ffc30e4462c1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2653,6 +2653,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  		page = vm_normal_page(vma, _address, pteval);
>  		if (unlikely(!page))
>  			goto out_unmap;
> +
> +		/* TODO: teach khugepaged to collapse THP mapped with pte */
> +		if (PageCompound(page))
> +			goto out_unmap;
> +
>  		/*
>  		 * Record which node the original page is from and save this
>  		 * information to khugepaged_node_load[].
> @@ -2663,7 +2668,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  		if (khugepaged_scan_abort(node))
>  			goto out_unmap;
>  		khugepaged_node_load[node]++;
> -		VM_BUG_ON_PAGE(PageCompound(page), page);
>  		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>  			goto out_unmap;
>  		/*
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 09/28] thp: rename split_huge_page_pmd() to split_huge_pmd()
  2015-04-23 21:03   ` Kirill A. Shutemov
  (?)
@ 2015-04-29 16:00   ` Jerome Marchand
  -1 siblings, 0 replies; 189+ messages in thread
From: Jerome Marchand @ 2015-04-29 16:00 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Sasha Levin,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 10380 bytes --]

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We are going to decouple splitting THP PMD from splitting underlying
> compound page.
> 
> This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
> to reflect the fact that it doesn't imply page splitting, only PMD.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Jerome Marchand <jmarchan@redhat.com>

> ---
>  arch/powerpc/mm/subpage-prot.c |  2 +-
>  arch/x86/kernel/vm86_32.c      |  6 +++++-
>  include/linux/huge_mm.h        |  8 ++------
>  mm/gup.c                       |  2 +-
>  mm/huge_memory.c               | 32 +++++++++++---------------------
>  mm/madvise.c                   |  2 +-
>  mm/memory.c                    |  2 +-
>  mm/mempolicy.c                 |  2 +-
>  mm/mprotect.c                  |  2 +-
>  mm/mremap.c                    |  2 +-
>  mm/pagewalk.c                  |  2 +-
>  11 files changed, 26 insertions(+), 36 deletions(-)
> 
> diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
> index fa9fb5b4c66c..d5543514c1df 100644
> --- a/arch/powerpc/mm/subpage-prot.c
> +++ b/arch/powerpc/mm/subpage-prot.c
> @@ -135,7 +135,7 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
>  				  unsigned long end, struct mm_walk *walk)
>  {
>  	struct vm_area_struct *vma = walk->vma;
> -	split_huge_page_pmd(vma, addr, pmd);
> +	split_huge_pmd(vma, pmd, addr);
>  	return 0;
>  }
>  
> diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
> index e8edcf52e069..883160599965 100644
> --- a/arch/x86/kernel/vm86_32.c
> +++ b/arch/x86/kernel/vm86_32.c
> @@ -182,7 +182,11 @@ static void mark_screen_rdonly(struct mm_struct *mm)
>  	if (pud_none_or_clear_bad(pud))
>  		goto out;
>  	pmd = pmd_offset(pud, 0xA0000);
> -	split_huge_page_pmd_mm(mm, 0xA0000, pmd);
> +
> +	if (pmd_trans_huge(*pmd)) {
> +		struct vm_area_struct *vma = find_vma(mm, 0xA0000);
> +		split_huge_pmd(vma, pmd, 0xA0000);
> +	}
>  	if (pmd_none_or_clear_bad(pmd))
>  		goto out;
>  	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 44a840a53974..34bbf769d52e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -104,7 +104,7 @@ static inline int split_huge_page(struct page *page)
>  }
>  extern void __split_huge_page_pmd(struct vm_area_struct *vma,
>  		unsigned long address, pmd_t *pmd);
> -#define split_huge_page_pmd(__vma, __address, __pmd)			\
> +#define split_huge_pmd(__vma, __pmd, __address)				\
>  	do {								\
>  		pmd_t *____pmd = (__pmd);				\
>  		if (unlikely(pmd_trans_huge(*____pmd)))			\
> @@ -119,8 +119,6 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
>  		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
>  		       pmd_trans_huge(*____pmd));			\
>  	} while (0)
> -extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd);
>  #if HPAGE_PMD_ORDER >= MAX_ORDER
>  #error "hugepages can't be allocated by the buddy allocator"
>  #endif
> @@ -187,11 +185,9 @@ static inline int split_huge_page(struct page *page)
>  {
>  	return 0;
>  }
> -#define split_huge_page_pmd(__vma, __address, __pmd)	\
> -	do { } while (0)
>  #define wait_split_huge_page(__anon_vma, __pmd)	\
>  	do { } while (0)
> -#define split_huge_page_pmd_mm(__mm, __address, __pmd)	\
> +#define split_huge_pmd(__vma, __pmd, __address)	\
>  	do { } while (0)
>  static inline int hugepage_madvise(struct vm_area_struct *vma,
>  				   unsigned long *vm_flags, int advice)
> diff --git a/mm/gup.c b/mm/gup.c
> index 7334eb24f414..19e01f156abb 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -220,7 +220,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  		if (is_huge_zero_page(page)) {
>  			spin_unlock(ptl);
>  			ret = 0;
> -			split_huge_page_pmd(vma, address, pmd);
> +			split_huge_pmd(vma, pmd, address);
>  		} else {
>  			get_page(page);
>  			spin_unlock(ptl);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ffc30e4462c1..ccbfacf07160 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1136,13 +1136,13 @@ alloc:
>  
>  	if (unlikely(!new_page)) {
>  		if (!page) {
> -			split_huge_page_pmd(vma, address, pmd);
> +			split_huge_pmd(vma, pmd, address);
>  			ret |= VM_FAULT_FALLBACK;
>  		} else {
>  			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
>  					pmd, orig_pmd, page, haddr);
>  			if (ret & VM_FAULT_OOM) {
> -				split_huge_page(page);
> +				split_huge_pmd(vma, pmd, address);
>  				ret |= VM_FAULT_FALLBACK;
>  			}
>  			put_user_huge_page(page);
> @@ -1155,10 +1155,10 @@ alloc:
>  					&memcg, true))) {
>  		put_page(new_page);
>  		if (page) {
> -			split_huge_page(page);
> +			split_huge_pmd(vma, pmd, address);
>  			put_user_huge_page(page);
>  		} else
> -			split_huge_page_pmd(vma, address, pmd);
> +			split_huge_pmd(vma, pmd, address);
>  		ret |= VM_FAULT_FALLBACK;
>  		count_vm_event(THP_FAULT_FALLBACK);
>  		goto out;
> @@ -2985,17 +2985,7 @@ again:
>  		goto again;
>  }
>  
> -void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd)
> -{
> -	struct vm_area_struct *vma;
> -
> -	vma = find_vma(mm, address);
> -	BUG_ON(vma == NULL);
> -	split_huge_page_pmd(vma, address, pmd);
> -}
> -
> -static void split_huge_page_address(struct mm_struct *mm,
> +static void split_huge_pmd_address(struct vm_area_struct *vma,
>  				    unsigned long address)
>  {
>  	pgd_t *pgd;
> @@ -3004,7 +2994,7 @@ static void split_huge_page_address(struct mm_struct *mm,
>  
>  	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
>  
> -	pgd = pgd_offset(mm, address);
> +	pgd = pgd_offset(vma->vm_mm, address);
>  	if (!pgd_present(*pgd))
>  		return;
>  
> @@ -3013,13 +3003,13 @@ static void split_huge_page_address(struct mm_struct *mm,
>  		return;
>  
>  	pmd = pmd_offset(pud, address);
> -	if (!pmd_present(*pmd))
> +	if (!pmd_present(*pmd) || !pmd_trans_huge(*pmd))
>  		return;
>  	/*
>  	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
>  	 * materialize from under us.
>  	 */
> -	split_huge_page_pmd_mm(mm, address, pmd);
> +	__split_huge_page_pmd(vma, address, pmd);
>  }
>  
>  void __vma_adjust_trans_huge(struct vm_area_struct *vma,
> @@ -3035,7 +3025,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
>  	if (start & ~HPAGE_PMD_MASK &&
>  	    (start & HPAGE_PMD_MASK) >= vma->vm_start &&
>  	    (start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
> -		split_huge_page_address(vma->vm_mm, start);
> +		split_huge_pmd_address(vma, start);
>  
>  	/*
>  	 * If the new end address isn't hpage aligned and it could
> @@ -3045,7 +3035,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
>  	if (end & ~HPAGE_PMD_MASK &&
>  	    (end & HPAGE_PMD_MASK) >= vma->vm_start &&
>  	    (end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
> -		split_huge_page_address(vma->vm_mm, end);
> +		split_huge_pmd_address(vma, end);
>  
>  	/*
>  	 * If we're also updating the vma->vm_next->vm_start, if the new
> @@ -3059,6 +3049,6 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
>  		if (nstart & ~HPAGE_PMD_MASK &&
>  		    (nstart & HPAGE_PMD_MASK) >= next->vm_start &&
>  		    (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
> -			split_huge_page_address(next->vm_mm, nstart);
> +			split_huge_pmd_address(next, nstart);
>  	}
>  }
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 22b86daf6b94..f5a81ca0dca7 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -281,7 +281,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	next = pmd_addr_end(addr, end);
>  	if (pmd_trans_huge(*pmd)) {
>  		if (next - addr != HPAGE_PMD_SIZE)
> -			split_huge_page_pmd(vma, addr, pmd);
> +			split_huge_pmd(vma, pmd, addr);
>  		else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
>  			goto next;
>  		/* fall through */
> diff --git a/mm/memory.c b/mm/memory.c
> index 8bbd3f88544b..61e7ed722760 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1201,7 +1201,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>  					BUG();
>  				}
>  #endif
> -				split_huge_page_pmd(vma, addr, pmd);
> +				split_huge_pmd(vma, pmd, addr);
>  			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
>  				goto next;
>  			/* fall through */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 8badb84c013e..aac490fdc91f 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -493,7 +493,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> -	split_huge_page_pmd(vma, addr, pmd);
> +	split_huge_pmd(vma, pmd, addr);
>  	if (pmd_trans_unstable(pmd))
>  		return 0;
>  
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88584838e704..714d2fbbaafd 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  
>  		if (pmd_trans_huge(*pmd)) {
>  			if (next - addr != HPAGE_PMD_SIZE)
> -				split_huge_page_pmd(vma, addr, pmd);
> +				split_huge_pmd(vma, pmd, addr);
>  			else {
>  				int nr_ptes = change_huge_pmd(vma, pmd, addr,
>  						newprot, prot_numa);
> diff --git a/mm/mremap.c b/mm/mremap.c
> index afa3ab740d8c..3e40ea27edc4 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -208,7 +208,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  				need_flush = true;
>  				continue;
>  			} else if (!err) {
> -				split_huge_page_pmd(vma, old_addr, old_pmd);
> +				split_huge_pmd(vma, old_pmd, old_addr);
>  			}
>  			VM_BUG_ON(pmd_trans_huge(*old_pmd));
>  		}
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 29f2f8b853ae..207244489a68 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -58,7 +58,7 @@ again:
>  		if (!walk->pte_entry)
>  			continue;
>  
> -		split_huge_page_pmd_mm(walk->mm, addr, pmd);
> +		split_huge_pmd(walk->vma, pmd, addr);
>  		if (pmd_trans_unstable(pmd))
>  			goto again;
>  		err = walk_pte_range(pmd, addr, next, walk);
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 10/28] mm, vmstats: new THP splitting event
  2015-04-23 21:03   ` Kirill A. Shutemov
  (?)
@ 2015-04-29 16:02   ` Jerome Marchand
  -1 siblings, 0 replies; 189+ messages in thread
From: Jerome Marchand @ 2015-04-29 16:02 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Sasha Levin,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2149 bytes --]

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
> THP_SPLIT_PAGE_FAILT and THP_SPLIT_PMD. It reflects the fact that we

s/FAILT/FAILED

> are going to be able split PMD without the compound page and that
> split_huge_page() can fail.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Jerome Marchand <jmarchan@redhat.com>

> ---
>  include/linux/vm_event_item.h | 4 +++-
>  mm/huge_memory.c              | 2 +-
>  mm/vmstat.c                   | 4 +++-
>  3 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 2b1cef88b827..3261bfe2156a 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -69,7 +69,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		THP_FAULT_FALLBACK,
>  		THP_COLLAPSE_ALLOC,
>  		THP_COLLAPSE_ALLOC_FAILED,
> -		THP_SPLIT,
> +		THP_SPLIT_PAGE,
> +		THP_SPLIT_PAGE_FAILED,
> +		THP_SPLIT_PMD,
>  		THP_ZERO_PAGE_ALLOC,
>  		THP_ZERO_PAGE_ALLOC_FAILED,
>  #endif
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ccbfacf07160..be6d0e0f5050 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1961,7 +1961,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  
>  	BUG_ON(!PageSwapBacked(page));
>  	__split_huge_page(page, anon_vma, list);
> -	count_vm_event(THP_SPLIT);
> +	count_vm_event(THP_SPLIT_PAGE);
>  
>  	BUG_ON(PageCompound(page));
>  out_unlock:
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1fd0886a389f..e1c87425fe11 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -821,7 +821,9 @@ const char * const vmstat_text[] = {
>  	"thp_fault_fallback",
>  	"thp_collapse_alloc",
>  	"thp_collapse_alloc_failed",
> -	"thp_split",
> +	"thp_split_page",
> +	"thp_split_page_failed",
> +	"thp_split_pmd",
>  	"thp_zero_page_alloc",
>  	"thp_zero_page_alloc_failed",
>  #endif
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 16/28] mm, thp: remove compound_lock
  2015-04-23 21:03   ` Kirill A. Shutemov
  (?)
@ 2015-04-29 16:11   ` Jerome Marchand
  2015-04-30 11:58       ` Kirill A. Shutemov
  -1 siblings, 1 reply; 189+ messages in thread
From: Jerome Marchand @ 2015-04-29 16:11 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Sasha Levin,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 3905 bytes --]

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We are going to use migration entries to stabilize page counts. It means

By "stabilize" do you mean "protect" from concurrent access? I've seen
that you use the same term in seemingly the same sense several times (at
least in patches 15, 16, 23, 24 and 28).

Jerome

> we don't need compound_lock() for that.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>  include/linux/mm.h         | 35 -----------------------------------
>  include/linux/page-flags.h | 12 +-----------
>  mm/debug.c                 |  3 ---
>  3 files changed, 1 insertion(+), 49 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index dd1b5f2b1966..dad667d99304 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -393,41 +393,6 @@ static inline int is_vmalloc_or_module_addr(const void *x)
>  
>  extern void kvfree(const void *addr);
>  
> -static inline void compound_lock(struct page *page)
> -{
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	VM_BUG_ON_PAGE(PageSlab(page), page);
> -	bit_spin_lock(PG_compound_lock, &page->flags);
> -#endif
> -}
> -
> -static inline void compound_unlock(struct page *page)
> -{
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	VM_BUG_ON_PAGE(PageSlab(page), page);
> -	bit_spin_unlock(PG_compound_lock, &page->flags);
> -#endif
> -}
> -
> -static inline unsigned long compound_lock_irqsave(struct page *page)
> -{
> -	unsigned long uninitialized_var(flags);
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	local_irq_save(flags);
> -	compound_lock(page);
> -#endif
> -	return flags;
> -}
> -
> -static inline void compound_unlock_irqrestore(struct page *page,
> -					      unsigned long flags)
> -{
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	compound_unlock(page);
> -	local_irq_restore(flags);
> -#endif
> -}
> -
>  /*
>   * The atomic page->_mapcount, starts from -1: so that transitions
>   * both from it and to it can be tracked, using atomic_inc_and_test
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 91b7f9b2b774..74b7cece1dfa 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -106,9 +106,6 @@ enum pageflags {
>  #ifdef CONFIG_MEMORY_FAILURE
>  	PG_hwpoison,		/* hardware poisoned page. Don't touch */
>  #endif
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	PG_compound_lock,
> -#endif
>  	__NR_PAGEFLAGS,
>  
>  	/* Filesystems */
> @@ -683,12 +680,6 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
>  #define __PG_MLOCKED		0
>  #endif
>  
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
> -#else
> -#define __PG_COMPOUND_LOCK		0
> -#endif
> -
>  /*
>   * Flags checked when a page is freed.  Pages being freed should not have
>   * these flags set.  It they are, there is a problem.
> @@ -698,8 +689,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
>  	 1 << PG_private | 1 << PG_private_2 | \
>  	 1 << PG_writeback | 1 << PG_reserved | \
>  	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
> -	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
> -	 __PG_COMPOUND_LOCK)
> +	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON )
>  
>  /*
>   * Flags checked when a page is prepped for return by the page allocator.
> diff --git a/mm/debug.c b/mm/debug.c
> index 3eb3ac2fcee7..9dfcd77e7354 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -45,9 +45,6 @@ static const struct trace_print_flags pageflag_names[] = {
>  #ifdef CONFIG_MEMORY_FAILURE
>  	{1UL << PG_hwpoison,		"hwpoison"	},
>  #endif
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	{1UL << PG_compound_lock,	"compound_lock"	},
> -#endif
>  };
>  
>  static void dump_flags(unsigned long flags,
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 17/28] mm, thp: remove infrastructure for handling splitting PMDs
  2015-04-23 21:03   ` Kirill A. Shutemov
  (?)
@ 2015-04-29 16:14   ` Jerome Marchand
  2015-04-30 12:03       ` Kirill A. Shutemov
  -1 siblings, 1 reply; 189+ messages in thread
From: Jerome Marchand @ 2015-04-29 16:14 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Sasha Levin,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 15294 bytes --]

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting we don't need to mark PMDs splitting. Let's drop code
> to handle this.
> 
> Arch-specific code will removed separately.

This series only removed code from x86 arch. Does that mean other arches
patches will come later?

Jerome

> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>  fs/proc/task_mmu.c            |  8 +++----
>  include/asm-generic/pgtable.h |  5 ----
>  include/linux/huge_mm.h       |  9 --------
>  mm/gup.c                      |  7 ------
>  mm/huge_memory.c              | 54 ++++++++-----------------------------------
>  mm/memcontrol.c               | 14 ++---------
>  mm/memory.c                   | 18 ++-------------
>  mm/mincore.c                  |  2 +-
>  mm/pgtable-generic.c          | 14 -----------
>  mm/rmap.c                     |  4 +---
>  10 files changed, 20 insertions(+), 115 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 95bc384ee3f7..edd63c40ed71 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -534,7 +534,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>  		smaps_pmd_entry(pmd, addr, walk);
>  		spin_unlock(ptl);
>  		return 0;
> @@ -799,7 +799,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  	spinlock_t *ptl;
>  	struct page *page;
>  
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>  		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
>  			clear_soft_dirty_pmd(vma, addr, pmd);
>  			goto out;
> @@ -1112,7 +1112,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	pte_t *pte, *orig_pte;
>  	int err = 0;
>  
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>  		int pmd_flags2;
>  
>  		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
> @@ -1416,7 +1416,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
>  	pte_t *orig_pte;
>  	pte_t *pte;
>  
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>  		pte_t huge_pte = *(pte_t *)pmd;
>  		struct page *page;
>  
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 39f1d6a2b04d..fe617b7e4be6 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -184,11 +184,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif
>  
> -#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
> -extern void pmdp_splitting_flush(struct vm_area_struct *vma,
> -				 unsigned long address, pmd_t *pmdp);
> -#endif
> -
>  #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
>  extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
>  				       pgtable_t pgtable);
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 47f80207782f..0382230b490f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -49,15 +49,9 @@ enum transparent_hugepage_flag {
>  #endif
>  };
>  
> -enum page_check_address_pmd_flag {
> -	PAGE_CHECK_ADDRESS_PMD_FLAG,
> -	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
> -	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
> -};
>  extern pmd_t *page_check_address_pmd(struct page *page,
>  				     struct mm_struct *mm,
>  				     unsigned long address,
> -				     enum page_check_address_pmd_flag flag,
>  				     spinlock_t **ptl);
>  extern int pmd_freeable(pmd_t pmd);
>  
> @@ -102,7 +96,6 @@ extern unsigned long transparent_hugepage_flags;
>  #define split_huge_page(page) BUILD_BUG()
>  #define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
>  
> -#define wait_split_huge_page(__anon_vma, __pmd) BUILD_BUG();
>  #if HPAGE_PMD_ORDER >= MAX_ORDER
>  #error "hugepages can't be allocated by the buddy allocator"
>  #endif
> @@ -169,8 +162,6 @@ static inline int split_huge_page(struct page *page)
>  {
>  	return 0;
>  }
> -#define wait_split_huge_page(__anon_vma, __pmd)	\
> -	do { } while (0)
>  #define split_huge_pmd(__vma, __pmd, __address)	\
>  	do { } while (0)
>  static inline int hugepage_madvise(struct vm_area_struct *vma,
> diff --git a/mm/gup.c b/mm/gup.c
> index 53f9681b7b30..0cebfa76fd0c 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -207,13 +207,6 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>  		spin_unlock(ptl);
>  		return follow_page_pte(vma, address, pmd, flags);
>  	}
> -
> -	if (unlikely(pmd_trans_splitting(*pmd))) {
> -		spin_unlock(ptl);
> -		wait_split_huge_page(vma->anon_vma, pmd);
> -		return follow_page_pte(vma, address, pmd, flags);
> -	}
> -
>  	if (flags & FOLL_SPLIT) {
>  		int ret;
>  		page = pmd_page(*pmd);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 16c6c262385c..23181f836b62 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -889,15 +889,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		goto out_unlock;
>  	}
>  
> -	if (unlikely(pmd_trans_splitting(pmd))) {
> -		/* split huge page running from under us */
> -		spin_unlock(src_ptl);
> -		spin_unlock(dst_ptl);
> -		pte_free(dst_mm, pgtable);
> -
> -		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
> -		goto out;
> -	}
>  	src_page = pmd_page(pmd);
>  	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
>  	get_page(src_page);
> @@ -1403,7 +1394,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	spinlock_t *ptl;
>  	int ret = 0;
>  
> -	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
>  		struct page *page;
>  		pgtable_t pgtable;
>  		pmd_t orig_pmd;
> @@ -1443,7 +1434,6 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>  		  pmd_t *old_pmd, pmd_t *new_pmd)
>  {
>  	spinlock_t *old_ptl, *new_ptl;
> -	int ret = 0;
>  	pmd_t pmd;
>  
>  	struct mm_struct *mm = vma->vm_mm;
> @@ -1452,7 +1442,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>  	    (new_addr & ~HPAGE_PMD_MASK) ||
>  	    old_end - old_addr < HPAGE_PMD_SIZE ||
>  	    (new_vma->vm_flags & VM_NOHUGEPAGE))
> -		goto out;
> +		return 0;
>  
>  	/*
>  	 * The destination pmd shouldn't be established, free_pgtables()
> @@ -1460,15 +1450,14 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>  	 */
>  	if (WARN_ON(!pmd_none(*new_pmd))) {
>  		VM_BUG_ON(pmd_trans_huge(*new_pmd));
> -		goto out;
> +		return 0;
>  	}
>  
>  	/*
>  	 * We don't have to worry about the ordering of src and dst
>  	 * ptlocks because exclusive mmap_sem prevents deadlock.
>  	 */
> -	ret = __pmd_trans_huge_lock(old_pmd, vma, &old_ptl);
> -	if (ret == 1) {
> +	if (__pmd_trans_huge_lock(old_pmd, vma, &old_ptl)) {
>  		new_ptl = pmd_lockptr(mm, new_pmd);
>  		if (new_ptl != old_ptl)
>  			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> @@ -1484,9 +1473,9 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>  		if (new_ptl != old_ptl)
>  			spin_unlock(new_ptl);
>  		spin_unlock(old_ptl);
> +		return 1;
>  	}
> -out:
> -	return ret;
> +	return 0;
>  }
>  
>  /*
> @@ -1502,7 +1491,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  	spinlock_t *ptl;
>  	int ret = 0;
>  
> -	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
>  		pmd_t entry;
>  		bool preserve_write = prot_numa && pmd_write(*pmd);
>  		ret = 1;
> @@ -1543,17 +1532,8 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
>  		spinlock_t **ptl)
>  {
>  	*ptl = pmd_lock(vma->vm_mm, pmd);
> -	if (likely(pmd_trans_huge(*pmd))) {
> -		if (unlikely(pmd_trans_splitting(*pmd))) {
> -			spin_unlock(*ptl);
> -			wait_split_huge_page(vma->anon_vma, pmd);
> -			return -1;
> -		} else {
> -			/* Thp mapped by 'pmd' is stable, so we can
> -			 * handle it as it is. */
> -			return 1;
> -		}
> -	}
> +	if (likely(pmd_trans_huge(*pmd)))
> +		return 1;
>  	spin_unlock(*ptl);
>  	return 0;
>  }
> @@ -1569,7 +1549,6 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
>  pmd_t *page_check_address_pmd(struct page *page,
>  			      struct mm_struct *mm,
>  			      unsigned long address,
> -			      enum page_check_address_pmd_flag flag,
>  			      spinlock_t **ptl)
>  {
>  	pgd_t *pgd;
> @@ -1592,21 +1571,8 @@ pmd_t *page_check_address_pmd(struct page *page,
>  		goto unlock;
>  	if (pmd_page(*pmd) != page)
>  		goto unlock;
> -	/*
> -	 * split_vma() may create temporary aliased mappings. There is
> -	 * no risk as long as all huge pmd are found and have their
> -	 * splitting bit set before __split_huge_page_refcount
> -	 * runs. Finding the same huge pmd more than once during the
> -	 * same rmap walk is not a problem.
> -	 */
> -	if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
> -	    pmd_trans_splitting(*pmd))
> -		goto unlock;
> -	if (pmd_trans_huge(*pmd)) {
> -		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
> -			  !pmd_trans_splitting(*pmd));
> +	if (pmd_trans_huge(*pmd))
>  		return pmd;
> -	}
>  unlock:
>  	spin_unlock(*ptl);
>  	return NULL;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f659d4f77138..1bc6a77067ad 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4888,7 +4888,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>  		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
>  			mc.precharge += HPAGE_PMD_NR;
>  		spin_unlock(ptl);
> @@ -5056,17 +5056,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>  	union mc_target target;
>  	struct page *page;
>  
> -	/*
> -	 * We don't take compound_lock() here but no race with splitting thp
> -	 * happens because:
> -	 *  - if pmd_trans_huge_lock() returns 1, the relevant thp is not
> -	 *    under splitting, which means there's no concurrent thp split,
> -	 *  - if another thread runs into split_huge_page() just after we
> -	 *    entered this if-block, the thread must wait for page table lock
> -	 *    to be unlocked in __split_huge_page_splitting(), where the main
> -	 *    part of thp split is not executed yet.
> -	 */
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>  		if (mc.precharge < HPAGE_PMD_NR) {
>  			spin_unlock(ptl);
>  			return 0;
> diff --git a/mm/memory.c b/mm/memory.c
> index 61e7ed722760..1bad3766b00c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -565,7 +565,6 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  {
>  	spinlock_t *ptl;
>  	pgtable_t new = pte_alloc_one(mm, address);
> -	int wait_split_huge_page;
>  	if (!new)
>  		return -ENOMEM;
>  
> @@ -585,18 +584,14 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
>  
>  	ptl = pmd_lock(mm, pmd);
> -	wait_split_huge_page = 0;
>  	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
>  		atomic_long_inc(&mm->nr_ptes);
>  		pmd_populate(mm, pmd, new);
>  		new = NULL;
> -	} else if (unlikely(pmd_trans_splitting(*pmd)))
> -		wait_split_huge_page = 1;
> +	}
>  	spin_unlock(ptl);
>  	if (new)
>  		pte_free(mm, new);
> -	if (wait_split_huge_page)
> -		wait_split_huge_page(vma->anon_vma, pmd);
>  	return 0;
>  }
>  
> @@ -612,8 +607,7 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
>  	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
>  		pmd_populate_kernel(&init_mm, pmd, new);
>  		new = NULL;
> -	} else
> -		VM_BUG_ON(pmd_trans_splitting(*pmd));
> +	}
>  	spin_unlock(&init_mm.page_table_lock);
>  	if (new)
>  		pte_free_kernel(&init_mm, new);
> @@ -3299,14 +3293,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		if (pmd_trans_huge(orig_pmd)) {
>  			unsigned int dirty = flags & FAULT_FLAG_WRITE;
>  
> -			/*
> -			 * If the pmd is splitting, return and retry the
> -			 * the fault.  Alternative: wait until the split
> -			 * is done, and goto retry.
> -			 */
> -			if (pmd_trans_splitting(orig_pmd))
> -				return 0;
> -
>  			if (pmd_protnone(orig_pmd))
>  				return do_huge_pmd_numa_page(mm, vma, address,
>  							     orig_pmd, pmd);
> diff --git a/mm/mincore.c b/mm/mincore.c
> index be25efde64a4..feb867f5fdf4 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -117,7 +117,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  	unsigned char *vec = walk->private;
>  	int nr = (end - addr) >> PAGE_SHIFT;
>  
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>  		memset(vec, 1, nr);
>  		spin_unlock(ptl);
>  		goto out;
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index c25f94b33811..2fe699cedd4d 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -133,20 +133,6 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif
>  
> -#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
> -			  pmd_t *pmdp)
> -{
> -	pmd_t pmd = pmd_mksplitting(*pmdp);
> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> -	/* tlb flush only to serialize against gup-fast */
> -	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> -}
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> -#endif
> -
>  #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4ca4b5cffd95..1636a96e5f71 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -737,8 +737,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  		 * rmap might return false positives; we must filter
>  		 * these out using page_check_address_pmd().
>  		 */
> -		pmd = page_check_address_pmd(page, mm, address,
> -					     PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
> +		pmd = page_check_address_pmd(page, mm, address, &ptl);
>  		if (!pmd)
>  			return SWAP_AGAIN;
>  
> @@ -748,7 +747,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  			return SWAP_FAIL; /* To break the loop */
>  		}
>  
> -		/* go ahead even if the pmd is pmd_trans_splitting() */
>  		if (pmdp_clear_flush_young_notify(vma, address, pmd))
>  			referenced++;
>  
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 20/28] mm: differentiate page_mapped() from page_mapcount() for compound pages
  2015-04-23 21:03   ` Kirill A. Shutemov
  (?)
@ 2015-04-29 16:20   ` Jerome Marchand
  2015-04-30 12:06       ` Kirill A. Shutemov
  -1 siblings, 1 reply; 189+ messages in thread
From: Jerome Marchand @ 2015-04-29 16:20 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Vlastimil Babka,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Sasha Levin,
	linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 9913 bytes --]

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> Let's define page_mapped() to be true for compound pages if any
> sub-pages of the compound page is mapped (with PMD or PTE).
> 
> On other hand page_mapcount() return mapcount for this particular small
> page.
> 
> This will make cases like page_get_anon_vma() behave correctly once we
> allow huge pages to be mapped with PTE.
> 
> Most users outside core-mm should use page_mapcount() instead of
> page_mapped().
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>  arch/arc/mm/cache_arc700.c |  4 ++--
>  arch/arm/mm/flush.c        |  2 +-
>  arch/mips/mm/c-r4k.c       |  3 ++-
>  arch/mips/mm/cache.c       |  2 +-
>  arch/mips/mm/init.c        |  6 +++---
>  arch/sh/mm/cache-sh4.c     |  2 +-
>  arch/sh/mm/cache.c         |  8 ++++----
>  arch/xtensa/mm/tlb.c       |  2 +-
>  fs/proc/page.c             |  4 ++--
>  include/linux/mm.h         | 11 ++++++++++-
>  mm/filemap.c               |  2 +-
>  11 files changed, 28 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arc/mm/cache_arc700.c b/arch/arc/mm/cache_arc700.c
> index 8c3a3e02ba92..1baa4d23314b 100644
> --- a/arch/arc/mm/cache_arc700.c
> +++ b/arch/arc/mm/cache_arc700.c
> @@ -490,7 +490,7 @@ void flush_dcache_page(struct page *page)
>  	 */
>  	if (!mapping_mapped(mapping)) {
>  		clear_bit(PG_dc_clean, &page->flags);
> -	} else if (page_mapped(page)) {
> +	} else if (page_mapcount(page)) {
>  
>  		/* kernel reading from page with U-mapping */
>  		void *paddr = page_address(page);
> @@ -675,7 +675,7 @@ void copy_user_highpage(struct page *to, struct page *from,
>  	 * Note that while @u_vaddr refers to DST page's userspace vaddr, it is
>  	 * equally valid for SRC page as well
>  	 */
> -	if (page_mapped(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
> +	if (page_mapcount(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
>  		__flush_dcache_page(kfrom, u_vaddr);
>  		clean_src_k_mappings = 1;
>  	}
> diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> index 34b66af516ea..8f972fc8933d 100644
> --- a/arch/arm/mm/flush.c
> +++ b/arch/arm/mm/flush.c
> @@ -315,7 +315,7 @@ void flush_dcache_page(struct page *page)
>  	mapping = page_mapping(page);
>  
>  	if (!cache_ops_need_broadcast() &&
> -	    mapping && !page_mapped(page))
> +	    mapping && !page_mapcount(page))
>  		clear_bit(PG_dcache_clean, &page->flags);
>  	else {
>  		__flush_dcache_page(mapping, page);
> diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c
> index dd261df005c2..c4960b2d6682 100644
> --- a/arch/mips/mm/c-r4k.c
> +++ b/arch/mips/mm/c-r4k.c
> @@ -578,7 +578,8 @@ static inline void local_r4k_flush_cache_page(void *args)
>  		 * another ASID than the current one.
>  		 */
>  		map_coherent = (cpu_has_dc_aliases &&
> -				page_mapped(page) && !Page_dcache_dirty(page));
> +				page_mapcount(page) &&
> +				!Page_dcache_dirty(page));
>  		if (map_coherent)
>  			vaddr = kmap_coherent(page, addr);
>  		else
> diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c
> index 7e3ea7766822..e695b28dc32c 100644
> --- a/arch/mips/mm/cache.c
> +++ b/arch/mips/mm/cache.c
> @@ -106,7 +106,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
>  	unsigned long addr = (unsigned long) page_address(page);
>  
>  	if (pages_do_alias(addr, vmaddr)) {
> -		if (page_mapped(page) && !Page_dcache_dirty(page)) {
> +		if (page_mapcount(page) && !Page_dcache_dirty(page)) {
>  			void *kaddr;
>  
>  			kaddr = kmap_coherent(page, vmaddr);
> diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
> index 448cde372af0..2c8e44aa536e 100644
> --- a/arch/mips/mm/init.c
> +++ b/arch/mips/mm/init.c
> @@ -156,7 +156,7 @@ void copy_user_highpage(struct page *to, struct page *from,
>  
>  	vto = kmap_atomic(to);
>  	if (cpu_has_dc_aliases &&
> -	    page_mapped(from) && !Page_dcache_dirty(from)) {
> +	    page_mapcount(from) && !Page_dcache_dirty(from)) {
>  		vfrom = kmap_coherent(from, vaddr);
>  		copy_page(vto, vfrom);
>  		kunmap_coherent();
> @@ -178,7 +178,7 @@ void copy_to_user_page(struct vm_area_struct *vma,
>  	unsigned long len)
>  {
>  	if (cpu_has_dc_aliases &&
> -	    page_mapped(page) && !Page_dcache_dirty(page)) {
> +	    page_mapcount(page) && !Page_dcache_dirty(page)) {
>  		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
>  		memcpy(vto, src, len);
>  		kunmap_coherent();
> @@ -196,7 +196,7 @@ void copy_from_user_page(struct vm_area_struct *vma,
>  	unsigned long len)
>  {
>  	if (cpu_has_dc_aliases &&
> -	    page_mapped(page) && !Page_dcache_dirty(page)) {
> +	    page_mapcount(page) && !Page_dcache_dirty(page)) {
>  		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
>  		memcpy(dst, vfrom, len);
>  		kunmap_coherent();
> diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c
> index 51d8f7f31d1d..58aaa4f33b81 100644
> --- a/arch/sh/mm/cache-sh4.c
> +++ b/arch/sh/mm/cache-sh4.c
> @@ -241,7 +241,7 @@ static void sh4_flush_cache_page(void *args)
>  		 */
>  		map_coherent = (current_cpu_data.dcache.n_aliases &&
>  			test_bit(PG_dcache_clean, &page->flags) &&
> -			page_mapped(page));
> +			page_mapcount(page));
>  		if (map_coherent)
>  			vaddr = kmap_coherent(page, address);
>  		else
> diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c
> index f770e3992620..e58cfbf45150 100644
> --- a/arch/sh/mm/cache.c
> +++ b/arch/sh/mm/cache.c
> @@ -59,7 +59,7 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
>  		       unsigned long vaddr, void *dst, const void *src,
>  		       unsigned long len)
>  {
> -	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
> +	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
>  	    test_bit(PG_dcache_clean, &page->flags)) {
>  		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
>  		memcpy(vto, src, len);
> @@ -78,7 +78,7 @@ void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
>  			 unsigned long vaddr, void *dst, const void *src,
>  			 unsigned long len)
>  {
> -	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
> +	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
>  	    test_bit(PG_dcache_clean, &page->flags)) {
>  		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
>  		memcpy(dst, vfrom, len);
> @@ -97,7 +97,7 @@ void copy_user_highpage(struct page *to, struct page *from,
>  
>  	vto = kmap_atomic(to);
>  
> -	if (boot_cpu_data.dcache.n_aliases && page_mapped(from) &&
> +	if (boot_cpu_data.dcache.n_aliases && page_mapcount(from) &&
>  	    test_bit(PG_dcache_clean, &from->flags)) {
>  		vfrom = kmap_coherent(from, vaddr);
>  		copy_page(vto, vfrom);
> @@ -153,7 +153,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
>  	unsigned long addr = (unsigned long) page_address(page);
>  
>  	if (pages_do_alias(addr, vmaddr)) {
> -		if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
> +		if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
>  		    test_bit(PG_dcache_clean, &page->flags)) {
>  			void *kaddr;
>  
> diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
> index 5ece856c5725..35c822286bbe 100644
> --- a/arch/xtensa/mm/tlb.c
> +++ b/arch/xtensa/mm/tlb.c
> @@ -245,7 +245,7 @@ static int check_tlb_entry(unsigned w, unsigned e, bool dtlb)
>  						page_mapcount(p));
>  				if (!page_count(p))
>  					rc |= TLB_INSANE;
> -				else if (page_mapped(p))
> +				else if (page_mapcount(p))
>  					rc |= TLB_SUSPICIOUS;
>  			} else {
>  				rc |= TLB_INSANE;
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 7eee2d8b97d9..e99c059339f6 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -97,9 +97,9 @@ u64 stable_page_flags(struct page *page)
>  	 * pseudo flags for the well known (anonymous) memory mapped pages
>  	 *
>  	 * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
> -	 * simple test in page_mapped() is not enough.
> +	 * simple test in page_mapcount() is not enough.
>  	 */
> -	if (!PageSlab(page) && page_mapped(page))
> +	if (!PageSlab(page) && page_mapcount(page))
>  		u |= 1 << KPF_MMAP;
>  	if (PageAnon(page))
>  		u |= 1 << KPF_ANON;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 33cb3aa647a6..8ddc184c55d6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -909,7 +909,16 @@ static inline pgoff_t page_file_index(struct page *page)
>   */
>  static inline int page_mapped(struct page *page)
>  {
> -	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
> +	int i;
> +	if (likely(!PageCompound(page)))
> +		return atomic_read(&page->_mapcount) >= 0;
> +	if (compound_mapcount(page))
> +		return 1;
> +	for (i = 0; i < hpage_nr_pages(page); i++) {
> +		if (atomic_read(&page[i]._mapcount) >= 0)
> +			return 1;
> +	}
> +	return 0;
>  }

page_mapped() won't work with tail pages. Maybe I'm missing something
that makes it impossible. Otherwise, have you checked that this
condition is true for all call site?  Should we add some check at the
beginning of the function? Something like:

VM_BUG_ON_PAGE(PageTail(page), page)?

>  
>  /*
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ce4d6e3d740f..c25ba3b4e7a2 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -200,7 +200,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
>  	__dec_zone_page_state(page, NR_FILE_PAGES);
>  	if (PageSwapBacked(page))
>  		__dec_zone_page_state(page, NR_SHMEM);
> -	BUG_ON(page_mapped(page));
> +	VM_BUG_ON_PAGE(page_mapped(page), page);
>  
>  	/*
>  	 * At this point page must be either written or cleaned by truncate.
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 189+ messages in thread

* [RFC PATCH 0/3] Remove _PAGE_SPLITTING from ppc64
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-04-30  8:25   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 189+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-30  8:25 UTC (permalink / raw)
  To: akpm, paulus, benh, kirill.shutemov
  Cc: linux-mm, linux-kernel, Aneesh Kumar K.V

The changes are on top of what is posted  at

http://mid.gmane.org/1429823043-157133-1-git-send-email-kirill.shutemov@linux.intel.com

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v5

Aneesh Kumar K.V (3):
  mm/thp: Use pmdp_splitting_flush_notify to clear pmd on splitting
  powerpc/thp: Remove _PAGE_SPLITTING and related code
  mm/thp: Add new function to clear pmd on collapse

 arch/powerpc/include/asm/kvm_book3s_64.h |   6 --
 arch/powerpc/include/asm/pgtable-ppc64.h |  29 ++------
 arch/powerpc/mm/hugepage-hash64.c        |   3 -
 arch/powerpc/mm/hugetlbpage.c            |   2 +-
 arch/powerpc/mm/pgtable_64.c             | 111 ++++++++++++-------------------
 include/asm-generic/pgtable.h            |  14 ++++
 mm/gup.c                                 |   2 +-
 mm/huge_memory.c                         |   9 +--
 mm/pgtable-generic.c                     |  11 +++
 9 files changed, 82 insertions(+), 105 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 189+ messages in thread

* [RFC PATCH 0/3] Remove _PAGE_SPLITTING from ppc64
@ 2015-04-30  8:25   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 189+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-30  8:25 UTC (permalink / raw)
  To: akpm, paulus, benh, kirill.shutemov
  Cc: linux-mm, linux-kernel, Aneesh Kumar K.V

The changes are on top of what is posted  at

http://mid.gmane.org/1429823043-157133-1-git-send-email-kirill.shutemov@linux.intel.com

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/refcounting/v5

Aneesh Kumar K.V (3):
  mm/thp: Use pmdp_splitting_flush_notify to clear pmd on splitting
  powerpc/thp: Remove _PAGE_SPLITTING and related code
  mm/thp: Add new function to clear pmd on collapse

 arch/powerpc/include/asm/kvm_book3s_64.h |   6 --
 arch/powerpc/include/asm/pgtable-ppc64.h |  29 ++------
 arch/powerpc/mm/hugepage-hash64.c        |   3 -
 arch/powerpc/mm/hugetlbpage.c            |   2 +-
 arch/powerpc/mm/pgtable_64.c             | 111 ++++++++++++-------------------
 include/asm-generic/pgtable.h            |  14 ++++
 mm/gup.c                                 |   2 +-
 mm/huge_memory.c                         |   9 +--
 mm/pgtable-generic.c                     |  11 +++
 9 files changed, 82 insertions(+), 105 deletions(-)

-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* [RFC PATCH 1/3] mm/thp: Use pmdp_splitting_flush_notify to clear pmd on splitting
  2015-04-30  8:25   ` Aneesh Kumar K.V
@ 2015-04-30  8:25     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 189+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-30  8:25 UTC (permalink / raw)
  To: akpm, paulus, benh, kirill.shutemov
  Cc: linux-mm, linux-kernel, Aneesh Kumar K.V

Some arch may require an explicit IPI before a THP PMD split. This
ensures that a local_irq_disable can prevent a parallel THP PMD split.
So use new function which arch can override

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/asm-generic/pgtable.h |  5 +++++
 mm/huge_memory.c              |  7 ++++---
 mm/pgtable-generic.c          | 11 +++++++++++
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index fe617b7e4be6..d091a666f5b1 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -184,6 +184,11 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
+extern void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
+					unsigned long address, pmd_t *pmdp);
+#endif
+
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cce4604c192f..81e9578bf43a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2606,9 +2606,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	write = pmd_write(*pmd);
 	young = pmd_young(*pmd);
-
-	/* leave pmd empty until pte is filled */
-	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/*
+	 * leave pmd empty until pte is filled.
+	 */
+	pmdp_splitting_flush_notify(vma, haddr, pmd);
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 2fe699cedd4d..0fc1f5a06979 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -7,6 +7,7 @@
  */
 
 #include <linux/pagemap.h>
+#include <linux/mmu_notifier.h>
 #include <asm/tlb.h>
 #include <asm-generic/pgtable.h>
 
@@ -184,3 +185,13 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
+
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp)
+{
+	pmdp_clear_flush_notify(vma, address, pmdp);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [RFC PATCH 1/3] mm/thp: Use pmdp_splitting_flush_notify to clear pmd on splitting
@ 2015-04-30  8:25     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 189+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-30  8:25 UTC (permalink / raw)
  To: akpm, paulus, benh, kirill.shutemov
  Cc: linux-mm, linux-kernel, Aneesh Kumar K.V

Some arch may require an explicit IPI before a THP PMD split. This
ensures that a local_irq_disable can prevent a parallel THP PMD split.
So use new function which arch can override

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 include/asm-generic/pgtable.h |  5 +++++
 mm/huge_memory.c              |  7 ++++---
 mm/pgtable-generic.c          | 11 +++++++++++
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index fe617b7e4be6..d091a666f5b1 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -184,6 +184,11 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
+extern void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
+					unsigned long address, pmd_t *pmdp);
+#endif
+
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cce4604c192f..81e9578bf43a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2606,9 +2606,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	write = pmd_write(*pmd);
 	young = pmd_young(*pmd);
-
-	/* leave pmd empty until pte is filled */
-	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/*
+	 * leave pmd empty until pte is filled.
+	 */
+	pmdp_splitting_flush_notify(vma, haddr, pmd);
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 2fe699cedd4d..0fc1f5a06979 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -7,6 +7,7 @@
  */
 
 #include <linux/pagemap.h>
+#include <linux/mmu_notifier.h>
 #include <asm/tlb.h>
 #include <asm-generic/pgtable.h>
 
@@ -184,3 +185,13 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
+
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp)
+{
+	pmdp_clear_flush_notify(vma, address, pmdp);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [RFC PATCH 2/3] powerpc/thp: Remove _PAGE_SPLITTING and related code
  2015-04-30  8:25   ` Aneesh Kumar K.V
@ 2015-04-30  8:25     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 189+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-30  8:25 UTC (permalink / raw)
  To: akpm, paulus, benh, kirill.shutemov
  Cc: linux-mm, linux-kernel, Aneesh Kumar K.V

With the new thp refcounting we don't need to mark the PMD splitting.
Drop the code to handle this.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  6 -----
 arch/powerpc/include/asm/pgtable-ppc64.h | 27 ++++-------------------
 arch/powerpc/mm/hugepage-hash64.c        |  3 ---
 arch/powerpc/mm/hugetlbpage.c            |  2 +-
 arch/powerpc/mm/pgtable_64.c             | 38 +++++---------------------------
 mm/gup.c                                 |  2 +-
 6 files changed, 12 insertions(+), 66 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index 2d81e202bdcc..9a96fe3caa48 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -298,12 +298,6 @@ static inline pte_t kvmppc_read_update_linux_pte(pte_t *ptep, int writing,
 			cpu_relax();
 			continue;
 		}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		/* If hugepage and is trans splitting return None */
-		if (unlikely(hugepage &&
-			     pmd_trans_splitting(pte_pmd(old_pte))))
-			return __pte(0);
-#endif
 		/* If pte is not present return None */
 		if (unlikely(!(old_pte & _PAGE_PRESENT)))
 			return __pte(0);
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 843cb35e6add..ff275443a040 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -361,11 +361,6 @@ void pgtable_cache_init(void);
 #endif /* __ASSEMBLY__ */
 
 /*
- * THP pages can't be special. So use the _PAGE_SPECIAL
- */
-#define _PAGE_SPLITTING _PAGE_SPECIAL
-
-/*
  * We need to differentiate between explicit huge page and THP huge
  * page, since THP huge page also need to track real subpage details
  */
@@ -375,8 +370,7 @@ void pgtable_cache_init(void);
  * set of bits not changed in pmd_modify.
  */
 #define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS |		\
-			 _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_SPLITTING | \
-			 _PAGE_THP_HUGE)
+			 _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_THP_HUGE)
 
 #ifndef __ASSEMBLY__
 /*
@@ -458,13 +452,6 @@ static inline int pmd_trans_huge(pmd_t pmd)
 	return (pmd_val(pmd) & 0x3) && (pmd_val(pmd) & _PAGE_THP_HUGE);
 }
 
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	if (pmd_trans_huge(pmd))
-		return pmd_val(pmd) & _PAGE_SPLITTING;
-	return 0;
-}
-
 extern int has_transparent_hugepage(void);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -517,12 +504,6 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd)
 	return pmd;
 }
 
-static inline pmd_t pmd_mksplitting(pmd_t pmd)
-{
-	pmd_val(pmd) |= _PAGE_SPLITTING;
-	return pmd;
-}
-
 #define __HAVE_ARCH_PMD_SAME
 static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
@@ -577,9 +558,9 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 	pmd_hugepage_update(mm, addr, pmdp, _PAGE_RW, 0);
 }
 
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long address, pmd_t *pmdp);
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
+extern void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
+					unsigned long address, pmd_t *pmdp);
 
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
diff --git a/arch/powerpc/mm/hugepage-hash64.c b/arch/powerpc/mm/hugepage-hash64.c
index 86686514ae13..078f7207afd2 100644
--- a/arch/powerpc/mm/hugepage-hash64.c
+++ b/arch/powerpc/mm/hugepage-hash64.c
@@ -39,9 +39,6 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
 		/* If PMD busy, retry the access */
 		if (unlikely(old_pmd & _PAGE_BUSY))
 			return 0;
-		/* If PMD is trans splitting retry the access */
-		if (unlikely(old_pmd & _PAGE_SPLITTING))
-			return 0;
 		/* If PMD permissions don't match, take page fault */
 		if (unlikely(access & ~old_pmd))
 			return 1;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index f30ae0f7f570..dfd7db0cfbee 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1008,7 +1008,7 @@ pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift
 			 * hpte invalidate
 			 *
 			 */
-			if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+			if (pmd_none(pmd))
 				return NULL;
 
 			if (pmd_huge(pmd) || pmd_large(pmd)) {
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 91bb8836825a..89b356250be3 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -36,6 +36,7 @@
 #include <linux/memblock.h>
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/page.h>
@@ -622,47 +623,20 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
  * We mark the pmd splitting and invalidate all the hpte
  * entries for this hugepage.
  */
-void pmdp_splitting_flush(struct vm_area_struct *vma,
-			  unsigned long address, pmd_t *pmdp)
+void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp)
 {
-	unsigned long old, tmp;
-
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 #ifdef CONFIG_DEBUG_VM
 	WARN_ON(!pmd_trans_huge(*pmdp));
 	assert_spin_locked(&vma->vm_mm->page_table_lock);
 #endif
-
-#ifdef PTE_ATOMIC_UPDATES
-
-	__asm__ __volatile__(
-	"1:	ldarx	%0,0,%3\n\
-		andi.	%1,%0,%6\n\
-		bne-	1b \n\
-		ori	%1,%0,%4 \n\
-		stdcx.	%1,0,%3 \n\
-		bne-	1b"
-	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
-	: "r" (pmdp), "i" (_PAGE_SPLITTING), "m" (*pmdp), "i" (_PAGE_BUSY)
-	: "cc" );
-#else
-	old = pmd_val(*pmdp);
-	*pmdp = __pmd(old | _PAGE_SPLITTING);
-#endif
-	/*
-	 * If we didn't had the splitting flag set, go and flush the
-	 * HPTE entries.
-	 */
-	trace_hugepage_splitting(address, old);
-	if (!(old & _PAGE_SPLITTING)) {
-		/* We need to flush the hpte */
-		if (old & _PAGE_HASHPTE)
-			hpte_do_hugepage_flush(vma->vm_mm, address, pmdp, old);
-	}
+	trace_hugepage_splitting(address, *pmdp);
+	pmdp_clear_flush_notify(vma, address, pmdp);
 	/*
 	 * This ensures that generic code that rely on IRQ disabling
-	 * to prevent a parallel THP split work as expected.
+	 * to prevent a parallel THP PMD split work as expected.
 	 */
 	kick_all_cpus_sync();
 }
diff --git a/mm/gup.c b/mm/gup.c
index 0cebfa76fd0c..8375781b76f0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1215,7 +1215,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = ACCESS_ONCE(*pmdp);
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd))
 			return 0;
 
 		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [RFC PATCH 2/3] powerpc/thp: Remove _PAGE_SPLITTING and related code
@ 2015-04-30  8:25     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 189+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-30  8:25 UTC (permalink / raw)
  To: akpm, paulus, benh, kirill.shutemov
  Cc: linux-mm, linux-kernel, Aneesh Kumar K.V

With the new thp refcounting we don't need to mark the PMD splitting.
Drop the code to handle this.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  6 -----
 arch/powerpc/include/asm/pgtable-ppc64.h | 27 ++++-------------------
 arch/powerpc/mm/hugepage-hash64.c        |  3 ---
 arch/powerpc/mm/hugetlbpage.c            |  2 +-
 arch/powerpc/mm/pgtable_64.c             | 38 +++++---------------------------
 mm/gup.c                                 |  2 +-
 6 files changed, 12 insertions(+), 66 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index 2d81e202bdcc..9a96fe3caa48 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -298,12 +298,6 @@ static inline pte_t kvmppc_read_update_linux_pte(pte_t *ptep, int writing,
 			cpu_relax();
 			continue;
 		}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-		/* If hugepage and is trans splitting return None */
-		if (unlikely(hugepage &&
-			     pmd_trans_splitting(pte_pmd(old_pte))))
-			return __pte(0);
-#endif
 		/* If pte is not present return None */
 		if (unlikely(!(old_pte & _PAGE_PRESENT)))
 			return __pte(0);
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 843cb35e6add..ff275443a040 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -361,11 +361,6 @@ void pgtable_cache_init(void);
 #endif /* __ASSEMBLY__ */
 
 /*
- * THP pages can't be special. So use the _PAGE_SPECIAL
- */
-#define _PAGE_SPLITTING _PAGE_SPECIAL
-
-/*
  * We need to differentiate between explicit huge page and THP huge
  * page, since THP huge page also need to track real subpage details
  */
@@ -375,8 +370,7 @@ void pgtable_cache_init(void);
  * set of bits not changed in pmd_modify.
  */
 #define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS |		\
-			 _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_SPLITTING | \
-			 _PAGE_THP_HUGE)
+			 _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_THP_HUGE)
 
 #ifndef __ASSEMBLY__
 /*
@@ -458,13 +452,6 @@ static inline int pmd_trans_huge(pmd_t pmd)
 	return (pmd_val(pmd) & 0x3) && (pmd_val(pmd) & _PAGE_THP_HUGE);
 }
 
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	if (pmd_trans_huge(pmd))
-		return pmd_val(pmd) & _PAGE_SPLITTING;
-	return 0;
-}
-
 extern int has_transparent_hugepage(void);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
@@ -517,12 +504,6 @@ static inline pmd_t pmd_mknotpresent(pmd_t pmd)
 	return pmd;
 }
 
-static inline pmd_t pmd_mksplitting(pmd_t pmd)
-{
-	pmd_val(pmd) |= _PAGE_SPLITTING;
-	return pmd;
-}
-
 #define __HAVE_ARCH_PMD_SAME
 static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
@@ -577,9 +558,9 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 	pmd_hugepage_update(mm, addr, pmdp, _PAGE_RW, 0);
 }
 
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long address, pmd_t *pmdp);
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
+extern void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
+					unsigned long address, pmd_t *pmdp);
 
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
diff --git a/arch/powerpc/mm/hugepage-hash64.c b/arch/powerpc/mm/hugepage-hash64.c
index 86686514ae13..078f7207afd2 100644
--- a/arch/powerpc/mm/hugepage-hash64.c
+++ b/arch/powerpc/mm/hugepage-hash64.c
@@ -39,9 +39,6 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
 		/* If PMD busy, retry the access */
 		if (unlikely(old_pmd & _PAGE_BUSY))
 			return 0;
-		/* If PMD is trans splitting retry the access */
-		if (unlikely(old_pmd & _PAGE_SPLITTING))
-			return 0;
 		/* If PMD permissions don't match, take page fault */
 		if (unlikely(access & ~old_pmd))
 			return 1;
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index f30ae0f7f570..dfd7db0cfbee 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -1008,7 +1008,7 @@ pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift
 			 * hpte invalidate
 			 *
 			 */
-			if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+			if (pmd_none(pmd))
 				return NULL;
 
 			if (pmd_huge(pmd) || pmd_large(pmd)) {
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 91bb8836825a..89b356250be3 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -36,6 +36,7 @@
 #include <linux/memblock.h>
 #include <linux/slab.h>
 #include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
 
 #include <asm/pgalloc.h>
 #include <asm/page.h>
@@ -622,47 +623,20 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
  * We mark the pmd splitting and invalidate all the hpte
  * entries for this hugepage.
  */
-void pmdp_splitting_flush(struct vm_area_struct *vma,
-			  unsigned long address, pmd_t *pmdp)
+void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp)
 {
-	unsigned long old, tmp;
-
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 #ifdef CONFIG_DEBUG_VM
 	WARN_ON(!pmd_trans_huge(*pmdp));
 	assert_spin_locked(&vma->vm_mm->page_table_lock);
 #endif
-
-#ifdef PTE_ATOMIC_UPDATES
-
-	__asm__ __volatile__(
-	"1:	ldarx	%0,0,%3\n\
-		andi.	%1,%0,%6\n\
-		bne-	1b \n\
-		ori	%1,%0,%4 \n\
-		stdcx.	%1,0,%3 \n\
-		bne-	1b"
-	: "=&r" (old), "=&r" (tmp), "=m" (*pmdp)
-	: "r" (pmdp), "i" (_PAGE_SPLITTING), "m" (*pmdp), "i" (_PAGE_BUSY)
-	: "cc" );
-#else
-	old = pmd_val(*pmdp);
-	*pmdp = __pmd(old | _PAGE_SPLITTING);
-#endif
-	/*
-	 * If we didn't had the splitting flag set, go and flush the
-	 * HPTE entries.
-	 */
-	trace_hugepage_splitting(address, old);
-	if (!(old & _PAGE_SPLITTING)) {
-		/* We need to flush the hpte */
-		if (old & _PAGE_HASHPTE)
-			hpte_do_hugepage_flush(vma->vm_mm, address, pmdp, old);
-	}
+	trace_hugepage_splitting(address, *pmdp);
+	pmdp_clear_flush_notify(vma, address, pmdp);
 	/*
 	 * This ensures that generic code that rely on IRQ disabling
-	 * to prevent a parallel THP split work as expected.
+	 * to prevent a parallel THP PMD split work as expected.
 	 */
 	kick_all_cpus_sync();
 }
diff --git a/mm/gup.c b/mm/gup.c
index 0cebfa76fd0c..8375781b76f0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1215,7 +1215,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		pmd_t pmd = ACCESS_ONCE(*pmdp);
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+		if (pmd_none(pmd))
 			return 0;
 
 		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd))) {
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [RFC PATCH 3/3] mm/thp: Add new function to clear pmd on collapse
  2015-04-30  8:25   ` Aneesh Kumar K.V
@ 2015-04-30  8:25     ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 189+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-30  8:25 UTC (permalink / raw)
  To: akpm, paulus, benh, kirill.shutemov
  Cc: linux-mm, linux-kernel, Aneesh Kumar K.V

Some arch may need an explicit IPI when clearing pmd
on collapse. Add new function which arch can override.
After this pmdp_clear_flush is used only for THP case
to invalidate a huge page pte.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |  4 ++
 arch/powerpc/mm/pgtable_64.c             | 77 ++++++++++++++++----------------
 include/asm-generic/pgtable.h            |  9 ++++
 mm/huge_memory.c                         |  2 +-
 4 files changed, 53 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index ff275443a040..655dde8e9683 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -562,6 +562,10 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 extern void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
 					unsigned long address, pmd_t *pmdp);
 
+#define __HAVE_ARCH_PMDP_COLLAPSE_FLUSH
+extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp);
+
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 89b356250be3..fa49e2ff042b 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -558,45 +558,9 @@ unsigned long pmd_hugepage_update(struct mm_struct *mm, unsigned long addr,
 pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
 		       pmd_t *pmdp)
 {
-	pmd_t pmd;
-
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	if (pmd_trans_huge(*pmdp)) {
-		pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp);
-	} else {
-		/*
-		 * khugepaged calls this for normal pmd
-		 */
-		pmd = *pmdp;
-		pmd_clear(pmdp);
-		/*
-		 * Wait for all pending hash_page to finish. This is needed
-		 * in case of subpage collapse. When we collapse normal pages
-		 * to hugepage, we first clear the pmd, then invalidate all
-		 * the PTE entries. The assumption here is that any low level
-		 * page fault will see a none pmd and take the slow path that
-		 * will wait on mmap_sem. But we could very well be in a
-		 * hash_page with local ptep pointer value. Such a hash page
-		 * can result in adding new HPTE entries for normal subpages.
-		 * That means we could be modifying the page content as we
-		 * copy them to a huge page. So wait for parallel hash_page
-		 * to finish before invalidating HPTE entries. We can do this
-		 * by sending an IPI to all the cpus and executing a dummy
-		 * function there.
-		 */
-		kick_all_cpus_sync();
-		/*
-		 * Now invalidate the hpte entries in the range
-		 * covered by pmd. This make sure we take a
-		 * fault and will find the pmd as none, which will
-		 * result in a major fault which takes mmap_sem and
-		 * hence wait for collapse to complete. Without this
-		 * the __collapse_huge_page_copy can result in copying
-		 * the old content.
-		 */
-		flush_tlb_pmd_range(vma->vm_mm, &pmd, address);
-	}
-	return pmd;
+	VM_BUG_ON(!pmd_trans_huge(*pmdp));
+	return pmdp_get_and_clear(vma->vm_mm, address, pmdp);
 }
 
 int pmdp_test_and_clear_young(struct vm_area_struct *vma,
@@ -641,6 +605,43 @@ void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
 	kick_all_cpus_sync();
 }
 
+pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp)
+{
+	pmd_t pmd;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	pmd = *pmdp;
+	pmd_clear(pmdp);
+	/*
+	 * Wait for all pending hash_page to finish. This is needed
+	 * in case of subpage collapse. When we collapse normal pages
+	 * to hugepage, we first clear the pmd, then invalidate all
+	 * the PTE entries. The assumption here is that any low level
+	 * page fault will see a none pmd and take the slow path that
+	 * will wait on mmap_sem. But we could very well be in a
+	 * hash_page with local ptep pointer value. Such a hash page
+	 * can result in adding new HPTE entries for normal subpages.
+	 * That means we could be modifying the page content as we
+	 * copy them to a huge page. So wait for parallel hash_page
+	 * to finish before invalidating HPTE entries. We can do this
+	 * by sending an IPI to all the cpus and executing a dummy
+	 * function there.
+	 */
+	kick_all_cpus_sync();
+	/*
+	 * Now invalidate the hpte entries in the range
+	 * covered by pmd. This make sure we take a
+	 * fault and will find the pmd as none, which will
+	 * result in a major fault which takes mmap_sem and
+	 * hence wait for collapse to complete. Without this
+	 * the __collapse_huge_page_copy can result in copying
+	 * the old content.
+	 */
+	flush_tlb_pmd_range(vma->vm_mm, &pmd, address);
+	return pmd;
+}
+
 /*
  * We want to put the pgtable in pmd and use pgtable for tracking
  * the base page size hptes
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index d091a666f5b1..2e1e4653ae7c 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -189,6 +189,15 @@ extern void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
 					unsigned long address, pmd_t *pmdp);
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_COLLAPSE_FLUSH
+static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
+				       unsigned long address,
+				       pmd_t *pmdp)
+{
+	return pmdp_clear_flush(vma, address, pmdp);
+}
+#endif
+
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 81e9578bf43a..30c1b46fcf6d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2187,7 +2187,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * huge and small TLB entries for the same virtual address
 	 * to avoid the risk of CPU bugs in that area.
 	 */
-	_pmd = pmdp_clear_flush(vma, address, pmd);
+	_pmd = pmdp_collapse_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* [RFC PATCH 3/3] mm/thp: Add new function to clear pmd on collapse
@ 2015-04-30  8:25     ` Aneesh Kumar K.V
  0 siblings, 0 replies; 189+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-30  8:25 UTC (permalink / raw)
  To: akpm, paulus, benh, kirill.shutemov
  Cc: linux-mm, linux-kernel, Aneesh Kumar K.V

Some arch may need an explicit IPI when clearing pmd
on collapse. Add new function which arch can override.
After this pmdp_clear_flush is used only for THP case
to invalidate a huge page pte.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgtable-ppc64.h |  4 ++
 arch/powerpc/mm/pgtable_64.c             | 77 ++++++++++++++++----------------
 include/asm-generic/pgtable.h            |  9 ++++
 mm/huge_memory.c                         |  2 +-
 4 files changed, 53 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index ff275443a040..655dde8e9683 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -562,6 +562,10 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr,
 extern void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
 					unsigned long address, pmd_t *pmdp);
 
+#define __HAVE_ARCH_PMDP_COLLAPSE_FLUSH
+extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp);
+
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 89b356250be3..fa49e2ff042b 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -558,45 +558,9 @@ unsigned long pmd_hugepage_update(struct mm_struct *mm, unsigned long addr,
 pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
 		       pmd_t *pmdp)
 {
-	pmd_t pmd;
-
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	if (pmd_trans_huge(*pmdp)) {
-		pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp);
-	} else {
-		/*
-		 * khugepaged calls this for normal pmd
-		 */
-		pmd = *pmdp;
-		pmd_clear(pmdp);
-		/*
-		 * Wait for all pending hash_page to finish. This is needed
-		 * in case of subpage collapse. When we collapse normal pages
-		 * to hugepage, we first clear the pmd, then invalidate all
-		 * the PTE entries. The assumption here is that any low level
-		 * page fault will see a none pmd and take the slow path that
-		 * will wait on mmap_sem. But we could very well be in a
-		 * hash_page with local ptep pointer value. Such a hash page
-		 * can result in adding new HPTE entries for normal subpages.
-		 * That means we could be modifying the page content as we
-		 * copy them to a huge page. So wait for parallel hash_page
-		 * to finish before invalidating HPTE entries. We can do this
-		 * by sending an IPI to all the cpus and executing a dummy
-		 * function there.
-		 */
-		kick_all_cpus_sync();
-		/*
-		 * Now invalidate the hpte entries in the range
-		 * covered by pmd. This make sure we take a
-		 * fault and will find the pmd as none, which will
-		 * result in a major fault which takes mmap_sem and
-		 * hence wait for collapse to complete. Without this
-		 * the __collapse_huge_page_copy can result in copying
-		 * the old content.
-		 */
-		flush_tlb_pmd_range(vma->vm_mm, &pmd, address);
-	}
-	return pmd;
+	VM_BUG_ON(!pmd_trans_huge(*pmdp));
+	return pmdp_get_and_clear(vma->vm_mm, address, pmdp);
 }
 
 int pmdp_test_and_clear_young(struct vm_area_struct *vma,
@@ -641,6 +605,43 @@ void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
 	kick_all_cpus_sync();
 }
 
+pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
+			  pmd_t *pmdp)
+{
+	pmd_t pmd;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	pmd = *pmdp;
+	pmd_clear(pmdp);
+	/*
+	 * Wait for all pending hash_page to finish. This is needed
+	 * in case of subpage collapse. When we collapse normal pages
+	 * to hugepage, we first clear the pmd, then invalidate all
+	 * the PTE entries. The assumption here is that any low level
+	 * page fault will see a none pmd and take the slow path that
+	 * will wait on mmap_sem. But we could very well be in a
+	 * hash_page with local ptep pointer value. Such a hash page
+	 * can result in adding new HPTE entries for normal subpages.
+	 * That means we could be modifying the page content as we
+	 * copy them to a huge page. So wait for parallel hash_page
+	 * to finish before invalidating HPTE entries. We can do this
+	 * by sending an IPI to all the cpus and executing a dummy
+	 * function there.
+	 */
+	kick_all_cpus_sync();
+	/*
+	 * Now invalidate the hpte entries in the range
+	 * covered by pmd. This make sure we take a
+	 * fault and will find the pmd as none, which will
+	 * result in a major fault which takes mmap_sem and
+	 * hence wait for collapse to complete. Without this
+	 * the __collapse_huge_page_copy can result in copying
+	 * the old content.
+	 */
+	flush_tlb_pmd_range(vma->vm_mm, &pmd, address);
+	return pmd;
+}
+
 /*
  * We want to put the pgtable in pmd and use pgtable for tracking
  * the base page size hptes
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index d091a666f5b1..2e1e4653ae7c 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -189,6 +189,15 @@ extern void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
 					unsigned long address, pmd_t *pmdp);
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_COLLAPSE_FLUSH
+static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
+				       unsigned long address,
+				       pmd_t *pmdp)
+{
+	return pmdp_clear_flush(vma, address, pmdp);
+}
+#endif
+
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 81e9578bf43a..30c1b46fcf6d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2187,7 +2187,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * huge and small TLB entries for the same virtual address
 	 * to avoid the risk of CPU bugs in that area.
 	 */
-	_pmd = pmdp_clear_flush(vma, address, pmd);
+	_pmd = pmdp_collapse_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 02/28] rmap: add argument to charge compound page
  2015-04-29 15:53   ` Jerome Marchand
@ 2015-04-30 11:52       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-30 11:52 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Sasha Levin, linux-kernel, linux-mm

On Wed, Apr 29, 2015 at 05:53:04PM +0200, Jerome Marchand wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index dad23a43e42c..4ca4b5cffd95 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1048,9 +1048,9 @@ static void __page_check_anon_rmap(struct page *page,
> >   * (but PageKsm is never downgraded to PageAnon).
> >   */
> 
> The comment above should be updated to include the new argument.
> 
...
> > @@ -1097,15 +1101,18 @@ void do_page_add_anon_rmap(struct page *page,
> >   * Page does not have to be locked.
> >   */
> 
> Again, the description of the function should be updated.
> 
...
> > @@ -1161,9 +1168,12 @@ out:
> >   *
> >   * The caller needs to hold the pte lock.
> >   */
> 
> Same here.

Will be fixed for v6. Thanks.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 02/28] rmap: add argument to charge compound page
@ 2015-04-30 11:52       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-30 11:52 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Sasha Levin, linux-kernel, linux-mm

On Wed, Apr 29, 2015 at 05:53:04PM +0200, Jerome Marchand wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index dad23a43e42c..4ca4b5cffd95 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1048,9 +1048,9 @@ static void __page_check_anon_rmap(struct page *page,
> >   * (but PageKsm is never downgraded to PageAnon).
> >   */
> 
> The comment above should be updated to include the new argument.
> 
...
> > @@ -1097,15 +1101,18 @@ void do_page_add_anon_rmap(struct page *page,
> >   * Page does not have to be locked.
> >   */
> 
> Again, the description of the function should be updated.
> 
...
> > @@ -1161,9 +1168,12 @@ out:
> >   *
> >   * The caller needs to hold the pte lock.
> >   */
> 
> Same here.

Will be fixed for v6. Thanks.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 16/28] mm, thp: remove compound_lock
  2015-04-29 16:11   ` Jerome Marchand
@ 2015-04-30 11:58       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-30 11:58 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Sasha Levin, linux-kernel, linux-mm

On Wed, Apr 29, 2015 at 06:11:08PM +0200, Jerome Marchand wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> > We are going to use migration entries to stabilize page counts. It means
> 
> By "stabilize" do you mean "protect" from concurrent access? I've seen
> that you use the same term in seemingly the same sense several times (at
> least in patches 15, 16, 23, 24 and 28).

Here it's protect from concurrent change of page's ->_count or
->_mapcount.

In some context I could you "stabilize" as "protect from concurrent
split".
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 16/28] mm, thp: remove compound_lock
@ 2015-04-30 11:58       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-30 11:58 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Sasha Levin, linux-kernel, linux-mm

On Wed, Apr 29, 2015 at 06:11:08PM +0200, Jerome Marchand wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> > We are going to use migration entries to stabilize page counts. It means
> 
> By "stabilize" do you mean "protect" from concurrent access? I've seen
> that you use the same term in seemingly the same sense several times (at
> least in patches 15, 16, 23, 24 and 28).

Here it's protect from concurrent change of page's ->_count or
->_mapcount.

In some context I could you "stabilize" as "protect from concurrent
split".
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 17/28] mm, thp: remove infrastructure for handling splitting PMDs
  2015-04-29 16:14   ` Jerome Marchand
@ 2015-04-30 12:03       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-30 12:03 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Sasha Levin, linux-kernel, linux-mm

On Wed, Apr 29, 2015 at 06:14:13PM +0200, Jerome Marchand wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> > With new refcounting we don't need to mark PMDs splitting. Let's drop code
> > to handle this.
> > 
> > Arch-specific code will removed separately.
> 
> This series only removed code from x86 arch. Does that mean other arches
> patches will come later?

Initially I hoped it will be just trivial dropping dead code and can be
done later. But we need to do a bit more at least for powerpc (see
patchset by Aneesh). I will need to check other arch's code.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 17/28] mm, thp: remove infrastructure for handling splitting PMDs
@ 2015-04-30 12:03       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-30 12:03 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Sasha Levin, linux-kernel, linux-mm

On Wed, Apr 29, 2015 at 06:14:13PM +0200, Jerome Marchand wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> > With new refcounting we don't need to mark PMDs splitting. Let's drop code
> > to handle this.
> > 
> > Arch-specific code will removed separately.
> 
> This series only removed code from x86 arch. Does that mean other arches
> patches will come later?

Initially I hoped it will be just trivial dropping dead code and can be
done later. But we need to do a bit more at least for powerpc (see
patchset by Aneesh). I will need to check other arch's code.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 20/28] mm: differentiate page_mapped() from page_mapcount() for compound pages
  2015-04-29 16:20   ` Jerome Marchand
@ 2015-04-30 12:06       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-30 12:06 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Sasha Levin, linux-kernel, linux-mm

On Wed, Apr 29, 2015 at 06:20:03PM +0200, Jerome Marchand wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> > Let's define page_mapped() to be true for compound pages if any
> > sub-pages of the compound page is mapped (with PMD or PTE).
> > 
> > On other hand page_mapcount() return mapcount for this particular small
> > page.
> > 
> > This will make cases like page_get_anon_vma() behave correctly once we
> > allow huge pages to be mapped with PTE.
> > 
> > Most users outside core-mm should use page_mapcount() instead of
> > page_mapped().
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Tested-by: Sasha Levin <sasha.levin@oracle.com>
> > ---
> >  arch/arc/mm/cache_arc700.c |  4 ++--
> >  arch/arm/mm/flush.c        |  2 +-
> >  arch/mips/mm/c-r4k.c       |  3 ++-
> >  arch/mips/mm/cache.c       |  2 +-
> >  arch/mips/mm/init.c        |  6 +++---
> >  arch/sh/mm/cache-sh4.c     |  2 +-
> >  arch/sh/mm/cache.c         |  8 ++++----
> >  arch/xtensa/mm/tlb.c       |  2 +-
> >  fs/proc/page.c             |  4 ++--
> >  include/linux/mm.h         | 11 ++++++++++-
> >  mm/filemap.c               |  2 +-
> >  11 files changed, 28 insertions(+), 18 deletions(-)
> > 
> > diff --git a/arch/arc/mm/cache_arc700.c b/arch/arc/mm/cache_arc700.c
> > index 8c3a3e02ba92..1baa4d23314b 100644
> > --- a/arch/arc/mm/cache_arc700.c
> > +++ b/arch/arc/mm/cache_arc700.c
> > @@ -490,7 +490,7 @@ void flush_dcache_page(struct page *page)
> >  	 */
> >  	if (!mapping_mapped(mapping)) {
> >  		clear_bit(PG_dc_clean, &page->flags);
> > -	} else if (page_mapped(page)) {
> > +	} else if (page_mapcount(page)) {
> >  
> >  		/* kernel reading from page with U-mapping */
> >  		void *paddr = page_address(page);
> > @@ -675,7 +675,7 @@ void copy_user_highpage(struct page *to, struct page *from,
> >  	 * Note that while @u_vaddr refers to DST page's userspace vaddr, it is
> >  	 * equally valid for SRC page as well
> >  	 */
> > -	if (page_mapped(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
> > +	if (page_mapcount(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
> >  		__flush_dcache_page(kfrom, u_vaddr);
> >  		clean_src_k_mappings = 1;
> >  	}
> > diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> > index 34b66af516ea..8f972fc8933d 100644
> > --- a/arch/arm/mm/flush.c
> > +++ b/arch/arm/mm/flush.c
> > @@ -315,7 +315,7 @@ void flush_dcache_page(struct page *page)
> >  	mapping = page_mapping(page);
> >  
> >  	if (!cache_ops_need_broadcast() &&
> > -	    mapping && !page_mapped(page))
> > +	    mapping && !page_mapcount(page))
> >  		clear_bit(PG_dcache_clean, &page->flags);
> >  	else {
> >  		__flush_dcache_page(mapping, page);
> > diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c
> > index dd261df005c2..c4960b2d6682 100644
> > --- a/arch/mips/mm/c-r4k.c
> > +++ b/arch/mips/mm/c-r4k.c
> > @@ -578,7 +578,8 @@ static inline void local_r4k_flush_cache_page(void *args)
> >  		 * another ASID than the current one.
> >  		 */
> >  		map_coherent = (cpu_has_dc_aliases &&
> > -				page_mapped(page) && !Page_dcache_dirty(page));
> > +				page_mapcount(page) &&
> > +				!Page_dcache_dirty(page));
> >  		if (map_coherent)
> >  			vaddr = kmap_coherent(page, addr);
> >  		else
> > diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c
> > index 7e3ea7766822..e695b28dc32c 100644
> > --- a/arch/mips/mm/cache.c
> > +++ b/arch/mips/mm/cache.c
> > @@ -106,7 +106,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
> >  	unsigned long addr = (unsigned long) page_address(page);
> >  
> >  	if (pages_do_alias(addr, vmaddr)) {
> > -		if (page_mapped(page) && !Page_dcache_dirty(page)) {
> > +		if (page_mapcount(page) && !Page_dcache_dirty(page)) {
> >  			void *kaddr;
> >  
> >  			kaddr = kmap_coherent(page, vmaddr);
> > diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
> > index 448cde372af0..2c8e44aa536e 100644
> > --- a/arch/mips/mm/init.c
> > +++ b/arch/mips/mm/init.c
> > @@ -156,7 +156,7 @@ void copy_user_highpage(struct page *to, struct page *from,
> >  
> >  	vto = kmap_atomic(to);
> >  	if (cpu_has_dc_aliases &&
> > -	    page_mapped(from) && !Page_dcache_dirty(from)) {
> > +	    page_mapcount(from) && !Page_dcache_dirty(from)) {
> >  		vfrom = kmap_coherent(from, vaddr);
> >  		copy_page(vto, vfrom);
> >  		kunmap_coherent();
> > @@ -178,7 +178,7 @@ void copy_to_user_page(struct vm_area_struct *vma,
> >  	unsigned long len)
> >  {
> >  	if (cpu_has_dc_aliases &&
> > -	    page_mapped(page) && !Page_dcache_dirty(page)) {
> > +	    page_mapcount(page) && !Page_dcache_dirty(page)) {
> >  		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
> >  		memcpy(vto, src, len);
> >  		kunmap_coherent();
> > @@ -196,7 +196,7 @@ void copy_from_user_page(struct vm_area_struct *vma,
> >  	unsigned long len)
> >  {
> >  	if (cpu_has_dc_aliases &&
> > -	    page_mapped(page) && !Page_dcache_dirty(page)) {
> > +	    page_mapcount(page) && !Page_dcache_dirty(page)) {
> >  		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
> >  		memcpy(dst, vfrom, len);
> >  		kunmap_coherent();
> > diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c
> > index 51d8f7f31d1d..58aaa4f33b81 100644
> > --- a/arch/sh/mm/cache-sh4.c
> > +++ b/arch/sh/mm/cache-sh4.c
> > @@ -241,7 +241,7 @@ static void sh4_flush_cache_page(void *args)
> >  		 */
> >  		map_coherent = (current_cpu_data.dcache.n_aliases &&
> >  			test_bit(PG_dcache_clean, &page->flags) &&
> > -			page_mapped(page));
> > +			page_mapcount(page));
> >  		if (map_coherent)
> >  			vaddr = kmap_coherent(page, address);
> >  		else
> > diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c
> > index f770e3992620..e58cfbf45150 100644
> > --- a/arch/sh/mm/cache.c
> > +++ b/arch/sh/mm/cache.c
> > @@ -59,7 +59,7 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
> >  		       unsigned long vaddr, void *dst, const void *src,
> >  		       unsigned long len)
> >  {
> > -	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
> > +	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
> >  	    test_bit(PG_dcache_clean, &page->flags)) {
> >  		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
> >  		memcpy(vto, src, len);
> > @@ -78,7 +78,7 @@ void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
> >  			 unsigned long vaddr, void *dst, const void *src,
> >  			 unsigned long len)
> >  {
> > -	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
> > +	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
> >  	    test_bit(PG_dcache_clean, &page->flags)) {
> >  		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
> >  		memcpy(dst, vfrom, len);
> > @@ -97,7 +97,7 @@ void copy_user_highpage(struct page *to, struct page *from,
> >  
> >  	vto = kmap_atomic(to);
> >  
> > -	if (boot_cpu_data.dcache.n_aliases && page_mapped(from) &&
> > +	if (boot_cpu_data.dcache.n_aliases && page_mapcount(from) &&
> >  	    test_bit(PG_dcache_clean, &from->flags)) {
> >  		vfrom = kmap_coherent(from, vaddr);
> >  		copy_page(vto, vfrom);
> > @@ -153,7 +153,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
> >  	unsigned long addr = (unsigned long) page_address(page);
> >  
> >  	if (pages_do_alias(addr, vmaddr)) {
> > -		if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
> > +		if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
> >  		    test_bit(PG_dcache_clean, &page->flags)) {
> >  			void *kaddr;
> >  
> > diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
> > index 5ece856c5725..35c822286bbe 100644
> > --- a/arch/xtensa/mm/tlb.c
> > +++ b/arch/xtensa/mm/tlb.c
> > @@ -245,7 +245,7 @@ static int check_tlb_entry(unsigned w, unsigned e, bool dtlb)
> >  						page_mapcount(p));
> >  				if (!page_count(p))
> >  					rc |= TLB_INSANE;
> > -				else if (page_mapped(p))
> > +				else if (page_mapcount(p))
> >  					rc |= TLB_SUSPICIOUS;
> >  			} else {
> >  				rc |= TLB_INSANE;
> > diff --git a/fs/proc/page.c b/fs/proc/page.c
> > index 7eee2d8b97d9..e99c059339f6 100644
> > --- a/fs/proc/page.c
> > +++ b/fs/proc/page.c
> > @@ -97,9 +97,9 @@ u64 stable_page_flags(struct page *page)
> >  	 * pseudo flags for the well known (anonymous) memory mapped pages
> >  	 *
> >  	 * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
> > -	 * simple test in page_mapped() is not enough.
> > +	 * simple test in page_mapcount() is not enough.
> >  	 */
> > -	if (!PageSlab(page) && page_mapped(page))
> > +	if (!PageSlab(page) && page_mapcount(page))
> >  		u |= 1 << KPF_MMAP;
> >  	if (PageAnon(page))
> >  		u |= 1 << KPF_ANON;
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 33cb3aa647a6..8ddc184c55d6 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -909,7 +909,16 @@ static inline pgoff_t page_file_index(struct page *page)
> >   */
> >  static inline int page_mapped(struct page *page)
> >  {
> > -	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
> > +	int i;
> > +	if (likely(!PageCompound(page)))
> > +		return atomic_read(&page->_mapcount) >= 0;
> > +	if (compound_mapcount(page))
> > +		return 1;
> > +	for (i = 0; i < hpage_nr_pages(page); i++) {
> > +		if (atomic_read(&page[i]._mapcount) >= 0)
> > +			return 1;
> > +	}
> > +	return 0;
> >  }
> 
> page_mapped() won't work with tail pages. Maybe I'm missing something
> that makes it impossible. Otherwise, have you checked that this
> condition is true for all call site?  Should we add some check at the
> beginning of the function? Something like:
> 
> VM_BUG_ON_PAGE(PageTail(page), page)?

Good catch. I will probably put compound_head() there. Thanks.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 20/28] mm: differentiate page_mapped() from page_mapcount() for compound pages
@ 2015-04-30 12:06       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-30 12:06 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Vlastimil Babka, Christoph Lameter, Naoya Horiguchi,
	Steve Capper, Aneesh Kumar K.V, Johannes Weiner, Michal Hocko,
	Sasha Levin, linux-kernel, linux-mm

On Wed, Apr 29, 2015 at 06:20:03PM +0200, Jerome Marchand wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> > Let's define page_mapped() to be true for compound pages if any
> > sub-pages of the compound page is mapped (with PMD or PTE).
> > 
> > On other hand page_mapcount() return mapcount for this particular small
> > page.
> > 
> > This will make cases like page_get_anon_vma() behave correctly once we
> > allow huge pages to be mapped with PTE.
> > 
> > Most users outside core-mm should use page_mapcount() instead of
> > page_mapped().
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Tested-by: Sasha Levin <sasha.levin@oracle.com>
> > ---
> >  arch/arc/mm/cache_arc700.c |  4 ++--
> >  arch/arm/mm/flush.c        |  2 +-
> >  arch/mips/mm/c-r4k.c       |  3 ++-
> >  arch/mips/mm/cache.c       |  2 +-
> >  arch/mips/mm/init.c        |  6 +++---
> >  arch/sh/mm/cache-sh4.c     |  2 +-
> >  arch/sh/mm/cache.c         |  8 ++++----
> >  arch/xtensa/mm/tlb.c       |  2 +-
> >  fs/proc/page.c             |  4 ++--
> >  include/linux/mm.h         | 11 ++++++++++-
> >  mm/filemap.c               |  2 +-
> >  11 files changed, 28 insertions(+), 18 deletions(-)
> > 
> > diff --git a/arch/arc/mm/cache_arc700.c b/arch/arc/mm/cache_arc700.c
> > index 8c3a3e02ba92..1baa4d23314b 100644
> > --- a/arch/arc/mm/cache_arc700.c
> > +++ b/arch/arc/mm/cache_arc700.c
> > @@ -490,7 +490,7 @@ void flush_dcache_page(struct page *page)
> >  	 */
> >  	if (!mapping_mapped(mapping)) {
> >  		clear_bit(PG_dc_clean, &page->flags);
> > -	} else if (page_mapped(page)) {
> > +	} else if (page_mapcount(page)) {
> >  
> >  		/* kernel reading from page with U-mapping */
> >  		void *paddr = page_address(page);
> > @@ -675,7 +675,7 @@ void copy_user_highpage(struct page *to, struct page *from,
> >  	 * Note that while @u_vaddr refers to DST page's userspace vaddr, it is
> >  	 * equally valid for SRC page as well
> >  	 */
> > -	if (page_mapped(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
> > +	if (page_mapcount(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
> >  		__flush_dcache_page(kfrom, u_vaddr);
> >  		clean_src_k_mappings = 1;
> >  	}
> > diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> > index 34b66af516ea..8f972fc8933d 100644
> > --- a/arch/arm/mm/flush.c
> > +++ b/arch/arm/mm/flush.c
> > @@ -315,7 +315,7 @@ void flush_dcache_page(struct page *page)
> >  	mapping = page_mapping(page);
> >  
> >  	if (!cache_ops_need_broadcast() &&
> > -	    mapping && !page_mapped(page))
> > +	    mapping && !page_mapcount(page))
> >  		clear_bit(PG_dcache_clean, &page->flags);
> >  	else {
> >  		__flush_dcache_page(mapping, page);
> > diff --git a/arch/mips/mm/c-r4k.c b/arch/mips/mm/c-r4k.c
> > index dd261df005c2..c4960b2d6682 100644
> > --- a/arch/mips/mm/c-r4k.c
> > +++ b/arch/mips/mm/c-r4k.c
> > @@ -578,7 +578,8 @@ static inline void local_r4k_flush_cache_page(void *args)
> >  		 * another ASID than the current one.
> >  		 */
> >  		map_coherent = (cpu_has_dc_aliases &&
> > -				page_mapped(page) && !Page_dcache_dirty(page));
> > +				page_mapcount(page) &&
> > +				!Page_dcache_dirty(page));
> >  		if (map_coherent)
> >  			vaddr = kmap_coherent(page, addr);
> >  		else
> > diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c
> > index 7e3ea7766822..e695b28dc32c 100644
> > --- a/arch/mips/mm/cache.c
> > +++ b/arch/mips/mm/cache.c
> > @@ -106,7 +106,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
> >  	unsigned long addr = (unsigned long) page_address(page);
> >  
> >  	if (pages_do_alias(addr, vmaddr)) {
> > -		if (page_mapped(page) && !Page_dcache_dirty(page)) {
> > +		if (page_mapcount(page) && !Page_dcache_dirty(page)) {
> >  			void *kaddr;
> >  
> >  			kaddr = kmap_coherent(page, vmaddr);
> > diff --git a/arch/mips/mm/init.c b/arch/mips/mm/init.c
> > index 448cde372af0..2c8e44aa536e 100644
> > --- a/arch/mips/mm/init.c
> > +++ b/arch/mips/mm/init.c
> > @@ -156,7 +156,7 @@ void copy_user_highpage(struct page *to, struct page *from,
> >  
> >  	vto = kmap_atomic(to);
> >  	if (cpu_has_dc_aliases &&
> > -	    page_mapped(from) && !Page_dcache_dirty(from)) {
> > +	    page_mapcount(from) && !Page_dcache_dirty(from)) {
> >  		vfrom = kmap_coherent(from, vaddr);
> >  		copy_page(vto, vfrom);
> >  		kunmap_coherent();
> > @@ -178,7 +178,7 @@ void copy_to_user_page(struct vm_area_struct *vma,
> >  	unsigned long len)
> >  {
> >  	if (cpu_has_dc_aliases &&
> > -	    page_mapped(page) && !Page_dcache_dirty(page)) {
> > +	    page_mapcount(page) && !Page_dcache_dirty(page)) {
> >  		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
> >  		memcpy(vto, src, len);
> >  		kunmap_coherent();
> > @@ -196,7 +196,7 @@ void copy_from_user_page(struct vm_area_struct *vma,
> >  	unsigned long len)
> >  {
> >  	if (cpu_has_dc_aliases &&
> > -	    page_mapped(page) && !Page_dcache_dirty(page)) {
> > +	    page_mapcount(page) && !Page_dcache_dirty(page)) {
> >  		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
> >  		memcpy(dst, vfrom, len);
> >  		kunmap_coherent();
> > diff --git a/arch/sh/mm/cache-sh4.c b/arch/sh/mm/cache-sh4.c
> > index 51d8f7f31d1d..58aaa4f33b81 100644
> > --- a/arch/sh/mm/cache-sh4.c
> > +++ b/arch/sh/mm/cache-sh4.c
> > @@ -241,7 +241,7 @@ static void sh4_flush_cache_page(void *args)
> >  		 */
> >  		map_coherent = (current_cpu_data.dcache.n_aliases &&
> >  			test_bit(PG_dcache_clean, &page->flags) &&
> > -			page_mapped(page));
> > +			page_mapcount(page));
> >  		if (map_coherent)
> >  			vaddr = kmap_coherent(page, address);
> >  		else
> > diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c
> > index f770e3992620..e58cfbf45150 100644
> > --- a/arch/sh/mm/cache.c
> > +++ b/arch/sh/mm/cache.c
> > @@ -59,7 +59,7 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
> >  		       unsigned long vaddr, void *dst, const void *src,
> >  		       unsigned long len)
> >  {
> > -	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
> > +	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
> >  	    test_bit(PG_dcache_clean, &page->flags)) {
> >  		void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
> >  		memcpy(vto, src, len);
> > @@ -78,7 +78,7 @@ void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
> >  			 unsigned long vaddr, void *dst, const void *src,
> >  			 unsigned long len)
> >  {
> > -	if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
> > +	if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
> >  	    test_bit(PG_dcache_clean, &page->flags)) {
> >  		void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
> >  		memcpy(dst, vfrom, len);
> > @@ -97,7 +97,7 @@ void copy_user_highpage(struct page *to, struct page *from,
> >  
> >  	vto = kmap_atomic(to);
> >  
> > -	if (boot_cpu_data.dcache.n_aliases && page_mapped(from) &&
> > +	if (boot_cpu_data.dcache.n_aliases && page_mapcount(from) &&
> >  	    test_bit(PG_dcache_clean, &from->flags)) {
> >  		vfrom = kmap_coherent(from, vaddr);
> >  		copy_page(vto, vfrom);
> > @@ -153,7 +153,7 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)
> >  	unsigned long addr = (unsigned long) page_address(page);
> >  
> >  	if (pages_do_alias(addr, vmaddr)) {
> > -		if (boot_cpu_data.dcache.n_aliases && page_mapped(page) &&
> > +		if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
> >  		    test_bit(PG_dcache_clean, &page->flags)) {
> >  			void *kaddr;
> >  
> > diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
> > index 5ece856c5725..35c822286bbe 100644
> > --- a/arch/xtensa/mm/tlb.c
> > +++ b/arch/xtensa/mm/tlb.c
> > @@ -245,7 +245,7 @@ static int check_tlb_entry(unsigned w, unsigned e, bool dtlb)
> >  						page_mapcount(p));
> >  				if (!page_count(p))
> >  					rc |= TLB_INSANE;
> > -				else if (page_mapped(p))
> > +				else if (page_mapcount(p))
> >  					rc |= TLB_SUSPICIOUS;
> >  			} else {
> >  				rc |= TLB_INSANE;
> > diff --git a/fs/proc/page.c b/fs/proc/page.c
> > index 7eee2d8b97d9..e99c059339f6 100644
> > --- a/fs/proc/page.c
> > +++ b/fs/proc/page.c
> > @@ -97,9 +97,9 @@ u64 stable_page_flags(struct page *page)
> >  	 * pseudo flags for the well known (anonymous) memory mapped pages
> >  	 *
> >  	 * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
> > -	 * simple test in page_mapped() is not enough.
> > +	 * simple test in page_mapcount() is not enough.
> >  	 */
> > -	if (!PageSlab(page) && page_mapped(page))
> > +	if (!PageSlab(page) && page_mapcount(page))
> >  		u |= 1 << KPF_MMAP;
> >  	if (PageAnon(page))
> >  		u |= 1 << KPF_ANON;
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 33cb3aa647a6..8ddc184c55d6 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -909,7 +909,16 @@ static inline pgoff_t page_file_index(struct page *page)
> >   */
> >  static inline int page_mapped(struct page *page)
> >  {
> > -	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
> > +	int i;
> > +	if (likely(!PageCompound(page)))
> > +		return atomic_read(&page->_mapcount) >= 0;
> > +	if (compound_mapcount(page))
> > +		return 1;
> > +	for (i = 0; i < hpage_nr_pages(page); i++) {
> > +		if (atomic_read(&page[i]._mapcount) >= 0)
> > +			return 1;
> > +	}
> > +	return 0;
> >  }
> 
> page_mapped() won't work with tail pages. Maybe I'm missing something
> that makes it impossible. Otherwise, have you checked that this
> condition is true for all call site?  Should we add some check at the
> beginning of the function? Something like:
> 
> VM_BUG_ON_PAGE(PageTail(page), page)?

Good catch. I will probably put compound_head() there. Thanks.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [RFC PATCH 1/3] mm/thp: Use pmdp_splitting_flush_notify to clear pmd on splitting
  2015-04-30  8:25     ` Aneesh Kumar K.V
@ 2015-04-30 13:30       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-30 13:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: akpm, paulus, benh, kirill.shutemov, linux-mm, linux-kernel

On Thu, Apr 30, 2015 at 01:55:39PM +0530, Aneesh Kumar K.V wrote:
> Some arch may require an explicit IPI before a THP PMD split. This
> ensures that a local_irq_disable can prevent a parallel THP PMD split.
> So use new function which arch can override
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  include/asm-generic/pgtable.h |  5 +++++
>  mm/huge_memory.c              |  7 ++++---
>  mm/pgtable-generic.c          | 11 +++++++++++
>  3 files changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index fe617b7e4be6..d091a666f5b1 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -184,6 +184,11 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif
>  
> +#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
> +extern void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
> +					unsigned long address, pmd_t *pmdp);
> +#endif
> +
>  #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
>  extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
>  				       pgtable_t pgtable);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cce4604c192f..81e9578bf43a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2606,9 +2606,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  
>  	write = pmd_write(*pmd);
>  	young = pmd_young(*pmd);
> -
> -	/* leave pmd empty until pte is filled */
> -	pmdp_clear_flush_notify(vma, haddr, pmd);
> +	/*
> +	 * leave pmd empty until pte is filled.
> +	 */
> +	pmdp_splitting_flush_notify(vma, haddr, pmd);
>  
>  	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>  	pmd_populate(mm, &_pmd, pgtable);
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 2fe699cedd4d..0fc1f5a06979 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -7,6 +7,7 @@
>   */
>  
>  #include <linux/pagemap.h>
> +#include <linux/mmu_notifier.h>
>  #include <asm/tlb.h>
>  #include <asm-generic/pgtable.h>
>  
> @@ -184,3 +185,13 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif
> +
> +#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
> +				 unsigned long address, pmd_t *pmdp)
> +{
> +	pmdp_clear_flush_notify(vma, address, pmdp);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#endif

I think it worth inlining. Let's put it to <asm-generic/pgtable.h>

It probably worth combining with collapse counterpart in the same patch.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [RFC PATCH 1/3] mm/thp: Use pmdp_splitting_flush_notify to clear pmd on splitting
@ 2015-04-30 13:30       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-04-30 13:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: akpm, paulus, benh, kirill.shutemov, linux-mm, linux-kernel

On Thu, Apr 30, 2015 at 01:55:39PM +0530, Aneesh Kumar K.V wrote:
> Some arch may require an explicit IPI before a THP PMD split. This
> ensures that a local_irq_disable can prevent a parallel THP PMD split.
> So use new function which arch can override
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  include/asm-generic/pgtable.h |  5 +++++
>  mm/huge_memory.c              |  7 ++++---
>  mm/pgtable-generic.c          | 11 +++++++++++
>  3 files changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index fe617b7e4be6..d091a666f5b1 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -184,6 +184,11 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif
>  
> +#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
> +extern void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
> +					unsigned long address, pmd_t *pmdp);
> +#endif
> +
>  #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
>  extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
>  				       pgtable_t pgtable);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cce4604c192f..81e9578bf43a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2606,9 +2606,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  
>  	write = pmd_write(*pmd);
>  	young = pmd_young(*pmd);
> -
> -	/* leave pmd empty until pte is filled */
> -	pmdp_clear_flush_notify(vma, haddr, pmd);
> +	/*
> +	 * leave pmd empty until pte is filled.
> +	 */
> +	pmdp_splitting_flush_notify(vma, haddr, pmd);
>  
>  	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
>  	pmd_populate(mm, &_pmd, pgtable);
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 2fe699cedd4d..0fc1f5a06979 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -7,6 +7,7 @@
>   */
>  
>  #include <linux/pagemap.h>
> +#include <linux/mmu_notifier.h>
>  #include <asm/tlb.h>
>  #include <asm-generic/pgtable.h>
>  
> @@ -184,3 +185,13 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif
> +
> +#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
> +				 unsigned long address, pmd_t *pmdp)
> +{
> +	pmdp_clear_flush_notify(vma, address, pmdp);
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#endif

I think it worth inlining. Let's put it to <asm-generic/pgtable.h>

It probably worth combining with collapse counterpart in the same patch.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [RFC PATCH 1/3] mm/thp: Use pmdp_splitting_flush_notify to clear pmd on splitting
  2015-04-30 13:30       ` Kirill A. Shutemov
@ 2015-04-30 15:59         ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 189+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-30 15:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, paulus, benh, kirill.shutemov, linux-mm, linux-kernel

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

>> @@ -184,3 +185,13 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>>  }
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  #endif
>> +
>> +#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
>> +				 unsigned long address, pmd_t *pmdp)
>> +{
>> +	pmdp_clear_flush_notify(vma, address, pmdp);
>> +}
>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> +#endif
>
> I think it worth inlining. Let's put it to <asm-generic/pgtable.h>
>
> It probably worth combining with collapse counterpart in the same patch.
>

I tried that first, But that pulls in mmu_notifier.h and huge_mm.h
headers and other build failures

-aneesh


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [RFC PATCH 1/3] mm/thp: Use pmdp_splitting_flush_notify to clear pmd on splitting
@ 2015-04-30 15:59         ` Aneesh Kumar K.V
  0 siblings, 0 replies; 189+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-30 15:59 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, paulus, benh, kirill.shutemov, linux-mm, linux-kernel

"Kirill A. Shutemov" <kirill@shutemov.name> writes:

>> @@ -184,3 +185,13 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>>  }
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>  #endif
>> +
>> +#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
>> +				 unsigned long address, pmd_t *pmdp)
>> +{
>> +	pmdp_clear_flush_notify(vma, address, pmdp);
>> +}
>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> +#endif
>
> I think it worth inlining. Let's put it to <asm-generic/pgtable.h>
>
> It probably worth combining with collapse counterpart in the same patch.
>

I tried that first, But that pulls in mmu_notifier.h and huge_mm.h
headers and other build failures

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [RFC PATCH 1/3] mm/thp: Use pmdp_splitting_flush_notify to clear pmd on splitting
  2015-04-30 15:59         ` Aneesh Kumar K.V
@ 2015-04-30 16:47           ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 189+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-30 16:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, paulus, benh, kirill.shutemov, linux-mm, linux-kernel

"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:

> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>
>>> @@ -184,3 +185,13 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>>>  }
>>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>  #endif
>>> +
>>> +#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
>>> +				 unsigned long address, pmd_t *pmdp)
>>> +{
>>> +	pmdp_clear_flush_notify(vma, address, pmdp);
>>> +}
>>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>> +#endif
>>
>> I think it worth inlining. Let's put it to <asm-generic/pgtable.h>
>>
>> It probably worth combining with collapse counterpart in the same patch.
>>
>
> I tried that first, But that pulls in mmu_notifier.h and huge_mm.h
> headers and other build failures
>

Putting them in TRANSPATENT_HUGEPAGE helped.

commit 9c60ab5d1d684db2ba454ee1c7f3e9a6bf57f026
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Wed Apr 29 14:57:30 2015 +0530

    mm/thp: Use pmdp_splitting_flush_notify to clear pmd on splitting
    
    Some arch may require an explicit IPI before a THP PMD split. This
    ensures that a local_irq_disable can prevent a parallel THP PMD split.
    So use new function which arch can override
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index fe617b7e4be6..6a0b2ab899d1 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -184,6 +184,24 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
+					       unsigned long address,
+					       pmd_t *pmdp)
+{
+	pmdp_clear_flush_notify(vma, address, pmdp);
+}
+#else
+static inline void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
+					       unsigned long address,
+					       pmd_t *pmdp)
+{
+	BUG();
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cce4604c192f..81e9578bf43a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2606,9 +2606,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	write = pmd_write(*pmd);
 	young = pmd_young(*pmd);
-
-	/* leave pmd empty until pte is filled */
-	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/*
+	 * leave pmd empty until pte is filled.
+	 */
+	pmdp_splitting_flush_notify(vma, haddr, pmd);
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);


^ permalink raw reply related	[flat|nested] 189+ messages in thread

* Re: [RFC PATCH 1/3] mm/thp: Use pmdp_splitting_flush_notify to clear pmd on splitting
@ 2015-04-30 16:47           ` Aneesh Kumar K.V
  0 siblings, 0 replies; 189+ messages in thread
From: Aneesh Kumar K.V @ 2015-04-30 16:47 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, paulus, benh, kirill.shutemov, linux-mm, linux-kernel

"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> writes:

> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>
>>> @@ -184,3 +185,13 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>>>  }
>>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>  #endif
>>> +
>>> +#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
>>> +				 unsigned long address, pmd_t *pmdp)
>>> +{
>>> +	pmdp_clear_flush_notify(vma, address, pmdp);
>>> +}
>>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>> +#endif
>>
>> I think it worth inlining. Let's put it to <asm-generic/pgtable.h>
>>
>> It probably worth combining with collapse counterpart in the same patch.
>>
>
> I tried that first, But that pulls in mmu_notifier.h and huge_mm.h
> headers and other build failures
>

Putting them in TRANSPATENT_HUGEPAGE helped.

commit 9c60ab5d1d684db2ba454ee1c7f3e9a6bf57f026
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Wed Apr 29 14:57:30 2015 +0530

    mm/thp: Use pmdp_splitting_flush_notify to clear pmd on splitting
    
    Some arch may require an explicit IPI before a THP PMD split. This
    ensures that a local_irq_disable can prevent a parallel THP PMD split.
    So use new function which arch can override
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index fe617b7e4be6..6a0b2ab899d1 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -184,6 +184,24 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH_NOTIFY
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
+					       unsigned long address,
+					       pmd_t *pmdp)
+{
+	pmdp_clear_flush_notify(vma, address, pmdp);
+}
+#else
+static inline void pmdp_splitting_flush_notify(struct vm_area_struct *vma,
+					       unsigned long address,
+					       pmd_t *pmdp)
+{
+	BUG();
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
 extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				       pgtable_t pgtable);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cce4604c192f..81e9578bf43a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2606,9 +2606,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	write = pmd_write(*pmd);
 	young = pmd_young(*pmd);
-
-	/* leave pmd empty until pte is filled */
-	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/*
+	 * leave pmd empty until pte is filled.
+	 */
+	pmdp_splitting_flush_notify(vma, haddr, pmd);
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 01/28] mm, proc: adjust PSS calculation
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-14 14:12     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-14 14:12 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting all subpages of the compound page are not nessessary
> have the same mapcount. We need to take into account mapcount of every
> sub-page.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

(some nitpicks below)

> ---
>   fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
>   1 file changed, 22 insertions(+), 21 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 956b75d61809..95bc384ee3f7 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -449,9 +449,10 @@ struct mem_size_stats {
>   };
>
>   static void smaps_account(struct mem_size_stats *mss, struct page *page,
> -		unsigned long size, bool young, bool dirty)
> +		bool compound, bool young, bool dirty)
>   {
> -	int mapcount;
> +	int i, nr = compound ? hpage_nr_pages(page) : 1;

Why not just HPAGE_PMD_NR instead of hpage_nr_pages(page)? We already 
came here through a pmd mapping. Even if the page stopped being a 
hugepage meanwhile (I'm not sure if any locking prevents that or not?), 
it would be more accurate to continue assuming it's a hugepage, 
otherwise we account only the base page (formerly head) and skip the 511 
formerly tail pages?

Also, is there some shortcut way to tell us that we are the only one 
mapping the whole compound page, and nobody has any base pages, so we 
don't need to loop on each tail page? I guess not under the new design, 
right...

> +	unsigned long size = nr * PAGE_SIZE;
>
>   	if (PageAnon(page))
>   		mss->anonymous += size;
> @@ -460,23 +461,23 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>   	/* Accumulate the size in pages that have been accessed. */
>   	if (young || PageReferenced(page))
>   		mss->referenced += size;
> -	mapcount = page_mapcount(page);
> -	if (mapcount >= 2) {
> -		u64 pss_delta;
>
> -		if (dirty || PageDirty(page))
> -			mss->shared_dirty += size;
> -		else
> -			mss->shared_clean += size;
> -		pss_delta = (u64)size << PSS_SHIFT;
> -		do_div(pss_delta, mapcount);
> -		mss->pss += pss_delta;
> -	} else {
> -		if (dirty || PageDirty(page))
> -			mss->private_dirty += size;
> -		else
> -			mss->private_clean += size;
> -		mss->pss += (u64)size << PSS_SHIFT;
> +	for (i = 0; i < nr; i++) {
> +		int mapcount = page_mapcount(page + i);
> +
> +		if (mapcount >= 2) {
> +			if (dirty || PageDirty(page + i))
> +				mss->shared_dirty += PAGE_SIZE;
> +			else
> +				mss->shared_clean += PAGE_SIZE;
> +			mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
> +		} else {
> +			if (dirty || PageDirty(page + i))
> +				mss->private_dirty += PAGE_SIZE;
> +			else
> +				mss->private_clean += PAGE_SIZE;
> +			mss->pss += PAGE_SIZE << PSS_SHIFT;
> +		}

That's 3 instances of "page + i", why not just use page and do a page++ 
in the for loop?

>   	}
>   }
>
> @@ -500,7 +501,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>
>   	if (!page)
>   		return;
> -	smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte));
> +
> +	smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte));
>   }
>
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -516,8 +518,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
>   	if (IS_ERR_OR_NULL(page))
>   		return;
>   	mss->anonymous_thp += HPAGE_PMD_SIZE;
> -	smaps_account(mss, page, HPAGE_PMD_SIZE,
> -			pmd_young(*pmd), pmd_dirty(*pmd));
> +	smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
>   }
>   #else
>   static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 01/28] mm, proc: adjust PSS calculation
@ 2015-05-14 14:12     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-14 14:12 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting all subpages of the compound page are not nessessary
> have the same mapcount. We need to take into account mapcount of every
> sub-page.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

(some nitpicks below)

> ---
>   fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
>   1 file changed, 22 insertions(+), 21 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 956b75d61809..95bc384ee3f7 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -449,9 +449,10 @@ struct mem_size_stats {
>   };
>
>   static void smaps_account(struct mem_size_stats *mss, struct page *page,
> -		unsigned long size, bool young, bool dirty)
> +		bool compound, bool young, bool dirty)
>   {
> -	int mapcount;
> +	int i, nr = compound ? hpage_nr_pages(page) : 1;

Why not just HPAGE_PMD_NR instead of hpage_nr_pages(page)? We already 
came here through a pmd mapping. Even if the page stopped being a 
hugepage meanwhile (I'm not sure if any locking prevents that or not?), 
it would be more accurate to continue assuming it's a hugepage, 
otherwise we account only the base page (formerly head) and skip the 511 
formerly tail pages?

Also, is there some shortcut way to tell us that we are the only one 
mapping the whole compound page, and nobody has any base pages, so we 
don't need to loop on each tail page? I guess not under the new design, 
right...

> +	unsigned long size = nr * PAGE_SIZE;
>
>   	if (PageAnon(page))
>   		mss->anonymous += size;
> @@ -460,23 +461,23 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>   	/* Accumulate the size in pages that have been accessed. */
>   	if (young || PageReferenced(page))
>   		mss->referenced += size;
> -	mapcount = page_mapcount(page);
> -	if (mapcount >= 2) {
> -		u64 pss_delta;
>
> -		if (dirty || PageDirty(page))
> -			mss->shared_dirty += size;
> -		else
> -			mss->shared_clean += size;
> -		pss_delta = (u64)size << PSS_SHIFT;
> -		do_div(pss_delta, mapcount);
> -		mss->pss += pss_delta;
> -	} else {
> -		if (dirty || PageDirty(page))
> -			mss->private_dirty += size;
> -		else
> -			mss->private_clean += size;
> -		mss->pss += (u64)size << PSS_SHIFT;
> +	for (i = 0; i < nr; i++) {
> +		int mapcount = page_mapcount(page + i);
> +
> +		if (mapcount >= 2) {
> +			if (dirty || PageDirty(page + i))
> +				mss->shared_dirty += PAGE_SIZE;
> +			else
> +				mss->shared_clean += PAGE_SIZE;
> +			mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
> +		} else {
> +			if (dirty || PageDirty(page + i))
> +				mss->private_dirty += PAGE_SIZE;
> +			else
> +				mss->private_clean += PAGE_SIZE;
> +			mss->pss += PAGE_SIZE << PSS_SHIFT;
> +		}

That's 3 instances of "page + i", why not just use page and do a page++ 
in the for loop?

>   	}
>   }
>
> @@ -500,7 +501,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>
>   	if (!page)
>   		return;
> -	smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte));
> +
> +	smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte));
>   }
>
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -516,8 +518,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
>   	if (IS_ERR_OR_NULL(page))
>   		return;
>   	mss->anonymous_thp += HPAGE_PMD_SIZE;
> -	smaps_account(mss, page, HPAGE_PMD_SIZE,
> -			pmd_young(*pmd), pmd_dirty(*pmd));
> +	smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
>   }
>   #else
>   static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 02/28] rmap: add argument to charge compound page
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-14 16:07     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-14 16:07 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We're going to allow mapping of individual 4k pages of THP compound
> page. It means we cannot rely on PageTransHuge() check to decide if
> map/unmap small page or THP.
>
> The patch adds new argument to rmap functions to indicate whether we want
> to operate on whole compound page or only the small page.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

But I wonder about one thing:

> -void page_remove_rmap(struct page *page)
> +void page_remove_rmap(struct page *page, bool compound)
>   {
> +	int nr = compound ? hpage_nr_pages(page) : 1;
> +
>   	if (!PageAnon(page)) {
> +		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
>   		page_remove_file_rmap(page);
>   		return;
>   	}

The function continues by:

         /* page still mapped by someone else? */
         if (!atomic_add_negative(-1, &page->_mapcount))
                 return;

         /* Hugepages are not counted in NR_ANON_PAGES for now. */
         if (unlikely(PageHuge(page)))
                 return;

The handling of compound parameter for PageHuge() pages feels just 
weird. You use hpage_nr_pages() for them which tests PageTransHuge(). It 
doesn't break anything and the value of nr is effectively ignored 
anyway, but still...

So I wonder, if all callers of page_remove_rmap() for PageHuge() pages 
are the two in mm/hugetlb.c, why not just create a special case 
function? Or are some callers elsewhere, not aware whether they are 
calling this on a PageHuge()? So compound might be even false for those? 
If that's all possible and legal, then maybe explain it in a comment to 
reduce confusion of further readers. And move the 'nr' assignment to a 
place where we are sure it's not a PageHuge(), i.e. right above the 
place the value is used, perhaps?


> @@ -1181,11 +1191,12 @@ void page_remove_rmap(struct page *page)
>   	 * these counters are not modified in interrupt context, and
>   	 * pte lock(a spinlock) is held, which implies preemption disabled.
>   	 */
> -	if (PageTransHuge(page))
> +	if (compound) {
> +		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>   		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> +	}
>
> -	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
> -			      -hpage_nr_pages(page));
> +	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
>
>   	if (unlikely(PageMlocked(page)))
>   		clear_page_mlock(page);
> @@ -1327,7 +1338,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>   		dec_mm_counter(mm, MM_FILEPAGES);
>
>   discard:
> -	page_remove_rmap(page);
> +	page_remove_rmap(page, false);
>   	page_cache_release(page);
>
>   out_unmap:


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 02/28] rmap: add argument to charge compound page
@ 2015-05-14 16:07     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-14 16:07 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We're going to allow mapping of individual 4k pages of THP compound
> page. It means we cannot rely on PageTransHuge() check to decide if
> map/unmap small page or THP.
>
> The patch adds new argument to rmap functions to indicate whether we want
> to operate on whole compound page or only the small page.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

But I wonder about one thing:

> -void page_remove_rmap(struct page *page)
> +void page_remove_rmap(struct page *page, bool compound)
>   {
> +	int nr = compound ? hpage_nr_pages(page) : 1;
> +
>   	if (!PageAnon(page)) {
> +		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
>   		page_remove_file_rmap(page);
>   		return;
>   	}

The function continues by:

         /* page still mapped by someone else? */
         if (!atomic_add_negative(-1, &page->_mapcount))
                 return;

         /* Hugepages are not counted in NR_ANON_PAGES for now. */
         if (unlikely(PageHuge(page)))
                 return;

The handling of compound parameter for PageHuge() pages feels just 
weird. You use hpage_nr_pages() for them which tests PageTransHuge(). It 
doesn't break anything and the value of nr is effectively ignored 
anyway, but still...

So I wonder, if all callers of page_remove_rmap() for PageHuge() pages 
are the two in mm/hugetlb.c, why not just create a special case 
function? Or are some callers elsewhere, not aware whether they are 
calling this on a PageHuge()? So compound might be even false for those? 
If that's all possible and legal, then maybe explain it in a comment to 
reduce confusion of further readers. And move the 'nr' assignment to a 
place where we are sure it's not a PageHuge(), i.e. right above the 
place the value is used, perhaps?


> @@ -1181,11 +1191,12 @@ void page_remove_rmap(struct page *page)
>   	 * these counters are not modified in interrupt context, and
>   	 * pte lock(a spinlock) is held, which implies preemption disabled.
>   	 */
> -	if (PageTransHuge(page))
> +	if (compound) {
> +		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>   		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> +	}
>
> -	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
> -			      -hpage_nr_pages(page));
> +	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
>
>   	if (unlikely(PageMlocked(page)))
>   		clear_page_mlock(page);
> @@ -1327,7 +1338,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>   		dec_mm_counter(mm, MM_FILEPAGES);
>
>   discard:
> -	page_remove_rmap(page);
> +	page_remove_rmap(page, false);
>   	page_cache_release(page);
>
>   out_unmap:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 03/28] memcg: adjust to support new THP refcounting
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-15  7:44     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15  7:44 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> As with rmap, with new refcounting we cannot rely on PageTransHuge() to
> check if we need to charge size of huge page form the cgroup. We need to
> get information from caller to know whether it was mapped with PMD or
> PTE.
>
> We do uncharge when last reference on the page gone. At that point if we
> see PageTransHuge() it means we need to unchange whole huge page.
>
> The tricky part is partial unmap -- when we try to unmap part of huge
> page. We don't do a special handing of this situation, meaning we don't
> uncharge the part of huge page unless last user is gone or
> split_huge_page() is triggered. In case of cgroup memory pressure
> happens the partial unmapped page will be split through shrinker. This
> should be good enough.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

But same question about whether it should be using hpage_nr_pages() 
instead of a constant.


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 03/28] memcg: adjust to support new THP refcounting
@ 2015-05-15  7:44     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15  7:44 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> As with rmap, with new refcounting we cannot rely on PageTransHuge() to
> check if we need to charge size of huge page form the cgroup. We need to
> get information from caller to know whether it was mapped with PMD or
> PTE.
>
> We do uncharge when last reference on the page gone. At that point if we
> see PageTransHuge() it means we need to unchange whole huge page.
>
> The tricky part is partial unmap -- when we try to unmap part of huge
> page. We don't do a special handing of this situation, meaning we don't
> uncharge the part of huge page unless last user is gone or
> split_huge_page() is triggered. In case of cgroup memory pressure
> happens the partial unmapped page will be split through shrinker. This
> should be good enough.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

But same question about whether it should be using hpage_nr_pages() 
instead of a constant.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 00/28] THP refcounting redesign
  2015-04-23 21:03 ` Kirill A. Shutemov
@ 2015-05-15  8:55   ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15  8:55 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> Hello everybody,
>
> Here's reworked version of my patchset. All known issues were addressed.
>
> The goal of patchset is to make refcounting on THP pages cheaper with
> simpler semantics and allow the same THP compound page to be mapped with
> PMD and PTEs. This is required to get reasonable THP-pagecache
> implementation.
>
> With the new refcounting design it's much easier to protect against
> split_huge_page(): simple reference on a page will make you the deal.
> It makes gup_fast() implementation simpler and doesn't require
> special-case in futex code to handle tail THP pages.
>
> It should improve THP utilization over the system since splitting THP in
> one process doesn't necessary lead to splitting the page in all other
> processes have the page mapped.
>
> The patchset drastically lower complexity of get_page()/put_page()
> codepaths. I encourage reviewers look on this code before-and-after to
> justify time budget on reviewing this patchset.
>
> = Changelog =
>
> v5:
>    - Tested-by: Sasha Levin!™
>    - re-split patchset in hope to improve readability;
>    - rebased on top of page flags and ->mapping sanitizing patchset;
>    - uncharge compound_mapcount rather than mapcount for hugetlb pages
>      during removing from rmap;
>    - differentiate page_mapped() from page_mapcount() for compound pages;
>    - rework deferred_split_huge_page() to use shrinker interface;
>    - fix race in page_remove_rmap();
>    - get rid of __get_page_tail();
>    - few random bug fixes;
> v4:
>    - fix sizes reported in smaps;
>    - defines instead of enum for RMAP_{EXCLUSIVE,COMPOUND};
>    - skip THP pages on munlock_vma_pages_range(): they are never mlocked;
>    - properly handle huge zero page on FOLL_SPLIT;
>    - fix lock_page() slow path on tail pages;
>    - account page_get_anon_vma() fail to THP_SPLIT_PAGE_FAILED;
>    - fix split_huge_page() on huge page with unmapped head page;
>    - fix transfering 'write' and 'young' from pmd to ptes on split_huge_pmd;
>    - call page_remove_rmap() in unfreeze_page under ptl.
>
> = Design overview =
>
> The main reason why we can't map THP with 4k is how refcounting on THP
> designed. It built around two requirements:
>
>    - split of huge page should never fail;
>    - we can't change interface of get_user_page();
>
> To be able to split huge page at any point we have to track which tail
> page was pinned. It leads to tricky and expensive get_page() on tail pages
> and also occupy tail_page->_mapcount.
>
> Most split_huge_page*() users want PMD to be split into table of PTEs and
> don't care whether compound page is going to be split or not.
>
> The plan is:
>
>   - allow split_huge_page() to fail if the page is pinned. It's trivial to
>     split non-pinned page and it doesn't require tail page refcounting, so
>     tail_page->_mapcount is free to be reused.
>
>   - introduce new routine -- split_huge_pmd() -- to split PMD into table of
>     PTEs. It splits only one PMD, not touching other PMDs the page is
>     mapped with or underlying compound page. Unlike new split_huge_page(),
>     split_huge_pmd() never fails.
>
> Fortunately, we have only few places where split_huge_page() is needed:
> swap out, memory failure, migration, KSM. And all of them can handle
> split_huge_page() fail.
>
> In new scheme we use page->_mapcount is used to account how many time
> the page is mapped with PTEs. We have separate compound_mapcount() to
> count mappings with PMD. page_mapcount() returns sum of PTE and PMD
> mappings of the page.

It would be very beneficial to describe the scheme in full, both before 
in after. The latter goes also for the Documentation patch, where you 
fixed what wasn't true anymore, but I think the picture wasn't complete 
neither before, nor is it now. There's the lwn article [1] which helps a 
lot, but we shouldn't rely on that exclusively.

So the full scheme should include at least:
- where were/are pins and mapcounts stored
- what exactly get_page()/put_page() did/does now
- etc.

[1] https://lwn.net/Articles/619738/






^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 00/28] THP refcounting redesign
@ 2015-05-15  8:55   ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15  8:55 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> Hello everybody,
>
> Here's reworked version of my patchset. All known issues were addressed.
>
> The goal of patchset is to make refcounting on THP pages cheaper with
> simpler semantics and allow the same THP compound page to be mapped with
> PMD and PTEs. This is required to get reasonable THP-pagecache
> implementation.
>
> With the new refcounting design it's much easier to protect against
> split_huge_page(): simple reference on a page will make you the deal.
> It makes gup_fast() implementation simpler and doesn't require
> special-case in futex code to handle tail THP pages.
>
> It should improve THP utilization over the system since splitting THP in
> one process doesn't necessary lead to splitting the page in all other
> processes have the page mapped.
>
> The patchset drastically lower complexity of get_page()/put_page()
> codepaths. I encourage reviewers look on this code before-and-after to
> justify time budget on reviewing this patchset.
>
> = Changelog =
>
> v5:
>    - Tested-by: Sasha Levin!a?c
>    - re-split patchset in hope to improve readability;
>    - rebased on top of page flags and ->mapping sanitizing patchset;
>    - uncharge compound_mapcount rather than mapcount for hugetlb pages
>      during removing from rmap;
>    - differentiate page_mapped() from page_mapcount() for compound pages;
>    - rework deferred_split_huge_page() to use shrinker interface;
>    - fix race in page_remove_rmap();
>    - get rid of __get_page_tail();
>    - few random bug fixes;
> v4:
>    - fix sizes reported in smaps;
>    - defines instead of enum for RMAP_{EXCLUSIVE,COMPOUND};
>    - skip THP pages on munlock_vma_pages_range(): they are never mlocked;
>    - properly handle huge zero page on FOLL_SPLIT;
>    - fix lock_page() slow path on tail pages;
>    - account page_get_anon_vma() fail to THP_SPLIT_PAGE_FAILED;
>    - fix split_huge_page() on huge page with unmapped head page;
>    - fix transfering 'write' and 'young' from pmd to ptes on split_huge_pmd;
>    - call page_remove_rmap() in unfreeze_page under ptl.
>
> = Design overview =
>
> The main reason why we can't map THP with 4k is how refcounting on THP
> designed. It built around two requirements:
>
>    - split of huge page should never fail;
>    - we can't change interface of get_user_page();
>
> To be able to split huge page at any point we have to track which tail
> page was pinned. It leads to tricky and expensive get_page() on tail pages
> and also occupy tail_page->_mapcount.
>
> Most split_huge_page*() users want PMD to be split into table of PTEs and
> don't care whether compound page is going to be split or not.
>
> The plan is:
>
>   - allow split_huge_page() to fail if the page is pinned. It's trivial to
>     split non-pinned page and it doesn't require tail page refcounting, so
>     tail_page->_mapcount is free to be reused.
>
>   - introduce new routine -- split_huge_pmd() -- to split PMD into table of
>     PTEs. It splits only one PMD, not touching other PMDs the page is
>     mapped with or underlying compound page. Unlike new split_huge_page(),
>     split_huge_pmd() never fails.
>
> Fortunately, we have only few places where split_huge_page() is needed:
> swap out, memory failure, migration, KSM. And all of them can handle
> split_huge_page() fail.
>
> In new scheme we use page->_mapcount is used to account how many time
> the page is mapped with PTEs. We have separate compound_mapcount() to
> count mappings with PMD. page_mapcount() returns sum of PTE and PMD
> mappings of the page.

It would be very beneficial to describe the scheme in full, both before 
in after. The latter goes also for the Documentation patch, where you 
fixed what wasn't true anymore, but I think the picture wasn't complete 
neither before, nor is it now. There's the lwn article [1] which helps a 
lot, but we shouldn't rely on that exclusively.

So the full scheme should include at least:
- where were/are pins and mapcounts stored
- what exactly get_page()/put_page() did/does now
- etc.

[1] https://lwn.net/Articles/619738/





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 04/28] mm, thp: adjust conditions when we can reuse the page on WP fault
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-15  9:15     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15  9:15 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting we will be able map the same compound page with
> PTEs and PMDs. It requires adjustment to conditions when we can reuse
> the page on write-protection fault.
>
> For PTE fault we can't reuse the page if it's part of huge page.
>
> For PMD we can only reuse the page if nobody else maps the huge page or
> it's part. We can do it by checking page_mapcount() on each sub-page,
> but it's expensive.
>
> The cheaper way is to check page_count() to be equal 1: every mapcount
> takes page reference, so this way we can guarantee, that the PMD is the
> only mapping.
>
> This approach can give false negative if somebody pinned the page, but
> that doesn't affect correctness.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

So couldn't the same trick be used in Patch 1 to avoid counting 
individual oder-0 pages?

> ---
>   include/linux/swap.h |  3 ++-
>   mm/huge_memory.c     | 12 +++++++++++-
>   mm/swapfile.c        |  3 +++
>   3 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0428e4c84e1d..17cdd6b9456b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -524,7 +524,8 @@ static inline int page_swapcount(struct page *page)
>   	return 0;
>   }
>
> -#define reuse_swap_page(page)	(page_mapcount(page) == 1)
> +#define reuse_swap_page(page) \
> +	(!PageTransCompound(page) && page_mapcount(page) == 1)
>
>   static inline int try_to_free_swap(struct page *page)
>   {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 534f353e12bf..fd8af5b9917f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1103,7 +1103,17 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
>   	page = pmd_page(orig_pmd);
>   	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> -	if (page_mapcount(page) == 1) {
> +	/*
> +	 * We can only reuse the page if nobody else maps the huge page or it's
> +	 * part. We can do it by checking page_mapcount() on each sub-page, but
> +	 * it's expensive.
> +	 * The cheaper way is to check page_count() to be equal 1: every
> +	 * mapcount takes page reference reference, so this way we can
> +	 * guarantee, that the PMD is the only mapping.
> +	 * This can give false negative if somebody pinned the page, but that's
> +	 * fine.
> +	 */
> +	if (page_mapcount(page) == 1 && page_count(page) == 1) {
>   		pmd_t entry;
>   		entry = pmd_mkyoung(orig_pmd);
>   		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6dd365d1c488..3cd5f188b996 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -887,6 +887,9 @@ int reuse_swap_page(struct page *page)
>   	VM_BUG_ON_PAGE(!PageLocked(page), page);
>   	if (unlikely(PageKsm(page)))
>   		return 0;
> +	/* The page is part of THP and cannot be reused */
> +	if (PageTransCompound(page))
> +		return 0;
>   	count = page_mapcount(page);
>   	if (count <= 1 && PageSwapCache(page)) {
>   		count += page_swapcount(page);
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 04/28] mm, thp: adjust conditions when we can reuse the page on WP fault
@ 2015-05-15  9:15     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15  9:15 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting we will be able map the same compound page with
> PTEs and PMDs. It requires adjustment to conditions when we can reuse
> the page on write-protection fault.
>
> For PTE fault we can't reuse the page if it's part of huge page.
>
> For PMD we can only reuse the page if nobody else maps the huge page or
> it's part. We can do it by checking page_mapcount() on each sub-page,
> but it's expensive.
>
> The cheaper way is to check page_count() to be equal 1: every mapcount
> takes page reference, so this way we can guarantee, that the PMD is the
> only mapping.
>
> This approach can give false negative if somebody pinned the page, but
> that doesn't affect correctness.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

So couldn't the same trick be used in Patch 1 to avoid counting 
individual oder-0 pages?

> ---
>   include/linux/swap.h |  3 ++-
>   mm/huge_memory.c     | 12 +++++++++++-
>   mm/swapfile.c        |  3 +++
>   3 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0428e4c84e1d..17cdd6b9456b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -524,7 +524,8 @@ static inline int page_swapcount(struct page *page)
>   	return 0;
>   }
>
> -#define reuse_swap_page(page)	(page_mapcount(page) == 1)
> +#define reuse_swap_page(page) \
> +	(!PageTransCompound(page) && page_mapcount(page) == 1)
>
>   static inline int try_to_free_swap(struct page *page)
>   {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 534f353e12bf..fd8af5b9917f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1103,7 +1103,17 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
>   	page = pmd_page(orig_pmd);
>   	VM_BUG_ON_PAGE(!PageCompound(page) || !PageHead(page), page);
> -	if (page_mapcount(page) == 1) {
> +	/*
> +	 * We can only reuse the page if nobody else maps the huge page or it's
> +	 * part. We can do it by checking page_mapcount() on each sub-page, but
> +	 * it's expensive.
> +	 * The cheaper way is to check page_count() to be equal 1: every
> +	 * mapcount takes page reference reference, so this way we can
> +	 * guarantee, that the PMD is the only mapping.
> +	 * This can give false negative if somebody pinned the page, but that's
> +	 * fine.
> +	 */
> +	if (page_mapcount(page) == 1 && page_count(page) == 1) {
>   		pmd_t entry;
>   		entry = pmd_mkyoung(orig_pmd);
>   		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 6dd365d1c488..3cd5f188b996 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -887,6 +887,9 @@ int reuse_swap_page(struct page *page)
>   	VM_BUG_ON_PAGE(!PageLocked(page), page);
>   	if (unlikely(PageKsm(page)))
>   		return 0;
> +	/* The page is part of THP and cannot be reused */
> +	if (PageTransCompound(page))
> +		return 0;
>   	count = page_mapcount(page);
>   	if (count <= 1 && PageSwapCache(page)) {
>   		count += page_swapcount(page);
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 01/28] mm, proc: adjust PSS calculation
  2015-05-14 14:12     ` Vlastimil Babka
@ 2015-05-15 10:56       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 10:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Thu, May 14, 2015 at 04:12:29PM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >With new refcounting all subpages of the compound page are not nessessary
> >have the same mapcount. We need to take into account mapcount of every
> >sub-page.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> (some nitpicks below)
> 
> >---
> >  fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
> >  1 file changed, 22 insertions(+), 21 deletions(-)
> >
> >diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >index 956b75d61809..95bc384ee3f7 100644
> >--- a/fs/proc/task_mmu.c
> >+++ b/fs/proc/task_mmu.c
> >@@ -449,9 +449,10 @@ struct mem_size_stats {
> >  };
> >
> >  static void smaps_account(struct mem_size_stats *mss, struct page *page,
> >-		unsigned long size, bool young, bool dirty)
> >+		bool compound, bool young, bool dirty)
> >  {
> >-	int mapcount;
> >+	int i, nr = compound ? hpage_nr_pages(page) : 1;
> 
> Why not just HPAGE_PMD_NR instead of hpage_nr_pages(page)?

Okay, makes sense. Compiler is smart enough to optimize away HPAGE_PMD_NR
for THP=n. (HPAGE_PMD_NR is BUILD_BUG() for THP=n)

> We already came here through a pmd mapping. Even if the page stopped
> being a hugepage meanwhile (I'm not sure if any locking prevents that or
> not?),

We're under ptl here. PMD will not go away under us.

> it would be more accurate to continue assuming it's a hugepage,
> otherwise we account only the base page (formerly head) and skip the 511
> formerly tail pages?
> 
> Also, is there some shortcut way to tell us that we are the only one mapping
> the whole compound page, and nobody has any base pages, so we don't need to
> loop on each tail page? I guess not under the new design, right...

No, we don't have shortcut here.

> >+	unsigned long size = nr * PAGE_SIZE;
> >
> >  	if (PageAnon(page))
> >  		mss->anonymous += size;
> >@@ -460,23 +461,23 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
> >  	/* Accumulate the size in pages that have been accessed. */
> >  	if (young || PageReferenced(page))
> >  		mss->referenced += size;
> >-	mapcount = page_mapcount(page);
> >-	if (mapcount >= 2) {
> >-		u64 pss_delta;
> >
> >-		if (dirty || PageDirty(page))
> >-			mss->shared_dirty += size;
> >-		else
> >-			mss->shared_clean += size;
> >-		pss_delta = (u64)size << PSS_SHIFT;
> >-		do_div(pss_delta, mapcount);
> >-		mss->pss += pss_delta;
> >-	} else {
> >-		if (dirty || PageDirty(page))
> >-			mss->private_dirty += size;
> >-		else
> >-			mss->private_clean += size;
> >-		mss->pss += (u64)size << PSS_SHIFT;
> >+	for (i = 0; i < nr; i++) {
> >+		int mapcount = page_mapcount(page + i);
> >+
> >+		if (mapcount >= 2) {
> >+			if (dirty || PageDirty(page + i))
> >+				mss->shared_dirty += PAGE_SIZE;
> >+			else
> >+				mss->shared_clean += PAGE_SIZE;
> >+			mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
> >+		} else {
> >+			if (dirty || PageDirty(page + i))
> >+				mss->private_dirty += PAGE_SIZE;
> >+			else
> >+				mss->private_clean += PAGE_SIZE;
> >+			mss->pss += PAGE_SIZE << PSS_SHIFT;
> >+		}
> 
> That's 3 instances of "page + i", why not just use page and do a page++ in
> the for loop?

Okay.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 01/28] mm, proc: adjust PSS calculation
@ 2015-05-15 10:56       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 10:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Thu, May 14, 2015 at 04:12:29PM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >With new refcounting all subpages of the compound page are not nessessary
> >have the same mapcount. We need to take into account mapcount of every
> >sub-page.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> (some nitpicks below)
> 
> >---
> >  fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
> >  1 file changed, 22 insertions(+), 21 deletions(-)
> >
> >diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >index 956b75d61809..95bc384ee3f7 100644
> >--- a/fs/proc/task_mmu.c
> >+++ b/fs/proc/task_mmu.c
> >@@ -449,9 +449,10 @@ struct mem_size_stats {
> >  };
> >
> >  static void smaps_account(struct mem_size_stats *mss, struct page *page,
> >-		unsigned long size, bool young, bool dirty)
> >+		bool compound, bool young, bool dirty)
> >  {
> >-	int mapcount;
> >+	int i, nr = compound ? hpage_nr_pages(page) : 1;
> 
> Why not just HPAGE_PMD_NR instead of hpage_nr_pages(page)?

Okay, makes sense. Compiler is smart enough to optimize away HPAGE_PMD_NR
for THP=n. (HPAGE_PMD_NR is BUILD_BUG() for THP=n)

> We already came here through a pmd mapping. Even if the page stopped
> being a hugepage meanwhile (I'm not sure if any locking prevents that or
> not?),

We're under ptl here. PMD will not go away under us.

> it would be more accurate to continue assuming it's a hugepage,
> otherwise we account only the base page (formerly head) and skip the 511
> formerly tail pages?
> 
> Also, is there some shortcut way to tell us that we are the only one mapping
> the whole compound page, and nobody has any base pages, so we don't need to
> loop on each tail page? I guess not under the new design, right...

No, we don't have shortcut here.

> >+	unsigned long size = nr * PAGE_SIZE;
> >
> >  	if (PageAnon(page))
> >  		mss->anonymous += size;
> >@@ -460,23 +461,23 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
> >  	/* Accumulate the size in pages that have been accessed. */
> >  	if (young || PageReferenced(page))
> >  		mss->referenced += size;
> >-	mapcount = page_mapcount(page);
> >-	if (mapcount >= 2) {
> >-		u64 pss_delta;
> >
> >-		if (dirty || PageDirty(page))
> >-			mss->shared_dirty += size;
> >-		else
> >-			mss->shared_clean += size;
> >-		pss_delta = (u64)size << PSS_SHIFT;
> >-		do_div(pss_delta, mapcount);
> >-		mss->pss += pss_delta;
> >-	} else {
> >-		if (dirty || PageDirty(page))
> >-			mss->private_dirty += size;
> >-		else
> >-			mss->private_clean += size;
> >-		mss->pss += (u64)size << PSS_SHIFT;
> >+	for (i = 0; i < nr; i++) {
> >+		int mapcount = page_mapcount(page + i);
> >+
> >+		if (mapcount >= 2) {
> >+			if (dirty || PageDirty(page + i))
> >+				mss->shared_dirty += PAGE_SIZE;
> >+			else
> >+				mss->shared_clean += PAGE_SIZE;
> >+			mss->pss += (PAGE_SIZE << PSS_SHIFT) / mapcount;
> >+		} else {
> >+			if (dirty || PageDirty(page + i))
> >+				mss->private_dirty += PAGE_SIZE;
> >+			else
> >+				mss->private_clean += PAGE_SIZE;
> >+			mss->pss += PAGE_SIZE << PSS_SHIFT;
> >+		}
> 
> That's 3 instances of "page + i", why not just use page and do a page++ in
> the for loop?

Okay.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 05/28] mm: adjust FOLL_SPLIT for new refcounting
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-15 11:05     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 11:05 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We need to prepare kernel to allow transhuge pages to be mapped with
> ptes too. We need to handle FOLL_SPLIT in follow_page_pte().
>
> Also we use split_huge_page() directly instead of split_huge_page_pmd().
> split_huge_page_pmd() will gone.

You still call split_huge_page_pmd() for the is_huge_zero_page(page) 
case. Also, of the code around split_huge_page() you basically took from 
split_huge_page_pmd() and open-coded into follow_page_mask(), you didn't 
include the mmu notifier calls. Why are they needed in 
split_huge_page_pmd() but not here?

>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>   mm/gup.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++-----------------
>   1 file changed, 49 insertions(+), 18 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 203781fa96a5..ebdb39b3e820 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -79,6 +79,19 @@ retry:
>   		page = pte_page(pte);
>   	}
>
> +	if (flags & FOLL_SPLIT && PageTransCompound(page)) {
> +		int ret;
> +		get_page(page);
> +		pte_unmap_unlock(ptep, ptl);
> +		lock_page(page);
> +		ret = split_huge_page(page);
> +		unlock_page(page);
> +		put_page(page);
> +		if (ret)
> +			return ERR_PTR(ret);
> +		goto retry;
> +	}
> +
>   	if (flags & FOLL_GET)
>   		get_page_foll(page);
>   	if (flags & FOLL_TOUCH) {
> @@ -186,27 +199,45 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>   	}
>   	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>   		return no_page_table(vma, flags);
> -	if (pmd_trans_huge(*pmd)) {
> -		if (flags & FOLL_SPLIT) {
> +	if (likely(!pmd_trans_huge(*pmd)))
> +		return follow_page_pte(vma, address, pmd, flags);
> +
> +	ptl = pmd_lock(mm, pmd);
> +	if (unlikely(!pmd_trans_huge(*pmd))) {
> +		spin_unlock(ptl);
> +		return follow_page_pte(vma, address, pmd, flags);
> +	}
> +
> +	if (unlikely(pmd_trans_splitting(*pmd))) {
> +		spin_unlock(ptl);
> +		wait_split_huge_page(vma->anon_vma, pmd);
> +		return follow_page_pte(vma, address, pmd, flags);
> +	}
> +
> +	if (flags & FOLL_SPLIT) {
> +		int ret;
> +		page = pmd_page(*pmd);
> +		if (is_huge_zero_page(page)) {
> +			spin_unlock(ptl);
> +			ret = 0;
>   			split_huge_page_pmd(vma, address, pmd);
> -			return follow_page_pte(vma, address, pmd, flags);
> -		}
> -		ptl = pmd_lock(mm, pmd);
> -		if (likely(pmd_trans_huge(*pmd))) {
> -			if (unlikely(pmd_trans_splitting(*pmd))) {
> -				spin_unlock(ptl);
> -				wait_split_huge_page(vma->anon_vma, pmd);
> -			} else {
> -				page = follow_trans_huge_pmd(vma, address,
> -							     pmd, flags);
> -				spin_unlock(ptl);
> -				*page_mask = HPAGE_PMD_NR - 1;
> -				return page;
> -			}
> -		} else
> +		} else {
> +			get_page(page);
>   			spin_unlock(ptl);
> +			lock_page(page);
> +			ret = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +		}
> +
> +		return ret ? ERR_PTR(ret) :
> +			follow_page_pte(vma, address, pmd, flags);
>   	}
> -	return follow_page_pte(vma, address, pmd, flags);
> +
> +	page = follow_trans_huge_pmd(vma, address, pmd, flags);
> +	spin_unlock(ptl);
> +	*page_mask = HPAGE_PMD_NR - 1;
> +	return page;
>   }
>
>   static int get_gate_page(struct mm_struct *mm, unsigned long address,
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 05/28] mm: adjust FOLL_SPLIT for new refcounting
@ 2015-05-15 11:05     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 11:05 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We need to prepare kernel to allow transhuge pages to be mapped with
> ptes too. We need to handle FOLL_SPLIT in follow_page_pte().
>
> Also we use split_huge_page() directly instead of split_huge_page_pmd().
> split_huge_page_pmd() will gone.

You still call split_huge_page_pmd() for the is_huge_zero_page(page) 
case. Also, of the code around split_huge_page() you basically took from 
split_huge_page_pmd() and open-coded into follow_page_mask(), you didn't 
include the mmu notifier calls. Why are they needed in 
split_huge_page_pmd() but not here?

>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>   mm/gup.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++-----------------
>   1 file changed, 49 insertions(+), 18 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 203781fa96a5..ebdb39b3e820 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -79,6 +79,19 @@ retry:
>   		page = pte_page(pte);
>   	}
>
> +	if (flags & FOLL_SPLIT && PageTransCompound(page)) {
> +		int ret;
> +		get_page(page);
> +		pte_unmap_unlock(ptep, ptl);
> +		lock_page(page);
> +		ret = split_huge_page(page);
> +		unlock_page(page);
> +		put_page(page);
> +		if (ret)
> +			return ERR_PTR(ret);
> +		goto retry;
> +	}
> +
>   	if (flags & FOLL_GET)
>   		get_page_foll(page);
>   	if (flags & FOLL_TOUCH) {
> @@ -186,27 +199,45 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>   	}
>   	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>   		return no_page_table(vma, flags);
> -	if (pmd_trans_huge(*pmd)) {
> -		if (flags & FOLL_SPLIT) {
> +	if (likely(!pmd_trans_huge(*pmd)))
> +		return follow_page_pte(vma, address, pmd, flags);
> +
> +	ptl = pmd_lock(mm, pmd);
> +	if (unlikely(!pmd_trans_huge(*pmd))) {
> +		spin_unlock(ptl);
> +		return follow_page_pte(vma, address, pmd, flags);
> +	}
> +
> +	if (unlikely(pmd_trans_splitting(*pmd))) {
> +		spin_unlock(ptl);
> +		wait_split_huge_page(vma->anon_vma, pmd);
> +		return follow_page_pte(vma, address, pmd, flags);
> +	}
> +
> +	if (flags & FOLL_SPLIT) {
> +		int ret;
> +		page = pmd_page(*pmd);
> +		if (is_huge_zero_page(page)) {
> +			spin_unlock(ptl);
> +			ret = 0;
>   			split_huge_page_pmd(vma, address, pmd);
> -			return follow_page_pte(vma, address, pmd, flags);
> -		}
> -		ptl = pmd_lock(mm, pmd);
> -		if (likely(pmd_trans_huge(*pmd))) {
> -			if (unlikely(pmd_trans_splitting(*pmd))) {
> -				spin_unlock(ptl);
> -				wait_split_huge_page(vma->anon_vma, pmd);
> -			} else {
> -				page = follow_trans_huge_pmd(vma, address,
> -							     pmd, flags);
> -				spin_unlock(ptl);
> -				*page_mask = HPAGE_PMD_NR - 1;
> -				return page;
> -			}
> -		} else
> +		} else {
> +			get_page(page);
>   			spin_unlock(ptl);
> +			lock_page(page);
> +			ret = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +		}
> +
> +		return ret ? ERR_PTR(ret) :
> +			follow_page_pte(vma, address, pmd, flags);
>   	}
> -	return follow_page_pte(vma, address, pmd, flags);
> +
> +	page = follow_trans_huge_pmd(vma, address, pmd, flags);
> +	spin_unlock(ptl);
> +	*page_mask = HPAGE_PMD_NR - 1;
> +	return page;
>   }
>
>   static int get_gate_page(struct mm_struct *mm, unsigned long address,
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 02/28] rmap: add argument to charge compound page
  2015-05-14 16:07     ` Vlastimil Babka
@ 2015-05-15 11:14       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 11:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Thu, May 14, 2015 at 06:07:48PM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >We're going to allow mapping of individual 4k pages of THP compound
> >page. It means we cannot rely on PageTransHuge() check to decide if
> >map/unmap small page or THP.
> >
> >The patch adds new argument to rmap functions to indicate whether we want
> >to operate on whole compound page or only the small page.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> But I wonder about one thing:
> 
> >-void page_remove_rmap(struct page *page)
> >+void page_remove_rmap(struct page *page, bool compound)
> >  {
> >+	int nr = compound ? hpage_nr_pages(page) : 1;
> >+
> >  	if (!PageAnon(page)) {
> >+		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
> >  		page_remove_file_rmap(page);
> >  		return;
> >  	}
> 
> The function continues by:
> 
>         /* page still mapped by someone else? */
>         if (!atomic_add_negative(-1, &page->_mapcount))
>                 return;
> 
>         /* Hugepages are not counted in NR_ANON_PAGES for now. */
>         if (unlikely(PageHuge(page)))
>                 return;
> 
> The handling of compound parameter for PageHuge() pages feels just weird.
> You use hpage_nr_pages() for them which tests PageTransHuge(). It doesn't
> break anything and the value of nr is effectively ignored anyway, but
> still...
> 
> So I wonder, if all callers of page_remove_rmap() for PageHuge() pages are
> the two in mm/hugetlb.c, why not just create a special case function?

It's fair question. I think we shouldn't do this. It makes hugetlb even
more special place, alien to rest of mm.

And this is out of scope of the patchset in question.

> Or are some callers elsewhere, not aware whether they are calling this
> on a PageHuge()? So compound might be even false for those?

Caller sets compound==true based on whether the page is mapped with
PMD/PUD or not. It's nothing to do with what page type it is.

> If that's all possible and legal, then maybe explain it in a comment to
> reduce confusion of further readers. And move the 'nr' assignment to a
> place where we are sure it's not a PageHuge(), i.e. right above the
> place the value is used, perhaps?

I'll rework code a bit in v6.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 02/28] rmap: add argument to charge compound page
@ 2015-05-15 11:14       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 11:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Thu, May 14, 2015 at 06:07:48PM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >We're going to allow mapping of individual 4k pages of THP compound
> >page. It means we cannot rely on PageTransHuge() check to decide if
> >map/unmap small page or THP.
> >
> >The patch adds new argument to rmap functions to indicate whether we want
> >to operate on whole compound page or only the small page.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> But I wonder about one thing:
> 
> >-void page_remove_rmap(struct page *page)
> >+void page_remove_rmap(struct page *page, bool compound)
> >  {
> >+	int nr = compound ? hpage_nr_pages(page) : 1;
> >+
> >  	if (!PageAnon(page)) {
> >+		VM_BUG_ON_PAGE(compound && !PageHuge(page), page);
> >  		page_remove_file_rmap(page);
> >  		return;
> >  	}
> 
> The function continues by:
> 
>         /* page still mapped by someone else? */
>         if (!atomic_add_negative(-1, &page->_mapcount))
>                 return;
> 
>         /* Hugepages are not counted in NR_ANON_PAGES for now. */
>         if (unlikely(PageHuge(page)))
>                 return;
> 
> The handling of compound parameter for PageHuge() pages feels just weird.
> You use hpage_nr_pages() for them which tests PageTransHuge(). It doesn't
> break anything and the value of nr is effectively ignored anyway, but
> still...
> 
> So I wonder, if all callers of page_remove_rmap() for PageHuge() pages are
> the two in mm/hugetlb.c, why not just create a special case function?

It's fair question. I think we shouldn't do this. It makes hugetlb even
more special place, alien to rest of mm.

And this is out of scope of the patchset in question.

> Or are some callers elsewhere, not aware whether they are calling this
> on a PageHuge()? So compound might be even false for those?

Caller sets compound==true based on whether the page is mapped with
PMD/PUD or not. It's nothing to do with what page type it is.

> If that's all possible and legal, then maybe explain it in a comment to
> reduce confusion of further readers. And move the 'nr' assignment to a
> place where we are sure it's not a PageHuge(), i.e. right above the
> place the value is used, perhaps?

I'll rework code a bit in v6.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 03/28] memcg: adjust to support new THP refcounting
  2015-05-15  7:44     ` Vlastimil Babka
@ 2015-05-15 11:18       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 11:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 09:44:17AM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >As with rmap, with new refcounting we cannot rely on PageTransHuge() to
> >check if we need to charge size of huge page form the cgroup. We need to
> >get information from caller to know whether it was mapped with PMD or
> >PTE.
> >
> >We do uncharge when last reference on the page gone. At that point if we
> >see PageTransHuge() it means we need to unchange whole huge page.
> >
> >The tricky part is partial unmap -- when we try to unmap part of huge
> >page. We don't do a special handing of this situation, meaning we don't
> >uncharge the part of huge page unless last user is gone or
> >split_huge_page() is triggered. In case of cgroup memory pressure
> >happens the partial unmapped page will be split through shrinker. This
> >should be good enough.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> But same question about whether it should be using hpage_nr_pages() instead
> of a constant.

No. Compiler woundn't be able to optimize HPAGE_PMD_NR away for THP=n,
since compound value cross compilation unit barrier.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 03/28] memcg: adjust to support new THP refcounting
@ 2015-05-15 11:18       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 11:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 09:44:17AM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >As with rmap, with new refcounting we cannot rely on PageTransHuge() to
> >check if we need to charge size of huge page form the cgroup. We need to
> >get information from caller to know whether it was mapped with PMD or
> >PTE.
> >
> >We do uncharge when last reference on the page gone. At that point if we
> >see PageTransHuge() it means we need to unchange whole huge page.
> >
> >The tricky part is partial unmap -- when we try to unmap part of huge
> >page. We don't do a special handing of this situation, meaning we don't
> >uncharge the part of huge page unless last user is gone or
> >split_huge_page() is triggered. In case of cgroup memory pressure
> >happens the partial unmapped page will be split through shrinker. This
> >should be good enough.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> But same question about whether it should be using hpage_nr_pages() instead
> of a constant.

No. Compiler woundn't be able to optimize HPAGE_PMD_NR away for THP=n,
since compound value cross compilation unit barrier.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 04/28] mm, thp: adjust conditions when we can reuse the page on WP fault
  2015-05-15  9:15     ` Vlastimil Babka
@ 2015-05-15 11:21       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 11:21 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 11:15:00AM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >With new refcounting we will be able map the same compound page with
> >PTEs and PMDs. It requires adjustment to conditions when we can reuse
> >the page on write-protection fault.
> >
> >For PTE fault we can't reuse the page if it's part of huge page.
> >
> >For PMD we can only reuse the page if nobody else maps the huge page or
> >it's part. We can do it by checking page_mapcount() on each sub-page,
> >but it's expensive.
> >
> >The cheaper way is to check page_count() to be equal 1: every mapcount
> >takes page reference, so this way we can guarantee, that the PMD is the
> >only mapping.
> >
> >This approach can give false negative if somebody pinned the page, but
> >that doesn't affect correctness.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> So couldn't the same trick be used in Patch 1 to avoid counting individual
> oder-0 pages?

Hm. You're right, we could. But is smaps that performance sensitive to
bother?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 04/28] mm, thp: adjust conditions when we can reuse the page on WP fault
@ 2015-05-15 11:21       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 11:21 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 11:15:00AM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >With new refcounting we will be able map the same compound page with
> >PTEs and PMDs. It requires adjustment to conditions when we can reuse
> >the page on write-protection fault.
> >
> >For PTE fault we can't reuse the page if it's part of huge page.
> >
> >For PMD we can only reuse the page if nobody else maps the huge page or
> >it's part. We can do it by checking page_mapcount() on each sub-page,
> >but it's expensive.
> >
> >The cheaper way is to check page_count() to be equal 1: every mapcount
> >takes page reference, so this way we can guarantee, that the PMD is the
> >only mapping.
> >
> >This approach can give false negative if somebody pinned the page, but
> >that doesn't affect correctness.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> So couldn't the same trick be used in Patch 1 to avoid counting individual
> oder-0 pages?

Hm. You're right, we could. But is smaps that performance sensitive to
bother?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 01/28] mm, proc: adjust PSS calculation
  2015-05-15 10:56       ` Kirill A. Shutemov
@ 2015-05-15 11:33         ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 11:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 05/15/2015 12:56 PM, Kirill A. Shutemov wrote:
> On Thu, May 14, 2015 at 04:12:29PM +0200, Vlastimil Babka wrote:
>> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
>>> With new refcounting all subpages of the compound page are not nessessary
>>> have the same mapcount. We need to take into account mapcount of every
>>> sub-page.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Tested-by: Sasha Levin <sasha.levin@oracle.com>
>>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>
>> (some nitpicks below)
>>
>>> ---
>>>   fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
>>>   1 file changed, 22 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index 956b75d61809..95bc384ee3f7 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -449,9 +449,10 @@ struct mem_size_stats {
>>>   };
>>>
>>>   static void smaps_account(struct mem_size_stats *mss, struct page *page,
>>> -		unsigned long size, bool young, bool dirty)
>>> +		bool compound, bool young, bool dirty)
>>>   {
>>> -	int mapcount;
>>> +	int i, nr = compound ? hpage_nr_pages(page) : 1;
>>
>> Why not just HPAGE_PMD_NR instead of hpage_nr_pages(page)?
>
> Okay, makes sense. Compiler is smart enough to optimize away HPAGE_PMD_NR
> for THP=n. (HPAGE_PMD_NR is BUILD_BUG() for THP=n)

Ah, BUILD_BUG()... I'm not sure we can rely on optimization to avoid 
BUILD_BUG(), what if somebody compiles with all optimizations off?
So why not replace BUILD_BUG() with "1", or create a variant of 
HPAGE_PMD_NR that does that, for this case and patch 3. Seems better 
than testing PageTransHuge everywhere...

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 01/28] mm, proc: adjust PSS calculation
@ 2015-05-15 11:33         ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 11:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 05/15/2015 12:56 PM, Kirill A. Shutemov wrote:
> On Thu, May 14, 2015 at 04:12:29PM +0200, Vlastimil Babka wrote:
>> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
>>> With new refcounting all subpages of the compound page are not nessessary
>>> have the same mapcount. We need to take into account mapcount of every
>>> sub-page.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Tested-by: Sasha Levin <sasha.levin@oracle.com>
>>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>
>> (some nitpicks below)
>>
>>> ---
>>>   fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
>>>   1 file changed, 22 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index 956b75d61809..95bc384ee3f7 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -449,9 +449,10 @@ struct mem_size_stats {
>>>   };
>>>
>>>   static void smaps_account(struct mem_size_stats *mss, struct page *page,
>>> -		unsigned long size, bool young, bool dirty)
>>> +		bool compound, bool young, bool dirty)
>>>   {
>>> -	int mapcount;
>>> +	int i, nr = compound ? hpage_nr_pages(page) : 1;
>>
>> Why not just HPAGE_PMD_NR instead of hpage_nr_pages(page)?
>
> Okay, makes sense. Compiler is smart enough to optimize away HPAGE_PMD_NR
> for THP=n. (HPAGE_PMD_NR is BUILD_BUG() for THP=n)

Ah, BUILD_BUG()... I'm not sure we can rely on optimization to avoid 
BUILD_BUG(), what if somebody compiles with all optimizations off?
So why not replace BUILD_BUG() with "1", or create a variant of 
HPAGE_PMD_NR that does that, for this case and patch 3. Seems better 
than testing PageTransHuge everywhere...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 04/28] mm, thp: adjust conditions when we can reuse the page on WP fault
  2015-05-15 11:21       ` Kirill A. Shutemov
@ 2015-05-15 11:35         ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 11:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 05/15/2015 01:21 PM, Kirill A. Shutemov wrote:
> On Fri, May 15, 2015 at 11:15:00AM +0200, Vlastimil Babka wrote:
>> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
>>> With new refcounting we will be able map the same compound page with
>>> PTEs and PMDs. It requires adjustment to conditions when we can reuse
>>> the page on write-protection fault.
>>>
>>> For PTE fault we can't reuse the page if it's part of huge page.
>>>
>>> For PMD we can only reuse the page if nobody else maps the huge page or
>>> it's part. We can do it by checking page_mapcount() on each sub-page,
>>> but it's expensive.
>>>
>>> The cheaper way is to check page_count() to be equal 1: every mapcount
>>> takes page reference, so this way we can guarantee, that the PMD is the
>>> only mapping.
>>>
>>> This approach can give false negative if somebody pinned the page, but
>>> that doesn't affect correctness.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Tested-by: Sasha Levin <sasha.levin@oracle.com>
>>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>
>> So couldn't the same trick be used in Patch 1 to avoid counting individual
>> oder-0 pages?
>
> Hm. You're right, we could. But is smaps that performance sensitive to
> bother?

Well, I was nudged to optimize it when doing the shmem swap accounting 
changes there :) User may not care about the latency of obtaining the 
smaps file contents, but since it has mmap_sem locked for that, the 
process might care...



^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 04/28] mm, thp: adjust conditions when we can reuse the page on WP fault
@ 2015-05-15 11:35         ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 11:35 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 05/15/2015 01:21 PM, Kirill A. Shutemov wrote:
> On Fri, May 15, 2015 at 11:15:00AM +0200, Vlastimil Babka wrote:
>> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
>>> With new refcounting we will be able map the same compound page with
>>> PTEs and PMDs. It requires adjustment to conditions when we can reuse
>>> the page on write-protection fault.
>>>
>>> For PTE fault we can't reuse the page if it's part of huge page.
>>>
>>> For PMD we can only reuse the page if nobody else maps the huge page or
>>> it's part. We can do it by checking page_mapcount() on each sub-page,
>>> but it's expensive.
>>>
>>> The cheaper way is to check page_count() to be equal 1: every mapcount
>>> takes page reference, so this way we can guarantee, that the PMD is the
>>> only mapping.
>>>
>>> This approach can give false negative if somebody pinned the page, but
>>> that doesn't affect correctness.
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Tested-by: Sasha Levin <sasha.levin@oracle.com>
>>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>
>> So couldn't the same trick be used in Patch 1 to avoid counting individual
>> oder-0 pages?
>
> Hm. You're right, we could. But is smaps that performance sensitive to
> bother?

Well, I was nudged to optimize it when doing the shmem swap accounting 
changes there :) User may not care about the latency of obtaining the 
smaps file contents, but since it has mmap_sem locked for that, the 
process might care...


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 05/28] mm: adjust FOLL_SPLIT for new refcounting
  2015-05-15 11:05     ` Vlastimil Babka
@ 2015-05-15 11:36       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 11:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 01:05:27PM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >We need to prepare kernel to allow transhuge pages to be mapped with
> >ptes too. We need to handle FOLL_SPLIT in follow_page_pte().
> >
> >Also we use split_huge_page() directly instead of split_huge_page_pmd().
> >split_huge_page_pmd() will gone.
> 
> You still call split_huge_page_pmd() for the is_huge_zero_page(page) case.

For huge zero page we split PMD into table of zero pages and don't touch
compound page under it. That's what split_huge_page_pmd() (renamed into
split_huge_pmd()) will do by the end of patchset.

> Also, of the code around split_huge_page() you basically took from
> split_huge_page_pmd() and open-coded into follow_page_mask(), you didn't
> include the mmu notifier calls. Why are they needed in split_huge_page_pmd()
> but not here?

We do need mmu notifier in split_huge_page_pmd() for huge zero page. When
we need to split compound page we go into split_huge_page() which takes
care about mmut notifiers.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 05/28] mm: adjust FOLL_SPLIT for new refcounting
@ 2015-05-15 11:36       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 11:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 01:05:27PM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >We need to prepare kernel to allow transhuge pages to be mapped with
> >ptes too. We need to handle FOLL_SPLIT in follow_page_pte().
> >
> >Also we use split_huge_page() directly instead of split_huge_page_pmd().
> >split_huge_page_pmd() will gone.
> 
> You still call split_huge_page_pmd() for the is_huge_zero_page(page) case.

For huge zero page we split PMD into table of zero pages and don't touch
compound page under it. That's what split_huge_page_pmd() (renamed into
split_huge_pmd()) will do by the end of patchset.

> Also, of the code around split_huge_page() you basically took from
> split_huge_page_pmd() and open-coded into follow_page_mask(), you didn't
> include the mmu notifier calls. Why are they needed in split_huge_page_pmd()
> but not here?

We do need mmu notifier in split_huge_page_pmd() for huge zero page. When
we need to split compound page we go into split_huge_page() which takes
care about mmut notifiers.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 01/28] mm, proc: adjust PSS calculation
  2015-05-15 11:33         ` Vlastimil Babka
@ 2015-05-15 11:43           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 11:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 01:33:31PM +0200, Vlastimil Babka wrote:
> On 05/15/2015 12:56 PM, Kirill A. Shutemov wrote:
> >On Thu, May 14, 2015 at 04:12:29PM +0200, Vlastimil Babka wrote:
> >>On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >>>With new refcounting all subpages of the compound page are not nessessary
> >>>have the same mapcount. We need to take into account mapcount of every
> >>>sub-page.
> >>>
> >>>Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >>>Tested-by: Sasha Levin <sasha.levin@oracle.com>
> >>
> >>Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >>
> >>(some nitpicks below)
> >>
> >>>---
> >>>  fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
> >>>  1 file changed, 22 insertions(+), 21 deletions(-)
> >>>
> >>>diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >>>index 956b75d61809..95bc384ee3f7 100644
> >>>--- a/fs/proc/task_mmu.c
> >>>+++ b/fs/proc/task_mmu.c
> >>>@@ -449,9 +449,10 @@ struct mem_size_stats {
> >>>  };
> >>>
> >>>  static void smaps_account(struct mem_size_stats *mss, struct page *page,
> >>>-		unsigned long size, bool young, bool dirty)
> >>>+		bool compound, bool young, bool dirty)
> >>>  {
> >>>-	int mapcount;
> >>>+	int i, nr = compound ? hpage_nr_pages(page) : 1;
> >>
> >>Why not just HPAGE_PMD_NR instead of hpage_nr_pages(page)?
> >
> >Okay, makes sense. Compiler is smart enough to optimize away HPAGE_PMD_NR
> >for THP=n. (HPAGE_PMD_NR is BUILD_BUG() for THP=n)
> 
> Ah, BUILD_BUG()... I'm not sure we can rely on optimization to avoid
> BUILD_BUG(), what if somebody compiles with all optimizations off?

Kernel relies on dead-code elimination. You cannot build kernel with -O0.

> So why not replace BUILD_BUG() with "1", or create a variant of HPAGE_PMD_NR
> that does that, for this case and patch 3. Seems better than testing
> PageTransHuge everywhere...

I think we could try to downgrade it BUG(). Although I found BUILD_BUG()
useful few times.

HPAGE_PMD_NR==1 would be just wrong. It would mean you can map order-0
page with PMD %-|

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 01/28] mm, proc: adjust PSS calculation
@ 2015-05-15 11:43           ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 11:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 01:33:31PM +0200, Vlastimil Babka wrote:
> On 05/15/2015 12:56 PM, Kirill A. Shutemov wrote:
> >On Thu, May 14, 2015 at 04:12:29PM +0200, Vlastimil Babka wrote:
> >>On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >>>With new refcounting all subpages of the compound page are not nessessary
> >>>have the same mapcount. We need to take into account mapcount of every
> >>>sub-page.
> >>>
> >>>Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >>>Tested-by: Sasha Levin <sasha.levin@oracle.com>
> >>
> >>Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >>
> >>(some nitpicks below)
> >>
> >>>---
> >>>  fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
> >>>  1 file changed, 22 insertions(+), 21 deletions(-)
> >>>
> >>>diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >>>index 956b75d61809..95bc384ee3f7 100644
> >>>--- a/fs/proc/task_mmu.c
> >>>+++ b/fs/proc/task_mmu.c
> >>>@@ -449,9 +449,10 @@ struct mem_size_stats {
> >>>  };
> >>>
> >>>  static void smaps_account(struct mem_size_stats *mss, struct page *page,
> >>>-		unsigned long size, bool young, bool dirty)
> >>>+		bool compound, bool young, bool dirty)
> >>>  {
> >>>-	int mapcount;
> >>>+	int i, nr = compound ? hpage_nr_pages(page) : 1;
> >>
> >>Why not just HPAGE_PMD_NR instead of hpage_nr_pages(page)?
> >
> >Okay, makes sense. Compiler is smart enough to optimize away HPAGE_PMD_NR
> >for THP=n. (HPAGE_PMD_NR is BUILD_BUG() for THP=n)
> 
> Ah, BUILD_BUG()... I'm not sure we can rely on optimization to avoid
> BUILD_BUG(), what if somebody compiles with all optimizations off?

Kernel relies on dead-code elimination. You cannot build kernel with -O0.

> So why not replace BUILD_BUG() with "1", or create a variant of HPAGE_PMD_NR
> that does that, for this case and patch 3. Seems better than testing
> PageTransHuge everywhere...

I think we could try to downgrade it BUG(). Although I found BUILD_BUG()
useful few times.

HPAGE_PMD_NR==1 would be just wrong. It would mean you can map order-0
page with PMD %-|

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 05/28] mm: adjust FOLL_SPLIT for new refcounting
  2015-05-15 11:36       ` Kirill A. Shutemov
@ 2015-05-15 12:01         ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 12:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 05/15/2015 01:36 PM, Kirill A. Shutemov wrote:
> On Fri, May 15, 2015 at 01:05:27PM +0200, Vlastimil Babka wrote:
>> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
>>> We need to prepare kernel to allow transhuge pages to be mapped with
>>> ptes too. We need to handle FOLL_SPLIT in follow_page_pte().
>>>
>>> Also we use split_huge_page() directly instead of split_huge_page_pmd().
>>> split_huge_page_pmd() will gone.
>>
>> You still call split_huge_page_pmd() for the is_huge_zero_page(page) case.
>
> For huge zero page we split PMD into table of zero pages and don't touch
> compound page under it. That's what split_huge_page_pmd() (renamed into
> split_huge_pmd()) will do by the end of patchset.

Ah, I see.

>> Also, of the code around split_huge_page() you basically took from
>> split_huge_page_pmd() and open-coded into follow_page_mask(), you didn't
>> include the mmu notifier calls. Why are they needed in split_huge_page_pmd()
>> but not here?
>
> We do need mmu notifier in split_huge_page_pmd() for huge zero page. When

Oh, I guess that's obvious then... to someone, anyway. Thanks.

In that case the patch seems fine.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> we need to split compound page we go into split_huge_page() which takes
> care about mmut notifiers.
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 05/28] mm: adjust FOLL_SPLIT for new refcounting
@ 2015-05-15 12:01         ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 12:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 05/15/2015 01:36 PM, Kirill A. Shutemov wrote:
> On Fri, May 15, 2015 at 01:05:27PM +0200, Vlastimil Babka wrote:
>> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
>>> We need to prepare kernel to allow transhuge pages to be mapped with
>>> ptes too. We need to handle FOLL_SPLIT in follow_page_pte().
>>>
>>> Also we use split_huge_page() directly instead of split_huge_page_pmd().
>>> split_huge_page_pmd() will gone.
>>
>> You still call split_huge_page_pmd() for the is_huge_zero_page(page) case.
>
> For huge zero page we split PMD into table of zero pages and don't touch
> compound page under it. That's what split_huge_page_pmd() (renamed into
> split_huge_pmd()) will do by the end of patchset.

Ah, I see.

>> Also, of the code around split_huge_page() you basically took from
>> split_huge_page_pmd() and open-coded into follow_page_mask(), you didn't
>> include the mmu notifier calls. Why are they needed in split_huge_page_pmd()
>> but not here?
>
> We do need mmu notifier in split_huge_page_pmd() for huge zero page. When

Oh, I guess that's obvious then... to someone, anyway. Thanks.

In that case the patch seems fine.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> we need to split compound page we go into split_huge_page() which takes
> care about mmut notifiers.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 01/28] mm, proc: adjust PSS calculation
  2015-05-15 11:43           ` Kirill A. Shutemov
@ 2015-05-15 12:37             ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 12:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 05/15/2015 01:43 PM, Kirill A. Shutemov wrote:
> On Fri, May 15, 2015 at 01:33:31PM +0200, Vlastimil Babka wrote:
>> On 05/15/2015 12:56 PM, Kirill A. Shutemov wrote:
>>> On Thu, May 14, 2015 at 04:12:29PM +0200, Vlastimil Babka wrote:
>>>> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
>>>>> With new refcounting all subpages of the compound page are not nessessary
>>>>> have the same mapcount. We need to take into account mapcount of every
>>>>> sub-page.
>>>>>
>>>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>>> Tested-by: Sasha Levin <sasha.levin@oracle.com>
>>>>
>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>>>
>>>> (some nitpicks below)
>>>>
>>>>> ---
>>>>>   fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
>>>>>   1 file changed, 22 insertions(+), 21 deletions(-)
>>>>>
>>>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>>>> index 956b75d61809..95bc384ee3f7 100644
>>>>> --- a/fs/proc/task_mmu.c
>>>>> +++ b/fs/proc/task_mmu.c
>>>>> @@ -449,9 +449,10 @@ struct mem_size_stats {
>>>>>   };
>>>>>
>>>>>   static void smaps_account(struct mem_size_stats *mss, struct page *page,
>>>>> -		unsigned long size, bool young, bool dirty)
>>>>> +		bool compound, bool young, bool dirty)
>>>>>   {
>>>>> -	int mapcount;
>>>>> +	int i, nr = compound ? hpage_nr_pages(page) : 1;
>>>>
>>>> Why not just HPAGE_PMD_NR instead of hpage_nr_pages(page)?
>>>
>>> Okay, makes sense. Compiler is smart enough to optimize away HPAGE_PMD_NR
>>> for THP=n. (HPAGE_PMD_NR is BUILD_BUG() for THP=n)
>>
>> Ah, BUILD_BUG()... I'm not sure we can rely on optimization to avoid
>> BUILD_BUG(), what if somebody compiles with all optimizations off?
>
> Kernel relies on dead-code elimination. You cannot build kernel with -O0.

Ah, OK.

>> So why not replace BUILD_BUG() with "1", or create a variant of HPAGE_PMD_NR
>> that does that, for this case and patch 3. Seems better than testing
>> PageTransHuge everywhere...
>
> I think we could try to downgrade it BUG(). Although I found BUILD_BUG()
> useful few times.

BUILD_BUG() seems like a better match here. Better than pollute the code 
for THP=n (in case of Patch 3 where it doesn't eliminate).

> HPAGE_PMD_NR==1 would be just wrong. It would mean you can map order-0
> page with PMD %-|

That's why I suggested a different variant of the "variable".

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 01/28] mm, proc: adjust PSS calculation
@ 2015-05-15 12:37             ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 12:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 05/15/2015 01:43 PM, Kirill A. Shutemov wrote:
> On Fri, May 15, 2015 at 01:33:31PM +0200, Vlastimil Babka wrote:
>> On 05/15/2015 12:56 PM, Kirill A. Shutemov wrote:
>>> On Thu, May 14, 2015 at 04:12:29PM +0200, Vlastimil Babka wrote:
>>>> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
>>>>> With new refcounting all subpages of the compound page are not nessessary
>>>>> have the same mapcount. We need to take into account mapcount of every
>>>>> sub-page.
>>>>>
>>>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>>> Tested-by: Sasha Levin <sasha.levin@oracle.com>
>>>>
>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>>>
>>>> (some nitpicks below)
>>>>
>>>>> ---
>>>>>   fs/proc/task_mmu.c | 43 ++++++++++++++++++++++---------------------
>>>>>   1 file changed, 22 insertions(+), 21 deletions(-)
>>>>>
>>>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>>>> index 956b75d61809..95bc384ee3f7 100644
>>>>> --- a/fs/proc/task_mmu.c
>>>>> +++ b/fs/proc/task_mmu.c
>>>>> @@ -449,9 +449,10 @@ struct mem_size_stats {
>>>>>   };
>>>>>
>>>>>   static void smaps_account(struct mem_size_stats *mss, struct page *page,
>>>>> -		unsigned long size, bool young, bool dirty)
>>>>> +		bool compound, bool young, bool dirty)
>>>>>   {
>>>>> -	int mapcount;
>>>>> +	int i, nr = compound ? hpage_nr_pages(page) : 1;
>>>>
>>>> Why not just HPAGE_PMD_NR instead of hpage_nr_pages(page)?
>>>
>>> Okay, makes sense. Compiler is smart enough to optimize away HPAGE_PMD_NR
>>> for THP=n. (HPAGE_PMD_NR is BUILD_BUG() for THP=n)
>>
>> Ah, BUILD_BUG()... I'm not sure we can rely on optimization to avoid
>> BUILD_BUG(), what if somebody compiles with all optimizations off?
>
> Kernel relies on dead-code elimination. You cannot build kernel with -O0.

Ah, OK.

>> So why not replace BUILD_BUG() with "1", or create a variant of HPAGE_PMD_NR
>> that does that, for this case and patch 3. Seems better than testing
>> PageTransHuge everywhere...
>
> I think we could try to downgrade it BUG(). Although I found BUILD_BUG()
> useful few times.

BUILD_BUG() seems like a better match here. Better than pollute the code 
for THP=n (in case of Patch 3 where it doesn't eliminate).

> HPAGE_PMD_NR==1 would be just wrong. It would mean you can map order-0
> page with PMD %-|

That's why I suggested a different variant of the "variable".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 06/28] mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-15 12:46     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 12:46 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting we are going to see THP tail pages mapped with PTE.
> Generic fast GUP rely on page_cache_get_speculative() to obtain
> reference on page. page_cache_get_speculative() always fails on tail
> pages, because ->_count on tail pages is always zero.
>
> Let's handle tail pages in gup_pte_range().
>
> New split_huge_page() will rely on migration entries to freeze page's
> counts. Recheck PTE value after page_cache_get_speculative() on head
> page should be enough to serialize against split.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/gup.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index ebdb39b3e820..eaeeae15006b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1051,7 +1051,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>   		 * for an example see gup_get_pte in arch/x86/mm/gup.c
>   		 */
>   		pte_t pte = READ_ONCE(*ptep);
> -		struct page *page;
> +		struct page *head, *page;
>
>   		/*
>   		 * Similar to the PMD case below, NUMA hinting must take slow
> @@ -1063,15 +1063,17 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>
>   		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>   		page = pte_page(pte);
> +		head = compound_head(page);
>
> -		if (!page_cache_get_speculative(page))
> +		if (!page_cache_get_speculative(head))
>   			goto pte_unmap;
>
>   		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> -			put_page(page);
> +			put_page(head);
>   			goto pte_unmap;
>   		}
>
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>   		pages[*nr] = page;
>   		(*nr)++;
>
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 06/28] mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton
@ 2015-05-15 12:46     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 12:46 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting we are going to see THP tail pages mapped with PTE.
> Generic fast GUP rely on page_cache_get_speculative() to obtain
> reference on page. page_cache_get_speculative() always fails on tail
> pages, because ->_count on tail pages is always zero.
>
> Let's handle tail pages in gup_pte_range().
>
> New split_huge_page() will rely on migration entries to freeze page's
> counts. Recheck PTE value after page_cache_get_speculative() on head
> page should be enough to serialize against split.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/gup.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index ebdb39b3e820..eaeeae15006b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1051,7 +1051,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>   		 * for an example see gup_get_pte in arch/x86/mm/gup.c
>   		 */
>   		pte_t pte = READ_ONCE(*ptep);
> -		struct page *page;
> +		struct page *head, *page;
>
>   		/*
>   		 * Similar to the PMD case below, NUMA hinting must take slow
> @@ -1063,15 +1063,17 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>
>   		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>   		page = pte_page(pte);
> +		head = compound_head(page);
>
> -		if (!page_cache_get_speculative(page))
> +		if (!page_cache_get_speculative(head))
>   			goto pte_unmap;
>
>   		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> -			put_page(page);
> +			put_page(head);
>   			goto pte_unmap;
>   		}
>
> +		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>   		pages[*nr] = page;
>   		(*nr)++;
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 07/28] thp, mlock: do not allow huge pages in mlocked area
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-15 12:56     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 12:56 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting THP can belong to several VMAs. This makes tricky
> to track THP pages, when they partially mlocked. It can lead to leaking
> mlocked pages to non-VM_LOCKED vmas and other problems.
>
> With this patch we will split all pages on mlock and avoid
> fault-in/collapse new THP in VM_LOCKED vmas.
>
> I've tried alternative approach: do not mark THP pages mlocked and keep
> them on normal LRUs. This way vmscan could try to split huge pages on
> memory pressure and free up subpages which doesn't belong to VM_LOCKED
> vmas.  But this is user-visible change: we screw up Mlocked accouting
> reported in meminfo, so I had to leave this approach aside.
>
> We can bring something better later, but this should be good enough for
> now.

I can imagine people won't be happy about losing benefits of THP's when 
they mlock().
How difficult would it be to support mlocked THP pages without splitting 
until something actually tries to do a partial (un)mapping, and only 
then do the split? That will support the most common case, no?

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>   mm/gup.c         |  2 ++
>   mm/huge_memory.c |  5 ++++-
>   mm/memory.c      |  3 ++-
>   mm/mlock.c       | 51 +++++++++++++++++++--------------------------------
>   4 files changed, 27 insertions(+), 34 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index eaeeae15006b..7334eb24f414 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -882,6 +882,8 @@ long populate_vma_page_range(struct vm_area_struct *vma,
>   	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
>
>   	gup_flags = FOLL_TOUCH | FOLL_POPULATE;
> +	if (vma->vm_flags & VM_LOCKED)
> +		gup_flags |= FOLL_SPLIT;
>   	/*
>   	 * We want to touch writable mappings with a write fault in order
>   	 * to break COW, except for shared mappings because these don't COW
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fd8af5b9917f..fa3d4f78b716 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -796,6 +796,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
>   	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
>   		return VM_FAULT_FALLBACK;
> +	if (vma->vm_flags & VM_LOCKED)
> +		return VM_FAULT_FALLBACK;
>   	if (unlikely(anon_vma_prepare(vma)))
>   		return VM_FAULT_OOM;
>   	if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
> @@ -2467,7 +2469,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
>   	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
>   	    (vma->vm_flags & VM_NOHUGEPAGE))
>   		return false;
> -
> +	if (vma->vm_flags & VM_LOCKED)
> +		return false;
>   	if (!vma->anon_vma || vma->vm_ops)
>   		return false;
>   	if (is_vma_temporary_stack(vma))
> diff --git a/mm/memory.c b/mm/memory.c
> index 559c6651d6b6..8bbd3f88544b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2156,7 +2156,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
>
>   	pte_unmap_unlock(page_table, ptl);
>   	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> -	if (old_page) {
> +	/* THP pages are never mlocked */
> +	if (old_page && !PageTransCompound(old_page)) {
>   		/*
>   		 * Don't let another task, with possibly unlocked vma,
>   		 * keep the mlocked page.
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 6fd2cf15e868..76cde3967483 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -443,39 +443,26 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
>   		page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
>   				&page_mask);
>
> -		if (page && !IS_ERR(page)) {
> -			if (PageTransHuge(page)) {
> -				lock_page(page);
> -				/*
> -				 * Any THP page found by follow_page_mask() may
> -				 * have gotten split before reaching
> -				 * munlock_vma_page(), so we need to recompute
> -				 * the page_mask here.
> -				 */
> -				page_mask = munlock_vma_page(page);
> -				unlock_page(page);
> -				put_page(page); /* follow_page_mask() */
> -			} else {
> -				/*
> -				 * Non-huge pages are handled in batches via
> -				 * pagevec. The pin from follow_page_mask()
> -				 * prevents them from collapsing by THP.
> -				 */
> -				pagevec_add(&pvec, page);
> -				zone = page_zone(page);
> -				zoneid = page_zone_id(page);
> +		if (page && !IS_ERR(page) && !PageTransCompound(page)) {
> +			/*
> +			 * Non-huge pages are handled in batches via
> +			 * pagevec. The pin from follow_page_mask()
> +			 * prevents them from collapsing by THP.
> +			 */
> +			pagevec_add(&pvec, page);
> +			zone = page_zone(page);
> +			zoneid = page_zone_id(page);
>
> -				/*
> -				 * Try to fill the rest of pagevec using fast
> -				 * pte walk. This will also update start to
> -				 * the next page to process. Then munlock the
> -				 * pagevec.
> -				 */
> -				start = __munlock_pagevec_fill(&pvec, vma,
> -						zoneid, start, end);
> -				__munlock_pagevec(&pvec, zone);
> -				goto next;
> -			}
> +			/*
> +			 * Try to fill the rest of pagevec using fast
> +			 * pte walk. This will also update start to
> +			 * the next page to process. Then munlock the
> +			 * pagevec.
> +			 */
> +			start = __munlock_pagevec_fill(&pvec, vma,
> +					zoneid, start, end);
> +			__munlock_pagevec(&pvec, zone);
> +			goto next;
>   		}
>   		/* It's a bug to munlock in the middle of a THP page */
>   		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 07/28] thp, mlock: do not allow huge pages in mlocked area
@ 2015-05-15 12:56     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 12:56 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting THP can belong to several VMAs. This makes tricky
> to track THP pages, when they partially mlocked. It can lead to leaking
> mlocked pages to non-VM_LOCKED vmas and other problems.
>
> With this patch we will split all pages on mlock and avoid
> fault-in/collapse new THP in VM_LOCKED vmas.
>
> I've tried alternative approach: do not mark THP pages mlocked and keep
> them on normal LRUs. This way vmscan could try to split huge pages on
> memory pressure and free up subpages which doesn't belong to VM_LOCKED
> vmas.  But this is user-visible change: we screw up Mlocked accouting
> reported in meminfo, so I had to leave this approach aside.
>
> We can bring something better later, but this should be good enough for
> now.

I can imagine people won't be happy about losing benefits of THP's when 
they mlock().
How difficult would it be to support mlocked THP pages without splitting 
until something actually tries to do a partial (un)mapping, and only 
then do the split? That will support the most common case, no?

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>   mm/gup.c         |  2 ++
>   mm/huge_memory.c |  5 ++++-
>   mm/memory.c      |  3 ++-
>   mm/mlock.c       | 51 +++++++++++++++++++--------------------------------
>   4 files changed, 27 insertions(+), 34 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index eaeeae15006b..7334eb24f414 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -882,6 +882,8 @@ long populate_vma_page_range(struct vm_area_struct *vma,
>   	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
>
>   	gup_flags = FOLL_TOUCH | FOLL_POPULATE;
> +	if (vma->vm_flags & VM_LOCKED)
> +		gup_flags |= FOLL_SPLIT;
>   	/*
>   	 * We want to touch writable mappings with a write fault in order
>   	 * to break COW, except for shared mappings because these don't COW
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fd8af5b9917f..fa3d4f78b716 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -796,6 +796,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
>   	if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
>   		return VM_FAULT_FALLBACK;
> +	if (vma->vm_flags & VM_LOCKED)
> +		return VM_FAULT_FALLBACK;
>   	if (unlikely(anon_vma_prepare(vma)))
>   		return VM_FAULT_OOM;
>   	if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
> @@ -2467,7 +2469,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma)
>   	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
>   	    (vma->vm_flags & VM_NOHUGEPAGE))
>   		return false;
> -
> +	if (vma->vm_flags & VM_LOCKED)
> +		return false;
>   	if (!vma->anon_vma || vma->vm_ops)
>   		return false;
>   	if (is_vma_temporary_stack(vma))
> diff --git a/mm/memory.c b/mm/memory.c
> index 559c6651d6b6..8bbd3f88544b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2156,7 +2156,8 @@ static int wp_page_copy(struct mm_struct *mm, struct vm_area_struct *vma,
>
>   	pte_unmap_unlock(page_table, ptl);
>   	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> -	if (old_page) {
> +	/* THP pages are never mlocked */
> +	if (old_page && !PageTransCompound(old_page)) {
>   		/*
>   		 * Don't let another task, with possibly unlocked vma,
>   		 * keep the mlocked page.
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 6fd2cf15e868..76cde3967483 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -443,39 +443,26 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
>   		page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
>   				&page_mask);
>
> -		if (page && !IS_ERR(page)) {
> -			if (PageTransHuge(page)) {
> -				lock_page(page);
> -				/*
> -				 * Any THP page found by follow_page_mask() may
> -				 * have gotten split before reaching
> -				 * munlock_vma_page(), so we need to recompute
> -				 * the page_mask here.
> -				 */
> -				page_mask = munlock_vma_page(page);
> -				unlock_page(page);
> -				put_page(page); /* follow_page_mask() */
> -			} else {
> -				/*
> -				 * Non-huge pages are handled in batches via
> -				 * pagevec. The pin from follow_page_mask()
> -				 * prevents them from collapsing by THP.
> -				 */
> -				pagevec_add(&pvec, page);
> -				zone = page_zone(page);
> -				zoneid = page_zone_id(page);
> +		if (page && !IS_ERR(page) && !PageTransCompound(page)) {
> +			/*
> +			 * Non-huge pages are handled in batches via
> +			 * pagevec. The pin from follow_page_mask()
> +			 * prevents them from collapsing by THP.
> +			 */
> +			pagevec_add(&pvec, page);
> +			zone = page_zone(page);
> +			zoneid = page_zone_id(page);
>
> -				/*
> -				 * Try to fill the rest of pagevec using fast
> -				 * pte walk. This will also update start to
> -				 * the next page to process. Then munlock the
> -				 * pagevec.
> -				 */
> -				start = __munlock_pagevec_fill(&pvec, vma,
> -						zoneid, start, end);
> -				__munlock_pagevec(&pvec, zone);
> -				goto next;
> -			}
> +			/*
> +			 * Try to fill the rest of pagevec using fast
> +			 * pte walk. This will also update start to
> +			 * the next page to process. Then munlock the
> +			 * pagevec.
> +			 */
> +			start = __munlock_pagevec_fill(&pvec, vma,
> +					zoneid, start, end);
> +			__munlock_pagevec(&pvec, zone);
> +			goto next;
>   		}
>   		/* It's a bug to munlock in the middle of a THP page */
>   		VM_BUG_ON((start >> PAGE_SHIFT) & page_mask);
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 08/28] khugepaged: ignore pmd tables with THP mapped with ptes
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-15 12:59     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 12:59 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> Prepare khugepaged to see compound pages mapped with pte. For now we
> won't collapse the pmd table with such pte.
>
> khugepaged is subject for future rework wrt new refcounting.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/huge_memory.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fa3d4f78b716..ffc30e4462c1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2653,6 +2653,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		page = vm_normal_page(vma, _address, pteval);
>   		if (unlikely(!page))
>   			goto out_unmap;
> +
> +		/* TODO: teach khugepaged to collapse THP mapped with pte */
> +		if (PageCompound(page))
> +			goto out_unmap;
> +
>   		/*
>   		 * Record which node the original page is from and save this
>   		 * information to khugepaged_node_load[].
> @@ -2663,7 +2668,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		if (khugepaged_scan_abort(node))
>   			goto out_unmap;
>   		khugepaged_node_load[node]++;
> -		VM_BUG_ON_PAGE(PageCompound(page), page);
>   		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>   			goto out_unmap;
>   		/*
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 08/28] khugepaged: ignore pmd tables with THP mapped with ptes
@ 2015-05-15 12:59     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 12:59 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> Prepare khugepaged to see compound pages mapped with pte. For now we
> won't collapse the pmd table with such pte.
>
> khugepaged is subject for future rework wrt new refcounting.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/huge_memory.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fa3d4f78b716..ffc30e4462c1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2653,6 +2653,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		page = vm_normal_page(vma, _address, pteval);
>   		if (unlikely(!page))
>   			goto out_unmap;
> +
> +		/* TODO: teach khugepaged to collapse THP mapped with pte */
> +		if (PageCompound(page))
> +			goto out_unmap;
> +
>   		/*
>   		 * Record which node the original page is from and save this
>   		 * information to khugepaged_node_load[].
> @@ -2663,7 +2668,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>   		if (khugepaged_scan_abort(node))
>   			goto out_unmap;
>   		khugepaged_node_load[node]++;
> -		VM_BUG_ON_PAGE(PageCompound(page), page);
>   		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>   			goto out_unmap;
>   		/*
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 09/28] thp: rename split_huge_page_pmd() to split_huge_pmd()
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-15 13:08     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 13:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We are going to decouple splitting THP PMD from splitting underlying
> compound page.
>
> This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
> to reflect the fact that it doesn't imply page splitting, only PMD.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   arch/powerpc/mm/subpage-prot.c |  2 +-
>   arch/x86/kernel/vm86_32.c      |  6 +++++-
>   include/linux/huge_mm.h        |  8 ++------
>   mm/gup.c                       |  2 +-
>   mm/huge_memory.c               | 32 +++++++++++---------------------
>   mm/madvise.c                   |  2 +-
>   mm/memory.c                    |  2 +-
>   mm/mempolicy.c                 |  2 +-
>   mm/mprotect.c                  |  2 +-
>   mm/mremap.c                    |  2 +-
>   mm/pagewalk.c                  |  2 +-
>   11 files changed, 26 insertions(+), 36 deletions(-)
>
> diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
> index fa9fb5b4c66c..d5543514c1df 100644
> --- a/arch/powerpc/mm/subpage-prot.c
> +++ b/arch/powerpc/mm/subpage-prot.c
> @@ -135,7 +135,7 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
>   				  unsigned long end, struct mm_walk *walk)
>   {
>   	struct vm_area_struct *vma = walk->vma;
> -	split_huge_page_pmd(vma, addr, pmd);
> +	split_huge_pmd(vma, pmd, addr);
>   	return 0;
>   }
>
> diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
> index e8edcf52e069..883160599965 100644
> --- a/arch/x86/kernel/vm86_32.c
> +++ b/arch/x86/kernel/vm86_32.c
> @@ -182,7 +182,11 @@ static void mark_screen_rdonly(struct mm_struct *mm)
>   	if (pud_none_or_clear_bad(pud))
>   		goto out;
>   	pmd = pmd_offset(pud, 0xA0000);
> -	split_huge_page_pmd_mm(mm, 0xA0000, pmd);
> +
> +	if (pmd_trans_huge(*pmd)) {
> +		struct vm_area_struct *vma = find_vma(mm, 0xA0000);
> +		split_huge_pmd(vma, pmd, 0xA0000);
> +	}
>   	if (pmd_none_or_clear_bad(pmd))
>   		goto out;
>   	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 44a840a53974..34bbf769d52e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -104,7 +104,7 @@ static inline int split_huge_page(struct page *page)
>   }
>   extern void __split_huge_page_pmd(struct vm_area_struct *vma,
>   		unsigned long address, pmd_t *pmd);
> -#define split_huge_page_pmd(__vma, __address, __pmd)			\
> +#define split_huge_pmd(__vma, __pmd, __address)				\
>   	do {								\
>   		pmd_t *____pmd = (__pmd);				\
>   		if (unlikely(pmd_trans_huge(*____pmd)))			\
> @@ -119,8 +119,6 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
>   		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
>   		       pmd_trans_huge(*____pmd));			\
>   	} while (0)
> -extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd);
>   #if HPAGE_PMD_ORDER >= MAX_ORDER
>   #error "hugepages can't be allocated by the buddy allocator"
>   #endif
> @@ -187,11 +185,9 @@ static inline int split_huge_page(struct page *page)
>   {
>   	return 0;
>   }
> -#define split_huge_page_pmd(__vma, __address, __pmd)	\
> -	do { } while (0)
>   #define wait_split_huge_page(__anon_vma, __pmd)	\
>   	do { } while (0)
> -#define split_huge_page_pmd_mm(__mm, __address, __pmd)	\
> +#define split_huge_pmd(__vma, __pmd, __address)	\
>   	do { } while (0)
>   static inline int hugepage_madvise(struct vm_area_struct *vma,
>   				   unsigned long *vm_flags, int advice)
> diff --git a/mm/gup.c b/mm/gup.c
> index 7334eb24f414..19e01f156abb 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -220,7 +220,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>   		if (is_huge_zero_page(page)) {
>   			spin_unlock(ptl);
>   			ret = 0;
> -			split_huge_page_pmd(vma, address, pmd);
> +			split_huge_pmd(vma, pmd, address);
>   		} else {
>   			get_page(page);
>   			spin_unlock(ptl);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ffc30e4462c1..ccbfacf07160 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1136,13 +1136,13 @@ alloc:
>
>   	if (unlikely(!new_page)) {
>   		if (!page) {
> -			split_huge_page_pmd(vma, address, pmd);
> +			split_huge_pmd(vma, pmd, address);
>   			ret |= VM_FAULT_FALLBACK;
>   		} else {
>   			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
>   					pmd, orig_pmd, page, haddr);
>   			if (ret & VM_FAULT_OOM) {
> -				split_huge_page(page);
> +				split_huge_pmd(vma, pmd, address);
>   				ret |= VM_FAULT_FALLBACK;
>   			}
>   			put_user_huge_page(page);
> @@ -1155,10 +1155,10 @@ alloc:
>   					&memcg, true))) {
>   		put_page(new_page);
>   		if (page) {
> -			split_huge_page(page);
> +			split_huge_pmd(vma, pmd, address);
>   			put_user_huge_page(page);
>   		} else
> -			split_huge_page_pmd(vma, address, pmd);
> +			split_huge_pmd(vma, pmd, address);
>   		ret |= VM_FAULT_FALLBACK;
>   		count_vm_event(THP_FAULT_FALLBACK);
>   		goto out;
> @@ -2985,17 +2985,7 @@ again:
>   		goto again;
>   }
>
> -void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd)
> -{
> -	struct vm_area_struct *vma;
> -
> -	vma = find_vma(mm, address);
> -	BUG_ON(vma == NULL);
> -	split_huge_page_pmd(vma, address, pmd);
> -}
> -
> -static void split_huge_page_address(struct mm_struct *mm,
> +static void split_huge_pmd_address(struct vm_area_struct *vma,
>   				    unsigned long address)
>   {
>   	pgd_t *pgd;
> @@ -3004,7 +2994,7 @@ static void split_huge_page_address(struct mm_struct *mm,
>
>   	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
>
> -	pgd = pgd_offset(mm, address);
> +	pgd = pgd_offset(vma->vm_mm, address);
>   	if (!pgd_present(*pgd))
>   		return;
>
> @@ -3013,13 +3003,13 @@ static void split_huge_page_address(struct mm_struct *mm,
>   		return;
>
>   	pmd = pmd_offset(pud, address);
> -	if (!pmd_present(*pmd))
> +	if (!pmd_present(*pmd) || !pmd_trans_huge(*pmd))
>   		return;
>   	/*
>   	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
>   	 * materialize from under us.
>   	 */
> -	split_huge_page_pmd_mm(mm, address, pmd);
> +	__split_huge_page_pmd(vma, address, pmd);
>   }
>
>   void __vma_adjust_trans_huge(struct vm_area_struct *vma,
> @@ -3035,7 +3025,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
>   	if (start & ~HPAGE_PMD_MASK &&
>   	    (start & HPAGE_PMD_MASK) >= vma->vm_start &&
>   	    (start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
> -		split_huge_page_address(vma->vm_mm, start);
> +		split_huge_pmd_address(vma, start);
>
>   	/*
>   	 * If the new end address isn't hpage aligned and it could
> @@ -3045,7 +3035,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
>   	if (end & ~HPAGE_PMD_MASK &&
>   	    (end & HPAGE_PMD_MASK) >= vma->vm_start &&
>   	    (end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
> -		split_huge_page_address(vma->vm_mm, end);
> +		split_huge_pmd_address(vma, end);
>
>   	/*
>   	 * If we're also updating the vma->vm_next->vm_start, if the new
> @@ -3059,6 +3049,6 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
>   		if (nstart & ~HPAGE_PMD_MASK &&
>   		    (nstart & HPAGE_PMD_MASK) >= next->vm_start &&
>   		    (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
> -			split_huge_page_address(next->vm_mm, nstart);
> +			split_huge_pmd_address(next, nstart);
>   	}
>   }
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 22b86daf6b94..f5a81ca0dca7 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -281,7 +281,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>   	next = pmd_addr_end(addr, end);
>   	if (pmd_trans_huge(*pmd)) {
>   		if (next - addr != HPAGE_PMD_SIZE)
> -			split_huge_page_pmd(vma, addr, pmd);
> +			split_huge_pmd(vma, pmd, addr);
>   		else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
>   			goto next;
>   		/* fall through */
> diff --git a/mm/memory.c b/mm/memory.c
> index 8bbd3f88544b..61e7ed722760 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1201,7 +1201,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>   					BUG();
>   				}
>   #endif
> -				split_huge_page_pmd(vma, addr, pmd);
> +				split_huge_pmd(vma, pmd, addr);
>   			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
>   				goto next;
>   			/* fall through */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 8badb84c013e..aac490fdc91f 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -493,7 +493,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
>   	pte_t *pte;
>   	spinlock_t *ptl;
>
> -	split_huge_page_pmd(vma, addr, pmd);
> +	split_huge_pmd(vma, pmd, addr);
>   	if (pmd_trans_unstable(pmd))
>   		return 0;
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88584838e704..714d2fbbaafd 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>
>   		if (pmd_trans_huge(*pmd)) {
>   			if (next - addr != HPAGE_PMD_SIZE)
> -				split_huge_page_pmd(vma, addr, pmd);
> +				split_huge_pmd(vma, pmd, addr);
>   			else {
>   				int nr_ptes = change_huge_pmd(vma, pmd, addr,
>   						newprot, prot_numa);
> diff --git a/mm/mremap.c b/mm/mremap.c
> index afa3ab740d8c..3e40ea27edc4 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -208,7 +208,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>   				need_flush = true;
>   				continue;
>   			} else if (!err) {
> -				split_huge_page_pmd(vma, old_addr, old_pmd);
> +				split_huge_pmd(vma, old_pmd, old_addr);
>   			}
>   			VM_BUG_ON(pmd_trans_huge(*old_pmd));
>   		}
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 29f2f8b853ae..207244489a68 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -58,7 +58,7 @@ again:
>   		if (!walk->pte_entry)
>   			continue;
>
> -		split_huge_page_pmd_mm(walk->mm, addr, pmd);
> +		split_huge_pmd(walk->vma, pmd, addr);
>   		if (pmd_trans_unstable(pmd))
>   			goto again;
>   		err = walk_pte_range(pmd, addr, next, walk);
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 09/28] thp: rename split_huge_page_pmd() to split_huge_pmd()
@ 2015-05-15 13:08     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 13:08 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We are going to decouple splitting THP PMD from splitting underlying
> compound page.
>
> This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
> to reflect the fact that it doesn't imply page splitting, only PMD.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   arch/powerpc/mm/subpage-prot.c |  2 +-
>   arch/x86/kernel/vm86_32.c      |  6 +++++-
>   include/linux/huge_mm.h        |  8 ++------
>   mm/gup.c                       |  2 +-
>   mm/huge_memory.c               | 32 +++++++++++---------------------
>   mm/madvise.c                   |  2 +-
>   mm/memory.c                    |  2 +-
>   mm/mempolicy.c                 |  2 +-
>   mm/mprotect.c                  |  2 +-
>   mm/mremap.c                    |  2 +-
>   mm/pagewalk.c                  |  2 +-
>   11 files changed, 26 insertions(+), 36 deletions(-)
>
> diff --git a/arch/powerpc/mm/subpage-prot.c b/arch/powerpc/mm/subpage-prot.c
> index fa9fb5b4c66c..d5543514c1df 100644
> --- a/arch/powerpc/mm/subpage-prot.c
> +++ b/arch/powerpc/mm/subpage-prot.c
> @@ -135,7 +135,7 @@ static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
>   				  unsigned long end, struct mm_walk *walk)
>   {
>   	struct vm_area_struct *vma = walk->vma;
> -	split_huge_page_pmd(vma, addr, pmd);
> +	split_huge_pmd(vma, pmd, addr);
>   	return 0;
>   }
>
> diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
> index e8edcf52e069..883160599965 100644
> --- a/arch/x86/kernel/vm86_32.c
> +++ b/arch/x86/kernel/vm86_32.c
> @@ -182,7 +182,11 @@ static void mark_screen_rdonly(struct mm_struct *mm)
>   	if (pud_none_or_clear_bad(pud))
>   		goto out;
>   	pmd = pmd_offset(pud, 0xA0000);
> -	split_huge_page_pmd_mm(mm, 0xA0000, pmd);
> +
> +	if (pmd_trans_huge(*pmd)) {
> +		struct vm_area_struct *vma = find_vma(mm, 0xA0000);
> +		split_huge_pmd(vma, pmd, 0xA0000);
> +	}
>   	if (pmd_none_or_clear_bad(pmd))
>   		goto out;
>   	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 44a840a53974..34bbf769d52e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -104,7 +104,7 @@ static inline int split_huge_page(struct page *page)
>   }
>   extern void __split_huge_page_pmd(struct vm_area_struct *vma,
>   		unsigned long address, pmd_t *pmd);
> -#define split_huge_page_pmd(__vma, __address, __pmd)			\
> +#define split_huge_pmd(__vma, __pmd, __address)				\
>   	do {								\
>   		pmd_t *____pmd = (__pmd);				\
>   		if (unlikely(pmd_trans_huge(*____pmd)))			\
> @@ -119,8 +119,6 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
>   		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
>   		       pmd_trans_huge(*____pmd));			\
>   	} while (0)
> -extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd);
>   #if HPAGE_PMD_ORDER >= MAX_ORDER
>   #error "hugepages can't be allocated by the buddy allocator"
>   #endif
> @@ -187,11 +185,9 @@ static inline int split_huge_page(struct page *page)
>   {
>   	return 0;
>   }
> -#define split_huge_page_pmd(__vma, __address, __pmd)	\
> -	do { } while (0)
>   #define wait_split_huge_page(__anon_vma, __pmd)	\
>   	do { } while (0)
> -#define split_huge_page_pmd_mm(__mm, __address, __pmd)	\
> +#define split_huge_pmd(__vma, __pmd, __address)	\
>   	do { } while (0)
>   static inline int hugepage_madvise(struct vm_area_struct *vma,
>   				   unsigned long *vm_flags, int advice)
> diff --git a/mm/gup.c b/mm/gup.c
> index 7334eb24f414..19e01f156abb 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -220,7 +220,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>   		if (is_huge_zero_page(page)) {
>   			spin_unlock(ptl);
>   			ret = 0;
> -			split_huge_page_pmd(vma, address, pmd);
> +			split_huge_pmd(vma, pmd, address);
>   		} else {
>   			get_page(page);
>   			spin_unlock(ptl);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ffc30e4462c1..ccbfacf07160 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1136,13 +1136,13 @@ alloc:
>
>   	if (unlikely(!new_page)) {
>   		if (!page) {
> -			split_huge_page_pmd(vma, address, pmd);
> +			split_huge_pmd(vma, pmd, address);
>   			ret |= VM_FAULT_FALLBACK;
>   		} else {
>   			ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
>   					pmd, orig_pmd, page, haddr);
>   			if (ret & VM_FAULT_OOM) {
> -				split_huge_page(page);
> +				split_huge_pmd(vma, pmd, address);
>   				ret |= VM_FAULT_FALLBACK;
>   			}
>   			put_user_huge_page(page);
> @@ -1155,10 +1155,10 @@ alloc:
>   					&memcg, true))) {
>   		put_page(new_page);
>   		if (page) {
> -			split_huge_page(page);
> +			split_huge_pmd(vma, pmd, address);
>   			put_user_huge_page(page);
>   		} else
> -			split_huge_page_pmd(vma, address, pmd);
> +			split_huge_pmd(vma, pmd, address);
>   		ret |= VM_FAULT_FALLBACK;
>   		count_vm_event(THP_FAULT_FALLBACK);
>   		goto out;
> @@ -2985,17 +2985,7 @@ again:
>   		goto again;
>   }
>
> -void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
> -		pmd_t *pmd)
> -{
> -	struct vm_area_struct *vma;
> -
> -	vma = find_vma(mm, address);
> -	BUG_ON(vma == NULL);
> -	split_huge_page_pmd(vma, address, pmd);
> -}
> -
> -static void split_huge_page_address(struct mm_struct *mm,
> +static void split_huge_pmd_address(struct vm_area_struct *vma,
>   				    unsigned long address)
>   {
>   	pgd_t *pgd;
> @@ -3004,7 +2994,7 @@ static void split_huge_page_address(struct mm_struct *mm,
>
>   	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
>
> -	pgd = pgd_offset(mm, address);
> +	pgd = pgd_offset(vma->vm_mm, address);
>   	if (!pgd_present(*pgd))
>   		return;
>
> @@ -3013,13 +3003,13 @@ static void split_huge_page_address(struct mm_struct *mm,
>   		return;
>
>   	pmd = pmd_offset(pud, address);
> -	if (!pmd_present(*pmd))
> +	if (!pmd_present(*pmd) || !pmd_trans_huge(*pmd))
>   		return;
>   	/*
>   	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
>   	 * materialize from under us.
>   	 */
> -	split_huge_page_pmd_mm(mm, address, pmd);
> +	__split_huge_page_pmd(vma, address, pmd);
>   }
>
>   void __vma_adjust_trans_huge(struct vm_area_struct *vma,
> @@ -3035,7 +3025,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
>   	if (start & ~HPAGE_PMD_MASK &&
>   	    (start & HPAGE_PMD_MASK) >= vma->vm_start &&
>   	    (start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
> -		split_huge_page_address(vma->vm_mm, start);
> +		split_huge_pmd_address(vma, start);
>
>   	/*
>   	 * If the new end address isn't hpage aligned and it could
> @@ -3045,7 +3035,7 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
>   	if (end & ~HPAGE_PMD_MASK &&
>   	    (end & HPAGE_PMD_MASK) >= vma->vm_start &&
>   	    (end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
> -		split_huge_page_address(vma->vm_mm, end);
> +		split_huge_pmd_address(vma, end);
>
>   	/*
>   	 * If we're also updating the vma->vm_next->vm_start, if the new
> @@ -3059,6 +3049,6 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
>   		if (nstart & ~HPAGE_PMD_MASK &&
>   		    (nstart & HPAGE_PMD_MASK) >= next->vm_start &&
>   		    (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
> -			split_huge_page_address(next->vm_mm, nstart);
> +			split_huge_pmd_address(next, nstart);
>   	}
>   }
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 22b86daf6b94..f5a81ca0dca7 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -281,7 +281,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>   	next = pmd_addr_end(addr, end);
>   	if (pmd_trans_huge(*pmd)) {
>   		if (next - addr != HPAGE_PMD_SIZE)
> -			split_huge_page_pmd(vma, addr, pmd);
> +			split_huge_pmd(vma, pmd, addr);
>   		else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
>   			goto next;
>   		/* fall through */
> diff --git a/mm/memory.c b/mm/memory.c
> index 8bbd3f88544b..61e7ed722760 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1201,7 +1201,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>   					BUG();
>   				}
>   #endif
> -				split_huge_page_pmd(vma, addr, pmd);
> +				split_huge_pmd(vma, pmd, addr);
>   			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
>   				goto next;
>   			/* fall through */
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 8badb84c013e..aac490fdc91f 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -493,7 +493,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
>   	pte_t *pte;
>   	spinlock_t *ptl;
>
> -	split_huge_page_pmd(vma, addr, pmd);
> +	split_huge_pmd(vma, pmd, addr);
>   	if (pmd_trans_unstable(pmd))
>   		return 0;
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 88584838e704..714d2fbbaafd 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -158,7 +158,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>
>   		if (pmd_trans_huge(*pmd)) {
>   			if (next - addr != HPAGE_PMD_SIZE)
> -				split_huge_page_pmd(vma, addr, pmd);
> +				split_huge_pmd(vma, pmd, addr);
>   			else {
>   				int nr_ptes = change_huge_pmd(vma, pmd, addr,
>   						newprot, prot_numa);
> diff --git a/mm/mremap.c b/mm/mremap.c
> index afa3ab740d8c..3e40ea27edc4 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -208,7 +208,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>   				need_flush = true;
>   				continue;
>   			} else if (!err) {
> -				split_huge_page_pmd(vma, old_addr, old_pmd);
> +				split_huge_pmd(vma, old_pmd, old_addr);
>   			}
>   			VM_BUG_ON(pmd_trans_huge(*old_pmd));
>   		}
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 29f2f8b853ae..207244489a68 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -58,7 +58,7 @@ again:
>   		if (!walk->pte_entry)
>   			continue;
>
> -		split_huge_page_pmd_mm(walk->mm, addr, pmd);
> +		split_huge_pmd(walk->vma, pmd, addr);
>   		if (pmd_trans_unstable(pmd))
>   			goto again;
>   		err = walk_pte_range(pmd, addr, next, walk);
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 10/28] mm, vmstats: new THP splitting event
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-15 13:10     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 13:10 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
> THP_SPLIT_PAGE_FAILT and THP_SPLIT_PMD. It reflects the fact that we
> are going to be able split PMD without the compound page and that
> split_huge_page() can fail.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   include/linux/vm_event_item.h | 4 +++-
>   mm/huge_memory.c              | 2 +-
>   mm/vmstat.c                   | 4 +++-
>   3 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 2b1cef88b827..3261bfe2156a 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -69,7 +69,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>   		THP_FAULT_FALLBACK,
>   		THP_COLLAPSE_ALLOC,
>   		THP_COLLAPSE_ALLOC_FAILED,
> -		THP_SPLIT,
> +		THP_SPLIT_PAGE,
> +		THP_SPLIT_PAGE_FAILED,
> +		THP_SPLIT_PMD,
>   		THP_ZERO_PAGE_ALLOC,
>   		THP_ZERO_PAGE_ALLOC_FAILED,
>   #endif
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ccbfacf07160..be6d0e0f5050 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1961,7 +1961,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>
>   	BUG_ON(!PageSwapBacked(page));
>   	__split_huge_page(page, anon_vma, list);
> -	count_vm_event(THP_SPLIT);
> +	count_vm_event(THP_SPLIT_PAGE);
>
>   	BUG_ON(PageCompound(page));
>   out_unlock:
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1fd0886a389f..e1c87425fe11 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -821,7 +821,9 @@ const char * const vmstat_text[] = {
>   	"thp_fault_fallback",
>   	"thp_collapse_alloc",
>   	"thp_collapse_alloc_failed",
> -	"thp_split",
> +	"thp_split_page",
> +	"thp_split_page_failed",
> +	"thp_split_pmd",
>   	"thp_zero_page_alloc",
>   	"thp_zero_page_alloc_failed",
>   #endif
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 10/28] mm, vmstats: new THP splitting event
@ 2015-05-15 13:10     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-15 13:10 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
> THP_SPLIT_PAGE_FAILT and THP_SPLIT_PMD. It reflects the fact that we
> are going to be able split PMD without the compound page and that
> split_huge_page() can fail.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   include/linux/vm_event_item.h | 4 +++-
>   mm/huge_memory.c              | 2 +-
>   mm/vmstat.c                   | 4 +++-
>   3 files changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 2b1cef88b827..3261bfe2156a 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -69,7 +69,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>   		THP_FAULT_FALLBACK,
>   		THP_COLLAPSE_ALLOC,
>   		THP_COLLAPSE_ALLOC_FAILED,
> -		THP_SPLIT,
> +		THP_SPLIT_PAGE,
> +		THP_SPLIT_PAGE_FAILED,
> +		THP_SPLIT_PMD,
>   		THP_ZERO_PAGE_ALLOC,
>   		THP_ZERO_PAGE_ALLOC_FAILED,
>   #endif
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ccbfacf07160..be6d0e0f5050 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1961,7 +1961,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>
>   	BUG_ON(!PageSwapBacked(page));
>   	__split_huge_page(page, anon_vma, list);
> -	count_vm_event(THP_SPLIT);
> +	count_vm_event(THP_SPLIT_PAGE);
>
>   	BUG_ON(PageCompound(page));
>   out_unlock:
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1fd0886a389f..e1c87425fe11 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -821,7 +821,9 @@ const char * const vmstat_text[] = {
>   	"thp_fault_fallback",
>   	"thp_collapse_alloc",
>   	"thp_collapse_alloc_failed",
> -	"thp_split",
> +	"thp_split_page",
> +	"thp_split_page_failed",
> +	"thp_split_pmd",
>   	"thp_zero_page_alloc",
>   	"thp_zero_page_alloc_failed",
>   #endif
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 04/28] mm, thp: adjust conditions when we can reuse the page on WP fault
  2015-05-15 11:35         ` Vlastimil Babka
@ 2015-05-15 13:29           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 13:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 01:35:49PM +0200, Vlastimil Babka wrote:
> On 05/15/2015 01:21 PM, Kirill A. Shutemov wrote:
> >On Fri, May 15, 2015 at 11:15:00AM +0200, Vlastimil Babka wrote:
> >>On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >>>With new refcounting we will be able map the same compound page with
> >>>PTEs and PMDs. It requires adjustment to conditions when we can reuse
> >>>the page on write-protection fault.
> >>>
> >>>For PTE fault we can't reuse the page if it's part of huge page.
> >>>
> >>>For PMD we can only reuse the page if nobody else maps the huge page or
> >>>it's part. We can do it by checking page_mapcount() on each sub-page,
> >>>but it's expensive.
> >>>
> >>>The cheaper way is to check page_count() to be equal 1: every mapcount
> >>>takes page reference, so this way we can guarantee, that the PMD is the
> >>>only mapping.
> >>>
> >>>This approach can give false negative if somebody pinned the page, but
> >>>that doesn't affect correctness.
> >>>
> >>>Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >>>Tested-by: Sasha Levin <sasha.levin@oracle.com>
> >>
> >>Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >>
> >>So couldn't the same trick be used in Patch 1 to avoid counting individual
> >>oder-0 pages?
> >
> >Hm. You're right, we could. But is smaps that performance sensitive to
> >bother?
> 
> Well, I was nudged to optimize it when doing the shmem swap accounting
> changes there :) User may not care about the latency of obtaining the smaps
> file contents, but since it has mmap_sem locked for that, the process might
> care...

Somewthing like this?

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e04399e53965..5bc3d2b1176e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -462,6 +462,19 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
        if (young || PageReferenced(page))
                mss->referenced += size;
 
+       /*
+        * page_count(page) == 1 guarantees the page is mapped exactly once.
+        * If any subpage of the compound page mapped with PTE it would elevate
+        * page_count().
+        */
+       if (page_count(page) == 1) {
+               if (dirty || PageDirty(page))
+                       mss->private_dirty += size;
+               else
+                       mss->private_clean += size;
+               return;
+       }
+
        for (i = 0; i < nr; i++, page++) {
                int mapcount = page_mapcount(page);
 
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 04/28] mm, thp: adjust conditions when we can reuse the page on WP fault
@ 2015-05-15 13:29           ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 13:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 01:35:49PM +0200, Vlastimil Babka wrote:
> On 05/15/2015 01:21 PM, Kirill A. Shutemov wrote:
> >On Fri, May 15, 2015 at 11:15:00AM +0200, Vlastimil Babka wrote:
> >>On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >>>With new refcounting we will be able map the same compound page with
> >>>PTEs and PMDs. It requires adjustment to conditions when we can reuse
> >>>the page on write-protection fault.
> >>>
> >>>For PTE fault we can't reuse the page if it's part of huge page.
> >>>
> >>>For PMD we can only reuse the page if nobody else maps the huge page or
> >>>it's part. We can do it by checking page_mapcount() on each sub-page,
> >>>but it's expensive.
> >>>
> >>>The cheaper way is to check page_count() to be equal 1: every mapcount
> >>>takes page reference, so this way we can guarantee, that the PMD is the
> >>>only mapping.
> >>>
> >>>This approach can give false negative if somebody pinned the page, but
> >>>that doesn't affect correctness.
> >>>
> >>>Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >>>Tested-by: Sasha Levin <sasha.levin@oracle.com>
> >>
> >>Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >>
> >>So couldn't the same trick be used in Patch 1 to avoid counting individual
> >>oder-0 pages?
> >
> >Hm. You're right, we could. But is smaps that performance sensitive to
> >bother?
> 
> Well, I was nudged to optimize it when doing the shmem swap accounting
> changes there :) User may not care about the latency of obtaining the smaps
> file contents, but since it has mmap_sem locked for that, the process might
> care...

Somewthing like this?

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e04399e53965..5bc3d2b1176e 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -462,6 +462,19 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
        if (young || PageReferenced(page))
                mss->referenced += size;
 
+       /*
+        * page_count(page) == 1 guarantees the page is mapped exactly once.
+        * If any subpage of the compound page mapped with PTE it would elevate
+        * page_count().
+        */
+       if (page_count(page) == 1) {
+               if (dirty || PageDirty(page))
+                       mss->private_dirty += size;
+               else
+                       mss->private_clean += size;
+               return;
+       }
+
        for (i = 0; i < nr; i++, page++) {
                int mapcount = page_mapcount(page);
 
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 00/28] THP refcounting redesign
  2015-05-15  8:55   ` Vlastimil Babka
@ 2015-05-15 13:31     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 13:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 10:55:55AM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >Hello everybody,
> >
> >Here's reworked version of my patchset. All known issues were addressed.
> >
> >The goal of patchset is to make refcounting on THP pages cheaper with
> >simpler semantics and allow the same THP compound page to be mapped with
> >PMD and PTEs. This is required to get reasonable THP-pagecache
> >implementation.
> >
> >With the new refcounting design it's much easier to protect against
> >split_huge_page(): simple reference on a page will make you the deal.
> >It makes gup_fast() implementation simpler and doesn't require
> >special-case in futex code to handle tail THP pages.
> >
> >It should improve THP utilization over the system since splitting THP in
> >one process doesn't necessary lead to splitting the page in all other
> >processes have the page mapped.
> >
> >The patchset drastically lower complexity of get_page()/put_page()
> >codepaths. I encourage reviewers look on this code before-and-after to
> >justify time budget on reviewing this patchset.
> >
> >= Changelog =
> >
> >v5:
> >   - Tested-by: Sasha Levin!™
> >   - re-split patchset in hope to improve readability;
> >   - rebased on top of page flags and ->mapping sanitizing patchset;
> >   - uncharge compound_mapcount rather than mapcount for hugetlb pages
> >     during removing from rmap;
> >   - differentiate page_mapped() from page_mapcount() for compound pages;
> >   - rework deferred_split_huge_page() to use shrinker interface;
> >   - fix race in page_remove_rmap();
> >   - get rid of __get_page_tail();
> >   - few random bug fixes;
> >v4:
> >   - fix sizes reported in smaps;
> >   - defines instead of enum for RMAP_{EXCLUSIVE,COMPOUND};
> >   - skip THP pages on munlock_vma_pages_range(): they are never mlocked;
> >   - properly handle huge zero page on FOLL_SPLIT;
> >   - fix lock_page() slow path on tail pages;
> >   - account page_get_anon_vma() fail to THP_SPLIT_PAGE_FAILED;
> >   - fix split_huge_page() on huge page with unmapped head page;
> >   - fix transfering 'write' and 'young' from pmd to ptes on split_huge_pmd;
> >   - call page_remove_rmap() in unfreeze_page under ptl.
> >
> >= Design overview =
> >
> >The main reason why we can't map THP with 4k is how refcounting on THP
> >designed. It built around two requirements:
> >
> >   - split of huge page should never fail;
> >   - we can't change interface of get_user_page();
> >
> >To be able to split huge page at any point we have to track which tail
> >page was pinned. It leads to tricky and expensive get_page() on tail pages
> >and also occupy tail_page->_mapcount.
> >
> >Most split_huge_page*() users want PMD to be split into table of PTEs and
> >don't care whether compound page is going to be split or not.
> >
> >The plan is:
> >
> >  - allow split_huge_page() to fail if the page is pinned. It's trivial to
> >    split non-pinned page and it doesn't require tail page refcounting, so
> >    tail_page->_mapcount is free to be reused.
> >
> >  - introduce new routine -- split_huge_pmd() -- to split PMD into table of
> >    PTEs. It splits only one PMD, not touching other PMDs the page is
> >    mapped with or underlying compound page. Unlike new split_huge_page(),
> >    split_huge_pmd() never fails.
> >
> >Fortunately, we have only few places where split_huge_page() is needed:
> >swap out, memory failure, migration, KSM. And all of them can handle
> >split_huge_page() fail.
> >
> >In new scheme we use page->_mapcount is used to account how many time
> >the page is mapped with PTEs. We have separate compound_mapcount() to
> >count mappings with PMD. page_mapcount() returns sum of PTE and PMD
> >mappings of the page.
> 
> It would be very beneficial to describe the scheme in full, both before in
> after. The latter goes also for the Documentation patch, where you fixed
> what wasn't true anymore, but I think the picture wasn't complete neither
> before, nor is it now. There's the lwn article [1] which helps a lot, but we
> shouldn't rely on that exclusively.
> 
> So the full scheme should include at least:
> - where were/are pins and mapcounts stored
> - what exactly get_page()/put_page() did/does now
> - etc.
> 
> [1] https://lwn.net/Articles/619738/

Okay. Will do.
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 00/28] THP refcounting redesign
@ 2015-05-15 13:31     ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 13:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 10:55:55AM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >Hello everybody,
> >
> >Here's reworked version of my patchset. All known issues were addressed.
> >
> >The goal of patchset is to make refcounting on THP pages cheaper with
> >simpler semantics and allow the same THP compound page to be mapped with
> >PMD and PTEs. This is required to get reasonable THP-pagecache
> >implementation.
> >
> >With the new refcounting design it's much easier to protect against
> >split_huge_page(): simple reference on a page will make you the deal.
> >It makes gup_fast() implementation simpler and doesn't require
> >special-case in futex code to handle tail THP pages.
> >
> >It should improve THP utilization over the system since splitting THP in
> >one process doesn't necessary lead to splitting the page in all other
> >processes have the page mapped.
> >
> >The patchset drastically lower complexity of get_page()/put_page()
> >codepaths. I encourage reviewers look on this code before-and-after to
> >justify time budget on reviewing this patchset.
> >
> >= Changelog =
> >
> >v5:
> >   - Tested-by: Sasha Levin!a?c
> >   - re-split patchset in hope to improve readability;
> >   - rebased on top of page flags and ->mapping sanitizing patchset;
> >   - uncharge compound_mapcount rather than mapcount for hugetlb pages
> >     during removing from rmap;
> >   - differentiate page_mapped() from page_mapcount() for compound pages;
> >   - rework deferred_split_huge_page() to use shrinker interface;
> >   - fix race in page_remove_rmap();
> >   - get rid of __get_page_tail();
> >   - few random bug fixes;
> >v4:
> >   - fix sizes reported in smaps;
> >   - defines instead of enum for RMAP_{EXCLUSIVE,COMPOUND};
> >   - skip THP pages on munlock_vma_pages_range(): they are never mlocked;
> >   - properly handle huge zero page on FOLL_SPLIT;
> >   - fix lock_page() slow path on tail pages;
> >   - account page_get_anon_vma() fail to THP_SPLIT_PAGE_FAILED;
> >   - fix split_huge_page() on huge page with unmapped head page;
> >   - fix transfering 'write' and 'young' from pmd to ptes on split_huge_pmd;
> >   - call page_remove_rmap() in unfreeze_page under ptl.
> >
> >= Design overview =
> >
> >The main reason why we can't map THP with 4k is how refcounting on THP
> >designed. It built around two requirements:
> >
> >   - split of huge page should never fail;
> >   - we can't change interface of get_user_page();
> >
> >To be able to split huge page at any point we have to track which tail
> >page was pinned. It leads to tricky and expensive get_page() on tail pages
> >and also occupy tail_page->_mapcount.
> >
> >Most split_huge_page*() users want PMD to be split into table of PTEs and
> >don't care whether compound page is going to be split or not.
> >
> >The plan is:
> >
> >  - allow split_huge_page() to fail if the page is pinned. It's trivial to
> >    split non-pinned page and it doesn't require tail page refcounting, so
> >    tail_page->_mapcount is free to be reused.
> >
> >  - introduce new routine -- split_huge_pmd() -- to split PMD into table of
> >    PTEs. It splits only one PMD, not touching other PMDs the page is
> >    mapped with or underlying compound page. Unlike new split_huge_page(),
> >    split_huge_pmd() never fails.
> >
> >Fortunately, we have only few places where split_huge_page() is needed:
> >swap out, memory failure, migration, KSM. And all of them can handle
> >split_huge_page() fail.
> >
> >In new scheme we use page->_mapcount is used to account how many time
> >the page is mapped with PTEs. We have separate compound_mapcount() to
> >count mappings with PMD. page_mapcount() returns sum of PTE and PMD
> >mappings of the page.
> 
> It would be very beneficial to describe the scheme in full, both before in
> after. The latter goes also for the Documentation patch, where you fixed
> what wasn't true anymore, but I think the picture wasn't complete neither
> before, nor is it now. There's the lwn article [1] which helps a lot, but we
> shouldn't rely on that exclusively.
> 
> So the full scheme should include at least:
> - where were/are pins and mapcounts stored
> - what exactly get_page()/put_page() did/does now
> - etc.
> 
> [1] https://lwn.net/Articles/619738/

Okay. Will do.
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 07/28] thp, mlock: do not allow huge pages in mlocked area
  2015-05-15 12:56     ` Vlastimil Babka
@ 2015-05-15 13:41       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 13:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 02:56:42PM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >With new refcounting THP can belong to several VMAs. This makes tricky
> >to track THP pages, when they partially mlocked. It can lead to leaking
> >mlocked pages to non-VM_LOCKED vmas and other problems.
> >
> >With this patch we will split all pages on mlock and avoid
> >fault-in/collapse new THP in VM_LOCKED vmas.
> >
> >I've tried alternative approach: do not mark THP pages mlocked and keep
> >them on normal LRUs. This way vmscan could try to split huge pages on
> >memory pressure and free up subpages which doesn't belong to VM_LOCKED
> >vmas.  But this is user-visible change: we screw up Mlocked accouting
> >reported in meminfo, so I had to leave this approach aside.
> >
> >We can bring something better later, but this should be good enough for
> >now.
> 
> I can imagine people won't be happy about losing benefits of THP's when they
> mlock().
> How difficult would it be to support mlocked THP pages without splitting
> until something actually tries to do a partial (un)mapping, and only then do
> the split? That will support the most common case, no?

Yes, it will.

But what will we do if we fail to split huge page on munmap()? Fail
munmap() with -EBUSY? 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 07/28] thp, mlock: do not allow huge pages in mlocked area
@ 2015-05-15 13:41       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-15 13:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 02:56:42PM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >With new refcounting THP can belong to several VMAs. This makes tricky
> >to track THP pages, when they partially mlocked. It can lead to leaking
> >mlocked pages to non-VM_LOCKED vmas and other problems.
> >
> >With this patch we will split all pages on mlock and avoid
> >fault-in/collapse new THP in VM_LOCKED vmas.
> >
> >I've tried alternative approach: do not mark THP pages mlocked and keep
> >them on normal LRUs. This way vmscan could try to split huge pages on
> >memory pressure and free up subpages which doesn't belong to VM_LOCKED
> >vmas.  But this is user-visible change: we screw up Mlocked accouting
> >reported in meminfo, so I had to leave this approach aside.
> >
> >We can bring something better later, but this should be good enough for
> >now.
> 
> I can imagine people won't be happy about losing benefits of THP's when they
> mlock().
> How difficult would it be to support mlocked THP pages without splitting
> until something actually tries to do a partial (un)mapping, and only then do
> the split? That will support the most common case, no?

Yes, it will.

But what will we do if we fail to split huge page on munmap()? Fail
munmap() with -EBUSY? 

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 03/28] memcg: adjust to support new THP refcounting
  2015-05-15 11:18       ` Kirill A. Shutemov
@ 2015-05-15 14:57         ` Dave Hansen
  -1 siblings, 0 replies; 189+ messages in thread
From: Dave Hansen @ 2015-05-15 14:57 UTC (permalink / raw)
  To: Kirill A. Shutemov, Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 05/15/2015 04:18 AM, Kirill A. Shutemov wrote:
>> > But same question about whether it should be using hpage_nr_pages() instead
>> > of a constant.
> No. Compiler woundn't be able to optimize HPAGE_PMD_NR away for THP=n,
> since compound value cross compilation unit barrier.

What code are you talking about here, specifically?  This?

static inline int hpage_nr_pages(struct page *page)
{
        if (unlikely(PageTransHuge(page)))
                return HPAGE_PMD_NR;
        return 1;
}


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 03/28] memcg: adjust to support new THP refcounting
@ 2015-05-15 14:57         ` Dave Hansen
  0 siblings, 0 replies; 189+ messages in thread
From: Dave Hansen @ 2015-05-15 14:57 UTC (permalink / raw)
  To: Kirill A. Shutemov, Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 05/15/2015 04:18 AM, Kirill A. Shutemov wrote:
>> > But same question about whether it should be using hpage_nr_pages() instead
>> > of a constant.
> No. Compiler woundn't be able to optimize HPAGE_PMD_NR away for THP=n,
> since compound value cross compilation unit barrier.

What code are you talking about here, specifically?  This?

static inline int hpage_nr_pages(struct page *page)
{
        if (unlikely(PageTransHuge(page)))
                return HPAGE_PMD_NR;
        return 1;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 03/28] memcg: adjust to support new THP refcounting
  2015-05-15 14:57         ` Dave Hansen
@ 2015-05-16 23:17           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-16 23:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Andrea Arcangeli, Hugh Dickins, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 07:57:39AM -0700, Dave Hansen wrote:
> On 05/15/2015 04:18 AM, Kirill A. Shutemov wrote:
> >> > But same question about whether it should be using hpage_nr_pages() instead
> >> > of a constant.
> > No. Compiler woundn't be able to optimize HPAGE_PMD_NR away for THP=n,
> > since compound value cross compilation unit barrier.
> 
> What code are you talking about here, specifically?  This?
> 
> static inline int hpage_nr_pages(struct page *page)
> {
>         if (unlikely(PageTransHuge(page)))
>                 return HPAGE_PMD_NR;
>         return 1;
> }

No. See for instance mem_cgroup_try_charge(). Vlastimil would like to
replace hpage_nr_pages() call with plain HPAGE_PMD_NR.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 03/28] memcg: adjust to support new THP refcounting
@ 2015-05-16 23:17           ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-16 23:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Andrea Arcangeli, Hugh Dickins, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Fri, May 15, 2015 at 07:57:39AM -0700, Dave Hansen wrote:
> On 05/15/2015 04:18 AM, Kirill A. Shutemov wrote:
> >> > But same question about whether it should be using hpage_nr_pages() instead
> >> > of a constant.
> > No. Compiler woundn't be able to optimize HPAGE_PMD_NR away for THP=n,
> > since compound value cross compilation unit barrier.
> 
> What code are you talking about here, specifically?  This?
> 
> static inline int hpage_nr_pages(struct page *page)
> {
>         if (unlikely(PageTransHuge(page)))
>                 return HPAGE_PMD_NR;
>         return 1;
> }

No. See for instance mem_cgroup_try_charge(). Vlastimil would like to
replace hpage_nr_pages() call with plain HPAGE_PMD_NR.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 13/28] mm: drop tail page refcounting
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-18  9:48     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-18  9:48 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> Tail page refcounting is utterly complicated and painful to support.
> It also make use of ->_mapcount to account pins on tail pages. We will
> need ->_mapcount acoount PTE mappings of subpages of the compound page.
>
> The only user of tail page refcounting is THP which is marked BROKEN for
> now.
>
> Let's drop all this mess. It makes get_page() and put_pag() much simplier.

Apart from several typos, this is another place where more details 
wouldn't hurt.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   arch/mips/mm/gup.c            |   4 -
>   arch/powerpc/mm/hugetlbpage.c |  13 +-
>   arch/s390/mm/gup.c            |  13 +-
>   arch/sparc/mm/gup.c           |  14 +--
>   arch/x86/mm/gup.c             |   4 -
>   include/linux/mm.h            |  47 ++------
>   include/linux/mm_types.h      |  17 +--
>   mm/gup.c                      |  34 +-----
>   mm/huge_memory.c              |  41 +------
>   mm/hugetlb.c                  |   2 +-
>   mm/internal.h                 |  44 -------
>   mm/swap.c                     | 274 +++---------------------------------------
>   12 files changed, 40 insertions(+), 467 deletions(-)
>
> diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
> index 349995d19c7f..36a35115dc2e 100644
> --- a/arch/mips/mm/gup.c
> +++ b/arch/mips/mm/gup.c
> @@ -87,8 +87,6 @@ static int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end,
>   	do {
>   		VM_BUG_ON(compound_head(page) != head);
>   		pages[*nr] = page;
> -		if (PageTail(page))
> -			get_huge_page_tail(page);
>   		(*nr)++;
>   		page++;
>   		refs++;
> @@ -153,8 +151,6 @@ static int gup_huge_pud(pud_t pud, unsigned long addr, unsigned long end,
>   	do {
>   		VM_BUG_ON(compound_head(page) != head);
>   		pages[*nr] = page;
> -		if (PageTail(page))
> -			get_huge_page_tail(page);
>   		(*nr)++;
>   		page++;
>   		refs++;
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index cf0464f4284f..f30ae0f7f570 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -1037,7 +1037,7 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>   {
>   	unsigned long mask;
>   	unsigned long pte_end;
> -	struct page *head, *page, *tail;
> +	struct page *head, *page;
>   	pte_t pte;
>   	int refs;
>
> @@ -1060,7 +1060,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>   	head = pte_page(pte);
>
>   	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
> -	tail = page;
>   	do {
>   		VM_BUG_ON(compound_head(page) != head);
>   		pages[*nr] = page;
> @@ -1082,15 +1081,5 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>   		return 0;
>   	}
>
> -	/*
> -	 * Any tail page need their mapcount reference taken before we
> -	 * return.
> -	 */
> -	while (refs--) {
> -		if (PageTail(tail))
> -			get_huge_page_tail(tail);
> -		tail++;
> -	}
> -
>   	return 1;
>   }
> diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
> index 5c586c78ca8d..dab30527ad41 100644
> --- a/arch/s390/mm/gup.c
> +++ b/arch/s390/mm/gup.c
> @@ -52,7 +52,7 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
>   		unsigned long end, int write, struct page **pages, int *nr)
>   {
>   	unsigned long mask, result;
> -	struct page *head, *page, *tail;
> +	struct page *head, *page;
>   	int refs;
>
>   	result = write ? 0 : _SEGMENT_ENTRY_PROTECT;
> @@ -64,7 +64,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
>   	refs = 0;
>   	head = pmd_page(pmd);
>   	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> -	tail = page;
>   	do {
>   		VM_BUG_ON(compound_head(page) != head);
>   		pages[*nr] = page;
> @@ -85,16 +84,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
>   		return 0;
>   	}
>
> -	/*
> -	 * Any tail page need their mapcount reference taken before we
> -	 * return.
> -	 */
> -	while (refs--) {
> -		if (PageTail(tail))
> -			get_huge_page_tail(tail);
> -		tail++;
> -	}
> -
>   	return 1;
>   }
>
> diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
> index 2e5c4fc2daa9..9091c5daa2e1 100644
> --- a/arch/sparc/mm/gup.c
> +++ b/arch/sparc/mm/gup.c
> @@ -56,8 +56,6 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
>   			put_page(head);
>   			return 0;
>   		}
> -		if (head != page)
> -			get_huge_page_tail(page);
>
>   		pages[*nr] = page;
>   		(*nr)++;
> @@ -70,7 +68,7 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
>   			unsigned long end, int write, struct page **pages,
>   			int *nr)
>   {
> -	struct page *head, *page, *tail;
> +	struct page *head, *page;
>   	int refs;
>
>   	if (!(pmd_val(pmd) & _PAGE_VALID))
> @@ -82,7 +80,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
>   	refs = 0;
>   	head = pmd_page(pmd);
>   	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> -	tail = page;
>   	do {
>   		VM_BUG_ON(compound_head(page) != head);
>   		pages[*nr] = page;
> @@ -103,15 +100,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
>   		return 0;
>   	}
>
> -	/* Any tail page need their mapcount reference taken before we
> -	 * return.
> -	 */
> -	while (refs--) {
> -		if (PageTail(tail))
> -			get_huge_page_tail(tail);
> -		tail++;
> -	}
> -
>   	return 1;
>   }
>
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 81bf3d2af3eb..62a887a3cf50 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -137,8 +137,6 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
>   	do {
>   		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>   		pages[*nr] = page;
> -		if (PageTail(page))
> -			get_huge_page_tail(page);
>   		(*nr)++;
>   		page++;
>   		refs++;
> @@ -214,8 +212,6 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
>   	do {
>   		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>   		pages[*nr] = page;
> -		if (PageTail(page))
> -			get_huge_page_tail(page);
>   		(*nr)++;
>   		page++;
>   		refs++;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index efe8417360a2..dd1b5f2b1966 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -449,44 +449,9 @@ static inline int page_count(struct page *page)
>   	return atomic_read(&compound_head(page)->_count);
>   }
>
> -static inline bool __compound_tail_refcounted(struct page *page)
> -{
> -	return PageAnon(page) && !PageSlab(page) && !PageHeadHuge(page);
> -}
> -
> -/*
> - * This takes a head page as parameter and tells if the
> - * tail page reference counting can be skipped.
> - *
> - * For this to be safe, PageSlab and PageHeadHuge must remain true on
> - * any given page where they return true here, until all tail pins
> - * have been released.
> - */
> -static inline bool compound_tail_refcounted(struct page *page)
> -{
> -	VM_BUG_ON_PAGE(!PageHead(page), page);
> -	return __compound_tail_refcounted(page);
> -}
> -
> -static inline void get_huge_page_tail(struct page *page)
> -{
> -	/*
> -	 * __split_huge_page_refcount() cannot run from under us.
> -	 */
> -	VM_BUG_ON_PAGE(!PageTail(page), page);
> -	VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> -	VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
> -	if (compound_tail_refcounted(page->first_page))
> -		atomic_inc(&page->_mapcount);
> -}
> -
> -extern bool __get_page_tail(struct page *page);
> -
>   static inline void get_page(struct page *page)
>   {
> -	if (unlikely(PageTail(page)))
> -		if (likely(__get_page_tail(page)))
> -			return;
> +	page = compound_head(page);
>   	/*
>   	 * Getting a normal page or the head of a compound page
>   	 * requires to already have an elevated page->_count.
> @@ -517,7 +482,15 @@ static inline void init_page_count(struct page *page)
>   	atomic_set(&page->_count, 1);
>   }
>
> -void put_page(struct page *page);
> +void __put_page(struct page* page);
> +
> +static inline void put_page(struct page *page)
> +{
> +	page = compound_head(page);
> +	if (put_page_testzero(page))
> +		__put_page(page);
> +}
> +
>   void put_pages_list(struct list_head *pages);
>
>   void split_page(struct page *page, unsigned int order);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 590630eb59ba..126f481bb95a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -92,20 +92,9 @@ struct page {
>
>   				union {
>   					/*
> -					 * Count of ptes mapped in
> -					 * mms, to show when page is
> -					 * mapped & limit reverse map
> -					 * searches.
> -					 *
> -					 * Used also for tail pages
> -					 * refcounting instead of
> -					 * _count. Tail pages cannot
> -					 * be mapped and keeping the
> -					 * tail page _count zero at
> -					 * all times guarantees
> -					 * get_page_unless_zero() will
> -					 * never succeed on tail
> -					 * pages.
> +					 * Count of ptes mapped in mms, to show
> +					 * when page is mapped & limit reverse
> +					 * map searches.
>   					 */
>   					atomic_t _mapcount;
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 19e01f156abb..53f9681b7b30 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -93,7 +93,7 @@ retry:
>   	}
>
>   	if (flags & FOLL_GET)
> -		get_page_foll(page);
> +		get_page(page);
>   	if (flags & FOLL_TOUCH) {
>   		if ((flags & FOLL_WRITE) &&
>   		    !pte_dirty(pte) && !PageDirty(page))
> @@ -1108,7 +1108,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>   static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>   		unsigned long end, int write, struct page **pages, int *nr)
>   {
> -	struct page *head, *page, *tail;
> +	struct page *head, *page;
>   	int refs;
>
>   	if (write && !pmd_write(orig))
> @@ -1117,7 +1117,6 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>   	refs = 0;
>   	head = pmd_page(orig);
>   	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> -	tail = page;
>   	do {
>   		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>   		pages[*nr] = page;
> @@ -1138,24 +1137,13 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>   		return 0;
>   	}
>
> -	/*
> -	 * Any tail pages need their mapcount reference taken before we
> -	 * return. (This allows the THP code to bump their ref count when
> -	 * they are split into base pages).
> -	 */
> -	while (refs--) {
> -		if (PageTail(tail))
> -			get_huge_page_tail(tail);
> -		tail++;
> -	}
> -
>   	return 1;
>   }
>
>   static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>   		unsigned long end, int write, struct page **pages, int *nr)
>   {
> -	struct page *head, *page, *tail;
> +	struct page *head, *page;
>   	int refs;
>
>   	if (write && !pud_write(orig))
> @@ -1164,7 +1152,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>   	refs = 0;
>   	head = pud_page(orig);
>   	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> -	tail = page;
>   	do {
>   		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>   		pages[*nr] = page;
> @@ -1185,12 +1172,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>   		return 0;
>   	}
>
> -	while (refs--) {
> -		if (PageTail(tail))
> -			get_huge_page_tail(tail);
> -		tail++;
> -	}
> -
>   	return 1;
>   }
>
> @@ -1199,7 +1180,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>   			struct page **pages, int *nr)
>   {
>   	int refs;
> -	struct page *head, *page, *tail;
> +	struct page *head, *page;
>
>   	if (write && !pgd_write(orig))
>   		return 0;
> @@ -1207,7 +1188,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>   	refs = 0;
>   	head = pgd_page(orig);
>   	page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
> -	tail = page;
>   	do {
>   		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>   		pages[*nr] = page;
> @@ -1228,12 +1208,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>   		return 0;
>   	}
>
> -	while (refs--) {
> -		if (PageTail(tail))
> -			get_huge_page_tail(tail);
> -		tail++;
> -	}
> -
>   	return 1;
>   }
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f3cc576dad73..16c6c262385c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -941,37 +941,6 @@ unlock:
>   	spin_unlock(ptl);
>   }
>
> -/*
> - * Save CONFIG_DEBUG_PAGEALLOC from faulting falsely on tail pages
> - * during copy_user_huge_page()'s copy_page_rep(): in the case when
> - * the source page gets split and a tail freed before copy completes.
> - * Called under pmd_lock of checked pmd, so safe from splitting itself.
> - */
> -static void get_user_huge_page(struct page *page)
> -{
> -	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
> -		struct page *endpage = page + HPAGE_PMD_NR;
> -
> -		atomic_add(HPAGE_PMD_NR, &page->_count);
> -		while (++page < endpage)
> -			get_huge_page_tail(page);
> -	} else {
> -		get_page(page);
> -	}
> -}
> -
> -static void put_user_huge_page(struct page *page)
> -{
> -	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
> -		struct page *endpage = page + HPAGE_PMD_NR;
> -
> -		while (page < endpage)
> -			put_page(page++);
> -	} else {
> -		put_page(page);
> -	}
> -}
> -
>   static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>   					struct vm_area_struct *vma,
>   					unsigned long address,
> @@ -1124,7 +1093,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   		ret |= VM_FAULT_WRITE;
>   		goto out_unlock;
>   	}
> -	get_user_huge_page(page);
> +	get_page(page);
>   	spin_unlock(ptl);
>   alloc:
>   	if (transparent_hugepage_enabled(vma) &&
> @@ -1145,7 +1114,7 @@ alloc:
>   				split_huge_pmd(vma, pmd, address);
>   				ret |= VM_FAULT_FALLBACK;
>   			}
> -			put_user_huge_page(page);
> +			put_page(page);
>   		}
>   		count_vm_event(THP_FAULT_FALLBACK);
>   		goto out;
> @@ -1156,7 +1125,7 @@ alloc:
>   		put_page(new_page);
>   		if (page) {
>   			split_huge_pmd(vma, pmd, address);
> -			put_user_huge_page(page);
> +			put_page(page);
>   		} else
>   			split_huge_pmd(vma, pmd, address);
>   		ret |= VM_FAULT_FALLBACK;
> @@ -1178,7 +1147,7 @@ alloc:
>
>   	spin_lock(ptl);
>   	if (page)
> -		put_user_huge_page(page);
> +		put_page(page);
>   	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
>   		spin_unlock(ptl);
>   		mem_cgroup_cancel_charge(new_page, memcg, true);
> @@ -1263,7 +1232,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>   	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
>   	VM_BUG_ON_PAGE(!PageCompound(page), page);
>   	if (flags & FOLL_GET)
> -		get_page_foll(page);
> +		get_page(page);
>
>   out:
>   	return page;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index eb2a0430535e..f27d4edada3a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3453,7 +3453,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   same_page:
>   		if (pages) {
>   			pages[i] = mem_map_offset(page, pfn_offset);
> -			get_page_foll(pages[i]);
> +			get_page(pages[i]);
>   		}
>
>   		if (vmas)
> diff --git a/mm/internal.h b/mm/internal.h
> index a25e359a4039..98bce4d12a16 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -47,50 +47,6 @@ static inline void set_page_refcounted(struct page *page)
>   	set_page_count(page, 1);
>   }
>
> -static inline void __get_page_tail_foll(struct page *page,
> -					bool get_page_head)
> -{
> -	/*
> -	 * If we're getting a tail page, the elevated page->_count is
> -	 * required only in the head page and we will elevate the head
> -	 * page->_count and tail page->_mapcount.
> -	 *
> -	 * We elevate page_tail->_mapcount for tail pages to force
> -	 * page_tail->_count to be zero at all times to avoid getting
> -	 * false positives from get_page_unless_zero() with
> -	 * speculative page access (like in
> -	 * page_cache_get_speculative()) on tail pages.
> -	 */
> -	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
> -	if (get_page_head)
> -		atomic_inc(&page->first_page->_count);
> -	get_huge_page_tail(page);
> -}
> -
> -/*
> - * This is meant to be called as the FOLL_GET operation of
> - * follow_page() and it must be called while holding the proper PT
> - * lock while the pte (or pmd_trans_huge) is still mapping the page.
> - */
> -static inline void get_page_foll(struct page *page)
> -{
> -	if (unlikely(PageTail(page)))
> -		/*
> -		 * This is safe only because
> -		 * __split_huge_page_refcount() can't run under
> -		 * get_page_foll() because we hold the proper PT lock.
> -		 */
> -		__get_page_tail_foll(page, true);
> -	else {
> -		/*
> -		 * Getting a normal page or the head of a compound page
> -		 * requires to already have an elevated page->_count.
> -		 */
> -		VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
> -		atomic_inc(&page->_count);
> -	}
> -}
> -
>   extern unsigned long highest_memmap_pfn;
>
>   /*
> diff --git a/mm/swap.c b/mm/swap.c
> index 8773de093171..39166c05e5f3 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -89,261 +89,14 @@ static void __put_compound_page(struct page *page)
>   	(*dtor)(page);
>   }
>
> -/**
> - * Two special cases here: we could avoid taking compound_lock_irqsave
> - * and could skip the tail refcounting(in _mapcount).
> - *
> - * 1. Hugetlbfs page:
> - *
> - *    PageHeadHuge will remain true until the compound page
> - *    is released and enters the buddy allocator, and it could
> - *    not be split by __split_huge_page_refcount().
> - *
> - *    So if we see PageHeadHuge set, and we have the tail page pin,
> - *    then we could safely put head page.
> - *
> - * 2. Slab THP page:
> - *
> - *    PG_slab is cleared before the slab frees the head page, and
> - *    tail pin cannot be the last reference left on the head page,
> - *    because the slab code is free to reuse the compound page
> - *    after a kfree/kmem_cache_free without having to check if
> - *    there's any tail pin left.  In turn all tail pinsmust be always
> - *    released while the head is still pinned by the slab code
> - *    and so we know PG_slab will be still set too.
> - *
> - *    So if we see PageSlab set, and we have the tail page pin,
> - *    then we could safely put head page.
> - */
> -static __always_inline
> -void put_unrefcounted_compound_page(struct page *page_head, struct page *page)
> -{
> -	/*
> -	 * If @page is a THP tail, we must read the tail page
> -	 * flags after the head page flags. The
> -	 * __split_huge_page_refcount side enforces write memory barriers
> -	 * between clearing PageTail and before the head page
> -	 * can be freed and reallocated.
> -	 */
> -	smp_rmb();
> -	if (likely(PageTail(page))) {
> -		/*
> -		 * __split_huge_page_refcount cannot race
> -		 * here, see the comment above this function.
> -		 */
> -		VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
> -		VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
> -		if (put_page_testzero(page_head)) {
> -			/*
> -			 * If this is the tail of a slab THP page,
> -			 * the tail pin must not be the last reference
> -			 * held on the page, because the PG_slab cannot
> -			 * be cleared before all tail pins (which skips
> -			 * the _mapcount tail refcounting) have been
> -			 * released.
> -			 *
> -			 * If this is the tail of a hugetlbfs page,
> -			 * the tail pin may be the last reference on
> -			 * the page instead, because PageHeadHuge will
> -			 * not go away until the compound page enters
> -			 * the buddy allocator.
> -			 */
> -			VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
> -			__put_compound_page(page_head);
> -		}
> -	} else
> -		/*
> -		 * __split_huge_page_refcount run before us,
> -		 * @page was a THP tail. The split @page_head
> -		 * has been freed and reallocated as slab or
> -		 * hugetlbfs page of smaller order (only
> -		 * possible if reallocated as slab on x86).
> -		 */
> -		if (put_page_testzero(page))
> -			__put_single_page(page);
> -}
> -
> -static __always_inline
> -void put_refcounted_compound_page(struct page *page_head, struct page *page)
> -{
> -	if (likely(page != page_head && get_page_unless_zero(page_head))) {
> -		unsigned long flags;
> -
> -		/*
> -		 * @page_head wasn't a dangling pointer but it may not
> -		 * be a head page anymore by the time we obtain the
> -		 * lock. That is ok as long as it can't be freed from
> -		 * under us.
> -		 */
> -		flags = compound_lock_irqsave(page_head);
> -		if (unlikely(!PageTail(page))) {
> -			/* __split_huge_page_refcount run before us */
> -			compound_unlock_irqrestore(page_head, flags);
> -			if (put_page_testzero(page_head)) {
> -				/*
> -				 * The @page_head may have been freed
> -				 * and reallocated as a compound page
> -				 * of smaller order and then freed
> -				 * again.  All we know is that it
> -				 * cannot have become: a THP page, a
> -				 * compound page of higher order, a
> -				 * tail page.  That is because we
> -				 * still hold the refcount of the
> -				 * split THP tail and page_head was
> -				 * the THP head before the split.
> -				 */
> -				if (PageHead(page_head))
> -					__put_compound_page(page_head);
> -				else
> -					__put_single_page(page_head);
> -			}
> -out_put_single:
> -			if (put_page_testzero(page))
> -				__put_single_page(page);
> -			return;
> -		}
> -		VM_BUG_ON_PAGE(page_head != page->first_page, page);
> -		/*
> -		 * We can release the refcount taken by
> -		 * get_page_unless_zero() now that
> -		 * __split_huge_page_refcount() is blocked on the
> -		 * compound_lock.
> -		 */
> -		if (put_page_testzero(page_head))
> -			VM_BUG_ON_PAGE(1, page_head);
> -		/* __split_huge_page_refcount will wait now */
> -		VM_BUG_ON_PAGE(page_mapcount(page) <= 0, page);
> -		atomic_dec(&page->_mapcount);
> -		VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page_head);
> -		VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
> -		compound_unlock_irqrestore(page_head, flags);
> -
> -		if (put_page_testzero(page_head)) {
> -			if (PageHead(page_head))
> -				__put_compound_page(page_head);
> -			else
> -				__put_single_page(page_head);
> -		}
> -	} else {
> -		/* @page_head is a dangling pointer */
> -		VM_BUG_ON_PAGE(PageTail(page), page);
> -		goto out_put_single;
> -	}
> -}
> -
> -static void put_compound_page(struct page *page)
> -{
> -	struct page *page_head;
> -
> -	/*
> -	 * We see the PageCompound set and PageTail not set, so @page maybe:
> -	 *  1. hugetlbfs head page, or
> -	 *  2. THP head page.
> -	 */
> -	if (likely(!PageTail(page))) {
> -		if (put_page_testzero(page)) {
> -			/*
> -			 * By the time all refcounts have been released
> -			 * split_huge_page cannot run anymore from under us.
> -			 */
> -			if (PageHead(page))
> -				__put_compound_page(page);
> -			else
> -				__put_single_page(page);
> -		}
> -		return;
> -	}
> -
> -	/*
> -	 * We see the PageCompound set and PageTail set, so @page maybe:
> -	 *  1. a tail hugetlbfs page, or
> -	 *  2. a tail THP page, or
> -	 *  3. a split THP page.
> -	 *
> -	 *  Case 3 is possible, as we may race with
> -	 *  __split_huge_page_refcount tearing down a THP page.
> -	 */
> -	page_head = compound_head_by_tail(page);
> -	if (!__compound_tail_refcounted(page_head))
> -		put_unrefcounted_compound_page(page_head, page);
> -	else
> -		put_refcounted_compound_page(page_head, page);
> -}
> -
> -void put_page(struct page *page)
> +void __put_page(struct page *page)
>   {
>   	if (unlikely(PageCompound(page)))
> -		put_compound_page(page);
> -	else if (put_page_testzero(page))
> +		__put_compound_page(page);
> +	else
>   		__put_single_page(page);
>   }
> -EXPORT_SYMBOL(put_page);
> -
> -/*
> - * This function is exported but must not be called by anything other
> - * than get_page(). It implements the slow path of get_page().
> - */
> -bool __get_page_tail(struct page *page)
> -{
> -	/*
> -	 * This takes care of get_page() if run on a tail page
> -	 * returned by one of the get_user_pages/follow_page variants.
> -	 * get_user_pages/follow_page itself doesn't need the compound
> -	 * lock because it runs __get_page_tail_foll() under the
> -	 * proper PT lock that already serializes against
> -	 * split_huge_page().
> -	 */
> -	unsigned long flags;
> -	bool got;
> -	struct page *page_head = compound_head(page);
> -
> -	/* Ref to put_compound_page() comment. */
> -	if (!__compound_tail_refcounted(page_head)) {
> -		smp_rmb();
> -		if (likely(PageTail(page))) {
> -			/*
> -			 * This is a hugetlbfs page or a slab
> -			 * page. __split_huge_page_refcount
> -			 * cannot race here.
> -			 */
> -			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
> -			__get_page_tail_foll(page, true);
> -			return true;
> -		} else {
> -			/*
> -			 * __split_huge_page_refcount run
> -			 * before us, "page" was a THP
> -			 * tail. The split page_head has been
> -			 * freed and reallocated as slab or
> -			 * hugetlbfs page of smaller order
> -			 * (only possible if reallocated as
> -			 * slab on x86).
> -			 */
> -			return false;
> -		}
> -	}
> -
> -	got = false;
> -	if (likely(page != page_head && get_page_unless_zero(page_head))) {
> -		/*
> -		 * page_head wasn't a dangling pointer but it
> -		 * may not be a head page anymore by the time
> -		 * we obtain the lock. That is ok as long as it
> -		 * can't be freed from under us.
> -		 */
> -		flags = compound_lock_irqsave(page_head);
> -		/* here __split_huge_page_refcount won't run anymore */
> -		if (likely(PageTail(page))) {
> -			__get_page_tail_foll(page, false);
> -			got = true;
> -		}
> -		compound_unlock_irqrestore(page_head, flags);
> -		if (unlikely(!got))
> -			put_page(page_head);
> -	}
> -	return got;
> -}
> -EXPORT_SYMBOL(__get_page_tail);
> +EXPORT_SYMBOL(__put_page);
>
>   /**
>    * put_pages_list() - release a list of pages
> @@ -960,15 +713,6 @@ void release_pages(struct page **pages, int nr, bool cold)
>   	for (i = 0; i < nr; i++) {
>   		struct page *page = pages[i];
>
> -		if (unlikely(PageCompound(page))) {
> -			if (zone) {
> -				spin_unlock_irqrestore(&zone->lru_lock, flags);
> -				zone = NULL;
> -			}
> -			put_compound_page(page);
> -			continue;
> -		}
> -
>   		/*
>   		 * Make sure the IRQ-safe lock-holding time does not get
>   		 * excessive with a continuous string of pages from the
> @@ -979,9 +723,19 @@ void release_pages(struct page **pages, int nr, bool cold)
>   			zone = NULL;
>   		}
>
> +		page = compound_head(page);
>   		if (!put_page_testzero(page))
>   			continue;
>
> +		if (PageCompound(page)) {
> +			if (zone) {
> +				spin_unlock_irqrestore(&zone->lru_lock, flags);
> +				zone = NULL;
> +			}
> +			__put_compound_page(page);
> +			continue;
> +		}
> +
>   		if (PageLRU(page)) {
>   			struct zone *pagezone = page_zone(page);
>
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 13/28] mm: drop tail page refcounting
@ 2015-05-18  9:48     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-18  9:48 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> Tail page refcounting is utterly complicated and painful to support.
> It also make use of ->_mapcount to account pins on tail pages. We will
> need ->_mapcount acoount PTE mappings of subpages of the compound page.
>
> The only user of tail page refcounting is THP which is marked BROKEN for
> now.
>
> Let's drop all this mess. It makes get_page() and put_pag() much simplier.

Apart from several typos, this is another place where more details 
wouldn't hurt.

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   arch/mips/mm/gup.c            |   4 -
>   arch/powerpc/mm/hugetlbpage.c |  13 +-
>   arch/s390/mm/gup.c            |  13 +-
>   arch/sparc/mm/gup.c           |  14 +--
>   arch/x86/mm/gup.c             |   4 -
>   include/linux/mm.h            |  47 ++------
>   include/linux/mm_types.h      |  17 +--
>   mm/gup.c                      |  34 +-----
>   mm/huge_memory.c              |  41 +------
>   mm/hugetlb.c                  |   2 +-
>   mm/internal.h                 |  44 -------
>   mm/swap.c                     | 274 +++---------------------------------------
>   12 files changed, 40 insertions(+), 467 deletions(-)
>
> diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
> index 349995d19c7f..36a35115dc2e 100644
> --- a/arch/mips/mm/gup.c
> +++ b/arch/mips/mm/gup.c
> @@ -87,8 +87,6 @@ static int gup_huge_pmd(pmd_t pmd, unsigned long addr, unsigned long end,
>   	do {
>   		VM_BUG_ON(compound_head(page) != head);
>   		pages[*nr] = page;
> -		if (PageTail(page))
> -			get_huge_page_tail(page);
>   		(*nr)++;
>   		page++;
>   		refs++;
> @@ -153,8 +151,6 @@ static int gup_huge_pud(pud_t pud, unsigned long addr, unsigned long end,
>   	do {
>   		VM_BUG_ON(compound_head(page) != head);
>   		pages[*nr] = page;
> -		if (PageTail(page))
> -			get_huge_page_tail(page);
>   		(*nr)++;
>   		page++;
>   		refs++;
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index cf0464f4284f..f30ae0f7f570 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -1037,7 +1037,7 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>   {
>   	unsigned long mask;
>   	unsigned long pte_end;
> -	struct page *head, *page, *tail;
> +	struct page *head, *page;
>   	pte_t pte;
>   	int refs;
>
> @@ -1060,7 +1060,6 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>   	head = pte_page(pte);
>
>   	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
> -	tail = page;
>   	do {
>   		VM_BUG_ON(compound_head(page) != head);
>   		pages[*nr] = page;
> @@ -1082,15 +1081,5 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>   		return 0;
>   	}
>
> -	/*
> -	 * Any tail page need their mapcount reference taken before we
> -	 * return.
> -	 */
> -	while (refs--) {
> -		if (PageTail(tail))
> -			get_huge_page_tail(tail);
> -		tail++;
> -	}
> -
>   	return 1;
>   }
> diff --git a/arch/s390/mm/gup.c b/arch/s390/mm/gup.c
> index 5c586c78ca8d..dab30527ad41 100644
> --- a/arch/s390/mm/gup.c
> +++ b/arch/s390/mm/gup.c
> @@ -52,7 +52,7 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
>   		unsigned long end, int write, struct page **pages, int *nr)
>   {
>   	unsigned long mask, result;
> -	struct page *head, *page, *tail;
> +	struct page *head, *page;
>   	int refs;
>
>   	result = write ? 0 : _SEGMENT_ENTRY_PROTECT;
> @@ -64,7 +64,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
>   	refs = 0;
>   	head = pmd_page(pmd);
>   	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> -	tail = page;
>   	do {
>   		VM_BUG_ON(compound_head(page) != head);
>   		pages[*nr] = page;
> @@ -85,16 +84,6 @@ static inline int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
>   		return 0;
>   	}
>
> -	/*
> -	 * Any tail page need their mapcount reference taken before we
> -	 * return.
> -	 */
> -	while (refs--) {
> -		if (PageTail(tail))
> -			get_huge_page_tail(tail);
> -		tail++;
> -	}
> -
>   	return 1;
>   }
>
> diff --git a/arch/sparc/mm/gup.c b/arch/sparc/mm/gup.c
> index 2e5c4fc2daa9..9091c5daa2e1 100644
> --- a/arch/sparc/mm/gup.c
> +++ b/arch/sparc/mm/gup.c
> @@ -56,8 +56,6 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned long addr,
>   			put_page(head);
>   			return 0;
>   		}
> -		if (head != page)
> -			get_huge_page_tail(page);
>
>   		pages[*nr] = page;
>   		(*nr)++;
> @@ -70,7 +68,7 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
>   			unsigned long end, int write, struct page **pages,
>   			int *nr)
>   {
> -	struct page *head, *page, *tail;
> +	struct page *head, *page;
>   	int refs;
>
>   	if (!(pmd_val(pmd) & _PAGE_VALID))
> @@ -82,7 +80,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
>   	refs = 0;
>   	head = pmd_page(pmd);
>   	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> -	tail = page;
>   	do {
>   		VM_BUG_ON(compound_head(page) != head);
>   		pages[*nr] = page;
> @@ -103,15 +100,6 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd_t pmd, unsigned long addr,
>   		return 0;
>   	}
>
> -	/* Any tail page need their mapcount reference taken before we
> -	 * return.
> -	 */
> -	while (refs--) {
> -		if (PageTail(tail))
> -			get_huge_page_tail(tail);
> -		tail++;
> -	}
> -
>   	return 1;
>   }
>
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 81bf3d2af3eb..62a887a3cf50 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -137,8 +137,6 @@ static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
>   	do {
>   		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>   		pages[*nr] = page;
> -		if (PageTail(page))
> -			get_huge_page_tail(page);
>   		(*nr)++;
>   		page++;
>   		refs++;
> @@ -214,8 +212,6 @@ static noinline int gup_huge_pud(pud_t pud, unsigned long addr,
>   	do {
>   		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>   		pages[*nr] = page;
> -		if (PageTail(page))
> -			get_huge_page_tail(page);
>   		(*nr)++;
>   		page++;
>   		refs++;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index efe8417360a2..dd1b5f2b1966 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -449,44 +449,9 @@ static inline int page_count(struct page *page)
>   	return atomic_read(&compound_head(page)->_count);
>   }
>
> -static inline bool __compound_tail_refcounted(struct page *page)
> -{
> -	return PageAnon(page) && !PageSlab(page) && !PageHeadHuge(page);
> -}
> -
> -/*
> - * This takes a head page as parameter and tells if the
> - * tail page reference counting can be skipped.
> - *
> - * For this to be safe, PageSlab and PageHeadHuge must remain true on
> - * any given page where they return true here, until all tail pins
> - * have been released.
> - */
> -static inline bool compound_tail_refcounted(struct page *page)
> -{
> -	VM_BUG_ON_PAGE(!PageHead(page), page);
> -	return __compound_tail_refcounted(page);
> -}
> -
> -static inline void get_huge_page_tail(struct page *page)
> -{
> -	/*
> -	 * __split_huge_page_refcount() cannot run from under us.
> -	 */
> -	VM_BUG_ON_PAGE(!PageTail(page), page);
> -	VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> -	VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
> -	if (compound_tail_refcounted(page->first_page))
> -		atomic_inc(&page->_mapcount);
> -}
> -
> -extern bool __get_page_tail(struct page *page);
> -
>   static inline void get_page(struct page *page)
>   {
> -	if (unlikely(PageTail(page)))
> -		if (likely(__get_page_tail(page)))
> -			return;
> +	page = compound_head(page);
>   	/*
>   	 * Getting a normal page or the head of a compound page
>   	 * requires to already have an elevated page->_count.
> @@ -517,7 +482,15 @@ static inline void init_page_count(struct page *page)
>   	atomic_set(&page->_count, 1);
>   }
>
> -void put_page(struct page *page);
> +void __put_page(struct page* page);
> +
> +static inline void put_page(struct page *page)
> +{
> +	page = compound_head(page);
> +	if (put_page_testzero(page))
> +		__put_page(page);
> +}
> +
>   void put_pages_list(struct list_head *pages);
>
>   void split_page(struct page *page, unsigned int order);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 590630eb59ba..126f481bb95a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -92,20 +92,9 @@ struct page {
>
>   				union {
>   					/*
> -					 * Count of ptes mapped in
> -					 * mms, to show when page is
> -					 * mapped & limit reverse map
> -					 * searches.
> -					 *
> -					 * Used also for tail pages
> -					 * refcounting instead of
> -					 * _count. Tail pages cannot
> -					 * be mapped and keeping the
> -					 * tail page _count zero at
> -					 * all times guarantees
> -					 * get_page_unless_zero() will
> -					 * never succeed on tail
> -					 * pages.
> +					 * Count of ptes mapped in mms, to show
> +					 * when page is mapped & limit reverse
> +					 * map searches.
>   					 */
>   					atomic_t _mapcount;
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 19e01f156abb..53f9681b7b30 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -93,7 +93,7 @@ retry:
>   	}
>
>   	if (flags & FOLL_GET)
> -		get_page_foll(page);
> +		get_page(page);
>   	if (flags & FOLL_TOUCH) {
>   		if ((flags & FOLL_WRITE) &&
>   		    !pte_dirty(pte) && !PageDirty(page))
> @@ -1108,7 +1108,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>   static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>   		unsigned long end, int write, struct page **pages, int *nr)
>   {
> -	struct page *head, *page, *tail;
> +	struct page *head, *page;
>   	int refs;
>
>   	if (write && !pmd_write(orig))
> @@ -1117,7 +1117,6 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>   	refs = 0;
>   	head = pmd_page(orig);
>   	page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> -	tail = page;
>   	do {
>   		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>   		pages[*nr] = page;
> @@ -1138,24 +1137,13 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>   		return 0;
>   	}
>
> -	/*
> -	 * Any tail pages need their mapcount reference taken before we
> -	 * return. (This allows the THP code to bump their ref count when
> -	 * they are split into base pages).
> -	 */
> -	while (refs--) {
> -		if (PageTail(tail))
> -			get_huge_page_tail(tail);
> -		tail++;
> -	}
> -
>   	return 1;
>   }
>
>   static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>   		unsigned long end, int write, struct page **pages, int *nr)
>   {
> -	struct page *head, *page, *tail;
> +	struct page *head, *page;
>   	int refs;
>
>   	if (write && !pud_write(orig))
> @@ -1164,7 +1152,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>   	refs = 0;
>   	head = pud_page(orig);
>   	page = head + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> -	tail = page;
>   	do {
>   		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>   		pages[*nr] = page;
> @@ -1185,12 +1172,6 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>   		return 0;
>   	}
>
> -	while (refs--) {
> -		if (PageTail(tail))
> -			get_huge_page_tail(tail);
> -		tail++;
> -	}
> -
>   	return 1;
>   }
>
> @@ -1199,7 +1180,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>   			struct page **pages, int *nr)
>   {
>   	int refs;
> -	struct page *head, *page, *tail;
> +	struct page *head, *page;
>
>   	if (write && !pgd_write(orig))
>   		return 0;
> @@ -1207,7 +1188,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>   	refs = 0;
>   	head = pgd_page(orig);
>   	page = head + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
> -	tail = page;
>   	do {
>   		VM_BUG_ON_PAGE(compound_head(page) != head, page);
>   		pages[*nr] = page;
> @@ -1228,12 +1208,6 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>   		return 0;
>   	}
>
> -	while (refs--) {
> -		if (PageTail(tail))
> -			get_huge_page_tail(tail);
> -		tail++;
> -	}
> -
>   	return 1;
>   }
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f3cc576dad73..16c6c262385c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -941,37 +941,6 @@ unlock:
>   	spin_unlock(ptl);
>   }
>
> -/*
> - * Save CONFIG_DEBUG_PAGEALLOC from faulting falsely on tail pages
> - * during copy_user_huge_page()'s copy_page_rep(): in the case when
> - * the source page gets split and a tail freed before copy completes.
> - * Called under pmd_lock of checked pmd, so safe from splitting itself.
> - */
> -static void get_user_huge_page(struct page *page)
> -{
> -	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
> -		struct page *endpage = page + HPAGE_PMD_NR;
> -
> -		atomic_add(HPAGE_PMD_NR, &page->_count);
> -		while (++page < endpage)
> -			get_huge_page_tail(page);
> -	} else {
> -		get_page(page);
> -	}
> -}
> -
> -static void put_user_huge_page(struct page *page)
> -{
> -	if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
> -		struct page *endpage = page + HPAGE_PMD_NR;
> -
> -		while (page < endpage)
> -			put_page(page++);
> -	} else {
> -		put_page(page);
> -	}
> -}
> -
>   static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>   					struct vm_area_struct *vma,
>   					unsigned long address,
> @@ -1124,7 +1093,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   		ret |= VM_FAULT_WRITE;
>   		goto out_unlock;
>   	}
> -	get_user_huge_page(page);
> +	get_page(page);
>   	spin_unlock(ptl);
>   alloc:
>   	if (transparent_hugepage_enabled(vma) &&
> @@ -1145,7 +1114,7 @@ alloc:
>   				split_huge_pmd(vma, pmd, address);
>   				ret |= VM_FAULT_FALLBACK;
>   			}
> -			put_user_huge_page(page);
> +			put_page(page);
>   		}
>   		count_vm_event(THP_FAULT_FALLBACK);
>   		goto out;
> @@ -1156,7 +1125,7 @@ alloc:
>   		put_page(new_page);
>   		if (page) {
>   			split_huge_pmd(vma, pmd, address);
> -			put_user_huge_page(page);
> +			put_page(page);
>   		} else
>   			split_huge_pmd(vma, pmd, address);
>   		ret |= VM_FAULT_FALLBACK;
> @@ -1178,7 +1147,7 @@ alloc:
>
>   	spin_lock(ptl);
>   	if (page)
> -		put_user_huge_page(page);
> +		put_page(page);
>   	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
>   		spin_unlock(ptl);
>   		mem_cgroup_cancel_charge(new_page, memcg, true);
> @@ -1263,7 +1232,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>   	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
>   	VM_BUG_ON_PAGE(!PageCompound(page), page);
>   	if (flags & FOLL_GET)
> -		get_page_foll(page);
> +		get_page(page);
>
>   out:
>   	return page;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index eb2a0430535e..f27d4edada3a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3453,7 +3453,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>   same_page:
>   		if (pages) {
>   			pages[i] = mem_map_offset(page, pfn_offset);
> -			get_page_foll(pages[i]);
> +			get_page(pages[i]);
>   		}
>
>   		if (vmas)
> diff --git a/mm/internal.h b/mm/internal.h
> index a25e359a4039..98bce4d12a16 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -47,50 +47,6 @@ static inline void set_page_refcounted(struct page *page)
>   	set_page_count(page, 1);
>   }
>
> -static inline void __get_page_tail_foll(struct page *page,
> -					bool get_page_head)
> -{
> -	/*
> -	 * If we're getting a tail page, the elevated page->_count is
> -	 * required only in the head page and we will elevate the head
> -	 * page->_count and tail page->_mapcount.
> -	 *
> -	 * We elevate page_tail->_mapcount for tail pages to force
> -	 * page_tail->_count to be zero at all times to avoid getting
> -	 * false positives from get_page_unless_zero() with
> -	 * speculative page access (like in
> -	 * page_cache_get_speculative()) on tail pages.
> -	 */
> -	VM_BUG_ON_PAGE(atomic_read(&page->first_page->_count) <= 0, page);
> -	if (get_page_head)
> -		atomic_inc(&page->first_page->_count);
> -	get_huge_page_tail(page);
> -}
> -
> -/*
> - * This is meant to be called as the FOLL_GET operation of
> - * follow_page() and it must be called while holding the proper PT
> - * lock while the pte (or pmd_trans_huge) is still mapping the page.
> - */
> -static inline void get_page_foll(struct page *page)
> -{
> -	if (unlikely(PageTail(page)))
> -		/*
> -		 * This is safe only because
> -		 * __split_huge_page_refcount() can't run under
> -		 * get_page_foll() because we hold the proper PT lock.
> -		 */
> -		__get_page_tail_foll(page, true);
> -	else {
> -		/*
> -		 * Getting a normal page or the head of a compound page
> -		 * requires to already have an elevated page->_count.
> -		 */
> -		VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
> -		atomic_inc(&page->_count);
> -	}
> -}
> -
>   extern unsigned long highest_memmap_pfn;
>
>   /*
> diff --git a/mm/swap.c b/mm/swap.c
> index 8773de093171..39166c05e5f3 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -89,261 +89,14 @@ static void __put_compound_page(struct page *page)
>   	(*dtor)(page);
>   }
>
> -/**
> - * Two special cases here: we could avoid taking compound_lock_irqsave
> - * and could skip the tail refcounting(in _mapcount).
> - *
> - * 1. Hugetlbfs page:
> - *
> - *    PageHeadHuge will remain true until the compound page
> - *    is released and enters the buddy allocator, and it could
> - *    not be split by __split_huge_page_refcount().
> - *
> - *    So if we see PageHeadHuge set, and we have the tail page pin,
> - *    then we could safely put head page.
> - *
> - * 2. Slab THP page:
> - *
> - *    PG_slab is cleared before the slab frees the head page, and
> - *    tail pin cannot be the last reference left on the head page,
> - *    because the slab code is free to reuse the compound page
> - *    after a kfree/kmem_cache_free without having to check if
> - *    there's any tail pin left.  In turn all tail pinsmust be always
> - *    released while the head is still pinned by the slab code
> - *    and so we know PG_slab will be still set too.
> - *
> - *    So if we see PageSlab set, and we have the tail page pin,
> - *    then we could safely put head page.
> - */
> -static __always_inline
> -void put_unrefcounted_compound_page(struct page *page_head, struct page *page)
> -{
> -	/*
> -	 * If @page is a THP tail, we must read the tail page
> -	 * flags after the head page flags. The
> -	 * __split_huge_page_refcount side enforces write memory barriers
> -	 * between clearing PageTail and before the head page
> -	 * can be freed and reallocated.
> -	 */
> -	smp_rmb();
> -	if (likely(PageTail(page))) {
> -		/*
> -		 * __split_huge_page_refcount cannot race
> -		 * here, see the comment above this function.
> -		 */
> -		VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
> -		VM_BUG_ON_PAGE(page_mapcount(page) != 0, page);
> -		if (put_page_testzero(page_head)) {
> -			/*
> -			 * If this is the tail of a slab THP page,
> -			 * the tail pin must not be the last reference
> -			 * held on the page, because the PG_slab cannot
> -			 * be cleared before all tail pins (which skips
> -			 * the _mapcount tail refcounting) have been
> -			 * released.
> -			 *
> -			 * If this is the tail of a hugetlbfs page,
> -			 * the tail pin may be the last reference on
> -			 * the page instead, because PageHeadHuge will
> -			 * not go away until the compound page enters
> -			 * the buddy allocator.
> -			 */
> -			VM_BUG_ON_PAGE(PageSlab(page_head), page_head);
> -			__put_compound_page(page_head);
> -		}
> -	} else
> -		/*
> -		 * __split_huge_page_refcount run before us,
> -		 * @page was a THP tail. The split @page_head
> -		 * has been freed and reallocated as slab or
> -		 * hugetlbfs page of smaller order (only
> -		 * possible if reallocated as slab on x86).
> -		 */
> -		if (put_page_testzero(page))
> -			__put_single_page(page);
> -}
> -
> -static __always_inline
> -void put_refcounted_compound_page(struct page *page_head, struct page *page)
> -{
> -	if (likely(page != page_head && get_page_unless_zero(page_head))) {
> -		unsigned long flags;
> -
> -		/*
> -		 * @page_head wasn't a dangling pointer but it may not
> -		 * be a head page anymore by the time we obtain the
> -		 * lock. That is ok as long as it can't be freed from
> -		 * under us.
> -		 */
> -		flags = compound_lock_irqsave(page_head);
> -		if (unlikely(!PageTail(page))) {
> -			/* __split_huge_page_refcount run before us */
> -			compound_unlock_irqrestore(page_head, flags);
> -			if (put_page_testzero(page_head)) {
> -				/*
> -				 * The @page_head may have been freed
> -				 * and reallocated as a compound page
> -				 * of smaller order and then freed
> -				 * again.  All we know is that it
> -				 * cannot have become: a THP page, a
> -				 * compound page of higher order, a
> -				 * tail page.  That is because we
> -				 * still hold the refcount of the
> -				 * split THP tail and page_head was
> -				 * the THP head before the split.
> -				 */
> -				if (PageHead(page_head))
> -					__put_compound_page(page_head);
> -				else
> -					__put_single_page(page_head);
> -			}
> -out_put_single:
> -			if (put_page_testzero(page))
> -				__put_single_page(page);
> -			return;
> -		}
> -		VM_BUG_ON_PAGE(page_head != page->first_page, page);
> -		/*
> -		 * We can release the refcount taken by
> -		 * get_page_unless_zero() now that
> -		 * __split_huge_page_refcount() is blocked on the
> -		 * compound_lock.
> -		 */
> -		if (put_page_testzero(page_head))
> -			VM_BUG_ON_PAGE(1, page_head);
> -		/* __split_huge_page_refcount will wait now */
> -		VM_BUG_ON_PAGE(page_mapcount(page) <= 0, page);
> -		atomic_dec(&page->_mapcount);
> -		VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page_head);
> -		VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
> -		compound_unlock_irqrestore(page_head, flags);
> -
> -		if (put_page_testzero(page_head)) {
> -			if (PageHead(page_head))
> -				__put_compound_page(page_head);
> -			else
> -				__put_single_page(page_head);
> -		}
> -	} else {
> -		/* @page_head is a dangling pointer */
> -		VM_BUG_ON_PAGE(PageTail(page), page);
> -		goto out_put_single;
> -	}
> -}
> -
> -static void put_compound_page(struct page *page)
> -{
> -	struct page *page_head;
> -
> -	/*
> -	 * We see the PageCompound set and PageTail not set, so @page maybe:
> -	 *  1. hugetlbfs head page, or
> -	 *  2. THP head page.
> -	 */
> -	if (likely(!PageTail(page))) {
> -		if (put_page_testzero(page)) {
> -			/*
> -			 * By the time all refcounts have been released
> -			 * split_huge_page cannot run anymore from under us.
> -			 */
> -			if (PageHead(page))
> -				__put_compound_page(page);
> -			else
> -				__put_single_page(page);
> -		}
> -		return;
> -	}
> -
> -	/*
> -	 * We see the PageCompound set and PageTail set, so @page maybe:
> -	 *  1. a tail hugetlbfs page, or
> -	 *  2. a tail THP page, or
> -	 *  3. a split THP page.
> -	 *
> -	 *  Case 3 is possible, as we may race with
> -	 *  __split_huge_page_refcount tearing down a THP page.
> -	 */
> -	page_head = compound_head_by_tail(page);
> -	if (!__compound_tail_refcounted(page_head))
> -		put_unrefcounted_compound_page(page_head, page);
> -	else
> -		put_refcounted_compound_page(page_head, page);
> -}
> -
> -void put_page(struct page *page)
> +void __put_page(struct page *page)
>   {
>   	if (unlikely(PageCompound(page)))
> -		put_compound_page(page);
> -	else if (put_page_testzero(page))
> +		__put_compound_page(page);
> +	else
>   		__put_single_page(page);
>   }
> -EXPORT_SYMBOL(put_page);
> -
> -/*
> - * This function is exported but must not be called by anything other
> - * than get_page(). It implements the slow path of get_page().
> - */
> -bool __get_page_tail(struct page *page)
> -{
> -	/*
> -	 * This takes care of get_page() if run on a tail page
> -	 * returned by one of the get_user_pages/follow_page variants.
> -	 * get_user_pages/follow_page itself doesn't need the compound
> -	 * lock because it runs __get_page_tail_foll() under the
> -	 * proper PT lock that already serializes against
> -	 * split_huge_page().
> -	 */
> -	unsigned long flags;
> -	bool got;
> -	struct page *page_head = compound_head(page);
> -
> -	/* Ref to put_compound_page() comment. */
> -	if (!__compound_tail_refcounted(page_head)) {
> -		smp_rmb();
> -		if (likely(PageTail(page))) {
> -			/*
> -			 * This is a hugetlbfs page or a slab
> -			 * page. __split_huge_page_refcount
> -			 * cannot race here.
> -			 */
> -			VM_BUG_ON_PAGE(!PageHead(page_head), page_head);
> -			__get_page_tail_foll(page, true);
> -			return true;
> -		} else {
> -			/*
> -			 * __split_huge_page_refcount run
> -			 * before us, "page" was a THP
> -			 * tail. The split page_head has been
> -			 * freed and reallocated as slab or
> -			 * hugetlbfs page of smaller order
> -			 * (only possible if reallocated as
> -			 * slab on x86).
> -			 */
> -			return false;
> -		}
> -	}
> -
> -	got = false;
> -	if (likely(page != page_head && get_page_unless_zero(page_head))) {
> -		/*
> -		 * page_head wasn't a dangling pointer but it
> -		 * may not be a head page anymore by the time
> -		 * we obtain the lock. That is ok as long as it
> -		 * can't be freed from under us.
> -		 */
> -		flags = compound_lock_irqsave(page_head);
> -		/* here __split_huge_page_refcount won't run anymore */
> -		if (likely(PageTail(page))) {
> -			__get_page_tail_foll(page, false);
> -			got = true;
> -		}
> -		compound_unlock_irqrestore(page_head, flags);
> -		if (unlikely(!got))
> -			put_page(page_head);
> -	}
> -	return got;
> -}
> -EXPORT_SYMBOL(__get_page_tail);
> +EXPORT_SYMBOL(__put_page);
>
>   /**
>    * put_pages_list() - release a list of pages
> @@ -960,15 +713,6 @@ void release_pages(struct page **pages, int nr, bool cold)
>   	for (i = 0; i < nr; i++) {
>   		struct page *page = pages[i];
>
> -		if (unlikely(PageCompound(page))) {
> -			if (zone) {
> -				spin_unlock_irqrestore(&zone->lru_lock, flags);
> -				zone = NULL;
> -			}
> -			put_compound_page(page);
> -			continue;
> -		}
> -
>   		/*
>   		 * Make sure the IRQ-safe lock-holding time does not get
>   		 * excessive with a continuous string of pages from the
> @@ -979,9 +723,19 @@ void release_pages(struct page **pages, int nr, bool cold)
>   			zone = NULL;
>   		}
>
> +		page = compound_head(page);
>   		if (!put_page_testzero(page))
>   			continue;
>
> +		if (PageCompound(page)) {
> +			if (zone) {
> +				spin_unlock_irqrestore(&zone->lru_lock, flags);
> +				zone = NULL;
> +			}
> +			__put_compound_page(page);
> +			continue;
> +		}
> +
>   		if (PageLRU(page)) {
>   			struct zone *pagezone = page_zone(page);
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 14/28] futex, thp: remove special case for THP in get_futex_key
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-18 11:49     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-18 11:49 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new THP refcounting, we don't need tricks to stabilize huge page.
> If we've got reference to tail page, it can't split under us.
>
> This patch effectively reverts a5b338f2b0b1.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>   kernel/futex.c | 61 ++++++++++++----------------------------------------------
>   1 file changed, 12 insertions(+), 49 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index f4d8a85641ed..cf0192e60ef9 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -399,7 +399,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
>   {
>   	unsigned long address = (unsigned long)uaddr;
>   	struct mm_struct *mm = current->mm;
> -	struct page *page, *page_head;
> +	struct page *page;
>   	int err, ro = 0;
>
>   	/*
> @@ -442,46 +442,9 @@ again:
>   	else
>   		err = 0;
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	page_head = page;
> -	if (unlikely(PageTail(page))) {
> -		put_page(page);
> -		/* serialize against __split_huge_page_splitting() */
> -		local_irq_disable();
> -		if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
> -			page_head = compound_head(page);
> -			/*
> -			 * page_head is valid pointer but we must pin
> -			 * it before taking the PG_lock and/or
> -			 * PG_compound_lock. The moment we re-enable
> -			 * irqs __split_huge_page_splitting() can
> -			 * return and the head page can be freed from
> -			 * under us. We can't take the PG_lock and/or
> -			 * PG_compound_lock on a page that could be
> -			 * freed from under us.
> -			 */
> -			if (page != page_head) {
> -				get_page(page_head);
> -				put_page(page);
> -			}
> -			local_irq_enable();
> -		} else {
> -			local_irq_enable();
> -			goto again;
> -		}
> -	}
> -#else
> -	page_head = compound_head(page);
> -	if (page != page_head) {
> -		get_page(page_head);
> -		put_page(page);
> -	}

Hmm, any idea why this was there? Without THP, it was already sure that 
get/put_page() on tail page operates on the head page's _count, no?

> -#endif
> -
> -	lock_page(page_head);
> -
> +	lock_page(page);
>   	/*
> -	 * If page_head->mapping is NULL, then it cannot be a PageAnon
> +	 * If page->mapping is NULL, then it cannot be a PageAnon
>   	 * page; but it might be the ZERO_PAGE or in the gate area or
>   	 * in a special mapping (all cases which we are happy to fail);
>   	 * or it may have been a good file page when get_user_pages_fast
> @@ -493,12 +456,12 @@ again:
>   	 *
>   	 * The case we do have to guard against is when memory pressure made
>   	 * shmem_writepage move it from filecache to swapcache beneath us:
> -	 * an unlikely race, but we do need to retry for page_head->mapping.
> +	 * an unlikely race, but we do need to retry for page->mapping.
>   	 */
> -	if (!page_head->mapping) {
> -		int shmem_swizzled = PageSwapCache(page_head);
> -		unlock_page(page_head);
> -		put_page(page_head);
> +	if (!page->mapping) {
> +		int shmem_swizzled = PageSwapCache(page);
> +		unlock_page(page);
> +		put_page(page);
>   		if (shmem_swizzled)
>   			goto again;
>   		return -EFAULT;
> @@ -511,7 +474,7 @@ again:
>   	 * it's a read-only handle, it's expected that futexes attach to
>   	 * the object not the particular process.
>   	 */
> -	if (PageAnon(page_head)) {
> +	if (PageAnon(page)) {
>   		/*
>   		 * A RO anonymous page will never change and thus doesn't make
>   		 * sense for futex operations.
> @@ -526,15 +489,15 @@ again:
>   		key->private.address = address;
>   	} else {
>   		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
> -		key->shared.inode = page_head->mapping->host;
> +		key->shared.inode = page->mapping->host;
>   		key->shared.pgoff = basepage_index(page);
>   	}
>
>   	get_futex_key_refs(key); /* implies MB (B) */
>
>   out:
> -	unlock_page(page_head);
> -	put_page(page_head);
> +	unlock_page(page);
> +	put_page(page);
>   	return err;
>   }
>
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 14/28] futex, thp: remove special case for THP in get_futex_key
@ 2015-05-18 11:49     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-18 11:49 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new THP refcounting, we don't need tricks to stabilize huge page.
> If we've got reference to tail page, it can't split under us.
>
> This patch effectively reverts a5b338f2b0b1.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>   kernel/futex.c | 61 ++++++++++++----------------------------------------------
>   1 file changed, 12 insertions(+), 49 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index f4d8a85641ed..cf0192e60ef9 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -399,7 +399,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
>   {
>   	unsigned long address = (unsigned long)uaddr;
>   	struct mm_struct *mm = current->mm;
> -	struct page *page, *page_head;
> +	struct page *page;
>   	int err, ro = 0;
>
>   	/*
> @@ -442,46 +442,9 @@ again:
>   	else
>   		err = 0;
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	page_head = page;
> -	if (unlikely(PageTail(page))) {
> -		put_page(page);
> -		/* serialize against __split_huge_page_splitting() */
> -		local_irq_disable();
> -		if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
> -			page_head = compound_head(page);
> -			/*
> -			 * page_head is valid pointer but we must pin
> -			 * it before taking the PG_lock and/or
> -			 * PG_compound_lock. The moment we re-enable
> -			 * irqs __split_huge_page_splitting() can
> -			 * return and the head page can be freed from
> -			 * under us. We can't take the PG_lock and/or
> -			 * PG_compound_lock on a page that could be
> -			 * freed from under us.
> -			 */
> -			if (page != page_head) {
> -				get_page(page_head);
> -				put_page(page);
> -			}
> -			local_irq_enable();
> -		} else {
> -			local_irq_enable();
> -			goto again;
> -		}
> -	}
> -#else
> -	page_head = compound_head(page);
> -	if (page != page_head) {
> -		get_page(page_head);
> -		put_page(page);
> -	}

Hmm, any idea why this was there? Without THP, it was already sure that 
get/put_page() on tail page operates on the head page's _count, no?

> -#endif
> -
> -	lock_page(page_head);
> -
> +	lock_page(page);
>   	/*
> -	 * If page_head->mapping is NULL, then it cannot be a PageAnon
> +	 * If page->mapping is NULL, then it cannot be a PageAnon
>   	 * page; but it might be the ZERO_PAGE or in the gate area or
>   	 * in a special mapping (all cases which we are happy to fail);
>   	 * or it may have been a good file page when get_user_pages_fast
> @@ -493,12 +456,12 @@ again:
>   	 *
>   	 * The case we do have to guard against is when memory pressure made
>   	 * shmem_writepage move it from filecache to swapcache beneath us:
> -	 * an unlikely race, but we do need to retry for page_head->mapping.
> +	 * an unlikely race, but we do need to retry for page->mapping.
>   	 */
> -	if (!page_head->mapping) {
> -		int shmem_swizzled = PageSwapCache(page_head);
> -		unlock_page(page_head);
> -		put_page(page_head);
> +	if (!page->mapping) {
> +		int shmem_swizzled = PageSwapCache(page);
> +		unlock_page(page);
> +		put_page(page);
>   		if (shmem_swizzled)
>   			goto again;
>   		return -EFAULT;
> @@ -511,7 +474,7 @@ again:
>   	 * it's a read-only handle, it's expected that futexes attach to
>   	 * the object not the particular process.
>   	 */
> -	if (PageAnon(page_head)) {
> +	if (PageAnon(page)) {
>   		/*
>   		 * A RO anonymous page will never change and thus doesn't make
>   		 * sense for futex operations.
> @@ -526,15 +489,15 @@ again:
>   		key->private.address = address;
>   	} else {
>   		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
> -		key->shared.inode = page_head->mapping->host;
> +		key->shared.inode = page->mapping->host;
>   		key->shared.pgoff = basepage_index(page);
>   	}
>
>   	get_futex_key_refs(key); /* implies MB (B) */
>
>   out:
> -	unlock_page(page_head);
> -	put_page(page_head);
> +	unlock_page(page);
> +	put_page(page);
>   	return err;
>   }
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 14/28] futex, thp: remove special case for THP in get_futex_key
  2015-05-18 11:49     ` Vlastimil Babka
@ 2015-05-18 12:13       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-18 12:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Mon, May 18, 2015 at 01:49:39PM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >With new THP refcounting, we don't need tricks to stabilize huge page.
> >If we've got reference to tail page, it can't split under us.
> >
> >This patch effectively reverts a5b338f2b0b1.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> >---
> >  kernel/futex.c | 61 ++++++++++++----------------------------------------------
> >  1 file changed, 12 insertions(+), 49 deletions(-)
> >
> >diff --git a/kernel/futex.c b/kernel/futex.c
> >index f4d8a85641ed..cf0192e60ef9 100644
> >--- a/kernel/futex.c
> >+++ b/kernel/futex.c
> >@@ -399,7 +399,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
> >  {
> >  	unsigned long address = (unsigned long)uaddr;
> >  	struct mm_struct *mm = current->mm;
> >-	struct page *page, *page_head;
> >+	struct page *page;
> >  	int err, ro = 0;
> >
> >  	/*
> >@@ -442,46 +442,9 @@ again:
> >  	else
> >  		err = 0;
> >
> >-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >-	page_head = page;
> >-	if (unlikely(PageTail(page))) {
> >-		put_page(page);
> >-		/* serialize against __split_huge_page_splitting() */
> >-		local_irq_disable();
> >-		if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
> >-			page_head = compound_head(page);
> >-			/*
> >-			 * page_head is valid pointer but we must pin
> >-			 * it before taking the PG_lock and/or
> >-			 * PG_compound_lock. The moment we re-enable
> >-			 * irqs __split_huge_page_splitting() can
> >-			 * return and the head page can be freed from
> >-			 * under us. We can't take the PG_lock and/or
> >-			 * PG_compound_lock on a page that could be
> >-			 * freed from under us.
> >-			 */
> >-			if (page != page_head) {
> >-				get_page(page_head);
> >-				put_page(page);
> >-			}
> >-			local_irq_enable();
> >-		} else {
> >-			local_irq_enable();
> >-			goto again;
> >-		}
> >-	}
> >-#else
> >-	page_head = compound_head(page);
> >-	if (page != page_head) {
> >-		get_page(page_head);
> >-		put_page(page);
> >-	}
> 
> Hmm, any idea why this was there? Without THP, it was already sure that
> get/put_page() on tail page operates on the head page's _count, no?

I guess it's just to deal with the same page from this point forward.
Pin/unpin one page, but lock other could look strange.
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 14/28] futex, thp: remove special case for THP in get_futex_key
@ 2015-05-18 12:13       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-18 12:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Mon, May 18, 2015 at 01:49:39PM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >With new THP refcounting, we don't need tricks to stabilize huge page.
> >If we've got reference to tail page, it can't split under us.
> >
> >This patch effectively reverts a5b338f2b0b1.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> >---
> >  kernel/futex.c | 61 ++++++++++++----------------------------------------------
> >  1 file changed, 12 insertions(+), 49 deletions(-)
> >
> >diff --git a/kernel/futex.c b/kernel/futex.c
> >index f4d8a85641ed..cf0192e60ef9 100644
> >--- a/kernel/futex.c
> >+++ b/kernel/futex.c
> >@@ -399,7 +399,7 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
> >  {
> >  	unsigned long address = (unsigned long)uaddr;
> >  	struct mm_struct *mm = current->mm;
> >-	struct page *page, *page_head;
> >+	struct page *page;
> >  	int err, ro = 0;
> >
> >  	/*
> >@@ -442,46 +442,9 @@ again:
> >  	else
> >  		err = 0;
> >
> >-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >-	page_head = page;
> >-	if (unlikely(PageTail(page))) {
> >-		put_page(page);
> >-		/* serialize against __split_huge_page_splitting() */
> >-		local_irq_disable();
> >-		if (likely(__get_user_pages_fast(address, 1, !ro, &page) == 1)) {
> >-			page_head = compound_head(page);
> >-			/*
> >-			 * page_head is valid pointer but we must pin
> >-			 * it before taking the PG_lock and/or
> >-			 * PG_compound_lock. The moment we re-enable
> >-			 * irqs __split_huge_page_splitting() can
> >-			 * return and the head page can be freed from
> >-			 * under us. We can't take the PG_lock and/or
> >-			 * PG_compound_lock on a page that could be
> >-			 * freed from under us.
> >-			 */
> >-			if (page != page_head) {
> >-				get_page(page_head);
> >-				put_page(page);
> >-			}
> >-			local_irq_enable();
> >-		} else {
> >-			local_irq_enable();
> >-			goto again;
> >-		}
> >-	}
> >-#else
> >-	page_head = compound_head(page);
> >-	if (page != page_head) {
> >-		get_page(page_head);
> >-		put_page(page);
> >-	}
> 
> Hmm, any idea why this was there? Without THP, it was already sure that
> get/put_page() on tail page operates on the head page's _count, no?

I guess it's just to deal with the same page from this point forward.
Pin/unpin one page, but lock other could look strange.
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 15/28] ksm: prepare to new THP semantics
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-18 12:41     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-18 12:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We don't need special code to stabilize THP. If you've got reference to
> any subpage of THP it will not be split under you.
>
> New split_huge_page() also accepts tail pages: no need in special code
> to get reference to head page.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/ksm.c | 57 ++++++++++-----------------------------------------------
>   1 file changed, 10 insertions(+), 47 deletions(-)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index fe09f3ddc912..fb333d8188fc 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -441,20 +441,6 @@ static void break_cow(struct rmap_item *rmap_item)
>   	up_read(&mm->mmap_sem);
>   }
>
> -static struct page *page_trans_compound_anon(struct page *page)
> -{
> -	if (PageTransCompound(page)) {
> -		struct page *head = compound_head(page);
> -		/*
> -		 * head may actually be splitted and freed from under
> -		 * us but it's ok here.
> -		 */
> -		if (PageAnon(head))
> -			return head;
> -	}
> -	return NULL;
> -}
> -
>   static struct page *get_mergeable_page(struct rmap_item *rmap_item)
>   {
>   	struct mm_struct *mm = rmap_item->mm;
> @@ -470,7 +456,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
>   	page = follow_page(vma, addr, FOLL_GET);
>   	if (IS_ERR_OR_NULL(page))
>   		goto out;
> -	if (PageAnon(page) || page_trans_compound_anon(page)) {
> +	if (PageAnon(page)) {
>   		flush_anon_page(vma, page, addr);
>   		flush_dcache_page(page);
>   	} else {
> @@ -976,33 +962,6 @@ out:
>   	return err;
>   }
>
> -static int page_trans_compound_anon_split(struct page *page)
> -{
> -	int ret = 0;
> -	struct page *transhuge_head = page_trans_compound_anon(page);
> -	if (transhuge_head) {
> -		/* Get the reference on the head to split it. */
> -		if (get_page_unless_zero(transhuge_head)) {
> -			/*
> -			 * Recheck we got the reference while the head
> -			 * was still anonymous.
> -			 */
> -			if (PageAnon(transhuge_head))
> -				ret = split_huge_page(transhuge_head);
> -			else
> -				/*
> -				 * Retry later if split_huge_page run
> -				 * from under us.
> -				 */
> -				ret = 1;
> -			put_page(transhuge_head);
> -		} else
> -			/* Retry later if split_huge_page run from under us. */
> -			ret = 1;
> -	}
> -	return ret;
> -}
> -
>   /*
>    * try_to_merge_one_page - take two pages and merge them into one
>    * @vma: the vma that holds the pte pointing to page
> @@ -1023,9 +982,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>
>   	if (!(vma->vm_flags & VM_MERGEABLE))
>   		goto out;
> -	if (PageTransCompound(page) && page_trans_compound_anon_split(page))
> -		goto out;
> -	BUG_ON(PageTransCompound(page));
>   	if (!PageAnon(page))
>   		goto out;
>
> @@ -1038,6 +994,13 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>   	 */
>   	if (!trylock_page(page))
>   		goto out;
> +
> +	if (PageTransCompound(page)) {
> +		err = split_huge_page(page);
> +		if (err)
> +			goto out_unlock;
> +	}
> +
>   	/*
>   	 * If this anonymous page is mapped only here, its pte may need
>   	 * to be write-protected.  If it's mapped elsewhere, all of its
> @@ -1068,6 +1031,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>   		}
>   	}
>
> +out_unlock:
>   	unlock_page(page);
>   out:
>   	return err;
> @@ -1620,8 +1584,7 @@ next_mm:
>   				cond_resched();
>   				continue;
>   			}
> -			if (PageAnon(*page) ||
> -			    page_trans_compound_anon(*page)) {
> +			if (PageAnon(*page)) {
>   				flush_anon_page(vma, *page, ksm_scan.address);
>   				flush_dcache_page(*page);
>   				rmap_item = get_next_rmap_item(slot,
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 15/28] ksm: prepare to new THP semantics
@ 2015-05-18 12:41     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-18 12:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We don't need special code to stabilize THP. If you've got reference to
> any subpage of THP it will not be split under you.
>
> New split_huge_page() also accepts tail pages: no need in special code
> to get reference to head page.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/ksm.c | 57 ++++++++++-----------------------------------------------
>   1 file changed, 10 insertions(+), 47 deletions(-)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index fe09f3ddc912..fb333d8188fc 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -441,20 +441,6 @@ static void break_cow(struct rmap_item *rmap_item)
>   	up_read(&mm->mmap_sem);
>   }
>
> -static struct page *page_trans_compound_anon(struct page *page)
> -{
> -	if (PageTransCompound(page)) {
> -		struct page *head = compound_head(page);
> -		/*
> -		 * head may actually be splitted and freed from under
> -		 * us but it's ok here.
> -		 */
> -		if (PageAnon(head))
> -			return head;
> -	}
> -	return NULL;
> -}
> -
>   static struct page *get_mergeable_page(struct rmap_item *rmap_item)
>   {
>   	struct mm_struct *mm = rmap_item->mm;
> @@ -470,7 +456,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
>   	page = follow_page(vma, addr, FOLL_GET);
>   	if (IS_ERR_OR_NULL(page))
>   		goto out;
> -	if (PageAnon(page) || page_trans_compound_anon(page)) {
> +	if (PageAnon(page)) {
>   		flush_anon_page(vma, page, addr);
>   		flush_dcache_page(page);
>   	} else {
> @@ -976,33 +962,6 @@ out:
>   	return err;
>   }
>
> -static int page_trans_compound_anon_split(struct page *page)
> -{
> -	int ret = 0;
> -	struct page *transhuge_head = page_trans_compound_anon(page);
> -	if (transhuge_head) {
> -		/* Get the reference on the head to split it. */
> -		if (get_page_unless_zero(transhuge_head)) {
> -			/*
> -			 * Recheck we got the reference while the head
> -			 * was still anonymous.
> -			 */
> -			if (PageAnon(transhuge_head))
> -				ret = split_huge_page(transhuge_head);
> -			else
> -				/*
> -				 * Retry later if split_huge_page run
> -				 * from under us.
> -				 */
> -				ret = 1;
> -			put_page(transhuge_head);
> -		} else
> -			/* Retry later if split_huge_page run from under us. */
> -			ret = 1;
> -	}
> -	return ret;
> -}
> -
>   /*
>    * try_to_merge_one_page - take two pages and merge them into one
>    * @vma: the vma that holds the pte pointing to page
> @@ -1023,9 +982,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>
>   	if (!(vma->vm_flags & VM_MERGEABLE))
>   		goto out;
> -	if (PageTransCompound(page) && page_trans_compound_anon_split(page))
> -		goto out;
> -	BUG_ON(PageTransCompound(page));
>   	if (!PageAnon(page))
>   		goto out;
>
> @@ -1038,6 +994,13 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>   	 */
>   	if (!trylock_page(page))
>   		goto out;
> +
> +	if (PageTransCompound(page)) {
> +		err = split_huge_page(page);
> +		if (err)
> +			goto out_unlock;
> +	}
> +
>   	/*
>   	 * If this anonymous page is mapped only here, its pte may need
>   	 * to be write-protected.  If it's mapped elsewhere, all of its
> @@ -1068,6 +1031,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>   		}
>   	}
>
> +out_unlock:
>   	unlock_page(page);
>   out:
>   	return err;
> @@ -1620,8 +1584,7 @@ next_mm:
>   				cond_resched();
>   				continue;
>   			}
> -			if (PageAnon(*page) ||
> -			    page_trans_compound_anon(*page)) {
> +			if (PageAnon(*page)) {
>   				flush_anon_page(vma, *page, ksm_scan.address);
>   				flush_dcache_page(*page);
>   				rmap_item = get_next_rmap_item(slot,
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 16/28] mm, thp: remove compound_lock
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-18 12:57     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-18 12:57 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We are going to use migration entries to stabilize page counts. It means
> we don't need compound_lock() for that.

git grep says you didn't clean up enough :)

mm/memcontrol.c: * zone->lru_lock, 'splitting on pmd' and compound_lock.
mm/memcontrol.c: * compound_lock(), so we don't have to take care of races.
mm/memcontrol.c: * - compound_lock is held when nr_pages > 1

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

When that's amended,

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   include/linux/mm.h         | 35 -----------------------------------
>   include/linux/page-flags.h | 12 +-----------
>   mm/debug.c                 |  3 ---
>   3 files changed, 1 insertion(+), 49 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index dd1b5f2b1966..dad667d99304 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -393,41 +393,6 @@ static inline int is_vmalloc_or_module_addr(const void *x)
>
>   extern void kvfree(const void *addr);
>
> -static inline void compound_lock(struct page *page)
> -{
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	VM_BUG_ON_PAGE(PageSlab(page), page);
> -	bit_spin_lock(PG_compound_lock, &page->flags);
> -#endif
> -}
> -
> -static inline void compound_unlock(struct page *page)
> -{
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	VM_BUG_ON_PAGE(PageSlab(page), page);
> -	bit_spin_unlock(PG_compound_lock, &page->flags);
> -#endif
> -}
> -
> -static inline unsigned long compound_lock_irqsave(struct page *page)
> -{
> -	unsigned long uninitialized_var(flags);
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	local_irq_save(flags);
> -	compound_lock(page);
> -#endif
> -	return flags;
> -}
> -
> -static inline void compound_unlock_irqrestore(struct page *page,
> -					      unsigned long flags)
> -{
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	compound_unlock(page);
> -	local_irq_restore(flags);
> -#endif
> -}
> -
>   /*
>    * The atomic page->_mapcount, starts from -1: so that transitions
>    * both from it and to it can be tracked, using atomic_inc_and_test
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 91b7f9b2b774..74b7cece1dfa 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -106,9 +106,6 @@ enum pageflags {
>   #ifdef CONFIG_MEMORY_FAILURE
>   	PG_hwpoison,		/* hardware poisoned page. Don't touch */
>   #endif
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	PG_compound_lock,
> -#endif
>   	__NR_PAGEFLAGS,
>
>   	/* Filesystems */
> @@ -683,12 +680,6 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
>   #define __PG_MLOCKED		0
>   #endif
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
> -#else
> -#define __PG_COMPOUND_LOCK		0
> -#endif
> -
>   /*
>    * Flags checked when a page is freed.  Pages being freed should not have
>    * these flags set.  It they are, there is a problem.
> @@ -698,8 +689,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
>   	 1 << PG_private | 1 << PG_private_2 | \
>   	 1 << PG_writeback | 1 << PG_reserved | \
>   	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
> -	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
> -	 __PG_COMPOUND_LOCK)
> +	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON )
>
>   /*
>    * Flags checked when a page is prepped for return by the page allocator.
> diff --git a/mm/debug.c b/mm/debug.c
> index 3eb3ac2fcee7..9dfcd77e7354 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -45,9 +45,6 @@ static const struct trace_print_flags pageflag_names[] = {
>   #ifdef CONFIG_MEMORY_FAILURE
>   	{1UL << PG_hwpoison,		"hwpoison"	},
>   #endif
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	{1UL << PG_compound_lock,	"compound_lock"	},
> -#endif
>   };
>
>   static void dump_flags(unsigned long flags,
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 16/28] mm, thp: remove compound_lock
@ 2015-05-18 12:57     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-18 12:57 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We are going to use migration entries to stabilize page counts. It means
> we don't need compound_lock() for that.

git grep says you didn't clean up enough :)

mm/memcontrol.c: * zone->lru_lock, 'splitting on pmd' and compound_lock.
mm/memcontrol.c: * compound_lock(), so we don't have to take care of races.
mm/memcontrol.c: * - compound_lock is held when nr_pages > 1

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

When that's amended,

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   include/linux/mm.h         | 35 -----------------------------------
>   include/linux/page-flags.h | 12 +-----------
>   mm/debug.c                 |  3 ---
>   3 files changed, 1 insertion(+), 49 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index dd1b5f2b1966..dad667d99304 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -393,41 +393,6 @@ static inline int is_vmalloc_or_module_addr(const void *x)
>
>   extern void kvfree(const void *addr);
>
> -static inline void compound_lock(struct page *page)
> -{
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	VM_BUG_ON_PAGE(PageSlab(page), page);
> -	bit_spin_lock(PG_compound_lock, &page->flags);
> -#endif
> -}
> -
> -static inline void compound_unlock(struct page *page)
> -{
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	VM_BUG_ON_PAGE(PageSlab(page), page);
> -	bit_spin_unlock(PG_compound_lock, &page->flags);
> -#endif
> -}
> -
> -static inline unsigned long compound_lock_irqsave(struct page *page)
> -{
> -	unsigned long uninitialized_var(flags);
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	local_irq_save(flags);
> -	compound_lock(page);
> -#endif
> -	return flags;
> -}
> -
> -static inline void compound_unlock_irqrestore(struct page *page,
> -					      unsigned long flags)
> -{
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	compound_unlock(page);
> -	local_irq_restore(flags);
> -#endif
> -}
> -
>   /*
>    * The atomic page->_mapcount, starts from -1: so that transitions
>    * both from it and to it can be tracked, using atomic_inc_and_test
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 91b7f9b2b774..74b7cece1dfa 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -106,9 +106,6 @@ enum pageflags {
>   #ifdef CONFIG_MEMORY_FAILURE
>   	PG_hwpoison,		/* hardware poisoned page. Don't touch */
>   #endif
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	PG_compound_lock,
> -#endif
>   	__NR_PAGEFLAGS,
>
>   	/* Filesystems */
> @@ -683,12 +680,6 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
>   #define __PG_MLOCKED		0
>   #endif
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
> -#else
> -#define __PG_COMPOUND_LOCK		0
> -#endif
> -
>   /*
>    * Flags checked when a page is freed.  Pages being freed should not have
>    * these flags set.  It they are, there is a problem.
> @@ -698,8 +689,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
>   	 1 << PG_private | 1 << PG_private_2 | \
>   	 1 << PG_writeback | 1 << PG_reserved | \
>   	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
> -	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
> -	 __PG_COMPOUND_LOCK)
> +	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON )
>
>   /*
>    * Flags checked when a page is prepped for return by the page allocator.
> diff --git a/mm/debug.c b/mm/debug.c
> index 3eb3ac2fcee7..9dfcd77e7354 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -45,9 +45,6 @@ static const struct trace_print_flags pageflag_names[] = {
>   #ifdef CONFIG_MEMORY_FAILURE
>   	{1UL << PG_hwpoison,		"hwpoison"	},
>   #endif
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	{1UL << PG_compound_lock,	"compound_lock"	},
> -#endif
>   };
>
>   static void dump_flags(unsigned long flags,
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 17/28] mm, thp: remove infrastructure for handling splitting PMDs
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-18 13:40     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-18 13:40 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting we don't need to mark PMDs splitting. Let's drop code
> to handle this.
>
> Arch-specific code will removed separately.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Some functions could be changed to return bool instead of int:
pmd_trans_huge_lock()
__pmd_trans_huge_lock()
move_huge_pmd()

And there's some leftover references to pmd_trans_splitting()
include/asm-generic/pgtable.h:static inline int 
pmd_trans_splitting(pmd_t pmd)
(for the !TRANSPARENT_HUGEPAGE config)

mm/gup.c:               if (pmd_none(pmd) || pmd_trans_splitting(pmd))


> ---
>   fs/proc/task_mmu.c            |  8 +++----
>   include/asm-generic/pgtable.h |  5 ----
>   include/linux/huge_mm.h       |  9 --------
>   mm/gup.c                      |  7 ------
>   mm/huge_memory.c              | 54 ++++++++-----------------------------------
>   mm/memcontrol.c               | 14 ++---------
>   mm/memory.c                   | 18 ++-------------
>   mm/mincore.c                  |  2 +-
>   mm/pgtable-generic.c          | 14 -----------
>   mm/rmap.c                     |  4 +---
>   10 files changed, 20 insertions(+), 115 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 95bc384ee3f7..edd63c40ed71 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -534,7 +534,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   	pte_t *pte;
>   	spinlock_t *ptl;
>
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		smaps_pmd_entry(pmd, addr, walk);
>   		spin_unlock(ptl);
>   		return 0;
> @@ -799,7 +799,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>   	spinlock_t *ptl;
>   	struct page *page;
>
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
>   			clear_soft_dirty_pmd(vma, addr, pmd);
>   			goto out;
> @@ -1112,7 +1112,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   	pte_t *pte, *orig_pte;
>   	int err = 0;
>
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		int pmd_flags2;
>
>   		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
> @@ -1416,7 +1416,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
>   	pte_t *orig_pte;
>   	pte_t *pte;
>
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		pte_t huge_pte = *(pte_t *)pmd;
>   		struct page *page;
>
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 39f1d6a2b04d..fe617b7e4be6 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -184,11 +184,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>   #endif
>
> -#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
> -extern void pmdp_splitting_flush(struct vm_area_struct *vma,
> -				 unsigned long address, pmd_t *pmdp);
> -#endif
> -
>   #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
>   extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
>   				       pgtable_t pgtable);
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 47f80207782f..0382230b490f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -49,15 +49,9 @@ enum transparent_hugepage_flag {
>   #endif
>   };
>
> -enum page_check_address_pmd_flag {
> -	PAGE_CHECK_ADDRESS_PMD_FLAG,
> -	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
> -	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
> -};
>   extern pmd_t *page_check_address_pmd(struct page *page,
>   				     struct mm_struct *mm,
>   				     unsigned long address,
> -				     enum page_check_address_pmd_flag flag,
>   				     spinlock_t **ptl);
>   extern int pmd_freeable(pmd_t pmd);
>
> @@ -102,7 +96,6 @@ extern unsigned long transparent_hugepage_flags;
>   #define split_huge_page(page) BUILD_BUG()
>   #define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
>
> -#define wait_split_huge_page(__anon_vma, __pmd) BUILD_BUG();
>   #if HPAGE_PMD_ORDER >= MAX_ORDER
>   #error "hugepages can't be allocated by the buddy allocator"
>   #endif
> @@ -169,8 +162,6 @@ static inline int split_huge_page(struct page *page)
>   {
>   	return 0;
>   }
> -#define wait_split_huge_page(__anon_vma, __pmd)	\
> -	do { } while (0)
>   #define split_huge_pmd(__vma, __pmd, __address)	\
>   	do { } while (0)
>   static inline int hugepage_madvise(struct vm_area_struct *vma,
> diff --git a/mm/gup.c b/mm/gup.c
> index 53f9681b7b30..0cebfa76fd0c 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -207,13 +207,6 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>   		spin_unlock(ptl);
>   		return follow_page_pte(vma, address, pmd, flags);
>   	}
> -
> -	if (unlikely(pmd_trans_splitting(*pmd))) {
> -		spin_unlock(ptl);
> -		wait_split_huge_page(vma->anon_vma, pmd);
> -		return follow_page_pte(vma, address, pmd, flags);
> -	}
> -
>   	if (flags & FOLL_SPLIT) {
>   		int ret;
>   		page = pmd_page(*pmd);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 16c6c262385c..23181f836b62 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -889,15 +889,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   		goto out_unlock;
>   	}
>
> -	if (unlikely(pmd_trans_splitting(pmd))) {
> -		/* split huge page running from under us */
> -		spin_unlock(src_ptl);
> -		spin_unlock(dst_ptl);
> -		pte_free(dst_mm, pgtable);
> -
> -		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
> -		goto out;
> -	}
>   	src_page = pmd_page(pmd);
>   	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
>   	get_page(src_page);
> @@ -1403,7 +1394,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   	spinlock_t *ptl;
>   	int ret = 0;
>
> -	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		struct page *page;
>   		pgtable_t pgtable;
>   		pmd_t orig_pmd;
> @@ -1443,7 +1434,6 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>   		  pmd_t *old_pmd, pmd_t *new_pmd)
>   {
>   	spinlock_t *old_ptl, *new_ptl;
> -	int ret = 0;
>   	pmd_t pmd;
>
>   	struct mm_struct *mm = vma->vm_mm;
> @@ -1452,7 +1442,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>   	    (new_addr & ~HPAGE_PMD_MASK) ||
>   	    old_end - old_addr < HPAGE_PMD_SIZE ||
>   	    (new_vma->vm_flags & VM_NOHUGEPAGE))
> -		goto out;
> +		return 0;
>
>   	/*
>   	 * The destination pmd shouldn't be established, free_pgtables()
> @@ -1460,15 +1450,14 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>   	 */
>   	if (WARN_ON(!pmd_none(*new_pmd))) {
>   		VM_BUG_ON(pmd_trans_huge(*new_pmd));
> -		goto out;
> +		return 0;
>   	}
>
>   	/*
>   	 * We don't have to worry about the ordering of src and dst
>   	 * ptlocks because exclusive mmap_sem prevents deadlock.
>   	 */
> -	ret = __pmd_trans_huge_lock(old_pmd, vma, &old_ptl);
> -	if (ret == 1) {
> +	if (__pmd_trans_huge_lock(old_pmd, vma, &old_ptl)) {
>   		new_ptl = pmd_lockptr(mm, new_pmd);
>   		if (new_ptl != old_ptl)
>   			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> @@ -1484,9 +1473,9 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>   		if (new_ptl != old_ptl)
>   			spin_unlock(new_ptl);
>   		spin_unlock(old_ptl);
> +		return 1;
>   	}
> -out:
> -	return ret;
> +	return 0;
>   }
>
>   /*
> @@ -1502,7 +1491,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   	spinlock_t *ptl;
>   	int ret = 0;
>
> -	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		pmd_t entry;
>   		bool preserve_write = prot_numa && pmd_write(*pmd);
>   		ret = 1;
> @@ -1543,17 +1532,8 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
>   		spinlock_t **ptl)
>   {
>   	*ptl = pmd_lock(vma->vm_mm, pmd);
> -	if (likely(pmd_trans_huge(*pmd))) {
> -		if (unlikely(pmd_trans_splitting(*pmd))) {
> -			spin_unlock(*ptl);
> -			wait_split_huge_page(vma->anon_vma, pmd);
> -			return -1;
> -		} else {
> -			/* Thp mapped by 'pmd' is stable, so we can
> -			 * handle it as it is. */
> -			return 1;
> -		}
> -	}
> +	if (likely(pmd_trans_huge(*pmd)))
> +		return 1;
>   	spin_unlock(*ptl);
>   	return 0;
>   }
> @@ -1569,7 +1549,6 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
>   pmd_t *page_check_address_pmd(struct page *page,
>   			      struct mm_struct *mm,
>   			      unsigned long address,
> -			      enum page_check_address_pmd_flag flag,
>   			      spinlock_t **ptl)
>   {
>   	pgd_t *pgd;
> @@ -1592,21 +1571,8 @@ pmd_t *page_check_address_pmd(struct page *page,
>   		goto unlock;
>   	if (pmd_page(*pmd) != page)
>   		goto unlock;
> -	/*
> -	 * split_vma() may create temporary aliased mappings. There is
> -	 * no risk as long as all huge pmd are found and have their
> -	 * splitting bit set before __split_huge_page_refcount
> -	 * runs. Finding the same huge pmd more than once during the
> -	 * same rmap walk is not a problem.
> -	 */
> -	if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
> -	    pmd_trans_splitting(*pmd))
> -		goto unlock;
> -	if (pmd_trans_huge(*pmd)) {
> -		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
> -			  !pmd_trans_splitting(*pmd));
> +	if (pmd_trans_huge(*pmd))
>   		return pmd;
> -	}
>   unlock:
>   	spin_unlock(*ptl);
>   	return NULL;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f659d4f77138..1bc6a77067ad 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4888,7 +4888,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
>   	pte_t *pte;
>   	spinlock_t *ptl;
>
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
>   			mc.precharge += HPAGE_PMD_NR;
>   		spin_unlock(ptl);
> @@ -5056,17 +5056,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>   	union mc_target target;
>   	struct page *page;
>
> -	/*
> -	 * We don't take compound_lock() here but no race with splitting thp
> -	 * happens because:
> -	 *  - if pmd_trans_huge_lock() returns 1, the relevant thp is not
> -	 *    under splitting, which means there's no concurrent thp split,
> -	 *  - if another thread runs into split_huge_page() just after we
> -	 *    entered this if-block, the thread must wait for page table lock
> -	 *    to be unlocked in __split_huge_page_splitting(), where the main
> -	 *    part of thp split is not executed yet.
> -	 */
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		if (mc.precharge < HPAGE_PMD_NR) {
>   			spin_unlock(ptl);
>   			return 0;
> diff --git a/mm/memory.c b/mm/memory.c
> index 61e7ed722760..1bad3766b00c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -565,7 +565,6 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>   {
>   	spinlock_t *ptl;
>   	pgtable_t new = pte_alloc_one(mm, address);
> -	int wait_split_huge_page;
>   	if (!new)
>   		return -ENOMEM;
>
> @@ -585,18 +584,14 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>   	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
>
>   	ptl = pmd_lock(mm, pmd);
> -	wait_split_huge_page = 0;
>   	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
>   		atomic_long_inc(&mm->nr_ptes);
>   		pmd_populate(mm, pmd, new);
>   		new = NULL;
> -	} else if (unlikely(pmd_trans_splitting(*pmd)))
> -		wait_split_huge_page = 1;
> +	}
>   	spin_unlock(ptl);
>   	if (new)
>   		pte_free(mm, new);
> -	if (wait_split_huge_page)
> -		wait_split_huge_page(vma->anon_vma, pmd);
>   	return 0;
>   }
>
> @@ -612,8 +607,7 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
>   	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
>   		pmd_populate_kernel(&init_mm, pmd, new);
>   		new = NULL;
> -	} else
> -		VM_BUG_ON(pmd_trans_splitting(*pmd));
> +	}
>   	spin_unlock(&init_mm.page_table_lock);
>   	if (new)
>   		pte_free_kernel(&init_mm, new);
> @@ -3299,14 +3293,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   		if (pmd_trans_huge(orig_pmd)) {
>   			unsigned int dirty = flags & FAULT_FLAG_WRITE;
>
> -			/*
> -			 * If the pmd is splitting, return and retry the
> -			 * the fault.  Alternative: wait until the split
> -			 * is done, and goto retry.
> -			 */
> -			if (pmd_trans_splitting(orig_pmd))
> -				return 0;
> -
>   			if (pmd_protnone(orig_pmd))
>   				return do_huge_pmd_numa_page(mm, vma, address,
>   							     orig_pmd, pmd);
> diff --git a/mm/mincore.c b/mm/mincore.c
> index be25efde64a4..feb867f5fdf4 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -117,7 +117,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   	unsigned char *vec = walk->private;
>   	int nr = (end - addr) >> PAGE_SHIFT;
>
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		memset(vec, 1, nr);
>   		spin_unlock(ptl);
>   		goto out;
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index c25f94b33811..2fe699cedd4d 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -133,20 +133,6 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>   #endif
>
> -#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
> -			  pmd_t *pmdp)
> -{
> -	pmd_t pmd = pmd_mksplitting(*pmdp);
> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> -	/* tlb flush only to serialize against gup-fast */
> -	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> -}
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> -#endif
> -
>   #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4ca4b5cffd95..1636a96e5f71 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -737,8 +737,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>   		 * rmap might return false positives; we must filter
>   		 * these out using page_check_address_pmd().
>   		 */
> -		pmd = page_check_address_pmd(page, mm, address,
> -					     PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
> +		pmd = page_check_address_pmd(page, mm, address, &ptl);
>   		if (!pmd)
>   			return SWAP_AGAIN;
>
> @@ -748,7 +747,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>   			return SWAP_FAIL; /* To break the loop */
>   		}
>
> -		/* go ahead even if the pmd is pmd_trans_splitting() */
>   		if (pmdp_clear_flush_young_notify(vma, address, pmd))
>   			referenced++;
>
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 17/28] mm, thp: remove infrastructure for handling splitting PMDs
@ 2015-05-18 13:40     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-18 13:40 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> With new refcounting we don't need to mark PMDs splitting. Let's drop code
> to handle this.
>
> Arch-specific code will removed separately.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Some functions could be changed to return bool instead of int:
pmd_trans_huge_lock()
__pmd_trans_huge_lock()
move_huge_pmd()

And there's some leftover references to pmd_trans_splitting()
include/asm-generic/pgtable.h:static inline int 
pmd_trans_splitting(pmd_t pmd)
(for the !TRANSPARENT_HUGEPAGE config)

mm/gup.c:               if (pmd_none(pmd) || pmd_trans_splitting(pmd))


> ---
>   fs/proc/task_mmu.c            |  8 +++----
>   include/asm-generic/pgtable.h |  5 ----
>   include/linux/huge_mm.h       |  9 --------
>   mm/gup.c                      |  7 ------
>   mm/huge_memory.c              | 54 ++++++++-----------------------------------
>   mm/memcontrol.c               | 14 ++---------
>   mm/memory.c                   | 18 ++-------------
>   mm/mincore.c                  |  2 +-
>   mm/pgtable-generic.c          | 14 -----------
>   mm/rmap.c                     |  4 +---
>   10 files changed, 20 insertions(+), 115 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 95bc384ee3f7..edd63c40ed71 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -534,7 +534,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   	pte_t *pte;
>   	spinlock_t *ptl;
>
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		smaps_pmd_entry(pmd, addr, walk);
>   		spin_unlock(ptl);
>   		return 0;
> @@ -799,7 +799,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>   	spinlock_t *ptl;
>   	struct page *page;
>
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
>   			clear_soft_dirty_pmd(vma, addr, pmd);
>   			goto out;
> @@ -1112,7 +1112,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   	pte_t *pte, *orig_pte;
>   	int err = 0;
>
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		int pmd_flags2;
>
>   		if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
> @@ -1416,7 +1416,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
>   	pte_t *orig_pte;
>   	pte_t *pte;
>
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		pte_t huge_pte = *(pte_t *)pmd;
>   		struct page *page;
>
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 39f1d6a2b04d..fe617b7e4be6 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -184,11 +184,6 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>   #endif
>
> -#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
> -extern void pmdp_splitting_flush(struct vm_area_struct *vma,
> -				 unsigned long address, pmd_t *pmdp);
> -#endif
> -
>   #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
>   extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
>   				       pgtable_t pgtable);
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 47f80207782f..0382230b490f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -49,15 +49,9 @@ enum transparent_hugepage_flag {
>   #endif
>   };
>
> -enum page_check_address_pmd_flag {
> -	PAGE_CHECK_ADDRESS_PMD_FLAG,
> -	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
> -	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
> -};
>   extern pmd_t *page_check_address_pmd(struct page *page,
>   				     struct mm_struct *mm,
>   				     unsigned long address,
> -				     enum page_check_address_pmd_flag flag,
>   				     spinlock_t **ptl);
>   extern int pmd_freeable(pmd_t pmd);
>
> @@ -102,7 +96,6 @@ extern unsigned long transparent_hugepage_flags;
>   #define split_huge_page(page) BUILD_BUG()
>   #define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
>
> -#define wait_split_huge_page(__anon_vma, __pmd) BUILD_BUG();
>   #if HPAGE_PMD_ORDER >= MAX_ORDER
>   #error "hugepages can't be allocated by the buddy allocator"
>   #endif
> @@ -169,8 +162,6 @@ static inline int split_huge_page(struct page *page)
>   {
>   	return 0;
>   }
> -#define wait_split_huge_page(__anon_vma, __pmd)	\
> -	do { } while (0)
>   #define split_huge_pmd(__vma, __pmd, __address)	\
>   	do { } while (0)
>   static inline int hugepage_madvise(struct vm_area_struct *vma,
> diff --git a/mm/gup.c b/mm/gup.c
> index 53f9681b7b30..0cebfa76fd0c 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -207,13 +207,6 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
>   		spin_unlock(ptl);
>   		return follow_page_pte(vma, address, pmd, flags);
>   	}
> -
> -	if (unlikely(pmd_trans_splitting(*pmd))) {
> -		spin_unlock(ptl);
> -		wait_split_huge_page(vma->anon_vma, pmd);
> -		return follow_page_pte(vma, address, pmd, flags);
> -	}
> -
>   	if (flags & FOLL_SPLIT) {
>   		int ret;
>   		page = pmd_page(*pmd);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 16c6c262385c..23181f836b62 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -889,15 +889,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>   		goto out_unlock;
>   	}
>
> -	if (unlikely(pmd_trans_splitting(pmd))) {
> -		/* split huge page running from under us */
> -		spin_unlock(src_ptl);
> -		spin_unlock(dst_ptl);
> -		pte_free(dst_mm, pgtable);
> -
> -		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
> -		goto out;
> -	}
>   	src_page = pmd_page(pmd);
>   	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
>   	get_page(src_page);
> @@ -1403,7 +1394,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>   	spinlock_t *ptl;
>   	int ret = 0;
>
> -	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		struct page *page;
>   		pgtable_t pgtable;
>   		pmd_t orig_pmd;
> @@ -1443,7 +1434,6 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>   		  pmd_t *old_pmd, pmd_t *new_pmd)
>   {
>   	spinlock_t *old_ptl, *new_ptl;
> -	int ret = 0;
>   	pmd_t pmd;
>
>   	struct mm_struct *mm = vma->vm_mm;
> @@ -1452,7 +1442,7 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>   	    (new_addr & ~HPAGE_PMD_MASK) ||
>   	    old_end - old_addr < HPAGE_PMD_SIZE ||
>   	    (new_vma->vm_flags & VM_NOHUGEPAGE))
> -		goto out;
> +		return 0;
>
>   	/*
>   	 * The destination pmd shouldn't be established, free_pgtables()
> @@ -1460,15 +1450,14 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>   	 */
>   	if (WARN_ON(!pmd_none(*new_pmd))) {
>   		VM_BUG_ON(pmd_trans_huge(*new_pmd));
> -		goto out;
> +		return 0;
>   	}
>
>   	/*
>   	 * We don't have to worry about the ordering of src and dst
>   	 * ptlocks because exclusive mmap_sem prevents deadlock.
>   	 */
> -	ret = __pmd_trans_huge_lock(old_pmd, vma, &old_ptl);
> -	if (ret == 1) {
> +	if (__pmd_trans_huge_lock(old_pmd, vma, &old_ptl)) {
>   		new_ptl = pmd_lockptr(mm, new_pmd);
>   		if (new_ptl != old_ptl)
>   			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> @@ -1484,9 +1473,9 @@ int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
>   		if (new_ptl != old_ptl)
>   			spin_unlock(new_ptl);
>   		spin_unlock(old_ptl);
> +		return 1;
>   	}
> -out:
> -	return ret;
> +	return 0;
>   }
>
>   /*
> @@ -1502,7 +1491,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   	spinlock_t *ptl;
>   	int ret = 0;
>
> -	if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (__pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		pmd_t entry;
>   		bool preserve_write = prot_numa && pmd_write(*pmd);
>   		ret = 1;
> @@ -1543,17 +1532,8 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
>   		spinlock_t **ptl)
>   {
>   	*ptl = pmd_lock(vma->vm_mm, pmd);
> -	if (likely(pmd_trans_huge(*pmd))) {
> -		if (unlikely(pmd_trans_splitting(*pmd))) {
> -			spin_unlock(*ptl);
> -			wait_split_huge_page(vma->anon_vma, pmd);
> -			return -1;
> -		} else {
> -			/* Thp mapped by 'pmd' is stable, so we can
> -			 * handle it as it is. */
> -			return 1;
> -		}
> -	}
> +	if (likely(pmd_trans_huge(*pmd)))
> +		return 1;
>   	spin_unlock(*ptl);
>   	return 0;
>   }
> @@ -1569,7 +1549,6 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
>   pmd_t *page_check_address_pmd(struct page *page,
>   			      struct mm_struct *mm,
>   			      unsigned long address,
> -			      enum page_check_address_pmd_flag flag,
>   			      spinlock_t **ptl)
>   {
>   	pgd_t *pgd;
> @@ -1592,21 +1571,8 @@ pmd_t *page_check_address_pmd(struct page *page,
>   		goto unlock;
>   	if (pmd_page(*pmd) != page)
>   		goto unlock;
> -	/*
> -	 * split_vma() may create temporary aliased mappings. There is
> -	 * no risk as long as all huge pmd are found and have their
> -	 * splitting bit set before __split_huge_page_refcount
> -	 * runs. Finding the same huge pmd more than once during the
> -	 * same rmap walk is not a problem.
> -	 */
> -	if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
> -	    pmd_trans_splitting(*pmd))
> -		goto unlock;
> -	if (pmd_trans_huge(*pmd)) {
> -		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
> -			  !pmd_trans_splitting(*pmd));
> +	if (pmd_trans_huge(*pmd))
>   		return pmd;
> -	}
>   unlock:
>   	spin_unlock(*ptl);
>   	return NULL;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f659d4f77138..1bc6a77067ad 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4888,7 +4888,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
>   	pte_t *pte;
>   	spinlock_t *ptl;
>
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
>   			mc.precharge += HPAGE_PMD_NR;
>   		spin_unlock(ptl);
> @@ -5056,17 +5056,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>   	union mc_target target;
>   	struct page *page;
>
> -	/*
> -	 * We don't take compound_lock() here but no race with splitting thp
> -	 * happens because:
> -	 *  - if pmd_trans_huge_lock() returns 1, the relevant thp is not
> -	 *    under splitting, which means there's no concurrent thp split,
> -	 *  - if another thread runs into split_huge_page() just after we
> -	 *    entered this if-block, the thread must wait for page table lock
> -	 *    to be unlocked in __split_huge_page_splitting(), where the main
> -	 *    part of thp split is not executed yet.
> -	 */
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		if (mc.precharge < HPAGE_PMD_NR) {
>   			spin_unlock(ptl);
>   			return 0;
> diff --git a/mm/memory.c b/mm/memory.c
> index 61e7ed722760..1bad3766b00c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -565,7 +565,6 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>   {
>   	spinlock_t *ptl;
>   	pgtable_t new = pte_alloc_one(mm, address);
> -	int wait_split_huge_page;
>   	if (!new)
>   		return -ENOMEM;
>
> @@ -585,18 +584,14 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>   	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
>
>   	ptl = pmd_lock(mm, pmd);
> -	wait_split_huge_page = 0;
>   	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
>   		atomic_long_inc(&mm->nr_ptes);
>   		pmd_populate(mm, pmd, new);
>   		new = NULL;
> -	} else if (unlikely(pmd_trans_splitting(*pmd)))
> -		wait_split_huge_page = 1;
> +	}
>   	spin_unlock(ptl);
>   	if (new)
>   		pte_free(mm, new);
> -	if (wait_split_huge_page)
> -		wait_split_huge_page(vma->anon_vma, pmd);
>   	return 0;
>   }
>
> @@ -612,8 +607,7 @@ int __pte_alloc_kernel(pmd_t *pmd, unsigned long address)
>   	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
>   		pmd_populate_kernel(&init_mm, pmd, new);
>   		new = NULL;
> -	} else
> -		VM_BUG_ON(pmd_trans_splitting(*pmd));
> +	}
>   	spin_unlock(&init_mm.page_table_lock);
>   	if (new)
>   		pte_free_kernel(&init_mm, new);
> @@ -3299,14 +3293,6 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   		if (pmd_trans_huge(orig_pmd)) {
>   			unsigned int dirty = flags & FAULT_FLAG_WRITE;
>
> -			/*
> -			 * If the pmd is splitting, return and retry the
> -			 * the fault.  Alternative: wait until the split
> -			 * is done, and goto retry.
> -			 */
> -			if (pmd_trans_splitting(orig_pmd))
> -				return 0;
> -
>   			if (pmd_protnone(orig_pmd))
>   				return do_huge_pmd_numa_page(mm, vma, address,
>   							     orig_pmd, pmd);
> diff --git a/mm/mincore.c b/mm/mincore.c
> index be25efde64a4..feb867f5fdf4 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -117,7 +117,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   	unsigned char *vec = walk->private;
>   	int nr = (end - addr) >> PAGE_SHIFT;
>
> -	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> +	if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
>   		memset(vec, 1, nr);
>   		spin_unlock(ptl);
>   		goto out;
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index c25f94b33811..2fe699cedd4d 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -133,20 +133,6 @@ pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>   #endif
>
> -#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
> -			  pmd_t *pmdp)
> -{
> -	pmd_t pmd = pmd_mksplitting(*pmdp);
> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> -	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> -	/* tlb flush only to serialize against gup-fast */
> -	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> -}
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> -#endif
> -
>   #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 4ca4b5cffd95..1636a96e5f71 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -737,8 +737,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>   		 * rmap might return false positives; we must filter
>   		 * these out using page_check_address_pmd().
>   		 */
> -		pmd = page_check_address_pmd(page, mm, address,
> -					     PAGE_CHECK_ADDRESS_PMD_FLAG, &ptl);
> +		pmd = page_check_address_pmd(page, mm, address, &ptl);
>   		if (!pmd)
>   			return SWAP_AGAIN;
>
> @@ -748,7 +747,6 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>   			return SWAP_FAIL; /* To break the loop */
>   		}
>
> -		/* go ahead even if the pmd is pmd_trans_splitting() */
>   		if (pmdp_clear_flush_young_notify(vma, address, pmd))
>   			referenced++;
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 19/28] mm: store mapcount for compound page separately
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-18 14:32     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-18 14:32 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We're going to allow mapping of individual 4k pages of THP compound and
> we need a cheap way to find out how many time the compound page is
> mapped with PMD -- compound_mapcount() does this.
>
> We use the same approach as with compound page destructor and compound
> order: use space in first tail page, ->mapping this time.
>
> page_mapcount() counts both: PTE and PMD mappings of the page.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>   include/linux/mm.h       | 25 ++++++++++++--
>   include/linux/mm_types.h |  1 +
>   include/linux/rmap.h     |  4 +--
>   mm/debug.c               |  5 ++-
>   mm/huge_memory.c         |  2 +-
>   mm/hugetlb.c             |  4 +--
>   mm/memory.c              |  2 +-
>   mm/migrate.c             |  2 +-
>   mm/page_alloc.c          | 14 ++++++--
>   mm/rmap.c                | 87 +++++++++++++++++++++++++++++++++++++-----------
>   10 files changed, 114 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index dad667d99304..33cb3aa647a6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -393,6 +393,19 @@ static inline int is_vmalloc_or_module_addr(const void *x)
>
>   extern void kvfree(const void *addr);
>
> +static inline atomic_t *compound_mapcount_ptr(struct page *page)
> +{
> +	return &page[1].compound_mapcount;
> +}
> +
> +static inline int compound_mapcount(struct page *page)
> +{
> +	if (!PageCompound(page))
> +		return 0;
> +	page = compound_head(page);
> +	return atomic_read(compound_mapcount_ptr(page)) + 1;
> +}
> +
>   /*
>    * The atomic page->_mapcount, starts from -1: so that transitions
>    * both from it and to it can be tracked, using atomic_inc_and_test

What's not shown here is the implementation of page_mapcount_reset() 
that's unchanged... is that correct from all callers?

> @@ -405,8 +418,16 @@ static inline void page_mapcount_reset(struct page *page)
>
>   static inline int page_mapcount(struct page *page)
>   {
> +	int ret;
>   	VM_BUG_ON_PAGE(PageSlab(page), page);
> -	return atomic_read(&page->_mapcount) + 1;
> +	ret = atomic_read(&page->_mapcount) + 1;
> +	/*
> +	 * Positive compound_mapcount() offsets ->_mapcount in every page by
> +	 * one. Let's substract it here.
> +	 */

This could use some more detailed explanation, or at least pointers to 
the relevant rmap functions. Also in commit message.

> +	if (compound_mapcount(page))
> +	       ret += compound_mapcount(page) - 1;

This looks like it could uselessly duplicate-inline the code for 
compound_mapcount(). It has atomics and smp_rmb() so I'm not sure if the 
compiler can just "squash it".

On the other hand, a simple atomic read that was page_mapcount() has 
turned into multiple atomic reads and flag checks. What about the 
stability of the whole result? Are all callers ok? (maybe a later page 
deals with it).


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 19/28] mm: store mapcount for compound page separately
@ 2015-05-18 14:32     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-18 14:32 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We're going to allow mapping of individual 4k pages of THP compound and
> we need a cheap way to find out how many time the compound page is
> mapped with PMD -- compound_mapcount() does this.
>
> We use the same approach as with compound page destructor and compound
> order: use space in first tail page, ->mapping this time.
>
> page_mapcount() counts both: PTE and PMD mappings of the page.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>   include/linux/mm.h       | 25 ++++++++++++--
>   include/linux/mm_types.h |  1 +
>   include/linux/rmap.h     |  4 +--
>   mm/debug.c               |  5 ++-
>   mm/huge_memory.c         |  2 +-
>   mm/hugetlb.c             |  4 +--
>   mm/memory.c              |  2 +-
>   mm/migrate.c             |  2 +-
>   mm/page_alloc.c          | 14 ++++++--
>   mm/rmap.c                | 87 +++++++++++++++++++++++++++++++++++++-----------
>   10 files changed, 114 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index dad667d99304..33cb3aa647a6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -393,6 +393,19 @@ static inline int is_vmalloc_or_module_addr(const void *x)
>
>   extern void kvfree(const void *addr);
>
> +static inline atomic_t *compound_mapcount_ptr(struct page *page)
> +{
> +	return &page[1].compound_mapcount;
> +}
> +
> +static inline int compound_mapcount(struct page *page)
> +{
> +	if (!PageCompound(page))
> +		return 0;
> +	page = compound_head(page);
> +	return atomic_read(compound_mapcount_ptr(page)) + 1;
> +}
> +
>   /*
>    * The atomic page->_mapcount, starts from -1: so that transitions
>    * both from it and to it can be tracked, using atomic_inc_and_test

What's not shown here is the implementation of page_mapcount_reset() 
that's unchanged... is that correct from all callers?

> @@ -405,8 +418,16 @@ static inline void page_mapcount_reset(struct page *page)
>
>   static inline int page_mapcount(struct page *page)
>   {
> +	int ret;
>   	VM_BUG_ON_PAGE(PageSlab(page), page);
> -	return atomic_read(&page->_mapcount) + 1;
> +	ret = atomic_read(&page->_mapcount) + 1;
> +	/*
> +	 * Positive compound_mapcount() offsets ->_mapcount in every page by
> +	 * one. Let's substract it here.
> +	 */

This could use some more detailed explanation, or at least pointers to 
the relevant rmap functions. Also in commit message.

> +	if (compound_mapcount(page))
> +	       ret += compound_mapcount(page) - 1;

This looks like it could uselessly duplicate-inline the code for 
compound_mapcount(). It has atomics and smp_rmb() so I'm not sure if the 
compiler can just "squash it".

On the other hand, a simple atomic read that was page_mapcount() has 
turned into multiple atomic reads and flag checks. What about the 
stability of the whole result? Are all callers ok? (maybe a later page 
deals with it).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 20/28] mm: differentiate page_mapped() from page_mapcount() for compound pages
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-18 15:35     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-18 15:35 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> Let's define page_mapped() to be true for compound pages if any
> sub-pages of the compound page is mapped (with PMD or PTE).
>
> On other hand page_mapcount() return mapcount for this particular small
> page.
>
> This will make cases like page_get_anon_vma() behave correctly once we
> allow huge pages to be mapped with PTE.
>
> Most users outside core-mm should use page_mapcount() instead of
> page_mapped().

Does "should" mean that they do that now, or just that you would like 
them to? Should there be a warning before the function then?

>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -909,7 +909,16 @@ static inline pgoff_t page_file_index(struct page *page)

(not shown in the diff)

  * Return true if this page is mapped into pagetables.
>    */

Expand the comment? Especially if you put compound_head() there.

>   static inline int page_mapped(struct page *page)

Convert to proper bool while at it?

>   {
> -	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
> +	int i;
> +	if (likely(!PageCompound(page)))
> +		return atomic_read(&page->_mapcount) >= 0;
> +	if (compound_mapcount(page))
> +		return 1;
> +	for (i = 0; i < hpage_nr_pages(page); i++) {
> +		if (atomic_read(&page[i]._mapcount) >= 0)
> +			return 1;
> +	}
> +	return 0;
>   }
>
>   /*


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 20/28] mm: differentiate page_mapped() from page_mapcount() for compound pages
@ 2015-05-18 15:35     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-18 15:35 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> Let's define page_mapped() to be true for compound pages if any
> sub-pages of the compound page is mapped (with PMD or PTE).
>
> On other hand page_mapcount() return mapcount for this particular small
> page.
>
> This will make cases like page_get_anon_vma() behave correctly once we
> allow huge pages to be mapped with PTE.
>
> Most users outside core-mm should use page_mapcount() instead of
> page_mapped().

Does "should" mean that they do that now, or just that you would like 
them to? Should there be a warning before the function then?

>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -909,7 +909,16 @@ static inline pgoff_t page_file_index(struct page *page)

(not shown in the diff)

  * Return true if this page is mapped into pagetables.
>    */

Expand the comment? Especially if you put compound_head() there.

>   static inline int page_mapped(struct page *page)

Convert to proper bool while at it?

>   {
> -	return atomic_read(&(page)->_mapcount) + compound_mapcount(page) >= 0;
> +	int i;
> +	if (likely(!PageCompound(page)))
> +		return atomic_read(&page->_mapcount) >= 0;
> +	if (compound_mapcount(page))
> +		return 1;
> +	for (i = 0; i < hpage_nr_pages(page); i++) {
> +		if (atomic_read(&page[i]._mapcount) >= 0)
> +			return 1;
> +	}
> +	return 0;
>   }
>
>   /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 19/28] mm: store mapcount for compound page separately
  2015-05-18 14:32     ` Vlastimil Babka
@ 2015-05-19  3:55       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-19  3:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Mon, May 18, 2015 at 04:32:22PM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >We're going to allow mapping of individual 4k pages of THP compound and
> >we need a cheap way to find out how many time the compound page is
> >mapped with PMD -- compound_mapcount() does this.
> >
> >We use the same approach as with compound page destructor and compound
> >order: use space in first tail page, ->mapping this time.
> >
> >page_mapcount() counts both: PTE and PMD mappings of the page.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> >---
> >  include/linux/mm.h       | 25 ++++++++++++--
> >  include/linux/mm_types.h |  1 +
> >  include/linux/rmap.h     |  4 +--
> >  mm/debug.c               |  5 ++-
> >  mm/huge_memory.c         |  2 +-
> >  mm/hugetlb.c             |  4 +--
> >  mm/memory.c              |  2 +-
> >  mm/migrate.c             |  2 +-
> >  mm/page_alloc.c          | 14 ++++++--
> >  mm/rmap.c                | 87 +++++++++++++++++++++++++++++++++++++-----------
> >  10 files changed, 114 insertions(+), 32 deletions(-)
> >
> >diff --git a/include/linux/mm.h b/include/linux/mm.h
> >index dad667d99304..33cb3aa647a6 100644
> >--- a/include/linux/mm.h
> >+++ b/include/linux/mm.h
> >@@ -393,6 +393,19 @@ static inline int is_vmalloc_or_module_addr(const void *x)
> >
> >  extern void kvfree(const void *addr);
> >
> >+static inline atomic_t *compound_mapcount_ptr(struct page *page)
> >+{
> >+	return &page[1].compound_mapcount;
> >+}
> >+
> >+static inline int compound_mapcount(struct page *page)
> >+{
> >+	if (!PageCompound(page))
> >+		return 0;
> >+	page = compound_head(page);
> >+	return atomic_read(compound_mapcount_ptr(page)) + 1;
> >+}
> >+
> >  /*
> >   * The atomic page->_mapcount, starts from -1: so that transitions
> >   * both from it and to it can be tracked, using atomic_inc_and_test
> 
> What's not shown here is the implementation of page_mapcount_reset() that's
> unchanged... is that correct from all callers?

Looks like page_mapcount_reset() is mostly use to deal with PageBuddy()
and such. We don't have this kind of tricks for compound_mapcount.

> >@@ -405,8 +418,16 @@ static inline void page_mapcount_reset(struct page *page)
> >
> >  static inline int page_mapcount(struct page *page)
> >  {
> >+	int ret;
> >  	VM_BUG_ON_PAGE(PageSlab(page), page);
> >-	return atomic_read(&page->_mapcount) + 1;
> >+	ret = atomic_read(&page->_mapcount) + 1;
> >+	/*
> >+	 * Positive compound_mapcount() offsets ->_mapcount in every page by
> >+	 * one. Let's substract it here.
> >+	 */
> 
> This could use some more detailed explanation, or at least pointers to the
> relevant rmap functions. Also in commit message.

Okay. Will do.

> 
> >+	if (compound_mapcount(page))
> >+	       ret += compound_mapcount(page) - 1;
> 
> This looks like it could uselessly duplicate-inline the code for
> compound_mapcount(). It has atomics and smp_rmb() so I'm not sure if the
> compiler can just "squash it".

Good point. I'll rework this.
>
> On the other hand, a simple atomic read that was page_mapcount() has turned
> into multiple atomic reads and flag checks. What about the stability of the
> whole result? Are all callers ok? (maybe a later page deals with it).

Urghh.. I'll look into this.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 19/28] mm: store mapcount for compound page separately
@ 2015-05-19  3:55       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-19  3:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Mon, May 18, 2015 at 04:32:22PM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >We're going to allow mapping of individual 4k pages of THP compound and
> >we need a cheap way to find out how many time the compound page is
> >mapped with PMD -- compound_mapcount() does this.
> >
> >We use the same approach as with compound page destructor and compound
> >order: use space in first tail page, ->mapping this time.
> >
> >page_mapcount() counts both: PTE and PMD mappings of the page.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> >---
> >  include/linux/mm.h       | 25 ++++++++++++--
> >  include/linux/mm_types.h |  1 +
> >  include/linux/rmap.h     |  4 +--
> >  mm/debug.c               |  5 ++-
> >  mm/huge_memory.c         |  2 +-
> >  mm/hugetlb.c             |  4 +--
> >  mm/memory.c              |  2 +-
> >  mm/migrate.c             |  2 +-
> >  mm/page_alloc.c          | 14 ++++++--
> >  mm/rmap.c                | 87 +++++++++++++++++++++++++++++++++++++-----------
> >  10 files changed, 114 insertions(+), 32 deletions(-)
> >
> >diff --git a/include/linux/mm.h b/include/linux/mm.h
> >index dad667d99304..33cb3aa647a6 100644
> >--- a/include/linux/mm.h
> >+++ b/include/linux/mm.h
> >@@ -393,6 +393,19 @@ static inline int is_vmalloc_or_module_addr(const void *x)
> >
> >  extern void kvfree(const void *addr);
> >
> >+static inline atomic_t *compound_mapcount_ptr(struct page *page)
> >+{
> >+	return &page[1].compound_mapcount;
> >+}
> >+
> >+static inline int compound_mapcount(struct page *page)
> >+{
> >+	if (!PageCompound(page))
> >+		return 0;
> >+	page = compound_head(page);
> >+	return atomic_read(compound_mapcount_ptr(page)) + 1;
> >+}
> >+
> >  /*
> >   * The atomic page->_mapcount, starts from -1: so that transitions
> >   * both from it and to it can be tracked, using atomic_inc_and_test
> 
> What's not shown here is the implementation of page_mapcount_reset() that's
> unchanged... is that correct from all callers?

Looks like page_mapcount_reset() is mostly use to deal with PageBuddy()
and such. We don't have this kind of tricks for compound_mapcount.

> >@@ -405,8 +418,16 @@ static inline void page_mapcount_reset(struct page *page)
> >
> >  static inline int page_mapcount(struct page *page)
> >  {
> >+	int ret;
> >  	VM_BUG_ON_PAGE(PageSlab(page), page);
> >-	return atomic_read(&page->_mapcount) + 1;
> >+	ret = atomic_read(&page->_mapcount) + 1;
> >+	/*
> >+	 * Positive compound_mapcount() offsets ->_mapcount in every page by
> >+	 * one. Let's substract it here.
> >+	 */
> 
> This could use some more detailed explanation, or at least pointers to the
> relevant rmap functions. Also in commit message.

Okay. Will do.

> 
> >+	if (compound_mapcount(page))
> >+	       ret += compound_mapcount(page) - 1;
> 
> This looks like it could uselessly duplicate-inline the code for
> compound_mapcount(). It has atomics and smp_rmb() so I'm not sure if the
> compiler can just "squash it".

Good point. I'll rework this.
>
> On the other hand, a simple atomic read that was page_mapcount() has turned
> into multiple atomic reads and flag checks. What about the stability of the
> whole result? Are all callers ok? (maybe a later page deals with it).

Urghh.. I'll look into this.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 20/28] mm: differentiate page_mapped() from page_mapcount() for compound pages
  2015-05-18 15:35     ` Vlastimil Babka
@ 2015-05-19  4:00       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-19  4:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Mon, May 18, 2015 at 05:35:16PM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >Let's define page_mapped() to be true for compound pages if any
> >sub-pages of the compound page is mapped (with PMD or PTE).
> >
> >On other hand page_mapcount() return mapcount for this particular small
> >page.
> >
> >This will make cases like page_get_anon_vma() behave correctly once we
> >allow huge pages to be mapped with PTE.
> >
> >Most users outside core-mm should use page_mapcount() instead of
> >page_mapped().
> 
> Does "should" mean that they do that now, or just that you would like them
> to?

I would like them to.

> Should there be a warning before the function then?

Ok.

> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> 
> >--- a/include/linux/mm.h
> >+++ b/include/linux/mm.h
> >@@ -909,7 +909,16 @@ static inline pgoff_t page_file_index(struct page *page)
> 
> (not shown in the diff)
> 
>  * Return true if this page is mapped into pagetables.
> >   */
> 
> Expand the comment? Especially if you put compound_head() there.

Ok.

> >  static inline int page_mapped(struct page *page)
> 
> Convert to proper bool while at it?

Ok.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 20/28] mm: differentiate page_mapped() from page_mapcount() for compound pages
@ 2015-05-19  4:00       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-19  4:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Mon, May 18, 2015 at 05:35:16PM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >Let's define page_mapped() to be true for compound pages if any
> >sub-pages of the compound page is mapped (with PMD or PTE).
> >
> >On other hand page_mapcount() return mapcount for this particular small
> >page.
> >
> >This will make cases like page_get_anon_vma() behave correctly once we
> >allow huge pages to be mapped with PTE.
> >
> >Most users outside core-mm should use page_mapcount() instead of
> >page_mapped().
> 
> Does "should" mean that they do that now, or just that you would like them
> to?

I would like them to.

> Should there be a warning before the function then?

Ok.

> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> 
> >--- a/include/linux/mm.h
> >+++ b/include/linux/mm.h
> >@@ -909,7 +909,16 @@ static inline pgoff_t page_file_index(struct page *page)
> 
> (not shown in the diff)
> 
>  * Return true if this page is mapped into pagetables.
> >   */
> 
> Expand the comment? Especially if you put compound_head() there.

Ok.

> >  static inline int page_mapped(struct page *page)
> 
> Convert to proper bool while at it?

Ok.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 22/28] thp: implement split_huge_pmd()
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-19  8:25     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19  8:25 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> Original split_huge_page() combined two operations: splitting PMDs into
> tables of PTEs and splitting underlying compound page. This patch
> implements split_huge_pmd() which split given PMD without splitting
> other PMDs this page mapped with or underlying compound page.
>
> Without tail page refcounting, implementation of split_huge_pmd() is
> pretty straight-forward.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>   include/linux/huge_mm.h |  11 ++++-
>   mm/huge_memory.c        | 108 ++++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 118 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 0382230b490f..b7844c73b7db 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -94,7 +94,16 @@ extern unsigned long transparent_hugepage_flags;
>
>   #define split_huge_page_to_list(page, list) BUILD_BUG()
>   #define split_huge_page(page) BUILD_BUG()
> -#define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
> +
> +void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> +		unsigned long address);
> +
> +#define split_huge_pmd(__vma, __pmd, __address)				\
> +	do {								\
> +		pmd_t *____pmd = (__pmd);				\
> +		if (unlikely(pmd_trans_huge(*____pmd)))			\

Given that most of calls to split_huge_pmd() appear to be in
if (pmd_trans_huge(...)) branches, this unlikely() seems counter-productive.

> +			__split_huge_pmd(__vma, __pmd, __address);	\
> +	}  while (0)
>
>   #if HPAGE_PMD_ORDER >= MAX_ORDER
>   #error "hugepages can't be allocated by the buddy allocator"
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 06adbe3f2100..5885ef8f0fad 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2522,6 +2522,114 @@ static int khugepaged(void *none)
>   	return 0;
>   }
>
> +static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> +		unsigned long haddr, pmd_t *pmd)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pgtable_t pgtable;
> +	pmd_t _pmd;
> +	int i;
> +
> +	/* leave pmd empty until pte is filled */
> +	pmdp_clear_flush_notify(vma, haddr, pmd);
> +
> +	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> +	pmd_populate(mm, &_pmd, pgtable);
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		pte_t *pte, entry;
> +		entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
> +		entry = pte_mkspecial(entry);
> +		pte = pte_offset_map(&_pmd, haddr);
> +		VM_BUG_ON(!pte_none(*pte));
> +		set_pte_at(mm, haddr, pte, entry);
> +		pte_unmap(pte);
> +	}
> +	smp_wmb(); /* make pte visible before pmd */
> +	pmd_populate(mm, pmd, pgtable);
> +	put_huge_zero_page();
> +}
> +
> +static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> +		unsigned long haddr)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct page *page;
> +	pgtable_t pgtable;
> +	pmd_t _pmd;
> +	bool young, write, last;
> +	int i;
> +
> +	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
> +	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
> +	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> +	VM_BUG_ON(!pmd_trans_huge(*pmd));
> +
> +	count_vm_event(THP_SPLIT_PMD);
> +
> +	if (is_huge_zero_pmd(*pmd))
> +		return __split_huge_zero_page_pmd(vma, haddr, pmd);
> +
> +	page = pmd_page(*pmd);
> +	VM_BUG_ON_PAGE(!page_count(page), page);
> +	atomic_add(HPAGE_PMD_NR - 1, &page->_count);
> +	last = atomic_add_negative(-1, compound_mapcount_ptr(page));
> +	if (last)
> +		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> +
> +	write = pmd_write(*pmd);
> +	young = pmd_young(*pmd);
> +
> +	/* leave pmd empty until pte is filled */
> +	pmdp_clear_flush_notify(vma, haddr, pmd);
> +
> +	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> +	pmd_populate(mm, &_pmd, pgtable);
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		pte_t entry, *pte;
> +		/*
> +		 * Note that NUMA hinting access restrictions are not
> +		 * transferred to avoid any possibility of altering
> +		 * permissions across VMAs.
> +		 */
> +		entry = mk_pte(page + i, vma->vm_page_prot);
> +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		if (!write)
> +			entry = pte_wrprotect(entry);
> +		if (!young)
> +			entry = pte_mkold(entry);
> +		pte = pte_offset_map(&_pmd, haddr);
> +		BUG_ON(!pte_none(*pte));
> +		set_pte_at(mm, haddr, pte, entry);
> +		/*
> +		 * Positive compound_mapcount also offsets ->_mapcount of
> +		 * every subpage by one -- no need to increase mapcount when
> +		 * splitting last PMD.
> +		 */
> +		if (!last)
> +			atomic_inc(&page[i]._mapcount);
> +		pte_unmap(pte);
> +	}
> +	smp_wmb(); /* make pte visible before pmd */
> +	pmd_populate(mm, pmd, pgtable);
> +}
> +
> +void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> +		unsigned long address)
> +{
> +	spinlock_t *ptl;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long haddr = address & HPAGE_PMD_MASK;
> +
> +	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
> +	ptl = pmd_lock(mm, pmd);
> +	if (likely(pmd_trans_huge(*pmd)))

This likely is likely useless :)

> +		__split_huge_pmd_locked(vma, pmd, haddr);
> +	spin_unlock(ptl);
> +	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
> +}
> +
>   static void split_huge_pmd_address(struct vm_area_struct *vma,
>   				    unsigned long address)
>   {
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 22/28] thp: implement split_huge_pmd()
@ 2015-05-19  8:25     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19  8:25 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> Original split_huge_page() combined two operations: splitting PMDs into
> tables of PTEs and splitting underlying compound page. This patch
> implements split_huge_pmd() which split given PMD without splitting
> other PMDs this page mapped with or underlying compound page.
>
> Without tail page refcounting, implementation of split_huge_pmd() is
> pretty straight-forward.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>   include/linux/huge_mm.h |  11 ++++-
>   mm/huge_memory.c        | 108 ++++++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 118 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 0382230b490f..b7844c73b7db 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -94,7 +94,16 @@ extern unsigned long transparent_hugepage_flags;
>
>   #define split_huge_page_to_list(page, list) BUILD_BUG()
>   #define split_huge_page(page) BUILD_BUG()
> -#define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
> +
> +void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> +		unsigned long address);
> +
> +#define split_huge_pmd(__vma, __pmd, __address)				\
> +	do {								\
> +		pmd_t *____pmd = (__pmd);				\
> +		if (unlikely(pmd_trans_huge(*____pmd)))			\

Given that most of calls to split_huge_pmd() appear to be in
if (pmd_trans_huge(...)) branches, this unlikely() seems counter-productive.

> +			__split_huge_pmd(__vma, __pmd, __address);	\
> +	}  while (0)
>
>   #if HPAGE_PMD_ORDER >= MAX_ORDER
>   #error "hugepages can't be allocated by the buddy allocator"
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 06adbe3f2100..5885ef8f0fad 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2522,6 +2522,114 @@ static int khugepaged(void *none)
>   	return 0;
>   }
>
> +static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> +		unsigned long haddr, pmd_t *pmd)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pgtable_t pgtable;
> +	pmd_t _pmd;
> +	int i;
> +
> +	/* leave pmd empty until pte is filled */
> +	pmdp_clear_flush_notify(vma, haddr, pmd);
> +
> +	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> +	pmd_populate(mm, &_pmd, pgtable);
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		pte_t *pte, entry;
> +		entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
> +		entry = pte_mkspecial(entry);
> +		pte = pte_offset_map(&_pmd, haddr);
> +		VM_BUG_ON(!pte_none(*pte));
> +		set_pte_at(mm, haddr, pte, entry);
> +		pte_unmap(pte);
> +	}
> +	smp_wmb(); /* make pte visible before pmd */
> +	pmd_populate(mm, pmd, pgtable);
> +	put_huge_zero_page();
> +}
> +
> +static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> +		unsigned long haddr)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct page *page;
> +	pgtable_t pgtable;
> +	pmd_t _pmd;
> +	bool young, write, last;
> +	int i;
> +
> +	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
> +	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
> +	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
> +	VM_BUG_ON(!pmd_trans_huge(*pmd));
> +
> +	count_vm_event(THP_SPLIT_PMD);
> +
> +	if (is_huge_zero_pmd(*pmd))
> +		return __split_huge_zero_page_pmd(vma, haddr, pmd);
> +
> +	page = pmd_page(*pmd);
> +	VM_BUG_ON_PAGE(!page_count(page), page);
> +	atomic_add(HPAGE_PMD_NR - 1, &page->_count);
> +	last = atomic_add_negative(-1, compound_mapcount_ptr(page));
> +	if (last)
> +		__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
> +
> +	write = pmd_write(*pmd);
> +	young = pmd_young(*pmd);
> +
> +	/* leave pmd empty until pte is filled */
> +	pmdp_clear_flush_notify(vma, haddr, pmd);
> +
> +	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> +	pmd_populate(mm, &_pmd, pgtable);
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		pte_t entry, *pte;
> +		/*
> +		 * Note that NUMA hinting access restrictions are not
> +		 * transferred to avoid any possibility of altering
> +		 * permissions across VMAs.
> +		 */
> +		entry = mk_pte(page + i, vma->vm_page_prot);
> +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		if (!write)
> +			entry = pte_wrprotect(entry);
> +		if (!young)
> +			entry = pte_mkold(entry);
> +		pte = pte_offset_map(&_pmd, haddr);
> +		BUG_ON(!pte_none(*pte));
> +		set_pte_at(mm, haddr, pte, entry);
> +		/*
> +		 * Positive compound_mapcount also offsets ->_mapcount of
> +		 * every subpage by one -- no need to increase mapcount when
> +		 * splitting last PMD.
> +		 */
> +		if (!last)
> +			atomic_inc(&page[i]._mapcount);
> +		pte_unmap(pte);
> +	}
> +	smp_wmb(); /* make pte visible before pmd */
> +	pmd_populate(mm, pmd, pgtable);
> +}
> +
> +void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> +		unsigned long address)
> +{
> +	spinlock_t *ptl;
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long haddr = address & HPAGE_PMD_MASK;
> +
> +	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
> +	ptl = pmd_lock(mm, pmd);
> +	if (likely(pmd_trans_huge(*pmd)))

This likely is likely useless :)

> +		__split_huge_pmd_locked(vma, pmd, haddr);
> +	spin_unlock(ptl);
> +	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
> +}
> +
>   static void split_huge_pmd_address(struct vm_area_struct *vma,
>   				    unsigned long address)
>   {
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 19/28] mm: store mapcount for compound page separately
  2015-05-19  3:55       ` Kirill A. Shutemov
@ 2015-05-19  9:01         ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19  9:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 05/19/2015 05:55 AM, Kirill A. Shutemov wrote:
>>
>>> +	if (compound_mapcount(page))
>>> +	       ret += compound_mapcount(page) - 1;
>>
>> This looks like it could uselessly duplicate-inline the code for
>> compound_mapcount(). It has atomics and smp_rmb() so I'm not sure if the
>> compiler can just "squash it".
>
> Good point. I'll rework this.

Hm BTW I think same duplication of compound_head() happens in 
lock_page(), where it's done by trylock_page() and then __lock_page(), 
which is also in different compilation unit to make things worse.

I can imagine it's solvable by introducing variants of __lock_page* that 
expect to be already given a head page... if it's worth the trouble.

>>
>> On the other hand, a simple atomic read that was page_mapcount() has turned
>> into multiple atomic reads and flag checks. What about the stability of the
>> whole result? Are all callers ok? (maybe a later page deals with it).
>
> Urghh.. I'll look into this.
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 19/28] mm: store mapcount for compound page separately
@ 2015-05-19  9:01         ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19  9:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 05/19/2015 05:55 AM, Kirill A. Shutemov wrote:
>>
>>> +	if (compound_mapcount(page))
>>> +	       ret += compound_mapcount(page) - 1;
>>
>> This looks like it could uselessly duplicate-inline the code for
>> compound_mapcount(). It has atomics and smp_rmb() so I'm not sure if the
>> compiler can just "squash it".
>
> Good point. I'll rework this.

Hm BTW I think same duplication of compound_head() happens in 
lock_page(), where it's done by trylock_page() and then __lock_page(), 
which is also in different compilation unit to make things worse.

I can imagine it's solvable by introducing variants of __lock_page* that 
expect to be already given a head page... if it's worth the trouble.

>>
>> On the other hand, a simple atomic read that was page_mapcount() has turned
>> into multiple atomic reads and flag checks. What about the stability of the
>> whole result? Are all callers ok? (maybe a later page deals with it).
>
> Urghh.. I'll look into this.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 25/28] thp: reintroduce split_huge_page()
  2015-04-23 21:04   ` Kirill A. Shutemov
@ 2015-05-19 12:43     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19 12:43 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:04 PM, Kirill A. Shutemov wrote:
> This patch adds implementation of split_huge_page() for new
> refcountings.
>
> Unlike previous implementation, new split_huge_page() can fail if
> somebody holds GUP pin on the page. It also means that pin on page
> would prevent it from bening split under you. It makes situation in
> many places much cleaner.
>
> The basic scheme of split_huge_page():
>
>    - Check that sum of mapcounts of all subpage is equal to page_count()
>      plus one (caller pin). Foll off with -EBUSY. This way we can avoid
>      useless PMD-splits.
>
>    - Freeze the page counters by splitting all PMD and setup migration
>      PTEs.
>
>    - Re-check sum of mapcounts against page_count(). Page's counts are
>      stable now. -EBUSY if page is pinned.
>
>    - Split compound page.
>
>    - Unfreeze the page by removing migration entries.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>   include/linux/huge_mm.h |   7 +-
>   include/linux/pagemap.h |   9 +-
>   mm/huge_memory.c        | 322 ++++++++++++++++++++++++++++++++++++++++++++++++
>   mm/internal.h           |  26 +++-
>   mm/rmap.c               |  21 ----
>   5 files changed, 357 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index b7844c73b7db..3c0a50ed3eb8 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -92,8 +92,11 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
>
>   extern unsigned long transparent_hugepage_flags;
>
> -#define split_huge_page_to_list(page, list) BUILD_BUG()
> -#define split_huge_page(page) BUILD_BUG()
> +int split_huge_page_to_list(struct page *page, struct list_head *list);
> +static inline int split_huge_page(struct page *page)
> +{
> +	return split_huge_page_to_list(page, NULL);
> +}
>
>   void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   		unsigned long address);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 7c3790764795..ffbb23dbebba 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -387,10 +387,17 @@ static inline struct page *read_mapping_page(struct address_space *mapping,
>    */
>   static inline pgoff_t page_to_pgoff(struct page *page)
>   {
> +	pgoff_t pgoff;
> +
>   	if (unlikely(PageHeadHuge(page)))
>   		return page->index << compound_order(page);
> -	else
> +
> +	if (likely(!PageTransTail(page)))
>   		return page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +
> +	pgoff = page->first_page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +	pgoff += page - page->first_page;
> +	return pgoff;

This could use some comment or maybe separate preparatory patch?

>   }
>
>   /*
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2f9e2e882bab..7ad338ab2ac8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2704,3 +2704,325 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
>   			split_huge_pmd_address(next, nstart);
>   	}
>   }
> +
> +static void freeze_page_vma(struct vm_area_struct *vma, struct page *page,
> +		unsigned long address)
> +{
> +	spinlock_t *ptl;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +	int i;
> +
> +	pgd = pgd_offset(vma->vm_mm, address);
> +	if (!pgd_present(*pgd))
> +		return;
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		return;
> +	pmd = pmd_offset(pud, address);
> +	ptl = pmd_lock(vma->vm_mm, pmd);
> +	if (!pmd_present(*pmd)) {
> +		spin_unlock(ptl);
> +		return;
> +	}
> +	if (pmd_trans_huge(*pmd)) {
> +		if (page == pmd_page(*pmd))
> +			__split_huge_pmd_locked(vma, pmd, address, true);
> +		spin_unlock(ptl);
> +		return;
> +	}
> +	spin_unlock(ptl);
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
> +	for (i = 0; i < HPAGE_PMD_NR; i++, address += PAGE_SIZE, page++) {
> +		pte_t entry, swp_pte;
> +		swp_entry_t swp_entry;
> +
> +		if (!pte_present(pte[i]))
> +			continue;
> +		if (page_to_pfn(page) != pte_pfn(pte[i]))
> +			continue;
> +		flush_cache_page(vma, address, page_to_pfn(page));
> +		entry = ptep_clear_flush(vma, address, pte + i);
> +		swp_entry = make_migration_entry(page, pte_write(entry));
> +		swp_pte = swp_entry_to_pte(swp_entry);
> +		if (pte_soft_dirty(entry))
> +			swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +		set_pte_at(vma->vm_mm, address, pte + i, swp_pte);
> +	}
> +	pte_unmap_unlock(pte, ptl);
> +}
> +
> +static void freeze_page(struct anon_vma *anon_vma, struct page *page)
> +{
> +	struct anon_vma_chain *avc;
> +	pgoff_t pgoff = page_to_pgoff(page);
> +
> +	VM_BUG_ON_PAGE(!PageHead(page), page);
> +
> +	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff,
> +			pgoff + HPAGE_PMD_NR - 1) {
> +		unsigned long haddr;
> +
> +		haddr = __vma_address(page, avc->vma) & HPAGE_PMD_MASK;
> +		mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
> +				haddr, haddr + HPAGE_PMD_SIZE);
> +		freeze_page_vma(avc->vma, page, haddr);
> +		mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
> +				haddr, haddr + HPAGE_PMD_SIZE);
> +	}
> +}
> +
> +static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
> +		unsigned long address)
> +{
> +	spinlock_t *ptl;
> +	pmd_t *pmd;
> +	pte_t *pte, entry;
> +	swp_entry_t swp_entry;
> +
> +	pmd = mm_find_pmd(vma->vm_mm, address);
> +	if (!pmd)
> +		return;
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
> +
> +	if (!is_swap_pte(*pte))
> +		goto unlock;
> +
> +	swp_entry = pte_to_swp_entry(*pte);
> +	if (!is_migration_entry(swp_entry) ||
> +			migration_entry_to_page(swp_entry) != page)
> +		goto unlock;
> +
> +	entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
> +	if (is_write_migration_entry(swp_entry))
> +		entry = maybe_mkwrite(entry, vma);
> +
> +	flush_dcache_page(page);
> +	set_pte_at(vma->vm_mm, address, pte, entry);
> +
> +	/* No need to invalidate - it was non-present before */
> +	update_mmu_cache(vma, address, pte);
> +unlock:
> +	pte_unmap_unlock(pte, ptl);
> +}
> +
> +static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
> +{
> +	struct anon_vma_chain *avc;
> +	pgoff_t pgoff = page_to_pgoff(page);
> +	int i;
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++, pgoff++, page++) {

In case of freeze_page() this cycle is the inner one and it can batch 
ptl lock. Why not here?

> +		if (!page_mapcount(page))
> +			continue;
> +
> +		anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
> +				pgoff, pgoff) {
> +			unsigned long address = vma_address(page, avc->vma);
> +
> +			mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
> +					address, address + PAGE_SIZE);
> +			unfreeze_page_vma(avc->vma, page, address);
> +			mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
> +					address, address + PAGE_SIZE);
> +		}
> +	}
> +}
> +
> +static int total_mapcount(struct page *page)
> +{
> +	int i, ret;
> +
> +	ret = compound_mapcount(page);
> +	for (i = 0; i < HPAGE_PMD_NR; i++)
> +		ret += atomic_read(&page[i]._mapcount) + 1;
> +
> +	/*
> +	 * Positive compound_mapcount() offsets ->_mapcount in every subpage by
> +	 * one. Let's substract it here.
> +	 */
> +	if (compound_mapcount(page))
> +		ret -= HPAGE_PMD_NR;
> +
> +	return ret;
> +}
> +
> +static int __split_huge_page_tail(struct page *head, int tail,
> +		struct lruvec *lruvec, struct list_head *list)
> +{
> +	int mapcount;
> +	struct page *page_tail = head + tail;
> +
> +	mapcount = page_mapcount(page_tail);
> +	BUG_ON(atomic_read(&page_tail->_count) != 0);

VM_BUG_ON?

> +
> +	/*
> +	 * tail_page->_count is zero and not changing from under us. But
> +	 * get_page_unless_zero() may be running from under us on the
> +	 * tail_page. If we used atomic_set() below instead of atomic_add(), we
> +	 * would then run atomic_set() concurrently with
> +	 * get_page_unless_zero(), and atomic_set() is implemented in C not
> +	 * using locked ops. spin_unlock on x86 sometime uses locked ops
> +	 * because of PPro errata 66, 92, so unless somebody can guarantee
> +	 * atomic_set() here would be safe on all archs (and not only on x86),
> +	 * it's safer to use atomic_add().
> +	 */
> +	atomic_add(page_mapcount(page_tail) + 1, &page_tail->_count);
> +
> +	/* after clearing PageTail the gup refcount can be released */
> +	smp_mb__after_atomic();
> +
> +	/*
> +	 * retain hwpoison flag of the poisoned tail page:
> +	 *   fix for the unsuitable process killed on Guest Machine(KVM)
> +	 *   by the memory-failure.
> +	 */
> +	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
> +	page_tail->flags |= (head->flags &
> +			((1L << PG_referenced) |
> +			 (1L << PG_swapbacked) |
> +			 (1L << PG_mlocked) |
> +			 (1L << PG_uptodate) |
> +			 (1L << PG_active) |
> +			 (1L << PG_locked) |
> +			 (1L << PG_unevictable)));
> +	page_tail->flags |= (1L << PG_dirty);
> +
> +	/* clear PageTail before overwriting first_page */
> +	smp_wmb();
> +
> +	/* ->mapping in first tail page is compound_mapcount */
> +	BUG_ON(tail != 1 && page_tail->mapping != TAIL_MAPPING);

VM_BUG_ON?

> +	page_tail->mapping = head->mapping;
> +
> +	page_tail->index = head->index + tail;
> +	page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
> +	lru_add_page_tail(head, page_tail, lruvec, list);
> +
> +	return mapcount;
> +}
> +
> +static void __split_huge_page(struct page *page, struct list_head *list)
> +{
> +	struct page *head = compound_head(page);
> +	struct zone *zone = page_zone(head);
> +	struct lruvec *lruvec;
> +	int i, tail_mapcount;
> +
> +	/* prevent PageLRU to go away from under us, and freeze lru stats */
> +	spin_lock_irq(&zone->lru_lock);
> +	lruvec = mem_cgroup_page_lruvec(head, zone);
> +
> +	/* complete memcg works before add pages to LRU */
> +	mem_cgroup_split_huge_fixup(head);
> +
> +	tail_mapcount = 0;
> +	for (i = HPAGE_PMD_NR - 1; i >= 1; i--)
> +		tail_mapcount += __split_huge_page_tail(head, i, lruvec, list);
> +	atomic_sub(tail_mapcount, &head->_count);
> +
> +	ClearPageCompound(head);
> +	spin_unlock_irq(&zone->lru_lock);
> +
> +	unfreeze_page(page_anon_vma(head), head);
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		struct page *subpage = head + i;
> +		if (subpage == page)
> +			continue;
> +		unlock_page(subpage);
> +
> +		/*
> +		 * Subpages may be freed if there wasn't any mapping
> +		 * like if add_to_swap() is running on a lru page that
> +		 * had its mapping zapped. And freeing these pages
> +		 * requires taking the lru_lock so we do the put_page
> +		 * of the tail pages after the split is complete.
> +		 */
> +		put_page(subpage);
> +	}
> +}
> +
> +/*
> + * This function splits huge page into normal pages. @page can point to any
> + * subpage of huge page to split. Split doesn't change the position of @page.
> + *
> + * Only caller must hold pin on the @page, otherwise split fails with -EBUSY.
> + * The huge page must be locked.
> + *
> + * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
> + *
> + * Both head page and tail pages will inherit mapping, flags, and so on from
> + * the hugepage.
> + *
> + * GUP pin and PG_locked transfered to @page. Rest subpages can be freed if
> + * they are not mapped.
> + *
> + * Returns 0 if the hugepage is split successfully.
> + * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under
> + * us.
> + */
> +int split_huge_page_to_list(struct page *page, struct list_head *list)
> +{
> +	struct page *head = compound_head(page);
> +	struct anon_vma *anon_vma;
> +	int mapcount, ret;
> +
> +	BUG_ON(is_huge_zero_page(page));
> +	BUG_ON(!PageAnon(page));
> +	BUG_ON(!PageLocked(page));
> +	BUG_ON(!PageSwapBacked(page));
> +	BUG_ON(!PageCompound(page));

VM_BUG_ONs?

> +
> +	/*
> +	 * The caller does not necessarily hold an mmap_sem that would prevent
> +	 * the anon_vma disappearing so we first we take a reference to it
> +	 * and then lock the anon_vma for write. This is similar to
> +	 * page_lock_anon_vma_read except the write lock is taken to serialise
> +	 * against parallel split or collapse operations.
> +	 */
> +	anon_vma = page_get_anon_vma(head);
> +	if (!anon_vma) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +	anon_vma_lock_write(anon_vma);
> +
> +	/*
> +	 * Racy check if we can split the page, before freeze_page() will
> +	 * split PMDs
> +	 */
> +	if (total_mapcount(head) != page_count(head) - 1) {
> +		ret = -EBUSY;
> +		goto out_unlock;
> +	}
> +
> +	freeze_page(anon_vma, head);
> +	VM_BUG_ON_PAGE(compound_mapcount(head), head);
> +
> +	mapcount = total_mapcount(head);
> +	if (mapcount == page_count(head) - 1) {
> +		__split_huge_page(page, list);
> +		ret = 0;
> +	} else if (mapcount > page_count(page) - 1) {

It's confusing to use page_count(head) in one test and page_count(page) 
in other, although I know it should be same. Also what if you read a 
different value because something broke?

> +		pr_alert("total_mapcount: %u, page_count(): %u\n",
> +				mapcount, page_count(page));

Here you determine page_count(page) again although it could have 
possibly changed (we are in path where something went wrong already) so 
you potentially print different value than the one that was tested.


> +		if (PageTail(page))
> +			dump_page(head, NULL);
> +		dump_page(page, "tail_mapcount > page_count(page) - 1");

Here you say "tail_mapcount" which means something else in different places.
Also isn't the whole "else if" test a DEBUG_VM material as well?

> +		BUG();
> +	} else {
> +		unfreeze_page(anon_vma, head);
> +		ret = -EBUSY;
> +	}
> +
> +out_unlock:
> +	anon_vma_unlock_write(anon_vma);
> +	put_anon_vma(anon_vma);
> +out:
> +	count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
> +	return ret;
> +}
> diff --git a/mm/internal.h b/mm/internal.h
> index 98bce4d12a16..aee0f2566fdd 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -13,6 +13,7 @@
>
>   #include <linux/fs.h>
>   #include <linux/mm.h>
> +#include <linux/pagemap.h>
>
>   void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>   		unsigned long floor, unsigned long ceiling);
> @@ -244,10 +245,27 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>
>   extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -extern unsigned long vma_address(struct page *page,
> -				 struct vm_area_struct *vma);
> -#endif
> +/*
> + * At what user virtual address is page expected in @vma?
> + */
> +static inline unsigned long
> +__vma_address(struct page *page, struct vm_area_struct *vma)
> +{
> +	pgoff_t pgoff = page_to_pgoff(page);
> +	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +}
> +
> +static inline unsigned long
> +vma_address(struct page *page, struct vm_area_struct *vma)
> +{
> +	unsigned long address = __vma_address(page, vma);
> +
> +	/* page should be within @vma mapping range */
> +	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> +
> +	return address;
> +}
> +
>   #else /* !CONFIG_MMU */
>   static inline void clear_page_mlock(struct page *page) { }
>   static inline void mlock_vma_page(struct page *page) { }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 047953145710..723af5bbeb02 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -561,27 +561,6 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
>   }
>
>   /*
> - * At what user virtual address is page expected in @vma?
> - */
> -static inline unsigned long
> -__vma_address(struct page *page, struct vm_area_struct *vma)
> -{
> -	pgoff_t pgoff = page_to_pgoff(page);
> -	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> -}
> -
> -inline unsigned long
> -vma_address(struct page *page, struct vm_area_struct *vma)
> -{
> -	unsigned long address = __vma_address(page, vma);
> -
> -	/* page should be within @vma mapping range */
> -	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> -
> -	return address;
> -}
> -
> -/*
>    * At what user virtual address is page expected in vma?
>    * Caller should check the page is actually part of the vma.
>    */
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 25/28] thp: reintroduce split_huge_page()
@ 2015-05-19 12:43     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19 12:43 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:04 PM, Kirill A. Shutemov wrote:
> This patch adds implementation of split_huge_page() for new
> refcountings.
>
> Unlike previous implementation, new split_huge_page() can fail if
> somebody holds GUP pin on the page. It also means that pin on page
> would prevent it from bening split under you. It makes situation in
> many places much cleaner.
>
> The basic scheme of split_huge_page():
>
>    - Check that sum of mapcounts of all subpage is equal to page_count()
>      plus one (caller pin). Foll off with -EBUSY. This way we can avoid
>      useless PMD-splits.
>
>    - Freeze the page counters by splitting all PMD and setup migration
>      PTEs.
>
>    - Re-check sum of mapcounts against page_count(). Page's counts are
>      stable now. -EBUSY if page is pinned.
>
>    - Split compound page.
>
>    - Unfreeze the page by removing migration entries.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>
> ---
>   include/linux/huge_mm.h |   7 +-
>   include/linux/pagemap.h |   9 +-
>   mm/huge_memory.c        | 322 ++++++++++++++++++++++++++++++++++++++++++++++++
>   mm/internal.h           |  26 +++-
>   mm/rmap.c               |  21 ----
>   5 files changed, 357 insertions(+), 28 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index b7844c73b7db..3c0a50ed3eb8 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -92,8 +92,11 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
>
>   extern unsigned long transparent_hugepage_flags;
>
> -#define split_huge_page_to_list(page, list) BUILD_BUG()
> -#define split_huge_page(page) BUILD_BUG()
> +int split_huge_page_to_list(struct page *page, struct list_head *list);
> +static inline int split_huge_page(struct page *page)
> +{
> +	return split_huge_page_to_list(page, NULL);
> +}
>
>   void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   		unsigned long address);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 7c3790764795..ffbb23dbebba 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -387,10 +387,17 @@ static inline struct page *read_mapping_page(struct address_space *mapping,
>    */
>   static inline pgoff_t page_to_pgoff(struct page *page)
>   {
> +	pgoff_t pgoff;
> +
>   	if (unlikely(PageHeadHuge(page)))
>   		return page->index << compound_order(page);
> -	else
> +
> +	if (likely(!PageTransTail(page)))
>   		return page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +
> +	pgoff = page->first_page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> +	pgoff += page - page->first_page;
> +	return pgoff;

This could use some comment or maybe separate preparatory patch?

>   }
>
>   /*
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2f9e2e882bab..7ad338ab2ac8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2704,3 +2704,325 @@ void __vma_adjust_trans_huge(struct vm_area_struct *vma,
>   			split_huge_pmd_address(next, nstart);
>   	}
>   }
> +
> +static void freeze_page_vma(struct vm_area_struct *vma, struct page *page,
> +		unsigned long address)
> +{
> +	spinlock_t *ptl;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +	int i;
> +
> +	pgd = pgd_offset(vma->vm_mm, address);
> +	if (!pgd_present(*pgd))
> +		return;
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		return;
> +	pmd = pmd_offset(pud, address);
> +	ptl = pmd_lock(vma->vm_mm, pmd);
> +	if (!pmd_present(*pmd)) {
> +		spin_unlock(ptl);
> +		return;
> +	}
> +	if (pmd_trans_huge(*pmd)) {
> +		if (page == pmd_page(*pmd))
> +			__split_huge_pmd_locked(vma, pmd, address, true);
> +		spin_unlock(ptl);
> +		return;
> +	}
> +	spin_unlock(ptl);
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
> +	for (i = 0; i < HPAGE_PMD_NR; i++, address += PAGE_SIZE, page++) {
> +		pte_t entry, swp_pte;
> +		swp_entry_t swp_entry;
> +
> +		if (!pte_present(pte[i]))
> +			continue;
> +		if (page_to_pfn(page) != pte_pfn(pte[i]))
> +			continue;
> +		flush_cache_page(vma, address, page_to_pfn(page));
> +		entry = ptep_clear_flush(vma, address, pte + i);
> +		swp_entry = make_migration_entry(page, pte_write(entry));
> +		swp_pte = swp_entry_to_pte(swp_entry);
> +		if (pte_soft_dirty(entry))
> +			swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +		set_pte_at(vma->vm_mm, address, pte + i, swp_pte);
> +	}
> +	pte_unmap_unlock(pte, ptl);
> +}
> +
> +static void freeze_page(struct anon_vma *anon_vma, struct page *page)
> +{
> +	struct anon_vma_chain *avc;
> +	pgoff_t pgoff = page_to_pgoff(page);
> +
> +	VM_BUG_ON_PAGE(!PageHead(page), page);
> +
> +	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff,
> +			pgoff + HPAGE_PMD_NR - 1) {
> +		unsigned long haddr;
> +
> +		haddr = __vma_address(page, avc->vma) & HPAGE_PMD_MASK;
> +		mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
> +				haddr, haddr + HPAGE_PMD_SIZE);
> +		freeze_page_vma(avc->vma, page, haddr);
> +		mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
> +				haddr, haddr + HPAGE_PMD_SIZE);
> +	}
> +}
> +
> +static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
> +		unsigned long address)
> +{
> +	spinlock_t *ptl;
> +	pmd_t *pmd;
> +	pte_t *pte, entry;
> +	swp_entry_t swp_entry;
> +
> +	pmd = mm_find_pmd(vma->vm_mm, address);
> +	if (!pmd)
> +		return;
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
> +
> +	if (!is_swap_pte(*pte))
> +		goto unlock;
> +
> +	swp_entry = pte_to_swp_entry(*pte);
> +	if (!is_migration_entry(swp_entry) ||
> +			migration_entry_to_page(swp_entry) != page)
> +		goto unlock;
> +
> +	entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
> +	if (is_write_migration_entry(swp_entry))
> +		entry = maybe_mkwrite(entry, vma);
> +
> +	flush_dcache_page(page);
> +	set_pte_at(vma->vm_mm, address, pte, entry);
> +
> +	/* No need to invalidate - it was non-present before */
> +	update_mmu_cache(vma, address, pte);
> +unlock:
> +	pte_unmap_unlock(pte, ptl);
> +}
> +
> +static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
> +{
> +	struct anon_vma_chain *avc;
> +	pgoff_t pgoff = page_to_pgoff(page);
> +	int i;
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++, pgoff++, page++) {

In case of freeze_page() this cycle is the inner one and it can batch 
ptl lock. Why not here?

> +		if (!page_mapcount(page))
> +			continue;
> +
> +		anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
> +				pgoff, pgoff) {
> +			unsigned long address = vma_address(page, avc->vma);
> +
> +			mmu_notifier_invalidate_range_start(avc->vma->vm_mm,
> +					address, address + PAGE_SIZE);
> +			unfreeze_page_vma(avc->vma, page, address);
> +			mmu_notifier_invalidate_range_end(avc->vma->vm_mm,
> +					address, address + PAGE_SIZE);
> +		}
> +	}
> +}
> +
> +static int total_mapcount(struct page *page)
> +{
> +	int i, ret;
> +
> +	ret = compound_mapcount(page);
> +	for (i = 0; i < HPAGE_PMD_NR; i++)
> +		ret += atomic_read(&page[i]._mapcount) + 1;
> +
> +	/*
> +	 * Positive compound_mapcount() offsets ->_mapcount in every subpage by
> +	 * one. Let's substract it here.
> +	 */
> +	if (compound_mapcount(page))
> +		ret -= HPAGE_PMD_NR;
> +
> +	return ret;
> +}
> +
> +static int __split_huge_page_tail(struct page *head, int tail,
> +		struct lruvec *lruvec, struct list_head *list)
> +{
> +	int mapcount;
> +	struct page *page_tail = head + tail;
> +
> +	mapcount = page_mapcount(page_tail);
> +	BUG_ON(atomic_read(&page_tail->_count) != 0);

VM_BUG_ON?

> +
> +	/*
> +	 * tail_page->_count is zero and not changing from under us. But
> +	 * get_page_unless_zero() may be running from under us on the
> +	 * tail_page. If we used atomic_set() below instead of atomic_add(), we
> +	 * would then run atomic_set() concurrently with
> +	 * get_page_unless_zero(), and atomic_set() is implemented in C not
> +	 * using locked ops. spin_unlock on x86 sometime uses locked ops
> +	 * because of PPro errata 66, 92, so unless somebody can guarantee
> +	 * atomic_set() here would be safe on all archs (and not only on x86),
> +	 * it's safer to use atomic_add().
> +	 */
> +	atomic_add(page_mapcount(page_tail) + 1, &page_tail->_count);
> +
> +	/* after clearing PageTail the gup refcount can be released */
> +	smp_mb__after_atomic();
> +
> +	/*
> +	 * retain hwpoison flag of the poisoned tail page:
> +	 *   fix for the unsuitable process killed on Guest Machine(KVM)
> +	 *   by the memory-failure.
> +	 */
> +	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
> +	page_tail->flags |= (head->flags &
> +			((1L << PG_referenced) |
> +			 (1L << PG_swapbacked) |
> +			 (1L << PG_mlocked) |
> +			 (1L << PG_uptodate) |
> +			 (1L << PG_active) |
> +			 (1L << PG_locked) |
> +			 (1L << PG_unevictable)));
> +	page_tail->flags |= (1L << PG_dirty);
> +
> +	/* clear PageTail before overwriting first_page */
> +	smp_wmb();
> +
> +	/* ->mapping in first tail page is compound_mapcount */
> +	BUG_ON(tail != 1 && page_tail->mapping != TAIL_MAPPING);

VM_BUG_ON?

> +	page_tail->mapping = head->mapping;
> +
> +	page_tail->index = head->index + tail;
> +	page_cpupid_xchg_last(page_tail, page_cpupid_last(head));
> +	lru_add_page_tail(head, page_tail, lruvec, list);
> +
> +	return mapcount;
> +}
> +
> +static void __split_huge_page(struct page *page, struct list_head *list)
> +{
> +	struct page *head = compound_head(page);
> +	struct zone *zone = page_zone(head);
> +	struct lruvec *lruvec;
> +	int i, tail_mapcount;
> +
> +	/* prevent PageLRU to go away from under us, and freeze lru stats */
> +	spin_lock_irq(&zone->lru_lock);
> +	lruvec = mem_cgroup_page_lruvec(head, zone);
> +
> +	/* complete memcg works before add pages to LRU */
> +	mem_cgroup_split_huge_fixup(head);
> +
> +	tail_mapcount = 0;
> +	for (i = HPAGE_PMD_NR - 1; i >= 1; i--)
> +		tail_mapcount += __split_huge_page_tail(head, i, lruvec, list);
> +	atomic_sub(tail_mapcount, &head->_count);
> +
> +	ClearPageCompound(head);
> +	spin_unlock_irq(&zone->lru_lock);
> +
> +	unfreeze_page(page_anon_vma(head), head);
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		struct page *subpage = head + i;
> +		if (subpage == page)
> +			continue;
> +		unlock_page(subpage);
> +
> +		/*
> +		 * Subpages may be freed if there wasn't any mapping
> +		 * like if add_to_swap() is running on a lru page that
> +		 * had its mapping zapped. And freeing these pages
> +		 * requires taking the lru_lock so we do the put_page
> +		 * of the tail pages after the split is complete.
> +		 */
> +		put_page(subpage);
> +	}
> +}
> +
> +/*
> + * This function splits huge page into normal pages. @page can point to any
> + * subpage of huge page to split. Split doesn't change the position of @page.
> + *
> + * Only caller must hold pin on the @page, otherwise split fails with -EBUSY.
> + * The huge page must be locked.
> + *
> + * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
> + *
> + * Both head page and tail pages will inherit mapping, flags, and so on from
> + * the hugepage.
> + *
> + * GUP pin and PG_locked transfered to @page. Rest subpages can be freed if
> + * they are not mapped.
> + *
> + * Returns 0 if the hugepage is split successfully.
> + * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under
> + * us.
> + */
> +int split_huge_page_to_list(struct page *page, struct list_head *list)
> +{
> +	struct page *head = compound_head(page);
> +	struct anon_vma *anon_vma;
> +	int mapcount, ret;
> +
> +	BUG_ON(is_huge_zero_page(page));
> +	BUG_ON(!PageAnon(page));
> +	BUG_ON(!PageLocked(page));
> +	BUG_ON(!PageSwapBacked(page));
> +	BUG_ON(!PageCompound(page));

VM_BUG_ONs?

> +
> +	/*
> +	 * The caller does not necessarily hold an mmap_sem that would prevent
> +	 * the anon_vma disappearing so we first we take a reference to it
> +	 * and then lock the anon_vma for write. This is similar to
> +	 * page_lock_anon_vma_read except the write lock is taken to serialise
> +	 * against parallel split or collapse operations.
> +	 */
> +	anon_vma = page_get_anon_vma(head);
> +	if (!anon_vma) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +	anon_vma_lock_write(anon_vma);
> +
> +	/*
> +	 * Racy check if we can split the page, before freeze_page() will
> +	 * split PMDs
> +	 */
> +	if (total_mapcount(head) != page_count(head) - 1) {
> +		ret = -EBUSY;
> +		goto out_unlock;
> +	}
> +
> +	freeze_page(anon_vma, head);
> +	VM_BUG_ON_PAGE(compound_mapcount(head), head);
> +
> +	mapcount = total_mapcount(head);
> +	if (mapcount == page_count(head) - 1) {
> +		__split_huge_page(page, list);
> +		ret = 0;
> +	} else if (mapcount > page_count(page) - 1) {

It's confusing to use page_count(head) in one test and page_count(page) 
in other, although I know it should be same. Also what if you read a 
different value because something broke?

> +		pr_alert("total_mapcount: %u, page_count(): %u\n",
> +				mapcount, page_count(page));

Here you determine page_count(page) again although it could have 
possibly changed (we are in path where something went wrong already) so 
you potentially print different value than the one that was tested.


> +		if (PageTail(page))
> +			dump_page(head, NULL);
> +		dump_page(page, "tail_mapcount > page_count(page) - 1");

Here you say "tail_mapcount" which means something else in different places.
Also isn't the whole "else if" test a DEBUG_VM material as well?

> +		BUG();
> +	} else {
> +		unfreeze_page(anon_vma, head);
> +		ret = -EBUSY;
> +	}
> +
> +out_unlock:
> +	anon_vma_unlock_write(anon_vma);
> +	put_anon_vma(anon_vma);
> +out:
> +	count_vm_event(!ret ? THP_SPLIT_PAGE : THP_SPLIT_PAGE_FAILED);
> +	return ret;
> +}
> diff --git a/mm/internal.h b/mm/internal.h
> index 98bce4d12a16..aee0f2566fdd 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -13,6 +13,7 @@
>
>   #include <linux/fs.h>
>   #include <linux/mm.h>
> +#include <linux/pagemap.h>
>
>   void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>   		unsigned long floor, unsigned long ceiling);
> @@ -244,10 +245,27 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
>
>   extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -extern unsigned long vma_address(struct page *page,
> -				 struct vm_area_struct *vma);
> -#endif
> +/*
> + * At what user virtual address is page expected in @vma?
> + */
> +static inline unsigned long
> +__vma_address(struct page *page, struct vm_area_struct *vma)
> +{
> +	pgoff_t pgoff = page_to_pgoff(page);
> +	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> +}
> +
> +static inline unsigned long
> +vma_address(struct page *page, struct vm_area_struct *vma)
> +{
> +	unsigned long address = __vma_address(page, vma);
> +
> +	/* page should be within @vma mapping range */
> +	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> +
> +	return address;
> +}
> +
>   #else /* !CONFIG_MMU */
>   static inline void clear_page_mlock(struct page *page) { }
>   static inline void mlock_vma_page(struct page *page) { }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 047953145710..723af5bbeb02 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -561,27 +561,6 @@ void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
>   }
>
>   /*
> - * At what user virtual address is page expected in @vma?
> - */
> -static inline unsigned long
> -__vma_address(struct page *page, struct vm_area_struct *vma)
> -{
> -	pgoff_t pgoff = page_to_pgoff(page);
> -	return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> -}
> -
> -inline unsigned long
> -vma_address(struct page *page, struct vm_area_struct *vma)
> -{
> -	unsigned long address = __vma_address(page, vma);
> -
> -	/* page should be within @vma mapping range */
> -	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
> -
> -	return address;
> -}
> -
> -/*
>    * At what user virtual address is page expected in vma?
>    * Caller should check the page is actually part of the vma.
>    */
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 04/28] mm, thp: adjust conditions when we can reuse the page on WP fault
  2015-05-15 13:29           ` Kirill A. Shutemov
@ 2015-05-19 13:00             ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19 13:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 05/15/2015 03:29 PM, Kirill A. Shutemov wrote:
> On Fri, May 15, 2015 at 01:35:49PM +0200, Vlastimil Babka wrote:
>> On 05/15/2015 01:21 PM, Kirill A. Shutemov wrote:
>>> On Fri, May 15, 2015 at 11:15:00AM +0200, Vlastimil Babka wrote:
>>>> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
>>>>> With new refcounting we will be able map the same compound page with
>>>>> PTEs and PMDs. It requires adjustment to conditions when we can reuse
>>>>> the page on write-protection fault.
>>>>>
>>>>> For PTE fault we can't reuse the page if it's part of huge page.
>>>>>
>>>>> For PMD we can only reuse the page if nobody else maps the huge page or
>>>>> it's part. We can do it by checking page_mapcount() on each sub-page,
>>>>> but it's expensive.
>>>>>
>>>>> The cheaper way is to check page_count() to be equal 1: every mapcount
>>>>> takes page reference, so this way we can guarantee, that the PMD is the
>>>>> only mapping.
>>>>>
>>>>> This approach can give false negative if somebody pinned the page, but
>>>>> that doesn't affect correctness.
>>>>>
>>>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>>> Tested-by: Sasha Levin <sasha.levin@oracle.com>
>>>>
>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>>>
>>>> So couldn't the same trick be used in Patch 1 to avoid counting individual
>>>> oder-0 pages?
>>>
>>> Hm. You're right, we could. But is smaps that performance sensitive to
>>> bother?
>>
>> Well, I was nudged to optimize it when doing the shmem swap accounting
>> changes there :) User may not care about the latency of obtaining the smaps
>> file contents, but since it has mmap_sem locked for that, the process might
>> care...
>
> Somewthing like this?

Yeah, that should work.

>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e04399e53965..5bc3d2b1176e 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -462,6 +462,19 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>          if (young || PageReferenced(page))
>                  mss->referenced += size;
>
> +       /*
> +        * page_count(page) == 1 guarantees the page is mapped exactly once.
> +        * If any subpage of the compound page mapped with PTE it would elevate
> +        * page_count().
> +        */
> +       if (page_count(page) == 1) {
> +               if (dirty || PageDirty(page))
> +                       mss->private_dirty += size;
> +               else
> +                       mss->private_clean += size;
> +               return;
> +       }
> +
>          for (i = 0; i < nr; i++, page++) {
>                  int mapcount = page_mapcount(page);
>
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 04/28] mm, thp: adjust conditions when we can reuse the page on WP fault
@ 2015-05-19 13:00             ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19 13:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 05/15/2015 03:29 PM, Kirill A. Shutemov wrote:
> On Fri, May 15, 2015 at 01:35:49PM +0200, Vlastimil Babka wrote:
>> On 05/15/2015 01:21 PM, Kirill A. Shutemov wrote:
>>> On Fri, May 15, 2015 at 11:15:00AM +0200, Vlastimil Babka wrote:
>>>> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
>>>>> With new refcounting we will be able map the same compound page with
>>>>> PTEs and PMDs. It requires adjustment to conditions when we can reuse
>>>>> the page on write-protection fault.
>>>>>
>>>>> For PTE fault we can't reuse the page if it's part of huge page.
>>>>>
>>>>> For PMD we can only reuse the page if nobody else maps the huge page or
>>>>> it's part. We can do it by checking page_mapcount() on each sub-page,
>>>>> but it's expensive.
>>>>>
>>>>> The cheaper way is to check page_count() to be equal 1: every mapcount
>>>>> takes page reference, so this way we can guarantee, that the PMD is the
>>>>> only mapping.
>>>>>
>>>>> This approach can give false negative if somebody pinned the page, but
>>>>> that doesn't affect correctness.
>>>>>
>>>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>>> Tested-by: Sasha Levin <sasha.levin@oracle.com>
>>>>
>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>>>
>>>> So couldn't the same trick be used in Patch 1 to avoid counting individual
>>>> oder-0 pages?
>>>
>>> Hm. You're right, we could. But is smaps that performance sensitive to
>>> bother?
>>
>> Well, I was nudged to optimize it when doing the shmem swap accounting
>> changes there :) User may not care about the latency of obtaining the smaps
>> file contents, but since it has mmap_sem locked for that, the process might
>> care...
>
> Somewthing like this?

Yeah, that should work.

>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e04399e53965..5bc3d2b1176e 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -462,6 +462,19 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>          if (young || PageReferenced(page))
>                  mss->referenced += size;
>
> +       /*
> +        * page_count(page) == 1 guarantees the page is mapped exactly once.
> +        * If any subpage of the compound page mapped with PTE it would elevate
> +        * page_count().
> +        */
> +       if (page_count(page) == 1) {
> +               if (dirty || PageDirty(page))
> +                       mss->private_dirty += size;
> +               else
> +                       mss->private_clean += size;
> +               return;
> +       }
> +
>          for (i = 0; i < nr; i++, page++) {
>                  int mapcount = page_mapcount(page);
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 26/28] thp: introduce deferred_split_huge_page()
  2015-04-23 21:04   ` Kirill A. Shutemov
@ 2015-05-19 13:54     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19 13:54 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:04 PM, Kirill A. Shutemov wrote:
> Currently we don't split huge page on partial unmap. It's not an ideal
> situation. It can lead to memory overhead.
>
> Furtunately, we can detect partial unmap on page_remove_rmap(). But we
> cannot call split_huge_page() from there due to locking context.
>
> It's also counterproductive to do directly from munmap() codepath: in
> many cases we will hit this from exit(2) and splitting the huge page
> just to free it up in small pages is not what we really want.
>
> The patch introduce deferred_split_huge_page() which put the huge page
> into queue for splitting. The splitting itself will happen when we get
> memory pressure via shrinker interface. The page will be dropped from
> list on freeing through compound page destructor.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> @@ -715,6 +726,12 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
>   	return entry;
>   }
>
> +void prep_transhuge_page(struct page *page)
> +{
> +	INIT_LIST_HEAD(&page[2].lru);

Wouldn't hurt to mention that you use page[2] because lru in page 1 
would collide with the dtor (right?).

> +	set_compound_page_dtor(page, free_transhuge_page);
> +}
> +
>   static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
>   					struct vm_area_struct *vma,
>   					unsigned long haddr, pmd_t *pmd,


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 26/28] thp: introduce deferred_split_huge_page()
@ 2015-05-19 13:54     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19 13:54 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:04 PM, Kirill A. Shutemov wrote:
> Currently we don't split huge page on partial unmap. It's not an ideal
> situation. It can lead to memory overhead.
>
> Furtunately, we can detect partial unmap on page_remove_rmap(). But we
> cannot call split_huge_page() from there due to locking context.
>
> It's also counterproductive to do directly from munmap() codepath: in
> many cases we will hit this from exit(2) and splitting the huge page
> just to free it up in small pages is not what we really want.
>
> The patch introduce deferred_split_huge_page() which put the huge page
> into queue for splitting. The splitting itself will happen when we get
> memory pressure via shrinker interface. The page will be dropped from
> list on freeing through compound page destructor.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> @@ -715,6 +726,12 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
>   	return entry;
>   }
>
> +void prep_transhuge_page(struct page *page)
> +{
> +	INIT_LIST_HEAD(&page[2].lru);

Wouldn't hurt to mention that you use page[2] because lru in page 1 
would collide with the dtor (right?).

> +	set_compound_page_dtor(page, free_transhuge_page);
> +}
> +
>   static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
>   					struct vm_area_struct *vma,
>   					unsigned long haddr, pmd_t *pmd,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 23/28] thp: add option to setup migration entiries during PMD split
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-19 13:55     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19 13:55 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We are going to use migration PTE entires to stabilize page counts.
> If the page is mapped with PMDs we need to split the PMD and setup
> migration enties. It's reasonable to combine these operations to avoid
> double-scanning over the page table.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/huge_memory.c | 23 +++++++++++++++--------
>   1 file changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5885ef8f0fad..2f9e2e882bab 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -23,6 +23,7 @@
>   #include <linux/pagemap.h>
>   #include <linux/migrate.h>
>   #include <linux/hashtable.h>
> +#include <linux/swapops.h>
>
>   #include <asm/tlb.h>
>   #include <asm/pgalloc.h>
> @@ -2551,7 +2552,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>   }
>
>   static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> -		unsigned long haddr)
> +		unsigned long haddr, bool freeze)
>   {
>   	struct mm_struct *mm = vma->vm_mm;
>   	struct page *page;
> @@ -2593,12 +2594,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   		 * transferred to avoid any possibility of altering
>   		 * permissions across VMAs.
>   		 */
> -		entry = mk_pte(page + i, vma->vm_page_prot);
> -		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> -		if (!write)
> -			entry = pte_wrprotect(entry);
> -		if (!young)
> -			entry = pte_mkold(entry);
> +		if (freeze) {
> +			swp_entry_t swp_entry;
> +			swp_entry = make_migration_entry(page + i, write);
> +			entry = swp_entry_to_pte(swp_entry);
> +		} else {
> +			entry = mk_pte(page + i, vma->vm_page_prot);
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +			if (!write)
> +				entry = pte_wrprotect(entry);
> +			if (!young)
> +				entry = pte_mkold(entry);
> +		}
>   		pte = pte_offset_map(&_pmd, haddr);
>   		BUG_ON(!pte_none(*pte));
>   		set_pte_at(mm, haddr, pte, entry);
> @@ -2625,7 +2632,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
>   	ptl = pmd_lock(mm, pmd);
>   	if (likely(pmd_trans_huge(*pmd)))
> -		__split_huge_pmd_locked(vma, pmd, haddr);
> +		__split_huge_pmd_locked(vma, pmd, haddr, false);
>   	spin_unlock(ptl);
>   	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
>   }
>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 23/28] thp: add option to setup migration entiries during PMD split
@ 2015-05-19 13:55     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19 13:55 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We are going to use migration PTE entires to stabilize page counts.
> If the page is mapped with PMDs we need to split the PMD and setup
> migration enties. It's reasonable to combine these operations to avoid
> double-scanning over the page table.
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/huge_memory.c | 23 +++++++++++++++--------
>   1 file changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5885ef8f0fad..2f9e2e882bab 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -23,6 +23,7 @@
>   #include <linux/pagemap.h>
>   #include <linux/migrate.h>
>   #include <linux/hashtable.h>
> +#include <linux/swapops.h>
>
>   #include <asm/tlb.h>
>   #include <asm/pgalloc.h>
> @@ -2551,7 +2552,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>   }
>
>   static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> -		unsigned long haddr)
> +		unsigned long haddr, bool freeze)
>   {
>   	struct mm_struct *mm = vma->vm_mm;
>   	struct page *page;
> @@ -2593,12 +2594,18 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>   		 * transferred to avoid any possibility of altering
>   		 * permissions across VMAs.
>   		 */
> -		entry = mk_pte(page + i, vma->vm_page_prot);
> -		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> -		if (!write)
> -			entry = pte_wrprotect(entry);
> -		if (!young)
> -			entry = pte_mkold(entry);
> +		if (freeze) {
> +			swp_entry_t swp_entry;
> +			swp_entry = make_migration_entry(page + i, write);
> +			entry = swp_entry_to_pte(swp_entry);
> +		} else {
> +			entry = mk_pte(page + i, vma->vm_page_prot);
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +			if (!write)
> +				entry = pte_wrprotect(entry);
> +			if (!young)
> +				entry = pte_mkold(entry);
> +		}
>   		pte = pte_offset_map(&_pmd, haddr);
>   		BUG_ON(!pte_none(*pte));
>   		set_pte_at(mm, haddr, pte, entry);
> @@ -2625,7 +2632,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
>   	ptl = pmd_lock(mm, pmd);
>   	if (likely(pmd_trans_huge(*pmd)))
> -		__split_huge_pmd_locked(vma, pmd, haddr);
> +		__split_huge_pmd_locked(vma, pmd, haddr, false);
>   	spin_unlock(ptl);
>   	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
>   }
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 24/28] thp, mm: split_huge_page(): caller need to lock page
  2015-04-23 21:03   ` Kirill A. Shutemov
@ 2015-05-19 13:55     ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19 13:55 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We're going to use migration entries instead of compound_lock() to
> stabilize page refcounts. Setup and remove migration entries require
> page to be locked.
>
> Some of split_huge_page() callers already have the page locked. Let's
> require everybody to lock the page before calling split_huge_page().
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 24/28] thp, mm: split_huge_page(): caller need to lock page
@ 2015-05-19 13:55     ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19 13:55 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli, Hugh Dickins
  Cc: Dave Hansen, Mel Gorman, Rik van Riel, Christoph Lameter,
	Naoya Horiguchi, Steve Capper, Aneesh Kumar K.V, Johannes Weiner,
	Michal Hocko, Jerome Marchand, Sasha Levin, linux-kernel,
	linux-mm

On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> We're going to use migration entries instead of compound_lock() to
> stabilize page refcounts. Setup and remove migration entries require
> page to be locked.
>
> Some of split_huge_page() callers already have the page locked. Let's
> require everybody to lock the page before calling split_huge_page().
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Tested-by: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 07/28] thp, mlock: do not allow huge pages in mlocked area
  2015-05-15 13:41       ` Kirill A. Shutemov
@ 2015-05-19 14:37         ` Vlastimil Babka
  -1 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19 14:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 05/15/2015 03:41 PM, Kirill A. Shutemov wrote:
> On Fri, May 15, 2015 at 02:56:42PM +0200, Vlastimil Babka wrote:
>> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
>>> With new refcounting THP can belong to several VMAs. This makes tricky
>>> to track THP pages, when they partially mlocked. It can lead to leaking
>>> mlocked pages to non-VM_LOCKED vmas and other problems.
>>> With this patch we will split all pages on mlock and avoid
>>> fault-in/collapse new THP in VM_LOCKED vmas.
>>>
>>> I've tried alternative approach: do not mark THP pages mlocked and keep
>>> them on normal LRUs. This way vmscan could try to split huge pages on
>>> memory pressure and free up subpages which doesn't belong to VM_LOCKED
>>> vmas.  But this is user-visible change: we screw up Mlocked accouting
>>> reported in meminfo, so I had to leave this approach aside.
>>>
>>> We can bring something better later, but this should be good enough for
>>> now.
>>
>> I can imagine people won't be happy about losing benefits of THP's when they
>> mlock().
>> How difficult would it be to support mlocked THP pages without splitting
>> until something actually tries to do a partial (un)mapping, and only then do
>> the split? That will support the most common case, no?
>
> Yes, it will.
>
> But what will we do if we fail to split huge page on munmap()? Fail
> munmap() with -EBUSY?

We could just unmlock the whole THP page and if we could make the 
deferred split done ASAP, and not waiting for memory pressure, the 
window with NR_MLOCK being undercounted would be minimized. Since the 
RLIMIT_MEMLOCK is tracked independently from NR_MLOCK, there should be 
no danger wrt breaching the limit due to undercounting here?



^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 07/28] thp, mlock: do not allow huge pages in mlocked area
@ 2015-05-19 14:37         ` Vlastimil Babka
  0 siblings, 0 replies; 189+ messages in thread
From: Vlastimil Babka @ 2015-05-19 14:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On 05/15/2015 03:41 PM, Kirill A. Shutemov wrote:
> On Fri, May 15, 2015 at 02:56:42PM +0200, Vlastimil Babka wrote:
>> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
>>> With new refcounting THP can belong to several VMAs. This makes tricky
>>> to track THP pages, when they partially mlocked. It can lead to leaking
>>> mlocked pages to non-VM_LOCKED vmas and other problems.
>>> With this patch we will split all pages on mlock and avoid
>>> fault-in/collapse new THP in VM_LOCKED vmas.
>>>
>>> I've tried alternative approach: do not mark THP pages mlocked and keep
>>> them on normal LRUs. This way vmscan could try to split huge pages on
>>> memory pressure and free up subpages which doesn't belong to VM_LOCKED
>>> vmas.  But this is user-visible change: we screw up Mlocked accouting
>>> reported in meminfo, so I had to leave this approach aside.
>>>
>>> We can bring something better later, but this should be good enough for
>>> now.
>>
>> I can imagine people won't be happy about losing benefits of THP's when they
>> mlock().
>> How difficult would it be to support mlocked THP pages without splitting
>> until something actually tries to do a partial (un)mapping, and only then do
>> the split? That will support the most common case, no?
>
> Yes, it will.
>
> But what will we do if we fail to split huge page on munmap()? Fail
> munmap() with -EBUSY?

We could just unmlock the whole THP page and if we could make the 
deferred split done ASAP, and not waiting for memory pressure, the 
window with NR_MLOCK being undercounted would be minimized. Since the 
RLIMIT_MEMLOCK is tracked independently from NR_MLOCK, there should be 
no danger wrt breaching the limit due to undercounting here?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 07/28] thp, mlock: do not allow huge pages in mlocked area
  2015-05-19 14:37         ` Vlastimil Babka
@ 2015-05-20 12:10           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-20 12:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Tue, May 19, 2015 at 04:37:25PM +0200, Vlastimil Babka wrote:
> On 05/15/2015 03:41 PM, Kirill A. Shutemov wrote:
> >On Fri, May 15, 2015 at 02:56:42PM +0200, Vlastimil Babka wrote:
> >>On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >>>With new refcounting THP can belong to several VMAs. This makes tricky
> >>>to track THP pages, when they partially mlocked. It can lead to leaking
> >>>mlocked pages to non-VM_LOCKED vmas and other problems.
> >>>With this patch we will split all pages on mlock and avoid
> >>>fault-in/collapse new THP in VM_LOCKED vmas.
> >>>
> >>>I've tried alternative approach: do not mark THP pages mlocked and keep
> >>>them on normal LRUs. This way vmscan could try to split huge pages on
> >>>memory pressure and free up subpages which doesn't belong to VM_LOCKED
> >>>vmas.  But this is user-visible change: we screw up Mlocked accouting
> >>>reported in meminfo, so I had to leave this approach aside.
> >>>
> >>>We can bring something better later, but this should be good enough for
> >>>now.
> >>
> >>I can imagine people won't be happy about losing benefits of THP's when they
> >>mlock().
> >>How difficult would it be to support mlocked THP pages without splitting
> >>until something actually tries to do a partial (un)mapping, and only then do
> >>the split? That will support the most common case, no?
> >
> >Yes, it will.
> >
> >But what will we do if we fail to split huge page on munmap()? Fail
> >munmap() with -EBUSY?
> 
> We could just unmlock the whole THP page and if we could make the deferred
> split done ASAP, and not waiting for memory pressure, the window with
> NR_MLOCK being undercounted would be minimized. Since the RLIMIT_MEMLOCK is
> tracked independently from NR_MLOCK, there should be no danger wrt breaching
> the limit due to undercounting here?

I'm not sure what "ASAP" should mean here and how to implement it.

I would really prefer to address mlock separately. The patchset is already
huge enough. :-/

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 07/28] thp, mlock: do not allow huge pages in mlocked area
@ 2015-05-20 12:10           ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-20 12:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Tue, May 19, 2015 at 04:37:25PM +0200, Vlastimil Babka wrote:
> On 05/15/2015 03:41 PM, Kirill A. Shutemov wrote:
> >On Fri, May 15, 2015 at 02:56:42PM +0200, Vlastimil Babka wrote:
> >>On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >>>With new refcounting THP can belong to several VMAs. This makes tricky
> >>>to track THP pages, when they partially mlocked. It can lead to leaking
> >>>mlocked pages to non-VM_LOCKED vmas and other problems.
> >>>With this patch we will split all pages on mlock and avoid
> >>>fault-in/collapse new THP in VM_LOCKED vmas.
> >>>
> >>>I've tried alternative approach: do not mark THP pages mlocked and keep
> >>>them on normal LRUs. This way vmscan could try to split huge pages on
> >>>memory pressure and free up subpages which doesn't belong to VM_LOCKED
> >>>vmas.  But this is user-visible change: we screw up Mlocked accouting
> >>>reported in meminfo, so I had to leave this approach aside.
> >>>
> >>>We can bring something better later, but this should be good enough for
> >>>now.
> >>
> >>I can imagine people won't be happy about losing benefits of THP's when they
> >>mlock().
> >>How difficult would it be to support mlocked THP pages without splitting
> >>until something actually tries to do a partial (un)mapping, and only then do
> >>the split? That will support the most common case, no?
> >
> >Yes, it will.
> >
> >But what will we do if we fail to split huge page on munmap()? Fail
> >munmap() with -EBUSY?
> 
> We could just unmlock the whole THP page and if we could make the deferred
> split done ASAP, and not waiting for memory pressure, the window with
> NR_MLOCK being undercounted would be minimized. Since the RLIMIT_MEMLOCK is
> tracked independently from NR_MLOCK, there should be no danger wrt breaching
> the limit due to undercounting here?

I'm not sure what "ASAP" should mean here and how to implement it.

I would really prefer to address mlock separately. The patchset is already
huge enough. :-/

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 22/28] thp: implement split_huge_pmd()
  2015-05-19  8:25     ` Vlastimil Babka
@ 2015-05-20 14:38       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-20 14:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Tue, May 19, 2015 at 10:25:30AM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >Original split_huge_page() combined two operations: splitting PMDs into
> >tables of PTEs and splitting underlying compound page. This patch
> >implements split_huge_pmd() which split given PMD without splitting
> >other PMDs this page mapped with or underlying compound page.
> >
> >Without tail page refcounting, implementation of split_huge_pmd() is
> >pretty straight-forward.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> >---
> >  include/linux/huge_mm.h |  11 ++++-
> >  mm/huge_memory.c        | 108 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 118 insertions(+), 1 deletion(-)
> >
> >diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >index 0382230b490f..b7844c73b7db 100644
> >--- a/include/linux/huge_mm.h
> >+++ b/include/linux/huge_mm.h
> >@@ -94,7 +94,16 @@ extern unsigned long transparent_hugepage_flags;
> >
> >  #define split_huge_page_to_list(page, list) BUILD_BUG()
> >  #define split_huge_page(page) BUILD_BUG()
> >-#define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
> >+
> >+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >+		unsigned long address);
> >+
> >+#define split_huge_pmd(__vma, __pmd, __address)				\
> >+	do {								\
> >+		pmd_t *____pmd = (__pmd);				\
> >+		if (unlikely(pmd_trans_huge(*____pmd)))			\
> 
> Given that most of calls to split_huge_pmd() appear to be in
> if (pmd_trans_huge(...)) branches, this unlikely() seems counter-productive.

Fair enough.

> >+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >+		unsigned long address)
> >+{
> >+	spinlock_t *ptl;
> >+	struct mm_struct *mm = vma->vm_mm;
> >+	unsigned long haddr = address & HPAGE_PMD_MASK;
> >+
> >+	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
> >+	ptl = pmd_lock(mm, pmd);
> >+	if (likely(pmd_trans_huge(*pmd)))
> 
> This likely is likely useless :)
 
No, it's not. We check the pmd with pmd_trans_huge() under ptl for the
first time. And __split_huge_pmd_locked() assumes pmd is huge.

> >+		__split_huge_pmd_locked(vma, pmd, haddr);
> >+	spin_unlock(ptl);
> >+	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
> >+}
> >+
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 189+ messages in thread

* Re: [PATCHv5 22/28] thp: implement split_huge_pmd()
@ 2015-05-20 14:38       ` Kirill A. Shutemov
  0 siblings, 0 replies; 189+ messages in thread
From: Kirill A. Shutemov @ 2015-05-20 14:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Dave Hansen, Mel Gorman, Rik van Riel,
	Christoph Lameter, Naoya Horiguchi, Steve Capper,
	Aneesh Kumar K.V, Johannes Weiner, Michal Hocko, Jerome Marchand,
	Sasha Levin, linux-kernel, linux-mm

On Tue, May 19, 2015 at 10:25:30AM +0200, Vlastimil Babka wrote:
> On 04/23/2015 11:03 PM, Kirill A. Shutemov wrote:
> >Original split_huge_page() combined two operations: splitting PMDs into
> >tables of PTEs and splitting underlying compound page. This patch
> >implements split_huge_pmd() which split given PMD without splitting
> >other PMDs this page mapped with or underlying compound page.
> >
> >Without tail page refcounting, implementation of split_huge_pmd() is
> >pretty straight-forward.
> >
> >Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> >Tested-by: Sasha Levin <sasha.levin@oracle.com>
> >---
> >  include/linux/huge_mm.h |  11 ++++-
> >  mm/huge_memory.c        | 108 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 118 insertions(+), 1 deletion(-)
> >
> >diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >index 0382230b490f..b7844c73b7db 100644
> >--- a/include/linux/huge_mm.h
> >+++ b/include/linux/huge_mm.h
> >@@ -94,7 +94,16 @@ extern unsigned long transparent_hugepage_flags;
> >
> >  #define split_huge_page_to_list(page, list) BUILD_BUG()
> >  #define split_huge_page(page) BUILD_BUG()
> >-#define split_huge_pmd(__vma, __pmd, __address) BUILD_BUG()
> >+
> >+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >+		unsigned long address);
> >+
> >+#define split_huge_pmd(__vma, __pmd, __address)				\
> >+	do {								\
> >+		pmd_t *____pmd = (__pmd);				\
> >+		if (unlikely(pmd_trans_huge(*____pmd)))			\
> 
> Given that most of calls to split_huge_pmd() appear to be in
> if (pmd_trans_huge(...)) branches, this unlikely() seems counter-productive.

Fair enough.

> >+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> >+		unsigned long address)
> >+{
> >+	spinlock_t *ptl;
> >+	struct mm_struct *mm = vma->vm_mm;
> >+	unsigned long haddr = address & HPAGE_PMD_MASK;
> >+
> >+	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
> >+	ptl = pmd_lock(mm, pmd);
> >+	if (likely(pmd_trans_huge(*pmd)))
> 
> This likely is likely useless :)
 
No, it's not. We check the pmd with pmd_trans_huge() under ptl for the
first time. And __split_huge_pmd_locked() assumes pmd is huge.

> >+		__split_huge_pmd_locked(vma, pmd, haddr);
> >+	spin_unlock(ptl);
> >+	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
> >+}
> >+
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 189+ messages in thread

end of thread, other threads:[~2015-05-20 14:39 UTC | newest]

Thread overview: 189+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-23 21:03 [PATCHv5 00/28] THP refcounting redesign Kirill A. Shutemov
2015-04-23 21:03 ` Kirill A. Shutemov
2015-04-23 21:03 ` [PATCHv5 01/28] mm, proc: adjust PSS calculation Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-29 15:49   ` Jerome Marchand
2015-05-14 14:12   ` Vlastimil Babka
2015-05-14 14:12     ` Vlastimil Babka
2015-05-15 10:56     ` Kirill A. Shutemov
2015-05-15 10:56       ` Kirill A. Shutemov
2015-05-15 11:33       ` Vlastimil Babka
2015-05-15 11:33         ` Vlastimil Babka
2015-05-15 11:43         ` Kirill A. Shutemov
2015-05-15 11:43           ` Kirill A. Shutemov
2015-05-15 12:37           ` Vlastimil Babka
2015-05-15 12:37             ` Vlastimil Babka
2015-04-23 21:03 ` [PATCHv5 02/28] rmap: add argument to charge compound page Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-29 15:53   ` Jerome Marchand
2015-04-30 11:52     ` Kirill A. Shutemov
2015-04-30 11:52       ` Kirill A. Shutemov
2015-05-14 16:07   ` Vlastimil Babka
2015-05-14 16:07     ` Vlastimil Babka
2015-05-15 11:14     ` Kirill A. Shutemov
2015-05-15 11:14       ` Kirill A. Shutemov
2015-04-23 21:03 ` [PATCHv5 03/28] memcg: adjust to support new THP refcounting Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-05-15  7:44   ` Vlastimil Babka
2015-05-15  7:44     ` Vlastimil Babka
2015-05-15 11:18     ` Kirill A. Shutemov
2015-05-15 11:18       ` Kirill A. Shutemov
2015-05-15 14:57       ` Dave Hansen
2015-05-15 14:57         ` Dave Hansen
2015-05-16 23:17         ` Kirill A. Shutemov
2015-05-16 23:17           ` Kirill A. Shutemov
2015-04-23 21:03 ` [PATCHv5 04/28] mm, thp: adjust conditions when we can reuse the page on WP fault Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-29 15:54   ` Jerome Marchand
2015-05-15  9:15   ` Vlastimil Babka
2015-05-15  9:15     ` Vlastimil Babka
2015-05-15 11:21     ` Kirill A. Shutemov
2015-05-15 11:21       ` Kirill A. Shutemov
2015-05-15 11:35       ` Vlastimil Babka
2015-05-15 11:35         ` Vlastimil Babka
2015-05-15 13:29         ` Kirill A. Shutemov
2015-05-15 13:29           ` Kirill A. Shutemov
2015-05-19 13:00           ` Vlastimil Babka
2015-05-19 13:00             ` Vlastimil Babka
2015-04-23 21:03 ` [PATCHv5 05/28] mm: adjust FOLL_SPLIT for new refcounting Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-05-15 11:05   ` Vlastimil Babka
2015-05-15 11:05     ` Vlastimil Babka
2015-05-15 11:36     ` Kirill A. Shutemov
2015-05-15 11:36       ` Kirill A. Shutemov
2015-05-15 12:01       ` Vlastimil Babka
2015-05-15 12:01         ` Vlastimil Babka
2015-04-23 21:03 ` [PATCHv5 06/28] mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-29 15:56   ` Jerome Marchand
2015-05-15 12:46   ` Vlastimil Babka
2015-05-15 12:46     ` Vlastimil Babka
2015-04-23 21:03 ` [PATCHv5 07/28] thp, mlock: do not allow huge pages in mlocked area Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-29 15:58   ` Jerome Marchand
2015-05-15 12:56   ` Vlastimil Babka
2015-05-15 12:56     ` Vlastimil Babka
2015-05-15 13:41     ` Kirill A. Shutemov
2015-05-15 13:41       ` Kirill A. Shutemov
2015-05-19 14:37       ` Vlastimil Babka
2015-05-19 14:37         ` Vlastimil Babka
2015-05-20 12:10         ` Kirill A. Shutemov
2015-05-20 12:10           ` Kirill A. Shutemov
2015-04-23 21:03 ` [PATCHv5 08/28] khugepaged: ignore pmd tables with THP mapped with ptes Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-29 15:59   ` Jerome Marchand
2015-05-15 12:59   ` Vlastimil Babka
2015-05-15 12:59     ` Vlastimil Babka
2015-04-23 21:03 ` [PATCHv5 09/28] thp: rename split_huge_page_pmd() to split_huge_pmd() Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-29 16:00   ` Jerome Marchand
2015-05-15 13:08   ` Vlastimil Babka
2015-05-15 13:08     ` Vlastimil Babka
2015-04-23 21:03 ` [PATCHv5 10/28] mm, vmstats: new THP splitting event Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-29 16:02   ` Jerome Marchand
2015-05-15 13:10   ` Vlastimil Babka
2015-05-15 13:10     ` Vlastimil Babka
2015-04-23 21:03 ` [PATCHv5 11/28] mm: temporally mark THP broken Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-23 21:03 ` [PATCHv5 12/28] thp: drop all split_huge_page()-related code Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-23 21:03 ` [PATCHv5 13/28] mm: drop tail page refcounting Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-05-18  9:48   ` Vlastimil Babka
2015-05-18  9:48     ` Vlastimil Babka
2015-04-23 21:03 ` [PATCHv5 14/28] futex, thp: remove special case for THP in get_futex_key Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-05-18 11:49   ` Vlastimil Babka
2015-05-18 11:49     ` Vlastimil Babka
2015-05-18 12:13     ` Kirill A. Shutemov
2015-05-18 12:13       ` Kirill A. Shutemov
2015-04-23 21:03 ` [PATCHv5 15/28] ksm: prepare to new THP semantics Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-05-18 12:41   ` Vlastimil Babka
2015-05-18 12:41     ` Vlastimil Babka
2015-04-23 21:03 ` [PATCHv5 16/28] mm, thp: remove compound_lock Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-29 16:11   ` Jerome Marchand
2015-04-30 11:58     ` Kirill A. Shutemov
2015-04-30 11:58       ` Kirill A. Shutemov
2015-05-18 12:57   ` Vlastimil Babka
2015-05-18 12:57     ` Vlastimil Babka
2015-04-23 21:03 ` [PATCHv5 17/28] mm, thp: remove infrastructure for handling splitting PMDs Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-29 16:14   ` Jerome Marchand
2015-04-30 12:03     ` Kirill A. Shutemov
2015-04-30 12:03       ` Kirill A. Shutemov
2015-05-18 13:40   ` Vlastimil Babka
2015-05-18 13:40     ` Vlastimil Babka
2015-04-23 21:03 ` [PATCHv5 18/28] x86, " Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-29  9:13   ` Aneesh Kumar K.V
2015-04-29  9:13     ` Aneesh Kumar K.V
2015-04-23 21:03 ` [PATCHv5 19/28] mm: store mapcount for compound page separately Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-05-18 14:32   ` Vlastimil Babka
2015-05-18 14:32     ` Vlastimil Babka
2015-05-19  3:55     ` Kirill A. Shutemov
2015-05-19  3:55       ` Kirill A. Shutemov
2015-05-19  9:01       ` Vlastimil Babka
2015-05-19  9:01         ` Vlastimil Babka
2015-04-23 21:03 ` [PATCHv5 20/28] mm: differentiate page_mapped() from page_mapcount() for compound pages Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-29 16:20   ` Jerome Marchand
2015-04-30 12:06     ` Kirill A. Shutemov
2015-04-30 12:06       ` Kirill A. Shutemov
2015-05-18 15:35   ` Vlastimil Babka
2015-05-18 15:35     ` Vlastimil Babka
2015-05-19  4:00     ` Kirill A. Shutemov
2015-05-19  4:00       ` Kirill A. Shutemov
2015-04-23 21:03 ` [PATCHv5 21/28] mm, numa: skip PTE-mapped THP on numa fault Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-04-23 21:03 ` [PATCHv5 22/28] thp: implement split_huge_pmd() Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-05-19  8:25   ` Vlastimil Babka
2015-05-19  8:25     ` Vlastimil Babka
2015-05-20 14:38     ` Kirill A. Shutemov
2015-05-20 14:38       ` Kirill A. Shutemov
2015-04-23 21:03 ` [PATCHv5 23/28] thp: add option to setup migration entiries during PMD split Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-05-19 13:55   ` Vlastimil Babka
2015-05-19 13:55     ` Vlastimil Babka
2015-04-23 21:03 ` [PATCHv5 24/28] thp, mm: split_huge_page(): caller need to lock page Kirill A. Shutemov
2015-04-23 21:03   ` Kirill A. Shutemov
2015-05-19 13:55   ` Vlastimil Babka
2015-05-19 13:55     ` Vlastimil Babka
2015-04-23 21:04 ` [PATCHv5 25/28] thp: reintroduce split_huge_page() Kirill A. Shutemov
2015-04-23 21:04   ` Kirill A. Shutemov
2015-05-19 12:43   ` Vlastimil Babka
2015-05-19 12:43     ` Vlastimil Babka
2015-04-23 21:04 ` [PATCHv5 26/28] thp: introduce deferred_split_huge_page() Kirill A. Shutemov
2015-04-23 21:04   ` Kirill A. Shutemov
2015-05-19 13:54   ` Vlastimil Babka
2015-05-19 13:54     ` Vlastimil Babka
2015-04-23 21:04 ` [PATCHv5 27/28] mm: re-enable THP Kirill A. Shutemov
2015-04-23 21:04   ` Kirill A. Shutemov
2015-04-23 21:04 ` [PATCHv5 28/28] thp: update documentation Kirill A. Shutemov
2015-04-23 21:04   ` Kirill A. Shutemov
2015-04-27 23:03 ` [PATCHv5 00/28] THP refcounting redesign Andrew Morton
2015-04-27 23:03   ` Andrew Morton
2015-04-27 23:33   ` Kirill A. Shutemov
2015-04-27 23:33     ` Kirill A. Shutemov
2015-04-30  8:25 ` [RFC PATCH 0/3] Remove _PAGE_SPLITTING from ppc64 Aneesh Kumar K.V
2015-04-30  8:25   ` Aneesh Kumar K.V
2015-04-30  8:25   ` [RFC PATCH 1/3] mm/thp: Use pmdp_splitting_flush_notify to clear pmd on splitting Aneesh Kumar K.V
2015-04-30  8:25     ` Aneesh Kumar K.V
2015-04-30 13:30     ` Kirill A. Shutemov
2015-04-30 13:30       ` Kirill A. Shutemov
2015-04-30 15:59       ` Aneesh Kumar K.V
2015-04-30 15:59         ` Aneesh Kumar K.V
2015-04-30 16:47         ` Aneesh Kumar K.V
2015-04-30 16:47           ` Aneesh Kumar K.V
2015-04-30  8:25   ` [RFC PATCH 2/3] powerpc/thp: Remove _PAGE_SPLITTING and related code Aneesh Kumar K.V
2015-04-30  8:25     ` Aneesh Kumar K.V
2015-04-30  8:25   ` [RFC PATCH 3/3] mm/thp: Add new function to clear pmd on collapse Aneesh Kumar K.V
2015-04-30  8:25     ` Aneesh Kumar K.V
2015-05-15  8:55 ` [PATCHv5 00/28] THP refcounting redesign Vlastimil Babka
2015-05-15  8:55   ` Vlastimil Babka
2015-05-15 13:31   ` Kirill A. Shutemov
2015-05-15 13:31     ` Kirill A. Shutemov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.