linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>,
	Hugh Dickins <hughd@google.com>, Mel Gorman <mgorman@suse.de>,
	Rik van Riel <riel@redhat.com>, Vlastimil Babka <vbabka@suse.cz>,
	Christoph Lameter <cl@gentwo.org>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Steve Capper <steve.capper@linaro.org>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: [PATCH 16/19] thp: update documentation
Date: Wed,  5 Nov 2014 16:49:51 +0200	[thread overview]
Message-ID: <1415198994-15252-17-git-send-email-kirill.shutemov@linux.intel.com> (raw)
In-Reply-To: <1415198994-15252-1-git-send-email-kirill.shutemov@linux.intel.com>

The patch updates Documentation/vm/transhuge.txt to reflect changes in
THP design.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 Documentation/vm/transhuge.txt | 84 +++++++++++++++++++-----------------------
 1 file changed, 38 insertions(+), 46 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index df1794a9071f..33465e7b0d9b 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -200,9 +200,18 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
 	of pages that should be collapsed into one huge page but failed
 	the allocation.
 
-thp_split is incremented every time a huge page is split into base
+thp_split_page is incremented every time a huge page is split into base
 	pages. This can happen for a variety of reasons but a common
 	reason is that a huge page is old and is being reclaimed.
+	This action implies splitting all PMD the page mapped with.
+
+thp_split_page_failed is is incremented if kernel fails to split huge
+	page. This can happen if the page was pinned by somebody.
+
+thp_split_pmd is incremented every time a PMD split into table of PTEs.
+	This can happen, for instance, when application calls mprotect() or
+	munmap() on part of huge page. It doesn't split huge page, only
+	page table entry.
 
 thp_zero_page_alloc is incremented every time a huge zero page is
 	successfully allocated. It includes allocations which where
@@ -280,9 +289,9 @@ unaffected. libhugetlbfs will also work fine as usual.
 == Graceful fallback ==
 
 Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(vma, addr, pmd) where the pmd is the one returned by
+split_huge_pmd(vma, pmd, addr) where the pmd is the one returned by
 pmd_offset. It's trivial to make the code transparent hugepage aware
-by just grepping for "pmd_offset" and adding split_huge_page_pmd where
+by just grepping for "pmd_offset" and adding split_huge_pmd where
 missing after pmd_offset returns the pmd. Thanks to the graceful
 fallback design, with a one liner change, you can avoid to write
 hundred if not thousand of lines of complex code to make your code
@@ -291,7 +300,8 @@ hugepage aware.
 If you're not walking pagetables but you run into a physical hugepage
 but you can't handle it natively in your code, you can split it by
 calling split_huge_page(page). This is what the Linux VM does before
-it tries to swapout the hugepage for example.
+it tries to swapout the hugepage for example. split_huge_page can fail
+if the page is pinned and you must handle this correctly.
 
 Example to make mremap.c transparent hugepage aware with a one liner
 change:
@@ -303,14 +313,14 @@ diff --git a/mm/mremap.c b/mm/mremap.c
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
-+	split_huge_page_pmd(vma, addr, pmd);
++	split_huge_pmd(vma, pmd, addr);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;
 
 == Locking in hugepage aware code ==
 
 We want as much code as possible hugepage aware, as calling
-split_huge_page() or split_huge_page_pmd() has a cost.
+split_huge_page() or split_huge_pmd() has a cost.
 
 To make pagetable walks huge pmd aware, all you need to do is to call
 pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
@@ -319,47 +329,29 @@ created from under you by khugepaged (khugepaged collapse_huge_page
 takes the mmap_sem in write mode in addition to the anon_vma lock). If
 pmd_trans_huge returns false, you just fallback in the old code
 paths. If instead pmd_trans_huge returns true, you have to take the
-mm->page_table_lock and re-run pmd_trans_huge. Taking the
-page_table_lock will prevent the huge pmd to be converted into a
-regular pmd from under you (split_huge_page can run in parallel to the
+page table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the
+page table lock will prevent the huge pmd to be converted into a
+regular pmd from under you (split_huge_pmd can run in parallel to the
 pagetable walk). If the second pmd_trans_huge returns false, you
 should just drop the page_table_lock and fallback to the old code as
-before. Otherwise you should run pmd_trans_splitting on the pmd. In
-case pmd_trans_splitting returns true, it means split_huge_page is
-already in the middle of splitting the page. So if pmd_trans_splitting
-returns true it's enough to drop the page_table_lock and call
-wait_split_huge_page and then fallback the old code paths. You are
-guaranteed by the time wait_split_huge_page returns, the pmd isn't
-huge anymore. If pmd_trans_splitting returns false, you can proceed to
-process the huge pmd and the hugepage natively. Once finished you can
-drop the page_table_lock.
-
-== compound_lock, get_user_pages and put_page ==
+before. Otherwise you can proceed to process the huge pmd and the
+hugepage natively. Once finished you can drop the page_table_lock.
+
+== Refcounts and transparent huge pages ==
 
+As with other compound page types we do all refcounting for THP on head
+page, but unlike other compound pages THP support splitting.
 split_huge_page internally has to distribute the refcounts in the head
-page to the tail pages before clearing all PG_head/tail bits from the
-page structures. It can do that easily for refcounts taken by huge pmd
-mappings. But the GUI API as created by hugetlbfs (that returns head
-and tail pages if running get_user_pages on an address backed by any
-hugepage), requires the refcount to be accounted on the tail pages and
-not only in the head pages, if we want to be able to run
-split_huge_page while there are gup pins established on any tail
-page. Failure to be able to run split_huge_page if there's any gup pin
-on any tail page, would mean having to split all hugepages upfront in
-get_user_pages which is unacceptable as too many gup users are
-performance critical and they must work natively on hugepages like
-they work natively on hugetlbfs already (hugetlbfs is simpler because
-hugetlbfs pages cannot be split so there wouldn't be requirement of
-accounting the pins on the tail pages for hugetlbfs). If we wouldn't
-account the gup refcounts on the tail pages during gup, we won't know
-anymore which tail page is pinned by gup and which is not while we run
-split_huge_page. But we still have to add the gup pin to the head page
-too, to know when we can free the compound page in case it's never
-split during its lifetime. That requires changing not just
-get_page, but put_page as well so that when put_page runs on a tail
-page (and only on a tail page) it will find its respective head page,
-and then it will decrease the head page refcount in addition to the
-tail page refcount. To obtain a head page reliably and to decrease its
-refcount without race conditions, put_page has to serialize against
-__split_huge_page_refcount using a special per-page lock called
-compound_lock.
+page to the tail pages before clearing all PG_head/tail bits from the page
+structures. It can be done easily for refcounts taken by page table
+entries. But we don't have enough information on how to distribute any
+additional pins (i.e. from get_user_pages). split_huge_page fails any
+requests to split pinned huge page: it expects page count to be equal to
+sum of mapcount of all sub-pages plus one (split_huge_page caller must
+have reference for head page).
+
+split_huge_page uses per-page compound_lock to protect page->_count to be
+updated by get_page()/put_page() on tail pages.
+
+Note that split_huge_pmd doesn't have any limitation on refcounting: PMD
+can be split at any point and never fails.
-- 
2.1.1


  parent reply	other threads:[~2014-11-05 14:50 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-05 14:49 [PATCHv2 RFC 00/19] THP refcounting redesign Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 01/19] mm, thp: drop FOLL_SPLIT Kirill A. Shutemov
2014-11-25  3:01   ` Naoya Horiguchi
2014-11-25 14:04     ` Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 02/19] thp: cluster split_huge_page* code together Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 03/19] mm: change PageAnon() to work on tail pages Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 04/19] mm: avoid PG_locked " Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 05/19] rmap: add argument to charge compound page Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 06/19] mm: store mapcount for compound page separate Kirill A. Shutemov
2014-11-18  8:43   ` Naoya Horiguchi
2014-11-18  9:58     ` Kirill A. Shutemov
2014-11-18 23:41       ` Naoya Horiguchi
2014-11-19  0:54         ` Kirill A. Shutemov
2014-11-21  6:41       ` Aneesh Kumar K.V
2014-11-21 11:47         ` Kirill A. Shutemov
2014-11-19 10:51   ` Jerome Marchand
2014-11-19 13:00     ` Kirill A. Shutemov
2014-11-19 13:15       ` Jerome Marchand
2014-11-20 20:06       ` Christoph Lameter
2014-11-21 12:01         ` Kirill A. Shutemov
2014-11-21  6:12   ` Aneesh Kumar K.V
2014-11-21 12:02     ` Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 07/19] mm, thp: adjust conditions when we can reuse the page on WP fault Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 08/19] mm: prepare migration code for new THP refcounting Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 09/19] thp: rename split_huge_page_pmd() to split_huge_pmd() Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 10/19] thp: PMD splitting without splitting compound page Kirill A. Shutemov
2014-11-19  6:57   ` Naoya Horiguchi
2014-11-19 13:02     ` Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 11/19] mm, vmstats: new THP splitting event Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 12/19] thp: implement new split_huge_page() Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 13/19] mm, thp: remove infrastructure for handling splitting PMDs Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 14/19] x86, thp: remove " Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 15/19] futex, thp: remove special case for THP in get_futex_key Kirill A. Shutemov
2014-11-05 14:49 ` Kirill A. Shutemov [this message]
2014-11-19  8:07   ` [PATCH 16/19] thp: update documentation Naoya Horiguchi
2014-11-19 13:11     ` Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 17/19] mlock, thp: HACK: split all pages in VM_LOCKED vma Kirill A. Shutemov
2014-11-19  9:02   ` Naoya Horiguchi
2014-11-19 13:08     ` Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 18/19] tho, mm: use migration entries to freeze page counts on split Kirill A. Shutemov
2014-11-05 14:49 ` [PATCH 19/19] mm, thp: remove compound_lock Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1415198994-15252-17-git-send-email-kirill.shutemov@linux.intel.com \
    --to=kirill.shutemov@linux.intel.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=cl@gentwo.org \
    --cc=dave.hansen@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=riel@redhat.com \
    --cc=steve.capper@linaro.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).