From: Mike Kravetz <mike.kravetz@oracle.com>
To: Matthew Wilcox <willy@infradead.org>,
Waiman Long <longman@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>, Will Deacon <will.deacon@arm.com>
Subject: Re: [PATCH] hugetlbfs: Take read_lock on i_mmap for PMD sharing
Date: Fri, 8 Nov 2019 17:47:06 -0800 [thread overview]
Message-ID: <5059733e-95aa-2c9e-6f5d-4f45f6a130b3@oracle.com> (raw)
In-Reply-To: <ea057d15-5205-9992-af95-b2727df577c4@oracle.com>
On 11/8/19 11:10 AM, Mike Kravetz wrote:
> On 11/7/19 6:04 PM, Davidlohr Bueso wrote:
>> On Thu, 07 Nov 2019, Mike Kravetz wrote:
>>
>>> Note that huge_pmd_share now increments the page count with the semaphore
>>> held just in read mode. It is OK to do increments in parallel without
>>> synchronization. However, we don't want anyone else changing the count
>>> while that check in huge_pmd_unshare is happening. Hence, the need for
>>> taking the semaphore in write mode.
>>
>> This would be a nice addition to the changelog methinks.
>
> Last night I remembered there is one place where we currently take
> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare. That
> is in try_to_unmap_one. Yes, there is a potential race here today.
Actually there is no race there today. Callers to huge_pmd_unshare
hold the page table lock. So, this synchronizes those unshare calls
from page migration and page poisoning.
> But that race is somewhat contained as you need two threads doing some
> combination of page migration and page poisoning to race. This change
> now allows migration or poisoning to race with page fault. I would
> really prefer if we do not open up the race window in this manner.
But, we do open a race window by changing huge_pmd_share to take the
i_mmap_rwsem in read mode as in the original patch.
Here is the additional code needed to take the semaphore in write mode
for the huge_pmd_unshare calls via try_to_unmap_one. We would need to
combine this with Longman's patch. Please take a look and provide feedback.
Some of the changes are subtle, especially the exception for MAP_PRIVATE
mappings, but I tried to add sufficient comments.
From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Fri, 8 Nov 2019 17:25:37 -0800
Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode before
call to huge_pmd_unshare in try_to_unmap_one.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
mm/hugetlb.c | 9 ++++++++-
mm/memory-failure.c | 28 +++++++++++++++++++++++++++-
mm/migrate.c | 27 +++++++++++++++++++++++++--
3 files changed, 60 insertions(+), 4 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f78891f92765..73d9136549a5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
* indicated by page_count > 1, unmap is achieved by clearing pud and
* decrementing the ref count. If count == 1, the pte page is not shared.
*
- * called with page table lock held.
+ * Must be called while holding page table lock.
+ * In general, the caller should also hold the i_mmap_rwsem in write mode.
+ * This is to prevent races with page faults calling huge_pmd_share which
+ * will not be holding the page table lock, but will be holding i_mmap_rwsem
+ * in read mode. It is possible to call without holding i_mmap_rwsem in
+ * write mode if the caller KNOWS the page table is associated with a private
+ * mapping. This is because private mappings can not share PMDs and can
+ * not race with huge_pmd_share calls during page faults.
*
* returns: 1 successfully unmapped a shared pte page
* 0 the underlying pte page is not shared, or it is the last user
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3151c87dff73..8f52b22cf71b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1030,7 +1030,33 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
if (kill)
collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
- unmap_success = try_to_unmap(hpage, ttu);
+ if (!PageHuge(hpage)) {
+ unmap_success = try_to_unmap(hpage, ttu);
+ } else {
+ mapping = page_mapping(hpage);
+ if (mapping) {
+ /*
+ * For hugetlb pages, try_to_unmap could potentially
+ * call huge_pmd_unshare. Because of this, take
+ * semaphore in write mode here and set TTU_RMAP_LOCKED
+ * to indicate we have taken the lock at this higher
+ * level.
+ */
+ i_mmap_lock_write(mapping);
+ unmap_success = try_to_unmap(hpage,
+ ttu|TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
+ } else {
+ /*
+ * !mapping implies a MAP_PRIVATE huge page mapping.
+ * Since PMDs will never be shared in a private
+ * mapping, it is safe to let huge_pmd_unshare be
+ * called with the semaphore in read mode.
+ */
+ unmap_success = try_to_unmap(hpage, ttu);
+ }
+ }
+
if (!unmap_success)
pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
pfn, page_mapcount(hpage));
diff --git a/mm/migrate.c b/mm/migrate.c
index 4fe45d1428c8..9cae5a4f1e48 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1333,8 +1333,31 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
goto put_anon;
if (page_mapped(hpage)) {
- try_to_unmap(hpage,
- TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+ struct address_space *mapping = page_mapping(hpage);
+
+ if (mapping) {
+ /*
+ * try_to_unmap could potentially call huge_pmd_unshare.
+ * Because of this, take semaphore in write mode here
+ * and set TTU_RMAP_LOCKED to indicate we have taken
+ * the lock at this higher level.
+ */
+ i_mmap_lock_write(mapping);
+ try_to_unmap(hpage,
+ TTU_MIGRATION|TTU_IGNORE_MLOCK|
+ TTU_IGNORE_ACCESS|TTU_RMAP_LOCKED);
+ i_mmap_unlock_write(mapping);
+ } else {
+ /*
+ * !mapping implies a MAP_PRIVATE huge page mapping.
+ * Since PMDs will never be shared in a private
+ * mapping, it is safe to let huge_pmd_unshare be
+ * called with the semaphore in read mode.
+ */
+ try_to_unmap(hpage,
+ TTU_MIGRATION|TTU_IGNORE_MLOCK|
+ TTU_IGNORE_ACCESS);
+ }
page_was_mapped = 1;
}
--
2.23.0
next prev parent reply other threads:[~2019-11-09 1:47 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-07 19:06 [PATCH] hugetlbfs: Take read_lock on i_mmap for PMD sharing Waiman Long
2019-11-07 19:13 ` Waiman Long
2019-11-07 19:15 ` Waiman Long
2019-11-07 19:31 ` Mike Kravetz
2019-11-07 19:42 ` Matthew Wilcox
2019-11-07 21:06 ` Waiman Long
2019-11-07 19:54 ` Matthew Wilcox
2019-11-07 21:27 ` Waiman Long
2019-11-07 21:49 ` Mike Kravetz
2019-11-07 21:56 ` Mike Kravetz
2019-11-08 2:04 ` Davidlohr Bueso
2019-11-08 3:22 ` Andrew Morton
2019-11-08 19:10 ` Mike Kravetz
2019-11-09 1:47 ` Mike Kravetz [this message]
2019-11-12 17:27 ` Waiman Long
2019-11-12 23:11 ` Mike Kravetz
2019-11-13 2:55 ` Waiman Long
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5059733e-95aa-2c9e-6f5d-4f45f6a130b3@oracle.com \
--to=mike.kravetz@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=longman@redhat.com \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=will.deacon@arm.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).