linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@kernel.org>, Hugh Dickins <hughd@google.com>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Prakash Sangappa <prakash.sangappa@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	kernel test robot <rong.a.chen@intel.com>
Subject: [RFC PATCH 3/3] huegtlbfs: handle page fault/truncate races
Date: Mon,  6 Jul 2020 13:26:15 -0700	[thread overview]
Message-ID: <20200706202615.32111-4-mike.kravetz@oracle.com> (raw)
In-Reply-To: <20200706202615.32111-1-mike.kravetz@oracle.com>

A huegtlb page fault can race with page truncation.  Make the code
identifying and handling these races more robust.

Page fault handling needs to back out pages added to page cache beyond
file size (i_size).  When backing out the page, take care to restore
reserve map  entries and counts as necessary.

File truncation (remove_inode_hugepages) needs to handle page mapping
changes before locking the page.  This could happen if page was added
to page cache and later backed out in fault processing.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 41 ++++++++++++++++++++++-------------------
 mm/hugetlb.c         | 37 +++++++++++++++++++++++++++++++++++--
 2 files changed, 57 insertions(+), 21 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b4bb82815dd4..eeddd43b8809 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -494,13 +494,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			 * unmapped in caller.  Unmap (again) now after taking
 			 * the fault mutex.  The mutex will prevent faults
 			 * until we finish removing the page.
-			 *
-			 * This race can only happen in the hole punch case.
-			 * Getting here in a truncate operation is a bug.
 			 */
 			if (unlikely(page_mapped(page))) {
-				BUG_ON(truncate_op);
-
 				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 				i_mmap_lock_write(mapping);
 				mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -512,23 +507,31 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 
 			lock_page(page);
 			/*
-			 * We must free the huge page and remove from page
-			 * cache (remove_huge_page) BEFORE removing the
-			 * region/reserve map (hugetlb_unreserve_pages).  In
-			 * rare out of memory conditions, removal of the
-			 * region/reserve map could fail. Correspondingly,
-			 * the subpool and global reserve usage count can need
-			 * to be adjusted.
+			 * After locking page, make sure mapping is the same.
+			 * We could have raced with page fault populate and
+			 * backout code.
 			 */
-			VM_BUG_ON(PagePrivate(page));
-			remove_huge_page(page);
-			freed++;
-			if (!truncate_op) {
-				if (unlikely(hugetlb_unreserve_pages(inode,
+			if (page_mapping(page) == mapping) {
+				/*
+				 * We must free the huge page and remove from
+				 * page cache (remove_huge_page) BEFORE
+				 * removing the region/reserve map.  In rare
+				 * out of memory conditions, removal of the
+				 * region/reserve map could fail.
+				 * Correspondingly, the subpool and global
+				 * reserve usage count can need to be adjusted.
+				 */
+				VM_BUG_ON(PagePrivate(page));
+				remove_huge_page(page);
+				freed++;
+				if (!truncate_op) {
+					if (unlikely(
+					    hugetlb_unreserve_pages(inode,
 							index, index + 1, 1)))
-					hugetlb_fix_reserve_counts(inode);
+						hugetlb_fix_reserve_counts(
+							inode);
+				}
 			}
-
 			unlock_page(page);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6e9085464e78..68785cc80523 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4339,6 +4339,9 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
 	bool new_page = false;
+	bool page_cache = false;
+	bool reserve_alloc = false;
+	bool beyond_i_size = false;
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -4423,6 +4426,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		clear_huge_page(page, address, pages_per_huge_page(h));
 		__SetPageUptodate(page);
 		new_page = true;
+		if (PagePrivate(page))
+			reserve_alloc = true;
 
 		if (vma->vm_flags & VM_MAYSHARE) {
 			int err = huge_add_to_page_cache(page, mapping, idx);
@@ -4432,6 +4437,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 					goto retry;
 				goto out;
 			}
+			page_cache = true;
 		} else {
 			lock_page(page);
 			if (unlikely(anon_vma_prepare(vma))) {
@@ -4470,8 +4476,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 
 	ptl = huge_pte_lock(h, mm, ptep);
 	size = i_size_read(mapping->host) >> huge_page_shift(h);
-	if (idx >= size)
+	if (idx >= size) {
+		beyond_i_size = true;
 		goto backout;
+	}
 
 	ret = 0;
 	if (!huge_pte_none(huge_ptep_get(ptep)))
@@ -4509,8 +4517,33 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 backout:
 	spin_unlock(ptl);
 backout_unlocked:
+	if (new_page) {
+		/*
+		 * Back out pages added to page cache beyond i_size.  Otherwise,
+		 * they will 'sit' there until the file is removed.
+		 */
+		if (page_cache && beyond_i_size) {
+			/* FIXME - following lines are remove_huge_page() */
+			ClearPageDirty(page);
+			ClearPageUptodate(page);
+			delete_from_page_cache(page);
+		}
+
+		/*
+		 * If reserve was consumed, set PagePrivate so that it will
+		 * be restored in free_huge_page().
+		 */
+		if (reserve_alloc)
+			SetPagePrivate(page);
+
+		/*
+		 * Do not restore reserve map entries beyond i_size.  Otherwise,
+		 * there will be leaks when the file is removed.
+		 */
+		if (!beyond_i_size)
+			restore_reserve_on_error(h, vma, haddr, page);
+	}
 	unlock_page(page);
-	restore_reserve_on_error(h, vma, haddr, page);
 	put_page(page);
 	goto out;
 }
-- 
2.25.4


  parent reply	other threads:[~2020-07-06 20:28 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-22  0:55 [hugetlbfs] c0d0381ade: vm-scalability.throughput -33.4% regression kernel test robot
2020-06-22 22:01 ` Mike Kravetz
2020-06-25 21:33   ` Mike Kravetz
2020-08-21  8:39     ` [LKP] " Xing Zhengjun
2020-08-21 21:02       ` Mike Kravetz
2020-08-21 23:36         ` Mike Kravetz
2020-10-12  5:29           ` Xing Zhengjun
2020-10-12 17:40             ` Mike Kravetz
2020-10-13  1:59               ` Xing Zhengjun
2020-10-13  3:01                 ` Mike Kravetz
2020-10-13  6:05                   ` Xing Zhengjun
2020-07-06 20:26 ` [RFC PATCH 0/3] hugetlbfs: address fault time regression Mike Kravetz
2020-07-06 20:26   ` [RFC PATCH 1/3] Revert: "hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race" Mike Kravetz
2020-07-06 20:26   ` [RFC PATCH 2/3] hugetlbfs: Only take i_mmap_rwsem when sharing is possible Mike Kravetz
2020-07-21  0:56     ` [hugetlbfs] 878308e2e0: vm-scalability.throughput 1.2% improvement kernel test robot
2020-07-06 20:26   ` Mike Kravetz [this message]
2020-10-13 23:10 [RFC PATCH 0/3] hugetlbfs: introduce hinode_rwsem for pmd sharing synchronization Mike Kravetz
2020-10-13 23:11 ` [RFC PATCH 3/3] huegtlbfs: handle page fault/truncate races Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200706202615.32111-4-mike.kravetz@oracle.com \
    --to=mike.kravetz@oracle.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=dave@stgolabs.net \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=prakash.sangappa@oracle.com \
    --cc=rong.a.chen@intel.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).