linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/1] hugetlbfs: fix truncate/fault races
@ 2018-10-07 23:38 Mike Kravetz
  2018-10-07 23:38 ` [PATCH RFC 1/1] hugetlbfs: introduce truncation/fault mutex to avoid races Mike Kravetz
  0 siblings, 1 reply; 4+ messages in thread
From: Mike Kravetz @ 2018-10-07 23:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Michal Hocko, Hugh Dickins, Naoya Horiguchi,
	Aneesh Kumar K . V, Andrea Arcangeli, Kirill A . Shutemov,
	Davidlohr Bueso, Mike Kravetz

Our DB team noticed negative hugetlb reserved page counts during development
testing.  Related meminfo fields were as follows on one system:

HugePages_Total:   47143
HugePages_Free:    45610
HugePages_Rsvd:    18446744073709551613
HugePages_Surp:        0
Hugepagesize:       2048 kB 

Code inspection revealed that the most likely cause were races with truncate
and page faults.  In fact, I could write a not too complicated program to
cause the races and recreate the issue.

Way back in 2006, Hugh Dickins created a patch (ebed4bfc8da8) with this
message:

"[PATCH] hugetlb: fix absurd HugePages_Rsvd
    
 If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
 area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative".  Reinstate my
 preliminary i_size check before attempting to allocate the page (though this
 only fixes the most obvious case: more work will be needed here)."

Looks like we need to do more work.

While looking at the code, there were many issues to correctly handle racing
and back out changes partially made.  Instead, why not just introduce a
rw mutex to prevent the races.  Page faults would take the mutex in read mode
to allow multiple faults in parallel as it works today.  Truncate code would
take the mutex in write mode and prevent faults for the duration of truncate
processing.  This seems almost too obvious.  Something must be wrong with this
approach, or others would have employed it earlier.

The following patch describes the current race in detail and adds the mutex
to prevent truncate/fault races.

Mike Kravetz (1):
  hugetlbfs: introduce truncation/fault mutex to avoid races

 fs/hugetlbfs/inode.c    | 24 ++++++++++++++++++++----
 include/linux/hugetlb.h |  1 +
 mm/hugetlb.c            | 25 +++++++++++++++++++------
 mm/userfaultfd.c        |  8 +++++++-
 4 files changed, 47 insertions(+), 11 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH RFC 1/1] hugetlbfs: introduce truncation/fault mutex to avoid races
  2018-10-07 23:38 [PATCH RFC 0/1] hugetlbfs: fix truncate/fault races Mike Kravetz
@ 2018-10-07 23:38 ` Mike Kravetz
  2018-10-08  8:03   ` Kirill A. Shutemov
  0 siblings, 1 reply; 4+ messages in thread
From: Mike Kravetz @ 2018-10-07 23:38 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, Michal Hocko, Hugh Dickins, Naoya Horiguchi,
	Aneesh Kumar K . V, Andrea Arcangeli, Kirill A . Shutemov,
	Davidlohr Bueso, Mike Kravetz

The following hugetlbfs truncate/page fault race can be recreated
with programs doing something like the following.

A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages.  At
mmap time, 4 huge pages are reserved for the file/mapping.  So,
the global reserve count is 4.  In addition, since this is a shared
mapping an entry for 4 pages is added to the file's reserve map.
The first 3 of the 4 pages are faulted into the file.  As a result,
the global reserve count is now 1.

Task A starts to fault in the last page (routines hugetlb_fault,
hugetlb_no_page).  It allocates a huge page (alloc_huge_page).
The reserve map indicates there is a reserved page, so this is
used and the global reserve count goes to 0.

Now, task B truncates the file to size 0.  It starts by setting
inode size to 0(hugetlb_vmtruncate).  It then unmaps all mapping
of the file (hugetlb_vmdelete_list).  Since task A's page table
lock is not held at the time, truncation is not blocked.  Truncation
removes the 3 pages from the file (remove_inode_hugepages).  When
cleaning up the reserved pages (hugetlb_unreserve_pages), it notices
the reserve map was for 4 pages.  However, it has only freed 3 pages.
So it assumes there is still (4 - 3) 1 reserved pages.  It then
decrements the global reserve count by 1 and it goes negative.

Task A then continues the page fault process and adds it's newly
acquired page to the page cache.  Note that the index of this page
is beyond the size of the truncated file (0).  The page fault process
then notices the file has been truncated and exits.  However, the
page is left in the cache associated with the file.

Now, if the file is immediately deleted the truncate code runs again.
It will find and free the one page associated with the file.  When
cleaning up reserves, it notices the reserve map is empty.  Yet, one
page freed.  So, the global reserve count is decremented by (0 - 1) -1.
This returns the global count to 0 as it should be.  But, it is
possible for someone else to mmap this file/range before it is deleted.
If this happens, a reserve map entry for the allocated page is created
and the reserved page is forever leaked.

To avoid all these conditions, let's simply prevent faults to a file
while it is being truncated.  Add a new truncation specific rw mutex
to hugetlbfs inode extensions.  faults take the mutex in read mode,
truncation takes in write mode.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    | 24 ++++++++++++++++++++----
 include/linux/hugetlb.h |  1 +
 mm/hugetlb.c            | 25 +++++++++++++++++++------
 mm/userfaultfd.c        |  8 +++++++-
 4 files changed, 47 insertions(+), 11 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 40d4c66c7751..07b0ba049c37 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -427,10 +427,17 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			u32 hash;
 
 			index = page->index;
-			hash = hugetlb_fault_mutex_hash(h, current->mm,
+			/*
+			 * Only need to acquire fault mutex in hole punch case.
+			 * For truncation, we are synchronized via truncation
+			 * mutex.
+			 */
+			if (!truncate_op) {
+				hash = hugetlb_fault_mutex_hash(h, current->mm,
 							&pseudo_vma,
 							mapping, index, 0);
-			mutex_lock(&hugetlb_fault_mutex_table[hash]);
+				mutex_lock(&hugetlb_fault_mutex_table[hash]);
+			}
 
 			/*
 			 * If page is mapped, it was faulted in after being
@@ -471,7 +478,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			}
 
 			unlock_page(page);
-			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			if (!truncate_op)
+				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
 		huge_pagevec_release(&pvec);
 		cond_resched();
@@ -498,16 +506,19 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	pgoff_t pgoff;
 	struct address_space *mapping = inode->i_mapping;
 	struct hstate *h = hstate_inode(inode);
+	struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode);
 
 	BUG_ON(offset & ~huge_page_mask(h));
 	pgoff = offset >> PAGE_SHIFT;
 
+	down_write(&info->trunc_rwsem);
 	i_size_write(inode, offset);
 	i_mmap_lock_write(mapping);
 	if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
 		hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
 	i_mmap_unlock_write(mapping);
 	remove_inode_hugepages(inode, offset, LLONG_MAX);
+	up_write(&info->trunc_rwsem);
 	return 0;
 }
 
@@ -626,7 +637,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		/* addr is the offset within the file (zero based) */
 		addr = index * hpage_size;
 
-		/* mutex taken here, fault path and hole punch */
+		/*
+		 * mutex taken here, for fault path and hole punch.
+		 * No need to worry about truncation as we are synchronized
+		 * with inode mutex
+		 */
 		hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
 						index, addr);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -761,6 +776,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
 		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
 		inode->i_mapping->private_data = resv_map;
 		info->seals = F_SEAL_SEAL;
+		init_rwsem(&info->trunc_rwsem);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2a82e3..73844107ee8a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -277,6 +277,7 @@ struct hugetlbfs_inode_info {
 	struct shared_policy policy;
 	struct inode vfs_inode;
 	unsigned int seals;
+	struct rw_semaphore trunc_rwsem;
 };
 
 static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3103099f64fd..10142c922aab 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3696,6 +3696,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	pte_t new_pte;
 	spinlock_t *ptl;
 	unsigned long haddr = address & huge_page_mask(h);
+	struct hugetlbfs_inode_info *hinode_info = HUGETLBFS_I(mapping->host);
 
 	/*
 	 * Currently, we are forced to kill the process in the event the
@@ -3738,14 +3739,18 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			};
 
 			/*
-			 * hugetlb_fault_mutex must be dropped before
-			 * handling userfault.  Reacquire after handling
-			 * fault to make calling code simpler.
+			 * hugetlb_fault_mutex and truncation mutex must be
+			 * dropped before handling userfault.  Reacquire after
+			 * handling fault to make calling code simpler.
 			 */
 			hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
 							idx, haddr);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			up_read(&hinode_info->trunc_rwsem);
+
 			ret = handle_userfault(&vmf, VM_UFFD_MISSING);
+
+			down_read(&hinode_info->trunc_rwsem);
 			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 			goto out;
 		}
@@ -3894,6 +3899,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct address_space *mapping;
 	int need_wait_lock = 0;
 	unsigned long haddr = address & huge_page_mask(h);
+	struct hugetlbfs_inode_info *hinode_info;
 
 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (ptep) {
@@ -3914,10 +3920,16 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	idx = vma_hugecache_offset(h, vma, haddr);
 
 	/*
-	 * Serialize hugepage allocation and instantiation, so that we don't
-	 * get spurious allocation failures if two CPUs race to instantiate
-	 * the same page in the page cache.
+	 * Use truncate mutex to serialize truncation and page faults.  This
+	 * prevents ANY faults from happening on the file during truncation.
+	 * The fault mutex serializes hugepage allocation and instantiation
+	 * on the same page.  This prevents spurious allocation failures if
+	 * two CPUs race to instantiate the same page in the page cache.
+	 *
+	 * Acquire truncate mutex BEFORE fault mutex.
 	 */
+	hinode_info = HUGETLBFS_I(mapping->host);
+	down_read(&hinode_info->trunc_rwsem);
 	hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr);
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
@@ -4005,6 +4017,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 out_mutex:
 	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+	up_read(&hinode_info->trunc_rwsem);
 	/*
 	 * Generally it's safe to hold refcount during waiting page lock. But
 	 * here we just wait to defer the next page fault to avoid busy loop and
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 5029f241908f..554d1731028e 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -169,6 +169,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 	pgoff_t idx;
 	u32 hash;
 	struct address_space *mapping;
+	struct hugetlbfs_inode_info *hinode_info;
 
 	/*
 	 * There is no default zero huge page for all huge page sizes as
@@ -244,10 +245,12 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		VM_BUG_ON(dst_addr & ~huge_page_mask(h));
 
 		/*
-		 * Serialize via hugetlb_fault_mutex
+		 * Serialize via truncation and hugetlb_fault_mutex
 		 */
 		idx = linear_page_index(dst_vma, dst_addr);
 		mapping = dst_vma->vm_file->f_mapping;
+		hinode_info = HUGETLBFS_I(mapping->host);
+		down_read(&hinode_info->trunc_rwsem);
 		hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
 								idx, dst_addr);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -256,6 +259,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		dst_pte = huge_pte_alloc(dst_mm, dst_addr, huge_page_size(h));
 		if (!dst_pte) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			up_read(&hinode_info->trunc_rwsem);
 			goto out_unlock;
 		}
 
@@ -263,6 +267,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		dst_pteval = huge_ptep_get(dst_pte);
 		if (!huge_pte_none(dst_pteval)) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			up_read(&hinode_info->trunc_rwsem);
 			goto out_unlock;
 		}
 
@@ -270,6 +275,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 						dst_addr, src_addr, &page);
 
 		mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+		up_read(&hinode_info->trunc_rwsem);
 		vm_alloc_shared = vm_shared;
 
 		cond_resched();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC 1/1] hugetlbfs: introduce truncation/fault mutex to avoid races
  2018-10-07 23:38 ` [PATCH RFC 1/1] hugetlbfs: introduce truncation/fault mutex to avoid races Mike Kravetz
@ 2018-10-08  8:03   ` Kirill A. Shutemov
  2018-10-09  0:20     ` Mike Kravetz
  0 siblings, 1 reply; 4+ messages in thread
From: Kirill A. Shutemov @ 2018-10-08  8:03 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Andrew Morton, Michal Hocko,
	Hugh Dickins, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso

On Sun, Oct 07, 2018 at 04:38:48PM -0700, Mike Kravetz wrote:
> The following hugetlbfs truncate/page fault race can be recreated
> with programs doing something like the following.
> 
> A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages.  At
> mmap time, 4 huge pages are reserved for the file/mapping.  So,
> the global reserve count is 4.  In addition, since this is a shared
> mapping an entry for 4 pages is added to the file's reserve map.
> The first 3 of the 4 pages are faulted into the file.  As a result,
> the global reserve count is now 1.
> 
> Task A starts to fault in the last page (routines hugetlb_fault,
> hugetlb_no_page).  It allocates a huge page (alloc_huge_page).
> The reserve map indicates there is a reserved page, so this is
> used and the global reserve count goes to 0.
> 
> Now, task B truncates the file to size 0.  It starts by setting
> inode size to 0(hugetlb_vmtruncate).  It then unmaps all mapping
> of the file (hugetlb_vmdelete_list).  Since task A's page table
> lock is not held at the time, truncation is not blocked.  Truncation
> removes the 3 pages from the file (remove_inode_hugepages).  When
> cleaning up the reserved pages (hugetlb_unreserve_pages), it notices
> the reserve map was for 4 pages.  However, it has only freed 3 pages.
> So it assumes there is still (4 - 3) 1 reserved pages.  It then
> decrements the global reserve count by 1 and it goes negative.
> 
> Task A then continues the page fault process and adds it's newly
> acquired page to the page cache.  Note that the index of this page
> is beyond the size of the truncated file (0).  The page fault process
> then notices the file has been truncated and exits.  However, the
> page is left in the cache associated with the file.
> 
> Now, if the file is immediately deleted the truncate code runs again.
> It will find and free the one page associated with the file.  When
> cleaning up reserves, it notices the reserve map is empty.  Yet, one
> page freed.  So, the global reserve count is decremented by (0 - 1) -1.
> This returns the global count to 0 as it should be.  But, it is
> possible for someone else to mmap this file/range before it is deleted.
> If this happens, a reserve map entry for the allocated page is created
> and the reserved page is forever leaked.
> 
> To avoid all these conditions, let's simply prevent faults to a file
> while it is being truncated.  Add a new truncation specific rw mutex
> to hugetlbfs inode extensions.  faults take the mutex in read mode,
> truncation takes in write mode.

Hm. Don't we have already a lock for this? I mean i_mmap_lock.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC 1/1] hugetlbfs: introduce truncation/fault mutex to avoid races
  2018-10-08  8:03   ` Kirill A. Shutemov
@ 2018-10-09  0:20     ` Mike Kravetz
  0 siblings, 0 replies; 4+ messages in thread
From: Mike Kravetz @ 2018-10-09  0:20 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, linux-kernel, Andrew Morton, Michal Hocko,
	Hugh Dickins, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso

On 10/8/18 1:03 AM, Kirill A. Shutemov wrote:
> On Sun, Oct 07, 2018 at 04:38:48PM -0700, Mike Kravetz wrote:
>> The following hugetlbfs truncate/page fault race can be recreated
>> with programs doing something like the following.
>>
>> A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages.  At
>> mmap time, 4 huge pages are reserved for the file/mapping.  So,
>> the global reserve count is 4.  In addition, since this is a shared
>> mapping an entry for 4 pages is added to the file's reserve map.
>> The first 3 of the 4 pages are faulted into the file.  As a result,
>> the global reserve count is now 1.
>>
>> Task A starts to fault in the last page (routines hugetlb_fault,
>> hugetlb_no_page).  It allocates a huge page (alloc_huge_page).
>> The reserve map indicates there is a reserved page, so this is
>> used and the global reserve count goes to 0.
>>
>> Now, task B truncates the file to size 0.  It starts by setting
>> inode size to 0(hugetlb_vmtruncate).  It then unmaps all mapping
>> of the file (hugetlb_vmdelete_list).  Since task A's page table
>> lock is not held at the time, truncation is not blocked.  Truncation
>> removes the 3 pages from the file (remove_inode_hugepages).  When
>> cleaning up the reserved pages (hugetlb_unreserve_pages), it notices
>> the reserve map was for 4 pages.  However, it has only freed 3 pages.
>> So it assumes there is still (4 - 3) 1 reserved pages.  It then
>> decrements the global reserve count by 1 and it goes negative.
>>
>> Task A then continues the page fault process and adds it's newly
>> acquired page to the page cache.  Note that the index of this page
>> is beyond the size of the truncated file (0).  The page fault process
>> then notices the file has been truncated and exits.  However, the
>> page is left in the cache associated with the file.
>>
>> Now, if the file is immediately deleted the truncate code runs again.
>> It will find and free the one page associated with the file.  When
>> cleaning up reserves, it notices the reserve map is empty.  Yet, one
>> page freed.  So, the global reserve count is decremented by (0 - 1) -1.
>> This returns the global count to 0 as it should be.  But, it is
>> possible for someone else to mmap this file/range before it is deleted.
>> If this happens, a reserve map entry for the allocated page is created
>> and the reserved page is forever leaked.
>>
>> To avoid all these conditions, let's simply prevent faults to a file
>> while it is being truncated.  Add a new truncation specific rw mutex
>> to hugetlbfs inode extensions.  faults take the mutex in read mode,
>> truncation takes in write mode.
> 
> Hm. Don't we have already a lock for this? I mean i_mmap_lock.
> 

Thanks Kirill,

Yes, we could use use i_mmap_rwsem for this synchronization.  I don't
see anyone else using the mutex in this manner.  hugetlb code only
explicitly takes this mutex in write mode today.  I suspect that is not
optimal and could be improved.  Certainly, the use within
hugetlb_fault->huge_pte_alloc->huge_pmd_share would need to be changed
if we always wanted to take the mutex in read mode during faults.

I'll work on the changes to use i_mmap_rwsem.

However, right now our DB team informs me that the truncate/fault race
is not the cause of their huge page reserve count going negative issue.
So, I am searching for more bugs in this area.  Found another where an
allocation for migration could race with a fault in a VM_NORESERVE vma.
But, there were no migrations noted on the system, so there must be another
bug.  Sigh!
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-10-09  0:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-07 23:38 [PATCH RFC 0/1] hugetlbfs: fix truncate/fault races Mike Kravetz
2018-10-07 23:38 ` [PATCH RFC 1/1] hugetlbfs: introduce truncation/fault mutex to avoid races Mike Kravetz
2018-10-08  8:03   ` Kirill A. Shutemov
2018-10-09  0:20     ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).