linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] A couple hugetlbfs fixes
@ 2019-03-28 23:47 Mike Kravetz
  2019-03-28 23:47 ` [PATCH v2 1/2] huegtlbfs: on restore reserve error path retain subpool reservation Mike Kravetz
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Mike Kravetz @ 2019-03-28 23:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, Davidlohr Bueso
  Cc: Joonsoo Kim, Michal Hocko, Naoya Horiguchi, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz

I stumbled on these two hugetlbfs issues while looking at other things:
- The 'restore reserve' functionality at page free time should not
  be adjusting subpool counts.
- A BUG can be triggered (not easily) due to temporarily mapping a
  page before doing a COW.

Both are described in detail in the commit message of the patches.
I would appreciate comments from Davidlohr Bueso as one patch is
directly related to code he added in commit 8382d914ebf7.

I did not cc stable as the first problem has been around since reserves
were added to hugetlbfs and nobody has noticed.  The second is very hard
to hit/reproduce.

v2 - Update definition and all callers of hugetlb_fault_mutex_hash as
     the arguments mm and vma are no longer used or necessary.

Mike Kravetz (2):
  huegtlbfs: on restore reserve error path retain subpool reservation
  hugetlb: use same fault hash key for shared and private mappings

 fs/hugetlbfs/inode.c    |  7 ++-----
 include/linux/hugetlb.h |  4 +---
 mm/hugetlb.c            | 43 +++++++++++++++++++++--------------------
 mm/userfaultfd.c        |  3 +--
 4 files changed, 26 insertions(+), 31 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/2] huegtlbfs: on restore reserve error path retain subpool reservation
  2019-03-28 23:47 [PATCH v2 0/2] A couple hugetlbfs fixes Mike Kravetz
@ 2019-03-28 23:47 ` Mike Kravetz
  2019-03-28 23:47 ` [PATCH v2 2/2] hugetlb: use same fault hash key for shared and private mappings Mike Kravetz
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Mike Kravetz @ 2019-03-28 23:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, Davidlohr Bueso
  Cc: Joonsoo Kim, Michal Hocko, Naoya Horiguchi, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz

When a huge page is allocated, PagePrivate() is set if the allocation
consumed a reservation.  When freeing a huge page, PagePrivate is checked.
If set, it indicates the reservation should be restored.  PagePrivate
being set at free huge page time mostly happens on error paths.

When huge page reservations are created, a check is made to determine if
the mapping is associated with an explicitly mounted filesystem.  If so,
pages are also reserved within the filesystem.  The default action when
freeing a huge page is to decrement the usage count in any associated
explicitly mounted filesystem.  However, if the reservation is to be
restored the reservation/use count within the filesystem should not be
decrementd.  Otherwise, a subsequent page allocation and free for the same
mapping location will cause the file filesystem usage to go 'negative'.

Filesystem                         Size  Used Avail Use% Mounted on
nodev                              4.0G -4.0M  4.1G    - /opt/hugepool

To fix, when freeing a huge page do not adjust filesystem usage if
PagePrivate() is set to indicate the reservation should be restored.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f79ae4e42159..8651d6a602f9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1268,12 +1268,23 @@ void free_huge_page(struct page *page)
 	ClearPagePrivate(page);
 
 	/*
-	 * A return code of zero implies that the subpool will be under its
-	 * minimum size if the reservation is not restored after page is free.
-	 * Therefore, force restore_reserve operation.
+	 * If PagePrivate() was set on page, page allocation consumed a
+	 * reservation.  If the page was associated with a subpool, there
+	 * would have been a page reserved in the subpool before allocation
+	 * via hugepage_subpool_get_pages().  Since we are 'restoring' the
+	 * reservtion, do not call hugepage_subpool_put_pages() as this will
+	 * remove the reserved page from the subpool.
 	 */
-	if (hugepage_subpool_put_pages(spool, 1) == 0)
-		restore_reserve = true;
+	if (!restore_reserve) {
+		/*
+		 * A return code of zero implies that the subpool will be
+		 * under its minimum size if the reservation is not restored
+		 * after page is free.  Therefore, force restore_reserve
+		 * operation.
+		 */
+		if (hugepage_subpool_put_pages(spool, 1) == 0)
+			restore_reserve = true;
+	}
 
 	spin_lock(&hugetlb_lock);
 	clear_page_huge_active(page);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/2] hugetlb: use same fault hash key for shared and private mappings
  2019-03-28 23:47 [PATCH v2 0/2] A couple hugetlbfs fixes Mike Kravetz
  2019-03-28 23:47 ` [PATCH v2 1/2] huegtlbfs: on restore reserve error path retain subpool reservation Mike Kravetz
@ 2019-03-28 23:47 ` Mike Kravetz
  2019-04-11 18:32   ` Mike Kravetz
  2019-04-03  5:29 ` [PATCH v2 0/2] A couple hugetlbfs fixes Naoya Horiguchi
  2019-04-08 19:48 ` Davidlohr Bueso
  3 siblings, 1 reply; 8+ messages in thread
From: Mike Kravetz @ 2019-03-28 23:47 UTC (permalink / raw)
  To: linux-kernel, linux-mm, Davidlohr Bueso
  Cc: Joonsoo Kim, Michal Hocko, Naoya Horiguchi, Kirill A . Shutemov,
	Andrew Morton, Mike Kravetz

hugetlb uses a fault mutex hash table to prevent page faults of the
same pages concurrently.  The key for shared and private mappings is
different.  Shared keys off address_space and file index.  Private
keys off mm and virtual address.  Consider a private mappings of a
populated hugetlbfs file.  A write fault will first map the page from
the file and then do a COW to map a writable page.

Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
pages.  It uses the address_space file index key.  However, private
mappings will use a different key and could temporarily map the file
page before COW.  This causes problems (BUG) for the hole punch code
as it expects the mutex to prevent additional uses/mappings of the page.

There seems to be another potential COW issue/race with this approach
of different private and shared keys as notes in commit 8382d914ebf7
("mm, hugetlb: improve page-fault scalability").

Since every hugetlb mapping (even anon and private) is actually a file
mapping, just use the address_space index key for all mappings.  This
results in potentially more hash collisions.  However, this should not
be the common case.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    |  7 ++-----
 include/linux/hugetlb.h |  4 +---
 mm/hugetlb.c            | 22 ++++++----------------
 mm/userfaultfd.c        |  3 +--
 4 files changed, 10 insertions(+), 26 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ec32fece5e1e..6189ba80b57b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -440,9 +440,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			u32 hash;
 
 			index = page->index;
-			hash = hugetlb_fault_mutex_hash(h, current->mm,
-							&pseudo_vma,
-							mapping, index, 0);
+			hash = hugetlb_fault_mutex_hash(h, mapping, index, 0);
 			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 			/*
@@ -639,8 +637,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		addr = index * hpage_size;
 
 		/* mutex taken here, fault path and hole punch */
-		hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
-						index, addr);
+		hash = hugetlb_fault_mutex_hash(h, mapping, index, addr);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 		/* See if already present in mapping to avoid alloc/free */
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ea35263eb76b..3bc0d02649fe 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -123,9 +123,7 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
 void free_huge_page(struct page *page);
 void hugetlb_fix_reserve_counts(struct inode *inode);
 extern struct mutex *hugetlb_fault_mutex_table;
-u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
-				struct vm_area_struct *vma,
-				struct address_space *mapping,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct address_space *mapping,
 				pgoff_t idx, unsigned long address);
 
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8651d6a602f9..4409a87434f1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3837,8 +3837,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * handling userfault.  Reacquire after handling
 			 * fault to make calling code simpler.
 			 */
-			hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping,
-							idx, haddr);
+			hash = hugetlb_fault_mutex_hash(h, mapping, idx, haddr);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			ret = handle_userfault(&vmf, VM_UFFD_MISSING);
 			mutex_lock(&hugetlb_fault_mutex_table[hash]);
@@ -3946,21 +3945,14 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 }
 
 #ifdef CONFIG_SMP
-u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
-			    struct vm_area_struct *vma,
-			    struct address_space *mapping,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct address_space *mapping,
 			    pgoff_t idx, unsigned long address)
 {
 	unsigned long key[2];
 	u32 hash;
 
-	if (vma->vm_flags & VM_SHARED) {
-		key[0] = (unsigned long) mapping;
-		key[1] = idx;
-	} else {
-		key[0] = (unsigned long) mm;
-		key[1] = address >> huge_page_shift(h);
-	}
+	key[0] = (unsigned long) mapping;
+	key[1] = idx;
 
 	hash = jhash2((u32 *)&key, sizeof(key)/sizeof(u32), 0);
 
@@ -3971,9 +3963,7 @@ u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
  * For uniprocesor systems we always use a single mutex, so just
  * return 0 and avoid the hashing overhead.
  */
-u32 hugetlb_fault_mutex_hash(struct hstate *h, struct mm_struct *mm,
-			    struct vm_area_struct *vma,
-			    struct address_space *mapping,
+u32 hugetlb_fault_mutex_hash(struct hstate *h, struct address_space *mapping,
 			    pgoff_t idx, unsigned long address)
 {
 	return 0;
@@ -4018,7 +4008,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	hash = hugetlb_fault_mutex_hash(h, mm, vma, mapping, idx, haddr);
+	hash = hugetlb_fault_mutex_hash(h, mapping, idx, haddr);
 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 	entry = huge_ptep_get(ptep);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d59b5a73dfb3..9932d5755e4c 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -271,8 +271,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		 */
 		idx = linear_page_index(dst_vma, dst_addr);
 		mapping = dst_vma->vm_file->f_mapping;
-		hash = hugetlb_fault_mutex_hash(h, dst_mm, dst_vma, mapping,
-								idx, dst_addr);
+		hash = hugetlb_fault_mutex_hash(h, mapping, idx, dst_addr);
 		mutex_lock(&hugetlb_fault_mutex_table[hash]);
 
 		err = -ENOMEM;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/2] A couple hugetlbfs fixes
  2019-03-28 23:47 [PATCH v2 0/2] A couple hugetlbfs fixes Mike Kravetz
  2019-03-28 23:47 ` [PATCH v2 1/2] huegtlbfs: on restore reserve error path retain subpool reservation Mike Kravetz
  2019-03-28 23:47 ` [PATCH v2 2/2] hugetlb: use same fault hash key for shared and private mappings Mike Kravetz
@ 2019-04-03  5:29 ` Naoya Horiguchi
  2019-04-08 19:48 ` Davidlohr Bueso
  3 siblings, 0 replies; 8+ messages in thread
From: Naoya Horiguchi @ 2019-04-03  5:29 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-kernel, linux-mm, Davidlohr Bueso, Joonsoo Kim,
	Michal Hocko, Kirill A . Shutemov, Andrew Morton

On Thu, Mar 28, 2019 at 04:47:02PM -0700, Mike Kravetz wrote:
> I stumbled on these two hugetlbfs issues while looking at other things:
> - The 'restore reserve' functionality at page free time should not
>   be adjusting subpool counts.
> - A BUG can be triggered (not easily) due to temporarily mapping a
>   page before doing a COW.
> 
> Both are described in detail in the commit message of the patches.
> I would appreciate comments from Davidlohr Bueso as one patch is
> directly related to code he added in commit 8382d914ebf7.
> 
> I did not cc stable as the first problem has been around since reserves
> were added to hugetlbfs and nobody has noticed.  The second is very hard
> to hit/reproduce.
> 
> v2 - Update definition and all callers of hugetlb_fault_mutex_hash as
>      the arguments mm and vma are no longer used or necessary.
> 
> Mike Kravetz (2):
>   huegtlbfs: on restore reserve error path retain subpool reservation
>   hugetlb: use same fault hash key for shared and private mappings
> 
>  fs/hugetlbfs/inode.c    |  7 ++-----
>  include/linux/hugetlb.h |  4 +---
>  mm/hugetlb.c            | 43 +++++++++++++++++++++--------------------
>  mm/userfaultfd.c        |  3 +--
>  4 files changed, 26 insertions(+), 31 deletions(-)

Both fixes look fine to me.

Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/2] A couple hugetlbfs fixes
  2019-03-28 23:47 [PATCH v2 0/2] A couple hugetlbfs fixes Mike Kravetz
                   ` (2 preceding siblings ...)
  2019-04-03  5:29 ` [PATCH v2 0/2] A couple hugetlbfs fixes Naoya Horiguchi
@ 2019-04-08 19:48 ` Davidlohr Bueso
  2019-04-09  3:30   ` Mike Kravetz
  3 siblings, 1 reply; 8+ messages in thread
From: Davidlohr Bueso @ 2019-04-08 19:48 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-kernel, linux-mm, Joonsoo Kim, Michal Hocko,
	Naoya Horiguchi, Kirill A . Shutemov, Andrew Morton

On Thu, 28 Mar 2019, Mike Kravetz wrote:

>- A BUG can be triggered (not easily) due to temporarily mapping a
>  page before doing a COW.

But you actually _have_ seen it? Do you have the traces? I ask
not because of the patches perse, but because it would be nice
to have a real snipplet in the Changelog for patch 2.

Thanks,
Davidlohr

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/2] A couple hugetlbfs fixes
  2019-04-08 19:48 ` Davidlohr Bueso
@ 2019-04-09  3:30   ` Mike Kravetz
  0 siblings, 0 replies; 8+ messages in thread
From: Mike Kravetz @ 2019-04-09  3:30 UTC (permalink / raw)
  To: linux-kernel, linux-mm, Joonsoo Kim, Michal Hocko,
	Naoya Horiguchi, Kirill A . Shutemov, Andrew Morton

On 4/8/19 12:48 PM, Davidlohr Bueso wrote:
> On Thu, 28 Mar 2019, Mike Kravetz wrote:
> 
>> - A BUG can be triggered (not easily) due to temporarily mapping a
>>  page before doing a COW.
> 
> But you actually _have_ seen it? Do you have the traces? I ask
> not because of the patches perse, but because it would be nice
> to have a real snipplet in the Changelog for patch 2.

Yes, I actually saw this problem.  It happened while I was debugging and
testing some patches for hugetlb migration.  The BUG I hit was in
unaccount_page_cache_page(): VM_BUG_ON_PAGE(page_mapped(page), page).

Stack trace was something like:
unaccount_page_cache_page
  __delete_from_page_cache
    delete_from_page_cache
      remove_huge_page
        remove_inode_hugepages
          hugetlbfs_punch_hole
            hugetlbfs_fallocate

When I hit that, it took me a while to figure out how it could happen.
i.e. How could a page be mapped at that point in remove_inode_hugepages?
It checks page_mapped and we are holding the fault mutex.  With some
additional debug code (strategic udelays) I could hit the issue on a
somewhat regular basis and verified another thread was in the
hugetlb_no_page/hugetlb_cow path for the same page at the same time.

Unfortunately, I did not save the traces.  I am trying to recreate now.
However, my test system was recently updated and it might take a little
time to recreate.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] hugetlb: use same fault hash key for shared and private mappings
  2019-03-28 23:47 ` [PATCH v2 2/2] hugetlb: use same fault hash key for shared and private mappings Mike Kravetz
@ 2019-04-11 18:32   ` Mike Kravetz
  2019-04-12 16:52     ` Davidlohr Bueso
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Kravetz @ 2019-04-11 18:32 UTC (permalink / raw)
  To: linux-kernel, linux-mm, Davidlohr Bueso, Andrew Morton
  Cc: Joonsoo Kim, Michal Hocko, Naoya Horiguchi, Kirill A . Shutemov

On 3/28/19 4:47 PM, Mike Kravetz wrote:
> hugetlb uses a fault mutex hash table to prevent page faults of the
> same pages concurrently.  The key for shared and private mappings is
> different.  Shared keys off address_space and file index.  Private
> keys off mm and virtual address.  Consider a private mappings of a
> populated hugetlbfs file.  A write fault will first map the page from
> the file and then do a COW to map a writable page.

Davidlohr suggested adding the stack trace to the commit log.  When I
originally 'discovered' this issue I was debugging something else.  The
routine remove_inode_hugepages() contains the following:

			 * ...
			 * This race can only happen in the hole punch case.
			 * Getting here in a truncate operation is a bug.
			 */
			if (unlikely(page_mapped(page))) {
				BUG_ON(truncate_op);

				i_mmap_lock_write(mapping);
				hugetlb_vmdelete_list(&mapping->i_mmap,
					index * pages_per_huge_page(h),
					(index + 1) * pages_per_huge_page(h));
				i_mmap_unlock_write(mapping);
			}

			lock_page(page);
			/*
			 * We must free the huge page and remove from page
			 * ...
			 */
			VM_BUG_ON(PagePrivate(page));
			remove_huge_page(page);
			freed++;

I observed that the page could be mapped (again) before the call to lock_page
if we raced with a private write fault.  However, for COW faults the faulting
code is holding the page lock until it unmaps the file page.  Hence, we will
not call remove_huge_page() with the page mapped.  That is good.  However, for
simple read faults the page remains mapped after releasing the page lock and
we can call remove_huge_page with a mapped page and BUG.

Sorry, the original commit message was not completely accurate in describing
the issue.  I was basing the change on behavior experienced during debug of
a another issue.  Actually, it is MUCH easier to BUG by making private read
faults race with hole punch.  As a result, I now think this should go to
stable.

Andrew, below is an updated commit message.  No changes to code.  Would you
like me to send an updated patch?  Also, need to add stable.

hugetlb uses a fault mutex hash table to prevent page faults of the
same pages concurrently.  The key for shared and private mappings is
different.  Shared keys off address_space and file index.  Private
keys off mm and virtual address.  Consider a private mappings of a
populated hugetlbfs file.  A fault will map the page from the file
and if needed do a COW to map a writable page.

Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
pages.  It uses the address_space file index key.  However, private
mappings will use a different key and could race with this code to map
the file page.  This causes problems (BUG) for the page cache remove
code as it expects the page to be unmapped.  A sample stack is:

page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
kernel BUG at mm/filemap.c:169!
...
RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
...
Call Trace:
__delete_from_page_cache+0x39/0x220
delete_from_page_cache+0x45/0x70
remove_inode_hugepages+0x13c/0x380
? __add_to_page_cache_locked+0x162/0x380
hugetlbfs_fallocate+0x403/0x540
? _cond_resched+0x15/0x30
? __inode_security_revalidate+0x5d/0x70
? selinux_file_permission+0x100/0x130
vfs_fallocate+0x13f/0x270
ksys_fallocate+0x3c/0x80
__x64_sys_fallocate+0x1a/0x20
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9

There seems to be another potential COW issue/race with this approach
of different private and shared keys as noted in commit 8382d914ebf7
("mm, hugetlb: improve page-fault scalability").

Since every hugetlb mapping (even anon and private) is actually a file
mapping, just use the address_space index key for all mappings.  This
results in potentially more hash collisions.  However, this should not
be the common case.

Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
Cc: <stable@vger.kernel.org>

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] hugetlb: use same fault hash key for shared and private mappings
  2019-04-11 18:32   ` Mike Kravetz
@ 2019-04-12 16:52     ` Davidlohr Bueso
  0 siblings, 0 replies; 8+ messages in thread
From: Davidlohr Bueso @ 2019-04-12 16:52 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-kernel, linux-mm, Andrew Morton, Joonsoo Kim, Michal Hocko,
	Naoya Horiguchi, Kirill A . Shutemov

On Thu, 11 Apr 2019, Mike Kravetz wrote:

>On 3/28/19 4:47 PM, Mike Kravetz wrote:
>> hugetlb uses a fault mutex hash table to prevent page faults of the
>> same pages concurrently.  The key for shared and private mappings is
>> different.  Shared keys off address_space and file index.  Private
>> keys off mm and virtual address.  Consider a private mappings of a
>> populated hugetlbfs file.  A write fault will first map the page from
>> the file and then do a COW to map a writable page.
>
>Davidlohr suggested adding the stack trace to the commit log.  When I
>originally 'discovered' this issue I was debugging something else.  The
>routine remove_inode_hugepages() contains the following:
>
>			 * ...
>			 * This race can only happen in the hole punch case.
>			 * Getting here in a truncate operation is a bug.
>			 */
>			if (unlikely(page_mapped(page))) {
>				BUG_ON(truncate_op);
>
>				i_mmap_lock_write(mapping);
>				hugetlb_vmdelete_list(&mapping->i_mmap,
>					index * pages_per_huge_page(h),
>					(index + 1) * pages_per_huge_page(h));
>				i_mmap_unlock_write(mapping);
>			}
>
>			lock_page(page);
>			/*
>			 * We must free the huge page and remove from page
>			 * ...
>			 */
>			VM_BUG_ON(PagePrivate(page));
>			remove_huge_page(page);
>			freed++;
>
>I observed that the page could be mapped (again) before the call to lock_page
>if we raced with a private write fault.  However, for COW faults the faulting
>code is holding the page lock until it unmaps the file page.  Hence, we will
>not call remove_huge_page() with the page mapped.  That is good.  However, for
>simple read faults the page remains mapped after releasing the page lock and
>we can call remove_huge_page with a mapped page and BUG.
>
>Sorry, the original commit message was not completely accurate in describing
>the issue.  I was basing the change on behavior experienced during debug of
>a another issue.  Actually, it is MUCH easier to BUG by making private read
>faults race with hole punch.  As a result, I now think this should go to
>stable.
>
>Andrew, below is an updated commit message.  No changes to code.  Would you
>like me to send an updated patch?  Also, need to add stable.
>
>hugetlb uses a fault mutex hash table to prevent page faults of the
>same pages concurrently.  The key for shared and private mappings is
>different.  Shared keys off address_space and file index.  Private
>keys off mm and virtual address.  Consider a private mappings of a
>populated hugetlbfs file.  A fault will map the page from the file
>and if needed do a COW to map a writable page.
>
>Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
>pages.  It uses the address_space file index key.  However, private
>mappings will use a different key and could race with this code to map
>the file page.  This causes problems (BUG) for the page cache remove
>code as it expects the page to be unmapped.  A sample stack is:
>
>page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
>kernel BUG at mm/filemap.c:169!
>...
>RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
>...
>Call Trace:
>__delete_from_page_cache+0x39/0x220
>delete_from_page_cache+0x45/0x70
>remove_inode_hugepages+0x13c/0x380
>? __add_to_page_cache_locked+0x162/0x380
>hugetlbfs_fallocate+0x403/0x540
>? _cond_resched+0x15/0x30
>? __inode_security_revalidate+0x5d/0x70
>? selinux_file_permission+0x100/0x130
>vfs_fallocate+0x13f/0x270
>ksys_fallocate+0x3c/0x80
>__x64_sys_fallocate+0x1a/0x20
>do_syscall_64+0x5b/0x180
>entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
>There seems to be another potential COW issue/race with this approach
>of different private and shared keys as noted in commit 8382d914ebf7
>("mm, hugetlb: improve page-fault scalability").
>
>Since every hugetlb mapping (even anon and private) is actually a file
>mapping, just use the address_space index key for all mappings.  This
>results in potentially more hash collisions.  However, this should not
>be the common case.

This is fair enough as most mappings will be shared anyway (it would be
lovely to have some machinery to measure collisions in kernel hash tables,
in general).

>Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")

Ok the issue was introduced after we had the mutex table.

>Cc: <stable@vger.kernel.org>

Thanks for the details, I'm definitely seeing the idx mismatch issue now.

Reviewed-by: Davidlohr Bueso <dbueso@suse.de>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-04-12 16:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-28 23:47 [PATCH v2 0/2] A couple hugetlbfs fixes Mike Kravetz
2019-03-28 23:47 ` [PATCH v2 1/2] huegtlbfs: on restore reserve error path retain subpool reservation Mike Kravetz
2019-03-28 23:47 ` [PATCH v2 2/2] hugetlb: use same fault hash key for shared and private mappings Mike Kravetz
2019-04-11 18:32   ` Mike Kravetz
2019-04-12 16:52     ` Davidlohr Bueso
2019-04-03  5:29 ` [PATCH v2 0/2] A couple hugetlbfs fixes Naoya Horiguchi
2019-04-08 19:48 ` Davidlohr Bueso
2019-04-09  3:30   ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).