[PATCH v2 0/9] hugetlb: Use new vma lock for huge pmd sharing synchronization

* [PATCH v2 0/9] hugetlb: Use new vma lock for huge pmd sharing synchronization
@ 2022-09-14 22:18 Mike Kravetz
  2022-09-14 22:18 ` [PATCH v2 1/9] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race Mike Kravetz
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Mike Kravetz @ 2022-09-14 22:18 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Miaohe Lin, David Hildenbrand, Sven Schnelle,
	Michal Hocko, Peter Xu, Naoya Horiguchi, Aneesh Kumar K . V,
	Andrea Arcangeli, Kirill A . Shutemov, Davidlohr Bueso,
	Prakash Sangappa, James Houghton, Mina Almasry, Pasha Tatashin,
	Axel Rasmussen, Ray Fucillo, Andrew Morton, Mike Kravetz

hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

The regression and benefit of this patch series is not evident when
using the vm_scalability benchmark reported in [2] on a recent kernel.
Results from running,
"./usemem -n 48 --prealloc --prefault -O -U 3448054972"

			48 sample Avg
next-20220913		next-20220913			next-20220913
unmodified	revert i_mmap_sema locking	vma sema locking, this series
-----------------------------------------------------------------------------
498150 KB/s		501934 KB/s			504793 KB/s

The recent regression report [1] notes page fault and fork latency of
shared hugetlb mappings.  To measure this, I created two simple programs:
1) map a shared hugetlb area, write fault all pages, unmap area
   Do this in a continuous loop to measure faults per second
2) map a shared hugetlb area, write fault a few pages, fork and exit
   Do this in a continuous loop to measure forks per second
These programs were run on a 48 CPU VM with 320GB memory.  The shared
mapping size was 250GB.  For comparison, a single instance of the program
was run.  Then, multiple instances were run in parallel to introduce
lock contention.  Changing the locking scheme results in a significant
performance benefit.

test		instances	unmodified	revert		vma
--------------------------------------------------------------------------
faults per sec	1		393043		395680		389932
faults per sec  24		 71405		 81191		 79048
forks per sec   1		  2802		  2747		  2725
forks per sec   24		   439		   536		   500
Combined faults 24		  1621		 68070		 53662
Combined forks  24		   358		    67		   142

Combined test is when running both faulting program and forking program
simultaneously.

Patches 1 and 2 of this series revert c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79.  Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share.  With c0d0381ade79 reverted, this race is exposed:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

Reverting 87bf91d39bb5 exposes races in page fault/file truncation.
When the new vma lock is put to use in patch 8, this will handle the
fault/file truncation races.  This is explained in patch 9 where code
associated with these races is cleaned up.

Patches 3 - 5 restructure existing code in preparation for using the
new vma lock (rw semaphore) for pmd sharing synchronization.  The idea
is that this semaphore will be held in read mode for the duration of
fault processing, and held in write mode for unmap operations which may
call huge_pmd_unshare.  Acquiring i_mmap_rwsem is also still required
to synchronize huge pmd sharing.  However it is only required in the
fault path when setting up sharing, and will be acquired in huge_pmd_share().

Patch 6 adds the new vma lock and all supporting routines, but does not
actually change code to use the new lock.

Patch 7 refactors code in preparation for using the new lock. And, patch
8 finally adds code to make use of this new vma lock.  Unfortunately, the
fault code and truncate/hole punch code would naturally take locks in the
opposite order which could lead to deadlock.  Since the performance of
page faults is more important, the truncation/hole punch code is modified
to back out and take locks in the correct order if necessary.

[1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/

v1  -> v2
- Addressed more issues pointed out by Miaohe Lin.  Tahnks again!  And,
  added some Reviewed-by's.
- Addressed performance issue with files that have large holes as reported
  by Sven Schnelle.  Ultimately, removed code using fault mutex to address
  fault/truncation races as new vma lock is sufficient for this.
- Made vma lock be a ref counted structure so that it can exist separate
  from its associated vma.  This makes the unmap code that backs out and
  take locks in order cleaner and a little simpler.
- Rebased and retested on next-20220913

RFC -> v1
- Addressed many issues pointed out by Miaohe Lin.  Thank you!  Most
  significant was not attempting to backout pages in fault code due to
  races with truncation (patch 4).
- Rebased and retested on next-20220822

Mike Kravetz (9):
  hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
  hugetlbfs: revert use i_mmap_rwsem for more pmd sharing
    synchronization
  hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache
  hugetlb: create remove_inode_single_folio to remove single file folio
  hugetlb: rename vma_shareable() and refactor code
  hugetlb: add vma based lock for pmd sharing
  hugetlb: create hugetlb_unmap_file_folio to unmap single file folio
  hugetlb: use new vma_lock for pmd sharing synchronization
  hugetlb: clean up code checking for fault/truncation races

 fs/hugetlbfs/inode.c    | 300 +++++++++++++++++++++++---------
 include/linux/hugetlb.h |  45 ++++-
 kernel/fork.c           |   6 +-
 mm/hugetlb.c            | 373 ++++++++++++++++++++++++++++++----------
 mm/memory.c             |   2 +
 mm/rmap.c               | 114 +++++++-----
 mm/userfaultfd.c        |  14 +-
 7 files changed, 621 insertions(+), 233 deletions(-)

-- 
2.37.2

^ permalink raw reply	[flat|nested] 18+ messages in thread