All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/3] mm/khugepaged: fix khugepaged+shmem races
@ 2023-03-07  5:20 David Stevens
  2023-03-07  5:20 ` [PATCH v5 1/3] mm/khugepaged: refactor collapse_file control flow David Stevens
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: David Stevens @ 2023-03-07  5:20 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Peter Xu, Matthew Wilcox, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Hugh Dickins, Jiaqi Yan, linux-kernel,
	David Stevens

From: David Stevens <stevensd@chromium.org>

Fix two races in khugepaged+shmem that cause issues with userfaultfd and
lseek, respectively.

v4 -> v5:
 - Rebase on mm-unstable (9caa15b8a499)
 - Gather acks
v3 -> v4:
 - Base changes on mm-everything (fba720cb4dc0)
 - Add patch to refactor error handling control flow in collapse_file
 - Rebase userfaultfd patch with no significant logic changes
 - Different approach for fixing lseek race
v2 -> v3:
 - Use XA_RETRY_ENTRY to synchronize with reads from the page cache
   under the RCU read lock in userfaultfd fix
 - Add patch to fix lseek race
v1 -> v2:
 - Different approach for userfaultfd fix

David Stevens (3):
  mm/khugepaged: refactor collapse_file control flow
  mm/khugepaged: skip shmem with userfaultfd
  mm/khugepaged: maintain page cache uptodate flag

 include/trace/events/huge_memory.h |   3 +-
 mm/khugepaged.c                    | 259 ++++++++++++++++-------------
 2 files changed, 142 insertions(+), 120 deletions(-)

-- 
2.40.0.rc0.216.gc4246ad0f0-goog


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v5 1/3] mm/khugepaged: refactor collapse_file control flow
  2023-03-07  5:20 [PATCH v5 0/3] mm/khugepaged: fix khugepaged+shmem races David Stevens
@ 2023-03-07  5:20 ` David Stevens
  2023-03-23 19:51   ` Hugh Dickins
  2023-03-07  5:20 ` [PATCH v5 2/3] mm/khugepaged: skip shmem with userfaultfd David Stevens
  2023-03-07  5:20 ` [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag David Stevens
  2 siblings, 1 reply; 17+ messages in thread
From: David Stevens @ 2023-03-07  5:20 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Peter Xu, Matthew Wilcox, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Hugh Dickins, Jiaqi Yan, linux-kernel,
	David Stevens

From: David Stevens <stevensd@chromium.org>

Add a rollback label to deal with failure, instead of continuously
checking for RESULT_SUCCESS, to make it easier to add more failure
cases. The refactoring also allows the collapse_file tracepoint to
include hpage on success (instead of NULL).

Signed-off-by: David Stevens <stevensd@chromium.org>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
---
 mm/khugepaged.c | 219 ++++++++++++++++++++++++------------------------
 1 file changed, 108 insertions(+), 111 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3ea2aa55c2c5..b954e3c685e7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1907,6 +1907,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	if (result != SCAN_SUCCEED)
 		goto out;
 
+	__SetPageLocked(hpage);
+	if (is_shmem)
+		__SetPageSwapBacked(hpage);
+	hpage->index = start;
+	hpage->mapping = mapping;
+
 	/*
 	 * Ensure we have slots for all the pages in the range.  This is
 	 * almost certainly a no-op because most of the pages must be present
@@ -1919,16 +1925,10 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		xas_unlock_irq(&xas);
 		if (!xas_nomem(&xas, GFP_KERNEL)) {
 			result = SCAN_FAIL;
-			goto out;
+			goto rollback;
 		}
 	} while (1);
 
-	__SetPageLocked(hpage);
-	if (is_shmem)
-		__SetPageSwapBacked(hpage);
-	hpage->index = start;
-	hpage->mapping = mapping;
-
 	/*
 	 * At this point the hpage is locked and not up-to-date.
 	 * It's safe to insert it into the page cache, because nobody would
@@ -2145,129 +2145,126 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	 */
 	try_to_unmap_flush();
 
-	if (result == SCAN_SUCCEED) {
-		/*
-		 * Replacing old pages with new one has succeeded, now we
-		 * attempt to copy the contents.
-		 */
-		index = start;
-		list_for_each_entry(page, &pagelist, lru) {
-			while (index < page->index) {
-				clear_highpage(hpage + (index % HPAGE_PMD_NR));
-				index++;
-			}
-			if (copy_mc_highpage(hpage + (page->index % HPAGE_PMD_NR),
-					     page) > 0) {
-				result = SCAN_COPY_MC;
-				break;
-			}
-			index++;
-		}
-		while (result == SCAN_SUCCEED && index < end) {
+	if (result != SCAN_SUCCEED)
+		goto rollback;
+
+	/*
+	 * Replacing old pages with new one has succeeded, now we
+	 * attempt to copy the contents.
+	 */
+	index = start;
+	list_for_each_entry(page, &pagelist, lru) {
+		while (index < page->index) {
 			clear_highpage(hpage + (index % HPAGE_PMD_NR));
 			index++;
 		}
+		if (copy_mc_highpage(hpage + (page->index % HPAGE_PMD_NR), page) > 0) {
+			result = SCAN_COPY_MC;
+			goto rollback;
+		}
+		index++;
+	}
+	while (index < end) {
+		clear_highpage(hpage + (index % HPAGE_PMD_NR));
+		index++;
 	}
 
-	if (result == SCAN_SUCCEED) {
-		/*
-		 * Copying old pages to huge one has succeeded, now we
-		 * need to free the old pages.
-		 */
-		list_for_each_entry_safe(page, tmp, &pagelist, lru) {
-			list_del(&page->lru);
-			page->mapping = NULL;
-			page_ref_unfreeze(page, 1);
-			ClearPageActive(page);
-			ClearPageUnevictable(page);
-			unlock_page(page);
-			put_page(page);
-		}
+	/*
+	 * Copying old pages to huge one has succeeded, now we
+	 * need to free the old pages.
+	 */
+	list_for_each_entry_safe(page, tmp, &pagelist, lru) {
+		list_del(&page->lru);
+		page->mapping = NULL;
+		page_ref_unfreeze(page, 1);
+		ClearPageActive(page);
+		ClearPageUnevictable(page);
+		unlock_page(page);
+		put_page(page);
+	}
 
-		xas_lock_irq(&xas);
-		if (is_shmem)
-			__mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
-		else
-			__mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
+	xas_lock_irq(&xas);
+	if (is_shmem)
+		__mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
+	else
+		__mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
 
-		if (nr_none) {
-			__mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
-			/* nr_none is always 0 for non-shmem. */
-			__mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
-		}
-		/* Join all the small entries into a single multi-index entry. */
-		xas_set_order(&xas, start, HPAGE_PMD_ORDER);
-		xas_store(&xas, hpage);
-		xas_unlock_irq(&xas);
+	if (nr_none) {
+		__mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
+		/* nr_none is always 0 for non-shmem. */
+		__mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
+	}
+	/* Join all the small entries into a single multi-index entry. */
+	xas_set_order(&xas, start, HPAGE_PMD_ORDER);
+	xas_store(&xas, hpage);
+	xas_unlock_irq(&xas);
 
-		folio = page_folio(hpage);
-		folio_mark_uptodate(folio);
-		folio_ref_add(folio, HPAGE_PMD_NR - 1);
+	folio = page_folio(hpage);
+	folio_mark_uptodate(folio);
+	folio_ref_add(folio, HPAGE_PMD_NR - 1);
 
-		if (is_shmem)
-			folio_mark_dirty(folio);
-		folio_add_lru(folio);
+	if (is_shmem)
+		folio_mark_dirty(folio);
+	folio_add_lru(folio);
 
-		/*
-		 * Remove pte page tables, so we can re-fault the page as huge.
-		 */
-		result = retract_page_tables(mapping, start, mm, addr, hpage,
-					     cc);
-		unlock_page(hpage);
-		hpage = NULL;
-	} else {
-		/* Something went wrong: roll back page cache changes */
-		xas_lock_irq(&xas);
-		if (nr_none) {
-			mapping->nrpages -= nr_none;
-			shmem_uncharge(mapping->host, nr_none);
+	/*
+	 * Remove pte page tables, so we can re-fault the page as huge.
+	 */
+	result = retract_page_tables(mapping, start, mm, addr, hpage,
+				     cc);
+	unlock_page(hpage);
+	goto out;
+
+rollback:
+	/* Something went wrong: roll back page cache changes */
+	xas_lock_irq(&xas);
+	if (nr_none) {
+		mapping->nrpages -= nr_none;
+		shmem_uncharge(mapping->host, nr_none);
+	}
+
+	xas_set(&xas, start);
+	xas_for_each(&xas, page, end - 1) {
+		page = list_first_entry_or_null(&pagelist,
+				struct page, lru);
+		if (!page || xas.xa_index < page->index) {
+			if (!nr_none)
+				break;
+			nr_none--;
+			/* Put holes back where they were */
+			xas_store(&xas, NULL);
+			continue;
 		}
 
-		xas_set(&xas, start);
-		xas_for_each(&xas, page, end - 1) {
-			page = list_first_entry_or_null(&pagelist,
-					struct page, lru);
-			if (!page || xas.xa_index < page->index) {
-				if (!nr_none)
-					break;
-				nr_none--;
-				/* Put holes back where they were */
-				xas_store(&xas, NULL);
-				continue;
-			}
+		VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
 
-			VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
+		/* Unfreeze the page. */
+		list_del(&page->lru);
+		page_ref_unfreeze(page, 2);
+		xas_store(&xas, page);
+		xas_pause(&xas);
+		xas_unlock_irq(&xas);
+		unlock_page(page);
+		putback_lru_page(page);
+		xas_lock_irq(&xas);
+	}
+	VM_BUG_ON(nr_none);
+	/*
+	 * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
+	 * This undo is not needed unless failure is due to SCAN_COPY_MC.
+	 */
+	if (!is_shmem && result == SCAN_COPY_MC)
+		filemap_nr_thps_dec(mapping);
 
-			/* Unfreeze the page. */
-			list_del(&page->lru);
-			page_ref_unfreeze(page, 2);
-			xas_store(&xas, page);
-			xas_pause(&xas);
-			xas_unlock_irq(&xas);
-			unlock_page(page);
-			putback_lru_page(page);
-			xas_lock_irq(&xas);
-		}
-		VM_BUG_ON(nr_none);
-		/*
-		 * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
-		 * This undo is not needed unless failure is due to SCAN_COPY_MC.
-		 */
-		if (!is_shmem && result == SCAN_COPY_MC)
-			filemap_nr_thps_dec(mapping);
+	xas_unlock_irq(&xas);
 
-		xas_unlock_irq(&xas);
+	hpage->mapping = NULL;
 
-		hpage->mapping = NULL;
-	}
+	unlock_page(hpage);
+	put_page(hpage);
 
-	if (hpage)
-		unlock_page(hpage);
 out:
 	VM_BUG_ON(!list_empty(&pagelist));
-	if (hpage)
-		put_page(hpage);
-
 	trace_mm_khugepaged_collapse_file(mm, hpage, index, is_shmem, addr, file, nr, result);
 	return result;
 }
-- 
2.40.0.rc0.216.gc4246ad0f0-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v5 2/3] mm/khugepaged: skip shmem with userfaultfd
  2023-03-07  5:20 [PATCH v5 0/3] mm/khugepaged: fix khugepaged+shmem races David Stevens
  2023-03-07  5:20 ` [PATCH v5 1/3] mm/khugepaged: refactor collapse_file control flow David Stevens
@ 2023-03-07  5:20 ` David Stevens
  2023-03-23 19:48   ` Hugh Dickins
  2023-03-07  5:20 ` [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag David Stevens
  2 siblings, 1 reply; 17+ messages in thread
From: David Stevens @ 2023-03-07  5:20 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Peter Xu, Matthew Wilcox, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Hugh Dickins, Jiaqi Yan, linux-kernel,
	David Stevens

From: David Stevens <stevensd@chromium.org>

Make sure that collapse_file respects any userfaultfds registered with
MODE_MISSING. If userspace has any such userfaultfds registered, then
for any page which it knows to be missing, it may expect a
UFFD_EVENT_PAGEFAULT. This means collapse_file needs to be careful when
collapsing a shmem range would result in replacing an empty page with a
THP, to avoid breaking userfaultfd.

Synchronization when checking for userfaultfds in collapse_file is
tricky because the mmap locks can't be used to prevent races with the
registration of new userfaultfds. Instead, we provide synchronization by
ensuring that userspace cannot observe the fact that pages are missing
before we check for userfaultfds. Although this allows registration of a
userfaultfd to race with collapse_file, it ensures that userspace cannot
observe any pages transition from missing to present after such a race
occurs. This makes such a race indistinguishable to the collapse
occurring immediately before the userfaultfd registration.

The first step to provide this synchronization is to stop filling gaps
during the loop iterating over the target range, since the page cache
lock can be dropped during that loop. The second step is to fill the
gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
time, to avoid races with accesses to the page cache that only take the
RCU read lock.

The fact that we don't fill holes during the initial iteration means
that collapse_file now has to handle faults occurring during the
collapse. This is done by re-validating the number of missing pages
after acquiring the page cache lock for the final time.

This fix is targeted at khugepaged, but the change also applies to
MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
return EBUSY if there are any missing pages (instead of succeeding on
shmem and returning EINVAL on anonymous memory). There is also now a
window during MADV_COLLAPSE where a fault on a missing page will cause
the syscall to fail with EAGAIN.

The fact that intermediate page cache state can no longer be observed
before the rollback of a failed collapse is also technically a
userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
is exceedingly unlikely that anything relies on being able to observe
that transient state.

Signed-off-by: David Stevens <stevensd@chromium.org>
Acked-by: Peter Xu <peterx@redhat.com>
---
 include/trace/events/huge_memory.h |  3 +-
 mm/khugepaged.c                    | 92 +++++++++++++++++++++++-------
 2 files changed, 73 insertions(+), 22 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 46cce509957b..7ee85fff89a3 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -37,7 +37,8 @@
 	EM( SCAN_CGROUP_CHARGE_FAIL,	"ccgroup_charge_failed")	\
 	EM( SCAN_TRUNCATED,		"truncated")			\
 	EM( SCAN_PAGE_HAS_PRIVATE,	"page_has_private")		\
-	EMe(SCAN_COPY_MC,		"copy_poisoned_page")		\
+	EM( SCAN_COPY_MC,		"copy_poisoned_page")		\
+	EMe(SCAN_PAGE_FILLED,		"page_filled")			\
 
 #undef EM
 #undef EMe
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b954e3c685e7..51ae399f2035 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -57,6 +57,7 @@ enum scan_result {
 	SCAN_TRUNCATED,
 	SCAN_PAGE_HAS_PRIVATE,
 	SCAN_COPY_MC,
+	SCAN_PAGE_FILLED,
 };
 
 #define CREATE_TRACE_POINTS
@@ -1873,8 +1874,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
  *  - allocate and lock a new huge page;
  *  - scan page cache replacing old pages with the new one
  *    + swap/gup in pages if necessary;
- *    + fill in gaps;
  *    + keep old pages around in case rollback is required;
+ *  - finalize updates to the page cache;
  *  - if replacing succeeds:
  *    + copy data over;
  *    + free old pages;
@@ -1952,13 +1953,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 						result = SCAN_TRUNCATED;
 						goto xa_locked;
 					}
-					xas_set(&xas, index);
+					xas_set(&xas, index + 1);
 				}
 				if (!shmem_charge(mapping->host, 1)) {
 					result = SCAN_FAIL;
 					goto xa_locked;
 				}
-				xas_store(&xas, hpage);
 				nr_none++;
 				continue;
 			}
@@ -2169,21 +2169,57 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		index++;
 	}
 
-	/*
-	 * Copying old pages to huge one has succeeded, now we
-	 * need to free the old pages.
-	 */
-	list_for_each_entry_safe(page, tmp, &pagelist, lru) {
-		list_del(&page->lru);
-		page->mapping = NULL;
-		page_ref_unfreeze(page, 1);
-		ClearPageActive(page);
-		ClearPageUnevictable(page);
-		unlock_page(page);
-		put_page(page);
+	if (nr_none) {
+		struct vm_area_struct *vma;
+		int nr_none_check = 0;
+
+		i_mmap_lock_read(mapping);
+		xas_lock_irq(&xas);
+
+		xas_set(&xas, start);
+		for (index = start; index < end; index++) {
+			if (!xas_next(&xas)) {
+				xas_store(&xas, XA_RETRY_ENTRY);
+				nr_none_check++;
+			}
+		}
+
+		if (nr_none != nr_none_check) {
+			result = SCAN_PAGE_FILLED;
+			goto immap_locked;
+		}
+
+		/*
+		 * If userspace observed a missing page in a VMA with an armed
+		 * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
+		 * that page, so we need to roll back to avoid suppressing such
+		 * an event. Any userfaultfds armed after this point will not be
+		 * able to observe any missing pages due to the previously
+		 * inserted retry entries.
+		 */
+		vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
+			if (userfaultfd_missing(vma)) {
+				result = SCAN_EXCEED_NONE_PTE;
+				goto immap_locked;
+			}
+		}
+
+immap_locked:
+		i_mmap_unlock_read(mapping);
+		if (result != SCAN_SUCCEED) {
+			xas_set(&xas, start);
+			for (index = start; index < end; index++) {
+				if (xas_next(&xas) == XA_RETRY_ENTRY)
+					xas_store(&xas, NULL);
+			}
+
+			xas_unlock_irq(&xas);
+			goto rollback;
+		}
+	} else {
+		xas_lock_irq(&xas);
 	}
 
-	xas_lock_irq(&xas);
 	if (is_shmem)
 		__mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
 	else
@@ -2213,6 +2249,20 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	result = retract_page_tables(mapping, start, mm, addr, hpage,
 				     cc);
 	unlock_page(hpage);
+
+	/*
+	 * The collapse has succeeded, so free the old pages.
+	 */
+	list_for_each_entry_safe(page, tmp, &pagelist, lru) {
+		list_del(&page->lru);
+		page->mapping = NULL;
+		page_ref_unfreeze(page, 1);
+		ClearPageActive(page);
+		ClearPageUnevictable(page);
+		unlock_page(page);
+		put_page(page);
+	}
+
 	goto out;
 
 rollback:
@@ -2224,15 +2274,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	}
 
 	xas_set(&xas, start);
-	xas_for_each(&xas, page, end - 1) {
+	end = index;
+	for (index = start; index < end; index++) {
+		xas_next(&xas);
 		page = list_first_entry_or_null(&pagelist,
 				struct page, lru);
 		if (!page || xas.xa_index < page->index) {
-			if (!nr_none)
-				break;
 			nr_none--;
-			/* Put holes back where they were */
-			xas_store(&xas, NULL);
 			continue;
 		}
 
@@ -2750,12 +2798,14 @@ static int madvise_collapse_errno(enum scan_result r)
 	case SCAN_ALLOC_HUGE_PAGE_FAIL:
 		return -ENOMEM;
 	case SCAN_CGROUP_CHARGE_FAIL:
+	case SCAN_EXCEED_NONE_PTE:
 		return -EBUSY;
 	/* Resource temporary unavailable - trying again might succeed */
 	case SCAN_PAGE_COUNT:
 	case SCAN_PAGE_LOCK:
 	case SCAN_PAGE_LRU:
 	case SCAN_DEL_PAGE_LRU:
+	case SCAN_PAGE_FILLED:
 		return -EAGAIN;
 	/*
 	 * Other: Trying again likely not to succeed / error intrinsic to
-- 
2.40.0.rc0.216.gc4246ad0f0-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag
  2023-03-07  5:20 [PATCH v5 0/3] mm/khugepaged: fix khugepaged+shmem races David Stevens
  2023-03-07  5:20 ` [PATCH v5 1/3] mm/khugepaged: refactor collapse_file control flow David Stevens
  2023-03-07  5:20 ` [PATCH v5 2/3] mm/khugepaged: skip shmem with userfaultfd David Stevens
@ 2023-03-07  5:20 ` David Stevens
  2023-03-23 19:07   ` Hugh Dickins
  2 siblings, 1 reply; 17+ messages in thread
From: David Stevens @ 2023-03-07  5:20 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: Peter Xu, Matthew Wilcox, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Hugh Dickins, Jiaqi Yan, linux-kernel,
	David Stevens

From: David Stevens <stevensd@chromium.org>

Make sure that collapse_file doesn't interfere with checking the
uptodate flag in the page cache by only inserting hpage into the page
cache after it has been updated and marked uptodate. This is achieved by
simply not replacing present pages with hpage when iterating over the
target range. The present pages are already locked, so replacing the
with the locked hpage before the collapse is finalized is unnecessary.

This fixes a race where folio_seek_hole_data would mistake hpage for
an fallocated but unwritten page. This race is visible to userspace via
data temporarily disappearing from SEEK_DATA/SEEK_HOLE.

Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: David Stevens <stevensd@chromium.org>
Acked-by: Peter Xu <peterx@redhat.com>
---
 mm/khugepaged.c | 50 ++++++++++++-------------------------------------
 1 file changed, 12 insertions(+), 38 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 51ae399f2035..bdde0a02811b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1930,12 +1930,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		}
 	} while (1);
 
-	/*
-	 * At this point the hpage is locked and not up-to-date.
-	 * It's safe to insert it into the page cache, because nobody would
-	 * be able to map it or use it in another way until we unlock it.
-	 */
-
 	xas_set(&xas, start);
 	for (index = start; index < end; index++) {
 		page = xas_next(&xas);
@@ -2104,13 +2098,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		}
 
 		/*
-		 * Add the page to the list to be able to undo the collapse if
-		 * something go wrong.
+		 * Accumulate the pages that are being collapsed.
 		 */
 		list_add_tail(&page->lru, &pagelist);
-
-		/* Finally, replace with the new page. */
-		xas_store(&xas, hpage);
 		continue;
 out_unlock:
 		unlock_page(page);
@@ -2149,8 +2139,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		goto rollback;
 
 	/*
-	 * Replacing old pages with new one has succeeded, now we
-	 * attempt to copy the contents.
+	 * The old pages are locked, so they won't change anymore.
 	 */
 	index = start;
 	list_for_each_entry(page, &pagelist, lru) {
@@ -2230,11 +2219,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		/* nr_none is always 0 for non-shmem. */
 		__mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
 	}
-	/* Join all the small entries into a single multi-index entry. */
-	xas_set_order(&xas, start, HPAGE_PMD_ORDER);
-	xas_store(&xas, hpage);
-	xas_unlock_irq(&xas);
 
+	/*
+	 * Mark hpage as uptodate before inserting it into the page cache so
+	 * that it isn't mistaken for an fallocated but unwritten page.
+	 */
 	folio = page_folio(hpage);
 	folio_mark_uptodate(folio);
 	folio_ref_add(folio, HPAGE_PMD_NR - 1);
@@ -2243,6 +2232,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 		folio_mark_dirty(folio);
 	folio_add_lru(folio);
 
+	/* Join all the small entries into a single multi-index entry. */
+	xas_set_order(&xas, start, HPAGE_PMD_ORDER);
+	xas_store(&xas, hpage);
+	xas_unlock_irq(&xas);
+
 	/*
 	 * Remove pte page tables, so we can re-fault the page as huge.
 	 */
@@ -2267,36 +2261,18 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 
 rollback:
 	/* Something went wrong: roll back page cache changes */
-	xas_lock_irq(&xas);
 	if (nr_none) {
 		mapping->nrpages -= nr_none;
 		shmem_uncharge(mapping->host, nr_none);
 	}
 
-	xas_set(&xas, start);
-	end = index;
-	for (index = start; index < end; index++) {
-		xas_next(&xas);
-		page = list_first_entry_or_null(&pagelist,
-				struct page, lru);
-		if (!page || xas.xa_index < page->index) {
-			nr_none--;
-			continue;
-		}
-
-		VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
-
+	list_for_each_entry_safe(page, tmp, &pagelist, lru) {
 		/* Unfreeze the page. */
 		list_del(&page->lru);
 		page_ref_unfreeze(page, 2);
-		xas_store(&xas, page);
-		xas_pause(&xas);
-		xas_unlock_irq(&xas);
 		unlock_page(page);
 		putback_lru_page(page);
-		xas_lock_irq(&xas);
 	}
-	VM_BUG_ON(nr_none);
 	/*
 	 * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
 	 * This undo is not needed unless failure is due to SCAN_COPY_MC.
@@ -2304,8 +2280,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
 	if (!is_shmem && result == SCAN_COPY_MC)
 		filemap_nr_thps_dec(mapping);
 
-	xas_unlock_irq(&xas);
-
 	hpage->mapping = NULL;
 
 	unlock_page(hpage);
-- 
2.40.0.rc0.216.gc4246ad0f0-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag
  2023-03-07  5:20 ` [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag David Stevens
@ 2023-03-23 19:07   ` Hugh Dickins
  2023-03-23 21:56     ` Matthew Wilcox
  2023-03-28  9:48     ` David Stevens
  0 siblings, 2 replies; 17+ messages in thread
From: Hugh Dickins @ 2023-03-23 19:07 UTC (permalink / raw)
  To: David Stevens
  Cc: linux-mm, Andrew Morton, Peter Xu, Matthew Wilcox,
	Kirill A . Shutemov, Yang Shi, David Hildenbrand, Hugh Dickins,
	Jiaqi Yan, linux-kernel

On Tue, 7 Mar 2023, David Stevens wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Make sure that collapse_file doesn't interfere with checking the
> uptodate flag in the page cache by only inserting hpage into the page
> cache after it has been updated and marked uptodate. This is achieved by
> simply not replacing present pages with hpage when iterating over the
> target range. The present pages are already locked, so replacing the
> with the locked hpage before the collapse is finalized is unnecessary.
> 
> This fixes a race where folio_seek_hole_data would mistake hpage for
> an fallocated but unwritten page. This race is visible to userspace via
> data temporarily disappearing from SEEK_DATA/SEEK_HOLE.
> 
> Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> Signed-off-by: David Stevens <stevensd@chromium.org>
> Acked-by: Peter Xu <peterx@redhat.com>

NAK to this patch, I'm afraid: it deadlocks.

What I know it to deadlock against, does not make the most persuasive
argument: cgroup v1 deprecated memcg moving, where mc_handle_file_pte()
uses filemap_get_incore_folio() while holding page table lock, and spins
around doing "goto repeat" in filemap_get_entry() while folio_try_get_rcu()
keeps failing because collapse_file()'s old page has been left in the
xarray with its refcount frozen to 0.  Meanwhile, collapse_file() is
spinning to get that page table lock, to unmap pte of a later page.

mincore()'s use of filemap_get_incore_folio() would be liable to hit
the same deadlock.  If we think for longer, we may find more examples.
But even when not actually deadlocking, it's wasting lots of CPU on
concurrent lookups (e.g. faults) spinning in filemap_get_entry().

I don't suppose it's entirely accurate, but think of keeping a folio
refcount frozen to 0 as like holding a spinlock (and this lock sadly out
of sight from lockdep).  The pre-existing code works because the old page
with refcount frozen to 0 is immediately replaced in the xarray by an
entry for the new hpage, so the old page is no longer discoverable:
and the new hpage is locked, not with a spinlock but the usual
folio/page lock, on which concurrent lookups will sleep.

Your discovery of the SEEK_DATA/SEEK_HOLE issue is important - thank
you - but I believe collapse_file() should be left as is, and the fix
made instead in mapping_seek_hole_data() or folio_seek_hole_data():
I believe that should not jump to assume that a !uptodate folio is a
hole (as was reasonable to assume for shmem, before collapsing to huge
got added), but should lock the folio if !uptodate, and check again
after getting the lock - if still !uptodate, it's a shmem hole, not
a transient race with collapse_file().

I was (pleased but) a little surprised when Matthew found in 5.12 that
shmem_seek_hole_data() could be generalized to filemap_seek_hole_data():
he will have a surer grasp of what's safe or unsafe to assume of
!uptodate in non-shmem folios.

On an earlier audit, for different reasons, I did also run across
lib/buildid.c build_id_parse() using find_get_page() without checking
PageUptodate() - looks as if it might do the wrong thing if it races
with khugepaged collapsing text to huge, and should probably have a
similar fix.

Hugh

> ---
>  mm/khugepaged.c | 50 ++++++++++++-------------------------------------
>  1 file changed, 12 insertions(+), 38 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 51ae399f2035..bdde0a02811b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1930,12 +1930,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		}
>  	} while (1);
>  
> -	/*
> -	 * At this point the hpage is locked and not up-to-date.
> -	 * It's safe to insert it into the page cache, because nobody would
> -	 * be able to map it or use it in another way until we unlock it.
> -	 */
> -
>  	xas_set(&xas, start);
>  	for (index = start; index < end; index++) {
>  		page = xas_next(&xas);
> @@ -2104,13 +2098,9 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		}
>  
>  		/*
> -		 * Add the page to the list to be able to undo the collapse if
> -		 * something go wrong.
> +		 * Accumulate the pages that are being collapsed.
>  		 */
>  		list_add_tail(&page->lru, &pagelist);
> -
> -		/* Finally, replace with the new page. */
> -		xas_store(&xas, hpage);
>  		continue;
>  out_unlock:
>  		unlock_page(page);
> @@ -2149,8 +2139,7 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		goto rollback;
>  
>  	/*
> -	 * Replacing old pages with new one has succeeded, now we
> -	 * attempt to copy the contents.
> +	 * The old pages are locked, so they won't change anymore.
>  	 */
>  	index = start;
>  	list_for_each_entry(page, &pagelist, lru) {
> @@ -2230,11 +2219,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		/* nr_none is always 0 for non-shmem. */
>  		__mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
>  	}
> -	/* Join all the small entries into a single multi-index entry. */
> -	xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> -	xas_store(&xas, hpage);
> -	xas_unlock_irq(&xas);
>  
> +	/*
> +	 * Mark hpage as uptodate before inserting it into the page cache so
> +	 * that it isn't mistaken for an fallocated but unwritten page.
> +	 */
>  	folio = page_folio(hpage);
>  	folio_mark_uptodate(folio);
>  	folio_ref_add(folio, HPAGE_PMD_NR - 1);
> @@ -2243,6 +2232,11 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		folio_mark_dirty(folio);
>  	folio_add_lru(folio);
>  
> +	/* Join all the small entries into a single multi-index entry. */
> +	xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> +	xas_store(&xas, hpage);
> +	xas_unlock_irq(&xas);
> +
>  	/*
>  	 * Remove pte page tables, so we can re-fault the page as huge.
>  	 */
> @@ -2267,36 +2261,18 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  
>  rollback:
>  	/* Something went wrong: roll back page cache changes */
> -	xas_lock_irq(&xas);
>  	if (nr_none) {
>  		mapping->nrpages -= nr_none;
>  		shmem_uncharge(mapping->host, nr_none);
>  	}
>  
> -	xas_set(&xas, start);
> -	end = index;
> -	for (index = start; index < end; index++) {
> -		xas_next(&xas);
> -		page = list_first_entry_or_null(&pagelist,
> -				struct page, lru);
> -		if (!page || xas.xa_index < page->index) {
> -			nr_none--;
> -			continue;
> -		}
> -
> -		VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> -
> +	list_for_each_entry_safe(page, tmp, &pagelist, lru) {
>  		/* Unfreeze the page. */
>  		list_del(&page->lru);
>  		page_ref_unfreeze(page, 2);
> -		xas_store(&xas, page);
> -		xas_pause(&xas);
> -		xas_unlock_irq(&xas);
>  		unlock_page(page);
>  		putback_lru_page(page);
> -		xas_lock_irq(&xas);
>  	}
> -	VM_BUG_ON(nr_none);
>  	/*
>  	 * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
>  	 * This undo is not needed unless failure is due to SCAN_COPY_MC.
> @@ -2304,8 +2280,6 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	if (!is_shmem && result == SCAN_COPY_MC)
>  		filemap_nr_thps_dec(mapping);
>  
> -	xas_unlock_irq(&xas);
> -
>  	hpage->mapping = NULL;
>  
>  	unlock_page(hpage);
> -- 
> 2.40.0.rc0.216.gc4246ad0f0-goog

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 2/3] mm/khugepaged: skip shmem with userfaultfd
  2023-03-07  5:20 ` [PATCH v5 2/3] mm/khugepaged: skip shmem with userfaultfd David Stevens
@ 2023-03-23 19:48   ` Hugh Dickins
  2023-03-24  5:34     ` David Stevens
  0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2023-03-23 19:48 UTC (permalink / raw)
  To: David Stevens
  Cc: linux-mm, Andrew Morton, Peter Xu, Matthew Wilcox,
	Kirill A . Shutemov, Yang Shi, David Hildenbrand, Hugh Dickins,
	Jiaqi Yan, Christoph Hellwig, linux-kernel

On Tue, 7 Mar 2023, David Stevens wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Make sure that collapse_file respects any userfaultfds registered with
> MODE_MISSING. If userspace has any such userfaultfds registered, then
> for any page which it knows to be missing, it may expect a
> UFFD_EVENT_PAGEFAULT. This means collapse_file needs to be careful when
> collapsing a shmem range would result in replacing an empty page with a
> THP, to avoid breaking userfaultfd.
> 
> Synchronization when checking for userfaultfds in collapse_file is
> tricky because the mmap locks can't be used to prevent races with the
> registration of new userfaultfds. Instead, we provide synchronization by
> ensuring that userspace cannot observe the fact that pages are missing
> before we check for userfaultfds. Although this allows registration of a
> userfaultfd to race with collapse_file, it ensures that userspace cannot
> observe any pages transition from missing to present after such a race
> occurs. This makes such a race indistinguishable to the collapse
> occurring immediately before the userfaultfd registration.
> 
> The first step to provide this synchronization is to stop filling gaps
> during the loop iterating over the target range, since the page cache
> lock can be dropped during that loop. The second step is to fill the
> gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> time, to avoid races with accesses to the page cache that only take the
> RCU read lock.
> 
> The fact that we don't fill holes during the initial iteration means
> that collapse_file now has to handle faults occurring during the
> collapse. This is done by re-validating the number of missing pages
> after acquiring the page cache lock for the final time.
> 
> This fix is targeted at khugepaged, but the change also applies to
> MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> return EBUSY if there are any missing pages (instead of succeeding on
> shmem and returning EINVAL on anonymous memory). There is also now a
> window during MADV_COLLAPSE where a fault on a missing page will cause
> the syscall to fail with EAGAIN.
> 
> The fact that intermediate page cache state can no longer be observed
> before the rollback of a failed collapse is also technically a
> userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> is exceedingly unlikely that anything relies on being able to observe
> that transient state.
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> Acked-by: Peter Xu <peterx@redhat.com>

This patch looks to me like a lot of unnecessary (and not very pretty)
change, with surprising use of XA_RETRY_ENTRY outside of xarray internals.

I think you probably worked on this one, knowing what you intended in 3/3
to follow.  But if 3/3 as such does not follow, why can't this one just
leave collapse_file()'s code flow unchanged, but insert the
> +		vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> +			if (userfaultfd_missing(vma)) {
you need somewhere just before copying and clearing the pages data?
Limited to CONFIG_USERFAULTFD=y and shmem and nr_none to minimize impact.
The locked !uptodate hpage in the xarray keeping others at bay as before.

Hmm, looking at that extract, is "start, start" correct?  And everywhere
else in mm/khugepaged.c is checking userfaultfd_armed(vma): I imagine
one or the other is more correct, but I'm not at all a userfaultfder.

And now I think that I was mistaken to Ack hch's "shmem: open code
the page cache lookup in shmem_get_folio_gfp" a few days ago: its
userfault_missing() case needs folio_lock() to be taken after
filemap_get_entry(), in order to exclude the race with collapse_file()
as it used to do.  I think.  He and I will sort that out in due course,
maybe mm/shmem.c wants its own replacement for what he's removing.

Hugh

> ---
>  include/trace/events/huge_memory.h |  3 +-
>  mm/khugepaged.c                    | 92 +++++++++++++++++++++++-------
>  2 files changed, 73 insertions(+), 22 deletions(-)
> 
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 46cce509957b..7ee85fff89a3 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -37,7 +37,8 @@
>  	EM( SCAN_CGROUP_CHARGE_FAIL,	"ccgroup_charge_failed")	\
>  	EM( SCAN_TRUNCATED,		"truncated")			\
>  	EM( SCAN_PAGE_HAS_PRIVATE,	"page_has_private")		\
> -	EMe(SCAN_COPY_MC,		"copy_poisoned_page")		\
> +	EM( SCAN_COPY_MC,		"copy_poisoned_page")		\
> +	EMe(SCAN_PAGE_FILLED,		"page_filled")			\
>  
>  #undef EM
>  #undef EMe
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b954e3c685e7..51ae399f2035 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -57,6 +57,7 @@ enum scan_result {
>  	SCAN_TRUNCATED,
>  	SCAN_PAGE_HAS_PRIVATE,
>  	SCAN_COPY_MC,
> +	SCAN_PAGE_FILLED,
>  };
>  
>  #define CREATE_TRACE_POINTS
> @@ -1873,8 +1874,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
>   *  - allocate and lock a new huge page;
>   *  - scan page cache replacing old pages with the new one
>   *    + swap/gup in pages if necessary;
> - *    + fill in gaps;
>   *    + keep old pages around in case rollback is required;
> + *  - finalize updates to the page cache;
>   *  - if replacing succeeds:
>   *    + copy data over;
>   *    + free old pages;
> @@ -1952,13 +1953,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  						result = SCAN_TRUNCATED;
>  						goto xa_locked;
>  					}
> -					xas_set(&xas, index);
> +					xas_set(&xas, index + 1);
>  				}
>  				if (!shmem_charge(mapping->host, 1)) {
>  					result = SCAN_FAIL;
>  					goto xa_locked;
>  				}
> -				xas_store(&xas, hpage);
>  				nr_none++;
>  				continue;
>  			}
> @@ -2169,21 +2169,57 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		index++;
>  	}
>  
> -	/*
> -	 * Copying old pages to huge one has succeeded, now we
> -	 * need to free the old pages.
> -	 */
> -	list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> -		list_del(&page->lru);
> -		page->mapping = NULL;
> -		page_ref_unfreeze(page, 1);
> -		ClearPageActive(page);
> -		ClearPageUnevictable(page);
> -		unlock_page(page);
> -		put_page(page);
> +	if (nr_none) {
> +		struct vm_area_struct *vma;
> +		int nr_none_check = 0;
> +
> +		i_mmap_lock_read(mapping);
> +		xas_lock_irq(&xas);
> +
> +		xas_set(&xas, start);
> +		for (index = start; index < end; index++) {
> +			if (!xas_next(&xas)) {
> +				xas_store(&xas, XA_RETRY_ENTRY);
> +				nr_none_check++;
> +			}
> +		}
> +
> +		if (nr_none != nr_none_check) {
> +			result = SCAN_PAGE_FILLED;
> +			goto immap_locked;
> +		}
> +
> +		/*
> +		 * If userspace observed a missing page in a VMA with an armed
> +		 * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> +		 * that page, so we need to roll back to avoid suppressing such
> +		 * an event. Any userfaultfds armed after this point will not be
> +		 * able to observe any missing pages due to the previously
> +		 * inserted retry entries.
> +		 */
> +		vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> +			if (userfaultfd_missing(vma)) {
> +				result = SCAN_EXCEED_NONE_PTE;
> +				goto immap_locked;
> +			}
> +		}
> +
> +immap_locked:
> +		i_mmap_unlock_read(mapping);
> +		if (result != SCAN_SUCCEED) {
> +			xas_set(&xas, start);
> +			for (index = start; index < end; index++) {
> +				if (xas_next(&xas) == XA_RETRY_ENTRY)
> +					xas_store(&xas, NULL);
> +			}
> +
> +			xas_unlock_irq(&xas);
> +			goto rollback;
> +		}
> +	} else {
> +		xas_lock_irq(&xas);
>  	}
>  
> -	xas_lock_irq(&xas);
>  	if (is_shmem)
>  		__mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
>  	else
> @@ -2213,6 +2249,20 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	result = retract_page_tables(mapping, start, mm, addr, hpage,
>  				     cc);
>  	unlock_page(hpage);
> +
> +	/*
> +	 * The collapse has succeeded, so free the old pages.
> +	 */
> +	list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> +		list_del(&page->lru);
> +		page->mapping = NULL;
> +		page_ref_unfreeze(page, 1);
> +		ClearPageActive(page);
> +		ClearPageUnevictable(page);
> +		unlock_page(page);
> +		put_page(page);
> +	}
> +
>  	goto out;
>  
>  rollback:
> @@ -2224,15 +2274,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	}
>  
>  	xas_set(&xas, start);
> -	xas_for_each(&xas, page, end - 1) {
> +	end = index;
> +	for (index = start; index < end; index++) {
> +		xas_next(&xas);
>  		page = list_first_entry_or_null(&pagelist,
>  				struct page, lru);
>  		if (!page || xas.xa_index < page->index) {
> -			if (!nr_none)
> -				break;
>  			nr_none--;
> -			/* Put holes back where they were */
> -			xas_store(&xas, NULL);
>  			continue;
>  		}
>  
> @@ -2750,12 +2798,14 @@ static int madvise_collapse_errno(enum scan_result r)
>  	case SCAN_ALLOC_HUGE_PAGE_FAIL:
>  		return -ENOMEM;
>  	case SCAN_CGROUP_CHARGE_FAIL:
> +	case SCAN_EXCEED_NONE_PTE:
>  		return -EBUSY;
>  	/* Resource temporary unavailable - trying again might succeed */
>  	case SCAN_PAGE_COUNT:
>  	case SCAN_PAGE_LOCK:
>  	case SCAN_PAGE_LRU:
>  	case SCAN_DEL_PAGE_LRU:
> +	case SCAN_PAGE_FILLED:
>  		return -EAGAIN;
>  	/*
>  	 * Other: Trying again likely not to succeed / error intrinsic to
> -- 
> 2.40.0.rc0.216.gc4246ad0f0-goog

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 1/3] mm/khugepaged: refactor collapse_file control flow
  2023-03-07  5:20 ` [PATCH v5 1/3] mm/khugepaged: refactor collapse_file control flow David Stevens
@ 2023-03-23 19:51   ` Hugh Dickins
  0 siblings, 0 replies; 17+ messages in thread
From: Hugh Dickins @ 2023-03-23 19:51 UTC (permalink / raw)
  To: David Stevens
  Cc: linux-mm, Andrew Morton, Peter Xu, Matthew Wilcox,
	Kirill A . Shutemov, Yang Shi, David Hildenbrand, Hugh Dickins,
	Jiaqi Yan, linux-kernel

On Tue, 7 Mar 2023, David Stevens wrote:

> From: David Stevens <stevensd@chromium.org>
> 
> Add a rollback label to deal with failure, instead of continuously
> checking for RESULT_SUCCESS, to make it easier to add more failure
> cases. The refactoring also allows the collapse_file tracepoint to
> include hpage on success (instead of NULL).
> 
> Signed-off-by: David Stevens <stevensd@chromium.org>
> Acked-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Yang Shi <shy828301@gmail.com>

This one looks like a nice de-indentation to me,
even if it's not followed by 2/3 and 3/3.

Acked-by: Hugh Dickins <hughd@google.com>

> ---
>  mm/khugepaged.c | 219 ++++++++++++++++++++++++------------------------
>  1 file changed, 108 insertions(+), 111 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 3ea2aa55c2c5..b954e3c685e7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1907,6 +1907,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	if (result != SCAN_SUCCEED)
>  		goto out;
>  
> +	__SetPageLocked(hpage);
> +	if (is_shmem)
> +		__SetPageSwapBacked(hpage);
> +	hpage->index = start;
> +	hpage->mapping = mapping;
> +
>  	/*
>  	 * Ensure we have slots for all the pages in the range.  This is
>  	 * almost certainly a no-op because most of the pages must be present
> @@ -1919,16 +1925,10 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  		xas_unlock_irq(&xas);
>  		if (!xas_nomem(&xas, GFP_KERNEL)) {
>  			result = SCAN_FAIL;
> -			goto out;
> +			goto rollback;
>  		}
>  	} while (1);
>  
> -	__SetPageLocked(hpage);
> -	if (is_shmem)
> -		__SetPageSwapBacked(hpage);
> -	hpage->index = start;
> -	hpage->mapping = mapping;
> -
>  	/*
>  	 * At this point the hpage is locked and not up-to-date.
>  	 * It's safe to insert it into the page cache, because nobody would
> @@ -2145,129 +2145,126 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
>  	 */
>  	try_to_unmap_flush();
>  
> -	if (result == SCAN_SUCCEED) {
> -		/*
> -		 * Replacing old pages with new one has succeeded, now we
> -		 * attempt to copy the contents.
> -		 */
> -		index = start;
> -		list_for_each_entry(page, &pagelist, lru) {
> -			while (index < page->index) {
> -				clear_highpage(hpage + (index % HPAGE_PMD_NR));
> -				index++;
> -			}
> -			if (copy_mc_highpage(hpage + (page->index % HPAGE_PMD_NR),
> -					     page) > 0) {
> -				result = SCAN_COPY_MC;
> -				break;
> -			}
> -			index++;
> -		}
> -		while (result == SCAN_SUCCEED && index < end) {
> +	if (result != SCAN_SUCCEED)
> +		goto rollback;
> +
> +	/*
> +	 * Replacing old pages with new one has succeeded, now we
> +	 * attempt to copy the contents.
> +	 */
> +	index = start;
> +	list_for_each_entry(page, &pagelist, lru) {
> +		while (index < page->index) {
>  			clear_highpage(hpage + (index % HPAGE_PMD_NR));
>  			index++;
>  		}
> +		if (copy_mc_highpage(hpage + (page->index % HPAGE_PMD_NR), page) > 0) {
> +			result = SCAN_COPY_MC;
> +			goto rollback;
> +		}
> +		index++;
> +	}
> +	while (index < end) {
> +		clear_highpage(hpage + (index % HPAGE_PMD_NR));
> +		index++;
>  	}
>  
> -	if (result == SCAN_SUCCEED) {
> -		/*
> -		 * Copying old pages to huge one has succeeded, now we
> -		 * need to free the old pages.
> -		 */
> -		list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> -			list_del(&page->lru);
> -			page->mapping = NULL;
> -			page_ref_unfreeze(page, 1);
> -			ClearPageActive(page);
> -			ClearPageUnevictable(page);
> -			unlock_page(page);
> -			put_page(page);
> -		}
> +	/*
> +	 * Copying old pages to huge one has succeeded, now we
> +	 * need to free the old pages.
> +	 */
> +	list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> +		list_del(&page->lru);
> +		page->mapping = NULL;
> +		page_ref_unfreeze(page, 1);
> +		ClearPageActive(page);
> +		ClearPageUnevictable(page);
> +		unlock_page(page);
> +		put_page(page);
> +	}
>  
> -		xas_lock_irq(&xas);
> -		if (is_shmem)
> -			__mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
> -		else
> -			__mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
> +	xas_lock_irq(&xas);
> +	if (is_shmem)
> +		__mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
> +	else
> +		__mod_lruvec_page_state(hpage, NR_FILE_THPS, nr);
>  
> -		if (nr_none) {
> -			__mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
> -			/* nr_none is always 0 for non-shmem. */
> -			__mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
> -		}
> -		/* Join all the small entries into a single multi-index entry. */
> -		xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> -		xas_store(&xas, hpage);
> -		xas_unlock_irq(&xas);
> +	if (nr_none) {
> +		__mod_lruvec_page_state(hpage, NR_FILE_PAGES, nr_none);
> +		/* nr_none is always 0 for non-shmem. */
> +		__mod_lruvec_page_state(hpage, NR_SHMEM, nr_none);
> +	}
> +	/* Join all the small entries into a single multi-index entry. */
> +	xas_set_order(&xas, start, HPAGE_PMD_ORDER);
> +	xas_store(&xas, hpage);
> +	xas_unlock_irq(&xas);
>  
> -		folio = page_folio(hpage);
> -		folio_mark_uptodate(folio);
> -		folio_ref_add(folio, HPAGE_PMD_NR - 1);
> +	folio = page_folio(hpage);
> +	folio_mark_uptodate(folio);
> +	folio_ref_add(folio, HPAGE_PMD_NR - 1);
>  
> -		if (is_shmem)
> -			folio_mark_dirty(folio);
> -		folio_add_lru(folio);
> +	if (is_shmem)
> +		folio_mark_dirty(folio);
> +	folio_add_lru(folio);
>  
> -		/*
> -		 * Remove pte page tables, so we can re-fault the page as huge.
> -		 */
> -		result = retract_page_tables(mapping, start, mm, addr, hpage,
> -					     cc);
> -		unlock_page(hpage);
> -		hpage = NULL;
> -	} else {
> -		/* Something went wrong: roll back page cache changes */
> -		xas_lock_irq(&xas);
> -		if (nr_none) {
> -			mapping->nrpages -= nr_none;
> -			shmem_uncharge(mapping->host, nr_none);
> +	/*
> +	 * Remove pte page tables, so we can re-fault the page as huge.
> +	 */
> +	result = retract_page_tables(mapping, start, mm, addr, hpage,
> +				     cc);
> +	unlock_page(hpage);
> +	goto out;
> +
> +rollback:
> +	/* Something went wrong: roll back page cache changes */
> +	xas_lock_irq(&xas);
> +	if (nr_none) {
> +		mapping->nrpages -= nr_none;
> +		shmem_uncharge(mapping->host, nr_none);
> +	}
> +
> +	xas_set(&xas, start);
> +	xas_for_each(&xas, page, end - 1) {
> +		page = list_first_entry_or_null(&pagelist,
> +				struct page, lru);
> +		if (!page || xas.xa_index < page->index) {
> +			if (!nr_none)
> +				break;
> +			nr_none--;
> +			/* Put holes back where they were */
> +			xas_store(&xas, NULL);
> +			continue;
>  		}
>  
> -		xas_set(&xas, start);
> -		xas_for_each(&xas, page, end - 1) {
> -			page = list_first_entry_or_null(&pagelist,
> -					struct page, lru);
> -			if (!page || xas.xa_index < page->index) {
> -				if (!nr_none)
> -					break;
> -				nr_none--;
> -				/* Put holes back where they were */
> -				xas_store(&xas, NULL);
> -				continue;
> -			}
> +		VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
>  
> -			VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> +		/* Unfreeze the page. */
> +		list_del(&page->lru);
> +		page_ref_unfreeze(page, 2);
> +		xas_store(&xas, page);
> +		xas_pause(&xas);
> +		xas_unlock_irq(&xas);
> +		unlock_page(page);
> +		putback_lru_page(page);
> +		xas_lock_irq(&xas);
> +	}
> +	VM_BUG_ON(nr_none);
> +	/*
> +	 * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
> +	 * This undo is not needed unless failure is due to SCAN_COPY_MC.
> +	 */
> +	if (!is_shmem && result == SCAN_COPY_MC)
> +		filemap_nr_thps_dec(mapping);
>  
> -			/* Unfreeze the page. */
> -			list_del(&page->lru);
> -			page_ref_unfreeze(page, 2);
> -			xas_store(&xas, page);
> -			xas_pause(&xas);
> -			xas_unlock_irq(&xas);
> -			unlock_page(page);
> -			putback_lru_page(page);
> -			xas_lock_irq(&xas);
> -		}
> -		VM_BUG_ON(nr_none);
> -		/*
> -		 * Undo the updates of filemap_nr_thps_inc for non-SHMEM file only.
> -		 * This undo is not needed unless failure is due to SCAN_COPY_MC.
> -		 */
> -		if (!is_shmem && result == SCAN_COPY_MC)
> -			filemap_nr_thps_dec(mapping);
> +	xas_unlock_irq(&xas);
>  
> -		xas_unlock_irq(&xas);
> +	hpage->mapping = NULL;
>  
> -		hpage->mapping = NULL;
> -	}
> +	unlock_page(hpage);
> +	put_page(hpage);
>  
> -	if (hpage)
> -		unlock_page(hpage);
>  out:
>  	VM_BUG_ON(!list_empty(&pagelist));
> -	if (hpage)
> -		put_page(hpage);
> -
>  	trace_mm_khugepaged_collapse_file(mm, hpage, index, is_shmem, addr, file, nr, result);
>  	return result;
>  }
> -- 
> 2.40.0.rc0.216.gc4246ad0f0-goog
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag
  2023-03-23 19:07   ` Hugh Dickins
@ 2023-03-23 21:56     ` Matthew Wilcox
  2023-03-23 22:28       ` Song Liu
  2023-03-28  9:48     ` David Stevens
  1 sibling, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2023-03-23 21:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Song Liu, Jiri Olsa, David Stevens, linux-mm, Andrew Morton,
	Peter Xu, Kirill A . Shutemov, Yang Shi, David Hildenbrand,
	Jiaqi Yan, linux-kernel

On Thu, Mar 23, 2023 at 12:07:46PM -0700, Hugh Dickins wrote:
> On an earlier audit, for different reasons, I did also run across
> lib/buildid.c build_id_parse() using find_get_page() without checking
> PageUptodate() - looks as if it might do the wrong thing if it races
> with khugepaged collapsing text to huge, and should probably have a
> similar fix.

That shouldn't be using find_get_page().  It should probably use
read_cache_folio() which will actually read in the data if it's not
present in the page cache, and return an ERR_PTR if the data couldn't
be read.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag
  2023-03-23 21:56     ` Matthew Wilcox
@ 2023-03-23 22:28       ` Song Liu
  2023-03-24  1:56         ` Hugh Dickins
  0 siblings, 1 reply; 17+ messages in thread
From: Song Liu @ 2023-03-23 22:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Song Liu, Jiri Olsa, David Stevens, linux-mm,
	Andrew Morton, Peter Xu, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Jiaqi Yan, linux-kernel

On Thu, Mar 23, 2023 at 2:56 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Mar 23, 2023 at 12:07:46PM -0700, Hugh Dickins wrote:
> > On an earlier audit, for different reasons, I did also run across
> > lib/buildid.c build_id_parse() using find_get_page() without checking
> > PageUptodate() - looks as if it might do the wrong thing if it races
> > with khugepaged collapsing text to huge, and should probably have a
> > similar fix.
>
> That shouldn't be using find_get_page().  It should probably use
> read_cache_folio() which will actually read in the data if it's not
> present in the page cache, and return an ERR_PTR if the data couldn't
> be read.

build_id_parse() can be called from NMI, so I don't think we can let
read_cache_folio() read-in the data.

Thanks,
Song

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag
  2023-03-23 22:28       ` Song Liu
@ 2023-03-24  1:56         ` Hugh Dickins
  2023-03-24  3:30           ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: Hugh Dickins @ 2023-03-24  1:56 UTC (permalink / raw)
  To: Song Liu
  Cc: Matthew Wilcox, Hugh Dickins, Song Liu, Jiri Olsa, David Stevens,
	linux-mm, Andrew Morton, Peter Xu, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Jiaqi Yan, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1553 bytes --]

On Thu, 23 Mar 2023, Song Liu wrote:
> On Thu, Mar 23, 2023 at 2:56 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Mar 23, 2023 at 12:07:46PM -0700, Hugh Dickins wrote:
> > > On an earlier audit, for different reasons, I did also run across
> > > lib/buildid.c build_id_parse() using find_get_page() without checking
> > > PageUptodate() - looks as if it might do the wrong thing if it races
> > > with khugepaged collapsing text to huge, and should probably have a
> > > similar fix.
> >
> > That shouldn't be using find_get_page().  It should probably use
> > read_cache_folio() which will actually read in the data if it's not
> > present in the page cache, and return an ERR_PTR if the data couldn't
> > be read.
> 
> build_id_parse() can be called from NMI, so I don't think we can let
> read_cache_folio() read-in the data.

Interesting.

This being the same Layering_Violation_ID which is asking for a home in
everyone's struct file?  (Okay, I'm being disagreeable, no need to answer!)

I think even the current find_get_page() is unsafe from NMI: imagine that
NMI interrupting a sequence (maybe THP collapse or splitting, maybe page
migration, maybe others) when the page refcount has been frozen to 0:
you'll just have to reboot the machine?

I guess the RCU-safety of find_get_page() implies that its XArray basics
are safe in NMI; but you need a low-level variant (xas_find()?) which
does none of the "goto retry"s, and fails immediately if anything is
wrong - including !PageUptodate.

Hugh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag
  2023-03-24  1:56         ` Hugh Dickins
@ 2023-03-24  3:30           ` Matthew Wilcox
  2023-03-24  6:03             ` Song Liu
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2023-03-24  3:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Song Liu, Song Liu, Jiri Olsa, David Stevens, linux-mm,
	Andrew Morton, Peter Xu, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Jiaqi Yan, linux-kernel

On Thu, Mar 23, 2023 at 06:56:34PM -0700, Hugh Dickins wrote:
> On Thu, 23 Mar 2023, Song Liu wrote:
> > On Thu, Mar 23, 2023 at 2:56 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 12:07:46PM -0700, Hugh Dickins wrote:
> > > > On an earlier audit, for different reasons, I did also run across
> > > > lib/buildid.c build_id_parse() using find_get_page() without checking
> > > > PageUptodate() - looks as if it might do the wrong thing if it races
> > > > with khugepaged collapsing text to huge, and should probably have a
> > > > similar fix.
> > >
> > > That shouldn't be using find_get_page().  It should probably use
> > > read_cache_folio() which will actually read in the data if it's not
> > > present in the page cache, and return an ERR_PTR if the data couldn't
> > > be read.
> > 
> > build_id_parse() can be called from NMI, so I don't think we can let
> > read_cache_folio() read-in the data.
> 
> Interesting.
> 
> This being the same Layering_Violation_ID which is asking for a home in
> everyone's struct file?  (Okay, I'm being disagreeable, no need to answer!)

Yes, that's the one.  Part of the BPF "splatter stuff all across the
kernel that we don't really understand" (see, I can be just as
disagreeable!)

> I think even the current find_get_page() is unsafe from NMI: imagine that
> NMI interrupting a sequence (maybe THP collapse or splitting, maybe page
> migration, maybe others) when the page refcount has been frozen to 0:
> you'll just have to reboot the machine?

Correct; if it's been frozen by this CPU, we'll spin forever.  I think
the other conditions where we retry are temporary and caused by
something _another_ CPU is doing.  For example, if _this_ CPU is in the
middle of modifying the tree when an NMI happens, we won't hit the
xas_retry() case.  That can only happen if we've observed something
another CPU did, and then a second write happened from that same other
CPU.  The easiest example of that would be that we hit this kind of
race:

CPU 0				CPU 1
load root of tree, points to node
				store entry in root of tree
				wmb();
				store RETRY entry in node
load entry from node, observe RETRY entry

The retry will happen and we'll observe the new state of the tree with
the entry we're looking for in the root.

If CPU 1 takes an NMI and interrupts itself, it will always see a
consistent tree.

> I guess the RCU-safety of find_get_page() implies that its XArray basics
> are safe in NMI; but you need a low-level variant (xas_find()?) which
> does none of the "goto retry"s, and fails immediately if anything is
> wrong - including !PageUptodate.

The Uptodate flag check needs to be done by the caller; the
find_get_page() family return !uptodate pages.

But find_get_page() does not advertise itself as NMI-safe.  And I
think it's wrong to try to make it NMI-safe.  Most of the kernel is
not NMI-safe.  I think it's incumbent on the BPF people to get the
information they need ahead of taking the NMI.  NMI handlers are not
supposed to be doing a huge amount of work!  I don't really understand
why it needs to do work in NMI context; surely it can note the location of
the fault and queue work to be done later (eg on irq-enable, task-switch
or return-to-user)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 2/3] mm/khugepaged: skip shmem with userfaultfd
  2023-03-23 19:48   ` Hugh Dickins
@ 2023-03-24  5:34     ` David Stevens
  2023-03-28 15:48       ` Peter Xu
  0 siblings, 1 reply; 17+ messages in thread
From: David Stevens @ 2023-03-24  5:34 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, Andrew Morton, Peter Xu, Matthew Wilcox,
	Kirill A . Shutemov, Yang Shi, David Hildenbrand, Jiaqi Yan,
	Christoph Hellwig, linux-kernel

On Fri, Mar 24, 2023 at 4:48 AM Hugh Dickins <hughd@google.com> wrote:
>
> On Tue, 7 Mar 2023, David Stevens wrote:
>
> > From: David Stevens <stevensd@chromium.org>
> >
> > Make sure that collapse_file respects any userfaultfds registered with
> > MODE_MISSING. If userspace has any such userfaultfds registered, then
> > for any page which it knows to be missing, it may expect a
> > UFFD_EVENT_PAGEFAULT. This means collapse_file needs to be careful when
> > collapsing a shmem range would result in replacing an empty page with a
> > THP, to avoid breaking userfaultfd.
> >
> > Synchronization when checking for userfaultfds in collapse_file is
> > tricky because the mmap locks can't be used to prevent races with the
> > registration of new userfaultfds. Instead, we provide synchronization by
> > ensuring that userspace cannot observe the fact that pages are missing
> > before we check for userfaultfds. Although this allows registration of a
> > userfaultfd to race with collapse_file, it ensures that userspace cannot
> > observe any pages transition from missing to present after such a race
> > occurs. This makes such a race indistinguishable to the collapse
> > occurring immediately before the userfaultfd registration.
> >
> > The first step to provide this synchronization is to stop filling gaps
> > during the loop iterating over the target range, since the page cache
> > lock can be dropped during that loop. The second step is to fill the
> > gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> > time, to avoid races with accesses to the page cache that only take the
> > RCU read lock.
> >
> > The fact that we don't fill holes during the initial iteration means
> > that collapse_file now has to handle faults occurring during the
> > collapse. This is done by re-validating the number of missing pages
> > after acquiring the page cache lock for the final time.
> >
> > This fix is targeted at khugepaged, but the change also applies to
> > MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> > return EBUSY if there are any missing pages (instead of succeeding on
> > shmem and returning EINVAL on anonymous memory). There is also now a
> > window during MADV_COLLAPSE where a fault on a missing page will cause
> > the syscall to fail with EAGAIN.
> >
> > The fact that intermediate page cache state can no longer be observed
> > before the rollback of a failed collapse is also technically a
> > userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> > is exceedingly unlikely that anything relies on being able to observe
> > that transient state.
> >
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > Acked-by: Peter Xu <peterx@redhat.com>
>
> This patch looks to me like a lot of unnecessary (and not very pretty)
> change, with surprising use of XA_RETRY_ENTRY outside of xarray internals.
>
> I think you probably worked on this one, knowing what you intended in 3/3
> to follow.  But if 3/3 as such does not follow, why can't this one just
> leave collapse_file()'s code flow unchanged, but insert the
> > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > +                     if (userfaultfd_missing(vma)) {
> you need somewhere just before copying and clearing the pages data?
> Limited to CONFIG_USERFAULTFD=y and shmem and nr_none to minimize impact.
> The locked !uptodate hpage in the xarray keeping others at bay as before.

The complication here is that there are places that examine the page
cache without actually examining the pages in the page cache, and thus
don't lock the pages they see. At the very least, folio_seek_hole_data
and shmem_mfill_atomic_pte do so today. I should have included this in
the commit message.

I suppose we could require that anything that might examine shmem page
cache must lock any pages it sees regardless of whether or not it
actually needs to access the pages, and then update the two functions
I referenced plus any other similar functions I've missed. I went with
the approach I did because I felt that localized complexity in
collapse_file that minimizes intermediate state visible throughout the
mm subsystem would be preferable. However, if that isn't actually
preferable, I can take the approach you suggested. Especially now that
we probably want to fix folio_seek_hole_data anyway.

> Hmm, looking at that extract, is "start, start" correct?

That's definitely a bug, good catch.

> And everywhere
> else in mm/khugepaged.c is checking userfaultfd_armed(vma): I imagine
> one or the other is more correct, but I'm not at all a userfaultfder.

IIUC, the check for userfaultfd_wp in retract_page_tables is
sufficient for uffd-wp in the shmem case. Simply collapsing a range in
the underlying shmem file shouldn't affect the semantics of uffd-wp. I
could be missing something, though, so Peter would probably be the one
to know for sure. As for uffd-minor, its interactions with khugepaged
are currently hopelessly undefined.

-David

> And now I think that I was mistaken to Ack hch's "shmem: open code
> the page cache lookup in shmem_get_folio_gfp" a few days ago: its
> userfault_missing() case needs folio_lock() to be taken after
> filemap_get_entry(), in order to exclude the race with collapse_file()
> as it used to do.  I think.  He and I will sort that out in due course,
> maybe mm/shmem.c wants its own replacement for what he's removing.
>
> Hugh
>
> > ---
> >  include/trace/events/huge_memory.h |  3 +-
> >  mm/khugepaged.c                    | 92 +++++++++++++++++++++++-------
> >  2 files changed, 73 insertions(+), 22 deletions(-)
> >
> > diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> > index 46cce509957b..7ee85fff89a3 100644
> > --- a/include/trace/events/huge_memory.h
> > +++ b/include/trace/events/huge_memory.h
> > @@ -37,7 +37,8 @@
> >       EM( SCAN_CGROUP_CHARGE_FAIL,    "ccgroup_charge_failed")        \
> >       EM( SCAN_TRUNCATED,             "truncated")                    \
> >       EM( SCAN_PAGE_HAS_PRIVATE,      "page_has_private")             \
> > -     EMe(SCAN_COPY_MC,               "copy_poisoned_page")           \
> > +     EM( SCAN_COPY_MC,               "copy_poisoned_page")           \
> > +     EMe(SCAN_PAGE_FILLED,           "page_filled")                  \
> >
> >  #undef EM
> >  #undef EMe
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index b954e3c685e7..51ae399f2035 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -57,6 +57,7 @@ enum scan_result {
> >       SCAN_TRUNCATED,
> >       SCAN_PAGE_HAS_PRIVATE,
> >       SCAN_COPY_MC,
> > +     SCAN_PAGE_FILLED,
> >  };
> >
> >  #define CREATE_TRACE_POINTS
> > @@ -1873,8 +1874,8 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> >   *  - allocate and lock a new huge page;
> >   *  - scan page cache replacing old pages with the new one
> >   *    + swap/gup in pages if necessary;
> > - *    + fill in gaps;
> >   *    + keep old pages around in case rollback is required;
> > + *  - finalize updates to the page cache;
> >   *  - if replacing succeeds:
> >   *    + copy data over;
> >   *    + free old pages;
> > @@ -1952,13 +1953,12 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >                                               result = SCAN_TRUNCATED;
> >                                               goto xa_locked;
> >                                       }
> > -                                     xas_set(&xas, index);
> > +                                     xas_set(&xas, index + 1);
> >                               }
> >                               if (!shmem_charge(mapping->host, 1)) {
> >                                       result = SCAN_FAIL;
> >                                       goto xa_locked;
> >                               }
> > -                             xas_store(&xas, hpage);
> >                               nr_none++;
> >                               continue;
> >                       }
> > @@ -2169,21 +2169,57 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >               index++;
> >       }
> >
> > -     /*
> > -      * Copying old pages to huge one has succeeded, now we
> > -      * need to free the old pages.
> > -      */
> > -     list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> > -             list_del(&page->lru);
> > -             page->mapping = NULL;
> > -             page_ref_unfreeze(page, 1);
> > -             ClearPageActive(page);
> > -             ClearPageUnevictable(page);
> > -             unlock_page(page);
> > -             put_page(page);
> > +     if (nr_none) {
> > +             struct vm_area_struct *vma;
> > +             int nr_none_check = 0;
> > +
> > +             i_mmap_lock_read(mapping);
> > +             xas_lock_irq(&xas);
> > +
> > +             xas_set(&xas, start);
> > +             for (index = start; index < end; index++) {
> > +                     if (!xas_next(&xas)) {
> > +                             xas_store(&xas, XA_RETRY_ENTRY);
> > +                             nr_none_check++;
> > +                     }
> > +             }
> > +
> > +             if (nr_none != nr_none_check) {
> > +                     result = SCAN_PAGE_FILLED;
> > +                     goto immap_locked;
> > +             }
> > +
> > +             /*
> > +              * If userspace observed a missing page in a VMA with an armed
> > +              * userfaultfd, then it might expect a UFFD_EVENT_PAGEFAULT for
> > +              * that page, so we need to roll back to avoid suppressing such
> > +              * an event. Any userfaultfds armed after this point will not be
> > +              * able to observe any missing pages due to the previously
> > +              * inserted retry entries.
> > +              */
> > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > +                     if (userfaultfd_missing(vma)) {
> > +                             result = SCAN_EXCEED_NONE_PTE;
> > +                             goto immap_locked;
> > +                     }
> > +             }
> > +
> > +immap_locked:
> > +             i_mmap_unlock_read(mapping);
> > +             if (result != SCAN_SUCCEED) {
> > +                     xas_set(&xas, start);
> > +                     for (index = start; index < end; index++) {
> > +                             if (xas_next(&xas) == XA_RETRY_ENTRY)
> > +                                     xas_store(&xas, NULL);
> > +                     }
> > +
> > +                     xas_unlock_irq(&xas);
> > +                     goto rollback;
> > +             }
> > +     } else {
> > +             xas_lock_irq(&xas);
> >       }
> >
> > -     xas_lock_irq(&xas);
> >       if (is_shmem)
> >               __mod_lruvec_page_state(hpage, NR_SHMEM_THPS, nr);
> >       else
> > @@ -2213,6 +2249,20 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >       result = retract_page_tables(mapping, start, mm, addr, hpage,
> >                                    cc);
> >       unlock_page(hpage);
> > +
> > +     /*
> > +      * The collapse has succeeded, so free the old pages.
> > +      */
> > +     list_for_each_entry_safe(page, tmp, &pagelist, lru) {
> > +             list_del(&page->lru);
> > +             page->mapping = NULL;
> > +             page_ref_unfreeze(page, 1);
> > +             ClearPageActive(page);
> > +             ClearPageUnevictable(page);
> > +             unlock_page(page);
> > +             put_page(page);
> > +     }
> > +
> >       goto out;
> >
> >  rollback:
> > @@ -2224,15 +2274,13 @@ static int collapse_file(struct mm_struct *mm, unsigned long addr,
> >       }
> >
> >       xas_set(&xas, start);
> > -     xas_for_each(&xas, page, end - 1) {
> > +     end = index;
> > +     for (index = start; index < end; index++) {
> > +             xas_next(&xas);
> >               page = list_first_entry_or_null(&pagelist,
> >                               struct page, lru);
> >               if (!page || xas.xa_index < page->index) {
> > -                     if (!nr_none)
> > -                             break;
> >                       nr_none--;
> > -                     /* Put holes back where they were */
> > -                     xas_store(&xas, NULL);
> >                       continue;
> >               }
> >
> > @@ -2750,12 +2798,14 @@ static int madvise_collapse_errno(enum scan_result r)
> >       case SCAN_ALLOC_HUGE_PAGE_FAIL:
> >               return -ENOMEM;
> >       case SCAN_CGROUP_CHARGE_FAIL:
> > +     case SCAN_EXCEED_NONE_PTE:
> >               return -EBUSY;
> >       /* Resource temporary unavailable - trying again might succeed */
> >       case SCAN_PAGE_COUNT:
> >       case SCAN_PAGE_LOCK:
> >       case SCAN_PAGE_LRU:
> >       case SCAN_DEL_PAGE_LRU:
> > +     case SCAN_PAGE_FILLED:
> >               return -EAGAIN;
> >       /*
> >        * Other: Trying again likely not to succeed / error intrinsic to
> > --
> > 2.40.0.rc0.216.gc4246ad0f0-goog

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag
  2023-03-24  3:30           ` Matthew Wilcox
@ 2023-03-24  6:03             ` Song Liu
  2023-03-24 13:31               ` Matthew Wilcox
  0 siblings, 1 reply; 17+ messages in thread
From: Song Liu @ 2023-03-24  6:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Song Liu, Jiri Olsa, David Stevens, Linux-MM,
	Andrew Morton, Peter Xu, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Jiaqi Yan, linux-kernel



> On Mar 23, 2023, at 8:30 PM, Matthew Wilcox <willy@infradead.org> wrote:

[...]

> 
> The Uptodate flag check needs to be done by the caller; the
> find_get_page() family return !uptodate pages.
> 
> But find_get_page() does not advertise itself as NMI-safe.  And I
> think it's wrong to try to make it NMI-safe.  Most of the kernel is
> not NMI-safe.  I think it's incumbent on the BPF people to get the
> information they need ahead of taking the NMI.  NMI handlers are not
> supposed to be doing a huge amount of work!  I don't really understand
> why it needs to do work in NMI context; surely it can note the location of
> the fault and queue work to be done later (eg on irq-enable, task-switch
> or return-to-user)

The use case here is a profiler (similar to perf-record). Parsing the 
build id in side the NMI makes the profiler a lot simpler. Otherwise, 
we will need some post processing for each sample. 

OTOH, it is totally fine if build_id_parse() fails some time, say < 5%. 
The profiler output is still useful in such cases. 

I guess the next step is to replace find_get_page() with a NMI-safe
version?

Thanks,
Song


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag
  2023-03-24  6:03             ` Song Liu
@ 2023-03-24 13:31               ` Matthew Wilcox
  2023-03-29 16:53                 ` Song Liu
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2023-03-24 13:31 UTC (permalink / raw)
  To: Song Liu
  Cc: Hugh Dickins, Song Liu, Jiri Olsa, David Stevens, Linux-MM,
	Andrew Morton, Peter Xu, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Jiaqi Yan, linux-kernel

On Fri, Mar 24, 2023 at 06:03:37AM +0000, Song Liu wrote:
> 
> 
> > On Mar 23, 2023, at 8:30 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> [...]
> 
> > 
> > The Uptodate flag check needs to be done by the caller; the
> > find_get_page() family return !uptodate pages.
> > 
> > But find_get_page() does not advertise itself as NMI-safe.  And I
> > think it's wrong to try to make it NMI-safe.  Most of the kernel is
> > not NMI-safe.  I think it's incumbent on the BPF people to get the
> > information they need ahead of taking the NMI.  NMI handlers are not
> > supposed to be doing a huge amount of work!  I don't really understand
> > why it needs to do work in NMI context; surely it can note the location of
> > the fault and queue work to be done later (eg on irq-enable, task-switch
> > or return-to-user)
> 
> The use case here is a profiler (similar to perf-record). Parsing the 
> build id in side the NMI makes the profiler a lot simpler. Otherwise, 
> we will need some post processing for each sample. 

Simpler for you, maybe.  But this is an NMI!  It's not supposed to
be doing printf-formatting or whatever, much less poking around
in the file cache.  Like perf, it should record a sample and then
convert that later.  Maybe it can defer to a tasklet, but i think
scheduling work is a better option.

> OTOH, it is totally fine if build_id_parse() fails some time, say < 5%. 
> The profiler output is still useful in such cases. 
> 
> I guess the next step is to replace find_get_page() with a NMI-safe
> version?

No, absolutely not.  Stop doing so much work in an NMI.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag
  2023-03-23 19:07   ` Hugh Dickins
  2023-03-23 21:56     ` Matthew Wilcox
@ 2023-03-28  9:48     ` David Stevens
  1 sibling, 0 replies; 17+ messages in thread
From: David Stevens @ 2023-03-28  9:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: linux-mm, Andrew Morton, Peter Xu, Matthew Wilcox,
	Kirill A . Shutemov, Yang Shi, David Hildenbrand, Jiaqi Yan,
	linux-kernel

On Fri, Mar 24, 2023 at 4:08 AM Hugh Dickins <hughd@google.com> wrote:
>
> On Tue, 7 Mar 2023, David Stevens wrote:
>
> > From: David Stevens <stevensd@chromium.org>
> >
> > Make sure that collapse_file doesn't interfere with checking the
> > uptodate flag in the page cache by only inserting hpage into the page
> > cache after it has been updated and marked uptodate. This is achieved by
> > simply not replacing present pages with hpage when iterating over the
> > target range. The present pages are already locked, so replacing the
> > with the locked hpage before the collapse is finalized is unnecessary.
> >
> > This fixes a race where folio_seek_hole_data would mistake hpage for
> > an fallocated but unwritten page. This race is visible to userspace via
> > data temporarily disappearing from SEEK_DATA/SEEK_HOLE.
> >
> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > Signed-off-by: David Stevens <stevensd@chromium.org>
> > Acked-by: Peter Xu <peterx@redhat.com>
>
> NAK to this patch, I'm afraid: it deadlocks.
>
> What I know it to deadlock against, does not make the most persuasive
> argument: cgroup v1 deprecated memcg moving, where mc_handle_file_pte()
> uses filemap_get_incore_folio() while holding page table lock, and spins
> around doing "goto repeat" in filemap_get_entry() while folio_try_get_rcu()
> keeps failing because collapse_file()'s old page has been left in the
> xarray with its refcount frozen to 0.  Meanwhile, collapse_file() is
> spinning to get that page table lock, to unmap pte of a later page.
>
> mincore()'s use of filemap_get_incore_folio() would be liable to hit
> the same deadlock.  If we think for longer, we may find more examples.
> But even when not actually deadlocking, it's wasting lots of CPU on
> concurrent lookups (e.g. faults) spinning in filemap_get_entry().

Ignoring my changes for now, these callers of filemap_get_incore_folio
seem broken to some degree with respect to khugepaged. Mincore can
show mlocked pages spuriously disappearing - this is pretty easy to
reproduce with concurrent calls to MADV_COLLAPSE and mincore. As for
the memcg code, I'm not sure how precise it is expected to be, but it
seems likely that khugepaged can make task migration accounting less
reliable (although I don't really understand the code).

> I don't suppose it's entirely accurate, but think of keeping a folio
> refcount frozen to 0 as like holding a spinlock (and this lock sadly out
> of sight from lockdep).  The pre-existing code works because the old page
> with refcount frozen to 0 is immediately replaced in the xarray by an
> entry for the new hpage, so the old page is no longer discoverable:
> and the new hpage is locked, not with a spinlock but the usual
> folio/page lock, on which concurrent lookups will sleep.

Is it actually necessary to freeze the original pages? At least at a
surface level, it seems that the arguments in 87c460a0bded
("mm/khugepaged: collapse_shmem() without freezing new_page") would
apply to the original pages as well. And if it is actually necessary
to freeze the original pages, why is it not necessary to freeze the
hugepage for the rollback case? Rolling back hugepage->original pages
seems more-or-less symmetric to collapsing original pages->hugepage.

> Your discovery of the SEEK_DATA/SEEK_HOLE issue is important - thank
> you - but I believe collapse_file() should be left as is, and the fix
> made instead in mapping_seek_hole_data() or folio_seek_hole_data():
> I believe that should not jump to assume that a !uptodate folio is a
> hole (as was reasonable to assume for shmem, before collapsing to huge
> got added), but should lock the folio if !uptodate, and check again
> after getting the lock - if still !uptodate, it's a shmem hole, not
> a transient race with collapse_file().

This sounds like it would work for lseek. I guess it could maybe be
made to sort of work for mincore if we abort the walk, lock the page,
restart the walk, and then re-validate the locked page. Although
that's not exactly efficient.

-David

> I was (pleased but) a little surprised when Matthew found in 5.12 that
> shmem_seek_hole_data() could be generalized to filemap_seek_hole_data():
> he will have a surer grasp of what's safe or unsafe to assume of
> !uptodate in non-shmem folios.
>
> On an earlier audit, for different reasons, I did also run across
> lib/buildid.c build_id_parse() using find_get_page() without checking
> PageUptodate() - looks as if it might do the wrong thing if it races
> with khugepaged collapsing text to huge, and should probably have a
> similar fix.
>
> Hugh

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 2/3] mm/khugepaged: skip shmem with userfaultfd
  2023-03-24  5:34     ` David Stevens
@ 2023-03-28 15:48       ` Peter Xu
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2023-03-28 15:48 UTC (permalink / raw)
  To: David Stevens
  Cc: Hugh Dickins, linux-mm, Andrew Morton, Matthew Wilcox,
	Kirill A . Shutemov, Yang Shi, David Hildenbrand, Jiaqi Yan,
	Christoph Hellwig, linux-kernel

On Fri, Mar 24, 2023 at 02:34:07PM +0900, David Stevens wrote:
> On Fri, Mar 24, 2023 at 4:48 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > On Tue, 7 Mar 2023, David Stevens wrote:
> >
> > > From: David Stevens <stevensd@chromium.org>
> > >
> > > Make sure that collapse_file respects any userfaultfds registered with
> > > MODE_MISSING. If userspace has any such userfaultfds registered, then
> > > for any page which it knows to be missing, it may expect a
> > > UFFD_EVENT_PAGEFAULT. This means collapse_file needs to be careful when
> > > collapsing a shmem range would result in replacing an empty page with a
> > > THP, to avoid breaking userfaultfd.
> > >
> > > Synchronization when checking for userfaultfds in collapse_file is
> > > tricky because the mmap locks can't be used to prevent races with the
> > > registration of new userfaultfds. Instead, we provide synchronization by
> > > ensuring that userspace cannot observe the fact that pages are missing
> > > before we check for userfaultfds. Although this allows registration of a
> > > userfaultfd to race with collapse_file, it ensures that userspace cannot
> > > observe any pages transition from missing to present after such a race
> > > occurs. This makes such a race indistinguishable to the collapse
> > > occurring immediately before the userfaultfd registration.
> > >
> > > The first step to provide this synchronization is to stop filling gaps
> > > during the loop iterating over the target range, since the page cache
> > > lock can be dropped during that loop. The second step is to fill the
> > > gaps with XA_RETRY_ENTRY after the page cache lock is acquired the final
> > > time, to avoid races with accesses to the page cache that only take the
> > > RCU read lock.
> > >
> > > The fact that we don't fill holes during the initial iteration means
> > > that collapse_file now has to handle faults occurring during the
> > > collapse. This is done by re-validating the number of missing pages
> > > after acquiring the page cache lock for the final time.
> > >
> > > This fix is targeted at khugepaged, but the change also applies to
> > > MADV_COLLAPSE. MADV_COLLAPSE on a range with a userfaultfd will now
> > > return EBUSY if there are any missing pages (instead of succeeding on
> > > shmem and returning EINVAL on anonymous memory). There is also now a
> > > window during MADV_COLLAPSE where a fault on a missing page will cause
> > > the syscall to fail with EAGAIN.
> > >
> > > The fact that intermediate page cache state can no longer be observed
> > > before the rollback of a failed collapse is also technically a
> > > userspace-visible change (via at least SEEK_DATA and SEEK_END), but it
> > > is exceedingly unlikely that anything relies on being able to observe
> > > that transient state.
> > >
> > > Signed-off-by: David Stevens <stevensd@chromium.org>
> > > Acked-by: Peter Xu <peterx@redhat.com>
> >
> > This patch looks to me like a lot of unnecessary (and not very pretty)
> > change, with surprising use of XA_RETRY_ENTRY outside of xarray internals.
> >
> > I think you probably worked on this one, knowing what you intended in 3/3
> > to follow.  But if 3/3 as such does not follow, why can't this one just
> > leave collapse_file()'s code flow unchanged, but insert the
> > > +             vma_interval_tree_foreach(vma, &mapping->i_mmap, start, start) {
> > > +                     if (userfaultfd_missing(vma)) {
> > you need somewhere just before copying and clearing the pages data?
> > Limited to CONFIG_USERFAULTFD=y and shmem and nr_none to minimize impact.
> > The locked !uptodate hpage in the xarray keeping others at bay as before.
> 
> The complication here is that there are places that examine the page
> cache without actually examining the pages in the page cache, and thus
> don't lock the pages they see. At the very least, folio_seek_hole_data
> and shmem_mfill_atomic_pte do so today. I should have included this in
> the commit message.
> 
> I suppose we could require that anything that might examine shmem page
> cache must lock any pages it sees regardless of whether or not it
> actually needs to access the pages, and then update the two functions
> I referenced plus any other similar functions I've missed. I went with
> the approach I did because I felt that localized complexity in
> collapse_file that minimizes intermediate state visible throughout the
> mm subsystem would be preferable. However, if that isn't actually
> preferable, I can take the approach you suggested. Especially now that
> we probably want to fix folio_seek_hole_data anyway.
> 
> > Hmm, looking at that extract, is "start, start" correct?
> 
> That's definitely a bug, good catch.
> 
> > And everywhere
> > else in mm/khugepaged.c is checking userfaultfd_armed(vma): I imagine
> > one or the other is more correct, but I'm not at all a userfaultfder.
> 
> IIUC, the check for userfaultfd_wp in retract_page_tables is
> sufficient for uffd-wp in the shmem case. Simply collapsing a range in
> the underlying shmem file shouldn't affect the semantics of uffd-wp.

Yes that should be the case.  I think this allows shmem page cache to be
still collapse-able into a thp even if some vma has uffd-wp registered. On
a uffd-wp enabled VMA it keeps using pte mappings, while other vmas that
map the same page can map it as thp.

> I could be missing something, though, so Peter would probably be the one
> to know for sure. As for uffd-minor, its interactions with khugepaged are
> currently hopelessly undefined.

Right, my guess is we're purely lucky because currently khugepaged (or the
new MADV_COLLAPSE) merge file thps lazily so we only drop ptes/pmd rather
than installing the new pmd.

To minor mode it means old pte holes will keep being holes (with pmd entry
dropped as a whole, though). It's just that there can be present ptes
becoming holes after collapsed so there can be false positives after the
collapsing of the page cache.  Should not be a problem for minor mode as
long as holes are not populated without being noticed.  After all, minor
mode is a tricky feature, and it's prone to false positives from the 1st
day because file ptes can get lost for a lot of reasons..

If checking userfaultfd_armed() here it'll also work I think, but it won't
help much with e.g. minor mode because at this stage we've already freezed
all existing small pages and unmapped all the ptes anyway, so false
positive is not avoidable anyway for minor mode.  Meanwhile it'll stop the
chance of processes to use shm thp mappings as long as they have a sharer
vma that registered with any type of uffd.  So checking missing here only
seems to be a good choice.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag
  2023-03-24 13:31               ` Matthew Wilcox
@ 2023-03-29 16:53                 ` Song Liu
  0 siblings, 0 replies; 17+ messages in thread
From: Song Liu @ 2023-03-29 16:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Song Liu, Hugh Dickins, Song Liu, Jiri Olsa, David Stevens,
	Linux-MM, Andrew Morton, Peter Xu, Kirill A . Shutemov, Yang Shi,
	David Hildenbrand, Jiaqi Yan, linux-kernel



> On Mar 24, 2023, at 6:31 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Fri, Mar 24, 2023 at 06:03:37AM +0000, Song Liu wrote:
>> 
>> 
>>> On Mar 23, 2023, at 8:30 PM, Matthew Wilcox <willy@infradead.org> wrote:
>> 
>> [...]
>> 
>>> 
>>> The Uptodate flag check needs to be done by the caller; the
>>> find_get_page() family return !uptodate pages.
>>> 
>>> But find_get_page() does not advertise itself as NMI-safe.  And I
>>> think it's wrong to try to make it NMI-safe.  Most of the kernel is
>>> not NMI-safe.  I think it's incumbent on the BPF people to get the
>>> information they need ahead of taking the NMI.  NMI handlers are not
>>> supposed to be doing a huge amount of work!  I don't really understand
>>> why it needs to do work in NMI context; surely it can note the location of
>>> the fault and queue work to be done later (eg on irq-enable, task-switch
>>> or return-to-user)
>> 
>> The use case here is a profiler (similar to perf-record). Parsing the 
>> build id in side the NMI makes the profiler a lot simpler. Otherwise, 
>> we will need some post processing for each sample. 
> 
> Simpler for you, maybe.  But this is an NMI!  It's not supposed to
> be doing printf-formatting or whatever, much less poking around
> in the file cache.  Like perf, it should record a sample and then
> convert that later.  Maybe it can defer to a tasklet, but i think
> scheduling work is a better option.
> 
>> OTOH, it is totally fine if build_id_parse() fails some time, say < 5%. 
>> The profiler output is still useful in such cases. 
>> 
>> I guess the next step is to replace find_get_page() with a NMI-safe
>> version?
> 
> No, absolutely not.  Stop doing so much work in an NMI.

While I understand the concern, it is not something we can easily remove, 
as there are users rely on this feature. How about we discuss this at
upcoming LSFMMBPF?

Thanks,
Song


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-03-29 16:54 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-07  5:20 [PATCH v5 0/3] mm/khugepaged: fix khugepaged+shmem races David Stevens
2023-03-07  5:20 ` [PATCH v5 1/3] mm/khugepaged: refactor collapse_file control flow David Stevens
2023-03-23 19:51   ` Hugh Dickins
2023-03-07  5:20 ` [PATCH v5 2/3] mm/khugepaged: skip shmem with userfaultfd David Stevens
2023-03-23 19:48   ` Hugh Dickins
2023-03-24  5:34     ` David Stevens
2023-03-28 15:48       ` Peter Xu
2023-03-07  5:20 ` [PATCH v5 3/3] mm/khugepaged: maintain page cache uptodate flag David Stevens
2023-03-23 19:07   ` Hugh Dickins
2023-03-23 21:56     ` Matthew Wilcox
2023-03-23 22:28       ` Song Liu
2023-03-24  1:56         ` Hugh Dickins
2023-03-24  3:30           ` Matthew Wilcox
2023-03-24  6:03             ` Song Liu
2023-03-24 13:31               ` Matthew Wilcox
2023-03-29 16:53                 ` Song Liu
2023-03-28  9:48     ` David Stevens

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.